Skip to main content

DataSQRL Documentation

DataSQRL is an open-source data engineering harness that provides guardrails and feedback for AI coding agents to build reliable data pipelines, data APIs (REST, MCP, GraphQL), and data products.

DataSQRL ensures coding agents meet the non-functional requirements of production data systems: data quality, scalability, governance, and reliability. It provides deep-inspection of SQL, relational validators, and deterministic event-replay simulation to guide agent-generated code through iterative feedback loops.

Why a Data Engineering Harness?​

Coding agents can generate SQL queries that produce correct results on test data. But will those queries perform at scale? Handle late-arriving events correctly? Maintain data quality when upstream schemas change? Provide lineage tracking and meet compliance requirements?

These non-functional requirements, data quality, scalability, governance, reliability, cost efficiency, distinguish data engineering from general software development. General-purpose coding agents aren't equipped to handle them consistently.

A data engineering harness provides the guardrails, feedback loops, and domain-specific constraints that coding agents need. Without a harness, you get pipelines that work in demos but fail in production. With a harness, you get pipelines that embody data engineering best practices.

Learn more about the harness architecture and the design choices behind DataSQRL.

Key Capabilities​

DataSQRL provides three capabilities that coding agents need to produce production-grade data systems:

1. Conceptual Framework​

DataSQRL extends SQL to a comprehensive framework for data platforms. SQL provides the ideal foundation because it offers a mathematical foundation (relational algebra), deep introspection through its declarative nature, deterministic validation and optimization, human readability, and strong support from modern LLMs.

The framework separates logical and physical layers:

  • Logical Layer: Expresses what data transformations are needed using SQRL (SQL extended with stream processing semantics)
  • Physical Layer: Represents how data gets processed through engine assignment and configuration

This separation lets agents reason about business logic while the harness handles infrastructure complexity.

DataSQRL Framework Overview

2. Comprehensive Validation​

DataSQRL validates at every level. From syntax and schema validation through physical plan verification to deployment asset generation:

  • Logical Validation: Syntax, schema consistency, data flow semantics, timestamp propagation, primary key inference
  • Physical Validation: Engine capability matching, data type mapping, topological constraint satisfaction
  • Deployment Validation: Generated artifacts (Flink plans, Postgres schemas, GraphQL models) guaranteed consistent with logical definitions

The validation system provides comprehensive context and suggested fixes, producing better results than agents reasoning about errors independently.

3. Real-World Feedback​

Static validation catches many issues but can't substitute for execution feedback. DataSQRL provides:

  • Simulation: Execute pipelines locally in Docker with timestamp-accurate event replay for deterministic, reproducible testing
  • Production Telemetry: Hooks for correlating runtime observations back to source code for autonomous troubleshooting

Since the entire pipeline is defined in SQL, it remains humanly readable and easy to verify. DataSQRL produces detailed execution plans, data lineage graphs, and optimization reportsβ€”enabling both automated analysis by agents and manual inspection by engineers.

Quick Start​

Check out the Getting Started guide to build a data pipeline with DataSQRL and see how the test-driven feedback loop guides coding agents toward correct solutions.

Explore the DataSQRL Examples for real-world patterns and how to setup an automated data platform with DataSQRL.

DataSQRL Components​

1. SQRL Language​

SQRL extends Flink SQL to capture the complete logical layer of data pipelines:

  • IMPORT/EXPORT statements for connecting data systems
  • Table functions and relationships for API endpoint definitions
  • Hints to control pipeline structure and execution
  • Subscription syntax for real-time data streaming
  • Stream/state semantics for temporal data processing

2. Interface Design​

DataSQRL automatically generates interfaces from your SQRL script for multiple protocols:

  • Data Products as database/data lake views and tables
  • GraphQL APIs with queries, mutations, and subscriptions
  • REST endpoints with GET/POST operations
  • MCP tools/resources for AI agent integration
  • Schema customization and operation control

3. Configuration​

JSON configuration defines the physical layer, how the pipeline gets executed:

  • Engines: Data technologies (Flink, Postgres, Kafka, Iceberg, etc.)
  • Connectors: Templates for data sources and sinks
  • Dependencies: External data packages and libraries
  • Compiler options: Optimization and deployment settings

4. Compiler​

The DataSQRL compiler implements validation and simulation:

  • Transpiles SQRL scripts into deployment assets
  • Validates logical and physical plans against data engineering constraints
  • Optimizes data processing DAGs across multiple engines
  • Simulates pipeline execution with deterministic event replay

Documentation Guide​

Getting Started​

  • Getting Started - Build a pipeline with the test-driven feedback loop
  • Examples - Practical examples for specific use cases

Core Documentation​

  • SQRL Language - Complete language specification and syntax
  • Interface Design - API generation and data product interfaces
  • Configuration - Engine setup and project configuration
  • Compiler - Command-line interface and compilation options
  • Functions - Built-in functions and custom function libraries

Integration & Deployment​

  • Connectors - Ingest from and export to external systems
  • Concepts - Key concepts in stream processing (time, watermarks, etc.)
  • How-To Guides - Best practices and implementation patterns

Advanced Topics​

Use Cases​

DataSQRL enables AI-assisted automation of:

  • Data Pipelines: Processing data in realtime or batch with production-grade reliability
  • Data APIs: Serving processed data through REST, GraphQL, or MCP APIs
  • Data Lakehouse: Producing Apache Iceberg tables with catalog and schema management

Community & Support​

DataSQRL is open source and community-driven:

We welcome feedback, bug reports, and contributions to build a data engineering harness that enables safe, reliable automation of data platforms.