Skip to main content

DataSQRL Documentation

DataSQRL is a framework for building data pipelines with guaranteed data integrity. It compiles SQL scripts into fully integrated data infrastructure that ingests data from multiple sources, transforms it through stream processing, and serves the results as realtime data APIs, LLM tooling, or Apache Iceberg views.

What is DataSQRL?โ€‹

DataSQRL simplifies data pipeline development by automatically generating the glue code, schemas, mappings, and deployment artifacts needed to integrate Apache Flink, Postgres, Kafka, GraphQL APIs, and other technologies into a coherent, production-grade data stack.

Key Benefits:

  • ๐Ÿ›ก๏ธ Data Integrity: Exactly-once processing, consistent data across all outputs, automated data lineage
  • ๐Ÿ”’ Production-Ready: Highly available, scalable, observable pipelines using trusted OSS technologies
  • ๐Ÿ”— End-to-End Consistency: Generated connectors and schemas maintain data integrity across the entire pipeline
  • ๐Ÿš€ Developer-Friendly: Local development, CI/CD support, comprehensive testing framework
  • ๐Ÿค– AI-Native: Support for vector embeddings, LLM invocation, and ML model inference

Quick Startโ€‹

Check out the Getting Started guide to build a realtime data pipeline with DataSQRL in 10 minutes.

Take a look at the DataSQRL Examples Repository for simple and complex use cases implemented with DataSQRL.

Core Componentsโ€‹

DataSQRL consists of three main components that work together:

1. SQRL Languageโ€‹

SQRL extends Flink SQL with features specifically designed for reactive data processing:

  • IMPORT/EXPORT statements for connecting data systems
  • Table functions and relationships for interface definitions
  • Hints to control pipeline structure and execution
  • Subscription syntax for real-time data streaming
  • Type system for stream processing semantics

2. Configurationโ€‹

JSON configuration files that define:

  • Engines: Data technologies (Flink, Postgres, Kafka, etc.)
  • Connectors: Templates for data sources and sinks
  • Dependencies: External data packages and libraries
  • Compiler options: Optimization and deployment settings

3. Compilerโ€‹

The DataSQRL compiler:

  • Transpiles SQRL scripts into deployment assets
  • Optimizes data processing DAGs across multiple engines
  • Generates schemas, connectors, and API definitions
  • Executes pipelines locally for development and testing

Documentation Guideโ€‹

๐Ÿš€ Getting Startedโ€‹

๐Ÿ“š Core Documentationโ€‹

  • SQRL Language - Complete language specification and syntax
  • Configuration - Engine setup and project configuration
  • Compiler - Command-line interface and compilation options
  • Functions - Built-in functions and custom function libraries

๐Ÿ”Œ Integration & Deploymentโ€‹

  • Connectors - Ingest from and export to external systems
  • Concepts - Key concepts in stream processing (time, watermarks, etc.)
  • How-To Guides - Best practices and implementation patterns

๐Ÿ› ๏ธ Advanced Topicsโ€‹

Use Casesโ€‹

DataSQRL is ideal for:

  • Real-time Analytics: Stream processing with consistent data APIs
  • Event-Driven Applications: Reactive systems with subscriptions and alerts
  • Data Lakehouses: Reliable Iceberg tables with automated schema management
  • LLM Applications: Accurate data delivery for AI agents and chatbots
  • Microservices Integration: Consistent data sharing across distributed systems

Architectureโ€‹

DataSQRL compiles your SQRL scripts into a data processing DAG that's optimized and distributed across multiple engines:

Data Sources โ†’ Apache Flink โ†’ PostgreSQL/Iceberg โ†’ GraphQL API
โ†“ โ†“ โ†“ โ†“
Kafka Stream Database Real-time
Topics Processing Views APIs

The compiler automatically generates all necessary:

  • Flink job definitions and SQL plans
  • Database schemas and views
  • Kafka topic configurations
  • GraphQL schemas and resolvers
  • Container and Kubernetes deployment files

Community & Supportโ€‹

DataSQRL is open source and community-driven. Get help and contribute:

We welcome feedback, bug reports, and contributions to help make data pipeline development faster and more reliable for everyone.