Skip to main content
Matthias Broecheler
CEO of DataSQRL
View all authors

Defining Data Interfaces with FlinkSQL

· 4 min read
Matthias Broecheler
CEO of DataSQRL

FlinkSQL is an amazing innovation in data processing: it packages the power of realtime stream processing within the simplicity of SQL. That means you can start with the SQL you know and introduce stream processing constructs as you need them.

FlinkSQL API Extension >

FlinkSQL adds the ability to process data incrementally to the classic set-based semantics of SQL. In addition, FlinkSQL supports source and sink connectors making it easy to ingest data from and move data to other systems. That's a powerful combination which covers a lot of data processing use cases.

In fact, it only takes a few extensions to FlinkSQL to build entire data applications. Let's see how that works.

Building Data APIs with FlinkSQL

CREATE TABLE UserTokens (
userid BIGINT NOT NULL,
tokens BIGINT NOT NULL,
request_time TIMESTAMP_LTZ(3) NOT NULL METADATA FROM 'timestamp'
);

/*+query_by_all(userid) */
TotalUserTokens := SELECT userid, sum(tokens) as total_tokens,
count(tokens) as total_requests
FROM UserTokens GROUP BY userid;

UserTokensByTime(userid BIGINT NOT NULL, fromTime TIMESTAMP NOT NULL, toTime TIMESTAMP NOT NULL):=
SELECT * FROM UserTokens WHERE userid = :userid,
request_time >= :fromTime AND request_time < :toTime ORDER BY request_time DESC;

UsageAlert := SUBSCRIBE SELECT * FROM UserTokens WHERE tokens > 100000;

This script defines a sequence of tables. We introduce := as syntactic sugar for the verbose CREATE TEMPORARY VIEW syntax.

The UserTokens table does not have a configured connector, which mean we treat it as an API mutation endpoint connected to Flink via a Kafka topic that captures the events. This makes it easy to build APIs that capture user activity, transactions, or other types of events.

Next, we sum up the data collected through the API for each user. This is a standard FlinkSQL aggregation query and we expose the result in our API through the query_by_all hint which defines the arguments for the query endpoint of that table.

We can also explicitly define query endpoints with arguments through SQL table functions. FlinkSQL supports table functions natively. All we had to do is provide the syntax for defining the function signature.

And last, the SUBSCRIBE keyword in front of the query defines a subscription endpoint for requests exceeding a certain token count which get pushed to clients in real-time.

Voila, we just build ourselves a complete GraphQL API with mutation, query, and subscription endpoints. Run the above script with DataSQRL to see the result:

docker run -it --rm -p 8888:8888 -v $PWD:/build datasqrl/cmd run usertokens.sqrl

Relationships for Complex Data Structures

And for extra credit, we can define relationships in FlinkSQL to represent the structure of our data explicitly and expose it in the API:

User.totalTokens := SELECT * FROM TotalUserTokens t WHERE this.userid = t.userid LIMIT 1;

The User table in this example is read from an upsert Kafka topic using a standard FlinkSQL CREATE TABLE statement.

Code Modularity and Connector Management

Many FlinkSQL projects break the codebase into multiple files for better code readability, modularity, or to swap out sources and sinks. That requires extra infrastructure to manage FlinkSQL files and stitch them together.

How about we do that directly in FlinkSQL?

IMPORT source-data.User;

Here, we import the User table from a separate file within the source-data directory, allowing us to separate the data processing logic from the source configurations. It also enables us to use dependency management to swap out sources for local testing vs production.

And we can do the same for sinks:

EXPORT UsageAlert TO mysinks.UsageAlert;

In addition to breaking out the sink configuration from the main script, the EXPORT statement functions as an INSERT INTO statement and creates a STATEMENT SET implicitly. That makes the code easier to read.

Learn More

FlinkSQL is phenomenal extension of the SQL ecosystem to stream processing. With DataSQRL, we are trying to make it easier to build end-to-end data pipelines and complete data applications with FlinkSQL.

Check out the complete example which also covers testing, customization, and deployment. Or read the documentation to learn more.

DataSQRL 0.6 Release: The Streaming Data Framework

· 3 min read
Matthias Broecheler
CEO of DataSQRL

The DataSQRL community is proud to announce the release of DataSQRL 0.6. This release marks a major milestone in the evolution of our open-source project, bringing enhanced alignment with Flink SQL and powerful new capabilities to the real-time serving layer.

DataSQRL 0.6.0 Release >

You can find the full release notes and source code on our GitHub release page. To get started with the latest compiler, simply pull the latest Docker image:

docker pull datasqrl/cmd:0.6.0

With DataSQRL 0.6, we are embracing the Flink ecosystem more deeply than ever before. This release introduces a complete re-architecture of the DataSQRL compiler to build directly on top of Flink SQL's parser and planner. By aligning our internal model with Flink SQL semantics, we unlock a host of new capabilities and bring DataSQRL users closer to the vibrant Flink ecosystem.

This architectural shift allows DataSQRL to:

  • Use Flink SQL syntax as the foundation, enabling more intuitive query definitions and easier onboarding for users familiar with Flink.
  • Extend Flink SQL with domain-specific features, such as declarative relationship definitions and functions to define the data interface.
  • Transpile FlinkSQL to database dialects for query execution.

Serving-Layer Power: Functions & Relationships

DataSQRL 0.6 introduces first-class support for defining functions and relationships in your SQRL scripts. These constructs make it easier to model complex application logic in a modular, declarative fashion.

These features are purpose-built for powering LLM-ready APIs, event-driven architectures, and real-time user-facing applications.

Check out the language documentation for details.

Developer Tooling

DataSQRL 0.6 provides a docker image for compiling, running, and testing SQRL projects. You can now quickly iterate and check the results. Or run automated tests in CI/CD.

Deployment Artifacts

DataSQRL 0.6 removes deployment profiles and instead generates all deployment artifacts in the build/deploy/plan folder. This makes it easier to integrate with Kubernetes deployment processes (e.g. via Helm) or cloud managed service deployments (e.g. via Terraform).

Breaking Changes & Migration Path

As this is a major release, DataSQRL 0.6 is not backwards compatible with version 0.5. The syntax and internal representation have been updated to align with Flink SQL and to support the new compiler architecture.

To help you transition, we’ve provided updated examples and migration guidance in the DataSQRL examples repository. We recommend starting with one of the updated use cases to get a feel for the new workflow.

Thanks to the Community

This release wouldn’t have been possible without the contributions, bug reports, and thoughtful feedback from our growing community. Whether you opened a pull request, filed an issue, or joined a discussion, thank you. Your support drives this project forward.

We’re excited to see what you build with DataSQRL 0.6. If you haven’t joined the community yet, now’s a great time to get involved: star us on GitHub, try out the latest release, and share your thoughts.

Stay tuned for more updates, and happy building.

Why Temporal Join is Stream Processing’s Superpower

· 8 min read
Matthias Broecheler
CEO of DataSQRL

Stream processing technologies like Apache Flink introduce a new type of data transformation that’s very powerful: the temporal join. Temporal joins add context to data streams while being efficient and fast to execute.

Temporal Join >

This article introduces the temporal join, compares it to the traditional inner join, explains when to use it, and why it is a secret superpower.

Table of Contents:

Let's Uplevel Our Database Game: Meet DataSQRL

· 5 min read
Matthias Broecheler
CEO of DataSQRL

We need to make it easier to build data-driven applications. Databases are great if all your application needs is storing and retrieving data. But if you want to build anything more interesting with data - like serving users recommendations based on the pages they are visiting, detecting fraudulent transactions on your site, or computing real-time features for your machine learning model - you end up building a ton of custom code and infrastructure around the database.

You need a queue like Kafka to hold your events, a stream processor like Flink to process data, a database like Postgres to store and query the result data, and an API layer to tie it all together.

DataSQRL Logo >

And that’s just the price of admission. To get a functioning data layer, you need to make sure that all these components talk to each other and that data flows smoothly between them. Schema synchronization, data model tuning, index selection, query batching … all that fun stuff.

The point is, you need to do a ton of data plumbing if you want to build a data-driven application. All that data plumbing code is time-consuming to develop, hard to maintain, and expensive to operate.

We need to make building with data easier. That’s why we are sending out this call to action to uplevel our database game. Join us in figuring out how to simplify the data layer.

We have an idea to get us started: Meet DataSQRL.