Skip to main content

5 posts tagged with "DataSQRL"

Posts about DataSQRL

View All Tags

0.10 Release: Iceberg Mutations

· 3 min read
Ferenc Csaky
Apache Flink PMC
Matthias Broecheler
CEO of DataSQRL
SQRL 0.10 Release >

DataSQRL 0.10 has been released and the headline feature is supporting mutations for Iceberg tables. DataSQRL can now manage Apache Iceberg tables as sources and sinks.

Why is that a big deal? Up to this point, DataSQRL could read and write to Apache Iceberg tables, but you had to manage them explicitly. This new release makes it easy to share data through Apache Iceberg between DataSQRL pipelines.

Data Fast and Slow

Let's back up a bit. Before 0.10 you could create tables in DataSQRL like this:

CREATE TABLE Clickstream (
user_id STRING,
event_time TIMESTAMP_LTZ(3) NOT NULL METADATA FROM 'timestamp',
ad_id STRING,
WATERMARK FOR event_time AS event_time - INTERVAL '5' SECOND
);

If you don't specify a connector explicitly, DataSQRL will manage the table for you, create the Kafka topics, and wire everything up. This allows you to expose mutation endpoints in GraphQL that store mutation events in Kafka and process them in Flink, making it easy to build event-driven microservices.

Kafka is great if you need data fast – in milliseconds. But it requires a separate Kafka cluster, which is challenging to maintain and costly. For many data use cases, you don't need data that quickly. Minutes and hours are just fine.

That's where Apache Iceberg shines. It's a table format for sharing data that does not require a separate data system to operate. All you need is local or cloud storage.

With the 0.10 release, DataSQRL supports Apache Iceberg for managed tables. Use Kafka for the fast data and Iceberg for the slow data. This allows you to balance speed with operational simplicity and cost.

To make this possible, DataSQRL 0.10 introduces a breaking change: You have to explicitly annotate where you want the tables you create to be persisted: iceberg or kafka. Use the engine SQL hint:

/*+ engine(iceberg) */
CREATE TABLE Clickstream (
user_id STRING,
event_time TIMESTAMP_LTZ(3) NOT NULL METADATA FROM 'timestamp',
ad_id STRING,
WATERMARK FOR event_time AS event_time - INTERVAL '5' SECOND
);

That's it. And make sure to upgrade your existing DataSQRL projects with /*+ engine(kafka) */ on any managed tables as you migrate to 0.10.

There are a lot more goodies in the 0.10 release. Check out the complete changelog for details. Shout out to Ferenc for driving this release and to our newest team members Wellington and Mate for making their first contributions to DataSQRL. Thank you!

What's Next?

We are working steadily toward the 1.0 release of DataSQRL. Most of the big features are in, and we are primarily focusing on hardening what is there: additional test coverage, handling edge cases, and extending support.

One area of work that's particularly interesting is partition strategy optimization. We are extending support for statistics to make that happen. More on this soon.

Flink SQL Runner: Run Flink SQL Without JARs or Glue Code

· 3 min read
Matthias Broecheler
CEO of DataSQRL

Apache Flink has long been a powerhouse for streaming and batch data processing. And with the rise of Flink SQL, developers can now build sophisticated pipelines using a declarative language they already know. But getting Flink SQL applications into production still comes with friction: packaging JARs, managing connectors, injecting secrets, and wiring up deployment infrastructure.

FlinkSQL Runner >

Flink SQL Runner is here to change that. It's an open-source toolkit that simplifies development, deployment, and operation of Flink SQL applications—locally or in Kubernetes—without manual JAR assembly or scripting custom infrastructure pipelines.

Defining Data Interfaces with FlinkSQL

· 4 min read
Matthias Broecheler
CEO of DataSQRL

FlinkSQL is an amazing innovation in data processing: it packages the power of realtime stream processing within the simplicity of SQL. That means you can start with the SQL you know and introduce stream processing constructs as you need them.

FlinkSQL API Extension >

FlinkSQL adds the ability to process data incrementally to the classic set-based semantics of SQL. In addition, FlinkSQL supports source and sink connectors making it easy to ingest data from and move data to other systems. That's a powerful combination which covers a lot of data processing use cases.

In fact, it only takes a few extensions to FlinkSQL to build entire data applications. Let's see how that works.

Building Data APIs with FlinkSQL

CREATE TABLE UserTokens (
userid BIGINT NOT NULL,
tokens BIGINT NOT NULL,
request_time TIMESTAMP_LTZ(3) NOT NULL METADATA FROM 'timestamp'
);

/*+query_by_all(userid) */
TotalUserTokens := SELECT userid, sum(tokens) as total_tokens,
count(tokens) as total_requests
FROM UserTokens GROUP BY userid;

UserTokensByTime(userid BIGINT NOT NULL, fromTime TIMESTAMP NOT NULL, toTime TIMESTAMP NOT NULL):=
SELECT * FROM UserTokens WHERE userid = :userid,
request_time >= :fromTime AND request_time < :toTime ORDER BY request_time DESC;

UsageAlert := SUBSCRIBE SELECT * FROM UserTokens WHERE tokens > 100000;

This script defines a sequence of tables. We introduce := as syntactic sugar for the verbose CREATE TEMPORARY VIEW syntax.

The UserTokens table does not have a configured connector, which mean we treat it as an API mutation endpoint connected to Flink via a Kafka topic that captures the events. This makes it easy to build APIs that capture user activity, transactions, or other types of events.

Why Temporal Join is Stream Processing’s Superpower

· 8 min read
Matthias Broecheler
CEO of DataSQRL

Stream processing technologies like Apache Flink introduce a new type of data transformation that’s very powerful: the temporal join. Temporal joins add context to data streams while being efficient and fast to execute.

Temporal Join >

This article introduces the temporal join, compares it to the traditional inner join, explains when to use it, and why it is a secret superpower.

Table of Contents:

Let's Uplevel Our Database Game: Meet DataSQRL

· 5 min read
Matthias Broecheler
CEO of DataSQRL

We need to make it easier to build data-driven applications. Databases are great if all your application needs is storing and retrieving data. But if you want to build anything more interesting with data - like serving users recommendations based on the pages they are visiting, detecting fraudulent transactions on your site, or computing real-time features for your machine learning model - you end up building a ton of custom code and infrastructure around the database.

You need a queue like Kafka to hold your events, a stream processor like Flink to process data, a database like Postgres to store and query the result data, and an API layer to tie it all together.

DataSQRL Logo >

And that’s just the price of admission. To get a functioning data layer, you need to make sure that all these components talk to each other and that data flows smoothly between them. Schema synchronization, data model tuning, index selection, query batching … all that fun stuff.

The point is, you need to do a ton of data plumbing if you want to build a data-driven application. All that data plumbing code is time-consuming to develop, hard to maintain, and expensive to operate.

We need to make building with data easier. That’s why we are sending out this call to action to uplevel our database game. Join us in figuring out how to simplify the data layer.

We have an idea to get us started: Meet DataSQRL.