Skip to main content

0.10 Release: Iceberg Mutations

· 3 min read
Ferenc Csaky
Apache Flink PMC
Matthias Broecheler
CEO of DataSQRL
SQRL 0.10 Release >

DataSQRL 0.10 has been released and the headline feature is supporting mutations for Iceberg tables. DataSQRL can now manage Apache Iceberg tables as sources and sinks.

Why is that a big deal? Up to this point, DataSQRL could read and write to Apache Iceberg tables, but you had to manage them explicitly. This new release makes it easy to share data through Apache Iceberg between DataSQRL pipelines.

Data Fast and Slow

Let's back up a bit. Before 0.10 you could create tables in DataSQRL like this:

CREATE TABLE Clickstream (
user_id STRING,
event_time TIMESTAMP_LTZ(3) NOT NULL METADATA FROM 'timestamp',
ad_id STRING,
WATERMARK FOR event_time AS event_time - INTERVAL '5' SECOND
);

If you don't specify a connector explicitly, DataSQRL will manage the table for you, create the Kafka topics, and wire everything up. This allows you to expose mutation endpoints in GraphQL that store mutation events in Kafka and process them in Flink, making it easy to build event-driven microservices.

Kafka is great if you need data fast – in milliseconds. But it requires a separate Kafka cluster, which is challenging to maintain and costly. For many data use cases, you don't need data that quickly. Minutes and hours are just fine.

That's where Apache Iceberg shines. It's a table format for sharing data that does not require a separate data system to operate. All you need is local or cloud storage.

With the 0.10 release, DataSQRL supports Apache Iceberg for managed tables. Use Kafka for the fast data and Iceberg for the slow data. This allows you to balance speed with operational simplicity and cost.

To make this possible, DataSQRL 0.10 introduces a breaking change: You have to explicitly annotate where you want the tables you create to be persisted: iceberg or kafka. Use the engine SQL hint:

/*+ engine(iceberg) */
CREATE TABLE Clickstream (
user_id STRING,
event_time TIMESTAMP_LTZ(3) NOT NULL METADATA FROM 'timestamp',
ad_id STRING,
WATERMARK FOR event_time AS event_time - INTERVAL '5' SECOND
);

That's it. And make sure to upgrade your existing DataSQRL projects with /*+ engine(kafka) */ on any managed tables as you migrate to 0.10.

There are a lot more goodies in the 0.10 release. Check out the complete changelog for details. Shout out to Ferenc for driving this release and to our newest team members Wellington and Mate for making their first contributions to DataSQRL. Thank you!

What's Next?

We are working steadily toward the 1.0 release of DataSQRL. Most of the big features are in, and we are primarily focusing on hardening what is there: additional test coverage, handling edge cases, and extending support.

One area of work that's particularly interesting is partition strategy optimization. We are extending support for statistics to make that happen. More on this soon.