Skip to main content

How-To Guides

Practical guides for developing data pipelines with DataSQRL

Project Structureโ€‹

We recommend the following project structure to support testing and deploying to multiple environments:

project-root/
โ”œโ”€โ”€ sources-prod/ <-- contains source connectors for prod
โ”œโ”€โ”€ sources-testdata/ <-- contains test data
โ”œโ”€โ”€ snapshots/ <-- snapshots for test cases, generated by DataSQRL
โ”œโ”€โ”€ tests/ <-- (optional) GraphQL test queries
โ”œโ”€โ”€ components.sqrl <-- table definitions imported into main script
โ”œโ”€โ”€ mainscript.sqrl
โ”œโ”€โ”€ mainscript_package_prod.json <-- configuration for prod
โ””โ”€โ”€ mainscript_package_test.json <-- configuration for testing
  • Create one folder for each collection of data sources. Sources that represent the same type of data but different environments (test vs prod) have the same prefix.
  • Create one package.json configuration file for each environment that references the same main script but maps the data sources differently in the dependencies section of the configuration and (optionally) uses different engines and configurations.
  • By default, DataSQRL uses tests and snapshots directories. If you have multiple test suites or run the same tests with different sources, append a distinguishing suffix (e.g. -api or -regression) to both directory names and specify them explicitly in the configuration file or via compiler flags.

Testingโ€‹

DataSQRL supports running automated tests for your SQRL pipeline by annotating test cases with the /*+test */ hint or placing test queries in the tests folder (or any other folder that's passed via the --tests command option).

The best practice for writing test cases is to modularize your sources so that you dynamically link different sources for local development, testing, and production. In many cases, you can use the same sources for testing and local development in a single folder.

That data should contain explicit event timestamps for all records. That enables completely deterministic test cases. It also supports reproducing failure scenarios that you experienced in production as local test cases by using the data that caused the failure with the original timestamp. That way, you don't have to externally simulate certain sequences of events that caused the failure in the first place.

In addition, it allows you to build up a repository of failures and edge cases that gets executed automatically to spot regressions.

Script Importsโ€‹

If your main script gets too big, or you want to reuse table definitions across multiple scripts, move the definitions to a separate SQRL script and import it into the main script.

Inline Script Importsโ€‹

Inline imports place table and function definitions from another script into the current scope and requires that table and function names do not clash with those in the importing script.

IMPORT myscript.*;

This statement imports all tables and functions from a SQRL script called myscript.sqrl in the local folder.

Data Discoveryโ€‹

DataSQRL automatically generates table definitions with connector configuration and schemas for json-line files (with extension .jsonl) and csv files (with extension .csv) within the project directory. This makes it easy to import data from such files into a SQRL project.

For example, to import data from a file orders.jsonl in the folder mydata you write:

IMPORT mydata.orders;

When you run the compiler, it will create the table configuration file orders.table.sql which you can then import like any other source. The compiler reads the file and auto-discovers the schema.

To disable automatic discovery of data for a directory, place a file called .nodiscovery into that directory.

Manual Subgraph Elimination with Noop Functionโ€‹

Sometimes the Flink optimizer is too smart for its own good and will push down predicates that make common subgraph identification impossible. That can result in much larger job graphs and poor performance or high state maintenance.

To inhibit predicate pushdown, SQRL uses the noop function that takes an arbitrary list of argument and always returns true. As such, the function serves no purpose other than making it impossible for the optimizer to push down predicates.