Incremental queries in Dataform always include two modes for updating the destination table: incremental & full refresh. However, sometimes, it’s necessary to implement safety measures against an accidental full refresh. The reasons for that include:
- Protecting the previously added data in the table in case it’s difficult or impossible to recover it later (for example in a daily snapshot table),
- Additional protection against a costly full refresh.
How can a full refresh happen by mistake?
Dataform workflow invocations target specific queries that will be executed. However, they can also include dependencies or dependents of the listed queries. The queries included through the dependency tree can extend the workflow to include many more queries than those explicitly listed.
Because of these query dependencies, it can sometimes be confusing what the final list of queries to be run will be. In the above workflow execution, any dependencies of the selected query would also run with full refresh.
This lack of visibility is one of the reasons that can cause accidents. The full refresh option in the workflow will apply to all included dependencies and dependents.
How do we prevent accidental full refreshes? The native way.
Thanks to Krisztián Korpa for pointing out that there’s a native method in Dataform to prevent accidental full refreshes.
You can include protected: true in your .sqlx file’s config, preventing the query from ever running in full refresh mode.
config {
type: "incremental",
protected: true
}
select
current_datetime() as datetime
What it does is simply disregard the full refresh version of the query. There’s no error being thrown; instead, the query will just complete in incremental mode.
How do we prevent accidental full refreshes? The hacky way.
Another way to prevent accidental table rebuilds or even full table scan queries is to make the query error out in case it’s run in full refresh mode.
However, not just any error will do. Dataform executes query workflows as BigQuery procedures. The procedure includes both the incremental and full refresh versions of the query, which means that both need to at least somewhat look like valid SQL for the procedure to execute.
The query must fail in a way that is not yet detected when BigQuery evaluates the procedure.
The query below looks valid enough to pass the procedure evaluation. However, it will throw an error when the query job starts.
select
current_datetime() as datetime,
"break"=1 as break_non_incremental_query
We can implement it in two ways:
- As a pre_operation that will only prevent workflow executions,
- As a regular SQL statement inside the query, that will also prevent the preview version of the query from running.
config {
type: "incremental"
}
js {
const breakNonIncrementalQuery = (incremental, preOperation) => {
if (!incremental) {
const sql = '"break"=1';
// define a BigQuery variable for the pre_operations block
if (preOperation) {
return `declare break_non_incremental_query default (
select ${sql}
)`;
}
// otherwise, just return a regular SQL formula
return `${sql} as break_non_incremental_query`;
}
return '';
};
}
select
current_datetime() as datetime,
-- implementing this will break all full refresh versions of the query
${breakNonIncrementalQuery(incremental())}
-- implementing this will only break full refresh workflow executions
pre_operations {
${breakNonIncrementalQuery(incremental(), true)}
}
The above .sqlx file contains a JS function that can do both.
We can now implement the function in the desired place to prevent accidental full refreshes.
Final thoughts
One of the lessons here is to be careful when declaring dependencies in Dataform. While dependencies are perhaps the greatest feature in Dataform, they can sometimes accidentally extend the full refresh to cover more tables than intended. That’s why it’s important to implement safety measures against accidental table rebuilds.
Have you encountered this issue in Dataform? What was your solution?