Dataform Tips – Protect an Incremental Table from Accidental Full Refreshes

Incremental queries in Dataform always include two modes for updating the destination table: incremental & full refresh. However, sometimes, it’s necessary to implement safety measures against an accidental full refresh. The reasons for that include:

  1. Protecting the previously added data in the table in case it’s difficult or impossible to recover it later (for example in a daily snapshot table),
  2. Additional protection against a costly full refresh.

How can a full refresh happen by mistake?

Dataform workflow invocations target specific queries that will be executed. However, they can also include dependencies or dependents of the listed queries. The queries included through the dependency tree can extend the workflow to include many more queries than those explicitly listed.

dataform workflow execution options with full refresh enabled

Because of these query dependencies, it can sometimes be confusing what the final list of queries to be run will be. In the above workflow execution, any dependencies of the selected query would also run with full refresh.

This lack of visibility is one of the reasons that can cause accidents. The full refresh option in the workflow will apply to all included dependencies and dependents.

How do we prevent accidental full refreshes? The native way.

Thanks to Krisztián Korpa for pointing out that there’s a native method in Dataform to prevent accidental full refreshes.

You can include protected: true in your .sqlx file’s config, preventing the query from ever running in full refresh mode.

config {
  type: "incremental",
  protected: true
}

select
  current_datetime() as datetime

What it does is simply disregard the full refresh version of the query. There’s no error being thrown; instead, the query will just complete in incremental mode.

How do we prevent accidental full refreshes? The hacky way.

Another way to prevent accidental table rebuilds or even full table scan queries is to make the query error out in case it’s run in full refresh mode.

However, not just any error will do. Dataform executes query workflows as BigQuery procedures. The procedure includes both the incremental and full refresh versions of the query, which means that both need to at least somewhat look like valid SQL for the procedure to execute.

The query must fail in a way that is not yet detected when BigQuery evaluates the procedure.

The query below looks valid enough to pass the procedure evaluation. However, it will throw an error when the query job starts.

select
  current_datetime() as datetime,
  "break"=1 as break_non_incremental_query

We can implement it in two ways:

  1. As a pre_operation that will only prevent workflow executions,
  2. As a regular SQL statement inside the query, that will also prevent the preview version of the query from running.
config {
  type: "incremental"
}

js {
  const breakNonIncrementalQuery = (incremental, preOperation) => {
    if (!incremental) {
      const sql = '"break"=1';
      
      // define a BigQuery variable for the pre_operations block
      if (preOperation) {
        return `declare break_non_incremental_query default (
          select ${sql}
        )`;
      }
      
      // otherwise, just return a regular SQL formula
      return `${sql} as break_non_incremental_query`;
    }
    return '';
  };
}

select
  current_datetime() as datetime,
  -- implementing this will break all full refresh versions of the query
  ${breakNonIncrementalQuery(incremental())}

-- implementing this will only break full refresh workflow executions
pre_operations {
  ${breakNonIncrementalQuery(incremental(), true)}
}

The above .sqlx file contains a JS function that can do both.

We can now implement the function in the desired place to prevent accidental full refreshes.

Final thoughts

One of the lessons here is to be careful when declaring dependencies in Dataform. While dependencies are perhaps the greatest feature in Dataform, they can sometimes accidentally extend the full refresh to cover more tables than intended. That’s why it’s important to implement safety measures against accidental table rebuilds.

Have you encountered this issue in Dataform? What was your solution?

👋 Want to stay up to date on the latest blog posts? Follow me on LinkedIn.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top