Dataform Tips – Managing Column Descriptions

Many of us consider writing documentation a bit of a dull task. Organizations make it even worse by hoarding the documentation in centralized but isolated systems such as Confluence. This approach can easily lead to a situation where writing documentation becomes a project of its own. 

Ideally, the documentation would evolve continuously whenever the queries behind the views and tables are updated. For this reason, the column descriptions in the Dataform .sqlx configuration are super useful, as they allow updating the documentation in the same place where the query logic is defined. The final descriptions will be visible in the BigQuery table itself.

In this post, I’ll share one approach to making documentation management a bit easier using a separate documentation file. The same file can be referenced across the queries in the Dataform project, thus avoiding describing similar columns repeatedly.

Dataform column descriptions

Let’s start with an example query with the documentation defined in the columns object of the config block.

config {
    type: "table",
    description: "Example 1",
    schema: "blog_dataform_documentation",
    columns: {
        date: 'The date of the event',
        event_name: 'Name of the event, such as "purchase"',
        items: {
            description: 'An array containing all ecommerce products related to the event',
            columns: {
                item_name: 'Ecommerce product name',
                item_brand: 'Ecommerce product brand',
                item_variant: 'Additional variant information about the ecommerce product',
                quantity: 'The quantity of ecommerce products in the event, such as purchase'
            }
        }
    }
}

select
  current_date() as date,
  'purchase' as event_name,
  array(
    (select as struct 'iPhone 16' as item_name, 'Apple' as item_brand, '128 Gb' as item_variant, 1 as quantity)
  ) as items

union all

select
  current_date() as date,
  'purchase' as event_name,
  array(
    (select as struct 'Pixel 9' as item_name, 'Google' as item_brand, '256 Gb' as item_variant, 2 as quantity)
  ) as items

The example configuration writes the descriptions for the nested items like this:

But what if you want to reference some of the same definitions in another .sqlx file?

You can accomplish this easily by moving the descriptions into a separate .js file, such as documentation.js. Save the documentation.js file under the includes folder in Dataform.

const columns = {
    date: 'The date of the event',
    event_name: 'Name of the event, such as "purchase"',
    items: {
        description: 'An array containing all ecommerce products related to the event',
        columns: {
            item_name: 'Ecommerce product name',
            item_brand: 'Ecommerce product brand',
            item_variant: 'Additional variant information about the ecommerce product',
            quantity: 'The quantity of ecommerce products in the event, such as purchase'
        }
    }
};

module.exports = {
    columns
}

Now, we can reference the same descriptions across all queries inside the Dataform repository. It doesn’t matter if all of the fields defined in documentation.js don’t exist in the .sqlx file referencing them. Dataform will just ignore the unnecessary ones.

Below is another example query referencing the common documentation.js file for column descriptions.

config {
    type: "table",
    description: "Example 2",
    schema: "blog_dataform_documentation",
    columns: documentation.columns
}

select
  date,
  item.item_brand,
  item.item_name,
  sum(quantity) as items_sold
from
  ${ref('example1')},
  unnest(items) as item
group by
  date,
  item_brand,
  item_name

Documentation function

There’s still one issue. What if the same field names that exist inside nested structs also appear as individual column names, like in the above example?

For example, the Google Analytics 4 BigQuery export includes an items array containing individual items that have the item_name attribute. If you work with GA4 ecommerce data, you’ll likely have a flat table that contains the same item_name attribute as the name of an individual column. In the Dataform column descriptions, these columns need to be documented separately.

I’ve defined a documentation function to avoid having to write multiple versions of the same column descriptions. The function takes the nested record descriptions and automatically includes flattened versions of each key in the final documentation object. Naturally, this means that the field name should mean the same regardless of if it’s nested or not.

The function also allows splitting the documentation into multiple objects to make it more human-readable and thus easier to maintain.

const eventColumns = {
    date: 'The date of the event',
    event_name: 'Name of the event, such as "purchase"'
};

const ecomColumns = {
    items: {
        description: 'An array containing all ecommerce products related to the event',
        columns: {
            item_name: 'Ecommerce product name',
            item_brand: 'Ecommerce product brand',
            item_variant: 'Additional variant information about the ecommerce product',
            quantity: 'The quantity of ecommerce products in the event, such as purchase'
        }
    },
    items_sold: 'Total number of items sold'
};

/**
 * Return one object that contains all column descriptions.
 * https://cloud.google.com/dataform/docs/document-tables
 
 * Args:
 *   documentationObjects: An array of objects that contains column descriptions in Dataform format. Data from multiple objects will be merged into one object.
 *   flatten: If set to true, all nested column descriptions will also be added as top-level column descriptions. Useful if the same columns are used in unnested flat tables.

 * Returns:
 *   An object to be used for setting column descriptions in dataform .sqlx files.
 */

const getColumnDescriptions = (documentationObjects, flatten) => {
    const columns = {};

    // extract data from the source object and save it in the final columns object
    const setObjKeys = (sourceObj) => {
        // https://stackoverflow.com/questions/8085004/iterate-through-nested-javascript-objects
        const iterateNestedColumns = (obj) => {
            const stack = [obj];
            while (stack?.length > 0) {
                const currentObj = stack.pop();
                Object.keys(currentObj).forEach(key => {
                    if (typeof currentObj[key] === 'string' && key !== 'description') {
                        columns[key] = currentObj[key];
                    }

                    if (typeof currentObj[key] === 'object' && currentObj[key] !== null) {
                        stack.push(currentObj[key]);
                    }
                });
            }
        };

        Object.entries(sourceObj).forEach(e => {
            // copy the column description to the common object
            columns[e[0]] = e[1];
            // include nested keys also as top-level flattened keys
            if (flatten && typeof e[1] === 'object') {
                iterateNestedColumns(e[1]);
            }
        });
    };

    documentationObjects.forEach(o => {
        setObjKeys(o);
    });

    // sort the object keys alphabetically
    return Object.keys(columns).sort().reduce(
        (obj, key) => {
            obj[key] = columns[key];
            return obj;
        }, {}
    );
};

// merge all column description objects into one
const columns = getColumnDescriptions([
    eventColumns,
    ecomColumns
], true);

module.exports = {
    columns
}

The updated version of documentation.js now covers the descriptions for both example queries. Of course, in the real world, with references to dozens of tables, the file will get bloated. That’s why it can be useful to organize the descriptions into separate objects based on the context.

The final object that the queries will reference looks like this:

{
    "date": "The date of the event",
    "event_name": "Name of the event, such as \"purchase\"",
    "item_brand": "Ecommerce product brand",
    "item_name": "Ecommerce product name",
    "item_variant": "Additional variant information about the ecommerce product",
    "items": {
        "description": "An array containing all ecommerce products related to the event",
        "columns": {
            "item_name": "Ecommerce product name",
            "item_brand": "Ecommerce product brand",
            "item_variant": "Additional variant information about the ecommerce product",
            "quantity": "The quantity of ecommerce products in the event, such as purchase"
        }
    },
    "items_sold": "Total number of items sold",
    "quantity": "The quantity of ecommerce products in the event, such as purchase"
}

Final words

It’s always nice to get access to a well-documented dataset. However, it happens too rarely, which is why I think any small improvements to the process are worth considering.

Do you have any tips on how to manage documentation? Please share in the comments!

👋 Want to stay up to date on the latest blog posts? Follow me on LinkedIn.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top