If you’re working with dbt and find yourself copying the same column descriptions across multiple models, this post is for you. We’ll show you how to eliminate that repetition using a simple but powerful technique!
The need to create common column descriptions
At FreeAgent, we process a lot of event data flowing into our data platform. While each event is unique, they all share common elements – like where the event originated, or the time it was emitted. With well over 100 of these event models we need to maintain hundreds of duplicate descriptions. When working on our data pipelines in Dagster and dbt, we wanted a simple way to define these common attributes once, rather than manually rewriting or copying them across every event-based model. Repeating ourselves isn’t just tedious; it also creates a high risk of inconsistencies, and makes updating descriptions a laborious task.
Using Jinja docs to create re-usable descriptions
We found an elegant solution by combining Jinja’s doc function with a dedicated docs.md file. This approach allows us to centralise our common column descriptions and reuse them.
Let’s look at how we implemented this for a column called data_source. This column is present on all our event models and indicates whether an event came from our desktop app, mobile app, or another source.
1. Create a docs.md File
First, we created a docs.md file within our dbt project’s models folder. This file serves as a single repository for all our reusable documentation snippets.
For instance, to describe our data_source column, we added the following:
{% docs event_data_source %}
The source of the event that was emitted, e.g. DESKTOP, MOBILE_WEB, etc
{% enddocs %}
In this snippet:
{% docs event_data_source %}and{% enddocs %}define a Jinja documentation block.- event_data_source is a unique identifier for this particular documentation snippet.
- The text within the block is the actual description we want to reuse
2. Using the doc Function in dbt Models
Now, whenever we define a column that needs this description, we simply reference it using the doc function in our dbt model’s YAML configuration:
- name: data_source
description: '{{ doc("event_data_source") }}'
data_type: varchar(20)
And that’s it! When dbt compiles our models, it will pull the description associated with event_data_source from docs.md and insert it directly into the model’s metadata.
Benefits of this approach
Implementing common descriptions for columns used across multiple models has brought several advantages:
- DRY Principle Adherence: We’ve eliminated redundant documentation efforts. Where appropriate, descriptions are defined once and reused across countless models, reducing repetition.
- Enhanced Consistency: With a single source of truth for common descriptions, we ensure that all mentions of a specific data element are described identically.
- Time Savings: Data engineers no longer need to manually type or copy-paste descriptions. This saves time during model development and iteration, and reduces the likelihood of errors.
- Easier Maintenance: If a common description needs to be updated, we only need to modify it in one place (docs.md). This change then propagates automatically to all models using that doc reference.
- Improved Discoverability: Centralising common terms in docs.md can also serve as a useful reference point for understanding frequently used concepts within our data landscape.
Summary
By leveraging Jinja’s doc function and a simple docs.md file, we’ve streamlined our dbt documentation process. This approach makes our documentation more efficient, consistent, and maintainable. It’s a simple change, but it has a significant impact: not only does it allow our team to focus more on data transformation and less on repetitive documentation tasks, but it also builds confidence for our data consumers. When they explore our data through Dagster’s asset catalog, they’ll find consistent, unified descriptions for common elements across all our models, making it much easier to understand what the data truly represents.
What’s your experience with this approach? Have you found other creative ways to centralise dbt documentation? We’d love to hear your approaches!