During my first weeks at Human37, I had the opportunity to discover a plethora of data-related tools. Today, I want to highlight one among them: Data Build Tool, namely dbt.
Introduction and definition
Actually, there are two ways to use dbt:
A CLI (Command Line Interface) which can be run through your terminal (free, open-source);
A cloud platform which corresponds to the hosted service with an integrated developer environment (IDE) namely dbt Cloud (subject to costs).
Within the wider context of my role at Human37, I only used the Cloud platform version. Hence, I will mainly focus my article on that version.
Dbt Cloud is a data engineering tool. It offers a development framework using modular SQL to transform your data, centralize your code and collaborate with your team on a single source. Several other features are available such as
Running SQL against several ‘data platforms’;
Ease of testing and documentation.
Versioning through Git.
The purpose of this article is to give you 4 pieces of advice (based on my personal experience) to ease your first experience with dbt Cloud.
Fore and foremost, the dbt Fundamentals course is a must-do. As more and more SaaS companies are now offering onboarding program, these courses have become an excellent first step into a new tool. And dbt is no exception. The course walks you through the main concepts (models, sources, tests, etc.) along with practical exercises. The documentation tied to the course is also well explained. Therefore, it is pretty easy to progress. For my part, I managed to do it on a sunny day in Brussels (yes, it is possible). Aside, more sophisticated courses are also available. It gives you a general overview of what you can do with dbt Cloud. Oh, and it’s free so I guess you have no more excuses.
‘DBT is basically SQL’
My second piece of advice would be to have a basic understanding of the programming language SQL. ‘DBT is basically SQL’. That is what you can read in this Medium article. After several weeks of use, I couldn’t agree more. A significant part of your time on the platform will consist of formulating SQL queries to create your models and transform your data. Even though it is not necessarily required, my ‘academic’ experience in SQL gave me an edge when I started. As I spent less time learning how to formulate a proper query, I was able to focus on the core concepts. Let me reassure you, you don’t have to be an expert in SQL to start working with dbt but some knowledge in the topic could be extremely beneficial to leverage the dbt full potential. There exists a plethora of really good SQL introductory courses on the Internet so choose the one you prefer. Similarly, a basic knowledge of how relational databases and Git work can be interesting for best practices purposes.
Adopting best practices
The next one: to set up best practices as soon as possible. It might be straightforward when you go through the fundamentals course but it is more complicated than it sounds.
A first core dbt’s feature that I really enjoy is the modularity of models offering flexibility and variety in use. There are several ways to take advantage of it. For my part, I like to use different transformation layers. For example, I usually create staging models as a first layer to clean my raw data. In other words, I use it to make ‘basic’ changes that I want to be applied in any (future) use cases. Then, I can easily recombine the models depending on the given project focus. It is also a good way to separate the different steps of your data transformations easing the clarity and readability of your work. Moreover, the lineage feature of dbt lets you visualize the dependencies when selecting a model (see below).
A second best practice is to get used to testing your data. You can run two types of tests on your models by simply setting up a ‘source.yml’ file. I particularly use the unicity and non-null test as it is critical for the cases I am dealing with at the moment. A straightforward example would be to test if on a (let’s say) customer model, each customer (i.e. entry) has a unique id (non-null test). Similarly, one could want to know if there is redundancy in the model which could be potentially a problem. As a whole, the ease of implementing and using that features could be extremely beneficial for the robustness of your models
Last but not least, the documentation. Documenting your work could be extremely beneficial, particularly in data-oriented projects. It helps you to keep track of how you structured and transformed your data contextualising your entire work. In the end, it will ensure that your work can be understood by any other member of your team (even new members onboarding). With the dbt’s documentation tool, the creation process is eased letting you bring together the coding and the documentation part (often dealt as separate processes). With a command as simple as dbt docs generate , you can create an (update) version of your documentation at any moment. You can also document manually your models in an YML file living in your project.
Dbt might be one of the most interesting tools I have used since I started to work at Human37. Hopefully, my 4 pieces of advice would help you to get started with the tool and enjoy it as much as I do.