Building a media performances summary with DBT and BigQuery

Rémi Peeters
Jun 22, 2024
6 min read

Advertisers are continuously spending more and more budget in advertising. According to Statista, the total industry ad spend will increase by 47% by 2026 compared to 2022. Moreover, there is an increasing complexity in the advertising strategies where marketeers now tend to diversify their investments using a wide variety of platforms.

This article aims at providing you with an introduction to the reconciliation of media and analytics data as well as highlighting some key attention points along the way. Today, advertisers have media data in different ad platforms and mainly do attribution using GA4. The challenge is to reconcile that data, which lives in different systems, and bring it together. There are multiple ways to achieve this:

The old Excel/Sheets manual way: manually export data from advertising platforms and manipulate it in Excel or Sheets.
All-in-one data visualization platforms: platforms such as Adverity or Funnel offer some ETL, data harmonization and visualization features.
Custom Solution: how to build a media summary using BigQuery as data warehouse and DBT as orchestration tool on top of it.

We will explore the third route which consists in a custom solution using BigQuery and DBT.

START WITH THE END IN MIND

The media summary can be visualized in many ways: an Excel report, a Looker Studio dashboard, etc. What does really matter is to consolidate the foundations by focusing on what is inside the summary, the actual contents. We recommend:

Making an exhaustive list of your data sources: this will give you a good overview of the connections you will need. This can also impact the overall project budget.
Listing the measurements required: get a good grip on the metrics to be ingested as well as those to be calculated (for instance, ratios). This implies aligning on a common definition with all stakeholders.
Listing the dimensions required: define the level of granularity needed. Having a report by channel requires less data and work than a report by ad set or creative. This can also impact the budget, specifically the one of the ETL tool.
Involving stakeholders: create a mockup of the final product with the key stakeholders. By doing so, a feeling of involvement will be created and adoption of the dashboard will be easier.

EXTRACT AND LOAD DATA INTO A WAREHOUSE

The first step in the process of creating the media summary is to extract and load the media data in a data warehouse. To do so, there are a myriad of solutions available on the market. In the end, we strongly advise comparing the pricing models and answering the following questions:

Can the ETL platform sync with my warehouse?
Do I need specific dimensions and metrics or should I retrieve them all? Can I create custom reports or is the extraction standard for everyone?
What is the granularity level I need? Do I have to stay at channel level or go deeper at campaign or ad levels?

Those questions are essentials as the answers will strongly impact the amount of data to collect. ETL vendors have specific pricing models and the price can vary.

To properly structure your warehouse, we suggest creating a dataset dedicated to being the landing zone of the data (for instance: “landing_zone_media”). There will be one (or multiple) table by data source ingested available in this landing zone.

As an example, assuming we want to import data from Facebook Ads, Google Ads and Snapchat Ads, we would use the BigQuery native connector for Google Ads and an ETL like Fivetran for Facebook and Snapchat Ads data:

BUILDING THE NECESSARY MODELS

Raw data ingested, the next step is to build the models for the media summary. What we usually see is people using scheduled queries usually running on a daily basis in order to create or replace an existing table that puts together data from different sources. This technique has some limitations:

Maintenance overhead: creating scheduled queries can lead to heavy maintenance should you need to adapt your code in some of them, especially due to dependancies.
Lineage: seeing dependencies between tables within BigQuery might not be the clearest option.

Therefore, our preferred solution is dbt (linked to github for version control). Plugged on top of your data warehouse, dbt helps in managing data transformations and having a clear view on lineage. Within dbt, for each data source, create:

A staging folder: staging models will query the source data and have the most basic transformations such as formatting numbers or dates. The model should be as close as possible to the source data.
An intermediary folder: intermediary models contain more advanced transformations that should be separated from the staging ones.
A media summary folder: the media summary is the final model that will be consumed and contains the extra customization required by the stakeholders.

Taking Google Ads as an example, this would be our recommended structure within dbt:

google_ads (root folder)
staging
stg_google_ads_base_metrics.sql
stg_google_ads_conversion_metrics.sql
intermediary
int_google_ads_base_metrics.sql
int_google_ads_conversion_metrics.sql
media_summary
google_ads_media_summary.sql

The staging models are pretty basic and are matching the source as closely as possible. For the base metrics, you could have:

SELECT

date, 
name AS campaign_name,
ad_group AS placement_name,
keyword AS ad_name, 
cost_micros AS media_cost,
impressions,
clicks

FROM {{ source('google_ads', 'campaigns_basic_metrics') }}

The final step consists in joining the different models together to create the media summary. To do so, create one last folder called “activation” or “media summary” with a model that joins data from the different media sources. The lineage looks like this:

TIP N°1: CREATE A REUSABLE APPROACH BETWEEN DIFFERENT DATA SOURCES

Ad platforms are most of the time structured the same way, using a three layers setup. Facebook’s lowest level would be the ads which belong to placement. Those placements are part of campaigns. The structure is the same for Google Ads where keywords are categorized in ad groups and those are part of campaigns.

To deal with this, we suggest keeping the same three layers setup but harmonizing the names between the sources. For instance:

campaign_name: highest level with campaigns from both sources.
placement_name: middle level containing placements from Facebook or ad groups from Google Ads.
ad_name: lowest level reuniting ads from Facebook and keywords from Google Ads.

Your media summary model would look like:

{{ config(

    materialized='table',
    partition_by={
      "field": "date",
      "data_type": "date",
      "granularity": "day"
    }

)}}

SELECT

	date,
	"Google" AS source_name, 
	campaign_name, 
	placement_name, 
	ad_name,
	media_cost,
	impressions,
	clicks

FROM {{ ref("google_ads_media_summary") }}
UNION ALL 
SELECT

	date, 
	"Facebook" AS source_name, 
	campaign_name, 
	placement_name, 
	ad_name,
	media_cost,
	impressions,
	clicks

FROM {{ ref("google_ads_media_summary") }}

TIP N°2: USE A STRUCTURED NAMING CONVENTION

You might need additional information in your final reporting. You can take advantage of the three layers setup in order to add any information required by the stakeholders.

At campaign level, describe the overall goal of the campaign. For a January Sales campaign, you could use “012024_sales_conversions” to indicate that the campaign has been made for the sales and the objective is to drive conversions.

At placement level, describe your targeting, who you are trying to reach. It is very likely that your placements will share different information based on the channel they are related to, but some will be similar.

Finally, at ad level, describe the format of the ad or the language to give additional details.

TIP N°3: BE STRICT ON THE USE OF UTMS

So far, we have seen how to connect media data between two different sources. But campaign performance does not stop at media data. Analytics data should also be included. Question: How do I connect media and analytics data?

In Google Ads, you can build the following report:

Date	Channel	Campaign	Impressions	Clicks
17/11/2023	Google	Campaign 1	100	50
17/11/2023	Google	Campaign 1	200	40
17/11/2023	Google	Campaign 1	300	90
Total	Google	Campaign 1	600	180

You can build the following report out of Google Analytics:

Date	utm_source	utm_medium	utm_campaign	Sessions	Conversions
17/11/2023	google	cpc	Campaign 1	45	10
17/11/2023	google	cpc	Campaign 1	39	15
17/11/2023	google	cpc	Campaign 1	80	25
Total	google	cpc	Campaign 1	164	50

A single report combining the two tables can be achieved by using the same entry in the campaign_name in Google Ads as in the utm_campaign of that campaign:

Date	Campaign	Impressions	Clicks	Sessions	Conversions
17/11/2023	Campaign 1	100	50	45	10
17/11/2023	Campaign 1	200	40	39	15
17/11/2023	Campaign 1	300	90	80	25
Total	Campaign 1	600	180	164	50

This can be achieved by using a left join in the media summary model.

TIP N°4: MANAGE ATTRIBUTION DIFFERENCES BETWEEN PLATFORMS

Attribution in media is a set of rules that states to which channel your conversions will be attributed. And it is an important piece to bear in mind when creating a media summary. When analyzing data from sources, you will probably see something like this:

Date	Channel	Campaign	Conversions
17/11/2023	google	Campaign 1	10
17/11/2023	google	Campaign 1	11
17/11/2023	google	Campaign 1	13
Total	google	Campaign 1	34

Date	Channel	Campaign	Conversions
17/11/2023	facebook	Campaign 1	14
17/11/2023	facebook	Campaign 1	10
17/11/2023	facebook	Campaign 1	12
Total	facebook	Campaign 1	36

Based on the channel you are using, conversions will be counted differently. Creating such a media summary would not be correct. By doing so, you are mixing conversions from platforms having different attribution models and conversion windows. To sum up, you have more conversions than what you should expect:

Date	Channel	Campaign	Conversions
17/11/2023	google	Campaign 1	14
17/11/2023	google	Campaign 1	10
17/11/2023	google	Campaign 1	12
17/11/2023	facebook	Campaign 1	14
17/11/2023	facebook	Campaign 1	10
17/11/2023	facebook	Campaign 1	12
Total		Campaign 1	70

To solve this limitation, two options are available: you can either code your own attribution model or use a third-party attribution platform that will take care of this for you. GA4 is a free option that uses a single attribution model to give credit to advertising channels. By using it, you avoid counting twice the same conversion and apply the same model to all sources:

Date	Channel	Campaign	Conversions
17/11/2023	Google Ads	Campaign 1	9
17/11/2023	Google Ads	Campaign 1	8
17/11/2023	Google Ads	Campaign 1	11
17/11/2023	Facebook	Campaign 1	12
17/11/2023	Facebook	Campaign 1	9
17/11/2023	Facebook	Campaign 1	13
Total		Campaign 1	61

MEDIA SUMMARY: FINAL OUTPUT EXAMPLE

Before anything, let’s add the missing piece of the overall architecture:

The dashboard below is a simple example of media summary that can be achieved quickly using bigquery and dbt. This one contains the most basic information needed to monitor media spent and revenue generated by campaigns as well as the possibility to apply filters.