top of page
Writer's pictureRémi Peeters

Building a media performances summary with DBT and BigQuery

Advertisers are continuously spending more and more budget in advertising. According to Statista, the total industry ad spend will increase by 47% by 2026 compared to 2022. Moreover, there is an increasing complexity in the advertising strategies where marketeers now tend to diversify their investments using a wide variety of platforms. 


This article aims at providing you with an introduction to the reconciliation of media and analytics data as well as highlighting some key attention points along the way. Today, advertisers have media data in different ad platforms and mainly do attribution using GA4. The challenge is to reconcile that data, which lives in different systems, and bring it together. There  are multiple ways to achieve this:


  1. The old Excel/Sheets manual way: manually export data from advertising platforms and manipulate it in Excel or Sheets.

  2. All-in-one data visualization platforms: platforms such as Adverity or Funnel offer some ETL, data harmonization and visualization features.

  3. Custom Solution: how to build a media summary using BigQuery as data warehouse and DBT as orchestration tool on top of it. 


We will explore the third route which consists in a custom solution using BigQuery and DBT.


START WITH THE END IN MIND


The media summary can be visualized in many ways: an Excel report, a Looker Studio dashboard, etc. What does really matter is to consolidate the foundations by focusing on what is inside the summary, the actual contents. We recommend:


  1. Making an exhaustive list of your data sources: this will give you a good overview of the connections you will need. This can also impact the overall project budget. 

  2. Listing the measurements required: get a good grip on the metrics to be ingested as well as those to be calculated (for instance, ratios). This implies aligning on a common definition with all stakeholders. 

  3. Listing the dimensions required: define the level of granularity needed. Having a report by channel requires less data and work than a report by ad set or creative. This can also impact the budget, specifically the one of the ETL tool. 

  4. Involving stakeholders: create a mockup of the final product with the key stakeholders. By doing so, a feeling of involvement will be created and adoption of the dashboard will be easier.


EXTRACT AND LOAD DATA INTO A WAREHOUSE


The first step in the process of creating the media summary is to extract and load the media data in a data warehouse. To do so, there are a myriad of solutions available on the market. In the end, we strongly advise comparing the pricing models and answering the following questions: 


  • Can the ETL platform sync with my warehouse? 

  • Do I need specific dimensions and metrics or should I retrieve them all? Can I create custom reports or is the extraction standard for everyone? 

  • What is the granularity level I need? Do I have to stay at channel level or go deeper at campaign or ad levels? 


Those questions are essentials as the answers will strongly impact the amount of data to collect. ETL vendors have specific pricing models and the price can vary. 


To properly structure your warehouse, we suggest creating a dataset dedicated to being the landing zone of the data (for instance: “landing_zone_media”). There will be one (or multiple) table by data source ingested available in this landing zone. 


As an example, assuming we want to import data from Facebook Ads, Google Ads and Snapchat Ads, we would use the BigQuery native connector for Google Ads and an ETL like Fivetran for Facebook and Snapchat Ads data: 


BUILDING THE NECESSARY MODELS


Raw data ingested, the next step is to build the models for the media summary. What we usually see is people using scheduled queries usually running on a daily basis in order to create or replace an existing table that puts together data from different sources. This technique has some limitations: 


  • Maintenance overhead: creating scheduled queries can lead to heavy maintenance should you need to adapt your code in some of them, especially due to dependancies. 

  • Lineage: seeing dependencies between tables within BigQuery might not be the clearest option.


Therefore, our preferred solution is dbt (linked to github for version control). Plugged on top of your data warehouse, dbt helps in managing data transformations and having a clear view on lineage. Within dbt, for each data source, create:


  • A staging folder: staging models will query the source data and have the most basic transformations such as formatting numbers or dates. The model should be as close as possible to the source data. 

  • An intermediary folder: intermediary models contain more advanced transformations that should be separated from the staging ones.

  • A media summary folder: the media summary is the final model that will be consumed and contains the extra customization required by the stakeholders.


Taking Google Ads as an example, this would be our recommended structure within dbt:


  • google_ads (root folder)

  • staging

  • stg_google_ads_base_metrics.sql

  • stg_google_ads_conversion_metrics.sql

  • intermediary

  • int_google_ads_base_metrics.sql

  • int_google_ads_conversion_metrics.sql

  • media_summary

  • google_ads_media_summary.sql


The staging models are pretty basic and are matching the source as closely as possible. For the base metrics, you could have: 


SELECT

date, 
name AS campaign_name,
ad_group AS placement_name,
keyword AS ad_name, 
cost_micros AS media_cost,
impressions,
clicks

FROM {{ source('google_ads', 'campaigns_basic_metrics') }}

The final step consists in joining the different models together to create the media summary. To do so, create one last folder called “activation” or “media summary” with a model that joins data from the different media sources. The lineage looks like this: 



TIP N°1: CREATE A REUSABLE APPROACH BETWEEN DIFFERENT DATA SOURCES


Ad platforms are most of the time structured the same way, using a three layers setup. Facebook’s lowest level would be the ads which belong to placement. Those placements are part of campaigns. The structure is the same for Google Ads where keywords are categorized in ad groups and those are part of campaigns. 


To deal with this, we suggest keeping the same three layers setup but harmonizing the names between the sources. For instance:


  • campaign_name: highest level with campaigns from both sources.

  • placement_name: middle level containing placements from Facebook or ad groups from Google Ads. 

  • ad_name: lowest level reuniting ads from Facebook and keywords from Google Ads.


Your media summary model would look like: 


{{ config(

    materialized='table',
    partition_by={
      "field": "date",
      "data_type": "date",
      "granularity": "day"
    }

)}}

SELECT

	date,
	"Google" AS source_name, 
	campaign_name, 
	placement_name, 
	ad_name,
	media_cost,
	impressions,
	clicks

FROM {{ ref("google_ads_media_summary") }}
UNION ALL 
SELECT

	date, 
	"Facebook" AS source_name, 
	campaign_name, 
	placement_name, 
	ad_name,
	media_cost,
	impressions,
	clicks

FROM {{ ref("google_ads_media_summary") }}

TIP N°2: USE A STRUCTURED NAMING CONVENTION


You might need additional information in your final reporting. You can take advantage of the three layers setup in order to add any information required by the stakeholders. 


At campaign level, describe the overall goal of the campaign. For a January Sales campaign, you could use “012024_sales_conversions” to indicate that the campaign has been made for the sales and the objective is to drive conversions.   


At placement level, describe your targeting, who you are trying to reach. It is very likely that your placements will share different information based on the channel they are related to, but some will be similar. 


Finally, at ad level, describe the format of the ad or the language to give additional details.


TIP N°3: BE STRICT ON THE USE OF UTMS


So far, we have seen how to connect media data between two different sources. But campaign performance does not stop at media data. Analytics data should also be included. Question: How do I connect media and analytics data? 


In Google Ads, you can build the following report: 

Date

Channel

Campaign

Impressions

Clicks

17/11/2023

Google

Campaign 1

100

50

17/11/2023

Google

Campaign 1

200

40

17/11/2023

Google

Campaign 1

300

90

Total

Google

Campaign 1

600

180

You can build the following report out of Google Analytics: 

Date

utm_source

utm_medium

utm_campaign

Sessions

Conversions

17/11/2023

google

cpc

Campaign 1

45

10

17/11/2023

google

cpc

Campaign 1

39

15

17/11/2023

google

cpc

Campaign 1

80

25

Total

google

cpc

Campaign 1

164

50

A single report combining the two tables can be achieved by using the same entry in the campaign_name in Google Ads as in the utm_campaign of that campaign:

Date

Campaign

Impressions

Clicks

Sessions

Conversions

17/11/2023

Campaign 1

100

50

45

10

17/11/2023

Campaign 1

200

40

39

15

17/11/2023

Campaign 1

300

90

80

25

Total

Campaign 1

600

180

164

50

This can be achieved by using a left join in the media summary model.


TIP N°4: MANAGE ATTRIBUTION DIFFERENCES BETWEEN PLATFORMS


Attribution in media is a set of rules that states to which channel your conversions will be attributed. And it is an important piece to bear in mind when creating a media summary. When analyzing data from sources, you will probably see something like this: 

Date

Channel

Campaign

Conversions

17/11/2023

google

Campaign 1

10

17/11/2023

google

Campaign 1

11

17/11/2023

google

Campaign 1

13

Total

google

Campaign 1

34

Date

Channel

Campaign

Conversions

17/11/2023

facebook

Campaign 1

14

17/11/2023

facebook

Campaign 1

10

17/11/2023

facebook

Campaign 1

12

Total

facebook

Campaign 1

36

Based on the channel you are using, conversions will be counted differently. Creating such a media summary would not be correct. By doing so, you are mixing conversions from platforms having different attribution models and conversion windows. To sum up, you have more conversions than what you should expect: 

Date

Channel

Campaign

Conversions

17/11/2023

google

Campaign 1

14

17/11/2023

google

Campaign 1

10

17/11/2023

google

Campaign 1

12

17/11/2023

facebook

Campaign 1

14

17/11/2023

facebook

Campaign 1

10

17/11/2023

facebook

Campaign 1

12

Total


Campaign 1

70

To solve this limitation, two options are available: you can either code your own attribution model or use a third-party attribution platform that will take care of this for you. GA4 is a free option that uses a single attribution model to give credit to advertising channels. By using it, you avoid counting twice the same conversion and apply the same model to all sources:

Date

Channel

Campaign

Conversions

17/11/2023

Google Ads

Campaign 1

9

17/11/2023

Google Ads

Campaign 1

8

17/11/2023

Google Ads

Campaign 1

11

17/11/2023

Facebook

Campaign 1

12

17/11/2023

Facebook

Campaign 1

9

17/11/2023

Facebook

Campaign 1

13

Total


Campaign 1

61

MEDIA SUMMARY: FINAL OUTPUT EXAMPLE


Before anything, let’s add the missing piece of the overall architecture: 


The dashboard below is a simple example of media summary that can be achieved quickly using bigquery and dbt. This one contains the most basic information needed to monitor media spent and revenue generated by campaigns as well as the possibility to apply filters.



Interested? Feel free to reach out !


Comentários


bottom of page