One of the shortcomings of the GA4 BigQuery export is its lack of session-level traffic source data. Data on traffic sources are only available at the user and event levels. Having to rely on event-level data makes analyzing session-level traffic acquisition quite inconvenient.
This blog post will show how to recreate the GA4 session traffic source dimensions using BigQuery export event data. The result will be a lookup table containing the last non-direct source of traffic for each unique session id.
Before jumping into the SQL queries, let’s go through the basics.
- 2023-03-24: Improvements to the last non-direct attribution. Read the last non-direct traffic source from the last preceding event traffic source.
- 2023-03-29: Improvements to the query performance in the first session traffic source query. Use group by approach instead of a window function.
- 2023-10-24: Use the newly added collected_traffic_source struct in the GA4 export for getting the event-level traffic source parameters.
What kind of traffic source data is available in BigQuery?
The most common source of confusion in the GA4 export seems to relate to the poorly named traffic_source parameters. Traffic_source contains the user-level first-touch traffic source, not the session’s traffic source.
Additionally, the individual events in BigQuery can contain event parameters such as source, medium, and campaign. These are the event-level traffic source details. The parameters are nested inside the event_params field of the event.
For help with querying nested event parameters, see my earlier post.
Collected traffic source
Update: Since June 2023, you can also find the event-level traffic source details using the newly added collected_traffic_source struct in the GA4 BigQuery export.
Collected_traffic_source is a collection of the same event-level traffic source details that you can find nested inside the event_params array. Compared to having to unnest each parameter, the struct format makes the data much more easily accessible.
However, as the data in this new format is only available since June 2023, you’ll need to rely on the event_params for the data older than that. That is not an issue. I’ll demonstrate later how you can use event_params to build a struct that looks exactly like the collected_traffic_source struct.
How is the session traffic source defined in GA4?
In Universal Analytics, the traffic source dimensions were quite clear. That’s because a changed source of traffic during a session would always create a new session. Therefore, there could only ever be one source of traffic per session.
GA4 is different. A change in the source of traffic will no longer trigger a new session. Instead, if a new traffic source is detected within the session, the session will continue with the traffic source details captured in the events’ parameters.
So how does GA4 report session source and medium?
I tested this by retrieving the session source and medium per unique session id from the GA4 Data API and compared that to the event-level source and medium data from the BigQuery export. In cases where the session included multiple different source and medium pairs, the source / medium reported in the GA4 interface was the one that appeared first.
In the following example session, the visit starts directly without traffic source data. The first page view is followed by another one that includes the utm_source and utm_medium parameters. Later, there’s another utm_source & utm_medium pair with different values. In GA4, the final source / medium of the session is reported as “(direct) / (none)”.
GA4 tracks the session’s first event’s traffic source as the source for the entire session. Automatically generated first_visit and session_start events are not included in this evaluation.
If you want to dig deeper into the GA session source magic, I recommend reading this excellent investigation in the GTM-Gear blog.
How does the last non-direct attribution work in GA4?
So, we went through how the session traffic source is derived. How about the last non-direct attribution or Paid and organic last click, as Google has decided to rename it?
When looking back to find the last non-direct traffic source, GA4 has a slightly different approach. Instead of looking at the first event of each preceding session, GA4 will get the data from the last event that includes traffic source details.
Also, keep in mind that Paid and organic last click is not the default attribution model in GA4. GA4 defaults to the Data-driven attribution model, a more complex model that assigns weights between the different touchpoints that preceded the session. If you wish to build something similar to the Data-driven model yourself, you could utilize the queries in this post as the input data for the model.
To make the process a bit easier to manage, I’ve divided it into two different SQL queries:
- Create a table containing the first and last traffic source per session.
- Identify the last non-direct source of traffic in case the current session had no traffic source data.
The main reason for doing this in two parts is to make the queries less expensive. The last non-direct lookback will need to process a lot of data whenever the results table is updated. For example, a 30-day lookback would require 30+1 days of data to increment the results table by one day.
Ideally, I’d save the results of the first SQL query to a date-partitioned table and have that updated incrementally. The following query would then rely on that to fill in the last non-direct traffic source in case the session had no traffic source data.
I’m trying to keep this post simple and on topic, so I’ll leave out the incremental update and other query optimization parts.
You can check out my post on Dataform incremental refresh strategies for help in turning these queries into a proper daily updated data source.
Query the Session Traffic Source
The first SQL query will go through the events and return the first and last traffic source per each unique session id. The values will be null if the traffic source data doesn’t exist.
You can save the query’s results in a new table and reference that in the next query instead of rerunning the whole thing with all event data.
-- extract the event data needed for the session traffic source details with events as ( select cast(event_date as date format 'YYYYMMDD') as date, -- unique session id concat(user_pseudo_id, (select value.int_value from unnest(event_params) where key = 'ga_session_id')) as session_id, user_pseudo_id, (select value.int_value from unnest(event_params) where key = 'ga_session_id') as session_start, collected_traffic_source, event_timestamp from `<project>.<dataset>.events_*` where (_table_suffix >= '20230701' and _table_suffix <= '20231231') -- starting from 20231102 the sessions_start and first_visit events don't need to be excluded anymore and event_name not in ('session_start', 'first_visit') ) select min(date) as date, session_id, user_pseudo_id, session_start, -- the traffic source of the first event in the session with session_start and first_visit excluded array_agg( collected_traffic_source order by event_timestamp asc limit 1 ) [safe_offset(0)] as session_first_traffic_source, -- the last not null traffic source of the session array_agg( collected_traffic_source ignore nulls order by event_timestamp desc limit 1 ) [safe_offset(0)] as session_last_traffic_source from events where session_id is not null group by session_id, user_pseudo_id, session_start
Update 1: The query utilizes the collected_traffic_source struct that was included in the BigQuery GA4 export in June 2023. With data older than that, you’ll need to query the nested event parameters. Replace collected_traffic_source in the above query (line 9) with the below SQL to generate a similar struct using the data from event_params.
if( (select value.string_value from unnest(event_params) where key in ('campaign_id', 'campaign', 'source', 'medium', 'term', 'content', 'gclid', 'dclid', 'srsltid') and value.string_value is not null limit 1) is not null, ( select as struct (select value.string_value from unnest(event_params) where key = 'campaign_id') as manual_campaign_id, (select value.string_value from unnest(event_params) where key = 'campaign') as manual_campaign_name, (select value.string_value from unnest(event_params) where key = 'source') as manual_source, (select value.string_value from unnest(event_params) where key = 'medium') as manual_medium, (select value.string_value from unnest(event_params) where key = 'term') as manual_term, (select value.string_value from unnest(event_params) where key = 'content') as manual_content, (select value.string_value from unnest(event_params) where key = 'gclid') as gclid, (select value.string_value from unnest(event_params) where key = 'dclid') as dclid, (select value.string_value from unnest(event_params) where key = 'srsltid') as srsltid ), null ) as collected_traffic_source,
Update 2: With the daily exports starting from November 2, 2023, it’s no longer necessary to exclude the session_start and first_visit events as they now also include the required traffic source parameters. However, it’s still best to get the data from the first event of the session, not from a session_start event exclusively. For more details, see my other post.
Results Without Last Non-direct
As a benchmark, I took a small sample of source / medium data from the GA4 interface.
When looking at the data from the newly created table for the same date range, it doesn’t quite match with GA4. That’s because the last non-direct attribution is yet to be applied.
select session_first_traffic_source.manual_source as source, session_first_traffic_source.manual_medium as medium, count(distinct session_id) as sessions from `<project>.<dataset>.<session traffic source table>` where date between '<start>' and '<end>' group by 1, 2 order by sessions desc, manual_source
Query the Last Non-direct Traffic Source
The second SQL query will take the session-level data from the previously created table and fill the missing traffic sources with the last non-direct source.
The query checks if the session already has traffic source data in session_first_traffic_source. If there is none, it will look for the same user’s previous not null value for session_last_traffic_source.
The window order and range in the query are based on session_start, which is the value of the ga_session_id parameter. Ga_session_id is a Unix timestamp that is assigned when the session begins.
select date, session_id, user_pseudo_id, session_start, session_first_traffic_source, ifnull( session_first_traffic_source, last_value(session_last_traffic_source ignore nulls) over( partition by user_pseudo_id order by session_start range between 2592000 preceding and 1 preceding -- 30 day lookback ) ) as session_traffic_source_last_non_direct, from `<project>.<dataset>.<session traffic source table>`
You can adjust the lookback window of the query. Based on running a few tests, 30 days seems close to what GA4 uses.
A larger lookback window will find traffic sources further back in time. However, at the same time, the cost of the query will increase.
Let’s see if the applied last non-direct attribution results in numbers more closely resembling the Traffic acquisition report in GA4. Additionally, we can convert the null values to “(direct)” and “(none)” to make it look more like in the GA4 report.
select ifnull(session_traffic_source_last_non_direct.manual_source, '(direct)' ) as source, ifnull(session_traffic_source_last_non_direct.manual_medium, '(none)' ) as medium, count(distinct session_id) as sessions from `<project>.<dataset>.<session last non-direct traffic source table>` where date between '<start>' and '<end>' group by 1, 2 order by sessions desc, source
The numbers based on my sample data now match the GA4 report almost exactly. The SQL query attributed two more sessions to “google / organic” than what GA4 did. I ran the query with the 90-day lookback window. With a 30-day lookback, there would have been a few more direct sessions and a few less in the other source / mediums.
The query worked quite nicely with my test dataset. Based on that, the base logic seems correct. However, there are a couple of common issues which weren’t present in the dataset that I used.
Google Ads and GCLID, GBRAID, WBRAID Parameters
GA4 has some issues related to tracking Google Ads traffic with auto-tagging enabled. Auto-tagging uses the GCLID parameter for connecting the website visit with the corresponding campaign. With auto-tagging enabled, the usual utm tags shouldn’t be needed.
However, this doesn’t quite work with the BigQuery export. In BigQuery, the sessions that started with an event containing the GCLID parameter can either have missing source and medium details or have something like “google / organic” or “youtube.com / referral” as their source medium.
A simple fix for this is adjusting the first session traffic source query to check the GCLID parameter. If the parameter exists, the source and medium values can be overwritten with “google” and “cpc”. However, this approach could be too simplistic, as other traffic sources can also utilize the GCLID parameter.
with events as ( ... ) select ... -- fix the session_first_traffic_source and session_last_traffic_source structs below -- modify the manual_source and manual_medium fields in case a gclid param was included array_agg( if( collected_traffic_source is not null, ( select as struct collected_traffic_source.* except(manual_source, manual_medium), if(collected_traffic_source.gclid is not null, 'google', collected_traffic_source.manual_source) as manual_source, if(collected_traffic_source.gclid is not null, 'cpc', collected_traffic_source.manual_medium) as manual_medium ), null ) order by event_timestamp asc limit 1 ) [safe_offset(0)] as session_first_traffic_source, array_agg( if( collected_traffic_source is not null, ( select as struct collected_traffic_source.* except(manual_source, manual_medium), if(collected_traffic_source.gclid is not null, 'google', collected_traffic_source.manual_source) as manual_source, if(collected_traffic_source.gclid is not null, 'cpc', collected_traffic_source.manual_medium) as manual_medium ), null ) ignore nulls order by event_timestamp desc limit 1 ) [safe_offset(0)] as session_last_traffic_source from events where ... group by ...
Another thing to consider is the GBRAID and WBRAID parameters which are used as alternatives to the GCLID parameter in iOS14+ devices.
Finally, even if that solution worked correctly with your traffic data, it would still lack the campaign details. So perhaps the best fix to this problem is to start including utm tags with the auto-tagged Google Ads traffic as well.
Another issue that can cause discrepancies between the GA4 reports and BigQuery data is the use of user ids. When a visitor is assigned a user id mid-session, GA4 will attribute the session’s traffic source to “(direct)” / “(none)”, or I guess the last non-direct source derived from that, regardless of what the session’s first traffic source was.
As of writing this in November 2022, this is still an issue in GA4. However, it doesn’t apply to the event data in BigQuery, so it’s good to keep that in mind when comparing the numbers between GA4 and BigQuery data.
The below image shows unique sessions that have a user id and that were reported as “(direct) / (none)” by GA4. I exported the data from GA4 and joined it with the traffic source table created earlier. As a result, some of the sessions were incorrectly reported as direct even though the traffic source data existed on the event-level.
The following snapshot of some of the events in one of those sessions shows what’s happening. The visit originated from “google / organic”. However, later during the session, a login event occurred, and a user id was assigned to the visitor. That messed up the reporting in GA4 and wiped the original traffic source of the session.
Recreating the session-level GA4 traffic reports with BigQuery data is not quite as straightforward as it could be. It also requires a lot of processing on BigQuery’s side when repeated daily. In a production setting, the process needs to be well-thought and optimized when up-to-date data is needed in multiple places.
Additionally, the issues related to Google Ads traffic and the use of user ids make it difficult to verify the query results by comparing them with the GA4 reports.
Hopefully, Google will make this easier someday and start providing correct session-level traffic data already in the export, just like they’ve been providing with Universal Analytics.
Finally, another thing to consider is whether it makes sense to replicate the GA4 traffic source reporting. With BigQuery, you can define the session’s traffic source however you like. An alternative approach could be to credit all traffic sources within the same session, not just the first, and split the credit between them.