One of the shortcomings of the GA4 BigQuery export is its lack of session-level traffic source data. Data on traffic sources are only available at the user and event levels. Having to rely on event-level data makes analyzing session-level traffic acquisition quite inconvenient.
This blog post will show how to recreate the GA4 session traffic source dimensions using BigQuery export event data. The result will be a lookup table containing the last non-direct source of traffic for each unique session id.
Before jumping into the SQL queries, let’s go through the basics.
- 2023-03-24: Improvements to the last non-direct attribution. Read the last non-direct traffic source from the last preceding event traffic source.
- 2023-03-29: Query performance improvements to the session traffic source query.
What kind of traffic source data is available in BigQuery?
The most common source of confusion in the GA4 export seems to relate to the poorly named traffic_source parameters. Traffic_source contains the user-level first-touch traffic source, not the session’s traffic source.
Additionally, the individual events in BigQuery can contain event parameters such as source, medium, and campaign. The parameters are nested inside the event_params field of the event.
For help with querying nested event parameters, see my earlier post.
How is the session traffic source defined in GA4?
In Universal Analytics, the traffic source dimensions were quite clear. That’s because a changed source of traffic during a session would always create a new session. Therefore, there could only ever be one source of traffic per session.
GA4 is different. A change in the source of traffic will no longer trigger a new session. Instead, if a new traffic source is detected within the session, the session will continue with the traffic source details captured in the events’ parameters.
So how does GA4 report session source and medium?
I tested this by retrieving the session source and medium per unique session id from the GA4 Data API and compared that to the event-level source and medium data from the BigQuery export. In cases where the session included multiple different source and medium pairs, the source / medium reported in the GA4 interface was the one that appeared first.
In the following example session, the visit starts directly without traffic source data. The first page view is followed by another one that includes the utm_source and utm_medium parameters. Later, there’s another utm_source & utm_medium pair with different values. In GA4, the final source / medium of the session is reported as “(direct) / (none)”.
GA4 tracks the session’s first event’s traffic source as the source for the entire session. Automatically generated first_visit and session_start events are not included in this evaluation.
If you want to dig deeper into the GA session source magic, I recommend reading this excellent investigation in the GTM-Gear blog.
How does the last non-direct attribution work in GA4?
So, we went through how the session traffic source is derived. How about the last non-direct attribution or Cross-channel last click, as Google has decided to rename it?
When looking back to find the last non-direct traffic source, GA4 has a slightly different approach. Instead of looking at the first event of each preceding session, GA4 will get the data from the last event that includes traffic source details.
To make the process a bit easier to manage, I’ve divided it into two different SQL queries:
- Create a table containing the first and last traffic source per session.
- Identify the last non-direct source of traffic in case the current session had no traffic source data.
The main reason for doing this in two parts is to make the queries less expensive. The last non-direct lookback will need to process a lot of data whenever the results table is updated. For example, a 90-day lookback would require 90+1 days of data to increment the results table by one day.
Ideally, I’d save the results of the first SQL query to a date-partitioned table and have that updated incrementally. The following query would then rely on that to fill in the last non-direct traffic source in case the session had no traffic source data.
I’m trying to keep this post clear and on topic, so I’ll leave out the incremental update and other query optimization parts. However, covering those using Dataform could very well be a topic for another post. 🙂
Query the Session Traffic Source
The first SQL query will go through the events and return the first and last traffic source per each unique session id. The values will be null if the traffic source data doesn’t exist.
You can save the query’s results in a new table and reference that in the next query instead of rerunning the whole thing with all event data.
-- extract the event data needed for the session traffic source details with events as ( select cast(event_date as date format 'YYYYMMDD') as date, -- unique session id concat(user_pseudo_id, (select value.int_value from unnest(event_params) where key = 'ga_session_id')) as session_id, user_pseudo_id, (select value.int_value from unnest(event_params) where key = 'ga_session_id') as session_start, -- wrap all traffic source dimensions into a struct for the next step ( select as struct (select value.string_value from unnest(event_params) where key = 'source') as source, (select value.string_value from unnest(event_params) where key = 'medium') as medium, (select value.string_value from unnest(event_params) where key = 'campaign') as campaign ) as traffic_source, event_timestamp from `gtm-tpkxvmj-ntcwm.analytics_298233330.events_*` where (_table_suffix >= '20220101' and _table_suffix <= '20221231') and event_name not in ('session_start', 'first_visit') ) select min(date) as date, session_id, user_pseudo_id, session_start, -- the traffic source of the first event in the session with session_start and first_visit excluded array_agg( if( coalesce(traffic_source.source, traffic_source.medium, traffic_source.campaign) is not null, traffic_source, null ) order by event_timestamp asc limit 1 ) [safe_offset(0)] as session_first_traffic_source, -- the last not null traffic source of the session array_agg( if( coalesce(traffic_source.source,traffic_source.medium,traffic_source.campaign) is not null, traffic_source, null ) ignore nulls order by event_timestamp desc limit 1 ) [safe_offset(0)] as session_last_traffic_source from events where session_id is not null group by session_id, user_pseudo_id, session_start
Results Without Last Non-direct
As a benchmark, I took a small sample of source / medium data from the GA4 interface.
When looking at the data from the newly created table for the same date range, it doesn’t quite match with GA4. That’s because the last non-direct attribution is yet to be applied.
select session_first_traffic_source.source, session_first_traffic_source.medium, count(distinct session_id) as sessions from `<project>.<dataset>.<session traffic source table>` where date between '<start>' and '<end>' group by 1, 2 order by sessions desc, source
Query the Last Non-direct Traffic Source
The second SQL query will take the session-level data from the previously created table and fill the missing traffic sources with the last non-direct source.
The query checks if the session already has traffic source data in session_first_traffic_source. If there is none, it will look for the same user’s previous not null value for session_last_traffic_source.
The window order and range in the query are based on session_start, which is the value of the ga_session_id parameter. Ga_session_id is a Unix timestamp that is assigned when the session begins.
select date, session_id, user_pseudo_id, session_start, session_first_traffic_source, ifnull( session_first_traffic_source, last_value(session_last_traffic_source ignore nulls) over( partition by user_pseudo_id order by session_start range between 2592000 preceding and 1 preceding -- 30 day lookback ) ) as session_traffic_source_last_non_direct, from `<project>.<dataset>.<session traffic source table>`
You can adjust the lookback window of the query. Based on running a few tests, 30 days seems close to what GA4 uses.
A larger lookback window will find traffic sources further back in time. However, at the same time, the cost of the query will increase.
Let’s see if the applied last non-direct attribution results in numbers more closely resembling the Traffic acquisition report in GA4. Additionally, we can convert the null values to “(direct)” and “(none)” to make it look more like in the GA4 report.
select ifnull( session_traffic_source_last_non_direct.source, '(direct)' ) as source, ifnull( session_traffic_source_last_non_direct.medium, '(none)' ) as medium, count(distinct session_id) as sessions from `<project>.<dataset>.<session last non-direct traffic source table>` where date between '<start>' and '<end>' group by 1, 2 order by sessions desc, source
The numbers based on my sample data now match the GA4 report almost exactly. The SQL query attributed two more sessions to “google / organic” than what GA4 did. I ran the query with the 90-day lookback window. With a 30-day lookback, there would have been a few more direct sessions and a few less in the other source / mediums.
The query worked quite nicely with my test dataset. Based on that, the base logic seems correct. However, there are a couple of common issues which weren’t present in the dataset that I used.
Google Ads and GCLID, GBRAID, WBRAID Parameters
GA4 has some issues related to tracking Google Ads traffic with auto-tagging enabled. Auto-tagging uses the GCLID parameter for connecting the website visit with the corresponding campaign. With auto-tagging enabled, the usual utm tags shouldn’t be needed.
However, this doesn’t quite work with the BigQuery export. In BigQuery, the sessions that started with an event containing the GCLID parameter can either have missing source and medium details or have something like “google / organic” or “youtube.com / referral” as their source medium.
A simple fix for this is adjusting the first session traffic source query to check the GCLID parameter. If the parameter exists, the source and medium values can be overwritten with “google” and “cpc”. However, this approach could be too simplistic, as other traffic sources can also utilize the GCLID parameter.
with events as ( select ... -- include the gclid parameter in the traffic_source struct ( select as struct (select value.string_value from unnest(event_params) where key = 'source') as source, (select value.string_value from unnest(event_params) where key = 'medium') as medium, (select value.string_value from unnest(event_params) where key = 'campaign') as campaign, (select value.string_value from unnest(event_params) where key = 'gclid') as gclid ) as traffic_source, ... from `<project>.<dataset>.events_*` where ... ) select ... -- fix the source and medium fields in the structs based on gclid array_agg( if( coalesce(traffic_source.source, traffic_source.medium, traffic_source.campaign, traffic_source.gclid) is not null, ( select as struct if(traffic_source.gclid is not null, 'google', source) as source, if(traffic_source.gclid is not null, 'cpc', medium) as medium, traffic_source.campaign, traffic_source.gclid ), null ) order by event_timestamp asc limit 1 ) [safe_offset(0)] as session_first_traffic_source, array_agg( if( coalesce(traffic_source.source,traffic_source.medium,traffic_source.campaign,traffic_source.gclid) is not null, ( select as struct if(traffic_source.gclid is not null, 'google', source) as source, if(traffic_source.gclid is not null, 'cpc', medium) as medium, traffic_source.campaign, traffic_source.gclid ), null ) ignore nulls order by event_timestamp desc limit 1 ) [safe_offset(0)] as session_last_traffic_source ... from events where ... group by ...
Another thing to consider is the GBRAID and WBRAID parameters which are used as alternatives to the GCLID parameter in iOS14+ devices.
Finalle, even if that solution worked correctly with your traffic data, it would still lack the campaign details. So perhaps the best fix to this problem is to start including utm tags with the auto-tagged Google Ads traffic as well.
Another issue that can cause discrepancies between the GA4 reports and BigQuery data is the use of user ids. When a visitor is assigned a user id mid-session, GA4 will attribute the session’s traffic source to “(direct)” / “(none)”, or I guess the last non-direct source derived from that, regardless of what the session’s first traffic source was.
As of writing this in November 2022, this is still an issue in GA4. However, it doesn’t apply to the event data in BigQuery, so it’s good to keep that in mind when comparing the numbers between GA4 and BigQuery data.
The below image shows unique sessions that have a user id and that were reported as “(direct) / (none)” by GA4. I exported the data from GA4 and joined it with the traffic source table created earlier. As a result, some of the sessions were incorrectly reported as direct even though the traffic source data existed on the event-level.
The following snapshot of some of the events in one of those sessions shows what’s happening. The visit originated from “google / organic”. However, later during the session, a login event occurred, and a user id was assigned to the visitor. That messed up the reporting in GA4 and wiped the original traffic source of the session.
Recreating the session-level GA4 traffic reports with BigQuery data is not quite as straightforward as it could be. It also requires a lot of processing on BigQuery’s side when repeated daily. In a production setting, the process needs to be well-thought and optimized when up-to-date data is needed in multiple places.
Additionally, the issues related to Google Ads traffic and the use of user ids make it difficult to verify the query results by comparing them with the GA4 reports.
Hopefully, Google will make this easier someday and start providing correct session-level traffic data already in the export, just like they’ve been providing with Universal Analytics.
Finally, another thing to consider is whether it makes sense to replicate the GA4 traffic source reporting. With BigQuery, you can define the session’s traffic source however you like. An alternative approach could be to credit all traffic sources within the same session, not just the first, and split the credit between them.