One of the shortcomings of the GA4 BigQuery export is its lack of session-level traffic source data. Data on traffic sources are only available at the user and event levels. Having to rely on event-level data makes analyzing session-level traffic acquisition quite inconvenient.
This blog post will show how to recreate the GA4 session traffic source dimensions using BigQuery export event data. The result will be a lookup table containing the last non-direct source of traffic for each unique session id.
Before jumping into the SQL queries, let’s go through the basics.
Post updates
- 2023-03-24: Improvements to the last non-direct attribution. Read the last non-direct traffic source from the last preceding event traffic source.
- 2023-03-29: Improvements to the query performance in the first session traffic source query. Use group by approach instead of a window function.
- 2023-10-24: Use the newly added collected_traffic_source struct in the GA4 export for getting the event-level traffic source parameters.
What kind of traffic source data is available in BigQuery?
The most common source of confusion in the GA4 export seems to relate to the poorly named traffic_source parameters. Traffic_source contains the user-level first-touch traffic source, not the session’s traffic source.
Additionally, the individual events in BigQuery can contain event parameters such as source, medium, and campaign. These are the event-level traffic source details. The parameters are nested inside the event_params field of the event.
For help with querying nested event parameters, see my earlier post.
Collected traffic source
Update: Since June 2023, you can also find the event-level traffic source details using the newly added collected_traffic_source struct in the GA4 BigQuery export.
Collected_traffic_source is a collection of the same event-level traffic source details that you can find nested inside the event_params array. Compared to having to unnest each parameter, the struct format makes the data much more easily accessible.
However, as the data in this new format is only available since June 2023, you’ll need to rely on the event_params for the data older than that. That is not an issue. I’ll demonstrate later how you can use event_params to build a struct that looks exactly like the collected_traffic_source struct.
How is the session traffic source defined in GA4?
In Universal Analytics, the traffic source dimensions were quite clear. That’s because a changed source of traffic during a session would always create a new session. Therefore, there could only ever be one source of traffic per session.
GA4 is different. A change in the source of traffic will no longer trigger a new session. Instead, if a new traffic source is detected within the session, the session will continue with the traffic source details captured in the events’ parameters.
So how does GA4 report session source and medium?
I tested this by retrieving the session source and medium per unique session id from the GA4 Data API and compared that to the event-level source and medium data from the BigQuery export. In cases where the session included multiple different source and medium pairs, the source / medium reported in the GA4 interface was the one that appeared first.
In the following example session, the visit starts directly without traffic source data. The first page view is followed by another one that includes the utm_source and utm_medium parameters. Later, there’s another utm_source & utm_medium pair with different values. In GA4, the final source / medium of the session is reported as “(direct) / (none)”.
GA4 tracks the session’s first event’s traffic source as the source for the entire session. Automatically generated first_visit and session_start events are not included in this evaluation.
If you want to dig deeper into the GA session source magic, I recommend reading this excellent investigation in the GTM-Gear blog.
How does the last non-direct attribution work in GA4?
So, we went through how the session traffic source is derived. How about the last non-direct attribution or Paid and organic last click, as Google has decided to rename it?
When looking back to find the last non-direct traffic source, GA4 has a slightly different approach. Instead of looking at the first event of each preceding session, GA4 will get the data from the last event that includes traffic source details.
Also, keep in mind that Paid and organic last click is not the default attribution model in GA4. GA4 defaults to the Data-driven attribution model, a more complex model that assigns weights between the different touchpoints that preceded the session. If you wish to build something similar to the Data-driven model yourself, you could utilize the queries in this post as the input data for the model.
Query Process
To make the process a bit easier to manage, I’ve divided it into two different SQL queries:
- Create a table containing the first and last traffic source per session.
- Identify the last non-direct source of traffic in case the current session had no traffic source data.
The main reason for doing this in two parts is to make the queries less expensive. The last non-direct lookback will need to process a lot of data whenever the results table is updated. For example, a 30-day lookback would require 30+1 days of data to increment the results table by one day.
Ideally, I’d save the results of the first SQL query to a date-partitioned table and have that updated incrementally. The following query would then rely on that to fill in the last non-direct traffic source in case the session had no traffic source data.
I’m trying to keep this post simple and on topic, so I’ll leave out the incremental update and other query optimization parts.
You can check out my post on Dataform incremental refresh strategies for help in turning these queries into a proper daily updated data source.
Query the Session Traffic Source
The first SQL query will go through the events and return the first and last traffic source per each unique session id. The values will be null if the traffic source data doesn’t exist.
You can save the query’s results in a new table and reference that in the next query instead of rerunning the whole thing with all event data.
-- extract the event data needed for the session traffic source details
with events as (
select
cast(event_date as date format 'YYYYMMDD') as date,
-- unique session id
concat(user_pseudo_id, (select value.int_value from unnest(event_params) where key = 'ga_session_id')) as session_id,
user_pseudo_id,
(select value.int_value from unnest(event_params) where key = 'ga_session_id') as session_start,
collected_traffic_source,
event_timestamp
from
`<project>.<dataset>.events_*`
where
(_table_suffix >= '20230701' and _table_suffix <= '20231231')
-- starting from 20231102 the sessions_start and first_visit events don't need to be excluded anymore
and event_name not in ('session_start', 'first_visit')
)
select
min(date) as date,
session_id,
user_pseudo_id,
session_start,
-- the traffic source of the first event in the session with session_start and first_visit excluded
array_agg(
collected_traffic_source
order by
event_timestamp asc
limit
1
) [safe_offset(0)] as session_first_traffic_source,
-- the last not null traffic source of the session
array_agg(
collected_traffic_source ignore nulls
order by
event_timestamp desc
limit
1
) [safe_offset(0)] as session_last_traffic_source
from
events
where
session_id is not null
group by
session_id,
user_pseudo_id,
session_start
Update 1: The query utilizes the collected_traffic_source struct that was included in the BigQuery GA4 export in June 2023. With data older than that, you’ll need to query the nested event parameters. Replace collected_traffic_source in the above query (line 9) with the below SQL to generate a similar struct using the data from event_params.
if(
(select value.string_value from unnest(event_params) where key in ('campaign_id', 'campaign', 'source', 'medium', 'term', 'content', 'gclid', 'dclid', 'srsltid') and value.string_value is not null limit 1) is not null,
(
select as struct
(select value.string_value from unnest(event_params) where key = 'campaign_id') as manual_campaign_id,
(select value.string_value from unnest(event_params) where key = 'campaign') as manual_campaign_name,
(select value.string_value from unnest(event_params) where key = 'source') as manual_source,
(select value.string_value from unnest(event_params) where key = 'medium') as manual_medium,
(select value.string_value from unnest(event_params) where key = 'term') as manual_term,
(select value.string_value from unnest(event_params) where key = 'content') as manual_content,
(select value.string_value from unnest(event_params) where key = 'gclid') as gclid,
(select value.string_value from unnest(event_params) where key = 'dclid') as dclid,
(select value.string_value from unnest(event_params) where key = 'srsltid') as srsltid
),
null
) as collected_traffic_source,
Update 2: With the daily exports starting from November 2, 2023, it’s no longer necessary to exclude the session_start and first_visit events as they now also include the required traffic source parameters. However, it’s still best to get the data from the first event of the session, not from a session_start event exclusively. For more details, see my other post.
Results Without Last Non-direct
As a benchmark, I took a small sample of source / medium data from the GA4 interface.
When looking at the data from the newly created table for the same date range, it doesn’t quite match with GA4. That’s because the last non-direct attribution is yet to be applied.
select
session_first_traffic_source.manual_source as source,
session_first_traffic_source.manual_medium as medium,
count(distinct session_id) as sessions
from
`<project>.<dataset>.<session traffic source table>`
where
date between '<start>'
and '<end>'
group by
1,
2
order by
sessions desc, manual_source
Query the Last Non-direct Traffic Source
The second SQL query will take the session-level data from the previously created table and fill the missing traffic sources with the last non-direct source.
The query checks if the session already has traffic source data in session_first_traffic_source. If there is none, it will look for the same user’s previous not null value for session_last_traffic_source.
The window order and range in the query are based on session_start, which is the value of the ga_session_id parameter. Ga_session_id is a Unix timestamp that is assigned when the session begins.
select
date,
session_id,
user_pseudo_id,
session_start,
session_first_traffic_source,
ifnull(
session_first_traffic_source,
last_value(session_last_traffic_source ignore nulls) over(
partition by user_pseudo_id
order by
session_start range between 2592000 preceding
and 1 preceding -- 30 day lookback
)
) as session_traffic_source_last_non_direct,
from
`<project>.<dataset>.<session traffic source table>`
You can adjust the lookback window of the query. Based on running a few tests, 30 days seems close to what GA4 uses.
A larger lookback window will find traffic sources further back in time. However, at the same time, the cost of the query will increase.
Results
Let’s see if the applied last non-direct attribution results in numbers more closely resembling the Traffic acquisition report in GA4. Additionally, we can convert the null values to “(direct)” and “(none)” to make it look more like in the GA4 report.
select
ifnull(session_traffic_source_last_non_direct.manual_source,
'(direct)'
) as source,
ifnull(session_traffic_source_last_non_direct.manual_medium,
'(none)'
) as medium,
count(distinct session_id) as sessions
from
`<project>.<dataset>.<session last non-direct traffic source table>`
where
date between '<start>'
and '<end>'
group by
1,
2
order by
sessions desc, source
The numbers based on my sample data now match the GA4 report almost exactly. The SQL query attributed two more sessions to “google / organic” than what GA4 did. I ran the query with the 90-day lookback window. With a 30-day lookback, there would have been a few more direct sessions and a few less in the other source / mediums.
Additional Considerations
The query worked quite nicely with my test dataset. Based on that, the base logic seems correct. However, there are a couple of common issues which weren’t present in the dataset that I used.
Google Ads and GCLID, GBRAID, WBRAID Parameters
GA4 has some issues related to tracking Google Ads traffic with auto-tagging enabled. Auto-tagging uses the GCLID parameter for connecting the website visit with the corresponding campaign. With auto-tagging enabled, the usual utm tags shouldn’t be needed.
However, this doesn’t quite work with the BigQuery export. In BigQuery, the sessions that started with an event containing the GCLID parameter can either have missing source and medium details or have something like “google / organic” or “youtube.com / referral” as their source medium.
A simple fix for this is adjusting the first session traffic source query to check the GCLID parameter. If the parameter exists, the source and medium values can be overwritten with “google” and “cpc”. However, this approach could be too simplistic, as other traffic sources can also utilize the GCLID parameter.
with events as (
...
)
select
...
-- fix the session_first_traffic_source and session_last_traffic_source structs below
-- modify the manual_source and manual_medium fields in case a gclid param was included
array_agg(
if(
collected_traffic_source is not null,
(
select as struct
collected_traffic_source.* except(manual_source, manual_medium),
if(collected_traffic_source.gclid is not null, 'google', collected_traffic_source.manual_source) as manual_source,
if(collected_traffic_source.gclid is not null, 'cpc', collected_traffic_source.manual_medium) as manual_medium
),
null
)
order by
event_timestamp asc
limit
1
) [safe_offset(0)] as session_first_traffic_source,
array_agg(
if(
collected_traffic_source is not null,
(
select as struct
collected_traffic_source.* except(manual_source, manual_medium),
if(collected_traffic_source.gclid is not null, 'google', collected_traffic_source.manual_source) as manual_source,
if(collected_traffic_source.gclid is not null, 'cpc', collected_traffic_source.manual_medium) as manual_medium
),
null
) ignore nulls
order by
event_timestamp desc
limit
1
) [safe_offset(0)] as session_last_traffic_source
from
events
where
...
group by
...
Another thing to consider is the GBRAID and WBRAID parameters which are used as alternatives to the GCLID parameter in iOS14+ devices.
Finally, even if that solution worked correctly with your traffic data, it would still lack the campaign details. So perhaps the best fix to this problem is to start including utm tags with the auto-tagged Google Ads traffic as well.
User Id
Another issue that can cause discrepancies between the GA4 reports and BigQuery data is the use of user ids. When a visitor is assigned a user id mid-session, GA4 will attribute the session’s traffic source to “(direct)” / “(none)”, or I guess the last non-direct source derived from that, regardless of what the session’s first traffic source was.
As of writing this in November 2022, this is still an issue in GA4. However, it doesn’t apply to the event data in BigQuery, so it’s good to keep that in mind when comparing the numbers between GA4 and BigQuery data.
The below image shows unique sessions that have a user id and that were reported as “(direct) / (none)” by GA4. I exported the data from GA4 and joined it with the traffic source table created earlier. As a result, some of the sessions were incorrectly reported as direct even though the traffic source data existed on the event-level.
The following snapshot of some of the events in one of those sessions shows what’s happening. The visit originated from “google / organic”. However, later during the session, a login event occurred, and a user id was assigned to the visitor. That messed up the reporting in GA4 and wiped the original traffic source of the session.
Summary
Recreating the session-level GA4 traffic reports with BigQuery data is not quite as straightforward as it could be. It also requires a lot of processing on BigQuery’s side when repeated daily. In a production setting, the process needs to be well-thought and optimized when up-to-date data is needed in multiple places.
Additionally, the issues related to Google Ads traffic and the use of user ids make it difficult to verify the query results by comparing them with the GA4 reports.
Hopefully, Google will make this easier someday and start providing correct session-level traffic data already in the export, just like they’ve been providing with Universal Analytics.
Finally, another thing to consider is whether it makes sense to replicate the GA4 traffic source reporting. With BigQuery, you can define the session’s traffic source however you like. An alternative approach could be to credit all traffic sources within the same session, not just the first, and split the credit between them.
Fantastic documtentation, thanks!
Hello,
Very useful article. Indeed, it’s sad that session level traffic source is not available and we need to find workarounds to compile the data and create costly queries to obtain the data.
This is a great post as well as the referenced post and has been a big help!
One enhancement that I’d suggest is picking up the last utm parameters in the session. GA4 does not start a new session when the utm parameters are changed, however, it does look like it will carry any mid-session utm parameters through to the next session.
Therefore you need to retain the parameters for the current session (I used page entrances for this) and the last value of the parameters for the the current session – if these are the same as the first, then it doesn’t matter.
If the utm parameters for the session are not defined then you should look back to the last parameters value. Seems to be more robust.
Thanks for the comment Martin!
That is indeed one thing that was missing in the original version. I’ve updated the queries now to capture both the traffic source of the first event, and also the session’s last traffic source.
The first traffic source is used in evaluating the current session’s traffic source. The last traffic source is only being used in the last non-direct lookback.
Do you ran similiar test but based on conversion events scoped traffic?
I’m not sure if I undestand what you mean.
I would use this session-scoped data and join it with the conversion events using the session id. Of course it might not be 100 % accurate, because the traffic source could change mid session, but it shouldn’t be too far off.
To get a more accurate result, you could write a similar query, but use the conversion events as the touchpoints instead of the session starts.
why have you dropped the automatically generated first_visit and session_start events from this evaluation.
GA4 doesn’t use those either in the session traffic source evaluation. I recommend that you check out this post on the topic: https://gtm-gear.com/posts/ga4-sessions-source-medium/
Hi Taneli, thank you for this post! It’s been incredibly useful in better understanding BQ GA4 data and creating a query to calculate sessions.
I wanted to share my considerations and changes to the script: in the first subquery rather than selecting all events except session_start and first_visit I take all events where source or medium are not null or event is session_start. This allows me to ignore visits that started before midnight (other than the ones with a new source/medium) and include sessions which only have a session_start event (I found several in my dataset).
I’m still considering whether to add a workaround to include also session_start events for hits with user_pseudo_id and ga_session_id null. I wanted to know what your opinion on this is?
Thanks again
Hi Alessandra, you’re welcome!
In the first sub query I just take all the events except session_start and first_visit to avoid an extra step in the query. The next part aggregates the results taking in the data from the first event of the session (except the excluded session_start and first_visit events).
GA4 also does this, it ignores the session_starts and first_visits and looks at the first event of the session regardless if it has a traffic source or not. If you add a filter to include only events with a not null source or medium, then the results would change a bit, as you would be crediting the first non null traffic source instead of the data in the first event.
In the next part of the first query I use min(date) as the date field instead of grouping by date. That should fix the issue with the session splitting into two days.
I’m not sure what those sessions that only have a session_start event could be. Session_start should be generated by an event that was sent to GA4. I think I would just disregard the session if it doesn’t have any other events than the session_start.
Do you have a lot of hits without ga_session_id and user_pseudo_id? Are they consent mode hits without the consent granted? If that’s the case, I think it would be difficult to include them here as you can’t really limit them to just one hit per session.
I’ve only seen a few odd hits missing the ga_session_id in my datasets (consent mode not in use). I’ve just disregarded them as they don’t have the identifier that’s needed for the session scope, and the can’t be joined to the events in other tables because of that.
Hello!
Thank You very much for that article!
Can we derive the campaign name of google/cpc traffic when we look at sessions with gclid parameter filled? I cannot find when I do this filter on GA4 raw data.
The campaign name is not included anywhere in the GA4 export data.
However, it’s possible to fetch the data from the Google Ads Data Transfer BigQuery export. The export has the gclid values and the campaign ids in the p_ads_ClickStats table.
You could join the gclid in the GA4 dataset with the gclid in the p_ads_ClickStats table to get the campaign id. Then you can find the campaign name from the ads_Campaign table from the same Google Ads export.
It’s quite a lot of work, but possible. 🙂
And can You, please, tell what exactly You mean by saying?
”
So perhaps the best fix to this problem is to start including utm tags with the auto-tagged Google Ads traffic as well.
”
I need campaign information in google/cpc traffic.
This issue would be fixed by including the manual old school utm tags in the Google Ads links as well. Like utm_campaign=my_campaign etc.
Auto-tagging only ads the gclid parameter to the links. It also requires that the GA4 property is linked with the Google Ads account. However, setting up this link to the Google Ads account doesn’t make the campaign data flow into BigQuery as well. In BigQuery, you only get the gclid value.
Hi Taneli, your post is great! I am trying to replicate UA model with GA4 bigquery export and this is very useful. In my case I passed from 150% increment of sessions direct / none traffic to 33% comparing GA4 vs UA using this technics and others I added, but in this last 33% I have not been able to reduce it yet, there is no more information in event parameters or user properties that give me some light. Do you know if it is a common problem in GA4?
I added some additional steps to the model:
1.- If sessions is first visit, the User Source Medium (user acquisition source medium) is taken into account, direct is excluded.
2. If the above is null, the utm_Source / utm_Medium is taken using regex from the page_referrer. Additionally, the gclid is extracted. If it is present, it is assigned to google/cpc.
3.-If the above is null, the first Source Medium from the session parameters is taken. If the first source medium is null, the last non-null source medium from the session is taken. Similarly, if the gclid is found in the parameters, it is assigned to google/cpc. (Your Model)
4.- If a user has multiple sessions on the same day, sessions with null source medium are replaced with the source medium from the user’s first session where the source medium is not null.
5.-If, even after all this, the source medium is still null, the User Source Medium is taken into account. From my point of view, it is preferable to assign it to the user’s acquisition channel rather than direct traffic.
What do you think?
Hi Ivan, thanks!
Yeah, it can be difficult to match the numbers in GA4. I’m actually seeing less direct traffic using these queries than what I see in GA4. The number of direct traffic depends entirely on the lookback window used in the last non-direct traffic source query. The longer lookback window you specify, the less direct traffic you’ll get.
Also, I’m just trying to replicate GA4’s traffic source attribution as closely as possible here. There could be smarter ways to do it as well, that return different looking results. 🙂
Some comments:
1. If the session is a first visit, shouldn’t the session’s traffic source match with the user traffic source already? If you look at the first event of the session (excluding session_start and first_visit events) it should have the same traffic source details as the user traffic source?
2. Hmm, I think page_referrer shouldn’t be needed as the same data would also be included in the previous page’s page_location. The only case would be if you missed the previous page’s events for some reason.
3. It could be a valid point to look at the later non-null session traffic source if the first event didn’t have anything. However, GA4 only looks at the first event (excluding session_start and first_visit events) when it’s determining the session traffic source.
4. This also happens through the second query: the last non-direct traffic source query. It would attribute the session with the traffic source from the preceding session that had one.
5. You could include this, but it’s again a bit different than the behaviour in GA4. 🙂 You could also increase the lookback window in the last non-direct traffic source query to look further back in time.
Great article!
A question for you – isn’t the lookback window a setting inside the admin console? Therefore the query needs to align with that?
Also, how does this align with channel groupings? And session level channel groupings? Same logic?
Thanks for the great post, really helpful. Isn’t it – session_start range between 2592000 preceding and 1 preceding – instead of current row to see the last session?
Yeah, that’s true. Good catch! I’ll fix the query. Thanks for the comment.
Hi there, thanks for this post! In GA4, I can see a (direct) source with metrics attributed to it, but when querying in BigQuery – there’s no (direct)/direct source at all! Are you now experiencing the same issue or can you still see (direct) in BigQuery? Thanks
2 things I’ve picked up is make sure your adjusting the big query event timestamp using the region your using in GA4.
I have also noticed slightly different behaviour for cpc mobile app data compared with web, where the GA4 interface is attributing a session in August as CPC and the only session that was CPC was the first one way back in Jan. Haven’t seen the same behaviour on web traffic so wonder if app sessions are processed differently from the web.
Same happens to me
Hi Taneli, thanks for posting this. I compared the ARRAY_AGG method, for retrieving the session traffic values, to the FIRST_VALUE method, and I found out there are few rows of mismatch between the two methods (I expected exactly the same results from the two queries). Both are ordering by event_timestamp.
Could you spot the difference ?
Hi,
If you mean the earlier version of my query, I think that might have also had a different value for the attribution lookback window, perhaps 90 days? Both FIRST_VALUE and ARRAY_AGG (for getting the first value of the session) should return the same results. ARRAY_AGG was just had a clear query performance increase over FIRST_VALUE.
Hello,
Coming late to the discussion bu thank you so much for posting this.
My only question: is this query replicating the Data Driven attribution model set by default by GA4?
Hi,
It’s not replicating the Data-driven model. It’s replicating the Paid and organic last click model (or last non-direct click as it’s been referred to as earlier).
I think you could use the data from the first query to build your own model that would resemble the data-driven attribution model. I believe the algorithm that it’s based on is called Shapley Values.
Hi Taneli,
Can I share your code on my github page? With references to this article and you personally of course.
I modified your code a bit, that now it works almost perfectly (at least with my data) compared to what I see in GA4 UI. And this is the only code which works, and I went through 5-7 solutions already.
Maybe this could be helpful to others….
And yeah, many thanks for such a detailed description.
Hi,
Yes, feel free to post your version on Github. 🙂
Thank you!
Here’s the code:
https://gist.github.com/AlexDinahl/7dd82169290dfb28359c3092f5408809
Hi Alex – would you mind sharing your solution? I’d be interested to see what you’ve changed.
Hi Rob,
Sorry, just stuck in the daily routines. Here’s the code:
https://gist.github.com/AlexDinahl/7dd82169290dfb28359c3092f5408809
Hi Taneli, really appreciate your post, your query is super useful. I have a question, what if I want to filter the query to see data for only a specific conversion event, however that event_name has no ‘gclid’, but other events inside the same session have the gclid, so the user should be attributed to a ‘cpc’ medium according to your query. But if I filter to see only that specific event, the user attribution would be ‘direct’. Please see the screenshot in the below link:
https://imageupload.io/en/CcN5V2qNOsJCn2c
Hi Taneli, the material is great! thanks for this article, read your other materials and a lot of things became clearer!
Taneli,
Thank you for your query process. I compared your methodology to several others with our data using a GA4 exploration as a control and it produced the best, closest results (usually with a variance of 0.25-0.30% in total sessions).
One question though: We’re moving the results into tables and noticed that some traffic source fields were not included in the last traffic source. Campaign ID, campaign name, source and medium are included – but not content, term, GCLID, DCLID, and SRSLTID. When we move to the query part where we identify last non-direct, term does appear but not the others. Is there a way to force all fields to be represented in last traffic source and last non-direct? We do have campaign data in all these fields (but, of course, not with organic traffic). Thank you for your help
Hi Taneli, this is the best solution I’ve seen so far. However, i do see a down side of using min(date) for sessions that span over midnight. If for example session starts on Dec 27th and spans over to 28th and transaction happens on the 28th, then you will have no session on the Dec 28th to join with the transaction on that day. So eventually the transaction data is lost. What do you think?
“I’m trying to aggregate values like purchase and purchase_revenue_in_usd in the prep CTE section of my query, but the aggregation isn’t working correctly due to the use of min(date). How can I modify the base query to correctly retrieve metrics for purchase, purchase_revenue_in_usd, and new users?”
Hey Taneli,
great stuff! I have still some questions as the direct and none values unfortunately are still inaccurate. The inaccuracies reach over 200-400 users and sessions. Also if you sum all the values from the users or sessions the total also is more than GA4 gives. So my suspicion is that it counts the same users multiple times and adds them to different dimension values e.g. the same user_pseudo_id for both organic and direct.
Another thing is I am curios what you think about this query: https://optimizationup.com/recreating-ga4-traffic-acquisition-report-in-bigquery/
I think it’s pretty accurate but I don’t understand why he tries to also extract data from traffic_source (which is user based) when he counts session based metrics. Do you have an idea?
p.s. I tried connected with you on Linkedin
Thanks for providing this process! We have compared it to several other processes in the field and consider this one to be the simplest but most accurate implementation. When we compare results to GA4 Explore, we generally get total session variances between 0.2% and 1.8%. A 2% variance is our threshold for accuracy and we haven’t yet seen that high of a variance for this process.
Hi Taneli. Very much a newbie to the world of ga4, so firstly thanks for an excellent article, clear and concise. Do you use an incremental load strategy for the traffic sources query? We have large volumes so trying to understand any techniques to help reduce costs.
Hi Taneli,
I believe the Last non-direct traffic source query is wrong in the line 8. The ifnull is being applied into the column first session_first_traffic_source when it should be session_last_traffic_source.
Hi Edward,
That’s intentional. If the session itself has any traffic source data, the ifnull will get the first traffic source from the event that started the session. If the session’s start didn’t contain any traffic source data it will look back into the preceding sessions. With the preceding sessions, the sessions’ last traffic source data is considered instead.
In GA4, one session can contain multiple traffic sources. The logic in this query should be the same as what GA4 itself uses. Of course, you could adjust it to take the last traffic source instead, in case the session itself has many traffic sources.
Recently Google changed the export schema to include the session traffic fields based on last non direct. How would this impact these queries?
With the new session_traffic_source_last_click fields, you might not need to use a semi complex query like this at all. That is if you are fine with using the standard attribution logic from Google and with the data being available only since July 2024. The data in the new fields should match the last non-direct attribution in the GA4 UI pretty closely.
However, Google also recently broke the Google Ads auto tagging in the EU region. As a result of that, the campaign information is not always available in the new session_traffic_source_last_click fields either. With a custom approach, like the query in this post, it’s still possible to use the individual gclid values and the Google Ads BigQuery Transfer to fill in the correct campaign details.
And of course, the custom approach, that’s based on event-level data, allows adjusting the attribution and defining other kinds of attribution models as well.