Query the GA4 Session Last Non-direct Traffic Source in BigQuery

One of the shortcomings of the GA4 BigQuery export has been its lack of session-level traffic source data, which wasn’t fixed until 2024-07-17. Data on traffic sources used to be available only at the user and event levels, with perhaps the most important session scope missing. Having to rely on event-level data made analyzing session-level traffic acquisition quite inconvenient.

This blog post will show how to recreate the GA4 session traffic source dimensions using BigQuery export event data. The result will be a lookup table containing the last non-direct source of traffic for each unique session id.

Before jumping into the SQL queries, let’s go through the basics.

Post updates

  • 2023-03-24: Improvements to the last non-direct attribution. Read the last non-direct traffic source from the last preceding event traffic source.
  • 2023-03-29: Improvements to the query performance in the first session traffic source query. Use group by approach instead of a window function.
  • 2023-10-24: Use the newly added collected_traffic_source struct in the GA4 export for getting the event-level traffic source parameters.
  • 2024-10-17: Comment on the new session_traffic_source_last_click data.

Comment on the new session_traffic_source_last_click data

Google included the new session_traffic_source_last_click struct in the GA4 BigQuery export on 2024-07-17. After that, they updated the field again on 2024-10-09

With the most recent update, the integrations to SA360, CM360, and DV360 were also included. On top of that, the latest version also includes cross_channel_campaign, which combines all of the integrations into similar session traffic source data that you see in the GA4 UI.

Does it make sense to do the attribution based on event data anymore?

The new session_traffic_source_last_click matches quite closely with the data you see in the GA4 UI. Based on that, it’s a welcome addition to the export.

However, there are still some reasons why it might make sense to utilize the event-level data:

  1. The need to cover data older than 2024-10-09 or 2024-07-17,
  2. Building a custom attribution model,
  3. Fixing GA4’s data processing issues related to traffic sources.

Although I designed the queries in this post to replicate the last non-direct attribution that GA4 applies, they can also be adjusted to support other types of attribution models. Extracting and sessionizing the event-level data is still required for such use cases.

What kind of traffic source data is available in BigQuery?

The most common source of confusion in the GA4 export seems to relate to the poorly named traffic_source parameters. Traffic_source contains the user-level first-touch traffic source, not the session’s traffic source.

Additionally, the individual events in BigQuery can contain event parameters such as source, medium, and campaign. These are the event-level traffic source details. The parameters are nested inside the event_params field of the event.

traffic source data in the ga4 bigquery schema
GA4 event schema

For help with querying nested event parameters, see my earlier post.

Collected traffic source

Update: Since June 2023, you can also find the event-level traffic source details using the newly added collected_traffic_source struct in the GA4 BigQuery export. 

Collected_traffic_source is a collection of the same event-level traffic source details that you can find nested inside the event_params array. Compared to having to unnest each parameter, the struct format makes the data much more easily accessible.

However, as the data in this new format is only available since June 2023, you’ll need to rely on the event_params for the data older than that. That is not an issue. I’ll demonstrate later how you can use event_params to build a struct that looks exactly like the collected_traffic_source struct.

Event traffic source details can be found from the collected_traffic_source struct.
Collected_traffic_source struct contents

How is the session traffic source defined in GA4?

In Universal Analytics, the traffic source dimensions were quite clear. That’s because a changed source of traffic during a session would always create a new session. Therefore, there could only ever be one source of traffic per session.

GA4 is different. A change in the source of traffic will no longer trigger a new session. Instead, if a new traffic source is detected within the session, the session will continue with the traffic source details captured in the events’ parameters.

So how does GA4 report session source and medium?

I tested this by retrieving the session source and medium per unique session id from the GA4 Data API and compared that to the event-level source and medium data from the BigQuery export. In cases where the session included multiple different source and medium pairs, the source / medium reported in the GA4 interface was the one that appeared first.

In the following example session, the visit starts directly without traffic source data. The first page view is followed by another one that includes the utm_source and utm_medium parameters. Later, there’s another utm_source & utm_medium pair with different values. In GA4, the final source / medium of the session is reported as “(direct) / (none)”

example session in BigQuery with different event-level traffic sources
Viewed pages in an example session

GA4 tracks the session’s first event’s traffic source as the source for the entire session. Automatically generated first_visit and session_start events are not included in this evaluation.

If you want to dig deeper into the GA session source magic, I recommend reading this excellent investigation in the GTM-Gear blog.

How does the last non-direct attribution work in GA4?

So, we went through how the session traffic source is derived. How about the last non-direct attribution or Paid and organic last click, as Google has decided to rename it?

When looking back to find the last non-direct traffic source, GA4 has a slightly different approach. Instead of looking at the first event of each preceding session, GA4 will get the data from the last event that includes traffic source details.

GA4 session last non-direct attribution example
In this example, Facebook doesn’t get credited in the first session as it’s not the traffic source of the first event. However, Facebook will be the last non-direct traffic source of the second session.

Also, keep in mind that Paid and organic last click is not the default attribution model in GA4. GA4 defaults to the Data-driven attribution model, a more complex model that assigns weights between the different touchpoints that preceded the session. If you wish to build something similar to the Data-driven model yourself, you could utilize the queries in this post as the input data for the model.

Query Process

To make the process a bit easier to manage, I’ve divided it into two different SQL queries:

  1. Create a table containing the first and last traffic source per session.
  2. Identify the last non-direct source of traffic in case the current session had no traffic source data.

The main reason for doing this in two parts is to make the queries less expensive. The last non-direct lookback will need to process a lot of data whenever the results table is updated. For example, a 30-day lookback would require 30+1 days of data to increment the results table by one day.

Ideally, I’d save the results of the first SQL query to a date-partitioned table and have that updated incrementally. The following query would then rely on that to fill in the last non-direct traffic source in case the session had no traffic source data.

I’m trying to keep this post simple and on topic, so I’ll leave out the incremental update and other query optimization parts.

You can check out my post on Dataform incremental refresh strategies for help in turning these queries into a proper daily updated data source.

Query the Session Traffic Source

The first SQL query will go through the events and return the first and last traffic source per each unique session id. The values will be null if the traffic source data doesn’t exist.

You can save the query’s results in a new table and reference that in the next query instead of rerunning the whole thing with all event data.

-- extract the event data needed for the session traffic source details
with events as (
  select
    cast(event_date as date format 'YYYYMMDD') as date,
    -- unique session id
    concat(user_pseudo_id, (select value.int_value from unnest(event_params) where key = 'ga_session_id')) as session_id,
    user_pseudo_id,
    (select value.int_value from unnest(event_params) where key = 'ga_session_id') as session_start,
    collected_traffic_source,
    event_timestamp
  from
    `<project>.<dataset>.events_*`
  where
    (_table_suffix >= '20230701' and _table_suffix <= '20231231')
    -- starting from 20231102 the sessions_start and first_visit events don't need to be excluded anymore
    and event_name not in ('session_start', 'first_visit')
)
select
  min(date) as date,
  session_id,
  user_pseudo_id,
  session_start,
  -- the traffic source of the first event in the session with session_start and first_visit excluded
  array_agg(
    collected_traffic_source
    order by
      event_timestamp asc
    limit
      1
  ) [safe_offset(0)] as session_first_traffic_source,
  -- the last not null traffic source of the session
  array_agg(
    collected_traffic_source ignore nulls
    order by
      event_timestamp desc
    limit
      1
  ) [safe_offset(0)] as session_last_traffic_source
from
  events
where
  session_id is not null
group by
  session_id,
  user_pseudo_id,
  session_start

Update 1: The query utilizes the collected_traffic_source struct that was included in the BigQuery GA4 export in June 2023. With data older than that, you’ll need to query the nested event parameters. Replace collected_traffic_source in the above query (line 9) with the below SQL to generate a similar struct using the data from event_params.

if(
(select value.string_value from unnest(event_params) where key in ('campaign_id', 'campaign', 'source', 'medium', 'term', 'content', 'gclid', 'dclid', 'srsltid') and value.string_value is not null limit 1) is not null,
(
  select as struct
    (select value.string_value from unnest(event_params) where key = 'campaign_id') as manual_campaign_id,
    (select value.string_value from unnest(event_params) where key = 'campaign') as manual_campaign_name,
    (select value.string_value from unnest(event_params) where key = 'source') as manual_source,
    (select value.string_value from unnest(event_params) where key = 'medium') as manual_medium,
    (select value.string_value from unnest(event_params) where key = 'term') as manual_term,
    (select value.string_value from unnest(event_params) where key = 'content') as manual_content,
    (select value.string_value from unnest(event_params) where key = 'gclid') as gclid,
    (select value.string_value from unnest(event_params) where key = 'dclid') as dclid,
    (select value.string_value from unnest(event_params) where key = 'srsltid') as srsltid
),
null
) as collected_traffic_source,

Update 2: With the daily exports starting from November 2, 2023, it’s no longer necessary to exclude the session_start and first_visit events as they now also include the required traffic source parameters. However, it’s still best to get the data from the first event of the session, not from a session_start event exclusively. For more details, see my other post.

Results Without Last Non-direct

As a benchmark, I took a small sample of source / medium data from the GA4 interface.

example ga4 session traffic source data
Example session data from GA4

When looking at the data from the newly created table for the same date range, it doesn’t quite match with GA4. That’s because the last non-direct attribution is yet to be applied.

select
  session_first_traffic_source.manual_source as source,
  session_first_traffic_source.manual_medium as medium,
  count(distinct session_id) as sessions
from
  `<project>.<dataset>.<session traffic source table>`
where
  date between '<start>'
  and '<end>'
group by
  1,
  2
order by
  sessions desc, manual_source
sql query results without last non-direct session traffic source
Sessions per source / medium without last non-direct attribution

Query the Last Non-direct Traffic Source

The second SQL query will take the session-level data from the previously created table and fill the missing traffic sources with the last non-direct source.

The query checks if the session already has traffic source data in session_first_traffic_source. If there is none, it will look for the same user’s previous not null value for session_last_traffic_source.

The window order and range in the query are based on session_start, which is the value of the ga_session_id parameter. Ga_session_id is a Unix timestamp that is assigned when the session begins.

select
  date,
  session_id,
  user_pseudo_id,
  session_start,
  session_first_traffic_source,
  ifnull(
    session_first_traffic_source,
    last_value(session_last_traffic_source ignore nulls) over(
      partition by user_pseudo_id
      order by
        session_start range between 2592000 preceding
        and 1 preceding -- 30 day lookback
    )
  ) as session_traffic_source_last_non_direct,
from
  `<project>.<dataset>.<session traffic source table>`

You can adjust the lookback window of the query. Based on running a few tests, 30 days seems close to what GA4 uses.

A larger lookback window will find traffic sources further back in time. However, at the same time, the cost of the query will increase.

Results

Let’s see if the applied last non-direct attribution results in numbers more closely resembling the Traffic acquisition report in GA4. Additionally, we can convert the null values to “(direct)” and “(none)” to make it look more like in the GA4 report.

select
  ifnull(session_traffic_source_last_non_direct.manual_source, 
    '(direct)'
  ) as source,
  ifnull(session_traffic_source_last_non_direct.manual_medium, 
    '(none)'
  ) as medium,
  count(distinct session_id) as sessions
from
  `<project>.<dataset>.<session last non-direct traffic source table>`
where
  date between '<start>'
  and '<end>'
group by
  1,
  2
order by
  sessions desc, source
sql query results with last non-direct session traffic source

The numbers based on my sample data now match the GA4 report almost exactly. The SQL query attributed two more sessions to “google / organic” than what GA4 did. I ran the query with the 90-day lookback window. With a 30-day lookback, there would have been a few more direct sessions and a few less in the other source / mediums.

Additional Considerations

The query worked quite nicely with my test dataset. Based on that, the base logic seems correct. However, there are a couple of common issues which weren’t present in the dataset that I used.

GA4 has some issues related to tracking Google Ads traffic with auto-tagging enabled. Auto-tagging uses the GCLID parameter for connecting the website visit with the corresponding campaign. With auto-tagging enabled, the usual utm tags shouldn’t be needed.

However, this doesn’t quite work with the BigQuery export. In BigQuery, the sessions that started with an event containing the GCLID parameter can either have missing source and medium details or have something like “google / organic” or “youtube.com / referral” as their source medium.

example session source mediums with a a gclid parameter
Example sessions with a GCLID parameter

A simple fix for this is adjusting the first session traffic source query to check the GCLID parameter. If the parameter exists, the source and medium values can be overwritten with “google” and “cpc”. However, this approach could be too simplistic, as other traffic sources can also utilize the GCLID parameter.


with events as (
  ...
)
select
  ...
  -- fix the session_first_traffic_source and session_last_traffic_source structs below
  -- modify the manual_source and manual_medium fields in case a gclid param was included
  array_agg(
    if(
      collected_traffic_source is not null,
      (
          select as struct 
            collected_traffic_source.* except(manual_source, manual_medium),
            if(collected_traffic_source.gclid is not null, 'google', collected_traffic_source.manual_source) as manual_source,
            if(collected_traffic_source.gclid is not null, 'cpc', collected_traffic_source.manual_medium) as manual_medium
      ),
      null
    )
    order by
      event_timestamp asc
    limit
      1
  ) [safe_offset(0)] as session_first_traffic_source,
  array_agg(
    if(
      collected_traffic_source is not null,
      (
          select as struct 
            collected_traffic_source.* except(manual_source, manual_medium),
            if(collected_traffic_source.gclid is not null, 'google', collected_traffic_source.manual_source) as manual_source,
            if(collected_traffic_source.gclid is not null, 'cpc', collected_traffic_source.manual_medium) as manual_medium
      ),
      null
    ) ignore nulls
    order by
      event_timestamp desc
    limit
      1
  ) [safe_offset(0)] as session_last_traffic_source
from
  events
where
  ...
group by
  ...

Another thing to consider is the GBRAID and WBRAID parameters which are used as alternatives to the GCLID parameter in iOS14+ devices.

Finally, even if that solution worked correctly with your traffic data, it would still lack the campaign details. So perhaps the best fix to this problem is to start including utm tags with the auto-tagged Google Ads traffic as well.

User Id

Another issue that can cause discrepancies between the GA4 reports and BigQuery data is the use of user ids. When a visitor is assigned a user id mid-session, GA4 will attribute the session’s traffic source to “(direct)” / “(none)”, or I guess the last non-direct source derived from that, regardless of what the session’s first traffic source was.

As of writing this in November 2022, this is still an issue in GA4. However, it doesn’t apply to the event data in BigQuery, so it’s good to keep that in mind when comparing the numbers between GA4 and BigQuery data.

The below image shows unique sessions that have a user id and that were reported as “(direct) / (none)” by GA4. I exported the data from GA4 and joined it with the traffic source table created earlier. As a result, some of the sessions were incorrectly reported as direct even though the traffic source data existed on the event-level.

Example session traffic sources with user id sessions
Example user id sessions, traffic source according to GA4 compared to BigQuery

The following snapshot of some of the events in one of those sessions shows what’s happening. The visit originated from “google / organic”. However, later during the session, a login event occurred, and a user id was assigned to the visitor. That messed up the reporting in GA4 and wiped the original traffic source of the session.

example session journey with a traffic source and assigned user id
Example session journey with a user id assigned mid-session

Summary

Recreating the session-level GA4 traffic reports with BigQuery data is not quite as straightforward as it could be. It also requires a lot of processing on BigQuery’s side when repeated daily. In a production setting, the process needs to be well-thought and optimized when up-to-date data is needed in multiple places. 

Additionally, the issues related to Google Ads traffic and the use of user ids make it difficult to verify the query results by comparing them with the GA4 reports.

Finally, another thing to consider is whether it makes sense to replicate the GA4 traffic source reporting. With BigQuery, you can define the session’s traffic source however you like. An alternative approach could be to credit all traffic sources within the same session, not just the first, and split the credit between them.

👋 Want to stay up to date on the latest blog posts? Follow me on LinkedIn.

50 thoughts on “Query the GA4 Session Last Non-direct Traffic Source in BigQuery”

  1. Hello,
    Very useful article. Indeed, it’s sad that session level traffic source is not available and we need to find workarounds to compile the data and create costly queries to obtain the data.

  2. Avatar photo
    Martin Neighbours

    This is a great post as well as the referenced post and has been a big help!

    One enhancement that I’d suggest is picking up the last utm parameters in the session. GA4 does not start a new session when the utm parameters are changed, however, it does look like it will carry any mid-session utm parameters through to the next session.

    Therefore you need to retain the parameters for the current session (I used page entrances for this) and the last value of the parameters for the the current session – if these are the same as the first, then it doesn’t matter.

    If the utm parameters for the session are not defined then you should look back to the last parameters value. Seems to be more robust.

    1. Avatar photo
      Taneli Salonen

      Thanks for the comment Martin!

      That is indeed one thing that was missing in the original version. I’ve updated the queries now to capture both the traffic source of the first event, and also the session’s last traffic source.

      The first traffic source is used in evaluating the current session’s traffic source. The last traffic source is only being used in the last non-direct lookback.

    1. Avatar photo
      Taneli Salonen

      I’m not sure if I undestand what you mean.

      I would use this session-scoped data and join it with the conversion events using the session id. Of course it might not be 100 % accurate, because the traffic source could change mid session, but it shouldn’t be too far off.

      To get a more accurate result, you could write a similar query, but use the conversion events as the touchpoints instead of the session starts.

  3. why have you dropped the automatically generated first_visit and session_start events from this evaluation.

  4. Hi Taneli, thank you for this post! It’s been incredibly useful in better understanding BQ GA4 data and creating a query to calculate sessions.

    I wanted to share my considerations and changes to the script: in the first subquery rather than selecting all events except session_start and first_visit I take all events where source or medium are not null or event is session_start. This allows me to ignore visits that started before midnight (other than the ones with a new source/medium) and include sessions which only have a session_start event (I found several in my dataset).

    I’m still considering whether to add a workaround to include also session_start events for hits with user_pseudo_id and ga_session_id null. I wanted to know what your opinion on this is?

    Thanks again

    1. Avatar photo
      Taneli Salonen

      Hi Alessandra, you’re welcome!

      In the first sub query I just take all the events except session_start and first_visit to avoid an extra step in the query. The next part aggregates the results taking in the data from the first event of the session (except the excluded session_start and first_visit events).

      GA4 also does this, it ignores the session_starts and first_visits and looks at the first event of the session regardless if it has a traffic source or not. If you add a filter to include only events with a not null source or medium, then the results would change a bit, as you would be crediting the first non null traffic source instead of the data in the first event.

      In the next part of the first query I use min(date) as the date field instead of grouping by date. That should fix the issue with the session splitting into two days.

      I’m not sure what those sessions that only have a session_start event could be. Session_start should be generated by an event that was sent to GA4. I think I would just disregard the session if it doesn’t have any other events than the session_start.

      Do you have a lot of hits without ga_session_id and user_pseudo_id? Are they consent mode hits without the consent granted? If that’s the case, I think it would be difficult to include them here as you can’t really limit them to just one hit per session.

      I’ve only seen a few odd hits missing the ga_session_id in my datasets (consent mode not in use). I’ve just disregarded them as they don’t have the identifier that’s needed for the session scope, and the can’t be joined to the events in other tables because of that.

  5. Hello!
    Thank You very much for that article!
    Can we derive the campaign name of google/cpc traffic when we look at sessions with gclid parameter filled? I cannot find when I do this filter on GA4 raw data.

    1. Avatar photo
      Taneli Salonen

      The campaign name is not included anywhere in the GA4 export data.

      However, it’s possible to fetch the data from the Google Ads Data Transfer BigQuery export. The export has the gclid values and the campaign ids in the p_ads_ClickStats table.

      You could join the gclid in the GA4 dataset with the gclid in the p_ads_ClickStats table to get the campaign id. Then you can find the campaign name from the ads_Campaign table from the same Google Ads export.

      It’s quite a lot of work, but possible. 🙂

  6. And can You, please, tell what exactly You mean by saying?


    So perhaps the best fix to this problem is to start including utm tags with the auto-tagged Google Ads traffic as well.

    I need campaign information in google/cpc traffic.

    1. Avatar photo
      Taneli Salonen

      This issue would be fixed by including the manual old school utm tags in the Google Ads links as well. Like utm_campaign=my_campaign etc.

      Auto-tagging only ads the gclid parameter to the links. It also requires that the GA4 property is linked with the Google Ads account. However, setting up this link to the Google Ads account doesn’t make the campaign data flow into BigQuery as well. In BigQuery, you only get the gclid value.

  7. Avatar photo
    Ivan Jardon

    Hi Taneli, your post is great! I am trying to replicate UA model with GA4 bigquery export and this is very useful. In my case I passed from 150% increment of sessions direct / none traffic to 33% comparing GA4 vs UA using this technics and others I added, but in this last 33% I have not been able to reduce it yet, there is no more information in event parameters or user properties that give me some light. Do you know if it is a common problem in GA4?

    I added some additional steps to the model:

    1.- If sessions is first visit, the User Source Medium (user acquisition source medium) is taken into account, direct is excluded.
    2. If the above is null, the utm_Source / utm_Medium is taken using regex from the page_referrer. Additionally, the gclid is extracted. If it is present, it is assigned to google/cpc.
    3.-If the above is null, the first Source Medium from the session parameters is taken. If the first source medium is null, the last non-null source medium from the session is taken. Similarly, if the gclid is found in the parameters, it is assigned to google/cpc. (Your Model)
    4.- If a user has multiple sessions on the same day, sessions with null source medium are replaced with the source medium from the user’s first session where the source medium is not null.
    5.-If, even after all this, the source medium is still null, the User Source Medium is taken into account. From my point of view, it is preferable to assign it to the user’s acquisition channel rather than direct traffic.

    What do you think?

    1. Avatar photo
      Taneli Salonen

      Hi Ivan, thanks!

      Yeah, it can be difficult to match the numbers in GA4. I’m actually seeing less direct traffic using these queries than what I see in GA4. The number of direct traffic depends entirely on the lookback window used in the last non-direct traffic source query. The longer lookback window you specify, the less direct traffic you’ll get.

      Also, I’m just trying to replicate GA4’s traffic source attribution as closely as possible here. There could be smarter ways to do it as well, that return different looking results. 🙂

      Some comments:

      1. If the session is a first visit, shouldn’t the session’s traffic source match with the user traffic source already? If you look at the first event of the session (excluding session_start and first_visit events) it should have the same traffic source details as the user traffic source?

      2. Hmm, I think page_referrer shouldn’t be needed as the same data would also be included in the previous page’s page_location. The only case would be if you missed the previous page’s events for some reason.

      3. It could be a valid point to look at the later non-null session traffic source if the first event didn’t have anything. However, GA4 only looks at the first event (excluding session_start and first_visit events) when it’s determining the session traffic source.

      4. This also happens through the second query: the last non-direct traffic source query. It would attribute the session with the traffic source from the preceding session that had one.

      5. You could include this, but it’s again a bit different than the behaviour in GA4. 🙂 You could also increase the lookback window in the last non-direct traffic source query to look further back in time.

      1. Great article!

        A question for you – isn’t the lookback window a setting inside the admin console? Therefore the query needs to align with that?

        Also, how does this align with channel groupings? And session level channel groupings? Same logic?

  8. Thanks for the great post, really helpful. Isn’t it – session_start range between 2592000 preceding and 1 preceding – instead of current row to see the last session?

  9. Hi there, thanks for this post! In GA4, I can see a (direct) source with metrics attributed to it, but when querying in BigQuery – there’s no (direct)/direct source at all! Are you now experiencing the same issue or can you still see (direct) in BigQuery? Thanks

  10. 2 things I’ve picked up is make sure your adjusting the big query event timestamp using the region your using in GA4.
    I have also noticed slightly different behaviour for cpc mobile app data compared with web, where the GA4 interface is attributing a session in August as CPC and the only session that was CPC was the first one way back in Jan. Haven’t seen the same behaviour on web traffic so wonder if app sessions are processed differently from the web.

  11. Hi Taneli, thanks for posting this. I compared the ARRAY_AGG method, for retrieving the session traffic values, to the FIRST_VALUE method, and I found out there are few rows of mismatch between the two methods (I expected exactly the same results from the two queries). Both are ordering by event_timestamp.
    Could you spot the difference ?

    1. Avatar photo
      Taneli Salonen

      Hi,
      If you mean the earlier version of my query, I think that might have also had a different value for the attribution lookback window, perhaps 90 days? Both FIRST_VALUE and ARRAY_AGG (for getting the first value of the session) should return the same results. ARRAY_AGG was just had a clear query performance increase over FIRST_VALUE.

  12. Hello,

    Coming late to the discussion bu thank you so much for posting this.

    My only question: is this query replicating the Data Driven attribution model set by default by GA4?

    1. Avatar photo
      Taneli Salonen

      Hi,

      It’s not replicating the Data-driven model. It’s replicating the Paid and organic last click model (or last non-direct click as it’s been referred to as earlier).

      I think you could use the data from the first query to build your own model that would resemble the data-driven attribution model. I believe the algorithm that it’s based on is called Shapley Values.

  13. Hi Taneli,

    Can I share your code on my github page? With references to this article and you personally of course.

    I modified your code a bit, that now it works almost perfectly (at least with my data) compared to what I see in GA4 UI. And this is the only code which works, and I went through 5-7 solutions already.

    Maybe this could be helpful to others….

    And yeah, many thanks for such a detailed description.

    1. Hi Alex – would you mind sharing your solution? I’d be interested to see what you’ve changed.

  14. Hi Taneli, really appreciate your post, your query is super useful. I have a question, what if I want to filter the query to see data for only a specific conversion event, however that event_name has no ‘gclid’, but other events inside the same session have the gclid, so the user should be attributed to a ‘cpc’ medium according to your query. But if I filter to see only that specific event, the user attribution would be ‘direct’. Please see the screenshot in the below link:
    https://imageupload.io/en/CcN5V2qNOsJCn2c

  15. Hi Taneli, the material is great! thanks for this article, read your other materials and a lot of things became clearer!

  16. Taneli,
    Thank you for your query process. I compared your methodology to several others with our data using a GA4 exploration as a control and it produced the best, closest results (usually with a variance of 0.25-0.30% in total sessions).

    One question though: We’re moving the results into tables and noticed that some traffic source fields were not included in the last traffic source. Campaign ID, campaign name, source and medium are included – but not content, term, GCLID, DCLID, and SRSLTID. When we move to the query part where we identify last non-direct, term does appear but not the others. Is there a way to force all fields to be represented in last traffic source and last non-direct? We do have campaign data in all these fields (but, of course, not with organic traffic). Thank you for your help

  17. Hi Taneli, this is the best solution I’ve seen so far. However, i do see a down side of using min(date) for sessions that span over midnight. If for example session starts on Dec 27th and spans over to 28th and transaction happens on the 28th, then you will have no session on the Dec 28th to join with the transaction on that day. So eventually the transaction data is lost. What do you think?

  18. “I’m trying to aggregate values like purchase and purchase_revenue_in_usd in the prep CTE section of my query, but the aggregation isn’t working correctly due to the use of min(date). How can I modify the base query to correctly retrieve metrics for purchase, purchase_revenue_in_usd, and new users?”

  19. Hey Taneli,

    great stuff! I have still some questions as the direct and none values unfortunately are still inaccurate. The inaccuracies reach over 200-400 users and sessions. Also if you sum all the values from the users or sessions the total also is more than GA4 gives. So my suspicion is that it counts the same users multiple times and adds them to different dimension values e.g. the same user_pseudo_id for both organic and direct.

    Another thing is I am curios what you think about this query: https://optimizationup.com/recreating-ga4-traffic-acquisition-report-in-bigquery/

    I think it’s pretty accurate but I don’t understand why he tries to also extract data from traffic_source (which is user based) when he counts session based metrics. Do you have an idea?

    p.s. I tried connected with you on Linkedin

  20. Thanks for providing this process! We have compared it to several other processes in the field and consider this one to be the simplest but most accurate implementation. When we compare results to GA4 Explore, we generally get total session variances between 0.2% and 1.8%. A 2% variance is our threshold for accuracy and we haven’t yet seen that high of a variance for this process.

  21. Hi Taneli. Very much a newbie to the world of ga4, so firstly thanks for an excellent article, clear and concise. Do you use an incremental load strategy for the traffic sources query? We have large volumes so trying to understand any techniques to help reduce costs.

  22. Hi Taneli,
    I believe the Last non-direct traffic source query is wrong in the line 8. The ifnull is being applied into the column first session_first_traffic_source when it should be session_last_traffic_source.

    1. Avatar photo
      Taneli Salonen

      Hi Edward,

      That’s intentional. If the session itself has any traffic source data, the ifnull will get the first traffic source from the event that started the session. If the session’s start didn’t contain any traffic source data it will look back into the preceding sessions. With the preceding sessions, the sessions’ last traffic source data is considered instead.

      In GA4, one session can contain multiple traffic sources. The logic in this query should be the same as what GA4 itself uses. Of course, you could adjust it to take the last traffic source instead, in case the session itself has many traffic sources.

  23. Recently Google changed the export schema to include the session traffic fields based on last non direct. How would this impact these queries?

    1. Avatar photo
      Taneli Salonen

      With the new session_traffic_source_last_click fields, you might not need to use a semi complex query like this at all. That is if you are fine with using the standard attribution logic from Google and with the data being available only since July 2024. The data in the new fields should match the last non-direct attribution in the GA4 UI pretty closely.

      However, Google also recently broke the Google Ads auto tagging in the EU region. As a result of that, the campaign information is not always available in the new session_traffic_source_last_click fields either. With a custom approach, like the query in this post, it’s still possible to use the individual gclid values and the Google Ads BigQuery Transfer to fill in the correct campaign details.

      And of course, the custom approach, that’s based on event-level data, allows adjusting the attribution and defining other kinds of attribution models as well.

  24. Another question here, how to you handle the Null Ga_Session_ID and User_Pseudo_Id. While adding add measure like purchase, purchase_revenue_in_usd with this on session_id, the revenue from Null Ga_Session_ID and User_Pseudo_Id, is not counted in.

    1. Avatar photo
      Taneli Salonen

      If the ga_session_id and user_pseudo_id are null, that likely means that the event is an “anonymous” consent mode hit.

      It’s not possible to do session-level attribution with events that can’t be tied to a session because of the missing session identifier.

      1. Appreciate your response. With the latest release from google, could Session_traffic_source_last_click be used? What would be the difference, between using Collected_Traffic_Source vs Session_traffic_source_last_click?

      2. Thank you and appreciate your response.
        With Google’s latest release, could Session_traffic_source_last_click be used instead of Collected_Traffic_Source ? do you have any thoughts on that?

        1. Avatar photo
          Taneli Salonen

          I’ve included a comment on the top of the blog post addressing the new Session_traffic_source_last_click data.

          Shortly put, the new data seems to be working as expected with a few caveats. So a much welcome addition to the export. 🙂

  25. On the query for the Query the Session Traffic Source section, the ARRAY_AGG for the last traffic source can “pull” a traffic source which happened after the key event (a purchase).
    For instance, a user brought a product after clicking an ad that lead him to my website. If in the same session the same user change his traffic source by clicking on another ad, the query will say that the source is the last one of the session, not the last source right before the event.
    Maybe we could use LAST_VALUE with ORDER BY and RANGE, but ARRAY_AGG doesn’t support it.
    Do you have any tips for this problem?

    1. Avatar photo
      Taneli Salonen

      Yes, the array_agg for the last traffic source will take whatever was the last traffic source during the session. However, in the later query, the session_last_traffic_source data is used only together with sessions that preceded the ongoing session.

      The attribution query will use session_first_traffic_source as the primary source of traffic and only utilize session_last_traffic_source when looking back to the preceding sessions to get the last non-direct attribution correct. It will not attribute the current session’s traffic source to the one that happened after the the purchase. 🙂

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top