Query the GA4 Session Last Non-direct Traffic Source in BigQuery

One of the shortcomings of the GA4 BigQuery export is its lack of session-level traffic source data. Data on traffic sources are only available at the user and event levels. Having to rely on event-level data makes analyzing session-level traffic acquisition quite inconvenient.

This blog post will show how to recreate the GA4 session traffic source dimensions using BigQuery export event data. The result will be a lookup table containing the last non-direct source of traffic for each unique session id.

Before jumping into the SQL queries, let’s go through the basics.

Post updates

  • 2023-03-24: Improvements to the last non-direct attribution. Read the last non-direct traffic source from the last preceding event traffic source.

What kind of traffic source data is available in BigQuery?

The most common source of confusion in the GA4 export seems to relate to the poorly named traffic_source parameters. Traffic_source contains the user-level first-touch traffic source, not the session’s traffic source.

Additionally, the individual events in BigQuery can contain event parameters such as source, medium, and campaign. The parameters are nested inside the event_params field of the event.

traffic source data in the ga4 bigquery schema
GA4 event schema

For help with querying nested event parameters, see my earlier post.

How is the session traffic source defined in GA4?

In Universal Analytics, the traffic source dimensions were quite clear. That’s because a changed source of traffic during a session would always create a new session. Therefore, there could only ever be one source of traffic per session.

GA4 is different. A change in the source of traffic will no longer trigger a new session. Instead, if a new traffic source is detected within the session, the session will continue with the traffic source details captured in the events’ parameters.

So how does GA4 report session source and medium?

I tested this by retrieving the session source and medium per unique session id from the GA4 Data API and compared that to the event-level source and medium data from the BigQuery export. In cases where the session included multiple different source and medium pairs, the source / medium reported in the GA4 interface was the one that appeared first.

In the following example session, the visit starts directly without traffic source data. The first page view is followed by another one that includes the utm_source and utm_medium parameters. Later, there’s another utm_source & utm_medium pair with different values. In GA4, the final source / medium of the session is reported as “(direct) / (none)”

example session in BigQuery with different event-level traffic sources
Viewed pages in an example session

GA4 tracks the session’s first event’s traffic source as the source for the entire session. Automatically generated first_visit and session_start events are not included in this evaluation.

If you want to dig deeper into the GA session source magic, I recommend reading this excellent investigation in the GTM-Gear blog.

How does the last non-direct attribution work in GA4?

So, we went through how the session traffic source is derived. How about the last non-direct attribution or Cross-channel last click, as Google has decided to rename it?

When looking back to find the last non-direct traffic source, GA4 has a slightly different approach. Instead of looking at the first event of each preceding session, GA4 will get the data from the last event that includes traffic source details.

GA4 session last non-direct attribution example
In this example, Facebook doesn’t get credited in the first session as it’s not the traffic source of the first event. However, Facebook will be the last non-direct traffic source of the second session.

Query Process

To make the process a bit easier to manage, I’ve divided it into two different SQL queries:

  1. Create a table containing the first and last traffic source per session.
  2. Identify the last non-direct source of traffic in case the current session had no traffic source data.

The main reason for doing this in two parts is to make the queries less expensive. The last non-direct lookback will need to process a lot of data whenever the results table is updated. For example, a 90-day lookback would require 90+1 days of data to increment the results table by one day.

Ideally, I’d save the results of the first SQL query to a date-partitioned table and have that updated incrementally. The following query would then rely on that to fill in the last non-direct traffic source in case the session had no traffic source data.

I’m trying to keep this post clear and on topic, so I’ll leave out the incremental update and other query optimization parts. However, covering those using Dataform could very well be a topic for another post. 🙂

Query the Session Traffic Source

The first SQL query will go through the events and return the first and last traffic source per each unique session id. The values will be null if the traffic source data doesn’t exist.

You can save the query’s results in a new table and reference that in the next query instead of rerunning the whole thing with all event data.

-- extract the event data needed for the session traffic source details
with events as (
  select
    cast(event_date as date format 'YYYYMMDD') as date,
    -- unique session id
    concat(user_pseudo_id, (select value.int_value from unnest(event_params) where key = 'ga_session_id')) as session_id,
    user_pseudo_id,
    (select value.int_value from unnest(event_params) where key = 'ga_session_id') as session_start,
    (select value.string_value from unnest(event_params) where key = 'source') as source,
    (select value.string_value from unnest(event_params) where key = 'medium') as medium,
    (select value.string_value from unnest(event_params) where key = 'campaign') as campaign,
    event_timestamp
  from
    `<project>.<dataset>.events_*`
  where
    (_table_suffix >= '<start date>' and _table_suffix <= '<end date>')
    and event_name not in ('session_start', 'first_visit')
)
select
  date,
  session_id,
  user_pseudo_id,
  session_start,
  -- the traffic source of the first event in the session with session_start and first_visit excluded
  first_value(
    if(
      coalesce(source, medium, campaign) is not null,
      (
        select
          as struct source, medium, campaign
      ),
      null
    )
  ) over(
    partition by session_id
    order by
      event_timestamp rows between unbounded preceding
      and unbounded following
  ) as session_first_traffic_source,
  -- the last not null traffic source of the session
  last_value(
    if(
      coalesce(source, medium, campaign) is not null,
      (
        select
          as struct source, medium, campaign
      ),
      null
    ) ignore nulls
  ) over(
    partition by session_id
    order by
      event_timestamp rows between unbounded preceding
      and unbounded following
  ) as session_last_traffic_source,
  -- flag the session's first event
  row_number() over(
    partition by session_id
    order by
      event_timestamp
  ) as session_event_number
from
  events qualify session_event_number = 1
  -- select only one row per session

Results Without Last Non-direct

As a benchmark, I took a small sample of source / medium data from the GA4 interface.

example ga4 session traffic source data
Example session data from GA4

When looking at the data from the newly created table for the same date range, it doesn’t quite match with GA4. That’s because the last non-direct attribution is yet to be applied.

select
  session_first_traffic_source.source,
  session_first_traffic_source.medium,
  count(distinct session_id) as sessions
from
  `<project>.<dataset>.<session traffic source table>`
where
  date between '<start>'
  and '<end>'
group by
  1,
  2
order by
  sessions desc, source
sql query results without last non-direct session traffic source
Sessions per source / medium without last non-direct attribution

Query the Last Non-direct Traffic Source

The second SQL query will take the session-level data from the previously created table and fill the missing traffic sources with the last non-direct source.

The query checks if the session already has traffic source data in session_first_traffic_source. If there is none, it will look for the same user’s previous not null value for session_last_traffic_source.

select
  date,
  session_id,
  user_pseudo_id,
  session_start,
  session_first_traffic_source,
  ifnull(
    session_first_traffic_source,
    last_value(session_last_traffic_source ignore nulls) over(
      partition by user_pseudo_id
      order by
        session_start range between 7776000 preceding
        and current row -- 90 day lookback
    )
  ) as session_traffic_source_last_non_direct,
from
  `<project>.<dataset>.<session traffic source table>`

The query is set to lookback 90 days (7776000 seconds). You can change that to any number of days specified in seconds. The window order and range is based on session_start, which is the value of the ga_session_id parameter. Ga_session_id is a Unix timestamp that is assigned when the session begins.

A larger lookback window will find traffic sources further back in time. However, at the same time, the cost of the query will increase.

Results

Let’s see if the applied last non-direct attribution results in numbers more closely resembling the Traffic acquisition report in GA4. Additionally, we can convert the null values to “(direct)” and “(none)” to make it look more like in the GA4 report.

select
  ifnull(
    session_traffic_source_last_non_direct.source,
    '(direct)'
  ) as source,
  ifnull(
    session_traffic_source_last_non_direct.medium,
    '(none)'
  ) as medium,
  count(distinct session_id) as sessions
from
  `<project>.<dataset>.<session last non-direct traffic source table>`
where
  date between '<start>'
  and '<end>'
group by
  1,
  2
order by
  sessions desc, source
sql query results with last non-direct session traffic source

The numbers based on my sample data now match the GA4 report almost exactly. The SQL query attributed two more sessions to “google / organic” than what GA4 did. I ran the query with the 90-day lookback window. With a 30-day lookback, there would have been a few more direct sessions and a few less in the other source / mediums.

Additional Considerations

The query worked quite nicely with my test dataset. Based on that, the base logic seems correct. However, there are a couple of common issues which weren’t present in the dataset that I used.

GA4 has some issues related to tracking Google Ads traffic with auto-tagging enabled. Auto-tagging uses the GCLID parameter for connecting the website visit with the corresponding campaign. With auto-tagging enabled, the usual utm tags shouldn’t be needed.

However, this doesn’t quite work with the BigQuery export. In BigQuery, the sessions that started with an event containing the GCLID parameter can either have missing source and medium details or have something like “google / organic” or “youtube.com / referral” as their source medium.

example session source mediums with a a gclid parameter
Example sessions with a GCLID parameter

A simple fix for this is adjusting the first session traffic source query to check the GCLID parameter. If the parameter exists, the source and medium values can be overwritten with “google” and “cpc”. However, this approach could be too simplistic, as other traffic sources can also utilize the GCLID parameter.

with events as (
  select
    ...
    -- include the gclid parameter in the query
    (select value.string_value from unnest(event_params) where key = 'gclid') as gclid,
    ...
  from
    `<project>.<dataset>.events_*`
  where
    ...
)
select
  ...
  -- fix the source and medium field based on gclid
  first_value(
    if(
      coalesce(source, medium, campaign, gclid) is not null,
      (
        select
          as struct if(gclid is not null, 'google', source) as source,
            if(gclid is not null, 'cpc', medium) as medium,
            campaign,
            gclid
      ),
      null
    )
  ) over(
    partition by session_id
    order by
      event_timestamp rows between unbounded preceding
      and unbounded following
  ) as session_first_traffic_source,
  last_value(
    if(
      coalesce(source, medium, campaign, gclid) is not null,
      (
        select
          as struct if(gclid is not null, 'google', source) as source,
            if(gclid is not null, 'cpc', medium) as medium,
            campaign,
            gclid
      ),
      null
    ) ignore nulls
  ) over(
    partition by session_id
    order by
      event_timestamp rows between unbounded preceding
      and unbounded following
  ) as session_last_traffic_source,
  ...
from
  events

Another thing to consider is the GBRAID and WBRAID parameters which are used as alternatives to the GCLID parameter in iOS14+ devices.

Finalle, even if that solution worked correctly with your traffic data, it would still lack the campaign details. So perhaps the best fix to this problem is to start including utm tags with the auto-tagged Google Ads traffic as well.

User Id

Another issue that can cause discrepancies between the GA4 reports and BigQuery data is the use of user ids. When a visitor is assigned a user id mid-session, GA4 will attribute the session’s traffic source to “(direct)” / “(none)”, or I guess the last non-direct source derived from that, regardless of what the session’s first traffic source was.

As of writing this in November 2022, this is still an issue in GA4. However, it doesn’t apply to the event data in BigQuery, so it’s good to keep that in mind when comparing the numbers between GA4 and BigQuery data.

The below image shows unique sessions that have a user id and that were reported as “(direct) / (none)” by GA4. I exported the data from GA4 and joined it with the traffic source table created earlier. As a result, some of the sessions were incorrectly reported as direct even though the traffic source data existed on the event-level.

Example session traffic sources with user id sessions
Example user id sessions, traffic source according to GA4 compared to BigQuery

The following snapshot of some of the events in one of those sessions shows what’s happening. The visit originated from “google / organic”. However, later during the session, a login event occurred, and a user id was assigned to the visitor. That messed up the reporting in GA4 and wiped the original traffic source of the session.

example session journey with a traffic source and assigned user id
Example session journey with a user id assigned mid-session

Summary

Recreating the session-level GA4 traffic reports with BigQuery data is not quite as straightforward as it could be. It also requires a lot of processing on BigQuery’s side when repeated daily. In a production setting, the process needs to be well-thought and optimized when up-to-date data is needed in multiple places. 

Additionally, the issues related to Google Ads traffic and the use of user ids make it difficult to verify the query results by comparing them with the GA4 reports.

Hopefully, Google will make this easier someday and start providing correct session-level traffic data already in the export, just like they’ve been providing with Universal Analytics.

Finally, another thing to consider is whether it makes sense to replicate the GA4 traffic source reporting. With BigQuery, you can define the session’s traffic source however you like. An alternative approach could be to credit all traffic sources within the same session, not just the first, and split the credit between them.

4 thoughts on “Query the GA4 Session Last Non-direct Traffic Source in BigQuery”

  1. Hello,
    Very useful article. Indeed, it’s sad that session level traffic source is not available and we need to find workarounds to compile the data and create costly queries to obtain the data.

  2. This is a great post as well as the referenced post and has been a big help!

    One enhancement that I’d suggest is picking up the last utm parameters in the session. GA4 does not start a new session when the utm parameters are changed, however, it does look like it will carry any mid-session utm parameters through to the next session.

    Therefore you need to retain the parameters for the current session (I used page entrances for this) and the last value of the parameters for the the current session – if these are the same as the first, then it doesn’t matter.

    If the utm parameters for the session are not defined then you should look back to the last parameters value. Seems to be more robust.

    1. Thanks for the comment Martin!

      That is indeed one thing that was missing in the original version. I’ve updated the queries now to capture both the traffic source of the first event, and also the session’s last traffic source.

      The first traffic source is used in evaluating the current session’s traffic source. The last traffic source is only being used in the last non-direct lookback.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top