BigQuery SQL Cheat Sheet for GA4

One of the significant improvements in Google Analytics 4 over the old Universal Analytics is its event-based data model and the free data export to BigQuery. The export opens up many possibilities for analysis not available through GA4 UI itself.

However, as the data is event-level, meaning that there’s one record per event, you’ll always have to define the formulas for getting the basic stuff you see in the GA4 UI. Furthermore, as the SQL query always starts from the event level, it means that you’ll often need the same formulas again and again.

This article is a collection of some of the most common SQL formulas, plus a few extras, needed when working with the GA4 event data in BigQuery.

List of topics:

  1. Unnest an event parameter into a column
  2. Unnest an event parameter of any type
  3. Utilize custom and standard e-commerce item parameters
  4. Set a dynamic query date range
  5. Extract a clean page path from page_location
  6. Extract URL query parameters
  7. Defining and calculating sessions
  8. Turn a hit-level dimension into session-level
  9. Unnesting and utilizing user properties
  10. Merging visitor identifiers to get an accurate user count
  11. Get the session’s landing page
  12. Previous or next page path
  13. Traffic source data

Post updates

Unnest an event parameter into a column

The GA4 export utilizes nested structures for storing the data. This is maybe best explained using an Excel reference: In Excel, you have sheets consisting of cells. Whereas in BigQuery, you can have a “sheet” with more “sheets” within the cells.

The events’ parameters in the GA4 export are nested like this. Event_params is a property of the event record and contains key-value pairs for all the related parameters. The key field is the name of the event parameter, and the value field is an object containing the parameter’s value in one of its four fields: string_value, int_value, float_value, or double_value.

Different GA4 event dimensions are nested inside the event_params array.

To extract the event parameter’s value and turn it into a column, you must write another select statement inside your query on the row level. Then, using the unnest function, you can access the contents of event_parameters.

select
  event_name,
  (select value.string_value from unnest(event_params) where key = 'page_location') as page_location
from
  `<your-project>.<ga4-dataset>.<ga4-table>`

The example above returns the value from the string_value field. If you want to retrieve the value of a numeric parameter, you should pick the value from one of the other fields instead.

Unnest an event parameter of any type

The previous unnest event parameter example has one problem: what if you don’t know the type of the event parameter’s value?

GA4 detects the data type of a parameter automatically based on the input value. However, the automatic data type detection will cause problems if the common data type of the parameter can’t be detected with all input values. 

For example:

Input valueDetected data type
“A12345”string
“012345”integer
“0123.45”float

If you want to ensure that your query extracts the data from all the possible data type fields, you can simply cast all of them to string and return the first non null value.

select
  event_name,
  (
    select
      coalesce(
        value.string_value,
        cast(value.int_value as string),
        cast(value.double_value as string),
        cast(value.float_value as string)
      )
    from
      unnest(event_params)
    where
      key = '<event parameter name>'
  ) as <event parameter>
from
  `<your-project>.<ga4-dataset>.<ga4-table>`

Utilize custom and standard e-commerce item parameters

The e-commerce items in the BigQuery export are located in another nested field. Starting from October 25, 2023, the export also includes custom item parameters. These are nested even further within the items array. So it’s a nest within a nest.

select
  cast(event_date as date format 'YYYYMMDD') as date,
  event_name,
  item.item_name,
  (select value.string_value from unnest(item.item_params) where key = 'item_sold_as_batch') as item_sold_as_batch,
  item.quantity,
from
  `<your-project>.<ga4-dataset>.events_*`
  unnest(items) as item
where
  event_name = 'purchase'

The above query lists all the purchased items. I’ve also included the item_sold_as_batch custom parameter in the query. The parameter is a true or false value based on whether the item was included in a batch of items or purchased as an individual item.

Notice that the first unnest happens in the query’s cross join. The cross join opens the e-commerce items as single rows. In the query, the items are unnested further to return the item_sold_as_batch parameter as a column.

What if you want to return the data on the event level? The below query checks if purchase events contain any batch items.

select
  cast(event_date as date format 'YYYYMMDD') as date,
  event_name,
  (
    select 
      max(
        if(
          (select value.string_value from unnest(item.item_params) where key = 'item_sold_as_batch') = 'true',
          true,
          false
        )
      )
    from unnest(items) as item
  ) as contains_batch_items
from
  `<your-project>.<ga4-dataset>.events_*`
where
  event_name = 'purchase'

There you have it, a nice double unnest.

Set a dynamic query date range

The GA4 BigQuery export stores the event data in daily export tables. Each table’s name starts with “events_” and ends with that day’s date as the suffix. You can use a wildcard with the table suffix to input a dynamic date range into the query.

The below query counts all events between the current date minus seven days and the previous date.

select
  count(*) as events
from
  `<your-project>.<ga4-dataset>.events_*`
where
  _table_suffix between cast(
    current_date() -7 as string format 'YYYYMMDD'
  )
  and cast(
    current_date() -1 as string format 'YYYYMMDD'
  )

Note that the table suffix filter is combining strings, not actual dates. If you set it to include, for example, all tables with a suffix greater than “20220207” you’ll also end up including all the intraday tables. That’s why it’s good to define the filter using a between clause.

You can test what’s happening by running the following query.

select
  'intraday_20220218' > '20220217'

Extract a clean page path from page_location

The default GA4 page_location field contains the page’s full URL, including any query parameters. However, to make it a bit more readable and more useful in reports, you can exclude the protocol and hostname as well as any query parameters from the string.

select
  regexp_replace(
  regexp_replace(
    (
      select
        p.value.string_value
      from
        unnest(event_params) as p
      where
        p.key = 'page_location'
    ),
    r'^https?://[^/]+',
    ''
  ),
  r'[\?#].*',
  ''
) as page_path
from
  `<your-project>.<ga4-dataset>.events_*`

Extract URL query parameters

In the previous example, we took out the query string from the page_location. However, the query string can contain valuable information as well. In most cases, though, it’s more useful as a separate event dimension or table column.

Below, I show two examples on how to extract query parameters: The simpler one just extracts the selected parameter value from the page_location string. The other method reads the page’s query parameters into a nested array of key-value pairs similar to the default event_params array. From this array, you can unnest the parameter of choice later.

select
  -- extract utm_source from page_location
  regexp_extract(
    (select value.string_value from unnest(event_params) where key = 'page_location'), 
    r'(?:\?|&)utm_source=([^&]+)'
  ) as utm_source
from
  `<your-project>.<ga4-dataset>.<ga4-table>`
with url_params as (
  select
    -- read all query parameters into an array
    array(
      (
        select
          as struct split(keyval, '=') [safe_offset(0)] as key,
          split(keyval, '=') [safe_offset(1)] as value
        from
          unnest(
            split(
              split((select value.string_value from unnest(event_params) where key = 'page_location'), '?') [safe_offset(1)],
              '&'
            )
          ) as keyval
      )
    ) as url_query,
  from
    `<your-project>.<ga4-dataset>.<ga4-table>`
)
select
  *,
  -- extract utm_source
  (select value from unnest(url_query) where key = 'utm_source') as utm_source
from
  url_params

Defining and calculating sessions

All GA4 events contain a ga_session_id parameter nested inside event parameters. However, this parameter is not a unique session id. Instead, it’s the session’s start time saved as a Unix timestamp. You can create a unique session id by combining the ga_session_id parameter with user_pseudo_id.

select
  concat(
    user_pseudo_id,
    (
      select
        value.int_value
      from
        unnest(event_params)
      where
        key = 'ga_session_id'
    )
  ) as unique_session_id
from
  `<your-project>.<ga4-dataset>.<ga4-table>`

However, calculating the number of sessions using this unique session id will not exactly match the numbers in the GA4 UI. This is because GA4 uses the session_start event instead for session calculation.

If you compare the two, you might notice that the numbers can be a bit off: The same unique session id can have multiple session_start events. That’s why I’ve opted to disregard the session_start events and not try to match the numbers in the UI.

select
  concat(
    user_pseudo_id,
    (
      select
        value.int_value
      from
        unnest(event_params)
      where
        key = 'ga_session_id'
    )
  ) as unique_session_id,
  sum(if(event_name = 'session_start', 1, 0)) as session_start_events
from
  `<your-project>.<ga4-dataset>.<ga4-table>`
group by
  1
order by
  2 desc

Turn a hit-level dimension into session-level

At the time of writing this, GA4 doesn’t yet have session-level custom dimensions. In BigQuery, that’s not an issue; you can take any event parameter and turn it into a session or user-level dimension.

Session-level custom dimensions availability in GA4
Session-level custom dimensions are on the product roadmap for GA4.

The best way to make an event-level field session-level is by utilizing analytic functions. An analytic function uses a window frame to define a group of rows and compute a value over that group. The group of rows could be all rows that share the same unique_session_id (defined earlier).

For example, let’s look at the GA4 default session_engaged dimension. This dimension tells if GA4 measured an engagement during the session or not. However, the dimension is not included in all of the session’s events and the value can change from not engaged to engaged between the session’s events. If you’d like to show all events from engaged sessions, the dimension needs to be converted to the session level first.

with events as (
  select
    concat(
      user_pseudo_id,
      (
        select
          value.int_value
        from
          unnest(event_params)
        where
          key = 'ga_session_id'
      )
    ) as unique_session_id,
    (select value.string_value from unnest(event_params) where key = 'session_engaged') as event_session_engaged
  from
    `<your-project>.<ga4-dataset>.<ga4-table>`
)
select
  unique_session_id,
  event_session_engaged,
  -- the same dimension as session-level included with all session's events
  max(event_session_engaged) over(partition by unique_session_id) as session_session_engaged
from
  events

This kind of a query is also useful when working with custom event parameters that don’t get included in any of the automatically generated events, like user_engagement

In the example, we look for the max value of the session_engaged parameter over the unique_session_id. In other cases, for example, you might want to return the last first value instead. You can reference the BigQuery documentation for other options.

If you want to make the dimension user-level instead, you can use user_pseudo_id in the window’s partition field.

Unnesting and utilizing user properties

GA4 user properties use a very similar nested field as event parameters. The difference is that the user property’s value will persist in future events once it has been included in one.

Unnesting the user properties works exactly the same as with the event parameters. However, there’s one adjustment that you’ll likely want to make. What if a user property is assigned mid-session or the value of the user property changes during the session? In that case, the last value seen during the session is probably the one that is the most useful.

select
  cast(event_date as date format 'YYYYMMDD') as date,
  event_name,
  last_value (
    (select value.string_value from unnest(user_properties) where key = 'customer_type') ignore nulls
  ) over (
    partition by concat(user_pseudo_id, (select value.int_value from unnest(event_params) where key = 'ga_session_id')) order by event_timestamp
  )
   as customer_type
from
  `<your-project>.<ga4-dataset>.<ga4-table>`

The above query assigns the session’s last value of the user property for each event that happened during the session. That also makes the data more suitable for group bys, as the same session doesn’t fall into multiple buckets.

Merging visitor identifiers to get an accurate user count

The BigQuery exports includes two different identifiers for visitors:

  1. user_pseudo_id
  2. user_id

User_pseudo_id is the standard cookie identifier assigned by the GA tracking code. User_id, on the other hand, is a custom identifier you can assign to the users who log in.

Additionally, GA4 can use Google Signals as the third identifier for calculating unique users. However, the Google Signals identifiers are not included in the BigQuery export.

To get the most accurate number of unique visitors, you’ll need to utilize both user_pseudo_id and user_id in a way that doesn’t double-count visitors. The user_id is often assigned mid-session, meaning that the first events of the session might lack the id.

with events as(
  select
    cast(event_date as date format 'YYYYMMDD') as date,
    coalesce(
      max(user_id) over(partition by event_date, user_pseudo_id),
      user_pseudo_id
    ) AS merged_user_id,
  from
    `<your-project>.<ga4-dataset>.<ga4-table>`
)

select
  date,
  count(distinct merged_user_id) as daily_users
from
  events
group by
  date

The above query checks the user_id field on a daily level. If the visitor had a user_id during the day, that id is used for the calculation. Otherwise, the query falls back on user_pseudo_id.

Get the session’s landing page

We can use an analytic function, similar to the previous example, to get the session’s landing page. However, this time, instead of returning the maximum event parameter value of the session, we would return the first one.

with events as (
  select
    concat(
      user_pseudo_id,
      (
        select
          value.int_value
        from
          unnest(event_params)
        where
          key = 'ga_session_id'
      )
    ) as unique_session_id,
    event_name,
    (select value.string_value from unnest(event_params) where key = 'page_location') as page_location,
    (select value.int_value from unnest(event_params) where key = 'entrances') as entrances,
    event_timestamp
  from
    `<your-project>.<ga4-dataset>.<ga4-table>`
)
select
  unique_session_id,
  -- session's first page_location retrieved from any event
  first_value(page_location) over(
    partition by unique_session_id
    order by
      event_timestamp asc
  ) as session_landing_page,
  -- session's first page_location only retrieved from page_view events
  first_value(
    if(
      event_name = 'page_view',
      page_location,
      null
    ) ignore nulls
  ) over(
    partition by unique_session_id
    order by
      event_timestamp asc
  ) as session_landing_page2,
  -- session's first page_location retrieved from events that were tracked as entrances
  first_value(
    if(
      entrances > 0,
      page_location,
      null
    ) ignore nulls
  ) over(
    partition by unique_session_id
    order by
      event_timestamp asc
  ) as session_landing_page3
from
  events

Note that I’ve included three different options in the query. The first one simply returns the first page_location of the session. The second one is adjusted slightly only to take page_view events into account. The third one, instead, uses the entrances parameter for deciding from which event to retrieve the information.

With options two and three, the landing page of a session can sometimes be null.

Previous and next page path

Previous and next page paths can help visualize how the visitors navigate the site’s content and how the different pages are related. Again, we can utilize an analytic function.

with events as (
  select
    concat(
      user_pseudo_id,
      (
        select
          value.int_value
        from
          unnest(event_params)
        where
          key = 'ga_session_id'
      )
    ) as unique_session_id,
    event_name,
    event_timestamp,
    regexp_replace(
  regexp_replace(
    (
      select
        p.value.string_value
      from
        unnest(event_params) as p
      where
        p.key = 'page_location'
    ),
    r'^https?://[^/]+',
    ''
  ),
  r'[\?].*',
  ''
) as page_path
  from
    `<your-project>.<ga4-dataset>.<ga4-table>`
)
select
  unique_session_id,
  event_name,
  page_path,
  event_timestamp,
  -- look for the previous page_path
  if(
    event_name = 'page_view',
    coalesce(
      last_value(
        if(event_name = 'page_view', page_path, null) ignore nulls
      ) over(
        partition by unique_session_id
        order by
          event_timestamp asc rows between unbounded preceding
          and 1 preceding
      ),
      '(entrance)'
    ),
    null
  ) as previous_page,
  -- look for the next page_path
  if(
    event_name = 'page_view',
    coalesce(
      first_value(
        if(event_name = 'page_view', page_path, null) ignore nulls
      ) over(
        partition by unique_session_id
        order by
          event_timestamp asc rows between 1 following
          and unbounded following
      ),
      '(exit)'
    ),
    null
  ) as next_page
from
  events

In this example, we define the analytic function’s window with the previous page’s path to consist of all the events before the current event. Then, from that window, the formula extracts the page path of the events. However, we only want the page path from the page_view events, so other event_names are set to return nulls. The last_value function, used with the window, ignores those null values.

This example query returns the preceding and following page paths only if the event itself is a page_view event. It gets a bit more complex if you want to return the same for all event types. Then, you’ll have to take into account that events on the same page can occur before or after the page_view event.

Traffic source data

The GA4 export includes traffic source data only on event and user levels. Session-level traffic source data is missing.

However getting the session-level data is also possible, but it requires utilizing the event-level fields and building custom attribution on top of that data. I’ve dedicated an entire blog post on how to do that.

with events as(
  
select
  cast(event_date as date format 'YYYYMMDD') as date,
  event_name,
  -- event-level traffic source data
  collected_traffic_source,
  -- user first touch traffic source
  traffic_source
from
  `<your-project>.<ga4-dataset>.<ga4-table>`

Summary

When working with GA4 data in BigQuery, the SQL queries very often share a lot of the same formulas when starting from the event-level data. This article was a collection of some of the most common SQL formulas I find myself repeatedly using. Please let me know in the comments if I missed something important.

3 thoughts on “BigQuery SQL Cheat Sheet for GA4”

  1. Really useful for someone starting to use BG for GA4.

    Basic SQL skills are not enough in my opinion, so those pre built requests are really helpful to understand this logic better.

    Thanks !

    1. I agree with you Morgan,
      I am just starting out with BQ and GA4 and I must say, SQL skills alone are not enough.

      Thanks Taneli!

  2. Very helpful! One small remark: cleaning URL’s would probably be easier with regexp_extract instead of two regexp_replace statements.

    Example:

    SELECT
    REGEXP_EXTRACT(‘https://www.domain.com/category/subcategory?start=24&sz=24#anchor’, ‘https?://([^\?#]+)’)

    This returns http://www.domain.com/category/subcategory

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top