BigQuery SQL Cheat Sheet for GA4

One of the significant improvements in Google Analytics 4 over the old Universal Analytics is its event-based data model and the free data export to BigQuery. The export opens up many possibilities for analysis not available through GA4 UI itself.

However, as the data is event-level, meaning that there’s one record per event, you’ll always have to define the formulas for getting the basic stuff you see in the GA4 UI. Furthermore, as the SQL query always starts from the event level, it means that you’ll often need the same formulas again and again.

This article is a collection of some of the most common SQL formulas, plus a few extras, needed when working with the GA4 event data in BigQuery.

List of topics:

  1. Unnest an event parameter into a column
  2. Unnest an event parameter of any type
  3. Set a dynamic query date range
  4. Extract a clean page path from page_location
  5. Extract URL query parameters
  6. Defining and calculating sessions
  7. Turn a hit-level dimension into session-level
  8. Get the session’s landing page
  9. Previous or next page path

Post updates

Unnest an event parameter into a column

The GA4 export utilizes nested structures for storing the data. This is maybe best explained using an Excel reference: In Excel, you have sheets consisting of cells. Whereas in BigQuery, you can have a “sheet” with more “sheets” within the cells.

The events’ parameters in the GA4 export are nested like this. Event_params is a property of the event record and contains key-value pairs for all the related parameters. The key field is the name of the event parameter, and the value field is an object containing the parameter’s value in one of its four fields: string_value, int_value, float_value, or double_value.

Different GA4 event dimensions are nested inside the event_params array.

To extract the event parameter’s value and turn it into a column, you must write another select statement inside your query on the row level. Then, using the unnest function, you can access the contents of event_parameters.

select
  event_name,
  (select value.string_value from unnest(event_params) where key = 'page_location') as page_location
from
  `<your-project>.<ga4-dataset>.<ga4-table>`

The example above returns the value from the string_value field. If you want to retrieve the value of a numeric parameter, you should pick the value from one of the other fields instead.

Unnest an event parameter of any type

The previous unnest event parameter example has one problem: what if you don’t know the type of the event parameter’s value?

GA4 detects the data type of a parameter automatically based on the input value. However, the automatic data type detection will cause problems if the common data type of the parameter can’t be detected with all input values. 

For example:

Input valueDetected data type
“A12345”string
“012345”integer
“0123.45”float

If you want to ensure that your query extracts the data from all the possible data type fields, you can simply cast all of them to string and return the first non null value.

select
  event_name,
  (
    select
      coalesce(
        value.string_value,
        cast(value.int_value as string),
        cast(value.double_value as string),
        cast(value.float_value as string)
      )
    from
      unnest(event_params)
    where
      key = '<event parameter name>'
  ) as <event parameter>
from
  `<your-project>.<ga4-dataset>.<ga4-table>`

Set a dynamic query date range

The GA4 BigQuery export stores the event data in daily export tables. Each table’s name starts with “events_” and ends with that day’s date as the suffix. You can use a wildcard with the table suffix to input a dynamic date range into the query.

The below query counts all events between the current date minus seven days and the previous date.

select
  count(*) as events
from
  `<your-project>.<ga4-dataset>.events_*`
where
  _table_suffix between cast(
    current_date() -7 as string format 'YYYYMMDD'
  )
  and cast(
    current_date() -1 as string format 'YYYYMMDD'
  )

Note that the table suffix filter is combining strings, not actual dates. If you set it to include, for example, all tables with a suffix greater than “20220207” you’ll also end up including all the intraday tables. That’s why it’s good to define the filter using a between clause.

You can test what’s happening by running the following query.

select
  'intraday_20220218' > '20220217'

Extract a clean page path from page_location

The default GA4 page_location field contains the page’s full URL, including any query parameters. However, to make it a bit more readable and more useful in reports, you can exclude the protocol and hostname as well as any query parameters from the string.

select
  regexp_replace(
  regexp_replace(
    (
      select
        p.value.string_value
      from
        unnest(event_params) as p
      where
        p.key = 'page_location'
    ),
    r'^https?://[^/]+',
    ''
  ),
  r'[\?].*',
  ''
) as page_path
from
  `<your-project>.<ga4-dataset>.events_*`

Extract URL query parameters

In the previous example, we took out the query string from the page_location. However, the query string can contain valuable information as well. In most cases, though, it’s more useful as a separate event dimension or table column.

Below, I show two examples on how to extract query parameters: The simpler one just extracts the selected parameter value from the page_location string. The other method reads the page’s query parameters into a nested array of key-value pairs similar to the default event_params array. From this array, you can unnest the parameter of choice later.

select
  -- extract utm_source from page_location
  regexp_extract(
    (select value.string_value from unnest(event_params) where key = 'page_location'), 
    r'(?:\?|&)utm_source=([^&]+)'
  ) as utm_source
from
  `<your-project>.<ga4-dataset>.<ga4-table>`
with url_params as (
  select
    -- read all query parameters into an array
    array(
      (
        select
          as struct split(keyval, '=') [safe_offset(0)] as key,
          split(keyval, '=') [safe_offset(1)] as value
        from
          unnest(
            split(
              split((select value.string_value from unnest(event_params) where key = 'page_location'), '?') [safe_offset(1)],
              '&'
            )
          ) as keyval
      )
    ) as url_query,
  from
    `<your-project>.<ga4-dataset>.<ga4-table>`
)
select
  *,
  -- extract utm_source
  (select value from unnest(url_query) where key = 'utm_source') as utm_source
from
  url_params

Defining and calculating sessions

All GA4 events contain a ga_session_id parameter nested inside event parameters. However, this parameter is not a unique session id. Instead, it’s the session’s start time saved as a Unix timestamp. You can create a unique session id by combining the ga_session_id parameter with user_pseudo_id.

select
  concat(
    user_pseudo_id,
    (
      select
        value.int_value
      from
        unnest(event_params)
      where
        key = 'ga_session_id'
    )
  ) as unique_session_id
from
  `<your-project>.<ga4-dataset>.<ga4-table>`

However, calculating the number of sessions using this unique session id will not exactly match the numbers in the GA4 UI. This is because GA4 uses the session_start event instead for session calculation.

If you compare the two, you might notice that the numbers can be a bit off: The same unique session id can have multiple session_start events. That’s why I’ve opted to disregard the session_start events and not try to match the numbers in the UI.

select
  concat(
    user_pseudo_id,
    (
      select
        value.int_value
      from
        unnest(event_params)
      where
        key = 'ga_session_id'
    )
  ) as unique_session_id,
  sum(if(event_name = 'session_start', 1, 0)) as session_start_events
from
  `<your-project>.<ga4-dataset>.<ga4-table>`
group by
  1
order by
  2 desc

Turn a hit-level dimension into session-level

At the time of writing this, GA4 doesn’t yet have session-level custom dimensions. In BigQuery, that’s not an issue; you can take any event parameter and turn it into a session or user-level dimension.

Session-level custom dimensions availability in GA4
Session-level custom dimensions are on the product roadmap for GA4.

The best way to make an event-level field session-level is by utilizing analytic functions. An analytic function uses a window frame to define a group of rows and compute a value over that group. The group of rows could be all rows that share the same unique_session_id (defined earlier).

For example, let’s look at the GA4 default session_engaged dimension. This dimension tells if GA4 measured an engagement during the session or not. However, the dimension is not included in all of the session’s events and the value can change from not engaged to engaged between the session’s events. If you’d like to show all events from engaged sessions, the dimension needs to be converted to the session level first.

with events as (
  select
    concat(
      user_pseudo_id,
      (
        select
          value.int_value
        from
          unnest(event_params)
        where
          key = 'ga_session_id'
      )
    ) as unique_session_id,
    (select value.string_value from unnest(event_params) where key = 'session_engaged') as event_session_engaged
  from
    `<your-project>.<ga4-dataset>.<ga4-table>`
)
select
  unique_session_id,
  event_session_engaged,
  -- the same dimension as session-level included with all session's events
  max(event_session_engaged) over(partition by unique_session_id) as session_session_engaged
from
  events

This kind of a query is also useful when working with custom event parameters that don’t get included in any of the automatically generated events, like user_engagement

In the example, we look for the max value of the session_engaged parameter over the unique_session_id. In other cases, for example, you might want to return the last first value instead. You can reference the BigQuery documentation for other options.

If you want to make the dimension user-level instead, you can use user_pseudo_id in the window’s partition field.

Get the session’s landing page

We can use an analytic function, similar to the previous example, to get the session’s landing page. However, this time, instead of returning the maximum event parameter value of the session, we would return the first one.

with events as (
  select
    concat(
      user_pseudo_id,
      (
        select
          value.int_value
        from
          unnest(event_params)
        where
          key = 'ga_session_id'
      )
    ) as unique_session_id,
    event_name,
    (select value.string_value from unnest(event_params) where key = 'page_location') as page_location,
    event_timestamp
  from
    `<your-project>.<ga4-dataset>.<ga4-table>`
)
select
  unique_session_id,
  -- session's first page_location retrieved from any event
  first_value(page_location) over(
    partition by unique_session_id
    order by
      event_timestamp asc
  ) as session_landing_page,
  -- session's first page_location only retrieved from page_view events
  first_value(
    if(
      event_name = 'page_view',
      page_location,
      null
    ) ignore nulls
  ) over(
    partition by unique_session_id
    order by
      event_timestamp asc
  ) as session_landing_page2
from
  events

Previous and next page path

Previous and next page paths can help visualize how the visitors navigate the site’s content and how the different pages are related. Again, we can utilize an analytic function.

with events as (
  select
    concat(
      user_pseudo_id,
      (
        select
          value.int_value
        from
          unnest(event_params)
        where
          key = 'ga_session_id'
      )
    ) as unique_session_id,
    event_name,
    event_timestamp,
    regexp_replace(
  regexp_replace(
    (
      select
        p.value.string_value
      from
        unnest(event_params) as p
      where
        p.key = 'page_location'
    ),
    r'^https?://[^/]+',
    ''
  ),
  r'[\?].*',
  ''
) as page_path
  from
    `<your-project>.<ga4-dataset>.<ga4-table>`
)
select
  unique_session_id,
  event_name,
  page_path,
  event_timestamp,
  -- look for the previous page_path
  if(
    event_name = 'page_view',
    coalesce(
      last_value(
        if(event_name = 'page_view', page_path, null) ignore nulls
      ) over(
        partition by unique_session_id
        order by
          event_timestamp asc rows between unbounded preceding
          and 1 preceding
      ),
      '(entrance)'
    ),
    null
  ) as previous_page,
  -- look for the next page_path
  if(
    event_name = 'page_view',
    coalesce(
      first_value(
        if(event_name = 'page_view', page_path, null) ignore nulls
      ) over(
        partition by unique_session_id
        order by
          event_timestamp asc rows between 1 following
          and unbounded following
      ),
      '(exit)'
    ),
    null
  ) as next_page
from
  events

In this example, we define the analytic function’s window with the previous page’s path to consist of all the events before the current event. Then, from that window, the formula extracts the page path of the events. However, we only want the page path from the page_view events, so other event_names are set to return nulls. The last_value function, used with the window, ignores those null values.

This example query returns the preceding and following page paths only if the event itself is a page_view event. It gets a bit more complex if you want to return the same for all event types. Then, you’ll have to take into account that events on the same page can occur before or after the page_view event.

Summary

When working with GA4 data in BigQuery, the SQL queries very often share a lot of the same formulas when starting from the event-level data. This article was a collection of some of the most common SQL formulas I find myself repeatedly using. Please let me know in the comments if I missed something important.

3 thoughts on “BigQuery SQL Cheat Sheet for GA4”

  1. Really useful for someone starting to use BG for GA4.

    Basic SQL skills are not enough in my opinion, so those pre built requests are really helpful to understand this logic better.

    Thanks !

    1. I agree with you Morgan,
      I am just starting out with BQ and GA4 and I must say, SQL skills alone are not enough.

      Thanks Taneli!

  2. Very helpful! One small remark: cleaning URL’s would probably be easier with regexp_extract instead of two regexp_replace statements.

    Example:

    SELECT
    REGEXP_EXTRACT(‘https://www.domain.com/category/subcategory?start=24&sz=24#anchor’, ‘https?://([^\?#]+)’)

    This returns http://www.domain.com/category/subcategory

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top