Why you should avoid using the session_start and first_visit events in GA4

Google recently fixed the missing event parameters issue with the GA4 session_start and first_visit events. Starting from November 2, 2023, the events contain the same parameters as the first event of the session that triggered these events.

While this update fixes one of the problems by making the data more consistent, there are still a lot of issues related to these two events. At best, they are just a bit off, while in worst cases, they are complete junk.

Here’s why:

number of session_start events ≠ number of sessions

number of first_visit events = number of new users in GA4…

… but the number is actually measured incorrectly.

Also, as a bonus, there seem to be some serious issues with the measurement of new vs. returning users in GA4.

Post updates

  • 2024-04-22: Corrections to how the first_visit event works, new debug query for verifying issues. Thanks Giovani Ortolani Barbosa and Leonardo Lourenço Crespilho for the comments!

Session_start

GA4 generates the session_start event whenever an event has the _ss parameter. Essentially, it’s a duplicate of the session’s first event with a different event name.

An event with the _ss and _fv flags
A page_view event that includes both the _ss and _fv flags signaling that a session_start and a first_visit event should be generated using the event’s data.

GA4 itself doesn’t use the session_start events anymore for its sessions metric. Instead, the number of sessions in GA4 is an estimate of the number of unique session ids.

The session_start event is based entirely on the client-side script’s ability to include the _ss flag in the event. This evaluation, which is done on the client side, is not flawless. Sometimes, the script incorrectly triggers multiple session_starts per one session.

As explained by Simo in Measure Slack:

simo explains why session_start events get duplicates

In most cases, the number of session_start events should be close to the number of sessions. However, there can be some discrepancy.

comparison on the number of sessions vs. session_start events
Comparison of sessions vs. the number of session_start events.

By digging deeper using BigQuery data, we see that some sessions include multiple session_start events, while some don’t have any session_start events at all.

duplicate session_start events
Unique session ids with multiple session_start events.
sessions with missing session_start events
Unique session ids with no session_start events.

I got the above result with a query that didn’t include any time range filter, thus avoiding potential issues if the sessions overlap days.

You can test this with your own GA4 property’s data by using the following query.

select
  concat(user_pseudo_id, (select value.int_value from unnest(event_params) where key = 'ga_session_id')) as session_id,
  countif(event_name = 'session_start') as session_start_events
from
  `<project>.<dataset>.events_*`
group by 
  session_id

Missing session_start events are especially common with sub-properties, even after the November 2, 2023 update

Sub-properties allow you to get a filtered view of a larger entity. It can often be the case that the session’s first event has already happened before viewing the content included in the sub-property. Because the evaluation is done client-side, not in the GA4 property, the session_start event will be missing.

sessions with a session_start event vs all session
Comparison of sessions vs. sessions that included a session_start event in a sub-property.

The above Exploration report uses a session segment with the following criteria.

segment of sessions with a session_start event

A proper session_start event would make it easy, for example, to access the session’s first traffic source in BigQuery (without further attribution) by using a simple where clause. However, as the event is so unreliable, it’s still best to just get this information from the first event of the session.

First_visit

GA4 uses the first_visit events for measuring new users. As with the session_start event, this event is also based on a flag added by the client-side tracking code. This flag is called _fv.

new users vs. first_visit events
The new users metric matches exactly with the number of first_visit events.

But does the new and returning users measurement work?

You would think that adding up new users + returning users would give you roughly the same number as total users. After all, shouldn’t each user belong in one or the other bucket? Of course, a new user could later turn into a returning user. So, depending on the logic used in the calculation, there could be a case where the same user is counted twice. However, at least all users should either be new or returning. 

Well, that is not the case in GA4. In worst cases, most of the users are neither new nor returning.

new and returning users vs. total users
New and returning users vs. total users.

The reason behind this discrepancy is that GA4 uses an entirely different logic for the returning users metric compared to the new users metric. In GA4, a returning user is one who had at least one preceding session before the current session.

The new users metric and the first_visit event, on the other hand, are based on a simplistic cookie value check done on the client side. GA4 checks for the existence of the client id (_ga) cookie and the _ga<Measurement Id> cookie. If one of the cookies doesn’t exist, the tracking script will add the _fv flag in the event.

The evaluation falls short with sub-properties, which are based on a filtered set of events instead of having their own stream. However, there can also be issues in regular properties

The below query checks if each user_pseudo_id has logged at least one first_visit event.

with event_data as (
select
user_pseudo_id,
max(
if(event_name = 'first_visit', true, false)
) as user_has_first_visit
from
`<table>.<dataset>.events_*`
group by
1
)
select
user_has_first_visit,
count(distinct user_pseudo_id) as users
from
event_data
group by
1

Sometimes, the results can look like this:

ga4 property with a lot of missing first_visit events

I don’t know what exactly is behind these issues. However, they seem to occur mainly with properties that share the same top-level domain as other GA-tracked sites and GA4 properties.

Below are the results of the same query using my blog’s GA4 data.

ga4 property with no significant issues in first_visit event collection

So, we can conclude that the number of new users is sometimes too low. What about the returning users? Looking at the data, it doesn’t quite seem to work as documented.

GA4 automatically tracks the user’s session number. The session number dimension is unavailable in the UI, but we can access it in BigQuery. The evaluation should be as simple as checking if the ga_session_number parameter is greater than one.

select
  count(
    distinct 
    if(
      (select value.int_value from unnest(event_params) where key = 'ga_session_number') > 1, 
      user_pseudo_id,
      null
    )
   ) as returning_users
from
  `<project>.<dataset>.events_*`

Let’s calculate the same number as shown earlier using BigQuery.

returning users in bigquery vs. returning users in ga4
Returning users in BigQuery vs. returning users in GA4.

This GA4 property is configured to use the device-based reporting identity. Regardless of that, the numbers are not even close, even though the logic should be the same.

Doing the same comparison on a property that doesn’t share the same domain with another property gives me a much smaller but still very notable difference.

These findings lead me to believe the returning users metric doesn’t work as documented.

Interestingly, the SQL query utilizing the ga_session_number parameter gives results very close to the new and returning user counts in Universal Analytics. Based on that, this method seems like the most accurate way to get this data.

Final thoughts

Because of the above issues, I’ve been avoiding using the session_start and first_visit events for any analysis. Fortunately, the BigQuery export allows more accurate ways to get the session’s first event or the user’s returning vs. new status than using these two events.

Both events rely on logic that happens on the client side. That is a weird design because there are cases, as described in this post, where this method fails miserably. 

Finally, when it comes to the new and returning users metrics in GA4, it’s almost as if these two were developed by two isolated teams. For these two to be consistent, shouldn’t they at least follow the same logic?

This post is a collection of issues related to these two events I’ve encountered while working with GA4. Please let me know if I’ve missed something!

2 thoughts on “Why you should avoid using the session_start and first_visit events in GA4”

  1. Avatar photo
    Leonardo Lourenço Crespilho

    Hi Taneli. How are you?

    I think the following cookie/first_visit “problem” doesn’t exist:

    > Multiple properties setup for sites that are under the same parent domain

    The first_visit is sent not when the _ga cookie gets created, but the _ga_MEASUREMENT-ID one is created. So, it does not matter if multiple properties run on the same domain. It make sense?

    Regards.

    1. Avatar photo
      Taneli Salonen

      Hi Leonardo,

      Yes, you are right. It’s also based on the _ga_MEASUREMENT-ID. If you already have the _ga cookie, but the _ga_MEASUREMENT-ID cookie is missing, then the param gets added. But also, if the _ga_MEASUREMENT-ID param is there but the _ga cookie is missing for some reason, then the _fv param is also added.

      I think my tests for that case were a bit too simple.

      I actually found this out a while a go but haven’t managed to update the blog yet. Will do that next. Thanks!

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top