Google recently fixed the missing event parameters issue with the GA4 session_start and first_visit events. Starting from November 2, 2023, the events contain the same parameters as the first event of the session that triggered these events.
While this update fixes one of the problems by making the data more consistent, there are still a lot of issues related to these two events. At best, they are just a bit off, while in worst cases, they are complete junk.
number of session_start events ≠ number of sessions
number of first_visit events = number of new users in GA4…
… but the number is actually measured incorrectly.
Also, as a bonus, there seem to be some serious issues with the measurement of new vs. returning users in GA4.
GA4 generates the session_start event whenever an event has the _ss parameter. Essentially, it’s a duplicate of the session’s first event with a different event name.
GA4 itself doesn’t use the session_start events anymore for its sessions metric. Instead, the number of sessions in GA4 is an estimate of the number of unique session ids.
The session_start event is based entirely on the client-side script’s ability to include the _ss flag in the event. This evaluation, which is done on the client side, is not flawless. Sometimes, the script incorrectly triggers multiple session_starts per one session.
In most cases, the number of session_start events should be close to the number of sessions. However, there can be some discrepancy.
By digging deeper using BigQuery data, we see that some sessions include multiple session_start events, while some don’t have any session_start events at all.
I got the above result with a query that didn’t include any time range filter, thus avoiding potential issues if the sessions overlap days.
You can test this with your own GA4 property’s data by using the following query.
select concat(user_pseudo_id, (select value.int_value from unnest(event_params) where key = 'ga_session_id')) as session_id, countif(event_name = 'session_start') as session_start_events from `<project>.<dataset>.events_*` group by session_id
Missing session_start events are especially common with sub-properties, even after the November 2, 2023 update.
Sub-properties allow you to get a filtered view of a larger entity. It can often be the case that the session’s first event has already happened before viewing the content included in the sub-property. Because the evaluation is done client-side, not in the GA4 property, the session_start event will be missing.
The above Exploration report uses a session segment with the following criteria.
A proper session_start event would make it easy, for example, to access the session’s first traffic source in BigQuery (without further attribution) by using a simple where clause. However, as the event is so unreliable, it’s still best to just get this information from the first event of the session.
But does the new and returning users measurement work?
You would think that adding up new users + returning users would give you roughly the same number as total users. After all, shouldn’t each user belong in one or the other bucket? Of course, a new user could later turn into a returning user. So, depending on the logic used in the calculation, there could be a case where the same user is counted twice. However, at least all users should either be new or returning.
Well, that is not the case in GA4. In worst cases, most of the users are neither new nor returning.
One of the reasons behind this discrepancy is that GA4 uses an entirely different logic for the returning users metric compared to the new users metric. In GA4, a returning user is one that had at least one preceding session before the current session.
The new users metric and the first_visit event, on the other hand, seem to be based on a simplistic cookie value check done on the client side. If the client id cookie doesn’t exist, the tracking script will add the _fv flag to the event.
This evaluation is way too simplistic and falls short in these cases, for example:
- Multiple properties setup for sites that are under the same parent domain
Both of the above cases are very similar. When multiple properties are used within the same parent domain, the property that gets to drop the client id cookie first is the one that also gets the first_visit event. The first visit evaluation doesn’t work on the property level but on the cookie domain level.
So, we can conclude that the number of new users is sometimes too low. What about the returning users? Looking at the data, it doesn’t quite seem to work as documented.
GA4 automatically tracks the user’s session number. The session number dimension is unavailable in the UI, but we can access it in BigQuery. The evaluation should be as simple as checking if the ga_session_number parameter is greater than one.
select count( distinct if( (select value.int_value from unnest(event_params) where key = 'ga_session_number') > 1, user_pseudo_id, null ) ) as returning_users from `<project>.<dataset>.events_*`
Let’s calculate the same number as shown earlier using BigQuery.
This GA4 property is configured to use the device-based reporting identity. Regardless of that, the numbers are not even close, even though the logic should be the same.
Doing the same comparison on a property that doesn’t share the same domain with another property gives me a much smaller but still very notable difference.
These findings lead me to believe the returning users metric doesn’t work as documented.
Interestingly, the SQL query utilizing the ga_session_number parameter gives results very close to the new and returning user counts in Universal Analytics. Based on that, this method seems like the most accurate way to get this data.
Because of the above issues, I’ve been avoiding using the session_start and first_visit events for any analysis. Fortunately, the BigQuery export allows more accurate ways to get the session’s first event or the user’s returning vs. new status than using these two events.
Both events rely on logic that happens on the client side. That is a weird design because there are cases, as described in this post, where this method fails miserably.
Finally, when it comes to the new and returning users metrics in GA4, it’s almost as if these two were developed by two isolated teams. For these two to be consistent, shouldn’t they at least follow the same logic?
This post is a collection of issues related to these two events I’ve encountered while working with GA4. Please let me know if I’ve missed something!