Core Web Vitals Load Testing: Methodology

Why RUM alone is not enough

Real User Monitoring is the right primary data source for Vitals. It is what your customers actually experienced. The 75th-percentile thresholds Google publishes are calibrated against RUM-style field data.

RUM has two structural blind spots, and both bite at the wrong time.

It is reactive. RUM tells you about yesterday’s traffic. The Vitals regression that shipped this morning will not show up cleanly until enough customers have visited the affected pages, which can take hours or days depending on volume. By the time the dashboard turns yellow, the regression is already live.

It cannot model a concurrency you have not served. The Black Friday plan that requires sustaining 4x normal traffic cannot be validated by RUM, because RUM only sees the traffic you currently have. You need a synthetic load test to know whether the site will hold its Vitals targets at that concurrency.

Synthetic real-browser load testing covers both gaps. You can run it before customers visit (pre-deploy, post-deploy, on a schedule), and you can run it at whatever concurrency you want (5x peak, 10x peak, sustained for an hour).

What “Vitals at load” actually requires

A useful Core Web Vitals load test has a specific architecture:

One real browser per virtual user. Vitals are measured by the browser, natively. No browser, no native measurement; the alternative is heuristic approximations that drift from field data.
Isolated browser instances. Memory, CPU, cache, cookies, and network stack must not cross between virtual users. Shared-browser models produce contention that has nothing to do with real users.
Real network conditions. A 100Gbps lab network does not predict customer experience on residential broadband. Configurable throttling, or runs from geographies that match the customer base, is the way to model this honestly.
Per-session capture. Every virtual user needs an addressable session with video, network log, console output, and Vitals values. Aggregates without per-session evidence cannot answer “for which users did LCP spike?”.
Configurable traffic shape. Ramp-up, steady-state, ramp-down. Match the shape that exposes the regression you care about (gradual capacity, spike, soak).

Any tool missing one of these produces numbers that look like Vitals but do not behave like Vitals.

A useful test scenario

A Core Web Vitals load test is built from the same primitives as any real-browser test:

A scenario. The user journey, from navigation through to conversion or whatever the meaningful endpoint is. Build it in a visual editor; reuse it across performance tests, smoke tests, and monitors.
A traffic shape. How many virtual users, ramped over how long, sustained for how long. Match the shape to the question (capacity rehearsal vs release validation vs sustained soak).
A region. Where the virtual users originate. Latency from London is different from latency from Frankfurt or Sao Paulo. Vitals are sensitive to RTT and TLS handshake time.
Vitals capture per session. LCP, INP, CLS, FCP, captured natively by every virtual user. Aggregated across the run, addressable per session for debugging.

The output is a report with five views: time-series Vitals across the run, per-URL Vitals breakdown, per-session detail with video, console logs, and network logs. From any view, you can drill into the worst sessions and see exactly what made them slow.

Budgeting Vitals in CI

The pattern that works for most teams:

Set per-URL Vitals budgets aligned with Google’s thresholds (LCP 2.5s, INP 200ms, CLS 0.1 at the 75th percentile).
Run a post-deploy real-browser smoke test against the deployed environment after every promotion.
Fail the build (or the promotion) when a page busts its budget.
Use diff reporting between the current and previous green runs to surface what changed and where.
Keep an explicit override for accepted regressions, documented per release.

This puts Vitals on the same footing as a unit test. Regressions stop reaching customers. The conversation moves from “RUM looks bad this week, what changed?” to “the build failed because the candidate’s LCP on /product is 3.1s vs 2.4s on main.”

Testing Suite is the Evaluat product for this pattern. It is on the roadmap and the Testing Suite page explains the planned shape.

Common mistakes

Trusting single-user lab data. Lighthouse on a fast network with no concurrent load does not predict peak.
Aggregating Vitals across all pages. Page-level budgets matter. A site-wide LCP of 2.6s can hide a single critical page at 4.2s.
Reporting the mean. The published threshold is the 75th percentile. The mean is structurally optimistic.
Treating Vitals as a one-time audit. Add a third-party tag, change a CDN config, ship a new framework version, and Vitals shift. The test has to run continuously to catch drift.

For the underlying methodology of running tests in real browsers, see Real-browser load testing. For the Vitals themselves, see LCP and INP.

Common questions

FAQ

Do synthetic Vitals match RUM Vitals?

They will not match exactly. RUM aggregates across every device, network, and traffic pattern your customers bring. Synthetic is a controlled environment with fixed conditions. The value of synthetic is the controlled comparison: deploy A vs deploy B, peak load vs idle, with vs without the new third-party tag. RUM is the ground truth for steady state. Synthetic is the ground truth for change.

How is this different from PageSpeed Insights or Lighthouse?

PageSpeed Insights and Lighthouse run a single browser against a single URL with no concurrent load. Useful for one-off audits. Useless for measuring what Vitals will look like when 1,000 users are hitting the same server at the same time. Real-browser load testing covers the concurrency case.

Should I gate releases on Vitals budgets?

Most teams that adopt this find it stops more regressions than it creates pain. The pattern is: per-URL LCP and INP thresholds, enforced in CI by a post-deploy smoke test, with a manual override for known-acceptable degradations. The build fails when a page busts its budget; the team either fixes it or signs off on the regression in writing.

What concurrencies should I test at?

Match the test to the question. For capacity rehearsals, test at 5x to 10x your peak measured concurrency. For release validation, test at typical peak. For monitoring, test at a constant low concurrency from chosen regions to baseline drift. There is no single right number, but a single-user Vitals capture is almost never enough.

Want to see this measured on your app?

30 minutes. We build a scenario on your real customer journey, run a small test, and walk you through the report with your data in it.

Book a demo How it works

Sample report walkthrough

30s video · 16:9

Core Web Vitals at load, explained