Evaluat is in private access. Demos open through July. Book a slot

Blog Guides & best practices

Performance regression testing: making Core Web Vitals a CI/CD release gate

A green test suite proves your code is correct. It says nothing about whether the page got slower. Performance regression testing closes that gap: set Core Web Vitals budgets, measure every build against a baseline, and fail the pipeline when a change busts one. This guide wires that gate into CI/CD, from baselining main to the regressions only load reveals.

Written by: Evaluat Staff ·

A Core Web Vitals release gate: across pull requests, Largest Contentful Paint stays under a 2.3-second budget until one build regresses to 2.4 seconds and the gate blocks it, even though that is still within Google's 2.5-second good threshold.

What is performance regression testing?

Performance regression testing compares the performance of a new build against a known-good baseline and fails the build when a change makes things measurably slower. A regression is just a change that moves a metric the wrong way. Functional tests ask whether the code is correct; performance regression tests ask whether it got slower, a question a passing suite never answers.

The term is broad. It also covers backend throughput, query latency, and API response times. This guide covers the front-end slice: the speed your users feel in the browser, measured with Core Web Vitals and enforced as a gate in your CI/CD pipeline (the automated pipeline that builds, tests, and ships your code on every change).

Core Web Vitals are the three metrics Google uses to score page experience. Largest Contentful Paint (LCP) measures loading, Interaction to Next Paint (INP) measures responsiveness, and Cumulative Layout Shift (CLS) measures visual stability. Google’s “good” thresholds, assessed at the 75th percentile of real visits, are 2.5 seconds for LCP, 200 milliseconds for INP, and 0.1 for CLS. The 75th percentile is the point where three of four visits are at least this good, chosen so a handful of outliers cannot flatter the score.

Why gate, rather than audit once and move on? Because speed regresses. Research by Google found that “most performance improvements tend to regress within six months.” A page you tuned in January is slow again by July unless something keeps watching. And the cost is real: in a 2020 study Deloitte ran for Google, brands that improved mobile load time by 0.1 seconds saw retail conversions rise 8.4% and average order value rise 9.2% on average. (The study improved four timing metrics together across the journey, so read it as a correlation between speed and revenue, not a single dial you turn.) The field shows how common the failure is: in 2024, around 43% of mobile sites and 54% of desktop sites passed the Core Web Vitals assessment, which means close to half the web ships an experience Google rates as needing improvement or poor.

The target itself moves too. In March 2024, Interaction to Next Paint replaced First Input Delay as a Core Web Vital. A gate written against last year’s metric set can quietly stop testing the thing Google now measures. New to all of this? Start with what performance testing is.

Lab, field, and load: where a regression hides

A Core Web Vitals regression can appear in three different places, and a release gate that watches only one of them will miss the other two. Knowing which is which is the whole design.

Field data is what real users actually experienced, collected by real user monitoring (RUM) or Google’s Chrome User Experience Report. It is the ground truth, and Google’s guidance is to “always concentrate on field Core Web Vitals over Lighthouse metrics and scores.” The catch is that field data is reactive: it reports yesterday’s traffic, after the regression already shipped.

Lab data is a synthetic measurement under fixed conditions, the kind a tool like Lighthouse produces. It is repeatable and fast, so it can run on a pull request and catch a regression before it reaches a single user. That is what makes lab the natural engine of a CI gate.

Load is the dimension almost every CI tutorial skips. A lab run is one page, loaded once, by one synthetic user, with no one else on the server. Real traffic is hundreds of users at the same time. LCP climbs as the server slows under contention, and INP degrades as the main thread and the backend compete. A build that passes a single-user lab gate can still ship a regression that only appears at peak. For the methodology of measuring Vitals at realistic concurrency, see Core Web Vitals at load.

There is one more wrinkle the table makes explicit: a standard lab page load cannot even measure INP, because INP needs interactions and a cold navigation has none.

Where it runsWhat it answersCadenceWhich Vitals it can captureBlind spot
Field / RUMWhat real users gotContinuous, after releaseLCP, INP, CLS (real)Reactive; cannot test a change before it ships
Lab navigation (Lighthouse CI)Did this build regress in a clean roomEvery commit or PRLCP, CLS, TBT (proxy for INP)One user, no concurrency, synthetic network
Lab interaction (user flows)Did a specific interaction regressPer build or pre-releaseINP (real), LCP, CLSStill one user, no concurrency
Load (real browsers at concurrency)Do Vitals hold at peakPre-release or scheduledLCP, INP, CLS, FCP under loadHeavier to run; not per-commit

The practical gate uses these as a staircase: lab navigation on every PR for the cheap, fast signal; scripted interaction where INP matters; field as the ground-truth backstop; and a load stage before release for the regressions concurrency hides.

What you need before you start

  • A per-build preview URL. CI needs a deployed version of the candidate to measure. A preview deployment per pull request, or a static build served inside the job, both work.
  • The journeys that matter. Pick the three to five pages where slowness costs you most, not every route. A bloated gate is a slow, noisy gate.
  • A baseline. You cannot detect a regression without a known-good number to compare against. Step 2 captures one.
  • A CI runner you control. Shared, burstable runners add measurement noise; Step 4 explains how to fight it.
  • A decision: block or warn. Agree up front whether a busted budget fails the build or just comments on the PR. You can start with warn and promote to block later.

Step 1: Choose the pages and journeys to defend

Start narrow. Pick the handful of URLs where a regression has real consequences and leave the rest out of the gate for now. A gate that checks every page is slow, noisy, and quickly ignored.

For our worked example, a storefront at example.shop, the list is three pages: the home page (/), a product page (/product), and checkout (/checkout). These are the steps that carry revenue, and they exercise different code paths: a marketing page, a data-heavy template, and an interaction-heavy form. We carry these same three URLs and their numbers through every step below.

When this works, you have a short, agreed list of URLs, and the team understands why each one is on it.

Step 2: Capture a baseline from your main branch

A budget pinned to Google’s absolute thresholds only catches pages that are already bad. It does nothing for a page that slides from good to slightly-less-good, which is what most regressions actually look like. To catch those, you need a baseline: the current, known-good numbers on your main branch.

Measure each URL on main several times and keep the median. Run-to-run variance is real, so a single measurement is not a baseline. Google notes that the median of five Lighthouse runs is about twice as stable as one run, so collect three to five and take the middle value.

For example.shop, the median LCP on main comes out at 1.8 seconds for the home page, 2.1 seconds for the product page, and 2.4 seconds for checkout, with CLS at 0.05 and Total Blocking Time around 220 milliseconds on the product page. Those numbers are the baseline.

When this works, you have a recorded median per URL that you trust enough to defend. For the metric behind the headline number, see Largest Contentful Paint explained.

Step 3: Turn the baseline into budgets

A budget is the line a metric is not allowed to cross. Set each budget at the baseline plus a small tolerance, so normal noise passes but a genuine slowdown fails. A common starting tolerance is roughly 10%, or a fixed margin such as 150 milliseconds on LCP.

For the product page, a baseline LCP of 2.1 seconds becomes a budget of 2.3 seconds. Now the regression that matters gets caught: a pull request that pushes the product page to 2.4 seconds fails the gate, even though 2.4 seconds is still inside Google’s absolute 2.5-second threshold. A budget anchored to your baseline sees the slide that an absolute threshold sleeps through.

Lighthouse CI reads these as assertions. CLS is unitless, while LCP and TBT are in milliseconds:

{
  "ci": {
    "collect": { "numberOfRuns": 5 },
    "assert": {
      "assertions": {
        "largest-contentful-paint": ["error", { "maxNumericValue": 2300 }],
        "cumulative-layout-shift": ["error", { "maxNumericValue": 0.1 }],
        "total-blocking-time": ["warn", { "maxNumericValue": 300 }]
      }
    }
  }
}

Three notes. To give each URL its own budget, use an assertMatrix with a matchingUrlPattern per page, or a per-path budget.json. To compare against a baseline branch instead of relying only on static budgets, the Lighthouse CI server (@lhci/server) stores historical runs and surfaces regressions against them. And ratchet: when you legitimately make a page faster, lower its budget to lock the win in. A budget should only ever move down.

Step 4: Wire the gate into the pipeline and fail on a real regression

With URLs, a baseline, and budgets in hand, the pipeline job is small. The official GitHub Action runs Lighthouse against your preview URLs, takes the median of the configured runs, and fails the build when a URL exceeds its budget:

# .github/workflows/perf-gate.yml
name: Performance gate
on: pull_request
jobs:
  lighthouse:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run Lighthouse CI
        uses: treosh/lighthouse-ci-action@v12
        with:
          urls: |
            https://deploy-preview.example.shop/
            https://deploy-preview.example.shop/product
            https://deploy-preview.example.shop/checkout
          configPath: ./lighthouserc.json
          temporaryPublicStorage: true

The level on each assertion decides the consequence. An error fails the job and blocks the merge; a warn reports without blocking. Promote a metric to error once you trust its stability, and keep the rest on warn while you tune.

Stability is the part teams underestimate. Free and burstable CI runners are, in Google’s words, “typically quite volatile,” so a gate on a noisy runner fails on luck rather than on regressions. Defend against it three ways: take the median of three to five runs, use a dedicated runner instead of a shared one, and never run Lighthouse jobs concurrently on the same machine.

One accuracy trap. A default Lighthouse run is a cold navigation with no interactions, so it cannot measure INP; it reports Total Blocking Time as a lab proxy. To gate real INP you have to script the interaction, with Lighthouse user flows or a real-browser test that performs the actual clicks and taps. See Interaction to Next Paint explained for what you are protecting. Finally, keep a documented manual override for regressions you consciously accept; the gate exists to stop accidents, not to overrule engineering judgment.

When this works, a pull request that busts a budget shows a red check naming the URL and the metric, for example “largest-contentful-paint on /product is 2.4 s, budget 2.3 s.”

Step 5: Add a load stage for Vitals under concurrency

Everything so far runs one synthetic user against a quiet server. That is the right engine for a per-commit gate, and Lighthouse CI is genuinely the right tool for it. But it is blind to the regression that only shows up under traffic, and that gap is where most “it was fast in staging” incidents come from.

Run the same journeys at realistic concurrency before release, as a pre-release or nightly stage rather than on every pull request, since load runs cost more than a single audit. Set a separate budget for Vitals at your target concurrency, then compare the load run against the single-user baseline from Step 2. The delta is the regression that traffic creates.

On example.shop, the product page holds LCP at 2.1 seconds for one user. At 500 concurrent virtual users its LCP climbs to 3.4 seconds and INP to 410 milliseconds, both past Google’s absolute thresholds, while the per-PR lab gate stayed green the whole time. The gate passed a build that a single real session would have caught.

This is the dimension Evaluat is built for: a real-browser performance testing platform that runs each virtual user in its own isolated browser and captures LCP, INP, CLS, and FCP per session under load. When a build busts its load budget, the per-session video, network log, and console log show which user hit the wall and why, so a failed gate is a starting point for debugging rather than a bare number. The methodology lives in Core Web Vitals at load; productizing this pattern as a CI gate is the planned shape of Evaluat’s Testing Suite.

How do you know it worked?

You know the gate is working when it changes what reaches your main branch. Four things confirm it: a pull request that busts a budget shows a red check, the failing run names the offending URL and metric, the diff against the baseline is legible to a reviewer, and the load-stage report exposes the worst sessions to open. Inspect each in turn.

Click into a failing run and you should see the measured value next to the budget, for example /product at 2.4 seconds against a 2.3-second budget. The diff against baseline is legible enough that a reviewer can see /product went from 2.1 to 2.4 seconds without reading a log. The load-stage report shows per-URL Vitals at your target concurrency, with the worst sessions addressable so you can open the one that broke. If all four hold, the gate is doing its job: regressions stop at the pull request, not in the field.

Common problems and fixes

SymptomCauseFix
The gate fails randomly on unchanged codeA single run on a noisy, shared CI runnerTake the median of three to five runs, use a dedicated runner, and avoid concurrent Lighthouse jobs
The build passes but the field still slows downLab is one device, one network, no concurrencyThrottle to match your users, keep RUM as the ground truth, and add the load stage
The budget never failsBudgets set at Google’s absolute thresholds, well above your baselineSet budgets from the baseline plus a small tolerance, and ratchet them down as you improve
A new third-party tag tanks the scoreTags add main-thread work and bytesBudget resource counts and sizes, and test the page with and without the tag
INP never appears in the reportA cold navigation has no interactions to measureScript the interaction with Lighthouse user flows, or measure INP in a real-browser run

Make the gate routine

The loop is short: measure a baseline on main, set budgets from it, run the lab gate on every pull request, fail only on real regressions, and add a load stage before release for the slowdowns concurrency hides. Then ratchet the budgets down as the site gets faster. The gate turns page speed from a thing you audit once into a property the pipeline defends on every change.

Test in real browsers. Debug in real sessions. Book a demo.

Common questions

FAQ

What is performance regression testing?

Performance regression testing compares a build against a known-good baseline and fails when a change makes pages measurably slower. On the front end, the metrics you compare are usually Core Web Vitals: Largest Contentful Paint, Interaction to Next Paint, and Cumulative Layout Shift. It runs in CI, like a unit test for speed.

How do I test Core Web Vitals in a CI/CD pipeline?

Run a lab tool such as Lighthouse CI against a preview deployment on every pull request, take the median of three to five runs per URL, and assert each page against a budget. If a page exceeds its budget, the job exits non-zero and the build fails. Add a separate load stage to check Vitals under concurrency, which a single-user lab run cannot see.

Should I block a build on a Core Web Vitals regression, or just warn?

Start with warnings while you tune budgets and learn your pipeline variance, then promote the metrics you trust to blocking. Keep a documented manual override for regressions you consciously accept. The goal is to stop accidental slowdowns, not to halt every release over measurement noise.

Why does my Lighthouse score change between runs?

Lab results vary with network jitter, CPU contention, and background processes, especially on shared CI runners. Google notes that the median of five runs is about twice as stable as one run. Use the median of three to five runs on a dedicated runner to keep the gate from failing on noise.

Does Lighthouse measure INP in CI?

Not from a standard run. A cold page navigation has no interactions, and INP needs them, so Lighthouse reports Total Blocking Time as a lab proxy instead. To gate real INP you have to script the interaction, using Lighthouse user flows or a real-browser test that performs the clicks and taps you care about.

Should the CI gate use lab data or field data?

Both, for different jobs. Field data from real user monitoring is the ground truth for what users experience, but it is reactive and arrives after release. Lab data is synthetic and repeatable, so it can catch a regression on a pull request before it ships. Gate on lab, confirm on field.

Can Lighthouse CI test Core Web Vitals under load?

No. Lighthouse runs one page load with one synthetic user and no concurrency, so it cannot show how Vitals behave at peak traffic. LCP rises as the server slows under load and INP degrades as the main thread contends. Measuring Vitals under load needs a real-browser load test run as a separate pre-release stage.

See it on your site

Test in real browsers.
Debug in real sessions.

Want to see this measured on your app?

30 minutes. We build a scenario on your real customer journey, run a small test, and walk you through the report with your data in it.