Performance regression testing: making Core Web Vitals a CI/CD release gate

A green test suite proves your code is correct. It says nothing about whether the page got slower. Performance regression testing closes that gap: set Core Web Vitals budgets, measure every build against a baseline, and fail the pipeline when a change busts one. This guide wires that gate into CI/CD, from baselining main to the regressions only load reveals.

Written by: Ahmad Farzan · 7 June 2026 · Updated 18 July 2026

A Core Web Vitals release gate: across pull requests, Largest Contentful Paint stays under a 2.3-second budget until one build regresses to 2.4 seconds and the gate blocks it, even though that is still within Google's 2.5-second good threshold.

Summary

Performance regression testing compares every new build against a known-good baseline and fails the build when a change makes pages measurably slower. It matters because speed doesn't stay fixed: Google's research found most performance improvements regress within six months, and a Deloitte study for Google linked a tenth of a second of mobile speed to a conversion lift of more than eight percent. The gate is built on Core Web Vitals: Largest Contentful Paint, Interaction to Next Paint, and Cumulative Layout Shift. Start by measuring your main branch three to five times and keeping the median, because Google notes the median of five runs is about twice as stable as one. Then set each budget at that baseline plus a small tolerance, roughly ten percent, so a page that slides from good to slightly worse still fails, even while it's inside Google's absolute thresholds. Wire the check into your pipeline so a pull request that busts a budget shows a red check naming the page and the metric. Two blind spots remain. A cold page load has no interactions, so it can't measure Interaction to Next Paint; you have to script the clicks. And a lab run is one user on a quiet server: in the worked example, a product page held just over two seconds for a single user but climbed to nearly three and a half at five hundred concurrent users. So baseline your main branch, set budgets, gate every pull request, and add a real-browser load stage before release.

Listen to this article · 1:35

What is performance regression testing?

Performance regression testing compares the performance of a new build against a known-good baseline and fails the build when a change makes things measurably slower. A regression is just a change that moves a metric the wrong way. Functional tests ask whether the code is correct; performance regression tests ask whether it got slower, a question a passing suite never answers.

The term is broad. It also covers backend throughput, query latency, and API response times. This guide covers the front-end slice: the speed your users feel in the browser, measured with Core Web Vitals and enforced as a gate in your CI/CD pipeline (the automated pipeline that builds, tests, and ships your code on every change).

Core Web Vitals are the three metrics Google uses to score page experience. Largest Contentful Paint (LCP) measures loading, Interaction to Next Paint (INP) measures responsiveness, and Cumulative Layout Shift (CLS) measures visual stability. Google’s “good” thresholds, assessed at the 75th percentile of real visits, are 2.5 seconds for LCP, 200 milliseconds for INP, and 0.1 for CLS. The 75th percentile is the point where three of four visits are at least this good, chosen so a handful of outliers cannot flatter the score.

Why gate, rather than audit once and move on? Because speed regresses. Research by Google found that “most performance improvements tend to regress within six months.” A page you tuned in January is slow again by July unless something keeps watching. And the cost is real: in a 2020 study Deloitte ran for Google, brands that improved mobile load time by 0.1 seconds saw retail conversions rise 8.4% and average order value rise 9.2% on average. (The study improved four timing metrics together across the journey, so read it as a correlation between speed and revenue, not a single dial you turn.) The field shows how common the failure is: in 2024, around 43% of mobile sites and 54% of desktop sites passed the Core Web Vitals assessment, which means close to half the web ships an experience Google rates as needing improvement or poor.

The target itself moves too. In March 2024, Interaction to Next Paint replaced First Input Delay as a Core Web Vital. A gate written against last year’s metric set can quietly stop testing the thing Google now measures. New to all of this? Start with the complete performance testing guide.

Lab, field, and load: where a regression hides

A Core Web Vitals regression can appear in three different places, and a release gate that watches only one of them will miss the other two. Knowing which is which is the whole design.

Field data is what real users actually experienced, collected by real user monitoring (RUM) or Google’s Chrome User Experience Report. It is the ground truth, and Google’s guidance is to “always concentrate on field Core Web Vitals over Lighthouse metrics and scores.” The catch is that field data is reactive: it reports yesterday’s traffic, after the regression already shipped.

Lab data is a synthetic measurement under fixed conditions, the kind a tool like Lighthouse produces. It is repeatable and fast, so it can run on a pull request and catch a regression before it reaches a single user. That is what makes lab the natural engine of a CI gate.

Load is the dimension almost every CI tutorial skips. A lab run is one page, loaded once, by one synthetic user, with no one else loading shared services. Real traffic puts backends, CDNs, and third parties under concurrency. LCP can climb as those services slow, and network-bound interactions can degrade in each independent browser. Customer devices do not share a main thread. For the methodology of measuring Vitals at realistic concurrency, see Core Web Vitals at load.

There is one more wrinkle the table makes explicit: a standard lab page load cannot even measure INP, because INP needs interactions and a cold navigation has none.

Where it runs	What it answers	Cadence	Which Vitals it can capture	Blind spot
Field / RUM	What real users got	Continuous, after release	LCP, INP, CLS (real)	Reactive; cannot test a change before it ships
Lab navigation (Lighthouse CI)	Did this build regress in a clean room	Every commit or PR	LCP, CLS, TBT (proxy for INP)	One user, no concurrency, synthetic network
Lab interaction (user flows)	Did a specific interaction regress	Per build or pre-release	INP (real), LCP, CLS	Still one user, no concurrency
Load (real browsers at concurrency)	Do Vitals hold at peak	Pre-release or scheduled	LCP, INP, CLS, FCP under load	Heavier to run; not per-commit

The practical gate uses these as a staircase: lab navigation on every PR for the cheap, fast signal; scripted interaction where INP matters; field as the ground-truth backstop; and a load stage before release for the regressions concurrency hides.

What you need before you start

A per-build preview URL. CI needs a deployed version of the candidate to measure. A preview deployment per pull request, or a static build served inside the job, both work.
The journeys that matter. Pick the three to five pages where slowness costs you most, not every route. A bloated gate is a slow, noisy gate.
A baseline. You cannot detect a regression without a known-good number to compare against. Step 2 captures one.
A CI runner you control. Shared, burstable runners add measurement noise; Step 4 explains how to fight it.
A decision: block or warn. Agree up front whether a busted budget fails the build or just comments on the PR. You can start with warn and promote to block later.

Step 1: Choose the pages and journeys to defend

Start narrow. Pick the handful of URLs where a regression has real consequences and leave the rest out of the gate for now. A gate that checks every page is slow, noisy, and quickly ignored.

For our worked example, a storefront at example.shop, the list is three pages: the home page (/), a product page (/product), and checkout (/checkout). These are the steps that carry revenue, and they exercise different code paths: a marketing page, a data-heavy template, and an interaction-heavy form. We carry these same three URLs and their numbers through every step below.

When this works, you have a short, agreed list of URLs, and the team understands why each one is on it.

Step 2: Capture a baseline from your main branch

A budget pinned to Google’s absolute thresholds only catches pages that are already bad. It does nothing for a page that slides from good to slightly-less-good, which is what most regressions actually look like. To catch those, you need a baseline: the current, known-good numbers on your main branch.

Measure each URL on main several times and keep the median. Run-to-run variance is real, so a single measurement is not a baseline. Google notes that the median of five Lighthouse runs is about twice as stable as one run, so collect three to five and take the middle value.

For example.shop, the median LCP on main comes out at 1.8 seconds for the home page, 2.1 seconds for the product page, and 2.4 seconds for checkout, with CLS at 0.05 and Total Blocking Time around 220 milliseconds on the product page. Those numbers are the baseline.

Before you trust the harness, take one manual reading too: run a quick speed test on the same URLs and compare. Evaluat Pulse returns LCP, CLS, FCP, and TTFB from one real-browser load with Evaluat’s A to F composite grade. It does not produce a representative INP on a cold load. Compare like-for-like controlled conditions; a disagreement is a reason to inspect the runner, not proof that either result represents every customer device.

When this works, you have a recorded median per URL that you trust enough to defend. For the metric behind the headline number, see Largest Contentful Paint explained.

Step 3: Turn the baseline into budgets

A budget is the line a metric is not allowed to cross. Set each budget at the baseline plus a small tolerance, so normal noise passes but a genuine slowdown fails. A common starting tolerance is roughly 10%, or a fixed margin such as 150 milliseconds on LCP.

For the product page, a baseline LCP of 2.1 seconds becomes a budget of 2.3 seconds. Now the regression that matters gets caught: a pull request that pushes the product page to 2.4 seconds fails the gate, even though 2.4 seconds is still inside Google’s absolute 2.5-second threshold. A budget anchored to your baseline sees the slide that an absolute threshold sleeps through.

Lighthouse CI reads these as assertions. CLS is unitless, while LCP and TBT are in milliseconds:

{
  "ci": {
    "collect": { "numberOfRuns": 5 },
    "assert": {
      "assertions": {
        "largest-contentful-paint": ["error", { "maxNumericValue": 2300 }],
        "cumulative-layout-shift": ["error", { "maxNumericValue": 0.1 }],
        "total-blocking-time": ["warn", { "maxNumericValue": 300 }]
      }
    }
  }
}

Three notes. To give each URL its own budget, use an assertMatrix with a matchingUrlPattern per page, or a per-path budget.json. To compare against a baseline branch instead of relying only on static budgets, the Lighthouse CI server (@lhci/server) stores historical runs and surfaces regressions against them. And ratchet: when you legitimately make a page faster, lower its budget to lock the win in. A budget should only ever move down.

Step 4: Wire the gate into the pipeline and fail on a real regression

With URLs, a baseline, and budgets in hand, the pipeline job is small. The official GitHub Action runs Lighthouse against your preview URLs, takes the median of the configured runs, and fails the build when a URL exceeds its budget:

# .github/workflows/perf-gate.yml
name: Performance gate
on: pull_request
jobs:
  lighthouse:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run Lighthouse CI
        uses: treosh/lighthouse-ci-action@v12
        with:
          urls: |
            https://deploy-preview.example.shop/
            https://deploy-preview.example.shop/product
            https://deploy-preview.example.shop/checkout
          configPath: ./lighthouserc.json
          temporaryPublicStorage: true

The level on each assertion decides the consequence. An error fails the job and blocks the merge; a warn reports without blocking. Promote a metric to error once you trust its stability, and keep the rest on warn while you tune.

Stability is the part teams underestimate. Free and burstable CI runners are, in Google’s words, “typically quite volatile,” so a gate on a noisy runner fails on luck rather than on regressions. Defend against it three ways: take the median of three to five runs, use a dedicated runner instead of a shared one, and never run Lighthouse jobs concurrently on the same machine.

One accuracy trap. A default Lighthouse run is a cold navigation with no interactions, so it cannot measure INP; it reports Total Blocking Time as a lab proxy. To gate real INP you have to script the interaction, with Lighthouse user flows or a real-browser test that performs the actual clicks and taps. See Interaction to Next Paint explained for what you are protecting. Finally, keep a documented manual override for regressions you consciously accept; the gate exists to stop accidents, not to overrule engineering judgment.

When this works, a pull request that busts a budget shows a red check naming the URL and the metric, for example “largest-contentful-paint on /product is 2.4 s, budget 2.3 s.”

Step 5: Add a load stage for Vitals under concurrency

Everything so far runs one synthetic user against a quiet server. That is the right engine for a per-commit gate, and Lighthouse CI is genuinely the right tool for it. But it is blind to the regression that only shows up under traffic, and that gap is where most “it was fast in staging” incidents come from.

Run the same journeys at realistic concurrency before release, as a pre-release or nightly stage rather than on every pull request, since load runs cost more than a single audit. Set a separate budget for Vitals at your target concurrency, then compare the load run against the single-user baseline from Step 2. The delta is the regression that traffic creates.

On example.shop, the product page holds LCP at 2.1 seconds for one user. At 500 concurrent virtual users its LCP climbs to 3.4 seconds and INP to 410 milliseconds, both past Google’s absolute thresholds, while the per-PR lab gate stayed green the whole time. The gate passed a build that a single real session would have caught.

Performance Testing can supply the real-browser run and per-session evidence for this stage. Productizing it as a managed CI gate is the planned shape of Evaluat’s Testing Suite, not a live feature today. The methodology lives in Core Web Vitals at load.

How do you know it worked?

You know the gate is working when it changes what reaches your main branch. Four things confirm it: a pull request that busts a budget shows a red check, the failing run names the offending URL and metric, the diff against the baseline is legible to a reviewer, and the load-stage report exposes the worst sessions to open. Inspect each in turn.

Click into a failing run and you should see the measured value next to the budget, for example /product at 2.4 seconds against a 2.3-second budget. The diff against baseline is legible enough that a reviewer can see /product went from 2.1 to 2.4 seconds without reading a log. The load-stage report shows per-URL Vitals at your target concurrency, with the worst sessions addressable so you can open the one that broke. If all four hold, the gate is doing its job: regressions stop at the pull request, not in the field.

Common problems and fixes

Symptom	Cause	Fix
The gate fails randomly on unchanged code	A single run on a noisy, shared CI runner	Take the median of three to five runs, use a dedicated runner, and avoid concurrent Lighthouse jobs
The build passes but the field still slows down	Lab is one device, one network, no concurrency	Throttle to match your users, keep RUM as the ground truth, and add the load stage
The budget never fails	Budgets set at Google’s absolute thresholds, well above your baseline	Set budgets from the baseline plus a small tolerance, and ratchet them down as you improve
A new third-party tag tanks the score	Tags add main-thread work and bytes	Budget resource counts and sizes, and test the page with and without the tag
INP never appears in the report	A cold navigation has no interactions to measure	Script the interaction with Lighthouse user flows, or measure INP in a real-browser run

Make the gate routine

The loop is short: measure a baseline on main, set budgets from it, run the lab gate on every pull request, fail only on real regressions, and add a load stage before release for the slowdowns concurrency hides. Then ratchet the budgets down as the site gets faster. The gate turns page speed from a thing you audit once into a property the pipeline defends on every change.

Join the Testing Suite design-partner waitlist to discuss planned CI gates.

About the author

Ahmad Farzan · Founder at Evaluat

Founder of Evaluat. Has spent years building and load-testing Adobe Commerce and Magento storefronts, and built Evaluat to test sites the way real browsers actually hit them.

FAQ

What is performance regression testing?

Performance regression testing compares a build against a known-good baseline and fails when a change makes pages measurably slower. On the front end, the metrics you compare are usually Core Web Vitals: Largest Contentful Paint, Interaction to Next Paint, and Cumulative Layout Shift. It runs in CI, like a unit test for speed.

How do I test Core Web Vitals in a CI/CD pipeline?

Run a lab tool such as Lighthouse CI against a preview deployment on every pull request, take the median of three to five runs per URL, and assert each page against a budget. If a page exceeds its budget, the job exits non-zero and the build fails. Add a separate load stage to check Vitals under concurrency, which a single-user lab run cannot see.

Should I block a build on a Core Web Vitals regression, or just warn?

Start with warnings while you tune budgets and learn your pipeline variance, then promote the metrics you trust to blocking. Keep a documented manual override for regressions you consciously accept. The goal is to stop accidental slowdowns, not to halt every release over measurement noise.

Why does my Lighthouse score change between runs?

Lab results vary with network jitter, CPU contention, and background processes, especially on shared CI runners. Google notes that the median of five runs is about twice as stable as one run. Use the median of three to five runs on a dedicated runner to keep the gate from failing on noise.

Does Lighthouse measure INP in CI?

Not from a standard run. A cold page navigation has no interactions, and INP needs them, so Lighthouse reports Total Blocking Time as a lab proxy instead. To gate real INP you have to script the interaction, using Lighthouse user flows or a real-browser test that performs the clicks and taps you care about.

Should the CI gate use lab data or field data?

Both, for different jobs. Field data from real user monitoring is the ground truth for what users experience, but it is reactive and arrives after release. Lab data is synthetic and repeatable, so it can catch a regression on a pull request before it ships. Gate on lab, confirm on field.

Can Lighthouse CI test Core Web Vitals under load?

No. Lighthouse runs one page load with one synthetic user and no concurrency, so it cannot show how Vitals behave at peak traffic. LCP can rise as shared services slow, and network-bound interactions can degrade in each independent browser. Measuring that needs a real-browser load test as a separate pre-release stage.

More from the blog

Performance testing: the complete guide

Your server can answer in 50 milliseconds and still ship an eight-second page. Performance testing measures both backend behavior and the browser-rendered experience under controlled load. This guide maps the whole discipline: the types, the metrics that matter, the process, and how to choose between protocol-level and real-browser tools.

Ahmad Farzan · 3 May 2026

Core Web Vitals at load, explained

A page can score green in a single-user Lighthouse run and still ship a red Largest Contentful Paint the moment real traffic arrives. Core Web Vitals change under load: the server slows, time to first byte grows, and interactions wait on a busy backend. This guide explains why each Vital moves under load, and how to measure them at concurrency.

Ahmad Farzan · 1 June 2026

Where does performance testing fit in an agile release cycle?

Agile teams ship every week, sometimes every day. Performance testing built for a quarterly release does not fit that rhythm, so it slides to the end, then to never, until production buckles. It does not have to. This guide maps each performance test to a stage: cheap checks every commit, a real-browser load test at the pre-release gate, monitoring after.

Ahmad Farzan · 16 May 2026

See it on your site

Test in real browsers.
Debug in real sessions.

CI smoke checks are on the Testing Suite roadmap.

Join the design-partner waitlist if post-deploy real-browser checks matter to your release process.

Join the Testing Suite waitlist Testing Suite plans