Blog Guides & best practices

What is performance testing? A QA engineer's guide to testing under real traffic

Your app works fine for one user. Then a launch sends three thousand at once and pages crawl. Performance testing is how QA teams measure speed, stability, and scale under real traffic, on purpose, in a test instead of in production. This guide covers what performance testing is, its types, and when to run it.

Written by: Evaluat Staff · 20 May 2026

Load testing: concurrent virtual users ramp up, hold at a steady load, then ramp down, with each virtual user running in its own browser.

What is performance testing?

Performance testing is a type of software testing that measures how a system behaves under load: how fast it responds, how stable it stays, and how well it scales as traffic grows. Instead of checking whether a feature works, it checks how well the system works when many people use it at the same time.

That distinction matters. Functional testing asks one question: does the feature do what it should. Performance testing asks a different one: does the system stay fast and stable when real traffic arrives. A checkout flow can pass every functional test and still fall over when a thousand people check out at once.

The traffic in a performance test is not real people. It is virtual users: scripted or recorded sessions that each behave like one visitor moving through your app. A tool runs hundreds or thousands of them at once, against an environment you control, so you find the limits on purpose, in a test, instead of in production where a failure costs customers.

Performance testing vs load testing

Performance testing is the umbrella; load testing is one type within it. Performance testing covers every way a system can be put under pressure, while load testing checks one specific condition: expected peak traffic. People use the two terms interchangeably, but load testing is a subset of performance testing.

The reason the confusion is so common is that load testing is the type most teams start with. It answers the everyday question, “will we survive the traffic we actually expect,” and it is the baseline the other types build on. If you only ever load test, you learn how the system behaves on a good day. The other types of performance testing tell you what happens on a bad one: a surge you did not plan for, a slow leak over a long weekend, a database that has quietly doubled in size.

The types of performance testing

Performance testing splits into a handful of types, each asking a different question about the system under pressure. The most common are load, stress, spike, soak, scalability, and volume testing. You choose the type by the risk you are testing for, and mature teams use several together.

Type	What it checks	When to use it
Load testing	Behavior at expected peak traffic	Validate a release or rehearse for a known event
Stress testing	Where and how the system breaks past its limit	Find the ceiling and the failure mode
Spike testing	Survival of a sudden traffic surge	Flash sales, viral posts, breaking news
Soak (endurance) testing	Slow degradation over hours, such as memory leaks	Long-running stability
Scalability testing	How performance changes as you add resources	Capacity planning
Volume testing	Behavior with large data volumes, not just users	Database and storage growth

Start with load testing, then add the others as risk demands. A retailer expecting a sale needs spike testing; a service that runs for days needs soak testing; a team adding servers needs scalability testing. They are not interchangeable, and each catches a failure the others miss.

Why performance testing matters for QA teams

Performance testing matters because speed and stability are features your users feel, and most performance failures only appear under load. A page that is fast for one tester can crawl for a thousand real users, and that gap costs conversions, trust, and sometimes the whole sale.

The numbers are blunt. A 2020 study by Google and Deloitte, Milliseconds Make Millions, found that a 0.1 second improvement in mobile site speed lifted retail conversions by 8.4% and travel conversions by 10.1%. Portent’s 2022 analysis of more than 100 million page views found that pages loading in one second convert at 3.05%, falling to 1.12% by three seconds. Google and SOASTA’s 2017 benchmarks showed the probability of a bounce rises 32% as load time grows from one to three seconds.

Most sites are not keeping up. The HTTP Archive’s 2025 Web Almanac found only 48% of mobile sites pass Core Web Vitals. Those costs land hardest at peak, which is exactly when you most want the traffic to convert. And a failure at that moment has a face: a real person who tried to buy and could not.

How does performance testing work?

A performance test follows four steps: model a real user journey, decide how much traffic to send and how fast, run it against a production-like environment, then read the results against a target. Each step is a decision, and the test is only as honest as those decisions.

Design the scenario

A scenario is the journey one virtual user performs, the sequence of steps a real visitor would take: search, open a product, add to basket, check out. The closer the scenario is to genuine behavior, the more honest the test. A test of your cheapest page tells you almost nothing about your checkout.

Set the load profile

The load profile is how many virtual users you run and how you introduce them. Concurrency is the number active at the same time. Ramp is how quickly you reach that number. Ramping 1,000 users over ten minutes tests steady growth; dropping all 1,000 at once tests a spike. Pick the shape that matches the event you are worried about.

Run the test

Run the scenario against an environment that resembles production: the same infrastructure size, the same data volume, ideally the same regions your users come from. Testing against an empty database on an undersized box produces numbers that feel reassuring and mean nothing.

Read the metrics

A performance test produces a handful of core metrics. Response time is how long a request takes. Throughput is how many requests the system handles per second. Error rate is the share of requests that fail. Because averages hide pain, read response time as percentiles: p95 means 95% of requests were at least this fast and the slowest 5% were worse. The p95 and p99 are where real users feel the system struggle, so QA teams budget against them, not against the average.

Putting it together

A concrete example makes the four steps click. Say your analytics show a typical peak of 200 concurrent users, and a sale next week is expected to triple that. You build a scenario that logs in, searches, opens a product, and checks out. You set the load profile to ramp from zero to 600 virtual users over five minutes, then hold for fifteen, running from the regions most of your customers sit in. You point it at a staging environment sized like production, with a realistic catalogue loaded.

The report comes back. Throughput plateaus at 480 requests per second. Error rate stays near zero until about 500 users, then climbs to 4% as checkout calls begin timing out. The p95 response time on the checkout step jumps from 900 milliseconds to six seconds at peak. None of that showed up at 200 users. You have found a real limit, with a week to fix it, instead of meeting it live during the sale.

What performance tests miss: server response vs real user experience

Most performance tools stop at the server’s reply. They can report that the backend answered in 50 milliseconds, but they never render the page, run your JavaScript, or load your third-party tags, so the experience that drives every conversion number above is the part they never measure.

A single page load on a timeline: an HTTP-level test stops at the server's first byte (TTFB), while the part users actually wait for, Largest Contentful Paint and Interaction to Next Paint, unfolds later in the browser.

The split is architectural. Protocol-level tools like k6 and JMeter pace HTTP requests in code, which makes them fast, cheap, and ideal for APIs and very high concurrency. What they do not do is open a browser, so they report server response time rather than Core Web Vitals, Google’s metrics for loading, interactivity, and visual stability. (k6 ships a browser mode that can read Web Vitals, but that is a separate mode, not its core, and it keeps no per-session forensics.) A 50 millisecond reply counts for little if the Largest Contentful Paint lands four seconds later because a few marketing tags are blocking the main thread.

And the gap is wide in the field. The 2025 Web Almanac put the median mobile Total Blocking Time at 1,916 milliseconds, up 58% in a year, and found only 77% of mobile sites scoring well on Interaction to Next Paint. Nearly all of that is browser-side work a request-level test is structurally blind to.

Real-browser performance testing closes the gap by giving every virtual user an actual browser, so the numbers come from the same rendering path a customer’s machine would run. This is the approach Evaluat takes: each virtual user gets its own isolated browser, and every report captures Core Web Vitals, session video, network logs, and console output per user. For pure API tests or extreme concurrency on a tight budget, protocol tools are still the better fit; for user-facing journeys, the experience only shows up in a browser. The three load-testing models and measuring Web Vitals under load go deeper.

When should QA run performance tests?

Run performance tests at three moments: before known traffic events, as a gate in the release pipeline, and on a schedule to catch drift. Performance testing is most valuable when it is routine, not a scramble the week before a launch. The earlier a regression is caught, the cheaper it is to fix.

Before a launch, sale, or campaign, a capacity rehearsal tells you whether the system survives the traffic you are about to invite. In continuous integration, a smaller smoke test can gate releases: the build fails when a key page busts its performance budget, the same way it fails on a broken unit test. On a schedule, a steady low-concurrency run from your users’ regions baselines performance so you notice slow degradation before users do. Build a scenario once, use it everywhere, and the cost of running it often drops close to zero.

Common performance testing mistakes

The mistakes that waste performance tests are mostly about realism. A technically clean run against an unrealistic setup gives confident, wrong answers. Watch for these five.

Testing an unrealistic environment. An undersized box with an empty database will not behave like production. Match infrastructure and data volume.
Using averages instead of percentiles. The average hides the slow tail where users actually suffer. Read p95 and p99.
Modeling only the happy path. Real users log in, search, abandon, and retry. A scenario that loads a single page tests almost nothing.
Measuring servers, not users. Request-level timings miss rendering, JavaScript, and third-party tags. If the experience matters, test in a browser.
Testing once, before launch. Performance regresses with every deploy. A one-off test is stale within a sprint.

Start testing under real traffic

Performance testing is how QA teams replace hope with evidence. You pick the type that matches the risk, model real journeys, send realistic traffic, read the percentiles, and fix what breaks before a customer finds it. Start with load testing at your measured peak, gate releases on a budget, and test the experience your users actually get, not just the response your servers send. The teams that do this well treat performance as a standing release gate, not a fire drill the week of a launch.

Evaluat runs every virtual user in a real browser and captures Core Web Vitals, session video, and network and console logs for each one, so when something breaks at peak you can open the session and watch the moment it happened.

Test in real browsers. Debug in real sessions. Book a demo.

Common questions

FAQ

What is the difference between performance testing and load testing?

Performance testing is the umbrella term; load testing is one type within it. Load testing checks behavior at expected peak traffic, while performance testing covers every way a system can be put under pressure, including stress, spike, and soak testing. People use the terms interchangeably, but load testing is a subset.

What are the main types of performance testing?

The common types are load testing (expected peak), stress testing (past the breaking point), spike testing (a sudden surge), soak or endurance testing (sustained load over hours), scalability testing (performance as you add resources), and volume testing (large data volumes). You pick the type by the risk you are testing for.

How is performance testing different from functional testing?

Functional testing asks whether a feature works correctly for one user. Performance testing asks how well the system works when many people use it at once. A checkout flow can pass every functional test and still time out under load, so teams run both.

What metrics matter most in performance testing?

Response time, throughput, and error rate are the core three. Read response time as percentiles rather than averages: p95 and p99 show what your slowest users experience, which is where systems actually hurt. For user-facing pages, add Core Web Vitals, since server response time alone does not capture what people see in the browser.

When should QA run performance tests?

At three moments: before known traffic events like launches and sales, as a gate inside the release pipeline, and on a schedule to catch slow drift. Performance regresses with almost every deploy, so a one-off test before launch goes stale within a sprint. Routine testing catches regressions while they are cheap to fix.

Is performance testing manual or automated?

Performance testing is automated by nature. A tool generates and runs the virtual users for you, because no team can coordinate hundreds or thousands of real people clicking at once. The manual part is design and interpretation: defining a realistic scenario, choosing the load profile, and reading the results against a target.

Do you need real browsers for performance testing?

It depends on what you are testing. Protocol-level tools that send HTTP requests work well for APIs and backend services. But to measure what users actually experience, including Core Web Vitals, the test has to run in a real browser, because rendering, JavaScript, and third-party tags only happen there.

More from the blog

Performance testing: the complete guide

Your server can answer in 50 milliseconds and still ship an eight-second page. Performance testing is how you measure what users actually experience under load, not just what the backend returns. This guide maps the whole discipline: the types, the metrics that matter, the process, and how to choose between protocol-level and real-browser tools.

Evaluat Staff · 3 May 2026

Load testing vs stress testing vs performance testing: how the three actually differ

Three terms, endless confusion. Performance testing is the umbrella; load testing checks whether you survive the traffic you expect; stress testing pushes past that to find where you break. This guide shows how the three actually differ, when to run each, and which one your team needs first.

Evaluat Staff · 3 June 2026

Real-browser load testing, explained

Most load testing tools fire HTTP requests at your server. A few share one browser across many simulated users. Real-browser load testing gives every virtual user its own isolated browser, so it measures what your customers' browsers actually do under load. Here is how the three models differ, what each one can and cannot see, and when each is the right call.

Evaluat Staff · 5 May 2026

See it on your site

Test in real browsers.
Debug in real sessions.

Want to see this measured on your app?

30 minutes. We build a scenario on your real customer journey, run a small test, and walk you through the report with your data in it.

Book a demo How it works