What is performance testing?
Performance testing is a type of software testing that measures how a system behaves under load: how fast it responds, how stable it stays, and how well it scales as traffic grows. Instead of checking whether a feature works, it checks how well the system works when many people use it at the same time.
That distinction matters. Functional testing asks one question: does the feature do what it should. Performance testing asks a different one: does the system stay fast and stable when real traffic arrives. A checkout flow can pass every functional test and still fall over when a thousand people check out at once.
The traffic in a performance test is not real people. It is virtual users: scripted or recorded sessions that each behave like one visitor moving through your app. A tool runs hundreds or thousands of them at once, against an environment you control, so you find the limits on purpose, in a test, instead of in production where a failure costs customers.
Performance testing vs load testing
Performance testing is the umbrella; load testing is one type within it. Performance testing covers every way a system can be put under pressure, while load testing checks one specific condition: expected peak traffic. People use the two terms interchangeably, but load testing is a subset of performance testing.
The reason the confusion is so common is that load testing is the type most teams start with. It answers the everyday question, “will we survive the traffic we actually expect,” and it is the baseline the other types build on. If you only ever load test, you learn how the system behaves on a good day. The other types of performance testing tell you what happens on a bad one: a surge you did not plan for, a slow leak over a long weekend, a database that has quietly doubled in size.
The types of performance testing
Performance testing splits into a handful of types, each asking a different question about the system under pressure. The most common are load, stress, spike, soak, scalability, and volume testing. You choose the type by the risk you are testing for, and mature teams use several together.
| Type | What it checks | When to use it |
|---|---|---|
| Load testing | Behavior at expected peak traffic | Validate a release or rehearse for a known event |
| Stress testing | Where and how the system breaks past its limit | Find the ceiling and the failure mode |
| Spike testing | Survival of a sudden traffic surge | Flash sales, viral posts, breaking news |
| Soak (endurance) testing | Slow degradation over hours, such as memory leaks | Long-running stability |
| Scalability testing | How performance changes as you add resources | Capacity planning |
| Volume testing | Behavior with large data volumes, not just users | Database and storage growth |
Start with load testing, then add the others as risk demands. A retailer expecting a sale needs spike testing; a service that runs for days needs soak testing; a team adding servers needs scalability testing. They are not interchangeable, and each catches a failure the others miss.
Why performance testing matters for QA teams
Performance testing matters because speed and stability are features your users feel, and most performance failures only appear under load. A page that is fast for one tester can crawl for a thousand real users, and that gap costs conversions, trust, and sometimes the whole sale.
The numbers are blunt. A 2020 study by Google and Deloitte, Milliseconds Make Millions, found that a 0.1 second improvement in mobile site speed lifted retail conversions by 8.4% and travel conversions by 10.1%. Portent’s 2022 analysis of more than 100 million page views found that pages loading in one second convert at 3.05%, falling to 1.12% by three seconds. Google and SOASTA’s 2017 benchmarks showed the probability of a bounce rises 32% as load time grows from one to three seconds.
Most sites are not keeping up. The HTTP Archive’s 2025 Web Almanac found only 48% of mobile sites pass Core Web Vitals. Those costs land hardest at peak, which is exactly when you most want the traffic to convert. And a failure at that moment has a face: a real person who tried to buy and could not.
How does performance testing work?
A performance test follows four steps: model a real user journey, decide how much traffic to send and how fast, run it against a production-like environment, then read the results against a target. Each step is a decision, and the test is only as honest as those decisions.
Design the scenario
A scenario is the journey one virtual user performs, the sequence of steps a real visitor would take: search, open a product, add to basket, check out. The closer the scenario is to genuine behavior, the more honest the test. A test of your cheapest page tells you almost nothing about your checkout.
Set the load profile
The load profile is how many virtual users you run and how you introduce them. Concurrency is the number active at the same time. Ramp is how quickly you reach that number. Ramping 1,000 users over ten minutes tests steady growth; dropping all 1,000 at once tests a spike. Pick the shape that matches the event you are worried about.
Run the test
Run the scenario against an environment that resembles production: the same infrastructure size, the same data volume, ideally the same regions your users come from. Testing against an empty database on an undersized box produces numbers that feel reassuring and mean nothing.
Read the metrics
A performance test produces a handful of core metrics. Response time is how long a request takes. Throughput is how many requests the system handles per second. Error rate is the share of requests that fail. Because averages hide pain, read response time as percentiles: p95 means 95% of requests were at least this fast and the slowest 5% were worse. The p95 and p99 are where real users feel the system struggle, so QA teams budget against them, not against the average.
Putting it together
A concrete example makes the four steps click. Say your analytics show a typical peak of 200 concurrent users, and a sale next week is expected to triple that. You build a scenario that logs in, searches, opens a product, and checks out. You set the load profile to ramp from zero to 600 virtual users over five minutes, then hold for fifteen, running from the regions most of your customers sit in. You point it at a staging environment sized like production, with a realistic catalogue loaded.
The report comes back. Throughput plateaus at 480 requests per second. Error rate stays near zero until about 500 users, then climbs to 4% as checkout calls begin timing out. The p95 response time on the checkout step jumps from 900 milliseconds to six seconds at peak. None of that showed up at 200 users. You have found a real limit, with a week to fix it, instead of meeting it live during the sale.
What performance tests miss: server response vs real user experience
Most performance tools stop at the server’s reply. They can report that the backend answered in 50 milliseconds, but they never render the page, run your JavaScript, or load your third-party tags, so the experience that drives every conversion number above is the part they never measure.
The split is architectural. Protocol-level tools like k6 and JMeter pace HTTP requests in code, which makes them fast, cheap, and ideal for APIs and very high concurrency. What they do not do is open a browser, so they report server response time rather than Core Web Vitals, Google’s metrics for loading, interactivity, and visual stability. (k6 ships a browser mode that can read Web Vitals, but that is a separate mode, not its core, and it keeps no per-session forensics.) A 50 millisecond reply counts for little if the Largest Contentful Paint lands four seconds later because a few marketing tags are blocking the main thread.
And the gap is wide in the field. The 2025 Web Almanac put the median mobile Total Blocking Time at 1,916 milliseconds, up 58% in a year, and found only 77% of mobile sites scoring well on Interaction to Next Paint. Nearly all of that is browser-side work a request-level test is structurally blind to.
Real-browser performance testing closes the gap by giving every virtual user an actual browser, so the numbers come from the same rendering path a customer’s machine would run. This is the approach Evaluat takes: each virtual user gets its own isolated browser, and every report captures Core Web Vitals, session video, network logs, and console output per user. For pure API tests or extreme concurrency on a tight budget, protocol tools are still the better fit; for user-facing journeys, the experience only shows up in a browser. The three load-testing models and measuring Web Vitals under load go deeper.
When should QA run performance tests?
Run performance tests at three moments: before known traffic events, as a gate in the release pipeline, and on a schedule to catch drift. Performance testing is most valuable when it is routine, not a scramble the week before a launch. The earlier a regression is caught, the cheaper it is to fix.
Before a launch, sale, or campaign, a capacity rehearsal tells you whether the system survives the traffic you are about to invite. In continuous integration, a smaller smoke test can gate releases: the build fails when a key page busts its performance budget, the same way it fails on a broken unit test. On a schedule, a steady low-concurrency run from your users’ regions baselines performance so you notice slow degradation before users do. Build a scenario once, use it everywhere, and the cost of running it often drops close to zero.
Common performance testing mistakes
The mistakes that waste performance tests are mostly about realism. A technically clean run against an unrealistic setup gives confident, wrong answers. Watch for these five.
- Testing an unrealistic environment. An undersized box with an empty database will not behave like production. Match infrastructure and data volume.
- Using averages instead of percentiles. The average hides the slow tail where users actually suffer. Read p95 and p99.
- Modeling only the happy path. Real users log in, search, abandon, and retry. A scenario that loads a single page tests almost nothing.
- Measuring servers, not users. Request-level timings miss rendering, JavaScript, and third-party tags. If the experience matters, test in a browser.
- Testing once, before launch. Performance regresses with every deploy. A one-off test is stale within a sprint.
Start testing under real traffic
Performance testing is how QA teams replace hope with evidence. You pick the type that matches the risk, model real journeys, send realistic traffic, read the percentiles, and fix what breaks before a customer finds it. Start with load testing at your measured peak, gate releases on a budget, and test the experience your users actually get, not just the response your servers send. The teams that do this well treat performance as a standing release gate, not a fire drill the week of a launch.
Evaluat runs every virtual user in a real browser and captures Core Web Vitals, session video, and network and console logs for each one, so when something breaks at peak you can open the session and watch the moment it happened.
Test in real browsers. Debug in real sessions. Book a demo.