Evaluat is in private access. Demos open through July. Book a slot

Blog Guides & best practices

Why average response time misleads you: reading p95 and p99

Your dashboard says average response time is 420 milliseconds. Half your users see 100, one in a hundred waits over five seconds, and the average describes none of them. p95 and p99 read response time from the slow end, where the failures you run a performance test to find actually live.

Written by: Evaluat Staff ·

A right-skewed histogram of response times. Most requests cluster fast around a 100-millisecond median, with a long tail of slow ones stretching right. The mean, about 420 milliseconds, sits on the sparse downslope where almost no requests land, while p95 at 2,800 milliseconds and p99 at 5,200 milliseconds sit far out in the tail. The average describes no actual user.

What are p95 and p99 response times?

The p95 response time is the value 95% of your requests come in under, so only the slowest 5% were worse. The p99 is the same line drawn at 99%. Reading p95 and p99 response time, instead of the average, is the gap between a dashboard that looks calm and one that tells the truth.

A percentile is just a ranking. Line up every request from fastest to slowest, and the 95th percentile is the one standing at the 95% mark: 95 in a hundred were at least that quick, and the worst five were slower. The p50, or median, is the request in the middle, so it describes the typical visit. The mean, the arithmetic average, adds every response time and divides by the count. It feels like the obvious summary, and it is the one that hides the most.

The distinction is not academic. p95 and p99 are where regressions and lost conversions live, because they describe the experience of your unluckiest users rather than an imagined average one. For where these percentiles sit among the other numbers a run produces, see the eight metrics every report should include, part of our complete guide to performance testing. This guide is about one of them, and why it is the metric beginners most often misread.

Why does the average response time mislead you?

The average misleads because a handful of slow requests barely move it, while a single huge outlier can drag it past the typical experience. The mean is a blend of fast and slow that matches no actual user. It can understate your worst sessions and overstate your normal ones, sometimes in the same report, and the number alone never tells you which.

Picture 100 requests. Ninety return in 100 milliseconds. The other ten are slow: 1,500, 1,800, 2,000, 2,400, 2,800, 3,200, 3,800, 4,500, 5,200, and 6,000 milliseconds. The median is 100 milliseconds, because the middle request sits squarely in the fast group. The mean works out to about 420 milliseconds. The p95 is 2,800 milliseconds, and the p99 is 5,200 milliseconds.

Read those four numbers side by side. The median says the typical visit is quick. The average, at 420 milliseconds, sounds slightly slow but survivable, and it describes none of the hundred requests: nobody actually waited 420 milliseconds. One visitor in twenty waited almost three seconds. One in a hundred waited over five. The average smoothed a real, painful tail into a figure that looks fine on a slide.

The mean can mislead the other way too. Take ninety-nine requests at 10 milliseconds and one that hangs for 100 seconds. The mean is over a second, yet ninety-nine percent of requests were effectively instant. Now the average makes a healthy system look broken. The lesson is not that the mean is always optimistic or always pessimistic. It is that the mean is unpredictable, because a few values can pull it anywhere.

The median is a safer default than the mean, but it has the opposite blind spot: it tells you the typical visit and nothing about the tail. That is why you read the median alongside p95 and p99, not instead of them.

That tail is where the money leaves. Akamai and SOASTA’s 2017 State of Online Retail Performance report found that a 100-millisecond delay in load time cut conversion rates by 7%, a two-second delay raised bounce rates by 103%, and 53% of mobile visitors abandoned a page that took longer than three seconds. The requests that breach those thresholds are exactly the ones sitting in your p95 and p99, and exactly the ones the average buries.

How percentiles read the tail, and why the tail grows at scale

Percentiles read the tail by ranking requests, not averaging them, so the slowest responses keep their own line instead of dissolving into a mean. To find the p95, you sort every response time and take the value at the 95% position. At scale, that tail matters far more than its size suggests.

There is no averaging involved, which is the entire point. A typical latency distribution is right-skewed: a tall cluster of fast responses near the left, and a long thin tail of slow ones stretching to the right. The mean gets pulled toward that tail and lands somewhere on the sparse downslope, where almost no requests actually are. The median stays at the dense peak, and p95 and p99 stay anchored to real requests out in the tail. Each percentile points at an experience a real user had.

Percentiles also need enough data to be trustworthy. Measure only twenty requests and your p99 is effectively a single sample that one fluke can move. Collect thousands of measurements before you read the far tail, and keep the raw timings, because a percentile cannot be recovered from an average after the fact.

The tail at scale

Here is the part that surprises people new to performance testing: a slow tail does not stay a small problem, it compounds with scale. In their 2013 paper The Tail at Scale, Google’s Jeffrey Dean and Luiz André Barroso give the canonical example. Suppose one server responds in 10 milliseconds most of the time, but its 99th-percentile latency is one second, so one request in a hundred takes a full second. That sounds harmless.

Now suppose rendering a single page requires calling 100 such servers in parallel and waiting for all of them. The chance that all 100 come back fast is 0.99 to the 100th power, about 37%. Which means roughly 63% of your page loads wait a full second on at least one straggler. A one-in-a-hundred event at the server became a nearly two-in-three event for the user.

Modern pages fan out like this constantly: microservices, ad calls, third-party tags, database shards. The more pieces a request touches, the more the slow tail of each piece becomes the common experience of the whole. That is why a 1% tail is never a rounding error, and why teams that run reliable systems watch p99, not the average.

Should you target p95 or p99?

Target the percentile that matches the cost of a slow request. For most user-facing pages, p95 is the working number: it covers everyone except your unluckiest one in twenty. Reach for p99 or higher when a single slow request is expensive, as in payments or checkout, where one timeout is a lost sale. Read them together, never one in isolation.

PercentileWhat it tells youWhen to lead with it
p50 (median)The typical visitSanity-checking the normal case
p95The common bad case, one in twentyMost user-facing budgets and SLOs
p99The rare bad case, one in a hundredCheckout, payments, anything where one slow request loses a sale
p99.9The extreme tail, one in a thousandOnly at high scale or on critical paths, where it is costly to chase

A fast way to read these together is the gap between them. When p99 sits close to the median, the system is consistent and the average is roughly honest. When p99 is five or ten times the median, you have a tail problem: most users are fine and a meaningful minority is not, and the average is hiding them. The ratio, not the raw number, is often the quickest signal that something is wrong.

Two cautions sit alongside the table. First, you cannot average percentiles. The p95 of two servers is not the average of their two p95 values, because a percentile is a position in a distribution, not a quantity you can add up. Combine the raw measurements, then compute the percentile once. Second, picking a target is a service-level decision: Google’s site reliability practice frames it as choosing an objective on the distribution, such as 99% of requests under 300 milliseconds, then tracking how much of your error budget the slow tail spends.

It is worth keeping response-time percentiles separate from one number they are often confused with. The 75th percentile you see quoted for Core Web Vitals is a field experience metric judged across real visits, not the p95 or p99 of request latency. They are different measurements. The shared idea is the one to hold onto: the industry grades performance at a percentile, not a mean. In 2025 only 48% of mobile sites passed Core Web Vitals, and the threshold that decides it is a percentile, assessed at the 75th of real visits.

Common mistakes when reading p95 and p99

Most percentile mistakes come from trusting a number without checking how it was made. The average slips back into reports, percentiles get averaged together, and load tests quietly understate the very tail they were run to find. Four come up often enough to name, and each has a fix.

  • Reporting the mean anyway. A single average is a figure a few requests can drag either way, and it hides the slow tail where users suffer. Read p50, p95, and p99 together against your target, and set thresholds at p95 or p99.
  • Trusting your load test’s p99. This is the subtle one. Many load generators suffer from coordinated omission, a measurement flaw named by Gil Tene and explained well by ScyllaDB. When the system under test stalls, the tool waits on its in-flight request and stops sending new ones, so the requests that would have been slowest are never issued. They vanish from the distribution, and your p99 looks healthy. In one documented benchmark, a load test reported a p99 of 47 milliseconds while the same release in production showed 1.8 seconds, a 38-fold gap the test had called nothing. Use a tool that corrects for it, and validate against what a real user actually experienced.
  • Averaging percentiles. Rolling up the p95 of ten servers into one mean produces a number that means nothing. Aggregate the raw data, then take the percentile once.
  • Chasing p99.9 before p95 is healthy. The far tail is expensive to chase. If your p95 is still red, optimizing the one-in-a-thousand request is effort in the wrong place. Fix the common bad case first, then decide how far into the tail your product needs to go.

There is a fifth mistake that the others lead to: treating the percentile as the diagnosis. A percentile tells you that the tail is slow and roughly how slow. It does not tell you which user, which request, or why. That is a different job, and it is where a number stops being enough.

How Evaluat reads the tail

Evaluat is a real-browser performance testing platform that reports the full distribution, p50 through p99, alongside a configurable Apdex score, so you never have to read the tail through an average. What it adds is the step after the percentile: the session behind it.

Every virtual user is a real browser, and every session is recorded. So when your p99 spikes, you do not stop at the number. You open the specific slow session and watch what happened: the network waterfall, the console errors, the video of the page stalling. The percentile tells you the tail is bad; the session tells you why. A failure at peak isn’t a percentile. It’s a session.

This is a deliberate boundary. Running a real browser for every virtual user is heavier than pacing raw HTTP requests, so for a pure API load test or an extreme requests-per-second target, a protocol-level tool like k6 or JMeter is the lighter fit, and they report p95 latency perfectly well. What they cannot show is the rendered experience, or the session behind the slow request. When the question is what your users actually felt at peak, that takes a real browser.

Read the percentile, then open the session

Reading p95 and p99 response time is how you stop the average from lying to you. The mean describes no one. The median tells you the typical visit, and the tail tells you who is suffering and by how much. Read the percentiles together, set a target that matches the cost of a slow request, check that your measurement is not quietly omitting the tail, and when a percentile turns red, open the session behind it. The number tells you where to look. The session tells you what to fix.

Test in real browsers. Debug in real sessions. Book a demo.

Common questions

FAQ

What is the difference between p95 and p99 response time?

The p95 is the response time 95% of requests came in under; the p99 is the response time 99% came in under. p95 covers all but your slowest one in twenty requests, while p99 reaches further into the tail, to the slowest one in a hundred. A healthy p95 with a p99 several times higher means most users are fine but a small, often high-value, minority is not. Read them together rather than picking one.

What is a good p95 response time?

It depends on the action, so set the target before the test. As rough guidance, many user-facing APIs aim for a few hundred milliseconds, and web pages often target a one to three second load. Measure your p95 against that goal rather than against the average, and tie the target to the cost of a slow request for that specific journey.

Why is p99 higher than the average?

Because response time distributions are right-skewed: most requests are fast, and a long tail of slow ones stretches out to the right. The p99 sits far out in that tail, while the average is pulled only part of the way toward it by the bulk of fast requests. In almost every real system the p99 is several times the mean, which is exactly why the average hides what your slowest users experience.

How do you calculate p95 and p99?

Sort every response time from fastest to slowest, then count up to the 95th or 99th value. With 100 requests, the p95 is the value 95 of them came in at or under, and the p99 the value 99 came in at or under, so only the slowest five sit beyond p95 and the slowest one beyond p99. For an accurate figure you need enough samples and the raw measurements, not pre-averaged buckets, because you cannot recover a percentile from a mean.

What is tail latency?

Tail latency is the slow end of your response-time distribution, the requests captured by high percentiles like p95, p99, and p99.9. These are the slowest responses, and they matter more than their small share suggests, because a page that fans out to many backend calls hits the tail far more often than any single call would. Watching tail latency, not the average, is how you see what your unluckiest users get.

Why does my load test show a good p99 but production is slow?

The usual cause is coordinated omission, a flaw in how many load generators measure. When the system stalls, the tool waits on its in-flight request and stops issuing new ones, so the requests that would have been slowest are never sent and never counted. The result is a p99 that looks healthy in the test and falls apart in production. Use a tool that corrects for it, and validate against what real users actually experience.

Should I ever use the average response time?

Use it as context, never as your headline number. The mean is a blend that matches no actual user, and it can both understate your slow tail and overstate your typical case, so it cannot be trusted alone. Lead with the median for the typical visit, and the p95 and p99 for the bad case, and keep the average only as a rough cross-check.

See it on your site

Test in real browsers.
Debug in real sessions.

Want to see this measured on your app?

30 minutes. We build a scenario on your real customer journey, run a small test, and walk you through the report with your data in it.