Saturday, June 28, 2014

#LatencyTipOfTheDay: Median Server Response Time: The number that 99.9999999999% of page views can be worse than.

The math is simple:

The median server response time (MSRT) is measured per request. Pages have many requests.

% of page view that will see a response worse than the MSRT = (1 - (0.5 ^ N)) * 100%.

Where N is the number of [resource requests / objects / HTTP GETs] per page.

Plug in 42, and you get:

(1 - (0.5 ^ 42)) * 100% = 99.999999999977%

Why 42?

Well, beyond that being the obvious answer, it also happens to be the average number of resource requests per page across top sites according to Google's 4 year old stats, collected across billions of web pages at the time [1].  Things have changed since then, and the current number is much higher, but with anything over 50 resource requests per page, which is pretty much everything these days,  both my calculator and excel overflow with too many 9s, and say it's basically 100%. Since I figured 12 9s makes the point well enough, I didn't bother trying a big decimal calculator to compute the infinitesimal chance of someone actually seeing the median or better server response times for a page load in 2014...

[1] Sreeram Ramachandran: "Web metrics: Size and number of resources", May 2010.



Discussion Note:



It's been noted by a few people that this calculation assumes that there is no strong time-correlation of bad or good result. Which is absolutely true. This calculation is valid if every request has an even chance of experiencing a larger-than-median result regardless of what previous results have seen. A strong time correlation would decrease the number of pages that would see worse-than-median results (down to a theoretical 50% in "every response in hour 1 was faster that every result in hour 2" situations). Similarly, a strong time anti-correlation will increase the number of pages that would see a worse-than-median result up to a theoretical 100% (e.g. when every two consecutive response time lie on two opposite sides of the median).

So in reality, if there is some time correlation involved. My number of 9's may be exaggerated. Instead of 99.9999999999% of page views experiencing a response time worse than the median server response time, maybe it's "only" 99.9% of page views that are that really bad. ;-)

Without establishing actual time-correlation or anti-correlation information, the best you can do is act on the basic information at hand. And the only thing we know about the median in most systems (on its own, with no other measured information) 's that the chances of seeing a number above it is 50%.

#LatencyTipOfTheDay: MOST page loads will experience the 99%'lie server response

Yes. MOST of the page view attempts will experience the 99%'lie server response time in modern web applications. You didn't read that wrong.
This simple fact seems to surprise many people. Especially people who spend much of their time looking at pretty lines depicting averages, 50%'lie, 90%'lie or 95%'lies of server response time in feel-good monitoring charts. I am constantly amazed by how little attention is paid to the "higher end" of the percentile spectrum in most application monitoring, benchmarking, and tuning environments. Given the fact that most user interactions will experience those numbers, the adjacent picture comes to mind.

Oh, and in case the message isn't clear, I am also saying that:

- MOST of your users will experience the 99.9%'lie once in ten page view attempts

- MOST of your users will experience the 99.99%'lie once in 100 page view attempts

- More than 5% of your shoppers/visitors/customers will experience the 99.99%'lie once in 10 page view attempts.


So, how does this work: Simple. It's math.


For most (>50%) web pages to possibly avoid experiencing the 99%'ile of server response time, the number of resource requests per page would need to be smaller than 69.

Why 69?

Here is the math:

- The chances of a single resource request avoiding the 99%'lie is 99%. [Duh.]

- The chances on all N resource requests in a page avoiding the 99%'lie is (0.99 ^ N) * 100%.

- (0.99 ^ 69) * 100%  = 49.9%

So with 69 resource requests or more per page, MOST (> 50% of) page loads are going to fail to avoid the 99%'lie. And the users waiting for those pages to fill will experience the 99%'ile for at least some portion of the web page. This is true even if you assume perfect parallelism for all resource requests within a page (non of the requests issued depend on previous requests being answered). Reality is obviously much worse than that, since requests in pages do depend on previous response, but I'll stick with what we can claim for sure.

The percentage of page view attempts that will experience your 99%'lie server response time (even assuming perfect parallelism in all requests) will be bound from below by:

% of page view attempts experiencing 99%'ile >= (1 - (0.99 ^ N)) * 100%

Where N is the number of [resource requests / objects / HTTP GETs] per page.

So, How many server requests are involved in loading a web page? 


The total number of server requests issued by a single page load obviously varies by application, but it appears to be a continually growing number in modern web applications. So to back my claims I went off to the trusty web and looked for data.

According to some older stats collected for a sample of several billions of pages processed as part of Google's crawl and indexing pipeline, the number of HTTP GET requests per page on "Top sites" hit the obvious right answer (42, Duh!) in mid 2010 (see [1]). For "All sites" it was 44. But those tiny numbers are so 2010...

According to other sources, the number of objects per page has been growing steadily, with ~49 around the 2009-2010 timeframe (similar to but larger than Google's estimates), and crossed 100 GETs per page in late 2012 (see [2]). But that was almost two years ago.

And according to a very simple and subjective measurement done with my browser just now, loading this blog's web page (before this posting) involved 119 individual resource requests. So nearly 70% of the page views of this blog are experiencing blogger's 99%'lie.

To further make sure that I'm not smoking something, I hand checked a few common web sites I happened to think of, and none of the request counts came in at 420:

Site # of requests page loads that would
experience the 99%'lie
[(1 - (.99 ^ N)) * 100%]
amazon.com 190 85.2%
kohls.com 204 87.1%
jcrew.com 112 67.6%
saksfifthavenue.com 109 66.5%
-- -- --
nytimes.com 173 82.4%
cnn.com 279 93.9%
-- -- --
twitter.com 87 58.3%
pinterest.com 84 57.0%
facebook.com 178 83.3%
-- -- --
google.com
(yes, that simple noise-free page)
31 26.7%
google.com
search for "http requests per page"
76 53.4%

So yes. There is one web page on this list for which most page loads will not experience the 99%'lie. "Only" 1/4 of visits to google.com's clean and plain home page will see that percentile. But if you actually use google search for something, you are back on my side of the boat....

What the ^%&*! are people spending their time looking at?

Given these simple facts, I am constantly amazed by the number of people I meet who never look at numbers above the 95%'ile, and spend most of their attention of medians or averages. Even if we temporarily ignore the fact that the 95%'lie is irrelevant (as in too optimistic) for more than half of your page views, there is less than a 3% chance of modern web app page view avoiding the 95%'ile of server response time. This means that the 90%'lie, 75%'lie, median, and [usually] the average are completely irrelevant to 97% of your page views.

So please, Wake Up! Start looking at the right end of the percentile spectrum...

References:


[1] Sreeram Ramachandran: "Web metrics: Size and number of resources", May 2010.
[2] "Average Number of Web Page Objects Breaks 100", Nov 2012


Discussion Note: 


It's been noted by a few people that these calculations assume that there is no strong time-correlation of bad or good result. This is absolutely true. The calculations I use are valid if every request has the same chance of experiencing a larger-than-percetile-X result regardless of what previous results have seen. A strong time correlation would decrease the number of pages that would see worse-than-percentile-X results (down to a theoretical X% in theoretically perfect "all responses in a given page are either above or below the X%'lie" situations). Similarly, a strong time anti-correlation (e.g. a repeated pattern going through the full range of response times values every 100 responses) will increase the number of pages that would see a worse-than-percentile-X result, up to a theoretical 100%.

So in reality, my statement of "most" (and the 99.3% computed for cnn.com above) may be slightly exaggerated. Maybe instead of >50% of your page views seeing the 99%'lie, it's "only" 20% of page views that really are that bad... ;-)

Without time-correlation (or anti-correlation) information, the best you can do is act on the basic information at hand. And the only thing we know about a given X %'ile in most systems (on its own, with no other measured information about correlative behavior) is that the chances of seeing a number above it is (100%-X).

Saturday, June 21, 2014

#LatencyTipOfTheDay: Q: What's wrong with this picture? A: Everything!

Question: What's wrong with this picture:


Answer: Everything!

This single chart (source redacted to protect the guilty) is a great depiction of my last four #LatencyTipOfTheDay posts in one great visual:

1. Average (def): a random number that falls somewhere between the maximum and 1/2 the median. Most often used to ignore reality.

Well, that one is self explanatory, and the math behind it is simple. But I keep meeting people who think that looking at average numbers tells them something about the behavior of the system that produced them... Which is especially curious when what they are trying to do is monitor health, readiness, and responsiveness behavior. This chart summarizes averages of the things it plots. Just like everyone else. I hear that on the average these days, the tooth fairy pays $2/tooth.

2. You can't average percentiles. Period.

See those averages posted (at the bottom of the chart) for the 25%, 50%, 75%, 90%, and 95% lines? It's not just that average numbers are low in meaning in their own right. Averaging percentiles is so silly mathematically that it may be a good way to build a random number generator. Read this #LatencyTipOfTheDay post for an explanation of just how meaningless averages of percentiles are.

3. If you are not measuring and/or plotting Max, what are you hiding (from)?

This is a classic "feel good" chart. The chart is pretty. And it looks informative. But it really isn't. Not unless all you care about is the things that worked well, and you don't want to to show anything about a single thing that didn't work well. The main practical function of such a chart is to distract the reader, and make them look at the pretty lines that tell the story of what the good experiences were, so that they won't ask questions about how often bad experiences happened.

The way a chart like this achieves this nefarious purpose is simple: it completely ignores, as in "does not display any information about" and "throws away all indication of", the worst 5% of server operations in any given interval, or in the timespan charted as a whole.

This chart only displays the "best 95%" of results. You can literally have up to 5% of server side page response times take 2 minutes each, and this pretty picture would stay the same.

Whenever charts that show response time or latency fail to display the Maximum measurements along with lower quantiles (like the 95%, or the median, or even the fuzzy average), ask yourself: "what are they hiding?".

4. Measure what you need to monitor. Don't just monitor what you happen to be able to easily measure.

This chart plots the information it has. Unfortunately the information it has is not the important information needed for monitoring server response behavior...

Monitoring is mostly supposed to be about "this is how bad the bad stuff was, and this is how much bad stuff we had". It's been a while since I had met anyone operating servers that only cared about the fastest operation of the day. Or only the best 25% of operation. Or only about the half of operations. Or only the better 95% of operations, but didn't care at all about the worst 5% of operations. But that's exactly (and only) what the 25%'lie, median, and 95% lines in this chart display. It is literally a chart showing "this is how good the good stuff was". 

Percentiles matter for monitoring server response times, and they matter to several 9s in the vast majority of server applications. The fact that your measurements or data stores only provided you with common case latencies (and yes, 95% is still common case) is no excuse. You may as well be monitoring a white board.

You see, when you actually ask scary questions like "how many of our users will observe the 99.99%'lie of server side HTTP request time each day?" you'll typically get the scary answer that makes you realize you really want to watch that number closely. Much more so than the median or the 95%'ile. Not asking that question is where the problems start.

For example, the statement "roughly 10% of all users will experience our 99.99%'lie once a day or more" is true when each page view involves ~20 HTTP gets on the average, and the average user does ~50 page views or refreshes during a day. Both of these are considered low numbers for what retail sites, or social networking sites, or online application sites experience, for example. And very few users will "fail to experience" the 99.9%'lie in 50 page views under the same assumptions. So if you care about what a huge portion of your user based is actually experiencing regularly, you really care about 99.9% and 99.99%'iles.

But even though I have yet to meet an ops team that does not need to monitor the 99.9% and/or 99.99%'lie of their server response times, it's rare that I meet one that actually does monitor at those levels. And the reason is that they are usually staring at dashboards full of this sort of feel good chart, which distract them enough to not think about the fact that they are not measuring what they need to monitor. Instead, they are monitoring what they happen to be able to measure...

Once you know what you want to be watching, the measurement and the retention of data is where the focus should start, as those pretty picture things usually break early in the process because the things that need monitoring were simply not being measured. E.g. many measurement systems (e.g. Yammer metrics) do not even attempt to actually measure percentiles, and model them statistically instead. Of the ones that actually do measure percentiles on the data they receive (e.g. StatsD, if you sent it un-summarized data about each and every server response time), the percentile measurements are typically done on relatively short intervals and summarized (per interval) in storage, which means two things: (A) You can only measure very low numbers of 9s, and cannot produce the sort of percentiles you actually should be monitoring, and (B) the percentiles cannot be usefully aggregated across intervals (see discussion).

(A) happens because intervals are "short", and cannot individually produce any useful data about "high number of 9s". E.g. there is no way to report useful information on the 99.9%'lie or 99.99%'lie in a 5 second interval when the operation rate is 100 ops/sec.

(B) happens because unless percentile data is computed (correctly) across long enough periods of time, it is basically useless for deducing percentiles across more than a single interval. Collecting the data for producing percentiles over longer periods is relatively straightforward (after all, it's being done for each interval with low numbers of 9s), but has some storage volume and speed related challenges with commonly used percentile computation techniques. Challenges that HdrHistogram completely solves now, BTW.

So what can you do about his stuff? A lot. But it starts with sitting down and deciding what you actually need to be monitoring. Without monitoring things that matter, you are just keeping your operations and devops folks distracted, and keeping your developers busy feeding them distracting nonsense.

And stop looking at feel good charts. Start asking for pretty charts of data that actually matters to your operations...

#LatencyTipOfTheDay: If you are not measuring and/or plotting Max, what are you hiding (from)?

When you monitor response times or latency in any form, and don't measure Max values for response times or latency, or don't monitor the ones you do measure, you are invariably ignoring a system requirement, and hiding the fact that you are ignoring it.

Monitoring against required behavior


The goal of most monitoring systems is (or should be) to provide status knowledge about a system's behavior compared to it's business requirements, and to help you avoid digging where you don't need to (no need to look into the "cause" of things being "just fine"). Drilling into details is what you do when you know a problem exists. But first you need to know the problem exists, which starts with knowing that you are failing to meet some required behavior.

The reason everyone should be measuring and plotting max values is simple: I don't know of any applications that don't actually have a requirement for Max response time or latency. I do know of many teams that don't know what the requirement is, have never talked about it, think the requirement doesn't exist, and don't test for it. But the requirement is always there.

"But we don't have a max requirement"


Yeah, right.

Whenever someone tells me "we don't have a max time requirement", I answer with "so it's ok for your system to be complete unresponsive for 3 days at a time, right?".

When they say no (they usually use something more profound that a simple "no"), I calmly say "so your requirement is to never have response time be no longer than 3 days then..."

They will usually "correct me" at that point, and eventually come up with some number that is reasonable for their business needs. At which point they discover that they are not watching the numbers for that requirement.

So if you have the power to measure this Max latency or response time stuff yourself, or to require it from others, start doing so right away, and start looking at it.

Uncovering insanity


Beyond being a universally useful and critical-to-watch requirement. Max is also a great sanity checker for other values. It's harder to measure Max wrong (although the thing many tools report and display as "ams" is a bogus forms of "sampled" max) , and it's really hard to hide from it once you plot or display it.

The first conversations that happen when people start to look at max values after monitoring their stuff for a while without them often start with: "I understand that the pretty lines showing 95%'lie, average, and median all appear great, and hover around 70msec +/- 100msec, and that we've been making them better for months..., but if that's really the case, what the &$#^! is this 7 second max time doing here, and how come it happens several times an hour? And why has nobody said anything about this before? ..."

Apologies


Who knows? You may find that those nice fuzzy feelings the 95%'lie charts have been giving you are well justified. So no harm then.

For those of you who find an uglier truth because I made you look, and don't like what they see, I sincerely pre-apologize for having written this...



#LatencyTipOfTheDay : Measure what you need to monitor. Don't just monitor what you happen to be able to easily measure.

Wednesday, June 18, 2014

#LatencyTipOfTheDay: Average (def): a random number that falls somewhere between the maximum and 1/2 the median. Most often used to ignore reality.

Averages are silly things when it comes to latency. I've yet to meet an application that actually has a use for, or a valid business or technical requirement for an average latency. Short of the ones that are contractually required to produce and report on this silly number, of course. Contracts are good reasons to measure things.

So I often wonder why people measure averages, or require them at all. Let alone use averages as  primary indicators of behavior, and as a primary target for tuning, planning, and monitoring.

My opinion is that this fallacy comes from a natural tendency towards "Latency wishful thinking". We really wish that latency behavior exhibited one of those nice bell curve shapes we learned about in some math class. And if it did, the center (mode) of that bell curve would probably be around where the average number is. And on top of that is where the tooth fairy lives.

Unfortunately, response times and latency distributions practically NEVER look like that. Not even in idle systems. Systems latency distributions are strongly multi-modal, with things we call "tails" or "outliers" or "high percentiles" that fall so many standard deviations away from the average that the Big Bang would happen 17 more times before that result were possible in one of those nice bell curve shapey thingies.

And when the distribution is nothing like a bell curve, the average tends to fall in surprising places.

I commonly see latency and response time data sets where the average is higher than the 99%'ile.

And I commonly see ones where the average is smaller than the median.

And everything in between.

The average can (and will) fall pretty much anywhere in a wide range of allowed values. The math bounding how far the average can go is simple. All you need is to come up with the two most extreme data sets:

A) The biggest possible Average is equal to Max.

This one is trivial. If all results are the same, the average is equal to the Max.  It's also clearly the highest value it could be. QED.

B) The smallest possible Average  is equal to (Median / 2).

By definition, half of the data points are equal to or larger than the median. So the smallest possible contribution the results above the median can have towards the average would be if they were all equal to the median. As to the other half, the smallest possible contribution the results below the median could have towards the average is 0, which would happen if they are all 0. So the smallest possible average would occur when half the results are 0 and the other half is exactly equal to the Median, leading to Average = (Median / 2). QED.

Bottom line: The average is a random number that falls somewhere between Median/2 and Max.

#LatencyTipOfTheDay: You can't average percentiles. Period.

I run into the following situation way too often:

You have some means of measuring and collecting latency, and you want to report on it's percentile behavior over time. You usually capture the data either in some form of high-fidelity histogram, or as raw data (each operation has it's own latency info). You then summarize the data on a per-interval basis and place it in some log. E.g. you may have a log with a summary line per interval, describing something like the 90%/ile, 95%'lie, 99%'lie, 99.9%'lie and Max of all results seen in the last 5 seconds. 

Since raw data is (unfortunately) much more commonly used for this because it's hard to get accurate percentile information from most histograms, you worry about the space needed to keep the full data used to produce this summary over time. So you only keep the summaries but throw away the data.

Then you have this nice log file, with one line per summary interval, and over a run (hour, day, whatever), the log file has all sorts of interval data points for each percentile level, which will often shift rapidly over time. And while this data is useful for "what happened when?" needs, it is terrible for summarizing and getting a feel for the overall behavior of the system during the run. It needs to be further summarized to be comprehensible. 

So how do you produce a summary report on the behavior over longer periods of time (e.g. per hour, or per day, or of the whole run) from such per-interval summary reports? 

Well, here is what I see many people do at this point. Ready for this?

They produce a summary from the summaries. Since they really want to report the 90%'lie, 95%'lie, 99%'lie, 99.9%'lie, and Max values for the whole run, but only have the summary data on a per-interval basis, they average the interval summaries to build the overall summary. 

E.g. The 99%'lie for the run is computed as the average of the 99%'lie report of all intervals in the log. Same for the other percentiles.

Any time you see timeline chart plotting some percentile, and there is an average reported in text to give you a feel for what the overall average behavior of the spiky line you are looking at is, that's what you are most likely looking at...

So what's wrong with that?

At first blush, you may think that the hourly average of the per-5-second-interval 99%ile value has some meaning. you know averages can be inaccurate and misrepresenting at times, but this value could still give you some "feel" for the 99%'lie system, right?

Wrong.

The average of 1,000 99%'lie measurement has as much to do with the 99%'lie behavior of a system as the current temperature on the surface of mars does. Maybe less.

The simplest way to demonstrate the absurdity of this "average of percentiles" calculation is to examine it's obvious effect on two "percentile" values that are hard to hide from: The Max (the 100%'lie value) and the Min (the 0%'lie value). 

If you had the Maximum value recorded for each 5 second interval over a 5000 second run, and want to deduce the Maximum value for the whole run, would an average of the 1,000 max values produce anything useful? No. You'd need to use the Maximum of the Maximums. Which will actually work, but only for the unique edge case of 100%...

But if you had the Minimum value recorded for each 5 second interval over a 5000 second run, and want to deduce the Minimum value for the whole run, you'd have to look for the Minimum of the Minimums. Which also works, but only for the unique edge case of 0%.

The only thing you can deduce about the 99%'lie of a 5000 second run for which you have 1,000 5 second summary 99%'lie values is that it falls somewhere between the Maximum and Minimum of those 1,000 99%'lie values...

Bottom line: You can't average percentiles. It doesn't work. Period.