When doing systems research, a large chunk of making our case about the benefits of a new system or a new feature is the presentation of our experimental results.
In turn, in order to create our graphs we run experiments (e.g., based on benchmarks) and we present various metrics (e.g., latency, throughput). In some cases we face the dilemma of how many times we should run an experiment to have an authoritative average of the metric we want to report.
In order to consider this question a bit deeper, we have to think how each point of our graph is being generated. For example, consider a plot that presents the latency of 4 different (let’s say TPC-H) queries. How many times should we run each query before we present numbers (e.g., latency average)? What is the standard deviation of the runs we actually did? Is 5% a good number? Should we show the average or the median?
There is no one-size-fits-all answer for these questions. First, we should go back to some probability theory and statistics.
The act of running the experiment multiple times to find out the “correct” value of the metric, implies that what we measure behaves like a random variable, which means that we also make some (potentially silent) assumptions that this random variable follows a specific distribution (e.g., normal). Based on such an assumption, we can now start answering the above questions. Note that even if a random variable does not follow the normal distribution, most of the basic concepts hold as long as the distribution has one peak and it is symmetric (in other words, it looks like a normal distribution). An extra benefit of a symmetric distribution with a single peak is that the average and the median are the same.
From the central limit theorem, we know that if we “run our experiment” enough times, the average of those times will converge to the average. How many is enough? A good rule of thumb is more than 30. In fact, if we have 30 sample values of our random variable (that follows a normal distribution), the real average, say μ, is with probability 99% in the interval (X-0.5*S, X+0.5*S), where X is the experimental average, and S is the experimental standard deviation. If we are ok to relax the probability to 95%, then the interval becomes narrower: (X-0.37*S, X+0.37*S). Of course, the usefulness of this confidence interval depends on the relative values of X and S (which depend on the nature of the observed metric).
The discussion above focuses on 30 repetitions of the experiments. What if each repetition is expensive (in terms of time, or even monetary cost)? How do these confidence intervals change with a smaller number of repetitions? In order to answer this question, we need to shift our attention to small sample statistics. A good primer can be found in https://saylordotorg.github.io/text_introductory-statistics/s11-02-small-sample-estimation-of-a-p.html (published by Saylor Academy, 2012). The core result is that given a number of repetitions (termed population) N, the average μ of the random variable in question will be with probability P in the interval (X-T(P,N)*S/sqrt(N), X+T(P,N)*S/sqrt(N)), where X is the experimental average, S is the experimental standard deviation, and T(P,N) is a term that depends on the target probability we want μ to be in the above interval, and on the size of the population (i.e., the number of times we run our experiments).
Now, assume that we want to run an experiment and we need to spend a warmup budget W, and execution cost C per execution. The total benchmarking budget is W+N*C. The benefit from increasing N is that we can provide more accurate calculation of the actual average μ, and the downside is higher overall benchmarking cost. To use some more numbers consider that we run N=1, N=2, N=5, or N=10 times and that we want to have confidence 95% that μ is in the confidence interval. In these four cases, the confidence intervals will be:
- For N=1: (X-12.706*S, X+12.706*S)
- For N=2: (X-3.043*S, X+3.043*S)
- For N=5: (X-1.150*S, X+1.150*S)
- For N=10: (X-0.704*S, X+1.704*S)
The intervals with 99% confidence would be:
- For N=1: (X-63.657*S, X+63.657*S)
- For N=2: (X-7.018*S, X+7.018*S)
- For N=5: (X-1.803*S, X+1.803*S)
- For N=10: (X-1.002*S, X+1.002*S)
Note that in several practical use-cases confidence intervals of 95% are considered to be adequate, however, the usefulness of the intervals also depends on the measured values X and S.
In this brief note we discuss how to quantify the impact of reducing the number of experiments to calculate statistics. Another relevant question, left for future discussion, is how to sample from a big population (say, latency of short transactions) so that we do not have to pay the cost of storing/updating the value of the metric too often.