What is it about?
Experimental evaluation is key to systems research. Because modern systems are complex and non-deterministic, good experimental methodology demands that researchers account for uncertainty. To obtain valid results, they are expected to run many iterations of benchmarks, invoke virtual machines (VMs) several times, or even rebuild VM or benchmark binaries more than once. We provide a statistically rigorous methodology that allows a researcher to determine the number of repetitions at each level (e.g. iterations of benchmark, invocations of VM, or rebuilds) of an experiment necessary and sufficient to obtain a given level of precision. We also show how to present results with an effect size confidence interval.
Featured Image
Why is it important?
Improvements in performance are often small, so it's essential that researchers account for variation in their results. However, running experiments many times is time-consuming. For example, the SPEC INT benchmarks to measure the speedup of a compiler optimisation, using 30 different linking orders, took 3 days on a fast computer. Unsurprisingly maybe, it is therefore common for researchers to report results in ways that seem to make their work impossible to repeat, or that do not convincingly demonstrate their claims for performance improvement. We offer a sound experimental methodology that makes best use of experiment time. We establish, both formally and in practice, the optimum number of repetitions of experiments to achieve the most precise results for a given experiment time. We compare our methodology with heuristics-based practice common today, and show that the latter often leads to either too few or too many experiments. Too few repetitions leads to unsound results; too many wastes time. We show how to estimate the error bounds both to evaluate the performance of a single system and to compare execution times of versions of a system, based on effect sizes and confidence intervals,: such ratios are commonly used but hardly ever qualified with error bounds.
Read the Original
This page is a summary of: Rigorous benchmarking in reasonable time, January 2013, ACM (Association for Computing Machinery),
DOI: 10.1145/2491894.2464160.
You can read the full text:
Contributors
The following have contributed to this page