2009 ControlledExperimentsOnTheWeb
- (Kohavi et al., 2009) ⇒ Ron Kohavi, Roger Longbotham, Dan Sommerfield, and Randal M. Henne. (2009). “Controlled Experiments on the Web: survey and practical guide.” In: Data Mining and Knowledge Discovery, 18(1). doi:10.1007/s10618-008-0114-1
Subject Headings: Online Controlled Experiment.
Notes
- The paper is a longer version of (Kohavi, Henne & Sommerfield, 2007) at KDD-2007 paper http://exp-platform.com/Documents/GuideControlledExperiments.pdf
Cited By
- http://scholar.google.com/scholar?q=%222009%22+Controlled+Experiments+on+the+Web%3A+Survey+and+Practical+Guide
- http://dl.acm.org/citation.cfm?id=1485071.1485091&preflayout=flat#citedby
Quotes
Autho Keywords
Controlled experiments; A/B testing; e-commerce; Website optimization; MultiVariable Testing; MVT
Abstract
The web provides an unprecedented opportunity to evaluate ideas quickly using controlled experiments, also called randomized experiments, A/B tests (and their generalizations), split tests, Control/Treatment tests, MultiVariable Tests (MVT) and parallel flights. Controlled experiments embody the best scientific design for establishing a causal relationship between changes and their influence on user-observable behavior. We provide a practical guide to conducting online experiments, where end-users can help guide the development of features. Our experience indicates that significant learning and return-on-investment (ROI) are seen when development teams listen to their customers, not to the Highest Paid Person's Opinion (HiPPO). We provide several examples of controlled experiments with surprising results. We review the important ingredients of running controlled experiments, and discuss their limitations (both technical and organizational). We focus on several areas that are critical to experimentation, including statistical power, sample size, and techniques for variance reduction. We describe common architectures for experimentation systems and analyze their advantages and disadvantages. We evaluate randomization and hashing techniques, which we show are not as simple in practice as is often assumed. Controlled experiments typically generate large amounts of data, which can be analyzed using data mining techniques to gain deeper understanding of the factors influencing the outcome of interest, leading to new hypotheses and creating a virtuous cycle of improvements. Organizations that embrace controlled experiments with clear evaluation criteria can evolve their systems with automated optimizations and real-time analyses. Based on our extensive practical experience with multiple systems and organizations, we share key lessons that will help practitioners in running trustworthy controlled experiments.
1. Introduction
…
Controlled experiments provide a methodology to reliably evaluate ideas. Unlike other methodologies, such as post-hoc analysis or interrupted time series (quasi experimentation) (Charles and Melvin 2004), this experimental design methodology tests for causal relationships (Keppel et al. 1992, pp. 5–6). Most organizations have many ideas, but the return-on-investment (ROI) for many may be unclear and the evaluation itself may be expensive. As shown in the next section, even minor changes can make a big difference, and often in unexpected ways. A live experiment goes a long way in providing guidance as to the value of the idea.
2. Motivating examples
…
3. Controlled experiments
…
3.1 Terminology
The terminology for controlled experiments varies widely in the literature. Below we define key terms used in this paper and note alternative terms that are commonly used.
Overall Evaluation Criterion (OEC) (Roy 2001). A quantitative measure of the experiment’s objective. In statistics this is often called the Response or Dependent Variable (Mason et al. 1989; Box et al. 2005); other synonyms include Outcome, Evaluation metric, Performance metric, or Fitness Function (Quarto-vonTivadar 2006). Experiments may have multiple objectives and a scorecard approach might be taken (Kaplan and Norton 1996), although selecting a single metric, possibly as a weighted combination of such objectives is highly desired and recommended (Roy 2001, p. 50). A single metric forces tradeoffs to be made once for multiple experiments and aligns the organization behind a clear objective. A good OEC should not be short-term focused (e.g., clicks); to the contrary, it should include factors that predict long-term goals, such as predicted lifetime value and repeat visits. Ulwick describes some ways to measure what customers want (although not specifically for the web) (Ulwick 2005).
Factor. A controllable experimental variable that is thought to influence the OEC. Factors are assigned Values, sometimes called Levels or Versions. Factors are sometimes called Variables. In simple A/B tests, there is a single factor with two values: A and B.
Variant. A user experience being tested by assigning levels to the factors; it is either the Control or one of the Treatments. Sometimes referred to as Treatment, although we prefer to specifically differentiate between the Control, which is a special variant that designates the existing version being compared against and the new Treatments being tried. In case of a bug, for example, the experiment is aborted and all users should see the Control variant.
Experimental unit. The entity over which metrics are calculated before averaging over the entire experiment for each variant. Sometimes called an item. The units are assumed to be independent. On the web, the user is a common experimental unit, although some metrics may have user-day, user-session or page views as the experimental units. For any of these randomization by user is preferred. It is important that the user receive a consistent experience throughout the experiment, and this is commonly achieved through randomization based on user IDs stored in cookies. We will assume that randomization is by user with some suggestions when randomization by user is not appropriate in Appendix.
Null hypothesis. The hypothesis, often referred to as H0, that the OECs for the variants are not different and that any observed differences during the experiment are due to random fluctuations.
Confidence level. The probability of failing to reject (i.e., retaining) the null hypothesis when it is true.
Power. The probability of correctly rejecting the null hypothesis, H0, when it is false. Power measures our ability to detect a difference when it indeed exists. A/A test. Sometimes called a Null Test (Peterson 2004). Instead of an A/B test, you exercise the experimentation system, assigning users to one of two groups, but expose them to exactly the same experience. An A/A test can be used to (i) collect data and assess its variability for power calculations, and (ii) test the experimentation system (the Null hypothesis should be rejected about 5% of the time when a 95% confidence level is used).
Standard deviation (Std-Dev). A measure of variability , typically denoted by σ.
Standard error (Std-Err). For a statistic, it is the standard deviation of the sampling distribution of the sample statistic (Mason et al. 1989). For a mean of n independent observations, it is ˆσ/ √n where ˆσ is the estimated standard deviation.
3.2 Hypothesis testing and sample size
To evaluate whether one of the treatments is different than the Control, a statistical test can be done. We accept a Treatment as being statistically significantly different if the test rejects the null hypothesis, which is that the OECs are not different. We will not review the details of the statistical tests, as they are described very well in many statistical books (Mason et al. 1989; Box et al. 2005; Keppel et al. 1992). What is important is to review the factors that impact the test:
- Confidence level. Commonly set to 95%, this level implies that 5% of the time we will incorrectly conclude that there is a difference when there is none (Type I error). All else being equal, increasing this level reduces our power (below).
- Power. Commonly desired to be around 80–95%, although not directly controlled. If the Null Hypothesis is false, i.e., there is a difference in the OECs, the power is the probability of determining that the difference is statistically significant. (A Type II error is one where we retain the Null Hypothesis when it is false.)
- Standard error. The smaller the Std-Err, the more powerful the test. There are three useful ways to reduce the Std-Err:
- The estimated OEC is typically a mean of large samples. As shown in Sect. 3.1, the Std-Err of amean is inversely proportional to the square root of the sample size, so increasing the sample size, which usually implies running the experiment longer, reduces the Std-Err and hence increases the power for most metrics. See the example in 3.2.1.
- Use OEC components that have inherently lower variability, i.e., the Std-Dev, σ, is smaller. For example, conversion probability (0–100%) typically has lower Std-Dev than number of purchase units (typically small integers), which in turn has a lower Std-Dev than revenue (real-valued). See the example in 3.2.1.
- Lower the variability of the OEC by filtering out users who were not exposed to the variants, yet were still included in the OEC. For example, if you make a change to the checkout page, analyze only users who got to the page, as everyone else adds noise, increasing the variability. See the example in 3.2.3.
- Effect. The difference in OECs for the variants, i.e. the mean of the Treatment minus the mean of the Control. Larger differences are easier to detect, so great ideas will unlikely be missed. Conversely, Type II errors are more likely when the effects are small.
Two formulas are useful to share in this context. The first is the t-test, used in A/B tests (single factor hypothesis tests):
- [math]\displaystyle{ t = OB − OA �σd }[/math]
where OA and OB are the estimated OEC values (e.g., averages), �σd is the estimated standard deviation of the difference between the two OECs, and t is the test result. Based on the confidence level, a threshold t is established (e.g., 1.96 for large samples and 95% confidence) and if the absolute value of t is larger than the threshold, then we reject the Null Hypothesis, claiming the Treatment’s OEC is therefore statistically significantly different than the Control’s OEC. We assume throughout that the sample sizes are large enough that it is safe to assume the means have a Normal distribution by the Central Limit Theorem (Box et al. 2005, p. 29; Boos and Hughes-Oliver 2000) even though the population distributions may be quite skewed.
A second formula is a calculation for the minimum sample size, assuming the desired confidence level is 95% and the desired power is 80% (van Belle 2002, p. 31)
- [math]\displaystyle{ n = 16σ2 �2 }[/math]
where n is the number of users in each variant and the variants are assumed to be of equal size, σ2 is the variance of the OEC, and � is the sensitivity, or the amount of change you want to detect. (It is well known that one could improve the power of comparisons of the treatments to the control by making the sample size of the control larger than for the treatments when there is more than one treatment and you are only interested in the comparison of each treatment to the control. If, however, a primary objective is to compare the treatments to each other then all groups should be of the same size as given by Formula 2.) The coefficient of 16 in the formula provides 80% power, i.e., it has an 80% probability of rejecting the null hypothesis that there is no difference between the Treatment and Control if the true mean is different than the true Control by�. Even a rough estimate of standard deviation in Formula 2 can be helpful in planning an experiment. Replace the 16 by 21 in the formula above to increase the power to 90%.
A more conservative formula for sample size (for 90% power) has been suggested (Wheeler 1974):
- [math]\displaystyle{ n = (4rσ/�)2 }[/math]
where r is the number of variants (assumed to be approximately equal in size). The formula is an approximation and intentionally conservative to account for multiple comparison issues when conducting an analysis of variance with multiple variants per factor (Wheeler 1975; van Belle 2002). The examples below use the first formula.
References
;
Author | volume | Date Value | title | type | journal | titleUrl | doi | note | year | |
---|---|---|---|---|---|---|---|---|---|---|
2009 ControlledExperimentsOnTheWeb | Ron Kohavi Roger Longbotham Dan Sommerfield Randal M. Henne | Controlled Experiments on the Web: Survey and Practical Guide | Data Mining and Knowledge Discovery | 10.1007/s10618-008-0114-1 | 2009 |