2024 FalsePositivesinABTests

From GM-RKB
Jump to navigation Jump to search

Subject Headings:

Notes

Cited By

Quotes

Abstract

A/B tests, or online controlled experiments, are used heavily in the software industry to evaluate implementations of ideas, as the paradigm is the gold standard in science for establishing causality: the changes introduced in the treatment caused the changes to the metrics of interest with high probability. What distinguishes software experiments, or A/B tests, from experiments in many other domains is the scale (e.g., over 100 experiment treatments may launch on a given workday in large companies) and the effect sizes that matter to the business are small (e.g., a 3% improvement to conversion rate from a single experiment is a cause for celebration). The humbling reality is that most experiments fail to improve key metrics, and success rates of only about 10-20% are most common. With low success rates, the industry standard alpha threshold of 0.05 implies a high probability of false positives. We begin with motivation about why false positives are expensive in many software domains. We then offer several approaches to estimate the true success rate of experiments, given the observed "win" rate (statistically significant positive improvements), and show examples from Expedia and Optimizely. We offer a modified procedure for experimentation, based in sequential group testing, that selectively extends experiments to reduce false positives, increase power, at a small increase to runtime. We conclude with a discussion of the difference between ideas and experiments in practice, terms that are often incorrectly used interchangeably.

References

;

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2024 FalsePositivesinABTestsRon Kohavi
Nanyu Chen
False Positives in A/B Tests10.1145/3637528.36716312024