2020 CausalMetaMediationAnalysisInfe
- (Wang, Yin et al., 2020) ⇒ Zenan Wang, Xuan Yin, Tianbo Li, and Liangjie Hong. (2020). “Causal Meta-Mediation Analysis: Inferring Dose-Response Function From Summary Statistics of Many Randomized Experiments.” In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.
Subject Headings:
Notes
Cited By
Quotes
Author Keywords
causal inference; meta-analysis; mediation analysis; experiment; dose-response function; A/B test; evaluation metric; business KPI.
Abstract
It is common in the internet industry to use offline-developed algorithms to power online products that contribute to the success of a business. Offline-developed algorithms are guided by offline evaluation metrics, which are often different from online business key performance indicators (KPIs). To maximize business KPIs, it is important to pick a north star among all available offline evaluation metrics. By noting that online products can be measured by online evaluation metrics, the online counterparts of offline evaluation metrics, we decompose the problem into two parts. As the offline A/B test literature works out the first part: counterfactual estimators of offline evaluation metrics that move the same way as their online counterparts, we focus on the second part: causal effects of online evaluation metrics on business KPIs. The north star of offline evaluation metrics should be the one whose online counterpart causes the most significant lift in the business KPI. We model the online evaluation metric as a mediator and formalize its causality with the business KPI as dose-response function (DRF). Our novel approach, causal meta-mediation analysis, leverages summary statistics of many existing randomized experiments to identify, estimate, and test the mediator DRF. It is easy to implement and to scale up, and has many advantages over the literature of mediation analysis and meta-analysis. We demonstrate its effectiveness by simulation and implementation on real data.
1 INTRODUCTION
Nowadays it is common in the internet industry to develop algorithms that power online products using historical data. The one that improves evaluation metrics from historical data will be tested against the one that has been in production to assess the lift in key performance indicators (KPIs) of the business in online A/B tests. Here we refer to metrics calculated from historical data as offline metrics and metrics calculated in online A/B tests as online metrics. In many cases, offline evaluation metrics are different from online business KPIs. For instance, a ranking algorithm, which powers search pages in e-commerce platforms, typically optimizes for relevance by predicting purchase or click probabilities of items. It could be tested offline (offline A/B tests) for rank-aware evaluation metrics, for example, normalized discounted cumulative gain (NDCG), mean reciprocal rank (MRR) or mean average precision (MAP), which are calculated from the test set of historical purchase or clickthrough feedback of users. Most e-commerce platforms, however, deem sitewide gross merchandise value (GMV) as their business KPI and test for it online. There could be various reasons not to directly optimize for business KPIs offline or use business KPIs as offline evaluation metrics, such as technical difficulty, business reputation, or user loyalty. Nonetheless, the discrepancy between offline evaluation metrics and online business KPIs poses a challenge to product owners because it is not clear that, in order to maximize online business KPIs, which offline evaluation metric should be adopted to guide the offline development of algorithms.
The challenge essentially asks for the causal effects of increasing offline evaluation metrics on business KPIs, e.g., how business KPIs would change for a 10% increase in an offline evaluation metric. The offline evaluation metric in which a 10% increase could result in the most significant lift in business KPIs should be the north star to guide algorithm development. Algorithms developed offline power online products, and online products contribute to the success of the business (see Figure 1). By noting that online products can be measured by online evaluation metrics, the online counterparts of offline evaluation metrics, we decompose the problem into two parts. The offline A/B test literature (see, e.g, Gilotte et al. [6]) works out the first part (the black arrow): counterfactual estimators of offline evaluation metrics to bridge the inconsistency between changes of offline and online evaluation metrics. We focus on the second part (the red arrow): the causality between online products (assessed by online evaluation metrics) and the business (assessed by online business KPIs). The offline evaluation metric whose online counterpart causes the most significant lift in online business KPIs should be the north star. Hence, the question for us becomes, how business KPIs would change for a 10% increase in an online evaluation metric.
Algorithms guided by Offline Evaluation Metrics: NDCG, MRR, MAP, · · · Online Products measured by Online Evaluation Metrics: NDCG, MRR, MAP, · · · Business measured by Online Business KPIs: sitewide GMV, · · ·
- Figure 1: The Causal Path from Algorithms to Business
Randomized controlled trials, or online A/B tests, are popular to measure the causal effects of online product change on business KPIs. Unfortunately, they cannot answer our question directly. In online A/B tests, in order to compare the business KPIs caused by different values of an online evaluation metric, we need to fix the metric at its different values for treatment and control groups. Take the ranking algorithm as an example. If we could fix online NDCG of the search page at 0.22 and 0.2 for treatment and control groups respectively, then we would know how sitewide GMV would change for a 10% increase in online NDCG at 0.2. However, this experimental design is impossible, because most online evaluation metrics depend on users’ feedback and thus cannot be directly controlled.
We address the question by developing a novel approach of causal inference. We model the causality between online evaluation metrics and business KPIs by dose-response function (DRF) in potential outcome framework [13, 14]. DRF originates from medicine and describes the magnitude of the response of an organism given different doses of a stimulus. Here we use it to depict the value of a business KPI given different values of an online evaluation metric. Different from doses of stimuli, values of online evaluation metrics cannot be directly manipulated. However, they could differ between treatment and control groups in experiments of treatments other than algorithms — user interface/user experience (UI/UX) design, marketing, etc. This could be due to the “fat hand” [19, 29] nature of online A/B tests that a single intervention can change many causal variables at once. A change of the tested feature, which is not algorithm, could induce users to change their engagement with algorithm-powered online products, so that values of online evaluation metrics would change. For instance, in an experiment of UI design, users might change their search behaviors because of the new UI design, so that values of online NDCG, which depends on search interaction, would change, even though ranking algorithm does not change. The evidence suggests that online evaluation metrics could be mediators that (partially) transmit causal effects of treatments on business KPIs in experiments where treatments are not necessarily algorithm-related. Hence, we formalize the problem as the identification, estimation, and test of mediator DRF.
In mediation analysis literature, there are two popular identification techniques: sequential ignorability (SI) and instrumental variable (IV). SI assumes each potential mediator is independent of all potential outcomes conditional on the assigned treatment, whereas IV permits dependence between unknown factors and mediators but forbids the existence of direct effects of the treatment. Rather than making these stringent assumptions, we leverage trial characteristics to explain average direct effect (ADE) in each experiment so that we can tease it out from average treatment effect (ATE) to identify the causal mediation. The utilization of trial characteristics means we have to use data from many trials because we need variations in trial characteristics. Hence, we develop our framework as a meta-analysis and propose an algorithm that only uses summarized results from many existing experiments and gain the advantage of easy implementation to scale.
Most meta-analyses rely on summarized results from different studies with different raw data sources. Therefore, it is almost impossible to learn more beyond the distribution of ATEs. Fortunately, the internet industry produces plentiful randomized trials with consistently defined metrics, and thus presents an opportunity for performing a more complicated meta-analysis. Literature is lacking in this area while we create the framework of causal meta-mediation analysis (CMMA) to fill in the gap.
Another prominent strength of our approach in real application is, for a new product that has been shipped online but has few A/B tests, it is plausible to explore the causality between its online metrics and business KPIs from many A/B tests of other products. The values of online metrics of the new product can differ between treatment and control groups in experiments of other products ("fat hand" [19, 29]), which makes it possible to solve for mediator DRF of the new product without its own A/B tests.
Note that, our approach can be applied to any evaluation metric that is defined at experimental-unit level, like metrics discussed in offline A/B test literature. The experimental unit means the unit for randomization in online A/B tests. For example, in search page experiments, the experimental unit is typically the user. Also, the evaluation metric can be any combination of existing experimentalunit- level metrics.
To summarize, our contributions in this paper include:
(1) This is the first study that offers a framework to choose the north star among all available offline evaluation metrics for algorithm development to maximize business KPIs when offline evaluation metrics and business KPIs are different. We decompose the problem into two parts. Since the offline A/B test literature works out the first part: counterfactual estimators of offline evaluation metrics to bridge the inconsistency between changes of offline and online metrics, we work out the second part: inferring causal effects of online evaluation metrics on business KPIs. The offline evaluation metric whose online counterpart causes the most significant lift in business KPIs should be the north star. We show the implementation of our framework on data from Etsy.com.
(2) Our novel approach CMMA combines mediation analysis and meta-analysis to identify, estimate, and test mediator DRF. It relaxes standard SI assumption and overcomes the limitation of IV, both of which are popular in causal mediation literature. It extends meta-analysis to solve causal mediation while the meta-analysis literature only learns the distribution of ATE. We demonstrate its effectiveness by simulation and show its performance is superior to other methods.
(3) Our novel approach CMMA uses only trial-level summary statistics (i.e., meta-data) of many existing trials, which makes it easy to implement and to scale up. It can be applied to all experimental-unit-level evaluation metrics or any combination of them. Because it solves for causality problem of a product by leveraging trials of all products, it could be particularly useful in real applications for a new product that has been shipped online but has few A/B tests.
2 LITERATURE REVIEW
We draw on two strands of literature: mediation analysis and metaanalysis. We briefly discuss them in turn.
2.1 Mediation Analysis
Our framework expands on causal mediation analysis. Mediation analysis is actively conducted in various disciplines, such as psychology [15, 24], political science [7, 12], economics [9], and computer science [16]. The recent application in the internet industry reveals the performance of recommendation system could be cannibalized by search in e-commerce website [29]. Mediation analysis originates from the seminal paper of Baron and Kenny [3], where they proposed a parametric estimator based on the linear structural equation model (LSEM). LSEM, by far, is still widely used by applied researchers because of its simplicity. Since then, Robins and Greenland [21] and Pearl [16] and other causal inference researchers have formalized the definition of causal mediation and pinpointed assumptions for its identification [17, 20, 22] in various complicated scenarios. The progress features extensive usage of structural equation models and causal diagrams (e.g., NPSEM-IE of Pearl [16] and FRCISTG of Robins [20]).
As researchers extend the potential outcome framework of Rubin [23] to causal mediation, alternative identification, and more general estimation strategies have been developed. Imai et al. [12] achieved the non-parametric identification and estimation of mediation effects of a single mediator under the assumption of SI. After analyzing other well-known models such as LSEM [3] and FRCISTG [20], they concluded that assumptions of most models can be either boiled down to or replaced by SI. However, SI is stringent, which ignites many in-depth discussions around it (see, e.g., the discussion between Pearl [17, 18] and Imai et al. [11]).
Another popular identification strategy of causal mediation is IV, which is a signature technique in economics [1, 2]. Sobel [26] used treatment as IV to identify mediation effects without SI. However, as Imai et al. [12] pointed out, IV assumptions may be undesirable because they require all causal effects of the treatment pass through the mediator (i.e., complete mediation [3]). Small [25] proposed a new method to construct IV that allows direct effects of the treatment (i.e., partial mediation [3]) but assumes that ADE of the treatment is the same for different segments of the population.
2.2 Meta-Analysis
Our method only uses summary statistics of many past experiments. Analyzing summarized results from many experiments is termed as meta-analysis and is common in analytical practice [5, 27]. In the literature, meta-analysis is used for mitigating the problem of external validity in a single experiment and learning knowledge that was hard to recover when analyzing data in isolation, such as heterogeneous treatment effects (see, e.g., Browne and Jones [4], Higgins and Thompson [10]). Besides, a significant advantage of meta-analysis is easy to scale, because it only takes summarized results from many different experiments.
Peysakhovich and Eckles [19] took one step toward the direction of performing mediation analysis using data from many experiments. They used treatment assignments as IVs to identify causal mediation, which is similar to Sobel [26], but lacks the justification why more than one experiment is needed and failed to address limitations of IV that we discussed above. Our framework shows that having access to many experiments enables identifying causal mediation without SI and overcoming the limitation of IV, both of which are hard to achieve with only one experiment.
References
- [1] Joshua Angrist, Guido Imbens, and Donald Rubin. 1996. Identification of Causal Effects Using Instrumental Variables. J. Amer. Statist. Assoc. 91, 434 (6 1996), 444.
- [2] Joshua Angrist and Alan Krueger. 2001. Instrumental Variables and the Search for Identification: From Supply and Demand to Natural Experiments. Journal of Economic Perspectives 15, 4 (11 2001), 69–85.
- [3] Reuben Baron and David Kenny. 1986. The moderator-mediator variable distinction in social psychological research: Conceptual, strategic, and statistical considerations. Journal of Personality and Social Psychology 51, 6 (1986), 1173– 1182.
- [4] Will Browne and Mike Jones. 2017. What works in e-commerce - a meta-analysis of 6700 online experiments. Qubit Digital Ltd (2017), 1–21.
- [5] Harris Cooper, Larry Hedges, and Jeffrey Valentine. 2009. The handbook of research synthesis and meta-analysis. Russell Sage Foundation.
- [6] Alexandre Gilotte, ClÃľment Calauzènes, Thomas Nedelec, Alexandre Abraham, and Simon Dollé. 2018. Offline A/B Testing for Recommender Systems. In: Proceedings of the Eleventh ACM International Conference on Web Search and Data Causal Meta-Mediation Analysis KDD ’20, August 23–27, 2020, Virtual Event, CA, USA Figure 5: Estimated Mediator Dose Response Function Mining (WSDM âĂŹ18). Association for Computing Machinery, New York, NY, USA, 198–206.
- [7] Donald Green, Shang Ha, and John Bullock. 2010. Enough already about "Black Box" experiments: Studying mediation is more difficult than most scholars suppose. Annals of the American Academy of Political and Social Science 628, 1 (2010), 200–208.
- [8] William Greene. 2011. Econometric analysis (7 ed.). Pearson Education Inc. 1232 pages.
- [9] James Heckman and Rodrigo Pinto. 2015. Econometric Mediation Analyses: Identifying the Sources of Treatment Effects from Experimentally Estimated Production Technologies with Unmeasured and Mismeasured Inputs. Econometric Reviews 34 (2015), 6–31.
- [10] Julian Higgins and Simon Thompson. 2002. Quantifying heterogeneity in a meta-analysis. Statistics in Medicine 21, 11 (6 2002), 1539–1558.
- [11] Kosuke Imai, Luke Keele, Dustin Tingley, and Teppei Yamamoto. 2014. Comment on Pearl: Practical implications of theoretical results for causal mediation analysis. Psychological Methods 19, 4 (2014), 482–487.
- [12] Kosuke Imai, Luke Keele, and Teppei Yamamoto. 2010. Identification, Inference and Sensitivity Analysis for Causal Mediation Effects. Statist. Sci. (2010).
- [13] Guido Imbens. 2000. The Role of the Propensity Score in Estimating Dose- Response Functions. Biometrika 87, 3 (2000), 706–710.
- [14] Guido Imbens and Keisuke Hirano. 2004. The Propensity Score with Continuous Treatments. (2004).
- [15] David MacKinnon, Amanda Fairchild, and Matthew Fritz. 2006. Mediation Analysis. Annual Review of Psychology 58, 1 (12 2006), 593–614.
- [16] Judea Pearl. 2001. Direct and indirect effects. In: Proceedings of the seventeenth conference on uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc., 411–420.
- [17] Judea Pearl. 2014. Interpretation and identification of causal mediation. Psychological Methods 19, 4 (2014), 459–481.
- [18] Judea Pearl. 2014. Reply to Commentary by Imai, Keele, Tingley, and Yamamoto Concerning Causal Mediation Analysis. Psychological Methods 19, 4 (2014), 488– 492.
- [19] Alexander Peysakhovich and Dean Eckles. 2018. Learning causal effects from many randomized experiments using regularized instrumental variables. In The Web Conference 2018 (WWW 2018). ACM, New York, NY.
- [20] James Robins. 2003. Semantics of causal DAG models and the identification of direct and indirect effects. Highly Structured Stochastic Systems (1 2003), 70–82.
- [21] James Robins and Sander Greenland. 1992. Identifiability and exchangeability for direct and indirect effects. Epidemiology 3, 2 (1992), 143–155.
- [22] James Robins and Thomas Richardson. 2010. Alternative graphical causal models and the identification of direct effects. Causality and psychopathology: finding the determinants of disorders and their cures (2010).
- [23] Donald Rubin. 2003. Basic concepts of statistical inference for causal effects in experiments and observational studies. (2003).
- [24] Derek Rucker, Kristopher Preacher, Zakary Tormala, and Richard Petty. 2011. Mediation Analysis in Social Psychology: Current Practices and New Recommendations. Social and Personality Psychology Compass 5, 6 (2011), 359–371.
- [25] Dylan Small. 2012. Mediation analysis without sequential ignorability: Using baseline covariates interacted with random assignment as instrumental variables. Journal of Statistical Research 46, 2 (2012), 91–103.
- [26] Michael Sobel. 2008. Identification of Causal Parameters in Randomized Studies With Mediating Variables. Journal of Educational and Behavioral Statistics 33, 2 (2008), 230–251.
- [27] Tom Stanley and Hristos Doucouliagos. 2012. Meta-regression analysis in economics and business. Routledge.
- [28] Jeffrey Wooldridge. 2010. Econometric analysis of cross section and panel data. MIT Press, Cambridge, MA. 1096 pages.
- [29] Xuan Yin and Liangjie Hong. 2019. The Identification and Estimation of Direct and Indirect Effects in A/B Tests Through Causal Mediation Analysis. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD ’19). ACM, New York, NY, USA, 2989–2999.
;
Author | volume | Date Value | title | type | journal | titleUrl | doi | note | year | |
---|---|---|---|---|---|---|---|---|---|---|
2020 CausalMetaMediationAnalysisInfe | Liangjie Hong Zenan Wang Xuan Yin Tianbo Li | Causal Meta-Mediation Analysis: Inferring Dose-Response Function From Summary Statistics of Many Randomized Experiments | 2020 |