《基于深度学习多实验叠加效果因果推断.pdf》由会员分享,可在线阅读,更多相关《基于深度学习多实验叠加效果因果推断.pdf(50页珍藏版)》请在三个皮匠报告上搜索。
1、DataFunDataFun SummitSummit#20232023基于深度学习多实验因果推断基于深度学习多实验因果推断张任宇张任宇快手经济学家快手经济学家香港香港中文大学副教授中文大学副教授OctOct 21,21,20232023张任宇 Philip Zhang三位一体数据科学家/运营管理学青椒:学者+老师+互联网搬砖工快手经济学家(2018-)香港中文大学商学院(tenured)副教授(2021-);研究数据科学在互联网平台策略评估与优化的应用;面向本科、硕士、PhD和EMBA讲授数据科学在商业的应用。2上海纽约大学助理教授(2016-2022)圣路易斯华盛顿大学运营管理学博士(20
2、11-2016)北京大学数学本科(2007-2011)我是谁我是谁?DeepDeep LearningLearning MeetsMeets DoubleDouble MachineMachine Learning:Learning:CausalCausal InferenceInference forfor Large-ScaleLarge-Scale CombinatorialCombinatorial ExperimentsExperimentsRenyu(Philip)ZhangKuaishou Economist Team and CUHK Business School(Based
3、 on the join work with Zikun Ye,Zhiqi Zhang,Dennis Zhang,Heng Zhang)OutlineOutline ofof thethe TalkTalkIntroduction:Problem,Solution and ContributionsTheory:Debiased Deep Learning and AsymptoticsEmpirics:Validations with Field Experiment Data4Number of Experiments per week5Exp 1:Get Rewards StickerL
4、aunch Two Experiments to increase watching timeExp 2:Send Gift StickerMultipleMultiple A/BA/B TestsTests6MultipleMultiple A/BA/B TestsTestsExp 1Exp 2Exp 1+Exp 2Control7 Major Question:How to estimate and infer the combined treatment effect ofmultiple A/B tests?MultipleMultiple A/BA/B TestsTests8Solu
5、tionSolution 1:1:LinearLinear AdditionAdditionEffect of(Exp 1+Exp 2)=Effect of Exp 1+Effect of Exp 2 Limitations:Non-linearity of treatment effects:o Marginal Decreasing:negative interactiono Marginal Increasing:positive interactionControlExp 1Exp 2No ButtonReward StickerGift Sticker9SolutionSolutio
6、n 2:2:FullFull FactorialFactorial DesignDesignRun All treatment combinationsLimitations:Exponentially Large treatment combinations:o binary-level tests generate 2 combinationso Impossible to assign even 1 user to each combination if 30ControlExp 1Exp 2Exp 1+Exp 2No StickerReward StickerGift StickerB
7、oth Stickers10SolutionSolution 3:3:Machine/DeepMachine/Deep LearningLearningDirectly predict the individual outcome under each treatment combination using end-to-end machine/deep learninga.k.a.uplift modeling/meta learner Limitations:Not a valid inference methodHard to obtain managerial insights11Re
8、searchResearch QuestionsQuestionsWithout observing outcomes of all treatment combinations:Average Treatment EffectsHow to estimate and infer the average treatment effect of any treatment combination?Best-Treatment IdentificationHow to identify the optimal treatment combination with the highest effec
9、t?With minimum interference of the current experimentation platforms12Summary:Summary:OurOur SolutionSolution andand ContributionsContributionsMethodologyDeep Learning captures experimental Interactions and HeterogeneityDouble Machine Learning Chernozhukov et al.(2018)ensures valid inference Empiric
10、al Evaluationon data from large-scale field experimentsContribution 1:A framework to solve the important problem with theoretical guarantee(valid inference and confidence interval)Contribution 2:A structured deep neural network to capture interactions and heterogeneity Contribution 3:First to valida
11、te double machine learning using large-scale experimental data13RelatedRelated LiteratureLiteratureDouble/de-biased machine learning(DML)Correct the bias of ML estimators through Neyman orthogonal score functions+cross-fitting Newey(1994),Chernozhukov et al.(2018),Farrell et al.(2020,2021),Ellickson
12、 et al.(2022),Gordon et al.(2022).Valid estimation and inference with multiple experimentsAzevedo et al.(2020),Dasgupta et al.(2015),Athey et al.(2021),Pashley and Bind(2019),etc.Experiments on online platformsEvaluating and optimizing the strategies of a large-scale online platform.Ye et al.(2022),
13、Zeng et al.(2022),Zhang et al.(2020),Cui et al.(2019,2020),Feldman et al.(2021),Schwartz et al.(2017),etc.14OutlineOutline ofof thethe TalkTalkIntroduction:Problem,Solution and ContributionsTheory:Debiased Deep Learning and AsymptoticsEmpirics:Validations with Field Experiment Data15DeepDeep Learnin
14、gLearning Framework:Framework:SetupSetupo The platform runs A/B tests,each with a binary treatment.We use =(1,2,)0,1 to denote a treatment combination.o e.g.,=4,=(0,0,1,0)represents that the user is in the control condition of experiments 1,2,and 4,and in the treatment condition of experiment 3.o Ou
15、tcome and covariates ,observations(,)=1 o Assumption:the observed data(,)=1 is generated by a semiparametric form:=,=(),)()summarizes users covariates(high-dimensional).(),)is the link function to link()and together for predicting outcomes.Key to capture the interaction between experiments.()is unkn
16、own and learned by DL models.16WhyWhy DoDo WeWe NeedNeed L Linkink F Functionunction?Complexity of o Higher accuracy of ATE(with infinite data)o Higher identification requiremento Higher computation costModel Selection=0()+1()1+2()2+()(+1 dimensional()+1)=0()+1()1+()+()+all higher interaction terms(
17、2-dimensional()2)17ChoiceChoice ofof LinkLink FunctionFunctionWe adopt the following Generalized Sigmoid Link Function:(),)=+1()1+exp(0()+1()1+()which captures both marginal increasing and decreasing outcomes,and any possible ranges of potential outcomes.where()=(0(),1(),+1()+218ChoiceChoice ofof Li
18、nkLink FunctionFunctionWe adopt the following Generalized Sigmoid Link Function:(),)=+1()1+exp(0()+1()1+()which captures both marginal increasing and decreasing outcomes,and any possible ranges of potential outcomes.19ChoiceChoice ofof LinkLink FunctionFunctionGeneralized Sigmoid Link Function:(),)=
19、3()1+exp(0()+1()1+2()2)marginal decreasing with two experiments=(1,2)+0()+1()1+2()2=(0,0)=(1,0)=(1,1)20How Restrictive is This G function?o This link function is person-specific(depends on).Any ATE can be approximated by the generalized sigmoid link function with an arbitrarily small error.o This li
20、nk function does not assume linear additivity of treatment effects.o Fundamentally,this is a statistical assumption that we need to“test”empirically.o We also tested other link functions.Generalized Sigmoid Link Function:(),)=+1()1+exp(0()+1()1+()21GoalsGoals andand Two-stageTwo-stage MethodMethodGi
21、ven the data(,)=1 generated by =,=(),)and only partially observed out of 2 possible combinations,We are interested in estimating and inferring:Average Treatment Effect(ATE)for any 0,1:()=(),)(),),where=(0,0,0)Best-Treatment Identification=argmax0,1()First stage(Training):get estimator()from observed
22、 data and deep learning.Second stage(Estimation and Inference):construct asymptotically normal ATE estimator using firststage result.22FirstFirst Stage:Stage:DeepDeep LearningLearningSolving the empirical loss minimization problem can get empirical estimator of():()=argminDNN1=1(),)2Hidden Layers()(
23、),).Structured DNNNote:link function is given23FirstFirst Stage:Stage:IdentificationIdentification&ConvergenceConvergence Generalized Logistic Link Function:(),)=+1()1+exp(0()+1()1+()*Non-restrictive overlapping condition:It requires to observe(+)out of treatment combinationsPropositionProposition (
24、Informal)(Informal)Under mild regularity and network size assumptions on DNN and the non-restrictive overlapping condition of observed treatments*,o()can be nonparametrically identified.o()converges to()sufficiently fast for inference in the 2nd stage(with subsequent debias),i.e.,()()=(1/4).24Overla
25、ppingOverlapping ConditionCondition (m=4)(m=4)=0=1Experiments withSingle TreatmentOverlapping condition1432(),)=+1()1+exp(0()+1()1+()2m+SecondSecond Stage:Stage:Nave ApproachNave ApproachAfter getting estimator()of(),one can immediately constructPlug-in Estimator for ATE at any 0,1:()=1=1 (),(),ATE:
26、()=(),)(),),where=(0,0,0)Best-Treatment Identification=argmax0,1()26WeWe NNeedeed toto DebiasDebias ()A critical issue with the Plug-in estimator:biased estimator due to the large bias from()Why()has the large bias?o Regularization in DNNo Optimization erroro High dimensional 27WeWe NNeedeed toto De
27、biasDebias ()A critical issue with the Plug-in estimator:biased estimator due to the large bias from()Solution:Double/debias machine learning which uses orthogonality and cross-fitting28Intuition behind Neyman Orthogonality()():(1/2)small bias()()():(1/4)large bias()plug-in estimator is unbiasedNeym
28、an Orthogonality:as long as we have()in green region,we may find an unbiased orthogonal estimator for ATE.()plug-in estimator is biased()orthogonal estimator is unbiasedLarge biasSmall bias29Under mild regularity assumptions,for any 0,1,(,;)is a Neyman Orthogonal Score,where =(,)is collected data,()
29、=2(,(),)=,();,)=(),)(),)(,;)=();,)();,)()1,(),SecondSecond Stage:Stage:NeymanNeyman OrthogonalityOrthogonalityWe need to find a score function that satisfy the following conditionso Zero-order moment condition:()=(,;).o()is ATE,is the score function,is the data,and is the nuisance parameter.o First-
30、order condition(Neyman Orthogonality):(,;)=0.o Neyman orthogonality implies that the estimator is robust w.r.t.first-order errors in.Remark:The score function is derived based on the pathwise derivative approach in semi-parametric statistics(Newey 1994,Chernozhukov et al.2018,Farrell et al.2020).Ort
31、hogonal Score FunctionPlug-in TermDe-bias Term TheoremTheorem30Cross-FittingCross-Fitting andand AsymptoticAsymptotic NormalityNormalityDataset=(,)TrainInferRepeat for each fold =1,2,3(3 folds total)Debiased Deep Learning(DeDL)EstimatorTrainInfer31ATEATE Estimator:Estimator:()=1=1(),()=1 ,();Var.Var
32、.Estimator:Estimator:()=1=1(),()=1 ,();()2SecondSecond Stage:Stage:Debiased Deep Learning(DeDL)EstimatorEstimatorDeDL-ATE Estimator()for any 0,1 including unobservable treatments(1 )-Confidence Interval:()12(),()+12()Under nonrestrictive regularity assumptions and cross fitting,and ()()2=(1/4),for a
33、ll folds =1,()1/2()()(0,1)TheoremTheoremATEATE Estimator:Estimator:()=1=1(),()=1 ,();Var.Var.Estimator:Estimator:()=1=1(),()=1 ,();()232OutlineOutline ofof thethe TalkTalkIntroduction:Problem,Solution and ContributionsTheory:Debiased Deep Learning and AsymptoticsEmpirics:Validations with Field Exper
34、iment Data33Two types of functional form assumptions in the DeDL estimator:o Semi-parametric assumption of DGP with link function.o Smoothness condition on()for neural network convergence 14.We need empirical validation to show these assumptions are indeed“mild”.Empirical Validation of DML-type Empi
35、rical Validation of DML-type EstimatorsEstimators34We consider =concurrent A/B tests on the recommendation system upgrades of 3 pages respectively FieldField ExperimentsExperiments SettingSettingLive PageDiscover PageFor You PageExperiment Time Jan 10,to Feb 1,2021Sample Size(with stratified re-samp
36、ling)2,066,606:video watching time of users:demographic features and past behavior ,:treatment level35PartiallyPartially ObservedObserved SettingSetting inin PracticePracticeTreatment CombinationObservable or NotRelative Treatment Effect(0,0,0)Observable0.000%*(0,0,1)Observable1.091%*(0,1,0)Observab
37、le-0.267%*(1,0,0)Observable0.758%*(1,1,1)Observable2.121%*(1,1,0)Unobservable?(1,0,1)Unobservable?(0,1,1)Unobservable?p0.05;p0.01;p0.001;p0.0001Hold-out conditions23=8 treatment combinations ,:whether treated in Discover Page,Live Page,For You Page36Ground-TruthGround-Truth ATEsATEsTreatment Combina
38、tionObservable or NotRelative Treatment Effect(0,0,0)Observable0.000%*(0,0,1)Observable1.091%*(0,1,0)Observable-0.267%*(1,0,0)Observable0.758%*(1,1,1)Observable2.121%*(1,1,0)Observable0.689%*(1,0,1)Observable2.299%*(0,1,1)Observable1.387%*p0.05;p0.01;p0.001;p0.000123=8 treatment combinations ,:wheth
39、er treated in Discover Page,Live Page,For You PageWe have also tested out other empirical settings with different observed combinations.37BenchmarksBenchmarkso Linear Addition(LA):Assume that ATEs of different individual treatments are linearly and independently additive.o Industry practiceo Linear
40、Regression(LR):Regress on(,)and predict the outcomes of unobservable treatment combinations by linear extrapolation =+(possibly interaction term).o Still a linear approach,but better leverages the user features.o Pure Deep Learning(PDL):Regress on(,)using deep neural network(DNN).o Structured Deep L
41、earning(SDL):Apply the same DNN as DeDL without debias,i.e.,plug-in estimator at the second stageo Debiased Deep Learning(DeDL):SDL+Neyman Orthogonalityo Comparing DeDL with SDL highlights the value of bias correction via orthogonality.Proposed38Structured DNNStructured DNNFeature Input Layer87 node
42、sHidden LayersHidden Layers20 nodes per layer20 nodes per layer0()()Treatment Input Layer3 nodes1()12()23()31()1+2()2+3()3=0()+1()1+2()2+3()3=41+exp()39MainMain ResultResult I IDeDL Outperforms All Benchmarks.DeDL Outperforms All Benchmarks.The first empirical validation of DML-The first empirical v
43、alidation of DML-type estimator withtype estimator with large-scalelarge-scale experimentsexperiments in practice.in practice.4030.06%4.86%6.86%14.71%1.75%0%5%10%15%20%25%30%35%Mean Abs.Percentage Error(MAPE)LALRPDLSDLDeDLDeDL Outperforms All BenchmarksDeDL Outperforms All BenchmarksNote:1.y-axis is
44、 the average ATE errors(MAPE)over unobserved treatment combinations 2.The result is robust to MAE,MSE,significant level and all treatment combinations as wellLinear Addition(LA),Linear Regression(LR),Pure Deep Learning(PDL),Structured Deep Learning(SDL),Debiased Deep Learning(DeDL)4.031.851.842.271.
45、340.001.002.003.004.005.00MAELALRPDLSDLDeDL41MainMain ResultResult II II(a)(a)Parametric link function reduces Parametric link function reduces prediction accuracy.prediction accuracy.(b)(b)De-Bias De-Bias viavia NeymanNeyman orthogonalityorthogonality increasesincreases ATEATE estimationestimation
46、accuracy.accuracy.The benefit of(b)outweights the cost of The benefit of(b)outweights the cost of(a).(a).42BenefitBenefit ofof NeymanNeyman OrthogonalityOrthogonality 30.06%4.86%6.86%14.71%1.75%0%5%10%15%20%25%30%35%MAPELALRPDLSDLDeDL(b)De-Bias via Neyman orthogonality increases ATE estimation accur
47、acy(a)Parametric link function reduces prediction accuracyLinear Addition(LA),Linear Regression(LR),Pure Deep Learning(PDL),Structured Deep Learning(SDL),Debiased Deep Learning(DeDL)The benefit of(b)outweights the cost of(a).43MainMain ResultResult IIIIIIDeDL DeDL*worksworks*if if the first-stage DN
48、N the first-stage DNN converges.converges.44MAPE Comparison as DNN ConvergesMAPE Comparison as DNN ConvergesDNN validation loss at first stageValidation loss()()Estimation MAPE at second stage45MAPE Comparison as DNN ConvergesMAPE Comparison as DNN ConvergesDNN validation loss at first stageValidati
49、on loss()()Estimation MAPE at second stageMAPE of DeDLMAPE of SDLo Both DeDL and SDL estimators yield smaller MAPE as DNN training loss is smaller.o But SDL estimator has bad performance even with well trained DNN.o Implication:“Double/debiased”DML is not for free.One should train ML well at the fir
50、st stage.46RobustnessRobustness Check:Check:GoodGood NewsNewsLarge number of experiments o The advantage of DeDL over LA,LR,and SDL is even greater when the number of A/B tests is larger.Significant DNN errors ()()o If link function(.,.)is correctly specified,DeDL still performs best even when addit
51、ional large biases are introduced in().47RobustnessRobustness Check:Check:BadBad NewsNewsMisspecified link function(),)o If the link function(.,.)is seriously misspecified,DeDL exacerbates the bad performance of SDLo Solutions:1.Try-and-error on,via training loss;2.generalize the framework to fully
52、nonparametric setting(Chernozhukov et al.2022,Colangelo and Lee 2020)48Take-awaysTake-awayso DeDL framework provides an estimator with provable theoretical guarantees and good empirical performance for analyzing multiple experiments.o DML-type estimator works very well in practice with real experime
53、nt data(in contrast to Gordon et al.,2022).o The framework is currently utilized by the platform.o We are trying to open source the framework and you can find preliminary code here:ohttps:/ You!You!EmailEmailphilipzhangcuhk.edu.hkWebsiteWebsiterphilipzhang.github.io/rphilipzhang/WeChatrphilip_zhangDownload Our PaperDownload Our Paper