《数据融合的效率得与失.pdf》由会员分享,可在线阅读,更多相关《数据融合的效率得与失.pdf(57页珍藏版)》请在三个皮匠报告上搜索。
1、.Multiply robust estimation of causal effects usinglinked dataShanshan Luo1,Yechi Zhang2,and Wei Li21School of Mathematics and StatisticsBeijing Technology and Business University2Center for Applied Statistics and School of StatisticsRenmin University of ChinaShanshan Luo(BTBU)Robust estimation usin
2、g linked data2023.10.211/57.Table of Contents1Introduction2Set up3Estimation4Design Issue5Numerical studies6ConclusionsShanshan Luo(BTBU)Robust estimation using linked data2023.10.212/57.Table of Contents1Introduction2Set up3Estimation4Design Issue5Numerical studies6ConclusionsShanshan Luo(BTBU)Robu
3、st estimation using linked data2023.10.213/57.Unmeasured Confounding and Data Linkage IUnmeasured confounding remains a persistent challenge withinobservational studies,leading to biased estimations of causalparameters.In the current era of big data,the increasing availability of diversedata sources
4、 offers potential remedies.Among these,leveraging data linkage emerges as a promisingapproach to mitigate the impact of unmeasured confounding in aprimary study of interest.Shanshan Luo(BTBU)Robust estimation using linked data2023.10.214/57.Unmeasured Confounding and Data Linkage IIFor instance,in h
5、ealthcare research,the linkage of a claims databasefrom a health plan with an electronic health record database from adelivery system can yield richer patient data.The resulting linked cohort,comprising patients present in both datasources,presents an opportunity to enhance estimation byincorporatin
6、g pivotal confounding factors.Shanshan Luo(BTBU)Robust estimation using linked data2023.10.215/57.ChallengesHowever,the data linkage approach may introduce selection bias,which arises from the fact that studies conducted within linkeddatabases are often restricted to a subset of the primary studypop
7、ulation.If this subset fails to adequately represent the entire population,constraining analysis solely to the linked data might invalidatefindings and limit result generalizability(Sun et al.,2022).Shanshan Luo(BTBU)Robust estimation using linked data2023.10.216/57.Previous studies IThe problem of
8、integrating inferences from primary andsupplementary datasets to control for unmeasured confounding haspreviously received substantial attention.1Data fusion,especially when more detailed confounding informationbecomes available from an alternative study(Strmer et al.,2005;McCandless et al.,2012;Lin
9、 and Chen,2014).2Two-phase sampling designs(Chatterjee et al.,2003;Wang et al.,2009).3Missing data,especially in the context of missing covariate(Williamsonet al.,2012;Evans et al.,2020).Shanshan Luo(BTBU)Robust estimation using linked data2023.10.217/57.Previous studies IIYang and Ding(2020)introdu
10、ced a general framework for estimationof causal effects combining the primary dataset with unmeasuredconfounders and a smaller validation dataset that contributesadditional confounding information.For scenarios involving heterogeneity between the two data sources,Sun et al.(2022)proposed an estimato
11、r employing two weightings toestimate the causal effect.However,the estimators proposed by Yang and Ding(2020)and Sunet al.(2022)may not be the most efficient,particularly in situationswith heterogeneity between the two datasets.Shanshan Luo(BTBU)Robust estimation using linked data2023.10.218/57.Our
12、 contributions I1This paper focuses on the semiparametric efficient estimation of theaverage treatment effect(ATE)using a linked database that mayexhibit heterogeneity from the primary population.2We establish three nonparametric identification formulas andsubsequently develop three corresponding se
13、miparametric estimators.3We derive the efficient influence function(EIF)for ATE within anonparametric model and introduce a semiparametric estimatordemonstrating the triply robust property(Bickel et al.,1993).Shanshan Luo(BTBU)Robust estimation using linked data2023.10.219/57.Our contributions II4Mo
14、reover,we explore the application of the proposed estimator toidentify optimal sampling designs by minimizing the asymptoticvariance under specific cost constraints.5We also discuss the connection between the proposed method andexisting approaches for handling missing data,wherein individualsoutside
15、 the linked cohort can be viewed as those with missing valuesfor certain confounding variables.Shanshan Luo(BTBU)Robust estimation using linked data2023.10.2110/57.Table of Contents1Introduction2Set up3Estimation4Design Issue5Numerical studies6ConclusionsShanshan Luo(BTBU)Robust estimation using lin
16、ked data2023.10.2111/57.Notation ISuppose we are interested in estimating the ATE of a binarytreatment Z on an outcome Y within a primary study of interest.Let Yzdenote the potential outcome that would be observed if thetreatment Z were z for z=0,1.Under the causal consistency assumption,the observe
17、d outcome canbe expressed as Y=ZY1+(1 Z)Y0.The ATE is denoted as =E(Y1 Y0).Shanshan Luo(BTBU)Robust estimation using linked data2023.10.2112/57.Notation IIThe primary dataset collects treatment variable Z and outcomevariable Y,along with a set of observed covariates denoted as X.Simultaneously,an av
18、ailable supplementary dataset providesadditional covariates represented by V,although it lacks the othervariables present in the primary dataset.Ri=1 if the unit i appears in both datasets and Ri=0 if the unit i isexclusive to the primary dataset.Shanshan Luo(BTBU)Robust estimation using linked data
19、2023.10.2113/57.Assumptions IAssumption 1:Independence of R(X MAR)(i)R (Y,Z,V)|X.(ii)0 pr(R=1|X)=(X)1 for all X.Assumption 1 means that the probability of being in the linked cohortdepends solely on fully observed covariates(Sun et al.,2022).A weaker version of Assumption 1,known as the MAR assumpti
20、on,isdenoted as R V|(X,Y,Z)(Robins et al.,1994).While the MAR assumption is inherently less stringent than ourAssumption 1,we can verify the latter by examining conditionalindependence between R and(Z,Y)given X within scenarios thatcomply with MAR conditions.Shanshan Luo(BTBU)Robust estimation using
21、 linked data2023.10.2114/57.Assumptions IIAssumption 2:Ignorability(Ignorability of Z)(i)Z (Y0,Y1)|(X,V);(ii)0 pr(Z=1|X,V)=(X,V)0 and 2C1 0 and 2C1 1C2asyvar(b tr)1(c)1 0图:The optimal second-phase sample ratio under different scenarios.Shanshan Luo(BTBU)Robust estimation using linked data2023.10.213
22、4/57.Connection with Missing Data IIn this section,we discuss connections between our framework andexisting methods in missing data analysis,focusing on the context ofcovariates missing at random.The MAR assumption in our setting is stated as MAR:Assumption 4,MARR V|(Z,X,Y).Currently,there is no exi
23、sting proposal in the literature that achievesmultiply robust estimation within the context of covariates missing atrandom.Shanshan Luo(BTBU)Robust estimation using linked data2023.10.2135/57.Connection with Missing Data IIWilliamson et al.(2012)introduced a unified approach for estimatingcausal eff
24、ects using a multiply robust methodology under MAR.However,as elucidated in Evans et al.(2020),achieving multiplerobustness entails placing parametric restrictions on specificcomponents of the observed data likelihood while leaving othersunrestricted.Shanshan Luo(BTBU)Robust estimation using linked
25、data2023.10.2136/57.Connection with Missing Data IIIAs discussed below X-MAR,while MAR is less restrictive,itsvalidity enables us to employ observed data to test X-MAR.In this paper,we have introduced a genuinely multiply robustestimator for ATE under X-MAR.The nuisance models employed in this paper
26、 are variationalindependent and commonly used in causal inference for robustestimation.Shanshan Luo(BTBU)Robust estimation using linked data2023.10.2137/57.Table of Contents1Introduction2Set up3Estimation4Design Issue5Numerical studies6ConclusionsShanshan Luo(BTBU)Robust estimation using linked data
27、2023.10.2138/57.Simulation IWe consider the following data-generating mechanism:Fully observed covariate:X N(0,1).Selection mechanism:pr(R=1|X)=expit(0.75+0.5X),whereexpit(u)=exp(u)/1+exp(u).Partially observed covariate:V|X N(0.5+0.5X,1).Treatment assignment:pr(Z=1|X,V)=expit(0.5+0.5X+0.6V).Observed
28、 outcome:Y|Z,X,V N(0.5+0.5X+0.5V+2Z+2ZX+ZV,1).Shanshan Luo(BTBU)Robust estimation using linked data2023.10.2139/57.Simulation IITo evaluate the performance of these estimators in scenarios withmodel misspecification,we adopt an approach similar to that of Kangand Schafer(2007).We introduce transform
29、ed variables,denoted as X=|X|1/2andV=|V|1/2,and examine how the estimators perform when using Xand Vin place of X and V within the working models.We present the results from the following five scenarios:All the models are correct.Only models(X;)and(X,V;)in M1are correct.Only models(X;)and(Z,X,V;)in
30、M2are correct.Only models(Z,X,V;)and f(V|X;)in M3are correct.None of the models are correct.Shanshan Luo(BTBU)Robust estimation using linked data2023.10.2140/57.Simulation IIIModelb 1b 2b 3b trM1M2M3Sample SizeSample SizeSample SizeSample Size00000000Bias-
31、0404343163165164SD467544619861986CP9595969795959795949796949595960000009393940009795950069695000000000000Shansha
32、n Luo(BTBU)Robust estimation using linked data2023.10.2141/57.Simulation Results ITable 1 summarizes the performance of the four estimators b 1,b 2,b 3,and b trunder the above five scenarios.The simulation results are based on 200 Monte Carlo runs.For clarity,the bias,empirical standard deviation,an
33、d the 95%coverage probability are all multiplied by 102.Across all sample sizes,the four estimators perform well in scenario(i).As expected,the triply robust estimator b trexhibits small bias in thefirst four scenarios(i)-(iv).In contrast,the other three estimators b 1,b 2,b 3show substantial biaswh
34、en their respective models are misspecified.Shanshan Luo(BTBU)Robust estimation using linked data2023.10.2142/57.Simulation Results IIThe 95%coverage probabilities of the three semiparametricestimators b 1,b 2,and b 3align with the nominal level solely whentheir underlying models are accurately spec
35、ified.The 95%coverage probabilities of b trclosely approximate the nominallevel in all four scenarios.These results confirm our previous theoretical findings anddemonstrate the advantages of the proposed triply robust estimator.We provide additional simulation studies for a binary outcome in thesupp
36、lementary material,which corroborate the simulation resultspresented here.Shanshan Luo(BTBU)Robust estimation using linked data2023.10.2143/57.Empirical Example IIn this section,we investigate the causal effect of physical activity onhealthcare expenditures among adults in the United States.We analy
37、ze two datasets from the 2018 Medical Expenditure PanelSurvey(MEPS).Shanshan Luo(BTBU)Robust estimation using linked data2023.10.2144/57.Empirical Example IIThe primary dataset originates from the 2018 full-year consolidatedfile of MEPS,mainly focusing on demographics,health status,andhealth insuran
38、ce.The primary dataset consists of the followingvariables:The treatment variable Z indicates regular participation in moderate tovigorous physical activity(at least 5 times a week for half an hour).The outcome variable Y represents healthcare expenses.The baseline covariates X include factors like a
39、ge,sex,race,maritalstatus,BMI,education,family poverty level,smoking status,cancerdiagnosis,and health insurance coverage.Shanshan Luo(BTBU)Robust estimation using linked data2023.10.2145/57.Empirical Example IIIAdditionally,we utilize an auxiliary data also obtained from MEPS,specifically drawn fro
40、m the 2018 job file.This auxiliary dataset isdesigned to collect supplementary job-related information.This linked dataset incorporates additional significant confoundingvariables,including working hours and income,denoted by V.The primary dataset consists of 18,774 subjects,with 9,174 subjectshavin
41、g Z=1 and 9,600 subjects having Z=0.The auxiliary dataset consists of 6,118 subjects,and the resultinglinked dataset includes 5,226 subjects.Shanshan Luo(BTBU)Robust estimation using linked data2023.10.2146/57.Empirical Example IVFurthermore,under the MAR assumption,we conducted an empiricaltest to
42、assess X-MAR.This test involved using logistic regression toexamine the selection mechanism based on the fully observedvariables(Z,X,Y).The results suggest that there is no significant evidence of violatingX-MAR when the MAR assumption holds.Shanshan Luo(BTBU)Robust estimation using linked data2023.
43、10.2147/57.Results IWe employ a linear model for the outcome regression model,a normaldistribution for the imputation model,and logistic models for theselection mechanism and propensity score.The results from the four estimators are statistically significant anddemonstrate a consistent pattern,sugge
44、sting that regular physicalactivity may lead to reduced healthcare costs(Kokkinos et al.,2011;Booth et al.,2012).Shanshan Luo(BTBU)Robust estimation using linked data2023.10.2148/57.Results II表:Estimates of the causal effect of physical activities on total healthcareexpenditures(scaled by 1000 US do
45、llars).EstimatorsPoint estimateStandard error95%Confidence intervalb 1-1.000.51(-1.99,-0.01)b 2-1.060.48(-2.00,-0.11)b 3-1.200.55(-2.28,-0.12)b tr-1.350.56(-2.45,-0.24)Shanshan Luo(BTBU)Robust estimation using linked data2023.10.2149/57.Table of Contents1Introduction2Set up3Estimation4Design Issue5N
46、umerical studies6ConclusionsShanshan Luo(BTBU)Robust estimation using linked data2023.10.2150/57.Conclusions IIt is crucial to consider potential selection bias when estimating causaleffects from linked data.Neglecting this bias could lead to biased estimates and incorrectconclusions.Our study has i
47、ntroduced three nonparametric strategies foridentifying ATE.We have presented a triply robust and locally efficient estimator thatis easy to implement.Investigated the asymptotic properties of our proposed estimator,showcasing its consistency and asymptotic normality.Shanshan Luo(BTBU)Robust estimat
48、ion using linked data2023.10.2151/57.Conclusions IIExplored the use of various flexible machine learning techniques fornuisance model estimation.Discussed potential extensions of the proposed method,includingoptimal sampling designs and the estimation of nonlinear causalestimands.Several possible ex
49、tensions for future work:Develop theories and methodologies for scenarios with multiple datasources.Explore optimal sample size and sample rate allocation under MAR.Shanshan Luo(BTBU)Robust estimation using linked data2023.10.2152/57.谢谢!Shanshan Luo(BTBU)Robust estimation using linked data2023.10.21
50、53/57.在耿直教授的带领下,北京工商大学因果推断团队于 2022 年组建成立,主要从事因果推断的基础理论、方法和相关应用领域的研究工作。团队在因果效应评价、因果关系发现、因果归因、因果推荐系统、基于因果的公平性评价,以及生物医学、食品安全和互联网 IT 等因果推断应用研究方面取得了一系列成果。团队自成立以来,研究成果发表在统计学、机器学习及人工智能领域的国际顶级期刊 Biometrika、J.Machine Learning Research、Biometrics、Statistica Sinica、Statistics inMedicine、Artificial Intelligence
51、、TNNLS 等,和国际顶级学术会议 ICML、NuerIPS、ICLR、AAAI、KDD、IJCAI、WWW、UAI 等。中国现场统计研究会因果推断分会于 2022 年 12 月在北京工商大学成立;由北京工商大学承办的“第一次因果推断学术年会”于 2023 年 5 月成功举行。Shanshan Luo(BTBU)Robust estimation using linked data2023.10.2154/57.References IBickel,P.J.,Klaassen,C.,Ritov,Y.,and Wellner,J.A.(1993).Efficient and AdaptiveI
52、nference in Semiparametric Models.Johns Hopkins University Press,Baltimore,MD.Booth,F.W.,Roberts,C.K.,and Laye,M.J.(2012).Lack of Exercise is a MajorCause of Chronic Diseases.Comprehensive Physiology,2(2):1143.Chatterjee,N.,Chen,Y.-H.,and Breslow,N.E.(2003).A Pseudoscore Estimator forRegression Prob
53、lems with Two-Phase Sampling.Journal of the American StatisticalAssociation,98(461):158168.Chernozhukov,V.,Chetverikov,D.,Demirer,M.,Duflo,E.,Hansen,C.,Newey,W.,andRobins,J.(2018).Double/Debiased Machine Learning for Treatment and StructuralParameters.The Econometrics Journal,21(1):C1C68.Evans,K.,Fu
54、lcher,I.,and Tchetgen,E.J.T.(2020).A Coherent LikelihoodParametrization for Doubly Robust Estimation of a Causal Effect with MissingConfounders.arXiv:2007.10393.Hjek,J.(1971).Comment on a Paper by D.Basu.In Godambe,V.P.and Sprott,D.A.,editors,Foundations of Statistical Inference,page 236.Holt,Rineha
55、rt andWinston,Toronto.Shanshan Luo(BTBU)Robust estimation using linked data2023.10.2155/57.References IIHorvitz,D.G.and Thompson,D.J.(1952).A Generalization of Sampling WithoutReplacement from a Finite Universe.Journal of the American Statistical Association,47(260):663685.Kang,J.D.and Schafer,J.L.(
56、2007).Demystifying Double Robustness:a Comparisonof Alternative Strategies for Estimating a Population Mean from Incomplete Data.Statistical Science,22(4):523539.Kokkinos,P.,Sheriff,H.,and Kheirbek,R.(2011).Physical Inactivity and MortalityRisk.Cardiology Research and Practice,2011:924945.Lin,H.-W.a
57、nd Chen,Y.-H.(2014).Adjustment for Missing Confounders in StudiesBased on Observational Databases:2-Stage Calibration Combining Propensity Scoresfrom Primary and Validation Data.American Journal of Epidemiology,180(3):308317.McCandless,L.C.,Richardson,S.,and Best,N.(2012).Adjustment for MissingConfo
58、unders Using External Validation Data and Propensity Scores.Journal of theAmerican Statistical Association,107(497):4051.Robins,J.M.,Rotnitzky,A.,and Zhao,L.P.(1994).Estimation of RegressionCoefficients when Some Regressors are not Always Observed.Journal of theAmerican Statistical Association,89(42
59、7):846866.Shanshan Luo(BTBU)Robust estimation using linked data2023.10.2156/57.References IIIStrmer,T.,Schneeweiss,S.,Avorn,J.,and Glynn,R.J.(2005).Adjusting EffectEstimates for Unmeasured Confounding with Validation Data Using Propensity ScoreCalibration.American Journal of Epidemiology,162(3):2792
60、89.Sun,J.W.,Wang,R.,Li,D.,and Toh,S.(2022).Use of Linked Databases forImproved Confounding Control:Considerations for Potential Selection Bias.American Journal of Epidemiology,191(4):711723.Wang,W.,Scharfstein,D.,Tan,Z.,and MacKenzie,E.J.(2009).Causal Inference inOutcome-Dependent Two-Phase Sampling
61、 Designs.Journal of the Royal StatisticalSociety Series B:Statistical Methodology,71(5):947969.Williamson,E.J.,Forbes,A.,and Wolfe,R.(2012).Doubly Robust Estimators ofCausal Exposure Effects with Missing Data in the Outcome,Exposure or aConfounder.Statistics in Medicine,31(30):43824400.Yang,S.and Ding,P.(2020).Combining Multiple Observational Data Sources toEstimate Causal Effects.Journal of the American Statistical Association,115(531):15401554.PMID:33088006.Shanshan Luo(BTBU)Robust estimation using linked data2023.10.2157/57