《Trustworthy Policy Learning under the Counterfactual No-Harm Criterion.pdf》由会员分享,可在线阅读,更多相关《Trustworthy Policy Learning under the Counterfactual No-Harm Criterion.pdf(34页珍藏版)》请在三个皮匠报告上搜索。
1、Counterfactual No-Harm Criterion:Individual Risk and Trustworthy Policy LearningPeng WuJoint work with Zhi Geng,Yue Liu,Haoxuan Li,and Chunyuan Zheng.Beijing Technology and Business UniversityOctober 17,2023Peng Wu(DataFun 因果推断在线峰会 23)Counterfactual No-Harm Criterion1/34Introduction1Introduction2Sha
2、rp Bounds of the No-Harm Criterion3No-Harm Trustworthy Policy Learning4ExperimentsPeng Wu(DataFun 因果推断在线峰会 23)Counterfactual No-Harm Criterion2/34IntroductionBackground1Policy learning determines the individuals who should be treated based on theircovariates,and it is important that humans can trust
3、 a decision made by analgorithm.2A trustworthy algorithm is expected to meet various advanced requirements,including fairness,diversity,explainability,accountability,safety,etc.3In this talk,we discuss the”harmlessness”of policy learning.Peng Wu(DataFun 因果推断在线峰会 23)Counterfactual No-Harm Criterion3/
4、34IntroductionWhat is No-Harm?Hippocratic oath:”First do no harm”.Isaac Asimovs Laws of Robotics:”A robot may not injure a human being or,through inaction,allow a human being to come to harm.”Peng Wu(DataFun 因果推断在线峰会 23)Counterfactual No-Harm Criterion4/34IntroductionWhat is No-Harm?Peng Wu(DataFun
5、因果推断在线峰会 23)Counterfactual No-Harm Criterion5/34IntroductionWhat is No-Harm?A Toy ExampleConsider two policies,a treatment policy that is useful for 70%of patients but will harm 30%of patients.the second policy can be useful for 40%of patients but no harm.The two policies have the same average causa
6、l effect(40%).Clearly,the second policyis preferable.However,if the second policy can be useful for only 30%of patients but no harm.which policy is preferred?the first or the second?Peng Wu(DataFun 因果推断在线峰会 23)Counterfactual No-Harm Criterion6/34IntroductionNotationNotation:Observed data(Xi,Ti,Yi):i
7、=1,.,n,Xi X:covariates;Ti 0,1:binary treatment;Yi 0,1:binary outcome;Yi(1),Yi(0):potential outcomes;XTYY(1)Y(0)1?1?0?0?Either Yi(0)or Yi(1)can be observed for each unit,but not both.Individual treatment effect:Yi(1)Yi(0);Conditional average treatment effect(CATE),(x)=EY(1)Y(0)|X=x,which is the avera
8、ge causal effect in the subpopulation X=x.Peng Wu(DataFun 因果推断在线峰会 23)Counterfactual No-Harm Criterion7/34IntroductionIdentifiability of(x)Assumption 1(Strong Ignorability)0 P(T=1|X=x)c(x)0,(x)c(x)0,P(Y0,1|X=x)P(Y1,0|X=x)c(x)0,EY(1)|X=x c(x)0,P(Y0,1|X=x)+P(Y1,1|X=x)c(x)0,P(Y0,1|X=x)P(Y1,0|X=x)c(x).d
9、,P(Y0,1|X=x)P(Y1,0|X=x)=c(x)the cost function c(x)helps to control the upper bound of FNA().Corollary 1(Relation to the cost).For the upper bound wFNA()in Lemma 1 anduFNA()in Theorem 1,the optimal policy satisfieswFNA()Eh1 c(X)2(X)i,and uFNA()Eh?1 c(X)2?2(X)i.Peng Wu(DataFun 因果推断在线峰会 23)Counterfactu
10、al No-Harm Criterion19/34No-Harm Trustworthy Policy Learning1Introduction2Sharp Bounds of the No-Harm Criterion3No-Harm Trustworthy Policy Learning4ExperimentsPeng Wu(DataFun 因果推断在线峰会 23)Counterfactual No-Harm Criterion20/34No-Harm Trustworthy Policy LearningOptimal Policy at a Given Level of No-Har
11、mDenote as the optimal target policy satisfying the no-harm criterionmaxR(;c,)subject touFNA(),(1)where is a pre-specified level of allowed harm,andR(;c,)=E(X)Y(1)c(X)+Y(0)1 (X)for 0,1,which is a general form of policy reward for different utility functions.For example,R(;c,1)=R()for U(X,T,Y)=Y Tc(X
12、);R(;c,0)=R()for U(X,T,Y)=TY Tc(X).Peng Wu(DataFun 因果推断在线峰会 23)Counterfactual No-Harm Criterion21/34No-Harm Trustworthy Policy LearningLearned PolicyLet be the learned policy of,derived by optimizing the empirical form of Eq.(1),maxR(;c,)subject to uFNA(),(2)whereR(;c,)and uFNA()are the correspondin
13、g estimators of R(;c,)anduFNA(),obtained as follows.Let e(x):=P(T=1|X=x),t(x):=EY|T=t,X=x for t=0,1,and(Z;e,0,1)=?T(Y 1(X)e(X)+1(X)c(X)?(X)+?(1 T)(Y 0(X)1 e(X)+0(X)?(1 (X),(Z;e,0,1)=?(1 T)(Y 0(X)1 e(X)+0(X)?(X)?T(Y 1(X)e(X)+1(X)?0(X)(X),where Z=(T,X,Y).Lemma 2.,R(;c,)=E(Z;e,0,1)and uFNA()=E(Z;e,0,1)
14、.Peng Wu(DataFun 因果推断在线峰会 23)Counterfactual No-Harm Criterion22/34No-Harm Trustworthy Policy LearningFrom Lemma 2,it is natural to define the estimators of R(;c,)and uFNA()asR(;c,)=1nnXi=1(Zi;e,0,1),uFNA()=1nnXi=1(Zi;e,0,1),where e(x)and t(x)for t=0,1 as the estimators of e(x)and t(x),respectively,u
15、singthe sample-splitting technique.Peng Wu(DataFun 因果推断在线峰会 23)Counterfactual No-Harm Criterion23/34No-Harm Trustworthy Policy LearningAsymptotic Properties of EstimatorsTheorem 2.Suppose that|e(x)e(x)|2|t(x)t(x)|2=oP(n1/2)for all x Xand t 0,1,(a)R(,c;)is consistent and asymptotically normalnR(,c;)R
16、(,c;)N(0,21),where 21=V(Z;e,0,1);(b)if 0(x)=0(x;)is a parametric model,uFNA()is consistent andasymptotically normaln uFNA()uFNA()N(0,22),where22=Vh(Z;e,0,1)s(X)En0(X;)1(X)(X)oi,and s(X)is the influence function of estimator of.Peng Wu(DataFun 因果推断在线峰会 23)Counterfactual No-Harm Criterion24/34No-Harm
17、Trustworthy Policy LearningProperties of the Learned PolicyR(;c,)R(;c,)R(;c,)R(;c,)which are the regret of the learned policy,and error of the estimated reward of learnedpolicy,respectively.Theorem 3(Main Result 2)Suppose that for all ,(x)=(x;)is acontinuously differentiable and convex function with
18、 respect to,where is acompact set,under the assumptions in Theorem 1,then we have(a)The expected reward of the learned policy is consistent,andR(;c,)R(;c,)=OP(1/n);(b)The estimated reward of the learned policy is consistent,andR(;c,)R(;c,)=OP(1/n).Peng Wu(DataFun 因果推断在线峰会 23)Counterfactual No-Harm C
19、riterion25/34No-Harm Trustworthy Policy LearningTheorem 4(Main Result 3)Suppose that is a P-G-C class,t(x)and e(x)areuniformly consistent estimators of t(x)and e(x)for t=0,1,respectively,and a for any and 0 a 1,then we have(a)R(;c,)R(;c,)P 0;and(b)R(;c,)R(;c,)P 0.Peng Wu(DataFun 因果推断在线峰会 23)Counterf
20、actual No-Harm Criterion26/34Experiments1Introduction2Sharp Bounds of the No-Harm Criterion3No-Harm Trustworthy Policy Learning4ExperimentsPeng Wu(DataFun 因果推断在线峰会 23)Counterfactual No-Harm Criterion27/34ExperimentsSimulation SetupsTwo semi-synthetic datasets:The(Infant Health and Development Progra
21、m)IHDP dataset.It comprises 672 units(123 treated,549 control)and 25 covariatesmeasuring aspects of children and their mothers.Goal:examines the effects of specialist home visits on future cognitivetest scores.The JOBS dataset is based on the National Supported Work program.It includes 2,570 units(2
22、37 treated,2,333 control)and 17 covariates.Goal:examines the effects of job training on income and employmentstatus after training.To determine the ground truth of harm,we simulate potential outcomes:Yi(0)Bern(w0 xi+0,i),Yi(1)Bern(w1xi+1,i),where()is the sigmoid function,w0 N1,1(0,1)follows a trunca
23、ted normaldistribution,w1 Unif(1,1)follows a uniform distribution,0,i N(0,1),and1,i N(1,1).We set the noise parameters 0=1 and 1=3 for IHDP and 0=0and 1=2 for Jobs.Peng Wu(DataFun 因果推断在线峰会 23)Counterfactual No-Harm Criterion28/34ExperimentsGoal and Evaluation MetricsGoal.the goal of our policy learn
24、ing is to maximize the reward and the resulting changein welfare while satisfying the no-harm criterion.In this simulation,there are 65 and 252 units in the”harmful treatment”strata onIHDP and Jobs,respectively.We define the no-harm criterion as harming less than20%of them by the learned policy,i.e.
25、,13 units for IHDP and 50 units for Jobs.Evaluation Metrics.Reward:For CATE-based policy learning,nXi=1(Yi(1)c)(xi)+Yi(0)(1 (xi).For recommendation-based policy learning:Pni=1(Yi(1)c)(xi).The change in welfare:W()=nXi=1Yi(1)(xi)+Yi(0)(1(xi)nXi=1Yi(0)=nXi=1(Yi(1)Yi(0)(xi)The true harmnXi=1IYi(0)=1,Yi
26、(1)=0 (xi).Peng Wu(DataFun 因果推断在线峰会 23)Counterfactual No-Harm Criterion29/34ExperimentsResultsPeng Wu(DataFun 因果推断在线峰会 23)Counterfactual No-Harm Criterion30/34ExperimentsConclusionWe formalize the no-harm criterion for policy learning from a principalstratification perspective.We propose a novel upp
27、er bound for the fraction negatively affected by the policy.We propose an estimator of the upper bound,and show the consistency andasymptotic normality of the estimator.Based on the estimators for the policy reward and harm rate,we further propose apolicy learning approach that satisfies the no-harm
28、 criterion,and prove itsconsistency to the optimal policy reward for parametric and nonparametric policyclasses,respectively.Peng Wu(DataFun 因果推断在线峰会 23)Counterfactual No-Harm Criterion31/34ExperimentsMain References1 Kush R.Varshney(2022),Trustworthy Machine Learning.Independently Published.2 Chern
29、ozhukov,V.,Chetverikov,D.,Demirer,M.,Duflo,E.,Hansen,C.,Newey,W.,andRobins,J.(2018),Double/debiased machine learning for treatment and structural parameters.The Econometrics Journal.3 Kallus,N.(2022),Treatment effect risk:Bounds and inference.In 2022 ACM Conference onFairness,Accountability,and Tran
30、sparency.(FAccT 22).4 Kallus,N.(2022),Whats the harm?sharp bounds on the fraction negatively affected bytreatment.arXiv preprint arXiv:2205.10327.5 Kitagawa,T.and Tetenov,A.(2018)Who should be treated?Empirical welfaremaximization methods for treatment choice.Econometrica.6 Haoxuan Li,Chunyuan Zheng
31、,Yixiao Cao,Zhi Geng,Yue Liu*,and Peng Wu*(2023).Trustworthy Policy Learning under the Counterfactual No-Harm Criterion.ICML 23.7 Peng Wu,Peng Ding,Zhi Geng,and Yue Liu.Individual Benefit and Risk:Bounds andInference.Working Papers.Peng Wu(DataFun 因果推断在线峰会 23)Counterfactual No-Harm Criterion32/34Exp
32、eriments欢迎加入北京工商大学因果团队在耿直教授的带领下,北京工商大学因果推断团队于 2022 年组建成立,主要从事因果推断的基础理论、方法和相关应用领域的研究工作。团队在因果效应评价、因果关系发现、因果归因、因果推荐系统、因果强化学习,基于因果的公平性评价,以及生物医学、食品安全和互联网 IT 等因果推断应用研究方面取得了一系列成果。自成立以来,研究成果发表在统计学、机器学习及人工智能领域的国际顶级期刊Biometrika、J.Machine Learning Research、Biometrics、Statistica Sinica、Statistics inMedicine、Artificial Intelligence、TNNLS 等,和国际顶级学术会议 ICML、NuerIPS、ICLR、AAAI、KDD、IJCAI、WWW、UAI 等。有兴趣者请联系。Peng Wu(DataFun 因果推断在线峰会 23)Counterfactual No-Harm Criterion33/34ExperimentsThanks!Peng Wu(DataFun 因果推断在线峰会 23)Counterfactual No-Harm Criterion34/34