《4-5 面向真实场景的数据驱动决策优化.pdf》由会员分享,可在线阅读,更多相关《4-5 面向真实场景的数据驱动决策优化.pdf(42页珍藏版)》请在三个皮匠报告上搜索。
1、面向真实场景的数据驱动决策优化詹仙园 清华大学智能产业研究院(AIR)助理研究员/助理教授|01Real-World Challenges for Data-Driven Decision-Making 02Offline Reinforcement Learning(RL)03Hybrid Offline-and-Online RL目录 CONTENT|01Real-World Challenges for Data-Driven Decision-Making|4Decision-Making Application in Real-WorldGaming AIRoboticsAutono
2、mous drivingIndustrial&Energy SystemsLogisticsScheduling|5Real-World Challenges for Sequential Decision-Making Methods Conventional decision-making tasksReal-world tasksActionStateRewardActionStateRewardXRiskHistorical dataOffline datasetsX Not possible to interact with the real environment during t
3、raining Perfect simulation environment may not exist Severe sim-to-real transfer issue Only have offline logged data Most conventional methods fail!Is there a data-driven solution?Overview of Data-Driven Sequential Decision MakingLevel of System interactionAmount of offline dataHighNo interactionLow
4、MediumNo offline dataSmallMediumLargeOnline RLApplication ScenariosSample efficient online RLSample efficient online RLoffline RL/IL Few-shot IL/RL/PlanningSample efficient IL/RL/planningSample efficient offline IL/RL/planningGaming AIAI ApproachesLimited real-world applicationsLots of real-world ap
5、plicationsLimited real-world applicationsTechnological maturityLots of research Relatively matureSome researchLack of research.Hard problems.Lots of unknown.Lots of application scenariosMission critical system optimizationRoboticsAutonomous Driving|02Offline Reinforcement Learning(RL)|8Introduction
6、of Reinforcement LearningGaming AIRoboticsAction State Reward|9Challenge of DRL in Real-World Applications|10Challenges of DRL in Real-World ApplicationsConventional RL tasksUsing RL in real-world tasks?ActionStateRewardActionStateRewardXHistorical state,action,reward dataOffline logged data Not pos
7、sible to interact with the real environment during training Perfect simulation environment may not exist Severe sim-to-real transfer issue Only have offline logged data Most existing DRL algorithms fail!Real systemor simulator|11Offline Reinforcement Learning(Offline RL)https:/ and NotationsOffline
8、logged data|13 Kumar,Fu,Tucker,Levine.Stabilizing Off-Policy RL via Bootstrapping Error Reduction.NeurIPS 2019.Bootstrapping Error Accumulation and Distribution ShiftCounterfactual queries lead to distributional shift:Function approximators(policy,Q-function or both)trained under one distributionEva
9、luated on a different distribution due to change in visited states for the new policyWorsened by the act of maximizing the expected return|How to Make Offline RL WorkSome ideas:Adding policy constraints:Enforce behavior regularizationMany model-free offline RL methodsValue function regularization:Pe
10、nalize value/Q functionModify Q-function training objectivePenalize using uncertainty estimatesModel-based methodsSolve a pessimistic MDPPenalize OOD data rewardsStrict in-sample learningCommon principle:Conservatism,Pessimism Make rewards/value pessimisticKeep rewards/value unalteredKidambi,Rajeswa
11、ran,Netrapalli,Joachims.MOReL:Model-Based Offline Reinforcement Learning.NeurlPS 2020.|Over-Conservatism in Existing methods15policy constraintvalue regularizationin-sample learning?Over-Conservatism degenerates the performance and generalization on unknown areasPartially covered dataset|Over-Conser
12、vatism in Existing methods16Over-Conservatism:fully coverage(|)aPolicy distributionmay locate hereData distributionleaned policy(|)aPolicy distributionExplore in OOD regions|17How Well Does Q-Functions Perform in OOD Areas?Deep Q functions interpolate well but struggle to extrapolatedataConvex hull|
13、18Theoretical Explanation The geometry of dataset(distance to data samples)matters|19How to Measure the Distance to a Dataset?(,)aTraining dataDistance to training dataDistance function outputState-conditioned distance function|20What Can the Distance Function Do?Distance function based convex hull
14、constraint(,)aTraining dataDistance to training dataDistance function outputDOGE(Distance-sensitive Offline rl with better GEneralization):A Minimalist ModificationGLi,J.,Zhan,X.,et.al.Distance-Sensitive Offline Reinforcement Learning.arXiv 2022.WOWWOW|21Theoretical Analysis of DOGEAchieves tighter
15、performance bound than data support constraint methods Concentrability coefficient:Suboptimality constant:as|22ExperimentsOutperform SOTA methods on D4RL benchmarks|23Generalization Ability of DOGEPolicy constraintValue regularizationIn-sample learningDOGE?AntMaze Large DatasetDOGE enjoys better gen
16、eralization performance|24Insight from DOGEGeneralization of DNN and geometry of datasets are largely overlooked in existing offline RL methods.Its necessary to take them into the consideration of designing new effective offline RL algorithms.|25RL-based vs Imitation-based MethodsTraing dataImitatio
17、n-basedRL-basedPolicy constrain,Value regularzation,Uncertainty penalty,RL-based methods enjoy out-of-distribution generalization but suffer from distribution shift.Imitation-based methods avoid distribution shift but are too conservative to surpass the dataset.How can we avoid distribution shift an
18、d enjoy the benefit of out-of-distribution generalization?Goal-conditioned supervised learning:DT,TT,RvS Eysenbach B,et al.Imitating Past Successes can be Very Suboptimal.arXiv preprint arXiv:2206.03378,2022.BrandfonbrenerD,et al.When does return-conditioned supervised learning work for offline rein
19、forcement learning?.arXiv preprint arXiv:2206.01079,2022.Paster K,McIlraith S,Ba J.You Cant Count on Luck:Why Decision Transformers Fail in Stochastic Environments.arXiv preprint arXiv:2205.15967,2022.|26A Motivating Examplegreen arrows:transitions in datasetgrid color:state-value()Action-stitching
20、vs.state-stitchingAction-stitching:Choose action in data that leads to next state in data with highest()Has more imitation flavor,very conservativeState-stitching:Choose any action that leads to next state in data with highest()Allow OOD actions,but need some guidance|27Policy-Guided Offline Reinfor
21、cement LearningStandard RL approach:Policy evaluation step(learn(,):Policy maximization step(learn(|):arg m,2+()arg max,2+()Powerful DPCoupled&updatesTricky OOD regularizersGoodInstabilityConservatismHow to preserve DP while learning&in a decoupled way?1.Learning instead of ,Learning a better than i
22、n dataA different view:2.Learning a guide-policy(|)to determine where to go Output the optimal next state given current state 3.Learning a task-irrelevant execute-policy(|,)Determine which action can produce the given next state|28Policy-Guided Offline Reinforcement Learning1.Use expectile regressio
23、n to obtain the upper confidence of state-value.Denote the state-value as .The the learning object is:2.Train the guide-policy with respect to V(s).Denote the guide-policy as.Then the learning object g is given by:3.Train the execute-policyas:Given a state s,the final action is determined by both th
24、e guide-policy and the execute-policy,by:Credit to Kostrikov et al.ICLR 2022Xu et al.A Policy-Guided Imitation Approach for Offline Reinforcement Learning.NeurIPS 2022.(6)|29ExperimentsBenchmarks resultsMulti-task adaptationFor multi-task adaptation,POR only needs to re-train the guide-policy and us
25、e the same execute-policy for all tasksExecute-policy can be trained with data from different tasks,enabling superior data efficiency|30Offline RL for Real-World ApplicationsOffline LearningOffline LearningThermal Thermal combustion combustion process simulatorprocess simulatorCoal feederCoal millTu
26、rbineInduced draft fanForced draft fanControl variableWater pumpOnline OperationOnline OperationWaterSteamWindCoalBoilerValvesValvesWindCoalWatersmokeSteamOffline RL Offline RL algorithmalgorithmHistorical Historical operational operational datadataControl loopData inputNext state ElectricityPolluta
27、ntOptimized Optimized actionsactionsLogged Logged datadataCombustion Optimization for Thermal Power Generating Units(TPGUs)Using Offline RLTrainingTrainingUse less coalUse less coalMore electricityMore electricityLess pollutionLess pollutionZhan,X.,et.al.DeepThermal:Combustion Optimization for Therm
28、al Power Generating Units Using Offline Reinforcement Learning.In AAAI 2022;Spotlight RL4Real Life Workshop,ICML2021.系统在京东科技开发并完成真实电厂落地。Offline RL for Real-World ApplicationsCoal feederCoal millTurbineInduced draft fanForced draft fanControl variablesWater pumpWaterSteamWindCoalReal BoilerValvesValv
29、esWindCoalWaterSmokeSteamControl loopElectricityPollutantSystem complexity Coal mill,boiler,steamer,heater components 10,000 sensorsComplex dynamics Involve coal pulverizing,burning and steam circulation Complex physical and chemical processesHigh dimensional control 100+major control elements Conti
30、nuous controlModeling restrictions Not possible to interact with the system during training No high-fidelity simulator Only have offline logged dataDomain expertise Require large amount of domain knowledge Lots of safety constraintsLong-term optimization Needs to optimize long-term combustionperform
31、anceMulti-objective optimization Improve combustion efficiency Reduce NOx emission|32MORE:An Improved Model-based Offline RL with Restrictive Exploration MORE:Tackles the challenge of offline policy learning under constraints with an imperfect simulator Safe policy optimization:Uses two types of Q-f
32、unctions,and,for reward maximization and cost evaluationPolicy optimization is performed on the carefully combined real-simulated data|MORE:An Improved Model-based Offline RL with Restrictive Exploration Restrictive exploration and hybrid training:Intuition:only consider samples that the data-driven
33、 simulator is certain,and then further distinguish whether the samples are in distribution or notOffline training using a special local buffer,combining real,positive and negative simulated samplesFilter out samples if the model is uncertain or lack of prediction robustness,Model sensitivity based f
34、ilteringDataDensity+Data density based filteringHybrid training|34Real-World ExperimentsReal-world experiments at CHN Energy Langfang Power StationThe optimized control strategy achieved the maximum increase of 0.56%,0.65%and 0.51%on the combustion efficiency|03Hybrid Offline-and-Online RL|Limitatio
35、ns in Both Online&Offline Approaches36 Dynamics Gap:high-fidelity simulators are hard to construct Limited Coverage:offline data with sufficient space coverage for offline RL training is impractically largeIs it possible to combine learning from limited real data in offline RLand unrestricted explor
36、ation through imperfect simulators in online RLto address the drawbacks of both approaches?|37H2O:Dynamics-Aware Hybrid Offline-and-Online RLDynamics-aware policy evaluation:Minimize the dynamics-gap weighted soft-maximum of Q values:Push down Q values on high dynamics-gap samplesMaximize Q values o
37、n data:Pull up Q values on real offline data samplesLearn on both offline data and online simulated samplesFix Bellman error due to dynamics gap:Use dynamics ratio as an importance sampling weightCan be interpreted as adding following adaptive adjustment on rewards:Theoretical analysis:Niu,H.,et.al.
38、When to Trust Your Simulator:Dynamics-Aware Hybrid Offline-and-Online Reinforcement Learning.NeurIPS 2022.|38H2O:Dynamics-Aware Hybrid Offline-and-Online RLReal-world validation on a wheel-footed robotStanding Still:SAC,DARC,and DARC are not able to keep the robot balanced and quickly fail after ini
39、tialization.the robot with H2O policy remains steady after 11 seconds(s),while CQL bumps into the ground and goes out of control at 12s.SACDARCCQLH2O(Ours)|39H2O:Dynamics-Aware Hybrid Offline-and-Online RLMoving Straight:DARC,and SAC fail at the beginning.H2O achieves great control performance to ke
40、ep balanced and closely follow the target velocity(v=0.2m/s),CQL exceeds v by a fairly large margin,nearly doubling the desired target velocity.Additionally,the angle illustrates that the robot controlled by H2O runs more smoothly than CQL.SACDARCCQLH2OReal-world validation on a wheel-footed robotPr
41、omising Research Directions for Data-Driven Methods决策优化应用方向有数据有仿真/模型(+数据)优化问题可公式化描述问题分类场景传统方法复杂工业系统优化机器人控制自动驾驶决策机理建模经典控制理论基于仿真的在线强化(online RL)物流调度生产排期运筹学(Operation research,OR)前沿研究方向 AI for math/OR 利用离线模仿/强化学习加速大规模混合整型规划问题求解 有偏好的离线模仿学习:融合专家/非专家数据 混合式强化学习:融合仿真+真实数据 有限样本下的高泛化能力离线模仿/强化学习算法 安全约束下的离线强化学习
42、ReferencesOffline RLOffline ILOffline Planning Zhan,X.,Zhu,X.and Xu,H.Model-Based Offline Planning with Trajectory Pruning.IJCAI 2022.Zhan,X.,et.al.DeepThermal:Combustion Optimization for Thermal Power Generating Units Using Offline Reinforcement Learning.AAAI 2022.Xu,H.,Zhan,X.,and Zhu,X.Constraint
43、s Penalized Q-Learning for Safe Offline Reinforcement Learning.AAAI 2022.Xu,H.,Jiang,L.,Li,J.,Zhan,X.A Policy-Guided Imitation Approach for Offline Reinforcement Learning.NeurIPS 2022.Li,J.,Zhan,X.,et.al.Distance-Sensitive Offline Reinforcement Learning.arXiv preprint.Xu,H.,Zhan,X.,Li,J.,Yin,H.Offli
44、ne Reinforcement Learning with Soft Behavior Regularization.arXiv preprint.Xu,H.,Zhan,X.,Yin,H.and Qin,H.Discriminator-Weighted Offline Imitation Learning from Suboptimal Demonstrations.ICML 2022.Zhang,W.et al.Discriminator-Guided Model-Based Offline Imitation Learning.CoRL 2022.Offline-and-Online Hybrid RL Niu,H.,et.al.When to Trust Your Simulator:Dynamics-Aware Hybrid Offline-and-Online Reinforcement Learning.NeurIPS 2022.非常感谢您的观看|