上海品茶

您的当前位置:上海品茶 > 报告分类 > PDF报告下载

张希_RLChina talk1126_watermark.pdf

编号:155587 PDF 42页 8.30MB 下载积分:VIP专享
下载报告请您先登录!

张希_RLChina talk1126_watermark.pdf

1、融合大语言模型的智能体学习与决策融合大语言模型的智能体学习与决策 构建强化学习世界模型构建强化学习世界模型张张 希希 Xi Sheryl Zhang Nov.26,20232Why Reinforcement Learning?Active Learning vs Passive Learning Paradigms The main distinction with supervised learning Active learner interacts with the environment at training time,say,by posing queries or perfor

2、ming experiments Passive learner only observes the information provided by the environment(or the teacher)without influencing or directing it 3RL Nomenclature*Partially Observed MDP(POMDP)model is usually advocated when the agent has no access to the exact system state but only an observationof the

3、state.known/unknownstationary/non-stationarySergey Levin and Chelsea Finn,Deep Reinforcement Learning,Decision Making,and Control,ICML 2017 Tutorial4RL Algorithm:Policy Evaluation&ImprovementMDP ControlGeneralized Policy Iteration(GPI)Richard S.Sutton and Andrew G.Barto,Reinforcement Learning:An Int

4、roduction,Second Edition,The MIT Press 2018FQ2:How can we learn a good policy?FQ1:How good is a specific policy?5Computational RL Anatomy Deep RLDeep Neural Networks wSergey Levin and Chelsea Finn,Deep Reinforcement Learning,Decision Making,and Control,ICML 2017 Tutorial6DQNAlphaGoMuZeroAlphaStarAta

5、riGoChessStarCraftSuccesses obtained via DRLBeyond the Imitation Game:Quantifying and extrapolating the capabilities of language models,2023Checkmate-in-one taskLarger models are better at finding legal chess moves,but struggle to find checkmating moves.How about using LM?None of the BIG-G models te

6、sted can solve checkmate-in-one taskNot so goodNatural Natural Language inLanguage inRLRL7How might intelligent agents ground language understanding in their own embodied perception?Natural language inNatural language in RLRL89Natural language inNatural language in RLRLtask-agnostic controlRelabeled

7、 goal conditioned behavioral cloning(GCBC)Learning from Play(LfP)How can LLMs power autonomous agents?Learning Latent Plans from Play,201910A Gentle Start:PaLM-E PaLM-E:an embodied multimodal language model,202311A Gentle Start:PaLM-EResults on general language tasks Results on planning successOverv

8、iew of transfer learning demonstrated by PaLM-ESome Conclusions:Dataset:the full mixture is better than a single robot;Params:Frozen LLM,training an encoder is feasible;PaLM-E agent can be utilized in tasks such as VQA,NLG,etc.12PaLM-E Pipeline PaLM-E:an embodied multimodal language model,2023Input&

9、Scene Representations for Different Sensor ModalitiesLow-level skills/actions State estimation vectorVision Transformer(ViT)Object-centric representationsObject Scene Representation Transformer(OSRT)Entity ReferralsPaLM4B ViT+PaLM-8B22B ViT+PaLM-62B22B ViT+PaLM-540BfreezeAbilities:perception,visuall

10、y-grounded dialogue,and planningIt could be posed as building theWorld ModelLearning-Theoretic View13 Understand the world Plan the tasks Hasten the learning efficiency Mindstorms in Natural Language-Based Societies of Mind,2023How can LLMs power autonomous agents?Improve algorithmic generalization

11、Provide an intrinsic cost A Path Towards Autonomous Machine Intelligence,LeCun Yann,202214Deployable Hardware Control SystemsTrainable Computation Resources交互接口交互接口Decision Foundation ModelsHCIHCIParamsResultsVisualizationCorrectionDatasetDatasetDecisionEnviron.Demo ControlControlNavigationNavigatio

12、nAutonomous vehiclesRobotics Algorithmic Platform交互接口Perception-oriented ModelsState Abs.Gen.ModelInferencePlanning AlgorithmsLarge Cross-Modality ModelsSub-goals discoverPolicy SearchLLMsADPPRLHFRLHFPerceptionPerceptionPlanningPlanningCausalityCausalityKBKBEvaluationEvaluation PerceptionPlanningDec

13、isionActionBuilding the World Model15Delve into a World ModelLeCun Yann,A Path Towards Autonomous Machine Intelligence,https:/ can we imitate human intelligence?joint embedding predictive architectures(JEPA)MDPPredictive model16Computational RL:More World ModelsWang et al.,Denoised MDP,ICML 2022How

14、can we imitate human intelligence?Variational boundCritic lossActor lossHafner et al.,DreamerV2,ICLR 202117Can LLMs help in understanding the world?Deep RL SketchIs the latent space based on visual sufficient for learning a complex world?State AbstractionNot that simple!Amy Zhang,201918Can LLMs help

15、 in understanding the world?Hallucination!The heuristic function determines when the trajectory is inefficient or contains hallucinationand should be stopped.ReAct:Synergizing reasoning and acting in language models,2023Reflexion:an autonomous agent with dynamic memory and self-reflection,202319Mind

16、storms in Natural Language-Based Societies of Mind,2023More than one LLMs 20Dreamer:magnificent backbone for MBRLDreamer series:Past and PresentLearning Latent Dynamics for Planning from Pixels,2019 Planning in latent spaces Recurrent State Space Model:RSSM is the key contribution of the PlaNet,and

17、the structure of this dynamics model has been consistently utilized throughout the subsequent Dreamer Series Variational encoder infers approximate state posteriors from past observations and actions21Dreamer v1:making decisions by gradient analysis instead of planning Dreamer series:Past and Presen

18、t Learning long-horizon behaviors by latent imagination Empirical performance for visual controlGradient propagation:Take action(like SAC):Target:22Computational RL:Our DesignIt aims to describe how do we understand the worldThe architectures by no means unique23Computational RL:Our DesignState Abst

19、ractionOff-policy modelw.non-stationaryassumptionLatent Space LearningGoal:understanding the observed world for decision making Control tasks provided by Deepmind Control SuiteLiu,Sc.,Zhang,X.Li.,Ys,Zhang,Yf.,&Cheng,J.(2023,May).On the data-efficiency with contrastive image transformation in reinfor

20、cement learning.In International Conference on Learning Representations.24Computational RL:Our Designmake the world pretty much like stationary,say,by parameterization Key Idea:data manipulation can make understanding easierRepresentation Invariant FrameworkProperties of functions are discussed(smoo

21、thness,convexity)for the invariant representation learningControl the observation distribution in a high-dimensional space:a GaussianBalestriero et al.,Learning in High Dimension Always Amounts to Extrapolation,Arxiv 202125Computational RL:Our Designmake the world pretty much like stationary,say,by

22、parameterization Key Idea:data manipulation can make understanding easierProperties of functions are discussed(smoothness,convexity)for the invariant representation learningControl the observation distribution in a high-dimensional space:a Gaussian26Computational RL:Our DesignKey Idea:data manipulat

23、ion can make understanding easierProperties of functions are discussed(smoothness,convexity)for the invariant representation learning27Computational RL:Our DesignInsights:theoretical understanding of the possible rolesSSL via data augmentationQ-LearningRegularization via statistics Inductive Bias33D

24、reamer v3:A strong baseline with out-of-the-box usability,requiring no adjustment of hyperparametersDreamerV3:SOTA ChallengerDreamerV3 is the first algorithm capable of autonomously collecting diamonds in Minecraft from scratch,without any human data or pre-training.How about there is a pre-training

25、 for DreamerV3?35Deployable Hardware Control SystemsTrainable Computation Resources交互接口交互接口Decision Foundation ModelsHCIHCIParamsResultsVisualizationCorrectionDatasetDatasetDecisionEnviron.Demo ControlControlNavigationNavigationAutonomous vehiclesRobotics Algorithmic Platform交互接口Perception-oriented

26、ModelsState Abs.Gen.ModelInferencePlanning AlgorithmsLarge Cross-Modality ModelsSub-goals discoverPolicy SearchLLMsADPPRLHFRLHFPerceptionPerceptionPlanningPlanningCausalityCausalityKBKBEvaluationEvaluation PerceptionPlanningDecisionActionBuilding the World Model Simulation platform with thousands of

27、 diverse open-ended tasks.Internet-scale multimodal Minecraft knowledge base.Novel algorithm for embodied agents with large-scale pre-training.36Can LLMs help in agent planning?MineDojo:Building Open-Ended Embodied Agents with Internet-Scale Knowledge,202237Can LLMs help in agent planning?Voyager:An

28、 Open-Ended Embodied Agent with Large Language Models,2023Yes,it can.38MCTSutilizes LLMs commonsense knowledge to generate the initial belief of statesRandom Rolloutchoose action based on Q value,visit count&LLM policyUpdate Q valueLLM world modelCan LLMs help in agent planning?Large Language Models

29、 as Commonsense Knowledge for Large-Scale Task Planning,202339MCTSCan LLMs help in agent planning?Large Language Models as Commonsense Knowledge for Large-Scale Task Planning,2023LLM as a commonsense world modelLLM as a heuristic policyInsight:How do we choose between L-Model and L-Policy*?One idea

30、is the minimum description length(MDL)principle.Theoretical analysis suggests that a hypothesis with a shorter description length has a smaller generalization error and is preferred.*L-Policy:treat the LLM as a policy and query it directly for the next actions.ADPP40Deployable Hardware Control Syste

31、msTrainable Computation Resources交互接口交互接口Decision Foundation ModelsHCIHCIParamsResultsVisualizationCorrectionDatasetDatasetDecisionEnviron.Demo ControlControlNavigationNavigationAutonomous vehiclesRobotics Algorithmic Platform交互接口Perception-oriented ModelsState Abs.Gen.ModelInferencePlanning Algorit

32、hmsLarge Cross-Modality ModelsSub-goals discoverPolicy SearchLLMsRLHFRLHFPerceptionPerceptionPlanningPlanningCausalityCausalityKBKBEvaluationEvaluation PerceptionPlanningDecisionActionBuilding the World Model41Can LLMs help in RL generalization?The study of generalisation in deep Reinforcement Learn

33、ing(RL)aims to produce RL algorithms whose policies generalise well to novel unseen situations at deployment time,avoiding overfitting to their training environments.5The goal of generalization in RL is to make RL algorithms perform well in test domains that are unseen duringtraining.61Kirk R,Zhang

34、A,Grefenstette E,et al.A Survey of Zero-shot Generalisation in Deep Reinforcement LearningJ.Journal of Artificial Intelligence Research,2023,76:201-264.2Ni T,Eysenbach B,Salakhutdinov R.Recurrent model-free RL can be a strong baseline for many POMDPsC/International Conference on Machine Learning.PML

35、R,2022:16691-16723.enable agents adapt to the real wordAgent work in many tasksAgent adapt to new environments42Generalization in RLHow to Generalization Meta-learning Robust RL Representation learning Multi-task Learning Meta learning for RL to fast online adaptationMulti-task learning for RL to so

36、lve different tasks Adversarial and robust RL to deal with environment interferences Decision transformer for large decision modelState abstraction and representation learning for environment changes43Similarity MetricRobust representations of the visual scene should be insensitive to irrelevant obj

37、ects or details.11 Zhang,A.,McAllister,R.T.,Calandra,R.,Gal,Y.,&Levine,S.(2020,October).Learning Invariant Representations for Reinforcement Learning without Reconstruction.In International Conference on Learning Representations.2 Agarwal,R.,Machado,M.C.,Castro,P.S.,&Bellemare,M.G.(2020,October).Con

38、trastive Behavioral Similarity Embeddingsfor Generalization in Reinforcement Learning.In International Conference on Learning Representations.Policy Similarity Metric defines a notion of similarity between states originated from different environments by the proximity of the long-term optimal behavi

39、or from these states.Bisimulation MetricSimilar behavior with different rewardDifferent behavior,similar reward44Similarity MetricAgarwal,R.,Machado,M.C.,Castro,P.S.,&Bellemare,M.G.(2020,October).Contrastive Behavioral Similarity Embeddings for Generalization in Reinforcement Learning.In Internation

40、al Conference on Learning Representations.Policy Similarity Metric-Bisimulation MetricWasserstein DistanceLocal Optimal Behavior DifferenceTime-discountLong-term Optimal Behavior DifferenceContrastive Learning ArchitecturePSEs align the labeled states that have the same distance from the obstacle,th

41、e invariant feature that generalizes across tasks.45Is reward enough with LLMs?Reward Design with Language Models,2023Conclusion:LLMs are efficient in-context learners.They are able to provide reward signals that are consistent with a users objectives from examples even a single example with an expl

42、anation will suffice.An LLM is able to identify well-known objectives and provide objective-aligned reward signals in a zero-shot setting.It can train objective-aligned agents when ground truth rewards are not present in complex,longer-horizon tasks.46ADPPPerceptionPlanningDecisionActionBuilding the

43、 World ModelMore important topics RLHF for language-based decision making Limitation of LLMs trained without actions Adopting online/offline RL with LLMs Designing long-term/short-term memory Long-horizon reasoning in RL Hierarchical RL with high-level skill discovery Can AGI Agents Be Trained with

44、World Model?Tuesday,December 5,2023WHEN REPRESENTATION LEARNING MEETS CAUSAL INFERENCE47 Social life incentivizes evolution of intelligence“Because corvids and apes share these cognitive tools,we argue that complex cognitive abilities evolved multiple times in distantly related species with vastly d

45、ifferent brain structures in order to solve similar socioecological problems.”-Science vol.306 issue 5703 pp 1903-1907 Open-ended world could produce:Theory of mind,negotiation,social skills,empathy,real language understanding Human-level concept learningGrandmaster-level in StarCraft IIWhich one is more difficult?How to use self-playin world model?Ilya Sutskever,2019Thank You!Q&A

友情提示

1、下载报告失败解决办法
2、PDF文件下载后,可能会被浏览器默认打开,此种情况可以点击浏览器菜单,保存网页到桌面,就可以正常下载了。
3、本站不支持迅雷下载,请使用电脑自带的IE浏览器,或者360浏览器、谷歌浏览器下载即可。
4、本站报告下载后的文档和图纸-无水印,预览文档经过压缩,下载后原文更清晰。

本文(张希_RLChina talk1126_watermark.pdf)为本站 (张5G) 主动上传,三个皮匠报告文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。 若此文所含内容侵犯了您的版权或隐私,请立即通知三个皮匠报告文库(点击联系客服),我们立即给予删除!

温馨提示:如果因为网速或其他原因下载失败请重新下载,重复下载不扣分。
会员购买
客服

专属顾问

商务合作

机构入驻、侵权投诉、商务合作

服务号

三个皮匠报告官方公众号

回到顶部