《黄世宇-OpenRL支持大模型训练的强化学习框架与大模型时代的PluginStore.pdf》由会员分享,可在线阅读,更多相关《黄世宇-OpenRL支持大模型训练的强化学习框架与大模型时代的PluginStore.pdf(61页珍藏版)》请在三个皮匠报告上搜索。
1、OpenRL:A Unified Reinforcement Learning Framework黄世宇 第四范式演讲嘉宾黄世宇第四范式强化学习科学家,开源强化学习OpenRL Lab负责人本科与博士均毕业于清华大学计算机系,导师是朱军和陈挺教授,本科期间在CMU交换,导师为Deva Ramanan教授。主要研究方向为强化学习,多智能体强化学习,分布式强化学习。曾在ICLR、CVPR、AAAI、NeurIPS,Nature Machine Intelligence,ICML,AAMAS,Pattern Recognition等会议和期刊发表多篇学术论文。其领导开发的TiZero谷歌足球游戏智能
2、体曾在及第平台上取得排名第一的成绩。黄世宇也曾在腾讯AI Lab、华为诺亚、商汤、瑞莱智慧等工作。目 录CONTENTS1.强化学习背景2.OpenRL介绍3.OpenRL未来发展4.OpenPlugin介绍Introduction&MotivationPART 01What is Reinforcement Learning?Goal of RL:Artificial General Intelligence(AGI)Reinforcement learning in dog training.What else?Robotics Autonomous DrivingOpenAI 2019C
3、ARLA 2017What else?Industrial Design Quantitative TradingPrefixRL 2022FinRL 2020What else?Chat BotWhat else?Multi-agent RL Competitive RLTiZero 2023Honor of Kings Arena 2022Do RL in a Unified FrameworkVarious RL AlgorithmsVarious EnvironmentsMulti-agent&Self-playOffline RLOpenRL:An Open-Souce RL Fra
4、meworkPART 02Main Features of OpenRL Friendly to beginnerspip install openrlordocker pull openrllab/openrlMain Features of OpenRL Friendly to beginnersopenrl-mode train-env CartPole-v1Main Features of OpenRL Friendly to beginnersMain Features of OpenRL Friendly to beginnersDocumentation/中文文档Tutorial
5、Main Features of OpenRL Customizable capabilities for professionalsConfigure everything via YAMLUse yaml python train_ppo.py-config mpe_ppo.yamlUse yaml python train_ppo.py-config mpe_ppo.yaml python train_ppo.py-seed 1-lr 5e-4Main Features of OpenRL Customizable capabilities for professionalsTrack
6、your experiments via WandbMain Features of OpenRL Customizable capabilities for professionalsTrack your experiments via TensorboardCustomize Wandb Outputhttps:/ Wandb OutputMain Features of OpenRL Customizable capabilities for professionalsAbstract&Modularized DesignReward ModulePolicy ModuleValue M
7、oduleAlgorithmCustomize Reward ModelChen,Wenze,et al.DGPO:Discovering Multiple Strategies with Diversity-Guided Policy Optimization.arXiv preprint arXiv:2207.05631(2022).Customize Reward ModelCustomize Reward ModelCustomize Reward ModelIntent Reward:When the generated text by the agent is close to t
8、he expected intent,the agent can receive higher rewards.METEOR Metric Reward:METEOR is a metric used to evaluate text generation quality and can be used to measure how similar generated texts are compared with expected ones.We use this metric as feedback for rewards given to agents in order to optim
9、ize their text generation performance.KL Divergence Reward:This reward is used to limit how much text generated by agents deviates from pre-trained models and prevent issues of reward hacking.Customize Reward ModelIntent Reward:When the generated text by the agent is close to the expected intent,the
10、 agent can receive higher rewards.Customize Reward Model METEOR Metric Reward:METEOR is a metric used to evaluate text generation quality and can be used to measure how similar generated texts are compared with expected ones.We use this metric as feedback for rewards given to agents in order to opti
11、mize their text generation performance.Customize Reward Model KL Divergence Reward:This reward is used to limit how much text generated by agents deviates from pre-trained models and prevent issues of reward hacking.Main Features of OpenRL Support Offline RLLearn from IteractionLearn from Expert Dat
12、aMain Features of OpenRL Support Offline RLMain Features of OpenRL Customizable capabilities for professionals Dictionary observation space support Serial or parallel environment training Support for models such as LSTM,GRU,Transformer etc.Automatic mixed precision(AMP)training Data collecting wth h
13、alf precision policy networkMain Features of OpenRL Build on top of othersDatasetsModelsMain Features of OpenRLGalleryMain Features of OpenRLHigh performanceTraining CartPole on a laptop only takes a few seconds.+17%speedup for language model training.Ranking 1st on Google Research Football.Achievin
14、g+43%performance improvement on LLM.Compared with RL4LMsTiZeroLin,Fanqi,et al.TiZero:Mastering Multi-Agent Football with Curriculum Learning and Self-Play.arXiv preprint arXiv:2302.07515(2023).TiZeroLin,Fanqi,et al.TiZero:Mastering Multi-Agent Football with Curriculum Learning and Self-Play.arXivpre
15、print arXiv:2302.07515(2023).TiZeroLin,Fanqi,et al.TiZero:Mastering Multi-Agent Football with Curriculum Learning and Self-Play.arXiv preprint arXiv:2302.07515(2023).Future ReleasePART 03Large-Scale RLLarge ModelLarge ClusterLarge PopulationLarge-Scale RLLarge PopulationYang,Xinyi,et al.Learning Gra
16、ph-Enhanced Commander-Executor for Multi-Agent Navigation.arXiv preprint arXiv:2302.04094(2023).Open RL via SharingShare ModelsShare CodesShare ResultsScan the QR code to try OpenRL!Visit: the QR code to try OpenRL!Visit: for LLMPART 03Why?Think about pip for Python package(apt/yum/brew/dnf/npm/)!Th
17、ink about App Store.Standardize plugin.Provide a simple way to use,share LLM plugins.Main Features of OpenPlugin Installationpip install openplugin-pyMain Features of OpenPlugin Usage install plugin:op install install locally op install./reinstall op reinstall uninstall plugin:op uninstall start to
18、run plugin:op run list installed plugins:op listop is all you need!Main Features of OpenPlugin Usage Provide config API for SageGPT/ChatGPT platform can get json file via:server_host/ai-plugin.json can get YAML file via:server_host/openapi.yamlMain Features of OpenPlugin Build on top of othersYou ca
19、n share your plugin to others!https:/ Features of OpenPluginPlugin Storeikun_plugintodo_pluginQRcode_plugin.Main Features of OpenPlugin QRcode_pluginSupport for placeholder:Plugin StructureMain Features of OpenPlugin How to use QRcode_plugin Step 0:Find a server Step 1:pip install openplugin-py Step 2:op install QRcode_plugin Step 3:op run QRcode Step 4:Get the json and YAML file Step 5:Register plugin to SageGPT or ChatGPT website Step6:Finished!Have fun!Main Features of OpenPluginQRcode_pluginDemoTry OpenPlugin,Click Star!Visit:https:/ 谢 聆 听