《多智能体强化学习大模型初探-郝晓田.pdf》由会员分享,可在线阅读,更多相关《多智能体强化学习大模型初探-郝晓田.pdf(36页珍藏版)》请在三个皮匠报告上搜索。
1、DataFunSummitDataFunSummit#20232023多智能体强化学习大模型初探郝晓田-天津大学-博士在读NOAHS ARK LAB01多智能体决策大模型面临的挑战为什么强化学习需要大模型?多智能体决策大模型有哪些挑战?02动作语义网络ICLR21 Action Semantics Network:Considering the Effects of Actions in Multiagent Systems03置换不变性、置换同变性ICLR-23 Boosting MARL via Permutation Invariant and Permutation Equivaria
2、nt Networks04跨任务自动化课程学习AAMAS-23 PORTAL:Automatic Curricula Generation for Multiagent Reinforcement Learning目录CONTENTNOAHS ARK LABDataFunSummitDataFunSummit#2023202301多智能体决策大模型面临的挑战NOAHS ARK LAB基本概念 什么是合作式多智能体系统?NOAHS ARK LAB游戏AI中的多“英雄”协作AlphaStar(DeepMind)Dota2(OpenAI-5)Honor of Kings(Tencent)多用户-多商
3、品推荐智能仓储多车辆协调(云计算、排产)多资源调度、协同优化滴滴出行多车辆协作调度多车辆运输投递优化现实世界中的大量实际问题可以建模为包含了多个主体的协同控制和优化问题。由多个参与主体,合作地优化某个(或多个)相同的目标函数。基本概念 合作式多智能体强化学习建模方式NOAHS ARK LABMultiagent Markov Decision Processes(MMDP):Decentralized Partially Observable MDP(Dec-POMDP):Joint policy =1,=argmax=0(,).Xobs:类型、距离、相对横纵坐标、血量、护甲,action:无
4、操作、上下左右移动、攻击某个敌方单位基本概念 合作式多智能体强化学习建模方式NOAHS ARK LABMultiagent Markov Decision Processes(MMDP):Decentralized Partially Observable MDP(Dec-POMDP):Joint policy =1,=argmax=0(,).X 难点1:维度灾难状态观测空间随实体数量指数增长联合动作空间随实体数量指数爆炸 难点2:学习样本效率低 难点3:通用性、泛化性差什么是多智能体强化学习大模型?设计模型使具有比较好的泛化性,一个模型可以解决多个类似问题NOAHS ARK LABStarC
5、raftDota2Honor of Kings相同游戏不同场景不同游戏不同场景MMM21c3s5z2m_vs_1z3s_vs_5z3s5z3s5z_vs_3s6z(星际争霸)更大模型能给强化学习带来什么好处?大模型在自然语言处理、计算机视觉等领域已取得突破性成果(ChatGPT3.5约有1750亿参数)。强化学习领域:BBF(Bigger,Better,Faster)1NOAHS ARK LAB1 Bigger,Better,Faster:Human-level Atari with human-level efficiency,ICML-2023.Environment samples to
6、 reach human-level performance on Atari(over 26 games).(Atari-100k)BBF results in similar performance to model-based EfficientZero with at least 4x reduction in runtime.Larger network+self-supervision+increasing replay ratio+parameter reset多智能体强化学习大模型面临哪些挑战?Different entity numbers and types:不同场景的智能
7、体(或实体)数量、种类不同;Different feature inputs:实体的特征不同 观测(obs)、状态(state)不同;Different action spaces:动作空间不同;Different rewards:奖励函数不同;NOAHS ARK LABobs:类型、距离、相对横纵坐标、血量、护甲,网络输入维度、含义等不同策略网络输出维度、含义不同值函数网络输出尺度不同类比语言模型对多智能体系统进行统一描述 Align multiagent systems and languagesNOAHS ARK LABLanguage model词表 句子描述客观世界词tokenize
8、r词向量(模型底座)word2vec神经网络Entity-factored description of multiagent system属性表 实体表 动作表(动作语义)构成多智能体系统属性tokenizer实体向量(模型底座)word2vec神经网络种类位置血量护甲(类似关系型数据库)向上向下攻击描述obsstate观测/状态3条重要设计先验 动作语义网络ICLR-2021 Action Semantics Network:Considering the Effects of Actions in Multiagent Systems.置换不变性、置换同变性、模型变长输入ICLR-202
9、3 Boosting MARL via Permutation Invariant and Permutation Equivariant Networks.迁移学习、跨任务的自动化课程学习AAMAS-2023 PORTAL:Automatic Curricula Generation for Multiagent Reinforcement Learning.NOAHS ARK LABDataFunSummitDataFunSummit#2023202302动作语义网络NOAHS ARK LABICLR-2021 Action Semantics Network:Considering th
10、e Effects of Actions in Multiagent SystemsASN(Action Semantics Network)ASN considers different actions influence on other agents and designs neural networks based on the action semantics,e.g.,move or attack actions.NOAHS ARK LABe.g.,move actionse.g.,attack actionsASN(Action Semantics Network)ASN con
11、siders different actions influence on other agents and designs neural networks based on the action semantics,e.g.,move or attack actions.NOAHS ARK LABe.g.,move actionse.g.,attack actionsDataFunSummitDataFunSummit#2023202303置换不变性与置换同变性NOAHS ARK LABICLR-2023 Boosting MARL via Permutation Invariant and
12、 Permutation Equivariant Networks.Motivation Entity-factored modeling in MARLNOAHS ARK LAB A multiagent environment typically consists of entities,including learning agents and non-player objects.Both the state s and each agents observation oare usually composed of the features of the m entities:x0,
13、xm,each xj X,e.g.,and .The curse of dimensionalityIf simply representing the state s or the observation oas a concatenation of the entities features in a fixed order,the state space or observation space will grow exponentially as the entity number increases,which results in low sample efficiency and
14、 poor scalability of existing MARL methods.Main ideaThere exists symmetry features in multiagent systems containing homogeneous agents.Building/functions insensitive to the entities order can significantly reduce the state/observation space by a factor of!,i.e.,from X(concatenating/in a fixed order)
15、to Xm!(thus alleviating the curse of dimensionality).and are the feature dimension of each entity in and.X状态、观测空间随实体数量指数增长系统的状态刻画的是实体集合的客观信息,不随“输入顺序的变化而”6 homogeneous agents 6 statesMotivation Entity-factored modeling in MARLNOAHS ARK LAB There are 2 types of actions:equiv,invin typical MA environme
16、nts.Entity-correlated actions equiv:e.g.,attack which enemy entity or heal which ally entity(StarCraft),pass the ball to which teammate(Football);Normal(entity-uncorrelated)actions inv:e.g.,move in different directions.Attack&heal in StarCraftpass the ball in Football Gamemove in different direction
17、sMotivation Design permutation insensitive/functionsNOAHS ARK LAB To build/functions insensitive to the order of the entities features(x0,xm),we should take the type of the actions into consideration.For entity-correlated actions equiv,permute the input entities order should also permute the corresp
18、onding outputs order.For normal(entity-uncorrelated)actions inv,permute the input entities order should not permute the outputs order.permute the input entities order One-to-one correspondence is an arbitrary permutation matrix operating on x0,xmTMethod Designing Permutation Invariant and Permutatio
19、n Equivariant Policy NetworksNOAHS ARK LAB is an arbitrary permutation matrix operating on The target is to inject the PI and PE inductive biases into the policy or the value function.Tasks only contain one of the two types of actions can be considered as special cases.Method Minimal Modification Pr
20、inciple(易用性)NOAHS ARK LAB Existing algorithms have invested a lot in handling MARL-specific problems,e.g.,they usually incorporate RNNs into the backbone module B to handle the partially observable inputs.We propose to only modify the input module A and the output module D while keeping the backbone
21、 module Band output module C unchanged.=,=,=,=()Method Dynamic Permutation Network(DPN)NOAHS ARK LAB The core idea of DPN is to always assign the same weight matrix to each entitys features.We build a weight selection network of which the output dimension is m.PI Input Layer ANo matter what order th
22、e m entities in the observation,the output of layer A is computed as:The input order change will result in the same output order change,thus achieving PEPE Output Layer DThe j-th output of D is computed as:T=,=,=,=()Method Hyper Policy Network(HPN)NOAHS ARK LAB How to automatically generate the weig
23、ht matrix to embed the input and the output We use hypernetworks to directly generate the weight matrices for the input layer and the output layer.PI Input Layer ANo matter what order the m entities in the observation,the output of layer A is computed as:The input order change will result in the sam
24、e output order change,thus achieving PE.PE Output Layer DThe j-th output of D is computed as:TExperiments StarCraft Multiagent Challenge(SMAC)NOAHS ARK LAB Two teams:allies and enemies.The ally units are controlled by the agents while the enemy units are controlled by the built-in rule-based bots.SO
25、TA performance(we achieve 100%win-rates in all hard and super hard maps!)Experiments StarCraft Multiagent Challenge(SMAC)NOAHS ARK LAB Our design follow the Minimal Modification Principle and can be easily plugged into any MARL algorithms and boost their performance very easy to use!Our HPN+QPLEX vs
26、 QPLEX(in SMAC)Our HPN+MAPPO vs MAPPO(in SMAC)Experiments StarCraft Multiagent Challenge V2(SMAC-V2)NOAHS ARK LAB Random start positions and random unit types.Experiments Multiagent Particle Environment(MPE)&Google Research Football(GRF)NOAHS ARK LAB For MPE,actions only consist of movements.Therefo
27、re,only the PI property is needed.For GRF,each agent has 19 discrete actions,including moving,sliding,shooting and passing Our HPN+QMIX vs QMIX(in Google Football)Experiments Transferability of HPNNOAHS ARK LAB Apart from achieving PI and PE,another benefit of HPN is that it can naturally handle var
28、iablenumbers of inputs and outputs.Therefore,HPN can be used to design more efficient multitasklearning and transfer learning algorithms.(网络结构的可泛化性、可迁移性)Summary Simple but efficient implementations of PI and PE modules in MARL which can be trained in an end-to-end way.A plug-in module for the policy
29、 function or the value function in any MARL algorithm.Achieving SOTA results on typical MARL benchmarks.NOAHS ARK LABhttps:/ ARK LABAAMAS-2023 PORTAL:Automatic Curricula Generation for Multiagent Reinforcement Learning.Motivation 跨任务的课程学习(多个任务,如何自适应调度训练顺序)目标任务 T(非常难,很难从零开始学会)起始任务 T1,候选任务集合 T2,Tn,从中选
30、择任务构成“最优学习序列”回答如下两个核心问题:选哪个课程作为下一个学习的目标合适?前面学到的知识在新的课程如何复用?Reload the previous policy model?不同task的模型输入维度(agent数量、种类、状态、观测)不同;NOAHS ARK LABMethod 基于难度和任务相似度选择课程,基于HPN网络结构实现策略迁移和复用 基于评估Return选择难度适中(中间40%)任务 把当前策略在备选任务集合的所有任务上进行评估,得到对应的Return.根据Return排序,选择处于中间40%的任务,下一课程在这些任务中产生。基于Task similarity,在难度适
31、中任务中选择与目标任务最接近的(目标任务相关度目标任务要有帮助)NOAHS ARK LAB上面两个任务序列,第一次课程选择时,5m_vs_6m和25m的任务难度差不多。但是前者是更好的中间课程,与目标任务的相关度更高5m-5m_vs_6m-8m_vs_10m5m-25m-8m_vs_10m衡量rollout state visitation distribution差异Method 基于HPN网络结构实现策略迁移和复用 整体框架NOAHS ARK LAB模型支持变长输入和输出(支持不同agent数量、种类)Experiments Can PORTAL improve the learning
32、efficiency on all curricula?NOAHS ARK LABMarines5m(initial),5m vs 6m,6m,7m,8m vs 9m,8m vs 10m,10m,15m,20m,25m,7m vs 9m(final)Stalkers&Zealots(S&Z)2s3z(initial),2s4z,2s5z,3s4z,2s3z vs 2s4z,2s3z vs 2s5z,1s4z vs 4s1z,1s4z vs 5s1z,3s5z,1s4z vs 6s1z,3s5z vs 3s6z,3s5z vs 3s7z,3s5z vs 8s2z,3s5z vs 4s6z,3s5
33、z vs 4s7z,3s5z vs 4s8z(final)Medivac&Marauders&Marines(MMM)MMM0(initial),MMM1,MMM2,MMM3,MMM4,MMM5,MMM6,MMM7,MMM8,MMM9,MMM10(final).After the policy have learned on 5m vs 6m,the tasks with moderate difficulty are 8m vs 10m,20m,25m.Necessity Of Task Similarity Measure总结 概述多智能体决策大模型面临的挑战类比语言模型对多智能体系统进行
34、描述和建模的方案 3条重要设计先验:动作语义网络ICLR-2021 Action Semantics Network:Considering the Effects of Actions in Multiagent Systems.置换不变性、置换同变性、模型变长输入ICLR-2023 Boosting MARL via Permutation Invariant and Permutation Equivariant Networks.迁移学习、跨任务的自动化课程学习AAMAS-2023 PORTAL:Automatic Curricula Generation for Multiagent Reinforcement Learning.欢迎大家一起合作,进一步研究强化学习大模型!NOAHS ARK LAB天津大学强化学习实验室欢迎你的加入!NOAHS ARK LAB实验室主页:http:/rl.beiyang.ren/感谢观看NOAHS ARK LAB