《黄隆波_RLx2_v2_watermark.pdf》由会员分享,可在线阅读,更多相关《黄隆波_RLx2_v2_watermark.pdf(20页珍藏版)》请在三个皮匠报告上搜索。
1、RLx2:Training a Sparse DeepReinforcement Learning Model from ScratchLongbo HuangInstitute for Interdisciplinary Information SciencesTsinghua University1RL China 20232Deep Reinforcement Learning(DRL)3AlphaGo-Zero4 TPUs40 daysOpenAI Five256 GPUs180 daysRainbow DQNsingle GPU one weekHigh training cost
2、Resources-limited scenariosDRL Training is Costly106 W20 W=5000 x4ParadigmPrior worksin SLPrior worksin RLSave inference cost?Save training cost?Sparsityin RLIterative pruningHan et al.(2015;2016);Srinivas et al.(2017);Zhu&Gupta(2018);Hu et al.(2016);Guo et al.(2016);Dong et al.(2017);Molchanov et a
3、l.(2019b);Louizos et al.(2018);Tartaglione et al.(2018);Molchanov et al.(2017);Schwarz et al.(2021)Livne&Cohen(2020);Yu et al.(2020);Lee et al.(2021);Vischer et al.(2022)YesNo70%99%Dynamicsparse trainingBellec et al.(2017);Mocanu et al.(2018);Mostafa&Wang(2019);Dettmers&Zettlemoyer(2019);Evci et al.
4、(2020)Sokar et al.(2021);Graesser et al.(2022)YesYes50%90%Prior Works on lightweight DL5TrainingPruningMuch less parameters&Matchable performance!Iterative Pruning6TrainingEvolvingExplore new connections based on gradientsDynamic Sparse Training7ParadigmPrior worksin SLPrior worksin RLSave inference
5、 cost?Save training cost?Sparsityin RLIterative pruningHan et al.(2015;2016);Srinivas et al.(2017);Zhu&Gupta(2018);Hu et al.(2016);Guo et al.(2016);Dong et al.(2017);Molchanov et al.(2019b);Louizos et al.(2018);Tartaglione et al.(2018);Molchanov et al.(2017);Schwarz et al.(2021)Livne&Cohen(2020);Yu
6、et al.(2020);Lee et al.(2021);Vischer et al.(2022)YesNo70%99%Dynamicsparse trainingBellec et al.(2017);Mocanu et al.(2018);Mostafa&Wang(2019);Dettmers&Zettlemoyer(2019);Evci et al.(2020)Sokar et al.(2021);Graesser et al.(2022)YesYes50%90%Prior Works on lightweight DL8Can an efficient DRL agent be tr
7、ained from scratch with an ultra-sparse network throughout?9Issue 1:Nonstationary learning target=4=8=6TD target !#Iteration 0 0.9 0.0 0.9 Iteration 1 1.7 1.7 2.0 Iteration 2 2.9 2.4 3.0 SLRLObstacles in Sparse DRL Training10SLRLDataset(,)(#,#)($,$)TrainingEnvReplayBuffer(,%)InteractingStoringTraini
8、ng(#,#,#,#%)($,$,$,$%)Obstacles in Sparse DRL TraningIssue 2:Nonstationary data distribution11SL TrainingTopology evolution schedulerGradientsAdjustRL Training?PerformanceOursStabilize DRL with Robust Value Learning12ActorCriticFixed-capacityReplay BufferCollect&StorePolicyLossOne-stepTD LossSampleD
9、rop GrowDrop GrowTopology EvolutionRobust Value LearningDelayedMulti-stepTD LossDynamic-capacityRLx2:Rigged Reinforcement Learning LotteryTan,Hu,Pan,Huang,Huang,RLx2:Training a Sparse Deep Reinforcement Learning Model from Scratch,ICLR 2023(Spotlight)13RLx2:Rigged Reinforcement Learning LotteryTan,H
10、u,Pan,Huang,Huang,RLx2:Training a Sparse Deep Reinforcement Learning Model from Scratch,ICLR 2023(Spotlight)Topology Evolution(Evci et al.2020)Gradient-based link adaptationv Grow links with large gradientsv Remove links with small absolute weights(1 )log)implementation complexityPublished as a conf
11、erence paper at ICLR 2023The pseudo-code of our scheme is given in Algo-rithm 1,where?is the element-wise multiplica-tion operator and Mis the binary mask to rep-resent the sparse topology of the network.Theupdate fraction anneals during the training pro-cess according to t=02(1+cos(tTend),where0is
12、the initial update fraction and Tendis thetotal number of iterations.Finding top-k linkswith maximum gradients in Line 10 can be effi-ciently implemented such that Algorithm 1 ownstime complexity O(1?s)N logN)(detailed inAppendix A.1),where s is the total sparsity.Be-sides,the topology adjustment ha
13、ppens very in-frequently during the training,i.e.,every 10000step in our setup,such that consumption of thisstep is negligible(detailed in Appendix C.3).Our topology evolution scheme can be imple-mented efficiently on resource-limited devices.Algorithm 1 Topology Evolution(Evciet al.,2020)1:Nl:Numbe
14、r of parameters in layer l2:l:Parameters in layer l3:Ml:Sparse mask of layer l4:sl:Sparsity of layer l5:L:Loss function6:t:Update fraction in training step t7:for each layer l do8:k=t(1?sl)Nl9:Idrop=ArgTopK(?|l?Ml|,k)10:Igrow=ArgTopKi/2l?MlIdrop(|rlL,k|)11:Update Mlaccording to Idropand Igrow12:l l?
15、Ml13:end for4.2ROBUSTVALUELEARNINGPublished as a conference paper at ICLR 2023The pseudo-code of our scheme is given in Algo-rithm 1,where?is the element-wise multiplica-tion operator and Mis the binary mask to rep-resent the sparse topology of the network.Theupdate fraction anneals during the train
16、ing pro-cess according to t=02(1+cos(tTend),where0is the initial update fraction and Tendis thetotal number of iterations.Finding top-k linkswith maximum gradients in Line 10 can be effi-ciently implemented such that Algorithm 1 ownstimecomplexity O(1?s)N logN)(detailed inAppendix A.1),where s is th
17、e total sparsity.Be-sides,the topology adjustment happens very in-frequently during the training,i.e.,every 10000step in our setup,such that consumption of thisstep is negligible(detailed in Appendix C.3).Our topology evolution scheme can be imple-mented efficiently on resource-limited devices.Algor
18、ithm 1 Topology Evolution(Evciet al.,2020)1:Nl:Number of parameters in layer l2:l:Parameters in layer l3:Ml:Sparse mask of layer l4:sl:Sparsity of layer l5:L:Loss function6:t:Update fraction in training step t7:for each layer l do8:k=t(1?sl)Nl9:Idrop=ArgTopK(?|l?Ml|,k)10:Igrow=ArgTopKi/2l?MlIdrop(|r
19、lL,k|)11:Update Mlaccording to Idropand Igrow12:l l?Ml13:end for4.2ROBUSTVALUELEARNING14RLx2:Rigged Reinforcement Learning LotteryTan,Hu,Pan,Huang,Huang,RLx2:Training a Sparse Deep Reinforcement Learning Model from Scratch,ICLR 2023(Spotlight)Delayed Multi-step TD Loss TD estimate could be unreliabl
20、e due to sparse network Improve estimate with multi-step target and delayFigure 3:Sparse model com-parison in Ant-v3.As discussed above,value function learning is crucial in sparsetraining.Specifically,we find that under sparse models,robustvalue learning not only serves to guarantee the efficiency
21、of boot-strap training as in dense models,but also guides the gradient-basedtopology exploration of the sparse network during training.Figure 3 compares the performance of the masks(i.e.,sparse net-work topology)obtained by RigL and RLx2(i.e.,RigL+robustvalue learning)on Ant-v3.Here we use a method
22、similar to(Fran-kle&Carbin,2019)for evaluating the obtained sparse mask:21)first initialize a random sparsetopology;2)keep adjusting the topology during the training and obtain the final mask;3)traina sparse agent with the obtained mask(the mask is fixed throughout this training phase,only theweight
23、s are restored to their initial values as in the first step at the beginning).It can be clearlyobserved that the mask by RLx2 significantly outperforms that by solely using RigL(Appendix C.4provides details and experiments in other environments,where similar results are observed).To achieve robust v
24、alue estimation and properly guide the topology search,RLx2 utilizes two novelcomponents:i)delayed multi-step TD targets to bootstrap value estimation;ii)a dynamic-capacityreplay buffer to eliminate the potential data inconsistency due to policy change during training.4.2.1MULTI-STEPTD TARGETIn TD l
25、earning,a TD target is generated,and the value network will be iteratively updated by mini-mizing a squared loss induced by the TD target.Single-step methods generate the TD target by com-bining one-step reward and discounted target network output,i.e.,T1=rt+?Q(st+1,(st+1);).However,a sparse network
26、 parameterb=?M,obtained from its dense counterpart,willinevitably reside in a smaller hypothesis space due to using fewer parameters.This means that theoutput of the sparse value networkb can be unreliable and may lead to inaccurate value estimation.Denote the fitting error of the value network as(s
27、,a)=Q(s,a;)?Q(s,a).One sees that thiserror may be larger under a sparse model compared to that under a dense network.To overcome this issue,we adopt a multi-step target,i.e.,Tn=Pn?1k=0?krt+k+?nQ(st+n,(st+n);),where the target combines an N-step sample and the output of the sparsevalue network after
28、N-step,both appropriately discounted.By doing so,we reduce the expectederror between the TD target and the true target.Specifically,Eq.(4)shows the expected TD er-ror between multi-step TD target Tnand the true Q-value Qassociated with the target policy,conditioned on transitions from behavior polic
29、y b(see detailed derivation in Appendix A.2).EbTn(s,a)?Q(s,a)=(EbTn(s,a)?ETn(s,a)|zPolicy inconsistency error+?nE(sn,(sn)|zNetwork fitting error(4)2(Frankle&Carbin,2019)obtains the mask,i.e.,the“lottery ticket”,by pruning a pretrained dense model.Our sparse mask is the final mask obtained by dynamic
30、 sparse training.5Addressed by delayed Multi-step TDAddressed by dynamiccapacity buffer(next)15RLx2:Rigged Reinforcement Learning LotteryTan,Hu,Pan,Huang,Huang,RLx2:Training a Sparse Deep Reinforcement Learning Model from Scratch,ICLR 2023(Spotlight)Published as a conference paper at ICLR 2023The mu
31、lti-step target has been studied in existing works(Bertsekas&Ioffe,1996;Precup,2000;Munos et al.,2016)for improving TD learning.In our case,we also find that introducing a multi-step target reduces the network fitting error by a multiplicative factor?n,as shown in Eq.(4).Onthe other hand,it has been
32、 observed,e.g.,in Fedus et al.(2020),that an immediate adoption ofmulti-step TD targets may cause a larger policy inconsistency error(the first term in Eq.(4).Thus,we adopt a delayed scheme to suppress policy inconsistency and further improve value learning.Specifically,at the early stage of trainin
33、g,we use one-step TD targets to better handle the quicklychanging policy during this period,where a multi-step target may not be meaningful.Then,afterseveral training epochs,when the policy change becomes less abrupt,We permanently switch tomulti-step TD targets,to exploit its better approximation o
34、f the value function.4.2.2DYNAMIC-CAPACITYBUFFERThe second component of RLx2 for robust value learning is a novel dynamic buffer scheme forcontrolling data inconsistency.Off-policy algorithms use a replay buffer to store collected dataand train networks with sampled batches from the buffer.Their per
35、formances generally improvewhen larger replay capacities are used(Fedus et al.,2020).However,off-policy algorithms withunlimited-size replay buffers can suffer from policy inconsistency due to the following two aspects.(i)Inconsistent multi-step targets:In off-policy algorithms with multi-step TD ta
36、rgets,the valuefunction is updated to minimize the squared loss in Eq.(3)on transitions sampled from the replaybuffer,i.e.,the reward sequence rt,rt+1,rt+ncollected during training.However,the factthat the policy can evolve during training means that the data in the replay buffer,used for Monte-Carl
37、o approximation of the current policy,may be collected under a different behavior policy b(Hernandez-Garcia&Sutton,2019;Fedus et al.,2020).As a result,it may lead to a large policyinconsistency error in Eq.(4),causing inaccurate estimation.Sample?e?l?Figure 4:Dynamic buffer capacity&policy inconsist
38、ency(ii)Mismatched training data:In practice,the agent mini-mizes the value lossbL()with respect to the sampled valuein mini-batch Bt,given bybL()=1|Bt|X(si,ai)Bt(Q(si,ai;)?T)2(5)Compared to Eq.(3),the difference between the distribu-tion of transitions in the mini-batch Btand the true transi-tion d
39、istribution induced by the current policy also leads toa mismatch in the training objective(Fujimoto et al.,2019).Indeed,our analysis in Appendix A.4 shows that trainingperformance is closely connected to policy consistency.Motivatedbyouranalysis,weintroduceadynamically-sizedbuffertoreducethepolicyg
40、apbasedonthe policy distance of the collected data.The formal scheme is given in Algorithm 3.We introducethe following policy distance measure to evaluate the inconsistency of data in the buffer,i.e.,D(B,?)=1KX(si,ai)2OldK(B)k(si;?)?aik2,(6)where B denotes the current replay buffer,OldK(B)denotes th
41、e oldest K transitions in B,and(;?)is the current policy.Here K is a hyperparameter.We calculate the D(B,?)value every?bsteps.If D(B,?)gets above a certain pre-specified threshold D0,we start to pop items from B in aFirst-In-First-Out(FIFO)order until this distance measure D becomes below the thresh
42、old.A visualization of the number of stored samples and the proposed policy distance metric duringtraining is shown in Figure 4.We see that the policy distance oscillates in the early stage as thepolicy evolves,but it is tightly controlled and does not violate the threshold condition to effectivelya
43、ddress the off-policyness issue.As the policy converges,the policy distance tends to decrease andconverge(Appendix C.8 also shows that the performance is insensitive to the policy threshold D0).5EXPERIMENTSIn this section,we investigate the performance improvement of RLx2 in Section 5.1,and the impo
44、r-tance of each component in RLx2 in Section 5.2.In particular,we pay extra attention to the role6Dynamic Capacity Replay Buffer Off-policy RL can suffer from inconsistency Introduce policy distance measureFigure 4:Dynamic buffer capacity&policy inconsistency(ii)Mismatched training data:In practice,
45、the agent mini-mizes the value lossbL()with respect to the sampled valuein mini-batch Bt,given bybL()=1|Bt|X(si,ai)Bt(Q(si,ai;)?T)2(5)Compared to Eq.(3),the difference between the distribu-tion of transitions in the mini-batch Btand the true transi-tion distribution induced by the current policy als
46、o leads toa mismatch in the training objective(Fujimoto et al.,2019).Indeed,our analysis in Appendix A.4 shows that trainingperformance is closely connected to policy consistency.Motivatedbyouranalysis,weintroduceadynamically-sizedbuffertoreducethepolicygapbasedonthe policy distance of the collected
47、 data.The formal scheme is given in Algorithm 3.We introducethe following policy distance measure to evaluate the inconsistency of data in the buffer,i.e.,D(B,?)=1KX(si,ai)2OldK(B)k(si;?)?aik2,(6)where B denotes the current replay buffer,OldK(B)denotes the oldest K transitions in B,and(;?)is the cur
48、rent policy.Here K is a hyperparameter.We calculate the D(B,?)value every?bsteps.If D(B,?)gets above a certain pre-specified threshold D0,we start to pop items from B in aFirst-In-First-Out(FIFO)order until this distance measure D becomes below the threshold.A visualization of the number of stored s
49、amples and the proposed policy distance metric duringtraining is shown in Figure 4.We see that the policy distance oscillates in the early stage as thepolicy evolves,but it is tightly controlled and does not violate the threshold condition to effectivelyaddress the off-policyness issue.As the policy
50、 converges,the policy distance tends to decrease andconverge(Appendix C.8 also shows that the performance is insensitive to the policy threshold D0).5EXPERIMENTSIn this section,we investigate the performance improvement of RLx2 in Section 5.1,and the impor-tance of each component in RLx2 in Section
51、5.2.In particular,we pay extra attention to the role616 7x-20 x model compression 20 x-50 x FLOPs reduction Less than 3%performance degradation Empirical ResultsUnder review as a conference paper at ICLR 20235EXPERIMENTSIn this section,we investigate the performance improvement of RLx2 in Section 5.
52、1,and the impor-tance of each component in RLx2 in Section 5.2.In particular,we pay extra attention to the roletopology evolution plays in sparse training in Section 5.3.Our experiments are conducted in fourpopular MuJoCo environments:HalfCheetah-v3(Hal.),Hopper-v3(Hop.),Walker2d-v3(Wal.),andAnt-v3(
53、Ant.),for RLx2 with two off-policy algorithms,TD3 and SAC.Instantiations of RLx2 onTD3 and SAC are provided in Appendix B.Each result is averaged over eight random seeds.5.1COMPARATIVEEVALUATIONTable 2:Comparisons of RLx2 with sparse training baselines.Here“Sp.”refers to the sparsity level(percentag
54、eofmodelsizereduced),“TotalSize”referstothetotalparametersofbothcriticandactornetworks(detailed calculation of training and inference FLOPs are given in Appendix C.3).Theright five columns show the final performance of different methods.The“Total size,”“FLOPs”,and“Performance”are all normalized w.r.
55、t.the original large dense model(detailed in Appendix C.2).Alg.Env.ActorSp.CriticSp.TotalSizeFLOPs(Train)FLOPs(Test)Tiny(%)SS(%)SET(%)RigL(%)RLx2(%)TD3Hal.90%85%0.133x 0.138x0.100 x86.377.192.690.899.8Hop.98%95%0.040 x 0.043x0.020 x64.567.766.590.697.0Wal.97%95%0.043x 0.045x0.030 x60.842.939.335.798
56、.1Ant.96%88%0.093x 0.100 x0.040 x16.549.662.568.5103.9Avg.95%91%0.077x 0.081x0.048x57.059.365.271.499.7SACHal.90%80%0.180 x 0.197x0.100 x95.075.494.889.8102.2Hop.98%95%0.044x 0.048x0.020 x89.181.6103.9 110.0109.7Wal.90%90%0.100 x 0.113x0.100 x73.883.495.881.9103.2Ant90%75%0.220 x 0.239x0.100 x49.649
57、.379.890.9105.6Avg.92%85%0.136x 0.149x0.080 x76.972.493.693.2105.2Avg.94%88%0.107x 0.115x0.064x67.065.979.482.3102.4Table 2 summarizes the comparison results.In our experiments,we compare RLx2 with the fol-lowing baselines:(i)Tiny,which uses tiny dense networks with the same number of parameters ast
58、he sparse model in training.(ii)SS:using static sparse networks with random initialization.(iii)SET(Bellec et al.,2017),which uses dynamic sparse training by dropping connections accordingto the magnitude and growing connections randomly.Please notice that the previous work(Sokaret al.,2021)also ado
59、pts the SET algorithm for topology evolution in reinforcement learning.Ourimplementations reach better performance due to different hyperparameters.(iv)RigL(Evci et al.,2020),which uses dynamic sparse training by dropping and growing connections with magnitudeand gradient criteria,respectively,the s
60、ame as RLx2s topology evolution procedure.In our experiments,we allow the actor and critic networks to take different sparsities.We define anultimate compression ratio,i.e.,the largest sparsity level under which the performance degradationunder RLx2 is within%3 of that under the original dense model
61、s.This can also be understood asthe minimum size of the sparse model with the full performance of the original dense model.Wepresent performance comparison results in Table 2 based on the ultimate compression ratio.The per-formance of each algorithm is evaluated with the average reward per episode o
62、ver the last 30 policyevaluations of the training(policy evaluation is conducted every 5000 steps).Hyperparameters arefixed in all four environments for TD3 and SAC,respectively,which are presented in Appendix C.2.PerformanceTable 2 shows RLx2 performs best among all baselines in all four environmen
63、ts bya large margin(except for a close performance with RigL with SAC in Hopper).In addition,tinydense(Tiny)and random static sparse networks(SS)performance are worst on average.SET andRigL are better yet fail to maintain the performance in Walker2d-v3 and Ant-v3,which means robustvalue learning is
64、necessary under sparse training.To further validate the performance of RLx2,wecompare the performance of different methods under different sparsity levels in Hopper-v3 and Ant-v3 in Figure 5,showing RLx2 has a significant performance gain over other baselines.Model CompressionRLx2 achieves superior
65、compression ratios(the reciprocal of the total size),with minor performance degradation(less than 3%).Specifically,RLx2 with TD3 achieves 7.5-717Empirical ResultsBetter Handle Sparsity18Empirical ResultsFigure 5:Performance comparison under different model sparsity.Acceleration in FLOPsDifferent fro
66、m knowledge-distillation/BC based methods,e.g.,Livne&Cohen(2020);Vischer et al.(2022);Lee et al.(2021),RLx2 uses a sparse network throughouttraining.Thus,it has an additional advantage of immensely accelerating training and saving com-putation,i.e.,12 training acceleration and 20 inference accelerat
67、ion for RLx2-TD3,and 7training acceleration and 12 inference acceleration for RLx2-SAC.5.2ABLATIONSTUDYWe conduct a comprehensive ablation study on the three critical components of RLx2 on TD3,i.e.,topology evolution,multi-step TD target,and dynamic-capacity buffer,to examine the effect of eachcompo
68、nent in RLx2 and their robustness in hyperparameters.In addition,we provide the sensitivityanalysis for algorithm hyper-parameters,e.g.initial mask update fraction,mask update interval,buffer adjustment interval,and buffer policy distance threshold,in Appendix C.8.Topology evolutionRLx2 drops and gr
69、ows connections with magnitude and gradient criteria,respectively,which has been adopted in RigL(Evci et al.,2020)for deep supervised learning.Tovalidate the necessity of our topology evolution criteria,we compare RLx2 with three baselines,which replace the topology evolution scheme in RLx2 with Tin
70、y,SS and SET,while keeping othercomponents in RLx2 unchanged.4Thy evolution e left partpolog of Table 3 shows that RigL as atopology adjustment scheme(the resulting scheme is RLx2 when using RigL)performs best amongthefourbaselines.WealsoobservethatTinyperformsworst,whichisconsistentwiththeconclusio
71、nin existing works(Zhu&Gupta,2018)that a sparse network may contain a smaller hypothesis spaceand leads to performance loss,which necessitates a topology evolution scheme.Table 3:Ablation study on topology evolution and multi-step target,where the performance(%)isnormalized with respect to the perfo
72、rmance of dense models.Env.Topoloy EvolutionMulti-step TargetTinySSSETRLx21-step2-step3-step4-step5-stepHal.93.386.1100.199.896.5101.799.898.897.0Hop.74.484.288.897.077.991.797.084.087.5Wal.84.183.889.498.173.993.798.199.199.3Ant.28.780.283.5103.9103.9105.1103.996.794.5Avg.70.183.690.499.788.198.199
73、.794.694.6Figure 6:Performance withdifferent buffer schemes.Multi-step TD targetsWe also compare different step lengths inmulti-step TD targets for RLx2 in the right part of Table 3.We findthat multi-step TD targets with a step length of 3 obtain the maxi-mum performance.In particular,multi-step TD
74、targets improve theperformance dramatically in Hopper-v3 and Walker2d-v3,while theimprovement in HalfCheetach-v3 and Ant-v3 is minor.Dynamic-capacity BufferWe compare different buffer sizingschemes,including our dynamic scheme,different fixed-capacitybuffers,and an unlimited buffer.Figure 6 shows th
75、at our dynamic-capacity buffer performs best among all settings of the buffer.Smaller4Take Algorithm 4 in Appendix B for an example.Only lines 16 and 21 of“Topology Evolution()”arechanged while other parts remain unchanged.We also regard static topology as a special topology evolution.8Effective Dyn
76、amic Buffer Dynamic vs StaticSummary19RLx2:Rigged Reinforcement Learning Lottery Three Design Componentsv Gradient-based Topology Evolutionv Delayed Multi-step Value Learningv Dynamic Buffer Ultra-sparse&Good Performancev 7x-20 x Model Compressionv 20 x-50 x FLOPs Reductionv 3%Performance LossInteractionReplayBuffer(Dynamic-size)=2=(,()CriticActorPruneGrowTopology EvolutionPruneGrowTopology EvolutionThank you!20More info:https:/