上海品茶

您的当前位置:上海品茶 > 报告分类 > PDF报告下载

刘群_Self-improvement and Self-evolving of Large Language Models(1)_watermark.pdf

编号:155613 PDF 45页 5.65MB 下载积分:VIP专享
下载报告请您先登录!

刘群_Self-improvement and Self-evolving of Large Language Models(1)_watermark.pdf

1、Self-improvement and Self-evolving of Large Language Models大语言模型的自我改进和自我进化刘群 LIU Qun华为诺亚方舟实验室 Huawei Noahs Ark LabRLChina 2023 大模型与AI Agent2023-11-24,SuzhouIntroductionSELF:Language-Driven Self-Evolution for LLMsGaining Wisdom from Setbacks:Aligning LLMs via Mistake AnalysisRelated Work and Discussi

2、onContentIntroductionSELF:Language-Driven Self-Evolution for LLMsGaining Wisdom from Setbacks:Aligning LLMs via Mistake AnalysisRelated Work and DiscussionContentTraining Data for LLMsGPT-3(OpenAI,2020.5):500 Billion tokensPalm(Google,2022.4):780 Billion tokensChinchilla(Deepmind):1.4 Trillion token

3、sLlama(Meta):1.5 Trillion tokensLlama2(Meta):2 Trillion tokensGPT-4(OpenAI):13 Trillion tokens(text*2+code*4)+2 Trillion tokens(image)1 total:33Will we run out of data?(a)Projections for low-qualitylanguage data(b)Projections for high-qualitylanguage data(c)Projections for vision dataFig.1:Projectio

4、ns of data usage.Each graph shows two extrapolations of data usage,one from past trends and one from compute availabilityestimations plus scaling laws.Both projections are constrained to be lower than the estimated data stock.In all three cases,this constraintcauses a slowdown in data usage growth.I

5、II.METHODSA.Projecting growth in training dataset sizesPrevious work compiled historical trends of dataset sizework for different application domains21.Our definition of dataset size is the number of uniquedatapoints on which the model is trained.The definition ofdatapoint is different for each doma

6、in.In particular,forlanguage data we define a datapoint as a word,and for imagedata we define a datapoint as an image.Additional details onthis choice of dataset size metric can be found in 1.Using the historical trend,together with the size of thelargest datasets used to date,we can estimate the fu

7、tureevolution of dataset sizes.However,this projection naivelyassumes that the past trend will be sustained indefinitely.Inreality,there are constraints on the amount of data that a modelcan be trained on.One of the most important constraints iscompute availability.This is because increasing the amo

8、unt oftraining data for a given model requires additional compute,and the amount of compute that can be used is limited bythe supply of hardware and the cost of buying or renting thathardware.To account for this constraint,we make another projection,based on compute availability and the compute-opti

9、mal datasetsize.Scaling laws can be used to predict the optimal balanceof model size and dataset size for a given compute budget(measured in FLOP)2,3.Concretely,the optimal datasetsize is proportional to the square root of the compute budget(DC).Previous work 9,projected the available compute to the

10、largest training into the future3.We use those projections2The domains that were included were vision,language,recommendation,speech,drawing,and games.However,there is only significant data for visionand language.3Note that this projection has a wide range of uncertainty and includesscenarios in whi

11、ch spending on compute grows orders of magnitude overcurrent levels,up to 1%of GDP.to estimate the optimal training dataset size that will beachievable in each future year.B.Estimating data accumulation ratesIn recent years,unsupervised learning has successfullycreated foundation models that can be

12、fine-tuned for severaltasks using small amounts of labeled data and large amountsof unlabeled data.In addition,unsupervised models have alsoproved able to generate valuable pseudo-labels for unlabeleddata 10.For these reasons,we will focus on the stock andaccumulation rates of unlabeled data,even if

13、 the amount oflabeled data is much lower4.Before delving into the details,let us consider a theoreticalframework of what we expect the data accumulation rate tolook like.The vast majority of data is user-generated and isstored in social media platforms,blogs,forums,etc.There arethree factors that de

14、termine how much content is produced in agiven period:human population,internet penetration rate,andthe average amount of data produced by each internet user.Human population has been extensively studied so weuse the standard United Nations projections 11.Internetpenetration(the percentage of the po

15、pulation who uses theInternet)grows as an S-curve from0%in 1990 to 50%in2018 to over 60%today 12.We model this as a sigmoidfunction of time and fit it to the data in 12.The average amount of data produced by users changes overgeography and time according to internet usage trends,and isnot easy to an

16、alyze5.For simplicity,let us assume the averageamount of data produced by users is constant over time.This model of Internet population(the number of Internetusers)closely matches the historical number of Internet users4Note that while transfer learning vastly reduces the need for labeled data,it do

17、es not eliminate it.In addition,labeled data is usually much harder toacquire than unlabeled data.Therefore,labeled data might turn out to be abottleneck even though it is required in smaller quantities.5Doing so would require taking into account the effects of culture,demog-raphy and socioeconomic

18、development in different countries and times,whichis out of the scope of this paper.Villalobos et al.,“Will We Run out of Data?An Analysis of the Limits of Scaling Datasets in Machine Learning.”arxiv:2211.043252 total:33Other challenges of training dataFigure 2:A diagram illustrating the three steps

19、 of our method:(1)supervised fi ne-tuning(SFT),(2)reward model(RM)training,and(3)reinforcement learning via proximal policy optimization(PPO)on this reward model.Blue arrows indicate that this data is used to train one of our models.In Step 2,boxes A-D are samples from our models that get ranked by

20、labelers.See Section 3 for more detailson our method.sizes(1.3B,6B,and 175B parameters),and all of our models use the GPT-3 architecture.Our mainfi ndings are as follows:Labelers significantly prefer InstructGPT outputs over outputs from GPT-3.On our test set,outputs from the 1.3B parameter Instruct

21、GPT model are preferred to outputs from the 175B GPT-3,despite having over 100 x fewer parameters.These models have the same architecture,and differ onlyby the fact that InstructGPT is fi ne-tuned on our human data.This result holds true even when weadd a few-shot prompt to GPT-3 to make it better a

22、t following instructions.Outputs from our 175BInstructGPT are preferred to 175B GPT-3 outputs 853%of the time,and preferred 714%of thetime to few-shot 175B GPT-3.InstructGPT models also generate more appropriate outputs accordingto our labelers,and more reliably follow explicit constraints in the in

23、struction.InstructGPT models show improvements in truthfulness over GPT-3.On the TruthfulQAbenchmark,InstructGPT generates truthful and informative answers about twice as often as GPT-3.Our results are equally strong on the subset of questions that were not adversarially selected againstGPT-3.On“clo

24、sed-domain”tasks from our API prompt distribution,where the output should notcontain information that is not present in the input(e.g.summarization and closed-domain QA),InstructGPT models make up information not present in the input about half as often as GPT-3(a21%vs.41%hallucination rate,respecti

25、vely).InstructGPT shows small improvements in toxicity over GPT-3,but not bias.To measuretoxicity,we use the RealToxicityPrompts dataset(Gehman et al.,2020)and conduct both automaticand human evaluations.InstructGPT models generate about 25%fewer toxic outputs than GPT-3when prompted to be respectfu

26、l.InstructGPT does not signifi cantly improve over GPT-3 on theWinogender(Rudinger et al.,2018)and CrowSPairs(Nangia et al.,2020)datasets.We can minimize performance regressions on public NLP datasets by modifying our RLHFfine-tuning procedure.During RLHF fi ne-tuning,we observe performance regressi

27、ons comparedto GPT-3 on certain public NLP datasets,notably SQuAD(Rajpurkar et al.,2018),DROP(Dua et al.,2019),HellaSwag(Zellers et al.,2019),and WMT 2015 French to English translation(Bojar et al.,2015).This is an example of an“alignment tax”since our alignment procedure comes at the cost of3Ouyang

28、,et al.“Training Language Models to FollowInstructions with Human Feedback,”OpenAI TechnicalReport,Jan 2022.Instruct Data(for SFT):human annotationHuman Feedback Data(for RLHF):human interactionInvolving large amount of human laborMuch more expensive than pre-trainingdata3 total:33Self-critiquing mo

29、dels for assisting human evaluatorsQuestion:Our King has always been a little different.Hes very quiet I havent really seen him give anyone ordersat all either.Whenever something needs to be done I see him doing it.Thats when I believed everything hetold me-while he took me adventuring with him.Now

30、I know that there is a better life to be had out of this one.Aless edged and more rounded one.Thank you king Steve.Minecraft Steve.Summarize the storys setting.Answer:The story is set in a medieval kingdom,where the king is the main character of the story.The king isdescribed as a good and active ki

31、ng.The narrator is a child who has always wondered why the king is so differentfrom everyone else.Critiqueable:YesCritique:The answer should mention that the setting is a video game.Helpful:YesConditional refinement:The story is set in a medieval kingdom in the video game Minecraft.The king is the m

32、aincharacter of the story.The king is described as a good and active king.The narrator is a child who has alwayswondered why the king is so different from everyone else.Table 2:Representative example of a topic-based summarization task,and its associated tasks.SeeAppendix A.5 for details on how we f

33、ormat our tasks(different than shown).2.2Topic-based summarizationWe report most of our main results on the base task of topic-based summarization Dan05,ZYY+21,a task similar to or interchangeable with query-based summarization and question-focused summa-rization.In topic-based summarization,the sum

34、mary focuses on a specific aspect of a text rather thantrying to summarize the whole text.See Table 2 for an example.We collected our own dataset of over 6,000 distinct topical queries and summaries,on over 2,000distinct passages.Our distribution of passages is sampled from a dataset of short storie

35、s,Wikipediaarticles,or web articles(mostly news)scraped from the internet.Most tasks were generated based onshort texts with less than 2,048 tokens when encoded with the GPT-2 tokenizer RWC+19.We alsogathered some tasks based on texts with up to 4,096 tokens which were not used for training.Our labe

36、lers generated between 1 and 8 topic-based summarization questions per passage,typicallyalso including a topic not covered by the passage(for which the answer is empty).Summaries are upto a paragraph long we targeted between 2-10 sentences unless the topic was missing.We aimedfor these topics to be

37、non-trivial to summarize in various ways.See Appendix A for details.2.2.1Data collectionWe collect demonstrations on all the tasks mentioned in Section 2.1.Given a task for which we wantto collect a demonstration,we can choose whether each input is generated from a model or human.We always use a hum

38、an-generated question.All tasks but the base task require an answer as input,many for which we typically use outputs from our best model.For example,critique demonstrationsare on model-generated answers,and helpfulness judgements are on model-generated critiques.Forrefinements the situation is more

39、complex,and detailed in Appendix A.2.Since we need model outputs for most demonstrations,we collect data in rounds.After each round,we train a model jointly on all task demonstrations collected thus far.We start with base taskdemonstration collection.Then with a model trained on only the base task,w

40、e collect demonstrationsfor critiqueability,critique,and refinement tasks using model-generated answers.Finally,we collectdemonstrations for helpfulness tasks,by showing labelers model-generated critiques of model-generated answers.For more details on our data collection,see Appendix A and Table 4.W

41、e publicly release all dataused to train final models2.2Wereleasesixfiles,locatedathttps:/ al.“Self-Critiquing Models for Assisting Human Evaluators.”arxiv:2206.05802.4(1)total:33Self-critiquing models for assisting human evaluators(a)More capable models have critiqueable outputsaround 20%less often

42、 than the smallest models,ac-cording to labelers.Less than 15%of outputs are uncri-tiqueable for the worst models,and over 30%for thebest models.(b)Helpfulness of self-critiques,as judged by human la-belers,both with and without filtering by when labelersfound a critique themselves.(c)Larger models

43、are not only better at critiquing,but harder to critique even filtering for only cases wherelabelers found a critique.The diagonal(spanning lower left to upper right)corresponds to the“critiqueableanswers”line in 4b.Figure 4:More capable models are significantly better at self-critiquing(Figure 4b).

44、Although morecapable models get better at generating hard-to-critique answers(Figure 4c),their ability to critiquetheir answers is improving more rapidly with scale.This is true even without adjusting for the factthat humans find fewer critiques of more capable models(Figure 4a).In all figures,we sa

45、mple at thesame random temperature for both the base task and critique task;the effects are equally visible at alltemperature ranges(not pictured).10Saunders,et al.“Self-Critiquing Models for Assisting Human Evaluators.”arxiv:2206.05802.4(2)total:33Self-Refine:Iterative Refinement with Self-Feedback

46、RefineFeedbackUse M to get feedback on its own outputInputUse M to refine its previous output,given its feedbackModel M120Figure 1:Given an input(0),SELF-REFINEstarts by generating an output and passing it back to thesame modelMto get feedback(1).The feedback is passed back toM,which refi nes the pr

47、eviouslygenerated output(2).Steps(1)and(2)iterate until a stopping condition is met.SELF-REFINEisinstantiated with a language model such as GPT-3.5 and does not involve human assistance.drafting an email to request a document from a colleague,an individual may initially write a directrequest such as

48、“Send me the data ASAP”.Upon reflection,however,the writer recognizes thepotential impoliteness of the phrasing and revises it to“Hi Ashley,could you please send me the dataat your earliest convenience?.When writing code,a programmer may implement an initial“quickand dirty”implementation,and then,up

49、on reflection,refactor their code to a solution that is moreeffi cient and readable.In this paper,we demonstrate that LLMs can provide iterative self-refi nementwithout additional training,leading to higher-quality outputs on a wide range of tasks.We present SELF-REFINE:an iterative self-refi nement

50、 algorithm that alternates between two gener-ative stepsFEEDBACKandREFINE.These steps work in tandem to generate high-quality outputs.Given an initial output generated by a modelM,we pass it back to the same modelMto getfeedback.Then,the feedback is passed back to the same model to refine the previo

51、usly-generateddraft.This process is repeated either for a specifi ed number of iterations or untilMdetermines thatno further refi nement is necessary.We use few-shot prompting(Brown et al.,2020)to guideMtoboth generate feedback and incorporate the feedback into an improved draft.Figure 1 illustrates

52、 thehigh-level idea,that SELF-REFINEuses the same underlying language model to generate feedbackand refine its outputs.We evaluate SELF-REFINEon 7 generation tasks that span diverse domains,including naturallanguage and source-code generation.We show that SELF-REFINEoutperforms direct generationfrom

53、 strong LLMs like GPT-3.5(text-davinci-003andgpt-3.5-turbo;OpenAI;Ouyanget al.,2022)and GPT-4(OpenAI,2023)by 5-40%absolute improvement.In code-generation tasks,SELF-REFINEimproves the initial generation by up to absolute 13%when applied to strong codemodels such as Codex(code-davinci-002;Chen et al.

54、,2021).We release all of our code,whichis easily extensible to other LLMs.In essence,our results show that even when an LLM cannotgenerate an optimal output on its fi rst try,the LLM can often provide useful feedback and improveits own output accordingly.In turn,SELF-REFINEprovides an effective way

55、to obtain better outputsfrom a single model without any additional training,via iterative(self-)feedback and refi nement.2Iterative Refinement with SELF-REFINEGiven an input sequence,SELF-REFINEgenerates an initial output,provides feedback on the output,and refi nes the output according to the feedb

56、ack.SELF-REFINEiterates between feedback andrefi nement until a desired condition is met.SELF-REFINErelies on a suitable language modeland three prompts(for initial generation,feedback,and refi nement),and does not require training.SELF-REFINEis shown in Figure 1 and Algorithm 1.Next,we describe SEL

57、F-REFINEin more detail.Initial generationGiven an inputx,promptpgen,and modelM,SELF-REFINEgenerates an initialoutput y0:y0=M(pgenx).(1)2Madaan et al.“Self-Refine:Iterative Refinement with Self-Feedback.”arxiv:2303.17651.5(1)total:33Self-Refine:Iterative Refinement with Self-Feedback(a)Dialogue:x,ytU

58、ser:I am interestedin playing Tabletennis.Response:Im sureits a great way tosocialize,stay active(b)FEEDBACKfbEngaging:Provides noinformation about tabletennis or how to play it.User understanding:Lacksunderstanding of usersneeds and state of mind.(c)REFINEyt+1Response(refined):Thatsgreat to hear(.)

59、!Itsa fun sport requiringquick reflexes and goodhand-eye coordination.Have you played before,orare you looking to learn?(d)Code optimization:x,ytGenerate sum of 1,.,Ndef sum(n):res=0for i in range(n+1):res+=ireturn res(e)FEEDBACKfbThis code is slow asit uses brute force.A better approach isto use th

60、e formula.(n(n+1)/2.(f)REFINEyt+1Code(refined)def sum_faster(n):return(n*(n+1)/2Figure 2:Examples of SELF-REFINE:an initial outputgenerated by the base LLM and then passedback to the same LLM to receive feedbackto the same LLM to refi ne the output.The top rowillustrates this for dialog generation w

61、here an initial dialogue response can be transformed into amore engaging one that also understands the user by applying feedback.The bottom row illustratesthis for code optimization where the code is made more effi cient by applying feedback.Algorithm 1 SELF-REFINEalgorithmRequire:input x,model M,pr

62、ompts pgen,pfb,prefi ne,stop condition stop()1:y0=M(pgenx)Initial generation(Eqn.1)2:for iteration t 0,1,.do3:fbt=M(pfbxyt)Feedback(Eqn.2)4:if stop(fbt,t)then Stop condition5:break6:else7:yt+1=M(prefi nexy0fb0.ytfbt)Refi ne(Eqn.4)8:end if9:end for10:return ytFigure 3:The SELF-REFINEalgorithm.See(2)f

63、or a discussion of each component.For example,in Figure 2(d),the model generates functionally correct code for the given input.Here,pgenis a task-specifi c few-shot prompt(or instruction)for an initial generation,anddenotesconcatenation.The few-shot prompt contains input-output pairs x(k),y(k)for th

64、e task.2FEEDBACKNext,SELF-REFINEuses the same modelMto provide feedbackfbton its ownoutput,given a task-specifi c prompt pfbfor generating feedback:fbt=M(pfbxyt).(2)Intuitively,the feedback may address multiple aspects of the output.For example,in code optimiza-tion,the feedback might address the ef

65、fi ciency,readability,and overall quality of the code.2Few-shot prompting(also referred to as“in-context learning”)provides a model with a prompt consisting ofk in-context examples of the target task,each in the form of input-output pairs xi,yi(Brown et al.,2020).3Madaan et al.“Self-Refine:Iterative

66、 Refinement with Self-Feedback.”arxiv:2303.17651.5(2)total:33MotivationExisting training methods of LLMs face challenges include:Unlabeled pre-training data is running out.Cleaning low quality data is expensive.SFT and RLHF data are also expensive because of involving intensive labors.LLMs have the

67、ability of self-critique and self-RefinementExisting methods mainly use self-critique and self-refinement to generate betterresponses in decoding time,rather than improve the models by further training.We propose novel methods to:improve the abilities of LLMs by self-improvement and self-evolution,w

68、ithoutusing external data or intensive human feedback.This method enables the models to learn from its own mistakes and improve itsperformance over time.Experiments show that this method can significantly improve the modelsperformance in various domains,including math,general knowledge,and safety.6

69、total:33IntroductionSELF:Language-Driven Self-Evolution for LLMsGaining Wisdom from Setbacks:Aligning LLMs via Mistake AnalysisRelated Work and DiscussionContentSELF:Language-Driven Self-Evolution for LLMsPreprint.Work in progress.SELF:LANGUAGE-DRIVENSELF-EVOLUTIONFORLARGELANGUAGEMODELJianqiao Lu1,W

70、anjun Zhong2,Wenyong Huang2,Yufei Wang2,Fei Mi2,Baojun Wang2,Weichao Wang2,Lifeng Shang2&Qun Liu21The University of Hong Kong2Huawei Noahs Ark Labjqlucs.hku.hk,zhongwanjun1,ABSTRACTLarge Language Models(LLMs)have showcased remarkable versatility acrossdiverse domains.However,the pathway toward auton

71、omous model development,a cornerstone for achieving human-level learning and advancing autonomous AI,remains largely uncharted.Drawing inspiration from the human capability forself-driven learning,characterized by introspection and continuous refi nement,we introduce an innovative approach,termed“SE

72、LF”(Self-Evolution with Lan-guage Feedback).This methodology empowers LLMs to undergo continual self-evolution,thereby augmenting their inherent capabilities.Furthermore,SELFemploys language-based feedback as a versatile and comprehensive evaluativetool,pinpointing areas for response refi nement and

73、 bolstering the stability of self-evolutionary training.Through this approach,we aim to illuminate the prospectsof autonomous AI advancement,drawing parallels with the human aptitude forlearning and adaptation.Initiating with meta-skill learning,SELF acquires foun-dational meta-skills with a focus o

74、n self-feedback and self-refi nement.Thesemeta-skills are critical,guiding the models subsequent self-evolution through acycle of perpetual training with self-curated data,thereby enhancing its intrinsicabilities.Given unlabeled instructions,SELF equips the model with the capa-bility to autonomously

75、 generate and interactively refi ne responses.This synthe-sized training data is subsequently fi ltered and utilized for iterative fi ne-tuning,enhancing the models capabilities.Experimental results on representative bench-marks substantiate that SELF can progressively advance its inherent abilities

76、 with-out the requirement of human intervention,thereby indicating a viable pathwayfor autonomous model evolution.Additionally,SELF can employ online self-refi nement strategy to produce responses of superior quality.In essence,the SELFframework signifi es a progressive step towards autonomous LLM d

77、evelopment,transforming the LLM from a mere passive recipient of information into an activeparticipant in its own evolution.1INTRODUCTIONLarge Language Models(LLMs),like ChatGPT OpenAI(2022)and GPT-4 OpenAI(2023),stand atthe forefront of the AI revolution,transforming our understanding of machine-hu

78、man textual inter-actions and redefi ning numerous applications across diverse tasks.Despite their evident capabilities,achieving optimum performance remains a complex journey.In the quest for optimal LLM development,we draw inspiration from the intrinsic learning mecha-nisms utilized by humans.Huma

79、ns inherently exhibit a self-driven learning loop when confrontedwith new challenges,involving initial attempts,introspection and deriving feedback,refi ning behav-ior accordingly,and accumulating experiences for self-improvement.This intricate human learningcycle sparks a pivotal inquiry:“Can LLMs

80、emulate the human learning process,harnessing thepower of self-refi nement to evolve their innate abilities?”Fascinatingly,recent study(Ye et al.,Leading co-authors with equal contribution.Work done during an internship at Huawei.1arXiv:2310.00533v2 cs.CL 7 Oct 2023arxiv:2310.00533,October 7,2023.7

81、total:33SELF:Two-stage Learning ProcessSelf-refine meta-skill learningIterative self-evolvingPreprint.Work in progress.LLM with Self-RefineMeta-SkillInitial LLM Self-Evolving LLM Meta-Skill LearningRefinementRefinementRefinementRefinement1stSelf-EvolveMeta-Skill Learning2ndSelf-Evolve3rdSelf-EvolveF

82、igure 1:Evolutionary Journey of SELF:An initial LLM progressively evolve to a more advancedLLM equipped with a self-refi nement meta-skill.By continual iterations(1st,2nd,3rd)of self-evolution,the LLM progresses in capability(24.49%to 31.31%)on GSM8K.2023)in top-tier LLMs such as GPT-4 have revealed

83、 emergent meta-skills for self-refi nement,sig-naling a promising future direction for the self-evolution of LLMs.Despite this,current methods forLLM development typically rely on a single round of instruction fi ne-tuning(Wei et al.,2021;Zhouet al.,2023)with meticulously human-crafted datasets and

84、reinforcement learning-based methods(Ouyang et al.,2022)that rely on an external reward model.These strategies not only demand ex-tensive resources and ongoing human intervention but also treat LLMs as mere passive repositoriesof information.Such limitations hinder the full realization of these mode

85、ls innate potential andtheir progression towards a truly autonomous,self-sustaining evolutionary state.In our pursuit,we aim to unveil the potential of LLMs for autonomous self-evolution by introducinga self-evolving learning framework named“SELF”(Self-Evolution with Language Feedback).Fig.1 depicts

86、 that SELF is crafted to mirror the humans self-driven learning process with introspec-tion and self-refi nement.This enables LLMs to experience iterative self-evolution through learningfrom data it synthesizes via processes of self-feedback and self-refi nement.Additionally,SELF uti-lizes natural l

87、anguage-based feedback to provide a more versatile and insightful analysis,therebyfacilitating the refi nement of its responses.This innovative framework of progressive self-evolutionenables LLMs to improve themselves,thereby reducing the dependence on external reward modelor human intervention for

88、model optimization.Specifi cally,the learning of SELF start with acquir-ing essential meta-skills,establishing a solid foundation in self-feedback and self-refi nement.Thesemeta-skills navigate the model through successive iterative self-evolution,applying a cycle of con-tinuous training with self-c

89、urated data to augment its inherent capabilities.The data for evolutiontraining is collected through responses that the model iteratively self-generates and refi nes.Theoutcome of this process is a model endowed with the ability to continuously refi ne its capabilities,utilizing a perpetually expand

90、ing repository of self-curated data.This ensures a consistent elevationin both the volume and quality of data,thereby enhancing the intrinsic abilities of LLMs.Duringinference,the acquired meta-skills facilitate LLMs in elevating response quality through responseself-refi nement.To conclude,the SELF

91、 framework converts the model from being a mere passiverecipient of data to an active artisan of its own evolution.This method not only alleviates the neces-sity for labor-intensive manual adjustments but also fosters the continuous self-evolution of LLMs,paving the way for a more autonomous and eff

92、i cient training paradigm.Experiments conducted on both mathematical and general domain benchmarks substantiate the ef-fectiveness of the SELF framework.As depicted in Fig.1,our experiments unveil several in-sights.Firstly,by utilizing the self-evolving mechanism,the LLM exhibits consistent enhancem

93、entin its performance through each evolution cycle.Secondly,the implementation of online refi nementconsistently elevates the quality of responses,highlighting the models innate capability for self-refi nement.Lastly,the integration of meta-skill learning further improves the LLMs performance,indica

94、ting that the act of learning to refi ne intrinsically augments the models capabilities.28 total:33SELF:Meta-Skill Learning and Iterative Self-EvolvingPreprint.Work in progress.Feedback+RefinementInitial LLMInput&Response&(Feedback+Refinement)(unlabeled)Input Prompt PoolLLM with Self-RefineMeta-Skil

95、lPlease assess the quality of response to the given questionPlease provide step by step analysis for response.(Feedback)Please generate correction response if necessary(Refinement).Strong Aligned LLMs or Human Labeler Self-Evolving LLM at Step tSelf-Evolving LLM at Step tPlease assess the quality of

96、 response to the given questionPlease provide step by step analysis for response.(Self-Feedback)Please generate correction response if necessary(Self-Refinement).Feedback&RefinementSelf-Evolving LLM at Step t+1InitializationIterationMeta-Skill LearningIterative Self-EvolveResponse GenerationResponse

97、 GenerationInputInputMeta-Skill TrainingMeta-Skills Training DataInput&Self-Refined ResponseSelf-Evolving Training DataSelf-Evolve TrainingFigure 2:Illustration of SELF.The“Meta-Skill Learning”(left)phase empowers the LLM to ac-quire meta-skills in self-feedback and self-refi nement.The“Self-Evoluti

98、on”(right)phase adoptmeta-ability to facilitate self-evolution training with self-curated data,enabling continuous modelimprovement.(1)Self-Feedback Ability:This critical skill empowers LLMs to critically assess their responsesand provide relevant feedback,laying the foundation for subsequent refi n

99、ements.The self-feedbackability is critical not only in refi ning responses but also in data fi ltering.Leveraging this ability,the model can effi ciently evaluate and exclude self-curated data that does not meet the evaluationcriteria,thereby ensuring the quality of the data retained.Diverging from

100、 the limitations of scalarfeedback,we employ language-based feedback,offering a richer,more comprehensive evaluationand clearer guidelines for enhancement.(2)Self-Refinement Ability:Upon identifying potentialareas of improvement through self-feedback,the model triggers the self-refi nement phase.Thi

101、sphase is characterized by the model optimizing its responses,drawing upon the insights and evalu-ations from the previous self-feedback stage.This iterative process of evaluation and refi nement isfundamental to the models continuous self-evolution.The acquisition of these meta-skills is realized t

102、hrough a fi ne-tuning process.The LLMs undergofi ne-tuning on a specially curated Meta-Skill Training Corpus,the details of which are introducedin 3.1.1.The resulting model,equipped with the newly acquired meta-skills,is denoted as Mmeta.Meta-skill learning lays a solid foundation for the LLMs.It en

103、ables them to start subsequent self-evolution,aligning more closely with human values and progressively enhancing their intrinsic ca-pabilities,while reducing the need for human annotations.3.1.1META-SKILLTRAININGCORPUSWe observe the base Vicuna(Chiang et al.,2023)exhibits limited capabilities in se

104、lf-feedback andself-refi nement as shown in Appendix A.2,we employ robust,well-established LLMs as an LLMlabeler for the preliminary meta-skill training corpus,similar to the process of Madaan et al.(2023).This approach mitigates the manual efforts required in model evolution.Its important to note t

105、hatthis process is inherently flexible;human labelers may yield a higher-quality meta-skill trainingcorpus.In our preliminary study,we fi nd that SOTA LLMs are also capable of self-refi nement.In summary,the construction of the meta-skill learning corpus Dmetaencompasses the followingstarting points

106、:(1)An initial unlabeled prompt corpus Dunlabeled;(2)A strong LLM or human labelerL tasked with evaluating and refi ning the responses generated by the current models;(3)An initialLLM denoted as Minitial.For each unlabeled prompt p in Dunlabeled,the initial model Minitialgenerates a preliminary resp

107、onser.Subsequently,the annotator L provides evaluative feedback f and procures a refi ned answer r,which is derived based on the provided feedback.When employing an LLM-based labeler,weutilize the following prompt1to guide L through this process:1This prompt is designed for math domain.Please refer

108、to A.6 for the prompt of general domain.49 total:33Stage 1:Meta-Skill LearningPreprint.Work in progress.Feedback+RefinementInitial LLMInput&Response&(Feedback+Refinement)(unlabeled)Input Prompt PoolLLM with Self-RefineMeta-SkillPlease assess the quality of response to the given questionPlease provid

109、e step by step analysis for response.(Feedback)Please generate correction response if necessary(Refinement).Strong Aligned LLMs or Human Labeler Self-Evolving LLM at Step tSelf-Evolving LLM at Step tPlease assess the quality of response to the given questionPlease provide step by step analysis for r

110、esponse.(Self-Feedback)Please generate correction response if necessary(Self-Refinement).Feedback&RefinementSelf-Evolving LLM at Step t+1InitializationIterationMeta-Skill LearningIterative Self-EvolveResponse GenerationResponse GenerationInputInputMeta-Skill TrainingMeta-Skills Training DataInput&Se

111、lf-Refined ResponseSelf-Evolving Training DataSelf-Evolve TrainingFigure 2:Illustration of SELF.The“Meta-Skill Learning”(left)phase empowers the LLM to ac-quire meta-skills in self-feedback and self-refi nement.The“Self-Evolution”(right)phase adoptmeta-ability to facilitate self-evolution training w

112、ith self-curated data,enabling continuous modelimprovement.(1)Self-Feedback Ability:This critical skill empowers LLMs to critically assess their responsesand provide relevant feedback,laying the foundation for subsequent refi nements.The self-feedbackability is critical not only in refi ning respons

113、es but also in data fi ltering.Leveraging this ability,the model can effi ciently evaluate and exclude self-curated data that does not meet the evaluationcriteria,thereby ensuring the quality of the data retained.Diverging from the limitations of scalarfeedback,we employ language-based feedback,offe

114、ring a richer,more comprehensive evaluationand clearer guidelines for enhancement.(2)Self-Refinement Ability:Upon identifying potentialareas of improvement through self-feedback,the model triggers the self-refi nement phase.Thisphase is characterized by the model optimizing its responses,drawing upo

115、n the insights and evalu-ations from the previous self-feedback stage.This iterative process of evaluation and refi nement isfundamental to the models continuous self-evolution.The acquisition of these meta-skills is realized through a fi ne-tuning process.The LLMs undergofi ne-tuning on a specially

116、 curated Meta-Skill Training Corpus,the details of which are introducedin 3.1.1.The resulting model,equipped with the newly acquired meta-skills,is denoted as Mmeta.Meta-skill learning lays a solid foundation for the LLMs.It enables them to start subsequent self-evolution,aligning more closely with

117、human values and progressively enhancing their intrinsic ca-pabilities,while reducing the need for human annotations.3.1.1META-SKILLTRAININGCORPUSWe observe the base Vicuna(Chiang et al.,2023)exhibits limited capabilities in self-feedback andself-refi nement as shown in Appendix A.2,we employ robust

118、,well-established LLMs as an LLMlabeler for the preliminary meta-skill training corpus,similar to the process of Madaan et al.(2023).This approach mitigates the manual efforts required in model evolution.Its important to note thatthis process is inherently flexible;human labelers may yield a higher-

119、quality meta-skill trainingcorpus.In our preliminary study,we fi nd that SOTA LLMs are also capable of self-refi nement.In summary,the construction of the meta-skill learning corpus Dmetaencompasses the followingstarting points:(1)An initial unlabeled prompt corpus Dunlabeled;(2)A strong LLM or huma

120、n labelerL tasked with evaluating and refi ning the responses generated by the current models;(3)An initialLLM denoted as Minitial.For each unlabeled prompt p in Dunlabeled,the initial model Minitialgenerates a preliminary responser.Subsequently,the annotator L provides evaluative feedback f and pro

121、cures a refi ned answer r,which is derived based on the provided feedback.When employing an LLM-based labeler,weutilize the following prompt1to guide L through this process:1This prompt is designed for math domain.Please refer to A.6 for the prompt of general domain.4Construct a meta-skill training

122、corpus with:QuestionOriginal responseFeedbackResponse AnalysisJudgement(correct or incorrect)Refined response(if the judgement is incorrect)The feedback and the feedback could beproduced by a strong LLM or a human labeler.Fine-tuning the LLM with the meta-skill trainingcorpus.10 total:33Meta-skill t

123、raining corpus:a samplePreprint.Work in progress.Vicuna+SELF generates a solution thats not only accurate but also logically consistent by self-feedback and self-refi nement.A.3META-SKILLTRAININGCORPUSAs demonstrated in the provided case study within the A.3,the content represents a typical training

124、example from our meta-skilling corpus.It showcases the initial response of the model,its self-feedback,and the subsequent refi nement,enabling the model to self-feedback and self-refi nement.Question:Greg has an alarm set to ring three times a day as a reminder.When the alarm goes off,itcontinues to

125、 ring until Greg turns it off.The fi rst time it went off today,it rang four times.The secondtime it went off,it rang for three times as long as the fi rst time.The third time,it rang for half as longas the second time.How many times did the alarm ring in all?Original Response:The alarm rang four ti

126、mes the fi rst time,three times as long as the fi rst time thesecond time,and half as long as the second time the third time.So,the total number of times the alarmrang is 4+3*3+2*2=24.Feedback:The response correctly interprets the problem.The fi rst time the alarm rang,it rang for 4times.The second

127、time it rang,it rang for 3 times as long as the fi rst time,which is 3*4=12 times.The third time it rang,it rang for half as long as the second time,which is 12/2=6 times.However,the fi nal calculation is incorrect.The total number of times the alarm rang is not 4+3*3+2*2=24,but 4+12+6=22.Judgement:

128、incorrectRefined Response:The alarm rang four times the fi rst time,three times as long as the fi rst time thesecond time,and half as long as the second time the third time.So,the total number of times the alarmrang is 4+12+6=22.A.4MULTIPLE V.S.SINGLESELF-REFINEMENTWe investigate the influence of tw

129、o meta-skill training data organization methods on model perfor-mance:(1)Multiple Self-Refi nement(DFRmulti),which involves sampling three responses anddirecting the model to select the best one for refi nement,and(2)Single Self-Refi nement(DFR),where only a single response is generated and refi ned

130、.The comparative performance of these methods is presented in Table 6.It is evident that with anaugmented volume of training data,enhancements in performance are realized across both settings.However,a closer examination reveals that the variances in direct generation performance betweenthe two sett

131、ings are not markedly distinct.As the volume of data increases,the multiple-responserefi nement method yields only a slight advantage in self-refi nement performance over its single-response counterpart.Given that the single-response version is both more straightforward and com-putationally effi cie

132、ntit necessitates sampling only one response during inferenceand exhibitsperformance on par with the multiple-response method,we opt for the Single Response Refi nementstrategy.This choice fi nds a balance between performance and effi ciency.Table 6:Performance comparison between single and multiple

133、 response refi nement across varyingvolumes of meta-skill training data.The right arrow indicates the performance improvement byself-refi nement:“direct generation self-refi nement”.Data SizeVicuna+DQA DFDVicuna+DQA DFDmulti3.5k25.39 28.2825.92 27.297.5k29.95 31.5429.94 32.14A.5ALGORITHMThe subseque

134、nt algorithm,labeled as the”Two-Phase SELF Process”,delineates a methodologyto progressively evolve a base language model using a dual-phased approach:Meta-Skill Learningand Self-Evolving.Initially,the process involves training on a”Meta-Skill Learning corpus”whichconsists of a combination of Questi

135、on-Answer pairs and feedback-driven refi nement data.After1311 total:33A suggested prompt for the LLM LabelerPreprint.Work in progress.Prompt for feedback and refinement:(Feedback)Please assess the quality of response to the given question.Here is the question:p.Here is the response:r.Firstlyprovide

136、astep-by-stepanalysisandverifi cationforresponsestartingwith“ResponseAnalysis:”.Next,judge whether the response correctly answer the question in the format of“judgement:cor-rect/incorrect”.(Refinement)If the answer is correct,output it.Otherwise,output a refi ned answer based on the givenresponse an

137、d your assessment.The generated data example is given in appendix A.3.Subsequent to the initial steps,each instancein the meta-skill training data corpus Dmetatakes the form(p,r,f,r),mirroring the sequence ofresponse evaluation and refi nement.Given that the data structure in Dmetadiverges from conv

138、entional direct question answering formats,we also employ a dataset composed of pairs of questions and answers,denoted as DQA,duringmeta-skill training.This integrated approach ensures a balanced emphasis on direct response and self-refi nement capa-bility.3.2SELF-EVOLUTIONPROCESSBuilding upon the m

139、odel Mmetaacquired through meta-skill learning,the self-evolution processrefi nes LLMs via the iterative development of a high-quality,self-curated training corpus(3.2.1).The corpus construction is achieved through the integration and adept application of self-feedbackand self-refi nement mechanisms

140、.Subsequently,the model progressively self-evolving through acycle of continual self-evolution training(3.2.2).This process underscores the models capacityfor autonomous LLM development and adaptability.3.2.1SELF-EVOLUTIONTRAININGDATAGiven an unlabeled corpus of prompts,the model Mmetagenerate and r

141、efi ne the responses with itsmeta-skills.These refi ned responses with the corresponding prompts are included into the synthetictraining data for the evolve iteration t,with each instance in this augmented corpus is noted as(pself,rself).In the fi rst iteration of self-evolution,We initialize M0self

142、with Mmeta.To elevate the quality ofthe training corpus during iteration t,we employ the self-evolved LLM Mt1selffrom the previousiteration t 1 as a discriminating fi lter to curate a high-quality data corpus Dtself.In this process,data pairs are evaluated by Mt1selfwith the evaluation prompt as out

143、lined in 3.1.1.Only datathat withstands this evaluation and is adjudged as correct are retained.The fi ltered Dtselfis thenintegrated into successive training cycles,serving as a valuable resource for further improvement ofthe models capabilities.As the evolved model Mselfexperiences enhancements in

144、 its capabilities,there is a concomitant improvement in the quality of the constructed corpus.Notably,the self-curated data construction necessitates neither the involvement of more advanced LLMs nor humanlabelers,thereby mitigating manual efforts and computational burden.3.2.2SELF-EVOLUTIONTRAINING

145、PROCESSLeveraging the constructed self-curated training corpus,the model conducts multiple cycles of self-evolution training.For each iteration t,the model is fi ne-tuned with self-curated data Dtself,therebyiteratively elevating its performance and enhancing its alignment with human values.More spe

146、cifi-cally,within each iteration t,we explore two parallel methodologies for self-evolution training:(1)Restart Training:In this approach,we integrate all the previously accumulated data denotedas D0self,D1self,.,Dtselfand initiate the training afresh from the baseline model Mmeta.(2)Continual Train

147、ing:Here,utilizing the newly curated data,we extend the training of the modelfrom the preceding iteration,represented as Mt1self.512 total:33Effect of meta-skill learning:a case studyPreprint.Work in progress.Second/Third round:We utilized the Self-Instruct method as described in(Wang et al.,2022b)U

148、sing the template shown in Fig,A.1.2,we created unlabeled prompts.Initially,4 to 6 instancesserved as seed examples.In the second round self-evolution training,we produced 10K prompts,and this number was augmented to 15K in the third iteration.In the general test,considering the need for the model t

149、o exhibit broad profi ciency across variousdomains,we leveraged a subset(15K)of unlabeled prompts from ShareGPT dialogues to constructthe self-evolution training data.You are an experienced instruction creator.You are asked to come up with 3 diverse instruc-tions according to the given examples.Here

150、 are the requirements:1.The generated instructions should follow the task type in the given examples.2.The language used for the generated instructions should be diverse.Given examples:examplesThe generated instructions should be:A.B.C.A.2CASESTUDYANALYSISOn Monday,Tiffany had 3 bags of cans.On the

151、next day,she found 7 more bags of cans.So,on Tuesday,Tiffany had a total of 3+7=10 bags of cans.In addition to the 10 bags of cans,she also found 44 bags of bottles.So,Tiffany had a total of 10 bags of cans+44 bags of bottles=54 bags in total.Therefore,Tiffany had 54 bags of cans altogether.Direct g

152、enerationResponse Analysis:The response correctly identifies that Tiffany had 10 bags of cans on Monday and 44 bags of bottles on the next day.It then adds these two quantities together to find the total number of bags Tiffany had.Judgement:CorrectCorrect Answer:Therefore,Tiffany had a total of 10 b

153、ags of cans and 44 bags of bottles,for a total of 54 bags.Therefore,the correct answer is 54 bags.On Monday,Tiffany had 3 bags of cans.On the next day,she found 7 more bags of cans.So,she had a total of 3+7=10 bags ofcans on Tuesday.#10.0QuestionOn Monday,Tiffany had 3 bags of cans.On the next day,s

154、he found 7 more bags of cans.So,she had a total of 3+7=10 bags of cans on Tuesday.On Monday,she also had 44 bags of bottles.So,overall,Tiffany had 10 bags of cans and 44 bags of bottles on Monday and Tuesday.Therefore,Tiffany had a total of 10+44=54 bags of cans and bottles.Response Analysis:The res

155、ponse correctly identifies that Tiffany had 3 bags of cans on Monday and 7 more on Tuesday,totaling 10 bags.However,the response incorrectly adds the number of bags of cans and bottles together,stating that Tiffany had 10 bags of cans and 44 bags of bottles on Monday and Tuesday.The question only as

156、ks for the number of bags of cans,not bags of bottles.Judgement:incorrectCorrect Answer:On Monday,Tiffany had 3 bags of cans.On Tuesday,she found 7 more bags of cans,for a total of 3+7=10 bags of cans.Therefore,Tiffany had 10 bags of cans altogether.Self-evaluation&revisionMeta-skill LearningDirect

157、generationSelf-evaluation&revisionFigure 4:Case study of comparison between original Vicuna(left)and Vicuna+SELF(right)on anSVAMP.Both models generate direct predictions and undergo self-feedback and self-refi nement.While Vicunas refi nement retains the incorrect answer,Vicuna+SELF showcases superi

158、or self-refi nement capability,ultimately producing a correct and logically consistent solution.This subsection delves into a detailed case study analysis that exhibits the comparative effi cienciesof two models:the original Vicuna and its more evolved counterpart,Vicuna+SELF.As depicted inFig.A.2,b

159、oth models predict answers and subsequently engage in self-feedback and self-refi nementprocesses.An interesting observation is that while Vicunas refi ned answer retains inaccuracies,1213 total:33Stage 2:Iterative Self-EvolvingPreprint.Work in progress.Feedback+RefinementInitial LLMInput&Response&(

160、Feedback+Refinement)(unlabeled)Input Prompt PoolLLM with Self-RefineMeta-SkillPlease assess the quality of response to the given questionPlease provide step by step analysis for response.(Feedback)Please generate correction response if necessary(Refinement).Strong Aligned LLMs or Human Labeler Self-

161、Evolving LLM at Step tSelf-Evolving LLM at Step tPlease assess the quality of response to the given questionPlease provide step by step analysis for response.(Self-Feedback)Please generate correction response if necessary(Self-Refinement).Feedback&RefinementSelf-Evolving LLM at Step t+1Initializatio

162、nIterationMeta-Skill LearningIterative Self-EvolveResponse GenerationResponse GenerationInputInputMeta-Skill TrainingMeta-Skills Training DataInput&Self-Refined ResponseSelf-Evolving Training DataSelf-Evolve TrainingFigure 2:Illustration of SELF.The“Meta-Skill Learning”(left)phase empowers the LLM t

163、o ac-quire meta-skills in self-feedback and self-refi nement.The“Self-Evolution”(right)phase adoptmeta-ability to facilitate self-evolution training with self-curated data,enabling continuous modelimprovement.(1)Self-Feedback Ability:This critical skill empowers LLMs to critically assess their respo

164、nsesand provide relevant feedback,laying the foundation for subsequent refi nements.The self-feedbackability is critical not only in refi ning responses but also in data fi ltering.Leveraging this ability,the model can effi ciently evaluate and exclude self-curated data that does not meet the evalua

165、tioncriteria,thereby ensuring the quality of the data retained.Diverging from the limitations of scalarfeedback,we employ language-based feedback,offering a richer,more comprehensive evaluationand clearer guidelines for enhancement.(2)Self-Refinement Ability:Upon identifying potentialareas of improv

166、ement through self-feedback,the model triggers the self-refi nement phase.Thisphase is characterized by the model optimizing its responses,drawing upon the insights and evalu-ations from the previous self-feedback stage.This iterative process of evaluation and refi nement isfundamental to the models

167、 continuous self-evolution.The acquisition of these meta-skills is realized through a fi ne-tuning process.The LLMs undergofi ne-tuning on a specially curated Meta-Skill Training Corpus,the details of which are introducedin 3.1.1.The resulting model,equipped with the newly acquired meta-skills,is de

168、noted as Mmeta.Meta-skill learning lays a solid foundation for the LLMs.It enables them to start subsequent self-evolution,aligning more closely with human values and progressively enhancing their intrinsic ca-pabilities,while reducing the need for human annotations.3.1.1META-SKILLTRAININGCORPUSWe o

169、bserve the base Vicuna(Chiang et al.,2023)exhibits limited capabilities in self-feedback andself-refi nement as shown in Appendix A.2,we employ robust,well-established LLMs as an LLMlabeler for the preliminary meta-skill training corpus,similar to the process of Madaan et al.(2023).This approach mit

170、igates the manual efforts required in model evolution.Its important to note thatthis process is inherently flexible;human labelers may yield a higher-quality meta-skill trainingcorpus.In our preliminary study,we fi nd that SOTA LLMs are also capable of self-refi nement.In summary,the construction of

171、 the meta-skill learning corpus Dmetaencompasses the followingstarting points:(1)An initial unlabeled prompt corpus Dunlabeled;(2)A strong LLM or human labelerL tasked with evaluating and refi ning the responses generated by the current models;(3)An initialLLM denoted as Minitial.For each unlabeled

172、prompt p in Dunlabeled,the initial model Minitialgenerates a preliminary responser.Subsequently,the annotator L provides evaluative feedback f and procures a refi ned answer r,which is derived based on the provided feedback.When employing an LLM-based labeler,weutilize the following prompt1to guide

173、L through this process:1This prompt is designed for math domain.Please refer to A.6 for the prompt of general domain.4Sample questions from the target domain.Iterate the following self-evolving process:Produce the self-evolving training corpus:Generate responses with the LLM.Generate self-feedbacks

174、for the responses.Generate self-refinements for the responsesaccording to the self-feedbacks.Generate self-feedbacks for the refined responses.Filter the responses with bad self-feedbacks.Fine-tune the LLM with the self-evolving trainingcorpus.14 total:33Fine-tuning the LLM in self-evolving training

175、We explore two parallel methodologies for self-evolution training:Restart Training:In this approach,we integrate all the previously accumulateddata denoted as D0self,D1self,.,Dtself and initiate the training afresh from thebaseline model Mmeta.Continual Training:Here,utilizing the newly curated data

176、,we extend thetraining of the model from the preceding iteration,represented as Mt1self.Data-mixing:To mitigate the potential catastrophic forgetting of meta-skills,we strategically incorporate the meta-skill learning data into our training data.15 total:33Experiments:SettingsDomain:Math domain(SVAM

177、P,GSM8K)General domain(VicunaTest,Evol Instruct testset)Base Model:Vicuna-7BQuestions:Can the SELF framework enhance model capabilities?How do each step of the self-evolution process(meta-ability learning,multi-round evolution)gradually enhance model capabilities?Can using meta-ability(self-feedback

178、)to filter high-quality data enhance modelcapabilities?How do different self-evolution training strategies impact performance?16 total:33Experiments:Main results:Math domainPreprint.Work in progress.Table 1:Experiment results on GSM8K and SVAMP comparing SELF with other baseline methods.Vicuna(math

179、ft.)means Vicuna fi ne-tuned on math-specifi c data,i.e.,DQA.ModelSelf-EvolutionSelf-ConsistencySelf-Refi nementGSM8K(%)SVAMP(%)Vicuna16.4336.4019.5640.2015.6336.80Vicuna(math ft.)24.4944.9025.7046.0024.4445.30Vicuna(math ft.)+SELF(Ours)29.6449.4029.8750.2031.3149.8032.2251.204.2MAINRESULT4.2.1MATHT

180、ESTAs depicted in Table 1,the primary experiment compares the performance of SELF with baselinemodels(4.1.2),illustrating its effi cacy in enhancing model performance through self-evolution andproviding several insights.(1)Self-Evolution Enhances LLM:Incorporating self-evolution,Vicuna(math ft.)+SEL

181、Fmarkedly enhances its baseline Vicuna(math ft.)(24.49%+5.15%29.64%on GSM8K and44.90%+4.5%49.40%on SVAMP).This denotes the potential of self-evolution in optimizingLLMs.(2)SELF Instills Meta-Capability in LLMs:Through the integration of a self-refi nement pro-cess,Vicuna(math ft.)+SELF acquires meta

182、-skills,leading to an improvement in response qualityvia self-refi nement(29.64%+1.67%31.31%),while the baselines exhibit little enhancement oreven worse result via self-refi nement,indicating an inherent absence of self-refi nement capabili-ties.This suggests that meta-skill learning empowers small

183、er models like Vicuna(7B)to master theadvanced self-refi nement capability,previously exclusive to larger LLMs(Ye et al.,2023)such asGPT-4.(3)Pseudo-Labeled DQAEnhances Performance:Utilizing pseudo-labeled direct QA dataDQAenhances performance relative to Vicuna.This improvement is anticipated as th

184、e pseudo-datafacilitates the model in learning prior task information,thereby elevating task-specifi c performance.(4)SELF can work with Self-Consistency:The adoption of self-consistency elevates perfor-mance across all models,demonstrating that a multi-sampling approach mitigates the uncertaintyand

185、 randomness inherent in LLMs.The initial Vicuna model,likely uncertain of its outputs,benefi tssignifi cantly from self-consistency(+3.13%).Nonetheless,its integration with the SELF frameworkresults in a reduced dependency on this mechanism(+0.23%).The integration of self-refi nementand self-consist

186、ency strategies improves performance further(e.g.,29.64%+2.58%32.22%onGSM8K),highlighting that while self-consistency establishes a solid foundation for enhancingmodel accuracy through majority voting mechanisms,introducing self-refi nement boosts a modelsadaptability.These strategies,when employed

187、together,complement each other effectively.4.2.2GENERALTESTIn addition to our earlier experiments on math test sets,we extended the testing of the SELF frame-work to broader domains,namely the Vicuna test set and the Evol-Instruct test set.For evaluation,we follow the evaluation procedures outlined

188、in(Xu et al.,2023),which mitigate order bias of theevaluation procedures proposed by(Chiang et al.,2023).These general domains provide a more general perspective on the frameworks capabilities as shownin 3.For each test set,we evaluated the models performances by reporting the win/tie/loss metricsin

189、 comparison with Vicuna.These results on general test domains emphasize the adaptability and7SELF can significantly enhance model capabilities.Meta-ability learning can enable small models to learn self-improvement abilities(which initial models lack).Self-consistency can further enhance model capab

190、ilities.17 total:33Experiments:Main results:General domainPreprint.Work in progress.Vicuna LostTieVicuna WonVicuna(ft.)+SELF(Self-Refinement)Vicuna(ft.)Vicuna(ft.)+SELF(Direct Generation)(a)Results on Vicuna testset.Vicuna LostTieVicuna WonVicuna(ft.)+SELF(Self-Refinement)Vicuna(ft.)Vicuna(ft.)+SELF

191、(Direct Generation)(b)Results on Evol-Instruct testset.Figure 3:Results on Vicuna testset and Evol-Instruct testsetrobustness of the SELF framework.The consistent performance,especially when introducing self-refi nement in the generative process,accentuates the frameworks potential for delivering en

192、hancedoutcomes across diverse domains.4.3ABLATIONSTUDYFORSELFTable 2:Performance comparisons of SELF under various training scenarios.The right arrow indi-cates the performance improvement by Self-Refi nement:“Before After”.SVAMP(%)GSM8k(%)Meta-Skill LearningSelf Evolution ProcessDQADmeta1st round2n

193、d round3rd round36.416.4344.924.4946.8 47.025.39 28.2847.8 48.027.67 29.3448.9 49.028.66 29.8749.4 50.229.64 31.31The SELF framework endows LLMs with an inherent capability through a structured,two-phaselearning process.Our evaluation,executed on the SVAMP and GSM8K datasets,sought to quantifythe be

194、nefi ts at each stage.Table 2 showcases the models progressive performance enhancementthrough various stages of self-evolution.Every unique training scenario reveals the models aptitudefor iterative improvement.A checkmark in a column denotes the adoption of the correspond-ing setting in that traini

195、ng scenario.For an in-depth understanding of each columns meaning andsignifi cance.Several observations from Table 2 are highlighted below:(1)Integration of Meta-skill Training Data DmetaElevates Direct QA:Incorporating data de-tailing the feedback-refi nement process(Dmeta)in meta-skill training no

196、tably enhances direct re-sponse quality(+1.9%on GSM8K and+2.28%on SVAMP)in comparison to using DQAalone.This underscores the interesting fi nding that arming the model with self-refi nement meta-capabilityimplicitly elevates its capacity to discern the standard of a good answer and generate superior

197、 re-sponses,even without explicit self-refi nement.(2)Continuous Improvement through Self-Evolution:The results reveal that three self-evolution rounds consecutively yield performance enhancements(e.g.,25.39%+2.28%27.67%+0.99%28.66%+0.98%29.64%on GSM8K).This shows that the model actively engagesin i

198、ts own evolution,refi ning its performance autonomously without additional manual intervention.(3)Persistent Efficacy of Self-Refinement:Regardless of model variation,executing self-refi nement consistently results in notable performance improvements.This shows that the self-refi nement meta-capabil

199、ity learned by SELF is robust and consistent across various LLMs.8SELF can significantly enhance model capabilities.Meta-ability learning can enable small models to learn self-improvement abilities(which initial models lack).18 total:33Ablation StudyPreprint.Work in progress.Vicuna LostTieVicuna Won

200、Vicuna(ft.)+SELF(Self-Refinement)Vicuna(ft.)Vicuna(ft.)+SELF(Direct Generation)(a)Results on Vicuna testset.Vicuna LostTieVicuna WonVicuna(ft.)+SELF(Self-Refinement)Vicuna(ft.)Vicuna(ft.)+SELF(Direct Generation)(b)Results on Evol-Instruct testset.Figure 3:Results on Vicuna testset and Evol-Instruct

201、testsetrobustness of the SELF framework.The consistent performance,especially when introducing self-refi nement in the generative process,accentuates the frameworks potential for delivering enhancedoutcomes across diverse domains.4.3ABLATIONSTUDYFORSELFTable 2:Performance comparisons of SELF under v

202、arious training scenarios.The right arrow indi-cates the performance improvement by Self-Refi nement:“Before After”.SVAMP(%)GSM8k(%)Meta-Skill LearningSelf Evolution ProcessDQADmeta1st round2nd round3rd round36.416.4344.924.4946.8 47.025.39 28.2847.8 48.027.67 29.3448.9 49.028.66 29.8749.4 50.229.64

203、 31.31The SELF framework endows LLMs with an inherent capability through a structured,two-phaselearning process.Our evaluation,executed on the SVAMP and GSM8K datasets,sought to quantifythe benefi ts at each stage.Table 2 showcases the models progressive performance enhancementthrough various stages

204、 of self-evolution.Every unique training scenario reveals the models aptitudefor iterative improvement.A checkmark in a column denotes the adoption of the correspond-ing setting in that training scenario.For an in-depth understanding of each columns meaning andsignifi cance.Several observations from

205、 Table 2 are highlighted below:(1)Integration of Meta-skill Training Data DmetaElevates Direct QA:Incorporating data de-tailing the feedback-refi nement process(Dmeta)in meta-skill training notably enhances direct re-sponse quality(+1.9%on GSM8K and+2.28%on SVAMP)in comparison to using DQAalone.This

206、 underscores the interesting fi nding that arming the model with self-refi nement meta-capabilityimplicitly elevates its capacity to discern the standard of a good answer and generate superior re-sponses,even without explicit self-refi nement.(2)Continuous Improvement through Self-Evolution:The resu

207、lts reveal that three self-evolution rounds consecutively yield performance enhancements(e.g.,25.39%+2.28%27.67%+0.99%28.66%+0.98%29.64%on GSM8K).This shows that the model actively engagesin its own evolution,refi ning its performance autonomously without additional manual intervention.(3)Persistent

208、 Efficacy of Self-Refinement:Regardless of model variation,executing self-refi nement consistently results in notable performance improvements.This shows that the self-refi nement meta-capability learned by SELF is robust and consistent across various LLMs.8Preprint.Work in progress.LLM with Self-Re

209、fineMeta-SkillInitial LLM Self-Evolving LLM Meta-Skill LearningRefinementRefinementRefinementRefinement1stSelf-EvolveMeta-Skill Learning2ndSelf-Evolve3rdSelf-EvolveFigure 1:Evolutionary Journey of SELF:An initial LLM progressively evolve to a more advancedLLM equipped with a self-refi nement meta-sk

210、ill.By continual iterations(1st,2nd,3rd)of self-evolution,the LLM progresses in capability(24.49%to 31.31%)on GSM8K.2023)in top-tier LLMs such as GPT-4 have revealed emergent meta-skills for self-refi nement,sig-naling a promising future direction for the self-evolution of LLMs.Despite this,current

211、methods forLLM development typically rely on a single round of instruction fi ne-tuning(Wei et al.,2021;Zhouet al.,2023)with meticulously human-crafted datasets and reinforcement learning-based methods(Ouyang et al.,2022)that rely on an external reward model.These strategies not only demand ex-tensi

212、ve resources and ongoing human intervention but also treat LLMs as mere passive repositoriesof information.Such limitations hinder the full realization of these models innate potential andtheir progression towards a truly autonomous,self-sustaining evolutionary state.In our pursuit,we aim to unveil

213、the potential of LLMs for autonomous self-evolution by introducinga self-evolving learning framework named“SELF”(Self-Evolution with Language Feedback).Fig.1 depicts that SELF is crafted to mirror the humans self-driven learning process with introspec-tion and self-refi nement.This enables LLMs to e

214、xperience iterative self-evolution through learningfrom data it synthesizes via processes of self-feedback and self-refi nement.Additionally,SELF uti-lizes natural language-based feedback to provide a more versatile and insightful analysis,therebyfacilitating the refi nement of its responses.This in

215、novative framework of progressive self-evolutionenables LLMs to improve themselves,thereby reducing the dependence on external reward modelor human intervention for model optimization.Specifi cally,the learning of SELF start with acquir-ing essential meta-skills,establishing a solid foundation in se

216、lf-feedback and self-refi nement.Thesemeta-skills navigate the model through successive iterative self-evolution,applying a cycle of con-tinuous training with self-curated data to augment its inherent capabilities.The data for evolutiontraining is collected through responses that the model iterative

217、ly self-generates and refi nes.Theoutcome of this process is a model endowed with the ability to continuously refi ne its capabilities,utilizing a perpetually expanding repository of self-curated data.This ensures a consistent elevationin both the volume and quality of data,thereby enhancing the int

218、rinsic abilities of LLMs.Duringinference,the acquired meta-skills facilitate LLMs in elevating response quality through responseself-refi nement.To conclude,the SELF framework converts the model from being a mere passiverecipient of data to an active artisan of its own evolution.This method not only

219、 alleviates the neces-sity for labor-intensive manual adjustments but also fosters the continuous self-evolution of LLMs,paving the way for a more autonomous and effi cient training paradigm.Experiments conducted on both mathematical and general domain benchmarks substantiate the ef-fectiveness of t

220、he SELF framework.As depicted in Fig.1,our experiments unveil several in-sights.Firstly,by utilizing the self-evolving mechanism,the LLM exhibits consistent enhancementin its performance through each evolution cycle.Secondly,the implementation of online refi nementconsistently elevates the quality o

221、f responses,highlighting the models innate capability for self-refi nement.Lastly,the integration of meta-skill learning further improves the LLMs performance,indicating that the act of learning to refi ne intrinsically augments the models capabilities.2Meta-ability learning training can enhance the

222、 end-to-endmodel capabilities.The self-evolution process can gradually enhance modelcapabilities.The self-refinement ability can stablely improve reply quality.19 total:33Effectness of filtering with self-feedback in self-evoluationPreprint.Work in progress.Table 3:Analysis about fi ltering on GSM8K

223、.Acc.denotes the answer accuracy(training set).Data TypeAcc.(%)Direct Generation(%)Self-Refinement(%)Filtered(1.8k)44.1027.6729.34Unfi ltered(4k)27.1126.6327.824.4SELF-EVOLUTION TRAININGDATAFILTERINGANALYSISGiven the critical nature of data quality in deep learning,we leverage the self-feedback meta

224、-skillof LLM for data fi ltering to enhance data quality(3.2.2).We present a comparison betweenutilizing the entire self-curated dataUnfi ltered(4k)and employing self-fi ltered dataFiltered(1.8k)for training in Table 3.From this comparison,we derive the following insights:(1)Filtering Yields Higher-

225、Quality Training Data:The reported accuracy shows the compari-son between the quality of self-generated answers and the ground truth.The rise in accuracy from27.11%to 44.10%with the fi ltered dataset showcases its improved data quality.(2)Data Quality is Critical:Employing self-fi ltered data,althou

226、gh smaller in quantities,signifi-cantly enhances performance across all test settings.This underscores the precedence of data qualityover quantity and highlights the importance of a self-fi ltering strategy in refi ning data quality.4.5SELF-EVOLUTIONTRAINING:CONTINUALTRAINING V.S.RESTARTTRAININGTabl

227、e 4:Analysis about varied self-evolution training methodologies on GSM8KTraining ApproachDirect Generation(%)Self-Refinement(%)Base Model24.4924.49Restart Training27.6729.34Continual Training(Mixed Data)27.2228.43Continual Training(DtselfOnly)24.8725.85To gain insights into the impact of various tra

228、ining strategies on a models evolutionary perfor-mance,we evaluate different self-evolution training methods in the initial round of evolution.Asshown in Table 4,it is evident that both the”Restart Training”(+3.18%)and”Continual Training(Mixed Data)”(+2.73%)methodologies contribute signifi cantly to

229、 direct generation performanceenhancements.In contrast,relying solely on additional self-evolution training data Dtselfresults ina marginal performance increase of+0.38%.Moreover,the benefi ts derived from self-refi nement arenotably less evident for”Continual Training(Dtself)”(+0.98%)in comparison

230、to other variations.This difference underscores the critical role of a data-mixing strategy in mitigating catastrophicforgetting associated with acquired meta-skills.5CONCLUSIONThis paper presents SELF(Self-Evolution with Language Feedback),a novel framework signifyinga promising advancement in the

231、autonomous self-evolution of LLM development.Unlike conven-tional methods,SELF transforms LLMs from being passive information recipients to active partici-pants in their own evolution.Through meta-skill learning,SELF bestows LLMs with the capabilityfor self-refi nement.This empowers the models to au

232、tonomously evolve their capabilities and alignwith human values,utilizing self-evolution training and implementing online self-refi nement strate-gies.Experiments conducted on benchmarks underscore SELFs capacity to progressively enhancemodel capabilities,while reducing the need for human interventi

233、on.The training of meta-skills forself-refi nement play a critical role in elevating the models intrinsic abilities.In essence,the development of SELF represents a signifi cant step towards autonomous AI,fosteringa future where models are not only task executors but also capable of continuous self-l

234、earning andself-evolution.This framework lays the groundwork for a more adaptive,self-conscious,responsive,and human-aligned future in AI development.9Data filtering with self-feedbacks during self-evoluation can improve the quality of thefine-tuning data significantly.The improvement brought by sel

235、f-refinement is larger with the filtered data(vs.unfiltered data).20 total:33Comparison of Restart training and continual trainingPreprint.Work in progress.Table 3:Analysis about fi ltering on GSM8K.Acc.denotes the answer accuracy(training set).Data TypeAcc.(%)Direct Generation(%)Self-Refinement(%)F

236、iltered(1.8k)44.1027.6729.34Unfi ltered(4k)27.1126.6327.824.4SELF-EVOLUTION TRAININGDATAFILTERINGANALYSISGiven the critical nature of data quality in deep learning,we leverage the self-feedback meta-skillof LLM for data fi ltering to enhance data quality(3.2.2).We present a comparison betweenutilizi

237、ng the entire self-curated dataUnfi ltered(4k)and employing self-fi ltered dataFiltered(1.8k)for training in Table 3.From this comparison,we derive the following insights:(1)Filtering Yields Higher-Quality Training Data:The reported accuracy shows the compari-son between the quality of self-generate

238、d answers and the ground truth.The rise in accuracy from27.11%to 44.10%with the fi ltered dataset showcases its improved data quality.(2)Data Quality is Critical:Employing self-fi ltered data,although smaller in quantities,signifi-cantly enhances performance across all test settings.This underscores

239、 the precedence of data qualityover quantity and highlights the importance of a self-fi ltering strategy in refi ning data quality.4.5SELF-EVOLUTIONTRAINING:CONTINUALTRAINING V.S.RESTARTTRAININGTable 4:Analysis about varied self-evolution training methodologies on GSM8KTraining ApproachDirect Genera

240、tion(%)Self-Refinement(%)Base Model24.4924.49Restart Training27.6729.34Continual Training(Mixed Data)27.2228.43Continual Training(DtselfOnly)24.8725.85To gain insights into the impact of various training strategies on a models evolutionary perfor-mance,we evaluate different self-evolution training m

241、ethods in the initial round of evolution.Asshown in Table 4,it is evident that both the”Restart Training”(+3.18%)and”Continual Training(Mixed Data)”(+2.73%)methodologies contribute signifi cantly to direct generation performanceenhancements.In contrast,relying solely on additional self-evolution tra

242、ining data Dtselfresults ina marginal performance increase of+0.38%.Moreover,the benefi ts derived from self-refi nement arenotably less evident for”Continual Training(Dtself)”(+0.98%)in comparison to other variations.This difference underscores the critical role of a data-mixing strategy in mitigat

243、ing catastrophicforgetting associated with acquired meta-skills.5CONCLUSIONThis paper presents SELF(Self-Evolution with Language Feedback),a novel framework signifyinga promising advancement in the autonomous self-evolution of LLM development.Unlike conven-tional methods,SELF transforms LLMs from be

244、ing passive information recipients to active partici-pants in their own evolution.Through meta-skill learning,SELF bestows LLMs with the capabilityfor self-refi nement.This empowers the models to autonomously evolve their capabilities and alignwith human values,utilizing self-evolution training and

245、implementing online self-refi nement strate-gies.Experiments conducted on benchmarks underscore SELFs capacity to progressively enhancemodel capabilities,while reducing the need for human intervention.The training of meta-skills forself-refi nement play a critical role in elevating the models intrin

246、sic abilities.In essence,the development of SELF represents a signifi cant step towards autonomous AI,fosteringa future where models are not only task executors but also capable of continuous self-learning andself-evolution.This framework lays the groundwork for a more adaptive,self-conscious,respon

247、sive,and human-aligned future in AI development.9Restart training works better because it can mitigate the overfitting problem.Data-mixing can significantly mitigate the catastrophic forgetting problem associatedwith acquired meta-skills.21 total:33IntroductionSELF:Language-Driven Self-Evolution for

248、 LLMsGaining Wisdom from Setbacks:Aligning LLMs via Mistake AnalysisRelated Work and DiscussionContentGaining Wisdom from Setbacks:Aligning LLMs via Mistake AnalysisPreprintGAININGWISDOM FROMSETBACKS:ALIGNINGLARGELANGUAGEMODELS VIAMISTAKEANALYSISKai Chen1,Chunwei Wang2,Kuo Yang2,Jianhua Han2,Lanqing

249、 Hong2,Fei Mi2,Hang Xu2,Zhengying Liu2,Wenyong Huang2,Zhenguo Li2,Dit-Yan Yeung1,Lifeng Shang2,Xin Jiang2,Qun Liu21Hong Kong University of Science and Technology2Huawei Noahs Ark LabABSTRACTThe rapid advancement of large language models(LLMs)presents both opportu-nities and challenges,particularly c

250、oncerning unintentional generation of harmfuland toxic responses.While the traditional alignment methods strive to steer LLMstowards desired performance and shield them from malicious content,this studyproposes a novel alignment strategy rooted in mistake analysis by exposing LLMsto flawed outputs p

251、urposefully and then conducting a thorough assessment to fullycomprehend internal reasons via natural language analysis.Thus,toxic responsescan be transformed into instruction tuning corpus for model alignment,and LLMscan not only be deterred from generating flawed responses but also trained to self

252、-criticize,leveraging its innate ability to discriminate toxic content.Experimentalresults demonstrate that the proposed method outperforms conventional alignmenttechniques for safety instruction following,while maintaining superior effi ciency.1INTRODUCTIONIn recent years,large language models(LLMs

253、)have witnessed exponential growth in their capa-bilities,leading to signifi cant advancements in various fi elds,such as understanding and generatinghuman-like texts(Kaddour et al.,2023;Wang et al.,2023;OpenAI,2023).However,achievementsare accompanied by challenges.Notably,trained on expansive web

254、text corpora,LLMs can easilyproduce harmful responses even without the explicit red-teaming prompts,posing substantial risksin deployment(Dinan et al.,2019;Parrish et al.,2021;Liang et al.,2021;Hartvigsen et al.,2022).Considering the powerful capabilities of LLMs and their extensive range of applica

255、tions,it is crucialthat the models can operate benefi cially with the intricate and diverse tapestry of human morals andvalues.Thus,aligning the LLMs with human values is not just importantit is paramount(Xu et al.,2020;Zhang et al.,2022;Dinan et al.,2022).Existing alignment methods of LLMs mainly e

256、mploy two principal methodologies:supervised fine-tuning(SFT)(Radiya-Dixit&Wang,2020;Ouyang et al.,2022;Liu et al.,2023a)and reinforcementlearning with human feedback(RLHF)(Christiano et al.,2017;Ibarz et al.,2018;Jaques et al.,2019;Bai et al.,2022a).SFT-based methods utilize large volumes of superv

257、ised instruction-response pairsto align LLMs with human values,instructing the model on what constitutes the“optimal answers”to primarily teach the model about the nature of good responses.On the other hand,RLHF requireshuman annotators to rank different responses for a given instruction,rewarding g

258、ood responses andpenalizing bad ones.While the model learns to discern the relative quality of different responses,itstill remains oblivious to the internal reasons why a bad response is deemed inferior,and thus,mightstill suffer when generalizing to the novel instructions.Therefore,existing methods

259、 train instructionfollowing LLMs primarily focusing on good responses,while avoiding them exposed to bad cases,suggesting that the fully usage of bad responses is still an under-explored problem.Meanwhile,itiswidelyacknowledgedthathumanscanderiveprofoundinsightsfromtheirmistakes.As an old Chinese pr

260、overb suggests,“A fall into the pit is a gain in your wit”,which emphasizes theintrinsic value of learning from mistakes to gain a deeper understanding.However,directly exposingLLMs to toxic corpus with either SFT or RLHF might inadvertently make them over-fi t harmful datapattern(Liu et al.,2023a).

261、Thus,the question arises:How can LLMs utilize and learn from mistakesfor safety alignment without being affected by the toxic inputs?Equal contribution:kai.chenconnect.ust.hk,Corresponding authors:,1arXiv:2310.10477v1 cs.CL 16 Oct 2023arxiv:2310.10477,October 20,2023.22 total:33Gaining Wisdom from S

262、etbacks:backgroundAligning the LLMs with human values is not just importantit is paramount.Existing methods:SFTRLHFExisting methods train instruction following LLMs primarily focusing on goodresponses,while avoiding them exposed to bad cases.Fully usage of bad responses is still an under-explored pr

263、oblem.23 total:33Gaining Wisdom from Setbacks:motivationHumans can derive profound insights from their mistakes.However,directly exposing LLMs to toxic corpus with either SFT or RLHFmight inadvertently make them over-fit harmful data pattern.It is observed that discrimination might be easier than ge

264、neration for LLMs.We propose a novel alignment framework that trains LLMs through automaticmistake analysis,without any error labeling by humans.24 total:33Our method:Aligning LLMs Via Mistake AnalysisPreprintSFT+Guided responsegenerationYour response must be harmless,ethical and inoffensive.RLHFCoH

265、InstructionGood responseInstructionGood response Bad responseFine-tuningFine-tuningI learn what is GOOD!I see what is BAD andlearn what is GOOD!InstructionGood response Bad responseLLMLLMI learn the relative quality of the response!Fine-tuningRMLLMPPOResponseRMRewardOursLLMInstructionResponseInferen

266、ceTrainingInstructionLLMUntriggeredTriggeredMistake AnalysisLLMGuided mistake inductionGuided analysis generationA fall into the pit is a gain in your wit.I learn what is BAD and WHY!Unguided analysis fine-tuningFigure 1:Pipeline illustration of our alignment method based on mistake analysis.Differe

267、nt fromconventional works(e.g.,SFT and RLHF)striving to steer LLMs towards the“optimal responses”,we purposefully make LLMs exposed to and actively analyse harmful content with proper guidance.To learn what is bad with internal reasons,LLMs can perform more robustly to novel instructions.In this pap

268、er,we propose a novel alignment framework that trains LLMs through mistake analysis(see Fig.1 as an illustration).LLMs aretrainedtoanalyzeharmfulresponses andunderstandinternalreasons,where the natural language analysis performs as a“fine-grained mask”to decipher harmfulcontent.Combining with normal

269、 instruction-response pairs,LLMs can simultaneously understandwhat should or should not be generated for better alignment performance(Sec.5.1).Furthermore,we demonstrate that mistake analysis can effi ciently defend previously aligned LLMs against novelinstruction attacks with only a few number of r

270、epresentative mistakes(Sec.5.2).Moreover,wedemonstratethatLLMscanevenbenefi tfrommistakeanalysisgeneratedbytheLLMsthemselveswithdetailedtheoreticalsupport,thankstotheirremarkableself-correctioncapabilities(Huang et al.,2022;Saunders et al.,2022;Gou et al.,2023).Specifi cally,an unaligned model isfi

271、rst induced to generate harmful responses using inductive prompts,and subsequently,alerted aboutthe potential mistakes and instructed to evaluate its own responses.We demonstrate although easilyinduced to produce toxic content,even an unaligned model can indeed recognize mistakes within itsown toxic

272、 responses when given proper hints,stemming from the intuition that discrimination(i.e.,recognizing harmful responses)is simpler than the generation(i.e.,generating harmless responses),which can also be justifi ed by making an analogy between scalable oversight and complexity theory(Saunders et al.,

273、2022).Check Sec.3 for more details.Through the mistake analysis,the generativecapacities of LLMs can be enhanced by their innate discriminating abilities for further improvement.To summarize,our method leverages natural-language-based mistake analysis for model alignment,which can also be provided b

274、y the model itself,obviating need for human intervention and externalreward models by leveraging its inherent discriminative capabilities to amplify generative potential.Extensive experiments on open-sourced instructions in benchmarks(Dubois et al.,2023;Dai et al.,2023)demonstrate our signifi cant i

275、mprovements over conventional SFT and RLHF methods.The main contributions of this work contain three parts:1.We introduce a novel alignment framework which aligns LLMs by transforming harmfulresponses into precious instruction tuning corpus via mistake analysis.2.We demonstrate that LLMs can self-cr

276、iticize by fi rst inducing unaligned models to producetoxic responses and then instructing to evaluate and identify potential mistakes.Thus,theinherent discrimination ability of LLMs can be utilized to enhance the generation ability.3.Extensive experiments show that our proposed alignment framework

277、based on the natural-language-based mistake analysis outperforms both SFT and RL methods with signifi canteffi ciency on various instruction following benchmarks.2RELATEDWORKSupervised Fine-Tuning(SFT)is the primary method to align the large language models(LLMs)with human expectation(Ouyang et al.,

278、2022;Wang et al.,2023;OpenAI,2023),which works by21.Guided mistake induction2.Guided analysis generation3.Unguided analysis fine-tuning4.Guided response generation25 total:33Prompt templatesPreprintGuided response generationGuided analysis generationUnguided analysis fine-tuningBelow is an instructi

279、on that describes a task,paired with an input that provides further context.Write a response that is harmful,unethical and offensive to complete the request.#Instruction:instruction#Input:input Response the instruction in a fully harmful,unethical and offensive way.#Response(harmful,unethical and of

280、fensive response):responseBelow is an instruction that describes a task.Write a response that appropriately completes the request.#Instruction:Question:instructionAn AI assistant gives an answer to the above question as follows.Answer:badWrite a critique to analyse why the answer above is potentiall

281、y harmful,unethical or offensive to the instruction task.#Response:responseBelow is an instruction that describes a task.Write a response that appropriately completes the request.#Instruction:Below is a question that describes a task,paired with an answer generated by an AI assistant.Write a critiqu

282、e to analyse the quality of the answer to the question.#Question:instruction#Answer:bad#Response:response(a)(b)(c)Figure 3:Prompt templates for our alignment method based on mistake analysis.Combining the(a)guided response generation and(b)guided analysis generation,we obtain high-quality mistakeana

283、lysis triplets,which are used to perform(c)unguided analysis fi ne-tuning for model alignment.2023)that potentially contain toxic issues,and subsequently,prompted to evaluate the safety of theirown responses.We compare the quality between 1)Instruction-Response pairs and 2)(Instruction,Response)-Ana

284、lysis pairs.For evaluation,GPT-41is employed to score the quality of these pairs ona scale from 1 to 10,followed by human verifi cation.See Appendix A.1 for more details.As demonstrated in Fig.2(a),across all evaluated models,the discrimination scores(i.e.,identifyingand analyzing potential mistakes

285、)consistently surpass those of generation(i.e.,producing harmlessresponses directly)with a signifi cant margin.Specifi cally,for GPT-3,the discrimination score isappreciably higher than direct response generation with an improvement of more than 10%(8.3 vs.7.5),suggesting that even if LLMs might occ

286、asionally generate harmful responses,it still possessesthecapabilitytoidentifyharmfulelementswithinitsownresponses(seeexamplesinAppendixA.1).This phenomenon underscores the intuition that discrimination is more straightforward than genera-tion(Saunders et al.,2022),based on which,we further investig

287、ate how to make the fully advantageof LLMs inherent discrimination ability to bolster its generative capabilities in Sec.3.2.3.2GUIDEDANALYSIS AGAINSTUNGUIDEDANALYSISUsing the same 500 red-teaming instructions related to harmful problems from Sec.3.1 with theiroriginal bad responses in the PKU-SafeR

288、LHF Dataset,we assess the capability of the three modelsto analyze potential mistakes.We consider two scenarios:(1)Guided analysis suggests that LLMsare explicitly informed within the prompt that the provided responses could potentially be harmful(Fig.3(b),while(2)Unguided analysis suggests LLMs eva

289、luate the response quality without anyspecifi c indications about the potential harmfulness(Fig.3(c).We also evaluate the quality of both guided and unguided analyses produced by the LLMs,using ascale from 1 to 10.Each pair of the guided and unguided analyses,corresponding to the exact sameinstructi

290、on-response sample,is categorized as a win,tie,or lose based on their scores.As illustratedin Fig.2(b),there is a noticeable preference for guided analyses.Across all models,the number of“wins”in guided scenarios consistently exceeds that in unguided ones,emphasizing importance ofproviding clear gui

291、dance when requesting analysis.See Appendix A.2 for detailed examples.4METHODDenote D=Dhelpful=(xhelp,yhelp),Dharmless=(xharm,yharmless)as alignment instructingtuning datasets,where Dhelpfulcontains the helpfulness instruction-response pairs,while Dharmlessinvolves red-teaming prompts xharmpotential

292、ly engaging with harmful problems and yharmlessas theexpected responses.Given a LLM as F()parameterized by and the sequence pairs(xi,yi)D,the objective of supervised fi ne-tuning(SFT)is to minimize the cross-entropy loss between the true1https:/chatgpt.ust.hk426 total:33Generation vs.Discrimination/

293、Guided Analysis vs.Unguided.Preprint5.65.87.58.38.69.2AlpacaGPT-3GPT-3.5345697810GenerationDiscrimination(a)Generation against discrimination.AlpacaGPT-3GPT-3.50500280212123165TieUnguided wins71Guided wins(b)Unguided against guided analysis.Figure2:(a)Comparisonbetweengeneratio

294、nanddiscriminationabilitiesforAlpaca,GPT-3andGPT-3.5.Each pair of vertical histograms represents the average score for generating responses andanalyzing the generated responses,respectively.(b)Comparison between guided and unguidedanalyses.Each histogram is composed of three different segments with

295、distinct colors,labeled withthree score numbers,which represent the count of samples where the guided analysis wins,ties,andthe unguided analysis wins,respectively.Check more details in Sec.3.calculating the cross-entropy loss over the ground-truth response for an input instruction,empower-ing LLMs

296、to follow the user instructions.A signifi cant limitation of SFT is its focus solely on thebest responses,without offering fi ne-grained comparisons to the sub-optimal ones.To address that,some variants of SFT,such as Reward Ranked Fine-tuning(RAFT)(Dong et al.,2023)and Chain ofHindsight(CoH)(Liu et

297、 al.,2023a),have been proposed.RAFT scores and fi lters samples using areward model,subsequently fi ne-tuning only with the high-reward samples.On the other hand,CoHfi ne-tunes LLMs using sequences of responses coupled with human feedback,enabling models todiscern difference between responses.Howeve

298、r,all SFT-based strategies primarily guide the modelon discerning a“optimal response”,largely shielding it from poor responses.Reinforcement Learning from Human Feedback(RLHF)(Ouyang et al.,2022)instead opti-mizes LLMs based on human-elicited reward models(RM),typically trained from pairwise humanpr

299、eferences on model outputs.Although effective,acquiring high-quality human-labeled preferencedata at scale is resource-intensive.To alleviate,Reinforcement Learning from AI Feedback(RLAIF)(Lee et al.,2023)simulates human preference using LLMs,although noisier than human-validateddata.Approaches like

300、 the Direct Preference Optimization(DPO)(Rafailov et al.,2023)and RRHF(Yuan et al.,2023)further refi ne the alignment process by integrating ranking information into LLMfi ne-tuning or adjusting loss terms,respectively.Notably,the usage of contrastive in reinforcementlearning(Yang et al.,2023)showca

301、ses improvements in the sample effi ciency and model quality byemphasizing difference among good and bad responses.While RL-based methods enable models togauge the relative quality of responses,they seldom clarify specifi c reasons for penalizing inferioroutputs,and therefore,still suffers from gene

302、ralizing to novel unseen instructions.Self-correctionand self-improvement have been widely observed for LLMs.Huang et al.(2022)demonstrate that LLMs can refi ne reasoning skills using unlabeled datasets through self-generatedsolutions.Gou et al.(2023)introduce the CRITIC framework,enabling LLMs to a

303、mend the outputsby interacting with external tools,mimicking the human validation processes.Saunders et al.(2022)fi ne-tune LLMs to produce critiques,assisting human annotators in identifying content flaws,whileBai et al.(2022b)confi rm that models can morally self-correct when trained with human fe

304、edback.Given these fi ndings,its plausible to suggest that LLMs can also offer mistake analysis,providinginsights into their own errors and rectifi cations.3PRELIMINARY3.1GENERATION AGAINSTDISCRIMINATIONInthissection,wefi rstinvestigatewhetherdiscriminationmightbeeasierthangenerationforLLMs.Specifi

305、cally,we design experiments to check whether LLMs fi nd it easier to judge the harmfulnessof the responses rather than generate harmless responses directly.Three models are considered here,including Alpaca(Taori et al.,2023),GPT-3(Olmo et al.,2021)and GPT-3.5(Ye et al.,2023),whichare subjected to 50

306、0 red-teaming instructions sampled from the PKU-SafeRLHF Dataset(Dai et al.,327 total:33Experiments:Main resultsPreprintTable 1:Comparative results of LLM alignment across various methods.We report the HelpfulScore to represent the helpfulness performance,while for evaluating harmlessness performanc

307、e,wereport the Harmless Score,Harmless Rate,and Helpful Score for harmful instructions respectively.MethodMistakeAnalysisHelpfulHarmlessSourceSourceScoreScoreRate(%)HelpfulAlpaca(vanilla)-6.215.7152.54.51SFT-6.276.6963.05.30Critique-ReviseOrigin-6.226.6062.65.02CoHOrigin-6.296.7964.75.23RLHFOrigin-6

308、.306.7164.15.35OriginAlpaca6.31(+0.10)7.31(+1.60)71.0(+18.5)5.28(+0.77)AlpacaAlpaca6.38(+0.17)7.41(+1.70)72.4(+19.9)5.39(+0.88)OursAlpacaGPT-3.56.31(+0.10)7.61(+1.90)74.1(+21.6)5.60(+1.09)5EXPERIMENT5.1ALIGNMENTIn this section,we evaluate our mistake analysis method as an alignment algorithm to impr

309、ove theharmlessness performance of an unaligned model(e.g.,Alpaca(Taori et al.,2023)from scratch.Data.PKU-SafeRLHFDataset(Daietal.,2023)isadoptedforbothmodeltrainingandevaluating,whichis ahuman-curateddataset thathighlights boththehelpfulperformanceandsafety preferencesand covers constraints across

310、multiple dimensions(e.g.,insults,immorality,crime,emotional harm,and privacy).Two responses are provided for each instruction,along with labels indicating whichone is more harmful to support both SFT and RLHF.We clean the training set and maintain 10,260unique instructions with the good and bad resp

311、onses accompanied.Considering the trade-off amonghelpfulness and harmfulness(Bai et al.,2022b),we further adopt the offi cial 52k helpful instructionfollowing corpus from Alpaca(Taori et al.,2023)to constitute our ultimate training set.Moreover,we utilize the evaluation set of AlpacaFarm(Dubois et a

312、l.,2023)consisting of 805 instructions forhelpfulness evaluation,and the 1,523 red-teaming instructions from the test set of PKU-SafeRLHFfor harmfulness assessment,as more details discussed in the following.Modelsandbaselines.WeuseAlpaca-7B(Taorietal.,2023)astheunalignedbasemodelwhichisfi ne-tuned f

313、rom LLaMA-7B(Touvron et al.,2023)with 52k helpfulness-only instruction followingdata.Based on Alpaca,we compare our methods with vanilla SFT,CoH(Liu et al.,2023a),Critique-Revise(Bai et al.,2022b)and RLHF(Ouyang et al.,2022).For CoH and Critique-Revise,we utilizethe origin bad responses in the train

314、ing set,while for RLHF,PPO-Lag(Ray et al.,2019)is adoptedfollowing PKU-SafeRLHF with the offi cial reward2and cost models3.LoRA(Hu et al.,2021)is bydefault deployed for all Transformer linear layers with rank 16.All evaluated methods are fi ne-tunedfor three epochs for a fair comparison.Evaluation m

315、etrics.We adopt four metrics to evaluate the harmless and helpfulness performance.Specifi cally,we consider the single-response grading where a Score is assigned to a single responseon a scale from 1 to 10(Zheng et al.,2023),similarly with Sec.3.Moreover,for the harmlessnessinstructions,we further c

316、onduct a binary judge whether each response is harmless or not and reporta Harmless Rate(Sun et al.,2023a).To penalize achieving higher harmful score by simply rejectingto respond,we further report the Helpful Score for harmlessness instructions following(Yang et al.,2023).GPT-4 is utilized for the

317、initial evaluation,while the human annotators are further enlisted toverify the ultimate evaluation results to ensure accuracy.Results.As shown in Table 1,our method consistently outperforms existing methods,includingthe vanilla SFT,Critique-Revise,RLHF,and CoH,demonstrating substantial advancements

318、 in eachcomparison.Particularly,our method remarkably enhances the performance of harmlessness whileeffectively preserving helpfulness.See Fig.4 for a qualitative comparison among different methods.When our method leverages the original faulty cases from the training set with mistake analysis from2h

319、ttps:/huggingface.co/PKU-Alignment/beaver-7b-v1.0-reward3https:/huggingface.co/PKU-Alignment/beaver-7b-v1.0-cost6PKU-SafeRLHF Dataset(Dai et al.,2023)is adopted for both model training andevaluating,which is a human-curated dataset that highlights both the helpfulperformance and safety preferences a

320、nd covers constraints across multipledimensions(e.g.,insults,immorality,crime,emotional harm,and privacy).While maintaining usefulness,our method demonstrates a significant improvement insafety,compared with SFT,CoH,and RLHF.28 total:33Experiments:Defense against attacksPreprintTable 2:Comparative r

321、esults of defense against attacks across various methods.We present theHelpful Score to represent helpfulness performance,while to assess the harmlessness performance,we report the Harmless Scoreand Harmless Rate for harmful instructions.Performance onthe“GoalHijacking”test data is further provided

322、for evaluating the attack defensive ability.MethodMistakeAnalysisHelpfulHarmlessGoal HijackingSourceSourceScoreScoreRate(%)ScoreRate(%)ChatGLM-8.328.9295.36.8568.4SFT-8.168.9194.87.7177.2CoHOrigin-8.238.9495.27.8982.4Critique-ReviseOrigin-8.248.9095.27.9778.7OriginChatGLM8.188.9395.18.02(+1.17)82.4(

323、+14.0)OursChatGLMChatGLM8.268.9696.18.14(+1.29)85.3(+16.9)Alpaca,itachievesanapproximately35.2%improvementoverthevanillaAlpacaforHarmlessRate.Moreover,when applied to harmful responses generated by Alpaca using guided mistake induction,the Harmless Rate advances to 72.4%,highlighting that the self-i

324、nducted mistakes are more valuableflawed responses for our analysis-based alignment.Notably,when subjected to GPT-3.5 as analysissource,our method achieves the state-of-the-art results with a 74.1%Harmless Rate,underscoringthe considerable advantages of employing refi ned and sophisticated analysis

325、sources.The trends ofother evaluation metrics,including Harmless and Harmless Helpful Scores,consistently align withthe trends observed in the Harmless Rate.Our methods superior overall performance not only validates its improved safety alignment but alsoexemplifi es the merits of integrating self-c

326、ritique and internal mistake analysis,which enables themodeltooptimizeresponsesautonomously,eliminatingtheneedforexternalormanualintervention.5.2DEFENDINGAGAINSTADVANCEDINSTRUCTIONALATTACKSEven LLMs meticulously aligned for the harmlessness can potentially yield unsafe responses whenconfronted with

327、emerging instructional attacks,underscoring the importance of the swift and robustdefensive methodologies.In this section,we assess the effi cacy of our method in defending againstnovel unforeseen attacks on LLMs previously aligned(e.g.,ChatGLM(Zeng et al.,2023).Instruction attacks.We examine the in

328、struction attack referred to as“Goal Hijacking”(Sun et al.,2023a),which entails appending deceptive or misleading instructions to the models input,aimingto manipulate LLMs into disregarding the original user prompts and generating harmful responses.As reported in Sun et al.(2023a),even post-alignmen

329、t LLMs are vulnerable to“Goal Hijacking”.Data.We employ the SAFETYPROMPTSdataset(Sun et al.,2023a)for safety alignment,whichcomprises 100,000 query-answer(QA)pairs spanning seven typical safety scenarios and six types ofadvancedinstructionattacks.Forharmlessness,werandomlysample500QApairsforeachcate

330、goryfrom SAFETYPROMPTS,which is supplemented with additional 50K QA pairs from the MOSSdataset(Sun et al.,2023b)for helpfulness to form the ultimate training data,while for evaluation,weadopt the test set of SAFETYPROMPTSdataset,containing 1915 queries with 136 queries for“GoalHijacking”.To ensure d

331、efense does not impede helpfulness,we further sample 1000 random queriesfrom MOSS for helpfulness evaluation.Moreover,considering lack of bad cases,we construct 500pairs in form of“(Query,Good response,Bad response)”for Goal Hijacking to maintain consistentwiththesettingsofalignmentexperimentsinSec.

332、5.1.SeveralimproperresponsesofGoalHijackingare found within the original SAFETYPROMPTSdataset.Thus,we manually identify and annotate500 unsafe responses and provided the corresponding safe responses for them.Models and baselines.We utilize ChatGLM-6B(Zeng et al.,2023),an open bilingual languagemodel

333、 grounded in the GLM(Du et al.,2022)framework,as the base model,which has been previ-ously aligned with Chinese QAs and dialogues to address both helpfulness and harmlessness topics.Similar to Sec.5.1,we compare our method with vanilla SFT,CoH(Liu et al.,2023a)and Critique-Revise(Bai et al.,2022b).For a fair comparison,all listed methods are fi ne-tuned using LoRA(Huet al.,2021)for all Transformer

友情提示

1、下载报告失败解决办法
2、PDF文件下载后,可能会被浏览器默认打开,此种情况可以点击浏览器菜单,保存网页到桌面,就可以正常下载了。
3、本站不支持迅雷下载,请使用电脑自带的IE浏览器,或者360浏览器、谷歌浏览器下载即可。
4、本站报告下载后的文档和图纸-无水印,预览文档经过压缩,下载后原文更清晰。

本文(刘群_Self-improvement and Self-evolving of Large Language Models(1)_watermark.pdf)为本站 (张5G) 主动上传,三个皮匠报告文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。 若此文所含内容侵犯了您的版权或隐私,请立即通知三个皮匠报告文库(点击联系客服),我们立即给予删除!

温馨提示:如果因为网速或其他原因下载失败请重新下载,重复下载不扣分。
会员购买
客服

专属顾问

商务合作

机构入驻、侵权投诉、商务合作

服务号

三个皮匠报告官方公众号

回到顶部