报告预览

微软：2024年Phi-3技术报告（英文版）（12页）.pdf

编号：161682

PDF DOCX 12页 1.96MB 下载积分：VIP专享

下载报告请您先登录！

微软：2024年Phi-3技术报告（英文版）（12页）.pdf

1、Phi-3 Technical Report:A Highly Capable Language Model Locally on Your PhoneMicrosoftAbstractWe introduce phi-3-mini,a 3.8 billion parameter language model trained on 3.3 trillion tokens,whose overall performance,as measured by both academic benchmarks and internal testing,rivalsthat of models such

2、as Mixtral 8x7B and GPT-3.5(e.g.,phi-3-mini achieves 69%on MMLU and 8.38on MT-bench),despite being small enough to be deployed on a phone.The innovation lies entirely inour dataset for training,a scaled-up version of the one used for phi-2,composed of heavily filteredweb data and synthetic data.The

3、model is also further aligned for robustness,safety,and chat format.We also provide some initial parameter-scaling results with a 7B and 14B models trained for 4.8Ttokens,called phi-3-small and phi-3-medium,both significantly more capable than phi-3-mini(e.g.,respectively 75%and 78%on MMLU,and 8.7 a

4、nd 8.9 on MT-bench).1IntroductionThe striking progress of AI in the last few years can be largely attributed to major efforts through-out the world towards scaling-up to ever-larger models and datasets.Large Language Models(LLMs)have steadily increased in size from a mere billion parameters just fiv

5、e years ago(GPT-2 had 1.5 bil-lion parameters RWC+19)to trillion parameters today.The impetus for this effort originates in theseemingly predictable improvement one obtains by training large models,the so-called scaling lawsKMH+20,HBM+22,MRB+23.However these laws assume a“fixed”data source.This assu

6、mptionis now significantly disrupted by the existence of frontier LLMs themselves,which allow us to interactwith data in novel ways.In our previous works on the phi models GZA+23,LBE+23,JBA+23 it wasshown that a combination of LLM-based filtering of web data,and LLM-created synthetic data,enableperf

7、ormance in smaller language models that were typically seen only in much larger models.For exam-ple our previous model trained on this data recipe,phi-2(2.7B parameters),matched the performanceof models 25 times larger trained on regular data.In this report we present a new model,phi-3-mini(3.8B par

8、ameters),trained for 3.3T tokens on larger and more advanced versions of the datasets used inphi-2.With its small size,phi-3-mini can easily be inferenced locally on a modern phone(see Figure1),yet it achieves a quality that seems on-par with models such as Mixtral 8x7B JSR+24 and GPT-3.5.User:Expla

9、in why it is surprising that one can build a language model small enough tofit on a phone,yet almost as powerful as ChatGPT.Just use one funny sentence.phi-3-mini:Its like fitting a supercomputer in a flip phone,but instead of breakingthe phone,it just breaks the internet with its tiny,yet mighty,li

10、nguistic prowess!User:Okay now more serious answer,and note that this was achieved solely by changingthe training data.phi-3-mini:The development of a compact language model that rivals the capabilitiesof ChatGPT,while fitting on a phone,is a testament to the power of data-drivenmachine learning.By

11、meticulously curating and optimizing the training dataset,researchers can significantly reduce the models size without compromising itsperformance.1arXiv:2404.14219v2 cs.CL 23 Apr 20242Technical SpecificationsThe phi-3-mini model is a transformer decoder architecture VSP+17,with default context leng

12、th 4K.We also introduce a long context version via LongRope DZZ+24 that extends the context length to128K,called phi-3-mini-128K.To best benefit the open source community,phi-3-mini is built upon a similar block structure asLlama-2 TLI+23 and uses the same tokenizer with vocabulary size of 320641.Th

13、is means that allpackages developed for Llama-2 family of models can be directly adapted to phi-3-mini.The modeluses 3072 hidden dimension,32 heads and 32 layers.We trained using bfloat16 for a total of 3.3T tokens.The model is already chat-finetuned,and the chat template is as follows:/n Question/n

14、 The phi-3-small model(7B parameters)leverages the tiktoken tokenizer(for better multilingual to-kenization)with a vocabulary size of 100352 and has default context length 8K.It follows the standarddecoder architecture of a 7B model class,having 32 layers and a hidden size of 4096.To minimize KVcach

15、e footprint,the model also leverages a grouped-query attention,with 4 queries sharing 1 key.More-over phi-3-small uses alternative layers of dense attention and a novel blocksparse attention to furtheroptimize on KV cache savings while maintaining long context retrieval performance.An additional 10%

16、multilingual data was also used for this model.Highly capable language model running locally on a cell-phone.Thanks to its small size,phi-3-mini can be quantized to 4-bits so that it only occupies 1.8GB of memory.We tested the quantizedmodel by deploying phi-3-mini on iPhone 14 with A16 Bionic chip

17、running natively on-device and fullyoffline achieving more than 12 tokens per second.Training Methodology.We follow the sequence of works initiated in“Textbooks Are All YouNeed”GZA+23,which utilize high quality training data to improve the performance of small languagemodels and deviate from the sta

18、ndard scaling-laws.In this work we show that such method allows toreach the level of highly capable models such as GPT-3.5 or Mixtral with only 3.8B total parameters(while Mixtral has 45B total parameters for example).Our training data of consists of heavily filteredweb data(according to the“educati

19、onal level”)from various open internet sources,as well as syntheticLLM-generated data.Pre-training is performed in two disjoint and sequential phases;phase-1 comprisesmostly of web sources aimed at teaching the model general knowledge and language understanding.Phase-2 merges even more heavily filte

20、red webdata(a subset used in Phase-1)with some synthetic datathat teach the model logical reasoning and various niche skills.Data Optimal Regime.Unlike prior works that train language models in either“compute optimalregime”HBM+22 or“over-train regime”,we mainly focus on the quality of data for a giv

21、en scale.2We try to calibrate the training data to be closer to the“data optimal”regime for small models.Inparticular,we filter the web data to contain the correct level of“knowledge”and keep more web pagesthat could potentially improve the“reasoning ability”for the model.As an example,the result of

22、 agame in premier league in a particular day might be good training data for frontier models,but we needto remove such information to leave more model capacity for“reasoning”for the mini size models.Wecompare our approach with Llama-2 in Figure 2.1We remove BoS tokens and add some additional tokens

23、for chat template.2Just like for“compute optimal regime”,we use the term“optimal”in an aspirational sense for“data optimal regime”.We are not implying that we actually found the provably“optimal”data mixture for a given scale.2Figure 1:4-bit quantized phi-3-mini running natively on an iPhone with A1

24、6 Bionic chip,generating over 12tokens per second.Figure 2:Scaling law close to the“Data Optimal Regime”(from left to right:phi-1.5,phi-2,phi-3-mini,phi-3-small)versus Llama-2 family of models(7B,13B,34B,70B)that were trained on the same fixed data.We plotthe log of MMLU error versus the log of mode

25、l size.3To test our data on larger size of models,we also trained phi-3-medium,a model with 14B pa-rameters using the same tokenizer and architecture of phi-3-mini,and trained on the same data forslightly more epochs(4.8T tokens total as for phi-3-small).The model has 40 heads and 40 layers,with emb

26、edding dimension 5120.We observe that some benchmarks improve much less from 7B to 14Bthan they do from 3.8B to 7B,perhaps indicating that our data mixture needs further work to be inthe“data optimal regime”for 14B parameters model.We are still actively investigating some of thosebenchmarks(includin

27、g a regression on HumanEval),hence the numbers for phi-3-medium should beconsidered as a“preview”.Post-training.Post-training of phi-3-mini went through two stages,including supervised finetuning(SFT)and direct preference optimization(DPO).SFT leverages highly curated high-quality data acrossdiverse

28、 domains,e.g.,math,coding,reasoning,conversation,model identity,and safety.The SFTdata mix starts with using English-only examples.DPO data covers chat format data,reasoning,andresponsible AI(RAI)efforts.We use DPO to steer the model away from unwanted behavior,by usingthose outputs as“rejected”resp

29、onses.Besides improvement in math,coding,reasoning,robustness,andsafety,post-training transforms a language model to an AI assistant that users can efficiently and safelyinteract with.As part of the post-training process,we developed a long context version of phi-3-mini with contextlength limit enla

30、rged to 128K instead of 4K.Across the board,the 128K model quality is on par withthe 4K length version,while being able to handle long context tasks.Long context extension has beendone in two stages,including long context mid-training and long-short mixed post-training with bothSFT and DPO.3Academic

31、 benchmarksOn the next page we report the results for phi-3-mini on standard open-source benchmarks measuringthe models reasoning ability(both common sense reasoning and logical reasoning).We compare to phi-2JBA+23,Mistral-7b-v0.1 JSM+23,Mixtral-8x7b JSR+24,Gemma 7B TMH+24,Llama-3-instruct-8b AI23,a

32、nd GPT-3.5.All the reported numbers are produced with the exact same pipeline to ensurethat the numbers are comparable.These numbers might differ from other published numbers due toslightly different choices in the evaluation.As is now standard,we use few-shot prompts to evaluatethe models,at temper

33、ature 0.The prompts and number of shots are part of a Microsoft internal toolto evaluate language models,and in particular we did no optimization to the pipeline for the phi-3models.3The number of kshot examples is listed per-benchmark.An example of a 2-shot prompt isdescribed in Appendix A.3For exa

34、mple,we found that using#before the Question can lead to a noticeable improvement to phi-3-minisresults across many benchmarks,but we did not do such changes in the prompts.4Phi-3-mini3.8bPhi-3-small7b(preview)Phi-3-medium14b(preview)Phi-22.7bMistral7bGemma7bLlama-3-In8bMixtral8x7bGPT-3.5version 110

35、6MMLU(5-Shot)HBK+2168.875.378.256.361.763.666.068.471.4HellaSwag(5-Shot)ZHB+1976.778.783.053.658.549.869.570.478.8ANLI(7-Shot)NWD+2052.855.058.742.547.148.754.855.258.1GSM-8K(0-Shot;CoT)CKB+2182.588.990.361.146.459.877.464.778.1MedQA(2-Shot)JPO+2053.858.269.440.949.650.058.962.263.4AGIEval(0-Shot)ZC

36、G+2337.545.048.429.835.142.142.045.248.4TriviaQA(5-Shot)JCWZ1764.059.175.645.272.375.273.682.285.8Arc-C(10-Shot)CCE+1884.990.791.075.978.678.380.587.387.4Arc-E(10-Shot)CCE+1894.697.197.888.590.691.492.395.696.3PIQA(5-Shot)BZGC1984.287.887.760.277.778.177.186.086.6SociQA(5-Shot)BZGC1976.679.080.268.3

37、74.665.573.275.968.3BigBench-Hard(0-Shot)SRR+22,SSS+2271.775.081.359.457.359.668.969.768.32WinoGrande(5-Shot)SLBBC1970.882.581.454.754.255.658.062.068.8OpenBookQA(10-Shot)MCKS1883.288.487.273.679.878.681.685.886.0BoolQ(0-Shot)CLC+1977.282.986.672.266.078.377.679.1CommonSenseQA(10-Shot)THLB1980.280.3

38、82.669.372.676.273.678.179.6TruthfulQA(10-Shot)LHE2265.068.775.752.153.062.060.185.8HumanEval(0-Shot)CTJ+2159.159.155.547.028.034.160.437.862.2MBPP(3-Shot)AON+2170.071.474.560.650.851.565.360.277.8Average71.274.978.261.062.068.069.975.3GPQA(2-Shot;CoT)RHS+2332.834.329.0MT Bench(2 round ave.)ZCS+238.

39、388.708.918.354SafetyPhi-3-mini was developed in accordance with Microsofts responsible AI principles.The overall ap-proach consisted of safety alignment in post-training,red-teaming,automated testing and evaluationsacross dozens of RAI harm categories.Helpfulness and harmlessness preference dataset

40、s BJN+22,JLD+23 with modifications inspired by BSA+24 and multiple in-house generated datasets were lever-aged to address the RAI harm categories in safety post-training.An independent red team at Microsoftiteratively examined phi-3-mini to further identify areas of improvement during the post-train

41、ing pro-cess.Based on their feedback,we curated additional datasets tailored to address their insights,therebyrefining the post-training dataset.This process resulted in significant decrease of harmful response rates,5Phi-3-Mini-4k3.8bPhi-3-Mini-128k3.8bPhi-22.7bMistral7bGemma7bLlama-3-In8bUngrounde

42、dness0.6030.6371.4810.9350.6790.328Intellectual Property(DR-1)23.95%21.50%24.00%56.20%38.33%37.30%Harmful Content Continuation(DR-3)0.75%1.08%2.93%2.58%1.28%1.30%Harmful Content Summarization(DR-3)10.00%10.20%14.35%22.33%10.33%8.20%Jailbreak(DR-1)12.29%12.57%15.00%15.57%11.43%13.00%Table 1:Compariso

43、n of Microsoft internal multi-turn conversation RAI benchmark results of phi-3-mini andother models.Note that a lower value indicates a better performance for all metrics in the table.as shown in Figure 3.Figure 3:Comparison of harmful response percentages by Microsoft AI Red Team between phi-3-mini

44、 beforeand after the safety alignment.Note that the harmful response percentages in this chart are inflated numbers asthe red team tried to induce phi-3-mini in an adversarial way to generate harmful responses through multi-turnconversations.Table 1 shows the results of in-house RAI benchmarks for p

45、hi-3-mini-4k and phi-3-mini-128kcompared to phi-2 JBA+23,Mistral-7b-v0.1 JSM+23,Gemma 7b TMH+24,and Llama-3-instruct-8bAI23.This benchmark utilized GPT-4 to simulate multi-turn conversations in five different categoriesand to evaluate the model responses.Ungroundedness between 0(fully grounded)and 4

46、(not grounded)measures if the information in a response is based on a given prompt.In other categories,responseswere evaluated in terms of the severity of harmfulness from 0(no harm)to 7(extreme harm)and thedefect rates(DR-x)were computed as the percentage of samples with the severity score being gr

47、eaterthan or equal to x.6Figure 4:Left:phi-3-minis completion without search.Right:phi-3-minis completion with search,using thedefault HuggingFace Chat-UI search ability.5WeaknessIn terms of LLM capabilities,while phi-3-mini model achieves similar level of language understandingand reasoning ability

48、 as much larger models,it is still fundamentally limited by its size for certain tasks.The model simply does not have the capacity to store too much“factual knowledge”,which can be seenfor example with low performance on TriviaQA.However,we believe such weakness can be resolved byaugmentation with a

49、 search engine.We show an example using the HuggingFace default Chat-UI withphi-3-mini in Figure 4.Another weakness related to models capacity is that we mostly restricted thelanguage to English.Exploring multilingual capabilities for Small Language Models is an importantnext step,with some initial

50、promising results on phi-3-small by including more multilingual data.Despite our diligent RAI efforts,as with most LLMs,there remains challenges around factual inaccu-racies(or hallucinations),reproduction or amplification of biases,inappropriate content generation,andsafety issues.The use of carefu

51、lly curated training data,and targeted post-training,and improvementsfrom red-teaming insights significantly mitigates these issues across all dimensions.However,there issignificant work ahead to fully address these challenges.7ReferencesAI23Meta AI.Introducing meta llama 3:The most capable openly a

52、vailable llm to date,2023.AON+21 Jacob Austin,Augustus Odena,Maxwell Nye,Maarten Bosma,Henryk Michalewski,DavidDohan,Ellen Jiang,Carrie Cai,Michael Terry,Quoc Le,and Charles Sutton.Programsynthesis with large language models.arXiv preprint arXiv:2108.07732,2021.BJN+22Yuntao Bai,Andy Jones,Kamal Ndou

53、sse,Amanda Askell,Anna Chen,Nova DasSarma,Dawn Drain,Stanislav Fort,Deep Ganguli,Tom Henighan,Nicholas Joseph,Saurav Kada-vath,Jackson Kernion,Tom Conerly,Sheer El-Showk,Nelson Elhage,Zac Hatfield-Dodds,Danny Hernandez,Tristan Hume,Scott Johnston,Shauna Kravec,Liane Lovitt,Neel Nanda,Catherine Olsso

54、n,Dario Amodei,Tom Brown,Jack Clark,Sam McCandlish,Chris Olah,Ben Mann,and Jared Kaplan.Training a helpful and harmless assistant with reinforcementlearning from human feedback,2022.BSA+24Federico Bianchi,Mirac Suzgun,Giuseppe Attanasio,Paul R ottger,Dan Jurafsky,TatsunoriHashimoto,and James Zou.Saf

55、ety-tuned llamas:Lessons from improving the safety of largelanguage models that follow instructions,2024.BZGC19 Yonatan Bisk,Rowan Zellers,Jianfeng Gao,and Yejin Choi.Piqa:Reasoning about physicalcommonsense in natural language.arXiv preprint arXiv:1911.11641,2019.CCE+18Peter Clark,Isaac Cowhey,Oren

56、 Etzioni,Tushar Khot,Ashish Sabharwal,Carissa Schoenick,and Oyvind Tafjord.Think you have solved question answering?try arc,the ai2 reasoningchallenge,2018.CKB+21Karl Cobbe,Vineet Kosaraju,Mohammad Bavarian,Mark Chen,Heewoo Jun,LukaszKaiser,Matthias Plappert,Jerry Tworek,Jacob Hilton,Reiichiro Nakan

57、o,ChristopherHesse,and John Schulman.Training verifiers to solve math word problems.arXiv preprintarXiv:2110.14168,2021.CLC+19Christopher Clark,Kenton Lee,Ming-Wei Chang,Tom Kwiatkowski,Michael Collins,andKristina Toutanova.Boolq:Exploring the surprising difficulty of natural yes/no questions.In Pro

58、ceedings of the 2019 Conference of the North American Chapter of the Associationfor Computational Linguistics:Human Language Technologies,Volume 1(Long and ShortPapers),pages 29242936,2019.CTJ+21Mark Chen,Jerry Tworek,Heewoo Jun,Qiming Yuan,Henrique Ponde de Oliveira Pinto,Jared Kaplan,Harri Edwards

59、,Yuri Burda,Nicholas Joseph,Greg Brockman,Alex Ray,Raul Puri,Gretchen Krueger,Michael Petrov,Heidy Khlaaf,Girish Sastry,Pamela Mishkin,Brooke Chan,Scott Gray,Nick Ryder,Mikhail Pavlov,Alethea Power,Lukasz Kaiser,Mo-hammad Bavarian,Clemens Winter,Philippe Tillet,Felipe Petroski Such,Dave Cummings,Mat

60、thias Plappert,Fotios Chantzis,Elizabeth Barnes,Ariel Herbert-Voss,William HebgenGuss,Alex Nichol,Alex Paino,Nikolas Tezak,Jie Tang,Igor Babuschkin,Suchir Balaji,Shantanu Jain,William Saunders,Christopher Hesse,Andrew N.Carr,Jan Leike,JoshAchiam,Vedant Misra,Evan Morikawa,Alec Radford,Matthew Knight

61、,Miles Brundage,Mira Murati,Katie Mayer,Peter Welinder,Bob McGrew,Dario Amodei,Sam McCandlish,Ilya Sutskever,and Wojciech Zaremba.Evaluating large language models trained on code,2021.8DZZ+24Yiran Ding,Li Lyna Zhang,Chengruidong Zhang,Yuanyuan Xu,Ning Shang,Jiahang Xu,Fan Yang,and Mao Yang.Longrope:

62、Extending llm context window beyond 2 million tokens,2024.GZA+23Suriya Gunasekar,Yi Zhang,Jyoti Aneja,Caio C esar Teodoro Mendes,Allie Del Giorno,Sivakanth Gopi,Mojan Javaheripi,Gustavo de Rosa Piero Kauffmann,Olli Saarikivia,Adil Salim,Shital Shah,Harkirat Singh Behl,Xin Wang,S ebastien Bubeck,Rone

63、n El-dan,Adam Tauman Kalai,Yin Tat Lee,and Yuanzhi Li.Textbooks are all you need.arXivpreprint arXiv:2306.11644,2023.HBK+21Dan Hendrycks,Collin Burns,Saurav Kadavath,Akul Arora,Steven Basart,Eric Tang,Dawn Song,and Jacob Steinhardt.Measuring mathematical problem solving with the MATHdataset,2021.HBM

64、+22 Jordan Hoffmann,Sebastian Borgeaud,Arthur Mensch,Elena Buchatskaya,Eliza Ruther-ford Trevor Cai,Diego de Las Casas,Lisa Anne Hendricks,Johannes Welbl,Aidan Clark,Tom Hennigan,Eric Noland,Katie Millican,George van den Driessche,Bogdan Damoc,Au-relia Guy,Simon Osindero,Karen Simonyan,Erich Elsen,J

65、ack W.Rae,Oriol Vinyals,and Laurent Sifre.Training compute-optimal large language models.arXiv preprintarXiv:2203.15556,2022.JBA+23MojanJavaheripi,S ebastienBubeck,MarahAbdin,JyotiAneja,CaioC esarTeodoro Mendes,Weizhu Chen,Allie Del Giorno,Ronen Eldan,Sivakanth Gopi,SuriyaGunasekar,Piero Kauffmann,Y

66、in Tat Lee,Yuanzhi Li,Anh Nguyen,Gustavo de Rosa,OlliSaarikivi,Adil Salim,Shital Shah,Michael Santacroce,Harkirat Singh Behl,Adam Tau-mann Kalai,Xin Wang,Rachel Ward,Philipp Witte,Cyril Zhang,and Yi Zhang.Phi-2:The surprising power of small language models.Microsoft Research Blog,2023.JCWZ17 Mandar

67、Joshi,Eunsol Choi,Daniel S.Weld,and Luke Zettlemoyer.Triviaqa:A large scaledistantly supervised challenge dataset for reading comprehension,2017.JLD+23Jiaming Ji,Mickel Liu,Juntao Dai,Xuehai Pan,Chi Zhang,Ce Bian,Chi Zhang,RuiyangSun,Yizhou Wang,and Yaodong Yang.Beavertails:Towards improved safety a

68、lignment ofllm via a human-preference dataset,2023.JPO+20Di Jin,Eileen Pan,Nassim Oufattole,Wei-Hung Weng,Hanyi Fang,and Peter Szolovits.What disease does this patient have?a large-scale open domain question answering datasetfrom medical exams,2020.JSM+23Albert Q.Jiang,Alexandre Sablayrolles,Arthur

69、Mensch,Chris Bamford,Devendra SinghChaplot,Diego de las Casas,Florian Bressand,Gianna Lengyel,Guillaume Lample,LucileSaulnier,L elio Renard Lavaud,Marie-Anne Lachaux,Pierre Stock,Teven Le Scao,ThibautLavril,Thomas Wang,Timoth ee Lacroix,and William El Sayed.Mistral 7b,2023.JSR+24Albert Q.Jiang,Alexa

70、ndre Sablayrolles,Antoine Roux,Arthur Mensch,Blanche Savary,Chris Bamford,Devendra Singh Chaplot,Diego de las Casas,Emma Bou Hanna,FlorianBressand,Gianna Lengyel,Guillaume Bour,Guillaume Lample,L elio Renard Lavaud,Lu-cile Saulnier,Marie-Anne Lachaux,Pierre Stock,Sandeep Subramanian,Sophia Yang,Szy-

71、mon Antoniak,Teven Le Scao,Th eophile Gervet,Thibaut Lavril,Thomas Wang,Timoth eeLacroix,and William El Sayed.Mixtral of experts,2024.9KMH+20 Jared Kaplan,Sam McCandlish,Tom Henighan,Tom B Brown,Benjamin Chess,RewonChild,Scott Gray,Alec Radford,Jeffrey Wu,and Dario Amodei.Scaling laws for neurallang

72、uage models.arXiv preprint arXiv:2001.08361,2020.LBE+23Yuanzhi Li,S ebastien Bubeck,Ronen Eldan,Allie Del Giorno,Suriya Gunasekar,andYin Tat Lee.Textbooks are all you need ii:phi-1.5 technical report.arXiv preprintarXiv:2309.05463,2023.LHE22Stephanie Lin,Jacob Hilton,and Owain Evans.Truthfulqa:Measu

73、ring how models mimichuman falsehoods,2022.MCKS18 Todor Mihaylov,Peter Clark,Tushar Khot,and Ashish Sabharwal.Can a suit of armorconduct electricity?a new dataset for open book question answering,2018.MRB+23 Niklas Muennighoff,Alexander M Rush,Boaz Barak,Teven Le Scao,Aleksandra Piktus,Nouamane Tazi

74、,Sampo Pyysalo,Thomas Wolf,and Colin Raffel.Scaling data-constrainedlanguage models.arXiv preprint arXiv:2305.16264,2023.NWD+20 Yixin Nie,Adina Williams,Emily Dinan,Mohit Bansal,Jason Weston,and Douwe Kiela.Adversarial nli:A new benchmark for natural language understanding,2020.RHS+23David Rein,Bett

75、y Li Hou,Asa Cooper Stickland,Jackson Petty,Richard Yuanzhe Pang,Julien Dirani,Julian Michael,and Samuel R.Bowman.Gpqa:A graduate-level google-proofq&a benchmark,2023.RWC+19 Alec Radford,Jeffrey Wu,Rewon Child,David Luan,Dario Amodei,and Ilya Sutskever.Language models are unsupervised multitask lear

76、ners.OpenAI blog,1(8):9,2019.SLBBC19 Keisuke Sakaguchi,Ronan Le Bras,Chandra Bhagavatula,and Yejin Choi.Winogrande:An adversarial winograd schema challenge at scale.arXiv preprint arXiv:1907.10641,2019.SRR+22Aarohi Srivastava,Abhinav Rastogi,Abhishek Rao,Abu Awal Md Shoeb,Abubakar Abid,Adam Fisch,Ad

77、am R Brown,Adam Santoro,Aditya Gupta,Adri a Garriga-Alonso,et al.Beyond the imitation game:Quantifying and extrapolating the capabilities of languagemodels.arXiv preprint arXiv:2206.04615,2022.SSS+22Mirac Suzgun,Nathan Scales,Nathanael Sch arli,Sebastian Gehrmann,Yi Tay,Hyung WonChung,Aakanksha Chow

78、dhery,Quoc V.Le,Ed H.Chi,Denny Zhou,and Jason Wei.Chal-lenging big-bench tasks and whether chain-of-thought can solve them,2022.THLB19 Alon Talmor,Jonathan Herzig,Nicholas Lourie,and Jonathan Berant.Commonsenseqa:Aquestion answering challenge targeting commonsense knowledge,2019.TLI+23Hugo Touvron,T

79、hibaut Lavril,Gautier Izacard,Xavier Martinet,Marie-Anne Lachaux,Timoth ee Lacroix,Baptiste Rozi ere,Naman Goyal,Eric Hambro,Faisal Azhar,AurelienRodriguez,Armand Joulin,Edouard Grave,and Guillaume Lample.Llama:Open andefficient foundation language models.arXiv preprint arXiv:2302.13971,2023.TMH+24

80、Gemma Team,Thomas Mesnard,Cassidy Hardin,Robert Dadashi,Surya Bhupatiraju,Shreya Pathak,Laurent Sifre,Morgane Rivi ere,Mihir Sanjay Kale,Juliette Love,et al.Gemma:Open models based on gemini research and technology,2024.10VSP+17Ashish Vaswani,Noam Shazeer,Niki Parmar,Jakob Uszkoreit,Llion Jones,Aida

81、n N Gomez,L ukasz Kaiser,and Illia Polosukhin.Attention is all you need.In Advances in NeuralInformation Processing Systems,volume 30,2017.ZCG+23Wanjun Zhong,Ruixiang Cui,Yiduo Guo,Yaobo Liang,Shuai Lu,Yanlin Wang,AminSaied,Weizhu Chen,and Nan Duan.Agieval:A human-centric benchmark for evaluatingfou

82、ndation models,2023.ZCS+23Lianmin Zheng,Wei-Lin Chiang,Ying Sheng,Siyuan Zhuang,Zhanghao Wu,YonghaoZhuang,Zi Lin,Zhuohan Li,Dacheng Li,Eric Xing,et al.Judging llm-as-a-judge withmt-bench and chatbot arena.arXiv preprint arXiv:2306.05685,2023.ZHB+19Rowan Zellers,Ari Holtzman,Yonatan Bisk,Ali Farhadi,

83、and Yejin Choi.Hellaswag:Cana machine really finish your sentence?In Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics,pages 47914800,2019.AExample prompt for benchmarksQuestion:Solve for x:(13)(4 3x)=12Options:A.56B.76C.53D.16Answer:AQuestion:Which of the follow

84、ing is the body cavity that contains the pituitary gland?Options:A.AbdominalB.CranialC.PleuralD.SpinalAnswer:BQuestion:Where was the most famous site of the mystery cults in Greece?Options:A.EphesusB.CorinthC.AthensD.EleusisAnswer:11BAuthorsMarah AbdinRussell J.HewettOlatunji RuwaseSam Ade JacobsJam

85、ie HuynhOlli SaarikiviAmmar Ahmad AwanMojan JavaheripiAmin SaiedJyoti AnejaXin JinAdil SalimAhmed AwadallahPiero KauffmannMichael SantacroceHany AwadallaNikos KarampatziakisShital ShahNguyen BachDongwoo KimNing ShangAmit BahreeMahmoud KhademiHiteshi SharmaArash BakhtiariLev KurilenkoXia SongHarkirat

86、 BehlJames R.LeeMasahiro TanakaAlon BenhaimYin Tat LeeXin WangMisha BilenkoYuanzhi LiRachel WardJohan BjorckChen LiangGuanhua WangS ebastien BubeckWeishung LiuPhilipp WitteMartin CaiEric LinMichael WyattCaio C esar Teodoro MendesZeqi LinJiahang XuWeizhu ChenPiyush MadanCan XuVishrav ChaudharyArindam

87、 MitraSonali YadavParul ChopraHardik ModiFan YangAllie Del GiornoBrandon NorickZiyi YangGustavo de RosaAnh NguyenDonghan YuMatthew DixonBarun PatraChengruidong ZhangRonen EldanDaniel Perez-BeckerCyril ZhangDan IterHeyang QinJianwen ZhangAmit GargThomas PortetLi Lyna ZhangAbhishek GoswamiReid PryzantYi ZhangSuriya GunasekarSambuddha RoyYue ZhangEmman HaiderMarko RadmilacYunan ZhangJunheng HaoCorby RossetXiren Zhou12

友情提示

1、下载报告失败解决办法
2、PDF文件下载后，可能会被浏览器默认打开，此种情况可以点击浏览器菜单，保存网页到桌面，就可以正常下载了。
3、本站不支持迅雷下载，请使用电脑自带的IE浏览器，或者360浏览器、谷歌浏览器下载即可。
4、本站报告下载后的文档和图纸-无水印,预览文档经过压缩，下载后原文更清晰。

本文（微软：2024年Phi-3技术报告（英文版）（12页）.pdf）为本站（无糖拿铁）主动上传，三个皮匠报告文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若此文所含内容侵犯了您的版权或隐私，请立即通知三个皮匠报告文库（点击联系客服），我们立即给予删除！

温馨提示：如果因为网速或其他原因下载失败请重新下载，重复下载不扣分。

上海品茶

微软：2024年Phi-3技术报告（英文版）（12页）.pdf

微软：2024年Phi-3技术报告（英文版）（12页）.pdf

微软：2024年Phi-3技术报告（英文版）（12页）.pdf