上海品茶

Scale:2024年人工智能就绪度研究报告(英文版)(25页).pdf

编号:160861 PDF  DOCX  25页 2.39MB 下载积分:VIP专享
下载报告请您先登录!

Scale:2024年人工智能就绪度研究报告(英文版)(25页).pdf

1、1A I R E A D I N E S S R E P O R T 2 0 2 4ii01AI Year in ReviewApply AIBuild AI Evaluate AI ConclusionMethodology41326384849IntroductionThe hype for generative AI has reached its peak.Developers continue to push the limits,exploring new frontiers with increasingly sophisticated models.At the same ti

2、me,without a standardized blueprint,enterprises and governments are grappling with the risks vs.rewards that come with adopting AI.Thats why in our third edition of Scale Zeitgeist:AI Readiness Report,we focused on what it takes to transition from merely adopting AI to actively optimizing and evalua

3、ting it.To understand the state of AI development and adoption today,we surveyed more than 1,800 ML practitioners and leaders directly involved in building or applying AI solutions and interviewed dozens more.In other words,we removed responses from business leaders or executives who are not equippe

4、d to know or understand the challenges of AI adoption first-hand.Our findings show that of the 60%of respon-dents who have not yet adopted AI,security concerns and lack of expertise were the top two reasons holding them back.This finding seems to validate the“AI safety”narrative that dominates today

5、s news.Among survey respondents who have adopted AI,many feel they lack the appro-priate benchmarks to effectively evaluate models.Specifically,48%of respondents referenced lacking security benchmarks,and 50%desired industry-specific benchmarks.Additionally,while 79%of respondents cited improving op

6、erational efficiency as the key reason for adopting AI,only half are measuring the business impact of their AI initiatives.And while performance and reliability(each at 69%)were indicated as the top reasons for evaluating models,safety ranked lower(55%),running counter to popular narratives.This rep

7、ort presents expert insights from Scale and its partners across the ecosystem,including frontier AI companies,enterprises,and govern-ments.Whether you are developing your own models(building AI),leveraging existing foun-dation models(applying AI),or testing models(evaluating AI),there are actionable

8、 insights and best practices for everyone.Table of Contents0203“The rapid evolution of AI offers both immense opportunities and challenges.Embracing it responsibly,with robust infrastructure and rigorous evaluation protocols,unlocks the potential of AI while safeguarding against the risks,known and

9、unknown.”Alexandr Wang,FO U N D E R&C E O,S CA L E0405Year in ReviewAdvancements in generative AI continued to accelerate in 2023.After the release of OpenAIs ChatGPT in November 2022,the platform reached an estimated 100 million users in just two months.In March 2023,OpenAI released GPT-4,a large l

10、anguage multimodal model that demon-strated human-level performance across industry benchmarks.Other model builders joined the launch party last year.Google launched Bard,initially running on the LaMDA model and replaced shortly after by PaLM 2(with improved domain-specific knowledge-including codin

11、g and math).Anthropic introduced Claude 2 in the summer with a 100K context window.A week later,Meta unveiled Llama 2 and Code Llama,and included model weights and code for the pretrained model.Google DeepMind closed out 2023 with the release of Gemini-repre-senting a significant improvement in perf

12、ormance as the first model to outperform human experts on the Massive Multitask Language Under-standing(MMLU)test.Newer open source model families like Falcon,Mixtral,and DBRX demonstrated the possibility for local inference while innovating on model architecture to use far less compute.This year,in

13、 March 2024,Anthropic launched the family of Claude 3 models,doubling the context window.Just a few days later,Cohere released their Command R generative model-designed for scalability and long context tasks.Frontier research underlies many of these model advancements.Some significant advancements i

14、nclude:1.Open AI achieved improvements in mathematical reasoning through rewarding chain-of-thought reasoning.Scale contributed to the creation of PRM800K,the full process supervision dataset released as part of this paper.2.Anthropic uncovered an approach for better model interpretabil-ity through

15、analysis of feature activation compared to individual neurons.3.The Microsoft Research team discovered that a model with a smaller number of parameters relative to state-of-the-art models can demon-strate impressive performance on task-specific benchmarks when fine-tuned with high-quality textbook d

16、ata.Generative AI continues to reshape our worldOrganizations applying AI are seeking to extract additional value by optimizing AI through prompt-en-gineering,fine-tuning models,and retrieval augmented generation(RAG).Despite the desire to optimize foun-dational models,65%of organizations use models

17、 out-of-the-box,43%of organizations fine-tune models and 38%use RAG.Fine-tuning can customize models for specific tasks or datasets,significantly enhancing their performance and accuracy on targeted applications.RAG further enhances this by dynamically incorporat-ing external information during the

18、generation process,enabling the model to produce more relevant and con-textually rich outputs.Organizations reporting generative AI forced the creation of an AI strategy:Organizations planning to increase investment in commercial and closed-source models over the next three years:Organizations that

19、consider AI to be very or highly critical to their business in the next three years:Organizations with no plans to work with generative AI:Organizations with generative AI models in production:20232024202320240607Do you customize generative AI models or use them out of the box?Key findings,2023 to 2

20、024To illustrate the evolving landscape,we see the following changes as important trends in AI over the past year.Model preferences continue to evolve and remain a key decision for an organizations AI strategy.The largest increase in usage came from closed-source models with 86%of organizations usin

21、g these models compared to 37%the year prior.This is likely due to a combi-nation of factors.Many organiza-tions have existing contracts with cloud service providers who in turn have partnerships with closed-source model developers,making usage of closed-source models easier.Many closed-source model

22、s also outperform open source models out-of-the-box.Despite that,open-source model usage still increased from 41%to 66%.This is likely due to the flexibility open-source models provide for fine-tun-ing and hosting.The smallest change in model preferences were organizations that trained their own mod

23、els at 24%in 2024.Similar to last year,61%of orga-nizations stated improved oper-ational efficiency as the leading driver behind adopting generative AI.Improved customer experience came in second at 55%.Despite growing adoption,there are still a number of challenges that stall widespread use of gene

24、rative AI.61%of respondents cited infrastructure,tooling,or out-of-the-box solutions not meeting their specific needs.Processes like RAG and fine-tuning introduce the complexity of integrat-ing external data sources in real-time,ensuring the relevance and accuracy of retrieved information,managing a

25、dditional computation-al costs,and addressing potential biases or errors.Fine-tuning requires careful selection of data to avoid overfitting and ensuring models remain generalizable to new,unseen information.Proprietary data is a key ingredient to power performance enhance-ments for generative AI mo

26、dels.While Scales machine learning team proved how fine-tuning can enhance model capabilities,41%of organizations lack the ML expertise to execute the data transformations and measure and evaluate results to justify the initial investment.What positive outcomes have you seen from generative AI adopt

27、ion?How do you work with generative AI models?0809What to Expect in 2024Increasingly Capable Foundation ModelsIn the coming year,we expect notable advancements in generative AI foundation models to continue.Models like Claude 3 have demonstrated improved performance on various benchmarks,such as sco

28、ring 86.8%on the MMLU dataset and 95.0%on the GSM8K math problem set,indicating enhanced capabilities in reasoning and problem-solving.We also expect to see the emergence of more sophisticated multimodal models that can seamlessly integrate and generate content across various modalities,including te

29、xt,images,audio,and video as both inputs and outputs.As researchers continue to refine these models,we can also anticipate improvements in accuracy and reduced latency,making models more reliable and efficient.The size of these foundation models is also likely to grow,allowing them to capture and le

30、verage even more knowledge and nuance from the vast amounts of data they are trained on.Expert Insight Will Power Performance ImprovementsHuman experts will play an increasingly crucial role in model advancements and evaluation.As models start to exhaust the corpus of general information widely avai

31、lable on the internet,models will require addi-tional data to improve their capabilities.While some organizations may look to replace human-generated data with synthetic data for training,models reliant on synthetic data can be susceptible to model collapse.A hybrid human and synthetic data approach

32、 can mitigate biases from synthetic data and still reflect nuanced human preferences.The domain-specific knowledge of experts allows them to provide data that captures the nuance,complexity,and diversity to supplement model training.Experts are also critical for testing and eval-uation alongside rei

33、nforcement learning from human feedback,with the knowledge to identify subtle errors,inconsistencies,or biases in order to provide reliable guidance to preferred model outputs.While experts are necessary to improve model capabili-ties,we anticipate organizations defining new roles that are centered

34、around generative AI.Prompt engineers,machine learning researchers,and generative AI experts will collaborate with subject matter experts to ensure AI initiatives are successful.Generative AI will fundamentally change the nature of work.Evolving Proof-of-Concepts to Scaling Production DeploymentsImp

35、rovements in model performance and capabilities will motivate leaders to quickly iterate from proof-of-concepts to pilots to production deployments.More user friendly RAG and fine-tuning solutions will emerge as on-ramps to improve adoption so that organizations can more easily customize models.As s

36、tart up costs taper,model effectiveness improves,and more robust evaluation strategies emerge,organizations will be able to more clearly capture and define return on invest-ment.Increasing Emphasis on Test&Evaluation PracticesNearly every major model release usurps a different leading model on vario

37、us benchmarks.Enterprises will want to create their own evaluation methodology consisting of industry benchmarks,automated model metrics,and measures for return on investment to continuously evaluate their preferred model.As model capabilities grow,model builders will place more importance on guardr

38、ails,steerability,safety,security,and transparency.Public sector institutions now must consider the White Houses OMB Policy and test and evaluate AI systems to ensure that AI is safe.MathMultivariate CalculusApplying Gradient TheoremCreative WritingMetaphorical StoriesLyrical SonnetsScienceBiologyGe

39、netic ExpressionCodingDebuggingCode OptimizationEvolution of generative AI capabilities:domain and functional capabilities are rapidly growing1011Apply AI12131213Adoption TrendsIn a world where innovation moves at the speed of thought,generative AI has emerged as a transformative force.En-terprises

40、and governments are deploying resources,capital,and teams to not just embed models into business processes,but also transform the paradigm of industry operations.This section highlights trends in enter-prise AI,including stages of adoption,model preferences,and investment themes for model categories

41、.Well also dig into leading enterprise AI use-cases,the challenges behind AI adoption,and uncover the barriers that prevent organi-zations from using AI.Which of the following describes how your company works with generative AI models?1415The Evolution of AI Adoption22%of organizations have one mode

42、l in production with 27%of total respondents reporting multiple models in production.Deploying multiple generative AI models in production allows orga-nizations to leverage specialized capabilities,avoid vendor lock-in,and scale multiple use-cases.By comparing performance across models and maintaini

43、ng flexibility,businesses can adapt to evolving requirements while mitigating risks associated with relying on a single model.The growing number of models in production reflects the progression of proof-of-concepts to production deployments.49%of organizations are still either evaluating use cases o

44、r develop-ing the first model or application.Many organizations are increasingly dedicating time to evaluating use cases to ensure alignment with business objectives.Thorough use case evaluation allows companies to identify applications with high ROI potential,assess feasibility and risk,and priorit

45、ize implementation efforts.25%Plan on working with generative AI models 33%Experimented with generative AI models 25%Evaluating use cases 26%Developing the first model/application22%One model/application deployed to production27%Multiple models/applications deployed to productionApplication and mode

46、l develop-ment follows use case selection.Deploying generative AI in an enterprise setting involves a multi-step process,including data preparation and pre-processing,model selection and architecture design,hyperparameter tuning and training,API development for integration,monitoring feedback,and te

47、st and evaluation.Technical organizations are ahead of the curve with genera-tive AI adoption.Software and internet companies are leading the pack with 48%of organizations reporting generative AI models in production.Conversely,only 24%of government and defense entities have generative AI models in

48、pro-duction.4%No plans to work with generative AI models38%Generative AI models in productionWhat is the current stage of your AI/ML project?No model deployed to productionOne or more models deployedWhich generative AI models do you work with?1617Our respondents indicate that their preferred model i

49、s the latest version of OpenAI GPT-4 with 58%of enter-prises using the latest version and 44%of enterprises using GPT-3.5.Trailing closely behind,39%of enterprises use Google Gemini.Theres a notable drop-off in model selection following these three models with OpenAI GPT-3 at 26%.Model Preferences M

50、odel selection is critical for generative AI devel-opment,as it determines the systems performance,scalability,and alignment with specific task require-ments,data characteristics,computational resources,and trade-offs between model complexity and inference speed.Organizations also evaluate model sel

51、ection through cost trade-offs-comparing investments tied to infrastructure,managed services,and per token inputs and outputs.OpenAI is overwhelmingly the preferred model vendor.Virality and the ongoing rollout of advanced features positioned OpenAI as the preferred model vendor even as other models

52、 demonstrate com-parable performance.Note-at the time of the survey,Claude 3,Grok,and DallE3 were not released and thus not included in the survey.How does your company plan on investing in generative AI over the next 3 years?In which ways has your company implemented AI?1819Model Investment Just as

53、 the leading preferred models are closed-source commercial models,planned investments in these categories of models reflect usage trends.72%of organizations plan to increase investments in commercial closed-source models.A lower percentage of organiza-tions plan to invest in open-source models at 67

54、%.While open-source models provide organiza-tions with greater control,many leading commercial closed-source models are closely tied to leading cloud-service providers.Enter-prises can draw down from cloud spend commitments through use of partner models(e.g.,Amazon and Anthropic,Microsoft and Open A

55、I).Last year,organizations referenced the ability to develop new products or services as the leading reason to adopt generative AI.This year,improved operational efficiency is the key driver behind adopting generative AI.Generative AI use cases reflect this shift in priorities.The leading use-cases

56、for generative AI adoption are computer program-ming and content generation.Deploying and Customizing AI Use CasesCoding copilots are becoming mainstream with technical users being early adopters of solutions like GitHub Copilot,CodeLlama,and Devin.Model vendors have responded to demand for content

57、generation with prompt templates that guide users to effective content creation questions for functions including Marketing,Product Man-agement,and Public Relations.Organizations can optimize genera-tive AI models for specific use cases through the following techniques:Prompt-engineering-guiding the

58、 models output through carefully crafted input prompts Fine-tuning-training the model on domain-specific data Retrieval-Augmented Generation(RAG)-enhancing the models knowledge by integrating infor-mation from external sources during the generation process.Teams are likely to maximize their AI inves

59、tments by adopting these techniques.For organizations that already fine-tune their own models,39%saw improved performance on domain-specific tasks compared to out-of-the-box models.2021“With fine-tuning,theres always the issue of data that we fine-tune on and compute.We can address hallucination and

60、 bias with better data.Frequency of fine-tuning helps but its an expensive procedure,most of the work that happens is on the data-side.Were always on the search for more volume of data and better annotations.”Mohammed Minhaas,DATA E N G I N E E R2223Barriers to AI Adoption and ImplementationDespite

61、rapid advancements in the field,organizations still face challeng-es with AI implementation.61%of organizations specified that infra-structure,tooling,or out-of-the-box solutions dont meet their needs.Insufficient tooling for tasks such as data preparation,model training,and deployment,combined with

62、 the lack of standardized frameworks for integrating generative AI into existing systems,can hinder the scalability and efficiency of AI implementations,leading to increased complexity and higher costs.54%of organizations struggle with insufficient budget.Finding a home on the balance sheet for new

63、gener-ative AI projects limits the pace of adoption.52%also have concerns about data privacy.Fine-tuning can use vast amounts of potentially sensitive training data.The risk of data breaches,unauthorized access,or misuse of personal information during the data collection,storage,and processing stage

64、s can expose organizations to legal liabilities and reputational damage,particularly in industries with stringent data protec-tion regulations.For example,certain health and human service providers must ensure their AI models abide by federal non-discrimination laws and privacy laws.Beyond these obs

65、tacles,organiza-tions grapple with how to effectively employ fine-tuning and RAG tech-niques.32%of organizations pinpoint evaluating performance as the top obstacle to fine-tuning,with 31%of respondents citing data transforma-tion as an issue.Similarly,the leading challenge for employing RAG is eval

66、uating performance,as stated by 28%of respondents.Some organizations are still not adopting AI.Most organizations have not adopted AI due to data and security concerns(28%)or a lack of expertise(26%).Other initiatives that take priority was the leading reason software and internet companies did not

67、adopt AI.Overcoming privacy and security concerns will require organizations to implement test and evaluation protocols to ensure that models are safe to use.79%of organizations assert that they already test or evaluate(T&E)models.Participants responded that the leading reason for T&E was to measure

68、 performance and reliabili-ty-both at 67%.The leading reason for evaluation was to measure business impact(59%of respon-dents).While 42%of respondents used benchmarks to evaluate model performance,there are still short-comings in existing benchmarks.Specifically,48%of respondents referenced lacking

69、security bench-marks and 50%of respondents cited missing industry specific bench-marks.Instead,to address safety and reliability of generative models,56%of organizations use industry standards by following leaderboards for public benchmarks.The adoption of generative AI in enterprise organizations p

70、resents both opportunities and chal-lenges.Careful model selection,fine-tuning,prompt-engineering,and data augmentation techniques are essential for optimizing per-formance and tailoring models to specific use cases.However,enter-prises must also navigate complex challenges related to infrastructure

71、,tooling,data security,privacy,and safety.Addressing these concerns requires significant investment in data management,model transpar-ency,and governance frameworks to ensure responsible and effective deployment.What are the top challenges in implementing AI technologies at your company?If you have

72、not yet adopted AI,why have you not adopted it?2425“RAG aims to address a key challenge with LLMs-while they are very creative,they lack factual understanding of the world and struggle to explain their reasoning.RAG tackles this by connecting LLMs to known data sources,like a banks general ledger,us

73、ing vector search on a database.This augments the LLM prompts with relevant facts.However,implementing RAG presents its own challenges.It requires creating and maintaining the external data connection,setting up a fast vector database,and designing vector representations of the data for efficient se

74、arch.Companies need to consider if they require a purpose-built database optimized for vector search.Keeping this vectorized representation of truth up-to-date is tricky.As the underlying data sources change over time and users ask new questions,the vector database needs to evolve as well.Deciding i

75、f and how to incorporate user assumptions into the vector representations is a philosophical question that also has practical implications for implementation.The industry is still grappling with how to design RAG systems that can continually improve over time.”Jon Barker,C U STO M E R E N G I N E E

76、R,G O O G L E2726Build AIPushing the Boundaries:AIs Rapid Advancement Across Domains As highlighted in the Year In Review section of this report,weve seen a significant leap in model capabilities in the past year.The latest models have revolutionized programming,writing clean,efficient code from nat

77、ural language prompts with an almost human-like understanding of intent.But the advancements dont stop there.Were not far away from a world where AI agents effortlessly communicate across language barriers,solve complex mathematical equations,explain scientific concepts,and even make new discoveries

78、.Moreover,AI is rapidly advancing in its ability to perceive and generate content across multiple modalities,including text,images,audio,and video.2829The race between leaders like OpenAI,Anthropic,Google,Meta,and others is driving the rapid advancement of foun-dation models.Each lab is pushing the

79、boundaries of whats possible,releasing new models that leapfrog the capabilities of predecessors.However,the pace of releases is not constant.The survey data reveals that it typically takes companies three to six months to develop a model and deploy it to production.For the top labs,major releases a

80、re often spaced six to nine months apart,waiting until achieving a significant step-change in performance before unveiling a new model.We expect this six to nine month release cadence to continue over the coming year.However,the pace could decelerate as organizations encounter data limita-tions and

81、struggle to achieve meaningful improvements over current models performance.The following sections will explore the key pillars needed to build effective models,including model architecture innovations,computational resource trends,and the high-quality data imper-ative.Well also discuss future inves

82、t-ments and priorities in the AI landscape providing insights into the advance-ments shaping the future of AI.The key pillars of effective AI models Developing industry-leading AI requires a combination of:THOUGHTFUL MODEL ARCHITECTURES VAST COMPUTATIONAL RESOURCES CAREFULLY CURATED DATASETS 2019202

83、020224PaLM ImagenOPT-175BMed-PaLM BardClaude 1LLaMAPaLM 2Claude 3Segment Anything GeminiClaude Instant BERT MetaOpen AIGoogleAnthropicGPT-2RoBERTaBlenderBotGPT-3CLIPDALL-ECODEXDALL-E-2WhisperPaLM-SayCanChatGPTGalacticaGPT-4Claude 2InstructGPTConstitutional AICodeyMinervaTimeline of Model

84、Releases3031Model ArchitectureNew neural network designs and techniques are enabling the development of larger,more capable models that can tackle increasingly complex tasks.One new promising approach is the use of sparse expert models,which allows for efficient training of massive networks by activ

85、ating only relevant subsets of neurons for each input.This enables models to spe-cialize in different domains while still maintaining the ability to generalize across tasks.Recent open-source models like Falcon,Mixtral,and DBRX demonstrate the potential of these architectures,scoring high on perform

86、ance benchmarks with significantly fewer pa-rameters and computational resources when compared to traditional models.Similarly,AI21 Labs Grok model showcases the power of sparse expert models in natural language processing,excelling across a wide range of language tasks while maintaining high effici

87、ency.Key challenges in training and developing advanced AI models.Computational Resources TrendsDemand for compute continues to grow,with model training requiring huge clusters of specialized accelera-tors like GPUs and TPUs.However,the industry is un-dergoing a significant shift away from tradition

88、al CPUs towards these accelerator architectures optimized for AI workloads.This transition brings significant chal-lenges in terms of infrastructure,tooling,and resource management.The survey highlights the magnitude of this shift,with over 48%of respondents rating compute resource man-agement as“mo

89、st challenging”or“very challenging”.“CPUs consume about 80%of IT workloads today.GPUs consume about 20%.Thats going to flip in the short term,meaning 3 to 5 years.Many industry leaders that Ive talked to at Google and elsewhere believe that in 3 to 5 years,80%of IT workloads will be running on some

90、type of architec-ture that is not CPU,but rather some type of chip architec-ture like a GPU.”-Jon Barker,Customer Engineer,GoogleThis rapid transition towards more costly GPU and TPU-centric workloads presents a number of chal-lenges.While these accelerators offer unparalleled performance for AI tas

91、ks,they also require a different programming model,tooling ecosystem,and set of optimization techniques compared to tradition-al CPU-based workloads.Further,large models are usually trained across many accelerators and distribut-ed across many machines in parallel,requiring complex orchestration fra

92、meworks.To address these challenges,PyTorch introduced the Fully Shared Data Parallel(FSDP).FSDP is a data parallelism paradigm that shards model parameters,gradients,and optimizer states across data-parallel workers,enabling more efficient memory usage and training of larger models.In addition to t

93、he challenge of compute resource management,model builders also face obstacles due to a lack of suitable tools and frameworks.38%of respondents indicated that the absence of AI-spe-cific libraries,frameworks,and platforms is a major challenge holding back their AI projects.These tools are crucial fo

94、r abstracting away the complexities of distributed computing and accelerator programming,allowing researchers to focus on model development and experimentation.3233Unlocking AI Potential:Domain-Specific,Human-Generated DatasetsData is the fuel that powers AI models,and the quality,quantity,and diver

95、sity of that data is critical to building effective,unbiased systems.The survey results highlight the importance of high-quality datasets,with labeling quality as the top challenge in preparing data for training models.Obtaining extremely high-quality labels while minimizing the time required to get

96、 that labeled data is a significant hurdle for model builders.This highlights the need for efficient data labeling processes and tools that can maintain high standards while expediting the labeling process.Large,web-scraped datasets have been instrumental in pre-training foundation models.The next l

97、eap in capabilities will require more targeted,domain-specific data that captures the nuances and edge cases that only human experts can provide.The advent of generative AI and large language models(LLMs)has fundamentally changed what it means to create high-quality training and evaluation data.For

98、open-ended use cases,such as question answering,coding,and agentic use cases,ad-vancements in AI capabilities will be bottlenecked by the supervision we can feed into these models.Even if you train long enough with enough GPUs,youll get similar results with any modern model.Its not about the model,i

99、ts about the data that it was trained with.The difference between performance is the volume and quality of data,especially human feedback data.You absolutely need it.That will determine your success.-Ashiqur Rahman,Machine Learning Researcher,Kimberly-ClarkHuman-labeled data plays a critical role in

100、 aligning models with user preferences and real-world requirements.Techniques like reinforcement learning from human feedback(RLHF)can help guide models towards desired behaviors and outputs,but they require a steady stream of high-quality,human-generated labels and rankings.Future Investments&Prior

101、ities69%of respondents rely on unstructured data like text,images,audio,and video to train their models.However,data quality emerges as the top challenge in acquiring training data,ranked as the largest obstacle by 35%of respondents.To address this,55%of organizations are leveraging internal labelin

102、g teams,while 50%engage specialized data labeling services and 29%leverage crowdsourcing.Organizations are scaling their annotation efforts with managed labeling services,with 40%of users receiving high-quality labeled data within one week to one month.Managed labeling services allow companies to sc

103、ale up labeling operations,reduce overhead,and access expert annotators on-demand.Managed labeling services also handle project management,quality assurance,annotator recruiting,and increasingly offer specialized expertise in areas like coding,mathematics,and languages.Common approaches for data ann

104、otation.Top challenges in preparing high-quality training data for AI models.3435The demand for specific types of Scales Data Streams provides insights into the priorities and use cases driving AI development.Among the most sought-after Data Streams are:1.Coding,Reasoning,and Precise Instruction Fol

105、lowing2.Languages3.Multimodal DataGoing forward,we expect to see increased adoption of human-in-the-loop pipelines that leverage subject matter experts to refine model outputs and provide targeted feedback.This creates a virtuous“data flywheel”effect,where model usage results in new high-quality tra

106、ining data for continuous improvement.Multimodal data collection spanning text,speech,images,and video will also be a key priority as organiza-tions seek to build AI systems that can perceive,reason and interact more naturally.One new notable trend is the acquisition of proprietary data from platfor

107、ms like Reddit,as exemplified by the recent multi-year data partnership between Reddit and Google.This deal,reportedly valued at$60 million per year,emphasizes the value placed on unique,hu-man-generated content for training the next generation of models.However,simply acquiring vast amounts of data

108、 is not enough.To truly stay ahead of the curve,organizations must also invest in robust human-in-the-loop(HITL)pipelines that can process and label data across an ever-expanding range of modalities.As AI systems become more sophisticated,they will require not just text,but also speech,images,video,

109、and even more complex data types like 3D scenes and sensor data.Moreover,the rise of reinforcement learning from human feedback(RLHF)has fundamentally changed how models are evaluated.RLHF requires“on-policy”human supervision,where human raters provide feedback on the actual outputs generated by the

110、 model during the training process.Additionally,traditional evaluation methods that rely on fixed sets of labels are no longer sufficient.Instead,organizations must conduct side-by-side comparisons of their old and new model responses across a large number of prompts before each release.This approac

111、h captures the nuances and edge cases that emerge as models become more sophisticated and ensures that improvements are aligned with user expectations.Building scalable labeling programs that address mul-timodal capabilities is a critical challenge for model builders.It will require a combination of

112、 advanced tooling,specialized annotator training,and close collaboration between domain experts and machine learning teams.Managed labeling services with expertise across a wide range of modalities will be increasingly sought after to help organizations navigate this complex landscape.By fusing dive

113、rse input modalities and investing in hu-man-in-the-loop pipelines,models can develop richer,more contextual representations that mirror how humans process information and engage with their en-vironments.Organizations that can effectively harness multimodal data and scale their labeling capabilities

114、 will be well-positioned to unlock new frontiers in AI.HUMAN FEEDBACKDATAMODEL TRAINING&OUTPUTData Flywheel3637Evaluate AIEvaluating Model PerformanceEvaluation critera for models in useAs foundation models grow in capability and impact,compre-hensive model evaluation has become paramount whether yo

115、u are building or applying models.In contrast to common headlines,assessing foundation models is not just about safety.In fact,perfor-mance,reliability,and security were indicated as the top three reasons survey respondents evaluate models-with safety ranking as a lower priority.Despite this focus o

116、n evalua-tion,developing robust evalua-tion frameworks is an evolving challenge.Models must be assessed holistically,accounting for perfor-mance on real-world use cases as well as potential risks.Traditional academic benchmarks are generally not representative of production scenarios,and models have

117、 been overfitted to these existing bench-marks due to their presence in the public domain.Leading or-ganizations are moving towards comprehensive private test suites that probe model behavior across diverse domains and capabilities.Universally agreed upon 3rd party benchmarks are crucial for objec-t

118、ively evaluating and comparing the performance of large language models.Researchers,develop-ers,and users can select models based on standardized transparent metrics.68%Reliability 67%Performance62%Security 54%Safety 6%N/A383987%Model builders who apply AI indicated that they evaluate models or appl

119、ications.72%Enterprises who apply AI indicated that they evaluate models or applications.To understand current evaluation practices,the survey asked respondents how they measure model performance.The top approaches are illustrated in the figure,left.The data shows that automated model metrics and hu

120、man preference ranking are the fastest ways to identify issues,with over 70%of respondents discov-ering problems within one week.This highlights the value of quantitative and qualitative evaluation approaches to rapidly surface model performance problems.The prevalence of human evaluations is notabl

121、e(41%),reflecting the importance of subjective judgments in assessing generative outputs.Techniques like preference ranking,where human raters compare model samples,can capture nuanced quality distinctions.The survey results suggest that a multi-faceted evaluation strategy is necessary,as no single

122、approach dominates.While automated metrics and business impact assessments are widely used,the data indicates the need to incorporate a variety of quantitative and qualitative techniques to compre-hensively evaluate models.When asked why they conduct model evaluations,69%of respondents selected perf

123、ormance,another 69%selected reli-ability and 63%selected security as main objectives.Stress testing models is an important defense against failure modes such as hallucination and bias.Evaluation practices for model performance.4041Techniques like red teaming,where expert testers try to elicit unsafe

124、 behaviors,can surface vulnerabilities.Careful prompt engineering can also help assess models resilience against malicious prompts or out-of-distribution inputs.The results highlight the importance of con-tinuous monitoring,as models can degrade or exhibit new issues over time.Over 40%of re-spondent

125、s evaluate their models following any changes or prior to major releases,highlighting the shift towards a continuous evaluation that goes beyond one-time assessments.While model evaluation plays a crucial role in measuring AI performance,leaders responsible for applying AI in their organizations mus

126、t also initiatives also need to demonstrate tangible business outcomes.Almost half of respondents evaluate models based on their direct impact on KPIs like operational efficiency or customer satisfaction.Grounding evaluations in down-stream outcomes ensures that models are not just technically profi

127、cient but actually valuable in practice.4243“Evaluating generative AI performance is complex due to evolving benchmarks,data drift,model versioning,and the need to coordinate across diverse teams.The key question is how the model performs on specific data and use cases.Centralized oversight of the d

128、ata flow is essential for effective model evaluation and risk management in order to achieve high acceptance rates from developers and other stakeholders.”Babar Bhatti,I B M,A I C U STO M E R S U C C E S S L E A DChallenges with model evaluation todayDespite progress,many gaps remain in current mode

129、l evaluation practices.Performance and usability benchmarks are critical to ensure models meet rising user expectations while vertical-specific standards will be key as AI permeates different sectors.Industry groups like the National Institute of Standards and Technology(NIST)are working to define c

130、omprehensive evaluation standards.Scales Safety,Evaluations,and Analysis Lab(SEAL)is also working to develop robust evaluation frameworks.The data reveals room for improvement in measuring the business impact of AI models.For key outcomes like revenue,profitability,and strategic decision-mak-ing,onl

131、y half of the organizations are assessing business impact.This represents an opportunity for enterprises to more clearly link model performance to tangible business results,ensuring that AI investments are delivering real value.Model evaluation challenges:gaps in benchmarking for model builders and

132、enterprises applying AI4445Evaluating AI Systems in ProductionRobust evaluation practices are essential not just during model development,but also when deploying and monitoring AI systems in real-world production environments.The survey highlights how both model builders and en-terprises are investi

133、ng in evaluation capabilities.On the“Build”side,organizations recognize the importance of comprehensive evaluations and employ a combination of internal dashboards and external platforms to gain a holistic understanding of model performance.46%of organizations have internal teams with dedicated test

134、 and evaluation platforms,while 64%leverage internal proprietary platforms.Adoption of third-party evaluation consultancies(23%)and platforms(40%)is also prevalent,demonstrating the value of external expertise and tools in the evaluation process.For enterprises focused on“Applying”AI,the invest-ment

135、 patterns are similar but with a blend of internal and external solutions.42%have internal teams using external evaluation platforms,49%use proprietary internal platforms,38%adopt third-party platforms and 21%engage external consultants.These results underscore the complexity of validating AI system

136、 performance,safety,and alignment with re-al-world operating conditions and business objectives.Effective evaluation requires a blend of skilled in-house teams,robust tools and frameworks,and external spe-cialist support.Looking ahead,evaluation methodology must evolve in lockstep with AI capabiliti

137、es.Multidisciplinary research at the intersection of machine learning,software engi-neering,and social science is needed to define rigorous standards.Scalable infrastructure for human-in-the-loop evaluation pipelines will also be critical.With sustained effort and investment,the industry can build g

138、enerative models that are not only powerful but truly reliable and beneficial.Practices for evaluating AI systems in production“As AI systems become more advanced and influential,its crucial that we prioritize AI safety.The rapid progress in large language models and generative AI is both awe-in-spi

139、ring and sobering-while these technologies could help solve some of humanitys greatest challenges,they also pose catastrophic risks if developed without sufficient safeguards.At the Center for AI Safety,our research focuses on the important problem of AI safety:mitigating the various risks posed by

140、AI systems.We also need proactive governance strategies to navigate the high-stakes landscape of powerful AI,including estab-lishing international cooperation,safety standards,and regulatory oversight.While the era of advanced AI presents tremendous potential,we must not underestimate the risks and

141、challenges ahead.Its crucial that the AI community comes together to prioritize safety,so we can chart a course towards a future where AI is a profound positive force for the world.”Dan Hendrycks,C E N T E R FO R A I SA F E T Y (C A I S)4647Whether you are building or applying AI,model optimization

142、and evaluation is key to unlock performance and ROI.The pace of innovation for generative AI continues to accelerate.While the 2023 AI Readiness Report focused on how enterprises could adopt AI,this years report examined challenges and best practices to apply,build,and evaluate AI.The two most signi

143、ficant trends to emerge in our analysis are:1.The growing need for model eval-uation frameworks and private benchmarks 2.The continued challenges of optimizing models for specific use cases without sufficient tooling for data preparation,model training,and deploy-ment.At Scale,our mission is to acce

144、lerate the develop-ment of AI applications.The Scale Zeitgeist:AI Readiness Report supports that mission.We will continue to shed light on the latest trends,challenges,and what it really takes to build,apply,and evaluate AI.About ScaleScale is fueling the generative AI revolu-tion.Built on a foundat

145、ion of high-quality data and expert insight,Scale powers the worlds most advanced models.Our years of deep partnership with every major model builder enables our platform to empower any organization to apply and evaluate AIMethodologyThis survey was conducted online within the United States by Scale

146、 AI from February 20,2024,to March 29,2024.We received 2,302 responses from ML prac-titioners(e.g.,ML engineers,data scientists,devel-opment operations,etc.)and leaders involved with AI in their companies.Participants who reported no involvement in AI or ML projects were excluded from the dataset,re

147、sulting in a final sample size of 1800 respondents.A quarter of the respondents identified themselves as belonging to the Software and Internet/Telecommu-nications industry(28%),with the Financial Services/Insurance Industry following closely behind at 15%.Business Services accounted for 7%,while th

148、e Gov-ernment and Defense Industry represented 4%of the respondents.Among these industries,a majority of respondents specified their employment within the Information Technology department(33%).In terms of seniority within their organizations,nearly a quarter of respondents(24%)identified themselves

149、 as Team Leads,22%as department heads,and 5%as owners.Sixty-six percent (66%)of respondents report involvement in AI model application and customization(applying AI),while 34%are directly engaged in de-veloping foundational generative AI models(building AI).Consequently,a significant portion of resp

150、ondents(46%)represent organizations at an advanced stage of AI/ML adoption,with one to multiple models deployed to production and undergoing regular retraining.Approximately 26%are in the process of developing their inaugural model,while 23%are in the phase of evaluating potential use cases,underscoring the significance and enthu-siasm for AI/ML project development.Conclusion

友情提示

1、下载报告失败解决办法
2、PDF文件下载后,可能会被浏览器默认打开,此种情况可以点击浏览器菜单,保存网页到桌面,就可以正常下载了。
3、本站不支持迅雷下载,请使用电脑自带的IE浏览器,或者360浏览器、谷歌浏览器下载即可。
4、本站报告下载后的文档和图纸-无水印,预览文档经过压缩,下载后原文更清晰。

本文(Scale:2024年人工智能就绪度研究报告(英文版)(25页).pdf)为本站 (白日梦派对) 主动上传,三个皮匠报告文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。 若此文所含内容侵犯了您的版权或隐私,请立即通知三个皮匠报告文库(点击联系客服),我们立即给予删除!

温馨提示:如果因为网速或其他原因下载失败请重新下载,重复下载不扣分。
客服
商务合作
小程序
服务号
会员动态
会员动态 会员动态:

 wei**n_... 升级为至尊VIP 魏康**e... 升级为至尊VIP

魏康**e... 升级为高级VIP  wei**n_...  升级为至尊VIP

182**45... 升级为标准VIP   wei**n_... 升级为至尊VIP

 zho**ia...  升级为高级VIP 137**69...  升级为高级VIP 

137**75... 升级为高级VIP 微**...  升级为标准VIP 

wei**n_...  升级为高级VIP   135**90... 升级为高级VIP 

 134**66... 升级为标准VIP wei**n_...  升级为至尊VIP 

136**56... 升级为至尊VIP 185**33...  升级为标准VIP 

微**... 升级为至尊VIP   wei**n_... 升级为至尊VIP

189**71... 升级为标准VIP    wei**n_... 升级为至尊VIP

 173**29... 升级为标准VIP   158**00... 升级为高级VIP

 176**24... 升级为高级VIP 187**39...  升级为标准VIP 

138**22... 升级为高级VIP  182**56... 升级为高级VIP

 186**61... 升级为高级VIP 159**08... 升级为标准VIP 

158**66...   升级为至尊VIP 微**...  升级为至尊VIP

wei**n_... 升级为标准VIP  wei**n_... 升级为高级VIP  

wei**n_... 升级为高级VIP  wei**n_... 升级为至尊VIP 

 wei**n_... 升级为高级VIP 158**25...  升级为标准VIP 

189**63... 升级为标准VIP   183**73... 升级为高级VIP

 wei**n_... 升级为标准VIP   186**27... 升级为高级VIP

186**09...  升级为至尊VIP wei**n_... 升级为标准VIP 

139**98...   升级为标准VIP wei**n_... 升级为至尊VIP 

wei**n_... 升级为标准VIP  wei**n_...  升级为标准VIP

wei**n_... 升级为标准VIP   wei**n_... 升级为标准VIP 

陈金 升级为至尊VIP   150**20... 升级为标准VIP

183**91... 升级为标准VIP  152**40...  升级为至尊VIP

wei**n_...  升级为标准VIP  wei**n_...  升级为高级VIP

微**... 升级为高级VIP  wei**n_... 升级为高级VIP  

juo**wa... 升级为标准VIP  wei**n_... 升级为标准VIP

wei**n_...  升级为标准VIP   wei**n_... 升级为标准VIP

wei**n_...  升级为标准VIP  180**26... 升级为至尊VIP

wei**n_... 升级为至尊VIP   159**82... 升级为至尊VIP

 wei**n_... 升级为标准VIP  186**18... 升级为标准VIP

A**y  升级为标准VIP 夏木  升级为至尊VIP 

  138**18... 升级为高级VIP wei**n_...   升级为高级VIP

微**...   升级为高级VIP  wei**n_... 升级为至尊VIP 

wei**n_... 升级为至尊VIP  136**55... 升级为高级VIP 

小晨**3 升级为高级VIP wei**n_...  升级为至尊VIP

wei**n_... 升级为标准VIP   130**83...  升级为标准VIP

185**26... 升级为至尊VIP   180**05... 升级为标准VIP

  185**30... 升级为至尊VIP 188**62...  升级为高级VIP

eli**pa...  升级为至尊VIP wei**n_...  升级为高级VIP

 137**78... 升级为至尊VIP wei**n_...  升级为高级VIP

菜**1...  升级为高级VIP 丝丝 升级为高级VIP 

wei**n_... 升级为高级VIP  wei**n_... 升级为标准VIP 

139**03...   升级为标准VIP 微**... 升级为至尊VIP 

wei**n_... 升级为高级VIP  159**15... 升级为高级VIP 

wei**n_...  升级为至尊VIP wei**n_...   升级为高级VIP

海豚  升级为至尊VIP   183**48... 升级为高级VIP

 ec**儿... 升级为高级VIP  wei**n_... 升级为至尊VIP