《Intento & e2f:2024机器翻译报告(英文版)(61页).pdf》由会员分享,可在线阅读,更多相关《Intento & e2f:2024机器翻译报告(英文版)(61页).pdf(61页珍藏版)》请在三个皮匠报告上搜索。
1、TheStateofMachineTranslation2024An independent multi-domain evaluation ofMachine Translation engines and Large Language Models52MT Enginesand LLMs11LanguagePairs9ContentDomainsDisclaimerMarch 25May 14,2024The MT systems and LLMs covered in this report were accessed between March 25 and May 14,2024.S
2、ome systems may have been updated since that time period.Automatic scoringThis report uses semantic similarity and LLM-based DQF-MQM scores.To pick the top model for a your use case you may need a human linguist or subject matter expert review to address particular business requirements.Stock models
3、 onlyIf you consider customizing an engine on your data,your choice may vary from what is suggested here.In the solutions we build for our clients,top picks often include Amazon,DeepL,Google,Microsoft,ModernMT,and Systran,depending on the languages and the amount of available training data.Data limi
4、tationsThe evaluation used plain text data.Results often differ for tagged text with some MT vendors and language pairs because of imperfect inline tag support.This report has also used segment-wise translation rather than leveraging the full text capabilities of LLMs and some MT systems.Valid for a
5、 specific datasetThis report shows how the systems performed only on the datasets listed.We run multiple evaluations for our clients using various language pairs and domains,and often observe different MT system rankings than those provided in this report.on slide 14Theres no“best”MT system or LLMMT
6、 performance depends on how similar your data is to the data used to train the vendors models,their algorithms,and your quality requirements.TrademarksAll third-party trademarks,registered trademarks,product names,and company names or logos mentioned in the Report are the property of their respectiv
7、e owners,and the use of such Third-Party Trademarks inures to the benefit of each owner.The use of such Third-Party Trademarks is intended to describe the third-party goods or services and does not constitute an affiliation by Intento and its licensors with such company or an endorsement or approval
8、 by such company of Intento or its licensors or their respective products or services.Domains?What are these?Domain is a corpus from a specific source that may differ from other domains in topic,genre,style,level of formality et cetera*.Basically,a combination of industry sector and content type.2 o
9、f 61The State of Machine Translation 2024*as defined in“Domain Adaptation and Multi-Domain Adaptation for Neural Machine Translation:A Survey”by Danielle SaundersExecutive SummaryThe machine translation landscape is evolving as multilingual LLMs capable of high-quality translations are developed and
10、 deployed at an accelerating pace.Weve evaluated Large Language Models(+19 to 2023).52 systems overall(+15 to 2023),among which there are 2455%of all top-performing models are LLMs(it was 17%in 2023).The largest LLM is not always the best for translation(even from the same provider).In 25%of all cas
11、es,LLMs are significantly better than any MT.More in Colloquial,Education and Entertainment.In 12%cases,MT is better than any LLM.More in English to Arabic,and in IT and Legal domains.LLMs are 10-100 times less expensive than MT systems,but 50-1000 times slower,so we provide a separate rating for re
12、al-time systems.On average,it comes at 11%penalty in quality.Among analyzed pairs and domains,domain and pair carry the most critical issues due to complexity of translation.ColloquialEnglish-Arabic9Content domains 52Machine Translation(MT)Engines and Large Language Models(LLMs)evaluated GeneralColl
13、oquialITEntertainmentHospitalityEducationHealthcareLegalFinancial11Language pairsEnglishSpanish*French*ItalianPortuguese*GermanDutchUkrainianKoreanJapaneseChinese*Arabic*Spanish(LA),French(European),Portuguese(Brazilian),Chinese(Simplified).3 of 61The State of Machine Translation 202420MT engines an
14、d LLMs show the best results for some language pairs and domains(55%are LLMs).191Kunique language pairs+1Kcompared to 2023658unique languagesLanguage expansion across all MT enginesAmazonBaiduClaude 3 HaikuClaude 3 OpusClaude 3 SonnetCommand R+DeepLGPT-4GPT-4 TurboGPT-4oGemini Pro 1.5GoogleMicrosoft
15、ModernMTNiuTransPaLM 2 Chat BisonPaLM 2 Text BisonPaLM 2 Text UnicornTarjamaYandexLarge Language Models for Translation1.Expansion across the boardThe Large Language Model market has experienced explosive growth in recent years.Among models we have assessed in this report,nearly half,are Large Langu
16、age Models.52242.Large Language Models are in the 1st tier Large Language Models,such as,and,demonstrate performance comparable to top-tier commercial MT systems across most language pairs.While their cost is 10 to 100 times lower,LLMs have a latency 50 to 1,000 times higher than traditional MT engi
17、nes.GPT-4o PaLM2 Text UnicornGemini Pro 1.53.LLMs are priced 10 to 100 times lower than traditional MTOn average,LLMs are priced to times lower than traditional MT engines,making them a highly attractive alternative for companies looking to reduce costs without compromising on quality for human post
18、-diting scenarios.101004.50-1000 slower than traditional MTAlthough LLMs offer lower costs compared to traditional MT engines,their translation times are typically to times slower,rendering them unsuitable for real-time translation applications.501,0005.Open-source LLMs are generally in the 2nd tier
19、While the performance of open-source LLMs like or approaches top-tier commercial engines,the majority of open-source LLMs produce lower-quality translations due to their more limited multilingual capabilities compared to their commercial counterparts.TowerInstruct 7B v0.2Command R6.Customization is
20、possibleThe performance of LLMs can be enhanced through the use of or techniques.These methods allow for adjustments in tone of voice,mitigation of gender bias,and incorporation of domain-specific terminology.Moreover,several LLMs can be for translation tasks by leveraging existing translation memor
21、ies.Retrieval-Augmented Generation(RAG)prompt engineeringfine-tuned4 of 61The State of Machine Translation 2024About IntentoIntento is a machine translation and multilingual generative AI platform for global enterprise companies.Our Enterprise Language Hub enables companies like Procore and Subway t
22、o deliver consistent,authentic language experiences across all markets and audiences.It combines machine translation and generative AI models into automatic translation workflows,customizing them to client data and integrating them into customers existing software systems for localization,marketing,
23、customer support,and other business functions.With Intento,clients achieve high-quality,real-time translations for all users and team members worldwide.The Enterprise Language Hub is ISO-27001 certified,ensuring enterprise top-tier security for GenAI solutions in high-demand industries.Intento also
24、offers ISO-9001-certified expert help for setting up and maintaining MT and AI models and constantly refines these models with new data and user feedback.Trusted by the global enterpriseWe have been evaluating stock Machine Translation models since May 2017.For customers,we also evaluate customizabl
25、e NMT models(you can get a glimpse).hereAs we show in this report,the Machine Translation landscape is complex and dynamic.Models from five different vendors are required to achieve the best quality in popular language pairs,with a dramatic price difference(as much as 200 times.)Book a demo5 of 61Th
26、e State of Machine Translation 2024Enterprise Language HubMachine Translation and multilingual Generative AI platform for global businesses.We deliver immediate,tailored,and personalized language experience in all the software systems your customers and teammates already use,supporting over 650 lang
27、uages.Book a demoLanguage Hub forLocalizationSave up to 95%on translations by combining best-fit Machine Translation with source quality improvement and automatic post-editing based on GenAI.Let us take care of your custom MT engines.Works with over 15 TMS.Customer ExperienceEmpower customer support
28、 teams with real-time machine translation for chats and tickets to help customers 24/7 in their native language.Globalize self-service by on-the-fly translation of knowledge bases and community forums,tailored to your data.Employee ExperienceOur Language Hub translates documents,support tickets,and
29、enterprise apps so everyones on the same page and included.Works as a translation portal and through integrations with Atlassian,Microsoft,ServiceNow,Zendesk,and other tools.Enterprise GenAI PortalBoost your teams productivity with GenAI-based automatic language skills(tone of voice,summarization,an
30、d more)while keeping your security team happy.Choose the right GenAI model for every task and amplify it with Machine Translation for non-English content.Pay for actual usage,not seats.6 of 61The State of Machine Translation 2024About e2fe2f services MT detection and MT quality evaluation services t
31、hat enable organizations to monitor suppliers for compliance with brand standards for human and machine translation.Creation of custom Lingosets,or augmented multilingual datasets that represent real human conversational flow.Lingosets serve as benchmarks for conversational AI deployments.Golden dat
32、asets and training datasets that enable leading MT providers to evaluate and fine-tune engine performance.Established in 2004,e2f helps people and machines understand each other fluently,regardless of language,content,and culture.e2f solutions empower Fortune 50 brands to monitor,objectively assess,
33、and improve communications on a global scale.e2f delivers world-class translation and training data with its proprietary technology stack for translation,quality review,and AI services.e2f offers a global resource pool of skilled professionals in virtually all countries and languages.To learn more,o
34、r.contact e2fvisit website7 of 61The State of Machine Translation 2024Overview1.MT Engines2.Datasets3.Evaluation Methodology4.Evaluation Results5.Miscellaneous6.Takeaways52Machine Translation Engines and Large Language Models11Language Pairs9Content Domains8 of 61The State of Machine Translation 202
35、41.1Machine Translation Landscape1.2Generative AI Landscape1.3Evaluated MT Engines and LLMs1.MT Engines and Large Language Models9 of 61The State of Machine Translation 20241.1 Machine Translation LandscapeStandalone commercial products with an API.All product names,trademarks and registered tradema
36、rks are property of their respective owners.All company,product and service names used in this material are for identification purposes only.Use of these names,trademarks and brands does not imply endorsement.Generic stock modelsVertical stock modelsCustom terminology supportRWSSYSTRANUbiqusYandexAm
37、azonBaiduDeepLGoogleIBMMicrosoftRozettaStatic domain adaptationDynamic domain adaptationAlibabaAppTekBaiduCloudTranslationGlobaleseGoogleIBMKantanAIMicrosoftOmniscienPangeaMTPrompsitSYSTRANTildeUbiqusPROMTRozettaRWSYandexAmazonModernMTAISAAlibabaAmazonAppTekBaiduDeepLDevnagrieBayEliaFujitsuGlobalese
38、GoogleGTComIBMiFlyTekKakaoKawamuraKingsoftLesanLindatLingvaNexMicrosoftMiraiModernMTNaverNiuTransNTT COTOHAOmniscienOraclePangeaMTPhrase Process9PrompsitPROMTReversoRozettaRWSSmartMATESogouSYSTRANTartu NLPTarjamaTencentTREBETildeUbiqusUnbabelYandexYarakuZenYoudaoAlibabaBaiduCloudTranslationLingua Cu
39、stodiaMicrosoftNiuTransOmniscienPROMTRoyalFlushSAPSYSTRANTartu NLPTarjamaUbiqusXL8powered by NICTby CHAPSVISIONby CHAPSVISIONby CHAPSVISION10 of 61The State of Machine Translation 20241.1 Machine Translation LandscapeGeneric Stock ModelsPre-trained models based on data from multiple sources.These mo
40、dels are not pre-adjusted to one particular industry or specialization,such as Legal or Medical translations.Custom Terminology SupportAllows users to customize the MT models by applying their own glossaries.Depending on the implementation,terminology can be used while training custom models or for
41、adjusting machine translation results.Dynamic Domain AdaptationThe model can be incrementally updated on the fly.The adaptation can be done with as few as a single datapoint and happens in real-time.Typically,theres no snapshot of the baseline model created,making the model performance affected when
42、 the baseline model is updated by an MT provider.Vertical Stock ModelsPre-trained models,pre-adjusted to one particular industry or specialization,such as Legal or Medical translations.Static Domain AdaptationThe baseline MT model can be adjusted using batch training.The training requires a signific
43、ant amount of data(thousands of parallel segments)and takes time(from hours to days).Once the model is trained,a snapshot of a model is created and does not change after the next batch re-training.Large Language ModelsLarge Language Models(LLMs)are trained on massive amounts of data to generate text
44、,follow instructions,and answer questions.These models can be used for various tasks such as content creation,sentiment analysis,text summarization,or translation.11 of 61The State of Machine Translation 2024Cloud CommercialBaseline onlyAleph AlphaLuminousAnthropicClaude 34(Opus,Sonnet,Haiku),2.1,2.
45、0,InstantGooglePaLM2 Unicorn-001,Gemini 1.0 Ultra,Gemini 1.5 ProMicrosoft AzureOpenAI GPT-4,GPT-4 Turbo,Mistral Large,other open modelsOpenAIGPT-4 Turbo3Cloud CommercialCustomizableAI21Jurassic-2 Ultra4,Mid4,LightAmazon AWSTitan,Cohere Command,Meta Llama 2CohereCommand4GooglePaLM 2 Bison,Gemini 1.0
46、ProMicrosoft AzureOpenAI GPT-3.5MistralSmall3,4,Large3,4NVIDIANemotron-33OpenAIGPT-3.5 Turbo3,GPT-45Open Commercial101.AIYiAi21JambaAlibabaPolyLM,Qwen,Qwen1.52AllenAIOLMoBAAIAquila2BaichuanBaichuan 2BAIROpenLLaMA,StarlingBig ScienceBLOOM2CerebrasCerebras-GPTCohereAya,CommandDatabricksDolly 2.0,MPT,D
47、BRX3DeciDeciLM3Eleuther AIGPT-Neo,GPT-NeoX,Pythia,PolyglotGoogleT5-FLAN,GemmaHuggingFaceZephyrLianjiia TechBELLELLM360Amber,CrystalLLMZooPhoenixMeta AILlama 22,3,4,Llama 32,3,4MicrosoftPhi,Phi-1.5,Phi-2,Phi-3MistralMistral 7B3,4,Mixtral 8x7B3,4Preferred NetworksPLaMOSalesforceXGenSilo AIPoro,VikingS
48、nowflakeArcticStability AIStable LM 2StatNLPTinyLLaMATII(UAE)Falcon3X.aiGrok-1Open Non-commercial BAAIAquila2 70BBAIRKoalaMeta AILlamaStability AIStable LM Zephyr,Stable Beluga 1/2StanfordAplaca,VicunaUnbabel TowerInstruct1 Apache 2.0 or MIT licenses2 limited commercial use,read license terms for de
49、tails3 available as a cloud commercial model via Azure4 available as a cloud commercial model via AWS5 experimentalAll product names,trademarks and registered trademarks are property of their respective owners.All company,product and service names used in this material are for identification purpose
50、s only.Use of these names,trademarks and brands does not imply endorsement.Updated on April 24,2024.For any revisions,please reach out to us at hellointen.to.Remember to verify the license information,as licenses may change.1.2 Generative AI Landscape12 of 61The State of Machine Translation 2024Cust
51、omizationoptionsMT EnginesLLMsNoneTMGlossaryBoth1.3 Evaluated MT Engines and LLMsLarge Language Models can be customized with TMs through fine-tuning,RAG,and terminology via prompt engineering13 of 61The State of Machine Translation 2024Alibaba CloudGeneralAlibabaE-Commerce MTAmazonTranslateBaiduTra
52、nslate APIDeepLAPiEliaMT APIHiThinkRoyalFlash Finance TranslationGlobaleseMachine TranslationGoogle CloudAdvanced TranslationKawamura by NICTTranslation EngineLingua CustodiaMachine Translation APIMicrosoftLanguage TranslatorMiraiTranslatorModernMTAdaptiveNaverPapago NMT CommercialNiuTransTranslatio
53、n Cloud PlatformOracleMachine TranslationPangeanicMachine Translation APIPROMTMachine TranslationSYSTRANPNMTTarjamaMT APITartuNLPNeurotlge MTTencent CloudTMT APITildeMachine Translation APITREBEMachine Translation APIUbiqusTranslation APIYandexTranslate APIYoudaoCloud Translation APIAya-101CohereCla
54、ude 3 HaikuAnthropicClaude 3 OpusAnthropicClaude 3 SonnetAnthropicCommand-RCohereCommand-R+CohereGemini ProGoogle VertexGemini Pro 1.5Google VertexGPT-3.5 TurboOpenAIGPT-4OpenAIGPT-4oOpenAIGPT-4 TurboOpenAIJurassic UltraAI21LLaMA-2Meta AILLaMA-3Meta AIMistral LargeMistral AIMixtral 8x7BMistral AIPaL
55、M2 Chat BisonGoogle VertexPaLM2 Text BisonGoogle VertexPaLM2 UnicornGoogle VertexRakutenAI 7B InstructRakutenAITitan Text ExpressAmazon AWSTowerInstruct 13B v0.1UnbabelTowerInstruct 7B Internal v0.2Unbabel2.Datasets2.1 Preparation2.3 ontent Samples by Domain2.2Content Domains and Language Pairs2.4 S
56、entence Length14 of 61The State of Machine Translation 2024The source data collection and initial cleaning were done by Intento.Open-Source English TextsCarefully selected from open-source data Found several resources for each domain and selected the ones with suitable license agreements Extracted h
57、igh-quality segmentsData samples for various domains are used according to their licence agreements:,Financial data Hospitality data 1 Hospitality data 2 Legal data Entertainment data IT data Colloquial dataFiltering to Ensure High-Quality SourceCollected data for 9 domains using open-source resourc
58、es Removed duplicates,tags,and broken symbols Removed segments under 4 words Removed segments that were truncated(except for the Colloquial sector)and segments that were longer than one sentence Manually checked each segment in every domain to avoid segments with an ambigous meaning or incorrect ton
59、e of voice 2.1 Preparation15 of 61The State of Machine Translation 2024The dataset translations and quality assurance were done by e2f.Quality AssuranceProvided via e2fs TEP portal Human translations were compared with ones generated by the leading machine translation engines using e2fs MT Detection
60、 tool,and determined the probability that they contained machine-translated and/or post-edited content(MTPE).Strings whose MTPE probability exceeded e2fs threshold triggered expert review and was followed by re-translations,which were automatically reassessed.The resulting golden dataset does not be
61、ar traces of MTPE.Quality assurance reports were run on capitalization,punctuation,spelling,numbers,spaces,and typos.Reviewers implemented necessary changes and proofread the dataset prior to final delivery.Translation by Native Speaking Experts Selected native translators with expert-level qualific
62、ations and positive feedback in each language and domain.For reviews,selected native language experts in editing and proofreading across multiple domains,and positive customer feedback.Proofread strings supplied by Intento for compliance with proper English grammar,spelling,and punctuation and suppl
63、ied files to translators via e2fs Translation,Editing,and Proofreading(TEP)platform.2.1 Preparation16 of 61The State of Machine Translation 20242.2 Content Domains and Language Pairs9content domainsper language pair11language pairsper domain17 of 61The State of Machine Translation 2024ColloquialEduc
64、ationEntertainmentFinancialGeneralHealthcareHospitalityITLegalen-ar4674780470473476en-de4724754773475474en-es4694744664724769en-fr473472473475473475472471473en-it4744764778476473en-ja4774754724773473en-ko472478465469470471470470474en-nl473470466
65、472472470476475475en-pt472472472472476481476471472en-uk473474470475475473474476479en-zh4744724694764754784744734732.3 ontent Samples by DomainGeneral“Walmart is also the largest grocery retailer in the United States.”Finance“Both operating profit and net sales for the three-month period increased,re
66、spectively from 16m and 139m,as compared to the corresponding quarter in 2006.”Hospitality“Very reasonably priced and the food is excellent,I had pasta which was delicious,and my friend had the Italian meats&cheeses.”Healthcare“Leishmaniosis caused by Leishmania infantum is a parasitic disease of pe
67、ople and animals transmitted by sand fly vectors.”Legal“Landlord and Tenant acknowledge and agree that the terms of this Amendment and the Existing Lease are confidential and constitute proprietary information of Landlord and Tenant.”Entertainment“Further,they are aided by a magnificent cast of co-s
68、tars,most notably their secretary,played by Isabel Tuengerthal,who is a rare gem with great comic potential.”Education“Find what straight lines are represented by the following equation and determine the angles between them.”IT“The interface is in Python,a dynamic programming language,which is very
69、appropriate for fast development,but the algorithms are implemented in C+and are tuned for speed.”Colloquial“and,in fact,there are two huge lenses that frame the figure on either side”18 of 61The State of Machine Translation 2024 11 pairs with English source were translated in total Sentences that w
70、ere too short(4 words)were excluded from the dataset.2.4 Sentence LengthSource segment length(words)en-aren-deen-esen-fren-iten-jaen-koen-nlen-pten-uken-zh09 of 61The State of Machine Translation 20243.Evaluation Methodology3.1 Scores to Choose From3.2Choosing Engines for Linguistic Quali
71、ty Assessment using COMET3.3Issue Classification and Severity for Intento LQA3.4Examples of Issue Classification Using Intento LQA20 of 61The State of Machine Translation 20243.5How We Choose Best Engines Using Intento LQAhLEPORSyntactic similarityCompares similarity of token-based n-grams.Penalizes
72、 both omissions and additions.Penalizes paraphrases/synonyms.Penalizes translations of different length.paper+codeTERSyntactic similarityMeasures the number of edits(insertions,deletions,shifts,and substitutions)required to transform a machine translation into the reference translation.Penalizes par
73、aphrases/synonyms.Penalizes translations of different length.paper+codeSacreBLEUSyntactic similarityCompares token-based similarity of the MT output with the reference segment and averages it over the whole corpus.Penalizes omissions and additions.Penalizes paraphrases/synonyms.Penalizes translation
74、s of different length.paper+codeBERTScoreSemantic similarityAnalyzes cosine distances between BERT representations of machine translation and human reference(semantic similarity).Does not penalize paraphrases/synonyms.May be unreliable for terminology in domains and languages underrepresented in BER
75、T model.paper+codeCOMETSemantic similarityPredicts machine translation quality using information from both the source input and the reference translation.Achieves state-of-the-art levels of correlation with human judgement.May penalize paraphrases/synonyms.The version of the model used in the report
76、 is.wmt22-comet-dapaper+codeIntento LQALarge Language Model-basedAnalyses the quality of machine translation based on both source and reference translation using DQF-MQM framework.Achieves high correlation with human assessment.May penalize paraphrases/synonyms.3.1 Scores to Choose From21 of 61The S
77、tate of Machine Translation 20243.2 Choosing Engines for Linguistic Quality Assessment using COMET While COMET provides valuable insights into the performance of various models,it is important to note that relying solely on COMET may not give a comprehensive understanding of the models strengths and
78、 weaknesses.In several combinations of domain x pair,up to fall within the 83%confidence interval,which mean further analysis is necessary to fully assess the nuances and differences between these models beyond COMET scoring.24 models For this evaluation,we also used Intento LQA metric based on the
79、framework for machine translation evaluation.DQF-MQM We have first scored the translations using COMET to get the ranking of providers for each combination of language pair x domain.We have then chosen top-runners for each combination based on the following conditions:?We chose top-runners from the
80、the 90th percentile of COMET score in each combination of domain x pair?We made sure that there were no less than 3 providers per language pair x domain?If there were only LLMs,we added a low-latency model?If more than one model from one LLM family appears,unless its a provider with reasonable tiers
81、 between models,we removed all but one with the highest COMET22 of 61The State of Machine Translation 20243.3 Issue Classification and Severity for Intento LQAWe use the following issue classification as given in the framework when working on Intento LQA:DQF-MQM?Accuracy issues:Addition,Omission,Mis
82、translation,Over-translation,Under-translation,Untranslated text?Fluency issues:Punctuation,Spelling,Grammar,Grammatical register,Inconsistency,Link/cross-reference,Character encoding?Terminology issues:Inconsistent use of terminology?Style issues:Awkward,Inconsistent style,Unidiomatic?Design issues
83、:Length,Local formatting,Markup,Missing text,Truncation/text expansion?Locale convention issues:Address,Date,Currency,Measurement,Shortcut key,Telephone format?Verity issues:Culture-specific reference?Other issuesAs seen in DQF-MQM description,we are using the following error severity classification
84、:?Critical(10-point penalty):Has a significant impact and may cause severe implications?Major(5-point penalty):Has a considerable impact and may confuse or mislead the reader?Minor(1-point penalty):Has a slight impact;does not cause loss of meaning nor confuse the reader?Neutral(0-point penalty):Fla
85、g problems that are not considered errors,for example preferred stylistic changes.No penalty associated23 of 61The State of Machine Translation 20243.4 Examples of Issue Classification Using Intento LQA24 of 61The State of Machine Translation 2024Intento LQA score:7-point penaltySource:“I wish I cou
86、ld have said something better about Take The High Ground because I certainly like its talented cast,its talented director Richard Brooks,even the silly theme by Dimitri Tiomkin and Ned Washington,fresh from their Oscar a year before for High Noon.”MT:?Take The High Ground?High Noon.”Intento LQA:-Maj
87、or error(5-point deduction)-Untranslated text:The title Take The High Ground and High Noon were not translated into Arabic,which may confuse the reader.-Minor error(1-point deduction)-Grammar:The phrase?is awkward and ungrammatical.A better translation would be?.-Minor error(1-point deduction)-Fluen
88、cy:The machine translation is generally understandable but lacks the fluency and natural flow of the reference translation.Intento LQA score:10-point penaltySource:“The Commission considers that this will inevitably have led them to temper their competitive behaviour towards each other.”Machine tran
89、slation:“欧州委員会、必然的相互競争行動抑制考。”Intento LQA:-Major error(5-point deduction)-Addition:The machine translation adds 欧州(European)which is not present in the source text.-Major error(5-point deduction)-Mistranslation:The machine translation uses (led to)which implies a past action,whereas the source text i
90、mplies a future consequence.3.5 How We Choose Best Engines Using Intento LQA25 of 61The State of Machine Translation 20241.Harvey Goldstein;Michael J.R.Healy.The Graphical Presentation of a Collection of Means,Journal of the Royal Statistical Society,Vol.158,No.1.(1995),p.175-177.2.Payton ME,Greenst
91、one MH,Schenker N.Overlapping confidence intervals or standard error intervals:what do they mean in terms of statistical significance?.J Insect Sci.2003;3:34.doi:10.1093/jis/3.1.34.1.Score all segments according to their penalties,where 100 is a maximum score obtainable by an MT engine or an LLM,and
92、 100-(all segment penalty points)is the final segment score2.Average segment-level scores across the corpus2.Identify a group of top-runners(BEST)within a an 83%confidence interval1,2 of the leader83%ci999590Best4.9Mistranslations Are More Common than Other Translation Issues4.10Few Major or Critica
93、l Errors Among Providers4.1 Best MT Engines per Domain4.2Sixteen Engines are Among the Statistically Significant Leaders4.3Eleven Engines Provide Minimal Coverage4.4GPT-4 Consistently Outperforms Other Engines4.5Seven MT Engines Excel Among Real-Time Providers4.6Five Real-Time Engines Provide Minima
94、l Coverage4.7DeepL Surpasses Other Real-Time Engines4.8Few Major or Critical Errors Among Providers4.Evaluation Results26 of 61The State of Machine Translation 20244.1 Best MT Engines per Domain In the next two slides,we show the best MT engines by Intento LQA score.Each square shows the best provid
95、ers for a particular language pair in a specific domain.The color of the square shows the achievable MT quality for this domain compared to other domains in this language pair.For example,we see that the best engine for the English-Japanese pair in the Educational and Entertainment domains is GPT-4o
96、.Its score for the Educational domain is higher,so we expect less post-editing than in the Entertainment domain.We showcase both the best MT engines and LLMs overall and also,separately,the best engines suitable for real-time translation,as LLMs have high latency and might not be the best choice for
97、 such a scenario.The score values are not comparable between different language pairs.Engines in one bucket provide the best quality for this language pair and domain,with no statistically significant difference between them.They are presented in alphabetical order.27 of 61The State of Machine Trans
98、lation 2024Available quality and best commercial MT engines and LLMs by domain per Intento LQA score22040ColloquialEducationEntertainmentFinancialGeneralHealthcareHospitalityITLegalenarendeenesenfrenitenjaenkoennlenptenukenzhEngines are shows in alphabetical order as they are statisticall
99、y non-distinguishible and are in the same tier.28 of 61The State of Machine Translation 2024AmazonClaude 3 HaikuClaude 3 OpusClaude 3 SonnetCommand R+DeepLGPT-4 TurboGPT-4oGemini Pro 1.5GoogleMicrosoftModernMTPaLM 2 Chat BisonPaLM 2 Text BisonPaLM 2 Text UnicornTarjamaYandex Many engines perform bes
100、t with and.English to Spanish,Portuguese,French Colloquial Entertainment domains,Japanese Korean languages,andas well asand require a careful choice of MT vendor,as relatively few perform at the top level.Despite having several comparable engines per language pair,and domains show relatively low sco
101、res,which may indicate the importance of customization or context.ColloquialEntertainment Recently published outperforms several traditional MT engines across multiple domains and pairs.GPT-4o 4.2 Eighteen Providers Show the Best Results29 of 61The State of Machine Translation 2024Engines are shows
102、in alphabetical order as they are statistically non-distinguishible and are in the same tier.Available quality and best commercial MT engines by domain per Intento LQA score(real-time translation)*2452041631228140*When working with real-time translation engines only,decrease in quality should be exp
103、ected compared to LLM providers,which is represented as percentages in the bottom right corner of each bucket*Engines are shows in alphabetical order as they are statistically non-distinguishible and are in the same tier.30 of 61The State of Machine Translation 2024AmazonBaiduDeepLGoogleMicrosoftMod
104、ernMTNiuTransTarjamaYandexColloquialEducationEntertainmentFinancialGeneralHealthcareHospitalityITLegalenarendeenesenfrenitenjaenkoennlenptenukenzh When it comes to real-time translation,several engines perform best with and.English to Spanish,Arabic,Dutch Arabic,Korean Ukrainian languagesand require
105、 a careful choice of MT vendor when working with real-time translation as few perform at top-level.Entertainment Colloquialand domains show relatively low scores,which highlights the importance of customization and context.GoogleDeepL and showcase superior translation performance in multiple domain
106、x pair combinations.4.3 Nine MT Engines Excel Among Real-Time Providers31 of 61The State of Machine Translation 2024When working with real-time translation engines only,an average 1%decrease in quality should be expected compared to LLM providersEngines are shows in alphabetical order as they are st
107、atistically non-distinguishible and are in the same tier.*For every domain,we provide the minimum number of providers needed to translate all language pairs in this specific domain.*Engines are shows in alphabetical order as they are statistically non-distinguishible and are in the same tier.Educati
108、onAmazon,Claude 3 Opus,GPT-4oEntertainmentDeepL,GPT-4oGeneralAmazon,Claude 3 Opus,DeepL,GPT-4oLegalDeepL,GPT-4oFinancialDeepL,Gemini Pro 1.5HospitalityAmazon,GPT-4oITDeepL,GPT-4o,GoogleColloquialGPT-4oHealthcareDeepL,Gemini Pro 1.5,GoogleMinimal coverage for the best quality*Providers per domain4.4
109、Six Engines Provide Minimal Coverage6 MT engines and LLMs provide minimal coverage*for all pairs and industries,1-4 per domain.32 of 61The State of Machine Translation 202411 language pairs,9 domainsSome providers were tested only in their specific domains and language pairs:HiThink RoyalFlush speci
110、alizes in en-zh translation in the Finance domain TREBE specializes in Iberian languages,and was used for en-es translation Tarjama specialized in Arabic translation4.5 GPT-4 and DeepL Consistently Outperform Other ModelsLanguage pair,domainNumber of cases a provider got into“best”33 of 61The State
111、of Machine Translation 2024*For every domain,we provide the minimum number of providers needed to translate all language pairs in this specific domain.*Engines are shows in alphabetical order as they are statistically non-distinguishible and are in the same tier.ColloquialDeepL,Google,MicrosoftLegal
112、Amazon,DeepLEducationAmazon,DeepL,GoogleFinancialDeepL,GoogleEntertainmentAmazon,DeepL,GoogleITDeepL,GoogleHospitalityAmazon,DeepL,GoogleGeneralDeepL,GoogleHealthcareDeepL,Google,MicrosoftMinimal coverage for the best quality*Providers per domain4.6 Four Real-Time Engines Provide Minimal Coverage4 M
113、T engines provide minimal coverage*for all pairsand industries,23 per domain,in the real-time scenario.34 of 61The State of Machine Translation 202411 language pairs,9 domainsSome providers were tested only in their specific domains and language pairs:HiThink RoyalFlush specializes in en-zh translat
114、ion in the Finance domain TREBE specializes in Iberian languages,and was used for en-es translation Tarjama specialized in Arabic translation4.7 DeepL Surpasses Other Real-Time EnginesLanguage pair,domainNumber of cases a real-time provider got into“best”35 of 61The State of Machine Translation 2024
115、 Across all combinations domain pair,and have the most segments with no or minor issues.GPT-4o,DeepL,Google Among analyzed pairs and domains,domain and pair carry the most major and critical issues due to complexity of translation.ColloquialEnglish-Arabic According to the DQF-MQM framework,minor iss
116、ues are described as having a slight impact on meaning.This broad definition leads to a large proportion of segments being classified as having minor issues.Higher linguistic quality can be achieved using engine customization and glossary support.4.8 Relatively Low Number of Major or Critical Errors
117、We present an example of ratings in one combination of domain x pair to showcase general distribution between different translation issues and lack thereof36 of 61The State of Machine Translation 2024Translation providerTranslation providerTranslation providerTranslation provider Mistranslations rep
118、resent 80%of all major and critical issues.Most major mistranslation showcase translations with altered overall meaning,because of incorrectly used words or phrases.Unidiomatic translations tend to appear where a non-literal translation is possible but MT did not handle it properly.The rest of the i
119、ssues overall represent only 10%of the data which proves that MT keeps on drastically improving every year.We present examples of major issues on the next slide.4.9 Mistranslations Are More Common than Other Translation Issues37 of 61The State of Machine Translation 2024*Major or critical issues pre
120、sent in one segment are counted separately*We present an example of ratings in one combination of domain x pair to showcase general distribution between different translation issues and lack thereof4.10 Examples of Major and Critical IssuesWe present an example of ratings in one combination of domai
121、n x pair to showcase general distribution between different translation issues and lack thereof 38 of 61The State of Machine Translation 2024MistranslationSource:“Terribly undercooked pasta,not sure if they have even heard the term al dente pasta as it was hardly cooked at all.”MT:“茹、言葉知疑問思。”Under-t
122、ranslationSource:“Florence Rice runs the gamut from comedienne to heroine.”MT:“Florence Rice passe de la comdie lhrone.”GrammarSource:“Of course the mozzarella is astounding,but the bread and meats and everything else are also just fantastic.”MT:“Natrlich ist die Mozzarella erstaunlich,aber auch das
123、 Brot,die Fleischwaren und alles andere sind einfach fantastisch.”UnidiomaticSource:“Pardon My Pups is an enjoyable little film,with Shirley Temple stealing all her scenes as the heros lively kid sister.”MT:“,.”OmissionSource:“You have to have a strong stomach and a firm grip on yourself to sit thro
124、ugh this,and I wouldnt recommend trying unless you have a good reason.”MT:“除非有充分的理由,否则我不建议您尝试。”Untranslated textSource:“The yellow to red color of many cheeses,such as Red Leicester,is normally formed from adding annatto.”MT:?Red Leicester?annatto.”5.1 191,010 Language Pairs Across All MT Engines5.2
125、LLM Multilinguality Does Not Mean Equal Support of All Languages5.3LLMs are 50-1000 Times Slower than Specialized MT Models5.4 Changes inProviders Features5.5 LLMs Are Priced Lower Than MT Engines5.6 Public Pricing for Model Customization 5.7 Rapid Rise of Independent Cloud Vendors as Number of LLMs
126、 Grows 2.5 Times5.8 Most Models Improved COMET-wise compared to 20235.9 Open Source Pre-Trained Models5.10 Several Open Source Models Deliver Impressive Results5.11Large Language Models5.12Large Language Models Achieve Remarkable Scores in Top Tiers5.Miscellaneous39 of 61The State of Machine Transla
127、tion 2024From in May23 to in May24190,085 191,010Several new languages added by,and ModernMTGoogle YandexMicrosoftDeepLYakut,Emoji,Uzbek CyrillicBodo,and Indian language,are new to IntentoAdded new nicheMT provider specializing in Arabic translationsTarjamatotal language pairsunique language pairsla
128、nguage pair growth5.1 191,010 Language Pairs Across All MT Engines*Where possible,we have checked via API if all language pairs advertised by thedocumentation are supported and removed the pairs we were unable to locate in the API.*As advertised(not validated via API).*Due to LLMs multilinguality an
129、d the nature of the data they were trained on,there is no definitive list of languages or language combinations they support.For now,we do not include LLMs in this slide.40 of 61The State of Machine Translation 20245.2 LLM Multilinguality Does Not Mean Equal Support of All Languages41 of 61The State
130、 of Machine Translation 2024*Multilingual Large Language Models language support for pairs other than the ones in the current MT Report cannot be confirmedAya-101Claude 3 HaikuClaude 3 OpusClaude 3 SonnetCommand-RCommand-R+Gemini ProGemini Pro 1.5GPT-3.5 TurboGPT-4GPT-4oGPT-4 TurboJurassic UltraLLaM
131、A-2LLaMA-3Mistral LargeMixtral 8x7BPaLM2 Chat BisonPaLM2 Text BisonPaLM2 UnicornRakutenAI 7B InstructTitan Text ExpressTowerInstruct 13BTowerInstruct 7Ben-aren-deen-esen-fren-iten-jaen-koen-nlen-pten-uken-zhpair is supportedpair is supported and the model appeared in the list of best enginespair is
132、not supportedThere is a 501000 times difference in translation time between LLMs and MT engines*provider translation time(logarithmic scale)5.3 LLMs are 50-1000 Times Slower than Specialized MT Models*Several MT Engines speed was affected by limited quotas(Tilde,Mirai,PROMT,NiuTrans,Tarjama,HiThink,
133、Kawamura by NICT,TartuNLP,Pangeanic)42 of 61The State of Machine Translation 2024 Google Gemini 1.5 Flash,upgrades Gemini 1.5 Pro,and introduces new developer features,such as video frame extraction and parallel function calling.debuts Anthropic a new generation of Claude models,Claude 3,including H
134、aiku,Sonnet,and Opus in ascending order of capability.presents Yandex Cloud several new languages,among which there are three new to Intento:Yakut,Emoji,and Uzbek Cyrillic.adds Microsoft to 20 Indian languages,with one of them,Bodo,being new to Intento.expands Google has given General Access to thei
135、r new translation engine which leverages Google LLMs to tailor translations,.Adaptive Translation DeepL has its language offerings by adding Arabic to its list of supported languages.This addition marks a significant milestone for the company,as Arabic becomes the first right-to-left language availa
136、ble on their platform.expanded Azure AI Custom Translator Neural Dictionary,an impressive extension to their dynamic dictionary and phrase dictionary features,and access to direct model customization.welcomesgives OpenAI their new multimodal flagship model,GPT-4o,which shows performance comparable t
137、o GPT-4 Turbo but is much more efficient,generating text 2x faster and being 50%cheaper.introduces5.4 Changes inProviders Features43 of 61The State of Machine Translation 20245.5 LLMs Are Priced Lower Than MT Enginesstock provider price($/1 MLN characters)LLM price($/1 MLN characters)$110price by re
138、questMost Large Language Models are priced 10-100 times lower than traditional Machine Translation engines*Open Source Engines*Prices for LLMs are converted with an estimation of 2.83 characters per token on average.*Prices provided herein are based on the publicly listed prices at the time of the a
139、nalysis.Actual prices may vary depending on a variety of factors,including your geographical location and any customary discounts.It is always recommended to contact the vendor directly for the most accurate and up-to-date pricing information.44 of 61The State of Machine Translation 20245.6 Public P
140、ricing for Model Customization AmazonCommand RGlobaleseGoogle VertexGoogle v3GPT-3.5 TurboLLaMAMicrosoftMistral AIModernMT Human in the LoopTitan Text ExpressSYSTRANCustomization($)free$8/M tokens$50$21.25/hour$45/hour,$300 max$8/M tokensfree$10/M source+target characters(max.$300)freefree$0.008/1K
141、tokensby requestHosting/month($)200 GB free parallel data storage.$0.023/GB per month for excess datanotspecified$5.50notspecifiedfreefreefree$10freefreenotspecifiedby requestTranslation($/M characters)$60$2.10$110from$0.10$80$3.20free$40free$50$0.30by request*Prices provided herein are based on the
142、 publicly listed prices at the time of the analysis.Actual prices may vary depending on a variety of factors,including your geographical location and any customary discounts.It is always recommended to contact the vendor directly for the most accurate and up-to-date pricing information.*Prices are c
143、onverted with an estimation of 2.83 characters per token.45 of 61The State of Machine Translation 2024Commercial50AISA,Alibaba,Amazon,Apptek,Baidu,CloudTranslation,DeepL,Elia,Fujitsu,Globalese,Google,GTCom,IBM,iFlyTek,RoyalFlush,Lesan,Lindat,Lingvanex,Kantan,Kawamura/NICT,Kingsoft,Masakhane,Microsof
144、t,Mirai,ModernMT,Naver,Niutrans,NTT COTOHA,Omniscien,Pangeanic,Prompsit,PROMT,Process9,Reverso,Rozetta,RWS,SAP,Sogou,Systran,Tencent,Tilde,Ubiqus,Unbabel,TREBE,XL8,Yandex,YarakuZen,YoudaoLingua CustodiaTarjamaPreview/Limited5eBay,Kakao,QCRI,Tarjama,Birch.AIOpen Source Pretrained6TartuNLP,NeMo by NVI
145、DIA,NLLB by Meta AI,M2M-100,mBART,OPUSLarge Language Models3301.AIAlibaba AllenAIBAAI BaichuanDeciHuggingFaceLianjiia Tech LLM360 LLMZooMicrosoft MistralPreferred Networks Salesforce SiloAI SnowflakeStanford StatNLP TII(UAE)X.ai,AI21,Anthropic,BAIR,BigScience,Cerebras,Cohere,DataBricks,EleutherAI,Go
146、ogle,Meta AI,MosaicML,OpenAI,Stability AI,5.7 Rapid Rise of Independent Cloud Vendors as Number of LLMs Grows 2.5 TimesThe new engines are highlighted in blue46 of 61The State of Machine Translation 20245.8 Most Models Improved COMET-wise compared to 2023 Most providers have significantly improved s
147、core-wise.Baiduen-ar has lower COMET scores than in the previous year in several pairs.In,which has the biggest score drop,we observe some moderate to severe mistranslations.Jurassic Ultraen-ar en-zh en-ja,although having drastically improved in several pairs,shows decreased quality in,most likely t
148、o significant model updates that have happened over the year.*For certain graphs,COMET was used in place of Intento LQA because the LQA analysis was limited to the top-performing engines,while COMET allowed for evaluation across all engines47 of 61The State of Machine Translation 20245.9 Open Source
149、 Pre-Trained ModelsLlama-2,Llama-3 by Meta AILlama family of models contains Llama-2 and Llama-3 open-source large language models of different sizes designed for research,development,and innovation in the AI field.License:and Llama2Llama3TowerInstruct 13B,TowerInstruct 7B by UnbabelTower family of
150、models includes open-weight multilingual LLMs of different sizes for translation-related tasks,ranging from pre-translation tasks to translation and evaluation tasks.Build on top of Llama-2,they come in two sizes,7B and 13B.License:CC-BY-NC-4.0RakutenAI 7B by RakutenRakutenAI 7B Instruct build upon
151、the Mistral model architecture and is based on Mistral 7B pre-trained checkpoint.It excels at Japanese language understanding while remaining competitive in English,outperforming similar models.License:Apache-2.048 of 61The State of Machine Translation 2024Neurotlge by TartuNLPNeurotlge is a multidi
152、rectional machine translation engine developed by the NLP lab at the University of Tartu.Among several high-resource languages,the engine supports several low-resource languages from the Finno-Ugric language family.License:MIT LicenseAya-101,Command R by CohereThe is a massively multilingual open-so
153、urce generative language model that follows instructions in 101 languages of which over 50%are considered as lower-resourced.Aya modelLicense:Apache-2.0Command R is a 35B parameter generative model optimized for reasoning,summarization,question answering,and multilingual generation in 10 languages.L
154、icense:CC-BY-NCMixtral 8x7B by Mistral AIMixtral 8x7B is a generative sparse mixture of experts model(SMoE)with open weights,achieving high quality performance in several languages.License:Apache 2.0 TowerInstruct 7B v0.2 TowerInstruct 13B v0.1and outperform all other open-source models for most lan
155、guage pairs.Translation in is the hardest for open-source models to tackle,with scoring the highest but underperforming compared to commercial engines.ArabicCommand R Llama-2 severely underperforms compared to the rest of the engines,while its successor Llama-3 achieves much higher scores in all lan
156、guage pairs except for,and.en-ja en-koen-zh5.10 Several Open Source Models Deliver Impressive Results*For certain graphs,COMET was used in place of Intento LQA because the LQA analysis was limited to the top-performing engines,while COMET allowed for evaluation across all engines49 of 61The State of
157、 Machine Translation 2024GPT by OpenAIGPT-4o GPT-4 Turbo GPT-4GPT-3.5 TurboGPT-4oGPT-4 Turbo,and are a diverse set of models with different capabilities and price points.All of them are fine-tuned for chat conversations,with and also having Vision capabilities.Claude 3 by AnthropicClaude 3 Haiku Cla
158、ude 3 SonnetClaude 3 Opus,and represent a series of LLMs with increasing levels of performance.The tiered approach allows users to choose the model that best suits their needs in terms of performance,speed,and cost-efficiency.Titan Text Express by AmazonTitan Text Express,exclusive to Amazon Bedrock
159、,leverages Amazons extensive AI and ML expertise.It is a high-performing text model supporting 100+languages.Gemini Pro by GoogleGemini is a family of generative AI models that are designed and trained to handle both text and images as input.PaLM 2 by GoogleText UnicornText Bison and represent the P
160、aLM 2 family of LLMs,offering enhanced capabilities in multilingual understanding,logical reasoning,and code generation.Mistral Large by Mistral AIMistral Large,Mistral AIs flagship model,can be used for complex multilingual reasoning tasks,including text understanding,transformation,and code genera
161、tion.Jurassic Ultra by AI21Jurassic Ultra,the flagship model of the Jurassic series,can tackle the most intricate language processing tasks and creating advanced generative text applications.Other LLMs have been introduced earlier as open-sources models5.11 Large Language Models50 of 61The State of
162、Machine Translation 2024 GPT-4o,which has recently been present as the new flagship model of OpenAI,consistently appears in the 1st tier and scores higher*than other OpenAI models.PaLM2 Text Unicorn shows the highest results in en-fr out of all LLMs.Models from the family,and generally achieved lowe
163、r scores in comparison to the other models evaluated in this report.LlamaJurassic UltraTitan Text Express5.12 Large Language Models Achieve Remarkable Scores in Top Tiers*For certain graphs,COMET was used in place of Intento LQA because the LQA analysis was limited to the top-performing engines,whil
164、e COMET allowed for evaluation across all engines*Considering the number of Large Language Models,we have only chosen the best out of each model family by language pair to be present on the graph.51 of 61The State of Machine Translation 20246.Takeaways6.1 Key Conclusions6.2Intento Machine Translatio
165、n and multilingual Generative AI platform for global businesses6.3MT Evaluation&MT Maintenance6.4Getmaximum results from machine translation52 of 61The State of Machine Translation 20246.5Machine Translation University6.1 Key Conclusions(1/2)1.Large Language Models are changing the Machine Translati
166、on landscapeSince the ,we noticed two new vendors with pre-trained MT models.The growth in the number of new LLM providers has been more substantial.We currently track,33%of them are LLM vendors(it was just 18%in 2023).2023 Report94 vendors2.LLMs help with the linguistic quality analysisTo enhance o
167、ur analysis,we have introduced,an LLM-based DQF-MQM-based metric,and combined it with the established framework.It provides more insights into quality issues and enables a more accurate model comparison.Intento LQACOMET3.LLMs rapidly carve out their share of the marketThis time,we assessed a total o
168、f,out of which were Large Language Models.Its not just about having more LLMs:we have more of them among best models,too.of all top-performing models are LLMs(it was in 2023).The largest LLM is not always the best for translation(even from the same provider),hence we see multiple models from the sam
169、e family.52 engines2455%17%4.The quality landscape is quite complexIn of all cases,LLMs are significantly better than any MT.More in,and.In cases,MT is better than any LLM.More in,and in and domains.25%Colloquial EducationEntertainment12%English to ArabicITLegal5.LLMs are much cheaper and much slowe
170、rLLMs(with simple prompts)are than MT systems and the price is not correlated with the quality.However,they are,so we provide a separate rating for real-time systems.On average,the real-time requirement comes at 11%penalty in quality.10-100 times less expensive50-1000 times slower6.Few languages and
171、 domains are harder than othersAmong the analyzed pairs and domains,the domain and pair carry the most critical issues due to translation complexity.Overall,translations into and have clear leaders for every domain except and,emphasizing the importance of careful model selection.ColloquialEnglish-Ar
172、abicJapanese KoreanITLegal53 of 61The State of Machine Translation 20246.1 Key Conclusions(2/2)7.Number of supported languages didnt grow muchThere are across all MT systems.Among them,four new languages and 1,000+new language pairs,but its just a 0.5%growth since 2023.Although technically LLMs shou
173、ld support all languages,we see that the quality varies a lot,so it deserves a separate study.191,010 unique language pairs8.19 best-performing engines overall and 9 best in real-time scenarioAmong the and pairs analyzed,emerge as the top performers.When LLMs are excluded due to their high latency,d
174、emonstrate the best performance across the board.9 domains11 language18 MT engines and LLMs9 MT engines9.Open-source LLMs are generally in the 2nd tierWhile the performance of open-source LLMs like or approaches top-tier commercial engines,the majority of open-source LLMs produce lower-quality trans
175、lations due to their more limited multilingual capabilities compared to their commercial counterparts.TowerInstruct 7B v0.2Command R10.Few models are best at translationOut of multiple tested LLMs,we see only models from,and in leaders.The number of leading MT vendors has also reduced in 2024.This c
176、ould be attributed to a more detailed quality analysis.AnthropicCohere GoogleOpenAI11.MT and LLMs make similar translation errorsWe did not see much difference in translation errors between MT engines and LLMs.However,some of the systems have certian perks.Some systems produce more grammatical error
177、s than others,some can be more awkward in languages;one of the models produces more partially untranslated segments,while yet another is prone to additions.Asian12.Customization improves quality above this reportBoth MT and LLMs can be trained on translation memories and improved with glossaries.LLM
178、s additionally can benefit from prompt engineering and RAG*.These tools can eliminate significant amount of the errors found in this report.Use them.54 of 61The State of Machine Translation 2024*Retrieval Augmented Generation6.2 Intento Machine Translation and multilingual Generative AI platform for
179、 global businessesGenerative AIGenerative AIChange ManagementChange ManagementQuality EstimationQuality EstimationMahine TranslationMahine TranslationEnterprise LanguageHubFlexible WorkflowsFlexible WorkflowsModel CustomizationModel CustomizationTrusted by the global enterprise55 of 61The State of M
180、achine Translation 20246.3 Unlockmachine translationandgenerative AIacross your entire company with an all-in-one,scalable platformBook a demoThe right way to jumpstart your AI programWe customize and evaluate AI models for you,configure workflows and integrations,and provide ongoing maintenance.Kee
181、p existing workflows running smoothlyEnterprise Language Hub integrates with the most popular software systems,so you can keep you existing human workflows in localization,document management and customer support.Save up to 20 x on translation and content productionCombine MT with automatic language
182、 skills,such as source content improvement and automatic revision,to save up to 95%on what you spend today.Central,future-proof enterprise AI deploymentBuild and manage AI workflows centrally,tapping into 40+MT and GenAI providers,ready for any new tech the future brings.ISO-27001 certified.56 of 61
183、The State of Machine Translation 2024Modern machine translation(MT)is powerful but struggles with imperfect source text and cannot leverage context.Enterprise Language Hub overcomes these limitations by making source text more translatable before MT and adding context afterward.It can also use GenAI
184、 to automate human post-editing,saving up to 95%of translation costs.Book a demoSource quality improvement Change incorrect formatting,slang,and language errors before translation.Up to25%less editingMachinetranslationWe help pick the right MT model for each of your languages and tailor it to your t
185、erminology and glossary.Up to70%less effortAutomatic post-editingApply your tone of voice,terminology,or other customized language features with generative AI.Another60%less editing needed6.4 Getmaximum results from machine translation57 of 61The State of Machine Translation 2024Learn how to evolve
186、your MT program over timeLearn how to build or improve your MT programFast and Safe Only 5-6 weeks to get a winning MT engine with estimations for effort saved in post-editing and quality in real-time cases,such as support chats TrustedWe run 1520 MT Evaluation projects per month for global companie
187、s across industries under strict Security,Quality,and Data Protection requirements.ISO 27001 and ISO 9001 certified.MT Evaluation Data cleaning Model training Test sample translations Model training analysis LQA(sample review)Final analysisMT Maintenance MT Performance Monitoring&Hot-Swap Glossary u
188、pdates Model updates MT Quality Monitoring Post-editing Effort Analysis MT Evaluation6.5 MT Evaluation&MT Maintenancefor hassle-free Enterprise MT58 of 61The State of Machine Translation 2024The State of Machine Translation An independent multi-domain evaluation of MT enginesommercially available pr
189、e-trained MT models 2261 Market St,#4273San Francisco,CA 94114inten.to3655 Nobel Drive,Suite 520San Diego,CA 59 of 61The State of Machine Translation 2024A.1 Best scores per Domain(BLEU)Appendix A60 of 61The State of Machine Translation 2024A.1 Best scores perDomain(BLEU)In the past,we were often as
190、ked“OK,but what are the BLEU scores”?Today,its commonly accepted that one should not use BLEU score at all.However,since youve asked for it,we decided to give you the highest SacreBLEU scores in each combination of domain and language pair.Theres no statistical significance test as BLEU is a corpus-based score.Please keep in mind that BLEU,as a corpus-level score with a number of parameters,is not comparable not only across different languages,but also across different datasets and different BLEU implementations.61 of 61The State of Machine Translation 2024