上海品茶

您的当前位置:上海品茶 > 报告分类 > PDF报告下载

世界银行:2023互联网项目SmartFi-探索AI与ML在金融科技新闻中的应用(英文版)(74页).pdf

编号:143313  PDF  DOCX 74页 5.74MB 下载积分:VIP专享
下载报告请您先登录!

世界银行:2023互联网项目SmartFi-探索AI与ML在金融科技新闻中的应用(英文版)(74页).pdf

1、Project SmartFi Exploring AI/ML for FinTech NewsIN COLLABORATION WITH SYNTASA,POWERED BY GOOGLE CLOUDPowered byPublic Disclosure AuthorizedPublic Disclosure AuthorizedPublic Disclosure AuthorizedPublic Disclosure AuthorizedABSTRACTThe World Bank Finance and Technology Department,in collaboration wit

2、h The World Bank Technology and Innovation Lab,partnered with Google Cloud and Syntasa Inc.to learn how artificial intelligence and machine learning could enhance the news sourcing and sentiment of FinTech topics globally.This outcome report shares the key learnings and insights as a part of the exp

3、loration and development of a prototype.ACKNOWLEDGEMENTSThe key learnings outlined in this report were prepared by the Project SmartFi(Smart Finance)team.World Bank Treasury Finance and Technology(TREFT):Paul Snaith,Patrick Cheng,Jaskaran SinghWorld Bank Technology and Innovation Lab(ITSTI:)Yusuf Ka

4、racaoglu,Stela Mocan,Mora Farhad,Mahesh Chandrahas Karajgi,Oleksandra Postavnicha,Yujuan SunWorld Bank Corporate Procurement:Sanjay Colaco,Shweta MesipamSyntasa Incorporated:Shawn Zargham,Michael Finn,Kyle Witt,James Wilson,Eric Bugin,Kareem Sharaf,Ted BlakeGoogle Cloud:Ryan Wright,Rajat GuptaConten

5、tsAbbreviations andAcronyms vSection 1:Overview 1Executive Summary 1Project Background 3Project Team&Sponsor 4Section 2:Exploration with Artificial Intelligence for Financial News 5Research Approach 5Business Challenge Scope 6Section 3:Collaboration with Google Cloud andSyntasa 9Rapid Prototyping wi

6、th Technology Partners 9Solution Overview and Key Results 14Technical Approach(Syntasa)22Section 4:Learning Outcomes and Future Considerations 37Technical Learnings for World Bank 37Business Learnings andOutcome 42Appendix A:Narrative Dashboard Features 46Appendix B:Reference Data 50Appendix C:Brand

7、watch 55Appendix D:SmartFi Trusted Domains Technical Details 58Appendix E:SmartFi Uncertain Domains Technical Details 62Appendix F:SmartFi Chinese Language Technical Details 65FIGURES AND TABLESTable 2.1 6Figure 3.1:Syntasa Solution 10Figure 3.2:Modeled Mentions 16Figure 3.3:Word Cloud 17Figure 3.4:

8、Domain Source 18Figure 3.5:Domain and PDF Sourcing 19Figure 3.6:Trending Topics 20Figure 3.7:Sentiment Validation 21Figure 3.8:Sentiment Model Explainability 21Figure 3.9:Solution Architecture 23Figure 3.10:Data and AI pipeline 24Figure 3.11:Chinese Language App Configuration 25Figure 3.12:Topic Mod

9、eling Parameters 27Figure 3.13:Dashboard Trending Phrases 28Figure 3.14:Sentiment Explainability 29Figure 3.15:Sentiment Validation 30Figure 3.16:Language Translation Performance 32Figure 3.17:PDF Sourcing 33Figure 3.18:Sentiment Explainability 34Figure 3.19:Solution Architecture 35Figure 4.1:Topic

10、Modeling 38Figure 4.2:Topic Modeling Explainer 39Table 4.1:Sentiment Analysis Models 40Project SmartFi:Exploring AI/ML for FinTech NewsivAbbreviations andAcronymsAbbreviationDescriptionAI Artificial IntelligenceAPI Application Programming InterfaceApp ApplicationAWS Amazon Web ServicesBARD AI Google

11、s Generative AI ToolBERT Bidirectional Encoder Representations from TransformersBI Business IntelligenceBQ Big QueryChatGPT Open AIs Generative AI ToolDLP Data Loss PreventionETL Extract Transform LoadFedRAMP Federal Risk and Authorization Management ProgramFinTech Finance and TechnologyFTX Futures

12、ExchangeGCP Google Cloud PlatformIAM Identity Access ManagementIoT Internet of ThingsITSTI World Bank Group Technology and Innovation LabAbbreviationDescriptionJSON JavaScript Object NotationKPI Key Performance IndicatorsLDA Latent Dirichlet AllocationLLM Large Language ModelsLookML Looker Modeling

13、LanguageML Machine LearningNLP Natural Language ProcessingNMF Negative Matrix FactorizationOCR Optical Character RecognitionPOC Proof of ConceptPoV Proof of ValueRoBERTa Variant of BERT modelRPA Robotics Process AutomationSaas Software as a ServiceSmartFi Smart FinanceSME Subject Matter ExpertTI Lab

14、 World Bank Technology and Innovation LabTRE TreasuryTREFT World Bank Treasury Financial Technology unitUI User InterfaceVPC Virtual Private Cloudv SECTION 1OVERVIEWExecutive SummaryIn todays fast-paced world,it can be challenging to stay informed on the latest financial technology news and trends,w

15、hich can help to inform decisions for financial and operational strategies.The amount of information and opinions available on the internet can be overwhelming,and it can be challenging to filter out what is most relevant and important for business users.Technology is constantly evolving;new trends

16、and developments may emerge daily.To address this challenge,the World Bank Treasury Financial Technology unit(TREFT)and the World Bank Group Technology and Innovation Lab(ITSTI)(hereafter“project team”)worked on a framing exercise to explore how emerging technologies could provide a solution to help

17、 users with access to curated,trusted,and relevant news sources that inform them of sentiments across trending topics.The ITSTI lab follows a structured approach using design thinking methodologies to understand the needs,wants,and pain-points of end users.The project team identified a sample list o

18、f the key topics and terms of interest;various trusted sources(including open source and subscription content,and social media channels);and the geographic areas of interest,to help guide the data requirements.The team also conducted market research to understand how similar problems are being solve

19、d,and to build on the in-lab knowledge.Throughout this research,we worked with the largest search provider,Google Cloud.The Google Cloud Platform(GCP)provides a range of tools and services that are helpful in using machine learning to source newsfor example cloud natural language API to extract enti

20、ties,sentiments,and insights from news articlesamong many other capabilities.We also worked with Google Clouds partner company,Syntasa Inc.,which specializes in sentiment analytics,generating insights through data analytics,and understanding digital behaviors to customize solutions for business user

21、s.Overview1With Syntasa,which is powered by Google Cloud,we collaborated on designing and creating a prototype of a dashboard that provides users with the ability to gain insights into sentiment trends so that behavior shifts can be quickly identified by topic and by region.The visualization tool we

22、 created also provides flexibility in customizing filters,to enable quick access to digestible FinTech topics that can help users stay up to date with the latest trends and developments in their industries;identify new opportunities;and make informeddecisions.Our collaboration provided the project t

23、eam with the opportunity to not only explore potential solutions but also to learn from Syntasa how private technology firms blueprint and develop artificial intelligence(AI)and machine learning(ML)prototypes to scale into enterprise adoption.The World Bank Technology and Innovation Lab(TI Lab)techn

24、ical team worked closely with Syntasa and Google Cloud to learn how data scientists build custom AI/ML models,and test them for accuracy and explainability regarding transparency,accountability,and compliance,and to ensure that AI systems are fair,ethical,and safe to use.This report outlines the tec

25、hnical learnings,value drivers,and capabilities of the solution we developed.Siphosethu Fanti/Project SmartFi:Exploring AI/ML for FinTech News2Project BackgroundThe World Banks Treasury Operations,Financial Technology unit(TREFT)helps lead the treasurys technological advancement initiatives from the

26、 ideation phase through development,and successful implementation in close partnership with the treasury business units and technology developers.TREFT actively engages with the Banks business units on identifying and implementing suitable technical solutions for business use cases in treasury opera

27、tions,and their potential development and implementation through in-house and/or off-the-shelf solutions.Such a process requires a constant review of the Banks internal technology capabilities and comparison with existing industry standards and new market developments.Consequently,it is immensely im

28、portant for TREFT to selectively monitor new technology trends and solutions,and subsequently to determine their suitability for the improvement of treasury operations.Currently,this process is being largely performed manually,with a considerable amount of personnel time and resources being dedicate

29、d to it on a regular basis.Some of the current challenges include:Manual sourcing and consolidation of the most relevant and informative FinTech news and events is tedious.Keeping track of market discussions and public sentiment surrounding notable FinTech topics and events.Limited search scope in t

30、erms of news sources,given the time and resourceconstraints.Determining the authenticity of a news source,its thematic relevance,and potential topical categorization.In order to tackle these challenges and to systematically harmonize the process of FinTech and technology news sourcing,TREFT sees a u

31、nique opportunity to explore an AI system that mimics human methods in order to quickly and efficiently source curated news relevant to the topics of interest for a specific business unit.A related opportunity comes with automating the process of quantifying relevance,measuring sentiment,and determi

32、ning the bias of news after it has been sourced.This can be accomplished by mirroring human tactics for measuring how relevant an article is,and determining its overall sentiment and bias,a process which can also be supported through AI methods.3OverviewGiven the existence of these opportunities and

33、 the potential benefits of deploying such an AI solution to multiple use cases within treasury,TREFT,along with its partner,Innovation Lab,collaborated in exploring in-house and off-the-shelf solutions which could fulfill the requirements of the use case.Project Team&SponsorTREFT coordinates the eff

34、icient internal administration of the World Bank Treasurys Information Technology infrastructure across all institutional projects,maintenance,and budget and planning cycles,ensuring that it remains fit for purpose,up-to-date,secure,and reliable.The unit also develops and maintains appropriate strat

35、egic technology planning in relation to Treasurys significant standing in the global financial markets,and leverages that standing to build internal and external partnerships for market and development effect.TREFTs technology initiatives include leading Treasurys participation in large-scale system

36、 renewals and emerging technology projects in FinTech fields such as AI/ML,blockchain,RPA,and World Bank finance-wideprojects.The TI Lab is a specialized unit within the World Bank Groups Information and Technology vice presidency,centered around three main pillars:innovation,experimentation,and cap

37、acity building.TI Lab works closely with various departments and units within the World Bank Group,as well as with external partners,to identify potential areas where emerging technologies can be applied to solve business and development problems.It aims to assist World Bank Group(WBG)business teams

38、 in problem framing,requirement gathering,data preparation,technical guidance,and prototype delivery to help decision makers assess whether an investment is worth embarking on for operationalization.The mandate in the TI Lab is to learn by doing and to share knowledge across teams,for continuous inn

39、ovation.Project SmartFi:Exploring AI/ML for FinTech News4SECTION 2EXPLORATION WITH ARTIFICIAL INTELLIGENCE FOR FINANCIAL NEWSResearch Approach1 What are the most effective methods for collecting and curating news articles related to a specific topic or set of topics?2 How accurate and reliable are e

40、xisting sentiment analysis models for analyzing news articles,and what types of customizations or training are needed to improve their performance?3 How do different sources of news articles(social media,traditional news outlets,blogs)vary in terms of their sentiment and relevance to specifictopics?

41、4 What are the most effective methods for visualizing and presenting sentiment analysis results to users,and how can these be customized to meet the needs of different stakeholders?5 How can sentiment analysis be used to identify trends and emerging topics in a specific industry or field,and what ty

42、pes of insights can be gained from this analysis?6 What are the ethical and legal implications of using sentiment analysis to curate and analyze news articles,and how can these be addressed in the development and implementation of the solution?7 How do different user groups(analysts,executives,inves

43、tors)use curated news and sentiment analysis,and what other features and functionalities can be important to these users?Exploration with Artificial Intelligence for Financial News5Business Challenge ScopeThe scope of the PoC was determined by the project team in collaboration with Syntasa.Foundatio

44、nal data and base material was provided as inputs to the Syntasa team as detailed below:Relevant topics of interest to TREFT business operations were provided to Syntasa in the form of a holistic Excel document with the following structure.Major themes were developed,and various subtopics were categ

45、orized into the themes,which then formed the pool of relevant FinTech and technology-related keywords.To provide additional filter mechanisms and take into account the geographical relevance of the topics,an additional list of geographic locations and regions was provided,with the theme subtopics yi

46、elding more specific and relevant search results.A brief example of the structure of the inputs can be seen seen in Table 2.1,and a detailed overview is provided in Appendix B.TABLE 2.1ThemeAsset TokenizationDigital CurrencyWeb3Keywords Fungible tokens ICO(Initial Coin Offering)NFT(Non-Fungible Toke

47、ns)Programmable Money Programmable Payments Carbon tokenization Security Tokens Offering(STO)CBDC(Central Bank Digital Currency)Delivery versus Payment(DvP)Digital Assets Digital Wallet Stablecoin FOMO(Fear of Missing Out)Instant Payment Blockchain Cryptocurrency DApps(Decentralized Apps)DLT(Distrib

48、uted Ledger Technology)Decentralized Autonomous Organizations(DAOs)Decentralized Finance(DeFi)InteroperabilityList of RegionsList of DomainsFilters(North America,South America,Europe,MENA,Asia,etc.)(federalreserve.gov,ecb.europa.eu,bankofcanada.ca,mas.gov.sg,imf.org,etc.)Project SmartFi:Exploring AI

49、/ML for FinTech News6Value PropositionThe following are value-drivers for the proposed solution:Stay informed on industry trends and news:Allows users to stay up-to-date on the latest news and developments in the finance and technology industries,including emerging trends and topics.Gain insights in

50、to sentiment trends:Allows users to quickly identify shifts in sentiment towards specific topics or companies,providing valuable insights into market trends and sentiment.Monitor Partners:Users could track news and sentiment around member countries,NGOs,commercial banks,and other partners,enabling t

51、hem to stay informed on their actions and strategies.Make data-driven decisions:Accurate and reliable sentiment analysis on desired topics to help users make data-driven decisions based on real-timeinsights.Save time and resources:Users can save time and resources that would otherwise be spent searc

52、hing for and analyzing news articles manually.Capabilities that could be included in the dashboard to support these value drivers include:Customizable news feeds:Users could customize their news feeds to only show news articles related to specific topics or keywords,ensuring that they only see relev

53、ant content.Sentiment analysis:Flexibility to filter by sentiment on specified topics or across geographic landscape to understand how different regions or industries react to fintech Real-time updates:Users may adjust the time horizon to understand how topics in fintech have evolved over time or re

54、ceive alerts in real time.Customizable alerts:Users could set up alerts to notify them of changes in sentiment or news related to specific topics or companies,enabling them to stay informed without constantly monitoring the dashboard.Integration with other tools:The dashboard could be integrated wit

55、h other tools,such as trading platforms or financial analysis tools,allowing users to make data-driven decisions directly from the dashboard.Possibility of integrating generative AI in future.7Exploration with Artificial Intelligence for Financial NewsBy incorporating these value drivers and capabil

56、ities,a dashboard that shows finance and technology-related news with sentiment analysis could provide valuable insights and result in time savings for its users.Donson/Project SmartFi:Exploring AI/ML for FinTech News8SECTION 3COLLABORATION WITH GOOGLE CLOUD ANDSYNTASARapid Prototyping with Technolo

57、gy PartnersAdd content on the motivation to learn from the Google Cloud Platform(GCP)platform,and on designing a prototype solution with a technology partner.About SyntasaSyntasa is a cloud-based data and AI platform that enables users to connect various data sources,build and deploy customized AI/M

58、L models,and activate them across various channels through dashboards,data shares,and APIs.This tool provides users with visibility into the full data pipeline,including data source,dependencies,and how the data is being used to drive insights.The Syntasa platform is built with leading open-source t

59、echnologies,and is powered by GCP services.The Syntasa platform uses the concept of apps(along with the sequencing of those apps)to accelerate time-to-value;improve reliability and efficiency;and provide significant return on investment over home-grown cloud-based solutions.The apps provide low or n

60、ocode to full-code capabilities,which allows business users,analysts,data scientists,and data engineers to collaborate,and to leverage and share their expertise.The Syntasa platform runs natively in an organizations GCP with the data stored in Google Cloud storage and BigQuery.Organizations can keep

61、 their sensitive data inside their virtual private cloud(VPC)and behind their firewall,thus maintaining full control,while leveraging the power of advances in big data processing and AI/ML that are being provided by Syntasa and Google Cloudservices.Collaboration with Google Cloud andSyntasa9FIGURE 3

62、.1:Syntasa SolutionBy C Malambo/Project SmartFi:Exploring AI/ML for FinTech News10Syntasas capabilities make it a powerful tool for rapid prototyping,enabling users to quickly iterate and refine prototypes based on real-time data and insights.Benefits include:Rapid prototyping from low-code drag-and

63、-drop interface and full-code interface Native support for GCP infrastructure Apache Spark and Kubernetes runtime support Templatized integrations,processes,and apps to enable consistency and code reuse Scalable data and AI app framework for development and production Integrated production data+feat

64、ure+activation pipelines Collaboration,version control,and an automated documentation framework Advanced job definition,scheduling,and management capabilities with job failure alerts Data quality monitoring with visibility into data provenance and lineage Business alerting and model performance moni

65、toringAbout the Google Cloud PlatformGCP is a suite of cloud computing services offered by Google.It runs on the same infrastructure that Google uses internally for its end-user products,such as Google Search,Gmail,Google Drive,and YouTube.GCP offers a scalable range of computing services such as co

66、mputing services,networking,storage services,big data,security and identity management,management tools,cloud AI,IoT(Internet of Things)and more.Some examples of GCP services are:Compute Engine,App Engine,Kubernetes Engine,Cloud Functions,Cloud Run,Cloud Storage,Cloud SQL,BigQuery,Cloud Pub/Sub&Tens

67、orFlow services.11Collaboration with Google Cloud andSyntasaGlobal NetworkGoogle Cloud has a worldwide presence.Googles global network,connected via high speed cables,makes data movement across the globe in a highly performant and secure manner.Google Cloud offers FedRamp moderate cloud services in

68、Google Cloud data centers around the world,which gives organizations the ability to move data securely and compliantly from one part of the world to another in order to meet key objectives such as data backuprequirements.BigQueryBigQuery is Google Clouds planet-scale,completely serverless,and cost-e

69、ffective enterprise data warehouse that works across clouds and scales with your data.With BigQuery,Google has separated compute storage,and connected via the Petabit network,allowing for the compute and storage functions to expand vertically and independently of each other.This allows users to leve

70、rage as many compute slots as necessary to answer a query;as a result,BigQuery offers measurable performance gains compared to other analytical systems.BigQuery Omni:Google gives organizations the ability to leverage BigQuery even if users are housing data with other cloud service providers,or on-pr

71、emise with BigQuery Omni.When users deploy BigQuery Omni,they are able to query data that is stored on-premisefor example in Microsoft Azure or AWS in a tabular formatas if the data were being stored in a Google Cloud BigQuery environment.This capability allows users to receive all the benefits of G

72、oogle BigQuery without requiring them to move the data across public clouds.Data Governance:BigQuery allows for row-level and column-level security as well as other IAM-based permissions at the table and dataset levels.Combined with a DLP solution,BigQuery is one of the most extensible and secure so

73、lutions in the cloud today,and these data governance capabilities can also be applied to other clouds via BigQuery Omni.TranslationGoogle Cloud offers out-of-the-box(OOTB)translation capabilities that allow translation in 100+languages.These translations do not require any pretraining,and are availa

74、ble as APIs to be consumed.These translations are Project SmartFi:Exploring AI/ML for FinTech News12some of the highest quality translations in the industry.Today Google offers both text and document translation capabilities.We believe that this will allow the World Bank to meet the needs of its glo

75、bal audience effectively.DocumentAIDocumentAI is another differentiator for Google Cloud.It allows for OCR and Key Value pairs from documents with the highest fidelity.and works particularly well with handwritten documents.is the suite includes Document Warehouse,which is a hosted repository of docu

76、ments.Document AI and Document Warehouse are going to be the earliest targets for introducing large language models(LLMs),which will allow a unified cloud search experience,along with natural language processing(NLP)-based offerings like summarization,andchatbots.LookerLooker is Googles cloud-based

77、data exploration,discovery,and data analytics platform.Key information is typically stored in a number of different data stores,each with their own schemas and access processes.Looker provides discovery and real-time analysis of data across multiple data stores,which is critical in understanding dis

78、parate information from a business and technicalperspective.Looker strikes a balance between governance and self service in the deployment of analytics.This scalable,real-time approach prevents data sprawl and duplication headaches,including the common issue of having multiple versions of the same b

79、usiness intelligence(BI)reports and dashboards.Looker is capable of presenting dashboards and reports within the application,embedded in portals,and via third-party BI tools such asTableau.Looker Blocks are free,reusable,and customizable OOTB templates that provide a head start in creating value fro

80、m data.With Blocks,nontechnical users can quickly turn data into dashboards that can either be used as-is or be easily customized and blended with other data to meet specific needs.Blocks have been prebuilt to model and visualize a wide range of common use cases such as multicloud cost analysis,data

81、 warehouse log analysis,and much more.More than 150 Blocks are available for downloading from the Marketplace:https:/.13Collaboration with Google Cloud andSyntasaSolution Overview and Key ResultsThe Syntasa Data and AI platform was utilized for this POC to demonstrate rapid prototyping of several se

82、ntiment analytics use cases in Google Cloud Platform(GCP).The platform simplifies the use of GCP cloud services for data scientists and analysts,allowing them to either code or visually build their apps.This helps users focus on constructing their data and AI pipelines using familiar user interfaces

83、 like Jupyter Notebook or Syntasas low/no code workflowprocesses.The POC involved the creation of six Syntasa apps and Looker dashboards.These apps and dashboards explored a wide range of data and AI capabilities,including data ingestion,topic modeling,sentiment analysis,language translation,trend a

84、nalysis,and AI explainability.The apps and dashboards covered the following use cases:Trusted Domains Uncertain Domains Chinese Language PDF Sourcing Trend Analysis Sentiment ExplanationFuntap/Adobe StockProject SmartFi:Exploring AI/ML for FinTech News14Key ResultsThe key results obtained and demons

85、trated through dashboards,analysis,and discussions are:World Bank Group(WBG)domain experts can gain deeper and quicker insights into their subject areas of interest by leveraging automated AI/MLtechnologies.WBG domain experts can focus their efforts by using customized narrative and topic modeling a

86、pps,and dashboards tailored to their needs by defining themes,keywords,data sources,languages,categories,and geographies of their choice.Sentiment analytics solutions that leverage large language models(LLMs)can classify positive and negative sentiment with greater than 85 percent accuracy when comp

87、ared to manually classified relevant text.Google Translate APIs outperform open-source models by a wide margin,with 96 percent of translations done by Google Translate being deemedacceptable.Automated data and AI pipelines can extract full PDF reports from trusted sites and apply AI-based summarizat

88、ion and topic modeling to help WBG experts track the latest developments in their topics of interest.Trend analysis can be fine-tuned to the needs of the WBG team to detect and alert users to rising and falling topics,and to highlight emerging high-visibility events such as the FTX collapse and the

89、Silicon Valley Bank failure.For more details on app configuration and dashboard usage please refer to Technical Approach(Syntasa).15Collaboration with Google Cloud andSyntasaSmartFi-Trusted DomainsThe SmartFi-Trusted Domains app was developed to analyze and provide insights from the“SmartFi”content

90、that is available on trusted domains.The app connects to the Brandwatch API;extracts relevant text;loads data into the GCP;filters and transforms the text for topic modeling and sentiment analysis;adds WBG-defined themes;and prepares an analysis-ready dataset for the trusted domain narrative dashboa

91、rd.More than five years of historical data has been processed,and the production pipeline is updated daily.The SmartFi-Trusted Domains dashboard,built using Googles Looker,provides comprehensive visibility into FinTech-related articles about and conversations on trusted domains.Users can analyze key

92、 performance indicators(KPIs)and time series charts,and can drill down to original news or social media mentions.The dashboard includes filters for geographic regions,trusted domain categories,domain URLs,and sentiment,allowing for granular analysis of specific regions or categories.The screenshot s

93、hown in Figure 3.2 shows a comparison of activity,sentiment,and trends,based on the categories defined by the project teams.FIGURE 3.2:Modeled MentionsFor more technical details on data sources,topic modeling see Technical Approach(Syntasa)and for more details on the narrative dashboard see Appendix

94、 A.Project SmartFi:Exploring AI/ML for FinTech News16SmartFi-Uncertain DomainsThe SmartFi-Uncertain Domains app focuses on all domains that are not included in the trusted domain list.The workflow consists of multiple steps similar to the ones in the Trusted Domains app,with the addition of a proces

95、s that uses the Twitter API and Twitter IDs to extract Tweet texts for topic modeling.The data extraction is sampled at 2 percent,and over 3 months of historical data has been processed.The production pipeline is updated daily.The SmartFi-Uncertain Domains dashboard offers comprehensive visibility i

96、nto FinTech-related articles on and conversations about websites and social media platforms beyond the trusted domains.Users can analyze the impact of major events,such as the FTX and Silicon Valley Bank collapses,and can explore discussions with and without hashtags.The right panel of Figure 3.3 sh

97、ows the phrases that were present when authors mentioned cryptocurrencyexchange.FIGURE 3.3:Word CloudFor more technical detail on data sources and topic modeling,see Technical Approach(Syntasa);and for more detail on the narrative dashboard see Appendix A.17Collaboration with Google Cloud andSyntasa

98、SmartFi-Chinese LanguageThe SmartFi-Chinese Language app was created to demonstrate the translation capabilities of Google Cloud,and compare them with open-source translation routines.The workflow consists of multiple steps,similar to those in the Trusted Domains app,with the addition of multiple tr

99、anslation processes.Only one day of Chinese language mentions(a little less than 1M mentions)were processed,and the production pipeline was not activated.The SmartFi-Chinese Translation dashboard offers the same analytics abilities as the Uncertain Domains dashboard,but with a focus on Chinese langu

100、age content.Users can explore and compare narratives expressed by authors in Chinese,with both the original and translated text displayed side by side.Figure3.4 shows the domains,authors,and sample translated and original text.FIGURE 3.4:Domain SourceFor more detail on comparison of translation algo

101、rithms,see Chinese Translation.Project SmartFi:Exploring AI/ML for FinTech News18SmartFi-PDF SourcingThe SmartFi-PDF Sourcing app was created to demonstrate rapid prototyping capability,exploring both website crawling and search API approaches for automating the extraction of PDF reports from truste

102、d sites.The search API approach was found to be more targeted and efficient.The SmartFi-PDF dashboard provides a faster way to acquire information from trusted data sources,displaying links to PDFs,AI-generated summaries,and topic modeling analysis of the Figure 3.5 below shows that over 5,000 PDF d

103、ocuments were automatically downloaded and analyzed from the European Central Bank site.FIGURE 3.5:Domain and PDF SourcingFor more detail on PDF sourcing implementation,see PDF Sourcing.19Collaboration with Google Cloud andSyntasaSmartFi-Trend AnalysisThe SmartFi-Trend Analysis app analyzes the outp

104、ut of the SmartFi Trusted Domain app to identify rising and falling topics and phrases.Users can customize trend analysis,for example by using a seven-day rolling average to smooth out daily fluctuations.The app has analyzed the trusted domain app output from October 2022 and is updated daily.The Sm

105、artFi-Trending Dashboard displays the results of the trend analysis,allowing users to detect and alert rising and falling topics,and to highlight emerging high-visibility events,such as the FTX collapse and the Silicon Valley Bank failure.The left panel in Figure 3.6 shows the top five topics/phrase

106、s by volume,and the right panel shows the top five rising topics/phrases on Nov 10 2022.As can be seen,a day before the FTX collapse on Nov 11 2022,Alameda Research was the top rising phrase in the trusted sources data.FIGURE 3.6:Trending TopicsSmartFi-Sentiment Models and ExplainabiltyThe SmartFi-S

107、entiment Explanation app was created to address two research questions:1)comparison of different sentiment analysis models;and 2)analysis of gender and race bias.The FinBERT model was found to be over 85 percent accurate for positive and negative sentiment classification when compared to manually cl

108、assified relevant text.The SmartFi-Sentiment validation dashboard provides an in-depth view of the sentiment analysis,allowing users to explore the performance of different sentiment models,such as the FinBERT model,and Googles AutoML.The middle panel in the Figure 3.7 below shows that the FinBERT m

109、odel was over 85percent accurate for financial text.Project SmartFi:Exploring AI/ML for FinTech News20FIGURE 3.7:Sentiment ValidationFor more detail on PDF sourcing implementation,see Sentiment Analysis.The dashboard also enables users to analyze potential gender and race biases in sentiment classif

110、ication,providing insights into ensuring unbiased analysis of financial narratives.FIGURE 3.8:Sentiment Model ExplainabilityFor more detail on the model explainability,see Trustworthy and Explainable AI.21Collaboration with Google Cloud andSyntasaTechnical Approach(Syntasa)Data Sources and Preparati

111、onReference DataWorking in collaboration with Syntasa,the World Bank Group(SBG)provided a number of parameters to help scope this project,facilitate data collection,and ensure alignment with WBG business objectives.These data were defined by the WBG in a spreadsheet that included themes and keywords

112、 related to financial technology in both English and Chinese;a prioritized list of online news and media websites referred to as Trusted Domains;and geographic regions of interest.(For more detail see Appendix B:Reference Data.)SmartFi-Trusted DomainsThe goal of the SmartFi-Trusted Domains solution

113、is to extract meaningful insights from the WBG Trusted Domains.The SmartFi-Trusted Domains app contains the pipeline that was created in Syntasa to ingest and process the underlying data needed to accomplish this.The app includes a combination of ready-made and custom processes to ingest the Brandwa

114、tch Trusted Domains dataset,and the WBG themes,categories,and regions into BigQuery;then process each mention to mitigate noise;apply the predefined WBG themes,categories,and regions;and extract topics,phrases,and companion phrases.Finally,the data is combined into a single curated dataset used for

115、analysis in Looker.Figure 3.9 shows the data and AI pipeline for the SmartFi Trusted Domain app configured in the Syntasa Platform.(For more details see Appendix D:SmartFi Trusted Domains Technical Details.)Project SmartFi:Exploring AI/ML for FinTech News22FIGURE 3.9:Solution ArchitectureSmartFi-Unc

116、ertain DomainsThe SmartFi-Uncertain Domains app contains the pipeline that was created in Syntasa to ingest and process the underlying data needed to extract meaningful insights from sources,explicitly excluding the WBG Trusted Domains.A lighter version of the SmartFi-Trusted Domains app,this app in

117、cludes a combination of ready-made and custom processes to ingest the Brandwatch Uncertain Domains dataset and the WBG themes into BigQuery;apply predefined WBG themes;and then extract topics,phrases,and companion phrases.Licensing restrictions prevent Brandwatch from providing any Twitter tweet tex

118、t via the Brandwatch API,so the app also retrieves the full tweet text directly from the Twitter API.Finally,the data is combined into a single curated dataset used for analysis in Looker.Figure 3.10 shows the data and AI pipeline for the SmartFi Uncertain Domain app configured in the Syntasa Platfo

119、rm.(For more detail see Appendix E:SmartFi Uncertain Domains Technical Details.)23Collaboration with Google Cloud andSyntasaFIGURE 3.10:Data and AI pipelineSmartFi-Chinese LanguageThe SmartFi-Chinese Language app contains the pipeline that was created in Syntasa to ingest and process the underlying

120、data needed to extract meaningful insights from the Chinese mentions.The app functions identically to the SmartFi-Uncertain Domains app,with the addition of a ready-made translation process to translate Chinese snippet text into English.Figure 3.11 shows the data and the AI pipeline for the SmartFi

121、Chinese Language app configured in the Syntasa platform.(For more detail see Appendix F:SmartFi Chinese Language Technical Details.)Project SmartFi:Exploring AI/ML for FinTech News24FIGURE 3.11:Chinese Language App ConfigurationTopic ModelingSyntasa has conducted topic modeling on social media and n

122、ews texts to bring to light the most dominant and frequent conversations contained within them.The strategy is to start with a general subject areafor example,text that contains keywords related to financeand further breaks it down into expert-defined themes(top-down)and AI-identified topics(bottom-

123、up)for quick discovery of the narratives that are being conversed.Syntasas focus is on automating the clustering workflow so as to lower manual oversight,work dynamically on either small or big data,automatically discard irrelevant text,and preserve the most dominant clusters,which will also be self

124、-named.An unsupervised clustering approach is most useful because then the topics(or classes)are not known beforehand.Likewise,developing a classifier through clustering would not be a suitable solution because it would not be able to discover new conversations as they appear in real time.Some of th

125、e popular approaches to clustering involve algorithms such as KMeans or LDA,which can be used to group similar sentences/text together,but that have some downsides,especially with very diverse text.Algorithms 25Collaboration with Google Cloud andSyntasarequire knowing beforehand how many clusters it

126、 is optimal to create;otherwise the clusters start blending words that have no similarity to each other.Determining the optimal number of clusters(K)requires sampling many different Ks,and having the manual oversight needed to search for that number.There is also no guarantee that the sampling will

127、include the optimal K of the text;rather,the analysis would select only the best K of the sampling.Therefore,searching for optimal K with manual oversight increases computational and labor costs.In the case of social media and news text,conversations can be diverse to the point where it becomes impr

128、actical to find the optimal K needed in order to try to force all of the text into respective clusters.Examining the contents of these clusters is usually done by pulling n-grams,bigrams,or trigrams,and an analyst manually determining the“topic”that is being discussed.Because sentences have long str

129、uctures compared to n-grams,there will be a mixture of unrelated n-grams in a group that is supposed to summarize the cluster content.To overcome the problem of manually naming clusters based on n-grams that likely do not have similarity to each other,a novel approach had to be developed.Rather than

130、 focusing clustering at the sentence level,then examining n-grams,Syntasa developed a strategy that begins with a focus on the n-grams themselves.First,the highest-occurring n-grams are used as“topics,”and are used as the cluster centers are checked against other nontopic n-grams for similarity.This

131、 allows the topic to be self-named,and only similar phrases to be grouped together;this provides an obvious answer to the content of the cluster.Similarity checks between the n-grams are made by using a BERT embedder with cosine similarity.All of the text that does not get linked to a topic then get

132、s discarded.The discarded text includes very short text that cannot form a valid n-gram;that contains a valid n-gram that does not occur often enough;or that contains a valid n-gram that is unrelated to the top topics.Discarding text is a desired side effect,because it does not contaminate the other

133、 text.Project SmartFi:Exploring AI/ML for FinTech News26FIGURE 3.12:Topic Modeling ParametersWhile Syntasas topic modeling results in rapid self-naming topics,the term“topic”is itself a subjective term.To some analysts,a topic could be as high-level as“crypto”or“banking,”but to others it might be mo

134、re granular;for example,“cryptocurrency exchange,”“blockchain technology,”or“smart contracts.”The more granular the topic the more topics there will be;this can overwhelm the analysis,but it can also be more informative about the narrative.To preserve both the high-level and granular topics,Syntasa

135、prepares the data carefully for dashboards in order to give the user full filtering capabilities with which he can narrow the conversation further.A user can start with the larger topics that are generated and then delve into the various narratives surroundingit.27Collaboration with Google Cloud and

136、SyntasaFIGURE 3.13:Dashboard Trending PhrasesSentiment AnalysisSyntasa conducted an experiment that involved exploring the use of additional NLP models for producing sentiment classifications and determining the agreement levels between the models and members of the World Bank Group.This was an effo

137、rt to better the built-in sentiment model coming from Brandwatch,which proved to be a black box that made unreliable classifications.Two models from Hugging Faces model repository were selected.First,there was a model trained on 124 million tweets that learned colloquial conversation;next,a model na

138、med FinBERT was trained to understand financial terminology.Both models proved to be good in their respective fields.The Twitter model could accurately identify positive or negative text,for example,in the context of reviewing products(in this case,crypto exchanges),whereas the FinBERT model did a b

139、etter job of accurately classifying financial terms(for example“surged 27 percent,”or“$40bn implosion”).If a mix of colloquial talk and formal financial talk is to be collected in the future,an ensemble or combination of these models could be used to capture more of the text accurately.Project Smart

140、Fi:Exploring AI/ML for FinTech News28FIGURE 3.14:Sentiment ExplainabilityIn order to evaluate the accuracy of each model(Twitter,FinBERT,and Brandwatch)we asked a WBG domain expert to manually score over 200 texts into negative,neutral,and positive sentiment classes.This enabled us to determine the

141、accuracy of each model using the WBG scores as ground truth.Our model evaluation showed that:The Brandwatch sentiment model,at only 24 percent accuracy,was unacceptable,as we had seen in working with other clients.The Twitter roBERTa sentiment model was also unacceptable.It was only 47 percent accur

142、ate;after some tuning(setting the confidence score to 70percent and above)we were able to increase the accuracy,but only to 51percent.The FinBERT model,on the other hand,started with 64 percent accuracy and after adjusting the confidence score to 70 percent,the accuracy was increased to 75 percent.2

143、9Collaboration with Google Cloud andSyntasaThe FinBERT model was the only acceptable model we found.We also experimented with ensemble models,in which the answers of the Twitter or FinBERT model could supersede the other model based on higher confidence scores.This did not increase accuracy by a lar

144、ge factor,but it did slightly increase the number of records qualifying above the 70 percent confidence score.The Brandwatch model proved to be the least accurate,and it also did not have the ability to conduct bias tests or see confidence scores.The WBG expertalso classified the text in the experim

145、ent to indicate whether it was relevant or not.Greater than 75 percent of the text was deemed relevant.When focusing on WBGs relevant text,only positive and negative sentiments,and a confidence score of greater than 60 percent,FinBERT produces an accuracy of 86 percent.While a“relevant”classificatio

146、n will not be available on new live data,a text classifier model could be developed to further narrow down the text.The screenshot of the dashboard in Figure 15 shows that 86 percent agreement was captured.FIGURE 3.15:Sentiment ValidationProject SmartFi:Exploring AI/ML for FinTech News30This dashboa

147、rd was created in order to observe the results of the sentiment model validation test.The three pie charts show,in order:the distribution of the sentiments that WBG supplied;the sentiment distribution of the model selected;and the percentage of agreement between the WBG expert and themodel.Filters a

148、llow for the selection of:The 3 models+the 2 ensemble methods The sentiment outputs of positive,negative,neutral Confidence scores Agreement between the World Bank Group and the model The World Bank Groups relevance indicatorThese filters are useful for narrowing down the acceptability of the models

149、,such as on specific classifications and/or at specific confidence levels.In conclusion,we found that FinBERT,an open-source sentiment model,can be an effective way of producing accurate sentiment classifications that are closely in line with WBGs expert opinions.We also demonstrated that accuracy c

150、an be boosted by adjusting the confidence thresholds,and by limiting the scope to just positive and negative sentiment classes.Chinese TranslationThe Helsinki-NLP open-source model https:/huggingface.co/Helsinki-NLP/opus-mt-zh-en was used for translation into Chinese,with similar models made availab

151、le for all of the most commonlanguages.Syntasa conducted a comparison of two translation options,Hugging Face/Generic Models and Googles Cloud Translation service.These two options differ in several key aspects,including cost,speed,customization,and language support.Hugging Face models are a lower c

152、ost option for translation needs.They are easy to customize by simply swapping the model type and offering a relatively low cost run.However,they are slower than Cloud Translation,and they require language detection capabilities in order to handle multiple languages.Hugging Face models are also limi

153、ted in their language support,as they are designed for specific languages and require additional work to add language detectioncapabilities.31Collaboration with Google Cloud andSyntasaGoogles Cloud Translation service is a fast and flexible option.It can easily be configured to stream directly,and i

154、t is capable of translating any language it can detect.This language-agnostic nature makes it a more flexible option for multilingual projects.However,Cloud Translation is also more expensive than the Hugging Face models($20 per 1 million characters).This can make it a less suitable option for proje

155、cts that require higher-volume translations and are operating within tight budgets.As part of the comparison,the WBG team manually evaluated the translations produced by each option.The Google translations were found to be superior in quality,mostly because the Hugging Face model didnt completely tr

156、anslate the entire text.This analysis indicates that both options can be used,but the Google translation is preferable when a higher-quality translation is required.Figure 3.16 shows the comparison of the cloud vs the model translations,where it can clearly be seen that the cloud translation results

157、 are superior.FIGURE 3.16:Language Translation PerformanceTranslation ComparisonEven25%Cloud Win71%Huggingface Win4%Ultimately,the choice between Hugging Face models and Googles Cloud Translation service will depend on the specific needs of the project.If cost is a primary concern,Hugging Face model

158、s offer a cost-effective solution.If flexibility and speed are important factors,Cloud Translation is the better option,despite its higher cost.The language requirements of the project should also be considered,since Hugging Face models may require additional work to support multiple languages.Proje

159、ct SmartFi:Exploring AI/ML for FinTech News32PDF SourcingTo gather PDFs related to specific topics,Syntasa explored two primary methods:searching APIs and website crawling.Search API,the most effective solution,was previously offered by Google but is no longer available.As an alternative,Syntasa use

160、d Bing Web Search to gather PDFs related to specific topics.This allowed for a more streamlined and efficient process for sourcing relevant PDFs.While website crawling is an ideal method for gathering every PDF,it proved to be slow,cumbersome,and expensive.Therefore,it is not recommended as a primar

161、y method for PDF sourcing.(See Figure 3.17.)FIGURE 3.17:PDF SourcingSyntasa used the Bart-Large-CNN summarization algorithm to effectively summarize the content of the sourced PDFs.This app can be easily modified to incorporate any other summarization algorithms used by the WBG,or any publicly avail

162、able summarization models.Other models were tested,including Googles Pegasus model,but Syntasa did not perform in-depth evaluation and comparison of the two models.The Bart-Large-CNN algorithm performed sufficiently well for this use case since the focus was on PDF extraction.33Collaboration with Go

163、ogle Cloud andSyntasaIn conclusion,Syntasa used Bing Web Search as an alternative to the previously offered Google Search API to gather PDFs related to specific topics.The Bart-Large-CNN summarization algorithm was used to effectively summarize the content of the PDFs,proving that its possible to ex

164、tract PDF documents from specified domains and summarize them in order to increase the efficiency of the current manual process.For future evaluation and exploration,we recommend a more systematic review of various summarization algorithms.Trustworthy and Explainable AIFor the two sentiment models p

165、ut in place,bias tests were conducted by Syntasa using a Python library called Transformers-Interpret.This library can explain a PyTorch model derived from Hugging Face to display the weights of the features.It uses Facebooks Captum to apply integrated gradients on the features in order to obtain th

166、e weights of each word in the text.By using this library,Syntasa was able to determine whether the way the models reacted to gender or racial terminology was significantly different.Using a word replacement strategy,the same sentence was used while switching out gender or race-related words(for exam

167、ple,“he”or“she”).The sentiment and confidence scores were then examined to determine whether these kinds of words could sway the model.In all of their experiments,the sentiment never changed,regardless of race or gender terms,and the confidence scores had some slight variability.FIGURE 3.18:Sentimen

168、t ExplainabilityProject SmartFi:Exploring AI/ML for FinTech News34The dashboard screenshot in Figure 3.18 shows the outputs of the model trained on Twitter data and how it reacted to identical sentences where gender-based terminology was swapped out.In all three examples,we can see that the sentimen

169、ts were the same whether a male or a female term was used;there were also similar confidence levels.In conducting this experiment,Syntasa now has the structure built for future bias tests that will be able to accommodate new types of testing,as desired by the WBG.Solution IntegrationThere are many o

170、ptions available for integrating the Syntasa Data and AI platform,and the sentiment analytics solution with the WBG IT environment.Figure 3.19 shows the technical deployment architecture for the Syntasa Platform in GCP,including the GCP services and network configuration.FIGURE 3.19:Solution Archite

171、cture35Collaboration with Google Cloud andSyntasaThe current POC was conducted as a private cloud SaaS where a single-tenant solution was hosted in a dedicated GCP project within a fully controlled Virtual Private Cloud(VPC).A similar solution architecture has been deployed for clients with highly s

172、ensitive data,and this architecture,when deployed in the WBGs GCP organization,can achieve the highest level of compliance,including FedRAMP High.The solution can support Single Sign On to simplify the connectivity from the WBG network,using the existing corporate authentication services.For the POC

173、,since only publicly available data was used,it was determined that the simplest compliant option was to host the solution in a Syntasa-controlled GCP project,and use the WBG GCP billing account.Given the initial success of the POC in demonstrating the potential of using large language models for au

174、tomating text,and sentiment analysis for several use cases,with the additional exploration,development,and testing required to reach a production-ready state,we can envision proceeding with either a similar arrangement of a Syntasa-controlled GCP project,or a WBG-controlled GCP organization,folder,a

175、nd project.Project SmartFi:Exploring AI/ML for FinTech News36SECTION 4LEARNING OUTCOMES AND FUTURE CONSIDERATIONSTechnical Learnings for World BankTopic ModelingTopic modeling is a statistical and computational technique used to identify underlying topics or themes within a collection of texts or do

176、cuments.It is a process of extracting meaningful patterns or themes from large volumes of text data.The goal of topic modeling is to identify the most significant topics present in the documents without prior knowledge of the topics.The most commonly used types of topic modeling are Latent Dirichlet

177、 Allocation(LDA)and Non-Negative Matrix Factorization(NMF).LDA is a probabilistic model that assumes that each document contains a mixture of topics,and each topic is a probability distribution over words.The model infers the topics based on the distribution of words in the documents.The output of L

178、DA is a set of topics,along with the distribution of each topic across the documents,and the distribution of each word across the topics.NMF is a matrix factorization technique that decomposes the document-term matrix into two matrices,one representing the topics,and the other representing the words

179、 in the topics.The output of NMF is a set of topics,along with the weight of each word in the topics.Previously,TI Lab worked on several projects that required topic modeling.LDA was used primarily to tackle the grouping of documents into clusters.During our collaboration with Syntasa,they introduce

180、d us to their custom algorithm,which has proven to be more accurate and robust than LDA.Learning Outcomes and Future Considerations37Syntasas team introduced us to a mix of three different algorithms used to achieve the projects objective.This objective is based on the need for the text snippets to

181、be assigned to multiple topics and for the topics to be named automatically.Since FinTech-related social media posts are extremely diverse,using the LDA algorithm is no longer a viable option.FIGURE 4.1:Topic ModelingSyntasa clusteringFast clustering(similarity checks usingcosine similarity withvolu

182、me criteria)Graph Networks(to link snippets tomultiple topics)K-means clustering(highest occurringphrases as clustercenters)Syntasas team introduced us to mix of three different algorithms to achieve the projects objective.This objective is based on the need for the text snippets to be assigned to m

183、ultiple topics and for the topics to be named automatically.Since fintech-related social media posts are extremely diverse,using LDA algorithm is no longer a viable option.The K-means topic modeling technique is a clustering method that groups the documents into a fixed number of topics based on the

184、 similarity of their word frequencies.The key disadvantage of this modeling technique is that it requires manual naming of topics,and it assigns only one topic per snippet of text.It is also sensitive to the initial conditions,and the results may vary depending on the random initialization of the al

185、gorithm.However,if used in combination with other algorithms,it can provide valuable insights.Fast Clustering is another type of topic modeling that works somewhat like hierarchical clustering,but is tuned for speed.It is useful when the number of clusters is unknown and the dataset is quite large.W

186、ith fast clustering,the developer can freely configure the threshold of what is considered to be similar.A high threshold will only find extremely similar sentences;a lower threshold will find more sentences that are less similar to each other.11 https:/ SmartFi:Exploring AI/ML for FinTech News38Gra

187、ph Networks can also be used to link multiple topics to a text.Graph Networks represent the documents and topics as nodes in a graph,and the relationships between them as edges.By analyzing the graph,it is possible to identify the most significant topics and their relationships to the documents.Grap

188、h Networks can also be used to visualize the topics and their relationships,making it easier to interpret the results.The solution is fully scalable,using Apache Spark on large data to take advantage of the Could infrastructure.The outcome for a real-life example is described using the image shown i

189、n Figure 4.2.The phrase that was analyzed is“The blockchain network allows users to avoid Central Banks.”This sentence clearly has more than one topic,and the figure shows how it can be connected to three different topics:for example,Allows Users;Blockchain Technology;and Central Banks.FIGURE 4.2:To

190、pic Modeling Explainer39Learning Outcomes and Future ConsiderationsSentiment AnalysisSentiment analysis is the use of natural language processing,text analysis,computational linguistics,and biometrics to systematically identify,extract,quantify,and study affective states and subjective information.O

191、pen source software tools,as well as a range of free and paid sentiment analysis tools such as RoBERTa,Google Cloud translation,and BERT automate sentiment analysis on large collections of texts,including web pages,online news,and blogs.Sentiment analysis is well-used at ITSTI to analyze internal do

192、cuments,risk management,feedback review,online and social media data,and so on.Pretrained models with different datasets have different capabilities and strengths.Sentiment models should be selected based on the specific business demands and the available data.After exploring three Sentiment Analysi

193、s models for financial data in this prototype,we determined that the FinBERT model focuses on financial data and produces better results.TABLE 4.1:Sentiment Analysis ModelsBrandwatch FinBERT TwitterroBERTa Multilingual Sentiment Model FinBERTis a pre-trained NLPmodel to analyze sentiment offinancial

194、 text roBERTa-base model trainedon 58M tweets and finetunedfor sentiment analysis with theTweetEvalbenchmark Hybrid approach to SentimentAnalysis:Knowledge-Based-ML-Custom Rules Narrow focus on financial data Effective at picking upcolloquial talk Chinese TranslationAI translation is a machine trans

195、lation process based on complex,deep learning algorithms.Using intelligent behavior,it can understand a source text and generate another text in a different language.The translation is required in order to build a more robust tool covering other languages.Since Chinese is much used in FinTech-relate

196、d data in Asia,during the Syntasa engagement,we applied both Simplified Chinese and Traditional Chinese themes and keywords to collect media data in Chinese.Then we tested different translation services on snippets,and compared the quality of Google Translation with Hugging Face.The results show tha

197、t Google Translation Project SmartFi:Exploring AI/ML for FinTech News40performs better than Hugging Face in terms of the completeness and accuracy of the content;that is,Google is more comprehensive than Hugging Face and it also works for long and complex texts.It also has more accuracy in some key

198、verb translations.Hugging Face also usually misses some content,especially in the context of a long sentence,and it cant recognize many professional terms and proper nouns,such as brand names(for example,Moutai).But when the sentence is short,Hugging Face is concise and accurate;it is not worse,and

199、sometimes it is even better than Google.Business Intelligence Tool:LookerLooker by Google is a business intelligence(BI)and data analytics platform,aligned with Microsoft Power BI.This web-based tool offers plenty of analytics capabilities that businesses can use to explore,discover,visualize,and sh

200、are analysis and insights.Looker earns good marks for reporting granularity and scheduling,drag and drop interface,and prebuilt templates and data models.Looker has more colorful UI graph options and a customizable layout size.It is easy to apply Looker to visualizing results and building enterprise

201、-level products such as dashboards and websites.However,since Looker was integrated into Googles system just a few years ago,it has limited AI and statistical functions.The price is also higher than Power BI.41Learning Outcomes and Future ConsiderationsBusiness Learnings andOutcomeThis section descr

202、ibes how the dashboard can be useful for finance and technology users.Key Learnings(Technology)1 Significance of input data:Input data is the foundation of any solution that aspires to use emerging technologies like artificial intelligence.Therefore,it is important to ensure that the data used to tr

203、ain AI models is accurate,representative,and sufficient in quantity.2 Explainability and transparency:As AI models become more complex,it is important to ensure that they are explainable and transparent.The decision-making process of the model should be easily understood and verified by humans conce

204、rning which data is relevant;what data can be categorized into which theme/keyword;what data to exclude,and so on.Explainability and transparency can also help to build trust in the solution.3 Continuous technology learning and improvement:One of the significant advantages of AI is that it can impro

205、ve over time;but this requires continuous feedback and training.It is important to continuously monitor and evaluate the performance of AI models,and update them to ensure the relevance of results over time.Key Learnings(Project)1 Clear base requirements:It is important to have clear and well-define

206、d requirements for such a PoV.This will help to ensure that everyone involved in the project is on the same page and has a common understanding of what needs to be achieved.Technical scoping sessions are relevant steps in the process of streamlining project requirements,and ensuring their alignment

207、with the relevant business needs.ITSTI,along with TREFT and Syntasa,will set up dedicated scoping sessions at project initiation to clarify the basic project requirements.Project SmartFi:Exploring AI/ML for FinTech News422 Stakeholder engagement through collaboration and expertise:It is important to

208、 involve relevant stakeholders throughout project engagement,from the ideation phase through to scoping and development.TREFT has performed the role of the business user collaborating with ITSTI to finalize the business and technical requirements,and has collaborated with Syntasa as the developer of

209、 the solution.3 Agile approach:An agile approach toward this project enabled the solution to be developed as close to the relevant business needs as possible.Given the possibility of showcasing a key functionality during the engagement and its alignment with the business needs,the project teams test

210、ed the PDF Summarizer function in lieu of API integration.4 Testing and quality assurance:Continuous manual testing and analysis of parts of output data at different stages of the engagement has helped to maintain business relevance and ensure quality assurance.During this engagement,manual testing

211、was especially important in areas related to topical relevance,quality of translation,and user interface.This helped to prevent issues and ensure that the solution is reliable and effective.Key Business Outcomes:1 Efficiency.The ability to intelligently source relevant FinTech news by mimicking huma

212、n logic,and to present it on a dynamic dashboard powered by Googles Looker platform contributes to streamlining the tedious news-sourcing process,and reveals detailed insights on digital trends,and sentiment on the topics.Such a solution could help to save time and resources that would otherwise be

213、spent sourcing important news manually.It could also reduce human error in identifying news sources that are potentially biased or irrelevant,as well as gather relevant news sources that a human might miss due to the massive volume of news data on the internet.2 Scalability.The consolidation and rep

214、resentation of large volumes of data on a dynamic dashboard such as Looker allows the user to customize search criteria based on user needs,and categorize data by drawing its relation to topical areas of interest.The functionality of reviewing market sentiments across multiple topics presents intere

215、sting insights that can be used as inputs in creating briefing notes,resources,knowledge material,slide decks,and reports for senior management review and the wider TREaudience.43Learning Outcomes and Future Considerations3 Relevance.Ultimately,this solution can also allow treasury staff to stay ful

216、ly involved in and informed about the most relevant happenings within the topics of interest,enabling the organization to potentially capitalize on key opportunities for innovation within this space,and leverage these technologies to improve TRE operations.The applicability of the solution to other

217、use cases is another opportunity.Currently this solution captures news and material on a specific list of topics,and captures them from specific sources as defined by the project team.There is a possibility of changing the list of topics and sources,thus indicating the potential universality of the

218、base solution(with customized features)across various use cases.Considerations for Production Solution Chatbot integration/plug in(BARD AI or ChatGPT):A solution that could enable the user to source the relevant information by conversing with achatbot.Language translation:A solution that could captu

219、re resources and materials in a multilingual setting,thus increasing the geographical reach and revealing more significant results.PDF summarizer:A solution where large text files/PDFs are converted into an easily understandable and brief summary,with suggestions for how it could increase convenienc

220、e for users.Expand scope to test intelligence:A solution where the input data is more broadly categorized,and the output data is expected to be even more specific and filtered.Project SmartFi:Exploring AI/ML for FinTech News44AppendicesAPPENDIX ANarrative Dashboard FeaturesA templated narrative dash

221、board was deployed to hasten development time,limiting the scopeto first insights.Although each dashboard was then customized to best meet the requirements set forth in this POC,they share many of the same features:FiltersTo facilitate noise mitigation and focused exploration,two types of filters ar

222、e included on the dashboard:cross-chart filters,and top-level filters.Cross-chart filtering enables users to interact with most of the elements on the dashboard.For example,on the topics table,if a specific topic is selected,the dashboard will filter all of the charts that are based on the selected

223、topic.Top-level filters appear at the top of the dashboard,providing extensive filtering capabilities and allowing for the selection of a specific date range and time series chart granularity;inclusion and exclusion of any combination of themes,topics,phrases,companion phrases,types of mention(uniqu

224、e vs repeat),page type,domain,author,and/or language can be arranged.KPI ScorecardsKPI measures at the top of the dashboard include the number of sampled mentions,modeled mentions,percentage of mentions modeled,calculated net sentiment,and oldest and newest mention dates with respect to the applied

225、filters,providing a high-level overview.Project SmartFi:Exploring AI/ML for FinTech News46Volume and Sentiment Time SeriesThese two visualizations,found underneath the scorecards,show FinTech sampled and modeled mentions by volume,and net sentiment over time.In addition to showing how volume and sen

226、timent are changing over time,peaks and valleys are often indicative of significant events of interest that may warrant further investigation.CountriesThe dashboards include a table that shows mention volume,percentage,and sentiment by country,along with a heat map visualization.Through these featur

227、es,users are able to understand and compare the level of engagement and sentiment in various countries.ThemesTo facilitate top-down analysis,a series of tiles provide mention volume and sentiment by theme;mention volume by theme over time;and mention sentiment by theme over time.As detailed in Refer

228、ence Data,the themes were provided by WBG SMEs.They include Asset Tokenization,Digital Currency,and Web3,and are consistent across all three narrative dashboards.This is useful for understanding and comparing proportionality and sentiment across various known areas of interest.The time series charts

229、 visualize changes in the discussion to help users understand the ebb and flow of engagement and sentiment for these themes.TopicsComplementary to the top-down approach of themes,topics can be thought of as being constructed from the bottom up.Using AI and natural language processing(NLP),the mentio

230、ns are analyzed to identify recurring phrases and are dynamically grouped into topics.For example,the phrases“bitcoin,”“btc,”and“ethereum”might be categorized under the topic“cryptocurrency.”Topic Modeling provides more information on the topic of modeling implementation.As with themes,the same seri

231、es of tiles is provided for topics to show how prevalent the initial topics of interest are in digital narratives,as well as additional topics that are emerging from the conversation.Often many of the 47Learning Outcomes and Future Considerationscollected mentions do not fit inside one of the predef

232、ined themes.These tiles typically surface as previously unrecognized topics of discussion that are taking place outside of the predefined themes,and are likely of interest.Phrases&Companion PhrasesPhrases are identified by the algorithm using parts of speech to select the most relevant phrases and w

233、ords.The algorithm also identifies the companion phrases that are used most commonly with each phrase.These tables show the most common phrases and companion phrases in FinTech-related posts and articles by volume and sentiment.Accompanying word clouds allow for visualanalysis.Phrase volume and sent

234、iment can be compared in order to understand the multitude of narratives taking place.One particular phrase can also be selected for deep analysis.By reviewing the associated companion phrases,users are able to determine the specific subject matter being discussed in relation to the broader topic of

235、 conversation.For example,when selecting the phrase“bitcoin,”the top two companion phrases that appear might be“ethereum”and“cardano.”This suggests that mentions that include the phrase“bitcoin”are often discussing“Ethereum”and“cardano”in relation to bitcoin.Page TypePage Type refers to the category

236、 of website the mention was found on;that is,news,forums,or blogs,as well as large social media platforms like Twitter,Facebook,and YouTube.The dashboards include the same series of tiles for Page Type as with themes and topics,and provide insights into where the discussion is taking place,comparati

237、ve sentiments,and changes over time.Reach Estimate is an additional measure included here to explain which Page Type participants are most likely to engage with.(See more in Reference Data.)For example,a minority of the mentions may come from Twitter compared to mentions in news,suggesting that the

238、bulk of the discussion is happening in the news.However,Twitters significantly higher reach estimate indicates that despite fewer mentions on the platform,significantly more people are likely to be exposed to those mentions.Project SmartFi:Exploring AI/ML for FinTech News48DomainsA domain tile is in

239、cluded to analyze volume and sentiment.Domain is the domain name of the website from which the mention originated(for example,T).This table allows the user to understand and compare engagement and sentiment across domains,or filter mentions to focus analysis on one or more domains.AuthorsAuthor is t

240、he nickname,user name,or full name of the entity that posted a mention.The authors table displays the author of a given post or comment,the domain the content was posted to,the number and net sentiment of mentions authored,and the authors reach estimate.Users are able to identify key participants,th

241、eir sentiments,and their relative influence on the discussion.Mention DetailsThe original text of the mention is displayed in the Mention Details table.This table reveals the author of the comment,the text of the mention,the originating domain,and the date it was posted,thus providing users with an

242、expanded context.A URL link button to the original source of the mention is included to facilitate in vivo analysis.An impact score for each mention is also included to help users understand the relative impact a mention is likely to have had in the discussion,as discussed in Reference Data.49Learni

243、ng Outcomes and Future ConsiderationsAPPENDIX BReference DataThemes and KeywordsThe relevant smart finance keywords in the list were grouped and categorized by the World Bank Group,generating a total of three themes of interest,based on WBG business use cases:asset tokenization,digital currency,and

244、Web3.Keywords ranged in specificity from a particular cryptocurrency such as Bitcoin,to more generalized terms,such as digital wallet.Asset TokenizationAsset Tokenization theme contained approximately 21 keywords:Bitcoin Circle(USDC)Cold Wallet/Hot Wallet Cryptowinter Ethereum Fungibile tokens ICO(I

245、nitial Coin Offering)NFT(Non-Fungible Tokens)Programmable Money Programmable Payments Sats/Satoshis Tether(USDT)Stellar Development Foundation Security Tokens Offering(STO)Digital Assets Platform(DAP)Onyx Orion Digital Promissory Note Digital Financial Market Infrastructure(DFMI)carbon tokenization

246、carbon credits/certificatesProject SmartFi:Exploring AI/ML for FinTech News50Digital CurrencyDigital Currency theme contained approximately 24 keywords:Adoption Apple Pay CBDC(Central Bank Digital Currency)DCEP(Digital Currency Electronic Payment)/e-CNY/Digital Yuan Delivery versus Payment(DvP)Digit

247、al Assets Digital Dollar Digital Euro Digital Wallet Double Spending Fiat currency Financial inclusion FOMO(Fear of Missing Out)Google Pay Instant Payment MetaPay Public-Private Partnership(PPP)Retail CBDC Stablecoin Wholesale CBDC Ripple Retail Central Bank Digital Currency(or Retail CBDC or rCBDC)

248、Wholesale Central Bank Digital Currency(or Whole CBDC or wCBDC)Atomic settlementWeb3Web3 theme contained approximately 21 keywords:Blockchain Cryptocurrency dApps(Decentralized Apps)DLT(Distributed Ledger Technology)Ledger Metaverse MiCAMarkets in Crypto-Assets Law Regulation Smart contract Traditio

249、nal Finance/TradFi Decentralized Exchange(DEX)Oracle Hyperledger Decentralized Autonomous Organizations(DAOs)Liquidity Pool Market Capitalization(Market Cap)Total Value Locked(TVL)Loss/bankruptcy/fraud/hack Decentralized Finance(DeFi)Interoperability/Interoperable/Bridge Flash Loans51Learning Outcom

250、es and Future ConsiderationsChinese KeywordsThe English keywords were later translated to Simplified Chinese and Traditional Chinese to facilitate collection and analysis of FinTech-related data authored in Chinese and likely originating from individuals and media sources closer to the Chinese marke

251、ts(for example,Singapore).Initial translations were made by Syntasa using Google Translate service.These initial results were refined by WBG personnel who are fluent in written Chinese,and familiar with relevant cultural references related to smart finance.Asset Tokenization(Simplified Chinese)Asset

252、 Tokenization keywords in Simplified Chinese shown with multiple synonyms separated by“/”:资产代币化,比特币,世可/Circle/比特币Circle/比特币银行/比特币银行Circle,USDC,冷钱包/硬件钱包/离线钱包,热钱包/软件钱包/线上钱包,加密寒冬/加密货币寒冬,以太坊,同质化代币/可替代代币/同质化通证/可替代通证,ICO/首次代币发行/首次发行代币/数字货币首次公开募资/数字货币首次公开发行/首次币发行,NFT/非同质化代币/非可替代代币/非同质化通证/非可替代通证/不可替代代币,可编程货

253、币/程序化货币,可编程支付/程序化支付,Sats/Satoshis/中本聪,Tether/稳定币Tether,USDT/泰达币/稳定币USDT,恒星币/XLM(Stellar)/恒星网络/XLM,STO/证券型通证发行/证券化通证发行,数字资产平台,DAP/DAP币,Onyx/Onyx币,Orion/Orion币,数字本票/数字期票,DFMI/数字金融市场基础设施,碳币Digital Currency(Simplified Chinese)Digital Currency keywords in Simplified Chinese shown with multiple synonyms s

254、eparated by“/”:数字货币,采用,苹果支付,CBDC/中央银行数字货币/央行数字货币,DCEP/数字货币电子支付/数字货币和电子支付工具/DC/EP,数字人民币/e-CNY,货银对付/DVP/券款对付,数字资产,数字美元,数字欧元,电子数字钱包/数字钱包,双重支付/重复花费/双花,法定货币,普惠金融/金融包容性,错失恐惧症/FOMO/害怕错过/社交控,谷歌支付/Google Pay,即时付款,Meta pay/脸书支付,公私合作制/公共私营合作制/政府和社会资本合作模式/公私伙伴关系/PPP,零售央行数字货币/零售CBDC/零售中央银行数字货币/零售型央行数字货币/零售型CBDC/

255、rCBDC,稳定币,批发央行数字货币/批发CBDC/批发中央银行数字货币/批发央行数字货币/批发型CBDC/wCBDC,瑞波币,原子清算/原子结算Project SmartFi:Exploring AI/ML for FinTech News52Web3(Simplified Chinese)Web3 keywords in Simplified Chinese shown with multiple synonyms separated by“/”:web3,区块链,加密货币/密码货币/加密数字货币/虚拟货币,dApp/去中心化应用程序/分布式应用程序/去中心化应用/分布式应用,DLT/分布

256、式帐本技术/分布式记账技术/分布式记账方式,分布式帐本,分类帐/分类账簿,元宇宙,欧盟加密资产市场监管法案/加密货币监管协议/MiCA,监管,智能合约,传统金融/TradFi,去中心化交易所,价值中介,Hyperledger/超级账本,DAO/去中心化组织/去中心化自治组织,流动性池/流动资金池/流动性储备资金,市值,TVL/总锁定价值/锁定的总价值,损失,破产,欺诈,黑客,去中心化金融/分布式金融/DeFi,互操作性,可互操作,Bridge/区块链桥,Interoperab,闪电贷/Flash LoanGeographical LocationsWBG provided a list of

257、34 individual and collective countries of interest grouped into six geographic regions:North America(US,Canada,Mexico,Bahamas,and Caribbean)South America(Brazil,Ecuador,and Colombia)Europe(European Union,Euro Area,European Economic Area,Ukraine,and Russia)Middle East and North Africa(MENAUAE,Saudi A

258、rabia,Qatar,Israel,Turkey)Africa(Central African Republic,Democratic Republic of the Congo,Ghana,and South Africa)Asia(China,Hong Kong,India,Kazakhstan,Singapore,South Korea,Taiwan,Thailand,Japan,Australia,New Zealand,and Vietnam)Trusted DomainsThe WBG provided a list of 82 organizations of prioriti

259、zed interest relating to the predefined themes and keywords,accompanied by their website address(domain)and grouped into categories by organization type.53Learning Outcomes and Future ConsiderationsOrganization CategoriesCentral Bank,Consultancy,Digital Currency Institution,News Sources,Financial Se

260、rvices,International Development,Regulatory Body,Research Center,Technology Company,and Think Tank.These organizations represent a combination of authority figures,key players,and news sources participating in the many facets of finance.They are considered by WBG to be generally reliable,authentic,a

261、nd trustworthy sources of information that is highly relevant to WBG business interests.As such,the collection was labeled Trusted Domains,referring to their website domain for the duration of the project.Notably absent are social media platforms,including Twitter and Facebook.Project SmartFi:Explor

262、ing AI/ML for FinTech News54APPENDIX CBrandwatchBrandwatch Social Media Listening PlatformBrandwatch is a social media listening and analytics platform that provides access to a wide range of online data sources including websites,social media platforms,and news.Brandwatch automates the process of c

263、apturing data from various sources.The platform uses web crawlers to continuously gather data from millions of websites,including blogs,forums,and news sites.It gathers news articles from thousands of sources,including major news outlets,blogs,and online publications.Users also have access to data f

264、rom all of the major social media platforms(Facebook,Twitter,Instagram,LinkedIn,YouTube,and Reddit).Brandwatchs query feature is used to build complex queries to retrieve data that meets specific criteria,using key terms of interest in SQL-like queries to retrieve relevant data such as mentions of a

265、 brand or product,competitor activities,and industry trends.Some of the key capabilities of Brandwatchs query feature include:Advanced filteringA wide range of filtering options may be used in a query,allowing users to narrow down their search results to only the data that is relevant to their resea

266、rch.Filters can be applied based on a variety of criteria,including time period,language,country,author,source type,and more.These can also filter out irrelevant data,reducing the amount of noise in your dataset.55Learning Outcomes and Future ConsiderationsBoolean operatorsQueries also support boole

267、an operators,such as AND,OR,NOT,and NEAR.This enables users to create complex search queries that combine multiple search terms and filter criteria.Although Brandwatch also provides a range of analytics and visualization tools,these capabilities are limited in comparison to those that are easily ach

268、ievable using Syntasa and Google Cloud.Through the Brandwatch API,were able to take advantage of this automated data capture with comprehensive coverage provided in near real-time;this can save time and effort compared to manually scraping data from these sources.Brandwatchs mention metadata fields

269、provide a rich set of information that can be used to filter,analyze,and visualize social media and online content.Here are some of the metadata fields that are available in Brandwatch and commonly used in Syntasas news and social media narrative solutions.SnippetSnippet is a snippet of the mention

270、that best matches the query.Page TypeDescribes the kind of website the mention was found on in a more human-readable way.For example:“Blogs”“YouTube”“Dark Web”“QQ”“Facebook”“Tumblr”“Instagram”“Forums”“Twitter”“VK”“Review”“Sina Weibo”“Reddit”“4Chan”“LexisNexis Licensed News”“News”.ImpactImpact is a B

271、randwatch metric used to measure the potential impact of an author,site,or mention.It has a logarithmic scale between 0100,normalized for the users data to help them find what is most interesting for them.The impact score takes into account how much potential a mention has to be seen,as well as how

272、many times it has been viewed,shared,or retweeted.(A decimal from 0100.)Project SmartFi:Exploring AI/ML for FinTech News56Reach EstimateReach Estimate is a score created by Brandwatch to estimate how many individuals may have seen a piece of content.It is available for multiple data sources,and enab

273、les the user to compare the reach of content from different platforms and track development over time.(0,or a positive integer.)SentimentEach mention within a query has a sentiment associated with it.The sentiment of a mention can be positive,negative,or neutral.Sentiment is assigned automatically b

274、y the system,but can be selected manually if required.Brandwatchs sentiment analysis is based on cutting-edge AI research in the fields of Deep Learning and Natural Language Processing(NLP).Transformer Architecture Language Models are pretrained on billions of words to develop a deep general knowled

275、ge of over 100 languages before being applied to sentiment analysis.This offers a more sophisticated understanding of context,slang,and dialects.These models can detect sentiment indicated by:Words(including misspelled words),phrases,and sentence structure Emojis,emoticons,and multiword hashtags Neg

276、ation,punctuation,and much more.57Learning Outcomes and Future ConsiderationsAPPENDIX DSmartFi Trusted Domains Technical DetailsA Brandwatch query was constructed using the three themes and associated English keywords mentioned in Reference Data.The location filter was set to“worldwide”to enable lat

277、er geographic analysis in the dashboard.Since the keywords were in English,to alleviate the need for additional translation in the Syntasa app,the language filter was set to English to ensure that only English content is searched and returned.Pluralized and wild-card variants of the keywords were in

278、cluded in the query.The“NEAR”operator was used to reduce noise created by generic keywords by helping to ensure that their presence in the mention occurred alongside other themes and keywords of interest.(For more details on Brandwatch features and data sources see Appendix C.)As the title suggests,

279、the SmartFi-Trusted Domains exploration was primarily focused on the WBG list of 82 organizations of prioritized interest relating to the predefined themes and keywords.As such,advanced filtering was applied in the Brandwatch query to include only the results from those organizations website domains

280、(Trusted Domains).The final SmartFi-Trusted Domains dataset in Brandwatch consists of approximately 1M mentions in English from approximately 274k unique authors found worldwide across the 82 Trusted Domains,from January 1,2018 through February 28,2023.A“mention”refers to a specific instance of a ke

281、yword being Project SmartFi:Exploring AI/ML for FinTech News58mentioned on social media,news sites,blogs,forums,or any other online source that Brandwatch monitors.A mention can be in a tweet,a Facebook post,a blog post,a news article,a forum thread,or any other piece of content that contains the sp

282、ecified keyword.Syntasa SmartFi-Trusted Domains AppIngestBrandwatch datasetThe ready-made Brandwatch API process included with Syntasa was configured with the Brandwatch Trusted Domains query ID to ingest the Brandwatch Trusted Domains query dataset into BigQuery at a 100 percent sample rate via Bra

283、ndwatchs commercial API.Each mention in the Brandwatch Trusted Domains dataset contains up to 103 associated mention metadata fields depending on the source,type,and data availability of the mention.These fields include date,author,domain,page type,sentiment,impact,reach,snippet,and geographical inf

284、ormation(whenavailable).In addition to the Brandwatch data,the reference data in Appendix B were also ingested.The reference data were first manually copied into a single Google Sheet with three tabs:Themes and Keywords,Trusted Domains,and Regions and Countries.Three Spark processors,one for each ta

285、b,use Python code to access the relevant tab via Google Cloud Storage API and insert them into a BigQuerytable.ProcessNoise FilterVisual analysis in the SmartFi Trusted Domains-Narrative Looker dashboard of the most recent 30 days of mentions revealed a high number of irrelevant forum and review men

286、tions originating from the trusted domains that were categorized as Technology Company.These mentions included knowledge-base articles,technical support forums,and app store reviews from Amazon,Microsoft,Google,and Apple.59Learning Outcomes and Future ConsiderationsSyntasa provides a multitude of wa

287、ys to implement noise mitigation in the pipeline,including predefined processes with filtering parameters,or the option to define custom scripts and SQL queries.For demonstration in this POC,a Transform process was inserted in the app and an SQL“WHERE”clause was added to the filters to exclude the a

288、forementioned mentions:where(categories.Category!=“Technology Company”)and(data.pageType!=“forum”)AND(data.pageType!=“review”)ThemesA Big Query(BQ)process was used to label themes associated with each mention based on matching keywords.Referencing the predefined Themes and Keywords,a mention was lab

289、eled Asset Tokenization,Digital Currency,or Web3 if the snippet contained at least one keyword associated with one of these themes.CategoriesAn organization category was assigned to each mention using the“Join”feature of the same Transform process containing the noise filtration mentioned above.Ment

290、ions were labeled with one of the ten Organization categories,based on a matching originatingdomain.RegionsA geographic region was assigned to mentions that have an associated country provided by Brandwatch.Although this could easily be done using a process in the app,for demonstration purposes this

291、 was implemented in Looker using LookML.Similarly to how organization categories are assigned,the LookML references the Geographical Locations to assign one of the six predefined regions,based on a matching originating country.Topic Modeler,Phrases,and Companion PhrasesA ready-made Topic Modeler pro

292、cess is used in the app to identify topics,phrases,and companion phrases.This process consists of Python code running in a Spark processor that applies AI and NLP to analyze each mention,and to identify recurring phrases and categorize them into topics.For example,the phrases“bitcoin,”“btc,”and“Ethe

293、reum”might be categorized under the topic“cryptocurrency.”(See Topic Modeling for additional details on Syntasas implementation.)Project SmartFi:Exploring AI/ML for FinTech News60The snippet is first cleansed using regular expressions to ensure the snippets processed by the topic modeler consist of

294、only alphanumeric characters and spaces.Given the Trusted Domains sources do not include social media,the parameter to include hashtags for analysis was set to disabled.Mentions with short,nonsensical,and/or unrelated text are automatically discarded by the topic modeler.As with observations discuss

295、ed regarding the noise filter,visual analysis in the SmartFi Trusted Domains-Narrative Looker dashboard of the most recent 30 days of mentions revealed several irrelevant and/or undesirable topics.A series of“stop”words were provided to the topic modeler to suppress these:the,this,an,that,do,these,i

296、s,has,have,was,had,a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,continue reading,also read,rights reserved,privacy policy,not be,total views,use cookies.The resulting BigQuery table is an expanded view consisting of a row for every unique combination of a topic,phrase,and/or companion phrase

297、associated with a particular snippet.CombineFinally,to facilitate analysis in a Looker dashboard,an SQL query in a BQ process is used to join the intermediary tables containing the Brandwatch data,themes,categories,regions,topics,phrases,and companion phrases into a single,unified table.The unique m

298、ention resource ID is referenced in the LookML to essentially collapse the expanded dataset back to ensure that each mention and associated metadata are accounted for only once during the dashboard analysis.ActivateInitially,one week of Brandwatch data was ingested;processed to ensure that the pipel

299、ine was operating properly;and analyzed in the Looker dashboard to identify data quality issues such as sources of noise.After updating the noise filter and stop words,the process was repeated for the most recent 30 days,and then expanded even further to incorporate mentions from the last five years

300、(January 1,2018 to current day)for historical analysis.The last step taken was to enable a scheduled job to automatically ingest and process new Brandwatch data once a day to allow continued analysis moving forward.61Learning Outcomes and Future ConsiderationsAPPENDIX ESmartFi Uncertain Domains Tech

301、nical DetailsFor the SmartFi-Uncertain Domains solution,the SmartFi-Trusted Domains Brandwatch query was modified(see SmartFi-Trusted Domains).The same keywords,location filter,and language were used.Data sources include social media(Twitter,Facebook,Reddit,Tumblr,YouTube),blogs,forums,and news webs

302、ites.However,unlike with the SmartFi-Trusted Domains,which focused exclusively on the Trusted Domains,the advanced filtering in the SmartFi-Uncertain Domains query was modified to explicitly exclude results from Trusted Domains.The final SmartFi-Uncertain Domains dataset in Brandwatch consists of ab

303、out 83M mentions in English from about 6M unique authors found worldwide from December 1,2022 through February 28,2023.IngestBrandwatch DatasetThe ready-made Brandwatch API process included with Syntasa was configured with the Brandwatch Uncertain Domains query ID to ingest the dataset into BigQuery

304、 at an 1.85 percent sample ratethe maximum Brandwatch given the data set volumevia Brandwatchs commercial API.The metadata fields remain the same as described in the SmartFi-Trusted Domainsapp.Project SmartFi:Exploring AI/ML for FinTech News62ThemesIn addition to the Brandwatch data,the Themes and K

305、eywords were also ingested,as described in the Trusted Domains app.TwitterFull tweet text was retrieved directly from the Twitter API for all tweet IDs included in the Brandwatch data set through a Spark processor with custom Python code that leverages off-the-shelf libraries such as Requests,Pandas

306、,and JSON.The tweet text is then inserted into the Brandwatch data set as the mention snippet in a second Spark Processor.ProcessThemesAs with the SmartFi-Trusted Domains app,the same BQ process was used to label themes associated with each mention based on matching keywords.Topic Modeler,Phrases an

307、d Companion PhrasesVisual analysis in the SmartFi Uncertain Domains-Narrative Looker dashboard of the most recent 30 days of mentions revealed several irrelevant and/or undesirable topics.No noise filter was implemented in the app.However,a series of stop words were provided to the topic modeler to

308、suppress these:the,this,an,that,do,these,im,is,has,have,was,had,a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,amp,rt,follow,retweet,tweet,quote,comment,the,a,this,an,that,do,these,im,i,is,has,have,was,had,huh,th,else,did,http,httpsCombineFinally,to facilitate analysis in a Looker dashboard,an

309、SQL query in a BQ process is used to join the intermediary tables containing the Brandwatch data,themes,topics,phrases,and companion phrases into a single unified table.The unique mention resource ID is referenced in the LookML to essentially collapse the expanded dataset back to ensure that each me

310、ntion and the associated metadata are accounted for only once for dashboard analysis.63Learning Outcomes and Future ConsiderationsActivateInitially,one day of Brandwatch data was ingested,processed,and analyzed in the Looker dashboard to ensure that the pipeline was operating properly and to identif

311、y sources of noise.After updating the noise filter and stop words,the process was repeated for the most recent seven days and then expanded even further to incorporate mentions from December 1,2022 to the current day for historical analysis.As with the SmartFi-Trusted Domains,the last step taken was

312、 to enable a scheduled job to automatically ingest and process new Brandwatch data once a day to allow continued analysis moving forward.Project SmartFi:Exploring AI/ML for FinTech News64APPENDIX FSmartFi Chinese Language Technical DetailsFor the SmartFi-Chinese Language solution,the SmartFi-Uncerta

313、in Domains Brandwatch query(SmartFi-Trusted Domains)was modified.The same location filterWorldwidewas used.However,the language was limited to Chinese and the Simplified Chinese keywords were used in place of the English terms.Again,data sources include social media(Twitter,Facebook,Reddit,Tumblr,Yo

314、utube),blogs,forums,and news websites.Trusted Domains were notexcluded.The final SmartFi-Chinese Language app dataset in Brandwatch consists of 69K mentions in Chinese from 17K unique authors found worldwide on February 7,2023.IngestBrandwatch DatasetThe ready-made Brandwatch API process included wi

315、th Syntasa was configured with the Brandwatch Chinese Language query ID to ingest the dataset into BigQuery at an 37.5 percent sample ratethe maximum Brandwatch provided given the data set volumevia Brandwatchs commercial API.The metadata fields remain the same as described in the SmartFi-Trusted Do

316、mains app.65Learning Outcomes and Future ConsiderationsThemesThemes and Keywords were also ingested as described in the Trusted Domainsapp.TwitterAs with the SmartFi-Uncertain Domains,the full tweet text was retrieved directly from the Twitter API for all tweet IDs included in the Brandwatch data,an

317、d inserted into the Brandwatch dataset as the mention snippet.ProcessThemes,Topics,Phrases and Companion PhrasesProcessing for themes,topics,phrases,and companion phrases occurred the same as in the SmartFi-Uncertain Domains app.TranslationTo facilitate theme assignment and topic modeling,the snippe

318、t text was translated into English using a ready-made Translate process which uses a pre-trained Opus-MT model available for download on HuggingFace https:/huggingface.co/Helsinki-NLP/opus-mt-zh-en.SeeChinese Translation for additional details on Syntasas Chinese to English translation implementatio

319、n.CombineTo facilitate analysis in a Looker dashboard,the SQL query used to join the intermediary tables in the SmartFi-Uncertain Domains was modified to include both the original Chinese snippet and the translated-into-English snippet.ActivateOnly one day(February 7,2023)of Brandwatch data was ingested,processed,and analyzed in the Looker dashboard,to ensure that the pipeline was operating properly and to allow for proper evaluation.Project SmartFi:Exploring AI/ML for FinTech News66

友情提示

1、下载报告失败解决办法
2、PDF文件下载后,可能会被浏览器默认打开,此种情况可以点击浏览器菜单,保存网页到桌面,就可以正常下载了。
3、本站不支持迅雷下载,请使用电脑自带的IE浏览器,或者360浏览器、谷歌浏览器下载即可。
4、本站报告下载后的文档和图纸-无水印,预览文档经过压缩,下载后原文更清晰。

本文(世界银行:2023互联网项目SmartFi-探索AI与ML在金融科技新闻中的应用(英文版)(74页).pdf)为本站 (Yoomi) 主动上传,三个皮匠报告文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。 若此文所含内容侵犯了您的版权或隐私,请立即通知三个皮匠报告文库(点击联系客服),我们立即给予删除!

温馨提示:如果因为网速或其他原因下载失败请重新下载,重复下载不扣分。
相关报告
会员购买
客服

专属顾问

商务合作

机构入驻、侵权投诉、商务合作

服务号

三个皮匠报告官方公众号

回到顶部