《AI 与数据融合的基础设施技术展望-陈文光.pdf》由会员分享,可在线阅读,更多相关《AI 与数据融合的基础设施技术展望-陈文光.pdf(26页珍藏版)》请在三个皮匠报告上搜索。
1、AI与数据融合的基础设施发展展望陈文光 蚂蚁技术研究院/清华大学大数据:数据量,数据生成的速度和多模态Volume of data/information created,captured,copied,and consumed worldwide from 2010 to 2025Volume of data/information created,captured,copied,and consumed worldwide from 2010 to 2025 Statista 2021 https:/ zettabytes)物联网、边缘设备和用户行为产生大量数据数据量(Volume)和数据
2、生成速度(Velocity)多模态数据(Variety)图片,文档,图,时序,交易AppsDatabase(MySQL)Queue(Kafka)RealTime ETL(Flink,SPARK)OLTP(Hbase,KV,ES)ETL(Flink,Spark+HUDI)DataLake(MPPDB,HDFS)OLAP(Presto,CK)Analysts典型数据处理链路实时链路离线链路https:/ 典型数据+AI处理链路AppsDatabase(MySQL)Queue(Kafka)RealTime ETL(Flink,SPARK)OLTP(Hbase,KV,ES)ETL(Flink,Spar
3、k+HUDI)DataLake(MPPDB,HDFS)OLAP(Presto,CK)Analysts实时链路离线链路Online Model Update(PyTorch,TF)Model Serving(PyTorch,TF)Batch Training/Test(PyTorch,TF)主要挑战 123在线离线一致性基于JVM的数据处理系统的性能问题大数据处理与AI融合问题https:/ ETL(Flink,SPARK)OLTP(Hbase,KV,ES)ETL(Flink,Spark+HUDI)DataLake(MPPDB,HDFS)OLAP(Presto,CK)Analysts实时链路离线
4、链路Online Model Update(PyTorch,TF)Model Serving(PyTorch,TF)Batch Training/Test(PyTorch,TF)解决方案 以蚂蚁集团图计算为例ApplicationTuGraph DBMessage QueueTuGraphDataflowTuGraphDataflowHistorical PlaybackDecision EngineStreamingWriteRule based ServingDataServingDecision MakingTuGraph DB:分布式图数据库,支持自定义图查询语言GQueryTuGra
5、ph Dataflow:流图计算系统,支持Gremlin基于图的风控解决方案(全图风控)架构 1在线近线数据不一致模型效果不一致解决方案 以蚂蚁集团图计算为例TuGraph DB:分布式图数据库,支持国际标准图查询语言ISO-GQLTuGraph Dataflow:流图计算系统,支持国际标准图查询语言ISO-GQL基于图的风控解决方案(全图风控)架构 2,支持探索-仿真-上线的一致性ApplicationTuGraph DBMessage QueueTuGraphDataflowTuGraphDataflowHistorical PlaybackDecision EngineStreaming
6、WriteRule based ServingDataServingDecision Making保证在线近线数据一致以在线数据库内容为准,同步到近线系统在线近线系统使用同样的查询语言避免不同语言语义的不一致性很多细节,比如Nodelimit问题2.基于JVM的数据处理系统的性能问题Spark处理性能较差C+手写Word Count与Spark的Word Count相比,单机加速比12倍Java运行时:Java运行时带来了较高的数据对象转换开销。例如,序列化/反序列化 Spark执行策略:Spark每次仅处理一个元素的执行策略带来了较高的函数调用开销单机单机Spark单线程单线程C+多线程多线
7、程C+时间(秒)186458.9915.54配置:2路E5-2680 v4、256GB内存、https:/ 44GB文本数据集用VTunes分析3个大数据负载(单词计数WC、网页排名PR、聚类分析KM)的计算热点WC-RPR-RPR-FKM-RKM-F虚函数调用25.120.38.70.920.7迭代器4.116.600装箱/拆箱0.33.29.90.20序列化0.42.214.600总计29.926.739.81.120.7内存占用多图计算迭代算法上Spark需要的内存容量是原始数据集的20倍1内存消耗多原因主要来自Java运行时:不紧凑对象排布:例如,右图三元组表示的边需要76字节,比数据
8、本身所需的16字节膨胀了近5倍 垃圾回收内存管理机制:为了避免频繁垃圾回收需要预留内存005006007008009001000enwiki-2013twitter-2010uk-2007-05weibo-2013clueweb-12内存使用量(GB)GraphXPowerGraphGemini1 Zhu,Xiaowei,et al.Gemini:A Computation-Centric Distributed Graph Processing System.OSDI.2016.header指针数据图计算内存用量比较图计算内存用量比较JavaJava三元组内存排布示意图
9、三元组内存排布示意图问题2.基于JVM的数据处理系统的性能问题新型大数据处理内核“诸葛弩”*设计理念效率优先:Spark虽然强调可扩展性但是忽视了执行效率。灵活性优先:提供用户更多定制、控制的机会,为高效支持各种应用提供基础。新设计试图解决Spark RDD存在的问题采用native语言(C+)缓解Spark运行时环境带来的问题针对Spark RDD逐个元素处理模式带来的开销问题,Chukonu设计了常见数据的紧凑数据表示,并利用编译优化来进行高效批处理*图片来自:https:/ RDD核心部分效率高,功能完善,假设Spark除核心外的其它部分没有问题支持SQL性能评估 非结构化数据性能评估
10、图计算性能评估 机器学习性能评估 内存占用性能评估 TPC-DS问题3.大数据处理与AI融合问题AI大数据处理大数据处理小数据处理小数据处理处理器GPU或AI加速器通用CPUCPU网络NVLink+IB/100Gbps+10Gbps 25Gbps-主要编程语言PythonJava/ScalaPython编程框架PyTorch,Tensorflow,PaddlePaddleSpark,DataFramePandas,Numpy,SciPy,Notepad AI计算在数据中心的比例将持续显著增加,主要是Python生态 分布式大数据处理主要是Java生态“小数据”处理主要是Python生态AI和大
11、数据处理在硬件层面也有很大差别Spark预处理TF/PyTorch神经网络Spark后处理 两类软硬件生态的开发、调试、部署和维护都更加复杂 系统间数据传输开销降低性能 需要招聘两类程序员,或精通两者的程序员数据与AI独立生态的问题只支持CPU,不支持GPU和异构加速器重新开发深度学习模块,不能复用TF中的功能Spark本身性能有缺陷一种尝试:BigDL*深度学习的Java化*Dai,J.J.,Wang,Y.,Qiu,X.,Ding,D.,Zhang,Y.,Wang,Y.,.&Wang,J.(2019,November).Bigdl:A distributed deep learning fr
12、amework for big data.SoCC 2019SoCC 19,November 20-23,Santa Cruz,CATrovatoand Tobin,et al.Figure 1:The end-to-end text classi cation pipeline(including data loading,processing,training,prediction,etc.)on Sparkand BigDLbeeasily deployed,monitored and managed in asingleuni ed bigdataplatform.BigDL is d
13、eveloped as an open source project1;over the pastyears,a variety of users in the industry(e.g.,Mastercard,WorldBank,Cray,Talroo,UCSF,JD,UnionPay,Telefonica,GigaSpaces,etc.)havebuilt their data analytics and deep learning applicationson top of BigDL for a wide range of workloads,such as transferlearn
14、ing based image classi cation,object detection and featureextraction,sequence-to-sequence prediction for precipitation now-casting,neural collaborative ltering for recommendations,etc.Inthispaper,wefocuson theexecution model of BigDL to supportlarge-scale distributed training(achallenging system pro
15、blem fordeep learning frameworks),as well as empirical results of real-world deep learning applications built on top of BigDL.Themaincontributionsofthispaper are:It presents BigDL,aworking system that havebeen used bymany usersin theindustry for distributed deep learning onproduction bigdatasystems.
16、It describesthedistributed execution model in BigDL(thatadoptsthestate of practice of big data systems),which pro-videsaviable design alternativefor distributed model train-ing(comparedtoexistingdeeplearningframeworks).It shares real-world experience and“war stories”of usersthathaveadoptedBigDLtoadd
17、resstheir challenges(i.e.,howto easily build end-to-end data analysis and deep learningpipelinesfor their production data).1https:/ lot of e ortsin thedeep learning community havebeen focusingon improving the accuracy and/or speed of standard deep learn-ing benchmarks(such asImageNet 12 or SQuAD 13)
18、.For thesebenchmarks,theinput dataset havealready been curated and ex-plicitly labelled,and it makessensetorun deep learning algorithmson specialized deep learning frameworksfor best computing e-ciency.On the other hand,if the input dataset are dynamic andmessy(e.g.,livedata streaming into productio
19、n data pipeline thatrequirecomplex processing),it makes moresense to adopt BigDLtobuild theend-to-end,integrateddataanalyticsand deep learningpipelinesfor production data.Asmentioned in Section 1,BigDL hasbeen used by avariety ofusersin theindustry to build deep learning applications on theirproduct
20、ion data platform.The key motivation for adopting suchauni ed data analytics and deep learning system like BigDL istoimprovetheeaseof use(including development,deployment andoperations)for applying deep learning in real-world data pipelines.In real world,it is critical to run deep learning applicati
21、onsdirectly on wherethedataarestored,and asapart of theend-to-end data analysis pipelines.Applying deep learning to productionbig data isvery di erent from the ImageNet 12 or SQuAD 13problem;real-world big data areboth dynamicand messy,and arepossibly implicitly labeled(e.g.,implicit feedbacks in re
22、commenda-tion applications14),which requirevery complex dataprocessing;furthermore,instead of running ETL(extract,transform and load)and dataprocessing only once,real-world dataanalytics pipeline isan iterativeandrecurrentprocess(e.g.,back-and-forth developmentand debugging,incremental model update with new production51另一种尝试:Spark的Python化 PySpark支持Dataframe和SQL Koalas是Pandas的Spark封装,现在已经被合并进入Spark3.2 PySpark在Spark用户中的使用已经接近一半 Python由于无静态类型,编译优化方面有难度,在常见查询中与Java性能有约50%的落后融合大数据和AI生态的愿景 AI将成为主要计算形式,数据处理生态应该围绕AI来建设 研究支持数据处理的编译优化技术,使PySpark性能达到native执行的水平 加速器支持与弹性任务调度 一次编写,到处执行