《2-3 Alluxio 加速云上深度学习训练.pdf》由会员分享,可在线阅读,更多相关《2-3 Alluxio 加速云上深度学习训练.pdf(36页珍藏版)》请在三个皮匠报告上搜索。
1、Accelerate Cloud Training with AlluxioData FunLu Qiu AlluxioLu QiuMachine Learning Engineer AlluxioAlluxio PMC maintainerMaster Data Science GWUResponsible for integrating Alluxio with deep learning Areas:Alluxio fault tolerant system,journal system,metrics system,and POSIX API.Alluxio integration w
2、ith Cloud2AgendaAlluxio and its POSIX API Accelerate cloud training with AlluxioLevel 1 Storage Read AcceleratingLevel 2 Data Preprocessing&TrainingLevel 3 Data Orchestration Layer3Alluxio&its POSIX API4Data Orchestration for Analytics&AI in the CloudAvailable:ALLUXIO 6DATA ACCESSIBILITYConvert from
3、 client-side interface to native storage interfaceALLUXIO 7DATA LOCALITYLocal performance for remote data with intelligent multi-tieringHotWarmColdRAMSSDHDDRead&Write BufferingTransparent to AppPolicies for pinning,promotion/demotion,TTLOn-premisesPublic CloudModel TrainingBig Data ETLBig Data Query
4、ALLUXIO 8METADATA LOCALITYSynchronization of changes across clustersOld File at path/file1-New File at path/file1-Alluxio MasterPolicies for pinning,promotion/demotion,TTLMetadata SynchronizationMutationOn-premisesPublic CloudModel TrainingBig Data ETLBig Data QueryAlluxio POSIX APIAlluxio POSIX API
5、 10HDFS#1Obj StoreNFSHDFS#2Connecting toHDFSAmazon S3AzureGoogle CloudCeph NFSMany moreAccessing Remote/Distributed Data as Local DirectoriesAccelerating Cloud Training with Alluxio11Level 1Accelerating under storage data accessTraining ClustersHotWarmColdRAMSSDHDDRead BufferingTransparent to AppPol
6、icies for pinning,promotion/demotion,TTLUnder Storage Kubernetes Cloud Cluster1.Accelerating under storage data accessOne Click to Mount UFS to AlluxioAll the data locates in s3:/will be cached by Alluxio and provide data locality for training jobs.$bin/alluxio fs mount/s3 s3:/-option aws.accessKeyI
7、d=-option aws.secretKey=$bin/alluxio fs distributedLoad/s3One Click to Load all Training data into AlluxioAlluxio Alibaba Improve Throughputhttps:/www.alluxio.io/blog/efficient-model-training-in-the-cloud-with-kubernetes-tensorflow-and-alluxio/https:/www.alluxio.io/resources/whitepapers/using-alluxi
8、o-to-optimize-and-improve-performance-of-kubernetes-based-deep-learning-in-the-cloud/Alluxio MicrosoftTaskMore than 400 tasks need to read data from Azure and write data to AzureThe total data size is larger than 1TPreviously they directly copy data from cloud to training nodes.ChallengesEasy to exc
9、eed request rate.Azure blob-fuse requires downloading data from Azure to local before starting the tasks,and uploading data to Azure after finishing the tasks.Large amount of data input and output,easy to cause I/O errorsGPU idle when waiting for I/O operationshttps:/www.alluxio.io/resources/videos/
10、speed-up-large-scale-ml-dl-offline-inference-job-with-alluxio/Alluxio MicrosoftAlluxio Speed up Training by 18%Reduce I/O wait time,improve training performanceUse data pre-cache to improve performanceDynamically cache data during trainingShare data across multiple tasksStreaming read data to disper
11、se I/O request and avoid exceeding cloud storage request limitAuto retry retry to reduce I/O error ratehttps:/www.alluxio.io/resources/videos/speed-up-large-scale-ml-dl-offline-inference-job-with-alluxio/Level 2Data processing&training speed upBig Data ETL ClusterTraining ClustersDATADATADATARead Bu
12、fferingTransparent to AppPolicies for pinning,promotion/demotion,TTL2.Data processing to training speed upAlluxio Boss ZinpinTaskUse Spark/Flink to process dataModel training on top of the processed dataPrevious solutionSpark/flink+Ceph+model trainingProblemsWrite temporary files into Ceph cause hig
13、h Ceph pressureCannot control Ceph read/write pressure,cluster unstableSolution with AlluxioSpark/flink+Alluxio+Ceph+Alluxio+model trainingAlluxio supports multiple data sources and multiple compute/training frameworksMultiple independent Alluxio clusters,support multi-tenants,customized configurati
14、on,access controlhttps:/www.alluxio.io/resources/videos/alluxio-k8s-cloud-native-ai-environment-bosszp-chinese/Alluxio in BOSSZP21 Big Data ETLModel TrainingHDFS InterfacePOSIX Interface2.Data processing to training speed upImprove under storage stabilitySpeed up whole data preprocessing to training
15、 pipelineCan launch more Alluxio clusters to meet burst ETL/Training requirements2.Data processing to training speed up23 Data PreprocessingModel TrainingPOSIX InterfaceLevel 3Data Orchestration LayerBig Data ETL ClusterTraining ClustersDATADATADATARead BufferingTransparent to AppPolicies for pinnin
16、g,promotion/demotion,TTLUnder Storage SystemData PreprocessingBig Data ETL ClusterDATADATADATAWrite BufferingPolicies for pinning,promotion/demotion,TTLUnder StorageData PreprocessingTraining ClustersData Orchestration for Analytics&AI in the CloudAvailable:Alluxio MomoMomo has multiple Alluxio clus
17、ters including thousands of Alluxio nodes.Stores more than 100+TB data.Alluxio serves searching and training tasks of Momo.Momo continues to develop new use cases of Alluxio.Alluxio supports multiple under storage and multiple compute/training frameworks.Accelerate compute/training tasksReduce the m
18、etadata and data overhead of under storagehttps:/www.alluxio.io/resources/videos/ml-and-query-acceleration-at-momo-with-alluxio-chinese/Alluxio MomoBillions image training-2 billion small files-Pytorch+Alluxio+Ceph-Reduce the metadata and data interactions with Ceph to improve performancehttps:/www.
19、alluxio.io/resources/videos/ml-and-query-acceleration-at-momo-with-alluxio-chinese/Alluxio MomoSpeed up recommendation system model loadingUpload recommendation system model to HDFSDistributed load model from HDFS to AlluxioRecommendation system load model from Alluxio concurrentlySpeed up loading i
20、ndexes for ANN systemCreating indexesUpload indexes to HDFS(or object store)Nodes loading indexes from Alluxiohttps:/www.alluxio.io/resources/videos/ml-and-query-acceleration-at-momo-with-alluxio-chinese/Alluxio may help you ifDistributed TrainingLarge amount of data(=TB),large amount of small files
21、/imagesNetwork I/O cannot satisfy GPU requirementsMultiple data sources and multiple training/compute frameworksKeep under storage stable and avoid exceeding request rate problemsShare data between multiple training tasksAlluxio POSIX API Development32Community cooperationCommunity driven cooperatio
22、n,Special thanks to Tencent:Baolong Mao,Bing Zheng,Yaolong Liu Microsoft:Bingyang Li Alibaba:Cheyang Nanjing University:Rong Gu,Yili Luo Bilibili:Zifan Ni AntFinance:Kevin CaiIn production in Microsoft,Bilibili,MOMO,Boss Zhipin,and etc33Alluxio 2.8(on the way)Improves the stability of Alluxio when i
23、ntegrating with training workloadsRockDB metastore core crash with high memory consumption(918e73)FUSE segment fault error(issue)FUSE statfs potential OOM(fabcf47)FUSE failed to unmount(5393b6),and etc.Better supports a large number of small files and highly concurrent data access.Supporting worker
24、register with millions of blocks(htf8e5e)Improving the performance of preloading a large number of small files by 10X while reducing the memory overhead(bc104a9)Libfuse3 (6f3fe6f)is supported in Alluxio 2.8 to enable future performance and scalability optimizations.34Alluxio AI SIG2022-04-25Monday,April 25,7PM Pacific Time(PST)Tuesday,April 26,10AM Beijing Time(GMT+08)Bi-WeeklyZoom linkhttps:/ noteshttps:/ Media