报告预览

GPU 加速的数据处理在推荐系统中的应用.pdf

编号：29559

PDF 45页 3.40MB 下载积分：VIP专享

下载报告请您先登录！

GPU 加速的数据处理在推荐系统中的应用.pdf

1、NVIDIAGPU-ACCELERATED DATAPROCESSINGFOR RECOMMENDATION SYSTEMGTC-China魏英灿2020/12#page#主讲人介绍魏英灿GPU计算专家经验毕业于香港大学，研究领域包括深度学习域适应，生成对抗网络，推荐算法设计优化，在加入英伟达前，任职于欧美外资以及互联网等企业，拥有多年图像处理、数据挖掘，推荐系统设计开发经验。当前主要负责HugeCTR的算法设计架构工作。工作经历2020年加入英伟达半导体（上海）有限公司#page#AGENDAINTRODUCTIONDASK-RAPIDSNVTABULARSPARK3.0BENCHMARKC

2、ONCLUSION#page#INTRODUCTION#page#FLOW OF RECOMMENDATION SYSTEMDatasource1DataStructured1apowFeatureresultSourceData2DataSource3ModelFeatureInference713TrainingEngineering#page#REAL-TIME ETL ARCHITECTUREHBASELAelasticfhadoopMySQLV mongoDBSensorsDRSiteInteractiveexploration byDataMap ReduceMirror Make

3、rScientistsafkcSkafkaMobile DevicesAPPLOgS富SoarkstreamingReal-timeintelligenceattheNOC#page#FEATURE ENGINEERINGNonlinearMissingFlNormalizationSamplingOutlier&TruncationTransformationmimax，Mean，EasyEnsemble，Logarihm，Anomaly detectGaussian，Medianlance CascaddPolynomial30 cnteroQuantile，wode，NearMissIn

4、terpolationLogistiictransformaiion#page#SUMMARY OF ETL AND FEATURE ENGINEERINGFeature Engineering-Extract featureETL-Convert raw data to structured datafrom structed dataExtractFeature ExtractionsAggregate raw data from different sourcesExtract features from the feature spaceconstructed by structure

5、d dataTransformFeature SelectionData cleaning andvalidationFilter/Wrapper based on various operatorsGenerate structured dataFeature EvaluationLoadIteration based on model performanceLoad structured data into data warehouse orlake#page#HIERARCHY-RAPIDS/SPARK3.0/NVTABULAR/DASKGPUGPUGPUGPUGPUGPUGPURAPI

6、DSRAPIDSRAPIDSRAPIDSDASK广江SBark3.0NVTabular#page#DASK-RAPIDS#page#RAPIDSGPU-accelerated data science ecosystemA suite of open source software libraries and PyData-like APls to execute end-to-end datascience and analytics pipelines entirely on GPUS without paying typical serialization costsRAPIDS als

7、o includes support for multi-node， enabling vastly accelerated processing andtraining on much larger dataset sizesBuilding bridges into the array ecosystemDataPreparationModel TrainingVisualizationDaskCuDF culOPyTorch Chainer MxNCuGraphGraphlyticsDeep Learningalytic车GPUMemoryhrrn#page#RAPIDS UDFSUse

8、r defined functions- ExtensiveEnables compiling Python user-defined functions (UDF） and inlining them into native CUDAkernels. Use the Numba Python compiler and Jitify CUDAjust-in-time (JIT） compilation libraryProvide cuDF users the flexibility of Python with the performance of CUDA as a compiled la

9、nguage.RAPIDS UDFPandas UDFimport cudfimport pandas as pdimport numpy as npfrom cudf.datasets import randomdata#My custom function#My custom functiondf“a.map(lambdax：x+5ifxelsex-5）dfa.applymap(lambdax：+5ifxelsex-5）https:/ PERFORMANCE5XDGX-GPUDGX-2100 CPU Nodes50 CPU NodeCPU30 CPU Nodes20 CPU Node2,0

10、004.0006,0008.00010,000Time(seconds)-ShorterisBetterData Load+Feature EngineeringData ConversionMachineLearning Trainingpandas.3scu ra.14lSpeedp031349107S90066110831230.068435210*931.2640.072#page#CHOOSE RAPIDS？Pros：Pandas/NumPy/Scikit-learn-like API:Easy to hands-onend-to-end data science and analy

11、tics pipelines entirely on GPUsGPU-Acceleration: columnar processing on much larger dataset sizesThe cuStreamz Series: accelerated Kafka datasourceCons：Only work within one GPU (cannot exceed one GPUs memory）RAPIDS API is not identical with Pandas/NumPy/Scikit-learn APILow flexibility for customizii

12、ng based on libcudf C+#page#WHAT IS DASKDASKPyData NativeBuilt on top of NumPy Pandas Scikit-Learn， etc.With the sameAPIs(easy totrain）With the same developer community (wel trusted）ScalesEasy to installand use on a laptops Scales out to thousand-node clustersPopularMost common parallelism framework

13、 todayat PyData and SciPy conferencesDeployableHPC:SLURM，PBS，LSF,SGECloud: KubernetesHadoop/Spark:Yarn#page#DEMOPyDataDASK + PyDataDASK+RAPIDSimportdask_cudfimport pandasas pdimport dask.dataframe as dddf=dask_cudf.read_csv（s3:/./2018-*.csv）df= pd.read_csv(file.csv）df=dd.read_csv（s3:/./2018-*.cSv）df

14、groupby(df.account_id）.balance.sum(）df.groupby(df.account_id.balance.sumdfgroupby(df.account_id）.balance.sum(）from cuml.dask.linear_modelfrom dask_ml.linear_modelfrom scikit_learn.linear_modelimport LinearRegressionimport LinearRegressionimport LinearRegressionlr= LinearRegression(）Ir=LinearRegressi

15、on(）Ir=LinearRegression(）lr.fit（train，test）lrfit（train，test）lr.fit(train，test）#page#CHOOSE OR NOT ？Pros:Seamlessly scale out PyData operations (nearly no code changes）with GPURapidsEcosystems（shownontheright）:BlazingsQL，custreamzLazy Evaluation(Graph）Customization for GPU-levelMulti-GPUs，Multi-Nodes

16、Cons:Not all PyData APIs are supported (Pandas - CuDF - DASK-CUDF） Lots!Read Parquet/oRC，Grouping on multiple columns， Rolling WindowLabelencodingDASK Limitationso Schedulers，graphs: overheadso Lack of optimization of operator combinations （like operator fusion）o Inherit all capabilities and limitat

17、ions of your single functiono Shuffling is expensivehttps:/distributed.dask.org/en/latest/limitations.html#page#SUMMARYNVTabularScaleUp/AccelerateRAPIDS+DASKRAPIDS AND OTHERSRAPIDSWITH OPENUCXAccelerated on single GPUMulti-GPu句RAPIDSNumPyCuPy/PyTorch/.Onsingle Node（DGX）CuDOracrossaclusterCUMLEDASKNe

18、tworkx-CuGraphNumbaNumbaPYDATADASKNumPy，Pandas，Scikit-Learn，Numyeand distributed PyData出EDASKNumPy Dask ArraySingle CPU corefPyDataScikit-Leamn-Dask-MLDask FuturesScale Out / Parallelize#page#NNVTABULAR#page#CHALLENGES口 Deep learning recommender systems require:口Major Performance Bottlenecks：Extensi

19、ve experimentation to find the right features(fasterPreparing the datasetiteration times）featuretransforms can requireiteration over the entire datasetHuge datasets(1PB+common，largeris better）thatareFeatureengineering exploration is iterativeand done multipletimescomplex to processSignificant effort

20、 to deploy to productionDataloading(50%oftraining timein RecSysexample）Item by item dataloading doesnt scale to tabularrestartETL！maett口aenbUgetacdd a featurT12stan ETL0CPUGPUconigure ETLworkfioPOWEREDPOWEREDWORKFLOWWORKFLOWdatasetcollectioranalysisLtraininference#page#LOCATIONNVTabularQueryData Loa

21、derDLTrainingFeature EngineeringExtracting tabular data fromuSnpuilteringeordatauency filteringsorfloShuffle data on theflyyTorchn/combning featuresOutputto.csparquetNVIDA HugeCTRMsipo auos pue alnus-3.ashingETLCoSt=Time to Run+Time toWriteDaysFeatureEngEDASKSparkdanaGPUMinsSupeogedNTabularDaskProce

22、ssingCUDFMinsHoursDaysTimeWitenvioL#page#FEATURE ENGINEERING EXAMPLEimportnvtabulaSpecify which variablesareCategorical and which areContinuous1abel_namenbaj6JDKj1ads#aTeqet.=#initialize MorkfLowWorfklow（cat_na期十rOCDefine the location of the trainingandvalidation settrainfiles=glob.glob（./dataset/tr

23、ain/*.parquet）valid_files=glob.glob（“./dataset/valid/*.valid_datasetsnvt.dataset(valid_files，gpu_memory_frac=8.1）Encode Categoricals using thedefined thresholds.add featureengineering and preprocessing ops to horkfLotLog transform the Continuousvariables， Zero filing any nullsproc.add_cont_preproces

24、s(nvt.ops.Mormalize（）proc.add_cat_pre#compute statistics，transform data，export to diskApply the operations，creatinga new，num_out_files=len（train_files）proc.apply（train_dataset，shuffle=True，outputpath=”./processed_data/trainshuffled training dataset andaproc.apply(valid_dataset，shuffleaFalse，output_p

25、ath=”./processed_data/validout_files=len（valid_files）validation dataset#page#NVTABULAR FEATURES Feature Transforms accelerated on GPU No limit on dataset size (not bound by GPU or CPU memory）o Ahigher levelabstraction What you want to do rather than how to do it or how to scale it Batch Dataloading

26、to major Frameworks； PyTorch， Tensorflow，and HugeCTR Design in particular for recommendation systems training and inference（Building）An easy path to production deployment for data transformsS Consistency between data during training and inferencehttps:/ INTEGRATION ON 0.2AIL in DASKNVTabulaDask-CuDF

27、Make big-data preprocessing asFile PathsHorkflowUgeCTRfopainless as possible forthe typicaldata scientistorCuD10OOIengineer.DDataset-A universal Dataset object was introducedUDatasetta-loader APIsto effectivelyguarantee thistranslation.transformedddTo ModerDask Cluster- dask.distributed cluster depl

28、oyed onthe system by a corresponding Client object passed toNVTabularthe Workflow#page#NVTABULAR WORK FLOWBuildingofthedatasetLoading ofthe datasetTrainingNVTabularNVTabularFrameworkOfineML.readyModelFeaturesFeaturePreProc ofspecificData LakedatasetdatasetsTrainingEngineeringDataloaderTraining data（

29、PTITF/Upto1EBocnUpto10PBsUp to1PBHugeCTR）Up to100TBs100TBUpto100TBsModelPreProcCandidateConfigWeightsInferenceGeneration ServerFind highrecalInferencetimetransformationofthe dataOnline Inference Server(TRITON）（upto1bilioninferences/secondlowlatencybudget）RecommendatonInferenceNVTabularWebServicesTRT

30、ServerCandidatesOnline FeatureModelTensors.Engineering&355InferencePreprocessingsdaladmatrix.etcRetms.e8#page#SPARK3.0#page#WHAT IS SPARK？Goal: unified engine across data sources，workloads and environmentsEnvironmentsYARNS00WorkloadsDataFrames/SQL/DatasetsAPlsA distributed data processing frameworkt

31、hat scales to thousands of nodesMLlibSpark SQLSparkStreaminGraphxRDD APISupports Java， Scala， R， Python and SQLSpark CoreJSONHEASEData Sourceshttpsdatabricks-prodC ACCELERATOR FOR APACHE SPARK3.0 (PLUGIN）DISTRIBUTED SCALE-OUT SPARK APPLICATIONSAPACHE SPARK CORESpark SQL APIDataFrame APISpark Shuffle

32、Custom Implementation of Sparkffgpu_enabledloperation，data_type）Shufflecall-out to RAPIDSRAPIDS AcceleratorOptimized to use RDMA and GPU-to-elsefor SparkGPU direct communicationexecute standard Spark operationJNI bindingsJNI bindingsMapping From Java/Scala to C+Mapping From Java/Scala to C+RAPIDS C+

33、 LibrariesUCX LibrariesCUDA#page#SPARK SQL 8 DATAFRAME COMPILATION FLOWbar.groupBy（SELECT product_id， ds，cotCproduct_id”，max(price）-min（price）AS（(.5P）102GPU PHYSICAL PLANrange FROM bar GROUP BYa886max（col(price”）product_id， dsmin(col(“price”）alias（range”）DataFrameRAPIDS SQLPluginLogical PlanGPU Phys

34、ical PlanPnysical PlanRDDIColumnarBatchRDDtiinternalRow#page#SPARK 3.0 ON GPUDISTRIBUTED，SCALE-OUTDATASCENCE AND AIAPPLICATIONSEND-TO-ENDAPACHESPARK3.0PIPELNE1. Runs supported SQL operations on the GPU. If anACCELERATEDAPACHESPARKCOWPONENTSACCELERATEDML/DLFRAMoperation is not implemented or not comp

35、atible with，itSoar5Dabfrmwill fall back to using the Spark CPU version.RAPIDSAcceleratorforApacheSpark2.Allows running Spark SQL on a GPU with ColumnarRAPIDSprocessing (dont support MLLIB，Graph）GPU-ACCELERATEDINFRASTRUCTURE3. Requires no APIchanges from the userETL Technology Stack4.Uses Rapids CuDF

36、 C+ library with Apache ArrowDask cuDFSpark DataFramecuDF，PandasScala，PySparkPythonJavaCythonJNIbindingsRAPIDSCUDFC+CUDA LibrariesCUDA#page#TPC-DS OUERY 38 RESULTSEntire query is GPU acceleratedQuery TimeTotal CostsS0.2520160.0OZOS55% Cost Savings3X Speed-up120.0S0.1550.0OTOSS0.0540.00.000osCPU:8xm5

37、dn.2xlargeCPU:8xm5dn.2xlargeGPU:8xg4dn.2xlargeGPU:8xg4dn.2xlarge64-core256GB8xT4GPU（64-core256GB8xT4GPU）（64-core256GB）（64-core256GB）CPuCluster:Driver:1xm5dn.large；Workers:8xm5dn.2xlargeGPUCluster:Driver:1xm5dn.large，Worker:8xg4dn.2xlargeOn-demandclustercost(UsWest）:S4.4488/h#page#SPARK3.0 VS DASK (G

38、PU）Reasons you might choose Dask-RapidsDASK +RAPIDS EcosystemYour use case is complex or does not cleanly fit the Spark computing model Have lighter set up over Spark/Hadoop.Easy to get started with distributed GPU Acceleration and Pandas-like APIs Operations are supported by DASK-RAPIDS.（efficiency

39、） Multi-GPUS cuGraph/cuML (Spark ML-lib and Graph are not supported by GPUS）Reasons you might choose Spark-GPUsSpark Ecosystem（more robust with Hadoop）-One unified platform for every big data problemYou prefer Scala or the SQL language （can be efficiently expressed by SQL/Dataframe APIs）SQL/datafram

40、e query optimizerSpark-Rapids Shuffle manager（UCX）#page#BENCHMARK#page#FEATURE ENGINEERINGFeature Engineering Performance on 10GB Data45048350NVTDaskSpakGPu:3200105.8304380Sparkcpu:0oorcDXCCLoNTDaskpakGPu:16cor15010050weNtngg1GPSpark CPUVi0032g*1V10032g*4V10032g*8uet FileThe Number of ParCategorify

41、t.lowfrequencyFiNssinRead ParquetFilter value0LogWrite Parquetfilters#page#FEATURE ENGINEERINGDASK10CPU5，80Cores,540.95GSpark CPu:32cores,8executors，8Pu(TeslaV100PCIE-32G832G8*8）.32partitioSpark CPU:80cores,4Feature Engineering Performanceon140GBDataFeatureEngineeringby Dask-Pandas(10GB）0

42、0100157Categorify+lowfrequencyFi NssingRead ParquetFiltervalue0LogWrite Parquetfiilters#page#OPERATORS PERFORMANCEQuery Performance SpeedUp over Pandas00ENVTO.2Dask-Rapids239.3250SparkonGPuDask-Pandas196.7200SparkonCpUS经150Spee104.110048.25036.228.230.829.721.320.318.10.49.410.74.13.5FiterMinMaxCate

43、gorifyJoinOperatorslaVC0-PCIE-32GB32GB#page#CONCLUSION#page#FEATURESSpark-SQL-GPU： SQL operations accelerated by spark-rapids plugin， columnar processing， Multi-gpusJoy paulsap an Saidvy uo paseq Kiela Suuaauua aineay pue Sulssaoldald iaAal u JeingLANDASK-RAPIDS: partitioned GPU-backed libraries for

44、 distribution cluster computingRapids （single GPU）：cuDF：pandas-like GPU Dataframe operations，based on Apache Arrow columnar memory formatcuGraph，cuxfilter,custreammz#page#WORKFLOW WITH GPUSETLTrainData PreparationDeployMedicalDevicesRxMedicalTrainEvaluateScoringData LakeRecordscuioHugeCTRClaimsTenso

45、rRTS66RDASK RAPIDSRTritonWearabies1F TensorFlowNVTabularNVTabularInferenceO PyTorchServerEDASK RAPIDSDASK RAPIDSworkflow#page#SCENARIOSTraining data set sizeETLtechnology stackTerabytesJava,scala，Smal sizePython，Julia.Data storage structureTheelements in Data ProcessingDBSConversion， Cleaning， enric

46、hment No-SQLFeature EngineeringTeam composition and positioningETLVS.Training timeData ScientistFlatDataEngineeringDisparity1powApplication scenarioRecommendation System RealtimeNLP.off-line#page#RELATED SESSIONS IN GTC CHINALearning More About Nvidia MerlinMerlin：GPU加速的推荐系统框架CNS20590王泽囊，英伟达亚太AI开发者技术经理，NVIDIAMerlinHugeCTR：深入研究性能优化CNS20516-MinseokLee，GPU计算专家，NVIDIAMerlinNVTabular：基于GPU加速的推荐系统特征工程最佳实践CNS20624黄孟迪，深度学习工程师，WVIDIAGPU加速的数据处理在推荐系统中的应用CNS20813魏英灿，GPU计算专家，NVIDIA将HugeCTREmbedding集成于TensorFlowICNS20377董建兵，GPU计算专家，WVIDIA使用GPUembeddingcache加速CTR推理过程CNS20626郁凡，GPU计算专家，NVIDIA

友情提示

1、下载报告失败解决办法
2、PDF文件下载后，可能会被浏览器默认打开，此种情况可以点击浏览器菜单，保存网页到桌面，就可以正常下载了。
3、本站不支持迅雷下载，请使用电脑自带的IE浏览器，或者360浏览器、谷歌浏览器下载即可。
4、本站报告下载后的文档和图纸-无水印,预览文档经过压缩，下载后原文更清晰。

本文（GPU 加速的数据处理在推荐系统中的应用.pdf）为本站（X-iao）主动上传，三个皮匠报告文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若此文所含内容侵犯了您的版权或隐私，请立即通知三个皮匠报告文库（点击联系客服），我们立即给予删除！

温馨提示：如果因为网速或其他原因下载失败请重新下载，重复下载不扣分。

上海品茶

GPU 加速的数据处理在推荐系统中的应用.pdf

GPU 加速的数据处理在推荐系统中的应用.pdf