上海品茶

您的当前位置:上海品茶 > 报告分类 > PDF报告下载

基于 Apache Spark的大规模分布式机器学习实践(26页).pdf

编号:91296 PDF 26页 3.09MB 下载积分:VIP专享
下载报告请您先登录!

基于 Apache Spark的大规模分布式机器学习实践(26页).pdf

1、Zhichao LiSenior Software Development Engineer,Intel Corporation基于 Apache*Spark*的大规模分布式机器学习实践法律声明英特尔技术特性和优势取决于系统配置,并可能需要支持的硬件、软件或服务得以激活。产品性能会基于系统配置有所变化。没有计算机系统是绝对安全的。更多信息,请见,或从原始设备制造商或零售商处获得更多信息。在特定系统中对组件性能进行特定测试。硬件、软件或配置的任何差异都可能影响实际性能。请进行多方咨询,以评估您考虑购买的系统或组件的性能。关于性能及基准数据的更完整的信息,敬请登陆:http:/ SYSmark 和

2、 MobileMark 等测试均系基于特定计算机系统、硬件、软件、操作系统及功能,上述任何要素的变动都有可能导致测试结果的变化。请参考其它信息及性能测试(包括结合其它产品使用时的运行性能)以对目标产品进行全面评估。更多信息请访问 http:/ SEC 报告中包含关于可能影响英特尔结果和计划的因素的详细讨论,包括有关 10-K 报表的年度报告。所有涉及的所有产品、计算机系统、日期和数字信息均为依据当前期望得出的初步结果,可能随时更改,恕不另行通知。所述产品可能包含设计缺陷或错误(已在勘误表中注明),这可能会使产品偏离已经发布的技术规范。英特尔提供最新的勘误表备索。英特尔不对本文中引用的第三方基准

3、数据或网站承担任何控制或审计的责任。您需要访问参考网站以确认所引用数据是否准确。英特尔、英特尔标识、Intel.Experience Whats Inside 标识是英特尔公司在美国和/或其他国家的商标。*其他的名称和品牌可能是其他所有者的资产。2016英特尔公司版权所有。所有权保留。3Content Project Overview Distributed ML on Spark-Fraud Detection:End-to-End Solution for Top Payments Company-Large-scale,Sparse Logistic Regression for Cli

4、ck-through and Purchase Rate Predictions-Deep(Convolutional)neural network Infrastructure support for distributed ML-Parameter server4 Research and open source project initiated by UC Berkeley AMPLab Intel is closely collaborating with AMPLab and the community on open source development-One of the e

5、arliest adopters of Spark*(since 2012)Many key contributions(Netty shuffle,FairScheduler,“yarn-client”mode,)-Collaborating on other components in BDAS(e.g.,Tachyon*,SparkR,)Intel is partnering with many“web-scale”companies-Free!No commercial solution or Consultations-Online-LDA,Word2Vec(Merged)-Spar

6、seML(Separated package)-E.g.,Tencent,PayPal*,Alibaba*,Baidu*/iQiyi,JD.com,Youku*,etc.Project OverviewBDAS:Berkeley Data Analytics Stack(Ref:https:/amplab.cs.berkeley.edu/software/)SparkStreamingSpark CoreSampleCleanG-OLABlinkDBSparkSQLVelox*SparkRGraphXSplashMLBaseMLlibMLPipelinesMesos*Hadoop*YarnHD

7、FS,S3,Ceph*AMPLab DevelopedSpark CommunityIn Development3rdPartyTachyon*Succinct5Distributed ML on Spark Fraud Detection:End-to-End Solution for Top Payments Company Large-scale,Sparse Logistic Regression for Click-through and Purchase Rate Predictions Deep(Convolutional)neural networkInfrastructure

8、 support for distributed ML Parameter serverLarge-Scale Distributed ML on Apache Spark56Goal:Given transaction details,classify if its fraud or normalEvaluation Matrices Recall=predicted fraud/all real fraud transaction.Precision=predicted fraud correctly /predicted fraudFraud Detection on Apache Sp

9、arkFraud can mean:Buying with stolen credit cardsAbusing promotional programsAccount takeoverSpamming other users7Intel Customer StoryProblem statement and Pain points-An old rule-based system that needs significant improvement-Turn to Spark for data statistics and model training-Need Neural Network

10、 for Fraud Detection on their Spark 1.4 clusterIntel Solution-Implement Neural Network on Spark and help integrate Business Result-Neural network model performs better than other algorithm-Machine Learning system overtakes rule-based system and exceeds expectation-Improve precision by 15%,improve re

11、call by 30%8Solution Architecture OverviewTrain one modelall featuresselected featuresmodelsampledpartitionTraining DatanormalfraudTrain one modelTrain one modelPost-Processing Pre-ProcessingmodelmodelSpark PipelineTest DataPredictionsTestSpark*DataFrameHive*TablePreprocessingFeature EngineeringFeat

12、ure Engineer-ingFeature SelectionModel EnsembleSpark PipelineNeural Net ModelFeature SelectionModel TrainingModel Evaluation&Fine Tunemodel candidate9ApplicationTool Stack OverviewFeature EngineeringApache*SparkML PipelineOneHotEncoderWOEQuantile DiscretizerFraud Detection(Driver)Spark CommunityInte

13、l DevelopedString IndexerPre-processingSampling UtilitySpark*SQLIn DevelopmentStatisticsFeature SelectionModel TrainingModel EnsembleModel Evaluation&Fine TuneEstimatorGrid SearchNeural Net ModelBagging UtilityStandardizerBinary Class EvaluatorStep-wise Feature SelectorPost-ProcessingCross Validatio

14、nModel SelectorIntel Improved10Distributed ML on Spark Fraud Detection:End-to-End Solution for Top Payments Company Large-scale,Sparse Logistic Regression for Click-through and Purchase RatePredictions Deep(Convolutional)neural networkInfrastructure support for distributed ML Parameter serverLarge-S

15、cale Distributed ML on Apache Spark1011Logistic Regression on Spark*with Mini-Batch SGD11“Canonical”implementationRepeat Driver broadcasts W to each workerWorkers compute gradient for the next batch of B records from the training setEach task(running on workers)samples records from its data partitio

16、nEach task computes local gradient Aggregates gradient (possibly through tree aggregation)Driver updates weight123 4Partition 1Partition 2Partition nTraining SetSampleSampleSampleWorkerWorkerWorkerDriver2221113334*Other names and brands may be claimed as the property of others.12Network and Memory B

17、ottlenecksClick-through and purchase rate predictionsAdopted by top internet companies-Model size:100s of millions billions unique featuresWeight(W)and gradient(G)are both double vector,one entry for each unique feature-Training data:billions trillions training samplesPartitioned&cached across worke

18、rs12Partition 1Partition 2Partition nTraining SetSampleSampleSampleWorkerWorkerWorkerDriver2221113334Broadcast W(800MB)to each worker in each iterationEach task computes G(800MB)in each iterationEach task sends G(800MB)for aggregation in each iterationTraining samples cached in worker memory13Click-

19、through and purchase rate predictions Adopted by top internet companies Model size:100s of millions billions unique features Training data:billions trillions training samplesSolution Cached using sparse format Using float16(instead of double values)Extra Support for binary(0 or 1)values Only Calc&sy

20、nc gradient with non-zero data Better CommunicationSparse Logistic RegressionPartition 1Partition 2Partition nTraining SetSampleSampleSampleWorkerWorkerWorkerDriver2221113334Gradient:sparse vectorCompacted network communicationData cached using advanced encodingFor more complete information about pe

21、rformance and benchmark results,visit ML on Spark Fraud Detection:End-to-End Solution for Top Payments Company Large-scale,Sparse Logistic Regression for Click-through and Purchase RatePredictions Deep(Convolutional)neural networkInfrastructure support for distributed ML Parameter serverLarge-Scale

22、Distributed ML on Apache Spark1415Multi-Layer Perceptron(MLP)Fully connected,feed-forwardDeep learning CNN,autoencoder,RBM,etc.Distributed Neural NetworkRepeat Driver broadcasts parameters(weights&biases)to each workerWorkers process the next batch of Brecords from the training setEach task(running

23、on workers)samples records from its data partitionEach task computes the forwardand backpropagation passDriver aggregates gradient Driver updates parameters(weights&biases)123 4Partition 1Partition 2Partition nTraining SetSampleSampleSampleWorkerWorkerWorkerDriver2221113334Training A Neural Network1

24、6Built on top of standard Big Data platforms Easily utilize your existing clustersEngaging industry users and community early Evolving with feedback from real-world use cases Community version compatible with Spark*MLPTargeting Full function coverage:Auto Encoder,Sparse Encoder Convolution with max

25、and avg pooling RBM and DBNBenchmark with popular dataset/models GoogleNet,AlexNet on ImageNetEasy MKLintegration for Intel Architecture accelerationBetter communication:All-to-one,All-reduce on spark(CaffeOnSpark),ParameterServerFree community license(https:/ NetworkIntuitive API with layer-based i

26、nterfaceval trainData=loadData()val model=new Sequential()model+=new Convolution()model+=new maxPooling()val criterion=new ClassNLLCriterion()val optimizer=new ParallelOptimizer(model,new SGD)optimizer.setCrossValidation(evaluator.accuracy)optimizer.setPath(./model_save.obj)optimizer.optimize(trainD

27、ata)17Flaw detection in steel product 10/11Convolutionn(5,5)Maxpooling(2,2,2,2)Convolution(5,5)Maxpooling(2,2,2,2)3001005FCFC18Pipeline10/11ProposalDefect Proposal Algorithm 1ClassificationModelDefectNormalNormalPre-processDefect Proposal Algorithm 219Distributed ML on Spark Fraud Detection:End-to-E

28、nd Solution for Top Payments Company Large-scale,Sparse Logistic Regression for Click-through and Purchase Rate Predictions Deep(Convolutional)neural networkInfrastructure support for distributed ML Parameter serverLarge-Scale Distributed ML on Apache Spark20Communication Model20DriverTaskTaskTaskTa

29、skTaskTaskAll to oneParameter Server35635355All reduce(tree aggregation)TaskTaskTaskTaskAll reduce21“Parameter Server”support?-Very large scale model/graph(billions of unique features)-Leveraging further data sparsity in each worker(only a subset of weight vector needed)-Possib

30、le weakly-synchronized model(BSP vs.SSP vs.ASP,etc.)-Distributed parameter aggregation&update in Parallel-Easily integration with Apache Spark*.-Fault Torrance-Co-partitioning21*Other names and brands may be claimed as the property of others.Source:Dean J,Corrado G,Monga R,et al.Large scale distribu

31、ted deep networksC/Advances in neural information processing systems.2012:1223-1231.22Reference&ResourcesIntel packages-https:/ Analytics:-https:/ Notices and DisclaimersIntel technologies features and benefits depend on system configuration and may require enabled hardware,software or service activ

32、ation.Learn more at ,or from the OEM or retailer.No computer system can be absolutely secure.Tests document performance of components on a particular test,in specific systems.Differences in hardware,software,or configuration will affect actual performance.Consult other sources of information to eval

33、uate performance as you consider your purchase.For more complete information about performance and benchmark results,visit http:/ reduction scenarios described are intended as examples of how a given Intel-based product,in the specified circumstances and configurations,may affect future costs and pr

34、ovide cost savings.Circumstances will vary.Intel does not guarantee any costs or cost reduction.This document contains information on products,services and/or processes in development.All information provided here is subject to change without notice.Contact your Intel representative to obtain the la

35、test forecast,schedule,specifications and roadmaps.Statements in this document that refer to Intels plans and expectations for the quarter,the year,and the future,are forward-looking statements that involve a number of risks and uncertainties.A detailed discussion of the factors that could affect In

36、tels results and plans is included in Intels SEC filings,including the annual report on Form 10-K.The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications.Current characterized errata are available on request.No

37、license(express or implied,by estoppel or otherwise)to any intellectual property rights is granted by this document.Intel does not control or audit third-party benchmark data or the web sites referenced in this document.You should visit the referenced web site and confirm whether referenced data are

38、 accurate.Intel,and the Intel logo are trademarks of Intel Corporation in the United States and other countries.*Other names and brands may be claimed as the property of others.2016 Intel Corporation.24Optimization NoticeIntels compilers may or may not optimize to the same degree for non-Intel micro

39、processors for optimizations that are not unique to Intel microprocessors.These optimizations include SSE2,SSE3,and SSE3 instruction sets and other optimizations.Intel does not guarantee the availability,functionality,or effectiveness of any optimization on microprocessors not manufactured by Intel.

40、Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors.Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors.Please refer to the applicable product User and Reference Guides for more information regarding

41、the specific instruction sets covered by this notice.Notice revision#20110804 25Risk FactorsThe above statements and any others in this document that refer to future plans and expectations are forward-looking statements that involve a number of risks and uncertainties.Words such as anticipates,expec

42、ts,intends,goals,plans,believes,seeks,estimates,continues,may,will,should,and variations of such words and similar expressions are intended to identify such forward-looking statements.Statements that refer to or are based on projections,uncertain events or assumptions also identify forward-looking s

43、tatements.Many factors could affect Intels actual results,and variances from Intels current expectations regarding such factors could cause actual results to differ materially from those expressed in these forward-looking statements.Intel presently considers the following to be important factors tha

44、t could cause actual results to differ materially from the companys expectations.Demand for Intels products is highly variable and could differ from expectations due to factors including changes in business and economic conditions;consumer confidence or income levels;the introduction,availability an

45、d market acceptance of Intels products,products used together with Intel products and competitors products;competitive and pricing pressures,including actions taken by competitors;supply constraints and other disruptions affecting customers;changes in customer order patterns including order cancella

46、tions;and changes in the level of inventory at customers.Intels gross margin percentage could vary significantly from expectations based on capacity utilization;variations in inventory valuation,including variations related to the timing of qualifying products for sale;changes in revenue levels;segm

47、ent product mix;the timing and execution of the manufacturing ramp and associated costs;excess or obsolete inventory;changes in unit costs;defects or disruptions in the supply of materials or resources;and product manufacturing quality/yields.Variations in gross margin may also be caused by the timi

48、ng of Intel product introductions and related expenses,including marketing expenses,and Intels ability to respond quickly to technological developments and to introduce new products or incorporate new features into existing products,which may result in restructuring and asset impairment charges.Inte

49、ls results could be affected by adverse economic,social,political and physical/infrastructure conditions in countries where Intel,its customers or its suppliers operate,including military conflict and other security risks,natural disasters,infrastructure disruptions,health concerns and fluctuations

50、in currency exchange rates.Results may also be affected by the formal or informal imposition by countries of new or revised export and/or import and doing-business regulations,which could be changed without prior notice.Intel operates in highly competitive industries and its operations have high cos

51、ts that are either fixed or difficult to reduce in the short term.The amount,timing and execution of Intels stock repurchase program could be affected by changes in Intels priorities for the use of cash,such as operational spending,capital spending,acquisitions,and as a result of changes to Intels c

52、ash flows or changes in tax laws.Product defects or errata(deviations from published specifications)may adversely impact our expenses,revenues and reputation.Intels results could be affected by litigation or regulatory matters involving intellectual property,stockholder,consumer,antitrust,disclosure

53、 and other issues.An unfavorable ruling could include monetary damages or an injunction prohibiting Intel from manufacturing or selling one or more products,precluding particular business practices,impacting Intels ability to design its products,or requiring other remedies such as compulsory licensi

54、ng of intellectual property.Intels results may be affected by the timing of closing of acquisitions,divestitures and other significant transactions.We completed our acquisition of Altera on December 28,2015 and risks associated with that acquisition are described in the“Forward Looking Statements”pa

55、ragraph of Intels press release dated June 1,2015,which risk factors are incorporated by reference herein.A detailed discussion of these and other factors that could affect Intels results is included in Intels SEC filings,including the companys most recent reports on Form 10-Q,Form 10-K and earnings release.Rev.1/14/1626

友情提示

1、下载报告失败解决办法
2、PDF文件下载后,可能会被浏览器默认打开,此种情况可以点击浏览器菜单,保存网页到桌面,就可以正常下载了。
3、本站不支持迅雷下载,请使用电脑自带的IE浏览器,或者360浏览器、谷歌浏览器下载即可。
4、本站报告下载后的文档和图纸-无水印,预览文档经过压缩,下载后原文更清晰。

本文(基于 Apache Spark的大规模分布式机器学习实践(26页).pdf)为本站 (云闲) 主动上传,三个皮匠报告文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。 若此文所含内容侵犯了您的版权或隐私,请立即通知三个皮匠报告文库(点击联系客服),我们立即给予删除!

温馨提示:如果因为网速或其他原因下载失败请重新下载,重复下载不扣分。
会员购买
客服

专属顾问

商务合作

机构入驻、侵权投诉、商务合作

服务号

三个皮匠报告官方公众号

回到顶部