《2020年WaveSummit深度学习开发者峰会嘉宾演讲PPT资料合集.rar》由会员分享,可在线阅读,更多相关《2020年WaveSummit深度学习开发者峰会嘉宾演讲PPT资料合集.rar(0页珍藏版)》请在三个皮匠报告上搜索。
1、?n?n?n?nNeural Network Architecture Searchn?n?n?n?n?-?n?n?n?n?n?nPaddlePaddlenPARL?n?n?nfrom scratch?n?n?n?layer/op?n?nReinforcement Learningn?RNN?LSTM?n?AutoDL?+?nLSTM?nfc+multinomial-id?nfc+bernoulli-?n?n?n?id?n?n?id?n?n?nEarly stop(b)Reduction Cellc_k-1c_k-2c_k3201skip_connectskip_connectsep_conv
2、_3x3sep_conv_3x3sep_conv_3x3c_k-2c_k-12130c_kmax_pool_3x3max_pool_3x3(a)Normal Celldil_conv_3x3skip_connectskip_connectskip_connectskip_connectskip_connectdil_conv_3x3skip_connectskip_connect?n?CIFAR10?6?n?97+%n?98%?nPaddlePaddle+PARLn?demo?1?CIFAR10?0.10.20.30.40.50.60.70.80.961151341531
3、72826728630532434336238574764955590609628647666685704723742763785687589499891008AutoDL?n?PaddlePaddle?PARLn?PaddlePaddle/AutoDLn?MNIST?CIFAR10?n?n?nRNN?n?cell?n?n?n3*3?1*3?3*1?n?n?THANKSPaddle Fluid?batch?batch?batch sizeBatch?epochEpoch?Shuffle
4、?Random Crop?Flip?Mnist?16?28*28*1bytes?0,255?8?1byte/per image?0,9?n?60000?n?http:/ Fluid?reader?batch?Paddle Fluid API?Feed?py_reader?Executor:Run()?Mnist?reader?Reader?Python?(generator)?yield?reader?Paddle?Mnist Reader API?Mnist?reader?reader?shuffle?reader?paddle.batch?batch?reader?shuffle?read
5、erSample 1Sample 2Sample 3Sample 4Sample 5Sample 6Sample NSample 6Sample 3Sample 1Sample 5Sample 4Sample 2Sample KBatch 1Sample 6Sample 3Sample 1Batch 2Sample 5Sample 4Sample 2Paddle C+BackendExecutor:Run()?reader?shufflebuf_size=6Shuffled reader?batchbatch_size=3Data feeder?py_reader?Batch readerFe
6、ed?reader?data feeder?Paddle?Tensor?https:/ layer?data layer?2.?data feeder?3.?DataFeeder.feed()?batch_reader?paddle?Tensor?py_reader?Python?Paddle backend?Python?Paddle Python FrontendBatch readerPython?Paddle C+Backend?Executor:Run(.)Queue:Push()Queue:Push()?feed?py_reader?feed?API?1.?py_reader?ca
7、pacity?shapes?dtypes?feed?data layer shape?dtype?2.?read_file?read_file?py_reader?read_file?feed?data layer3.?reader?reader?py_reader?4.?py_reader.start()?Python?py_reader.reset()?epoch?C+Backend?EOF?py_reader.reset()?py_reader?epoch?Loss?.?softmax?X?Y?softmax?W?b?yi=!#$%&(wijxj+bi)j!#$%&(i)=jexyexy
8、?+1?1.?W?012783+1net0net1net9!#$%&y0y1y9?bhttp:/paddlepaddle.org/http:/paddlepaddle.org/Dataconv2dpool2dconv2dpool2dfccross_entropyLabellossavg_lossReluReluDataconv2dpool2dconv2dpool2dfccross_entropyLabellossavg_lossReluRelu?Fluid?loss?cross_entropy,linear_chain_crf,bpr_loss,edit_distance,warpctc,di
9、ce_loss,mean_iou,log_loss,huber_loss?loss?cross_entropy:soft_label=True:soft_label=False:Dataconv2dpool2dconv2dpool2dfccross_entropyLabellossavg_lossReluReluFluid?SGD,Momentum,Adagrad,Adam,Adamax,DecayedAdagrad,Ftrl,Adadelta,RMSProp,LarsMomentum?fluid.Programstartup_program?main_program?n?fluid.defa
10、ult_startup_program()?n?GPU?place?fluid.CUDAPlace(0)?startup_program?main_program?n?fluid.Executor()?run()?fluid.Program?n?run(feed=.)?run(fetch=.)?startup_program?main_program?Dataconv2dpool2dconv2dpool2dconv2dpool2dconv2dpool2dconv2dpool2dconv2dpool2dconv2dpool2dconv2dpool2dconv2dpool2dconv2dpool2
11、dData?nFluid?Program?n?Program?Program?fluid.Executor?n?GPU?export CUDA_VISIBLE_DEVICES=0,1,2,3?:fluid.default_main_program()-compiled_programhttp:/paddlepaddle.org/THANKSPaddle Fluid?ProgramPython API?CPU/GPU?Runtime?Runtime?API?Runtime?API?CV?NLP?API?API?VariablesOperatorsLayersControl FlowExecuto
12、rSaveRestoreBlockBlock.Layer.Variable.ProgramnProgram:?Variable?Layer?fluid.default_startup_program()fluid.default_main_program()nLayer:?layer?operator?Layer?Variable?Layer?out=fluid.layers.relu(x)out=fluid.layers.fc(input=x,size=1000,act=tanh)nVariable:?Tensor?Variable?Layer?Variable?Variable?var=l
13、ayers.fill_constant(shape=1,dtype=int64,value=5)nExecutor:?Program?Executor?Program?feed?Program?fetch_list?x+y=z?Compile TimeRun Time?CRIM?ZN?25,000?INDUS?CHAS?Charles River?1=?0=?NOX?RM?AGE1940?DIS?5?default_main_programblock_0import paddle.fluid as fluidvarsopsx=fluid.layers.data(name=x,shape=13,
14、dtype=float32)xy=fluid.layers.data(name=y,shape=1,dtype=float32)yy_predict=fluid.layers.fc(input=x,size=1,act=None)fc_0.tmp_1fc_0.b_0mulelement_addfc_0.tmp_0fc_0.w_0avg_cost=fluid.layers.mean(name=mean,x=cost)mean.tmp_0meancost=fluid.layers.square_error_cost(name=cost,input=y_predict,label=y)cost.tm
15、p_1cost.tmp_0element_subsquarevarssgd_optimizer=fluid.optimizer.SGD(learning_rate=0.001)sgd_optimizer.minimize(avg_cost)lr_1lr_0fc_0.tmp_0GRADfc_0.tmp_1GRADfc_0.w_0GRADfc_0.b_0GRADcost.tmp_0GRADcost.tmp_1GRADmean.tmp_0GRADmean_gradsquare_gradelement_sub_gradelement_add_gradmul_gradsgdsgdTHANKS?P?I?P
16、?A?P?PaddleHubResNetMobileNetSSDNASNetERNIEBERTLACSentaCommandLine ToolFinetune API?-?F?F?10000+GPU Hours?Paddle Model?10?+5?pip install paddlehub?.?.?.?.?Output Tensor 1Output Tensor 2Output Tensor 3?Input Tensor 1Input Tensor 2Input Tensor 3DataFeeder?1?-?1?-?1?+?-0?dataset=hub.dataset.ChnSentiCor
17、p()reader=hub.reader.ClassifyReader(dataset=dataset,vocab_path=module.get_vocab_path(),max_seq_len=128)Dataset?input_ids12340567position_ids00000000segment_id11111100input_maskClassifyReaderTokenizerCLSSEPPADPADText:?Label:1?ClassifyReaderSequenceLabelReaderinput_idsposition_idssegment_idsinput_mask
18、ERNIE/BERTCLST1T2TnPaddleHub?Text Classification TaskSequence Labeling Taskpooled_outputsequence_output?1?2?.?CFTRCFTRA?CF?A?N?_?B?(?W?W?A)?DA?DA?L?_?S?D?$visualdl-logdir=/path/to/log/?PaddleHub?./?0?1?:?THANKSPaddleNLP?N L P?2?2?+?9?0?NLP?NLP?NLP?NLP?More?NLP?Flexible?NLP?Performance?BERTELMoErnie?
19、(?)?10?NLP?4?NLP?3?PaddleNLPBiLSTM?TextCNN?ERNIEBOW?CNN?LSTM?GRUBiGRU-CRF?ERNIEBOW?CNN?GRU?LSTM?MMDNNTransformerBiDAF?BERT10?NLP?4?NLP?(ERNIE)?TextCNN76.8%BERT78.6%ERNIE80.6%BiLSTM91.8%BERT94.3%ERNIE95.4%BiGRU-CRF88.0%BERT90.2%ERNIE92.0%?DuReader30?150?SQuAD10?2?12?360?paddlepaddle.org/paddlenlpTHAN
20、KS?Neural NetworkForwardBackward?Parameter Server?-?CPU&GPU?CPU&FPGAPipelineAPipelineBdata shard A data shard Bdata shard X?data shard Adata shard Bdata shard X?data shard Adata shard Bdata shard X?n?GPU?n?n?n?CPU?n?n?DataAGPU1DataBGPU2DataCGPU3DataDGPU4Gradient/Parameter SyncCollective ModeParamete
21、r Server ModeWW?&?n?n?n?&?n?n?n?ProgramDistributedProgramProgramServerProgramWorkerProgramProgramCollectiveProgramQuery=?User Id?News Id?Video Id?News Tag?User Id-News Id?T?Embedding Representation?T?DNN?&?WWPaddle?n?n?nIO?&?n?n?CPU?n?Worker?Hogwild!Training ThreadsAggregate and sendgrad to serverPu
22、sh GradientsPush GradientsFF/BP OpSExecutionPu11 SparseParameterData FeedingData ShardAsync commAsync readFF/BP OpSExecutionPu11 SparseParameterData FeedingData ShardAsync commAsync readDense ModelParameterIn Global ScopeSparse ModelParameterIn Thread ScopePul1DenseParaneterKey-Value Table orn Each
23、NodeAuto Growwth?KV?Baidu-rpcsupportedKeyValue?MLP?140M?180M?1k?117Multi-Field MLP?ReLU-32ReLU-128ReLU-256ReLU-1024Emb SumEmb SumEmb SumEmb Sum?0204060801001node*10threads25nodes*10threads50nodes*10threads100nodes*10threadsbatch=32batch=128batch=512?CTR?CTR?04000008000000001node*10threads
24、25nodes*10threads50nodes*10threads100nodes*10threadsbatch=32batch=128batch=512PaddlePaddleK8S?:https:/PARL?PARLGood Weather Today!Wanna go out for a walk?DRL?n?RL?n?Trick?DRL?n?RL?RL?nGitHub?10000+?DRL?PARL?n?DRL?n?DRL?SimulatorObservation(!)#(%&,(&=*+)#(%&,(&=-./0)#(%&,(&=1230)#(%&,(&=5678)Q Networ
25、k(#9)Action(&Trajectory&Feedback:&TrainingReplay BufferUpdating Network(!)Target Network(!#)SynchronizingLoss Function$(&)=)+,(-.,0,+2,4max8-#.,0 GradientStep 2:?Updating?&Target?def deep_q_net_update(image):conv_1=fluid.layers.conv2d()conv_2=output=fluid.layers.fc()return output,varsdef deep_q_net_
26、target(image):Step 3:?,?Program#Train programupdate_q,update_vars=deep_q_net_update(image)target_q,target_vars=deep_q_net_target(image)max_target_q?fluid.layers.reduce_max(target_q,)max_target_q.stop_gradient=Trueloss=fluid.layers.square_error_cost()#Test program#Synchronizing Programfor i,var in en
27、umerate(update_vars):sync_ops.append(fluid.layers.assign(update_varsi,target_varsi)Step 1:?def get_input():return fluid.layers.data(name=state,),fluid.layers.data(name=action,),fluid.layers.data(name=reward,),Step 4:?Replay MemoryClass ReplayMemory(object):def push():def sample_batch():Step 5:?rpm=R
28、eplayMemory()while():action=next_state=rpm.push()if train_step%synchronizing_interval=0:#?synchronizing?=rpm.sample_batch()#?Step 1:?from parl.algorithms import DQNalgorithm=DQN()cost=algorithm.define_learn()if():algorithm.sync_target(self.gpu_id)Step 2:?Updating?Step 3:?AlgorithmStep 4:?ReplayMemor
29、y?Model,Agent,Algorithm?RL?Simulator?State-Action?Agent?(Q-Learning,DDPG,PPO?)?/?APIdefine_learndefine_predictsync_targetAlgorithm?Policy?Value?PARL?Layer Wrapper?ModelActorLearner?classLayerFunc(object):def _deepcopy_(self,memo):target_q=copy.deepcopy(update_q)algorithm.sync_params_to(target_q,)?de
30、fremote_class(cls):class ClientWrapper(object):def as_remote(self,server_ip,server_port,remote_ip=None,remote_port=None):#?self._connect_server()#?reply_thread=threading.Thread()reply_thread.setDaemon(True)reply_thread.start()?parl.remote_classclassActor(object):def sample(self):return sample_dataac
31、tor=Actor()actor.as_remote()Remote_manager=RemoteManager(port=)while True:remote_actor=remote_manager.get_remote()batch=remote_actor.sample()ActorActorActorData ServerLearnerCPU ClusterGPUTraining dataSimulatorSimulatorSimulatorSimulatorActorMemoryLearnerBottleneckTraining data?PARL?Ray-RLLib(UC Ber
32、keley)?RL?,?NeurIPS 2018 AI for Prosthetics Competition-Champions SolutionTop-3 Teams3rd2ndOurs9947.0969949.939980.46Cumulative RewardQ valueQ valueQ valueDenseDenseDenseDenseDenseDenseObservatioActionControlCommandActionActionActionDenseDenseDenseDenseDenseDenseObservationControl CommandCritic Netw
33、orkPolicy Network?Rllib?mean_episode_rewards?IMPALA?P40GPU+24?CPU?40?CPU?8?python3.5+paddle1.3.01hBreakoutNoFrameskip-v4538?582?426?495BeamRiderNoFrameskip-v43181?5308?4411?3819SpacelnvadersNoFrameskip-v4843?977?1516?1266QbertNoFrameskip-v410850?19680?15611?17538BufferWorkersEnvPolicyEnvPolicyEnvPol
34、icyEnvPolicyEnvPolicyLearnersPolicy-Lag CorrectionTrainSynchronizingValuePolicyIMPALA?PARL?IMPALA?PARL?Feature&?1.02019.01DQN,Double DQN,Policy Gradient,Proximal Policy Optimization,Distributed DDPG1.12019.04A2C,GA3C,IMPALA,?1.2In PlanningEvolutional Learning,World Model&Planning,Learner?THANKS?Padd
35、lePaddle?T?End-to-end?Flying Cards3D CNN,TSN/TSM(C3D,I3D,P3D,Non-local)Two-Stage?2D CNNLocal Feature Integration(LSTM,AttentionCluster,)Flying Cards?:10?+?110?110?1?l?inference?l?TSN?2D-CNN?Kinetics-400Top-1local?0.67Non-Local?Kinetics-400Top-1local?0.62?0.74?StNet?Kinetics-400Top-1local&global?0.69
36、TSM?Kinetics-400Top-1local?0.70Attention LSTM?RNN?Youtube-8MGAPglobal?0.86Attention Clusters?Youtube-8MGAPNO?0.87NeXtVLAD2nd-Youtube-8M?Youtube-8MGAPNo?0.87ConvNetConvNetConvNetSegmentalConsensuslTemporally sample several snippets lAverage predictions of these snippetsPixels should be related with e
37、ach other in the spatial-temporal spaceTxHx W?10241x1?1T x H x W?512THW?512SoftmaxTHW?512T x H x W?512THW?512T x H x W?512THW x THWTHW?512T x H x W?512:1?1?1:1?1?1g:1?1?1T x H x W?1024ZXFacebook?self-attention?Temporally sampling“super-images”2D-Conv on super-images for local S-T modelingStacked 3D/
38、2D-Conv blocks for global S-T modelingTemporal 1D-Xception for long term dynamic modeling N 1 2 T?3N?H?WConv1Res2Res3Res4TemporalModellingBlockTemporalModellingBlockRes5AvgPoolReshapeT?Ci x Hi?WiReshape1?Ci?T?Hi?WiConv_3d(Ci,(3,1,1),1)BN _3dReLUReshapeT?Ci?Hi?WiT?CTemporal Xception BlockFCActivityNe
39、t2018?single?AAAI19?VideS1S2N SegmentsSNSampleN FramesFrames2D ConvChannel CFeature MapTemporal TTemporalShift2D ConvIdentitylTemporally sample several snippets lExchange part of channels among nearby snippetsl2D Conv to accomplish 3D spatial-temporal modelingTSN?SOTAActivityNet17?single model?Sigmo
40、idFC4096FC8192AttentionConcatenateAttention.Concatenate.Bi-directionalSequence Model.RGBAudioBidirectional LSTM+Self-attention PoolingRGB Attention ClusterConcatenateFlow Attention ClusterConcatenateAudio Attention ClusterConcatenateConcatenateFully ConnectedSoftmaxLocal FeaturesAttention UnitsSingl
41、e Attention Unitwith Shifting OperationSingle OutputShifting OperationSummationMulWeightingFunctionLocal Feature SetCVPR18?local feature sequence?channel?group,?End-to-End learnable VLAD encoding?Group?Squeeze-and-Excitation?Per-Channel AttentionNNFCCKNGKNGHvideo-levelclassifierFCGNKFCsoftmaxgk(.)FC
42、softmaxgk(.)GNYoutube-8M 2018?single model?l?inference?l?Youtube-8M,Kinetics?https:/ scripts/test/test_stnet.sh?bash scripts/test/test_stnet.shhttps:/ MobileNet?3D-CNN?feature?topK?gap?C+API?TensorRT?l?inference?l?GAN?Q2Q3Q4THANKS森林里的深度学习应用森林里的深度学习应用红脂大小蠹目标检测半小时源于产业实践的开源深度学习平台PaddlePaddlePaddlePaddl
43、e模 型 库PaddlePaddle全景开发动态图静态图训练大规模分布式训练工业级数据处理预测Paddle ServingPaddle MobilePaddleSlim安全与加密PaddleRecPaddleNLPPaddleCV核心框架工具组件VisualDL 训练可视化工具PaddleHub 迁移学习PARL 强化学习EDL 弹性深度学习计算EasyDL 零基础定制化训练和服务平台 AI Studio 一站式开发平台服务平台AutoDL Design 自动化网络结构设计PaddlePaddle全景模 型 库PaddlePaddle全景开发动态图静态图训练预测Paddle MobilePad
44、dleSlim安全与加密核心框架工具组件EDL 弹性深度学习计算EasyDL 零基础定制化训练和服务平台 AI Studio 一站式开发平台服务平台PaddlePaddle全景PaddleHub 迁移学习PaddleRecPaddleNLPPaddleCV大规模分布式训练工业级数据处理VisualDL 训练可视化工具Paddle ServingAutoDL Design 自动化网络结构设计PARL 强化学习开发训练预测工具服务PaddleNLP视频识别工具集语义匹配组网集 SimNet,DAM分割组网集 ICNet,DeepLab v3+官方支持60+主流模型PaddleRecPaddleNL
45、PPaddleCV关键点检测人脸检测字符识别图像生成对话生成阅读理解词法分析机器翻译情感分析语义表示 ERNIE,BERT,ELMo语言模型 LSTM,GRU大规模CTR预估排序组网集 Deep Interest Network,DeepCTR,GRU4Rec,GNN候选召回候选标签官方支持60+主流模型VGG,ResNet,SE-ResNeXt,Inception v4,MobileNet分类组网集Fast R-CNN,Faster R-CNN,Mask R-CNN,SSD,YOLO v3检测组网集视频分类任务层算法层全面丰富/灵活插拔/工业级效果PaddleNLP基于PaddlePaddl
46、e 打造的面向工业应用的中文NLP工具集,最懂中文PaddleNLPPaddleNLP应 用 任 务 层基 础 网 络 层序列标注组网集语义匹配组网集语言生成与复杂任务组网集文本分类组网集语义表示(含预训练模型)BERT ERNIE ELMo语言模型组网集文本情感分类对话情绪识别阅读理解机器翻译对话模型工具箱中文词法分析短文语义匹配语言模型知识驱动对话ERNIE 中文NLP任务表现全面领先Enhanced Representation through kNowledge IntEgration哈尔滨冰雪Learned by ERNIE XXX是黑龙江的省会,国际XX文化名城Transforme
47、r任务名称自然语言推断任务语义匹配任务命名实体识别任务情感分析任务检索式问答任务数据集XNLILCQMCMASK-NERChnSentiCorpNLPCC-DBQA评估指标准确率准确率F1准确率MRRF1BERT77.20%87.00%92.60%94.30%94.60%80.80%ERNIE78.4%87.4%93.8%94.3%95.1%82.7%BenchmarkERNIE 中文NLP任务表现全面领先PaddleNLPPaddleNLP共享骨架代码/视频识别模型种类齐全/一键式任务启动视频识别工具集覆盖主流实用的序列建模算法与端到端视频识别模型,高效配置模型完成训练和评测PaddlePa
48、ddle 视频识别工具集PaddlePaddle 视频识别工具集stNetActivityNet 2018 最佳single model,AAAI 2019,融合局部与全局的时序模型Attention Cluster CVPR 2018,引入不同模态的不同注意力聚合模型,更好捕获特征间的组合关系Attention LSTMActivityNet 2017最佳single model,更稳定的时序模型TSN经典网络结构,首次引入序列信息到视频分类,证明序列信息有效性Non-LocalFacebook首次提出的时空非局部建模,引入类似self-attention机制,效果好,计算量大TSMTSN改
49、进版,简单高效,计算简单,当前的SOTANeXtVLADYoutube-8M 2018最佳single model,弱化时序关系,适合建模短视频核心业务应用核心业务应用视频自动分类 可全免人审 视频语义向量 推荐/搜索模型效果显著提升视频标签集 top5准确率达 96%百度 Feed流 百度搜索 百度云 VCA 系统 模 型 库PaddlePaddle全景开发训练预测PaddleRecPaddleNLPPaddleCV核心框架工具组件VisualDL 训练可视化工具PARL 强化学习EDL 弹性深度学习计算EasyDL 零基础定制化训练和服务平台 AI Studio 一站式开发平台服务平台Au
50、toDL Design 自动化网络结构设计PaddlePaddle全景PaddleHub 迁移学习动态图静态图大规模分布式训练工业级数据处理Paddle ServingPaddle MobilePaddleSlim安全与加密开发训练预测工具服务大规模分布式训练工业级数据处理多机多卡/大规模稀疏参数服务器/K8S生态支持基于工业实践打造业界最强的超大规模并行深度学习能力分布式训练分布式训练 Benchmark分布式训练 Benchmark40003000200010000 1Gb/sResNet50 BaselineResNet50 Bandwidth-EfficientResNet50 on
51、4 x 4 v100 under different bandwidth8Gb/s1000Gb/s1000Gb/s with IB60004500300015000 1x1FP321x82x84x8FP16ResNet50 on ImageNet with FP32 and FP16个性化点击率预估任务在不同并发资源下的单位时间吞吐量对比 个性化点击率预估任务在不同Batch下的加速比对比0000008000006000004000002000000 1node*10threads25node*10threads50node*10threads100
52、node*10threadsbatch=32batch=128batch=512纵轴:样本总吞吐/s横轴:节点数x工作线程数5040300 1node*10threads25node*10threads50node*10threads100node*10threads纵轴:加速比横轴:节点数x工作线程数2010分布式训练 Benchmark分布式训练 Benchmarkbatch=32batch=128batch=512大规模稀疏参数服务器大规模稀疏参数服务器异步高并发Worker线程级的异步IO、异步计算、异步通信高并发Server和参数退场机制高性能通信库-BRPC模
53、型参数分片稀疏模型参数退场机制百度Feed流百度商业推广系统已经过百度内核心业务验证:超大规模数据海量特征&自膨胀高频率模型迭代解决的问题工业级数据处理工业级数据处理技术优势分布式文件系统IO支持分布式样本Shuffle高性能多生产者-多消费者设计多种语言IO组件的灵活嵌入30000750001threaddatasetIMDB数据不同线程数下的吞吐量2thread3thread4thread5thread6thread7thread8thread9thread10thread1500022500模 型 库PaddlePaddle全景开发动态图静态图训练预测Paddle ServingPadd
54、le MobilePaddleSlim安全与加密PaddleRecPaddleNLPPaddleCV核心框架工具组件VisualDL 训练可视化工具PaddleHub 迁移学习PARL 强化学习EDL 弹性深度学习计算EasyDL 零基础定制化训练和服务平台 AI Studio 一站式开发平台服务平台AutoDL Design 自动化网络结构设计PaddlePaddle全景大规模分布式训练工业级数据处理开发训练预测工具服务软硬一体推理引擎Benchmark模型压缩 PaddleSlim服务端部署 Paddle Serving底层硬件GPUCPUNVAMDCambriconIntel推理引擎工具
55、PaddleSlim 安全与加密软硬一体方案部署手册PythonJavascript多语言支持方案与服务Paddle ServingPaddle Mobile硬件加速库端到端全流程部署方案端到端全流程部署方案habanna华为DSPC+ARM服务端移动端CPUGPUMaliAdrenoMetal高速推理引擎 Benchmark高速推理引擎 Benchmark本测试基于PaddlePaddle v1.4.0版本,GPU P4,单卡,CPU E5-2650 v4,8线程,batch_size=1。warmup10次,运行10次取均值。https:/ 806040200ResNet50ResNet1
56、01MobileNet v1MobileNet v2PaddlePaddle主流实现GoogleNetBenchmark on CPU E5-2650(ms)107.552.50ResNet50ResNet101MobileNet v1MobileNet v2GoogleNetBenchmark on GPU P4(ms)PaddlePaddle主流实现本测试基于PaddlePaddle v1.4.0版本,采用android ndk r16交叉编译,gcc 4.9,enable neon,ABI:armveabi-v7a with neon-mfloat-abi=softfp/armv8。1线
57、程,warmup10次,运行10次取均值。https:/ v1 INT8 model on ARM Latency(ms)-v840608004003002001000麒麟960高通835rk3399高通653PaddlePaddle主流实现1高通625MobileNet v1 FP32 model on ARM Latency(ms)-v8主流实现2主流实现3高速推理引擎 Benchmark高速推理引擎 BenchmarkPaddlePaddle主流实现1主流实现2完备的在线服务能力/Built-in模型服务支持/硬件设备可扩展模型训练到上线无缝衔接Paddle Ser
58、vingPaddle Serving 架构图Paddle ServingPaddle ServingBuilt-in模型服务支持 图像分类 文本分类Paddle Serving 技术优势百度产品线验证 百度商业广告系统 百度Feed流完备的在线服务能力 单服务多模型 多版本模型A/B Testing 模型热更新硬件设备可扩展 CPU GPUUser Defined Input MessageClient Configuration ManagementA/B TestingRPC Proto Parserrun-timeofflineDAG ExecutorServer Configurati
59、on ManagementDAG ParserRPC Proto ParserBRPC ServiceServing Operator Base Built-in Op/User-Impl OpEngine Base CPU/GPUInference EngineClientServerClient SDKPredictors ParserInference Engine Conf参数集中管理/模型自动压缩/两行python代码调用自动化模型压缩模型压缩工具库,能够在精度损失较小的情况下高效进行模型体积压缩PaddleSlim两行python代码调用自动化模型压缩剪枝量化蒸馏PaddleSli
60、mPaddleSlimSensitive Filter Pruning多种模式int8量化训练多种Loss 任意组合活体检测模型 人脸检测模型 人脸属性模型 PaddleSlimPaddleSlimDynamic QuantizerStatic QuantizerStructure PrunerFSP distillerL2-loss distillerGraph C+APISensitivePruner StrategyQuantization StrategyDistillation StrategyUniformPrune StrategyProgramFCFusePass(C+)Gra
61、ph WrapperPruneParamPass(python)ConvBNFusePass(C+)FSPDistillationPass(python)FCFusePass(python)ConvBNFusePass(python)SoftLabel distiller人脸对齐模型 人脸识别-71.76%蒸馏+int8量化训练剪切+int8量化训练+1.10%-86.47%-1.71%精度模型大小MobileNet v1 on ImageNetCore:Compressor?D?B?N?L?C?F?+?+?+?+?+?+?(?2?2?2?+?+?+?2?2?(?2?2?(?(?-2?2?-2
62、?2?(?2?(?2?2?(?2?2?2?(?)?2?(?(?+?-2?2?用户接口基础框架百元级硬件和全面的工具栈推动深度学习技术落地PaddlePaddle模型在边缘设备的部署编译转化后基础模型+业务数据自定义模型训练与下载模型编译与适配边缘设备部署与加速模型工具栈板芯平台合作伙伴的板芯硬件部署流程设备端部署解决方案 完成AI识虫YOLO v3模型+蠹虫图片识虫模型训练与下载Paddle派210芯片板卡性能高,本地运行YOLO v3可达30FPS成本低,百元级售价体积小,38*38mm功耗低,工作功率1.2W内置电源,野外工作可达一年工业封装,有效防尘防水,防水等级IP65无需联网,本地完
63、成预测,计算快速安全设备端部署解决方案 完成AI识虫百度AI Studio提供 模型训练、下载、编译的完整工作流程合 作 研 发。硬 件 板 卡 和 设 备 5 月 在 百 度 A I 市 场 发 售。注:此 设 备 由AI识虫AI识虫模 型 库PaddlePaddle全景开发训练预测PaddleRecPaddleNLPPaddleCV核心框架工具组件VisualDL 训练可视化工具PARL 强化学习EDL 弹性深度学习计算EasyDL 零基础定制化训练和服务平台 AI Studio 一站式开发平台服务平台AutoDL Design 自动化网络结构设计PaddlePaddle全景PaddleH
64、ub 迁移学习动态图静态图大规模分布式训练工业级数据处理Paddle ServingPaddle MobilePaddleSlim安全与加密AutoDL DesignPARLPaddleHub开发训练预测工具服务正式开源/多个自动设计的优质模型/特定场景优于专家设计网络AutoDL Design通过网络结构搜索得到多个性能优异的神经网络结构c_k-1c_k-2c_k3201skip_connectskip_connectsep_conv_3x3sep_conv_3x3sep_conv_3x3dil_conv_3x3skip_connectskip_connectc_k-2c_k-12130c_
65、kmax_pool_3x3max_pool_3x3skip_connectskip_connectskip_connectdil_conv_3x3skip_connectskip_connectAutoDL DesignAutoDL Design开源开源基于PaddlePaddle实现的AutoDL Design源码效果CIFAR-10 数据集上精度达到98.01%模型基于Local Rademacher Complexity Regularization的模型开源算法覆盖更全面/高性能通讯协议/方便定制的并行APIPARL深度强化学习框架,具备高灵活性和可扩展性,能够支持可定制的并行扩展赢得
66、NeurIPS 2018 AI假肢挑战赛冠军Target Driven DDPG+Bootstrapping 千台CPU+单GPUPARLPARL10分钟以内训练一个Atari 智能体7分钟训练Pong游戏智能体32 CPU 计算集群1个并行通讯修饰符IMPALA/A2C/GA3C并行算法提升80%样本收集效率10行代码完成迁移学习/即拿即用的模型/命令行工具PaddleHub简明易用的预训练模型管理工具数据集即拿即用 的预训练模型数据处理迁移任务优化策略Finetune API命令行NLP/CV DataSetNLPReaderCVReader文本分类 序列标注图像分类AdamWeightD
67、ecayStrategyL2SPStrategyhub.finetune_and_evalinstalluninstallshowdownloadsearchlistrunhelpversionPaddleHub 10行代码完成迁移学习PaddleHub 10行代码完成迁移学习视频分类Transformer目标检测图像分类词法分析语言模型情感分析图像生成视频分类PaddleHubPaddleHub模 型 库PaddlePaddle全景开发训练预测PaddleRecPaddleNLPPaddleCV核心框架工具组件VisualDL 训练可视化工具PARL强化学习EDL 弹性深度学习计算EasyD
68、L 零基础定制化训练和服务平台 AI Studio 一站式开发平台服务平台AutoDL Design自动化网络结构设计PaddlePaddle全景PaddleHub迁移学习动态图静态图大规模分布式训练工业级数据处理Paddle ServingPaddle MobilePaddleSlim安全与加密竭诚服务开发者,推动中国深度学习发展开发训练预测工具服务破除算力桎梏,促进深度学习发展AI Studio亿元算力支持计划免费使用工业旗舰GPU,同时提供免安装的集成环境,直接上手使用一人一卡模式远程集群模式V100训练卡,人手一张16GB显存最高2TB存储空间超强算力 使用邀请码即可获赠算力时长 邀请
69、好友加入可以获赠更多时长高性能集群,免费使用即刻登录AI Studio即可使用获取方法获取方法 1亿元免费算力 助力开发者成功 1亿元免费算力 助力开发者成功可多卡并行训练不限时免费使用单卡12GB显存高校开发者企业黄埔学院 AI快车道 百度AI技术生态扶持免费在线课程 免费算力支持 不间断赛事互动深度学习师资培训班 协同育人专项基金 AI Studio教育版全面推动中国深度学习技术发展全面推动中国深度学习技术发展深度学习案例现场剖析/即学即用的Code Live/共享黄埔学院精华课程AI快车道-企业深度学习实战营1000家企业深度学习技术应用扶持计划PaddlePaddle Roadmap2
70、016年:PaddlePaddle正式开源 2017年:发布新一代深度学习框架Paddle Fluid 2018年:PaddlePaddle升级为端到端深度学习平台发 布 P ad d l e NL P、业界 首 个 视 频识 别工 具集 发 布 分布 式 训 练 B en ch m a rk 以及 大规 模稀 疏参 数服 务器 能力 发 布 P ad d l e Se rv in g、P a d d leS l im,具备一站式部署能力 发 布 Aut o DL D e si g n,P AR L,P ad dl e Hu b动态图基本功能完善,新增流水线并行能力 提供视觉检测、生成工具集,
71、使用文档全面优化 显存占用优化,静态图训练速度全面提升 优化高速推理引擎,支持在更多硬件的快速扩展,完善支持半精度动态图实现与静态图灵活转换,支持高层API 动态图训练速度全面优化 PaddleHub 2.0,基于最完备的预训练模型库 进行迁移学习 多项行业应用解决方案发布PaddlePaddle Roadmap2019年11月2016年-2018年2019年7月2019年4月PaddlePaddlePaddlePaddle?API?PaddlePaddle v1.4.0?GPU:P4?,CPU:E5-2650 v4?8?batch_size=1?warmup10?10?:https:/ v1
72、 MobileNet v2GoogleNetPaddlePaddle?00708090ResNet50ResNet101MobileNet v1 MobileNet v2GoogleNetPaddlePaddle?Benchmark on GPU P4?ms?Benchmark on CPU E5-2650?ms?Mobilenet v1 FP32 model on ARM Latency(ms)-v8 0500300350?960?835rk3399?653?625PaddlePaddle?1?2?3Mobilenet v1 INT8 model
73、on ARM Latency(ms)-v8 0204060800180?960?835rk3399?653?625PaddlePaddle?1?2?PaddlePaddle v1.4.0?android ndk r16?gcc 4.9?enable neon?ABI?armveabi-v7a with neon-mfloat-abi=softfp/armv8?1?warmup10?10?:https:/ abstract graphOptimized interpreterBackend hardware functorIRIRSchedulerResourceGraph
74、GPU/X86/ARMExecuter1?2?3?dispatchConvReLUInputConvReluPoolConvReluPoolInputConvolutionReLUConvolutionReLUPoolingConvolutionPoolingReLUname:conv_3group(int):1axis(int):1bias_term(bool):falsestrides(list):2,2fliter_num(int):32ParameterTensor shapeweight_1(tensor):32,3,3,31?2?3?Graphinputconv2Dconv2Dso
75、ftmaxsoftmaxFronted abstract graphOptimized interpreterBackend hardware functorIRIRSchedulerResourceGraphGPU/X86/ARMExecuterdispatch1?2?3?/?Config?inference?AnalysisConfig config;Config.SetMode(model_path,params_path);/?./?auto*predictor=CreatePaddlePredictor(config);/?for(auto&name:predictor-GetInp
76、utNames()/?tensorauto*input_t=predictor-GetInputTensor(name);/?reshape?input_t-Reshape(N,C,H,W);/?CPU?CPU?input_t-copy_from_cpu(some cpu data);/?Predictor-ZeroCopyRun();/?for(auto&name:predictor-GetOutputNames()/?tensorauto out_t=predictor-GetOutputTensor(name);PaddlePaddle?API?1?2?PaddlePaddle ARM?
77、/?inference?Paddle?inferencePaddle?/?inference?Config.EnableTensorRtEngine(120,batch);TensorRT/?inference?Config.EnableAnakinEngine();AnakinPaddlePaddle?API?1?2?PaddlePaddle ARM?/?inference?Paddle?inferencePaddle?/?inference?Config.EnableTensorRtEngine(120,batch);TensorRT/?inference?Config.EnableAna
78、kinEngine();Anakin/?Config?inference?AnalysisConfig config;Config.SetMode(model_path,params_path);/?./?auto*predictor=CreatePaddlePredictor(config);/?std:vector threads;Std:vector predictors;/?(Clone?predictor)?for(int i=0;i Clone();/?for(int i=0;i Run(););/?for(int i=0;i thread_num;i+)if(threadsi.j
79、oinable()threadsi.join();PaddlePaddle?API?1?2?PaddlePaddle ARM?/?Config?Anakin inference?Contrib:AnakinConfig config;Config.model_file=anakin_model_path;Config.TargetType=AnakinConfig:ARM;/?Anakin?paddle predictorUnique_ptr predictor=CreatePaddlePredictor(config);/?auto&in_names=predictor-GetInputNames();std:vector inputs,outputs;std:vectorstd:vector in_shapes;SetupTensor(in_shapes,in_names,&inputs);/?auto&out_names=predcitor-GetOutputNames();outputs.resize(out_names.size();for(auto&name:out_names)outputsi.name=name;/?inferencePredictor-Run(inputs,outputs);THANKS