2-4 基于大数据的复杂场景的语音识别的探索与实践.pdf

编号：102409

PDF 41页 5.49MB 下载积分：VIP专享

下载报告请您先登录！

2-4 基于大数据的复杂场景的语音识别的探索与实践.pdf

1、The exploration of complex Large-scale databased scenario automatic speech recognitionComplex scenario ASR in ZOOMHaoyu(Charlie)TangApril 24,2022Zoom AI/ML EngineeringContent1.Introduction to automatic speech recognition2.End-to-End automatic speech recognition3.Model innovation4.Training pipeline i

2、nnovation5.Large scale data model training acceleration in ZOOM6.Summary and next step1Introduction to automatic speechrecognitionWhat is automatic speech recognitionAutomatic Speech Recognition(ASR):generate texts from given audiowav,argmax(P(Y|X)Figure 1:ASR1Conventional Method:Acoustic model,Lang

3、uage model andPronunciation dict/model.End-To-End Method:Main Model,and language model(optional).1https:/ History of ASRFigure 2:ASR history22https:/sonix.ai/history-of-speech-recognition.3Brief History of ASRFigure 3:Recent decade ASR history33https:/sonix.ai/history-of-speech-recognition.4ASR:curr

4、ent problemsCurrent problem in ASR for live,meeting and online chat scenario:1.Spontaneous but most ASR open data is read sound2.Open-set recognition+Large vocabulary3.Noise especially for background music4.Accent independent5.Code-switch6.Free switch between far-field and near-field since moving sp

5、eak5End-to-End automatic speechrecognitionA standard end-to-end(E2E)ASR architectureFigure 4:A standard end-to-end(E2E)ASR architecture4This figure shows two standard E2E modeling method:CTC andencoder-attention-decoder.And These could be combined togetheras CTC-ATT architecture.4Watanabe et al.,“Hy

6、brid CTC/attention architecture for end-to-end speechrecognition”.6ATT-CTC Training and Inference5Loss combine:LMTL=LCTC+(1 )LAttention(1)Figure 5:Loss combineJoint decoding/rescoring:C=argmaxlogpctc(h|X)+(1 )patt(h|X)(2)5Watanabe et al.,“Hybrid CTC/attention architecture for end-to-end speechrecogn

7、ition”.7Model innovationModel improvement:from ATT to TransformerNot only replace(B)LSTM(p)6in total encoder and docer astransformer7.Figure 6:Transformer(Xfmr)6Chan et al.,“Listen,attend and spell”.7Dong,Xu,and Xu,“Speech-transformer:a no-recurrence sequence-to-sequencemodel for speech recognition”

8、.8Reorder in ASR and Neural Machine Translationthis C-MHA also could be reviewed as a constrain.The next layerseach element could only be affected by neighbours of previous layer.(a)(b)Figure 7:Reorder in NMT8and ASR(a):A reorder example in NMT(b):A heat map of source-target attention in S-Xfmr deco

9、der8Wu,Carpuat,and Shen,“Inversion transduction grammar coverage of arabic-englishword alignment for tree-structured statistical machine translation”.9Model improvement:chunkAlso since there is no re-order in ASR like neural machine translation(NMT),a Chunked multi-head attention(C-MHA)could beimple

10、mented in Speech-Transformer(S-Xfmr)instead of MHA,thecore innovation of original Transformer.(a)(b)Figure 8:chunk flow attention(CL-MHA)11(a)and chunk attention12(C-MHA)(b)9Tian et al.,“Self-attention transducers for end-to-end speech recognition”.10Zhang et al.,“Unified streaming and non-streaming

11、 two-pass end-to-end model forspeech recognition”.11Tian et al.,“Self-attention transducers for end-to-end speech recognition”.12Zhang et al.,“Unified streaming and non-streaming two-pass end-to-end model forspeech recognition”.10Training pipeline innovationWeak distillation:from conventional model

12、to E2EAs mentioned before,especially in one set ASR,like meetingtranscription.One of problem is open set recognition.You neverknow what kind of topic will be in the meeting and what words willbe spoken.And mentioned before in Fig 1,conventional method containspronunciation dict,which is hand-craft w

13、ork,of-course,containingextra prior information and do help conventional method in NER.Previous work also have shown that E2E method degrades on NER.Ofcourse best way is adding labelled named entity data,but labelleddata itself is expensive.So one of other choice is distilling extra information from

14、conventional method and unlabelled data to E2E-ASR model.11Weak distillation:seq2seq distillationFigure 9:seq2seq distillation1313Kim and Rush,“Sequence-level knowledge distillation”.12Software architecture for Iterative pseudo-labeling(IPL)Figure 10:A common Software architecture for IPL13Architect

15、ure1selector=init_selector()2builder=init_builder()3predictor=init_predictor()4filter=init_predictor()5initializer=init_initializer()6for cycle in range(cycles):7selected_data=selecter(cycle,unlabel_data)8teacher_model=builder(model_pool)9predicted_data=predictor(selected_data,teacher_model)10filter

16、ed_data=filter(predicted_data)11batcher=init_bather(filtered_data,label_data)12student_model=initializer(model_pool)13for epoch in epochs:14trainer(student_model,batcher)14Plug-in descriptionPlug-inDescriptionSelectorselection data from unlabelled dataset according indexRecogerRecognition selected d

17、ata with teacher model.FilterFilter selected data according score and predictionBatcherorganizing unlabelled and label data to trainingTrainerTrain student model with batcherBuilderBuild a teacher modelInitializerInitializing the student model state dictTable 1:Plug-in descriptionArchitecture rule:L

18、ow-coupling,high cohesion and simple rule forreuse.15Large scale data model trainingacceleration in ZOOMGTC22 zoom model Training14Figure 11:Scale and Accelerate the Distributed Model Training in KubernetesClusterMainly from our talk in GTC22,focus on DDP communicationacceleration today.14Scale and

19、Accelerate the Distributed Model Training in Kubernetes Cluster.https:/ Components15 DL Training:Filter,Gradient Compression,PowerSGD,PytorchLighting,Determined-AI,Pytorch DDP Machine Learning Framework:Pytorch,Tensorflow,Horovod,MPINCCL,APEX,Cuda,Tensor Core Kubernetes:SRIOV,Virtual Function,peer m

20、emory,Operator,PytorchJob,TFJob,MPIJob,Calico OS:RDMA,RoCE,GPUDirect,OFED,Firmware,Kernel module Server:GPU,Smart NIC(Mellanox,EFA),PCIe,NVLink,NVSwitch,NvME over Fabric Networking:PFC,ETS,LLDP,DCBX15Scale and Accelerate the Distributed Model Training in Kubernetes Cluster.https:/ 12:The comparison

21、of model and data parallelmodel parallelism:All workers train on same data.Parts of themodel are distributed across GPUs.data parallelism:All workers train on different data.All workers havethe same copy of the model.18ASR model training situation analysisModel usually LSTM/Transformer based,the num

22、ber of parameter isaround:20M to 200MData from 10 hours labelled audio to 100,000 hours.The selection of Parallelism from model and data parallel:dataparallel.For large scale data model training:Distributed DataParallelism(DDP),run on multiple servers.19DDPFigure 13:DDPGPU communication:Within the s

23、erver:are all gpu same Between the servers:network20DDP accelerationFigure 14:mix-precision example Consume less memory:Larger neural networks models or minibatches.Halves the size of activation and gradient tensors Consume less memory bandwidth:Speeding up data transferoperations.Halves memory traf

24、fic compared to FP32.Accelerates math-bound operations:Tensor Cores are 8x fasterthan FP32 Automatic:Identifying the steps that require full precision andusing 32-bit floating point for only those steps.Black/white listcontrol21Smart NIC+RDMA+RoCE Smart NIC:A SmartNIC,or smart network interface card

25、(NIC),isa programmable accelerator that makes data center networking,security and storage efficient and flexible.(Hardware level)RDMA:remote direct memory access from the memory of oneserver into that of another without involving either onesoperating system,high-throughput,low-latency networking(OSl

26、evel)RoCE:RDMA over Converged Ethernet(OS level)OFED:OpenFabrics Enterprise Distribution is open-sourcesoftware for RDMA providing NIC firmware,drivers,kernelmodules,and libraries(OS level)22RDMA/ROCEFigure 15:Avantage of RDMA/RoCE Zero-copy no need to copy user application buffers to adedicated NIC

27、 buffers.Direct user space hardware interface which bypasses the kerneland TCP/IP in data pathThe Infiniband transport over UDP encapsulation.Available for allEthernet speeds 10 100G23GPUDirect RDMACPU utilization CPU not involved in the DMA operationsFigure 16:GPUDirect RDMA24Implement:Nvidia Netwo

28、rk OperatorFigure 17:Nvidia Network Operator25RDMA/RoCE test resultFigure 18:RDMA/RoCE test resultHardware update to enable faster multi-node communication.WithRDMA/ROCE enabled,the multi-node 2gpu training is around 5xfaster26Gradient compression DDP grad_fp16:sending gradient in 16 bits instead of

29、 32 bits thusachieve 2x compression grad_powersgd:using smaller matrix to estimate the originalgradient,it can achieve up to 1000 x compression but with somecost of accuracy and additional computation time.27Gradient compression test resultFigure 19:Gradient compression accuracy comparison in 12 gpu

30、 setupFrom the chart above,grad_fp16 leads to almost no accuracy losswhile grad_powersgd still can lead to some accuracy loss.We alsonoticed a early model convergence at epoch 13 for grad_powersgd.28Summary and next stepSummary and next stepSummary:Model early version of semi-supervised ASR software

31、 architecture.Training DDP GPUdirect acceleration,mixed-precision,GPUDirect RDMA.gp_FP16,gp_powerSGDNext step:Model Tuning and upgrade semi-supervised in ASR,beam searchacceleration.Training Communication Efficiency Improvement(1-bit Adam optimizer),Model Distributed Training(ZeRO optimizer),Data IO

32、Improvement(On-the-fly Caching,Load to Memory)29PresenterFigure 20:Presenter Haoyu(Charlie)TangHaoyu(Charlie)Tanghaoyu.tangzoom.usWorking in ZoomAI/ML EngineerdepartmentGraduated from NorwegianUniversityof Science andTechnology(NTNU)Responsible for ASR training&deploymentGithub qmpzzpmhttps:/ 戚少商htt

33、ps:/ Questions?Also,hiring:Video,Audio,Speech,NLP,前端,ServiceDevops,测试,Java/C+全职/实习生Base:杭州,苏州,合肥To haoyu.tangzoom.us31Referenceshttps:/ et al.“Listen,attend and spell”.In:arXiv preprintarXiv:1508.01211(2015).Dong,Linhao,Shuang Xu,and Bo Xu.“Speech-transformer:ano-recurrence sequence-to-sequence mode

34、l for speechrecognition”.In:2018 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE.2018,pp.58845888.Kim,Yoon and Alexander M Rush.“Sequence-level knowledgedistillation”.In:arXiv preprint arXiv:1606.07947(2016).Scale and Accelerate the Distributed Model Training inK

35、ubernetes Cluster.https:/ et al.“Self-attention transducers for end-to-endspeech recognition”.In:arXiv preprint arXiv:1909.13037(2019).31Watanabe,Shinji et al.“Hybrid CTC/attention architecture forend-to-end speech recognition”.In:IEEE Journal of SelectedTopics in Signal Processing 11.8(2017),pp.124

36、01253.Wu,Dekai,Marine Carpuat,and Yihai Shen.“Inversiontransduction grammar coverage of arabic-english word alignmentfor tree-structured statistical machine translation”.In:2006 IEEESpoken Language Technology Workshop.IEEE.2006,pp.234237.Zhang,Binbin et al.“Unified streaming and non-streamingtwo-pass end-to-end model for speech recognition”.In:arXivpreprint arXiv:2012.05481(2020).31

友情提示

1、下载报告失败解决办法
2、PDF文件下载后，可能会被浏览器默认打开，此种情况可以点击浏览器菜单，保存网页到桌面，就可以正常下载了。
3、本站不支持迅雷下载，请使用电脑自带的IE浏览器，或者360浏览器、谷歌浏览器下载即可。
4、本站报告下载后的文档和图纸-无水印,预览文档经过压缩，下载后原文更清晰。

本文（2-4 基于大数据的复杂场景的语音识别的探索与实践.pdf）为本站（云闲）主动上传，三个皮匠报告文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若此文所含内容侵犯了您的版权或隐私，请立即通知三个皮匠报告文库（点击联系客服），我们立即给予删除！

温馨提示：如果因为网速或其他原因下载失败请重新下载，重复下载不扣分。