1-3 STRONGHOLD：快速实惠的亿级深度学习模型训练.pdf

编号：102308

PDF 26页 3.38MB 下载积分：VIP专享

下载报告请您先登录！

1-3 STRONGHOLD：快速实惠的亿级深度学习模型训练.pdf

1、STRONGHOLD:Fast and Affordable Billion-scale Deep Learning Model Training 王玮达摩院NLP算法专家2022/07/30Foundation Models ML homogenizes learning algorithms(e.g.,logistic regression),DL homogenizes model architectures(e.g.,CNN)Foundation models homogenizes the model itself(e.g.,BERT,GPT-3.)figure from On th

2、e Opportunities and Risks of Foundation Models,https:/arxiv.org/abs/2108.07258Foundation Models Training+Adaptation Pretrained on broad unannotated(multimodal)data at scale via self-supervised way Adapted to a wide rage of downstream tasks via fine-tuning.One is All figure from On the Opportunities

3、and Risks of Foundation Models,https:/arxiv.org/abs/2108.07258Model Size v.s.HW Capacity Transformer Size 2*104 x/5 year GPU Memory 6x/5 year We need more GPUs!200202020015002000BERT-baseBERT-largeGPTGPT-2GPT-3Megatron-Turing-NLGMegatron-LMT-NLGT5-baseT5-largeT5-3BT5-11BT5-XXLA

4、LBERTRoBERTa-largeZhiyuan-Wudao2.0Ali-M6KUAIMODELGShardSwitch-baseSwitch-largeSwitch-XXLSwitch-C dense sparseparameters(B)DateP100(12GB)TPU V2(16GB)V100(32GB)TPU V3(32GB)A100(40GB)A100(80GB)GPU Memory Model Size Data parallelism Distribute data across processors Processed in parallel,and parameters

5、are updated synchronously Communication happens at the all-reduce operations to sum the gradients from all processorsModel parallelism Pipeline(Inter-Layer)Model Parallelism Split sets of layers across multiple devices Layer 0,1,2 and layer 3,4,5 are on different devices Tensor(Intra-Layer)Model Par

6、allelism Split individual layers across multiple devices Both devices compute difference parts of layer 0,1,2,3,4,5 These two approaches are complementaryModel parallelism Pipeline(Inter-Layer)Model Parallelism Less communication intensive Generalizable to almost all DNNs Can require large batch siz

7、e for high throughput(using pipeline syncs)Can be hard to load balance computation across workers Tensor(Intra-Layer)Model Parallelism Works great for large matrices Simple to implement No restriction on batch size More communicationExamples:Megatron-NVIDIA For Y,split A column-wise,Y=GeLU(XA)For Z,

8、split B row-wise,Z=Dropout(YB)f is identity operator in the forward pass and all-reduce in the backward pass g is all-reduce in the forward pass and identity operator in the backward passExamples:DeepSpeed-Microsoft Compatible with Megatron Support Zero Redundancy Optimizer(ZeRO)Beyond the GPU using

9、 heterogeneous resourcesCPU side optimized for serial operationso higher clock speedso fewer execution units latency oriented task parallelism RAM 200GB,1TB,2TBGPU side built for parallel operationso light-weight threadso many execution units throughput oriented data parallelism Memory(A100)50）。ACL

10、2021 表单和文档图片处理预训练模型，曾在FUNSD、RVL-CDIP、DocVQA三个结构化理解数据集SOTA（2021.11）/榜单排名第一（2020.8）超中预训练模型(PLUG)首个统一自然语言理解和生成能力的超大中文文本预训练模型（270亿参数），初步建成PLUG大模型完整服务链路：大模型推进加速领先性 10X表格预训练模型(STAR)1.0/2.0 AAAI 2021 表格问答四大国际榜单取得第一名，开源中文首个预训练表格模型；沉淀覆盖阿里云18个行业的3.9亿张中文表格，落地10+智能客服客户。多语预训练模型(VECO)成式预训练模型(PALM)通预训练模型(StructBERT)多模态预训练模型(StructVLM)结构化预训练模型(StructLM/Bi-VLDoc)对话预训练模型(SPACE1.0/2.0/3.0)平台化开源和生态业务应用知识融合预训练模型(LatticeBERT)ACL 2021 中文LatticeBert 2020.9 CLUE Base模型第一，KGBert2021.06 FewClue第一，在20shot,KB支持KBQA提升3.2%-3.8%影响力30+国际评测榜单第一（理解，生成，翻译，问答）

友情提示

1、下载报告失败解决办法
2、PDF文件下载后，可能会被浏览器默认打开，此种情况可以点击浏览器菜单，保存网页到桌面，就可以正常下载了。
3、本站不支持迅雷下载，请使用电脑自带的IE浏览器，或者360浏览器、谷歌浏览器下载即可。
4、本站报告下载后的文档和图纸-无水印,预览文档经过压缩，下载后原文更清晰。

本文（1-3 STRONGHOLD：快速实惠的亿级深度学习模型训练.pdf）为本站（云闲）主动上传，三个皮匠报告文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若此文所含内容侵犯了您的版权或隐私，请立即通知三个皮匠报告文库（点击联系客服），我们立即给予删除！

温馨提示：如果因为网速或其他原因下载失败请重新下载，重复下载不扣分。