1、STRONGHOLD:Fast and Affordable Billion-scale Deep Learning Model Training 王玮达摩院NLP算法专家2022/07/30Foundation Models ML homogenizes learning algorithms(e.g.,logistic regression),DL homogenizes model architectures(e.g.,CNN)Foundation models homogenizes the model itself(e.g.,BERT,GPT-3.)figure from On th
2、e Opportunities and Risks of Foundation Models,https:/arxiv.org/abs/2108.07258Foundation Models Training+Adaptation Pretrained on broad unannotated(multimodal)data at scale via self-supervised way Adapted to a wide rage of downstream tasks via fine-tuning.One is All figure from On the Opportunities
3、and Risks of Foundation Models,https:/arxiv.org/abs/2108.07258Model Size v.s.HW Capacity Transformer Size 2*104 x/5 year GPU Memory 6x/5 year We need more GPUs!200202020015002000BERT-baseBERT-largeGPTGPT-2GPT-3Megatron-Turing-NLGMegatron-LMT-NLGT5-baseT5-largeT5-3BT5-11BT5-XXLA
4、LBERTRoBERTa-largeZhiyuan-Wudao2.0Ali-M6KUAIMODELGShardSwitch-baseSwitch-largeSwitch-XXLSwitch-C dense sparseparameters(B)DateP100(12GB)TPU V2(16GB)V100(32GB)TPU V3(32GB)A100(40GB)A100(80GB)GPU Memory Model Size Data parallelism Distribute data across processors Processed in parallel,and parameters
5、are updated synchronously Communication happens at the all-reduce operations to sum the gradients from all processorsModel parallelism Pipeline(Inter-Layer)Model Parallelism Split sets of layers across multiple devices Layer 0,1,2 and layer 3,4,5 are on different devices Tensor(Intra-Layer)Model Par
6、allelism Split individual layers across multiple devices Both devices compute difference parts of layer 0,1,2,3,4,5 These two approaches are complementaryModel parallelism Pipeline(Inter-Layer)Model Parallelism Less communication intensive Generalizable to almost all DNNs Can require large batch siz
7、e for high throughput(using pipeline syncs)Can be hard to load balance computation across workers Tensor(Intra-Layer)Model Parallelism Works great for large matrices Simple to implement No restriction on batch size More communicationExamples:Megatron-NVIDIA For Y,split A column-wise,Y=GeLU(XA)For Z,
8、split B row-wise,Z=Dropout(YB)f is identity operator in the forward pass and all-reduce in the backward pass g is all-reduce in the forward pass and identity operator in the backward passExamples:DeepSpeed-Microsoft Compatible with Megatron Support Zero Redundancy Optimizer(ZeRO)Beyond the GPU using
9、 heterogeneous resourcesCPU side optimized for serial operationso higher clock speedso fewer execution units latency oriented task parallelism RAM 200GB,1TB,2TBGPU side built for parallel operationso light-weight threadso many execution units throughput oriented data parallelism Memory(A100)50)。ACL
10、2021 表单和文档图片处理预训练模型,曾在FUNSD、RVL-CDIP、DocVQA三个结构化理解数据集SOTA(2021.11)/榜单排名第一(2020.8)超中预训练模型(PLUG)首个统一自然语言理解和生成能力的超大中文文本预训练模型(270亿参数),初步建成PLUG大模型完整服务链路:大模型推进加速领先性 10X表格预训练模型(STAR)1.0/2.0 AAAI 2021 表格问答四大国际榜单取得第一名,开源中文首个预训练表格模型;沉淀覆盖阿里云18个行业的3.9亿张中文表格,落地10+智能客服客户。多语预训练模型(VECO)成式预训练模型(PALM)通预训练模型(StructBERT)多模态预训练模型(StructVLM)结构化预训练模型(StructLM/Bi-VLDoc)对话预训练模型(SPACE1.0/2.0/3.0)平台化开源和生态业务应用知识融合预训练模型(LatticeBERT)ACL 2021 中文LatticeBert 2020.9 CLUE Base模型第一,KGBert2021.06 FewClue第一,在20shot,KB支持KBQA提升3.2%-3.8%影响力30+国际评测榜单第一(理解,生成,翻译,问答)