大语言模型分布式训练时的量化分析与最佳实践以 GPT-175B 为例.pdf

编号：155323

PDF 39页 14.99MB 下载积分：VIP专享

下载报告请您先登录！

大语言模型分布式训练时的量化分析与最佳实践以 GPT-175B 为例.pdf

1、IN-DEPTH ANALYSIS OF THE PERFORMANCE FOR GPT3颜子杰NVIDIALLM TRAINING TECHSLARGER MODEL IS THE TRENDCHALLENGES FOR TRAINING LARGE MODELCompute cost Lower bound of each iteration:(Refer to https:/arxiv.org/abs/2104.04473)B:batch size,S:sequence length,l:transformer layer numberh:hidden size,V:vocabulary

2、 size 2150 ZettaFLOPs(175B with 1.5T tokens)1 ZettaFLOP=1024 ExaFLOPs96!(1+6+16)Challenges 128 DGX A100,trained in 170 120 days.(about 50%computing efficiency)High Compute CostsCHALLENGES FOR TRAINING LARGE MODELMemory costs(Mixed Precision,Native implementation)Model States(Total:3.5TB)Parameter:35

3、0GB(175B*2Bytes)Gradient:350GB Optimizer:2800GB Activation:?Challenges Model can not fit in single GPU or even single GPU server.(35p A100 80G)Model parallelism is a MUST across multi nodesHigh Memory CostsCHALLENGES FOR TRAINING LARGE MODEL Model can not fit in single GPU or even single GPU server.

4、(35p A100-80G)Extremely huge computing power:about 16K A100*days computing.(Not considering efficiency)What we need:An efficient framework with model parallel Careful co-design of software and system7NeMo and Megatron-LM is NVIDIAs FW for efficiently training the worlds largest transformer-based mod

5、els.Train transformer models with billions of parametersAchieve high utilization and scaling to thousands of GPUs7NeMo and MEGATRONOVERVIEW OF LARGE TRANSFORMER TRAINING TECHNIQUES Parallelisms:Pipeline Parallelism Tensor Parallelism Sequence Parallelism Expert Parallelism Memory Optimizations:Distr

6、ibuted optimizer (DeepSpeed ZeRO-1)Checkpoint activations Selective activation checkpointing Others:FP16/BF16 training,optimized kernels,etc Communication overlapping for PP and TPBlue for Megatron v2 featuresGreen for Megatron v3 new featuresTHE DISTRIBUTED TRAINING OF GPT-3 MODELModel Parallelism

7、Tensor model parallelism Intra-layer Split individual layers across multiple devices Simple to implement Good performance under large matrix Fine-grained,high-frequency communication Pipeline model parallelism Inter-layer Split sets of layers across multiple devices Coarse-grained communication Gene

8、ralizable to almost all DNNs Require large batch size for high throughput Load imbalance across workersTHE DISTRIBUTED TRAINING OF GPT-3 MODEL Tensor model parallelism-MLPf and g are conjugate operators,f is identity operator and g is all-reduce operator in forward pass；While in backward pass,f is a

9、ll-reduce operator and g is identity operator.THE DISTRIBUTED TRAINING OF GPT-3 MODEL Tensor model parallelismSelf-attentionf and g are conjugate operators,f is identity operator and g is all-reduce operator in forward pass；While in backward pass,f is all-reduce operator and g is identity operator.P

10、ARALLELISIMNew in Megatron v3:Sequence ParallelismExpands upon Tensor Parallelism by spli4ng tensors across the sequence dimension.Par;oning along the sequence dimension reduces the memory required for the ac;va;ons.Introduces all-gather/reduce-scaAer between sequence parallel and tensor parallel.g

11、and g are conjugate.g is all-gather in forward pass and reduce-scaAer in backward pass.g is reduce-scaAer in forward pass and all-gather in backward pass.THE DISTRIBUTED TRAINING OF GPT-3 MODEL Pipeline model parallelism1F1BTimem:number of mini batchesp:number of pipeline stages!：forward step time：b

12、ackward step timeIdeal time=(!+)Bubble time=(p 1)(!+)Total time=(+1)(!+)Bubble time overhead=+,+-./01.02.3-/01.=4561THE DISTRIBUTED TRAINING OF GPT-3 MODEL Interleaved pipeline model parallelismTimeBubble time overhead=!#$v:number of interleaved stagesBubble time overhead=#%!#$THE DISTRIBUTED TRAINI

13、NG OF GPT-3 MODEL3D ParallelismREQUIREMENT OF STORAGEActivations memory optimizationsFull checkpointing Storing(or checkpointing)the input activations to a group of layers and recomputing other required activations using an extra forward pass during back-propagation.Significantly reduced the require

14、d memory for training,while with 36%computing overhead.Sequence Parallel+Selective checkpointing Only checkpoint and recompute parts of each layer that take up a considerable amount of memory but are not computationally expensive to recompute.Use sequence parallel to distributed other parts of activ

15、ation.Reducing the recompute overhead from 36%to 4%.ACTIVATION CHECKPOINTINGNew in Megatron v3:Selective CheckpointingOnly checkpoint and recompute parts of each layer that take up a considerable amount of memory but are not computa;onally expensive to recompute.(Called selec;ve ac;va;on recompu;ng)

16、Found that aAen;on opera;ons generally have large input sizes and thus large ac;va;ons,however,the number of floa;ng-point opera;ons(FLOPs)per input element is very low.Excep&ons&Limita&ons:Only works well in conjunc&on with other Parallelism TechniquesReducing the recompute overhead from 36%to 4%by

17、 using sequence parallel and selec&ve checkpoin&ng.DISTRIBUTED OPTIMIZERNew in Megatron v3Optimizer States(16M)Gradients(2M)Model Weights(2M)MemoryCostCommunicationCostData ParallelismReplicatedReplicatedReplicated20Mall-reduce(M)Distributed Optimizer(ZeRO Stage 1)PartitionedReplicatedReplicated(4+1

18、6/N)Mreduce-scatter(M)+all-gather(M)ZeRO Stage 2PartitionedPartitionedReplicated(2+18/N)Mreduce-scatter(M)*num_micro_batches+all-gather(M)ZeRO Stage 3PartitionedPartitionedPartitioned20M/N1.5*all-reduce(M)*num_micro_batchesM is the number of parameters,N is the number of devices.Megatron V3 implemen

19、ted distributed optimizer to shard optimizer states including momentum,variance,master weight and master grads,without compromising performance.Considering the communication overhead,ZeRO-2,3 is not currently implemented in Megatron.In fact,the training of GPT-3 175B and Turing-NLG 530B didnt use Ze

20、RO.MEMORY ANALYSISSTORAGE ANALYSISGPT3 storageModel memoryActivation memoryExtra memoryParameterGradientOptimizerTransformer layerEmbedding layerAll reduceSelf-attentionMLPPyTorch Memory ManagementREQUIREMENT OF STORAGEModel params Model memory per device#4+42+1.37 175(3)vocabulary sizesequence leng

21、thAttn headsHidden sizeLayersData parallelTensor parallelPipeline parallelMicro batch sizeBatch sizev=51200s=2048ah=96h=12288n=96d=8t=8p=16b=1B=1536v/t=6400h/t=1536REQUIREMENT OF STORAGEModel states Cost of model memory(AMP)vocabulary sizesequence lengthAttn headsHidden sizeLayersData parallelTensor

22、 parallelPipeline parallelMicro batch sizeBatch sizev=51200s=2048a=96h=12288n=96d=8t=8p=16b=1B=1536Partitioned Model size#1.37BParameter()2#2.74GBGradient()2#2.74GBAdam optimizer()4+4+4+4#21.9GBNote:The first 4 in _ equation:the copy of parameter with fp32 typeThe second 4 in _ equation:the copy of

23、gradient with fp32 typeThe third 4 in _ equation:the momentum with fp32 typeThe fourth 4 in _ equation:the variance with fp32 typeREQUIREMENT OF STORAGEModel states w/ZeRO Cost of model memory(AMP)vocabulary sizesequence lengthAttn headsHidden sizeLayersData parallelTensor parallelPipeline parallelM

24、icro batch sizeBatch sizev=51200s=2048a=96h=12288n=96d=8t=8p=16b=1B=1536Partitioned Model size#1.37BParameter()2#2.74GB-ZeRO3-0.34GBGradient()2#2.74GB-ZeRO2-0.34GBAdam optimizer()4+4+4+4#21.9GB-ZeRO1-2.7GBNote:The first 4 in _ equation:the copy of parameter with fp32 typeThe second 4 in _ equation:t

25、he copy of gradient with fp32 typeThe third 4 in _ equation:the momentum with fp32 typeThe fourth 4 in _ equation:the variance with fp32 typeREQUIREMENT OF STORAGEActivations memory(Per Layer)AttentionMLP2 LayerNorm=11+5+19+4=(34+5)REQUIREMENT OF STORAGEFull checkpointing Cost of activation memoryvo

26、cabulary sizesequence lengthAttn headsHidden sizeLayersData parallelTensor parallelPipeline parallelMicro batch sizeBatch sizev=51200s=2048ah=96h=12288n=96d=8t=8p=16b=1B=1536Only store the input activation of transformer layerSublinear memory costNeed forward re-computationFull checkpointing signifi

27、cantly reduced the required memory for training,while with 36%computing overhead.=34+5=2DroppedMLPAttentionREQUIREMENT OF STORAGETensor Parallel=8+5+3+16+3+4=(24+5+10)2 LayerNormt is the TP sizeREQUIREMENT OF STORAGETensor Parallel+Sequence ParallelMLPAttention=8+5+3+16+3+4=(34+5)2 LayerNormREQUIREM

28、ENT OF STORAGETensor Parallel+Sequence Parallel+Selective checkpointingMLPAttention=8+5+3+16+3+4=(34)2 LayerNormDroppedDroppedDroppedREQUIREMENT OF STORAGEActivation-SummaryMethodActivationMemory PerLayerRecomputation overheadNone(34+5)2.86GB0%Full250MB36%TP+SP+Selective(34)106MB4%vocabulary sizeseq

29、uence lengthAttn headsHidden sizeLayersData parallelTensor parallelPipeline parallelMicro batch sizeBatch sizev=51200s=2048ah=96h=12288n=96d=8t=8p=16b=1B=1536COMMUNICATION COSTDATA PARALLELISMNameOperationMessage sizeGroupLoopsGradient accumulationAll reduceData parallel group1:model gradientBW：Bus

30、bandwidthd：data parallelism1#=2(1)05000200040006000800010000BANDWIDTH(GB/S)MESSAGE SIZE(MB)Bus Bandwidth vs.Message sizeTENSOR PARALLELISMNameOperationMessage sizeGroupSelf-attention All-reduce forTP All-gather andReduce-Scatterfor TP+SP_Tensor parallel groupMLPCommunication size for TP a

31、nd TP+SP are the same.TENSOR PARALLELISMEquation for Communication Cost2#=2(,)2(1)3+3 Message size per NCCL callCorrection factor for bus BW（all-reduce)Self attentionMLPBus Bandwidth:obtained by interpolationNumber of transformer layersNumber of mini batchesvocabulary sizesequence lengthAttn headsHi

32、dden sizeLayersData parallelTensor parallelPipeline parallelMicro batch sizeBatch sizevsahhndtpbBPIPELINE PARALLELISM Scatter-gather mechanismData parallelTensor parallelPipeline parallelMicro batch sizeBatch sizedtpbBAll-gather via NVLINK.Can beignored.P2PP2PCOMPUTATION COSTTRAINING COMPUTATION ANA

33、LYSIS OF GPT-3 175BPer Batch FLOPs EstimationModel FLOPs(Per Batch)=Model FLOPS per GPU=Model FLOPs/(Batch Time*Num GPUs)A100 Training efficiency=Model FLOPS per GPU/312We can check if our training efficiency meets expectations by calculating the FLOPS achieved per GPU during training.For Large GPT(

34、with TP and PP),it can achieve 150 Model TFLOPS/A100.If this value is less than 120,there are usually some performance issues during training.CONCLUSION/38NVIDIA CONFIDENTIAL.DO NOT DISTRIBUTE.TAKEAWAY POINTSOOTBMixed precision trainingFlashAttentionBF16 is recommended for larger model training.(20B

35、)For large scale training(both weak-scaling and strong-scaling),my intuition:If memory is an issue:Selective activation checkpointingDistributed optimizerProgressively using tensor parallel(with sequence parallel enabled,and keep(hidden/tp)at least 1024,ideally 2048.)Progressively using pipeline parallelFull activation checkpointingIf memory is not an issue:Data parallelLarger batchsize感谢您的观看 THANKS

友情提示

1、下载报告失败解决办法
2、PDF文件下载后，可能会被浏览器默认打开，此种情况可以点击浏览器菜单，保存网页到桌面，就可以正常下载了。
3、本站不支持迅雷下载，请使用电脑自带的IE浏览器，或者360浏览器、谷歌浏览器下载即可。
4、本站报告下载后的文档和图纸-无水印,预览文档经过压缩，下载后原文更清晰。

本文（大语言模型分布式训练时的量化分析与最佳实践以 GPT-175B 为例.pdf）为本站（张5G）主动上传，三个皮匠报告文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若此文所含内容侵犯了您的版权或隐私，请立即通知三个皮匠报告文库（点击联系客服），我们立即给予删除！

温馨提示：如果因为网速或其他原因下载失败请重新下载，重复下载不扣分。