《大模型时代 最大化CPU价值的优化策略-何普江.pdf》由会员分享,可在线阅读,更多相关《大模型时代 最大化CPU价值的优化策略-何普江.pdf(18页珍藏版)》请在三个皮匠报告上搜索。
1、何普江英特尔AI软件架构师大语言模型时代:最大化CPU价值的优化策略QCON 2023 SHANGHAI目录01 背景(为什么?)02 CPU上如何优化大语言模型?04 总结03 最大化CPU价值背景(为什么考虑最大化CPU价值?)QCON 2023 SHANGHAIComputing Needs in LLM4Probability of next tokenSoftMaxMatMulLayer NormBMMSoftMaxBMMMaskedMulti-HeadAttentionMatMulLayer NormMatMulSoftMaxMatMul1st tokenNext tokensQK
2、V MatMul in MHAA:2048x4096B:4096x12288A:1x4096B:4096x12288MHA(1st BMM)A:16x2048x256B:16x2048x256A:16x1x256B:16x2048x256Output MatMul in MHAA:2048x4096B:4096x4096A:1x4096B:4096x40961st MatMul in FFNA:2048x4096B:4096x16384A:1x4096B:4096x163842nd MatMul in FFNA:2048x16384B:4096x4096A:1x16384B:4096x4096
3、Input embeddingMatMul shapes in GPT-J(suppose prompt token size=2048,batch size=1,greedy search)x 28Compute BoundMemory Read Bandwidth BoundGPT-J Model StructureQKV MatMulQCON 2023 SHANGHAIGPT Series Model Analysis5Parameters visited during one time inference=?4?2+2 4?2+?Memory latency&Compute laten
4、cy?=?=2?Arithmetic intensity?=2?=2?FLOIPS/bytePeak AI for SPR-SP with BF16 with AMX?16=123.2?307.2/1000=401FLOIPS/byteCompute bound?16 B S 401Memory bound?16 1GB/core HBM memory capacity1TB/s memory BW up to112.5MB shared LLCDDR5 8 channels per CPU 4800MTS(1DPC)/16 DIMMs per socket64GB HBM2e QCON 20
5、23 SHANGHAICPU is NOT Fully Utilized!70.000010.000020.000030.000040.000050.000060.000070.000005001,0001,5002,0002,5003,000metric_CPU utilization%metric_CPU utilization%Vector DBText Emb2Context RetrieverLLMREST/gRPC APIPre-trained/finetuned LLM modelPre-processing and post processing in LLM inferenc
6、e is relatively simple and do not need too much CPU resource.Text Emb1Should I attend QCON?Yes.LLM Inference PipelineCPU Utilization in LLM Training(offload mode)Even for offload LLM training,CPU is still not fully utilized.CPU上如何优化大语言模型?Optimization Leverage the high-performance kernel(e.g.,oneDNN)
7、Avoid redundant computingContinuous batchingCausal maskingPrefix sharing Lower precision&Sparsity Graph fusion Minimize memory copy and reorder Reuse the memory Distributed inference&use efficient communication library oneCCL Runtime tunningOptimization for Distributed InferenceDistributed inference
8、 based on oneCCLImprove scalability by minimizing synchronization.One time synchronization per layer is enough for some modelsCompute moduleCommunication modulecopyMinimize memory copy with full stack ownership11Attention OptimizationQKSoftmax(Q*KT)VSoftmax(Q*KT)*V*QKSoftmax(Q*KT)VSoftmax(Q*KT)*V*Co
9、mputing Order:Intermediate score size:Same size with Advantage:less score buffer w/o redundant computing*SlimAttention(split score in 1 dimesion)FlashAttention(split score in 2 dimesion)Computing Order:Intermediate score size:Same size with Advantage:minimal intermediate buffer w/some redundant comp
10、uting12Do We Need Paged Attention on CPU?0nd sequential visit1st sequential visitInt8 Weight Only Quantizationmin(fp32/fp16)max(fp32/fp16)int8fp32/fp16-127 -126 0 126 127min 0.0 maxint8histogramint8histogram100%data99.99%datareal min(fp32/fp16)real max(fp3
11、2/fp16)int8fp32/fp16real min0.0real max127126.0-126-127Weight(fp32/fp16)QWeight(int8)scale(fp32)zero_point(fp32)100%data99.99%data1 0 00 1-127-126-125-124 2 1 0 1 2 125 126 127count500real minminreal maxmaxLLaMA2 7BAccuracyHF+AutoGPTQ _INT873.7046%xFT+INT8_Convert73.7629%ObservationWith histogram-ba
12、sed quantization,we could get very good accuracy in xFasterTransformer.最大化CPU价值QCON 2023 SHANGHAICPU vs.GPUA smaller number of larger coresA larger number(thousands)of smaller coresLow latencyHigh throughputPerforms fewer instructions per clockPerforms more instructions per clockDesigned and optimiz
13、ed for complex programs w/serial processingOptimized for parallel processing w/bulk repetitive calculationsAutomatic cache managementAllows for manualmemory managementLarge memory capacityLimited memory capacityCPUGPUMemory bandwidthComputingMemory capacityKey HW Factors in LLM InferenceKey Challeng
14、es in LLM InferenceAutoregressive generationAttention(square)Model is largeQCON 2023 SHANGHAIScenarios CPU has values Long tail models(many models,few requests)Offline mode(to maximize throughput)Occasional demand Very long prompt token size and no strict latency requirement Very large model and no
15、enough GPU Hybrid solution(e.g.,speculative sampling)CPUModel-1Model-2Model-3Model-N All models loaded in memory Not all models serving togetherQCON 2023 SHANGHAISpeculative DecodingImage from https:/ ModelTarget ModelProposedTokensTokenGenerationTokenVerification716.84800004000500060002S
16、 CPU(5th Gen Xeon)8 x A10Memory Bandwidth(GB/s)Memory Bandwidth MatchingIn Practice,draft models are about 15-20 x smaller than the target model.Draft ModelTarget ModelProposedTokensProposedTokensDraft2 Model6.7xQCON 2023 SHANGHAI总结Try our solution on XeonBuild your own solution or seek ultimate perfWhy Considering CPU?How to optimize on CPU?When to use CPU?