《陈争胜-在边缘DC部署大模型:实践和加速 陈争胜.pdf》由会员分享,可在线阅读,更多相关《陈争胜-在边缘DC部署大模型:实践和加速 陈争胜.pdf(14页珍藏版)》请在三个皮匠报告上搜索。
1、Deploying Large Model in edge DC:practice and accelerationJack ChenYsemi Computing遇 贤 微 电 子致力于研发高性能计算和数据中心的CPUYSEMI Computing Introduction7/17/20232 YSEMI Computing founded in 2020YSEMI Computing founded in 2020 100+Employees across Shenzhen/Shanghai/Xian100+Employees across Shenzhen/Shanghai/Xian P
2、roviding silicon,platform and system for cloud computing Providing silicon,platform and system for cloud computing data centersdata centers First generation product is 160Cores Armv9 Datacenter First generation product is 160Cores Armv9 Datacenter CPU,3.2GHz and estimatedCPU,3.2GHz and estimated 620
3、+SPECint2017620+SPECint2017 Members of LF Members of LF EdageEdage,OpenEulerOpenEuler,ODCC etc.,ODCC etc.Why Large Language Mode on edge is important?7/17/20233Data PrivacyPervasive AILatencyChatGPT-like LLM user cases in edge DC7/17/20234code explanationTime Complexity Calculationprogram code trans
4、lationFix code bugsparagraph productionstory creationsummary descriptionText CategorizationPerson switchCategoryFAQreview generationText Sentiment AnalysisAdvanced Sentiment Scoringinterview questions and answersText to Emojilanguage chatbotEngineeringCommunicationContent generationMarketingChalleng
5、es in LLM deployment on edge7/17/20235ChallengesDemands are hugeThe recent rise of ChatGPT-like Large Language Model(LLM)has promoted the vigorous development of AI on the application side,which also puts forward unprecedented demands on edge devices.Computing Force is limitedGPT-3 has 175 billion p
6、arameters while GPT-4 has more parameters.Big model size and large training computing force will limit the usage in edge and terminals.smaller model size moderate computing force lower but usable accuracyIP PhoneTablet PCPC ClientMobile ClientThin clientTV screenCollaborationSoftwareComputing force
7、estimation of LLM application7/17/20236Training andTraining and InferenceInference computing force:computing force:According to the official account of the Green Energy Saving Data Center,the total computing power consumption of ChatGPT Training is about 3640PFChatGPT Training is about 3640PF-daysda
8、ys.There are 175 billion parameters in GPT-3,96 layers,and each token for inference requires 2N operations.The total computing power of a single token is 2*175,000,000,000=0.35 Tops.Inference CF(Computing force)LayersParameters(B)Model0.35 Tops/Token96175GPT-3-175B0.026 Tops/Token4013LLaMA-13BInfere
9、nce computing force estimation under edge application scenarioInference computing force estimation under edge application scenarioAssumption:Assumption:average visit per person of a R&D enterprise with 1K people are 100K tokens/(day*people)and 1M tokens/(day*people)respectivelyResult:Result:based on
10、 different model,the servers needed is as below:Total Servers(Unit),100K Tokens/DayParametersModel100Tops/Server40Tops/Server40Tops/Server1Tops/ServerCF(Tops/Token)LayersParas(B)123012150.3596175GPT-3-175B12900.0264013LLaMA-13B252080.066030LLaMA-30B5114510.138065LLaMA-65BTotal Servers(Unit),1M Token
11、s/Day100Tops/Server40Tops/Server40Tops/Server1Tops/Server2083451134513The peak computing force of a 2P server based-on YSEMI CPU:32Tops/2P32Tops/2P-Server Server?=?_?_?_?_?_?_?LLM software architecture in edge server7LLM ContainerInferenceKubernetes ServiceDocker Registry Servi
12、ceDistributed TrainingOrchestration&Workflow(Kubeflow)Kubernetes ManagementOnnx ModelOnnx ModelTensorFlow ModelTensorFlow ModelK8S ClustersTensorFlow Cafee Pytorch MXNetWith YSEMI Acceleration LibraryCPU+xPU ServersLLM ContainerCI/CDCeph/HDFSHadoop/SparkSystem ManagementData&StorageUser ManagementLo
13、ad BalanceResource ManagementModel ManagementK8s Production EnvironmentAuthApplication APILogging&ProfilingTensorboardGITTensorFlow Cafee Pytorch MXNetWith YSEMI Acceleration LibraryThe LLM comes from a trained model such as LLaMA,which supports reasoning and secondary training in this production en
14、vironment;Multiple LLMs are managed by the model management module(system management).Kubernetes as a core scheduling and task management platform of the AI platform.ARMv9 ML feature acceleration library for mainstream ML frameworksPractice and accelerating(Test Enviornment,1/4)7/17/20238Configurati
15、onsItemsArm A72;64*A72;2400 MHzCPUDDR4,16384MB*3memoryUbuntu 18.04.6 LTSOSConfigurationsItems4*Quad core Cortex-A72;1.5GHzCPUDDR4,8GBmemoryUbuntu 20.04 LTSOSConfigurationsItems96*ArmV8 Custom Core;2600 MHzCPUDDR4,16384MB*3memoryUbuntu 18.04.6 LTSOSLLaMA,a collection of foundation language models ran
16、ging from 7B to 65Bparameters.In particular,LLaMA-13B outperforms GPT-3(175B)on mostbenchmarks,and LLaMA-65B is competitive with the best models,Chinchilla-70B and PaLM-540BChatGLM-6B is an open source dialogue robot released by the KnowledgeEngineering Group(KEG)&Data Mining at Tsinghua University.
17、According to theofficial introduction,this is a Chinese-English language model with a scale of 100billion parameters.And optimized for Chinese.This open source version is asmall-scale version of its 6 billion parameters,and only 6GB of memory isrequired for local deployment(INT4 quantization level)M
18、odel 1-LLaMAModel 2-ChatGLM-6BA72 Server:Arm V8 Server:Raspberry Board:Two ModelsThree Platforms7/17/20239RasperBerry(LLaMA-7B,4GB memory)./main-m./models/7B/ggml-model-q4_0.bin-p Building a website can be done in 10 simple steps:-t 2-n 512 RasperBerry 2 Cores can get latency:1955msPractice and acce
19、lerating(LLaMA,2/4)A72&ArmV8 Server(LLaMA-7B/13B)./main-m./models/*B/ggml-model-q4_0.bin-t 64-n 128 LLaMA-13B(8GB memory):A72 64 Cores get latency 2281.98ms,ArmV8 64 Cores get latency 1538.4msLLaMA-7B(4GB memory):A72 64 Cores get latency 397.8msArmV8 64 Cores get latency 262ms1955.36007B(
20、ms)n_threads=2/4|AVX=0|AVX2=0|AVX512=0|FMA=0|NEON=1|ARM_FMA=1|F16C=0|FP16_VA=0|WASM_SIMD=0|BLAS=0|SSE3=0|VSX=0|7/17/202310Profiling./main-m./models/13B/ggml-model-q4_0.bin-p Building a website can be done in 10 simple steps:-t 64 -n 512perf record-F 1000-g-p 1517241-sleep 10perf script|./FlameGraph/
21、stackcollapse-perf.pl|./FlameGraph/flamegraph.pl perf.svgAI inference Model OptimizationAcceleratingUse memory to get efficiency Lower down accuracyIncreasing efficiency of matrix computing Merge parameters to save stepsCut off some branchesPractice and accelerating(LLaMA,3/4)7/17/202311ChatGLM-6B i
22、s an open source dialogue robot released by the Knowledge Engineering Group(KEG)&Data Mining at Tsinghua University.According to the official introduction,this is a Chinese-English language model with a scale of 100 billion parameters,and optimized for Chinese.This open source version is a small-sca
23、le version of its 6 billion parameters,and only 6GB of memory is required for local deployment(INT4 quantization level).ConfigurationsItemsArm V8 Server;96*armv8 cores;2400 MHzCPUDDR4,16384MB*3RAMUbuntu 18.04.6 LTSOSPractice and accelerating(ChatGLM-6B,4/4)7/17/202312Creating better processor design
24、 with AI/LLM1stExample:MCU(Chip-Chat,https:/arxiv.org/abs/2305.13243)2ndExample:RISC-V CPU(https:/arxiv.org/abs/2306.12456)Methodology:Simple Circuit Result:8-bit processor co-design Result:Generating the circuit logic in the form of a Boolean function satisfying the input-output specification.Metho
25、dology:Result:1)The implemented program is executed on a Linux cluster including 68 servers,each of which is equipped with 2 Intel Xeon Gold 6230 CPUs.2)The CPU with RISC-V 32IA instruction set is generated from a relatively small set of IO examples in less than 5 hours,which can run the Linux opera
26、ting system successfully.300MHz,0.276mm2,14.46mW 65nm.Design space:10?7/17/202313Manageable engineering in EdgePrivate Data&AccuracyHigh Performance&efficient costLLMEdge Inferencecould meanHigh Perf.CPU+efficient AccelerationLLMInferenceEdge datacenter is one of the core usageSummaryEnabling AI2.0 for Edge DatacenterWith YSEMI Computing7/17/202314遇 贤 微 电 子