《HC2022.NTT.kenji_tanaka.v1.pdf》由会员分享,可在线阅读,更多相关《HC2022.NTT.kenji_tanaka.v1.pdf(16页珍藏版)》请在三个皮匠报告上搜索。
1、VTA-NIC:Deep Learning Inference Serving in Network Interface CardsKenji Tanaka1,Yuki Arikawa1,Kazutaka Morita2,Tsuyoshi Ito1,Takashi Uchida3,Natsuko Saito3,Shinya Kaji3,Takeshi Sakamoto11NTT Device Technology Labs,2NTT Software Innovation Center,3Fixstars CorporationHotchips 34 Posters,August 21-23,
2、2022,Virtual Conference NTT Ltd.All Rights Reserved.1Abstract:VTA-NIC Chip ArchitectureWe aim to achieve DL inference serving(DLIS)without CPU interference.We integrate hardware data paths as a NIC(Network Interface Card),a REST API parser/deparser and multiple VTAs(Versatile Tensor Accelerators).Ho
3、tchips 2022 PostersVTA-NIC:Deep Learning Inference Serving in Network Interface Cards NTT Ltd.All Rights Reserved.2ConfigurationVTA-NICProcess node16 nm FinFETXilinx FPGANumber of Cores8 VTA CoresCore Frequency213 MHzMACs per core169Memory Throughput19.2GB/s(DDR4-2400)Number precision INT8Abstract:P
4、erformancePower EfficiencyThe DLIS power efficiency of VTA-NIC is 6.1x better than that of GPU(Nvidia V100).Tail LatencyAt high load,the tail latency of heterogeneous systems unexpectedly increases.With our chip,the tail latency is predictable since it is proportional to the load.3Hotchips 2022 Post
5、ersVTA-NIC:Deep Learning Inference Serving in Network Interface Cards NTT Ltd.All Rights Reserved.Recently,web applications are often built on microservices.DL Inference Serving(DLIS)is one of those microservices1.DLIS is provisioned with a special accelerator instance2.The microservices/instances a
6、re loosely coupled via APIs.Background4CPUGPUHotchips 2022 PostersVTA-NIC:Deep Learning Inference Serving in Network Interface Cards NTT Ltd.All Rights Reserved.Accelerator instances risk inefficient data movement.1.Moving data via host processors decreases the accelerators utilization.3a.In our pre
7、liminary experiments,half of the DLIS latency was caused by moving data.2.Under high-load conditions,the interference of host processors degrades DLIS tail latency by up to 100 times4.a.In the real cloud,9%of light DLIS tasks suffer server tail latency,and half of the serving time is waiting time.5B
8、ackground5 queue output input computeModel:ResNet-18TensorRTPrecision:INT8System:Triton Inference ServerAccelerator:Nvidia V100 Hotchips 2022 PostersVTA-NIC:Deep Learning Inference Serving in Network Interface Cards NTT Ltd.All Rights Reserved.IntroductionObjectiveBy integrating NN processing cores
9、in the NIC,we eliminate redundant data movement in DLIS.ChallengesAn architecture to serve inference requests to NN processing engines.An architecture to bridge the inter-service communication protocol and the instructions of NN processing engines.SolutionA VTA-NIC architecture that integrates an op
10、en DL inference engine,a VTA(Versatile Tensor Accelerator),into the NIC.Offload host processing to the NIC to achieve DLIS without CPU processing.Integrate a circuit that bridges the VTAs instructions and the web API.6Hotchips 2022 PostersVTA-NIC:Deep Learning Inference Serving in Network Interface
11、Cards NTT Ltd.All Rights Reserved.SolutionA VTA-NIC architecture that integrates VTAs into the NIC.MotivationMinimize data movement for inference requests.Key PointsWe choose TVM-VTA6 for its open ecosystem.The VTA-NIC maximizes throughput by integrating multiple VTAs that operate asynchronously.7Ho
12、tchips 2022 PostersVTA-NIC:Deep Learning Inference Serving in Network Interface Cards NTT Ltd.All Rights Reserved.VTA:Versatile Tensor AcceleratorSolutionOffload host processing to the NIC to achieve DLIS without CPU processing.MotivationExecute DLIS without CPU processing.Key PointsPacket processin
13、g(TCP/IP)is offloaded to the VTA-NIC.Requests can also be sent from the DMA Engine.8Hotchips 2022 PostersVTA-NIC:Deep Learning Inference Serving in Network Interface Cards NTT Ltd.All Rights Reserved.SolutionIntegrate a circuit that bridges the VTAs instruction and the web API.MotivationInclude prot
14、ocols for controlling VTAs in the inter-service communication.Key PointsDevelop a circuit to convert REST API to VTA instructions.Send HTTP requests to FPGA IP Address/Port with the URN(Uniform Resource Name).9Hotchips 2022 PostersVTA-NIC:Deep Learning Inference Serving in Network Interface Cards NT
15、T Ltd.All Rights Reserved.APIURNVTA-NIC BehaviorGET/memory/memory_address?size=DRAM Data Download PUT/memory/memory_address DRAM Data UploadPOST/device?insn_addr=memory_address&insn_count=offsetVTA LouchingPOST/processing_logicProcessing Logic LouchingImplementationNoC and DDR I/F that connects betw
16、een FPGA dies.Uses approximately 42%of Alveo U250s resources.In TOE-HTTP Parser-VTA core,pass-through latency is 468 nsec(sim.).10Hotchips 2022 PostersVTA-NIC:Deep Learning Inference Serving in Network Interface Cards NTT Ltd.All Rights Reserved.ConfigurationVTA-NICProcess node16 nm FinFETXilinx FPG
17、ACore Frequency213 MHzSRAM40 MBLUT0.5 MDSP3.3 K sliceEvaluation 1MotivationEvaluate the power efficiency of the VTA-NIC.Key PointsDLIS for ResNet-18INT8.Reference system was served with Triton Inference Server7 with an NvidiaV100 GPU.The process node of the Nvidia V100 GPU is more advanced than that
18、 of our chip.Our chip achieved about 6.1x better power efficiency than the GPU.11Hotchips 2022 PostersVTA-NIC:Deep Learning Inference Serving in Network Interface Cards NTT Ltd.All Rights Reserved.6.1xEvaluation 2MotivationEvaluate the unpredictable tail latency under heavy load.Key PointsGenerally,
19、P50 latency increases in proportion to the load.We observe that there is no unpredictable tail latency by evaluating the gap between P99 latency and P50 latency.In the Triton inference server,the unpredictable gap between P50 latency and P99 latency increases in proportion to the load.In Our System,
20、the gap between P50 latency and P99 latency is constant and unpredictable tail latency isnt observed.12Hotchips 2022 PostersVTA-NIC:Deep Learning Inference Serving in Network Interface Cards NTT Ltd.All Rights Reserved.Future WorkLow latency inferenceOur chip performs about 26x worse than the GPU sy
21、stem in terms of P50 latency.The clock speed is 7.15x slower than V100 GPUs clock speed.Memory bandwidth is 23.4x narrower than that of V100 GPU.The size of the matrix arithmetic unit is 30 x smaller than for V100 GPU.Because many of the performance limitations are due to the FPGA,future chips will
22、likely improve performance.VTA micro architectureNeed to modify VTA Micro Architecture in the future.Parallel operation of GEMM and ALUCache for data reuseCooperative inference of multiple VTA cores to handle large models.Shared memory for VTA cores is required.13Hotchips 2022 PostersVTA-NIC:Deep Le
23、arning Inference Serving in Network Interface Cards NTT Ltd.All Rights Reserved.NSDI 22“Re-architecting Traffic Analysis with Neural Network Interface Cards”8Closely related work that demonstrates the advantages of BNN inference in the NIC.Small power overhead for inference in the NICReduces tail la
24、tencyOur chip realizes these advantages,plus the following benefits.Possible to execute inference on more general models,including BNNs.Hardware-integrated serving functions and inter-service communication.No need to install special communication protocols,and easy to connect to other services.No ne
25、ed to use special programming languages or compilers,and can benefit from the OSS ecosystem.Related Work14Hotchips 2022 PostersVTA-NIC:Deep Learning Inference Serving in Network Interface Cards NTT Ltd.All Rights Reserved.DLIS has been facing a challenge caused by redundant data movement in heteroge
26、neous server architectures,which degrades the performance of the accelerator and the serving system.In this work,we proposed an architecture that enables DLIS on NICs.A multi-core DL Engine that can directly serve data from Network Clients.TVM-VTA is used as the DL Engine to exploit the large ecosys
27、tem.The circuit that bridges the TVM-VTAs instruction and the web API was integrated into a chip.The advantages of our chip compared to conventional systems are as follows.6.1x power efficientNo unpredictable tail latency under heavy loadConclusions15Hotchips 2022 PostersVTA-NIC:Deep Learning Infere
28、nce Serving in Network Interface Cards NTT Ltd.All Rights Reserved.1.Simple steps to create scalable processes to deploy ML models as microservices2.Applied Machine Learning at Facebook:A Datacenter Infrastructure Perspective3.Solros:a data-centric operating system architecture for heterogeneous com
29、puting4.Serving DNNs like Clockwork:Performance Predictability from the Bottom Up5.MLaaS in the Wild:Workload Analysis and Scheduling in Large-Scale Heterogeneous GPU Clusters6.VTA:Versatile Tensor Accelerator7.Triton Inference Server8.Re-architecting Traffic Analysis with Neural Network Interface CardsReferences16Hotchips 2022 PostersVTA-NIC:Deep Learning Inference Serving in Network Interface Cards NTT Ltd.All Rights Reserved.