您的当前位置：上海品茶 > 报告分类 > PDF报告下载

报告预览

Machine Learning Hardware_Considerations and Accelerator Approaches.pdf

编号：155013

PDF 446页 31.08MB 下载积分：VIP专享

下载报告请您先登录！

Machine Learning Hardware_Considerations and Accelerator Approaches.pdf

1、ISSCC 2024Short CourseMachine Learning Hardware:Considerations and Accelerator Approaches 2024 IEEE International Solid-State Circuits ConferenceIntroduction to Machine Learning Applications andHardware-Aware OptimizationsRangharajan VenkatesanNVIDIA CorporationFebruary 2024ISSCC 2024 Short CourseCo

2、ntact Infoemail:Machine learning hardware:considerations and accelerator approaches1 of 89 2024 IEEE International Solid-State Circuits ConferenceOutlineIntroduction to machine learning and deep neural networksTrends and challenges in hardware designApproaches to scaling single-chip performanceQuant

3、izationSparsityScaling beyond single chip with package-level integrationEfficient communication architectureExploiting parallelismISSCC 2024 Short CourseMachine learning hardware:considerations and accelerator approaches2 of 89 2024 IEEE International Solid-State Circuits ConferenceArtificial Intell

4、igence(AI)Artificial Intelligence:“The science and engineering of creating intelligent machines”-John McCarthy,1956ISSCC 2024 Short CourseArtificial IntelligenceMachine learning hardware:considerations and accelerator approaches3 of 89 2024 IEEE International Solid-State Circuits ConferenceMachine L

5、earning(ML)Machine Learning:“Field of study that gives computers the ability to learn without being explicitly programmed.”-Arthur Samuel,1959ISSCC 2024 Short CourseArtificial IntelligenceMachine LearningMachine learning hardware:considerations and accelerator approaches4 of 89 2024 IEEE Internation

6、al Solid-State Circuits ConferenceDeep learning(aka Deep neural networks)Deep Learning:“Seek to exploit the unknown structure in the input distribution in order to discover good representations,often at multiple levels.”-Yoshua Bengio,2012ISSCC 2024 Short CourseArtificial IntelligenceMachine Learnin

7、gDeep LearningMachine learning hardware:considerations and accelerator approaches5 of 89 2024 IEEE International Solid-State Circuits ConferenceMachine learning algorithms“A computer program is said to learn from experience(E)with respect to some task(T)and some performance measure(P),if its perform

8、ance on T,as measured by P,improves with experience E.”,Tom Mitchell,1998Example:spam classificationTask(T):Predict emails as spam or not spam.Experience(E):Observe users label emails as spam or not.Performance(P):#of emails that are correctly predicted.ISSCC 2024 Short CourseAcknowledgement:Prof.So

9、phia Shao,UC BerkeleyMachine learning hardware:considerations and accelerator approaches6 of 89 2024 IEEE International Solid-State Circuits ConferenceMachine learning algorithmsMost ML algorithms can be described with a simple recipe:A dataset Experience(E)A cost(loss)function Performance Measure(P

10、)A model+An optimization method Task(T)Types of ML algorithmsSupervised:Training dataset with a label or target and evaluated on Test datasetExample:classification,regressionUnsupervised:Experience dataset without labelsExample:clusteringReinforcement:Not a fixed dataset,interact with an environment

11、ISSCC 2024 Short CourseMachine learning hardware:considerations and accelerator approachesAcknowledgement:Prof.Sophia Shao,UC Berkeley7 of 89 2024 IEEE International Solid-State Circuits ConferenceDeep neural networksISSCC 2024 Short CourseLayer 1Layer 2Layer NInputOutput“Deep”Processing involves 10

12、s to 100s of layersYj=activationWijXii=13Input LayerOutput LayerHidden LayerX1X2X3Y1Y2Y3Y4W11W34An example of a simple layerV.Sze et al.,Synthesis Lectures on Computer Architecture,2020Machine learning hardware:considerations and accelerator approaches8 of 89 2024 IEEE International Solid-State Circ

13、uits ConferenceExamples of DNN applicationsISSCC 2024 Short CourseImage classificationSheepDogCatObject detectionRecommendation systemsText summarizationAutomatic speech recognitionImage captioningMachine learning hardware:considerations and accelerator approaches9 of 89 2024 IEEE International Soli

14、d-State Circuits ConferenceTypes of DNNsConvolutional Neural Network(CNN)Transformers/Large Language Model(LLM)Graph Neural Network(GNN)Recurrent Neural Network(RNN)Long Short-Term Memory(LSTM)NetworkDeep Belief Network(DBN)Generative Adversarial Network(GAN)Stable Diffusion(and more)ISSCC 2024 Shor

15、t CourseMachine learning hardware:considerations and accelerator approaches10 of 89 2024 IEEE International Solid-State Circuits ConferenceBasic structure of a CNN ISSCC 2024 Short CourseMachine learning hardware:considerations and accelerator approachesV.Sze et al.,Synthesis Lectures on Computer Ar

16、chitecture,202011 of 89 2024 IEEE International Solid-State Circuits ConferenceConvolutionISSCC 2024 Short Coursefor n=0:N)for m=0:M)for e=0:E)for f=0:F)for c=0:C)for r=0:R)for s=0:S)Outefmn+=Weightrscm*Inpute+rf+scn Machine learning hardware:considerations and accelerator approachesV.Sze et al.,Syn

17、thesis Lectures on Computer Architecture,202012 of 89 2024 IEEE International Solid-State Circuits ConferenceActivation functionsIntroduce non-linearity in the networkISSCC 2024 Short CourseCommon Activation Layers in DNNsMachine learning hardware:considerations and accelerator approaches13 of 89 20

18、24 IEEE International Solid-State Circuits ConferencePoolingDown-sampling operation to provide invariance against minor changes in the image such as shifting,rotation,etc.Can be different typesMax.pooling,Average poolingISSCC 2024 Short Course421336834Max(1,2,4,6)2x2 max pooling with stri

19、de=2Max.Pooling ExampleMachine learning hardware:considerations and accelerator approaches14 of 89 2024 IEEE International Solid-State Circuits ConferenceFully-connected layerISSCC 2024 Short Coursefor m=0:M)for c=0:C)Outm+=Weightmc*Inputc CMMC11Matrix-Vector MultiplicationMachine learning hardware:

20、considerations and accelerator approaches15 of 89 2024 IEEE International Solid-State Circuits ConferenceExample of a CNN:LeNetCNN for digit recognition 3 convolutional layers,2 subsampling/pooling layers,and 2 fully-connected layersClassifies an input image to one of the 10 digitsISSCC 2024 Short C

21、ourseY.Lecun,et al.,Gradient-based learning applied to document recognition,in Proceedings of the IEEE,1998Machine learning hardware:considerations and accelerator approaches16 of 89 2024 IEEE International Solid-State Circuits ConferenceExample of a CNN:AlexNetCNN consisting of 5 convolutional laye

22、rs,3 pooling layers,and 3 fully-connected layersWon the ImageNet Large Scale Visual Recognition Challenge on September 30,2012Showed that depth of the model was essential for its high accuracyHigh computational expense was made feasible using GPUs during trainingA major landmark that spurred develop

23、ment of many DNNsISSCC 2024 Short CourseA.Krizhevsky,et al.,ImageNet Classification with Deep Convolutional Neural Networks,in NeurIPS,2012Machine learning hardware:considerations and accelerator approaches17 of 89 2024 IEEE International Solid-State Circuits ConferenceLarge Language Models(LLMs)Dee

24、p learning models trained on vast amounts of data targeting natural language processing(NLP)tasksKey ComponentsFoundation modelLarge model that serves as the foundation for specific use casesAlignmentFine-tuning and adapting the base model to perform different tasksLLMs use Transformer-based neural

25、network architecture with a very large number of parametersISSCC 2024 Short CourseMachine learning hardware:considerations and accelerator approaches18 of 89 2024 IEEE International Solid-State Circuits ConferenceWhat is Attention?To understand the meaning of one token,you often need the context of

26、other tokensAttention matrix specifies how much each token is relevant to each other tokenIf you have N tokens,then your attention matrix is NxNISSCC 2024 Short CourseMachine learning hardware:considerations and accelerator approachesJ.Alammar,2018.The Illustrated TransformerQKVBMM1WKKWVXQBMM2Softma

27、xWQVAttn(X)A.Vaswani et al.,“Attention is all you need,”NeurIPS 201719 of 89 2024 IEEE International Solid-State Circuits ConferenceFoundation modelTerm coined by the Stanford Institute for Human-Centered Artificial Intelligence in 2021Training objective:Predict next wordISSCC 2024 Short CourseMachi

28、ne learning hardware:considerations and accelerator approachesdeep learning models execute efficiently on GPU (1)deep learning models execute efficiently on GPU 2 1)deep learning models execute efficiently on GPU 3 2,1)deep learning models execute efficiently on GPU 4 3,2,1)deep learning models exec

29、ute efficiently on GPU 5 4,3,2,1)deep learning models execute efficiently on GPU 6 5,4,3,2,1)deep learning models execute efficiently on GPU 7 6,5,4,3,2,1)20 of 89 2024 IEEE International Solid-State Circuits ConferenceAlignmentFew shot learningUsing a few relevant training examples,base model impro

30、ves in that specific area.English:I live in California.Spanish:Yo vivo en California.English:I work at NVIDIA.Spanish:Yo trabajo en NVIDIA.English:I believe in science.Spanish:Yo creo en la ciencia.Few-shot needs a few examplesZero shot learningBase LLM responds to a broad range of requests,typicall

31、y promptsTranslate“I believe in science”from English to Spanish:Yo creo en la ciencia.Zero Shot needs no examples!ISSCC 2024 Short CourseMachine learning hardware:considerations and accelerator approaches21 of 89 2024 IEEE International Solid-State Circuits ConferenceAlignment stepsISSCC 2024 Short

32、CourseSource:https:/ learning hardware:considerations and accelerator approaches22 of 89 2024 IEEE International Solid-State Circuits ConferenceStructure of transformersSequence of transformer layersEach transformer layer consists of multi-head attention and feed-forward blocksDifferent types of com

33、putationsMatMul with pre-trained weightsMatMul without weightsSoftmax,ReLUISSCC 2024 Short Course4ms4msmsmsFeed Forward FC1ReLUFC2Add&NormMulti-Head Attentionhssh ssBMM1ScaleSoftmaxsms QueryKeyValuemsmsmSplitSplitSplithsqhsqhsqAdd&NormmshsqmsmsBMM2ConcatProj.MatMul without weightsMatMul with weights

34、Post-processingDatapath OpsPPU OpsInputsTransformer LayerOutputnFeed ForwardAdd&NormMulti-Head AttentionAdd&NormMachine learning hardware:considerations and accelerator approaches23 of 89 2024 IEEE International Solid-State Circuits ConferenceSoftmax computationKey non-linear computation found in at

35、tention layersSafe softmaxPrevents overflow in exponent computationExpensive in hardwareRepetitive memory accessesComplex compute using MUFUISSCC 2024 Short Course =()()=One pass through vector to calculate maxOne pass through vector to calculate expOne pass through vector to normalizeMachine learni

36、ng hardware:considerations and accelerator approachesM.Milakov et al.“Online normalizer calculation for softmax,”arxiv 2018 24 of 89 2024 IEEE International Solid-State Circuits ConferenceOutlineIntroduction to machine learning and deep neural networksTrends and challenges in hardware designApproach

37、es to scaling single-chip performanceQuantizationSparsityScaling beyond single chip with package-level integrationEfficient communication architectureExploiting parallelismISSCC 2024 Short CourseMachine learning hardware:considerations and accelerator approaches25 of 89 2024 IEEE International Solid

38、-State Circuits ConferenceGrowth in application complexityISSCC 2024 Short CourseS.Bianco et al.,“Benchmark Analysis of Representative Deep Neural Network Architectures,”IEEE Access,2018Machine learning hardware:considerations and accelerator approaches26 of 89 2024 IEEE International Solid-State Ci

39、rcuits ConferenceNeural network scalabilityFor a given compute budget,what is the size of the model to be trained to achieve best accuracy?Tradeoff between model and dataset sizeIn 2020,OpenAI showed that optimal model size grows quicklyISSCC 2024 Short CourseJ.Kaplan et al.,“Scaling Laws for Neural

40、 Language Models,”arxiv 2020Machine learning hardware:considerations and accelerator approaches27 of 89 2024 IEEE International Solid-State Circuits ConferenceNeural network scalabilityFor a given compute budget,what is the size of the model to be trained to achieve best accuracy?In 2022,DeepMind re

41、visited this problem and showed that dataset size should scale with model Compute scales quadratically with model sizeISSCC 2024 Short CourseJ.Hoffmann et al.,“Training Compute-Optimal Large Language Models,”arxiv 2022IsoFlop curvesMachine learning hardware:considerations and accelerator approaches2

42、8 of 89 2024 IEEE International Solid-State Circuits ConferenceHardware performance challengesISSCC 2024 Short CourseJ.Hennessy and D.Patterson,“Computer Architecture:A Quantitative Approach”,6thedition,2018S.Naffziger et al.,ISSCC 2020Increasing costEnd of Moores LawMachine learning hardware:consid

43、erations and accelerator approaches29 of 89 2024 IEEE International Solid-State Circuits ConferenceEnergy efficiency challengeISSCC 2024 Short Coursehttps:/www.top500.org/lists/green500/NVIDIA H100 65.40 GFLOPS/W00002002520302035GFLOPS/WYearEnergy Efficiency of Supercomputers26

44、00 GFlops/W1 ZettaFLOPS 400 MW Machine learning hardware:considerations and accelerator approaches30 of 89 2024 IEEE International Solid-State Circuits ConferenceDifferent evaluation metricsAccuracy%predicted correctly,Top-1/Top-5 error,perplexityPerformance(Throughput,Latency)Inferences/sec,FLOPS,T

45、OPS,delayEnergy efficiencyEnergy/inference,FLOPS/W,TOPS/WArea efficiencyInference/sec/mm2,FLOPS/mm2,TOPS/mm2FlexibilitySupport different types of neural networks and layersISSCC 2024 Short CourseMachine learning hardware:considerations and accelerator approaches31 of 89 2024 IEEE International Solid

46、-State Circuits ConferenceOutlineIntroduction to machine learning and deep neural networksTrends and challenges in hardware designApproaches to scaling single-chip performanceQuantizationSparsityScaling beyond single chip with package-level integrationEfficient communication architectureExploiting p

47、arallelismISSCC 2024 Short CourseMachine learning hardware:considerations and accelerator approaches32 of 89 2024 IEEE International Solid-State Circuits ConferenceHardware specializationISSCC 2024 Short CourseAcceleratorsCustomized hardware accelerators to support a class of neural networksFixed Fu

48、nction AcceleratorsProgrammable hardware with support for scalar and vector math functionsProgrammable ProcessorsMachine learning hardware:considerations and accelerator approachesLeverage reconfigurability of FPGA to accelerate a specific neural networkReconfigurable FPGAs33 of 89 2024 IEEE Interna

49、tional Solid-State Circuits ConferenceMany DNN accelerators existISSCC 2024 Short CourseSource:https:/ platformsDifferent power targetsWide range of performanceMachine learning hardware:considerations and accelerator approaches34 of 89 2024 IEEE International Solid-State Circuits ConferenceOutlineIn

50、troduction to machine learning and deep neural networksTrends and challenges in hardware designApproaches to scaling single-chip performanceQuantizationSparsityScaling beyond single chip with package-level integrationEfficient communication architectureExploiting parallelismISSCC 2024 Short CourseMa

51、chine learning hardware:considerations and accelerator approaches35 of 89 2024 IEEE International Solid-State Circuits ConferenceQuantizationDeep learning is inherently probabilistic and often over-parameterizedGood use case for approximation using precision scaling with lower-precision data formats

52、(e.g.,FP16,FP8,INT8)Benefits of lower-precision data formats Accelerate math using higher-throughput math pipelineReduce memory traffic with fewer bits per valueLess on-chip storage is neededSave energy from data movementEfficiency vs.accuracy tradeoffEfficiency comes from approximationBut approxima

53、tion generally hurts model accuracyISSCC 2024 Short CourseMachine learning hardware:considerations and accelerator approaches36 of 89 2024 IEEE International Solid-State Circuits ConferenceDNN quantizationConvert a high-precision floating-point model to low-precision data formatAccelerate math using

54、 higher-throughput math pipelineReduce memory traffic and on-chip storageAccelerate math:convert convolutions and matrix multiplications to integer to use tensor coresAdd Quantize(Q)and Dequantize(DQ)operations to the neural network graphISSCC 2024 Short CourseP.Judd,“Integer Quantization for DNN In

55、ference Acceleration,”GTC,2020Machine learning hardware:considerations and accelerator approaches37 of 89 2024 IEEE International Solid-State Circuits ConferenceDNN quantizationConvert a high-precision floating-point model to low-precision data formatAccelerate math using higher-throughput math pipe

56、lineReduce memory traffic and on-chip storageReduce memory requirement:read and write integer tensors Fuse operations to have integer input and outputISSCC 2024 Short CourseMachine learning hardware:considerations and accelerator approachesP.Judd,“Integer Quantization for DNN Inference Acceleration,

57、”GTC,202038 of 89 2024 IEEE International Solid-State Circuits ConferenceQuantization design choicesRange of floating-point values to be representedRange of integer values to map toType of mapping(scale only vs.affine)Scaling granularity(per-tensor,per-channel,etc.)ISSCC 2024 Short CoursePost-traini

58、ng Quantization(PTQ)Directly quantize a pre-trained full-precision model for inferenceProsDoes not require complete set of training dataTurn-key approach:No finetuning,little hyperparameter tuningConsAccuracy loss is more significantQuantization-Aware TrainingFinetune a pre-trained full-precision mo

59、del with quantization in-the-loop using trainingProsSignificantly reduce accuracy loss from quantizationEnable more aggressive reduction of precisionConsRequire complete set of training dataSignificant effort in retraining and hyperparameter tuningMachine learning hardware:considerations and acceler

60、ator approachesApproaches39 of 89 2024 IEEE International Solid-State Circuits ConferenceUniform symmetric quantizationChoose a symmetric range of real values,+calibrated based on tensor statisticsMaximum or clipped rangeIdentify a symmetric range of integer values to map toFor B-bit,21 1,21 1Determ

61、ine the scale factor for the real-to-integer mapping=211Map real value to integer uniformly by clipping,scaling,and rounding=,Questions:Representable range:How do we calibrate the representable range?Scaling granularity:At what granularity do we apply different scale factors?ISSCC 2024 Short CourseM

62、achine learning hardware:considerations and accelerator approaches40 of 89 2024 IEEE International Solid-State Circuits ConferenceRepresentable range:CalibrationCollecting statistics:Observe data distributions of tensors being quantizedPre-trained weights:analyze histogram directlyActivations:collec

63、t histograms by feeding examples to produce activations100-1000 examples are often sufficientUse the training set,dont overfit ranges on the validation dataChoose the range:Trade off range and precisionMax range:Quantize the full range of observed valuesRepresents outliers,low resolution for inliers

64、Clipped rangeClips outliers,increases resolution of inliersMultiple ways to perform the clippingPercentile calibration:explicitly control how many values are clippede.g.99th percentile clip 1%of the valuesISSCC 2024 Short Course(21 1)+(21 1)integer+Real valueMachine learning hardware:considerations

65、and accelerator approaches41 of 89 2024 IEEE International Solid-State Circuits ConferenceRepresentable range:CalibrationCollecting statistics:Observe data distributions of tensors being quantizedPre-trained weights:analyze histogram directlyActivations:collect histograms by feeding examples to prod

66、uce activations100-1000 examples are often sufficientUse the training set,dont overfit ranges on the validation dataChoose the range:Trade off range and precisionMax range:Quantize the full range of observed valuesRepresents outliers,low resolution for inliersClipped rangeClips outliers,increases re

67、solution of inliersMultiple ways to perform the clippingPercentile calibration:explicitly control how many values are clippede.g.99th percentile clip 1%of the valuesISSCC 2024 Short Course(21 1)+(21 1)integer+Real valueMachine learning hardware:considerations and accelerator approaches42 of 89 2024

68、IEEE International Solid-State Circuits ConferenceScaling granularity:Per-tensorISSCC 2024 Short Course ,=21 1x=CWHCRSK1 floating-point scale factor for input activation tensor1 floating-point scale factor for weight tensorInput activationWeightOutput activationPQKCalibrated based on statistics of e

69、ntire tensorReal(FP32)data distributionPer-tensor Sizable quantization noise+Large if computed for the entire tensorMachine learning hardware:considerations and accelerator approaches43 of 89 2024 IEEE International Solid-State Circuits ConferenceScaling granularity:Per-channelISSCC 2024 Short Cours

70、e ,=21 1x=CWHCRSK011 floating-point scale factor for input activation tensorK floating-point scale factors for weight tensorInput activationWeightOutput activationPQKFP32INT8 Post-training QuantizationMobileNet v271.971.1(-0.8)ResNet50 v1.576.276.0(-0.2)Inception v479.779.6(-0.1)ResNeXt10179.379.2(-

71、0.1)EfficientNet b3 81.680.3(-1.3)Mask R-CNN37.937.7(-0.2)DeepLabV367.467.5(+0.1)GNMT 24.324.4(+0.1)Transformer28.327.7(-0.6)Jasper3.93.9(-0.0)BERT Large91.090.2(-0.8)H.Wu et al.,“Integer Quantization for Deep Learning Inference:Principles and Empirical Evaluation,”arXiv,2020Calibrated based on stat

72、istics of sub-tensorMachine learning hardware:considerations and accelerator approaches44 of 89 2024 IEEE International Solid-State Circuits ConferenceVS-Quant:Per-vector scalingISSCC 2024 Short Course ,=21 1x=CWHCRSK11V0,01,0Input activationWeightOutput activationPQKPer-channel Sizable quantization

73、 noise+Per-vector Small quantization noise+Local statisticsK x R x S x ceil(C/V)floating-point scale factors for weights Excessive compute and memory overheadMachine learning hardware:considerations and accelerator approachesS.Dai et al.,“VS-QUANT:Per-Vector Scaled Quantization for Accurate Low-Prec

74、ision Neural Network Inference”,MLSys 202145 of 89 2024 IEEE International Solid-State Circuits ConferenceTwo-level scalingStart with one-level per-vector quantization=Quantize the per-vector scale factor with per-channel quantization=We have quantization with two levels of scale factors=ISSCC 2024

75、Short CourseAssume N=M=4 and V=161/16=6.25%storage overhead12-bit x 8-bit multiplication per-vectorMachine learning hardware:considerations and accelerator approachesS.Dai et al.,“VS-QUANT:Per-Vector Scaled Quantization for Accurate Low-Precision Neural Network Inference”,MLSys 202146 of 89 2024 IEE

76、E International Solid-State Circuits ConferenceVS-Quant:simulated resultsEach point is a synthesized hardware instanceVarious weight/activation precisionsDifferent weight/activation scale precisionsDifferent scaling granularityArea efficiency vs.energy efficiency for each pointColors/shapes represen

77、t accuracy rangesOnly plot points with acceptable accuracy(74.0%)Top accuracy represents close to floating-point accuracyISSCC 2024 Short Course*Amount of scale rounding varies among design pointsWeight Width/Activation Width/Weight Scale Width/Activation Scale Width“-”indicates per-channel scalingN

78、ormalized to the 8-bit baselineMachine learning hardware:considerations and accelerator approachesS.Dai et al.,“VS-QUANT:Per-Vector Scaled Quantization for Accurate Low-Precision Neural Network Inference”,MLSys 202147 of 89 2024 IEEE International Solid-State Circuits ConferenceEffect of clipping on

79、 quantizationMean squared error(MSE)=2as a metric for quantization errorCorrelated to task accuracy Sakr et al.,ICML 2017Analytical:=4320|+2|Empirical:Averaged over tensor entries after clipping and quantizationThere exists an optimal choice of clipping scalar 0as a function of bitwidth Confirmed by

80、 analytical and empirical resultsAccurate inference/training requires optimal clipping scalarISSCC 2024 Short CourseMachine learning hardware:considerations and accelerator approachesC.Sakr et al.,“Optimal Clipping and Magnitude-aware Differentiation for Improved Quantization-aware Training”,ICML 20

81、2248 of 89 2024 IEEE International Solid-State Circuits ConferenceOCTAV:Optimally Clipped Tensors&VectorsFast minimization of clipped quantization MSEA fast recursive algorithm based on the Newton-Raphson method to determine MSE-minimizing clipping scalarsCan quickly compute optimal quantization sca

82、lar for every tensor at every iteration of inference or trainingEach iteration is formulated with minimum quantization noise to improve accuracyWhy is OCTAV fast?Analytical,no histogram or brute-force:Only requires where,abs,and meanNewton-Raphson usually converges very fast(in fewer than 10 iterati

83、ons)ISSCC 2024 Short CourseMachine learning hardware:considerations and accelerator approachesC.Sakr et al.,“Optimal Clipping and Magnitude-aware Differentiation for Improved Quantization-aware Training”,ICML 202249 of 89 2024 IEEE International Solid-State Circuits ConferenceOCTAV finetuning result

84、sOCTAV can be applied to both dynamic and static scalingDynamic scaling:Compute scale factors dynamically during inference/trainingStatic scaling:Compute scale factors statically prior to inference/training and fix them throughout the processISSCC 2024 Short CourseNetworkBERT-large on SQuADBERT-base

85、 on SQuADBaseline FP accuracy91.0088.24Bitwidth 4567845678Dynamic quantizationOCTAV87.0989.7790.5190.8190.7884.5186.3087.4388.2888.34Max-scaled6.9280.0687.7190.0490.4811.5178.9785.1787.4688.01Static quantizationOCTAV87.0889.5490.6090.7990.6183.6085.8287.1487.6788.02MSE sweep85.5489.7790.3990.8090.55

86、81.8284.1687.1487.6887.9799.9thperc.86.9889.7989.9990.0790.1181.0685.7886.7386.8487.3499.99thperc.6.9087.6390.3890.7990.3367.9083.2086.7887.6087.9499.999thperc.4.565.6689.7690.4490.8326.8582.1586.2787.5188.08Machine learning hardware:considerations and accelerator approachesC.Sakr et al.,“Optimal Cl

87、ipping and Magnitude-aware Differentiation for Improved Quantization-aware Training”,ICML 202250 of 89 2024 IEEE International Solid-State Circuits ConferenceQuantization-aware trainingTrain with quantization in-the-loop to recover accuracy loss from low-precision quantizationThere are a few problem

88、s to considerStochastic gradient descent(SGD)prefers floating-point weights and gradients To allow weights to accumulate small updates from many small gradientThe quantization function is non-differentiableISSCC 2024 Short CourseINT8MatMulQXQWWINT4INT4FLOATINT32FLOATDQXWFLOATMachine learning hardwar

89、e:considerations and accelerator approachesP.Judd,“Integer Quantization for DNN Inference Acceleration,”GTC,202051 of 89 2024 IEEE International Solid-State Circuits ConferenceQuantization-aware trainingProblem:Stochastic gradient descent(SGD)prefers floating-point weights and gradients Solution:Fak

90、e quantizationCombine quantize and de-quantize Takes floating-point inputs,generates discrete floating-point outputsMath remains equivalent because Conv/MatMul are linear operationsISSCC 2024 Short CourseFakeQuantFakeQuantINT8MatMulQXQWWINT4INT4FLOATINT32FLOATDQXWFLOATFP32MatMulQXQWWFLOATFLOATFLOATD

91、QXDQWFLOATFLOAT=Machine learning hardware:considerations and accelerator approachesP.Judd,“Integer Quantization for DNN Inference Acceleration,”GTC,202052 of 89 2024 IEEE International Solid-State Circuits ConferenceQuantization-aware trainingProblem:The quantization function is non-differentiableSo

92、lution:Straight-through estimation(STE)Bengio et al.,arXiv13Approximates quantization function as an identityDefines the derivative of the quantization function to be 1Lets the gradient propagate straight through the quantizerISSCC 2024 Short CourseY.Bengio et al.,“Estimating or Propagating Gradient

93、s Through Stochastic Neurons for Conditional Computation”,arXiv,2013 =1Machine learning hardware:considerations and accelerator approaches53 of 89 2024 IEEE International Solid-State Circuits ConferenceQuantizing softmaxReplace FP32 exp computation with fixed-point power of 2Online normalizer elimin

94、ates need for explicit passUse IntMax along with power of 2 to convert renormalization to shiftsISSCC 2024 Short Course1.0 2.for 1,do3.max 1,4.end for5.0 06.for j 1,do7.1+8.end for9.for i 1,do10.11.end for1.0 2.0 03.for j 1,do4.max 1,5.1 21+26.28.end for9.for i 1,do10.211.end for1.0 2.0 03.for j 1,d

95、o4.,5.1(1)+26.28.end for9.for i 1,do10.()11.end forJ.Stevens et al.,“Softermax:Hardware/Software Co-Design of an Efficient Softmax for Transformers”,DAC 2021Machine learning hardware:considerations and accelerator approaches54 of 89 2024 IEEE International Solid-State Circuits ConferenceHardware-awa

96、re training workflowHardware-aware re-training to recover accuracy lossModel hardware optimizations in the forward passTraining optimizes the loss considering hardware optimizationsISSCC 2024 Short CourseMachine learning hardware:considerations and accelerator approaches55 of 89 2024 IEEE Internatio

97、nal Solid-State Circuits ConferenceFabricated inference accelerator testchipPer-vector scaled quantization(VSQ)4-bit precision with 8-bit per-vector scale factorsHardware-friendly softmaxReduces data movement and hardware costResultsEnergy efficiency(0,46V):95.6 TOPS/W(INT4 VSQ),39.1 TOPS/W(INT8)Are

98、a efficiency(1.05V):23.3 TOPS/mm2(INT4 VSQ),11.7 TOPS/mm2(INT8)ISSCC 2024 Short CourseMachine learning hardware:considerations and accelerator approachesB.Keller,et al.,“A 1795.6 TOPS/W Deep Learning Inference Accelerator with Per-Vector Scaled 4-bit Quantization for Transformers in 5nm”,VLSI 202256

99、 of 89 2024 IEEE International Solid-State Circuits ConferenceSilicon measurement resultsISSCC 2024 Short Course4-bit VSQ achieves similar accuracy to 8-bit4-bit VSQ achieves 2.3X energy efficiency gainWithout VSQ,4-bit results in unacceptable loss in accuracyMachine learning hardware:considerations

100、 and accelerator approachesB.Keller,et al.,“A 1795.6 TOPS/W Deep Learning Inference Accelerator with Per-Vector Scaled 4-bit Quantization for Transformers in 5nm”,VLSI 202257 of 89 2024 IEEE International Solid-State Circuits ConferenceMultiple precision support in GPUsAmpere A100 and Hopper H100 de

101、nse performance results2X higher performance with structured sparsityISSCC 2024 Short CourseRef:NVIDIA H100 Tensor core GPU Architecture OverviewMachine learning hardware:considerations and accelerator approachesA100H100FP8 Tensor coreNA1978.9FP1678133.8FP16 Tensor core312989.4BF16 Tensor core312989

102、.4FP3219.566.9TF32 Tensor core 156494.7FP649.733.5FP64 Tensor core19.566.9INT8 Tensor core6241978.958 of 89 2024 IEEE International Solid-State Circuits ConferenceBenefits of quantizationISSCC 2024 Short CourseSource:https:/ learning hardware:considerations and accelerator approachesPeak efficiency

103、results of different accelerator designs59 of 89 2024 IEEE International Solid-State Circuits ConferenceOutlineIntroduction to machine learning and deep neural networksTrends and challenges in hardware designApproaches to scaling single-chip performanceQuantizationSparsityScaling beyond single chip

104、with package-level integrationEfficient communication architectureExploiting parallelismISSCC 2024 Short CourseMachine learning hardware:considerations and accelerator approaches60 of 89 2024 IEEE International Solid-State Circuits ConferenceSparsityNeural networks exhibit a high degree of sparsityW

105、eights/connections are sparseActivations at intermediate stages are sparseISSCC 2024 Short CourseS.Han et al.,“Learning both Weights and Connections for Efficient Neural Network”,NeurIPS 2015Machine learning hardware:considerations and accelerator approaches61 of 89 2024 IEEE International Solid-Sta

106、te Circuits ConferenceUnstructured sparsitySparse CNN(SCNN)Only compute partial products where both operands are non-zeroGet rid of the idea of sliding convolution:doesnt make sense when most of the operands are 0Vector ops are questionable:most elements of your vector are 0,dont know a priori which

107、 ones or how manyISSCC 2024 Short Course*=A.Parashar et al.“SCNN:An Accelerator for Compressed-sparse Convolutional Neural Networks”,ISCA 2017Machine learning hardware:considerations and accelerator approaches62 of 89 2024 IEEE International Solid-State Circuits ConferenceUnstructured sparsitySparse

108、 CNN(SCNN)ISSCC 2024 Short CourseMachine learning hardware:considerations and accelerator approachesWeights and input activations stored in sparse formatCompute output coordinates from input indicesAll-to-all multiplier arrayAccumulator banks storing partial sums in dense formatPost-process to compr

109、ess output activationsA.Parashar et al.“SCNN:An Accelerator for Compressed-sparse Convolutional Neural Networks”,ISCA 201763 of 89 2024 IEEE International Solid-State Circuits ConferenceUnstructured sparsitySimulation results of SCNN in 16nm technology nodeSCNN achieves 2.3X improvement in energy ef

110、ficiency over Dense CNN(DCNN)acceleratorISSCC 2024 Short CourseMachine learning hardware:considerations and accelerator approachesA.Parashar et al.“SCNN:An Accelerator for Compressed-sparse Convolutional Neural Networks”,ISCA 201764 of 89 2024 IEEE International Solid-State Circuits ConferenceStruct

111、ured sparsityNVIDIA Ampere A100ISSCC 2024 Short CourseRef:NVIDIA A100 Tensor Core GPU Architecture whitepaperMachine learning hardware:considerations and accelerator approaches65 of 89 2024 IEEE International Solid-State Circuits ConferenceSparsity in transformers/LLMsSelf-attention layers in transf

112、ormers exhibit a lot of sparsityNot every token needs to know the context of every other tokenSparsity patterns in different modelsISSCC 2024 Short CourseMachine learning hardware:considerations and accelerator approachesBERT-baseSeq-length:384GPT2Seq-length:1024S.Dai et al.,“Efficient Transformer I

113、nference with Statically Structured Sparse Attention,”DAC 202366 of 89 2024 IEEE International Solid-State Circuits ConferenceSparsity patterns in LLMsUse mask to exploit sparsity patternsHand-designed or trainedISSCC 2024 Short CourseMachine learning hardware:considerations and accelerator approach

114、esH.Shi et al.Sparsebert:Rethinking the importance analysis in self-attention.International Conference on Machine Learning.PMLR,2021.67 of 89 2024 IEEE International Solid-State Circuits ConferenceSparsity patterns in LLMsThree separate regions:Diagonal matrixHow much attention tokens pay to other,n

115、earby tokens carrying“local”informationLeft rectangular matrixHow much attention is paid to tokens carrying“global”informationUpper rectangular matrixHow much attention“global”tokens pay to other tokensISSCC 2024 Short CourseMachine learning hardware:considerations and accelerator approachesS.Dai et

116、 al.,“Efficient Transformer Inference with Statically Structured Sparse Attention,”DAC 202368 of 89 2024 IEEE International Solid-State Circuits ConferenceSparsity patterns in LLMsSparsity results with fine-tuningISSCC 2024 Short CourseMachine learning hardware:considerations and accelerator approac

117、hesModelTaskSeq-LenSparsityAccuracyAccuracy LossBERT-BaseSQuAD38465%87.8%0.1%BERT-BaseSQuAD38479%87.2%0.7%BERT-BaseSQuAD38485%86.4%1.5%BERT-LargeSQuAD38463%90.4%0.4%BERT-LargeSQuAD38490%88.8%2.0%GPT-2Wiki2102459%21.9(perplexity)0.2GPT-2Wiki2102465%21.8(perplexity)0.1S.Dai et al.,“Efficient Transform

118、er Inference with Statically Structured Sparse Attention,”DAC 202369 of 89 2024 IEEE International Solid-State Circuits ConferenceSparsity patterns in LLMsSparse hardware designInspired by NVDLA,but with support for our sparsity maskNew expander unitDuplicates weights as they are passed into the Vec

119、tor MAC unitReads from metadata bufferContains duplicated row idsISSCC 2024 Short CourseMachine learning hardware:considerations and accelerator approachesS.Dai et al.,“Efficient Transformer Inference with Statically Structured Sparse Attention,”DAC 202370 of 89 2024 IEEE International Solid-State C

120、ircuits ConferenceSparsity patterns in LLMsHardware simulation results(TSMC 5nm)Area overheadAdds less than 3%areaEnergy reduction:BMM1:4%-73%BMM2:16%-88%Average:57%Latency reduction:BMM1:4%-71%BMM2:17%-89%Average:59%ISSCC 2024 Short CourseMachine learning hardware:considerations and accelerator app

121、roaches0%20%40%60%80%100%011Percentage reductionLayer IDBERT-base on SQuADBMM1 energyBMM2 energyBMM1 latencyBMM2 latencyS.Dai et al.,“Efficient Transformer Inference with Statically Structured Sparse Attention,”DAC 202371 of 89 2024 IEEE International Solid-State Circuits ConferenceSingle

122、 chip inference performanceISSCC 2024 Short Course472400005001,0001,5002,0002,5003,0003,5004,0004,5002000020202120222023TOPSKeplerFP32PascalFP16TuringInt8AmpereInt4HopperSparse FP81000X in 10 yearsMachine learning hardware:considerations and accelerator ap

123、proaches72 of 89 2024 IEEE International Solid-State Circuits ConferenceOutlineIntroduction to machine learning and deep neural networksTrends and challenges in hardware designApproaches to scaling single-chip performanceQuantizationSparsityScaling beyond single chip with package-level integrationEf

124、ficient communication architectureExploiting parallelismISSCC 2024 Short CourseMachine learning hardware:considerations and accelerator approaches73 of 89 2024 IEEE International Solid-State Circuits ConferencePromise of package-level integrationISSCC 2024 Short CourseScale beyond monolithic single

125、chip performanceOvercome reticle limitsIntegrate multiple homogenous chips Heterogenous integrationSpecialized chips with complementary functionalitiesCould mix process technologies as wellTarget multiple product SKUsVary number of chips to meet different performance targetsDifferent compute capabil

126、ity and memory capacityMachine learning hardware:considerations and accelerator approaches74 of 89 2024 IEEE International Solid-State Circuits ConferenceChallengesISSCC 2024 Short CourseData MovementEfficient intra-chip and inter-chip communicationScalabilityExploit parallelism in computation to sc

127、ale across multiple chipsMachine learning hardware:considerations and accelerator approaches75 of 89 2024 IEEE International Solid-State Circuits ConferenceOutlineIntroduction to machine learning and deep neural networksTrends and challenges in hardware designApproaches to scaling single-chip perfor

128、manceQuantizationSparsityScaling beyond single chip with package-level integrationEfficient communication architectureExploiting parallelismISSCC 2024 Short CourseMachine learning hardware:considerations and accelerator approaches76 of 89 2024 IEEE International Solid-State Circuits ConferenceSystol

129、ic arrayISSCC 2024 Short CourseMachine learning hardware:considerations and accelerator approachesinput activationsfrom differentinput channels(C)psums from different output channels(M)PEV.Sze et al.,Synthesis Lectures on Computer Architecture,202077 of 89 2024 IEEE International Solid-State Circuit

130、s ConferenceMesh network on chipDifferent packet sizes for different data typesUnicast and Multicast supportFlexible routing protocolsISSCC 2024 Short CourseMachine learning hardware:considerations and accelerator approaches78 of 89 2024 IEEE International Solid-State Circuits ConferenceHierarchical

131、 networkEfficient communication for large scale designs Reduces number of hopsReduces congestionISSCC 2024 Short CourseMachine learning hardware:considerations and accelerator approachesR.Venkatesan et al.,“A 0.11 pJ/Op,0.32-128 TOPS,Scalable Multi-Chip-Module-based Deep Neural Network Accelerator D

132、esigned with a High-Productivity VLSI Methodology”,HotChips 2019Network-on-Chip(NoC)Network-on-Package(NoP)79 of 89 2024 IEEE International Solid-State Circuits ConferenceChip-to-chip linksGround-Referenced Signaling(GRS)for energy-efficient,inter-chip communicationHigh speed11-25 Gbps per pinHigh e

133、nergy efficiencyLow voltage swing(200mV)0.82-1.75 pJ/bitHigh area efficiencySingle-ended links4 data bumps+1 clock bump per GRS linkISSCC 2024 Short CourseMachine learning hardware:considerations and accelerator approachesGRS MacroKey idea behind NVLink-C2C in Grace SuperchipsJ.Poulton et al.,JSSC 2

134、019,Y.Wei et al.ISSCC 202380 of 89 2024 IEEE International Solid-State Circuits ConferenceOutlineIntroduction to machine learning and deep neural networksTrends and challenges in hardware designApproaches to scaling single-chip performanceQuantizationSparsityScaling beyond single chip with package-l

135、evel integrationEfficient communication architectureExploiting parallelismISSCC 2024 Short CourseMachine learning hardware:considerations and accelerator approaches81 of 89 2024 IEEE International Solid-State Circuits ConferenceModel parallelismExploit parallelism across weights in a layerExampleAn

136、architecture implementing weight-stationary dataflowTile weights and distribute them to different PEsCompute different output activations by streaming in the input activationsISSCC 2024 Short CourseInput ActivationsWeightsOutput ActivationsPE1PE2PE3PE0Machine learning hardware:considerations and acc

137、elerator approaches82 of 89 2024 IEEE International Solid-State Circuits ConferenceData parallelismExploiting parallelism across activations in a layerExampleAn architecture implementing input-stationary dataflowTile input activations and map to different PEsEach PE computes different output activat

138、ions by streaming in the weightsISSCC 2024 Short CoursePE0PE1PE2PE3Input ActivationsWeightsOutput ActivationsMachine learning hardware:considerations and accelerator approaches83 of 89 2024 IEEE International Solid-State Circuits ConferencePipeliningParallelism across layers of the networkExecute on

139、e or more layers across different processing elements ISSCC 2024 Short CourseLayer1InputLayer2Layer3Layer4Layer5Layer6Layer7OutputPipe-1Pipe-2Pipe-3Pipe-4PEPEPEPEPEPEPEPEPEPEPEPEPEPEPEPEPEPEPEPEPEPEPEPEPEPEPEPEPEPEPEPEMachine learning hardware:considerations and accelerator approaches84 of 89 2024 I

140、EEE International Solid-State Circuits ConferenceMCM-based accelerator testchipISSCC 2024 Short CourseMachine learning hardware:considerations and accelerator approachesWide Performance range4-128 TOPSGround Reference Signaling(GRS)as an MCM interconnect100 Gbps between chips in meshHierarchical com

141、munication architectureNetwork-on-Package(NoP)Network-on-Chip(NoC)B.Zimmer et al.,VLSI 2019,R.Venkatesan et al.HotChips 201985 of 89 2024 IEEE International Solid-State Circuits ConferencePerformance scalabilityISSCC 2024 Short Course(Voltage:1.1V)(Voltage:1.1V)Weak scaling with DriveNetStrong scali

142、ng with ResNet-50Machine learning hardware:considerations and accelerator approachesR.Venkatesan et al.,“A 0.11 pJ/Op,0.32-128 TOPS,Scalable Multi-Chip-Module-based Deep Neural Network Accelerator Designed with a High-Productivity VLSI Methodology”,HotChips 201986 of 89 2024 IEEE International Solid

143、-State Circuits ConferenceSummaryDeep learning is a key driver for future hardware designsDiverse applications:CNNs,LLMs,and many moreCompute and memory demands continue to scale at rapid paceCustom hardware accelerators enable scaling single chip performanceHardware-software co-design techniques ar

144、e highly promisingQuantization and sparsity offer interesting opportunitiesPackage-level integration enable scaling beyond single chipOvercome reticle limitsEfficient communication architecture and exploiting parallelism are key to scalability ISSCC 2024 Short CourseMachine learning hardware:conside

145、rations and accelerator approaches87 of 89 2024 IEEE International Solid-State Circuits ConferenceReferencesNVIDIA H100 Tensor core GPU Architecture OverviewNVIDIA A100 Tensor Core GPU Architecture whitepaperV.Sze et al.,Synthesis Lectures on Computer Architecture,2020B.Zimmer et al.,“A 0.11pJ/Op,0.

146、32-128 TOPS,Scalable Multi-Chip-Module-based Deep Neural Network Accelerator with Ground-Reference Signaling in 16nm,”VLSI 2019S.Dai et al.,VS-QUANT:Per-Vector Scaled Quantization for Accurate Low-Precision Neural Network Inference,MLSys 2021A.Parashar et al.,“SCNN:An accelerator for compressed-spar

147、se convolutional neural networks,”in ISCA 2017P.Judd,“Integer Quantization for DNN Inference Acceleration,”GTC,2020M.Milakov et al.“Online normalizer calculation for softmax,”arxiv 2018H.Wu et al.,”Integer Quantization for Deep Learning Inference:Principles and Empirical Evaluation,”arXiv,2020ISSCC

148、2024 Short CourseMachine learning hardware:considerations and accelerator approaches88 of 89 2024 IEEE International Solid-State Circuits ConferenceReferencesR.Venkatesan et al.,“A 0.11 pJ/Op,0.32-128 TOPS,Scalable Multi-Chip-Module-based Deep Neural Network Accelerator Designed with a High-Producti

149、vity VLSI Methodology”,HotChips 2019B.Keller,et al.,“A 1795.6 TOPS/W Deep Learning Inference Accelerator with Per-Vector Scaled 4-bit Quantization for Transformers in 5nm”,VLSI 2022J.Stevens et al.,“Softermax:Hardware/Software Co-Design of an Efficient Softmax for Transformers”,DAC 2021S.Dai et al.,

150、“Efficient Transformer Inference with Statically Structured Sparse Attention,”DAC 2023J.Kaplan et al.,“Scaling Laws for Neural Language Models,”arxiv 2020C.Sakr et al.,“Optimal Clipping and Magnitude-aware Differentiation for Improved Quantization-aware Training”,ICML 2022ISSCC 2024 Short CourseMach

151、ine learning hardware:considerations and accelerator approaches89 of 89 2024 IEEE International Solid-State Circuits ConferenceISSCC 2024 Short Course90 of 89Machine learning hardware:considerations and accelerator approachesPlease Scan to Rate Please Scan to Rate This PaperThis PaperArchitecture an

152、d Design Approaches to ML Hardware Acceleration:Performance Compute EnvironmentLeland ChangIBM T.J.Watson Research CenterISSCC 2024 Short CourseL.Chang 2024 IEEE International Solid-State Circuits ConferenceOutlineLandscape:LLMs and Generative AISpecific considerations for high-performance AI accele

153、ratorsBroad model/use case supportA system-level optimizationRoadmap:Compute efficiencyQuantization vs.SparsityPower managementMixed-signal/analog computationRoadmap:Communication bandwidthWithin core=Core-to-core=DRAM=Accelerator-to-acceleratorSummaryISSCC 2024 Short Course2 of 70L.Chang 2024 IEEE

154、International Solid-State Circuits ConferenceAI Is Capturing the ImaginationTremendous excitement over AI potentialTransformative impact across industries$What innovation to comeISSCC 2024 Short Course3 of 70L.ChangDebates over AI impact/guardrailsEthics,misinformation,data ownership,Government regu

155、lationProductivity/Societal benefits vs.Risks 2024 IEEE International Solid-State Circuits ConferenceAI Foundation ModelsFoundation Models:An inflection point in generalizable and adaptable representationsISSCC 2024 Short Course4 of 70L.ChangExpert SystemsHand-crafted symbolicrepresentationsMachine

156、LearningTask-specific hand-crafted feature representationsDeep LearningTask-specific learnt feature representations1980s1980s2012 Big dataMassive labeled data+ComputeFoundation ModelsGeneralizable&adaptable learnt representationsSelf-supervision at scale+Massive unlabeled data+Compute2018+2024 IEEE

157、International Solid-State Circuits ConferenceFoundation Model WorkflowA use model that amortizes the significant training cost for end applicationsLarge pretrained models fine-tuned for separate downstream tasksISSCC 2024 Short Course5 of 70L.Chang“Pre-Training”“Fine-Tuning”“Inference”Phases:+Infere

158、nceModel adaptationDistributed training and model validationMay have sensitivity to latency/throughput,always cost-sensitive Long-running job on massive infrastructureData preparationE.g.,remove hate and profanity,deduplicate,etc.Model tuning with custom data set for downstream tasks 2024 IEEE Inter

159、national Solid-State Circuits ConferenceLarge Language Models(LLMs):TransformersISSCC 2024 Short Course6 of 70L.ChangVaswani,NeurIPS 17“Encoder”(e.g.BERT)“Decoder”(e.g.GPT,LLaMA)Self-attention mechanismModel considers relative importance of each input in a sequenceCan be computed in parallel=Maps we

160、ll to GPU architecturesAI accelerator optimization now focusing on transformer modelsUnique microarchitecture featuresEncoder,Decoder,Encoder-decoder architectures 2024 IEEE International Solid-State Circuits ConferenceExplosive Growth in AI Model SizeNo end in sight!=Drives compute and memory needs

161、 in hardwareUp to 100B parameters(!),but even 100M parameter models are quite capableTarget can depend on use case,data center vs.edge application,ISSCC 2024 Short Course7 of 70L.Changhttps:/epochai.org/mlinputs/visualization#Trainable ParametersAlexNet60M2012BERT-Large340M2018LLaMA27B/13B/70B2023GP

162、T-3175B2020 2024 IEEE International Solid-State Circuits ConferencePutting It All Together:IBM watsonx ExampleISSCC 2024 Short Course8 of 70L.Changwatsonx.aiBuild and deploy AI apps with easeBuild with our new studio for foundation models,generative AI and machine learning.With watsonx.ai,you can tr

163、ain,validate,tune and deploy foundation and machine learning models with ease.watsonx.dataScale AI workloads from one data storeScale analytics and AI workloads for all your data,anywhere with watsonx.data,the industrys only data store that is open,hybrid and governed.watsonx.governanceMonitor and g

164、overn the entire AI lifecycleAccelerate responsibility,transparency and explainability in your data and AI workflows with watsonx.governance.This solution helps you direct,manage and monitor your organizations AI activities.2024 IEEE International Solid-State Circuits ConferenceOutlineLandscape:LLMs

165、 and Generative AISpecific considerations for high-performance AI acceleratorsBroad model/use case supportA system-level optimizationRoadmap:Compute efficiencyQuantization vs.SparsityPower managementMixed-signal/analog computationRoadmap:Communication bandwidthWithin core=Core-to-core=DRAM=Accelerat

166、or-to-acceleratorSummaryISSCC 2024 Short Course9 of 70L.Chang 2024 IEEE International Solid-State Circuits ConferencePerformance Compute EnvironmentsNot form-factor-constrained(vs.edge)=Lots of compute,powerTDP(thermal design power)-constrained environment,high compute density is importantFrom Softw

167、are(piler/framework support)is key!Application developers should not be adversely impacted by hardware changesISSCC 2024 Short Course10 of 70L.ChangPCIe cardServer NodeData Center Rack 2024 IEEE International Solid-State Circuits ConferenceBroad Use Case SupportA wide range of modelsTraditional ML(r

168、egression,k-nearest neighbor,support vector machines,):Remains important,may be well-handled by CPUs within data centerMature,capable AI models(multi-layer perceptron,CNN,):Acceleration neededLarge language models(LLMs):Multi-accelerator operation is essentialMust support end-to-end AI workloads:Inc

169、luding non-AI portions!But can leverage CPUs:Input pre-processing,new/unsupported AI operations,Model Pre-Training vs.Fine-Tuning vs.InferenceFor the sake of time,we will focus our circuit-centric discussion on inferenceBut first a few charts to set the contextISSCC 2024 Short Course11 of 70L.Chang

170、2024 IEEE International Solid-State Circuits ConferenceAI Training vs.InferenceISSCC 2024 Short Course12 of 70L.ChangTrainingInferenceCompute PhasesForward/Backward/UpdateForwardBatch sizeLargeLarge or SmallPerformanceThroughputLatency and ThroughputMemory footprintLarge(activations)Small(er)SystemD

171、istributedSingle deviceSimilar underlying math constructsInference is a subset of trainingBut use cases are quite differentDrives different hardware/systemsForwardBackpropagationvs.LabelDataClassificationInferenceTraining 2024 IEEE International Solid-State Circuits ConferencePre-Training vs.Fine-Tu

172、ningSeveral classes of techniques to apply Foundation Models to downstream tasksDrives a range of compute/communication needs+system form factors(#of devices)ISSCC 2024 Short Course13 of 70L.ChangPre-TrainingFull Fine-TuningPromptTuningPrompt EngineeringDescriptionTrain base Foundation ModelAdapt Fo

173、undation Model to downstream taskTrain new weights to encode prompt Modify input prompt to improve outputWeights to TrainComplete modelComplete modelSmall model to encode promptNoneTraining EpochsManyFewFewNoneTraining DataFull Training SetSmall,Task-specificSmall,Task-specificNoneParallel Training

174、DevicesA lot!(up to 1000+)FewFewNoneInference inputPromptPromptEncoded promptEngineered prompt 2024 IEEE International Solid-State Circuits ConferenceLoRA:Low-Rank Adaptation of LLMsHypothesis:Weight updates for fine-tuning are of low intrinsic rankCan be represented by(much!)smaller matrices A&B of

175、 low rank(can be 1 or 2)Technique:Train A and B only:2 weights(2in full model)During inference,use merged weights:=+=+Hardware impact:“Fine-tuning”can be achieved by significantly less compute/communication vs.full fine-tuningISSCC 2024 Short Course14 of 70L.ChangHu,arXiv 23x:Inputd:Full matrix dime

176、nsionh:Outputr:Rank,can be 1 or 2Matrix Dimensions:W:d x dMatrix Dimensions:B:r x dA:d x r 2024 IEEE International Solid-State Circuits ConferenceRAG:Retrieval-Augmented GenerationCombine prompt engineering with database retrieval to improve LLM performance and adapt to new,updated data that was not

177、 in the pre-training datasetHardware impact:AI system must also handle databases!ISSCC 2024 Short Course15 of 70L.ChangLewis,NeurIPS 20Gao,arXiv 23 2024 IEEE International Solid-State Circuits ConferenceOutlineLandscape:LLMs and Generative AISpecific considerations for high-performance AI accelerato

178、rsBroad model/use case supportA system-level optimizationRoadmap:Compute efficiencyQuantization vs.SparsityPower managementMixed-signal/analog computationRoadmap:Communication bandwidthWithin core=Core-to-core=DRAM=Accelerator-to-acceleratorSummaryISSCC 2024 Short Course16 of 70L.Chang 2024 IEEE Int

179、ernational Solid-State Circuits ConferenceAccelerators in High Performance SystemsChallenges:Work partitioning/latency+Moving data/memory managementSoftware is easier w/tight coupling between processor and acceleratorHardware is easier w/loose coupling between processor and acceleratorAI=A lot of wo

180、rk to do:Focus on attached acceleratorsISSCC 2024 Short Course17 of 70L.ChangOn-ProcessorMemory BusPCIe BusNetworkTightLooseAccelerator Attachment OptionsAcceleratorMem BusSMP BusPCIe/DsmpAdapter/SwitchesGXX0-4A0-2Processor BusCPU Core(s)On PCIe BusOn PCIe BusOff-NodeOff-NodeSeparate On-Node ChipSep

181、arate On-Node ChipOn-Chip Logic On-Chip Logic Processor ChipFour Accelerator Attachment-Point Options(FPGA,RL,Fixed-Function)Processor ModuleMemory ControllersAcceleratorMem BusSMP BusPCIe/DsmpAdapter/SwitchesGXX0-4A0-2Processor BusCPU Core(s)On PCIe BusOn PCIe BusOff-NodeOff-NodeSeparate On-Node Ch

182、ipSeparate On-Node ChipOn-Chip Logic On-Chip Logic Processor ChipFour Accelerator Attachment-Point Options(FPGA,RL,Fixed-Function)Processor ModuleMemory Controllers 2024 IEEE International Solid-State Circuits ConferenceDistributed Systems for Inference and TrainingVery large AI models require many

183、accelerators working in parallelISSCC 2024 Short Course18 of 70L.ChangTensor-Parallel InferenceProcessing different tensor pieces in parallel devices while minimizing communicationFully-Sharded Data Parallel(FSDP)TrainingProcessing different model shards in parallel while minimizing communicationhtt

184、ps:/huggingface.co/transformers/v4.10.1/parallelism.htmlhttps:/ IEEE International Solid-State Circuits ConferenceA System-Wide OptimizationISSCC 2024 Short Course19 of 70L.ChangPE?ArrayCoreCoreCoreExternal?MemoryLoc.?Mem?1Loc.?Mem?kLocal?Mem.HierarchyShared?MemoryCoreShared?MemoryOn-chip?Interconne

185、ct?NetworkCoreCoreCoreCoreCoreCoreExternal?MemoryConfigurableCore/Mem?count?and?size,?and?Bandwidth?ChipMemChipMemChipMemChipMemOff-ChipNetworkSystemChipLx MemoryL0-Y MemL0-X MemPE Array PERowsColsSFPCorePE(processing element)Register FileX+X+Multiply-and-AccumulateOn-chip BandwidthTransport model&i

186、ntermediate resultsChip-to-Chip BandwidthTraining+Largemodel inferenceOff-Chip MemoryWeight+Activation StorageOn-Chip MemoryLocal data reuseCompute EngineLeveraging reduced precision/sparsity 2024 IEEE International Solid-State Circuits ConferenceTransformer Models:Implications for AcceleratorsEncod

187、er-only models:e.g.BERT,NLP use cases:Sentiment analysis,entity extraction,relationship detection,classificationSignificant parallelism(attention)=Compute-intensiveDecoder-only models:e.g.GPT,LLaMA,PaLM-E,Generative tasks:Summarization,content generation,extractionGenerate one token at a time=Less c

188、ompute parallelism=Memory-intensiveEncoder-Decoder models:e.g.T5,Encoder portion is compute-intensiveDecoder portion is memory-intensiveISSCC 2024 Short Course20 of 70L.ChangVaswani,NeurIPS 17EncoderDecoderTo support all transformer types,need system-wide balance of compute vs.memory 2024 IEEE Inter

189、national Solid-State Circuits ConferenceL1 ScratchpadPEPEPEPEPEPEPEPEPEPEPEPESFU16/32SFU16/32SFU16/32L0 ScratchpadSpecial Function UnitPE arrayPerformance AI Cores:Dataflow ArchitecturesISSCC 2024 Short Course21 of 70L.ChangAgrawal,ISSCC 21Venkataramani,ISCA 21AI accelerators typically leverage data

190、flow architectures to efficiently process 2-D matrix-multiplication operationsMinimize data communication cost by directly propagating partial sumsAdditional engines to support 1-D vector operations 2024 IEEE International Solid-State Circuits ConferencePerformance AI Cores:Mixed-Precision SupportIS

191、SCC 2024 Short Course22 of 70L.ChangOPNorthFPReg.File122B or 2x1BOPSouthXXX+OpWestMUXMUXMUX1xFP162xHFP8Exponent LogicXX8 x SIMD LanessubSIMDINTReg.FileINTReg.File3244x4b or 2x8b eachINTReg.File1x24b R+WAdder tree Mixed-Precision Processing EngineXX8xINT4subSIMDXDouble-pumped INT pipeline4xINT8subSIM

192、DFP16 ResultINT24 ResultXXXXMUXFP pipelineFP9XXXTo support a broad range of AI use cases(e.g.model accuracy requirements)To support both training and inferenceBoth floating-point and fixed-point formatsScale both performance(TOPS)and efficiency(TOPS/W)across precisionsAgrawal,ISSCC 21Venkataramani,I

193、SCA 21 2024 IEEE International Solid-State Circuits ConferenceSoftware stack to support AI accelerators must provide:High performance:Lightweight runtime to minimize overheadsEase of use:Integration w/widely used AI frameworksA rapidly evolving SW landscape!Caffe,Theano,Keras,MXNet,Chainer,TensorFlo

194、w,PyTorch,CNTK,ONNX,JAX,e.g.PyTorch 2.0 release in March 23ISSCC 2024 Short Course23 of 70L.ChangA Reminder:Software is Critical!CompilerModel OptimizerRuntimeDriverRed Hat Linux/OpenShiftAI Framework IntegrationAI Frameworks/modelsSoftware stack(containers)PCIeAccelerator Card 2024 IEEE Internation

195、al Solid-State Circuits ConferenceOutlineLandscape:LLMs and Generative AISpecific considerations for high-performance AI acceleratorsBroad model/use case supportA system-level optimizationRoadmap:Compute efficiencyQuantization vs.SparsityPower managementMixed-signal/analog computationRoadmap:Communi

196、cation bandwidthWithin core=Core-to-core=DRAM=Accelerator-to-acceleratorSummaryISSCC 2024 Short Course24 of 70L.Chang 2024 IEEE International Solid-State Circuits ConferenceAI Compute Efficiency:Work Directions(Dense)matrix multiplication is at the heart of AIFocus on improving compute engine power/

197、performance=Quantization+(some)sparsityBut AI is more than just dense matrix multiplicationHow to optimize across varying models/operations/use cases?=Power managementCan aggressive circuit techniques play a role?=Mixed-signal/analog computationFuture innovation requires interaction across tradition

198、al HW/SW boundaries!ISSCC 2024 Short Course25 of 70L.Chang 2024 IEEE International Solid-State Circuits ConferenceAggressive Reduced Precision/QuantizationISSCC 2024 Short Course26 of 70L.ChangReducing compute precision(=quantization)can dramatically improve AI performanceEfficient computation:e.g.M

199、ultiplier area quadratic vs.#bitsRelieve capacity/bandwidth constraints:More reuse in lower levels of memory hierarchyHigh-performance AI accelerators must support a range of precisionsTraining:fp32=bfloat16,fp8 is next(?)Inference:fp16 and int8 broadly used,int4 is next(?)Weight vs.Activation quant

200、ization may be different:e.g.LLMs w/int4 weights&fp16 activations2381Sign ExponentMantissafp321051fp16251fp8int871int431151int16961DLFloat16781bfloat16 2024 IEEE International Solid-State Circuits ConferenceKey to Quantization:Model AccuracyISSCC 2024 Short Course27 of 70L.ChangCritical to maintain

201、accuracy across a broad range of models!Many many tricks are used:Quantizing some or all of:Weights vs.Activations vs.GradientsNot doing everything in reduced precision(=mixed precision):e.g.multiplication vs.accumulation,skipping first/last layers,Batch norm,scale factors,non-uniform/trained quanti

202、zation,.Sun,NeurIPS 19Example:fp8 training 2024 IEEE International Solid-State Circuits ConferenceExample:Training Quantization Down to fp8ISSCC 2024 Short Course28 of 70L.ChangFormat(s/e/m)BiasInstructionFP161/6/931FMA:R=A.B+C(all FP16)FP8-fwd1/4/311(0-15)FMMA:R=A1.B1+A2.B2+C A1,A2,B1,B2:FP8 R,C:FP

203、16FP8-bwd1/5/215Sun,NeurIPS 19Different fp8 formats for weights and activations(fp8-fwd)and gradients(fp8-bwd)fp16 accumulation into an FMMA instruction(2x effective bandwidth/TOPS)2024 IEEE International Solid-State Circuits ConferenceExample:Inference Quantization Down to int4ISSCC 2024 Short Cour

204、se29 of 70L.ChangWeight Quantization:Statistics Aware Weight Binning(SAWB)Exploit weight statistics to better capture shape of weight distributionActivation quantization:Parameterized Clipping acTivation(PACT)Automatic tuning of clipping level to balance clipping vs quantization errorTrainable param

205、eters to learn optimal clipping valuesLarge-amplitude outliers play an important roleChoi,SySML 19 2024 IEEE International Solid-State Circuits ConferenceOverheads in the Quantization ProcessISSCC 2024 Short Course30 of 70L.ChangWhile wonderful for hardware designers,quantization creates significant

206、 overhead in the deployment of production AI modelsQuantization-Aware Training(QAT)Retrain/finetune model to recover accuracy loss resulting from quantizationSignificant additional training epochs may be requiredRequires access to training dataPost-Training Quantization(PTQ)Direct quantization after

207、 calibrating scale factors/clip valuesNo extra training epochs,only a small subset of training dataBut harder to maintain model accuracyGholami,arXiv 21 2024 IEEE International Solid-State Circuits ConferenceSparsityISSCC 2024 Short Course31 of 70L.ChangPotential for zeros in weights and activations

208、ReLU naturally results in zero-value activationsLow-value weights can be pruned(=more retraining)Level of sparsity 50%Not like HPC(High Performance Computing),which can be Aggressive sparse representation formatsSze,Proc.IEEE 17Han,NeurIPS 15Impact on AI accelerator design:Memory capacity can be red

209、uced:But compute must decode any sparse representationMemory bandwidth can be reduced:But memory accesses may become more randomPower savings may be straightforward to harness:e.g.via data gatingImproving hardware performance is a challenge:Need“structure”to sparse elements 2024 IEEE International S

210、olid-State Circuits ConferenceHardware-Friendly:Structured Weight SparsityISSCC 2024 Short Course32 of 70L.ChangNeed sparsity to be hardware-friendly to achieve performance speedupFine-grained structured pruning:e.g.2:4 sparsity(50%)But accuracy must be maintained:A lossy processFilter-wiseChannel-w

211、iseFilter-in-filterGroup-in-filter-in-filterPrune 2 elements out of every group of 4 for 50%sparsity Weight tensors w/gray cells are prunedKijWang,NeurIPS 22 2024 IEEE International Solid-State Circuits ConferenceChallenging to simultaneously apply both quantization and pruningDifficult to minimize

212、accuracy loss:Both are lossy processes!Further complicated by need for hardware-friendly formats:Uniform quantization+structured sparsityNeed a generalizable methodology(vs.model/application-specific tricks)To support AI models across application domainsDespite a common transformer architecture,each

213、 domain may have drastically different input signals(=numerical data distribution)To appropriately intercept pre-training vs.fine-tuning of AI modelsAvoid expensive pre-training phase=Perform quantization/sparsification during fine-tuning(of downloaded pre-trained models)for downstream tasksISSCC 20

214、24 Short Course33 of 70L.ChangCombining Precision&Sparsity 2024 IEEE International Solid-State Circuits ConferenceISSCC 2024 Short Course34 of 70L.ChangFoundation Model Quantization+PruningQuantized/Sparse fine-tuned models can achieve baseline accuracyRequires careful use of initialization,quantize

215、rs,zero-alignment,dropout,Can simultaneously achieve int4 quantization+50%structured sparsityDense FP16Sparse INT8Sparse INT4Memory capacity1x3.2x5.34xThroughput1x2.67x3.78xWang,NeurIPS 22Significant throughput/memory capacity improvements when combined w/optimized hardware architectures 2024 IEEE I

216、nternational Solid-State Circuits ConferenceOutlineLandscape:LLMs and Generative AISpecific considerations for high-performance AI acceleratorsBroad model/use case supportA system-level optimizationRoadmap:Compute efficiencyQuantization vs.SparsityPower managementMixed-signal/analog computationRoadm

217、ap:Communication bandwidthWithin core=Core-to-core=DRAM=Accelerator-to-acceleratorSummaryISSCC 2024 Short Course35 of 70L.Chang 2024 IEEE International Solid-State Circuits ConferenceISSCC 2024 Short Course36 of 70L.ChangPower Variation In AI WorkloadsIn performance environments,AI accelerator power

218、 can vary widelyAs determined by compute and memory usage characteristics of the workloadKar,ISSCC 24Vision ModelsLLMQueryvsBatchingModel classesPrecisionFP16,FP8,INT8,INT4 2024 IEEE International Solid-State Circuits ConferenceISSCC 2024 Short Course37 of 70L.ChangPower Variation Across AI ModelsPo

219、wer consumption follows compute utilizationUtilization for different AI models and different batching depends on compilation and workload mappingKar,ISSCC 240.000.200.400.600.801.001.20Resnet mb1Resnet mb4BertBase-mb1BertBase-mb4BertLarge-mb1BertLarge-mb4Normalized Utilization 2024 IEEE Internationa

220、l Solid-State Circuits ConferenceISSCC 2024 Short Course38 of 70L.ChangPower Variation Across Model LayersWithin an inference job,different model layers drive different compute utilization2D compute array(Convolution,matrix multiplication)consumes higher powerVector engines(softmax,GeLu,etc.)and dat

221、a transfer consumes lower powerLocal MemoryMPEMPEActivation bufferWeight bufferMPEMPEMPEMPEMPEMPEMPEMPEMPEMPEMPEMPEMPEMPEVector Engine RoutingAI Core 0AI Core iAI Core n0.000.200.400.600.801.001.20MatMul(FP16)MatMul(Int8)MatMul(Int4)ReLUBatchNormIdleNormalized PowerLeakageActiveKar,ISSCC 24 2024 IEE

222、E International Solid-State Circuits ConferenceISSCC 2024 Short Course39 of 70L.ChangPower Variation Within A Model LayerWithin a model layer,different phases of operation will consume different amounts of power,e.g.:Matrix multiplication=Significant compute=High power(depends on matrix dimension)We

223、ight loading=Communication bottleneck=Low powerKar,ISSCC 24WeightsLoadingMatMulWeightsLoadingWeightsLoadingMatMulWeightsLoadingMatMulHigh PowerLow PowerLow PowerLarge matrix dimensionSmall matrix dimensionSmall Power 2024 IEEE International Solid-State Circuits ConferenceISSCC 2024 Short Course40 of

224、 70L.ChangPower Management:Feed Back+Feed ForwardFeedback path via current sensing at 12V card inputThrottle core performance via stallingConsiders power of SoC+discrete components(e.g.DRAM)Captures true system power limitUnique to AI workloads:Power can be predicted at compile timeCore throttling p

225、er model/model layer via software can improve overall performanceEnhances response beyond closed loop bandwidthKar,ISSCC 24 2024 IEEE International Solid-State Circuits ConferenceISSCC 2024 Short Course41 of 70L.ChangFeedback ControllerClosed-loop control dynamically stretches high-power workloads2n

226、dorder IIR filter(fixed point datapath w/16b coefficients)for feedback compensationOutput clipped to convert to a 5b value,which stalls core pipelineKar,ISSCC 24+Ta1Ta2+TTb0b1b2+12Input Current(from ADC)12+Target12-AI Cores+DRAMVoltage regulatorClip rangeStall en135 2024 IEEE International Solid-Sta

227、te Circuits ConferenceISSCC 2024 Short Course42 of 70L.ChangSoftware-Generated Stall RateCompiler can use hardware power model to compute a software-generated stall rate Contained in program headers:Fetched by core global unit and sent to AI coresController combines software stall rate with IIR filt

228、er outputKar,ISSCC 24+SSR(Unsigned)CGI32 AI CoresFractional StallProgram DataWeightsInputs6565Compute UnitPower Management ControllerDRAM LayoutFinalSRInput GraphWork Division/OptimizationStall-aware Code GenProgram data with stall-infoPower ModelDeep learning compilerPower ProjectionAI CompilationI

229、IRSR(Signed)2024 IEEE International Solid-State Circuits ConferenceISSCC 2024 Short Course43 of 70L.ChangPower Management for AI:Potential ValueJointly tuned software stalling(SSR)and configurable saturation(CSAT)of feedback path can allow significant performance improvementDue to reduced need for h

230、ardware margining for worst-case workloadsDepends on system-dependent power excursion specification:Peak current+time constantKar,ISSCC 240.000.200.400.600.801.001.201.401.60Iso peak power 1us windowIso peak power 10ms windowIso peak power 100us windowCSAT OnlyBaselineClosed loop SSRPerformance norm

231、alized with baseline 2024 IEEE International Solid-State Circuits ConferenceOutlineLandscape:LLMs and Generative AISpecific considerations for high-performance AI acceleratorsBroad model/use case supportA system-level optimizationRoadmap:Compute efficiencyQuantization vs.SparsityPower managementMixe

232、d-signal/analog computationRoadmap:Communication bandwidthWithin core=Core-to-core=DRAM=Accelerator-to-acceleratorSummaryISSCC 2024 Short Course44 of 70L.Chang 2024 IEEE International Solid-State Circuits ConferenceBeyond Digital Precision ScalingDigital precision scaling:Not too many bits left!Anal

233、og/mixed-signal circuits:Can perform compute at significantly lower powerMay require new microarchitecture/workload mappings to harness compute efficiencyMay require new AI algorithms to maintain model accuracyISSCC 2024 Short Course45 of 70L.ChangPowerEfficiencySoftware/uArchCircuit DesignAlgor-ith

234、msAnalog/mixed-signal circuit designLeverage resiliency of AI to approximate computationsExploit uArchitecturalinsights from compilers 2024 IEEE International Solid-State Circuits ConferenceAnalog/Mixed-Signal In-Memory ComputeSignificant research activity using memory arrays:WL&BL voltage/current f

235、or MACStraightforward for integer arithmetic(inference),but harder for floating point(training)Emerging memories(RRAM,PCRAM,etc.)could provide improvements vs.SRAMChallenges:Inherently weight-stationary(compute utilization),model accuracy limitationsTopic to be covered in-depth in 4thtalk by N.Shanb

236、hagISSCC 2024 Short Course46 of 70L.ChangWL1BLDACDACWL2Drive multiple WLAccumulate across different binary weights or different bits of a single weightVWL/pulsewidth encodingInput features or weight bit positionCompute on stored weightsDrive Icellon BLBLAnalog output IBLor VBLKang,ICASSP 14Zhang,VLS

237、I Circuits 16 2024 IEEE International Solid-State Circuits ConferenceAnalog/Mixed-Signal Compute EngineIn AI accelerator architectures,dense 2-D(matmul/convolution)processing engines(PEs)dominate power/performance and are often decoupled from 1-D vector enginesCan harness benefits of mixed-signal an

238、alog circuits by converting only 2-D enginesCompatible with with existing workload mappings+software stacksISSCC 2024 Short Course47 of 70L.ChangSRAMPE Instruction FetchPEPEScratchpadInput FIFOPEPEPEPEPEPEPEPEPEPEPEPEPEPESFU-16SFU-32Output FIFOSRAMPE Instruction FetchScratchpadInput FIFOSFU-16SFU-32

239、Output FIFOSwitched Capacitor-PEAnalog PESwitched Cap PE circuitryDigital interfaceSwitched Capacitor-PESwitched Capacitor-PESwitched Capacitor-PEAgrawal,VLSI Circuits 23 2024 IEEE International Solid-State Circuits ConferenceMultibit-MAC=Sum of 1-b MACs ISSCC 2024 Short Course48 of 70L.Chang=1n=1n8

240、3+42+21+083+42+21+0=1n00+2=1n01+2=1n10+64=1n33=1(00+201+210+6433)=0,1,2,3=0,1,2,32+=1,1Green:Analog domainRed:Digital domainCan be computed w/an analog population counter(PPCTR)Agrawal,VLSI Circuits 23 2024 IEEE International Solid-State Circuits ConferenceSwitched-Capacitor-Based Compute EnginePPCT

241、Rs produce 1b MAC outputs(digital)Programmable digital backend allows switching between INT2 and INT4 modesPPCTR latency is 8 cycles,so 4-way time-interleaving allows higher effective throughputISSCC 2024 Short Course49 of 70L.ChangLatch-based WeightStorage Array of PPCTRsProgrammable Digital Backen

242、dActivationsOutputControl(from IS)Agrawal,VLSI Circuits 23 2024 IEEE International Solid-State Circuits ConferencePop Counter Using Mixed-Signal CircuitsSwitched-cap circuit counts#non-zeros in 64 bitwise multiplications1b multiplication(X x W)w/AND gatesSummation by charge-sharing w/CDACVCDACpropor

243、tional to#non-zero productsSAR ADC performs binary search:Toggle bottom capacitor plates until VCDAC VSARResult is final bottom plate code D5:0Function is sensitive to noise=Bit error rate(BER)Digital back-end processes bit-shift and summationISSCC 2024 Short Course50 of 70L.ChangSC-based processing

244、x64+-resetCx0w0 x1w1x63w63CCresetVCDAC xiwiC2C32CD0D1D5.BinarysearchlogicD5:0 x63:0w63:0Input CDACSAR ADCreset4CD2VSARAgrawal,VLSI Circuits 23 2024 IEEE International Solid-State Circuits ConferenceOptimized PPCTR Circuit ImplementationISSCC 2024 Short Course51 of 70L.ChangShared capacitor array sav

245、es area+enables use of non-linear capacitorsIn“sum”mode,each capacitor is independentIn“sar”mode,capacitors group together as a binary-weighted arrayVCMSAR CTRLsumA0sarW0prechC0VCMVsumD0A1W1C1D1A2W2C2A63W63C63D6A126W126C126D6:0OFFSET CALcompareAgrawal,VLSI Circuits 23 2024 IEEE International Solid-S

246、tate Circuits ConferenceISSCC 2024 Short Course52 of 70L.ChangCan achieve integer factor improvement in power efficiencyMixed-Signal Compute Digital ComputeTechnology5nm7nmVDD0.65V0.55V-0.75VFrequency800 MHz1.0-1.6 GHzThroughput*104.9 TOPS130-210 TOPS Power Efficiency*650 TOPS/W142-264 TOPS/W Area E

247、fficiency*108 TOPS/mm252-84 TOPS/mm2*normalized to 1-b MAC,for 1 coreletAgrawal,VLSI Circuits 23Switched-Cap Compute Engine Power/Perf 2024 IEEE International Solid-State Circuits ConferenceOutlineLandscape:LLMs and Generative AISpecific considerations for high-performance AI acceleratorsBroad model

248、/use case supportA system-level optimizationRoadmap:Compute efficiencyQuantization vs.SparsityPower managementMixed-signal/analog computationRoadmap:Communication bandwidthWithin core=Core-to-core=DRAM=Accelerator-to-acceleratorSummaryISSCC 2024 Short Course53 of 70L.Chang 2024 IEEE International So

249、lid-State Circuits ConferencePeak vs.Sustained PerformanceParticularly w/reduced precision compute,we can pack a massive number of parallel engines into a chip!Peak TOPS and TOPS/W are great,but.Sustained performance on real workloads requires using feeding compute engines w/data to keep them busy(u

250、tilized)Data communication is key:From storage=DRAM=SRAM=register filesAccelerator system architectures should be tailored for AI communication patternsTo achieve high utilization for the latest neural networks=LLMsTo achieve high utilization for both inference+fine-tuning+training+inferenceTo achie

251、ve high utilization into the future:Crystal ballISSCC 2024 Short Course54 of 70L.Chang 2024 IEEE International Solid-State Circuits ConferenceRoofline Visualization:Some Initial Intuition Performance model to visualize compute vs.memory bandwidth tradeoffsCompute bottlenecks:Circuit/technology-limit

252、ed,strongly relieved by reduced precisionMemory bottlenecks=Off-chip DRAM bandwidthAttainable performance(FLOPS)depends on a workloads operational intensity(FLOPS/Byte)Real workloads will span a range of FLOPS/ByteLow FLOPS/Byte workloads are more memory-bound But this is just simple intuitionISSCC

253、2024 Short Course55 of 70L.ChangAttainable FLOPSOperational Intensity(FLOPS/Byte)Peak Floating-Point PerformanceCompute-Bound WorkloadMemory-Bound WorkloadReducing bottlenecks(e.g.via architecture)Williams,Comm.ACM 09Reduced Precision 2024 IEEE International Solid-State Circuits ConferenceMore Than

254、Just DRAM BandwidthISSCC 2024 Short Course56 of 70L.ChangPE?ArrayCoreCoreCoreExternal?MemoryLoc.?Mem?1Loc.?Mem?kLocal?Mem.HierarchyShared?MemoryCoreShared?MemoryOn-chip?Interconnect?NetworkCoreCoreCoreCoreCoreCoreExternal?MemoryConfigurableCore/Mem?count?and?size,?and?Bandwidth?ChipMemChipMemChipMem

255、ChipMemOff-ChipNetworkSystemChipLx MemoryL0-Y MemL0-X MemPE Array PERowsColsSFPCorePE(processing element)Register FileX+X+Multiply-and-AccumulateFlexible interconnect topologyRing/Torus/Crossbar/Bus(for training)Software-managed memory hierarchy w/capacity&bandwidth limitsCommunication within datafl

256、ow PE array,across cores 2024 IEEE International Solid-State Circuits ConferenceFocus on Compute UtilizationISSCC 2024 Short Course57 of 70L.ChangCommunication bottlenecks can exist throughout core/chip/systemDepends on specific mapping of AI model to hardware architectureDepends not just on DRAM ba

257、ndwidth,but also on-chip memory(scratchpads,register files)capacity/bandwidth and interconnect bandwidths Depends on compiler optimization/maturity(software!)A metric that considers the complexity of all bottlenecks:=Different AI operations stress different bottlenecks,impact scaled by importanceMat

258、rix multiplication(high utilization)vs.Data relayout(low utilization)Must be calculated/assessed separately for each AI modelA”good”accelerator optimizes utilization across all use cases(potentially very broad in performance compute environments!)2024 IEEE International Solid-State Circuits Conferen

259、ceDataflow ArchitectureISSCC 2024 Short Course58 of 70L.ChangAvoid data communication to/from memoriesGoal:Maximize data reuse2-D array of processing engines(PEs)efficiently handle key AI constructs:Matrix multiplication,convolution,Directly pass data between engines w/o going to shared memoryFIFO f

260、abric connects neighborsRegister files facilitate local data reuse1-D array of engines for vector linear/non-linear operationsActivation functions,cross-slice reduction,Agrawal,ISSCC 21Venkataramani,ISCA 21L1 ScratchpadPEPEPEPEPEPEPEPEPEPEPEPESFU16/32SFU16/32SFU16/32L0 ScratchpadSpecial Function Uni

261、tPE array 2024 IEEE International Solid-State Circuits ConferenceDataflow MappingsISSCC 2024 Short Course59 of 70L.ChangDifferent layers of an AI model may optimally use different flows,depending on FLOPS/Byte and specific layer parametersRegister files hold“stationary”data in dataflow processingNee

262、d sufficient register file capacity to handle all dataflowsWeightStationaryChen,ISCA 16OutputStationaryRow Stationary(e.g.for convolution)2024 IEEE International Solid-State Circuits ConferenceSoftware-Managed ScratchpadsISSCC 2024 Short Course60 of 70L.ChangAI models are static dataflow graphsNo da

263、ta-dependent execution paths or irregular memory accessesScratchpad memories can effectively be used(vs.traditional caches)Effectively a large register file(“buffer,”“shared memory”)Software-managed data movement(not transparent),separate address spaceNo need to incur cache management overheads,e.g.

264、tag array,comparators,.Main MemoryScratchpadAddress SpaceManaged by applicationL1 ScratchpadPEPEPEPEPEPEPEPEPEPEPEPESFU16/32SFU16/32SFU16/32L0 ScratchpadSpecial Function UnitPE array 2024 IEEE International Solid-State Circuits ConferenceOn-Chip Memory CapacityIncreasing register file/scratchpad cap

265、acity enables more data reuseDeterministic:Data can be explicitly staged via software(no cache misses)Reduces required off-chip memory bandwidthSome break points:e.g.Holding entire model on-chip=No data transfer neededOn-chip memory needs for AI:Largely traditional embedded memory scalingMore capaci

266、ty/density,in particular to hit density break pointsLow Vminoperation:Near-threshold may be optimal for PE logic(parallelizable!)Though memory could leverage separate higher array supplyBut with LLMs of 10s or 100s of billion parameters,weights and activations will not fit on-chipISSCC 2024 Short Co

267、urse61 of 70L.Chang 2024 IEEE International Solid-State Circuits ConferenceOff-Chip Memory:DRAMISSCC 2024 Short Course62 of 70L.ChangOff-chip weight/activation storageWeights:Must be held for both inference and trainingActivations:Inference(forward pass):Just need activations from previous layer,the

268、n can discardFine-tuning/Pre-training:Backprop requires activation storage to calculate gradientsMinibatch size:Directly increases required activation storage(e.g.32x or more)Need high bandwidth:To balance weight/activation transfer pute=Low pJ/bit within cost/form factor constraints:HBM,LPDDR,GDDRB

269、ut with LLMs of 10s or 100s of billion parameters,DRAM bandwidth of a single accelerator may not be sufficiente.g.LLaMA-70B(140 GB in fp16)1 TB/s=140ms just to transfer(once)!2024 IEEE International Solid-State Circuits ConferenceAccelerator-to-Accelerator CommunicationISSCC 2024 Short Course63 of 7

270、0L.ChangSystem network topologiesDirect communication between local acceleratorsTree vs.Mesh vs.Ring vs.Torus vs.Communication through host CPUEthernet vs.InfinibandIn practical systems,data communication will depend on:Type of parallelism employedBandwidth at each level of hierarchyAI model sizePre

271、-training vs.fine-tuning vs.inferenceSwitchAccelAccelAccelAccelSwitchAccelAccelAccelAccelSwitchAccelAccelAccelAccelSwitchAccelAccelAccelAccelSwitchCPUNIC 2024 IEEE International Solid-State Circuits ConferenceRoadmap Directions Going ForwardFuture AI trends will drive need for improved communication

272、 bandwidthAI model size will continue to grow:More data to communicateCompute efficiency will continue to improve:Exacerbates compute/communication imbalanceBandwidth improvements needed across the system(ISSCC perspective):On-chip interconnect=Digital circuits/architectureBetween chips=Wireline cir

273、cuitsTo on-chip memory=Embedded memory/Digital circuitsTo off-chip DRAM=DRAM/Wireline circuitsAnalyze compute utilization using our crystal ball:What AI models/use cases will emerge?What core/chip/system architectures will best optimize utilization?ISSCC 2024 Short Course64 of 70L.Chang 2024 IEEE In

274、ternational Solid-State Circuits ConferenceHeterogeneous IntegrationPackaging technology innovation can improve communication bandwidth between“heterogeneous”technologiesISSCC 2024 Short Course65 of 70L.ChangToday:HBM w/2.5D silicon interposer integrates logic die+3D DRAM stacksTomorrow?Chiplet tech

275、nologies:Very low power chip-to-chip interconnect3D-stacked memories:Very low power memory interface 2024 IEEE International Solid-State Circuits ConferenceKeep an Eye on AI Algorithms/Models!ISSCC 2024 Short Course66 of 70L.ChangAI algorithms are very very very very very rapidly evolving:Tensor-par

276、allel inference:Shrink model size per deviceFully-sharded data parallel training:Shrink model size per deviceLoRA fine-tuning:Reduce fine-tuned model weightsModel compression:Shrink model size/precisionSpeculative sampling:Add more compute(draft model)Mixture of experts:A different ballgameand more!

277、New developments in AI algorithms/use cases can significantly change compute/bandwidth/capacity needs(by up to orders of magnitude!)Dramatically impacts how to optimize hardware systems!Hu,arXiv 21https:/huggingface.co/transformers/v4.10.1/parallelism.htmlhttps:/ 22Chen,arXiv 23https:/mistral.ai/new

278、s/mixtral-of-experts/2024 IEEE International Solid-State Circuits ConferenceOutlineLandscape:LLMs and Generative AISpecific considerations for high-performance AI acceleratorsBroad model/use case supportA system-level optimizationRoadmap:Compute efficiencyQuantization vs.SparsityPower managementMixe

279、d-signal/analog computationRoadmap:Communication bandwidthWithin core=Core-to-core=DRAM=Accelerator-to-acceleratorSummaryISSCC 2024 Short Course67 of 70L.Chang 2024 IEEE International Solid-State Circuits ConferenceSummaryLLMs/Generative AI will drive further explosive growth in AI accelerationAI ac

280、celerators in performance compute environments will require system-level optimization to offer broad model/use case supportFuture roadmap will require innovation in:Compute efficiency:Quantization/sparsity,power management,mixed-signal/analog computationCommunication bandwidth:Within core=Core-to-co

281、re=DRAM=Accelerator-to-acceleratorAn exciting time ahead!ISSCC 2024 Short CourseL.Chang68 of 70 2024 IEEE International Solid-State Circuits ConferenceKey References(1)A.Vaswani,et al.,“Attention is all you need,”NIPS,2017.E.J.Hu,et al.,“LoRA:Low-Rank Adaptation of Large Language Models,”arXiv:2106.

282、09685,2021.P.Lewis,et al.,“Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks,”NeurIPS,2020.Y.Gao,et al.,“Retrieval-Augmented Generation for Large Language Models:A Survey,”arXiv:2312.10997,2023.Tensor Parallelism:https:/huggingface.co/transformers/v4.10.1/parallelism.htmlFully-Sharded

283、 Data Parallel:https:/ al.,“A 7nm 4-core AI chip with 25.6 TFLOPS hybrid FP8 training,102.4 TOPS INT4 inference and workload-aware throttling,”ISSCC,2021.S.Venkataramani,et al.,“RaPiD:AI accelerator for ultra-low precision training and inference,”ISCA,2021.X.Sun,et al.,“Hybrid 8-bit Floating Point(H

284、FP8)Training and Inference for Deep Neural Networks,”NeurIPS,2019.J.Choi,et al.,“Accurate and efficient 2-bit quantized neural networks,”SysML,2019.A.Gholami,et al.,“A Survey of Quantization Methods for Efficient Neural Network Inference,”arXiv:2103.13630,2021.V.Sze,et al.,“Efficient processing of d

285、eep neural networks:A tutorial and survey,”Proc.IEEE,Dec.2017.S.Han,et al.,“Learning both weights and connections for efficient neural network,”NeurIPS,2015.N.Wang,et al.,“Deep Compression of Pre-trained Transformer Models,”NeurIPS,2022.ISSCC 2024 Short Course69 of 70L.Chang 2024 IEEE International

286、Solid-State Circuits ConferenceKey References(2)M.Kar,et al.,“A Software-Assisted Peak Current Regulation Scheme to Improve Power-Limited Inference Performance in a 5nm AI SoC,”ISSCC,2024.M.Kang,et al.,“An Energy-Efficient VLSI Architecture for Pattern Recognition via Deep Embedding of Computation i

287、n SRAM,”ICASSP,2014.J.Zhang,et al.,“A Machine-learning Classifier Implemented in a Standard 6T SRAM Array,”Symposium on VLSI Circuits,2016.A.Agrawal,et al.,“A Switched-Capacitor Integer Compute Unit with Decoupled Storage and Arithmetic for Cloud AI Inference in 5nm CMOS,”Symp.VLSI Circuits,2023.S.W

288、illiams,et al.,“Roofline:An Insightful Visual Performance Model for Multicore Architectures,”Communications of the ACM,2009.Y.-H.Chen,et al.,“Eyeriss:A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks,”ISCA,2016.C.Chen,et al.,“Accelerating Large Language Model Dec

289、oding with Speculative Sampling,”arXiv:2302.01318,2023.Mixture of Experts(MoE):https:/mistral.ai/news/mixtral-of-experts/ISSCC 2024 Short Course70 of 70L.Chang 2024 IEEE International Solid-State Circuits ConferenceISSCC 2024 Short Course71 of 69L.ChangPlease Scan to Rate Please Scan to Rate This Pa

290、perThis Paper 2024 IEEE International Solid-State Circuits ConferenceArchitecture and design approaches to ML hardware acceleration:edge and mobile environmentsMarian VerhelstProfessor KU Leuven;research director imecmarian.verhelstkuleuven.beFebruary 2024ISSCC 2024 Short CourseMarian Verhelst1 of 8

291、0 2024 IEEE International Solid-State Circuits ConferenceMaking(extreme)edge devices smartEdge systems=wearables,implantables,smart speakers,drones,cars,ISSCC 2024 Short CourseMarian Verhelst2 of 80 2024 IEEE International Solid-State Circuits ConferenceMaking(extreme)edge devices smartEdge systems=

292、wearables,implantables,smart speakers,drones,cars,CLOUD GPURaw DataInformationISSCC 2024 Short CourseMarian Verhelst2 of 80 2024 IEEE International Solid-State Circuits ConferenceMaking(extreme)edge devices smartEdge systems=wearables,implantables,smart speakers,drones,cars,www.tinyml.orgEmbedded ma

293、chine learning at the(extreme)edgeCLOUD GPURaw DataInformationISSCC 2024 Short CourseMarian Verhelst2 of 80 2024 IEEE International Solid-State Circuits ConferencetinyML challengeshttps:/ 2024 Short CourseMarian Verhelst3 of 80 2024 IEEE International Solid-State Circuits ConferencetinyML challenges

294、https:/ neural networksDecision treesTrans-formersSupport vector machinesNeuro-symbolicISSCC 2024 Short CourseMarian Verhelst3 of 80 2024 IEEE International Solid-State Circuits ConferencetinyML challengeshttps:/ neural networksDecision treesTrans-formersSmall memoryReal time(low latency)Low energyS

295、upport vector machinesNeuro-symbolicISSCC 2024 Short CourseMarian Verhelst3 of 80 2024 IEEE International Solid-State Circuits ConferenceAre they?Deep neural networks are everywhere in our edge devices Marian Verhelst4 of 80 2024 IEEE International Solid-State Circuits ConferenceAre they?Deep neural

296、 networks are everywhere in our edge devices Marian Verhelst4 of 80 2024 IEEE International Solid-State Circuits ConferenceAre they?Deep neural networks are everywhere in our edge devices Marian Verhelst4 of 80 2024 IEEE International Solid-State Circuits ConferenceAre they?Deep neural networks are

297、everywhere in our edge devices Marian Verhelst4 of 80 2024 IEEE International Solid-State Circuits ConferenceAre they?Deep neural networks are everywhere in our edge devices Marian Verhelst4 of 80 2024 IEEE International Solid-State Circuits ConferenceAre they?Only simple tasksKeyword spotting in ph

298、one Speech processing in cloudLimited processing or bulky batteryProcessing limited by affordable cooling(10W)Deep neural networks are everywhere in our edge devices Marian Verhelst4 of 80 2024 IEEE International Solid-State Circuits ConferenceAutonomous DrivingAugmented Reality GlassesEdge AI targe

299、tsMarian Verhelst5 of 80 2024 IEEE International Solid-State Circuits ConferenceAutonomous DrivingAugmented Reality Glasses6 full HD cameras 30fps10W budgetModel:EfficientNetV2/frame(under est.!)2TOPs/frame,500 TOPs 50TOPs/WEdge AI targetsMarian Verhelst5 of 80 2024 IEEE International Solid-State Ci

300、rcuits ConferenceAutonomous DrivingAugmented Reality Glasses6 full HD cameras 30fps10W budgetModel:EfficientNetV2/frame(under est.!)2TOPs/frame,500 TOPs 50TOPs/WStereo HD+eye tracking camera 30fps100mW budgetModel:EfficientNetV2/frame(under est.!)1TOPs/frame,50 TOPs 500TOPs/WEdge AI targetsMarian Ve

301、rhelst5 of 80 2024 IEEE International Solid-State Circuits Conference*Neural network processors:state-of-the-art*ISSCC 2024 Short CourseMarian VerhelstSource:https:/ of 80 2024 IEEE International Solid-State Circuits Conference*Neural network processors:state-of-the-art*ISSCC 2024 Short CourseMarian

302、 VerhelstSource:https:/ of 80 2024 IEEE International Solid-State Circuits Conference*Neural network processors:state-of-the-art*ISSCC 2024 Short CourseMarian VerhelstSource:https:/ of 80 2024 IEEE International Solid-State Circuits Conference*Neural network processors:state-of-the-art*ISSCC 2024 Sh

303、ort CourseMarian VerhelstSource:https:/ of 80 2024 IEEE International Solid-State Circuits Conference*Neural network processors:state-of-the-art*ISSCC 2024 Short CourseMarian VerhelstSource:https:/ of 80 2024 IEEE International Solid-State Circuits ConferenceOverviewPART1:Efficiency techniques in ML

304、 processorsData reuseSpatialTemporalReduced precisionSparsityFused processingPART2:The need for cross-layer optimization and heterogeneous multi-coreThe need for cross-layer optimizationScheduling/mappingThe need for multi-coreMulti-core explorationThe need for heterogeneity?ConclusionISSCC 2024 Sho

305、rt CourseMarian Verhelst7 of 80 2024 IEEE International Solid-State Circuits ConferenceOverviewPART1:Efficiency techniques in ML processorsData reuseSpatialTemporalReduced precisionSparsityFused processingPART2:The need for cross-layer optimization and heterogeneous multi-coreThe need for cross-laye

306、r optimizationScheduling/mappingThe need for multi-coreMulti-core explorationThe need for heterogeneity?ConclusionISSCC 2024 Short CourseMarian Verhelst8 of 80 2024 IEEE International Solid-State Circuits ConferenceA(simplified)neural network layer1 neural network layerMKNKMNISSCC 2024 Short CourseM

307、arian Verhelst9 of 80 2024 IEEE International Solid-State Circuits ConferenceA(simplified)neural network layer1 neural network layerfor(m=0 to M-1);for each rowfor(n=0 to N-1);for each output columnfor(k=0 to K-1);for each input channel(column)omn+=imk*wkn;MKNKMNISSCC 2024 Short CourseMarian Verhels

308、t9 of 80 2024 IEEE International Solid-State Circuits ConferenceA typical Neural(co)processor unit(NPU)ISSCC 2024 Short CourseMarian Verhelst10 of 80 2024 IEEE International Solid-State Circuits ConferenceA typical Neural(co)processor unit(NPU)DatapathCHIPISSCC 2024 Short CourseMarian Verhelst10 of

309、80 2024 IEEE International Solid-State Circuits ConferenceA typical Neural(co)processor unit(NPU)+DatapathCHIPProcessing element(MAC)ISSCC 2024 Short CourseMarian Verhelst10 of 80 2024 IEEE International Solid-State Circuits ConferenceOn-chip SRAM/RF buffers(1 or more)Weights In Activations Out Acti

310、vations On-chip SRAM/RF(1 or more levels)Weights Layer inputsLayer outputsA typical Neural(co)processor unit(NPU)+DatapathCHIPProcessing element(MAC)ISSCC 2024 Short CourseMarian Verhelst10 of 80 2024 IEEE International Solid-State Circuits ConferenceWeightActivations InActivations OutMain DRAMStora

311、geOn-chip SRAM/RF buffers(1 or more)DatapathWeights In Activations Out Activations +WeightActivations InActivations OutOff-chipStorageOn-chip SRAM/RF(1 or more levels)DatapathWeights Layer inputsLayer outputsCHIPProcessing elementISSCC 2024 Short CourseMarian Verhelst11 of 80A typical Neural(co)proc

312、essor unit(NPU)2024 IEEE International Solid-State Circuits ConferenceWeightActivations InActivations OutMain DRAMStorageOn-chip SRAM/RF buffers(1 or more)DatapathWeights In Activations Out Activations +WeightActivations InActivations OutOff-chipStorageOn-chip SRAM/RF(1 or more levels)DatapathWeight

313、s Layer inputsLayer outputsCHIPProcessing elementEnergy per IO transfer(DRAM)ISSCC 2024 Short CourseMarian Verhelst11 of 80A typical Neural(co)processor unit(NPU)2024 IEEE International Solid-State Circuits ConferenceWeightActivations InActivations OutMain DRAMStorageOn-chip SRAM/RF buffers(1 or mor

314、e)DatapathWeights In Activations Out Activations +WeightActivations InActivations OutOff-chipStorageOn-chip SRAM/RF(1 or more levels)DatapathWeights Layer inputsLayer outputsCHIPProcessing elementEnergy per IO transfer(DRAM)Energy per memoryread .ISSCC 2024 Short CourseMarian Verhelst11 of 80A typic

315、al Neural(co)processor unit(NPU)2024 IEEE International Solid-State Circuits ConferenceWeightActivations InActivations OutMain DRAMStorageOn-chip SRAM/RF buffers(1 or more)DatapathWeights In Activations Out Activations +WeightActivations InActivations OutOff-chipStorageOn-chip SRAM/RF(1 or more leve

316、ls)DatapathWeights Layer inputsLayer outputsCHIPProcessing elementEnergy per IO transfer(DRAM)Energy per memoryread .Energy per MUL+ADD(PE)ISSCC 2024 Short CourseMarian Verhelst11 of 80A typical Neural(co)processor unit(NPU)2024 IEEE International Solid-State Circuits ConferenceWeightActivations InA

317、ctivations OutMain DRAMStorageOn-chip SRAM/RF buffers(1 or more)DatapathWeights In Activations Out Activations +WeightActivations InActivations OutOff-chipStorageOn-chip SRAM/RF(1 or more levels)DatapathWeights Layer inputsLayer outputsCHIPProcessing elementEnergy per IO transfer(DRAM)Energy per mem

318、oryread .Energy per MUL+ADD(PE)/write(SRAM)ISSCC 2024 Short CourseMarian Verhelst11 of 80A typical Neural(co)processor unit(NPU)2024 IEEE International Solid-State Circuits ConferenceWeightActivations InActivations OutMain DRAMStorageOn-chip SRAM/RF buffers(1 or more)DatapathWeights In Activations O

319、ut Activations +WeightActivations InActivations OutOff-chipStorageOn-chip SRAM/RF(1 or more levels)DatapathWeights Layer inputsLayer outputsCHIPProcessing elementEnergy per IO transfer(DRAM)Energy per memoryread .Energy per MUL+ADD(PE)/write(SRAM)ISSCC 2024 Short CourseMarian Verhelst11 of 80A typic

320、al Neural(co)processor unit(NPU)2024 IEEE International Solid-State Circuits ConferenceWeightActivations InActivations OutMain DRAMStorageOn-chip SRAM/RF buffers(1 or more)DatapathWeights In Activations Out Activations +WeightActivations InActivations OutOff-chipStorageOn-chip SRAM/RF(1 or more leve

321、ls)DatapathWeights Layer inputsLayer outputsCHIPProcessing elementEnergy per IO transfer(DRAM)Energy per memoryread .Energy per MUL+ADD(PE)/write(SRAM)ISSCC 2024 Short CourseMarian VerhelstEnergyDRAMSRAMPEs11 of 80A typical Neural(co)processor unit(NPU)2024 IEEE International Solid-State Circuits Co

322、nferenceOverviewPART1:Efficiency techniques in ML processors1.Data reuseSpatialTemporal2.Reduced precision3.Sparsity4.Fused processingPART2:The need for cross-layer optimization and heterogeneous multi-coreThe need for cross-layer optimizationScheduling/mappingThe need for multi-coreMulti-core explo

323、rationThe need for heterogeneity?ConclusionISSCC 2024 Short CourseMarian Verhelst12 of 80 2024 IEEE International Solid-State Circuits ConferenceData reuse in ML processorsEnergy per IO transferEnergy per memory read/writeEnergy per MUL+ADDRemember:every W&I used multiple times,and O accumulated!ISS

324、CC 2024 Short CourseMarian Verhelst13 of 80 2024 IEEE International Solid-State Circuits ConferenceData reuse in ML processorsEnergy per IO transferEnergy per memory read/writeEnergy per MUL+ADDRemember:every W&I used multiple times,and O accumulated!Exploit data reuse to reduce memory energy by inc

325、reasing arithmetic intensity(AI)ISSCC 2024 Short CourseMarian Verhelst13 of 80 2024 IEEE International Solid-State Circuits ConferenceArithmetic intensity(AI)arithmetic intensity(AI)=ops/memory accessISSCC 2024 Short CourseMarian Verhelst14 of 80 2024 IEEE International Solid-State Circuits Conferen

326、ceArithmetic intensity(AI)arithmetic intensity(AI)=ops/memory accessISSCC 2024 Short CourseMarian Verhelst14 of 80+2024 IEEE International Solid-State Circuits ConferenceArithmetic intensity(AI)arithmetic intensity(AI)=ops/memory accessNaive:=2 ops/4 accesses=0.5ISSCC 2024 Short CourseMarian Verhels

327、t14 of 80+2024 IEEE International Solid-State Circuits ConferenceSpatial data reuse:in datapath Save ESRAMISSCC 2024 Short CourseMarian Verhelst15 of 80 2024 IEEE International Solid-State Circuits ConferenceSpatial data reuse:in datapath Save ESRAMfor(m=0 to M-1);for each rowfor(n=0 to N-1);for eac

328、h output columnfor(k=0 to K-1);for each input channelomn+=imk*wkn;ISSCC 2024 Short CourseMarian Verhelst15 of 80 2024 IEEE International Solid-State Circuits ConferenceSpatial data reuse:in datapath Save ESRAMfor(m=0 to M-1);for each rowfor(n=0 to N-1);for each output columnfor(k=0 to K-1);for each

329、input channelomn+=imk*wkn;ISSCC 2024 Short CourseMarian Verhelstfor(n=0 to N-1);for(m2=0 to M/4-1);for(k2=0 to K/4-1);parfor(m1=0 to 3);parfor(k1=0 to 3);omn+=imk*wkn;m=4.m2+m1;k=4.k2+k1parfor=parallel for15 of 80 2024 IEEE International Solid-State Circuits ConferenceSpatial data reuse:in datapath

330、Save ESRAMfor(m=0 to M-1);for each rowfor(n=0 to N-1);for each output columnfor(k=0 to K-1);for each input channelomn+=imk*wkn;ISSCC 2024 Short CourseMarian Verhelstfor(n=0 to N-1);for(m2=0 to M/4-1);for(k2=0 to K/4-1);parfor(m1=0 to 3);parfor(k1=0 to 3);omn+=imk*wkn;m=4.m2+m1;k=4.k2+k1parfor=parall

331、el for15 of 80Improved arithmetic intensity:2*16 ops/(4+16+4+4 access)1 2024 IEEE International Solid-State Circuits ConferenceWeight reuseInput reuseOutput reuse FMAWeight BW1SSInput BWS1SOutput BWSS1Arithmetic Intensity2S/(3S+1)2S/(3S+1)2S/(2S+2)Spatial data reuse(aka“spatial unrolling”)overview:i

332、nputsxxxx+weightsx xx xweightsinput+outputsxxxxweightinputs+outputsAssuming spatial reuse factor S=#MACMany hybrids existISSCC 2024 Short CourseMarian Verhelst16 of 70 2024 IEEE International Solid-State Circuits ConferenceExample:3D spatial unrollingHuawei Da VinciMNK16*16*16=4096 parallel MACArith

333、metic intensity=8192/(4*16*16)=8 ops/fetch!for(m1=0 to M/16-1);for(k1=0 to K/16-1);for(n1=0 to N/16-1);parfor(m2=0 to 15);parfor(k2=0 to 15);parfor(n2=0 to 15);omn+=imk*wkn;ISSCC 2024 Short CourseMarian VerhelstLiao,HotChips1917 of 70 2024 IEEE International Solid-State Circuits Conferencefor(m=0 to M-1);for each rowfor(n=0 to N-1);for each output columnfor(k=0 to K-1);for each input channelomn+=i

友情提示

1、下载报告失败解决办法
2、PDF文件下载后，可能会被浏览器默认打开，此种情况可以点击浏览器菜单，保存网页到桌面，就可以正常下载了。
3、本站不支持迅雷下载，请使用电脑自带的IE浏览器，或者360浏览器、谷歌浏览器下载即可。
4、本站报告下载后的文档和图纸-无水印,预览文档经过压缩，下载后原文更清晰。

本文（Machine Learning Hardware_Considerations and Accelerator Approaches.pdf）为本站（2200）主动上传，三个皮匠报告文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若此文所含内容侵犯了您的版权或隐私，请立即通知三个皮匠报告文库（点击联系客服），我们立即给予删除！

温馨提示：如果因为网速或其他原因下载失败请重新下载，重复下载不扣分。