您的当前位置：上海品茶 > 报告分类 > PDF报告下载

报告预览

SESSION 20 - Machine Learning Accelerators.pdf

编号：155003

PDF 365页 34.27MB 下载积分：VIP专享

下载报告请您先登录！

SESSION 20 - Machine Learning Accelerators.pdf

1、ISSCC 2024SESSION 20Machine Learning Accelerators20.1:NVE:A 3nm 23.2TOPS/W 12b-Digital-CIM-Based Neural Engine for High-Resolution Visual-Quality Enhancement on Smart Devices 2024 IEEE International Solid-State Circuits Conference1 of 44NVE:A 3nm 23.2TOPS/W 12b-Digital-CIM-Based Neural Engine for Hi

2、gh-Resolution Visual-Quality Enhancement on Smart DevicesMing-En Shih*1,Shih-Wei Hsieh*1,Ping-Yuan Tsai*1,Ming-Hung Lin1,Pei-Kuei Tsung1,En-Jui Chang1,Jenwei Liang1,Shu-Hsin Chang1,Chung-Lun Huang1,You-Yu Nian1,Zhe Wan2,Sushil Kumar2,Cheng-Xin Xue1,Gajanan Jedhe2,Hidehiro Fujiwara3,Haruki Mori3,Chih

3、-Wei Chen1,Po-Hua Huang1,Chih-Feng Juan1,Chung-Yi Chen1,Tsung-Yao Lin1,CH Wang1,Chih-Cheng Chen1,Kevin Jou11 MediaTek,HsinChu,Taiwan.2 MediaTek,San Jose,CA.3 TSMC,HsinChu,Taiwan.20.1:NVE:A 3nm 23.2TOPS/W 12b-Digital-CIM-Based Neural Engine for High-Resolution Visual-Quality Enhancement on Smart Devi

4、ces 2024 IEEE International Solid-State Circuits Conference2 of 44Outline Introduction Architecture and Key Features of Neural Visual-Enhancement Engine(NVE)18-row DCIM Core with 4-Cycle Row Switch ControlConvolution Element FusionAdaptive Data Control and Striping Optimization Measurement Results S

5、ummary20.1:NVE:A 3nm 23.2TOPS/W 12b-Digital-CIM-Based Neural Engine for High-Resolution Visual-Quality Enhancement on Smart Devices 2024 IEEE International Solid-State Circuits Conference3 of 44Outline Introduction Architecture and Key Features of Neural Visual-Enhancement Engine(NVE)18-row DCIM Cor

6、e with 4-Cycle Row Switch ControlConvolution Element FusionAdaptive Data Control and Striping Optimization Measurement Results Summary20.1:NVE:A 3nm 23.2TOPS/W 12b-Digital-CIM-Based Neural Engine for High-Resolution Visual-Quality Enhancement on Smart Devices 2024 IEEE International Solid-State Circ

7、uits Conference4 of 44Enhancing Video Quality on Smart Device A critical task for achieving better user experience.Challenging to realize on power,BW,and area limited device.Pixel-level Processing TasksSuper ResolutionNoise ReductionFull-HD 30fps4K 30fps4K 30fps4K 30fpsNVETV/MonitorMobileVideo Strea

8、mingVideo CallsGamingLow-light Photography20.1:NVE:A 3nm 23.2TOPS/W 12b-Digital-CIM-Based Neural Engine for High-Resolution Visual-Quality Enhancement on Smart Devices 2024 IEEE International Solid-State Circuits Conference5 of 44Challenge of Real-Time High-Res Processing(1/3)High computational powe

9、rInference at high-resolution network involves substantial GMACLow sparsity weight and feature map(PReLu,asymmetric quant)00.020.040.06Feature MapWeightInputOutputLow sparsityPercentage of zeros across layers0.11.010.0100.0MobileNetv1MobileNetv2EfficientNetB0FSRCNN4x(4K)FSRCNN2x(4K)GMAC per frameCla

10、ssification(low res.)Pixel-level(high res.)Total GMAC x10 x100Total GMAC for vision application20.1:NVE:A 3nm 23.2TOPS/W 12b-Digital-CIM-Based Neural Engine for High-Resolution Visual-Quality Enhancement on Smart Devices 2024 IEEE International Solid-State Circuits Conference6 of 44Challenge of Real

11、-Time High-Res Processing(1/3)High computational powerHigh bit-precision is required to ensure image quality8b x 8b12b x 12bSuper ResolutionNoise Reduction8b x 8b12b x 12b20.1:NVE:A 3nm 23.2TOPS/W 12b-Digital-CIM-Based Neural Engine for High-Resolution Visual-Quality Enhancement on Smart Devices 202

12、4 IEEE International Solid-State Circuits Conference7 of 44Challenge of Real-Time High-Res Processing(2/3)High External Memory Access(EMA)Leading to increased bandwidth power usage0.11.010.0100.0MobileNetv1MobileNetv2EfficientNetB0FSRCNN4x(4K)FSRCNN2x(4K)BW per frame(MB)Feat.MapFeat.MapWeightWeightC

13、lassification(low res.)Pixel-level(high res.)Total EMA x10SR/NR networks uses fewer weightsLarge feature map BW0500300350FSRCNN4x(4K)FSRCNN2x(4K)mWMAC powerBW power30%of Total powerPower constitution20.1:NVE:A 3nm 23.2TOPS/W 12b-Digital-CIM-Based Neural Engine for High-Resolution Visual-Q

14、uality Enhancement on Smart Devices 2024 IEEE International Solid-State Circuits Conference8 of 44Challenge of Real-Time High-Res Processing(3/3)Flexibility with Efficient SupportAbility to adapt to diverse OP and structure Challenge 3-1:Maintaining high urate over the expanded OP rangeChallenge 3-2

15、:Mitigating BW growth in deep layer modelsLayer j+1Layer j+2Layer jLayer 1Conv2DStrided Conv2DTransposed Conv2DOP listEach Layer:NBlock 1Block 2Block MOutputInputDeep layer:increased BW from internal FM storageMaintain high Urate at all OP20.1:NVE:A 3nm 23.2TOPS/W 12b-Digital-CIM-Based Neural Engine

16、 for High-Resolution Visual-Quality Enhancement on Smart Devices 2024 IEEE International Solid-State Circuits Conference9 of 44Outline Introduction Architecture and Key Features of Neural Visual-Enhancement Engine(NVE)18-row DCIM Core with 4-Cycle Row Switch ControlConvolution Element FusionAdaptive

17、 Data Control and Striping Optimization Measurement Results Summary20.1:NVE:A 3nm 23.2TOPS/W 12b-Digital-CIM-Based Neural Engine for High-Resolution Visual-Quality Enhancement on Smart Devices 2024 IEEE International Solid-State Circuits Conference10 of 44NVE Overall Architecture8x12bData Controller

18、Data CollectorData bufferData DispatcherConv settingRow SelectorDCIM Macro72 banks x 18row(12A12W)MAC72 banks x 18row(12A12W)MAC72 banks x 18row(12A12W)MAC72 banks x 18row(12b x 12b)MACx872x12bAccumulatorActivateReorderCE FusionInterface8x31b8x33b8x12b8x12b5bExternal DRAMTightly Coupled Memory(TCM)C

19、onv Element 1Conv Element 2Conv Element 3Conv Element 11CE Fusion InterfaceFeature Map MemoryFusioncontrolConv coreSRAMSRAMSRAMSRAMSRAMSRAMSRAMSRAMCV coreResizerDepth to SpaceSpace to DepthBlendCVmemoryDMAFormatconverteraddress controlbufferConnection networkNVECompilerGraph schedulerStripe schedule

20、rMemory allocatorcommand descriptorFeature 1Feature 3.1Feature 2Feature 3.220.1:NVE:A 3nm 23.2TOPS/W 12b-Digital-CIM-Based Neural Engine for High-Resolution Visual-Quality Enhancement on Smart Devices 2024 IEEE International Solid-State Circuits Conference11 of 448x12bData ControllerData CollectorDa

21、ta bufferData DispatcherConv settingRow SelectorDCIM Macro72 banks x 18row(12A12W)MAC72 banks x 18row(12A12W)MAC72 banks x 18row(12A12W)MAC72 banks x 18row(12b x 12b)MACx872x12bAccumulatorActivateReorderCE FusionInterface8x31b8x33b8x12b8x12b5bFeature 1Feature 3.1NVE Overall Architecture:Convolution

22、CoresExternal DRAMTightly Coupled Memory(TCM)CV coreResizerDepth to SpaceSpace to DepthBlendCVmemoryDMAFormatconverteraddress controlbufferConnection networkNVECompilerGraph schedulerStripe schedulerMemory allocatorcommand descriptorFeature 3.2Conv Element 1Conv Element 2Conv Element 3Conv Element 1

23、1CE Fusion InterfaceFeature Map MemoryFusioncontrolConv coreSRAMSRAMSRAMSRAMSRAMSRAMSRAMSRAMFeature 2 11 Stages of Pipelined Convolution Elements(CE)Distributed MAC units across pipeline are processed simultaneouslyIntermediate feature maps are stored internally within each stage20.1:NVE:A 3nm 23.2T

24、OPS/W 12b-Digital-CIM-Based Neural Engine for High-Resolution Visual-Quality Enhancement on Smart Devices 2024 IEEE International Solid-State Circuits Conference12 of 44NVE Overall Architecture:Convolution ElementExternal DRAMTightly Coupled Memory(TCM)CV coreResizerDepth to SpaceSpace to DepthBlend

25、CVmemoryDMAFormatconverteraddress controlbufferConnection networkNVECompilerGraph schedulerStripe schedulerMemory allocatorcommand descriptorFeature 3.2Conv Element 1Conv Element 2Conv Element 11CE Fusion InterfaceFeature Map MemoryFusioncontrolConv coreSRAMSRAMSRAMSRAMSRAMSRAMSRAMSRAMFeature 28x12b

26、Data ControllerData CollectorData bufferData DispatcherConv settingRow SelectorDCIM Macro72 banks x 18row(12A12W)MAC72 banks x 18row(12A12W)MAC72 banks x 18row(12A12W)MAC72 banks x 18row(12b x 12b)MACx872x12bAccumulatorActivateReorderCE FusionInterface8x31b8x33b8x12b8x12b5bFeature 1Feature 3.1Conv E

27、lement 3 Modular design of convolution elements18-row Digital-CIM(DCIM)macroCompute 8 sets of 72 12b elements per cycle20.1:NVE:A 3nm 23.2TOPS/W 12b-Digital-CIM-Based Neural Engine for High-Resolution Visual-Quality Enhancement on Smart Devices 2024 IEEE International Solid-State Circuits Conference

28、13 of 44NVE Overall Architecture:CV core and DMANVEConv Element 1Conv Element 2Conv Element 11CE Fusion InterfaceFeature Map MemoryFusioncontrolConv coreSRAMSRAMSRAMSRAMSRAMSRAMSRAMSRAMFeature 28x12bData ControllerData CollectorData bufferData DispatcherConv settingRow SelectorDCIM Macro72 banks x 1

29、8row(12A12W)MAC72 banks x 18row(12A12W)MAC72 banks x 18row(12A12W)MAC72 banks x 18row(12b x 12b)MACx872x12bAccumulatorActivateReorderCE FusionInterface8x31b8x33b8x12b8x12b5bFeature 1Feature 3.1Conv Element 3 DMA loads command descriptor encoded by compilerFetch input feature map in raster-scan order

30、 to the core A dedicate core for CV tasksCV coreResizerDepth to SpaceSpace to DepthBlendCVmemoryDMAFormatconverteraddress controlbufferConnection networkExternal DRAMTightly Coupled Memory(TCM)CompilerGraph schedulerStripe schedulerMemory allocatorcommand descriptorFeature 3.220.1:NVE:A 3nm 23.2TOPS

31、/W 12b-Digital-CIM-Based Neural Engine for High-Resolution Visual-Quality Enhancement on Smart Devices 2024 IEEE International Solid-State Circuits Conference14 of 44Key FeaturesExternal DRAMTightly Coupled Memory(TCM)Conv Element 1Conv Element 2Conv Element 3Conv Element 11CE Fusion InterfaceFeatur

32、e Map MemoryFusioncontrolConv coreSRAMSRAMSRAMSRAMSRAMSRAMSRAMSRAMCV coreResizerDepth to SpaceSpace to DepthBlendCVmemoryDMAFormatconverteraddress controlbufferConnection networkNVECompilerGraph schedulerStripe schedulerMemory allocatorcommand descriptor8x12bData ControllerData CollectorData bufferD

33、ata DispatcherConv settingRow SelectorDCIM Macro72 banks x 18row(12A12W)MAC72 banks x 18row(12A12W)MAC72 banks x 18row(12A12W)MAC72 banks x 18row(12b x 12b)MACx872x12bAccumulatorActivateReorderCE FusionInterface8x31b8x33b8x12b8x12b5bFeature 1Feature 3.1Feature 2Feature 3.2Feature 1.Adaptive data con

34、trol18-row DCIM engine with 4-cycle row switchFeature 1Compiler optimized looping/stripingFeature 2.Feature 3.1Feature 3.2Convolution Element FusionFeature 2Feature 320.1:NVE:A 3nm 23.2TOPS/W 12b-Digital-CIM-Based Neural Engine for High-Resolution Visual-Quality Enhancement on Smart Devices 2024 IEE

35、E International Solid-State Circuits Conference15 of 44Advantage of DCIM DCIM offers high energy-efficiency by reducing data movement between processing and memory unitSource:Haruki Mori,ISSCC 7.4,2023.Compute-In-MemoryMemoryComputeDinWeight DoutTraditional MACLarge data movement from memoryComputat

36、ion within memory arrayFeature 1.20.1:NVE:A 3nm 23.2TOPS/W 12b-Digital-CIM-Based Neural Engine for High-Resolution Visual-Quality Enhancement on Smart Devices 2024 IEEE International Solid-State Circuits Conference16 of 44Multi-Row DCIM Without Weight-reload DCIM with weight reload introduce extra p

37、ower overhead Leverage the fewer weights in SR/NR network,an 18-row DCIM is adopted to increase weight storage and eliminate reload(2 rows for ping-pong update)Weight SRAM(15.6KB)DCIM(1.73 KB)(72 banks x 2 rows x 8 out)Extra SRAM storageWeight reload powerDCIM w/Ping-Pong weight updateDCIM wo reload

38、DCIM(15.6 KB)(72 banks x 18 rows x 8 out)No extra storageNo reload power0.00.20.40.60.81.0Area ReductionWeightSRAMDCIMDCIM-31%Ping-pong weightFull weightPower ReductionWeightReloadDCIMDCIM-28%Ping-pong weightFull weightFeature 1.Macro design:H.Fujiwara et al,ISSCC 34.4,202420.1:NVE:A 3nm 23.2TOPS/W

39、12b-Digital-CIM-Based Neural Engine for High-Resolution Visual-Quality Enhancement on Smart Devices 2024 IEEE International Solid-State Circuits Conference17 of 44Operating Order of DCIMpixichData bufferDCIM Row 0=PsumDCIM Row 1=PsumpixichData bufferDCIM Row 0=PsumpixichData bufferDCIM Row 0=Psum Cy

40、cle T Cycle T+1 Cycle T Cycle T+1pixichData bufferFeature 1.Same IF,change DCIM row Same DCIM row,change IFRow switch20.1:NVE:A 3nm 23.2TOPS/W 12b-Digital-CIM-Based Neural Engine for High-Resolution Visual-Quality Enhancement on Smart Devices 2024 IEEE International Solid-State Circuits Conference18

41、 of 44Row Switching Cycle Extending cycle between row switch leads to better efficiencyBut requires larger local buffer2-cycle row switch:4-cycle row switch:Input buffDCIMInput buffDCIMInput buffDCIM4 cycle4 cycleInput buffDCIMInput buffDCIM2 cycleInput buffDCIMInput buffDCIM2 cycleInput buffDCIMInp

42、ut buffDCIMRow switchRow switchRow switchFeature 1.Row switch20.1:NVE:A 3nm 23.2TOPS/W 12b-Digital-CIM-Based Neural Engine for High-Resolution Visual-Quality Enhancement on Smart Devices 2024 IEEE International Solid-State Circuits Conference19 of 444-cycle Row-switch Data Flow Input data buffer:Pin

43、g-pong update;Input in raster-scan orderPixel-first order to DCIM.8x12bData CollectorData bufferData DispatcherConv settingRow SelectorDCIM Macro72 banks x 18row(12A12W)MAC72 banks x 18row(12A12W)MAC72 banks x 18row(12A12W)MAC72 banks x 18row(12b x 12b)MACx872x12bAccumulatorActivateReorder8x31b8x33b

44、8x12b8x12b5bichInputTo DCIM(pixel first)P0 P1 P2 P3 P4 P5 P6 P7Data bufferFeature 1.20.1:NVE:A 3nm 23.2TOPS/W 12b-Digital-CIM-Based Neural Engine for High-Resolution Visual-Quality Enhancement on Smart Devices 2024 IEEE International Solid-State Circuits Conference20 of 444-cycle Row-switch Data Flo

45、w DCIM macro:Switch row every 4 cycle8x12bData CollectorData bufferData DispatcherConv settingRow SelectorDCIM Macro72 banks x 18row(12A12W)MAC72 banks x 18row(12A12W)MAC72 banks x 18row(12A12W)MAC72 banks x 18row(12b x 12b)MACx872x12bAccumulatorActivateReorder8x31b8x33b8x12b8x12b5bichInputTo DCIM(p

46、ixel first)P0 P1 P2 P3 P4 P5 P6 P7Data bufferDCIMMacroRow Cycle00-314-728-11Feature 1.20.1:NVE:A 3nm 23.2TOPS/W 12b-Digital-CIM-Based Neural Engine for High-Resolution Visual-Quality Enhancement on Smart Devices 2024 IEEE International Solid-State Circuits Conference21 of 444-cycle Row-switch Data F

47、low Accumulate partial sum Reorder back to raster scan order8x12bData CollectorData bufferData DispatcherConv settingRow SelectorDCIM Macro72 banks x 18row(12A12W)MAC72 banks x 18row(12A12W)MAC72 banks x 18row(12A12W)MAC72 banks x 18row(12b x 12b)MACx872x12bAccumulatorActivateReorder8x31b8x33b8x12b8

48、x12b5bAccumulatorReorderichInputTo DCIM(pixel first)P0 P1 P2 P3 P4 P5 P6 P7Data bufferRow Cycle00-314-728-11DCIMMacroP0 P1 P2 P3P0 P1 P2 P3 P4 P5OutputFeature 1.20.1:NVE:A 3nm 23.2TOPS/W 12b-Digital-CIM-Based Neural Engine for High-Resolution Visual-Quality Enhancement on Smart Devices 2024 IEEE Int

49、ernational Solid-State Circuits Conference22 of 44Balancing Area and Power Efficiency System TOPS/W:from 18.7(1T)to 23.2(4T)(+24%)The high efficiency helps address the increase computation demands in high-resolution network inference1.301.351.401.451.5022324250246810Area(mm2)Row switch cy

50、cleRow switch cycleEnergy efficiency(TOPS/W)Row-Switch Cycle Selection 0.46V23.223.924.221.518.71.321.331.371.411.45Switch TTOPS/W/mm2114.21 216.17 416.94 616.89 816.65 Selects best combined efficiencyFeature 1.20.1:NVE:A 3nm 23.2TOPS/W 12b-Digital-CIM-Based Neural Engine for High-Resolution Visual-

51、Quality Enhancement on Smart Devices 2024 IEEE International Solid-State Circuits Conference23 of 44Pipeline and Unified MAC Architecture Target model:Conv A Conv BConv CConv DFeature 2.Pipelined MAC,concurrent executionHW MAC(N/4)HW MAC(N/4)HW MAC(N/4)Conv AConv BConv DMemoryReduced EMA/on-chip mem

52、ory from direct linkSuffers from workload balance issueUnified MAC,sequential executionConv AConv BHW MAC(N)HW MAC(N)tTT+1MemoryMemoryNo workload balance issueLack of direct stage-to-stage data transfer.Higher EMA and memory usage20.1:NVE:A 3nm 23.2TOPS/W 12b-Digital-CIM-Based Neural Engine for High

53、-Resolution Visual-Quality Enhancement on Smart Devices 2024 IEEE International Solid-State Circuits Conference24 of 44Load Balance by Fusion Target model:Conv A Conv BConv CConv DPipelined MAC with Fusion,concurrent executionHW MAC(N/4)HW MAC(N/4)HW MAC(N/4)Conv A(Fusion)MemoryReduced EMA/on-chip m

54、emory from direct linkBalanced workloadHW MAC(N/4)HW MAC(N/4)HW MAC(N/4)Conv AConv BConv DMemoryFeature 2.20.1:NVE:A 3nm 23.2TOPS/W 12b-Digital-CIM-Based Neural Engine for High-Resolution Visual-Quality Enhancement on Smart Devices 2024 IEEE International Solid-State Circuits Conference25 of 44Confi

55、gurable Fusion Mode Load-balanced fusion mappingFusion mode is optimized by compiler based on the models load profileModel ACE ID1 2 3 4 5 6 7 8 SingleFuse 2Fuse 4CE ID1 2 3 4 5 6 7 8 SingleFuse 2Fuse 4CE 1CE 2CE 3CE 4CE 5CE 6CE 7CE 8CE 9CE 104 CE-fusion2 CE-fusion2 CE-fusionSingle layersCE 11Conv(4

56、n MAC)Conv(2n MAC)Conv(n MAC)Conv(n MAC)Conv(2n MAC)Conv(n MAC)Model BFeature 2.20.1:NVE:A 3nm 23.2TOPS/W 12b-Digital-CIM-Based Neural Engine for High-Resolution Visual-Quality Enhancement on Smart Devices 2024 IEEE International Solid-State Circuits Conference26 of 44Workflow of 4-element Fusion Co

57、nsider a 32ich-32och convolutionComputation load equally dispatch to four CE0-78-1516-2324-31och0-7ich8-15 16-23 24-31CE1:CE2:CE3:CE4:0-78-1516-2324-310-78-15 16-23 24-31Partial sum exchangeFeature 2.Distribute ichEach with 1/4 workload20.1:NVE:A 3nm 23.2TOPS/W 12b-Digital-CIM-Based Neural Engine fo

58、r High-Resolution Visual-Quality Enhancement on Smart Devices 2024 IEEE International Solid-State Circuits Conference27 of 44Workflow of 4-element Fusion Consider a 32ich-32och convolutionComputation load equally dispatch to four CE0-78-1516-2324-310-78-15 16-23 24-31CE1:CE2:CE3:CE4:0-78-1516-2324-3

59、10-78-15 16-23 24-31Partial sum exchangeLong critical pathFeature 2.Distribute ichEach with 1/4 workloadochich20.1:NVE:A 3nm 23.2TOPS/W 12b-Digital-CIM-Based Neural Engine for High-Resolution Visual-Quality Enhancement on Smart Devices 2024 IEEE International Solid-State Circuits Conference28 of 44W

60、orkflow of 4-element Fusion Consider a 32ich-32och convolutionComputation load equally dispatch to four CE0-78-1516-2324-310-78-15 16-23 24-31CE1:CE2:CE3:CE4:0-78-1516-2324-310-78-15 16-23 24-31Short critical pathbut extended latencyFeature 2.Distribute ichEach with 1/4 workloadochich20.1:NVE:A 3nm

61、23.2TOPS/W 12b-Digital-CIM-Based Neural Engine for High-Resolution Visual-Quality Enhancement on Smart Devices 2024 IEEE International Solid-State Circuits Conference29 of 44Workflow of 4-element Fusion Consider a 32ich-32och convolutionComputation load equally dispatch to four CE0-78-1516-2324-310-

62、78-15 16-23 24-31CE1:CE2:CE3:CE4:0-78-1516-2324-310-78-15 16-23 24-31Short critical pathbut extended latencyFeature 2.Distribute ichEach with 1/4 workloadochich20.1:NVE:A 3nm 23.2TOPS/W 12b-Digital-CIM-Based Neural Engine for High-Resolution Visual-Quality Enhancement on Smart Devices 2024 IEEE Inte

63、rnational Solid-State Circuits Conference30 of 44Workflow of 4-element Fusion Consider a 32ich-32och convolutionComputation load equally dispatch to four CE0-78-1516-2324-310-78-15 16-23 24-31CE1:CE2:CE3:CE4:Distribute ich0-78-1516-2324-3124-310-7Each with 1/4 workload8-1516-23Collecting ochShort cr

64、itical pathno extended latencyFeature 2.ochich20.1:NVE:A 3nm 23.2TOPS/W 12b-Digital-CIM-Based Neural Engine for High-Resolution Visual-Quality Enhancement on Smart Devices 2024 IEEE International Solid-State Circuits Conference31 of 44Fusion Dataflow Management Key elements of fusion:Input distribut

65、erPartial sum exchangeOutput collectorCE2CE1CE3CE42 CE-fusion psum4 CE-fusion psumInput distributerOutput collectorFeature 2.20.1:NVE:A 3nm 23.2TOPS/W 12b-Digital-CIM-Based Neural Engine for High-Resolution Visual-Quality Enhancement on Smart Devices 2024 IEEE International Solid-State Circuits Conf

66、erence32 of 44Without Fusion vs With Fusion Utilization improves through workload balancing Reduced leakage and clock toggling power via early system power-down.40%79%0.00.20.40.60.81.0w/o fusion with fusionUtilization(%)x2Utilization Improvement63%63%37%18%0.00.20.40.60.81.0w/o fusion with fusionNo

67、rmalized PowerDynamicDynamicclk tgl+lkgclk tgl+lkg-19%Power ImprovementFeature 2.20.1:NVE:A 3nm 23.2TOPS/W 12b-Digital-CIM-Based Neural Engine for High-Resolution Visual-Quality Enhancement on Smart Devices 2024 IEEE International Solid-State Circuits Conference33 of 44Strided and Transposed Convolu

68、tions Commonly utilized OP in U-net models Improving utilization from the 25%baselineBaseline:75%zero insertion Urate=25%Transposed Conv(stride=2)Input(zero padded)OutputBaseline:masking Urate=25%13570 1 2 3Input OutputConv(stride=2)01230 1 2 3 4 5 6 70 1 2 3InputFeature 3.20.1:NVE:A 3nm 23.2TOPS/W

69、12b-Digital-CIM-Based Neural Engine for High-Resolution Visual-Quality Enhancement on Smart Devices 2024 IEEE International Solid-State Circuits Conference34 of 44Adaptive Data Control for Stride2 Conv Skipping computation Bottleneck in input buffer refill Compute additional output channel to reduce

70、 DCIM idle timeData collectorFM memoryData dispatcherDCIM(3x3)x8ch8ch0123456789 10 11 12 13 14 15Data bufferReuseCalculate more och while collecting data to bufferx och/8 timesodd row(0%)100%100%even rowcycle100%Data collectorFM memoryData dispatcherDCIM(3x3)x8ch8chich0123456789 10 11 12 13 14 15pix

71、Data bufferRefill input bufferOutput orderFM BufferBuffer DCIMFeature 3.20.1:NVE:A 3nm 23.2TOPS/W 12b-Digital-CIM-Based Neural Engine for High-Resolution Visual-Quality Enhancement on Smart Devices 2024 IEEE International Solid-State Circuits Conference35 of 44Adaptive Data Control for Transposed Co

72、nv Increase effective data by packing more input channel Reduce zero data dispatch to DCIMData collectorFM memoryData dispatcherDCIM0123(1x1)x32chor(1x2)x32chor(2x2)x16ch8ch0w1132chw1032ch01w12Left:Right:w21160w0132ch16ch160171w20w00w22w02Left:Right:odd roweven rowPacked ch:16 or 32chReduce zero dat

73、a ratioin DCIM computationFM Bufferx och/8 timeseven rowcycle89%67%67%89%89%67%odd rowBuffer DCIMFeature 3.20.1:NVE:A 3nm 23.2TOPS/W 12b-Digital-CIM-Based Neural Engine for High-Resolution Visual-Quality Enhancement on Smart Devices 2024 IEEE International Solid-State Circuits Conference36 of 44Util

74、ization with Adaptive Data Control Stride-2 conv average utilization increased to 58%Transposed conv average utilization increased to 65%00.20.40.6BaselineThisworkUtilization(%)+33%Stride-2 conv urateBaseline This work00.20.40.6BaselineThisworkUtilization(%)+40%Transpose conv urateBaseline This work

75、Feature 3.20.1:NVE:A 3nm 23.2TOPS/W 12b-Digital-CIM-Based Neural Engine for High-Resolution Visual-Quality Enhancement on Smart Devices 2024 IEEE International Solid-State Circuits Conference37 of 44Compiler Optimized Execution OrderFull feature-map scheduleLoop 1Loop 2DRAMDRAMLoop 3Loop 4DRAMDRAMex

76、change via DRAMFull Feature Map0123Execution orderLarge FM cause large DRAM BWSingle-time weight loadOptimized vertical-striping scheduleOptimized for minimum DRAM BW withinperformance targetLoop 1TCMTCMLoop 2DRAMLoop 3DRAMTCMExecution order0 1 n-1n01n-1exchange via TCMStriped Feature MapFull Featur

77、e Mapexchange via DRAMFeature 3.20.1:NVE:A 3nm 23.2TOPS/W 12b-Digital-CIM-Based Neural Engine for High-Resolution Visual-Quality Enhancement on Smart Devices 2024 IEEE International Solid-State Circuits Conference38 of 44Reduced DRAM Access Vertical-striping ensures better use of TCM Especially effe

78、ctive for large feature map models Reduce 38%in DRAM access050100150200Access Size(MB)DRAM(MB)TCM(MB)DRAM/TCM Access-38%Full feature mapVertical stripesFeature 3.20.1:NVE:A 3nm 23.2TOPS/W 12b-Digital-CIM-Based Neural Engine for High-Resolution Visual-Quality Enhancement on Smart Devices 2024 IEEE In

79、ternational Solid-State Circuits Conference39 of 44Outline Introduction Architecture and Key Features of Neural Visual-Enhancement Engine(NVE)18-row DCIM Core with 4-Cycle Row Switch ControlConvolution Element FusionAdaptive Data Control and Striping Optimization Measurement Results Summary20.1:NVE:

80、A 3nm 23.2TOPS/W 12b-Digital-CIM-Based Neural Engine for High-Resolution Visual-Quality Enhancement on Smart Devices 2024 IEEE International Solid-State Circuits Conference40 of 44Chip Micrograph and SummaryTechnology3nm FinFETArea(mm2)1.328mm x 1.032mmSupply Voltage(V)0.46-1.0Frequency(MHz)400-1300

81、Activation PrecisionINT12Weight PrecisionINT12On-Chip Memory(KB)1073(CIM:156)#of MAC6336 12b x 12bPeak Performance(TOPS)16.5Peak Area Efficiency(TOPS/mm2)12.0Energy Efficiency(TOPS/W)4K3023.2 0.46V4K6016.3 0.55V1.328mm1.032mmConv 2MacroConv 3MacroConv 4MacroConv 5MacroConv 6MacroConv 7MacroConv 8Mac

82、roConv 9MacroConv 10MacroConv11MacroConv 1MacroDMACV CoreConv core logic20.1:NVE:A 3nm 23.2TOPS/W 12b-Digital-CIM-Based Neural Engine for High-Resolution Visual-Quality Enhancement on Smart Devices 2024 IEEE International Solid-State Circuits Conference41 of 44Measured Shmoo Plot Diverse operating r

83、ange for different throughput needs08007006005004003000.450.50.550.60.650.70.750.80.850.90.951DVDD(V)Frequency(MHz)PASSFAIL1GHz 0.75V900MHz 0.65V600MHz 0.55V400MHz 0.46V1.3GHz 1V5.1TOPS7.6TOPS16.5TOPS20.1:NVE:A 3nm 23.2TOPS/W 12b-Digital-CIM-Based Neural Engine for High-Resolut

84、ion Visual-Quality Enhancement on Smart Devices 2024 IEEE International Solid-State Circuits Conference42 of 44Comparison to State-of-the-artVLSI21 2ISSCC22 3ISSCC23 4ISSCC23 5ISSCC23 6This workTech Node(nm)1242840183MAC ImplementationDigital MACDigital MACDigital MACDigital MACDigital CIMDigital CI

85、MApplicationSuper ResolutionMobileNetTPUDeepLab V3MobileNet/ResNetInceptionVideo SR/FI/NR 4KCNN,LSTM,RNNSuper Resolution*1 4KNoise Reduction*1 4KSupply Voltage(V)0.5-0.80.55-1.00.66-1.30.64-1.060.525-1.00.46-1.0Frequency(MHz)-1196100--1800400-1300Activation PrecisionINT8INT4/8/

86、16,FP16INT8INT8INT1-4INT12Weight PrecisionINT8INT4/8/16,FP16INT8INT8INT1-4INT12On-Chip Memory(KB)0994CIM:2501073(CIM:156)Core Area(mm2)4.504.747.818.704.201.37#of MAC2048 8b x 8b8192 8b x 8b1024 8b x 8b-6336 12b x 12bPeak Performance*2*3(TOPS)1.698.760.461.246.3316.5Area Efficiency*2*3(TO

87、PS/mm2)0.381.530.060.141.5112.0Energy Efficiency*2(TOPS/W)2.22*45.153.33-4.982.67*44.44*423.2 4K30*5*616.3 4K60*5*7*1:Production-ready model.*2:One operation(OP)=one multiplication or one addition,normalized to 12bit activation and weight.*3:Measured at the highest performance operating point.*4:Rep

88、ort only the most non-sparse point,aligning our evaluation with non-sparse features and weights.*5:Weight and feature map sparsity 2%.*6:0.46V,400MHz.*7:0.55V,600MHz.Non-sparseNormalized to 12bx12b20.1:NVE:A 3nm 23.2TOPS/W 12b-Digital-CIM-Based Neural Engine for High-Resolution Visual-Quality Enhanc

89、ement on Smart Devices 2024 IEEE International Solid-State Circuits Conference43 of 44Comparison to State-of-the-artVLSI21 2ISSCC22 3ISSCC23 4ISSCC23 5ISSCC23 6This workTech Node(nm)1242840183MAC ImplementationDigital MACDigital MACDigital MACDigital MACDigital CIMDigital CIMApplicationSuper Resolut

90、ionMobileNetTPUDeepLab V3MobileNet/ResNetInceptionVideo SR/FI/NR 4KCNN,LSTM,RNNSuper Resolution*1 4KNoise Reduction*1 4KSupply Voltage(V)0.5-0.80.55-1.00.66-1.30.64-1.060.525-1.00.46-1.0Frequency(MHz)-1196100--1800400-1300Activation PrecisionINT8INT4/8/16,FP16INT8INT8INT1-4INT1

91、2Weight PrecisionINT8INT4/8/16,FP16INT8INT8INT1-4INT12On-Chip Memory(KB)0994CIM:2501073(CIM:156)Core Area(mm2)4.504.747.818.704.201.37#of MAC2048 8b x 8b8192 8b x 8b1024 8b x 8b-6336 12b x 12bPeak Performance*2*3(TOPS)1.698.760.461.246.3316.5Area Efficiency*2*3(TOPS/mm2)0.381.530.060.141.

92、5112.0Energy Efficiency*2(TOPS/W)2.22*45.153.33-4.982.67*44.44*423.2 4K30*5*616.3 4K60*5*7*1:Production-ready model.*2:One operation(OP)=one multiplication or one addition,normalized to 12bit activation and weight.*3:Measured at the highest performance operating point.*4:Report only the most non-spa

93、rse point,aligning our evaluation with non-sparse features and weights.*5:Weight and feature map sparsity A fixed diffusion chain(|1)gradually noise the data-The data converges to the complete noise(e.g.,(0,).Forward diffusion process(fixed):fixed gaussian noise 20.2:A 28nm 74.34TFLOPS/W BF16 Hetero

94、genous CIM-Based AcceleratorExploiting Denoising-Similarity for Diffusion Models 2024 IEEE International Solid-State Circuits Conference6 of 38What is a Diffusion Model?Aims to learn to generate data by denoisingReverse step:a reverse chain(1|)to turn noise into image-Use DNNs(U-net,Transformer)to p

95、redict(1|)-Note that it is the“generation”procedure.Reverse denoising process(generative):learnt denoising model20.2:A 28nm 74.34TFLOPS/W BF16 Heterogenous CIM-Based AcceleratorExploiting Denoising-Similarity for Diffusion Models 2024 IEEE International Solid-State Circuits Conference7 of 38Accumula

96、ted Quantization Errors across StepsLQDMLQDM.LQDMX0Err0XTXT-1XT-2X1Err1 ErrT.85,000 msVery Slow!4GB Model:Stable Diffusion v1.520.2:A 28nm 74.34TFLOPS/W BF16 Heterogenous CIM-Based AcceleratorExploiting Denoising-Similarity for Diffusion Models 2024 IEEE International Solid-State Circuits Conference

97、8 of 38Opportunity of Denoising-Similarity Similar visual effects between two adjacent denoised images Most IN:consistently clustered within a narrow range-INT type Remaining sparse IN:relatively large&changing distributions-FP typeDistribution of IN in Adjacent DMsMax.and Min.values of FP-INDepends

98、 on Time-step(t)040802060Time-step(t)021-1-2=Dense INT-INSimilarity in 2 adjacent DMsSparse FP-INLogAxis0 1 2 3 4 5 6 7 8 9 51617ExpIntensive IN all timestepDense INT-INProcessed Image(Visual Display)Sparse FP-INProcessed Image(Visual Display)tDistribution of IN of adjacent denoisingMax.a

99、nd Min.values of FP-IN depends on time-step(t)Sparse FP-IN Processed Image(Visual Display)Dense INT-IN Processed Image(Visual Display)Sparse FP-INDense INT-INSimilarity in 2 adjacent denoisingExpLog Axis20.2:A 28nm 74.34TFLOPS/W BF16 Heterogenous CIM-Based AcceleratorExploiting Denoising-Similarity

100、for Diffusion Models 2024 IEEE International Solid-State Circuits Conference9 of 38Outline Background and Motivation Design Challenge of Diffusion-based CIM Accelerator Proposed Diffusion AcceleratorSign-Magnitude radix-8 Booth CIM macro(SMB-CIM)Four-operand Exponent CIM with auxiliary mantissa PEIn

101、-Memory Redundancy-Search-Elimination Measurement Results Conclusion20.2:A 28nm 74.34TFLOPS/W BF16 Heterogenous CIM-Based AcceleratorExploiting Denoising-Similarity for Diffusion Models 2024 IEEE International Solid-State Circuits Conference10 of 38Challenge 1:Bit-serial CIM for Iterative Denoising

102、INT-IN acceleration is critical for efficiency improvement Conventional digital CIM loads input in a bit-serial styleCycle count=input bit count-To reduce latency,requiring additional adder-tree to handle one more input bit-Large area overheadWeight.1 1 1 1 1-1D+CIM with2-bit ParallelWeight1 1 1 0 1

103、-3D+.CIM with 2b-inputTransistor count per bitPower1b-IN CIM2b-IN CIM1b-IN2b-INPowerTransistor count per bit20.2:A 28nm 74.34TFLOPS/W BF16 Heterogenous CIM-Based AcceleratorExploiting Denoising-Similarity for Diffusion Models 2024 IEEE International Solid-State Circuits Conference11 of 38Challenge 2

104、:Higher Latency in FP-CIM CIM cannot process FP-IN at high speed like INT-INDirect FP alignmentLSBCIM LogicMSBEW4b EINBit-serialExp.AdditionMantissa Mult.by Repetitive AdditionCIM Logic.LSBMWMINMSB4 100 CIM cycles(FP16)8 CIM cycles(BF16)Lengthy alignment cycle latencyRepeated access to handle high-p

105、recision mantissasDirect FP logic20.2:A 28nm 74.34TFLOPS/W BF16 Heterogenous CIM-Based AcceleratorExploiting Denoising-Similarity for Diffusion Models 2024 IEEE International Solid-State Circuits Conference12 of 38Challenge 3:Sparse Denoising in FP-CIM How to support identifying&utilizing sparse FP-

106、IN?Conventional:row-by-row off-CIM zero-detectDatasetEnergy in 50 iterative denoiseRedundantRatioTotal ComputeRedundantCIFAR-10(32x32)63.8 mJ31.9 mJ49.1%ImageNet(64x64)571.4 mJ365.8 mJ64.3%CelebA(64x64)544.2 mJ353.7 mJ65.6%=Dense INT-INSimilarity in 2 adjacent DMsSparse FP-INSparse FP-INDense INT-IN

107、Similarity in 2 adjacent denoisingFP-format sparsityIf EIN,k=0;then,INk=0,INkW=0CIM(Sparse EIN)data row#0data row#1data row#NInput Driver=0?=0?=0?Sparsity Ctrl.No sparsity optimization20.2:A 28nm 74.34TFLOPS/W BF16 Heterogenous CIM-Based AcceleratorExploiting Denoising-Similarity for Diffusion Model

108、s 2024 IEEE International Solid-State Circuits Conference13 of 38Outline Background and Motivation Design Challenge of Diffusion Accelerator Proposed Diffusion AcceleratorSign-Magnitude radix-8 Booth CIM macro(SMB-CIM)Four-operand Exponent CIM with auxiliary mantissa PEIn-Memory Redundancy-Search-El

109、imination Measurement Results Conclusion20.2:A 28nm 74.34TFLOPS/W BF16 Heterogenous CIM-Based AcceleratorExploiting Denoising-Similarity for Diffusion Models 2024 IEEE International Solid-State Circuits Conference14 of 38Overall Architecture Compute Flow like video inter-frame processingCIM Core#0(S

110、MB-CIM Core)for dense INT-INCore#1(4Op-ECIM with ManPE)for sparse FP-INInter-denoising Workflow(Layer k,Time-step=t)Feature Map(Layer k,Time=t)Feature Map(Layer k,Time=t+1)Precision SeparatorIN(k,t)IN(k,t)IN(k,t+1)SMB-CIM Core4Op-ECIMwith ManPEDifference Diffusion ResultReferenceDiffusion ResultReco

111、verDiffusion ResultNon-linear ActivatingNon-linear ActivatingIN(k+1,t)UpdateDenseINT-IN(k,t)SparseFP-IN(k,t)INk+1W=(INk+IN)W=INkW+(INT-INW)+(FP-INW)Reference diffusion resultSMB-CIMCore4Op-ECIM with ManPEDifference diffusion result20.2:A 28nm 74.34TFLOPS/W BF16 Heterogenous CIM-Based AcceleratorExpl

112、oiting Denoising-Similarity for Diffusion Models 2024 IEEE International Solid-State Circuits Conference15 of 38Overall Architecture Heterogenous CIM-based architecture exploiting denoising-similarity with three design featuresLocal Accu.#0SMB Macro#1-#5SM radix-8 Booth CIM(SMB-CIM)#0SMEncoderDecode

113、r&DriverMAC-array256 4 8T(SM Weight)Pos.AdderNeg.AdderRadix-8Booth EncoderDataEncoder#1-#5#0 DataEncoder 48LocalAccu.#15Driver&4Op-ECIM Controller4Op-ECIM Macro#1(32Kb)ManPE0Output Memory36b Global Accumulator LTA Mode64b64bCompute ModeSearch Mode Feature1:SM radix-8 Booth CIM(SMB-CIM)Core for INT-I

114、NSparse Exponent Index BufferSparsity-awareCompute-Allocator(SACA)Feature3:In-memory Redundancy-Search-Elimination for Sparse FP-IN Feature2:4Op-ECIM with ManPE for FP-INManPE1ManPE2ManPE15(MantissaMAC)4Op-ECIM#0256 128 8TNear-Memory ProductExponent Alignment36b3636b32b32b32b32b128b128b20.2:A 28nm 7

115、4.34TFLOPS/W BF16 Heterogenous CIM-Based AcceleratorExploiting Denoising-Similarity for Diffusion Models 2024 IEEE International Solid-State Circuits Conference16 of 38Outline Background and Motivation Design Challenge of Diffusion Accelerator Proposed Diffusion AcceleratorSign-Magnitude radix-8 Boo

116、th CIM macro(SMB-CIM)Four-operand Exponent CIM with auxiliary mantissa PEIn-Memory Redundancy-Search-Elimination Measurement Results Conclusion20.2:A 28nm 74.34TFLOPS/W BF16 Heterogenous CIM-Based AcceleratorExploiting Denoising-Similarity for Diffusion Models 2024 IEEE International Solid-State Cir

117、cuits Conference17 of 38Sign-Magnitude Radix-8 Booth CIM(SMB-CIM)Structure:48 MAC arrays and 16 accumulators MAC with sign-magnitude weights and radix-8 booth encoded input Each MAC array:64 input decoders 64 weight banks(a bank:4x4 8T-cells)64 shifters(2/1/0)64 Demuxs 2 unsigned adder trees(one for

118、 positive addition,one for negative addition)a 12b subtractor8T 8T 8T 8T8T8T8TBank#02/1/02/1/02/1/0 Low toggle activity Normal distributed weights with high occurrence of small values Require two unsigned adder trees-Large area overhead WS,0INS,0=0|W0|*.1 0 0 1-1DSM IN0|Wn|1 0 1 1-3DSM INn.WS,nINS,n

119、=1.PositiveAdder TreeNegative Adder TreeDUAT|W0|00|Wn|.*|W|is the magnitude of SM-represented W20.2:A 28nm 74.34TFLOPS/W BF16 Heterogenous CIM-Based AcceleratorExploiting Denoising-Similarity for Diffusion Models 2024 IEEE International Solid-State Circuits Conference19 of 38Sign-Magnitude Radix-8 B

120、ooth CIM(SMB-CIM)SMB-CIM utilizes Radix-8 Booth encoding for inputProcess 4-bit octal group in a cycle-Reducing cycles by nearly 67%Fully utilize two unsigned adder trees-Increase area efficiency Utilize Radix-8 Booth for CIM Input PP=W(Ak+Ak+1+2Ak+2-4Ak+3)Partial Product(PP)of Ak+3:kKey Feature:PP=

121、aW-bW;a,b=0,1,2,4 0W,1W,2W,3W,4W4b-parallel CIM Input Ak+3:k|W|Reducing cycles by nearly 67%Fully utilize two unsigned adder trees-Increase area efficiency Ak+3WSAk+2:k0 or 10,0,00,0,10,1,00,1,11,0,01,0,11,1,01,1,1Product0/-4|W|W|/-3|W|2|W|/-2|W|3|W|/-|W|4|W|/0Pos.AT0|W|2|W|/04|W|/04|W|/0Neg.AT0/4|W

122、|0/2|W|W|0*WS=Signof SM-weightOperation table of two unsigned adder tree for radix-8 booth input and SM weightPositive Adder TreeDec#0A0k+3:k=1001|W0|2zeroP=0,zeroN=0WS0=0Product:-3|W0|wP=1,wN=0|W0|4|W0|Dec#1A1k+3:k=0011|W1|1zeroP=0,zeroN=1WS1=0Product:2|W1|wP=0,wN=02|W1|02|W1|-3|W0|NegativeAdder Tr

123、eeNote:zeroP(zeroN)=1 means:Zero is sent to positive(negative)adder tree.wP(wN)=1 means:|W|is sent to positive(negative)adder tree.20.2:A 28nm 74.34TFLOPS/W BF16 Heterogenous CIM-Based AcceleratorExploiting Denoising-Similarity for Diffusion Models 2024 IEEE International Solid-State Circuits Confer

124、ence21 of 38Conventional CIM vs.Proposed SMB-CIM 2.8-to-3.3 speedup with 1.04-to-1.47 power saving under different distributions.1.61higher area efficiency Standard variationof data1025426496Power(mW)0.4020.4090.5440.5770.5932C/2CSM/SMSM/Booth80*Norm.Area Efficiency0.73x1.61x1.51.00.5Weight/Input re

125、presentation*Area Efficiency=Throughput/Area;*Evaluated using INT10-IN,0.9V,200MHz2.02C/2C*1SM/Booth8Speedup3.33x*1.Evaluated using INT10-IN,bit-serial CIM architecture*2.Power is measured 0.9V,400MHz,standard variation=10Norm.Power*212C/2C*1SM/Booth80.68Weight/Input representationPost-synthesis gat

126、e-level simulation(0.9V,250MHz)20.2:A 28nm 74.34TFLOPS/W BF16 Heterogenous CIM-Based AcceleratorExploiting Denoising-Similarity for Diffusion Models 2024 IEEE International Solid-State Circuits Conference22 of 38Outline Background and Motivation Design Challenge of Diffusion Accelerator Proposed Dif

127、fusion AcceleratorSign-Magnitude radix-8 Booth CIM macro(SMB-CIM)Four-operand Exponent CIM with auxiliary mantissa PEIn-Memory Redundancy-Search-Elimination Measurement Results Conclusion20.2:A 28nm 74.34TFLOPS/W BF16 Heterogenous CIM-Based AcceleratorExploiting Denoising-Similarity for Diffusion Mo

128、dels 2024 IEEE International Solid-State Circuits Conference23 of 38Four-Operand Exponent CIM with ManPE Process FP-IN;the architecture is similar to VLSI21 64Op-ECIM process exponents,achieving balanced latency w/SMB-CIMRow Driver&CTRL.SubArrayWLR0-31RBL0-31WLL0-31BL0.BLB0RBL0.SubArrayWLR224-255RBL

129、224-255WLL224-255SubArrayBL127BLB127RBL127SubArray 256 128 8TInput&WeightExponent Cells.4Op-ECIM MacroColumn Driver&CTRLMantissa Buffer8-bit Multiplier32bit AdderAccum.RefMW0MIN0ManPENearMemoryExp.Aligner(NMEA#0)NMEA#12720T-FSFA 2Full AdderComparator EP0EP0EP1EP,max8-bit MultiplierMW1MIN1Sparsity In

130、dex BufferOther 8T Cells20.2:A 28nm 74.34TFLOPS/W BF16 Heterogenous CIM-Based AcceleratorExploiting Denoising-Similarity for Diffusion Models 2024 IEEE International Solid-State Circuits Conference29 of 38In-Memory Redundancy-Search-Elimination 4Op-ECIM is reconfigured as a CAM to search sparse EINW

131、LL is off;WLR is set to 1;RWL is set to 0.-Is RBL discharging?WLL001WLR0=0RWL0=1WLL110WLR1=0RWL1=12VREF21VREF1Mismatch:001Discharge10WLR0=0RWL0=110WLR1=0RWL1=12VREF21VREF1Match:111Keep HighOperation TableEINm-1:0BLBRBLMatch|EIN=0Keep highKeep high1|EIN=1Keep highDischarge0SparsityYes:1No:00Unused BL

132、,WLL,SA0WLL0WLL10Unused BL,WLL,SA0*Improvement of In-Memory Redundancy-SearchBaselineOursNorm.LatencyBaselineOursNorm.Energy*1.Baseline:compare after read.*2.Evaluated in BF16 data8x104.64xDischarge20.2:A 28nm 74.34TFLOPS/W BF16 Heterogenous CIM-Based AcceleratorExploiting Denoising-Similarity for D

133、iffusion Models 2024 IEEE International Solid-State Circuits Conference30 of 38In-Memory Redundancy-Search-Elimination Sparse EINis computed in BL/BLB/RBL-Disable connected SAConventional sparsity optimization for ManPEs and mantissa memory4Op-ECIMRBLNormalBL/BLBDisturbNo-precharge Before Read10EIN,

134、p0=010EW,p0=X10EIN,q0=010EW,q0=X00001001PRE0SAE=0Disable SA0,SA1,SA2,SA31023SaCA(The sparser of the 2 rows of EIN-Compute via RWL&RBL)WLL=1WLR=1WLL=1WLR=1Sparsity IndexPRE0If EIN,k=0;then,INk=0,INk W=0ManPE(MIN,k MW)Skip4Op-ECIM(EIN,k+EW)SkipRead sparse EIN via RBLEliminate RBL prechage0010RWL.0010R

135、WL.1011WLL(R)0111.WLL(R).EINEINEnergy Reduction bySparsity OptimizationNorm.Energy(*Baseline:4OP-ECIM without optimizing power)BaselineSA RBLSACABaseline*1Digital ManPE*2 4Op-ECIM*3Norm.Energy*4*1.Baseline:without any sparsity optimization*2.Optimize for ManPE.*3.Optimize for ECIM*4.Evaluated in a B

136、F16 DM with 61.7%FP-IN.101.351.560.550.420.3920.2:A 28nm 74.34TFLOPS/W BF16 Heterogenous CIM-Based AcceleratorExploiting Denoising-Similarity for Diffusion Models 2024 IEEE International Solid-State Circuits Conference31 of 38In-Memory Redundancy-Search-Elimination Sparse EINis computed in RBL-Bypas

137、s precharge of RBL additionallySACA allocates more EINto be computed through RBL-power-saving4Op-ECIMRBLNormalBL/BLBDisturbNo-precharge Before Read10EIN,p0=010EW,p0=X10EIN,q0=010EW,q0=X00001001PRE0SAE=0No RBL Pre-chargeDisable SA0,SA1,SA2,SA31023SaCA(The sparser of the 2 rows of EIN-Compute via RWL&

138、RBL)WLL=1WLR=1WLL=1WLR=1Sparsity IndexPRE0If EIN,k=0;then,INk=0,INk W=0ManPE(MIN,k MW)Skip4Op-ECIM(EIN,k+EW)SkipRead sparse EIN via RBLEliminate RBL prechage0010RWL.0010RWL.1011WLL(R)0111.WLL(R).EINEINEnergy Reduction bySparsity OptimizationNorm.Energy(*Baseline:4OP-ECIM without optimizing power)Bas

139、elineSA RBLSACABaseline*1Digital ManPE*2 4Op-ECIM*3Norm.Energy*4*1.Baseline:without any sparsity optimization*2.Optimize for ManPE.*3.Optimize for ECIM*4.Evaluated in a BF16 DM with 61.7%FP-IN.101.351.560.550.420.3920.2:A 28nm 74.34TFLOPS/W BF16 Heterogenous CIM-Based AcceleratorExploiting Denoising

140、-Similarity for Diffusion Models 2024 IEEE International Solid-State Circuits Conference32 of 38Outline Background and Motivation Design Challenge of Diffusion Accelerator Proposed Diffusion AcceleratorSign-Magnitude radix-8 Booth CIM macro(SMB-CIM)Four-operand Exponent CIM with auxiliary mantissa P

141、EIn-Memory Redundancy-Search-Elimination Measurement Results Conclusion20.2:A 28nm 74.34TFLOPS/W BF16 Heterogenous CIM-Based AcceleratorExploiting Denoising-Similarity for Diffusion Models 2024 IEEE International Solid-State Circuits Conference33 of 38Chip Photograph and SummarySpecificationsTechnol

142、ogy28nm CMOSChip Area1.87mm1.96mmCIM MacroAreaSMB-CIM:0.90mm0.25mm 64Op-ECIM:0.36mm0.075mm 2PrecisionFP16/BF16 Weight,InputHybrid INT10-IN,FP16-INVoltage0.6V 1.0VFrequency50MHz 540MHzCIM Size288kb SMB-CIM+64kb 4Op-ECIMSRAM Size200 KBSpecificationsChip Power8.268 mW 0.6V,50MHz171.0 mW 1.0V,540MHzSyst

143、em PeakPerformance*16.636 TFLOPS(BF16)4.424 TFLOPS(FP16)SMB-CIM Macro*2Energy Efficiency80.3 TFLOPS/W(FP16)System Energy Efficiency*274.34 TFLOPS/W(BF16)67.89 TFLOPS/W(FP16)AreaEfficiency*21.808 TFLOPS/mm2(BF16)1.205 TFLOPS/mm2(FP16)Energy of*350 Iterative DMs230.88 mJ(BF16)252.81 mJ(FP16)Note:1.Mea

144、sured at the highestperformance point,1.0V,540MHz.2:Measured at the highest energy efficiency point,0.65V,120MHz.3.Only include the core module of the diffusion model,e.g.U-Net module.The computations of auto-encoder and decoder are excluded.Area BreakdownPLL 4.70%Weight Buffer8.29%SMB-CIM60.27%4Op-

145、ECIM8.75%Global Buffer12.44%ManPEs 5.37%Ctrl 0.17%Power BreakdownCtrl 1.36%Weight Buffer14.41%SMB-CIM36.02%4Op-ECIM2.88%Global Buffer24.63%ManPEs 19.30%PLL 1.42%20.2:A 28nm 74.34TFLOPS/W BF16 Heterogenous CIM-Based AcceleratorExploiting Denoising-Similarity for Diffusion Models 2024 IEEE Internation

146、al Solid-State Circuits Conference34 of 38Test Platform and Voltage-Frequency ScalingDC PowerPCOscilloscopeLogic AnalyzerFPGA&Test Chip Test ChipFPGAPCResultsResultsCtrlCtrlLogic AnalyzerFPGATest ChipDC PowerOscilloscopeLogic SignalPowerSupplyState Signal Peak energy efficiency:74.23TFLOPS/W for BF1

147、6Frequency(MHz)0306090002003004005006000.60.65 0.72 0.810.91Voltage-Frequency ScalingPower(mW)Highest Energy Efficiency Point(0.65V,120MHz)Chip Power Consumption8.268 mW 0.6V,50MHz171.0 mW 1.0V,540MHzHighest Energy Efficiency Point:0.65V,120MHz20.2:A 28nm 74.34TFLOPS/W BF16 Heterogenous C

148、IM-Based AcceleratorExploiting Denoising-Similarity for Diffusion Models 2024 IEEE International Solid-State Circuits Conference35 of 38Performance ImprovementModel PrecisionFP16FP16FP16DatasetImageNet(6464)CIFAR-10(3232)CelebA(6464)Baseline FID(50 Steps)20.737.027.33Chip FID(50 Steps)20.787.047.39E

149、xecution Time(s)*318.3804.65618.378Macro Energy*4 Efficiency(TFLOPS/W)71.98(SMB-CIM)67.14(SMB-CIM)73.60(SMB-CIM)System Energy*4Efficiency(TFLOPS/W)65.73 58.01 67.89 1.Off-chip memory access is excluded;2.FP16/BF16 Model with INT10-IN,FP16/BF16-IN;3.540MHz,1.0V;4.120MHz,0.65V.5.Measured the Unet modu

150、le of SD-v1.5.Measurement Results on Stable Diffusion v1.5(SD-v1.5)*5Improvement Breakdown Analysis on SD-v1.5(FP16)Norm.Energy/DiffusionNorm.Time/Diffusion0+INT/FP-INBaseline+SMB-CIM+4Op-ECIM10.320.080.5010.5*This chip is a heterogeneous CIM chip(SMB-CIM+4Op-ECIM).It cannot achieve high performance

151、 and energy efficiency with only one CIM technique.Thus,we can not shown the improvement breakdown for SMB-CIM and 4Op-ECIM0.550.17 Denoising similarity-reduce 3.12x energy and 1.82x latency Heterogenous CIMs-4.0 x energy savings and 3.3x speedup20.2:A 28nm 74.34TFLOPS/W BF16 Heterogenous CIM-Based

152、AcceleratorExploiting Denoising-Similarity for Diffusion Models 2024 IEEE International Solid-State Circuits Conference36 of 38Comparison with State-of-the-artISSCC 23 5ISSCC 22 7VLSI 21 6ISSCC 23 1This workTechnology(nm)2828282828TaskDNN TrainingCloud InferenceDNN TrainingDNN InferenceDiffusion Mod

153、elData PrecisionINT4/8,FP16/BF16INT8/16,BF16,FP32BF16BF16INT10/16,FP16/BF16Chip Area(mm2)4.546.695.80.146(Macro)3.67Supply Voltage(V)0.469 0.90.6 1.00.76 1.10.6 0.90.6 1.0Frequency(MHz)10 40050 22025081 18250 540CIM Size16kb96kb1280kb64kb288kb SMB-CIM,64kb 4Op-ECIMCIM FunctionFP-MACFP-MACElement-wis

154、eFP-Multiply/AddFP-MACSMB-CIM:INT-MAC;ECIM:Element-wise FP-AddCIM Area0.269No ReportedNo Reported0.146SMB-CIM:0.225;ECIM:0.02625 CIM Macro Energy Efficiency(TFLOPS/W)17.2 91.3(FP16)(Block-wise Sparsity)23.2(BF16)3.0(FP32)13.7(BF16)14.04 31.6(50%Input Sparsity)SMB-CIM:80.3(FP16)0.65V,120MHzSystem Pow

155、er(mW)0.87 74.912.5 69.41.2 156.1No Reported8.268 171.0System PeakPerformance(TFLOPS)No Reported forFP precision1.08(BF16)0.14(FP32)0.119(BF16)N/A6.636(BF16)1.0V,540MHz4.424(FP16)1.0V,540MHzSystem Peak EnergyEfficiency(TFLOPS/W)16.9(FP16)29.2(BF16)3.7(FP32)1.43(BF16)N/A74.34(BF16)0.65V,120MHz67.89(F

156、P16)0.65V,120MHzChip Area Efficiency(TFLOPS/mm2)No Reported0.160(BF16)0.021(FP32)0.021(BF16)2.05(BF16)(Only macro)1.808(BF16)1.0V,540MHz1.205(FP16)1.0V,540MHz20.2:A 28nm 74.34TFLOPS/W BF16 Heterogenous CIM-Based AcceleratorExploiting Denoising-Similarity for Diffusion Models 2024 IEEE International

157、Solid-State Circuits Conference37 of 38Outline Background and Motivation Design Challenge of Diffusion Accelerator Proposed Diffusion AcceleratorSign-Magnitude radix-8 Booth CIM macro(SMB-CIM)Four-operand Exponent CIM with auxiliary mantissa PEIn-Memory Redundancy-Search-Elimination Measurement Resu

158、lts Conclusion20.2:A 28nm 74.34TFLOPS/W BF16 Heterogenous CIM-Based AcceleratorExploiting Denoising-Similarity for Diffusion Models 2024 IEEE International Solid-State Circuits Conference38 of 38Conclusion An energy-efficient heterogenous CIM processor exploiting denoising-similarity for diffusion m

159、odelSeparate all FP-IN into dense INT-IN and sparse FP-INSign-magnitude radix-8 booth CIM(SMB-CIM)core Process dense INT-IN with lower power and higher area efficiencyFour-operand exponent CIM(4Op-ECIM)with mantissa PEs Accelerate sparse FP-IN for balancing with SMB-CIMIn-memory redundancy-search-el

160、imination technique 4Op-ECIM identifies stored sparse FP-IN with sparsity optimization20.2:A 28nm 74.34TFLOPS/W BF16 Heterogenous CIM-Based AcceleratorExploiting Denoising-Similarity for Diffusion Models 2024 IEEE International Solid-State Circuits Conference39 of 38Please Scan to Rate This Paper20.

161、3 A 23.9TOPS/W 0.8V,130TOPS AI Accelerator with 16Performance-Accelerable Pruning in 14nm Heterogeneous Embedded MPU for Real-Time Robot Applications 2024 IEEE International Solid-State Circuits Conference1 of 3223.9TOPS/W 0.8V,130TOPS AI Accelerator with 16 Performance-Accelerable Pruning in 14nm H

162、eterogeneous Embedded MPU for Real-Time Robot ApplicationsKoichi Nose,Taro Fujii,Katsumi Togawa,Shunsuke Okumura,Kentaro Mikami,Daichi Hayashi,Teruhito Tanaka,Takao ToiRenesas Electronics,Tokyo,Japan20.3 A 23.9TOPS/W 0.8V,130TOPS AI Accelerator with 16Performance-Accelerable Pruning in 14nm Heteroge

163、neous Embedded MPU for Real-Time Robot Applications 2024 IEEE International Solid-State Circuits Conference2 of 32Outline Background Architectural overview AI accelerator with flexible N:M pruning method Measurement results and comparison Example of real-time robot application Conclusion20.3 A 23.9T

164、OPS/W 0.8V,130TOPS AI Accelerator with 16Performance-Accelerable Pruning in 14nm Heterogeneous Embedded MPU for Real-Time Robot Applications 2024 IEEE International Solid-State Circuits Conference3 of 32Outline Background Architectural overview AI accelerator with flexible N:M pruning method Measure

165、ment results and comparison Example of real-time robot application Conclusion20.3 A 23.9TOPS/W 0.8V,130TOPS AI Accelerator with 16Performance-Accelerable Pruning in 14nm Heterogeneous Embedded MPU for Real-Time Robot Applications 2024 IEEE International Solid-State Circuits Conference4 of 32Expansio

166、n of embedded AI applicationsHuman-Robotinteraction systemsRecognition(with AI)Planning/control(with non-AI)Robot navigationwith SLAM algorithm Robot arm controlwith IK algorithm Environment&Human recognitionEmbedded systems with advanced environmental awareness and real-time judgment and controle.g

167、.human-collaborative robots to solve labor shortages and improve productivityTarget and function20.3 A 23.9TOPS/W 0.8V,130TOPS AI Accelerator with 16Performance-Accelerable Pruning in 14nm Heterogeneous Embedded MPU for Real-Time Robot Applications 2024 IEEE International Solid-State Circuits Confer

168、ence5 of 32Performance and power requirements x10 power efficiency x10 performance than embedded CPUs20.3 A 23.9TOPS/W 0.8V,130TOPS AI Accelerator with 16Performance-Accelerable Pruning in 14nm Heterogeneous Embedded MPU for Real-Time Robot Applications 2024 IEEE International Solid-State Circuits C

169、onference6 of 32Outline Background Architectural overview AI accelerator with flexible N:M pruning method Measurement results and comparison Example of real-time robot application Conclusion20.3 A 23.9TOPS/W 0.8V,130TOPS AI Accelerator with 16Performance-Accelerable Pruning in 14nm Heterogeneous Emb

170、edded MPU for Real-Time Robot Applications 2024 IEEE International Solid-State Circuits Conference7 of 32AI accelerator(DRP-AI)MAC unitHeterogeneous architecture with AI acceleratorReconf.processor(DRP)#0MAC4KMAC2KMAC2KFeature map BUF(2MB)Weight RAM(1MB)(weight and compression info.)DMACDMACDMACMACC

171、TRLExternal DRAMLPDDR4Reconf.processor (DRP)#1CPUDMACDRAM CTRLCPUCPUCPUAXIALU1RegALU2SRAMMul MulSRAMDIVDIVControllerProcessingelements(216 ALUs)20.3 A 23.9TOPS/W 0.8V,130TOPS AI Accelerator with 16Performance-Accelerable Pruning in 14nm Heterogeneous Embedded MPU for Real-Time Robot Applications 202

172、4 IEEE International Solid-State Circuits Conference8 of 32Features of DRP(Dynamically Reconfigurable Processor)Dynamically(within 1-cycle)change connections between arithmetic units and memories-Combining high perf.with flexibility by reconfigurable hardware configurations+-+-State Transition1234t1

173、clockData-pathDynamic ReconfigurationSpatial+Time-multiplexed computing20.3 A 23.9TOPS/W 0.8V,130TOPS AI Accelerator with 16Performance-Accelerable Pruning in 14nm Heterogeneous Embedded MPU for Real-Time Robot Applications 2024 IEEE International Solid-State Circuits Conference9 of 32DRP-AI(Flexibi

174、lity,H/W switching)Flexible AI processing by co-operation between MAC unit and DRP Flexibility to respond to unexpected calculations during developmentNeural NetworkConvolutionPoolingSoftMaxOtherDRP#0MACunitDRP-AISoftmax etc.(Intermediate operation)convolution(Multiply-accumulate operation)Complex p

175、rocessing such as softmax can be dynamically performed in hardwareFlexible and fast processing by changing configurationsEfficient processing of convolutional and fully connected layer calculations20.3 A 23.9TOPS/W 0.8V,130TOPS AI Accelerator with 16Performance-Accelerable Pruning in 14nm Heterogene

176、ous Embedded MPU for Real-Time Robot Applications 2024 IEEE International Solid-State Circuits Conference10 of 32Outline Background Architectural overview AI accelerator with flexible N:M pruning method Measurement results and comparison Example of real-time robot application Conclusion20.3 A 23.9TO

177、PS/W 0.8V,130TOPS AI Accelerator with 16Performance-Accelerable Pruning in 14nm Heterogeneous Embedded MPU for Real-Time Robot Applications 2024 IEEE International Solid-State Circuits Conference11 of 32Observation of features in CNN models2)Large weight data exist unevenly within layer as well as a

178、cross layers.HighMiddleWeight matrix in a layerProbability oflarge weight1)Unstructured pruning is required to obtain high pruning rate while keeping accuracy.Structured pruning(pruning rate:3050%)Unstructured pruning(pruning rate:8090%)Weight matrixnon-zero weightzero weight20.3 A 23.9TOPS/W 0.8V,1

179、30TOPS AI Accelerator with 16Performance-Accelerable Pruning in 14nm Heterogeneous Embedded MPU for Real-Time Robot Applications 2024 IEEE International Solid-State Circuits Conference12 of 32Issue of AI processors for unstructured pruning modelConventional SIMD processing cannot reduce the number o

180、f operation cycles Existing hardware does not efficiently skip random unimportant weightsNon-zero weight calculation must still be processed.MACsSparse WeightFeature mapNo cycle time reductiontime000000000000000000000000000000000000000000000000MACsDense WeightFeature map20.3 A 23.9TOPS/W 0.8V,130TOP

181、S AI Accelerator with 16Performance-Accelerable Pruning in 14nm Heterogeneous Embedded MPU for Real-Time Robot Applications 2024 IEEE International Solid-State Circuits Conference13 of 32Flexible N:M pruning methodMACsCompressedWeightFeature mapCompressionInfoSparse WeightCompression#of channels=M4

182、cycles(N=4)Weight group 0Split weight matrix by M columns into weight groupsSelect N significant weights in each row and regroup into compressed weightWeight coordinate are left into compression info and used for multiplication with feature map20.3 A 23.9TOPS/W 0.8V,130TOPS AI Accelerator with 16Per

183、formance-Accelerable Pruning in 14nm Heterogeneous Embedded MPU for Real-Time Robot Applications 2024 IEEE International Solid-State Circuits Conference14 of 32Flexible N:M pruning methodMACsCompressedWeightFeature mapCompressionInfoSparse Weight#of channels=M4 cycles(N=4)Compression1 cycle(N=1)Weig

184、ht group 1DRP-AI can flexibly change N for each weight group considering the bias of important weights between groups.20.3 A 23.9TOPS/W 0.8V,130TOPS AI Accelerator with 16Performance-Accelerable Pruning in 14nm Heterogeneous Embedded MPU for Real-Time Robot Applications 2024 IEEE International Solid

185、-State Circuits Conference15 of 32Pruning effect comparisonRegardless of the pruning rate,DRP-AI can flexibly process pruning models with optimal number of cycles.General AI accelerators50%pruning arch.(ref 1)This work(DRP-AI)*Pruning structurestructuredunstructuredunstructuredunstructuredpruning ra

186、te030%090%0/50%0 93%weight data size1/3x1/10 x5/8x1/10 xCycle time1/3xx1x1/2x1/16xCyclezero weightnon-zero weight*M=16,N=variableCycleSkipCycleSkipCycleCycle20.3 A 23.9TOPS/W 0.8V,130TOPS AI Accelerator with 16Performance-Accelerable Pruning in 14nm Heterogeneous Embedded MPU for Real-Time Robot App

187、lications 2024 IEEE International Solid-State Circuits Conference16 of 32Optimization of recognition accuracy and speedThe larger M,the more randomly pruning position can be determined.Smaller accuracy loss,but trade-off with increase of footprint and memorySimulation results (ResNet50,FP32)Unstruct

188、ured(ideal)M=32M=16*M=8M=4M=Inf.*including accumulator sharing(Tech.2 in proceedings)SkipMSparse to maximize skip cyclesCompletely randomWeight group20.3 A 23.9TOPS/W 0.8V,130TOPS AI Accelerator with 16Performance-Accelerable Pruning in 14nm Heterogeneous Embedded MPU for Real-Time Robot Application

189、s 2024 IEEE International Solid-State Circuits Conference17 of 32Optimization of recognition accuracy and speedSmall accuracy degradation from ideal case at M=16.Simulation results (ResNet50,FP32)Unstructured(ideal)M=32M=16*M=8M=4M=Inf.Optimum pruning rate of most CNN modelsSmall accuracy degradatio

190、n(within 1%)Large accuracy degradation*including accumulator sharing(Tech.2 in proceedings)20.3 A 23.9TOPS/W 0.8V,130TOPS AI Accelerator with 16Performance-Accelerable Pruning in 14nm Heterogeneous Embedded MPU for Real-Time Robot Applications 2024 IEEE International Solid-State Circuits Conference1

191、8 of 32Optimization of recognition accuracy and speedSmall accuracy degradation from ideal case at M=16.Compression info.becomes large at M=32 M=16 selected as optimumSimulation results (ResNet50,FP32)Unstructured(ideal)M=32M=16*M=8M=4M=Inf.Optimum pruning rate of most CNN modelsM#of bits in compres

192、sion info#of bits incompression info+weight32516*164128312*4210+33%*8bit/4bit SRAM space to store 5bit/3bit dataSmall accuracy degradation(within 1%)*including accumulator sharing(Tech.2 in proceedings)20.3 A 23.9TOPS/W 0.8V,130TOPS AI Accelerator with 16Performance-Accelerable Pruning in 14nm Heter

193、ogeneous Embedded MPU for Real-Time Robot Applications 2024 IEEE International Solid-State Circuits Conference19 of 32Improvement of power efficiencyN:M pruning can reduce power to 1/8Eliminating wasted memory accesses and clock power is greater than the power overhead of the additional compression(

194、N:M pruning)*see proceedings(Tech 2.)20.3 A 23.9TOPS/W 0.8V,130TOPS AI Accelerator with 16Performance-Accelerable Pruning in 14nm Heterogeneous Embedded MPU for Real-Time Robot Applications 2024 IEEE International Solid-State Circuits Conference20 of 32Outline Background Architectural overview AI ac

195、celerator with flexible N:M pruning method Measurement results and comparison Example of real-time robot application Conclusion20.3 A 23.9TOPS/W 0.8V,130TOPS AI Accelerator with 16Performance-Accelerable Pruning in 14nm Heterogeneous Embedded MPU for Real-Time Robot Applications 2024 IEEE Internatio

196、nal Solid-State Circuits Conference21 of 32Test chip microphoto and specificationsTest chipTechnology14nm,11-MetalPackage2300pin FCBGACoreAI accelerator(DRP-AI),Reconf.processor(DRP),ARM CPU,etc.External memoryLPDDR4 32bit(3.2Gbps)MACmax 996MHzReconf.processormax 420MHzPeak performance8.16 Dense TOP

197、S(130.55 Sparse TOPS)Nominal VDD(core)0.8VReconf.processor(DRP)CPU CPUDual-CPULPDDR4InterfaceVideo InVideo OutUSBInterfaceReconf.processor(DRP)MACDRP-AI20.3 A 23.9TOPS/W 0.8V,130TOPS AI Accelerator with 16Performance-Accelerable Pruning in 14nm Heterogeneous Embedded MPU for Real-Time Robot Applicat

198、ions 2024 IEEE International Solid-State Circuits Conference22 of 32Peak perf.&power efficiency of AI acceleratorPerformance with nominal supply voltage(0.8V)is crucial for practical useHigh performance for robot applications(up to 130TOPS)Highly power efficient for embedded MPU(up to 23.9TOPS/W)Not

199、e:Using single 33 convolution layer for peak measurement20.3 A 23.9TOPS/W 0.8V,130TOPS AI Accelerator with 16Performance-Accelerable Pruning in 14nm Heterogeneous Embedded MPU for Real-Time Robot Applications 2024 IEEE International Solid-State Circuits Conference23 of 32Power efficiency with actual

200、 AI modelsPower efficiency with actual AI models achieves 10TOPS/W without significant model dependency.AI accelerator can efficiently skip zero weights with a variety of CNN models.20.3 A 23.9TOPS/W 0.8V,130TOPS AI Accelerator with 16Performance-Accelerable Pruning in 14nm Heterogeneous Embedded MP

201、U for Real-Time Robot Applications 2024 IEEE International Solid-State Circuits Conference24 of 32DemonstrationMultiple AI applications with highly(7090%)pruned modelsReal-time evaluation on DRP-AI test chip with No fan and No heat sinkDemonstrated at ISSCC2024 demo session(DS2)20.3 A 23.9TOPS/W 0.8

202、V,130TOPS AI Accelerator with 16Performance-Accelerable Pruning in 14nm Heterogeneous Embedded MPU for Real-Time Robot Applications 2024 IEEE International Solid-State Circuits Conference25 of 32Comparison with prior worksWith pruning technology,higher power efficiency is achieved at nominal supply

203、voltageISSCC2020S7.1ISSCC2021 S4.2ISSCC2021 S9.2ISSCC2022 S15.1VL2022 5ISSCC2023 S22.1This workProcess nm71228452214Supply V0.575-0.8250.80.6-0.90.55-10.46-1.050.5-0.80.696-0.8memory MB2.1760.22.0480.1413DatatypeINT8,16,FP16INT8INT8INT4,8,16/FP16INT4,8INT2-8INT8Peak perf.TOPS3.60460.41.4339.3(4b)19.

204、7(8b)3.6(INT4)1.8(INT8)0.637130.55power W0.174 0.575V1.053 0.825V4.40.131 0.9V0.019 0.6V5.15.06W 0.8V4.3W 0.696VPower eff.TOPS/W7.34 0.825V13.32 0.575V13.8 0.8V12.1 0.9V17.5 0.6V3.0 1V11.59 0.5V(INT8)8 1.05V(INT8)39.1 0.46V(INT8)5.4 0.8V12.4 0.5V(INT2)23.9 0.8V28.2 0.696V*3x3 Conv.,93%SparseNominal

205、Vdd20.3 A 23.9TOPS/W 0.8V,130TOPS AI Accelerator with 16Performance-Accelerable Pruning in 14nm Heterogeneous Embedded MPU for Real-Time Robot Applications 2024 IEEE International Solid-State Circuits Conference26 of 32Outline Background Architectural overview AI accelerator with flexible N:M prunin

206、g method Measurement results and comparison Example of real-time robot application Conclusion20.3 A 23.9TOPS/W 0.8V,130TOPS AI Accelerator with 16Performance-Accelerable Pruning in 14nm Heterogeneous Embedded MPU for Real-Time Robot Applications 2024 IEEE International Solid-State Circuits Conferenc

207、e27 of 32Processing time acceleration with reconf.processor Image processingAI pre-processingCNN inferenceAI post-processingOutput processingDRP-AI(MAC&DRP)DRP&CPUCPUDRPDRP&CPUInference time msec36.48x6.5 faster5.64model:YOLOv2Image processingAI processing(pre+CNN inference+post)DRP can flexibly pro

208、cess a variety of streaming operation and operate fast like dedicated hardware.AI processing speed is 6.5 faster than embedded CPU.Image processing libraries optimized for DRP can also accelerate robot application.20.3 A 23.9TOPS/W 0.8V,130TOPS AI Accelerator with 16Performance-Accelerable Pruning i

209、n 14nm Heterogeneous Embedded MPU for Real-Time Robot Applications 2024 IEEE International Solid-State Circuits Conference28 of 32Example of real-time robot applicationVisual SLAM(Simultaneously Localization And Mapping)with AIMulti-thread&pipelined operation using heterogeneous architecture(DRP-AI,

210、DRP,CPU)Wired logic hardware with dynamic reconfiguration can operate multiple algorithm with small footprint,latency,power and jitter.Image scalingKeypoint detectionFeature extractionTrackingObject detection(CNN)MappingSensor loadingDataflowSwitch hardware configuration within 1msecKeypointdetectio

211、n(Data-path3)Keypointdetection(Data-path2)Imagescaling(Data-path1)(Data-path3)(Data-path2)Keypointdetection(Data-path1)+-+-State Transition1234t1clockData-pathDatapath reconfiguration within 1cycleNon-AI algorithm acceleration by DRPCPUDRPDRP-AI20.3 A 23.9TOPS/W 0.8V,130TOPS AI Accelerator with 16Pe

212、rformance-Accelerable Pruning in 14nm Heterogeneous Embedded MPU for Real-Time Robot Applications 2024 IEEE International Solid-State Circuits Conference29 of 32Example of real-time robot application17 faster than the CPU(sufficient time for real-time robot operation)12 higher power efficiency compa

213、red with embedded CPUSmaller jitter by wired hardware-based structure of DRPCNNImageprocessingTrackingPre/postOriginal(on CPU)Pipeline operation,Offload to DRP&DRP-AITracking time msec5899784616478840 x17 faster*with 90%sparsityCPUDRPx12Original(on CPU)Offload CNN to DRP-AI*Offload Image processing

214、to DRP820.3 A 23.9TOPS/W 0.8V,130TOPS AI Accelerator with 16Performance-Accelerable Pruning in 14nm Heterogeneous Embedded MPU for Real-Time Robot Applications 2024 IEEE International Solid-State Circuits Conference30 of 32Outline Background Architectural overview AI accelerator with flexible N:M pr

215、uning method Measurement results and comparison Example of real-time robot application Conclusion20.3 A 23.9TOPS/W 0.8V,130TOPS AI Accelerator with 16Performance-Accelerable Pruning in 14nm Heterogeneous Embedded MPU for Real-Time Robot Applications 2024 IEEE International Solid-State Circuits Confe

216、rence31 of 32ConclusionDevelopment of heterogeneous embedded MPU for real-time robot applications such as human-collaborative robots CNN inference:23.9TOPS/W and up to 16x performance acceleration with a flexible N:M pruning technology AI+Non-AI combined robot application:17x performance and 6x powe

217、r efficient than embedded CPU with reconfigurable wired logic hardware(DRP)20.3 A 23.9TOPS/W 0.8V,130TOPS AI Accelerator with 16Performance-Accelerable Pruning in 14nm Heterogeneous Embedded MPU for Real-Time Robot Applications 2024 IEEE International Solid-State Circuits Conference32 of 32Acknowled

218、gementThank you for your attention!Contact:Koichi Nose()This paper is based on results obtained from a project,JPNP16007,commissioned by the New Energy and Industrial Technology Development Organization(NEDO).20.3 A 23.9TOPS/W 0.8V,130TOPS AI Accelerator with 16Performance-Accelerable Pruning in 14n

219、m Heterogeneous Embedded MPU for Real-Time Robot Applications 2024 IEEE International Solid-State Circuits Conference33 of 32Please Scan to Rate This Paper20.4:A 28nm Physics Computing Unit Supporting Emerging Physics-Informed Neural Network and Finite Element Method for Real-Time ScientificComputin

220、g on Edge Devices 2024 IEEE International Solid-State Circuits Conference1 of 39A 28nm Physics Computing Unit Supporting Emerging Physics-Informed Neural Network and Finite Element Method for Real-Time Scientific Computing on Edge DevicesYuhao Ju,Ganqi Xu,Jie GuNorthwestern University,Evanston,IL20.

221、4:A 28nm Physics Computing Unit Supporting Emerging Physics-Informed Neural Network and Finite Element Method for Real-Time ScientificComputing on Edge Devices 2024 IEEE International Solid-State Circuits Conference2 of 39Outline Introduction and Motivation Algorithms and Architecture of Physics Com

222、pute Unit Hardware Architecture for Physics Informed Neural NetworkDiverse Dataflows for Versatile PINN ModelsInput Mesh Data Compression Support of Finite Element Method with High Precision Measurement Results and Case Study Conclusions20.4:A 28nm Physics Computing Unit Supporting Emerging Physics-

223、Informed Neural Network and Finite Element Method for Real-Time ScientificComputing on Edge Devices 2024 IEEE International Solid-State Circuits Conference3 of 39Outline Introduction and Motivation Algorithms and Architecture of Physics Compute Unit Hardware Architecture for Physics Informed Neural

224、NetworkDiverse Dataflows for Versatile PINN ModelsInput Mesh Data Compression Support of Finite Element Method with High Precision Measurement Results and Case Study Conclusions20.4:A 28nm Physics Computing Unit Supporting Emerging Physics-Informed Neural Network and Finite Element Method for Real-T

225、ime ScientificComputing on Edge Devices 2024 IEEE International Solid-State Circuits Conference4 of 39Scientific Computing On Edge DevicesScientific computing are mostly based on partial differentiable equations(PDE)Strong demands on real-time physics-based computing on edge devicesE.g.Virtual Reali

226、ty,Robotics,Manufacturing,Hazard Detection,etc.Gaussian Plume Model and Particle DiffusionPower System Swing EquationVR/MR Structural DeformationRobot DynamicsPower System Dynamic ControlAdditive Manufacturing MonitoringE-Nose for Leak-Gas TrackingEquilibrium EquationSolid Mechanics and Thermodynami

227、csLagrange EquationInverse KinameticsGoverning Equation Fluid Thermodynamics20.4:A 28nm Physics Computing Unit Supporting Emerging Physics-Informed Neural Network and Finite Element Method for Real-Time ScientificComputing on Edge Devices 2024 IEEE International Solid-State Circuits Conference5 of 3

228、9Challenges of Existing SolutionsConventional Numerical Solutions:Finite Element Method(FEM),Finite Difference Method(FDM),Finite Volume Method(FVM)Challenge 1:High power and area costs for PDE solversChallenge 2:Fail to meet real-time latency requirementChallenge 3:Existing ASICs are limited to onl

229、y FDM and Poisson Eq.,cannot deal with complex physical structuresGPURegistersRegistersRegistersRegistersL2$DRAMVR/MR Structural Deformation10E4Latency(ms)1000GPU This WorkVR/MR fps8450 elements meshEquilibrium EquationSolid Mechanics and ThermodynamicsJ.Mu et al.,ISSCC21,ASICT.Chen et al.,JSSC20,AS

230、IC20.4:A 28nm Physics Computing Unit Supporting Emerging Physics-Informed Neural Network and Finite Element Method for Real-Time ScientificComputing on Edge Devices 2024 IEEE International Solid-State Circuits Conference6 of 39Emerging Data-Driven SolutionEmerging PINN offers 1900 x10000 x speedup o

231、ver conv.solvers.However,new challenges are observed as below:Challenge 4:Highly diverse PINN models and dataflowsChallenge 5:Speed and accuracy tradeoff.xyzt.vupct22.Boundary Condi.Initial Condi.LLLibp+BC LossIC LossPhysics LossLLossGround TruthOptimizerInferenceTraining 12000Weather Model10000X190

232、0X4000X2500XHeat SinkWind PowerFluid Dynamics60000Speedup(x)Speedup from PINN over Conv.Solver(from Nvidia Modulus)Physics-informed Neural Network(PINN)20.4:A 28nm Physics Computing Unit Supporting Emerging Physics-Informed Neural Network and Finite Element Method for Real-Time ScientificComputing o

233、n Edge Devices 2024 IEEE International Solid-State Circuits Conference7 of 39Proposed Unified Physics Computing Unit(PhyCU)First PINN solution for real-time scientific computing on edge devicesJointly support classic FEM as a high-accuracy optionSupport the majority of common physics-based analysis

234、e.g.,fluid dynamics,robot control,etc.Special sparsity,data compression,dataflow techniques with SOTA efficiency and latency0.1SolutionsJSSC202PhyCU-PINN(This Work)ESSCIRC223GPUPhyCU-FEM(This Work)1100Energy(uJ)Latency(us)SolutionsESSCIRC22JSSC20PhyCU-PINNPhyCU-FEM10010GPU800 xEnergy Redu

235、ctionLatency Reduction20.4:A 28nm Physics Computing Unit Supporting Emerging Physics-Informed Neural Network and Finite Element Method for Real-Time ScientificComputing on Edge Devices 2024 IEEE International Solid-State Circuits Conference8 of 39Outline Introduction and Motivation Algorithms and Ar

236、chitecture of Physics Compute Unit Hardware Architecture for Physics Informed Neural NetworkDiverse Dataflows for Versatile PINN ModelsInput Mesh Data Compression Support of Finite Element Method with High Precision Measurement Results and Case Study Conclusions20.4:A 28nm Physics Computing Unit Sup

237、porting Emerging Physics-Informed Neural Network and Finite Element Method for Real-Time ScientificComputing on Edge Devices 2024 IEEE International Solid-State Circuits Conference9 of 39PINN AlgorithmUse of feedforward NNs:CNN,FCN,RNN,etc.for inference as solversApply physics constraints in trainin

238、g for accurate and faster modelsExtra computing after FNN for loss functions:2ndderivative,etc.3 additional loss functions are needed for backpropagation:boundary condition(BC)loss,initial condition(IC)loss,physics lossThis work only focuses on inference of pre-trained PINN modelsNeural NetworksHori

239、zontal Velocityt22.Boundary Condition(BC)Initial Condition(IC)Vertical VelocityPressureTemeratureDataLossPhysics LawGround TruthTraining OptimizerInferenceTraining.xyzt.vupc +L+LpLiLbBC LossIC LossPhysics LossPhysical Status20.4:A 28nm Physics Computing Unit Supporting Emerging Physics-Informed Neur

240、al Network and Finite Element Method for Real-Time ScientificComputing on Edge Devices 2024 IEEE International Solid-State Circuits Conference10 of 39FEM AlgorithmElement meshing for the object is needed for both FEM and PINN PDE is transformed to linear equation system by FEM Conjugate gradient(CG)

241、method is selected to solve equation system125x convergence speedup compared with prior ASICs Fit well with matrix operation used in PINNRequire symmetrical matrix A(usually satisfied for FEM analysis)Basic Function SelectionVariational calculus,integral Element MeshingEquation SystemA*Th=BConjugate

242、 Gradient method(CG)to get ThIterative Methods Convergence Speed16x16x16 Poisson ApplicationNorm.#of Iterations100.5JacobiGauss-SeidelHybridCheck-Board1Multi-Grid2125xCG20.4:A 28nm Physics Computing Unit Supporting Emerging Physics-Informed Neural Network and Finite Element Method for Real-Time Scie

243、ntificComputing on Edge Devices 2024 IEEE International Solid-State Circuits Conference11 of 39Architecture of Physics Compute Unit(PhyCU)Proposed PhyCU includes:9x16 physics processing element(PHY-E)array4 different SRAM bank groups:input SRAM,parameter SRAM,top general-purpose(TGP)SRAM,bottom gene

244、ral-purpose(BGP)SRAMInput data compression module(ICDM)Offset-based Sparsity Address Scheduler(OBSAS)to improve sparse matrix-vector multiplication in CG and PINNTwo reconfiguration modes:PINN mode for low latencyFEM mode for high accuracyScan I/OControlVCOInput Data ManageFEM AccumulatorPINN Accumu

245、lator.BGP SRAM Banks(144KB)TGP SRAM Banks(288KB)Bank0Bank14 Bank15Bank0Bank14 Bank15Sparsity/power saving featuresInput SRAM Banks(78KB)Input Data Compression Module(IDCM)IDCM ControlAdder-based Generator+Compress InputInput DataInput DataIn.Data9x16 2D PHY-E ArrayPHY-EPHY-EPHY-EPHY-EPHY-EPHY-E 144b

246、 Bus+144b Bus+Threshold Post-Spike(t)Gen.SNN:Accumulation-based OperationsPros1:Event-driven OperationPros2:Low Power for Small Input Value DNN:MAC-based OperationsPros1:High AccuracyPros2:Low Power for Large Input Value C-DNN(ISSCC23,22.5)+20.5:C-Transformer:A 2.6-18.1J/Token Homogeneous DNN-Transf

247、ormer/Spiking-Transformer Processor with Big-Little Network and Implicit Weight Generation for Large Language Models 2024 IEEE International Solid-State Circuits Conference4 of 61Complementary DNN(C-DNN)Allocating workload to DNN or SNN depending on magnitude 31%energy for ImageNet classification wi

248、th ResNet-18I.Introduction*ISSCC23,C-DNN1.Layer-wise Division2.Tile-wise DivisionDNN LayerDNN LayerSNN LayerSNN LayerDNN LayerDNN LayerSNN LayerDNN LayerSNN LayerSNN LayerSNNDNN LayerDNN LayerSNN LayerSNN LayerDivision ResultsSNNDNNDNNSNNImageNet Energy Consumption(uJ)0246810121416Layer Number015020

249、025010050DNNSNNC-DNNDNNSNNDNN31%Energy Reduction20.5:C-Transformer:A 2.6-18.1J/Token Homogeneous DNN-Transformer/Spiking-Transformer Processor with Big-Little Network and Implicit Weight Generation for Large Language Models 2024 IEEE International Solid-State Circuits Conference5 of 61Complementary

250、Characteristic b/w SNN&DNN Small input data High spike sparsity SNN domain efficient Large input data Low spike sparsity DNN domain efficientI.IntroductionEnergy Efficiency(TOPS/W)98969492999795930CNNSNNSpike Sparsity(%)0Trade-off of SNN&DNNDNN:Narrow Power Variationtt+1t+2t+n+w+w+w+wSNN:

251、Wide Power VariationSmall InputValue tt+1t+2t+n+w+w+w+wLarge InputValue Clock Gating Power Saving0110W0W1W2W3P4P3P2P1P0P6P5Large Input Value FAP7Multiplierin DNN20.5:C-Transformer:A 2.6-18.1J/Token Homogeneous DNN-Transformer/Spiking-Transformer Processor with Big-Little Network and Implicit Weight

252、Generation for Large Language Models 2024 IEEE International Solid-State Circuits Conference6 of 61Performance of previous C-DNN processor C-DNN achieves high energy efficiency w/high accuracyI.IntroductionInference with C-DNN(50MHz,0.7V)Accuracy(%)ImageNetCIFAR-1094.1Bit-width8b x 8b4b x 4bPower(mW

253、)39.4Energy Efficiency(TOPS/W)85.8VGG Res18VGG Res5092.739.583.070.633.425.377.134.424.570.240.221.1MobBit-widthCIFAR-1004b x 4bVGG Res1873.142.379.975.941.279.600708090100Inference Energy Efficiency(TOPS/W)VGG16ResNet-18VGG16ResNet-18VGG16ResNet-50MobileNetCIFAR-10CIFAR-100ImageNetCNNSNN

254、C-DNNState-of-the-artEnergy Efficiency20.5:C-Transformer:A 2.6-18.1J/Token Homogeneous DNN-Transformer/Spiking-Transformer Processor with Big-Little Network and Implicit Weight Generation for Large Language Models 2024 IEEE International Solid-State Circuits Conference7 of 61Inference with C-DNN(50M

255、Hz,0.7V)Accuracy(%)ImageNetCIFAR-1094.1Bit-width8b x 8b4b x 4bPower(mW)39.4Energy Efficiency(TOPS/W)85.8VGG Res18VGG Res5092.739.583.070.633.425.377.134.424.570.240.221.1MobBit-widthCIFAR-1004b x 4bVGG Res1873.142.379.975.941.279.600708090100Inference Energy Efficiency(TOPS/W)VGG16ResNet-

256、18VGG16ResNet-18VGG16ResNet-50MobileNetCIFAR-10CIFAR-100ImageNetCNNSNNC-DNNState-of-the-artEnergy EfficiencyPerformance of previous C-DNN processor C-DNN achieves high energy efficiency w/high accuracyI.IntroductionC-DNN:SNN shows the same accuracy of DNN with 1/3 energy consumption20.5:C-Transforme

257、r:A 2.6-18.1J/Token Homogeneous DNN-Transformer/Spiking-Transformer Processor with Big-Little Network and Implicit Weight Generation for Large Language Models 2024 IEEE International Solid-State Circuits Conference8 of 61Inference with C-DNN(50MHz,0.7V)Accuracy(%)ImageNetCIFAR-1094.1Bit-width8b x 8b

258、4b x 4bPower(mW)39.4Energy Efficiency(TOPS/W)85.8VGG Res18VGG Res5092.739.583.070.633.425.377.134.424.570.240.221.1MobBit-widthCIFAR-1004b x 4bVGG Res1873.142.379.975.941.279.600708090100Inference Energy Efficiency(TOPS/W)VGG16ResNet-18VGG16ResNet-18VGG16ResNet-50MobileNetCIFAR-10CIFAR-10

259、0ImageNetCNNSNNC-DNNState-of-the-artEnergy EfficiencyPerformance of previous C-DNN processor C-DNN achieves high energy efficiency w/high accuracyI.IntroductionC-Transformer:Extending C-DNN to Realize Ultra Low-Power On-Device Large Language Model systemC-DNN:SNN shows the same accuracy of DNN with

260、1/3 energy consumption20.5:C-Transformer:A 2.6-18.1J/Token Homogeneous DNN-Transformer/Spiking-Transformer Processor with Big-Little Network and Implicit Weight Generation for Large Language Models 2024 IEEE International Solid-State Circuits Conference9 of 61On-device Large Language Model(LLM)I.Int

261、roduction Low-power consumption is required due to battery limitation 43 computations and 18 parameters compared to CNN Adopting C-DNN for On-device LLM to solve these issues Res50VGG16mT5T5GPT22502003500Parameters(M)5004003002001006007000Large LanugageModelsConvolutionalNeural NetworksOp

262、erations(GOP)ParametersOperations20.5:C-Transformer:A 2.6-18.1J/Token Homogeneous DNN-Transformer/Spiking-Transformer Processor with Big-Little Network and Implicit Weight Generation for Large Language Models 2024 IEEE International Solid-State Circuits Conference10 of 61Challenge 1:High Reconfigura

263、bility is Required SNN requires only accumulations,DNN requires MACs Utilization of heterogeneous arch.C-DNN Comp.EnergyII.Motivation Heterogeneous ArchitectureSNN corefor STDNN corefor DT888888888888Large Value Ratio Small Value Ratio IDLEIDLEIDLEIDLEUtilizationUtilizationHigh ST/DT RatioHigh ST/DT

264、 RatioLow ST/DT RatioLow ST/DT Ratio*ST:Spiking-Transformer,*DT:DNN-Transformer20.5:C-Transformer:A 2.6-18.1J/Token Homogeneous DNN-Transformer/Spiking-Transformer Processor with Big-Little Network and Implicit Weight Generation for Large Language Models 2024 IEEE International Solid-State Circuits

265、Conference11 of 61Challenge 1:High Reconfigurability is Required Ratio of ST and DT is different for each layer/model Language modeling w/GPT2:Ratio of ST varies 45%to 98%Translation w/mT5:Ratio of ST varies 48%to 99%II.MotivationWorkload Ratio(%)Multi Self-Masked AttentionX*WQX*WKX*WVQ*KTA*VFCProj7

266、550250100 Each Decoder BlockSpiking-TransformerDNN-TransformerWorkload Ratio(%)5025075100Multi Self-Masked AttentionFC1FC3X*WQX*WKX*WVQ*KTA*VCross AttentionFC2ProjX*WQX*WKX*WVQ*KTA*V Proj Each Decoder BlockSpiking-TransformerDNN-Transformer20.5:C-Transformer:A 2.6-18.1J/Token Homogeneous DNN-Transfo

267、rmer/Spiking-Transformer Processor with Big-Little Network and Implicit Weight Generation for Large Language Models 2024 IEEE International Solid-State Circuits Conference12 of 61Challenge 1:High Reconfigurability is Required Compression#of strongly related tokens in att.map varies Ratio of large an

268、d small value is changed Core util.3236%II.MotivationStronglyRelatedWeaklyRelatedq0q1q2q3q4k0k1k2k3k4StronglyRelatedWeaklyRelatedLess discriminativeq0q1q2q3q4k0k1k2k3k4StronglyRelatedWeaklyRelatedStronglyRelatedWeaklyRelatedLarge ValuesSmall ValuesLarge ValuesSmall ValuesSmall ValuesLarge ValuesSmal

269、l ValuesLarge ValuesRatio(#of Large/#of Small)q3q1q4q21.02.03.04.00q0Ratio(#of Large/#of Small)q3q1q4q21.02.03.04.00q0DynamicallyChanged!Model CompressionSimilar Ratio20.5:C-Transformer:A 2.6-18.1J/Token Homogeneous DNN-Transformer/Spiking-Transformer Processor with Big-Little Network and Implicit W

270、eight Generation for Large Language Models 2024 IEEE International Solid-State Circuits Conference13 of 61Challenge 1:High Reconfigurability is Required Compression#of strongly related tokens in att.map varies Ratio of large and small value is changed Core util.3236%II.MotivationStronglyRelatedWeakl

271、yRelatedq0q1q2q3q4k0k1k2k3k4StronglyRelatedWeaklyRelatedLess discriminativeq0q1q2q3q4k0k1k2k3k4StronglyRelatedWeaklyRelatedStronglyRelatedWeaklyRelatedLarge ValuesSmall ValuesLarge ValuesSmall ValuesSmall ValuesLarge ValuesSmall ValuesLarge ValuesRatio(#of Large/#of Small)q3q1q4q21.02.03.04.00q0Rati

272、o(#of Large/#of Small)q3q1q4q21.02.03.04.00q0DynamicallyChanged!Model CompressionSimilar RatioReconfigurable Homogeneous DNN-Transformer and Spiking-Transformer Core to Increase HW Utilization20.5:C-Transformer:A 2.6-18.1J/Token Homogeneous DNN-Transformer/Spiking-Transformer Processor with Big-Litt

273、le Network and Implicit Weight Generation for Large Language Models 2024 IEEE International Solid-State Circuits Conference14 of 61Challenge 2:Large External Memory Access Low weight sparsity unlike the vision transformer Large external memory access(EMA)and memory footprintII.MotivationEnergy Break

274、down of LLM systemSparsity-Accuracy Graph for Transformer Models 30282624221030507090BLEU of LLMSparsity by Pruning(%)343236Language ModelTransformer(Dense)VisionTransformer(Sparse)387674727068ViT Accuray(%)807882Longformer(LLM)DeiT-Base(ViT)Routing(LLM)DeiT-Small(ViT)External Memory Access(68%)Tran

275、sformerComputation20.5:C-Transformer:A 2.6-18.1J/Token Homogeneous DNN-Transformer/Spiking-Transformer Processor with Big-Little Network and Implicit Weight Generation for Large Language Models 2024 IEEE International Solid-State Circuits Conference15 of 61Limitation of Previous Transformer ASICs EM

276、A energy consumption was excluded in many ASICsII.MotivationISSCC23 40.61.0288527514.36INT8/163.55(INT8)0.89(INT16)48.4101.1(INT8)12.160.3(INT16)ISSCC23 50.641.0328203203.93INT80.49 3.33(INT8)1.9625.22(INT8)ISSCC23 100.621.012777174.60FP4/80.734(FP4)0.367(FP8)6.6118.1(FP4)3.08.24(FP8)ISSCC22 20.561.

277、128505106.82INT120.524.07(INT12)1.9127.56(INT12)ISSCC22 30.61.028802406.83INT8/161.48(INT8)0.37(INT16)20.5(INT8)5.1(INT16)N/AN/AN/AN/AN/ASupply Voltage(V)Tech.(nm)Frequency(MHz)Die Area(mm2)PrecisionPerformance(TOPS or TFLOPS)Chip Energy Efficiency(TOPS/W or TFLOPS/W)System Energy Consumptionw/EMA25

278、.22 TOPS/W(Adaptive-Span)8.24 TFLOPS/W8.20 TOPS/W(ImageNet ViT-B)Benchmark Energy Efficiency w/o EMA20.5:C-Transformer:A 2.6-18.1J/Token Homogeneous DNN-Transformer/Spiking-Transformer Processor with Big-Little Network and Implicit Weight Generation for Large Language Models 2024 IEEE International

279、Solid-State Circuits Conference16 of 61Limitation of Previous Transformer ASICs EMA energy consumption was excluded in many ASICsII.MotivationISSCC23 40.61.0288527514.36INT8/163.55(INT8)0.89(INT16)48.4101.1(INT8)12.160.3(INT16)ISSCC23 50.641.0328203203.93INT80.49 3.33(INT8)1.9625.22(INT8)25.22 TOPS/

280、W(Adaptive-Span)ISSCC23 100.621.012777174.60FP4/80.734(FP4)0.367(FP8)6.6118.1(FP4)3.08.24(FP8)8.24 TFLOPS/WISSCC22 20.561.128505106.82INT120.524.07(INT12)1.9127.56(INT12)8.20 TOPS/W(ImageNet ViT-B)ISSCC22 30.61.028802406.83INT8/161.48(INT8)0.37(INT16)20.5(INT8)5.1(INT16)N/AN/AN/AN/AN/ASupply Voltage

281、(V)Tech.(nm)Frequency(MHz)Die Area(mm2)PrecisionPerformance(TOPS or TFLOPS)Chip Energy Efficiency(TOPS/W or TFLOPS/W)Benchmark Energy Efficiency w/o EMASystem Energy Consumptionw/EMAExternal memory access(EMA)was excluded LL 20.5:C-Transformer:A 2.6-18.1J/Token Homogeneous DNN-Transformer/Spiking-Tr

282、ansformer Processor with Big-Little Network and Implicit Weight Generation for Large Language Models 2024 IEEE International Solid-State Circuits Conference17 of 61Limitation of Previous Transformer ASICs EMA energy consumption was excluded in many ASICsII.MotivationISSCC23 40.61.0288527514.36INT8/1

283、63.55(INT8)0.89(INT16)48.4101.1(INT8)12.160.3(INT16)ISSCC23 50.641.0328203203.93INT80.49 3.33(INT8)1.9625.22(INT8)25.22 TOPS/W(Adaptive-Span)ISSCC23 100.621.012777174.60FP4/80.734(FP4)0.367(FP8)6.6118.1(FP4)3.08.24(FP8)8.24 TFLOPS/WISSCC22 20.561.128505106.82INT120.524.07(INT12)1.9127.56(INT12)8.20

284、TOPS/W(ImageNet ViT-B)ISSCC22 30.61.028802406.83INT8/161.48(INT8)0.37(INT16)20.5(INT8)5.1(INT16)N/AN/AN/AN/AN/ASupply Voltage(V)Tech.(nm)Frequency(MHz)Die Area(mm2)PrecisionPerformance(TOPS or TFLOPS)Chip Energy Efficiency(TOPS/W or TFLOPS/W)Benchmark Energy Efficiency w/o EMASystem Energy Consumpti

285、onw/EMAExternal memory access(EMA)was excluded LL We Propose 3-Stage Compression to Reduce Large Energy Consumption due to External Memory Access20.5:C-Transformer:A 2.6-18.1J/Token Homogeneous DNN-Transformer/Spiking-Transformer Processor with Big-Little Network and Implicit Weight Generation for L

286、arge Language Models 2024 IEEE International Solid-State Circuits Conference18 of 61Proposed Processor:C-Transformer 2 Different OptimizationsIII.C-Transformer ArchitectureMem.AccessOptimization3-Stage Compression for External Memory Access ReductionComputationOptimizationReconfigurable Homogeneous

287、Core with Hybrid Multiplication Accumulation Unit to Reduce Computation Energy of C-DNNOutput Spike Speculation for Early Stopping of Spiking-Transformer20.5:C-Transformer:A 2.6-18.1J/Token Homogeneous DNN-Transformer/Spiking-Transformer Processor with Big-Little Network and Implicit Weight Generati

288、on for Large Language Models 2024 IEEE International Solid-State Circuits Conference19 of 61Overall Architecture of C-Transformer 48 homogeneous DT/ST cores(HDSC)for high utilization 2 Weight generators for high EMA power reductionIII.C-Transformer ArchitectureWeight Generator(WG)#1Extended Sign Dec

289、ompression UnitIDX MEM(4 KB)MSB MEM(4 KB)Sign MEM(5 KB)LSB MEM(40 KB)RouterIWGU2DMACArrayWeight Generator(WG)#0Extended Sign Decompression UnitIDX MEM(4 KB)MSB MEM(4 KB)Sign MEM(5 KB)LSB MEM(40 KB)RouterIWGU2DMACArrayTop Controller+Network-on-Chip(NoC)+1D SIMD CoreHDSC#0HDSC#1HDSC#8HDSC#9HDSC#16HDSC

290、#17HDSC#2HDSC#3HDSC#10HDSC#11HDSC#18HDSC#19HDSC#4HDSC#5HDSC#12HDSC#13HDSC#20HDSC#21HDSC#6HDSC#7HDSC#14HDSC#15HDSC#22HDSC#23HDSC#24HDSC#25HDSC#32HDSC#33HDSC#40HDSC#41HDSC#26HDSC#27HDSC#34HDSC#35HDSC#42HDSC#43HDSC#28HDSC#29HDSC#36HDSC#37HDSC#44HDSC#45HDSC#30HDSC#31HDSC#38HDSC#39HDSC#46HDSC#47Wtransfor

291、merMEM(80 KB)Kernel Embedding LogicWtransformerMEM(80 KB)Kernel Embedding Logic20.5:C-Transformer:A 2.6-18.1J/Token Homogeneous DNN-Transformer/Spiking-Transformer Processor with Big-Little Network and Implicit Weight Generation for Large Language Models 2024 IEEE International Solid-State Circuits

292、Conference20 of 61Overall Architecture of C-Transformer Hybrid multiplication/accumulation unit for reconfigurability Output spike speculation unit for low power ST processingIII.C-Transformer ArchitectureHomogeneous DT/ST Core(HDSC)WorkloadAllocatorNoC SwitchInput LoaderOutput Memory(OMEM)Controlle

293、rHMAUHMAUHMAUHMAUHMAUHMAUHMAUHMAUHMAUHMAUHMAUHMAUHybrid Multiplication/Accumulation Unit(HMAU)FAFAFAFAFAFAFAFAFAFAFAFAFAFAFAFAFAFAFAFAFAFAFAFAFAFAFAFAFAFAFAFAFAFAFAFAFAFAFAFAFAFAFAFAFAFAFAFAFAFAFAFAFAFAFAFAFAFAFAFAFAFAFAFAP7 P6 P5P1 P0P2P3P4P4P5P6P3 P2 P1 P0P7Left-PartFull-Adder(FA)ArrayRight-PartFu

294、ll-Adder(FA)Array(1)DT-Mode1 MAC Ops(2)ST-Mode8 AC OpsConnectionChangeReconfigurable ConnectionInput Memory(IMEM)20.5:C-Transformer:A 2.6-18.1J/Token Homogeneous DNN-Transformer/Spiking-Transformer Processor with Big-Little Network and Implicit Weight Generation for Large Language Models 2024 IEEE I

295、nternational Solid-State Circuits Conference21 of 61Proposed Processor:C-Transformer 2 Different OptimizationsIII.C-Transformer ArchitectureMem.AccessOptimization3-Stage Compression for External Memory Access ReductionComputationOptimizationReconfigurable Homogeneous Core with Hybrid Multiplication

296、Accumulation Unit to Reduce Computation Energy of C-DNNOutput Spike Speculation for Early Stopping of Spiking-Transformer20.5:C-Transformer:A 2.6-18.1J/Token Homogeneous DNN-Transformer/Spiking-Transformer Processor with Big-Little Network and Implicit Weight Generation for Large Language Models 202

297、4 IEEE International Solid-State Circuits Conference22 of 61Detail Architecture of HMAU Reconfigurable Connection(RC)is integrated into multiplierIII.C-Transformer ArchitectureW03W02W01W02W01W00W01W00W00HA3,0HA1,3W03W02FA2,2FA2,3W03W02W01FA3,1FA3,2W03HA1,0FA1,1FA1,2HA2,0FA2,1W00FA3,3P7P6P5P4P3P2P1P0

298、W03W02W01W02W01W00W01W00W00000HA3,0HA1,3W03W02FA2,2FA2,3W03W02W01FA3,1FA3,2W03RC(Sel of Multiplexers)HA1,0FA1,1FA1,2HA2,0FA2,1W00FA3,3P7P6P5P4P3P2P1P0I00I01I02I03I00I01I02I0320.5:C-Transformer:A 2.6-18.1J/Token Homogeneous DNN-Transformer/Spiking-Transformer Processor with Big-Little Network and Imp

299、licit Weight Generation for Large Language Models 2024 IEEE International Solid-State Circuits Conference23 of 61N-bitMultiplierI03:01 Mult.for CNN(W0 IN0)IN0(N-bit)W0(N-bit)W0&IN00W0&IN01W0&IN02W0&IN03W0&IN04W0&IN0N-1+W0 IN0*InputWeightIN Channels W03:0W03:0W03:0W03:0Detail Architecture of HMAU in

300、DT-Mode DT-Mode:1 multiplication b/w N-bit input and N-bit weightIII.C-Transformer Architecture*RC=Reconfigurable ConnectionW03W02W01W02W01W00W01W00W00000HA3,0HA1,3W03W02FA2,2FA2,3W03W02W01FA3,1FA3,2W03RC(Sel of Multiplexers:1b0)HA1,0FA1,1FA1,2HA2,0FA2,1I00I01I02I03W00FA3,3P7P6P5P4P3P2P1P0W03:0W03:0

301、W03:0W03:020.5:C-Transformer:A 2.6-18.1J/Token Homogeneous DNN-Transformer/Spiking-Transformer Processor with Big-Little Network and Implicit Weight Generation for Large Language Models 2024 IEEE International Solid-State Circuits Conference24 of 61N-bitMultiplierI03:01 Mult.for DNN(W0 IN0)IN0(N-bit

302、)W0(N-bit)W0&IN00W0&IN01W0&IN02W0&IN03W0&IN04W0&IN0N-1+W0 IN0*InputWeightIN Channels W03:0W03:0W03:0W03:0Detail Architecture of HMAU in DT-Mode DT-Mode:1 multiplication b/w N-bit input and N-bit weightIII.C-Transformer Architecture*RC=Reconfigurable ConnectionW03W02W01W02W01W00W01W00W00000HA3,0HA1,3

303、W03W02FA2,2FA2,3W03W02W01FA3,1FA3,2W03RC(Sel of Multiplexers:1b0)HA1,0FA1,1FA1,2HA2,0FA2,1I00I01I02I03W00FA3,3P7P6P5P4P3P2P1P0W03:0W03:0W03:0W03:0Simplified20.5:C-Transformer:A 2.6-18.1J/Token Homogeneous DNN-Transformer/Spiking-Transformer Processor with Big-Little Network and Implicit Weight Gener

304、ation for Large Language Models 2024 IEEE International Solid-State Circuits Conference25 of 61N x N-bitAccumsW03:0W13:0W23:0W33:0+Partial SumW0&S0t W1&S1tW2&S2tW3&S3tW4&S4tWN-1&SN-1tSN-1tS1tS0tt0t1t2t3N Accum for SNN(i=0N-1Sit&Wi)IN Channels*InputWeightS0tS1tS2tS3tW0&S0W1&S1W2&S2W3&S3Detail Archite

305、cture of HMAU in ST-Mode Step1:Allocating different channels for different rowsIII.C-Transformer Architecture*RC=Reconfigurable ConnectionW03W02W01W12W11W10W21W20W30000HA3,0HA1,3W23W22FA2,2FA2,3W33W32W31FA3,1FA3,2W13RCHA1,0FA1,1FA1,2HA2,0FA2,1S0tS1tS2tS3tW00FA3,3Issue:W03+W12+W21+W30 Accumulation b/

306、w different bit positionW03:0W13:0W23:0W33:020.5:C-Transformer:A 2.6-18.1J/Token Homogeneous DNN-Transformer/Spiking-Transformer Processor with Big-Little Network and Implicit Weight Generation for Large Language Models 2024 IEEE International Solid-State Circuits Conference26 of 61N x N-bitAccumsW0

307、3:0W13:0W23:0W33:0+Partial SumW0&S0t W1&S1tW2&S2tW3&S3tW4&S4tWN-1&SN-1tSN-1tS1tS0tt0t1t2t3N Accum for SNN(i=0N-1Sit&Wi)IN Channels*InputWeightS0tS1tS2tS3tW0&S0W1&S1W2&S2W3&S3Detail Architecture of HMAU in ST-Mode Step1:Allocating different channels for different rowsIII.C-Transformer Architecture*RC

308、=Reconfigurable ConnectionW03W02W01W12W11W10W21W20W30000HA3,0HA1,3W23W22FA2,2FA2,3W33W32W31FA3,1FA3,2W13RCHA1,0FA1,1FA1,2HA2,0FA2,1S0tS1tS2tS3tW00FA3,3Issue:W03+W12+W21+W30 Accumulation b/w different bit positionW03:0W13:0W23:0W33:0Simplified20.5:C-Transformer:A 2.6-18.1J/Token Homogeneous DNN-Trans

309、former/Spiking-Transformer Processor with Big-Little Network and Implicit Weight Generation for Large Language Models 2024 IEEE International Solid-State Circuits Conference27 of 61N x N-bitAccumsW03:0W13:0W23:0W33:0+Partial SumW0&S0t W1&S1tW2&S2tW3&S3tW4&S4tWN-1&SN-1tSN-1tS1tS0tt0t1t2t3N Accum for

310、SNN(i=0N-1Sit&Wi)IN Channels*InputWeightS0tS1tS2tS3tAlignment Required!Accum.W0&S0W1&S1W2&S2W3&S3Detail Architecture of HMAU in ST-Mode Step1:Allocating different channels for different rowsIII.C-Transformer Architecture*RC=Reconfigurable ConnectionW03W02W01W12W11W10W21W20W30000HA3,0HA1,3W23W22FA2,2

311、FA2,3W33W32W31FA3,1FA3,2W13RCHA1,0FA1,1FA1,2HA2,0FA2,1S0tS1tS2tS3tW00FA3,3Issue:W03+W12+W21+W30 Accumulation b/w different bit positionW03:0W13:0W23:0W33:0Simplified20.5:C-Transformer:A 2.6-18.1J/Token Homogeneous DNN-Transformer/Spiking-Transformer Processor with Big-Little Network and Implicit Wei

312、ght Generation for Large Language Models 2024 IEEE International Solid-State Circuits Conference28 of 61+Partial SumW0&S0t W1&S1tW2&S2tW3&S3tW4&S4tWN-1&SN-1tSN-1tS1tS0tt0t1t2t3N Accum for SNN(i=0N-1Sit&Wi)IN Channels*InputWeightN x N-bitAccumsS0tS1tS2tS3tW03:0W13:0W23:0W33:0AlignAlignAlignDetail Arc

313、hitecture of HMAU in ST-Mode Step2:Bit position alignment to adjust bit position in columnIII.C-Transformer Architecture*RC=Reconfigurable ConnectionW03W02W01W13W12W11W23W22W33000HA3,0HA1,3W21W20FA2,2FA2,3W32W31W30FA3,1FA3,2W10RCHA1,0FA1,1FA1,2HA2,0FA2,1S0tS1tS2tS3tW00FA3,3Aligned WAligned WBit Posi

314、tion AlignmentW1Aligned W10 3 2 13 2 1 0Simplified20.5:C-Transformer:A 2.6-18.1J/Token Homogeneous DNN-Transformer/Spiking-Transformer Processor with Big-Little Network and Implicit Weight Generation for Large Language Models 2024 IEEE International Solid-State Circuits Conference29 of 61+Partial Su

315、mW0&S0t W1&S1tW2&S2tW3&S3tW4&S4tWN-1&SN-1tSN-1tS1tS0tt0t1t2t3N Accum for SNN(i=0N-1Sit&Wi)IN Channels*InputWeightN x N-bitAccumsS0tS1tS2tS3tW03:0W13:0W23:0W33:0ColumnAccums.Detail Architecture of HMAU in ST-Mode Step3:Aggregation between right and left part resultsIII.C-Transformer Architecture*RC=R

316、econfigurable ConnectionRCW03W02W01W13W12W11W23W22W33000HA1,3W21W20FA2,2FA2,3W32W31W30W10HA1,0FA1,1FA1,2HA2,0FA2,1S0tS1tS2tS3tW00(Sel of Multiplexers:1b1)PSUM_HighPSUM_Low+PSUMHA3,0Column AccumulationsFA3,1FA3,2FA3,3SimplifiedHigh Bits of WeightLow Bits of Weight20.5:C-Transformer:A 2.6-18.1J/Token

317、Homogeneous DNN-Transformer/Spiking-Transformer Processor with Big-Little Network and Implicit Weight Generation for Large Language Models 2024 IEEE International Solid-State Circuits Conference30 of 61Detail Architecture of HMAU in ST-Mode Step4:Carry aggregation for complete accumulation resultIII

318、.C-Transformer Architecture*RC=Reconfigurable Connection+Partial SumW0&S0t W1&S1tW2&S2tW3&S3tW4&S4tWN-1&SN-1tSN-1tS1tS0tt0t1t2t3N Accum for SNN(i=0N-1Sit&Wi)IN Channels*InputWeightN x N-bitAccumsS0tS1tS2tS3tW03:0W13:0W23:0W33:0CarryAggr.W03W02W01W13W12W11W23W22W33000HA3,0HA1,3W21W20FA2,2FA2,3W32W31W

319、30FA3,1FA3,2W10RCHA1,0FA1,1FA1,2HA2,0FA2,1S0tS1tS2tS3tW00(Sel of Multiplexers:1b1)FA3,3PSUM_HighPSUM_Low+PSUMSimplified20.5:C-Transformer:A 2.6-18.1J/Token Homogeneous DNN-Transformer/Spiking-Transformer Processor with Big-Little Network and Implicit Weight Generation for Large Language Models 2024

320、IEEE International Solid-State Circuits Conference31 of 61Weight Feeding Logic of HMAU Issue:Weight bandwidth unbalance b/w two modes(N diff.)III.C-Transformer Architecture*RC=Reconfigurable ConnectionW03W02W01W02W01W00W01W00W00000HA3,0HA1,3W03W02FA2,2FA2,3W03W02W01FA3,1FA3,2W03RCHA1,0FA1,1FA1,2HA2,

321、0FA2,1I00I01I02I03W001 Weight Are Required!FA3,3W03W02W01W13W12W11W23 W22W33000HA3,0HA1,3W21W20FA2,2FA2,3W32W31W30FA3,1FA3,2W10RCHA1,0FA1,1FA1,2HA2,0FA2,1S0tS1tS2tS3tW00N Weights Are Required!FA3,3(Sel of Multiplexers:1b0)(Sel of Multiplexers:1b1)20.5:C-Transformer:A 2.6-18.1J/Token Homogeneous DNN-

322、Transformer/Spiking-Transformer Processor with Big-Little Network and Implicit Weight Generation for Large Language Models 2024 IEEE International Solid-State Circuits Conference32 of 61W0W1W2W0W1W2W0W1W2W0W1W2W3W5W3W5W3W5W3W5010101W1W2W3W0W0W4W8W1W5W9W2W6W10W3W7W11010101W1W2W3W0W4W4W4W4Weight Feedi

323、ng Logic of HMAU DT-Mode:Feeding 1-weight in cycle by cycle(Broadcasting)III.C-Transformer Architecture20.5:C-Transformer:A 2.6-18.1J/Token Homogeneous DNN-Transformer/Spiking-Transformer Processor with Big-Little Network and Implicit Weight Generation for Large Language Models 2024 IEEE Internation

324、al Solid-State Circuits Conference33 of 61W0W1W2W0W1W2W0W1W2W0W1W2W3W5W3W5W3W5W3W5010101W1W2W3W0W0W4W8W1W5W9W2W6W10W3W7W11010101W1W2W3W0W4W4W4W4Weight Feeding Logic of HMAU ST-Mode:Feeding N weights at the same time(Unicasting)III.C-Transformer Architecture20.5:C-Transformer:A 2.6-18.1J/Token Homoge

325、neous DNN-Transformer/Spiking-Transformer Processor with Big-Little Network and Implicit Weight Generation for Large Language Models 2024 IEEE International Solid-State Circuits Conference34 of 61Performance of HMAU Increasing 59%energy efficiency by homogeneous architectureIII.C-Transformer Archite

326、ctureCore Utilization(%)060801004020GPT-2T5mT534%36%32%Energy Efficiency(TOPS/W)0253035201559%HomogeneousArchitectureHeterogeneousArchitectureHomo.Hetero.20.5:C-Transformer:A 2.6-18.1J/Token Homogeneous DNN-Transformer/Spiking-Transformer Processor with Big-Little Network and Implicit Weight Generat

327、ion for Large Language Models 2024 IEEE International Solid-State Circuits Conference35 of 61Proposed Processor:C-Transformer 2 Different OptimizationsIII.C-Transformer ArchitectureMem.AccessOptimization3-Stage Compression for External Memory Access ReductionComputationOptimizationReconfigurable Hom

328、ogeneous Core with Hybrid Multiplication Accumulation Unit to Reduce Computation Energy of C-DNNOutput Spike Speculation for Early Stopping of Spiking-Transformer20.5:C-Transformer:A 2.6-18.1J/Token Homogeneous DNN-Transformer/Spiking-Transformer Processor with Big-Little Network and Implicit Weight

329、 Generation for Large Language Models 2024 IEEE International Solid-State Circuits Conference36 of 61Motivation of Output Spike Speculation(OSS)Rate Coding:Input spike probability input magnitude Input spike rate(ISR)in sampled TS ISR of entire TSIII.C-Transformer Architecturet0 t1 t2 t3 t4 t5 t6 t7

330、VTHVMEMt0 t1 t2 t3 t4 t5 t6 t7Souttt8 t9 t10t30 t31t8 t9 t10t30 t31Similar Output Spike Rate b/w Sampled TS&Total TSSampled Time-stepTotal Time-step(TS)S0tIN0S1tIN1SNtINNInput Rate Codingtttt0t31t0t31t0t31Similar Input Spike Rate b/w Sampled&TotaltSpike Probability INNSpike Probability INNSampled Ti

331、me-stepTotal Time-step(TS)20.5:C-Transformer:A 2.6-18.1J/Token Homogeneous DNN-Transformer/Spiking-Transformer Processor with Big-Little Network and Implicit Weight Generation for Large Language Models 2024 IEEE International Solid-State Circuits Conference37 of 61Motivation of Output Spike Speculat

332、ion(OSS)III.C-Transformer Architecture Input spike rate(ISR)in sampled TS ISR of entire TS Output spike rate(OSR)in sampled TS OSR of entire TSS0tIN0S1tIN1SNtINNInput Rate Codingtttt0t31t0t31t0t31Similar Input Spike Rate b/w Sampled&TotaltSpike Probability INNSpike Probability INNSampled Time-stepTo

333、tal Time-step(TS)t0 t1 t2 t3 t4 t5 t6 t7VTHVMEMt0 t1 t2 t3 t4 t5 t6 t7Souttt8 t9 t10t30 t31t8 t9 t10t30 t31Similar Output Spike Rate b/w Sampled TS&Total TSSampled Time-stepTotal Time-step(TS)20.5:C-Transformer:A 2.6-18.1J/Token Homogeneous DNN-Transformer/Spiking-Transformer Processor with Big-Little Network and Implicit Weight Generation for Large Language Models 2024 IEEE International Solid-St

友情提示

1、下载报告失败解决办法
2、PDF文件下载后，可能会被浏览器默认打开，此种情况可以点击浏览器菜单，保存网页到桌面，就可以正常下载了。
3、本站不支持迅雷下载，请使用电脑自带的IE浏览器，或者360浏览器、谷歌浏览器下载即可。
4、本站报告下载后的文档和图纸-无水印,预览文档经过压缩，下载后原文更清晰。

本文（SESSION 20 - Machine Learning Accelerators.pdf）为本站（2200）主动上传，三个皮匠报告文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若此文所含内容侵犯了您的版权或隐私，请立即通知三个皮匠报告文库（点击联系客服），我们立即给予删除！

温馨提示：如果因为网速或其他原因下载失败请重新下载，重复下载不扣分。