《HC2022.Stanford.KathleenFeng.v2.pdf》由会员分享,可在线阅读,更多相关《HC2022.Stanford.KathleenFeng.v2.pdf(30页珍藏版)》请在三个皮匠报告上搜索。
1、Amber:Coarse-Grained Reconfigurable Array-Based SoCfor Dense Linear Algebra AccelerationKathleen Feng,Alex Carsello,Taeyoung Kong,Kalhan Koul,Qiaoyi Liu,JacksonMelchert,Gedeon Nyengele,Maxwell Strange,Keyi Zhang,Ankita Nayak,JeffSetter,James Thomas,Kavya Sreedhar,Po-Han Chen,Nikhil Bhagdikar,Zachary
2、Myers,Brandon DAgostino,Pranil Joshi,Stephen Richardson,Rick Bahr,Christopher Torng,Mark Horowitz,Priyanka RainaHot Chips 34August 22,2022Application-Specific AcceleratorsDedicated hardware accelerators popular for imaging,vision,and machine learning(ML)applicationsApplications change rapidly reconf
3、igurable acceleratorsCatTop-1 Accuracy(%)2000708090100SIFTAlexNetInception V3ResNeXtEfficientNetAcceleratoroptimized for ResNeXt not efficient for EfficientNet due to different layer typesNew ASICs required constantly to keep up with state of the artHot Chips 34Amber:Coarse-Gra
4、ined Reconfigurable Array-Based SoC for Dense Linear Algebra Acceleration2Reconfigurable Accelerator OverheadsSlow reconfiguration for repurposing idle resourcesInefficient memory control logicUnderutilized,costly compute unitsFPGA WorkloadsDSPMLCryptoActiveIdleSystem InterconnectCPUMEMDSPEnginesACT
5、IVEAIEnginesIDLEProgrammable LogicIDLESystem InterconnectCPUMEMSystem InterconnectCPUMEMFPGAFPGAFPGADSPEnginesIDLEDSPEnginesIDLEAIEnginesACTIVEAIEnginesIDLEProgrammable LogicIDLEProgrammable LogicACTIVEHot Chips 34Amber:Coarse-Grained Reconfigurable Array-Based SoC for Dense Linear Algebra Accelerat
6、ion3Amber ArchitectureCoarse-grained configurable array(CGRA)for acceleration384 processing elements(PEs):supports INT16/BFloat16 operations,64B register file128 memory elements(MEMs):4KBSRAM with internal streamingmemory controllersSwitch Box(SB):Routes outputsCGRAMEMMEMMEMPEPEPEPEPEPEConnection Bo
7、x(CB):Brings inputsRouting Tracks Hot Chips 34Amber:Coarse-Grained Reconfigurable Array-Based SoC for Dense Linear Algebra Acceleration4Amber ArchitectureEach column has same title typeEvery fourth column is memoryGlobal buffer(GLB):streams data andbitstreams to the CGRA16 tiles:each with two 128KBS
8、RAM banks,load and store unitsPRRegionGlobal Buffer(GLB).PE TilesMEM TilesData/Configuration Network.Accelerator Subsystem.GLB TilesCGRATile0Tile1Tile2Tile3Tile15Tile14Tile13Tile12Hot Chips 34Amber:Coarse-Grained Reconfigurable Array-Based SoC for Dense Linear Algebra Acceleration5Amber Architecture
9、PRRegion64-bit InterconnectPeripheral Subsystem 32-bit ARM M3 CPU32 KB D$32 KB I$DMA EnginesGlobal Buffer(GLB)SoC.PE TilesMEM TilesCGRAData/Configuration Network.DRAM CtrlDRAMOff-ChipSRAM 32 KBSRAM 32 KBSRAM 32 KB32 KB SRAMAccelerator Subsystem.GLB TilesProcessor SubsystemApplication ProcessorHot Ch
10、ips 34Amber:Coarse-Grained Reconfigurable Array-Based SoC for Dense Linear Algebra Acceleration6ContributionsGlobal BufferCGRAFast dynamic partial reconfiguration that runs up to eight kernels in parallelConfigurationEfficient streaming memories for affine patternsMemory?Low-overheadcomplex arithmet
11、icsupportComputeHot Chips 34Amber:Coarse-Grained Reconfigurable Array-Based SoC for Dense Linear Algebra Acceleration7ContributionsGlobal BufferCGRAFast dynamic partial reconfiguration that runs up to eight kernels in parallelConfigurationEfficient streaming memories for affine patternsMemory?Effici
12、ent streaming memories for affine patternsMemory?Low-overheadcomplex arithmeticsupportComputeLow-overheadcomplex arithmeticsupportComputeHot Chips 34Amber:Coarse-Grained Reconfigurable Array-Based SoC for Dense Linear Algebra Acceleration8Maximizing Resource UtilizationReconfigurable accelerators fr
13、equently need to switch applicationsEdge devices run multiple kernels on limited resourcesMultiple users share the same hardware in the cloudFast configuration is key to prevent resources from sitting idleCloudEdgeHot Chips 34Amber:Coarse-Grained Reconfigurable Array-Based SoC for Dense Linear Algeb
14、ra Acceleration9Maximizing Resource UtilizationIn a stream of images:Every frame processed by a camera pipelineOnly key frames got through ResNet-18 for object detectionCGRACamera PipelineResNet-18CGRACGRANon-key frame processingKey frame processingNon-key frame processingTimeCamera PipelineIdleResN
15、etCamera PipelineCamera PipelineIdleHot Chips 34Amber:Coarse-Grained Reconfigurable Array-Based SoC for Dense Linear Algebra Acceleration10Dynamic Partial ReconfigurationDynamic partial reconfiguration(DPR)enables repurposing of unused tiles foradditional computation during runtimeFirst CGRA reconfi
16、guration network specialized for high performance DPRSupports up to eight different kernelsGlobal BufferCGRAHot Chips 34Amber:Coarse-Grained Reconfigurable Array-Based SoC for Dense Linear Algebra Acceleration11Dynamic Partial ReconfigurationIn a stream of images:Every frame processed by a camera pi
17、pelineOnly key frames got through ResNet-18 for object detectionCGRACamera PipelineResNet-18CGRACGRANon-key frame processingKey frame processingNon-key frame processingTimeResNetCamera PipelineCamera PipelineCamera PipelineHot Chips 34Amber:Coarse-Grained Reconfigurable Array-Based SoC for Dense Lin
18、ear Algebra Acceleration12Dynamic Partial Reconfiguration in the GLBGlobal BufferGLB Tile 0GLB Tile 1Read DataFromBankSGAGIDParallel In Serial Out(PISO)2-LevelID ConfigurationLoad DMA fetches both data and configurationConfiguration Network for Dynamic Partial ReconfigurationGlobal BufferSystem Inte
19、rconnectAXI MasterAXI SlaveSwitchLoad DMASwitchBank 0Bank 1GLB Tile 064-bit Word16-bit WordSwitchLoad DMASwitchBank 0Bank 1GLB Tile 1SwitchLoad DMASwitchBank 0Bank 1GLB Tile 15PEMEMMEMPEPEPEI/O TileConfig flows down columnsStoreDMAStoreDMAStoreDMARead EnableTo BankRead AddressTo Bank16ConfigData6464
20、64CGRAGLB Tile 15ConfigurationDataConfiguration Network for Dynamic Partial ReconfigurationAchieves high configurationthroughput usingParallel GLB tilesPipelined configuration networkLow area overhead by sharingstorage between application andconfiguration dataHot Chips 34Amber:Coarse-Grained Reconfi
21、gurable Array-Based SoC for Dense Linear Algebra Acceleration13Performance Benefits of DPR in AmberConfigures full array in 3.5s36.5 more configuration throughput than FPGAMax.Freq.InterfaceBitwidthConfig EnergyPeak ThroughputAmber DPR520 MHz448 bit57.4 pJ/config29.1 GB/sAmber AXI-Lite660 MHz32 bit3
22、9454.5 pJ/config44 MB/sFPGA(Xilinx ICAP)200 MHz32 bit800 MB/sHot Chips 34Amber:Coarse-Grained Reconfigurable Array-Based SoC for Dense Linear Algebra Acceleration14Camera Pipeline and ResNet with DPRIDLEIDLECGRARegion#1IDLECGRARegion#2Time(ms)0key frameBaseline(no DPR)#1ResNet(frame#1)0Time(ms)DPRDP
23、RCGRARegion#1CGRARegion#2#2#3#4#5#6#29#30#1ResNet(frame#1)#2#3#4#5#6#7#29#30(a)(b)511902Camera Pipeline on frame#30FrequencyExecution TimeEnergyEnergy-Delay Product(MHz)(ms/30 frames)(J/30 frames)(Js/30 frames)Baseline200902217196DPR200511(-43.3%)154(-29.0%)78(-60.2%)Hot Chips 34Amber:Coarse-Grained
24、 Reconfigurable Array-Based SoC for Dense Linear Algebra Acceleration15ContributionsGlobal BufferCGRAFast dynamic partial reconfiguration that runs up to eight kernels in parallelConfigurationGlobal BufferCGRAFast dynamic partial reconfiguration that runs up to eight kernels in parallelConfiguration
25、Efficient streaming memories for affine patternsMemory?Low-overheadcomplex arithmeticsupportComputeLow-overheadcomplex arithmeticsupportComputeHot Chips 34Amber:Coarse-Grained Reconfigurable Array-Based SoC for Dense Linear Algebra Acceleration16On-Chip Streaming Memories for Affine PatternsAccelera
26、tors commonly use direct memory access enginesToo general and have high area/energy overheadsAmbers on-chip memory controllers are specialized for affineaccess patterns seen in dense linear algebra applicationsUsed in all levels of the memory hierarchyGlobal bufferMEMPE register fileAmber Memory Hie
27、rarchyPE Register File24KBMemory Tile512KBGlobal Buffer4MBHot Chips 34Amber:Coarse-Grained Reconfigurable Array-Based SoC for Dense Linear Algebra Acceleration17On-Chip Streaming Memories for Affine PatternsAffine pattern:?for y in 0:ry for x in 0:rx addr=sx*x+sy*y+offset1.Iteration domain(ID):speci
28、fies rangeof memory operations2.Address generator(AG):computesaffine addresses from a set of stridesand ID values3.Schedule generator(SG):producesread/write enables,similarly to AGParameters extracted from application by the compilerHot Chips 34Amber:Coarse-Grained Reconfigurable Array-Based SoC for
29、 Dense Linear Algebra Acceleration18Streaming Memory Optimizations in MEMStreaming memory in MEM has further optimizations:1.Wide-fetch SRAM:lower access energy per byte(0.81 pJ vs 1.65 pJ forsingle-fetch SRAM)2.Resource sharing of ID/AG/SGs to reduce area3.Recurrence relations:eliminates multiplier
30、 when calculating affine patternsDelay SRAM Read512x64bit Single-Port SRAMRead AddressWrite AddressMemory(MEM)SIPOSIPOPISOPISOFrom 16b SBsTo 16b SBWrite DataWrite EnableChainData In16166464Serial In Parallel OutEncodeRead DataParallel In Serial OutRead EnableChainData InAGIDSG6-Level IterationDomain
31、Resource SharingHot Chips 34Amber:Coarse-Grained Reconfigurable Array-Based SoC for Dense Linear Algebra Acceleration19Streaming Memory Optimizations in MEMSave another 26.1%area and 30.6%energy using wide-fetch SRAMOverall,save 50%in area and 48%in energy0246Energy(pJ/access)0123Area(m2)104-32.3%st
32、reaming controllers-26.1%wide-fetch-25.0%streaming controllers-30.6%wide-fetchHot Chips 34Amber:Coarse-Grained Reconfigurable Array-Based SoC for Dense Linear Algebra Acceleration20ContributionsGlobal BufferCGRAFast dynamic partial reconfiguration that runs up to eight kernels in parallelConfigurati
33、onGlobal BufferCGRAFast dynamic partial reconfiguration that runs up to eight kernels in parallelConfigurationEfficient streaming memories for affine patternsMemory?Efficient streaming memories for affine patternsMemory?Low-overheadcomplex arithmeticsupportComputeHot Chips 34Amber:Coarse-Grained Rec
34、onfigurable Array-Based SoC for Dense Linear Algebra Acceleration21Complex Arithmetic OperationsImage processing and computer vision kernels requirecomplex arithmetic operations but are infrequently usedBFloat16 division,natural logarithm,sine,exponential15%of operations in non-local means(nlmeans)a
35、re complexHow can we support complex operations?Offload to CPU(slow)Dedicated hardware in each PE(expensive)A compromise?02040%Total OpsminmaxsubdivexpmuladdHot Chips 34Amber:Coarse-Grained Reconfigurable Array-Based SoC for Dense Linear Algebra Acceleration22Low-Overhead Complex Arithmetic in Amber
36、INT/BITADDSUBADCSBCABSGTELTESELMULSHRSHLORANDXORALU OperationsCONDTo 16b SBTo 1b SB16RF OutputBFloatADDSUBCMPMULGETMANADDIEXPSUBEXPEXP2FF2INTGETFRINT2FALULUTREGREGFrom 1b CBs16161CBCBALU OutputRF InputProcessing Elementout=a/b=a*(1/b)b=1.f*2x1/b=(1/1.f)*2-x =(1.g*2h)*2-x =1.g*2(h-x)DIVLUTin ROMg,hGE
37、TMANbSUBEXPFP MULaoutfImplemented using MEM tilePEMEMPEPEPEMEMHot Chips 34Amber:Coarse-Grained Reconfigurable Array-Based SoC for Dense Linear Algebra Acceleration23Low-Overhead Complex Arithmetic in AmberPEMEMCGRA w/dedicated complex opsAmber CGRA-34.5%-30.9%-28.9%-32.5%+64.7%contain no complex ops
38、has complexops29.7x slowerif using CPUHot Chips 34Amber:Coarse-Grained Reconfigurable Array-Based SoC for Dense Linear Algebra Acceleration24Saves 37.8%inarea comparedto implementingsin/exp/ln/divSupplementaryops add just+0.2%in areaAgile Accelerator-Compiler Design FlowDomain-specific language-base
39、d hardware generation flowAutomatically updates application compiler flow to run applicationsApplication CompilerPolyhedral SchedulerPlacer&RouterBitstream GeneratorCGRA RTLLake Program(MEM Specification)Lake CompilerMEM RTLMEM Rewrite RulesCanal Program(Interconnect Specification)Interconnect RTLCa
40、nal CompilerRouting GraphApplication in HalideCGRA BitstreamPEak Program(PE Specification)PEak CompilerPE RTLPE Rewrite RulesPhysical DesignSoC with CGRAIntegration into SoCApplication Processor CodeMapperPipeliningHot Chips 34Amber:Coarse-Grained Reconfigurable Array-Based SoC for Dense Linear Alge
41、bra Acceleration25Application FlowEnd-to-end compiler maps Halide applications onto CGRA12.4 faster than Vivado FPGA compilerconv(x,y)+=kernel(r.x,r.y)*input(x+r.x,y+r.y);conv.in().compute_root();conv.in().tile(x,y,xo,yo,xi,yi,64,64).hw_accelerate(xi,xo);conv.update().unroll(r.y,3).unroll(r.x,3);put
42、e_at (conv.in(),xo);input.stream_to_ accelerator();Halide ApplicationPolyhedral SchedulerPlacer&RouterBitstream GeneratorMapperCGRABitstreamCPU CodePipeliningC=constant,P=PE,M=memory,R=shift register+xxxxCCCCxCCxxCCxCx+Memory(M)MMMMPPPPPPPPPPPPPPPPPMemory(M)RRRRRRM+Hot Chips 34Amber:Coarse-Grained R
43、econfigurable Array-Based SoC for Dense Linear Algebra Acceleration26Comparison with State of the Art1.7 better energy efficiency with 36.7 throughputGlobal BufferCPUCGRA4.9mm4.1mmThis WorkVLSI 2019WhatmoughISSCC 2021SchmidtVLSI 2019RovinskiArchitectureSoC withCGRASoC withFPGAMulticoreVector CPUMult
44、icoreCPUNodeTSMC 16nmTSMC 16nmTSMC 16nmTSMC 16nmArea(mm2)20.12524.0115.25PrecisionBF16,INT16-64FP16-64,INT16-64FP16-64,INT32-64INT32SRAM(MiB)4.59.0254.53.875Voltage(V)0.84-1.290.5-1.00.55-1.00.60-0.98Freq(MHz)955 1000144010-1400Peak GOPS36710-56.2368.4695GOPS/W538312.4209.593.04Hot Chips 34Amber:Coa
45、rse-Grained Reconfigurable Array-Based SoC for Dense Linear Algebra Acceleration27Benchmark Application SuiteBenchmark apps written in Halide,compared against CPU,GPU,FPGA:Image processingBlur:image blurUnsharp:enhances local contrast by smoothing an imageCamera pipeline:processes raw data from an i
46、mage sensor into a color imageComputer visionHarris:detects cornersMachine learningResNet-18:image classificationHot Chips 34Amber:Coarse-Grained Reconfigurable Array-Based SoC for Dense Linear Algebra Acceleration28Results:Energy-Delay ProductAmber up to 3902,152,and 88 better EDP than CPU,GPU,and
47、FPGAVirtex Ultrascale+VU9P FPGANvidia Tesla K40 GPUXeon CPU(12 cores)Xeon CPU(1 core)ARM Cortex A57(4 cores)Amber SoC2.3e-039.9e-039.6e-039.4e-041.6e-031.8e-051.4e-021.2e-027.9e-031.9e-043.1e-041.6e-055.1e-032.4e-026.1e-039.4e-043.5e-046.2e-061.4e-021.9e-025.9e-032.8e-043.8e-042.0e-056.0e-056.7e-061
48、.0e-049.9e-06BLURUNSHARPCAMERAHARRISCONV3_XCONV5_X10-410-2Energy-Delay Product(EDP)(Js/frame)559x864x3902x931x9x10 xHot Chips 34Amber:Coarse-Grained Reconfigurable Array-Based SoC for Dense Linear Algebra Acceleration29Summary of Key ContributionsAmber is an SoC designed for flexible and efficient a
49、cceleration for imageprocessing,computer vision,and machine learningConfiguration:fast dynamic partial reconfiguration at runtimeMemory:efficient streaming memories for affine patternsCompute:low-overhead complex arithmetic operation supportAutomatic end-to-end compiler maps applications onto AmberA
50、mber achieves 3902,152,and 88 better EDP over CPU,GPU,and FPGA,respectivelyEnables efficient domain,rather than single application,accelerationFunding Acknowledgement:DARPA DSSoC,AHA Center,Stanford SystemX AllianceHot Chips 34Amber:Coarse-Grained Reconfigurable Array-Based SoC for Dense Linear Algebra Acceleration30