《HC2022.KAIST.ZhiyongLi.v02.pdf》由会员分享,可在线阅读,更多相关《HC2022.KAIST.ZhiyongLi.v02.pdf(26页珍藏版)》请在三个皮匠报告上搜索。
1、1 of 22HOTCHIPS 2022An Efficient High-quality FHD Super-resolution Mobile Accelerator SoC with Hybrid-precision and Energy-efficient CacheAn Efficient High-quality FHD Super-resolution Mobile Accelerator SoC with Hybrid-precision and Energy-efficient CacheZhiyong Li,Sangjin Kim,Dongseok Im,Donghyeon
2、 Han,and Hoi-Jun Yoozhiyong_likaist.ac.krSemiconductor System Lab.School of EE,KAIST2 of 22HOTCHIPS 2022An Efficient High-quality FHD Super-resolution Mobile Accelerator SoC with Hybrid-precision and Energy-efficient CacheSuper-Resolution on Mobile Platform Improve User Experience with High Quality
3、Images Expansion of image feature channels and maintain higher image resolution Enhance quality of streaming media/camera shot for better QoS*QoS:Quality of Service720p SR ON!1080p SR ON!3 of 22HOTCHIPS 2022An Efficient High-quality FHD Super-resolution Mobile Accelerator SoC with Hybrid-precision a
4、nd Energy-efficient CacheCharacteristic of SR CNN Large and Non-zero Feature Resolution Maintaining or enlarging feature resolution 9x input image resolution Non-ReLu act.func.remove sparsity zero skipping is impossible Classification CNNSuper-Resolution CNN640 x360 x31920 x1080 x3224x224x31000 x1Bu
5、tterflyTiger Flower*Measured VGG-16,*Measured FSRCNN(Set 5)4 of 22HOTCHIPS 2022An Efficient High-quality FHD Super-resolution Mobile Accelerator SoC with Hybrid-precision and Energy-efficient CacheHigh Memory Requirement Large Data Transactions Total 11x data transaction than classification CNN HW d
6、esign issue of on-chip cache units and processing unitsExt.MemOn-chipCacheExt.MemOn-chipCacheMore PUMore MemT0T1T2Mem0Mem1Mem2LargeEMALowThoughputReuseReuse11TimesClassification CNN*91MBData Transaction(MB/Frame)20406080Feature Map Weight8.3MBSuper-resolution CNN*Challenge 1)High Memory Requirement5
7、 of 22HOTCHIPS 2022An Efficient High-quality FHD Super-resolution Mobile Accelerator SoC with Hybrid-precision and Energy-efficient CacheCharacteristic of SR CNN Large Bit Precision for QoS Inefficient high bit data transactions and computations Efficient Processing algorithm&HW is needed x1 x3.8 x1
8、2.3Relative Energy Cost1FXP8 MACFXP16 MAC16b on-chip Mem Access38.69 dBFXP16PrecisionResults on Set5 PSNR18.85 dBFXP8Challenge 2)High Bit Precision1:A.Raha et al.VLSID 20216 of 22HOTCHIPS 2022An Efficient High-quality FHD Super-resolution Mobile Accelerator SoC with Hybrid-precision and Energy-effic
9、ient CacheFeatures of Proposed SR-SoC 2-Part Optimization with Heterogeneous Hybrid-precisionHeterogeneous Hybrid-PrecisionFor Data Compression and Enhance HeterogeneityHeterogeneous Accel.ArchitectureFor Energy-EfficiencyHierarchical Line Cache SubsystemFor Smaller Cache Footprint and Less EMASW-HW
10、Co-designProcessUnitMicro-Arch.Level CacheUnitHeterogeneous L1 Line Cache For Smaller L1 Cache and low powerProposed Approach7 of 22HOTCHIPS 2022An Efficient High-quality FHD Super-resolution Mobile Accelerator SoC with Hybrid-precision and Energy-efficient CacheHeterogeneous Hybrid-Precision Archit
11、ecture SW-HW Co-designed Architecture Heterogeneous Hybrid Precision for less data transaction&sparsity High parallelism&Skipping architecture for high computing efficiency Heterogeneous hierarchical cache w/Tiled exec.for high mem efficiencyProposed ApproachOutliersProcess UnitsCaching UnitsAlgorit
12、hmSkippingSkipping Ctrl.UnpredictableTag check=?PsumTagPsumTagPsumTagHigh Parall.Hierarchical$ShortShortShortLong-termLong-termLong-termL2 L1 TiledLineRegular AccessInput Distribution of SRInliersHeterogenietyStream Acc.Acc.1D-PEsDenseSparse0003x3x54 2D-PEsSkip012012idididProblemsLarge&Non-sparseDat
13、a Transactions 1)Large EMA2)Large on-chipMem PowerNon-sparse 16b Data 3)Inefficient HighBit Computation8 of 22HOTCHIPS 2022An Efficient High-quality FHD Super-resolution Mobile Accelerator SoC with Hybrid-precision and Energy-efficient CacheProposed Heterogeneous Hybrid Precision Divide Data into Sp
14、arsity-biased 2 Groups Thresholding with Probability Density Function(p.d.f.)lower 90%100%non-zero dense FXP8 group and 10%non-zero sparse FP8 grouplog2|X|in FSRCNN*Dataset from Set5,Set14 SR dataset,measurement includes input and intermediate activationLSBMSBOutlierTh=pdf-1(90%)log2(Th)8b represent
15、InliersInliersDense FXP8Sparse FP8SSkipSkip9 of 22HOTCHIPS 2022An Efficient High-quality FHD Super-resolution Mobile Accelerator SoC with Hybrid-precision and Energy-efficient CacheResult of Heterogeneous Hybrid-Precision Maintaining Quality(0.5dB loss)while Reducing 47%External Memory Access*Datase
16、t from Set5,Set14 SR dataset,measurement includes input and intermediate activation0.620%Outlier Ratio0.0(FXP16)0Data Transaction(%)FXP16=FXP8+FP843%10%30%0.547%5%(FXP8+5%FP8)100PSNR(dB)Activation Precision(bit)3736816S.VLSI1935S.VLSI21TCSVT18This Work*More energy-efficientHigher qualityISSCC2110 of
17、 22HOTCHIPS 2022An Efficient High-quality FHD Super-resolution Mobile Accelerator SoC with Hybrid-precision and Energy-efficient CacheProposed Super-resolution SoC ArchitectureL2 Line$(30KBx4)TLBL2 Line$(30KBx4)TLBL2 Line$(30KBx4)TLBFXPU 1SwitchFXPU 1SwitchFXPU 1SwitchOff-chip MemoryHost PCRISC-VL1
18、I$+D$(4KB)L2 Line$(30KBx4)TLBInterconnection NetworkUARTSPII2CGPIOFXPU 0FXPU 0FXPU 0FXPU 0Input AllocaterL1 Line$(5KB)Weight Spad(4KB)Loop Ctrlr.3x18 WSSystolic PE Array3x18 WSSystolic PE Array6-ch 3x3 WSSystolic PE ArrayAdder TreeAdder TreeAdder Tree.x4FXPU 0SwitchRouterFP2FxPPLUTFPUInputL1$(2KB)Ou
19、tputL1$(0.5KB)IDX$Lane 17Lane 0W SPad(4KB)Inst.$AXI 64-bitx41.FiXed-point Processing Units-For Non-zero FXP8 Convolution-For Short-term Line$2.Floating-point Processing Unit-For Sparse FP8 Convolution-For Short-term Tagged Line$3.Global L2 Line Cache-For top-level threshold biasing-For Long-term Ove
20、rlapped Line$11 of 22HOTCHIPS 2022An Efficient High-quality FHD Super-resolution Mobile Accelerator SoC with Hybrid-precision and Energy-efficient CacheHeterogeneityMotivations of Heterogeneous Arch.Sparsity Heterogeneity of HHP Data 100%non-zero FXP8 and 10%non-zero FP8 group Inefficient in previou
21、s homogeneous architecture 100%Non-zeroRegular Mem.AccessLarge portion10%Non-zero w/indexUnpredictable Mem.AccessSmall portionDense FXP8Sparse FP812 of 22HOTCHIPS 2022An Efficient High-quality FHD Super-resolution Mobile Accelerator SoC with Hybrid-precision and Energy-efficient CacheData Compressio
22、n for Non-Sparse SR Previous Mixed-Precision Hardware Divide data into 16b large value group and 8b small value group Low compression ratio with non-zero SR data Additional zero data indexing bits as overhead FP16-GroupFP8-GroupPartial Use of FP8(Ratio:01.0)FP8FP16Raw dataFP16zero data index as over
23、head Non-zerodataFP16+MetadataFP 16+zero data index FP 162 Lee et al.ISSCC 2019;13 of 22HOTCHIPS 2022An Efficient High-quality FHD Super-resolution Mobile Accelerator SoC with Hybrid-precision and Energy-efficient CacheZero-Skipping HWZero-Gating HW090%30%60%TFLOPS/W101Previous Mixed-Precision Archi
24、tecture Homogeneous Mixed-Precision Accelerator Homogeneous skipping arch.for higher energy-efficiency,BUT Dense low precision group Skipping Overhead dominant(90%of SR data)Sparse high precision group Benefits from small portion(10%)of total dataFP8FP16Sparsity lie here!zFP8/FP16 Cfg.MAC Array Col-
25、bufDMEMCtrlr FP8 ModeSkippingOverheadFP16-GroupFP8-Group2 Lee et al.ISSCC 2019;14 of 22HOTCHIPS 2022An Efficient High-quality FHD Super-resolution Mobile Accelerator SoC with Hybrid-precision and Energy-efficient CachePrevious Mixed-Precision Architecture Homogeneous Mixed-Precision Accelerator Homo
26、geneous skipping arch.for higher energy-efficiency,BUT Dense low precision group Skipping Overhead dominant(90%of SR data)Sparse high precision group Benefits from small portion(10%)of total data2 Lee et al.ISSCC 2019;Zero-Skipping HWZero-Gating HW090%30%60%TFLOPS/W101Sparsity lie here!FP16-GroupFP8
27、-GroupFP8FP16FP8/FP16 Cfg.MAC Array Col-bufDMEMCtrlr FP16 Mode15 of 22HOTCHIPS 2022An Efficient High-quality FHD Super-resolution Mobile Accelerator SoC with Hybrid-precision and Energy-efficient CacheHeterogeneous Accelerating Architecture Efficient Skipping-and Parallelism-exploit AcceleratorFPUSp
28、arse L1$SIMDALU laneIdx$Inst.$SboradFXPUDense L1$SystolicPE arrayFunc.LUTFXPUDense L1$SystolicPE arrayFunc.LUTFXPUDense L1$SystolicPE arrayFunc.LUTFXPUDense L1$SystolicPE arrayFunc.LUTHeterogenousHigh ParallelismRegular Mem.AccessHigh ThroughputUnpredictable Mem.Access16 of 22HOTCHIPS 2022An Efficie
29、nt High-quality FHD Super-resolution Mobile Accelerator SoC with Hybrid-precision and Energy-efficient CacheProposed High Parallelism FXPU High Parallelism FXPU for Diverse Convolution Kernels3x3 2D PE Array:MAC operationOutput StationarySystolic Data Reuse3x3 OS Systolic ArrayInput TensorAdder Tree
30、Input Allocater.X2X1X0.X2X1X0.X2X1X0.X2X1X0T0 x9Input Allocator w/Dense L1$:Receive input dataReform&feed to PE ArrayAdder Tree w/LUTFor Activation Func.17 of 22HOTCHIPS 2022An Efficient High-quality FHD Super-resolution Mobile Accelerator SoC with Hybrid-precision and Energy-efficient CacheProposed
31、 Zero-skipping Architecture Zero-skipping FPU for Sparse Convolutions4 3x1 1D PE Lane:Sparse Multiply&Act.Func.operationScoreboard based parallelismSkip Ctrlr.w/Sparse L1$:Receive input data&skip indexSelect weight w/zero indexAccumulator&Reorder L1$Sparse AccumulateScoreboardInput VectorSkip contro
32、llerIdx$01.X7X3.X7X3.X7X3.X7X3V0Sparse AccumulatorReorder bufferLane 1Lane 2Lane 318 of 22HOTCHIPS 2022An Efficient High-quality FHD Super-resolution Mobile Accelerator SoC with Hybrid-precision and Energy-efficient CacheResults of Heterogeneous Accelerating Arch.*Under grouping threshold of 95%3 J.
33、Lee et.al S.VLSI 2019 Energy-efficiency(mJ/frame)2.38BaselineFXP16ProcessorHybridFP-FXPArch.1.249.6%Normalized Power(%)100BaselineFXP16Processor100HybridFP-FXPArch.46%47%CoreMemoryPreviousSRNPU3PreviousSRNPU319 of 22HOTCHIPS 2022An Efficient High-quality FHD Super-resolution Mobile Accelerator SoC w
34、ith Hybrid-precision and Energy-efficient CacheLevel 2Level 1LRFMain MemProposed Hierarchical Cache Subsystem 2-level Hierarchical Line Cache Tile-based hierarchical cache to reduce cache size&power Multi-core layer fusion to reduce EMA Cached FeatureTotal Feature SpaceFXPU0FXPU1FXPUnLayer fusionTil
35、e-based Hierarchical$20 of 22HOTCHIPS 2022An Efficient High-quality FHD Super-resolution Mobile Accelerator SoC with Hybrid-precision and Energy-efficient CacheProposed Hierarchical Cache Subsystem Tile-based Execution Fetch small tile of entire input each time Shorter reuse distance between lines x
36、12 smaller on chip cache Level 2Level 1LRFMain MemCached FeatureTotal Feature SpaceFeedfor Conv.PEArrayx12 smaller 21 of 22HOTCHIPS 2022An Efficient High-quality FHD Super-resolution Mobile Accelerator SoC with Hybrid-precision and Energy-efficient CacheProposed Hierarchical Cache Subsystem 2-level
37、Hierarchical Line Cache Tile-based hierarchical cache to reduce cache size&power Multi-core layer fusion to reduce EMA Level 2Level 1LRFMain MemCached FeatureTotal Feature SpaceFXPU0FXPU1FXPUn22 of 22HOTCHIPS 2022An Efficient High-quality FHD Super-resolution Mobile Accelerator SoC with Hybrid-preci
38、sion and Energy-efficient Cache4.394.44.891.8This workPreviousworkExt.Mem/FMemFmem/PEL1$/PEL2$/L1$Ext.Mem/L2$91.8Data Transactions(MB)No fusionPreviouswork91EMA(MB)4.397004.39358ProposedHHP47%4.4014866%ProposedHier.$Subsys.On-chip mem footprint(KB)Result of Hierarchical Cache Subsystem Additionally
39、reduce 66%global memory footprint Reduce 18x L2 cache(Large memory)access4 K.Goetschalckx et al.S.VLSI 2021;4423 of 22HOTCHIPS 2022An Efficient High-quality FHD Super-resolution Mobile Accelerator SoC with Hybrid-precision and Energy-efficient CacheResult of Hierarchical Cache Subsystem With smaller
40、 cache subsystem 58.4%power With fused layer 61.8%energy Normalized Power(%)100PreviousSR NPU100HybridFP-FXPArch.5354HierarchicalLine Cache41.65446%47%58.4%CoreMemoryEnergy/Frame(mJ)2.38PreviousSR NPUHybridFP-FXPArch.HierarchicalLine Cache1.20.9161.8%49.6%3 J.Lee et.al S.VLSI 2019 3324 of 22HOTCHIPS
41、 2022An Efficient High-quality FHD Super-resolution Mobile Accelerator SoC with Hybrid-precision and Energy-efficient CacheChip Photography and SummaryFPURISC-VL1 Line Buf.Enc./PLUTFXPU0,1L1 Line Buf.4,0002,500Dec./Aggr.FXPU2,3L1 Line Buf.L1 Line Buf.L2 Line BufferInst.cacheInput$Resolution270p1080p
42、Method540p4kFSRCNNClassSRFramerate107 fps41 fpsGANOMGD86 fpsProcessActivationPrecisionSupplyVoltage(V)65nm FXP8+510%FP81.0Frequency(MHz)200SR AlgorithmFSRCNN/ClassSRFrame Rate*(fps)SR Energy*(mJ/frame)1070.92*x4 FSRCNN on Set5,Set14 Dataset25 of 22HOTCHIPS 2022An Efficient High-quality FHD Super-res
43、olution Mobile Accelerator SoC with Hybrid-precision and Energy-efficient Cache Energy-Efficient Non-sparse High-quality SRCNN Heterogeneous Accelerating w/Hybrid-precision 46%Processing Power&47%Mem Access Power&47%EMA Data Lifetime-aware Hierarchical Line cache 53.7%Mem Access Power&71.8%EMA Concl
44、usionA 0.92 mJ/frame Super-resolution SoC for Resource-limited Mobile Applications26 of 22HOTCHIPS 2022An Efficient High-quality FHD Super-resolution Mobile Accelerator SoC with Hybrid-precision and Energy-efficient CacheThank You!Questions?Feel Free to Contact Me!E-mail:Zhiyong_likaist.ac.kr Zoom Meeting:https:/zoom.us/xxxx(Password:HC_GANPU)Acknowledgement TBD.