上海品茶

HC2022.KAIST.ZhiyongLi.v02.pdf

编号:136956 PDF 26页 1.43MB 下载积分:VIP专享
下载报告请您先登录!

HC2022.KAIST.ZhiyongLi.v02.pdf

1、1 of 22HOTCHIPS 2022An Efficient High-quality FHD Super-resolution Mobile Accelerator SoC with Hybrid-precision and Energy-efficient CacheAn Efficient High-quality FHD Super-resolution Mobile Accelerator SoC with Hybrid-precision and Energy-efficient CacheZhiyong Li,Sangjin Kim,Dongseok Im,Donghyeon

2、 Han,and Hoi-Jun Yoozhiyong_likaist.ac.krSemiconductor System Lab.School of EE,KAIST2 of 22HOTCHIPS 2022An Efficient High-quality FHD Super-resolution Mobile Accelerator SoC with Hybrid-precision and Energy-efficient CacheSuper-Resolution on Mobile Platform Improve User Experience with High Quality

3、Images Expansion of image feature channels and maintain higher image resolution Enhance quality of streaming media/camera shot for better QoS*QoS:Quality of Service720p SR ON!1080p SR ON!3 of 22HOTCHIPS 2022An Efficient High-quality FHD Super-resolution Mobile Accelerator SoC with Hybrid-precision a

4、nd Energy-efficient CacheCharacteristic of SR CNN Large and Non-zero Feature Resolution Maintaining or enlarging feature resolution 9x input image resolution Non-ReLu act.func.remove sparsity zero skipping is impossible Classification CNNSuper-Resolution CNN640 x360 x31920 x1080 x3224x224x31000 x1Bu

5、tterflyTiger Flower*Measured VGG-16,*Measured FSRCNN(Set 5)4 of 22HOTCHIPS 2022An Efficient High-quality FHD Super-resolution Mobile Accelerator SoC with Hybrid-precision and Energy-efficient CacheHigh Memory Requirement Large Data Transactions Total 11x data transaction than classification CNN HW d

6、esign issue of on-chip cache units and processing unitsExt.MemOn-chipCacheExt.MemOn-chipCacheMore PUMore MemT0T1T2Mem0Mem1Mem2LargeEMALowThoughputReuseReuse11TimesClassification CNN*91MBData Transaction(MB/Frame)20406080Feature Map Weight8.3MBSuper-resolution CNN*Challenge 1)High Memory Requirement5

7、 of 22HOTCHIPS 2022An Efficient High-quality FHD Super-resolution Mobile Accelerator SoC with Hybrid-precision and Energy-efficient CacheCharacteristic of SR CNN Large Bit Precision for QoS Inefficient high bit data transactions and computations Efficient Processing algorithm&HW is needed x1 x3.8 x1

8、2.3Relative Energy Cost1FXP8 MACFXP16 MAC16b on-chip Mem Access38.69 dBFXP16PrecisionResults on Set5 PSNR18.85 dBFXP8Challenge 2)High Bit Precision1:A.Raha et al.VLSID 20216 of 22HOTCHIPS 2022An Efficient High-quality FHD Super-resolution Mobile Accelerator SoC with Hybrid-precision and Energy-effic

9、ient CacheFeatures of Proposed SR-SoC 2-Part Optimization with Heterogeneous Hybrid-precisionHeterogeneous Hybrid-PrecisionFor Data Compression and Enhance HeterogeneityHeterogeneous Accel.ArchitectureFor Energy-EfficiencyHierarchical Line Cache SubsystemFor Smaller Cache Footprint and Less EMASW-HW

10、Co-designProcessUnitMicro-Arch.Level CacheUnitHeterogeneous L1 Line Cache For Smaller L1 Cache and low powerProposed Approach7 of 22HOTCHIPS 2022An Efficient High-quality FHD Super-resolution Mobile Accelerator SoC with Hybrid-precision and Energy-efficient CacheHeterogeneous Hybrid-Precision Archit

11、ecture SW-HW Co-designed Architecture Heterogeneous Hybrid Precision for less data transaction&sparsity High parallelism&Skipping architecture for high computing efficiency Heterogeneous hierarchical cache w/Tiled exec.for high mem efficiencyProposed ApproachOutliersProcess UnitsCaching UnitsAlgorit

12、hmSkippingSkipping Ctrl.UnpredictableTag check=?PsumTagPsumTagPsumTagHigh Parall.Hierarchical$ShortShortShortLong-termLong-termLong-termL2 L1 TiledLineRegular AccessInput Distribution of SRInliersHeterogenietyStream Acc.Acc.1D-PEsDenseSparse0003x3x54 2D-PEsSkip012012idididProblemsLarge&Non-sparseDat

13、a Transactions 1)Large EMA2)Large on-chipMem PowerNon-sparse 16b Data 3)Inefficient HighBit Computation8 of 22HOTCHIPS 2022An Efficient High-quality FHD Super-resolution Mobile Accelerator SoC with Hybrid-precision and Energy-efficient CacheProposed Heterogeneous Hybrid Precision Divide Data into Sp

14、arsity-biased 2 Groups Thresholding with Probability Density Function(p.d.f.)lower 90%100%non-zero dense FXP8 group and 10%non-zero sparse FP8 grouplog2|X|in FSRCNN*Dataset from Set5,Set14 SR dataset,measurement includes input and intermediate activationLSBMSBOutlierTh=pdf-1(90%)log2(Th)8b represent

15、InliersInliersDense FXP8Sparse FP8SSkipSkip9 of 22HOTCHIPS 2022An Efficient High-quality FHD Super-resolution Mobile Accelerator SoC with Hybrid-precision and Energy-efficient CacheResult of Heterogeneous Hybrid-Precision Maintaining Quality(0.5dB loss)while Reducing 47%External Memory Access*Datase

16、t from Set5,Set14 SR dataset,measurement includes input and intermediate activation0.620%Outlier Ratio0.0(FXP16)0Data Transaction(%)FXP16=FXP8+FP843%10%30%0.547%5%(FXP8+5%FP8)100PSNR(dB)Activation Precision(bit)3736816S.VLSI1935S.VLSI21TCSVT18This Work*More energy-efficientHigher qualityISSCC2110 of

17、 22HOTCHIPS 2022An Efficient High-quality FHD Super-resolution Mobile Accelerator SoC with Hybrid-precision and Energy-efficient CacheProposed Super-resolution SoC ArchitectureL2 Line$(30KBx4)TLBL2 Line$(30KBx4)TLBL2 Line$(30KBx4)TLBFXPU 1SwitchFXPU 1SwitchFXPU 1SwitchOff-chip MemoryHost PCRISC-VL1

18、I$+D$(4KB)L2 Line$(30KBx4)TLBInterconnection NetworkUARTSPII2CGPIOFXPU 0FXPU 0FXPU 0FXPU 0Input AllocaterL1 Line$(5KB)Weight Spad(4KB)Loop Ctrlr.3x18 WSSystolic PE Array3x18 WSSystolic PE Array6-ch 3x3 WSSystolic PE ArrayAdder TreeAdder TreeAdder Tree.x4FXPU 0SwitchRouterFP2FxPPLUTFPUInputL1$(2KB)Ou

19、tputL1$(0.5KB)IDX$Lane 17Lane 0W SPad(4KB)Inst.$AXI 64-bitx41.FiXed-point Processing Units-For Non-zero FXP8 Convolution-For Short-term Line$2.Floating-point Processing Unit-For Sparse FP8 Convolution-For Short-term Tagged Line$3.Global L2 Line Cache-For top-level threshold biasing-For Long-term Ove

20、rlapped Line$11 of 22HOTCHIPS 2022An Efficient High-quality FHD Super-resolution Mobile Accelerator SoC with Hybrid-precision and Energy-efficient CacheHeterogeneityMotivations of Heterogeneous Arch.Sparsity Heterogeneity of HHP Data 100%non-zero FXP8 and 10%non-zero FP8 group Inefficient in previou

21、s homogeneous architecture 100%Non-zeroRegular Mem.AccessLarge portion10%Non-zero w/indexUnpredictable Mem.AccessSmall portionDense FXP8Sparse FP812 of 22HOTCHIPS 2022An Efficient High-quality FHD Super-resolution Mobile Accelerator SoC with Hybrid-precision and Energy-efficient CacheData Compressio

22、n for Non-Sparse SR Previous Mixed-Precision Hardware Divide data into 16b large value group and 8b small value group Low compression ratio with non-zero SR data Additional zero data indexing bits as overhead FP16-GroupFP8-GroupPartial Use of FP8(Ratio:01.0)FP8FP16Raw dataFP16zero data index as over

23、head Non-zerodataFP16+MetadataFP 16+zero data index FP 162 Lee et al.ISSCC 2019;13 of 22HOTCHIPS 2022An Efficient High-quality FHD Super-resolution Mobile Accelerator SoC with Hybrid-precision and Energy-efficient CacheZero-Skipping HWZero-Gating HW090%30%60%TFLOPS/W101Previous Mixed-Precision Archi

24、tecture Homogeneous Mixed-Precision Accelerator Homogeneous skipping arch.for higher energy-efficiency,BUT Dense low precision group Skipping Overhead dominant(90%of SR data)Sparse high precision group Benefits from small portion(10%)of total dataFP8FP16Sparsity lie here!zFP8/FP16 Cfg.MAC Array Col-

25、bufDMEMCtrlr FP8 ModeSkippingOverheadFP16-GroupFP8-Group2 Lee et al.ISSCC 2019;14 of 22HOTCHIPS 2022An Efficient High-quality FHD Super-resolution Mobile Accelerator SoC with Hybrid-precision and Energy-efficient CachePrevious Mixed-Precision Architecture Homogeneous Mixed-Precision Accelerator Homo

26、geneous skipping arch.for higher energy-efficiency,BUT Dense low precision group Skipping Overhead dominant(90%of SR data)Sparse high precision group Benefits from small portion(10%)of total data2 Lee et al.ISSCC 2019;Zero-Skipping HWZero-Gating HW090%30%60%TFLOPS/W101Sparsity lie here!FP16-GroupFP8

27、-GroupFP8FP16FP8/FP16 Cfg.MAC Array Col-bufDMEMCtrlr FP16 Mode15 of 22HOTCHIPS 2022An Efficient High-quality FHD Super-resolution Mobile Accelerator SoC with Hybrid-precision and Energy-efficient CacheHeterogeneous Accelerating Architecture Efficient Skipping-and Parallelism-exploit AcceleratorFPUSp

28、arse L1$SIMDALU laneIdx$Inst.$SboradFXPUDense L1$SystolicPE arrayFunc.LUTFXPUDense L1$SystolicPE arrayFunc.LUTFXPUDense L1$SystolicPE arrayFunc.LUTFXPUDense L1$SystolicPE arrayFunc.LUTHeterogenousHigh ParallelismRegular Mem.AccessHigh ThroughputUnpredictable Mem.Access16 of 22HOTCHIPS 2022An Efficie

29、nt High-quality FHD Super-resolution Mobile Accelerator SoC with Hybrid-precision and Energy-efficient CacheProposed High Parallelism FXPU High Parallelism FXPU for Diverse Convolution Kernels3x3 2D PE Array:MAC operationOutput StationarySystolic Data Reuse3x3 OS Systolic ArrayInput TensorAdder Tree

30、Input Allocater.X2X1X0.X2X1X0.X2X1X0.X2X1X0T0 x9Input Allocator w/Dense L1$:Receive input dataReform&feed to PE ArrayAdder Tree w/LUTFor Activation Func.17 of 22HOTCHIPS 2022An Efficient High-quality FHD Super-resolution Mobile Accelerator SoC with Hybrid-precision and Energy-efficient CacheProposed

31、 Zero-skipping Architecture Zero-skipping FPU for Sparse Convolutions4 3x1 1D PE Lane:Sparse Multiply&Act.Func.operationScoreboard based parallelismSkip Ctrlr.w/Sparse L1$:Receive input data&skip indexSelect weight w/zero indexAccumulator&Reorder L1$Sparse AccumulateScoreboardInput VectorSkip contro

32、llerIdx$01.X7X3.X7X3.X7X3.X7X3V0Sparse AccumulatorReorder bufferLane 1Lane 2Lane 318 of 22HOTCHIPS 2022An Efficient High-quality FHD Super-resolution Mobile Accelerator SoC with Hybrid-precision and Energy-efficient CacheResults of Heterogeneous Accelerating Arch.*Under grouping threshold of 95%3 J.

33、Lee et.al S.VLSI 2019 Energy-efficiency(mJ/frame)2.38BaselineFXP16ProcessorHybridFP-FXPArch.1.249.6%Normalized Power(%)100BaselineFXP16Processor100HybridFP-FXPArch.46%47%CoreMemoryPreviousSRNPU3PreviousSRNPU319 of 22HOTCHIPS 2022An Efficient High-quality FHD Super-resolution Mobile Accelerator SoC w

34、ith Hybrid-precision and Energy-efficient CacheLevel 2Level 1LRFMain MemProposed Hierarchical Cache Subsystem 2-level Hierarchical Line Cache Tile-based hierarchical cache to reduce cache size&power Multi-core layer fusion to reduce EMA Cached FeatureTotal Feature SpaceFXPU0FXPU1FXPUnLayer fusionTil

35、e-based Hierarchical$20 of 22HOTCHIPS 2022An Efficient High-quality FHD Super-resolution Mobile Accelerator SoC with Hybrid-precision and Energy-efficient CacheProposed Hierarchical Cache Subsystem Tile-based Execution Fetch small tile of entire input each time Shorter reuse distance between lines x

36、12 smaller on chip cache Level 2Level 1LRFMain MemCached FeatureTotal Feature SpaceFeedfor Conv.PEArrayx12 smaller 21 of 22HOTCHIPS 2022An Efficient High-quality FHD Super-resolution Mobile Accelerator SoC with Hybrid-precision and Energy-efficient CacheProposed Hierarchical Cache Subsystem 2-level

37、Hierarchical Line Cache Tile-based hierarchical cache to reduce cache size&power Multi-core layer fusion to reduce EMA Level 2Level 1LRFMain MemCached FeatureTotal Feature SpaceFXPU0FXPU1FXPUn22 of 22HOTCHIPS 2022An Efficient High-quality FHD Super-resolution Mobile Accelerator SoC with Hybrid-preci

38、sion and Energy-efficient Cache4.394.44.891.8This workPreviousworkExt.Mem/FMemFmem/PEL1$/PEL2$/L1$Ext.Mem/L2$91.8Data Transactions(MB)No fusionPreviouswork91EMA(MB)4.397004.39358ProposedHHP47%4.4014866%ProposedHier.$Subsys.On-chip mem footprint(KB)Result of Hierarchical Cache Subsystem Additionally

39、reduce 66%global memory footprint Reduce 18x L2 cache(Large memory)access4 K.Goetschalckx et al.S.VLSI 2021;4423 of 22HOTCHIPS 2022An Efficient High-quality FHD Super-resolution Mobile Accelerator SoC with Hybrid-precision and Energy-efficient CacheResult of Hierarchical Cache Subsystem With smaller

40、 cache subsystem 58.4%power With fused layer 61.8%energy Normalized Power(%)100PreviousSR NPU100HybridFP-FXPArch.5354HierarchicalLine Cache41.65446%47%58.4%CoreMemoryEnergy/Frame(mJ)2.38PreviousSR NPUHybridFP-FXPArch.HierarchicalLine Cache1.20.9161.8%49.6%3 J.Lee et.al S.VLSI 2019 3324 of 22HOTCHIPS

41、 2022An Efficient High-quality FHD Super-resolution Mobile Accelerator SoC with Hybrid-precision and Energy-efficient CacheChip Photography and SummaryFPURISC-VL1 Line Buf.Enc./PLUTFXPU0,1L1 Line Buf.4,0002,500Dec./Aggr.FXPU2,3L1 Line Buf.L1 Line Buf.L2 Line BufferInst.cacheInput$Resolution270p1080p

42、Method540p4kFSRCNNClassSRFramerate107 fps41 fpsGANOMGD86 fpsProcessActivationPrecisionSupplyVoltage(V)65nm FXP8+510%FP81.0Frequency(MHz)200SR AlgorithmFSRCNN/ClassSRFrame Rate*(fps)SR Energy*(mJ/frame)1070.92*x4 FSRCNN on Set5,Set14 Dataset25 of 22HOTCHIPS 2022An Efficient High-quality FHD Super-res

43、olution Mobile Accelerator SoC with Hybrid-precision and Energy-efficient Cache Energy-Efficient Non-sparse High-quality SRCNN Heterogeneous Accelerating w/Hybrid-precision 46%Processing Power&47%Mem Access Power&47%EMA Data Lifetime-aware Hierarchical Line cache 53.7%Mem Access Power&71.8%EMA Concl

44、usionA 0.92 mJ/frame Super-resolution SoC for Resource-limited Mobile Applications26 of 22HOTCHIPS 2022An Efficient High-quality FHD Super-resolution Mobile Accelerator SoC with Hybrid-precision and Energy-efficient CacheThank You!Questions?Feel Free to Contact Me!E-mail:Zhiyong_likaist.ac.kr Zoom Meeting:https:/zoom.us/xxxx(Password:HC_GANPU)Acknowledgement TBD.

友情提示

1、下载报告失败解决办法
2、PDF文件下载后,可能会被浏览器默认打开,此种情况可以点击浏览器菜单,保存网页到桌面,就可以正常下载了。
3、本站不支持迅雷下载,请使用电脑自带的IE浏览器,或者360浏览器、谷歌浏览器下载即可。
4、本站报告下载后的文档和图纸-无水印,预览文档经过压缩,下载后原文更清晰。

本文(HC2022.KAIST.ZhiyongLi.v02.pdf)为本站 (2200) 主动上传,三个皮匠报告文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。 若此文所含内容侵犯了您的版权或隐私,请立即通知三个皮匠报告文库(点击联系客服),我们立即给予删除!

温馨提示:如果因为网速或其他原因下载失败请重新下载,重复下载不扣分。
客服
商务合作
小程序
服务号
会员动态
会员动态 会员动态:

wei**n_...  升级为标准VIP wei**n_...  升级为高级VIP

wei**n_...  升级为至尊VIP 一朴**P... 升级为标准VIP  

133**88...  升级为至尊VIP wei**n_... 升级为高级VIP 

159**56...  升级为高级VIP 159**56...  升级为标准VIP

升级为至尊VIP 136**96...  升级为高级VIP

 wei**n_... 升级为至尊VIP wei**n_... 升级为至尊VIP 

wei**n_... 升级为标准VIP  186**65...  升级为标准VIP 

137**92...  升级为标准VIP  139**06... 升级为高级VIP

 130**09...  升级为高级VIP wei**n_...   升级为至尊VIP

 wei**n_...  升级为至尊VIP wei**n_...  升级为至尊VIP 

wei**n_... 升级为至尊VIP 158**33... 升级为高级VIP 

骑**...  升级为高级VIP wei**n_... 升级为高级VIP 

 wei**n_... 升级为至尊VIP  150**42... 升级为至尊VIP

 185**92... 升级为高级VIP  dav**_w...  升级为至尊VIP 

zhu**zh... 升级为高级VIP  wei**n_...  升级为至尊VIP

136**49... 升级为标准VIP  158**39... 升级为高级VIP  

wei**n_...  升级为高级VIP  139**38... 升级为高级VIP 

159**12... 升级为至尊VIP 微**...  升级为高级VIP

 185**23...  升级为至尊VIP wei**n_...  升级为标准VIP

 152**85... 升级为至尊VIP ask**un  升级为至尊VIP

136**21... 升级为至尊VIP 微**... 升级为至尊VIP 

135**38... 升级为至尊VIP 139**14...  升级为至尊VIP 

 138**36... 升级为至尊VIP 136**02...  升级为至尊VIP

 139**63... 升级为高级VIP   wei**n_... 升级为高级VIP

 Ssx**om 升级为高级VIP  wei**n_... 升级为至尊VIP  

131**90... 升级为至尊VIP 188**13... 升级为标准VIP 

 159**90... 升级为标准VIP 风诰  升级为至尊VIP

182**81...  升级为标准VIP  133**39... 升级为高级VIP 

wei**n_...  升级为至尊VIP  段** 升级为至尊VIP

wei**n_...  升级为至尊VIP 136**65...  升级为至尊VIP

136**03...  升级为高级VIP  wei**n_...  升级为标准VIP

 137**52... 升级为标准VIP 139**61... 升级为至尊VIP

  微**... 升级为高级VIP wei**n_... 升级为高级VIP 

188**25... 升级为高级VIP  微**... 升级为至尊VIP

  wei**n_... 升级为高级VIP wei**n_... 升级为标准VIP 

 wei**n_...  升级为高级VIP wei**n_... 升级为标准VIP 

 186**28... 升级为标准VIP 微**... 升级为至尊VIP

wei**n_... 升级为至尊VIP  wei**n_... 升级为高级VIP 

189**30... 升级为高级VIP  134**70... 升级为标准VIP 

 185**87...  升级为标准VIP  wei**n_...  升级为高级VIP

wei**n_...  升级为至尊VIP 微**... 升级为至尊VIP 

wei**n_...  升级为标准VIP wei**n_... 升级为至尊VIP 

wei**n_...  升级为标准VIP  132**09... 升级为至尊VIP

 麦提 升级为高级VIP wei**n_...  升级为高级VIP 

 wei**n_... 升级为至尊VIP   wei**n_... 升级为标准VIP

wei**n_... 升级为至尊VIP  wei**n_...  升级为标准VIP

 wei**n_...  升级为至尊VIP  wei**n_... 升级为标准VIP

182**18... 升级为高级VIP 中**...  升级为至尊VIP

136**77...  升级为标准VIP wei**n_... 升级为标准VIP 

180**43... 升级为至尊VIP  桃** 升级为至尊VIP