《HC2022.KAIST.DongseokIm.v03.pdf》由会员分享,可在线阅读,更多相关《HC2022.KAIST.DongseokIm.v03.pdf(25页珍藏版)》请在三个皮匠报告上搜索。
1、1 of 25HOTCHIPS 2022DSPU:A 281.6mW Real-Time Deep Learning-Based Dense RGB-D Data Acquisition with Sensor Fusion and 3D Perception System-on-Chip DSPU:A 281.6mW Real-Time Deep Learning-Based Dense RGB-D Data Acquisition with Sensor Fusion and 3D Perception System-on-Chip Dongseok Im,Gwangtae Park,Zh
2、iyong Li,Junha Ryu,Sanghoon Kang,Donghyeon Han,Jinsu Lee,Wonhoon Park,Hankyul Kown,and Hoi-Jun YooSemiconductor System Lab.School of EE,KAIST2 of 25HOTCHIPS 2022DSPU:A 281.6mW Real-Time Deep Learning-Based Dense RGB-D Data Acquisition with Sensor Fusion and 3D Perception System-on-Chip 3D Data in Mo
3、bile Platforms RGB-D data More Accurate and Versatile Applications CNN recognizes only 2D pictures,but real world consists of 3D objects RGB-D(3D)data enables the exact 3D object recognitionsTimeAccuracy(mAP)CVPR20406560555045CVPR16CVPR17CVPR18CVPR17ICCV193D-based2D-basedFace RecognitionAR/VR3D Geom
4、etryHigh AccuracyCVPR21ICCV213 of 25HOTCHIPS 2022DSPU:A 281.6mW Real-Time Deep Learning-Based Dense RGB-D Data Acquisition with Sensor Fusion and 3D Perception System-on-Chip DSPU:End-to-end 3D Perception SoC A 281 mW and 31.9 fps 3D Object Recognition Processor For Low-Power RGB-D Data Acquisition
5、CNN-based MDE&Sensor Fusion SW/HW Architecture For Real-time 3D Perception(e.g.3D Bounding Box)Window-based Search&Point Feature Reuse SW/HW ArchitectureUMPU Core#1UMPU Core#2UMPU Core#4UMPU Core#3UMPU Core#0UMPUCore#6UMPU Core#7UMPU Core#5DMU Core#0DMU Core#1DMU Core#2DMU Core#3RISC-V CoreInterconn
6、ect NetworkUPPU Core#0UPPU Core#1Interconnect Network3.6 mm3.6 mm1)MDE:Monocular Depth EstimationAligned Dense RGB-D3D Bounding BoxRGB DataRaw Depth Datat1t2t0Final ResultMonocular Depth EstimationSensor Fusion 3D PerceptionLow-Power and Real-Time DSPU SoCRGB Cam.Low-power ToFt2t04 of 25HOTCHIPS 202
7、2DSPU:A 281.6mW Real-Time Deep Learning-Based Dense RGB-D Data Acquisition with Sensor Fusion and 3D Perception System-on-Chip Challenges of 3D Perception Power and Latency Challenges in Mobile Platforms High sensor power(3 W)High latency in CPU+GPU Platform(10 fps)Aligned RGB-DDense RGB-D3D Boundin
8、g BoxRGB DataRaw Depth Datat0t1t2t0t2RGB CameraToFFinal ResultCamera Sync&Coordinate AlignDepth Inpainting3D High-level Perception5 of 25HOTCHIPS 2022DSPU:A 281.6mW Real-Time Deep Learning-Based Dense RGB-D Data Acquisition with Sensor Fusion and 3D Perception System-on-Chip Proposed End-to-end 3D P
9、erception1.CNN-based MDE for Low-Power Dense RGB-D Acquisition2.Sensor Fusion for Accurate RGB-D Data3.Window Search-based PNN for Low-Latency 3D Perception1)LP ToF:Low Power ToF,2)CNN:Convolutional Neural Network,3)KNN:K-nearest neighbor search,4)C-Grad:Conjugate-gradient,5)UDS:Uniform Distance Poi
10、nt Sampling,6)BQ:Ball Query3D Bounding Box(CNN)Monocular Depth Estimation(KNN+C-Grad)Depth Fusion(UDS+BQ+CNN)Point Cloud Neural NetworkRGB Datat0t1t2RGB CameraLow-power ToFFinal ResultAligned Dense RGB-DRaw Depth Data6 of 25HOTCHIPS 2022DSPU:A 281.6mW Real-Time Deep Learning-Based Dense RGB-D Data A
11、cquisition with Sensor Fusion and 3D Perception System-on-Chip Challenges of Sensor Fusion Irregular Sparse Matrix generated by KNN CSR produces Data+Index,but still large data size(1.86 MB)SpMM&SpMV result in many data transactions due to low data reuseWTWNo Output ReuseMethod 2:Outer-ProductWTWNo
12、Input ReuseMethod 1:Inner-Product48.2%45.1%6.7%Nonzero Data(16b)Column Index(15b)Row Ptr Index(20b)Encoded Data Size Breakdown1)CSR:Compressed Sparse Row7 of 25HOTCHIPS 2022DSPU:A 281.6mW Real-Time Deep Learning-Based Dense RGB-D Data Acquisition with Sensor Fusion and 3D Perception System-on-Chip B
13、and Matrix Encoding Diagonal BM generated by Window Search-based KNN Hierarchical BM encoding produces Diagonal Index+Data+Small Index Increase the data compression ratio000 0000 0KNN in WindowBoundaryD3D2D1D0D0D1D2D3D0D0Step 1:Diagonal EncodingStep 2:Run-length EncodingDiag.IdxDiag.DataNonzero Data
14、(16b)Nonzero Idx(4b)Diag.Idx(3b)Partitioned BM8 of 25HOTCHIPS 2022DSPU:A 281.6mW Real-Time Deep Learning-Based Dense RGB-D Data Acquisition with Sensor Fusion and 3D Perception System-on-Chip Band Matrix Decoding for SpMM Simultaneous WT&W Computation Increase both input data and output data reuse R
15、educe the number of data transactionD0D1D2D3D01)Output Reuse by Inner-Product2)Input Reuse by Outer-ProductD0D0D1D1D2D2D3D33)Input Reuse by TransposeD0D1D2D3D0D1D2D3WTW9 of 25HOTCHIPS 2022DSPU:A 281.6mW Real-Time Deep Learning-Based Dense RGB-D Data Acquisition with Sensor Fusion and 3D Perception S
16、ystem-on-Chip Band Matrix Decoding for SpMV Simultaneous Lower&Upper Triangle of WTW Computation Increase both input data and output data reuse Reduce the number of data transactionWTWbD1D2D3D0D0D1D2D3b0b1b2b3D0D1D2D3b01)Output Reuse by Inner-Product2)Input Reuse by Outer-ProductD0D0D1D1D2D2D3D33)In
17、put Reuse by Transpose10 of 25HOTCHIPS 2022DSPU:A 281.6mW Real-Time Deep Learning-Based Dense RGB-D Data Acquisition with Sensor Fusion and 3D Perception System-on-Chip Performance of BM Codec Reduction of Memory Footprint and Data Transactions BM encoding-decoding increases the speed of sensor fusi
18、onSensor Fusion Latency(ms)53.1%BaselineThis work16.07.5Data Transaction(MB/Frame)48.8%No BM DecoderBM Decoder125.964.5Memory Footprint of WTW Matrix(KB)60.5%1858CSRThis Work703000Raw Data73532.6%472700CSRThis WorkMemory Footprint of W Matrix(KB)703000Raw Data11 of 25HOTCHIPS 2022DSPU:A 281.6mW Real
19、-Time Deep Learning-Based Dense RGB-D Data Acquisition with Sensor Fusion and 3D Perception System-on-Chip Redundant Operations in PNN Redundant Convolution OPs at Overlapping Neighbors Average 50%of neighbors are overlapped after BQ Their point features cause the redundant convolution OPsP0CP1CP0CP
20、2CP0CP3CP2Group CGroup CP0Ball Query(BQ)P3CP3P3CP3GFPFP3CP3P3CP3GFPF11 ConvolutionRedundant OPs at Overlapping PFs P0CP0CP0CP1CP0CP3CWWCCP1P3Same PFs1)PF:Point Feature,2)GF:Group Feature12 of 25HOTCHIPS 2022DSPU:A 281.6mW Real-Time Deep Learning-Based Dense RGB-D Data Acquisition with Sensor Fusion
21、and 3D Perception System-on-Chip Point Feature Reuse Computational Reuse at Overlapping Point Features Execute the convolution on PFs and GFs separately Reuse the PF convolution results by aggregating corresponding GF results11 Convolution on PFP0P1P2P3P1CP2CP3CP0CP1CP3CWWP0P1P2P3P0CP1CP3CP1CP2CP3C+
22、11 Convolution on GFPF AggregationPF ReuseP0CP1CP0CP2CP0CP3CP0CP0CP0CP1CP0CP3CWW1)PF:Point Feature,2)GF:Group Feature13 of 25HOTCHIPS 2022DSPU:A 281.6mW Real-Time Deep Learning-Based Dense RGB-D Data Acquisition with Sensor Fusion and 3D Perception System-on-Chip Pipelined Architecture Point Feature
23、 Reuse with the UPPU,UMPU,and DMU Pipelined architecture hides the processing time of each HW unitUPPU GF BufferPF LUT(Global Memory)C AddrC AddrAddr GeneratorGroup IdxCenter IdxPF AggregatorOutput BuffersPE ArraysUMPUDMUPF Convolution ResultsGF Convolution ResultsPF AggregationPF AggregationGroup P
24、oint IndexCenter Point Index1)UPPU:Unified Point Processing Unit,2)UMPU:Unified Matrix Processing Unit,3)DMU:Data Management Unit14 of 25HOTCHIPS 2022DSPU:A 281.6mW Real-Time Deep Learning-Based Dense RGB-D Data Acquisition with Sensor Fusion and 3D Perception System-on-Chip Pipelined Architecture S
25、imultaneous Convolution and Ball Query Operations UPPU performs the BQ on 3D point data UMPU computes the convolution on PFs of all 3D point dataGF Buffer-PF LUT(Global Memory)-Group CGroup CC AddrC AddrAddr GeneratorGroup Idx3,2,1Center IdxC3,1,0CPF AggregatorUPPU Output BuffersPE ArraysUMPUDMUP2Gr
26、oup CGroup CP0CCP1P3Ball Query11 Convolution on PFP0P1P2P3P1CP2CP3CP0CP1CP3CWW11 Convolution on GFUPPUUMPU1)UPPU:Unified Point Processing Unit,2)UMPU:Unified Matrix Processing Unit,3)DMU:Data Management Unit15 of 25HOTCHIPS 2022DSPU:A 281.6mW Real-Time Deep Learning-Based Dense RGB-D Data Acquisitio
27、n with Sensor Fusion and 3D Perception System-on-Chip Pipelined Architecture Simultaneous Convolution and PF LUT Update UMPU computes the convolution on GFs PF LUT is updated by new PF convolution resultsGF Buffer-PF LUT(Global Memory)P3P2P1P0-Group CGroup CC AddrC AddrAddr GeneratorGroup Idx3,2,1Ce
28、nter IdxC3,1,0CPF AggregatorOutput BuffersPE ArraysUMPUDMUUPPU 11 Convolution on PFP0P1P2P3P1CP2CP3CP0CP1CP3CWWP0P1P2P3P0CP1CP3CP1CP2CP3C+11 Convolution on GFPF AggregationUMPUDMU1)UPPU:Unified Point Processing Unit,2)UMPU:Unified Matrix Processing Unit,3)DMU:Data Management Unit16 of 25HOTCHIPS 202
29、2DSPU:A 281.6mW Real-Time Deep Learning-Based Dense RGB-D Data Acquisition with Sensor Fusion and 3D Perception System-on-Chip Pipelined Architecture PF Aggregation on the Group C P0,P1,and P3are loaded from PF LUT by the address generator,and summed up with P0C,P1C,and P3CGF Buffer0P1CP2CP3CP2P1P0P
30、0CP1CP3C-0-P0P1P30-Group CGroup CC AddrC AddrAddr GeneratorGroup Idx3,2,1Center IdxC3,1,0CPF AggregatorOutput BuffersPE ArraysUMPUDMUUPPU PF LUT(Global Memory)P3P0CP1C0P3C11 Convolution on PFP0P1P2P3P1CP2CP3CP0CP1CP3CWWP0P1P2P3P0CP1CP3CP1CP2CP3C+11 Convolution on GFPF AggregationDMU1)UPPU:Unified Po
31、int Processing Unit,2)UMPU:Unified Matrix Processing Unit,3)DMU:Data Management Unit17 of 25HOTCHIPS 2022DSPU:A 281.6mW Real-Time Deep Learning-Based Dense RGB-D Data Acquisition with Sensor Fusion and 3D Perception System-on-Chip Pipelined Architecture PF Aggregation on the Group C P1,P2,and P3are
32、loaded from PF LUT by the address generator,and summed up with P1C,P2C,and P3CDMUGF BufferP0CP1C0P3CP0P0CP1CP3CP1CP2CP3C00P0P1P30P1P2P30Group CGroup CC AddrC AddrAddr GeneratorGroup IdxCenter Idx3,1,0CPF AggregatorOutput BuffersPE ArraysUMPUDMUUPPU PF LUT(Global Memory)P2P1P33,2,1C0P1CP2CP3C11 Convo
33、lution on PFP0P1P2P3P1CP2CP3CP0CP1CP3CWWP0P1P2P3P0CP1CP3CP1CP2CP3C+11 Convolution on GFPF Aggregation1)UPPU:Unified Point Processing Unit,2)UMPU:Unified Matrix Processing Unit,3)DMU:Data Management Unit18 of 25HOTCHIPS 2022DSPU:A 281.6mW Real-Time Deep Learning-Based Dense RGB-D Data Acquisition wit
34、h Sensor Fusion and 3D Perception System-on-Chip PNN Performance Performance Improvement with Pipelined Architecture VoteNetBaselineLatency(ms)+Inout Skip+PF Reuse22.5%46.3%14.611.37.7BaselineEnergy-Delay Product(uJ s)+Inout Skip+PF Reuse50.1%81.4%11054.920.519 of 25HOTCHIPS 2022DSPU:A 281.6mW Real-
35、Time Deep Learning-Based Dense RGB-D Data Acquisition with Sensor Fusion and 3D Perception System-on-Chip Challenges of Point Processing Different Operations between Point Processing Algorithms Dedicated HW units are required The area overhead of HW units increasesKNN UnitNeighbor Search in Depth Fu
36、sionUDS UnitPoint Sampling in PNNBQ UnitPoint Grouping in PNNGroup Points,s.t.topK(d,K)Group Points,s.t.d theshold1)PNN:Point Cloud-based Neural Network 2)KNN:K-nearest neighbor search,3)BQ:Ball Query,4)UDS:Uniform Distance Point Sampling20 of 25HOTCHIPS 2022DSPU:A 281.6mW Real-Time Deep Learning-Ba
37、sed Dense RGB-D Data Acquisition with Sensor Fusion and 3D Perception System-on-Chip Window Search-based Point Processing Point Processing within the Predefined Window Number of operations can be reduced largely The different point processing algorithms can share“operations”,e.g.,window generation,L
38、2 distance computation,load/store block dataUnified Point Processing UnitdthKNN in WindowBQ in WindowUDS in WindowNeighbor Search in Depth FusionPoint Sampling in PNNPoint Grouping in PNNShared OPs21 of 25HOTCHIPS 2022DSPU:A 281.6mW Real-Time Deep Learning-Based Dense RGB-D Data Acquisition with Sen
39、sor Fusion and 3D Perception System-on-Chip Unified Point Processing Unit Area Saving by Sharing Common Logic and Buffer Hardware units for the window-based search and output buffers are shared1)PNN:Point Cloud-based Neural Network 2)KNN:K-nearest neighbor search,3)BQ:Ball Query,4)UDS:Uniform Distan
40、ce Point SamplingUnified Point Processing Unit(UPPU)DedicatedHW UnitsUPPU1.057.2%Normalized AreaLogic Area Reduction in UPPUKNNBQUDSShared0.750.50.250Shared Window GeneratorZ0Y0X0Shared L2 Dist.CalculatorZ0Y0XCZ0Y0dX0Z0Y0dX0AdderTreeX0X1XCCenter 3D PointX2X3I0I1ICCenter Depth PixelI2I3Input BufferD0
41、KNN ModuleCMP&SortREG ArrayNeighborsD0BQ ModuleCMPFIFORadiusD0UDS ModuleCMPREGThres.L0L0L0D0Group/Center Index BufferGroup/Center Coordinate BufferShared Output Buffer22 of 25HOTCHIPS 2022DSPU:A 281.6mW Real-Time Deep Learning-Based Dense RGB-D Data Acquisition with Sensor Fusion and 3D Perception S
42、ystem-on-Chip 64.4%Lower Power Consumption than Previous System 53.6%Lower Latency than Previous SystemChip Photography and SummaryUMPU Core#1UMPU Core#2UMPU Core#4UMPU Core#3UMPU Core#0UMPUCore#6UMPU Core#7UMPU Core#5DMU Core#0DMU Core#1DMU Core#2DMU Core#3RISC-V CoreInterconnect NetworkUPPU Core#0
43、UPPU Core#1Interconnect Network3.6 mm3.6 mmSpecificationsSamsung 28 nmTechnology12.96 mm2Die Area806 KBSRAM0.72-1.1 VSupply Voltage250 MHzMax.FrequencyRISC-VISA4.5 Depth CNN(8b)1.8 Depth CNN(12b)0.1 C-Grad(16b)11.6 Point CNN(8b)Peak Throughput TOPS1.1 Point Grouping0.3 Point SamplingThroughput TOPS2
44、5.1 Point Grouping23.0 Point SamplingPower mW544.7 Depth CNN(8b)640.9 Depth CNN(12b)545.3 C-Grad(16b)609.1 Point CNN(8b)Power mWUMPU PerformanceUPPU Performance1.200.44DSPUVLSI21Depth Signal Processing Power Consumption(W)63.4%67.3End-to-End Depth Signal Processing Latency(ms)53.6%31.3DSPUVLSI21Chip
45、Host CPUExternal MemoryToF&RGB Sensor1)VLSI21 System:S.Kims ASIC(VLSI21)+Host CPU+External Memory+RGB-D Sensor23 of 25HOTCHIPS 2022DSPU:A 281.6mW Real-Time Deep Learning-Based Dense RGB-D Data Acquisition with Sensor Fusion and 3D Perception System-on-Chip Visual Results of 3D B-Box Extraction ToF S
46、ensor cannot capture a chair in the back Fail to extract the 3D bounding-box(B-Box)This work detects all of objectsMeasurement ResultsToF Sensor+3D B-BoxThis Work(RGB-D+3D B-Box)RGB Image SUN-RGBDFailedMissing 3D B-Box24 of 25HOTCHIPS 2022DSPU:A 281.6mW Real-Time Deep Learning-Based Dense RGB-D Data
47、 Acquisition with Sensor Fusion and 3D Perception System-on-Chip DSPU:Low-power and Real-Time 3D Object Recognition SoC For Low-power and Real-Time 3D Object Recognition BM Encoder and Decoder for Low Latency PF Reuse with Pipelined Architecture for Low Latency and Energy Shared Unified Point Proces
48、sing Unit for High ReconfigurabilityConclusionA 281.6 mW and 31.9 fps Dense RGB-D Acquisition and PNN 3-D Recognition Processor for Mobile 3-D Vision25 of 25HOTCHIPS 2022DSPU:A 281.6mW Real-Time Deep Learning-Based Dense RGB-D Data Acquisition with Sensor Fusion and 3D Perception System-on-Chip Thank You!Questions?Feel Free to Contact Me!E-mail:dsimkaist.ac.kr LinkedIn:https:/