1、1Vision Perception Unit:Next-Generation Smart CMOS Image SensorWenqi Ji,Yuxing Han,Jiangtao Wen,Yubin Hu,FutangWang,Yuze He,Xi Li and Jun ZhangDepartment of Computer Science and Technology,Tsinghua UniversityAbstractAs we reach the end of Moores Law and Dennard Scaling,it has become highly desirable

2、 to design a highly integrated and optimized pipeline specifically for computer vision.A new generation of integrated“smart”visual processors that streamline an end-to-end optimized visual information acquisition and processing pipeline(VIAPP)becomes necessary to lower the cost,power consumption,and

3、 latency.We describe a new paradigm for VIAPP as Vision Perception Unit(VPU),wherein electric signals generated by photons are amplified before converting to the digital signals to emulate an initial layer of a convolutional neural network(CNN).The outputs from these layers are then converted to dig

4、ital signals and processed by following layers of a deep CNN.2Abstract3Conventional CMOS Image Sensor4 An image signal processor(ISP)for color processing,denoising,correction,etc.Further pre-processing in digital signal processor(DSP)e.g.,image enhancement and compression All capture image frames ar

5、e processed with original high resolutionVPU:Next-Generation CIS5 Sensing and processing are integrated DSP takes raw images as input No delay and power consumption from ISP In-pixel filters driven by DSPVPU:Process and Domain Specific Architecture6Pixels for Dynamic Architecturea.Dynamic Resolution

6、Tb.ROITPartially read-out from CMOS Image SensorDynamically lower frequency of ADCs&Clocks,reduce the bandwidth of data transmissionVision TasksVideo SegmentationEdge ExtractionFeedbackc.Dynamicread-out precisionXY2Bit4Bit8Bit8BitVPU:Sensing-Processing-Integrated Hardware7Edge Extraction and Video S

7、egmentation on DSP8 The DSP in VPU is optimized for edge extraction and video segmentation in low light for various applications.Low light Applied on 24/7 self-driving,AIoT,CCTV,robot,etc.Suitable for high frame rate imaging(1000fps)Low cost on lens Edge extraction and video segmentation Basic featu

8、re and semantic label used for other CV tasksPerformance of Edge Extraction Our Unet-based edge extraction model work on raw images directly from image sensor readout.Output the contour information for gesture recognition and abnormal behavior detection.Suitable for extremely low light(20 photons pe

9、r pixel)edge extraction in VPUState-of-the-art methods using SID for enhancementand HED for edge extractionDynamic Resolution for Video Segmentation Processing video frames with dynamic resolution reduces both read-out cost and computation cost.Pixel:Photodetector&ReadouttransistorsRowDecoderColumnA

10、mplifiers/CapsRead-out with Dynamic Resolution utilizingRandom Access Ability of CMOSVPUDR-Seg:Computation with Dynamic ResolutionTraditional:Computationwith Constant ResolutionDR-Seg:Feature Fusion&Training Process The Cross Resolution Feature Fusion module(CReFF)aggregates HR features into LR feat

11、ures with local attention mechanism.A feature similarity loss is designed to aid the training process.CReFFShared 1x1 ConvHR Frame tLR Frame t+1&MV map HR Frame t+1Feature Similarity TrainingCross Resolution Feature FusionMV_WarpUpsampleGrouped ConvQKVLocal QueryLocal AttentionOutputPerformance of D

12、R-Seg DR-Seg outperforms the state-of-the-art constant-resolution algorithm by 1.0%mIoU with only 32.97%FLOPs.ImageImageConstantResolution(PSPNet18)ConstantResolution(PSPNet18)DR-Seg(Ours)DR-Seg(Ours)ResultsonCamViddatasetResultsonCityscapesdatasetMethodsResolutionmIoU(%)GFLOPsPSPNet18*1.0 x69.43309

13、.28PSPNet18*0.5x66.8777.27DR0.5-PSP18(Ours)1.0 x,0.5x70.48101.98DR0.5-PSP18(Ours)1.0 x,0.3x69.0056.33MethodsResolutionmIoU(%)GFLOPsPSPNet18*1.0 x69.00938.52PSPNet18*0.5x63.95234.63DR0.5-PSP18(Ours)1.0 x,0.5x69.03309.69*Zhao,H.,Shi,J.,Qi,X.,Wang,X.,Jia,J.:Pyramidsceneparsingnetwork.In:765Proceedingso

14、ftheIEEEconferenceoncomputervisionandpatternrecognition.766pp.28812890(2017)Summary13 Suitable for self-driving and AIoT Tape-out in 2023We proposed VPU,the next-generation smart CMOS image sensor.VPU pioneeringly integrates image sensing and processing into one chip.VPU could saves power consumptio

15、n by end-to-end architecture,which reduces the cost of intermediate processing,and dynamic-resolution algorithms with optimized dynamically controlled pixels,which reduces the cost of computation and read-out.Our results illustrate that the efficiency of video segmentation in VPU is improved by the dynamic-resolution architecture while the accuracy is maintained.The performance of edge detection by VPU outperforms SOTA methods using traditional CIS.



