上海品茶

您的当前位置:上海品茶 > 报告分类 > PDF报告下载

2017年解锁深度视频理解的潜力.pdf

编号:95390 PDF 61页 6.46MB 下载积分:VIP专享
下载报告请您先登录!

2017年解锁深度视频理解的潜力.pdf

1、Unlocking the Potential of Deep Video Understanding gPrincipal Research Manager,Microsoft Research视觉智能和深度学习简介 深度图像理解技术 深度视频理解技术 实际应用及市场化 未来技术趋势探讨AI Golden Age 黄金时代Big Data Image database organized according to WordNet hierarchy 100K+concepts/nodes 500+images/node on average tens of millions of human

2、-annotated images A Knowledge Ontology有标注的最大图像数据库AI Golden Age黄金时代Big Data“Deep”LearningBig Compute Architecture advancements 架构演化 Fully connected neural networks-FNN Convolutional neural networks-CNN Recurrent Neural Networks-RNN Long Short-Term Memory-LSTM Fully convolutional networks-FCN Deep res

3、idual networks Resnet Deep Learning Going“Deeper”Deep Learning Going“Deeper”深度学习深度学习“深入化深入化 Fully Connected Neural Networks FNN 全连神经网络8InputOutputNeural Networks:Approximate a function to map known input to known output Learn weights through trainingLimitations of FNN 缺点:Many parameters/weights(参数多)

4、High computation Potentially suffer severe overfitting(过拟合)Need large#of labeled dataConvolutional Neural Networks(CNN)(LeCun89)卷积神经网络9 shared-weight 参数共享参数共享 locally connected “locality”保留位置信息保留位置信息 Hierarchical view 多分辨率多分辨率Learned Low-Level Filters 学到的低层筛选器11学到的高层筛选器RNN/LSTM 递归神经网络 Recurrent Neur

5、al Networks The networks with loops in them allowing information to persist Model time-sequence data LSTM network Long Short-Term Memory Capable of learning long-range dependencies Model temporal dynamics welltanhtanhInput GateOutput GateCellForget GateOutputInputMemory cell remembers info that occu

6、rred at many timesteps in the past.Architecture advancements架构演化 Fully connected neural networks-FNN Convolutional neural networks-CNN Recurrent Neural Networks-RNN Long Short-Term Memory-LSTM Fully convolutional networks-FCN Deep residual networks Resnet Deep Learning Going“Deeper”Deep Learning Goi

7、ng“Deeper”深度学习深度学习“深入化深入化 VisionVision AI Scenarios AI Scenarios 应用场景应用场景Public Security Object tracking for event analysis Privacy in law enforcement Networked video monitoringSports&Entertainment Consumer video AR Scene understanding for chatbots AI for sport training and coachingCommerce Vision f

8、or retail intelligence Automated drivingMedicine Patient monitoring for hospital Medical imaging and augmented clinical 视觉智能和深度学习简介 深度图像理解技术 深度视频理解技术 实际应用及市场化 未来技术趋势探讨Deep Learning Changed the Landscape of Image UnderstandingDeep Learning Changed the Landscape of Image Understanding深度学习改变了图像理解的进程深度学

9、习改变了图像理解的进程 Image classification 分类 Object detection 检测 Image semantic segmentation 分割 boat personGT:horse cart1:horse cart2:minibus3:oxcart4:stretcher5:half trackGT:coucal1:coucal2:indigo bunting3:lorikeet4:walking stick5:custard appleGT:birdhouse1:birdhouse2:sliding door3:window screen4:mailbox5:p

10、otGT:komondor1:komondor2:patio3:llama4:mobile home5:Old English sheepdogGT:forklift1:forklift2:garbage truck3:tow truck4:trailer truck5:go-kartGT:yellow ladys slipper1:yellow ladys slipper2:slug3:hen-of-the-woods4:stinkhorn5:coral fungusImageNet Classification(1,000 categories)图像分类挑战赛ImageNet Classi

11、fication 图像分类挑战赛赢家-错误率2.93.64.95.16.711.716.425.80.05.010.015.020.025.030.02016CUimage2015/12MSRA2015/1MSRAHuman20st CNN2011SIFTTop-5 error rates(ResNet,152)(GoogleNet,22)(19)(8)#of layers1,000 categories 1.2M imagesResNets(Deep Residual Networks)深度残差网络 Key highlights identity shortcut co

12、nnections allow to propagate every 2 or 3 layers ease optimization 150+layers7x7 conv,64,/2pool,/23x3 conv,643x3 conv,643x3 conv,643x3 conv,643x3 conv,643x3 conv,643x3 conv,128,/23x3 conv,1283x3 conv,1283x3 conv,1283x3 conv,1283x3 conv,1283x3 conv,1283x3 conv,1283x3 conv,256,/23x3 conv,2563x3 conv,2

13、563x3 conv,2563x3 conv,2563x3 conv,2563x3 conv,2563x3 conv,2563x3 conv,2563x3 conv,2563x3 conv,2563x3 conv,2563x3 conv,512,/23x3 conv,5123x3 conv,5123x3 conv,5123x3 conv,5123x3 conv,512avg poolfc 10007x7 conv,64,/2pool,/23x3 conv,643x3 conv,643x3 conv,643x3 conv,643x3 conv,643x3 conv,643x3 conv,128,

14、/23x3 conv,1283x3 conv,1283x3 conv,1283x3 conv,1283x3 conv,1283x3 conv,1283x3 conv,1283x3 conv,256,/23x3 conv,2563x3 conv,2563x3 conv,2563x3 conv,2563x3 conv,2563x3 conv,2563x3 conv,2563x3 conv,2563x3 conv,2563x3 conv,2563x3 conv,2563x3 conv,512,/23x3 conv,5123x3 conv,5123x3 conv,5123x3 conv,5123x3

15、conv,512avg poolfc 1000+1K.He,X.Zhang,S.Ren,&J.Sun.“Deep Residual Learning for Image Recognition”.CVPR 2016.K.He,X.Zhang,S.Ren,&J.Sun.“Identity Mappings in Deep Residual Networks”.arXiv 2016.Image Classification vs.Object DetectionObject Detection图像分类 vs 物体检测 boat personmostly solved,squeezing last

16、bitsmuch harderCourtesy of Kaiming HeimageCNNconv feature map“Faster R-CNN:Towards Real-Time Object Detection with Region Proposal Networks”,S.Ren,K.He,R.Girshick,Jian Sun.NIPS 2015Courtesy of Kaiming HeFaster R-CNN 基于区域提议的CNN检测器imageCNNconv feature mapRPNproposals“Faster R-CNN:Towards Real-Time Obj

17、ect Detection with Region Proposal Networks”,S.Ren,K.He,R.Girshick,Jian Sun.NIPS 2015Courtesy of Kaiming HeFaster R-CNN基于区域提议的CNN检测器imageCNNconv feature mapRPNproposalsFast R-CNN“Faster R-CNN:Towards Real-Time Object Detection with Region Proposal Networks”,S.Ren,K.He,R.Girshick,Jian Sun.NIPS 2015En

18、d-to-EndCourtesy of Kaiming HeFaster R-CNNsystemtime07 data07+12 dataDPM v5 2013N/A33.7-R-CNN50s66.0N/A(too slow)Fast R-CNN2s66.970.0Faster R-CNN198ms69.973.2Performance in PASCAL VOC 07 精度和速度性能“Faster R-CNN:Towards Real-Time Object Detection with Region Proposal Networks”,S.Ren,K.He,R.Girshick,Jian

19、 Sun.NIPS 2015*mean Average Precision in%(higher is better)*running on Nvidia K40 GPUCourtesy of Kaiming HeSemantic Segmentation:Non-Instance vs.Instance语义分割Jifeng Dai,Kaiming He,&Jian Sun.“Instance-aware Semantic Segmentation via Multi-task Network Cascades”.CVPR 2016.Non-instance:simpler,mostly so

20、lvedInstance:much harderfar from goodpersonperson1person2person3person4person5person6Courtesy of Jifeng DaiMSRAs 1st-place system on 2015 MS COCO competition2015挑战赛冠军系统Jifeng Dai,Kaiming He,&Jian Sun.“Instance-aware Semantic Segmentation via Multi-task Network Cascades”.CVPR 2016.for each RoIfor eac

21、h RoICONVsconv feature mapFCsFCsRoI warping,poolingmaskingCONVsbox instances(RoIs)mask instancescategorized instancespersonpersonpersonhorse Solely CNN-based End-to-end training 12%better than 2nd-placeCourtesy of Jifeng DaiResults on the first 5k images from the COCO test set is available at https:

22、/ 1st-place image segmentation system on 2016 MS COCO competition2016挑战赛冠军COCO Segmentation Challenge 2016语义分割挑战赛MSRA won 1st place back-to-back11%relatively better than 2016 2nd(Google)33%relatively better than 2015 1st(MSRA)Excellent on box:2nd place in detection if public37.633.828.425MSRA2016 1s

23、tGoogle2016 2ndMSRA2015 1stFAIR2015 2ndCOCO Segmentation Accuracy(%)http:/mscoco.org/dataset/#detections-leaderboardCourtesy of Jifeng Dai视觉智能和深度学习简介 深度图像理解技术 深度视频理解技术 实际应用及市场化 未来技术趋势探讨Deep Learning Changed the Landscape of Video Understanding?Deep Learning Changed the Landscape of Video Understandi

24、ng?深度学习改变了视频理解的进程深度学习改变了视频理解的进程 Detection/tracking 检测/跟踪 Human pose estimation 人的姿态估计 Human action recognition 人类行动识别 Video captioning 视频字幕 High content variety(4 D)内容多样 High data volume 数据量大 High storage/computing resource requirements Real-time requirement 实时要求 Labeling cost high 标注成本高 Lack of tra

25、ining data(in some scenarios,e.g.,surveillance)缺乏训练数据 Privacy issue Positive samples scare(e.g.,for rare events)正样本少Challenges for Video Analytics 视频分析的挑战 People-centric scene understanding Intelligent motion analysis Video face/people detection/tracking/identification/Re-ID Video face redaction and

26、 celebrity recognition Human pose estimation Skeleton-based people action classification/detection Fast detection/localization of rare events(e.g.,falling down)in streaming video Car related featuresLeveraging deep learning DNN/CNN RNN/LSTMLow-LevelHuman DetectionHuman TrackingOur Main Approaches 以人

27、/车为本VideoMiddle-LevelHuman SegmentationPose EstimationHigh-LevelCrowd CountingHuman RecognitionActivity AnalysisAbnormalityDetectionMotion DetectionFace/People Tracking&Identification 人脸/人追踪和识别Progress of Single Object Tracking单目标跟踪的研究进展M.Danelljan,G.Bhat,F.Khan,and M.Felsberg,“ECO:Efficient Convolu

28、tion Operators for Tracking,”in CVPR2017.Case Study 1:Video Face Redaction(MSRAs)视频人脸模糊-high recall requirement 高召回要求-CNNCNN-basedbased face detector+trackingYouTubes SolutionOursDemo videos(people tracking)人的跟踪Multiple Object Tracking(MOT)Benchmark:https:/ detection:Faster-RCNNMSRA HumanSDK 人的属性 Re

29、cognition of 5 attributes1.Body Part Visibility2.Gender3.Hat Type4.TopWear Type5.BottomWear TypeImagedetectionHumansclassificationAttribute labelFaster R-CNN:Towards Real-Time Object Detection with Region Proposal Networks,Shaoqing Ren,Kaiming He,Ross Girshick and Jian Sun,NIPS 2015Human Pose Estima

30、tion人的姿态估计Motivation:人的理解 A key step toward understanding people in images/video Motion analysisHuman trackingClothing parsingHuman action recognitionTarget:Given a single RGB image,determine the precise pixel location of keypoints of the body.Pose EstimationChallenges 挑战 Occlusion 遮挡遮挡 Low quality

31、of images 质量 Scale 尺寸 Diverse postures 多样性多样性 Various appearances 外观Basic Approach for Pose Estimation基本方法FCNHeatmap for each jointmaxGroundtruth(Fully convolutional network)全卷积网络 Our idea:rotate to upright positions(normalize postures)Challenges:Diversity of Postures 姿势的多样性Rotate,ScalePCK 0.2Head S

32、houlderElbowWristHipKneeAnkleTotalAUCTraining DatasetTompson et al.,NIPS14 190.679.267.963.469.57164.272.347.3LSPFan et al.,CVPR15 292.475.265.36475.768.370.47343.2LSPCarreira et al.,CVPR16 390.581.865.859.881.670.66273.141.5LSPChen&Yuille,NIPS14 491.878.271.865.573.370.263.473.440.1LSPYang et al.,C

33、VPR16590.678.173.868.874.869.958.973.639.3LSPMSRAs(FCN)93.880.369.764.78178.173.177.250.5LSPMSRAs(FCN+Refine)9480.970.665.382.378.573.777.950.7LSPBulat et al.,CVPR16698.486.679.573.588.183.278.583.5LSP+LSPETWei et al.,CVPR16784.32LSP+LSPETRafi et al.,BMVC16895.886.279.37586.683.879.883.856.9LSP+LSPE

34、TYu et al.,CVPR16987.288.282.476.391.485.878.784.355.2LSP+LSPETMSRAs(FCN)95.286.278.172.88785.781.383.756.1LSP+LSPETMSRAs(FCN+Refine)95.588.58073.989.885.881.58558.5LSP+LSPETInsafutdinov et al.,ECCV161097.492.787.584.491.589.987.290.166.1MPII+LSPET+LSPWei et al.,CVPR16797.892.58783.991.590.889.990.5

35、65.4MPII+LSPET+LSPBulat et al.,CVPR16697.292.188.185.292.291.488.790.763.4MPII+LSPET+LSPMSRAs(ResNet-152)9791.586.282.889.489.987.589.263.5MPII+LSPET+LSPMSRAs(ResNet+Refine)97.392.287.183.592.190.687.890.164.8MPII+LSPET+LSPMSRAs(Hourglass)97.79388.384.892.390.29090.965MPII+LSPET+LSPMSRAs(Hg+Refine)9

36、7.993.68985.892.991.290.591.665.9MPII+LSPET+LSPPerformance on LSP test dataset 精度性能比较Ongoing:Multi-person Pose Estimation+Tracking(MSRA)多人姿态估计+跟踪From Pose Estimation to Human Action Recognition 从姿态估计到行为识别W.Zhu,C.Lan,J.Xing,W.Zeng,Y.Li,L.Shen,X.Xie,“Co-occurrence feature learning for skeleton-based a

37、ction recognition using regularized deep LSTM networks,”in AAAI Conference on Artificial Intelligence 2016.Sijie Song,Cuiling Lan,Wenjun Zeng,et.al.,“An End-to-end Attention Model for Human Action Recognition from Skeleton Data,”,in AAAI17.P.Zhang,C.Lan,J.Xing,W.Zeng,J.Xue,N.Zheng,“View Adaptive Rec

38、urrent Neural Networks for High Performance Human Action Recognition from Skeleton Data,”in ICCV2017.Human Action Recognition 人的行为识别Human can recognize actions by motion of key joints(Johansson 1973).Skeleton based human action recognition基于骨架的人的行为识别Human performing an actionAction labelProposed ske

39、leton based action recognition frameworkCost effective KinectW.Zhu,C.Lan,J.Xing,W.Zeng,Y.Li,L.Shen,X.Xie,“Co-occurrence feature learning for skeleton-based action recognitionusing regularized deep LSTM networks,”in AAAI 2016.递归神经网络递归神经网络Spatial-Temporal Attention Model Spatial Attention空域关注 Importan

40、ce of joints differs?Basic idea Learn spatial attention model Add different weights to jointsDrinkingKickingSpatialAttentionModel0.40.30.30.30.30.3tt+1 Importance of frames differs?Spatial-Temporal Attention Model Temporal Attention时域关注00.10.20.30.40.50.60.7135791113Attention WeightsTimeTemporal Att

41、entionTemporalAttentionModelFuse the output of all frames based on the attention weights Overall FrameworkOur Spatial-Temporal Attention Network时空关注网络Deep LSTM networkSpatial attentionnetworkTemporal attentionnetworkSpatial attention:=(,)Temporal attention:=(,)Attention modelBased on LSTM networkInp

42、utHidden output of LSTMResults:Visualization of Learned Attention 学到的关注Size of circle indicates the spatial attention.MICROSOFT CONFIDENTIALComparisons with State-of-the-Arts 检测精度比较Methods(Acc.%)NTU-Cross SubjectNTU-Cross ViewLie Group,CVIU16 350.152.8Dynamic Skeletons,CVPR15 460.265.2HBRNN,CVPR15 2

43、59.164.0Deep LSTM,CVPR16160.767.3Part-aware LSTM,CVPR16 162.970.3STASTA-LSTM(LSTM(MSRAs)73.473.481.281.21 Shahroudy,A.;Liu,J.;Ng,T.-T.;and Wang,G.NTU RGB+D:A large scale dataset for 3D human activity analysis.CVPR 20162 Du,Y.;Wang,W.;and Wang,L.Hierarchical recurrent neural network for skeleton base

44、d action recognition.CVPR 20153 Vemulapalli,R.;Arrate,F.;and Chellappa,R.R3DG features:Relative 3D geometry-based skeletal representations for human action recognition.CVIU 20164 Hu,J.-F.;Zheng,W.-S.;Lai,J.;and Zhang,J.Jointly learning heterogeneous features for RGB-D activity recognition.CVPR 2015

45、NTU DatasetComparisons with State-of-the-Arts 检测精度比较Methods(Acc.%)NTU-Cross SubjectNTU-Cross ViewLie Group,CVIU16 350.152.8Dynamic Skeletons,CVPR15 460.265.2HBRNN,CVPR15 259.164.0Deep LSTM,CVPR16160.767.3Part-aware LSTM,CVPR16 162.970.3STA-LSTM(MSRAs)573.481.2ViewViewAdaptive LSTM(MSRAs)Adaptive LST

46、M(MSRAs)681.588.41 Shahroudy,A.;Liu,J.;Ng,T.-T.;and Wang,G.NTU RGB+D:A large scale dataset for 3D human activity analysis.CVPR 20162 Du,Y.;Wang,W.;and Wang,L.Hierarchical recurrent neural network for skeleton based action recognition.CVPR 20153 Vemulapalli,R.;Arrate,F.;and Chellappa,R.R3DG features:

47、Relative 3D geometry-based skeletal representations for human action recognition.CVIU 20164 Hu,J.-F.;Zheng,W.-S.;Lai,J.;and Zhang,J.Jointly learning heterogeneous features for RGB-D activity recognition.CVPR 20155 Song,S;Lan,C.;Xing,J.;Zeng,W.;Liu,J.An end-to-end attention model for human action rec

48、ognition from skeleton data.AAAI17.6 P.Zhang,C.Lan,J.Xing,W.Zeng,J.Xue,N.Zheng,“View Adaptive Recurrent Neural Networks for High Performance Human Action Recognition from Skeleton Data,ICCV2017 NTU DatasetView Adaptive LSTM 视角自适应网络Original ViewLearned ViewOriginal ViewLearned ViewObservation:Can pre

49、serve the continuity 视觉智能和深度学习简介 深度图像理解技术 深度视频理解技术 实际应用及市场化 未来技术趋势探讨April 30,2015Go-to-Market 市场化April 30,2015Dec 2,2015Aug.26,2016Nov.29,2017Microsoft Cognitive Services Microsoft Cognitive Services 微软认知服务Available today(https:/ Face Redaction APIVideo Emotion APIsVideo Tagging APIAction Recognitio

50、n APICase Study 2:小冰的视觉技能兄弟情深内心很温柔www.customvision.ai(个性化视觉)Train your own image classifier in the cloudEasily access the trained model through web APIsIdeal for vertical domain applicationsCase Study 3:Build Your Deep Learning Vision Model in Minuteswww.customvision.aiCase Study 5:Consumer Video AR增强现实视频-核心技术61“a boy is cleaning the floor”()

友情提示

1、下载报告失败解决办法
2、PDF文件下载后,可能会被浏览器默认打开,此种情况可以点击浏览器菜单,保存网页到桌面,就可以正常下载了。
3、本站不支持迅雷下载,请使用电脑自带的IE浏览器,或者360浏览器、谷歌浏览器下载即可。
4、本站报告下载后的文档和图纸-无水印,预览文档经过压缩,下载后原文更清晰。

本文(2017年解锁深度视频理解的潜力.pdf)为本站 (云闲) 主动上传,三个皮匠报告文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。 若此文所含内容侵犯了您的版权或隐私,请立即通知三个皮匠报告文库(点击联系客服),我们立即给予删除!

温馨提示:如果因为网速或其他原因下载失败请重新下载,重复下载不扣分。
会员购买
客服

专属顾问

商务合作

机构入驻、侵权投诉、商务合作

服务号

三个皮匠报告官方公众号

回到顶部