报告预览

人的视觉感知技术：人脸关键点、手部姿态估计、人体深度估计.pdf

编号：29587

PDF 63页 3.75MB 下载积分：VIP专享

下载报告请您先登录！

人的视觉感知技术：人脸关键点、手部姿态估计、人体深度估计.pdf

1、Perceiving human:Facial Landmark Localization， Hand PoseEstimationandPedestrianEstimationDepth京东JDDAR#page#目录Facial Landmark Localization - FASTry2Hand Pose Estimation - FastHand3Pedestrian Depth Estimationand Segmentation - PDES-Net京东#page#FASTry:AFast.andStableAccurateCosmeticTry-On System#page#下午

2、2：58心1关2（）2手电网自海旋转监牙晚铃个?O飞行提式移动数据位道信息我屏网2次E章章ISIDoMokia-4201S云云词A剩余空间不足10%部分功能可能无法正常使用。请点。响设备运行速度，京东秒杀京东直播每日特价排行榜发现好货价场#page#各理自Emo88温杂低语最高颜值饰品电国花京东小家电超级品类白兴沃尔京东旅行信限七鲜病购装油拍二手更多频道京东秒杀20气四国京东直播分享享京豆每日特价排行榜发现好货好新生活招0发我#page#314S02工理区?8蛋庭来路快张嘴吃掉TQ#page#ne#page#京东JDARThe FASTry System中口Acosmetic try-onsy

3、stem provides realistic try-on experience for users and helps them efficiently choosea suitable cosmetic.口 FASTry is a real-time try-on system taking speed accuracy and stability into consideration at each step toensure better user experience.口 The system contains lipstick blush eyebrow pencii eye s

4、hadow eyeliner mascara and liquid foundation.#page#京东JDAThe FASTry System5LipstickStyle lStyle 2Style3Eye shadoww.o.fry-on#page#京东JDDARThe FASTry System中口 System PipelineFace Bounding-Fncinl 1.box DetectioFacial LandmarkStabilizatiBpkconvVirtaal0The Stable Larks List GerDKAverage PosyP+1-10#page#Fac

5、eDetection&Tracking#page#京东一JDARFace Detection 8 Tracking中口 The first module needs to detect the faces in video streams, while the detection is performed severaltimes inasecond and predicts the possible face location with avery little cost.口 The detection module is based on SSD We further reduce the

6、 number of parameters and calculations byusing depthwise separable convolution12#page#LandmarkDetection&Stabilization#page#京东IJDARLandmark Detection & Stabilization高U The networkconsists oftwo parts the main network for predicting landmark coordinates and theauxiliary network for predicting Euler an

7、gles.口 The auxiliary network is utilized only during training to boost landmark localization accuracy which willnot cost time at the test stage.ted Residual BlocksFlattenandConcatAuxiliary NetworkPitch&Yaw& Roll#page#京东JDARLandmark Detection & Stabilization口 Loss functionWing lossisalossfunction par

8、ticularly designed forfacial landmarklocalization.MM(-o）1aC=403530lldll=an（(oing()+ing（(y）2ln（1+|al/c)iflulating（u)=-01020lul-Cotherwise（a）=10（b）16#page#京东JDDARLandmark Detection & StabilizationD Dynamic weightsupue eupu a N) ueu pazleuu eanduam aseud uomepllenauuldynamic weight for each

9、landmark is calculated as:NMEnQn三（NME）D Online Hard Keypoints Mining (OHKM）The OHKM strategy for human pose estimation is incorporated into our systemThis strategy only chooses top K percentage oflandmarks with higher errors to compute loss.17#page#京东一JDARLandmark Detection & Stabilization高Facial La

10、ndmark StabilizationAlgorithm 2 Our Landmark Stabilization AlgorithmIput:p=*-1.j-N1，PkOutput:Fk The stabilization is based on a set that contains1:initialize L=0.Z=O L is the stable landmarks list.2:fori=lto Ndothe average position of the facial landmarksdist t- l -pk=flls B calculate the Euclidean

11、distance3:ofthe previous Nframes.P=l-l-Nbetween two landmark sets.4:if（dist sum up the landmarks in L10:end for店8#page#Expperiment Results#page#京东JDDARExperiment ResultsU DatasetsThefirst partis fromthe Grand Challenge of 106-p FacialLandmarkThe second part is a subset of a large-scale face recognit

12、ion dataset VGGFace2 with images larger than350350.The third part is selfie videos collected from a micro-video applicationThe final part is selfie videos collected by 61 volunteers.Table 1:The AR-Landmark107k DatasetPart 1Part 2Part3PartsPan4DatasetTraining set of JD Landmark Dataset 24 A subset of

13、 VGGFace2 Dataset 3Micro-video Dataset lMicro-video Dataset 2ConstitutionBasedon300W28Images from Google Image Search6.152 Selfie Video183 Selfie Video18.61910.86568.3999,042#images20#page#京东JDAExperiment Results口 TrainingWe implement the network via the TensorFlow framework using the batch size of

14、16.The network is trainedon an Nvidia Tesla P40 GPUTheinference on mobie phones is performed using Tencent ncnn deep learning frameworkGPU1NMIDIA Pascal GPUCuDA Cores3.84024GBGDDR5MemorySize24H.2641080p30streamMaxvcPuinstances241GBProfilel1G8.2G8.3G8.4GB.vCPUProfiles6GB.8GB.1200.24CBFormFactorPCle3.

15、0DulSlotPower250W21ThermalPassivt#page#京东JDDARExperiment ResultsLightweight and FastWe compare our facial landmark network with PFLD 0.25Xand PFLD 1X in terms of model size andcalculations.For processing a frame，it only takes 7.35 ms on average (over 135 fps）.（Qualcomm ARM 845）Table 3: The processin

16、g speed evaluation.Table 2:Comparison in terms of model size calculations andprocessing speed of different landmark localization methods.ModuleII Speed (ms）11.80Face Detection0.02ModelPFLD IXPFLD0.25XOursFace Tracking4.28Facial Landmark Localization8.713.27Size (Mb）0.930.48Landmark Stabilization274.

17、2752.319.61Calculations (MFlops）Processing a frameSpeed(ms）3.231.06CPU(ms）中10.50t8tusing tracking中63.2113.984.28ARM（ms）14.47using detection735onaverage22#page#京东JDARExperiment Results口 Accuracy (Normalized Mean Error NME）PFLD0.25XPFLD1X Compared to PFLD 0.25Xwe get better accuracy and speedPFLD 1X h

18、as the best NME: however it has a much larger0.2model size and requires much more time for inferencing0.00.010.020.030.040.050.06Table 5:Comparison in normalized mean error on our land-mark dataset.1.0D0.257PFLD1XMethodJAR-Landmarki07k Part 1,2AR-Landmark107k3.162.76PFLD 1X3.7813.22PFLDO.25X山3.723.1

19、90.010.040.05200#page#京东JDDARExperiment Results中 Ablation StudyTable 6:Comparison in normalized mean error of differentstrategies.L2Loss I WingLoss DW lOHKM IAR-LPart 1.2AR-Landmark107kXX4.06X3.50xXX3.783.33XX3.773.23XXV3.743.27X3.723.19中24#page#京东JDDARExperiment Results中站#page#京东JDDARExperiment Res

20、ults Stabilization MethodsTable 4:Comparison in mean precision (20px） and speed ofdifferent face tracking methods on the300-VW test set.Face Tracking MethodCategory 1Category 2Category 3Speed (ms）KCF U110.6970.9370.822155.350.9440.8740.8730.02Our methodTable 7:Comparison in mean standard deviation a

21、nd speed ofdifferent stabilization methods on the300-VWtest set.MethodICategory 1ICategory 2 lCategory 3| Speed (ms）dlib 160.005270.006580.007013.690.00524dlib 16+optical flow 250.006550.007621.01dlib 16+ our method0.004930.006490.0069426#page#Estimation From AFastHand: Fast HandPoseMonocularCamera#

22、page#1.MotivationGesture recognition is an essential component needed fora robust Human-Robot Interaction GHRI）.Gesture recognitionis a challenging problemin the roboticscommunity as the hand is defined through a small areacompared to the human body. Moreover itexhibits a highdegree of freedomand hi

23、gh similarityi.e，among the fingerjoints visual appearance.The timing required for a robot to interact with its operatorconstitutes another considerable challenge in HRI since real-tiime responses are necessary for areliable application.#page#2.ContributionsAnovellightweightencoderdecoder network for

24、OKhand pose estimation based on heatmap regressionentitled “Fast-Hand. The proposed framework isable to perform on a mobile robot in real-time.VictoryA 2D hand landmark dataset based on the Youtube3DHands image-sequence，Love-YouAn extensive experimentation protocol aiming todemonstrate the effective

25、ness of the proposed method.Thumbs-Up#page#3. Proposed MethodFrameworkHand Bounding-boxHand Landmark LocalizationDetcctionStabilization#page#3.Proposed Method Modulel: Hand Detection and StabilizationHand detectionTo utilize the MobileNetSSD framework to detect human hands, which isbased on MobileNe

26、tV2 as the networks backbone and Single Shot multi-box Detector CSSD） for the detectionStabilizationSince early perceived frames present a higher correlation with the currentbox, we perform a weighted average according to the exponentiallydecreasing weight.C-kPeurk=0#page#3. Proposed Method Module2:

27、 Landmark NetworkHial00中Hhxwxc#page#4.Experiments Metrics definition2（）2）SSE=D（-.0）-（-0）21（nac(a，h）（3）EPE=21Dlys-3s1PCk=（4）Gmar(w，h）22PCKPCK。21#page#4.ExperimentsAblation study: the impact of using diferent taining setDatasetMetricUsing YouTube2D HandsUsing GANeratedHands 22Using Both DatasetsSSE40.

28、04510.53430.0351STB38EPE10.07910.14970.0669PCK0.20.99470.79320.9947RHD 21SSE40.07110.43750.0708EPEL0.06820.17540.0685PCK0.210.99010.85460.9900#page#DatasetMetricInterHand 35InterHand 35Our MethodSRHandNet 32NSRM Hand 18trained on STBtrained on RHD0.0351STB138SSEJ0.70780.00430.3530EPE0.13260.01160.12

29、140.0669PCK0.2t0.85260.89440.99470.724686660RHD 21SSEJ2.56132.04131.69350.0708EPE上0.19530.17260.02280.0685PCK0.2t0.53170.71770.91300.99730.9900#page#speedDeviceSRHandNet 32InterHand 35Our MethodNSRM Hand 18NVIDIA GeForce 940MX GPU21.06FPS2.42FPS14.24FPS31.33FPS7.77FPSNVIDIA Jetson TX2 GPU19.16FPS3.6

30、5FPS25.05 FPSsizeModelInterHand 35Our MethodSRHandNet 32NSRM Hand fi871.90139.7541.7113.0Size (MB）#page#Num INum 3Num5Num 8PinkyThumbs-UpLoveYouClaw#page#4.ExperimentsQualitative: on STBSRHandNetNSRM HandInterHandOurs#page#4.ExperimentsQualitative: on RHDSRHandNetNSRM HandInterHandOurs#page#5.Videos

31、Hand poses from different distancesDifferent hand poses on real-worldon real-world scenariosscenarios#page#6. ConclusionsIn this paper a fast 2D hand pose estimation pipelineis proposed under the titlecFastHand The core of our framework is based on a lightweight encoder-decoder network.Through hand

32、detection and stabilization. the ROIis cropped and fed into our network forlandmark localizationThe proposed network is trained over our large-scale 2D hand posedatasetand asynthetic one while evaluated on two publicly-available datasets.In comparison to several state-ofthe-art techniques.the propos

33、ed approach shows animproved performance while is able to execute in real-time on a lowcost GeForce 940MXGraphics GPU and an NVIDIA Jetson TX2 GPU Gover 25 frames per second）.#page#Real-timeMonocularEstimationPedestrianDepthandSegmentationEmbeddedSystemson#page#Related WorkD Monocular depth estimati

34、onModern depth estimation methods use deep learning techniques trained over large-scale datasetsmulti-scale feature fusion and refinement to produce accurate object boundaries；infer a distribution over possible depths through discrete binary classifications to replace thetraditional bilinear interpo

35、lation and propose to guide the intermediate depth branch usingauxiliary losses.fastmonocular depth estimation method.which utilizes MobileNetas the encoderanddeptthwise decompositiion in thhe decoder；#page#Related WorkDOther studiessA study on 3D pedestrian localization with estimated uncertainty.L

36、. Bertoniet al. use a humanpose estimator to detecta set ofkeypoints and fed them to a feedforward neural network topredict the distance and the uncertainty associated with each pedestrian. Nevertheless. themultiperson pose estimation is computationally costly. while the method cannot segmentpedestr

37、iians from the scenemaking itunsuitable for collision avoidanceinnrobotapplications.#page#MotivationDepth estimation of a scene has been studied for a long time in the computer visionfield for various applications.Howeveras far as we know. thereis noresearch onpedestrian depth estimation and segment

38、ationofmonocularimages at the same timeand there is no corresponding dataset.We proposed a novel low complexity network architecture design dubbed as “PDS-Netwhich offers a fastandaccurate pedestrian depth estimation and segmentationWe generate three datasets which can be used for joint pedestrian d

39、epth estimationand segmentation.indulDepth GTSemantiic GT#page#Method33 DepthwiseUpsampleConcaConcsSeparable ConvX6+zoIXLXLBackboneRate=9InputEncoder2242243ASPPDepth branch3X3Upsample3X3 DepthwiseConcatConySeparable ConvX2asAwdad EXESeparable Convantic branchEXEOutputUpsample3X3 Depthwise2242243X2Co

40、nvSeparable ConvDecoder#page#Network pruningAnetwork pruning is performedoverASPP branches in theencoder. Retain the two parallel branches with dilation rates of 1pruneand9. Replace the remaining two by another one presentingdilation rate of5.ASPPRemove the global average poolingASPP#page#DatasetsDa

41、tasetDescriptionCameraImage resolutionSuuL#Test setCAD-60Indoor.onlyoneindividual74373Cornell Activity 30Kinect Vi24032037374653CAD-120Kinect Vi48064060480EPFLRGBD131Labandcorridormtiple pedestrians456030Kinect V2424X512CAD-60CAD-120EPFL RGBD#page#Ablation studyASPPOur networks performance whenAtrou

42、s Spatial Pyramid PoolingASPPis applied.Methodw/o ASPPwith ASPPRMSEJ0.15290.1526CAD-60 3061个98.71%98.72%People IoU 96.80%96.82%RMSE 0.31470.314097.97%CAD-120 3001个97.98%People IoU T96.10%96.17%RMSEJ0.14840.146198.47%98.53%EPFL RGBD3161个95.97%People IoU T%8096#page#Ablation studyDPruneOur networks pe

43、rformance through the utilization ofa pruning technique.Methodw/o pruningwith pruningRMSEJ0.15240.152681T98.72%CAD-60 3098.72%People IoU t96.85%96.82%RMSET0.31330.31408i97.98%CAD-120 3097.98%People Iou +96.14%96.17%RMSEJ0.14830.1461EPFL RGBD 318i98.50%98.53%People IoU t96.13%95.97%#page#w/o or with

44、pruningMeasuring the inference runtime when network pruning is employed.Devicew/o pruningwith pruningSddte6IntelXeon E5-26402.40GHzCPU13.80FPSNVIDIA Tesla P40 GPU179.21FPS199.93FPSNVIDIA Jetson Nano GPU13.56 FPS17.23FPSw/o or with TensorRT Measuring the inference runtime when TensorRT is appliedDevi

45、cew/o TensorRTwith TensorRT114.16FPSNVIDIA Jetson Nano GPU17.23FPS#page#SpeedLdifferent backbonesComparing the inference runtime ofPDES-Net when different backbones are used.MethodMobilenety2 51Resnet-18 48Resnet-50 48VGG1647MobilenetV1 2111.248.11FPS on CPU6.498.8913.80120.22169.90112.91170.64199.9

46、3FPS on Tesla P40 GPU14.304393.462.3617.23FPS on Jetson Nano GPU#page#Comparison with the baseline techniquesmresultsComparative results of the baseline methods against the proposed PDES-Net Green denotes thebest while blue is second best.一MetrieLEDNet 43ESPNet 45DFANet 42PDES-NetDatasetLiteSeg I44F

47、astDepth 210.1513RMSEJ0.23000.18230.15590.28330.1526CAD-60 2998.25%98.77%98.66%94.39%98.72%61个96.28%People IoUT97.31%94.35%95.06%94.50%90.17%96.82%RMSE40.40760.29820.32490.33230.3140CAD-120290.525961个95.15%98.12%98.29%98.3188.7897.98%96.51%94.18%94.48%87.19%96.17%People IoU T96.64%RMSEJ0.38820.22060

48、.18040.26630.84270.1461EPFLRGBD3090.25%96.33%97.70%94.07%50.92%98.53%01个90.75%92.12%95.97%People IoUT96.50%93.96%58.13%#page#Comparison with the baseline techniquesspeedInference runtime comparison between the baseline and the proposed networkDeviceLEDNet 43LiteSeg 44ESPNet 45JFastDepth 21DFANet42PD

49、ES-NetItel Xeon E5-26402.40GHzCPU7.79FPS17.65FPS19.67FPS6.79FPS13.66FPS13.80 FPSNVIDIA Tesla P40 GPU39.56FPS70.73FPS119.37FPS133.09FPS089.87FPS199.93FPSNVIDIA Jetson Nano GPU4.39FPS12.38FPS11.42FPS8.69FPS7.18FPS17.23FPS#page#Comparison with the baseline techniques（b）NVIDIA Jetson Nano GPU（a）NVIDIA P

50、40 GPUThe accuracy C8:）and the runtime (measured in frames per second） when applied on NVIDIA P40GPU deft） and NVIDIA Jetson Nano GPU (right） for various depth estimation frameworks. The EPFLdataset is selected for evaluation. The input images are resized to 224 X 224#page#ExperimentsCAD-60 & CAD-12

51、0 Qualitative results of CAD-60 dataset Gtop） and CAD-120 dataset Gbottom）、Refined DepthDepthInputDepth GTSemanticSemanticSegmentation andPredictionPredictionGTSegmentatiion Depth Predictiion#page#ExperimentsEPFL Qualitative results for the proposed pipeline on EPFLRGBD datasct电场电Refined DepthDepthSemanticSegmentation andIputDepthGTSemanticPredictionPredictionGSegmentationDepth Prediction#page#VideoUCornell Activity#page#VideoEPFL Lab#page#VideoDEPFL corridor#page#THANKS京东

友情提示

1、下载报告失败解决办法
2、PDF文件下载后，可能会被浏览器默认打开，此种情况可以点击浏览器菜单，保存网页到桌面，就可以正常下载了。
3、本站不支持迅雷下载，请使用电脑自带的IE浏览器，或者360浏览器、谷歌浏览器下载即可。
4、本站报告下载后的文档和图纸-无水印,预览文档经过压缩，下载后原文更清晰。

本文（人的视觉感知技术：人脸关键点、手部姿态估计、人体深度估计.pdf）为本站（X-iao）主动上传，三个皮匠报告文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若此文所含内容侵犯了您的版权或隐私，请立即通知三个皮匠报告文库（点击联系客服），我们立即给予删除！

温馨提示：如果因为网速或其他原因下载失败请重新下载，重复下载不扣分。

上海品茶

人的视觉感知技术：人脸关键点、手部姿态估计、人体深度估计.pdf

人的视觉感知技术：人脸关键点、手部姿态估计、人体深度估计.pdf