《AIGC驱动的3D场景理解及医学图像解析_李镇.pdf》由会员分享,可在线阅读,更多相关《AIGC驱动的3D场景理解及医学图像解析_李镇.pdf(63页珍藏版)》请在三个皮匠报告上搜索。
1、AIGCAIGC驱动的驱动的3D3D场景理解及场景理解及医学图像解析医学图像解析香港中文大学(深圳)香港中文大学(深圳)助理教授助理教授李镇博士李镇博士讲者介绍讲者介绍21 1人才人才 荣誉荣誉科研科研 学术学术 香港大学博士香港大学博士(师从余益州教授师从余益州教授),),芝加哥大学访问学者芝加哥大学访问学者(师从许锦波教授师从许锦波教授)香港中文大学香港中文大学(深圳深圳)理工学院理工学院/未来智联网络研究院未来智联网络研究院 助理院长助理院长/教授教授,校长青年学者校长青年学者 香港中文大学香港中文大学(深圳深圳)深度比特实验室主任深度比特实验室主任 博士后博士后:1名名,博士生博士生:
2、8名名,研究生研究生:2名名 CASP12 CASP12 接触图预测全球冠军,并作为接触图预测全球冠军,并作为AlphaFoldV1AlphaFoldV1的基线方案的基线方案 PLOS CB 2018 PLOS CB 2018年创新与突破奖年创新与突破奖 (一年一例)(一年一例)中国科协中国科协 20192019年青年托举人才年青年托举人才 2022 2022年年0505月月CAMEOCAMEO蛋白打分月度第一,蛋白打分月度第一,2022 2022 S SemancticKITTIemancticKITTI分割竞赛第一,分割竞赛第一,2023 2023 CVPR CVPR HOI4D HOI4
3、D 分割竞赛第一,分割竞赛第一,20182018全球气象预测大赛第一,全球气象预测大赛第一,ICCV 2022 Urban3DICCV 2022 Urban3D第二等第二等 主持国家主持国家自然科学基金青年项目自然科学基金青年项目1 1项项 主持深港主持深港A A类项目类项目“深度学习辅助的深度学习辅助的RNARNA蛋白结构预测以及蛋白高亲和性蛋白结构预测以及蛋白高亲和性RNARNA设计设计 ”(300300万)万)CCF-CCF-腾讯犀牛鸟腾讯犀牛鸟20192019优秀奖,优秀奖,20222022年犀牛鸟专项年犀牛鸟专项 参与科技部国家重点研发项目参与科技部国家重点研发项目 合作牵头国家自然
4、科学基金重点项目,合作牵头粤深联合基金重点项目合作牵头国家自然科学基金重点项目,合作牵头粤深联合基金重点项目李镇李镇助理教授助理教授FNII 助理院长助理院长目录目录 AIGCAIGC驱动的驱动的3D3D室内场景稠密描述及视觉定位室内场景稠密描述及视觉定位 AIGCAIGC驱动的驱动的3D3D高精度的说话人脸驱动及生成高精度的说话人脸驱动及生成 AIGCAIGC驱动的结肠镜图片生成及解析驱动的结肠镜图片生成及解析 300字以内进行概括性的案例介绍(突出亮点、案例独特性等)案例简介案例简介随着AIGC和ChatGPT等生成模型的迅速发展,我们探索出AIGC驱动的3D场景理解以及医疗场景的分析,并
5、通过一系列自研的算法和工具,对AIGC算法辅助的下游应用进行了深入地研究,从3D场景的自动稠密描述,到室内场景的视觉定位,再到3D视觉驱动的高保真说话人脸生成,并推广到AIGC辅助的医疗场景的解析,我们均进行了深入地探讨。在本次分享中,我们将会从3D场景描述和定位,3D说话人脸生成,生成图片辅助的肠胃镜图片解析等方面,详解介绍我们应用方案的架构设计与工程实践,同时也会基于我们的经验分享在使用AIGC驱动的3D场景理解和医疗图像理解过程中的思考和对未来AIGC演进的展望。目录目录 AIGCAIGC驱动的驱动的3D3D室内场景稠密描述及视觉定位室内场景稠密描述及视觉定位 AIGCAIGC驱动的驱动
6、的3D3D高精度的说话人脸驱动及生成高精度的说话人脸驱动及生成 AIGCAIGC驱动的结肠镜图片生成及解析驱动的结肠镜图片生成及解析InstanceRefer:Cooperative Holistic Understanding for Visual Grounding on Point Clouds through Instance Multi-level Contextual ReferringZhihao Yuan 1,Xu Yan 1,Yinghong Liao 1,Ruimao Zhang 1 Sheng Wang 2,Zhen Li 1,*,and Shuguang Cui 11
7、The Chinese University of Hong Kong(Shenzhen),Shenzhen Research Institute of Big Data 2 CryoEM Center,Southern University of Science and TechnologyBackgroundVisual Grounding:ScanRefer:3D Object Localization in RGB-D Scans using Natural LanguageVisual grounding(VG)aims at localizing the desired objec
8、ts or areas in an image or a 3D scene based on an object-related linguistic queryBackgroundScanRefer:ScanRefer:3D Object Localization in RGB-D Scans using Natural Language1.Exploiting object detection to generate proposal candidates;2.Localize described object by fusing language features into candid
9、ates.BackgroundScanRefer:ScanRefer:3D Object Localization in RGB-D Scans using Natural LanguageCons:1.The object proposals in the large 3D scene are usually redundant;2.The appearance and attribute information is not sufficiently captured;3.The relations among proposals and the ones between proposal
10、s and background are not fully studied.ScanRefer generates 114 possible candidates after filtering proposals by their objectness scores;Each proposals feature is generated by the detection framework;There is no relation reasoning among proposalsMethodInstanceRefer:1.Instance-level candidate represen
11、tation(small number);2.Multi-level contextual inference(attribute,objects relation and environment).MethodInstanceRefer Architecture:Description There is a gray and blue leather chair.Placed in a raw with other chairs in the side of the wall.GloVEWord Embedding WBiGRUWord Features ELanguage feature
12、encoding(the same as ScanRefer).MethodInstanceRefer Architecture:Instance Mask IInput Point Cloud PSemantics S Instances Panoptic SegmentationDescription There is a gray and blue leather chair.Placed in a raw with other chairs in the side of the wall.GloVEWord Embedding WBiGRUWord Features Eextract.
13、(Table)(Chair)(Chair)(Chair)Extracting instances through panoptic segmentation(predict instance and semantics).MethodInstanceRefer Architecture:Instance Mask IInput Point Cloud PSemantics S Instances Candidates Panoptic SegmentationDescription There is a gray and blue leather chair.Placed in a raw w
14、ith other chairs in the side of the wall.GloVEWord Embedding WBiGRUWord Features Eextractfilter(chair only)TargetPrediction“Chair”.(Table)(Chair)(Chair)(Chair)Eliminating irrelative instances by the target category(inferred by language).MethodInstanceRefer Architecture:Instance Mask IInput Point Clo
15、ud PSemantics S Instances Candidates Panoptic SegmentationDescription There is a gray and blue leather chair.Placed in a raw with other chairs in the side of the wall.GloVEWord Embedding WBiGRUWord Features Eextractfilter(chair only)TargetPrediction“Chair”.APRPGLPMulti-Level Visual Context.(Table)(C
16、hair)(Chair)(Chair),P,Generating visual feature of each candidate by multi-level referring(three novel modules are proposed).MethodInstanceRefer Architecture:Instance Mask IInput Point Cloud PSemantics S Instances Candidates Panoptic SegmentationDescription There is a gray and blue leather chair.Pla
17、ced in a raw with other chairs in the side of the wall.GloVEWord Embedding WBiGRUWord Features Eextractfilter(chair only)TargetPrediction“Chair”.APRPGLPMulti-Level Visual ContextSimilarity Score Q.(0.95)(0.31)(0.03)AttentionPoolingmatching.(Table)(Chair)(Chair)(Chair),P,Scoring each candidate matchi
18、ng language and visual features(the candidate with the largest score will be regarded as output).MethodSpecific Modules:(a)Attribute Perception(AP)Module.It construct a four-layer Sparse Convolution(SparseConv)as the feature extractor;After an average pooling,the global attribute perception feature
19、is obtained.MethodSpecific Modules:(b)Relation Perception(RP)Module.It uses k-nearest neighbors to construct a graph,where nodes features are their semantics obtained by panoptic segmentation and edges are consisted of their semantics and relative position;Dynamic graph convolution network(DGCNN)is
20、exploited to update the nodes featureMethodSpecific Modules:(c)Global Localization Perception(GLP)Module.It uses SparseConv layers with height-pooling to generate a 33 bird-eyes-view(BEV)plane;By combining language feature,it predicts which grid the target object is located in;It interpolates probab
21、ilities and generates the global perception features by merging features from AP module.MethodSpecific Modules:(d)Matching Module A naive version by using Cosine similarity;An enhance version by using modular co-attention from MCAN 1.1 Deep modular co-attention networks for visual question answering
22、(e)Contrastive Objectivewhere Q+and Q denote the scores of positive and negative pairs.ResultsScanRefer:ResultsResultsResultsNr3D/Sr3D:InstanceRefer:Cooperative Holistic Understanding for Visual Grounding on Point Clouds through Instance Multi-level Contextual ReferringThanks for watching!Zhihao Yua
23、n 1,Xu Yan 1,Yinghong Liao 1,Ruimao Zhang 1 Sheng Wang 2,Zhen Li 1,*,and Shuguang Cui 11 The Chinese University of Hong Kong(Shenzhen),Shenzhen Research Institute of Big Data 2 CryoEM Center,Southern University of Science and TechnologyX-Trans2Cap:Cross-Modal Knowledge Transfer using Transformer for
24、 3D Dense CaptioningZhihao Yuan1,Xu Yan1,Yinghong Liao1,Yao Guo2,Guanbin Li3,Shuguang Cui1,Zhen Li1,*1 The Chinese University of Hong Kong(Shenzhen),The Future Network of Intelligence Institute,Shenzhen Research Institute of Big Data,2 Shanghai Jiao Tong University,3 Sun Yat-sen UniversityScan2Cap:C
25、ontext-aware Dense Captioning in RGB-D Scans DaveTask Description(3D Dense Captioning)BackgroundLimitationsBackground The object representations in Scan2Cap are defective since they are solely learned from sparse 3D point clouds,thus failing to provide strong texture and color information compared w
26、ith the ones generated from 2D images.It requires the extra 2D input in both training and inference phases.However,the extra 2D information is usually computation intensive and unavailable during inference.MotivationX-Trans2Cap We propose a Cross-Modal Knowledge Transfer framework on 3D dense captio
27、ning task.During the training phase,the teacher network exploits auxiliary 2D modality and guides the student network that only takes point clouds as input through the feature consistency constraints.A more faithful caption can be generated only using point clouds during the inference.2D and 3D Inpu
28、tsX-Trans2Cap3D Proposals2D Proposals2D and 3D InputsX-Trans2CapMulti-Modal Inputs3D Modal Inputs.Target objectReference object3D Proposals2D ProposalsArchitectureX-Trans2Cap.Encoder Layer 1Encoder Layer LDecoder Layer3D Modal InputsDescriptionsLceGround TruthTarget objectReference object3both in tr
29、aining and inferenceArchitectureX-Trans2CapDescriptions.Encoder Layer 1Encoder Layer LDecoder Layer.Multi-Modal Inputs.Encoder Layer 1Encoder Layer LDecoder Layer3D Modal InputsDescriptionsLceLceGround TruthStudentTarget objectReference object?3Teacherboth in training and inferenceArchitectureX-Tran
30、s2CapDescriptions.Encoder Layer 1Encoder Layer LDecoder Layer.Multi-Modal Inputs.Encoder Layer 1Encoder Layer LDecoder LayerFeature Alignment3D Modal InputsDescriptionsLceLceGround TruthLalignTarget objectReference object?3StudentTeacherboth in training and inferenceArchitectureX-Trans2CapDescriptio
31、ns.Encoder Layer 1Encoder Layer LDecoder Layer.Multi-Modal Inputs.Encoder Layer 1Encoder Layer LDecoder LayerFeature AlignmentCMF at Level L3D Modal InputsDescriptionsLceLceGround TruthLalignStudentTarget objectReference object?3.CMF at Level 1Teacherboth in training and inferenceCross-Modal Fusion(
32、CMF)ModuleX-Trans2Cap3D Dense Captioning with Gound Truth Proposals (Nr3D and ScanRefer)Experiments3D Dense Captioning with Detection Proposals(Nr3D and ScanRefer)ExperimentsVisualizationExperimentsX-Trans2Cap:Cross-Modal Knowledge Transfer using Transformer for 3D Dense CaptioningThanks for watchin
33、g!Zhihao Yuan1,Xu Yan1,Yinghong Liao1,Yao Guo2,Guanbin Li3,Shuguang Cui1,Zhen Li1,*1 The Chinese University of Hong Kong(Shenzhen),The Future Network of Intelligence Institute,Shenzhen Research Institute of Big Data,2 Shanghai Jiao Tong University,3 Sun Yat-sen University目录目录 AIGCAIGC驱动的驱动的3D3D室内场景稠
34、密描述及视觉定位室内场景稠密描述及视觉定位 AIGCAIGC驱动的驱动的3D3D高精度的说话人脸驱动及生成高精度的说话人脸驱动及生成 AIGCAIGC驱动的结肠镜图片生成及解析驱动的结肠镜图片生成及解析说话人脸说话人脸任务介绍任务介绍语音语音/文本文本人脸图片人脸图片/视视频频说话人脸视频说话人脸视频深度网络深度网络输入模型结构输出 给定文本或语音作为驱动信息,同时给定人脸图片或视频提供人物信息,目标是生成人脸视频其嘴型与文本或语音内容保持一致;目标:挑战:跨模态学习任务,语音/文本模态到图像模态的映射,需要设计多模态的特征提取器以及跨模态交互学习;人类视觉系统对生成视频图像质量和语音-嘴型同
35、步质量比较敏感,生成高质量的说话人脸视频有挑战;说话人脸说话人脸 利用精细3D人脸顶点进行语音到嘴型的显式监督,并考虑长时时序信息,得到稳定人脸视频;时序3DMM方案(利用点云解析相关算法预测脸部关键点)已有成果已有成果点云解析驱动的高清说话人脸点云解析驱动的高清说话人脸生成结果生成结果3D animation3D animationBlend resultBlend resultGenerated resultGenerated result点云解析驱动的高清说话人脸点云解析驱动的高清说话人脸不同语言生成结果不同语言生成结果中文中文德语德语点云解析驱动的高清说话人脸点云解析驱动的高清说话人脸
36、不同语言生成结果不同语言生成结果粤语粤语英文英文目录目录 AIGCAIGC驱动的驱动的3D3D室内场景稠密描述及视觉定位室内场景稠密描述及视觉定位 AIGCAIGC驱动的驱动的3D3D高精度的说话人脸驱动及生成高精度的说话人脸驱动及生成 AIGCAIGC驱动的结肠镜图片生成及解析驱动的结肠镜图片生成及解析ArSDM:Colonoscopy Images Synthesis with Adaptive Refinement Diffusion ModelsBackground()Colonoscopy analysis,particularly automatic polyp segmentat
37、ion and detection are essential for assisting clinical diagnosis and treatment,while the scarcity of annotated data limits the effectiveness and generalization of existing models.()The quality of generated data by GANs or other data augmentation methods is poor.()Diffusion models have demonstrated r
38、emarkable progress in generating multiple modalities of medical data(CT,MRI,).Overview of the PipelineGT MasksOriginal ImagesSynthesis ImagesCombineSegmentationDetectionDownstream Tasks ArSDMe.g.e.g.Diffusion SamplerPipeline()Train a semantic diffusion model(our ArSDM).()For each mask in the trainin
39、g set,sample a synthesized image,the synthesized dataset has the same number of image-mask pairs as the original dataset.()Combine the original diffusion training set with the synthesized dataset for training polyp segmentation and detection modelsModel ArchitectureDiffusionProcessSamplerU-NetRe-Wei
40、ghtingModuleDiffusion Loss?Refinement Loss?Weights Map PraNetConditionInput GT MaskSample Prediction MaskNoisedEstimatedModel ArchitectureMask ConditioningUsing the segmentation masks as conditions,similar to semantic masks but have only two categories:foreground(polyp)and background(intestine wall)
41、The conditional U-Net model is the same as SDM(Semantic Image Synthesis via Diffusion Models https:/arxiv.org/abs/2207.00050)Adaptive Loss FunctionBased on 1 loss,define a pixel-wise weights matrix that vests different weights according to the size ratio of the polyp over the background.For coding,i
42、t is convenient to use the pixel values of the segmentation mask(0,1).Model ArchitectureMask ConditioningAdaptive Loss FunctionRefinementUsing a pre-trained segmentation model to fine-tune the diffusion model,in which the U-Net parameters are updated while the segmentation model parameters are fixed
43、.For each time-step?,we need to sample an image,which is time-consuming.Experimental SettingsDiffusion TrainingTraining Set:Kvasir+CVC-ClinicDB(1450 image-mask pairs)Image Size:Padding to have the same height and width and then resize to 384 384Duration:with Refinement:around one-half NVIDIA A100 da
44、ys(80GB Memory)w/o refinement:around one A100 day.Diffusion SamplingDDIM sampler with=200Random noise as input and mask as a conditionComparison ResultsPolyp SegmentationComparison ResultsPolyp DetectionVisualizationOri.ImagesMasksSamplesColonscopy Video Generationwith Diffusion ModelsPVDM用用sky-time
45、-lapse训练训练autoencoder,from scratchDataset:Sky Time-lapse997 段视频;总共 1,172,641 帧InputsReconstructionsTraining:1 V100,1 dayPVDM用用LDPolyp训练训练autoencoder,加载加载sky-time-lapse的权重的权重Dataset:LDPolyp100 段视频;总共 24,789 帧InputsReconstructionsTraining:1 V100,1.6 daysLVDM用用LDPolyp训练训练 2D autoencoder,加载加载ImageNet训练权
46、重训练权重InputsReconstructionsTraining:1 V100,1 dayDataset:LDPolyp100 段视频;总共 24,789 帧LVDM-2(LVDM 5.1 release codes)用用LDPolyp训练训练 unconditional diffusion modelSamplesTraining:1 80GB A100,1 dayDataset:LDPolyp100 段视频;总共 24,789 帧Details:3D-Unet 添加部分Attention layer 作attention参数过多,batch_size 2-48 GB 显存 进一步优化多模态在3D场景的解析与生成 结合video diffusion来强化说话人脸的效果 结合condition mask来进行医疗图像场景的video diffusion生成下一步启示下一步启示