6-2 支持用户反馈的对话式图像检索.pdf

编号：102533

PDF 55页 4.96MB 下载积分：VIP专享

下载报告请您先登录！

6-2 支持用户反馈的对话式图像检索.pdf

1、01Background02Unstructured Feedback03Structured Feedback04Future Work目录 CONTENT|01BackgroundBackgroundHuge economic value of fashion domain.|BackgroundNumerous online clothing data on the Internet.Precise image retrieval that meets the users search intent is a key challenge.|BackgroundConventional p

2、aradigms for item search take either text or image as theinput query to search for items of interest.a blue overcoat with a lapel collar and a belt around the waist Text QueryImage QueryUnstructured feedback+I want the dress to be black and more professional.|Background Flexible image retrieval:allo

3、w users to use reference image and modification feedbackto search items.Structured feedback|Background Application:dialog-based fashion search/conversational fashion search At the beginning,the recommended fashionproduct image may not be the desired one.Based on this reference image,the usertypicall

4、y would like to refine the retrieval byproviding feedbacks,describing the relativedifference between the current retrievedreference image and his/her desired one.|02Structured FeedbackTask A query image can be described by itsassociated attributes:=a1,1,2,The target image can be described by:=a1,1,2

5、,Attribute Manipulation1and 2are the to-be-manipulated attributes.|Related WorkCategoryRelated WorkFeature Fusion-basedMemory-Augmented Attribute Manipulation Networks for Interactive FashionSearch,In CVPR2017.Feature Substitution-basedEfficient Multi-Attribute Similarity Learning Towards Attribute-

6、based FashionSearch,In WACV2018.Automatic Spatially-aware Fashion Concept Discovery,In ICCV2017.Learning Attribute Representations with Localization for Flexible FashionSearch,In CVPR2018.|Fusion-based Method Memory-AugmentedAttributeManipulationNetworksforInteractiveFashion Search.In CVPR2017.Attri

7、buteRepresentationLearning Image Representation Learning RepresentationFusionFashion SearchFusion-based:learn the latent representation of the target item by directly fusing the visual features of thequery image and the semantic features of wanted attribute(s).:original image representation:prototyp

8、e attribute representation:binary indicator vector:memory matrix:manipulated representationAttribute Manipulation|Substitution-based MethodLearning Attribute Representations with Localization for Flexible Fashion Search.In CVPR2018.Learning Deep Features for Discriminative Localization.CVPR 2016:292

9、1-2929Attribute LocalizationAttribute RepresentationLearningOptimizationSubstitution-based:characterize the query image with multiple attributes,and the attributemanipulation can be conducted by replacing the unwished attribute features with desired ones.Class Activation Mapping isusedforlocalizingt

10、hediscriminative image regions.Afterthetraining,thefeatures extracted from thetrainingimageswiththesame attribute value areaveragedandusedforattribute manipulation.MotivationBlueGANFeature SpaceSimilar ItemsGenerated Prototype Image Existing methods ignore the potential of Generative Adversarial Net

11、works(GANs)in enhancing the visual understanding of target items.We aim to boost the performance of content-based fashion search with attribute manipulation by directly generating the target item image.|MethodPrototype Image GenerationMetric Learning for Fashion Search|Ground truth valueof the attri

12、bute Generated prototype imageMake the discriminator to learn how toaccurately classify the attributes.Encourage the generator to synthesizethe prototype image with correct attributemanipulation.The proposed AMGAN.Semantic Discriminative LearningMethodPrototype Image GenerationMetric Learning for Fa

13、shion Search|The proposed AMGAN.Adversarial Metric Learning(Pair-Based）Maximize the similaritybetween the positive pairMinimize the similaritybetween the negative pairEncourage the generator to produce similar to the positive image to fool the learned metricSimilarity Probability:+:shares the sameat

14、tribute values withMethodPrototype Image GenerationMetric Learning for Fashion Search|The proposed AMGAN.Adversarial Metric Learning(Triplet-based）Encourage to be moresimilar with+than Relative similarity Probability:Method|Prototype Image GenerationMetric Learning for Fashion Search DARN 213,636 im

15、ages,9 attributes and 179 possible valuesDatasetSome examples of online-offline image pairs in DARNAttributes and value examples of DARN.|Junshi Huang,Rogrio Schmidt Feris,Qiang Chen,Shuicheng Yan:Cross-Domain Image Retrieval with a Dual Attribute-Aware Ranking Network.ICCV 2015:1062-1070Attributes

16、and value examples of Shopping100K.Shopping100K 101,021 images,12 attributes and 151 possible valuesDatasetSamples in Shopping100KKenan E.Ak,Joo-Hwee Lim,Jo Yew Tham,Ashraf A.Kassim:Efficient Multi-attribute Similarity Learning Towards Attribute-BasedFashion Search.WACV 2018:1671-1679|Model Comparis

17、on（a）Top-K on Shopping100K（b）NDCGK on Shopping100K（c）MRRK on Shopping100K（d）Top-K on DARN（e）NDCGK on DARN（f）MRRK on DARN|Fig.1:Overall performance comparison on Shopping100K and DARN.Symbols and denote the statistical significance for 0.05and LAlgorithmFor testing,we will rank the gallery images by

18、jointly evaluating their cosinesimilarities to both local-wise and global-wise composed query representations.DatasetAdapting existing dataset:Creating new dataset:MIT-States Phillip Isola et al.CVPR 2015Birds-to-words Maxwell Forbes et al.EMNLP 2019 Shoes Xiaoxiao Guo et al.NeurIPS 2018CSS Nam Vo e

19、t al.CVPR 2019CIRR Zheyuan Liu et al.ICCV 2021 Fashion200k Xintong Han et al.ICCV 2017|FashionIQ.Released on ICCV 2019 workshopDatasetMIT-StatesExamples of training triplets derived from MIT-States.Contains around 60k images.Each image comes with an object/noun label and a state/adjective label(such

20、 as“red tomato”or“new camera”).unripe bananaripe bananaReplace unripe with ripeCluttered bagempty bagReplace clustered with empty(a)(b)|Phillip Isola,Joseph J.Lim,Edward H.Adelson:Discovering states and transformations in image collections.CVPR 2015:383-1391DatasetBirds-to-wordsSamples from the Bird

21、s-to-Words dataset.A dataset for relative captioning.Consists of 3,347 image pairs,annotated with 16,067 paragraphs describingthe differences between pairs of images.|Maxwell Forbes,Christine Kaeser-Chen,Piyush Sharma,Serge J.Belongie:Neural Naturalist:Generating Fine-Grained Image Comparisons.EMNLP

22、/IJCNLP(1)2019:708-717DatasetFashion200kExamples of the image-text pairs in Fashion200k.More than 200k image-text pairs,crawled from online shopping websites.Removed stop words,symbols,as well as words that occur fewer than 5 timesblue one shoulder dressblack one shoulder dressReplace blue with blac

23、kExamples of training triplets for CTI-IR.|Xintong Han,Zuxuan Wu,Phoenix X.Huang,Xiao Zhang,Menglong Zhu,Yuan Li,Yang Zhao,Larry S.Davis:Automatic Spatially-AwareFashion Concept Discovery.ICCV 2017:1472-1480DatasetShoesExamples of relative captions in Shoes dataset.A dataset for relative captioning,

24、collected in a scenario of a shopping chatting session between a shopping assistant and a customer.10,751 captions,with one caption per pair of images.AMT annotation interface.|Xiaoxiao Guo,Hui Wu,Yu Cheng,Steven Rennie,Gerald Tesauro,Rogrio Schmidt Feris:Dialog-based Interactive Image Retrieval.Neu

25、rIPS 2018:676-686DatasetAdapting existing dataset:Creating new dataset:MIT-States Phillip Isola et al.CVPR 2015Birds-to-words Maxwell Forbes et al.EMNLP 2019 Shoes Xiaoxiao Guo et al.NeurIPS 2018CSS Nam Vo et al.CVPR 2019CIRR Zheyuan Liu et al.ICCV 2021 Fashion200k Xintong Han et al.ICCV 2017|Fashio

26、nIQ.Released on ICCV 2019 workshopDatasetFashionIQRelative Caption:Relative Caption:Relative Caption:Relative Caption:Hui Wu,Yupeng Gao,Xiaoxiao Guo,Ziad Al-Halah,Steven Rennie,Kristen Grauman,Rogrio Feris:Fashion IQ:A New DatasetTowards Retrieving Images by Natural Language Feedback.CVPR 2021:11307

27、-11317The dataset contains 77,684 diverse fashion images(dresses,shirts,andtops&tees),side information in form of textual descriptions and product meta-data,attribute labels,and large-scale annotations of high quality relativecaptions collected from human annotators.|DatasetCSSExample images in CSS

28、dataset.The same scene arerendered in 2D and 3D images.Using the CLEVR toolkit for generating synthesized images.Render objects with different Color,Shape and Size(CSS)occupy.Three types of modification texts:adding,removing or changing object attributes.16K triplets for training and 16K triplets fo

29、r test.Examples of training triplets for CTI-IR.Nam Vo,Lu Jiang,Chen Sun,Kevin Murphy,Li-Jia Li,Li Fei-Fei,James Hays:Composing Text and Image for Image Retrieval-anEmpirical Odyssey.CVPR 2019:6439-6448|Dataset Limitations of previous dataset non-complex images within narrow domains contain many fal

30、se-negativesRelative Caption:Narrow Domains|Reference imageTarget imageare black with a colorful floral printPotential target images(false-negatives)Dataset Use the popular NLVR dataset for natural language visual reasoning as the source of images.Compose Image Retrieval on Real-life images(CIRR)dat

31、asetSamples in CIRR dataset(Over 36,000 pairs).Overview of the data collection process.|Zheyuan Liu,Cristian Rodriguez Opazo,Damien Teney,Stephen Gould:Image Retrieval on Real-life Images with Pre-trained Vision-and-Language Models.ICCV 2021:2105-2114Model Comparison Our method consistently surpass

32、all the baselines for all the three datasets,which reflects thesuperiority of our CLVC-Net.Performance comparison on FashionIQ,Shoes,and Fashion200k.|Case StudyIllustration of CTI-IR results obtained by our CLVC-Net on three datasets.Failure casesFailure casesgreen boxes:target items|Demo|We are the

33、 first to unify the global-wise and local-wise compositions withmutual enhancement in the context of CTI-IR.We devise two affine transformation-based attentive compositionmodules,towards the fine-grained multi-modal compositions for bothangles.Extensive experiments conducted on three real-world data

34、sets validatethe superiority of our model.Haokun Wen,Xuemeng Song,Xin Yang,Yibing Zhan,Liqiang Nie:Comprehensive Linguistic-Visual CompositionNetwork for Image Retrieval.SIGIR 2021:1369-1378Conclusion|04Future Work Pre-training TechniqueFuture Work|Using CLIP-based FeaturesUsing OSCAR as the composi

35、tion module Liu et al.ICCV 2021Baldrati et al.MMAsia 21 Limited Annotated SamplesFuture Work|Reference imagehas small straps,more plain and more revealingModification textTarget image Case1 from FashionIQ:Potential target images Case2 from Shoes:Reference imageTarget imageModification textare black with a colorful floral printPotential target images非常感谢您的观看|

友情提示

1、下载报告失败解决办法
2、PDF文件下载后，可能会被浏览器默认打开，此种情况可以点击浏览器菜单，保存网页到桌面，就可以正常下载了。
3、本站不支持迅雷下载，请使用电脑自带的IE浏览器，或者360浏览器、谷歌浏览器下载即可。
4、本站报告下载后的文档和图纸-无水印,预览文档经过压缩，下载后原文更清晰。

本文（6-2 支持用户反馈的对话式图像检索.pdf）为本站（云闲）主动上传，三个皮匠报告文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若此文所含内容侵犯了您的版权或隐私，请立即通知三个皮匠报告文库（点击联系客服），我们立即给予删除！

温馨提示：如果因为网速或其他原因下载失败请重新下载，重复下载不扣分。