Cycle-Consistent Weakly Supervised Visual Grounding With Individual and Contextual Representations

doi:10.1109/TIP.2023.3311917

CORC > 自动化研究所 > 中国科学院自动化研究所 > 多模态人工智能系统全国重点实验室

	Cycle-Consistent Weakly Supervised Visual Grounding With Individual and Contextual Representations
	Zhang, Ruisong 1,2; Wang, Chuang 1,2; Liu, Cheng-Lin 1,2
刊名	IEEE TRANSACTIONS ON IMAGE PROCESSING
	2023
卷号	32 页码:5167-5180
关键词	Visualization Grounding Task analysis Sports equipment Image reconstruction Transformers Training Weakly supervised learning visual grounding cycle consistency individual and contextual representations
ISSN号	1057-7149
DOI	10.1109/TIP.2023.3311917
通讯作者	Zhang, Ruisong(zhangruisong2019@ia.ac.cn)
英文摘要	Visual grounding, aiming to align image regions with textual queries, is a fundamental task for cross-modal learning. We study the weakly supervised visual grounding, where only image-text pairs at a coarse-grained level are available. Due to the lack of fine-grained correspondence information, existing approaches often encounter matching ambiguity. To overcome this challenge, we introduce the cycle consistency constraint into region-phrase pairs, which strengthens correlated pairs and weakens unrelated pairs. This cycle pairing makes use of the bidirectional association between image regions and text phrases to alleviate matching ambiguity. Furthermore, we propose a parallel grounding framework, where backbone networks and subsequent relation modules extract individual and contextual representations to calculate context-free and context-aware similarities between regions and phrases separately. Those two representations characterize visual/linguistic individual concepts and inter-relationships, respectively, and then complement each other to achieve cross-modal alignment. The whole framework is trained by minimizing an image-text contrastive loss and a cycle consistency loss. During inference, the above two similarities are fused to give the final region-phrase matching score. Experiments on five popular datasets about visual grounding demonstrate a noticeable improvement in our method. The source code is available at https://github.com/Evergrow/WSVG.
资助项目	National Key Research and Development Program ; National Natural Science Foundation of China (NSFC)[2018AAA0100400] ; National Natural Science Foundation of China (NSFC)[U20A20223] ; Pioneer Hundred Talents Program of the Chinese Academy of Sciences (CAS)[61721004] ; [Y9S9MS08]
WOS关键词	LANGUAGE
WOS研究方向	Computer Science ; Engineering
语种	英语
出版者	IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
WOS记录号	WOS:001070756500003
资助机构	National Key Research and Development Program ; National Natural Science Foundation of China (NSFC) ; Pioneer Hundred Talents Program of the Chinese Academy of Sciences (CAS)
内容类型	期刊论文
源URL	[http://ir.ia.ac.cn/handle/173211/53032]
专题	多模态人工智能系统全国重点实验室
通讯作者	Zhang, Ruisong
作者单位	1.Chinese Acad Sci, Inst Automat, State Key Lab Multimodal Artificial Intelligence S, Beijing 100190, Peoples R China 2.Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing 100049, Peoples R China
推荐引用方式 GB/T 7714	Zhang, Ruisong,Wang, Chuang,Liu, Cheng-Lin. Cycle-Consistent Weakly Supervised Visual Grounding With Individual and Contextual Representations[J]. IEEE TRANSACTIONS ON IMAGE PROCESSING,2023,32:5167-5180.
APA	Zhang, Ruisong,Wang, Chuang,&Liu, Cheng-Lin.(2023).Cycle-Consistent Weakly Supervised Visual Grounding With Individual and Contextual Representations.IEEE TRANSACTIONS ON IMAGE PROCESSING,32,5167-5180.
MLA	Zhang, Ruisong,et al."Cycle-Consistent Weakly Supervised Visual Grounding With Individual and Contextual Representations".IEEE TRANSACTIONS ON IMAGE PROCESSING 32(2023):5167-5180.