Visual Question Answering With Dense Inter- and Intra-Modality Interactions | |
Liu, Fei1,2; Liu, Jing1,2; Fang, Zhiwei1,2; Hong, Richang3; Lu, Hanqing1,2 | |
刊名 | IEEE TRANSACTIONS ON MULTIMEDIA |
2021 | |
卷号 | 23页码:3518-3529 |
关键词 | Visualization Knowledge discovery Connectors Encoding Task analysis Image coding Stacking Visual question answering attention dense interactions |
ISSN号 | 1520-9210 |
DOI | 10.1109/TMM.2020.3026892 |
通讯作者 | Liu, Jing(jliu@nlpr.ia.ac.cn) |
英文摘要 | Learning effective interactions between multi-modal features is at the heart of visual question answering (VQA). A common defect of the existing VQA approaches is that they only consider a very limited amount of inter-modality interactions, which may be not enough to model latent complex image-question relations that are necessary for accurately answering questions. Besides, most methods neglect the modeling of the intra-modality interactions that is also important to VQA. In this work, we propose a novel DenIII framework for modeling dense inter-, and intra-modality interactions. It densely connects all pairwise layers of the network via the proposed Inter-, and Intra-modality Attention Connectors, capturing fine-grained interplay across all hierarchical levels. The Inter-modality Attention Connector efficiently connects the multi-modality features at any two layers with bidirectional attention, capturing the inter-modality interactions. While the Intra-modality Attention Connector connects the features of the same modality with unidirectional attention, and models the intra-modality interactions. Extensive ablation studies, and visualizations validate the effectiveness of our method, and DenIII achieves state-of-the-art or competitive performance on three publicly available datasets. |
资助项目 | Beijing Natural Science Foundation[4192059] ; Beijing Natural Science Foundation[JQ20022] ; National Natural Science Foundation of China[61922086] ; National Natural Science Foundation of China[61872366] ; National Natural Science Foundation of China[61872364] |
WOS研究方向 | Computer Science ; Telecommunications |
语种 | 英语 |
出版者 | IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC |
WOS记录号 | WOS:000709093100007 |
资助机构 | Beijing Natural Science Foundation ; National Natural Science Foundation of China |
内容类型 | 期刊论文 |
源URL | [http://ir.ia.ac.cn/handle/173211/46267] |
专题 | 自动化研究所_模式识别国家重点实验室_图像与视频分析团队 |
通讯作者 | Liu, Jing |
作者单位 | 1.Chinese Acad Sci, Inst Automat, Natl Lab Pattern Recognit, Beijing 100190, Peoples R China 2.Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing 100049, Peoples R China 3.Hefei Univ Technol, Sch Comp & Informat, Hefei 230000, Anhui, Peoples R China |
推荐引用方式 GB/T 7714 | Liu, Fei,Liu, Jing,Fang, Zhiwei,et al. Visual Question Answering With Dense Inter- and Intra-Modality Interactions[J]. IEEE TRANSACTIONS ON MULTIMEDIA,2021,23:3518-3529. |
APA | Liu, Fei,Liu, Jing,Fang, Zhiwei,Hong, Richang,&Lu, Hanqing.(2021).Visual Question Answering With Dense Inter- and Intra-Modality Interactions.IEEE TRANSACTIONS ON MULTIMEDIA,23,3518-3529. |
MLA | Liu, Fei,et al."Visual Question Answering With Dense Inter- and Intra-Modality Interactions".IEEE TRANSACTIONS ON MULTIMEDIA 23(2021):3518-3529. |
个性服务 |
查看访问统计 |
相关权益政策 |
暂无数据 |
收藏/分享 |
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。
修改评论