Structured Triplet Learning with POS-tag Guided Attention for Visual Question Answering | |
Z. Wang; X. Liu; L. Chen; L. Wang; Y. Qiao; X. Xie; C. Fowlkes | |
2018 | |
会议日期 | 2018 |
会议地点 | 美国 |
英文摘要 | Visual question answering (VQA) is of significant interest due to its potential to be a strong test of image understanding systems and to probe the connection between language and vision. Despite much recent progress, general VQA is far from a solved problem. In this paper, we focus on the VQA multiple-choice task, and provide some good practices for designing an effective VQA model that can capture language-vision interactions and perform joint reasoning. We explore mechanisms of incorporating part-ofspeech (POS) tag guided attention, convolutional n-grams, triplet attention interactions between the image, question and candidate answer, and structured learning for triplets based on image-question pairs 1. We evaluate our models on two popular datasets: Visual7W and VQA Real Multiple Choice. Our final model achieves the state-of-the-art performance of 68.2% on Visual7W, and a very competitive performance of 69.6% on the test-standard split of VQA Real Multiple Choice. |
URL标识 | 查看原文 |
内容类型 | 会议论文 |
源URL | [http://ir.siat.ac.cn:8080/handle/172644/13696] |
专题 | 深圳先进技术研究院_集成所 |
推荐引用方式 GB/T 7714 | Z. Wang,X. Liu,L. Chen,et al. Structured Triplet Learning with POS-tag Guided Attention for Visual Question Answering[C]. 见:. 美国. 2018. |
个性服务 |
查看访问统计 |
相关权益政策 |
暂无数据 |
收藏/分享 |
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。
修改评论