Masked Vision-language Transformer in Fashion
Ge-Peng Ji2
刊名Machine Intelligence Research
2023
卷号20期号:3页码:421-434
关键词Vision-language, masked image reconstruction, transformer, fashion, e-commercial
ISSN号2731-538X
DOI10.1007/s11633-022-1394-4
英文摘要We present a masked vision-language transformer (MVLT) for fashion-specific multi-modal representation. Technically, we simply utilize the vision transformer architecture for replacing the bidirectional encoder representations from Transformers (BERT) in the pre-training model, making MVLT the first end-to-end framework for the fashion domain. Besides, we designed masked image recon struction (MIR) for a fine-grained understanding of fashion. MVLT is an extensible and convenient architecture that admits raw multi-modal inputs without extra pre-processing models (e.g., ResNet), implicitly modeling the vision-language alignments. More importantly, MVLT can easily generalize to various matching and generative tasks. Experimental results show obvious improvements in retrieval (rank@5: 17%) and recognition (accuracy: 3%) tasks over the Fashion-Gen 2018 winner, Kaleido-BERT. The code is available at https://github.com/GewelsJI/MVLT.
内容类型期刊论文
源URL[http://ir.ia.ac.cn/handle/173211/51710]  
专题自动化研究所_学术期刊_International Journal of Automation and Computing
作者单位1.Computer Vision Lab, ETH Zürich, Zürich 8092, Switzerland
2.International Core Business Unit, Alibaba Group, Hangzhou 310051, China
推荐引用方式
GB/T 7714
Ge-Peng Ji. Masked Vision-language Transformer in Fashion[J]. Machine Intelligence Research,2023,20(3):421-434.
APA Ge-Peng Ji.(2023).Masked Vision-language Transformer in Fashion.Machine Intelligence Research,20(3),421-434.
MLA Ge-Peng Ji."Masked Vision-language Transformer in Fashion".Machine Intelligence Research 20.3(2023):421-434.
个性服务
查看访问统计
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。


©版权所有 ©2017 CSpace - Powered by CSpace