Masked Vision-language Transformer in Fashion

doi:10.1007/s11633-022-1394-4

CORC > 自动化研究所 > 中国科学院自动化研究所 > 学术期刊 > International Journal of Automation and Computing

	Masked Vision-language Transformer in Fashion
	Ge-Peng Ji 2
刊名	Machine Intelligence Research
	2023
卷号	20 期号:3 页码:421-434
关键词	Vision-language, masked image reconstruction, transformer, fashion, e-commercial
ISSN号	2731-538X
DOI	10.1007/s11633-022-1394-4
英文摘要	We present a masked vision-language transformer (MVLT) for fashion-specific multi-modal representation. Technically, we simply utilize the vision transformer architecture for replacing the bidirectional encoder representations from Transformers (BERT) in the pre-training model, making MVLT the first end-to-end framework for the fashion domain. Besides, we designed masked image recon struction (MIR) for a fine-grained understanding of fashion. MVLT is an extensible and convenient architecture that admits raw multi-modal inputs without extra pre-processing models (e.g., ResNet), implicitly modeling the vision-language alignments. More importantly, MVLT can easily generalize to various matching and generative tasks. Experimental results show obvious improvements in retrieval (rank@5: 17%) and recognition (accuracy: 3%) tasks over the Fashion-Gen 2018 winner, Kaleido-BERT. The code is available at https://github.com/GewelsJI/MVLT.
内容类型	期刊论文
源URL	[http://ir.ia.ac.cn/handle/173211/51710]
专题	自动化研究所_学术期刊_International Journal of Automation and Computing
作者单位	1.Computer Vision Lab, ETH Zürich, Zürich 8092, Switzerland 2.International Core Business Unit, Alibaba Group, Hangzhou 310051, China
推荐引用方式 GB/T 7714	Ge-Peng Ji. Masked Vision-language Transformer in Fashion[J]. Machine Intelligence Research,2023,20(3):421-434.
APA	Ge-Peng Ji.(2023).Masked Vision-language Transformer in Fashion.Machine Intelligence Research,20(3),421-434.
MLA	Ge-Peng Ji."Masked Vision-language Transformer in Fashion".Machine Intelligence Research 20.3(2023):421-434.