Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Incorporating large-scale vision-language corpora in visual understanding
(USC Thesis Other)
Incorporating large-scale vision-language corpora in visual understanding
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
INCORPORATING LARGE-SCALE VISION-LANGUAGE CORPORA
IN VISUAL UNDERSTANDING
by
Zhaoheng Zheng
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
May 2024
Copyright 2024 Zhaoheng Zheng
To my advisor Ram,
my fiancée Lan,
and my parents Rongmin and Linna.
ii
Acknowledgments
I would like to express my utmost gratitude to my advisor, Prof. Ram Nevatia, for his unwavering
guidance and support throughout this challenging journey, particularly when the pandemic disrupted our
lives in 2020. Back in 2018, Prof. Nevatia enlightened me when I almost gave up pursuing the Ph.D. His
mentorship has been indispensable; I could not have reached this point without him. From him, I have
learned invaluable lessons not only in computer vision research, but also in the broad philosophy of life. I
feel fortunate to have attended USC and to have had Prof. Nevatia as my advisor over the last five years.
Special thanks to Yue (Rex) Wu for his invaluable mentorship during my internship at Amazon. Rex
was there for me, offering advice and encouragement, particularly when I faced hurdles with my first
top-conference publication. Learning from him was a hands-on experience that profoundly impacted my
approach in research. Without Rex, my Ph.D. journey would have been more difficult.
I am also thankful to Prof. Keith Jenkins, Prof. Mohammad Soleymani, Prof. Greg Ver Steeg, Prof. Jesse
Thomason, Prof. Xiang Ren for spending their precious time serving as committee members of my thesis
defense, thesis proporal and qualification exam; Prof. Shi-Min Hu and Prof. Fang-Lue Zhang for guiding
me into the world of Computer Vision and Artificial Intelligence; Rakesh Chada and Pradeep Natarajan
for two short but joyful summers at Amazon Alexa AI; Haidong Zhu, Xuefeng Hu and Arka Sadhu for
the continuing collaboration at USC IRIS. Prof. Jia Deng and Prof. David Fouhey for the collaboration at
University of Michigan; Vibhav Vineet and Neel Josh for collaboration at Microsoft Research; I also want
to say thank you to all current and previous members members of the IRIS Computer Vision Lab at USC.
iii
This endeavor would not have been possible without the boundless support and love from my parents,
Rongmin and Linna, and my fiancée, Lan. Their emotional support was my foundation during the most
challenging times; Their encouragement was not just words but actions that shows true care. Lan, with
her endless patience and understanding, has been a source of comfort and motivation, reminding me of
the joy and purpose beyond the rigors of academic pursuit. I am also grateful to other family members for
their support and understanding from the other side of the Pacific Ocean.
I want to thank Hongkuan Zhou, Yixiu Liu and Joey Yu for getting through the COVID-19 pandemic
together; Dawei Yang and Chaowei Xiao for their kind help when I first landed in the US knowing nothing.
My journey towards the Ph.D. has been make colorful and joyful by my friends Tiancheng Jin, Jian Guo,
Yiqi Zhong, Zihao Deng, Wenyi Liu, Sida Gao, Haotian Zhang, Jialin Liu, Lidi Zheng, Ruogu Lin, Yunhao
Ge, Jingmin Wei and so many others.
iv
Table of Contents
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 A Very Brief History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 General Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4.1 BIPR: Blending Individual and Pairwise Representations for Compositional
Zero-Shot Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4.2 CAILA: Concept-Aware Intra-Layer Adapters for Compositional Zero-Shot Learning 6
1.4.3 FashionVLP: Vision Language Transformer for Fashion Retrieval with Feedback . . 6
1.4.4 FIT: Fractional Intermediate Tower in Vision-Language Transformers . . . . . . . 7
1.4.5 MoMo: A shared encoder Model for text, image and multi-Modal representations . 7
1.4.6 Large Language Models are Good Prompt Learners for Low-Shot Image Classification 8
Chapter 2: Compositional Learning before Large Vision-Language Models . . . . . . . . . . . . . . 9
2.1 Compositionality in Image Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.3.1 Compatibility Estimation Pipeline . . . . . . . . . . . . . . . . . . . . . . 14
2.1.3.2 Blended Individual and Pairwise Representations (BIPR) . . . . . . . . . 14
2.1.3.3 Training and Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.4.1 Experiment Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.4.2 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.1.4.3 Quantitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.1.4.4 Choices of the Blending Function . . . . . . . . . . . . . . . . . . . . . . 22
2.1.4.5 Contributions of Blending . . . . . . . . . . . . . . . . . . . . . . . . . . 22
v
2.1.4.6 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2 Compositionality in Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.2.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.2.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.2.3.1 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.2.3.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.2.3.3 Quantitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.2.3.4 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Chapter 3: CLIP-Powered Compositional Zero-Shot Learning . . . . . . . . . . . . . . . . . . . . . 35
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3.1 Compatibility Estimation Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.3.2 Concept-Aware Intra-Layer Adapters . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3.3 MoA: Mixture of Adapters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.3.4 Primitive Concept Shift on Image Embeddings . . . . . . . . . . . . . . . . . . . . . 43
3.3.5 Training and Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.4.1 Experiment Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.4.2 Quantitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.4.3 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Chapter 4: Vision Language Transformers for Fashion Retrieval with Feedback . . . . . . . . . . . 53
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.3 FashionVLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.3.2 Linguistic Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.3.3 Image Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.3.4 Model Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.4.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.4.3 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Chapter 5: FIT: Fractional Intermediate Tower in Vision-Language Transformers . . . . . . . . . . 72
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.2.1 Transformer Encoder Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.2.2 Vision-Language Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.3.1 FIT: Fractional Intermediate Tower . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.4 Large-scale Vision-language Pre-training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.4.1 Training Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.4.2 Pre-training Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
vi
5.5.1 Downstream Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.5.2 Quantitative Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.5.3 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Chapter 6: A Unified Encoder for Vision, Language and Multimodal Tasks . . . . . . . . . . . . . . 89
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.3 The MoMo Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.3.1 Training Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.3.2 Multi-Stage Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.3.3 Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.3.4 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.3.5 Simultaneous learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.3.6 Implementation details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.4.1 Quantitative Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.4.2 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Chapter 7: Large Language Models for Low-Shot Image Classification . . . . . . . . . . . . . . . . 105
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
7.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
7.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
7.3.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
7.3.2 Adaptive Prompt Learning with LLMs . . . . . . . . . . . . . . . . . . . . . . . . . 111
7.3.3 The LLaMP Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
7.3.4 Training and Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
7.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7.4.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7.4.2 Quantitative Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
7.4.3 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
Chapter 8: Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
8.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
vii
List of Tables
2.1 Comparison between BIPR and existing methods. Attr., Obj. and Comp. refer to knowledge
of attributes, objects and compositions, respectively. Our model makes use of all of them. . 11
2.2 Statistics of datasets for CZSL. C-GQA has a larger vocabulary then other datasets. . . . . 17
2.3 Quantitative results on generalized CZSL, all numbers are reported in percentage. S and
U refer to best seen and unseen accuracy on the accuracy curve. Note that Models labeled
with † have extra knowledge during the training phase, resulting in unfair comparisons.
We discuss this issue in Sec. 2.1.4.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4 Ablation study on blending operation. The column of M reveals actual blending
functions, while columns of Attr. and Obj. indicate which embeddings are blended with
the compositional one. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5 Abaltion study on the blending on all benchmarks. We use ✗ and ✓ to indicate whether
a model is with or without blending. BIPR consistently improve the performance of the
model. For the BIPR variation without blending, we directly take hcomp as hout. . . . . . . 23
2.6 Effectiveness of the blending operation. We use ✗ and ✓ to indicate whether a model is
with or without blending. BIPR can not only improve the performance of our model, but
show noticeable performance gains on existing CZSL models except CGE. Note that we
run CGE under the same setting as others. . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.7 Results of Supervised Object Detection and Attribute Prediction: PA refers to the detection
model used in [5]. LFE refers to the variation with Late Fusion Entanglement. . . . . . . . 31
2.8 Results of Attribute Transfer: We use metrics object mAP, color recall, material recall, as
defined in Sec. 2.2.3.2. SCE and UCE are loss functions defined in Eqn. 2.7, 2.8 and Eqn.
2.9. And PA refers to the detection model in [5]. . . . . . . . . . . . . . . . . . . . . . . . . 31
3.1 Quantitative results on generalized CZSL in closed world, all numbers are reported in
percentage. S and U refer to best seen and unseen accuracy on the accuracy curve.
CLIP-ZS refers to the vanilla CLIP model without fine-tuning. All CLIP-based models are
run with ViT-L/14 and we conduct extensive experiments in Tab. 3.4. †We run Co-CGE
with similar CLIP features and report our best number of the model. Models published
before CGE are omitted as their performances are inferior to current baselines. . . . . . . . 44
viii
3.2 Quantitative results on generalized CZSL of VAW-CZSL in closed world, all numbers are
reported in percentage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.3 Quantitative results on generalized CZSL in open world, all numbers are reported in
percentage. S and U refer to best seen and unseen accuracy on the curve. CLIP-ZS refers to
the vanilla CLIP model without fine-tuning. All CLIP-based models are run with ViT-L/14.
Note that our models tested have identical weights as in Tab. 3.1. †We run Co-CGE with
similar CLIP features and report our best number of the model. Models published before
CGE are omitted as their performances are inferior to current baselines. . . . . . . . . . . . 46
3.4 Comparison of the AUC performance on all three benchmarks among CLIP-based models.
ZS and FT stand for zero-shot and fine-tuned. Best results are shown in bold and
runner-ups are underlined. ∆ is calculated between CAILA and the second-best. Numbers
with * are acquired from the CSP paper [149]. †We obtain these numbers by running
Co-CGE on similar CLIP features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.5 Ablation on adapters and MoA modules. V and L refer to Vision and Language, respectively. 51
3.6 Ablation on vision MoA strategies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.7 Ablation on vision MoA mixture functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.8 Ablation study on learnable prompts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.1 Comparison of related works. The general VLP model VinVL is included for reference
as our FashionVLP is based on this model. The columns T, W, C, R, and L refer to the
inputs: text, whole image features, cropped clothing features, RoI features, and landmark
features, respectively. AttNet refers to our new attention-based module for generating
image encodings by fusing multiple contextual features. . . . . . . . . . . . . . . . . . . . . 57
4.2 Quantitative results on FashionIQ. Our model surpasses the state-of-the-art by a large
margin on all three sub-categories. We report results with both the VAL evaluation
protocol [113, 26] and the Original evaluation protocol. CT denotes CIRPLANT [133]. . . . 66
4.3 Quantitative results on Fashion200K. Our model achieves the best results on Recall@50
and mean recall. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.4 Quantitative results on Shoes. Our model achieves the best results on Recall@50 and
mean recall. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.5 Ablation study on FashionIQ on different contextual image features. PositionalAttn, RoI,
Lmk, Crop and Whole refer to positional attention, RoI encodings, landmark features, and
embeddings from cropped and whole images, respectively. . . . . . . . . . . . . . . . . . . 69
ix
4.6 Ablation study on FashionIQ on different methods of generating landmark representations
(LmkRep) and combining them. Conv Block2 and Conv Block3 indicate that features for
each landmark are extracted from the 2nd and the 3rd convolutional blocks of the feature
extractor, respectively. Norm coords refers to the use of normalized landmark positions as
feature values. For fusion, we compare the effects of the context (Ctx) and the landmark
(Lmk) attention (Attn) modules with simply concatenating the features. . . . . . . . . . . . 71
5.2 Quantitative evaluation on retrieval tasks, all numbers reported in percentage while the
best scores are in bold and the second best are underlined: We report top 1/5/10 recalls
rates. FIT achieves the SoTA in all metrics on both MSCOCO and Flickr30K. . . . . . . . . 83
5.3 Quantitative evaluation on vision question answering, visual reasoning, visual entailment,
and zero-shot image-text retrieval, all numbers reported in percentage, while the best
scores are in bold and the second best are underlined: FIT outperforms baselines on
NVLR and SNLI-VE, while achieves second-best on VQA; FIT also beats other models on
zero-shot Flickr retrieval. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.4 Ablation Study on FIT: Our designs of multi-layer coverage and top-down pathway
effectively improve the performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.5 Ablation Study on the number of FIT units. . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.6 Ablation study on vision encoders: FIT achieves better performance with various vision
encoders, compared with the best baseline model with the same visual backbone. The
column of fv indicates the vision encoder that we experiment with. . . . . . . . . . . . . . 88
5.7 Ablation study on scalability of pre-training data. . . . . . . . . . . . . . . . . . . . . . . . 88
6.1 Comparison of various design choices across several strong multimodal models vs MoMo.
BEIT tokenizer is an externally learnt VQ-VAE based model trained on 250 million
image-text pairs. The number of image-text pairs is the size of pre-training data. Numbers
of parameters are calculated out of the number of model parameters used for running
inference on downstream tasks (retrieval, classification, VQA) listed in the paper. †: We
report the size of the smallest model available in the original paper. . . . . . . . . . . . . . 92
6.2 Comparing MoMo to previous models on multimodal, language, and image tasks. We
report results on dev sets of the GLUE benchmark [206]. We report accuracy/F1 for MRPC
and QQP; the Pearson/Spearman correlation for STS-B; Averaged recall at Top-1/5 for
zero-shot retrieval on COCO and Flickr30K; test-dev VQA score for VQAv2 and accuracy
for all the other tasks. Results of BERT and other methods are taken from [188]. Note
that SimVLM, CLIP and FLAVA are pretrained on much larger datasets than MoMo (1.8B,
400M & 75M vs 27M pairs). Bold signifies the best result on public data while underlined
indicates the overall best result. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
x
6.3 Performance comparison between MoMo, FLAVA and CLIP. The numbers for MRPC and
QQP are the averages of accuracy and F1 for. For STS-B, it’s Matthews correlation. The
numbers for CoCo and Flickr30K are from top-1/5 zero-shot text and image retrieval.
For other tasks, we report accuracy. Bold signifies the best result on public data while
underlined indicates the overall best result. Active parameters are the number of model
parameters used during the forward pass for a task. FT, LE, and ZS stands for Fine-Tuning,
Linear Eval and Zero-Shot, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.4 Performance after different training stages in MoMo. Stages 2 and 3 bring in considerable
performance gains for vision and multimodal tasks respectively. . . . . . . . . . . . . . . . 100
6.5 Performance comparison between combined and separate stages 2 and 3. Both the model
variations are trained for 50k steps in this experiment. We observe that separating stage 2
and 3 results in better overall performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.6 Ablation study of multiple modules inside MoMo. *Models under CMGA are trained for
50K steps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
7.1 Comparison with state-of-the-art methods on base-to-novel generalization. LLaMP shows
strong generalization results over previous approaches on 11 image classification tasks. Absolute
gains over PSRC are indicated in blue.
∗KAPT is trained with ViT-B/32 image encoder instead of
ViT-B/16. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
7.2 Few shot classification results with 16 shots. Numbers in the bracket indicate the average
performance over 1/2/4/8/16 shots. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
7.3 Ablation study on the LLM Knowledge. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
7.4 Ablation study on the Training Strategy. “%” indicates the ratio of parameters trained
compared to fully tuning a layer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
7.5 Ablation Study on Pre-generated Text Priors. ✗ refers to “without textual priors” and NP
stands for noun phrases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
7.6 The CLIP text encoder helps adaptation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
7.7 Study on Vision Tuning Scheme. Our hybrid design achieves the best performance. . . . . 121
xi
List of Figures
2.1 Examples of desired compositionality: Learn individual primitives (large, old, castle,
bear) from known concepts (large bear, old castle) and generalize the knowledge towards
perceptions of novel compositions (large castle, old bear). . . . . . . . . . . . . . . . . . . . 10
2.2 An overview of BIPR. Words describing attributes and objects are processed by W and
Pword, producing individual word embeddings, which are concatenated and fed to Pcomp
for the composition embedding. All three embeddings are fused by the proposed blending
operation M, resulting in hout, which is multiplied with the affinity score α(a, o) to create
G(a, o). On the other side, input image x is processed by the image feature extractor F.
The compatibility score is then computed as the dot product of F(x) and G(a, o). Note
modules that have the same color (W,Pword) are sharing the same weights. . . . . . . . . 13
2.3 Activation maps of images tested on BIPR. We generate the activation map of the
ground-truth for each image through GradCAM [183]. We observe that the model with
blending can better capture the critical region for making the correct prediction. . . . . . . 25
2.4 Attribute Transfer: During training phase, some categories e.g. (a) come with attribute
labels, while some e.g. (b) only have object class labelled. Models need to recognize
attributes on such unlabelled categories, e.g. (c) during testing. . . . . . . . . . . . . . . . . 27
2.5 Our two-stream architecture: Note that solid arrows are active in both inference and back
propagation while dotted arrows are only active in inference. . . . . . . . . . . . . . . . . . 29
2.6 Visualized detection results for VG-20 supervised (row 1), VG-20 attribute transfer (row
2). Object predictions colored in blue belong to reference set while those colored in red
belong to target set in the attribute transfer setting. . . . . . . . . . . . . . . . . . . . . . . 32
3.1 Illustrations of CAILA and previous CLIP-based baselines. CAILA has adapters integrated
into both CLIP encoders and thus better transfers the knowledge from CLIP to CZSL,
resulting in significant performance boosts compared with other CLIP-based baseline
methods. “Van.-CLIP” refers to models with vanilla CLIP architecture. Prompts
highlighted in green are set to be learnable parameters. . . . . . . . . . . . . . . . . . . . . 36
xii
3.2 An overview of CAILA: (a) The main composition compatibility estimation pipeline; (b)
Auxiliary sub-tasks on primitive compatibility during training; (c) The structure of CAILA
layers. Our model extracts concept-specific features by learning different adapters and
fuses them through the Mixture-of-Adapters (MoA) mechanism. Note that for each layer
of encoders in (a) and (b), the weights of encoding blocks of the same concept are shared.
NV , NL and M indicate numbers of layers. . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3 Details of our vision Mixture-of-Adapter. Latent features of each adapter, zA, zO, zC, are
mixed, and further processed by the upsampling function to generate h
′
C. h
′
C is joined
with h
′
A, h
′
O and input feature h for output. . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.4 Illustrations of concept shift. We perform concept shift by combining the attribute
(melted) feature from one image with the object (candy) feature to create an additional
composition (melted candy) feature. Newly generated features are shuffled with regular
samples during training. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.5 Ablation studies: (a) The number of vision MoA layer M; (b) The ratio of concept shift; (c)
The reduction factor of fDown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.1 Fashion image retrieval with textual feedback. The input query to the system includes a
reference image and a comment specifying changes to be made to the image. The system
retrieves fashion items with the desired changes accordingly. . . . . . . . . . . . . . . . . . 54
4.2 FashionVLP Overview. The model processes reference image-feedback pairs and target
image candidates using parallel blocks – Reference and Target. Both blocks extract
image features at multiple contextual levels, namely, whole image, cropped clothing,
fashion landmarks, and regions of interest (RoIs), to focus on different fashion-related
aspects of images. The Reference block fuses these image features with feedback inputs
to generate joint reference embeddings fref through a transformer module that contains
self-attention. The target block fuses image representations through multiple attention
modules to generate target embeddings ftar. The reference and target embeddings are
then compared during training and inference for ranking candidate images for a query
reference image and feedback pair. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.3 Fashion landmarks visualization for different clothing types. Landmarks reflect essential
points such as neckline, armpits, etc., that provide useful visual cues for fashion retrieval. . 61
4.4 Qualitative results on FashionIQ. We show reference images on the left and top-10
retrievals with descending scores on the right. Ground-truths are shown with boxes.
Feedback in FashionIQ is complex yet realistic and can contain multiple concepts
simultaneously. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.5 Qualitative results on Fashion200K. We show reference images on the left and top-10
retrievals with descending scores on the right. Ground-truths are shown with boxes. Note
that a query pair can correspond to multiple valid target images in this dataset. Due to the
lack of human annotated feedback, comments in Fashion200K follow the template: replace
[sth] with [sth], and are thus less instructive. . . . . . . . . . . . . . . . . . . . . . . . . . . 67
xiii
4.6 Qualitative results on Shoes. We show reference images on the left and top-10 retrievals
with descending scores on the right. Ground-truths are shown with boxes. Feedbacks in
Shoes are fine-grained and contain concepts belonging to the fashion domain of shoes. . . 67
4.7 Visualization of attention on relevant words in textual feedback and different contextual
image features for two sample pairs. Words with the highest attention weights are
shown in bold. For each level of context in the reference block, we visualize the attention
heatmap of the corresponding most attended word, and observe effective correspondence
between bold words and relevant image regions. On the target side, we visualize attention
heatmaps corresponding to our positional and landmark attention modules, showing that
these modules effectively capture important fashion information. Results of attention
in the reference (left) and the target (right) blocks further show that the whole image
modality is insufficient – for example, the upper sample’s whole image representation
for the target image lacks any useful information. Further, fashion landmarks provide
important for fashion-specific concepts, e.g. , strap, hem, etc. . . . . . . . . . . . . . . . . . 70
5.1 Comparisons of dual-tower-based vision-language model paradigms. (a) CLIP-like (e.g.
CLIP [168] and ALIGN [85]), whose predictions are directly made from unimodal features
without fusion encoder; (b) METER-like (e.g. METER [49]) whose joint fusion encoder
is built on top of both transformers; (c) ALBEF-like (e.g. ALBEF [118], CODIS [50], and
TCL [230]), whose top text transformer layers are converted into fusion layers with inputs
from the top vision layer; (d) the proposed FIT, which uses a Fractional Intermediate
Tower (FIT) to expand text encoder’s access to multi-layer vision features and enrich
per-layer features through the top-down pathway; and (e) performance comparisons of
SoTA dual-tower solutions on various visual language benchmarks, and ours noticeably
outperforms them after using the FIT. IR and TR refer to image retrieval and text retrieval,
respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.2 Overview of the FIT model: We follow the prevalent dual-tower in our design, with the
proposed FIT module bridging the image encoder and the text encoder. The FIT layer
processes the feature from the vision layer and combines it with the feature from the
previous FIT layer, producing vision features for the multimodal encoder and the next FIT
layer. Post-FIT features are combined with text features through Cross-Attention layers
inside the fusion encoder. We further leverage momentum distillation to tackle the noise
from web data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.3 Variations of the FIT design. Each rectangle represents a layer from the vision encoder or
the multimodal encoder. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.4 FIT consistently captures larger active regions and stronger peak signal responses than
w/o Multi-layer and w/o Top-down, respectively, where attention maps are generated by
applying Transformer-MM-Explainability [23] on ground-truth caption for each image. . 87
6.1 Comparison with FLAVA [188] and CLIP [168]: MoMo achieves the best macro average
across vision, language and multimodal benchmarks, which is the mean of the average of
all 3 task types, with ≈ 2/5th the parameters and 1/3rd the training image-caption pairs. . 90
xiv
6.2 Illustrations of how models process input data across various modalities. Language
tasks: (a) Encoder-only architecture; (b) Encoder-decoder architecture. Vision tasks:
(c) Convolutional Neural Networks; (d) Vision Transformers. Multimodal tasks: (e)
Similarity-based metric; (f) Multimodal transformer with ConvNets; (g) Multimodal
transformer with duo-transformers; (f) MoMo: Multiple tasks across multiple modalities
through an all-in-one unified transformer. . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.3 The architecture and training stages of MoMo. The model is first trained with unimodal
image datasets. Then simultaneously with unimodal image and unimodal text datasets.
And finally, simultaneously, with unimodal text and multimodal image-text datasets. Each
stage is executed sequentially and is initialized from the previous stage model’s weights.
For downstream task fine-tuning, the decoder is discarded and only the encoder weights
are used. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.4 Attention maps on predicting masked words through MoMo at stage 3. Heatmaps are
obtained through Transformer-MM-Explainability [23]. MoMo can capture meaningful
regions for MLM through cross-modality attention. . . . . . . . . . . . . . . . . . . . . . . 103
7.1 Demonstration of LLaMP: (a) LLMs can provide visual descriptions for fine-grained object
categories; (b) Zero-shot base-to-novel generalization benefits from the LLM knowledge. . 106
7.2 An Overview of the LLaMP Framework: We first generate the knowledge cache by
passing the query prompt through the LLM D and use the knowledge cache to encode pl
,
resulting the adaptive prompts h˜i
l = Wh
i
l + b
i
for the CLIP text encoder. h˜l
is combined
with regular learnable prompts of G to generate the final text feature vector gp. The
image feature vector fp is obtained through a hybrid-tuning strategy combining prompt
learning and low-rank adaptation (LoRA). . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
7.3 Effect of LLM Prompts on Harmonic Mean. 16 prompts achieve the most balanced
performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
7.4 Visualization of LLaMP Predictions by GradCAM [183] . . . . . . . . . . . . . . . . . . . . 122
xv
Abstract
As key mediators of human perception, vision and language corpora act as critical roles in the development of modern Artificial Intelligence (AI). The size of vision-language corpora has scaled up rapidly in
recent years, from thousands to billions, enabling the creation of large foundation models. However, as an
emerging concept, there are a series of problems yet to be explored. This thesis focuses on three specific
problems of leveraging vision-language corpora for visual understanding: i) The capacity of a model to
process concepts of which few or no samples are provided during training; ii) The capacity to perceive
novel concepts by composing the knowledge from existing concepts; iii) Better architectures and training
strategies for building foundational vision-language models.
This thesis starts with a study of compositional learning from pre-VLM times to the post-VLM era.
We introduce a representation blending approach that creates robust features for compositional image
classification and a two-stream architecture that tackles the entanglement in the feature space of the objectattribute detection problem with novel object-attribute pairs. We further design an adaptation approach
to leverage CLIP encoders for compositional image classification.
The second part covers a variety of methods built with multimodal transformer models. For image
retrieval, we propose a framework that assembles multimodal inputs into sequences with which a multimodal transformer encoder can be fine-tuned. The pre-training of vision-language models (VLMs) is also
xvi
explored. Specifically, we introduce a fractional intermediate tower that improves the feature expressibility of dual-tower vision-language models. We further design a unified pipeline that allows a VLM to learn
from not only vision-language corpora but unimodal visual and linguistic data.
Lastly, we study how to leverage the knowledge of Large Language Models (LLMs) for low-shot image
classification, in a data- and computation-efficient way.
Overall, this thesis covers a spectrum of problems that are critical for leveraging vision-language data
in computer vision tasks. The effectiveness of introduced approaches has been verified on a wide range of
publicly available benchmarks.
xvii
Chapter 1
Introduction
Vision and language are key mediators through which humans perceive the external world and interact
with other members of society. One goal of artificial intelligence (AI) research is to create machines that
are able to perceive visual information and linguistic content. A spectrum of visual understanding tasks
have language involved: Image captioning requires the transition from visual perception towards natural
language generation; Image generation asks for the capacity to convert linguistic inputs into 2D images;
Vision Question Answering (VQA) [104] seeks the joint understanding of vision and language inputs.
Thus, how to take care of both vision and language in one model becomes an important problem, which
has profound impacts on our day-to-day applications, including autonomous driving, AI assistants, AI
generative content, etc.
Large-scale pre-training on big data has been prevailing in the Natural Language Processing (NLP)
community. In the meantime, the scale of a vision-language corpus grows from the thousand-level [128]
to the billion-level [182]. So do the sizes of vision-language models, also known as foundation models.
With large amounts of data, a problem arises: How do we incorporate them into better solutions for some,
if not all, visual understanding tasks?
1
1.1 Problem Statement
Among the spectrum of visual understanding tasks, there are three critical problems that mark the intelligence and efficiency of a visual system, including
• Zero-shot inference capacity: A feasible model should be able to process concepts for which few or
even no examples are available during the training phase.
• Compositionality: Having observed certain compositions, one intelligent system is supposed to perceive novel concepts by decomposing and recomposing existing concepts.
• Model design: Given a certain amount of data, it is critical to study how to design and optimize
models for superior visual understanding performance.
The problems described above are broadly and actively studied in the computer vision and artificial
intelligence research community. In particular, I am interested in the following sub-problems that fall in
the scope of aforementioned problems:
• Compositional zero-shot learning (image classification): Recognize novel attribute-object compositions in images, where these concepts are created from the similar individual vocabulary during
training.
• Fashion image retrieval with textual feedback: In an online shopping scenario, given a reference
image and feedback from the customer, pick the most relevant alternative in the database. Both the
reference image and the textual feedback can be novel to the system.
• Architectural design and pipeline refinement: On a certain amount of data, improve the model capacity
by optimizing the model architecture and the training process.
• Low-shot image classification: Recognize objects of which few or no examples have been examined,
with textual knowledge extracted from their categorical names.
2
The advancement in large-scale vision-language corpora and corresponding foundation models trained
with them has provided a new aspect and a powerful toolbox to solve these problems in a way that goes
beyond traditional approaches. This thesis studies the problem of how to incorporate vision-language
corpora and foundational models for specific visual understanding tasks as mentioned above.
1.2 A Very Brief History
The idea of leveraging large web corpora has been sparked by researchers in Natural Language Processing
(NLP). In NLP, thanks to the internet, collections of text corpora are widely available, e.g. Wikipedia [54]
and Book Corpus [255]. Powered by the transformer architecture [204], a number of language pre-trained
models, like BERT [43], BART [115], GPT [18] and T5 [170], have emerged.
The trend of scale growth has appeared in the vision-language domain as well. The sizes of visionlanguage corpora, measures the number of image-text pairs, have expanded by multiples of tenfold in
recent years. Human-annotated vision-language datasets, e.g. MSCOCO [128] and Visual Genome [103],
are small but accurate. When creating larger datasets [155, 185], automatic annotating technology is introduced, which brings in noises at the same time. Further expanding in sizes, recent datasets [22, 198,
118, 182] are created by parsing enormous web data from the social media, resulting in billion-level visionlanguage corpora, such as ALIGN (1.8B) [118] and LAION (5B) [182].
Researchers in computer vision and NLP have spent enormous efforts in exploring models that can
leverage the knowledge from vision-language corpora. Notably, Chen et al. [28] creates a joint visionlanguage sequence for transformers by concatenating object detection features with text embeddings.
Radford et al. [168] introduces CLIP to directly learn from image-text pairs collected from the Internet.
Recently, as Large Language Models (LLMs) reveal their capacities in the language domain, taking LLMs
3
as the inference engine of multimodal data becomes a hot topic. Liu et al. [130] designs LLaVA, a framework that encodes images into latent features through a pre-trained CLIP [168] model and maps them into
the joint latent space of a LLM.
1.3 Challenges
To leverage large vision-language corpora for visual understanding task, especially in the era of LLMs,
these challenges have to be tackled:
Cross-Modality Learning. Although the scale of vision-language corpora has reached at the billion
level, it is still relatively small compared with data in the form of a single domain, e.g. images or puretext. Thus, how to make a foundation model learn from not only multimodal data, but also unimodal data,
becomes critical.
Multimodal Alignment. Vision data and linguistic data are structured in disparate ways: Images are
composed of pixels while text is composed of words; A single pixel is meaningless but a word contains
concentrated information. However, modern transformer architecture requires both modalities being encoded into the joint latent space. So, how to properly encode vision-language data and align semantics
across modalidities is another important problem.
Efficient Adaptation. Vision-language foundation models are often trained with a large amount of
data within a GPU cluster, while they are trained with general tasks that can not be directly applied on
downstream tasks. Training the model from scratch for each downstream task is infeasible, too. Therefore,
it is important to study how to adapt a foundation model to downstream tasks in a data-efficient and/or
computation-efficient way.
4
1.4 General Approach
In the last five years, the language processing components of vision-language models have significantly
evolved, from employing word embeddings like GloVe [161], to utilizing pre-trained language models such
as BERT [43], and advancing to even larger models including GPT-4, LLaMA, and LLaMA 2 [154, 201, 202],
now commonly referred to as Large Language Models.
Along with the evolution of language models, I have developed a set of methods in dealing with problems stated in Sec. 1.1, incorporated with language components at different levels. Initially, linguistic class
names are converted into discrete label numbers in the two-stream network for feature disentanglement
[129]. We then leverage GloVe embeddings in our feature blending approach [245] for Compositional ZeroShot Learning, and further design CAILA [248] to adapt CLIP [168] models to the same task. Furthermore,
We build FashionVLP [59] for fashion image retrieval, on top of a BERT-style multimodal transformer
model. FIT [244] and MoMo [21] are proposed during our exploration in better architectures and training
pipelines for learning transformer models from multimodal data. Entering the era of LLMs, we study the
problem of how to leverage knowledge from LLMs in low-shot image classification and introduce LLaMP
[247].
These methods cover various aspects of leveraging vision-langauge corpora for visual understanding,
including model adaptation, compositional learning and foundation model training. Some of them are
introduced in the following sections.
1.4.1 BIPR: Blending Individual and Pairwise Representations for Compositional ZeroShot Learning
Compositionality, the ability to combine existing concepts and generalize towards novel compositions, is a
key functionality for intelligent entities. Here, we study the problem of Compositional Zero-Shot Learning
(CZSL), which aims at recognizing novel attribute-object compositions. Previous approaches either learn
5
a joint embedding for each composition or leverage individual elements such as attributes and objects
separately; however, the information from both compositions and individual concepts is critical to CZSL.
Thus, we propose to blend features from pairs and their corresponding individual components. We name
the compound feature Blended Individual and Pairwise Representation (BIPR). Quantitative evaluations
performed on three popular CZSL datasets, MIT-States, C-GQA, and UT-Zappos, verify that BIPR surpasses the current state-of-the-art approaches. Extensive experiments further demonstrate that the idea
of blending can be generalized to existing CZSL models and show noticeable improvements.
1.4.2 CAILA: Concept-Aware Intra-Layer Adapters for Compositional Zero-Shot Learning
Recent researchers focus on applying large-scale Vision-Language Pre-trained (VLP) models like CLIP with
strong generalization ability to CZSL. However, these methods treat the pre-trained model as a black box
and focus on pre- and post-CLIP operations, which do not inherently mine the semantic concept between
the layers inside CLIP. We propose to dive deep into the architecture and insert adapters, a parameterefficient technique proven to be effective among large language models, into each CLIP encoder layer. We
further equip adapters with concept awareness so that concept-specific features of “object”, “attribute”, and
“composition” can be extracted. We assess our method on four popular CZSL datasets, MIT-States, C-GQA,
UT-Zappos, and VAW-CZSL, which shows state-of-the-art performance compared to existing methods on
all of them.
1.4.3 FashionVLP: Vision Language Transformer for Fashion Retrieval with Feedback
Fashion image retrieval based on a query pair of reference image and natural language feedback is a challenging task that requires models to assess fashion related information from visual and textual modalities
simultaneously. We propose a new vision-language transformer based model, FashionVLP, that brings
the prior knowledge contained in large image-text corpora to the domain of fashion image retrieval, and
6
combines visual information from multiple levels of context to effectively capture fashion-related information. While queries are encoded through the transformer layers, our asymmetric design adopts a novel
attention-based approach for fusing target image features without involving text or transformer layers in
the process. Extensive results show that FashionVLP achieves the state-of-the-art performance on benchmark datasets, with a large 23% relative improvement on the challenging FashionIQ dataset, which contains
complex natural language feedback.
1.4.4 FIT: Fractional Intermediate Tower in Vision-Language Transformers
Learning from large-scale image-text pairs, Vision-Language Pre-training (VLP) transformers have been
proven to be effective on various downstream vision-language tasks. Recent mainstream approaches adopt
the dual-tower structure where image and text are first fed into respective transformers and then fused
through communication mechanisms between the two. In this work, we investigate the communication
mechanism in the dual-tower design and propose a novel architecture named Fractional Intermediate
Tower (FIT) to strengthen such communication by taking features from the vision tower and processing
them for various fusion layers in a top-down manner. Our extensive quantitative evaluations on visionlanguage benchmarks verify that, under similar training circumstances, our model surpasses baselines
on multiple downstream tasks by large margins. Notably, on the COCO image retrieval task, our model
achieves 61.6% top-1 recall, which is 2.6% higher than the SOTA.
1.4.5 MoMo: A shared encoder Model for text, image and multi-Modal representations
We propose a self-supervised shared encoder model that achieves strong results on several visual, language
and multimodal benchmarks while being data, memory and run-time efficient. We make three key contributions. First, in contrast to most existing works, we use a single transformer with all the encoder layers
processing both the text and the image modalities. Second, we propose a stage-wise training strategy
7
where the model is first trained on images, then jointly with unimodal text and image datasets and finally
jointly with text and text-image datasets. Third, to preserve information across both the modalities, we
propose a training pipeline that learns simultaneously from gradient updates of different modalities at each
training update step. The results on downstream text-only, image-only and multimodal tasks show that
our model is competitive with several strong models while using fewer parameters and lesser pre-training
data. For example, MoMo performs competitively with FLAVA on multimodal (+3.1), image-only (+1.1) and
text-only (-0.1) tasks despite having 2/5th the number of parameters and using one third the image-text
training pairs. Finally, we ablate various design choices and further show that increasing model size produces significant performance gains indicating potential for substantial improvements with larger models
using our approach.
1.4.6 Large Language Models are Good Prompt Learners for Low-Shot Image Classification
Low-shot image classification, where training images are limited or inaccessible, has benefited from recent progress on pre-trained vision-language (VL) models with strong generalizability, e.g. CLIP. Prompt
learning methods built with VL models generate text features from the class names that only have confined class-specific information. Large Language Models (LLMs), with their vast encyclopedic knowledge,
emerge as the complement. Thus, in this work, we discuss the integration of LLMs to enhance pre-trained
VL models, specifically on low-shot classification. However, the domain gap between language and vision blocks the direct application of LLMs. Thus, we propose LLaMP, Large Language Models as Prompt
learners, that produces adaptive prompts for the CLIP text encoder, establishing it as the connecting bridge.
Experiments show that, compared with other state-of-the-art prompt learning methods, LLaMP yields better performance on both zero-shot generalization and few-shot image classification, over a spectrum of 11
datasets.
8
Chapter 2
Compositional Learning before Large Vision-Language Models
2.1 Compositionality in Image Classification
2.1.1 Introduction
The ability to combine existing concepts and generalize to novel compositions, also known as compositionality, is a key functionality for an intelligent entity. It should be able to perceive individual primitives
from known concepts and combine them into novel compositions. Some examples of compositionality are
shown in Fig. 2.1. It is natural for a human to decompose individual primitives (large, old, castle, bear) from
known concepts (large bear, old castle) and gain the knowledge of novel concepts (large castle, old bear) by
composing individual primitives. Researchers have formulated the problem as Compositional Zero-Shot
Learning (CZSL) [145, 147, 125, 215, 165, 143, 146, 177, 123, 178], where models are expected to generalize
the knowledge learned from a set of seen attribute-object pairs and recognize novel compositions.
In CZSL, all individual primitives are seen during training but only a fraction of potential compositional
pairs; the challenge is to learn compoistional model from these examplew without overfitting to them. One
approach [125, 8] is to obtain the distribution of attribute-object compositions by multiplying marginal
distributions of attributes and objects that are learned separately; though such marginal distributions are
easier to be learn due to their presence in the training data, simply multiplying the two ignores correlations
9
Known Concepts
old castle
large bear
Novel Compositions
large castle
old bear
large
bear
castle
old
large
castle
bear
old
Individual Primitives Composing Primitives
Figure 2.1: Examples of desired compositionality: Learn individual primitives (large, old, castle, bear) from
known concepts (large bear, old castle) and generalize the knowledge towards perceptions of novel compositions (large castle, old bear).
between attributes and objects. Many recent approaches [165, 215, 231, 148, 147, 145, 211, 146] learn joint
representations for compositions and show substantial improvement in accuracy over previous methods;
however, such methods suffer from the lack of data as samples of novel test compositions are not available
during training.
We propose to leverage the knowledge from both compositions and individual primitives by blending
representations of pairs and their individual components. We name the compound embedding as Blended
Individual and Pairwise Representation (BIPR). We first compute individual features out of input text, including attributes and objects, and further combine individual representations into a compositional embedding. Before the final classification, both individual and compositional representations are fused through
a blending function to create a comprehensive representation for each attribute-object composition.
Distinctions between typical previous approaches and our model, BIPR, are shown in Tab. 2.1. We
argue that, though previous models compute joint feature vectors out of individual attribute and object
embeddings, these individual embeddings are strongly influenced by the joint learning scheme and, thus,
10
Method Attr. Obj. Comp.
SymNet, Causal, SCEN ✓ ✓ ✗
AoP , LE+, TAFE-Net,
TMN, OADis, CGE, ✗ ✗ ✓
CompCos, ProtoProp
BIPR (Ours) ✓ ✓ ✓
Table 2.1: Comparison between BIPR and existing methods. Attr., Obj. and Comp. refer to knowledge of
attributes, objects and compositions, respectively. Our model makes use of all of them.
the joint feature vector largely reflects knowledge of the composition and not of the individual primitives.
On the contrary, our approach leverages all three types of information via the proposed blending operation.
We evaluate our approach on three popular CZSL datasets: MIT-States [83], C-GQA [146] and UTZappos [234, 235]. Our experiments demonstrate that, in both scenarios, our model outperforms stateof-the-arts over all three existing benchmarks following the generalized evaluation protocol [165]. Additionally, we show that the blending operation can be generalized to existing approaches and improve the
performance without adding extra complications.
2.1.2 Related Work
Visual Attributes. The concept of visual attributes are widely leveraged in computer vision to support
various perception needs: Farhadi et al. [52] bring attributes to achieve fine-grain recognition over visual
objects, while Lampert et al. [109, 108] utilize attributes to bridge the gap between visual concepts and
zero-shot learning; Visual attributes also play important role in image captioning [232, 5], visual question
answering [121, 227], person re-identification [129, 25], etc. However, the common practice [5, 109, 157]
for attribute recognition, which treats attributes as an independent factor, fails to capture relations between different attribute-object pairs, while some researchers [140, 82] start to investigate the correlations
between attributes and objects.
11
Zero-Shot Learning (ZSL). Unlike conventional fully-supervised learning, ZSL requires models to
learn from side information without observing any visual training samples [108]. The side information
comes from multiple non-visual resources such as attributes [108], word embeddings [210, 189] , and text
descriptions [174]. Notably, Zhang et al. [239] propose to learn a deep embedding model bridging the
seen and the unseen, while [24, 222, 254] investigate generative models that produce features for novel
categories. Moreover, [210, 89] integrate Graph Convolution Networks (GCN) [101] to better generalize
over unseen categories.
Compostional Zero-Shot Learning (CZSL). Previous CZSL approaches are built with pre-trained
image encoders, e.g. ResNet and separate word embeddings, e.g. GloVe [161]. More specifically, Li et
al. [125] investigates the symmetrical property between objects and attributes, while Atzmon et al. [8]
study the causal influence between the two. Moreover, Li et al. [123] construct a Siamese network with
contrastive learning to learn better object/attribute prototypes. On the other hand, joint representations of
compositions can be leveraged in multiple ways. [165] utilizes joint embeddings to control gating functions
for the modular network, while [215, 231, 148, 147] treat them as categorical centers in the joint latent
space. Furthermore, some approaches [145, 211, 146, 177, 142] directly take compositional embeddings as
classifier weights, while OADis [178] disentangles attributes and objects in the visual space.
Our method can exploit the information from all three types of concepts through the blending operation and thus achieve better performance.
2.1.3 Approach
The problem of CZSL can be formulated as follows. We denote the training set by T = {(x, y)|x ∈ X , y ∈
Ys}, where X contains images represented in the RGB color space and Ys is a set of seen composition
labels which are available during the training phase. Each label y = (a, o) is a pair of attribute a ∈ A and
object category o ∈ O. When testing, CZSL expects models to predict a set of unseen compositions Yu
12
ResNet Compatibility
Score
ℎ
ℳ(⋅)
Attribute Word:
Object Word:
Input Image:
Word
Embeder
Word
Projector ℎ
ℎ
Comp.
Projector
ℱ() (, )
Word
Embeder
Word
Projector
Individual Rep. Pairwise Rep. BIPR
Figure 2.2: An overview of BIPR. Words describing attributes and objects are processed by W and Pword,
producing individual word embeddings, which are concatenated and fed to Pcomp for the composition
embedding. All three embeddings are fused by the proposed blending operation M, resulting in hout,
which is multiplied with the affinity score α(a, o) to create G(a, o). On the other side, input image x is
processed by the image feature extractor F. The compatibility score is then computed as the dot product
of F(x) and G(a, o). Note modules that have the same color (W,Pword) are sharing the same weights.
that is mutually exclusive with training labels Ys: Ys ∪ Yu = ∅. Note that Ys and Yu share the same set of
A, O, while CZSL assumes that each a ∈ A or o ∈ O exists in the training set and only the composition
(a, o) ∈ Yu is novel. Following [165, 221, 146], we focus on generalized CZSL, where the test set contains
both seen and unseen labels, formally denoted by Ytest = Ys ∪ Yu.
Most recent works [146, 165, 8] study the generalized CZSL problem under the closed world setting,
where Ytest is a subset of the complete composition set Y : A × O. The closed world setting assumes that
Yu are known during testing. In this paper, we follow the closed world setting.
We provide an overview of our model in Fig. 2.2: Images and words are projected separately into a
shared latent space where the compatibility score between images and compositions are computed. We
13
elaborate our model design in the rest of this section: We first introduce the general compatibility estimation pipeline in Sec. 2.1.3.1 and then describe how the blending operation is performed in Sec. 2.1.3.2,
followed by details of training and testing procedures in Sec. 2.1.3.3.
2.1.3.1 Compatibility Estimation Pipeline
As different attributes can lead to significant appearance shifts even inside the same object category, performing attribute and object predictions separately may be ineffective. Hence, we model attribute-object
compositions jointly and learn a combined estimation function to measure the compatibility of input image
x and query composition (a, o).
The entire pipeline can be represented as C(x, a, o) : X × A × O → R. The estimation function C
contains two components: The image feature extractor F : R
H×W×3 → R
d
and the composition embedding generator G : A × O → R
d
. Note that d denotes the number of channels that each representation
has. Given an image x and a composition (a, o), the compatibility score is defined as the dot product of
F(x) and G(a, o), formally
C(x, a, o) = F(x) · G(a, o). (2.1)
To recognize the composition inside an image x, we compute compatibility scores of all candidate pairs
through the estimation function C and assign the image with the composition that has the highest compatibility score. Ideally, the model will yield the highest score when the correct composition (a, o) is given.
The output of G can be also viewed as weights of the final linear layer inside a traditional classifier.
2.1.3.2 Blended Individual and Pairwise Representations (BIPR)
Learning a joint compatibility estimator is a common practice among recent CZSL methods [145, 146, 143,
165]. However, recent works focus on learning a joint embedding of a composition (a, o) and fail to capture the information from individual attributes and objects. We propose Blended Individual and Pairwise
14
Representations (BIPR) , which fuses information from both the composition and individual elements: the
attribute and the object, and performs more accurate predictions. More specifically, our BIPR acts as the
composition embedding generator G as discussed in Sec. 2.1.3.1.
An overview of BIPR is shown in Fig. 2.2. The module consists of four components: the word embedder
W, the word projector Pword, the composition projector Pcomp, and the blending function M. The word
embedder W : A∪O → R
emb maps individual words, either attributes or objects, into a latent embedding
space: va = W(a), vo = W(o). Word vectors va and vo are further transformed to a latent space by the
word projector Pword : R
emb → R
d
,
ha = Pword(va), ho = Pword(vo). (2.2)
We then concatenate these two latent representations ha and ho and pass it through the composition
projector Pcomp : R
2d → R
d
:
hcomp = Pcomp([ha ho]) (2.3)
One straightforward way to utilize hcomp is simply taking it as the output of G and computing the dot
product of hcomp and F(x) as the compatibility score. But this only makes use of the knowledge from
the joint composition and fails to leverage priors from individual elements. In BIPR, we further blend
representations from individual primitives and compositions to create a comprehensive classifier weight:
G(a, o) = hout = M(ha, ho, hcomp). (2.4)
The actual blending functionMcan be selected from various options, e.g. , the mean function, the max
function or even a self-attention module as in [204]. We discuss the effectiveness and different variations
of M in the Experiments Section.
15
2.1.3.3 Training and Testing
Objective. As our model only has access to seen compositions Ys, we create our training objective upon
Ys and ignore other compositions during training. More specifically, given an image x, we compute the
compatibility score f(x, a, o) for all (a, o) ∈ Ys. We then jointly optimize F and G by the cross-entropy
loss:
L=
−1
|T |
X
(xi,ai,oi)∈T
log exp{C(xi
, ai
, oi)}
P
(aj ,oj )∈Ys
exp C(xi
, aj , oj )
. (2.5)
Intuitively, the cross-entropy loss will force the model to produce a higher compatibility score when
(x, a, o) matches and lower the score when a non-label composition occurs.
Inference. The generalized CZSL task requires models to perform recognition over a joint set of seen
and unseen compositions. Thus, for each test sample x, we estimate the compatibility score between x
and every candidates (a, o) inside the search space Ys ∪ Yu. We predict the image x as the composition
that has the highest compatibility score:
yˆ = argmax
(a,o)∈Ys∪Yu
C(x, a, o) (2.6)
We apply the prediction protocol to all datasets that we evaluate BIPR on.
2.1.4 Experiments
In this section, we present details of our experiments: Experimental settings are discussed in Sec. 2.1.4.1
and we introduce baseline models in Sec. 2.1.4.2. We report quantitative results in Sec. 2.1.4.3; Moreover,
we discuss the effectiveness of blending in Sec. 2.1.4.5; We conduct ablation study in Sec. 2.1.4.4 and show
visualizations in Sec. 2.1.4.6.
16
Dataset Attr. Obj. Train Val Test
Seen Seen Unseen Seen Unseen
MIT-States 115 245 1k 300 300 400 400
C-GQA 453 870 7k 1k 1k 1k 1k
UT-Zappos 16 12 83 15 15 18 18
Table 2.2: Statistics of datasets for CZSL. C-GQA has a larger vocabulary then other datasets.
2.1.4.1 Experiment Settings
Datasets. We evaluate BIPR on three popular datasets for CZSL: MIT-States [83], C-GQA [146] and
UT-Zappos [234, 235]. MIT-States contains natural objects associated with various fine-grain attributes.
Though it is relatively noisy [8] due to the data collection process, it still acts as an effective benchmark
for evaluating CZSL models. Derived from GQA [81], C-GQA has a richer set of compositions and a larger
vocabulary for attributes and objects, compared with MIT-States. UT-Zappos contains images of shoes
with various styles, some of which are even not distinguishable inside the word embedding space. As for
splits, we follow [146] for C-GQA, and [165] for MIT-States and UT-Zappos. Statistically, the numbers
of images in train/val/test are 30k/10k/10k for MIT-States, 23k/3k/3k for UT-Zappos, and 26k/7k/5k for
C-GQA. Detailed statistics on individual primitives and compositions is shown in Table. 2.2. As for splits,
We follow the closed world setting defined in [146, 165, 8].
Evaluation Metrics. Our evaluation is performed under the generalized CZSL protocol adopted by
[146, 165, 8, 143]. [165, 221] argue that it is unreasonable to evaluate only Yu as significant biases may be
present during training and model selection. Thus, they suggest computing both seen and unseen accuracy
with various bias values added to unseen categories and taking the Area Under the Curve (AUC) as the
core metric. We select our models with the best AUC on the validation set of each dataset and report their
performance on the test set.
17
Also, best seen accuracy and best unseen accuracy are calculated when other candidates are filtered
out by specific bias terms. To assess models’ capability of balancing between seen and unseen categories,
we also report best Harmonic Mean (HM), defined as 2∗seen∗unseen
seen+unseen
.
Implementation Details: We build our model on the PyTorch [159] framework. For image feature
extractor, we pick ResNet-18 [68], initialized from the model pre-trained on ImageNet[41], to maintain
consistency with previous approaches. We use 2-layer MLPs with a 4096-dim hidden vector for both Pword
and Pcomp. The final embedding after blending, hout is L2-normalized to stabilize the training process. As
for word embeddings, we follow [146] and adopt a concatenation of pre-trained fasttext [16] and word2vec
[144] models for all datasets.
As for optimization, we use Adam optimizer with a weight decay of 5e − 5. The learning rate is set to
5e − 6 for F and 5e − 5 for G. The batch size is set to 128 for all three datasets. Our experiments are run
on one NVIDIA RTX 2080Ti GPU.
2.1.4.2 Baselines
In this section, we discuss baseline models we are comparing BIPR with:
• Attributes as Operators (AoP) [147] treats attributes as modifiers of object categories and projects
object embeddings to the joint space.
• Label Embed + (LE+) [145] passes word embeddings of individual primitives to an MLP to create a joint
embedding for each composition. Note that LE+ allows input word embeddings to be updated during
training.
• Task-driven Modular Network (TMN) [165] builds a modular network with gating functions controlled by input word embeddings.
18
• SymNet [125] studies the symmetrical property of attribute-object compositions and creates a set of
group axioms to guide the learning process.
• Causal [8]formulates the CZSL problem from a causal perspective and builds a causal embedding model
to learn disentangled representations of basic elements of visual objects. We only include its results on
UT-Zappos due to the lack of published numbers.
• CompCos [143]is designed to estimate feasibility scores of unseen compositions from their seen neighbors.
• Prototype Propagation (ProtoProp) [177] learns independent prototypes of attribute and objects,
which are propagated through a compositional graph to produce pair-level prototypes.
• Object Attribute Disentanglement (OADis) [178] investigates the correlation between objects and
attributes inside visual space. The authors propose to regularize the learning process using decomposed
visual features.
• Siamese Contrastive Embedding Network (SCEN) [123] adopts contrastive learning to learn prototypes for objects and attributes separately, which is further improved by increasing the diversity of
training pairs.
• Compositional Graph Embedding (CGE) [146] builds compositional graphs with nodes of attributes,
objects and compositions, and leverages GCN to propagate information through different nodes.
• Co-CGE [142] combines sytems from CompCos and CGE to better performance compared with its
individual components.
Note that most baselines mentioned above freeze the image feature extractor and use pre-computed
features. As discussed in [146], previous methods easily overfit and yield even worse performance when
the image backbone is jointly trained. So it is non-trivial to migrate models running with frozen image
19
Model MIT-States C-GQA UT-Zappos
AUC (↑) HM (↑) S (↑) U (↑) AUC (↑) HM (↑) S (↑) U (↑) AUC (↑) HM (↑) S (↑) U (↑)
AoP 1.6 9.9 14.3 17.4 0.7 5.9 17.0 5.6 25.9 40.8 59.8 54.2
LE+ 2.0 10.7 15.0 20.1 0.8 6.1 18.1 5.6 25.7 41.0 53.0 61.9
TMN 2.9 13.0 20.2 20.1 1.1 7.5 23.1 6.5 29.3 45.0 58.7 60.0
SymNet 3.0 16.1 24.2 25.2 2.1 11.0 26.8 10.3 23.4 40.4 49.8 57.4
Causal - - - - - - - - 23.3 31.8 39.7 26.6
CompCos 4.5 16.4 25.3 24.6 2.6 12.4 28.1 11.2 28.7 43.1 59.8 62.5
ProtoProp - - - - 3.7 15.1 26.4 18.1 34.7 50.2 62.1 65.7
OADis 5.9 18.9 31.1 25.6 - - - - 30.0 44.4 59.5 65.5
SCEN 5.3 18.4 29.9 25.2 2.9 12.4 28.9 12.1 32.0 47.8 63.5 63.1
CGE† 6.5 21.4 32.8 28.0 4.2 15.5 33.5 16.0 33.5 60.5 64.5 71.5
Co-CGE† 6.6 20.0 32.1 28.3 4.1 14.4 33.3 14.9 33.9 48.1 62.3 66.3
BIPR (Ours) 6.9 20.8 33.7 27.9 4.5 16.9 33.4 16.7 35.6 50.3 62.5 69.7
Table 2.3: Quantitative results on generalized CZSL, all numbers are reported in percentage. S and U refer
to best seen and unseen accuracy on the accuracy curve. Note that Models labeled with † have extra
knowledge during the training phase, resulting in unfair comparisons. We discuss this issue in Sec. 2.1.4.2.
features into the end-to-end training scheme. CGE overcomes this issue by wrapping all embeddings into
a compositional graph. However, CGE assumes that test compositions are known when building graphs,
while other methods hold the assumption that test compositions are only known during the test phase.
Thus, we argue that the comparison against CGE/Co-CGE is unfair as they take the advantage of extra
knowledge during training. For readers’ reference, we report quantitiave results of CGE/Co-CGE in Tab.2.3,
labeled with †.
2.1.4.3 Quantitative Results
In this section, we present quantitative results in details. Such results verify the effectiveness of our
method, which surpasses current SOTA on most metrics. Performance numbers of different models are
reported in Tab. 2.3.
Results on MIT-States. Results show that our model overcomes the label noise and achieves SOTA.
More specifically, on the core metric, AUC, we observe 1.0% improvement, from 5.9% to 6.9%, compared
with OADis. BIPR also beats CGE and Co-CGE, by margins of 0.4% and 0.5%, despite the extra knowlegde
they have during training. Furthermore, regarding harmonic mean, BIPR achieves 20.8%, comparable with
20
Model Settings MIT-States
M Attr. Obj. AUC (↑) HM (↑) S (↑) U (↑)
BIPR (Ours)
Mean ✓ ✓ 6.9 20.8 33.7 27.9
Mean ✓ ✗ 5.4 18.5 30.4 24.0
Mean ✗ ✓ 6.5 20.4 32.0 27.7
Attn. ✓ ✓ 6.3 19.8 31.2 27.0
Max ✓ ✓ 5.8 18.7 30.3 26.5
Concat. ✓ ✓ 6.3 19.7 31.9 26.9
Table 2.4: Ablation study on blending operation. The column of M reveals actual blending functions,
while columns of Attr. and Obj. indicate which embeddings are blended with the compositional one.
CGE and better than all other baselines. On best seen accuracy, BIPR reaches 33.7%, higher than other
methods. Moreover, our method achieves 27.9% on best unseen accuracy, 2.3% higher than OADis, the
most recent baseline.
Results on C-GQA. Our evaluation results on C-GQA further verify the advantage of BIPR, especially
when the number of unseen compositions is larger. Altough C-GQA is the hardest among all benchmarks,
our model is able to surpass other baselines and achieves SOTA. On AUC, our model achieves 4.5%, 0.8%
higher than ProtoProp. Furthermore, BIPR also beats CGE and Co-CGE by a noticeable margin on AUC.
In addition, we also observe remarkble improvements against other baselines in terms of Harmonic Mean;
BIPR is 1.8% higher than ProtoProp, 1.4% highter than CGE and 2.5% higher than Co-CGE.
Results on UT-Zappos. UT-Zappos has much fewer attributes and object categories, compared with
its counterparts, and is thus much easier, so the gap between various methods is smaller. Noticeably, our
model, BIPR, outperforms all other baselines and achieves an AUC of 35.6%, 0.9% higher than the second
best model, ProtoProp. Moreover, on best unseen accuracy, BIPR achieves 69.7%, only lower than CGE
which benefits from the extra knowledge during training.
21
2.1.4.4 Choices of the Blending Function
To understand our model in-depth, we conduct experiments to explore the best choice for the blending
process and report results in Tab. 2.4. We explore options in two directions: what should be blended and
what operation should be taken.
We observe blending pairwise features solely with object embeddings also improves the performance
by a smaller margin. Interestingly, attribute embeddings are not beneficial when fused alone but can add
to the performance when mixed together with object embeddings. In our opinion, the drop is due to
considerable appearance shifts between different objects with the same attribute, making the attributealone blending too noisy.
We further experiment with various types of blending functions in addition to the mean function
chosen for our final model, including self-attention (Attn.) module [204], maximum function (Max.) and
feature concatenation (Concat.). For concatenation , we combine all three types of embeddings and pass
the joint feature vector through one linear layer. Evaluation results show that blending by self-attention
or concatenation, which has extra trainable parameters, does improve the performance but is not as good
as the counterpart with the mean function. The maximum is even worse than no blending. We attribute
the drop the dominance of specific features that can created by the maximum function.
2.1.4.5 Contributions of Blending
One may question whether the blending operation is the key to performance gains. In this section, we
discuss the effectiveness of blending on BIPR. For comparison, we skip the blending function M and
directly let hout be hcomp. We take this model as the non-blending variation of BIPR. Furthermore, to
explore the unreliability of the idea of blending, we further BIPR on existing CZSL models and compare
them with original models.
22
Model Blend MIT-States
AUC (↑) HM (↑) S (↑) U (↑)
BIPR (Ours)
✗ 6.0 19.3 31.1 25.6
✓ 6.9 20.8 33.7 27.9
∆ +0.9 +1.5 +2.6 +2.3
C-GQA
AUC (↑) HM (↑) S (↑) U (↑)
✗ 3.2 13.8 32.5 12.6
✓ 4.5 16.9 33.4 16.7
∆ +1.3 +2.9 +0.9 +4.1
UT-Zappos
AUC (↑) HM (↑) S (↑) U (↑)
✗ 30.3 44.2 62.2 67.9
✓ 34.0 48.9 61.9 68.0
∆ +3.7 +4.7 -0.3 +0.1
Table 2.5: Abaltion study on the blending on all benchmarks. We use ✗ and ✓ to indicate whether a model is
with or without blending. BIPR consistently improve the performance of the model. For the BIPR variation
without blending, we directly take hcomp as hout.
Evaluation results in Tab. 2.5 verify that BIPR brings significant improvements to our CZSL model, on
all three benchmarks we experiment with. Noticeably, on MIT-States, the blended representation increases
the AUC from 6.0% to 6.9%. We also observe a 1.5% improvement on HM, 2.6% on best seen accuracy as
well as 2.3% in best-unseen accuracy. Furthermore, on C-GQA, BIPR improves the AUC by 1.3%, 40% of
the non-blending performance. The improvement on HM is also significant, reaching 2.9%. Best seen and
unseen accuracy are also improved by 0.9% and 4.1%, respectively. Additionally, on UT-Zappos, BIPR also
shows significant improvements on AUC and HM.
Additionally, experiments on existing models, shown in Tab. 2.6, further confirm that the blending
operation can be plugged into existing models and achieve performance gains. For CGE, instead of its
original setting, we run in the same setting as others where test pairs are not available during training.
23
Model Blend MIT-States
AUC (↑) HM (↑) S (↑) U (↑)
AoP
✗ 1.6 9.9 14.3 17.4
✓ 2.7 12.3 20.4 19.6
∆ +1.1 +2.4 +6.1 +2.2
LE+
✗ 2.0 10.7 15.0 20.1
✓ 2.5 12.1 17.6 20.9
∆ +0.5 +1.4 +2.6 +0.8
TMN
✗ 2.9 13.0 20.2 20.1
✓ 3.0 13.5 21.4 19.8
∆ +0.1 +0.5 +1.2 -0.3
CompCos
✗ 4.5 16.4 25.3 24.6
✓ 4.7 16.8 26.3 24.9
∆ +0.2 +0.4 +1.0 +0.3
CGE
✗ 6.3 19.7 31.9 27.0
✓ 6.3 19.7 31.8 27.6
∆ +0.0 +0.0 -0.1 +0.6
Table 2.6: Effectiveness of the blending operation. We use ✗ and ✓ to indicate whether a model is with
or without blending. BIPR can not only improve the performance of our model, but show noticeable
performance gains on existing CZSL models except CGE. Note that we run CGE under the same setting as
others.
Quantitative results in the table show that BIPR improves recent models except for the CGE model. We attribute that difference to CGE’s graph convolution design, which implicitly fuses the information between
graph nodes.
2.1.4.6 Visualization
To further study how models benefit from BIPR, we employ GradCAM [183] to visualize activation maps
of ground-truth compositions for images tested on BIPR and its non-blending variation, shown in Fig. 2.3.
In our visualization, regions that generate stronger activations are highlighted, indicating where the model
is paying attention to for classification.
24
Image w/ Blending w/o Blending
caramelized sugar coiled hose engraved frame folded bag
Figure 2.3: Activation maps of images tested on BIPR. We generate the activation map of the ground-truth
for each image through GradCAM [183]. We observe that the model with blending can better capture the
critical region for making the correct prediction.
From the visualization, we observe that the model with the proposed blending operation can better
capture impactful regions for recognizing the composition. For example, when recognizing caramelized
sugar, BIPR looks at the brown region that reflects the caramelization of sugar. Futhermore, given an
image of coiled hose, our BIPR model gives higher attention to the region showing coiled and hose. Another
interesting example is the sample of engraved frame. We notice that our model pays special attention to
the engraving, while non-blending variation fails to do so.
25
2.2 Compositionality in Object Detection
2.2.1 Introduction
Object detection has seen tremendous progress through deep neural networks [175, 67, 19, 156, 150, 51] and
availability of large scale datasets such as MS-COCO [128] and Visual Genome [105]. In addition to objects,
attributes are useful in distinguishing among members of the same category. While attribute recognition
is a classic topic when applied to homogeneous patches, the task is much more complex when applied to
an entire object. Recently, joint prediction of objects and attributes has been explored under scene-graph
generation [126], symbolic VQA [233], dense captioning [87] and image captioning [5]. In particular, the
model used in [5] has been widely adopted as the feature extractor for VQA tasks, and as such forms a
competitive baseline in our experiments. However, prior work does not evaluate performance on novel
object-attribute pairs; in this paper, we explore the usual object-attribute detection problem and extension
to recognition of novel object-attribute pairs.
In Fig. 2.4, we show some examples of objects with color attributes. Note that the “red car” can be
distinguished from a “silver car” based on color . We note that the property of color is not specific to the car.
Unlike naming color on patches [64, 218], recognizing the color of an object is more challenging. Typical
objects are not of a single, uniform hue with further variations due to changes in surface orientation,
illumination, reflections, shadows, and highlights. The material composition may also not be uniform;
for example, a car has both metal and glass components. One other difficulty is created with the use of
rectangular bounding boxes for object proposals which mix background pixels with object pixels. We do
not aim to separate these influences; instead, as in object classification, we aim to learn from examples
where the variations are accounted for in a holistic feature vector.
A further challenge is that common detection datasets do not come with attribute annotations; even
in those, such as Visual Genome [105], that do provide attributes, a large proportion of objects is not
26
(a) shirt
red
(b) car
unlabelled
(c) car
red
Figure 2.4: Attribute Transfer: During training phase, some categories e.g. (a) come with attribute labels,
while some e.g. (b) only have object class labelled. Models need to recognize attributes on such unlabelled
categories, e.g. (c) during testing.
attribute annotated. Additionally, as shown in Fig. 2.4, it is not reasonable to expect training data to
contain all possible attribute-category pairs; a desirable model needs to recognize novel attribute-category
pairs not encountered in training, we name this task as one of attribute transfer.
There is an inherent conflict between the feature requirements of category and attribute classification
tasks: the former aims to be largely attribute-invariant and the latter to be largely invariant to category
class Simply attribute classification heads to the end of a two-stage detection pipeline (for instance, Faster
R-CNN [175]) entangles the features for two conflicting needs, weighing on the performance of both object
detection and attribute prediction.
To eliminate potential entanglement in feature space, we separate the feature extraction into two
streams. More specifically, in our proposed model, category classifier and attribute classifier are fed with
separate features from two independent convolutional backbones, while region proposals are shared.
We evaluate the accuracy of single-stream and two-stream variants on VG-20, which we construct from
the Visual Genome [105] dataset. We further construct novel splits from these datasets and investigate the
ability of the models to transfer attributes. Our experiments show that in a single-stream architecture,
incorporating attribute heads results in a significant drop in object detection mAP whereas there is little
or no loss in the two-stream variants under novel attribute-category combinations and that the two-stream
architecture achieves higher attribute transfer accuracy.
27
Our contributions are: (i) we eliminate the feature entanglement and resolve the internal conflict between object detection and attribute classification through the two-stream design; (ii) VG-20, a new subset
of Visual Genome and splits in this dataset for evaluating attribute inference and transfer performance;
(iii) demonstration of significant improvements over baselines for attribute inference and transfer tasks.
2.2.2 Method
R-CNN Detection Structure: Recent detection structures in the R-CNN family are composed of four
parts: a deep convolutional backbones like ResNet [68] and VGG [187] , the Region Proposal Network
(RPN) [175], a feature extractor and a classification module. Specifically, the convolutional backbone processes image-level features from input images, and the RPN, takes features to generate proposals, or in
other words, RoIs. The RoI feature extractor extracts features for these regions of interest via RoI Pooling
operations. The classification module uses these RoI features to classify and regress the bounding box. In
our case, we additionally have an attribute head for attribute classification.
Attribute Recognition with Object Embedding: Anderson et al. [5] introduced an additional substructure designed for attribute recognition. An embedding layer maps object labels into features, which
are concatenated with RoI features, followed by a linear layer and finally fed to the classification heads
for prediction. Such a structure brings object information to the attribute classification. But the conflict
between object detection and attribute recognition remains.
Two-Stream Architecture: The proposed architecture follows the R-CNN pipeline, with the backbone and RoI feature extractor divided into two independent streams, as shown in Fig. 2.5. The object
head and box head are kept in the object stream while the attribute stream makes attribute predictions.
We choose to use the same stream for both color and material attributes based on similar requirements of
feature extraction. On top of the R-CNN pipeline, we add an attribute feature extractor which uses the
RoI from RPN. The RPN is integrated into the object stream and takes features from the object backbone to
28
ℒmat
ℒcolor
Material
Color
ℒloc
ℒcls
Box
Object
ℒrpn
Backbone RPN ROI Pooling Feature Extrator Classifiers
Figure 2.5: Our two-stream architecture: Note that solid arrows are active in both inference and back
propagation while dotted arrows are only active in inference.
generate region proposals. Region proposals are shared by both streams because attributes are properties
associated with certain objects and computing proposals solely for attributes would be meaningless.
Cross Link: In the ordinary two-stream design, the object stream (the top side) and the attribute
stream (the bottom side) make predictions from separate features computed by independent feature extractors. But unlike objects and color, objects and material are highly correlated. While some objects
appear in some colors only, most man-made objects can appear in a variety of colors whereas the material
property is much more constrained. To leverage such correlation, we add a cross link, the red dotted arrow
in Fig. 2.5, from the object stream to the attribute stream. Features are concatenated before the prediction
layer. Furthermore, the gradient from the material head to the object stream is blocked so that utilizing
object features for attributes will not impair object detection.
Objectives: The overall loss function L is the sum of four components Lrpn,Lloc,Lcls,Lattr, that is,
L = Lrpn+Lloc+Lcls+Lattr. In terms of Lloc and Lcls, we follow the same objective function proposed in
29
[58]. As for Lrpn, the loss function of RPN, we follow the one defined in [175]. And Lattr = Lcolor +Lmat
Here,
Lcolor = H(σ(zcolor), ycolor), (2.7)
Lmat = H(σ(zmat), ymat). (2.8)
Note that H is the cross-entropy loss, σ refers to the softmax function, and z, y are inference scores and
labels respectively. We name Lattr as Separated Cross-Entropy loss (SCE) given that it is the sum of two
independent cross-entropy functions.
2.2.3 Experiments
We introduce our data preparation in Sec. 2.2.3.1 and detail our experimental setup in Sec. 2.2.3.2, followed
by quantitative results in Sec. 2.2.3.3 and qualitative visualizations in Sec. 2.2.3.4.
2.2.3.1 Data Preparation
To evaluate the performance of our approach, we construct a subset of Visual Genome [105]; specifically,
we adopt the split and reorganized scene graphs created by Hudson et al. [80]. The Visual Genome dataset
consists of 108k images along with over 1,600 object categories and around 400 attribute annotations associated with objects. However, many categories in the dataset overlap with other categories (for example,
“man” and “person” are labeled as two different categories) and it also suffers from a long-tailed attribute
distribution. Therefore, we pick 12 most descriptive colors and 4 most common materials from the dataset.
Regarding object categories, we select 20 categories that have sufficient attribute annotations for our task.
Thus we call our dataset as VG-20. In total, we have 180k samples for training and 25k for testing, with
around one-third of them possess attribute annotations. Note that each bounding box is counted as one
30
Model Object mAP Color Recall Material Recall
@.5 @.5 @.5
PA + SCE 24.95 67.77 56.98
PA + UCE 25.35 68.74 56.01
Single-Stream (SS) 25.13 68.59 61.22
SS Detection Only 38.18 - -
Two-Stream (TS) 38.17 72.40 63.83
TS + Cross Link 38.30 73.11 65.39
TS + LFE 28.37 72.72 63.50
Table 2.7: Results of Supervised Object Detection and Attribute Prediction: PA refers to the detection
model used in [5]. LFE refers to the variation with Late Fusion Entanglement.
Model Object mAP Color Recall Material Recall
@.5 @.5 @.5
Target
PA + SCE 25.09 50.89 47.85
PA + UCE 24.13 52.75 46.40
Single-Stream 24.87 48.39 49.92
Two-Stream (TS) 38.11 54.98 49.27
TS + Cross Link 38.39 61.01 52.28
TS + LFE 27.39 46.55 48.93
Reference
PA + SCE 23.27 67.26 59.70
PA + UCE 22.32 66.41 58.62
Single-Stream 22.76 67.39 61.48
Two-Stream (TS) 37.67 69.43 62.67
TS + Cross Link 38.26 71.73 66.14
TS + LFE 26.11 68.91 63.25
Table 2.8: Results of Attribute Transfer: We use metrics object mAP, color recall, material recall, as defined
in Sec. 2.2.3.2. SCE and UCE are loss functions defined in Eqn. 2.7, 2.8 and Eqn. 2.9. And PA refers to the
detection model in [5].
object sample and some bounding boxes do not have associated attribute annotations; we preserve these
as they are useful in both training and evaluating object detectors.
2.2.3.2 Experimental Setup
We explore two settings w.r.t attribute annotations:
• Supervised: all attribute annotations are available during training phase.
• Attribute Transfer: objects are divided into two groups by their object labels, reference categories
Xref and target categories Xtgt. During training, objects in Xref keep their attribute annotations
31
while those in Xtgt do not have access to the attribute annotations. That is, the model needs to
transfer attributes from Xref to Xtgt which brings additional complexity.
For fair evaluation of attribute transfer over all categories, we divide the objects into two groups XA
and XB, which satisfy following properties: XA ∩ XB = ∅, XA ∪ XB = Xall and keep |XA|=|XB|=10.
We let Xref = XA, Xtgt = XB in one run and vice versa in the other. Quantitative numbers are averaged
over those two runs.
Evaluation Metrics: For object detection, we adopt the commonly used mean Average Precision
(mAP@0.5). Furthermore, to measure both detection and recognition performances simultaneously, we
define “attribute recall" (attribute could be color or material) as the ratio of objects whose bounding boxes
and attributes are detected and recognized by the model to all objects with valid attribute annotations.
Ground Truth Predictions Ground Truth Predictions
Figure 2.6: Visualized detection results for VG-20 supervised (row 1), VG-20 attribute transfer (row 2).
Object predictions colored in blue belong to reference set while those colored in red belong to target set
in the attribute transfer setting.
Baselines: We compare our model against two baseline approaches and one variation of our design.
• Single-Stream: A single-stream version of our model.
32
• Peter Anderson Model (PA): The R-CNN-like structure proposed in [5]. For fair comparison, we integrate an FPN to this model and retrain it with our data splits. The original PA model uses Unified
Cross-Entropy loss:
Lattr = H(σ(z), y), (2.9)
where each color and material is treated as an attribute. We compare with two variants of PA,
one trained with Unified Cross-Entropy (UCE) and the other with Separated Cross-Entropy (SCE),
referred as PA + UCE and PA + SCE respectively.
• Late Feature Entanglement (LFE): A variation of our two-stream model where features from both
streams are explicitly entangled. More specifically, RoI features from both streams are concatenated
before classification so that all classifiers share identical features.
Implementation Details: We adopt ResNet-101 [68] as our backbones in both streams and the design
of RPN follows [175]. We build the feature extractors following Feature Pyramid Networks (FPN) in [127].
During training, both streams of our model are initialized with the pre-trained weights from MSCOCO
[128]. The model is trained by Adam [99] Optimizer with a learning rate of 5e − 5. The batch size is set to
12.
2.2.3.3 Quantitative Results
We report results for VG-20 in both supervised and attribute transfer setting.
(a) Supervised: As seen in Table 2.7 for VG-20, compared with the detection only model, the singlestream detection plus attribute inference model brings down the object mAP by more than 10%. The
two-stream variants do not exhibit this drop and also show a ∼3% boost on color recall and an ∼2%
improvement on material recall. We also compare with our implementation of [5] with two different
cross-entropy (PA + SCE and PA + UCE) which give similar results as single-stream models.
33
Furthermore, the late-stage feature entanglement does not show improvements in attribute recognition and even impairs the performance of object detection, dragging down object mAP by ∼10%.
By comparing Single-Stream with TS + LFE, we demonstrate that, even though both object detection
and attribute recognition benefit from the increased number of parameters, the feature entanglement
between object stream and attribute stream leads to a significant deterioration on object-related performances.
(b) Attribute Transfer: Results on VG, shown in Tab. 2.8, also show noticeable improvements. In color
domain, our models increase the performance by more than 10% on color recall. Finally, results on
reference set are consistent with those in supervised setting as expected.
Effectiveness of the Cross Link: As shown in Tab. 2.8, the cross link improves the performance
of our two-stream model especially in transferring attributes. The link improves the color recall and
material recall in target categories by around 6% and 3%, respectively. Such results reflect that with
less supervision, the cross link enables the attribute stream to learn from the object stream, resulting
in the gain in attribute transfer.
2.2.3.4 Visualization
We visualize detection results of our two-stream model in Fig. 2.6 (only objects with confidence ≥ 0.5 are
shown).
Supervised: Predictions are shown in first row. We note that the ground-truth annotations in VG are
sparse (i) only some objects are annotated with bounding boxes (ii) even among objects with bounding
boxes, only some are annotated with their color and material attributes. Though some objects are not
annotated in the ground truth, our model provides reasonably dense predictions of objects and attributes.
Attribute Transfer: Examples in the second row show that our model can transfer attribute in realworld images. The colors of animals and materials of doors are well transferred.
34
Chapter 3
CLIP-Powered Compositional Zero-Shot Learning
3.1 Introduction
As discussed in Sec. 2.1.1, the ability of recognizing new attribute-object compositions based on a set
of observed pairs, named Compositional Zero-Shot Learning (CZSL) [145], a sine qua non for an intelligent entity. However, the inherent challenge in CZSL lies in the capacity to identify unobserved novel
compositions without compromising the recognition of previously observed combinations. Conventional
approaches [125, 8, 123, 142, 178, 165, 215, 231, 148, 147, 145, 211, 146] often suffer from training biases. Even though recent methods employ large-scale Vision-Language Pre-training (VLP) models with
strong generalization ability, e.g., CLIP [168], to accommodate this issue, they simply treat VLP models as
frozen black box encoders and fail to exploit the potential of VLP models. Thus, here, we explore how to
more effectively extract and utilize the knowledge embedded in pre-trained vision-language models for
the recognition of novel attribute-object compositions.
More specifically, to adapt VLP models for CZSL, some researchers apply prompt-tuning [149, 250, 251]
or fine-tune the model with extra adaptation layers [56] on the top of CLIP. However, prompt-tuning methods, depicted in Figure 3.1(a), only learn trainable prompts, while CLIP-Adapter, shown in Figure 3.1(b),
only adds external modules outside CLIP. Both strategies abstain from altering the fundamental CLIP encoder, consequently retaining CLIP as a static black box. Nayak et al. [149] have shown that exhaustively
35
Image
Encoder
Text
Encoder
“A photo of large bear”
Compatibility
(a) Prompt Tuning [149]
Text
Encoder
“A photo of large bear”
Compatibility
Image
Encoder
MLP MLP
(b) CLIP-Adapter [56]
Image
Encoder
Text
Encoder
“A photo of large bear”
“A photo of bear”
“A photo of large object”
Compatibility
(c) CAILA (Ours)
Frozen Updated Adapters
ViT-B/32 ViT-L/14
5
10
15
20
25
AUC
9.5
7.5
10.9 11.0
14.4
12.2
17.0
12.4
19.4
13.2
20.6
16.1
23.4
CLIP-Adapter
Van.-CLIP-ZS
Van.-CLIP-FT
Co-CGE
CSP
DFSP
CAILA(Ours)
(d) Comparisons of AUC on MIT-States [83] with different backbones
Figure 3.1: Illustrations of CAILA and previous CLIP-based baselines. CAILA has adapters integrated into
both CLIP encoders and thus better transfers the knowledge from CLIP to CZSL, resulting in significant
performance boosts compared with other CLIP-based baseline methods. “Van.-CLIP” refers to models with
vanilla CLIP architecture. Prompts highlighted in green are set to be learnable parameters.
fine-tuning CLIP falls short of attaining practicable performance. Thus, we argue that properly optimizing
features across layers through a task-specific design is critical to effectively harnessing the knowledge embedded in CLIP. A feasible CLIP-based CZSL should: i) have task-specific designs for CZSL; ii) be capable
of extracting concept-specific features related to compositions and individual primitives.
Hence, we propose CAILA, Concept-Aware Intra-Layer Adapters, that satisfy the given prerequisites and substantiate its superiority, as shown in Fig. 3.1(d), compared with other CLIP-based methods.
Fig. 3.1(c) highlights the difference between CAILA and other VLP-based methods. Instead of prompt tuning or fully fine-tuning, we adopt adapters [74] to transfer knowledge from VLP models while avoiding
strong training biases.
36
Moreover, given that adapters are low-overhead components, it is feasible to employ a variety of
adapters to extract concept-wise representations. More specifically, CAILA integrates a group of adapters
into each layer of both encoders; each group possesses concept-specific components to extract knowledge
corresponding to particular concepts, including attributes, objects, and compositions. To merge features
extracted by various concept-aware adapters, we propose the Mixture-of-Adapters (MoA) mechanism for
both vision and yrcy encoder. In addition, the property that CAILA can extract concept-specific features
allows us to further propose Primitive Concept Shift, which generates additional vision embeddings by
combining the attribute feature from one image and the object feature from another for a more comprehensive understanding.
We evaluate our approach on three popular CZSL datasets: MIT-States [83], C-GQA [146] and UTZappos [234, 235], under both closed world and open world settings. We also report the performance of
CAILA in closed world on VAW-CZSL[178], a newly released benchmark. Our experiments show that,
in both scenarios, our model beats the state-of-the-arts over all benchmarks following the generalized
evaluation protocol [165], by significant margins.
To summarize, our contributions are as follows: (i) We propose CAILA, which is the first model exploring CZSL-oriented designs with CLIP models to balance model capacity and training bias robustness; (ii)
we design the Mixture-of-Adapter (MoA) mechanism to fuse the knowledge from concept-aware adapters
and improve the generalizability; (iii) we further enrich the training data and exploit the power of CAILA
through Primitive Concept Shifts; (iv) we conduct extensive experiments in exploring the optimal setup
for CAILA on CZSL. Quantitative experiments show that our model outperforms the SOTA by significant
margins in both closed world and open world, on all benchmarks
37
3.2 Related Work
Parameter-Efficient Tuning. Recent research on large scale pre-training models [168, 85, 77, 118,
59] has achieved superior performance on various downstream tasks, compared with regular approaches.
Various works [74, 195, 91] show that tuning adapters [74] on the language side yields comparable results
with fully fine-tuned variants, while Chen et al. [30] investigate the adaptation of image encoders on dense
prediction tasks. For CZSL, a few models [251, 149] leverage the knowledge of CLIP through prompt tuning
[114] , while Gao et al. [56] attach a post-processor to CLIP for knowledge transfer. Though these methods
show strong performance on CZSL against regular models, they treat the CLIP model as a black box and
keep it completely frozen. In CAILA, we open up the CLIP black box by integrating intra-layer adapters
to both image and text encoders.
3.3 Approach
The problem of CZSL can be formulated as follows. We denote the training set by T = {(x, y)|x ∈ X , y ∈
Ys}, where X contains images represented in the RGB color space and Ys is a set of seen composition
labels which are available during the training phase. Each label y = (a, o) is a pair of attribute a ∈ A and
object category o ∈ O. When testing, CZSL expects models to predict a set of unseen compositions Yu
that is mutually exclusive with training labels Ys: Ys ∪ Yu = ∅. Note that Ys and Yu share the same set of
A, O, while CZSL assumes that each a ∈ A or o ∈ O exists in the training set and only the composition
(a, o) ∈ Yu is novel. Following [165, 221, 146], we focus on generalized CZSL, where the test set contains
both seen and unseen labels, formally denoted by Ytest = Ys ∪ Yu.
Most recent works [146, 165, 8] study the generalized CZSL problem under the closed world setting,
where Ytest is a subset of the complete composition set Y : A × O. The closed world setting assumes
that Yu are known during testing and thus greatly reduce the size of the search space. On the contrary,
38
Mancini et al. [143] argue that such constraint should not be applied to the search space and introduce the
open world setting, where models are required to search over the complete set of compositions, formally
Ys ∪ Yu = Y. In this paper, we investigate the problem in both closed world and open world.
3.3.1 Compatibility Estimation Pipeline
As different attributes can lead to significant appearance shifts even inside the same object category, performing attribute and object predictions separately may be ineffective. Hence, we model attribute-object
compositions jointly and learn a combined estimation function to measure the compatibility of input image
x and query composition (a, o). In addition, we let the model estimate attribute and object compatibilities
as auxiliary sub-tasks during training.
The estimation of composition compatibility is represented as C(x, a, o) : X ×A×O → R. It contains
two components: The image feature extractor FC : R
H×W×3 → R
d
and the text embedding generator
G : A × O → R
d
. Note that d denotes the number of channels that each representation has. Given
an image x and a composition (a, o), the compatibility score is defined as the dot product of FC(x) and
G(a, o), formally
C(x, a, o) = FC(x) · G(a, o). (3.1)
Furthermore, as CZSL requires models to recognize novel pairs composed of known attributes and
objects, it is important for a model to possess the capability of primitive feature extraction that is disentangled with training compositions. Thus, we make our model extract features corresponding to primitives
and estimate the compatibility between vision features and text representations during training. Similar
to Eqn. 3.1, we have
C(x, a) = FA(x) · GA(a), C(x, o) = FO(x) · GO(o). (3.2)
39
Attribute
Adapter Blocks
Self-Attention
Layer Norm
Feed-Forward
+
Layer Norm
Adapter
Adapter
+
Composition
Adapter Blocks
Downsample
Upsample
Latent
Activation
+
Object
Adapter Blocks
Concept-Aware Adapter Blocks
Frozen
Updated
Vision
Encoders
Language
Encoders
× A A × ×
(, )
() ℱ()
“A photo of
large object”
O O ×
(, )
ℱ() ()
“A photo of bear”
(b)
C
A O
A O
C ×
Language
MoA
() ()
ℱ() (, , ) (, )
“A photo of
large bear”
(, )
(a) (c)
Vision MoA
(−) ×
×
A
“A photo of
large object”
O
“A photo
of bear”
Figure 3.2: An overview of CAILA: (a) The main composition compatibility estimation pipeline; (b) Auxiliary sub-tasks on primitive compatibility during training; (c) The structure of CAILA layers. Our model
extracts concept-specific features by learning different adapters and fuses them through the Mixture-ofAdapters (MoA) mechanism. Note that for each layer of encoders in (a) and (b), the weights of encoding
blocks of the same concept are shared. NV , NL and M indicate numbers of layers.
All three compatibility scores contribute independently to the loss function, while C(x, a, o) is leveraged
during inference. More specifically, our framework learns separate representations through CAILA discussed in Sec. 3.3.2 and conducts knowledge fusion through Mixture-of-Adapters (MoA), which will be
covered in Sec. 3.3.3.
Following [168], we create a prompt template similar to "a photo of [CLASS]" for each compatibility estimation sub-task. For composition compatibility, we feed the text encoder with "a photo of
[ATTRIBUTE] [OBJECT]"; We use "a photo of [ATTRIBUTE] object" and "a photo of [OBJECT]"
for attribute and object compatibilities, respectively. Similar to [149], we only make [CLASS] prompts
trainable. For both encoders F and G, we take the output hidden state of the [CLS] token as the representation.
40
3.3.2 Concept-Aware Intra-Layer Adapters
Though CLIP-based CZSL approaches [149, 56, 251] have achieved significant improvements compared
with earlier methods [145, 146, 143, 165, 149], the CLIP encoder is considered as a black box and no modifications are made to improve its generalizability. Thus, we propose to improve CLIP-based CZSL models
in both modalities with CAILA, Concept-Aware Intra-Layer Adapters.
As shown in Fig. 3.2 (a)(b), we take the CLIP image encoder as F and the text encoder as G, while
adding concept awareness to both encoders when estimating compatibilities of different concepts. Fig. 3.2
(c) demonstrates how adapters are integrated into a regular transformer encoding block. For each encoding block, we add adapters behind the frozen self-attention layer and the feed-forward network. More
specifically, given the input hidden state h of an adapter, we compute the latent feature z by the downsampling operator fDown, followed by the activation function σ. The output h
′ of an adapter is obtained
by upscaling z and summing it with h through the skip connection. Formally, we have
z = σ(fDown(h)), h
′ = fUp(z) + h, (3.3)
where both fDown and fU p are fully-connected layers.
To extract concept-specific features, at each depth level, we create three encoding blocks corresponding to attribute, object, and composition, respectively. As in Fig. 3.2(c), encoding blocks of at the same
level share the same weights except for the adapter layers. Inputs from both modalities are processed
by encoders equipped with different types of encoding blocks and features related to each of the three
concepts are produced. During training, vision-language compatibility scores for “attribute”, “object” and
“compositions” are estimated. More specifically, encoders referred in Fig. 3.2(a) and (b) are the same ones;
There are not extra side encoders for auxiliary sub-tasks.
41
Upsample
Downsample
Upsample
Downsample
Upsample
+
Mean
Input Hidden State
Downsample
Mean
Output Hidden State
Attribute
Adapter
Composition
Adapter
Object
Adapter
Figure 3.3: Details of our vision Mixture-of-Adapter. Latent features of each adapter, zA, zO, zC, are mixed,
and further processed by the upsampling function to generate h
′
C. h
′
C is joined with h
′
A, h
′
O and input
feature h for output.
3.3.3 MoA: Mixture of Adapters
To aggregate the knowledge extracted by adapters corresponding to attributes, objects, and compositions,
we propose Mixture-of-Adapters mechanisms for both the vision side and language side of the encoder.
On the vision side, we perform a two-stage feature extraction. As shown in Fig. 3.2 (a), for the first
NV −M layers, we extract features related to the attribute (hA) and the object (hO) through corresponding
encoding blocks, which are further concatenated and processed by the trailing M ternary MoA layers. An
example of the vision MoA layer is shown in Fig. 3.3. Given the hidden state h, we extract latent features
zA, zO and zC from the adapters. We then combine all three features and create z
′
C, followed by fU p:
z
′
C = Avg
zA, zO, zC
, h
′
C = fU p(z
′
C). (3.4)
We further combine h
′
C with outputs of attribute and object adapters, h
′
A and h
′
O, to create the output:
h
′ = Avg
h
′
A, h
′
O, h
′
C
+ h. (3.5)
42
Language
Encoder
O
melted A butter
pressed candy
Vision
MoA
“A photo of
melted candy”
“A photo of
candy”
“A photo of
melted object”
Compatibility
Visual feature of
“melted candy”
Figure 3.4: Illustrations of concept shift. We perform concept shift by combining the attribute (melted) feature from one image with the object (candy) feature to create an additional composition (melted candy)
feature. Newly generated features are shuffled with regular samples during training.
The output of the last mixture layer is L2-normalized and adopted as FC(x) for compatibility estimation.
Ablation study on this module is discussed in Sec. 3.4.3.
Unlike the vision side, where attributes and objects are deeply entangled within the same input image.
On the language side, we can create disentangled language inputs through different prompt templates for
attributes and objects separately. Thus, we adopt a simple mixture strategy for language adapters. We
compute the compositional embedding through NL encoding blocks for the composition and combine it
with primitive language embeddings:
G(a, o) = Avg
GA(a), GO(o), GC(a, o)
. (3.6)
3.3.4 Primitive Concept Shift on Image Embeddings
Due to the limited diversity of training data, current CZSL models often suffer from training biases. As
discussed in Sec. 3.3.3, in addition to the composition-related feature, CAILA extracts attribute- and objectoriented features during the first stage of FC. That motivates us to leverage these primitive-specific features to create additional embeddings for certain compositions. As it leads to changes in labels of original
images, e.g. from melted butter to melted candy, we call it primitive concept shift.
43
Closed World • MIT-States • C-GQA • UT-Zappos
Model AUC (↑) HM (↑) S (↑) U (↑) AUC (↑) HM (↑) S (↑) U (↑) AUC (↑) HM (↑) S (↑) U (↑)
Without
CLIP
CompCos [143] 4.5 16.4 25.3 24.6 2.6 12.4 28.1 11.2 28.7 43.1 59.8 62.5
ProtoProp [177] - - - - 3.7 15.1 26.4 18.1 34.7 50.2 62.1 65.7
OADis [178] 5.9 18.9 31.1 25.6 - - - - 30.0 44.4 59.5 65.5
SCEN [123] 5.3 18.4 29.9 25.2 2.9 12.4 28.9 12.1 32.0 47.8 63.5 63.1
CGE [146] 6.5 21.4 32.8 28.0 4.2 15.5 33.5 16.0 33.5 60.5 64.5 71.5
Co-CGE [142] 6.6 20.0 32.1 28.3 4.1 14.4 33.3 14.9 33.9 48.1 62.3 66.3
CAPE [94] 6.7 20.4 32.1 28.0 4.6 16.3 33.0 16.4 35.2 49.5 62.3 68.5
With
CLIP
CLIP-ZS [168] 11.0 26.1 30.2 46.0 1.4 8.6 7.5 25.0 5.0 15.6 15.8 49.1
CoOp [251] 13.5 29.8 34.4 47.6 4.4 17.1 26.8 20.5 18.8 34.6 52.1 49.3
Co-CGE†
[142] 17.0 33.1 46.7 45.9 5.7 18.9 34.1 21.2 36.3 49.7 63.4 71.3
CSP [149] 19.4 36.3 46.6 49.9 6.2 20.5 28.8 26.8 33.0 46.6 64.2 66.2
DFSP [137] 20.6 37.3 46.9 52.0 10.5 27.1 38.2 32.9 36.0 47.2 66.7 71.7
CAILA (Ours) 23.4 39.9 51.0 53.9 14.8 32.7 43.9 38.5 44.1 57.0 67.8 74.0
Table 3.1: Quantitative results on generalized CZSL in closed world, all numbers are reported in percentage.
S and U refer to best seen and unseen accuracy on the accuracy curve. CLIP-ZS refers to the vanilla
CLIP model without fine-tuning. All CLIP-based models are run with ViT-L/14 and we conduct extensive
experiments in Tab. 3.4. †We run Co-CGE with similar CLIP features and report our best number of the
model. Models published before CGE are omitted as their performances are inferior to current baselines.
Fig. 3.4 demonstrates the process of concept shift: Given one sample x0 of melted butter and one
sample x1 of pressed candy, we create a new sample of melted candy in the feature space, by combining the attribute-oriented feature hA of x0 and the object-oriented feature hO of x1. The newly combined
feature is further processed by vision MoA layers described in Sec. 3.3.3, leading to an embedding representing melted candy. Such change can be viewed as an “object shift” from melted butter or an
“attribute shift” from pressed candy. Thus, we name this process “primitive concept shift”. In practice,
we randomly pick a proportion of samples for shifting and ensure that the new label after shifting still lies
in the training set. We discuss the effectiveness of the shifting in Sec .3.4.3.
Although there are previous explorations [123, 215] in generating novel features in the latent space,
our method is essentially novel from two aspects: i) Wei et al. [215] generate features directly from word
embeddings, while our method leverages disentangled vision features that have richer and more diverse
knowledge; ii) Li et al. [123] uses generated features to augment primitive vision encoders, while ours
augments the entire model through CAILA for both compositions and individual primitives.
44
Closed World • VAW-CZSL
Model AUC (↑) HM (↑) S (↑) U (↑)
Without
CLIP
CompCos [143] 5.6 14.2 23.9 18.0
OADis [178] 6.1 15.2 24.9 18.7
CGE [146] 5.1 13.0 23.4 16.8
With
CLIP
CLIP-ZS [168] 2.6 11.9 12.8 27.8
CSP [149] 8.5 23.3 31.9 33.6
DFSP [137] 14.1 31.1 40.1 40.9
CAILA (Ours) 17.2 34.6 41.6 49.2
Table 3.2: Quantitative results on generalized CZSL of VAW-CZSL in closed world, all numbers are reported
in percentage.
3.3.5 Training and Testing
Objective. We optimize our model with a main loss on attribute-object compositions and auxiliary losses
on attributes and objects. As our model only has access to seen compositions Ys, we create our training
objective upon Ys and ignore other compositions during training. More specifically, given an image x, we
compute the compatibility score C(x, a, o), C(x, a) and C(x, o) for all (a, o) ∈ Ys. We then jointly optimize
F and G by the cross-entropy loss with temperature:
L =
−1
|T |
X
i
log e
[C(xi,ai,oi)/τC ]
P
j
e
[C(xi,aj ,oj )/τC ]
+ log e
[C(xi,ai)/τA]
P
j
e
[C(xi,aj )/τA]
+ log e
[C(xi,oi)/τO]
P
j
e
[C(xi,oj )/τO]
.
(3.7)
Intuitively, the cross-entropy loss will force the model to produce a higher compatibility score when
(x, a, o) matches and lower the score when a non-label composition occurs.
Inference. The generalized CZSL task requires models to perform recognition over a joint set of seen
and unseen compositions. Thus, for each test sample x, we estimate the compatibility score between x
and every candidate (a, o) inside the search space Ys ∪ Yu. We predict the image x as the composition
that has the highest compatibility score:
yˆ = argmax
(a,o)∈Ys∪Yu
C(x, a, o) (3.8)
45
Open World ◦ MIT-States ◦ C-GQA ◦ UT-Zappos
Model AUC (↑) HM (↑) S (↑) U (↑) AUC (↑) HM (↑) S (↑) U (↑) AUC (↑) HM (↑) S (↑) U (↑)
Without
CLIP
CompCos [143] 0.8 5.8 21.4 7.0 0.43 3.3 26.7 2.2 18.5 34.5 53.3 44.6
CGE [146] 1.0 6.0 32.4 5.1 0.47 2.9 32.7 1.8 23.1 39.0 61.7 47.7
KG-SP [92] 1.3 7.4 28.4 7.5 0.78 4.7 31.5 2.9 26.5 42.3 61.8 52.1
Co-CGECW [142] 1.1 6.4 31.1 5.8 0.53 3.4 32.1 2.0 23.1 40.3 62.0 44.3
Co-CGEopen [142] 2.3 10.7 30.3 11.2 0.78 4.8 32.1 3.0 23.3 40.8 61.2 45.8
With
CLIP
CLIP-ZS [168] 3.0 12.8 30.1 14.3 0.27 4.0 7.5 4.6 2.2 11.2 15.7 20.6
CoOp (a)[251] 4.7 16.1 36.8 16.5 0.73 5.7 20.9 4.5 19.5 35.6 61.8 39.3
CoOp (b)[251] 2.8 12.3 34.6 9.3 0.70 5.5 21.0 4.6 13.2 28.9 52.1 31.5
Co-CGE†
[142] 5.6 17.7 38.1 20.0 0.91 5.3 33.2 3.9 28.4 45.3 59.9 56.2
CSP [149] 5.7 17.4 46.3 15.7 1.20 6.9 28.7 5.2 22.7 38.9 64.1 44.1
DFSP [137] 6.8 19.3 47.5 18.5 2.40 10.4 38.3 7.2 30.3 44.0 66.8 60.0
CAILA (Ours) 8.2 21.6 51.0 20.2 3.08 11.5 43.9 8.0 32.8 49.4 67.8 59.7
Table 3.3: Quantitative results on generalized CZSL in open world, all numbers are reported in percentage.
S and U refer to best seen and unseen accuracy on the curve. CLIP-ZS refers to the vanilla CLIP model
without fine-tuning. All CLIP-based models are run with ViT-L/14. Note that our models tested have
identical weights as in Tab. 3.1. †We run Co-CGE with similar CLIP features and report our best number
of the model. Models published before CGE are omitted as their performances are inferior to current
baselines.
We apply the prediction protocol to all benchmarks.
3.4 Experiments
3.4.1 Experiment Settings
Datasets. We evaluate CAILA on four popular datasets: MIT-States [83], C-GQA [146], UT-Zappos [234,
235] and VAW-CZSL [178]. For splits, we follow [146] for C-GQA, [178] for VAW-CZSL, and [165] for MITStates/UT-Zappos. Statistically, the numbers of images in train/val/test are 29k/10k/10k for MIT-States,
23k/3k/3k for UT-Zappos, 26k/7k/5k for C-GQA, and 72k/10k/10k for VAW-CZSL.
Scenarios. We perform evaluation of CZSL models on both closed and open world scenarios and denote
them as • and ◦, respectively. Regarding the closed world setting, we follow [146, 165, 8] and conduct CZSL
with a limited search space. We further run models in the open world scenario, proposed by Mancini et
al. [143], to assess the scalability of CZSL models. It is worth noting that C-GQA becomes much more
challenging under the open world setting, as the size of the search space drastically increases from 2k
46
to nearly 400k. We also notice similar space expansions on MIT-States, while the number of possible
compositions does not increase much on UT-Zappos.
Evaluation Metrics. Our evaluation follows the generalized CZSL protocol adopted by [146, 165, 8,
143]. [165, 221] argue that it is unreasonable to evaluate only Yu as significant biases enter during training
and model selection. They suggest computing both seen and unseen accuracy with various bias values
added to unseen categories and taking the Area Under the Curve (AUC) as the core metric. We select our
models with the best AUC on val sets and report performance on test sets.
Furthermore, best-seen accuracy and best-unseen accuracy are calculated when other candidates are
filtered out by specific bias terms. We also report best Harmonic Mean (HM), defined as (2 ∗ seen ∗
unseen)/(seen + unseen).
Implementation Details: We build our model on the PyTorch [159] framework. As for optimization,
we use Adam optimizer with a weight decay of 5e − 5. The learning rate is set to 2e − 5. The batch size
is set to 32 for all three datasets. The temperature τC, τA, τO is set to 0.01, 0.0005 and 0.0005, respectively.
Most of the experiments are run on two NVIDIA A100 GPUs. We the number of vision MoA layers M to
6 by default. For the downsampling function fDown, we set the reduction factor to 4. Ablation studies on
these settings can be found in Sec 3.4.3.
3.4.2 Quantitative Results
In this section, we present quantitative results in detail under both closed world and open world settings.
Such results verify the effectiveness of our method, which surpasses the current SOTA on most metrics,
in both scenarios.
Closed World Results. Performance of the closed world scenario are reported in Tab. 3.1 and 3.2. On
MIT-States, results show that CAILA overcomes the label noise and achieves SOTA. More specifically, on
AUC, we observe a 2.8% improvement, from 20.6% to 23.4%. Furthermore, regarding HM, CAILA achieves
47
Image
Encoder
Closed World
•MIT-States •C-GQA •UT-Zappos Model
ViT B/32
CLIP-ZS* [168] 7.5 1.2 2.4
CLIP-FT [168] 10.9 7.6 21.1
Co-CGE†
[142] 12.2 5.0 31.2
CSP* [149] 12.4 5.7 24.2
DFSP [137] 13.2 - 23.3
CAILA (Ours) 16.1 10.4 39.0
∆
+2.9 +2.8 +7.8
(21.9%) (36.8%) (25.0%)
ViT L/14
CLIP-ZS* [168] 11.0 1.4 5.0
CLIP-FT* [168] 14.4 10.5 4.8
CoOp* [251] 13.5 4.4 18.8
CLIP-Adapter* [56] 9.5 3.2 31.5
Co-CGE†
[142] 17.0 5.7 36.3
CSP* [149] 19.4 6.2 33.0
DFSP [137] 20.6 10.5 36.0
CAILA (Ours) 23.4 14.8 44.1
∆
+2.8 +4.3 +7.8
(13.6%) (41.0%) (21.5%)
Table 3.4: Comparison of the AUC performance on all three benchmarks among CLIP-based models. ZS
and FT stand for zero-shot and fine-tuned. Best results are shown in bold and runner-ups are underlined.
∆ is calculated between CAILA and the second-best. Numbers with * are acquired from the CSP paper
[149]. †We obtain these numbers by running Co-CGE on similar CLIP features.
39.9%, outperforming all baselines. When it comes to best seen and unseen accuracy, our model improves
by ∼4% and ∼2%, respectively.
Our results on C-GQA further verify the advantage of CAILA, especially when the number of unseen
compositions is larger. On AUC, our model achieves a 4.3% improvement, 40% of the previous SOTA, from
10.5% to 14.8%. HM is also improved by 5.6%. Moreover, improvements of best seen and unseen accuracy
are 5.7% and 5.6%.
UT-Zappos has much fewer attributes and object categories, compared with its counterparts, and is
thus much easier, as the gap between various methods is smaller. But it is noticeable that our model,
CAILA, outperforms all other baselines, with a 7.2% improvement on the AUC metric.
48
Moreover, on the recently released benchmark, VAW-CZSL, CAILA is able to achieve noticeable improvements against baseline models, particularly the newly published method, DFSP [137]. CAILA improves the AUC by 3.1% while boosting the harmonic mean by 3.5%.
Open World Results. We further conduct experiments under the open world setting to evaluate the
robustness of CAILA. Results are shown in Tab. 3.3. Noticeably, open world is much harder than closed
world, as performance on all benchmarks drops drastically, while CAILA achieves SOTA on most metrics
in this scenario without any filtering techniques adopted in the previous papers [143, 142, 149].
On MIT-States, our approach greatly beats SOTA on all metrics, particularly the AUC. Our model improves AUC from 6.8% to 8.2% and achieves a 21.6% harmonic mean. Moreover, CAILA achieves improves
seen accuracy by 3.5% and unseen accuracy by 0.2%.
The performance of CAILA on C-GQA in the open world scenario is consistent with the one in closed
world. More specifically, our model achieves 3.08% AUC, 128% of DFSP [137]. We also observe a ∼10%
relative improvement on harmonic mean, from 10.4% to 11.5%. CAILA achieves 5.6% and 0.8% boosts on
seen and unseen.
Regarding UT-Zappos, our model also brings in performance gains. It achieves a 49.4% harmonic mean,
4.1% higher than Co-CGE. CAILA also gets the best AUC of 32.8%, at least 2% higher against other baselines.
Comparisons between CLIP-based Methods. We further make head-to-head comparisons between
CAILA and other approaches built with CLIP in Tab. 3.4, with variations on the vision encoder: ViT-B/32
[47] and ViT-L/14. Results verify CAILA’s effectiveness and consistency with different visual backbones.
In particular, CAILA achieves >35% relative improvements on C-GQA against other baselines.
Discussion. Given that CLIP is trained on a web-scale dataset, ensuring fair comparisons between
CLIP-based [137, 142, 149] and CLIP-free methods [92, 143, 146] can be difficult, particularly as CLIP-based
methods significantly outperform CLIP-free ones. We follow the setting in existing CLIP-based methods
[137, 142, 149, 168, 251] with a focus on enhancing CLIP-based CZSL. Comparisons between CAILA and
49
0 3 6 9
Number of Vision MoA Layers
15.0
15.5
16.0
AUC
15.2
15.8
16.1
15.7
(a) Number of Vision MoA Layers M
0.0 0.1 0.2 0.3
Concept Shift Ratio
15.0
15.5
16.0
AUC
15.8
16.1
15.6
15.1
(b) Concept Shift Ratio
2 4 8 16 32 64
Downsampling Reduction Factor
15.0
15.5
16.0
AUC
15.1
16.1
15.5 15.5
15.2 15.1
(c) Reduction Factor
Figure 3.5: Ablation studies: (a) The number of vision MoA layer M; (b) The ratio of concept shift; (c) The
reduction factor of fDown.
fine-tuned CLIP models show that a partially tuned model can beat its fully fine-tuned counterpart by a
large margin, justifying that CAILA better suppresses training biases while remaining sharp on knowledge
transfer for CZSL, thus is a better way to exploit CLIP knowledge.
3.4.3 Ablation Studies
We conduct the ablation study with CLIP ViT-B/32 and MIT-States in closed world.
Adapter and MoA. We evaluate different adapter/MoA settings on MIT-States and report results
in Tab. 3.5. We observe that compared with CSP [149], adding adapters to either side of encoders can
effectively improve the performance while attaching adapters to both sides shows further improvements.
Experiments in the last three rows verify that our Mixture-of-Adapters mechanism further improves the
performance when it is applied on both sides.
Vision Mixture Strategies. We compare different ways of mixing z and h
′
inside the vision MoA
layer as described in Eqn. 3.4,3.5. Tab. 3.6 shows the results of mixing only one of the feature vectors
or none at all. The last row corresponds to averaging FA(x), FO(x), FC(x) without intra-layer mixture,
which is similar to the language side MoA. Experiment results demonstrate that mixing both z and h
′
as
proposed in Sec. 3.3.3 yields optimal performance while applying a similar strategy as the language side
hurts.
Vision Mixture Functions. We evaluate various mixture functions of vision MoA besides the default
mean function, including summation (Sum.), element-wise multiplication (Mul.), and concatenation (Concat.). We add one linear layer after “Concat” to align the feature dimension with upcoming operations.
50
Adapter MoA • MIT-States
V L V L AUC (↑) HM (↑) S (↑) U (↑)
CSP [149] 12.4 28.6 36.4 42.5
✓ 14.0 30.1 41.4 42.0
✓ 13.9 30.5 40.3 42.8
✓ ✓ 14.4 30.7 42.2 43.2
✓ ✓ ✓ 15.4 31.4 43.4 44.5
✓ ✓ ✓ 15.2 31.7 41.6 44.8
✓ ✓ ✓ ✓ 16.1 32.9 43.3 45.6
Table 3.5: Ablation on adapters and MoA modules. V and L refer to Vision and Language, respectively.
Closed World Mixture • MIT-States
Model z h′ AUC (↑) HM (↑) S (↑) U (↑)
CAILA (Ours)
✓ ✓ 16.1 32.9 43.3 45.6
✓ 15.8 32.2 43.3 45.2
✓ 15.5 31.7 43.0 45.1
15.5 32.0 42.7 44.8
Table 3.6: Ablation on vision MoA strategies.
Results in Tab. 3.7 show that the “Mean” operation performs the best. We also notice that the variation
with "Sum." performs worse, possibly because summation greatly changes the magnitude of the feature
vector.
Learnable Prompts. We perform experiments to study the effect of learnable prompts in our framework. Results reported in Tab. 3.8 show that our model remains competitive with prompt embeddings fixed.
Such behavior justifies that performance gains of CAILA come from designs that have been discussed in
the Approach section.
CAILA Setups. We explore different aspects of our setup and show the results in Fig. 3.5. Fig. 3.5(a)
demonstrates that CAILA performs better with MoA layers and achieves the best performance with 6
MoA layers on the vision side, which is also better than the single-stage MoA when M=0; Fig. 3.5(b)
indicates that replacing 10% of a batch with post-shift features can increase the AUC while adding more
shift reduces it; In Fig. 3.5(c), we find that the optimal reduction factor for the latent feature z is 4, while
51
Closed World Mix. Fn.
• MIT-States
Model AUC (↑) HM (↑) S (↑) U (↑)
CAILA (Ours)
Mean 16.1 32.9 43.3 45.6
Sum. 14.6 30.7 42.8 42.1
Mul. 15.8 32.2 43.3 45.0
Concat. 15.2 31.9 41.8 44.8
Table 3.7: Ablation on vision MoA mixture functions.
Closed World • MIT-States
Model AUC (↑) HM (↑) S (↑) U (↑)
CAILA(Ours) 16.1 32.9 43.3 45.6
w/o Learnable Prompts 15.8 32.1 43.5 44.6
DFSP [137] 13.2 29.4 36.7 43.4
CSP [149] 12.4 28.6 36.4 42.5
Table 3.8: Ablation study on learnable prompts.
using higher reduction factors does not affect the performance significantly and can be considered for
efficiency reasons.
52
Chapter 4
Vision Language Transformers for Fashion Retrieval with Feedback
4.1 Introduction
The task of feedback-based fashion image retrieval involves fetching images of clothing items that match
a customer’s needs and preferences. A customer starts with an initial request to search for a fashion item
and participates in multiple turns of interaction with the conversational assistant until they get the result
that they are satisfied with. A key challenge in this use-case is to retrieve a new candidate image based
on both the previously retrieved image and the new feedback provided by the customer. Figure 4.1 shows
examples of feedback-based fashion image retrieval.
Substantial progress [205, 113, 72, 26] has been made on this topic by designing strong image-text
composers using image and text features from separate neural networks. Recently, Vision-Language Pretrained (VLP) transformers [240, 124, 136, 116, 252, 27, 196, 193, 256] have been shown to be capable of
learning joint representations for images and text directly by training on large-scale image-text corpora.
In this work, we propose a VLP transformer-based model, FashionVLP, for fashion image retrieval with
textual feedback, which leverages prior knowledge from large corpora and image features from multiple
fashion-related context levels.
Our model is composed of two parallel blocks – one for processing the reference image and the feedback, and another for processing target images. The reference block starts with extracting image features
53
⋯
⋯
FashionVLP
FashionVLP
I want a lighter
graphic design
I want orange and
less revealing
(a) Example of simple modification
⋯
⋯
FashionVLP
FashionVLP
I want a lighter
graphic design
I want orange and
less revealing
(b) Example of compound modification
Figure 4.1: Fashion image retrieval with textual feedback. The input query to the system includes a reference image and a comment specifying changes to be made to the image. The system retrieves fashion
items with the desired changes accordingly.
at multiple-levels of context: (1) whole image, (2) cropped image of clothing, (3) regions around fashion
landmarks [134], and (4) regions of interest determined by a pretrained object detector. These features
along with object tags from the detector and word tokens from the textual feedback are then fed into a
multi-layer transformer model to compute a final joint representation for reference. On the target side,
features at contexts (1)–(3) are computed using only image feature extractors for efficient low-cost inference, and fused using a contextual attention mechanism instead of transformer layers to generate a target
encoding for each candidate image. The model is trained using cosine similarity and a batch-based classification loss where the target for each reference image is used as a negative sample for other reference
images. Finally, retrieval is performed by ranking candidates using cosine similarities between reference
and target encodings.
We evaluate FashionVLP on three common fashion image retrieval datasets: FashionIQ [220], Shoes
[15] and Fashion200K [63]. Unlike other datasets, FashionIQ contains real human comments on specific
54
reference-target image pairs and is hence much more challenging for the fashion image retrieval task. Results show that FashionVLP improves the performance on FashionIQ by a significant relative performance
gain of 23%. This validates the capability of our framework in dealing with complicated real-life imagefeedback pairs when conducting fashion image retrieval. Our model also surpasses the state-of-the-art on
Shoes and Fashion200K datasets.
Our work makes the following contributions:
• We propose a new transformer-based model that leverages prior knowledge from large image-text corpora for fashion image retrieval with textual feedback.
• We provide a way for effectively incorporating multiple levels of fashion-related visual context for both
reference and candidate images within our asymmetric design.
• Our model outperforms previous works on benchmark datasets, with 23% relative gain on FashionIQ.
4.2 Related Work
Fashion Image Retrieval with Textual Feedback: The classic image retrieval task is a long-standing
fundamental problem [203, 32] in computer vision, which requires comparison of reference and target
images in a scalable way. Tremendous advances have been made in this field recently with the development
of deep learning based methods [152, 7, 167, 60]. Alternative formulations use natural language text-based
queries for image retrieval [180, 228, 139, 34, 242, 249].
Fashion image retrieval with textual feedback is different from the classic image retrieval problem as it
takes both a reference image and a textual feedback for modifying the reference as query inputs, as shown
in Fig. 4.1. Intuitively, this task can be solved via text-based visual relationship reasoning [179, 97, 162],
where text features are injected into image feature extractors to get modified image encodings, which are
then used for retrieval [162]. However, these methods do not explicitly combine visual and textual features
into a joint semantic space, leading to poor performance.
55
In contrast, previous methods developed specifically for this task typically fuse the image and textual
inputs into joint embeddings for retrieval. For example, TIRG [205] learns a gated feature and a residual
feature for each image-text query and composes them into a joint encoding. The CIRPLANT [133] model
fuses linguistic and visual information using a transformer while VAL [26] learns multiple transformers
for the same at various levels through an attention mechanism. The objective function of VAL is designed
to measure the feature similarities in a hierarchical manner. Hosseinzadeh et al. [73] compose images and
text through locally bounded features (LBF). The state-of-the-art CosMo [113] models content and style
changes between images and uses deep Multi-modal Non-Local (DMNL) [209] blocks to compose different
types of changes.
Vision-Language Pretrained Transformers: Unlike transformers used in natural language processing [169, 18, 43], image classification [46, 199, 132], object detection [20, 253], and video understanding [57],
Vision-Language Pre-trained (VLP) transformers are trained through self-supervision on large image-text
corpora to capture prior multi-modal knowledge contained within them [252, 119, 27, 116, 193, 240, 124].
In this work, we apply VLPs to the problem of fashion image retrieval with textual feedback, so that we
can benefit from the rich multi-modal information contained in their model weights. Our model is based
on the state-of-the-art VLP VinVL [240], but is tailored with architectural additions for fashion retrieval
and trained in a metric learning manner. Table 4.1 compares our model with previous works.
4.3 FashionVLP
As shown in Table 4.1, previous works in this domain are common in two aspects: (1) only the whole image
is used as input, and is represented as global features after average pooling from convolutional featuremaps, and (2) customized modules are used for composing reference image and text features. However,
both considerations are somewhat idealistic in the context of fashion image retrieval. More precisely, utilizing only whole fashion images implicitly requires robust feature extractors that generalize across fashion
56
Table 4.1: Comparison of related works. The general VLP model VinVL is included for reference as our
FashionVLP is based on this model. The columns T, W, C, R, and L refer to the inputs: text, whole image
features, cropped clothing features, RoI features, and landmark features, respectively. AttNet refers to our
new attention-based module for generating image encodings by fusing multiple contextual features.
Reference Target Reference Target
Method T W C R L W C L Fusion Feats.
TIRG ✓ ✓ ✗ ✗ ✗ ✓ ✗ ✗ Residual CNN
VAL ✓ ✓ ✗ ✗ ✗ ✓ ✗ ✗ Transformer CNN
LBF ✓ ✓ ✗ ✗ ✗ ✓ ✗ ✗ Cross-Attn CNN
CosMo ✓ ✓ ✗ ✗ ✗ ✓ ✗ ✗ DMNL CNN
CIRPLANT ✓ ✓ ✗ ✗ ✗ ✓ ✗ ✗ VLP CNN
VinVL ✓ ✓ ✗ ✓ ✗ – – – VLP –
FashionVLP ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ VLP AttNet
items with variations in size, rotation, pose, background, etc. Further, the use of global image features
assumes that these are themselves sufficient and contain enough local information for retrieval. Using
custom heuristic modules for image and text composition further raises concerns about generalization
across fashion item types and variation in textual feedback. In order to address both the considerations,
we design a novel method for fashion retrieval with textual feedback.
Our method incorporates a VLP module for multi-modal information fusion as VLPs are known to
generalize well across several domains [119, 27, 116, 193, 240, 124]. The inclusion of VLPs also brings the
prior knowledge contained in large image-text corpora to the feedback-based fashion retrieval domain.
Further, the core transformer design of VLPs allows for composition of additional modalities, e.g. , regions
of interest (RoIs) [124, 240]. In order to better fit the fashion retrieval task, we introduce a novel overcomplete image representation that fuses multiple levels of fashion-related contextual information, namely,
whole image, cropped clothing region, fashion landmarks, and regions of interest. Intuitively, such a
representation provides direct and explicit inputs that are correlated with words in the feedback, and thus
eases the fusion of linguistic and visual information, improving generalization.
57
Shared
Weights
�#$% �!"#
Multilayer Vision-Language Multi-modal Transformer
with Self-attention Mechanism
(VinVL)
Object
Detector
(X152-C4)
Feature Extractor (Resnet)
is longer in length and
lighter in color
Word
Tokens [CLS] [SEP] Object
Tags [SEP] Image Representations
�&'( �)*'+$
,-"! �.#'-
,-"! �+/0
,-"!
Word
Tokenizer
(BERT)
Reference Image Text Feedback
Reference Cosine Similarity
Feature Extractor (Resnet)
Target Image
Contextual Attention
Positional Attention Landmark Attention
Projection
�%1,$2
�)*'+$
,-"! �.#'-
,-"! �+/0
,-"!
�)*'+$
3+'4 �.#'-
3+'4 �+/0
3+'4
Target
Not trained; off-the-shelf model Finetuned; pretrained weights Trained; random weights
Figure 4.2: FashionVLP Overview. The model processes reference image-feedback pairs and target image
candidates using parallel blocks – Reference and Target. Both blocks extract image features at multiple contextual levels, namely, whole image, cropped clothing, fashion landmarks, and regions of interest (RoIs), to
focus on different fashion-related aspects of images. The Reference block fuses these image features with
feedback inputs to generate joint reference embeddings fref through a transformer module that contains
self-attention. The target block fuses image representations through multiple attention modules to generate target embeddings ftar. The reference and target embeddings are then compared during training and
inference for ranking candidate images for a query reference image and feedback pair.
4.3.1 Overview
Fashion image retrieval with textual feedback requires effective fusion of information between visual attributes of the reference image and linguistic content of the feedback. The tremendous success of VisionLanguage Pre-trained (VLP) transformers at learning joint representations of such data makes them extremely suitable for this task. The efficacy of transformers is attributed to their self-attention mechanism,
which allows information from non-adjacent inputs to be fused directly, unlike in traditional recurrent
networks. Specifically, a transformer applies linear projections to its input features X ∈ R
N×dmodel to
58
produce representations: Q ∈ R
N×dk , K ∈ R
N×dk and V ∈ R
N×dv
. The output of a self-attention
module is then computed as:
Attn(Q, K, V ) = softmax(QKT
/
p
dk)V ∈ R
N×dv
(4.1)
The learning capacity of the attention module can be further improved by the multi-head design, formulated as:
MultiHead(Q, K, V ) = Concat(h1, h2, . . . , hh)WO, (4.2)
hi = Attn(Qi
, Ki
, Vi). (4.3)
where WO is a linear layer that projects the concatenated features back to dmodel dimensionality. The
output of the attention module is post-processed [204] by a Feed-Forward Network and several Layer
Normalization [9] layers.
Multi-layer transformers are built by stacking multiple transformer blocks sequentially. A typical example is BERT [43], which is designed for natural language tasks but has been extended to the multi-modal
domain. Many VLP models are initialized with original BERT weights and further trained on domainspecific pre-training tasks [136, 116, 124].
In this work, we propose a new VLP model, FashionVLP, for fashion image retrieval with textual
feedback. Figure 4.2 provides an overview of our framework. The basic setup of the retrieval task requires
learning representations of (a) the reference image and the textual feedback and (b) target images in a
database in order to compare and search for candidate images to present to the user. Intuitively, our
model consists of two parallel blocks – the reference block and the target block for encoding (a) and (b),
respectively.
59
The reference block extracts features at multiple-levels of context – (1) whole image, (2) cropped image of clothing, (3) regions around fashion landmarks [134], and (4) regions of interest determined by a
pretrained object detector. These features are then fused with object tags from the detector and word tokens from the feedback through a multi-layer transformer. The transformer output is treated as the final
joint encoding fref for the reference image and feedback pair. The target block also computes encodings
at contexts (1)–(3) using the same image feature extraction layers, but these are fused using a contextual
attention mechanism instead of the transformer layers to reduce computation costs at inference, and projected to the dimensionality of fref to generate representations ftar for target images. This allows for a
scalable design for efficiently computing embeddings for fast-growing reference databases.
The model is trained using cosine similarity to compare reference and target embeddings. Subsequently, retrieval is performed by ranking candidate images using their similarities with the given reference image and textual feedback. In the following sections, we describe the computation of text and image
features and the training methodology.
4.3.2 Linguistic Embedding
We tokenize the textual feedback through the pre-trained Oscar [124] tokenizer from VinVL [240]. The
text is represented as a sequence of word tokens t = {w1, w2, . . . , wT }, where T is the length of the
text. We append a special [CLS] token to the beginning of the sequence. When the feedback has mulitple
sentences, we combine all the tokens into one sequence but separate sentences using [SEP] tokens. The
tokens are then mapped to R
T ×dmodel by an embedding layer. Finally, we add positional encoding to the
sequence to preserve positional information.
60
Figure 4.3: Fashion landmarks visualization for different clothing types. Landmarks reflect essential points
such as neckline, armpits, etc., that provide useful visual cues for fashion retrieval.
4.3.3 Image Embeddings
Our model employs a ResNet [68] as the backbone feature extractor for whole, cropped, and landmark representations. We use a publicly available (https://git.io/JPAO4) pretrained Cascaded Pyramid Network [29]
to extract fashion landmarks from input images. As shown in Figure 4.3, these landmarks are different for
each clothing category and capture fashion-related semantics, e.g. , hem line, waist line, etc. Although not
trained on shoes, the model is capable of effectively capturing meaningful points like tip, heel, etc.
Whole Image Representation: We obtain spatial image representations from the last convolutional
block of a ResNet feature extractor with dimg channels by flattening it into a feature sequence f
spat
whole ∈
R
HW×dimg . The self-attention mechanism in the transformer layers of the reference block allows features
from all positions on the feature map to be modeled simultaneously. Hence, we directly use this feature
sequence as a part of the reference image representation. In the target block, however, we fuse these spatial
features into global features. Global features are commonly computed by average-pooling spatial features,
which fails to preserve location-specific salience. Therefore, we propose a positional attention module,
61
a 1 × 1 convolution layer with dimg filters, to extract global representation f
glob
whole ∈ R
dimg from spatial
features f
spat
whole ∈ R
HW×dimg
.
f
glob
whole = PositionalAttn(f
spat
whole) ∗ f
spat
whole (4.4)
Cropped Clothing Representation: We use fashion landmarks to generate cropped clothing images
from the given (whole) images in order to process their “zoomed in” versions through the feature extractor
and better capture features from clothing regions. We then compute cropped clothing encodings in the
same way as whole image embeddings. Specifically, the reference block computes f
spat
crop to provide as input
to the transformer while the target block further generates f
glob
crop using positional attention.
Fashion Landmark Representation: We explicitly incorporate fashion semantics in our model by extracting feature maps corresponding to L fashion landmark positions from the second convolutional block
of the ResNet feature extractor, which preserves more localized information. We then project these features to match the number of channels in the whole and cropped encodings, producing f
spat
lmk ∈ R
L×dimg
.
This is then directly used as input to the transformer as a part of the image representation in the reference
block. However, in the target block, we use a landmark attention module, another 1 × 1 convolution with
dimg filters, to combine f
spat
lmk ∈ R
L×dimg and generate f
glob
lmk ∈ R
dimg
RoI-level Representation: Due to the size of image-text corpora typically used to train VLPs, CNNs are
usually not integrated into the framework. Instead, existing models extract RoI-level features through a
pre-trained object detector. The semantic information in RoIs is indeed crucial for feedback-based fashion
image retrieval. For instance, when a customer asks for changes on sleeves, a model with RoI information
would be able to place higher attention on RoIs corresponding to arms. Therefore, we include RoI-level features, fRoI , extracted by a pre-trained object detector as part of our image representation in the reference
block.
62
We use a publicly available [240] Faster-RCNN-C4 [175] with ResNeXt-152 [225] backbone (X152-C4)
trained on MSCOCO [128], Visual Genome [103], Objects365 [184], and OpenImage V5 [107], following [240]. The RoIs are filtered by a confidence threshold ϵ. In addition, [124, 240] state that the object
category for each region can act as an anchor between images and text. We follow the setting in [124] and
append object tags to the end of the linguistic input in the reference block, separated by the [SEP] token.
Region Position Encoding: In order to preserve positional information of the extracted RoI features, we
encode their region position into a 6-dimensional vectors as
fpos =
x1
w
,
y1
h
,
x2
w
,
y2
h
,
x2 − x1
w
,
y2 − y1
h
, (4.5)
where [x1, y1, x2, y2] denotes the bounding box of the RoI and h, w are image dimensions. We combine
fpos with fRoI to create position-aware region representations. We similarly combine f
spat
whole and f
spat
crop
with their corresponding position encodings according to their field of view.
Combined Reference Image Representation: The final image representation fed into the transformer
layers in the reference block consists of (1) f
spat
whole, (2) f
spat
crop , (3) fROI , (4) f
spat
lmk , and (5) position encodings
for (1)–(3).
Fused Target Representation: We combine f
glob
whole, f
glob
crop, and f
glob
lmk using a contextual attention module,
which is a 1 × 1 convolution layer, to get a fused target representation.
4.3.4 Model Training
The transformer takes three input segments: linguistic features, object tags, and image features. We take
the output hidden state of the [CLS] token from the transformer as the reference embedding fref . Meanwhile, for target representation we extract the fused features and project it into the joint feature space to
get ftar. We then compute the similarity of fref and ftar by a kernel function κ.
63
has a blue and
white print with
longer sleeves
A green/white
plaid with long
sleeves
is black colored
and has floral
pattern
Figure 4.4: Qualitative results on FashionIQ. We show reference images on the left and top-10 retrievals
with descending scores on the right. Ground-truths are shown with boxes. Feedback in FashionIQ is
complex yet realistic and can contain multiple concepts simultaneously.
We adopt a batch-based classification loss [205], where each entry inside a batch acts as a negative
sample for all other entries. This objective function converges faster [181] than triplet loss, especially on
complex datasets. For a batch of B image-text pairs, the loss is defined as:
L =
1
B
X
B
i=1
− log
exp
κ(f
i
mod, fi
tar)
PB
j=1 exp
κ(f
i
mod, fj
tar)
. (4.6)
The kernel κ in Equation (4.6) can be any metric, but we use inner product in this work, resulting in cosine
similarity.
We train our network by fine-tuning the transformer together with the feature extractor and the attention modules. The feature extractors in reference and target blocks share weights to prevent overfitting.
We do not fine-tune the object detector as the Region Proposal Network[175] inside X152-C4 cannot be
trained without separate loss functions.
4.4 Evaluation
We evaluate models on FashionIQ [220], Shoes [15], and Fashion200K [63]. We compare our model with
state-of-the-art methods: TIRG [205], VAL [26], and CosMo [113]. We additionally present results of visual
6
reasoning based baselines: RN [179], MRN [97], and FiLM [162]. In the following sections, we describe the
experiment setup, present evaluation results, and discuss ablation studies.
4.4.1 Experiment Setup
Implementation Details: We use ImageNet [41] pretrained ResNet-50 for FashionIQ and Shoes, and
ResNet-18 for Fashion200K, as image feature extractors following [113]. We employ BERT-base [43]
from [240] as the transformer. We use the Adam [100] optimizer with β = (0.55, 0.999) and train models
for 100 epochs, halving the learning rate every 10 epochs. We use batch sizes of 80 and 92 for FashionIQ
and Shoes, respectively, with an initial learning rate of 4e−4 and a warm-up period of 150 iterations. We
set batch size as 200, initial learning rate as 1e−3, and warm-up period as 500 iterations for Fashion200K
due to its large size. The detection confidence threshold ϵ is set to 0.5.
Inference: During inference, we process queries and candidates in the dataset separately. Candidate features are extracted by the target block containing only image feature extractors and attention modules,
while reference and feedback queries are processed by the reference block including the transformer module as described in Section 4.3.1. We then compute cosine similarities for ranking the candidates.
Evaluation Metric: Models are evaluated using the standard top-K recall metric for image retrieval, denoted as R@K. Performance is compared specifically on the average of R@10 and R@50 as a metric of
overall performance.
4.4.2 Results
FashionIQ [220]: This is a fashion retrieval dataset with interactive natural language captions. Items
belong to three types: Dresses, Tops&Tees, and Shirts. It contains 77K images in total with 46K images for
training and 18K image pairs available. Each pair has two crowdsourced captions that describe changes
from the reference to the target. The feedback is complicated and sometimes a sentence includes multiple
65
Table 4.2: Quantitative results on FashionIQ. Our model surpasses the state-of-the-art by a large margin
on all three sub-categories. We report results with both the VAL evaluation protocol [113, 26] and the
Original evaluation protocol. CT denotes CIRPLANT [133].
Dress Toptee Shirt Overall
Method R@10 R@50 R@10 R@50 R@10 R@50 R@10 R@50 Mean
VAL [113, 26] Evaluation Protocol
RN [179] 15.44 38.08 21.10 44.77 18.33 38.63 18.29 40.49 29.39
MRN [97] 12.32 32.18 18.11 36.33 15.88 34.33 15.44 34.28 24.86
FiLM [162] 14.23 33.34 17.30 37.68 15.04 34.09 15.52 35.04 25.28
TIRG [205] 14.87 34.66 18.26 37.89 19.08 39.62 17.40 37.39 27.40
CT [133] 17.45 40.41 17.53 38.81 21.64 45.38 18.87 41.53 30.20
VAL [26] 21.12 42.19 25.64 49.49 21.03 43.44 22.60 45.04 33.82
CosMo [113] 25.64 50.30 29.21 57.46 24.90 49.18 26.58 52.31 39.45
FashionVLP 32.42 60.29 38.51 68.79 31.89 58.44 34.27 62.51 48.39
Original Evaluation Protocol
Image Only 4.46 13.19 5.46 13.21 6.13 13.64 5.35 13.35 9.35
Concat 14.92 34.95 14.28 34.73 12.71 30.08 13.92 33.25 23.59
TIRG [205] 14.13 34.61 14.79 34.37 13.10 30.91 14.01 33.30 23.66
CosMo [113] 21.39 44.45 21.32 46.02 16.90 37.49 19.87 42.62 31.25
FashionVLP 26.77 53.20 28.51 57.47 22.67 46.22 25.98 52.30 39.14
concepts to be changed, e.g. , “is patterned and has a halter neckline”, “is black with floral patterns”, etc.
The complex yet realistic nature of the feedback in this dataset makes it exceptionally challenging for the
retrieval task.
We follow the evaluation protocol of [26, 113], where the candidate set is constructed by unifying all
reference and target images in the test set. This reduces the number of images for retrieval, compared with
the original test set, resulting in higher performance for all models. We evaluate models on the reduced
set (VAL [26] evaluation protocol) for fair comparison with previous works, but also report results for the
original evaluation protocol for future reference.
Quantitative results are presented in Table 4.2. Our model outperforms the previous state-of-the-art
by a large margin on all metrics. Specifically, for the VAL evaluation protocol, our approach achieves a
relative improvement of more than 29% on R@10 and 19% on R@50. Furthermore, we observe a broad
66
replace multicolor
with black
replace gray
with beige
replace black
with blue
Figure 4.5: Qualitative results on Fashion200K. We show reference images on the left and top-10 retrievals
with descending scores on the right. Ground-truths are shown with boxes. Note that a query pair can
correspond to multiple valid target images in this dataset. Due to the lack of human annotated feedback,
comments in Fashion200K follow the template: replace [sth] with [sth], and are thus less instructive.
is red leather
with a white
sheep
are bronzecolored
slingbacks
are mostly tan,
not floralpatterned
Figure 4.6: Qualitative results on Shoes. We show reference images on the left and top-10 retrievals with
descending scores on the right. Ground-truths are shown with boxes. Feedbacks in Shoes are fine-grained
and contain concepts belonging to the fashion domain of shoes.
23% relative improvement over all fashion types, indicating that our model generalizes well across them.
Finally, FashionVLP also shows an overall 25% relative improvement for the original evaluation protocol.
Figure 4.4 presents some examples of retrieval. As shown, feedback sentences in FashionIQ are complex
and contain multiple concepts. Our model is able to capture such diverse concepts and retrieve good
candidate images.
Fashion200K [63]: This is a large-scale fashion dataset with images from various online shopping websites. It contains more than 200K images (training: 172K, testing: 33K) and a feedback vocabulary of more
than 5K words. Images are labeled with descriptions like “blue women’s embroidered midi-dress”, and
attributes including product information and user reviews. In our experiments, we only utilize images
and their descriptions. Following [205], we generate textual feedback through an automated process that
67
Table 4.3: Quantitative results on Fashion200K. Our model achieves the best results on Recall@50 and
mean recall.
Method R@10 R@50 Mean
RN [179] 40.5 62.4 51.4
MRN [97] 40.0 61.9 50.9
FiLM [162] 39.5 61.9 50.7
TIRG [205] 42.5 63.8 53.2
VAL [26] 49.0 68.8 58.9
CosMo [113] 50.4 69.3 59.8
FashionVLP 49.9 70.5 60.2
compares attributes between pairs of images. The feedback is structured in the form of “replace [sth] with
[sth]”, which is much simpler than feedback in FashionIQ and Shoes.
Our results presented in Table 4.3 show that our model outperforms the previous state-of-the-art by a
relative improvement of 1.7% on R@50. Although R@10 is slightly lower, our model achieves an overall
relative improvement of 0.6%. We attribute smaller gains on Fashion200K to the fixed unnatural templated
nature of feedback in this dataset, as shown in Figure 4.5. Such text is closer to attribute-like feedback [26]
than to natural language sentences. As FashionVLP aims to bring the benefits of strong natural language
priors to the task of fashion retrieval, such knowledge is not as beneficial for Fashion200K. However, our
model still achieves the best results on this dataset.
Qualitative results in Figure 4.5 show that multiple images are considered correct for a query if their
captions are identical. Results show that our model can recognize attribute changes in the feedback and
retrieve images accordingly.
Shoes [15]: This dataset was originally collected to extract attribute information from web images. Guo
et al. [62] tagged the images with captions in natural language for fashion image retrieval. We use the
original splits in [62], which provides 10K training pairs and 4.6K test queries.
Table 4.4 shows that our model achieves the best results on this dataset, with relative improvements of
2.2% on R@50, 1.5% on R@10, and 1.9% on average. Qualitative results in Figure 4.6 show that our model
68
Table 4.4: Quantitative results on Shoes. Our model achieves the best results on Recall@50 and mean
recall.
Method R@10 R@50 Mean
RN [179] 45.10 71.45 58.27
MRN [97] 41.70 67.01 54.35
FiLM [162] 38.89 68.30 53.59
TIRG [205] 45.45 69.39 57.32
VAL [26] 49.12 73.53 61.32
CosMo [113] 48.36 75.64 62.00
FashionVLP 49.08 77.32 63.20
Table 4.5: Ablation study on FashionIQ on different contextual image features. PositionalAttn, RoI, Lmk,
Crop and Whole refer to positional attention, RoI encodings, landmark features, and embeddings from
cropped and whole images, respectively.
Method R@10 R@50 Mean
FashionVLP 34.27 62.51 48.39
w/o PositionalAttn 33.75 61.43 47.59
w/o Lmk 33.28 60.77 47.02
w/o Lmk, Crop 32.60 59.75 46.18
w/o Lmk, Crop, RoI 31.67 60.06 45.86
w/o Lmk, Crop, Whole 31.34 59.84 45.59
can perceive both simple visual changes like color and complex visual properties like patterns and shoe
models for retrieving candidate images.
4.4.3 Ablation Studies
We present ablation studies to provide insights into how different contextual image information and model
components affect performance. We perform these studies on FashionIQ as it contains complex and realistic feedback.
Contextual image features: We analyze the effect of positional attention, landmark, cropped clothing,
and RoI features on the retrieval performance by evaluating versions of our model trained without these
encodings. Results in Table 4.5 show that excluding these contextual pieces reduces performance. Specifically, using global average pooling instead of positional attention to combine spatial features results in a
69
is pastel green with
mesh over spaghetti
strap top
has blue floral print
around neck
has blue embroidery
and longer hem
Landmarks Crop Whole ROI Reference Target Whole Crop Landmarks
Reference Attention Example Pairs Target Attention
Figure 4.7: Visualization of attention on relevant words in textual feedback and different contextual image
features for two sample pairs. Words with the highest attention weights are shown in bold. For each level
of context in the reference block, we visualize the attention heatmap of the corresponding most attended
word, and observe effective correspondence between bold words and relevant image regions. On the target
side, we visualize attention heatmaps corresponding to our positional and landmark attention modules,
showing that these modules effectively capture important fashion information. Results of attention in the
reference (left) and the target (right) blocks further show that the whole image modality is insufficient – for
example, the upper sample’s whole image representation for the target image lacks any useful information.
Further, fashion landmarks provide important for fashion-specific concepts, e.g. , strap, hem, etc.
1.7% relative reduction in mean recall. Removing landmark features causes a 2.83% relative drop. Additionally excluding cropped clothing encodings results in a 4.57% drop. Further removing RoI features causes
5.23% degradation. Excluding whole image encodings instead of RoI features as in VinVL [240] leads to a
5.79% relative drop.
Fashion landmark features and fusion methods: We first study the effects of different methods of
generating landmark features: (1) normalized landmark coordinates, (2) indexed features from the third
convolutional block of the ResNet feature extractor, and (3) those from the second convolutional block.
Results in Table 4.6 show that using features from the lower (second) block achieves the best performance,
indicating that the fine-grained local information provided by this block is useful for fashion retrieval.
We also study the effects of different methods of fusing image features in the target block. Adding contextual and landmark attention to combine whole, cropped clothing, and landmark features (second convolutional block) provides 3.8% relative improvement compared to simply concatenating the said features.
70
Table 4.6: Ablation study on FashionIQ on different methods of generating landmark representations
(LmkRep) and combining them. Conv Block2 and Conv Block3 indicate that features for each landmark
are extracted from the 2nd and the 3rd convolutional blocks of the feature extractor, respectively. Norm
coords refers to the use of normalized landmark positions as feature values. For fusion, we compare the
effects of the context (Ctx) and the landmark (Lmk) attention (Attn) modules with simply concatenating
the features.
LmkRep Ctx Attn Lmk Attn R@10 R@50 Mean
Conv Block2
✓ ✓ 34.27 62.51 48.39
✓ ✗ 33.17 61.42 47.29
✗ ✗ 32.15 61.09 46.62
Conv Block3
✓ ✓ 33.63 61.85 47.74
✓ ✗ 32.09 60.48 46.29
✗ ✗ 33.81 61.12 47.46
Norm coords ✓ – 32.82 61.02 46.92
✗ – 32.70 61.10 46.90
Of this, incorporating the landmark attention module for fusing landmark features before the contextual
attention provides 2.3% improvement.
Attention Visualization: In order to further analyze the above two ablation studies, we visualize attention maps in Figure 4.7. For reference images, we first extract the most attended words from query text
and then visualize their corresponding attention on image features. We find that RoIs features best capture broad concepts,e.g. , design and dress, whereas fashion landmarks are useful for specific attributes,e.g.
, hem and straps. Cropped clothing features provide access to zoomed-in regions like neck. Our model
is also able to reason about ambiguous concepts like spaghetti and focus on relevant parts of image. On
target side, adding spatial attention helps remove irrelevant information and focus on important regions
like cloth design and sleeves.
71
Chapter 5
FIT: Fractional Intermediate Tower in Vision-Language Transformers
5.1 Introduction
Recent advances in deep neural networks have drawn the community’s attention from unimodal tasks to
multimodal tasks, among which problems related to vision-language understanding are under the spotlight, e.g. image-text retrieval [163, 128] and Vision Question Answering (VQA) [61]. Inspired by BERT
[43], Vision-Language Pre-training (VLP) models have demonstrated their competence in tackling visionlanguage tasks by perceiving both vision and language contents jointly. Typically, a VLP model is fed with
a large amount of image-text pairs, trained by enforcing the model to learn multimodal knowledge, and
fine-tuned on downstream tasks. Great success in multimodal fine-tuning [59] and image generation e.g.
styleCLIP [160], HairCLIP [216], and StableDiffusion [176] together show that VLP plays an import role
in computer vision research.
Achieving impressive performance in natural language processing, transformers [204] have been adopted
by a majority of VLP approaches as the backbone architecture of both unimodal feature extractors and
multimodal aggregation. Recently, vision transformers reveal their capacity in visual feature extraction
and dual-tower VLP transformers become mainstream, where images and text are processed separately
through unimodal encoders and fused afterward.
72
Image Text
Predict
… …
Image
Predict
Text
… …
Fine-tuned TR
(MSCOCO)
Fine-tuned IR
(MSCOCO)
Fine-tuned TR
(Flickr30K)
Fine-tuned IR
(Flickr30K)
SNLI-VE NLVR
Zero-shot TR
(Flickr30K)
Zero-shot IR
(Flickr30K)
VQA
N/A
73.5
75.0
76.5
78.0
56.8
58.5
60.2
62.0
94.0 95.0 96.0 97.0
82.5
84.0
85.5
87.0
80.2
81.5
82.8
84.0
79.8
80.5
81.2
82.0
88.5
90.0
91.5
93.0
71.0 75.0 79.0 83.0 74.2
75.5
76.8
78.0
ALIGN
CLIP
ALBEF
CODIS
TCL
METER
FIT (Ours)
(e)
(a) CLIP-like [168, 85] (b) METER-like [49]
Image Text
Predict
… …
Image Text
Predict
… …
FIT
(c) ALBEF-like [118, 50, 230] (d) Proposed FIT
Vision Encoder Language Encoder Fusion Encoder
Figure 5.1: Comparisons of dual-tower-based vision-language model paradigms. (a) CLIP-like (e.g.
CLIP [168] and ALIGN [85]), whose predictions are directly made from unimodal features without fusion
encoder; (b) METER-like (e.g. METER [49]) whose joint fusion encoder is built on top of both transformers; (c) ALBEF-like (e.g. ALBEF [118], CODIS [50], and TCL [230]), whose top text transformer layers are
converted into fusion layers with inputs from the top vision layer; (d) the proposed FIT, which uses a Fractional Intermediate Tower (FIT) to expand text encoder’s access to multi-layer vision features and enrich
per-layer features through the top-down pathway; and (e) performance comparisons of SoTA dual-tower
solutions on various visual language benchmarks, and ours noticeably outperforms them after using the
FIT. IR and TR refer to image retrieval and text retrieval, respectively.
As shown in Fig. 5.1, a number of design paradigms have been deployed by various models following
the dual-tower design. For example, CLIP [168] and ALIGN [85] (see Fig. 5.1-(a)) choose to directly compute
the similarity between unimodal features without a fusion encoder, while METER [49] adopts the joint
fusion design (see Fig. 5.1-(b)) and builds fusion layers on top of vision and language transformers, with
attention layers aggregating the information from both sides. Moreover, ALBEF [118], CODIS [50], and
TCL [230] (see Fig. 5.1-(c)) convert the top half of the language transformer into a multimodal encoder,
while features from the top vision layer are fed to all fusion layers.
We argue that these paradigms have their own limitations: Not having a fusion encoder prevents a
CLIP-like model from learning cross-modal knowledge; While other paradigms have an extra encoder for
multimodal fusion, current mainstream designs take features from the top vision layer, omitting features
from lower levels that are potentially beneficial. Though vision transformers do not have native receptive
73
fields for each layer, compared with conventional CNN models, Dosovitskiy et al. [47] argue that the average distances spanned by attention weights differ for different vision layers, indicating that each layer
has its own receptive field. This implies that though features from the top layer have strong global information, lower-level vision features can have specific local information that has not been exploited by the
fusion encoder. Thus, here, we attempt to answer the question: how to effectively learn such unexploited
local information in dual-tower VLPs?
In this paper, we propose Fractional∗
Intermediate Tower (FIT), a tower between the unimodal encoders, to better leverage vision features from multimodal fusion. As shown in Fig. 5.1-(d), FIT processes
features from the vision transformer and yields particular features for each fusion layer. Furthermore, the
top-down pathway enriches lower-level features with high-level information, making multilateral vision
knowledge accessible to each fusion layer. Concurrently, FIBER [48] proposed symmetric connections between the vision and the language towers, while FIT follows an asymmetric design to accommodate the
top-down pathway. Unlike FIBER’s bottom-up design, FIT allows fusion layers to leverage not only local
information from lower levels but global knowledge through the top-down design. Moreover, we convert
the top M layers of the language tower into fusion layers by adding cross-attention layers.
To assess the performance of our model, we follow the common “pre-train and fine-tune” scheme for
VLP models. Similar to [118, 50, 28, 49, 230], we pre-train our model on four million image-text pairs and
fine-tune it on multiple vision-language downstream tasks including image-text retrieval [128, 163], VQA
[61], NLVR [194] and visual entailment [224]. Quantitative evaluations are performed on these downstream tasks. As shown in Fig. 5.1-(e), FIT achieves new state-of-the-art (SoTA) results on multiple downstream tasks, verifying the effectiveness of the FIT module. Specifically, on COCO image-text retrieval, our
model achieves 77.8% top-1 text recall (TR@1) and 61.6% top-1 image recall (IR@1), 1.6% and 2.6% higher
than the SoTA under the same training settings.
∗
Inspired by fractional distillation, a process in the petroleum refinery that separates petroleum into various oil products for
different purposes.
74
Self-Attn
Feed-Forward
Cross-Attn
Previous FIT Layer
Next FIT layer
Momentum Model
×
Contrastive ITM MLM
×
Self-Attn
Feed-Forward
Self-Attn
Feed-Forward
Self-Attn
Feed-Forward Projection
LayerNorm
Projection
LayerNorm
Image Text
Distill /
Update
Fractional Intermediate Tower (FIT)
()
()
()
(+)
-th FIT Layer
Figure 5.2: Overview of the FIT model: We follow the prevalent dual-tower in our design, with the proposed
FIT module bridging the image encoder and the text encoder. The FIT layer processes the feature from the
vision layer and combines it with the feature from the previous FIT layer, producing vision features for
the multimodal encoder and the next FIT layer. Post-FIT features are combined with text features through
Cross-Attention layers inside the fusion encoder. We further leverage momentum distillation to tackle the
noise from web data.
5.2 Background
In this section, we briefly discuss building blocks used in previous transformer-based VLP models.
5.2.1 Transformer Encoder Blocks
A typical transformer [204] encoder block consists of two components: the attention layer and the feedforward network. Given a token embedding, an attention layer produces a weighted average of embeddings
from all other tokens in the sequence. Formally, given Query (Q), Key (K), and Value (V ) as the input,
we have
Attention(Q, K,V ) = Softmax(
QKT
√
dK
)V . (5.1)
75
More specifically, an attention layer is referred to as a Self-Attention layer when Q, K,V come from
the same input. Furthermore, attention layers are considered as fusion layers, namely Cross-Attention, in
scenarios where multiple input sequences are involved, e.g. , in transformer decoders or vision-language
fusion encoders. To combine information from two input sequences, a Cross-Attention layer computes Q
from one and extracts K,V from another.
The feed-forward network following the attention layer is designed to further process the output so
that it can better fit the next transformer block. A feed-forward network consists of two linear transformations with an activation function in between. built with attention layers and feed-forward networks,
transformer encoder blocks are the fundermental element of VLP transformers, a prevailing topic in the
field.
5.2.2 Vision-Language Transformers
VLP transformers aim at creating models that can process both modalities jointly by training on a largescale image-text corpus. Prior to ViT, a large proportion of VLP approaches [85, 196, 120, 28, 193, 124, 241,
207, 59, 77] rely on external object detectors [127, 175, 58, 5, 67] to create regional feature sequences. Some
other methods [79, 78, 186, 88] take grid features from CNNs and reorganize them into input sequences.
With ViT prevailing in the vision community, recent VLP models [168, 188, 118, 230, 49, 50, 117, 238, 48]
embrace the dual-tower design, where images and text are processed by separate transformers. Notably,
CLIP [168] train the model with image-text contrastive loss without any multimodal encoder; METER [49]
builds the fusion encoder on top of both unimodal encoders and adopts cross-attention layers to process
cross-modality knowledge, while ALBEF [118] aligns unimodal features and takes features from the top
vision layer for the fusion encoder; Yang et al. [230] investigate combinations of various contrastive losses;
VLMo [12] aims at learning a universal multimodal transformer through mixture-of-modality-experts.
76
Compared with previous methods, fusion layers with FIT can obtain local information from lower
levels while not losing global knowledge with the help of the top-down pathway inside FIT. Besides, we
adopt Self-Attention layers to process unimodal features and pick Cross-Attention layers for multimodal
fusion, with text embeddings as Q and image embeddings as K and V . In practice, we follow the multihead attention as in [204] to improve expressibility.
5.3 Approach
As illustrated in Fig. 5.2, our framework contains an image encoder, a text encoder, and the proposed FIT
in between. Similar to previous dual-tower VLP models discussed in Sec. 5.2, both encoders are multi-layer
transformers, which consist of a stack of transformer encoder blocks. The image encoder fv processes an
image I in to a sequence of hidden states {vcls, v1, . . . , vn}, where vcls is the embedding of the [CLS]
token; The text encoder ft converts the input sentence into a sequence embeddings {tcls, t1, . . . , tm},
where tcls represents the embedding of the [CLS] token. Similar to [118], we convert the top M layers of
the text encoder into the multimodal encoder by adding a Cross-Attention layer to each transformer layer.
During both training and inference phases, image embeddings from the top M layers are processed by FIT
layers which generate layer-specific features for each fusion layer. Post-FIT features are fused with text
hidden states through the multimodal encoder. Our models are trained with contrastive losses, Image-Text
Matching (ITM) loss, and Masked Language Modeling (MLM) loss. Details are discussed in Sec. 5.4.1.
5.3.1 FIT: Fractional Intermediate Tower
In existing dual-transformer VLP models, only features from the top vision layer are considered during the
fusion phase. We argue that features extracted by other layers can have certain information that is omitted
by the top layer. Hence, making such features available to the multimodal encoder is beneficial. Motivated
by this, we propose FIT, which improves the coverage of vision features during the fusion process by two
77
means: (i) FIT leverages vision features from multiple layers of the vision transformer, making the vision
knowledge more accessible to the multimodal encoder; (ii) The top-down pathway enables each fusion
layer to extract information from various vision layers rather than a certain layer decided manually. Our
ablation study in Sec. 5.5.3 and visualization in Fig. 5.4 confirm the effectiveness of both designs of FIT.
Multi-layer Coverage. Unlike conventional CNN models that have pooling layers explicitly expanding
the receptive field as the network goes deeper, ViT stacks a number of transformer layers and performs
self-attention over the entire sequence. However, Dosovitskiy et al. [47] observe that the average distances spanned by attention weights at different layers vary, reflecting various region coverages. While
features from the top layer have the strongest global information, features from lower layers can bear
useful local information that is omitted along the path bottom-up. Thus, we propose to feed features from
multiple vision layers to the multimodal encoder. More specifically, we denote features from each layer
as {V
(1), . . . , V (Nv)} and take features from the last M layers {V
(Nv−M+1), . . . , V (Nv)} for FIT, followed
by the multimodal encoder.
Top-down Pathway on FIT Layers. Though the multi-layer coverage enriches the vision information
that the multimodal encoder has access to, such access for each fusion layer should not be constrained to
a certain vision layer. Thus, we design a top-down pathway to enrich vision features so that each fusion
layer can have information from various vision layers, shown in Fig. 5.2. For features from the i-th vision
layer V
(i)
in , we denote the embedding of the [CLS] as v
(i)
cls and embeddings of patches as v
(i)
patch. Inside the
i-th FIT layer, to better align the feature with knowledge from other layers, features V
(i)
in are processed
by a projection layer p
(i)
in , followed by a Layer Normalization layer [9]. We refer to the output as the side
feature V
(i)
side. The projection layer p
(i)
in consists of two linear operators, g
(i)
cls and g
(i)
patch, processing v
(i)
cls and
v
(i)
patch separately. Formally,
p
(i)
in (V
(i)
) = [g
(i)
cls(v
(i)
cls) : g
(i)
patch(v
(i)
patch)]. (5.2)
78
The [· : ·] operation stands for the concatenation of two feature vectors. The side feature V
(i)
side is combined
with V
(i+1)
F IT from the (i+1)-th FIT layer to produce V
(i)
F IT , which is propagated downwards to the (i-1)-th
layer. Formally,
V
(i)
F IT =
V
(i)
side, i = Nv
V
(i)
side + V
(i+1)
F IT , i < Nv
(5.3)
V
(i)
F IT is processed by an another projection layer p
(i)
out, similar to p
(i)
in , followed by another Layer Normalization layer. The output V
(i)
out is fed to the corresponding fusion encoder for multimodal fusion through a
Cross-Attention layer.
5.4 Large-scale Vision-language Pre-training
In this section, we discuss pre-training objectives and the setup of pre-training experiments.
5.4.1 Training Objectives
We pre-train FIT with contrastive loss Lcon, ITM loss Litm and MLM loss Lmlm:
L = Lcon + Litm + Lmlm. (5.4)
More specifically, Lcon is the average of image-text constrastive loss Litc and intra-modal constrastive loss
Limc, formally, Lcon =
1
2
(Litc + Limc).
Image-Text Contrastive (ITC). ITC loss is designed to align representations from image and text encoders. Following [118, 66], we maintain two K-length queues with the most recent features generated by
the momentum model, denoted as V˜ and T˜. The similarity between the two embeddings is estimated by
s(v, t) = gv(vcls)
T
g
′
t
(tcls) and s(t, v) = gt(tcls)
T
g
′
v
(vcls), where gv and gt are linear projections mapping
79
unimodal features into a lower dimension (256-d) while g
′
v
, g′
t
are their counterparts with the momentum.
Formally, we define the ITC loss as:
Litc = −
1
2
E(v,t)
"
log
exp
s(v, t
′
+)/τ
P
t
′∈T˜ exp(s(v, t
′)/τ )
+ log
exp
s(t, v
′
+)/τ
P
v
′∈V˜ exp(s(t, v
′)/τ )
#
,
(5.5)
where τ is a learnable temperature parameter. More specifically, we turn off the FIT module and crossattention layers to assure that unimodal embeddings are well aligned.
Intra-Modal Contrastive (IMC). Yang et al. [230] suggest that the inter-modality alignment bounded
by the ITC loss is insufficient and propose to align unimodal representations produced by the main model
and the momentum model. Here, we define the similarity function as s(v, v
′
) = gv(vcls)
T
g
′
v
(v
′
cls) and
s(t, t
′
) = gt(tcls)
T
g
′
t
(t
′
cls). The IMC loss of the same modality is defined as:
Limc = −
1
2
E(v,t)
"
log
exp
s(v, v
′
+)/τ
P
v
′∈V˜ exp(s(v, v
′)/τ )
+ log
exp
s(t, t
′
+)/τ
P
t
′∈T˜ exp(s(t, t
′)/τ )
#
.
(5.6)
Following [230], we take different augmentations of an image as inputs for fv and its momentum counterpart when computing Limc on the vision side. Note that FIT and cross-attentions are turned off for the
IMC loss as well.
Image-Text Matching (ITM). In addition to feature alignments guided by contrastive losses, we leverage
ITM loss, which is computed on positive (matched) and negative (not matched) image-text pairs, to boost
multimodal fusion. More specifically, we take the [CLS] embedding produced by the multimodal encoder
and pass it through a two-way linear layer for classification, followed by a softmax function to produce
probability scores prob(v, t). Following [118, 230], we perform hardness-aware sampling by adopting ITC
scores of image-text pairs as sampling weights. Formally, we compute cross-entropy loss between labels
y
(v,t)
and predictions:
Litm = E(v,t)H(prob(v, t), y
(v,t)
) (5.7)
Masked Language Modeling (MLM). We further take the MLM loss from BERT [43] as one of our multimodal fusion loss functions. The MLM task requires models to reconstruct tokens that have been randomly
masked out †
. Unlike BERT, where only language information is utilized for reconstruction, we let the multimodal encoder take inputs from both the vision and language side so that the model can learn to leverage
information from both modalities. Formally, the MLM loss is defined as:
Lmlm = E(v,tm)H(prob(v, tm), y
tm), (5.8)
where tm are masked text sequences and y
tm are labels.
5.4.2 Pre-training Setup
Datasets. Following previous work [28, 118, 49, 230, 50], we construct our pre-training data with four
public datasets: COCO [128], SBU Captions [155], Visual Genome [104] and Conceptual Captions [185].
The dataset consists of 4.0M unique images and 5.1M image-text pairs. For clarity, we refer to the data as
4M in the following sections.
Momentum Distillation. We follow a similar design of momentum distillation in [118], which is supposed to alleviate the labeling noise from the large-scale web data. The momentum model is updated with
the main model with a momentum m, and produces pseudo-labels as part of the supervision. Formally, we
write the updating process as
θ
′ = mθ
′ + (1 − m)θ, (5.9)
†
Following BERT, we replace masked tokens with 10% random tokens, 10% unchanged and 80% [MASK].
81
where θ
′
are parameters of the momentum model while θ are of the main model. During training, we
pick pseudo-labels when computing ITC and MLM losses, with a certain weight α. Furthermore, in order
to align the main model and the momentum model, we compute the IMC loss between them for both
modalities, as suggested in [230].
Implementation Details. Our implementation is built on the PyTorch [159] framework and pre-training
experiments are conducted on 32 NVIDIA Tesla V100 GPUs. Following [49], we pick ViT/B-16 [47] with
CLIP [168] initialization as our image encoder and RoBERTaBase [131] initialized with pre-trained weights
as our text encoder. We set M = 12 so that all layers of the text encoder are leveraged for fusion. We
also apply FIT to all 12 layers. Apart from regular transformer layers, all FIT layers and cross-attention
layers are randomly initialized with normal distributions. We set the length of queues to 65,535 with a
momentum of 0.995. The distillation weight α is set to 0.4. During the pre-training stage, we optimize
our model through AdamW [135] with a learning rate of 2×10−4
, warmed up from 1×10−5
in the first 2,000
iterations. We take random crops of 256 × 256 from images and augment them with random color jittering,
random grayscale conversion, random horizontal flipping, random Gaussian blurring, and RandAugment
[38], as described in [230]. For MLM, we set the masking rate to 30% for pre-training. During fine-tuning,
we set the image size to 384 and interpolate image positional encodings except for VQA, where we work
with 576, following [49].
5.5 Experiments
In this section, we show results on downstream tasks and conduct ablation study for a better understanding
of FIT.
82
Method #Img MSCOCO (5K) Flickr30K (1K)
(M) TR@1 TR@5 TR@10 IR@1 IR@5 IR@10 TR@1 TR@5 TR@10 IR@1 IR@5 IR@10
Methods with different training setups: 1Extra data; 2Bounding box labels; 3Unimodal pre-training
VLMo3
[12] 4 74.8 93.1 96.9 57.2 82.6 89.8 92.3 99.4 99.9 79.3 95.7 97.8
FIBER2
[48] 4 75.1 93.9 97.4 59.0 84.0 91.0 95.1 99.6 99.9 84.1 97.5 98.9
ALBEF (14M)1
[118] 14 77.6 94.3 97.2 60.7 84.3 90.5 95.9 99.8 100.0 85.6 97.5 98.9
X-VLM1,2
[238] 16 81.2 95.6 98.2 63.4 85.8 91.5 97.1 100.0 100.0 86.9 97.3 98.7
BLIP1
[117] 129 81.9 95.4 97.8 64.3 85.7 91.5 97.3 99.9 100.0 87.3 97.6 98.9
ALIGN1
[85] 1.2k 77.0 93.5 96.9 59.9 83.3 89.8 95.3 99.8 100.0 84.9 97.4 98.6
Methods with similar training setups
Visual Parsing [229] 0.2 – – – – – – 87.0 98.4 99.5 73.5 93.1 96.4
SOHO [78] 0.2 66.4 88.2 93.8 50.6 78.0 86.7 86.5 98.1 99.3 72.5 92.7 96.1
UNITER [28] 4 65.7 88.6 93.8 52.9 79.9 88.0 87.3 98.0 99.2 75.6 94.1 96.8
VILLA [55] 4 – – – – – – 87.9 97.5 98.8 76.3 94.2 96.8
OSCAR [124] 4 70.0 91.1 95.5 54.0 80.8 88.5 – – – – – –
ViLT [98] 4 61.5 86.3 92.7 42.7 72.9 83.1 83.5 96.7 98.6 64.4 88.7 93.8
UNIMO [122] 4 – – – – – – 89.7 98.4 99.1 74.7 93.5 96.1
ALBEF [118] 4 73.1 91.4 96.0 56.8 81.5 89.2 94.3 99.4 99.8 82.8 96.7 98.4
CODIS [50] 4 75.3 92.6 96.6 58.7 82.8 89.7 95.1 99.4 99.9 83.3 96.1 97.8
TCL [230] 4 75.6 92.8 96.7 59.0 83.2 89.9 94.9 99.5 99.8 84.0 96.7 98.5
METER [49] 4 76.2 93.2 96.8 57.1 82.7 90.1 94.3 99.6 99.9 82.2 96.3 98.4
ImageBERT [166] 6 66.4 89.8 94.4 50.5 78.7 87.1 87.0 97.6 99.2 73.1 92.6 96.0
VinVL [241] 6 74.6 92.6 96.3 58.1 83.2 90.1 – – – – – –
Ours (FIT) 4 77.8 94.0 97.0 61.6 84.7 91.0 96.2 99.7 100.0 86.4 97.6 98.8
Table 5.2: Quantitative evaluation on retrieval tasks, all numbers reported in percentage while the best
scores are in bold and the second best are underlined: We report top 1/5/10 recalls rates. FIT achieves the
SoTA in all metrics on both MSCOCO and Flickr30K.
5.5.1 Downstream Tasks
Image-Text Retrieval. The retrieval task includes two subtasks: i) Image Retrieval where texts are queries
and Text Retrieval (TR) where images are queries. We perform evaluations of the retrieval task on two
datasets: COCO [128] and Flickr30K [163]. For quantitative evaluations, we take the model trained on the
4M pre-training dataset and fine-tune it on both datasets separately. In addition, we perform the zeroshot evaluation on Flickr30K with the model fine-tuned on COCO, following [118, 49, 230]. We adopt the
two-stage retrieval strategy as in [118]: we first compute the ITC similarity score with the cross-attention
module turned off; Top-k candidates are re-ranked by their ITM scores computed by the multimodal encoder.
Visual Question Answering (VQA [61]). VQA asks a model to predict the answer given an image and
a question, which requires a joint understanding of both modalities. We follow [49] and treat VQA as a
classification problem with 3,129 categories. More specifically, we pick a Multi-Layer Perceptron (MLP) as
83
Method #Img VQA NLVR2 SNLI-VE Flickr30K-ZeroShot Retrieval (1K)
(M) test-dev test-std dev test-P val test TR@1 TR@5 TR@10 IR@1 IR@5 IR@10
Methods with different training setups: 1Extra data; 2Bounding box labels; 3Unimodal pre-training
VLMo3
[12] 4 76.6 76.9 82.8 83.3 – – – – – – – –
FIBER2
[48] 4 78.6 78.5 84.6 85.5 – – – – – – – –
ALBEF (14M)1
[118] 14 75.8 76.0 82.6 83.1 80.8 80.9 94.1 99.5 99.7 82.8 96.3 98.1
X-VLM1,2 16 78.2 78.4 84.4 84.8 – – – – – – – –
OFA1
[207] 54 78.0 78.1 – – 89.3 89.2 – – – – – –
FLAVA1,3
[188] 70 72.5 – – – 78.9 – 67.7 94.0 – 65.2 89.4 –
BLIP1
[117] 129 78.2 78.2 82.5 83.1 – – 96.0 99.9 100.0 85.0 96.8 98.6
CLIP1
[168] 400 – – – – – – 88.0 98.7 99.4 68.7 90.6 95.2
ALIGN1
[85] 1.2k – – – – – – 88.6 98.7 99.7 75.7 93.8 96.8
SimVLM1
[214] 1.8k 77.8 78.1 81.7 81.8 84.2 84.2 – – – – – –
Methods with similar training setups
Visual Parsing [229] 0.2 74.0 74.2 77.6 78.1 – – – – – – – –
OSCAR [124] 4 73.2 73.4 78.1 78.4 – – – – – – – –
UNITER [28] 4 72.7 72.9 77.2 77.9 78.6 78.3 80.7 95.7 98.0 66.2 88.4 92.9
ViLT [98] 4 71.3 – 75.7 76.1 – – 73.2 93.6 96.5 55.0 82.5 89.8
UNIMO [122] 4 73.3 74.0 – – 80.0 79.1 – – – – – –
VILLA [55] 4 73.4 73.7 78.4 79.3 79.5 79.0 – – – – – –
ALBEF [118] 4 74.5 74.7 80.2 80.5 80.1 80.3 90.5 98.8 99.7 76.8 93.7 96.7
CODIS [50] 4 74.9 75.0 80.5 80.8 80.5 80.4 91.7 99.3 99.8 79.7 94.8 97.3
TCL [230] 4 74.9 74.9 80.5 81.3 80.5 80.3 93.0 99.1 99.6 79.6 95.1 97.4
METER [49] 4 77.7 77.6 82.3 83.1 80.9 81.2 90.9 98.3 99.5 79.6 95.0 97.3
ImageBERT [166] 6 – – – – – – 70.7 90.2 94.0 54.3 79.6 87.5
VinVL [241] 6 76.0 76.1 82.1 83.1 – – – – – – – –
Ours (FIT) 4 76.2 76.5 82.6 83.2 81.2 81.3 93.0 99.6 99.9 82.2 95.9 97.9
Table 5.3: Quantitative evaluation on vision question answering, visual reasoning, visual entailment, and
zero-shot image-text retrieval, all numbers reported in percentage, while the best scores are in bold and
the second best are underlined: FIT outperforms baselines on NVLR and SNLI-VE, while achieves secondbest on VQA; FIT also beats other models on zero-shot Flickr retrieval.
the classifier and take the last hidden states of the [CLS] token as the input. The classifier is initialized
with weights from the pre-trained MLM head.
Natural Language for Visual Reasoning (NLVR2
[194]). The task of NLVR requires a model to decide
whether the given text describes a pair of images. Given our joint-encoding design, we encode the text with
each image and take the feature of [CLS] from the last layer as the representation. We then concatenate
representations for each image-text pair and feed it to an MLP for classification.
Visual Entailment (SNLI-VE‡
[224]). The VE task requires a model to identify whether an image semantically entails the given text. Models are supposed to predict from three given answers: entailment, neutral,
and contradiction. Similar to [118, 49, 230], we treat VE as a 3-way classification and add an MLP-based
classifier on top of the model, with the hidden states from the [CLS] token as input.
‡Results on SNLI-VE need to be interpreted carefully as its data are reported to be noisy [44, 93].
84
5.5.2 Quantitative Evaluation
We perform quantitative evaluations on downstream tasks discussed in Sec. 5.5.1. We compare FIT against
other models pre-trained with similar training setup and provide results of models pre-trained on other
setups in gray for readers’ reference. For baseline models that have multiple model sizes, we report numbers of the Base model. For other baseline models, we combine the numbers from the summarizing tables
of [118, 50, 230, 49]. For image-text retrieval, we report top-k recall (k ∈ {1,5,10}) for both text retrieval
(TR) and image retrieval (IR). Quantitative results of fine-tuned models are shown in Tab.5.2, while results
for other downstream benchmarks are shown in Tab.5.3.
Fine-tuned MSCOCO Retrieval. On fine-tuned COCO-Retrieval, our model outperforms current SoTA
under similar training circumstances for both text retrieval (METER) and image retrieval (TCL). Notably,
on text retrieval, FIT achieves 77.9% in terms R@1, 1.6% higher than the second-best model. We also
observe 0.8% and 0.2% improvements on R@5 and R@10, respectively. On the image side, our model gets
61.6%, 2.6% better than the runner-up. Besides, FIT also beats other baselines on top-5 and top-10 recalls.
Moreover, it even performs better than ALBEF (14M) and ALIGN, which consume more images during
pre-training, on both text and image retrieval.
Fine-tuned Flickr30K Retrieval. On Flickr30K, our model also competes favorably against other baseline models. On the text side, FIT achieves 96.2% R@1, 1.1% higher than CODIS, and 1.9% higher than
METER. It beats baselines on top-5 and top-10 recalls, achieving 99.6% and 99.9%, respectively. On image
retrieval, FIT outperforms TCL by 2.4% on IR@1, getting a recall at 86.4%. The performance of FIT on
image retrieval even beats ALBEF (14M), which is trained with 3x more data than FIT.
VQA, NLVR2 and SNLI-VE. We report results of the official benchmark on VQA, NLVR2
, and SNLI-VE in
Tab. 5.3. On VQA, our model achieves 76.2% on test-dev and 76.5% on test-std, second-best among baselines
and only worse than METER. We consider the gap reasonable as METER has 50% more transformer layers
compared with FIT, thus 50% more parameters. Besides, on NLVR2
, our model outperforms METER and
85
FIT
(a) FIT (b) W/o Multi-layer (c) W/o Top-down
Figure 5.3: Variations of the FIT design. Each rectangle represents a layer from the vision encoder or the
multimodal encoder.
Method MSCOCO (5K) Flickr30K (1K)
TR@1 TR@5 IR@1 IR@5 TR@1 TR@5 IR@1 IR@5
Ours (FIT) 77.8 94.0 61.6 84.7 96.2 99.7 86.4 97.6
w/o Multi-layer 72.3 91.6 56.9 82.3 93.9 99.5 82.4 96.5
w/o Top-down 77.3 93.7 60.5 84.1 96.0 99.8 86.1 97.5
Table 5.4: Ablation Study on FIT: Our designs of multi-layer coverage and top-down pathway effectively
improve the performance.
achieves SoTA on both dev and test sets, scoring at 82.6% and 83.2%, respectively. FIT also surpasses METER
on SNLI-VE, with margins of 0.3% on the val set and 0.1% on the test set.
Zero-Shot Flickr30K Retrieval. In addition to the fine-tuned evaluation, we further report the results of
zero-shot retrieval in Tab. 5.3 to reflect the generalizability of models. We perform the zero-shot retrieval
evaluation on Flickr30K. Notably, our model gets 93.0% on TR@1, similar to TCL, while FIT marginally
improves on TR@5 and TR@10. On the image side, FIT achieves 82.2% on IR@1, 2.5% higher than the
second-best result from baseline models. We also observe >0.5% improvements on both IR@5 and IR@10.
5.5.3 Ablation Study
We conduct ablation study on various aspects and compare models in terms of their performance on imagetext retrieval on MSCOCO and Flickr30k. Experiments are run with the CLIP-ViT+RoBERTa backbone
unless specified.
The FIT Design. To verify the effectiveness of FIT, including multi-layer coverage and the top-down pathway, we conduct ablation study on several variations. Illustrations are shown in Fig. 5.3 and quantitative
86
Image FIT w/o Multi-layer w/o Top-down Image FIT w/o Multi-layer w/o Top-down
A green sign says thruway one fourth mile. A clock that is on the side of a building.
Fruits on a plate including pineapple bananas and an orange. A red fire hydrant next to two red poles.
Figure 5.4: FIT consistently captures larger active regions and stronger peak signal responses than
w/o Multi-layer and w/o Top-down, respectively, where attention maps are generated by applying
Transformer-MM-Explainability [23] on ground-truth caption for each image.
M
MSCOCO (5K) Flickr30K (1K)
TR@1 TR@5 IR@1 IR@5 TR@1 TR@5 IR@1 IR@5
12 77.8 94.0 61.6 84.7 96.2 99.7 86.4 97.6
9 77.7 94.1 61.2 84.4 96.5 99.9 86.2 97.3
6 77.6 93.9 61.1 84.6 95.7 99.8 86.0 97.3
Table 5.5: Ablation Study on the number of FIT units.
results are reported in Tab. 5.4. Evaluation results reveal that both designs bring noticeable improvements
to the model. More specifically, the multi-layer coverage brings ∼3% improvements on both TR@1 and
IR@1, while the top-down pathway contributes another 0.5% and 1% of TR@1 and IR@1 on MSCOCO.
Furthermore, we visualize attention maps of image-text pairs tested on FIT in Fig. 5.4, which are obtained
through the method described in [23]. We observe that, compared with other ablated variations, attention
maps generated by FIT have a larger area of active response and stronger peak signals.
Number of FIT Units. As discussed in Sec. 5.3.1, we convert the top M layers of the text encoder into
fusion encoder layers. We further study the impact of the number of FIT units by setting M = 6, 9, 12
and report results in Tab. 5.5. We observe that setting M = 12 has the best overall performance while
M = 6, 9 also yields reasonable performance, verifying that FIT produces feasible results under various
settings.
87
Method fv
MSCOCO (5K) Flickr30K (1K)
TR@1 TR@5 IR@1 IR@5 TR@1 TR@5 IR@1 IR@5
Ours (FIT) CLIP 77.8 94.0 61.6 84.7 96.2 99.7 86.4 97.6
METER [49] 76.2 93.2 57.1 82.7 94.3 99.6 82.2 96.3
Ours (FIT)
DeiT
75.8 93.2 59.7 83.6 95.3 99.6 84.2 96.9
CODIS [50] 75.3 92.6 58.7 82.8 95.1 99.4 83.3 96.1
TCL [230] 75.6 92.8 59.0 83.2 94.9 99.5 84.0 96.7
Ours (FIT) Swin 75.4 92.3 58.3 82.6 94.6 99.6 83.7 96.7
METER [49] 73.0 92.0 54.9 81.4 92.4 99.0 79.0 95.6
Table 5.6: Ablation study on vision encoders: FIT achieves better performance with various vision encoders, compared with the best baseline model with the same visual backbone. The column of fv indicates
the vision encoder that we experiment with.
#Img MSCOCO (5K) Flickr30K (1K)
TR@1 TR@5 IR@1 IR@5 TR@1 TR@5 IR@1 IR@5
FIT 4M 77.8 94.0 61.6 84.7 96.2 99.7 86.4 97.6
14M 79.6 94.8 62.9 85.3 96.6 99.6 86.6 97.7
Table 5.7: Ablation study on scalability of pre-training data.
The Visual Backbone. The choice of the visual backbone can be impactful for VLP models. We experiment FIT with various visual backbones and compare our model with SoTA models pre-trained with the
same backbone. We train our model with ViT-based DeiT [199] and Swin-Transformer [132]. For DeiT, we
use the same setup as our CLIP-ViT model, as they share similar ViT structures; For Swin-Transformer,
we create multi-layer entries on every two layers and set M = 9 for the fusion. Results shown in Tab. 5.6,
verify the robustness of FIT with different vision encoders. More specifically, with DeiT FIT beats TCL
on both MSCOCO and Flickr30K. On the other hand, our model significantly outperforms METER with a
similar Swin-Transformer backbone, which further justifies our design of FIT and the robustness of our
model.
Scalability. We test FIT on 14M data similar to ALBEF [118], and report its performance in Tab. 5.7. It
shows that FIT with 14M is better than the 4M version, demonstrating the scalability and utility of FIT.
88
Chapter 6
A Unified Encoder for Vision, Language and Multimodal Tasks
6.1 Introduction
Self-supervised learning has proven to be a very effective training mechanism for learning representations
that transfer well to various downstream tasks[43, 65, 168, 10]. It has been shown to be successful across
text, image, speech and video modalities. In a unimodal setting, inputs are fed into a single model, e.g. a
Transformer[204], while features are learned through self-supervision (such as masked input reconstruction) on large datasets [43, 131, 11, 65]. In multimodal settings, specifically with vision and language, image
and text features are typically processed by separate unimodal encoders, with a multimodal encoder on
top, encoding the joint sequence to learn cross-modality interactions. Such multimodal models rely on
either huge training corpora or models with an enormous number of parameters. For instance, CLIP [168]
is trained on 200M image-text pairs, ALIGN [85] and SimVLM [214] are trained on about 1B image-text
pairs. Recently, FLAVA [188] has been proposed to reduce the reliance on large corpora while still serving a
multitude of tasks effectively. However, FLAVA requires 75M text samples, 70M image-text pairs and an external image tokenizer trained with 250M image-text pairs [172]. Furthermore, for its model architecture,
FLAVA leverages three different encoders for visual, linguistic and multimodal representations.
In this work, we propose a simple unified model design where the same Transformer encoder processes
both the unimodal vision/language inputs and the cross-modal vision+language inputs. Specifically, once
89
65 70 75 80 85 90
Macro Avg. (%) 77.6
69.2
78.9
0 100 200 300 400
# Pairs (M)
# Params (M)
70
241
400
220
27
110 FLAVA
CLIP
MoMo
Figure 6.1: Comparison with FLAVA [188] and CLIP [168]: MoMo achieves the best macro average across
vision, language and multimodal benchmarks, which is the mean of the average of all 3 task types, with ≈
2/5th the parameters and 1/3rd the training image-caption pairs.
the text tokens and the image patches are embedded into the latent space, they are processed through
the same transformer encoder layers for all tasks. As shown in Fig. 6.1, with about 1/3rd the dataset size
and 2/5th the number of parameters, our model, MoMo outperforms FLAVA on macro average accuracy,
which is computed over multiple vision, language and vision-language benchmarks. Due to the reduced
model size and the shared parameters, our model is considerably faster during inference in comparison to
other models (upto 35% speedup for multimodal tasks such as Visual Question Answering). Furthermore,
in addition to providing data efficiency and run-time efficiency, our model simplifies the production deployment pipelines by eliminating the need to maintain separate models for each of those modalities. This
is enabled by its shared encoder structure, which handles all language, vision and multimodal tasks.
We also propose a new training strategy to address challenges in learning multimodal representations
by the shared encoder. Since all the encoder parameters are shared across modalities, if the model is trained
independently on datasets from different modalities, there tends to be information loss. To mitigate this,
90
ConvNet
Vision
Transformer
En/De-coder
Tokenizer
NLP
Text
Transformer
Vision
Transformer
Tokenizer PatchEmbed
Text Image
Multimodal
Multimodal
Transformer
Text
Transformer
Vision
Transformer
Tokenizer PatchEmbed
Text Image
Multimodal
Text
Transformer ConvNet
Tokenizer
Text Image
Multimodal
Transformer
Transformer
Encoder
Tokenizer
NLP
Transformer
Decoder
Transformer
Encoder
PatchEmbed
Vision
Image
(c) ConvNet
(ResNet, Faster RCNN)
Multimodal
Image
(d) Vision Transformer
(ViT, Swin)
Text
(b) Encoder-decoder
(T5, BART)
(e) Similarity metric
(CLIP)
(f) Multimodal w/ ConvNets
(MDETR, PixelBERT)
(g) Multimodal w/ duo-transformers
(FLAVA, METER)
Unified
Transformer
Tokenizer PatchEmbed
Text Image
NLP Multimodal Vision
(h) All-in-one Transformer
(Ours)
Language Vision
Multimodal
MoMo
Text
(a) Encoder/Decoder-only
(BERT, RoBERTa, GPT)
Figure 6.2: Illustrations of how models process input data across various modalities. Language tasks:
(a) Encoder-only architecture; (b) Encoder-decoder architecture. Vision tasks: (c) Convolutional Neural Networks; (d) Vision Transformers. Multimodal tasks: (e) Similarity-based metric; (f) Multimodal
transformer with ConvNets; (g) Multimodal transformer with duo-transformers; (f) MoMo: Multiple tasks
across multiple modalities through an all-in-one unified transformer.
we design a multi-stage training pipeline where we jointly iterate through both modalities during each
training step (Section 6.3.2). We accumulate gradients from both modalities and perform a single weight
update. Our experiments show that such design leads to effective multimodal representation learning.
We use masked input prediction[43, 65, 11, 226] objective for unimodal stages. For cross-modal learning, we mask both modalities at higher fractions to enable rich cross-modal interactions. A key design
feature of our model is that the encoder processes only the visible part of the inputs, regardless of their
modality. This leads to significantly improved training efficiency.
6.2 Related Work
Language Transformers. Transformers, introduced in Vaswani et al. [204], have first proven to be successful for several Natural Language Processing (NLP) tasks. A variety of methods have been built on top
91
Public
Datasets?
Types of
inputs accepted
External
Image processing?
#Pairs
Image-Text
#Params
Text
#Params
Image
#Params
Multimodal
MoMo (ours) Yes Text,Image,Multimodal No 27M 110M 86M 110M
MoMo-Large (ours) 335M 303M 335M
FLAVA [188] Yes Text,Image,Multimodal Image Tokenization (BEIT) 70M 110M 86M 241M
CLIP [168] No Text,Image No 400M 63M 86M -
BEIT-3 [208] Yes Text, Image,Multimodal Image tokenization (BEIT) 21M 1.0B†
1.0B†
1.9B†
FLAMINGO [3] No Image,Multimodal No 2.1B - 1.4B†
3.2B†
LEMON [77] No Multimodal Region features 200M - - 110M & object detection features
VLMo [13] Yes Text,Image,Multimodal Image tokenization(BEIT) 10M 110M 110M 130M
Coca [236] No Image, Multimodal No 4.8B - 86M 383M
UNiTER [28] Yes Multimodal Region features 5M - - 110M
Table 6.1: Comparison of various design choices across several strong multimodal models vs MoMo. BEIT
tokenizer is an externally learnt VQ-VAE based model trained on 250 million image-text pairs. The number
of image-text pairs is the size of pre-training data. Numbers of parameters are calculated out of the number
of model parameters used for running inference on downstream tasks (retrieval, classification, VQA) listed
in the paper. †: We report the size of the smallest model available in the original paper.
of the transformer architecture with different objectives. Notably, BERT [43], with an encoder-only architecture in Fig.6.2 (a), leverages the bidirectional attention mechanism and trains the model with Masked
Language Modeling (MLM) on large text corpus. RoBERTa [131], ELECTRA [36], ALBERT [110], and
DeBERTa [69] introduced extensions to BERT for Language Modeling (LM). Other language transformer
models [18, 115] adopt the encoder-decoder architecture with autoregressive objective and unidirectional
attention.
Vision Transformers. Prior to transformers, the prevailing paradigm for vision tasks is to process
images through convolutional neural networks (ConvNets), as in Fig. 6.2(c). Inspired by BERT, Dosovitskiy
et al. [47] treat an image as a sequence of image patches and propose to perform vision tasks through
Vision Transformer (ViT), a BERT-like architecture. Numerous efforts have been made to improve the
performance of ViT: DEiT [199] introduces distillation into the training phase; CaiT [200] explores deeper
ViT structures, while VOLO[237] designs a sliding window mechanism to aggregate local features; SwinTransformer[132] designs a hierarchical transformer structure with shifting windows. Furthermore, BEiT
92
[11] tokenizes image patches through an external tokenizer and proposes to pre-train ViT with a BERTstyle objective, named Masked Image Modelling (MIM). He et al. [65] shows that ViT better generalizes
under pixel-level supervision with aggressive masking.
Multimodal Transformers. The success of transformer-based architectures in the field of NLP has
sparked research efforts in multi-modality transformers, especially in the domain of Vision-Language Pretraining (VLP). Prior to ViT, models relied on external image processors to feed as inputs to transformerbased models. [85, 196, 120, 28, 193, 124, 241, 59] leverage external object detectors to generate input
sequences of regional features. On the other hand, [79, 78, 186, 88, 3] convert feature grids from Convolutional Neural Networks (CNNs) into visual input sequences. Recent research [168, 188, 230, 50, 49, 98, 2]
converts vision inputs into sequences through linear projection. Notably, CLIP [168] leverages contrastive
loss to extract visual-linguistic embeddings from noisy web image-data; FLAVA [188] aims at unifying vision, language and multimodal tasks with one VLP model; SimVLM [214] encodes images through a CNN
and feed inputs from both modalities into a encoder-decoder transformer. VATT [2] proposes a modalityagnostic transformer to align features from different modalities; VLMo [13] proposes to unify unimodal
and multimodal tasks by a set of modality-specific experts but, unlike MoMo, it still requires particular
modules for each modality.
The existing vision-language models can be broadly categorized into single-stream or dual-stream
architectures (Fig.6.2(e)(f)(g)). The dual-stream models have separate encoders for text and image (and
typically also a separate multimodal encoder) while the single-stream models process them through a
joint encoder. MoMo is a single-stream model. However, unlike most other single-stream models, it is
optimized to do well across all of text-only, image-only and multimodal tasks (Table 6.1). VLMO [13]
deviates from these common architectural choices and uses modality-specific experts to better capture
text, image and multi-modal features. Their work also studies a variation of this architecture that doesn’t
use these specific experts and thereby making the architecture closer to MoMo. However, this was only
93
MoMo Encoder
Patch Embedding Token Embedding
Image Decoder Text Decoder Multimodal Decoder
Stage 1 Stage 2 Stage 3
........ ....
It takes two flints to make
a fire
a bird in a pot eating a
fruit
it two to make a fire a eating
Patch Embedding Token Embedding
Figure 6.3: The architecture and training stages of MoMo. The model is first trained with unimodal image
datasets. Then simultaneously with unimodal image and unimodal text datasets. And finally, simultaneously, with unimodal text and multimodal image-text datasets. Each stage is executed sequentially and
is initialized from the previous stage model’s weights. For downstream task fine-tuning, the decoder is
discarded and only the encoder weights are used.
evaluated on multimodal and image-only tasks. MoMo studies training a shared encoder model to work
well simultaneously on unimodal text and image tasks along with the multimodal tasks. Also, due to the
shared encoder design, MoMo is computationally efficient during inference for multimodal tasks.
6.3 The MoMo Framework
Our goal is to build a compact and efficient model that is capable of handling both unimodal and multimodal
tasks. We show that this is possible through the training pipeline design described in Section 6.3.2 and
modeling architecture choices in Section 6.3.3.
94
6.3.1 Training Objectives
The loss functions used during different stages of training MoMo are described below:
Masked Image Modeling (MIM). Following [65], we adopt an encoder-decoder architecture during
training and ask the model to reconstruct patches that are randomly masked out, at the pixel level. During
training, we remove masked patches at the encoding stage and restructure the input sequence with [MASK]
tokens before the reconstruction. As shown in Eqn. 6.1, where x and xˆ are original and reconstructed pixels,
We use Mean Square Error (MSE) to measure the loss between the reconstructed image and the original
image.
LMIM = ||x − xˆ||2
2
(6.1)
Masked Language Modeling (MLM). Similar to [43, 131], we adopt MLM when training MoMo on text
datasets. We randomly mask out a portion of the tokens of the input sequence and ask the model to
reconstruct the missing tokens. We utilize the cross-entropy loss, elaborated in Eqn. 6.2 to measure the
disparity between predictions and targets. The difference between our approach and [43, 131] is that we
follow the encoder-decoder structure in [65], where masked tokens are removed for the encoder and are
reconstructed through a separate decoder. This modification accelerates the training process as the length
of the input sequence of the encoder is much shorter.
LMLM =
X
x∈XM
− log P(ˆx = x) (6.2)
Cross-modal masking Loss (CMM). To enhance interactions between the modalities (Fig. 6.4), we use
a cross-modal masking loss on the concatenated image-text sequence. Specifically, both the image tokens
95
(MI ) and text tokens (MT ) are masked at 75% and the model is asked to reconstruct the masked tokens
using visible tokens from both the modalities.
LCMM =
X
x∈XMI
LMIM +
X
x∈XMT
LMLM (6.3)
Global Contrastive Loss. To align image-text representations, we also include an image-text contrastive
loss [168, 188], shown in Eqn. 6.4, along with the masked token reconstruction loss during stage 3. Similar
to FLAVA, we use a global contrastive loss where the image-text pairs from all GPU nodes are accumulated
before the loss calculation.
LGC =
X
i
− log
exp
s(f
(i)
img,f
(i)
text)/τ
P
j
exp
s(f
(i)
img,f
(j)
text)/τ (6.4)
Image-Text Matching Loss. To further enhance cross-modality interactions, we also use Image-text
matching loss following previous works [188] . Specifically, we collect positive and negative image-text
pairs in each batch guided by similarity scores from the contrastive learning [118]. Then, we apply a
classifier on top of the [MASK] token to predict if the image-text pair match.
6.3.2 Multi-Stage Training
A key component of our training pipeline is learning simultaneously from multiple modalities.
To effectively train MoMo with multiple modalities, we develop a three- stage training pipeline as
illustrated in Figure 6.3:
Stage 1: The model is trained on the unimodal image data with MIM loss with a masking ratio of 75%.
Stage 2: After stage 1, the model is further trained simultaneously on unimodal image data and text
data. For image, masked patch reconstruction loss with 75% masking is used and for text, masked token
reconstruction loss with 15% masking is used.
96
Stage 3: The model is initialized with the weights from stage 2 and is trained simultaneously on
unimodal text data and multimodal image-text data. For unimodal text data, masked token reconstruction
loss with 15% masking is used. For multimodal image-text data, a combination of cross-modal masking,
global contrastive loss and image-text matching losses are used.
6.3.3 Model Architecture
As illustrated in Figure 6.3, our model is a transformer-based structure containing an encoder and a shallow
decoder. We follow the design of ViT for both encoder and decoder. We use separate decoders for each
stage as we found this to be effective while jointly training on image and text modalities. We discuss the
effect of separate decoders in Sec.6.4.1. During downstream fine-tuning, all decoders are discarded and
only the encoder weights are updated. Our model, MoMo, consists of 12 encoder layers with a hidden
dimension of 768 and 8 decoder layers with a hidden dimension of 512. The token embedding weights and
the language modeling output head weights are tied. Following MAE [65], the encoder processes only the
visible part of the inputs for all stages.
6.3.4 Datasets
We use ImageNet-1K [41] for Stage 1 training and Wikipedia & Books Corpus for Stage 2 training. For
Stage 3, we use a subset of the Public Multimodal Dataset (PMD) from FLAVA [188]. This subset consists
of COCO [128], SBU Captions [155], Conceptual Captions [185, 22], Visual Genome [104] and RedCaps
[42] datasets. We call this dataset as PMD-Subset. The text samples in PMD-Subset have an average
length of ≈12 tokens. Statistically, we have 6.6M documents for unimodal language tranining, 1M images
for unimodal vision training and 26.6M image-text pairs for multimodal training. For fair comparison, we
follow the data filtering process from [28] to exclude validation and test images that appear in downstream
tasks.
97
public Multimodal Tasks Language Tasks ImageNet
data VQAv2 COCO Flickr30K SST-2 RTE MRPC QQP MNLI QNLI STS-B linear eval
1 ✓ BERTbase [43] – – – 92.5 62.5 81.9/87.6 90.6/87.4 84.4 91.0 88.1 –
2 ✗ CLIP-ViT-B/16 [168] 55.3 55.2 81.6 88.2 55.2 74.9/65.0 76.8/53.9 33.5 50.5 16.0 80.2
3 ✗ SimVLMbase [214] 77.9 – – 90.9 63.9 75.2/84.4 90.4/87.2 83.4 88.6 – 80.6
4 ✓ VisualBERT [120] 70.8 – – 89.4 56.6 71.9/82.1 89.4/86.0 81.6 87.0 81.8 –
5 ✓ UNITERbase [28] 72.7 – – 89.7 55.6 69.3/80.3 89.2/85.7 80.9 86.0 75.3 –
6 ✓ VL-BERTbase [193] 71.2 – – 89.8 55.7 70.6/81.8 89.0/85.4 81.2 86.3 82.9 –
7 ✓ ViLBERT [136] 70.6 – – 90.4 53.7 69.0/79.4 88.6/85.0 79.9 83.8 77.9 –
8 ✓ LXMERT [196] 72.4 – – 90.2 57.2 69.7/80.4 75.3/75.3 80.4 84.2 75.3 –
9 ✓ UniT [76] 67.0 – – 89.3 – – 90.6/ – 81.5 88.0 – –
10 ✓ CLIP-ViT-B/16 (PMD) [188] 59.8 50.6 72.5 83.5 53.1 63.5/68.7 75.4/43.0 32.9 49.5 13.7 73.0
11 ✓ FLAVA [188] 72.8 56.3 79.1 90.9 57.8 81.4/86.9 90.4/87.2 80.3 87.3 85.7 75.5
12 ✓ MoMo (ours) 71.2 64.0 78.2 89.6 59.6 81.3/87.0 90.2/86.6 78.1 86.0 87.0 75.7
13 ✓ MoMo-Large (ours) 75.0 72.0 84.1 90.4 63.9 82.8/88.0 90.7/87.7 80.6 88.0 88.0/87.8 79.1
Table 6.2: Comparing MoMo to previous models on multimodal, language, and image tasks. We report
results on dev sets of the GLUE benchmark [206]. We report accuracy/F1 for MRPC and QQP; the Pearson/Spearman correlation for STS-B; Averaged recall at Top-1/5 for zero-shot retrieval on COCO and
Flickr30K; test-dev VQA score for VQAv2 and accuracy for all the other tasks. Results of BERT and other
methods are taken from [188]. Note that SimVLM, CLIP and FLAVA are pretrained on much larger datasets
than MoMo (1.8B, 400M & 75M vs 27M pairs). Bold signifies the best result on public data while underlined
indicates the overall best result.
6.3.5 Simultaneous learning
Apart from the three-stage pipeline, another key ingredient of MoMo is training simultaneously on multiple modalities. Specifically, during training, we simultaneously iterate through the image and text datasets
and learn from both modalities during each training step. For gradient update, we propose to accumulate
the gradients for both modalities and perform a single update per training step. We refer to it as CrossModality Gradient Accumulation (CMGA).
6.3.6 Implementation details
For all stages, We use 64 V100 GPUs to train the model. During all stages, we append a [CLS] token at the
beginning of the sequence. This is used for computation during contrastive loss and classification tasks.
Stage 1: We follow settings from MAE [65]. Specifically, the model is trained for 1600 epochs on
ImageNet dataset with a base learning rate (LR) of 1.5e-4, batch size of 64, linear warmup of 40 epochs
98
MoMo FLAVA CLIP MoMo FLAVA CLIP
Datasets Eval. PMD-Sub PMD 400M[168] Datasets Eval. PMD-Sub PMD 400M[168]
NLP Active Param. 110M 110M 63M Vision Active Param. 86M 86M 86M
MNLI [217] FT 78.1 80.3 33.5 ImageNet [41] LE 75.7 75.5 80.2
MRPC [45] FT 84.2 84.2 69.9 Food101 [17] LE 90.7 88.5 91.6
QQP [84] FT 88.4 88.7 65.3 CIFAR10 [106] LE 95.0 92.8 94.9
SST-2 [190] FT 89.6 90.9 88.2 CIFAR100 [106] LE 80.5 77.7 81.1
QNLI [171] FT 86.0 87.3 50.5 Cars [102] LE 71.6 70.9 86.0
RTE [40, 14] FT 59.6 57.8 55.2 Aircraft [141] LE 51.2 47.3 51.4
STS-B [1] FT 87.0 85.7 16.0 DTD [35] LE 79.7 77.3 78.5
NLP Avg. 81.8 82.1 54.1
Multimodal Active Param. 110M 241M 220M
VQAv2 [61] FT 71.2 72.5 54.8 Flowers102 [151] LE 97.8 96.4 97.1
Flickr30K [163] TR@1 ZS 74.2 67.7 82.2 MNIST [111] LE 97.0 98.4 99.0
Flickr30K [163] TR@5 ZS 93.7 93.2 96.6 STL10 [37] LE 98.2 98.9 99.1
Flickr30K [163] IR@1 ZS 60.9 65.2 62.1 EuroSAT [70] LE 95.8 97.3 95.4
Flickr30K [163] IR@5 ZS 84.8 89.4 85.7 GTSRB [192] LE 80.7 79.5 88.6
COCO [128] TR@1 ZS 59.3 42.1 52.5 SST [168] LE 56.7 57.1 74.7
COCO [128] TR@5 ZS 84.1 70.4 76.7 Caltech101 [53] LE 92.2 95.7 95.5
COCO [128] IR@1 ZS 42.7 38.4 33.1 Pets [158] LE 91.7 84.8 91.7
COCO [128] IR@5 ZS 70.6 67.5 58.4
Multimodal Avg. 71.3 68.2 66.9 Vision Avg. 83.6 82.5 86.7
Table 6.3: Performance comparison between MoMo, FLAVA and CLIP. The numbers for MRPC and QQP
are the averages of accuracy and F1 for. For STS-B, it’s Matthews correlation. The numbers for CoCo and
Flickr30K are from top-1/5 zero-shot text and image retrieval. For other tasks, we report accuracy. Bold
signifies the best result on public data while underlined indicates the overall best result. Active parameters
are the number of model parameters used during the forward pass for a task. FT, LE, and ZS stands for
Fine-Tuning, Linear Eval and Zero-Shot, respectively.
followed by a cosine decay. We use Random resizing and cropping as the only augmentation. For our
default model, we re-use weights from the publicly available MAE model.
Stage 2: The dataloader during this stage iterates simultaneously through unimodal text and image
datasets. We repeat the imagenet data five times during a single epoch so the image data loader roughly
matches the text data loader’s size. The model is trained for 100 epochs. The learning rate is set to 5e-5
with a linear warmup for 10 epochs followed by cosine decay until the end of training. The per-device
batch size is set to 64.
Stage 3: The model is trained for 100 epochs without any data repetition at each epoch. Since the
text and multimodal datasets are iterated simultaneously, an "epoch" here covers roughly 1/4th of the
multimodal dataset (the same size as the unimodal text dataset). For image augmentation, we use random
resize/cropping and RandAugment [39]. The learning rate is set to 1e-4 with a linear warmup for 10 epochs
99
Modality Datasets Eval. Stage 1 Stage 2 Stage 3
Language
MNLI [217] FT 59.0 78.5 78.1
MRPC [45] FT 74.7 84.7 84.2
QQP [84] FT 67.4 88.4 88.4
SST-2 [190] FT 80.1 89.4 89.6
QNLI [171] FT 61.5 86.7 86.0
RTE [40, 14] FT 52.7 59.2 59.6
STS-B [1] FT 7.4 86.2 87.0
Vision ImageNet [41] LE 66.9 52.9 75.7
Multimodal
VQAv2 [61] FT 56.7 57.1 71.2
Flickr30k [61] ZS - - 78.2
COCO [61] ZS - - 64.0
Table 6.4: Performance after different training stages in MoMo. Stages 2 and 3 bring in considerable performance gains for vision and multimodal tasks respectively.
followed by cosine decay. The per-device batch size is 64.
6.4 Experiments
In this section, we perform quantitative evaluations over a span of downstream tasks along with ablation
study.
6.4.1 Quantitative Evaluation
In this section, we conduct ablation study over different model components to better understand MoMo.
Comparison with state-of-the-art VLP models. We compare the full MoMo model with several
VLP models on multimodal tasks, language tasks and ImageNet linear probing. Results are reported in
Tab. 6.2. MoMo either matches or outperforms most of the multi-modal models on majority of the tasks.
Notably, MoMo surpasses other baselines, including CLIP, which is trained with 20X data than MoMo.
In-depth comparison with FLAVA and CLIP. We further compare our model in-depth with FLAVA
and CLIP over a broader spectrum of tasks. Results are reported in Tab. 6.3. For evaluation protocol, we
100
Last Two Training Stages
Modality Dataset Eval. Combined Separate (∆)
Language
MNLI [217] FT 74.0 76.5 (+2.5)
MRPC [45] FT 76.6 79.5 (+2.9)
QQP [84] FT 86.5 87.8 (+1.3)
SST-2 [190] FT 87.6 89.4 (+1.8)
QNLI [171] FT 82.4 85.0 (+2.6)
RTE [40, 14] FT 53.4 57.4 (+4.0)
STS-B [1] FT 71.2 83.0 (+11.8)
Vision ImageNet [41] LE 70.9 69.4 (-1.5)
Multimodal
VQAv2 [61] FT 69.8 68.4 (-1.4)
Flickr30k [61] ZS 68.9 66.9 (-2.0)
COCO [61] ZS 54.5 53.1 (-1.4)
Table 6.5: Performance comparison between combined and separate stages 2 and 3. Both the model variations are trained for 50k steps in this experiment. We observe that separating stage 2 and 3 results in better
overall performance.
follow the settings in [188], including Fine-Tuning (FT), Linear Evaluation (LE) and Zero-Shot (ZS). For
zero-shot, models are evaluated on tasks without training. For linear evaluation, we freeze model weights
and train an extra linear layer on top of it. For text and image retrieval tasks, MoMo’s encoder is used as
both the text and image encoder.
On unimodal NLP tasks, MoMo matches FLAVA performance despite using a shared encoder for both
modalities. MoMo remains competitive in each of the individual benchmarks. On unimodal vision tasks,
MoMo surpasses FLAVA on more than half of the selected benchmarks and remains comparable on others,
achieving an average accuracy of 83.6%, 1.1% higher than FLAVA.
On multimodal tasks, MoMo performs better on average (+3.1%) than the state-of-the-art multimodal
model FLAVA. Performance gains are especially prominent on COCO retrieval as MoMo improves over
FLAVA by 17.2% and 4.3% for TR@1 and IR@1 respectively. However, the 110M parameter MoMo lags
behind on VQA and Flickr30K Image retrieval. We hypothesize this gap could be further reduced by additionally pre-training MoMo on datasets of same scale as FLAVA. Furthermore, the slightly lower VQA
score seems reasonable since MoMo uses 2/5th the number of parameters as FLAVA for this task.
101
Stage 2 Stage 3 CMGA∗
Modality Dataset Eval. Has simultaneous training? Shared decoder? Has simultaneous training? ✗ ✓(∆)
Yes No (∆) Yes No (∆) Yes No (∆)
Language
MNLI [217] FT 78.5 78.7 (+0.2) 77.5 78.5 (+1.0) 78.1 78.7 (+0.6) 75.4 75.6 (+0.2)
MRPC [45] FT 84.7 83.3 (-1.4) 86.4 84.7 (-1.7) 84.2 78.3 (-5.9) 77.6 78.6 (+1.0)
QQP [84] FT 88.3 88.1 (-0.2) 87.8 88.4 (+0.6) 88.4 88.0 (-0.4) 87.2 87.7 (+0.5)
SST-2 [190] FT 89.4 90.1 (+0.7) 89.4 89.4 (+0.0) 89.6 88.5 (-1.1) 87.6 89.2 (+1.6)
QNLI [171] FT 86.7 88.0 (+1.3) 86.0 86.7 (+0.7) 86.0 84.2 (-1.8) 84.1 85.9 (+1.8)
RTE [40, 14] FT 59.2 61.0 (+1.8) 58.4 59.2 (+0.8) 59.6 54.1 (-5.5) 55.9 56.6 (+0.7)
STS-B [1] FT 86.2 84.5 (-1.7) 84.7 86.2 (+1.5) 87.0 82.5 (-4.5) 84.3 85.0 (+0.7)
Vision ImageNet [41] LE 52.9 9.5 (-43.4) 51.8 52.9 (+1.1) 75.7 73.6 (-2.1) 71.9 72.2 (+0.3)
Multimodal
COCO [41] ZS - - - - 62.0 64.6 (+2.6) 54.1 55.4 (+1.3)
Flickr30K [41] ZS - - - - 78.7 78.1 (-0.6) 67.4 70.5 (+3.1)
VQAv2 [41] FT - - - - 71.2 71.5 (+0.3) 69.3 70.2 (+0.9)
Table 6.6: Ablation study of multiple modules inside MoMo. *Models under CMGA are trained for 50K
steps.
Increasing Model Size. We conduct a larger-scale experiment where we replace VIT-B backbone
with VIT-L and train on same data. The resulting model, MoMo-Large, contains 335M parameters. As
seen in Tab. 6.2, scaling up the model leads to remarkable performance gains across all tasks. On VQAv2,
MoMo-Large improves our base model by 3.8%, while improvements on image-text retrieval are 8.0% and
5.9% for COCO and Flickr30K, respectively. We also observe broad improvements over language tasks,
while the performance on ImageNet is improved by 3.4%.
6.4.2 Ablation Study
Multi-Stage Training. Earlier work [98, 197] reveal that initializing multimodal model with vision
weights is more effective than initializing it with pre-trained text weights. On similar lines, we first train
MoMo with unimodal image data before learning unimodal text and multimodal image-text representations. The performance of MoMo after each stage on various tasks is reported in Table 6.4. After stage 1,
the model achieves feasible image classification performance, while results on other tasks remain suboptimal. Stage 2 helps the model to learn language-related information and thus improves the performance on
language tasks. Stage 3 further enhances the model capacity on multimodal tasks. We note that the performance on image classification degrades after Stage 2. We attribute this drop to the shared encoder design
102
Image Attention Map Image Attention Map
A white sink next to a toilet Plant in window
A cat is napping with
a pile of sneakers
Look up and see these beautiful palms in the breeze
Person walking past
the grand temple
View down the main street from atop roman structure
Figure 6.4: Attention maps on predicting masked words through MoMo at stage 3. Heatmaps are obtained through Transformer-MM-Explainability [23]. MoMo can capture meaningful regions for MLM
through cross-modality attention.
on learning different modalities. However, we observe that this drop is recovered and further improved
during Stage 3, while language performance is retained. During stage 3, we also evaluated a variant that
includes image-only training (along with text-only and multimodal). However, this didn’t show significant
differences on downstream tasks and hence we decided to save compute and skip this additional objective.
Combining Stage 2 and 3. We conduct experiments to investigate whether stage 2 and 3 can be
combined. In the combined setting, the model is trained with the unimodal image, unimodal text and
multimodal image-text losses at each training update step. We report the results in Tab. 6.5. We observe
103
that when stages 2 and 3 are combined, the performance of language tasks drops considerably (-3.8 points
on average). The performance on image and multimodal tasks shows some improvement (+1.4 on average).
Simultaneous Training with Multiple Modalities. For this ablation study, we investigate the effect
of not doing simultaneous training as described in Sec. 6.3.5 i.e. during stage 2, the model is trained only on
the text dataset (image dataset is removed). And during stage 3, the model is trained only on multimodal
datasets (text-only dataset is removed). Table 6.6 shows the performance difference between MoMo and
those variations. At stage 2, performance on language tasks remains similar without simultaneous training,
while its vision counterpart drops from 52.9% to 9.5%. Such drop indicates that learning text features can
result in loss of image information if image data are not involved at that stage. Similarly, with a lower
magnitude, results on text tasks degrade without simultaneous training at Stage 3.
Cross-Modality Gradient Accumulation. The “CMGA” column in Table 6.6 shows that MoMo consistently achieves better performance on all tasks with Cross-modality gradient accumulation (Sec. 6.3.5).
Shared Decoders. As described in Sec. 6.3, we adopt separate decoders for different modalities. In the
“Decoder” column of Tab. 6.6, we show the performance comparison if the decoder is, instead, shared at
stage 2. Sharing the decoder results in a slight degradation in average performance.
MLM with cross-modality knowledge. As discussed in Sec. 6.3, we use a cross-modal masking
objective to enrich multimodal features. In Fig. 6.4, we visualize the attention map on images for MLM
predictions. The results suggest that the model is able to capture the corresponding region in an image
given a text representing that.
104
Chapter 7
Large Language Models for Low-Shot Image Classification
7.1 Introduction
Low-shot image classification tasks, including few-shot and zero-shot variants, are to learn from a set of
class names along with a limited or null set of images. Such capacities are crucial for the extension and
generalization of vision systems. Vision-Language (VL) models trained on large-scale web data, such as
CLIP [168] and ALIGN [85], provide a new paradigm due to their generalization capabilities that includes
zero-shot classification, and have been used in recent work [95, 250, 251, 96, 90, 112, 138]. Due to the
scarcity of images for training, methods built for both tasks rely heavily on merely category names as
the source of class-specific knowledge, resulting in a shortage of distinguishable descriptions. Meanwhile,
Large Language Models (LLMs), e.g. GPT-4 [154] and LLaMA [201, 202], have demonstrated their encyclopedic knowledge and thus can provide linguistic visual descriptions for objects. Here, we investigate how
to leverage LLMs for low-shot image classification.
The emergence of prompt learning has provided an efficient way to adapt large pre-trained models.
Previous work has explored various strategies to prompt vision-language (VL) models, including visionconditioned text prompt learning [250], joint VL prompt learning [95] and self-regulated VL prompts[96].
On the text side, regardless of the learning strategy, learned prompt vectors are shared across all categories. The only difference among text inputs is the class name. In low-shot scenarios where visual data
105
In one sentence, describe the distinctive
appearance of a Yak-40, a type of aircraft.
The Yak-40 has a unique trijet configuration
with a large passenger window section and
a sloping nose, along with three engines
mounted on the rear of the aircraft, creating
an unmistakable silhouette in the sky.
LLaMA
Human
sloping nose trijet configuration
three engines
large passenger window section
(a) Visual Descriptions from an LLM.
Base
70
80
69.3
71.0
85.2
74.2 74.9
77.7
HM
71.7 72.8
81.3
CLIP
Novel
CLIP+LLM LLaMP
+14.2
+1.7
+0.7 +2.8
+8.5
+1.1
(b) LLMs’ Knowledge Boosts the Performance.
Figure 7.1: Demonstration of LLaMP: (a) LLMs can provide visual descriptions for fine-grained object
categories; (b) Zero-shot base-to-novel generalization benefits from the LLM knowledge.
is limited, the extraction of class-specific knowledge from textual inputs becomes essential. However, the
current paradigm, which relies on the CLIP text encoder to distinguish between class names, faces challenges, particularly with fine-grained target categories. For example, in FGVCAircraft [141], the class name
“Yak-40”, can barely provide any information for recognizing the object.
Large Language Models, trained with large text corpora, are good candidates to serve as the complement. As in Fig. 7.1a, being queried about “Yak-40”, the LLM generates a sentence detailing the visual
appearance of the Yak-40 that can be further parsed into noun phrases and integrated into text prompts,
providing richer information, compared with the ordinary prompt. We also show in Fig. 7.1b that by simply
incorporating noun phrases extracted from a LLM’s response, the performance of the ordinary CLIP models is improved by more than 1% without any training. Although recent prompt-learning based methods
have shown notable improvements, it is non-trivial to apply them on textual visual descriptions generated by LLMs. Thus, instead of directly taking LLM generations as the textual input, we aim at producing
class-specific representations by adapting LLMs to low-shot image classification.
106
One challenge of the adaption is the domain gap between vision and language. When trained exclusively with textual corpora, the latent feature space of a LLM significantly diverges from that of its visual
counterpart. Even worse, the data scarcity under the low-shot scenario make it virtually impossible to
align two spaces through plain contrastive loss. We argue that, the CLIP text encoder, which is trained to
project features from the language domain into the joint VL domain, can serve as the bridge. Thus, we
propose the LLaMP framework, Large Language Models as Prompt learners, which leverages LLMs to
learn informative prompts for CLIP models. In LLaMP, we treat LLMs as the prompt learner of the CLIP
text encoder. More specifically, for each object category, LLaMP extracts corresponding knowledge from
the LLM and yields class-specific prompt vectors, which are further combined with class-agnostic prompt
embeddings (as in previous approaches), and encoded by the CLIP text encoder. We design an efficient
tuning pipeline to avoid fully fine-tuning the language model while performing effective adaptation.
Following the protocol in [250, 251], we evaluate LLaMP with two typical scenarios: zero-shot baseto-novel generalization [251] and few-shot image classification. For each scenario, we run LLaMP with
11 datasets covering a spectrum of tasks. On average, LLaMP achieves a 1.3% boost on the harmonic
mean against the state-of-the-art PSRC [96], and 9.6% over the ordinary CLIP [168], on base-to-novel
generalization. We also observe an average improvement of 0.94% on 16-shot image classification.
In summary, our approach makes use of Large Language Models to improve performance in low-shot
image classification scenarios. The main contributions are: i) To the best of our knowledge, we are the
first to investigate how to use the encyclopedic knowledge inherent in Large Language Models (LLMs)
to enhance low-shot image classification; ii) We design a framework, LLaMP, to effectively adapt LLMs
for image classification, without training the entire language model, and achieve state-of-the-art in both
few-shot and zero-shot settings; iii) We conduct extensive analysis investigating the effectiveness of each
components of LLaMP, and discuss the optimal setup for LLM-aided image classification.
107
7.2 Related Work
Large Language Models (LLMs). Recent years have witnessed remarkable progress in scaling up the
size and capabilities of LLMs. Zhang et al. [243] first introduced a suite of transformers pre-trained at
scale, followed by PaLM [33]. ChatGPT/GPT-4 [153, 154] emerged as a milestone conversational model,
demonstrated impressive conversational abilities as a generalist model. Vicuna [31] further advanced by
learning from ChatGPT, while LLaMA [201] demonstrates that larger scale training yields stronger foundation models. The subsequent LLaMA-2 [202] and PaLM-2 [6] achieved further gains in scale, efficiency
and reasoning. Most recently, Almazrouei et al. [4] released Falcon, a 400B model.
Prompt Learning. With the progress in large-scale vision-language models, such as CLIP [168] and
ALIGN [85], which reveal their capacity in zero-shot transferability, prompt learning has emerged as an
efficient learning scheme, where learnable prompts are appended to the input to fine-tune models. For
low-shot image classification, CoOp [251] and CoCoOp [250], which modeled context words as learnable
vectors to automate prompt engineering, have shown significant improvements over regular CLIP. MaPLe
[95] further employed a hierarchical multi-modal prompting strategy across transformer blocks for progressive feature modeling. Kan et al. [90] incorporated external knowledge by designing knowledge-aware
prompts and adaptation head for better generalization. Lee et al. [112] used masked attention to prevent
internal representation shift for better generalization. Khattak et al. [96] further improved prompt learning
by guiding prompts to balance task-specific and task-agnostic knowledge via mutual agreement maximization and prompt ensemble.
108
“Describe a Chevrolet Image Encoder ℱ
Corvette ZR1 2012”
Value
Feed-Forward
Output
Key Query
⋮
⋮
Decoder Layer 1
KV Cache
⋮
⋮
Decoder Layer 2
⋮
Decoder Layer
“Chevrolet Corvette
ZR1 2012”
ℒ
Text Encoder
1
1
LoRA Tuned Frozen Tuned
Text Encoder Layer Image Encoder Layer
� �
1
Figure 7.2: An Overview of the LLaMP Framework: We first generate the knowledge cache by passing
the query prompt through the LLM D and use the knowledge cache to encode pl
, resulting the adaptive
prompts h˜i
l = Wh
i
l + b
i
for the CLIP text encoder. h˜l
is combined with regular learnable prompts of G to
generate the final text feature vector gp. The image feature vector fp is obtained through a hybrid-tuning
strategy combining prompt learning and low-rank adaptation (LoRA).
7.3 Approach
7.3.1 Preliminaries
Similar to previous CLIP-based learning approaches, we consider the classification problem as an imagetext matching problem. We denote the image encoder and the text encoder, in CLIP-like models, as F
and G, parameterized by θF and θG, respectively. An input image x ∈ R
C×H×W is split into M equalsized patches which are converted into a sequence of embeddings x˜ = {ecls, e1, e2, . . . , eM}. The visual
input sequence x˜ is encoded by the image encoder, producing the image feature f = F(x˜). On the
text side, the text label y and the associated name is formatted as “A photo of [STH]” and tokenized
into a sequence of tokens y˜ = {tbos, t1, t2, . . . , tL, teos}, where L is the length of input tokens. The
input sequence is then encoded into g = G(y˜). For image classification, target class labels {1, 2, . . . , C}
are encoded into text features gi
. The classification is done by picking the class that has the highest
similarity with the vision feature: yˆ = argmaxi C(f, gi), where C is the softmax cosine-similarity function
C(f, g) = exp{(f·g/τ)}
PC
j=1 exp{(f·gj/τ)}
with temperature τ .
109
Multimodal Prompting Learning. Given the size of the CLIP model, fine-tuning the entire model
becomes infeasible. As both image and text encoders are built with standard transformer architecture,
prompt learning, which tunes the model by combining trainable prompts with hidden states has been
applied on the text encoder [250, 251], the image encoder [86, 212, 213], or both [95, 96, 173]. Similar to
[96, 173], we build our method following the vision-language prompting paradigm, with deep prompting
[86, 96], which not only insert prompts to the input layer, but to later encoder layers.
More specifically, for each transformer layer that takes prompts, we define V learnable visual prompts
pv = {p
1
v
, p2
v
, . . . , pV
v } and T learnable language prompts pt = {p
1
t
, p2
t
, . . . , pT
t }. For the i-th vision
encoder layer, visual prompts p
i
v
are appended to input embeddings: x˜
i
p = {e
i
cls, e
i
1
, e
i
2
, . . . , e
i
M, p
i
v}.
The prompt-augmented vision feature, fp = F(x˜p), is produced by jointly encoding prompts and the
image. As the ViT [47] architecture in CLIP adopts the bi-directional attention mechanism, the placement
of pv has no effect on fp. On the language side, prompts are concatenated with the input of the i-th text
encoder: y˜
i
p = {t
i
bos, p
i
t
, t
i
1
, t
i
2
, . . . , t
i
L
, t
i
eos}. y˜p is further processed by the text encoder, resulting in
the prompt-augmented language feature gp = G(y˜p). More specifically, prompts to the first layer p
1
t
are
initialized with the embeddings of “A photo of a”.
Low-Rank Adaptation [75] (LoRA). As a parameter-efficient tuning technique, LoRA is designed
to adapt large transformer model without updating original model weights. The LoRA technique is, in
particular, applied to linear projection layers. More specifically, for a linear layer with weight W0 ∈ R
d×k
,
LoRA creates ∆W by learning two low rank matrices B ∈ R
d×r
and A ∈ R
r×k
:
h = (W0 + ∆W)x = W0x + BAx. (7.1)
We adopt a hybrid tuning scheme on the vision encoder, which performs prompt learning on the first few
layers and applies LoRA on the rest.
110
7.3.2 Adaptive Prompt Learning with LLMs
The goal of prompt tuning is to find a set of optimal prompts p = {pv, pt} which maximizes the log
likelihood of P(x, y|θF , θG) over target downstream distribution (x, y) ∼ (X,Y ):
p = argmax
p
E(x,y)∼(X,Y )
log C(F(x; pv), G(y; pt)) (7.2)
However, the p optimized through Eqn. 7.2 has two issues. First, p is shared for all categories for the
downstream task, while the optimal prompt for each category might be different. Second, in low-shot
scenarios, p are usually empirically estimated from a limited training-set Xtrain with limited categories
{1, 2, ..., Cbase}, and therefore such p can often be over-fitted to the small training-set Xtrain and fail to
generalize to novel categories outside {1, 2, ..., Cbase}.
To overcome these problems, we propose to learn a meta function on the language side pt = Θ(y)
which can adaptively estimate the optimal prompt for each category. An intuitive way to estimate proper
prompts p for category name y is to take advantage of the knowledge of the pre-trained Large Language
Models (LLM) D and extract discriminative descriptions of category y. For example, given the input text
z:“Describe {y}”,
pt = {p1, p2, ..., pk} = D(z). (7.3)
while pi being sequentially generated by D such that
pi = D(z, t1, ..., ti−1) = D
(i)
(z)
ti = M(pi),
(7.4)
111
where D(i)
is the i-th forward iteration of D, and M maps continuous hidden states into discrete language
tokens. To accelerate the process and to obtain p in one pass, we approximate the above process with K
learnable prompts pl = {θ1, ..., θK} so that
pt = Θ(y) = D({θ1, ..., θK}|z) (7.5)
Discussion. While Large Language Models (LLMs) possess robust foundational knowledge within the
linguistic domain, it is not feasible to directly substitute the text encoder of CLIP with an LLM. The reason
lies in the inherent divergence between the LLM’s latent space, which is purely language-oriented, and
the image-focused latent space of vision encoders. Attempting a direct alignment via contrastive learning
would require an extensive dataset that is typically beyond the scope of low-shot learning. To bridge
this gap, we introduce LLaMP—an adaptive prompt learning framework that leverages the LLM to craft
class-specific prompt vectors, to reinforce the text encoder for low-shot image classification.
7.3.3 The LLaMP Framework
Fig. 7.2 shows an overview of the LLaMP framework. For convenience, we denote the decoder-only LLM
as D. The input to the decoder D consists of two components: textual prompts y in the form of sentences,
tokenized as y˜, and learnable prompts pl
. We append prompt embeddings to the end of the input sequence
and obtain the last hidden states of D as the feature hl
:
hl = D(y˜, pl)[L + 1 : L + K], L = Length(y˜). (7.6)
112
Hidden states of D are then mapped to the input space of the CLIP text encoder by the projection matrix
W ∈ R
d1×d2
, where d1 and d2 are respectively the hidden sizes of the LLM D and the CLIP text encoder
G. A set of prompt-specific biases b ∈ R
K×d2 are added:
h˜l = Whl + b (7.7)
We combine h˜l
from LLM with regular learnable prompts, as in previous approaches [96], to construct
the input for CLIP text encoder. Similar to deep prompting [86, 96], we create layer-specific prompts
through different W matrices and b vectors. For the i-th layer, we let h˜i
l = Wihl + b
i
and the entire
sequence is constructed as
y˜
i
l = {t
i
bos, p
i
t
, t
i
1
, t
i
2
, . . . , t
i
L, h˜i
l
, t
i
eos} (7.8)
LLM Knowledge Cache. A Large Language Model (LLM), as implied by its name, typically comprises
billions of parameters. For example, the most compact LLaMA [201, 202] model has 7B parameters. Thus,
even performing prompt learning on a LLM become impractical. The memory consumption to store gradients for back propagation can go beyond the limit of mainstream GPUs. Instead, the causal attention
mechanism inherent in decoder-only LLMs, where the embedding of an input token only depends on the
preceding tokens, facilitates a feasible workaround.
As previously mentioned, the prompt embeddings pl are appended to the end of text tokens y˜. According to the causal attention mechanism, y˜ is encoded independently of pl
. Thus, we design a two-stage
process, where we create the LLM knowledge cache by passing y˜ through D and leverage the cache to
convert pl
into class-specific embeddings for the CLIP text encoder G.
To compute the attention of a token, the only dependency is the Key and Value vectors from the preceding tokens. Thus, we adopt the KV-cache [164, 219], a technique used in inference acceleration of LLMs,
to create the knowledge cache. At the first stage, we pass text tokens y˜ through the language model D
113
and save the Keys and Values as the knowledge cache for the second stage. Once computed, the knowledge cache remains fixed throughout the entire training process and bears the information that is needed
for further computation. Thus, in LLaMP, we leverage the knowledge cache obtained at the first stage to
generate class-specific prompt embeddings.
At the second stage, we create class-specific prompt embeddings from the pre-computed knowledge
cache. As pl
is not initialized in the natural language domain, it need not pass through the entire LLM;
instead, we insert those prompts pl
to the last layer of the LLM DN . It is achieved by encoding them
alongside the cache from y˜, as in
Hl = DN (Ky˜,Vy˜, pl), (7.9)
where Ky˜,Vy˜ represent the knowledge cache. This design enables LLaMP to efficiently learn informative
prompt embeddings for the CLIP encoder G. It accomplishes this by incurring modest training costs,
compared with training the entire LLM. Simultaneously, it maintains the essential knowledge inherent in
the LLM decoder D.
Training Targets of LLaMP. Although the training strategy in Eqn. 7.9 has reduced the number of
learnable parameters, a full decoder inside a LLM still consists of an enormous number of parameters. For
example, one layer in LLaMA-7B bears 200M parameters, making training of the entire layer costly. As the
goal is to leverage the knowledge from LLM, altering a full layer can lead to the loss of knowledge. Thus, as
shown in Fig. 7.2, a typical decoder layer has two major components: the self attention module, consisting
Query, Key, Value and Output projection layers, and the Feed-Forward Network (FFN). LLaMP targets the
Query and Output projection layer inside the self-attention module. By updating the Query layer, LLM
prompts pl are learned to distill pertinent information from the knowledge cache and the Output layer
projects it to the latent space. We keep the Key and Value layers frozen to ensure the alignment between
pl and knowledge cache. We leave the FFN unchanged to preserve the knowledge. Further discussions
regarding these choices are made in Sec. 7.4.3.
114
Textual Priors from Pre-Generated Responses. We extend the initial prompt, “In one sentence,
describe the distinctive appearance of [STH]”, by incorporating the response generated by the
language model into the input sequence. This approach enriches the base content: the generated text provides a clear and explicit description of the object’s appearance, acting as a valuable informative prior for
language model adaptation. However, it is common for responses from an LLM to include filler words like
“sure” for sentence structure coherence. To refine the input, we parse the noun phrases from the LLM’s
response through spaCy [71], an NLP engine, and merge them with the initial prompt, forming a more
focused and informative language prior.
Textual Augmentations. Following the insights of Khattak et al. [96], which highlight the performance benefits of diverse textual inputs, we aim to further augment the text inputs used in the CLIP text
encoder. Our approach, building upon the methods in [96, 250], incorporates hand-crafted templates and
expands their diversity through a two-step process: i) We introduce noun phrases into the existing templates for CLIP, for example, transforming “A photo of [STH]” to “A photo of [STH] with [NP]”,
thereby enriching the descriptive content; ii) We create a variety of new prompt templates for the LLM
similar to “In one sentence, describe the distinctive appearance of [STH]” through GPT-4
[154], to further diversify the text input.
7.3.4 Training and Inference
Similar to PSRC [96], our objective function consists of three components: The main cross-entropy loss
LCE, feature-level L1 regularization Ll1, and soft distillation loss Ldist. Given C training categories and
N training samples, LCE is defined as
LCE = −
1
N
X
i
log
exp
(f
i
p
· g
i
p/τ )
Pexpn
(f
i
p
· g
j
p/τ )
o. (7.10)
115
The L1 regularization is computed between learned features fp, gp and pre-trained CLIP features fˆ, gˆ:
Ll1 =
1
N
X
i
λv|f
i
p − fˆi
| +
1
C
X
i
λt
|g
i
p − gˆ
i
|, (7.11)
where λv and λt are coefficients. The prediction of LLaMP is further bound by the KL-Divergence between
predicted distributions of LLaMP and vanilla CLIP:
Ldist = λdistDKL(fp · gp, fˆ· gˆ). (7.12)
We sum all three losses up as the final objecttive function: L = LCE + Ll1 + Ldist.
During training, we randomly sample one LLM template as the input of LLaMP for each batch. For
inference, we compute the probability distribution predicted from each input template and average them.
7.4 Experiments
7.4.1 Experiment Setup
Datasets. Similar to previous work [96, 95, 250], in our study, we evaluate LLaMP performance over a
spectrum of classification tasks with 11 datasets, including ImageNet [41] and Caltech101 [53] for generic
image classification, OxfordPets [158], StanfordCars [102], Flowers102 [151], Food101 [17], and FGVCAircraft [141] for fine-grained classification, SUN397 [223] for scene recognition, UCF101 [191] for action
recognition, DTD [35] for texture classification, and EuroSAT [70] for satellite image recognition.
Scenarios & Metrics. We evaluate LLaMP on two typical low-shot scenarios: zero-shot base-to-novel
generalization and few-shot image classification. In zero-shot base-to-novel generalization, the base classes
are seen during training, while the novel classes are unseen. We measure models performance through
116
Method
Average ImageNet [41] Caltech101 [53] OxfordPets [158]
Base Novel HM Base Novel HM Base Novel HM Base Novel HM
CLIP [168] 69.34 74.22 71.70 72.43 68.14 70.22 96.84 94.00 95.40 91.17 97.26 94.12
CoOp [251] 82.69 63.22 71.66 76.47 67.88 71.92 98.00 89.81 93.73 93.67 95.29 94.47
CoCoOp [250] 80.47 71.69 75.83 75.98 70.43 73.10 97.96 93.81 95.84 95.20 97.69 96.43
KAPT∗
[90] 78.41 70.52 74.26 71.10 65.20 68.02 97.10 93.53 95.28 93.13 96.53 94.80
ProDA [138] 81.56 72.30 76.65 76.66 70.54 73.47 97.74 94.36 96.02 95.43 97.76 96.58
MaPLe [95] 82.28 75.14 78.55 75.40 70.32 72.72 98.27 93.23 95.68 95.43 97.83 96.62
RPO [112] 81.13 75.00 77.78 76.60 71.57 74.00 97.97 94.37 96.03 94.63 97.50 96.05
PSRC [96] 84.26 76.10 79.97 77.60 70.73 74.01 98.10 94.03 96.02 95.33 97.30 96.30
LLaMP 85.16 77.71 81.27 77.99 71.27 74.48 98.45 95.85 97.13 96.31 97.74 97.02
∆ w.r.t. PSRC +0.90 +1.61 +1.30 +0.39 +0.54 +0.47 +0.35 +1.82 +1.11 +0.98 +0.44 +0.72
Method
StanfordCars [102] Flowers102[151] Food101 [17] FGVCAircraft [141]
Base Novel HM Base Novel HM Base Novel HM Base Novel HM
CLIP [168] 63.37 74.89 68.65 72.08 77.80 74.83 90.10 91.22 90.66 27.19 36.29 31.09
CoOp [251] 78.12 60.40 68.13 97.60 59.67 74.06 88.33 82.26 85.19 40.44 22.30 28.75
CoCoOp [250] 70.49 73.59 72.01 94.87 71.75 81.71 90.70 91.29 90.99 33.41 23.71 27.74
KAPT∗
[90] 69.47 66.20 67.79 95.00 71.20 81.40 86.13 87.06 86.59 29.67 28.73 29.19
ProDA [138] 72.94 74.00 73.47 95.92 72.46 82.56 90.71 92.05 91.38 37.44 35.61 36.50
MaPLe [95] 74.70 71.20 72.91 97.70 68.68 80.66 90.30 88.57 89.43 36.90 34.13 35.46
RPO [112] 73.87 75.53 74.69 94.13 76.67 84.50 90.33 90.83 90.58 37.33 34.20 35.70
PSRC [96] 78.27 74.97 76.58 98.07 76.50 85.95 90.67 91.53 91.10 42.73 37.87 40.15
LLaMP 81.56 74.54 77.89 97.82 77.40 86.42 91.05 91.93 91.49 47.30 37.61 41.90
∆ w.r.t. PSRC +3.29 -0.43 +1.31 -0.25 +0.90 +0.47 +0.38 +0.40 +0.39 +4.57 -0.26 +1.75
Method
SUN397 [223] DTD [35] EuroSAT [70] UCF101 [191]
Base Novel HM Base Novel HM Base Novel HM Base Novel HM
CLIP [168] 69.36 75.35 72.23 53.24 59.90 56.37 56.48 64.05 60.03 70.53 77.50 73.85
CoOp [251] 80.60 65.89 72.51 79.44 41.18 54.24 92.19 54.74 68.69 84.69 56.05 67.46
CoCoOp [250] 79.74 76.86 78.27 77.01 56.00 64.85 87.49 60.04 71.21 82.33 73.45 77.64
KAPT∗
[90] 79.40 74.33 76.78 75.97 58.30 65.97 84.80 67.57 75.21 80.83 67.10 73.33
ProDA [138] 80.82 78.70 79.75 80.36 59.18 68.16 94.07 73.23 82.35 83.00 78.66 80.77
MaPLe [95] 78.47 76.93 77.79 80.67 56.48 66.44 83.90 66.00 73.88 85.23 71.97 78.04
RPO [112] 80.60 77.80 79.18 76.70 62.13 68.61 86.63 68.97 76.79 83.67 75.43 79.34
PSRC [96] 82.67 78.47 80.52 83.37 62.97 71.75 92.90 73.90 82.32 87.10 78.80 82.74
LLaMP 83.41 79.90 81.62 83.49 64.49 72.77 91.93 83.66 87.60 87.13 80.66 83.77
∆ w.r.t. PSRC +0.74 +1.43 +1.10 +0.12 +1.52 +1.02 -0.97 +9.76 +5.28 +0.03 +1.86 +1.03
Table 7.1: Comparison with state-of-the-art methods on base-to-novel generalization. LLaMP shows strong
generalization results over previous approaches on 11 image classification tasks. Absolute gains over PSRC are
indicated in blue.
∗KAPT is trained with ViT-B/32 image encoder instead of ViT-B/16.
accuracies of base and novel classes, and the harmonic mean of the two. For few-shot classification, we
assess the accuracy with 16 shots per class.
117
16-Shot Classification
Average
ImageNet [41]
Caltech [53]
Pets [158]
Cars [102]
Flowers [151]
Food [17]
Aircraft [141]
SUN397 [223]
DTD [35]
EuroSAT [70]
UCF101 [191]
CLIP [168] 78.79 (65.02) 67.31 95.43 85.34 80.44 97.37 82.90 45.36 73.28 69.96 87.21 82.11
CoOp [251] 79.89 (73.82) 71.87 95.57 91.87 83.07 97.07 84.20 43.40 74.67 69.87 84.93 82.23
CoCoOp [250] 74.90 (70.70) 70.83 95.16 93.34 71.57 87.84 87.25 31.21 72.15 63.04 73.32 78.14
MaPLe [95] 81.79 (75.58) 72.33 96.00 92.83 83.57 97.00 85.33 48.40 75.53 71.33 92.33 85.03
PSRC [96] 82.87 (77.90) 73.17 96.07 93.67 83.83 97.60 87.50 50.83 77.23 72.73 92.43 86.47
LLaMP 83.81 (78.50) 73.49 97.08 94.21 86.07 98.06 87.62 56.07 77.02 74.17 91.31 86.84
Table 7.2: Few shot classification results with 16 shots. Numbers in the bracket indicate the average performance over 1/2/4/8/16 shots.
Method LLM Base Novel HM
CLIP 69.34 74.22 71.70
✓ 70.95 74.93 72.79
LLaMP 82.21 76.44 79.22
✓ 85.16 77.71 81.27
Table 7.3: Ablation study on the LLM Knowledge.
Implementation Details. We build LLaMP through the PyTorch [159] framework. All models are
trained with 2 NVIDIA A100 40GB GPUs. For LLaMP, we adopt LLaMA2-7B [202] as the language model
D, and ViT-B/16 [47] as the image encoder, following [96, 95, 250, 251]. On the text side, we set prompt
learning depth to 9. To tune the vision encoder, we adopt the hybrid tuning scheme which performs deep
prompt learning on the first 6 layers and LoRA on the rest. Similar to [75], LoRA is applied to the Query
and Value projection layers inside attention modules. The number of pl prompts, K, is set to 16. We set a
global learning rate of 2E-4 with a batch size of 8. The learning rate of LoRA modules is set to 2E-5. λt
, λv
and λdist are set to 25, 10 and 2.5, respectively.
7.4.2 Quantitative Evaluation
Zero-Shot Base-to-Novel Generalization. LLaMP outperforms existing state-of-the-art prompt learning methods on most metrics of 11 classification datasets in the base-to-novel generalization benchmark.
118
As shown in Tab. 7.1, compared to the latest model PSRC [96], LLaMP achieves average gains of 0.90%
in base accuracy, 1.61% in novel accuracy, and 1.30% in harmonic mean on average. Moreover, LLaMP
consistently achieves higher harmonic means (HM) compared to other models. These improvements indicate that our approach better balances performance on base and novel data, thus achieving stronger
generalization compared to the prior prompt learning techniques.
In particular, LLaMP excels in fine-grained datasets requiring detailed analysis. On FGVCAircraft,
LLaMP surpasses PSRC by 4.57% on base accuracy and 1.75% on HM, highlighting its strong understanding of detailed aircraft features. Furthermore, on EuroSAT, LLaMP achieves improvements of 9.76% and
5.28% on novel accuracy and HM, respectively. We also observe similar performance gains on StanfordCars, where LLaMP outperformns by 3.29% on base accuracy and 1.31% on HM. The information embedded
in LLM enables LLaMP to capture and utilize the rich semantic information necessary for distinguishing
between closely related categories.
Few-Shot Classification. LLaMP also achieves improvements across these classification datasets in
few-shot classification tasks. As in Tab.7.2, with an average classification accuracy of 83.81%. Notably, on
FGVCAircraft and StanfordCars, LLaMP shows a significant improvement over PSRC, further demonstrating that the knowledge from language models benefits the recognition of fine-grained object categories,
which aligns with our observation on zero-shot base-to-novel generalization. Moreover, on DTD, where
MaPLe and PSRC achieve around 72% accuracy, LLaMP achieves a higher accuracy of 74.17%, underscoring its ability to recognize textures.
7.4.3 Ablation Study
Is the knowledge from LLM helping? In Tab. 7.3, we show that the knowledge from LLM benefits in
both ways: Without training, performance of ordinary CLIP model can be improved by introducing noun
phrases; The LLaMP framework shows further improvement after training.
119
LP QO KV FFN % Base Novel HM
✓ .03 85.00 77.29 80.96
✓ ✓ ✓ 33 85.20 77.45 81.14
✓ ✓ ✓ 83 85.05 77.73 81.22
✓ ✓ ✓ ✓ 100 85.23 77.56 81.21
✓ ✓ 17 85.16 77.71 81.27
Table 7.4: Ablation study on the Training Strategy. “%” indicates the ratio of parameters trained compared
to fully tuning a layer.
Method Priors Base Novel HM
LLaMP
✗ 84.90 77.59 81.08
Plain 85.26 77.56 81.22
NP 85.16 77.71 81.27
Table 7.5: Ablation Study on Pre-generated Text Priors. ✗ refers to “without textual priors” and NP stands
for noun phrases.
Noun phrases are parsed from the LLM’s responses the prompt of “Describe [STH]”. We then use the
template, “A photo of [STH] with [NP]” to generate the NP-augmented text embedding for CLIP. We
take the average of all augmented embeddings for classification. In Tab. 7.3 we show that even ordinary
CLIP can benefit from incorporating LLMs’ knowledge.
Furthermore, the comparison between LLaMP and LLaMP without the LLM indicates that merely integrating LoRA [75] to the vision encoder is not beneficial. The “LLaMP without LLM” is essentially an
ordinary prompting learning model plus LoRAs in the vision encoder. We show that the improved vision
encoding capacity only benefits when the quality of text embeddings s are enhanced by incorporating
LLMs’ knowledge through LLaMP.
Decoder Training strategy. We categorize trainable parameters of DN into four groups: learnable
prompts (LP), Query and Output projections (QO), Key and Value projections (KV), and the feed-forward
network (FFN). Tab. 7.4 indicates LLaMP can achieve desirable results by just learning the prompts of
D. One step further, adding QO into optimization achieves the best performance. Although other setups
introduce much more trainable parameters, they can not surpass the “LP + QO” strategy.
120
Method Base Novel HM
LLM Only 81.74 35.82 49.81
LLaMP 85.16 77.71 81.27
Table 7.6: The CLIP text encoder helps adaptation.
Scheme Base Novel HM
Prompt × 9 84.67 77.28 80.81
LoRA × 12 84.89 77.27 80.90
Prompt ×6 + LoRA ×6 85.16 77.71 81.27
Table 7.7: Study on Vision Tuning Scheme. Our hybrid design achieves the best performance.
Effect of Textual Priors. We study the effect of pre-generated textual priors on LLaMP. We compare
three different approaches: without textual priors, using plain responses as the prior, and LLaMP, which
takes parsed noun phrases. Tab. 7.5 shows that LLaMP can achieve over 81% on HM without pre-generated
priors, while adding parsed noun phrases as textual priors further pushes the HM to 81.27%.
CLIP as the bridge. One may wonder if it is possible to replace CLIP text encoder with a large
language model. Here, we study two setups: i) LLM as encoder, which treats the output of the language
model, h˜l as the text feature; ii) LLaMP, which treat h˜l as part of the text input prompt. Tab. 7.6 reveals that
relying solely on the LLM results in poor accuracy for novel categories. This supports our hypothesis that
aligning LLMs with vision encoders generally requires a more extensive dataset. Furthermore, LLaMP’s
design significantly improve the novel accuracy by 40%.
Vision Training Strategy. As ViT-16/B has 12 transformer layers, we compare different vision training strategies within LLaMP in Tab. 7.7. Apart from the default hybrid scheme, we evaluate setups including prompt learning at first 9 layers (P9), a similar setup to PSRC [96], and LoRA [75] in all 12 layers.
The results suggest that the scheme leverages the strengths of both prompt learning and LoRA, addressing
potential bottlenecks in the vision encoder and enhancing overall performance in LLaMP.
121
2 4 8 16 24
# of LLM Prompts
81.0
81.5
80.94 80.96 81.00
81.27
81.11
Figure 7.3: Effect of LLM Prompts on Harmonic Mean. 16 prompts achieve the most balanced performance.
Image Noun Phrases Heatmap
Classname: An-12
four engines,
turboprop aircraft,
large vertical fin
Classname: Industrial Buildings
a cluster,
rectangular structures,
flat roofs,
straight lines
Figure 7.4: Visualization of LLaMP Predictions by GradCAM [183]
Number of LLM Prompts. We vary the number of LLM prompts and study their effects on LLaMP.
As in Fig. 7.3, using 16 prompts optimizes the LLM’s capabilities, achieving the highest harmonic mean at
81.27%.
Visualizations. In Fig. 7.4, we visualize the gradient heatmap of input images from FGVCAircraft and
EuroSAT, through GradCAM [183]. The figure shows that, LLaMP can capture distinctive features that
matches LLM’s description.
122
Chapter 8
Conclusions and Future Work
In previous chapters, I address various problems in leveraging vision-language corpora in visual understanding and discuss my approaches for problems including zero-shot inference, compositionality and
vision-language pre-training. I make conclusions of the thesis and discuss a few potential directions for
the future.
8.1 Conclusions
In Chapter 2, we study the compositional learning problem before VLMs are introduced, including image
classification and object detection. For image classification, We argue a feasible CZSL model should consider the knowledge from both individual primitives and joint compositions. Thus, we propose Blended
Individual and Pairwise Representations (BIPR) [245], which fuses the knowledge from both sides through
the blending operation. Experiments show that BIPR achieves SOTA on three CZSL benchmarks: MITStates, C-GQA and UT-Zappos. Furthermore, We explore the task of jointly detecting objects and predicting their attributes. We show that naively attaching attribute heads to an R-CNN structure and jointly
training object category and attribute leads to a significant drop in object detection performance due to
feature entanglement. So we eliminate such feature entanglement via a two-stream pipeline with separate
123
networks [246]. We validate our approach on a subset of Visual Genome, VG-20. Experiments show that
our method can effectively improve the performance on both tasks.
In Chapter 3, we explore the problem of how to leverage large-scale Vision-Language Pre-trained (VLP)
models, particularly CLIP, more effectively for compositional zero-shot learning. Unlike previous methods
which treat CLIP as a black box, we propose to slightly modify the architecture and attach Concept-Aware
Intra-Layer Adapters (CAILA) [248] to each layer of the CLIP encoder to enhance the knowledge transfer from CLIP to CZSL. Moreover, we design the mixture-of-adapters mechanism to further improve the
generalizability of the model. Quantitative evaluations demonstrate that CAILA achieves significant improvements on all three common benchmarks. Due the lack of unfeasible pair filter, CAILA’s performance
drops from closed world to open world, when the number of possible pairs greatly increases, though. We
also provide comprehensive discussions on deciding the optimal setup.
In Chapter 4, we present a new vision-language transformer based model, FashionVLP [59], which
leverages prior knowledge from large image-text corpora and multiple contextual image features to effectively perform fashion image retrieval with textual feedback. Our model also provides a novel attention
based approach for effectively fusing visual information from diverse visual contexts for learning candidate image embeddings. Furthermore, we present an efficient framework for generating embeddings of
candidate images, which are in the same latent space as the joint encodings of reference image and feedback queries, by excluding the parameter-heavy transformer layers from the computation process. Results
show that our model achieves state-of-the-art results on benchmark datasets. In particular, our model
achieves more than 23% relative improvement on FashionIQ, which contains complex yet realistic natural
language feedback for fashion image retrieval.
In Chapter 5, we investigate how to better leverage multimodal knowledge inside Vision-Language
Pre-trained (VLP) transformers. Unlike existing paradigms of dual-tower VLP only utilizing features from
124
the top vision layer but omitting the lower-level features with richer local information, we propose Fractional Intermediate Tower (FIT) [244], a bridging module that provides the fusion encoder with access
to multi-layer vision features and enriches vision features through the top-down pathway. Quantitative
evaluations on downstream tasks verify that FIT beats SoTA solutions on various vision-language tasks.
Further experiments show that FIT is robust on multiple visual backbones.
In Chapter 6, we made three key contributions: i) We propose a compact and efficient transformer
model that can be applied to vision-only, language-only and vision-language tasks without requiring
modality-specific encoders; ii) We propose a new training pipeline that involves simultaneous learning
from multiple modalities via a hierarchical stage-wise training; iii) By leveraging masked input reconstruction, contrastive and image-text matching objectives, MoMo [21] is competitive with the state-of-the-art
multimodal models that are both larger in size and are trained on much larger datasets. We also verify
the scalability of MoMo through experiments with a larger transformer backbone. We present quantitative comparisons against similar sized baseline models demonstrating MoMo’s effectiveness. We further
present ablation studies to illustrate the impact of different design choices. In future, we plan to extend
MoMo into larger models, include more pre-training data and incorporate extra modalities.
In Chapter 7, our study shows that the encyclopedic knowledge from LLMs is beneficial for low-shot
image classification as extra class-specific information. To leverage such knowledge, we propose LLaMP
[247], a framework that adapts LLMs as prompt learners for the CLIP model. Over two common low-shot
scenarios: zero-shot generalization and few-shot learning, LLaMP demonstrates notable improvements
compared with previous state-of-the-arts on a spectrum of datasets. While LLaMP reveals an effective way
in leveraging LLMs’ knowledge, both modalities, vision and language, only interact at the finest feature
level. Given the broader LLM-aided knowledge from the language side, the performance can be potentially
further improved by introducing language priors at earlier vision encoding stages.
125
8.2 Future Work
As large foundation models become prevailing in recent research, there are a variety of directions that can
be explored in in the future.
LLM interface for multimodal data.
Recent advances in LLMs have revealed the power of scaling law and large models trained with enormous language corpora. It remains unclear about how to incorporate the power of LLMs with other modalities, including images, audio, hand poses, etc. It is important for both the academia and the industry to
make LLMs benefit tasks that go beyond text. We can create encoders prior to the LLM interface and
project inputs from each modalities into a joint input space that can be accommodated by large transformer models.
Multimodal scaling of large foundation models.
Autoregressive training with large-scale language data has been proven effective in recent model releases, e.g. GPT-4[154] and LLaMA [201, 202]. However, there are obstacles in applying the similar strategy
on vision and multimodal data. One issue is that, unlike language that has a well-defined token-level selfsupervision, images or videos can not be decomposed into natural “tokens”. Although some work[65, 11]
has been proposed to tokenize vision data by training autoencoders, if such practice can scale remains
unknown. One work around of this issue is to leverage the multimodal correspondence in vision-language
corpora such as image-text pairs, which leads to another problem: How can we obtain data within the
similar scale compared with the language corpora. A potential solution is to synthesize such data through
game engines, where the correspondence can be established automatically.
Efficient adaption on large foundation models.
Though I have introduced a few methods in adapting large foundation models to downstream tasks,
these methods still need to be fine-tuned at certain level and require GPUs with large memories, e.g.
A100 or H100. I would like to push forward in this direction and explore models that can be adapted
126
on consumer-level machines or even edge-devices, while not causing significant performance drop on
downstream tasks.
127
Bibliography
[1] Eneko Agirre, Lluís Márquez, and Richard Wicentowski. “Proceedings of the Fourth International
Workshop on Semantic Evaluations (SemEval-2007)”. In: Proceedings of the Fourth International
Workshop on Semantic Evaluations (SemEval-2007). 2007.
[2] Hassan Akbari, Liangzhe Yuan, Rui Qian, Wei-Hong Chuang, Shih-Fu Chang, Yin Cui, and
Boqing Gong. “Vatt: Transformers for multimodal self-supervised learning from raw video, audio
and text”. In: Advances in Neural Information Processing Systems 34 (2021), pp. 24206–24221.
[3] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson,
Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, et al. “Flamingo: a visual language
model for few-shot learning”. In: arXiv preprint arXiv:2204.14198 (2022).
[4] Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cappelli,
Ruxandra Cojocaru, Merouane Debbah, Etienne Goffinet, Daniel Heslow, Julien Launay,
Quentin Malartic, et al. Falcon-40B: an open large language model with state-of-the-art
performance. Tech. rep. Technical report, Technology Innovation Institute, 2023.
[5] Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and
Lei Zhang. “Bottom-up and top-down attention for image captioning and visual question
answering”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2018,
pp. 6077–6086.
[6] Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos,
Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. “Palm 2 technical report”. In:
arXiv preprint arXiv:2305.10403 (2023).
[7] Relja Arandjelovic, Petr Gronat, Akihiko Torii, Tomas Pajdla, and Josef Sivic. “NetVLAD: CNN
architecture for weakly supervised place recognition”. In: Proceedings of the IEEE conference on
computer vision and pattern recognition. 2016, pp. 5297–5307.
[8] Yuval Atzmon, Felix Kreuk, Uri Shalit, and Gal Chechik. “A causal view of compositional
zero-shot recognition”. In: arXiv preprint arXiv:2006.14610 (2020).
[9] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. “Layer normalization”. In: arXiv preprint
arXiv:1607.06450 (2016).
128
[10] Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, and Michael Auli. “Data2vec:
A general framework for self-supervised learning in speech, vision and language”. In: arXiv
preprint arXiv:2202.03555 (2022).
[11] Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. “BEiT: BERT Pre-Training of Image
Transformers”. In: International Conference on Learning Representations. 2022. url:
https://openreview.net/forum?id=p-BhZSz59o4.
[12] Hangbo Bao, Wenhui Wang, Li Dong, Qiang Liu, Owais Khan Mohammed, Kriti Aggarwal,
Subhojit Som, Songhao Piao, and Furu Wei. “VLMo: Unified Vision-Language Pre-Training with
Mixture-of-Modality-Experts”. In: Advances in Neural Information Processing Systems. Ed. by
Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho. 2022. url:
https://openreview.net/forum?id=bydKs84JEyw.
[13] Hangbo Bao, Wenhui Wang, Li Dong, Qiang Liu, Owais Khan Mohammed, Kriti Aggarwal,
Subhojit Som, and Furu Wei. “VLMo: Unified Vision-Language Pre-Training with
Mixture-of-Modality-Experts”. In: 2021 Neural Information Processing Systems. Nov. 2021. url:
https://www.microsoft.com/en-us/research/publication/vlmo-unified-vision-language-pretraining-with-mixture-of-modality-experts/.
[14] Luisa Bentivogli, Peter Clark, Ido Dagan, and Danilo Giampiccolo. “The Fifth PASCAL
Recognizing Textual Entailment Challenge.” In: TAC. 2009.
[15] Tamara L Berg, Alexander C Berg, and Jonathan Shih. “Automatic attribute discovery and
characterization from noisy web data”. In: European Conference on Computer Vision. Springer.
2010, pp. 663–676.
[16] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. “Enriching word vectors
with subword information”. In: TACL 5 (2017), pp. 135–146.
[17] Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. “Food-101–mining discriminative
components with random forests”. In: European conference on computer vision. Springer. 2014,
pp. 446–461.
[18] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal,
Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. “Language models are
few-shot learners”. In: Advances in neural information processing systems 33 (2020), pp. 1877–1901.
[19] Zhaowei Cai and Nuno Vasconcelos. “Cascade r-cnn: Delving into high quality object detection”.
In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2018,
pp. 6154–6162.
[20] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and
Sergey Zagoruyko. “End-to-end object detection with transformers”. In: European Conference on
Computer Vision. Springer. 2020, pp. 213–229.
[21] Rakesh Chada, Zhaoheng Zheng, and Pradeep Natarajan. “MoMo: A shared encoder Model for
text, image and multi-Modal representations”. In: arXiv preprint arXiv:2304.05523 (2023).
129
[22] Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. “Conceptual 12m: Pushing
web-scale image-text pre-training to recognize long-tail visual concepts”. In: Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021, pp. 3558–3568.
[23] Hila Chefer, Shir Gur, and Lior Wolf. “Generic attention-model explainability for interpreting
bi-modal and encoder-decoder transformers”. In: Proceedings of the IEEE/CVF International
Conference on Computer Vision. 2021, pp. 397–406.
[24] Shiming Chen, Wenjie Wang, Beihao Xia, Qinmu Peng, Xinge You, Feng Zheng, and Ling Shao.
“FREE: Feature Refinement for Generalized Zero-Shot Learning”. In: ICCV. 2021, pp. 122–131.
[25] Xiaodong Chen, Xinchen Liu, Wu Liu, Xiao-Ping Zhang, Yongdong Zhang, and Tao Mei.
“Explainable Person Re-Identification With Attribute-Guided Metric Distillation”. In: ICCV. 2021,
pp. 11813–11822.
[26] Yanbei Chen, Shaogang Gong, and Loris Bazzani. “Image Search With Text Feedback by
Visiolinguistic Attention Learning”. In: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR). June 2020.
[27] Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and
Jingjing Liu. “Uniter: Learning universal image-text representations”. In: (2019).
[28] Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and
Jingjing Liu. “Uniter: Universal image-text representation learning”. In: European conference on
computer vision. Springer. 2020, pp. 104–120.
[29] Yilun Chen, Zhicheng Wang, Yuxiang Peng, Zhiqiang Zhang, Gang Yu, and Jian Sun. “Cascaded
pyramid network for multi-person pose estimation”. In: Proceedings of the IEEE conference on
computer vision and pattern recognition. 2018, pp. 7103–7112.
[30] Zhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong Lu, Jifeng Dai, and Yu Qiao. “Vision
Transformer Adapter for Dense Predictions”. In: The Eleventh ICLR. 2022.
[31] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng,
Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An
Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality. Mar. 2023. url:
https://lmsys.org/blog/2023-03-30-vicuna/.
[32] Sumit Chopra, Raia Hadsell, and Yann LeCun. “Learning a similarity metric discriminatively, with
application to face verification”. In: 2005 IEEE Computer Society Conference on Computer Vision
and Pattern Recognition (CVPR’05). Vol. 1. IEEE. 2005, pp. 539–546.
[33] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra,
Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al.
“Palm: Scaling language modeling with pathways”. In: arXiv preprint arXiv:2204.02311 (2022).
[34] Sanghyuk Chun, Seong Joon Oh, Rafael Sampaio de Rezende, Yannis Kalantidis, and Diane Larlus.
“Probabilistic embeddings for cross-modal retrieval”. In: Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition. 2021, pp. 8415–8424.
130
[35] Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi.
“Describing textures in the wild”. In: Proceedings of the IEEE conference on computer vision and
pattern recognition. 2014, pp. 3606–3613.
[36] Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning. “ELECTRA:
Pre-training Text Encoders as Discriminators Rather Than Generators”. In: International
Conference on Learning Representations. 2020. url: https://openreview.net/forum?id=r1xMH1BtvB.
[37] Adam Coates, Andrew Ng, and Honglak Lee. “An analysis of single-layer networks in
unsupervised feature learning”. In: Proceedings of the fourteenth international conference on
artificial intelligence and statistics. JMLR Workshop and Conference Proceedings. 2011,
pp. 215–223.
[38] Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. “Randaugment: Practical
automated data augmentation with a reduced search space”. In: Proceedings of the IEEE/CVF
conference on computer vision and pattern recognition workshops. 2020, pp. 702–703.
[39] Ekin Dogus Cubuk, Barret Zoph, Jon Shlens, and Quoc Le. “RandAugment: Practical Automated
Data Augmentation with a Reduced Search Space”. In: Advances in Neural Information Processing
Systems. Ed. by H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin. Vol. 33. Curran
Associates, Inc., 2020, pp. 18613–18624. url:
https://proceedings.neurips.cc/paper/2020/file/d85b63ef0ccb114d0a3bb7b7d808028f-Paper.pdf.
[40] Ido Dagan, Oren Glickman, and Bernardo Magnini. “The pascal recognising textual entailment
challenge”. In: Machine learning challenges workshop. Springer. 2005, pp. 177–190.
[41] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. “Imagenet: A large-scale
hierarchical image database”. In: 2009 IEEE conference on computer vision and pattern recognition.
Ieee. 2009, pp. 248–255.
[42] Karan Desai, Gaurav Kaul, Zubin Trivadi Aysola, and Justin Johnson. “RedCaps: Web-curated
image-text data created by the people, for the people”. In: Thirty-fifth Conference on Neural
Information Processing Systems Datasets and Benchmarks Track (Round 1). 2021. url:
https://openreview.net/forum?id=VjJxBi1p9zh.
[43] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. “Bert: Pre-training of deep
bidirectional transformers for language understanding”. In: arXiv preprint arXiv:1810.04805 (2018).
[44] Virginie Do, Oana-Maria Camburu, Zeynep Akata, and Thomas Lukasiewicz. “e-snli-ve:
Corrected visual-textual entailment with natural language explanations”. In: arXiv preprint
arXiv:2004.03744 (2020).
[45] Bill Dolan and Chris Brockett. “Automatically constructing a corpus of sentential paraphrases”.
In: Third International Workshop on Paraphrasing (IWP2005). 2005.
[46] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai,
Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al.
“An image is worth 16x16 words: Transformers for image recognition at scale”. In: arXiv preprint
arXiv:2010.11929 (2020).
131
[47] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai,
Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly,
Jakob Uszkoreit, and Neil Houlsby. “An Image is Worth 16x16 Words: Transformers for Image
Recognition at Scale”. In: International Conference on Learning Representations. 2021. url:
https://openreview.net/forum?id=YicbFdNTTy.
[48] Zi-Yi Dou, Aishwarya Kamath, Zhe Gan, Pengchuan Zhang, Jianfeng Wang, Linjie Li,
Zicheng Liu, Ce Liu, Yann LeCun, Nanyun Peng, Jianfeng Gao, and Lijuan Wang. “Coarse-to-Fine
Vision-Language Pre-training with Fusion in the Backbone”. In: Advances in Neural Information
Processing Systems. Ed. by Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho.
2022. url: https://openreview.net/forum?id=o4neHaKMlse.
[49] Zi-Yi Dou, Yichong Xu, Zhe Gan, Jianfeng Wang, Shuohang Wang, Lijuan Wang,
Chenguang Zhu, Pengchuan Zhang, Lu Yuan, Nanyun Peng, et al. “An empirical study of training
end-to-end vision-and-language transformers”. In: Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition. 2022, pp. 18166–18176.
[50] Jiali Duan, Liqun Chen, Son Tran, Jinyu Yang, Yi Xu, Belinda Zeng, and Trishul Chilimbi.
“Multi-modal alignment using representation codebook”. In: Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition. 2022, pp. 15651–15660.
[51] Kaiwen Duan, Song Bai, Lingxi Xie, Honggang Qi, Qingming Huang, and Qi Tian. “CenterNet:
Keypoint Triplets for Object Detection”. In: The IEEE International Conference on Computer Vision
(ICCV). Oct. 2019.
[52] Ali Farhadi, Ian Endres, Derek Hoiem, and David Forsyth. “Describing objects by their attributes”.
In: CVPR. 2009, pp. 1778–1785.
[53] Li Fei-Fei, Rob Fergus, and Pietro Perona. “Learning generative visual models from few training
examples: An incremental bayesian approach tested on 101 object categories”. In: 2004 conference
on computer vision and pattern recognition workshop. IEEE. 2004, pp. 178–178.
[54] Wikimedia Foundation. Wikimedia Downloads. url: https://dumps.wikimedia.org.
[55] Zhe Gan, Yen-Chun Chen, Linjie Li, Chen Zhu, Yu Cheng, and Jingjing Liu. “Large-scale
adversarial training for vision-and-language representation learning”. In: arXiv preprint
arXiv:2006.06195 (2020).
[56] Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li,
and Yu Qiao. “Clip-adapter: Better vision-language models with feature adapters”. In: arXiv
preprint arXiv:2110.04544 (2021).
[57] Rohit Girdhar, Joao Carreira, Carl Doersch, and Andrew Zisserman. “Video action transformer
network”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
2019, pp. 244–253.
[58] Ross Girshick. “Fast r-cnn”. In: Proceedings of the IEEE international conference on computer vision.
2015, pp. 1440–1448.
132
[59] Sonam Goenka, Zhaoheng Zheng, Ayush Jaiswal, Rakesh Chada, Yue Wu, Varsha Hedau, and
Pradeep Natarajan. “FashionVLP: Vision Language Transformer for Fashion Retrieval with
Feedback”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
2022, pp. 14105–14115.
[60] Albert Gordo, Jon Almazán, Jerome Revaud, and Diane Larlus. “Deep image retrieval: Learning
global representations for image search”. In: European conference on computer vision. Springer.
2016, pp. 241–257.
[61] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. “Making the V in
VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering”. In:
Conference on Computer Vision and Pattern Recognition (CVPR). 2017.
[62] Xiaoxiao Guo, Hui Wu, Yu Cheng, Steven Rennie, Gerald Tesauro, and Rogerio Feris.
“Dialog-based Interactive Image Retrieval”. In: Advances in Neural Information Processing Systems.
Ed. by S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett. Vol. 31.
Curran Associates, Inc., 2018. url:
https://proceedings.neurips.cc/paper/2018/file/a01a0380ca3c61428c26a231f0e49a09-Paper.pdf.
[63] Xintong Han, Zuxuan Wu, Phoenix X Huang, Xiao Zhang, Menglong Zhu, Yuan Li, Yang Zhao,
and Larry S Davis. “Automatic spatially-aware fashion concept discovery”. In: Proceedings of the
IEEE international conference on computer vision. 2017, pp. 1463–1471.
[64] Xudong Han, Philip Schulz, and Trevor Cohn. “Grounding learning of modifier dynamics: An
application to color naming”. In: Proceedings of the 2019 Conference on Empirical Methods in
Natural Language Processing and the 9th International Joint Conference on Natural Language
Processing (EMNLP-IJCNLP). 2019, pp. 1488–1493.
[65] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. “Masked
autoencoders are scalable vision learners”. In: Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition. 2022, pp. 16000–16009.
[66] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. “Momentum contrast for
unsupervised visual representation learning”. In: Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition. 2020, pp. 9729–9738.
[67] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. “Mask r-cnn”. In: Proceedings of
the IEEE international conference on computer vision. 2017, pp. 2961–2969.
[68] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Deep residual learning for image
recognition”. In: Proceedings of the IEEE conference on computer vision and pattern recognition.
2016, pp. 770–778.
[69] Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. “DEBERTA:
DECODING-ENHANCED BERT WITH DISENTANGLED ATTENTION”. In: International
Conference on Learning Representations. 2021. url: https://openreview.net/forum?id=XPZIaotutsD.
133
[70] Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. “Eurosat: A novel dataset
and deep learning benchmark for land use and land cover classification”. In: IEEE Journal of
Selected Topics in Applied Earth Observations and Remote Sensing 12.7 (2019), pp. 2217–2226.
[71] Matthew Honnibal, Ines Montani, Sofie Van Landeghem, and Adriane Boyd. “spaCy:
Industrial-strength Natural Language Processing in Python”. In: (2020). doi:
10.5281/zenodo.1212303.
[72] Mehrdad Hosseinzadeh and Yang Wang. “Composed Query Image Retrieval Using Locally
Bounded Features”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR). June 2020.
[73] Mehrdad Hosseinzadeh and Yang Wang. “Composed query image retrieval using locally bounded
features”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
2020, pp. 3596–3605.
[74] Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe,
Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. “Parameter-efficient transfer learning for
NLP”. In: ICML. PMLR. 2019, pp. 2790–2799.
[75] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang,
and Weizhu Chen. “Lora: Low-rank adaptation of large language models”. In: arXiv preprint
arXiv:2106.09685 (2021).
[76] Ronghang Hu and Amanpreet Singh. “Unit: Multimodal multitask learning with a unified
transformer”. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021,
pp. 1439–1449.
[77] Xiaowei Hu, Zhe Gan, Jianfeng Wang, Zhengyuan Yang, Zicheng Liu, Yumao Lu, and
Lijuan Wang. “Scaling up vision-language pre-training for image captioning”. In: Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022, pp. 17980–17989.
[78] Zhicheng Huang, Zhaoyang Zeng, Yupan Huang, Bei Liu, Dongmei Fu, and Jianlong Fu. “Seeing
out of the box: End-to-end pre-training for vision-language representation learning”. In:
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021,
pp. 12976–12985.
[79] Zhicheng Huang, Zhaoyang Zeng, Bei Liu, Dongmei Fu, and Jianlong Fu. “Pixel-bert: Aligning
image pixels with text by deep multi-modal transformers”. In: arXiv preprint arXiv:2004.00849
(2020).
[80] Drew A Hudson and Christopher D Manning. “GQA: A New Dataset for Real-World Visual
Reasoning and Compositional Question Answering”. In: Conference on Computer Vision and
Pattern Recognition (CVPR) (2019).
[81] Drew A Hudson and Christopher D Manning. “Gqa: A new dataset for real-world visual
reasoning and compositional question answering”. In: CVPR. 2019, pp. 6700–6709.
134
[82] Sung Ju Hwang, Fei Sha, and Kristen Grauman. “Sharing features between objects and their
attributes”. In: CVPR. 2011, pp. 1761–1768.
[83] Phillip Isola, Joseph J Lim, and Edward H Adelson. “Discovering states and transformations in
image collections”. In: CVPR. 2015, pp. 1383–1391.
[84] Shankar Iyer, Nikhil Dandekar, and Kornel Csernai. First Quora Dataset Release: Question Pairs.
2017. url: https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs (visited on
04/03/2019).
[85] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le,
Yun-Hsuan Sung, Zhen Li, and Tom Duerig. “Scaling up visual and vision-language
representation learning with noisy text supervision”. In: International Conference on Machine
Learning. PMLR. 2021, pp. 4904–4916.
[86] Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan,
and Ser-Nam Lim. “Visual prompt tuning”. In: European Conference on Computer Vision. Springer.
2022, pp. 709–727.
[87] Justin Johnson, Andrej Karpathy, and Li Fei-Fei. “DenseCap: Fully Convolutional Localization
Networks for Dense Captioning”. In: Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition. 2016.
[88] Aishwarya Kamath, Mannat Singh, Yann LeCun, Gabriel Synnaeve, Ishan Misra, and
Nicolas Carion. “MDETR-modulated detection for end-to-end multi-modal understanding”. In:
Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021, pp. 1780–1790.
[89] Michael Kampffmeyer, Yinbo Chen, Xiaodan Liang, Hao Wang, Yujia Zhang, and Eric P Xing.
“Rethinking knowledge graph propagation for zero-shot learning”. In: CVPR. 2019,
pp. 11487–11496.
[90] Baoshuo Kan, Teng Wang, Wenpeng Lu, Xiantong Zhen, Weili Guan, and Feng Zheng.
“Knowledge-Aware Prompt Tuning for Generalizable Vision-Language Models”. In: ICCV. 2023,
pp. 15670–15680.
[91] Rabeeh Karimi Mahabadi, James Henderson, and Sebastian Ruder. “Compacter: Efficient low-rank
hypercomplex adapter layers”. In: NeurIPS 34 (2021), pp. 1022–1035.
[92] Shyamgopal Karthik, Massimiliano Mancini, and Zeynep Akata. “Kg-sp: Knowledge guided
simple primitives for open world compositional zero-shot learning”. In: CVPR. 2022,
pp. 9336–9345.
[93] Maxime Kayser, Oana-Maria Camburu, Leonard Salewski, Cornelius Emde, Virginie Do,
Zeynep Akata, and Thomas Lukasiewicz. “e-vil: A dataset and benchmark for natural language
explanations in vision-language tasks”. In: Proceedings of the IEEE/CVF International Conference on
Computer Vision. 2021, pp. 1244–1254.
135
[94] Muhammad Gul Zain Ali Khan, Muhammad Ferjad Naeem, Luc Van Gool, Alain Pagani,
Didier Stricker, and Muhammad Zeshan Afzal. “Learning Attention Propagation for
Compositional Zero-Shot Learning”. In: WACV. 2023, pp. 3828–3837.
[95] Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, and
Fahad Shahbaz Khan. “Maple: Multi-modal prompt learning”. In: CVPR. 2023, pp. 19113–19122.
[96] Muhammad Uzair Khattak, Syed Talal Wasim, Muzammal Naseer, Salman Khan,
Ming-Hsuan Yang, and Fahad Shahbaz Khan. “Self-regulating Prompts: Foundational Model
Adaptation without Forgetting”. In: ICCV. 2023, pp. 15190–15200.
[97] Jin-Hwa Kim, Sang-Woo Lee, Donghyun Kwak, Min-Oh Heo, Jeonghee Kim, Jung-Woo Ha, and
Byoung-Tak Zhang. “Multimodal residual learning for visual qa”. In: Advances in neural
information processing systems. 2016, pp. 361–369.
[98] Wonjae Kim, Bokyung Son, and Ildoo Kim. “Vilt: Vision-and-language transformer without
convolution or region supervision”. In: International Conference on Machine Learning. PMLR.
2021, pp. 5583–5594.
[99] Diederick P Kingma and Jimmy Ba. “Adam: A method for stochastic optimization”. In:
International Conference on Learning Representations (ICLR). 2015.
[100] Diederik P Kingma and Jimmy Ba. “Adam: A method for stochastic optimization”. In: arXiv
preprint arXiv:1412.6980 (2014).
[101] Thomas N. Kipf and Max Welling. “Semi-Supervised Classification with Graph Convolutional
Networks”. In: 5th ICLR, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings.
OpenReview.net, 2017. url: https://openreview.net/forum?id=SJU4ayYgl.
[102] Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. “3d object representations for
fine-grained categorization”. In: Proceedings of the IEEE international conference on computer
vision workshops. 2013, pp. 554–561.
[103] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz,
Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. “Visual genome: Connecting
language and vision using crowdsourced dense image annotations”. In: arXiv preprint
arXiv:1602.07332 (2016).
[104] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz,
Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. “Visual genome: Connecting
language and vision using crowdsourced dense image annotations”. In: International journal of
computer vision 123.1 (2017), pp. 32–73.
[105] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz,
Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. “Visual genome: Connecting
language and vision using crowdsourced dense image annotations”. In: International Journal of
Computer Vision 123.1 (2017), pp. 32–73.
136
[106] Alex Krizhevsky, Geoffrey Hinton, et al. “Learning multiple layers of features from tiny images”.
In: (2009).
[107] Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset,
Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, Tom Duerig, and
Vittorio Ferrari. “The Open Images Dataset V4: Unified image classification, object detection, and
visual relationship detection at scale”. In: IJCV (2020).
[108] Christoph H Lampert, Hannes Nickisch, and Stefan Harmeling. “Attribute-based classification for
zero-shot visual object categorization”. In: TPAMI 36.3 (2013), pp. 453–465.
[109] Christoph H Lampert, Hannes Nickisch, and Stefan Harmeling. “Learning to detect unseen object
classes by between-class attribute transfer”. In: CVPR. 2009, pp. 951–958.
[110] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and
Radu Soricut. “Albert: A lite bert for self-supervised learning of language representations”. In:
arXiv preprint arXiv:1909.11942 (2019).
[111] Yann LeCun and Corinna Cortes. “MNIST handwritten digit database”. In: (2010). url:
http://yann.lecun.com/exdb/mnist/.
[112] Dongjun Lee, Seokwon Song, Jihee Suh, Joonmyeong Choi, Sanghyeok Lee, and Hyunwoo J Kim.
“Read-only Prompt Optimization for Vision-Language Few-shot Learning”. In: ICCV. 2023,
pp. 1401–1411.
[113] Seungmin Lee, Dongwan Kim, and Bohyung Han. “CoSMo: Content-Style Modulation for Image
Retrieval With Text Feedback”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR). June 2021.
[114] Brian Lester, Rami Al-Rfou, and Noah Constant. “The power of scale for parameter-efficient
prompt tuning”. In: arXiv preprint arXiv:2104.08691 (2021).
[115] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed,
Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. “BART: Denoising Sequence-to-Sequence
Pre-training for Natural Language Generation, Translation, and Comprehension”. In: Proceedings
of the 58th Annual Meeting of the Association for Computational Linguistics. 2020, pp. 7871–7880.
[116] Gen Li, Nan Duan, Yuejian Fang, Ming Gong, and Daxin Jiang. “Unicoder-vl: A universal encoder
for vision and language by cross-modal pre-training”. In: Proceedings of the AAAI Conference on
Artificial Intelligence. Vol. 34. 07. 2020, pp. 11336–11344.
[117] Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. “BLIP: Bootstrapping Language-Image
Pre-training for Unified Vision-Language Understanding and Generation”. In: ICML. 2022.
[118] Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and
Steven Chu Hong Hoi. “Align before fuse: Vision and language representation learning with
momentum distillation”. In: Advances in neural information processing systems 34 (2021),
pp. 9694–9705.
137
[119] Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. “Visualbert: A
simple and performant baseline for vision and language”. In: arXiv preprint arXiv:1908.03557
(2019).
[120] Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. “What does bert
with vision look at?” In: Proceedings of the 58th Annual Meeting of the Association for
Computational Linguistics. 2020, pp. 5265–5275.
[121] Qing Li, Jianlong Fu, Dongfei Yu, Tao Mei, and Jiebo Luo. “Tell-and-Answer: Towards Explainable
Visual Question Answering using Attributes and Captions”. In: EMNLP. 2018, pp. 1338–1346.
[122] Wei Li, Can Gao, Guocheng Niu, Xinyan Xiao, Hao Liu, Jiachen Liu, Hua Wu, and Haifeng Wang.
“Unimo: Towards unified-modal understanding and generation via cross-modal contrastive
learning”. In: arXiv preprint arXiv:2012.15409 (2020).
[123] Xiangyu Li, Xu Yang, Kun Wei, Cheng Deng, and Muli Yang. “Siamese Contrastive Embedding
Network for Compositional Zero-Shot Learning”. In: CVPR. 2022, pp. 9326–9335.
[124] Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang,
Houdong Hu, Li Dong, Furu Wei, et al. “Oscar: Object-semantics aligned pre-training for
vision-language tasks”. In: European Conference on Computer Vision. Springer. 2020, pp. 121–137.
[125] Yong-Lu Li, Yue Xu, Xiaohan Mao, and Cewu Lu. “Symmetry and group in attribute-object
compositions”. In: CVPR. 2020, pp. 11316–11325.
[126] Xiaodan Liang, Lisa Lee, and Eric P. Xing. “Deep Variation-Structured Reinforcement Learning
for Visual Relationship and Attribute Detection”. In: 2017 IEEE Conference on Computer Vision and
Pattern Recognition (CVPR) (2017), pp. 4408–4417.
[127] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie.
“Feature pyramid networks for object detection”. In: Proceedings of the IEEE conference on
computer vision and pattern recognition. 2017, pp. 2117–2125.
[128] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan,
Piotr Dollár, and C Lawrence Zitnick. “Microsoft coco: Common objects in context”. In: European
conference on computer vision. Springer. 2014, pp. 740–755.
[129] Yutian Lin, Liang Zheng, Zhedong Zheng, Yu Wu, Zhilan Hu, Chenggang Yan, and Yi Yang.
“Improving person re-identification by attribute and identity learning”. In: PR 95 (2019),
pp. 151–161.
[130] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. “Visual instruction tuning”. In: arXiv
preprint arXiv:2304.08485 (2023).
[131] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy,
Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. “Roberta: A robustly optimized bert
pretraining approach”. In: arXiv preprint arXiv:1907.11692 (2019).
138
[132] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo.
“Swin transformer: Hierarchical vision transformer using shifted windows”. In: Proceedings of the
IEEE/CVF International Conference on Computer Vision. 2021, pp. 10012–10022.
[133] Zheyuan Liu, Cristian Rodriguez-Opazo, Damien Teney, and Stephen Gould. “Image Retrieval on
Real-Life Images With Pre-Trained Vision-and-Language Models”. In: Proceedings of the IEEE/CVF
International Conference on Computer Vision (ICCV). Oct. 2021, pp. 2125–2134.
[134] Ziwei Liu, Sijie Yan, Ping Luo, Xiaogang Wang, and Xiaoou Tang. “Fashion landmark detection in
the wild”. In: European Conference on Computer Vision. Springer. 2016, pp. 229–245.
[135] Ilya Loshchilov and Frank Hutter. “Decoupled Weight Decay Regularization”. In: International
Conference on Learning Representations. 2018.
[136] Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. “Vilbert: Pretraining task-agnostic
visiolinguistic representations for vision-and-language tasks”. In: Advances in neural information
processing systems 32 (2019).
[137] Xiaocheng Lu, Song Guo, Ziming Liu, and Jingcai Guo. “Decomposed soft prompt guided fusion
enhancing for compositional zero-shot learning”. In: CVPR. 2023, pp. 23560–23569.
[138] Yuning Lu, Jianzhuang Liu, Yonggang Zhang, Yajing Liu, and Xinmei Tian. “Prompt distribution
learning”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
2022, pp. 5206–5215.
[139] Andres Mafla, Rafael S. Rezende, Lluis Gomez, Diane Larlus, and Dimosthenis Karatzas. “StacMR:
Scene-Text Aware Cross-Modal Retrieval”. In: Proceedings of the IEEE/CVF Winter Conference on
Applications of Computer Vision (WACV). Jan. 2021, pp. 2220–2230.
[140] Dhruv Mahajan, Sundararajan Sellamanickam, and Vinod Nair. “A joint learning framework for
attribute models and object descriptions”. In: ICCV. 2011, pp. 1227–1234.
[141] Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. “Fine-grained
visual classification of aircraft”. In: arXiv preprint arXiv:1306.5151 (2013).
[142] Massimiliano Mancini, Muhammad Ferjad Naeem, Yongqin Xian, and Zeynep Akata. “Learning
graph embeddings for open world compositional zero-shot learning”. In: TPAMI (2022).
[143] Massimiliano Mancini, Muhammad Ferjad Naeem, Yongqin Xian, and Zeynep Akata. “Open
World Compositional Zero-Shot Learning”. In: CVPR. 2021, pp. 5222–5230.
[144] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. “Distributed
representations of words and phrases and their compositionality”. In: NeurIPS. 2013,
pp. 3111–3119.
[145] Ishan Misra, Abhinav Gupta, and Martial Hebert. “From red wine to red tomato: Composition
with context”. In: CVPR. 2017, pp. 1792–1801.
139
[146] Muhammad Ferjad Naeem, Yongqin Xian, Federico Tombari, and Zeynep Akata. “Learning graph
embeddings for compositional zero-shot learning”. In: CVPR. 2021, pp. 953–962.
[147] Tushar Nagarajan and Kristen Grauman. “Attributes as operators: factorizing unseen
attribute-object compositions”. In: ECCV. 2018, pp. 169–185.
[148] Zhixiong Nan, Yang Liu, Nanning Zheng, and Song-Chun Zhu. “Recognizing unseen
attribute-object pair with generative model”. In: AAAI. 2019, pp. 8811–8818.
[149] Nihal V. Nayak, Peilin Yu, and Stephen Bach. “Learning to Compose Soft Prompts for
Compositional Zero-Shot Learning”. In: ICLR. 2023. url:
https://openreview.net/forum?id=S8-A2FXnIh.
[150] Jing Nie, Rao Muhammad Anwer, Hisham Cholakkal, Fahad Shahbaz Khan, Yanwei Pang, and
Ling Shao. “Enriched Feature Guided Refinement Network for Object Detection”. In: The IEEE
International Conference on Computer Vision (ICCV). Oct. 2019.
[151] Maria-Elena Nilsback and Andrew Zisserman. “Automated flower classification over a large
number of classes”. In: 2008 Sixth Indian Conference on Computer Vision, Graphics & Image
Processing. IEEE. 2008, pp. 722–729.
[152] Hyeonwoo Noh, Andre Araujo, Jack Sim, Tobias Weyand, and Bohyung Han. “Large-scale image
retrieval with attentive deep local features”. In: Proceedings of the IEEE international conference on
computer vision. 2017, pp. 3456–3465.
[153] OpenAI. ChatGPT. Available at https://openai.com/chatgpt. 2023.
[154] OpenAI. GPT-4 Technical Report. 2023. arXiv: 2303.08774 [cs.CL].
[155] Vicente Ordonez, Girish Kulkarni, and Tamara Berg. “Im2text: Describing images using 1 million
captioned photographs”. In: Advances in neural information processing systems 24 (2011).
[156] Jiangmiao Pang, Kai Chen, Jianping Shi, Huajun Feng, Wanli Ouyang, and Dahua Lin. “Libra
r-cnn: Towards balanced learning for object detection”. In: Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition. 2019, pp. 821–830.
[157] Devi Parikh and Kristen Grauman. “Relative attributes”. In: ICCV. 2011, pp. 503–510.
[158] Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and CV Jawahar. “Cats and dogs”. In: 2012
IEEE conference on computer vision and pattern recognition. IEEE. 2012, pp. 3498–3505.
[159] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan,
Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. “Pytorch: An imperative style,
high-performance deep learning library”. In: Advances in neural information processing systems 32
(2019).
[160] Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, and Dani Lischinski. “Styleclip:
Text-driven manipulation of stylegan imagery”. In: ICCV. 2021, pp. 2085–2094.
140
[161] Jeffrey Pennington, Richard Socher, and Christopher D Manning. “Glove: Global vectors for word
representation”. In: EMNLP. 2014, pp. 1532–1543.
[162] Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. “Film: Visual
reasoning with a general conditioning layer”. In: Proceedings of the AAAI Conference on Artificial
Intelligence. Vol. 32. 1. 2018.
[163] Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and
Svetlana Lazebnik. “Flickr30k entities: Collecting region-to-phrase correspondences for richer
image-to-sentence models”. In: Proceedings of the IEEE international conference on computer vision.
2015, pp. 2641–2649.
[164] Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury,
Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean. “Efficiently scaling transformer
inference”. In: Proceedings of Machine Learning and Systems 5 (2023).
[165] Senthil Purushwalkam, Maximilian Nickel, Abhinav Gupta, and Marc’Aurelio Ranzato.
“Task-driven modular networks for zero-shot compositional learning”. In: ICCV. 2019,
pp. 3593–3602.
[166] Di Qi, Lin Su, Jia Song, Edward Cui, Taroon Bharti, and Arun Sacheti. “Imagebert: Cross-modal
pre-training with large-scale weak-supervised image-text data”. In: arXiv preprint
arXiv:2001.07966 (2020).
[167] Filip Radenović, Giorgos Tolias, and Ondřej Chum. “CNN image retrieval learns from BoW:
Unsupervised fine-tuning with hard examples”. In: European conference on computer vision.
Springer. 2016, pp. 3–20.
[168] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal,
Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. “Learning transferable visual
models from natural language supervision”. In: International Conference on Machine Learning.
PMLR. 2021, pp. 8748–8763.
[169] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al.
“Language models are unsupervised multitask learners”. In: OpenAI blog 1.8 (2019), p. 9.
[170] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena,
Yanqi Zhou, Wei Li, and Peter J Liu. “Exploring the limits of transfer learning with a unified
text-to-text transformer”. In: Journal of machine learning research 21.140 (2020), pp. 1–67.
[171] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. “SQuAD: 100, 000+
Questions for Machine Comprehension of Text”. In: EMNLP. 2016.
[172] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen,
and Ilya Sutskever. “Zero-shot text-to-image generation”. In: International Conference on Machine
Learning. PMLR. 2021, pp. 8821–8831.
141
[173] Hanoona Rasheed, Muhammad Uzair Khattak, Muhammad Maaz, Salman Khan, and
Fahad Shahbaz Khan. “Fine-tuned clip models are efficient video learners”. In: Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023, pp. 6545–6554.
[174] Scott Reed, Zeynep Akata, Honglak Lee, and Bernt Schiele. “Learning deep representations of
fine-grained visual descriptions”. In: CVPR. 2016, pp. 49–58.
[175] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. “Faster r-cnn: Towards real-time object
detection with region proposal networks”. In: Advances in neural information processing systems
28 (2015).
[176] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer.
“High-resolution image synthesis with latent diffusion models”. In: CVPR. 2022, pp. 10684–10695.
[177] Frank Ruis, Gertjan Burghouts, and Doina Bucur. “Independent Prototype Propagation for
Zero-Shot Compositionality”. In: NeurIPS 34 (2021).
[178] Nirat Saini, Khoi Pham, and Abhinav Shrivastava. “Disentangling Visual Embeddings for
Attributes and Objects”. In: CVPR. 2022, pp. 13658–13667.
[179] Adam Santoro, David Raposo, David G Barrett, Mateusz Malinowski, Razvan Pascanu,
Peter Battaglia, and Timothy Lillicrap. “A simple neural network module for relational
reasoning”. In: Advances in Neural Information Processing Systems 30 (2017).
[180] Nikolaos Sarafianos, Xiang Xu, and Ioannis A Kakadiaris. “Adversarial representation learning
for text-to-image matching”. In: Proceedings of the IEEE/CVF International Conference on Computer
Vision. 2019, pp. 5814–5824.
[181] Florian Schroff, Dmitry Kalenichenko, and James Philbin. “Facenet: A unified embedding for face
recognition and clustering”. In: Proceedings of the IEEE conference on computer vision and pattern
recognition. 2015, pp. 815–823.
[182] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman,
Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. “Laion-5b:
An open large-scale dataset for training next generation image-text models”. In: Advances in
Neural Information Processing Systems 35 (2022), pp. 25278–25294.
[183] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh,
and Dhruv Batra. “Grad-cam: Visual explanations from deep networks via gradient-based
localization”. In: ICCV. 2017, pp. 618–626.
[184] Shuai Shao, Zeming Li, Tianyuan Zhang, Chao Peng, Gang Yu, Xiangyu Zhang, Jing Li, and
Jian Sun. “Objects365: A large-scale, high-quality dataset for object detection”. In: Proceedings of
the IEEE/CVF International Conference on Computer Vision. 2019, pp. 8430–8439.
[185] Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. “Conceptual captions: A
cleaned, hypernymed, image alt-text dataset for automatic image captioning”. In: Proceedings of
the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).
2018, pp. 2556–2565.
142
[186] Sheng Shen, Liunian Harold Li, Hao Tan, Mohit Bansal, Anna Rohrbach, Kai-Wei Chang,
Zhewei Yao, and Kurt Keutzer. “How Much Can CLIP Benefit Vision-and-Language Tasks?” In:
International Conference on Learning Representations. 2021.
[187] Karen Simonyan and Andrew Zisserman. “Very Deep Convolutional Networks for Large-Scale
Image Recognition”. In: International Conference on Learning Representations. 2015.
[188] Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba,
Marcus Rohrbach, and Douwe Kiela. “Flava: A foundational language and vision alignment
model”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
2022, pp. 15638–15650.
[189] Richard Socher, Milind Ganjoo, Hamsa Sridhar, Osbert Bastani, Christopher D Manning, and
Andrew Y Ng. “Zero-shot learning through cross-modal transfer”. In: arXiv preprint
arXiv:1301.3666 (2013).
[190] Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng,
and Christopher Potts. “Recursive deep models for semantic compositionality over a sentiment
treebank”. In: Proceedings of the 2013 conference on empirical methods in natural language
processing. 2013, pp. 1631–1642.
[191] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. “UCF101: A dataset of 101 human
actions classes from videos in the wild”. In: arXiv preprint arXiv:1212.0402 (2012).
[192] Johannes Stallkamp, Marc Schlipsing, Jan Salmen, and Christian Igel. “The German traffic sign
recognition benchmark: a multi-class classification competition”. In: The 2011 international joint
conference on neural networks. IEEE. 2011, pp. 1453–1460.
[193] Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. “VL-BERT:
Pre-training of Generic Visual-Linguistic Representations”. In: International Conference on
Learning Representations. 2019.
[194] Alane Suhr, Stephanie Zhou, Ally Zhang, Iris Zhang, Huajun Bai, and Yoav Artzi. “A corpus for
reasoning about natural language grounded in photographs”. In: arXiv preprint arXiv:1811.00491
(2018).
[195] Yi-Lin Sung, Jaemin Cho, and Mohit Bansal. “Vl-adapter: Parameter-efficient transfer learning for
vision-and-language tasks”. In: CVPR. 2022, pp. 5227–5237.
[196] Hao Tan and Mohit Bansal. “LXMERT: Learning Cross-Modality Encoder Representations from
Transformers”. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language
Processing. 2019.
[197] Hao Tan and Mohit Bansal. “LXMERT: Learning Cross-Modality Encoder Representations from
Transformers”. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language
Processing and the 9th International Joint Conference on Natural Language Processing
(EMNLP-IJCNLP). Hong Kong, China: Association for Computational Linguistics, Nov. 2019,
pp. 5100–5111. doi: 10.18653/v1/D19-1514.
143
[198] Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland,
Damian Borth, and Li-Jia Li. “Yfcc100m: The new data in multimedia research”. In:
Communications of the ACM 59.2 (2016), pp. 64–73.
[199] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and
Hervé Jégou. “Training data-efficient image transformers & distillation through attention”. In:
International Conference on Machine Learning. PMLR. 2021, pp. 10347–10357.
[200] Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles, Gabriel Synnaeve, and Hervé Jégou.
“Going deeper with image transformers”. In: Proceedings of the IEEE/CVF International Conference
on Computer Vision. 2021, pp. 32–42.
[201] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux,
Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. “Llama: Open
and efficient foundation language models”. In: arXiv preprint arXiv:2302.13971 (2023).
[202] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei,
Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. “Llama 2: Open
foundation and fine-tuned chat models”. In: arXiv preprint arXiv:2307.09288 (2023).
[203] Matthew Turk and Alex Pentland. “Eigenfaces for recognition”. In: Journal of cognitive
neuroscience 3.1 (1991), pp. 71–86.
[204] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Lukasz Kaiser, and Illia Polosukhin. “Attention is all you need”. In: Advances in neural information
processing systems 30 (2017).
[205] Nam Vo, Lu Jiang, Chen Sun, Kevin Murphy, Li-Jia Li, Li Fei-Fei, and James Hays. “Composing
text and image for image retrieval-an empirical odyssey”. In: Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition. 2019, pp. 6439–6448.
[206] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman.
“GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding”.
In: Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural
Networks for NLP. Brussels, Belgium: Association for Computational Linguistics, Nov. 2018,
pp. 353–355. doi: 10.18653/v1/W18-5446.
[207] Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou,
Jingren Zhou, and Hongxia Yang. “Ofa: Unifying architectures, tasks, and modalities through a
simple sequence-to-sequence learning framework”. In: International Conference on Machine
Learning. PMLR. 2022, pp. 23318–23340.
[208] Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhiliang Peng, Qiang Liu, Kriti Aggarwal,
Owais Khan Mohammed, Saksham Singhal, Subhojit Som, et al. “Image as a foreign language: Beit
pretraining for all vision and vision-language tasks”. In: arXiv preprint arXiv:2208.10442 (2022).
[209] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. “Non-local neural networks”. In:
Proceedings of the IEEE conference on computer vision and pattern recognition. 2018, pp. 7794–7803.
144
[210] Xiaolong Wang, Yufei Ye, and Abhinav Gupta. “Zero-shot recognition via semantic embeddings
and knowledge graphs”. In: CVPR. 2018, pp. 6857–6866.
[211] Xin Wang, Fisher Yu, Ruth Wang, Trevor Darrell, and Joseph E Gonzalez. “Tafe-net: Task-aware
feature embeddings for low shot learning”. In: CVPR. 2019, pp. 1831–1840.
[212] Zifeng Wang, Zizhao Zhang, Sayna Ebrahimi, Ruoxi Sun, Han Zhang, Chen-Yu Lee, Xiaoqi Ren,
Guolong Su, Vincent Perot, Jennifer Dy, et al. “Dualprompt: Complementary prompting for
rehearsal-free continual learning”. In: European Conference on Computer Vision. Springer. 2022,
pp. 631–648.
[213] Zifeng Wang, Zizhao Zhang, Chen-Yu Lee, Han Zhang, Ruoxi Sun, Xiaoqi Ren, Guolong Su,
Vincent Perot, Jennifer Dy, and Tomas Pfister. “Learning to prompt for continual learning”. In:
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022,
pp. 139–149.
[214] Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yulia Tsvetkov, and Yuan Cao. “SimVLM:
Simple Visual Language Model Pretraining with Weak Supervision”. In: International Conference
on Learning Representations. 2021.
[215] Kun Wei, Muli Yang, Hao Wang, Cheng Deng, and Xianglong Liu. “Adversarial fine-grained
composition learning for unseen attribute-object recognition”. In: ICCV. 2019, pp. 3741–3749.
[216] Tianyi Wei, Dongdong Chen, Wenbo Zhou, Jing Liao, Zhentao Tan, Lu Yuan, Weiming Zhang,
and Nenghai Yu. “Hairclip: Design your hair by text and reference image”. In: CVPR. 2022,
pp. 18072–18081.
[217] Adina Williams, Nikita Nangia, and Samuel R Bowman. “A Broad-Coverage Challenge Corpus for
Sentence Understanding through Inference”. In: NAACL-HLT. 2018.
[218] Olivia Winn and Smaranda Muresan. “‘Lighter’Can Still Be Dark: Modeling Comparative Color
Descriptions”. In: Proceedings of the 56th Annual Meeting of the Association for Computational
Linguistics (Volume 2: Short Papers). 2018, pp. 790–795.
[219] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi,
Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer,
Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao,
Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. “Transformers:
State-of-the-Art Natural Language Processing”. In: Proceedings of the 2020 Conference on Empirical
Methods in Natural Language Processing: System Demonstrations. Online: Association for
Computational Linguistics, Oct. 2020, pp. 38–45. url:
https://www.aclweb.org/anthology/2020.emnlp-demos.6.
[220] Hui Wu, Yupeng Gao, Xiaoxiao Guo, Ziad Al-Halah, Steven Rennie, Kristen Grauman, and
Rogerio Feris. “Fashion iq: A new dataset towards retrieving images by natural language
feedback”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
2021, pp. 11307–11317.
145
[221] Yongqin Xian, Christoph H Lampert, Bernt Schiele, and Zeynep Akata. “Zero-shot learning—a
comprehensive evaluation of the good, the bad and the ugly”. In: TPAMI 41.9 (2018),
pp. 2251–2265.
[222] Yongqin Xian, Tobias Lorenz, Bernt Schiele, and Zeynep Akata. “Feature generating networks for
zero-shot learning”. In: CVPR. 2018, pp. 5542–5551.
[223] Jianxiong Xiao, Krista A Ehinger, James Hays, Antonio Torralba, and Aude Oliva. “Sun database:
Exploring a large collection of scene categories”. In: International Journal of Computer Vision
119.1 (2016), pp. 3–22.
[224] Ning Xie, Farley Lai, Derek Doran, and Asim Kadav. “Visual entailment: A novel task for
fine-grained image understanding”. In: arXiv preprint arXiv:1901.06706 (2019).
[225] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. “Aggregated residual
transformations for deep neural networks”. In: Proceedings of the IEEE conference on computer
vision and pattern recognition. 2017, pp. 1492–1500.
[226] Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Jianmin Bao, Zhuliang Yao, Qi Dai, and Han Hu.
“Simmim: A simple framework for masked image modeling”. In: Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition. 2022, pp. 9653–9663.
[227] Peixi Xiong, Huayi Zhan, Xin Wang, Baivab Sinha, and Ying Wu. “Visual query answering by
entity-attribute graph matching and reasoning”. In: CVPR. 2019, pp. 8357–8366.
[228] Xing Xu, Tan Wang, Yang Yang, Lin Zuo, Fumin Shen, and Heng Tao Shen. “Cross-modal
attention with semantic consistence for image–text matching”. In: IEEE transactions on neural
networks and learning systems 31.12 (2020), pp. 5412–5425.
[229] Hongwei Xue, Yupan Huang, Bei Liu, Houwen Peng, Jianlong Fu, Houqiang Li, and Jiebo Luo.
“Probing inter-modality: Visual parsing with self-attention for vision-and-language pre-training”.
In: Advances in Neural Information Processing Systems 34 (2021), pp. 4514–4528.
[230] Jinyu Yang, Jiali Duan, Son Tran, Yi Xu, Sampath Chanda, Liqun Chen, Belinda Zeng,
Trishul Chilimbi, and Junzhou Huang. “Vision-Language Pre-Training with Triple Contrastive
Learning”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
2022, pp. 15671–15680.
[231] Muli Yang, Cheng Deng, Junchi Yan, Xianglong Liu, and Dacheng Tao. “Learning unseen
concepts via hierarchical decomposition and composition”. In: CVPR. 2020, pp. 10248–10256.
[232] Ting Yao, Yingwei Pan, Yehao Li, Zhaofan Qiu, and Tao Mei. “Boosting image captioning with
attributes”. In: ICCV. 2017, pp. 4894–4902.
[233] Kexin Yi, Jiajun Wu, Chuang Gan, Antonio Torralba, Pushmeet Kohli, and Joshua B. Tenenbaum.
“Neural-symbolic vqa: Disentangling reasoning from vision and language understanding”. In:
Advances in Neural Information Processing Systems. 2018, pp. 1039–1050.
[234] A. Yu and K. Grauman. “Fine-Grained Visual Comparisons with Local Learning”. In: CVPR. 2014.
146
[235] A. Yu and K. Grauman. “Semantic Jitter: Dense Supervision for Visual Comparisons via Synthetic
Images”. In: ICCV. 2017.
[236] Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, and Yonghui Wu.
“Coca: Contrastive captioners are image-text foundation models”. In: arXiv preprint
arXiv:2205.01917 (2022).
[237] Li Yuan, Qibin Hou, Zihang Jiang, Jiashi Feng, and Shuicheng Yan. “Volo: Vision outlooker for
visual recognition”. In: IEEE Transactions on Pattern Analysis and Machine Intelligence (2022).
[238] Yan Zeng, Xinsong Zhang, and Hang Li. “Multi-Grained Vision Language Pre-Training: Aligning
Texts with Visual Concepts”. In: International Conference on Machine Learning. PMLR. 2022,
pp. 25994–26009.
[239] Li Zhang, Tao Xiang, and Shaogang Gong. “Learning a deep embedding model for zero-shot
learning”. In: CVPR. 2017, pp. 2021–2030.
[240] Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and
Jianfeng Gao. “VinVL: Revisiting Visual Representations in Vision-Language Models”. In:
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). June
2021, pp. 5579–5588.
[241] Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and
Jianfeng Gao. “Vinvl: Revisiting visual representations in vision-language models”. In: Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021, pp. 5579–5588.
[242] Qi Zhang, Zhen Lei, Zhaoxiang Zhang, and Stan Z Li. “Context-aware attention network for
image-text retrieval”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition. 2020, pp. 3536–3545.
[243] Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen,
Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. “Opt: Open pre-trained
transformer language models”. In: arXiv preprint arXiv:2205.01068 (2022).
[244] Zhaoheng Zheng, Rakesh Chada, Yue Wu, Pradeep Natarajan, and Ram Nevatia. “FIT: Fractional
Intermediate Tower in Vision-Language Transformers”. In: Technical Report (2022).
[245] Zhaoheng Zheng and Ram Nevatia. “BIPR: Blending Individual and Pairwise Representations for
Compositional Zero-Shot Learning”. In: Technical Report (2022).
[246] Zhaoheng Zheng, Arka Sadhu, and Ram Nevatia. “Improving Object Detection and Attribute
Recognition by Feature Entanglement Reduction”. In: 2021 IEEE International Conference on Image
Processing (ICIP). IEEE. 2021, pp. 2214–2218.
[247] Zhaoheng Zheng, Jingmin Wei, Xuefeng Hu, Haidong Zhu, and Ram Nevatia. “Large Language
Models are Good Prompt Learners for Low-Shot Image Classification”. In: CVPR (2024).
147
[248] Zhaoheng Zheng, Haidong Zhu, and Ram Nevatia. “CAILA: Concept-Aware Intra-Layer Adapters
for Compositional Zero-Shot Learning”. In: Proceedings of the IEEE/CVF Winter Conference on
Applications of Computer Vision. 2024, pp. 1721–1731.
[249] Zhedong Zheng, Liang Zheng, Michael Garrett, Yi Yang, Mingliang Xu, and Yi-Dong Shen.
“Dual-path convolutional image-text embeddings with instance loss”. In: ACM Transactions on
Multimedia Computing, Communications, and Applications (TOMM) 16.2 (2020), pp. 1–23.
[250] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. “Conditional prompt learning
for vision-language models”. In: CVPR. 2022, pp. 16816–16825.
[251] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. “Learning to prompt for
vision-language models”. In: IJCV 130.9 (2022), pp. 2337–2348.
[252] Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason Corso, and Jianfeng Gao. “Unified
vision-language pre-training for image captioning and vqa”. In: Proceedings of the AAAI
Conference on Artificial Intelligence. Vol. 34. 07. 2020, pp. 13041–13049.
[253] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. “Deformable detr:
Deformable transformers for end-to-end object detection”. In: arXiv preprint arXiv:2010.04159
(2020).
[254] Yizhe Zhu, Mohamed Elhoseiny, Bingchen Liu, Xi Peng, and Ahmed Elgammal. “A generative
adversarial approach for zero-shot learning from noisy texts”. In: CVPR. 2018, pp. 1004–1013.
[255] Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba,
and Sanja Fidler. “Aligning Books and Movies: Towards Story-Like Visual Explanations by
Watching Movies and Reading Books”. In: The IEEE International Conference on Computer Vision
(ICCV). Dec. 2015.
[256] Mingchen Zhuge, Dehong Gao, Deng-Ping Fan, Linbo Jin, Ben Chen, Haoming Zhou,
Minghui Qiu, and Ling Shao. “Kaleido-BERT: Vision-Language Pre-training on Fashion Domain”.
In: CoRR abs/2103.16110 (2021). arXiv: 2103.16110. url: https://arxiv.org/abs/2103.16110.
148
Abstract (if available)
Abstract
As key mediators of human perception, vision and language corpora act as critical roles in the development of modern Artificial Intelligence (AI). The size of vision-language corpora has scaled up rapidly in recent years, from thousands to billions, enabling the creation of large foundation models. However, as an emerging concept, there are a series of problems yet to be explored.
We start with a study of compositional learning from pre-VLM times to the post-VLM era. We introduce a representation blending approach that creates robust features for compositional image classification and a two-stream architecture that tackles the entanglement in the feature space of the object-attribute detection problem with novel object-attribute pairs. We further design an adaptation approach to leverage CLIP encoders for compositional image classification.
The second part covers a variety of methods built with multimodal transformer models. For image retrieval, we propose a framework that assembles multimodal inputs into sequences with which a multimodal transformer encoder can be fine-tuned. The pre-training of vision-language models (VLMs) is also explored. Specifically, we introduce a fractional intermediate tower that improves the feature expressibility of dual-tower vision-language models. We further design a unified pipeline that allows a VLM to learn from not only vision-language corpora but unimodal visual and linguistic data.
Lastly, we study how to leverage the knowledge of Large Language Models (LLMs) for low-shot image classification, in a data- and computation-efficient way.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Grounding language in images and videos
PDF
Bridging the visual reasoning gaps in multi-modal models
PDF
Multimodal reasoning of visual information and natural language
PDF
Adapting pre-trained representation towards downstream tasks
PDF
Shape-assisted multimodal person re-identification
PDF
Incorporating aggregate feature statistics in structured dynamical models for human activity recognition
PDF
Event detection and recounting from large-scale consumer videos
PDF
Aggregating symbols for language models
PDF
Building generalizable language models for code processing
PDF
Emphasizing the importance of data and evaluation in the era of large language models
PDF
Towards generalizable expression and emotion recognition
PDF
Data-efficient image and vision-and-language synthesis and classification
PDF
Syntax-aware natural language processing techniques and their applications
PDF
Externalized reasoning in language models for scalable and trustworthy AI
PDF
3D deep learning for perception and modeling
PDF
Modeling, learning, and leveraging similarity
PDF
Federated and distributed machine learning at scale: from systems to algorithms to applications
PDF
Scaling recommendation models with data-aware architectures and hardware efficient implementations
PDF
Leveraging cross-task transfer in sequential decision problems
PDF
Towards understanding language in perception and embodiment
Asset Metadata
Creator
Zheng, Zhaoheng
(author)
Core Title
Incorporating large-scale vision-language corpora in visual understanding
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2024-05
Publication Date
05/17/2024
Defense Date
04/25/2024
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
computer vision,deep learning,large multimodal models,OAI-PMH Harvest,vision and language
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Nevatia, Ramakant (
committee chair
)
Creator Email
zh2.amoy@gmail.com,zz_621@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113939963
Unique identifier
UC113939963
Identifier
etd-ZhengZhaoh-12927.pdf (filename)
Legacy Identifier
etd-ZhengZhaoh-12927
Document Type
Dissertation
Format
theses (aat)
Rights
Zheng, Zhaoheng
Internet Media Type
application/pdf
Type
texts
Source
20240517-usctheses-batch-1152
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
computer vision
deep learning
large multimodal models
vision and language