Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Adapting pre-trained representation towards downstream tasks
(USC Thesis Other)
Adapting pre-trained representation towards downstream tasks
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Adapting Pre-trained Representation Towards Downstream Tasks
by
Xuefeng Hu
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
May 2024
Copyright 2024 Xuefeng Hu
Dedication
To my beloved parents.
ii
Acknowledgements
I would like to express my deepest gratitude to my advisor, Professor Ram Nevatia, for his genuine support
and invaluable advice throughout my Ph.D. study and the completion of this dissertation. His guidance has
been a cornerstone of my Ph.D. study. I am also grateful to my Ph.D. committee members, Professors Aram
Galstyan, Keith Jenkins, Bistra Dilkina, and Greg Ver Steeg, for their insights and feedback during my Ph.D.
Dissertation Defense, Proposal, and Qualifying exams.
Special thanks to my labmates, Arka Sadhu, Zhaoheng Zheng, and Haidong Zhu, for their priceless
discussions and support during our shared journey in the Ph.D. program. I must also acknowledge all of
my coauthors, whose contributions were vital to the projects presented in this dissertation, with a special
mention to Zhenheng Yang for helping me complete my first paper.
Chapter 3 (SimPLE: Similar pseudo label exploitation for semi-supervised classification) and Chapter
5 (EZAD: Efficient Feature Distillation for Zero-shot Annotation Object Detection) of this dissertation are
based on research sponsored by Air Force Research Laboratory (AFRL) under agreement number FA8750-
19-1-1000, and Chapter 2 (SPAN: Spatial pyramid attention network for image manipulation localization)
of this dissertation is based on research sponsored by the Defense Advanced Research Projects Agency under agreement number FA8750-16-2-0204. The U.S. Government is authorized to reproduce and distribute
reprints for governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing
iii
the official policies or endorsements, either expressed or implied, of the Air Force Laboratory, the Defense
Advanced Research Projects Agency or the U.S. Government.
Chapter 4 (MixNorm: Test-Time Adaptation through Online Normalization Estimation) of this dissertation is completed during my internship at Meta (Facebook) in 2021. The major part of Chapter 6 (ReCLIP:
Refine Contrastive Language Image Pre-Training with Source Free Domain Adaptation) of this dissertation
is completed during my internship at Amazon in 2022. I want to say thank you to Ke Zhang, Cheng-Hao
Kuo, Min Sun, Gokhan Uzunbas and Ser-Nam Lim for their guidance and support during my internships.
iv
Table of Contents
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .xviii
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Pre-CLIP: Adaptation for Vision Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 SPAN: Spatial Pyramid Attention Network for Image Manipulation Localization . . 2
1.1.2 SimPLE: Similar Pseudo Label Exploitation for Semi-Supervised Classification . . . 3
1.1.3 MixNorm: Test-Time Adaptation through Online Normalization Estimation . . . . . 4
1.2 Post-CLIP: Adaptation for Vision-Language Models. . . . . . . . . . . . . . . . . . . . . . 5
1.2.1 EZAD: Efficient Feature Distillation for Zero-Shot Annotation Object Detection . . 6
1.2.2 ReCLIP: Refines Contrastive Language Image Pre-training with Source-Free
Domain Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.3 BaFTA: Backprop-Free Test-Time Adaptation for Zero-shot Vision Language Models 8
Chapter 2: SPAN: Spatial pyramid attention network for image manipulation localization . . . 10
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.1 Manipulation Detection and Localization . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.2 Attention Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.2 Local Self-Attention Block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.3 Positional Projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.4 Pyramid Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.5 Framework Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4.2 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4.3 Evaluation and Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4.4 Additional Discussion on Dataset Characteristics . . . . . . . . . . . . . . . . . . . 30
v
2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Chapter 3: SimPLE: Similar pseudo label exploitation for semi-supervised classification . . . 34
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2.1 Consistency Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2.2 Pseudo-labeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2.3 Label Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3.1 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.3.2 Augmentation Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.3.3 Pseudo-labeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.3.4 Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3.4.1 Pair Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3.4.2 Motivation for Different Loss Formulations . . . . . . . . . . . . . . . . 45
3.3.5 SimPLE Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.4.1 Datasets and Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.4.2 Baseline Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.5.1 SSL for Transfer Learning Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.5.2 Ablation Study over CIFAR-100 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.5.3 Analysis and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.5.3.1 Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.5.3.2 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.5.3.3 Augmentations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.5.3.4 Analysis on Confidence Threshold . . . . . . . . . . . . . . . . . . . . . 55
3.5.3.5 Pair Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Chapter 4: MixNorm: Test-Time Adaptation through Online Normalization Estimation . . . . 59
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.2.1 Test-Time Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.2.2 Source-Free Unsupervised Domain Adaptation . . . . . . . . . . . . . . . . . . . . 64
4.2.3 Zero-Shot Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.2.4 Normalization Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.3 Test-Time Adaptation with Mix-Norm Layer . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.4 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.4.1 Test-Time Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.4.2 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.4.3 Test-Time Adaptation in Zero-Shot Learning Setting . . . . . . . . . . . . . . . . . 76
4.4.4 Source-Free Unsupervised Domain Adaptation . . . . . . . . . . . . . . . . . . . . 77
4.4.5 Hyper Parameter Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Chapter 5: EZAD: Efficient Feature Distillation for Zero-shot Annotation Object Detection . 80
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
vi
5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.3.1 Adapt Vision-language Model Feature . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.3.2 Generate CLIP Proposals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.3.3 Model Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.4 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.4.1 Implementation Details. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.4.2 Comparison with Current Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.4.3 Experiments in Few-shot Detection Settings . . . . . . . . . . . . . . . . . . . . . . 96
5.4.4 Efficiency Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.4.5 Ablation Studies on CLIP’s Feature Adaptation and CLIP Proposal. . . . . . . . . . 99
5.4.6 Ablation Study on Semantic Regressor and Per CLIP Proposal Distillation Weight. . 100
5.4.7 More Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.4.8 Statistic and Visualization Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
Chapter 6: ReCLIP: Refine Contrastive Language Image Pre-Training with Source Free
Domain Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.2.1 Large-Scale Vision-Language Models . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.2.2 Unsupervised Domain Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.2.3 Learnable Language Prompt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6.3.1 Projection Space to Align Visual and Text . . . . . . . . . . . . . . . . . . . . . . . 114
6.3.2 Pseudo Label Generation for VLM . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.3.3 Source-Free Adaptation for Vision-Language Model via Cross-Modality Self-Training117
6.3.4 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.4 Experiment and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.4.1 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.4.2 Comparison with POUF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
6.4.3 Ablations Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
6.4.3.1 Effectiveness of ReCLIP Components . . . . . . . . . . . . . . . . . . . 126
6.4.3.2 Comparison on Pseudo Label Generations . . . . . . . . . . . . . . . . . 126
6.4.3.3 Comparison on other Vision-Language Models . . . . . . . . . . . . . . . 127
6.4.3.4 Choice on Learnable Modules . . . . . . . . . . . . . . . . . . . . . . . . 128
6.4.3.5 Inductive Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
6.4.3.6 Pseudo Label Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
6.4.4 Runtime Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
6.4.5 Limitation and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
Chapter 7: BaFTA: Backprop-free Test-Time Adaptation for Zero-Shot Visual Language
Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
7.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
7.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
7.3.1 Estimate Class Embedding with Projected Online Clustering . . . . . . . . . . . . . 138
vii
7.3.2 Prediction Aggregation with Rényi Entropy . . . . . . . . . . . . . . . . . . . . . . 139
7.3.3 Algorithm and Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
7.4 Experiment and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
7.4.1 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
7.4.2 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
7.4.2.1 Comparison on Predictions from BaFTA. . . . . . . . . . . . . . . . . . 146
7.4.2.2 Comparison on Prediction Aggregation Method. . . . . . . . . . . . . . . 148
7.4.2.3 Ablation Studies on Rényi Entropy Order α. . . . . . . . . . . . . . . . . 148
7.4.3 Inference Time Efficiency Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 152
7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
Chapter 8: Epilogue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
viii
List of Tables
2.1 Pixel-level localization performance under the metrics of AUC and F1 comparison on
validation set (SynVal) of the pre-training data and four different benchmarks. We could
not find the model that ManTra-Net reported in the paper. Hence, we also report the
performance of ManTra-Net’s default GitHub model where we share the same feature
extractor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2 Pixel-level localization performance under the metrics of AUC and F1 comparison on
validation set of the pre-training data and four different benchmarks. For NIST16, Coverage
and CASIA, all models are fine-tuned on the corresponding training splits unless specifically
stated. *We found there is an overlap of images between training and test data of NIST16. . 26
2.3 Pixel-level AUC and F1 comparison on multiple manipulation types evaluated NIST16[121]
dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.4 Comparison of SPAN variants evaluated with F1 metric on SynVal, Columbia, Coverage,
CASIA and NIST16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.5 Robustness Analysis of SPAN over NIST16 and Columbia Dataset . . . . . . . . . . . . . . 28
3.1 CIFAR-100 Top-1 Test Accuracy. ∗
: using our implementation. †
: reported in FixMatch
[145]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.2 CIFAR-10 and SVHN Top-1 Test Accuracy. All experiments use WRN 28-2. †
: The
accuracy is reported in ReMixMatch [12] and using its own implementation. ‡
: Fully
supervised baseline using all the labels and simple augmentation (flip-and-crop). . . . . . . 47
3.3 Mini-ImageNet Top-1 Test Accuracy. ∗
: using our implementation. †
: The score is reported
in [73] and using its own implementation. . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.4 DomainNet-Real pre-trained model transfer to Mini-ImageNet. All experiments use WRN
28-2. The model is converged when its validation accuracy reaches 95% of its highest
validation accuracy. §
: using labeled training set only. ∗
: using our implementation. . . . . . 49
3.5 ImageNet-1K pre-trained model transfer to DomainNet-Real. All experiments use ResNet50. The model is converged when its validation accuracy reaches 95% of its highest
validation accuracy. §
: using labeled training set only. ∗
: using our implementation. . . . . . 50
ix
3.6 Ablation on CIFAR-100. All experiments use WRN 28-2 . . . . . . . . . . . . . . . . . . . 51
3.7 Hyperparameters for CIFAR-10, SVHN, CIFAR-100 and MiniImageNet. . . . . . . . . . . 53
3.8 Hyperparameters for Transfer: DomainNet-Real to Mini-ImageNet (DN-R to M-IN) and
Transfer: ImageNet-1K to DomainNet-Real (IN-1K to DN-R) experiments. . . . . . . . . . 54
3.9 Augmentation details. Applied in order. Descriptions are from [136]. . . . . . . . . . . . . 55
4.1 Single Distribution Error Rates of TENT, MixNorm (Algorithm 2) with and without fixed
affine parameters, MixNormBN (Algorithm 3) on CIFAR-10C datasets. All scores are
average error rates from all 15 corruption datasets at severity level 5. All methods adopt
Wide-ResNet-28 as backbone architectures. . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.2 Mixed Distribution Error Rates of TENT, MixNorm (Algorithm 2) with and without fixed
affine parameters, MixNormBN (Algorithm 3) on CIFAR-10C datasets. All methods
are tested with 10,000 randomly ordered samples composed of corrupted images from
15 different corruptions types at severity level 5. All methods adopt Wide-ResNet-28 as
backbone architectures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.3 Single Distribution Error Rates of TENT, MixNorm (Algorithm 2) with and without fixed
affine parameters, MixNormBN (Algorithm 3) on CIFAR-10C datasets. All scores are
average error rates from all 15 corruption datasets at severity level 5. All methods adopt
Wide-ResNet-40 as backbone architectures. . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.4 Mixed Distribution Error Rates of TENT, MixNorm (Algorithm 2) with and without fixed
affine parameters, MixNormBN (Algorithm 3) on CIFAR-10C datasets. All methods
are tested with 10,000 randomly ordered samples composed of corrupted images from
15 different corruptions types at severity level 5. All methods adopt Wide-ResNet-40 as
backbone architectures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.5 Single Distribution Error Rates of TENT, MixNorm (Algorithm 2) with and without fixed
affine parameters, MixNormBN (Algorithm 3) on ImageNet-C datasets. All scores are
average error rates from all 15 corruption datasets at severity level 5. All methods adopt
ResNet-50 as backbone architectures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.6 Mixed Distribution Error Rates of TENT, MixNorm (Algorithm 2) with and without fixed
affine parameters, MixNormBN (Algorithm 3) on ImageNet-C datasets. All methods are
tested with 50,000 randomly ordered samples composed of corrupted images from 15
different corruptions types at severity level 5. All methods adopt ResNet-50 as backbone
architectures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.7 Ablation studies on different ways to collect normalization statistics, tested with CIFAR10C, with a mixed set of corruptions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.8 Ablation studies on the effect of number of augmentations to MixNormBN method when
collecting normalization statistics, tested with ImageNet-C and ResNet-50, with a mixed
set of corruptions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
x
4.9 Dataset Size and Class Number . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.10 Zero-Shot Image Classification Accuracy. We replaced the Batch-Norm layers in the
pre-trained CLIP ResNet-50 weights and tested it under the zero-shot setting on the
standard benchmarks. Result shows MixNorm improved the CLIP performance . . . . . . . 77
4.11 Error rate for Source-Free Unsupervised Domain Adaptation experiment on Digits Sets
from SVHN to MNIST/MNIST-M/USPS. . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.12 Hyper-Parameters of MixNorm Experiments. In Figure 4.5 and 4.6 we reported scores for
both MixNorm with and without fixed affine parameters. The MixNorm uses the same
hyper-parameter in the both cases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.13 Hyper-Parameters of MixNormBN Experiments. . . . . . . . . . . . . . . . . . . . . . . . 78
5.1 The Recall of RPN on Novel and Base categories on COCO or LVIS training set. The RPN
biases to the base category. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.2 Evaluation results on COCO benchmark. All the models are trained with the ResNet-50
backbone. Mask indicates whether the model is trained with Mask annotations. Our model
achieves 4% improvement on Novel with 1/3 training time of ViLD. . . . . . . . . . . . . . 90
5.3 Evaluation results on LVIS-Fbase benchmark. The ViLD* is our reproduced result of ViLD
with a shorter training schedule. EZAD outperforms ViLD in both 2x and 4x settings due
to the efficient feature distillation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.4 Evaluation results on COCO. EZAD outperforms ViLD in both 1x and 3x settings, showing
our distillation is more efficient. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.5 Adapting CLIP to the detection dataset’s domain. The table presents the classification
accuracy (ACC) of CLIP(w/ or w/o adaptation) when it is applied to classify the COCO
dataset’s instance. The ACC is aggregated based on the size of the instances. After the
adaptation, the ACC is improved by a huge margin in three different settings, especially for
small objects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.6 Ablation study on CLIP’s feature adaptation and CLIP Proposal using COCO detection
benchmark. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.7 Ablation study on Semantic-based regressor and per CLIP Proposal distillation weight
using COCO detection benchmark. SB Reg means the Semantic-based regressor, and
PPDW means the per CLIP Proposal distillation weight. . . . . . . . . . . . . . . . . . . . 95
5.8 The effective distillation region of different proposals. The table presents the Intersection
between the proposal and the novel GT bboxes over novel GT bboxes (IoGT), the number
of proposals that have high IoGT (IoGT≥0.8, IoGT≥0.5), and the percentage of these
proposals in all proposals. Our CLIP proposals can cover more novel categories instances
thus improving the distillation efficiency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
xi
5.9 Evaluation results on the novel categories of PASCAL VOC few-shot benchmark. MF
R-CNN means Meta Faster R-CNN. Our model zero-shot performance on the novel match
the TFA’s performance in its 3-shot setting. Our model also has a better performance on base. 98
5.10 Evaluation results on novel categories of COCO few-shot benchmark. MF R-CNN means
Meta Faster R-CNN. Our model zero-shot performance on the novel match the TFA’s
performance in its 10-shot setting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.11 The classification accuracy (ACC) of the unadapted CLIP on COCO instances with different
sizes of the GT bboxes to crop the instances. We decide to use the 1.2x enlarged GT bbox
to crop the instance since it has the best average ACC. . . . . . . . . . . . . . . . . . . . . . 101
5.12 Ablation study on using CLIP Proposals as distillation in COCO benchmark. The model
trained with CLIP Proposals has much better performance on novel categories. . . . . . . . 102
6.1 Classification accuracies (%) on 22 benchmarks. * on FGVC, Caltech101, Oxford-IIIT Pet
and Flowers102, CLIP reported mean-class-accuracy. All other scores in this table are top-1
accuracy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.2 Comparison of ReCLIP and published scores from POUF[151] on Office-Home[159], both
use CLIP-single as base model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
6.3 Comparison of classification accuracy with different version ReCLIP on ablation datasets.
ReCLIP with Label Sharing (Figure 6.5) is shown to be most effective compared to
ReCLIP-V, ReCLIP-T (Figure 6.4) and their simply assembled predictions (ReCLIP w/o
Label Sharing). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
6.4 Pseudo label accuracy with different methods. Label Propagation on projection space P2 is
shown to be the most effective and stable method in generating accurate pseudo labels. . . . 127
6.5 Ablation Studies on the effectiveness of ReCLIP on different model architecture and
pre-training strategies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
6.6 Fully supervised fine-tuning accuracy of CLIP with different learnable modules on ablation
datasets. On AID, fine-tuning weights from Text Encoder Layer-Norm is shown to be
most effective; On CIFAR10 and CIFAR100, fine-tuning weights from Visual Encoder
Layer-Norm is shown to be most effective. . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
6.7 Inductive and Transductive performance comparison of ReCLIP on ablation datasets. . . . . 129
6.8 ReCLIP pseudo label Quality. Results are generated with vanilla CLIP ViT-L/16 checkpoint,
on the first epoch of ReCLIP before the training algorithms update the model weights. . . . . 129
6.9 Metadata and Runtime comparison of AaD, POUF and ReCLIP of the 22 Evaluation
Benchmarks. Time reported in the unit of hour (h). . . . . . . . . . . . . . . . . . . . . . . 130
xii
7.1 Taxonomy of CLIP adaptation methods for downstream classification. In this work,
we adopt TPT as the main baseline for comparison as it is the state-of-the-art test-time
adaptation algorithm without requirement of external resources. . . . . . . . . . . . . . . . 135
7.2 Zero-Shot v.s. Linear Evaluation top-1 accuracy reported by CLIP ([131]). Linear
Evaluation protocol assesses the quality of visual embeddings by training a fully-supervised
linear classifier over the frozen visual embeddings. This Linear Evaluation result implies:
1) the zero-shot performance of CLIP are largely limited by the quality of zero-shot
classifier, i.e, the text embeddings of class names; 2) The native visual embeddings of
CLIP get classified well with a linear classifier, which suggests the distinctiveness of visual
embeddings across target classes, and leads to an opportunity to leverage the neighboring
relationships to enhance test-time performance. . . . . . . . . . . . . . . . . . . . . . . . . 140
7.3 Comparison of top-1 accuracy on ImageNet and the Natural Distribution Shifts (NDS)
Benchmarks. All methods evaluated in zero-shot classification setting, except 1) CoOp,
CoCoOp and ProGrad are fine-tuned on ImageNet with 4 or 16 examples per category; 2)
SuS-X uses external knowledge such as Stable Diffusion or LAION-5B to construct support
set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
7.4 Top-1 Accuracy on 10 Fine-grained Benchmarks. All baselines are evaluated in zero-shot
classification setting, except CoOp being fine-tuned on ImageNet with 16 examples per
category. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
7.5 Comparison over different BaFTA predictions. BaFTA-RA (Rényi Aggregation): standard
predictions over augmented views aggregated with Rényi Entropy; BaFTA-OC (Online
Clustering): predictions generated with the clustering centroids; BaFTA-single: adapted
with single-template CLIP; BaFTA-Avg: combines online clustering and augmentation
predictions with a naive average instead of weighting with Rényi Entropy. BaFTA-RA,
BaFTA-OC, BaFTA refers to the pi
, pˆi and p˜i from Eq. 7.6 respectively. All results
produced with CLIP (ViT-B/16). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
7.6 Comparison of different methods to aggregate predictions p
b
from augmented views. All
results are top-1 accuracy reported with CLIP (ViT-B/16) on ImageNet with 64 augmented
views for each test example. Please refer to the text for details on the notation. . . . . . . . . 148
7.7 Row 1: Accuracy standard deviation of Rényi Entropy aggregated predictions over varying
α. Row 2-3: Effectiveness of Projection P
∗
(Eq. 7.3) in improving embedding distribution.
Results produced with CLIP (ViT-B/16) embeddings, demonstrated by the top-1 accuracy
improvement of kNN classifier with k = 5. . . . . . . . . . . . . . . . . . . . . . . . . . . 148
7.8 Inference time comparison of TPT and BaFTA. All results reported with ImageNet
examples, evaluated on NVIDIA A40 GPU. . . . . . . . . . . . . . . . . . . . . . . . . . . 152
xiii
List of Figures
1.1 Dissertation chapters overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.1 Examples of images manipulated by splicing (left) and copy-move (right) techniques,
SPAN predictions and Ground-truth Masks. . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Overview of the Spatial Pyramid Attention Network. There are three blocks in SPAN
framework: feature extraction (blue), pyramid spatial attention propagation (green) and
decision module (orange). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3 Self-Attention Block: We perform self-attention mechanism over each target pixel Xi,j and
its local neighborhood Pi,j . With learnable projections Mq
, Mk
and Mv
, we prepare Keys
and Values from Pi,j and Query from Xi,j for the attention mechanism. . . . . . . . . . . . 19
2.4 Local Attention Pyramid Propagation: Going through self-attention blocks with proper
dilation distances (e.g. dilation distance of 1 in the purple blocks and 3 in the orange
blocks), information from each pixel location is propagated in a pyramid structure (e.g.
each pixel on level-2 encodes information from 9 pixels on level-1 and 81 pixels on level-0). 20
2.5 Comparison of SPAN prediction results on Coverage, CASIA, Columbia NIST16 datasets,
with ManTra-Net predictions. From top to bottom: Manipulated Image, Ground-truth
mask, ManTra-Net prediction, SPAN prediction and fine-tuned SPAN prediction. There is
no fine-tuning result on Columbia dataset as there is no training split in Columbia. . . . . . . 29
2.6 Qualitative Results of SPAN predictions on Coverage and NIST16 datasets, with
comparison to RGB-N predictions. From left to right: Manipulated Image, Ground-truth
mask, RGB-N prediction, SPAN (pre-training) prediction and SPAN(fine-tuned) prediction. . 30
2.7 Qualitative Results on NIST16[121]. Top to bottom: Manipulated Image, Ground-Truth
Mask, SPAN(fine-tuned) Prediction Mask. Red box contains failure samples. . . . . . . . . 30
2.8 Qualitative Results on CASIAv1[34]. Top to bottom: Manipulated Image, Ground-Truth
Mask, SPAN(fine-tuned) Prediction Mask. Red box contains failure samples. . . . . . . . . 31
2.9 Qualitative Results on Columbia[119]. Top to bottom: Manipulated Image, Ground-Truth
Mask, SPAN Prediction Mask. Red box contains failure samples. . . . . . . . . . . . . . . . 31
xiv
2.10 Qualitative Results on Coverage[170]. Top to bottom: Manipulated Image, Ground-Truth
Mask, SPAN(fine-tuned) Prediction Mask. Red box contains failure samples. . . . . . . . . 31
2.11 The characteristic difference between Columbia, Coverage and CASIA. . . . . . . . . . . . 32
2.12 Demonstration of NIST16[121] information overlapping between its training and testing
splits. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.1 Illustration of an image set with a limited amount of labeled images among a large number
of unlabeled images. Unlike unsupervised learning methods that only exploit the structure
from unlabeled data, and supervised learning methods that only look at the limited amount
of labeled data, semi-supervised learning utilizes information from both labeled and
unlabeled data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2 An overview of the proposed SimPLE algorithm. SimPLE optimizes the classification
network with three training objectives: 1) supervised loss LX for augmented labeled
data; 2) unsupervised loss LU that aligns the strongly augmented unlabeled data with
pseudo labels generated from weakly augmented data; 3) Pair Loss LP that minimizes the
statistical distance between predictions of strongly augmented data, based on the similarity
and confidence of their pseudo labels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3 Pair Loss Overview. Given a pseudo label ql (red) which is a probability vector representing
the guessed class distribution, if the highest entry in ql surpasses the confidence threshold
τc, ql will become an “anchor”. Then, for any pseudo label and image tuple qr (light blue)
and vr (dark blue), if the overlapping proportion (i.e. similarity) between ql and qr is greater
than the confidence threshold τs, this tuple (qr, vr) will contribute toward the Pair Loss by
pushing model’s prediction of a strongly augmented version of vr to the “anchor” ql (green
arrow). During this process, if either threshold can not be satisfied, ql
, qr, vr will be rejected. 45
3.4 Ratio of pairs pass both confidence and similarity thresholds. The green line is SimPLE and
the grey line is SimPLE without Pair Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.5 Ratio of high confidence prediction. The green line is SimPLE and the grey line is SimPLE
without Pair Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.1 Examples from ImageNet-C dataset [61]. ImageNet-C contains 75 corrupted copies of the
original ImageNet-1K test set, with 5 severity levels and 15 different corruption types. This
dataset is used to simulate the distribution shift between training domain (original) and test
domains (corrupted). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.2 Method overview for Test-Time Adaptation with MixNorm layers. Before the inference, we
replace all the Batch-Norm layers in the pre-trained model with the new proposed MixNorm
layers. The MixNorm layers are initialized by the training set (source) statistics (means
and variances) and the corresponding weights and biases from the pre-trained Batch-Norm
layers. During inference, each input is paired with another augmented view for calculation
in the MixNorm layers and the final prediction is made only on the original input. . . . . . . 63
xv
4.3 CLIP performs classification on target classes by comparing visual embeddings with the
text embeddings generated from class names. . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.4 Illustration of the MixNorm Layer. Inside the MixNorm layer we estimate the normalization
statistics by looking at both global statistics from historical data and local statistics from
augmented views. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.5 Comparison of MixNorm, with and without learnable affine weights (Algorithm 2), and
MixNormBN (Algorithm 3) methods to the baseline method TENT in two different test
time settings i) test samples come only from a single type of corruption; ii) test samples
come from mixed types of corruptions. For the Single Type experiments, we report average
error rates over all 15 corruption types at severity 5. Numerical results are reported in the
Appendix A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.6 Comparison of MixNorm and MixNormBN methods to the baseline method TENT, with
a different backbone architecture (WideResNet40). Numerical results are reported in the
Appendix A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.1 The difference between zero-shot annotation object detection (ZAD), zero-shot object
detection (ZSD), and open-vocabulary detection (OVD). Our ZAD can have novel instances
in training images and novel category names, but no additional annotations are allowed. . . . 81
5.2 Overview of EZAD. The key contributions of EZAD are we bridge the domain gap and
create better distillation regions (CLIP proposals). . . . . . . . . . . . . . . . . . . . . . . . 84
5.3 The pipeline of CLIP Proposals generation. The CLIP Proposals and their feature are
generated before the detector training. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.4 The training and inference pipeline of EZAD. The figure presents the model structures in
the RoI head of the two-stage detector. We use the per CLIP Proposal distillation weight
and semantic-based regressor to further improve the model performance. . . . . . . . . . . . 85
5.5 The visualization result on the COCO. The first row presents the results of the model trained
with the adapted CLIP features from CLIP Proposals. The second row presents the results
of the model trained with raw CLIP features from RPN proposals. . . . . . . . . . . . . . . 99
5.6 Visualization of using CLIP Proposals or RPN proposals as distillation regions in COCO
setting. The blue boxes and green boxes represent the GT bboxes of the novel and base
categories. The red boxes represent the CLIP proposals or the RPN proposals with the
highest IoU with the novel GT bboxes. The visualization shows the CLIP proposals can
cover more novel objects even though the box may not accurate. . . . . . . . . . . . . . . . 103
5.7 The tSNE embeddings of the COCO GT instance feature from the unadapted CLIP and
adapted CLIP. The GT features from the adapted CLIP form more dense clusters, indicating
that the features become more discriminating and the CLIP is adapted into the detection
dataset domain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
xvi
6.1 the t-SNE plot of visual and text embeddings from CLIP on CIFAR10 [86] test set. It is
clear to see the misalignment in the vision-language space: the text embedding of a class
name is adjacent to ones of other classes, but distant from image embeddings in the same
class. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.2 Demonstration of the design of Learnable Prompt. t
∗
represents a learnable token
embedding that is inserted at the beginning of the sequence of inputs to the transformerbased text encoder. “BOS” and “EOS” stands for “beginning of sentence” and “end of
sentence” and they serve as the special tokens for the text encoder to identify the beginning
and end of the input sentence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
6.3 Demonstration on Feature Redundancy Removal (left) and Label Propagation (right). Left:
P0 shows the original distribution of visual and text embeddings of CLIP, where text
embeddings are close to each other distant from visual embeddings; P1 = UU ⊤ removes
the class agnostic information from visual embeddings, and has pulled closer visual and text
embeddings. P2 = U
′U
′⊤ separates the text embeddings away by removing the redundant
information from them. Similarity values demonstrated in this example is calculated
based on average statistics from CIFAR10 test set; Right: the Label Propagation process
generates pseudo labels for unlabeled training images by propagating label information
from labeled text embeddings (categories names) to unlabeled visual embeddings (training
images) through nearest-neighbor connections. . . . . . . . . . . . . . . . . . . . . . . . . 113
6.4 Flow Chart of ReCLIP-V and ReCLIP-T. Orange symbols describe the loss and gradients
path of ReCLIP-V, and blue symbols describe the loss and gradients path of ReCLIP-T.
Black symbols describe the common steps that both ReCLIP-V and ReCLIP-T have. . . . . 115
6.5 Flow Chart of Pseudo Labels Sharing. The cross-modality self-training algorithm merges
the pseudo labels from ReCLIP-T and ReCLIP-V at the end of each epoch and updates the
encoders only on high-confidence pseudo labels agreed by both. . . . . . . . . . . . . . . . 116
7.1 Overview of the Backpropagation-Free Test-Time Adaptation algorithm BaFTA. Instead
of prompt-tuning, we employ online clustering to directly estimate class embeddings in a
projection space that aligns visual and text embeddings. The class centroids are initialized
with text embeddings of class names, and updated incrementally with online test examples
assigned to the class. For each test example, we generate two sets of predictions. The
first set measures cosine similarity between visual embeddings of augmented views and
class name text embeddings. The second set measures cosine similarity between visual
embeddings and online-clustering centroids. Predictions are aggregated with reliability
estimated by Rényi Entropy for final results. . . . . . . . . . . . . . . . . . . . . . . . . . 136
7.2 α-accuracy curves on 15 datasets, with α ∈ [0.1, 0.99]. In order to fit all curves into one plot
with unified value range, all curves are normalized by subtracting the maximum accuracy
within the curve. The bold red curve represents the averaged accuracy over 15 datasets,
achieves its maximum value at α = 0.5 and α = 0.6. This plot indicates that prediction
aggregation accuracy is not highly sensitive to the choice of α, with most curves exhibiting
less than a 0.3% change in accuracy across the α range [0.1, 0.99] . . . . . . . . . . . . . . 150
7.3 tSNE plots of original and projected visual embeddings from evaluation datasets. . . . . . . 151
xvii
Abstract
In recent years, the field of computer vision and machine learning has witnessed a paradigm shift, characterized by a dramatic increase in the scale of model parameters and training data sizes. This evolution has
led to significant enhancements in model accuracy and robustness, transcending the traditional, task-specific
expert models. The field has now pivoted towards universal, large-scale pre-trained visual representations,
which enable impressive zero-shot and few-shot solutions for a wide array of downstream tasks.
Despite these advancements, the application of pre-trained models to specific downstream tasks, each
with its unique conditions and domain-specific challenges, often exposes inherent limitations. This dissertation aims to tackle these challenges. The research journey comprises a spectrum of approaches from fully
supervised to source-free and test-time adaptation, with diverse applications such as image classification,
object detection, and forensic detection. This dissertation introduces novel architectures such as SPAN,
which has pioneered the utilization of the self-attention mechanism in the field of computer vision, as well
as innovative adaptation algorithms like ReCLIP and BaFTA, which enhance zero-shot classification performance with unsupervised vision-text alignment. This dissertation marks a transition from classic visual
representations, like those pretrained with ImageNet, to cutting-edge vision-language models like CLIP, and
has overcome some of the most pressing challenges in the field.
The works of this dissertation play an important role in bridging the gap between generic visual representations and the specific, nuanced requirements of various real-world tasks. By doing so, it establishes new
xviii
benchmarks in optimizing the performance of machine learning models in practical applications, reinforcing
the role of advanced computational techniques in solving complex real-world problems.
xix
Chapter 1
Introduction
The field of computer vision and machine learning has experienced a transformative shift in the past decade.
From ImageNet [32] to CLIP [131], from SimCLR [78] to DINOv2 [123], the increase in parameter numbers
and training set sizes has brought tremendous improvements in model precision and robustness. Instead of
having task-specific, data-heavy expert models, the field has pivoted towards employing universal visual
representations, trained on extensive, diverse datasets, which have enabled few-shot and zero-shot solutions
on a myriad of downstream tasks.
However, applying these pre-trained models to downstream tasks, each with its unique conditions and
domain-specific challenges, has proven to be a challenging problem. There is a critical need to bridge
the gap between these generic representations and the specific requirements of individual tasks to optimize
real-world performance.
This dissertation aims to tackle these challenges. It is composed of works range from fully-supervised
to test-time adaptation, covering diverse applications from image classification to forensic detection. These
works mark the transition from traditional visual representations, such as those pretrained with ImageNet,
to more advanced vision-language models like CLIP. This dissertation introduces innovative methodologies
and algorithms designed to tackle some of the most pressing challenges in the field.
1
Adaptation for
Vision Models
SPAN
(ECCV 2020)
Fully-Supervised
Adaptation
Classification →
Forensic Detection
SimPLE
(CVPR 2021)
Semi-Supervised
Adaptation
Image
Classification
MixNorm
(arXiv 2021)
Test-Time
Adaptation
Image
Classification
Adaptation for
Vision-Language
Models
EZAD
(WACV 2024)
Fully-Supervised
Adaptation
Classification →
Zero-Shot Detection
ReCLIP
(WACV 2024, Oral)
Source-Free
Adaptation
Zero-Shot
Classification
BaFTA
(ICML 24, in prepare)
BackProp-Free
Test-Time Adaptation
Zero-Shot
Classification
Figure 1.1: Dissertation chapters overview.
As illustrated in Figure 1.1, this dissertation is structured into two primary segments, differentiated by
the emergence of large-scale foundational vision-language models like CLIP, which have enabled openvocabulary, zero-shot solutions on numerous downstream tasks.
1.1 Pre-CLIP: Adaptation for Vision Models.
In the first part, which comprises Chapters 2, 3, and 4, three techniques will be discussed. These techniques
aims at adapting pre-trained computer visual models under varying supervision conditions: cross-task fullysupervised adaptation, semi-supervised adaptation, and test-time adaptation.
1.1.1 SPAN: Spatial Pyramid Attention Network for Image Manipulation Localization
The journey of this dissertation begins with a fully-supervised, cross-task adaptation method called SPAN. In
Chapter 2, we introduce this novel architecture, SPAN (Spatial Pyramid Attention Network), which bridges
2
pre-trained image manipulation classification models to forensic detection tasks through fully-supervised
adaptation.
The goal of Image Forensic Detection, or Image Manipulation Localization, is to localize tampered
pixels within manipulated images. ManTraNet [176] which created robust representations for this task
through extensive 385-way forensic type classification training, was the benchmark method for this task.
ManTraNet bridges the gap between classification-trained representations and the forensic detection task
with a LSTM-based detection head, which integrates signals from varying resolution levels. However, our
analysis suggests that ManTraNet’s design insufficiently harnesses the spatial correlations among different
image segments, thus limiting its detection performance.
To address this, we introduce the SPAN architecture, which aims to model inter-patch relationships
across multiple scales through a pyramid structure of local self-attention blocks. SPAN also includes a novel
position projection technique, enabling precise spatial encoding for image patches. With extensive traing
over a broad synthetic dataset, SPAN effectively bridges the classification-trained representation towards the
forensic detection tasks, and is also adaptable to specific downstream datasets with further fine-tuning.
With extensive experiments over standard image forensic benchmarks, SPAN has demonstrated better
performance compared to ManTraNet and other state-of-the-art methods at the time, both quantitatively and
qualitatively. It’s worth highlighting that SPAN is one of the pioneering works to employ the self-attention
mechanism within the realm of computer vision, and is still considered an important benchmark in the field
of image forensic detection.
1.1.2 SimPLE: Similar Pseudo Label Exploitation for Semi-Supervised Classification
Given its need to adapt across different tasks, training a model like SPAN, which shifts from classification
to detection, requires a lot of labeled data. This can be a significant constraint in real-world situations where
3
labeled data is scarce. This challenge motivated me to explore methods that make the most use of limited
labeled data.
In Chapter 3 of this dissertation, we present SimPLE (Similar Pseudo Label Exploitation), a novel semisupervised learning algorithm that can also be applied as a semi-supervised adaptation method to enhance
pre-trained models on target datasets with limited labeled examples.
In many classification problems, there are large-scale datasets but only a small portion of them are labeled. Semi-supervised training aims to improve accuracy not just by using the labeled data, but also by
utilizing the larger pool of unlabeled data. Previous research such as MixMatch [13] and FixMatch [145]
has shown progress by using consistency constraints between different augmentations of labeled and unlabeled data. Building upon this, SimPLE introduces a unique approach focusing on the connections among
similar, high-confidence unlabeled data. The innovative Pair Loss minimizes the differences between highconfidence pseudo labels that meet a certain level of similarity.
By combining Pair Loss with methods from the MixMatch family [13, 12, 145], SimPLE has outperformed earlier algorithms in semi-supervised adaptation evaluation, where models start with ImageNet
or DomainNet-Real pre-trained weights. It has also achieved state-of-the-art performance in classic semisupervised learning and few-shot learning benchmarks like CIFAR10, CIFAR100, SVNH, and Mini-ImageNet.
1.1.3 MixNorm: Test-Time Adaptation through Online Normalization Estimation
In Chapter 4, we progress to the more practical and challenging task of test-time adaptation. Among different
domain adaptation tasks, test-time adaptation is one of the most challenging and piratical settings. With only
access to a model that is pre-trained offline, the goal of test-time adaptation is to rapidly adapt the model to
the unlabeled test samples while making predictions at inference time.
4
Previous methods have focused on adapting the Batch Normalization Layers in deep learning models [99, 149, 162], significantly enhancing model performance in the face of test-time domain shifts. However, these methods generally assume that test examples are available in large batches and under similar
distribution shifts, which limits their practicality in real-world settings where samples may come from diverse distributions and in varying batch sizes.
To overcome these limitations, we propose replacing Batch-Norm Layers with our novel MixNorm operation. MixNorm adaptively predicts normalization statistics from individual inputs. It not only considers
global statistics derived from all historical samples but also local statistics computed from each incoming test
sample and its spatial augmentations. This innovative approach demonstrates notable improvements over
state-of-the-art methods like TENT ([162]). This is particularly evident in our two new evaluation protocols,
which assess test-time adaptation methods across various batch sizes and against a mix of test samples from
different domains. Beyond test-time adaptation, the MixNorm layer also shows enhanced performance in
source-free domain adaptation and zero-shot classification.
1.2 Post-CLIP: Adaptation for Vision-Language Models.
The emergence of CLIP has revolutionized domain adaptation. Instead of adapting models towards closedset categories seen during pre-training or requiring labeled examples for novel categories, the visual-language
representation of CLIP has enabled open-vocabulary zero-shot solutions over various downstream tasks. In
the second segment (Chapters 5, 6, and 7) of this dissertation, I explore techniques that further improve the
vision-language models for downstream tasks under different settings: classification to detection, sourcefree unsupervised adaptation, and test-time adaptation.
5
1.2.1 EZAD: Efficient Feature Distillation for Zero-Shot Annotation Object Detection
The second segment of this dissertation starts with EZAD (Efficient Zero-Shot Annotation Detection) in
Chapter 5, an innovative approach that extends the capabilities of the classification-based model CLIP to
object detection tasks.
Object detection, a cornerstone of computer vision, typically restricts detectors to pre-trained categories,
limiting their ability to recognize new, unseen categories. Zero-Shot Object Detection (ZSD) and OpenVocabulary Object Detection (OVD) have emerged to address this issue, yet they face limitations. ZSD
involves detecting new categories known at training but not present in the training data, while OVD assumes
that novel category names are unavailable during training. It is common for object detection training samples to contain unannotated background objects, initially considered irrelevant during the annotation stage.
However, these objects may gain relevance in specific tasks as novel categories. Nonetheless, the limitations
of both ZSD and OVD prevent them from leveraging this valuable information within the existing data.
To bridge these gaps, we introduce Zero-shot Annotation Object Detection (ZAD). ZAD is a new zeroshot object detection setting that allows for the presence of unannotated novel categories in training data,
with the flexibility to add more categories at inference time. ZAD alleviates the challenge of scarce labeled
object detection examples by allowing algorithms to leverage partially labeled object detection examples.
To address the core challenges of ZAD, we propose the novel algorithm EZAD (Efficient Feature Distillation for ZAD). Building upon the prior state-of-the-art method ViLD [51], EZAD learns to map image
regions to a vision-language feature space through distillation, uses text embeddings of novel category
names, and makes detection predictions by matching them with high cosine similarity scores to image features of novel objects. In addition, EZAD recognizes the domain gap between CLIP’s training data and
detection datasets and fine-tunes the LayerNorm layers in CLIP with base category instances. The predistillation adaptation significantly improves classification accuracy for both base and novel categories. To
leverage unannotated background objects from training images, EZAD selects anchor boxes with potential
6
background objects indicated by high CLIP confidence and uses them as candidate proposals for knowledge
distillation from CLIP. A semantic-based regressor is also included, which further enhances the accuracy of
box regression.
EZAD demonstrates superior performance in practical scenarios. When only provided with the names of
novel categories, EZAD surpasses ViLD by 4% in novel category detection on the COCO dataset and shows
better performance on both base and novel categories on the LVIS dataset, all with a much shorter training
time. This success highlights the effectiveness of our adapted features and CLIP Proposals in improving
distillation quality and training efficiency, validating EZAD as a robust solution in the realm of zero-shot
and open-vocabulary object detection.
1.2.2 ReCLIP: Refines Contrastive Language Image Pre-training with Source-Free Domain
Adaptation
In Chapter 6, we explore source-free adaptation for visual-language models, enhancing their classification
ability in the target domain with the novel algorithm ReCLIP (Refines Contrastive Language Image Pretraining with Source-Free Domain Adaptation). This approach does not require any labeled data.
Visual-language models like CLIP, despite their impressive zero-shot classification capabilities, face
significant challenges due to domain gaps in both visual and text inputs. These gaps limit the model’s
performance, especially in less common domains or fine-grained datasets. For instance, CLIP struggles
with visual embeddings in domains like PatchCamelyon and CLEVR, and its text embeddings often fail
to capture discriminative information in datasets such as RESISC45 and Birdsnap. Additionally, a notable
misalignment between visual and text embeddings further worsens the issue, as observed in several prior
studies [102, 151].
7
To address these challenges, we propose ReCLIP, a novel source-free domain adaptation method for
refining CLIP models. ReCLIP uniquely tackles the misalignment of visual and text embeddings by learning a projection subspace, eliminating redundant dimensions and class-agnostic information. It leverages
the neighboring relationship between aligned embeddings and uses label propagation to generate accurate
pseudo-labels in the target domain. ReCLIP performs cross-modality self-training with high-confidence
pseudo-labels, iteratively refining embedding spaces and label assignments through two parallel components for updating the text and visual encoders. This iterative process not only improves the quality of
embeddings but also enhances the assignment of pseudo-labels.
Through extensive experiments and ablation studies, ReCLIP is demonstrated to have consistent and
significant improvements over CLIP and other baseline methods. It notably improves the average accuracy
of CLIP from 69.83% to 74.94% across 22 datasets, showcasing the effectiveness of our approach in mitigating domain gaps and enhancing classification performance in the target domain without relying on any
labeled data.
1.2.3 BaFTA: Backprop-Free Test-Time Adaptation for Zero-shot Vision Language Models
Following the success of ReCLIP, I moved forward to the challenging and practical task of test-time adaptation. Different from the MixNorm method introduced in Chapter 4, Chapter 7 presents BaFTA (BackpropFree Test-Time Adaptation), which focuses on the test-time adaptation of vision-language models to enhance
their zero-shot classification ability at inference time.
Test-time prompt tuning (TPT) [113] is a pioneering work in performing test-time adaptation for visionlanguage models. TPT improves model accuracy by refining learnable text prompts during inference, optimizing an unsupervised objective using augmentations.
8
Unlike TPT, which relies on backpropagation training, BaFTA operates directly within CLIP’s visualtext embedding space using an innovative online clustering algorithm. This method addresses common testtime adaptation challenges, such as optimizing learning rates without validation data, essential for balancing
performance enhancement with model stability.
BaFTA’s effectiveness stems from its ability to refine class embeddings by leveraging neighboring information among test example visual embeddings. This approach is rooted in the observation that visual
embeddings in CLIP are inherently discriminative but are often constrained by imprecise text embeddings
in zero-shot performance. BaFTA employs two crucial designs to counter this: firstly, it projects the embedding space, following ReCLIP’s recommendation, to reduce the disparity between visual and text embeddings, thereby enhancing clustering outcomes. Secondly, it combines clustering-based predictions with
those derived from randomly augmented views of test examples. By using Rényi Entropy, BaFTA assesses
the reliability of these predictions, leading to an aggregated prediction that optimizes both accuracy and
robustness.
BaFTA’s efficacy is demonstrated through extensive experiments, where it significantly enhances the
zero-shot classification accuracy of pre-trained vision-language models during inference. This suggest
BaFTA as a practical and efficient approach for test-time enhancement, without the need for extensive training or external data.
9
Chapter 2
SPAN: Spatial pyramid attention network for image manipulation
localization
In this chapter*, we introduce a novel framework called the Spatial Pyramid Attention Network (SPAN) for
the detection and localization of various image manipulations.
Image Forensic Detection or Image Manipulation Localization is a task focused on predicting manipulated pixels within tampered images. Prior to the development of SPAN, the state-of-the-art method known
as ManTraNet [176] had generated generic and robust representations for this task through extensive training
involving a 385-way forensic type classification pretext task.
To connect the classification-trained representation with the forensic detection task, ManTraNet adopted
a LSTM [64]-based detection head, which merges signals from different resolution levels for adaptation.
However, we believe that ManTraNet overlooked crucial spatial information between different image patches,
thereby limiting its detection performance.
Therefore, we have proposed the SPAN architecture, which efficiently and effectively models the relationships between image patches at multiple scales by constructing a pyramid of local self-attention blocks.
Our design also includes a novel position projection to encode the spatial positions of the patches. SPAN is
*This chapter is published as an ECCV 2020 paper [69]: Hu, Xuefeng, Zhihan Zhang, Zhenye Jiang, Syomantak Chaudhuri,
Zhenheng Yang, and Ram Nevatia. "SPAN: Spatial pyramid attention network for image manipulation localization." In Computer
Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI 16, pp. 312-328.
Springer International Publishing, 2020.
10
initially trained on a generic synthetic dataset to bridge the gap between classification and forensic detection
tasks. However, it can also be further fine-tuned for specific downstream datasets.
It’s worth highlighting that our proposed method, SPAN, represents one of the pioneering works to
employ the self-attention mechanism within the realm of computer vision, and it is still considered as an
important benchmark in the field of image forensic detection.
11
2.1 Introduction
Fast development of image manipulation techniques is allowing users to create modifications and compositions of images that look “authentic" at relatively low cost. Typical manipulation methods include splicing,
copy-move, removal and various kinds of image enhancements to produce “fake" or “forged" images. We
aim to localize the manipulated regions of different tampering types in images.
Manipulated Image SPAN Prediction GT Mask Manipulated Image SPAN Prediction GT Mask
Figure 2.1: Examples of images manipulated by splicing (left) and copy-move (right) techniques, SPAN
predictions and Ground-truth Masks.
There has been work on both detection and localization of manipulations in recent years. Among the
methods that include localization, many deal with only one or a few types of manipulations such as splicing[27, 71, 82], copy-move[26, 133, 170, 173, 175], removal[205], and enhancement[9, 10]. Some recent
papers have proposed more general solutions that are not specific to manipulation types; these include RGBNoise (RGB-N) Net [200] and Manipulation Tracing Network (ManTra-Net) [176]. The two methods differ
in the granularity of their localization (bounding boxes in [200] vs masks in [176]). Our method is also
designed for pixel-level masks predictions of multiple manipulation types.
A key assumption underlying forged region localization is that the color, intensity or noise distributions
of manipulated regions are somehow different from those of the untampered ones. These distinctions are
typically not transparent to a human observer but the expectation is that they can be learned by a machine.
Ability to capture and model the relationship between tampered and authentic regions is crucial for this.
12
RGB-N [200] proposes a Faster R-CNN [135] based method to detect tampered region; hence, its predictions
are limited to the rectangular box. It also requires fine-tuning to adapt to a new dataset. ManTra-Net [176]
combines the task of image manipulation type classification and localization into a unified framework and
achieves comparable results to RGB-N but without fine-tuning on the target evaluation data. ManTra-Net
makes pixel-level predictions. However, its localization module only models the “vertical” relationship
of same points on different scales of the feature map but does not model the spatial relations between
image patches. We propose to model both “vertical” relationship using multi-scale propagation and spatial
relationship using a self-attention module in the Spatial Pyramid Attention Network (SPAN).
SPAN is composed of three blocks: a feature extractor, a spatial pyramid attention block and a decision
module. We adopt the pre-trained feature extractor provided by [176] as our feature extractor. To build the
relationship between different image patches, we construct a hierarchical structure of self-attention layers
with the features as input.
The hierarchical structure is designed to first calculate local self-attention blocks, and the local information is propagated through a pyramid hierarchy. This enables an efficient and multi-scale modeling of
neighbors. Inspired by [156], within each self-attention layer, we calculate a new representation for each
pixel conditioned on its relationship to its neighborhood pixels. To better capture the spatial information of
each neighbor in the self-attention block, we also introduce a positional projection which is more suitable
for our localization task, instead of the original positional embedding methods used for machine translation
task in [156]. The decision module, which is composed of a few 2D convolutional layers, is applied on top
of the output from pyramid spatial attention propagation module to predict the localization mask.
We conducted comprehensive experiments on five different popular benchmarks and compared with
previous works on pixel-level manipulation localization of different tampering types. Our proposed SPAN
model outperforms current state-of-the-art (SoTA) methods ManTra-Net [176] and RGB-N [200] with the
13
same setting on almost every benchmark (e.g., 11.21% improvement over ManTra-Net on Columbia, without
fine-tuning; 9.5% improvement over RGB-N on Coverage, with fine-tuning).
In summary we make three contributions in this work: (1) we designed a novel Spatial Pyramid Attention
Network architecture that efficiently and explicitly compares patches through the local self-attention block
on multiple scales; (2) we introduced positional projection in the self-attention mechanism, instead of classic
positional embedding used in text processing; (3) We show that the ability to compare and model relationship
between image patches at different scales results in higher accuracy compared to state-of-the-art methods.
2.2 Related Work
2.2.1 Manipulation Detection and Localization
Most of existing methods detect and localization a specific type of manipulation only, such as copy-move[26,
133, 170], removal[205], enhancement[9, 10], and splicing[27, 71, 82, 141]. Among these types, splice detection is the most studied topic. For this task, MFCN[141] has proposed a Multi-task Fully Convolutional
Network [105] with enhancement on edge detection, jointly trained towards both edge detection and tampered region detection. MAG[82] proposed Generative-Adversarial based pipeline to train its model with
augmented retouched samples, based on a new splicing dataset.
The types of manipulation, however, may not always be known in advance and a single image may contain multiple types of manipulations. Therefore, it is important to have systems that detect general types of
manipulations. Some recent work [176, 200, 7, 7] has addressed this problem, and shown success in building
models that have robustness to detection of multiple manipulation techniques. J-LSTM[6] is an LSTM based
patch comparison methods that finds tempered regions by detecting edges between the tempered patches and
authentic patches. H-LSTM[7] improved this method by introducing a separate encoder-decoder structure
to refine the predicted mask, and a Hilbert Curve route to process the image patches, to establish better
14
context information. Both methods operate on fixed size image patches, which might cause failure when the
tempered region size does not follow this assumption.
RGB-N[200] proposes a two-stream Faster R-CNN network[135], which combines the regular RGBbased Faster R-CNN with a parallel module that works on the noise information generated by the Steganalysis Rich Model (SRM) [42]. RGB-N is pre-trained on a synthetic tampering dataset generated base on
MS-COCO [104]. Due to the R-CNN architecture, RGB-N is limited to localizing to a rectangular box
whereas real objects are not necessarily rectangular. RGB-N also requires fine tuning on specific datasets to
achieve high accuracy.
ManTra-Net[176] has proposed a system that jointly learns the Manipulation-Tracing feature for both
image manipulation classification and forgery localization. ManTra-Net is composed of a VGG[144] based
feature extractor and an LSTM[64] based detection module. The feature extractor is trained towards 385
types of image manipulation, then it is trained together with the detection module on a synthetic dataset for
multi-type manipulation detection. While [200, 6, 7] require fine-tuning on different manipulation distributions to achieve state-of-the-art performance, ManTra-Net achieves comparable results without any training
on the domain datasets.
2.2.2 Attention Mechanism
Attention mechanism, first introduced by [4], has brought tremendous benefit to the many fields, such as
Machine Translation [4, 156], Image Captioning [183], Image Question and Answering [185] and Object
Detection [2, 65, 97]. Attention mechanism helps the neural network build input-aware connections to focus
more on meaningful entities, such as words or regions, by replacing the classic learnable fixed weights with
input dependent weights.
15
Self-Attention mechanism proposed in [156] calculates mutual attention between a group of input, has
succeed in capturing long term word relationships within the same sentence, and has been dominating Machine Translation in the recent two years. Image transformer [126] has made an attempt to bring the selfattention mechanism to image generation, by adopting an encoder-decoder structure based on pixel-wise
self-attention. The Non-local Neuron Network[167] uses self-attention mechanism to model non-local relationship between pixel features, and achieves an improvement in activity recognition, image classification
and object detection. However, the non-local blocks are of O(N4
) complexity in both times and space,
for N × N size input, which implies that it could only be applied when the spatial resolution is reduced.
For tasks addressed in [167], good results are obtained by applying the non-local layer to reduced resolution maps; however, for manipulation detection, it may be harmful to reduce spatial resolution as the small
differences between pixels that are important in tracing forgery regions, may be lost during the pooling
operations.
2.3 Method
We describe our proposed Spatial Pyramid Attention Network (SPAN) in this section. We first provide an
overview of the framework, then the details of each module and lastly how the whole framework can be
trained.
2.3.1 Overview
SPAN is composed of three blocks: feature extractor, pyramid spatial attention propagation and decision
module. We adopt the pre-trained feature extractor as proposed in [176]. The feature extraction network
is trained on synthesized data generated based on images sampled from the Dresden Image Database [49],
using classification supervision against 385-type image manipulation data. The feature extractor adopts the
16
Figure 2.2: Overview of the Spatial Pyramid Attention Network. There are three blocks in SPAN framework:
feature extraction (blue), pyramid spatial attention propagation (green) and decision module (orange).
Wider & Deeper VGG Network [144, 188] as the backbone architecture, along with Bayer [9] and SRM
[42, 201] Convolutional Layers to extract rich features from both visual artifacts and noise patterns.
On top of the embeddings extracted using the pre-trained feature extractor, we apply a convolution layer
to adapt the feature to feed into the spatial attention module. The pyramid spatial attention propagation
module is designed to establish the spatial relationship on multiple scales of the pixel representation. Five
layers of the local self-attention blocks are recursively applied to extract information from neighboring
pixels or patches. To preserve the details from multi-scale layers, we add the input back to the output after
each self-attention block, through the residual link which is proposed in [58].
To generate the final mask output, 2D convolutional blocks are applied. A Sigmoid activation is applied
to predict the soft tampering mask.
2.3.2 Local Self-Attention Block
Consider an image feature tensor X of size H ×W × D, where H and W are the spatial dimensions, and D
is the feature dimension. For simplicity, we use Xi,j to represent pixel at i-th row and j-th column, where
each Xi,j ∈ R
D. We calculate the Local Self-Attention value, LSA(Xi,j |X, N, t), for each position Xi,j ,
17
by looking at it and its (2N +1)×(2N +1) neighborhood, dilated by distance t. For example, when N = 1,
the 3 × 3 neighborhood looks like:
Xi−t,j−t Xi−t,j Xi−t,j+t
Xi,j−t Xi,j Xi,j+t
Xi+t,j−t Xi+t,j Xi+t,j+t
For simplicity, we rename the (2N+1)2 neighbors linearly from top-left to bottom-right, as {Y1, Y2, ..., Y(2N+1)2 }.
Then, the LSA over Xi,j is calculated as
\label {selfattneq} LSA(X_{i,j}|X,N,t) = \frac {1}{C}\sum _{l=1}^{(2N+1)^2} \exp {\left (\frac {\langle M^kY_l, M^qX_{i,j} \rangle }{\sqrt {D}}\right )} M^vY_l (2.1)
where ⟨,⟩ represents the inner product, C =
P(2N+1)2
l=1 exp
⟨MkYl
,MqXi,j ⟩
√
D
is the soft-max normalization
factor, and Mk
, Mq
, Mv ∈ R
D×D are learnable weights.
Equation 2.1 can be rewritten as
LSA(X_{i,j}|X,N,t)^\top = SoftMax\left (\frac {(M^qX_{i,j})^\top M^kP_{i,j}}{\sqrt {D}}\right )(M^vP_{i,j})^\top (2.2)
where Pi,j = [Y1, ..., Y(2N+1)2 ] ∈ R
D×(2N+1)2
, and MkPi,j , MvPi,j , MqXi,j correspond to the terminology of Keys, Values and Query in attention mechanism literature [156, 167, 4].
Relationships between neighborhood {Y1, Y2, ..., Y(2N+1)2 } and pixel Xi,j can be explicitly modeled by
the quadratic form ⟨MkYl
, MqXi,j ⟩, and used to help build the feature for manipulation localization. Such
information cannot be easily extracted by any convolution network with similar number of parameters, as
convolution blocks only adds pixels together, rather than building their mutual relationship.
18
Positional Embedded
P ([H, W, 9D])
Input
X ([H, W, D])
H
W
D 9D
��,�
[9, D]
Key
K= ����,�
[9, D]
Query
Q= �%��,�
[1, D]
Value
V= ����,�
[9, D]
Alpha
� = ������� ��!
�
[9, 1]
Output
�′�,� = �'�
[1, D]
Output
X' ([H, W, D])
��,�
����#$,�#� ����#$,� ����#$,�*�
����,�#� ����,� ����,�*�
����*$,�#� ����*�,� ����*�,�*�
Figure 2.3: Self-Attention Block: We perform self-attention mechanism over each target pixel Xi,j and its
local neighborhood Pi,j . With learnable projections Mq
, Mk
and Mv
, we prepare Keys and Values from
Pi,j and Query from Xi,j for the attention mechanism.
2.3.3 Positional Projection
As shown in the Transformer work [156], positional embedding is essential in self-attention calculation,
to enable the model to tell the difference between inputs with different temporal order. In our case, it is
also important that the model know the relative spatial relationship between the neighbor pixel and query
pixel. In machine translation models [156], positional embeddings are learnable weights that directly add to
inputs at each input position. This is reasonable for language tasks, because the word embedding space is
typically trained to support vector addition and subtraction corresponding to linguistic meaning. However,
as our embedding space does not follow these requirements, we propose to use learnable matrix projections {Ml}1≤l≤(2N+1)2 to represent the (2N + 1)2 possible relative spatial relationships. With positional
projection, the LSA block now becomes:
\label {self-attn-final} LSA(X_{i,j}|X,N,t) = \frac {1}{C}\sum _{l=1}^{(2N+1)^2} \exp {\left (\frac {\langle M^kM_lY_l, M^qX_{i,j} \rangle }{\sqrt {D}}\right )} M^vY_l (2.3)
with soft-max normalization factor C =
P(2N+1)2
l=1 exp
⟨MkMlYl
,MqXi,j ⟩
√
D
.
19
<—————H————> <————W———>
1
0
<—————H————>
<————W———>
2
1
0
Figure 2.4: Local Attention Pyramid Propagation: Going through self-attention blocks with proper dilation
distances (e.g. dilation distance of 1 in the purple blocks and 3 in the orange blocks), information from each
pixel location is propagated in a pyramid structure (e.g. each pixel on level-2 encodes information from 9
pixels on level-1 and 81 pixels on level-0).
In Figure 2.3 we illustrate a local self-attention block when N = 1. Given an image tensor X, we first
prepare the neighborhood Pi,j with 9 positional projected neighbors for each pixel Xi,j . Note that for edge
and corner pixels, the neighborhood sizes are 6 and 4 respectively. The neighbors are then projected into
Keys, Values and Query through matrix projections Mk
, Mv
and Mq
. Finally, Keys, Values and Query are
assembled into output X′
i,j (Equation 2.3).
2.3.4 Pyramid Propagation
The local self-attention blocks compare pixels and their neighbors with limited distances N t. However,
images might have large tampered regions, which require comparison between pixels far away from each
other. Instead of simply enlarging N to have better pixel coverage, we iteratively apply our local selfattention blocks to propagate the information in a pyramid structure.
As shown in Figure 2.4, the purple and orange stacks represent local self-attention blocks, each with
N = 1, t = 1 and N = 1, t = 3. Through the self-attention block, each pixel on layer 1 represents 9 pixels
from layer 0; each pixel on layer 2 represents 9 pixels from layer 0, and therefore 81 pixels from layer 1.
With properly set dilation distances, pixels from the top of h-layer local self-attention structure can reach to
(2N + 1)2h pixels from the bottom layers.
20
There are two benefits of the pyramidal design: 1) We can use a small value for N, which makes
each self-attention block efficient to compute; 2) in upper layers, pixels not only represent themselves, but
encode information from their local regions. Therefore, by comparing those pixels, the self-attention blocks
compare information from two different patches.
Analysis of block size: As demonstrated in Section 2.3.4, a h-layer M × M self-attention block
structure with dilation distances {1, M, ..., Mh−1
)} helps the output pixels related to most pixels from the
input, where M = 2N + 1 is the neighborhood size of each self-attention block. Each pixel in the final
layer covers a Mh × Mh
region on the input features. Therefore, to cover a S × S size image, h =
O(logM S) layers would be needed. As S
2
self-attention calculation is needed at each layer, and each selfattention block is of O(M2
) complexity in both time and memory, the total complexity with respect to M is
O(S
2M2
logM S), which reaches to its minimum with M = 2N + 1 = 3, among possible integer values.
The 3 × 3 blocks not only offer the largest coverage under the same memory and time budget, but also
offer comparisons over more scales. In our SPAN implementation, we adopt a 5-layer, 3 × 3 self-attention
blocks structure, with dilation distances 1, 3, 9, 27, 81, as demonstrated in Figure 7.1. The proposed structure
compares and models the relationship between neighbor regions at five different scales, of 3, 9, 27, 81 and
243.
2.3.5 Framework Training
To train SPAN, we freeze the feature extractor and train the following parts in an end-to-end fashion, supervised by binary ground-truth mask with 1 labels tampered pixels and 0 labels authentic pixels. We adopt
Binary Cross-Entropy (BCE) as our loss function:
\label {loss} \displaystyle Loss(X^{pred}, B) = \frac {1}{HW}\sum _{i=1}^{H}\sum _{j=1}^{W} \left (-B_{i,j}\log X^{pred}_{i,j}-\left (1-B_{i,j}\right )\log \left (1-X^{pred}_{i,j}\right )\right )
21
where Xpred is the output predicted mask, and B is the binary ground-truth mask. Bi,j and X
pred
i,j represent
the corresponding pixel value at location i, j.
2.4 Experiments
In this section, we describe experiments on five different datasets to explore the effectiveness of SPAN
model. These datasets include (1) base training dataset as proposed in [176]; (2) NIST16 [121]; (3)
Columbia [119]; (4) Coverage [170]; (5) CASIA [34]. We follow the evaluation protocols as in [176, 200,
82]. We compare with other state-of-the-art (SoTA) methods under the setup with and without finetuning
for specific datasets. We compare only with methods that attempt to detect general manipulations and not
tuned to a specific manipulation type, such as splicing[141, 82].
2.4.1 Datasets
Following the practices in [176], we used the synthetic dataset proposed in [176] for training and validation.
Four other popular benchmarks for manipulation detection are used for fine-tuning and evaluation.
• Synthetic Dataset[176] is composed of four subsets: the removal and enhancements datasets from
[176], the splicing dataset from [174] and the copy-move dataset from [173]. All samples from the
synthetic dataset are of size 224 × 224.
• NIST16[121] contains 584 samples with binary ground-truth masks. Samples from NIST16 are manipulated by one of the types from Splicing, Copy-Move and Removal, and are post-processed to hide
visible traces. A 404:160 training-testing split is provided by [200].
• Columbia[119] is a Splicing based datset, containing 180 images, with provided edge masks. We
transform the edge masks into binary region masks, with positive tampered pixels and negative authentic pixels. All images in Columbia are used for testing only.
22
SynVal Columbia Coverage CASIA NIST16
F1 AUC AUC AUC AUC
ManTra-Net [176] 48.39 82.4 81.9 81.7 79.5
ManTra-Net (GitHub) - 77.95 76.87 79.96 78.05
SPAN 81.45 93.61 92.22 79.72 83.95
Table 2.1: Pixel-level localization performance under the metrics of AUC and F1 comparison on validation
set (SynVal) of the pre-training data and four different benchmarks. We could not find the model that
ManTra-Net reported in the paper. Hence, we also report the performance of ManTra-Net’s default GitHub
model where we share the same feature extractor.
• Coverage[170] is a Copy-Move based dataset, containing only 100 samples, with provided binary
ground-truth masks, with post-processing to remove the visible traces of manipulation. A 75:25
training-testing split is provided by [200].
• CASIA[34] is composed of CASIAv1 and CASIAv2 splits. CASIAv2 contains 5123 images, CASIAv1
contains 921 images; both with provided binary ground-truth masks. Samples from both subsets are
manipulated by either Splicing or Copy-Move operations. Image enhancement techniques including
filtering and blurring are applied to the samples for post-processing. According to [200], CASIAv2 is
the train split, and CASIAv1 is the test split.
2.4.2 Implementation Details
As discussed in Section 2.3.4 and demonstrated in Figure 7.1, we adopt the 5-layer 3×3 self-attention block
structure, with dilation distance 1, 3, 9, 27, 81. We use the residual link to preserve information from each
level, and use the new proposed positional projection to capture the spatial relationships between neighbors.
We report the performance using this setting, unless otherwise specified; ablation on different model choices
is given in Section 2.4.3.
To train SPAN, we set batch size to 4 with 1000 batches per epoch. The batches are uniformly sampled
from the synthetic dataset described above. The models are optimized by the Adam optimizer [81] with
initial learning 10−4 without decay. The validation loss, precision, recall and F1 are evaluated at each
23
epoch. Learning rate is halved if the validation loss fails to decrease per 10 epochs, until it reaches 10−7
.
The training is stopped early if the validation loss fails to decrease for 30 epochs. The best model is picked
according to the validation loss.
Following the setting from [200, 6, 173], we also fine-tune our model on the training splits from the four
standard benchmarks for some of our experiments, after it is trained on the synthetic data. For NIST16 and
Coverage, we use the exact same split provided by RGB-N [200]; for CASIA, we use CASIAv2 as training
and CASIAv1 as testing, as stated in RGB-N [200].
Stopping criteria for fine-tuning are set as following: For NIST16 and Converage, we use cross validation instead of using a fixed validation set, due to the small training split sizes. For CASIA, we randomly
sampled 300 images from its training split to construct a validation split. The model with the best validation
AUC scores are picked and used for evaluation on test set.
Note that the pre-trained feature extractor achieves its peak performance on original size images, as
described in [176]. However, a fixed input size is required for the detection module, for fine-tuning and
generalization. Therefore, during fine-tuning and inference stage, we extract the features from original sized
image first, and then resize the feature tensor into fixed resolution of 224x224, for optimal performance.
Total run time for a 256x384 image is around 0.11 second.
2.4.3 Evaluation and Comparison
We evaluate and compare SPAN with current SoTA methods [176, 200, 6, 7, 84, 111, 40] on the four
benchmarks under different setups: (1) pre-training only; (2) pre-training + fine-tuning. We also explored the
effectiveness of each proposed module by conducting ablation studies on the four standard benchmarks. We
use pixel-level Area Under the Receiver Operating Characteristic Curve (AUC) and F1 score for comparison
against the state-of-the-art general type forgery localization methods, according to [200] and [176].
24
Pre-training only. We compare the performances of SPAN and ManTra-Net [176] under the same
setup: both models are trained on synthetic dataset as in [176] and evaluated on validation split of the
synthetic dataset (SynVal) and four different datasets. To fairly compare with [176], all images in these
four datasets are used as test data. As presented in Table 2.1, SPAN outperforms ManTra-Net by a large
margin on SynVal. With the same feature extraction backbone, our SPAN is more effective of modeling
the spatial relationship among different patches and thus generates better localization accuracy. Comparing
with ManTra-Net [176] on other four datasets, SPAN also shows better performance under the metric of
AUC, demonstrating that our proposed model has good capability of generalizing to other datasets without
extra adaptation.
SPAN achieves performance gains of over 10% in AUC compared to ManTra-Net on Columbia ([119])
and Coverage [170]. The performance boost on CASIA [34] is not as significant. There are two possible
reasons: 1) Tampered regions in Columbia and Coverage have more varied scales than those in CASIA
dataset and our SPAN also benefits from our more effective multi-scale modeling ; 2) CASIA samples
have lower average resolution (256 × 384) compare to the other datasets (NIST: 2448 × 3264; Columbia:
666 × 1002; Coverage: 209 × 586). As we show in Table 2.5, the performance gap between SPAN and
ManTra-Net decreases when test images are resized to lower resolution.
Pre-training + fine-tuning. We now compare SPAN with other SoTA methods [6, 7, 200] under the finetuning setup. We also report scores of traditional unsupervised signal analysis models (ELA[84], NOI1[111]
and CFA1[40]) here as they are also evaluated over the testing splits rather than over the whole dataset. For
fair comparison, we follow the same practices as in [200]: (1)directly evaluate on Columbia dataset; (2)
finetune our model on Coverage and CASIA training split and evaluate on test split; (3) finetune the model
on training split of NIST16 provided by [200] and evaluate on test split. The results as shown in Table
2.2 demonstrate that SPAN without fine-tuning already outperformed RGB-N and other methods by a large
25
Supervision Columbia Coverage CASIA NIST16*
AUC F1 AUC F1 AUC F1 AUC F1
ELA[84] unsupervised 58.1 47.0 58.3 22.2 61.3 21.4 42.9 23.6
NOI1[111] unsupervised 54.6 57.4 58.7 26.9 61.2 26.3 48.7 28.5
CFA1[40] unsupervised 72.0 46.7 48.5 19.0 52.2 20.7 50.1 17.4
J-LSTM[6] fine-tuned - - 61.4 - - - 76.4 -
H-LSTM[7] fine-tuned - - 71.2 - - - 79.4 -
RGB-N[200] fine-tuned 85.8 69.7 81.7 43.7 79.5 40.8 93.7* 72.2*
SPAN pre-training 93.6 81.5 91.2 53.5 81.4 33.6 83.6 29.0
SPAN fine-tuned - - 93.7 55.8 83.8 38.2 96.1* 58.2*
Table 2.2: Pixel-level localization performance under the metrics of AUC and F1 comparison on validation
set of the pre-training data and four different benchmarks. For NIST16, Coverage and CASIA, all models
are fine-tuned on the corresponding training splits unless specifically stated. *We found there is an overlap
of images between training and test data of NIST16.
margin on Columbia and Coverage, further proving that our our spatial attention module has the strong
capability of generalization. With fine-tuning, the performances on all four datasets further improve.
J-LSTM and H-LSTM also make predictions by comparing image patches. Two possible reasons our
method achieves better performance are: 1) J-LSTM and H-LSTM look at patches at a single scale only,
which might limit them from detecting tempered regions that are very large or very small; 2) J-LSTM treats
patches independently and adopts an LSTM structure to process patches linearly, which might lose track of
the spatial and context information of those patches; H-LSTM attempts to alleviate this problem by taking
in patches along a specifically designed Hilbert-Curve route, but it could still be hard for a linear LSTM to
explicitly model spatial information. SPAN considers patches with its context and neighbor pixels, and has
the ability to model spatial relationship though the positional projection.
There is not a large performance gain on CASIA dataset compared to RGB-N [200], which is pre-trained
on object detection dataset (MS-COCO dataset [104]). One observation on CASIA dataset is that compared
to Columbia and Coverage, where there is a combination of both tampered objects and random tampered
region, most tampering on CASIA images occur on objects. We put NIST16 numbers in the last column as
26
Splicing Removal Copy-Move
AUC F1 AUC F1 AUC F1
ManTra-Net [176] (GitHub) 85.89 38.56 65.52 14.86 79.84 15.03
SPAN (pre-training) 90.27 42.66 77.15 15.73 82.82 13.81
SPAN (fine-tuned) 99.15 82.94 90.95 49.95 90.94 40.54
Table 2.3: Pixel-level AUC and F1 comparison on multiple manipulation types evaluated NIST16[121]
dataset.
F1 SynVal Columbia Coverage CASIA NIST16
SPAN (Res) 79.36 68.76 44.53 31.56 27.92
SPAN (Res+PE) 79.72 68.38 49.13 26.01 28.05
SPAN (LSTM+PP) 78.99 80.56 49.09 29.78 27.70
SPAN (Res+PP) 81.76 81.45 53.47 33.63 28.99
Table 2.4: Comparison of SPAN variants evaluated with F1 metric on SynVal, Columbia, Coverage, CASIA
and NIST16
we observed there are visually almost identical images in both training and testing splits follow the same
protocol in [200].
Manipulation type analysis. In Table 2.3 we present SPAN’s performance on different manipulation
types when evaluated on NIST16 [121] dataset. For comparison, we also generate the per-class results by
directly evaluating the model provided in ManTra-Net GitHub †
. Without fine-tuning on NIST16 (comparing
the first tow rows), our SPAN model performs consistently better than ManTra-Net on all three manipulation
types, demonstrating that our proposed spatial attention model is effective agnostic to tampering types.
SPAN results can be further improved with adaptation onto this specific dataset.
Ablation studies. We explored how much each proposed component contributes to the final performance. We explore: (1) how to combine the outputs from different layers of multi-scale attention module
(convolution LSTM (LSTM) [182] and Residual link (Res)); (2) how to model self-attention (position projection (PP) and position embedding (PE)). Besides the variants in the two modules, the number of selfattention hierarchy is set to 5 and N = 1. The comparison of different variants is presented in Table 2.4.
Comparing the first two rows, Residual link performs better than LSTM on fusing the multi-scale attention
†
https://github.com/ISICV/ManTraNet
27
NIST16 Columbia
Pixel AUC SPAN ManTra-Net SPAN Mantra-Net
No manipulation 83.95 78.05 93.6 77.95
Resize (0.78x) 83.24 77.43 89.99 71.66
Resize (0.25x) 80.32 75.52 69.08 68.64
No manipulation 83.95 78.05 93.6 77.95
GaussianBlur (kernal size=3) 83.1 77.46 78.97 67.72
GaussianBlur (kernal size=15) 79.15 74.55 67.7 62.88
No manipulation 83.95 78.05 93.6 77.95
GaussianNoise (sigma=3) 75.17 67.41 75.11 68.22
GaussianNoise (sigma=15) 67.28 58.55 65.8 54.97
No manipulation 83.95 78.05 93.6 77.95
JPEGCompress (quality=100) 83.59 77.91 93.32 75
JPEGCompress (quality=50) 80.68 74.38 74.62 59.37
Table 2.5: Robustness Analysis of SPAN over NIST16 and Columbia Dataset
feature. One possible explanation is Residual link is easier to optimize compared to the more complex
LSTM; with a strong feature encoded from our attention module, SPAN with Residual link converges better.
We also compare the performances of SPAN using two types of position modeling: position embedding
as in [156] and our proposed position projection. We compare three variants Res+PP, Res+PE and Res
only in Table. 2.4. Under the setup of SPAN model, Res and Res+PE perform similarly across different
datasets. Replacing PE with PP as the positional modeling achieves significant performance improvement
on all datasets.
Robustness We conducted experiments to explore the robustness of SPAN to various manipulation types
in Table 2.5. To produce modified samples, we apply standard OpenCV built-in functions AREAResize,
GaussianBlur, GaussianNoise, JPEGCompress on NIST16 and Columbia. SPAN demonstrates more
robust performance to compression but is more sensitive to resizing.
Qualitative Results. Figure 2.5 shows some SPAN and MantraNet [176] results. The examples show
that SPAN produces better predictions compared to Man-Tra Net in three commonly observed cases: 1)
interior of tempered regions (green circles); 2) the correct manipulated object (blue circles) and 3) noisy
false negative predictions (orange circles).
28
Coverage CASIA NIST16 Columbia
Figure 2.5: Comparison of SPAN prediction results on Coverage, CASIA, Columbia NIST16 datasets,
with ManTra-Net predictions. From top to bottom: Manipulated Image, Ground-truth mask, ManTra-Net
prediction, SPAN prediction and fine-tuned SPAN prediction. There is no fine-tuning result on Columbia
dataset as there is no training split in Columbia.
In Figure 2.6, we present the qualitative results with comparison with RGB-N[200]. As mentioned
above, RGB-N (which is built off Faster R-CNN [135]) generates region-level prediction thus only predicts
rectangle mask, restricting it from generating more accurate shape of tampered region, as shown in the
second row. On the other hand, there are also many test samples where the tampered region is naturally a
rectangle, like the one in the first row of Figure 2.6
We also present more visualization results of SPAN on four different datasets: NIST16 in Figure 2.7,
CASIA in Figure 2.8, Columbia in Figure 2.9 and Coverage in Figure 2.10. We sampled 5 examples (shown
in first 5 columns in each image, surrounded by the green box) and 2 additional failure cases (shown in last
two columns in each image, surrounded by the red box) from each dataset.
29
Figure 2.6: Qualitative Results of SPAN predictions on Coverage and NIST16 datasets, with comparison to
RGB-N predictions. From left to right: Manipulated Image, Ground-truth mask, RGB-N prediction, SPAN
(pre-training) prediction and SPAN(fine-tuned) prediction.
Figure 2.7: Qualitative Results on NIST16[121]. Top to bottom: Manipulated Image, Ground-Truth Mask,
SPAN(fine-tuned) Prediction Mask. Red box contains failure samples.
2.4.4 Additional Discussion on Dataset Characteristics
In Section 2.4.3 we mentioned that the characteristic difference between Columbia, Coverage and CASIA
could be a possible reason of the different scale of performance improvement the SPAN achieved over RGBN[200]. In Figure 2.11, three images from each of these three datasets are presented. As shown, in CASIA,
the manipulated objects tend to correspond to semantic objects compared to the other two.
In Section 2.4.3 we mentioned that there is an image overlap issue between training and test data on
NIST16[121], according to the training and testing split provided by RGB-N[200]. There are 584 images in
NIST16 dataset, with 292 pairs of unique image and compressed JPEG version. Following the same practice
30
ō
Figure 2.8: Qualitative Results on CASIAv1[34]. Top to bottom: Manipulated Image, Ground-Truth Mask,
SPAN(fine-tuned) Prediction Mask. Red box contains failure samples.
Figure 2.9: Qualitative Results on Columbia[119]. Top to bottom: Manipulated Image, Ground-Truth Mask,
SPAN Prediction Mask. Red box contains failure samples.
Figure 2.10: Qualitative Results on Coverage[170]. Top to bottom: Manipulated Image, Ground-Truth
Mask, SPAN(fine-tuned) Prediction Mask. Red box contains failure samples.
31
Columbia Coverage CASIA
Figure 2.11: The characteristic difference between Columbia, Coverage and CASIA.
NC_2016_7747.PNG
NC_2016_3906.PNG
NC_2016_7691.PNG
Images From
Training Split
NC_2016_8629.PNG
NC2016_2490.png NC2016_5169.png NC2016_6192.png
NC_2016_2304.PNG
NC_2016_3660.PNG
NC_2016_8686.PNG
NC_2016_6312.PNG NC_2016_2255.PNG
Images From
Testing Split
Ground
-True
Masks
Figure 2.12: Demonstration of NIST16[121] information overlapping between its training and testing splits.
32
in [200], the test split contains 160 images. 151 out of the 160 (88%) test images have the other visually
similar version in training split and they share the same ground truth mask. Some overlapping images are
shown in Figure 2.12. From top to bottom row, the image in training split, image in testing split and the
ground truth mask of these two images are presented.
2.5 Conclusion
We presented a Spatial Pyramid Attention Network (SPAN) that models the relationships between patches
on multiple scales through a pyramid structure of local self-attention blocks, to detect and localize multiple image manipulation types with or without fine tuning. SPAN outperforms the state-of-the-art models
[200, 176].The method is both accurate and robust in general type manipulation detection and localization, indicating that modeling patch relationship at different scales help capture the essential information in
manipulation localization. However, SPAN may be less effective with lower image resolution.
33
Chapter 3
SimPLE: Similar pseudo label exploitation for semi-supervised classification
Owing to its cross-task nature, training an adaptation model like SPAN for classification-to-detection adaptation demands a substantial amount of labeled data. However, in practical environments where labeled
data is often scarce, this presents a significant challenge. This limitation has inspired me to investigate
approaches that facilitate adaptation with limited labeled data.
In this chapter *, we introduce SimPLE, an innovative semi-supervised learning algorithm. SimPLE
adapts ImageNet pre-trained models for various classification benchmarks.
A typical scenario in classification tasks involves having a large dataset for training, but only a fraction
of it is labeled. The objective of semi-supervised training in this context is to enhance classification accuracy
by utilizing not just the labeled data, but also the vast amount of unlabeled data. Earlier studies [13, 12, 145]
have made significant trend by exploiting the consistency constraint between differently augmented labeled
and unlabeled data. Building on this, our proposed SimPLE algorithm introduces a novel unsupervised
objective that concentrates on the under explored relationships among similar high-confidence unlabeled
data. Our newly proposed Pair Loss aims to minimize the statistical distance between high confidence
pseudo labels that exceed a specific similarity threshold.
*This chapter is published as a CVPR 2021 paper [70]: Zijian Hu, Zhengyu Yang, Xuefeng Hu, and Ram Nevatia, "Simple:
Similar pseudo label exploitation for semi-supervised classification," in the Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pp. 15099-15108, 2021
34
By integrating Pair Loss with techniques from the MixMatch family [13, 12, 145], SimPLE has not
only achieved substantial improvements over previous algorithms in the semi-supervised adaptation domain—where models are initialized with weights pre-trained on ImageNet [87] or DomainNet-Real [128]—but
also demonstrated state-of-the-art performance in classic semi-supervised learning benchmarks like CIFAR10, CIFAR100, SVNH, and Mini-ImageNet [161].
35
3.1 Introduction
Deep learning has recently achieved state-of-the-art performance on many computer vision tasks. One
major factor in the success of deep learning is the large labeled datasets. However, labeling large datasets is
very expensive and often not feasible, especially in domains that require expertise to provide labels. SemiSupervised Learning (SSL), on the other hand, can take advantage of partially labeled data, which is much
more readily available, as shown in figure 3.1.
A critical problem in semi-supervised learning is how to generalize the information learned from limited
label data to unlabeled data. Following the continuity assumption that close data have a higher probability
of sharing the same label [19], many approaches have been developed [150, 197, 36], including the recently
proposed Label Propagation [73].
Another critical problem in semi-supervised learning is how to directly learn from the large amount of
unlabeled data. Maintaining consistency between differently augmented unlabeled data has been recently
studied and proved to be an effective way to learn from unlabeled data in both self-supervised learning
[20, 54] and semi-supervised learning [90, 140, 13, 12, 145, 115, 152, 194, 181]. Other than consistency
regularization, a few other techniques have also been developed for the semi-supervised learning to leverage
the unlabeled data, such as entropy minimization [114, 94, 50] and generic regularization [63, 108, 193,
192, 160].
The recently proposed MixMatch[13] combined the above techniques and designed a unified loss function to let the model learn from differently augmented labeled and unlabeled data, together with the mix-up
[193] technique, which encourages convex behavior between samples to increase models’ generalization
ability. ReMixMatch [12] further improves the MixMatch by introducing the Distribution Alignment and
Augmentation Anchoring techniques, which allows the model to accommodate and leverage from the heavily augmented samples. FixMatch [145] simplifies its previous works by introducing a confidence threshold
36
Labeled
Data
Unlabeled
Data
Figure 3.1: Illustration of an image set with a limited amount of labeled images among a large number
of unlabeled images. Unlike unsupervised learning methods that only exploit the structure from unlabeled
data, and supervised learning methods that only look at the limited amount of labeled data, semi-supervised
learning utilizes information from both labeled and unlabeled data.
into its unsupervised objective function and achieves state-of-the-art performance over the standard benchmarks.
However, while Label Propagation [73] mainly focuses on the relationship between labeled data to unlabeled data, and the MixMatch family [13, 12, 145] primarily focuses on the relationship between differently
augmented unlabeled samples, the relationship between different unlabeled samples is less studied.
In this paper, we propose to take advantage of the relationship between different unlabeled samples.
We introduce a novel Pair Loss, which minimizes the distance between similar unlabeled samples of high
confidence.
Combining the techniques developed by the MixMatch family [13, 12, 145], we propose the SimPLE
algorithm. As shown in figure 3.2, the SimPLE algorithm generates pseudo labels of unlabeled samples
by averaging and sharpening the predictions on multiple weakly augmented variations of the same sample.
Then, we use both the labels and pseudo labels to compute the supervised cross-entropy loss and unsupervised L2 distance loss. These two terms push the decision boundaries to go through low-density areas
37
Labeled Train Images
Unlabeled Train Images
Weak Augmentations
Strong Augmentations
Weak Augmentations
Predictions
Labels
Predictions
Labels
Predictions Predictions
Predictions Predictions
Classification
Network
Supervised
Loss
Label
Guessing
Pseudo
Labels
Pseudo
Labels
Unsupervised
Loss
Pair Loss
Fish Dog
??? ???
LX
LP
LU
Figure 3.2: An overview of the proposed SimPLE algorithm. SimPLE optimizes the classification network
with three training objectives: 1) supervised loss LX for augmented labeled data; 2) unsupervised loss LU
that aligns the strongly augmented unlabeled data with pseudo labels generated from weakly augmented
data; 3) Pair Loss LP that minimizes the statistical distance between predictions of strongly augmented
data, based on the similarity and confidence of their pseudo labels.
and encourage consistency among different variations of the same samples. Finally, with the newly proposed Pair Loss, we harness the relationships among the pseudo labels of different samples by encouraging
consistency among different unlabeled samples which share a great similarity.
Our contribution can be described in four folds:
• We propose a novel unsupervised loss term that leverages the information from high confidence similar unlabeled data pairs.
• Combining the techniques from MixMatch family [13, 12, 145] with the new Pair Loss, we developed
the novel SimPLE algorithm for semi-supervised learning.
• We performed extensive experiments on the standard benchmarks and demonstrated the effectiveness
of the proposed Pair Loss. SimPLE outperforms the state-of-the-art methods on CIFAR100 and MiniImageNet and on par with the state-of-the-art methods on CIFAR10, SVHN.
38
• We also evaluated our algorithm in a realistic setting where SSL methods are applied on pre-trained
models, where the new proposed SimPLE algorithm also outperforms the current state-of-the-art
methods.
3.2 Related Work
3.2.1 Consistency Regularization
Consistency regularization is widely used in the field of SSL. It refers to the idea that a model’s response to
an input should remain consistent, when perturbations are used on the input or the model. The idea is first
proposed in [90, 140]. In its simplest form, the regularization can be achieved via the loss term:
\| \mathrm {p}_\text {model}(y | A(x); \theta ) - \mathrm {p}_{\text {model}^\prime }(y | A(x); \theta ) \|_2^2 (3.1)
The stochastic transformation A(x) can be either domain-specific data augmentation [13, 90, 140, 12], drop
out [140], random max pooling[140], or adversarial transformation [115]. A further extension of this idea
is to “perturb” the model, pmodel′, instead of the input. The perturbation can be a time ensembling of model
at different time step [90, 152], or an adversarial perturbation on model’s parameter θ [194]. Also, many
works choose to minimize cross entropy instead of the L2 norm [115, 181, 12, 145].
Augmentation Anchoring: Augmentation Anchoring is first proposed by ReMixMatch [12] and further developed in FixMatch [145]. It is a form of consistency regularization that involves applying different
levels of perturbations to the input. A model’s response to a slightly perturbed input is regarded as the “anchor”, and we try to align model’s response to a severely perturbed input to the anchor. For example, we
can slightly perturb the input by applying an “easy” augmentation such as horizontal flipping and severely
perturb the input by applying a “hard” augmentation such as Gaussian blurring. As the model’s response to
39
a slightly perturbed input is less unstable, including Augmentation Anchoring increases the stability of the
regularization process.
3.2.2 Pseudo-labeling
Pseudo labels are artificial labels generated by the model itself and are used to further train the model.
Lee [94] picks the class with the highest predicted probability by the model as the pseudo label. However,
pseudo labels are only used during the fine-tuning phase of the model, which has been pre-trained. When
we minimize the entropy on pseudo labels, we encouraged decision boundaries among clusters of unlabeled
samples to be in the low-density region, which is requested by low-density separation assumption [19].
In this paper, for simplicity, we use a single lower case letter, p ∼ ∆N (the N-probability simplex), to
represent either a hard label (a one-hot vector) or a soft label (a vector of probabilities). A simple yet
powerful extension of pseudo-labeling is to filter pseudo labels based on a confidence threshold [41, 145].
We define the confidence of a pseudo label as the highest probability of it being any class (i.e., maxi(pi)).
For simplicity, from now on, we will use max(p) as a shorthand notation for the confidence of any label
p. With a predefined confidence threshold τc, we reject all pseudo labels whose confidence is below the
threshold (i.e., max(p) < τc). The confidence threshold allows us to focus on labels with high confidence
(low entropy) that are away from the decision boundaries.
3.2.3 Label Propagation
Label propagation is a graph-based idea that tries to build a graph whose nodes are the labeled and unlabeled
samples, and edges are weighted by the similarity between those samples [19]. Although it is traditionally
considered as a transductive method [150, 197], recently, it has been used in an inductive setting as a way
to give pseudo labels. In [73], the authors measure the similarity between feature representations of labeled
and unlabeled samples embedded by a CNN. Then, each sample is connected with the K neighbors with
40
the highest similarity to construct the affinity graph. After pre-training the model in a supervised fashion,
they train the model and propagate the graph alternatively. The idea of using K nearest neighbors to build
a graph efficiently is proposed in [36], as most edges in the graph should have a weight close to 0. Our
similarity threshold, τc, takes a similar role.
Algorithm 1 SimPLE algorithm
1: Input: Batch of labeled examples and their one-hot labels X = ((xb, yb) ; b ∈ (1, . . . , B)), batch of unlabeled examples U = (ub; b ∈ (1, . . . , B)), sharpening temperature T, number of weak augmentations
K, number of strong augmentations Kstrong, confidence threshold τc, similarity threshold τs.
2: for b = 1 to B do
3: x˜b = Aweak (xb) \triangleright Apply weak data augmentation to xb
4: for k = 1 to K do
5: u˜b,k = Aweak (ub) \triangleright Apply k
th round of weak data augmentation to ub
6: end for
7: for k = 1 to Kstrong do
8: uˆb,k = Astrong (ub) \triangleright Apply k
th round of strong augmentation to ub
9: end for
10: q¯b =
1
K
PK
k=1 pmodel′ (˜y | u˜b,k; θ) \triangleright Compute average predictions across all weakly augmented ub
using EMA
11: qb = Sharpen(¯qb, T) \triangleright Apply temperature sharpening to the average prediction
12: end for
13: Xˆ = ((˜xb, yb) ; b ∈ (1, . . . , B)) \triangleright Weakly augmented labeled examples and their labels
14: Uˆ =
(ˆub,k, qb) ; b ∈ (1, . . . , B), k ∈
1, . . . , Kstrong \triangleright Strongly augmented unlabeled examples,
guessed labels
15: LX =
1
|X′
|
P
x,y∈Xˆ H (y, pmodel (˜y | x; θ)) \triangleright Compute supervised loss
16: LU =
1
L|Uˆ|
P
u,q∈Uˆ 1(max(q)>τc) ∥q − pmodel (˜y | u; θ)∥
2
2
\triangleright Compute thresholded unsupervised loss
17: LP = PairLoss
Uˆ, τc, τs
\triangleright Compute Pair Loss
18: return LX + λULU + λPLP \triangleright Compute loss L from Xˆ and Uˆ
3.3 Method
To take full advantage of the vast quantity of unlabeled samples in SSL problems, we propose the SimPLE
algorithm that focuses on the relationship between unlabeled samples. In the following section, we first
describe the semi-supervised image classification problem. Then, we develop the major components of our
methods and incorporate everything into our proposed SimPLE algorithm.
3.3.1 Problem Description
We define the semi-supervised image classification problem as following. In a L-class classification setting,
let X = ((xb, yb) ; b ∈ (1, . . . , B)) be a batch of labeled data, and U = (ub; b ∈ (1, . . . , B)) be a batch
of unlabeled data. Let pmodel (˜y | x; θ) denote the model’s predicted softmax class probability of input x
parameterized by weight θ.
3.3.2 Augmentation Strategy
Our algorithm uses Augmentation Anchoring [12, 145], in which pseudo labels come from weakly augmented samples act as “anchor”, and we align the strongly augmented samples to the “anchor”. Our weak
augmentation, follows that of MixMatch[13], ReMixMatch [12], and FixMatch [145], contains a random
cropping followed by a random horizontal flip. We use RandAugment [29] or a fixed augmentation strategy
that contains difficult transformations such as random affine and color jitter as strong augmentation. For
every batch, RandAugment randomly selects a fixed number of augmentations from a predefined pool; the
intensity of each transformation is determined by a magnitude parameter. In our experiments, we find that
method can adapt to high-intensity augmentation very quickly. Thus, we simply fix the magnitude to the
highest value possible.
3.3.3 Pseudo-labeling
Our pseudo labeling is based on the label guessing technique used in [13]. We first take the average of the
model’s predictions of several weakly augmented versions of the same unlabeled sample as its pseudo label.
As the prediction is averaged from K slight perturbations of the same input instead of K severe perturbation
42
[13] or a single perturbation [12, 145], the guessed pseudo label should be more stable. Then, we use the
sharpening operation defined in [13] to increase the temperature of the label’s distribution:
\operatorname {Sharpen}(p, T):= \frac {p^{\frac {1}{T}}}{\textbf {1}^\top p^{\frac {1}{T}}} (3.2)
As the peak of the pseudo label’s distribution is “sharpened”, the network will push this sample further away
from the decision boundary. Additionally, following the practice of MixMatch [13], we use the exponential
moving average of the model at each time step to guess the labels.
3.3.4 Loss
Our loss consists of three terms: the supervised loss LX , the unsupervised loss LU, and the Pair Loss LP.
\mathcal {L} &= \mathcal {L_X} + \lambda _{\mathcal {U}} \mathcal {L_U} + \lambda _{\mathcal {P}} \mathcal {L_P} \\ \mathcal {L_X} &= \frac {1}{\left | \mathcal {X}' \right |} \sum _{x,y \in \hat {\mathcal {X}}} \crossEntropy {y}{\modelPred {x}} \\ \mathcal {L_U} &= \frac { \sum _{u,q \in \hat {\mathcal {U}}} \indicatorFunc {\max \left (q\right )}{\tau _c} \norm {q - \modelPred {u}}[2][2] }{L \left | \hat {\mathcal {U}} \right |}
(3.5)
LX calculates the cross-entropy of weakly augmented labeled samples; LU represents the L2 distance between strongly augmented samples and their pseudo labels, filtered by confidence threshold. Notice that
LU only enforces consistency among different perturbations of the same samples but not consistency among
different samples.
43
3.3.4.1 Pair Loss
As we aim to exploit the relationship among unlabeled samples, we hereby introduce a novel loss term, Pair
Loss, that allows information to propagate implicitly between different unlabeled samples. In Pair Loss,
we use a high confidence pseudo label of an unlabeled point, p, as an “anchor.” All unlabeled samples
whose pseudo labels are similar enough to p need to align their predictions under severe perturbation to
the “anchor.” Figure 3.3 offers an overview of this selection process. During this process, the similarity
threshold “extended” our confidence threshold in an adaptive manner, as a sample whose pseudo label
confidence is below the threshold can still be selected by the loss and be pushed to a higher confidence level.
Formally, we defined the Pair Loss as following:
\label {eqn:pairLoss} \begin {aligned} \mathcal {L_P} &= \frac {1}{\binom {K'B}{2}} \sum _{ \substack { i,j \in \left [\left |\mathcal {U}'\right |\right ], i \ne j \\ \left (v_l, q_l\right ) = \mathcal {U}'_{i}\\ \left (v_r, q_r\right ) = \mathcal {U}'_{j} } } \thresholdFunc {\max \left (q_l\right )}{\tau _c} \\ &\cdot \thresholdFunc {\simFunc {q_l}{q_r}}{\tau _s} \\ &\cdot \distFunc {q_l}{\modelPred {v_r}} \end {aligned}
(3.6)
Here, τc and τs denote the confidence threshold and similarity threshold respectively. φt(x) = 1(x>t)x is a
hard threshold function controlled by threshold t. fsim (p, q) measures the similarity between two probability
vectors p, q by Bhattacharyya coefficient [14]. The coefficient is bounded between [0, 1], and represents the
size of the overlapping portion of the two discrete distributions:
\simFunc {p}{q} = \sqrt {p}^\top \sqrt {q} (3.7)
fdist (p, q) measures the distance between two probability vectors p, q. As fsim (p, q) ∈ [0, 1], we choose
the distance function to be fdist (p, q) = 1 − fsim (p, q).
44
Dog Cat Fish Frog
⌧c
Dog Cat Fish Frog
Dog Cat Fish Frog
⌧s
Dog Cat Fish Frog
ql qr
max (ql) < ⌧c fsim (ql,qr) < ⌧s
vr
min
✓ fdist (q, p✓ (y | vr))
Confidence Similarity
ql
p✓(y|vr)
Figure 3.3: Pair Loss Overview. Given a pseudo label ql (red) which is a probability vector representing the
guessed class distribution, if the highest entry in ql surpasses the confidence threshold τc, ql will become an
“anchor”. Then, for any pseudo label and image tuple qr (light blue) and vr (dark blue), if the overlapping
proportion (i.e. similarity) between ql and qr is greater than the confidence threshold τs, this tuple (qr, vr)
will contribute toward the Pair Loss by pushing model’s prediction of a strongly augmented version of vr to
the “anchor” ql (green arrow). During this process, if either threshold can not be satisfied, ql
, qr, vr will be
rejected.
Although based on analysis, we found that cos(cos−1
(
√
τc) + cos−1
(τs))2
is the infimal confidence a
label need to have for it to be selected by both thresholds, such low confidence label are rarely selected in
practice. Based on empirical evidence, we believe this is caused by the fact a label p that can pass through
the high confidence threshold typically has a near one-hot distribution. Thus, for another label q to fall in the
similarity threshold of q, it must also have relatively high confidence. Due to this property, the Pair Loss is
not very sensitive to the choices of hyperparameters τs, τc, which we will show empirically in section 3.5.2.
3.3.4.2 Motivation for Different Loss Formulations
We follow MixMatch [13] in choosing supervised loss LX and unsupervised loss LU terms. We use the
Bhattacharyya coefficient [14] in our Pair Loss because it measures the overlap between two distributions
and allows a more intuitive selection of the similarity threshold τs. Although we believe that the Bhattacharyya coefficient [14] is more suitable than L2 distance (or 2 − L2) to measure the similarity between
45
two distributions, we keep the L2 distance in unsupervised loss term to provide a better comparison with
MixMatch [13]. Moreover, as cross-entropy measures the entropy and is asymmetric, it is not a good distance measurement between distributions. In our experiments, we observe that SimPLE with L2 Pair Loss
has 0.53% lower test accuracy than the original.
3.3.5 SimPLE Algorithm
By putting together all the components introduced in this section, we now present the SimPLE algorithm.
During training, for a mini-batch of samples, SimPLE first augment labeled and unlabeled samples with both
weak and strong augmentations. The pseudo labels of the unlabeled samples are obtained by averaging and
then sharpening the models’ predictions on the weakly augmented unlabeled samples. Finally, we optimize
the loss terms based on augmented samples and pseudo labels. During testing, SimPLE uses the exponential
moving average of the weights of the model to make predictions, as the way done by MixMatch in [13].
Figure 3.2 gives an overview of SimPLE, and the complete algorithm is in algorithm 1.
Method 10000 labels Backbone
MixMatch∗ 64.01% WRN 28-2
MixMatch Enhanced 67.12% WRN 28-2
SimPLE 70.82% WRN 28-2
MixMatch†
[145] 71.69% WRN 28-8
ReMixMatch†
[145] 76.97% WRN 28-8
FixMatch [145] 77.40% WRN 28-8
SimPLE 78.11% WRN 28-8
Table 3.1: CIFAR-100 Top-1 Test Accuracy. ∗
: using our implementation. †
: reported in FixMatch [145].
3.4 Experiments
Unless specified otherwise, we use Wide ResNet 28-2 [187] as our backbone and AdamW [107] with weight
decay for optimization in all experiments. We also use the exponential moving average (EMA) of the
network parameter of every training step for evaluation and label guessing.
46
CIFAR-10 SVHN
Method 1000 labels 4000 labels 1000 labels 4000 labels
VAT†
[12] 81.36% 88.95% 94.02% 95.80%
MeanTeacher†
[12] 82.68% 89.64% 96.25% 96.61%
MixMatch [13] 92.25% 93.76% 96.73% 97.11%
ReMixMatch [12] 94.27% 94.86% 97.17% 97.58%
FixMatch [145] − 95.69% 97.64% −
SimPLE 94.84% 94.95% 97.54% 97.31%
Fully Supervised†‡ 95.75% 97.3%
Table 3.2: CIFAR-10 and SVHN Top-1 Test Accuracy. All experiments use WRN 28-2. †
: The accuracy is
reported in ReMixMatch [12] and using its own implementation. ‡
: Fully supervised baseline using all the
labels and simple augmentation (flip-and-crop).
Method 4000 labels Backbone K
MixMatch∗ 55.47% WRN 28-2 2
MixMatch Enhanced 60.50% WRN 28-2 7
SimPLE 66.55% WRN 28-2 7
MeanTeacher†
[73] 27.49% Resnet-18 –
Label Propagation [73] 29.71% Resnet-18 –
SimPLE 49.39% Resnet-18 7
Table 3.3: Mini-ImageNet Top-1 Test Accuracy. ∗
: using our implementation. †
: The score is reported in
[73] and using its own implementation.
To have a fair comparison with MixMatch, we implemented an enhanced version of MixMatch by combing it with Augmentation Anchoring [12]. To report test accuracy, we take the checkpoint with the highest
validation accuracy and report its test accuracy. By default, our experiments have fixed hyperparameters
τc = 0.95, τs = 0.9 and EMA decay to 0.999.
3.4.1 Datasets and Evaluations
• CIFAR-10: A dataset with 60K images of shape 32 × 32 evenly distributed across 10 classes. The
training set has 50K images, and the test set contains 10K images. Our validation set size is 5000 for
CIFAR-10. The results are available in table 3.2.
47
• SVHN: SVHN consists of 10 classes. Its training set has 73257 images, and the test set contains
26032 images. Each image in SVHN is 32 × 32. Our validation set size is 5000 for SVHN. The
results are available in table 3.2.
• CIFAR-100: Similar to CIFAR-10, CIFAR-100 also has 50K training images and 10K test images
but with 100 classes. The image size is 32 × 32, the same as CIFAR-10. Our validation set size is
5000 for CIFAR-100. The results are available in table 3.1.
• Mini-ImageNet: Mini-ImageNet is first introduced in [161] for few-shot learning. The dataset contains 100 classes where each class has 600 images of size 84 × 84. For SSL evaluation, our protocol
follows that of [73], in which 500 images are selected from each class to form the training set, and
the leftover 100 images are used for testing. Since [73] do not specify its validation set split, we use a
total of 7200 training images (72 per class) as validation set; this is of the same validation set size as
[161].
• DomainNet-Real [128]: DomainNet-Real has 345 categories with unbalanced numbers of images
per class following a long tail distribution. We use this dataset for transfer learning experiments in
section 3.5.1. For our evaluations, we resize the image to 84 × 84 and use 11-shot per class (a total of
3795) for the labeled training set.
3.4.2 Baseline Methods
We compare with the following baseline methods: FixMatch [145], MixMatch [13], ReMixMatch [12], VAT
[114], MeanTeacher [153], and Label Propagation [73].
48
Method 4000 labels Convergence step
Supervised w/ EMA§ 48.83% 4K
MixMatch∗
from scratch 50.31% 150K
MixMatch∗ 53.39% 69K
MixMatch Enhanced∗
from scratch 52.83% 734K
MixMatch Enhanced∗ 55.75% 7K
SimPLE from scratch 59.92% 338K
SimPLE 58.73% 53K
Table 3.4: DomainNet-Real pre-trained model transfer to Mini-ImageNet. All experiments use WRN 28-2.
The model is converged when its validation accuracy reaches 95% of its highest validation accuracy. §
:
using labeled training set only. ∗
: using our implementation.
3.5 Results
For all datasets, our labeled and unlabeled set split is done by randomly sample the same number of images
from all classes without replacement. In general, our hyperparameter choices follows that of MixMatch [13]
and FixMatch [145].
CIFAR-100: We set the loss weight to λU = 150, λP = 150. As shown in table 3.1, we find that SimPLE has significant improvement on CIFAR-100. For better comparison with [145], we include experiments
using the same optimizer (SGD), hyperparameters, and backbone network (WRN 28-8 with 23M parameters). With a larger backbone, our method still provides improvements over baseline methods. SimPLE is
better than FixMatch by 0.7% and takes only 4.7 hours of training for convergence, while FixMatch takes
about 8 hours to converge. We consider convergence is achieved when the validation accuracy reaches 95%
of its highest value.
CIFAR-10, SVHN: For CIFAR-10, we set λU = 75 and λP = 75; we set λU = λP = 250 for SVHN.
For both datasets, we use SGD with cosine learning rate decay [109] with decay rate set to 7π
16 following that
of FixMatch [145].
In table 3.2, we find that SimPLE is on par with ReMixMatch [12] and FixMatch [145]. ReMixMatch,
FixMatch, and SimPLE are very close to the fully supervised baseline with less than 1% difference in test
49
accuracy. SimPLE is less effective on these domains because the leftover samples are difficult ones whose
pseudo labels are not similar to any of the high confidence pseudo labels. In this case, no pseudo labels can
pass the two thresholds in Pair Loss and contribute to the loss. We observe that the percentage of pairs in a
batch that passes both thresholds stabilizes early in the training progress (the percentage is 12% for SVHN
and 10% for CIFAR-10). Thus, Pair Loss does not bring much performance gain as it does in the more
complicated datasets.
Mini-ImageNet: To examine the scalability of our method, we conduct experiments on Mini-ImageNet.
Mini-ImageNet is a more complex dataset because its categories and images are sampled directly from
ImageNet. Although the image size is scaled down to 84 × 84, it is still much more complicated than
CIFAR-10, CIFAR-100, and SVHN. Therefore, Mini-ImageNet is an excellent candidate to illustrate the
scalability of SimPLE.
In addition to WRN 28-2 experiments on Mini-ImageNet, we also apply the SimPLE algorithm on
ResNet-18 [57] for a fair comparison with prior works. The results are in table 3.3. In general, our method
outperforms all other methods by a large margin on Mini-ImageNet regardless of backbones. Our method
scales with the more challenging dataset.
Method 3795 labels Convergence step
Supervised w/ EMA§ 42.91% 4K
MixMatch∗ 35.34% 5K
MixMatch Enhanced∗ 35.16% 5K
SimPLE 50.90% 65K
Table 3.5: ImageNet-1K pre-trained model transfer to DomainNet-Real. All experiments use ResNet-50.
The model is converged when its validation accuracy reaches 95% of its highest validation accuracy. §
:
using labeled training set only. ∗
: using our implementation.
3.5.1 SSL for Transfer Learning Task
In real-world applications, a common scenario is where the target task is similar to existing datasets. Transfer
learning is helpful in this situation if the target domain has sufficient labeled data. However, this is not
50
Ablation Augmentation Type λP τc τs K 10000 labels
SimPLE RandAugment 150 0.95 0.9 2 70.82%
SimPLE RandAugment 150 0.95 0.9 7 73.04%
w/o Pair Loss RandAugment 0 0.95 0.9 2 69.07%
w/o Pair Loss RandAugment 0 0.95 0.9 7 69.94%
w/o RandAugment fixed 150 0.95 0.9 2 67.91%
w/o RandAugment, w/o Pair Loss fixed 0 0.95 0.9 2 67.41%
τc = 0.75 RandAugment 150 0.75 0.9 2 71.96%
τs = 0.7 RandAugment 150 0.95 0.7 2 70.85%
τc = 0.75, τs = 0.7 RandAugment 150 0.75 0.7 2 71.48%
λP = 50 RandAugment 50 0.95 0.9 2 71.34%
λP = 250 RandAugment 250 0.95 0.9 2 71.42%
Table 3.6: Ablation on CIFAR-100. All experiments use WRN 28-2
guaranteed. Therefore, SSL methods need to perform well when starting from a pre-trained model on a
different dataset. Another benefit of using a pre-trained model is having fast convergence, which is important
for time-sensitive applications.
Since prior SSL methods often neglect this scenario, in this section, we evaluate our algorithm, MixMatch [13] and supervised baseline in the transfer setting. The supervised baseline only uses labeled training
data and parameter EMA for evaluation. All transfer experiments use fixed augmentations.
Our first experiment is the adaptation from DomainNet-Real to Mini-ImageNet; the result is in table 3.4.
We observe that the pre-trained models are on par with training from scratch but converge 5 ∼ 100 times
faster. Under transfer setting, SimPLE is 7.57% better than MixMatch and 9.9% better than the supervised
baseline.
The experiment in table 3.5 is for transferring from ImageNet-1K [31] to DomainNet-Real. Since
ImageNet-1K pre-trained ResNet-50 [57] is readily available in many machine learning libraries (e.g., PyTorch), we evaluate the performance and the convergence speed using ImageNet-1K pre-trained ResNet-50
to mimic real-world applications.
51
On DomainNet-Real, MixMatch is about 7% lower than the supervised baseline, while SimPLE has 8%
higher accuracy than the baseline. MixMatch Enhanced, despite having Augmentation Anchoring, does not
outperform MixMatch.
It is clear that SimPLE perform well in pre-trained setting and surpasses MixMatch and supervised baselines by a large margin. This behavior is consistent across datasets and network architectures. MixMatch,
on the other hand, does not improve performance in the pre-trained setting.
Compared to training from scratch, the pre-trained models do not always provide performance improvements since the pre-trained models might have domain bias that is not easy to overcome. For example, in our
DomainNet-Real to Mini-ImageNet experiment, the pre-trained test accuracy is slightly lower than training
from scratch. However, the convergence speed is significantly faster (∼8 to 10 times) when starting from a
pre-trained model.
3.5.2 Ablation Study over CIFAR-100
In this section, we conducted ablation studies on CIFAR-100 with WRN 28-2 to evaluate the effectiveness
of different parts of our system. The results are available in table 3.6. We choose CIFAR-100 because it has
a reasonable number of classes (reasonably complicated) and a small image size (fast enough for training).
We observe that Pair Loss significantly improves the performance. With a more diverse augmentation
policy or increasing the number of augmentations, the advantage of the Pair Loss is enhanced. Also, SimPLE
is robust to threshold change. One possible explanation for the robustness is that since a pair must pass both
thresholds to contribute to the loss, changing one of them may not significantly affect the overall number of
pairs that pass both thresholds.
52
3.5.3 Analysis and Discussion
3.5.3.1 Hyperparameters
As mentioned in section 3.4, our hyperparameters are almost identical to that of MixMatch [13] and FixMatch [145]. We use the same network architecture and similar hyperparameters as FixMatch for CIFAR-10,
SVHN, and CIFAR-100 (WRN 28-8). We conducted ablation study on WRN 28-2 with hyperparameters
similar to that of MixMatch for simplicity. We also evaluated SimPLE on Mini-ImageNet with WRN 28-2
and ResNet 18. We use the same α (beta distribution parameter for mix-up [193]) and T (temperature for
sharpening) across all experiments. Notice that only MixMatch and MixMatch Enhanced use mix-up.
The full detail of our hyperparameters choices can be found in table 3.7. Our transfer experiment configurations are in table 3.8.
CIFAR-10 SVHN CIFAR-100 Mini-ImageNet
τc 0.95
τs 0.9
λU 75 250 150 150 300 300
λP 75 250 150 150 300 300
lr 0.03
K 7 7 4 2 7
T 0.5
α 0.75
weight decay 0.0005 0.0005 0.001 0.04 0.02 0.02
batch size 64 64 64 64 64 16
EMA decay 0.999
backbone WRN 28-2 WRN 28-2 WRN 28-8 WRN 28-2 WRN 28-2 ResNet-18
optimizer SGD SGD SGD AdamW AdamW AdamW
Nesterov True
momentum 0.9
lr scheduler cosine decay
lr decay rate 7π / 16
Table 3.7: Hyperparameters for CIFAR-10, SVHN, CIFAR-100 and MiniImageNet.
53
3.5.3.2 Optimization
For CIFAR-10, SVHN, and CIFAR-100 (WRN 28-8), we use SGD with Nesterov momentum set to 0.9.
We also use cosine learning rate decay [109] with a decay rate of 7π
16 following FixMatch. For CIFAR100 (WRN 28-2), Mini-ImageNet, and transfer experiments, we use AdamW [107] without learning rate
scheduling follows that of MixMatch. Details are available in table 3.7 and 3.8.
DN-R to M-IN IN-1K to DN-R
τc 0.95
τs 0.9
λU 300
λP 300
feature lr 0.0002 0.00002
classifier lr 0.002
K 2
T 0.5
α 0.75
weight decay 0.02
batch size 16
EMA decay 0.999
backbone WRN 28-2 ResNet 50
optimizer AdamW
Table 3.8: Hyperparameters for Transfer: DomainNet-Real to Mini-ImageNet (DN-R to M-IN) and Transfer: ImageNet-1K to DomainNet-Real (IN-1K to DN-R) experiments.
3.5.3.3 Augmentations
Our augmentations are implemented on GPU with Kornia [136]. In table 3.9, we list the transformations
used by the fixed augmentations of table 3.4 and 3.5. For RandAugment [29], we follows the exact same
settings as FixMatch [145]. Note that we only reported the changed augmentation parameters while the
omitted values are the same as the default parameters in Kornia [136].
54
Transformation Description Parameter
Random Horizontal Flip Horizontally flip an image randomly
with a given probability p
p = 0.5
Random Resized Crop Random crop on given size and resizing the cropped patch to another
scale = (0.8, 1), ratio = (1, 1)
Random 2D GaussianBlur Creates an Gaussian filter for image
blurring. The blurring is randomly
applied with probability p
p = 0.5, kernel size = (3, 3), sigma
= (1.5, 1.5)
Color Jitter Randomly change the brightness,
contrast, saturation, and hue of given
images
contrast = (0.75, 1.5)
Random Erasing Erases a randomly selected rectangle
for each image in the batch, putting
the value to zero
p = 0.1
Random Affine Random affine transformation of the
image keeping center invariant
degrees = (−25, 25), translate =
(0.2, 0.2), scale = (0.8, 1.2), shear
= (−8, 8)
Table 3.9: Augmentation details. Applied in order. Descriptions are from [136].
3.5.3.4 Analysis on Confidence Threshold
Theorem 1 ∀p, q ∈ ∆N , if φτc
(max (p)) · φτs
(fsim (p, q)) > 0, then max (q) > cos(cos−1
(
√
τc) +
cos−1
(τs))2
.
Since φτc
(max (p)) · φτs
(fsim (p, q)) > 0, we have:
\begin {cases} \max \left ( p \right ) > \tau _c \\ \simFunc {p}{q} > \tau _s \end {cases}
Denote j = arg maxi pi
, i.e., the confidence of p is attained at the j-th coordinate, pj = max(p).
Denote ej ∈ ∆n
as the elementary vector with the j-th element to be 1 and all other elements to be 0.
55
In the square root probability space, we have:
\begin {cases} \sqrt {e_j}^\top \sqrt {p} = \max \left ( \sqrt {p} \right ) &> \sqrt {\tau _c}\\ \sqrt {p}^\top \sqrt {q} > \tau _s \end {cases}
Notice, because ∥p∥1 = ∥q∥1 = ∥ej∥1 = 1, we have
√p
2
=
√q
2
=
√ej
2
= 1. Therefore, √p,
√q, and √ej are on the unit n-sphere Sn. Denote the geodesic distance between any two points x, y ∈ Sn
as dSn
(x, y) = cos−1
(
x⊤y
∥x∥2
·∥y∥2
) = cos−1
(x
⊤y) .
\begin {cases} \geodist {\sqrt {p}}{\sqrt {e_j}} > \cos ^{-1}(\sqrt {\tau _c})\\ \geodist {\sqrt {p}}{\sqrt {q}} > \cos ^{-1}(\tau _s) \end {cases}
As the geodesic distance preserves triangular inequality:
\geodist {\sqrt {q}}{\sqrt {e_j}} &\geq \geodist {\sqrt {q}}{\sqrt {p}} + \geodist {\sqrt {p}}{\sqrt {e_j}}\\ &> \cos ^{-1}(\sqrt {\tau _c}) + \cos ^{-1}(\tau _s) \\ \sqrt {q_j} = \sqrt {q}^\top \sqrt {e_j} &> \cos (\cos ^{-1}(\sqrt {\tau _c}) + \cos ^{-1}(\tau _s)) \\ \max (q) \geq q_j &> \cos (\cos ^{-1}(\sqrt {\tau _c}) + \cos ^{-1}(\tau _s))^2
3.5.3.5 Pair Loss
In this section, we provide additional information to two existing ablation studies in table 3.6 on CIFAR-100,
to demonstrate the effectiveness of Pair Loss in encouraging more unlabeled samples to have accurate and
high confidence predictions. Specifically, we compare the performance of the SimPLE algorithm with and
without the Pair Loss enabled in the following measurements: 1) the percentage of unlabeled samples with
high confidence pseudo labels; 2) the percentage of unlabeled sample pairs that pass both confidence and
56
similarity thresholds; 3) the percentage of false-positive unlabeled sample pairs that pass both confidence
and similarity thresholds but are in different categories.
Figure 3.4: Ratio of pairs pass both confidence and similarity thresholds. The green line is SimPLE and the
grey line is SimPLE without Pair Loss
From figure 3.4, the ratio of pairs that pass both the confidence threshold and similarity threshold is
increased by 16.67%, with a consistently nearly 0% false positive rate, which indicates that Pair Loss encourages the model to make more consistent and similar predictions for unlabeled samples from the same
class.
Figure 3.5: Ratio of high confidence prediction. The green line is SimPLE and the grey line is SimPLE
without Pair Loss
As shown in figure 3.5, with Pair Loss, the percentage of unlabeled sample with high confidence labels
is increased by 7.5%, and the prediction accuracy is increased by 2% as shown in table 3.6. These two
results indicate that Pair Loss encourages the model to make high confidence and accurate predictions on
57
more unlabeled samples, which follows our expectation that Pair Loss aligns samples with lower confidence
pseudo labels to their similar high confidence counterparts during the training and improves the prediction
accuracy.
3.6 Conclusion
We proposed SimPLE, a semi-supervised learning algorithm. SimPLE improves on previous works [13, 12,
145] by considering a novel unsupervised objective, Pair Loss, which minimizes the statistical distance between high confidence pseudo labels with similarity above a certain threshold. We have conducted extensive
experiments over the standard datasets and demonstrated the effectiveness of the SimPLE algorithm. Our
method shows significant performance gains over previous state-of-the-art algorithms on CIFAR-100 and
Mini-ImageNet [161], and is on par with the state-of-the-art methods on CIFAR-10 and SVHN. Furthermore, SimPLE also outperforms the state-of-the-art methods in the transfer learning setting, where models
are initialized by the weights pre-trained on ImageNet [87], or DomainNet-Real [128].
58
Chapter 4
MixNorm: Test-Time Adaptation through Online Normalization Estimation
In advancing to a more practical and challenging task, I has moved on with the task of test-time adaptation,
which intriguingly requires no labeled examples and performs adaptation during inference. In this chapter *,
we introduce MixNorm, a straightforward yet potent method for Test-Time domain adaptation. MixNorm
rapidly adapts a pre-trained source model to target domain samples by recalculating the batch-norm statistics.
Previous studies in this field typically adhere to two assumptions: test samples are presented in large
batches and originate from a singular test distribution. However, these assumptions often don’t hold in realworld scenarios. To address this, we propose two novel evaluation settings: one with variable batch sizes
and another considering multiple or mixed distributions. Distinct from earlier methods that depend on large,
single-distribution batches at test time to derive stable batch-norm statistics, MixNorm operates independently of such constraints and can accurately estimate batch-norm statistics even with a single sample. This
method not only markedly surpasses the state-of-the-art in our new test-time adaptation settings but also
shows significant improvement in diverse domains like Source-Free Unsupervised Domain Adaptation and
Zero-Shot Classification."
*This Chapter is uploaded as an arXiv preprint [66]: Hu, Xuefeng, Gokhan Uzunbas, Sirius Chen, Rui Wang, Ashish Shah,
Ram Nevatia, and Ser-Nam Lim. "Mixnorm: Test-time adaptation through online normalization estimation." arXiv preprint
arXiv:2110.11478 (2021).
59
4.1 Introduction
It is common to find that a well-trained deep learning model performs unexpectedly poorly in real-world
settings. Many factors play a role in this challenging problem; but one of the most probable causes stems
from a domain shift from the training data ( [130]). To study the domain shift in production environment
and to evaluate the generalization ability of deep learning models, many benchmarks have been proposed
([138, 129, 61]). Figure 4.1 demonstrates the samples from one of the benchmarks, ImageNet-C ([61]), in
which synthesized corruptions are added to ImageNet ([32]) test images in an attempt to simulate domain
shift.
To increase deep learning models’ robustness when there is domain shift, various approaches have
been developed in different settings, such as Domain Adaptation ([127]), Unsupervised Domain Adaptation([171]), Source-Free Domain Adaptation([101]) and Fully Test-Time Adaptation ([162]). Among all
those different settings, the task of Fully Test-Time Adaptation is one of the most challenging settings. With
only access to a model that is pre-trained offline, the goal of Test-Time Adaptation (TTA) is to rapidly adapt
the model to the test samples while making predictions. The task is essentially Domain Adaptation plus
three more challenging conditions:
• Online: The model gets updated during inference, and has access to each test sample only once.
• Source-Free: The model does not have access to training data after the offline training is done. It will
only have access to the pre-trained weights.
• Unsupervised: There is no label for the test data.
Several works have studied the Test-Time Adaptation problem in recent years ([99, 149, 162]). Existing
methods focus on adapting the Batch Normalization Layers ([72]) in the deep learning models, and have
significantly improved deep learning model’s performance against test-time domain shift.
60
Figure 4.1: Examples from ImageNet-C dataset [61]. ImageNet-C contains 75 corrupted copies of the
original ImageNet-1K test set, with 5 severity levels and 15 different corruption types. This dataset is used
to simulate the distribution shift between training domain (original) and test domains (corrupted).
Such Batch-Norm Layer based methods make two underlying assumptions to estimate stable normalization statistics during evaluation: 1) test data come together in large batches; 2) all the test samples are
from the same shifted distribution (single corruption types, single domain, etc). However, such assumptions
might not be practical in the real world as the samples might come from an arbitrary distribution such as:
varying sample sizes, different ways of collection, or different types of corruptions. Further, during inference, the AI system might not have the freedom to postpone the prediction in order to collect enough data
to apply test-time adaptation algorithm on the incoming test samples.
To eliminate the dependency on large batch size and sole distribution, we propose to replace the BatchNorm Layers with a novel MixNorm operation that performs adaptive prediction of the normalization statistics from a single input. The MixNorm Layer considers not only global statistics calculated from all historical samples, but also local statistics calculated from incoming test sample and its spatial augmentations.
Figure 4.2 illustrates our overall approach. The novel MixNorm method demonstrate significant improvement over the State-of-the-art methods TENT ([162]), especially in the two newly proposed evaluation
protocols, where the all Test-Time Adaptation methods are evaluated at various batch sizes, and against
mixture of test samples from different shifted domains. In addition to the Test-Time Adaptation task, the
61
proposed MixNorm Layer demonstrates improvement over two other tasks, source-free Domain Adaptation
and Zero-Shot Image Classification.
Our contributions are:
• We propose two new evaluation conditions for the task of Test-Time Adaptation, which avoid the
impractical assumptions made by previous protocols.
• We propose a novel and simple way to adaptively estimate the test-time normalization statistics from
single sample.
• We performed extensive experiments that demonstrate the effectiveness of the new MixNorm Layer
in various tasks, such as Test-Time Adaptation, Source-Free Unsupervised Domain Adaptation and
Zero-Shot Image Classification.
4.2 Related Work
Distribution shift between training domain and evaluation domain often causes a drop in the performance of
deep learning models. In recent years, many proposals have been made to increase deep learning model’s
robustness against distribution shifts, a field of research known as Domain Adaptation. Domain Adaptation
methods can be classified into different settings depending on factors like access to source (training) samples,
existence of new or missing categories, availability of supporting samples in evaluation domain, etc. In
this paper, we focus on settings where there is no access to the source (training) samples. In particular,
we compare our new method with existing methods on Test-Time Adaptation, Source-Free Unsupervised
Domain Adaptation and Zero-Shot Classification settings.
62
Convolution
Batch
-Norm
Activation
Convolution
Batch
-Norm
Activation
Convolution
Batch
-Norm
Activation
Classifier
Prediction
Convolution
Mix
-Norm
Activation
Convolution
Mix
-Norm
Activation
Convolution
Mix
-Norm
Activation
Classifier
Prediction
Training
Inputs
Test
Input
Augmented
View
COPY
COPY
COPY
COPY
COPY
COPY
COPY
INIT
Training
Phase
Test-Time
Adaptation
Phase
INIT
INIT
Figure 4.2: Method overview for Test-Time Adaptation with MixNorm layers. Before the inference, we
replace all the Batch-Norm layers in the pre-trained model with the new proposed MixNorm layers. The
MixNorm layers are initialized by the training set (source) statistics (means and variances) and the corresponding weights and biases from the pre-trained Batch-Norm layers. During inference, each input is paired
with another augmented view for calculation in the MixNorm layers and the final prediction is made only
on the original input.
4.2.1 Test-Time Adaptation
The task of Test-Time Adaptation (TTA) is one of the most challenging settings in the field of Domain
Adaptation, as it requires rapid adaptation while making predictions, without access to anything but the
pre-trained model and testing samples. BN ([99]) was one of the earliest work that explores the TTA task.
It focuses on adapting the Batch-Norm Layers in deep learning models. A traditional Batch-Norm module
collects running means and variances at training time from the training data, and uses the collected means
and variance at test time. BN argued that when there are distribution shifts between training and testing
data, it should not keep using the same training statistics. Instead, the model should collect statistic directly
from the online batch. BN demonstrates that this simple approach of using online statistics is effective in
handling the distribution shifts caused by corruptions. However, one clear drawback of the method is that
it requires relative large batches to obtain stable online statistics. Following the work of BN ([99]), TTT
([149]) proposed to update not only the batch-norm statistics, but also the affine weights and biases, which
are learnable weights in Batch-Norm Layers that further re-scale the normalized output. TTT performs a
63
single step gradient update over the batch-norm parameters after each prediction, with an unsupervised loss
defined by rotation prediction task. TENT ([162]) further proposed to use entropy loss instead of rotation
prediction loss and achieved state-of-the-art performance on CIFAR-10C, CIFAR-100C and ImageNet-C
datasets.
Most of the existing works in TTA follow the same evaluation protocol proposed by [99], where the
models are adapted using large batches of test samples coming from a single type of corruption. However,
such an assumption is not realistic in practice, because the test samples in the real world could come in any
batch size (in particular, real time high performance inference platforms often process one input at a time,
[1]), or from any shifted distribution.
4.2.2 Source-Free Unsupervised Domain Adaptation
Similar to the task of TTA, in Source-Free Unsupervised Domain Adaptation (UDA), the models only have
access to pre-trained models and test samples. Different than the TTA setting, Source-Free UDA models
have access to the evaluation samples as many times as needed during adaptation.
SHOT [101] is one of the first work to propose UDA without needing any source data. In SHOT,
the parameters are fine-tuned with a self-supervised loss that minimize the prediction entropy in the target
samples, while keeping the classification head intact. SHOT demonstrates that it is possible to get reasonable
performance on standard benchmarks such as Office ([138]), Office-Home ([159]) and VisDA ([129]).
Another method called TENT ([162]) also has shown potential for the Source-Free UDA task with
promising performance on digit datasets (e.g. adapting SVHN model to MNIST/MNIST-M/USPS). TENT
cannot reach SHOT’s performance when scaled to much larger scale datasets such as VisDA ([129]), but
TENT adapts to target domain inputs with much cheaper operations and sees each test sample only once
during testing.
64
Algorithm 2 MixNorm Algorithm
1: Input: Input Examples X1, ..., XT ; Visual Feature Encoder V ; Augmentation Function Aug that produces N-views of stochastic augmentations; Training Set Mean µ
0
and Variance σ
0
; Exponential Moving Speed τ = 0.001; Mixing Scale m = 0.05; Weights, Bias: α, β.
2: for t = 1 to t = T do
3: Ft ← V (Xt
) \triangleright Generate regular feature Ft ∈ RD×H×W
4: F
′
t ← V (Aug(Xt
)) \triangleright Generate augmented features F
′
t ∈ RN×D×H×W
5: µ
t ← (1 − τ )µ
t−1 + τFt
.mean(1, 2)
6: σ
t ← (1 − τ )σ
t−1 + τ (Ft − µ).var(1, 2)) \triangleright Update µ and σ with most recent sample
7: µglobal ← µ
t
, σglobal ← σ
t \triangleright Get global statistics
8: µlocal ← F
′
t
.mean(0, 2, 3)
9: σlocal ← (F
′
t − µlocal).var(0, 2, 3) \triangleright Get local statistics from Augmentations
10: µmixed ← (1 − m)µglobal + mµlocal
11: σmixed ← (1 − m)σglobal + mσlocal \triangleright Mix Global and Local
12: Ft ← α
Ft−µmixed
σmixed
+ β \triangleright Calculate shifted F
13: return Ft as prediction for Xt
14: end for
Algorithm 3 MixNormBN Algorithm
1: Input: Input Examples X1, ..., XN ; Batch Size B; Visual Feature Encoder V ; Augmentation Function
Aug that produces N-views of stochastic augmentations; Training Set Mean µ
0
and Variance σ
0
; Max
Moving Speed τmax = 0.9; Mixing Scale m = 0.05; Weights, Bias: α, β.
2: for t = 1 to t = ⌊T //B⌋ do
3: Ft ← V (XtB−t+1, ..., XtB) \triangleright Generate regular batch feature Ft ∈ RB×D×H×W
4: F
′
t ← V (Aug(XtB−t+1, ..., XtB)) \triangleright Generate augmented features F
′
t ∈ RB×N×D×H×W
5: τ ← τmax10− 3
B \triangleright Adjust τ based on Batch Size B
6: µ
t ← (1 − τ )µ
t−1 + τFt
.mean(0, 1, 2)
7: σ
t ← (1 − τ )σ
t−1 + τ (Ft − µ).var(0, 1, 2)) \triangleright Update µ and σ with most recent sample
8: µglobal ← µ
t
, σglobal ← σ
t \triangleright Get global statistics
9: µlocal ← F
′
t
.mean(0, 1, 3, 4)
10: σlocal ← (F
′
t − µlocal).var(0, 1, 3, 4) \triangleright Get local statistics from Augmentations
11: µmixed ← (1 − m)µglobal + mµlocal
12: σmixed ← (1 − m)σglobal + mσlocal \triangleright Mix Global and Local
13: Ft ← α
Ft−µmixed
σmixed
+ β \triangleright Calculate shifted F
14: return Ft as prediction for XtB−t+1, ..., XtB.
15: end for
65
4.2.3 Zero-Shot Classification
Unlike Test-Time Adaptation or Source-Free UDA, where methods are developed to close the domain gap
during evaluation stage, the goal of Zero-Shot Classification ([178]) is to increase the generalization ability
during the training stage so that it can be applied to any test domain without adaptation. Without access
to any support samples from the new categories, Zero-Shot Classification methods typically take advantage
of cues from syntax information such as label embedding ([43]) or attributes ([91]) to establish knowledge
about the new categories.
Along with the success in large-scale pre-training ([17]), self-supervised learning and contrastive learning ([20]), CLIP ([131]) has brought a huge improvement over almost all of the common benchmark datasets.
CLIP performs contrastive learning over 400 millions web-retrieved pairs of images and captions by pulling
the visual and text representation near if they are from the same pair and away if they are not. At inference stage, CLIP makes classification prediction by matching the visual embeddings of query images with
the text embeddings of categories names (wrapped in template text such as “a photo of {}”, or a list of
templates and uses the averaged embedding), and selects the category with the highest cosine similarity as
prediction, as shown in Figure 4.3. CLIP is capable of performing classification over novel tasks without
any training example, as long as the category names are provided. CLIP has demonstrated outstanding zeroshot classification accuracy, e.g. 76.3% top-1 accuracy on ImageNet without seeing any examples from the
dataset. [131].
Visual
Encoder
Text
Encoder
A photo of Bird
A photo of Cat
A photo of Deer
A photo of Dog
Visual
Embeddings
Cosine
Similarity
Predictions
Text
Embeddings
Figure 4.3: CLIP performs classification on target classes by comparing visual embeddings with the text
embeddings generated from class names.
66
4.2.4 Normalization Methods
Batch Normalization, is one of the most common method for normalization in neural networks along the side
of many other variants such as instance normalization, group normalization and more recently layernorm in
transformer architectures. Beyond question, Batch Normalization, has been shown to be a key component
of nearly all convolutional neural networks since its discovery; however, there are still unsolved issues
of Normalization methods such as the discrepancy between train time and test time; need of large batch
size to calculate the true statistic that approximate universal truth. There are few recent works that aim
to utilize normalization for small batch sizes which can also be utilized during inference time to close
the gap between train and test time. While these methods provide great guidance and evidence for which
training/testing settings the normalization methods may be effective; they are not specifically designed for
test-time adaptation, source free unsupervised domain adaptation and zero shot classification.
4.3 Test-Time Adaptation with Mix-Norm Layer
In this section, we define our new Mix-Norm layer and provide details on how it can be utilized for closing
domain gap during test time on Deep Neural Networks (DNN). Given a pre-trained DNN, we replace its
existing Batch-Norm layers ([72]). During test time, for arbitrary sized batches our new MixNorm layer
calculates empirical normalization statistics from two sources:
• Global Statistics: The Batch-Norm layers from the pre-trained model store the statistics from the
training distribution. Our MixNorm layer uses this training statistics as a global anchor point. The
global statistics, initialized by the training statistics, is updated by an exponential moving average of
the online test sample.
• Local Statistics: We create additional augmented views of the test sample to form a small augmentation batch. We calculate the empirical statistics from this augmentation batch to capture the local
67
distribution shift of the test sample. The augmentations are spatial, including random-resized-crop and
random-flipping. In practice, we only use one additional augmentation as it has the best performance.
In a typical deep learning model such as ResNet ([58]) or WideResNet ([188]), there will be multiple
batch-norm layers across the whole backbone network. Figure 4.4 below, illustrates our proposed operation
in only one such layer. As shown earlier in Figure 4.2, after the encoding layers and right before our
MixNorm layer, input images X, ∈ R
3×224×224 become feature tensor F ∈ R
D×H×W . For simplicity, we
denote the i-th row, j-th column of F to be Fi,j ∈ R
D.
As depicted in Figure 4.4, our global statistics µ
t
, σt
are initialized with train-time statistics, copied from
the batch-norm layers in the pre-trained network:
&\mu ^0 = \mu ^{training} & \sigma ^0 = \sigma ^{training},
and they will be slowly updated by the statistics on each new sample. At (t + 1)-th test example:
&\mu ^{t+1} = (1-\tau )\mu ^{t}+\tau \mu &\sigma ^{t+1} = (1-\tau )\sigma ^{t}+\tau \sigma ,
where τ is the exponential moving speed, and µ and σ are the means and variances from F:
& \mu = \frac {\sum _{i=0}^H\sum _{j=0}^W F_{ij}}{HW} &\sigma = \frac {\sum _{i=0}^H\sum _{j=0}^W (F_{ij}-\mu )^2}{HW}.
On the other hand, we calculate the local statistics µlocal, σlocal by examining F:
& \mu _{local} = \frac {\sum _{i=0}^H\sum _{j=0}^W F_{ij}}{HW} \\ & \sigma _{local}= \frac {\sum _{i=0}^H\sum _{j=0}^W (F_{ij}-\mu _{local})^2}{HW}.
68
Renaming µ
t+1, σt+1 to µglobal, σglobal, we obtain our final statistics:
& \mu _{mixed}= (1-m)\mu _{global} + m\mu _{local} \\ & \sigma _{mixed}= (1-m)\sigma _{global} + m\sigma _{local}.
Finally, with µmixed, σmixed, we can perform the normalization operation, and forward the resulting F to
next layers:
& F= \alpha \frac {F-\mu _{mixed}}{\epsilon +\sqrt {\sigma _{mixed}}}+\beta
where α, β, ϵ are the parameters copied from the pre-trained network batch-norm layers.
Mix-Norm Layer
Normalization
Sample Feature Tensor
F [1, H, W, D]
Augmented Sample
Feature Tensor
F’ [1, H, W, D]
�
�′
�
�′
Local Mean & Var
[1, D]
Sample Mean & Var
[1, D]
�!
�!
�!"#
�!"#
Exponential
Moving
Means & Vars
�"
�"
Training Set
Means & Vars
�
�
Mixing
Global
& Local
Normalized Sample Feature Tensor
F [1, H, W, D]
Normalized
Augmented Sample
Feature Tensor
F’ [1, H, W, D] Figure 4.4: Illustration of the MixNorm Layer. Inside the MixNorm layer we estimate the normalization
statistics by looking at both global statistics from historical data and local statistics from augmented views.
While the above process describes the procedure in Algorithm 2, we also present an variant of MixNorm
in Algorithm 3. Different from the vanilla MixNorm which only consider single sample, Algorithm 3
69
updates the global statistics with batch statistics, and have an adaptive moving speed τ which approximates
Batch-Norm behavior for large batch size, and MixNorm behavior for small batch size.
4.4 Experiments and Results
Datasets. For the Test-Time Adaptation task, we follow TENT ([162]) and evaluate our method over the
standard benchmarks CIFAR-10C and ImageNet-C ([61]). CIFAR-10C includes 75 subsets that are composed of 15 types of corruptions at 5 severity levels. It comprises the 10,000 images, each of size 32x32
pixels, from the 10 classes in the CIFAR10 test set, but with each modified by the corresponding corruption
type at certain severity level. Similarly, ImageNet-C is composed of a set of 75 common visual corruptions,
applied to the original ImageNet validation set, with 50,000 images of 224x224 pixels from 1000 classes.
For Source-Free Unsupervised Domain Adaptation task, we follow the experiment setup from [162],
and use SVHN ([118]) as the source domain and MNIST ([93])/MNIST-M ([44])/USPS ([118]) as the target
domains. SVHN, MNIST, MNIST-M and USPS are all 10-class digits datasets, with training/testing data
sizes of 73257/26032, 60000/10000, 60000/10000 and 7291/2007 respectively.
For Zero-Shot Image Classification, we follow the setup from CLIP ([131]) and evaluate our method on
CIFAR-10/100 ([85]), STL-10 ([24]), Stanford Cars ([83]) and Food101 ([16]). Details of these datasets are
provided in the Appendix C.
Implementation Details. For Test-Time Adaptation experiments, we follow the optimization setting
from TENT ([162]), which uses Adam optimizer on CIFAR-10C with learning rate 0.001 and SGD optimizer
on ImageNet-C with learning rate 0.00025, both optimized by the entropy minimization loss on output logits
as described in [162]. For CIFAR-10C we compare with TENT on both Wide-ResNet-28 and Wide-ResNet40 ([188]) and for ImageNet-C we compare with TENT on ResNet-50 ([58]) following the official public
model provided by the TENT repository†
.
†
https://github.com/DequanWang/tent
70
For Source-Free Unsupervised Domain Adaptation experiments, we replicated TENT ([162]) and BN
([99]) methods using the pre-trained SVNH model repository‡
since the original ResNet-26 model reported
in TENT ([162]) was not public. For hyper-parameter selection, on CIFAR-10C/ImageNet-C we use the
“Gaussian Noise at Severity 5" set to choose the hyper parameters, and then use the same hyper-parameters
during all experiments. For Source-Free UDA and Zero-Shot Classification experiments, as there is no
trivial way to split a validation set, we just report the optimal hyper-parameters for each dataset. The details
of the selected hyper-parameters will be provided in Appendix B.
4.4.1 Test-Time Adaptation
To understand the merit of our new method in real world scenarios, we introduce two new conditions during
evaluations. In TENT, error rates for each of the 15 corruption types at severity 5 are reported with a fixed
batch size of 200 on CIFAR-10C and 64 on ImageNet-C. In addition to this standard protocol, we compare
our method with TENT under two new conditions:
1) Different Batch Sizes. In additional to the fixed 200/64 batch size in the previous protocol, we also
compare our method with TENT at batch sizes of 1, 5, 8, 16, 32, 64, 100, 200;
2) Mixed Distribution. In additional to the standard protocol where the model’s final error rate is
averaged from 15 separate evaluations, each of which contains only one type of corruption, we further
evaluate the models with all 15 corruptions types shuffled and mixed together.
Figure 4.5 shows our main results, with completed results presented in Table 4.5,4.6, 4.1,4.2. In Table
4.3,4.4 we present the error rates shown in Figure 4.6.
Blue curves represent the SOTA model TENT, the red and orange curves represent MixNorm model
with and without learnable affine weights α and β. Green curves represent the performance of MixNormBN
‡
https://github.com/aaron-xichen/pytorch-playground
71
Batch Size
Mean Error Rate
60%
70%
80%
90%
100%
1 5 10 50
TENT MixNorm w/ fixed weights MixNorm MixNormBN
ImageNet-C Error Rates (Single Type)
Batch Size
Mean Error Rate
80%
90%
100%
1 5 10 50
TENT MixNorm w/ fixed weights MixNorm MixNormBN
ImageNet-C Error Rates (Mixed Types)
Batch Size
Mean Error Rate
10%
50%
100%
1 5 10 50 100
TENT MixNorm w/ fixed weights MixNorm MixNormBN
CIFAR-10C Error Rates (Single Type)
Batch Size
Mean Error Rate
20%
40%
60%
80%
1 5 10 50 100
TENT MixNorm w/ fixed weights MixNorm MixNormBN
CIFAR-10C Error Rates (Mixed Types)
Figure 4.5: Comparison of MixNorm, with and without learnable affine weights (Algorithm 2), and
MixNormBN (Algorithm 3) methods to the baseline method TENT in two different test time settings i)
test samples come only from a single type of corruption; ii) test samples come from mixed types of corruptions. For the Single Type experiments, we report average error rates over all 15 corruption types at severity
5. Numerical results are reported in the Appendix A.
as described in Algorithm 3. Tent’s performance drops as the batch size decreases, while MixNorm’s performance stays stable for any batch size. Additional to the improvement to small batch size, Figure 4.5
also shows the advantages of MixNorm when test samples come from different distributions. In TENT’s
testing protocol, all test samples are collected from the same corruption distribution, which makes it easier
to obtain reliable normalization statistics especially when the batch size is large. However, when tested
with batches of mixed inputs sampled from different corruption types, such assumption does not hold and
Tent’s performance drops significantly compared to MixNorm. This result holds even when tested with
larger batch size (up to 64 on ImageNet-C and up to 200 on CIFAR-10C). We also observe that when batch
size is large enough and all samples come from a similar distribution, TENT might have an advantage over
72
MixNorm; because TENT can get to estimate stable normalization statistics from empirical samples. In
order to take advantage of large batch size, when we test our MixNormBN method (green curves), again we
report consistently better scores than TENT at all batch sizes.
Batch Size TENT MixNorm w/ Fixed Parameters MixNorm MixNormBN
1 90.00% 21.76% 21.94% 22.33%
5 29.21% 21.57% 21.94% 22.47%
8 25.81% 21.53% 21.94% 22.16%
16 22.07% 21.48% 21.94% 20.94%
32 20.65% 21.53% 21.94% 20.19%
64 19.69% 21.43% 21.94% 19.43%
100 19.11% 21.26% 21.94% 18.94%
200 18.58% 21.20% 21.94% 18.52%
Table 4.1: Single Distribution Error Rates of TENT, MixNorm (Algorithm 2) with and without fixed affine
parameters, MixNormBN (Algorithm 3) on CIFAR-10C datasets. All scores are average error rates from all
15 corruption datasets at severity level 5. All methods adopt Wide-ResNet-28 as backbone architectures.
Batch Size TENT MixNorm w/ Fixed Parameters MixNorm MixNormBN
1 89.69% 31.94% 32.01% 32.01%
5 39.39% 31.96% 32.01% 33.24%
8 37.15% 31.98% 32.01% 32.67%
16 35.22% 32.06% 32.01% 32.18%
32 33.18% 32.17% 32.01% 31.73%
64 32.74% 32.09% 32.01% 30.87%
100 32.52% 32.51% 32.01% 30.24%
200 33.12% 31.96% 32.01% 30.33%
Table 4.2: Mixed Distribution Error Rates of TENT, MixNorm (Algorithm 2) with and without fixed affine
parameters, MixNormBN (Algorithm 3) on CIFAR-10C datasets. All methods are tested with 10,000 randomly ordered samples composed of corrupted images from 15 different corruptions types at severity level
5. All methods adopt Wide-ResNet-28 as backbone architectures.
4.4.2 Ablation Studies
In this section, we present the results of several ablations on architecture selection, ways of collecting
normalization statistics and number of augmentations.
Architectures: Figure 4.6 provides the results on CIFAR-10C where both TENT and MixNorm adopt
a more powerful backbone, WideResNet40, as opposed to the WideResNet28 used in Figure 4.5. Along
73
Batch Size TENT MixNorm w/ Fixed Parameters MixNorm MixNormBN
1 90.00% 15.09% 15.59% 15.62%
5 23.18% 14.60% 15.59% 15.61%
8 19.32% 14.45% 15.59% 15.46%
16 15.53% 14.23% 15.59% 14.26%
32 14.13% 14.19% 15.59% 13.53%
64 13.12% 14.41% 15.59% 12.78%
100 12.59% 13.96% 15.59% 12.32%
200 12.08% 13.85% 15.59% 11.95%
Table 4.3: Single Distribution Error Rates of TENT, MixNorm (Algorithm 2) with and without fixed affine
parameters, MixNormBN (Algorithm 3) on CIFAR-10C datasets. All scores are average error rates from all
15 corruption datasets at severity level 5. All methods adopt Wide-ResNet-40 as backbone architectures.
Batch Size TENT MixNorm w/ Fixed Parameters MixNorm MixNormBN
1 89.81% 18.80% 15.99% 18.52%
5 27.92% 18.80% 15.99% 20.21%
8 24.57% 18.80% 15.99% 19.97%
16 21.07% 18.80% 15.99% 18.71%
32 18.76% 18.80% 15.99% 17.45%
64 17.12% 18.80% 15.99% 16.44%
100 17.12% 18.80% 15.99% 16.53%
200 16.43% 18.80% 15.99% 17.67%
Table 4.4: Mixed Distribution Error Rates of TENT, MixNorm (Algorithm 2) with and without fixed affine
parameters, MixNormBN (Algorithm 3) on CIFAR-10C datasets. All methods are tested with 10,000 randomly ordered samples composed of corrupted images from 15 different corruptions types at severity level
5. All methods adopt Wide-ResNet-40 as backbone architectures.
with the ResNet-50 used in the ImageNet-C experiments, we demonstrate the effectiveness of MixNorm on
architecture with different depth and width.
Normalization Statistics Estimation: In Table 4.7, we present an ablation study on different ways to
collect normalization parameters. They include: 1) Instance Norm: each feature is normalized by itself
along the spatial dimensions; 2) Augmentation (Local) Norm: features are normalized within the local batch
of original samples and corresponding augmented views, i.e., only local statistics are used; 3) Fixed Global
Norm: each feature is normalized by the fixed statistics stored in the pre-trained weights learned during
training; 4) Moving Global Norm: each feature is normalized by the exponential moving statistics, i.e., only
global statistics are used.
74
Batch Size TENT MixNorm w/ Fixed Parameters MixNorm MixNormBN
1 99.87% 74.44% 74.44% 74.03%
5 81.08% 74.43% 74.44% 74.89%
8 75.74% 74.44% 74.44% 72.93%
16 69.85% 74.41% 74.44% 69.02%
32 65.51% 74.41% 74.44% 65.34%
64 62.74% 74.43% 74.44% 62.69%
Table 4.5: Single Distribution Error Rates of TENT, MixNorm (Algorithm 2) with and without fixed affine
parameters, MixNormBN (Algorithm 3) on ImageNet-C datasets. All scores are average error rates from all
15 corruption datasets at severity level 5. All methods adopt ResNet-50 as backbone architectures.
Batch Size TENT MixNorm w/ Fixed Parameters MixNorm MixNormBN
1 99.86% 78.88% 78.89% 88.73%
5 90.59% 78.87% 78.89% 85.06%
8 88.72% 78.87% 78.89% 84.75%
16 87.27% 78.85% 78.89% 83.99%
32 86.39% 78.87% 78.89% 83.55%
64 86.09% 78.86% 78.89% 82.78%
Table 4.6: Mixed Distribution Error Rates of TENT, MixNorm (Algorithm 2) with and without fixed affine
parameters, MixNormBN (Algorithm 3) on ImageNet-C datasets. All methods are tested with 50,000 randomly ordered samples composed of corrupted images from 15 different corruptions types at severity level
5. All methods adopt ResNet-50 as backbone architectures.
As shown in the table row 1 and 2, Augmentation Norm improves on Instance Norm by having augmented samples in addition to the original samples, but both of them perform poorly due to the absence of
global statistics. Rows 3 and 4 show that Moving Global Norm works better than Fixed Global Norm, which
is expected due to the existence of domain shift. Rows 5, 6, 7 indicate that combining the global and local
statistics boosted performance by a significant scale. Row 8 indicates that learnable affine weights, α and β,
do not seem to provide additional boost to the performance observed in TENT, possibly because MixNorm
can already make accurate prediction so it does not need to be adjusted by the affine weights.
Augmentations: Table 4.8 shows that one augmentation is enough to get good performance, and the
number of augmentations in local statistics calculation does not have significant influence on performance.
Therefore, we just use one augmentation for faster computation.
75
Batch Size
Mean Error Rate
10%
20%
30%
40%
50%
60%
70%
90%
1 5 10 50 100
TENT MixNorm w/ fixed weights MixNorm MixNormBN
CIFAR-10C Error Rates (Single Type)
Batch Size
Mean Error Rate
10%
20%
30%
40%
50%
60%
70%
90%
1 5 10 50 100
TENT MixNorm w/ fixed weights MixNorm MixNormBN
CIFAR-10C Error Rates (Mixed Types)
Figure 4.6: Comparison of MixNorm and MixNormBN methods to the baseline method TENT, with a
different backbone architecture (WideResNet40). Numerical results are reported in the Appendix A.
No Experiment Name Way of Collecting Means & Vars Error Rate Moving Speed Global Scale Local Scale
1 Instance Norm 0 0 1 78.21%
2 Augmentation (Local) Norm 0 0 1 68.73%
3 Fixed Global Norm 0 1 0 50.50%
4 Moving Global Norm 0.001 1 0 38.45%
5 Mixed Norm 0.001 0.95 0.95 34.45%
6 Mixed Norm 0.001 0.9 0.1 32.66%
7 Mixed Norm 0.001 0.8 0.2 31.96%
8 Mixed Norm w/ fixed params 0.001 0.8 0.2 32.01%
Table 4.7: Ablation studies on different ways to collect normalization statistics, tested with CIFAR-10C,
with a mixed set of corruptions.
Number of Augmentations 1 3 5 7 9 15
ImageNet-C Error Rate 58.16% 58.33% 58.35% 58.44% 58.49% 58.56%
Table 4.8: Ablation studies on the effect of number of augmentations to MixNormBN method when collecting normalization statistics, tested with ImageNet-C and ResNet-50, with a mixed set of corruptions.
4.4.3 Test-Time Adaptation in Zero-Shot Learning Setting
For Zero-Shot Classification, samples are evaluated separately at batch size of 1. Therefore, we use the
regular MixNorm with fixed affine weights for comparison. We replace the Batch-Norm layers from the pretrained CLIP models, with backbone architectures of ResNet-50, and then use the updated ResNet-50 model
to calculate the visual features in Zero-Shot Classification. Details of datasets are provided in table 4.9. The
76
results in Table 4.10 shows MixNorm improves the CLIP’s zero-shot learning performance on CIFAR-100,
Stanford-Cars and CIFAR-10 and is on par with CLIP on Food-101 and STL10. This experiment aims to
show MixNorm can bring reasonable improvements even over extremely large-scale pre-training models.
Interestingly, the results show that our new method can bring larger improvements when the CLIP has a
lower performance, which implicitly could be indication of the presence of larger domain shifts.
Datasets Classes Train Size Test Size
CIFAR-10 10 50000 10000
CIFAR-100 100 50000 10000
Stanford-Cars 196 8144 8041
Food-101 101 75750 25250
STL10 10 1000 8000
Table 4.9: Dataset Size and Class Number
Model CIFAR-100 Stanford-Cars CIFAR-10 Food-101 STL10
CLIP (RN50) 40.97% 53.80% 72.26% 78.99% 94.84%
MixNorm 45.13% 54.01% 75.36% 78.78% 94.67%
Table 4.10: Zero-Shot Image Classification Accuracy. We replaced the Batch-Norm layers in the pre-trained
CLIP ResNet-50 weights and tested it under the zero-shot setting on the standard benchmarks. Result shows
MixNorm improved the CLIP performance
4.4.4 Source-Free Unsupervised Domain Adaptation
Model SVHN → MNIST SVHN→USPS SVHN→MNIST-M
Source 23.84% 37.02% 45.45%
BN ([99]) 22.70% 36.87% 44.04%
TENT ([162]) 8.17% 34.63% 42.28%
MixNormBN 7.95% 25.71% 42.19%
Table 4.11: Error rate for Source-Free Unsupervised Domain Adaptation experiment on Digits Sets from
SVHN to MNIST/MNIST-M/USPS.
For Source-Free UDA experiments, as we compare to TENT ([162]) and BN ([99]), both of which use
large batch sizes (200), we use MixNormBN for a fair comparison. As shown in Table 4.11, MixNormBN
77
outperforms both TENT and BN by a significant margin on all three datasets, indicating the effectiveness of
MixNormBN in the task of Source-Free UDA, for other types of domain shift that are not just corruptions.
4.4.5 Hyper Parameter Selection
In this section we report the hyper parameters we used for each experiments. As we mentioned in Section 4,
we use the “Gaussian Noise at Severity 5" set to choose the hyper parameters on CIFAR-10C and ImageNetC experiments. For other dataset, because of the absence of proper validation set, we just pick the hyperparameters that give the optimal performance. All learning rate stay the same as the learning rated used
by other baselines. Finding optimal hyper-parameters remains to be a challenging problem for methods in
Test-Time Adaptation.
Scale m Moving Speed τ Learning Rate
CIFAR10-C Single 0.05 1.00E-03 1.00E-03
CIFAR10-C Mixed 0.2 1.00E-03 1.00E-03
ImageNet-C Single 0.01 1.00E-03 2.50E-04
ImageNet-C Mixed 0.05 1.00E-06 2.50E-04
CIFAR-10 0.01 1.00E-06 -
CIFAR-100 0.01 1.00E-06 -
STL10 0.01 1.00E-06 -
Stanford-Cars 0.01 1.00E-07 -
Food-101 0.005 1.00E-07 -
Table 4.12: Hyper-Parameters of MixNorm Experiments. In Figure 4.5 and 4.6 we reported scores for both
MixNorm with and without fixed affine parameters. The MixNorm uses the same hyper-parameter in the
both cases.
Scale m Max Moving Speed τmax Learning Rate
CIFAR10-C Single 0.01 0.75 1.00E-03
CIFAR10-C Mixed 0.2 0.75 1.00E-03
ImageNet-C Single 0.01 1.1 2.50E-04
ImageNet-C Mixed 0.05 1.1 2.50E-04
MNIST 0.1 0.75 0.75
MNIST-M 0.01 0.75 0.05
USPS 0.01 0.5 0.75
Table 4.13: Hyper-Parameters of MixNormBN Experiments.
78
4.5 Conclusion
We have developed a novel way to estimate the batch-norm statistics during test time, to fast adapt a source
model to target test samples. Unlike the previous methods that require a large batch from a single distribution
to calculate stable batch-norm statistics, our proposed method eliminates any dependency on online batches
and is able to estimate accurate batch-norm statistics with single samples. The newly proposed method
significantly outperformed the State-Of-The-Art models in the newly proposed real-world settings, and also
demonstrated potential to improve zero-shot learning performance.
79
Chapter 5
EZAD: Efficient Feature Distillation for Zero-shot Annotation Object
Detection
As briefly touched upon in Section 4.2.3, the emergence of large-scale pre-trained vision-language models,
such as CLIP [131], has significantly transformed the domain adaptation landscape. Unlike traditional models which were limited to adapting to predefined categories, CLIP’s visual-language representation supports
open-vocabulary, zero-shot applications across a multitude of tasks. The second part of this dissertation,
spanning Chapters 5 to 7, is dedicated to advancing vision-language models for a variety of downstream
tasks. This includes classification to detection adaptation, source-free unsupervised adaptation, and testtime adaptation.
In this Chapter*, we will discuss EZAD (Efficient Feature Distillation for Zero-shot Annotation Object
Detection), a novel algorithm that repurposes pre-trained visual-language classification models like CLIP for
object detection tasks. This involves leveraging unlabeled background objects, redefining CLIP representation for knowledge distillation, and implementing semantic-aware box regression. EZAD, which combines
CLIP with the Mask-RCNN framework, significantly enhances the efficacy of zero-shot object detection
while ensuring computational efficiency. In our evaluations, EZAD outperforms previous distillation-based
methods in the COCO dataset by 4% and shows a 3% improvement on the LVIS dataset.
*This chapter is published as a WACV 2024 paper [66]: Liu, Zhuoming*, Xuefeng Hu*, and Ram Nevatia. "Efficient Feature
Distillation for Zero-Shot Annotation Object Detection." In Proceedings of the IEEE/CVF Winter Conference on Applications of
Computer Vision, pp. 893-902. 2024.
80
ZSD ZAD OVD
❌Novel Instance
✔️Additional Data
✔️Novel’s Name
✔️Novel Instance
❌Additional Data
✔️Novel’s Name
✔️Novel Instance
✔️Additional Data
❌Novel’s Name
Figure 5.1: The difference between zero-shot annotation object detection (ZAD), zero-shot object detection
(ZSD), and open-vocabulary detection (OVD). Our ZAD can have novel instances in training images and
novel category names, but no additional annotations are allowed.
5.1 Introduction
Object detection is a fundamental task in computer vision. Typically, an object detector is trained on a dataset
with specific categories and can’t be extended to new categories. To detect instances of categories not seen
during training, researchers started studying Zero-Shot object Detection (ZSD) and Open-Vocabulary object
Detection (OVD). In both ZSD and OVD settings, a set of "base" categories is provided with annotated
training examples (bounding boxes), and the rest, called "novel" categories, are not provided with any annotated examples. In ZSD, the names of the novel categories are known at training time, but it is further
posited[5] that the training data should not have any instances of the novel categories; this is very restrictive
and requires training data to be screened for novel categories which defeats the purpose of the user adding
categories at will. OVD does not impose this restriction but assumes that the novel category names are unavailable during training and may be added freely at inference time. Also, to detect more novel categories,
the training of OVD sometimes needs additional images or captions.
We propose an alternate setting that we call Zero-shot Annotation object Detection (ZAD) task, which
allows the training data to contain unannotated novel category instances and a model to mine this information
at training time if novel category names are available in advance. However, additional category names may
be added at inference time, as in OVD. Thus, ZAD falls in between ZSD and OVD, as Fig 5.1 shows. We
believe ZAD is closer to a real-life setting: when no additional data or annotation is provided, a user wanting
81
to add categories not considered at annotation time but exploiting the latent information as novel category
names are added at the training time.
To recognize novel instances when only their name is available, it is natural to consider a vision-language
feature space. In recent years, some large-scale vision-language classification models trained on millions
of image-text pairs, for instance, CLIP[131], ALIGN[75] have become available. Different solutions are
proposed to leverage these models to enable a detector to detect novel objects. Some researchers[191, 45,
202, 196] use these models to generate pseudo-label and train their model with pseudo-label and the imagecaption pair, while other trains a prompt[38, 172, 88] to turn the CLIP into a detector. However, all these
methods need additional captions or images, which are unavailable in our ZAD setting. In contrast, pure
distillation-based methods[51, 190] learn a mapping from image regions to CLIP feature space by distillation
for detecting the novel object, which is applicable when no additional data is provided.
To know how good the features are for distillation, we first apply CLIP to classify the instances in the
COCO[104] dataset. We found that the classification accuracy (ACC) is only 46% which is much lower than
the ACC of the classifier in Faster R-CNN[135] (about 90%). This indicates the domain gap between the
training data of CLIP and the detection dataset, making the mapping from the image region to the visionlanguage feature space harder to learn. In addition, since the distillation is conducted on some specific image
regions, how to select such a region is an important question. ViLD[51] trains an RPN with base category
annotations and applies the RPN to the training image again to generate proposals as distillation regions. As
table 5.1 shows, these proposals are biased toward the region with base categories instances, which limits
the novel information obtained by the detector and harms the distillation efficiency.
To address these problems, we propose Efficient Feature Distillation for Zero-shot Annotation object
Detection (EZAD). For bridging the domain gap, we find that simply finetuning the LayerNorm layers in
the CLIP with the base category instances significantly improves the ACC on both base and novel (Fig 5.2,
1). For the distillation regions, we expect these regions could contain novel objects so that some novel
82
Dataset Eval On AR@100 AR@300 AR@1000
COCO Base 56.69 61.45 64.32
Novel 34.66 43.62 51.34
LVIS Base 42.14 49.84 55.09
Novel 35.63 45.07 51.82
Table 5.1: The Recall of RPN on Novel and Base categories on COCO or LVIS training set. The RPN biases
to the base category.
category information can be introduced into the detector. To make the best use of the only information
we have in the ZAD setting, the name of novel categories, we decided to use CLIP to select these regions
with the help of the novel category names. The selected regions are named CLIP Proposals, in which CLIP
believes there is a novel category instance(Fig 5.2, 2).
After adapting the feature space and generating the CLIP Proposals, EZAD learns a mapping from the
image regions to the vision-language feature space by distillation, which is achieved by minimizing the L1
loss between the features of the CLIP Proposals from the CLIP, and the one from our model (Fig 5.2, 3).
Once the model is trained, in all potential regions given by the RPN, EZAD recognizes the novel objects
by using the novel category name’s text embedding, which has a high cosine similarity score with the
image feature of the novel objects (Fig 5.2, 4). To further improve the model performance, we introduce a
semantic-based regressor, which takes the text embedding as additional information for regression.
By only providing the name of the novel categories to EZAD, EZAD can outperform ViLD by 4% in
novel categories with a much shorter training schedule in COCO. On the LVIS dataset, our method achieves
better performance than ViLD on both base and novel categories. This indicates that the adapted feature and
CLIP Proposals benefit both distillation quality and training efficiency.
83
L1 Loss
1. Adaptation of
CLIP’s Feature
Space
Image Feature
Text Embedding
CLIP 2. CLIP Proposal
Selection
3. CLIP Proposal
Feature Distillation
CLIP-Visual
Detection
Backbone
RoI Align
4. Novel Object
Detection with
Text Embedding Detection
Backbone RPN
CLIP-Text Bike
Dog
CLIP Proposal
Feature
CLIP Proposal
Feature
RPN Proposal Feature
Text Embedding
Cosine
Similarity
0.1
0.9
Figure 5.2: Overview of EZAD. The key contributions of EZAD are we bridge the domain gap and create
better distillation regions (CLIP proposals).
c9·T
1
c9·T
2
···
c9·T
N
c8·T
1
c8·T
2
···
c8·T
N
··· ··· ··· ···
c3·T
1
c3·T
2
···
c3·T
c2·T N
1
c2·T
2
···
c2·T
N
Generate CLIP Proposals (offline)
Generate anchors
CLIPVisual
c9
c8
···
c3
c2
c1
T1 T2 ··· TN
c1·
T1
c1·
T2 ···
c1·
TN
c9·T
2c8·T
2
···
c3·T
2
c2·T
2
c1·
T2
c9·
T2
c2·
T10
Crop regions
base on
anchors
Select the regions
with high objectness
score after NMS
Extract the feature
and calculate the
objectness score
c9
c2
o9
o2
Save the CLIP Proposals, the features
and the objectness scores
Figure 5.3: The pipeline of CLIP Proposals generation. The CLIP Proposals and their feature are generated
before the detector training.
5.2 Related Work
Zero-shot Detection object (ZSD) and Open Vocabulary object Detection (OVD) aim to learn knowledge
from the base categories and to generalize the knowledge to the novel categories, enabling the model to
detect novel objects. Bansal et al.[5], Zhu et al.[204], Rahman et al.[132], and Xie et al.[180] focus on ZSD,
in which novel instances cannot exist in the training image. However, in most object detection datasets,
not all the instances on the image will be annotated. This makes the ZSD setting too restricted. Zhong
et al.[196], Zhou et al.[202] and Zareian et al[191] pre-train their model with image-caption pair to learn
the vision-language feature space for open vocabulary detection. Gao et al.[45], Long et al.[106], Wang
et al.[164] Zhao et al.[195] train its model with pseudo-label. Feng et al.[38], Kuo et al.[88], and Wu et
84
p4·B
1
p4·B
2
···
p4·B
p3·B n
1
p3·B
2
···
p3·B
p2·B n
1
p2·B
2
···
p2·B
n
Inference
p4
p3
p2
p1
B1 B2 ···
p1·B
1
p1·B
2
···
p1·B
n
Cross
Entropy
Loss
Bn
CLIP proposals
Proposals from RPN
c’2
c’1
c2 L1 loss
Base Embedding
Novel Embedding
Proposals from RPN
p2
B1 B2 ··· Bn N1 ··· Nk
p2·B
1
p2·B
2
···
p2·B
n
p2·
N1 ···
p2·
Nk Cat
Sink
B1
N1
r4 B1
r3 B7
r2 B2
r1 B1
Concat
10.34, 12.24, ···
101.12,35.34, ···
673.90,234.1, ···
555.37,123.45,···
L1 Loss
r2 10.34, 12.24, ···
101.12,35.34, ···
N1
r1 N3
Distillation Classification Regression
p1 p1·B
1
p1·B
2
···
p1·B
n
p1·
N1 ···
p1·
Nk
Classification Branch
Regression Branch Training
Concat
o2
c1 o1
Figure 5.4: The training and inference pipeline of EZAD. The figure presents the model structures in the
RoI head of the two-stage detector. We use the per CLIP Proposal distillation weight and semantic-based
regressor to further improve the model performance.
al.[172] design a prompt to make use of CLIP for detection. Ma et al.[110] and Gu et al.[51] distill visual
features from CLIP, enabling object detectors to detect novel instances. Wang et al.[166] propose CondHead
which can be added to different OVD models to further improve their performance. All these methods aim
to solve the OVD problem, and most of them need additional training data except the distillation-based[110,
51, 166] and some prompt-based methods[88, 172]. The distillation-based and prompt-based methods can
fit in with our ZAD setting, while the distillation-based method has better overall performance.
5.3 Method
Setting definition. Our method aims to handle the Zero-shot Annotation object Detection (ZAD). In ZAD,
all categories are divided into two parts: the base categories Cb and the novel categories Cn. The name of
the base and expected novel category are known before model training. The training image may have novel
category instances, but only the base category will be annotated. No additional data will be available for the
training except the training image and the base categories annotations. The model is trained on a set of base
classes and tested on both base and novel classes.
85
Method Overview. Inspired by the previous paper ViLD [51], our proposed method, Efficient Feature
Distillation for Zero-shot Annotation object Detection (EZAD), map the image feature to the CLIP’s multimodal feature space. The mapping is learned by distilling the knowledge from CLIP in some selected
distillation regions of each training image. The knowledge distillation is conducted by optimizing the L1
loss between the feature from CLIP and the feature from our model in the selected regions. After learning
the mapping, given the proposal from the RPN, our model can recognize and detect novel objects by using
the text embedding of the novel categories.
In Section 5.3.1, we describe how to adapt the feature from CLIP to the domain of detection datasets.
In Section 5.3.2, we demonstrate how to select the distillation regions on each image. In Section 5.3.3, we
discuss our model structure and how to train our model and learn the mapping with the adapted feature from
the selected regions.
5.3.1 Adapt Vision-language Model Feature
To understand how good the image feature from CLIP is, we first apply the CLIP to classify the instances in
the COCO [104] dataset. We first extract the feature for each instance from the vision encoder of the CLIP.
The feature of the instance i can be expressed as: insi = V (Crop(I, GTi(1.2x)
)), where V is the vision
encoder of the CLIP and Crop(I, GTi(1.2x)
) means cropping the region from Image I base on 1.2x enlarged
GT bboxes GTi(1.2x)
. We use the 1.2x enlarged bboxes since the enlarged bboxes help CLIP yield the best
classification accuracy. We present more details in Section 5.4.1. We generate the text embedding for each
COCO category from the text encoder of the CLIP. We calculate the cosine similarity between the image
feature and text embeddings and select the category with the highest cosine score as the predicted category.
We notice that when directly applying the CLIP to classify the COCO instances, the classification accuracy (ACC) is only about 50% which is much lower than the ACC of the classifier in a well-trained detector,
indicating that there is a huge distribution gap between the training data of the CLIP and detection datasets.
86
To bridge the gap, inspired by[80], we simply fine-tune the CLIP’s layer of normalization layers by minimizing the cross-entropy loss, using all base categories instances in the detection dataset. This simple method
boosts the ACC on COCO to about 80%. Also, using the adapted CLIP feature for distillation helps improve
detection results.
5.3.2 Generate CLIP Proposals
To obtain useful information for the novel categories, we need to select some meaningful image regions for
distillation. we expect these regions to contain some novel category objects, and introduce information of
the novel categories. ViLD trains an RPN with base category annotations and uses the proposals from this
RPN as the distillation regions. However, since the RPN is trained to predict base categories, which makes
the proposals bias toward the areas that contain the base categories and ignore the areas that potentially
have the novel category instance. Instead of using the RPN to determine where to distill, we decided to use
CLIP [131]. Trained with 400 million image and text pairs collected from the internet, CLIP is trained to
discriminate a large number of categories. Therefore, if the region contains a novel object, the CLIP should
yield high confidence in it. We name these regions CLIP Proposals. Fig 5.3 demonstrates how to generate
CLIP Proposal.
To select image regions as the CLIP Proposals, we first generate anchors over the image I. Then we
crop the image based on the anchors and extract the feature for these anchors from CLIP’s vision encoder.
We generate the text embeddings Ti for a given dictionary from the CLIP’s text encoder. In addition to
the information of the novel categories, we also need knowledge of the base categories. Thus, we use all
category names as the dictionary. We classify each anchor by calculating the cosine similarity score between
its image feature and all text embeddings. We use the score after the softmax of the predicted category as
the objectness score oi of the anchor. We finally select the anchors with high objectness scores after the
Non-Maximum Suppression (NMS) as CLIP Proposals.
87
All the CLIP Proposals Ci and their feature from CLIP ci are generated offline. The ci can be expressed
like this: V (Crop(I, Ci)), where V and Crop are the same as those in insi definition. We also add 1.2x
enlarged base categories’ GT bbox as part of the CLIP Proposals. Our experiments show that even though
these CLIP Proposals are noisy, they are still meaningful regions for distillation and help our detector perform better on novel categories.
5.3.3 Model Structure
Compared with the traditional two-stage detector, our model structure has three main differences: CLIP
Proposals’ feature distillation, cosine-similarity-based classifier, and semantic-based regressor. The model
structure overview is shown in Fig 5.4.
Proposals’ Feature Distillation. To obtain knowledge from CLIP and map the image feature from our
model into the CLIP’s feature space, we distill the knowledge from CLIP by minimizing the L1 loss between
the CLIP Proposals’ feature from the CLIP’s vision encoder ci and the one from our model c
′
i
. The c
′
i
can be
expressed as Convc(Align(Bbone(I), Ci)), where Bbone(I) means getting the feature map by passing the
image I through the backbone Bbone, and Align means doing the RoIAlign base on the CLIP Proposal Ci
on the feature map. Convc means passing the feature after the RoIAlign through the convolution and linear
layers in the classification branch.
In the CLIP Proposal generation, we know the objectness score of each proposal. For the proposal with
a higher objectness score, it has a higher probability of having an object in it. Therefore, we should assign
a higher weight to these proposals. We directly use the objectness score as the weight, and the distillation
loss is formulated like this:
\label {eq:gradsimi} \begin {aligned} L_{dist} = \frac 1M\sum _{i=1}^Mo_i\vert c_i- c^{'}_i\vert _1 \end {aligned} (5.1)
where oi
is the objectness score of the CLIP Proposal Ci and M is the number of the CLIP Proposal on one
image.
88
Cosine-similarity-based Classifier. By distilling the knowledge from CLIP, we are able to map the
image feature of our model to CLIP feature space. Instead of using a learnable linear layer as the classifier, we use text embedding generated from CLIP’s text encoder. In the training phase, we only need
the name of the base categories, which are then converted into text embedding Bi
. For each proposal
Pi given by the Region Proposal Network(RPN), we generate its feature pi
. The pi can be expressed as:
pi = Convc(Align(Bbone(I), Pi)), where Convc, Align, and Bbone are the same as those in c
′
i
definition. The classification loss is given by:
\label {eq:gradsimi} \begin {aligned} L_{cls} = \frac 1N\sum _{i=1}^N L_{CE}(softmax(\boldsymbol {cos_{i}}) , y_{i}) \end {aligned} (5.2)
where N is total number of proposals, yi
is the assigned label for the proposal Pi
. The vector cosi for
the proposal Pi
is defined as [cos(pi
, B1), . . . , cos(pi
, Bn), cos(pi
, BG)] in which n is number of base
categories, cos is the cosine similarity score, and BG is a learnable vector for background.
At inference time, we also need to detect the novel categories. We generate the text embedding for both
the base Bi and the novel Ni
. The vector cosi become [cos(pi
, B1), . . . , cos(pi
, Bn), cos(pi
, N1), . . . , cos(pi
, Nk),
cos(pi
, BG)] where k is the number of the novel categories.
Semantic-based Regressor. To improve the performance of the regression module, we add the semantic
information of each category into consideration. We concatenate the text embedding with the proposal
feature to predict the bbox. For each foreground proposal Pi given by the Region Proposal Network(RPN),
we generate its feature for regression ri
. The ri can be expressed as Convr(Align(Bbone(I), Pi)), where
Align and Bbone are the same as those in c
′
i
definition, and Convr means passing the feature after the
RoIAlign through the convolution layers and linear layers in the regression branch. The regression loss is
defined as:
\label {eq:gradsimi} \begin {aligned} L_{reg} = \frac 1K\sum _{i=1}^K L_{1}(Linear(Cat(r_{i}, B_{y_{i}})) , a_{i}) \end {aligned} (5.3)
89
Method Mask Epoch Novel Category Knowledge Source AP50
Base Novel Overall
ZSD-YOLO[180] ✗ 50 CLIP’s feature 31.7 13.6 19.0
CORA[172] ✗ 5 CLIP 35.5 35.1 35.4
F-VLM[88] ✓ 6 CLIP - 28.0 38.0
PBBL[45] ✓ - ALBEF[96], CLIP, Pseudo-label 46.1 30.8 42.1
OVOS[110] ✗ 36 CLIP’s feature 51.3 20.3 43.2
VTP-OVD[106] ✓ - CLIP’s feature, Pseudo-label 31.5 29.8 46.1
OADP[164] ✓ - CLIP’s feature, Pseudo-label 53.3 30.0 47.2
CondHead[166] ✓ 107 CLIP’s feature 60.8 29.8 52.7
OV-DETR[190] ✗ 50 CLIP’s feature 61.0 29.4 52.7
ViLD[51] ✓ 107 CLIP’s feature 59.5 27.6 51.3
Ours ✓ 36 CLIP’s feature, Novel Category Name 59.9 31.6 52.1
Table 5.2: Evaluation results on COCO benchmark. All the models are trained with the ResNet-50 backbone. Mask indicates whether the model is trained with Mask annotations. Our model achieves 4% improvement on Novel with 1/3 training time of ViLD.
where K is the total number of the foreground proposals, Linear is the linear layer for bbox prediction,
Cat mean concatenation, Byi means the text embedding of the assigned GT label yi for the proposal i,
and ai
is the GT bbox for the proposal Pi
. At inference time, since we no longer have the GT label. We
concatenate the ri with the text embedding Bpredi
or Npredi
of the predicted category of the proposal Pi
,
where predi = arg max(cosi).
Finally, our overall loss function is given by:
\label {eq:gradsimi} \begin {aligned} L = L_{dist} + L_{cls} + L_{reg} \end {aligned} (5.4)
5.4 Experiments and Results
We first present our model result on COCO[104] and LVIS[52] detection benchmark in section 5.4.2. In
section 5.4.4 we compare our method with ViLD to show the efficiency of our method. Finally, we conduct
the ablation study with visualization analysis.
90
5.4.1 Implementation Details.
We use the publicly available pre-trained CLIP model ViT-B/32 as the open-vocabulary classification model,
with an input size of 224×224.
Adapt Image-language Model Feature Based on the detection setting we use for training and evaluating our detector, we adapt the CLIP to two detection domains: COCO[104] detection domain, LVIS
detection domain[52]. We finetune the layer normalization layers in the CLIP with base category instances
in COCO or LVIS based on the detection setting we use and maintain all other parameters fixed. All base
category instances are cropped by 1.2x enlarged GT bboxes. We conduct the zero padding to convert each
cropped region to the square and apply the default preprocessing pipeline of the CLIP.
We use CLIP to predict the category of each cropped region and calculate the cross-entropy loss with
the GT label of each region. We finetune the model by optimizing the Cross-Entropy Loss. We use AdamW
optimizer with a learning rate of 0.0001, batch size 4 and clip the L2 norm of the gradients when larger than
0.1. We finetune the model for 12 epochs.
Generate CLIP Proposals When generating the CLIP Proposals, we still use the CLIP model we mentioned in section 5.4.1 as a classifier to select the distillation regions. If we will use the adapted CLIP’s
feature to train the detector, we will use the adapted CLIP to generate the CLIP Proposals. Otherwise, we
use the unadapted CLIP to generate CLIP Proposals.
We generate the CLIP proposals on all the training images of the detection dataset base on the detection
setting we use. We first resize the image with the image ratio maintained. The long edge of the image will
be resized into 1333 as width or 800 as height.
We generate the anchors on each image with a stride of 32 pixels and with 5 different sizes (32, 64, 128,
256, 512), and 3 different ratios (1:1, 2:1, 1:2). We select the top 1000 anchors after NMS as CLIP Proposals
on each image. We filter out the anchors which have high IoU with the base category GT bboxes to reduce
91
the redundancy since we will add 1.2x enlarged base category GT bbox as part of the CLIP Proposals. In
model training, we randomly select a fixed subset with 200 CLIP Proposals on each image for training.
Detection Setting In COCO detection setting, the dataset is divided into 48 base categories and 17 novel
categories. 15 categories without a synset in the WordNet hierarchy are removed.
We filter out the training images which do not have base category annotation. Following the setting in
[191], we filter out the images that have neither the base category instances nor the novel category instances
in the validation set. The training set contains 107761 images and 665387 base category instances. The
validation set contains 4836 images and 28538 base category instances and 33152 novel category instances.
We evaluate the model in a generalized setting, which evaluates the base and novel categories at the same
time. AP50 is used as the evaluation metric.
In LVIS detection setting, the dataset is divided into 866 base categories (containing 405 frequent categories and 461 common categories) and 337 novel categories (337 rare categories). Our LVIS-Fbase split
uses the frequent categories as the base(405 categories), common and rare categories as the novel(common
has 461 categories, rare has 405 categories). The training set contains 98531 images and 1200258 base category instances. The validation set contains 19442 images and 230427 base category instances and 14280
novel category instances. We aggregate the model performance in frequent, common, and rare categories
separately. AP is used as the evaluation metric.
5.4.2 Comparison with Current Methods
In this section, we evaluate EZAD in the COCO and LVIS detection benchmarks.
Datasets and Evaluation Metrics. We evaluate EZAD on COCO and LVIS(v1). For the COCO dataset,
we use train2017 for training and val2017 for validation. Following [191], we divide COCO into 48 base
and 17 novel categories. For the LVIS dataset, we use the training/validation images for training/evaluation.
Previous work uses Frequent and Common categories as the base (866 categories), and Rare categories as
92
Method Epoch AP
Freq Comm Rare All
ViLD* 24 24.9 12.2 11.2 17.5
Ours 24 30.9 14.3 12.5 20.5
ViLD* 48 26.4 13.2 11.3 18.5
Ours 48 31.9 15.2 13.1 21.3
Table 5.3: Evaluation results on LVIS-Fbase benchmark. The ViLD* is our reproduced result of ViLD with
a shorter training schedule. EZAD outperforms ViLD in both 2x and 4x settings due to the efficient feature
distillation.
the novel (337 categories). We argue that in this split, the rare category objects are so sparse (less than 0.5%
annotations in the validation set) that the model’s performance on it is not representative. Therefore, we
propose a new split called LVIS-Fbase, which uses the Frequent categories as the base (405 categories), and
both Common and Rare categories as the novel (common has 461 categories). On COCO, AP50 is used as
the evaluation metric, while on LVIS the AP is used.
Model. We train a Mask R-CNN[56] model with ResNet-50[58] FPN[103] backbone. The backbone
is pre-trained on ImageNet[32]. We use SGD as the optimizer with batch size 4, learning rate 0.005, momentum 0.9, and weight decay 0.0001. We adopt linear warmup for the first 500 iterations, with a warm-up
ratio is 0.001. On COCO, we train our model with 36 epochs and divide the learning rate by 10 at epoch 27
and epoch 33. On LVIS, We train our model with 48 epochs and divide the learning rate by 10 at epoch 32
and epoch 44. We train our model with multi-scale train-time augmentation. Baselines. We compare our
method with different Zero-Shot Detection (ZSD) and Open-Vocabulary Detection (OVD) methods, which
do not use additional datasets so that they can fit in the Zero-shot Annotation Detection (ZAD) setting. We
mainly compare EZAD with ViLD[51] and CondHead[166], which uses distillation to obtain information
on the novel categories from CLIP. They have the best overall performance on the COCO dataset. However,
it uses the data augmentation of large-scale jittering[48] with an extremely long training schedule.
Results. Table 5.2 shows the results of EZAD in the COCO detection benchmark. Both the CORA[172]
and F-VLM[88] are prompt-base methods. Although they need less training time and have strong novel
93
Method Epoch AP50
Base Novel Overall
ViLD* 12 48.3 17.2 40.2
Ours 12 55.7 30.4 49.0
ViLD* 36 56.0 24.2 48.5
Ours 36 59.9 31.6 52.1
Table 5.4: Evaluation results on COCO. EZAD outperforms ViLD in both 1x and 3x settings, showing our
distillation is more efficient.
Method Base Novel General
L M S Avg L M S Avg L M S Avg
w/o Adaptation 69.9 70.2 46.6 66.8 90.7 82.3 48.5 77.7 61.3 62.2 36.9 53.9
w/ Adaptation 92.5 89.5 80.3 87.5 91.1 81.6 66.3 81.9 84.8 80.7 68.8 78.3
Table 5.5: Adapting CLIP to the detection dataset’s domain. The table presents the classification accuracy
(ACC) of CLIP(w/ or w/o adaptation) when it is applied to classify the COCO dataset’s instance. The ACC
is aggregated based on the size of the instances. After the adaptation, the ACC is improved by a huge margin
in three different settings, especially for small objects.
category performance, their performance in the base categories is much worse than other methods. ZSDYOLO[180] and OVOS[110] use a one-stage detector and suffer from high false positives in the detection
results, which causes low AP in novel categories. PBBL, VTP-OVD, and OADP are trained with pseudolabels, introducing noise and harming the model performance on base categories. OV-DETR[190], CondHead[166] and ViLD[51] are pure distillation-based method. While the OV-DETR uses the transformer
architecture which is heavier and has a better result than ViLD, the CondHead proposes a new detection
head that can improve ViLD performance. Our EZAD achieves 59.9% and 31.6% on base and novel, respectively. Its performance is 4% better than ViLD in the novel and has a better performance in base and
overall settings, with the use of 1/3 of ViLD’s training time.
Table 5.3 shows the results of EZAD in our LVIS-Fbase benchmark. The ViLD is trained 468 epochs on
LVIS, which is too long to be fully reproduced. We reproduce the ViLD with 24 and 48 epochs. Our method
uses ViLD’s model structure and distillation pipeline, and the improvement of our method comes from the
distillation feature and distillation region. Therefore, reproducing the ViLD with a shorter training schedule
94
CLIP Feature Distill Region Base Novel Overall
Raw RPN Proposal 48.8 17.5 40.6
Adapted RPN Proposal 56.9 24.6 48.5
Raw CLIP Proposal 48.7 19.3 41.7
Adapted CLIP Proposal 55.7 30.4 49.0
Table 5.6: Ablation study on CLIP’s feature adaptation and CLIP Proposal using COCO detection benchmark.
SB Reg PPDW Base Novel Overall
55.5 28.2 48.3
✓ 55.8 29.8 48.5
✓ ✓ 55.7 30.4 49.0
Table 5.7: Ablation study on Semantic-based regressor and per CLIP Proposal distillation weight using
COCO detection benchmark. SB Reg means the Semantic-based regressor, and PPDW means the per CLIP
Proposal distillation weight.
will not affect the fairness of the comparison. With 48 epochs, our method shows a 5% improvement in Freq
and 2% improvement in both Comm and Rare over the ViLD with the same training schedule. In the CLIP
proposals generation, EZAD only uses frequent and common object names to generate proposals. Therefore,
EZAD does not take advantage of the category name on the Rare. We believe that the enhancement in
performance for the rare and frequent categories can be attributed to eliminating the domain gap, whereas
the improvement in common categories results from both the CLIP proposals and domain adaptation.
Because we have utilized the ZAD setting, there are novel instances present in our training images.
This enables us to effectively leverage the information from these novel instances, thus achieving better
performance than existing methods. However, the same cannot be achieved with Zero-Shot Detection (ZSD).
On the other hand, we don’t need additional annotations like many OVD methods do, meaning our ZAD
setting can be applied to more real-life settings.
95
Proposal IoGT #(IoGT≥0.8) #(IoGT≥0.5)
RPN 0.340 362818 (9%) 610157 (15%)
CLIP(Ours) 0.365 563799 (14%) 870830 (21%)
Table 5.8: The effective distillation region of different proposals. The table presents the Intersection between
the proposal and the novel GT bboxes over novel GT bboxes (IoGT), the number of proposals that have high
IoGT (IoGT≥0.8, IoGT≥0.5), and the percentage of these proposals in all proposals. Our CLIP proposals
can cover more novel categories instances thus improving the distillation efficiency.
5.4.3 Experiments in Few-shot Detection Settings
In few-shot object detection, the model is trained on the base category’s annotations and evaluated on novel
categories. The only difference is that in few-shot detection, each novel category has the same number of
annotated objects(i.e, K-shot), which can be used to improve the model performance on the novel before the
model is evaluated. We directly evaluate our model in the few-shot benchmark, without using this K-shot
additional information.
Datasets and Evaluation Metrics. We evaluate our approach on PASCAL VOC 2007+2012 and
COCO. For the few-shot PASCAL VOC dataset, we combine the trainval set of 2007 with the one of 2012
as training data. PASCAL VOC 2007 test set is used for evaluation. The 20 classes are divided into 15 base
classes and 5 novel classes. We evaluate our model in three different base/novel splits used in [168]. Split
1 has 14631 training images with 41084 base category instances, and the validation set has 4952 images,
10552 base category instances, and 1480 novel instances. Split 2 has 14779 training images with 40397
base category instances, and the validation set has 4952 images, 10447 base category instances, and 1585
novel instances. Split 3 has 14318 training images with 40511 base category instances, and the validation
set has 4952 images, 10605 base category instances, and 1427 novel instances.
For the few-shot COCO dataset, we use the COCO train2017 as training data and evaluate our model on
the COCO val2017. The 20 categories that exist in PASCAL VOC are used as the novel categories, while
the rest of the 60 categories are used as the base categories. The training set has 98459 images and 367189
96
base category instances. The validation set has 5000 images and 15831 base category instances and 36781
novel category instances.
AP50 is used as the evaluation metric in PASCAL VOC, while AP and AP50 are used in COCO.
Model. Following previous work in few-shot detection, we train a Faster R-CNN[135] model with
ResNet-101 FPN backbone. The backbone is pretrained on ImageNet. We use SGD as the optimizer with
batch size 4, learning rate 0.005, momentum 0.9, and weight decay 0.0001. We also adopt linear warmup
for the first 500 iterations, with a warm up ratio is 0.001. We apply multi-scale train-time augmentation. For
the PASCAL VOC dataset, we train the model for 21 epochs and divide the learning rate by 10 at epoch 15
and epoch 18. For the COCO dataset, we train the model for 18 epochs and divide the learning rate by 10 at
epoch 14 and epoch 16.
Baselines. We compare EZAD’s performance with two few-shot detection models, TFA[168] and Meta
Faster R-CNN [53] as the baselines. The TFA model with linear layer as the classifier is noted as TFA w/fc,
while the model with cosine classifier is noted as TFA w/cos.
Results. Table 5.9 shows the results on the PASCAL dataset. EZAD achieves 40.9% in novel AP50
averaged over three different splits. EZAD’s performance matches the TFA 3-shot performance in split1
and split2 and is 4.7% higher than TFA in split3. Compared with the TFA’s performance on base, EZAD is
1.8% higher. For Meta Faster R-CNN, it generates proposals for each category on each image, which needs
multiple forward passes. Its inference time will be much slower if the dataset has a large number of novel
categories. Compared with the Meta Faster R-CNN, EZAD outperforms it without using any additional
annotations by a 1.6%, 3%, and 6.9% in three different splits, respectively. Table 5.10 shows the results on
the COCO dataset. EZAD achieves 10.2% and 22.2% in AP and AP50, respectively, matching TFA’s 10-
shot performance and 2.6% and 5.9% higher than the Meta Faster R-CNN’s 2-shot performance in AP and
AP50, respectively. Our model zero-shot performance on the few-shot setting shows the power of adapted
multi-modal feature space and validates the effectiveness of using CLIP Proposals as distillation regions.
97
Method Shot Novel AP50
Split1 Split2 Split 3 Avg
TFA w/fc 1 36.8 18.2 27.7 27.6
TFA w/fc 2 29.1 29.0 33.6 30.6
TFA w/fc 3 43.6 33.4 42.5 39.8
TFA w/cos 1 39.8 23.5 30.8 31.4
TFA w/cos 2 36.1 26.9 34.8 32.6
TFA w/cos 3 44.7 34.1 42.8 40.5
MF R-CNN 1 43.0 27.7 40.6 37.1
Ours 0 44.6 30.7 47.5 40.9
Split1 Base(AP50): TFA (3-Shot)=79.1, Ours=80.8
Table 5.9: Evaluation results on the novel categories of PASCAL VOC few-shot benchmark. MF R-CNN
means Meta Faster R-CNN. Our model zero-shot performance on the novel match the TFA’s performance
in its 3-shot setting. Our model also has a better performance on base.
Method Shot AP AP50
TFA w/fc 10 10.0 19.2
TFA w/cos 10 10.0 19.1
MF R-CNN 2 7.6 16.3
Ours 0 11.0 23.5
Table 5.10: Evaluation results on novel categories of COCO few-shot benchmark. MF R-CNN means Meta
Faster R-CNN. Our model zero-shot performance on the novel match the TFA’s performance in its 10-shot
setting.
5.4.4 Efficiency Evaluation
In this section, we compare EZAD with our reproduced ViLD to show the efficiency of our method. In
Table 5.4, we present our model and our reproduced ViLD on COCO with 1x and 3x training schedules.
Our method is consistently better than ViLD in two different settings. Table 5.3 shows EZAD and our
reproduced ViLD on the LVIS dataset with 2x and 4x training schedules. Our method shows substantial
improvement over the ViLD with the same training schedule. These results suggest that the adapted feature
space and the CLIP Proposals improve the distillation quality and efficiency. Thus, the model performance
is improved.
98
Adapted feat +
CLIP proposal
Raw feat +
RPN proposal
Figure 5.5: The visualization result on the COCO. The first row presents the results of the model trained
with the adapted CLIP features from CLIP Proposals. The second row presents the results of the model
trained with raw CLIP features from RPN proposals.
5.4.5 Ablation Studies on CLIP’s Feature Adaptation and CLIP Proposal.
In this section, we conduct ablation studies using the COCO detection benchmark. All the experiment details
are the same as mentioned in section 5.4.2. We train our detector for 12 epochs in all experiments of this
section.
Table 5.5 presents the classification result of adapting the CLIP to the COCO dataset’s domain. We
evaluate the classification accuracy(ACC) on the instances in the COCO validation set. We use the splits in
Section 5.4.2. Since the novel setting is a 17-ways classification which is easier than the other two settings,
the ACC in the novel setting is much higher than the one in the other two.
Before the adaptation, the ACC in the general setting is only about 53.9%, which is much lower than
the ACC of the classifier in a well-trained detector. This phenomenon indicates that there is a huge domain
gap between the training data of CLIP and the detection dataset. While the objects in CLIP’s training
image are clear and large and at the center of the image, the objects in the detection dataset might be small
and occluded. Although we only fine-tuned the CLIP on base categories instances, we observed a huge
improvement in all three settings, especially for the small objects. The average ACC reaches 87.5%, 81.9%,
and 78.3% in the Base, Novel, and General settings, respectively. The ACC for the small objects improved
99
by 33.7%, 17.8%, and 31.9% in three different settings. This indicates that the simple fine-tuning method
can effectively bridge the domain gap and make the feature more discriminating.
Table 5.6 shows the effectiveness of the CLIP Proposals and the CLIP’s feature adaptation in the detection setting. Using the adapted CLIP’s feature for distillation can consistently improve both base and
novel categories performance, no matter which kinds of distillation regions we use. Our model performance
on novel benefits from using CLIP Proposals as distillation regions (30.4% vs. 24.6% with adapted CLIP
features and 19.3% vs 17.5% with raw CLIP features). We believe that the adaptation of the CLIP and CLIP
Proposal are complementary to each other, which makes the improvement given by CLIP Proposals with
adapted CLIP (5.8%) larger than CLIP Proposals with raw CLIP (1.8%).
The performance on base categories of models trained with CLIP Proposals is slightly worse than those
trained with RPN proposals since the CLIP Proposals focus more on the regions with novel categories. Our
experiment results show that with a longer training schedule, the performance gap on the base categories
can be eliminated while the advantage on the novel will be maintained.
5.4.6 Ablation Study on Semantic Regressor and Per CLIP Proposal Distillation Weight.
In this section, we conduct ablation studies using the COCO detection benchmark. All the experiment details
are the same as mentioned in section 5.4.2. We train our detector for 12 epochs in all experiments of this
section.
Table 5.7 shows the effects of semantic-base regressor and per CLIP Proposal distillation weight. All
experiments use adapted CLIP’s features and CLIP Proposals as distillation regions. The semantic-base
regressor helps the model perform better on both base and novel categories, showing that the semantic
meaning of the categories does provide useful information to the regressor. Combining the semantic-base
regressor and per CLIP Proposal distillation weight, the AP50 on novel reaches 30.4%. This indicates the
reweighting of different distillation boxes further improves the distillation quality.
100
Bbox Size General
L M S Avg
0.8x GT 62.3 54.0 23.2 47.1
1.0x GT 64.0 61.9 32.9 53.4
1.2x GT 61.3 62.2 36.9 53.9
1.5x GT 56.7 59.5 40.6 52.6
2.0x GT 50.5 52.6 42.9 48.9
Table 5.11: The classification accuracy (ACC) of the unadapted CLIP on COCO instances with different
sizes of the GT bboxes to crop the instances. We decide to use the 1.2x enlarged GT bbox to crop the
instance since it has the best average ACC.
5.4.7 More Ablation Studies
Table 5.11 presents the experimental results on how the size of the bounding box (bbox) that we use to
crop the instances in the COCO [104] dataset affects the classification accuracy (ACC) of the unadapted
CLIP [131]. For the large objects, the more accurate bbox provided the higher ACC CLIP can achieve.
For the small objects, CLIP needs more background information to be correctly classified. In all settings,
the average ACC over all three sizes of the bbox is still much lower than the classifier of the well-trained
detector, indicating the domain gap between the training data of the CLIP and the detection dataset exists.
We use the 1.2x GT bbox to crop the base GT instance since it has the highest average ACC.
In Table 5.12, we train all models with the adapted CLIP features. For the models trained with 12 epochs,
the performance on novel categories of the model trained with the RPN proposals is 5.8% lower than the
one of the model trained with the CLIP proposals, though the former has slightly better performance on
base categories. For the models trained with 36 epochs, two models (RPN proposal and CLIP proposal)
has similar performance on base categories, and the model trained with the CLIP proposal features still
have much better novel category performance. This indicates that the negative effect on model performance
on base categories caused by the CLIP proposal is negligible and can be alleviated by a longer training
schedule. It also shows that the information of base categories provided by the distillation has redundancy,
which may accelerate the model convergence on base, but may not improve the model performance.
101
Epoch Distill Region Base Novel Overall
12 RPN Proposal 56.9 24.6 48.5
12 CLIP Proposal 55.7 30.4 49.0
36 RPN Proposal 60.2 24.3 50.8
36 CLIP Proposal 59.9 31.6 52.1
Table 5.12: Ablation study on using CLIP Proposals as distillation in COCO benchmark. The model trained
with CLIP Proposals has much better performance on novel categories.
5.4.8 Statistic and Visualization Analysis.
To compare which proposals can provide more meaningful novel categories information, we compare the effective distillation region of different proposals in an 8000 images subset of COCO training set in Table 5.8.
We calculate the Intersection between the proposal and the novel GT bboxes over novel GT bboxes (IoGT)
and present the number and the percentage of the proposals that have high IoGT in all proposals. Our CLIP
proposals are 6% higher than the RPN proposal in the percentage of the high IoGT proposals, meaning that
the CLIP proposal can cover more potential novel instances and improve the distillation efficiency.
Figure 5.5 provides some visualizations of the detected novel objects on the COCO benchmark. The
first row presents the results from the model trained with the adapted CLIP features from CLIP Proposals
(AFCP). The second row presents the results from the model trained with raw CLIP features from the RPN
proposal (RFRP). The results show that though two models can localize the object correctly, the AFCP
model has a higher classification accuracy than the RFRP model thanks to the adapted features and the more
meaningful distillation regions.
Fig 5.6 shows the visualization of using CLIP Proposals and RPN proposals as distillation regions in the
COCO setting. The blue boxes and green boxes represent the GT bboxes of the novel and base categories.
The red boxes represent the CLIP Proposals or the RPN proposals with the highest IoU with the novel GT
bboxes. The three images on the left show that the CLIP Proposals can cover most of the novel category
objects although the boxes may not accurate, while the RPN regards some of the novel objects as background
and just ignores them. Although the CLIP proposals are not accurate, the features extracted from these
102
CLIP Proposals
RPN Proposals
Figure 5.6: Visualization of using CLIP Proposals or RPN proposals as distillation regions in COCO setting.
The blue boxes and green boxes represent the GT bboxes of the novel and base categories. The red boxes
represent the CLIP proposals or the RPN proposals with the highest IoU with the novel GT bboxes. The
visualization shows the CLIP proposals can cover more novel objects even though the box may not accurate.
Unadapted CLIP Feature Space Adapted CLIP Feature Space
Figure 5.7: The tSNE embeddings of the COCO GT instance feature from the unadapted CLIP and adapted
CLIP. The GT features from the adapted CLIP form more dense clusters, indicating that the features become
more discriminating and the CLIP is adapted into the detection dataset domain.
boxes are accurate and meaningful. This phenomenon is also proved by the experiments in [51]. Therefore,
using the CLIP Proposals as distillation regions provides more novel category information and improve the
detector’s performance on the novel.
Fig 5.7 shows the tSNE embeddings of the COCO instance features of the unadapted CLIP and the
adapted CLIP. We collect 20 GT instances for each base and novel category in COCO setting and extract their
103
features from unadapted CLIP or adapted CLIP, and then generate the tSNE embeddings with these features.
The GT instances in the adapted CLIP feature space form some dense clusters. This indicates that the CLIP’s
feature space has been adapted in the COCO dataset domain and the features become more discriminating
after adaptation, improving the classification accuracy. The dots do not form a dense cluster mostly come
from the "person" category. Since the instances of the person usually show up with other categories instances
and occluded by other objects, therefore the person categories features are more scattered.
5.5 Conclusion
We propose a new setting for detecting novel instances, called Zero-shot Annotation object Detection
(ZAD), to move one step closer to the real-life scenario. Our method Efficient Feature Distillation for Zeroshot Annotation Detection (EZAD) in this paper. EZAD successfully bridging the domain gap between
the classification and detection datasets, and selecting the meaningful distillation regions (CLIP Proposals)
for obtaining knowledge from CLIP. Benefiting from these two improvements, EZAD outperforms previous
distillation-based work in both COCO and LVIS with a much shorter training schedule. We believe our work
provides a solid solution for applying zero-shot annotation detection in real life, and we hope our method
can inspire other works in the future.
104
Chapter 6
ReCLIP: Refine Contrastive Language Image Pre-Training with Source
Free Domain Adaptation
As highlighted in earlier chapters, Vision-Language Models like CLIP [131] have paved the way for openvocabulary, zero-shot classification. This chapter* delves into enhancing such classification capabilities in
downstream tasks without the need for any additional labeled data.
In response to these challenges, we introduce ReCLIP, an innovative source-free domain adaptation
method for Vision-Language Models (VLMs) that requires neither source data nor target labeled data. ReCLIP initially develops a projection space to align visual-text embeddings and generate pseudo labels. Subsequently, it employs cross-modality self-training using these pseudo labels to iteratively update the visual
and text encoders, refine the labels, and bridge domain gaps and misalignments. Through extensive experimentation, ReCLIP has been demonstrated to significantly outperform all baselines, boosting the average
accuracy of CLIP from 69.83% to 74.94% across 22 image classification benchmarks."
*This chapter is published as a WACV 2024 Oral paper [67] and has been selected in the WACV 2024 Best Paper Finalist:
Hu, Xuefeng, Ke Zhang, Lu Xia, Albert Chen, Jiajia Luo, Yuyin Sun, Ken Wang, et al. "ReCLIP: Refine contrastive language
image pre-training with source free domain adaptation." In Proceedings of the IEEE/CVF Winter Conference on Applications of
Computer Vision, pp. 2994-3003. 2024.
105
“A photo of airplane”
“A photo of automobile”
“A photo of bird”
“A photo of cat”
“A photo of deer”
“A photo of dog”
“A photo of frog”
“A photo of horse”
“A photo of ship”
“A photo of truck”
Text Embeddings
Figure 6.1: the t-SNE plot of visual and text embeddings from CLIP on CIFAR10 [86] test set. It is clear to
see the misalignment in the vision-language space: the text embedding of a class name is adjacent to ones
of other classes, but distant from image embeddings in the same class.
6.1 Introduction
Despite the impressive zero-shot classification and semantic understanding ability of CLIP [131] benefiting from the large-scale pre-training over 400 million image-caption pairs, we still observe domain gaps
from both image and text input that impact CLIP performance. The existence of visual domain gap between source and target images has been a challenge for computer vision models [165, 28]. CLIP has
been observed to have limitations on visual embedding when data comes from less common domains,
e.g.PatchCamelyon[158], CLEVR[77], etc. On the other hand, the domain gap in text is also a challenge
for vision-language models. The performance of CLIP is often limited by the text embeddings rather than
the visual embeddings, especially on fine-grained datasets e.g.RESISC45[22], Birdsnap[11], where CLIP
is able to create distinctive visual embeddings but the text embeddings from class names fail to capture
discriminative information.
In addition to the gaps in the visual and text domains, we have identified significant misalignment between visual and text embeddings across most datasets. Some recent studies [102, 151] have also observed
106
similar modality gaps across various contrastive-learned visual-language models. Figure 6.1 provides examples of this issue on the widely used benchmark CIFAR10. We believe that there are two primary reasons for
these misalignments. Firstly, text embeddings may be redundant, as CLIP was trained to work with millions
of captions and concepts, whereas target domain categories might only activate limited feature dimensions,
leaving the remaining ones inactive and redundant; these redundant dimensions can dominate the similarity
calculation. Secondly, visual embeddings may contain a significant amount of class-agnostic information;
since CLIP uses real captions for training, it preserves rich information, such as lighting, color, texture, and
relationship, but only a small portion of this information is crucial for classification.
Therefore, adaptation on both visual and text representations, and re-alignment between visual and text
embeddings are crucial in improving the target domain performance of vision-language models like CLIP.
However, traditional domain adaptation methods have significant limitations in this context. One major
challenge is that these methods either require target domain labeled examples (e.g. semi-supervised domain
adaptation[186, 139, 30]), or source domain examples (e.g., unsupervised domain adaptation [78, 117,
143]). Nonetheless, typical use cases of CLIP only have access to unlabeled target images, which requires
source-free unsupervised domain adaptation that does not need source data or labeled target data. Another
challenge is that existing methods assume conditions that may not hold for vision-language models. For
instance, most existing methods [101, 162, 184] assume a lightweight classifier, while a vision-language
model uses a large text encoder to generate classification weights based on category descriptions. Such
modules add flexibility and complexity to adaptation. Thus, the lack of labeled data from source and target
domains and the presence of multiple adaptable modules make it essential to develop a novel source-free
domain adaptation algorithm for vision-language models.
More recently, POUF[151] also proposes to address the misaligned embeddings of a vision-language
model through source-free adaptation. However, the unsupervised objective of POUF considers each target
107
example independently, instead of taking advantages from the neighboring relationship over the entire embedding space. Moreover, POUF cannot leverage multiple template augmented text embeddings as used in
CLIP and our proposed method, which limited its performance during the adaptation.
To take advantage of the unified vision-language space, and address the challenges on the visual and
text domain gaps and cross-modality misalignment, we propose ReCLIP, a novel source-free domain adaptation method to Refine CLIP models. Firstly, ReCLIP addresses the misalignment of visual and text
embeddings from CLIP by learning a projection subspace that removes redundant dimensions and classagnostic information, and realigns embeddings. ReCLIP then utilizes the neighboring relationship between
aligned embeddings, and employs label propagation to produce accurate pseudo-labels in the target domain.
Secondly, ReCLIP leverages cross-modality self-training with high-confidence pseudo labels to iteratively
refine embedding spaces and label assignments. Two parallel components are deployed to update the text
and visual encoders. The first component fine-tunes the text encoder while freezing the visual to pull the text
embedding of a label closer to the embeddings of images assigned the label. Meanwhile, the second component fine-tunes the visual encoder so that the images under the same label get closer to each other and to the
multi-template augmented text embedding of the label. During fine-tuning, each component learns crossmodality consistency in the target domain, leading to new label assignments. ReCLIP selects labels agreed
upon by both components as high-confidence ones for the next iteration. This iterative process improves the
quality of visual and text embeddings and significantly enhances the assignment of pseudo labels.
Our contributions are summarized in the following:
• We proposed ReCLIP, a novel source-free domain adaptation method for vision-language model,
which enhances the CLIP’s classification ability towards target domains without labeled data;
• We identified the cross-modality misalignment issue between CLIP’s visual and language embeddings, and address the issue with an efficient projection-based component in ReCLIP;
108
• We proposed a novel cross-modality self-training algorithm with high quality commonly agreed
pseudo labels leveraging cross-modality consistency to mitigate domain gaps from both visual and
text inputs;
• With extensive experiments and ablation studies, ReCLIP produces consistent and significant improvements over CLIP and other baseline methods; ReCLIP improves the average accuracy of CLIP
from 69.83% to 74.94% on 22 datasets.
6.2 Related Works
6.2.1 Large-Scale Vision-Language Models
As previously intrdocuded in Section 4.2.3, large-scale pre-training vision-language models such as CLIP [131],
ALIGN [75] have demonstrated impressive zero-shot classification ability and strong generalization ability.
In addition to these large-scale pre-trained models, there are also other models that focus on efficient training with additional self-supervised objectives such as DeCLIP [28] and SLIP [116]. In this work, we adopt
CLIP as our main base model, as it is still the most representative vision-language model with outstanding zero-shot classification performance and publicly available model weights. In addition, we will also
demonstrate the effectiveness of our method with different base models in ablation studies.
Augmented prompts through multiple templates. CLIP makes classification prediction by matching the
visual embeddings of query images with the text embeddings of categories names (wrapped in template text
such as “A photo of a {}”), and selects the category with the highest cosine similarity as prediction, as
described in Section 4.2.3.
To further align these text embeddings with the pre-training distribution generated from real captions,
CLIP prepares a long list of templates with various contexts for each of the 27 benchmarks it evaluated on.
109
Instead of using just one template, CLIP reported scores are produced with the averaged text embeddings
from a list of templated prompts for each category to boost performance.
Limitations of CLIP. We observe the following conditions where CLIP’s performance could be improved.
1) Inaccurate Text Description. The accuracy of CLIP can sometimes be drastically improved when the
classification weights are fully-supervised fine-tuned, e.g., On EuroSAT, accuracy of CLIP improved from
59.9% to 98.2% [131]. This indicates that CLIP has good quality default visual representations, but the zeroshot performance is limited by the quality of text-generated classification weights. This is often observed
on fine-grained datasets (e.g., AID [177], FGVC [112], EuroSAT [59], etc), where the class names can not
fully capture the visual differences between classes (e.g., “737-200" and “747-200" as class names from
FGVC); 2) Visual Gap. On some datasets, there are clear gaps for CLIP to be further improved even after
the fully supervised fine-tuning on classification weight. For example, fine-tuned CLIP achieves only 42.9%
on Country211 [131], and 85.97% on PatchCamelyon[158] (a binary classification task with state-of-the-art
system achieves 97.50%). This indicates that the visual encoder of CLIP can also be further improved.
3) Visual-Text Misalignment. Recent studies [102, 154] have also shown that the modality gap between
visual and text embeddings caused by contrastive pretraining could also limit the performance of CLIP. By
modifying contrastive temperature during pre-training [102], or by minimizing the gap during few-shot fine
tuning [154], these works suggest that mitigating the modality gap can benefit the classification ability of
CLIP.
6.2.2 Unsupervised Domain Adaptation
Unsupervised Domain Adaptation (UDA) is a task aimed at improving target domain performance of models
that were pre-trained on a related but different source domain. Many techniques have been developed [78,
117, 143, 148, 169], including a recent method designed for visual-language models[89]. However, most
110
of these techniques are not ideal for the purpose of improving CLIP’s zero-shot performance, as they often
require access to source domain data, while we do not require access to CLIP’s training data.
As mentioned in Chapter 4, Source-Free Adaptation defines a more challenging setting than UDA,
where training examples are not available in both the source and target domains. SHOT [101] is one of
the first Source-Free Adaptation (SFDA) methods. SHOT updates the feature extractor with cluster-based
pseudo labels and information entropy loss, while maintaining the classifier frozen. AaD [184] improves
SHOT by replacing the information entropy loss with a novel Attracting-and-Dispersing (AaD) loss. This
simple but effective approach achieves state-of-the-art performance on the task of SFDA.
More recently, POUF [151] also proposes to mitigate the misalignment embeddings through sourcefree domain adaptation for vision-language models. But the optimization objective of POUF has limited
its performance in two ways: 1) the training of POUF imposes dependency on the number of text encoder
inputs (prompts), which limits POUF from using multiple templates to boost performance, especially on
datasets with large number of classes; 2) the training objectives consider each image separately and fail to
leverage neighboring relationships.
6.2.3 Learnable Language Prompt
During the large-scale contrastive pre-training, CLIP [131] was trained to match visual-text embedding
between training images with their caption sentences such as “A Golden Retriever dog sitting on
grass”. However, during inference time, category descriptions are usually provided in the form of phrases
such as “Golden Retriever” or just “Dog” instead of captions in complete sentences. To mitigate this
gap, CLIP has proposed to use templates to wrap the category description phrase into complete sentences to
generate better text embeddings.
For optimal performance, CLIP [131] further claims that specific templates which provide contexts to
the category names might help generate better text embeddings for classification. For example, CLIP finds
111
A Photo of Dog
Dog
t1
[EOS]
t2 t3 t4 t* t5
Text Encoder
t
Category Name
Template
Prompt
Tokenized
Embeddings
Result Text Embeddings
[BOS]
t0
Figure 6.2: Demonstration of the design of Learnable Prompt. t
∗
represents a learnable token embedding
that is inserted at the beginning of the sequence of inputs to the transformer-based text encoder. “BOS” and
“EOS” stands for “beginning of sentence” and “end of sentence” and they serve as the special tokens for the
text encoder to identify the beginning and end of the input sentence.
the template prompt “A photo of {category name}, a type of pet” works the best for OxfordIIIPet [125]. CLIP has designed different lists of template prompts for all datasets it was evaluated on. The
details can be found on their official GitHub repository https://github.com/openai/CLIP/blob/main/data/
prompts.md.
As demonstrated by CLIP[131], the wisely chosen template prompts might play a vital role in generating
accurate text embeddings. However, this process largely depends on the heuristic design. Our goal for the
learnable language prompt design is to make the prompt learnable and to avoid having different template
prompts for different datasets. Additionally, this can also be an efficient and stable way to fine-tune the
performance of CLIP.
We start from the default template prompt “A photo of {category name}”, and insert an additional
learnable token embedding t
∗
at the beginning of the sentence, right after the Begin-Of-Sentence (BOS)
token, as shown in Figure 6.2. t
∗
is initialized with the same embedding value of word “is” for reasonable
performance before it is fine-tuned. During the fine-tuning process, token t
∗
is made to be learnable while
token embeddings for all other words are fixed.
112
6.3 Method
Visual Embeddings
Text Embeddings
���������� �! = � ���������� �" = ��# ���������� �$ = �′�′
#
81.63% 81.63% 0.05%
22.79% 92.96%
90.19%
Intra-Class Visual-Text Similarity
Inter-Class Text-Text Similarity
Before Label Propagation
Unlabeled Example
Labeled Example
After Label Propagation
Pseudo Labels
Figure 6.3: Demonstration on Feature Redundancy Removal (left) and Label Propagation (right). Left: P0
shows the original distribution of visual and text embeddings of CLIP, where text embeddings are close to
each other distant from visual embeddings; P1 = UU ⊤ removes the class agnostic information from visual
embeddings, and has pulled closer visual and text embeddings. P2 = U
′U
′⊤ separates the text embeddings
away by removing the redundant information from them. Similarity values demonstrated in this example
is calculated based on average statistics from CIFAR10 test set; Right: the Label Propagation process
generates pseudo labels for unlabeled training images by propagating label information from labeled text
embeddings (categories names) to unlabeled visual embeddings (training images) through nearest-neighbor
connections.
We describe our method ReCLIP, which Refines CLIP’s classification performance by accessing only
to the pre-trained model and the following target domain data:
• Pre-trained vision-language model M = {MT , MV }, with text encoder MT and visual encoder MV ,
• Unlabeled target images X = {x1, x2, ..., xn},
• Target class names C = {c1, c2, ..., cm}.
Our goal is to increase the classification accuracy of M on target data X. As the first method that
studies the source-free adaptation problem for vision-language model, we approach this problem in two
steps: (1) How to align visual and text embeddings by removing class-agnostic and redundant information
in a learned projection space (Section 6.3.1). Then we show how to assign pseudo labels for images in
the projection space via label propagation (Section 6.3.2); (2) How to utilize the pseudo labels to further
mitigate the visual and text domain gaps by efficiently updating both visual and text encoders, we propose
a cross-modality self-training algorithm which updates embeddings and pseudo labels in a iterative fashion
(Section 6.3.3).
113
6.3.1 Projection Space to Align Visual and Text
Figure 6.1 demonstrates the misalignment issue of text and visual embeddings from CIFAR10 [86], which
we have also observed over all the ablation datasets. The plot indicates that the text embeddings of different
class names are closer to each other than to images in the corresponding categories. We also validate the
misalignment with quantitative statistics, as shown in Figure 6.3. The average cosine similarity between
text embeddings is 82%, while the average similarity between visual and text embeddings from the same
category is only 23%. This indicates that the unified vision-language space of CLIP is far from well aligned.
As highlighted in Section 6.1, although the visual and text embeddings from CLIP convey rich information, much of them could be redundant and class-agnostic to target classification tasks. This redundancy
can result in misalignment between the text and visual embeddings. We hence propose a projection-based
method to eliminate the redundancy from both visual and text embeddings.
Remove class-agnostic information from visual embeddings. A straightforward way to remove classagnostic information from visual features is just to project all the visual embeddings onto the span of text
embeddings. Assuming we have a d dimensional representation space Rd
, and we have m classes whose
text embeddings are T = [t1, ..., tm] ∈ Rm×d
, where ti = Mt(ci) for i ∈ {1, 2, ..., m}. With Singular
Value Decomposition
U, S, V = svd(T)
we get U = [e1, e2, ..., em] as the orthonormal basis of the span of T, which defines a projection matrix
P1 = UU ⊤. Then, ∀f ∈ Rd
, we can calculate f
′ = fP1 with
e_k \cdot (f - f') &= 0, \forall k\in \{1,...m\}
114
where f − f
′
is the class-agnostic information that does not contribute to the classification. As shown in
Figure 6.3, P1 increases the average similarity between images and text embeddings from the same category
to 92.96% on CIFAR10.
Remove redundant information from text embeddings. As suggested in Principal Component Analysis,
the first dimension e1 of the outer-space basis U will be the major component that most {t1, ..., tm} overlap
on. Removing the major component e1 will make all text embeddings nearly perpendicular to each other.
Therefore, with U
′ = [e2, e3, ..., em] we define a new projection matrix P2 = U
′U
′⊤. As shown in Figure
6.3, P2 successfully separates the text embeddings from different classes to an average cosine similarity of
0.05%, while maintaining high intra-class visual-text similarity at 90.19% on CIFAR10.
In addition to the improvement of CIFAR10 statistics, experiments on pseudo label generation also indicate the effectiveness of embedding space induced by P2 in improving clustering performance, as demonstrated in Section 6.4.3.2.
Visual Encoder
Text Encoder
A photo of Bird
A photo of Cat
A photo of Deer
A photo of Dog
Text
Embeddings
Visual
Embeddings
Projection
Matrix P2
Projected
Text Embeddings
Projected
Visual Embeddings
Label
Propagation Pseudo Labels
Cross Entropy Loss
Cosine Similarity
Cluster Centers
Set Union Matrix Multiply
Cross Entropy Loss
Gradients on Text-Encoder Layer-Norm Weights
Gradients on Visual-Encoder Layer-Norm Weights
Loss and gradient path for ReCLIP-T model Loss and gradient path for ReCLIP-V model
Figure 6.4: Flow Chart of ReCLIP-V and ReCLIP-T. Orange symbols describe the loss and gradients path
of ReCLIP-V, and blue symbols describe the loss and gradients path of ReCLIP-T. Black symbols describe
the common steps that both ReCLIP-V and ReCLIP-T have.
115
ReCLIP-T ReCLIP-V
Pseudo Labels Pseudo Labels
Update Both Models with Commonly Agreed Labels
… …
Figure 6.5: Flow Chart of Pseudo Labels Sharing. The cross-modality self-training algorithm merges the
pseudo labels from ReCLIP-T and ReCLIP-V at the end of each epoch and updates the encoders only on
high-confidence pseudo labels agreed by both.
6.3.2 Pseudo Label Generation for VLM
The projection matrix P2 removes the redundancies and aligns visual and text embeddings, which enables
the generation of pseudo labels through Label Propagation [74], which is a semi-supervised learning method
that propagates label information from labeled to unlabeled data points through nearest neighbor connections, as demonstrated in Figure 6.3. Although in source-free adaptation we do not have access to labeled
data points, the embedding alignment through P2 has enabled us to treat text embeddings from class names
as labeled points, and visual embeddings from images as unlabeled points.
With labeled examples {tˆi}
m
i=1 (class name embeddings) and unlabeled examples {vˆj}
n
j=1 ( image visual
embeddings), we make the union set L:
L = [tˆ1,tˆ2, ...,tˆm, vˆ1, vˆ2, ..., vˆn] ∈ Rd×(m+n)
Following Label Propagation [74], we first produce affinity matrix Ak through k−nearest neighbor affinity
ranking Ak = topk
(L
⊤L) where topk
(·) is an operation that keeps the top k highest value per row from the
full affinity matrix L
⊤L. Then, with normalization and symmetrization, we have:
\mathcal {W} &= D^{-\frac {1}{2}}(A_k + A_k^\top )D^{-\frac {1}{2}}
116
where D := diag(W1m+n) is the degree matrix, 1m+n is the all-ones (m + n)−vector, and W is the
normalized adjacency matrix that defines the random walk probability. With an label matrix Y ∈ R(m+n)×m
is defined with elements
Yji :=
1, if j = i, j ≤ m
0, otherwise
where Yji is 1 for the text embedding entries at the corresponding column, and 0 otherwise. Then, the
pseudo label vector Z can be estimated by solving the random walk problem with initial state Y , propagation
probability matrix W and diffusion magnitude α:
Z := (\mathbf {I}-\alpha \mathcal {W})^{-1} Y \label {cg} (6.1)
where (I − αW)
−1
is the closed-form solution to the random walk problem. As (I − αW) ∈ Rm+n
is not
sparse, and therefore the calculation of its inverse matrix is very time consuming, we use conjugate gradient
(CG) to approximately solve Equation 6.1, following the suggestion from [74]. Finally, with Equation 6.1
solved, the pseudo label can be given by
y˜j := arg max
i
zm+j,i
where y˜j is the pseudo label of image xj , and zji is the (j, i) element of matrix Z.
6.3.3 Source-Free Adaptation for Vision-Language Model via Cross-Modality Self-Training
Vision-language models present a new challenge to adaptation algorithms, where both visual and text encoders need to be adapted. In this section, we discuss how to mitigates the domain gaps of visual and text
domains, and propose a cross-modality self-training algorithm with pseudo labels from 6.3.2 to iteratively
update the label assignments, and the visual and text encoders.
117
Algorithm 4 Visual and Text Encoder Self-Training: ReCLIP-V and ReCLIP-T
Require: Vision Language Pre-trained Model M = {Mv, Mt}
Require: Unlabeled Images X = {x1, ..., xn}
Require: Class Names C = {c1, ..., cm}
Require: Mode = ReCLIP-V or ReCLIP-T \triangleright ReCLIP-V updates Mv with Mt frozen
\triangleright ReCLIP-T updates Mt with Mv frozen
for epoch ← 1 to Max Epoch do
{t1, ..., tm} ← Mt({c1, ..., cm})
{v1, ..., vn} ← Mv({x1, ..., xn}) \triangleright Calculate Visual and Text Embeddings
U, S, V ← svd([t1, ..., tm]), where U = [e1, ..., em]
P2 ← [e2, ..., em][e2, ..., em]
⊤ \triangleright Prepare Projection Matrix with Singular Value Decomposition
tˆi ← tiP2
∥tiP2∥
vˆj ←
vjP2
∥vjP2∥
\triangleright Align Visual and Text Embeddings in Projection Space
L ← {tˆ1, ...,tˆm, vˆ1, ..., vˆn}
Y˜ ← Label_Propagation(L) \triangleright Generate Pseudo Label through Label Propagation
if Mode=ReCLIP-T then
Yˆ ← [ ˆv1, ..., vˆn]
⊤[tˆ1, ...,tˆm] \triangleright Generate Predictions through Cosine-Similarity
LossT ← Cross-Entropy(Y , ˆ Y˜ )
Back-Propagation over Mt
else if Mode=ReCLIP-V then
wi ←
P
Y˜
j=i
vj
/
P
Y˜
j=i
1
, for i ∈ {1, 2, ..., m}
wˆi ← wi
∥wi∥
for i ∈ {1, 2, ..., m} \triangleright Calculate the average embeddings for each class i
Yˆ ← [ ˆv1, ..., vˆn]
⊤[ ˆw1, ..., wˆm] \triangleright Generate Predictions through Cosine-Similarity
LossV ← Cross-Entropy(Y , ˆ Y˜ )
Back-Propagation over Mv
end if
end for
The self-training algorithm of ReCLIP consists of two parallel components: ReCLIP-T aims at closing
the text domain gap by pushing text embeddings towards visual embeddings of the same class, by finetuning the text encoder with the visual encoder frozen. ReCLIP-V aims at closing the visual domain gap
by pushing visual embeddings of the same class closer to each other, by fine-tuning the visual encoder with
the text encoder frozen. On top of ReCLIP-V and ReCLIP-T, we integrate the commonly-agreed pseudo
labels to produce high-confidence training signals. For inference, we add the prediction logits from both
ReCLIP-V and ReCLIP-T to make the final prediction.
ReCLIP-T: Text Encoder Training. We optimize the text encoder Mt with simple cross-entropy loss
LossT
:= CE(Yˆ T
, Y˜ ) between pseudo label Y˜ and cosine similarity prediction logits Yˆ T = [ˆv1, ..., vˆn]
⊤[tˆ1, ...,tˆm].
118
Algorithm 5 ReCLIP with Pseudo Label Sharing
Require: Component 1 M1 = {M1
v
, M1
t } (for ReCLIP-V),
Require: Component 2 M2 = {M2
v
, M2
t } (for ReCLIP-T)
Require: Unlabeled Images X = {x1, ..., xn}
Require: Class Names C = {c1, ..., cm}
Self-Training Adaptation Stage:
for epoch ← 1 to Max Epoch do
Yˆ 1
, Y˜ 1 ← ReCLIP-V(M1
, X, C)
Yˆ 2
, Y˜ 2 ← ReCLIP-T(M2
, X, C) \triangleright ReCLIP-V/T generate predictions Yˆ 1
, Yˆ 2
and pseudo labels
Y˜ 1
, Y˜ 2
.
Commonly Agreed Index Map Q ← (Y˜
1 = Y˜
2) \triangleright Boolean Index with T rue indicates Y˜ 1
agrees with
Y˜ 2
.
LossV ← Cross-Entropy(Yˆ 1
[Q], Y˜ 1
[Q])
LossT ← Cross-Entropy(Yˆ 2
[Q], Y˜ 2
[Q]) \triangleright Only calculate loss on entries where Q is True (Y˜ 1
agrees
with Y˜ 2
).
Back-Propagate M1
v with LossV
Back-Propagate M2
t with LossT
end for
Inference Stage:
Yˆ 1 ← ReCLIP-V(M1
, X, C) \triangleright Generate inference predictions from ReCLIP-T/V
Yˆ 2 ← ReCLIP-T(M2
, X, C) \triangleright At inference time, ReCLIP-T/V skip the pseudo label generation.
Yˆ ← 1
2
(Yˆ 1 + Yˆ 2
) \triangleright Aggregate prediction logits from both ReCLIP-T/V for prediction.
return arg max
i
yˆji as prediction for image xj \triangleright Y = {yˆji}, where yˆji is probability of image xj on class
i.
The objective of adaptation on the text encoder is to push text embeddings {tˆi} closer to the image embeddings {vˆj} from the same class based on pseudo label assignments Y˜ T
. In Figure 6.4 we present the details
of ReCLIP-T.
ReCLIP-V: Visual Encoder Training. The goal of visual encoder adaptation is to push visual embeddings
{vˆj} from the same class to be closer to each other, to form a better feature space for classification. As
contrastive loss is expensive and applying constraints on batch size, we have instead chosen to push visual
embeddings closer to the center of its class instead of other visual embeddings as an alternative resort. To be
specific, in ReCLIP-V we optimize the visual encoder Mv with cross-entropy loss LossV
:= CE(Yˆ V
, Y˜ )
between pseudo label Y˜ and cosine similarity logits Yˆ V = [ˆv1, ..., vˆn]
⊤[ ˆw1, ..., wˆm], where wˆ1, ..., wˆm are
the class centers calculated based on Y˜ . In Figure 6.4 we present the details of ReCLIP-V.
119
High-Confidence Pseudo Labels Sharing. ReCLIP-V updates the similarities among visual embeddings
with LossV
, while ReCLIP-T updates the projection matrix and text embeddings with LossT
. As these
two modules separately optimize the visual and text encoders with different objectives, their pseudo labels
may start to diverge after a certain number of epochs, resulting in different views where only the commonly
agreed samples are likely to be correctly classified. As such, ReCLIP collects pseudo labels from both
ReCLIP-V and ReCLIP-T at the end of each epoch, and updates both models with only the commonly
agreed pseudo labels Y˜ , as illustrated in Figure 6.5.
6.3.4 Algorithms
As described in Section 6.3.3, ReCLIP is composed of two parallel components that are designed for visual and text encoder fine-tuning, namely ReCLIP-V and ReCLIP-T. On top of ReCLIP-T and ReCLIP-V,
we integrate the pseudo labels by filtering the commonly-agreed ones to produce high-confidence training
signals for both sides. In this Section, we present the detailed description of ReCLIP-T and ReCLIP-V in
Algorithm 6, and the pseudo label sharing in Algorithm 5.
6.4 Experiment and Results
Baselines We use the following methods for comparison:
• CLIP [131]: State-of-the-art zero-shot image classification model. We choose CLIP with ViT/L-14
architecture as the main baseline model for comparison and adaptation. We report both published
results from Radford et al [131] and our reproduction, denoted as report and multi respectively. Both
report and multi are prepared with the official prompt template lists provided by Radford et al [131].
In addition, we also report the results we reproduced with a single template (“A photo of a {}”),
denoted as single;
120
• AaD [184]: State-of-the-art SFDA method. We adapt the official code to apply it on CLIP and our
benchmarks;
• POUF[151]: A recent SFDA method that also aims to mitigate misaligned visual and text embedding
spaces. Since POUF does not report on the benchmarks where CLIP has published scores, we produce
its results on these benchmarks using its official code. We report the best performing version of POUF
which fine-tunes the entire model.
Evaluation and Datasets.
• Main Results: for SFDA comparison between ReCLIP, POUF, AaD and base model CLIP, we use
an abundant and comprehensive list of 21 common image classification benchmarks out of the 27
benchmarks from Radford et al [131], except the 6 datasets where CLIP are evaluated on the custom
splits or protocols which are not released at the time of this submission (KITTI [47], UCF101 [146],
VOC2007 [37], Kinetics700 [18], HatefulMemes [79], CLEVR [77]). SFDA evaluation is performed
on 22 benchmarks in total. The 22 benchmarks is composed of the one ablation datasets AID [177]
that we used for hyper-parameter selection, and the 21 benchmarks (Caltech101[95], CIFAR10[86],
CIFAR100[86], ImageNet[32], SUN397[179], Birdsnap[11], Country211[131], DTD[23], EuroSAT[59],
FER2013[189], FGVC[112], Flowers[120], Food101[15], GTSRB[147], MNIST[33], Oxford Pet[125],
PCam[158], SST2[131], RESISC45[22], Cars[83], STL10[25]) from the 27 benchmarks CLIP reported in Radford, et al [131]. The detailed statistics on the number of images and the number of
classes are reported in Table 6.9.
• Comparison with POUF: For additional comparison with POUF on its published scores, we evaluate
ReCLIP on Office-Home [159], which contains 15588 images from four different domains: 2427 Art
(Ar) images, 4365 Clipart (Cl) images, 4439 Product (Pr) images and 4357 Real-World (Rw) Images.
121
• Ablation Studies: we choose AID, CIFAR10, CIFAR100 and SUN397 as ablation datasets to represent
datasets with different sizes and characteristics.
For SFDA evaluation in Section 6.4.1, AaD and ReCLIP use CLIP-multi as base model, and POUF uses
CLIP-single due to its design. For experiments on Office-Home, both ReCLIP and POUF use CLIP-single
as base model.
Unless otherwise specified, we perform our experiments in transductive manner, where SFDA methods
ReCLIP, POUF and AaD first perform adaptation on the unlabeled test data of each dataset, and then the
adapted models are evaluated on the same test data following the standard CLIP inference protocol. For all
benchmarks, we use top-1 classification accuracy as our metric.
Implementation Details For the self-training of ReCLIP, we fine-tune the layer-normalization [3] weights
with other weights frozen, as it is shown to be one of the most effective and stable option to adapt models
with noisy supervision [162]. For the SFDA evaluation, we use AID [177] to select the best hyper-parameter
for ReCLIP, POUF and AaD. We then use the same set of hyper-parameters for all 22 datasets during the
evaluation. We match the maximum adaptation steps for all methods to be the same, as min{5000 iterations, 50 epochs}.
For the evaluation on Office-Home, we select the hyper-parameter on the Real-World (Rw) domain and use
the same hyper-parameters across all domains for evaluation.
For ReCLIP, we use learning rate of 10−3
, weight decay of 10−4
, momentum of 0.9, batch size of
64, maximum length of min{5000 iterations, 50 epochs} and SGD optimization on both visual and text encoders. For Birdsnap, Country211, SUN397 and ImageNet which have more than 200 classes, we use a
batch size of 32 due to large memory occupation from text inputs to fit the training on a single V100 GPU.
For Label Propagation, we use propagation strength α = 0.99 and neighbor size k = 20. For datasets with
more than 500 classes (Birdsnap, ImageNet), we notice the accuracy of pseudo labels generated by label
propagation becomes unstable, and it requires additional hyper-parameter tuning to achieve good performance. To maintain stable performance, we turn off label propagation and simply use model predictions as
122
pseudo labels on datasets with over 500 categories (Birdsnap, ImageNet). For all other datasets, we follow
the exact process as described in Algorithm 6 and 5.
For both AaD and POUF, we have tested different hyper-parameters and report the the best performing setting, with learning rate of 10−3
, weight decay of 10−3
, momentum of 0.9, SGD optimization on
AaD, and learning rate of 10−2
, weight decay of 10−3
, momentum of 0.9, SGD optimization on POUF.
For both AaD and POUF, we extended their default training length to match our training length of ReCLIP, with batch size of 64 × min{5000 iterations, 50 epochs} steps on AaD, and batch size of 32 ×
min{10000 iterations, 100 epochs} steps on POUF.
For ReCLIP on Office-Home, we use the Real-World (Rw) domain to choose the hyper-parameter. We
use SGD optimizer with learning rate of 10−2 on the visual encoder and 10−3 on the text encoder, batch size
of 64 and 5000 iteration as maximum step across all domains. For label propagation, we use k = 10 due to
the smaller dataset size.
6.4.1 Main Results Avg Acc AID[177] Birdsnap[11]
Caltech101[95]
CIFAR10[86]
CIFAR100[86]
Country211[131]
DTD[23]
EuroSAT[59]
FER2013[189]
FGVC[112]
Flowers[120]
Food101[15]
GTSRB[147]
ImageNet[32]
MNIST[33]
Oxford Pet[125]
PCam[158]
SST2[131]
RESISC45[22]
Cars[83]
STL10[25]
SUN397[179]
CLIP-report 70.08 - 48.30 92.6* 96.20 77.90 32.70 55.30 59.90 57.50 36.1* 78.7* 92.90 50.30 75.30 87.20 93.50 58.80 64.00 71.60 77.3* 99.30 67.70
CLIP-single 65.53 61.30 51.88 92.02 95.19 77.18 25.78 52.50 56.03 52.22 30.18 74.19 92.56 45.57 73.46 52.63 93.21 57.75 52.39 63.29 76.45 99.47 66.42
CLIP-multi 69.83 68.73 52.48 91.63 95.60 78.22 31.84 55.37 60.00 56.39 31.59 79.04 93.08 50.59 75.52 76.23 93.62 62.43 68.92 69.66 77.88 99.36 67.97
AaD 46.53 69.83 52.42 91.45 96.54 80.18 0.47 55.43 11.12 16.91 32.37 78.61 0.99 51.26 0.11 89.81 93.62 49.95 49.92 2.51 0.52 99.41 0.25
AaD peak 71.79 70.33 52.58 91.93 96.55 80.46 31.90 55.59 76.18 55.67 32.43 79.22 93.04 52.83 75.53 91.95 93.73 64.03 68.97 71.01 77.96 99.42 67.96
POUF 69.73 64.83 52.91 92.97 96.06 80.39 28.19 56.65 67.95 55.92 32.88 75.62 92.71 51.47 73.05 91.22 94.20 66.57 48.22 67.54 76.72 99.50 68.38
POUF peak 69.76 64.87 52.96 92.97 96.06 80.39 28.22 56.75 67.95 55.92 32.91 75.62 92.73 51.47 73.06 91.22 94.20 66.75 48.60 67.54 76.72 99.53 68.38
ReCLIP 74.94 77.97 52.96 93.02 96.95 82.32 31.92 60.85 78.75 58.07 36.63 82.05 94.15 66.81 75.81 90.88 95.61 70.15 73.48 78.41 77.96 99.58 74.41
ReCLIP peak 75.85 79.27 53.28 93.10 97.04 83.42 31.95 61.38 79.94 58.29 38.70 83.14 94.18 69.14 76.01 97.11 96.05 70.56 73.48 79.31 79.26 99.59 74.53
Table 6.1: Classification accuracies (%) on 22 benchmarks. * on FGVC, Caltech101, Oxford-IIIT Pet and
Flowers102, CLIP reported mean-class-accuracy. All other scores in this table are top-1 accuracy.
123
In Table 6.1 we present the SFDA accuracy of ReCLIP, AaD and POUF over 22 datasets. Besides the
accuracy from the final epoch of self-training, we report the accuracy from the peak-performing epoch for
AaD, POUF and ReCLIP as well, denoted as peak.
ReCLIP achieves consistent and significant improvements over CLIP on 21 datasets and comparable
performance on Country 211. ReCLIP improves the averaged top-1 accuracy of CLIP by 5.11% and 6.02%
at the final and peak epochs respectively over the 22 datasets without accessing any labeled data, which
outperforms both baseline adaptation methods AaD, POUF by clear margin.
AaD achieves 1.96% improvements over CLIP at its peak epochs. However, it encounters drastic performance drops at final epochs that lose 25.26% of the averaged accuracy, due to collapsed unsupervised
training on target datasets such as Food101, SUN397, ImageNet, etc. Meanwhile, ReCLIP maintains the
performance at final epochs, with only 0.91% differences from the peak epochs. These results suggest
the effectiveness of the high-quality commonly agreed pseudo labels of ReCLIP in stabilizing the noisy
self-training and preventing model collapse.
POUF achieves 4.20% improvement over its base model CLIP-single. However, such improvement is
counteracted by the inability to employ multiple prompts to enhance text embedding quality, as suggested
by CLIP [131]. Multiple templates create a large number of prompts, which are not likely to fit in the same
mini-batch for text encoder optimization. ReCLIP also experiences this limitation when fine-tuning the text
encoder. However, thanks to the dual-component structure of ReCLIP, although ReCLIP-T also only use
single template for text-encoder optimization, ReCLIP-V can still take advantage of the multiple template
augmented text embeddings and provides better pseudo labels to ReCLIP-T through pseudo-label sharing.
In addition to the advantage brought by multi-template augmented text embeddings, ReCLIP also takes
advantage from the neighboring relationships over the entire visual-text embedding space, while POUF
does not, which has also contributed to the better performance of ReCLIP. More evidence and discussion on
this are covered in Section 6.4.2.
124
Avg Ar Cl Pr Rw
CLIP single 82.45 82.70 68.10 89.10 89.90
POUF-prompt 84.28 83.70 71.20 91.40 90.80
POUF 86.10 86.20 73.80 92.70 91.70
Label Propagation 84.94 83.27 73.49 91.89 91.09
ReCLIP 87.00 86.11 75.97 93.90 92.01
Table 6.2: Comparison of ReCLIP and published scores from POUF[151] on Office-Home[159], both use
CLIP-single as base model.
Country211 is designed to predict geo-location based on visual appearance, while CLIP might tend to
describe the image from actual content and texture. As shown in [131], CLIP can only achieve 42.9% after
its classifier is fine-tuned in the fully supervised way. Therefore, it is challenging to obtain improvement
during source-free domain adaptation.
6.4.2 Comparison with POUF
In Table 6.2 we present the comparison between the published scores of POUF and ReCLIP on the OfficeHome, where both methods use CLIP-single (ViT/B-16) as base model. We also include the Label Propagation pseudo label accuracy generated on our projected CLIP embeddings prior to any updates on the base
model. It is shown that the Label Propagation accuracy already outperforms POUF-prompt, which finetunes the learnable text prompt. Moreover, ReCLIP achieves clear improvement over POUF over most of
the domains, with 2.17%↑ on Cl, 1.20%↑ on Pr, 0.31%↑ on Rw and on-par performance on Ar. These results
indicate that ReCLIP can still outperform POUF without using multi-template augmented embeddings.
6.4.3 Ablations Studies
In this section, we present the ablation studies on comparison of different ReCLIP versions, pseudo label
generation, and ablation on various VLMs as base models. We use AID, CIFAR10, CIFAR100 and SUN397
as our ablation datasets, and the test set of each dataset is equally split into two fixed partitions. We report
125
the ablation results in an inductive manner where models are first adapted on partition 1 and then evaluated
on partition 2. Note that results in this section are not directly comparable to 6.4.1 because the different
evaluation partition.
6.4.3.1 Effectiveness of ReCLIP Components
CIFAR10 CIFAR100 AID SUN397
Vanilla CLIP 95.54 76.48 64.87 67.25
Label Propagation 96.38 80.66 74.73 70.54
ReCLIP-V 96.69 80.84 79.47 67.15
ReCLIP-T 96.50 81.10 79.07 70.12
ReCLIP (w/o Label Sharing) 97.40 82.80 80.01 71.10
ReCLIP (w/ Label Sharing) 97.48 84.14 82.53 71.34
Table 6.3: Comparison of classification accuracy with different version ReCLIP on ablation datasets. ReCLIP with Label Sharing (Figure 6.5) is shown to be most effective compared to ReCLIP-V, ReCLIP-T
(Figure 6.4) and their simply assembled predictions (ReCLIP w/o Label Sharing).
In Table 6.3 we present the comparison between different versions of ReCLIP. As shown, Label Propagation can create pseudo labels with significantly improved accuracy compared to vanilla CLIP. On the top
of Label Propagation, both ReCLIP-V and ReCLIP-T (Figure 6.4) are shown to be effective in providing
further improvements. In ReCLIP(w/o Label Sharing) we present the result by simply assembling predictions from separately trained ReCLIP-V and ReCLIP-T at inference time. Comparing the last two rows
of Table 6.3 we observe that ReCLIP (w/ Label Sharing) has clear improvement over ReCLIP (w/o Label
Sharing), which indicates that the commonly agreed pseudo-labels stabilizes the noisy adaptation process
and improved both ReCLIP-V and ReCLIP-T to achieve better performance.
6.4.3.2 Comparison on Pseudo Label Generations
In Table 6.4, we compare methods in pseudo label generation. For clustering based methods, we assign the
same pseudo labels for examples from the same cluster, based on the in-cluster majority vote; For k-NN
Classifier and Label Propagation methods, we experiment them on original CLIP feature space P0, and on
126
AID CIFAR10 CIFAR100 SUN397
Vanilla CLIP 68.80 95.59 78.21 67.97
Hierarchical Clustering 55.20 36.52 9.27 46.93
Spectrum Clustering 68.10 61.25 57.35 27.45
k-means Clustering 72.73 95.07 49.43 43.66
k-NN Classifier (P0) 72.30 93.74 69.46 60.72
k-NN Classifier (P1) 72.76 95.77 77.81 63.07
k-NN Classifier (P2) 72.43 95.76 78.19 63.29
Label Propagation (P0) 60.80 94.01 63.58 51.77
Label Propagation (P1) 60.43 96.23 45.41 33.41
Label Propagation (P2) 76.36 96.31 81.56 70.44
Table 6.4: Pseudo label accuracy with different methods. Label Propagation on projection space P2 is shown
to be the most effective and stable method in generating accurate pseudo labels.
projection spaces P1, P2 as described in Figure 6.3. For k-NN Classifiers, we assign each example with the
major vote prediction within its k-nearest-neighborhood, with k equal to the average sample count per class.
For Label Propagation on P0, we select the example with the highest confidence from each class as the
labeled example to perform label propagation as a baseline. Label Propagation on P1, P2 are as described
in Section 6.3.1.
Table 6.4 indicates k-NN based methods achieve better performance on projection spaces P1 are P2,
which indicates the effectiveness of P1, P2 in refining CLIP’s visual embeddings. On Label Propagation
methods, P2 gives a significant improvement over P0, P1, indicating its effectiveness in aligning CLIP’s
visual and text embeddings.
6.4.3.3 Comparison on other Vision-Language Models
ReCLIP is designed to improve the classification performance of visual-language models in general, not
only on CLIP. We tested the effectiveness of ReCLIP on SLIP[116] and DeCLIP[98], both of these improved
CLIP by adding self-supervision learning objectives during pre-training. We have also tested ReCLIP on
other versions of CLIP with smaller architectures. As shown in Table 6.5, ReCLIP demonstrates steady and
significant improvements on various vision-language models and architectures.
127
CIFAR10 CIFAR100 AID SUN397
Init → Adapt Init → Adapt Init → Adapt Init → Adapt
SLIP (ViT-L/16) 89.45 → 91.80 56.69 → 67.61 48.13 →64.07 55.56 → 65.28
DeCLIP (ViT-B/32) 90.57 → 94.50 66.58 → 77.10 53.53 →65.93 63.05 → 66.90
CLIP (RN50) 71.46 → 82.73 42.32 → 53.15 53.43 →65.97 59.76 → 65.38
CLIP (ViT-B/32) 89.83 → 92.15 65.25 → 71.09 60.83 →76.80 62.96 → 68.30
Table 6.5: Ablation Studies on the effectiveness of ReCLIP on different model architecture and pre-training
strategies.
6.4.3.4 Choice on Learnable Modules
CIFAR10 CIFAR100 AID SUN397
Vanilla CLIP 95.54 76.48 64.87 67.25
Learnable Text Prompts 97.50 82.18 93.73 75.27
Learnable Visual Prompts [76] 96.70 80.68 74.27 68.09
Text Encoder Layer-Norm 97.32 83.30 94.8 78.47
Visual Encoder Layer-Norm 97.8 85.16 69.40 68.30
Table 6.6: Fully supervised fine-tuning accuracy of CLIP with different learnable modules on ablation
datasets. On AID, fine-tuning weights from Text Encoder Layer-Norm is shown to be most effective; On
CIFAR10 and CIFAR100, fine-tuning weights from Visual Encoder Layer-Norm is shown to be most effective.
In Table 6.6, we evaluate different learnable modules by comparing their fully-supervised fine-tuned
performance. As suggested in [162], fine-tuning the normalization weights is shown to be efficient and
stable, compared to fine-tuning the entire weights in self-training of ReCLIP.
Recent research [76] as well as POUF [151] also suggests that learnable prompts can also be effective in
providing stable and fast performance improvement during the fine-tuning of Transformer [157, 35] based
models. In Table 6.6, we perform Visual Prompt tuning following [76], and our own designed Text Prompt.
Please refer to Appendix VII for more details.
As shown in Table 6.6, fine-tuning Layer-Norm weights from Visual Encoder has the best fully supervised accuracy on both CIFAR10 and CIFAR100, while fine-tuning Layer-Norm weights from Text Encoder
has the best fully supervised accuracy on AID. As described in Section 6.2.1, on some datasets (including
128
CIFAR10 CIFAR100 AID SUN397
CLIP 95.60 78.22 68.73 67.97
ReCLIP (Transductive) 97.04 83.42 79.27 71.25
ReCLIP (Inductive) 96.92 82.30 79.87 74.53
Table 6.7: Inductive and Transductive performance comparison of ReCLIP on ablation datasets.
AID), the performance of CLIP is mainly limited by the poor quality text embeddings from inaccurate class
names. In this case, fine-tuning the text encoder will achieve better performance as we observed. Table
6.6 results suggest the necessity of fine-tuning CLIP from both the visual and text side to handle different
scenarios.
6.4.3.5 Inductive Results
We perform the SFDA evaluation in Table ??, to follow the protocols of AaD [184] and POUF [151] and to
fully utilize the test examples. However, ReCLIP can also be applied in the inductive manner, so that the
adaptation only has to be performed once for the target domain, and the adapted model will be effective on
new and unseen examples of the target domain. In Table 6.7 we run ReCLIP in an inductive setting, where
ReCLIP performs self-training on the training split of a dataset (0.5 to 5 GPU-Hour), and inference on the
test split (similar to CLIP inference time). ReCLIP achieves similar improvements in the inductive setting
as in the transductive setting.
6.4.3.6 Pseudo Label Quality Average AID[177] Birdsnap[11] Caltech101[95]
CIFAR10[86]
CIFAR100[86]
Country211[131]
DTD[23]
EuroSAT[59]
FER2013[189]
FGVC[112]
Flowers[120]
Food101[15]
GTSRB[147]
ImageNet[32]
MNIST[33]
Oxford Pet[125]
PCam[158]
SST2[131]
RESISC45[22]
Stanford Cars[83]
STL10[25]
SUN397[179]
CLIP repro 69.83 68.73 52.48 91.63 95.60 78.22 31.84 55.37 60.00 56.39 31.59 79.04 93.08 50.59 75.52 76.23 93.62 62.43 68.92 69.66 77.88 99.36 67.97
ReCLIP (pseudo label) 72.54 74.50 43.25 91.91 96.56 81.40 26.30 59.04 73.36 57.15 36.33 82.55 93.95 60.64 25.11 82.85 94.77 62.46 68.86 77.63 77.66 99.52 70.54
Table 6.8: ReCLIP pseudo label Quality. Results are generated with vanilla CLIP ViT-L/16 checkpoint, on
the first epoch of ReCLIP before the training algorithms update the model weights.
129
In Table 6.8 we report the pseudo label accuracy of ReCLIP. We report the pseudo label accuracy from
ReCLIP on the first epoch, before the self-training algorithm updates the model weights. From Table 6.8
we observe that the label propagation over projected visual and text embeddings has obtained ReCLIP
pseudo labels with consistent improved accuracy over CLIP, only except Birdsnap and ImageNet which
have more than 500 categories, as we discussed in Appendix IV. The results from Table 6.8 demonstrate
the effectiveness of our version of the label propagation method in generating reliable pseudo labels for
vision-language models.
6.4.4 Runtime Analysis Average AID[177] Birdsnap[11] Caltech101[95] CIFAR10[86]
CIFAR100[86]
Country211[131]
DTD[23]
EuroSAT[59]
FER2013[189]
FGVC[112]
Flowers[120]
Food101[15]
GTSRB[147]
ImageNet[32]
MNIST[33]
Oxford Pet[125]
PCam[158]
SST2[131]
RESISC45[22]
Stanford Cars[83]
STL10[25]
SUN397[179]
Image Number 1500 2,149 9,146 10,000 10,000 21,100 1,880 5000 3,574 3,333 6,149 25,250 12,630 50,000 10,000 3,669 32,768 1,821 25,200 8,041 8,000 19,850
Class Number 30 500 102 10 100 211 47 10 8 100 102 102 43 1,000 10 37 2 2 45 196 10 397
AaD (h) 1.19 0.49 0.56 0.98 1.26 1.26 1.30 0.42 4.39 0.71 0.71 1.24 1.24 1.29 1.29 1.27 0.77 1.31 0.38 1.34 1.26 1.30 1.32
POUF (h) 6.18 4.51 7.07 5.61 5.80 5.71 7.30 5.50 5.60 3.73 5.02 5.82 6.38 6.41 13.58 5.74 4.13 6.79 4.91 6.33 5.97 5.92 8.19
ReCLIP (h) 2.35 0.68 0.97 2.94 1.62 2.68 1.58 1.08 1.82 0.90 1.24 2.73 5.66 3.82 3.23 2.19 0.95 2.99 0.61 3.12 4.17 2.18 4.63
Table 6.9: Metadata and Runtime comparison of AaD, POUF and ReCLIP of the 22 Evaluation Benchmarks.
Time reported in the unit of hour (h).
We present the runtime required by SFDA methods, namely AaD, POUF and ReCLIP, in Table 6.9. We
matched all methods to be at the same training steps for fair comparison. As shown by the result, AaD takes
an average of 1.19 hours to adapt, ReCLIP takes 2.35 hours and POUF takes 6.18 hours. ReCLIP is not
much slower than AaD although ReCLIP trains two sets of encoders at the same time, except on datasets
with more categories due to the time required for the Label Propagation process. However, POUF is much
slower than both AaD and ReCLIP, due to its less efficient implementation. However, all three algorithms
are very efficient as the adaptation only has to be applied once for each new target domain.
130
6.4.5 Limitation and Future Work
As mentioned in the Implementation Details section, we have observed that on datasets with more than
500 classes (Birdsnap, ImageNet), the accuracy of pseudo labels generated by label propagation becomes
unstable, and it requires additional hyperparameter tuning to achieve good performance. To maintain stable
performance, we have turned off label propagation and simply used model predictions as our pseudo labels
on datasets with over 500 categories. Studies on how the hyper-parameters influence the label propagation
performance on datasets with more than 500 categories will be important future work to further improve
ReCLIP.
Another future direction will be the utilization of augmentation consistency. Augmentation Consistency
has been shown to be a very powerful unsupervised training signal and has been widely applied in unsupervised methods [20, 21, 55]. Due to the scope and complexity of this project, we have not explored the usage
of augmentation consistency in source-free domain adaptation. It will be important future work to explore
the combination of the current ReCLIP with augmentation consistency to further improve the adaptation
performance.
6.5 Conclusion
We introduced ReCLIP, a novel solution on source-free domain adaptation for vision-language models.
ReCLIP first uses a novel designed projection space to re-aligns visual and text embeddings and to generate
dependable pseudo labels for target classification tasks. ReCLIP further applies cross-modality self-training
with pseudo labels, which iteratively enhances label assignments and visual and text embeddings. Compared
to the previous methods AaD and POUF, ReCLIP provides an effective and stable solution to the source-free
adaptation problem of vision-language models. ReCLIP significantly improves CLIP, increasing the average
accuracy from 69.83% to 74.94% across 22 datasets.
131
Chapter 7
BaFTA: Backprop-free Test-Time Adaptation for Zero-Shot Visual
Language Models
Building on the foundations set by ReCLIP, we shift our focus back to the challenging and practical realm
of test-time adaptation. To improve CLIP’s performance within the zero-shot paradigm, various test-time
prompt tuning methods have been developed. These methods aim to refine class embeddings through unsupervised learning objectives during inference. However, they often face difficulties in choosing appropriate
learning rates, which is crucial to avoid model instability when validation data is not available during testtime training.
In this chapter, we introduce an innovative, backpropagation-free method for test-time adaptation in
vision-language models. Unlike existing approaches that fine-tune text prompts to refine class embeddings,
our method takes a different route. It directly estimates class centroids through online clustering within a
specially projected embedding space that aligns text and visual embeddings. To further enhance accuracy,
we dynamically combine predictions from both the estimated and the original class embeddings. This is
complemented by predictions from various augmented views. The reliability of each prediction is assessed
using Rényi entropy, allowing us to judiciously weigh the contributions of each source. Our extensive
experiments demonstrate that this approach not only bypasses the challenges of previous methods but also
consistently surpasses the state-of-the-art in test-time adaptation by a significant margin.
132
7.1 Introduction
In the pursuit to further enhance Vision-Language Model performance, various adaptation and fine-tuning
techniques have emerged to bridge the domain gap when applied to downstream tasks. For instance, [199]
and [198] fine-tune text prompts for VLMs, tailoring them to specific downstream tasks with few-shot adaptation. Moreover, in the realm of zero-shot classification, numerous approaches have been proposed to boost
VLM performance without necessitating labeled data. For example, [68] and [151] improve CLIP through
source-free unsupervised adaptation using unlabeled test examples, while [155] and [46] enhance CLIP with
training-free methods by leveraging external resources. Furthermore, test-time prompt tuning algorithms,
exemplified by [113] and [124], refine learnable text prompts during inference through the optimization of
an unsupervised objective using augmentations, leading to improved model accuracy.
As shown in Table 7.1, the test-time prompt-tuning method, TPT ([113]), excels in adaptation without
the need for labeled data or external resources. This characteristic underscores its practicality and versatility as an effective means to enhance the performance of CLIP, especially within the context of zero-shot
classification. However, as pointed out by [122], test-time adaptation methods such as [162] [100] as well
as methods like TPT, often encounter the intricate challenge of determining an optimal learning rate in the
absence of validation data. Striking the right balance is crucial—achieving maximum improvement while
simultaneously safeguarding against the model’s instability during test-time adaptation.
To avoid the trouble in determining the optimal learning rate and to fully harness the potential of each test
example while avoiding concerns about model instability, we propose the Backpropagation-Free Test-time
Adaptation algorithm BaFTA. Instead of refining the class embeddings with backpropagation training in the
prompt token space, BaFTA directly refines the class embeddings within the unified visual-text embedding
space of CLIP, by leveraging the neighboring information among test examples visual embeddings with
an online clustering algorithm. Our approach is motivated by the observation that the visual embeddings
133
from CLIP are often sufficiently discriminative for effective classification. However, the sub-optimal zeroshot performance is often limited by the imprecise text embeddings associated with inaccurate class names.
Therefore, we opt to harness neighboring information within test example visual embeddings to enhance
CLIP’s test-time performance.
To further enhance the performance of online clustering predictions, we have proposed two pivotal designs. Firstly, building upon the recommendation from [68], we execute the online clustering algorithm on
a projected embedding space. This projection helps alleviate the disparity between CLIP’s visual and text
embeddings, contributing to improved clustering outcomes. Secondly, recognizing that clustering-based predictions can sometimes be swayed by the biased distribution of test examples, we combine these clusteringbased predictions with standard predictions derived from randomly augmented views of the test examples.
We employ Rényi Entropy to gauge the reliability of these predictions, ultimately arriving at an aggregated
prediction that benefits from the strengths of both approaches while ensuring accuracy and robustness.
The significance of our work can be summarized in four key contributions:
• We introduce BaFTA, a novel Backpropagation-Free Test-time Adaptation algorithm designed to enhance the zero-shot classification capability of vision-language models at inference time, without
requiring any labeled examples or back propagation training.
• We propose an effective online clustering method to directly refine the class embeddings of visionlanguage models within a projected space that aligns the visual and text embeddings.
• We present a simple technique to dynamically aggregate the predictions from the clustering-estimated
and original class embeddings, as well as from various augmented views, by evaluating the reliability
of each prediction using Rényi entropy.
134
Labeled Back-Prop Methods Task Setting Data Training External Resource
CoOp ([199]) Few-Shot Fine-Tuning Yes Yes None
CoCoOp ([198]) Few-Shot Fine-Tuning Yes Yes None
ReCLIP ([68]) Source-Free Adaptation No Yes None
POUF ([151]) Source-Free Adaptation No Yes None
TPT ([113]) Test-Time Adaptation No Yes None
DiffTPT ([39]) Test-Time Adaptation No Yes Stable Diffusion
Sus-X ([155]) Test-Time Adaptation No No LAION-5B
Hierarchy-CLIP ([46]) Test-Time Adaptation No No ImageNet Hierarchy
BaFTA (Ours) Test-Time Adaptation No No None
Table 7.1: Taxonomy of CLIP adaptation methods for downstream classification. In this work, we adopt
TPT as the main baseline for comparison as it is the state-of-the-art test-time adaptation algorithm without
requirement of external resources.
• Through comprehensive experiments, we validate BaFTA and its components, affirming its effectiveness in significantly improving the zero-shot classification accuracy of pre-trained vision-language
models during inference.
7.2 Background
In this section, we revisit the large-scale pre-trained vision language model CLIP ([131]) and test-time
prompt tuning algorithm TPT ([113]) for the necessary background and formulation before we introduce
our method in Section 7.3.
Zero-Shot Image Classification with VLM. A pre-trained vision-language model such as CLIP consists of two parallel components M = {Mv, Mt} where Mv is the visual encoder and Mt
is the text encoder.
Given test images Dtest = {xi}
I
i=1 and target class names C = {cj}
J
j=1 , the pre-trained vision-language M
performs zero-shot classification by generating the adaptive classification weights from text embeddings of
the target class names tj = Mt(θ0(cj )) for j ∈ {1, 2, ..., J}, where θ0 is the text prompt template such as “a
photo of {class name}” that warpped the class names cj into full sentences θ0(cj ). To further improve
the quality of the text embeddings, CLIP provides lists of templates {θz}
Z
z=1 to align the text embeddings
135
Multi-Template Prompts
a photo of a {}
a photo of the {}
a cropped photo of {}
a photo of many {}
…
Merged Logits
Projected
Online
Clustering
Test
Example
Augmented
Views
Visual
Encoder
Text
Encoder
Initialize Centroids
at beginning
Assign &
Updates
Projected
Centroids
Prediction Logits from
Text Embedding
Prediction Logits from
Class Centroids
Rényi
Entropy
Weighted
Average
Dog
Cat
Deer
Bird
…
Prediction
Projected
Visual
Embeddings
Cosine Similarity
Figure 7.1: Overview of the Backpropagation-Free Test-Time Adaptation algorithm BaFTA. Instead of
prompt-tuning, we employ online clustering to directly estimate class embeddings in a projection space
that aligns visual and text embeddings. The class centroids are initialized with text embeddings of class
names, and updated incrementally with online test examples assigned to the class. For each test example,
we generate two sets of predictions. The first set measures cosine similarity between visual embeddings
of augmented views and class name text embeddings. The second set measures cosine similarity between
visual embeddings and online-clustering centroids. Predictions are aggregated with reliability estimated by
Rényi Entropy for final results.
with the distribution of real caption sentences used in pre-training, and generates the text embeddings for
each class by taking the average of these templates,
tj =
1
Z
X
Z
z=1
Mt(θz(cj )).
Then, the prediction yi can be obtained by selecting the class j whose text embedding tj has the highest
cosine similarity with its visual embedding Mv(xi), i.e., yi = arg maxj
⟨
Mv(xi)
∥Mv(xi)∥
,
tj
∥tj∥
⟩
Test-Time Prompt Tuning for VLM. To further enhance the zero-shot generalization ability of vision
language model M, TPT proposes to learn an adaptive text template θ at inference time. For each test
example xi
, TPT first prepares a mini-batch of random augmented views {x
1
i
, x2
i
, ..., xB
i
} and performs a
136
single step gradient descent to optimize the entropy minimization loss over the high-confidence predictions
among the augmented views,
θi = θ0 − δ∇θ
X
B
b=1
1[H(M(x
b
i
) < τ ]H(M(x
b
i
))!
|θ=θ0
where H(·) is the entropy function, τ is the entropy threshold for high-confidence augmented view selection, and δ is the learning rate. M(x
b
i
) = sof tmax
M(x
b
i
; cj )
J
j=1
is the estimated probability
distribution of augmented view x
b
i
over target classes c1, ..., cj , with M(x
b
i
; cj ) = D
Mv(x
b
i
)
∥Mv(x
b
i
)∥
,
Mt(θ(cj )
∥Mt(θ(cj )∥
E
as the cosine-similarity between visual embedding Mv(x
b
i
) and text embedding Mt(θ(cj )). Then, with
adapted text prompt θi
, TPT produces the prediction for test example xi with the averaged prediction from
high-confidence (estimated by entropy H(·)) augmented views:
y_i = \argmax _{j} \sum _{b=1}^B \mathbbm {1}[H(M(x_i^b) < \tau ] M(x_i^b; c_j) \label {entropy_conf} (7.1)
7.3 Method
As investigated by [122], test-time adaptation algorithms frequently encounter the challenge regarding the
appropriate choice of the learning rate in absence of validation data during unsupervised training. On one
hand, opting for a small learning rate will restrict the enhancement of the model. On the other hand, employing a large learning rate can be risky in triggering the potential model collapse. TPT adopts a relatively large
learning rate to expedite improvement, but chooses to restart from the original model for each test example
to prevent the potential model collapse.
In this work, we present an novel backpropagation-free solution which directly refines the class embeddings in the aligned visual-text embedding space instead of in the prompt token space. Our BaFTA algorithm
137
performs Backpropagation-Free Test-time Adaptation for Vision-Language Models, and brings three major
advantages over the test-time prompt tuning methods like TPT:
• BaFTA avoids the use of back-propagation to update model weights. As a result, it significantly
reduces the risk of causing model collapse during unsupervised adaptation.
• In contrast to the test-time adaptation algorithms like TPT that require frequent restart to prevent
model collapse, BaFTA possesses the capability to scale as more test examples become available, and
to leverage from the relationships between neighboring examples.
• BaFTA can leverage the multi-template prompts provided by CLIP to enhance text embedding quality.
In contrast, prompt-tuning methods are constrained to using single-template prompts due to computational costs.
In the following sections, we first present the motivation and primary concepts behind the estimation of
class embeddings using projected online clustering during inference, as outlined in Section 7.3.1. Subsequently, we delve into the discussion of a pivotal findings that enhance the performance of online clustering,
as elaborated in Section 7.3.2. Finally, we present a comprehensive overview of the BaFTA with the complete algorithm in Section 7.3.3.
7.3.1 Estimate Class Embedding with Projected Online Clustering
As shown in Table 7.2, CLIP generates discriminative visual embeddings on various downstream tasks,
but the zero-shot classification performance is often limited by the imprecise text embeddings generated
from uninformative class names. For example, FGVC Aircraft ( [112]) uses codenames such as 707-320
and A300B4 as class names, which are hardly informative for CLIP to generate proper text embeddings to
capture the visual difference between classes.
138
Conversely, the results of linear evaluation suggest that the visual embeddings from CLIP exhibit a high
degree of distinctiveness among target classes, enabling the linear classifier to attain remarkable classification accuracy. This finding opens up an opportunity to leverage the neighboring information within these
visual embeddings to further enhance classification performance.
Given a set of visual embeddings {vi
|vi = Mv(xi)}
I
i=1 come in order, we can obtain a set of cluster
centroids wj as class embeddings using the online clustering algorithm [8]:
w_j &= \frac {t_j}{\lVert t_j \rVert } & \text {initialize centroids with text embedding $t_j$}\nonumber \\ w_{y_i} &= \frac {k_{y_i}w_{y_i} + v_i}{\lVert k_{y_i}w_{y_i} + v_i\rVert } & \text {update upon example $v_i$ with prediction $y_i$} \label {online_cluster_eq}\\ k_{y_i} &= k_{y_i} + 1 & \text {update counter $k_{y_i}$ for class $y_i$} \nonumber
where kyi
records the number of examples contributed to the calculation of wyi
before vi
, which adjust the
magnitude of wyi
to accommodate the new cluster member vi
.
For the online clustering to achieve optimal performance, we perform the online clustering within the
projection space described in Section 6.3.1 which aligns the visual and text embedding spaces of CLIP.
Specifically, for any embedding x, we transform it with the projection matrix P
∗ before performing the
online clustering algorithm:
P^*(x):= \frac {U'U'^\top x}{\lVert U'U'^\top x\rVert } & & U'= [e_2, e_2, ..., e_J] \label {proj_eq} (7.3)
7.3.2 Prediction Aggregation with Rényi Entropy
The online clustering algorithm presented in Section 7.3.1 yields accurate estimations of the embedding
centroids for classes that have a sufficient quantity of seen test examples. However, when it comes to
139
Cars
Caltech101
DTD
EuroSAT
FGVC
Food101
Flower102
Pets
UCF101
SUN397
ImageNet
CLIP (RN50) Zero-Shot 55.8 82.1 41.7 41.1 19.3 81.1 65.9 85.4 63.6 59.6 59.6
CLIP (RN50) Linear-Eval 78.3 89.6 76.4 95.2 49.1 86.4 96.1 88.2 81.6 73.3 73.3
CLIP (ViT-B/16) Zero-Shot 65.6 89.3 46.0 54.1 27.1 89.2 70.4 88.9 69.8 65.2 68.6
CLIP (ViT-B/16) Linear-Eval 86.7 94.7 79.2 97.1 59.5 92.8 98.1 93.1 88.4 78.4 80.2
Table 7.2: Zero-Shot v.s. Linear Evaluation top-1 accuracy reported by CLIP ([131]). Linear Evaluation
protocol assesses the quality of visual embeddings by training a fully-supervised linear classifier over the
frozen visual embeddings. This Linear Evaluation result implies: 1) the zero-shot performance of CLIP are
largely limited by the quality of zero-shot classifier, i.e, the text embeddings of class names; 2) The native
visual embeddings of CLIP get classified well with a linear classifier, which suggests the distinctiveness of
visual embeddings across target classes, and leads to an opportunity to leverage the neighboring relationships
to enhance test-time performance.
Algorithm 6 BaFTA: Backprop-Free Test-Time Adaptation for zero-shot VLM.
Require: Vision Language Pre-trained Model M = {Mv, Mt}
Require: Test Samples X = {xi}
I
i=1; Class Names C = {cj}
J
j=1; Template Prompts {θz}
Z
z=1
tj ← 1
Z
P
z Mt(θz(cj ) \triangleright Prepare multi-templates text embeddings for each class
tˆj ← P
∗
(tj |{t1, ..., tJ }) \triangleright Projected text embeddings (Eq 7.3)
wj ← tˆj , kj ← 0 \triangleright Initialize class centroids wj and counter kj for each class
for i ← 1 to I do
{x
b
i
}
B
b=1 ← A(xi) \triangleright Generate B views with random augmentation function A(·)
v
b
i ← Mv(x
b
i
) \triangleright Visual embedding for each augmented views
ˆ
v
b
i ← P
∗
(v
b
i
) \triangleright Projected visual embedding (Eq 7.3)
p
b
i ← sof tmax
cos(v
b
i
, tj )
J
j=1
\triangleright Cosine-similarity between visual embedding v
b
i
and text embedding t
b
j
, (Eq 7.4)
pˆ
b
i ← sof tmax h
cos(
ˆ
v
b
i
, wj )
iJ
j=1
\triangleright Cosine-similarity between projected visual embedding ˆ
v
b
i
and class centroids w
b
j
, (Eq 7.5)
p˜i ← 1
R
P
b Re(p
b
i
)p
b
i +
1
R
P
b Re(ˆp
b
i
)ˆp
b
i
\triangleright Prediction Aggregation (Eq. 7.6)
yi ← arg maxj p˜i \triangleright Get prediction for example xi
vˆi ← 1
B
PB
b=1
ˆ
v
b
i
wj ← (kjwj + ˆvi) /∥(kjwj + ˆvi)∥, kj ← kj + 1 for j = yi
\triangleright Updates centroids and counter on predicted class yi (Eq. 7.2)
Output yi as prediction for xi
end for
classes with only a limited number of examples, the estimations of embedding centroids can become notably
biased. In datasets featuring a large number of classes like ImageNet1k ([32]), certain categories might
remain unassigned or have very few examples assigned to them until the adaptation process concludes. This
140
situation reduces the reliability of centroid estimation for these classes. Consequently, it becomes imperative
to implement a mechanism for filtering out predictions with low reliability.
On the other hand, we follow TPT ([113]) to leverage random augmentations to improve the prediction
quality on test examples. For each test example xi
, we prepare B augmented views {x
1
i
, ..., xB
i
}, which
result in a B distinct predictions {p
1
i
, ..., pB
i
} that also requires to be filtered and to preserve the reliable
ones. As described in Equation 7.1, TPT selects the predictions p
b
i
by thresholding their entropy H(p
b
i
) > τ ,
as the high entropy predictions tend to be more confident.
On the contrary, we draw inspiration from a study from the area of speech recognition [92] and opt for
Rényi Entropy to estimate the reliability of each prediction. This decision is motivated by the observed
stronger correlation between Rényi Entropy and prediction accuracy, as indicated in the study. For each test
example xi
, we generate regular predictions p
b
i
by calculating the softmax-cosine similarity between visual
embedding v
b
i
and text embedding tj :
p_{i}^b = softmax\left (\left [cos(v_i^b, t_j)\right ]_{j=1}^J\right ), \label {pd1} (7.4)
and also online-clustering predictions p
b
i
by comparing v
b
i with the class centroids wj :
\hat {p_{i}^b} = softmax\left (\left [cos(P^*(v_i^b), w_j)\right ]_{j=1}^J\right ). \label {pd2}
(7.5)
Note that we use projected visual embeddings P
∗
(v
b
i
) to calculate ˆ
p
b
i
, because wj are calculated in the
projection space. Then, we estimate the reliability of each prediction p with the Rényi entropy:
Re(p) = \frac {1}{\alpha -1} \log \sum _{j=1}^J (p[j])^{\alpha } \label {eq:renyi}
141
Finally, we aggregate the predictions {p
b
i
} and {
ˆ
p
b
i
} with their Rényi entropy as the weight:
\Tilde {p_i} &= \frac {1}{R}\left (\sum _{b=1}^B Re(p_i^b) p_i^b + \sum _{b=1}^B Re(\hat {p_i^b}) \hat {p_i^b}\right ) \nonumber \\ &= \frac {1}{R}(p_i + \hat {p_i}) \label {eq:renyi2}
(7.6)
where R =
PB
b=1(Re(p
b
i
) + Re(
ˆ
p
b
i
)) is the normalization factor to ensure p˜i sums to 1.
7.3.3 Algorithm and Overview
We demonstrate the overview of BaFTA in Figure 7.1. Instead of employing prompt-tuning, which entails back-propagation and the risk of potential model collapse during unsupervised training, BaFTA takes
a backpropagation-free approach. We directly refine the class embeddings with online clustering (as detailed in Section 7.3.1) in a projection space that aligns the visual and text embeddings (as detailed in
Section 6.3.1). For each test instance, BaFTA generates two sets of predictions. The first set follows the
standard contrastive VLM classification protocol, measuring cosine similarity between visual embeddings
of augmented views and the text embeddings of class names. The second set measures cosine similarity
between visual embeddings and centroids obtained through online clustering. These predictions are subsequently combined, considering their reliability as evaluated by Rényi Entropy (as outlined in Section 7.3.2),
to yield the final results. For a comprehensive understanding of BaFTA’s procedures, please also refer to
Algorithm 6.
7.4 Experiment and Results
Baselines. We conduct experiments in comparison of BaFTA with several benchmark models and algorithms. Our comparisons include the baseline model CLIP ([131]) and the state-of-the-art test-time prompttuning algorithm TPT ([113]) that were introduced in Section 7.2. For CLIP, we report both single template
142
(denoted as CLIP), and multi-template versions. We also include Hierarchy-CLIP ([46]) and SuS-X ([155])
in the ImageNet evaluation, as both of them enhances prompt quality with a training-free method. Furthermore, we have introduced CoOp ([199]), CoCoOp ([198]) and ProGrad ([203]) as additional baseline
models for comparison with few-shot prompt-tuning method, aligning with experiments from TPT.
Datasets. We have conducted our experiments over two sets of datasets, following the experiment setup
of [113] and [199], which includes: 1) ImageNet Robustness Evaluation with ImageNet ([32]) and its Natural Distribution Shift (NDS) variants ImageNet-V2 ([134]), ImageNet-R ([60]), ImageNet-Sketch ([163])
and ImageNet-A ([62]); 2) Fine-Grained Datasets with Stanford Cars ([83]), Caltech101 ([95]), Describable Textures (DTD, [23]), EuroSAT ([59]), FGVC Aircrafts ([112]), Food101 ([15]), Flowers102 ([120]),
Oxford-IIIT-Pets ([125]), UCF101 ([146]) and SUN397 ([179]).
Implementation Details. In our experiments, we employ the ViT-B/16 and ResNet50 checkpoints from
CLIP as the baseline models for comparison and adaptation. In line with the TPT implementation, we utilize
a simple combination of RandomResizedCrop and RandomFlip to prepare 63 augmented views, constituting a mini-batch of 64 images for each test image. This choice, as previously observed in [113], strikes a
suitable balance between runtime efficiency and performance. We have employed the exponential form of
Renyi Entropy with order α = 0.5 following [92]. For experiments on-top-of the CoOp, we use the 16-shot
fine-tuned model and ensemble the predictions generated from CoOp embeddings with our predictions using
Rényi entropy. Instead of directly replacing the prompts, we adopt this approach because we have observed
that CoOp embeddings sometimes perform less effectively than the multi-template embeddings provided
by CLIP. For all other BaFTA results, we use official template sets provided by CLIP to generate the text
embeddings. Unless otherwise specified, all BaFTA results are reported with a warm-up schedule of 10J
examples (J as number of class) before the online clustering predictions aggregated into final prediction.
For the embedding projection matrix, we use U
′ = [e2, ..., eJ ] for all datasets, except for datasets with more
than 150 categories such as ImageNet, we use U
′ = [e2, ..., e150] for best performance.
143
7.4.1 Main Results
ImageNet ImageNet-A ImageNet-V2 ImageNet-R ImageNet-Sketch NDS Avg
CLIP (ViT-B/16) 66.73 47.87 60.86 73.98 46.09 57.20
Multi-Template 68.34 49.89 61.88 77.65 48.24 59.42
Hierarchy-CLIP 68.86 31.07 62.00 60.62 48.26 50.48
TPT 68.98 54.77 63.45 77.06 47.94 60.81
BaFTA 71.43 58.19 64.46 79.06 50.51 63.06
CoOp (16-shot) 71.51 49.71 64.20 75.21 47.99 59.28
CoCoOp (16-shot) 71.02 50.63 64.07 76.18 48.75 59.91
ProGrad (4-shot) 70.45 49.45 63.35 75.21 48.17 59.05
TPT+CoOp 73.61 57.95 66.83 77.27 49.29 62.84
TPT+CoCoOp 71.07 58.47 64.85 78.65 48.47 62.61
BaFTA + CoOp 74.42 59.21 67.15 79.00 51.39 64.19
CLIP (RN50) 58.16 21.83 51.41 56.15 33.37 40.69
Multi-Template 59.81 23.24 52.91 60.72 35.48 43.09
TPT 60.74 26.67 54.70 59.11 35.09 43.89
SuS-X (Stable Diff) 61.84 - - 61.76 36.30 -
SuS-X (LAION-5B) 61.89 - - 62.10 37.83 -
BaFTA 62.01 26.91 55.26 59.79 36.37 44.58
CoOp (16-shot) 63.33 23.06 55.40 56.60 34.67 42.43
CoCoOp (16-shot) 62.81 23.32 55.72 57.74 34.48 42.82
ProGrad (4-shot) 62.17 23.05 54.79 56.77 34.40 42.25
TPT+CoOp 64.73 30.32 57.83 58.99 35.86 45.75
TPT+CoCoOp 62.93 27.40 56.60 59.88 35.43 44.83
BaFTA + CoOp 65.92 29.39 58.22 59.45 36.84 45.98
Table 7.3: Comparison of top-1 accuracy on ImageNet and the Natural Distribution Shifts (NDS) Benchmarks. All methods evaluated in zero-shot classification setting, except 1) CoOp, CoCoOp and ProGrad
are fine-tuned on ImageNet with 4 or 16 examples per category; 2) SuS-X uses external knowledge such as
Stable Diffusion or LAION-5B to construct support set.
In Table 7.3 and Table 7.4 we present the comprehensive results of backpropagation-free test-time algorithm BaFTA in comparison to baseline methods across five ImageNet robustness benchmarks and ten
fine-grained classification benchmarks.
As illustrated in Table 7.3, BaFTA exhibits a substantial improvement over the baseline CLIP model
on ImageNet, achieving enhancements of 3.07% and 6.11% on ViT-B/16 and RN50 models, respectively.
Notably, BaFTA achieves these results without the need for backpropagation training during adaptation.
Furthermore, BaFTA surpasses the state-of-the-art test-time prompt tuning method TPT by notable margins
144
Average
Cars
Caltech101
DTD
EuroSAT
FGVC
Food101
Flower102
Pets
UCF101
SUN397
CLIP (ViT B/16) 63.58 65.48 93.35 44.27 42.01 23.67 83.65 67.44 88.25 65.13 62.59
Multi-Template 64.59 66.11 93.55 45.04 50.42 23.22 82.86 66.99 86.92 65.16 65.63
CoOp (16-shot) 63.88 64.51 93.70 41.92 46.39 18.47 85.30 68.71 89.14 66.55 64.15
TPT 65.10 66.87 94.16 47.75 42.44 24.78 84.67 68.98 87.79 68.04 65.50
BaFTA 68.52 69.44 94.08 50.30 50.49 27.00 87.03 73.81 92.61 71.13 69.34
CLIP (RN50) 55.82 55.70 85.88 40.37 23.69 15.66 73.97 61.75 83.57 58.84 58.80
Multi-Template 56.63 55.89 87.26 40.37 25.79 16.11 74.82 62.77 82.97 59.48 60.85
CoOp (16-shot) 56.18 55.32 86.53 37.29 26.20 15.12 75.59 61.55 87.00 59.05 58.15
TPT 57.66 58.46 87.02 40.84 28.33 17.58 74.88 62.69 84.49 60.82 61.46
BaFTA 63.20 58.29 87.95 44.03 39.26 18.15 77.69 66.67 88.76 64.26 62.99
Table 7.4: Top-1 Accuracy on 10 Fine-grained Benchmarks. All baselines are evaluated in zero-shot classification setting, except CoOp being fine-tuned on ImageNet with 16 examples per category.
of 2.45% and 1.17% on ViT-B/16 and RN50. Additionally, when applied on top of the few-shot fine-tuned
prompts from CoOp, BaFTA further enhances CoOp’s performance by significant margins. On the Natural
Distribution Shifts Benchmarks, BaFTA even outperforms CoOp with a remarkable margin of 3.78% and
2.15% on the ViT-B/16 and RN50 models (as well as on the fine-grained datasets as shown in Table 7.4).
This indicates that test-time adaptation provides superior results compared to cross-domain generalization
through few-shot supervised methods. In Table 7.4, BaFTA exhibits even larger improvements over TPT on
the fine-grained datasets, with notable margins of 3.42% and 5.54% on ViT-B/16 and RN50, respectively.
This performance improvement is possibly attributed to the better online clustering performance on datasets
with fewer target categories. These results underline the effectiveness of BaFTA, even without the use of
backpropagation training, solidifying its position as a valuable and robust test-time adaptation method.
The performance of CoCoOp and ProGrad are similar CoOp on both RN50 and ViT-B/16. BaFTA
achieves comparable performance to ProGrad, CoOp, and CoCoOp on ImageNet and demonstrates superior performance on datasets with natural distribution shifts. Notably, ProGrad, CoCoOp, and CoOp
employ prompts fine-tuned with few-shot examples from ImageNet, whereas BaFTA operates without any
145
supervision, highlighting its effectiveness and flexibility. Additionally, BaFTA can be combined with such
fine-tuned prompts to provide further improvements, which significantly outperforms the baseline methods
ProGrad, CoOp, CoCoOp, TPT+CoOp and TPT+CoCoOp.
BaFTA outperforms SuS-X on ImageNet, although with slightly lower performance on ImageNet-R and
ImageNet-Sketch. It’s noteworthy that SuS-X leverages external resources such as Stable Diffusion ([137])
and LAION-5B ([142]) to construct support sets and transform the zero-shot problem into a few-shot problem. In contrast, BaFTA operates without any labels or external resources, showcasing its label-free and
resource-independent nature.
7.4.2 Ablation Studies
7.4.2.1 Comparison on Predictions from BaFTA. Average ImageNet ImageNet-A ImageNet-V2 ImageNet-R ImageNet-S Cars
Caltech101
DTD
EuroSAT
FGVC
Food101
Flower102
Pets
UCF101
SUN397
CLIP single 56.92 66.73 47.87 60.86 73.98 46.09 55.70 85.88 40.37 23.69 15.66 73.97 61.75 83.57 58.84 58.80
TPT 64.21 68.98 54.77 63.45 77.06 47.94 66.87 94.16 47.75 42.44 24.78 84.67 68.98 87.79 68.04 65.50
BaFTA single 63.14 66.31 55.73 61.17 76.00 45.89 66.78 91.12 45.92 40.95 24.30 85.92 66.54 85.91 67.99 66.55
CLIP multi 63.46 68.34 49.89 61.88 77.65 48.24 66.11 93.55 45.04 50.42 23.22 82.86 66.99 86.92 65.16 65.63
BaFTA-RA 65.88 70.53 57.87 64.45 79.03 49.40 67.96 93.87 46.93 47.88 27.09 86.66 71.42 89.23 69.07 66.74
BaFTA-OC 62.53 66.77 52.43 50.15 74.74 46.65 64.99 92.12 49.41 49.52 24.78 85.72 58.60 89.31 68.60 64.11
BaFTA-Avg 65.58 69.40 53.55 63.22 76.56 48.57 67.97 93.61 48.97 48.43 26.28 85.96 72.77 91.18 69.27 67.94
BaFTA 67.26 71.43 58.19 64.46 79.06 50.51 69.44 94.08 50.30 50.49 27.00 87.03 73.81 92.61 71.13 69.34
Table 7.5: Comparison over different BaFTA predictions. BaFTA-RA (Rényi Aggregation): standard predictions over augmented views aggregated with Rényi Entropy; BaFTA-OC (Online Clustering): predictions
generated with the clustering centroids; BaFTA-single: adapted with single-template CLIP; BaFTA-Avg:
combines online clustering and augmentation predictions with a naive average instead of weighting with
Rényi Entropy. BaFTA-RA, BaFTA-OC, BaFTA refers to the pi
, pˆi and p˜i from Eq. 7.6 respectively. All
results produced with CLIP (ViT-B/16).
Table 7.5 presents the ablation results of BaFTA, assessing the accuracy of each of its prediction sets:
pˆi
, pi
, and p˜i
, as described in Equation 7.6. The results reveal that simply applying Rényi Entropy to aggregate predictions from augmented views results in a 2.40% average accuracy improvement across the 15
146
datasets. The predictions generated with the online clustering centroids improve CLIP’s performance on
ImageNet-A, DTD, FGVC, Food101, Oxford-Pets, and UCF101, but do not show improvement on others.
This discrepancy may be attributed to two factors: 1) The accuracy is calculated over the entire dataset,
where the clustering centroids are not yet stable on earlier examples; 2) Some of the clustering centroids
might become unreliable, on datasets with biased distribution, or a large number of categories, such as ImageNet and SUN397. However, thanks to Rényi Entropy Aggregation, BaFTA is capable of leveraging the
reliable predictions among the clustering-based predictions (pˆi) and achieves an additional 1.40% improvement over pi
, resulting in a total improvement of 3.80% across the 15 datasets.
Results of BaFTA-single provides ablation comparison on the effectiveness of templates. BaFTA-single
employs the single-template CLIP as the base model for adaptation, resulting in a notable 6.22% improvement in averaged accuracy across 15 datasets. Although BaFTA-single exhibits slightly less improvement
compared to TPT, it boasts the advantages of no back-propagation requirement and five times faster inference speed. Furthermore, BaFTA is capable of taking advantage from the multi-template ensemble (as
suggested by official CLIP), achieving an additional 4.12% improvement and achieves 67.26% averaged
acuracy on 15 datasets, which is not attainable by TPT.
Results of BaFTA-Avg provides ablation comparison on the effectiveness of Rényi Entropy aggregation.
In the BaFTA-Avg experiment, we substitute the Rényi Entropy aggregation with a simple average function
to merge predictions from online clustering and augmentations. Results demonstrate that BaFTA-Avg yields
a 2.12% improvement over the base model CLIP-multi, highlighting the effectiveness of augmentations
and online clustering. Furthermore, BaFTA achieves an additional 1.68% improvement over BaFTA-Avg,
underscoring the effectiveness of Rényi Entropy aggregation over naive average assembly.
147
Re(p
b)p
b
CLIP p
b softmax(p
b) max(p
b)p
b 1[H(p
b) < τ ]p
b Hˆ (p
b)p
b
α = 0.25 α = 0.5 α = 0.75
68.34 69.43 70.19 58.69 70.40 69.87 70.52 70.53 70.30
Table 7.6: Comparison of different methods to aggregate predictions p
b
from augmented views. All results
are top-1 accuracy reported with CLIP (ViT-B/16) on ImageNet with 64 augmented views for each test
example. Please refer to the text for details on the notation.
7.4.2.2 Comparison on Prediction Aggregation Method.
In Table 7.6, we present the ablation results on choice of aggregation function that merges the predictions results from augmented views. We use the over-line X =
PB
b=1 Xb to denote the average over B augmented
views. From left to right, we have: 1) CLIP: baseline prediction without augmentation; 2) p
b
: averaged
prediction; 2) softmax(γpb): soft majority-vote prediction; 3) max(p
b)p
b
: weighted-average prediction
with confidence estimated by maximum entry of p
b
; 4) 1[H(p
b) > τ ]p
b
: average of low-entropy (highconfidence) predictions, as adopted by TPT; 6) Hˆ (p
b)p
b
: weighted-average prediction with confidence
estimated by the normalized entropy Hˆ (p
b
) = (Hmax − H(p
b
))/Hmax; 6) Re(p
b)p
b
: weighted-average
prediction with confidence estimated by Rényi entropy Re(p
b
), with entropy order α = 0.25, 0.50, 0.75. As
shown in the Table, Rényi entropy at order of 0.50 provides the best results over all the other options.
7.4.2.3 Ablation Studies on Rényi Entropy Order α. Average ImageNet ImageNet-A ImageNet-V2 ImageNet-R ImageNet-S Cars
Caltech101
DTD
EuroSAT
FGVC
Food101
Flower102
Pets
UCF101
SUN397
accuracy std. (%) 0.12 0.14 0.93 0.17 0.16 0.27 0.08 0.07 0.06 0.27 0.10 0.03 0.07 0.05 0.15 0.06
kNN w/o P
∗ 61.56 63.03 44.29 50.56 71.19 44.29 62.49 92.29 43.85 58.37 22.35 68.66 84.56 85.01 68.68 63.77
kNN w/ P
∗ 64.04 66.62 48.91 55.41 77.91 47.62 67.01 93.55 45.74 53.75 23.52 69.35 86.33 89.64 69.36 65.87
Table 7.7: Row 1: Accuracy standard deviation of Rényi Entropy aggregated predictions over varying α.
Row 2-3: Effectiveness of Projection P
∗
(Eq. 7.3) in improving embedding distribution. Results produced
with CLIP (ViT-B/16) embeddings, demonstrated by the top-1 accuracy improvement of kNN classifier with
k = 5.
148
In order to assess the sensitivity of Rényi Entropy aggregation performance to the entropy order α,
we analyze the accuracy of Rényi Entropy aggregated predictions from augmented views with varying α,
specifically α ∈ {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99}, over all 15 datasets used in our study. In
all experiments, we employ CLIP-ViT-B/16 as the base model and evaluate BaFTA-RA to investigate the
influence of α on prediction aggregation without the impact of online clustering results.
Figure 7.2 illustrates the α−accuracy curve across all 15 datasets. The curves are normalized by subtracting the maximum value within each curve, ensuring they are plotted within the same value range. The
bold red curves represent the averaged accuracy over the 15 datasets, revealing that the average performance
peaks at α = 0.5 and α = 0.6. Additionally, the plot indicates that prediction aggregation accuracy is
relatively insensitive to the choice of α, with most curves exhibiting less than a 0.3% change in accuracy
across the α range [0.1, 0.99]. Most datasets achieve peak performance with α in the range of [0.3, 0.8], and
selecting α = 0.5 guarantees the performance to be within 0.25% from the peak.
In Table 7.7, we display the accuracy standard deviation of Rényi Entropy aggregated predictions over
varying α. The table reveals that the accuracy standard deviation is less than 0.3% on most datasets, with
the exception of ImageNet-A (corresponding to the orange curve in Figure 7.2). ImageNet-A is composed
of challenging outlier examples where machine learning models often falter. It is possible that CLIP produces less confident and flatter prediction logits on ImageNet-A, rendering its performance more sensitive
to variations in α compared to other datasets.
Table 7.7 provides evidence of the effectiveness of the Projection P
∗
in enhancing the distribution of
CLIP embeddings for clustering, as proposed in [68]. The results demonstrate a 2.48% improvement in
averaged k-nearest neighbor (kNN) classifier accuracy across the 15 datasets after projecting the CLIP embeddings with P
∗
. This improvement signifies that P
∗
successfully enhances the neighboring relationships
among CLIP embeddings in the projection space, which, in turn, will benefit the online clustering process.
149
Rényi Entropy Order: α
accuracy (normalized by subtracting max accuracy)
-1.00%
-0.90%
-0.80%
-0.70%
-0.60%
-0.50%
-0.40%
-0.30%
-0.20%
-0.10%
0.00%
0.10%
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Average
ImageNet
ImageNet A
ImageNet V
ImageNet S
ImageNet R
Cars
Caltech101
DTD
EuroSAT
FGVC
Food101
Flower102
Pets
UCF101
SUN397
Figure 7.2: α-accuracy curves on 15 datasets, with α ∈ [0.1, 0.99]. In order to fit all curves into one plot with
unified value range, all curves are normalized by subtracting the maximum accuracy within the curve. The
bold red curve represents the averaged accuracy over 15 datasets, achieves its maximum value at α = 0.5
and α = 0.6. This plot indicates that prediction aggregation accuracy is not highly sensitive to the choice of
α, with most curves exhibiting less than a 0.3% change in accuracy across the α range [0.1, 0.99]
In Figure 7.3, we present t-SNE plots for Oxford-IIIT-Pets, Describable Textures, and Stanford Cars
to visually showcase the distribution differences between original CLIP visual embeddings and projected
visual embeddings.
As illustrated by the t-SNE plots, the projection space effectively transforms sparse clusters into denser
formations, leading to improved online clustering results. Additionally, we observed that the enhancement
in clustering brought by the projection is potentially correlated with the classification accuracy on the respective datasets. For instance, CLIP attains a 86.92% zero-shot accuracy on Oxford-IIIT-Pets, with the
projection significantly improving its clustering quality. In contrast, CLIP achieves only 45.04% accuracy
on Describable Textures, and the improvement provided by the projection over the clustering condition is
relatively subtle.
150
(a) Oxford-IIIT-Pets (Original) (b) Oxford-IIIT-Pets (Projected)
(c) Describable Textures (Original) (d) Describable Textures (Projected)
(e) Stanford Cars (Original) (f) Stanford Cars (Projected)
Figure 7.3: tSNE plots of original and projected visual embeddings from evaluation datasets.
151
Time Per Example (ms) TPT BaFTA
RN50 841.0 158.7
ViT-B/16 873.0 183.8
Table 7.8: Inference time comparison of TPT and BaFTA. All results reported with ImageNet examples,
evaluated on NVIDIA A40 GPU.
7.4.3 Inference Time Efficiency Analysis
In Table 7.8, we present a comparison of inference times between TPT and BaFTA using both ViT-B/16
and RN50 backbones. The inference time per example (in milliseconds) was calculated by recording the
total time required to complete a 10,000-iteration inference on ImageNet examples, with a single example
processed per iteration. All experiments were conducted on a computation node equipped with an AMD
EPYC 7313 CPU (32 cores), 256 GB memory, and a single NVIDIA A40 GPU (48GB).
As indicated in the table, BaFTA exhibits a notable advantage, being approximately 5 times faster than
TPT. The significant difference in inference time for TPT can be attributed to two main factors: 1) TPT
requires two forward passes and one backward pass in each iteration, whereas BaFTA requires only a single
forward pass; 2) TPT requires recomputation of classification embeddings through the text encoder at each
forward pass, while BaFTA conducts the text encoder once offline and updates the classification embeddings
directly in the embedding space during inference.
7.5 Conclusion
In this chapter, we have focused on enhancing the performance of large-scale pre-trained vision-language
models, exemplified by CLIP, in the context of zero-shot image classification. While various test-time
prompt tuning methods have been developed to refine class embeddings during inference, they often grapple
with the challenge of selecting appropriate learning rates in the absence of validation data during test-time
training. To address this challenge, we have introduced a novel backpropagation-free method for test-time
152
adaptation in vision-language models. Instead of fine-tuning text prompts to refine class embeddings, our
approach directly estimates class centroids using online clustering within a projected embedding space that
aligns text and visual embeddings. We have also proposed a dynamic aggregation technique for predictions,
leveraging both estimated and original class embeddings, as well as distinct augmented views. This aggregation is guided by an assessment of prediction reliability using Rényi entropy. Our comprehensive experimentation has consistently demonstrated that our approach outperforms state-of-the-art test-time adaptation
methods by a significant margin. This work contributes to improving vision-language models, offering a
practical solution for real-world applications.
153
Chapter 8
Epilogue
This dissertation contributes to the ongoing evolution of computer vision and machine learning, marking a
transition from traditional methods to advanced, adaptable systems. Through a journey that spans from the
era of ImageNet to the realm of vision-language models and open-vocabulary, zero-shot solutions.
The innovative methodologies, from SPAN in fully-supervised settings to BaFTA in zero-shot test-time
adaptation, represent a spectrum of approaches that bridge the gap between generic visual-language representations and specific task requirements. The research presented in this dissertation has demonstrated
advancements in various applications, such as image classification, forensic detection and object classification. The development of algorithms like EZAD, ReCLIP, and BaFTA highlights the growing potential of
vision-language models in reshaping computer vision and machine learning.
This dissertation aims to contribute to the field of domain adaptation, highlighting the shift towards
adaptable, efficient, and robust solutions. It is intended as a stepping stone for future research, pointing towards the importance of adaptability and continued innovation in meeting real-world challenges in computer
vision. In summary, this work reflects a moment in the ongoing journey of machine learning and computer
vision, hoping to inspire and inform future advancements.
154
Bibliography
[1] Michael J. Anderson, Benny Chen, Stephen Chen, Summer Deng, Jordan Fix, Michael K., and
et al. “First-Generation Inference Accelerator Deployment at Facebook”. In: ArXiv abs/2107.04140
(2021).
[2] Jimmy Ba, Volodymyr Mnih, and Koray Kavukcuoglu. “Multiple object recognition with visual
attention”. In: arXiv preprint arXiv:1412.7755 (2014).
[3] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. “Layer normalization”. In: arXiv
preprint arXiv:1607.06450 (2016).
[4] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. “Neural machine translation by jointly
learning to align and translate”. In: arXiv preprint arXiv:1409.0473 (2014).
[5] Ankan Bansal, Karan Sikka, Gaurav Sharma, Rama Chellappa, and Ajay Divakaran. “Zero-shot
object detection”. In: Proceedings of the European Conference on Computer Vision (ECCV). 2018,
pp. 384–400.
[6] Jawadul H Bappy, Amit K Roy-Chowdhury, Jason Bunk, Lakshmanan Nataraj, and BS Manjunath.
“Exploiting spatial structure for localizing manipulated image regions”. In: Proceedings of the
IEEE international conference on computer vision. 2017, pp. 4970–4979.
[7] Jawadul H Bappy, Cody Simons, Lakshmanan Nataraj, BS Manjunath, and
Amit K Roy-Chowdhury. “Hybrid lstm and encoder–decoder architecture for detection of image
forgeries”. In: IEEE Transactions on Image Processing 28.7 (2019), pp. 3286–3300.
[8] Wesam Barbakh and Colin Fyfe. “Online clustering algorithms”. In: International journal of neural
systems 18.03 (2008), pp. 185–194.
[9] Belhassen Bayar and Matthew C Stamm. “A deep learning approach to universal image
manipulation detection using a new convolutional layer”. In: Proceedings of the 4th ACM
Workshop on Information Hiding and Multimedia Security. 2016, pp. 5–10.
[10] Belhassen Bayar and Matthew C Stamm. “Constrained convolutional neural networks: A new
approach towards general purpose image manipulation detection”. In: IEEE Transactions on
Information Forensics and Security 13.11 (2018), pp. 2691–2706.
155
[11] Thomas Berg, Jiongxin Liu, Seung Woo Lee, Michelle L Alexander, David W Jacobs, and
Peter N Belhumeur. “Birdsnap: Large-scale fine-grained visual categorization of birds”. In:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2014,
pp. 2011–2018.
[12] David Berthelot, Nicholas Carlini, Ekin D. Cubuk, Alex Kurakin, Kihyuk Sohn, Han Zhang, and
Colin Raffel. “ReMixMatch: Semi-Supervised Learning with Distribution Matching and
Augmentation Anchoring”. In: International Conference on Learning Representations. Apr. 2020.
[13] David Berthelot, Nicholas Carlini, Ian Goodfellow, Nicolas Papernot, Avital Oliver, and
Colin A. Raffel. “MixMatch: A Holistic Approach to Semi-Supervised Learning”. In: Neural
Information Processing Systems. Dec. 2019, pp. 5049–5059.
[14] A. Bhattacharyya. “On a Measure of Divergence between Two Multinomial Populations”. In:
Sankhya: The Indian Journal of Statistics (1933-1960) ¯ 7.4 (1946), pp. 401–406. ISSN: 00364452.
URL: http://www.jstor.org/stable/25047882.
[15] Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. “Food-101–mining discriminative
components with random forests”. In: European conference on computer vision. Springer. 2014,
pp. 446–461.
[16] Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. “Food-101–mining discriminative
components with random forests”. In: European conference on computer vision. Springer. 2014,
pp. 446–461.
[17] Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal,
Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. “Language models are
few-shot learners”. In: arXiv preprint arXiv:2005.14165 (2020).
[18] Joao Carreira and Andrew Zisserman. “Quo vadis, action recognition? a new model and the
kinetics dataset”. In: proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition. 2017, pp. 6299–6308.
[19] Olivier Chapelle, Bernhard Schlkopf, and Alexander Zien. “Semi-Supervised Learning”. In: IEEE
Transactions on Neural Networks 20.3 (Mar. 2010). DOI:
10.7551/MITPRESS/9780262033589.001.0001.
[20] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. “A simple framework for
contrastive learning of visual representations”. In: arXiv preprint arXiv:2002.05709 (2020).
[21] Xinlei Chen and Kaiming He. “Exploring simple siamese representation learning”. In: Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021,
pp. 15750–15758.
[22] Gong Cheng, Junwei Han, and Xiaoqiang Lu. “Remote sensing image scene classification:
Benchmark and state of the art”. In: Proceedings of the IEEE 105.10 (2017), pp. 1865–1883.
156
[23] Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi.
“Describing textures in the wild”. In: Proceedings of the IEEE conference on computer vision and
pattern recognition. 2014, pp. 3606–3613.
[24] Adam Coates, Andrew Ng, and Honglak Lee. “An analysis of single-layer networks in
unsupervised feature learning”. In: Proceedings of the fourteenth international conference on
artificial intelligence and statistics. JMLR Workshop and Conference Proceedings. 2011,
pp. 215–223.
[25] Adam Coates, Andrew Ng, and Honglak Lee. “An analysis of single-layer networks in
unsupervised feature learning”. In: Proceedings of the fourteenth international conference on
artificial intelligence and statistics. JMLR Workshop and Conference Proceedings. 2011,
pp. 215–223.
[26] Davide Cozzolino, Giovanni Poggi, and Luisa Verdoliva. “Efficient dense-field copy–move forgery
detection”. In: IEEE Transactions on Information Forensics and Security 10.11 (2015),
pp. 2284–2297.
[27] Davide Cozzolino, Giovanni Poggi, and Luisa Verdoliva. “Splicebuster: A new blind image
splicing detector”. In: 2015 IEEE International Workshop on Information Forensics and Security
(WIFS). IEEE. 2015, pp. 1–6.
[28] Gabriela Csurka. “Domain adaptation for visual applications: A comprehensive survey”. In: arXiv
preprint arXiv:1702.05374 (2017).
[29] Ekin D. Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V. Le. “Randaugment: Practical
Automated Data Augmentation With a Reduced Search Space”. In: Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR) Workshops. June 2020. URL:
https://openaccess.thecvf.com/content_CVPRW_2020/html/w40/Cubuk_Randaugment_Practical_
Automated_Data_Augmentation_With_a_Reduced_Search_Space_CVPRW_2020_paper.html.
[30] Hal Daumé III, Abhishek Kumar, and Avishek Saha. “Frustratingly easy semi-supervised domain
adaptation”. In: Proceedings of the 2010 Workshop on Domain Adaptation for Natural Language
Processing. 2010, pp. 53–59.
[31] J. Deng, W. Dong, R. Socher, L. Li, Kai Li, and Li Fei-Fei. “ImageNet: A large-scale hierarchical
image database”. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition. 2009,
pp. 248–255. DOI: 10.1109/CVPR.2009.5206848.
[32] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. “Imagenet: A large-scale
hierarchical image database”. In: 2009 IEEE conference on computer vision and pattern
recognition. Ieee. 2009, pp. 248–255.
[33] Li Deng. “The mnist database of handwritten digit images for machine learning research”. In:
IEEE Signal Processing Magazine 29.6 (2012), pp. 141–142.
[34] Jing Dong, Wei Wang, and Tieniu Tan. “Casia image tampering detection evaluation database”. In:
2013 IEEE China Summit and International Conference on Signal and Information Processing.
IEEE. 2013, pp. 422–426.
157
[35] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai,
Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al.
“An image is worth 16x16 words: Transformers for image recognition at scale”. In: arXiv preprint
arXiv:2010.11929 (2020).
[36] Matthijs Douze, Arthur Szlam, Bharath Hariharan, and Hervé Jégou. “Low-shot learning with
large-scale diffusion”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition. 2018, pp. 3349–3358.
[37] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual
Object Classes Challenge 2007 (VOC2007) Results.
http://www.pascal-network.org/challenges/VOC/voc2007/workshop/index.html.
[38] Chengjian Feng, Yujie Zhong, Zequn Jie, Xiangxiang Chu, Haibing Ren, Xiaolin Wei, Weidi Xie,
and Lin Ma. “Promptdet: Expand your detector vocabulary with uncurated images”. In: arXiv
preprint arXiv:2203.16513 (2022).
[39] Chun-Mei Feng, Kai Yu, Yong Liu, Salman Khan, and Wangmeng Zuo. “Diverse Data
Augmentation with Diffusions for Effective Test-time Prompt Tuning”. In: arXiv preprint
arXiv:2308.06038 (2023).
[40] Pasquale Ferrara, Tiziano Bianchi, Alessia De Rosa, and Alessandro Piva. “Image forgery
localization via fine-grained analysis of CFA artifacts”. In: IEEE Transactions on Information
Forensics and Security 7.5 (2012), pp. 1566–1577.
[41] Geoffrey French, Michal Mackiewicz, and Mark H. Fisher. “Self-ensembling for visual domain
adaptation”. In: International Conference on Learning Representations. Feb. 2018.
[42] Jessica Fridrich and Jan Kodovsky. “Rich models for steganalysis of digital images”. In: IEEE
Transactions on Information Forensics and Security 7.3 (2012), pp. 868–882.
[43] Andrea Frome, Greg Corrado, Jonathon Shlens, Samy Bengio, Jeffrey Dean,
Marc’Aurelio Ranzato, and Tomas Mikolov. “Devise: A deep visual-semantic embedding model”.
In: Advances in Neural Information Processing Systems (2013).
[44] Yaroslav Ganin and Victor Lempitsky. “Unsupervised domain adaptation by backpropagation”. In:
International conference on machine learning. PMLR. 2015, pp. 1180–1189.
[45] Mingfei Gao, Chen Xing, Juan Carlos Niebles, Junnan Li, Ran Xu, Wenhao Liu, and
Caiming Xiong. “Towards open vocabulary object detection without human-provided bounding
boxes”. In: arXiv preprint arXiv:2111.09452 (2021).
[46] Yunhao Ge, Jie Ren, Andrew Gallagher, Yuxiao Wang, Ming-Hsuan Yang, Hartwig Adam,
Laurent Itti, Balaji Lakshminarayanan, and Jiaping Zhao. “Improving Zero-shot Generalization and
Robustness of Multi-modal Models”. In: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition. 2023, pp. 11093–11101.
[47] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. “Vision meets robotics: The
kitti dataset”. In: The International Journal of Robotics Research 32.11 (2013), pp. 1231–1237.
158
[48] Golnaz Ghiasi, Yin Cui, Aravind Srinivas, Rui Qian, Tsung-Yi Lin, Ekin D Cubuk, Quoc V Le, and
Barret Zoph. “Simple copy-paste is a strong data augmentation method for instance segmentation”.
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021,
pp. 2918–2928.
[49] Thomas Gloe and Rainer Böhme. “The dresden image database for benchmarking digital image
forensics”. In: Journal of Digital Forensic Practice 3.2-4 (2010), pp. 150–159.
[50] Yves Grandvalet and Yoshua Bengio. “Semi-supervised learning by entropy minimization”. In:
Advances in neural information processing systems. 2005, pp. 529–536.
[51] Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, and Yin Cui. “Open-vocabulary object detection via
vision and language knowledge distillation”. In: arXiv preprint arXiv:2104.13921 (2021).
[52] Agrim Gupta, Piotr Dollar, and Ross Girshick. “Lvis: A dataset for large vocabulary instance
segmentation”. In: Proceedings of the IEEE/CVF conference on computer vision and pattern
recognition. 2019, pp. 5356–5364.
[53] Guangxing Han, Shiyuan Huang, Jiawei Ma, Yicheng He, and Shih-Fu Chang. “Meta faster r-cnn:
Towards accurate few-shot object detection with attentive feature alignment”. In: Proceedings of
the AAAI Conference on Artificial Intelligence. Vol. 36. 1. 2022, pp. 780–789.
[54] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. “Momentum Contrast for
Unsupervised Visual Representation Learning”. In: Computer Vision and Pattern Recognition. June
2020, pp. 9729–9738. DOI: 10.1109/CVPR42600.2020.00975.
[55] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. “Momentum contrast for
unsupervised visual representation learning”. In: Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition. 2020, pp. 9729–9738.
[56] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. “Mask r-cnn”. In: Proceedings of
the IEEE international conference on computer vision. 2017, pp. 2961–2969.
[57] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Deep Residual Learning for Image
Recognition”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition (CVPR). June 2016.
[58] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Deep residual learning for image
recognition”. In: Proceedings of the IEEE conference on computer vision and pattern recognition.
2016, pp. 770–778.
[59] Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. “Eurosat: A novel dataset
and deep learning benchmark for land use and land cover classification”. In: IEEE Journal of
Selected Topics in Applied Earth Observations and Remote Sensing 12.7 (2019), pp. 2217–2226.
[60] Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo,
Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, Dawn Song, Jacob Steinhardt, and
Justin Gilmer. “The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution
Generalization”. In: ICCV (2021).
159
[61] Dan Hendrycks and Thomas Dietterich. “Benchmarking neural network robustness to common
corruptions and perturbations”. In: arXiv preprint arXiv:1903.12261 (2019).
[62] Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. “Natural
adversarial examples”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition. 2021, pp. 15262–15271.
[63] Geoffrey E Hinton and Drew Van Camp. “Keeping the neural networks simple by minimizing the
description length of the weights”. In: Proceedings of the sixth annual conference on
Computational learning theory. 1993, pp. 5–13.
[64] Sepp Hochreiter and Jürgen Schmidhuber. “Long short-term memory”. In: Neural computation 9.8
(1997), pp. 1735–1780.
[65] Jie Hu, Li Shen, and Gang Sun. “Squeeze-and-excitation networks”. In: Proceedings of the IEEE
conference on computer vision and pattern recognition. 2018, pp. 7132–7141.
[66] Xuefeng Hu, Gokhan Uzunbas, Sirius Chen, Rui Wang, Ashish Shah, Ram Nevatia, and
Ser-Nam Lim. “Mixnorm: Test-time adaptation through online normalization estimation”. In: arXiv
preprint arXiv:2110.11478 (2021).
[67] Xuefeng Hu, Ke Zhang, Lu Xia, Albert Chen, Jiajia Luo, Yuyin Sun, Ken Wang, Nan Qiao,
Xiao Zeng, Min Sun, et al. “ReCLIP: Refine contrastive language image pre-training with source
free domain adaptation”. In: Proceedings of the IEEE/CVF Winter Conference on Applications of
Computer Vision. 2024, pp. 2994–3003.
[68] Xuefeng Hu, Ke Zhang, Lu Xia, Albert Chen, Jiajia Luo, Yuyin Sun, Ken Wang, Nan Qiao,
Xiao Zeng, Min Sun, Cheng-Hao Kuo, and Ram Nevatia. “ReCLIP: Refine Contrastive Language
Image Pre-Training with Source Free Domain Adaptation”. In: arXiv preprint arXiv:2308.03793
(2023).
[69] Xuefeng Hu, Zhihan Zhang, Zhenye Jiang, Syomantak Chaudhuri, Zhenheng Yang, and
Ram Nevatia. “SPAN: Spatial pyramid attention network for image manipulation localization”. In:
Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020,
Proceedings, Part XXI 16. Springer. 2020, pp. 312–328.
[70] Zijian Hu, Zhengyu Yang, Xuefeng Hu, and Ram Nevatia. “Simple: Similar pseudo label
exploitation for semi-supervised classification”. In: Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition. 2021, pp. 15099–15108.
[71] Minyoung Huh, Andrew Liu, Andrew Owens, and Alexei A Efros. “Fighting fake news: Image
splice detection via learned self-consistency”. In: Proceedings of the European Conference on
Computer Vision (ECCV). 2018, pp. 101–117.
[72] Sergey Ioffe and Christian Szegedy. “Batch normalization: Accelerating deep network training by
reducing internal covariate shift”. In: International conference on machine learning. PMLR. 2015,
pp. 448–456.
160
[73] Ahmet Iscen, Giorgos Tolias, Yannis Avrithis, and Ondrej Chum. “Label Propagation for Deep
Semi-Supervised Learning”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR). June 2019.
[74] Ahmet Iscen, Giorgos Tolias, Yannis Avrithis, and Ondrej Chum. “Label propagation for deep
semi-supervised learning”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition. 2019, pp. 5070–5079.
[75] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le,
Yun-Hsuan Sung, Zhen Li, and Tom Duerig. “Scaling up visual and vision-language representation
learning with noisy text supervision”. In: International Conference on Machine Learning. PMLR.
2021, pp. 4904–4916.
[76] Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and
Ser-Nam Lim. “Visual prompt tuning”. In: arXiv preprint arXiv:2203.12119 (2022).
[77] Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and
Ross Girshick. “Clevr: A diagnostic dataset for compositional language and elementary visual
reasoning”. In: Proceedings of the IEEE conference on computer vision and pattern recognition.
2017, pp. 2901–2910.
[78] Guoliang Kang, Lu Jiang, Yi Yang, and Alexander G Hauptmann. “Contrastive adaptation network
for unsupervised domain adaptation”. In: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition. 2019, pp. 4893–4902.
[79] Douwe Kiela, Hamed Firooz, Aravind Mohan, Vedanuj Goswami, Amanpreet Singh,
Pratik Ringshia, and Davide Testuggine. “The hateful memes challenge: Detecting hate speech in
multimodal memes”. In: Advances in Neural Information Processing Systems 33 (2020),
pp. 2611–2624.
[80] Konwoo Kim, Michael Laskin, Igor Mordatch, and Deepak Pathak. How to Adapt Your Large-Scale
Vision-and-Language Model. 2022. URL: https://openreview.net/forum?id=EhwEUb2ynIa.
[81] Diederik P Kingma and Jimmy Ba. “Adam: A method for stochastic optimization”. In: arXiv
preprint arXiv:1412.6980 (2014).
[82] Vladimir V Kniaz, Vladimir Knyaz, and Fabio Remondino. “The Point Where Reality Meets
Fantasy: Mixed Adversarial Generators for Image Splice Detection”. In: Advances in Neural
Information Processing Systems. 2019, pp. 215–226.
[83] Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. “3D Object Representations for
Fine-Grained Categorization”. In: 4th International IEEE Workshop on 3D Representation and
Recognition (3dRR-13). Sydney, Australia, 2013.
[84] Neal Krawetz and Hacker Factor Solutions. “A picture’s worth”. In: Hacker Factor Solutions 6.2
(2007), p. 2.
[85] Alex Krizhevsky, Geoffrey Hinton, et al. “Learning multiple layers of features from tiny images”.
In: Citeseer (2009).
161
[86] Alex Krizhevsky, Geoffrey Hinton, et al. “Learning multiple layers of features from tiny images”.
In: (2009).
[87] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. “ImageNet Classification with Deep
Convolutional Neural Networks”. In: Advances in Neural Information Processing Systems 25.
Ed. by F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger. Curran Associates, Inc., 2012,
pp. 1097–1105. URL: http://papers.nips.cc/paper/4824-imagenet-classification-with-deepconvolutional-neural-networks.pdf (visited on 12/07/2018).
[88] Weicheng Kuo, Yin Cui, Xiuye Gu, AJ Piergiovanni, and Anelia Angelova. “F-vlm:
Open-vocabulary object detection upon frozen vision and language models”. In: arXiv preprint
arXiv:2209.15639 (2022).
[89] Zhengfeng Lai, Sol Vesdapunt, Ning Zhou, Jun Wu, Cong Phuoc Huynh, Xuelu Li, Kah Kuen Fu,
and Chen-Nee Chuah. “PADCLIP: Pseudo-labeling with adaptive debiasing in CLIP for
unsupervised domain adaptation”. In: ICCV (2023).
[90] Samuli Laine and Timo Aila. “Temporal Ensembling for Semi-Supervised Learning”. In: arXiv:
Neural and Evolutionary Computing (Oct. 2016).
[91] Christoph H Lampert, Hannes Nickisch, and Stefan Harmeling. “Attribute-based classification for
zero-shot visual object categorization”. In: IEEE transactions on pattern analysis and machine
intelligence 36.3 (2013), pp. 453–465.
[92] Aleksandr Laptev and Boris Ginsburg. “Fast Entropy-Based Methods of Word-Level Confidence
Estimation for End-to-End Automatic Speech Recognition”. In: 2022 IEEE Spoken Language
Technology Workshop (SLT). IEEE. 2023, pp. 152–159.
[93] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. “Gradient-based learning applied
to document recognition”. In: Proceedings of the IEEE 86.11 (1998), pp. 2278–2324.
[94] Dong-Hyun Lee. “Pseudo-label: The simple and efficient semi-supervised learning method for
deep neural networks”. In: Workshop on challenges in representation learning, ICML. Vol. 3. 2013.
[95] Li, Andreeto, Ranzato, and Perona. Caltech 101. Apr. 2022. DOI: 10.22002/D1.20086.
[96] Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and
Steven Chu Hong Hoi. “Align before fuse: Vision and language representation learning with
momentum distillation”. In: Advances in neural information processing systems 34 (2021),
pp. 9694–9705.
[97] Xiang Li, Wenhai Wang, Xiaolin Hu, and Jian Yang. “Selective kernel networks”. In: Proceedings
of the IEEE conference on computer vision and pattern recognition. 2019, pp. 510–519.
[98] Yangguang Li, Feng Liang, Lichen Zhao, Yufeng Cui, Wanli Ouyang, Jing Shao, Fengwei Yu, and
Junjie Yan. “Supervision exists everywhere: A data efficient contrastive language-image
pre-training paradigm”. In: arXiv preprint arXiv:2110.05208 (2021).
162
[99] Yanghao Li, Naiyan Wang, Jianping Shi, Jiaying Liu, and Xiaodi Hou. “Revisiting batch
normalization for practical domain adaptation”. In: arXiv preprint arXiv:1603.04779 (2016).
[100] Jian Liang, Ran He, and Tieniu Tan. “A comprehensive survey on test-time adaptation under
distribution shifts”. In: arXiv preprint arXiv:2303.15361 (2023).
[101] Jian Liang, Dapeng Hu, and Jiashi Feng. “Do we really need to access the source data? source
hypothesis transfer for unsupervised domain adaptation”. In: International Conference on Machine
Learning. PMLR. 2020, pp. 6028–6039.
[102] Victor Weixin Liang, Yuhui Zhang, Yongchan Kwon, Serena Yeung, and James Y Zou. “Mind the
gap: Understanding the modality gap in multi-modal contrastive representation learning”. In:
Advances in Neural Information Processing Systems 35 (2022), pp. 17612–17625.
[103] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie.
“Feature pyramid networks for object detection”. In: Proceedings of the IEEE conference on
computer vision and pattern recognition. 2017, pp. 2117–2125.
[104] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan,
Piotr Dollár, and C Lawrence Zitnick. “Microsoft coco: Common objects in context”. In: European
conference on computer vision. Springer. 2014, pp. 740–755.
[105] Jonathan Long, Evan Shelhamer, and Trevor Darrell. “Fully convolutional networks for semantic
segmentation”. In: Proceedings of the IEEE conference on computer vision and pattern
recognition. 2015, pp. 3431–3440.
[106] Yanxin Long, Jianhua Han, Runhui Huang, Xu Hang, Yi Zhu, Chunjing Xu, and Xiaodan Liang.
“POVD: Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary Object
Detection”. In: arXiv preprint arXiv:2211.00849 (2022).
[107] Ilya Loshchilov and Frank Hutter. “Decoupled Weight Decay Regularization”. In: International
Conference on Learning Representations. 2019. URL:
https://openreview.net/forum?id=Bkg6RiCqY7.
[108] Ilya Loshchilov and Frank Hutter. “Decoupled weight decay regularization”. In: arXiv preprint
arXiv:1711.05101 (2017).
[109] Ilya Loshchilov and Frank Hutter. “Sgdr: Stochastic gradient descent with warm restarts”. In: arXiv
preprint arXiv:1608.03983 (2016).
[110] Zongyang Ma, Guan Luo, Jin Gao, Liang Li, Yuxin Chen, Shaoru Wang, Congxuan Zhang, and
Weiming Hu. “Open-Vocabulary One-Stage Detection With Hierarchical Visual-Language
Knowledge Distillation”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR). June 2022, pp. 14074–14083.
[111] Babak Mahdian and Stanislav Saic. “Using noise inconsistencies for blind image forensics”. In:
Image and Vision Computing 27.10 (2009), pp. 1497–1503.
163
[112] S. Maji, J. Kannala, E. Rahtu, M. Blaschko, and A. Vedaldi. Fine-Grained Visual Classification of
Aircraft. Tech. rep. 2013. arXiv: 1306.5151 [cs-cv].
[113] Shu Manli, Nie Weili, Huang De-An, Yu Zhiding, Goldstein Tom, Anandkumar Anima, and
Xiao Chaowei. “Test-Time Prompt Tuning for Zero-shot Generalization in Vision-Language
Models”. In: NeurIPS. 2022.
[114] T. Miyato, S. Maeda, M. Koyama, and S. Ishii. “Virtual Adversarial Training: A Regularization
Method for Supervised and Semi-Supervised Learning”. In: IEEE Transactions on Pattern Analysis
and Machine Intelligence 41.8 (2019), pp. 1979–1993. DOI: 10.1109/TPAMI.2018.2858821.
[115] Takeru Miyato, Shin-Ichi Maeda, Masanori Koyama, and Shin Ishii. “Virtual Adversarial Training:
A Regularization Method for Supervised and Semi-Supervised Learning”. In: IEEE Transactions
on Pattern Analysis and Machine Intelligence 41.8 (Aug. 2019), pp. 1979–1993. DOI:
10.1109/TPAMI.2018.2858821.
[116] Norman Mu, Alexander Kirillov, David Wagner, and Saining Xie. “Slip: Self-supervision meets
language-image pre-training”. In: Computer Vision–ECCV 2022: 17th European Conference, Tel
Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVI. Springer. 2022, pp. 529–544.
[117] Jaemin Na, Heechul Jung, Hyung Jin Chang, and Wonjun Hwang. “Fixbi: Bridging domain spaces
for unsupervised domain adaptation”. In: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition. 2021, pp. 1094–1103.
[118] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. “Reading
digits in natural images with unsupervised feature learning”. In: Research.Google.Com (2011).
[119] Tian-Tsong Ng, Jessie Hsu, and Shih-Fu Chang. “Columbia image splicing detection evaluation
dataset”. In: DVMM lab. Columbia Univ CalPhotos Digit Libr (2009).
[120] Maria-Elena Nilsback and Andrew Zisserman. “Automated flower classification over a large
number of classes”. In: 2008 Sixth Indian Conference on Computer Vision, Graphics & Image
Processing. IEEE. 2008, pp. 722–729.
[121] NIST. Nist nimble 2016 datasets. https://www.nist.gov/itl/iad/mig/. 2016.
[122] Shuaicheng Niu, Jiaxiang Wu, Yifan Zhang, Zhiquan Wen, Yaofo Chen, Peilin Zhao, and
Mingkui Tan. “Towards stable test-time adaptation in dynamic wild world”. In: arXiv preprint
arXiv:2302.12400 (2023).
[123] Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov,
Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. “Dinov2: Learning
robust visual features without supervision”. In: arXiv preprint arXiv:2304.07193 (2023).
[124] Tae Ha Park and Simone D’Amico. “Robust multi-task learning and online refinement for
spacecraft pose estimation across domain gap”. In: Advances in Space Research (2023).
[125] Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and CV Jawahar. “Cats and dogs”. In: 2012
IEEE conference on computer vision and pattern recognition. IEEE. 2012, pp. 3498–3505.
164
[126] Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Łukasz Kaiser, Noam Shazeer, Alexander Ku, and
Dustin Tran. “Image transformer”. In: arXiv preprint arXiv:1802.05751 (2018).
[127] Vishal M Patel, Raghuraman Gopalan, Ruonan Li, and Rama Chellappa. “Visual domain
adaptation: A survey of recent advances”. In: IEEE signal processing magazine 32.3 (2015),
pp. 53–69.
[128] Xingchao Peng, Qinxun Bai, Xide Xia, Zijun Huang, Kate Saenko, and Bo Wang. “Moment
matching for multi-source domain adaptation”. In: Proceedings of the IEEE International
Conference on Computer Vision. 2019, pp. 1406–1415.
[129] Xingchao Peng, Ben Usman, Neela Kaushik, Judy Hoffman, Dequan Wang, and Kate Saenko.
“Visda: The visual domain adaptation challenge”. In: arXiv preprint arXiv:1710.06924 (2017).
[130] Joaquin Quiñonero-Candela, Masashi Sugiyama, Neil D Lawrence, and Anton Schwaighofer.
Dataset shift in machine learning. Mit Press, 2009.
[131] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal,
Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. “Learning transferable visual
models from natural language supervision”. In: arXiv preprint arXiv:2103.00020 (2021).
[132] Shafin Rahman, Salman Khan, and Nick Barnes. “Improved visual-semantic alignment for
zero-shot object detection”. In: Proceedings of the AAAI Conference on Artificial Intelligence.
Vol. 34. 07. 2020, pp. 11932–11939.
[133] Yuan Rao and Jiangqun Ni. “A deep learning approach to detection of splicing and copy-move
forgeries in images”. In: 2016 IEEE International Workshop on Information Forensics and Security
(WIFS). IEEE. 2016, pp. 1–6.
[134] Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. “Do imagenet
classifiers generalize to imagenet?” In: International conference on machine learning. PMLR.
2019, pp. 5389–5400.
[135] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. “Faster r-cnn: Towards real-time object
detection with region proposal networks”. In: Advances in neural information processing systems.
2015, pp. 91–99.
[136] Edgar Riba, Dmytro Mishkin, Daniel Ponsa, Ethan Rublee, and Gary R. Bradski. “Kornia: an Open
Source Differentiable Computer Vision Library for PyTorch”. In: Winter Conference on
Applications of Computer Vision. 2020. URL: https://arxiv.org/pdf/1910.02190.pdf.
[137] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer.
“High-resolution image synthesis with latent diffusion models”. In: Proceedings of the IEEE/CVF
conference on computer vision and pattern recognition. 2022, pp. 10684–10695.
[138] Kate Saenko, Brian Kulis, Mario Fritz, and Trevor Darrell. “Adapting visual category models to
new domains”. In: European conference on computer vision. Springer. 2010, pp. 213–226.
165
[139] Kuniaki Saito, Donghyun Kim, Stan Sclaroff, Trevor Darrell, and Kate Saenko. “Semi-supervised
domain adaptation via minimax entropy”. In: Proceedings of the IEEE/CVF international
conference on computer vision. 2019, pp. 8050–8058.
[140] Mehdi Sajjadi, Mehran Javanmardi, and Tolga Tasdizen. “Regularization with stochastic
transformations and perturbations for deep semi-supervised learning”. In: Neural Information
Processing Systems. Dec. 2016, pp. 1171–1179.
[141] Ronald Salloum, Yuzhuo Ren, and C-C Jay Kuo. “Image splicing localization using a multi-task
fully convolutional network (MFCN)”. In: Journal of Visual Communication and Image
Representation 51 (2018), pp. 201–209.
[142] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman,
Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. “Laion-5b:
An open large-scale dataset for training next generation image-text models”. In: Advances in
Neural Information Processing Systems 35 (2022), pp. 25278–25294.
[143] Astuti Sharma, Tarun Kalluri, and Manmohan Chandraker. “Instance level affinity-based transfer
for unsupervised domain adaptation”. In: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition. 2021, pp. 5361–5371.
[144] Karen Simonyan and Andrew Zisserman. “Very deep convolutional networks for large-scale image
recognition”. In: arXiv preprint arXiv:1409.1556 (2014).
[145] Kihyuk Sohn, David Berthelot, Chun-Liang Li, Zizhao Zhang, Nicholas Carlini, Ekin D. Cubuk,
Alex Kurakin, Han Zhang, and Colin Raffel. “FixMatch: Simplifying Semi-Supervised Learning
with Consistency and Confidence”. In: arXiv: Learning (Jan. 2020).
[146] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. “UCF101: A dataset of 101 human
actions classes from videos in the wild”. In: arXiv preprint arXiv:1212.0402 (2012).
[147] J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel. “Man vs. computer: Benchmarking machine
learning algorithms for traffic sign recognition”. In: Neural Networks 0 (2012), pp. -. ISSN:
0893-6080. DOI: 10.1016/j.neunet.2012.02.016.
[148] Yu Sun, Eric Tzeng, Trevor Darrell, and Alexei A Efros. “Unsupervised domain adaptation through
self-supervision”. In: arXiv preprint arXiv:1909.11825 (2019).
[149] Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei Efros, and Moritz Hardt. “Test-time
training with self-supervision for generalization under distribution shifts”. In: International
Conference on Machine Learning. PMLR. 2020, pp. 9229–9248.
[150] Martin Szummer and Tommi Jaakkola. “Partially labeled classification with Markov random
walks”. In: Advances in neural information processing systems. 2002, pp. 945–952.
[151] Korawat Tanwisuth, Shujian Zhang, Huangjie Zheng, Pengcheng He, and Mingyuan Zhou.
“POUF: Prompt-oriented unsupervised fine-tuning for large pre-trained models”. In: arXiv preprint
arXiv:2305.00350 (2023).
166
[152] Antti Tarvainen and Harri Valpola. “Mean teachers are better role models: Weight-averaged
consistency targets improve semi-supervised deep learning results”. In: International Conference
on Learning Representations. Jan. 2017.
[153] Antti Tarvainen and Harri Valpola. “Mean teachers are better role models: Weight-averaged
consistency targets improve semi-supervised deep learning results”. In: Advances in Neural
Information Processing Systems. Ed. by I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus,
S. Vishwanathan, and R. Garnett. Vol. 30. Curran Associates, Inc., 2017, pp. 1195–1204. URL:
https://proceedings.neurips.cc/paper/2017/file/68053af2923e00204c3ca7c6a3150cf7-Paper.pdf.
[154] Vishaal Udandarao. “Understanding and Fixing the Modality Gap in Vision-Language Models”.
PhD thesis. Master’s thesis, University of Cambridge, 2022.
[155] Vishaal Udandarao, Ankush Gupta, and Samuel Albanie. “Sus-x: Training-free name-only transfer
of vision-language models”. In: arXiv preprint arXiv:2211.16198 (2022).
[156] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Łukasz Kaiser, and Illia Polosukhin. “Attention is all you need”. In: Advances in neural
information processing systems. 2017, pp. 5998–6008.
[157] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Łukasz Kaiser, and Illia Polosukhin. “Attention is all you need”. In: Advances in neural
information processing systems 30 (2017).
[158] Bastiaan S Veeling, Jasper Linmans, Jim Winkens, Taco Cohen, and Max Welling. “Rotation
equivariant CNNs for digital pathology”. In: International Conference on Medical image
computing and computer-assisted intervention. Springer. 2018, pp. 210–218.
[159] Hemanth Venkateswara, Jose Eusebio, Shayok Chakraborty, and Sethuraman Panchanathan. “Deep
hashing network for unsupervised domain adaptation”. In: Proceedings of the IEEE conference on
computer vision and pattern recognition. 2017, pp. 5018–5027.
[160] Vikas Verma, Alex Lamb, Juho Kannala, Yoshua Bengio, and David Lopez-Paz. “Interpolation
Consistency Training for Semi-supervised Learning”. In: International Joint Conference on
Artificial Intelligence. Aug. 2019, pp. 3635–3641. DOI: 10.24963/IJCAI.2019/504.
[161] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, koray kavukcuoglu koray, and Daan Wierstra.
“Matching Networks for One Shot Learning”. In: Advances in Neural Information Processing
Systems. Ed. by D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett. Vol. 29. Curran
Associates, Inc., 2016, pp. 3630–3638. URL:
https://proceedings.neurips.cc/paper/2016/file/90e1357833654983612fb05e3ec9148c-Paper.pdf.
[162] Dequan Wang, Evan Shelhamer, Shaoteng Liu, Bruno Olshausen, and Trevor Darrell. “Tent: Fully
test-time adaptation by entropy minimization”. In: arXiv preprint arXiv:2006.10726 (2020).
[163] Haohan Wang, Songwei Ge, Zachary Lipton, and Eric P Xing. “Learning Robust Global
Representations by Penalizing Local Predictive Power”. In: Advances in Neural Information
Processing Systems. 2019, pp. 10506–10518.
167
[164] Luting Wang, Yi Liu, Penghui Du, Zihan Ding, Yue Liao, Qiaosong Qi, Biaolong Chen, and Si Liu.
“Object-Aware Distillation Pyramid for Open-Vocabulary Object Detection”. In: Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023, pp. 11186–11196.
[165] Mei Wang and Weihong Deng. “Deep visual domain adaptation: A survey”. In: Neurocomputing
312 (2018), pp. 135–153.
[166] Tao Wang. “Learning to detect and segment for open vocabulary object detection”. In: Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023, pp. 7051–7060.
[167] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. “Non-local neural networks”. In:
Proceedings of the IEEE conference on computer vision and pattern recognition. 2018,
pp. 7794–7803.
[168] Xin Wang, Thomas E Huang, Trevor Darrell, Joseph E Gonzalez, and Fisher Yu. “Frustratingly
simple few-shot object detection”. In: arXiv preprint arXiv:2003.06957 (2020).
[169] Guoqiang Wei, Cuiling Lan, Wenjun Zeng, and Zhibo Chen. “Metaalign: Coordinating domain
alignment and classification for unsupervised domain adaptation”. In: Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021, pp. 16643–16653.
[170] Bihan Wen, Ye Zhu, Ramanathan Subramanian, Tian-Tsong Ng, Xuanjing Shen, and
Stefan Winkler. “COVERAGE—A novel database for copy-move forgery detection”. In: 2016
IEEE International Conference on Image Processing (ICIP). IEEE. 2016, pp. 161–165.
[171] Garrett Wilson and Diane J Cook. “A survey of unsupervised deep domain adaptation”. In: ACM
Transactions on Intelligent Systems and Technology (TIST) 11.5 (2020), pp. 1–46.
[172] Xiaoshi Wu, Feng Zhu, Rui Zhao, and Hongsheng Li. “CORA: Adapting CLIP for
Open-Vocabulary Detection with Region Prompting and Anchor Pre-Matching”. In: Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023, pp. 7031–7040.
[173] Yue Wu, Wael Abd-Almageed, and Prem Natarajan. “Busternet: Detecting copy-move image
forgery with source/target localization”. In: Proceedings of the European Conference on Computer
Vision (ECCV). 2018, pp. 168–184.
[174] Yue Wu, Wael Abd-Almageed, and Prem Natarajan. “Deep matching and validation network: An
end-to-end solution to constrained image splicing localization and detection”. In: Proceedings of
the 25th ACM international conference on Multimedia. 2017, pp. 1480–1502.
[175] Yue Wu, Wael Abd-Almageed, and Prem Natarajan. “Image copy-move forgery detection via an
end-to-end deep neural network”. In: 2018 IEEE Winter Conference on Applications of Computer
Vision (WACV). IEEE. 2018, pp. 1907–1915.
[176] Yue Wu, Wael AbdAlmageed, and Premkumar Natarajan. “Mantra-net: Manipulation tracing
network for detection and localization of image forgeries with anomalous features”. In:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019,
pp. 9543–9552.
168
[177] Gui-Song Xia, Jingwen Hu, Fan Hu, Baoguang Shi, Xiang Bai, Yanfei Zhong, Liangpei Zhang, and
Xiaoqiang Lu. “AID: A benchmark data set for performance evaluation of aerial scene
classification”. In: IEEE Transactions on Geoscience and Remote Sensing 55.7 (2017),
pp. 3965–3981.
[178] Yongqin Xian, Christoph H Lampert, Bernt Schiele, and Zeynep Akata. “Zero-shot learning—a
comprehensive evaluation of the good, the bad and the ugly”. In: IEEE transactions on pattern
analysis and machine intelligence 41.9 (2018), pp. 2251–2265.
[179] Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and Antonio Torralba. “Sun database:
Large-scale scene recognition from abbey to zoo”. In: 2010 IEEE computer society conference on
computer vision and pattern recognition. IEEE. 2010, pp. 3485–3492.
[180] Johnathan Xie and Shuai Zheng. “Zero-shot Object Detection Through Vision-Language
Embedding Alignment”. In: 2022 IEEE International Conference on Data Mining Workshops
(ICDMW). IEEE. 2022, pp. 1–15.
[181] Qizhe Xie, Zihang Dai, Eduard Hovy, Thang Luong, and Quoc Le. “Unsupervised data
augmentation for consistency training”. In: Advances in Neural Information Processing Systems 33
(2020).
[182] SHI Xingjian, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun Woo.
“Convolutional LSTM network: A machine learning approach for precipitation nowcasting”. In:
Advances in neural information processing systems. 2015, pp. 802–810.
[183] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov,
Rich Zemel, and Yoshua Bengio. “Show, attend and tell: Neural image caption generation with
visual attention”. In: International conference on machine learning. 2015, pp. 2048–2057.
[184] Shiqi Yang, Shangling Jui, Joost van de Weijer, et al. “Attracting and dispersing: A simple
approach for source-free domain adaptation”. In: Advances in Neural Information Processing
Systems 35 (2022), pp. 5802–5815.
[185] Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. “Stacked attention networks
for image question answering”. In: Proceedings of the IEEE conference on computer vision and
pattern recognition. 2016, pp. 21–29.
[186] Ting Yao, Yingwei Pan, Chong-Wah Ngo, Houqiang Li, and Tao Mei. “Semi-supervised domain
adaptation with subspace learning for visual recognition”. In: Proceedings of the IEEE conference
on Computer Vision and Pattern Recognition. 2015, pp. 2142–2150.
[187] Sergey Zagoruyko and Nikos Komodakis. “Wide Residual Networks”. In: Proceedings of the
British Machine Vision Conference (BMVC). Ed. by Edwin R. Hancock Richard C. Wilson and
William A. P. Smith. BMVA Press, Sept. 2016, pp. 87.1–87.12. ISBN: 1-901725-59-6. DOI:
10.5244/C.30.87.
[188] Sergey Zagoruyko and Nikos Komodakis. “Wide residual networks”. In: arXiv preprint
arXiv:1605.07146 (2016).
169
[189] Lutfiah Zahara, Purnawarman Musa, Eri Prasetyo Wibowo, Irwan Karim, and Saiful Bahri Musa.
“The facial emotion recognition (FER-2013) dataset for prediction system of micro-expressions
face using the convolutional neural network (CNN) algorithm based Raspberry Pi”. In: 2020 Fifth
international conference on informatics and computing (ICIC). IEEE. 2020, pp. 1–9.
[190] Yuhang Zang, Wei Li, Kaiyang Zhou, Chen Huang, and Chen Change Loy. “Open-vocabulary detr
with conditional matching”. In: European Conference on Computer Vision. Springer. 2022,
pp. 106–122.
[191] Alireza Zareian, Kevin Dela Rosa, Derek Hao Hu, and Shih-Fu Chang. “Open-vocabulary object
detection using captions”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition. 2021, pp. 14393–14402.
[192] Guodong Zhang, Chaoqi Wang, Bowen Xu, and Roger Grosse. “Three mechanisms of weight
decay regularization”. In: arXiv preprint arXiv:1810.12281 (2018).
[193] Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and David Lopez-Paz. “mixup: Beyond
Empirical Risk Minimization”. In: International Conference on Learning Representations. Feb.
2018.
[194] Liheng Zhang and Guo-Jun Qi. “WCP: Worst-Case Perturbations for Semi-Supervised Deep
Learning”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR). June 2020. URL:
http://openaccess.thecvf.com/content_CVPR_2020/html/Zhang_WCP_WorstCase_Perturbations_for_Semi-Supervised_Deep_Learning_CVPR_2020_paper.html (visited on
06/19/2020).
[195] Shiyu Zhao, Zhixing Zhang, Samuel Schulter, Long Zhao, BG Vijay Kumar,
Anastasis Stathopoulos, Manmohan Chandraker, and Dimitris N Metaxas. “Exploiting unlabeled
data with vision and language models for object detection”. In: European Conference on Computer
Vision. Springer. 2022, pp. 159–175.
[196] Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chunyuan Li, Noel Codella, Liunian Harold Li,
Luowei Zhou, Xiyang Dai, Lu Yuan, Yin Li, et al. “Regionclip: Region-based language-image
pretraining”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition. 2022, pp. 16793–16803.
[197] Dengyong Zhou, Olivier Bousquet, Thomas N. Lal, Jason Weston, and Bernhard Schölkopf.
“Learning with Local and Global Consistency”. In: Neural Information Processing Systems. Dec.
2003.
[198] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. “Conditional Prompt Learning
for Vision-Language Models”. In: IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR). 2022.
[199] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. “Learning to Prompt for
Vision-Language Models”. In: International Journal of Computer Vision (IJCV) (2022).
170
[200] Peng Zhou, Xintong Han, Vlad I Morariu, and Larry S Davis. “Learning rich features for image
manipulation detection”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition. 2018, pp. 1053–1061.
[201] Peng Zhou, Xintong Han, Vlad I Morariu, and Larry S Davis. “Two-stream neural networks for
tampered face detection”. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition
Workshops (CVPRW). IEEE. 2017, pp. 1831–1839.
[202] Xingyi Zhou, Rohit Girdhar, Armand Joulin, Philipp Krähenbühl, and Ishan Misra. “Detecting
twenty-thousand classes using image-level supervision”. In: Computer Vision–ECCV 2022: 17th
European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IX. Springer. 2022,
pp. 350–368.
[203] Beier Zhu, Yulei Niu, Yucheng Han, Yue Wu, and Hanwang Zhang. “Prompt-aligned gradient for
prompt tuning”. In: Proceedings of the IEEE/CVF International Conference on Computer Vision.
2023, pp. 15659–15669.
[204] Pengkai Zhu, Hanxiao Wang, and Venkatesh Saligrama. “Don’t Even Look Once: Synthesizing
Features for Zero-Shot Detection”. In: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition. 2020, pp. 11693–11702.
[205] Xinshan Zhu, Yongjun Qian, Xianfeng Zhao, Biao Sun, and Ya Sun. “A deep learning approach to
patch-based image inpainting forensics”. In: Signal Processing: Image Communication 67 (2018),
pp. 90–99.
171
Abstract (if available)
Abstract
In recent years, the field of computer vision and machine learning has witnessed a paradigm shift, characterized by a dramatic increase in the scale of model parameters and training data sizes. This evolution has led to significant enhancements in model accuracy and robustness, transcending the traditional, task-specific expert models. The field has now pivoted towards universal, large-scale pre-trained visual representations, which enables impressive zero-shot and few-shot solutions for a wide array of downstream tasks.
Despite these advancements, the application of pre-trained models to specific downstream tasks, each with their unique conditions and domain-specific challenges, often exposes inherent limitations. This dissertation aims to tackle these challenges. The research journey comprises a spectrum of approaches from fully-supervised to source-free and test-time adaptation, with diverse applications such as image classification, object detection, and forensic detection. This dissertation introduces novel architectures such as SPAN, which has pioneered the utilization of the self-attention mechanism in the field of computer vision, as well as innovative adaptation algorithms like ReCLIP and BaFTA, which enhance zero-shot classification performance with unsupervised vision-text alignment. This dissertation marks a transition from classic visual representations, like those pretrained with ImageNet, to cutting-edge vision-language models like CLIP, and has overcome some of the most pressing challenges in the field.
The works of this dissertation play an important role in bridging the gap between generic visual representations and the specific, nuanced requirements of various real-world tasks. By doing so, it establishes new benchmarks in optimizing the performance of machine learning models in practical applications, reinforcing the role of advanced computational techniques in solving complex, real-world problems.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Transfer learning for intelligent systems in the wild
PDF
Effective incremental learning and detector adaptation methods for video object detection
PDF
Grounding language in images and videos
PDF
Bridging the visual reasoning gaps in multi-modal models
PDF
Towards more occlusion-robust deep visual object tracking
PDF
Visual knowledge transfer with deep learning techniques
PDF
Incorporating large-scale vision-language corpora in visual understanding
PDF
Scaling recommendation models with data-aware architectures and hardware efficient implementations
PDF
Hashcode representations of natural language for relation extraction
PDF
Event detection and recounting from large-scale consumer videos
PDF
Deep learning techniques for supervised pedestrian detection and critically-supervised object detection
PDF
Automatic evaluation of open-domain dialogue systems
PDF
Visual representation learning with structural prior
PDF
Efficient and accurate object extraction from scanned maps by leveraging external data and learning representative context
PDF
Creating cross-modal, context-aware representations of music for downstream tasks
PDF
Multimodal representation learning of affective behavior
PDF
Robust causal inference with machine learning on observational data
PDF
Invariant representation learning for robust and fair predictions
PDF
Fast and label-efficient graph representation learning
PDF
Learning controllable data generation for scalable model training
Asset Metadata
Creator
Hu, Xuefeng
(author)
Core Title
Adapting pre-trained representation towards downstream tasks
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2024-05
Publication Date
07/24/2024
Defense Date
01/19/2024
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
computer vision,domain adaptation,forensic detection,image classification,machine learning,OAI-PMH Harvest,object detection,representation learning,semi-supervised learning,source-free domain adaptation,test-time adaptation,unsupervised domain adaptation,vision-language model,zero-shot learning
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Nevatia, Ram (
committee chair
), Galstyan, Aram (
committee member
), Jenkins, Keith (
committee member
)
Creator Email
xuefengh@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113810132
Unique identifier
UC113810132
Identifier
etd-HuXuefeng-12627.pdf (filename)
Legacy Identifier
etd-HuXuefeng-12627
Document Type
Dissertation
Format
theses (aat)
Rights
Hu, Xuefeng
Internet Media Type
application/pdf
Type
texts
Source
20240125-usctheses-batch-1122
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
computer vision
domain adaptation
forensic detection
image classification
machine learning
object detection
representation learning
semi-supervised learning
source-free domain adaptation
test-time adaptation
unsupervised domain adaptation
vision-language model
zero-shot learning