Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Visual representation learning with structural prior
(USC Thesis Other)
Visual representation learning with structural prior
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
VISUAL REPRESENTATION LEARNING
WITH STRUCTURAL PRIOR
by
Bowen Zhang
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
August 2022
Copyright 2022 Bowen Zhang
Acknowledgements
Writing the acknowledgment is always very challenging for me, as there are so many people I want to
thank. I always feel fortunate to meet, share joyful moments, and receive enormous helps from my advisors,
labmates, and friends.
I want to express my gratitude to my advisors: Prof. Fei Sha and Prof. Leana Golubchik. I want to
thank Fei for his wisdom, kindness, and patience. Discussing and planning projects with Fei was one of the
most joyful moments of my Ph.D. study. There are countless moments that I was inspired by Fei’s sparkling
ideas. His insightful suggestions provide new directions for many of my projects. As Fei joined Google,
Leana served as my advisor during my final year. Leana is a great advisor. I am grateful that she served as
my committee chair, provided me with a great workspace with a nice view, tolerated and allowed me to do
my machine learning and computer vision projects.
Similarly, I am thankful that Prof. Haipeng Luo, Prof. Laurent Itti, Prof. Shri Narayanan, and Prof. Ram
Nevatia could be on my committee for both Ph.D. qualification exam and Ph.D. proposal. I also want to
thank Laurent and Shri for being on my defense committee. Without your insightful suggestions, guidance,
and comments, this dissertation would be impossible to finish.
Multiple of my Ph.D. projects are delivered via collaborating with Hexiang Hu. We started the collab-
oration in my first Ph.D. year. Hexiang’s planning, writing, presentation, and communication skills laid the
foundation for those projects. I still remember countless nights in USC that we fought for the deadline. I
am grateful for that.
I am so fortunate for having multiple internship opportunities at Tencent AI lab, Amazon AWS, Google
AI, and Google Research Brain team collaborating with Boqing Gong, Yuanjun Xiong, Hao Chen, Meng
Wang, Vihan Jain, Eugene Ie, Haoshuo Huang, Ming Zhao, Joonseok Lee, Pete Shaw, Linlu Qiu, Jiahui Yu,
Andrew Dai, Chris Fifty, Yonghui Wu, and Ruoming Pang. The internship and collaboration allow me to
leverage enormous computation resources for exploring a wide range of projects, which profoundly broaden
my horizon.
ii
My friends and labmates at ShaLab and QED-Lab brought me many precious memories during these
five years. Specifically, I would like to thank Beer, Harry, Ke, and Zhiyun for helping me settle down and
introducing me to the lab; Bill, Michiel, Robby, Shariq, Seb, and Yury for leading and hosting interesting
discussions on our weekly ShaLab lunch; Aaron, Chao-Kai, and Melissa for the great collaboration in my
first two years; Han-Jia, Liyu, Ivy, Jeremy, Yiming, Bowen, and Zhuojin for the wonderful Chinese food
and dim sum. It is so amazing to meet and work with talented labmates on a daily basis.
I would not survive these years without the help of the wonderful staff at USC Viterbi School of En-
gineering. Nina Shilling has been our lab manager for four years. She always ensured that we got enough
coffee, tea, snacks, and office supplies. She made our life at the lab so much easier. It is great to have Lizsl
De Leon as our director for student affairs. I will always remember her efficiency, kindness, and one-day
email reply.
Finally, I want to thank my family and my girlfriend for their endless love and understanding. I would
not even start the journey without their encouragement and support. This thesis is dedicated to them.
This dissertation is partially supported by NSF IIS-1833137.
iii
Table of Contents
Acknowledgements ii
List of Tables ix
List of Figures xii
Abstract xiv
I Background 1
Chapter 1: Introduction 2
1.1 Hierarchical Visual Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Contributions and Outlines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Published Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.1 Works Covered in the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.2 Other Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
II Hierarchical Video Representation via Langauge Structure Prior 8
Chapter 2: Cross-Modal and Hierarchical Video Modeling via Text 10
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.1 Settings and Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.2 Flat Sequence Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.3 Hierarchical Sequence Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.4 Final Learning Objective and Its Extensions . . . . . . . . . . . . . . . . . . . . . . 17
2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4.1 Experiment Setups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4.1.1 Datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4.1.2 Baselines and Our Methods. . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4.1.3 Implementation Details. . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4.2 Results on Video-Paragraph Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4.2.1 Main Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4.2.2 Low-level Loss is Beneficial. . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4.2.3 Learning with Weak Correspondences at Low-level. . . . . . . . . . . . . 21
iv
2.4.2.4 Learning with Video Proposal Methods. . . . . . . . . . . . . . . . . . . 22
2.4.2.5 Retrieval with Incomplete Video and Paragraph. . . . . . . . . . . . . . . 22
2.4.3 Results on Video Captioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4.4 Results on Action Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4.5 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Chapter 3: Hierarchical Video Modeling for Moment Localization 27
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3.1 Problem Setting and Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3.2 Two-Stage MLVC: Retrieval and Localization . . . . . . . . . . . . . . . . . . . . . 31
3.3.3 HierArchical Multi-Modal Encoder (HAMMER) . . . . . . . . . . . . . . . . . . . . 32
3.3.4 Learning HAMMER for MLVC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3.5 Two-stage Inference with HAMMER . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4.1 Experimental Setups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4.2 MLVC Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4.3 Ablation Studies and Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.4.3.1 Learning Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.4.3.2 Evaluate Design Choices of the HAMMER . . . . . . . . . . . . . . . . . 41
3.4.3.3 Qualitative Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
III Hierarchical Image Representation via Langauge Structure Prior 45
Chapter 4: Modeling Image hierarchically with Denotation Graph 48
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3 Denotation Graph (DG) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.4 Learning with Denotation Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.4.1 Matching Texts with Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.4.1.1 Matching Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.4.1.2 Negative Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.4.2 Learning to Be More Specific . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.4.3 Learning to Predict Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.4.4 The Final Learning Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.5.2 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.5.3 Zero/Few-Shot and Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.5.4 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.5.5 Image Retrieval from Abstract Concepts . . . . . . . . . . . . . . . . . . . . . . . . 61
4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
v
Chapter 5: Hierarchical Visual Grounded Concept Composition 63
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.2 Concept & Relation Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.3 COMPOSER: Recursive Modeling of the Compositional Structure . . . . . . . . . . . . . . . 66
5.3.1 Encoding Primitives and Predicates . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.3.2 Recursive Concept Composition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.3.3 Learning COMPOSER with Visual-Semantic Alignments . . . . . . . . . . . . . . . 69
5.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.4.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.4.2 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.4.3 Analysis and Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
IV Hierarchical Video Representation via Visual Structure 80
Chapter 6: Modeling Video Representation via Co-training on both Image and Video 82
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.2.1 Video Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.2.2 Training Paradigms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.3.1 Video Transformer for Action Recognition . . . . . . . . . . . . . . . . . . . . . . 86
6.3.1.1 TimeSFormer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.3.1.2 Standard Training Paradigm . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.3.2 Action Recognition Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.3.3 COVER: Co-train Videos and Images for Action Recognition . . . . . . . . . . . . 89
6.3.3.1 Co-train Videos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.3.3.2 Co-train Video and Image Data . . . . . . . . . . . . . . . . . . . . . . . 91
6.3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.4 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.4.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.4.1.1 Datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.4.1.2 Implementation details. . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.4.2 Main results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.4.3 Ablation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.4.3.1 Co-training with multiple datasets . . . . . . . . . . . . . . . . . . . . . . 96
6.4.3.2 Loss weight for image and video classification . . . . . . . . . . . . . . . 96
6.4.3.3 Transfer learning on other datasets . . . . . . . . . . . . . . . . . . . . . 97
6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
V Application of Structural Visual Representation 99
Chapter 7: Visual Storytelling via Predicting Anchor Words in the Stories 100
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
7.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
7.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
vi
7.3.1 What is an anchor word? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
7.3.2 Model and Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
7.3.3 Other Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
7.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
7.4.1 Experimental Setups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
7.4.2 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
7.4.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
7.4.4 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
VI Conclusion and Future Work 112
Chapter 8: Conclusion 113
Chapter 9: Future Work 115
9.1 Acquiring the Visual Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
9.2 Obtaining a General-purpose Visual Representation . . . . . . . . . . . . . . . . . . . . . . 117
Bibliography 118
Appendices 130
A Appendix for Cross-Modal and Hierarchical Video Modeling via Text . . . . . . . . . . . . 130
A.1 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
A.1.1 Video and Text Features . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
A.1.2 Training Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
A.2 Additional Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
A.2.1 Ablation Study with Different Learning Objectives . . . . . . . . . . . . . 131
A.2.2 Low-level Loss is Beneficial . . . . . . . . . . . . . . . . . . . . . . . . 131
A.2.3 Ablation Study on Reconstruction Balance Term . . . . . . . . . . . . . . 132
A.2.4 Performance on ActivityNet Validation Set 2 . . . . . . . . . . . . . . . . 133
B Appendix for Hierarchical Video Modeling for Moment Localization . . . . . . . . . . . . . 133
B.1 Additional Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
B.1.1 Visual Feature Representation . . . . . . . . . . . . . . . . . . . . . . . . 133
B.1.2 Subtitle (ASR) Feature Representation . . . . . . . . . . . . . . . . . . . 134
B.1.3 Model Architecture with ASR input . . . . . . . . . . . . . . . . . . . . . 135
B.1.4 Model Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
B.1.5 Model Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
B.2 Illustration on the TVR Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
C Appendix for Modeling Image Hierarchically With Denotation Graph . . . . . . . . . . . . 136
C.1 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
C.1.1 Constructing Denotation Graphs . . . . . . . . . . . . . . . . . . . . . . 136
C.1.2 Model architectures of ViLBERT and UNITER . . . . . . . . . . . . . . 137
C.1.3 Training Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
C.1.4 Text Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
C.1.5 Visual Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
C.2 Full Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
C.2.1 Complete Results on COCO Dataset . . . . . . . . . . . . . . . . . . . . . 139
C.2.2 Complete Results on FLICKR30K Dataset . . . . . . . . . . . . . . . . . 140
vii
C.2.3 Ablation Study on Hyper-parameters . . . . . . . . . . . . . . . . . . . . 140
C.2.4 Full Results on Zero/Few-Shot and Transfer Learning . . . . . . . . . . . 141
C.3 Visualization of Model’s Predictions on Denotation Graphs . . . . . . . . . . . . . . 142
D Appendix for Hierarchical Visual grounded Concept Composition . . . . . . . . . . . . . . 143
D.1 Extracting Primitives & Predicates from the Constituency Tree . . . . . . . . . . . . 143
D.2 Details on Generation of Compositional Evaluation Splits . . . . . . . . . . . . . . 144
D.3 Implementation Details of COMPOSER and Baselines . . . . . . . . . . . . . . . . . 146
D.4 Additional Experiments on COMPOSER . . . . . . . . . . . . . . . . . . . . . . . . 148
E Appendix for Modeling Video Representation via Co-training on both Image and Video . . . 150
E.1 Implementation Details of Co-train Video and Image Data . . . . . . . . . . . . . . 150
E.2 Additional Experiments on COVER . . . . . . . . . . . . . . . . . . . . . . . . . . 150
E.2.1 Model variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
E.2.2 Input configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
E.2.3 Top-1 and Top-5 accuracy of COVER . . . . . . . . . . . . . . . . . . . . 152
viii
List of Tables
2.1 Video paragraph retrieval on ActivityNet (val1). Standard deviation from 3 random seeded
experiments are also reported. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2 Video paragraph retrieval on DiDeMo dataset. S2VT method is re-implemented for retrieval
task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3 Ablation studies on the learning objectives. . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4 Performance of using proposal instead of ground truth on ActivityNet dataset . . . . . . . . 22
2.5 Results for video captioning on ActivityNet . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.6 Results for action recognition on ActivityNet (low-level embeddings) . . . . . . . . . . . . 24
3.1 MLVC Results on ActivityNet and TVR datasets . . . . . . . . . . . . . . . . . . . . . . . 37
3.2 VR results on ActivityNet Captions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3 TL results on ActivityNet Captions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.4 Ablation study on sub-tasks (VR=Video Retrieval, TL=Temporal Localization, FM=Frame
MLM,CM=Clip MLM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.5 Ablation study on task weights (VR=Video Retrieval,TL=Temporal Localization,MLM=Masked
Language Model) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.6 Ablation study on Cross-modal Transformer . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.7 Ablation study on different clip lengths . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.8 Ablation study on weight sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.9 Ablation study on clip position embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.1 Key statistics of the two DGs: averaged over the all nodes in the graph, internal nodes and
leaf nodes (formated as all/internal/leaf) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.2 Text-based Image Retrieval (Higher is better) . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.3 Image-based Text Retrieval (Higher is better) . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.4 Image Retrieval via Text (Transfer Learning) . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.5 Zero/Few-shot Learning for Referring Expression (Reported in R@1 on validation, TestA
and TestB data) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.6 Image Recognition on UNSEEN Attribute-Object Pairs on the MIT-STATE Dataset . . . . . 60
4.7 Ablation Studies of Learning from DG . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.1 Statistics of the concepts and predicates in the F30K and C30K datasets. . . . . . . . . . . . 72
5.2 Text-to-Image retrieval results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.3 Study of different primitive encodings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.4 Study of different modulators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.5 Comparison between ViLBERT and COMPOSER on multi-level visual-semantic alignment
supervision (MVSA). All results are reported in R1. . . . . . . . . . . . . . . . . . . . . . . 76
5.6 Results on COMPOSER of different complexity. All results are reported in R1. (PT: Predicate
Transformer, CT: Composition Transformer) . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.7 Performance under different parsing qualities. . . . . . . . . . . . . . . . . . . . . . . . . . 77
ix
6.1 Top-1 accuracy comparison between standard training paradigm and COVER using TimeS-
Former pretrained on ImageNet-21k (I21K) with 448× 448× 16 (H× W× T) input setting.
COVER co-trained with multiple image and video datasets achieves better performance on
all datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.2 Performance of TimeSFormer trained using 224× 224 image resolution and evaluated on
normal frame order and reversed frame order. Reversed frame order means the order of
frames are reversed during test-time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.3 Comparison with the SoTA pretrained on ImageNet-21k (I21k). Our results COVER is
based on TimeSFormer architecture. Comparing with TimeSFormer, COVER achieves sig-
nificant improvements across all datasets. Comparing with ViViT [10] and VidTr [204],
COVER improves performance on K400, K600, K700, and MiT datasets. The views are
denoted as# of temporal crops× # of spatial crops. Dash (“-”) means the results are not
applicable. Joint
w/K400
means the model is finetuned on SSv2+MiT+ImageNet and K400
jointly. We note our improvements compared with TimeSFormer of same input sizes. . . . . 93
6.4 Comparison with the SoTA pretrained on larger-scale datasets. COVER pretrained on JFT-
300M surpassed all SoTA by a margin. COVER pretrained on even larger dataset (JFT-
3B) established a new set of SoTA for all datasets. Dash (“-”) means the results are not
applicable. Joint
w/K400
means the model is finetuned on SSv2+MiT+ImageNet and K400
jointly. We note our improvements compared with the previous state-of-the-arts. . . . . . . . 94
6.5 Detailed co-training Top-1 accuracy with model pretrained on JFT-3B. By co-training with
more datasets, COVER keeps improve performance on K400, K600, K700, SSv2, and MiT
datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.6 Top-1 accuracy of COVER under different image classification loss weights. With larger
w
image
, the model improves the ImageNet (ImNet) performance however sacrifices video
dataset results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
7.1 Corpus statistics of stories and captions on the VIST dataset [72] . . . . . . . . . . . . . . . 101
7.2 Adding ground-truth words as anchor words to a SEQ2SEQ model significantly improves
its performance where only image features are used. The higher numerical value indicates
better performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
7.3 Comparison of state-of-the-art method for the visual storytelling task on the VIST dataset.
Our “Image Only” model is a reimplementation of XE+SS [173] with the authors’ public
available codes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
7.4 Evaluating human performance by automatic evaluation procedures. Machine outperforms
human in all metrics. See text for details. . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
7.5 Human evaluation on the generated stories . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
7.6 Better storytelling uses bigger vocabulary . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
7.7 Which stories are preferred by human readers . . . . . . . . . . . . . . . . . . . . . . . . . 109
A.1 Ablation study on the learning objectives. . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
A.2 Ablation study ofτ on ActivityNet (val2). . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
A.3 Performance of video and paragraph retrieval on ActivityNet (val2). Standard deviation
from 3 random seeded experiments are also reported. . . . . . . . . . . . . . . . . . . . . . 133
C.1 Text-based Image Retrieval Performance of ViLBERT trained with different number of DG
levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
C.2 Results on Cross-Modal Retrieval on COCO dataset 1K test split (Higher is better) . . . . . 140
C.3 Results on Cross-Modal Retrieval on COCO dataset 5K test split (Higher is better) . . . . . 141
C.4 Results on Text-based Image Retrieval on FLICKR30K test split (Higher is better) . . . . . . 142
C.5 Ablation studies on the impact of λ 1
and λ 2
of ViLBERT Reduced on Text-based Image
Retrieval on FLICKR30K dataset (Higher is better) . . . . . . . . . . . . . . . . . . . . . . 142
x
C.6 Ablation studies on the impact ofλ 1
andλ 2
of ViLBERT on Text-based Image Retrieval on
FLICKR30K dataset (Higher is better) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
C.7 Transferrability of the learned representations . . . . . . . . . . . . . . . . . . . . . . . . . 143
D.1 Explanatory example of extracting predicates and sub-concepts from a concept . . . . . . . 146
D.2 Ablation Study on COMPOSER with Different Margin for MVSA and Order Objectives. . . . 149
D.3 Ablation Study MVSA Objective: Comparing NLL to Hinge Loss. . . . . . . . . . . . . . . 150
E.1 Comparison of different input settings for co-training on Kinetics, SSv2, and MiT datasets. . 152
E.2 Top-1 and Top-5 accuracy of COVER on K400, K600, K700, SSv2, and MiT. . . . . . . . . 152
xi
List of Figures
1.1 Given videos and images, the feature extractor is optimized to extract visual representations
that benefits the downstream tasks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 The text naturally exhibits the structure of an image. The caption ’two black dog are sitting
on the grass.’ illustrates that there are two objects as dogs. Their color are black. The status
are sitting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 In the pure vision scenario, videos are perfect for learning the temporal information. Images
are suitable for exploiting the appearance information. . . . . . . . . . . . . . . . . . . . . 4
2.1 Conceptual diagram of our approach for cross-modal modeling of video and texts. The main idea
is to embed both low-level (clips and sentences) and high-level (video and paragraph) in their own
semantic spaces coherently. As shown in the figure, the 3 sentences (and the corresponding 3 clips)
are mapped into a local embedding space where the corresponding pairs of clips and sentences are
placed close to each other. As a whole, the videos and the paragraphs are mapped into a global
semantic space where their embeddings are close. See Fig. 2.3 and texts for details. . . . . . . . . . 11
2.2 Flat sequence modeling of videos and texts, ignoring the hierarchical structures in either and regard-
ing the video (paragraph) as a sequence of frames (words). . . . . . . . . . . . . . . . . . . . . 13
2.3 Hierarchical cross-modal modeling of videos and texts. We differ from previous works [100, 128] in
two aspects (components in red color): layer-wise reconstruction through decoders, and matching at
both global and local levels. See texts for details. . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4 Retrieval performance improves given more observed clips/sentences. . . . . . . . . . . . . 23
2.5 T-SNE visualization of off-the-shelf video embedding of HSE on ActivityNet v1.3 training
and validation set. Points are marked with its action classes. . . . . . . . . . . . . . . . . . 25
3.1 Overview of the HAMMER model. The model contains two cross-modal encoders, a frame encoder
and a clip encoder on top of it. The outputs of the model are contextualized frame-level and clip-
level features, which are used by downstream task-specific modules, e.g. video retrieval and temporal
localization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2 Comparison of Video Retrieval performances under different video duration. Results are reported in
Median Rank (MedRank) on the ActivityNet Captions (Lower is better). . . . . . . . . . . . . . . 39
3.3 Comparison of Temporal Localization performances under different video duration. Results are re-
ported in Mean IoU (mIoU) on the ActivityNet Captions (Higher is better). . . . . . . . . . . . . 39
3.4 Illustration of temporal localization using different hierarchies of HAMMER as well as the
final HAMMER model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.5 The paired text can show the objects in the figure and provides their relations. The text ’a
gray and white cat is chasing a brown mouse’ indicates that two objects in the figure are a
cat and a mouse. The relation between those two objects is the cat is chasing the mouse. . . 46
4.1 (Left) A schematic example of denotation graph showing the hierarchical organization of lin-
guistic expression (adapted from https://shannon.cs.illinois.edu/DenotationGraph/) (Right)
A random-subgraph from the denotation graph extracted from the FLICKR30K dataset, with
images attached to concepts at different levels of hierarchy. . . . . . . . . . . . . . . . . . 50
xii
4.2 Image Retrieval using Mid-level Linguistic Expression on FLICKR30K Denotation Graph. The re-
sults are reported in Mean Average Precision (Mean AP). . . . . . . . . . . . . . . . . . . . . . 62
5.1 An illustrative example image database organized by the Concept & Relation Graph (CRG). 64
5.2 The overall design of proposed COMPOSER model. . . . . . . . . . . . . . . . . . . . . . . . . 65
5.3 Details of the composition procedure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.4 COMPOSER’s results on generalization splits of different compound divergence over text
description (evaluated under the F30K→COCO setting). . . . . . . . . . . . . . . . . . . . 74
5.5 Interpreting the COMPOSER using visual-semantic alignment scores, formatted as[s
GT
,s
Negative
].
The left figure corresponds to a correctly retrieved example, and the right figure corresponds
to an incorrectly retrieved example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.1 The comparison between the proposed COVER and the standard training paradigm. The
difference is COVER co-trained on multiple image and video datasets during finetuning,
while the standard training paradigm only finetuned on one dataset. . . . . . . . . . . . . . . 84
6.2 COVER adopts multi-task learning strategy. Each dataset has its own classifier. For the
image dataset, we consider images as single frame videos. Therefore, the temporal multi-
head attention will not affect the image input. . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.3 Comparison of transfer learning the representation learned by COVER and standard training
paradigm. A→B means the model is trained on dataset A and then finetuned on dataset B.
COVER achieves better performance than the standard training paradigm by a margin. . . . 97
7.1 Conceptual diagram of our approach for visual storytelling. The key difference from a
typical SEQ2SEQ model is the component of predicting anchor words (or their embeddings)
from the images. The predictions are then fused with the image features as the inputs to the
sequence transducer for generating desired narrative sentences. . . . . . . . . . . . . . . . . 102
7.2 Qualitative visualization of narrative stories generated by human and different methods.
Similar sentences for similar images are annoated in blue. . . . . . . . . . . . . . . . . . . . 110
7.3 Qualitative visualization of good quality stories that are generated with anchor words. The
sentences with anchor words are annotated as red in the generated stories. . . . . . . . . . . 111
8.1 The approaches developed in this paper are based on three modalities: images, videos and
text. We focus on obtaining hierarchical visual representation from parallel image and text
data, parallel video and text data, and pure visual domain with images and videos. . . . . . . 114
9.1 The example of video scene graph from [75]. Each frame is annotated with objects . . . . . 116
A.1 Recall vs Rank curves of Video to Paragraph and Paragraph to Video retrieval of both
HSE[τ =0] and HSE. All results are collected from models based on InceptionV3 feature on
ActivityNet validation set 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
B.1 Illustration of temporal localization results using HAMMER and its individual hierarchies,
the frame and clip encoder. The top two are successful examples and the bottom one is a
failed example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
C.1 Architecture of (a) ViLBERT, (b) UNITER. The
N
means element-wise product. The [CLS]
represents the embedding of [CLS] token in the last UNITER layer. . . . . . . . . . . . . . 138
C.2 FLICKR30K Denotation Graph: Given Text and Retrieve Image. Qualitative example of ViL-
BERT + DG successfully retrieves the text matched images. We mark the correct sample in green
and incorrect one in red. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
C.3 FLICKR30K Denotation Graph: Given Text and Retrieve Image. Qualitative example of ViL-
BERT + DG fails to retrieve the text matched images. We mark the correct sample in green and
incorrect one in red. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
D.1 Details of the Composition Transformer model. . . . . . . . . . . . . . . . . . . . . . . . . 149
E.1 Comparison of base model and large model using COVER pretrained on JFT-3B. The large
model constantly improves over the base model on all datasets. . . . . . . . . . . . . . . . . 151
xiii
Abstract
Visual representation learning is crucial for building a robust and effective visual understanding system.
The goal is to build general-purpose representations to benefit multiple downstream tasks ( i.e. image/video
classification, segmentation, retrieval, etc.) With the accessibility to large-scale datasets and the advance
in complex learning methods, sophisticated neural architectures and novel training approaches have been
proposed to improve visual representation. However, obtaining a versatile representation is still yet an open
question. This thesis aims to leverage the visual structure to obtain more general visual representations.
The key observation is that the visual components (i.e. images and videos) contain structure. It can be
decomposed into atomic components such as objects, attributes, clips, etc. For example, images can be
decomposed into objects and can be further described by attributes. Similarly, videos can describe complex
scenes composed of multiple clips or shots, where each depicts a semantically coherent event or action. As
atomic components are shareable across modalities and tasks, we hope the hierarchical visual representation
that is compiled from the atomic representation could achieve better generalization ability. In this thesis,
we studied two scenarios to obtain the visual structures: the structure from parallel visual and text data
and the pure visual domain. We achieved state-of-the-art performance on video and text retrieval, moment
localization in a video corpus, image and text retrieval, action recognition, and visual storytelling with the
proposed hierarchically visual representation.
xiv
Part I
Background
1
Chapter 1
Introduction
Artificial intelligence is about designing an intelligent agent that is able to perceive and interact with the
environment [141]. The fundamental ability of an intelligent agent is to perceive the visual world. As
humans, we understand the visual world through images and videos. For example, images are convenient for
describing an object or a scene. Videos are helpful for illustrating an action, an event, or a story. Designing
an intelligent agent to understand images and videos is an important and not yet solved problem. In this
thesis, I will present a set of visual representation learning methods developed during Ph.D study, which can
efficiently extract visual features to benefit multiple downstream tasks such as video classification, image
text retrieval, video text retrieval, and visual storytelling.
Visual representation as a key component of artificial intelligence system has been studied for decades.
The main goal of visual representation is to embed an image or a video into a semantic-rich fixed dimension
embedding for downstream tasks as Fig. 1.1. Early efforts focus on a two-stage feature extraction approach.
Concretely, the salient points (e.g. corners) are first detected, and then hand-designed feature descriptors
Videos
Images
Feature Extractor Image Embedding
Video Embedding
Figure 1.1: Given videos and images, the feature extractor is optimized to extract visual representations that
benefits the downstream tasks.
2
are used to transform the salient points into embeddings. Representative salient point detectors for images
include Harris Corner Detector [60], SIFT [113], and SURF [15]. As video contains an additional temporal
dimension, Space-time interest point [96], as an extension to the Harris Corner detector, was proposed
to extract spatial-temporal corner points in videos. Then feature extractors use the surrounding pixels to
describe the salient points. Specifically, the Histogram of Oriented Gradient [35] counts the occurrences
of gradient orientation in localized portions of the salient point as a feature representation. Histogram of
Optical Flow [35] counts the occurrences of optical flow direction as salient point representations. Recently,
deep learning based visual representation methods set a new milestone in feature representation. Unlike
the traditional two-stage approach, the deep learning based approach relies on a neural network to directly
encode an image (or a video) into an embedding. The simplified and unified approaches achieves state-of-
the-art on multiple downstream tasks (e.g. image classification [37], action classification [80], image-text
retrieval [135, 108], and video-text retrieval [91, 64].) The state-of-the-art deep learning based feature
extractor includes ResNet [62], VGG-Net [151], Bottom-up feature representation [4], I3D [21], etc.
1.1 Hierarchical Visual Representation
Both hand-designed and deep learning based visual representation approaches achieve good performance.
However, most of the work considers the images and videos as a set of pixels, ignoring the internal hierar-
chical structures. Concretely, the image can be decomposed into objects. Objects can be further specified
by attributes, such as shapes, color, categories, etc. Similarly, a video can describe a complex scene that is
composed of multiple clips or shots, where each depicts a semantically coherent event or action.
The main hypothesis of this thesis is that we can obtain a better visual representation by modeling the
hierarchical visual structure. The task is challenging, as it is not obvious to obtain the hierarchical structures
from visual domains. The most straightforward way to obtain the structure is to annotate each visual compo-
nent within the images and videos. Concretely, the annotation should provide accurate bounding boxes for
the objects and depict the relationship between two different objects. It is tedious and label-consuming for
annotating large-scale datasets. It is hard to define what kind of object sets and relations should be labeled.
Instead of exhaustively annotating the visual component, we consider obtaining the structure relation
from cross-domain data. Here we consider two types of cross-domain scenarios: cross-modal data (vision
and language) and pure visual data (image and video). For the cross-modal scenario, we use the paired text
3
How to obtain the structure? Image Objects Status: Sit Color: Black Attributes Two black dogs are sitting on grasses. Sentence Scenario 1: Using paired sentences to obtain structure for images P 5 Figure 1.2: The text naturally exhibits the structure of an image. The caption ’two black dog are sitting on
the grass.’ illustrates that there are two objects as dogs. Their color are black. The status are sitting.
How to obtain the structure? Scenario 3: Using both videos and images for obtaining the structure of video representation Video: Motion Image: Appearance P 7 Figure 1.3: In the pure vision scenario, videos are perfect for learning the temporal information. Images are
suitable for exploiting the appearance information.
to reveal the visual structure. Considering an image with ’two black dogs are sitting on the grass’ (Fig. 1.2),
the text naturally reveals the image contains two objects as dogs whose color are black. In the pure visual
scenario, as Fig. 1.3, we leverage the video to exploit the temporal structure and use the image to model the
spatial structure.
With these insights and observations, this thesis will be organized into four major parts: hierarchical
video representation via language structure prior, hierarchical image representation via language structure
prior, hierarchical Video Representation via Visual Structure, and applications of the hierarchical visual
representation.
4
1.2 Contributions and Outlines
The part II of the thesis is hierarchical video representation via language structure prior. In this part,
we investigated learning hierarchical visual representation from video and text. We are interested in paired
video and paragraph data. Videos contain a hierarchical structure by nature. Videos can be decomposed into
clips, which can be further decomposed into frames. Similar to the hierarchical structure within videos, a
paragraph contains sentences with different topics, which collectively convey a coherent message or story.
We investigate the modeling techniques for such hierarchical sequential data where there are correspon-
dences across multiple granularities between video and paragraph (e.g. videos and paragraphs, images and
sentences.) We demonstrate that by modeling the hierarchical structure within each modality and aligning
at multiple granularities across modalities, we achieve state-of-the-art performance in video and paragraph
retrieval as well as zero-shot action recognition. We further demonstrate that our approach could achieve
state-of-the-art performance on the challenging moment localization from video corpus tasks with hierar-
chical modeling. This part covers two of my published works (one on ECCV 2018 [198] and the other on
Arxiv [196])
Part III is under the topic of hierarchical image representation via language structure prior. We
studied learning hierarchical visual representation from images and text. This part exploits the image struc-
ture by grounding image and text at multiple granularities (token, phrase, and sentence), where the alignment
is automatically mined from paired image and text data. In this direction, we proposed and analyzed two
approaches: 1) grounding tokens, phrases, and sentences with images explicitly and 2) recursively encod-
ing and grounding the visual concepts, from coarse-level such as the noun words that refer to objects, to
finer-grained ones such as sentences with multiple levels of compositions. By hierarchically modeling the
visually grounded expressions, we show strong empirical results in multiple tasks, such as image and text
retrieval and referring expression. This section is supported by two of my published works (one paper on
EMNLP’20 [195] and the other on Findings of EMNLP’21 [197])
Part IV investigates hierarchical video modeling from visual domain. Videos and images contain
structures. Image is optimal for revealing the spatial hierarchies, while video, containing multiple frames,
is perfect for exploiting the temporal structures. We studied obtaining a universal video representation
that excels at modeling both appearance and temporal structure by jointly training our video representation
on multiple image and video datasets. The proposed video representation approaches jointly established
5
the new state-of-the-art results on five challenging video benchmarks without a dataset-specific fine-tuning
stage. The learned representation also exhibited strong transfer learning ability on downstream tasks. This
section is supported by one of my published work on Arxiv [201].
Part V discuss an application using the hierarchical visual representation. Here, we studied the ad-
vantage of the structural representation on one challenging text generation task, visual storytelling. Given
a sequence of images, visual storytelling asks the model to generate a story with a sequence of narrative
sentences based on the images. We conducted one preliminary work [199] that generates the story without
exploiting the visual structure. Here, our insight is that generating a story inherently relies on a hierarchical
visual representation. Specifically, the whole representation of the image sequence could determine the topic
and the story plot. Each image’s representation could help generate a specific sentence within the story. This
section is supported by one of my published work on ICCV’19 workshop [199].
1.3 Published Works
1.3.1 Works Covered in the Thesis
Chapter 2 corresponds to our ECCV’18 work [198]
• Bowen Zhang*, Hexiang Hu*, and Fei Sha. Cross-Modal and Hierarchical Modeling of Video and
Text. In ECCV, 2018
Chapter 3 corresponds to our Arxiv work [196]
• Bowen Zhang*, Hexiang Hu*, Joonseok Lee, Ming Zhao, Sheide Chammas, Vihan Jain, Eugene Ie,
and Fei Sha. A Hierarchical Multi-Modal Encoder for Moment Localization in Video Corpus. Arxiv,
2020
Chapter 4 corresponds to our EMNLP’20 work [195]
• Bowen Zhang*, Hexiang Hu*, Vihan Jain, Eugene Ie, and Fei Sha. Learning to Represent Image and
Text with Denotation Graph. In EMNLP, 2020
Chapter 5 corresponds to our Findings of EMNLP’21 work [197]
• Bowen Zhang, Hexiang Hu, Linlu Qiu, Pete Shaw, and Fei Sha. Visually Grounded Concept Compo-
sition. In Findings of EMNLP, 2021
6
Chapter 6 corresponds to our Arxiv work [201]
• Bowen Zhang, Jiahui Yu, Christopher Fifty, Wei Han, Andrew M. Dai, Ruoming Pang, and Fei Sha.
Co-training Transformer with Videos and Images Improves Action Recognition. Arxiv, 2021
Chapter 7 corresponds to our ICCV’19 workshop work [199]
• Bowen Zhang, Hexiang Hu, and Fei Sha. Visual Storytelling via Predicting Anchor Word Embeddings
in the Stories. In ICCV’19 Workshop on Closing the Loop Between Vision and Language, 2019
1.3.2 Other Works
Besides the work covered in this thesis, during Ph.D., I also worked on online action detection, word embed-
ding, text summarization, and compositional generalization. Those works are published on EMNLP’18 [1],
EMNLP’21 [136], BayLearn’19 [2], and Arxiv [194].
• Melissa Ailem, Bowen Zhang, Aur´ elien Bellet, Pascal Denis, and Fei Sha. A Probabilistic Model for
Joint Learning of Word Embeddings from Texts and Images. In EMNLP, 2018
• Melissa Ailem, Bowen Zhang, and Fei Sha. Topic Augmented Generator for Abstractive Summariza-
tion. In BayLearn, 2019
• Bowen Zhang, Hao Chen, Meng Wang, and Yuanjun Xiong. Online Action Detection in Streaming
Videos with Time Buffer. Arxiv, 2020
• Linlu Qiu, Hexiang Hu, Bowen Zhang, Pete Shaw, and Fei Sha, Systematic Generalization on gSCAN:
What is Nearly Solved and What is Next? In EMNLP, 2021.
7
Part II
Hierarchical Video Representation via Langauge Structure Prior
8
This part presents multiple approaches to obtaining the hierarchical video representations. Concretely,
we consider two real-world applications, video and text retrieval and moment localization in a video
corpus (MLVC).
The video and text retrieval task requires the model to retrieve the most relevant video given a text
query. Concretely, the model is presented with multiple pairs of videos and texts during training. The
model needs to figure out which pairs are aligned. In the inference stage, the model is given a single text
query and a set of videos. The model needs to locate the corresponding video to the text query from the video
corpus. This vital task has multiple real-world applications (e.g., video search, video recommendations, etc.)
Chapter 2 focuses on presenting a hierarchical video representation that efficiently handles the video and
text retrieval task. The hierarchical visual structure is unveiled from the corresponding text description. We
show that the hierarchical video representation obtains a superior performance than the flat representation
by a large margin.
With the hierarchical video representation, we study a more challenging task in Chapter 3, moment
localization from a video corpus. Specifically, the model needs to find the corresponding video to the text
query and locate the specific clip from the aligned video. Due to the task’s difficulty, we consider this task as
two sub-problems, video retrieval and clip localization. With this in mind, we designed a hierarchical video
representation approach to solve both sub-tasks in a unified fashion. Concretely, we obtained video repre-
sentation at three levels: frame level, clip level, and video level. The clip-level representation is obtained
by encoding the frame-level representation. The video representation is further gathered by encoding the
clip representation. With the hierarchical representation, video retrieval can be handled at the level of video
representation, while the clip localization can be tackled at the level of clip representation. We show that the
hierarchical video representation could efficiently and effectively handle the MLVC task by improving the
state-of-the-art performance on multiple benchmarks.
The following chapters are organized as below: Chapter 2 discusses hierarchical video modeling and its
application in video and text retrieval. Chapter 3 emphasizes applying the hierarchical video representation
to the MLVC task.
9
Chapter 2
Cross-Modal and Hierarchical Video Modeling via Text
2.1 Introduction
Recently, there has been an intensive interest in multi-modal learning of vision + language. A few challeng-
ing tasks have been proposed: visual semantic embedding (VSE) [85, 83, 34], image captioning [167, 182,
78, 108], and visual question answering (VQA) [9, 206, 22]. To jointly understand these two modalities of
data and make inference over them, the main intuition is that different types of data can share a common
semantic representation space. Examples are embedding images and the visual categories [51], embedding
images and texts for VSE [85], and embedding images, questions, and answers for VQA [69]. Once embed-
ded into this common (vector) space, similarity and distances among originally heterogeneous data can be
captured by learning algorithms.
While there has been a rich study on how to discover this shared semantic representation on structures
such as images, noun phrases (visual object or action categories) and sentences (such as captions, questions,
answers), less is known about how to achieve so on more complex structures such as videos and paragraphs
of texts
1
. There are conceptual challenges: while complex structured data can be mapped to vector spaces
(for instance, using deep architectures [92, 62]), it is not clear whether the intrinsic structures in those data’s
original format, after being transformed to the vectorial representations, still maintain their correspondence
and relevance across modalities.
Take the dense video description task as an example [91]. The task is to describe a video which is
made of short, coherent and meaningful clips. (Note that those clips could overlap temporally.) Due to its
narrowly focused semantic content, each clip is then describable with a sentence. The description for the
1
We use paragraphs and documents interchangeably throughout this work.
10
A Paragraph with Multiple Sentences A Video with Multiple Clips
2. The water skier
jumps over ramps
in the water of a lake.
1. A man on the dock
hands a rope to a
water skier.
3. He falls but
recovers then lets go
of the rope and drifts
over to the shoreline.
Local
Embedding Space
Global
Embedding Space
Figure 2.1: Conceptual diagram of our approach for cross-modal modeling of video and texts. The main idea is
to embed both low-level (clips and sentences) and high-level (video and paragraph) in their own semantic spaces
coherently. As shown in the figure, the 3 sentences (and the corresponding 3 clips) are mapped into a local embedding
space where the corresponding pairs of clips and sentences are placed close to each other. As a whole, the videos and
the paragraphs are mapped into a global semantic space where their embeddings are close. See Fig. 2.3 and texts for
details.
whole video is then a paragraph of texts with sentences linearly arranged in order. Arguably, a corresponding
pair of video and its descriptive paragraph can be embedded into a semantic space where their embeddings
are close to each other, using a vanilla learning model by ignoring the boundaries of clips and sentences
and treating as a sequence of continually flowing visual frames and words. However, for such a modeling
strategy, it is opaque that if and how the correspondences at the “lower level” (i.e. clips versus sentences)
are useful in either deriving the embeddings or using the embeddings to perform downstream tasks such as
video or text retrieval.
Addressing these deficiencies, we propose a novel cross-modal learning approach to model both videos
and texts jointly. The main idea is schematically illustrated in Fig. 2.1. Our approach is mindful of the in-
trinsic hierarchical structures of both videos and texts, and models them with hierarchical sequence learning
models such as GRUs [33]. However, as opposed to methods which disregard low-level correspondences,
we exploit them by deriving loss functions to ensure the embeddings for the clips and sentences are also
in accordance in their own (shared) semantic space. Those low-level embeddings in turn strengthen the
desiderata that videos and paragraphs are embedded coherently. We demonstrate the advantages of the pro-
posed model in a range of tasks including video and text retrieval, zero-shot action recognition and video
description.
11
The rest of the chapter is organized as follows. In section 2.2, we discuss related work. We describe
our proposed approach in section 2.3, followed by extensive experimental results and ablation studies in
section 2.4. We conclude in section 2.5.
2.2 Related Work
Hierarchical Sequence Embedding Models. Embedding images, videos, and textual data has been very
popular with the rise of deep learning. The most related works to ours are [100] and [128]. The former mod-
els the paragraph using a hierarchical auto-encoder for text modeling [100], and the later uses a hierarchical
RNN for videos and a one-layer RNN for caption generation. In contrast, our work models both modalities
hierarchically and learn the parameters by leveraging the correspondences across modalities. Works moti-
vated by other application scenarios usually explore hierarchical modeling in one modality [127, 187, 203].
Cross-modal Embedding Learning. There has been a rich history to learn embeddings for images and
smaller linguistic units (such as words and noun phrases). DeViSE [51] learns to align the latent embeddings
of visual data and names of the visual object categories. ReViSE [162] uses auto-encoders to derive embed-
dings for images and words which allow them to leverage unlabeled data. In contrast to previous methods,
our approach models both videos and texts hierarchically, bridging the embeddings at different granularities
using discriminative loss computed on corresponded pairs (i.e. videos vs. paragraphs).
Action Recognition in Videos. Deep learning has brought significant improvement to video understand-
ing [150, 161, 48, 170, 200, 175] on large-scale action recognition datasets [63, 153, 80] in the past decade.
Most of them [150, 48, 170] employed deep convolutional neural network to learn appearance feature and
motion information respectively. Based on the spatial-temporal feature from these video modeling methods,
we learn video semantic embedding to match the holistic video representation to text representation. To
evaluate the generalization of our learned video semantic representation, we evaluate the model directly on
the challenging action recognition benchmark. (Details in Section 2.4.4)
12
Word!
"
Word!
#
$%&
'
Paragraph Encoder
$%&
(
Video Encoder
A Paragraph with Multiple Sentences A Video with Multiple Clips
Global Alignments ()
*+,-.
./0.
)
Frame1
2
Frame1
"
Frame1
3
'
Paragraph Embedding
(
Video Embedding
Word!
2
Figure 2.2: Flat sequence modeling of videos and texts, ignoring the hierarchical structures in either and regarding
the video (paragraph) as a sequence of frames (words).
2.3 Approach
We begin by describing the problem settings and introducing necessary notations. We then describe the
standard sequential modeling technique, ignoring the hierarchical structures in the data. Finally, we describe
our approach.
2.3.1 Settings and Notations
We are interested in modeling videos and texts that are paired in correspondence In the later section, we
describe how to generalize this where there is no one to one correspondence.
A videov has n clips (or subshots), where each clipc
i
contains n
i
frames. Each frame is represented
by a visual feature vectorx
ij
. This feature vector can be derived in many ways, for instance, by feeding
the frame (and its contextual frames) to a convolution neural net and using the outputs from the penultimate
layer. Likewise, we assume there is a paragraph of texts describing the video. The paragraphp contains n
sentences, one for each video clip. Lets
i
denote theith sentence andw
ij
the feature for thejth word out of
n
′
i
words. We denote byD ={(v
k
,p
k
)} a set of corresponding videos and text descriptions.
We compute a clip vector embedding c
i
from the frame features {x
ij
}, and a sentence embedding
s
i
from the word features{w
ij
}. From those, we derive v and p, the embedding for the video and the
paragraph, respectively.
2.3.2 Flat Sequence Modeling
Many sequence-to-sequence (SEQ2SEQ) methods leverage the encoder-decoder structure [157, 117] to model
the process of transforming from the input sequence to the output sequence. In particular, the encoder,
13
Word!
"#"
Word!
$"
Word!
$# $
Word!
%"
Word !
%#%
&'(
)
(")
Sentence Encoder
,
"
Embedding
,
$
Embedding
,
%
Embedding
&'(
)
($)
Paragraph Encoder
)
Paragraph Embedding
Frame-
""
Frame-
"#"
Frame-
$"
Frame-
$#$
Frame-
%"
Frame-
%#%
&'(
.
(")
Clip Encoder
/
"
Embedding
/
$
Embedding
/
%
Embedding
.
Video Embedding
A Paragraph with Multiple Sentences A Video with Multiple Clips
Global Alignments (0
1231
)
Local Alignments (0
456
)
Reconstruction Reconstruction
Reconstruction
7&(
)
($)
Paragraph Decoder
&'(
.
($)
Video Encoder
7&(
)
(")
Sentence Decoder
7&(
.
($)
Video Decoder
7&(
.
(")
Clip Decoder
Word!
""
Reconstruction
Figure 2.3: Hierarchical cross-modal modeling of videos and texts. We differ from previous works [100, 128] in two
aspects (components in red color): layer-wise reconstruction through decoders, and matching at both global and local
levels. See texts for details.
which is composed of a layer of long short-term memory units (LSTMs) [67] or Gated Recurrent Units
(GRUs) [33], transforms the input sequence into a vector as the embeddingh. The similarly constructed
decoder takesh as input and outputs another sequence.
The original SEQ2SEQ methods do not consider the hierarchical structures in videos or texts. We refer
the embeddings as flat sequence embedding (FSE):
v = ENC
v
({x
ij
}), p= ENC
p
({w
ij
}), (2.1)
Fig. 2.2 schematically illustrates this idea. We measure how well the videos and the texts are aligned by
the following cosine similarity
MATCH(v,p)=v
⊤
p/∥v∥∥p∥ (2.2)
2.3.3 Hierarchical Sequence Modeling
One drawback of flat sequential modeling is that the LSTM/GRU layer needs to have a sufficient number
of units to model well the potential long-range dependency among video frames (or words). This often
complicates learning as the optimization becomes difficult [131].
14
We leverage the hierarchical structures in those data to overcome this deficiency: a video is made of
clips which are made of frames. In parallel, a paragraph of texts is made of sentences which in turn are
made of words. Similar ideas have been explored in [128, 100] and other previous works. The basic idea is
illustrated in Fig. 2.3, where we also add components in red color to highlight our extensions.
Hierarchical Sequence Embedding. Given the hierarchical structures in Fig. 2.3, we can compute the
embeddings using the forward paths
c
i
= ENC
(1)
v
({x
ij
,j =1,2,n
i
}), v = ENC
(2)
v
({c
i
})
s
i
= ENC
(1)
p
({w
ij
,j =1,2,n
′
i
}), p= ENC
(2)
p
({s
i
})
(2.3)
Learning with Discriminative Loss. For videos and texts have strong correspondences where clips and
sentences are paired, we optimize the encoders such that videos and texts are matched. To this end, we
define two loss functions, corresponding to the matching at the low-level and the high-level respectively:
ℓ
HIGH
MATCH
=
X
k
X
k
′
̸=k
[α + MATCH(v
k
,p
k
)− MATCH(v
k
′,p
k
)]
+
+[α + MATCH(v
k
,p
k
)− MATCH(v
k
,p
k
′)]
+
(2.4)
ℓ
LOW
MATCH
=
X
k
X
i
X
(k
′
,i
′
)̸=(k,i)
[β + MATCH(c
ki
,s
ki
)− MATCH(c
k
′
i
′,s
ki
)]
+
+[β + MATCH(c
ki
,s
ki
)− MATCH(c
ki
,s
k
′
i
′)]
+
(2.5)
These losses are margin-based losses [143] whereα andβ are positive numbers as the margins to separate
matched pairs from unmatched ones. The function[·]
+
is the standard hinge loss function.
Learning with Contrastive Loss. Assuming videos and texts are well clustered, we use the following loss
to model their clustering in their own space.
ℓ
HIGH
CLUSTER
=
X
k
X
k
′
̸=k
[γ +1− MATCH(v
k
′,v
k
)]
+
+[γ +1− MATCH(p
k
′,p
k
)]
+
(2.6)
ℓ
LOW
CLUSTER
=
X
k
X
i
X
(k
′
,i
′
)̸=(k,i)
[η +1− MATCH(c
k
′
i
′,c
ki
)]
+
+[η +1− MATCH(s
k
′
i
′,s
ki
)]
+
(2.7)
15
Note that the self-matching values MATCH(v
k
,v
k
) and MATCH(p
k
,p
k
) are 1 by definition. This loss can be
computed on videos and texts alone and does not require them being matched.
Learning with Unsupervised Layer-wise Reconstruction Loss. Thus far, the matching loss focuses on
matching across modality. The clustering loss focuses on separating between video/text data so that they do
not overlap. None of them, however, focuses on the quality of the modeling data itself. In what follows,
we propose a layer-wise reconstruction loss – when minimized, this loss ensures the learned video/text
embedding faithfully preserves information in the data.
We first introduce a set of layer-wise decoders for both videos and texts. The key idea is to pair the
encoders with decoders so that each pair of functions is an auto-encoder. Specifically, the decoder is also
a layer of LSTM/GRU units, generating sequences of data. Thus, at the level of video (or paragraph), we
will have a decoder to generate clips (or sentences). And at the level of clips (or sentences), we will have a
decoder to generate frames (or words). Concretely, we would like to minimize the difference between what
are generated by the decoders and what are computed by encoders on the data. Let
{ˆ c
i
}= DEC
(2)
v
(v),{ˆ s
i
}= DEC
(2)
p
(p) (2.8)
be the two (high-level) decoders for videos and texts respectively. And similarly, for the decoder at the
low-level
{ˆ x
ij
}= DEC
(1)
v
(ˆ c
i
),{ˆ w
ij
}= DEC
(1)
p
(ˆ s
i
) (2.9)
where the low-level decoders take each generated clip and sentence embeddings as inputs and output
sequences of generated frame and word embeddings.
ℓ
RECONSTRUCT
(v,p)=
X
i
{∥ˆ c
i
− c
i
∥
2
2
+
1
n
i
X
j
∥ˆ x
ij
− x
ij
∥
2
2
}
+
X
i
{∥ˆ s
i
− s
i
∥
2
2
+
1
n
′
i
X
j
∥ˆ w
ij
− w
ij
∥
2
2
} (2.10)
Using those generated embeddings, we can construct a loss function characterizing how well the en-
coders encode the data pair(v,p) (see Eq 2.10).
16
2.3.4 Final Learning Objective and Its Extensions
The final learning objective is to balance all those loss quantities
ℓ=ℓ
HIGH
+ℓ
LOW
+τ X
k
ℓ
RECONSTRUCT
(v
k
,p
k
) (2.11)
where the high-level and low-level losses are defined as
ℓ
HIGH
=ℓ
HIGH
MATCH
+ℓ
HIGH
CLUSTER
, ℓ
LOW
=ℓ
LOW
MATCH
+ℓ
LOW
CLUSTER
(2.12)
In our experiments, we will study the contribution by each term.
Learning under Weak Correspondences. Our idea can be also extended to the common setting where
only high-level alignments are available. In fact, high-level coarse alignments of data are easier and more
economical to obtain, compared to fine-grained alignments between each sub-level sentence and video clip.
Since we do not have enough information to define the low-level matching loss ℓ
LOW
MATCH
exactly, we resort
to approximation. We first define an averaged matching over all pairs of clips and sentences for a pair of
video and paragraph
MATCH(v,p)=
1
nm
X
c
i
X
s
j
MATCH(c
i
,s
j
) (2.13)
where we relax the assumption that there is precisely the same number of sentences and clips. We use
this averaged quantity to approximate the low-level matching loss
˜
ℓ
LOW
MATCH
=
X
k
X
k
′
̸=k
[β ′
+ MATCH(v
k
,p
k
)− MATCH(v
k
′,p
k
)]
+
+[β ′
+ MATCH(v
k
,p
k
)− MATCH(v
k
,p
k
′)]
+
(2.14)
This objective will push a clip embedding closer to the embeddings of the sentences belonging to the cor-
responding video (and vice versa for sentences to the corresponding video). A more refined approximation
involving a soft assignment of matching can also be derived, which will be left for future work.
17
2.4 Experiments
We evaluate and demonstrate the advantage of learning hierarchical cross-modal embedding with our pro-
posed approach on several tasks: (i) large-scale video-paragraph retrieval (Section 2.4.2), (ii) down-stream
tasks such as video captioning (Section 2.4.3), and (iii) action recognition (Section 2.4.4).
2.4.1 Experiment Setups
2.4.1.1 Datasets.
We evaluate on three large-scale video datasets:
(1) ActivityNet Dense Caption [91]. This variant of ActivityNet contains densely labeled temporal seg-
ments for 10,009 training and 4,917/4,885 (val1/val2) validation videos. Each video contains multiple clips
and a corresponding paragraph with sentences aligned to the clips. In all our retrieval experiments, we
follow the setting in [91] and report retrieval metrics such as recall@k (k=1,5,50) and median rank (MR).
Following [91] we use ground-truth clip proposals as input for our main results. In addition, we also study
our algorithm with a heuristic proposal method (see Section 2.4.2.4). In the main text, we report all results
on validation set 1 (val1). Please refer to the Supp. Material for the results on val2. For video caption ex-
periment, we follow [91] and evaluate on the validation set (val1 and val2). Instead of using action proposal
method, ground-truth video segmentation is used for training and evaluation. Performances are reported in
Bleu@K, METEOR and CIDEr.
(2) DiDeMo [8]. The original goal of DiDeMo dataset is to locate the temporal segments that correspond to
unambiguous natural language descriptions in a video. We re-purpose it for the task of video and paragraph
retrieval. It contains 10,464 videos, 26,892 video clips and 40,543 sentences. The training, validation and
testing split contain 8,395, 1,065 and 1,004 videos and corresponding paragraphs, respectively. Each video
clip may correspond to one or many sentences. For the video and paragraph retrieval task, paragraphs
are constructed by concatenating all sentences that corresponding to one video. Similar to the setting in
ActivityNet, we use the ground-truth clip proposals as input.
(3) ActivityNet Action Recognition [63]. We use ActivityNet V1.3 for aforementioned off-the-shelf action
recognition. The dataset contains 14,950 untrimmed videos with 200 action classes, which is split into
training and validation set. Training and validation set have 10,024 and 4,926 videos, respectively. Among
18
all 200 action classes, 189 of the action classes have been covered by the vocabulary extracted from the
paragraph corpus and 11 of the classes are unseen.
2.4.1.2 Baselines and Our Methods.
We use the FSE method (as described in Section 2.3.2) as a baseline model. It ignores the clip and sentence
structures in the videos and paragraphs. We train a one-layer GRU directly on the extracted frame/word
features and take their outputs as the embedding representing each modality. Results with C3D features are
also included (see Table 2.1).
Our method has two variants: whenτ = 0, the method (HSE[τ =0]) simplifies to a hierarchical sequence
models as used in [100, 128] except that they do not consider cross-modal learning with cross-modal match-
ing loss while we do. We consider this as a very strong baseline. Whenτ ̸=0, the HSE takes full advantage
of layer-wise reconstruction with multiple decoders, at different levels of the hierarchy. In our experiments,
this method gives the best results.
2.4.1.3 Implementation Details.
Following the settings proposed in [91], we extract the C3D features [161] pretrained on Sports-1M
dataset [79] for raw videos in ActivityNet. PCA is then used to reduce the dimensionality of the feature
to 500. To verify the generalization of our model across different sets of visual feature, as well as leveraging
the state-of-the-art video models, we also employed recently proposed TSN-Inception V3 network [170]
pre-trained on Kinetics [80] dataset to extract visual features. Similarly, we extract TSN-Inception V3 fea-
ture for videos in Didemo dataset. We do not fine-tuning the convolutional neural network on the video along
the training to reduce the computational cost. For word embedding, we use 300 dimension GloVe [132] fea-
tures pre-trained on 840B common web-crawls. In all our experiments, we use GRU as sequence encoders.
For HSE, we choose τ = 0.0005 from tuning this hyper-parameter on the val2 set of ActivityNet retrieval
dataset. The sameτ value is used for experiments on DiDeMo, without further tuning. (More details in the
Appendix)
19
Table 2.1: Video paragraph retrieval on ActivityNet (val1). Standard deviation from 3 random seeded
experiments are also reported.
Paragraph ⇒ Video Video ⇒ Paragraph
R@1 R@5 R@50 MR R@1 R@5 R@50 MR
C3D Feature with Dimensionality Reduction [161]
LSTM-YT [165] 0.0 4.0 24.0 102.0 0.0 7.0 38.0 98.0
NO CONTEXT [166] 5.0 14.0 32.0 78.0 7.0 18.0 45.0 56.0
DENSE online[91] 10.0 32.0 60.0 36.0 17.0 34.0 70.0 33.0
DENSE full[91] 14.0 32.0 65.0 34.0 18.0 36.0 74.0 32.0
FSE 12.6± 0.4 33.2± 0.3 77.6± 0.3 12.0 11.5± 0.5 31.8± 0.3 77.7± 0.3 13.0
HSE[τ =0] 32.8± 0.3 62.3± 0.4 90.5± 0.1 3.0 32.0± 0.6 62.5± 0.5 90.5± 0.3 3.0
HSE[τ =5e-4] 32.7± 0.7 63.2± 0.4 90.8± 0.2 3.0 32.8± 0.4 63.2± 0.2 91.2± 0.3 3.0
Inception-V3 pre-trained on Kinetics [171]
FSE 18.2± 0.2 44.8± 0.4 89.1± 0.3 7.0 16.7± 0.8 43.1± 1.1 88.4± 0.3 7.3
HSE[τ =0] 43.9± 0.6 75.8± 0.2 96.9± 0.3 2.0 43.3± 0.6 75.3± 0.6 96.6± 0.2 2.0
HSE[τ =5e-4] 44.4± 0.5 76.7± 0.3 97.1± 0.1 2.0 44.2± 0.6 76.7± 0.3 97.0± 0.3 2.0
Table 2.2: Video paragraph retrieval on DiDeMo dataset. S2VT method is re-implemented for retrieval task.
Paragraph ⇒ Video Video ⇒ Paragraph
R@1 R@5 R@50 MR R@1 R@5 R@50 MR
S2VT [166] 11.9 33.6 76.5 13.0 13.2 33.6 76.5 15.0
FSE 13.9± 0.7 36.0± 0.8 78.9± 1.6 11.0 13.1± 0.5 33.9± 0.4 78.0± 0.8 12.0
HSE[τ =0] 30.2± 0.8 60.5± 1.1 91.8± 0.7 3.3 29.4± 0.4 58.9± 0.7 91.9± 0.6 3.7
HSE[τ =5e-4] 29.7± 0.2 60.3± 0.9 92.4± 0.3 3.3 30.1± 1.2 59.2± 0.9 92.1± 0.5 3.0
2.4.2 Results on Video-Paragraph Retrieval
In this section, we first compare our proposed approach to the state-of-the-art algorithms, and then perform
ablation studies on variants of our method, to evaluate the proposed learning objectives.
2.4.2.1 Main Results.
We reported our results on ActivityNet Dense Caption val1 set and DiDeMo test set as Table 2.1 and Ta-
ble 2.2, respectively. For both C3D and Inception V3 feature, we observed performances on our hierarchical
models improved the previous state-of-the-art result by a large margin (on Recall@1, over ∼ 15% im-
provement with C3D and∼ 30% improvement with InceptionV3). DENSE full [91], which models the flat
sequences of clips, outperforms our FSE baseline as they augment each segment embedding with a weighted
aggregated context embedding. However, it fails to model more complex temporal structures of video and
paragraph, which leads to inferior performance to our HSE models.
Comparing to our flat baseline model, both HSE[τ =0] and HSE[τ =5e-4] improve performances over all met-
rics in retrieval. It implies that hierarchical modeling can effectively capture the structure information and
20
Table 2.3: Ablation studies on the learning objectives.
Paragraph ⇒ Video Video ⇒ Paragraph
Dataset ℓ
LOW
R@1 R@5 R@50 R@1 R@5 R@50
ActivityNet
HSE[τ =0]
✗ 41.8± 0.4 74.1± 0.6 96.6± 0.1 40.5± 0.4 73.9± 0.6 96.3± 0.1
WEAK 42.6± 0.4 74.8± 0.3 96.7± 0.1 41.3± 0.2 74.7± 0.4 96.5± 0.1
STRONG 43.9± 0.6 75.8± 0.2 96.9± 0.3 43.3± 0.6 75.3± 0.6 96.6± 0.2
HSE[τ =5e-4]
✗ 42.5± 0.3 74.8± 0.1 96.9± 0.0 41.6± 0.2 74.7± 0.6 96.6± 0.1
WEAK 43.0± 0.6 75.2± 0.4 96.9± 0.1 41.5± 0.1 75.2± 0.6 96.8± 0.2
STRONG 44.4± 0.5 76.7± 0.3 97.1± 0.1 44.2± 0.6 76.7± 0.3 97.0± 0.3
DiDeMo
HSE[τ =0]
✗ 27.1± 1.9 59.1± 0.4 92.2± 0.3 27.3± 1.0 57.6± 0.5 91.3± 1.2
WEAK 28.0± 0.8 58.9± 0.5 91.4± 0.6 28.3± 0.3 58.5± 0.6 91.2± 0.3
STRONG 30.2± 0.8 60.5± 1.1 91.8± 0.7 29.4± 0.4 58.9± 0.7 91.9± 0.6
HSE[τ =5e-4]
✗ 28.1± 0.8 59.5± 1.1 91.7± 0.7 28.2± 0.8 58.1± 0.5 90.9± 0.5
WEAK 28.7± 2.1 59.1± 0.2 91.6± 0.7 28.3± 0.8 59.2± 0.6 91.1± 0.1
STRONG 29.7± 0.2 60.3± 0.9 92.4± 0.3 30.1± 1.2 59.2± 0.9 92.1± 0.5
relationships over clips and sentences among videos and paragraphs. Moreover, we observe that HSE[τ =5e-4]
consistently improves over HSE[τ =0] across most retrieval metrics on both datasets. This attributes the impor-
tance of our layer-wise reconstruction objectives, which suggests that better generalization performances.
2.4.2.2 Low-level Loss is Beneficial.
Table 2.1 and Table 2.2 have shown results with optimizing both low-level and high-level objectives. In
Table 2.3, we further performed ablation studies on the learning objectives. Note that rows with✗ represent
learning without low-level loss ℓ
LOW
. In all scenarios, joint learning with both low-level and high-level
correspondences improves the retrieval performance.
2.4.2.3 Learning with Weak Correspondences at Low-level.
As mentioned in Section 2.3, our method can be extended to learn the low-level embedding with weak
correspondence. We evaluate its effectiveness on both ActivityNet and DiDeMo datasets. Performance are
listed in Table 2.3. Note that for the rows of “weak”, no auxiliary alignments between sentences and clips
are available during training.
Clearly, including low-level loss with weak correspondence (ie, correspondence only at the high-level)
obtained superior performances when compared to models that do not include low-level loss at all. On
several occasions, it even attains the same competitive result as including low-level loss with strong corre-
spondences at the clip/sentence levels.
21
Table 2.4: Performance of using proposal instead of ground truth on ActivityNet dataset
P ⇒ V V ⇒ P
Proposal Method # Segments R@1 R@5 R@1 R@5 Precision Recall
HSE + SSN - 10.4 31.9 10.8 31.7 1.5 17.1
HSE + UNIFORM
1 18.0 45.5 16.5 44.9 63.2 31.1
2 20.0 48.9 18.4 47.6 61.8 46.0
3 20.0 48.6 18.2 47.9 55.3 50.6
4 20.5 49.3 18.7 48.1 43.2 45.5
HSE + GROUND TRUTH - 44.4 76.7 44.2 76.7 100.0 100.0
FSE - 18.2 44.8 16.7 43.1 - -
2.4.2.4 Learning with Video Proposal Methods.
As using ground-truth temporal segments of videos is not a natural assumption, we perform experiments to
validate the effectiveness of our method with proposal methods. Specifically, we experiment with two differ-
ent proposal approaches: SSN [205] pre-trained on ActivityNet action proposal and a heuristic uniform pro-
posal. For uniform proposal of K segments, we meant naturally segmenting a video into K non-overlapping
and equal-length temporal segments.
The results are summarized in Table 2.4 (with columns of precision and recall being the performance
metrics of the proposal methods). There are two main conclusions from these results: (1) The segments
of Dense Caption dataset deviate significantly from the action proposals, therefore a pre-trained action
proposal algorithm performs poorly. (2) Even with heuristic proposal methods, the performance of HSE
is mostly better than (or comparable with) FSE. We leave to future work on identifying stronger methods for
proposals.
2.4.2.5 Retrieval with Incomplete Video and Paragraph.
In this section, we investigate the correlation between the number of observed clips and sentences and
models’ performance of video and paragraph retrieval. In this experiment, we gradually increase the number
of clips and sentences observed by our model during the testing and obtained the Figure 2.4, on ActivityNet.
When the video/paragraph contains fewer clips/sentences than the number of observations we required, we
take all those available clips/sentences for computing the video/paragraph embedding. (On average 3.65
clips/sentences per video/paragraph)
From Figure 2.4, we note that increasing the number of the observed clips and sentences leads to im-
proved performance results in retrievals. We can see that when observing only one clip and sentence, our
22
1 1.5 2 2.5 3 3.5 4 4.5 5
Number of Segments/Sentences Observed
10
20
30
40
50
60
70
80
Retrieval Performance
Video to Paragarph Retrieval
Recall@1 - HSE
Recall@5 - HSE
Recall@1 - FSE
Recall@5 - FSE
1 1.5 2 2.5 3 3.5 4 4.5 5
Number of Segments/Sentences Observed
10
20
30
40
50
60
70
80
Retrieval Performance
Paragarph to Video Retrieval
Recall@1 - HSE
Recall@5 - HSE
Recall@1 - FSE
Recall@5 - FSE
Figure 2.4: Retrieval performance improves given more observed clips/sentences.
Table 2.5: Results for video captioning on ActivityNet
B@1 B@2 B@3 B@4 M C
LSTM-YT [165] 18.2 7.4 3.2 1.2 6.6 14.9
S2VT [166] 20.4 9.0 4.6 2.6 7.9 21.0
HRNN [187] 19.5 8.8 4.3 2.5 8.0 20.2
DENSE [91] 26.5 13.5 7.1 4.0 9.5 24.6
DVC [105] 19.6 9.9 4.6 1.6 10.3 25.2
FSE 17.9 8.2 3.6 1.7 8.7 32.1
HSE[τ =0] 19.6 9.4 4.2 2.0 9.2 39.5
HSE[τ =5e-4] 19.8 9.4 4.3 2.1 9.2 39.8
model already outperforms the previous state-of-the-art method as well as our baseline FSE that observes
the entire sequence. With observing less than the average length of clips and sentences, our learned model
can achieve∼ 70% of the final performance.
2.4.3 Results on Video Captioning
Setups. In addition to the video paragraph retrieval, we evaluate our learned embeddings for video cap-
tioning. Specifically, we follow [91] and train a caption model [167] on top of the pre-trained video embed-
dings. Similar to [91], we concatenate the clip-level feature with contextual video-level feature, and build a
two-layer LSTM as a caption generator. We randomly initialized the word embedding as well as LSTM and
trained the model for 25 epochs with learning rate of 0.001. We use the ground-truth proposal throughout
training and evaluation following the setting of [91, 105]. During testing, beam search is used with beam=5.
Results are reported in Table 2.5.
23
Table 2.6: Results for action recognition on ActivityNet (low-level embeddings)
Zero-Shot Train
Transfer Classifier
Top-1 Top-5 Top-1 Top-5
FV-VAE [137] - - 78.6 -
TSN [171] - - 88.1 -
FSE 48.3 79.4 74.4 94.1
HSE[τ =0] 50.2 84.4 74.7 94.3
HSE[τ =5e-4] 51.4 83.8 75.3 94.3
RANDOM 0.5 2.5 0.5 2.5
Results. We observe that our proposed model outperforms baseline over most metrics. Meanwhile, HSE
also improves over previous approaches such as LSTM-YT, S2VT, and HRNN on B@2, METEOR, and CIDEr
by a margin. HSE achieves comparable results with DVC in all criterions. However, both HSE and HSE[τ =0]
failed to obtain close performance to DENSE [91]. This may due to the fact that DENSE [91] carefully learns
to aggregate the context information of a video clip for producing high-quality caption, while optimized for
video-paragraph retrieval our embedding model does not equip with such capability. However, it is worth
noting that our model obtains higher CIDEr score compared to all existing methods. We empirically observe
that fine-tuning the pre-trained video embedding does not lead to further performance improvement.
2.4.4 Results on Action Recognition
To evaluate the effectiveness of our model, we take the off-the-shelf clip-level embeddings trained on video-
paragraph retrieval for action recognition (on ActivityNet with non-overlapping training and validation data).
We use two action recognition settings to evaluate, namely zero-shot transfer and classification .
Setups. In the zero-shot setting, we directly evaluate our low-level embedding model learned in the video
and text retrieval, via treating the phrases of actions as sentences and use the sentence-level encoder to
encode the action embedding. We take the raw video and apply clip-level video encoder to extract the
feature for retrieving actions. No re-training is performed and all models have no access to the actions’ data
distribution. Note though action are not directly used as sentences during the training, some are available as
verbs in the vocabulary. Meanwhile, as we are using pre-trained word vector (GloVe), it allows the transfer
to unseen actions. In the classification setting, we discriminatively train a simple classifier to measure
the classification accuracy. Concretely, a one-hidden-layer Multi-Layer Perceptron (MLP) is trained on the
clip-level embeddings. We do not fine-tune the pre-trained clip-level video embedding here.
24
-40 -30 -20 -10 0 10 20 30 40 50
-40
-30
-20
-10
0
10
20
30
40
Arm wrestling
Capoeira
Futsal
Longboarding
Playing congas
Playing drums
Rafting
Snow tubing
Surfing
Using the balance beam
-20 -15 -10 -5 0 5 10 15 20 25
-25
-20
-15
-10
-5
0
5
10
15
20
25
Arm wrestling
Capoeira
Futsal
Longboarding
Playing congas
Playing drums
Rafting
Snow tubing
Surfing
Using the balance beam
ActivityNet Training Set ActivityNet Validation Set
Figure 2.5: T-SNE visualization of off-the-shelf video embedding of HSE on ActivityNet v1.3 training and
validation set. Points are marked with its action classes.
Results. We report results of above two settings on the ActivityNet validation set (see Table 2.6). We
observe that our learned low-level embeddings allow superior zero-shot transfer to action recognition, with-
out accessing any training data. This indicates that semantics of actions are indeed well reserved in the
learned embedding models. More interestingly, we can see that both HSE[τ =0] and HSE improve the perfor-
mance over FSE. It shows that our hierarchical modeling of video benefits not only high-level embedding
but also low-level embedding. A similar trend is also observed in the classification setting. Our method
achieves comparable performance to the state-of-the-art video modeling approach such as FV-VAE [137].
Note TSN [171] is fully supervised thus not directly comparable.
2.4.5 Qualitative Results
We use t-SNE [118] to visualize our results in the video to paragraph and paragraph to video retrieval task.
Fig 2.5 shows that the proposed method can cluster the embedding of videos with regard to its action classes.
To further explain the retrieval quality, we provide qualitative visualization in the Supp. Material.
25
2.5 Conclusion
In this chapter, we propose a novel cross-modal learning approach to model videos and texts jointly, which
leverages the intrinsic hierarchical structures of both videos or texts. Specifically, we consider the corre-
spondences of videos and texts at multiple granularities, and derived loss functions to align the embeddings
for the paired clips and sentences, as well as paired video and paragraph in accordance in their own semantic
spaces. Another important component of our model is layer-wise reconstruction, which ensures that learned
embeddings capture video (paragraph) and clips (words) at different levels. Moreover, we further extend
our learning objective so that it allows to handle a more generalized learning scenario where only video
paragraph correspondence exists. We demonstrate the advantage of our proposed model in a range of tasks
including video and text retrieval, zero-shot action recognition and video caption.
26
Chapter 3
Hierarchical Video Modeling for Moment Localization
3.1 Introduction
With over 70% of the current internet traffics being video data [126], a growing number of videos are being
created, shared, and consumed over time. To effectively and efficiently search, browse, and navigate video
contents, an intelligent system needs to understand the rich and complex semantic information in them.
For this type of use cases, the recently proposed task of moment localization in video corpus (MLVC)
highlights several challenges in semantic understanding of videos [43, 98]. The goal of MLVC is to find a
video segment that corresponds to a text query from a corpus of untrimmed and unsegmented videos, with
a significant amount of variation in factors such as the type of contents, lengths, visual appearance, quality,
and so on.
This task can be seen as “finding a needle in the haystack”. It is different from searching videos with
broad queries such as genres or names of the artists. In contrast, the text query needs to be semantically
congruent to a relatively short segment in a much longer target video. For example, the query “LeBron
James shot over Yao Ming” matches only a few seconds of clips in a game of hours long. Thus, MLVC
requires semantic understanding of videos at a more fine-grained level than video retrieval, which typically
only targets the whole video. Furthermore, finding the corresponding segment for a text query requires
combing through all videos in a corpus and all possible segments in each video. For a large corpus with long
videos, it is not feasible to have such computational complexity that depends on the square of the (averaged)
number of frames.
In this chapter, we address this challenge by representing videos at multiple scales of granularity. At the
coarse-grained level, the representation captures semantic information in a video over long temporal spans
27
CCLS1 CCLS2 CCLS3
Contextualized Clip
Representation
,
,
.
A child carry a
bucket of water
Text Query
Frame EncoderΦ
Video as Clips
Contextualized Frame
Representation
Video
Encoder
A B C sos D eos
Cross-Modal
Transformer
Text Transformer
A B C sos D eos
Cross Modal
Attention
Video
Encoder
A B C sos D eos
Cross-Modal
Transformer
Text Transformer
A B C sos D eos
Cross Modal
Attention
Clip Encoder Ψ Clip-level
Representation
Figure 3.1: Overview of the HAMMER model. The model contains two cross-modal encoders, a frame encoder and
a clip encoder on top of it. The outputs of the model are contextualized frame-level and clip-level features, which are
used by downstream task-specific modules, e.g. video retrieval and temporal localization.
(e.g., clips), allowing us to retrieve the most relevant set of videos for a text query. At the fine-grained level,
the representation captures semantic information in short temporal spans (e.g., frames) to allow for precise
localization of the most relevant video segments among the retrieved videos.
We propose a novel hierarchical multi-modal encoder (HAMMER) to implement this idea. HAMMER
uses cross-modal attention to combine the information between the text and visual modalities. The cross-
modal learning occurs hierarchically at 3 scales: frame, clip, and video (as a whole). Frames are the most
fine-grained building blocks of a video. Each clip consists of a non-overlapping set of frames with equal
length, and is in turn the building block of the final video-level representation. The architecture of the model
is illustrated in Fig. 3.1. The frame-level representation is obtained from a text-visual cross-modal encoder
operated on video frames, while the clip-level representation is built upon the frame-level representation
with a similar encoder.
The introduction of clip-level representation encoder is important as it allows us to capture both coarse-
and fine-grained semantic information. In contrast, existing approaches for MLVC [98, 43] and other visual-
language tasks [51, 85, 44, 24] typically pack information of different granularity into a single vector em-
bedding, making it hard to balance the differing demands between retrieving a long video and localizing a
short segment.
We apply HAMMER to MLVC task on two large-scale datasets, ActivityNet Captions [91] and TVR [98].
We train it with a multi-tasking approach combining three objectives: video retrieval, temporal localization,
and an auxiliary masked language modeling. Our experiments demonstrate the efficacy of HAMMER and
establish state-of-the-art performance on all the tasks simultaneously—video retrieval, moment localization
in single video and moment localization in video corpus. To better understand the inner-workings of our
28
model, we compare it with a strong FLAT baseline, a video encoder without any hierarchical representation.
Since the longer videos tend to be less homogeneous, it becomes decidedly important to represent the videos
at multiple levels of granularity. Our analysis shows that the performance of a FLAT baseline declines, when
the number of frames irrelevant to the text query increases. On the other hand, the performance of our
proposed HAMMER model is robust to the length of the videos, showing that our hierarchical approach is
not affected by the increase of irrelevant information and can flexibly handle longer videos.
Our contributions are summarized as follows:
• We propose a novel model architecture HAMMER that represents videos hierarchically and improves
video modeling at long-term temporal scales.
• We demonstrate the efficacy of HAMMER on two large-scale datasets, i.e., ActivityNet Captions and
TVR, outperforming previous state-of-the-art methods.
• We carry out a detailed analysis of HAMMER and show that it particularly improves the performance
of video retrieval over strong baselines on long videos.
• We conduct a thorough ablation study to understand the effects of different design choices in our
model.
3.2 Related Work
Most existing MLVC approaches consider text-based video retrieval [183, 39, 165, 129, 123] and temporal
localization [64, 52, 181, 109, 25, 139] as separate tasks.
Video Retrieval (VR) is a task that ranks candidate videos based on their relevance to a descriptive text
query. Standard cross-modal retrieval methods [91, 166] represent both video and text as two holistic em-
beddings, which are then used for computing the similarity as the ranking score. When the text query is a
lengthy paragraph, hierarchical modeling is applied to both modalities separately [198, 145], leading to a
significant improvement on the performance of text-based video retrieval. Different from prior work, in this
study we consider a more realistic problem where we use a single query sentence that describes only a small
segment to retrieve the entire video. For instance, the text query “Add the onion paste to the mixture” may
corresponds to a temporal segment of a few seconds in a long cooking video.
29
Temporal Localization (TL) aims at localizing a video segment (usually a short fraction of the entire video)
described by a query sentence inside a video. Two types of methods have been proposed to tackle this
challenge, namely the top-down (or proposal-based) approach [64, 181, 52] and the bottom-up (or proposal-
free) approach [25, 115, 26, 189, 190]. The top-down approach first generates multiple clip proposals before
matching them with a query sentence to localize the most relevant clip from the proposals. The bottom-up
approach first calculates a query-aware video representation as a sequence of features, then predicts the start
and end times of the clip described by the query.
Moment Localization in Video Corpus (MLVC) is first proposed by Escorcia et al. [43]. They consider
a practical setting where they require models to jointly retrieve videos and localize moments corresponding
to natural language queries from a large collection of untrimmed and unsegmented videos. They devised
a top-down localization approach that compares text embeddings on uniformly partitioned video chunks.
Recently, Lei et al. [98] proposed a new dataset, TVR, that considers a similar task called Moment Retrieval
in Video-Subtitle Corpus, which requires a model to align a query text temporally with a video clip, using
multi-modal information from video and subtitles (derived from Automatic Speech Recognition or ASR).
3.3 Method
We first describe the problem setting of MLVC and introducing the notations in §3.3.1. In §3.3.2, we
describe a general strategy of decomposing MLVC into two sub-tasks, VR and TL [180, 115]. The main
purpose is to reduce computation and to avoid the need to search all possible segments of all videos. In
§3.3.3, we present a novel HierArchical Multi-Modal EncodeR (HAMMER) model and describe how it is
trained in §3.3.4. Finally, we describe key details for inference in §3.3.5.
3.3.1 Problem Setting and Notations
We represent a videov as a sequence of N frames{x
t
|t = 1,...,N}, wherex
t
is a visual feature vector
representing the t-th frame. Given a text queryh (e.g., a sentence), our goal is to learn a parameterized
function (i.e., neural networks) that accurately estimates the conditional probability p(m|h), wherem is
a video segment given bym = {x
t
|t = t
s
,...,t
e
}. t
s
and t
e
stand for the indices of the starting and the
ending frames of the segment in a videov. Note that for a video corpusV with an average length of N
frames, the number of all possible segments isO(|V|× N
2
). Thus, in a large corpus, exhaustive search for
30
the best segmentm corresponding toh is not feasible. In what follows, we describe how to address this
challenge.
To localize the momentm that best corresponds to a text queryh, we need to identify
m
∗ =argmax
m
p(m|h)=argmax
m
X
v
p(m|v,h)p(v|h) (3.1)
Note that the conditional probability is factorized into two components. If we assumem uniquely belongs
to only one video in the corpusV, then the marginalization over the videov is vacuous and can be discarded.
This leads to
max
m
p(m|h)=max
v
max
m∈v
p(m|v,h)p(v|h). (3.2)
The training data are available in the form of (h
(i)
,v
(i)
,m
(i)
) wherem
(i)
⊂ v
(i)
is the matched segment
to the queryh
(i)
.
3.3.2 Two-Stage MLVC: Retrieval and Localization
As aforementioned, this inference of Eq. (3.2) is infeasible for large-scale corpora and/or long videos. Thus,
we approximate it by
m
∗ ≈ argmax
m∈v
∗ p(m|v
∗ ,h), withv
∗ =argmax
v
p(v|h) (3.3)
This approximation allows us to build two different learning components and stage them together to solve
MLVC. This approach has been applied in a recent work on the task [98]. We give a formal summary below.
Video Retrieval (VR) identifies the best video v
∗ by minimizing the negative log-likelihood ofp(v|h)
ℓ
VR
=− X
i
logp(v
(i)
|h
(i)
) (3.4)
wherev
(i)
is the ground-truth video for the text queryh
(i)
. This is a rather standard (cross-modal) retrieval
problem, which has been widely studied in the literature. (See §3.2 for some references.)
31
Temporal Localization (TL) models p(m|v,h). While it is possible to model O(N
2
) possible segments
in a video withN frames, we choose to model it with the probabilities of identifying the correct starting (t
s
)
and ending (t
e
) frames:
p(m|v,h)≈ p(t
s
|v,h)· p(t
e
|v,h)· I[t
e
>t
s
] (3.5)
Here, we considert
s
andt
e
to be independent to efficiently approximate p(m|v,h). The indicator function
I[·] simply stipulates that the ending frame needs to be after the starting frame.
To model each of the factors, we treat it as a frame classification problem, annotating each frame with
one of the three possible labels: BEGIN and END marks the starting and ending frames respectively, with
all other frames asOTHER. We denote this asB,E,O classification scheme. During training, we optimize
(the sum of) the frame-wise cross-entropy between the model’s predictions and the labels. We denote the
training loss as
ℓ
TL
=− X
i
X
t
f
(i)
t
logp(y
(i)
t
|v
(i)
,h
(i)
), (3.6)
wheref
(i)
t
is the true label for the framex
t
of the videov
(i)
, andy
(i)
t
is the corresponding prediction of the
model.
This type of labeling schemes have been widely used in the NLP community, for example, recently for
span-based question and answering [77, 49].
3.3.3 HierArchical Multi-Modal Encoder (HAMMER)
Our first contribution is to introduce the hierarchical modeling approach to parameterize the conditional
probability p(v|h) for the VR sub-task and the labeling model p(y|v,h) for the TL sub-task. In the next
section, we describe novel learning algorithms for training our model.
Main idea Video and text are complex and structural objects. They are naturally in “temporally” linear
orders of frames and words. More importantly, semantic relatedness manifests in both short-range and long-
range contextual dependencies. To this end, HAMMER infuses textual and visual information hierarchically
at different temporal scales. Figure 3.1 illustrates the architecture of HAMMER. A key element here is to
introduce cross-modal attention at both the frame level and the clip level.
32
Clip-level Representation We introduce an intermediate-level temporal unit with a fixed length of M
frames, and refer to them as a clipc
k
={x
t
|t = (k− 1)· M,...,k· M− 1}, wherek = 1,...,⌈N/M⌉.
As such, a video can also be hierarchically organized as a sequence of non-overlapping video clipsv =
{c
k
|k =1,...,⌈N/M⌉}. M is a hyper-parameter to be adjusted on different tasks and datasets. We empha-
size while sometimes segments and clips are used interchangeably, we refer to “segment” as a set of frames
that are also the visual grounding of a text query, and “clip” as a collection of temporally contiguous frames.
We treat them as holding memory slots for aggregated lower-level semantic information in frames.
Cross-modal Transformers HAMMER has two cross-modal Transformers. At the frame-level, the frame
encoder Φ takes as input both the frame sequence of a video clip and the text sequence of a query, and
outputs the contextualized visual frame features {Φ( x
t
;c
k
,h)} for each clip c
k
. The frame encoder Φ encodes the local and short-range contextual dependencies among the frames of the same clip.
We also introduce a Clip CLS Token (CCLS
k
) for eachc
k
[116]. The contextual embedding of this token
gives the representation of the clip:
ϕ k
=Φ( CCLS
k
;c
k
,h) (3.7)
Contextual embeddings for all clips are then fed into a higher-level clip encoder Ψ , also with cross-modal
attention to the input text, yielding a set of contextualized clip representation
Ψ v
={Ψ( ϕ k
;v,h)|k =1,...,⌈N/M⌉}. (3.8)
Note thatΨ v
now encodes the global and longer-range contextual dependencies among all frames (through
clips)
1
.
To summarize, our model has 3 levels of representations: the contextualized frames{Φ( x
t
;c
k
,h)}, the
clips{ϕ k
}, and the entire videoΨ v
. Next, we describe how to use them to form our learning algorithms.
3.3.4 Learning HAMMER for MLVC
The different levels of representation allows for the flexibility for modeling the two subtasks (VR and TL)
with semantic information across different temporal scales.
1
Alternatively, we can summarize it (into a vector, in lieu of the set) through various reduction operations such as pooling or
introducing a video-level CLS tokenVCLS.
33
Modeling Video Retrieval We use the contextualized clips to compute the video-query compatibility score
for a query h and its corresponding video v. In order to retrieve the likely relevant videos as much as
possible, we need a coarse-grained matching that focuses more on higher-level semantic information.
Specifically, we identify the best matching among all clip embeddings Ψ( c
k
;v,h) and use it as the
matching score for the whole video:
p(v|h)∝f(v,h)=max
k
{θ ⊤
VR
· Ψ( ϕ k
;v,h)}
(3.9)
whereθ VR
is a linear projection to extract the matching scores
2
. The conditional probability is normalized
with respect to all videos in the corpus (though in practice, a set of positive and negative ones).
Modeling Temporal Localization As in the previous section, we treat localization as classifying a frame
intoB,E, orO:
p(y
t
|v,h)≈ p(c
k
|v,h)· p(y
t
|c
k
,h) (3.10)
Note that each frame can belong to only one clip c
k
so there is no need to marginalize over c
k
. The
probabilityp(c
k
|v,h) measures the likelihood ofc
k
containing a labely
t
in one of its frames. The second
factor measures the likelihood that the specific frame x
k
is labeled asy
t
. Clearly, these two factors are on
different semantic scales and are thus modeled separately:
p(c
k
|v,h)∝u
⊤
· [Ψ( ϕ k
;v,h), Ψ( TCLS;v,h)] (3.11)
p(y
t
|c
k
,h)∝w
⊤
yt
· [Φ( x
t
;c
k
,h),Φ( TCLS;c
k
,h)] (3.12)
whereTCLS is a textCLS token summarizing the query embedding.
Masked Multi-Modal Model Masked language modeling has been widely adopted as a pre-training task
for language modeling [116, 156, 38]. The main idea is to backfill a masked text token from its contexts,
i.e., the other tokens in a sentence.
The multi-modal modeling task in this chapter can similarly benefit from this idea. During training, we
mask randomly some text tokens. We expect the model to achieve two things: (1) using the partially masked
text query to retrieve and localize which acts as a regularization mechanism; (2) better text grounding by
2
An alternative design is to pool all Ψ( ϕ k
;v,h) and then perform a linear projection. However, this type of polling has a
disadvantage that a short but relevant segment – say within a clip – can be overwhelmed by all other clips. Empirically, we also
find the current formalism works better. A similar finding is also discovered in [198].
34
recovering the masked tokens with the assistance of the multimodal context, i.e., both the textual context
and the visual information in the frames and the clips.
To incorporate a masked query to the loss functionsℓ
VR
andℓ
TL
of the model we applyh⊗ (1− mask)
to replace h, where mask is a binary mask vector for text tokens, 1 is a one-valued vector of the same
size, and⊗ indicates element-wise multiplication. We introduce another loss to backfill the missing tokens
represented byh⊗ mask:
ℓ
MASK
=− logp(h⊗ mask|v,h⊗ (1− m)) (3.13)
This probability is computed using bothΦ( ·) for frames andΨ( ·) for clips.
Multi-Task Learning Objective We use a weighted combination of video retrieval, moment localization,
and masked multi-modal modeling objectives as our final training objective:
ℓ=E
m
[λ VR
· ℓ
VR
+λ TL
· ℓ
TL
+λ MASK
· ℓ
MASK
], (3.14)
where the expectation is taken with respect to random masking. Since the VR and TL subtasks share the
same model and output representations, the final objective needs to balance different goals and is multi-
tasking in nature. We provide a detailed ablation study in §3.4 to analyze the choice of weights.
3.3.5 Two-stage Inference with HAMMER
For the model inference of HAMMER, we perform two sequential stages, i.e., video retrieval and temporal
localization, to accomplish the task of moment localization in video corpus. For video retrieval, we use
HAMMER and the linear regressor to compute pairwise compatibility scores as in Eq. (3.9) with respect to
the text queryh and all videosV in the corpus. Next, we perform temporal localization on the top ranked
videos. Specifically, we predict the start and end frame with HAMMER to localize the temporal segmentm
following Eq. (3.5). Then we greedily label the frame with the maximump(t
s
|v,h) as the start frame and
maximump(t
e
|v,h) as the end frame. Here we have an additional constraint to consider — the predicted
end frame must appear after the start frame prediction. This two-stage inference reduces the complexity to
O(|V|+N), which is significantly better comparing to the O(|V|· N
2
) complexity of [43].
35
3.4 Experiments
In this section, we perform experiments with the proposed HAMMER model. We first introduce the datasets
and setups of our experiments in §3.4.1. Next, we present the main results of the HAMMER model in
§3.4.2, contrasting against a strong baseline FLAT as well as other existing methods. We then confuct a
thorough ablation study in §3.4.3 to evaluate the importance of various design choices for the HAMMER
model. Finally, we carry out qualitative analysis of our model to better demonstrate its behaviour.
3.4.1 Experimental Setups
Datasets We experiment on two popular MLVC datasets:
ActivityNet Captions [91] contains∼ 20K videos, each has 3.65 temporally localized query sentences
on average. The mean video duration is 180 seconds and the average query length is 13.48 words, which
spans over 36 seconds of the video. There are 10,009 videos for training and 4,917 videos for validation
(val 1 split). We follow prior work [43, 65] to train our models and evaluate them on the val 1 split.
TVR [98] contains∼ 22K videos in total, of which∼ 17.5K videos are in the training set and 2,180 are
in the validation set. The dataset contains videos from movies, cartoons, and TV-shows. The videos are
on average 76.2 seconds long and contain 5 temporally localized sentences per video. The moments in the
videos are 9.1 seconds long and described by sentences containing 13.4 words on average. We make use of
the subtitle (ASR) features together with the video feature in TVR dataset, following prior works [98, 102].
We make use of multiple popular choices of video features on these two datasets as existing litera-
ture [43, 65, 98], which includes the appearance-only features (ResNet152 [62] pre-trained on ImageNet [37]),
spatio-temporal features (I3D [21] pre-trained on Kinetics [80]), and their combinations. We present the de-
tails of feature preparation in the Suppl. Material.
Evaluation Metrics We use different evaluation metrics for different video understanding tasks:
Video Retrieval (VR) We report Recall@k and Median Rank (MedR or MedRank) as the evaluation
metrics for Video Retrieval as suggested in the literature.
Temporal Localization (TL) We report both mean IoU (mIoU) and average precision with IoU={0.3,
0.5, 0.7} as the evaluation metrics. Here, IoU measures the Intersection over Union between the ground
truth and predicted video segments, i.e., the localization accuracy.
36
Table 3.1: MLVC Results on ActivityNet and TVR datasets
IoU=0.5 IoU=0.7
Model & Feature R1 R10 R100 R1 R10 R100
ActivityNet
MCN [64] R 0.02 0.18 1.26 0.01 0.09 0.70
CAL [43] R 0.21 1.32 6.82 0.12 0.89 4.79
FLAT R 0.34 2.28 10.09 0.21 1.28 5.69
HAMMER R 0.51 3.29 12.01 0.30 1.87 6.94
FLAT I 2.57 13.07 30.66 1.51 7.69 17.67
HAMMER I 2.94 14.49 32.49 1.74 8.75 19.08
TVR
XML [98] I+R – – – 2.62 6.39 22.00
HERO
1
[102] I+R – – – 2.98 10.65 18.25
FLAT I+R 8.45 21.14 30.75 4.61 11.29 16.24
HAMMER I+R 9.19 21.28 31.25 5.13 11.38 16.71
Moment Localization in Video Corpus (MLVC) We use Recall@k with IoU=p for the main evaluation
metrics [43, 98]. Specifically, we measure whether the correct localized segment exists in the top k of the
ranked videos. Here, a localized segment is correct if it overlaps with the ground truth segment over an IoU
of{0.5, 0.7}.
Baseline and the HAMMER Models In HAMMER, we use two encoders, i.e., the frame and clip encoders,
with multiple Transformer [163] layers to represent the visual (and ASR) features as well as the text query
features (details in Figure 3.1). Each encoder contains 1 layer of Transformer for visual input, 5 layers of
Transformers for the text query input, and 1 layer of cross-modal Transformer between the visual and text
query inputs. When ASR is provided (i.e., in TVR), we add one additional Transformer layer to incorporate
the ASR input, with another cross-modal Transformer layer that cross-attends between the query input and
ASR features. The processed ASR and visual features are concatenated. Meanwhile, we design a FLAT
model as a strong baseline. The FLAT model has a similar architecture as HAMMER, except that it only
uses the frame encoder to capture the visual (and ASR) features. We provide complete details about the
architectural configurations and model optimization in the the Suppl. Material.
3.4.2 MLVC Experiments
Main Results Table 3.1 presents a comparison between the proposed HAMMER and other methods on the
two MLVC benchmarks. We observe that, irrespective of the feature types, HAMMER outperforms FLAT
1
We compare against their model without large-scale pre-training for fair comparison.
37
Table 3.2: VR results on ActivityNet Captions.
Model R1 R10 R100 MedR↓
FLAT 5.37 29.14 71.64 29
HAMMER 5.89 30.98 73.38 26
Table 3.3: TL results on ActivityNet Captions.
Model IoU=0.3 IoU=0.5 IoU=0.7 mIoU↑
FLAT 57.58 39.60 22.59 40.98
HAMMER 59.18 41.45 24.27 42.68
noticeably, which in turn outperforms most published results on both datasets. On ActivityNet, we observe
that models using I3D features (denoted as I) outperform their counterparts with ResNet152 (denoted as R)
features, by a significant margin. It indicates the importance of spatio-temporal feature representation in the
MLVC tasks.
Meanwhile, we note that our FLAT model outperforms the baselines on the TVR dataset, which is mainly
due to the introduction of the cross-modal Transformer between query and visual+ASR features (see §3.4.3
for a detailed study). On both datasets, HAMMER establishes the new state-of-the-art results for the MLVC
task (without using additional data). This result shows a clear benefit of hierarchical structure modeling in
video for the MLVC task.
Table 3.2 and 3.3 contrast HAMMER to the FLAT model in more details by comparing their performance
on the tasks of video retrieval and temporal localization separately. The results are reported on the Activi-
tyNet with models using the I3D features. In both cases, HAMMER achieves significantly better performance
than the baseline FLAT model.
Comparing Models on Videos of Different Duration We discuss the potential reasons for HAMMER
to outperform the FLAT model. Since HAMMER learns video representation at multiple granularities, we
hypothesize that it should be able to focus on the task-relevant parts of a video without getting distracted
by irrelevant parts. Specifically for the task of sentence-based video retrieval which requires matching the
relevant frames in the video with the text query, HAMMER would be less sensitive to the presence of non-
matching frames and hence be robust to the length of the video. To verify this, we analyze HAMMER’s
performance on videos with different lengths for the task of video retrieval and temporal localization.
38
Figure 3.2: Comparison of Video Retrieval performances under different video duration. Results are reported in
Median Rank (MedRank) on the ActivityNet Captions (Lower is better).
Figure 3.3: Comparison of Temporal Localization performances under different video duration. Results are reported
in Mean IoU (mIoU) on the ActivityNet Captions (Higher is better).
We compare the performance of HAMMER and FLAT on videos with different durations for the video
retrieval task in Fig. 3.2. The metric used for comparison is the median rank where lower numbers indicate
better performance. Firstly, it can be observed that while the performance of FLAT model is inconsistent (e.g.,
performance on longest videos is worse than second-to-longest videos), the HAMMER model’s performance
consistently improves with the length of the video. Secondly, the performance of the HAMMER model is best
for the longest videos in the dataset. Finally, while both the models perform sub-optimally on the shortest
videos, HAMMER still outperforms FLAT for those videos.
39
Table 3.4: Ablation study on sub-tasks (VR=Video Retrieval,TL=Temporal Localization,FM=Frame MLM,
CM=Clip MLM)
Task Video Retrieval Temporal Localization
VRTLFMCM R1 R10 R100 IoU=0.5 IoU=0.7 mIoU
✓ 4.93 29.02 72.15 – – –
✓ ✓ 5.52 30.53 73.02 – – –
✓ ✓ 5.45 30.45 73.24 – – –
✓ ✓ ✓ 5.67 30.20 72.67 – – –
✓ – – – 39.02 22.74 40.28
✓ ✓ – – – 39.27 22.04 40.30
✓ ✓ – – – 39.13 22.38 40.51
✓ ✓ ✓ – – – 39.16 22.82 40.64
✓ ✓ 5.22 30.22 72.70 40.59 23.70 42.01
✓ ✓ ✓ 5.57 30.97 73.09 41.17 24.04 42.45
✓ ✓ ✓ 5.85 30.82 73.54 41.30 23.94 42.43
✓ ✓ ✓ ✓ 5.89 30.98 73.38 41.45 24.27 42.68
We further compare the temporal localization performance of HAMMER and FLAT models in Fig. 3.3.
The results are reported using mean IoU, where higher numbers indicate better performance. It shows that
HAMMER constantly achieves higher performance than FLAT across all videos irrespective of their length.
Overall, the analysis shows clear advantage of using HAMMER over FLAT which is especially profound
for longer videos, hence supporting our central modeling argument.
3.4.3 Ablation Studies and Analyses
In this section, we evaluate the effectiveness of learning objectives and various design choices for HAMMER.
We note that all the ablation studies in this section are conducted on ActivityNet Captions using the I3D
features.
3.4.3.1 Learning Objectives
As aforementioned, HAMMER is optimized with three objectives jointly namely video retrieval (VR), tem-
poral localization (TL), and masked language modeling (MLM). We study the contribution of different
objectives, reported in Table 3.4. It is worth to note that here we differentiate the MLM objective applied
to the frame encoder (denote asFM) and the clip encoder (denote asCM). Firstly, the objectives of VR and
TL are complementary to each other and jointly optimizing the two surpasses the single task performance
on both the tasks simultaneously. Secondly,CM andFM applied individually benefits both VR and TL tasks
40
Table 3.5: Ablation study on task weights (VR=Video Retrieval,TL=Temporal Localization,MLM=Masked
Language Model)
IoU=0.5 IoU=0.7
λ VR
λ TL
λ MLM
R1 R10 R100 R1 R10 R100
1.0 0.1 0.1 1.65 9.18 20.81 0.87 4.88 10.50
1.0 0.5 0.1 2.15 10.75 23.41 1.10 5.68 12.16
1.0 1.0 0.1 2.02 10.95 24.74 1.10 6.07 13.12
1.0 5.0 0.1 2.94 14.49 32.49 1.74 8.75 19.08
1.0 10.0 0.1 2.35 14.25 31.84 1.42 8.53 18.76
with their usage in unison resulting in best performance. This verifies the effectiveness of MLM objective to
improve the text representation. Finally, the best performance is achieved by combining all the objectives,
hence proving the complimentary nature of all of them.
Weights of Different Objectives We also conduct detailed experiments to investigate the influence of
different objectives’ weights (λ VR
, λ TL
, and λ MLM
). Table 3.5 shows that it is important to balance the
weights between VR and TL. The best performance is achieved when the weight for VR and TL is set to
1:5. For MLM, we find that the best loss weight is 0.1, and thus use this value through all our experiments.
3.4.3.2 Evaluate Design Choices of the HAMMER
We study the importance of a few design choices in the HAMMER model. Specifically, we evaluate the
following:
• Effect of the cross-modal Transformer layer
• Effect of different clip lengths for the clip representation
• Effect of parameter sharing for frame and clip encoders
• Effect of an additional clip-level position embedding
We present the results and discussion on these experiments in the following paragraphs.
Cross-Modal Transformer is Essential. Both frame and clip encoders contain one layer of cross-modal
(X-modal) Transformer between text query and video inputs. To verify its effectiveness, we compare with
an ablation model without this layer. Table 3.6 shows almost 100% relative improvement in the R1, R10
metrics when using the X-modal Transformer, proving it is essential to the success of HAMMER.
41
Table 3.6: Ablation study on Cross-modal Transformer
Model X-Modal
IoU=0.5 IoU=0.7
R1 R10 R100 R1 R10 R100
HAMMER
✗ 1.38 8.89 26.35 0.84 5.08 15.27
✓ 2.94 14.49 32.49 1.74 8.75 19.08
160.1 secs
160.1 secs
120.1 secs
140.0 secs 160.1 secs
HAMMER Clip Encoder Frame Encoder Ground Truth
Query: The video ends with a black and green background of the words Polo Tips in green and a green image of a person on a horse holding a stick.
Query: He walks out the door of the shop and walks down the street.
30.7 secs 123.0 secs
104 secs 122.1 secs
107.7 secs
32.3 secs
1 secs 123.0 secs
160.1 secs
140.9 secs
13.0 secs
Legend
Figure 3.4: Illustration of temporal localization using different hierarchies of HAMMER as well as the final
HAMMER model
Table 3.7: Ablation study on different clip lengths
Model
Clip
Length
IoU=0.5 IoU=0.7
R1 R10 R100 R1 R10 R100
HAMMER
16 2.70 14.06 31.85 1.63 8.16 18.60
32 2.94 14.49 32.49 1.74 8.75 19.08
64 2.78 14.69 32.08 1.70 9.00 18.71
Optimal Length of the Clip-Level Representation. In HAMMER, recall that a frame encoder takes a clip
of fixed length of frames and outputs a clip-level representation. Here, we examine the performance under
different lengths of clips, summarized in Table 3.7. Overall, we observe that the model’s performance is
robust to the clip length chosen for the experiments, and 32 is the optimal length for the clip representation
(with max video length of 128).
Parameter Sharing between Frame/Clip Encoders. We also consider whether the frame encoder and
the clip encoder in the HAMMER model may share the same set of parameters, as weight sharing could
regularize the model capacity and therefore improve the generalization performance. Table 3.8 indicates
42
Table 3.8: Ablation study on weight sharing
Model
Weight
Sharing
IoU=0.5 IoU=0.7
R1 R10 R100 R1 R10 R100
HAMMER
✗ 2.94 14.49 32.49 1.74 8.75 19.08
✓ 2.89 14.17 30.31 1.69 8.05 17.24
Table 3.9: Ablation study on clip position embeddings
Model
Clip
Position
IoU=0.5 IoU=0.7
R1 R10 R100 R1 R10 R100
HAMMER
✗ 2.82 14.39 32.01 1.76 8.59 18.63
✓ 2.94 14.49 32.49 1.74 8.75 19.08
that, however, untying the encoder weights achieves slightly better performance, potentially thanks to its
greater flexibility.
Position Embedding for Clip Encoder. Position embedding is an important model input as it indicates the
temporal boundary of each video frame segment. In the HAMMER model, since we also have a clip encoder
that takes the aggregated “Clip CLS” token as input, it is natural to ask if we need a position encoding for
each clip representation. Thus, we compare two models, with and without additional clip position encoding.
Table 3.9 shows that clip position embedding is indeed important to achieve superior performance.
3.4.3.3 Qualitative Visualization
To better understand the behavior of the HAMMER model, we demonstrate a couple of examples of temporal
localization. Figure 3.4 lists predicted spans from the frame and clip encoder as well as from the entire
HAMMER. In both examples, we observe that the frame encoder of HAMMER makes an incorrect prediction
of the temporal timestamps, but then corrected by the prediction from the clip encoder. Overall, HAMMER
makes more accurate predictions with respect to the ground-truth video segment.
3.5 Conclusion
In this chapter, we propose a hierarchical multi-modal encoder (HAMMER) that captures video dynamics in
three scales of granularity, frame, clip, and video. By hierarchically modeling videos, HAMMER achieves
43
significantly better performance than the baseline approaches on moment localization task in video corpus,
and further establishes new state-of-the-art on two challenging datasets, ActivityNet captions and TVR.
Extensive studies verify the effectiveness of the proposed architectures and learning objectives.
44
Part III
Hierarchical Image Representation via Langauge Structure Prior
45
Part III focuses on obtaining the hierarchical image representation from parallel language. The par-
allel language naturally reveals the objects and the relations between objects in the image. For example,
considering the image ’a cat chasing a mouse’, as Fig. 3.5, the text automatically exhibits that the image
contains two animals: one is a cat, and the other is a mouse. The relation between those two objects is
the cat is chasing the mouse. We verify the performance of the hierarchical image representation on the
image and text retrieval task. We observe that the hierarchical visual representation model marches forward
the state-of-the-art by a margin. In this part, we will present two ways to obtain the hierarchical image
representation: learning from the denotation graph (DG), and learning from the concept and relation graph
(CRG).
Figure 3.5: The paired text can show the objects in the figure and provides their relations. The text ’a gray
and white cat is chasing a brown mouse’ indicates that two objects in the figure are a cat and a mouse. The
relation between those two objects is the cat is chasing the mouse.
In chapter 4, we propose to learn the hierarchical structure from an image and text graph called the
denotation graph. Concretely, each node in the denotation graph represents a phrase and its corresponding
visual denotations (images). Those nodes are organized hierarchically, where the parent nodes contain more
abstract meanings than the children nodes. The nodes/phrases are identified by applying constituency pars-
ing to the sentences. The denotation graph was initially proposed by Young et al. for the natural language
inference task [186]. This chapter proposes using this graph to learn the hierarchical image representation.
Specifically, we design multiple learning objectives to enforce both the sentence and image level alignment
as well as the alignment between the image and phrase. With the denotation graph and the proposed learn-
ing objectives, our method significantly improves the image and text retrieval accuracy by a large margin.
46
Furthermore, we observe learning with denotation graph can bring consistent improvement across different
architectures.
Chapter 5 introduces a visual representation learning method that is able to capture multiple levels of
semantic information as well as the ability for compositional generalization. Following the DG, we propose
a concept and relation graph to characterize the relation between phrases and images. With this graph, we
propose a novel architecture COMPOSER to compose visually grounded phrases into sentences recursively.
The proposed approach improves both in-domain image text retrieval performance and zero-shot cross-
dataset transfer performance by a large margin.
47
Chapter 4
Modeling Image hierarchically with Denotation Graph
4.1 Introduction
There has been an abundant amount of aligned visual and language data such as text passages describ-
ing images, narrated videos, subtitles in movies, etc. Thus, learning how to represent visual and language
information when they are semantically related has been a very actively studied topic. There are many vi-
sion+ language applications: image retrieval with descriptive sentences or captions [14, 13, 68, 186], image
captioning [27, 182], visual question answering [9], visual navigation with language instructions [6], visual
objects localization via short text phrases [135], and others. A recurring theme is to learn the representation
of these two streams of information so that they correspond to each other, highlighting the notion that many
language expressions are visually grounded.
A standard approach is to embed the visual and the language information as points in a (joint) visual-
semantic embedding space [51, 86, 44]. One can then infer whether the visual information is aligned with
the text information by checking how these points are distributed.
How do we embed visual and text information? Earlier approaches focus on embedding each stream
of information independently, using models that are tailored to each modality. For example, for image, the
embedding could be the features at the last fully-connected layer from a deep neural network trained for
classifying the dominant objects in the image. For text, the embedding could be the last hidden outputs from
a recurrent neural network.
Recent approaches, however, have introduced several innovations [116, 99, 31]. The first is to contextu-
alize the embeddings of one modality using information from the other one. This is achieved by using co-
attention or cross-attention (in addition to self-attention) in Transformer layers. The second is to leverage the
48
power of pre-training [138, 38]: given a large number of parallel corpora of images and their descriptions,
it is beneficial to identify pre-trained embeddings on these data such that they are useful for downstream
vision+ language tasks.
Despite such progress, there is a missed opportunity of learning stronger representations from those
parallel corpora. As a motivating example, suppose we have two paired examples: one is an image x
1
corresponding to the texts
1
of TWO DOGS SAT IN FRONT OF PORCH and the other is an imagex
2
corre-
sponding to the texts
2
of TWO DOGS RUNNING ON THE GRASS. Existing approaches treat the two pairs
independently and compute the embeddings for each pair without acknowledging that both texts share the
common phrases
1
∩s
2
= TWO DOGS and the images have the same visual categories of two dogs.
We hypothesize that learning the correspondence between the common phrases
1
∩s
2
and the set of
images{x
1
,x
2
}, though not explicitly annotated in the training data, is beneficial. Enforcing the alignment
due to this additionally constructed pair introduces a form of structural constraint: the embeddings ofx
1
and x
2
have to convey similar visual information that is congruent to the similar text information in the
embeddings ofs
1
ands
2
.
In this chapter, we validate this hypothesis and show that extracting additional and implied correspon-
dences between the texts and the visual information, then using them for learning leads to better represen-
tation, which results in a stronger performance in downstream tasks. The additional alignment information
forms a graph where the edges indicate how visually grounded concepts can be instantiated at both abstract
levels (such as TWO DOGS) and specific levels (such as TWO DOGS SAT IN FRONT OF THE PORCH). These
edges and the nodes that represent the concepts at different abstraction levels form a graph, known as deno-
tation graph, previously studied in the NLP community [186, 94, 135] for grounding language expressions
visually.
Our contributions are to propose creating visually-grounded denotation graphs to facilitate representa-
tion learning. Concretely, we apply the technique originally developed for the FLICKR30K dataset [186] also
to COCO dataset [108] to obtain denotation graphs that are grounded in each domain respectively (§ 4.3).
We then show how the denotation graphs can be used to augment training samples for aligning text and
image (§ 4.4). Finally, we show empirically that the representation learned with denotation graphs leads to
stronger performance in downstream tasks (§ 4.5).
49
black dog
dog
two dogs
two black
dogs
two dogs
running
two dogs running
on grass
dog running
on grass
child
child
running
on grass
grass running
dog
running
running
on grass
child
running
A street performer
playing a percussion
instrument in front
of a crowd of
people.
play percussion instrument
man play percussion
instrument in front
of crowd
instrument in front of
person
instrument
person
Figure 4.1: (Left) A schematic example of denotation graph showing the hierarchical organization of linguis-
tic expression (adapted from https://shannon.cs.illinois.edu/DenotationGraph/) (Right) A random-subgraph
from the denotation graph extracted from the FLICKR30K dataset, with images attached to concepts at dif-
ferent levels of hierarchy.
4.2 Related Work
Learning representation for image and text Single-stream methods learn each modality separately and
align them together with a simple fusion model, often an inner product between the two representations.
Frome et al. [51] learns the joint embedding space for images and labels and use the learned embeddings for
zero-shot learning. Kiros et al. [86] uses bi-directional LSTMs to encode sentences and then maps images
and sentences into a joint embedding space for cross-modal retrieval and multi-modal language models.
Li et al. [101] designs a high-level visual reasoning module to contextualize image entity features and
obtain a more powerful image representation. Vendrov et al. [164] improves image retrieval performance
by exploiting the hypernym relations among words. There is a large body of work that has been focusing on
improving the visual or text embedding functions [152, 42, 125, 73, 59].
Another line of work, referred to as cross-stream methods infer fine-grained alignments between local
patterns of visual (i.e., local regions) and linguistic inputs (i.e., words) between a pair of image and text,
then use them to derive the similarity between the image and the text. SCAN [97] uses cross-modal attention
mechanism [182] to discover such latent alignments. Inspired by the success of BERT [38], recent efforts
have conducted visual-linguistic pre-training on large-scale datasets [146], using a powerful sequence model
such as deep Transformers [116, 99, 31, 154, 104]. The pre-training strategies of these methods typically
involve many self-supervised learning tasks, including the image-text matching [116], masked language
modeling [38, 116] and masked region modeling [31].
In contrast to those work, we focus on exploiting additional correspondences between image and text
that are not explicitly given in the many image and text datasets. By analyzing the linguistic structures
50
of the texts in those datasets, we are able to discover more correspondences that can be used for learning
representation. We show the learned representation is more powerful in downstream tasks.
Vision+Language Tasks There has been a large collection of tasks combining vision and language, in-
cluding image captioning [29, 47, 68, 78, 93], visual QA [9], text-based image verification [155, 70], visual
commonsense reasonin [191], and so on. In the context of this chapter, we focus on studying cross-modality
retrieval [13, 14, 56, 68, 186, 198], as well as transfer learning on downstream tasks, including composi-
tional attribute-object recognition [74, 121] and referring expressions [36, 81, 90, 122]. Please refer to § 4.5
for explanation of these tasks.
4.3 Denotation Graph (DG)
Visually grounded text expressions denote the images (or videos) they describe. When examined together,
these expressions reveal structural relations that do not exhibit when each expression is studied in isolation.
In particular, through linguistic analysis, these expressions can be grouped and partially ordered and thus
form a relation graph, representing how (visually grounded) concepts are shared among different expres-
sions and how different concepts are related. This insight was explored by [186] and the resulting graph is
referred to as a denotation graph, schematically shown in the top part of Fig. 4.1. In this work, we focus on
constructing denotation graphs from the FLICKR30K and the COCO datasets, where the text expressions are
sentences describing images.
Formally, a denotation graphG is a polytree where a node v
i
in the graph corresponds to a pair of a
linguistic expressions
i
and a set of imagesX
i
={x
1
,x
2
,··· ,x
n
i
}. A directed edgee
ij
from a nodev
i
to
its childv
j
represents a subsumption relation betweens
i
ands
j
. Semantically,s
i
is more abstract (generic)
thans
j
, and the tokens ins
i
can be a subset ofs
j
’s. For example, TWO DOGS describes all the images
which TWO DOGS ARE RUNNING describes, though less specifically. Note that the subsumption relation is
defined on the semantics of these expressions. Thus, the tokens do not have to be exactly matched on their
surface forms. For instance, IN FRONT OF PERSON or IN FRONT OF CROWD are also generic concepts that
subsume IN FRONT OF A CROWD OF PEOPLE, see the right-hand side of Fig. 4.1 for another example.
More formally, the set of images that correspond tov
i
is the union of all the images corresponding to
v
i
’s children ch(v
i
):X
i
=
S
v
j
∈ch(v
i
)
X
j
. We also use pa(v
j
) to denote the set ofv
j
’s parents.
51
Table 4.1: Key statistics of the two DGs: averaged over the all nodes in the graph, internal nodes and leaf
nodes (formated as all/internal/leaf)
Dataset DG-FLICKR30K DG-COCO
# of edges 1.94M 4.57M
# of nodes 597K/452K/145K 1.41M/841K/566K
# of tokens/node 6.78/4.45/14.04 5.88/4.07/8.58
# of images/node 4.46/5.57/1.00 5.06/7.79/1.00
Denotation graphs (DG) can be seen as a hierarchical organization of semantic knowledge among con-
cepts and their visual groundings. In this sense, they generalize the tree-structured object hierarchies that
have been often used in computer vision. The nodes in the DG are composite phrases that are semantically
richer than object names and the relationship among them is also richer.
Constructing DG We used the publicly available tool
1
, following Young et al. [186]. For details, please
refer to the Suppl. Material and the reference therein. Once the graph is constructed, we attach the images
to the proper nodes by set-union images of each node’s children, starting from the sentence-level node.
DG-FLICKR30K and DG-COCO We regenerate a DG on the FLICKR30K dataset
2
[186] and construct a
new DG on the COCO [108] dataset. The two datasets come from different visual and text domains where
the former contains more iconic social media photos and the latter focuses on photos with complex scenes
and has more objects. Figure 4.1 shows a random sub-graph of DG-FLICKR30K.
Table 4.1 lists the key statistics of the two DGs. We note that in both graphs, a large number of internal
nodes (more abstract concepts or phrases) are introduced. For such concepts, the linguistic expressions are
much shorter and the number of images they correspond to is also larger.
4.4 Learning with Denotation Graphs
The denotation graphs, as described in the previous section, provide rich structures for learning represen-
tations of text and image. In what follows, we describe three learning objectives, starting from the most
1
Available online athttps://github.com/aylai/DenotationGraph
2
The original DG, while publicly available athttps://shannon.cs.illinois.edu/DenotationGraph/ contains
1.75 million nodes which are significantly less than ours, due to the difference in the version of the NLP toolkit.
52
obvious one that matches images and their descriptions (§ 4.4.1), followed by learning to discriminate be-
tween general and specialized concepts (§ 4.4.2) and learning to predict concept relatedness (§ 4.4.3). We
perform ablation studies of those objectives in § 4.5.4.
4.4.1 Matching Texts with Images
We suppose the imagex and the texts are represented by (a set of) vectorsϕ (x) andψ (s) respectively.
A common choice forϕ (·) is the last layer of a convolutional neural network [62, 178] and forψ (·) the
contextualized word embeddings from a Transformer network [163]. The embedding of the multimodal pair
is a vector-valued function overϕ (x) andψ (s):
v(x,s)=f(ϕ (x),ψ (s)) (4.1)
There are many choices off(·,·). The simplest one is to concatenate the two arguments. We can also use
the element-wise product between the two if they have the same embedding dimension [86], or complex
mappings parameterized by layers of attention networks and convolutions [116, 31] – we experimented
some of them in our empirical studies.
4.4.1.1 Matching Model
We use the following probabilistic model to characterize the joint distribution
p(x,s)∝exp(θ T
v(x,s)) (4.2)
where the exponents(x,s) =θ T
v is referred as the matching score. To estimateθ , we use the maximum
likelihood estimation
θ ∗ =argmax
X
v
i
X
k
logp(x
ik
,s
i
) (4.3)
wherex
ik
is thekth element in the setX
i
. However, this probability is intractable to compute as it requires
us to get all possible pairs of(x,s). To approximate, we use negative sampling.
53
4.4.1.2 Negative Sampling
For each (randomly selected) positive sample(x
ij
,s
i
), we explore 4 types of negative examples and assem-
ble them as a negative sample setD
− ik
:
Visually mismatched pair We randomly sample an imagex
− / ∈ X
i
to pair withs
i
, i.e., (x
− ,s
i
). Note
that we automatically exclude the images fromv
i
’s children.
Semantically mismatched pair We randomly sample a texts
j
̸=s
i
to form the pair(x
ik
,s
j
). Note that we
constrains
j
not to include concepts that could be more abstract thans
i
as the more abstract can certainly
be used to describe the specific images x
ik
.
Semantically hard pair We randomly sample a text s
j
that corresponds to an image x
j
that is visually
similar tox
ik
to form(x
ik
,s
j
). See [116] for details.
DG Hard Negatives We randomly sample a sibling (but not cousin) nodev
j
tov
i
such thatx
ik
/ ∈ X
j
to
form(x
ik
,s
j
)
Note that the last 3 pairs have increasing degrees of semantic confusability. In particular, the 4th type of
negative sampling is only possible with the help of a denotation graph. In that type of negative samples,s
j
is semantically very close tos
i
(from the construction) yet they denote different images. The “semantically
hard pair”, on the other end, is not as hard as the last type ass
i
ands
j
could be very different despite high
visual similarity.
With the negative samples, we estimateθ as the minimizer of the following negative log-likelihood
ℓ
MATCH
=− X
v
i
X
k
log
e
s(x
ik
,s
i
)
P
(ˆ x,ˆ s)∼D
i
e
s(ˆ x,ˆ s)
(4.4)
whereD
i
=D
− ik
∪{(x
ik
,s
i
)} contains both the positive and negative examples.
4.4.2 Learning to Be More Specific
The hierarchy in the denotation graph introduces an opportunity for learning image and text representations
that are sensitive to fine-grained distinctions. Concretely, consider a parent node v
i
with an edge to the child
nodev
j
. While the descriptions
j
matches any images in its children nodes, the parent node’s description
s
i
on a higher level is more abstract. For example, the concepts INSTRUMENT and PLAY PERCUSSION
54
INSTRUMENT in Fig 4.1 is a pair of examples showing the latter more accurately describes the image(s) at
the lower-level.
To incorporate this modeling notion, we introduce
ℓ
SPEC
=
X
e
ij
X
k
[s(x
jk
,s
i
)− s(x
jk
,s
j
)]
+
(4.5)
as a specificity loss, where [h]
+
= max(0,h) denotes the hinge loss. The loss is to be minimized such that
the matching score for the less specific description s
i
is smaller than that for the more specific description
s
j
.
4.4.3 Learning to Predict Structures
Given the graph structure of the denotation graph, we can also improve the accuracy of image and text
representation by modeling high-order relationships. Specifically, for a pair of nodes v
i
andv
j
, we want to
predict whether there is an edge fromv
i
tov
j
, based on each node’s corresponding embedding of a pair of
image and text. Concretely, this is achieved by minimizing the following negated likelihood
ℓ
EDGE
=− X
e
ij
X
k,k
′
logp(e
ij
=1|v(x
ik
,s
i
),v(x
jk
′,s
j
)) (4.6)
We use a multi-layer perceptron with a binary output to parameterize the log-probability.
4.4.4 The Final Learning Objective
We combine the above loss functions as the final learning objective for learning on the DG
ℓ
DG
=ℓ
MATCH
+λ 1
· ℓ
SPEC
+λ 2
· ℓ
EDGE
(4.7)
where λ 1
, λ 2
are the hyper-parameters that trade-off different losses. Setting them to 1.0 seems to work
well. The performance under differentλ 1
andλ 2
are reported in the Suppl. Material. We study how each
component could affect the learning of representation in § 4.5.4.
55
4.5 Experiments
We examine the effectiveness of using denotation graphs to learn image and text representations. We first
describe the experimental setup and key implementation details (§ 4.5.1). We then describe key image-text
matching results in § 4.5.2, followed by studies about the transfer capability of our learned representation
(§ 4.5.3). Next, we present ablation studies over different components of our model (§ 4.5.4). Finally, we
validate how well abstract concepts can be used to retrieve images, using our model (§ 4.5.5).
4.5.1 Experimental Setup
We list major details in the following to provide context, with the full details documented in the Suppl.
Material for reproducibility.
Embeddings and Matching Models Our aim is to show denotation graphs improve state-of-the-art meth-
ods. To this end, we experiment with two recently proposed state-of-the-art approaches and their variants
for learning from multi-modal data: ViLBERT [116] and UNITER [31]. The architecture diagrams and the
implementation details are in the Suppl. Material, with key elements summarized in the following.
Both the approaches start with an image encoder, which obtains a set of embeddings of image patches,
and a text encoder which obtains a sequence of word (or word-piece) embeddings. For ViLBERT, text tokens
are processed with Transformer layers and fused with the image information with 6 layers of co-attention
Transformers. The output of each stream is then element-wise multiplied to give the fused embedding of
both streams. For UNITER, both streams are fed into 12 Transformer layers with cross-modal attention. A
special token CLS is used, and its embedding is regarded as the fused embedding of both streams.
For ablation studies, we use a smaller ViLBERT for rapid experimentation: ViLBERT (Reduced) where
there are 3 Transformer layers and 2 co-attention Transformers for the text stream, and 1 Transformer layer
for the image stream.
Constructing Denotation Graphs As described in §4.3, we construct denotation graphs DG-FLICKR30K
and DG-COCO from the FLICKR30K [186] and the COCO [108] datasets. FLICKR30K was originally de-
veloped for the tasks of image-based and text-based retrieval. It contains 29,000 images for training, 1,000
images for validation, and 1,000 images for testing. COCO is a significantly larger dataset, developed for the
image captioning task. It contains 565,515 sentences with 113,103 images. We evaluate on both the 1,000
56
images testing split and the 5,000 images testing split (in the Suppl. Material), following the setup in [78].
Key characteristics for the two DGs are reported in Table 4.1.
Evaluation Tasks We evaluate the learned representations on three common vision+ language tasks. In
text-based image retrieval, we evaluate two settings: the text is either a sentence or a phrase from the test
corpus. In the former setting, the sentence is a leaf node on the denotation graph, and in the latter case,
the phrase is an inner node on the denotation graph, representing more general concepts. We evaluate the
FLICKR30K and the COCO datasets, respectively. The main evaluation metrics we use are precisions at
recall R@M where M = 1,5 or 10 and RSUM which is the sum of the 3 precisions [176]. Conversely, we
also evaluate using the task of image-based text retrieval to retrieve the right descriptive text for an image.
In addition to the above cross-modal retrieval, we also consider two downstream evaluation tasks, i.e.,
Referring Expression and Compositional Attribute-Object Recognition. (1) Referring Expression is
a task where the goal is to localize the corresponding object in the image given an expression [81]. We
evaluate on the dataset REFCOCO+, which contains 141,564 expressions with 19,992 images. We follow
the previously established protocol to evaluate on the validation split, the TestA split, and the TestB split.
We are primarily interested in zero-shot/few-shot learning performance. (2) Compositional Attribute-Object
Recognition is a task that requires a model to learn from images of SEEN (attribute, object) label pairs, such
that it can generalize to recognize images of UNSEEN (attribute, object) label pairs. We evaluate this task
on the MIT-STATE dataset [74], following the protocol by [121]. The training split contains 34,562 images
from 1,262 SEEN labels, and the test split contains 19,191 images from 700 UNSEEN labels. We report the
Top-1, 2, 3 accuracies on the UNSEEN test set as evaluation metrics.
Training Details Both ViLBERT and UNITER models are pre-trained on the Conceptual Caption dataset [146]
and the pre-trained models are released publicly
3
. On the DG-FLICKR30K, ViLBERT and UNITER are
trained with a minibatch size of 64 and ViLBERT is trained for 17 epochs and UNITER for 15 epochs, with
a learning rate of0.00004. On the DG-COCO, ViLBERT is trained for 17 epochs and UNITER for 15 epochs
with a minibatch size of 64 and a learning rate of0.00004. The hyperparameters in Eq. (4.7) are set to 1.0,
unless specified (see the Suppl. Material).
3
The UNITER[31] model performs an additional online hard-negative mining (which we did not) during the training of image-
text matching to improve their results. This is computationally very costly.
57
Table 4.2: Text-based Image Retrieval (Higher is better)
Method R@1 R@5 R@10 RSUM
FLICKR30K
ViLBERT 59.1 85.7 92.0 236.7
ViLBERT + DG 63.8 87.3 92.2 243.3
UNITER 62.9 87.2 92.7 242.8
UNITER + DG 66.4 88.2 92.2 246.8
COCO 1K Test Split
ViLBERT 62.3 89.5 95.0 246.8
ViLBERT + DG 65.9 91.4 95.5 252.7
UNITER 60.7 88.0 93.8 242.5
UNITER + DG 62.7 88.8 94.4 245.9
COCO 5K Test Split
ViLBERT 38.6 68.2 79.0 185.7
ViLBERT + DG 41.8 71.5 81.5 194.8
UNITER 37.8 67.3 78.0 183.1
UNITER + DG 39.1 68.0 78.3 185.4
4.5.2 Main Results
Table 4.2 and Table 4.3 report the performances on cross-modal retrieval. On both datasets, models trained
with denotation graphs considerably outperform the corresponding ones which are not.
For the image-based text retrieval task, ViLBERT and UNITER on FLICKR30K suffers a small drop in
R@10 when DG is used. On the same task, UNITER on COCO 5K Test Split decreases more when DG is
used. However, note that on both splits of COCO, ViLBERT is a noticeably stronger model, and using DG
improves its performance.
4.5.3 Zero/Few-Shot and Transfer Learning
Transfer across Datasets Table 4.4 illustrates that the learned representations assisted by the DG have
better transferability when applied to another dataset (TARGET DOMAIN) that is different from the SOURCE
DOMAIN dataset which the DG is based on. Note that the representations are not fine-tuned on the TARGET
DOMAIN. The improvement on the direction COCO →FLICKR30K is stronger than the reverse one, pre-
sumably because the COCO dataset is bigger than FLICKR30K. (R@5 and R@10 are reported in the Suppl.
Material.)
58
Table 4.3: Image-based Text Retrieval (Higher is better)
Method R@1 R@5 R@10 RSUM
FLICKR30K
ViLBERT 76.8 93.7 97.6 268.1
ViLBERT + DG 77.0 93.0 95.0 265.0
UNITER 78.3 93.3 96.5 268.1
UNITER + DG 78.2 93.0 95.9 267.1
COCO 1K Test Split
ViLBERT 77.0 94.1 97.2 268.3
ViLBERT + DG 79.0 96.2 98.6 273.8
UNITER 74.4 93.9 97.1 265.4
UNITER + DG 77.7 95.0 97.5 270.2
COCO 5K Test Split
ViLBERT 53.5 79.7 87.9 221.1
ViLBERT + DG 57.5 84.0 90.1 232.2
UNITER 52.8 79.7 87.8 220.3
UNITER + DG 51.4 78.7 87.0 217.1
Table 4.4: Image Retrieval via Text (Transfer Learning)
SOURCE FLICKR→COCO COCO→FLICKR
→TARGET R@1 RSUM R@1 RSUM
ViLBERT 43.5 199.5 49.0 209.0
+ SOURCE DG 44.9 200.5 52.8 218.2
Zero/Few-shot Learning for Referring Expression We evaluate our model on the task of referring ex-
pression, a supervised learning task, in the setting of zero/few-shot transfer learning. In zero-shot learning,
we didn’t fine-tune the model on the referring expression dataset ( i.e. REFCOCO+).
Instead, we performed a “counterfactual” inference, where we measure the drop in the compatibility
score (between a text describing the referring object and the image of all candidate regions) as we removed
individual candidates results. The region that causes the biggest drop of compatibility score is selected. As a
result, the selected region is most likely to correspond to the description. In the setting of few-shot learning,
we fine-tune our COCO-pre-trained model on the task of referring expression in an end-to-end fashion on the
referring expression dataset (i.e. REFCOCO+).
The results in Table 4.5 suggest that when the amount of labeled data is limited, training with DG
performs better than training without. When the amount of data is sufficient for end-to-end training, the
advantage of training with DG diminishes.
59
Table 4.5: Zero/Few-shot Learning for Referring Expression (Reported in R@1 on validation, TestA and
TestB data)
Setting→ 0% (Zero-shot) 25% 50% 100%
Method Val TestA TestB Val TestA TestB Val TestA TestB Val TestA TestB
ViLBERT 35.7 41.8 29.5 67.2 74.0 57.1 68.8 75.6 59.4 71.0 76.8 61.1
ViLBERT + DG-COCO 36.1 43.3 29.6 67.4 74.5 57.3 69.3 76.6 59.3 71.0 77.0 60.8
Table 4.6: Image Recognition on UNSEEN Attribute-Object Pairs on the MIT-STATE Dataset
Method Top-1 Top-2 Top-3
VisProd [121] 13.6 16.1 20.6
RedWine [121] 12.1 21.2 27.6
SymNet [107] 19.9 28.2 33.8
ViLBERT pre-trained on
N/A 16.2 26.3 33.3
COCO 17.9 28.8 36.2
DG-COCO 19.4 30.4 37.6
Compositional Attribute-Object Recognition We evaluate our model for supervised compositional attribute-
object recognition [121], and report results on recognizing UNSEEN attribute-object labels on the MIT-
STATE test data [74]. Specifically, we treat the text of image labels ( i.e., attribute-object pairs as compound
phrases) as the sentences to fine-tune the ViLBERT models, using the ℓ
MATCH
objective. Table 4.6 reports the
results (in top-K accuracies) of both prior methods and variants of ViLBERT, which are trained from scratch
(N/A), pre-trained on COCO and DG-COCO, respectively. ViLBERT models pre-trained with parallel pairs
of images and texts (i.e., COCO and DG-COCO) improve significantly over the baseline that is trained on the
MIT-STATE from scratch. The model pre-trained with DG-COCO achives the best results among ViLBERT
variants. It performs on par with the previous state-of-the-art method in top-1 accuracy and outperforms
them in top-2 and top-3 accuracies.
4.5.4 Ablation Studies
The rich structures encoded in the DGs give rise to several components that can be incorporated into learning
representations. We study whether they are beneficial to the performances on the downstream task of text-
based image retrieval. In the notions of §4.4, those components are: (1) remove “DG HARD NEGATIVES”
from theℓ
MATCH
loss and only use the other 3 types of negative samples (§ 4.4.1); (2) align images with more
specific text descriptions (§ 4.4.2); (3) predict the existences of edges between pairs of nodes (§ 4.4.3).
60
Table 4.7: Ablation Studies of Learning from DG
ViLBERT variants→ Reduced Full
w/o DG 215.4 236.7
w/ DG
+ℓ
MATCH
221.5 236.5
− DG HARD NEGATIVES
+ℓ
MATCH
228.4 241.7
+ℓ
MATCH
+ℓ
SPEC
228.8 242.6
+ℓ
MATCH
+ℓ
SPEC
+ℓ
EDGE
231.2 243.3
Table 4.7 shows the results from the ablation studies. We report results on two versions of ViLBERT: In
ViLBERT (reduced), the number of parameters in the model is significantly reduced by making the model
less deep, and thus faster for development. Instead of being pre-trained, they are trained on the FLICKR30K
dataset directly for 15 epochs with a minibatch size of 96 and a learning rate of4e
− 5
. In ViLBERT (Full),
we use the aforementioned settings. We report RSUM on the FLICKR30K dataset for the task of text-based
image retrieval.
All models with DG perform better than the models without DG. Secondly, the components of DG HARD
NEGATIVES,ℓ
SPEC
, andℓ
EDGE
contribute positively and their gains are cumulative.
4.5.5 Image Retrieval from Abstract Concepts
The leaf nodes in a DG correspond to complete sentences describing images. The inner nodes are shorter
phrases that describe more abstract concepts and correspond to a broader set of images, refer to Table 4.2
for some key statistics in this aspect.
Fig. 4.2 contrasts how well abstract concepts can be used to retrieve images. The concepts are the
language expressions corresponding to the leaf nodes, the nodes that are one level above (LEAF-1), or two
levels above (LEAF-2) the leaf nodes from the DG-FLICKR30K. Since abstract concepts tend to correspond
to multiple images, we use mean averaged precision (mAP) to measure the retrieval results. ViLBERT w/
DG outperforms ViLBERT significantly. The improvement is also stronger when the concepts are more
abstract.
It is interesting to note that while theℓ
MATCH
used in ViLBERT w/ DG incorporates learning representa-
tions to align images at both specific and abstract levels, such learning benefits all levels. The improvement
of retrieving at abstract levels does not sacrifice the retrieval at specific levels.
61
LEAF LEAF-1 LEAF-2
Level in the DG Hierarchy
Mean AP
70.8
42.6
32.0
74.1
52.4
44.2
ViLBERT
ViLBERT w/ DG
Figure 4.2: Image Retrieval using Mid-level Linguistic Expression on FLICKR30K Denotation Graph. The results are
reported in Mean Average Precision (Mean AP).
4.6 Conclusion
Image and text aligned data is rich in semantic correspondence. Besides treating text annotations as “cate-
gorical” labels, in this chapter, we show that we can make full use of those labels. Concretely, denotation
graphs (DGs) encode structural relations that can be automatically extracted from those texts with linguistic
analysis tools. We proposed several ways to incorporate DGs into learning representation and validated the
proposed approach on several tasks. We plan to investigate other automatic tools in curating more accurate
denotation graphs with a complex composition of fine-grained concepts for future directions.
62
Chapter 5
Hierarchical Visual Grounded Concept Composition
5.1 Introduction
Visually grounded text expressions denote the images they describe. These expressions of visual concepts
are naturally organized hierarchically in sub-expressions. The organization reveals structural relations that
do not manifest when the sub-expressions are studied in isolation. For example, the phrase “a soccer ball
in a gift-box” is a compound of two shorter phrases, i.e., “a soccer ball” and “a gift-box”, but carries
the meaning of spatial relationship “something in something” that goes beyond the two shorter phrases
separately. The compositional structure of the grounded expression requires a concept learner to understand
what primitive concepts are visually appearing and how the compound relating multiple primitives modifies
their appearance.
Existing approaches [86, 44, 116, 31, 24] tackles the visual grounding via end-to-end learning, which
typically learns to align image and text information using neural networks without explicit modeling of the
structures. While neural networks have shown strong generalization capabilities in test examples that are
i.i.d to the training distribution [38], they often struggle in dealing with out-of-domain examples of novel
compositional structures, in many tasks such as Visual Reasoning [76, 11, 134], Semantic Parsing [50, 82],
and (Grounded) Command Following [95, 23, 66, 140].
In this work, we investigate how complex concepts, composed of simpler ones, are grounded in im-
ages at both sentences as well as phrases and tokens. In particular, we investigate whether the structures
of how those concepts are composed can be exploited as modeling prior to improve visual grounding. To
this end, we design Concept & Relation Graph (CRG), which is derived from constituency parse trees. The
63
soccer ball
a soccer ball a happy man
a man running on the field with a soccer ball. a soccer ball in a giftbox.
a man
man field
the field a giftbox
giftbox
Primitive
Concept
Figure 5.1: An illustrative example image database organized by the Concept & Relation Graph (CRG).
resulting CRG is a graph-structured database where concept nodes encode language expressions of con-
cepts and their image denotations (e.g., a set of images corresponds to the concept), and predicate nodes
define how semantically a concept is composed of its child concepts. Our graph is related to the denotation
graph [186, 195] but differs in two key aspects. First, our graph extracts the concepts without specifically
crafted heuristic rules
1
. Moreover, CRG’s predicate can encode richer information explicitly than the sub-
sumption relationships implicitly expressed in the denotation graphs. An illustrative figure of the graph is
shown in Figure 5.1.
In addition to CRG, we propose Concept cOMPOSition transformER (COMPOSER) that leverages the
structure of text expressions to recursively encode the grounded concept embeddings, from coarse-level
such as the noun words that refer to objects, to finer-grained ones with multiple levels of compositions.
Transformer [163] is used as a building block in our model, to encode the predicates, and perform grounded
concept composition. We learn COMPOSER using the task of visual-semantic alignment. Unlike tradi-
tional approaches, we perform hierarchical learning of visual-semantic alignment, which aligns the image
to words, phrases, and sentences, and preserves the order of matching confidences.
We conduct experiments on multi-modal matching and show that COMPOSER achieves strong grounding
capability in both sentence-to-image and phrase-to-image retrieval on the popular benchmarks. We validate
the generalization capability of COMPOSER by designing an evaluation procedure for more challenging
1
Our graph construction relies on constituency parsing thus it is more scalable than hand-written rules initially developed for
denotation graphs. The technique of denotation graph has been developed and evaluated on English language corpus, and its
multilingual utility might depend on the parsing techniques for those languages other than English.
64
Object-centric Visual Features
Predicate as Templated Sentences with Syntactic Blanks
soccer
man
field
ball
a soccer ball
a man running on the
field with a soccer ball
a man
the field
soccer
man
field
ball
Primitive Encoding Primitive Encoding Concept Encoding with Recursive Composition
Concept Embedding
Figure 5.2: The overall design of proposed COMPOSER model.
compositional generalization task that uses test examples with maximum compound divergence (MCD)
to the training data [147, 82]. Experiments show that COMPOSER is more robust to the compositional
generalization than other approaches.
Our contributions are summarized as below:
• We study composition structure of visually grounded concepts and design Concept & Relation Graph
that reflects such structures.
• We propose Concept cOMPOSition transformER (COMPOSER) that recursively composes concepts
using the child concepts and the semantically meaningful rules, which leads to strong compositional
generalization performances.
• We propose a new evaluation task to assess the model’s compositional generalization performances on
the task of text-to-image matching and conduct comprehensive experiments to evaluate both baseline
models and COMPOSER.
5.2 Concept & Relation Graph
We introduce multi-modal Concept and Relation Graph (CRG), a graph composed of concept and predicate
nodes, which compose visually grounded descriptive phrases and sentences. Figure 5.1 provides an illustra-
tive example. The concepts include the sentences and the intermediate phrases, shown in blue nodes. The
primitives are the leaf nodes (typically noun words) that refer to visual objects, shown in green nodes. The
65
predicates (in red nodes) are n-ary functions that define the meaning of the concept composition. Their
“signatures” are made of the semantic lexicons, the number of arguments, and the syntactic type of the
arguments. They combine primitives or simpler concepts into more complex ones.
Identifying concepts and relations. Given pairs of aligned image and sentence, we first parse a sentence
into a constituency tree, using the state-of-the-art syntactic parser [87]. We use the sentence’s constituent
tags to identify concepts and their relations. The set of relations are regarded as n-ary functions with
placeholders denoted with constituency tags. We refer to such functions as predicates. Simpler concepts are
arguments to the predicates, and the return values of the functions are complex concepts. The edges of the
graph represent the relationship between predicates and their arguments. We restrict the type of constituents
that can be concepts and how the predicates can be formed.
A concrete example is as follows: given an input concept “two dogs running on the grass”, the algorithm
extracts the predicate “[NP] running on [NP]” and child concepts “two dogs” and “the grass”. Here
we use syntactic placeholders to replace the concept phrases. Details are in the the Suppl. Material. This
idea is closely related to the semantically augmented parse trees [53], though we focus on visually grounded
concepts. With the generated text graph, we then group all images that refers to the same concept to form
the image denotation, similar as [186] and [195]. The image denotation is the set of images that contain the
referred concept. For example, the image denotation of the concept “ball” is all the images that have the
visual object category “ball”.
Finding visually grounded concepts. We take paired images and texts
2
, and convert the texts into derived
trees of predicates and primitives. Then we can collect all the concepts from these trees (which includes the
sentence itself) and identify the set of images that corresponds to each concept over the entire dataset. As a
result, we associate the image denotation with each concept in the format of words, phrases, and sentences,
which creates a multi-modal graph database as Figure 5.1.
5.3 COMPOSER: Recursive Modeling of the Compositional Structure
The main idea of COMPOSER is to recursively compose primitive concepts into sentences of complex struc-
ture, using composition rules defined by the predicates. Figure 5.2 presents a conceptual diagram of the
2
In this chapter, texts refer to sentences.
66
high-level idea. Concretely, it first takes the primitive word embedding as the inputs and performs cross-
modal attention to obtain their visually grounded word embeddings. Next, the COMPOSER calls the compo-
sition procedure to modify or combine primitive or intermediate concepts, according to the description of its
predicates. At the end of this recursive procedure, we obtain the desired sentence concept embedding. In the
rest of this section, we first discuss the notation and backgrounds, then introduce how primitives and predi-
cates are encoded (§ 5.3.1), and present the recursive composition procedures in detail (§ 5.3.2). Finally, we
discuss the learning objectives (§ 5.3.3).
Notation. We denote a paired image and sentence as(x,y) and the corresponding concepts and predicate
for a tree(x,U,E), whereU,E corresponds to the set of primitives and the set of predicates, respectfully.
We also denote all concepts from a sentencey to beC, whereU ̸⊂ C andy∈C.
Multi-head attention mechanism. Multi-Head Attention (MHA) [163] is the building block of our model.
It takes three sets of input elements, i.e., the key setK, the query setQ, and the value setV , and perform
scaled dot-product attention as:
MHA(K,Q,V)=FFN
Softmax(
Q
⊤
K
√
d
)· V
Here, d is the dimension of elements in K and Q. FFN is a feed-forward neural network. With dif-
ferent choices of K and V , MHA can be categorized as self-attention (SelfAtt) and cross-attention
(CrossAtt), which corresponds to the variants withK andV including only the single-modality or cross-
modality features.
5.3.1 Encoding Primitives and Predicates
Given a paired image and sentence (x,y), we parse the sentence as the tree of primitives and predicates
(x,U,E). Here, we represent the image as a set of visual feature vectors {ϕ }, which are the object-
centric features from an object detector [5]. Noted that we didn’t use structural information beyond object
proposals/regions. Our COMPOSER takes the primitives and predicates as input and output the visually
grounded concept embeddings, with both the primitives and predicates as continuous vectors of different
contextualization.
67
a soccer ball
a man running on the
field with a soccer ball
a man
the field
a soccer ball
a man
the field
Output Concept
Input Predicate
Object-centric Visual Features
Input Concepts
[NP]1
[NP]2
[CLS]
on
running
on
running
with with
[NP]3
Figure 5.3: Details of the composition procedure.
Representing primitives with visual context. The primitive concepts refer to tokens which can be visu-
ally grounded, and we represent them as word embeddings contextualized with visual features. As such, we
use a one-layer Transformer with theCrossAtt mechanism, by settingK ={w
⊤
K
· ϕ },V ={w
⊤
V
· ϕ },
andQ={w
⊤
Q
· u}, which essentially uses the word embedding to query the visual features and outputs the
grounded primitive embeddings
ˆ
U ={ˆ u}. Note that the output is always a single vector for each primitive
as it is a single word.
Representing predicates as neural templates. A predicatee is a semanticn-place function that combines
multiple concepts into one. We represent it as a template sentence with words and syntactic placeholders,
such as “[NP]
1
running on [NP]
2
”, where those syntactic placeholders denote the positions and types
of arguments. We encode such template sentences via SelfAtt mechanism, using a multi-layer Predi-
cate Transformer (PT). The output of this model is a contextualized sequence of the words and syntactic
placeholders as ˆ e.
5.3.2 Recursive Concept Composition
With the encoded primitives
ˆ
U and predicates
ˆ
E, the COMPOSER then performs multiple recursive composi-
tion steps to obtain the grounded concept embedding,v(x,y), representing the visual-linguistic embedding
of the sentence and the image as shown in the Figure 5.2. To further illustrate this process, we detail the
composition function in below, as shown in Figure 5.3.
68
Input concept modulation. We use a modulator to bind the arguments in the predicate to the input child
concepts. Given a encoded predicate ˆ e ={[NP]
1
, running, on, [NP]
2
, with, [NP]
3
} and a input
conceptc
1
=‘‘a man’’, the modulator is a neural network that takes the concept embeddingc
1
and its
corresponding syntactic placeholder[NP]
1
as input and outputs a modulated embedding. This embedding
is then reassembled with the embeddings of non-arguments in the predicate and used for the later stage.
For example, the output sequence becomes {Mod([NP]
1
,c
1
), running, on, Mod([NP]
2
,c
2
), with,
Mod([NP]
3
,c
3
)} after the modulator processed each pair of input concept and syntactic placeholder. Var-
ious choices of neural networks are available for this modulator, such as a Multi-Layer Perceptron (MLP)
or a Feature-wise Linear Modulation (FiLM) [133]. COMPOSER uses FiLM for its strong empirical perfor-
mance.
Contextualization with visual context. After concept modulation, we get a sequence of embeddings for
non-argument words of the predicate and the binded child concepts, which is then fed as an input to a
Composition Transformer (CT) model. This Transformer has multiple layers, with bothCrossAtt layers
that attends to the object-centric visual features and SelfAtt layers that contextualize between tokens.
Refer to the Suppl. Material for detailed network architecture.
Given that our model is recursive by nature, the computation complexity of CT is proportional to the
depth of the tree. We provide a comprehensive study in § 5.4.3 to show the correlation between the param-
eter/complexity and model’s performances.
5.3.3 Learning COMPOSER with Visual-Semantic Alignments
With the composed grounded concept embeddingv(x,y), we use the visual-semantic alignment as the pri-
mary objective to learn COMPOSER. To this end, we compute the alignment score by learning an additional
linear regressorθ :
s(x,y)=θ ⊤
· v(x,y)∝p(x,y),
wherep(x,y) is the probability that the sentence and image is a good match pair. Then we learn the sentence
to image alignment by minimizing the negative log-likelihood (NLL):
ℓ
MATCH
=− X
i
log
exp(s(x
i
,y
i
))
P
(ˆ x,ˆ y)∼D
i
exp(s(ˆ x, ˆ y))
69
withD
i
={(x
i
,y
i
)}∪D
− i
. To properly normalize the probability, it is necessary to sample a set of negative
examples to contrast. Thus, we generateD
− i
using the strategy of [116].
Multi-level visual-semantic alignment (MVSA). Since COMPOSER composes grounded concepts recur-
sively from the primitives, we obtain the embeddings of all the intermediate concepts automatically. There-
fore, it is natural to extend the alignment learning objectives to all those intermediate concepts. We optimize
the triplet hinge loss [86]:
ℓ
MVSA
=
X
i
X
c∈C
i
[α − s(x
i
,c)+s(x
i
,c
− )]
+
+[α − s(x
i
,c)+s(x
− i
,c)]
+
where [h]
+
= max(0,h) denotes the hinge loss and α is the margin to be tuned. We derive the negative
conceptsc
− from the negative sentences in theD
− i
. We observe that negative concepts at word/phrase levels
are noisier than the ones at sentence level because many are common objects presented in the positive image
and lead to ambiguity in learning. Therefore, we choose hinge loss over NLL because it is more robust to
label noises [18].
Learning to preserve orders in the tree. Finally, we use an order-preserving objective proposed by [195],
to ensure that a fine-grained concept (closer to sentence) can produce a more confident alignment score than
a coarse-grained concept (closer to primitive):
ℓ
ORDER
=
X
i
X
e
jk
[β − s(x
i
,c
j
)+s(x
i
,c
k
)]
+
Here, e
jk
represents a predicate connecting the c
j
and c
k
, with c
j
to be the fine-grained parent concept
which is closer to the sentence andc
k
to be the coarse-grained child concept which is closer to the primitives.
β is the margin that sets the constraint on how hard the order of embeddings should be reserved.
The complete learning objective is a weighted combination of three individual losses defined above,
with the loss weightsλ 1
=1 andλ 2
=1:
ℓ=ℓ
MATCH
+λ 1
· ℓ
MVSA
+λ 2
· ℓ
ORDER
70
The details of model optimization and hyper-parameter setting are included in the the Suppl. Material.
5.4 Related Work
Generalization in grounded language understanding. Many evaluation methods are proposed to as-
sess the model’s generalization capabilities in grounded language understanding. [76] proposes a synthetic
dataset, i.e. CLEVR, to evaluate the generalization of visual question answering models to novel objects
and attributes. [121] proposes to evaluate compositional generalization capability of visual models w.r.t.
short phrases consist of attributes and objects. [23] and [66] evaluate RL agents’ capability to generalize
to a novel composition of shape, size, and color in 3D simulators, which shows that RL agents generalize
poorly. gSCAN [140] perform a systematic benchmark to assess command following in a grounded envi-
ronment. In this work, we focus on assessing model composition generalization under the visual context.
Compositional networks. State-of-the-art visually grounded language learning typically use deep Trans-
former models [163] such as ViLBERT [116], LXMERT [158] and UNITER [31]. Though being effective
for data over i.i.d distribution, these models do not explicitly exploit the structure of the language and are thus
prone to fail on compositional generalization. In contrast, another thread of works [7, 185, 119, 148, 174]
parse the language into an executable program composed as a graph of atomic neural modules, where each
module is designed to perform atomic tasks and are learned end-to-end. Such models show almost per-
fect performances on synthetic benchmarks [76] but perform subpar on the real-world data [186, 28] that
are noisy and highly variable. Unlike them, we propose using a compositional neural network based on
the Transformer architecture, which extends state-of-the-art neural networks to explicitly exploits language
structure. In this section, we perform experiments to validate the proposed COMPOSER model on the tasks
of sentence-to-image retrieval and phrase-to-image retrieval. We begin with introducing the setup in § 5.4.1
and then present the main results in § 5.4.2, comparing models for their in-domain, cross-dataset evaluation,
and compositional generalization performance. Finally, we perform an analysis and ablation study of our
model design in § 5.4.3.
71
Table 5.1: Statistics of the concepts and predicates in the F30K and C30K datasets.
Dataset # concepts # predicates # primitives Avg height
F30K 408,464 122,196 10,755 3.09
C30K 345,331 88,623 9,683 2.86
5.4.1 Experiment Setup
Datasets. We perform experiments on the COCO-caption (COCO) [28] and Flickr30K (F30K) [186]
datasets. Each image of these two datasets is associated with five sentences. Flickr30K contains 31,000
images, and we use the same data split as [44], where there are 29,000 training images, 1000 test images,
and 1000 validation images. COCO contains 123,287 images in total. For fast iteration, we use a subset
training data COCO30K (C30K), which contains the same amount of images as the F30K. COCO results
are reported in the 1K test split of [78]. We also conducted experiments on the full COCO training split. For
both Flickr and COCO test split, there are 5,000 text queries and 1,000 candidate images to be retrieved. We
report recall@1 (R1) and recall@5 (R5) as the primary retrieval metric.
Compositional generalization evaluation. To generate evaluations of compositional generalization, we
use a method similar to that of [147] and [82] which maximizes compound divergence between the dis-
tribution of compounds in the evaluation set and in the training set. Here compounds are defined based
on the predicates occurring in captions. Following this method, we first calculate the overall divergence
of compounds from the evaluation data to the training data using predicates from all the sentences. Then,
for each sentence in the evaluation data, we calculate a compound divergence with this specific example
removed. We rank those sentences based on the difference of the compound divergence. Finally, we choose
the top-K sentences with the largest compound divergence differences and its corresponding images to form
the evaluation splits.
Using this method, we generate evaluation splits with 1,000 images and 5,000 text queries, COCO-
MCD and F30K-MCD, to assess models trained on F30K and COCO, respectively. Therefore, these splits
assess both compositional generalization and cross-dataset transfer. Defining such splits across datasets is
also helpful to achieve greater compound divergence than is otherwise possible, given the small amount of
available in-domain test data. More details are included in the Suppl. Material.
72
Table 5.2: Text-to-Image retrieval results.
(a) Models trained on Flickr30K (F30K)
Eval on F30K COCO COCO-MCD
Method R1 R5 R1 R5 R1 R5
VSE 46.84 77.16 25.60 54.36 21.82 47.58
ViLBERT 50.94 80.86 30.50 58.98 24.44 51.44
COMPOSER 54.02 80.27 33.81 63.19 29.20 57.13
(b) Models trained on COCO30K (C30K)
Eval on COCO F30K F30K-MCD
Method R1 R5 R1 R5 R1 R5
VSE 45.74 81.22 27.66 55.92 23.44 47.90
ViLBERT 48.08 81.10 31.12 58.88 24.02 49.34
COMPOSER 47.87 80.93 34.29 61.00 26.91 51.46
CRG construction. We constructed two CRGs on the F30K and C30K datasets, using the procedure
mentioned in § 5.2. The key statistics of the graph we generated as shown in Table 5.1.
Baselines and our approach. We compare COMPOSER to two strong baseline methods, i.e., ViLBERT [116]
and VSE [86]. We make sure all models are using the same object-centric visual features extracted from
the Up-Down object detector [5] for fair comparison. For the texts, both ViLBERT and the re-implemented
VSE use the pre-trained BERT model as initialization. For the COMPOSER, we only initialize the predicate
Transformer with the pre-trained BERT, which uses the first six layers. Note that the ViLBERT results are
re-produced using the codebase from its author. ViLBERT is not pre-trained on any additional data of
image-text pairs to prevent information leak in both cross-dataset evaluation and compositional generaliza-
tion. Therefore, we used the pre-trained BERT models provided by HuggingFace to initalize the text stream
of ViLBERT, and then followed the rest procedure in the original ViLBERT paper. Please refer to the Suppl.
Material for complete details.
5.4.2 Main Results
We compare the COMPOSER with ViLBERT [116] and VSE [86] on F30k and COCO for in-domain,
zero-shot cross-dataset transfer, and compositional generalization (e.g. F30K→COCO-MCD). The notation
A→B means that the model is trained on A and evaluated on B. We report the results of sentence-to-image
retrieval in the chapter and defer more ablation study results to the the Suppl. Material.
73
0.15
0.16
0.17
0.18
0.19
0.2
0.21
0.22
0.23
0.24
0.25
0.26
0.27
0.28
0.29
0.3
0.31
Compound Divergence
24
26
28
30
32
34
36
Recall@1 (%)
ViLBERT
Composer
0.15
0.16
0.17
0.18
0.19
0.2
0.21
0.22
0.23
0.24
0.25
0.26
0.27
0.28
0.29
0.3
0.31
Compound Divergence
0
5
10
15
20
25
Relative Improve (%)
Figure 5.4: COMPOSER’s results on generalization splits of different compound divergence over text de-
scription (evaluated under the F30K→COCO setting).
In-domain performance. Table 5.2 presents the in-domain performance on both F30k and COCO datasets.
First, we observe that both COMPOSER and ViLBERT consistently outperform VSE, which is expected as
ViLBERT contains a cross-modal transformer with stronger modeling capacity. Comparing to ViLBERT,
the COMPOSER performs on par.
Zero-shot cross-dataset transfer. We also consider zero-shot cross dataset transfer where we evaluate
models on a dataset that is different from the training dataset. In this setting, the COMPOSER outperforms
ViBLERT and VSE significantly. Concretely, on the F30k →COCO setting, the COMPOSER improves R1
and R5 by11.0% and7.0% over the ViLBERT, relatively. There are10.0% and4.2% relative improvements
on R1 and R5 on the other transfer direction.
Compositional generalization. On the max compound divergence (MCD) split, COMPOSER outperforms
baselines by a margin for both F30K and C30K trained models (shown as Table 5.2). To further characterize
the performance on compositional generalization, we create 16 test splits on each dataset with different
compound divergence (from0.15 to0.31, where0.31 is the max CD) and present the results in Figure 5.4.
With the increases of CD, we observe the performance of COMPOSER and ViLBERT decreases. Compared
to ViLBERT, we observe that COMPOSER is relatively more robust to this distribution shift, as the relative
performance improvement is increasing with CD increases.
74
Table 5.3: Study of different primitive encodings.
F30K→F30K F30K→COCO
CrossAtt? R1 R5 R1 R5
✗ 52.38 79.09 33.33 60.97
✓ 54.02 80.27 33.81 63.19
Table 5.4: Study of different modulators.
F30K→F30K F30K→COCO
Modulation R1 R5 R1 R5
Replace 52.84 79.79 32.63 61.61
MLP 52.92 79.89 33.39 61.41
FiLM 54.02 80.27 33.81 63.19
5.4.3 Analysis and Ablation Study
We perform several ablation studies to analyze COMPOSER, and provide qualitative results to demonstrate
the model’s interpretability.
IsCrossAtt in primitive encoding useful? Table 5.3 compares variants of COMPOSER with and with-
outCrossAtt for primitive encoding, and shows thatCrossAtt improves all metrics in in-domain and
cross-dataset evaluation.
Which modulator works better? We consider three modulators to combine input concepts with the syn-
tax token embeddings for later composition, which areReplace,MLP, andFiLM. TheReplace directly
replaces the syntax embedding with the input concept embedding. This is an inferior approach by design
as it ignores the relative position of each concept. MLP model applies multi-layer neural networks on the
concatenated syntax and input concept embeddings. FiLM model uses the syntax embedding to infer the
parameter of an affine transformation, which is then applied to the input concepts. We show the results in
Table 5.4. Replace achieves the worst performance, indicating the importance of identifying the posi-
tion of input concepts. COMPOSER chooses FiLM as the modulator given its strong performance over all
metrics.
Is MVSA supervision useful? We evaluate the influence of multi-level visual-semantic alignment on
sentence and phrase to image retrieval. In the phrase-to-image experiments, we sample 5 non-sentence
75
Table 5.5: Comparison between ViLBERT and COMPOSER on multi-level visual-semantic alignment super-
vision (MVSA). All results are reported in R1.
F30K→F30K F30K→COCO
Method Sentence Phrase Sentence Phrase
ViLBERT 50.94 18.34 30.50 15.00
+ MVSA 48.90 23.55 29.90 18.73
COMPOSER 52.52 21.04 32.87 18.29
+ MVSA 54.02 22.70 33.81 18.81
the bar
a bald man is standing by the beer pumps at the bar
a bald man the beer pumps
Ground Truth
bar man beer pumps
[0.630,-0.176] [0.630,-0.176]
Ground Truth
[-0.314,0.017] [-0.314,0.017]
[1.187, 0.005] [1.187, 0.005] [0.841, 1.043] [0.841, 1.043]
[0.340, 0.505] [0.340, 0.505] [-0.056, 0.095] [-0.056, 0.095]
[1.153, 0.129] [1.835, 2.099] [1.835, 2.099]
[8.785, 6.318] [8.785, 6.318] [7.128,7.141] [7.128,7.141]
[0.262, -0.882] [0.262, -0.882] [0.845, 0.957] [0.845, 0.957]
Negative
two middle aged police officers watch over a parking lot
a parking lot
Negative
Negative
parking lot police
two middle aged police officers
officers
[0.136, 1.189] [0.136, 1.189] [0.487, 0.664] [0.487, 0.664]
[0.487, 0.664] [0.487, 0.664]
Figure 5.5: Interpreting the COMPOSER using visual-semantic alignment scores, formatted as
[s
GT
,s
Negative
]. The left figure corresponds to a correctly retrieved example, and the right figure cor-
responds to an incorrectly retrieved example.
concepts from the CRG for each annotation in the corresponding test data and use them as the query to report
results (in R1). Table 5.5 presents the results. With the MVSA, COMPOSER outperforms ViLBERT on both
sentence and phrase-based retrieval by a noticeable margin, indicating the advantage of capturing mid-level
alignment in our model design. Secondly, MVSA improves both COMPOSER and ViLBERT on the phrase
to image retrieval over their counterparts. However, adding MVSA on ViLBERT leads to a degradation
of sentence-to-image retrieval, showing that ViLBERT is incapable of mastering visual alignments for both
sentences and phrases simultaneously. COMPOSER with MVSA improves itself on both sentence and phrase,
showing strong multi-granular visual-semantic alignment ability.
Performance vs. complexity trade-off. We compare variants of COMPOSER with different parameter
and computation budgets, which uses different numbers of layers for the Predicate Transformer (PT) and
Composition Transformer (CT). The results are shown in Table 5.6. First, We keep the size of CT fixed and
vary the size of PT. It shows a marginal performance decrease occurring as the # of layers of PT goes down.
Then we keep the size of PT fixed and decrease the capacity of CT, which presents a significant performance
drop, showing the essential role CT is playing. Besides having superior results, COMPOSER has (at least
76
Table 5.6: Results on COMPOSER of different complexity. All results are reported in R1. (PT: Predicate
Transformer, CT: Composition Transformer)
F30K→F30K F30K→COCO
PT CT R1 R5 R1 R5 # Param FLOPS
2 5 51.72 79.71 33.67 60.38 129M 35.40G
4 5 53.32 79.73 33.83 61.61 143M 37.11G
6 5 54.02 80.27 33.81 63.19 157M 38.40G
6 3 47.92 76.85 25.74 51.45 136M 29.40G
6 1 34.47 62.61 21.25 43.86 115M 19.98G
ViLBERT 50.94 80.86 30.50 58.98 235M 24.44G
Table 5.7: Performance under different parsing qualities.
F30K→F30K F30K→COCO
Pruning Probability R1 R5 R1 R5
Un-pruned 54.02 80.27 33.81 63.19
Probability=0.1 49.12 76.92 31.12 58.86
Probability=0.3 48.46 76.60 30.40 58.16
Probability=0.5 47.44 76.24 30.38 57.62
33%) fewer parameters than the ViLBERT model, which indicates a potential performance gain could be
achieved with a larger COMPOSER model.
For computation complexity, we observe that the full COMPOSER model is 50% less efficient to a ViL-
BERT model, due to its recursive nature. Meanwhile, we notice that the increase in the # of CT layers
contributes a significant amount to the total computation time as every two additional layers adds ∼ 10G
FLOPS.
Performance under different parsing qualities. CRG is generated based on constituent parser. We in-
vestigate the performance of COMPOSER with CRG under different parsing qualites. Given a parsing tree,
We randomly remove its branches randomly with a probability of 0.1, 0.3, or 0.5 to generate a tree with
degraded parsing quality. We evaluate COMPOSER on the resulting CRGs. We summarized the results in
Table 5.7. When parsing quality drops, both in-domain and cross-dataset transfer performance drops. The
performance degrades by 12%, when half of the parse could be missing. We expect with better parsing
quality, COMPOSER can achieve better performance.
77
Interpreting COMPOSER’s decision. Despite the solid performance, COMPOSER is also highly inter-
pretable. Specifically, we visualize its alignment scores along with the concept composition procedure in
Figure 5.5. Empirically, we observe that most failures are caused by visually grounding mistakes at the
primitive concepts level. The error then propagates “upwards” towards concept composition.
For instance, the left example shows that COMPOSER is confusing between the ground truth and negative
image when only the text of shared visual concept “a bold man” is presented. With more information are
given, it gets clarified immediately as it notices that the target sentence is composed not only with the
above subject, but also with the prepositional phrases “by the beer pumps at the bar” that reflects the visual
environment.
Results of model trained on full COCO dataset. We trained our model on the full COCO training split
and evaluated for both in-domain and cross-dataset transfer task. However, COMPOSER underperforms
the ViLBERT in this setting, as it achieves 56.06% and 44.24% in R1 for the in-domain task and cross-
dataset evaluation tasks, while ViLBERT obtains 56.83% and 46.62% in R1, respectively. We hypothesize
that this negative result is largely due to the limited model capacity of the proposed COMPOSER, as it has
78M less parameters comparing to ViLBERT. Meanwhile, it is also observed that COMPOSER performs
worse than ViLBERT in fitting training data. However, further scaling up C OMPOSER model requires more
computational resources
5.5 Conclusion
In this chapter, we propose the concept and relation graph (CRG) to explore the composition structure in
visually grounded text data. We further develop a novel concept composition neural network (COMPOSER)
on top of the CRG, which leverages the structure to explicitly compose the concepts from the coarse-grained
primitives to fine-grained sentences. We conduct extensive experiments to validate our model on the large-
scale image-text matching benchmarks. Comparing with prior methods, COMPOSER achieves significant
improvement, particularly in zero-shot cross-dataset transfer and compositional generalization. Despite
these highlights, there are many challenges that COMPOSER do not address in the scope of this chapter.
First, it requires high-quality parsing results to achieve strong performances, which is not readily available
78
in languages beyond English. Moreover, similar as other recursive neural networks, COMPOSER is also
computationally resource demanding, which sets a limit to its scalability to large-scale data.
79
Part IV
Hierarchical Video Representation via Visual Structure
80
The previous two parts focus on obtaining the hierarchical visual representation from the paired text.
However, the approaches discussed in the previous parts cannot handle the pure visual scenario, where
parallel text is not available. In this part, we focus on acquiring the hierarchical video representation
from pure vision scenario.
The key insight is the video representation should be capable of handling both appearance (spatial) and
motion (temporal) information. As the video contains multiple clips and frames, it is a perfect source for
learning the motion representation. In comparison, static frames (images) are ideal for learning appearance
information.
With this insight in mind, in chapter 6, we proposed to obtain a hierarchical video representation by
jointly training on both images and videos. Here, hierarchical video representation means the video rep-
resentation should be capable of obtaining both appearance and temporal information. By jointly training
on five video datasets and one image dataset, our proposed approach surpassed the state-of-the-art on five
challenging benchmarks by a large margin.
81
Chapter 6
Modeling Video Representation via Co-training on both Image and Video
6.1 Introduction
Action recognition has received significant attention from the research community, as many applications can
benefit from improved action recognition modeling, such as video retrieval [54, 91], video captioning [91],
video QA [103], etc. Datasets are one dimension of improvement. Video datasets have evolved from hun-
dreds of videos in controlled environments [144] to millions of videos crawled from the Internet [124, 80].
In addition to quantity, the scope of videos has also broadened. For example, the topics covered by ac-
tion recognition datasets have evolved from simple body motions like waving and handshaking to more
complex activities present in our daily life. Simultaneously, with the increase of data and class distribu-
tion complexity, modeling architecture complexity has likewise increased [150, 169, 21, 17, 10, 46, 112].
Among these architectures, Transformer based approaches have recently demonstrated state-of-the-art per-
formance on several benchmarks [112, 17, 204, 10, 172]. However, since Transformer models are more
data-hungry [160] and action recognition datasets are relatively small in scale, large Transformer models are
typically first trained on image datasets and later finetuned on the target action recognition dataset.
While the current pre-training and fine-tuning action recognition paradigm is straightforward and man-
ifests strong empirical results, it may be overly restrictive for building general-purpose action-recognition
models. Compared to a dataset like ImageNet [37] that covers a large range of object recognition classes,
action recognition datasets like Kinetics [80] and Something-Something-v2 (SSv2) [58] pertain to limited
topics. For example, Kinetics focuses on actions like “cliff diving” and “ice climbing” while SSv2 contains
information related to object agnostic activities like “pretending to put something onto something else.” As
a result, adapting an action recognition that has been fine-tuned on Something-Something-v2 to a disparate
82
Table 6.1: Top-1 accuracy comparison between standard training paradigm and COVER using TimeSFormer
pretrained on ImageNet-21k (I21K) with 448× 448× 16 (H× W× T) input setting. COVER co-trained with
multiple image and video datasets achieves better performance on all datasets.
Training Paradigm Standard COVER ∆ Kinetics-400 (K400) 79.7 83.1 +3.4
Kinetics-600 (K600) 81.8 84.5 +2.7
Kinetics-700 (K700) - 74.9 -
Something-Something-v2 (SSv2) 62.2 64.7 +2.5
Moments-in-Time (MiT) - 41.5 -
dataset like Kinetics is likely to result in poor performance. Differences in objects and video backgrounds
among datasets further exacerbate learning a general-purpose action recognition classification model, and
even though video datasets may be increasing in size, prior work [10, 46, 112] suggests significant data aug-
mentation and regularization is necessary to achieve strong performance. This latter finding may indicate
the model quickly overfits on the target dataset, and as a result, hinders its capacity to generalize to other
action recognition tasks.
In this work, we aim to build a training strategy for a general purpose action recognition model. Inspired
by prior works in vision and language that demonstrate a single Transformer model can be extended to
many downstream tasks [154, 116], we propose to leverage both image and video data to jointly train a
single action recognition model. This approach is buttressed by two main findings. First, disparate video
datasets cover a diverse set of activities, and training them together in a single model could lead to a model
that excels in a wide range of activities. Second, video is a perfect source for learning motion information,
while images are great for exploiting appearance structure. Compared to ImageNet, action recognition
datasets have relatively little data. Combining this with the high spatial redundancy among frames in a
clip, and leveraging a diverse distribution of image examples may be beneficial in building robust spatial
representations in video models.
With this background as motivation, we suggest a new training scheme: co-training video and image
for action recognition (COVER). Similar to the typical pre-training and fine-tuning paradigm, C OVER first
pre-trains the model on an image dataset, but during fine-tuning, it simultaneously trains a single model
on multiple action recognition and image datasets to build robust spatial and temporal representations of
video data. More concretely, COVER adopts a multi-task learning setup to train multiple datasets within
one model, with the total loss for a given batch equal to the weighted sum of the losses across each dataset
in that batch. Our empirical findings suggest this approach is competitive across several action recognition
83
classifier
Pretraining
classifier
Initialize
Finetuning
classifier
Pretraining
Image classifier
Initialize
Finetuning
Video classifier Video classifier
(a) Standard training paradigm
classifier
Pretraining
Image classifier
Initialize
Finetuning
Video classifier Video classifier
(b) COVER training paradigm
Figure 6.1: The comparison between the proposed COVER and the standard training paradigm. The dif-
ference is COVER co-trained on multiple image and video datasets during finetuning, while the standard
training paradigm only finetuned on one dataset.
benchmarks. Moreover, unlike the current paradigm of pre-train once and fine-tune on each downstream ac-
tion recognition benchmark, COVER learns generalizable representations which can be used on downstream
action recognition tasks without any additional finetuning. Our empirical findings in Table 6.1 indicate this
simple approach leads to improved performance on Kinetics, Something-Something-v2, and Moments-in-
Time when compared with the typical pre-train and fine-tune paradigm.
Our contributions are three-fold:
• We analyze simultaneous training of image and video data for the purpose of modeling action recog-
nition data.
• We propose COVER, an approach that learns robust spatial and temporal representations via simulta-
neous learning across multiple action recognition and image datasets. The learned representations can
be immediately applied to several downstream tasks to perform competitively with prior pre-training
and fine-tuning paradigms.
• COVER established new State-of-The-Art results on multiple datasets.
6.2 Related Work
Action recognition modeling persists as a challenging problem in the field of computer vision. Recent work
in this domain largely focus on two dimensions to improve performance: modeling changes and training
paradigm.
84
6.2.1 Video Transformers
Recently, and inspired by the Visual Transformer [40] and BERT[38] in the domain of Natural Language
Processing, action recognition modeling has begun to adopt transformer-based architectures such as TimeS-
Former [17], ViViT [10], and Multiscale Vision Transformer [46]. In this class of models, TimeSFormer
and ViViT directly extend the Visual Transformer into the video domain by adding temporal attention lay-
ers. Based on this architecture, TimeSFormer and ViViT find that large-scale image data pretraining is
beneficial, and with this training policy, they can surpass the performance of ConvNet-based architectures
[21, 150, 161]. Further exploring training policies, Multiscale Vision Transformer finds that, with carefully
tuned parameters, strong performance can be achieved without pre-training on a large-scale image dataset.
A separate approach, and inspired by the Swin Transformer [111], Video Swin-Transformer [112] adopts
3D shifted windows to learn spatial and temporal aware video representations. While modeling approaches
have undoubtedly driven much of the recent advancement of action recognition, we instead direct our focus
to these models’ training policy and how training on varying data distributions may learn general-purpose
action recognition models.
6.2.2 Training Paradigms
We define training policy as the technique(s) used to train a model. In this sense, and for the domain of action
recognition, Two-Stream ConvNet [150] and I3D [21] were the first approaches to leverage image data to
improve video modeling. More concretely, both models leveraged the feature extractors learned on image
data during “pre-training” and would later “fine-tune” on a downstream video benchmark. The concept
of using image data to learn spatial relationships which can be transferred to video understanding has also
been applied to recent transformer-based approaches [17, 10, 112, 12]. For instance, OmniSource [41]
extends 3D ConvNet by pre-training on both image and video datasets, and then later fine-tuning on a target
dataset. Similarly, UniDual is also tailored to ConvNets and proposes fine-tuning on one image and one
video dataset to improve video modeling performance. Frozen-in-Time [41] uses image-text and video-text
data to pre-train a spatial-temporal transformer, and then later fine-tuning on a target video-text retrieval
dataset. Unlike OmniSource and UniDual, COVER is tailored to transformer-based architectures. More
specifically, OmniSource and UniDual require an additional layer to process image data, whereas C OVER
directly processes video and image data without any modification to the model architecture. Moreover,
85
unlike OmniSource and Frozen in Time which fine-tune the learned weights on each down-stream dataset
during evaluation, our approach simplifies this process by building representations that are generalizable
across multiple datasets. UniDual fine-tunes the model on a single image and a single video dataset. This
differs from COVER that explores learning multiple video datasets simultaneously. Finally, our empirical
results suggest COVER outperforms OmniSource and UniDual on several action recognition datasets.
6.3 Method
To motivate our suggested change, we first describe the typical spatio-temporal transformer framework,
paying special attention to improving generalisation by pre-training on image data and fine-tuning on video
data. Next, we build on this analysis to present co-training image and video for action recognition (COVER).
COVER changes the typical training paradigm by leveraging multiple image and video datasets to train a
spatio-temporal transformer model.
6.3.1 Video Transformer for Action Recognition
The action recognition framework is composed of two components, an action recognition model f and a
training policy. In this chapter, we describe a simple spatio-temporal attention factorized transformer (i.e.,
TimeSFormer[17]) as a prototypical action recognition model.
6.3.1.1 TimeSFormer
TimeSFormer is an extension of Visual Transformer[40]. Similar to a ViT model, TimeSformer can be re-
duced to a sequence of self-attention blocks; however rather than using a single spatial self-attention mech-
anism, TimeSformer augments the self-attention block with a temporal attention mechanism. Analogous
to the Transformer model introduced for natural language processing, both spatial and temporal attention
mechanisms are formulated as Multi-Head Attention (MHA). MHA takes three sets of input elements, i.e.,
the key setK, the query setQ, and the value setV , and performs scaled dot-product attention as:
MHA(K,Q,V)=FFN
Softmax(
Q
⊤
K
√
d
)· V
86
Here, FFN is a feed-forward neural network and d is the dimension of K and Q. With different choices
of K and V , MHA can be categorized as either spatial attention or temporal attention. Spatial attention
corresponds to K and V sampled from the same frame and temporal attention corresponds to K and V
sampled across the frames of a video clip. Each TimeSFormer block contain one layer of temporal attention
and one layer of spatial attention.
TimeSFormer takes n video frames x
video
as input. The frames are first uniformly cropped into a
sequence of image patches with size of (n,s,h,w), where s is the number of image patches within one
frame. h andw represent the spatial resolution. The image patches are then fed intoL TimeSFormer blocks
through multiple spatial attention and temporal attention layers. An affine transformation is then applied
to the learned representationt
video
= f(x
video
) to attain a probability distribution across all label classes
c
video
=MLP(t
video
).
In this chapter, we describe TimeSFormer using θ s
and θ t
, where θ s
represents the parameters within
spatial attention layers and θ t
represents the parameters within the temporal attention layers. We describe
the classification layer as θ MLP
.
6.3.1.2 Standard Training Paradigm
Due to the large amount of parameters and the limited size of the video dataset, the standard training policy
follows a classical pretraining and finetuning approach. The model is first pretrained on a large object recog-
nition image dataset D
image
and then finetuned on the target downstream video dataset D
video
. Specifi-
cally, during the pre-training stage, the temporal attention layers are all removed. Only the parameters of
the spatial attention layersθ s
are optimized, by minimizing the training loss
θ s
=argmin
θ s
ℓ({y
image
},{c
image
}) (6.1)
where{c
image
} = MLP(f({x
image
};θ s
)) is the image classification probability. {x
image
} and{y
image
}
denote the images and labels in a mini-batch, and are randomly sampled from the image datasetD
image
. ℓ
is the cross-entropy loss function.
After pre-training, both spatial attention layers and temporal attention layers are finetuned on the target
video datasets.
(θ s
,θ t
)=argmin
θ s,θ t
ℓ({y
video
},{c
video
})
87
Table 6.2: Performance of TimeSFormer trained using224× 224 image resolution and evaluated on normal
frame order and reversed frame order. Reversed frame order means the order of frames are reversed during
test-time.
Frame order Normal Reversed ∆ K400 78.1 78.0 -0.1
SSv2 58.8 22.6 -36.2
where{c
video
} = MLP(f({x
video
};θ s
,θ t
)) is the video classification probability. (x
video
,y
video
) are
sampled from the video datasetD
video
.
6.3.2 Action Recognition Analysis
A robust learned representation of video data should be descriptive in both the spatial and temporal di-
mension. Our empirical findings suggest the typical pre-training and fine-tuning paradigm may limit the
model’s capacity to construct generalizable representations by fine-tuning on a single, and relatively small,
downstream action recognition task.
Expanding on this last point, it is likely each video dataset incorporates a dataset-specific bias. Applying
this hypothesis to two popular video datasets, K400 and SSv2, we find SSv2 focuses on object interaction
and relies on complex temporal reasoning to achieve strong performance [10]. On the other hand, K400
focuses on interactions among humans and objects. Given the strong performance of non-temporal action
recognition models on this dataset, complex temporal reasoning may be significantly less important than
learning robust representations comprised of spatial information.
To further analyze the inherent dataset bias hypothesis, we reverse the order of frames in a clip during
test-time. We follow the standard paradigm by training a TimeSformer model [17] using a 224× 224 image
resolution with 8 frames. We report our findings with both the normal and reversed ordering of frames along
the temporal dimension in Table 6.2. Our findings indicate the model may learn strong temporal-based
representations when training on SSv2, while temporal information appears less important for K400. For
example, the difference in test accuracy on K400 is small at -0.1%, but the difference in test accuracy on
SSv2 is -36.2%.
Another facet of our analysis relates to the limited size and scope of publicly available video datasets.
Given the difference in focus of each dataset, the representations learned on one dataset distribution may
not generalize to that of other datasets. To analyze this dimension, we conduct an experiment by training
88
a TimeSformer model to achieve strong performance on the Kinetics-400 dataset and then reinitialize the
classification layer, freeze the feature extraction layers, and fine-tune the model on SSv2. Our findings
indicate this approach would only yield an accuracy of 19.5% on SSv2, and we interpret this result to
indicate the feature extraction mechanisms learned by one action recognition model may have difficulty
generalizing to other datasets.
A possible solution to mitigate inherent dataset biases would be to collect a single large-scale video
dataset covering a diverse range of actions and events. However, such a collection would be challenging
to design and time-consuming to create. An added layer of complexity relates to deciding the label classes
of this dataset, and mapping a single video to one or more classes is non-trivial and would require careful
design. A different approach is to learn representations applicable to many disparate action recognition
datasets. Rather than learning video representations specific to a single dataset during fine-tuning, we instead
suggest learning a single model across many action recognition dataset distributions.
6.3.3 COVER: Co-train Videos and Images for Action Recognition
COVER leverages different action recognition dataset distributions, as well as large-scale object recogni-
tion, to construct a general-purpose feature-extraction model. We first introduce the mechanism of learning
representations suited for multiple video datasets and then describe the process of integrating image data
into the fine-tuning paradigm.
6.3.3.1 Co-train Videos
To learn fromN video datasets, we adopt a multi-task learning paradigm and equip the action recognition
model f with N classification heads {MLP
i
}
n
i=1
. However, pre-training is unchanged. Notably, we adopt
the typical pre-training policy of learning all non-temporal parameters on a large-scale object recognition
dataset by minimizing the cross entropy loss over Eq. 6.1.
In the co-training policy, we follow the pretraining and finetuning strategy. Similar to the standard
training policy, the spatial attention layersθ s
is learnt by minimizing the cross entropy loss over Eq. 6.1.
During fine-tuning, C OVER learns both spatial and temporal attention layers across the samples(x
i
video
,y
i
video
)∼ D
i
video
fromN datasets jointly, where(x
i
video
,y
i
video
) is the video and its label sampled from the dataset
89
Temporal
Attention
Videos
Images as single frame videos
+ +
TimeSFormer Block
24x
Image
Classifier
Video
Classifier + MLP
Video
Classifier
Spatial
Attention
Figure 6.2: COVER adopts multi-task learning strategy. Each dataset has its own classifier. For the image
dataset, we consider images as single frame videos. Therefore, the temporal multi-head attention will not
affect the image input.
i. All video samples are processed by the modelf via the shared parametersθ s
andθ t
. The sample repre-
sentations are then distributed to the appropriate classification head to obtain the classification probability
c
i
video
=MLP(f(x
i
video
;θ s
,θ t
)). We calculate the training loss for samples in video dataseti as
ℓ
i
video
=ℓ({y
i
video
},{c
i
video
})
To optimize the parametersθ s
andθ t
, we minimize the weighted sum of the loss function across allN
datasets.
(θ s
,θ t
)=argmin
θ s,θ t
X
i
w
i
· ℓ
i
video
where w
i
is the weight for the loss function of the dataset i. Jointly learning action recognition feature
extractors across multiple video datasets conveys two advantages. First, as the model is directly trained
on multiple datasets, the learned video representations are more general and can be directly evaluated on
those datasets without additional fine-tuning. Second, and as emphasized in prior work [17], there may be
benefits from expanding the scope and quantity of action recognition examples. Attention-based models
may easily overfit to a smaller video distribution, thus degrading the generalization of the learned represen-
tations. Training on multiple datasets mitigates this challenge by reducing the risk of overfitting. Finally,
as indicated in Sect. 6.3.3.1, certain datasets may focus on different inductive biases of video modeling.
For example, one dataset may emphasize the modeling of temporal information while others emphasize
spatial representational learning. Jointly learning on both distributions may lead to more robust feature ex-
tractors that encode both appearance and motion information to improve performance on action recognition
benchmarks.
90
6.3.3.2 Co-train Video and Image Data
To maintain strong spatial representations, COVER trains a model f on both image and video datasets.
Similar to the training policy of transformer-based video models, we first pre-train spatial attention layers
θ s
using a large object recognition dataset, and then fine-tune the entire model (θ s
,θ t
) using both video
datasetsD
i
video
and image datasetsD
j
image
as Fig. 6.2.
We adapt an object recognition image task to an action recognition video paradigm with minimal mod-
ification by considering an image as a video with only one frame. In this context, we can directly cre-
ate a batch of both image x
j
image
∼ D
j
image
and video x
i
video
∼ D
i
video
data as input to the TimeS-
former model f. With regards to the object recognition task, we obtain classification outputs c
j
image
=
MLP(f(x
j
image
;θ s
,θ t
)), and for the video datasets, we denote dataset-specific classification outputs as
c
j
video
=MLP(f(x
j
video
;θ s
,θ t
)). The weighted loss for co-training over both images and videos is
ℓ
image video
=
X
i
w
i
video
· ℓ({y
i
video
},{c
i
video
})
+
X
j
w
j
image
· ℓ({y
j
image
},{c
j
image
})
where w
j
image
and w
i
video
represents the loss weights for the image dataset j and the video dataset i. We
minimizeℓ
image video
to optimize parametersθ s
andθ t
.
The comparison between the standard training policy and COVER are summarized in Fig. 6.1
6.3.4 Discussion
An ideal video representation should capture both the appearance and motion information from a video.
Although video datasets are informational sources for learning motion information, the spatial information
contained within a video clip may be limited. This is due to redundancy among frames and the relatively
“small” size of video datasets compared to classical image datasets. Therefore, even by simultaneously
training a model on many video datasets, the model’s capacity to learn appearance information may be hin-
dered. Although image dataset pretraining may provide a good start for obtaining the appearance knowledge,
it is possible the robust spatial representations are diluted during fine-tuning on highly spatially redundant
91
video data. Reducing the robustness of learned spatial representations would likely decrease model perfor-
mance; however, these representations may be maintained by continuing to train object recognition with
action recognition during the fine-tuning stage of learning.
6.4 Experiment
In this section, we first present the experiment setup and the implementation details. Then we study the
performance of COVER on five large-scale video datasets.
6.4.1 Experiment Setup
6.4.1.1 Datasets.
We evaluate our approach on 5 challenging video datasets, Kinetics-400 (K400) [80], Kinetics-600 (K600) [19],
Kinetics-700 (K700) [20], SomethingSomething-v2 (SSv2) [58], and Moments in Time (MiT) [124]. The
Kinetics dataset is a collection of datasets containing three variants, Kinetics-400, Kinetics-600, and Kinetics-
700 with 400, 600, and 700 classes respectively. The Kinetics dataset focuses on daily human-object interac-
tion. The SSv2 dataset contains 174 classes with 168K videos for training and 24K videos for evaluation. It
contains videos with object-agnostic action. The Moments-in-Time dataset is one of the largest datasets for
video understanding. The dataset contains 791K videos for training and 33K videos for evaluation, which
cover human actions as well as animal actions. The Moments-in-Time dataset covers a wide range of videos
from Youtube to cartoons. For all datasets, we use the standard training and testing splits. We report the
results on the testing split for all datasets.
6.4.1.2 Implementation details.
We conducted experiments using TimeSFormer [17], as it achieves the balance between efficiency and
performance. The TimeSFormer model contains 24 TimeSFormer blocks. Within each block, there is one
temporal multi-head attention layer followed by one spatial multi-head attention layer. Each multi-head
attention layer has 16 attention heads with 1024 hidden dimensions in total.
Given a video, we first uniformly sampled 16 frames across the entire video. Then image patches with
the resolution of 448× 448 are randomly cropped from the sampled frames to form the input. Similarly, when
92
Table 6.3: Comparison with the SoTA pretrained on ImageNet-21k (I21k). Our results COVER is based
on TimeSFormer architecture. Comparing with TimeSFormer, COVER achieves significant improvements
across all datasets. Comparing with ViViT [10] and VidTr [204], COVER improves performance on K400,
K600, K700, and MiT datasets. The views are denoted as# of temporal crops× # of spatial crops. Dash (“-
”) means the results are not applicable. Joint
w/K400
means the model is finetuned on SSv2+MiT+ImageNet
and K400 jointly. We note our improvements compared with TimeSFormer of same input sizes.
Model Pretrain Finetune K400 K600 K700 SSv2 MiT Views
Video Swin [112] I21k K400 84.9 - - - - 10× 5
K600 - 86.1 - - - 10× 5
ViViT [10] I21k K400 81.3 - - - - 4× 3
K600 - 83.0 - - - 4× 3
SSv2 - - - 65.9 - 4× 3
MiT - - - - 38.5 4× 3
VidTr [204] I21k K400 80.5 - - - - 1× 3
K700 - - 70.8 - - 1× 3
SSv2 - - - 63.0 - 1× 3
TimeSFormer [17] I21k K400 79.7 - - - - 1× 3
(448× 448× 16) K600 - 81.8 - - - 1× 3
SSv2 - - - 62.2 - 1× 3
COVER I21k Joint
w/K400
83.1
(+3.4)
- - 64.2 41.3 1× 3
(TimeSFormer) Joint
w/K600
- 84.5
(+2.7)
- 64.7
(+2.5)
41.5 1× 3
(448× 448× 16) Joint
w/K700
- - 74.9 64.7
(+2.5)
41.5 1× 3
co-training with images, we randomly cropped a 448× 448 image patch. We only applied random cropping
and random horizontal flipping to augment the training data during training. Note that the horizontal flipping
is not applied to SSv2 videos, as two of the SSv2 action categories, “Pushing something from left to right”
and “Pushing something from right to left”, are symmetric. Following TimeSFormer, we evaluate our model
by averaging the predictions from 1× 3 views of cropping. Specifically, during evaluation, we obtain three
spatial crops from video. We didn’t do temporal cropping during the evaluation.
We study the performance under three pre-training image datasets, ImageNet-21k [37], JFT-300M [88],
and JFT-3B [193]. We finetune the pre-trained model on video and image datasets using a mini-batch of
128. Within each batch, we sampled videos and images from all finetuning datasets. The sampling rate is
proportional to the size of the datasets. The model is optimized using SGD with momentum set to 0.9. The
model is trained for 20 epochs. The initial learning rate is set as 5e− 3. It drops to 5e− 4 and 5e− 5 at epochs
11 and 15, respectively.
93
Table 6.4: Comparison with the SoTA pretrained on larger-scale datasets. COVER pretrained on JFT-300M
surpassed all SoTA by a margin. COVER pretrained on even larger dataset (JFT-3B) established a new set
of SoTA for all datasets. Dash (“-”) means the results are not applicable. Joint
w/K400
means the model
is finetuned on SSv2+MiT+ImageNet and K400 jointly. We note our improvements compared with the
previous state-of-the-arts.
Model Pretrain Finetune K400 K600 K700 SSv2 MiT Views
Video Swin [112] I21k+K400 SSv2 - - - 69.6 - 1× 3
TokenLearner[142] JFT-300M K400 85.4 - - - - 4× 3
K600 - 86.3 - - - 4× 3
ViViT [10] JFT-300M K400 84.8 - - - - 4× 3
K600 - 85.8 - - - 4× 3
MoViNet [89] None K700 - - 72.3 - - 1× 1
V ATT [3] AudioSet K400 82.1 - - - - 4× 3
K600 - 83.6 - - - 4× 3
MiT - - - - 41.1 4× 3
OmniSource [41] IG-65M K400 83.6 - - - - 10× 3
COVER JFT-300M Joint
w/K400
86.3
(+0.9)
- - 69.3 45.0
(+3.9)
1× 3
(TimeSFormer) Joint
w/K600
- 86.8
(+0.5)
- 69.8
(+0.2)
44.5 1× 3
(448x448x16) Joint
w/K700
- - 78.5
(+6.2)
69.7 44.8 1× 3
TimeSFormer [17] JFT-3B K400 85.0 - - - - 1× 3
(Reimplemented) K600 - 86.0 - - - 1× 3
(448x448x16) K700 - - 78.0 - - 1× 3
SSv2 - - - 67.2 - 1× 3
MiT - - - - 44.9 1× 3
COVER JFT-3B Joint
w/K400
87.2
(+1.8)
- - 70.8 46.1
(+5.0)
1× 3
(TimeSFormer) Joint
w/K600
- 87.9
(+1.6)
- 70.9
(+1.3)
45.9 1× 3
(448x448x16) Joint
w/K700
- - 79.8
(+7.5)
70.6 45.9 1× 3
6.4.2 Main results
We summarize the COVER performance in Table 6.3 and Table 6.4. We reported our model performance
under three pre-training datasets, ImageNet-21k, JFT-300M, and JFT-3B. For each pre-training setting, we
co-train our model on SSv2, MiT, ImageNet, and different versions of Kinetics jointly.
We compare COVER with the TimeSFormer under the same ImageNet-21k pre-training setting in Ta-
ble 6.3. With the same architecture, COVER co-trained on multiple datasets achieves 2.4%, 2.3%, and
2.3% improvement on K400, K600, and SSv2, respectively. Compared with ViViT and VidTr, COVER
improves K400, K600, K700, and MiT performance by a margin, which verifies the effectiveness of the
proposed approach. COVER achieves lower performance than Video SwinTrans, due to its more advanced
architecture.
When pretrained on a larger-scale image dataset, JFT-300M, COVER surpassed the previous best per-
formance by 0.9%, 0.5%, 0.2%, and 3.9% on K400, K600, SSv2, and MiT in top-1 accuracy. Pretraining on
an even larger image dataset, JFT-3B, further boosts the top-1 accuracy to 87.2% on K400, 87.9% on K600,
94
Table 6.5: Detailed co-training Top-1 accuracy with model pretrained on JFT-3B. By co-training with more
datasets, COVER keeps improve performance on K400, K600, K700, SSv2, and MiT datasets.
(a) The ablation of co-training on K400+SSv2+MiT+ImNet
Training strategy K400 SSv2 MiT
Train independently 85.0 67.2 44.9
K400+SSv2 86.4 69.9 -
K400+SSv2+MiT 87.2 70.6 45.9
K400+SSv2+MiT+ImNet 87.2 70.8 46.1
(b) The ablation of co-training on K600+SSv2+MiT+ImNet
Training strategy K600 SSv2 MiT
Train independently 86.0 67.2 44.9
K600+SSv2 87.3 70.4 -
K600+SSv2+MiT 87.9 70.5 45.9
K600+SSv2+MiT+ImNet 87.9 70.9 45.9
(c) The ablation of co-training on K700+SSv2+MiT+ImNet
Training strategy K700 SSv2 MiT
Train independently 78.0 67.2 44.9
K700+SSv2 79.1 70.6 -
K700+SSv2+MiT 79.6 70.7 45.8
K700+SSv2+MiT+ImNet 79.8 70.6 45.9
79.8% on K700, 70.9% on SSv2, and 46.1% on MiT. We also reported the TimeSFormer baseline pretrained
on JFT-3B. COVER indicates a strong improvement over the baselines on all five datasets by a large margin
. Our results indicate that training methodology for the transformer model is important. Orthogonal to im-
proving model architecture, co-training a simple spatio-temporal transformer with multiple datasets could
achieve superior performance.
COVER didn’t alter the architecture or evaluation scheme. Thus, COVER has the same inference speed
as TimeSFormer.
6.4.3 Ablation study
COVER improves the SoTA on Top-1 accuracy across all datasets. In this section, we conduct ablation
studies on the COVER pretrained on JFT-3B to empirically analyze the performance gain of COVER.
95
Table 6.6: Top-1 accuracy of COVER under different image classification loss weights. With larger w
image
,
the model improves the ImageNet (ImNet) performance however sacrifices video dataset results.
Models finetuned on K400+SSv2+MiT+ImNet
w
image
w
video
K400 SSv2 MiT ImNet
0 1 87.2 70.6 45.9 -
0.5 1 87.2 70.8 46.1 86.1
0.75 1 87.2 70.6 45.8 86.6
6.4.3.1 Co-training with multiple datasets
The main argument is that co-training over multiple image and video datasets can improve action recogni-
tion performance. To verify this argument, we summarize the ablation results in Table 6.5. Note that all
the ablation studies are conducted under the same input setting, 448× 448× 16 (H× W× T), and the same
architecture pretrained on JFT-3B dataset.
We first limit the scope to two datasets, K400 and SSv2. We compare the co-training performance with
the results of training independently. Co-training improves the top-1 accuracy on K400 by 1.4% and SSv2
by 2.7%, which indicates that jointly learning K400 and SSv2 could enhance the performance of both tasks.
We further include the MiT dataset into co-training. Compared with co-training on just two datasets, co-
training on all three improves K400 and SSv2 performance by 0.8% and 0.7%, respectively. We observe
that co-training with MiT also improves MiT performance by 1.0%. Finally, we co-train both image and
video datasets together. Adding ImageNet in the co-training datasets further improves SSv2 and MiT by
0.2% and 0.2%, establishing a new SoTA. We observe a similar improvement on the model co-trained with
K600+SSv2+MiT+ImageNet and K700+SSv2+MiT+ImageNet.
6.4.3.2 Loss weight for image and video classification
Our co-training paradigm involves two losses, the image classification loss and the video classification loss.
We study the performance under different loss weights in Table 6.6. By increasing the image classification
loss weight, we observe that the model is encouraged to learn better appearance representations. Thus,
ImageNet accuracy improves from 86.1% to 86.6%. However, as the model focuses on learning better
appearance information, the model’s ability to capture motion structure is reduced. We observed 0.2% top-
1 accuracy drops on SSv2 and MiT datasets, which indicates it is vital to balance the image and video
classification loss weight.
96
(a) Transfer learning on K400
(b) Transfer learning on SSv2
Figure 6.3: Comparison of transfer learning the representation learned by COVER and standard training
paradigm. A→B means the model is trained on dataset A and then finetuned on dataset B. C OVER achieves
better performance than the standard training paradigm by a margin.
6.4.3.3 Transfer learning on other datasets
We use transfer learning as a showcase to verify the video representation quality. Specifically, we trained
on the source datasets, then finetuned and evaluated on the target dataset. The results are summarized in
Fig. 6.3. We first consider K400 as the target dataset. C OVER co-trained on SSv2 and MiT improves the
top-1 accuracy on K400→K400 by 1.3%, SSv2→K400 by 1.7%, and MiT→K400 by 0.4%. Similarly, we
observe that by transferring to SSv2, COVER achieves 2%, 1.8%, and 1.1% improvement over SSv2→SSv2,
K400→SSv2, and MiT→SSv2, respectively. Improvement on transfer learning shows that COVER co-
trained on multiple datasets could learn better visual representations than the standard training paradigm,
which is useful for downstream tasks.
6.5 Conclusion
In this chapter, we present COVER, a training paradigm that jointly learns action recognition and object
recognition tasks in a single model for the purpose of constructing a general-purpose action recognition
framework. Our analysis indicates it may be beneficial to integrate many video and image datasets into
97
one multi-task learning paradigm. We highlight the importance of continuing to learn on image data during
fine-tuning to maintain robust spatial representations. Our empirical findings suggest C OVER can learn a
single model which achieves impressive performance across many action recognition datasets without an
additional stage of fine-tuning on each downstream application. In particular, C OVER sets a new state-of-
the-art performance on Kinetics-400, Kinetics-600, Kinetics-700, Something-Something-v2 and Moments-
in-Time datasets.
98
Part V
Application of Structural Visual Representation
99
Chapter 7
Visual Storytelling via Predicting Anchor Words in the Stories
7.1 Introduction
Visual storytelling, ie, narrating a sequence of images, is a challenging task [130, 72]. It demands a deeper
understanding of the underlying storyline of the images. The process is naturally subjective, depending on
the narrator’s personal perspective (and Zeitgeist at the time of events). It often focuses more on conveying
the narrator’s own interpretation than her describing the images in factual terms.
For example, as aptly pointed out by the creators of the popular dataset VIST [72], concatenating the
descriptions of the images, either in isolation or in the context of all other images, does not give rise to a
desirable narrative story. Table 7.1 illustrates the difference in the corpus statistics on the aforementioned
dataset (summarized over images which have been annotated with both a narrative sentence and a descriptive
caption). On those images, there is a clear preference of using more verbs and adverbs when narrating than
when captioning. In fact, despite similar in length, story and caption use very different sets of words. At
least 40% of words that appear in stories do not appear in captions.
While this discrepancy has been well-documented, it is not clear how this insight could be used to
devising effective models for visual storytelling. The task seems naturally gravitating to the method of
SEQ2SEQ where we learn a mapping to encode a sequence of image features then to decode by outputing
a sequence of words [57]. This approach met some initial successes, and other approaches have since
progressed [71, 168, 173].
In this chapter, we take a step toward identifying what might be needed for generating a narrative story.
We hypothesize that each narrative story needs to have a sequence of anchor words. For simplicity, we
assume one anchor word per image. The anchor words form a strong prior on what can be “said” about the
100
Table 7.1: Corpus statistics of stories and captions on the VIST dataset [72]
Storys Captions
V ocabulary Size 29,614 24,534
Avg. Sent. Length 11.4 11.9
# of Nouns 6,831 7,772
# of Verbs 5,217 3,202
# of Adjectives 2,089 2,018
# of Adverbs 1,505 286
images. To narrate a sequence of images, our learning model just needs to predict the anchor word for each
image in turn and then supply the anchor words to a standard SEQ2SEQ model to generate the story.
But then, what are the anchor words? They are not explicitly given in the annotated dataset and would
need to be inferred by the learning model during training the SEQ2SEQ model for story generation. As a
first step, however, we have shown that we can use the words in the ground-truth stories as anchor words
and learn a predictive model (from the image features) to predict the anchor words when the ground-truth
stories are not available, for example, when we use the model to narrate a test sequence of images.
As opposed to several best-performing models for the same task, our model is simple in design and
does not need to use reinforcement learning algorithms to optimize [71, 168, 173]. Yet it attains the best
performance in several automatic evaluation metrics.
We describe the idea of using anchor words in section 7.3, supported by the evidence that such words,
when added to a vanilla SEQ2SEQ model for story generation, significantly improve its performance. We
then describe how to train a predictive model to predict the anchor words. In section 7.4, we report our
evaluation results and conclude in section 7.5.
7.2 Related Work
There is a large body of work in the intersection of vision and language, e.g. [108, 167, 9].
Image captioning is closely related to visual storytelling. SEQ2SEQ and its variants are among the most
popular learning approaches for the task [182, 167]. More recently, researchers have studied to model
the image caption generation together with the personality of narrator [149], to reflect subjective personal
preference in language.
101
ˆ
h
2
AAAB8HicbVBNS8NAEJ34WetX1aOXxSJ4KkkR9Fj04rGC/ZA2lM120yzdbMLuRCihv8KLB0W8+nO8+W/ctjlo64OBx3szzMwLUikMuu63s7a+sbm1Xdop7+7tHxxWjo7bJsk04y2WyER3A2q4FIq3UKDk3VRzGgeSd4Lx7czvPHFtRKIecJJyP6YjJULBKFrpsR9RzKNBfTqoVN2aOwdZJV5BqlCgOah89YcJy2KukElqTM9zU/RzqlEwyaflfmZ4StmYjnjPUkVjbvx8fvCUnFtlSMJE21JI5urviZzGxkziwHbGFCOz7M3E/7xehuG1nwuVZsgVWywKM0kwIbPvyVBozlBOLKFMC3srYRHVlKHNqGxD8JZfXiXtes1za979ZbVxU8RRglM4gwvw4AoacAdNaAGDGJ7hFd4c7bw4787HonXNKWZO4A+czx/DiZBe
AAAB8HicbVBNS8NAEJ34WetX1aOXxSJ4KkkR9Fj04rGC/ZA2lM120yzdbMLuRCihv8KLB0W8+nO8+W/ctjlo64OBx3szzMwLUikMuu63s7a+sbm1Xdop7+7tHxxWjo7bJsk04y2WyER3A2q4FIq3UKDk3VRzGgeSd4Lx7czvPHFtRKIecJJyP6YjJULBKFrpsR9RzKNBfTqoVN2aOwdZJV5BqlCgOah89YcJy2KukElqTM9zU/RzqlEwyaflfmZ4StmYjnjPUkVjbvx8fvCUnFtlSMJE21JI5urviZzGxkziwHbGFCOz7M3E/7xehuG1nwuVZsgVWywKM0kwIbPvyVBozlBOLKFMC3srYRHVlKHNqGxD8JZfXiXtes1za979ZbVxU8RRglM4gwvw4AoacAdNaAGDGJ7hFd4c7bw4787HonXNKWZO4A+czx/DiZBe
AAAB8HicbVBNS8NAEJ34WetX1aOXxSJ4KkkR9Fj04rGC/ZA2lM120yzdbMLuRCihv8KLB0W8+nO8+W/ctjlo64OBx3szzMwLUikMuu63s7a+sbm1Xdop7+7tHxxWjo7bJsk04y2WyER3A2q4FIq3UKDk3VRzGgeSd4Lx7czvPHFtRKIecJJyP6YjJULBKFrpsR9RzKNBfTqoVN2aOwdZJV5BqlCgOah89YcJy2KukElqTM9zU/RzqlEwyaflfmZ4StmYjnjPUkVjbvx8fvCUnFtlSMJE21JI5urviZzGxkziwHbGFCOz7M3E/7xehuG1nwuVZsgVWywKM0kwIbPvyVBozlBOLKFMC3srYRHVlKHNqGxD8JZfXiXtes1za979ZbVxU8RRglM4gwvw4AoacAdNaAGDGJ7hFd4c7bw4787HonXNKWZO4A+czx/DiZBe
AAAB2XicbZDNSgMxFIXv1L86Vq1rN8EiuCozbnQpuHFZwbZCO5RM5k4bmskMyR2hDH0BF25EfC93vo3pz0JbDwQ+zknIvSculLQUBN9ebWd3b/+gfugfNfzjk9Nmo2fz0gjsilzl5jnmFpXU2CVJCp8LgzyLFfbj6f0i77+gsTLXTzQrMMr4WMtUCk7O6oyaraAdLMW2IVxDC9YaNb+GSS7KDDUJxa0dhEFBUcUNSaFw7g9LiwUXUz7GgUPNM7RRtRxzzi6dk7A0N+5oYkv394uKZ9bOstjdzDhN7Ga2MP/LBiWlt1EldVESarH6KC0Vo5wtdmaJNChIzRxwYaSblYkJN1yQa8Z3HYSbG29D77odBu3wMYA6nMMFXEEIN3AHD9CBLghI4BXevYn35n2suqp569LO4I+8zx84xIo4
AAAB5XicbZBLSwMxFIXv1FetVatbN8EiuCoz3eiy4MZlBfuQdiiZNNMJTTJDckcoQ3+FGxeK+Jfc+W9MHwttPRD4OCch954ok8Ki7397pZ3dvf2D8mHlqHp8clo7q3ZtmhvGOyyVqelH1HIpNO+gQMn7meFURZL3oundIu89c2NFqh9xlvFQ0YkWsWAUnfU0TCgWyag5H9XqfsNfimxDsIY6rNUe1b6G45Tlimtkklo7CPwMw4IaFEzyeWWYW55RNqUTPnCoqeI2LJYDz8mVc8YkTo07GsnS/f2ioMramYrcTUUxsZvZwvwvG+QY34aF0FmOXLPVR3EuCaZksT0ZC8MZypkDyoxwsxKWUEMZuo4qroRgc+Vt6DYbgd8IHnwowwVcwjUEcAMtuIc2dICBghd4g3fPeK/ex6qukrfu7Rz+yPv8AYsdjwM=
AAAB5XicbZBLSwMxFIXv1FetVatbN8EiuCoz3eiy4MZlBfuQdiiZNNMJTTJDckcoQ3+FGxeK+Jfc+W9MHwttPRD4OCch954ok8Ki7397pZ3dvf2D8mHlqHp8clo7q3ZtmhvGOyyVqelH1HIpNO+gQMn7meFURZL3oundIu89c2NFqh9xlvFQ0YkWsWAUnfU0TCgWyag5H9XqfsNfimxDsIY6rNUe1b6G45Tlimtkklo7CPwMw4IaFEzyeWWYW55RNqUTPnCoqeI2LJYDz8mVc8YkTo07GsnS/f2ioMramYrcTUUxsZvZwvwvG+QY34aF0FmOXLPVR3EuCaZksT0ZC8MZypkDyoxwsxKWUEMZuo4qroRgc+Vt6DYbgd8IHnwowwVcwjUEcAMtuIc2dICBghd4g3fPeK/ex6qukrfu7Rz+yPv8AYsdjwM=
AAAB8HicbVBNS8NAEJ3Ur1q/qh69BIvgqSS96LHoxWMF+yFtKJvtplm6uwm7E6GE/govHhTx6s/x5r9x2+agrQ8GHu/NMDMvTAU36HnfTmljc2t7p7xb2ds/ODyqHp90TJJpyto0EYnuhcQwwRVrI0fBeqlmRIaCdcPJ7dzvPjFteKIecJqyQJKx4hGnBK30OIgJ5vGwMRtWa17dW8BdJ35BalCgNax+DUYJzSRTSAUxpu97KQY50cipYLPKIDMsJXRCxqxvqSKSmSBfHDxzL6wycqNE21LoLtTfEzmRxkxlaDslwdisenPxP6+fYXQd5FylGTJFl4uiTLiYuPPv3RHXjKKYWkKo5vZWl8ZEE4o2o4oNwV99eZ10GnXfq/v3Xq15U8RRhjM4h0vw4QqacActaAMFCc/wCm+Odl6cd+dj2VpyiplT+APn8wfCSZBa
AAAB8HicbVBNS8NAEJ34WetX1aOXxSJ4KkkR9Fj04rGC/ZA2lM120yzdbMLuRCihv8KLB0W8+nO8+W/ctjlo64OBx3szzMwLUikMuu63s7a+sbm1Xdop7+7tHxxWjo7bJsk04y2WyER3A2q4FIq3UKDk3VRzGgeSd4Lx7czvPHFtRKIecJJyP6YjJULBKFrpsR9RzKNBfTqoVN2aOwdZJV5BqlCgOah89YcJy2KukElqTM9zU/RzqlEwyaflfmZ4StmYjnjPUkVjbvx8fvCUnFtlSMJE21JI5urviZzGxkziwHbGFCOz7M3E/7xehuG1nwuVZsgVWywKM0kwIbPvyVBozlBOLKFMC3srYRHVlKHNqGxD8JZfXiXtes1za979ZbVxU8RRglM4gwvw4AoacAdNaAGDGJ7hFd4c7bw4787HonXNKWZO4A+czx/DiZBe
AAAB8HicbVBNS8NAEJ34WetX1aOXxSJ4KkkR9Fj04rGC/ZA2lM120yzdbMLuRCihv8KLB0W8+nO8+W/ctjlo64OBx3szzMwLUikMuu63s7a+sbm1Xdop7+7tHxxWjo7bJsk04y2WyER3A2q4FIq3UKDk3VRzGgeSd4Lx7czvPHFtRKIecJJyP6YjJULBKFrpsR9RzKNBfTqoVN2aOwdZJV5BqlCgOah89YcJy2KukElqTM9zU/RzqlEwyaflfmZ4StmYjnjPUkVjbvx8fvCUnFtlSMJE21JI5urviZzGxkziwHbGFCOz7M3E/7xehuG1nwuVZsgVWywKM0kwIbPvyVBozlBOLKFMC3srYRHVlKHNqGxD8JZfXiXtes1za979ZbVxU8RRglM4gwvw4AoacAdNaAGDGJ7hFd4c7bw4787HonXNKWZO4A+czx/DiZBe
AAAB8HicbVBNS8NAEJ34WetX1aOXxSJ4KkkR9Fj04rGC/ZA2lM120yzdbMLuRCihv8KLB0W8+nO8+W/ctjlo64OBx3szzMwLUikMuu63s7a+sbm1Xdop7+7tHxxWjo7bJsk04y2WyER3A2q4FIq3UKDk3VRzGgeSd4Lx7czvPHFtRKIecJJyP6YjJULBKFrpsR9RzKNBfTqoVN2aOwdZJV5BqlCgOah89YcJy2KukElqTM9zU/RzqlEwyaflfmZ4StmYjnjPUkVjbvx8fvCUnFtlSMJE21JI5urviZzGxkziwHbGFCOz7M3E/7xehuG1nwuVZsgVWywKM0kwIbPvyVBozlBOLKFMC3srYRHVlKHNqGxD8JZfXiXtes1za979ZbVxU8RRglM4gwvw4AoacAdNaAGDGJ7hFd4c7bw4787HonXNKWZO4A+czx/DiZBe
AAAB8HicbVBNS8NAEJ34WetX1aOXxSJ4KkkR9Fj04rGC/ZA2lM120yzdbMLuRCihv8KLB0W8+nO8+W/ctjlo64OBx3szzMwLUikMuu63s7a+sbm1Xdop7+7tHxxWjo7bJsk04y2WyER3A2q4FIq3UKDk3VRzGgeSd4Lx7czvPHFtRKIecJJyP6YjJULBKFrpsR9RzKNBfTqoVN2aOwdZJV5BqlCgOah89YcJy2KukElqTM9zU/RzqlEwyaflfmZ4StmYjnjPUkVjbvx8fvCUnFtlSMJE21JI5urviZzGxkziwHbGFCOz7M3E/7xehuG1nwuVZsgVWywKM0kwIbPvyVBozlBOLKFMC3srYRHVlKHNqGxD8JZfXiXtes1za979ZbVxU8RRglM4gwvw4AoacAdNaAGDGJ7hFd4c7bw4787HonXNKWZO4A+czx/DiZBe
AAAB8HicbVBNS8NAEJ34WetX1aOXxSJ4KkkR9Fj04rGC/ZA2lM120yzdbMLuRCihv8KLB0W8+nO8+W/ctjlo64OBx3szzMwLUikMuu63s7a+sbm1Xdop7+7tHxxWjo7bJsk04y2WyER3A2q4FIq3UKDk3VRzGgeSd4Lx7czvPHFtRKIecJJyP6YjJULBKFrpsR9RzKNBfTqoVN2aOwdZJV5BqlCgOah89YcJy2KukElqTM9zU/RzqlEwyaflfmZ4StmYjnjPUkVjbvx8fvCUnFtlSMJE21JI5urviZzGxkziwHbGFCOz7M3E/7xehuG1nwuVZsgVWywKM0kwIbPvyVBozlBOLKFMC3srYRHVlKHNqGxD8JZfXiXtes1za979ZbVxU8RRglM4gwvw4AoacAdNaAGDGJ7hFd4c7bw4787HonXNKWZO4A+czx/DiZBe
AAAB8HicbVBNS8NAEJ34WetX1aOXxSJ4KkkR9Fj04rGC/ZA2lM120yzdbMLuRCihv8KLB0W8+nO8+W/ctjlo64OBx3szzMwLUikMuu63s7a+sbm1Xdop7+7tHxxWjo7bJsk04y2WyER3A2q4FIq3UKDk3VRzGgeSd4Lx7czvPHFtRKIecJJyP6YjJULBKFrpsR9RzKNBfTqoVN2aOwdZJV5BqlCgOah89YcJy2KukElqTM9zU/RzqlEwyaflfmZ4StmYjnjPUkVjbvx8fvCUnFtlSMJE21JI5urviZzGxkziwHbGFCOz7M3E/7xehuG1nwuVZsgVWywKM0kwIbPvyVBozlBOLKFMC3srYRHVlKHNqGxD8JZfXiXtes1za979ZbVxU8RRglM4gwvw4AoacAdNaAGDGJ7hFd4c7bw4787HonXNKWZO4A+czx/DiZBe
ˆ
h
3
AAAB8HicbVBNS8NAEJ3Ur1q/qh69BIvgqSQq6LHoxWMFWyttKJvtplm6uwm7E6GE/govHhTx6s/x5r9x2+agrQ8GHu/NMDMvTAU36HnfTmlldW19o7xZ2dre2d2r7h+0TZJpylo0EYnuhMQwwRVrIUfBOqlmRIaCPYSjm6n/8MS04Ym6x3HKAkmGikecErTSYy8mmMf980m/WvPq3gzuMvELUoMCzX71qzdIaCaZQiqIMV3fSzHIiUZOBZtUeplhKaEjMmRdSxWRzAT57OCJe2KVgRsl2pZCd6b+nsiJNGYsQ9spCcZm0ZuK/3ndDKOrIOcqzZApOl8UZcLFxJ1+7w64ZhTF2BJCNbe3ujQmmlC0GVVsCP7iy8ukfVb3vbp/d1FrXBdxlOEIjuEUfLiEBtxCE1pAQcIzvMKbo50X5935mLeWnGLmEP7A+fwBxQ6QXw==
AAAB8HicbVBNS8NAEJ3Ur1q/qh69BIvgqSQq6LHoxWMFWyttKJvtplm6uwm7E6GE/govHhTx6s/x5r9x2+agrQ8GHu/NMDMvTAU36HnfTmlldW19o7xZ2dre2d2r7h+0TZJpylo0EYnuhMQwwRVrIUfBOqlmRIaCPYSjm6n/8MS04Ym6x3HKAkmGikecErTSYy8mmMf980m/WvPq3gzuMvELUoMCzX71qzdIaCaZQiqIMV3fSzHIiUZOBZtUeplhKaEjMmRdSxWRzAT57OCJe2KVgRsl2pZCd6b+nsiJNGYsQ9spCcZm0ZuK/3ndDKOrIOcqzZApOl8UZcLFxJ1+7w64ZhTF2BJCNbe3ujQmmlC0GVVsCP7iy8ukfVb3vbp/d1FrXBdxlOEIjuEUfLiEBtxCE1pAQcIzvMKbo50X5935mLeWnGLmEP7A+fwBxQ6QXw==
AAAB8HicbVBNS8NAEJ3Ur1q/qh69BIvgqSQq6LHoxWMFWyttKJvtplm6uwm7E6GE/govHhTx6s/x5r9x2+agrQ8GHu/NMDMvTAU36HnfTmlldW19o7xZ2dre2d2r7h+0TZJpylo0EYnuhMQwwRVrIUfBOqlmRIaCPYSjm6n/8MS04Ym6x3HKAkmGikecErTSYy8mmMf980m/WvPq3gzuMvELUoMCzX71qzdIaCaZQiqIMV3fSzHIiUZOBZtUeplhKaEjMmRdSxWRzAT57OCJe2KVgRsl2pZCd6b+nsiJNGYsQ9spCcZm0ZuK/3ndDKOrIOcqzZApOl8UZcLFxJ1+7w64ZhTF2BJCNbe3ujQmmlC0GVVsCP7iy8ukfVb3vbp/d1FrXBdxlOEIjuEUfLiEBtxCE1pAQcIzvMKbo50X5935mLeWnGLmEP7A+fwBxQ6QXw==
AAAB8HicbVBNS8NAEJ3Ur1q/qh69BIvgqSQq6LHoxWMFWyttKJvtplm6uwm7E6GE/govHhTx6s/x5r9x2+agrQ8GHu/NMDMvTAU36HnfTmlldW19o7xZ2dre2d2r7h+0TZJpylo0EYnuhMQwwRVrIUfBOqlmRIaCPYSjm6n/8MS04Ym6x3HKAkmGikecErTSYy8mmMf980m/WvPq3gzuMvELUoMCzX71qzdIaCaZQiqIMV3fSzHIiUZOBZtUeplhKaEjMmRdSxWRzAT57OCJe2KVgRsl2pZCd6b+nsiJNGYsQ9spCcZm0ZuK/3ndDKOrIOcqzZApOl8UZcLFxJ1+7w64ZhTF2BJCNbe3ujQmmlC0GVVsCP7iy8ukfVb3vbp/d1FrXBdxlOEIjuEUfLiEBtxCE1pAQcIzvMKbo50X5935mLeWnGLmEP7A+fwBxQ6QXw==
ˆ
h
4
AAAB8HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9kPaUDbbTbN0swm7E6GE/govHhTx6s/x5r9x2+agrQ8GHu/NMDMvSKUw6LrfTmltfWNzq7xd2dnd2z+oHh61TZJpxlsskYnuBtRwKRRvoUDJu6nmNA4k7wTj25nfeeLaiEQ94CTlfkxHSoSCUbTSYz+imEeDy+mgWnPr7hxklXgFqUGB5qD61R8mLIu5QiapMT3PTdHPqUbBJJ9W+pnhKWVjOuI9SxWNufHz+cFTcmaVIQkTbUshmau/J3IaGzOJA9sZU4zMsjcT//N6GYbXfi5UmiFXbLEozCTBhMy+J0OhOUM5sYQyLeythEVUU4Y2o4oNwVt+eZW0L+qeW/fuL2uNmyKOMpzAKZyDB1fQgDtoQgsYxPAMr/DmaOfFeXc+Fq0lp5g5hj9wPn8AxpOQYA==
AAAB8HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9kPaUDbbTbN0swm7E6GE/govHhTx6s/x5r9x2+agrQ8GHu/NMDMvSKUw6LrfTmltfWNzq7xd2dnd2z+oHh61TZJpxlsskYnuBtRwKRRvoUDJu6nmNA4k7wTj25nfeeLaiEQ94CTlfkxHSoSCUbTSYz+imEeDy+mgWnPr7hxklXgFqUGB5qD61R8mLIu5QiapMT3PTdHPqUbBJJ9W+pnhKWVjOuI9SxWNufHz+cFTcmaVIQkTbUshmau/J3IaGzOJA9sZU4zMsjcT//N6GYbXfi5UmiFXbLEozCTBhMy+J0OhOUM5sYQyLeythEVUU4Y2o4oNwVt+eZW0L+qeW/fuL2uNmyKOMpzAKZyDB1fQgDtoQgsYxPAMr/DmaOfFeXc+Fq0lp5g5hj9wPn8AxpOQYA==
AAAB8HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9kPaUDbbTbN0swm7E6GE/govHhTx6s/x5r9x2+agrQ8GHu/NMDMvSKUw6LrfTmltfWNzq7xd2dnd2z+oHh61TZJpxlsskYnuBtRwKRRvoUDJu6nmNA4k7wTj25nfeeLaiEQ94CTlfkxHSoSCUbTSYz+imEeDy+mgWnPr7hxklXgFqUGB5qD61R8mLIu5QiapMT3PTdHPqUbBJJ9W+pnhKWVjOuI9SxWNufHz+cFTcmaVIQkTbUshmau/J3IaGzOJA9sZU4zMsjcT//N6GYbXfi5UmiFXbLEozCTBhMy+J0OhOUM5sYQyLeythEVUU4Y2o4oNwVt+eZW0L+qeW/fuL2uNmyKOMpzAKZyDB1fQgDtoQgsYxPAMr/DmaOfFeXc+Fq0lp5g5hj9wPn8AxpOQYA==
AAAB2XicbZDNSgMxFIXv1L86Vq1rN8EiuCozbnQpuHFZwbZCO5RM5k4bmskMyR2hDH0BF25EfC93vo3pz0JbDwQ+zknIvSculLQUBN9ebWd3b/+gfugfNfzjk9Nmo2fz0gjsilzl5jnmFpXU2CVJCp8LgzyLFfbj6f0i77+gsTLXTzQrMMr4WMtUCk7O6oyaraAdLMW2IVxDC9YaNb+GSS7KDDUJxa0dhEFBUcUNSaFw7g9LiwUXUz7GgUPNM7RRtRxzzi6dk7A0N+5oYkv394uKZ9bOstjdzDhN7Ga2MP/LBiWlt1EldVESarH6KC0Vo5wtdmaJNChIzRxwYaSblYkJN1yQa8Z3HYSbG29D77odBu3wMYA6nMMFXEEIN3AHD9CBLghI4BXevYn35n2suqp569LO4I+8zx84xIo4
AAAB5XicbZBLSwMxFIXv1FetVatbN8EiuCozItil4MZlBfuQdiiZNNMJTTJDckcoQ3+FGxeK+Jfc+W9MHwttPRD4OCch954ok8Ki7397pa3tnd298n7loHp4dFw7qXZsmhvG2yyVqelF1HIpNG+jQMl7meFURZJ3o8ndPO8+c2NFqh9xmvFQ0bEWsWAUnfU0SCgWyfB6NqzV/Ya/ENmEYAV1WKk1rH0NRinLFdfIJLW2H/gZhgU1KJjks8ogtzyjbELHvO9QU8VtWCwGnpEL54xInBp3NJKF+/tFQZW1UxW5m4piYtezuflf1s8xboaF0FmOXLPlR3EuCaZkvj0ZCcMZyqkDyoxwsxKWUEMZuo4qroRgfeVN6Fw1Ar8RPPhQhjM4h0sI4AZu4R5a0AYGCl7gDd494716H8u6St6qt1P4I+/zB44RjwU=
AAAB5XicbZBLSwMxFIXv1FetVatbN8EiuCozItil4MZlBfuQdiiZNNMJTTJDckcoQ3+FGxeK+Jfc+W9MHwttPRD4OCch954ok8Ki7397pa3tnd298n7loHp4dFw7qXZsmhvG2yyVqelF1HIpNG+jQMl7meFURZJ3o8ndPO8+c2NFqh9xmvFQ0bEWsWAUnfU0SCgWyfB6NqzV/Ya/ENmEYAV1WKk1rH0NRinLFdfIJLW2H/gZhgU1KJjks8ogtzyjbELHvO9QU8VtWCwGnpEL54xInBp3NJKF+/tFQZW1UxW5m4piYtezuflf1s8xboaF0FmOXLPlR3EuCaZkvj0ZCcMZyqkDyoxwsxKWUEMZuo4qroRgfeVN6Fw1Ar8RPPhQhjM4h0sI4AZu4R5a0AYGCl7gDd494716H8u6St6qt1P4I+/zB44RjwU=
AAAB8HicbVBNS8NAEJ3Ur1q/qh69BIvgqSQi6LHoxWMF+yFtKJvtplm6uwm7E6GE/govHhTx6s/x5r9x2+agrQ8GHu/NMDMvTAU36HnfTmltfWNzq7xd2dnd2z+oHh61TZJpylo0EYnuhsQwwRVrIUfBuqlmRIaCdcLx7czvPDFteKIecJKyQJKR4hGnBK302I8J5vHgcjqo1ry6N4e7SvyC1KBAc1D96g8TmkmmkApiTM/3UgxyopFTwaaVfmZYSuiYjFjPUkUkM0E+P3jqnlll6EaJtqXQnau/J3IijZnI0HZKgrFZ9mbif14vw+g6yLlKM2SKLhZFmXAxcWffu0OuGUUxsYRQze2tLo2JJhRtRhUbgr/88ippX9R9r+7fe7XGTRFHGU7gFM7BhytowB00oQUUJDzDK7w52nlx3p2PRWvJKWaO4Q+czx/FU5Bc
AAAB8HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9kPaUDbbTbN0swm7E6GE/govHhTx6s/x5r9x2+agrQ8GHu/NMDMvSKUw6LrfTmltfWNzq7xd2dnd2z+oHh61TZJpxlsskYnuBtRwKRRvoUDJu6nmNA4k7wTj25nfeeLaiEQ94CTlfkxHSoSCUbTSYz+imEeDy+mgWnPr7hxklXgFqUGB5qD61R8mLIu5QiapMT3PTdHPqUbBJJ9W+pnhKWVjOuI9SxWNufHz+cFTcmaVIQkTbUshmau/J3IaGzOJA9sZU4zMsjcT//N6GYbXfi5UmiFXbLEozCTBhMy+J0OhOUM5sYQyLeythEVUU4Y2o4oNwVt+eZW0L+qeW/fuL2uNmyKOMpzAKZyDB1fQgDtoQgsYxPAMr/DmaOfFeXc+Fq0lp5g5hj9wPn8AxpOQYA==
AAAB8HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9kPaUDbbTbN0swm7E6GE/govHhTx6s/x5r9x2+agrQ8GHu/NMDMvSKUw6LrfTmltfWNzq7xd2dnd2z+oHh61TZJpxlsskYnuBtRwKRRvoUDJu6nmNA4k7wTj25nfeeLaiEQ94CTlfkxHSoSCUbTSYz+imEeDy+mgWnPr7hxklXgFqUGB5qD61R8mLIu5QiapMT3PTdHPqUbBJJ9W+pnhKWVjOuI9SxWNufHz+cFTcmaVIQkTbUshmau/J3IaGzOJA9sZU4zMsjcT//N6GYbXfi5UmiFXbLEozCTBhMy+J0OhOUM5sYQyLeythEVUU4Y2o4oNwVt+eZW0L+qeW/fuL2uNmyKOMpzAKZyDB1fQgDtoQgsYxPAMr/DmaOfFeXc+Fq0lp5g5hj9wPn8AxpOQYA==
AAAB8HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9kPaUDbbTbN0swm7E6GE/govHhTx6s/x5r9x2+agrQ8GHu/NMDMvSKUw6LrfTmltfWNzq7xd2dnd2z+oHh61TZJpxlsskYnuBtRwKRRvoUDJu6nmNA4k7wTj25nfeeLaiEQ94CTlfkxHSoSCUbTSYz+imEeDy+mgWnPr7hxklXgFqUGB5qD61R8mLIu5QiapMT3PTdHPqUbBJJ9W+pnhKWVjOuI9SxWNufHz+cFTcmaVIQkTbUshmau/J3IaGzOJA9sZU4zMsjcT//N6GYbXfi5UmiFXbLEozCTBhMy+J0OhOUM5sYQyLeythEVUU4Y2o4oNwVt+eZW0L+qeW/fuL2uNmyKOMpzAKZyDB1fQgDtoQgsYxPAMr/DmaOfFeXc+Fq0lp5g5hj9wPn8AxpOQYA==
AAAB8HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9kPaUDbbTbN0swm7E6GE/govHhTx6s/x5r9x2+agrQ8GHu/NMDMvSKUw6LrfTmltfWNzq7xd2dnd2z+oHh61TZJpxlsskYnuBtRwKRRvoUDJu6nmNA4k7wTj25nfeeLaiEQ94CTlfkxHSoSCUbTSYz+imEeDy+mgWnPr7hxklXgFqUGB5qD61R8mLIu5QiapMT3PTdHPqUbBJJ9W+pnhKWVjOuI9SxWNufHz+cFTcmaVIQkTbUshmau/J3IaGzOJA9sZU4zMsjcT//N6GYbXfi5UmiFXbLEozCTBhMy+J0OhOUM5sYQyLeythEVUU4Y2o4oNwVt+eZW0L+qeW/fuL2uNmyKOMpzAKZyDB1fQgDtoQgsYxPAMr/DmaOfFeXc+Fq0lp5g5hj9wPn8AxpOQYA==
AAAB8HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9kPaUDbbTbN0swm7E6GE/govHhTx6s/x5r9x2+agrQ8GHu/NMDMvSKUw6LrfTmltfWNzq7xd2dnd2z+oHh61TZJpxlsskYnuBtRwKRRvoUDJu6nmNA4k7wTj25nfeeLaiEQ94CTlfkxHSoSCUbTSYz+imEeDy+mgWnPr7hxklXgFqUGB5qD61R8mLIu5QiapMT3PTdHPqUbBJJ9W+pnhKWVjOuI9SxWNufHz+cFTcmaVIQkTbUshmau/J3IaGzOJA9sZU4zMsjcT//N6GYbXfi5UmiFXbLEozCTBhMy+J0OhOUM5sYQyLeythEVUU4Y2o4oNwVt+eZW0L+qeW/fuL2uNmyKOMpzAKZyDB1fQgDtoQgsYxPAMr/DmaOfFeXc+Fq0lp5g5hj9wPn8AxpOQYA==
AAAB8HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9kPaUDbbTbN0swm7E6GE/govHhTx6s/x5r9x2+agrQ8GHu/NMDMvSKUw6LrfTmltfWNzq7xd2dnd2z+oHh61TZJpxlsskYnuBtRwKRRvoUDJu6nmNA4k7wTj25nfeeLaiEQ94CTlfkxHSoSCUbTSYz+imEeDy+mgWnPr7hxklXgFqUGB5qD61R8mLIu5QiapMT3PTdHPqUbBJJ9W+pnhKWVjOuI9SxWNufHz+cFTcmaVIQkTbUshmau/J3IaGzOJA9sZU4zMsjcT//N6GYbXfi5UmiFXbLEozCTBhMy+J0OhOUM5sYQyLeythEVUU4Y2o4oNwVt+eZW0L+qeW/fuL2uNmyKOMpzAKZyDB1fQgDtoQgsYxPAMr/DmaOfFeXc+Fq0lp5g5hj9wPn8AxpOQYA==
ˆ
h
1
AAAB8HicbVBNS8NAEJ3Ur1q/qh69BIvgqSQi6LHoxWMF+yFtKJvtplm6uwm7E6GE/govHhTx6s/x5r9x2+agrQ8GHu/NMDMvTAU36HnfTmltfWNzq7xd2dnd2z+oHh61TZJpylo0EYnuhsQwwRVrIUfBuqlmRIaCdcLx7czvPDFteKIecJKyQJKR4hGnBK302I8J5vHAnw6qNa/uzeGuEr8gNSjQHFS/+sOEZpIppIIY0/O9FIOcaORUsGmlnxmWEjomI9azVBHJTJDPD566Z1YZulGibSl05+rviZxIYyYytJ2SYGyWvZn4n9fLMLoOcq7SDJmii0VRJlxM3Nn37pBrRlFMLCFUc3urS2OiCUWbUcWG4C+/vEraF3Xfq/v3l7XGTRFHGU7gFM7BhytowB00oQUUJDzDK7w52nlx3p2PRWvJKWaO4Q+czx/CBJBd
AAAB8HicbVBNS8NAEJ3Ur1q/qh69BIvgqSQi6LHoxWMF+yFtKJvtplm6uwm7E6GE/govHhTx6s/x5r9x2+agrQ8GHu/NMDMvTAU36HnfTmltfWNzq7xd2dnd2z+oHh61TZJpylo0EYnuhsQwwRVrIUfBuqlmRIaCdcLx7czvPDFteKIecJKyQJKR4hGnBK302I8J5vHAnw6qNa/uzeGuEr8gNSjQHFS/+sOEZpIppIIY0/O9FIOcaORUsGmlnxmWEjomI9azVBHJTJDPD566Z1YZulGibSl05+rviZxIYyYytJ2SYGyWvZn4n9fLMLoOcq7SDJmii0VRJlxM3Nn37pBrRlFMLCFUc3urS2OiCUWbUcWG4C+/vEraF3Xfq/v3l7XGTRFHGU7gFM7BhytowB00oQUUJDzDK7w52nlx3p2PRWvJKWaO4Q+czx/CBJBd
AAAB8HicbVBNS8NAEJ3Ur1q/qh69BIvgqSQi6LHoxWMF+yFtKJvtplm6uwm7E6GE/govHhTx6s/x5r9x2+agrQ8GHu/NMDMvTAU36HnfTmltfWNzq7xd2dnd2z+oHh61TZJpylo0EYnuhsQwwRVrIUfBuqlmRIaCdcLx7czvPDFteKIecJKyQJKR4hGnBK302I8J5vHAnw6qNa/uzeGuEr8gNSjQHFS/+sOEZpIppIIY0/O9FIOcaORUsGmlnxmWEjomI9azVBHJTJDPD566Z1YZulGibSl05+rviZxIYyYytJ2SYGyWvZn4n9fLMLoOcq7SDJmii0VRJlxM3Nn37pBrRlFMLCFUc3urS2OiCUWbUcWG4C+/vEraF3Xfq/v3l7XGTRFHGU7gFM7BhytowB00oQUUJDzDK7w52nlx3p2PRWvJKWaO4Q+czx/CBJBd
AAAB8HicbVBNS8NAEJ3Ur1q/qh69BIvgqSQi6LHoxWMF+yFtKJvtplm6uwm7E6GE/govHhTx6s/x5r9x2+agrQ8GHu/NMDMvTAU36HnfTmltfWNzq7xd2dnd2z+oHh61TZJpylo0EYnuhsQwwRVrIUfBuqlmRIaCdcLx7czvPDFteKIecJKyQJKR4hGnBK302I8J5vHAnw6qNa/uzeGuEr8gNSjQHFS/+sOEZpIppIIY0/O9FIOcaORUsGmlnxmWEjomI9azVBHJTJDPD566Z1YZulGibSl05+rviZxIYyYytJ2SYGyWvZn4n9fLMLoOcq7SDJmii0VRJlxM3Nn37pBrRlFMLCFUc3urS2OiCUWbUcWG4C+/vEraF3Xfq/v3l7XGTRFHGU7gFM7BhytowB00oQUUJDzDK7w52nlx3p2PRWvJKWaO4Q+czx/CBJBd
ˆ
h
5
AAAB8HicbVBNS8NAEJ3Ur1q/qh69BIvgqSSi6LHoxWMFWyttKJvtplm6uwm7E6GE/govHhTx6s/x5r9x2+agrQ8GHu/NMDMvTAU36HnfTmlldW19o7xZ2dre2d2r7h+0TZJpylo0EYnuhMQwwRVrIUfBOqlmRIaCPYSjm6n/8MS04Ym6x3HKAkmGikecErTSYy8mmMf9i0m/WvPq3gzuMvELUoMCzX71qzdIaCaZQiqIMV3fSzHIiUZOBZtUeplhKaEjMmRdSxWRzAT57OCJe2KVgRsl2pZCd6b+nsiJNGYsQ9spCcZm0ZuK/3ndDKOrIOcqzZApOl8UZcLFxJ1+7w64ZhTF2BJCNbe3ujQmmlC0GVVsCP7iy8ukfVb3vbp/d15rXBdxlOEIjuEUfLiEBtxCE1pAQcIzvMKbo50X5935mLeWnGLmEP7A+fwByBiQYQ==
AAAB8HicbVBNS8NAEJ3Ur1q/qh69BIvgqSSi6LHoxWMFWyttKJvtplm6uwm7E6GE/govHhTx6s/x5r9x2+agrQ8GHu/NMDMvTAU36HnfTmlldW19o7xZ2dre2d2r7h+0TZJpylo0EYnuhMQwwRVrIUfBOqlmRIaCPYSjm6n/8MS04Ym6x3HKAkmGikecErTSYy8mmMf9i0m/WvPq3gzuMvELUoMCzX71qzdIaCaZQiqIMV3fSzHIiUZOBZtUeplhKaEjMmRdSxWRzAT57OCJe2KVgRsl2pZCd6b+nsiJNGYsQ9spCcZm0ZuK/3ndDKOrIOcqzZApOl8UZcLFxJ1+7w64ZhTF2BJCNbe3ujQmmlC0GVVsCP7iy8ukfVb3vbp/d15rXBdxlOEIjuEUfLiEBtxCE1pAQcIzvMKbo50X5935mLeWnGLmEP7A+fwByBiQYQ==
AAAB8HicbVBNS8NAEJ3Ur1q/qh69BIvgqSSi6LHoxWMFWyttKJvtplm6uwm7E6GE/govHhTx6s/x5r9x2+agrQ8GHu/NMDMvTAU36HnfTmlldW19o7xZ2dre2d2r7h+0TZJpylo0EYnuhMQwwRVrIUfBOqlmRIaCPYSjm6n/8MS04Ym6x3HKAkmGikecErTSYy8mmMf9i0m/WvPq3gzuMvELUoMCzX71qzdIaCaZQiqIMV3fSzHIiUZOBZtUeplhKaEjMmRdSxWRzAT57OCJe2KVgRsl2pZCd6b+nsiJNGYsQ9spCcZm0ZuK/3ndDKOrIOcqzZApOl8UZcLFxJ1+7w64ZhTF2BJCNbe3ujQmmlC0GVVsCP7iy8ukfVb3vbp/d15rXBdxlOEIjuEUfLiEBtxCE1pAQcIzvMKbo50X5935mLeWnGLmEP7A+fwByBiQYQ==
AAAB8HicbVBNS8NAEJ3Ur1q/qh69BIvgqSSi6LHoxWMFWyttKJvtplm6uwm7E6GE/govHhTx6s/x5r9x2+agrQ8GHu/NMDMvTAU36HnfTmlldW19o7xZ2dre2d2r7h+0TZJpylo0EYnuhMQwwRVrIUfBOqlmRIaCPYSjm6n/8MS04Ym6x3HKAkmGikecErTSYy8mmMf9i0m/WvPq3gzuMvELUoMCzX71qzdIaCaZQiqIMV3fSzHIiUZOBZtUeplhKaEjMmRdSxWRzAT57OCJe2KVgRsl2pZCd6b+nsiJNGYsQ9spCcZm0ZuK/3ndDKOrIOcqzZApOl8UZcLFxJ1+7w64ZhTF2BJCNbe3ujQmmlC0GVVsCP7iy8ukfVb3vbp/d15rXBdxlOEIjuEUfLiEBtxCE1pAQcIzvMKbo50X5935mLeWnGLmEP7A+fwByBiQYQ==
Bi-direct.
GRU
Sent.
Decoder
Sent.
Decoder
Sent.
Decoder
Sent.
Decoder
Sent.
Decoder
ˆ
S
1
AAAB8HicbVA9SwNBEJ2LXzF+RS1tFoNgFe5E0DJoYxnRxEgSwt5mL1myu3fszgnhyK+wsVDE1p9j579xk1yhiQ8GHu/NMDMvTKSw6PvfXmFldW19o7hZ2tre2d0r7x80bZwaxhsslrFphdRyKTRvoEDJW4nhVIWSP4Sj66n/8MSNFbG+x3HCu4oOtIgEo+ikx86QYnbXCya9csWv+jOQZRLkpAI56r3yV6cfs1RxjUxSa9uBn2A3owYFk3xS6qSWJ5SN6IC3HdVUcdvNZgdPyIlT+iSKjSuNZKb+nsiosnasQtepKA7tojcV//PaKUaX3UzoJEWu2XxRlEqCMZl+T/rCcIZy7AhlRrhbCRtSQxm6jEouhGDx5WXSPKsGfjW4Pa/UrvI4inAEx3AKAVxADW6gDg1goOAZXuHNM96L9+59zFsLXj5zCH/gff4AofGQSA==
AAAB8HicbVA9SwNBEJ2LXzF+RS1tFoNgFe5E0DJoYxnRxEgSwt5mL1myu3fszgnhyK+wsVDE1p9j579xk1yhiQ8GHu/NMDMvTKSw6PvfXmFldW19o7hZ2tre2d0r7x80bZwaxhsslrFphdRyKTRvoEDJW4nhVIWSP4Sj66n/8MSNFbG+x3HCu4oOtIgEo+ikx86QYnbXCya9csWv+jOQZRLkpAI56r3yV6cfs1RxjUxSa9uBn2A3owYFk3xS6qSWJ5SN6IC3HdVUcdvNZgdPyIlT+iSKjSuNZKb+nsiosnasQtepKA7tojcV//PaKUaX3UzoJEWu2XxRlEqCMZl+T/rCcIZy7AhlRrhbCRtSQxm6jEouhGDx5WXSPKsGfjW4Pa/UrvI4inAEx3AKAVxADW6gDg1goOAZXuHNM96L9+59zFsLXj5zCH/gff4AofGQSA==
AAAB8HicbVA9SwNBEJ2LXzF+RS1tFoNgFe5E0DJoYxnRxEgSwt5mL1myu3fszgnhyK+wsVDE1p9j579xk1yhiQ8GHu/NMDMvTKSw6PvfXmFldW19o7hZ2tre2d0r7x80bZwaxhsslrFphdRyKTRvoEDJW4nhVIWSP4Sj66n/8MSNFbG+x3HCu4oOtIgEo+ikx86QYnbXCya9csWv+jOQZRLkpAI56r3yV6cfs1RxjUxSa9uBn2A3owYFk3xS6qSWJ5SN6IC3HdVUcdvNZgdPyIlT+iSKjSuNZKb+nsiosnasQtepKA7tojcV//PaKUaX3UzoJEWu2XxRlEqCMZl+T/rCcIZy7AhlRrhbCRtSQxm6jEouhGDx5WXSPKsGfjW4Pa/UrvI4inAEx3AKAVxADW6gDg1goOAZXuHNM96L9+59zFsLXj5zCH/gff4AofGQSA==
AAAB8HicbVA9SwNBEJ2LXzF+RS1tFoNgFe5E0DJoYxnRxEgSwt5mL1myu3fszgnhyK+wsVDE1p9j579xk1yhiQ8GHu/NMDMvTKSw6PvfXmFldW19o7hZ2tre2d0r7x80bZwaxhsslrFphdRyKTRvoEDJW4nhVIWSP4Sj66n/8MSNFbG+x3HCu4oOtIgEo+ikx86QYnbXCya9csWv+jOQZRLkpAI56r3yV6cfs1RxjUxSa9uBn2A3owYFk3xS6qSWJ5SN6IC3HdVUcdvNZgdPyIlT+iSKjSuNZKb+nsiosnasQtepKA7tojcV//PaKUaX3UzoJEWu2XxRlEqCMZl+T/rCcIZy7AhlRrhbCRtSQxm6jEouhGDx5WXSPKsGfjW4Pa/UrvI4inAEx3AKAVxADW6gDg1goOAZXuHNM96L9+59zFsLXj5zCH/gff4AofGQSA==
ˆ
S
2
AAAB8HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoMeiF48V7Ye0oWy2m3bpbhJ2J0IJ/RVePCji1Z/jzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUkS8iQIl7ySaUxVI3g7GNzO//cS1EXH0gJOE+4oOIxEKRtFKj70Rxey+X5v2yxW36s5BVomXkwrkaPTLX71BzFLFI2SSGtP13AT9jGoUTPJpqZcanlA2pkPetTSiihs/mx88JWdWGZAw1rYiJHP190RGlTETFdhORXFklr2Z+J/XTTG88jMRJSnyiC0WhakkGJPZ92QgNGcoJ5ZQpoW9lbAR1ZShzahkQ/CWX14lrVrVc6ve3UWlfp3HUYQTOIVz8OAS6nALDWgCAwXP8ApvjnZenHfnY9FacPKZY/gD5/MHo3aQSQ==
AAAB8HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoMeiF48V7Ye0oWy2m3bpbhJ2J0IJ/RVePCji1Z/jzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUkS8iQIl7ySaUxVI3g7GNzO//cS1EXH0gJOE+4oOIxEKRtFKj70Rxey+X5v2yxW36s5BVomXkwrkaPTLX71BzFLFI2SSGtP13AT9jGoUTPJpqZcanlA2pkPetTSiihs/mx88JWdWGZAw1rYiJHP190RGlTETFdhORXFklr2Z+J/XTTG88jMRJSnyiC0WhakkGJPZ92QgNGcoJ5ZQpoW9lbAR1ZShzahkQ/CWX14lrVrVc6ve3UWlfp3HUYQTOIVz8OAS6nALDWgCAwXP8ApvjnZenHfnY9FacPKZY/gD5/MHo3aQSQ==
AAAB8HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoMeiF48V7Ye0oWy2m3bpbhJ2J0IJ/RVePCji1Z/jzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUkS8iQIl7ySaUxVI3g7GNzO//cS1EXH0gJOE+4oOIxEKRtFKj70Rxey+X5v2yxW36s5BVomXkwrkaPTLX71BzFLFI2SSGtP13AT9jGoUTPJpqZcanlA2pkPetTSiihs/mx88JWdWGZAw1rYiJHP190RGlTETFdhORXFklr2Z+J/XTTG88jMRJSnyiC0WhakkGJPZ92QgNGcoJ5ZQpoW9lbAR1ZShzahkQ/CWX14lrVrVc6ve3UWlfp3HUYQTOIVz8OAS6nALDWgCAwXP8ApvjnZenHfnY9FacPKZY/gD5/MHo3aQSQ==
AAAB2XicbZDNSgMxFIXv1L86Vq1rN8EiuCozbnQpuHFZwbZCO5RM5k4bmskMyR2hDH0BF25EfC93vo3pz0JbDwQ+zknIvSculLQUBN9ebWd3b/+gfugfNfzjk9Nmo2fz0gjsilzl5jnmFpXU2CVJCp8LgzyLFfbj6f0i77+gsTLXTzQrMMr4WMtUCk7O6oyaraAdLMW2IVxDC9YaNb+GSS7KDDUJxa0dhEFBUcUNSaFw7g9LiwUXUz7GgUPNM7RRtRxzzi6dk7A0N+5oYkv394uKZ9bOstjdzDhN7Ga2MP/LBiWlt1EldVESarH6KC0Vo5wtdmaJNChIzRxwYaSblYkJN1yQa8Z3HYSbG29D77odBu3wMYA6nMMFXEEIN3AHD9CBLghI4BXevYn35n2suqp569LO4I+8zx84xIo4
AAAB5XicbZDNSgMxFIXv1L9aq1a3boJFcFVmutGl4MZlRaettEPJpJk2NMkMyR2hDH0KNy4U8ZXc+TamPwttPRD4OCch9544k8Ki7397pa3tnd298n7loHp4dFw7qbZtmhvGQ5bK1HRjarkUmocoUPJuZjhVseSdeHI7zzvP3FiR6kecZjxSdKRFIhhFZz31xxSLh0FzNqjV/Ya/ENmEYAV1WKk1qH31hynLFdfIJLW2F/gZRgU1KJjks0o/tzyjbEJHvOdQU8VtVCwGnpEL5wxJkhp3NJKF+/tFQZW1UxW7m4ri2K5nc/O/rJdjch0VQmc5cs2WHyW5JJiS+fZkKAxnKKcOKDPCzUrYmBrK0HVUcSUE6ytvQrvZCPxGcO9DGc7gHC4hgCu4gTtoQQgMFLzAG7x7xnv1PpZ1lbxVb6fwR97nD2vxju4=
AAAB5XicbZDNSgMxFIXv1L9aq1a3boJFcFVmutGl4MZlRaettEPJpJk2NMkMyR2hDH0KNy4U8ZXc+TamPwttPRD4OCch9544k8Ki7397pa3tnd298n7loHp4dFw7qbZtmhvGQ5bK1HRjarkUmocoUPJuZjhVseSdeHI7zzvP3FiR6kecZjxSdKRFIhhFZz31xxSLh0FzNqjV/Ya/ENmEYAV1WKk1qH31hynLFdfIJLW2F/gZRgU1KJjks0o/tzyjbEJHvOdQU8VtVCwGnpEL5wxJkhp3NJKF+/tFQZW1UxW7m4ri2K5nc/O/rJdjch0VQmc5cs2WHyW5JJiS+fZkKAxnKKcOKDPCzUrYmBrK0HVUcSUE6ytvQrvZCPxGcO9DGc7gHC4hgCu4gTtoQQgMFLzAG7x7xnv1PpZ1lbxVb6fwR97nD2vxju4=
AAAB8HicbVA9SwNBEJ3zM8avqKXNYhCswl0aLYM2lhHNhyRH2NtskiW7e8funBCO/AobC0Vs/Tl2/hs3yRWa+GDg8d4MM/OiRAqLvv/tra1vbG5tF3aKu3v7B4elo+OmjVPDeIPFMjbtiFouheYNFCh5OzGcqkjyVjS+mfmtJ26siPUDThIeKjrUYiAYRSc9dkcUs/teddorlf2KPwdZJUFOypCj3it9dfsxSxXXyCS1thP4CYYZNSiY5NNiN7U8oWxMh7zjqKaK2zCbHzwl507pk0FsXGkkc/X3REaVtRMVuU5FcWSXvZn4n9dJcXAVZkInKXLNFosGqSQYk9n3pC8MZygnjlBmhLuVsBE1lKHLqOhCCJZfXiXNaiXwK8GdX65d53EU4BTO4AICuIQa3EIdGsBAwTO8wptnvBfv3ftYtK55+cwJ/IH3+QOiNpBF
AAAB8HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoMeiF48V7Ye0oWy2m3bpbhJ2J0IJ/RVePCji1Z/jzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUkS8iQIl7ySaUxVI3g7GNzO//cS1EXH0gJOE+4oOIxEKRtFKj70Rxey+X5v2yxW36s5BVomXkwrkaPTLX71BzFLFI2SSGtP13AT9jGoUTPJpqZcanlA2pkPetTSiihs/mx88JWdWGZAw1rYiJHP190RGlTETFdhORXFklr2Z+J/XTTG88jMRJSnyiC0WhakkGJPZ92QgNGcoJ5ZQpoW9lbAR1ZShzahkQ/CWX14lrVrVc6ve3UWlfp3HUYQTOIVz8OAS6nALDWgCAwXP8ApvjnZenHfnY9FacPKZY/gD5/MHo3aQSQ==
AAAB8HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoMeiF48V7Ye0oWy2m3bpbhJ2J0IJ/RVePCji1Z/jzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUkS8iQIl7ySaUxVI3g7GNzO//cS1EXH0gJOE+4oOIxEKRtFKj70Rxey+X5v2yxW36s5BVomXkwrkaPTLX71BzFLFI2SSGtP13AT9jGoUTPJpqZcanlA2pkPetTSiihs/mx88JWdWGZAw1rYiJHP190RGlTETFdhORXFklr2Z+J/XTTG88jMRJSnyiC0WhakkGJPZ92QgNGcoJ5ZQpoW9lbAR1ZShzahkQ/CWX14lrVrVc6ve3UWlfp3HUYQTOIVz8OAS6nALDWgCAwXP8ApvjnZenHfnY9FacPKZY/gD5/MHo3aQSQ==
AAAB8HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoMeiF48V7Ye0oWy2m3bpbhJ2J0IJ/RVePCji1Z/jzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUkS8iQIl7ySaUxVI3g7GNzO//cS1EXH0gJOE+4oOIxEKRtFKj70Rxey+X5v2yxW36s5BVomXkwrkaPTLX71BzFLFI2SSGtP13AT9jGoUTPJpqZcanlA2pkPetTSiihs/mx88JWdWGZAw1rYiJHP190RGlTETFdhORXFklr2Z+J/XTTG88jMRJSnyiC0WhakkGJPZ92QgNGcoJ5ZQpoW9lbAR1ZShzahkQ/CWX14lrVrVc6ve3UWlfp3HUYQTOIVz8OAS6nALDWgCAwXP8ApvjnZenHfnY9FacPKZY/gD5/MHo3aQSQ==
AAAB8HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoMeiF48V7Ye0oWy2m3bpbhJ2J0IJ/RVePCji1Z/jzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUkS8iQIl7ySaUxVI3g7GNzO//cS1EXH0gJOE+4oOIxEKRtFKj70Rxey+X5v2yxW36s5BVomXkwrkaPTLX71BzFLFI2SSGtP13AT9jGoUTPJpqZcanlA2pkPetTSiihs/mx88JWdWGZAw1rYiJHP190RGlTETFdhORXFklr2Z+J/XTTG88jMRJSnyiC0WhakkGJPZ92QgNGcoJ5ZQpoW9lbAR1ZShzahkQ/CWX14lrVrVc6ve3UWlfp3HUYQTOIVz8OAS6nALDWgCAwXP8ApvjnZenHfnY9FacPKZY/gD5/MHo3aQSQ==
AAAB8HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoMeiF48V7Ye0oWy2m3bpbhJ2J0IJ/RVePCji1Z/jzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUkS8iQIl7ySaUxVI3g7GNzO//cS1EXH0gJOE+4oOIxEKRtFKj70Rxey+X5v2yxW36s5BVomXkwrkaPTLX71BzFLFI2SSGtP13AT9jGoUTPJpqZcanlA2pkPetTSiihs/mx88JWdWGZAw1rYiJHP190RGlTETFdhORXFklr2Z+J/XTTG88jMRJSnyiC0WhakkGJPZ92QgNGcoJ5ZQpoW9lbAR1ZShzahkQ/CWX14lrVrVc6ve3UWlfp3HUYQTOIVz8OAS6nALDWgCAwXP8ApvjnZenHfnY9FacPKZY/gD5/MHo3aQSQ==
AAAB8HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoMeiF48V7Ye0oWy2m3bpbhJ2J0IJ/RVePCji1Z/jzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUkS8iQIl7ySaUxVI3g7GNzO//cS1EXH0gJOE+4oOIxEKRtFKj70Rxey+X5v2yxW36s5BVomXkwrkaPTLX71BzFLFI2SSGtP13AT9jGoUTPJpqZcanlA2pkPetTSiihs/mx88JWdWGZAw1rYiJHP190RGlTETFdhORXFklr2Z+J/XTTG88jMRJSnyiC0WhakkGJPZ92QgNGcoJ5ZQpoW9lbAR1ZShzahkQ/CWX14lrVrVc6ve3UWlfp3HUYQTOIVz8OAS6nALDWgCAwXP8ApvjnZenHfnY9FacPKZY/gD5/MHo3aQSQ==
ˆ
S
3
AAAB8HicbVBNS8NAEJ3Ur1q/qh69BIvgqSQq6LHoxWNF+yFtKJvtpl26uwm7E6GE/govHhTx6s/x5r9x2+agrQ8GHu/NMDMvTAQ36HnfTmFldW19o7hZ2tre2d0r7x80TZxqyho0FrFuh8QwwRVrIEfB2olmRIaCtcLRzdRvPTFteKwecJywQJKB4hGnBK302B0SzO5755NeueJVvRncZeLnpAI56r3yV7cf01QyhVQQYzq+l2CQEY2cCjYpdVPDEkJHZMA6lioimQmy2cET98QqfTeKtS2F7kz9PZERacxYhrZTEhyaRW8q/ud1UoyugoyrJEWm6HxRlAoXY3f6vdvnmlEUY0sI1dze6tIh0YSizahkQ/AXX14mzbOq71X9u4tK7TqPowhHcAyn4MMl1OAW6tAAChKe4RXeHO28OO/Ox7y14OQzh/AHzucPpPuQSg==
AAAB8HicbVBNS8NAEJ3Ur1q/qh69BIvgqSQq6LHoxWNF+yFtKJvtpl26uwm7E6GE/govHhTx6s/x5r9x2+agrQ8GHu/NMDMvTAQ36HnfTmFldW19o7hZ2tre2d0r7x80TZxqyho0FrFuh8QwwRVrIEfB2olmRIaCtcLRzdRvPTFteKwecJywQJKB4hGnBK302B0SzO5755NeueJVvRncZeLnpAI56r3yV7cf01QyhVQQYzq+l2CQEY2cCjYpdVPDEkJHZMA6lioimQmy2cET98QqfTeKtS2F7kz9PZERacxYhrZTEhyaRW8q/ud1UoyugoyrJEWm6HxRlAoXY3f6vdvnmlEUY0sI1dze6tIh0YSizahkQ/AXX14mzbOq71X9u4tK7TqPowhHcAyn4MMl1OAW6tAAChKe4RXeHO28OO/Ox7y14OQzh/AHzucPpPuQSg==
AAAB8HicbVBNS8NAEJ3Ur1q/qh69BIvgqSQq6LHoxWNF+yFtKJvtpl26uwm7E6GE/govHhTx6s/x5r9x2+agrQ8GHu/NMDMvTAQ36HnfTmFldW19o7hZ2tre2d0r7x80TZxqyho0FrFuh8QwwRVrIEfB2olmRIaCtcLRzdRvPTFteKwecJywQJKB4hGnBK302B0SzO5755NeueJVvRncZeLnpAI56r3yV7cf01QyhVQQYzq+l2CQEY2cCjYpdVPDEkJHZMA6lioimQmy2cET98QqfTeKtS2F7kz9PZERacxYhrZTEhyaRW8q/ud1UoyugoyrJEWm6HxRlAoXY3f6vdvnmlEUY0sI1dze6tIh0YSizahkQ/AXX14mzbOq71X9u4tK7TqPowhHcAyn4MMl1OAW6tAAChKe4RXeHO28OO/Ox7y14OQzh/AHzucPpPuQSg==
AAAB8HicbVBNS8NAEJ3Ur1q/qh69BIvgqSQq6LHoxWNF+yFtKJvtpl26uwm7E6GE/govHhTx6s/x5r9x2+agrQ8GHu/NMDMvTAQ36HnfTmFldW19o7hZ2tre2d0r7x80TZxqyho0FrFuh8QwwRVrIEfB2olmRIaCtcLRzdRvPTFteKwecJywQJKB4hGnBK302B0SzO5755NeueJVvRncZeLnpAI56r3yV7cf01QyhVQQYzq+l2CQEY2cCjYpdVPDEkJHZMA6lioimQmy2cET98QqfTeKtS2F7kz9PZERacxYhrZTEhyaRW8q/ud1UoyugoyrJEWm6HxRlAoXY3f6vdvnmlEUY0sI1dze6tIh0YSizahkQ/AXX14mzbOq71X9u4tK7TqPowhHcAyn4MMl1OAW6tAAChKe4RXeHO28OO/Ox7y14OQzh/AHzucPpPuQSg==
ˆ
S
4
AAAB8HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkoMeiF48V7Ye0oWy2m3bpbhJ2J0IJ/RVePCji1Z/jzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUkS8iQIl7ySaUxVI3g7GNzO//cS1EXH0gJOE+4oOIxEKRtFKj70Rxey+X5v2yxW36s5BVomXkwrkaPTLX71BzFLFI2SSGtP13AT9jGoUTPJpqZcanlA2pkPetTSiihs/mx88JWdWGZAw1rYiJHP190RGlTETFdhORXFklr2Z+J/XTTG88jMRJSnyiC0WhakkGJPZ92QgNGcoJ5ZQpoW9lbAR1ZShzahkQ/CWX14lrYuq51a9u1qlfp3HUYQTOIVz8OAS6nALDWgCAwXP8ApvjnZenHfnY9FacPKZY/gD5/MHpoCQSw==
AAAB8HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkoMeiF48V7Ye0oWy2m3bpbhJ2J0IJ/RVePCji1Z/jzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUkS8iQIl7ySaUxVI3g7GNzO//cS1EXH0gJOE+4oOIxEKRtFKj70Rxey+X5v2yxW36s5BVomXkwrkaPTLX71BzFLFI2SSGtP13AT9jGoUTPJpqZcanlA2pkPetTSiihs/mx88JWdWGZAw1rYiJHP190RGlTETFdhORXFklr2Z+J/XTTG88jMRJSnyiC0WhakkGJPZ92QgNGcoJ5ZQpoW9lbAR1ZShzahkQ/CWX14lrYuq51a9u1qlfp3HUYQTOIVz8OAS6nALDWgCAwXP8ApvjnZenHfnY9FacPKZY/gD5/MHpoCQSw==
AAAB8HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkoMeiF48V7Ye0oWy2m3bpbhJ2J0IJ/RVePCji1Z/jzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUkS8iQIl7ySaUxVI3g7GNzO//cS1EXH0gJOE+4oOIxEKRtFKj70Rxey+X5v2yxW36s5BVomXkwrkaPTLX71BzFLFI2SSGtP13AT9jGoUTPJpqZcanlA2pkPetTSiihs/mx88JWdWGZAw1rYiJHP190RGlTETFdhORXFklr2Z+J/XTTG88jMRJSnyiC0WhakkGJPZ92QgNGcoJ5ZQpoW9lbAR1ZShzahkQ/CWX14lrYuq51a9u1qlfp3HUYQTOIVz8OAS6nALDWgCAwXP8ApvjnZenHfnY9FacPKZY/gD5/MHpoCQSw==
AAAB2XicbZDNSgMxFIXv1L86Vq1rN8EiuCozbnQpuHFZwbZCO5RM5k4bmskMyR2hDH0BF25EfC93vo3pz0JbDwQ+zknIvSculLQUBN9ebWd3b/+gfugfNfzjk9Nmo2fz0gjsilzl5jnmFpXU2CVJCp8LgzyLFfbj6f0i77+gsTLXTzQrMMr4WMtUCk7O6oyaraAdLMW2IVxDC9YaNb+GSS7KDDUJxa0dhEFBUcUNSaFw7g9LiwUXUz7GgUPNM7RRtRxzzi6dk7A0N+5oYkv394uKZ9bOstjdzDhN7Ga2MP/LBiWlt1EldVESarH6KC0Vo5wtdmaJNChIzRxwYaSblYkJN1yQa8Z3HYSbG29D77odBu3wMYA6nMMFXEEIN3AHD9CBLghI4BXevYn35n2suqp569LO4I+8zx84xIo4
AAAB5XicbZDNSgMxFIXv1L9aq1a3boJFcFVmRNCl4MZlRaettEPJpJk2NMkMyR2hDH0KNy4U8ZXc+TamPwttPRD4OCch9544k8Ki7397pY3Nre2d8m5lr7p/cFg7qrZsmhvGQ5bK1HRiarkUmocoUPJOZjhVseTteHw7y9vP3FiR6kecZDxSdKhFIhhFZz31RhSLh/7ltF+r+w1/LrIOwRLqsFSzX/vqDVKWK66RSWptN/AzjApqUDDJp5VebnlG2ZgOedehporbqJgPPCVnzhmQJDXuaCRz9/eLgiprJyp2NxXFkV3NZuZ/WTfH5DoqhM5y5JotPkpySTAls+3JQBjOUE4cUGaEm5WwETWUoeuo4koIVldeh9ZFI/Abwb0PZTiBUziHAK7gBu6gCSEwUPACb/DuGe/V+1jUVfKWvR3DH3mfP27ljvA=
AAAB5XicbZDNSgMxFIXv1L9aq1a3boJFcFVmRNCl4MZlRaettEPJpJk2NMkMyR2hDH0KNy4U8ZXc+TamPwttPRD4OCch9544k8Ki7397pY3Nre2d8m5lr7p/cFg7qrZsmhvGQ5bK1HRiarkUmocoUPJOZjhVseTteHw7y9vP3FiR6kecZDxSdKhFIhhFZz31RhSLh/7ltF+r+w1/LrIOwRLqsFSzX/vqDVKWK66RSWptN/AzjApqUDDJp5VebnlG2ZgOedehporbqJgPPCVnzhmQJDXuaCRz9/eLgiprJyp2NxXFkV3NZuZ/WTfH5DoqhM5y5JotPkpySTAls+3JQBjOUE4cUGaEm5WwETWUoeuo4koIVldeh9ZFI/Abwb0PZTiBUziHAK7gBu6gCSEwUPACb/DuGe/V+1jUVfKWvR3DH3mfP27ljvA=
AAAB8HicbVA9SwNBEJ2LXzF+RS1tFoNgFe5E0DJoYxnRxEgSwt5mL1myu3fszgnhyK+wsVDE1p9j579xk1yhiQ8GHu/NMDMvTKSw6PvfXmFldW19o7hZ2tre2d0r7x80bZwaxhsslrFphdRyKTRvoEDJW4nhVIWSP4Sj66n/8MSNFbG+x3HCu4oOtIgEo+ikx86QYnbXO5/0yhW/6s9AlkmQkwrkqPfKX51+zFLFNTJJrW0HfoLdjBoUTPJJqZNanlA2ogPedlRTxW03mx08ISdO6ZMoNq40kpn6eyKjytqxCl2noji0i95U/M9rpxhddjOhkxS5ZvNFUSoJxmT6PekLwxnKsSOUGeFuJWxIDWXoMiq5EILFl5dJ86wa+NXg1q/UrvI4inAEx3AKAVxADW6gDg1goOAZXuHNM96L9+59zFsLXj5zCH/gff4ApUCQRw==
AAAB8HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkoMeiF48V7Ye0oWy2m3bpbhJ2J0IJ/RVePCji1Z/jzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUkS8iQIl7ySaUxVI3g7GNzO//cS1EXH0gJOE+4oOIxEKRtFKj70Rxey+X5v2yxW36s5BVomXkwrkaPTLX71BzFLFI2SSGtP13AT9jGoUTPJpqZcanlA2pkPetTSiihs/mx88JWdWGZAw1rYiJHP190RGlTETFdhORXFklr2Z+J/XTTG88jMRJSnyiC0WhakkGJPZ92QgNGcoJ5ZQpoW9lbAR1ZShzahkQ/CWX14lrYuq51a9u1qlfp3HUYQTOIVz8OAS6nALDWgCAwXP8ApvjnZenHfnY9FacPKZY/gD5/MHpoCQSw==
AAAB8HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkoMeiF48V7Ye0oWy2m3bpbhJ2J0IJ/RVePCji1Z/jzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUkS8iQIl7ySaUxVI3g7GNzO//cS1EXH0gJOE+4oOIxEKRtFKj70Rxey+X5v2yxW36s5BVomXkwrkaPTLX71BzFLFI2SSGtP13AT9jGoUTPJpqZcanlA2pkPetTSiihs/mx88JWdWGZAw1rYiJHP190RGlTETFdhORXFklr2Z+J/XTTG88jMRJSnyiC0WhakkGJPZ92QgNGcoJ5ZQpoW9lbAR1ZShzahkQ/CWX14lrYuq51a9u1qlfp3HUYQTOIVz8OAS6nALDWgCAwXP8ApvjnZenHfnY9FacPKZY/gD5/MHpoCQSw==
AAAB8HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkoMeiF48V7Ye0oWy2m3bpbhJ2J0IJ/RVePCji1Z/jzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUkS8iQIl7ySaUxVI3g7GNzO//cS1EXH0gJOE+4oOIxEKRtFKj70Rxey+X5v2yxW36s5BVomXkwrkaPTLX71BzFLFI2SSGtP13AT9jGoUTPJpqZcanlA2pkPetTSiihs/mx88JWdWGZAw1rYiJHP190RGlTETFdhORXFklr2Z+J/XTTG88jMRJSnyiC0WhakkGJPZ92QgNGcoJ5ZQpoW9lbAR1ZShzahkQ/CWX14lrYuq51a9u1qlfp3HUYQTOIVz8OAS6nALDWgCAwXP8ApvjnZenHfnY9FacPKZY/gD5/MHpoCQSw==
AAAB8HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkoMeiF48V7Ye0oWy2m3bpbhJ2J0IJ/RVePCji1Z/jzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUkS8iQIl7ySaUxVI3g7GNzO//cS1EXH0gJOE+4oOIxEKRtFKj70Rxey+X5v2yxW36s5BVomXkwrkaPTLX71BzFLFI2SSGtP13AT9jGoUTPJpqZcanlA2pkPetTSiihs/mx88JWdWGZAw1rYiJHP190RGlTETFdhORXFklr2Z+J/XTTG88jMRJSnyiC0WhakkGJPZ92QgNGcoJ5ZQpoW9lbAR1ZShzahkQ/CWX14lrYuq51a9u1qlfp3HUYQTOIVz8OAS6nALDWgCAwXP8ApvjnZenHfnY9FacPKZY/gD5/MHpoCQSw==
AAAB8HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkoMeiF48V7Ye0oWy2m3bpbhJ2J0IJ/RVePCji1Z/jzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUkS8iQIl7ySaUxVI3g7GNzO//cS1EXH0gJOE+4oOIxEKRtFKj70Rxey+X5v2yxW36s5BVomXkwrkaPTLX71BzFLFI2SSGtP13AT9jGoUTPJpqZcanlA2pkPetTSiihs/mx88JWdWGZAw1rYiJHP190RGlTETFdhORXFklr2Z+J/XTTG88jMRJSnyiC0WhakkGJPZ92QgNGcoJ5ZQpoW9lbAR1ZShzahkQ/CWX14lrYuq51a9u1qlfp3HUYQTOIVz8OAS6nALDWgCAwXP8ApvjnZenHfnY9FacPKZY/gD5/MHpoCQSw==
AAAB8HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkoMeiF48V7Ye0oWy2m3bpbhJ2J0IJ/RVePCji1Z/jzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUkS8iQIl7ySaUxVI3g7GNzO//cS1EXH0gJOE+4oOIxEKRtFKj70Rxey+X5v2yxW36s5BVomXkwrkaPTLX71BzFLFI2SSGtP13AT9jGoUTPJpqZcanlA2pkPetTSiihs/mx88JWdWGZAw1rYiJHP190RGlTETFdhORXFklr2Z+J/XTTG88jMRJSnyiC0WhakkGJPZ92QgNGcoJ5ZQpoW9lbAR1ZShzahkQ/CWX14lrYuq51a9u1qlfp3HUYQTOIVz8OAS6nALDWgCAwXP8ApvjnZenHfnY9FacPKZY/gD5/MHpoCQSw==
ˆ
S
5
AAAB8HicbVBNS8NAEJ3Ur1q/qh69BIvgqSSi6LHoxWNF+yFtKJvtpl26uwm7E6GE/govHhTx6s/x5r9x2+agrQ8GHu/NMDMvTAQ36HnfTmFldW19o7hZ2tre2d0r7x80TZxqyho0FrFuh8QwwRVrIEfB2olmRIaCtcLRzdRvPTFteKwecJywQJKB4hGnBK302B0SzO57F5NeueJVvRncZeLnpAI56r3yV7cf01QyhVQQYzq+l2CQEY2cCjYpdVPDEkJHZMA6lioimQmy2cET98QqfTeKtS2F7kz9PZERacxYhrZTEhyaRW8q/ud1UoyugoyrJEWm6HxRlAoXY3f6vdvnmlEUY0sI1dze6tIh0YSizahkQ/AXX14mzbOq71X9u/NK7TqPowhHcAyn4MMl1OAW6tAAChKe4RXeHO28OO/Ox7y14OQzh/AHzucPqAWQTA==
AAAB8HicbVBNS8NAEJ3Ur1q/qh69BIvgqSSi6LHoxWNF+yFtKJvtpl26uwm7E6GE/govHhTx6s/x5r9x2+agrQ8GHu/NMDMvTAQ36HnfTmFldW19o7hZ2tre2d0r7x80TZxqyho0FrFuh8QwwRVrIEfB2olmRIaCtcLRzdRvPTFteKwecJywQJKB4hGnBK302B0SzO57F5NeueJVvRncZeLnpAI56r3yV7cf01QyhVQQYzq+l2CQEY2cCjYpdVPDEkJHZMA6lioimQmy2cET98QqfTeKtS2F7kz9PZERacxYhrZTEhyaRW8q/ud1UoyugoyrJEWm6HxRlAoXY3f6vdvnmlEUY0sI1dze6tIh0YSizahkQ/AXX14mzbOq71X9u/NK7TqPowhHcAyn4MMl1OAW6tAAChKe4RXeHO28OO/Ox7y14OQzh/AHzucPqAWQTA==
AAAB8HicbVBNS8NAEJ3Ur1q/qh69BIvgqSSi6LHoxWNF+yFtKJvtpl26uwm7E6GE/govHhTx6s/x5r9x2+agrQ8GHu/NMDMvTAQ36HnfTmFldW19o7hZ2tre2d0r7x80TZxqyho0FrFuh8QwwRVrIEfB2olmRIaCtcLRzdRvPTFteKwecJywQJKB4hGnBK302B0SzO57F5NeueJVvRncZeLnpAI56r3yV7cf01QyhVQQYzq+l2CQEY2cCjYpdVPDEkJHZMA6lioimQmy2cET98QqfTeKtS2F7kz9PZERacxYhrZTEhyaRW8q/ud1UoyugoyrJEWm6HxRlAoXY3f6vdvnmlEUY0sI1dze6tIh0YSizahkQ/AXX14mzbOq71X9u/NK7TqPowhHcAyn4MMl1OAW6tAAChKe4RXeHO28OO/Ox7y14OQzh/AHzucPqAWQTA==
AAAB2XicbZDNSgMxFIXv1L86Vq1rN8EiuCozbnQpuHFZwbZCO5RM5k4bmskMyR2hDH0BF25EfC93vo3pz0JbDwQ+zknIvSculLQUBN9ebWd3b/+gfugfNfzjk9Nmo2fz0gjsilzl5jnmFpXU2CVJCp8LgzyLFfbj6f0i77+gsTLXTzQrMMr4WMtUCk7O6oyaraAdLMW2IVxDC9YaNb+GSS7KDDUJxa0dhEFBUcUNSaFw7g9LiwUXUz7GgUPNM7RRtRxzzi6dk7A0N+5oYkv394uKZ9bOstjdzDhN7Ga2MP/LBiWlt1EldVESarH6KC0Vo5wtdmaJNChIzRxwYaSblYkJN1yQa8Z3HYSbG29D77odBu3wMYA6nMMFXEEIN3AHD9CBLghI4BXevYn35n2suqp569LO4I+8zx84xIo4
AAAB5XicbZDNSgMxFIXv1L9aq1a3boJFcFVmBNGl4MZlRaettEPJpJk2NMkMyR2hDH0KNy4U8ZXc+TamPwttPRD4OCch9544k8Ki7397pY3Nre2d8m5lr7p/cFg7qrZsmhvGQ5bK1HRiarkUmocoUPJOZjhVseTteHw7y9vP3FiR6kecZDxSdKhFIhhFZz31RhSLh/7ltF+r+w1/LrIOwRLqsFSzX/vqDVKWK66RSWptN/AzjApqUDDJp5VebnlG2ZgOedehporbqJgPPCVnzhmQJDXuaCRz9/eLgiprJyp2NxXFkV3NZuZ/WTfH5DoqhM5y5JotPkpySTAls+3JQBjOUE4cUGaEm5WwETWUoeuo4koIVldeh9ZFI/Abwb0PZTiBUziHAK7gBu6gCSEwUPACb/DuGe/V+1jUVfKWvR3DH3mfP3BfjvE=
AAAB5XicbZDNSgMxFIXv1L9aq1a3boJFcFVmBNGl4MZlRaettEPJpJk2NMkMyR2hDH0KNy4U8ZXc+TamPwttPRD4OCch9544k8Ki7397pY3Nre2d8m5lr7p/cFg7qrZsmhvGQ5bK1HRiarkUmocoUPJOZjhVseTteHw7y9vP3FiR6kecZDxSdKhFIhhFZz31RhSLh/7ltF+r+w1/LrIOwRLqsFSzX/vqDVKWK66RSWptN/AzjApqUDDJp5VebnlG2ZgOedehporbqJgPPCVnzhmQJDXuaCRz9/eLgiprJyp2NxXFkV3NZuZ/WTfH5DoqhM5y5JotPkpySTAls+3JQBjOUE4cUGaEm5WwETWUoeuo4koIVldeh9ZFI/Abwb0PZTiBUziHAK7gBu6gCSEwUPACb/DuGe/V+1jUVfKWvR3DH3mfP3BfjvE=
AAAB8HicbVA9SwNBEJ2LXzF+RS1tFoNgFe4E0TJoYxnRxEgSwt5mL1myu3fszgnhyK+wsVDE1p9j579xk1yhiQ8GHu/NMDMvTKSw6PvfXmFldW19o7hZ2tre2d0r7x80bZwaxhsslrFphdRyKTRvoEDJW4nhVIWSP4Sj66n/8MSNFbG+x3HCu4oOtIgEo+ikx86QYnbXO5/0yhW/6s9AlkmQkwrkqPfKX51+zFLFNTJJrW0HfoLdjBoUTPJJqZNanlA2ogPedlRTxW03mx08ISdO6ZMoNq40kpn6eyKjytqxCl2noji0i95U/M9rpxhddjOhkxS5ZvNFUSoJxmT6PekLwxnKsSOUGeFuJWxIDWXoMiq5EILFl5dJ86wa+NXg1q/UrvI4inAEx3AKAVxADW6gDg1goOAZXuHNM96L9+59zFsLXj5zCH/gff4ApsWQSA==
AAAB8HicbVBNS8NAEJ3Ur1q/qh69BIvgqSSi6LHoxWNF+yFtKJvtpl26uwm7E6GE/govHhTx6s/x5r9x2+agrQ8GHu/NMDMvTAQ36HnfTmFldW19o7hZ2tre2d0r7x80TZxqyho0FrFuh8QwwRVrIEfB2olmRIaCtcLRzdRvPTFteKwecJywQJKB4hGnBK302B0SzO57F5NeueJVvRncZeLnpAI56r3yV7cf01QyhVQQYzq+l2CQEY2cCjYpdVPDEkJHZMA6lioimQmy2cET98QqfTeKtS2F7kz9PZERacxYhrZTEhyaRW8q/ud1UoyugoyrJEWm6HxRlAoXY3f6vdvnmlEUY0sI1dze6tIh0YSizahkQ/AXX14mzbOq71X9u/NK7TqPowhHcAyn4MMl1OAW6tAAChKe4RXeHO28OO/Ox7y14OQzh/AHzucPqAWQTA==
AAAB8HicbVBNS8NAEJ3Ur1q/qh69BIvgqSSi6LHoxWNF+yFtKJvtpl26uwm7E6GE/govHhTx6s/x5r9x2+agrQ8GHu/NMDMvTAQ36HnfTmFldW19o7hZ2tre2d0r7x80TZxqyho0FrFuh8QwwRVrIEfB2olmRIaCtcLRzdRvPTFteKwecJywQJKB4hGnBK302B0SzO57F5NeueJVvRncZeLnpAI56r3yV7cf01QyhVQQYzq+l2CQEY2cCjYpdVPDEkJHZMA6lioimQmy2cET98QqfTeKtS2F7kz9PZERacxYhrZTEhyaRW8q/ud1UoyugoyrJEWm6HxRlAoXY3f6vdvnmlEUY0sI1dze6tIh0YSizahkQ/AXX14mzbOq71X9u/NK7TqPowhHcAyn4MMl1OAW6tAAChKe4RXeHO28OO/Ox7y14OQzh/AHzucPqAWQTA==
AAAB8HicbVBNS8NAEJ3Ur1q/qh69BIvgqSSi6LHoxWNF+yFtKJvtpl26uwm7E6GE/govHhTx6s/x5r9x2+agrQ8GHu/NMDMvTAQ36HnfTmFldW19o7hZ2tre2d0r7x80TZxqyho0FrFuh8QwwRVrIEfB2olmRIaCtcLRzdRvPTFteKwecJywQJKB4hGnBK302B0SzO57F5NeueJVvRncZeLnpAI56r3yV7cf01QyhVQQYzq+l2CQEY2cCjYpdVPDEkJHZMA6lioimQmy2cET98QqfTeKtS2F7kz9PZERacxYhrZTEhyaRW8q/ud1UoyugoyrJEWm6HxRlAoXY3f6vdvnmlEUY0sI1dze6tIh0YSizahkQ/AXX14mzbOq71X9u/NK7TqPowhHcAyn4MMl1OAW6tAAChKe4RXeHO28OO/Ox7y14OQzh/AHzucPqAWQTA==
AAAB8HicbVBNS8NAEJ3Ur1q/qh69BIvgqSSi6LHoxWNF+yFtKJvtpl26uwm7E6GE/govHhTx6s/x5r9x2+agrQ8GHu/NMDMvTAQ36HnfTmFldW19o7hZ2tre2d0r7x80TZxqyho0FrFuh8QwwRVrIEfB2olmRIaCtcLRzdRvPTFteKwecJywQJKB4hGnBK302B0SzO57F5NeueJVvRncZeLnpAI56r3yV7cf01QyhVQQYzq+l2CQEY2cCjYpdVPDEkJHZMA6lioimQmy2cET98QqfTeKtS2F7kz9PZERacxYhrZTEhyaRW8q/ud1UoyugoyrJEWm6HxRlAoXY3f6vdvnmlEUY0sI1dze6tIh0YSizahkQ/AXX14mzbOq71X9u/NK7TqPowhHcAyn4MMl1OAW6tAAChKe4RXeHO28OO/Ox7y14OQzh/AHzucPqAWQTA==
AAAB8HicbVBNS8NAEJ3Ur1q/qh69BIvgqSSi6LHoxWNF+yFtKJvtpl26uwm7E6GE/govHhTx6s/x5r9x2+agrQ8GHu/NMDMvTAQ36HnfTmFldW19o7hZ2tre2d0r7x80TZxqyho0FrFuh8QwwRVrIEfB2olmRIaCtcLRzdRvPTFteKwecJywQJKB4hGnBK302B0SzO57F5NeueJVvRncZeLnpAI56r3yV7cf01QyhVQQYzq+l2CQEY2cCjYpdVPDEkJHZMA6lioimQmy2cET98QqfTeKtS2F7kz9PZERacxYhrZTEhyaRW8q/ud1UoyugoyrJEWm6HxRlAoXY3f6vdvnmlEUY0sI1dze6tIh0YSizahkQ/AXX14mzbOq71X9u/NK7TqPowhHcAyn4MMl1OAW6tAAChKe4RXeHO28OO/Ox7y14OQzh/AHzucPqAWQTA==
AAAB8HicbVBNS8NAEJ3Ur1q/qh69BIvgqSSi6LHoxWNF+yFtKJvtpl26uwm7E6GE/govHhTx6s/x5r9x2+agrQ8GHu/NMDMvTAQ36HnfTmFldW19o7hZ2tre2d0r7x80TZxqyho0FrFuh8QwwRVrIEfB2olmRIaCtcLRzdRvPTFteKwecJywQJKB4hGnBK302B0SzO57F5NeueJVvRncZeLnpAI56r3yV7cf01QyhVQQYzq+l2CQEY2cCjYpdVPDEkJHZMA6lioimQmy2cET98QqfTeKtS2F7kz9PZERacxYhrZTEhyaRW8q/ud1UoyugoyrJEWm6HxRlAoXY3f6vdvnmlEUY0sI1dze6tIh0YSizahkQ/AXX14mzbOq71X9u/NK7TqPowhHcAyn4MMl1OAW6tAAChKe4RXeHO28OO/Ox7y14OQzh/AHzucPqAWQTA==
F(·)
AAAB73icbVBNS8NAEJ34WetX1aOXYBHqpSQi6LEoiMcK9gPaUDabTbt0sxt3J0Ip/RNePCji1b/jzX/jts1BWx8MPN6bYWZemApu0PO+nZXVtfWNzcJWcXtnd2+/dHDYNCrTlDWoEkq3Q2KY4JI1kKNg7VQzkoSCtcLhzdRvPTFtuJIPOEpZkJC+5DGnBK3Uvq10aaTwrFcqe1VvBneZ+DkpQ456r/TVjRTNEiaRCmJMx/dSDMZEI6eCTYrdzLCU0CHps46lkiTMBOPZvRP31CqRGyttS6I7U39PjElizCgJbWdCcGAWvan4n9fJML4KxlymGTJJ54viTLio3OnzbsQ1oyhGlhCqub3VpQOiCUUbUdGG4C++vEya51Xfq/r3F+XadR5HAY7hBCrgwyXU4A7q0AAKAp7hFd6cR+fFeXc+5q0rTj5zBH/gfP4AMa6PZw==
AAAB73icbVBNS8NAEJ34WetX1aOXYBHqpSQi6LEoiMcK9gPaUDabTbt0sxt3J0Ip/RNePCji1b/jzX/jts1BWx8MPN6bYWZemApu0PO+nZXVtfWNzcJWcXtnd2+/dHDYNCrTlDWoEkq3Q2KY4JI1kKNg7VQzkoSCtcLhzdRvPTFtuJIPOEpZkJC+5DGnBK3Uvq10aaTwrFcqe1VvBneZ+DkpQ456r/TVjRTNEiaRCmJMx/dSDMZEI6eCTYrdzLCU0CHps46lkiTMBOPZvRP31CqRGyttS6I7U39PjElizCgJbWdCcGAWvan4n9fJML4KxlymGTJJ54viTLio3OnzbsQ1oyhGlhCqub3VpQOiCUUbUdGG4C++vEya51Xfq/r3F+XadR5HAY7hBCrgwyXU4A7q0AAKAp7hFd6cR+fFeXc+5q0rTj5zBH/gfP4AMa6PZw==
AAAB73icbVBNS8NAEJ34WetX1aOXYBHqpSQi6LEoiMcK9gPaUDabTbt0sxt3J0Ip/RNePCji1b/jzX/jts1BWx8MPN6bYWZemApu0PO+nZXVtfWNzcJWcXtnd2+/dHDYNCrTlDWoEkq3Q2KY4JI1kKNg7VQzkoSCtcLhzdRvPTFtuJIPOEpZkJC+5DGnBK3Uvq10aaTwrFcqe1VvBneZ+DkpQ456r/TVjRTNEiaRCmJMx/dSDMZEI6eCTYrdzLCU0CHps46lkiTMBOPZvRP31CqRGyttS6I7U39PjElizCgJbWdCcGAWvan4n9fJML4KxlymGTJJ54viTLio3OnzbsQ1oyhGlhCqub3VpQOiCUUbUdGG4C++vEya51Xfq/r3F+XadR5HAY7hBCrgwyXU4A7q0AAKAp7hFd6cR+fFeXc+5q0rTj5zBH/gfP4AMa6PZw==
AAAB73icbVBNS8NAEJ34WetX1aOXYBHqpSQi6LEoiMcK9gPaUDabTbt0sxt3J0Ip/RNePCji1b/jzX/jts1BWx8MPN6bYWZemApu0PO+nZXVtfWNzcJWcXtnd2+/dHDYNCrTlDWoEkq3Q2KY4JI1kKNg7VQzkoSCtcLhzdRvPTFtuJIPOEpZkJC+5DGnBK3Uvq10aaTwrFcqe1VvBneZ+DkpQ456r/TVjRTNEiaRCmJMx/dSDMZEI6eCTYrdzLCU0CHps46lkiTMBOPZvRP31CqRGyttS6I7U39PjElizCgJbWdCcGAWvan4n9fJML4KxlymGTJJ54viTLio3OnzbsQ1oyhGlhCqub3VpQOiCUUbUdGG4C++vEya51Xfq/r3F+XadR5HAY7hBCrgwyXU4A7q0AAKAp7hFd6cR+fFeXc+5q0rTj5zBH/gfP4AMa6PZw==
F(·)
AAAB73icbVBNS8NAEJ34WetX1aOXYBHqpSQi6LEoiMcK9gPaUDabTbt0sxt3J0Ip/RNePCji1b/jzX/jts1BWx8MPN6bYWZemApu0PO+nZXVtfWNzcJWcXtnd2+/dHDYNCrTlDWoEkq3Q2KY4JI1kKNg7VQzkoSCtcLhzdRvPTFtuJIPOEpZkJC+5DGnBK3Uvq10aaTwrFcqe1VvBneZ+DkpQ456r/TVjRTNEiaRCmJMx/dSDMZEI6eCTYrdzLCU0CHps46lkiTMBOPZvRP31CqRGyttS6I7U39PjElizCgJbWdCcGAWvan4n9fJML4KxlymGTJJ54viTLio3OnzbsQ1oyhGlhCqub3VpQOiCUUbUdGG4C++vEya51Xfq/r3F+XadR5HAY7hBCrgwyXU4A7q0AAKAp7hFd6cR+fFeXc+5q0rTj5zBH/gfP4AMa6PZw==
AAAB73icbVBNS8NAEJ34WetX1aOXYBHqpSQi6LEoiMcK9gPaUDabTbt0sxt3J0Ip/RNePCji1b/jzX/jts1BWx8MPN6bYWZemApu0PO+nZXVtfWNzcJWcXtnd2+/dHDYNCrTlDWoEkq3Q2KY4JI1kKNg7VQzkoSCtcLhzdRvPTFtuJIPOEpZkJC+5DGnBK3Uvq10aaTwrFcqe1VvBneZ+DkpQ456r/TVjRTNEiaRCmJMx/dSDMZEI6eCTYrdzLCU0CHps46lkiTMBOPZvRP31CqRGyttS6I7U39PjElizCgJbWdCcGAWvan4n9fJML4KxlymGTJJ54viTLio3OnzbsQ1oyhGlhCqub3VpQOiCUUbUdGG4C++vEya51Xfq/r3F+XadR5HAY7hBCrgwyXU4A7q0AAKAp7hFd6cR+fFeXc+5q0rTj5zBH/gfP4AMa6PZw==
AAAB73icbVBNS8NAEJ34WetX1aOXYBHqpSQi6LEoiMcK9gPaUDabTbt0sxt3J0Ip/RNePCji1b/jzX/jts1BWx8MPN6bYWZemApu0PO+nZXVtfWNzcJWcXtnd2+/dHDYNCrTlDWoEkq3Q2KY4JI1kKNg7VQzkoSCtcLhzdRvPTFtuJIPOEpZkJC+5DGnBK3Uvq10aaTwrFcqe1VvBneZ+DkpQ456r/TVjRTNEiaRCmJMx/dSDMZEI6eCTYrdzLCU0CHps46lkiTMBOPZvRP31CqRGyttS6I7U39PjElizCgJbWdCcGAWvan4n9fJML4KxlymGTJJ54viTLio3OnzbsQ1oyhGlhCqub3VpQOiCUUbUdGG4C++vEya51Xfq/r3F+XadR5HAY7hBCrgwyXU4A7q0AAKAp7hFd6cR+fFeXc+5q0rTj5zBH/gfP4AMa6PZw==
AAAB73icbVBNS8NAEJ34WetX1aOXYBHqpSQi6LEoiMcK9gPaUDabTbt0sxt3J0Ip/RNePCji1b/jzX/jts1BWx8MPN6bYWZemApu0PO+nZXVtfWNzcJWcXtnd2+/dHDYNCrTlDWoEkq3Q2KY4JI1kKNg7VQzkoSCtcLhzdRvPTFtuJIPOEpZkJC+5DGnBK3Uvq10aaTwrFcqe1VvBneZ+DkpQ456r/TVjRTNEiaRCmJMx/dSDMZEI6eCTYrdzLCU0CHps46lkiTMBOPZvRP31CqRGyttS6I7U39PjElizCgJbWdCcGAWvan4n9fJML4KxlymGTJJ54viTLio3OnzbsQ1oyhGlhCqub3VpQOiCUUbUdGG4C++vEya51Xfq/r3F+XadR5HAY7hBCrgwyXU4A7q0AAKAp7hFd6cR+fFeXc+5q0rTj5zBH/gfP4AMa6PZw==
F(·)
AAAB73icbVBNS8NAEJ34WetX1aOXYBHqpSQi6LEoiMcK9gPaUDabTbt0sxt3J0Ip/RNePCji1b/jzX/jts1BWx8MPN6bYWZemApu0PO+nZXVtfWNzcJWcXtnd2+/dHDYNCrTlDWoEkq3Q2KY4JI1kKNg7VQzkoSCtcLhzdRvPTFtuJIPOEpZkJC+5DGnBK3Uvq10aaTwrFcqe1VvBneZ+DkpQ456r/TVjRTNEiaRCmJMx/dSDMZEI6eCTYrdzLCU0CHps46lkiTMBOPZvRP31CqRGyttS6I7U39PjElizCgJbWdCcGAWvan4n9fJML4KxlymGTJJ54viTLio3OnzbsQ1oyhGlhCqub3VpQOiCUUbUdGG4C++vEya51Xfq/r3F+XadR5HAY7hBCrgwyXU4A7q0AAKAp7hFd6cR+fFeXc+5q0rTj5zBH/gfP4AMa6PZw==
AAAB73icbVBNS8NAEJ34WetX1aOXYBHqpSQi6LEoiMcK9gPaUDabTbt0sxt3J0Ip/RNePCji1b/jzX/jts1BWx8MPN6bYWZemApu0PO+nZXVtfWNzcJWcXtnd2+/dHDYNCrTlDWoEkq3Q2KY4JI1kKNg7VQzkoSCtcLhzdRvPTFtuJIPOEpZkJC+5DGnBK3Uvq10aaTwrFcqe1VvBneZ+DkpQ456r/TVjRTNEiaRCmJMx/dSDMZEI6eCTYrdzLCU0CHps46lkiTMBOPZvRP31CqRGyttS6I7U39PjElizCgJbWdCcGAWvan4n9fJML4KxlymGTJJ54viTLio3OnzbsQ1oyhGlhCqub3VpQOiCUUbUdGG4C++vEya51Xfq/r3F+XadR5HAY7hBCrgwyXU4A7q0AAKAp7hFd6cR+fFeXc+5q0rTj5zBH/gfP4AMa6PZw==
AAAB73icbVBNS8NAEJ34WetX1aOXYBHqpSQi6LEoiMcK9gPaUDabTbt0sxt3J0Ip/RNePCji1b/jzX/jts1BWx8MPN6bYWZemApu0PO+nZXVtfWNzcJWcXtnd2+/dHDYNCrTlDWoEkq3Q2KY4JI1kKNg7VQzkoSCtcLhzdRvPTFtuJIPOEpZkJC+5DGnBK3Uvq10aaTwrFcqe1VvBneZ+DkpQ456r/TVjRTNEiaRCmJMx/dSDMZEI6eCTYrdzLCU0CHps46lkiTMBOPZvRP31CqRGyttS6I7U39PjElizCgJbWdCcGAWvan4n9fJML4KxlymGTJJ54viTLio3OnzbsQ1oyhGlhCqub3VpQOiCUUbUdGG4C++vEya51Xfq/r3F+XadR5HAY7hBCrgwyXU4A7q0AAKAp7hFd6cR+fFeXc+5q0rTj5zBH/gfP4AMa6PZw==
AAAB73icbVBNS8NAEJ34WetX1aOXYBHqpSQi6LEoiMcK9gPaUDabTbt0sxt3J0Ip/RNePCji1b/jzX/jts1BWx8MPN6bYWZemApu0PO+nZXVtfWNzcJWcXtnd2+/dHDYNCrTlDWoEkq3Q2KY4JI1kKNg7VQzkoSCtcLhzdRvPTFtuJIPOEpZkJC+5DGnBK3Uvq10aaTwrFcqe1VvBneZ+DkpQ456r/TVjRTNEiaRCmJMx/dSDMZEI6eCTYrdzLCU0CHps46lkiTMBOPZvRP31CqRGyttS6I7U39PjElizCgJbWdCcGAWvan4n9fJML4KxlymGTJJ54viTLio3OnzbsQ1oyhGlhCqub3VpQOiCUUbUdGG4C++vEya51Xfq/r3F+XadR5HAY7hBCrgwyXU4A7q0AAKAp7hFd6cR+fFeXc+5q0rTj5zBH/gfP4AMa6PZw==
F(·)
AAAB73icbVBNS8NAEJ34WetX1aOXYBHqpSQi6LEoiMcK9gPaUDabTbt0sxt3J0Ip/RNePCji1b/jzX/jts1BWx8MPN6bYWZemApu0PO+nZXVtfWNzcJWcXtnd2+/dHDYNCrTlDWoEkq3Q2KY4JI1kKNg7VQzkoSCtcLhzdRvPTFtuJIPOEpZkJC+5DGnBK3Uvq10aaTwrFcqe1VvBneZ+DkpQ456r/TVjRTNEiaRCmJMx/dSDMZEI6eCTYrdzLCU0CHps46lkiTMBOPZvRP31CqRGyttS6I7U39PjElizCgJbWdCcGAWvan4n9fJML4KxlymGTJJ54viTLio3OnzbsQ1oyhGlhCqub3VpQOiCUUbUdGG4C++vEya51Xfq/r3F+XadR5HAY7hBCrgwyXU4A7q0AAKAp7hFd6cR+fFeXc+5q0rTj5zBH/gfP4AMa6PZw==
AAAB73icbVBNS8NAEJ34WetX1aOXYBHqpSQi6LEoiMcK9gPaUDabTbt0sxt3J0Ip/RNePCji1b/jzX/jts1BWx8MPN6bYWZemApu0PO+nZXVtfWNzcJWcXtnd2+/dHDYNCrTlDWoEkq3Q2KY4JI1kKNg7VQzkoSCtcLhzdRvPTFtuJIPOEpZkJC+5DGnBK3Uvq10aaTwrFcqe1VvBneZ+DkpQ456r/TVjRTNEiaRCmJMx/dSDMZEI6eCTYrdzLCU0CHps46lkiTMBOPZvRP31CqRGyttS6I7U39PjElizCgJbWdCcGAWvan4n9fJML4KxlymGTJJ54viTLio3OnzbsQ1oyhGlhCqub3VpQOiCUUbUdGG4C++vEya51Xfq/r3F+XadR5HAY7hBCrgwyXU4A7q0AAKAp7hFd6cR+fFeXc+5q0rTj5zBH/gfP4AMa6PZw==
AAAB73icbVBNS8NAEJ34WetX1aOXYBHqpSQi6LEoiMcK9gPaUDabTbt0sxt3J0Ip/RNePCji1b/jzX/jts1BWx8MPN6bYWZemApu0PO+nZXVtfWNzcJWcXtnd2+/dHDYNCrTlDWoEkq3Q2KY4JI1kKNg7VQzkoSCtcLhzdRvPTFtuJIPOEpZkJC+5DGnBK3Uvq10aaTwrFcqe1VvBneZ+DkpQ456r/TVjRTNEiaRCmJMx/dSDMZEI6eCTYrdzLCU0CHps46lkiTMBOPZvRP31CqRGyttS6I7U39PjElizCgJbWdCcGAWvan4n9fJML4KxlymGTJJ54viTLio3OnzbsQ1oyhGlhCqub3VpQOiCUUbUdGG4C++vEya51Xfq/r3F+XadR5HAY7hBCrgwyXU4A7q0AAKAp7hFd6cR+fFeXc+5q0rTj5zBH/gfP4AMa6PZw==
AAAB73icbVBNS8NAEJ34WetX1aOXYBHqpSQi6LEoiMcK9gPaUDabTbt0sxt3J0Ip/RNePCji1b/jzX/jts1BWx8MPN6bYWZemApu0PO+nZXVtfWNzcJWcXtnd2+/dHDYNCrTlDWoEkq3Q2KY4JI1kKNg7VQzkoSCtcLhzdRvPTFtuJIPOEpZkJC+5DGnBK3Uvq10aaTwrFcqe1VvBneZ+DkpQ456r/TVjRTNEiaRCmJMx/dSDMZEI6eCTYrdzLCU0CHps46lkiTMBOPZvRP31CqRGyttS6I7U39PjElizCgJbWdCcGAWvan4n9fJML4KxlymGTJJ54viTLio3OnzbsQ1oyhGlhCqub3VpQOiCUUbUdGG4C++vEya51Xfq/r3F+XadR5HAY7hBCrgwyXU4A7q0AAKAp7hFd6cR+fFeXc+5q0rTj5zBH/gfP4AMa6PZw==
F(·)
AAAB73icbVBNS8NAEJ34WetX1aOXYBHqpSQi6LEoiMcK9gPaUDabTbt0sxt3J0Ip/RNePCji1b/jzX/jts1BWx8MPN6bYWZemApu0PO+nZXVtfWNzcJWcXtnd2+/dHDYNCrTlDWoEkq3Q2KY4JI1kKNg7VQzkoSCtcLhzdRvPTFtuJIPOEpZkJC+5DGnBK3Uvq10aaTwrFcqe1VvBneZ+DkpQ456r/TVjRTNEiaRCmJMx/dSDMZEI6eCTYrdzLCU0CHps46lkiTMBOPZvRP31CqRGyttS6I7U39PjElizCgJbWdCcGAWvan4n9fJML4KxlymGTJJ54viTLio3OnzbsQ1oyhGlhCqub3VpQOiCUUbUdGG4C++vEya51Xfq/r3F+XadR5HAY7hBCrgwyXU4A7q0AAKAp7hFd6cR+fFeXc+5q0rTj5zBH/gfP4AMa6PZw==
AAAB73icbVBNS8NAEJ34WetX1aOXYBHqpSQi6LEoiMcK9gPaUDabTbt0sxt3J0Ip/RNePCji1b/jzX/jts1BWx8MPN6bYWZemApu0PO+nZXVtfWNzcJWcXtnd2+/dHDYNCrTlDWoEkq3Q2KY4JI1kKNg7VQzkoSCtcLhzdRvPTFtuJIPOEpZkJC+5DGnBK3Uvq10aaTwrFcqe1VvBneZ+DkpQ456r/TVjRTNEiaRCmJMx/dSDMZEI6eCTYrdzLCU0CHps46lkiTMBOPZvRP31CqRGyttS6I7U39PjElizCgJbWdCcGAWvan4n9fJML4KxlymGTJJ54viTLio3OnzbsQ1oyhGlhCqub3VpQOiCUUbUdGG4C++vEya51Xfq/r3F+XadR5HAY7hBCrgwyXU4A7q0AAKAp7hFd6cR+fFeXc+5q0rTj5zBH/gfP4AMa6PZw==
AAAB73icbVBNS8NAEJ34WetX1aOXYBHqpSQi6LEoiMcK9gPaUDabTbt0sxt3J0Ip/RNePCji1b/jzX/jts1BWx8MPN6bYWZemApu0PO+nZXVtfWNzcJWcXtnd2+/dHDYNCrTlDWoEkq3Q2KY4JI1kKNg7VQzkoSCtcLhzdRvPTFtuJIPOEpZkJC+5DGnBK3Uvq10aaTwrFcqe1VvBneZ+DkpQ456r/TVjRTNEiaRCmJMx/dSDMZEI6eCTYrdzLCU0CHps46lkiTMBOPZvRP31CqRGyttS6I7U39PjElizCgJbWdCcGAWvan4n9fJML4KxlymGTJJ54viTLio3OnzbsQ1oyhGlhCqub3VpQOiCUUbUdGG4C++vEya51Xfq/r3F+XadR5HAY7hBCrgwyXU4A7q0AAKAp7hFd6cR+fFeXc+5q0rTj5zBH/gfP4AMa6PZw==
AAAB73icbVBNS8NAEJ34WetX1aOXYBHqpSQi6LEoiMcK9gPaUDabTbt0sxt3J0Ip/RNePCji1b/jzX/jts1BWx8MPN6bYWZemApu0PO+nZXVtfWNzcJWcXtnd2+/dHDYNCrTlDWoEkq3Q2KY4JI1kKNg7VQzkoSCtcLhzdRvPTFtuJIPOEpZkJC+5DGnBK3Uvq10aaTwrFcqe1VvBneZ+DkpQ456r/TVjRTNEiaRCmJMx/dSDMZEI6eCTYrdzLCU0CHps46lkiTMBOPZvRP31CqRGyttS6I7U39PjElizCgJbWdCcGAWvan4n9fJML4KxlymGTJJ54viTLio3OnzbsQ1oyhGlhCqub3VpQOiCUUbUdGG4C++vEya51Xfq/r3F+XadR5HAY7hBCrgwyXU4A7q0AAKAp7hFd6cR+fFeXc+5q0rTj5zBH/gfP4AMa6PZw==
Figure 7.1: Conceptual diagram of our approach for visual storytelling. The key difference from a typical
SEQ2SEQ model is the component of predicting anchor words (or their embeddings) from the images. The
predictions are then fused with the image features as the inputs to the sequence transducer for generating
desired narrative sentences.
From the very beginning, the creators of the dataset for visual storytelling highlighted the difference of
captioning from narratives [72]. In essence, narrative stories go beyond the factual enumeration of objects
and activities depicted in the images, which is often adequate for image captioning. For example, “the man
is smiling” is not an narrative sentence as effective as “the man is having a good time”.
Recent approaches for visual storytelling have been using reinforcement learning (RL) to optimize com-
plicated models, such as hierarchical ones [71] or evaluation metrics [173]. The approach proposed in this
chapter has the advantage of a simplified design and learning procedure, yet attains the best performance on
several evaluation metrics.
7.3 Approach
The task of visual storytelling is to generate a sequence of narrative sentences{S
i
, i = 1,2,...,N}, one
for each of theN images{I
i
, i = 1,2,...,N}. The order of the images is important and is fixed. Each of
the generated sentencesS
i
could contain a variable length of words.
102
Table 7.2: Adding ground-truth words as anchor words to a SEQ2SEQ model significantly improves its
performance where only image features are used. The higher numerical value indicates better performance.
B@4 M R C
Image Only 13.9± 0.4 35.2± 0.1 29.5± 0.3 8.4± 0.9
Anchoring
with Noun 17.2± 0.2 39.0± 0.2 33.8± 0.1 15.7± 0.4
with Verb 16.5± 0.1 37.9± 0.2 34.7± 0.2 13.0± 0.2
with Adj. 15.2± 0.3 36.6± 0.1 31.9± 0.2 11.3± 0.2
with Adv. 14.9± 0.3 35.9± 0.1 31.0± 0.2 10.4± 0.3
The main idea behind our approach is fairly straightforward. For each image, we learn and apply a model
to predict its corresponding anchor word. The predicted anchor word (or more precisely, its embedding) is
then concatenated with the image feature. The combined feature is fed into a SEQ2SEQ [157] where the
narrative sentence is generated as output. Fig. 7.1 schematically illustrates the model design.
The key challenge is to learn the anchor word prediction model when the dataset does not provide anchor
words explicitly. We begin by describing how we overcome this challenge. Then we introduce our model in
detail.
7.3.1 What is an anchor word?
We are inspired by the comparison between narrative stories and captions on the same sequence of images,
shown in Table 7.1. In particular, a large number of words used in narration do not appear in captions.
Intuitively speaking, they are less likely visually grounded.
Thus, we conjecture that possible candidates for anchor words are the words in the narrative sentences.
The analysis results reported in Table 7.2 confirms the usefulness of this hypothesis.
Specifically, we train a model as in Fig. 7.1 with two variants. In the first variant, we do not use the
separate model to predict the anchor words. Instead we supply only the image features. The results are
reported in the row labeled as “Image Only”. In the second variant (“Anchoring”), we select all the noun
(alternatively, verb, adjective, or adverb) words in the story as anchor words – one word per sentence in the
story. We then train the SEQ2SEQ model by combining the image feature and the word embedding end to
end. The results are reported in rows labeled with the part-of-speech (POS) tags of the selected words. For
103
simplicity, all anchor words have the same POS tags. If there are multiple words with the same POS tags in
each sentence, we randomly select one. Details of the metrics are explained in the next section.
There are two points worth making. First, adding anchor words, irrespective of their types, significantly
improve the performance of the SEQ2SEQ model with image features only. Note that the results in “Image
Only” is on par with state-of-the-art results [173]. Secondly, among all POS tags, nouns as anchor words
seem to be the most beneficial ones on all metrics except R(OUGE) where verbs improve more.
In the rest of this chapter, we use nouns in the stories as the anchor words.
7.3.2 Model and Learning
The data for our learning task is augmented with a list of anchor words{w
i
|i=1,2,...,N} corresponding
to the images. In what follows, we explain how to learn each component of our models.
Anchor word predictor We learn a model F(I
i
) to predict w
i
. F(·) is parameterized by a one-hidden-
layer multi-layer perception (MLP) with ReLU non-linearity. The input could be the features for the ith
image or all the images in the same sequence. In practice, we do not have significant difference.
To be able to generalize to new anchor words, we do not predict the identity of the anchor word per se.
Instead, we predict its embedding and cast learning F(·) as a standard regression problem. To obtain the
target (ie, the “ground-truth” embedding) for the word w
i
, we take the embeddings from the “Anchoring”
model in Table 7.2 where the anchor words’ embeddings are trained end-to-end to generate the desired
stories. F(·) is then optimized to reduce the mean squared error between the predictions and the target
embeddings for all the anchor words.
Story generation model Similar to most state-of-the-art visual storytelling methods [173, 72], we use a
SEQ2SEQ model [157] as story generator. Concretely, a bidirectional gated recurrent neural network [33](GRU)
is used to encode the concatenated feature of the image and the predicted anchor word embedding and to
produce a sequence of hidden states
v
i
= BiGRU(I
i
,F(I
i
)) (7.1)
The sequence of the hidden states is then decoded by an one-layer GRU as decoder where each hidden
state is decoded into a narrative sentence. Both the encoder and the decoder are trained to maximize the
likelihood of ground-truth stories.
104
7.3.3 Other Implementation Details
To facilitate the reproducibility, our code will be publicly available on www.anonymity.com. Mean-
while, we denote some key implementation details in the following of this section.
Visual and textual representation We extract the 2048 dimension feature from the penultimate layer of
ResNet-152 [62] as visual representation. We initialize randomly 512 dimensional vectors for each of the
9,837 top frequent words in the vocabulary, which are fine-tuned in the training. Note that the anchor words
are sharing the word embeddings with the words in decoding vocabulary.
Model details The concatenated features of the image and the anchor word embedding are projected into a
2048 dimensional feature vector with a one-hidden-layer MLP of 2048 hidden dimension. Then, a one-layer
BiGRU model with 256 dimensional hidden states generate contextual embeddingv
i
of 512 dimensions, to
serve as hidden states representation. A standard SEQ2SEQ decoder with one-layer GRU model with 512
hidden dimension is used on top of these hidden states to generate story.
Optimization As mentioned, the model is trained in two stages. In the first stage, ground-truth anchor
words (nouns in the stories) are used to train the encoder-decoder as well as the embeddings end to end.
The model is trained with mini-batches and ADAM for 100 epochs. Each mini-batch contains 64 sampled
stories. The initial learning rate is set to 4e-4 and schedule sampling [16] has been used. The probability of
schedule sampling is first set to be 0.05, increased by 0.05 every 5 epochs, and fixed at 0.25 after 25 epoches.
In the second stage, the predictor F(·) is trained. Specifically, we use the model that achieves the highest
Meteor score on validation set in the first stage training as a pre-trained model. We train the predictor with
encoder-decoder model in an end-to-end way. We keep the encoder-decoder as well as the word embeddings
fixed. We use the same optimization hyper-parameter as the first stage.
In addition, a third stage fine-tuning could be further conducted, which trains the anchor word predictor
and sequence generator together end to end. However, we observe that this further fine-tuning might results
in slightly degenerated empirical performances. One potentially explanation is that the joint optimization
of regression and maximum likelihood objectives made a difficult optimization problem. We leave this to
future studies.
105
Inference At the inference time (ie, narrating a sequence of images), we perform beam search for sentence
decoding with a beam size of 3.
7.4 Experiments
7.4.1 Experimental Setups
Table 7.3: Comparison of state-of-the-art method for the visual storytelling task on the VIST dataset. Our
“Image Only” model is a reimplementation of XE+SS [173] with the authors’ public available codes.
Method B@1 B@2 B@3 B@4 M R C
AREL [173] 63.8 39.1 23.2 14.1 35.0 29.5 9.4
Show, Reward and Tell [168] 43.4 21.4 10.4 5.2 - - 11.4
HSRL w/ Joint Training [71] - - - 12.3 35.2 30.8 10.7
SEQ2SEQ+Heuristics [72] - - - - 31.4 - -
H-Attn-Rank[188] - - 20.8 - 33.9 29.8 7.4
StoryAnchor:
Image Only 62.2± 2.5 38.3± 1.7 22.7± 0.8 13.9± 0.4 35.2± 0.1 29.5± 0.3 8.4± 0.9
w/ Predicted Nouns 65.1± 0.3 40.0± 0.2 23.4± 0 14.0± 0.1 35.5± 0.0 30.0± 0.1 9.9± 0.1
Table 7.4: Evaluating human performance by automatic evaluation procedures. Machine outperforms human
in all metrics. See text for details.
B@1 B@2 B@3 B@4 M R C
Human 51.2± 0.2 25.0± 0.2 11.7± 0.2 5.6± 0.2 28.4± 0.1 24.5± 0.1 7.8± 0.1
StoryAnchor:
Image Only 58.6± 0.2 34.7± 0.2 20.0± 0.1 11.2± 0.1 34.0± 0.1 28.3± 0.1 8.8± 0.1
w/ Predicted Nouns 60.7± 0.2 35.8± 0.2 20.3± 0.1 11.9± 0.1 34.5± 0.0 28.9± 0.0 10.1± 0.1
w/ Ground-truth Nouns 65.1± 0.2 40.3± 0.1 23.9± 0.1 14.7± 0.0 37.7± 0.1 32.3± 0.1 16.2± 0.1
Dataset We the use VIST dataset [72] for evaluation. It contains 10,032 visual albums with 50,136 stories.
Each album contains 2 image sequences, where each sequence has 5 images and 1 to 4 annotated stories.
Each album contains 5 reference distinct stories in total. Each story contains five narrative sentences, corre-
sponding to five grounded images respectively. After removing broken images, there are 8,024 albums with
40,098 stories for training, 998 albums with 4988 stories for validation, and 1,010 albums with 5,050 stories
for testing. There is no overlap among the split.
106
Table 7.5: Human evaluation on the generated stories
StoryAnchor: AREL Tie
w/ Predicted Nouns
Relevance 53.2% 40.4% 6.4%
Concreteness 45.1% 38.1% 16.8%
Coherence 48.9% 42.3% 8.8%
Evaluation We follow the evaluation setup used in [72, 173, 188, 71]. For each testing album, we sample
one image sequence and generate a story based on that image sequence. The story is then scored against all
5 reference stories of that album. We use the evaluation code provided by the [188]
1
. We report results with
average BLEU, METEOR, ROUGE, and CIDER over the test split. For BLEU, we report it as 1, 2, 3 and
4-grams. We evaluate over 3 random runs and compute the means and the variances of the metrics.
Identifying anchor words We use NLTK POS tagger to get the tags. Each sentence contains on average
2.63 nouns, 2.0 verbs, 0.8 adjectives, and 0.5 adverbs. We use ’UNK’ as the anchor word when there is no
corresponding POS tag.
7.4.2 Main Results
We compare our method (StoryAnchor) to several state-of-the-art methods [72, 188, 173, 168, 71]. As
Figure 7.3 has shown, our model performs significantly better than others in almost all evaluation measures.
In ROUGE and CIDER, recent approaches of using reinforcement learning seem to perform well.
We also used Amazon Mechanical Turk to conduct human evaluation to compare the outputs of our
model and those from AREL [173]. We follow [173] and design three questions to evaluate the relevance,
concreteness, and coherence of generated stories and image sequences. 150 generated stories from the test
splits are evaluated. For each story, 5 workers are assigned. The reports are reported in Table 7.5. Our
approach performs better.
1
https://github.com/lichengunc/vist_eval. This is the most commonly used evaluation script nowadays.
107
Table 7.6: Better storytelling uses bigger vocabulary
Methods V ocab. Size Sent. Length
AREL [173] 661 8.30
StoryAnchor: Image Only 777 8.34
StoryAnchor: w/ Predicted Nouns 856 8.98
StoryAnchor: w/ Ground-truth Nouns 1729 8.54
7.4.3 Analysis
Better storytelling entails bigger vocabulary? Table 7.6 contrasts the outputs of several methods in
measuring the size of the vocabulary and the average sentence length in the generated stories. The results
are somewhat expected in that better models tend to use a bigger vocabulary in the generated stories. While
a detailed analysis is needed, we conjecture that words in stories are more likely not “visually grounded”.
Thus the current modeling does not discover such words easily from image features, or from co-occurrences
with “visual words” in the training data corpus. Hence, adding ground-truth nouns is the most direct and
beneficial way to increase the performance metrics.
Can machines beat humans in visual storytelling? Is visual storytelling, as a general AI problem, fun-
damentally out of reach of machines? Are the metrics being used now to guide the design of our systems
the right ones? Are there design flaws in the dataset, the task specification, the metrics such that we might
not make actual progress despite the illusory improvement in benchmark results?
The results in Table 7.4 highlight the challenging issues of addressing those questions. There, we assess
how well human storyteller would do. For each album, we randomly select one human-written ground-truth
story as “generated” story and the other 4 as “reference” stories. We then evaluate human performance by
scoring the generated story. For fair comparison, we re-evaluated all of learning models with 4 sampled
reference stories. Mean evaluation performances and variances per metrics are reported, collected by five
random runs.
Clearly, the learning models outperform human storyteller significantly in every metric! Yet, our “Turing
test” suggests the opposite. In Table 7.7, over 450 stories (3 for each of the 150 sequences of images), we
report the percentages of 150 AMT workers’ preference of stories by two learning models and one human
annotator. In this study, human storytelling is much more preferred. The significant misalignment between
108
Table 7.7: Which stories are preferred by human readers
StoryAnchor: AREL Human Unsure
w/ Predicted Nouns
19.9% 18.0% 57.8% 4.2%
human valuation and automatic evaluation metrics is likely a bottleneck for developing new methods for this
task.
7.4.4 Qualitative Results
We provide in Fig.7.2 a few examples illustrating the stories narrated by learning models and human. We
also show the ground-truth anchor words that are targeted and the ones are predicted.
In both cases, we can observe that learning methods tend to use similar sentences for images that are
similar (colored in blue in the examples). While such correlation is desirable for tasks such as image cap-
tioning, it is not ideal for narrative sentences which need to exhibit a thematic development. In other words,
even for images that are similar (for example, due to temporal proximity), the sentences for such images
needs to be sufficiently different to constitute of a story.
Note that the predicted nouns anchor words are completely different from the ground-truth nouns. The
examples in Fig. 7.3 have stories that are closer to the ground-truth stories. Fig. 7.3 shows that the story gen-
erated with anchor words (color in red in the examples) could describe the image sequence more concretely
than the story generated without anchor words.
7.5 Conclusion
The StoryAnchor model proposed in this chapter is simpler in design when compared to several state-of-
the-art models [173, 71] where hierarchical modeling and reinforce learning are deployed. Yet, it attains the
best results on most automatic evaluation metrics. The key insight is to use “anchor words” to model the
evolvement of the underlying storyline. Crudely, those words are the “topics” or “states” of the narrators.
While those notions are not explicitly annotated in the current dataset, we have selected the nouns in the
ground-truth stories as targets for learning an anchor word predictor. Our future direction is to investigate
how to learn those latent concepts automatically from data.
109
Story by human: The bazzar was filled with many unique items. Because of the rain, few people came out. As the
rain cleared it began to get crowded. [male] found a unique stall. The owner was a crochety old lady.
AREL[173]: There were a lot of people at the market today. We had a lot of people there. There were a lot of people
there. The food was delicious. We had a lot of fun.
StoryAnchor: Image Only: We went to the local market today. There were a lot of people there. There was a lot of
people there. We had a lot of food to choose from. I had a great time there.
StoryAnchor: w/ Predicted Nouns: We went to the market to get some groceries. There was a lot of people at the
market. The streets were filled with people. There were a lot of people there. The kids were very excited to see them.
StoryAnchor: w/ Ground-truth Nouns: We went to the local market to buy some food items. The rain was rain, but
the rain was rain. The rain started to rain. [male]’s favorite part of the day was the local restaurant. The owner of the
owner was very happy to see him.
Predicted Nouns: something; something; something; what; what
Ground-truth Nouns: items; rain; rain; [male]; owner
Story human: This year she brought her dog to watch the halloween dog parade. There are so many cute costumes,
like this little devil. This dog looks just like a jack ’o lantern. Some dogs prefer a more ‘royal’ look. But the winner
of this years award for cute, goes to this little prisoner.
AREL[173]: I went to the park yesterday. Even the dog was excited to see him. Even the dog was excited to see him.
The dogs were so happy to be there. And the dog had a good time.
StoryAnchor: Image Only: This is a picture of a man. He even got to pet the dog. I took a picture of the dog. He
even brought a dog with his dog. The dog had a great time.
StoryAnchor: w/ Predicted Nouns: This was my first day of the year. The dog was so excited to be there. Even the
dog was excited to be there. This is [male]’s dog. The dog had a great time.
StoryAnchor: w/ Ground-truth Nouns: This year, I went to a halloween party this year. The dog was cute and cute.
Even the dog was very happy. Even the dogs were very happy. Even the dog had a good time.
Predicted Nouns: momentous; extended; something; something; something;
Ground-truth Nouns: year; cute; dog; dogs; UNK
Figure 7.2: Qualitative visualization of narrative stories generated by human and different methods. Similar
sentences for similar images are annoated in blue.
110
Story human: It was the day my boy graduated. I kept the pamphlet that had his name in it. The faculty congratulated
my boy on his work. He looked so proud on that stage. I hope that his smile lasts long into his adulthood.
AREL[173]: The awards ceremony was held at the organization. They were a lot of people there. We had a lot of
people there. The president of the school was a lot of fun. I had a great time.
StoryAnchor: Image Only: I had a great time at the graduation ceremony. The plaque was the. He was very happy
to be there. He was very proud of his accomplishments. I had a great time there.
StoryAnchor: w/ Predicted Nouns: Today was the day of the graduation. It was a very special day. We were so
happy to be there. The man and his wife were there to celebrate their accomplishments. Everyone was very happy to
be there.
StoryAnchor: w/ Ground-truth Nouns: Today was the day of the graduation. We had a pamphlet at the reception.
The boy was so happy to be there. He was so excited to see the stage. I was so happy to be there.
Predicted Nouns: I; UNK; momentous; momentous; wally
Ground-truth Nouns: boy; pamphlet; boy; stage; smile
Figure 7.3: Qualitative visualization of good quality stories that are generated with anchor words. The
sentences with anchor words are annotated as red in the generated stories.
111
Part VI
Conclusion and Future Work
112
Chapter 8
Conclusion
In this thesis, we developed multiple approaches for visual representation learning. Traditional methods
improve visual representation by exploiting multiple directions: designing new training objectives [61, 30],
improving the neural architecture by instilling domain knowledge [159, 62, 21, 170], optimizing the visual
representation via neural architecture search (NAS) [207, 89], and proposing large-scale datasets for acquir-
ing better visual representation [37, 21, 108, 135, 80]. The traditional approaches have improved benchmark
performance in recent decades. However, the visual representation learned by the traditional methods might
only be optimized to a specific downstream task and has limited cross-dataset and cross-domain generaliza-
tion ability. The key reason might be the traditional approaches consider images and videos as a compilation
of pixels. The visual representation is optimized for specific tasks and specific datasets. Thus, the represen-
tation achieves inferior performance on cross-task and cross-dataset transfer.
In this thesis, orthogonal to those directions, we emphasize that visual components have structures.
Specifically, visual components (images and videos) can be decomposed into atomic components ( e.g. ob-
jects, attributes, actions, etc.) Those representations of the atomic components should be generalized across
datasets and tasks. Then we assembly those atomic representations to form the visual representation for
images and videos. With this insight in mind, we focused on visual representation learning with structure
prior. The key hypothesis is that by modeling the visual representation hierarchically, we should obtain the
representation that can be generalized on multiple tasks and benchmarks with good performance.
Under this hypothesis, we built the approaches based on three modalities: images, videos, and text,
where we focused on learning hierarchical visual representation from two domains: 1) cross modalities
domain with parallel visual and text data and 2) pure visual domain with images and videos. The overview
of the thesis is summarized in Fig. 8.1.
113
Talk Overview
Video Image
Text
Hierarchical Video
Representation
with Language
Structure Prior
Hierarchical Image
Representation
with Language
Structure Prior
Hierarchical Video
Representation with
Visual Structure
P 8
Figure 8.1: The approaches developed in this paper are based on three modalities: images, videos and text.
We focus on obtaining hierarchical visual representation from parallel image and text data, parallel video
and text data, and pure visual domain with images and videos.
We showed that by modeling the visual representation via structure prior, we established the state-
of-the-art performance on video text retrieval (Chapter 2), zero-shot action recognition (Chapter 2), video
captioning (Chapter 2), moment localization in video corpus (Chapter 3), in-domain and cross-dataset image
text retrieval (Chapter 4&5), action recognition (Chapter 6), and visual storytelling (Chapter 7).
114
Chapter 9
Future Work
As shown in the previous chapter, the hierarchical visual representation could improve performance than
the flat visual representation in multiple tasks. But this is not the end of this research direction, as there is
still much room to further improve the representation and its performance. This section will discuss two
possible directions for future work: 1) acquiring the visual structure and 2) obtaining a general-purpose
visual representation.
9.1 Acquiring the Visual Structure
Obtaining the visual structure is an important and essential step for hierarchical visual representation mod-
eling. In the previous chapter, we show that we can obtain the fine-grained visual structure (e.g. objects,
relations, etc.) from the parallel text data. We can also obtain the coarse-grained visual structure (e.g.
appearance and motion) from the pure visual domain. A more intriguing question is how to obtain the
fine-grained visual structure from the pure visual domain.
Specifically, given an image, we hope there is an approach that could extract the objects, their attributes,
and their relations from the given image automatically. There is a long line of research working on this
direction. Scene graph is a promising approach for extracting objects and characterizing their relations from
images and videos. Concretely, scene graph contains objects as the nodes. The edge between the objects
describes the relationship between objects. Each node is also annotated with multiple attributes. Tradition-
ally, human annotations are required for obtaining the scene graph. However, current visual datasets contain
multi-millions of images and videos. It is labor-intensive to annotate the scene graph for such large-scale
datasets. Recently, multiple approaches [106, 114, 179, 184, 192, 202] were proposed to extract the scene
115
Figure 9.1: The example of video scene graph from [75]. Each frame is annotated with objects
and their relations. Annotating the scene graph for image is labor-intensive. Acquiring the scene graph for
videos is a challenging task requiring more research efforts.
116
graph from static images. Krishna et al. [75] extended those approach for extracting scene graph from
videos, as Fig. 9.1. Although there is considerable progress in the research domain, how to extract scene
graphs from videos and images with an open set of object categories and relations is still an open ques-
tion. With the better scene graph extraction method, I believed we can achieve better performance with the
hierarchical visual representation.
9.2 Obtaining a General-purpose Visual Representation
Traditional visual representations are optimized on a specific dataset and specific task. These representa-
tions might not transfer well across the datasets and tasks. Acquiring a general visual representation is an
interesting research direction and is still an open question.
In chapter 5, we stepped into this direction a little bit by showing that explicitly grounding and com-
posing visually grounded concepts could achieve better in-domain performance and out-of-domain general-
ization ability. Chapter 5 is focused on the image and text domain. A more interesting question is whether
this type of method can be extended to the pure visual domain. Specifically, images and videos can be
decomposed into atomic components (e.g., objects, relations, and attributions). We can first acquire the
representation for the atomic components and store them in a feature bank. Then, the image and video rep-
resentation is just a re-compilation of the atomic component’s representations. As the atomic representation
can be generalized across domains, the corresponding image and video representation should achieve better
cross-task and cross-dataset generalization ability. Since the atomic representation feature bank is separated
from the method that compiles the image and video representation. When new visual components are added,
we don’t need to re-train the whole representation but just augment the atomic feature bank. This would
help us to obtain the representation more efficiently and effectively.
117
Bibliography
[1] Melissa Ailem, Bowen Zhang, Aurelien Bellet, Pascal Denis, and Fei Sha. A probabilistic model for
joint learning of word embeddings from texts and images. EMNLP, 2018.
[2] Melissa Ailem, Bowen Zhang, and Fei Sha. Topic augmented generator for abstractive summariza-
tion. Arxiv, 2019.
[3] Hassan Akbari, Linagzhe Yuan, Rui Qian, Wei-Hong Chuang, Shih-Fu Chang, Yin Cui, and Boqing
Gong. Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text.
Arxiv, 2021.
[4] Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei
Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In
CVPR, 2018.
[5] Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei
Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In
CVPR, 2018.
[6] Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko S¨ underhauf, Ian Reid,
Stephen Gould, and Anton van den Hengel. Vision-and-Language Navigation: Interpreting visually-
grounded navigation instructions in real environments. In CVPR, 2018.
[7] Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. Neural module networks. In CVPR,
2016.
[8] Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell.
Localizing moments in video with natural language. In ICCV, 2017.
[9] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zit-
nick, and Devi Parikh. Vqa: Visual question answering. In ICCV, 2015.
[10] Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Luˇ ci´ c, and Cordelia Schmid.
Vivit: A video vision transformer. ICCV, 2021.
[11] Dzmitry Bahdanau, Shikhar Murty, Michael Noukhovitch, Thien Huu Nguyen, Harm de Vries, and
Aaron Courville. Systematic generalization: What is required and can it be learned? In ICLR, 2019.
[12] Max Bain, Arsha Nagrani, G¨ ul Varol, and Andrew Zisserman. Frozen in time: A joint video and
image encoder for end-to-end retrieval. In CVPR, 2021.
[13] Kobus Barnard, Pinar Duygulu, David Forsyth, Nando de Freitas, David M Blei, and Michael I
Jordan. Matching words and pictures. JMLR, 2003.
[14] Kobus Barnard and David Forsyth. Learning the semantics of words and pictures. In ICCV, 2001.
118
[15] Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. Surf: Speeded up robust features. In ECCV, 2006.
[16] Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence
prediction with recurrent neural networks. In Neurips, 2015.
[17] Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is space-time attention all you need for video
understanding? NeurIPS, 2021.
[18] Battista Biggio, Blaine Nelson, and Pavel Laskov. Support vector machines under adversarial label
noise. In ACML, 2011.
[19] Joao Carreira, Eric Noland, Andras Banki-Horvath, Chloe Hillier, and Andrew Zisserman. A short
note about kinetics-600. Arxiv, 2018.
[20] Joao Carreira, Eric Noland, Chloe Hillier, and Andrew Zisserman. A short note on the kinetics-700
human action dataset. Arxiv, 2019.
[21] Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? A new model and the kinetics
dataset. In CVPR, 2017.
[22] Wei-Lun Chao, Hexiang Hu, and Fei Sha. Being negative but constructively: Lessons learnt from
creating better visual question answering datasets. NAACL-HLT, 2018.
[23] Devendra Singh Chaplot, Kanthashree Mysore Sathyendra, Rama Kumar Pasumarthi, Dheeraj Ra-
jagopal, and Ruslan Salakhutdinov. Gated-attention architectures for task-oriented language ground-
ing. In AAAI, 2018.
[24] Jiacheng Chen, Hexiang Hu, Hao Wu, Yuning Jiang, and Wang Changhu. Learning the best pooling
strategy for visual semantic embedding. In CVPR, 2021.
[25] Jingyuan Chen, Xinpeng Chen, Lin Ma, Zequn Jie, and Tat-Seng Chua. Temporally grounding natural
sentence in video. In EMNLP, 2018.
[26] Long Chen, Chujie Lu, Siliang Tang, Jun Xiao, Dong Zhang, Chilie Tan, and Xiaolin Li. Rethinking
the bottom-up framework for query-based video localization. In AAAI, 2020.
[27] Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Doll´ ar, and
C Lawrence Zitnick. Microsoft COCO Captions: Data collection and evaluation server. ArXiv
1504.00325, 2015.
[28] Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Doll´ ar, and
C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv
preprint:1504.00325, 2015.
[29] Xinlei Chen and C Lawrence Zitnick. Mind’s eye: A recurrent visual representation for image caption
generation. In CVPR, 2015.
[30] Xinlei Chen, Saining Xie, and Kaiming He. An empirical study of training self-supervised vision
transformers. In CVPR, 2021.
[31] Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and
Jingjing Liu. Uniter: Learning universal image-text representations. ECCV, 2020.
119
[32] Kyunghyun Cho, Bart Van Merri¨ enboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Hol-
ger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for
statistical machine translation. EMNLP, 2014.
[33] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of
gated recurrent neural networks on sequence modeling. Arxiv, 2014.
[34] Guillem Collell and Marie-Francine Moens. Is an image worth more than a thousand words? on the
fine-grain semantic differences between visual and linguistic representations. In COLING, 2016.
[35] Navneet Dalal, Bill Triggs, and Cordelia Schmid. Human detection using oriented histograms of flow
and appearance. In ECCV, 2006.
[36] Robert Dale and Ehud Reiter. Computational interpretations of the gricean maxims in the generation
of referring expressions. Cognitive science, 1995.
[37] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale
hierarchical image database. In CVPR, 2009.
[38] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep
bidirectional transformers for language understanding. NAACL-HLT, 2019.
[39] Jianfeng Dong, Xirong Li, and Cees G. M. Snoek. Word2VisualVec: Cross-media retrieval by visual
feature prediction. Arxiv, 2016.
[40] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas
Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is
worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021.
[41] Haodong Duan, Yue Zhao, Yuanjun Xiong, Wentao Liu, and Dahua Lin. Omni-sourced webly-
supervised learning for video recognition. In ECCV, 2020.
[42] Aviv Eisenschtat and Lior Wolf. Linking image and text with 2-way nets. In CVPR, 2017.
[43] Victor Escorcia, Mattia Soldan, Josef Sivic, Bernard Ghanem, and Bryan Russell. Temporal localiza-
tion of moments in video collections with natural language. ICCV, 2019.
[44] Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Fidler. Vse++: Improving visual-semantic
embeddings with hard negatives. BMVC, 2017.
[45] Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Fidler. VSE++: Improving visual-
semantic embeddings with hard negatives. BMVC, 2018.
[46] Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, and
Christoph Feichtenhofer. Multiscale vision transformers. Arxiv, 2021.
[47] Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh K Srivastava, Li Deng, Piotr Doll´ ar, Jianfeng Gao,
Xiaodong He, Margaret Mitchell, John C Platt, et al. From captions to visual concepts and back. In
CVPR, 2015.
[48] Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. Convolutional two-stream network
fusion for video action recognition. In CVPR, 2016.
120
[49] Thibault F´ evry, Livio Baldini Soares, Nicholas FitzGerald, Eunsol Choi, and Tom Kwiatkowski.
Entities as experts: Sparse memory access with entity supervision. EMNLP, 2020.
[50] Catherine Finegan-Dollak, Jonathan K. Kummerfeld, Li Zhang, Karthik Ramanathan, Sesh Sadasi-
vam, Rui Zhang, and Dragomir Radev. Improving text-to-SQL evaluation methodology. In ACL,
2018.
[51] Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Tomas Mikolov, et al. Devise:
A deep visual-semantic embedding model. In Neurips, 2013.
[52] Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. TALL: Temporal activity localization via
language query. In ICCV, 2017.
[53] Ruifang Ge and Raymond Mooney. Learning a compositional semantic parser using an existing
syntactic parser. In ACL-IJCNLP, 2009.
[54] Simon Ging, Mohammadreza Zolfaghari, Hamed Pirsiavash, and Thomas Brox. Coot: Cooperative
hierarchical transformer for video-text representation learning. In NeurIPS, 2020.
[55] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural
networks. In AISTATS, 2010.
[56] Yunchao Gong, Qifa Ke, Michael Isard, and Svetlana Lazebnik. A multi-view embedding space for
modeling internet images, tags, and their semantics. IJCV, 2014.
[57] Diana Gonzalez-Rico and Gibran Fuentes-Pineda. Contextualize, show and tell: a neural visual
storyteller. Arxiv, 2018.
[58] Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne West-
phal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The”
something something” video database for learning and evaluating visual common sense. In ICCV,
2017.
[59] Jiuxiang Gu, Jianfei Cai, Shafiq R Joty, Li Niu, and Gang Wang. Look, imagine and match: Improving
textual-visual cross-modal retrieval with generative models. In CVPR, 2018.
[60] Chris Harris and Mike Stephens. A combined corner and edge detector. In Alvey vision conference,
1988.
[61] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ ar, and Ross Girshick. Masked au-
toencoders are scalable vision learners. Arxiv, 2021.
[62] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recog-
nition. In CVPR, 2016.
[63] Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A
large-scale video benchmark for human activity understanding. In CVPR, 2015.
[64] Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell.
Localizing moments in video with natural language. In ICCV, 2017.
[65] Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell.
Localizing moments in video with temporal language. In EMNLP, 2018.
121
[66] Karl Moritz Hermann, Felix Hill, Simon Green, Fumin Wang, Ryan Faulkner, Hubert Soyer, David
Szepesvari, Wojciech Marian Czarnecki, Max Jaderberg, Denis Teplyashin, et al. Grounded language
learning in a simulated 3d world. arXiv preprint arXiv:1706.06551, 2017.
[67] Sepp Hochreiter and J¨ urgen Schmidhuber. Long short-term memory. Neural computation, 1997.
[68] Micah Hodosh, Peter Young, and Julia Hockenmaier. Framing image description as a ranking task:
Data, models and evaluation metrics. JAIR, 2013.
[69] Hexiang Hu, Wei-Lun Chao, and Fei Sha. Learning answer embeddings for visual question answer-
ing. In CVPR, 2018.
[70] Hexiang Hu, Ishan Misra, and Laurens van der Maaten. Evaluating text-to-image matching using
binary image selection (BISON). In ICCV workshop, 2019.
[71] Qiuyuan Huang, Zhe Gan, Asli Celikyilmaz, Dapeng Wu, Jianfeng Wang, and Xiaodong He. Hi-
erarchically structured reinforcement learning for topically coherent visual story generation. AAAI,
2019.
[72] Ting-Hao Kenneth Huang, Francis Ferraro, Nasrin Mostafazadeh, Ishan Misra, Aishwarya Agrawal,
Jacob Devlin, Ross Girshick, Xiaodong He, Pushmeet Kohli, Dhruv Batra, et al. Visual storytelling.
In NAACL-HLT, 2016.
[73] Yan Huang, Qi Wu, Chunfeng Song, and Liang Wang. Learning semantic concepts and order for
image and sentence matching. In CVPR, 2018.
[74] Phillip Isola, Joseph J Lim, and Edward H Adelson. Discovering states and transformations in image
collections. In CVPR, 2015.
[75] Jingwei Ji, Ranjay Krishna, Li Fei-Fei, and Juan Carlos Niebles. Action genome: Actions as compo-
sitions of spatio-temporal scene graphs. In CVPR, 2020.
[76] J. Johnson, B. Hariharan, L.J.P. van der Maaten, L. Fei-Fei, C.L. Zitnick, and R.B. Girshick. Clevr:
A diagnostic dataset for compositional language and elementary visual reasoning. CVPR, 2017.
[77] Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S Weld, Luke Zettlemoyer, and Omer Levy. Spanbert:
Improving pre-training by representing and predicting spans. T-ACL, 2020.
[78] Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions.
In CVPR, 2015.
[79] Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei.
Large-scale video classification with convolutional neural networks. In CVPR, 2014.
[80] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijaya-
narasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action
video dataset. arXiv preprint arXiv:1705.06950, 2017.
[81] Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. ReferItGame: Referring to
objects in photographs of natural scenes. In EMNLP, 2014.
[82] Daniel Keysers, Nathanael Sch¨ arli, Nathan Scales, Hylke Buisman, Daniel Furrer, Sergii Kashubin,
Nikola Momchev, Danila Sinopalnikov, Lukasz Stafiniak, Tibor Tihon, Dmitry Tsarkov, Xiao Wang,
Marc van Zee, and Olivier Bousquet. Measuring compositional generalization: A comprehensive
method on realistic data. In ICLR, 2020.
122
[83] Douwe Kiela and L´ eon Bottou. Learning Image Embeddings using Convolutional Neural Networks
for Improved Multi-Modal Semantics. In EMNLP, 2014.
[84] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ICLR, 2014.
[85] Ryan Kiros, Ruslan Salakhutdinov, and Richard S Zemel. Unifying visual-semantic embeddings with
multimodal neural language models. Neurips Workshop, 2014.
[86] Ryan Kiros, Ruslan Salakhutdinov, and Richard S Zemel. Unifying visual-semantic embeddings with
multimodal neural language models. NeurIPS Workshop Deep Learning, 2014.
[87] Nikita Kitaev and Dan Klein. Constituency parsing with a self-attentive encoder. In ACL, 2018.
[88] Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly,
and Neil Houlsby. Big transfer (bit): General visual representation learning. In ECCV, 2020.
[89] Dan Kondratyuk, Liangzhe Yuan, Yandong Li, Li Zhang, Mingxing Tan, Matthew Brown, and Boqing
Gong. Movinets: Mobile video networks for efficient video recognition. In CVPR, 2021.
[90] Chen Kong, Dahua Lin, Mohit Bansal, Raquel Urtasun, and Sanja Fidler. What are you talking about?
text-to-image coreference. In CVPR, 2014.
[91] Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. Dense-captioning
events in videos. In ICCV, 2017.
[92] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolu-
tional neural networks. In Neurips, 2012.
[93] Girish Kulkarni, Visruth Premraj, Vicente Ordonez, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C
Berg, and Tamara L Berg. Babytalk: Understanding and generating simple image descriptions. T-
PAMI, 2013.
[94] Alice Lai and Julia Hockenmaier. Learning to predict denotational probabilities for modeling entail-
ment. In ACL, 2017.
[95] Brenden Lake and Marco Baroni. Generalization without systematicity: On the compositional skills
of sequence-to-sequence recurrent networks. In ICML, 2018.
[96] Ivan Laptev. On space-time interest points. IJCV, 2005.
[97] Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. Stacked cross attention for
image-text matching. In ECCV, 2018.
[98] Jie Lei, Licheng Yu, Tamara L Berg, and Mohit Bansal. TVR: A large-scale dataset for video-subtitle
moment retrieval. arXiv:2001.09099, 2020.
[99] Gen Li, Nan Duan, Yuejian Fang, Daxin Jiang, and Ming Zhou. Unicoder-VL: A universal encoder
for vision and language by cross-modal pre-training. ArXiv 1908.06066, 2019.
[100] Jiwei Li, Minh-Thang Luong, and Dan Jurafsky. A hierarchical neural autoencoder for paragraphs
and documents. ACL, 2015.
[101] Kunpeng Li, Yulun Zhang, Kai Li, Yuanyuan Li, and Yun Fu. Visual semantic reasoning for image-
text matching. In ICCV, 2019.
123
[102] Linjie Li, Yen-Chun Chen, Yu Cheng, Zhe Gan, Licheng Yu, and Jingjing Liu. Hero: Hierarchical
encoder for video+language omni-representation pre-training, 2020.
[103] Linjie Li, Jie Lei, Zhe Gan, Licheng Yu, Yen-Chun Chen, Rohit Pillai, Yu Cheng, Luowei Zhou,
Xin Eric Wang, William Yang Wang, et al. Value: A multi-task benchmark for video-and-language
understanding evaluation. Arxiv, 2021.
[104] Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. VisualBERT: A
simple and performant baseline for vision and language. ArXiv 1908.03557, 2019.
[105] Yehao Li, Ting Yao, Yingwei Pan, Hongyang Chao, and Tao Mei. Jointly localizing and describing
events for dense video captioning. In CVPR, 2018.
[106] Yikang Li, Wanli Ouyang, Bolei Zhou, Kun Wang, and Xiaogang Wang. Scene graph generation
from objects, phrases and region captions. In CVPR, 2017.
[107] Yong-Lu Li, Yue Xu, Xiaohan Mao, and Cewu Lu. Symmetry and group in attribute-object composi-
tions. In CVPR, 2020.
[108] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr
Doll´ ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014.
[109] Meng Liu, Xiang Wang, Liqiang Nie, Qi Tian, Baoquan Chen, and Tat-Seng Chua. Cross-modal
moment localization in videos. In ACM MM, 2018.
[110] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike
Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining
approach. Arxiv, 2019.
[111] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo.
Swin transformer: Hierarchical vision transformer using shifted windows. ICCV, 2021.
[112] Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. Video swin
transformer. Arxiv, 2021.
[113] David G Lowe. Object recognition from local scale-invariant features. In CVPR, 1999.
[114] Cewu Lu, Ranjay Krishna, Michael Bernstein, and Li Fei-Fei. Visual relationship detection with
language priors. In ECCV, 2016.
[115] Chujie Lu, Long Chen, Chilie Tan, Xiaolin Li, and Jun Xiao. DEBUG: A dense bottom-up grounding
approach for natural language video localization. In EMNLP, 2019.
[116] Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. VilBERT: Pretraining task-agnostic visiolin-
guistic representations for vision-and-language tasks. In NeurIPS, 2019.
[117] Minh-Thang Luong, Hieu Pham, and Christopher D Manning. Effective approaches to attention-
based neural machine translation. EMNLP, 2015.
[118] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. JMLR, 2008.
[119] Jiayuan Mao, Chuang Gan, Pushmeet Kohli, Joshua B Tenenbaum, and Jiajun Wu. The neuro-
symbolic concept learner: Interpreting scenes, words, and sentences from natural supervision. In
ICLR, 2019.
124
[120] George A Miller. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39–
41, 1995.
[121] Ishan Misra, Abhinav Gupta, and Martial Hebert. From red wine to red tomato: Composition with
context. In CVPR, 2017.
[122] Margaret Mitchell, Xufeng Han, Jesse Dodge, Alyssa Mensch, Amit Goyal, Alex Berg, Kota Yam-
aguchi, Tamara Berg, Karl Stratos, and Hal Daum´ e III. Midge: Generating image descriptions from
computer vision detections. In EACL, 2012.
[123] Niluthpol Chowdhury Mithun, Juncheng Li, Florian Metze, and Amit K. Roy-Chowdhury. Learning
joint embedding with multimodal cues for cross-modal video-text retrieval. In ICMR, 2018.
[124] Mathew Monfort, Alex Andonian, Bolei Zhou, Kandan Ramakrishnan, Sarah Adel Bargal, Tom Yan,
Lisa Brown, Quanfu Fan, Dan Gutfreund, Carl V ondrick, et al. Moments in time dataset: one million
videos for event understanding. PAMI, 2019.
[125] Hyeonseob Nam, Jung-Woo Ha, and Jeonghee Kim. Dual attention networks for multimodal reason-
ing and matching. In CVPR, 2017.
[126] Cisco Visual Networking. Cisco global cloud index: Forecast and methodology, 2016–2021. White
paper. Cisco Public, San Jose, 2016.
[127] Zhenxing Niu, Mo Zhou, Le Wang, Xinbo Gao, and Gang Hua. Hierarchical multimodal lstm for
dense visual-semantic embedding. In ICCV, 2017.
[128] Pingbo Pan, Zhongwen Xu, Yi Yang, Fei Wu, and Yueting Zhuang. Hierarchical recurrent neural
encoder for video representation with application to captioning. In CVPR, 2016.
[129] Yingwei Pan, Tao Mei, Ting Yao, Houqiang Li, and Yong Rui. Jointly modeling embedding and
translation to bridge video and language. In CVPR, 2016.
[130] Cesc C Park and Gunhee Kim. Expressing an image stream with a sequence of natural sentences. In
Neurips, 2015.
[131] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural
networks. In ICML, 2013.
[132] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global vectors for word
representation. In EMNLP, 2014.
[133] Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. Film: Visual
reasoning with a general conditioning layer. In AAAI, 2018.
[134] Sandro Pezzelle and Raquel Fern´ andez. Is the red square big? malevic: Modeling adjectives leverag-
ing visual contexts. In EMNLP, 2019.
[135] Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, and
Svetlana Lazebnik. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-
to-sentence models. In ICCV, 2015.
[136] Linlu Qiu, Hexiang Hu, Bowen Zhang, Peter Shaw, and Fei Sha. Systematic generalization on gscan:
What is nearly solved and what is next? EMNLP, 2021.
125
[137] Zhaofan Qiu, Ting Yao, and Tao Mei. Deep quantization: Encoding convolutional activations with
deep generative model. In CVPR, 2017.
[138] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language
models are unsupervised multitask learners. Technical Report, 2019.
[139] Michaela Regneri, Marcus Rohrbach, Dominikus Wetzel, Stefan Thater, Bernt Schiele, and Manfred
Pinkal. Grounding action descriptions in videos. T-ACL, 1, 2013.
[140] Laura Ruis, Jacob Andreas, Marco Baroni, Diane Bouchacourt, and Brenden M Lake. A benchmark
for systematic generalization in grounded language understanding. In NeurIPS, 2020.
[141] Stuart Russell and Peter Norvig. Artificial intelligence: a modern approach. Pearson, 2002.
[142] Michael S Ryoo, AJ Piergiovanni, Anurag Arnab, Mostafa Dehghani, and Anelia Angelova. Token-
learner: What can 8 learned tokens do for images and videos? NeurIPS, 2021.
[143] Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face
recognition and clustering. In CVPR, 2015.
[144] Christian Schuldt, Ivan Laptev, and Barbara Caputo. Recognizing human actions: a local svm ap-
proach. In ICPR, 2004.
[145] Dian Shao, Yu Xiong, Yue Zhao, Qingqiu Huang, Yu Qiao, and Dahua Lin. Find and focus: Retrieve
and localize video events with natural language queries. In ECCV, 2018.
[146] Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual Captions: A cleaned,
hypernymed, image alt-text dataset for automatic image captioning. In ACL, 2018.
[147] Peter Shaw, Ming-Wei Chang, Panupong Pasupat, and Kristina Toutanova. Compositional gener-
alization and natural language variation: Can a semantic parsing approach handle both? In ACL,
2020.
[148] Haoyue Shi, Jiayuan Mao, Kevin Gimpel, and Karen Livescu. Visually grounded neural syntax
acquisition. ACL, 2019.
[149] Kurt Shuster, Samuel Humeau, Hexiang Hu, Antoine Bordes, and Jason Weston. Engaging image
captioning via personality. CVPR, 2019.
[150] Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition
in videos. In Neurips, 2014.
[151] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image
recognition. arXiv preprint arXiv:1409.1556, 2014.
[152] Richard Socher, Andrej Karpathy, Quoc V . Le, Christopher D. Manning, and Andrew Y . Ng.
Grounded compositional semantics for finding and describing images with sentences. T-ACL, 2:207–
218, 2014.
[153] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions
classes from videos in the wild. Arxiv, 2012.
[154] Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. VL-BERT: Pre-
training of generic visual-linguistic representations. ICLR, 2020.
126
[155] Alane Suhr, Mike Lewis, James Yeh, and Yoav Artzi. A corpus of natural language for visual reason-
ing. In ACL, 2017.
[156] Chen Sun, Austin Myers, Carl V ondrick, Kevin Murphy, and Cordelia Schmid. VideoBERT: A joint
model for video and language representation learning. In ICCV, 2019.
[157] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks.
In Neurips, 2014.
[158] Hao Tan and Mohit Bansal. LXMERT: Learning cross-modality encoder representations from trans-
formers. In EMNLP, 2019.
[159] Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural net-
works. In ICML, 2019.
[160] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herv´ e
J´ egou. Training data-efficient image transformers & distillation through attention. In ICML, 2021.
[161] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spa-
tiotemporal features with 3d convolutional networks. In ICCV, 2015.
[162] Yao-Hung Hubert Tsai, Liang-Kang Huang, and Ruslan Salakhutdinov. Learning robust visual-
semantic embeddings. In ICCV, 2017.
[163] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, unde-
finedukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017.
[164] Ivan Vendrov, Ryan Kiros, Sanja Fidler, and Raquel Urtasun. Order-embeddings of images and
language. ICLR, 2015.
[165] Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond Mooney, Trevor Darrell,
and Kate Saenko. Sequence to sequence-video to text. In ICCV, 2015.
[166] Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond Mooney, and Kate
Saenko. Translating videos to natural language using deep recurrent neural networks. NAACL-HLT,
2015.
[167] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image
caption generator. In CVPR, 2015.
[168] Jing Wang, Jianlong Fu, Jinhui Tang, Zechao Li, and Tao Mei. Show, reward and tell: Automatic
generation of narrative paragraph from photo stream by adversarial training. In AAAI, 2018.
[169] Limin Wang, Zhan Tong, Bin Ji, and Gangshan Wu. Tdn: Temporal difference networks for efficient
action recognition. In CVPR, 2021.
[170] Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool.
Temporal segment networks: Towards good practices for deep action recognition. In ECCV, 2016.
[171] Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool.
Temporal segment networks for action recognition in videos. Arxiv, 2017.
[172] Rui Wang, Dongdong Chen, Zuxuan Wu, Yinpeng Chen, Xiyang Dai, Mengchen Liu, Yu-Gang Jiang,
Luowei Zhou, and Lu Yuan. Bevt: Bert pretraining of video transformers. Arxiv, 2021.
127
[173] Xin Wang, Wenhu Chen, Yuan-Fang Wang, and William Yang Wang. No metrics are perfect: Adver-
sarial reward learning for visual storytelling. ACL, 2018.
[174] Yu-Siang Wang, Chenxi Liu, Xiaohui Zeng, and Alan Yuille. Scene graph parsing as dependency
parsing. NAACL, 2018.
[175] Chao-Yuan Wu, Manzil Zaheer, Hexiang Hu, R Manmatha, Alexander J Smola, and Philipp
Kr¨ ahenb¨ uhl. Compressed video action recognition. CVPR, 2018.
[176] Hao Wu, Jiayuan Mao, Yufeng Zhang, Yuning Jiang, Lei Li, Weiwei Sun, and Wei-Ying Ma. Unified
visual-semantic embeddings: Bridging vision and language with structured meaning representations.
In CVPR, 2019.
[177] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey,
Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation
system: Bridging the gap between human and machine translation. Arxiv, 2016.
[178] Saining Xie, Ross B. Girshick, Piotr Doll´ ar, Zhuowen Tu, and Kaiming He. Aggregated residual
transformations for deep neural networks. CVPR, 2017.
[179] Danfei Xu, Yuke Zhu, Christopher B Choy, and Li Fei-Fei. Scene graph generation by iterative
message passing. In CVPR, 2017.
[180] Huijuan Xu, Kun He, Bryan Plummer, Leonid Sigal, Stan Sclaroff, and Kate Saenko. Multilevel
language and vision integration for text-to-clip retrieval. In AAAI, 2019.
[181] Huijuan Xu, Kun He, Leonid Sigal, Stan Sclaroff, and Kate Saenko. Text-to-clip video retrieval with
early fusion and re-captioning. Arxiv, 2018.
[182] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich
Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual
attention. In ICML, 2015.
[183] Ran Xu, Caiming Xiong, Wei Chen, and Jason J. Corso. Jointly modeling deep video and composi-
tional text to bridge vision and language in a unified framework. In AAAI, 2015.
[184] Jianwei Yang, Jiasen Lu, Stefan Lee, Dhruv Batra, and Devi Parikh. Graph r-cnn for scene graph
generation. In ECCV, 2018.
[185] Kexin Yi, Jiajun Wu, Chuang Gan, Antonio Torralba, Pushmeet Kohli, and Josh Tenenbaum. Neural-
symbolic VQA: Disentangling reasoning from vision and language understanding. In NeurIPS, 2018.
[186] Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. From image descriptions to visual
denotations: New similarity metrics for semantic inference over event descriptions. T-ACL, 2:67–78,
2014.
[187] Haonan Yu, Jiang Wang, Zhiheng Huang, Yi Yang, and Wei Xu. Video paragraph captioning using
hierarchical recurrent neural networks. In CVPR, 2016.
[188] Licheng Yu, Mohit Bansal, and Tamara L Berg. Hierarchically-attentive rnn for album summarization
and storytelling. EMNLP, 2017.
[189] Yitian Yuan, Lin Ma, Jingwen Wang, Wei Liu, and Wenwu Zhu. Semantic conditioned dynamic
modulation for temporal sentence grounding in videos. In NeurIPS, 2019.
128
[190] Yitian Yuan, Tao Mei, and Wenwu Zhu. To find where you talk: Temporal sentence localization in
video with attention based location regression. In AAAI, 2019.
[191] Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi. From recognition to cognition: Visual
commonsense reasoning. In CVPR, 2019.
[192] Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. Neural motifs: Scene graph parsing
with global context. In CVPR, 2018.
[193] Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer. Scaling vision transformers.
Arxiv, 2021.
[194] Bowen Zhang, Hao Chen, Meng Wang, and Yuanjun Xiong. Online action detection in streaming
videos with time buffers. Arxiv, 2020.
[195] Bowen Zhang, Hexiang Hu, Vihan Jain, Eugene Ie, and Fei Sha. Learning to represent image and
text with denotation graph. In EMNLP, 2020.
[196] Bowen Zhang, Hexiang Hu, Joonseok Lee, Ming Zhao, Sheide Chammas, Vihan Jain, Eugene Ie, and
Fei Sha. A hierarchical multi-modal encoder for moment localization in video corpus. Arxiv, 2020.
[197] Bowen Zhang, Hexiang Hu, Linlu Qiu, Peter Shaw, and Fei Sha. Visually grounded concept compo-
sition. Findings of EMNLP, 2021.
[198] Bowen Zhang, Hexiang Hu, and Fei Sha. Cross-modal and hierarchical modeling of video and text.
In ECCV, 2018.
[199] Bowen Zhang, Hexiang Hu, and Fei Sha. Visual storytelling via predicting anchor word embeddings
in the stories. ICCV Workshop, 2019.
[200] Bowen Zhang, Limin Wang, Zhe Wang, Yu Qiao, and Hanli Wang. Real-time action recognition with
enhanced motion vector cnns. In CVPR, 2016.
[201] Bowen Zhang, Jiahui Yu, Christopher Fifty, Wei Han, Andrew M Dai, Ruoming Pang, and Fei Sha.
Co-training transformer with videos and images improves action recognition. Arxiv, 2021.
[202] Ji Zhang, Kevin J Shih, Ahmed Elgammal, Andrew Tao, and Bryan Catanzaro. Graphical contrastive
losses for scene graph parsing. In CVPR, 2019.
[203] Ke Zhang, Wei-Lun Chao, Fei Sha, and Kristen Grauman. Video summarization with long short-term
memory. In ECCV, 2016.
[204] Yanyi Zhang, Xinyu Li, Chunhui Liu, Bing Shuai, Yi Zhu, Biagio Brattoli, Hao Chen, Ivan Marsic,
and Joseph Tighe. Vidtr: Video transformer without convolutions. In CVPR, 2021.
[205] Yue Zhao, Yuanjun Xiong, Limin Wang, Zhirong Wu, Xiaoou Tang, and Dahua Lin. Temporal action
detection with structured segment networks. ICCV, 2017.
[206] Yuke Zhu, Oliver Groth, Michael Bernstein, and Li Fei-Fei. Visual7w: Grounded question answering
in images. In CVPR, 2016.
[207] Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. ICLR, 2017.
129
Appendices
A Appendix for Cross-Modal and Hierarchical Video Modeling via Text
A.1 Implementation Details
A.1.1 Video and Text Features
C3D Features. Similar to [91], we follow the standard ActivityNet setting and use the C3D [161] features
from [63] for retrieval and dense captioning [91]. In all our experiments under this setting, we extract frame-
wise video feature using C3D model pre-trained on Sports-1M dataset, with the temporal stride of 16. PCA
dimensionality reduction is then conducted to reduce features dimension to 500.
TSN-Inception V3 Features. To leverage the state-of-the-art of current video modeling, we extract more
recent deep features for retrieval on ActivityNet [91] and DiDeMo [8], using the Inception V3 model pre-
trained on Kinetics [80] dataset (provided by [170]). Follow their settings, we resize video frames to the
resolution of299× 299. We then fed video frames into the deep Inception V3 model to extract the output
activations from penultimate layer. Unlike [170], we do not perform any test-time data augmentations (e.g.
multiple crops, color jitter, etc.). Note that no fine-tuning are performed on either ActivityNet or DiDeMo.
Word Features. In the retrieval related experiments, we always use GloVE features [132] for the initial-
ization of the word embedding and fine-tune. Specifically, we use the GloVE vectors pre-trained on 840B
common web-crawled data, with its dimensionality equals to 300.
130
A.1.2 Training Details
When the learning of hierarchical embedding is applicable, we feed the entire video/paragraph in its frame-
wise/word-wise representations through the low-level encoder, and then input the subsequent low-level em-
bedding to the high-level encoder as its initial hidden state. In all our experiments, we use GRU [32] with
its hidden dimension to be 1,024 as our sequence encoder and decoder. To obtain the embedding for a
sequence, we take the channel-wise max over all output vectors of the GRU as it empirically outperforms
other strategies such as [165].
During training, we use Adam [84] optimizer with initial learning rate as 0.001, and decay it by 10 for
every 10 epochs during the training. We use Xavier initialization [55] for each affine layer in our model with
zero mean and variance of 0.01. We set all margin in the loss function to 0.2. Each loss is normalized by
its batch size. On both ActivityNet and DiDeMo dataset, we train our embedding models for 15 epochs and
collect the final results.
A.2 Additional Experiments
A.2.1 Ablation Study with Different Learning Objectives
We report ablation studies of different losses on ActivityNet video and paragraph retrieval task in Table A.1.
We use the Inception-V3 features and follow the same setting for training HSE. Each time we remove one
loss and report the performance. Note that the reconstruction loss and low-match loss are the most useful.
Table A.1: Ablation study on the learning objectives.
Paragraph ⇒ Video Video ⇒ Paragraph
Method R@1 R@5 R@1 R@5
HSE w/o high-cluster 44.6 76.4 44.2 76.1
HSE w/o low-match 40.9 73.6 39.8 73.6
HSE w/o low-cluster 44.6 76.6 43.9 76.4
HSE w/o reconstruction 43.9 75.8 43.3 75.3
HSE w all losses 44.4 76.7 44.2 76.7
A.2.2 Low-level Loss is Beneficial
As mentioned in the main text (see Table 1 and Table 2 in the main text), learning with low-level objectives
is beneficial for our full model. To better understands this, we also plot the recall (in %) with regard to the
rank of the video/paragraph to a query as supportive evidence. The results are shown in Fig. A.1.
131
0 5 10 15
Rank
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Recall
Paragraph Video
FSE
HSE
=0
w/o L
low
HSE
=0
0 5 10 15
Rank
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Recall
Video Paragraph
FSE
HSE
=0
w/o L
low
HSE
=0
0 5 10 15
Rank
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Recall
Paragraph Video
FSE
HSE w/o L
low
HSE
0 5 10 15
Rank
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Recall
Video Paragraph
FSE
HSE w/o L
low
HSE
(a) HSE[τ =0] (b) HSE
Figure A.1: Recall vs Rank curves of Video to Paragraph and Paragraph to Video retrieval of both HSE[τ =0]
and HSE. All results are collected from models based on InceptionV3 feature on ActivityNet validation set
1.
A.2.3 Ablation Study on Reconstruction Balance Term
Here we study the influence of loss balance term, by experimenting multiple choices of τ under a controlled
environment. We choose to study this on the validation set 2 (val2) of ActivityNet with Inception V3 visual
feature as input. Detailed results are shown in Table A.2. We summarized that retrieval performance, R@1
and R@5, approach to its peak whenτ =0.0005. Therefore, as stated in the main text, we setτ to be 0.0005
in all our experiments.
Table A.2: Ablation study ofτ on ActivityNet (val2).
Paragraph ⇒ Video Video ⇒ Paragraph
R@1 R@5 R@50 MR R@1 R@5 R@50 MR
Inception-V3 pre-trained on Kinetics [171]
HSE[τ =0.05] 25.0 54.9 92.6 5.0 25.1 55.4 92.4 4.0
HSE[τ =0.005] 32.4 62.2 93.8 3.0 32.1 63.0 93.7 3.0
HSE[τ =0.0005] 33.2 62.9 93.6 3.0 32.6 62.8 93.5 3.0
HSE[τ =0.00005] 33.2 62.9 93.8 3.0 32.2 62.5 93.6 3.0
HSE[τ =0] 32.2 61.5 93.6 3.0 31.5 62.0 93.3 3.0
132
A.2.4 Performance on ActivityNet Validation Set 2
As mentioned in the main paper, we reported the val2 performance of FSE, HSE[τ =0], and HSE in Table A.3.
Again, the results verified our papers’ claim as we show that HSE consistently improve performance than
FSE and HSE[τ =0]. It shows the importance of hierarchical modeling and feature reconstruction.
Table A.3: Performance of video and paragraph retrieval on ActivityNet (val2). Standard deviation from 3
random seeded experiments are also reported.
Paragraph ⇒ Video Video ⇒ Paragraph
R@1 R@5 R@50 MR R@1 R@5 R@50 MR
C3D Feature with Dimensionality Reduction [161]
FSE 11.5± 0.2 31.0± 0.4 75.9± 0.2 14.0 11.0± 0.5 30.6± 0.3 75.5± 0.4 14.0
HSE[τ =0] 23.3± 0.5 48.2± 0.2 84.5± 0.4 6.0 23.0± 0.3 47.9± 0.2 84.6± 0.2 6.0
HSE[τ =0.0005] 23.9± 0.3 49.4± 0.3 85.3± 0.2 6.0 23.4± 0.5 49.4± 0.4 85.5± 0.3 6.0
Inception-V3 pre-trained on Kinetics [171]
FSE 16.0± 0.2 41.8± 0.4 88.0± 0.5 8.0 15.1± 0.7 41.0± 0.4 87.7± 0.5 8.0
HSE[τ =0] 32.3± 0.2 62.2± 0.7 93.5± 0.1 3.0 32.0± 1.0 61.9± 0.2 93.3± 0.1 3.0
HSE[τ =0.0005] 32.9± 0.4 62.7± 0.2 93.9± 0.4 3.0 32.6± 0.1 63.0± 0.2 93.7± 0.2 3.0
B Appendix for Hierarchical Video Modeling for Moment Localization
B.1 Additional Implementation Details
B.1.1 Visual Feature Representation
We use three different kinds of visual features throughout our experiments, i.e., ResNet-152 [62], I3D [21],
and their combination. On the ActivityNet Captions dataset, we report model performance with ResNet-
152 for fair comparison with prior methods, and also report results using the widely used I3D features for
comparison. On the TVR dataset, we follow the setting in [98] and report results using the concatenation of
ResNet-152 and I3D features. The details about how to extract these features are specified below.
ResNet-152 Feature For all of the frames in a given video, we extract features (2048 dimensional) from
the penultimate layer of a ResNet-152 [62] model pre-trained on the ImageNet dataset. For ActivityNet
Captions dataset, the frames are extracted at a rate of6 FPS.
I3D Feature We use an I3D model [21] to extract the spatio-temporal visual features (with a dimension of
1024). The I3D model used for feature extraction is pre-trained on the Kinetics-400 [80] dataset. Similar to
the setting of ResNet-152 features, we take the features from the penultimate layer of the I3D model. For
ActivityNet Captions dataset, the I3D features are extracted at a frame rate of1 FPS.
133
37.4 secs
70.5 secs
0 secs
0 secs 39.0 secs
HAMMER Clip Encoder Frame Encoder Ground Truth
Query: Castle and Beckett sit talking to Stryker about alien existence.
Query: Robert was having a conversation with the patient lying on the hospital bed.
35.6 secs
0 secs
0 secs
Legend
41.9 secs
88.5 secs
0 secs
0 secs
40.5 secs
44.6 secs
0 secs
0 secs
Query: Izzie sobs as Alex sits next to her.
46.5 secs
87.0 secs
85.5 secs
0 secs
46.5 secs 52.5 secs
43.5 secs
50.5 secs
Figure B.1: Illustration of temporal localization results using HAMMER and its individual hierarchies, the
frame and clip encoder. The top two are successful examples and the bottom one is a failed example.
I3D+ResNet-152 Feature For the TVR dataset, we use the I3D+ResNet-152 features provided by Lei et
al. [98] to represent the visual information in the videos. The I3D and ResNet-152 models are pre-trained on
Kinetics-600 [19] and ImageNet datasets, respectively. For both models, the features from the penultimate
layers are used. The ResNet-152 features are extracted at a rate of 3 FPS and max-pooled over each 1.5-
seconds clip. The I3D features are extracted for every1.5 seconds as well. The two sets of features are then
concatenated to form the combined3072-dimensional feature.
B.1.2 Subtitle (ASR) Feature Representation
Previous work [98] have demonstrated that subtitles (e.g., extracted from ASR) can complement the visual
information in video and language tasks. For the TVR dataset, we follow the standard setting and use the
pre-extracted ASR embeddings provided by Lei et al. [98] as an additional input to our models. Contextual-
ized token-level subtitle embeddings are first generated using a 12-layer RoBERTa [110] model fine-tuned
on the TVR train split. The token embeddings are then max-pooled every 1.5 seconds to get an aggre-
gated768-dimensional feature vector. A zero vector of the same dimensionality is used for frames without
corresponding subtitles. The resulting subtitle embeddings are temporally aligned to the visual features
134
(I3D+ResNet-152), allowing us to combine the two modalities later in the cross-modal encoders. We refer
the reader to Lei et al. [98] for more details on the feature extraction process.
B.1.3 Model Architecture with ASR input
The general architecture of the HAMMER model is illustrated in Figure 1 of the main text. It consists of 2
hierarchical encoders (i.e., the frame and clip encoders) that have the same structure, and two input streams,
query and video. When only the visual features are present (e.g., ActivityNet Captions), the video encoder
contains only the visual encoder. Each hierarchical encoder contains 5 standard Transformer layers for the
query input and 1 Transformer layer for the visual input. There is an additional cross-modal Transformer
layer between the query and visual representations.
When ASR is provided as another input stream (e.g., in TVR), we add another branch to the video en-
coder in each of the hierarchical encoders to process the ASR input. The ASR and visual branches have sim-
ilar structure as both have 1 Transformer layer. The pre-extracted ASR embeddings and the visual features
both attend to the query representation to form their contextualized representations. The query embeddings
in turn attend to the ASR and visual modalities separately. The resulting ASR- and visual-grounded query
representations are then added together in the feature dimension, followed by a normalization and a dropout
layer. Finally, the query-grounded ASR and visual representations are concatenated to form the frame-level
and clip-level representations for the two hierarchical encoders, respectively.
B.1.4 Model Optimization
For ActivityNet Captions, we train the models with a mini-batch size of 64 and optimize them using
Adam [84] with a maximum learning rate of 4e− 5. The learning rate increases linearly from 0 to the
max rate in the first 10% training epochs and then drops to 0.1× and0.01× of the max rate at 50% and 75%
of the training epochs, respectively. We set the maximum video sequence length to be 128, and experiment
with clip lengths varying from 16 to 64 (refer to Table 3.7 in the main text).
For TVR data, we train the model with a batch size of 128. We use the same learning rate schedule as
mentioned above but with a maximum learning rate of2e− 4. We set the maximum video sequence length
to 96, and experiment with clip lengths varying from 8 to 48.
135
B.1.5 Model Initialization
We do not pre-train HAMMER on any dataset. We randomly initialize the visual and ASR branches and
the cross-attention Transformer layers. For the text query branch, following prior work [116], we initialize
from the first 5 layers of a pre-trained BERT model [38], and use the uncased WordPiece tokenizer [177] to
tokenize the text input. The vocabulary size of the tokenizer is 30,522.
B.2 Illustration on the TVR Dataset
Figure B.1 illustrates the temporal localization results on TVR dataset using predictions from HAMMER and
its frame and clip encoders. In the top 2 examples, HAMMER successfully localizes the video segments
described by the respective queries with the help of the clip encoder even though the frame encoder makes
erroneous predictions. In the bottom-most example, the clip encoder picks the incorrect video clip, causing
HAMMER to only partially capture the video segment described by the query. These examples show the
important role played by the two hierarchical encoders—while the clip encoder is responsible for choosing
the video clips that best describe the query, the frame encoder fine-tunes the predictions within the chosen
clips.
C Appendix for Modeling Image Hierarchically With Denotation Graph
C.1 Implementation Details
C.1.1 Constructing Denotation Graphs
We summarize the procedures used to extract DG from vision+ language datasets. For details, please refer
to [186]. We used the publicly available tool
1
. The analysis consists of several steps: (1) spell-checking;
(2) tokenize the sentences into words; (3) tag the words with Part-of-Speech labels and chunk works into
phrases; (4) abstract semantics by using the WordNet [120] to construct a hypernym lexicon table to replace
the nouns with more generic terms; (5) apply 6 types of templated rules to create fine-to-coarse ( i.e., specific
to generic) semantic concepts and connect the concepts with edges.
We set 3 as the maximum levels (counting from the sentence level) to extract abstract semantic concepts.
This is due to the computation budget we can afford, as the final graphs can be huge in both the number of
1
https://github.com/aylai/DenotationGraph
136
Table C.1: Text-based Image Retrieval Performance of ViLBERT trained with different number of DG levels
# of DG levels R@1 R@5 R@10 RSUM
3 levels 65.9 91.4 95.5 252.7
5 levels 62.5 86.4 92.3 241.2
7 levels 62.8 86.3 91.6 240.7
nodes and the edges. Specifically, without the maximum level constraint, we have 2.83 M concept nodes in
total for Flickr dataset. If the training is run on all these nodes, we will consume 19 times more iterations
than training on the original dataset, which has 145K sentences [186]. As a result, much more time would
be required for every experiment. With the 3 layers of DG from the leaf concepts, we have in 597K nodes.
In this case, the training time would be cut down to 4.1 times of the original dataset.
Nonetheless, we experimented with more than 3 levels to train ViLBERT + DG-FLICKR30K with 5
and 7 maximum levels, respectively. The training hyper-parameters remain the same as ViLBERT + DG-
FLICKR30K with 3 maximum layers. The aim is to check how much gain we could get from the additional
annotations. We report the results in Table C.1. It shows that actually, the model trained with 3 levels of
DG achieves the best performance. This might be because those high-level layers of DG (counting from
the sentences) contain very abstract text concepts, such as “entity” and “physical object”, which is non-
informative in learning the visual grounding.
Once the graph is constructed, we attach the images to the proper nodes by set-union images of each
node’s children, starting from the sentence-level node.
C.1.2 Model architectures of ViLBERT and UNITER
A comparison of these models is schematically illustrate in Fig. C.1.
• ViLBERT. It has 6 basic Transformer layers for text and 8 layers for image. For all the Transformer
layers on the text side, we use 12 attention heads and 256 feature dimensions, then linearly project
down to 1024 feature dimensions. For all the Transformers on the image side, we use 8 attention
heads and 128 feature dimensions, then combine into 1024 feature dimensions too.
• UNITER. All the Transformer layers have 12 heads and 256 feature dimensions.
The major difference between UNITER and ViLBERT is how attentions are used. In ViLBERT, one
modality is used as a query, and the other is used as value and key. In UNITER, however, both are used as
137
(a) ViLBERT
()
Image Encoder
()
Text Encoder
=( , )
(,)=
!
-
…
Text Co-attention
Transformer
Image Co-attention
Transformer …
…
Image Transformer
MLP & Pooling
Text Transformer
MLP & Pooling
(b) UNITER
()
Image Encoder
()
Text Encoder
Cross-Modal Transformer
…
[CLS]
=( , )
(,)=
!
-
Figure C.1: Architecture of (a) ViLBERT, (b) UNITER. The
N
means element-wise product. The [CLS]
represents the embedding of [CLS] token in the last UNITER layer.
query, key, and value. Additionally, UNITER is similar to another model Unicoder-VL [99]. However, the
latter has not provided publicly available code for experimenting.
For ViLBERT model, each text and image co-attention Transformer layer contains 8 attention heads
with 1024 dimensions in total. The text Transformer layer contains 12 attention heads with 3072 hidden
dimensions in total. In contrast, the image Transformer layer has 8 attention heads with 1024 hidden di-
mensions in total. For UNITER model, each cross-attention Transformer layer contains 12 heads with 3072
hidden dimensions in total.
ViLBERT model contains 121 million parameters, while UNITER contains 111 million parameters.
C.1.3 Training Details
All models are optimized with the Adam optimizer [84]. The learning rate is initialized as4e
− 5
. Following
ViLBERT [116], a warm-up training session is employed, during which we linearly increase the learning
rate from 0 to 4e
− 5
in the first 1.5% part of the training epochs. The learning rate is dropped to 4e
− 6
and4e
− 7
at the 10th and the15th epochs, respectively. For ViLBERT (Reduced), we randomly initialized
the model parameters in the image stream. The text stream is initialized from the first 3 layers of the pre-
trained BERT model, and its co-attention Transformer layers are randomly initialized. For ViLBERT (Full)
and UNITER [31], we load the model’s weights pre-trained on the Conceptual Caption dataset to initialize
them.
138
Training ViLBERT (Full) + DG with a minibatch size of 64 takes 2 to 3 days on an 8 TitanXp GPU
server, or 1 day on TPU v2 cloud. The GPU server is equipped with Intel Xeon Gold 6154 CPU and 256G
RAM.
C.1.4 Text Pre-processing
We follow BERT [38] that uses WordPiece [177] tokenizer to tokenize the texts. For ViLBERT (Reduced)
and ViLBERT (Full), we use the uncased tokenizer with a vocabulary size of 30,522. For UNITER, we use
the cased tokenizer with a vocabulary size of 28,996. After tokenization, the tokens are transformed to 768
dimension features by a word embedding initialized from BERT pre-trained model. The 768-dimensional
position features are included in the input to represent the position of each token.
C.1.5 Visual Pre-processing
For both ViLBERT and UNITER, we use the image patch features generated by the bottom-up attention
features, as suggested by the original papers [5]. The image patch features contain up to100 image patches
with their dimensions to be 2048. Besides this, a positional feature is used to represent the spatial location
of bounding boxes for both ViLBERT and UNITER. Specifically, ViLBERT uses 5-dimensional position
feature that encodes the normalized coordinates of the upper-left and lower-right corner for the bounding
boxes, as well as one additional dimension encoding the normalized patch size. UNITER uses two additional
spatial features that encode the normalized width and height of the object bounding box.
C.2 Full Experimental Results
In this section, we include additional experimental results referred to by the main text. Specifically, we
include results from a variety of models (e.g., ViLBERT, ViLBERT + DG, UNITER, and UNITER + DG)
on COCO dataset 5K test split [78] in § C.2.1. Then we provide a comprehensive ablation study on the
impact ofλ 1
andλ 2
of Eq. 7 in the main text in § C.2.3.
C.2.1 Complete Results on COCO Dataset
We report the full results on COCO dataset (1K test split and 5K test split) in Table C.2 and Table C.3.
Additionally, we contrast to other existing approaches on these tasks. It could be seen that ViLBERT + DG
139
Table C.2: Results on Cross-Modal Retrieval on COCO dataset 1K test split (Higher is better)
Text-based Image Retrieval Image-based Text Retrieval
Method R@1 R@5 R@10 RSUM
Models ran or implemented by us
ViLBERT 62.3 89.5 95.0 246.8
ViLBERT + DG 65.9 91.4 95.5 252.7
UNITER 60.7 88.0 93.8 242.5
UNITER + DG 62.7 88.8 94.4 245.9
Known results from literature
VSE++[45] 52.0 84.3 92.0 228.3
SCO[73] 56.7 87.5 94.8 239.0
SCAN[97] 58.8 88.4 94.8 242.0
VSRN[101] 62.8 89.7 95.1 247.6
Method R@1 R@5 R@10 RSUM
Models ran or implemented by us
ViLBERT 77.0 94.1 97.2 268.3
ViLBERT + DG 79.0 96.2 98.6 273.8
UNITER 74.4 93.9 97.1 265.4
UNITER + DG 77.7 95.0 97.5 270.2
Known results from literature
VSE++[45] 64.6 90.0 95.7 250.3
SCO[73] 69.9 92.9 97.5 260.3
SCAN[97] 72.7 94.8 98.4 265.9
VSRN[101] 76.2 94.8 98.2 269.2
and UNITER + DG improves the performance over the counterparts without DG by a significant margin
on both COCO 1K and 5K test split – the only exception is that on the task of image-based text retrieval,
UNITER performs better than UNITER+DG.
These results support our claim that training with DG helps the model to learn better visual and linguistic
features. Although ViLBERT and UNITER have different architectures, training with DG could improve
the performance consistently.
C.2.2 Complete Results on FLICKR30K Dataset
We contrast to other existing approaches in Table C.4 on the task of text-based image retrieval on the
FLICKR30K dataset.
C.2.3 Ablation Study on Hyper-parameters
We conduct an ablation study on the impact of the two hyper-parameters λ 1
and λ 2
in Eq. 7 of the main
text. We conduct the study with two ViLBERT variants: ViLBERT Reduced and ViLBERT. The results
are reported in Table C.5 and Table C.6. As we have two hyper-parameters λ 1
and λ 2
, we analyze their
impacts on the final results by fixing one λ to be 1. Fixing the λ 2
= 1 and changing λ 1
, we observe
that ViLBERT prefers largerλ 1
, while ViLBERT Reduced achieves slightly worse performance whenλ 1
is
140
Table C.3: Results on Cross-Modal Retrieval on COCO dataset 5K test split (Higher is better)
Text-based Image Retrieval Image-based Text Retrieval
Method R@1 R@5 R@10 RSUM
Models ran or implemented by us
ViLBERT 38.6 68.2 79.0 185.7
ViLBERT + DG 41.8 71.5 81.5 194.8
UNITER 37.8 67.3 78.0 183.1
UNITER + DG 39.1 68.0 78.3 185.4
Known results from literature
VSE++[45] 30.3 59.4 72.4 162.1
SCO[73] 33.1 62.9 75.5 171.5
SCAN[97] 38.6 69.3 80.4 188.3
VSRN[101] 40.5 70.6 81.1 192.2
UNITER[31]
†
48.4 76.7 85.9 211.0
Method R@1 R@5 R@10 RSUM
Models ran or implemented by us
ViLBERT 53.5 79.7 87.9 221.1
ViLBERT + DG 57.5 84.0 90.1 232.2
UNITER 52.8 79.7 87.8 220.3
UNITER + DG 51.4 78.7 87.0 217.1
Known results from literature
VSE++[45] 41.3 71.1 81.2 193.6
SCO[73] 42.8 72.3 83.0 198.1
SCAN[97] 50.4 82.2 90.0 222.6
VSRN[101] 53.0 81.1 89.4 223.5
UNITER [31]
†
63.3 87.0 93.1 243.4
†
: The UNITER[31] model performs an additional online hard-negative mining (which we did not) during the
training of image-text matching to improve their results, which is computationally very costly.
smaller or larger. Fixing the λ 1
= 1 and changing λ 2
, we observe that performance of both architectures
slightly reduced whenλ 2
=0.5 andλ 2
=2.
C.2.4 Full Results on Zero/Few-Shot and Transfer Learning
Implementation Details for Zero-shot Referring Expression Specifically, the learned ViLBERT and ViL-
BERT w/DG models are used first to produce a base matching score s
BASE
between the expression to be
referred and the whole image. We then compute the matching scores
MASKED
between the expression and
the image with each region feature being replaced by a random feature in turn. The masked region which
causes the largest drop (s
BASE
− s
MASKED
) is the model’s prediction of which region the expression refers to.
Table 4.5 shows that ViLBERT + DG-COCO outperforms ViLBERT on this task.
Transfer Learning Results Table C.7 reports the full set of evaluation metrics on transferring across
datasets. Training with DG improves training without DG noticeably.
141
Table C.4: Results on Text-based Image Retrieval on FLICKR30K test split (Higher is better)
Method R@1 R@5 R@10 RSUM
Models ran or implemented by us
ViLBERT 59.1 85.7 92.0 236.7
ViLBERT + DG 63.8 87.3 92.2 243.3
UNITER 62.9 87.2 92.7 242.8
UNITER + DG 66.4 88.2 92.2 246.8
Known results from literature
VSE++[45] 39.6 70.1 79.5 189.2
SCO[73] 41.1 70.5 80.1 191.7
SCAN[97] 48.6 77.7 85.2 211.5
VSRN[101] 54.7 81.8 88.2 224.7
ViLBERT[116] 58.2 84.9 91.5 234.6
UNITER[31] 71.5 91.2 95.2 257.9
Table C.5: Ablation studies on the impact ofλ 1
andλ 2
of ViLBERT Reduced on Text-based Image Retrieval
on FLICKR30K dataset (Higher is better)
(a) Ablatingλ 1
(b) Ablatingλ 2
λ 1
λ 2
R@1 R@5 R@10 RSUM
0.5 1.0 57.7 83.1 88.5 229.2
1.0 1.0 58.7 83.3 89.3 231.2
2 1.0 56.5 82.6 88.6 227.7
λ 1
λ 2
R@1 R@5 R@10 RSUM
1.0 0.5 56.3 81.7 87.2 225.2
1.0 1.0 58.7 83.3 89.3 231.2
1.0 2 58.5 82.3 88.0 228.9
C.3 Visualization of Model’s Predictions on Denotation Graphs
We show several qualitative examples of both success and failure cases of ViLBERT + DG, when retrieving
the text matched images, in Fig. C.2 and Fig. C.3. The image and text correspondence is generated by
the Denotation Graph, which are derived from the caption and image alignment. We observe that in the
Fig.C.2, the ViLBERT + DG successfully recognizes the images that are aligned with the text: “man wear
reflective vest”, while the ViLBERT fails to retrieve the matched image. In the failure case in Fig. C.3,
although ViLBERT + DG fails to retrieve the images that are exactly matched to the text, it still retrieves
very relevant images given the query.
142
Table C.6: Ablation studies on the impact of λ 1
and λ 2
of ViLBERT on Text-based Image Retrieval on
FLICKR30K dataset (Higher is better)
(a) Ablatingλ 1
(b) Ablatingλ 2
λ 1
λ 2
R@1 R@5 R@10 RSUM
0.5 1.0 63.1 86.7 91.7 241.4
1.0 1.0 63.8 87.3 92.2 243.3
2 1.0 64.1 87.6 92.5 244.2
λ 1
λ 2
R@1 R@5 R@10 RSUM
1.0 0.5 63.7 87.0 92.4 243.2
1.0 1.0 63.8 87.3 92.2 243.3
1.0 2 63.1 86.6 91.9 241.6
Table C.7: Transferrability of the learned representations
SOURCE→TARGET FLICKR30K→COCO COCO→FLICKR30K
Model R@1 R@5 R@10 RSUM R@1 R@5 R@10 RSUM
ViLBERT 43.5 72.5 83.4 199.4 49.0 76.0 83.9 209.0
ViLBERT + SOURCE DG 44.9 72.7 83.0 200.5 52.8 79.2 86.2 218.2
D Appendix for Hierarchical Visual grounded Concept Composition
D.1 Extracting Primitives & Predicates from the Constituency Tree
As mentioned in the main paper, we parse the sentence and convert it into a tree of concepts and primitives.
Particularly, we first perform constituency parsing using the self-attention parser [87]. Table D.1 provides
the visualization for two examples of the syntax sub-trees. Next, we perform a tree search (i.e., breadth-
first search) on the constituency tree of the current input concept to extract the sub-concepts and predicate
functions. Note that this step is applied recursively until we can no longer decompose a concept into any sub-
concepts. On a single step of the extraction, we enumerate each node in the constituency tree of current input
text expression and examine whether a constituent satisfies the criterion that defines the visually grounded
concept.
The concept criterion defined for the Flickr30K and COCO dataset contains several principles: (1) If
the constituent is a word, it is a primitive concept if its Part-of-Speech (POS) tag is one of the following:
{[NN],[NNS],[NNP],[NNPS]}; (2) If the constituent is a phrase (with two words or more), it is a concept
when this constituent contains a primitive word (i.e., satisfying condition (1)) and its constituency tag is one
of the following: {[S], [SBAR], [SBARQ], [SQ], [SINV], [NP], [NX]}. After all the concepts are
extracted, we take the remaining words in the current input text expression as the predicate that combines
those concepts and use the tag to represent syntactic blank. Concrete examples can be found in the Table D.1.
143
a man wearing a reflective vest sits on the sidewalk and holds up
pamphlets with bicycles on the cover
Query Text
ViLBERT + DG
ViLBERT
Query Text
Generated by DG
man wear reflective vest
ViLBERT + DG
ViLBERT
Figure C.2: FLICKR30K Denotation Graph: Given Text and Retrieve Image. Qualitative example of ViLBERT
+ DG successfully retrieves the text matched images. We mark the correct sample in green and incorrect one in red.
For instance, in the first example, we search the text “two dogs are running on the grass” and extract two
noun constituents, “two dogs” and “the grass” as the concepts out. We use the remaining text ”[NP] is
running on [NP]” as the predicate that indicates the semantic meaning of how these two sub-concepts
composes into the original sentence.
D.2 Details on Generation of Compositional Evaluation Splits
As mentioned in the main text, we generate compositional generalization (CG) splits with 1,000 images and
5,000 text queries, maximizing the Compound Divergence (MCD) as [147]
2
, to assess models’ capability
in generalizing to the data with different predicate distribution. Concretely, we select Flickr30K training
data to generate the F30K-MCD split. First, we remove all F30K test data that has unseen primitive con-
cepts to the COCO training data. Next, we collect and count the predicates for each image among all the
2
We adopt the released code here for the computing compound divergence: https://github.com/google-
research/language/tree/master/language/nqg/tasks
144
a black and white dog is running through the grass Query Text
ViLBERT + DG
ViLBERT
Query Text
Generated by DG
black and white dog run
ViLBERT + DG
ViLBERT
Figure C.3: FLICKR30K Denotation Graph: Given Text and Retrieve Image. Qualitative example of ViLBERT
+ DG fails to retrieve the text matched images. We mark the correct sample in green and incorrect one in red.
remaining data over the five associated captions. These predicates correspond to the “compounds” defined
in [82, 147], and the objective is to maximize the divergence between compound distribution of the eval-
uation data to the training data. As a result of this step, we end up with a data set formed with pairs of
(image, predicates counts), which are then used for computing the overall compound divergence (CD
ALL
) to
the training dataset. Afterwards, we enumerate over each pair of data, and again compute the compound
divergence to the training dataset but with this specific data is removed. We denote the change of compound
divergence as∆ i
= CD
i
− CD
ALL
, and use it as an additional score to associate every data. Finally, we sort
all the data with regard to the difference of compound divergence∆ i
, and use the top ranking one thousand
examples as the maximum compound divergence (MCD) split. The process for generating the COCO-MCD
split is symmetrical to the above process, except the data is collected from COCO val+test splits (as it is
sufficiently large). Similarly, to generate different CDs for making Figure 4 of the main text, we can also
make use of the above data sorted by ∆ i
. Concretely, we put a sliding window with 1,000 examples and
145
Syntax Tree S
NP
two dogs
VP
VBP
are
VBG
running
PP
IN
on
NP
the grass
S
NP
a small pizza
VP
VBN
cut
PP
IN
in
NN
half
IN
on
NP
a white plate
Concept
two dogs are running on
the grass
a small pizza cut in half on
a white plate
Predicate
[NP] are running on
[NP]
[NP] cut in half on [NP]
Sub-Concepts
NP1=‘‘two dogs’’
NP2=‘‘the grass’’
NP1=‘‘a small pizza’’
NP2=‘‘a white plate’’
Table D.1: Explanatory example of extracting predicates and sub-concepts from a concept
enumerate over the sorted data to obtain a massive combination of data (we can take a stride to make this
computation sparser).For each window of data, we measure the compound divergence and only take the
windows that are at the satisfaction to our criteria. In Figure 4, we keep the windows that has the closest CD
values to desired X-axis values for plotting. We will release the code and data to support the reproducibility
of our research.
D.3 Implementation Details of COMPOSER and Baselines
Visual feature pre-processing We follow ViLBERT [116] that extracts the patch-based ResNet feature
using the Bottom-Up Attention model. The image patch feature has a dimension of 2048. A 5-dimension
position feature that describes the normalized up-top and bottom-down position is extracted alongside the
image patch feature. Therefore, each image region is described by both the image patch feature and the
position feature. We extracted features from up to 100 patches in one image.
Text pre-processiong Following BERT [38], we tokenize the text using the uncased WordPiece tokenizer.
Specifically, we first lowercase the text and use the uncased tokenizer to extract tokens. The tokenizer has
a vocabulary size of 30,522. The tokens are then transformed into word embeddings with 768 dimensions.
Besides the word embedding, a 768-dimension position embedding is extracted. Both position embedding
and word embedding are added together to represent the embedding of tokens.
146
Training details We use Adam optimizer [84] to optimize the parameter of our model. All the models
are trained with a mini-batch size of 64. We employ a warm-up training strategy as suggested by ViL-
BERT [116]. Specifically, the learning rate is linearly increasing from 0 to 4e− 5 in the first 2 epochs. Then
the learning rate decays to 4e− 6 and 4e− 7 after 10 epochs and 15 epochs, respectively. The training
stopped at 20 epochs.
Details of baseline approaches. The text encoder for both models contains 12 layers of transformers and
is initialized from BERT pretrained model. For ViLBERT, we use the[CLS] embedding from the last layer
as text representationy. We use the average of contextualized text embedding from the last layer asy in the
VSE model. The visual encoder of VSE contains an MLP model with the residual connection. It transforms
the image patch feature into a joint image-text space. The output of the visual encoder is the mean of the
transformed image patch features. Unlike VSE, ViLBERT contains 6 layers of transformers for the image
encoder and 6 layers of the cross-modal transformer to model the text and image features jointly. We use
the embedding of[V-CLS] token from the last layer of the image encoder as the image featurex.
Details of COMPOSER. The composer contains four primary learning sub-modules: (1) theCrossAtt
model in primitive encoding; (2) the Predicate Transformer (PT) model; (3) the modulator; (4) the Compo-
sition Transformer (CT). The details of this sub-modules are list as what follows:
• Primitive encoding. We implement the CrossAtt model as a one-layer multi-head cross-modal
Transformer that contains 768 dimension with 12 attention heads. The query set Q is the sub-word
token embeddings of the primitive word, and the key and value set K and V are the union of sub-
word token embeddings and the object-centric visual features (which is linearly transformed to have
the same dimensionality). We use the average of the contextualized sub-word token embeddings as
the final primitive encoding.
• Predicate Transformer (PT). We use 6 layers text Transformers with 768 hidden dimension and 12
attention heads to instantiate the Predicate Transformer. This network is initialized with the first 6
layers of a pre-trained BERT model.
147
• Modulator. We use FiLM [133] as the modulator. Specifically, it contains two MLP models with a
hidden dimension size of 768 to generate the scalea and bias vectorsb, using the syntactic place-
holders as input. The scalea and biasb are then used to transform the input concept embeddingc as
a⊙ c+b. Here⊙ represents the element-wise multiplication. This modulated concept embedding
is then projected by another MLP with 768 hidden dimensions, and used for reassembling with the
predicate sequence.
• Composition Transformer (CT). We follow the architecture of ViLBERT [116] to design the Com-
position Transformer (shown in Figure D.1). Specifically, it has interleaved SelfAtt Transformer
andCrossAtt Transformer in the network. For example, if we consider a three-layer Composition
Transformer, we have a SelfAtt Transformer at the beginning for both modality, followed with a
CrossAtt Transformer that interchanges the information between the modality, and then another
SelfAtt Transformer that only operates on the text modality. The output embedding of this last text
SelfAtt Transformer is then used for computing the visual-semantic alignment scores using the
linear regressorθ . Thus, when we consider shallower or deeper network, we add or remove the two
layers of interleavedSelfAtt andCrossAtt Transformers. The hidden dimension ofSelfAtt
Transformer is 768, and there is 12 attention heads. The hidden dimension ofCrossAtt Transformer
is 1024, and there is 8 attention heads.
D.4 Additional Experiments on COMPOSER
We report additional ablation studies that are omitted in the main paper due to space limitation. In this
section, we study COMPOSER performance under different MVSA objectives, Negative Log-Likelihood
and Hinge loss. Then we study COMPOSER performance under different margins of MVSA and Order
objectives.
MVSA Objective. The MVSA objectives can be implemented using NLL loss or Hinge loss. We study
the performance of COMPOSER under different losses for MVSA in Table D.3. The models are trained with
both MVSA and order objectives. We set the margin of order objectives β = 0.2. For the hinge loss, we
set the marginα =0.8. COMPOSER trained with hinge loss in MVSA achieves better performance than the
148
Text Transformer Image Transformer
Text Transformer
Text Transformer
Image Transformer
a man running on the
field with a soccer ball
a soccer ball
a man
the field
Output Concept
Object-centric Visual Features
[CLS]
on
running
with
Cross-Atten.
Transformer
Cross-Atten.
Transformer
Cross-Atten.
Transformer
Cross-Atten.
Transformer
Figure D.1: Details of the Composition Transformer model.
F30K→F30K F30K→COCO
α β R1 R5 R1 R5
COMPOSER w/ differentα 0.4 0.2 53.54 80.51 33.73 61.67
0.6 0.2 53.44 80.21 33.89 61.05
0.8 0.2 54.02 80.27 33.81 63.19
COMPOSER w/ differentβ 0.8 0 53.66 80.39 33.33 61.15
0.8 0.2 54.02 80.27 33.81 63.19
0.8 0.4 53.50 80.55 33.87 61.15
Table D.2: Ablation Study on COMPOSER with Different Margin for MVSA and Order Objectives.
NLL loss in all metrics across both in-domain and cross-dataset generalization settings. Therefore, for all
the experiments training with MVSA, we use hinge loss instead.
Ablation study on α and β . We study COMPOSER performance on the different margin of MVSA and
Order objectives. First, we fix the margin of order objectives β and tune the margin for MVSA α . COM-
POSER with a larger margin for MVSA achieves better R1 in-domain performance. Alternatively, by fixing
theα and tuningβ , COMPOSER achieves the best R1 in-domain performance and best R5 in cross-dataset
generalization setting withβ =0.2.
149
F30K→F30K F30K→COCO
Loss Function R1 R5 R1 R5
NLL 52.42 79.47 33.41 62.17
Hinge Loss 54.02 80.27 33.81 63.19
Table D.3: Ablation Study MVSA Objective: Comparing NLL to Hinge Loss.
E Appendix for Modeling Video Representation via Co-training on both
Image and Video
E.1 Implementation Details of Co-train Video and Image Data
We provide the details of implementing the co-training on multiple video and image datasets. Without
modifying the architecture, we calculate the image and video classification losses by feed-forwarding the
images and videos separately.
We consider the images as single-frame videos. To calculate the image classification loss, we take
images as input and obtain the lossℓ({y
j
image
},{c
j
image
}). As images are considered as single-frame videos,
the key set K, the query set Q, and the value set V in the temporal multi-head attention layer are the same.
Temporal multi-head attention will be a feed-forward network.
After obtaining the image loss, we feed-forward the video to the modelf and obtain the video classifica-
tion lossℓ({y
j
video
},{c
j
video
}). The co-training loss for video and image dataℓ
image video
is the weighted
average of both image and video classification loss. We calculate the gradients and update the weights based
onℓ
image video
.
E.2 Additional Experiments on COVER
In this section, we reported the ablation studies that are omitted from the main paper due to space limi-
tations. Specifically, we studied C OVER performance under different model variants and different input
configurations.
E.2.1 Model variants
We first compare the performance under the different sizes of architectures. We consider the TimeSFormer-
Base model for comparison. The TimeSFormer-Base architecture is similar to the TimeSFormer-Large
150
Figure E.1: Comparison of base model and large model using COVER pretrained on JFT-3B. The large
model constantly improves over the base model on all datasets.
model, which contains 12 TimeSFormer blocks. Each block includes one temporal multi-head attention
layer and one spatial multi-head attention layer. The multi-head attention layer has 12 attention heads with
768 hidden dimensions. We follow the same learning rate scheduling and apply the same data augmentation
to train the TimeSFormer-Base model as described in Sect. 4.1 in the main paper.
We summarize the comparison between TimeSFormer-Large and TimeSFormer-Base model in Fig. E.1.
We observe that with co-training, the large model improves the top-1 accuracy by3.2%,2.8%,3.9%,2.1%,
5.2% on K400, K600, K700, SSv2, and MiT. By enlarging the model capacity, we didn’t experience over-
fitting. Here, we empirically show that C OVER could unleash the model performance by leveraging a large
amount of training data. With the co-training, the large model shows solid improvement over the base model
by a large margin across all datasets.
E.2.2 Input configuration
Next, we analyze the model performance under different input configurations. Here, we follow TimeSFormer[17]
to consider two input setting, standard and high resolution. We sample8 frames from the video for the stan-
dard setting, and the input patch resolution is 224× 224. For the high-resolution setting, 16 frames are
sampled from video. The resolution for the input patch is448× 448. We summarize the results of the model
151
Table E.1: Comparison of different input settings for co-training on Kinetics, SSv2, and MiT datasets.
(a) Models trained on K400+SSv2+MiT+ImNet
Input setting (H× W× T) K400 SSv2 MiT
Standard (224× 224× 8) 85.1 67.4 44.2
High Resolution (448× 448× 16) 87.2 70.8 46.1
(b) Models trained on K600+SSv2+MiT+ImNet
Input setting (H× W× T) K600 SSv2 MiT
Standard (224× 224× 8) 85.6 67.7 44.0
High Resolution (448× 448× 16) 87.9 70.9 45.9
(b) Models trained on K700+SSv2+MiT+ImNet
Input setting (H× W× T) K700 SSv2 MiT
Standard (224× 224× 8) 76.2 67.9 43.9
High Resolution (448× 448× 16) 79.8 70.6 45.9
pre-trained on JFT-3B in Table E.1. The model with high resolution constantly improves the performance
by a large margin across all benchmarks.
Table E.2: Top-1 and Top-5 accuracy of COVER on K400, K600, K700, SSv2, and MiT.
K400 K600 K700 SSv2 MiT
Pretrain Top-1 Top-5 Top-1 Top-5 Top-1 Top-5 Top-1 Top-5 Top-1 Top-5
I21k 83.1 96.2 84.5 96.6 74.9 92.3 64.7 89.3 41.5 70.7
JFT-300M 86.3 97.2 86.8 97.3 78.5 94.2 69.8 91.9 45.0 73.9
JFT-3B 87.2 97.5 87.9 97.8 79.8 94.9 70.9 92.5 46.1 75.4
E.2.3 Top-1 and Top-5 accuracy of COVER
Due to the space limitation of the main text, we summarize the top-1 and top-5 accuracy on K400, K600,
K700, SSv2, and MiT of COVER jointly trained on Kinetics, SSv2, Moments-in-Time, and ImageNet in
Table E.2.
152
Abstract (if available)
Abstract
Visual representation learning is crucial for building a robust and effective visual understanding system. The goal is to build general-purpose representations to benefit multiple downstream tasks (i.e. image/video classification, segmentation, retrieval, etc.) With the accessibility to large-scale datasets and the advance in complex learning methods, sophisticated neural architectures and novel training approaches have been proposed to improve visual representation. However, obtaining a versatile representation is still yet an open question. This thesis aims to leverage the visual structure to obtain more general visual representations. The key observation is that the visual components (i.e. images and videos) contain structure. It can be decomposed into atomic components such as objects, attributes, clips, etc. For example, images can be decomposed into objects and can be further described by attributes. Similarly, videos can describe complex scenes composed of multiple clips or shots, where each depicts a semantically coherent event or action. As atomic components are shareable across modalities and tasks, we hope the hierarchical visual representation that is compiled from the atomic representation could achieve better generalization ability. In this thesis, we studied two scenarios to obtain the visual structures: the structure from parallel visual and text data and the pure visual domain. We achieved state-of-the-art performance on video and text retrieval, moment localization in a video corpus, image and text retrieval, action recognition, and visual storytelling with the proposed hierarchically visual representation.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Parametric and semi-parametric methods for knowledge acquisition from text
PDF
Transfer learning for intelligent systems in the wild
PDF
Learning shared subspaces across multiple views and modalities
PDF
Lexical complexity-driven representation learning
PDF
Quickly solving new tasks, with meta-learning and without
PDF
Identifying and leveraging structure in complex cooperative tasks for multi-agent reinforcement learning
PDF
Towards learning generalization
PDF
Modeling, learning, and leveraging similarity
PDF
Expanding the performance-compute frontier for retrieval-augmented language models
PDF
Towards more occlusion-robust deep visual object tracking
PDF
Multimodality, context and continuous dynamics for recognition and analysis of emotional states, and applications in healthcare
PDF
Deep representations for shapes, structures and motion
PDF
Creating cross-modal, context-aware representations of music for downstream tasks
PDF
Semantically-grounded audio representation learning
PDF
Representation, classification and information fusion for robust and efficient multimodal human states recognition
PDF
Novel variations of sparse representation techniques with applications
PDF
Fast and label-efficient graph representation learning
PDF
Learning controllable data generation for scalable model training
PDF
Adapting pre-trained representation towards downstream tasks
PDF
Computational narrative models of character representations to estimate audience perception
Asset Metadata
Creator
Zhang, Bowen
(author)
Core Title
Visual representation learning with structural prior
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2022-08
Publication Date
07/08/2022
Defense Date
05/02/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
computer vision,machine learning,multimodal processing,OAI-PMH Harvest,Visual representation learning
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Golubchik, Leana (
committee chair
), Itti, Laurent (
committee member
), Narayanan, Shrikanth (
committee member
), Sha, Fei (
committee member
)
Creator Email
zbwglory@gmail.com,zhan734@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC111371072
Unique identifier
UC111371072
Legacy Identifier
etd-ZhangBowen-10819
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Zhang, Bowen
Type
texts
Source
20220713-usctheses-batch-952
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
computer vision
machine learning
multimodal processing
Visual representation learning