Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Event detection and recounting from large-scale consumer videos
(USC Thesis Other)
Event detection and recounting from large-scale consumer videos
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Event Detection and Recounting from Large-scale Consumer Videos
by
Chen Sun
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulllment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
May 2016
Copyright 2016 Chen Sun
To Yuwen
ii
Acknowledgments
I would like to express my sincere thanks to my advisor Prof. Ramakant Nevatia
for his guidance and support. I have learned a lot from his deep understanding, broad
knowledge and insightful vision about Articial Intelligence and Computer Vision. During
my study at the Computer Vision lab, I have been lucky to discuss with Prof. Nevatia
about research and life, several times per week. It is truly an invaluable experience.
I would like to thank Prof. Keith Jenkins, Prof. Kevin Knight, Prof. Yan Liu and
Prof. Suya You for taking their precious time to serve on my qualication and thesis
defense committee; Prof. Jianmin Li and Jinhui Yuan for introducing the concept of
Computer Vision to me back in the Tsinghua days; Bo Yang for all the kind help when
I rst arrived in the U.S.; Boqing Gong for advice and help on research; Sanketh Shetty
and Rahul Sukthankar for collaboration at Google; Lubomir Bourdev, Manohar Paluri
and Ronan Collobert for collaboration at Facebook; Chuang Gan and Jiyang Gao for
collaboration at USC. I also want to thank the current and previous members of our
Computer Vision lab, and all my friends at USC.
I am grateful to my parents Zhidong and Chaiyun for their understanding and support
over the past 27 years. My gratitude is beyond the words. My life in the U.S. has been
made colorful by my friends Tongchuan Gao, Kai Hong, Xue Chen, Song Cao, Arnav
Agharwal, Remi Trichet, Pramod Sharma, Chenhao Tan, Rongqi Qiu, Jun Pang, Dong
Wang, Kuan Liu, Zhiyun Lu and so many others.
Finally, a very special thanks to my ancee Yuwen Ma, who brightens up the whole
world for me.
iii
Table of Contents
Acknowledgments iii
List of Tables vii
List of Figures ix
Abstract xii
1 Introduction 1
1.1 Event Detection and Recounting . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Contribution Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Related Work 8
2.1 Low-level Features for Event Detection . . . . . . . . . . . . . . . . . . . . 8
2.2 Deep Learning for Action and Event Recognition . . . . . . . . . . . . . . 9
2.3 Mid-level Video Representation with Visual Concepts . . . . . . . . . . . 10
2.4 Temporal Information for Event Modeling . . . . . . . . . . . . . . . . . . 12
3 Low-level Motion Features 13
3.1 Event Detection with Motion Features . . . . . . . . . . . . . . . . . . . . 13
3.1.1 Local Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . 13
3.1.2 Fisher Vector Encoding . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1.2.1 Fisher Vector under Gaussian Mixture Model . . . . . . . 15
3.1.2.2 Non-probabilistic Fisher Vector . . . . . . . . . . . . . . . 16
3.1.2.3 Comparison with BoW . . . . . . . . . . . . . . . . . . . 16
3.1.3 Postprocessing and Classication . . . . . . . . . . . . . . . . . . . 17
3.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2.2 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4 Temporal Transition Features of Mid-level Concepts 23
4.1 Event Detection with Mid-level Concept Features . . . . . . . . . . . . . . 23
4.2 Concept Based Representation . . . . . . . . . . . . . . . . . . . . . . . . 25
4.3 HMM Fisher Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
iv
4.3.1 Our Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.3.2 HMMFV Formulation . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.3.3 Parameter Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.3.4 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.4.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.4.3 Same Domain Concepts . . . . . . . . . . . . . . . . . . . . . . . . 30
4.4.4 Cross Domain Concepts . . . . . . . . . . . . . . . . . . . . . . . . 31
4.4.5 Comparison with State-of-the-Art . . . . . . . . . . . . . . . . . . 32
4.4.6 Classication with Limited Training Samples . . . . . . . . . . . . 34
5 Weakly-supervised Action Localization 35
5.1 Approach Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.2 Action Localization by Domain Transfer from Web Images . . . . . . . . . 38
5.2.1 Shared CNN Representation . . . . . . . . . . . . . . . . . . . . . . 39
5.2.2 LAF Proposal with Web Images . . . . . . . . . . . . . . . . . . . 39
5.2.3 Long Short-Term Memory Network . . . . . . . . . . . . . . . . . . 42
5.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.3.1 Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.3.2 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.3.3 Video-level Classication Results . . . . . . . . . . . . . . . . . . . 47
5.3.4 Localization Results . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.3.5 Localization Results on THUMOS 2014 . . . . . . . . . . . . . . . 53
6 Weakly-supervised Event Recounting 55
6.1 Approach Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.2 Evidence Localization Model . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.2.1 Video Representation . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.2.2 Evidence Localization Model . . . . . . . . . . . . . . . . . . . . . 57
6.2.3 Compact Temporal Constraint . . . . . . . . . . . . . . . . . . . . 59
6.3 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.4 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.4.1 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.5 Event Recounting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.6.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.6.2 Classication Task . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.6.3 Recounting Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
7 Zero-shot Event Detection and Recounting 69
7.1 Transfering Mid-level Knowledge from Web Images . . . . . . . . . . . . . 69
7.2 Zero-shot Event Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . 72
7.2.1 Video Representation . . . . . . . . . . . . . . . . . . . . . . . . . 72
7.2.2 Mid-level Knowledge Transfer . . . . . . . . . . . . . . . . . . . . . 72
7.2.3 Event Recounting with Diversity . . . . . . . . . . . . . . . . . . . 74
v
7.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
7.3.1 MED-recounting Dataset . . . . . . . . . . . . . . . . . . . . . . . 76
7.3.1.1 Video Data . . . . . . . . . . . . . . . . . . . . . . . . . . 77
7.3.1.2 Annotation Protocol and Evaluation Metric . . . . . . . . 77
7.3.2 Concept Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
7.3.2.1 Image-based Concepts . . . . . . . . . . . . . . . . . . . . 79
7.3.2.2 Video-based Concepts . . . . . . . . . . . . . . . . . . . . 79
7.3.3 Zero-shot Event Detection . . . . . . . . . . . . . . . . . . . . . . . 79
7.3.4 Zero-shot Event Recounting . . . . . . . . . . . . . . . . . . . . . . 81
8 Video Transcription with SVO Triplets 83
8.1 Approach Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
8.2 Semantic Aware Transcription . . . . . . . . . . . . . . . . . . . . . . . . . 85
8.2.1 Continuous Word Representation . . . . . . . . . . . . . . . . . . . 86
8.2.2 Video and Annotation Preprocessing . . . . . . . . . . . . . . . . . 86
8.2.3 Random Forest Structure . . . . . . . . . . . . . . . . . . . . . . . 88
8.2.4 Learning Semantic Hierarchies . . . . . . . . . . . . . . . . . . . . 88
8.2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
8.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
8.3.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
8.3.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 91
8.3.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 92
9 Automatic Visual Concept Discovery 96
9.1 Approach Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
9.2 Visual Concept Discovery Pipeline . . . . . . . . . . . . . . . . . . . . . . 98
9.2.1 Concept Mining From Sentences . . . . . . . . . . . . . . . . . . . 99
9.2.2 Concept Filtering and Clustering . . . . . . . . . . . . . . . . . . . 100
9.2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
9.3 Concept Based Image and Sentence Retrieval . . . . . . . . . . . . . . . . 103
9.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
9.4.1 Bidirectional Sentence Image Retrieval . . . . . . . . . . . . . . . . 104
9.4.2 Human Evaluation of Image Tagging . . . . . . . . . . . . . . . . . 109
10 Conclusion and Future Work 112
10.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
10.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
Bibliography 116
vi
List of Tables
3.1 mAP Performance comparison on test set with dierent features and en-
codings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.1 Average precision comparison for event classication with same domain
concepts and cross domain concepts, bold numbers correspond to the
higher performance in their groups . . . . . . . . . . . . . . . . . . . . . . 31
4.2 Mean average precisions when using same domain concepts, cross domain
concepts and both for generating HMMFV . . . . . . . . . . . . . . . . . 32
4.3 Average precision comparison with state-of-the-art. Joint refers to the
joint modeling of activity concepts, as described in [39], LL refers to the
low level approach as described in [77] . . . . . . . . . . . . . . . . . . . . 33
4.4 Average precision comparison with 10 training samples for each event . . 34
5.1 Video-level classication performance of several dierent systems on ne-
grained actions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.2 Dierence in average precision between LSTM with and without LAF pro-
posal. Sorted by top wins (left) and top losses (right). . . . . . . . . . . . 48
5.3 Classication performance when measured on high-level sports activities
(e.g., basketball, soccer). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.4 Dierence in average precision, compared between LSTM with and without
LAF proposal (top), LSTM with LAF proposal and CNN (bottom), overlap
ratio is xed to 0.5. A positive number means LSTM with LAF proposal
is better. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.5 Temporal localization on the test partition of THUMOS 2014 dataset.
Ground truth uses temporal annotation of the training videos. . . . . . . 54
6.1 Average precision comparison among global baseline, ELM without tem-
poral constraint and the full ELM on MEDTest . . . . . . . . . . . . . . . 64
6.2 Performance gain by incorporating two state-of-the-art global video repre-
sentations into our framework . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.3 The ratio of our method's average snippet length over average video length,
and the average accuracy of labels from evaluators . . . . . . . . . . . . . 66
vii
6.4 Evaluators' comparison of ELM over global strategy. Assignments of better,
similar and worse were aggregated via average . . . . . . . . . . . . . . . 67
7.1 Selected ImagNet concepts for all 20 events. . . . . . . . . . . . . . . . . . 73
7.2 Comparison of dierent knowledge transfer methods for ZeroMED on MED14
dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
7.3 Comparisons with other stat-of-the-art ZeroMED systems on MEDtest13. 80
7.4 Comparison of mean recounting quality (mean RQ) on zero-shot recounting
task for dierent concept selection methods. . . . . . . . . . . . . . . . . . 81
7.5 Event recounting results comparing with baseline approaches. Higher
scores indicates better performance. . . . . . . . . . . . . . . . . . . . . . 82
7.6 Human comparison of the recounting results generated by ILP against ILP
w/o diversity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
8.1 Top correlated verb and object pairs in SAT . . . . . . . . . . . . . . . . . 92
8.2 Accuracy comparison among our proposed SAT, a traditional RF and a
linear SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
8.3 Accuracy and WUP comparisons between our proposed method and YouTube2text
[34] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
9.1 Preserved and ltered terms from Flickr 8k data set. A term might be
ltered if it's abstract (rst row), too detailed (second row) or not visually
discriminative (third row). Sometimes our algorithm may lter out visual
entities which are dicult to recognize (nal row). . . . . . . . . . . . . . 99
9.2 Concepts discovered by our framework from Flickr 8k data set. . . . . . . 102
9.3 Dierent aects the term groupings in the discovered concepts. Total
concept number is xed to 1,200. . . . . . . . . . . . . . . . . . . . . . . . 102
9.4 Retrieval evaluation compared with embedding based methods on Flickr
8k. Higher Recall@k and lower median rank are better. . . . . . . . . . . 105
9.5 Retrieval evaluation on Flickr 30k. Higher Recall@k and lower median
rank are better. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
9.6 Retrieval evaluation for dierent concept vocabularies on COCO data set. 107
9.7 Percentage of images where tags generated by the discovered concepts are
better, worse or the same compared with ImageNet 1k. . . . . . . . . . . . 110
viii
List of Figures
1.1 Given input videos, we rst extract low-level features and mid-level ac-
tion and object detections. These features are aggregated into video-level
features and used for event detection. . . . . . . . . . . . . . . . . . . . . 4
1.2 Snapshot of the ISOMER multimedia event recounting system as proposed
in [95]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Some image to sentence retrieval and video transcription results generated
by the visual concept discovery pipeline described in [96]. Retrieved sen-
tences are ordered by their similarity to the images. Correct matches are
marked in green. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.1 Camera motion can make lots of sparse feature points fall into background.
Left is a frame in a bike trick video, right shows the detected feature points
by STIP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2 Spatial-Temporal Pyramids. For pyramid level i, video is divided into 2
i
slices along timeline, and 2
i
by 2
i
blocks for each frame. . . . . . . . . . . 18
4.1 Illustration of our HMMFV generation process. An input video is sep-
arated into xed length clips, each has a vector of activity concept re-
sponses. HMMFV is computed by taking the partial derivatives of the
video's log-likelihood in a general HMM over the transition parameters.
Each dimension in HMMFV corresponds to a concept transition. . . . . . 24
4.2 Mean average precisions with dierent number of concepts. The red line
shows a randomly selected subset of cross domain concepts from 20 to
101. The cyan line illustrates a randomly selected subset of same domain
concepts from 20 to 60. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.1 Fine-grained actions are usually present as a tiny fraction within videos
(top). Our framework uses cross-domain transfer from possibly noisy image
search results (bottom) and identies the action related images for both
domains (marked in green). . . . . . . . . . . . . . . . . . . . . . . . . . . 36
ix
5.2 Illustration of the LAF proposal framework for basketball slam dunk videos.
We use a cross-domain transfer algorithm to jointly lter out non-video like
web images and non-action like video frames. We then learn a LAF pro-
posal model using the ltered images, which assigns LAF scores to training
video frames. Finally, we train LSTM based ne-grained action detectors,
where the misclassication penalty of each time step is weighted by the
LAF score. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.3 Top retrieved images from the Google image search engine, using keywords
tennis serve and baseball pitch. . . . . . . . . . . . . . . . . . . . . . . . . 40
5.4 Illustration of LSTM architecture with a single memory block. A recurrent
projection layer is added to reduce the number of parameters. Reproduced
from [86] with the authors' permission. . . . . . . . . . . . . . . . . . . . . 44
5.5 Some of the high-level sports activities and their corresponding ne-grained
sports actions in Fine Grained Actions 240 data set. Best viewed under
magnication. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.6 Example actions when LAF is not helpful. The web images retrieved to
generate LAF proposals might be beautied (top) or taken from dierent
viewpoint (middle). Sometimes there is a mix of the two issues (bottom). 49
5.7 Magnied view of confusion matrices for ice hockey, crosst and basketball 50
5.8 Classication output for a few videos. The labels under each video were
generated by LSTM with LAF proposal, LSTM without LAF proposal and
CNN from top to bottom. Correct answers are marked in bold. . . . . . . 50
5.9 Temporal localization performance on FGA-240 data set. . . . . . . . . . 52
6.1 (Middle) A typical birthday party video from the dataset, some of the
video clips are irrelevant to the event. (Bottom) Each clip has a vector
of primitive action classier responses, where the highest scoring ones are
listed. Primitive actions in red have high weights in evidence templates.
(Top) Two congurations of evidence locations. The green one scores
higher than the red one, as transition from eating cake to blowing candle
is highly unlikely. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.2 Event recounting results generated by ELM and the baseline approach.
The video events are
ash mob gathering, repairing an appliance and mak-
ing a sandwich respectively. ELM was labeled as better than baseline by
evaluators for the top two recounting results. . . . . . . . . . . . . . . . . 68
7.1 Given a query video and a set of candidate events, ZeroMED generates
event predictions while ZeroMER provide diverse evidence for the event. . 70
x
7.2 An illustration of our event recognition framework. Each video segment
is represented by a bank of image-based and video-based concepts. Given
an event query, we select the relevant image-based concepts by transfer-
ring event composition knowledge from web images. We then select the
video-based concepts based on their semantic similarities with the selected
image concepts. The classier condence scores of the selected concepts
are used for ZeroMED. For ZeroMER, we solve an integer linear program-
ming problem to select video segments with relevant evidence that are also
diverse and compact. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
8.1 Left: one example of the testing videos we used. Right: our algorithm uti-
lizes sentence based annotations, and output subject, verb, object triplets. 84
8.2 Illustration of a single decision tree used in SAT. Detector responses for a
video are used to traverse the tree nodes until reaching a leaf. Note that a
horse detector may not be needed in the process. . . . . . . . . . . . . . . 87
8.3 Testing videos with SVO triplets from groundtruth (GT), SAT, RF and
SVM. Exact matches are marked in blue, semantic related verbs and ob-
jects are marked in red. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
9.1 Overview of the concept discovery framework. Given a parallel corpus of
images and their descriptions, we rst extract unigrams and dependency
bigrams from the text data. These terms are ltered with the cross val-
idation average precision (AP) trained on their associated images. The
remaining terms are grouped into concept clusters based on both visual
and semantic similarity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
9.2 Impact of when testing on Flickr 8k data set (blue) and COCO data set
(red). Recall@5 for sentence retrieval is used. . . . . . . . . . . . . . . . . 108
9.3 Impact of total number of concepts when testing on Flickr 8k data set
(blue) and COCO data set (red). Recall@5 for sentence retrieval is used. . 109
9.4 Tags generated using ImageNet 1k concepts (blue) and the discovered con-
cepts (green). Tags preferred by evaluators are marked in red blocks. . . . 111
xi
Abstract
Multimedia event detection and recounting are important Computer Vision problems
that are closely related to Machine Learning, Natural Language Processing and other
research areas in Computer Science. Given a query consumer video, multimedia event
detection (MED) generates a high-level event label (e.g. birthday party, cleaning an ap-
pliance) for the entire video, while multimedia event recounting (MER) aims at selecting
supporting evidence for the detected event. Typical form of evidence includes short video
snippets and text descriptions. Event detection and recounting are challenging prob-
lems for the high video quality variation, the complex temporal structures, missing or
unreliable concept detectors, and also the large problem scale.
This thesis describes my solutions to event detection and recounting from large scale
consumer videos. The rst part focuses on extracting robust features for event detection.
The proposed pipeline utilizes both low-level motion features and mid-level semantic
features. For low-level features, the pipeline extracts local motion descriptors from videos,
and aggregates them into video-level representations by applying Fisher vector techniques.
For mid-level features, the pipeline encodes temporal transition information from noisy
object and action concept detection scores. The two feature types are suitable to train
linear event classiers to handle large amount of query videos, and have complementary
performance.
The second part of the thesis addresses the event recounting problem, which includes
the evidence localization task and description generation task. Evidence localization
searches for video snippets with supporting evidence for an event. It is inherently weakly
supervised, as most of the training videos have only video-level annotations rather than
segment-level annotations. My proposed framework treats evidence locations as hidden
variables, and exploits activity co-occurrences and temporal transitions to model events.
xii
Model parameters are learned with the latent SVM framework. For text description
generation, my proposed pipelines aim at connecting vision and language by considering
both semantic similarity from text and visual similarity from videos and images. The
pipelines are able to generate video transcription in subject-verb-object triplets or visual
concept tags.
This thesis demonstrates the eectiveness of all the algorithms on a range of publicly
available video or image datasets.
xiii
Chapter 1
Introduction
Whether captured by smart phones, surveillance cameras or self-driving cars, videos stand
out as as one of the most important and informative media in the digital world. Videos
bring temporal information to the visual recognition domain, and open doors to applica-
tions such as motion analysis, tracking and event detection.
The primary goal of my research is to understand human activities and high-level
events from large amount of unconstrained consumer videos. The ever-increasing popu-
larity of video capturing devices and sharing websites has created a huge gap between the
fast pace of video generation and our ability to index them. In response, I aim to build
a semantic representation of videos with objects, actions, events and their interactions,
and apply such semantic representations for video event detection and recounting.
1.1 Event Detection and Recounting
In this thesis, an action refers to a human-human or human-object interaction that
happens within a small period of time (e.g. 10 seconds or shorter) and can be identied
by its appearance or motion patterns. An event refers to a complex activity consisting
of several actions over a relatively long period of time (e.g. 30 seconds or longer).
The goal of multimedia event detection (MED), is to assign video-level event labels
to query videos. Once an event is detected, multimedia event recounting (MER) aims
to provide key supporting evidence in the form of temporally localized snippets and text
1
descriptions. The task of description generation is referred as video transcription. The
descriptions could either be phrases, subject-verb-object triplets, or sentences.
MED and MER have recently gained much attention in the Computer Vision com-
munity. Meanwhile, researchers in the early days spent most of their eorts on the
classication of simple actions under constrained environment. KTH dataset [88], for
example, contains six actions (e.g. walking, boxing) shot in clean backgrounds and with
xed camera. More realistic human action datasets have been proposed afterwards [58],
but still cannot match the complexity of real world videos. Compared with earlier work,
MED and MER are more challenging in the following aspects:
Large video variation. Unlike videos taken under controlled environment, most
of the consumer videos are shot and edited by amateur users. There are large variation
of lighting condition, video resolution and camera motion. Handling these variations
requires more robust low-level visual and motion features. Besides, human edition like
insertion of logos and captions are likely to introduce noise for video analysis.
Complex temporal structures. The temporal structure of video events are com-
plex for two reasons: rst, even with large amount of training videos, the possibility of
unobserved irrelevant video segments is still high for query videos. Second, event-related
video segments can appear anywhere in the videos, they are dicult to locate by rule-
based rigid temporal models. As an example, in a
ash mob video, dancing, marching
and people cheering can happen in dierent orders in an urban scene.
Missing and unreliable action and object detectors. Event recounting requires
semantic analysis of video events, it relies heavily on action and object detection results.
Although signicant improvement has been recently reported for object classication and
detection [53, 28], the detectors are still far from reliable. Moreover, it is non-trivial to
collect training data for the concept detectors. For example, ImageNet dataset [15] oers
1,000 categories for object detection, but it does not has a birthday cake category which
serves as key evidence for birthday party.
Large problem scale. A video is a long sequence of images. In the commonly used
TRECVID MED dataset [75], the average length of videos is 2 minutes, or 3,600 frames.
There are over 5,000 training videos and over 200,000 test videos. This large problem
2
scale requires the video feature extraction and analysis tools to be highly ecient. It also
makes spatiotemporal annotations of the training videos expensive and time consuming.
1.2 Contribution Overview
To address the above challenges in event detection and recounting, my thesis makes
contributions to the following three aspects:
1. Utilize temporal information eectively. I believe that local (e.g. motion)
and global (e.g. evolution of activities) temporal information oer key clues for successful
video understanding, and have designed robust feature encoding techniques to capture
these clues.
2. Extract rich semantics with moderate video annotations. Spatiotemporal
annotation of objects and actions from consumer videos is notoriously time-consuming.
To mitigate such eort, I proposed several methods to transfer knowledge from existing
image datasets and the Internet, and to learn with weak supervision.
3. Build connection between videos and language. Language is the most
natural interface for us to interact with videos. Along this line, I focus on the automatic
visual vocabulary construction problem to map human language and videos into the same
semantic space. This allows our system to retrieve and describe videos with sentences
rather than labels.
Robust motion and concept features
The rst step of video understanding is to extract robust motion and visual concept
signals. For the purpose of event recognition, it is desirable to aggregate local signals
into video-level feature vectors, and use them to train classiers. In [98], I demonstrated
that applying Fisher Vector techniques to local motion features produces much more
robust video-level features than the traditional bag-of-word approach, and improves the
previous state-of-the-art event recognition performance by as much as 35%. My Fisher
Vector implementation was the key component in our system winning the TRECVID
MED 2012 evaluation.
3
Challenge: large variation of video quality
Top performer in NIST TRECVID challenges:
Multimedia Event Detection (’12) & Recounting (’13)
4000+ hours of YouTube like videos
Low-level
features
Mid-level
features
Feature
aggregation
Classification
& Fusion
WACV’13
ICASSP’14
ICCV’13
Figure 1.1: Given input videos, we rst extract low-level features and mid-level action
and object detections. These features are aggregated into video-level features and used
for event detection.
Although building event detectors directly from low-level features is eective, it does
not oer insights about the activities that take place within the events. I proposed the
ACTIVE framework [97] to bridge this semantic gap: it represents videos with proba-
bility distributions of mid-level concepts (e.g. an action or an object), and encodes the
temporal transitions between the concepts. The temporal coding is derived using the
Fisher kernel over the transitional parameters of a Hidden Markov Model (HMM). The
ACTIVE framework captures the temporal information in a ecient and eective way,
and oers semantic video interpretations. Figure 1.1 illustrates the overall pipeline for
motion and activity transition feature encoding.
Weakly- and webly-supervised learning
One of my key research interest is weakly supervised learning from videos. For example,
I worked on generating event highlights (in terms of short video snippets and descrip-
tions) for long videos with only video-level event labels. I developed the DISCOVER
framework [99] for joint event classication and important video snippet localization.
DISCOVER exploits activity co-occurrences and temporal transitions to model events.
The important video snippet locations are treated as hidden variables during training, and
learned with the latent SVM framework. We showed that DISCOVER not only achieves
4
Figure 1.2: Snapshot of the ISOMER multimedia event recounting system as proposed
in [95].
state-of-the-art event classication performance, but also generates video summarizations
with higher qualities than previous methods.
Together with researchers at Google, I worked on the problem of ne-grained action
localization from temporally untrimmed web videos [101]. It is challenging as ne-grained
actions take place in the same activities or events, share similar contexts and tend to be
localized within videos. My main contribution is to design a domain transfer framework
that uses only video-level labels and cheaply-available web images to identify temporal
segments corresponding to the actions, and learn models that generalize to unconstrained
web videos. The pipeline was implemented with convolutional and recurrent neural net-
works. We collected a new FGA-240 dataset with 130,000 YouTube videos, and showed
that learning temporally localized actions from videos becomes much easier if we combine
weakly labeled video frames and noisily tagged web images.
Web images can also be used to assist zero-shot event detection and recounting. My
colleagues and I developed a fully automatic algorithm to select representative and reliable
concepts for event queries. This is achieved by transferring event composition knowledge
5
Figure 1.3: Some image to sentence retrieval and video transcription results generated by
the visual concept discovery pipeline described in [96]. Retrieved sentences are ordered
by their similarity to the images. Correct matches are marked in green.
discovered from available web images. To evaluate our proposed method, we used the
standard zero-shot event detection protocol (ZeroMED), but also introduced a novel zero-
shot event recounting (ZeroMER) problem to select supporting evidence of the events.
We formulated ZeroMER as an integer linear programming problem and aimed to select
video snippets that are relevant and diverse. Evaluation on the challenging TRECVID
MED dataset showed that our proposed method achieves promising results on both tasks.
Connecting videos with language
I started my journey to connect language and vision by working on video transcription
with subject, verb and object (SVO) triplets. A standard classication formulation does
not work well as the annotations are sparse, and same objects can be described with
dierent terms (e.g. bicycle and bike). I proposed a semantic aware transcription frame-
work SAT [100] using random forest classiers. The core innovation of SAT is to learn
semantic relationships of words from text data, and apply the knowledge when training
6
random forest classiers. The training algorithm optimizes the compactness of the la-
bels' semantic meanings rather than classication error. SAT allows feature sharing of
semantically similar words, and generates video transcriptions where the errors tend to
be more semantically reasonable.
To allow general video description and retrieval with sentences, one possible approach
is to map videos and sentences into a common semantic space spanned by visual con-
cepts. I proposed to discover the visual concepts by joint use of parallel text and visual
corpora [96]. The text data in parallel corpora oers a rich set of terms humans use to
describe visual entities, while visual data has the potential to help computer organize the
terms into visual concepts. My concept discovery algorithm aims at making the concept
vocabulary visual discriminative and compact. Both automatic and human evaluation
showed that the discovered concepts are better than several manually designed concept
vocabularies (e.g. ImageNet) in describing and retrieving images with text. Some of the
sentence retrieval and video transcription results are shown in Figure 1.3.
1.3 Thesis Outline
This thesis is outlined as follows: it begins in Chapter 2 with a brief overview of literature
on event detection and recounting. It is followed by three chapters on event detection.
Chapter 3 presents my framework for aggregating low-level features with sher vectors.
Chapter 4 describes a method for modeling activity transitions in video event classica-
tion. Chapter 5 introduces a weakly-supervised framework for action or event localiza-
tion. Chapter 6 to 9 focus on event recounting. Chapter 6 presents our framework for
discovering important video segments for event classication and recounting. Chapter 7
addresses the problem of zero-shot event detection and recounting. The idea of semantic
aware video transcription is described in Chapter 8. Chapter 9 studies the problem of
automatic concept discovery from parallel corpora. Finally, Chapter 10 concludes the
thesis and dicusses future directions.
7
Chapter 2
Related Work
This chapter brie
y reviews the related works in video event detection and recounting.
2.1 Low-level Features for Event Detection
Low-level features have been widely applied in image and video classication tasks. The
basic idea is to extract low-level local features such as color and edge information, and
aggregate them to form global representations. One of the most popular feature aggre-
gation technique is bag-of-visual-words [12], which is in analogy to bag-of-word used in
natural language processing. However, it is worth noting that unlike real words which
are natural units with semantic meanings, for computer vision applications, the visual
words are usually quantized low-level local features, most of which do not necessarily
carry semantic meanings. For video analysis, one can directly extend the image pipeline
by using bag of 3D features such as motion patterns.
Low-level feature pipelines typically have four stages: feature extraction, feature en-
coding, classication and fusion of results. At the feature extraction stage, local image or
video patches are selected, either densely or via some saliency selection process; descrip-
tions are then computed for each patch. Feature detection and description techniques
build on ideas developed for object recognition in still images but also incorporate the
use of temporal dimension. Interestingly, although it may be expected that sparse salient
feature points will be more robust, experiments show that dense features perform bet-
ter for more complex videos [111]. Feature encoding stage turns sets of local features
8
into xed length vectors; this is usually accomplished by vector quantization of feature
vectors and building histograms of visual codewords; this is commonly known as Bag-of-
Word (BoW) encoding [12]. Variations include soft quantization where distance from a
number of codewords is considered [106, 112]. Feature vectors are used to train classiers
(typically
2
kernel SVMs). Several late fusion strategies can be used to combine the
classication results of dierent low-level features; such fusion typically shows consistent
improvement [71].
Since a video is a sequence of image frames, low level features designed for images
can all be applied to video classication, such as SIFT [65]. To take motion information
into account, there are several innovations in motion related features. STIP [57], for
example, extends 2D feature detector to 3D space. Dense trajectory features [107] track
local points to obtain short tracklets, and describe the volumes around each tracklet with
histogram of gradient, optical
ow, etc. Its improved version [109] preprocesses videos
by compensating camera motion, and achieves the state-of-the-art performance on action
recognition and event classication.
2.2 Deep Learning for Action and Event Recognition
Rather than using hand-designed low-level features, many recent approaches applied deep
neural networks to jointly learn feature representations and classiers. Karpathy et al. [50]
proposed several variations of convolutional neural net (CNN) architectures that extended
Krizhevsky et al.'s image classication model [53] and attempted to learn motion patterns
from spatio-temporal video patches. Simonyan and Zisserman [89] obtained good results
on action recognition using a two-stream CNN that takes pre-computed optical
ow as
well as raw frame-level pixels as input. For event detection, Xu et al. [116] proposed to
extract frame-level CNN activations and aggregate them into video-level features with
Fisher vectors.
Recurrent neural networks have also been used to model the temporal information in
video sequences. For example, long short-term memory network (LSTM) was proposed
by Hochreiter and Schmidhuber [37] as an improvement over traditional recurrent neural
networks (RNN) for classication and prediction of time series data. Specically, an
9
LSTM can remember and forget values from the past, unlike a regular RNN where error
gradients decay exponentially quickly with the time lag between events. It has recently
shown excellent performance in modeling sequential data such as speech recognition [86,
32], handwriting recognition [33] and machine translation [102]. More recently, LSTM
has also been applied to classify temporally trimmed actions [18, 94] and generate image
descriptions [51, 48].
2.3 Mid-level Video Representation with Visual Concepts
The idea of using object and action concepts has been adopted in image [104, 60, 22]
and video classication [63] under dierent names. A set of object concept classiers
called classemes was used for image classication and novel class detection tasks in [104].
Meanwhile, Object Bank [60] applied object lters over dierent locations instead of using
whole image based classemes. Both classemes and Object Bank used the object detector
responses as features for image classication. One interesting observation is that, although
the responses may not always be semantically meaningful, they are discriminative features
for classication.
Visual concepts can be categorized [83] and organized as a hierarchy where the leaves
are the most specic and the root is the most general. For example, ImageNet con-
cepts [15] are organized following the rule-based WordNet [70] hierarchy. Similar struc-
ture also exists for actions [20]. Since concept classication is not always reliable, Deng
et al. [16] proposed a method to allow accuracy-specicity trade-o of object concepts
on WordNet. As WordNet synsets do not always correspond to how people name the
concepts, Ordonez et al. [74] studied the problem of entry-level category prediction by
collecting natural categories from humans.
To expand the visual concept vocabulary, recent work exists on visual data collection
from web images [114, 9, 17, 30] or weakly annotated videos [6]. Their goal is to collect
training images from the Internet with minimum human supervision, but for pre-dened
concepts. In particular, NEIL [9] started with a few exemplar images per concept, and
iteratively rened its concept detectors using image search results. LEVAN [17] explored
the sub-categories of a given concept by mining bigrams from large text corpus and using
10
the bigrams to retrieve training images from image search engines. Recently, Zhou et
al. [124] used noisily tagged Flickr images to train concept detectors, but do not consider
the semantic similarity among dierent tags.
Although visual concept responses can be used as features for classication regardless
of their semantic meanings, for the task of event recounting, those semantic meanings
have to be used explicitly in order to obtain human perceivable outputs. In [39], the
authors modeled the activity concepts as latent variables with pairwise correlations, and
applied latent SVM for classication. [64] analyzed the contribution of each concept
detection score to the overall event classication of the video. The event classier used
the concept detection scores as features, and the contributions were assessed from the
terms summed to perform the classication. This method produced superior recounting
results in the TRECVID 2012 recounting task.
A slightly dierent task, video transcription, requires describing video contents with-
out using event labels. As event prior is not available for re-ranking action and object
concepts, language based models are used instead to measure the possibility of concept
combinations. Kulkarni et al. [55] proposed a method to detect candidate objects and
their attributes from static images, and applied CRF for sentence generation. [2] used
object detection and tracking results to describe videos with simple actions and xed
camera. In [52], the authors obtained the SVO triplets using object and action detectors
and reranked the triplet proposals with language models. Alternatively, [80] proposed
to classify semantic representations (SR) with low level features, and used a CRF to
model the co-occurrences of SR. It formulated the conversion from SR to sentences as a
statistical machine translation problem and tested the idea on an indoor kitchen dataset.
However, global low level features may not be discriminative enough to identify actions
and objects for videos in the wild.
Visual concepts are also useful in zero-shot event detection. Zero-shot learning aims to
generate classication predictions for categories with no training examples. A commonly
practiced approach is to use attributes [56, 23], which are shared over all categories. The
attributes could be manually selected [63, 76, 67] or are data-driven [122, 11]. In video
domain, such attributes are usually named as concepts [85], and used for applications
11
like zero-shot event detection [115], event recounting with training samples [64] and video
transcription [34]. To nd relevant attributes for unseen categories, linguistic knowledge
databases [82] and web search hit counts [82, 81] can be used. For action recognition,
Jain et al. [41] used semantic embeddings to nd relevant object concepts given action
names.
2.4 Temporal Information for Event Modeling
There are several previous approaches attempting to utilize temporal information for
action and event recognition. For example, Niebles et al. [72] used a tree structure with
anchor positions to reward the presence of motion segments near the corresponding anchor
positions. [103] used variable-duration Hidden Markov Model to model a video event as
a sequence of latent states with various durations. Video representations in the above
approaches consist of low level features. [26] used actom as a mid-level representation
and encoded their temporal constraints. These approaches suer from at least one of the
following problems: (1) temporal constraint is too rigid, (2) assumption that all segments
are informative, (3) hard to locate positive evidence.
To nd the discriminative parts of videos, [105] learned kernelized latent SVMs with
no temporal constraints. [87] used a simple algorithm to evaluate the quality of a possible
cropping using other uncropped training data, the goal is to lter irrelevant segments from
training dataset. [61] employed a dynamic pooling procedure by selecting informative
regions for pooling low level features. These approaches ignore temporal structures which
are important to event understanding.
For video description and event recounting, [35] used captions as weak labels to learn
an AND-OR graph based storyline. However, captions are usually not available for uncon-
strained web videos. [2] applied object tracks and body-postures to sentence generation.
Besides, [13] assumed event classication is done and mined the co-occurrence of objects
and bag of low level features.
12
Chapter 3
Low-level Motion Features
This chapter mainly describes the low-level motion feature encoding framework with
Fisher Vectors. It serves as the basic building block for all the low-level video and clip
representations I used in later works.
The main contribution of the proposed framework is to apply Fisher kernel technique
to encode local motion features. Fisher kernel [40] was proposed to utilize the advan-
tages of generative model in a discriminative framework. The basic idea is to represent a
set of data by gradient of its loglikelihood with respect to model parameters, and mea-
sure the distance between instances with Fisher kernel. The xed-length representation
vector is also called a Fisher Vector. Fisher vectors have been applied to static image
classication[77] and indexing[42], showing signicant improvements over BoW methods.
This chapter demonstrates the eectiveness of Fisher vectors with motion features for
event detection.
3.1 Event Detection with Motion Features
This section describes the video event detection framework, which includes feature ex-
traction, Fisher vector encoding, postprocessing and classication.
3.1.1 Local Feature Extraction
We perform experiments with both a sparse local feature and a dense local feature. We use
Laptev's Space-Time Interest Points (STIP)[57] as the sparse feature, each interest point
13
Figure 3.1: Camera motion can make lots of sparse feature points fall into background.
Left is a frame in a bike trick video, right shows the detected feature points by STIP
is described by histograms of gradients (HoG) and optical
ow (HoF) of its surrounding
volume. For dense feature, we choose Wang et al.'s Dense Trajectory (DT)[107] features;
the descriptor includes shape of trajectory, HoG, HoF and motion boundaries histograms
(MBH). These choices are made based on the good performance of each in their category.
When environment is constrained and camera is xed, sparse features are likely to
select robust features that are highly correlated to events of interest. However, web videos
are usually taken in the wild, and in most cases camera motion is unknown. It is likely
that camera motion causes many feature points to originate from the static background
(See Figure 3.1). Dense features do not suer from feature point selection issues, but
they treat foreground and background information equally, which can also be a source of
distraction.
3.1.2 Fisher Vector Encoding
One key idea usually associated with local features is that of a visual words codebook,
obtained by clustering feature points and quantizing the feature space. A set of feature
points can then be represented by a xed length histogram. However, some feature points
may be far from any visual word, to compensate for this, Gemert et al. propose a soft
assignment of visual words[106], but each codeword is still only modeled by its mean.
14
3.1.2.1 Fisher Vector under Gaussian Mixture Model
Fisher Vector concepts have been introduced in preivous work [40] and applied to image
classication and retrieval by several authors [77][42]. We repeat their formulation below
for ease of reading and making this paper self-contained.
According to [40], suppose we have a generative probability model P (Xj), where
X =fx
i
ji = 1; 2;:::;Ng is a sample set, and is the set of model parameters. We can
mapX into a vector by computing the gradient vector of its loglikelihood function at the
current :
F
X
=r
logP (Xj) (3.1)
F
X
is a Fisher Vector, it can be seen as a measurement of the direction to make t
better to X. Sincejj is xed, the dimensions of Fisher Vector for dierent X are the
same. This makes F
X
a suitable alternative to represent a video with its local features.
GMM has the form
P (Xj) =
K
X
i=1
w
i
N (X;
i
;
i
) (3.2)
where K is the number of clusters, w
i
is the weight of the ith cluster, and
i
,
i
are the
mean and covariance matrix of the ith cluster.
As the dimension of Fisher Vector is the same as the number of parameters, diagonal
covariance matrices are usually assumed to simplify the model and thus reduce the size
of Fisher Vector. DenoteL(Xj) as the loglikelihood function, thedth dimension of
i
as
d
i
, the dth diagonal element of
i
as (
d
i
)
2
, local feature dimension as D and the total
number of feature points as T . By assuming that each local feature is independent, the
Fisher Vector F
X
of feature points set X is
F
X
=
@L(Xj)
@
;
@L(Xj)
@
(3.3)
@L(Xj)
@
d
i
=
T
X
t=1
t
(i)
x
d
t
d
i
(
d
i
)
2
(3.4)
@L(Xj)
@
d
i
=
T
X
t=1
t
(i)
(x
d
t
d
i
)
2
(
d
i
)
3
1
d
i
(3.5)
15
Here,
t
(i) is the probability of feature point x
t
belongs to the ith cluster, given by
t
(i) =
w
i
N (x
t
;
i
;
i
)
P
K
j=1
w
j
N (x
t
;
j
;
j
)
(3.6)
The dimension of F
X
is 2KD. The rst term,
@L(Xj)
@
, is composed of rst order
dierences of feature points to cluster centers. The second term,
@L(Xj)
@
, contains second
order terms. Both of these are weighted by the covariances and soft assignment terms.
Fisher Vector with GMM can be seen as an extension of BoW[43]. Actually, it ac-
cumulates the relative position to each cluster center, and models codeword assignment
uncertainty, which has shown to be benecial for BoW encoding[106].
3.1.2.2 Non-probabilistic Fisher Vector
In [42], the authors give a non-probabilistic approximation of Fisher Vector, called Vector
of Locally Aggregated Descriptors (VLAD). It uses K-Means clustering to get a codebook,
each value in VLAD is computed as
v
d
i
=
X
xt:NN (xt)=i
x
d
t
d
i
(3.7)
Compared with Fisher Vector, VLAD drops the second order terms, and assumes uniform
covariance among all dimensions. It also assigns each feature point to its nearest neighbor
in the codebook. The feature dimension is KD.
3.1.2.3 Comparison with BoW
The basic idea of both BoW and Fisher Vector is to map feature point set X into a
xed dimension vector, from which the distribution in the original feature space can be
reconstructed approximately. However, there are also several key dierences discussed
below:
First, BoW uses a hard quantization of feature space by KMeans, where each cluster
has the same importance and is described by its centroid only. Meanwhile, Fisher Vector
16
assumes GMM is the underlying generative model for local features. Although modica-
tions to BoW can help it capture more information, such as assigning dierent weights
to codewords and soft assignment of codewords[106], GMM incorporates them naturally.
Secondly, in Fisher Vector, local features' contribution to a Gaussian mixture depends
on their relative position to the mixture center. Suppose we have a trained GMM for
Fisher Vector as well as a trained visual codebook for BoW, and O happens to be both
the mean of a mixture in GMM and the centroid of a codeword. Given two points A and
B to be coded, as their distances to O are the same but
~
AO and
~
BO are dierent, they
contribute the same to the codeword in BoW but dierently to the mixture in Fisher
Vector.
Finally, let X be separated into two sets X
r
and X
b
, where X
b
contains the points
that t the GMM model well, we have
@L(Xj)
@
=
@L(X
r
j)
@
+
@L(X
b
j)
@
(3.8)
By denition,
@L(X
b
j)
@
0, so
@L(Xj)
@
@L(X
r
j)
@
(3.9)
Since is inferred by training on general data, which is likely to be dominated by back-
ground features, above implies that Fisher Vector can suppress the part of data that t
the general model well.
3.1.3 Postprocessing and Classication
Both BoW and Fisher Vector drop spatial and temporal information of the feature
points. However, sometimes spatial and temporal structures can be useful for classi-
cation. Lazebnik et al. proposed to build spatial pyramids to preserve approximate
location information for image classication task [59]. Here, we use a similar approach,
but taking temporal information into account. At pyramid level video i, video is divided
into 2
i
slices along timeline, and 2
i
by 2
i
blocks for each frame. Suppose there are P
s
spatial pyramid levels, and P
t
temporal pyramid levels, the total number of sub-volumes
17
Figure 3.2: Spatial-Temporal Pyramids. For pyramid level i, video is divided into 2
i
slices along timeline, and 2
i
by 2
i
blocks for each frame.
is (4
Ps
1)(2
Pt
1)=3. Encoding is performed for local features in each sub-volume, the
nal representation is a concatenation of all vectors.
We then normalize each dimension of F
X
by a power normalization:
f(x
i
) = sign(x
i
)jx
i
j
; 0 1 (3.10)
Power normalization step is important when a few dimensions have large values and
dominate the vector, the normalized vector will become
atter as decreases. It is
suggested by [43] for image retrieval task.
We train Support Vector Machines (SVM) classiers. Though the similarity of Fisher
Vector is usually measured by an inner product weighted by the inverse of Fisher in-
formation matrix, Fisher Vector itself can be used with nonlinear kernels. Moreover, to
combine rst and second order terms in Fisher Vector, we can either directly concatenate
them or build classiers separately and do a late fusion. For the previous approach, we
use
K(F
X
i
;F
X
j
) = exp
8
<
:
X
f2F
1
A
f
D(F
f
X
i
;F
f
X
j
)
9
=
;
(3.11)
18
where
D(F
f
X
i
;F
f
X
j
) =kF
f
X
i
F
f
X
j
k
2
2
(3.12)
F is the set of dierent feature vector types, D() is the distance function, and A
f
is the
average of distances for feature type f in training data. This kernel function is a special
case of RBF kernel, where features are concatenated with dierent weights, and sigma is
set based on average distances.
Late fusion is a way to combine decision condences from dierent classiers, it has
shown superior performance than early fusion in some tasks. In this paper, we use a
geometric average of individual scores.
3.2 Experiments
This section describes the dataset for evaluation, and provides experimental results.
3.2.1 Dataset
We use videos selected from the entire TRECVID MED11 video corpus and from MED12
Event Kit data [75] for evaluation. The datasets contain more than 40000 diverse, user-
generated videos vary in length, quality and resolution. The total length of videos is
more than 1400 hours. There are 25 positive event classes dened in this data set, along
with a big collection of samples that belong to none of these. The event concepts can be
basically categorized as:
• Distinct by objects: Board trick, bike trick, making a sandwich, etc.
• Distinct by motion patterns: Changing vehicle tire, getting vehicle unstuck, etc.
• Distinct by high-level motives: Birthday party, wedding ceremony, marriage pro-
posal, etc.
A complete list of event denitions can be found in [75]. For our evaluation, we selected
13274 videos and split them into 2 partitions, a training set (Train) and a test set (Test),
its goal is to evaluate the framework's performance on a large scale dataset. We sampled
randomly from the large set to create these partitions.
19
3.2.2 Experiment Setup
In this section, we describe the proposed classication framework as well as the baseline
system.
Fisher Vector and VLAD Generation. We use Laptev's STIP implementation,
with default parameters and sparse feature detection mode. For Dense Trajectory, we
resize the video's width to 320 rst and set sampling stride to 10 pixels. Both descriptors
have several components, we concatenate them directly to form a 162 dimension feature
vector for STIP features and a 426 dimension feature vector for DT features. Since the
length of Fisher Vector is linear in the dimension of local features, PCA is used to project
the features onto a lower dimension space; we project STIP features to 64 dimensions
and DT features to 128 dimensions.
We randomly select about 1500000 feature descriptors from the Train set. These
descriptors are used to train the PCA projection matrix and get codebooks with GMM
and K-Means clustering.
With spatio-temporal pyramid, each sub-volume has its own Fisher Vector or VLAD.
According to our experiments, increasing the number of spatial pyramid layers boosts
the performance, while increasing temporal pyramid layers has little in
uence or even
hampers the performance. We set number of spatial pyramid layers (#SP) as 2 and
temporal pyramid layers (#TP) as 1, to balance the classication performance and speed.
Before concatenation, we normalize each vector by two steps. First, a power nor-
malization is conducted on each dimension. Then, all vectors are l2-normalized, con-
catenated together and l2-normalized again, this is dierent from the traditional spatial
pyramid, where histograms from larger cells are penalized and normalization is after
concatenation[59]. We use l2-normalization since it is natural with linear kernel, which
is evaluated later.
In the following experiments, we will call event classication framework with VLAD
as VLAD, with rst order components of Fisher Vector as FV 1 and with second order
components of Fisher Vector as FV 2.
BoW Baseline. We use the same local features with no dimension reduction, and a
standard BoW approach with the following modications:
20
First, instead of hard assigning each local feature to its nearest neighbor, soft assign-
ment to nearest #K neighbors is used[106].
Secondly, we use spatio-temporal pyramid to encode spatial and temporal structures.
Based on experimental results, we set codebook size as 1000, K = 4, #SP = 3 and
#TP = 1, the nal representation has 21000 dimensions and is l1-normalized to form a
histogram.
Classication Scheme. Classiers are built with SVM in a one over rest approach.
We use the probability output produced by classier as condence values.
For parameter selection, we use 5-fold cross validation: Training data are separated
into 5 parts randomly, and the ratio of positive over negative samples is approximately
kept, the parameter set with the highest average performance is selected. Because the
dataset is highly unbalanced, traditional accuracy based parameter search is quite likely
to produce a trivial classier predicting all queries as negative. We choose to optimize
mAP instead.
3.2.3 Results
We compare the performance of BoW, VLAD and FV with STIP and DT features. Late
fusion is used for Fisher Vector (FV), is set to 0:3 for VLAD and 0:5 for FV. The results
are shown in Table 3.1.
We can see that Fisher Vector gives the best mAP for both STIP and DT, it has
about 35% improvement for STIP and 26% improvement for DT over the baseline. VLAD
improves mAP by 19% for STIP, less than Fisher Vector does. Considering AP is the
percentage of positive samples when assigning labels randomly, our best framework is 47
times better than random performance (35:5=4434 0:008).
It is dicult to account precisely for the reasons of superior results of Fisher Vector
coding. It captures more of the feature point distributions and hence likely to be more
discriminative. We believe that our hypothesis, given in Section 3.1.2.3, that Fisher coding
can suppress the contribution of background features when they t the model well, is also
part of the answer. From the table, we can also see that DT outperforms STIP in most
21
BoW+STIP VLAD+STIP FV+STIP BoW+DT FV+DT
E001 0:399 0:412 0:450 0:468 0:507
E002 0:053 0:084 0:141 0:059 0:159
E003 0:174 0:259 0:273 0:384 0:447
E004 0:517 0:583 0:581 0:624 0:610
E005 0:193 0:229 0:270 0:194 0:293
E006 0:217 0:217 0:192 0:225 0:309
E007 0:064 0:165 0:130 0:190 0:280
E008 0:535 0:579 0:576 0:564 0:576
E009 0:284 0:316 0:368 0:403 0:469
E010 0:093 0:116 0:154 0:216 0:295
E011 0:154 0:193 0:224 0:198 0:256
E012 0:260 0:364 0:459 0:446 0:517
E013 0:366 0:404 0:381 0:413 0:483
E014 0:357 0:370 0:403 0:417 0:457
E015 0:292 0:346 0:393 0:352 0:471
E021 0:104 0:234 0:240 0:245 0:491
E022 0:058 0:088 0:069 0:066 0:091
E023 0:361 0:489 0:550 0:600 0:674
E024 0:194 0:148 0:202 0:069 0:093
E025 0:040 0:107 0:175 0:059 0:140
E026 0:182 0:201 0:252 0:277 0:447
E027 0:156 0:326 0:364 0:470 0:566
E028 0:285 0:286 0:438 0:317 0:450
E029 0:187 0:174 0:271 0:179 0:275
E030 0:148 0:064 0:068 0:072 0:101
mAP 0:227 0:270 0:305 0:300 0:378
Table 3.1: mAP Performance comparison on test set with dierent features and encodings
events, but the relative improvement from BoW to Fisher Vector is smaller. Since DT
features are dense, the impact of background may be less than for sparse features.
22
Chapter 4
Temporal Transition Features of Mid-level Concepts
The low-level framework, though eective, has several limitations. First, most high-level
events consist complex human object interaction and various scene backgrounds, they
pose diculty for low-level frameworks. Moreover, these features are usually encoded
into video-level representations without preserving temporal information. To overcome
these limitations, I propose an event detection framework by encoding temporal transition
information of mid-level visual concept features.
4.1 Event Detection with Mid-level Concept Features
Activity concept transitions are encoded with Fisher vectors. A Fisher vector represents
a set of data by its derivative of log-likelihood function over the set of model parameters,
which is used as input for a discriminative classier like a Support Vector Machine (SVM).
Intuitively, it measures the dierence of the incoming data from an underlying model.
Here we use Hidden Markov Model (HMM) as the underlying generative model. In this
model, a video event is a sequence of activity concepts. A new concept is generated
with certain probabilities based on the previous concept. An observation is a low-level
feature vector from a sub-clip and generated based on the concepts. By using this model,
we bridge low-level features and high-level events via activity concepts, and utilize the
temporal relationships of activity concepts explicitly. We call the vector produced by
applying the Fisher kernel to an HMM to be an HMM Fisher Vector (HMMFV). Our
approach has the following features:
23
Figure 4.1: Illustration of our HMMFV generation process. An input video is separated
into xed length clips, each has a vector of activity concept responses. HMMFV is
computed by taking the partial derivatives of the video's log-likelihood in a general HMM
over the transition parameters. Each dimension in HMMFV corresponds to a concept
transition.
No maximum a posteriori (MAP) inference needed. HMM in traditional
generative framework requires MAP inference to nd out the concept assignments over
time with the highest probability. A separate model for each new event is needed. Instead,
HMMFV only uses HMM for the purpose of vector generation, and can utilize a single
general model for all videos.
Ecient for large-scale usage. HMMFV is a compact and xed-length feature
vector. It can be used directly in most classication and retrieval frameworks using
low-level features. HMMFV has a closed form representation and can be computed by
dynamic programming very eciently.
Robust with limited training data. Our activity concept classiers are pre-
learned and oer a good abstraction of low level features. This makes it possible to learn
robust event classiers with very limited training data.
Our approach can utilize both mid-level and atomic activity concepts, even when they
do not occur in high level events directly, or are collected from a dierent domain. In
this case, activity concepts can be seen as groups of low level features such that each of
them can provide useful statistics. Besides, each dimension of HMMFV corresponds to a
concept transition, dimensions with high values correspond to highly possible transitions,
and can be used to describe the video.
24
4.2 Concept Based Representation
The activity concepts we use are predened and trained under supervision. Activity
concept classiers are built from low-level visual features. All techniques for event clas-
sication with low level features can be used, resulting a single xed-length descriptor x
for each video clip.
We then train 1-vs-rest classier
c
for each activity concept c. Since x is usually
high dimensional, we use linear SVM to save training and prediction time. The output
of
c
(x) is dened as the probability output returned by LIBLINEAR [21] for x under
Logistic Regression Model.
After the concept classiers [
1
2
:::
K
] are obtained, we scan the video with xed-
length sliding windows and represent each video by a T by K matrix M = [
t;k
], where
T is the number of sliding windows, K is the number of activity concepts and
t;k
is the
classier response of the k-th activity concept for the t-th sliding window.
4.3 HMM Fisher Vector
In this section, we introduce how we model and encode the transitions of activity concepts
in videos. Figure 4.1 gives an illustration of the whole process.
4.3.1 Our Model
We use HMM to model a video event with activity concept transitions over time. There
are K states, each corresponds to an activity concept. Every two concepts i;j have a
transition probability P (C
j
jC
i
) from concepti toj. Each observation is a feature vector
x extracted from a sliding window.
Since we are working with a generative model, the emission probability of x given
concept C
i
is derived from
P (xjC
i
)
i
(x)
P (C
i
)
25
where
i
(x) is the activity concept classier output and P (C
i
) is the prior probability of
concept i. Here we assume uniform prior for all observations.
To make the derivation clearer, we dene the following notations:
•
i
is the prior probability of concept i.
•
ijj
is the transition probability from concept j to concept i.
•
xji
is the emission probability of x given concept i.
4.3.2 HMMFV Formulation
The idea of Fisher kernel is rst proposed in [40], the goal is to get the sucient statistics
of a generative model, and use them as kernel functions in discriminative classiers such
as SVM.
DenoteP (Xj) is the probability ofX given parameters for some generative model,
Fisher kernel is dened as
U
X
=r
logP (Xj) (4.1)
K(X
i
;X
j
) =U
T
X
i
I
1
U
X
j
(4.2)
where I is the Fisher information matrix.
If we ignore I by setting it to identity, U
X
can be seen as a feature vector in linear
kernel. We use U
X
as a feature vector and name it Fisher Vector for some generative
model.
As the emission probabilities are derived from activity concept classiers, we only use
partial derivatives over transition probability parameters
ijj
to derive the Fisher Vector
U
X
. Besides, we decide not to include event label variable since it makes the dimension
of Fisher Vector growing linearly with the number of events, and requires recomputation
of U
X
every time there is a new event.
The log-likelihood of HMM is given by
logP (Xj;) = log
X
s
1
;:::;sn
T
Y
i=1
x
i
js
i
s
i
js
i1
(4.3)
26
where s
1
;:::;s
n
are enumerating all the possible activity concepts. To simplify notation,
let
s
1
js
0
=
s
1
.
By taking the partial derivative of the log-likelihood function over
ijj
, we have
@
@
ijj
logP (Xj;) =
X
t
t
(i;j)
ijj
t1
(j)
(4.4)
where
t
(i;j) =P (s
t
=i;s
t1
=jjX;;)
t1
(j) =P (s
t1
=jjX;;)
Denote
t
(i) =P (x
1
;:::;x
t
;s
t
=ij;)
t
(i) =P (x
t+1
;:::;x
n
js
t
=i;;)
FV (i;j) =
@
@
ijj
logP (Xj;)
We have
FV (i;j)
X
t
t1
(j)
xtji
t
(i)
t1
(j)
(4.5)
The vector is then normalized to make its L
2
-norm as 1.
4.3.3 Parameter Learning
We use only one general model for HMMFV. When there are new event classes, instead of
relearning HMM parameters, we can still use the same model without changing existing
HMMFVs and only update the discriminative classiers.
Here we use a very simple method to learn the model. First, we randomly select activ-
ity concept responses from neighboring sliding windows of all events. Model parameters
are computed as
ijj
X
k
i
(x
k;1
)
j
(x
k;0
) (4.6)
i
X
k
i
(x
k;0
) (4.7)
27
where x
k;0
and x
k;1
are neighboring observations, and k is over all the samples.
Then we normalize the parameters to make them valid distributions.
4.3.4 Discussions
HMMFV can be computed very eciently. The 's and 's required for all FV (i;j) can
be computed via standard dynamic programming [79]. The dimension of HMMFV isK
2
,
where K is the number of activity concepts.
Intuitively, by looking into Equation 4.4, HMMFV accumulates the dierence between
the actual expectation of concept transition and the model's prediction based on the
previous concept. If the observations t the model perfectly, the dierence is zero. As
we are using a single general model, background information is thus suppressed. This is
especially useful for high level event classication since the videos often contain irrelevant
information.
By taking derivatives of model parameters, HMMFV preserves the sucient statistics
of HMM. Consider a birthday party event, in which people singing and people dancing are
often followed by each other,FV (dancing; singing) andFV (singing; dancing) should both
have high positive energy, indicating their transition probabilities are underestimated in
the general model. Similarly, FV (washing; sewing) should have high negative energy,
indicating their transition probability is overestimated. Based on this property, we can
describe a video using activity concept transitions with high positive values in HMMFV.
4.4 Experiments
In this section, we describe the dataset we used and the experiment settings. Then we
compare our approach with baseline, and study the in
uence of activity concept selection.
Finally we give performance comparison with two state-of-the-art systems, with abundant
and few training samples.
28
4.4.1 Dataset
We used TRECVID MED 11 Event Kit data [1] (EventKit) for evaluation. This dataset
contains 2,062 diverse, user-generated videos vary in length, quality and resolution. There
are 15 event classes: 1. Attempting a board trick, 2. feeding an animal, 3. landing a sh, 4.
wedding ceremony, 5. working on a woodworking project, 6. birthday party, 7. changing
a vehicle tire, 8.
ash mob gathering, 9. getting a vehicle unstuck, 10. grooming an
animal, 11. making a sandwich, 12. parade, 13. parkour, 14. repairing an appliance
and 15. working on a sewing project. To compare our framework with [39], we followed
their protocol and randomly selected 70% videos from EventKit for training and 30% for
testing. All the videos were resized to have 320 pixels in width. We set the size of sliding
windows as 100 frames, with 50-frame overlap.
For the purpose of training activity concept classiers, we used two datasets. We got
60 activity concept annotations used in [39] by communicating with the authors. The
concepts were annotated on the EventKit and highly related to high level events. We call
these concepts Same Domain Concepts.
Another dataset we used for training concepts is the UCF 101 [93] dataset. It has
13,320 videos from 101 categories. Most of the videos in UCF 101 are less than 10
seconds long. The categories range from playing musical instruments to doing sports,
most of which are not related to the events in EventKit directly. We call them Cross
Domain Concepts.
4.4.2 Experimental Setup
Dense Trajectory (DT) feature [107] was used as low level feature for activity concept
classication. DT tracks points densely and describes each tracklet with its shape and
the HoG, HoF, MBH features around the tracklet. We used the implementation provided
by the authors
1
, and set sampling stride to 10.
Recent results in image and video classication [77] [98] show that encoding low-level
features with Gaussian Mixture Model (GMM) and Fisher kernel gives better performance
1
http://lear.inrialpes.fr/people/wang/dense_trajectories
29
than with BoW histograms. Suppose the low-level feature has D dimensions and the
number of cluster centers for GMM is K, the resulting vector has O(KD) dimensions.
We used this encoding scheme and projected DT feature vectors to D = 128 with PCA,
the number of clusters is 64. Since same domain concepts were annotated on all EventKit
videos, we used only the annotations in training partition to train the activity concept
classiers. For dierent domain concepts, we used all videos in UCF 101 as training data.
Max pooling (Max) was selected as the baseline, it represents a video by the
maximum activity concept responses. Temporal information is dropped. Suppose the
concept responses from the t-th sliding window is [
t
1
t
2
:::
t
K
] , max pooling is dened
as [v
1
v
2
::: v
K
] where
v
i
= max
t
(
t
i
) (4.8)
The vector is normalized to make its L
2
-norm as 1, and used to build discriminative
classiers.
We used SVM classier with RBF kernel, the parameters were selected by a 5-fold
cross validation on training partition. Weighted average [71] was used to fuse results from
dierent modalities. All results were evaluated using average precision (AP).
4.4.3 Same Domain Concepts
According to the second and third columns of Table 4.1, HMMFV achieves better perfor-
mance in 14 of the 15 events. This result validates that by encoding concept transitions,
HMMFV preserves more information and has more discriminative power.
Max pooling achieves better performance in woodworking event, this may happen
when some concept classiers have strong performance and are highly correlated to a
single event (e.g. person carving).
In general, HMMFV achieves 11.9% higher performance over the baseline.
30
Same Domain Cross Domain
Event ID Max HMMFV Max HMMFV
1 0.846 0.857 0.772 0.806
2 0.272 0.398 0.413 0.458
3 0.708 0.767 0.698 0.748
4 0.640 0.782 0.664 0.717
5 0.525 0.507 0.392 0.646
6 0.611 0.753 0.791 0.831
7 0.393 0.492 0.249 0.355
8 0.660 0.745 0.857 0.864
9 0.635 0.730 0.635 0.687
10 0.498 0.539 0.585 0.606
11 0.252 0.386 0.386 0.436
12 0.645 0.761 0.706 0.741
13 0.528 0.863 0.721 0.818
14 0.344 0.596 0.600 0.596
15 0.381 0.545 0.384 0.545
mean AP 0.529 0.648 0.590 0.657
Table 4.1: Average precision comparison for event classication with same domain con-
cepts and cross domain concepts, bold numbers correspond to the higher performance in
their groups
4.4.4 Cross Domain Concepts
Although same domain concepts can provide semantic meanings for the videos, the anno-
tations can be expensive and time consuming to obtain and adding new event classes can
become cumbersome. Hence, we also studied the eect of using cross domain concepts.
Interestingly, the fourth and fth columns of Table 4.1 show that even if the activity
concepts are not related to events directly, our framework still achieves comparable per-
formance. It is quite likely that the concept classiers capture some inherent appearance
and motion information, so that they can still be used to provide discriminative informa-
tion for event classication. HMMFV still achieves better performance than max pooling,
which indicates that temporal information is also useful when the activity concepts are
from a dierent domain.
To study the in
uence of domain relevance, we randomly selected 20, 40 and 60
concepts from same domain concepts, and 20, 40 and 80 concepts from cross domain
concepts for HMMFV. The mean AP performance is plotted in Figure 4.2. According
31
20 30 40 50 60 70 80 90 100
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
Number of concepts
mAP
Cross Domain
Same Domain
Figure 4.2: Mean average precisions with dierent number of concepts. The red line
shows a randomly selected subset of cross domain concepts from 20 to 101. The cyan line
illustrates a randomly selected subset of same domain concepts from 20 to 60.
Concepts Same Domain Cross Domain Both
mean AP 0.648 0.657 0.693
Table 4.2: Mean average precisions when using same domain concepts, cross domain
concepts and both for generating HMMFV
to the gure, the performance increases with more concepts. Same domain concepts can
easily outperform cross domain concepts given same number of concepts and reach the
same level of performance with only 60 concepts, compared with 101 from cross domain
concepts. Besides, as shown in Table 4.2, if we combine the two sets of results by late
fusion, mean AP can be further improved by 4%, which indicates that HMMFV obtained
from same and cross domain concepts have complementary performance.
4.4.5 Comparison with State-of-the-Art
In this section, we compare our framework with two state-of-the-art approaches. [39] is
an activity concept based system. It models the joint occurrence of concepts without
considering temporal information. We used their provided numbers and followed the
same data partitioning method. We also compare our framework with a low level based
approach: we implemented the Fisher kernel with visual vocabulary method in [77] but
32
Event ID Joint [39] LL[77] HMMFV HMMFV+LL
1 0.757 0.813 0.885 0.882
2 0.565 0.363 0.468 0.461
3 0.722 0.743 0.796 0.789
4 0.675 0.829 0.771 0.811
5 0.653 0.496 0.604 0.623
6 0.782 0.736 0.811 0.814
7 0.477 0.541 0.482 0.518
8 0.919 0.868 0.873 0.877
9 0.691 0.769 0.756 0.772
10 0.510 0.579 0.617 0.634
11 0.419 0.515 0.476 0.524
12 0.724 0.720 0.770 0.770
13 0.664 0.792 0.886 0.890
14 0.782 0.661 0.619 0.634
15 0.575 0.608 0.575 0.621
mean AP 0.661 0.669 0.693 0.708
Table 4.3: Average precision comparison with state-of-the-art. Joint refers to the joint
modeling of activity concepts, as described in [39], LL refers to the low level approach as
described in [77]
used only the components corresponding to. We used the same low level features (DT)
for building our concept classiers. We used both same domain and cross domain concepts
in our framework. An event-by-event comparison is shown in Table 4.3.
Our framework outperforms the joint modeling of activity concepts approach in 11 of
the 15 events. Moreover, we used only a single type of low level feature, and did not fuse
the event classication results obtained from low level features directly. Compared with
the low-level approach, our framework is better in 9 of the 15 events. Our framework
achieves the best performance when used alone.
Besides, if we fuse the low level results with our framework, the overall performance
can further increase to 70.8%, a 4% improvement from the two previous systems.
These comparisons indicate that encoding concept transitions provides useful infor-
mation for event classication. Even though the joint modeling approach does consider
pairwise relationship of activity concepts, temporal information is not preserved.
33
Method mean AP
LL 0.421
Max (Same Domain) 0.456
HMMFV (Same Domain) 0.554
Max (Cross Domain) 0.432
HMMFV (Cross Domain) 0.470
HMMFV (Both) 0.562
Table 4.4: Average precision comparison with 10 training samples for each event
4.4.6 Classication with Limited Training Samples
In real world retrieval problems, it is desirable to let users provide just a few video samples
of some event they want before the system can build a reasonably good event classier.
In this section, we studied the case when only 10 positive samples per event are
provided for training. The training videos were randomly selected from the original
training partition, and we used the previous concept classiers since they didn't include
test information.
As shown in Table 4.4, when the number of training samples is limited, the perfor-
mance of low level approach decreases more signicantly than activity concept based
HMMFV, their relative mean AP dierence is 14.1%. One possible explanation is that
by using activity concepts, our framework has a better level of abstraction, which can be
captured by discriminative classiers even with a few training samples. Another interest-
ing observation is that, when the number of training sample is limited, the performance
of same domain concepts is 8.4% higher than the performance of cross domain concepts.
This is understandable since some same domain concepts are highly correlated with high
level events (e.g. kissing for wedding ceremony), they can help preserve highly discrimi-
native information if their classiers are strong.
Again, HMMFV outperforms the max pooling baseline.
34
Chapter 5
Weakly-supervised Action Localization
This chapter addresses the problem of ne-grained action localization from unconstrained
web videos
1
. A ne-grained action takes place in a higher-level activity or event (e.g.,
jump shot and slam dunk in basketball, blow candle in birthday party). Its instances
are usually temporally localized within the videos, and share similar context with other
ne-grained actions belonging to the same activity or event.
5.1 Approach Overview
Most existing work on action recognition focuses on action classication using pre-segmented
short video clips [93, 54, 88], which assumes implicitly that the actions of interest are tem-
porally segmented during both training and testing. The TRECVID Multimedia Event
Recounting evaluation [75] as well as THUMOS 14 Challenge [47] both address action
localization in untrimmed video, but the typical approach involves training classiers on
temporally segmented action clips and testing using sliding window on untrimmed video.
This setting does not scale to large action vocabularies, when data is collected from con-
sumer video websites. Videos here are unconstrained in length, format (home videos vs.
professional videos), and almost always only have video level annotations of actions.
We assume that only video-level annotations are available for the ne-grained action
localization problem. The ability to localize ne grained actions in videos has important
1
Work done during summer internship at Google.
35
Figure 5.1: Fine-grained actions are usually present as a tiny fraction within videos
(top). Our framework uses cross-domain transfer from possibly noisy image search results
(bottom) and identies the action related images for both domains (marked in green).
applications such as video highlighting, summarization, and automatic video transcrip-
tion. It is also a challenging problem for several reasons: rst, ne-grained actions for any
high-level activity or event are inherently similar since they take place in similar scene
context; second, occurrences of the ne-grained actions are usually short (a few seconds)
in training videos, making it dicult to associate the video-level labels to the occurrences.
Our key observation is that one can exploit web images to help localize ne-grained
actions in videos. As illustrated in Figure 5.1, by using action names (basketball slam
dunk) as queries, many of the image search results oer well localized actions, though
some of them are non-video like or irrelevant. Identifying action related frames from
weakly supervised videos and ltering irrelevant image tags is hard in either modality
by itself; however, it is easier to tackle these two problems together. This is due to our
observation that although most of the video frames and web images which correspond to
actions are visually similar, the distributions of non-action images from the video domain
and the web image domain are usually very dierent. For example, in a video with a
basketball slam dunk, non slam dunk frames in the video are mostly from a basketball
36
game. The irrelevant results returned by image search are more likely to be product
shots, or cartoons.
This motivates us to formulate a domain transfer problem between web images and
videos. To allow domain transfer, we rst treat the videos as a bag of frames, and use the
feature activations from deep convolutional neural networks (CNN) [53] as the common
representation for images and frames. Suppose we have selected a set of video frames
and a set of web images for every action, the domain transfer framework goes in two
directions: video frames to web images, and vice versa. For both directions, we use the
selected images from the source domain to train action classiers by ne-tuning the top
layers of the CNN; we then apply the trained classiers to the target domain. Each
image in the target domain is assigned a condence score given by its associated action
classier from the source domain. By gradually ltering out the images with low scores,
the bidirectional domain transfer can progress iteratively. In practice, we start from the
video frames to web images direction, and randomly select the video frames for training.
Since the non-action related frames are not likely to occur in web images, the tuned CNN
can be used to lter out the non-video like and irrelevant web images. The nal domain
transfer from web images is used to localize action related frames in videos. We term
these action-related frames as localized action frames (LAF).
Videos are more than an unordered collection of frames. We choose long short-term
memory (LSTM) [37] networks as the temporal model. Compared with the traditional
recurrent neural networks (RNN), LSTM has built-in input gates and forget gates to
control its memory cells. These gates allow LSTM to either keep a long term memory
or forget its history. The ability to learn from long sequences with unknown size of
background is well-suited for ne-grained action localization from unconstrained web
videos. We treat every sampled video frame as a time step in LSTM. When we train LSTM
models, we label all video frames by their video-level annotation, but use the LAF scores
generated by bidirectional domain transfer as weights on the loss for misclassication.
By doing this, irrelevant frames are eectively down-weighted in the training stage. The
framework can be naturally extended to use video shots as time steps, from which spatio-
temporal features can be extracted to capture local motion information.
37
Figure 5.2: Illustration of the LAF proposal framework for basketball slam dunk videos.
We use a cross-domain transfer algorithm to jointly lter out non-video like web images
and non-action like video frames. We then learn a LAF proposal model using the ltered
images, which assigns LAF scores to training video frames. Finally, we train LSTM
based ne-grained action detectors, where the misclassication penalty of each time step
is weighted by the LAF score.
5.2 Action Localization by Domain Transfer from Web Images
Our proposed ne-grained action localization framework uses both weakly labeled videos
and noisily tagged web images. It employs the same CNN based representation for web
images and video frameworks, and uses a bidirectional domain transfer algorithm to lter
out irrelevant images in both domains. A localized action frame (LAF) proposal model is
trained from the remaining web images, and used to assign LAF scores to video frames.
Finally, we use long short-term memory networks to train ne-grained action detectors,
using the LAF scores as the weight of loss for misclassication. The pipeline is illustrated
in Figure 5.2.
38
5.2.1 Shared CNN Representation
A shared feature space is required for domain transfer between images and videos. Here
we treat a video as a bag of frames, and extract activations from the intermediate layers
of a convolutional neural network (CNN) as features for both web images and video
frames. Although there is previous work on action recognition from still images using
other representations [119], we choose CNN activations for its simplicity and state-of-the-
art performance in several action recognition tasks [89, 50].
Training a CNN end-to-end from scratch is time consuming, and requires a large
amount of annotated data. It has been shown that CNN weights trained from large image
data sets like ImageNet [15] are generic, and can be applied to other image classication
tasks by ne-tuning. It is also possible to disable the error back-propagation for the rst
several layers during ne-tuning. This is equivalent to training a shallower neural network
using the intermediate CNN activations as features.
We adopt the methodology of ne-tuning the top layers of CNN, and experiment with
the AlexNet [53] CNN architecture. It contains ve convolution layers and three fully
connected layers. Each convolution layer is followed by a ReLU non-linearity layer and a
maximum pooling layer. We pre-trained the network on ImageNet data set using the data
partitions dened in [84]. We resized the images to 256 by 256, and used the raw pixels
as inputs. For the purpose of ne-tuning, we xed the network weights before its rst
fully connected layer and only updated the parameters of the top three layers. Feature
activations from fc6 serve as the shared representation for web images and video frames,
and allow cross-domain transfer between the two.
5.2.2 LAF Proposal with Web Images
Fine-grained actions tend to be more localized in videos than high-level activities. For
example, a basketball match video usually consists of jump shot, slam dunk, free throw etc.,
each of which may be as short as a few seconds. We address the problem of automatically
identifying them from minutes-long videos.
Fortunately, we observe that many of the ne-grained actions have image highlights
on the Internet (Figure 5.3). They are easily obtained by querying image search engines
39
Figure 5.3: Top retrieved images from the Google image search engine, using keywords
tennis serve and baseball pitch.
40
Algorithm 1: Domain transfer algorithm for localized action frame proposal.
Input : Images with noisy labels (I
i
;a
i
), frames with video-level labels (V
i
;a
i
)
Output: LAF proposal model
InitializeI andV to include all image and frame inputs respectively.
while stopping criteria not met do
1. Fine-tune CNN
v
using data in frame setV.
2. Compute CNN
v
(I) for all I2I.
3. UpdateI =fIjI2I; CNN
v
(I)
a
I
>
1
g.
4. Fine-tune CNN
i
using data in image setI.
5. Compute CNN
i
(V ) for all V 2V.
6. UpdateV =fVjV 2V; CNN
i
(V )
a
V
>
2
g.
end
return CNN
i
with action names. However, these images are noisily labeled, and not useful for learning
LAF proposal models directly, as they contain:
• Irrelevant images due to image crawling error, for example, a jogging image could
be retrieved with the keyword soccer dribbling.
• Items related to the actions, such as objects and logos.
• Images with the same action but from a dierent domain, such as advertisement
images with clear background, or cartoons.
Filtering the irrelevant web images is a challenging problem by itself. However, it
can be turned into an easier problem by using weakly-supervised videos. We hypothesize
that applying a classier, learned on video frames, as a lter on the images removes many
irrelevant images and preserves most video-like image highlights. More formally, assume
we have video frames inV and web images inI, and each of them is assigned a ne-grained
action labela = 0; 1;:::;N1. We rst learn a multi-class classier CNN
v
()2R
N
by ne-
tuning the top layers of CNN using video frames. CNN
v
() encodes action discriminative
information from the videos' perspective; we apply it to all I2I, and update
I =fIjI2I; CNN
v
(I)
a
I
>
1
g (5.1)
where
1
2 [0; 1] is the threshold for minimum softmax output, and CNN
v
(I)
a
I
corre-
sponds to the a
I
-th dimension of CNN
v
(I).
41
We then use the lteredI to ne-tune CNN
i
()2 R
N
, and updateV in a similar
manner:
V =fVjV 2V; CNN
i
(V )
a
V
>
2
g (5.2)
We iterate the process and updateV andI until certain stopping criteria are met.
The LAF proposal model CNN
i
() is learned using the nal web image setI, the LAF
score for a video frame V with action label a is given by
LAF(V ) = CNN
i
(V )
a
(5.3)
The whole process is summarized in Algorithm 1.
Discussion: We initialize the frame setV by random sampling. Even though many
of the sampled frames do not correspond to the actions of interest, they can help lter out
the non-video like web images such as cartoons, object photos and logos. In practice, the
random sampling of video frames is adequate for this step since the mis-labeled frames
rarely appear in the web image collection.
We set the stopping criteria to be: (1) video-level classication accuracy on a val-
idation set starts to drop; or (2) a maximum number of iterations is reached. To be
more ecient, we train one-vs-rest linear SVMs using frames inV after each iteration,
and apply the classiers to video frames in the validation set. We take the average of
frame-level classier responses to generate video-level responses, and use them to compute
classication accuracy.
5.2.3 Long Short-Term Memory Network
Long Short-term Memory (LSTM) [37] is a type of recurrent neural network (RNN) that
solves the vanishing and exploding gradients problem of previous RNN architectures when
trained using back-propagation. Standard LSTM architecture includes an input layer, a
recurrent LSTM layer and an output layer. The recurrent LSTM layer has a set of memory
cells, which are used to store real-valued state information from previous observations.
This recurrent information
ow, from previous observations, is particularly useful for
42
capturing temporal evolution in videos, which we hypothesize is useful in distinguishing
between ne-grained sports activities. In addition, LSTM's memory cells are protected
by input gates and forget gates, which allow it to maintain a long-term memory and
reset its memory, respectively. We employ the modication to LSTMs proposed by Sak
et al. [86] to add a projection layer after the LSTM layer. This reduces the dimension of
stored states in memory cells, and helps to make the training process faster.
Let us denote the input sequence X asfx
1
;x
2
;:::;x
T
g, where in our case each x
t
is a
feature vector of a video frame with time stamp t. LSTM maps the input sequence into
the output action responses Y =fy
1
;y
2
;:::;y
T
g by:
i
t
=(W
ix
x
t
+W
ir
r
t1
+W
ic
c
t1
+b
i
) (5.4)
f
t
=(W
fx
x
t
+W
rf
r
t1
+W
cf
c
t1
+b
f
) (5.5)
c
t
=f
t
c
t1
+i
t
g(W
cx
x
t
+W
cr
r
t1
+b
c
) (5.6)
o
t
=(W
ox
x
t
+W
or
r
t1
+W
oc
c
t
+b
o
) (5.7)
m
t
=o
t
h(c
t
) (5.8)
r
t
=W
rm
m
t
(5.9)
y
t
=W
yr
r
t
+b
y
(5.10)
HereW 's andb's are the weight matrices and biases, respectively, and denotes the
element-wise multiplication operation. c is the memory cell activation; i;f;o are input
gate, forget gate and output gate respectively. m and r are recurrent activation before
and after projection. is the sigmoid function, g and h are tanh. An illustration of the
LSTM architecture with a single memory block is shown in Figure 5.4.
Training LSTM with LAF scores. We sample video frames at 1 frame per second
and treat each frame as a basic LSTM step. Similar to speech recognition tasks, each
time step requires a label and a penalty weight for misclassication. The truncated
backpropagation through time (BPTT) learning algorithm [113] is used for training. We
limit the maximum unrolling time steps tok and only back-propagate the error fork time
steps. Incorporating the LAF scores into the LSTM framework is simple: we rst run
43
g g
· ·
c c h h
· ·
· ·
Recurrent
x
t
i
t
f
t
c
t
o
t
m
t
r
t
y
t
r
t 1
Memory block
Input
gate
Forget
gate
Output
gate
Figure 5.4: Illustration of LSTM architecture with a single memory block. A recurrent
projection layer is added to reduce the number of parameters. Reproduced from [86] with
the authors' permission.
the LAF proposal pipeline to score all sampled training video frames. Then we set the
frame-level labels based on video-level annotation, but use the LAF scores as the penalty
weights. Using this method, LSTM is forced to make the correct decision after watching a
LAF returned by LAF proposal system, and it is not penalized as heavily when gathering
context information from earlier frames or misclassifying an unrelated frame.
Computing LAF scores for video shots. For some data sets, it might be desirable
to use video shots as the basic LSTM steps, as it allows the use of spatio-temporal motion
features for representation. We extend the frame-level LAF scores to shot-level by taking
the average of LAF scores from the sampled frames within a certain video shot.
5.3 Experiments
This section rst describes the data set we collected for evaluation, and then presents
experimental results.
5.3.1 Data Set
There is no existing data set for ne-grained action localization using untrimmed web
videos. To evaluate our proposed method's performance, we collected a Fine Grained
Actions 240 (FGA-240) data set focusing on sports videos. It consists of over 130,000
44
Figure 5.5: Some of the high-level sports activities and their corresponding ne-grained
sports actions in Fine Grained Actions 240 data set. Best viewed under magnication.
YouTube videos in 240 categories. A subset of the categories is shown in Figure 5.5. We
selected 85 high-level sports activities from the Sports-1M data set [50], and manually
chose the ne-grained actions take place in these activities. The action categories cover
aquatic sports, team sports, sports with animals and others.
We decided the ne-grained categories for each high-level sports activity using the
following method: given YouTube videos and their associated text data such as titles and
descriptions, we ran an automatic text parser to recognize sports related entities. The
recognized entities which correlate with the high-level sports activities were stored in the
pool and then manually ltered to keep only ne-grained sports actions. As an example,
for basketball the initial entity pool contains not only ne-grained sports actions (e.g., slam
45
dunk, block), but also game events (e.g., NBA) and celebrities (e.g., Kobe Bryant). Once
the ne-grained categories are xed, we applied the same text analyzer to automatically
assign video-level annotations, and only kept the videos with high annotation condence.
We nally visualized the data set to lter out false annotations and removed the ne-
grained sports action categories with too few samples.
Our nal data set contains 48,381 training videos and 87,454 evaluation videos. The
median number of training videos per category is 133. We used 20% of the evaluation
videos for validation and 80% for testing. For temporal localization evaluation, we manu-
ally annotated 400 videos from 45 ne-grained actions. The average length of the videos
is 79 seconds.
5.3.2 Experiment Setup
LSTM implementation. We used the feature activations from pre-trained AlexNet
(rst fully-connected layer with 4,096 dimensions) as the input features for each time
step. We followed the LSTM implementation by Sak et al. [86] which utilizes a multi-
core CPU on a single machine. For training, we used asynchronous stochastic gradient
descent and set batch size to 12. We tuned the training parameters on the validation
videos and set the number of LSTM cells to 1024, learning rate to 0.0024 and learning
rate decay with a factor of 0.1. We xed the maximum unroll time step k to 20 to
forward-propagate the activations and backward-propagate the errors.
Video level classication. We evaluated ne-grained action classication results
on video level. We sampled test video frames at 1 frame per second. Given T sampled
frames from a video, these frames are forward-propagated through time, and produce T
softmax activations. We used average fusion to aggregate the frame-level activations over
whole videos.
Temporal localization. We generated the frame-level softmax activations using
the same approach as video level classication. We used a temporal sliding window of 10
time steps, the score of each sliding window was decided by taking the average of softmax
activations. We then applied non-maximum suppression to remove the localized windows
which overlap with each other.
46
Evaluation metric. For classication, we report Hit @k, which is the percentage of
testing videos whose labels can be found in the top k results. For localization, we follow
the same evaluation protocol as THUMOS 2014 [47] and evaluate mean average precision.
A detection is considered to be a correct one if its overlap with groundtruth is over some
ratio r.
CNN baseline. We deployed the single-frame architecture used by Karpathy et
al. [50] as the CNN baseline. It was shown to have comparable performance with mul-
tiple variations of CNNs while being simpler. We sampled the video frames at 1 frame
per second, and used average fusion to aggregate softmax scores for classication and
localization tasks. Instead of training a CNN from scratch, we used network parameters
from the pre-trained AlexNet, and ne-tuned the top two fully-connected layers and a
softmax layer. Training parameters were decided using the validation set.
Low-level feature baseline. We extracted low-level features used by [50, 117] over
whole videos for classication task, the feature set includes low-level visual and motion
features aggregated using bag-of-words. We used the same neural network architecture
as [50] with multiple Rectied Linear Units to build classiers based on the low-level
features. Its structure (e.g., number of layers, number of cells per layer) as well as
training parameters were decided with validation set.
5.3.3 Video-level Classication Results
We rst report the ne-grained action classication performance on video level.
Comparison with baselines. We compared several baseline systems' performance
against our proposed method on FGA-240 data set, the results are shown in Table 5.1.
From the table we can see that systems based on CNN activations outperformed low-level
features by a large margin. There are two possible reasons for this: rst, CNN learned
activations are more discriminative in classifying ne-grained sports actions, even without
capturing local motion patterns explicitly; second, low-level features were aggregated on
video-level. These video-level features are more sensitive to background and irrelevant
video segments, which happens a lot in ne-grained sports action videos.
47
Method Video Hit @1 Video Hit @5
Random 0.4 2.1
Low-level features [117] 30.8 -
CNN [50] 37.3 68.5
LSTM w/o LAF 41.1 70.2
LSTM w/ LAF 43.4 74.9
Table 5.1: Video-level classication performance of several dierent systems on ne-
grained actions.
Fine-grained sports AP AP Fine-grained sports
Fencing:Parry 0.17 -0.09 Parkour:Free Running
Cricket:Run out 0.15 -0.08 Freestyle soccer:Crip Walk
CrossFit:Deadlift 0.10 -0.08 Paragliding:Towing
CrossFit:Handstand 0.09 -0.07 Freestyle BMX:Stunt
Calisthenics:Push-up 0.09 -0.07 Judo:Sweep
Rings:Pull-up 0.08 -0.06 Basketball:Point
Table 5.2: Dierence in average precision between LSTM with and without LAF proposal.
Sorted by top wins (left) and top losses (right).
Among the systems relying on CNN activations, applying LSTM gave better perfor-
mance than ne-tuning the top layers of CNN. While both LSTM and CNN used the
late fusion of frame-level softmax activations to generate video-level classication results,
LSTM took previous observations into consideration with the help of memory cells. This
shows that temporal information helps classify ne-grained sports actions, and it was
captured by LSTM network.
Finally, using LAF proposals helped further improve the video hit @1 by 2.3% and
video hit @5 by 4.7%. In Table 5.2, we show the relative dierence in average precision for
LSTM with and without LAF proposal. We observe that LAF proposal helps the most
when the ne-grained sports actions are likely to be identied based on single frames, and
the image highlights on the Internet are visually very similar to the videos. Note that
there are still non-video-like and irrelevant images retrieved from the Internet for these
categories, but the LAF proposal system is an eective lter. Figure 5.8 gives the three
systems' output on a few example videos.
48
Figure 5.6: Example actions when LAF is not helpful. The web images retrieved to gener-
ate LAF proposals might be beautied (top) or taken from dierent viewpoint (middle).
Sometimes there is a mix of the two issues (bottom).
We also identify several cases when LAF proposals failed to work. The most common
case is when most of the retrieved images are non-video like but not ltered out. They
could be posed images or beautied images with logos, such as images retrieved for Park-
our:Free running, or have dierent viewpoints than videos, such as Paragliding:Towing.
Sample video snapshots and web images are shown in Figure 5.6.
Impact of action hierarchy. A ne-grained sports action could be misclassied
to either its sibling or non-sibling leaf nodes in the sports hierarchy. For example, a
basketball slam dunk can be confused with basketball alley-oop as well as street ball slam
49
Figure 5.7: Magnied view of confusion matrices for ice hockey, crosst and basketball
Figure 5.8: Classication output for a few videos. The labels under each video were
generated by LSTM with LAF proposal, LSTM without LAF proposal and CNN from
top to bottom. Correct answers are marked in bold.
50
Method Video Hit @1 Video Hit @5
Random 1.2 5.9
CNN [50] 69.2 75.9
LSTM w/o LAF 71.7 77.3
LSTM w/ LAF 73.6 79.5
Table 5.3: Classication performance when measured on high-level sports activities (e.g.,
basketball, soccer).
dunk. To study the source of confusion, we decided to measure classication accuracy
of high-level sports activities, and check how the numbers compared with ne-grained
sports actions.
We obtain the condence values for high-level sports activities by taking the average
of their child nodes' condence scores. Table 5.3 shows the classication accuracy with
dierent methods. We can see that the overall trend is the same as ne-grained sports
actions: LSTM with LAF proposal is still the best. However, the numbers are much
higher than when measured on ne-grained level, which indicates that the major source
of confusion still comes from the ne-grained level. In Figure 5.7, we provide the zoom-in
confusion matrices for ice hockey, crosst and basketball.
5.3.4 Localization Results
Comparison with baselines. We applied the frameworks to localize ne-grained ac-
tions, and varied the overlap ratio r from 0.1 to 0.5 for evaluation. Figure 5.9 shows the
mean average precision over all 45 categories of dierent systems. We did not include
the baseline using low-level features for evaluation as they were computed over whole
videos. From the gure we can see that LSTM with LAF proposal outperformed both
CNN and LSTM without LAF proposal signicantly, the gap grows wider as we increase
the overlap ratio. This conrms that temporal information and LAF proposal are helpful
for the temporal localization task.
In Table 5.4, we show the most dierent average precisions on action level. Some
actions have clearly beneted from the introduction of LSTM as well as LAF proposal.
We also observed that some actions were completely missed by all three systems, such as
51
Figure 5.9: Temporal localization performance on FGA-240 data set.
Fine-grained sports AP AP Fine-grained sports
Soccer:Penalty kick 0.32 -0.06 Baseball:Run
Tennis:Serve 0.25 -0.01 Skateboarding:Kick
ip
Basketball:Dribbling 0.21 -0.01 Volleyball:Spiking
Fine-grained sports AP AP Fine-grained sports
Baseball:Brawl 0.52 -0.11 Streetball:Crossovers
Ice hockey:Combat 0.48 -0.05 Ice hockey:Penalty shot
Soccer:Penalty kick 0.33 -0.04 Fencing:Parry
Table 5.4: Dierence in average precision, compared between LSTM with and without
LAF proposal (top), LSTM with LAF proposal and CNN (bottom), overlap ratio is xed
to 0.5. A positive number means LSTM with LAF proposal is better.
52
Baseball:Hit, Basketball:Three-point eld goal and Basketball:Block, possibly due to the
video frames corresponding to these actions not being well localized during training.
5.3.5 Localization Results on THUMOS 2014
To verify the eectiveness of domain transfer from web images, we also conducted a
localization experiment on the THUMOS 2014 data set [47]. This data set consists
of over 13,000 temporally trimmed videos from 101 actions, 1,010 temporally untrimmed
videos for validation and 2,574 temporally untrimmed videos for testing. The localization
annotations cover 20 out of the 101 actions in the validation and test sets. All 20 actions
are sports related.
Experiment setup: As this chapter focuses on temporal localization of untrimmed
videos, we dropped the 13,000 trimmed videos, and used the untrimmed validation videos
as the only positive samples for training. We also used 2,500 background videos as the
shared negative training data.
To generate LAF scores, we downloaded web images from Flickr and Google using the
action names as queries. We also sampled training video frames at 1 frame per second.
We used the AlexNet features for domain transfer experiment.
Recently, it has been shown that a combination of improved dense trajectory fea-
tures [110] and Fisher vector encoding [77] (iDT+FV) oers the state-of-the-art perfor-
mance on this data set. This motivated us to switch LSTM time steps from frames
to video segments, and represent segments with iDT+FV features for the nal detector
training. We segmented all videos uniformly with a window width of 100 frames and step
size of 50 frames. For iDT+FV feature extraction, we took only the MBH modality with
192 dimensions and reduced the dimensions to 96 with PCA. We used the full Fisher
vector formulation with the number of GMM cluster centers set to 128. The nal video
segment representation has 24,576 dimensions.
Results: We compared the performance of LSTM weighted by LAF scores against
several baselines. LSTM w/o LAF randomly assigned misclassication penalty for each
step of LSTM, where 30% of the steps were set to 1, and others 0. The Video baseline
used iDT+FV features aggregated over whole videos to train linear SVM classiers, and
53
Overlap ratio
Method 0.1 0.2 0.3 0.4 0.5
Ground truth 0.161 0.152 0.112 0.071 0.044
Video [95, 78] 0.098 0.089 0.071 0.041 0.024
LSTM w/o LAF 0.076 0.071 0.057 0.038 0.024
LSTM w/ LAF 0.124 0.110 0.085 0.052 0.044
Table 5.5: Temporal localization on the test partition of THUMOS 2014 dataset. Ground
truth uses temporal annotation of the training videos.
applied the classiers to the testing video shots. It was used by [95, 78] and achieved
state-of-the-art performance in event recounting and video summarization tasks. None of
these systems require temporal annotations. Finally, Ground truth employed manually
annotated temporal localizations to set LSTM penalty weights. It is used to study the
performance dierence between LAF and an oracle with perfect localized actions.
Table 5.5 shows the mean average precision for the four approaches. As expected,
using manually annotated ground truth for training provides the best localization per-
formance. Although LSTM with LAF scores has worse performance than using ground
truth, it outperforms LSTM without LAF scores, and the video-level baseline by large
margins. This further conrms that LAF proposal by domain transfer from web images
is eective in action localization tasks.
54
Chapter 6
Weakly-supervised Event Recounting
This chapter describes a joint framework for event detection and recounting. Given a
query video, the framework provides not only a high level event label (e.g. a wedding
ceremony), but also video segments which are important positive evidence and their
textual descriptions (e.g. people hugging).
The underlying observation is that the presence of high-level event in videos is de-
termined by the presence of positive evidence for that event. A positive evidence is
characterized by a template of primitive actions and objects in a relatively short period
of time. For example, humans can easily identify a making a sandwich video by the
presence of sandwich, pan and hand movement within tens of frames.
6.1 Approach Overview
The framework starts with primitive action based video representation with timestamps.
It is obtained by segmenting videos into short clips and applying pre-trained primitive
action classiers to each clip. The presence of an event is determined by: (1) global
video representation, generated by pooling visual features over the entire video, (2) the
presence of several pieces of dierent positive evidence which are consistent over time.
We introduce an evidence localization model (ELM) which has a global template and
a set of local evidence templates. ELM uses a compact action transition representation
motivated by our proposed HMMFV pipeline in Chapter 4 to impose temporal consistency
55
Figure 6.1: (Middle) A typical birthday party video from the dataset, some of the video
clips are irrelevant to the event. (Bottom) Each clip has a vector of primitive action
classier responses, where the highest scoring ones are listed. Primitive actions in red
have high weights in evidence templates. (Top) Two congurations of evidence locations.
The green one scores higher than the red one, as transition from eating cake to blowing
candle is highly unlikely.
for adjacent pieces of evidence. The constraint is suitable for the usually diverse temporal
structures of unconstrained web videos.
Inference in ELM is done by nding the best evidence locations which match the
evidence templates well, and are consistent over time. It can be solved eciently by
dynamic programming. Once the pieces of evidence are located, we use the top weighted
primitive actions as the descriptors for each evidence. ELM's parameters are learned via
a max-margin framework described in [118]. We treat the locations of positive evidence
as latent variables. The only supervised information required is video level event labels.
6.2 Evidence Localization Model
Our model consists of a global template, a set of local evidence templates and a
temporal transition constraint of evidence set. Given a video, we nd the sequence
of video segments which achieve best overall score in matching the evidence templates
and meeting the temporal constraints. An event label is assigned based on the global
feature of a video, as well as features from the selected pieces of evidence. An illustration
is given in Figure 6.1.
56
Our model is related to the Deformable Part-based Model (DPM) [24] in the sense
that they both try to nd discriminative components from query data. However, our
model is motivated by locating pieces of positive evidence for event classication and
recounting, thus the representation and constraints are dierent.
6.2.1 Video Representation
Videos are represented by a set of primitive action responses with timestamps.
Given an input videoV , we rst divide it into a sequence of short clips [S
1
S
2
:::S
n
].
This can be done by a sliding window with uniform size, or by shot boundary detection.
We then apply a set of pre-trained action classiers, where each action classier maps a
video clip to a condence score on how likely the action appears in the clip, given by
c
ji
=f
i
(S
j
) (6.1)
This representation can be easily extended to objects, which are image or patch based,
by pooling classier outputs sampled from S
j
.
Once this step is nished, video V is represented by a matrixC = [c
j;i
]. Thej-th row
of the matrix gives a complete action based description for S
j
.
It is reasonable to expect that to generate meaningful video descriptions, certain action
types are necessary. Meanwhile, action classiers can also be seen as nonlinear projections
of original feature space, which provide discriminative information for classication [104,
97]. We use action annotations from training videos as well as independent dataset. Some
of the actions are not related to high level events. A wide range of object and scene types
can also be used, but not addressed in this chapter.
6.2.2 Evidence Localization Model
ELM's evidence templates are learned and used for evidence localization from query
videos. For example, vehicle moving, people dancing and people marching are all related
to the parade event and may be expected to appear in dierent videos. Meanwhile, if a
video contains people dancing and people marching, it is highly likely to be a parade event.
57
These two primitive actions should have high weights in their corresponding evidence
templates. However, in practice the locations of segments providing evidence are usually
not available for training. We address this problem by treating the locations as latent
variables, and dene the scoring function
f
w
(C) = max
z2Z
[f
r
(C) +f
p
(C; z) +f
t
(C; z)] (6.2)
whereC is action response matrix of a video, each row ofC corresponds to a video segment,
Z is the set of all possible congurations of evidence locations. f(C) can be decomposed
into the following terms
Global score f
r
(C) measures event similarity based on global video features. This
can be done by extracting statistics information from all clips of a video. For example,
one can use average pooling technique
h
r
(C) =
1
N
N
X
i=1
C
i
where N is the number of rows inC andC
i
is the i-th row ofC.
Global term can then be expanded to
f
r
(C) = w
r
|
h
r
(C) (6.3)
parameterized by global template w
r
.
Local evidence scoref
p
(C; z) measures how the located pieces of evidence matches
evidence templates. Denote z
i
as the clip index of i-th evidence template, and T as the
total number of evidence templates in evidence set, we have
f
p
(C; z) =
T
X
i=1
w
p;i
|
C
z
i
(6.4)
The vector w
p;i
can be seen as an evidence template containing the desired action
responses. If action classiers are perfect, w
p;i
should be sparse as only a small subset of
58
actions should appear in an evidence. However, since the current state-of-the-art action
classiers are far from perfect, we decide not to impose sparsity constraint on w
p
.
Temporal consistency scoref
t
(C; z) evaluates the validity of the selected pieces of
evidence over time. Considering an ideal scenario for getting vehicle unstuck event, it is
obvious that pushing vehicle should happen before vehicle moving.
We use the Hidden Markov Model Fisher Vector (HMMFV) [97] to model temporal
consistency. Given a trained HMM parameterized by , and a collection of action re-
sponse dataC from clips of a video, each dimension '
i;j
(C) corresponds to the partial
derivative of the log-likelihood function logP (Cj) over transition parameter
i;j
, given
by
'
i;j
(C) =
N
X
t=1
t1
(j) [c
t;i
t
(i)
t1
(j)] (6.5)
where N is the number of clips, c
t;i
is the emission probability of action i at clip t, i
and j are action types.
t
(i) is the probability of observing rst t clips and the t-th clip
belongs to i-th primitive action, and
t
(i) is the probability of observing clips from the
(t + 1)-th to the end given the t-th clip belongs to the i-th primitive action. s and s
can be computed eciently via dynamic programming.
Assuming the positions of evidence z are sorted in temporal order [z
t
1
::: z
t
T
], and a
uniform prior distribution for all actions, we have
f
t
(C; z) =
T1
X
i=1
w
t
|
'([C
zt
i
;C
zt
i+1
]) (6.6)
which measures the compatibility of all adjacent pairs of evidence.
6.2.3 Compact Temporal Constraint
One potential problem of the above temporal constraint is that the dimension of ' grows
quadratically with the number of actions, which makes it computationally infeasible to
support a large vocabulary. We use the following guideline to select a subset of actions:
considering evidence templates w
p;i
(i = 1;:::;T ), higher values in w
p;i
re
ect the di-
mensions that are important components for the evidence template. We therefore select
59
a subset of actions by taking the average of w
p;i
(i = 1;:::;T ), and picking the actions
corresponding to D dimensions with highest values.
Our temporal constraint is
exible and data-driven: by learning the parameter vector
w
t
from training data, it can be used for both evidence sets with more rigid structures
as well as those with no clear temporal orders.
6.3 Inference
Inference involves solving Equation 6.2 by nding the assignment of latent variables z
which maximizes f
w
(C). We rewrite Equation 6.2 into
f
w
(C) =
1
(C) + max
z2Z
g
w
(C; z) (6.7)
where
g
w
(C; z) =
T
X
i=1
2
(C;z
i
;t
i
) +
T1
X
i=1
3
(C;z
i
;z
i+1
)
HereT is the number of evidence templates,z
i
is the location ofi-th evidence in temporal
order, and t
i
is the evidence template index.
The problem now becomes one of selecting T pieces of evidence sequentially from a
set ofN video clips where the choice of thei-th evidence is only aected by the (i 1)-th
evidence. Let G(i;z;t) be the maximum score by selecting the rst i evidence locations
given that the i-th location is z and its template index is t. We have
G(i;z;t) = max
z
0
;t
0
[G(i 1;z
0
;t
0
) +
3
(C;z
0
;z)]
+
2
(C;z;t)
The target score is max
z;t
G(T;z;t), which can be solved in O(T
3
N
2
) by dynamic
programming. Positive evidence locations can be obtained by backtracking.
In practice, we nd retaining top M = 10 candidate evidence locations already pro-
vides good performance. This reduces the time complexity to O(TM
2
).
60
6.4 Learning
Our model is parameterized by global template w
r
, evidence templates w
p
, temporal
constraint vector w
t
and transition parameters of HMM. We rst introduce the learning
of the rst three vectors w = [w
r
w
p
w
t
], and describe the last one in Section 6.4.1.
Given labeled training setfC
i
;y
i
g, y
i
2f1; 1g, i = 1; 2;:::;N, we learn the parame-
ters w by max-margin criteria similar to [24] and [118]
min
w;
1
2
kwk
2
+C
N
X
i=1
i
(6.8)
s.t. y
i
f
w
(C) 1
i
;
i
0;8i
is a vector of slack variables, it measures the degree of misclassication. C is a cost
variable which balances the two terms in target function.
The optimization problem is semi-convex. We use the quadratic programming based
solver proposed in [118].
6.4.1 Initialization
Since Equation 6.8 poses a semi-convex problem, initialization plays an important factor
in getting a good local optimum solution. We determine the number of evidence templates
and the evidence locations in positive training videos, in the following steps:
1. Select all action response vectors from positive training data to form a set D
p
, and
randomly select action response vectors from negative training data to form a set D
n
.
2. Do a K-Means clustering on D
p
with a large K (e.g. 200).
3. For each cluster, use entries of D
p
from that cluster as positive set and D
n
as
negative set to train a linear SVM classier. Apply the classier to all action response
vectors from training set. A video level score is produced by taking the maximum score
of its action response vectors. Compute the average precision.
4. Pick a cluster with high average precision as an evidence template, given certain
criteria are met. Stop if the maximum evidence template size is reached.
61
5. For each selected cluster, apply its classier to each positive video, select the clip
with the highest response as the evidence location.
We use two criteria for picking evidence templates: minimum average precision (mi-
nAP) and maximum percentage of evidence overlap (maxEO). A cluster is picked if the
average precision of its corresponding classier is higher than minAP, and the overlapping
percentage of its selected evidence locations with previous selected locations is lower than
maxEO.
The intuition behind this initialization method is that if a classier trained by positive
action responses from a cluster performs well over the whole dataset, then the correspond-
ing clips are likely to be representative for most of the positive videos.
After initialization, we use the evidence templates to select the top actions used for
modeling temporal constraint. A general HMM for selected actions is learned by sampling
training data.
6.5 Event Recounting
Generating video description for recounting is straightforward in our framework. After
running the inference algorithm, we have all the evidence locations in a video. A summary
of the video can then be generated by ordering the evidence temporally. To generate
textual descriptions for a evidence clip, we use actions with top responses weighted by
evidence template. Suppose w
p;i
= [w
i;1
::: w
i;D
] is the i-th evidence template, and
C
j
= [c
j;1
::: c
j;D
] is the action response vector. We select the top action given by
arg max
k
w
i;k
c
j;k
(6.9)
Another possible strategy for video description is to learn linear event classiers from
average pooled action responses. The classiers can be used as pseudo evidence templates,
one for each event [95]. Compared with ELM, this global strategy lacks the diversity of
having multiple evidence types per event. We compare these two strategies in Section
6.6.3.
62
6.6 Experiments
This section describes the dataset we used for evaluation, as well as evaluation results for
classication and recounting tasks.
6.6.1 Dataset
We used TRECVID 2013 Multimedia Event Detection dataset [75] for evaluation. The
dataset contains unconstrained web videos varying in length, quality and resolution. We
chose to evaluate the ten events listed in Table 6.1.
We used three dierent partitions: Background, which contains 4,992 background
videos not belonging to any of the target events; 100EX, which contains 100 positive
videos for each event; MEDTest, which contains 24,957 videos. To train a model for a
specic event, we used all background videos from Background, and positive videos from
100EX of that event only. Videos in MEDTest were used for testing.
We used two datasets UCF 101 [93] and MED action annotation set [39] to learn
primitive action classiers. UCF 101 has 13,320 videos from 101 categories, the videos
are of similar quality to TRECVID 2013 MED dataset, but many of the action types are
not related to the 10 events. MED action annotation set has 60 action types annotated
directly on 100 EX videos, the actions are highly related to events.
6.6.2 Classication Task
Classication task is to assign a single event label to each query video. We report our
results in average precision (AP).
To learn primitive action classiers, we used Dense Trajectories (DT) features [108],
and obtained video level representations by Fisher Vector coding [98]. LIBLINEAR[21]
was used for SVM classier training. Action classiers were applied to video for every
100 frames, with a 50 frame step size. We used a two-fold cross validation to select our
framework's parameters, which includes the cost C and relative weight of positive and
negative samples. For model initialization, we set the size of candidate clusters to 200,
63
Event name ID Global ELM NT ELM
Birthday party 6 13.6 17.4 17.1
Changing a vehicle tire 7 8.7 13.1 17.7
Flash mob gathering 8 31.2 42.7 57.3
Getting a vehicle unstuck 9 21.9 28.3 25.2
Grooming an animal 10 7.9 11.4 14.9
Making a sandwich 11 5.1 11.4 13.2
Parade 12 26.2 33.8 33.7
Parkour 13 19.4 39.9 43.6
Repairing an appliance 14 4.2 17.0 20.6
Sewing project 15 6.8 17.0 24.2
mean Average Precision (mAP) 14.5 23.2 26.8
Table 6.1: Average precision comparison among global baseline, ELM without temporal
constraint and the full ELM on MEDTest
minAP to 0.1, and maxEO to 20%. We selected top 40 primitive actions for temporal
constraints.
Comparison with Baselines. As the overall performance is aected by the choice
of the global video representation, we rst trained our evidence localization model (ELM)
without using the global term. We chose global average pooling of the action responses
(Global) as the rst baseline, dened by
h(X) =
1
N
N
X
i=1
X
i
where X is a N by D matrix, N is the number of total action response vectors, D is the
number of actions. X
i
is thei-th row of X. We used linear kernel SVM, and 5-fold cross
validation to select classier parameters.
To validate the eectiveness of temporal constraint term, we also provided results of
ELM with no temporal constraint (ELM NT). The results are shown in Table 6.1.
From the table, it is easy to see that ELM has much higher mean average precision
compared with the global baseline. One possible explanation is that, by locating positive
evidence instead of treating all clips as being equally important, ELM is more robust
against various irrelevant segments in testing videos. The results also indicate that the
temporal constraint is eective for most of the events.
64
ID HMMFV ELM+HMMFV DTFV ELM+DTFV
6 24.2 22.7 19.4 22.0
7 14.7 19.4 17.1 25.2
8 52.9 59.7 55.7 61.5
9 29.6 34.0 35.6 38.1
10 8.9 11.4 12.7 15.3
11 17.1 18.2 15.4 17.1
12 32.6 37.3 33.3 37.8
13 53.5 54.9 55.4 57.1
14 25.7 28.1 37.1 36.2
15 15.0 25.0 19.1 27.9
mAP 27.4 31.1 30.1 33.8
Table 6.2: Performance gain by incorporating two state-of-the-art global video represen-
tations into our framework
Working with State-of-the-Art Global Methods. Any global event classication
approach that provides a xed length video level feature vector can be incorporated into
our framework as the global term. It is interesting to see if our evidence based framework
can improve these methods. Recently, [73] showed that using multiple types of low level
features can increase event classication performance. It is then important to x the low
level features to compare dierent frameworks.
We chose two dierent state-of-the-art approaches following this rule. The rst one
uses an action transition based representation called HMMFV [97]. It accumulates action
transition statistics over the entire video. We used the same set of pre-trained action
classiers as those used in our approach. We refer to this method as HMMFV.
We also implemented a low-level based framework based on Dense Trajectories. The
features were extracted at step size of 10. We used Fisher Vector with both rst and
second order terms, and followed the suggestions in [98] to apply power normalization
and l2-normalization to the feature vectors. We used PCA to reduce the dimension of
DT features to 128, and used a codebook size of 64. We refer to this method as DTFV.
Both methods used a linear SVM classier, the parameters were selected by 5-fold
cross validation. The two global approaches as well as our ELM method were all based
on DT features alone.
65
Average video length 172.9 seconds
Average snippet length 7.1 seconds
Ratio 4.1%
Average Accuracy 86.3%
Table 6.3: The ratio of our method's average snippet length over average video length,
and the average accuracy of labels from evaluators
According to Table 6.2, by discovering evidence from videos, ELM achieves signicant
and consistent improvement over HMMFV and DTFV.
6.6.3 Recounting Task
In TRECVID Multimedia Event Recounting (MER) task, a video description is dened
as a video snippet with a starting frame, an ending frame and a textual description.
We used the ELM framework to obtain such snippets, and compared it with the global
strategy described in Section 6.5. The number of snippets selected per video was xed to
ELM's evidence set size T = 2 for both strategies.
The evaluation of video recounting results is dicult, as there is no groundtruth
information for which snippets are correct; to the best of our knowledge, there is also
little previous work to compare with. We conducted an experiment based on human
evaluation. Eight volunteers were asked to serve as evaluators. Before evaluation, each
evaluator was shown the event category descriptions in text, as well as one or two positive
examples in the training set. For each event, 10 positive videos were picked randomly
from MEDTest. However, we informed evaluators that negative videos may also appear to
avoid biased prior information. Evaluators were rst presented with snippets generated
by ELM to assign event labels, and then presented with snippets generated by the global
strategy from the same video to compare which snippets are more informative and whose
descriptions of the snippets are more accurate. Two criteria were used: average accuracy,
which measures the percentage of correctly labeled snippets; and relative performance,
counting evaluators' preference of video-level recounting results generated by the two
approaches.
66
Event Better Similar Worse
Birthday party 4 4 2
Changing a vehicle tire 7 0 3
Flash mob gathering 5 1 4
Getting a vehicle unstuck 3 5 2
Grooming an animal 8 0 2
Making an sandwich 6 0 4
Parade 5 2 3
Parkour 4 0 6
Repairing an appliance 8 0 2
Sewing project 3 5 2
Total 53 33 14
Table 6.4: Evaluators' comparison of ELM over global strategy. Assignments of better,
similar and worse were aggregated via average
Table 6.3 shows the average length of videos and snippets, as well as average accuracy.
ELM achieves 86.3% average accuracy by selecting only 4% of frames in the original
videos. This shows that our approach provides reasonable good snippets for users to
rapidly and accurately grasp the basic idea of video events.
Table 6.4 summarizes the evaluators' preferences between ELM and the global strategy
for each event. It can be seen that ELM is better for most of the events. Several
recounting results are shown in Figure 6.2, where snippets generated by ELM are on the
left. Among the three examples, our
ash mob gathering snippets provide more diverse
but also related information. Global strategy failed to assign proper description to the
repairing an appliance video, where hands are not present in the selected snippets. The
bottom row is an example of making a sandwich video where ELM's output is worse.
Most of the selected actions for description came from the MED action annotation
set. This indicates the benet of using event related actions to build the vocabulary.
67
ELM snippet 1 ELM snippet 2 Baseline snippet 1 Baseline snippet 2
people walking people dancing people dancing people dancing
people pointing hands visible hands visible hands visible
people using knife people spreading cream hands visible hands visible
Figure 6.2: Event recounting results generated by ELM and the baseline approach. The
video events are
ash mob gathering, repairing an appliance and making a sandwich
respectively. ELM was labeled as better than baseline by evaluators for the top two
recounting results.
68
Chapter 7
Zero-shot Event Detection and Recounting
This chapter addresses the problem of high-level video events recognition when there are
no training examples available (i.e. zero-shot learning).Although previous work exists
for zero-shot multimedia event detection (ZeroMED), to the best of our knowledge none
have studied the equally important problem of zero-shot multimedia event recounting
(ZeroMER). The goal of ZeroMER is to provide event-based supporting evidence, such
that users can quickly focus on relevant snippets and also decide if the retrieval meets
their needs. The evidence usually consists of a small subset of video snippets selected
from the original videos, accompanied with text descriptions.
7.1 Transfering Mid-level Knowledge from Web Images
Zero-shot learning has been studied in Computer Vision for object and action recognition
tasks. A common approach [56, 63, 67] is to represent the categories by banks of human-
interpretable semantic concepts (e.g. attributes, actions, objects), where each concept is
associated with a pre-trained classier. It then detects an unseen category by selecting
a subset of related concepts and combining their condence scores. For ZeroMED, this
approach requires the users to provide a list of related concepts for every event query,
thus making it dicult to explore the video collections on a large scale. Moreover, users
are usually unfamiliar with the internal operations of the system, and unable to deter-
mine what concepts will work well with the system. To avoid manual specication of
concepts, an alternative approach aims at constructing classiers of unseen categories
69
Testing Video
Key evidence
Event Name: Bee Keeping
bee hive bee honeycomb
Figure 7.1: Given a query video and a set of candidate events, ZeroMED generates event
predictions while ZeroMER provide diverse evidence for the event.
directly from classiers of observed categories [25, 81]. This is achieved by computing
semantic similarities of category labels based on their continuous word representations.
However, this approach implicitly assumes that unseen categories have semantically sim-
ilar counterparts in the observed categories, which may not hold for all events.
In light of the above challenges, we adopt the concept-based zero-shot learning scheme,
but design a fully automatic algorithm to select concepts for an event. We name the infor-
mation used to decompose high-level events into relevant concepts as mid-level knowledge,
and propose to transfer such knowledge from web images. Assume that we have a pool
of image-based concepts as well as concept classiers using other modalities (e.g. video
or audio). We pass the query text to an Internet image search engine (e.g. Google) and
collect the highest ranking return images. We then apply our bank of image-based con-
cept classiers to these images and then the concepts with the highest average responses
are chosen as event-relevant. To select video or audio concepts, we compute the semantic
similarities between the names of the selected image concepts and all other concepts, and
pick those with highest semantic similarities. We term this process as one of mid-level
knowledge transfer (MIKT) which is central to our approach of decomposing event queries
to multimedia concepts.
70
Event name: birthday party
Mid-level
knowledge trasnfer
Event Recounting
(b) candle (a) balloon (c) dining table
(d) blow candle (e) dining room (f) table
Event Detecion
Diversity matrix
0.4 0.4
0.4
0.4
Relevance scores
A
A
F
E
D
C
B
G
A
0.0
a
Recounting result
0.4 0.4 0.4 0.4
0.4
0
0 0.3
0.4 0.4 0 0.4
0.3 0.4 0
0.3 0.3 0.2
0.4 0.4 0.3
0.4 0.3 0.3 0.3 0 0.3 0.3
0.4
0.4
0.4 0.3 0.2 0.3 0 0
0.4 0.3 0.4 0.3 0 0
A B C D E F
G
ballon dinning table candle
0.9
0.0
0.0
0.0
0.0
0.0
A
0.0
b
0.0
0.0
0.0
0.0
0.9
0.7
A
0.0
c
0.2
0.0
0.9
0.3
0.1
0.2
A
0.0
d
0.1
0.0
0.2
0.1
0.9
0.9
A
0.0
e
0.7
0.2
0.8
0.7
0.7
0.9
A
0.0
f
0.2
0.0
0.8
0.0
0.9
0.8
A
0.0
Rel
0.4
0.1
0.5
0.2
0.7
0.6
0.4 0.4 0.4
ImageNet Concepts
Video Concepts
similarity semantic
CNN Features
Web image search results
Balloon
Dining table
Candle
Candle
Figure 7.2: An illustration of our event recognition framework. Each video segment is
represented by a bank of image-based and video-based concepts. Given an event query,
we select the relevant image-based concepts by transferring event composition knowledge
from web images. We then select the video-based concepts based on their semantic simi-
larities with the selected image concepts. The classier condence scores of the selected
concepts are used for ZeroMED. For ZeroMER, we solve an integer linear programming
problem to select video segments with relevant evidence that are also diverse and compact.
Once the relevant concepts have been selected for each event, ZeroMED proceeds
by computing weighted sum of relevant concept detection scores on video-level. For
ZeroMER, a na ve approach is to directly extend ZeroMED to video snippets. However,
users might have dierent preferences even for the same event queries. It is important to
present a diversied video segments covering dierent aspects of the query events. Take
renovating a home event as an example, some users are interested in laying the
oor
while others are more interested in tiling the roof. By providing diversied results, users
can quickly locate the clips they are interested in. For this purpose, we formulate an
integer linear programming (ILP) problem by taking the selection of each segment within
a video as a binary variable. The goal is to maximize the condence scores of the desired
concepts as well as the diversity of selected segments. We show that our ILP formulation
can be approximately solved eciently while maintaining good performance.
71
7.2 Zero-shot Event Recognition
Our system rst uses MIKT algorithm on web images to discover relevant image-based
semantic concepts for each event, and expand them to video concepts by computing
semantic similarities. It then breaks videos into shot segments, and represents each
video segment using the detection scores from a bank of semantic concepts. Classiers
of the selected concepts are used to generate detection scores for videos (ZeroMED) and
segments (ZeroMER). For ZeroMER, we introduce a diversity term and formulate an
integer linear programming problem to generate recounting results that are both relevant
and diverse. The overall approach is depicted in Figure 7.2.
7.2.1 Video Representation
We rst segment long videos into short clips using o-the-shelf shot boundary detec-
tors [123], and choose the middle frame in each segment as the representative key frame.
To allow zero-shot recounting, we represent video segments by a bank of pre-dened se-
mantic concepts, which ranges from objects, scenes to actions. For each semantic concept
C
i
, we learn a concept detector f
C
i
()2 R by use of existing image and video datasets
(e.g. ImageNet [15] and UCF 101 [93]). A video segment v
i
is mapped into the concept
space by
C(v
i
) = [f
C
1
(x
i
); f
C
2
(x
i
); ::: ; f
C
K
(x
i
)] (7.1)
where K is the total number of concepts. x
i
is the visual feature for v
i
, it is extracted
from the key frame for image-based detectors and from the whole segment for video-based
detectors.
7.2.2 Mid-level Knowledge Transfer
Concept selection directly determines the performance of zero-shot event recognition. Al-
though manual concept selection for each event is possible, it does not scale well when the
number of events is large. Besides, users may be unfamiliar with the internal operations
of the system. We propose to select concepts automatically by transferring mid-level
event composition knowledge from web images.
72
Event names MIKT word2vec
Attempting a bike trick mountain bike, bicycle-built-for-two, moped mountain bike, unicycle, motor scooter
Cleaning an appliance microwave, refrigerator, dishwasher dishwasher, washer, toaster
Dog show sheepdog, foxhound, komondor beagle, madagascar cat, bedlington terrier
Giving directions to a location trolleybus, passenger car, minibus pick plectrum, mailbox, menu
Marriage proposal sandbar, seashore, bathing trunks groom, church, streetcar
Renovating a home wardrobe, power drill, crate home theater, mobile home, birdhouse
Rock climbing cli, cli dwelling, alp rock crab, rock python, rock beauty
Town hall meeting restaurant, library, church library, church, lakeside
Winning race without vehicle run shoe, racket, racquet racer, pole, run shoe
Metal crafts project hammer, tool kit, screwdriver iron, steel ,quilt
Bee keeping bee house, honeycomb, lumbermill bee, bee eater, bee house
Wedding shower groom,gown, bride, bakeshop groom, shower curtain, shower cap
Non-motorized vehicle repair disk brake, bicycle-built-for-two, go-kart tractor, motor scooter, recreational vehicle
Fixing musical instrument violin, tobacco shop, barbershop violin, acoustic guitar, cello
Horse riding competition sorrel, horse cart, camel horse cart, llama, barn
Felling a tree chainsaw, plow, hatchet tree frog, acorn, tusker
Parking a vehicle sports car, tow truck, minivan cab, recreational vehicle, passenger car
Playing fetch kelpie, soccer ball, rugby ball steal, pick, borzoi
Tailgating grocery store, go-kart, race car, hotdog, sunscreen, pizza
Tuning musical instrument violin, ddle, cello, electric guitar violin, acoustic guitar, cello
Table 7.1: Selected ImagNet concepts for all 20 events.
For each event, we query Google image search engine with the event names, and
download the top ranked images with type photo. For each image-based concept C
i
, we
apply its concept detector f
C
i
() to the retrieved image setI. To suppress noise from
individual images, we compute the event matching score h by:
h(C
i
) =
1
jIj
X
I2I
f
C
i
(I) (7.2)
We select T image-based concepts with top h values. To go beyond image-based
concepts and select video-based concepts, we measure the semantic similarity between an
image-based concept namew
I
and a video-based concept namew
V
. For this purpose, we
use a data-driven similarity measurement based on the skip-gram model as implemented
in word2vec [69]. Its goal is to search for continuous word vectors which can be used to
predict a word's context. The resulting word vectors have the property that semantically
similar words are close to each other. We use the complete dump of English Wikipedia
to train word2vec embedding models.
To measure the semantic similarity between w
I
and w
V
, we compute the cosine sim-
ilarity between the normalized average of their corresponding word embeddings, given
by
sim(w
I
;w
V
) =(w
I
)
T
(w
V
) (7.3)
where (w) =
P
i2w
e
i
=
P
i2w
e
i
2
, and e
i
is the word embedding for word i.
73
We compute sim(w
I
;w
V
) between the selected image-based conceptsI and all video-
based concepts V . The top T
0
video-based concepts with the highest similarities are
selected as relevant video-based concepts.
Discussion: In Table 7.1, we show the top image-based concepts selected from 1000
ImageNet [15] concepts for the 20 events in the TRECVID MED14 dataset, using the
MIKT algorithm with web images. We also list the concepts selected by computing word
embedding similarities between event names and concept names. We can see that using
word2vec directly with event names results in poor concept selection for many events,
while the concepts retrieved using web image knowledge transfer are more semantically
reasonable. This is particularly true when the event names are more abstract (e.g. metal
crafts project, renovating a home). Another interesting observation is that MIKT some-
times selects concepts that are visually relevant but not necessarily semantically close
(e.g. restaurant for town hall meeting). In rare cases, MIKT selects irrelevant concepts
due to domain dierences (e.g. most wedding proposal photos are shot by the beach).
By applying the image-based concept detectors directly to web images and taking
the average of detector responses, concepts with less reliable detectors are ltered out
implicitly. This selection process is also less sensitive to the naming of image-based
concepts. Unlike previous work [8] which crawled web images to train concept detectors,
we only use web images as a source to discover mid-level knowledge which decomposes
events into concepts. As a result, only a small number ( 90) of web images is needed
for each event query. The whole knowledge transfer process is very fast.
7.2.3 Event Recounting with Diversity
We rst introduce how to select the segments which contain possible key evidence. We
compute the relevance score Rel
E
i
of video segmenti for eventE as the sum of detection
scores from the selected concepts:
Rel
E
i
=
X
c2C
E
f
c
(i) (7.4)
74
whereC
E
is the set of selected concepts for event E, f
c
(i) is the concept detection score
of segment i. Rel
E
i
terms from the same videos are normalized to [0; 1].
One can directly use the relevance scores for recounting, by selecting the video seg-
ments with highest relevance scores. However, it is intuitive that it may be benecial for
the system to display a diverse selection of video segments, while preserving the event
relevance for the selected segments. To address this issue, we introduce a diversity term
for segment selection. It is measured by the semantic distances between two video seg-
ments i and j. Let c
i
= [f
C
1
(i); f
C
2
(i); :::; f
C
K
(i)] as the concept representation for
segment i, we dene the diversity score between segments i;j as
Di
ij
=kc
i
c
j
k
2
(7.5)
Putting the two terms together, the objective function of event recounting for event
E is to select a subset of video segmentsI, such that
X
i2I
Rel
E
i
+
X
i2I;j2I
Di
ij
(7.6)
is maximized.
Denote s
i
2f1; 0g as a binary indicator of whether segment i is selected or not, the
equation can be transformed into:
T
X
i=1
Rel
E
i
s
i
+
X
i;j
Di
ij
s
i
s
j
(7.7)
To solve this objective function, we formulate an Integer Linear Programming (ILP)
problem by introducing an auxiliary variable s
ij
2f0; 1g. It takes the value of 1 only
if both segments i and j are selected. Dene L as the maximum number of selected
segments per video, the resulting ILP problem has the form:
75
Maximize:
T
X
i=1
Rel
E
i
s
i
+
X
i;j
Di
ij
s
ij
s.t.
T
X
i=1
s
i
L;
s
ij
s
i
; s
ij
s
j
8i;j;
s
i
+s
j
s
ij
1 8i;j;
s
i
2f0; 1g 8i;
s
ij
2f0; 1g 8i;j:
(7.8)
where T is the total number of segments in a video.
After the problem is solved, we use the video segments with s
i
> 0 to generate the
nal event recounting, each video segment is accompanied by a text description generated
from the selected concept with highest response.
Discussion: Although ILP is NP-Hard in general, approximate solutions can be
computed eciently using o-the-shelf solvers. In practice, generating event recounts for
a video with around 40 shots takes only a few seconds.
By maximizing the objective function, the video recounting results are not only rele-
vant to the events, but are also diverse.
7.3 Experiments
We present the dataset, experimental settings, evaluation criteria and experimental results
in this section.
7.3.1 MED-recounting Dataset
So far, no existing datasets are available for automatic quantitative evaluation of recount-
ing results. Previous work on recounting relies on humans to watch the program outputs
and rate their qualities, without explicitly dening the recounting ground truth [95]. To
ll this gap, we introduce a new video dataset MED-Recounting and provide temporal
76
annotations of the evidence locations within the videos. We also design an evaluation
metric for automatic recounting evaluation.
7.3.1.1 Video Data
We use the videos in the challenging NIST TRECVID Multimedia Event Detection 2014
dataset
1
(MED'14) to evaluate the recounting performance. It has 20 event categories,
each of which has a text description in the form of name, denition, explication and
related evidence types. Videos in MED'14 have large variations in length, quality and
resolution. The average length of the videos is over 2 minutes.
For the purpose of zero-shot event recounting evaluation, we select 10 videos per event
from the testing partition of MED'14 (MEDTest). Most of the videos have a duration
from 1 to 5 minutes. The total number of videos used for recounting evaluation is 200.
For the zero-shot event detection evaluation, we used the full MEDTest dataset, which
has 23,954 videos. No videos in MED'14 are used as training data.
To divide the videos into segments, we start by detecting the shot boundaries by
calculating the color histograms for all the frames. We then measure the dierence
between the color histograms of neighboring frames. If the absolute value is larger than
a certain threshold, this frame is marked as a shot boundary [123]. Frames between two
shot boundaries are considered a video shot. After detecting the shots, we use the middle
frame as the key frame.
7.3.1.2 Annotation Protocol and Evaluation Metric
Once the key frames are selected, the annotation process has two steps. For every shot
in the same video, we rst ask the users \Does the shot contain supporting evidence for
event A?". The possible answers are \Yes" or \No". For those shots marked as \yes", we
ask the annotators to group together the shots that they believe oer the same type of
evidence. We use majority vote rule to combine annotations from dierent annotators.
The nal annotation is in the form of integer labels for all shots in a video, where each
positive number stands for a dierent evidence category, and -1 stands for no evidence.
1
http://nist.gov/itl/iad/mig/med14.cfm
77
Event Name ID Image classiers word2vec (I) MIKT (I) word2vec (I + V) MIKT (I + V)
Attempting a bike trick E021 3.9 4.4 4.8 11.2 11.8
Cleaning an appliance E022 1.4 2.1 7.3 3.3 7.8
Dog show E023 2.2 11.2 12.3 62.8 63.4
Giving directions to a location E024 0.3 0.9 0.8 1.1 1.0
Marriage proposal E025 1.0 0.4 0.5 0.6 0.4
Renovating a home E026 1.1 1.2 1.3 2.1 2.3
Rock climbing E027 7.2 3.6 10.2 18.1 18.9
Town hall meeting E028 1.1 2.1 2.4 9.8 11.2
Winning a race without a vehicle E029 2.8 2.3 3.7 7.2 8.9
Working on a metal crafts project E030 0.9 1.1 5.3 1.3 5.8
Bee keeping E031 57.0 61.4 62.3 64.2 64.6
Wedding shower E032 0.3 0.8 0.9 1.3 1.2
Non-motorized vehicle repair E033 18.9 3.2 21.2 4.7 28.9
Fixing musical instrument E034 0.7 3.7 2.1 4.3 3.8
Horse riding competition E035 4.2 4.1 9.8 26.2 27.1
Felling a tree E036 3.3 2.4 2.9 8.2 9.7
Parking a vehicle E037 10.8 4.6 12.6 7.8 17.7
Playing fetch E038 5.2 1.1 9.8 2.1 10.1
Tailgating E039 1.5 0.9 1.2 1.4 1.8
Tuning musical instrument E040 2.7 8.4 8.6 9.1 9.6
Average 6.4 5.4 9.1 12.1 16.3
Table 7.2: Comparison of dierent knowledge transfer methods for ZeroMED on MED14
dataset.
Typically, each video contains about 40 video segments, and 3 key evidence categories
are marked in each video.
To evaluate recounting quality (RQ) for each video event, assume that the total num-
ber of selected shots is up to L, we use the percentages of evidence categories have been
hit as evaluation metric, which is dened as:
RQ =
#evidence
hit
#evidence
total
(7.9)
where #evidence
hit
represents the number of key evidence categories has been covered
in the recounting result, and #evidence
total
the total number of key evidence categories
within the test video. A higher RQ score indicates that more of the evidence categories
are covered by the recounting result for the video.
As selecting video shots from the same evidence category does not increase #evidence
hit
,
our evaluation metric favors video recounting that is not only relevant to the event, but
is also diverse and compact.
7.3.2 Concept Training
We use 1000 static image concepts, 101 action concepts, 487 sports activity concepts and
346 semantic video activity concepts for event recounting and detection experiments. We
introduce the details of concept training in this section.
78
7.3.2.1 Image-based Concepts
We obtain 1000 object concept detectors using a deep Convolutional Neural Network
(CNN) trained on the ImageNet ILSVRC-2014 dataset [15]. It includes 1.2M training
images categorized into 1000 classes. We use the very deep CNN architecture proposed
by Simonyan et al. [90] as implemented in cae [44] toolbox. It has 19 layers in total.
After the CNN model is trained, we take the key frame of each testing video segment as
an input, make a forward-pass of the CNN and use the softmax outputs as the concept
detection scores for the segment.
7.3.2.2 Video-based Concepts
We also obtain video-based concepts from three publicly available datasets: UCF101 [93],
TRECVID Semantic Indexing (SIN) and Google Sports1M [50]. They contain 101 action
categories, 346 semantic activity categories and 487 sports categories respectively. We
extract the improved dense trajectory [110] features from videos, and aggregate the local
features into video-level feature vectors by Fisher vectors [98]. We train linear SVM
classiers and employ 5-fold cross validation to select the parameters.
7.3.3 Zero-shot Event Detection
We rst evaluate whether MIKT helps in concept selection for ZeroMED. We use all the
23,954 videos in MEDTest 2014 for testing. Mean Average Precision (mAP) is used as
the evaluation metric.
Experiment setup: For knowledge transfer with web images, we queried Internet
image search engine and downloaded the top 90 images for each query. Only photo type
images were kept. To comply with image query format, we replace all occurrences of
without, non- and not with the minus sign. We set the number of image-based concepts
selected T to be 3, and the number of video-based concepts propagated T
0
to be also 3.
Impact of concept selection methods: We compare concepts selected by MIKT
with web images against those selected by semantic similarity as dened by word2vec
embeddings. In particular, we evaluate the following four settings:
79
Method mAP (%)
Concept Discovery [8] 2.3
Bi-concept [36] 6.0
Composite-concept [36] 6.4
MMPRF [46] 10.1
SPaR [45] 12.9
EventNet [120] 8.9
Weak concept [115] 12.7
Singh et al. [91] 11.6
MIKT (Ours) 17.8
Table 7.3: Comparisons with other stat-of-the-art ZeroMED systems on MEDtest13.
• word2vec (I): use event names to select image concepts with top word2vec simi-
larities.
• word2vec (I + V): use event names to select image and video concepts with top
word2vec similarities.
• MIKT (I): use event names for knowledge transfer from web images with MIKT.
• MIKT (I + V): use event names to select image concepts with MIKT; then use
selected concept names to pick video concepts with top word2vec similarities.
Table 7.2 lists the ZeroMED performance with dierent concept selection methods.
We can see that MIKT has better mAP than word2vec. The dierence in AP is larger
when the event names are more abstract and event composition knowledge is non-trivial
to infer (e.g. playing fetch), or when word2vec fails to retrieve semantic similar concepts
(e.g. rock crab for rock climbing).
Transfer knowledge or event classiers? The web images collected for MIKT can
also be used to train event classiers directly. The Image classiers column in Table 7.2
shows the APs for this baseline. The image-based event classiers were trained with VGG-
19 [90] CNN feature embeddings and SVM classiers from 90 web images per event. We
can see that its performance is still far behind from our best MIKT system. This indicates
that transferring event composition knowledge from web images requires less data than
directly training event classiers. For some events (e.g. bike trick), although video events
and web images share similar relevant concepts, their appearances dier a lot.
80
Dataset word2vec MIKT
MED14 recounting 0.534 0.648
Table 7.4: Comparison of mean recounting quality (mean RQ) on zero-shot recounting
task for dierent concept selection methods.
Comparison with state-of-the-art: Finally, we compare the MIKT approach with
recent state-of-the-art methods. We report results on MEDTest 2013 dataset which was
used by published methods. Among the systems, Bi-concept, Composite-concept, Event-
Net use web videos to train concept detectors; Singh et al. uses web images to train con-
cepts and re-train the event detectors with top retrieved test videos; Weak concept, SPaR
and MMPRF use similar pre-trained concept banks as ours, but select concepts from
manually generated event descriptions with semantic similarity. From Table 7.3, we ob-
serve that systems that use pre-trained concepts generally have higher performance than
those using webly-trained concepts. Our system outperforms all the other approaches sig-
nicantly, which indicates that MIKT from web images is able to select better concepts
automatically.
7.3.4 Zero-shot Event Recounting
We evaluate ZeroMER performance on our MED-recounting dataset. Web images used
for MIKT are the same as used in ZeroMED. Table 7.4 shows the mean RQ for dier-
ent concept selection methods. We can see that MIKT outperforms word2vec. In the
following experiments, we use the concepts selected by MIKT.
Comparison with baseline: To demonstrate the eectiveness of our proposed Ze-
roMER framework, we compare our framework against four baseline systems:
• Random: randomly select L shots.
• Uniform: divide the video into L parts uniformly, and choose one shot from each
part randomly.
• Clustering: for each video, cluster the concept features of the video segments into
L clusters, and use the cluster centers.
• ILP w/o diversity: select the top L shots with highest relevance scores.
81
Method mean RQ
Random 0.251
Uniform 0.279
Clustering 0.362
ILP w/o diversity 0.535
ILP 0.642
Table 7.5: Event recounting results comparing with baseline approaches. Higher scores
indicates better performance.
Better Worse Similar
71.5% 18.5% 10.0%
Table 7.6: Human comparison of the recounting results generated by ILP against ILP
w/o diversity.
To compute relevance scores, we use MIKT to select image and video concepts. We set
L to 3 and compare RQ on event level; results are shown in Table 7.5. We can see that by
choosing segments with relevant concepts, both ILP w/o diversity and ILP outperform
the other systems signicantly. By explicitly modeling diversity, ILP has the best overall
performance.
Human evaluation: We asked 10 human evaluators to compare the recounting
results generated by ILP and ILP w/o diversity. We used all 200 videos for evaluation
and xedL = 3. For each video, we provided the human evaluators the ground truth event
name as well as the description. We then showed the three key frames from recounting
generated by the two systems on each side of the screen respectively (with randomly
selected order), and asked the evaluators to choose from the following: 1st is better, 2nd
is better, equally good or bad. We aggregated the evaluation results using majority vote.
On average, 78.5% of the evaluators agreed on their votes for specic videos.
Table 7.6 shows the human evaluation results. According to the evaluators, ILP
generates better recounting results than ILP w/o diversity in 71.5% of all videos. It is
worse in 18.5% of the videos possibly due to irrelevant segmented selected to achieve
diversity. This indicates that the RQ evaluation metric agrees well with humans.
82
Chapter 8
Video Transcription with SVO Triplets
Humans can easily describe a video in terms of actors, actions and objects. It would be
desirable to make this process automatic, so that users can retrieve semantically related
videos using text queries, and capture the gist of a video before watching it. The goal
of this chapter is generating video transcriptions, where each transcription consists of a
subject, verb and object (SVO) triplet. We assume the videos to be unconstrained user
captured videos possibly with overlaid captions and camera motion, but that they are
pre-segmented to be short clips with a single activity, and a few objects of interest. One
example is shown in Figure 8.1 left.
Video transcription with SVO is an extremely challenging problem for several reasons:
rst, as annotating actions and objects with spatio-temporal bounding boxes is time-
consuming and boring, in most cases, only video-level sentence annotations are available
for training (Figure 8.1 right). Second, although there are several action and object
datasets with a large number of categories [15, 93], a considerable amount of SVO terms
are still not present in these categories. Finally, even for the detectors with corresponding
SVO terms, many of them are still far from reliable when applied to videos in the wild
[52].
8.1 Approach Overview
We propose a semantic aware transcription framework (SAT) using Random Forest clas-
siers. Inputs for the Random Forest classiers are detection responses from o-the-shelf
83
Human annotations:
Three men are biking in the woods
Two cyclist do tricks
Guys are riding motorcycles
People ride their bikes
...
Output of SAT:
Person rides bike
Figure 8.1: Left: one example of the testing videos we used. Right: our algorithm utilizes
sentence based annotations, and output subject, verb, object triplets.
action and object detectors. SAT's outputs are in the form of SVO triplets, which can
then be used for sentence generation. To obtain the SVO terms for training, we parse
human annotated sentences and retrieve the subject, verb and object terms. The labels
of a training video contain the topk most commonly used subject, verb and object terms
for that video. For example, the set of labels for video in Figure 8.1 may be (person,
motorcyclist, ride, do, bicycle, trick).
The core innovation of SAT is to consider the semantic relationships of SVO labels
during training. Semantic aware training is important when the labels are user provided
without a pre-dened small vocabulary set. On one hand, humans may use dierent words
to describe objects that are close visually or essentially the same (bike and bicycle). On
the other hand, for problems with a large number of classes, semantically reasonable errors
(tomato to potato) are more desirable than unreasonable ones (tomato to guitar). SAT
provides a framework for semantic aware training: during node split selection of decision
trees, it favors the clustering of semantically similar words. Similarity is measured by
continuous word vectors, learned with the skip-gram model [68]. The skip-gram model
optimizes context prediction power of words over a training corpus, and has been shown
to produce word vector clusters with clear semantic similarities. Given the learned word
vectors, SAT picks the best node split by computing dierential entropy of word clusters.
Each tree in the resulting forest divides training samples hierarchically into semantically
consistent clusters.
84
The detector responses used in this framework can be seen as candidate action and
object proposals. They are more suitable for the transcription task than low level features,
as action and object locations are not provided in the annotations. Torresani et al. [104]
showed that object detector responses provide competitive performance when used as
features for image classication task. SAT goes one step further and provides a mechanism
to measure the semantic map from a detector type to output labels. The map measures
the in
uence of a detector's response on the output probabilities of labels. For example,
bicycle detector may have high impact on both objects like bike or motorcycle as well as
verbs like ride.
SAT has the following highlights:
Larger vocabulary support. A Random Forest classier is naturally suited for
multi-class classication. We can use a single Random Forest for arbitrary vocabulary
size. For SVM-based frameworks, the number of one-vs-rest classiers required grows
linearly with vocabulary size.
Feature sharing for semantically similar words. By using a hierarchical struc-
ture, SAT allows sharing features for semantically similar words. For example, horse and
bicycle may go through the same path in a decision tree until separated by a node with
large tree depth. This is particularly useful for training as words with few occurrence can
be trained together with similar words with more training samples.
Semantic reasonableness. SAT optimizes over semantic similarity instead of binary
classication error. In our framework, piano is considered a better error for guitar than
pasta. The resulting transcriptions are thus likely to be more semantically reasonable.
8.2 Semantic Aware Transcription
This section describes the Semantic Aware Transcription framework. We rst brie
y
introduce a vector based word representation in semantic space [68]. The structure of
Random Forest classiers and their inputs are then described. Next, we show how the
semantic word vectors can be used to select the best node split in Random Forest classier
training, such that training samples after split become more similar in the semantic space.
Finally, a mechanism is provided to compute the semantic map for a concept detector.
85
8.2.1 Continuous Word Representation
Many existing Natural Language Processing techniques can be used to measure semantic
distances among dierent words. For example, WordNet [70] provides a database of
hierarchical word trees, on which semantic distances can be dened. To learn data driven
semantic structures, topic modeling techniques such as Latent Dirichlet Allocation have
been found to be useful [4].
We adopt the continuous word representation learned by skip-gram model [69]. Given
a sequence of training wordsfw
1
;w
2
;:::;w
T
g, it searches for a vector representation for
each word w
i
, denoted by v
w
i
, such that
1
T
T
X
t=1
X
cjc;j6=0
logP (w
t+j
jw
t
) (8.1)
is maximized. c controls the training context size, and the probability of w
t+j
given w
t
is dened by the softmax function
P (w
i
jw
j
) =
exp(v
T
w
i
v
w
j
)
P
w
exp(v
T
w
v
w
j
)
(8.2)
This objective function attempts to make the vector representation of semantically
close words behave similarly in predicting their contexts. In practice, a hierarchical
softmax function is used to make the training process computationally feasible. When
trained on large text corpus, the Euclidean distances between vectors of semantically
similar words are small.
Compared with rule-based WordNet and the topic modeling techniques, continuous
word representation is both data-driven and
exible. Once word vectors are trained from
an independent corpus, one can measure the semantic similarity for an arbitrary set of
words.
8.2.2 Video and Annotation Preprocessing
We assume each training video has several one-sentence descriptions annotated via crowd-
sourcing. These sentences are parsed by a dependency parser [14], and only subject, verb
86
Figure 8.2: Illustration of a single decision tree used in SAT. Detector responses for a
video are used to traverse the tree nodes until reaching a leaf. Note that a horse detector
may not be needed in the process.
and object components are kept. Denote D
s
, D
v
and D
o
as the dictionary of subjects,
verbs and objects, we store their word vectors as V
s
=fv
ws
jw
s
2 D
s
g, V
v
=fv
wv
jw
v
2
D
v
g, V
o
=fv
wo
jw
o
2D
o
g.
After annotation preprocessing, every training video has a set of SVO words. For
subject and object words, although most of them correspond to concrete objects, we lack
the bounding boxes to locate them. Meanwhile, an annotated verb may correspond to
very dierent actions, like the verb play in play guitar and play soccer. It is hard to learn
verb detectors based on these annotations directly.
We use o-the-shelf action and object detectors to represent a video [29, 98]. Training
data for these detectors are obtained from independent datasets. The types of trained
detectors correspond to a very limited vocabulary and may not contain the words used
in video transcriptions. To apply object detectors, we sample video frames and take the
maximum response returned by a detector over all sampled frames; action detectors are
applied with sliding windows, and are also combined by maximum pooling. The nal
video representation is a vector of action and object detector responsesS = [s
1
s
2
:::s
M
].
Each dimension corresponds to a type of action or object detector.
87
8.2.3 Random Forest Structure
As illustrated in Figure 8.2, we use a forest of randomly trained decision trees to map
detector responses into posterior word probabilities.
Starting from the root, every non-leaf node k contains a simple classier
k
(S) for
vector S of detector responses based on a single type of detector response. We have
k
(S) =s
i
8
>
<
>
:
> 0 Go to left
< 0 Go to right
(8.3)
where s
i
is the i-th concept in the vector, and is the threshold.
Leaf nodes store word count vectors; as in traditional decision trees, a word count
vector is obtained by accumulating the SVO words from all training samples belonging
to the leaf node. The nal condence score for word w
i
is obtained by
f(w
i
) =
1
T
T
X
t=1
c
t;w
i
P
w
c
t;w
(8.4)
whereT is the forest size,c
t;w
is the count for wordw at the leaf node of thet-th decision
tree.
The subject, verb and object terms with the highest condence scores respectively
are selected to generate a sentence description for a video.
8.2.4 Learning Semantic Hierarchies
Ideally, we would like to learn a tree structure which encodes the semantic hierarchy
of SVO words. Towards this goal, we use the continuous word vectors to measure the
semantic compactness for a set of words.
Denote W =fw
1
;w
2
;:::;w
M
g as a group of words, and V =fv
w
1
;v
w
2
;:::;v
w
M
g the
word vectors. Assume the underlying distribution of the word vectors is Gaussian, we
have
g(v
w
) =
1
p
(2)
k
jj
exp
1
2
(v
w
)
T
1
(v
w
)
(8.5)
88
where k is the dimension of word vectors, = [
1
2
:::
k
] is the mean vector, and
= diag(
1
;
2
;:::;
k
) is the diagonal covariance matrix. They can be estimated from
V by
j
=
1
M
M
X
i=1
v
j
w
i
(8.6)
j
=
1
M
M
X
i=1
(v
j
w
i
j
)
2
(8.7)
In analogy to entropy dened on discrete variables, we compute dierential entropy
H(; ) for the Gaussian distribution parametrized by and following
H(; ) =
1
2
lnj(2e)j
k
X
j=1
ln
j
+C (8.8)
H(; ) measures the degree of uncertainty for the distribution: the lower the value,
the more certain the distribution is. For word vectors, since semantically similar words
lie close to each other, their estimated 's should be small and the dierential entropy
low according to Equation 8.8. As a result, to achieve semantic compact node splits, we
minimize the weighted dierential entropy
jV
l
j
jV
l
j +jV
r
j
H(
l
;
l
) +
jV
r
j
jV
l
j +jV
r
j
H(
r
;
r
) (8.9)
where V
l
and V
r
are the two groups of word vectors after node split.
It has been shown that the generalization error of random forests is determined by
the strength of individual trees and correlation of the trees [5]. To reduce correlation, we
impose several types of randomness in training. First, only a subset of training videos is
sampled to train each decision forest. Second, we randomly assign each node to consider
only subject words, verb words or object words, and use the selected word to compute
dierential entropy as dened in Equation 8.8. Finally, we use the node split selection
criteria similar to extremely randomized trees [27]: after a feature dimension is sampled,
instead of nding the best threshold to minimize Equation 8.9, we only choose a small
subset of candidate thresholds and pick the best one among them.
89
8.2.5 Discussion
One major dierence of SAT from traditional Random Forest classiers is the node split
criteria during training. SAT ts a group of semantic word vectors using a Gaussian dis-
tribution with diagonal covariance matrix, and computes dierential entropy to measure
the semantic compactness. The penalties of grouping semantic similar words are smaller.
For example, the split of (drive, ride) and (cut, slice) should be better than (drive, slice)
and (cut, ride). Traditional Random Forest classiers cannot distinguish the two as their
discrete entropies are the same. This dierence makes SAT produce more semantically
reasonable predictions.
Video transcription using SAT is fast (tens of comparisons for each decision tree,
and hundreds of trees in total). For training, it only evaluates the randomly sampled
thresholds instead of searching for the optimum, which can be done very eciently. Since
there is no interaction between dierent trees, both training and testing of SAT can be
parallelized easily.
Our method to compute semantic map is related to, but dierent from, variable
importance estimation: we measure only the change in output word probabilities; instead
of lling in randomly selected values, we select only the maximum and minimum possible
values for that dimension, so that all nodes using this dimension to make decision are
toggled.
Computed semantic maps provide several indications: if the semantic meaning of a
detector's name and its top mapped words are identical or very similar, it is quite likely
that the detector outputs are reliable. Besides, if an object detector's top mapped words
contain verbs or an action detector's top mapped words contain objects, the combination
should appear frequently in training videos.
8.3 Experiments
In this section, we rst describe our experiment setup and the dataset used for evalua-
tions. Next, we compare the performance of SAT with several other video transcription
frameworks. Semantic maps learned by SAT are shown at the end of this section.
90
8.3.1 Dataset
We used the YouTube dataset collected by Chen and Dolan [7]. There are 1,970 short
video clips with 85,550 sentence-based video descriptions in English. Videos were anno-
tated by Amazon Mechanical Turk workers.
Object detector training data were provided by PASCAL VOC challenge [19] and a
subset of ImageNet [15]. There are 243 categories in total. For action detector training,
we used UCF 101 dataset [93] with 101 categories.
8.3.2 Experimental Setup
We followed data partitioning used in [34], there are 1,300 training videos and 670 testing
videos.
Stanford parser [14] was used to extract the subject, verb and object components
from the sentence associated with the videos. Some of the extracted words are typos or
occur only a few times, we ltered these words out; this results in a dictionary size of
517 words. As annotators tend to describe the same video with diverse words, each video
was described by the one most common subject, the two most common verbs and the
two most common objects. Unless otherwise specied, we used this set of words as the
groundtruth to train the classiers and measure the accuracy of video transcription.
We used the continuous word vectors pre-trained on Google News dataset. It was
provided by the authors of [69]. The dimension of each word vector is 300.
Deformable part models (DPM) [24] were used to train object detectors. Part of the
detector models were downloaded from [3]. The object detector works on static frames.
We uniformly sampled frames every second, and used maximum pooling to merge the
detector condence scores for all sampled frames in the same video.
To learn action detectors, we rst extracted motion compensated dense trajectory
features with default parameters [110], and encoded the features with Fisher Vectors [98].
We set the number of clusters for Fisher Vectors as 512, and computed Fisher Vectors for
the HOG, HOF and MBH components separately. A linear SVM [21] was then trained
for each action category with a single type of features. We used average fusion to combine
the classier outputs.
91
Verb Top correlation
come go
run walk
spread mix
fry cook
put pour
Object Top correlation
scooter bicycle
nger hand
motorbike car
vegetable onion
computer camera
Table 8.1: Top correlated verb and object pairs in SAT
Parameter set for Random Forest classiers includes the number of decision trees T ,
the number of sampled feature dimensionsN
f
and thresholdsN
t
, as well as the max tree
depth D. Parameters were selected by measuring out-of-bag errors (OOB) [5]. It was
computed as the average of prediction errors for each decision tree, using the non-selected
training data.
8.3.3 Performance Evaluation
We rst qualitatively show how SAT uses semantic similarity to group training samples.
We computed correlation of two words based on their number of cooccurrences in SAT's
leaf nodes. To avoid correlation introduced by multiple annotations for the same video,
we used only a single SVO triplet for each video. Table 8.1 shows several subject or
object words with their top correlated words. As we can see, most of the pairs are both
semantically close and visually related.
For quantitative evaluation, we compare our proposed SAT framework with the fol-
lowing two baselines:
Random Forest with no semantic grouping (RF). Every word under this setting
was treated as an independent class. Node split is selected by computing the discrete
entropy.
Linear SVM (SVM). A linear SVM classier was learned for every word, using the
detector responses as input features.
We xed T = 150 and D = 40 for SAT and RF. N
f
and N
t
were selected by OOB.
For SVM system, we xed the soft-margin penalty ratio between positive and negative
92
Method Subject accuracy Verb accuracy Object accuracy
SAT 0.816 0.344 0.244
RF 0.816 0.312 0.152
SVM 0.726 0.281 0.191
Table 8.2: Accuracy comparison among our proposed SAT, a traditional RF and a linear
SVM
Method Subject accuracy Verb accuracy Object accuracy
SAT 0.792 0.306 0.188
YouTube2Text [34] 0.809 0.291 0.170
Method Subject WUP Verb WUP Object WUP
SAT 0.927 0.625 0.590
YouTube2Text [34] 0.926 0.468 0.467
Table 8.3: Accuracy and WUP comparisons between our proposed method and
YouTube2text [34]
samples as the inverse of their sample size ratio, and used cross validation to select the
cost parameter.
Table 8.2 shows the accuracy comparison for the three methods. It is easy to see
that our proposed SAT provides better performance in both verb accuracy and object
accuracy, compared with the other two systems which do not use semantic relationships
during training. In Figure 8.3, we also show some of the transcription results. SAT
provided the correct SVO triplets for the top two examples, and related triplets for the
middle two examples. The bottom one is a case where SAT returned wrong result.
We also compare the performance of SAT with the YouTube2Text system proposed
by [34]. It used semantic hierarchies to convert uncondent SVO proposals to terms with
higher semantic hierarchy. Their evaluations included a binary accuracy measurement
using only the most common SVO triplet per testing video, no semantic conversion was
93
used for this evaluation. To make our results comparable, we used the groundtruth labels
provided by the authors. WUP metric was also used for evaluation. It is computed by
s
WUP
(w
1
;w
2
) =
2D
lcs
D
w
1
+D
w
2
(8.10)
where lcs is the least common ancestor of w
1
and w
2
in the semantic tree dened by
WordNet, and D
w
is the depth of w in the semantic tree. It provides the semantic
similarity of w
1
and w
2
dened by the rule-based WordNet. Since a word may have
multiple entries in WordNet, we used the set of entries provided by [34].
In Table 8.3, the binary accuracy of SAT is comparable to YouTube2Text in subject
terms, and better in verb and object terms. For the WUP measure where semantic
relatedness is being considered, SAT outperforms the YouTube2Text system by a large
margin.
94
GT: Person rides bicycle.
SAT: Person rides bicycle.
RF: Person tries ball.
SVM: Person rides bicycle.
GT: Person dances rain.
SAT: Person dances group.
RF: Person does hair.
SVM: Person kicks video.
GT: Person does exercise.
SAT: Person does exercise.
RF: Person does pistol.
SVM: Person gets pencil.
GT: Person runs ball.
SAT: Person plays ball.
RF: Person hits ball.
SVM: Person kicks garden.
GT: Person eats pizza.
SAT: Person makes food.
RF: Person goes something.
SVM: Person makes box.
GT: Person drives car.
SAT: Person rides car.
RF: Person moves bicycle.
SVM: Person does pool.
Figure 8.3: Testing videos with SVO triplets from groundtruth (GT), SAT, RF and SVM.
Exact matches are marked in blue, semantic related verbs and objects are marked in red.
95
Chapter 9
Automatic Visual Concept Discovery
This chapter studies the problem of automatic visual concept discovery. The proposed
method mines concepts from parallel text and visual corpora.
9.1 Approach Overview
Language and vision are both important for us to understand the world. Humans are
good at connecting the two modalities. Consider the sentence \A
uy dog leaps to
catch a ball": we can all relate
uy dog, dog leap and catch ball to the visual world and
describe them in our own words easily. However, to enable a computer to do something
similar, we need to rst understand what to learn from the visual world, and how to
relate them to the text world.
Visual concepts are a natural choice to serve as the basic unit to connect language
and vision. A visual concept is a subset of human vocabulary which species a group
of visual entities (e.g.
uy dog, curly dog). We name the collection of visual concepts
as a visual vocabulary. Computer vision researchers have long collected image examples
of manually selected visual concepts, and used them to train concept detectors. For
example, ImageNet [15] selects 21,841 synsets in WordNet as the visual concepts, and
has by far collected 14,197,122 images in total. One limitation of the manually selected
concepts is that their visual detectors often fail to capture the complexity of the visual
world, and cannot adapt to dierent domains. For example, people may be interested in
96
A black dog and a spotted dog are
fighting.
A black dog and a tri-colored dog
playing with each other on the
road.
Two dogs of different breeds
looking at each other on the road.
Two dogs on pavement moving
toward each other.
A cyclist is riding a bicycle on a
curved road up a hill.
A man on a mountain bike is
pedaling up a hill.
Man bicycle up a road , while cows
graze on a hill nearby.
The biker is riding around a curve
in the road.
Parallel corpus
Concept clustering
Image to sentence retrieval Sentence to image retrieval
Visual clustering
Text clustering
Final concept list
Input
Output
#1: A brown dog jumping over a obstacle.
#2: A brown dog jumping over a blue and yellow stick.
#3: A brown dog is jumping over a fence and another dog
is chasing it.
Concept mining
Input
Output
Concept filtering
black dog, spotted
dog, bicycle, play,
mountain bike ...
black dog
bicycle
play
AP = 0.6
AP = 0.4
AP = 0.2
police motor ride motor motorcyclist
motorcycle
...
valley
dog
fluffy dog curly dog fuzzy dog
A baseball player attempts to catch a ball
while another runs towards the base.
Image tagging
surfer
catch wave
surfboard
...
night
sidewalk
city
...
sandy beach
ocean
man surf
...
canyon green valley valley
Input Output
...
...
...
...
...
Figure 9.1: Overview of the concept discovery framework. Given a parallel corpus of
images and their descriptions, we rst extract unigrams and dependency bigrams from
the text data. These terms are ltered with the cross validation average precision (AP)
trained on their associated images. The remaining terms are grouped into concept clusters
based on both visual and semantic similarity.
detecting birthday cakes when they try to identify a birthday party, but this concept is
not present in ImageNet.
To address this problem, we propose to discover the visual concepts automatically by
joint use of parallel text and visual corpora. The text data in parallel corpora oers a rich
set of terms humans use to describe visual entities, while visual data has the potential to
help computer organize the terms into visual concepts. To be useful, we argue that the
visual concepts should have the following properties:
Discriminative: a visual concept must refer to visually discriminative entities that
can be learned by available computer vision algorithms.
Compact: dierent terms describing the same set of visual entities should be merged
into a single concept.
97
Our proposed visual concept discovery (VCD) framework rst extracts unigrams and
dependencies from the text data. It then computes the visual discriminative power of
these terms using their associated images and lters out the terms with low cross-validated
average precision. The remaining terms may be merged together if they correspond
to very similar visual entities. To achieve this, we use semantic similarity and visual
similarity scores, and cluster terms based on these similarities. The nal output of VCD
is a concept vocabulary, where each concept consists a set of terms and has a set of
associated images. The pipeline of our approach is illustrated in Figure 9.1.
We work with the Flickr 8k data set to discover visual concepts; it consists of 8,000
images downloaded from the Flickr website. Each image was annotated by 5 Amazon
Mechanical Turk (AMT) workers to describe its content. We design a concept based
pipeline for bidirectional image and sentence retrieval task [38] to automatically evaluate
the quality of the discovered concepts. We also conduct a human evaluation on a free-form
image tagging task using visual concepts. Evaluation results show that the discovered
concepts outperform manually selected concepts signicantly.
Our key contributions include:
• We show that manually selected concepts often fail to capture the complexity of
and to evolve with the visual world;
• We propose the VCD framework, which automatically generates discriminative and
compact visual vocabularies from parallel corpora;
• We demonstrate qualitatively and quantitatively that the discovered concepts out-
perform several large sets of manually selected concepts signicantly. They also
perform competitively in the image sentence retrieval task against state-of-the-art
embedding based approaches.
9.2 Visual Concept Discovery Pipeline
This section describes the VCD pipeline. Given a parallel corpus with images and their
text descriptions, we rst mine the text data to select candidate concepts. Due to the
98
Preserved terms Filtered terms
play tennis, play basketball play
bench, kayak red bench, blue kayak
sheer, tri-colored real, Mexican
biker, dog cigar, chess
Table 9.1: Preserved and ltered terms from Flickr 8k data set. A term might be l-
tered if it's abstract (rst row), too detailed (second row) or not visually discriminative
(third row). Sometimes our algorithm may lter out visual entities which are dicult to
recognize (nal row).
diversity of both visual world and human language, the pool of candidate concepts is
large. We use visual data to lter the terms which are not visually discriminative, and
then group the remaining terms into compact concept clusters.
9.2.1 Concept Mining From Sentences
To collect the candidate concepts, we use unigrams as well as the grammatical relations
called dependencies [14]. Unlike the syntax tree based representation of sentences, de-
pendencies operate directly on pairs of words. Consider a simple sentence \a little boy is
riding a white horse", white horse and little boy belong to the adjective modier (amod)
dependency, and ride horse belongs to the direct object (dobj ) dependency. As the num-
ber of dependency types is large, we manually select a subset of 9 types which are likely to
correspond to visual concepts. The selected dependency types are: acomp, agent, amod,
dobj, iobj, nsubj, nsubjpass, prt and vmod.
The concept mining process proceeds as follows: we rst parse the sentences in the
parallel corpus with the Stanford CoreNLP parser [14], and collect the terms with the
interesting dependency types. We also select unigrams which are annotated as noun,
verb, adjective and adverb by a part-of-speech tagger. We use the lemmatized form of the
selected unigrams and phrases such that nouns in singular and plural forms and verbs
in dierent tenses are grouped together. After parsing the whole corpus, we remove the
terms which occur fewer than k times.
99
9.2.2 Concept Filtering and Clustering
The unigrams and dependencies selected from text data contain terms which may not
have concrete visual patterns or may not be easy to learn with visual features. The images
in the parallel corpora are helpful to lter out these terms. We represent images using
feature activations from pre-trained deep convolutional neural networks (CNN), they are
image-level holistic features.
Since the number of terms mined from text data is large, the concept ltering algo-
rithm needs to be ecient. For the images associated with a certain term, we do a 2-fold
cross validation with a linear SVM, using randomly sampled negative training data. We
compute average precision (AP) on cross-validated results, and remove the terms with AP
lower than a threshold. Some of the preserved and ltered terms are listed in Table 9.1.
Many of the remaining terms are synonyms (e.g. ride bicycle and ride bike). These
terms are likely to confuse the concept classier training algorithm. It is important to
merge them together to make the concept set more compact. Besides, although some
terms refer to dierent visual entities, they are similar visually and semantically (e.g. a
red jersey and a orange jersey); it is often benecial to group them together to have more
image examples for training. This motivates us to cluster the concepts based on both
visual similarity and semantic similarity.
Visual similarity: We use the holistic image features to measure visual similarity
between dierent candidate concept terms. We learn two classiers f
t
1
and f
t
2
for terms
t
1
andt
2
using their associated image setsI
t
1
andI
t
2
; negative data is randomly sampled
from those not associated with t
1
and t
2
. To measure the similarity from t
1
to t
2
, we
compute the median of classier f
t
1
's response on the positive samples of t
2
.
b
S
v
(t
1
;t
2
) = median
I2It
2
(f
t
1
(I)) (9.1)
S
v
(t
1
;t
2
) = min
b
S
v
(t
1
;t
2
);
b
S
v
(t
2
;t
1
)
(9.2)
Here the outputs of f
t
1
are normalized to [0; 1] by a Sigmoid function. We take the
minimum of
b
S
v
(t
1
;t
2
) and
b
S
v
(t
2
;t
1
) to make it a symmetric similarity measurement.
100
The intuition behind this similarity measurement is that visual instances associated
with a term are more likely to get high scores from the classiers of other visually similar
terms.
Semantic similarity: We also measure the similarity of two terms in the semantic
space, which are computed by data-driven word embeddings. In particular, we train a
skip-gram model [69] using the English dump of Wikipedia. The basic idea of skip-gram
model is to t the word embeddings such that the words in corpus can predict their
context with high probability. Semantically similar words lie close to each other in the
embedded space.
Word embedding algorithm assigns aD-dimension vector for each word in the vocab-
ulary. For dependencies, we take the average of the word vectors from each word of the
dependency, and L2-normalize the averaged vector. The semantic similarity S
w
(t
1
;t
2
) of
two candidate concept terms t
1
and t
2
is dened as the cosine similarity of their word
embeddings.
Concept clustering: Denote the visual similarity matrix asS
v
and the semantic
similarity matrix asS
w
, we compute the overall similarity matrix by
S =S
v
S
1
w
(9.3)
where is element-wise matrix multiplication and2 [0; 1] is a parameter controlling the
weight assigned to visual similarity.
We then use spectral clustering to cluster the candidate concept terms intoK concept
groups. It is a natural choice when similarity matrix is available. We use the algorithm
implemented in the Python SKLearn toolkit, x the eigen solver to arpack and assign
the labels with K-means.
After the clustering stage, each concept is represented as a set of terms, as well as
their associated visual instances. One can use the associated visual instances to train
concept detectors with SVM or neural networks.
101
Type Concept terms
Object fjersey, red jersey, orange jerseyg
Activity fdribble, player dribble, dribble ballg
Attribute fmountainous, hillyg
Scene fblue water, clear water, green waterg
Mixed fswimming, diving, pool, blue poolg
Mixed fride bull, rodeo, buck, bullg
Table 9.2: Concepts discovered by our framework from Flickr 8k data set.
Concept terms
0 fwedding, churchg,fskyscraper, tall buildingg
1 fskyscraper, churchg,fwedding, birthdayg
0.3 fwedding, bridal partyg,fchurchg,fskyscraperg
Table 9.3: Dierent aects the term groupings in the discovered concepts. Total concept
number is xed to 1,200.
9.2.3 Discussion
Table 9.2 shows some of the concepts discovered by our framework. It can automatically
generate concepts related to objects, attributes, scenes and activities, and identify the
dierent terms associated with each concept. We observe that sometimes a more general
term (jersey) is merged with a more specic term (red jersey) due to high visual similarity.
We also nd that there are some mixed type concepts of objects, activities and scenes.
For example, swimming and pool belongs to the same concept, possibly due to their high
co-occurrence rate. One extreme case is that German and German Shepherd are grouped
together as the two words always occur together in the training data. We believe the
problem can be mitigated by using a larger parallel corpus.
Table 9.3 shows dierent concept clusters when semantic similarity is ignored ( = 0),
dominant ( = 1) and combined with visual similarity. As expected, when is small,
terms that look similar or often co-occur in images tend to be grouped together. As our
semantic similarity is based on word co-occurrence, ignoring visual similarity may lead
to sub-optimal concept clusters such as wedding and birthday.
102
9.3 Concept Based Image and Sentence Retrieval
Consider a set of images, each of which has a few ground truth sentence annotations,
the goal of bidirectional retrieval is to learn a ranking function from image to sentence
and vice versa, such that the ground truth entries rank at the top of the retrieved list.
Many previous methods approach the task by learning embeddings from raw feature
space [38, 49, 31].
We propose an alternative approach to the embedding based methods which uses
concept space directly. Let's start with the sentence to image direction. With the discov-
ered concepts, this problem can be approached by two steps: rst, identify the concepts
from the sentences; second, select the images with highest responses for those concepts.
Suppose we take the sum of the concept responses, this is equivalent to projecting the
sentence into the same concept-based space as images, and measuring the image sentence
similarity by an inner product. This formulation allows us to use the same similarity
function for image to sentence and sentence to image retrieval.
Sentence mapping: Mapping a sentence to the concept space is straightforward.
We run the same parser as used in concept mining to collect terms. Remember that each
concept is represented as a set of terms: denote the term set for the incoming sentence
asT =ft
1
;t
2
;:::;t
N
g, and the term set for concept i asC
i
=fc
i
1
;c
i
2
;:::;c
i
M
g, we have the
sentence's response forC
i
as
i
(T ) = max
t2T;c2C
i
(t;c) (9.4)
Here(t;c) is a function that measures the similarity between t andc. We set(t;c) = 1
if the cosine similarity of t and c's word embedding is greater than a certain threshold,
and 0 otherwise. In practice we set the threshold to 0:95.
There are some common concepts which occur in most of the sentences (e.g. a person);
to down-weight these common concepts, we normalize the scores with term frequency-
inverse document frequency (tf-idf), learned from the training text corpus.
Image mapping: To measure the response of an image to a certain concept, we
need to collect its positive and negative examples. For concepts discovered from parallel
103
corpora, we have their associated images. The set of training images can be augmented
with existing image data sets or by manual annotation.
Assume that training images are ready and concept classiers have been trained, we
then compute the continuous classier scores for an image over all concepts, and normalize
each of them to be [1; 1]. The normalization step is important as using non-negative
condence scores biases the system towards longer sentences.
Since image and text data are mapped into a common concept space, the performance
of bidirectional retrieval depends on: (1) whether the concept vocabulary covers the terms
and visual entities used in query data; (2) whether concept detectors are powerful enough
to extract useful information from visual data. It is thus useful to evaluate the quality of
discovered concepts against existing concept vocabularies and their concept detectors.
9.4 Evaluation
In this section, we rst evaluate our proposed concept discovery pipeline based on the
bidirectional sentence image retrieval task. We use the discovered concepts to generate
concept-based image descriptions, and report human evaluation results.
9.4.1 Bidirectional Sentence Image Retrieval
Data: We use 6,000 images from the Flickr 8k [38] data set for training, 1,000 images for
validation and another 1,000 for testing. We use all 5 sentences per image for both training
and testing. Flickr 30k [121] is an extended version of Flickr 8k. We select 29,000 images
(no overlap to the testing images) to study whether more training data yields better
concept detectors. We also report results when the visual concept discovery, concept
detector training and evaluation are all conducted on Flickr 30k. For this purpose, we
use the standard setting [48, 51] where 29,000 images are used for training, 1,000 images
for validation and 1,000 images for testing. Again, each image comes with 5 sentences.
Finally, we randomly select 1,000 images from the lately released Microsoft COCO [62]
data set to study if the discovered concept vocabulary and associated classiers generalize
to another data set.
104
Image to sentence Sentence to image
Method R@1 R@5 R@10 Median rank R@1 R@5 R@10 Median rank
Karpathy et al. [48] 16.5 40.6 54.2 7.6 11.8 32.1 44.7 12.4
Mao et al. [66] 14.5 37.2 48.5 11 11.5 31.0 42.4 14
Kiros et al. [51] 18.0 40.9 55.0 8 12.5 37.0 51.5 10
Concepts (trained on Flickr 8k) 18.7 41.9 54.7 8 16.7 40.7 54.0 9
Concepts (trained on Flickr 30k) 21.1 45.9 59.0 7 17.9 42.8 55.8 8
Table 9.4: Retrieval evaluation compared with embedding based methods on Flickr 8k.
Higher Recall@k and lower median rank are better.
Evaluation metric: Recall@k is used for evaluation. It computes the percentage of
ground truth entries ranked in the topk retrieved results, over all queries. We also report
median rank of the rst retrieved ground truth entries.
Image representation and classier training: Similar to [31, 51], we extracted
CNN activations as image-level features; such features have shown state-of-the-art perfor-
mance in recent object recognition results [53, 28]. We adapted the CNN implementation
provided by Cae [44], and used the 19-layer network architecture and parameters from
Oxford [90]. The feature activations from the network's rst fully-connected layer fc6
were used as image representations, each of which has 4,096 dimensions.
To train concept classiers, we normalized the feature activations with L2-norm. We
randomly sampled 1,000 images as negative data. We used the linear SVM [21] in the
concept discovery stage for its faster running time, and
2
kernel SVM to train nal
concept classiers as it is a natural choice for histogram-like features and provides higher
performance than linear SVM.
Comparison against embedding-based approaches: We rst compare the per-
formance of our concept-based pipeline against embedding based approaches. We set the
parameters of our system using the validation set. For concept discovery, we kept all
terms with at least 5 occurrences in the training sentences, this gave us an initial list of
5,309 terms. We ltered all terms with average precision lower than 0.15, which preserved
2,877 terms. We set to be 0.6 and number of concepts to be 1,200.
Several recent embedding based approaches [49, 92, 48, 66, 10, 51] are included for
comparison. Most of these approaches use CNN-based image representations (in partic-
ular, [51] uses the same Oxford architecture), and embed sentences with recurrent neural
105
Image to sentence Sentence to image
Method R@1 R@5 R@10 Median rank R@1 R@5 R@10 Median rank
Karpathy et al. [48] 22.2 48.2 61.4 4.8 15.2 37.7 50.5 9.2
Mao et al. [66] 18.4 40.2 50.9 10 12.6 31.2 41.5 16
Kiros et al. [51] 23.0 50.7 62.9 5 16.8 42.0 56.5 8
Concepts (trained on Flickr 30k) 26.6 52.0 63.7 5 18.3 42.2 56.0 8
Table 9.5: Retrieval evaluation on Flickr 30k. Higher Recall@k and lower median rank
are better.
network (RNN) or its variations. We make sure that the experiment setup and data
partitioning for all systems are the same, and report numbers in the original papers if
available.
Table 9.4 lists the evaluation performance for all systems. We can see that the concept
based framework achieves similar or better performance against the state-of-the-art em-
bedding based systems. This conrms the framework is a valid pipeline for bidirectional
image and sentence retrieval task.
Enhancing concept classiers with more data: The concept classiers we trained
for previous experiment only used training images from Flickr 8k data set. To check if
the discovered concepts can benet from additional training data, we collect the images
associated with the discovered concepts from Flickr 30k data set. Since Flickr 30k con-
tains images which overlap with the validation and testing partitions of Flickr 8k data
set, we removed those images and used around 29,000 images for training.
In the last row of Table 9.4, we list the results of the concept based approach using
Flickr 30k training data. We can see that there is a signicant improvement in every
metric. Since the only dierence is the use of additional training data, the results indicate
that the individual concept classiers benet from extra training data. It
is worth noting that while additional data may also be helpful for embedding based
approaches, it has to be in the form of image and sentence pairs. Such annotation tends
to be more expensive and time consuming to obtain than concept annotation.
Evaluation on Flickr 30k dataset: Evaluation on Flickr 30k follows the same
strategy as on Flickr 8k, where parameters were set using validation data. We kept 9,742
terms which have at least 5 occurrences in the training sentences. We then ltered all
terms with average precision lower than 0.15, which preserved 4,158 terms. We set to
106
Image to sentence Sentence to image
Vocabulary R@1 R@5 R@10 Median rank R@1 R@5 R@10 Median rank
ImageNet 1k [84] 2.5 6.7 9.7 714 1.6 5.0 8.5 315
LEVAN [17] 0.0 0.4 1.2 1348 0.2 1.1 1.7 443
NEIL [9] 0.1 0.7 1.1 1103 0.2 0.9 2.0 446
LEVAN [17] (trained on Flickr 8k) 1.2 5.7 9.5 360 2.6 9.1 14.7 113
NEIL [9] (trained on Flickr 8k) 1.4 5.7 8.9 278 3.7 11.3 18.3 92
Flickr 8k Concepts (ours) 10.4 29.3 40.0 17 9.8 27.5 39.0 17
Table 9.6: Retrieval evaluation for dierent concept vocabularies on COCO data set.
be 0.4 and number of concepts to be 1,600. Table 9.5 shows that our method achieves
comparable or better performance than other embedding based approaches.
Concept transfer to other data sets: It is important to investigate whether the
discovered concepts are generalizable. For this purpose, we randomly selected 1,000 im-
ages and their associated 5,000 text descriptions from the validation partition of Microsoft
COCO data set [62].
We used the concepts discovered and trained from Flickr 8k data set, and compared
with several existing concept vocabularies:
ImageNet 1k [84] is a subset of ImageNet data set, with 1,000 categories used
in ILSVRC 2014 evaluation. The classiers were trained using the same Oxford CNN
architecture used for feature extraction.
LEVAN [17] selected 305 concepts manually, and explored Google Ngram data to
collect 113,983 sub-concepts. They collected Internet images and trained detectors with
Deformable Part Model (DPM). We used the learned models provided by the authors.
NEIL [9] has 2,702 manually selected concepts, each of which was trained with DPM
using weakly supervised images from search engines. We also used the models released
by the authors.
Among the three baselines above, ImageNet 1k relies on the same set of CNN-based
features as our discovered concepts. To further investigate the eect of concept selection,
we took the concept lists provided by the authors of LEVAN and NEIL, and re-trained
their concept detectors using our proposed pipeline. To achieve this, we selected training
images associated with the concepts from Flickr 8k dataset, and learned concept detectors
using the same CNN feature extractors and classier training strategies as our proposed
pipeline.
107
Figure 9.2: Impact of when testing on Flickr 8k data set (blue) and COCO data set
(red). Recall@5 for sentence retrieval is used.
Table 9.6 lists the performance of using dierent vocabularies. We can see that the
discovered concepts clearly outperform manually selected vocabularies, but
the cross-dataset performance is lower than same-dataset performance. We
found that COCO uses many visual concepts discovered in Flickr 8k, though some are
missing (e.g. giraes). Compared with the concepts discovered by Flickr 8k, the three
manually selected vocabularies lack many terms used in the COCO data set to describe
the visual entities. This inevitably hurts their performance in the retrieval task. The
performance of NEIL and LEVAN is worse than ImageNet 1k, which might be explained
by the weakly Internet images they used to train concept detectors. Although re-training
from Flickr 8k using deep features helps improve retrieval performance of NEIL and
LEVAN, our system still outperforms the two by large margins.
Impact of concept discovery parameters: Figure 9.2 and Figure 9.3 shows the
impact of visual similarity weight and the total number of concepts on the retrieval
performance. To save space, we only display results of recall@5 for the sentence retrieval
direction.
We can see from the gures that both visual and semantic similarities are important for
concept clustering, this is particular true when the concepts trained from Flickr 8k were
108
Figure 9.3: Impact of total number of concepts when testing on Flickr 8k data set (blue)
and COCO data set (red). Recall@5 for sentence retrieval is used.
applied to COCO. Increasing the number of concepts helps at the beginning, as many
visually discriminative concepts are grouped together when the number of concepts is
small. However, as the number increases, the improvement becomes
at, and even hurts
the concepts' ability to generalize.
9.4.2 Human Evaluation of Image Tagging
We also evaluated the quality of the discovered concepts on the image tagging task whose
goal is to generate tags to describe the content of images. Compared with sentence
retrieval, the image tagging task has a higher degree of freedom as the combination of
tags is not limited by the existing sentences in the pool.
Evaluation setup: We used the concept classiers to generate image tags. For each
image, we selected the top three concepts with highest classier scores. Since a concept
may have more than one text terms, we picked up to two text terms randomly for display.
For evaluation, we asked 15 human evaluators to compare two sets of tags generated
by dierent concept vocabularies. The evaluators were asked to select which set of tags
better describes the image based on the accuracy of the generated tags and the coverage
109
Better Worse Same
64.1% 22.9% 12.9%
Table 9.7: Percentage of images where tags generated by the discovered concepts are
better, worse or the same compared with ImageNet 1k.
of visual entities in the image, or whether the two sets of tags are equally good or bad.
The nal label per image was combined using majority vote. On average, 85% of the
evaluators agreed on their votes for specic images.
We compared the concepts discovered from Flickr 8k and the manually selected Im-
ageNet 1k concept vocabulary. The classiers for the discovered concepts were trained
using the 6,000 images from Flickr 8k. We did not compare the discovered concepts
against NEIL and LEVAN as they performed very poorly in the retrieval task. To test
how the concepts generalize to a dierent data set, we used the same 1,000 images from
the COCO data set as used in retrieval task for evaluation.
Result analysis: Table 9.7 shows the evaluators' preference on the image tags gen-
erated by the discovered concepts and ImageNet 1k. We can see that the discovered
concepts generated better tags for 64.1% of the images. This agrees with the trend
observed in the bidirectional retrieval task.
As shown in Figure 9.4, tags generated by ImageNet 1k has the following problems
which might cause evaluators to label them as worse: rst, many of the visual entities
do not have corresponding concepts in the vocabulary; second, ImageNet 1k has many
ne-grained concepts (e.g. dierent species of dogs), while more general terms might be
preferred by evaluators. On the other hand, the discovered concepts are able to re
ect
how human name the visual entities, and have a higher concept coverage. However, due to
the number of training examples is relatively limited, sometimes the response of dierent
concept classiers are correlated (e.g. bed and sit down).
110
Figure 9.4: Tags generated using ImageNet 1k concepts (blue) and the discovered concepts
(green). Tags preferred by evaluators are marked in red blocks.
111
Chapter 10
Conclusion and Future Work
In the previous sections, I have described my eorts towards solving video understanding
problems, especially for event detection and recounting. In this chapter, I conclude the
thesis and discuss a few possible future directions.
10.1 Conclusion
Chapter 3 to Chapter 5 address the problem of action and event detection. Chapter 3
presents a technique for classication of unconstrained videos by using Fisher Vector
coding of sparse and dense local features. Signicant improvements (35% and 26% im-
provement for sparse STIP features and dense DT features respectively) over standard
Bag-of-Words have been demonstrated on a rather large test set which contains highly
diverse videos of varying quality. The improvement is consistent across the event classes
indicating robustness of the process.
A mid-level framework is introduced in Chapter 4. It uses HMM as the underlying
model and applied Fisher kernel technique to obtain a xed length description for the
model. The method is fast to compute and easy to use in existing frameworks. It can also
be used to describe videos with activity concept transitions. Experimental results show
that our approach achieves better results compared with state-of-the-art concept based
framework and low level framework. Moreover, when the number of training samples is
limited, our approach can still work reasonably well.
112
Chapter 5 studied the problem of ne-grained action localization for temporally
untrimmed web videos. We proposed to use noisily tagged web images to discover lo-
calized action frames (LAF) from videos, and model temporal information with LSTM
networks. We conducted thorough evaluations on our collected FGA-240 data set and the
public THUMOS 2014 data set, and showed the eectiveness of LAF proposal by domain
transfer from web images.
Chapter 6 to Chapter 9 address the event recounting problem. In Chapter 6, We pro-
pose the DISCOVER framework for video event classication and recounting. It classies
unconstrained web videos by discovering important segments characterized by primitive
actions and their transitions. DISCOVER allows ecient learning and inference, and is
generalizable to using objects and scenes. Experimental results show that it outperforms
current state-of-the-art classication methods on a challenging large scale dataset. For
event recounting, DISCOVER locates important segments which is seldom addressed in
previous work.
Chapter 7 introduces a novel problem of zero-shot multimedia event recounting (Ze-
roMER). It aims at providing persuasive evidence for the events, without using training
videos. We presented the MIKT algorithm to select relevant concepts for zero-shot recog-
nition fully automatically, and formulate an ILP problem to select video segments that
are relevant, diverse and compact. Experimental results based on automatic and human
evaluations show that the MIKT framework achieves promising results for both event
recounting and event detection tasks.
The video transcription task is addressed in Chapter 8. We propose a Semantic
Aware Video Transcription (SAT) system using Random Forest classiers. SAT builds a
hierarchical structure using the response of action and object detectors. It favors grouping
of semantically similar words, and outputs the probabilities of subject, verb, object terms.
SAT supports large vocabulary of output words, and is able to generate more semantic
reasonable results. Experimental results on a web video dataset of 1970 videos and 85,550
sentences showed that SAT provides state-of-the-art transcription performance.
Finally, Chapter 9 studies the problem of automatic concept discovery from parallel
corpora. We propose a concept ltering and clustering algorithm using both text and
113
visual information. Automatic evaluation using bidirectional image and text retrieval
and human evaluation of image tagging task show that the discovered concepts achieve
state-of-the-art performance, and outperform several large manually selected concept
vocabularies signicantly.
10.2 Future Work
There are several remaining questions I'd love to pursue in the future.
Weakly-supervised learning from videos
Deep learning has pushed the performance of computer vision systems up to a new level.
However, the success of deep learning relies heavily on large amount of well annotated
data, which to-date does not exist for videos. As demonstrated by my previous work [101,
99], weakly supervised learning from videos is possible by making reasonable assumptions
and exploiting the hidden knowledge (e.g. temporal consistency). My next step is to
study whether the neurons from weakly trained deep networks are able to capture mid-
level information such as objects, human pose and actions. This line of research not only
has the potential to demystify how deep learning works, but also oers rich semantics
from the videos.
Knowledge transfer from dierent domains
Exploiting existing knowledge is essential in building complex computer vision systems. I
have transferred dierent types of knowledge for video understanding in my previous work,
they include but not limited to deep neural network parameters, object detectors and
language models. My ongoing work explores a new direction for image to video transfer:
it attempts to nd a common mid-level representation shared by both domains (e.g.
birthday cake for birthday party images and videos), and uses such information for zero-
shot learning. Preliminary results have shown that my proposed approach outperforms
several competitive baselines signicantly in zero-shot video event recognition.
114
Tighter connection between vision and language
I have demonstrated how to construct visual vocabulary automatically in my previous
work [96]. One limitation of the approach is that the concept classiers are image-level and
not localized. I am now working on concept localization by formulating a multiple instance
learning problem under the deep learning framework, and have obtained promising results.
My next step is to build upon such localizations, study the spatiotemporal arrangement
of visual concepts, and extract useful common sense from visual data.
115
Bibliography
[1] http://www-nlpir.nist.gov/projects/tvpubs/tv.pubs.11.org.html.
[2] A. Barbu, A. Bridge, Z. Burchill, D. Coroian, S. J. Dickinson, S. Fidler, A. Michaux,
S. Mussman, S. Narayanaswamy, D. Salvi, L. Schmidt, J. Shangguan, J. M. Siskind,
J. W. Waggoner, S. Wang, J. Wei, Y. Yin, and Z. Zhang. Video in sentences out.
In UAI, 2012.
[3] D. Batra, H. Agrawal, P. Banik, N. Chavali, and A. Alfadda. Cloudcv: Large-scale
distributed computer vision as a cloud service, 2013.
[4] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. JMLR, 2003.
[5] L. Breiman. Random forests. Machine Learning, 2001.
[6] C.-Y. Chen and K. Grauman. Watching unlabeled video helps learn new human
actions from very few labeled snapshots. In CVPR, 2013.
[7] D. Chen and W. B. Dolan. Collecting highly parallel data for paraphrase evaluation.
In ACL, 2011.
[8] J. Chen, Y. Cui, G. Ye, D. Liu, and S. Chang. Event-driven semantic concept
discovery by exploiting weakly tagged internet images. In ICMR, 2014.
[9] X. Chen, A. Shrivastava, and A. Gupta. NEIL: Extracting visual knowledge from
web data. In ICCV, 2013.
[10] X. Chen and C. L. Zitnick. Learning a recurrent visual representation for image
caption generation. CoRR, abs/1411.5654, 2014.
[11] J. Choi, M. Rastegari, A. Farhadi, and L. S. Davis. Adding unlabeled samples to
categories by learned attributes. In CVPR, 2013.
[12] G. Csurka, C. Dance, L. Fan, J. Willamowski, and C. Bray. Visual categorization
with bags of keypoints. In ECCV Workshop, 2004.
[13] P. Das, C. Xu, R. F. Doell, and J. J. Corso. A thousand frames in just a few words:
Lingual description of videos through latent topics and sparse object stitching. In
CVPR, 2013.
[14] M.-C. de Marnee, B. MacCartney, and C. D. Manning. Generating typed depen-
dency parses from phrase structure parses. In LREC, 2006.
[15] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-
Scale Hierarchical Image Database. In CVPR, 2009.
[16] J. Deng, J. Krause, A. Berg, and L. Fei-Fei. Hedging your bets: Optimizing
accuracy-specicity trade-os in large scale visual recognition. In CVPR, 2012.
116
[17] S. K. Divvala, A. Farhadi, and C. Guestrin. Learning everything about anything:
Webly-supervised visual concept learning. In CVPR, 2014.
[18] J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan,
K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual
recognition and description. In CVPR, 2015.
[19] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The
pascal visual object classes (voc) challenge. IJCV, 2010.
[20] B. G. Fabian Caba Heilbron, Victor Escorcia and J. C. Niebles. ActivityNet: A
large-scale video benchmark for human activity understanding. In CVPR, 2015.
[21] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A
library for large linear classication. Journal of Machine Learning Research, 2008.
[22] A. Farhadi, I. Endres, D. Hoiem, and D. A. Forsyth. Describing objects by their
attributes. In CVPR, 2009.
[23] A. Farhadi, I. Endres, D. Hoiem, and D. A. Forsyth. Describing objects by their
attributes. In CVPR, 2009.
[24] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan. Object detection
with discriminatively trained part-based models. In PAMI, 2009.
[25] A. Frome, G. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, and T. Mikolov.
Devise: A deep visual-semantic embedding model. In NIPS, 2013.
[26] A. Gaidon, Z. Harchaoui, and C. Schmid. Actom sequence models for ecient
action detection. In CVPR, 2011.
[27] P. Geurts, D. Ernst, and L. Wehenkel. Extremely randomized trees. Machine
Learning, 2006.
[28] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for
accurate object detection and semantic segmentation. In Computer Vision and
Pattern Recognition, 2014.
[29] R. B. Girshick, P. F. Felzenszwalb, and D. McAllester. Discriminatively trained
deformable part models, release 5.
[30] E. Golge and P. Duygulu. Conceptmap: Mining noisy web data for concept learning.
In ECCV, 2014.
[31] Y. Gong, L. Wang, M. Hodosh, J. Hockenmaier, and S. Lazebnik. Improving image-
sentence embeddings using large weakly annotated photo collections. In ECCV,
2014.
[32] A. Graves, A. Mohamed, and G. E. Hinton. Speech recognition with deep recurrent
neural networks. In ICASSP, 2013.
[33] A. Graves and J. Schmidhuber. Oine handwriting recognition with multidimen-
sional recurrent neural networks. In NIPS, 2008.
[34] S. Guadarrama, N. Krishnamoorthy, G. Malkarnenkar, R. Mooney, T. Darrell, and
K. Saenko. Youtube2text: Recognizing and describing arbitrary activities using
semantic hierarchies and zero-shot recognition. In ICCV, 2013.
117
[35] A. Gupta, P. Srinivasan, J. Shi, and L. S. Davis. Understanding videos, constructing
plots learning a visually grounded storyline model from annotated videos. In CVPR,
2009.
[36] A. Habibian, T. Mensink, and C. G. Snoek. Composite concept discovery for zero-
shot video event detection. In ICMR, 2014.
[37] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Comput.,
1997.
[38] M. Hodosh, P. Young, and J. Hockenmaier. Framing image description as a ranking
task: Data, models and evaluation metrics. JAIR, 2013.
[39] H. Izadinia and M. Shah. Recognizing complex events using large margin joint
low-level event model. In ECCV, 2012.
[40] T. Jaakkola and D. Haussler. Exploiting generative models in discriminative clas-
siers. In NIPS, 1999.
[41] M. Jain, J. C. van Gemert, T. Mensink, and C. G. M. Snoek. Objects2action: Clas-
sifying and localizing actions without any video example. CoRR, abs/1510.06939,
2015.
[42] H. Jegou, M. Douze, C. Schmid, and P. P erez. Aggregating local descriptors into a
compact image representation. In CVPR, 2010.
[43] H. J egou, F. Perronnin, M. Douze, J. S anchez, P. P erez, and C. Schmid. Aggregat-
ing local image descriptors into compact codes. PAMI, 2011.
[44] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama,
and T. Darrell. Cae: Convolutional architecture for fast feature embedding. In
ACM MM, 2014.
[45] L. Jiang, D. Meng, T. Mitamura, and A. G. Hauptmann. Easy samples rst: Self-
paced reranking for zero-example multimedia search. In ACM MM, 2014.
[46] L. Jiang, T. Mitamura, S.-I. Yu, and A. G. Hauptmann. Zero-example event search
using multimodal pseudo relevance feedback. In ICMR, 2014.
[47] Y.-G. Jiang, J. Liu, A. Roshan Zamir, G. Toderici, I. Laptev, M. Shah, and R. Suk-
thankar. THUMOS challenge: Action recognition with a large number of classes.
http://crcv.ucf.edu/THUMOS14/, 2014.
[48] A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image
descriptions. CVPR, 2015.
[49] A. Karpathy, A. Joulin, and L. Fei-Fei. Deep fragment embeddings for bidirectional
image sentence mapping. In NIPS, 2014.
[50] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei.
Large-scale video classication with convolutional neural networks. In CVPR, 2014.
[51] R. Kiros, R. Salakhutdinov, and R. S. Zemel. Unifying visual-semantic embeddings
with multimodal neural language models. TACL, 2015.
[52] N. Krishnamoorthy, G. Malkarnenkar, R. J. Mooney, K. Saenko, and S. Guadar-
rama. Generating natural-language video descriptions using text-mined knowledge.
In AAAI, 2013.
118
[53] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classication with deep
convolutional neural networks. In NIPS, 2012.
[54] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. HMDB: a large video
database for human motion recognition. In ICCV, 2011.
[55] G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A. C. Berg, and T. L. Berg. Baby
talk: Understanding and generating image descriptions. In CVPR, 2011.
[56] C. H. Lampert, H. Nickisch, and S. Harmeling. Learning to detect unseen object
classes by between-class attribute transfer. In CVPR, 2009.
[57] I. Laptev. On space-time interest points. IJCV, 2005.
[58] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld. Learning realistic human
actions from movies. In CVPR, 2008.
[59] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid
matching for recognizing natural scene categories. In CVPR, 2006.
[60] L.-J. Li, H. Su, E. P. Xing, and F.-F. Li. Object bank: A high-level image repre-
sentation for scene classication & semantic feature sparsication. In NIPS, 2010.
[61] W. Li, Q. Yu, A. Divakaran, and N. Vasconcelos. Dynamic pooling for complex
event recognition. In ICCV, 2013.
[62] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll ar, and
C. L. Zitnick. Microsoft COCO: common objects in context. In ECCV, 2014.
[63] J. Liu, B. Kuipers, and S. Savarese. Recognizing human actions by attributes. In
CVPR, 2011.
[64] J. Liu, Q. Yu, O. Javed, S. Ali, A. Tamrakar, A. Divakaran, H. Cheng, and H. S.
Sawhney. Video event recognition using concept attributes. In WACV, 2013.
[65] D. G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 2004.
[66] J. Mao, W. Xu, Y. Yang, J. Wang, and A. L. Yuille. Explain images with multi-
modal recurrent neural networks. CoRR, abs/1410.1090, 2014.
[67] T. Mensink, E. Gavves, and C. G. M. Snoek. COSTA: Co-occurrence statistics for
zero-shot classication. In CVPR, June 2014.
[68] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Ecient estimation of word repre-
sentations in vector space. CoRR, 2013.
[69] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed repre-
sentations of words and phrases and their compositionality. In NIPS, 2013.
[70] G. A. Miller. Wordnet: A lexical database for english. CACM, 1995.
[71] P. Natarajan, S. Vitaladevuni, U. Park, S. Wu, V. Manohar, X. Zhuang, S. Tsaka-
lidis, R. Prasad, and P. Natarajan. Multimodel feature fusion for robust event
detection in web videos. In CVPR, 2012.
[72] J. C. Niebles, C.-W. Chen, , and L. Fei-Fei. Modeling temporal structure of de-
composable motion segments for activity classication. In ECCV, 2010.
[73] D. Oneata, J. Verbeek, and C. Schmid. Action and Event Recognition with Fisher
Vectors on a Compact Feature Set. In ICCV, 2013.
119
[74] V. Ordonez, J. Deng, Y. Choi, A. C. Berg, and T. L. Berg. From large scale image
categorization to entry-level categories. In ICCV, 2013.
[75] P. Over, G. Awad, M. Michel, J. Fiscus, G. Sanders, W. Kraaij, A. F. Smeaton,
and G. Queenot. Trecvid 2013 { an overview of the goals, tasks, data, evaluation
mechanisms and metrics. In TRECVID, 2013.
[76] D. Parikh and K. Grauman. Relative attributes. In ICCV, 2011.
[77] F. Perronnin and C. Dance. Fisher kernels on visual vocabularies for image cate-
gorization. In CVPR, 2007.
[78] D. Potapov, M. Douze, Z. Harchaoui, and C. Schmid. Category-specic video
summarization. In ECCV, 2014.
[79] L. R. Rabiner. A tutorial on hidden markov models and selected applications in
speech recognition. In Proceedings of the IEEE, 1989.
[80] M. Rohrbach, W. Qiu, I. Titov, S. Thater, M. Pinkal, and B. Schiele. Translating
video content to natural language descriptions. In ICCV, 2013.
[81] M. Rohrbach, M. Stark, and B. Schiele. Evaluating knowledge transfer and zero-
shot learning in a large-scale setting. In CVPR, 2011.
[82] M. Rohrbach, M. Stark, G. Szarvas, I. Gurevych, and B. Schiele. What helps where
- and why? semantic relatedness for knowledge transfer. In CVPR, 2010.
[83] E. Rosch. Principles of categorization. 1978.
[84] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpa-
thy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale
Visual Recognition Challenge, 2014.
[85] S. Sadanand and J. Corso. Action bank: A high-level representation of activity in
video. In CVPR, 2012.
[86] H. Sak, A. Senior, and F. Beaufays. Long short-term memory based recur-
rent neural network architectures for large vocabulary speech recognition. CoRR,
abs/1402.1128, 2014.
[87] S. Satkin and M. Hebert. Modeling the temporal extent of actions. In ECCV, 2010.
[88] C. Schuldt, I. Laptev, and B. Caputo. Recognizing human actions: A local svm
approach. In ICPR, 2004.
[89] K. Simonyan and A. Zisserman. Two-stream convolutional networks for action
recognition in videos. In NIPS, 2014.
[90] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale
image recognition. NIPS, 2014.
[91] B. Singh, X. Han, Z. Wu, V. I. Morariu, and L. S. Davis. Selecting relevant web
trained concepts for automated event retrieval. ICCV, 2015.
[92] R. Socher, A. Karpathy, Q. V. Le, C. D. Manning, and A. Y. Ng. Grounded
compositional semantics for nding and describing images with sentences. TACL,
2014.
[93] K. Soomro, A. R. Zamir, and M. Shah. Ucf101: A dataset of 101 human actions
classes from videos in the wild. CRCV-TR-12-01.
120
[94] N. Srivastava, E. Mansimov, and R. Salakhutdinov. Unsupervised learning of video
representations using LSTMs. In ICML, 2015.
[95] C. Sun, B. Burns, R. Nevatia, C. Snoek, B. Bolles, G. Myers, W. Wang, and E. Yeh.
ISOMER: Informative segment observations for multimedia event recounting. In
ICMR, 2014.
[96] C. Sun, C. Gan, and R. Nevatia. Automatic concept discovery from parallel text
and visual corpora. In ICCV, 2015.
[97] C. Sun and R. Nevatia. ACTIVE: Activity concept transitions in video event clas-
sication. In ICCV, 2013.
[98] C. Sun and R. Nevatia. Large-scale web video event classication by use of sher
vectors. In WACV, 2013.
[99] C. Sun and R. Nevatia. DISCOVER: Discovering important segments for classi-
cation of video events and recounting. In CVPR, 2014.
[100] C. Sun and R. Nevatia. Semantic aware video transcription using random forest
classiers. In ECCV, 2014.
[101] C. Sun, S. Shetty, R. Sukthankar, and R. Nevatia. Temporal localization of ne-
grained actions in videos by domain transfer from web images. In ACM MM, 2015.
[102] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural
networks. In NIPS, 2014.
[103] K. Tang, F.-F. Li, and D. Koller. Learning latent temporal structure for complex
event detection. In CVPR, 2012.
[104] L. Torresani, M. Szummer, and A. W. Fitzgibbon. Ecient object category recog-
nition using classemes. In ECCV, 2010.
[105] A. Vahdat, K. Cannons, G. Mori, I. Kim, and S. Oh. Compositional models for
video event detection: A multiple kernel learning latent variable approach. In
ICCV, 2013.
[106] J. C. van Gemert, J.-M. Geusebroek, C. J. Veenman, and A. W. M. Smeulders.
Kernel codebooks for scene categorization. In ECCV, 2008.
[107] H. Wang, A. Kl aser, C. Schmid, and C.-L. Liu. Action recognition by dense trajec-
tories. In CVPR, 2011.
[108] H. Wang, A. Kl aser, C. Schmid, and C.-L. Liu. Dense trajectories and motion
boundary descriptors for action recognition. IJCV, 2013.
[109] H. Wang and C. Schmid. Action recognition with improved trajectories. In ICCV,
2013.
[110] H. Wang and C. Schmid. Action Recognition with Improved Trajectories. In ICCV,
2013.
[111] H. Wang, M. M. Ullah, A. Kl aser, I. Laptev, and C. Schmid. Evaluation of local
spatio-temporal features for action recognition. In BMVC, 2009.
[112] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong. Locality-constrained
linear coding for image classication. In CVPR, 2010.
121
[113] R. J. Williams and J. Peng. An ecient gradient-based algorithm for on-line train-
ing of recurrent network trajectories. Neural Computation, 1990.
[114] J. Wu, Y. Yu, C. Huang, and K. Yu. Deep multiple instance learning for image
classication and auto-annotation. CVPR, 2015.
[115] S. Wu, S. Bondugula, F. Luisier, X. Zhuang, and P. Natarajan. Zero-shot event
detection using multi-modal fusion of weakly supervised concepts. In CVPR, 2014.
[116] Z. Xu, Y. Yang, and A. G. Hauptmann. A discriminative CNN video representation
for event detection. In CVPR, 2015.
[117] W. Yang and G. Toderici. Discriminative tag learning on youtube videos with
latent sub-tags. In CVPR, 2011.
[118] Y. Yang and D. Ramanan. Articulated pose estimation with
exible mixtures-of-
parts. In CVPR, 2011.
[119] B. Yao and F. Li. Recognizing human-object interactions in still images by modeling
the mutual context of objects and human poses. PAMI, 2012.
[120] G. Ye, Y. Li, H. Xu, D. Liu, and S.-F. Chang. Eventnet: A large scale structured
concept library for complex event detection in video. In ACM MM, 2015.
[121] P. Young, A. Lai, M. Hodosh, and J. Hockenmaier. From image descriptions to
visual denotations: New similarity metrics for semantic inference over event de-
scriptions. TACL, 2014.
[122] F. X. Yu, L. Cao, R. S. Feris, J. R. Smith, and S. Chang. Designing category-level
attributes for discriminative visual recognition. In CVPR, 2013.
[123] S.-I. Yu, L. Jiang, Z. Mao, X. Chang, X. Du, C. Gan, Z. Lan, Z. Xu, X. Li, Y. Cai,
et al. Informedia@ TRECVID 2014 MED and MER.
[124] B. Zhou, V. Jagadeesh, and R. Piramuthu. ConceptLearner: Discovering Visual
Concepts from Weakly Labeled Image Collections. CVPR, 2015.
122
Abstract (if available)
Abstract
Multimedia event detection and recounting are important Computer Vision problems that are closely related to Machine Learning, Natural Language Processing and other research areas in Computer Science. Given a query consumer video, multimedia event detection (MED) generates a high-level event label (e.g. birthday party, cleaning an appliance) for the entire video, while multimedia event recounting (MER) aims at selecting supporting evidence for the detected event. Typical form of evidence includes short video snippets and text descriptions. Event detection and recounting are challenging problems for the high video quality variation, the complex temporal structures, missing or unreliable concept detectors, and also the large problem scale. ❧ This thesis describes my solutions to event detection and recounting from large scale consumer videos. The first part focuses on extracting robust features for event detection. The proposed pipeline utilizes both low-level motion features and mid-level semantic features. For low-level features, the pipeline extracts local motion descriptors from videos, and aggregates them into video-level representations by applying Fisher vector techniques. For mid-level features, the pipeline encodes temporal transition information from noisy object and action concept detection scores. The two feature types are suitable to train linear event classifiers to handle large amount of query videos, and have complementary performance. ❧ The second part of the thesis addresses the event recounting problem, which includes the evidence localization task and description generation task. Evidence localization searches for video snippets with supporting evidence for an event. It is inherently weakly supervised, as most of the training videos have only video-level annotations rather than segment-level annotations. My proposed framework treats evidence locations as hidden variables, and exploits activity co-occurrences and temporal transitions to model events. Model parameters are learned with the latent SVM framework. For text description generation, my proposed pipelines aim at connecting vision and language by considering both semantic similarity from text and visual similarity from videos and images. The pipelines are able to generate video transcription in subject-verb-object triplets or visual concept tags. ❧ This thesis demonstrates the effectiveness of all the algorithms on a range of publicly available video or image datasets.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Incorporating aggregate feature statistics in structured dynamical models for human activity recognition
PDF
Tracking multiple articulating humans from a single camera
PDF
Multimodal reasoning of visual information and natural language
PDF
Grounding language in images and videos
PDF
Advanced machine learning techniques for video, social and biomedical data analytics
PDF
3-D building detection and description from multiple intensity images using hierarchical grouping and matching of features
PDF
Motion pattern learning and applications to tracking and detection
PDF
Tensor learning for large-scale spatiotemporal analysis
PDF
Object detection and recognition from 3D point clouds
PDF
Robust representation and recognition of actions in video
PDF
Incorporating large-scale vision-language corpora in visual understanding
PDF
Adapting pre-trained representation towards downstream tasks
PDF
Spatio-temporal probabilistic inference for persistent object detection and tracking
PDF
Depth inference and visual saliency detection from 2D images
PDF
Effective incremental learning and detector adaptation methods for video object detection
PDF
Detecting anomalies in event-based systems through static analysis
PDF
Analyzing human activities in videos using component based models
PDF
Efficient pipelines for vision-based context sensing
PDF
Temporal perception and reasoning in videos
PDF
Biologically inspired mobile robot vision localization
Asset Metadata
Creator
Sun, Chen
(author)
Core Title
Event detection and recounting from large-scale consumer videos
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
02/16/2016
Defense Date
11/16/2015
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
computer vision,consumer videos,event detection,event recounting,OAI-PMH Harvest
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Nevatia, Ramakant (
committee chair
), Jenkins, Keith (
committee member
), Liu, Yan (
committee member
)
Creator Email
chensun@usc.edu,dazhiruoyu@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-209057
Unique identifier
UC11278156
Identifier
etd-SunChen-4104.pdf (filename),usctheses-c40-209057 (legacy record id)
Legacy Identifier
etd-SunChen-4104.pdf
Dmrecord
209057
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Sun, Chen
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
computer vision
consumer videos
event detection
event recounting