Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Towards more occlusion-robust deep visual object tracking
(USC Thesis Other)
Towards more occlusion-robust deep visual object tracking
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
TOWARDS MORE OCCLUSION-ROBUST DEEP VISUAL OBJECT TRACKING
by
Gozde Sahin
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
December 2023
Copyright 2023 Gozde Sahin
Dedication
I dedicate my dissertation work to everyone who supported me throughout my journey. I’m very thankful
for my family, whose love and support have helped me stay resilient through hardships and kept me
going. I cannot be more proud of my achievements, and cannot wait to celebrate each and every one
with them. My friends (old and new) have been my greatest company, and I am so thankful for all the
fun games, concerts, and food I got to share with each of them. Even when they were away, they have
been indispensable to both my academic and personal life. I’m thankful to my advisor and colleagues who
have taught me something new every day and patiently allowed me to complain about academia. And
finally, I’m grateful for my partner, who entered my life out of nowhere and made it so much better with
his curiosity, creativity, and kindness. He has been my rock in these last two years, helping to keep me
motivated and absolutely entertained at all times. I can’t wait to spend the rest of my life with him. As
this long journey ends, I am infinitely excited for what the future will bring.
ii
Acknowledgements
I could not have undertaken this journey without my advisor and chair of my committee Dr. Laurent Itti,
who provided invaluable feedback and support to a PhD student who had just switched labs in their 4th
year, working on a brand new topic. Without him, it would have been impossible to see my ideas come to
fruition. I am grateful to my graduate advisor Lizsl de Leon, for being the best ally a PhD student can ask
for. I also deeply appreciate my quals and defense committees, who generously provided their time and
expertise to my work, as well as SRC and DARPA for funding my research.
I have learned immensely from my cohort and my labmates, sometimes during discussions that were
about our research or other emerging trends, other times during our chats about our respective cultures,
politics, philosophy, or just entertainment. They helped broaden my perspective and learn from others’
experiences for which I am grateful. Thanks should also go to any student I either worked with or taught
over the years, starting with my great team of annotators and extending to every student I interacted with
during my many semesters as a Teaching Assistant.
I would also like to recognize my managers, mentors, and colleagues at my three internships, who
helped me hone what kind of colleague and employee I wanted to be and got me excited about all the
real-world problems I will be working on after my degree. Lastly, I’m extremely grateful to my family,
my partner, my friends, and everyone else for their encouragement and moral support, to whom I have
dedicated this thesis.
iii
Table of Contents
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Visual Object Tracking & Occlusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Chapter 2: Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1 Visual Object Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.3 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.4 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Visual Object Trackers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.1 Conventional Tracking Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.2 Deep Visual Trackers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 Occlusions in VOT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Chapter 3: Multi-Task Occlusion Learning for Real-Time Visual Object Tracking . . . . . . . . . . 20
3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Multi-Task Occlusion Learning for Visual Object Tracking . . . . . . . . . . . . . . . . . . 23
3.2.1 Baseline SiamRPN Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2.2 Multi-Task Occlusion Prediction Module . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3.2 Results on the VOT Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3.3 Results on GOT-10k . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Chapter 4: HOOT: Heavy Occlusions in Object Tracking Benchmark . . . . . . . . . . . . . . . . . 34
4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
iv
4.2 The HOOT Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2.1 Comparison with Other VOT Datasets . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2.2 Benchmark Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2.3 Data Collection & Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2.4 Benchmark Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2.4.1 Target & Motion Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2.4.2 Occlusion Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2.5 Evaluation Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.3.1 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.3.2 Overall Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.3.3 Evaluation on Occlusion Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3.4 Evaluation on Other Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.3.5 Performance Comparisons with LaSOT . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.3.6 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Chapter 5: Template Update for Tracking Under Heavy Occlusion . . . . . . . . . . . . . . . . . . . 58
5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.2 Baseline STARK Tracker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.2.1 Localization Branch: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.2.2 Template Update Branch: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.3 Template Update Analysis with STARK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.3.1 Qualitative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.3.2 Quantitative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.3.3 Occlusion Oracle Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.4 Training STARK with HOOT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.4.1 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.4.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Chapter 6: Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.1 Thesis Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
v
List of Tables
2.1 Visual object tracking (VOT) performance metrics and the list of VOT datasets that have
adopted them for evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.1 VOT evaluation benchmark has binary labels for fully visible and occluded, while GOT-10k
has labels ranging from 0 (fully-visible) to 8 (fully-occluded). Both benchmarks have
highly imbalanced occlusion distributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 The overall tracking performance results on VOT datasets, compared with other wellknown trackers. SiamRPN* represents our baseline SiamRPN (trained only on GOT-10k). . 29
3.3 Expected Average Overlap (EAO) results on VOT Benchmark computed for all vs. occluded
frames, comparing the performance of the baseline SiamRPN trained on GOT-10k only
(SiamRPN*) to ours. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4 VOT2017 results for occlusion branch architecture variants. Reported are Accuracy (A),
Robustness (R), and Expected Average Overlap (EAO). See Section 4.3.1 for details. . . . . . 31
3.5 Effect of occlusion prediction loss and occlusion weighted classification loss on VOT2017. 32
3.6 GOT-10k test set results with comparisons to baseline methods in the GOT-10k official
results, including SiamRPN* (SiamRPN trained only on GOT-10k). . . . . . . . . . . . . . . 32
4.1 An overview of recent and widely-used visual object tracking benchmarks compared
to HOOT. The first part of the table focuses on general statistics, while the second part
focuses on occlusion-specific information provided by these benchmarks. HOOT stands
out as the dataset that provides the most detailed occlusion data per-frame. . . . . . . . . . 36
4.2 General statistics for the HOOT Benchmark. . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.3 Overall performance results for 15 state-of-the-art trackers on HOOT protocols defined in
Section 4.2.5. Metrics are computed as described in Section 4.3.1. Green, red, and orange
numbers represent the top 3 performers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
vi
4.4 Comparison of the overall performance results between HOOT and LaSOT test sets. The
table below only includes trackers that have also been evaluated on LaSOT, and presents
normalized precision and success numbers for both datasets. It shows a steep decline
in performance for HOOT, which has a much higher occlusion representation. Green
numbers mark the trackers that suffered the least drops between LaSOT and HOOT, while
red numbers mark the trackers that suffered the most. . . . . . . . . . . . . . . . . . . . . . 54
5.1 Analysis of the baseline STARK tracker with varying template update frequencies,
comparing statistics for tracking performance, percentages for scheduled updates,
successful updates, successful updates with occlusions, missed updates without occlusions,
and successful updates with very low or zero overlap. . . . . . . . . . . . . . . . . . . . . . 67
5.2 Analysis of the baseline STARK tracker with varying template update frequencies using an
occlusion oracle, comparing statistics for tracking performance, percentages for scheduled
updates, successful updates, successful updates with occlusions, missed updates without
occlusions, and successful updates with very low or zero overlap. The oracle forces the
tracker to skip an update if there is any occlusion affecting the target using occlusion
annotations from HOOT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.3 Tracking performance (success) results of naively training STARK with HOOT. We show
the difference between the published baseline STARK weights and the baseline STARK we
trained, as well as how adding HOOT naively to the training affects the performance. . . . 71
5.4 Tracking performance (success) results of training STARK with occlusion ratios in HOOT.
We change the occlusion threshold thocc to consider lower and lower visibility ratios as
acceptable updates for the classification module. Best success performances for each thocc
in bold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
vii
List of Figures
2.1 (a) Sample tracking results across 3 frames from different visual trackers, with the image
adapted from the original published in [81]. (b) A reference model for the most important
components of a visual tracker, with the image taken from [116]. . . . . . . . . . . . . . . 5
2.2 Samples from some of the challenges visual object trackers face, using OTB [129] videos.
For each challenge, two frames from a video are presented to demonstrate the target
marked with a red bounding box with and without the demonstrated challenge. . . . . . . 7
2.3 (a) Evaluation with re-initialization, as opposed to OPE, where the tracker is restarted
after it has lost the target. Taken from [12]. (b) A sample tracking result showing the
definitions or overlap (IoU) and Center Location Error (CLE) that are used to compute
most VOT metrics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 A non-exhaustive timeline showing visual object tracking datasets and the years they
were published: OTB [130, 129], VOT2013-2022 [72, 73, 71, 69, 67, 66, 64, 70, 65, 68],
ALOV300++ [116], NUS-PRO [79], UAV-123 [96], NfS [61], TLP [95], GOT-10k [51],
TrackingNet [97], OxUvA [121], LaSOT [34], HOB [75] and TOTB [37]. Datasets that were
designed mainly for evaluation and do not publish any pre-determined development (or
training) set are marked with *. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5 Schematics for Lucas-Kanade and TLD trackers as examples of the conventional visual
trackers. Images directly taken from [5, 56]. . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.6 Schematics for ATOM and DiMP trackers as examples of visual trackers that are trained
online. Images directly taken from [9, 24]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.7 Schematics for GOTURN and SiamFC trackers as examples of visual trackers that are
trained online. Images directly taken from [7, 48]. . . . . . . . . . . . . . . . . . . . . . . . 17
2.8 Schematics for Stark and ToMP trackers as examples of state-of-the-art visual trackers
that are transformed-based. Images directly taken from [90, 132]. . . . . . . . . . . . . . . 18
3.1 Sample frames from videos in the GOT-10k training set that contain a variety of occlusion
levels, given on the right for each video, along with the object class. . . . . . . . . . . . . . 22
viii
3.2 VOT evaluation benchmark has binary labels for fully visible and occluded, while GOT-10k
has labels ranging from 0 (fully-visible) to 8 (fully-occluded). Both benchmarks have
highly imbalanced occlusion distributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3 Architecture of the proposed method for learning occlusions end-to-end in a SiamRPN
tracker. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.4 Qualitative results from occluded frames of two different videos in the VOT dataset.
Green: ground truth, blue: baseline SiamRPN trained on GOT-10k only (SiamRPN*), red:
our proposed method with multi-task occlusion learning. . . . . . . . . . . . . . . . . . . . 30
4.1 Sample frames from the HOOT benchmark showing different classes with a variety of
occluder masks provided with the dataset, colored according to the defined occluder
taxonomy (solid: dark blue, sparse: purple, semi-transparent: yellow, transparent: red).
Images cropped to regions of interest to better view the target rotated bounding boxes
and occluder masks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2 Target class distribution in HOOT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3 Video-level distribution for target, motion, and occlusion attributes in HOOT. . . . . . . . 43
4.4 Sample images of different types of occluders defined in the benchmark taxonomy. (a-b)
Hard occlusions caused by solid occluders or sparse occluders. (c-d) Soft occlusions caused
by visual distortions from semi-transparent or transparent occluders. . . . . . . . . . . . . 44
4.5 (a) Per-frame occlusion-related attributes in HOOT. (b) Target occlusion levels per
occluder type across all partially occluded frames in HOOT. Full solid occlusion means
the target is fully occluded, which is why solid occlusion does not have partially occluded
frames with an occlusion ratio 1.0 while other types might. . . . . . . . . . . . . . . . . . . 46
4.6 Distributions of occlusion-related attributes in the HOOT test set. (a) Video-level attribute
distributions. (b) Frame-level attribute distributions. (c) Distributions of target occlusion
level per occluder type across partially occluded frames. . . . . . . . . . . . . . . . . . . . 47
4.7 Success curves for the state-of-the-art trackers evaluated on HOOT. The trackers are
ranked according to AUC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.8 Success (a-d) and normalized precision (e-h) curves for the different occluder types
annotated in HOOT, computed for Protocol I. . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.9 Success curves for occlusion attributes that affected tracking performance the most in
HOOT, computed for Protocol I. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.10 Success curves for the remaining occlusion attributes annotated in HOOT, computed for
Protocol I. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
ix
4.11 Success curves for videos that contain different motion tags for HOOT Protocol I. We
annotate videos that contain dynamic targets and camera motion. Moreover, for static
targets, we annotate tags that signify occlusion due to parallax and occlusion due to
moving occluder. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.12 Success curves for videos in HOOT annotated by different target attributes, computed
for Protocol I. (a)-(c) Videos each target attribute is set to true. (d)-(f) Videos each target
attribute is set to false. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.13 Sample frames from videos that on average scored 0.418 on the success metric. . . . . . . . 55
4.14 Sample frames from videos that on average scored 0.128 on the success metric. . . . . . . . 56
5.1 Schematic of the STARK tracker, as adapted from the original paper [132]. The sample
images are from a video in HOOT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.2 Schematics for the box prediction and classification heads for the STARK tracker, with the
box prediction schematic directly taken from [132]. . . . . . . . . . . . . . . . . . . . . . . 62
5.3 A collage of sample images from successful dynamic template updates made by the STARK
tracker on HOOT Protocol I videos. Each dynamic template is cropped by STARK to
128 × 128 using the predicted bounding box, and targets can be seen undergoing a variety
of occlusions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.4 A collage of 10 sample images from successful dynamic template updates made by the
STARK tracker on HOOT Protocol I videos. Each dynamic template is cropped by STARK
to 128 × 128 using the predicted bounding box, and targets are not in the cropped region
of interest due to tracker failure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.5 Plots showing how tracker performance (success) changes as update frequency is increased
for different types of occlusion oracles. Results are separated into two plots for different
oracle types. Occlusion percentage oracles prevent any update from happening if the
occlusion ratio in the search frame is over a threshold thocc, while occlusion type oracles
prevent updates where the target is occluded by a certain occluder type from happening. . 69
x
Abstract
Visual object tracking (VOT) is considered one of the principal challenges in computer vision, where a
target given in the first frame is tracked in the rest of the video. Major challenges in VOT include factors
such as rotations, deformations, illumination changes, and occlusions. With the widespread use of deep
learning models with strong representative power, trackers have evolved to better handle the changes in
the target’s appearance due to factors like rotations and deformations. Meanwhile, robustness to occlusions
has not been as widely studied for deep trackers and occlusion representation in VOT datasets has stayed
low over the years.
In this work, we focus on occlusions in deep visual object tracking and examine whether realistic
occlusion data and annotations can help with the development and evaluation of more occlusion-robust
trackers. First, we propose a multi-task occlusion learning framework to show how much occlusion labels in current datasets can help improve tracker performance in occluded frames. We discover that lack
of representation in VOT datasets creates a barrier for developing and evaluating trackers that focus on
occlusions. To address occlusions in visual tracking more directly, we create a large video benchmark for
visual object tracking: The Heavy Occlusions in Object Tracking (HOOT) Benchmark. HOOT is specifically tailored for the evaluation, analysis, and development of occlusion-robust trackers with its extensive
occlusion annotations. Finally, using the annotations in HOOT, we examine the effect of occlusions on
template update and propose an occlusion-aware template update framework that improves the tracker
performance under heavy occlusions.
xi
Chapter 1
Introduction
ch:introduction
1.1 Visual Object Tracking & Occlusions
Visual object tracking (VOT) is one of the fundamental tasks in computer vision, and like many other vision
problems (e.g. detection, classification, etc.), it has been rapidly evolving with the impressive progress
made in artificial intelligence and machine learning. It acts as a building block for a variety of real-world
applications such as surveillance [28, 41, 53, 106], video analytics [38, 52, 94], self-driving vehicles [4,
84, 145], assistive robotics [13, 18, 139], augmented and virtual reality [2, 3, 45] and human-computer
interaction [33, 107]. Therefore, VOT algorithms must be robust to external factors, especially if they are
deployed for safety-sensitive applications like assistive robotics or self-driving vehicles.
As the performance of VOT methods improves with the advancements in deep learning [25, 91, 80,
132], so does the users’ expectations from them. Many of the challenges in tracking that used to be main
topics for research (e.g. rotations, scale variation) have been mitigated by the progress in representation
learning, feature extraction, and data augmentation. However, as we ask these systems to solve more
and more difficult problems, there are still challenges that remain unsolved. Occlusions are one of these
challenges, where the view of the target object is partially or fully covered by another object in the scene
or the frame boundaries.
1
Occlusions have been extremely difficult to model, since they can be caused in a myriad of ways by
a myriad of objects, affecting the visual appearance of the target in a number of ways. Unlike rotations
or other challenges like illumination changes, this makes them extremely difficult to simulate and learn
occlusion-robust representations for the given targets. There exists a domain gap between any simulated
occlusion (e.g. blocking parts of the image, pasting other objects onto the image) and real-world occlusion
scenarios which makes them difficult to solve. In fact, as recently as 2021, Gupta et al. re-iterated these
issues and claimed that more data cannot fully solve the problem of handling occlusions [44]. While there is
some truth to this claim, it is also difficult to verify due to the lack of data and annotations in the literature
that contain realistic occlusion scenarios. As a matter of fact, while [44] specifically tackles occlusions,
their method could only be evaluated on benchmarks that have 10% or less occluded frames.
In this thesis, we investigate whether realistic occlusion data and annotations can help the VOT community develop more occlusion-robust deep visual trackers. We start by examining whether existing annotations regarding occlusions can help supervise the training of more occlusion-robust trackers. Identifying
a gap in the VOT dataset literature, we create and publish the largest benchmark in VOT to date that focuses
on real-world heavy occlusion scenarios. After showing how state-of-the-art in VOT is still vulnerable to
occlusions, we investigate how extensive occlusion data can be useful for the analysis, evaluation, and
development of more occlusion-robust deep trackers.
1.2 Thesis Outline
This thesis is organized in the following way: After this brief introduction to the problem and our approach to it, we first present a detailed background in Chapter 2 to familiarize the reader with visual object
tracking (VOT), the literature on visual object trackers and how occlusions have been addressed in deep
visual trackers. Next, we present our work on multi-task occlusion learning in Chapter 3, where we show
how existing occlusion data can be utilized to create more occlusion-robust deep trackers. Recognizing the
2
difficulty in developing and evaluating algorithms against occlusions due to the lack of occlusion representation in current VOT datasets, Chapter 4 introduces HOOT, our proposed benchmark that is specifically
designed for heavy occlusions in VOT. Lastly, we start dissecting in Chapter 5 how HOOT can be used to
analyze tracker response to occlusions and help train even more occlusion-robust deep trackers. The final
chapter discusses our conclusions and identifies future work in this research direction.
3
Chapter 2
Background
ch:background
2.1 Visual Object Tracking
In this section, we further define the visual object tracking (VOT) problem and how it fits into the rest
of the literature in visual tracking. We outline the challenges new methods address in VOT, the history
of datasets published to evaluate these algorithms and give details on the performance measures used to
evaluate them.
2.1.1 Introduction
As discussed in Section 1, visual object tracking (VOT) is one of the fundamental tasks in computer vision,
where a given target is tracked through subsequent frames under many confounding factors, as presented
in Fig. 2.1a. It is commonly used as a stand-in for single-object tracking (SOT) and we will also use VOT
as such in this study. Research in VOT generally assumes that only a single object is tracked in each
video, that there is no independent detection module for re-detection, and that the visual data comes from
a single RGB camera (monocular). Different visual tracking setups like multi-object tracking, multi-view
surveillance, or long-term tracking have been widely addressed elsewhere in the literature.
4
(a) Sample Visual Object Tracking Results
(b) Parts of a Visual Object Tracker
Figure 2.1: (a) Sample tracking results across 3 frames from different visual trackers, with the image adapted
from the original published in [81]. (b) A reference model for the most important components of a visual
tracker, with the image taken from [116]. fig:tracker_schema_sample
A visual object tracker is formed of 3 interacting mechanisms that help estimate a trajectory for the
given target. Research in VOT generally addresses one or more of these aspects, defined below:
• Representation, which is the description of the target and regions of interest as we process the
video. While motion and location are one part of the representation, for visual tracking they are
closely related to appearance features from the video. While previous work utilized hand-crafted
features, deep features have lately been the norm due to their representative power. Therefore,
many works have focused on how to efficiently use deep features in visual tracking.
• Model update, which deals with how to adapt the description of the target we are tracking through
subsequent frames, sometimes called the template or the exemplar. While some works have chosen
to keep the initial model given in the first frame, other research has focused solely on improved
target model update across frames. This can be crucial for difficult scenarios that include lots of
confounding factors like lighting changes, deformations, or rotating targets.
5
• Method, which deals with the formulation of the task and additional logic that facilitates improved
tracking performance. This can include posing the problem of tracking as a classification vs. regression task, as well as different supervision or training schemes for improved performance or
optimization of the output bounding box.
These aspects have also been visually presented in the survey by Smeulders et al. in [116], which is
replicated in Fig. 2.1b.
2.1.2 Challenges
bg:challenges
Visual object tracking shares many challenges with the wider tracking community. These challenges consist of changes to the target, the camera, or the environment that can affect the visual input used to create
a representation for the target. We identify and discuss some of these challenges below and give examples.
We also visualize some of these challenges in Fig. 2.2.
Illumination: In visual tracking applications, sudden changes in illumination or lack of illumination can
easily make it difficult to extract a good representation of the target. While some effects of illumination
changes have been remedied by the use of stronger visual features (e.g. deep features), extreme cases can
still result in tracker failure.
Fast Motion & Motion Blur: Motion can also affect tracking performance since most trackers employ
heuristics to estimate a region of interest (ROI) to search for the target instead of processing the entire image. Fast target or camera motion may cause the target to fall outside of this estimated ROI. Moreover, fast
motion can cause blur, which affects the quality of the visual features captured. Adaptive ROI estimation
and improved representative power have been used to aid with these challenges.
6
Figure 2.2: Samples from some of the challenges visual object trackers face, using OTB [129] videos. For
each challenge, two frames from a video are presented to demonstrate the target marked with a red bounding box with and without the demonstrated challenge. fig:bg_challenges
Scale Variation: Scale variation is another factor that can make it difficult to learn a good representation
of the target. Large changes in how many pixels the target occupies (for example, as it starts close to the
camera and moves away with time) can interfere with how depictive the appearance model is, making
trackers less accurate. Over the years, scale variations have been handled through increased video quality,
increasingly descriptive features, and better algorithms.
Rotations: In-plane and out-of-plane rotations also influence the performance of VOT algorithms. With
increased technical progress in recent years, data-driven feature extraction (such as deep features) and data
augmentation techniques have greatly improved how trackers respond to these challenges.
Deformations: Deformable objects are those that change their shape as the video progresses and have
been one of the main challenges for visual trackers. Changes in object shape can make it difficult to match
the target template to the current frame. The effect of deformations can be alleviated with model-based
tracking or improved data-driven matching methods. While these techniques are not as effective for novel
objects, trackers can still handle deformations with smart model update mechanisms.
7
Distractors: In some cases, the tracked object might exist or interact with other objects that are very
similar looking. While detection, re-identification, and association have helped with handling distractors
in multi-object tracking, they still pose a large threat for VOT algorithms and can cause identity switches
when the target interacts with similar objects.
Occlusions: Occlusions have been one of the most difficult of these challenges to address since they
can be caused in a myriad of ways and are difficult to model. They occur due to partial or full occlusion
by other objects in the scene or by the frame itself. Moreover, occluding objects themselves can vary
greatly and can affect how the tracker should handle the occlusion. They affect tracking accuracy, can
cause tracking failures, and can corrupt the target’s appearance model if a template update happens with
occlusions present. We cover how they have been addressed in the VOT literature in Section 2.3.
2.1.3 Performance Metrics bg:metrics
The performance metrics used to evaluate and rank visual object trackers are diverse. Most of the details
for different measures used can be found in the previous literature comparing these metrics [12] or those
that survey visual object tracking methods [15, 88, 116]. Below, we will outline some of the popular metrics
used to evaluate VOT algorithms.
In general, most trackers are evaluated using One Pass Evaluation (OPE), where the tracker is initialized
once and processes the entire video even if it loses the target. The counterpart to OPE is adopting a reinitialization mechanism during evaluation when the tracker fails, a visualization of which is shown in
Fig. 2.3a. Restart-based evaluation can help assess the failure rate or robustness of the trackers, detailed
below. After the video is processed, the rest of the metrics can be computed using the distance and overlap
between the predicted and ground truth boxes in each frame, as represented in Fig. 2.3b.
8
fig:restart
(a) Evaluation with re-start.
fig:cle_iou
(b) Overlap and CLE.
Figure 2.3: (a) Evaluation with re-initialization, as opposed to OPE, where the tracker is restarted after it
has lost the target. Taken from [12]. (b) A sample tracking result showing the definitions or overlap (IoU)
and Center Location Error (CLE) that are used to compute most VOT metrics. fig:metrics
Precision & Normalized Precision: These two metrics are computed over Center Location Error (CLE),
which is the Euclidean distance between the center of the ground truth and predicted bounding boxes in
each frame. Precision is one of the oldest performance metrics in tracking and is sensitive to frame size and
scale changes. Normalized precision has been proposed to remedy these shortcomings [97]. By varying
the CLE threshold, we can obtain precision/normalized precision plots that show the percentage of video
frames where the CLE is lower than a specific threshold.
Success & Success Rate: Unlike precision, success is based on the overlap - or intersection-over-union
(IoU) - of the predicted and ground truth bounding boxes. Success metrics can be computed by determining whether overlap is higher than a given threshold. For example, the success rate at 0.5 threshold is
the percentage of frames where the overlap is higher than 0.5. Other times when success is reported, it
represents the area-under-the-curve (AUC) as overlap threshold values are varied between 0 − 1.
Accuracy or Average Overlap (AO): As the name suggests, average overlap - sometimes called accuracy - computes the mean overlap across all frames. It was found the be highly correlated to success rate
[12], however, it does not need a threshold to be chosen.
9
Performance Metric Datasets
Precision OTB2013 [130], OTB2015 [129], UAV123 [96], TLP [95]
TrackingNet [97], LaSOT [34], HOB[75], TOTB [37]
Normalized Precision TrackingNet [97], LaSOT [34], TOTB [37]
Success OTB2013 [130], OTB2015 [129], NUS-PRO [79], UAV123 [96]
TLP [95], NfS [61], TrackingNet [97], LaSOT [34], HOB[75], TOTB [37]
Success Rate GOT-10k [51]
Average Overlap (AO) GOT-10k [51]
Accuracy VOT2013 [72] - present [68]
Robustness VOT2013 [72] - present [68]
Expected Average Overlap (EAO) VOT2013 [72] - present [68]
Table 2.1: Visual object tracking (VOT) performance metrics and the list of VOT datasets that have adopted
them for evaluation. tbl:bg_metric2dataset
Robustness: Robustness is also defined as the failure rate and represents how many times a tracker has
lost the target during a video. As such, it can only be computed if a re-initialization mechanism is used
during evaluation, and cannot be computed for datasets that use OPE.
Expected Average Overlap (EAO): This performance metric is interpreted as the combination of accuracy and robustness for trackers that are evaluated using a re-initialization strategy. It is calculated as
the average overlap in a sequence even if tracker failures cause zero overlaps for some frames.
Table 2.1 matches each metric to the datasets that have adopted them. These datasets can be found in
detail in Section 2.1.4. In addition to the metrics mentioned here, each tracker can also be evaluated on
speed or frames-per-second (FPS).
2.1.4 Datasets bg:datasets
There have been a variety of datasets published to evaluate, analyze, and develop visual object tracking
algorithms. As deep learning started to dominate the field, many of these datasets scaled up to meet the
demand to train stronger deep models. Fig. 2.4 presents a timeline for when these datasets were published.
Amongst them, some are more commonly used or more recent in the VOT dataset literature. We discuss a
selection of these below and give some basic statistics for each.
10
Figure 2.4: A non-exhaustive timeline showing visual object tracking datasets and the years they were
published: OTB [130, 129], VOT2013-2022 [72, 73, 71, 69, 67, 66, 64, 70, 65, 68], ALOV300++ [116], NUSPRO [79], UAV-123 [96], NfS [61], TLP [95], GOT-10k [51], TrackingNet [97], OxUvA [121], LaSOT [34],
HOB [75] and TOTB [37]. Datasets that were designed mainly for evaluation and do not publish any predetermined development (or training) set are marked with *. fig:bg_dataset_timeline
OTB: OTB datasets are pioneering evaluation datasets for visual object trackers. First released as 50
videos in 2013 [130], the collection was later updated to 100 videos with OTB100 in 2015 [129]. OTB100
contains almost 60k frames of 16 classes of objects and marks every video with a set of 11 attributes that
tags some of the tracking challenges each video exhibits. These video-level attributes have been used to
evaluate trackers per challenge (e.g. against illumination variation or deformations). OTB uses One-Pass
Evaluation (OPE) and ranks trackers with precision and success.
VOT: One of the pioneering evaluation datasets for visual trackers, VOT continues to keep its popularity through its yearly challenge [72, 73, 71, 69, 67, 66, 64, 70, 65, 68]. The short-term tracking challenge
generally consists of 60 videos that may get updated from year to year depending on the trackers’ performance. The main evaluation protocol consists of a tracker re-initialization mechanic and the performance
metrics used are accuracy, robustness, and EAO (as defined in Section 2.1.3). In addition to the shortterm challenge, VOT also consists of different problem setups in recent years, such as segmentation-based,
long-term, or RGB-D object tracking.
TrackingNet: This large-scale video dataset [97] was formed of videos from YTBB [108], and contains
over 30k videos along with a training and test set. Since the videos in YTBB are not densely annotated
11
(only at 1fps), TrackingNet utilizes a tracker output to estimate the ground truth annotations for the rest
of the frames. It contains 27 object classes and has become one of the key datasets for training deep
learning-based visual trackers.
GOT-10k: Focused on generic object tracking, GOT-10k [51] is another large-scale video dataset with
10k videos and over 500 object classes. Not only does it provide training and validation sets, but also a
held-out test set that is only formed of object classes not given in the training/validation. This makes the
tracking test one-shot, testing how well trackers do against new objects. It uses OPE and utilizes average
overlap and success rate as its main performance metrics.
LaSOT: One of the most popular VOT datasets in recent years, LaSOT [34] boasts 1550 sequences and
almost 4 million frames. It has become a great resource for deep learning-based trackers and has aided in
both training and evaluation of trackers through its training and test sets. LaSOT has also added a one-shot
test set since it was published. This dataset uses OPE to rank trackers with precision, normalized precision,
and success.
HOB: The first evaluation dataset to specifically focus on heavy occlusions in VOT, HOB [75] is comprised of 20 videos, coarsely annotated by bounding boxes and some occlusion-specific video attributes. As
more tasks in computer vision also focus on heavy occlusions, HOB has been the first one to pay attention
to this problem in VOT. While it is too small to train trackers with, it can be used to evaluate trackers using
OPE and LaSOT metrics.
TOTB: As the number of VOT datasets in the literature increases with the introduction of deep learning
into the field, more and more specialized datasets emerge and TOTB [37] is one of these datasets. It focuses
on tracking transparent objects and presents a large-scale evaluation playground for these difficult target
12
objects. It contains 225 videos, almost 90k frames, and 15 different transparent target objects. It also utilizes
OPE and ranks trackers using success.
It is also worth mentioning some of the visual object tracking datasets represented in Fig. 2.4 but not
detailed above. These datasets are either not as popular in recent years or specialized in different problem
setups. We list them shortly below:
• ALOV300++ [116], a dataset with 315 videos and almost 9k frames, mainly formed for evaluation
but became one of the first datasets used for training deep models for tracking [48].
• UAV-123 [96], a 123-video dataset aimed at visual object tracking from an aerial viewpoint.
• NUS-PRO [79], a large-scale evaluation benchmark of 365 sequences of object classes spanning
pedestrians and other rigid objects.
• NfS [61], a dataset specialized in high frame-rate tracking, with 100 videos shot at 240 FPS.
• OxUvA [121] and TLP [95], two datasets that specifically focus on long-term tracking, and contain
366 and 50 videos respectively. OxUva is actually a subset of the popular video dataset YTBB [108],
much like the popular dataset TrackingNet.
There have also been non-tracking datasets that have contributed to VOT algorithms which include
large video or object detection datasets that have aided in training stronger deep models. Common auxiliary datasets that many studies utilize have been the object detection dataset COCO [83] as well as video
datasets ImageNet-VID [112] and YouTube Bounding Boxes (YTBB) [108].
2.2 Visual Object Trackers
In this section, we give a summary of how visual trackers have evolved over the years and the rapid
development tracking algorithms have shown with new emerging technologies like deep learning. We
13
fig:lucas
(a) Lucas-Kanade Tracker Schematic
fig:tld
(b) TLD Tracker Schematic
Figure 2.5: Schematics for Lucas-Kanade and TLD trackers as examples of the conventional visual trackers.
Images directly taken from [5, 56]. fig:conv_trackers
start with a discussion of conventional tracking algorithms before the wide use of deep learning in visual
object tracking and then detail deep trackers and how they have evolved in recent years.
2.2.1 Conventional Tracking Algorithms
While the current state-of-the-art in visual object tracking (VOT) is dominated by deep learning-based
methods, we use this section to give a short introduction to the visual tracking landscape before deep features started being used in tracking applications and took over. We present conventional tracking methods
in two classes, following the survey of Smeulders et al. [116]: Those that utilize matching and those that
use discriminative classification.
Trackers that utilize matching attempt to match the template of the target that was built across past
frames to the current image. The most well-known trackers that fall into this category are the LucasKanade tracker and the Mean Shift tracker. Lucas-Kanade tracker poses the problem as an image alignment problem based on the 1981 algorithm [5, 86, 115] and computes affine transformations to match the
candidates to the template. Another famous tracker, Mean Shift, utilizes RGB-color histograms to perform
matching [20]. Other trackers of this school include [50, 76, 92, 93, 100, 111].
14
As for conventional trackers that follow the route of discriminative classification, they try to differentiate foreground from background. The most famous example of these trackers by far is the TrackingLearning-Detection (TLD) tracker which tries to integrate a detector with an optical flow tracker [56], based
on P-N Learning [57], which inspired other trackers as well [30]. The tracker Struck learns a kernelised
structured output support vector machine (SVM) online [46] and has garnered popularity as well. Some
other well-known methods that treat the tracking problem this way include [42, 99]. Trackers based on
Discriminative Correlation Filters (DCFs) also fall into this category and got popularized after the work of
Bolme et al. [10] on MOSSE. Some successors to MOSSE in DCF-based visual tracking include the popular
trackers Staple, KCF, and SRDCF [8, 27, 49], among many others.
Example schematics from Lucas-Kanade and TLD trackers are given in Figure 2.5. For more information on traditional object tracking history, Yilmaz et al. and Smeulders et al. provide snapshots of the field
from different decades [116, 136], and the most recent study that surveys VOT is Chen et al. [15].
2.2.2 Deep Visual Trackers
Recent years have brought the deep learning revolution to the field of visual object tracking (VOT) as well
as many other applications across other disciplines. While a broader history of deep visual object tracking
can be found in [15, 88], we separate and introduce deep object trackers by their training procedures. In
general, deep trackers can be divided into online and offline-trained models in terms of whether they make
online updates to the trained model as they process the input or not. We present related works covering
both schools of deep trackers below, as well as methods that only use pre-trained deep features.
Using Pre-Trained Deep Features: Initial efforts of integrating advances in deep learning into tracking
included replacing hand-crafted features (e.g. HOG, color) used to compute the appearance model of the
target with deep features. DCF-based trackers like DeepSTRCF [82], DeepSRDCF [26], CCOT [23] and
ECO [22] adopted this method and acted as a bridge between conventional and deep trackers.
15
fig:lucas
(a) ATOM Tracker Schematic
fig:tld
(b) DiMP Tracker Schematic
Figure 2.6: Schematics for ATOM and DiMP trackers as examples of visual trackers that are trained online.
Images directly taken from [9, 24]. fig:bg_online
Online Learning Methods: These types of deep trackers generally update a classifier or some other
tracker component online as they see consecutive frames from the target. Most of these trackers utilize
offline training first to learn a very discriminative model and use their online learning mechanisms to adapt
to each video. MDNet and VITAL are well-known pioneering trackers that belong in this category [54, 98,
118], along with several others [19, 35, 131] that include a tracker that trains an LSTM online for better
object adaptability [32].
More recently, ATOM [24] has utilized a module that trains online to classify background vs. foreground and has been popular in the field. Furthermore, DiMP [9] has used meta-learning to initialize
a model optimizer module and has used online training to fine-tune that module. DiMP variants like
PrDiMP and KeepTrack [25, 91] have been some of the state-of-the-art in VOT in recent years. Fig. 2.6
shows schematics of ATOM and DiMP as samples of trackers that train online.
Trackers that employ online training generally do extremely well in terms of tracking performance
since they adapt the network as they process the video, which provides a lot of robustness to many confounding factors including deformations. However, these algorithms are almost always slower than those
that are exclusively trained offline, since model updates can be expensive. They can also be more susceptible to factors like occlusions since online training modules might be updated with corrupted data coming
from an occluded target.
16
fig:lucas
(a) GOTURN Tracker Schematic
fig:tld
(b) SiamFC Tracker Schematic
Figure 2.7: Schematics for GOTURN and SiamFC trackers as examples of visual trackers that are trained
online. Images directly taken from [7, 48]. fig:bg_offline_early
Offline Learning Methods: The counterpart to trackers that do some online learning are the visual
object trackers that achieve all representation learning from the training data. Since these models only
require a forward pass for prediction, they can usually work at speeds much better than real-time, which
makes them more desirable for real-world applications.
GOTURN and SiamFC were some of the first trackers to use this approach, with their schematics given
in Fig. 2.7. Held et al. pose the problem of visual tracking as a bounding box regression task in GOTURN
[48]. GOTURN learns how to estimate the motion of the bounding box instead of a similarity function for
objects, and can therefore be considered a generic object tracker. It has inspired other works that improved
on it using recurrent models [43].
Meanwhile, fully-convolutional Siamese models that were pioneered by SiamFC [7] learn a general
similarity metric from the objects that they see during offline training. SiamFC fuses the features extracted
from the template and search regions using a correlation operation inside the network. Many variants
of SiamFC have been proposed, such as: SiamRPN [81] which uses an RPN module [109] for matching,
DaSiamRPN [146] which is distractor-aware, SiamRPN++ [80] that uses deeper and more representative
models, and SiamMask [125] which also does target segmentation. Other works have improved on major
algorithmic details, like the addition of a fine-matching strategy to SiamRPN [124], utilizing cascaded RPN
modules [36] and employing recurrent architectures [134, 135]). SiamDW instead explored how to get the
17
fig:lucas
(a) Stark Tracker Schematic
fig:tld
(b) ToMP Tracker Schematic
Figure 2.8: Schematics for Stark and ToMP trackers as examples of state-of-the-art visual trackers that are
transformed-based. Images directly taken from [90, 132]. fig:bg_offline_late
best out of these Siamese networks while using deeper and wider backbones for superior representative
power [142].
Attention has also been an inspiration for deep trackers [55, 63], but especially so in recent years with
the innovations in self-attention and transformer literature [122]. As transformers became more common
in computer vision tasks, visual object trackers have also started to widely utilize them. A variety of
transformer-based trackers have become the current state-of-the-art in VOT, amongst which are TransT
[16], DTT [137], Stark [132] and ToMP [90]. Sample schematics from Stark and ToMP are given in Fig. 2.8,
which shows how sophisticated trackers have become from early offline trained trackers in Fig. 2.7.
2.3 Occlusions in VOT
sec:occ_in_vot
Occlusions have always been a difficult challenge for visual trackers, as they represent a lack of visual signal
coming from the target object. This makes occlusions difficult to model compared to other factors. Over
the years, many algorithms have focused on occlusions in tracking applications [31, 78, 119, 126]. These
include taking advantage of the tracking-by-detection framework [56] or spatio-temporal outlier detection
[101] in conventional trackers, to using occlusion reasoning to update classifier pools [31], part-based
18
methods [14, 40, 119] that handle occlusions by determining which parts of an object is visible, and training
sub-networks for different confounding factors including occlusions [105]. More recent work focuses on
simulating occlusions in the latent space using structured dropouts [44] or coupling deep trackers with
conventional tracking methods like Kalman Filters to combat occlusions [128].
Occlusions have also been a crucial part of multi-object tracking and datasets curated for this task
[29, 127], where well-modeled targets like pedestrians and vehicles are the focus. Unfortunately, while
great progress has been made by the state-of-the-art in visual object tracking (VOT), there has not been
a big effort to develop algorithms to address occlusions, nor enough evaluation of developed algorithms
on them. Lack of data and annotations have driven most of this issue, as it has become easier to improve
upon the deep trackers’ learned representations through more and more sophisticated architectures.
Meanwhile, in other vision domains like object detection [113, 123] and video object segmentation [59,
103] and even video object inpainting [60], methods and datasets specifically addressing occlusions have
started to appear. For example, in video instance segmentation, OVIS benchmark that focuses on heavy
occlusions now also has an organized challenge that others can participate in [104]. In this work, we hope
to do the same for occlusions in VOT.
19
Chapter 3
Multi-Task Occlusion Learning for Real-Time Visual Object Tracking
ch:mtol
3.1 Motivation
sec:mtol_motivation
As mentioned in Section 1, visual object tracking (VOT) has greatly benefited from the deep learning revolution in recent years, along with other fundamental computer vision problems (e.g. classification, object
detection). While impressive progress has been achieved, using deep neural networks requires extensive
data - video data in the case of VOT - and supervision from annotations tailored to every task. To make
use of deep neural networks with stronger representative powers, studies in visual tracking have also 1)
upgraded to deeper and more complicated network architectures, and 2) started training with increasingly
more data to better optimize these deeper and more complex models.
The current state-of-the-art in VOT generally uses 4 or more large-scale datasets to train deep trackers
[9, 16, 24, 80, 91, 132]. These datasets are either video datasets [34, 51, 97] or detection datasets [83, 112]
and boast millions of bounding boxes that can be used to teach the models how to localize a target in the
given search frame. Unfortunately, since annotations can be costly, only a few of them contain occlusion
information. In this chapter, we investigate whether existing occlusion annotations in visual tracking
benchmarks can be used as supervision to train more occlusion-robust trackers. More specifically, we
20
Tracking Dataset Occlusion Labels Label Type Dataset Type
OTB [129] ✗ - test
TrackingNet [97] ✗ - train/test
LaSOT [34] ✓ 0-1 train/test
NUS-PRO [79] ✓ 0-2 test
VOT [66] ✓ 0-1 test
GOT-10k [51] ✓ 0-8 train/val/test
Table 3.1: VOT evaluation benchmark has binary labels for fully visible and occluded, while GOT-10k
has labels ranging from 0 (fully-visible) to 8 (fully-occluded). Both benchmarks have highly imbalanced
occlusion distributions. tbl:occ_level_labels
examine whether occlusion prediction can be a helpful auxiliary task for visual trackers, by proposing a
multi-task occlusion learning framework.
As mentioned in Section 2.3, previous work in visual object tracking has tried to handle occlusions in
many different ways, including tracking-by-detection [56], part-based methods [14, 40, 119] and others
[31, 105]. However, with the increased use of deep learning in the field, none has attempted to predict
occlusions directly. Benefits of multi-task learning have been shown in applications like facial landmark
prediction [140], attribute prediction [1], pose estimation [144], and even object detection and tracking
[81, 109]. Since learning-related tasks can improve upon the main task, it is important to examine whether
occlusion prediction can act as an auxiliary task to localization for visual tracking models.
Since there are very limited datasets in visual tracking that annotate occlusions, this has made occlusion
prediction difficult to study for visual tracking with deep neural networks. At the time of our study,
most VOT datasets consisted of only very high-level occlusion labels per-frame. Most of these labels were
inconsistent: LaSOT [34] has binary labels for absence alone, NUS-PRO [79] has 3 labels for no, partial,
and full occlusion, and the popular VOT challenge datasets [70] have binary labels for fully visible and
occluded. Moreover, NUS-RPO and VOT are designed only for evaluation of trackers, and not training.
The only dataset that annotated occlusions in the most detail was the recently released GOT-10k [51],
which provided occlusion level labels ranging from 0 (fully-visible) to 8 (fully-occluded). Fig. 3.1 shows
21
object class: skunk
occlusion level change: 0→1 →4
object class: football
occlusion level change: 0→5 →2
object class: person
occlusion level change: 1→4 →6
Figure 3.1: Sample frames from videos in the GOT-10k training set that contain a variety of occlusion
levels, given on the right for each video, along with the object class. fig:got10k_sample
sample frames from 3 different GOT-10k videos with their annotated target occlusion levels. In addition
to these occlusion labels, GOT-10k is a large-scale dataset with training, validation, and test splits. This
made GOT-10k a very good candidate for showing whether occlusion prediction is a good auxiliary task
for target tracking.
In addition to the lack of occlusion labels, occlusion prediction is also made more difficult by the imbalanced distribution of occlusions in the current visual tracking benchmarks. Fig. 3.2 shows the distribution
of occlusion labels across all frames in VOT challenge datasets and GOT-10k. Throughout the years, the
VOT challenge only had occlusions in around 10% of frames, with no details on how severe these occlusions are. While GOT-10k annotates occlusions according to severity, the distribution of the occlusion
labels shows that only a very small percentage of frames have moderate to heavy occlusions.
In this chapter, we take on the aforementioned challenges and answer whether it is possible to use
these limited occlusion labels to supervise a tracker during training and improve its tracking performance,
especially in the presence of occlusions. For our experiments, we use the popular SiamRPN tracker [81]
22
Figure 3.2: VOT evaluation benchmark has binary labels for fully visible and occluded, while GOT-10k
has labels ranging from 0 (fully-visible) to 8 (fully-occluded). Both benchmarks have highly imbalanced
occlusion distributions. fig:occ_dist
trained on the GOT-10k dataset [51] and provide experimental results on both GOT-10k and VOT to show
how useful learning occlusions in a multi-task framework can be for handling occluded frames.
3.2 Multi-Task Occlusion Learning for Visual Object Tracking
This section provides the background on the SiamRPN tracker we used as our baseline tracker and gives
the details of our proposed multi-task occlusion prediction framework.
3.2.1 Baseline SiamRPN Architecture
In this study, we formulate the main task of visual tracking as one-shot detection on a single frame, as
laid out by Li et al. in their paper [81]. While recent works like SiamRPN++ [80] have also explored how
to get the best out of these Siamese trackers while using deeper backbones to make use of their superior
representative power, this has also required them to utilize more and more data for training (as mentioned
in Section 3.1). Since we are limited to GOT-10k for its occlusion labels to show how occlusion prediction
can help visual tracking, we choose the less complex model SiamRPN as our baseline.
23
Figure 3.3: Architecture of the proposed method for learning occlusions end-to-end in a SiamRPN tracker. fig:schema
SiamRPN utilizes a Region Proposal Network (RPN) [109] and tries to solve the online tracking problem by having two branches for 1) regression of the bounding box proposals and 2) classification of the
object/background classes for the search image. The input consists of the exemplar frame, which is the
template of the object to be tracked, and the search image, which is the current search frame cropped at
the region of interest, using the bounding box information predicted in the previous frame. We will denote
these images as z and x respectively in this chapter. Both images are first processed by the same backbone
feature extractor to obtain two feature maps φ(z) and φ(x).
These feature maps φ(z) and φ(x) are the input to the regression and classification branches of the
RPN. For each branch, feature maps are first processed by additional convolutional layers, after which the
correlation operation (denoted with ⋆ in Fig. 3.3) is computed between the exemplar and search feature
maps. We will denote the feature maps obtained after correlation as Acls
w×h×2k
and A
reg
w×h×4k
, similar to [81],
where we choose w = h = 17 for our experiments, and keep the number of anchors k at 5. The factor 2 in
24
Acls
w×h×2k
represents the classification branch output dimension for binary foreground/background classification, while the factor 4 in A
reg
w×h×4k
represents the regression branch output dimension for (x,y,w,h),
the predicted center and shape of the anchor for that cell. Moving forward, we will eliminate w, h and k
from our notations for simplicity and call them Acls and Areg
.
The predicted output (ˆy
cls
i
, yˆ
reg
i
) is computed by passing these feature maps through more convolutional layers, making the classification and regression branches fully-convolutional networks. The joint
loss for training these branches employs the cross-entropy loss for the classification task and the smooth
L1 loss for the regression task. The cross-entropy (CE) loss for the binary foreground/background classification is given as:
Lcls = −(y
cls
i
log(ˆy
cls
i
) + (1 − y
cls
i
) log(1 − yˆ
cls
i
)) (3.1)
Meanwhile, the smooth L1 loss for regression gets computed on the normalized distances δ(ˆy
reg
i
, y
reg
i
)
of the predicted anchors (check [81] for more details) to the ground truth boxes using:
Lreg =
X
3
l=0
smoothL1(δ[l], σ) (3.2)
smoothL1(x, σ) =
0.5 ∗ (σ ∗ x)
2
|x| <
1
σ2
|x| − 0.5
σ2 , otherwise
(3.3)
The baseline model loss then becomes:
Lbaseline = Lcls(ˆy
cls
i
, ycls
i
) + λ0Lreg(ˆy
reg
i
, y
reg
i
) (3.4)
where λ0 represents the loss weight between classification and regression.
25
During inference, once predictions are obtained from the model, the final predicted box has to be
computed using several post-processing steps. First top K boxes are selected from amongst cells and
anchors that scored high for the classification task. These top K proposals are then filtered to remove boxes
farther away from the center of the region of interest since motion is assumed to be small in subsequent
frames. Filtered proposals are further re-ranked using a cosine window and scale change penalty. The
final predicted box is then selected by applying Non-maximum-suppression (NMS) on these re-ranked
proposals.
3.2.2 Multi-Task Occlusion Prediction Module
Our method proposes a deep, real-time visual object tracking algorithm that predicts the target occlusion
level as an auxiliary task to target localization. This is done by adding an occlusion module to the baseline
SiamRPN, the schematic of the model can be seen in Fig. 3.3. For this extra supervision, we utilize the
occlusion level labels provided by the recent GOT-10k dataset [51], which has every frame annotated with
an integer occlusion level, varying from 0 (fully visible) to 8 (fully occluded). Using the provided labels,
we learn the level of occlusion from the data end-to-end in a multi-task learning framework. Therefore,
our method adds an occlusion branch to the baseline model as well as an additional occlusion weighted
classification loss, inspired by [102], where occlusion levels were used to supervise a pedestrian detection
task for improved performance. In this section, we give details of our occlusion branch and the occlusion
weighted classification loss we utilized for improved tracking performance.
The Occlusion Branch: Our occlusion prediction module, as seen in Fig. 3.3, utilizes features from
the classification branch of the SiamRPN tracker. The input features used for the occlusion module are
the output of the correlation operation between the exemplar and search feature maps, denoted above as
Acls. We present ablation studies on how performance was affected when using different features and
model architectures in Section 3.3. Using the existing labels in GOT-10k, we aim to force the network to
26
explicitly learn occlusion patterns in the data, improving the tracking performance and providing the user
with this information directly. The occlusion prediction module consists of two convolutional layers and
one 512-dimensional fully connected layer before the classification layer.
Since occlusion labels in GOT-10k are more fine-grained than other datasets, but still very imbalanced,
the task can be made easier by reducing the final number of classes using user-determined thresholds
t = [t0, ..., tm−1] for a total of m classes. Therefore, for the ith frame, the scalar occlusion label oi
is a
function of the chosen thresholds t and the ground truth occlusion level o
gt
i
, and given as follows:
oi = l with (tl <= o
gt
i < tl+1) (3.5)
During our experiments, we use two thresholds at levels 2 and 5, reducing from 9 to 3 occlusion labels.
The occlusion prediction is achieved by minimizing the cross-entropy loss for the computed occlusion
classes. We weigh the loss by the inverse of the class frequency, denoted by foi
for sample i in class oi
,
due to the imbalanced distribution of the labels in the data (Fig. 3.2). The occlusion loss becomes:
Locc =
1
N
X
N
i=0
1
foi
CELoss(ˆoi
, oi) (3.6) {eq:loss_occ}
Occlusion Weighted Classification Loss: Inspired by the recent work on occluded pedestrian detection [102], we also utilize an additional loss in our work that weights the classification results with the
occlusion level of the sample. While in [102] this is done using the pixel-level visibility information, we
can directly use the integer labels given to us in our training set, since we are working with a framework
where we do not have segmentation or even coarser pixel-level visibility information of our targets. This
loss increases the penalty on high occlusion samples, forcing the network to pay more attention to them.
We denote this extra loss term as Locc−cls and provide it with the ground truth occlusion level normalized
to 0-1, where M = 9 due to the total number of occlusion level labels in GOT-10k.
27
Locc−cls =
1
N
X
N
i=0
o
gt
i
M
CELoss(ˆy
cls
i
, ycls
i
) (3.7) {eqn:loss_occ_cls}
The total loss is given as:
Ltotal = Lbaseline + λ1Locc + λ2Locc−cls (3.8) {eqn:loss_total}
where the λ values (λ0, λ1, and λ2) are the weighting factors for the multi-task framework and can
be tuned during validation. On top of improving the invariance of the baseline tracker to occlusions, the
predicted occlusion level oi at test time can also provide the occlusion level of the target to the user with
a single forward pass, keeping tracker speeds at real-time.
3.3 Experiments
sec:mtol_experiments
3.3.1 Experimental Setup
For training our proposed method, we use the GOT-10k dataset [51] and denote our SiamRPN baseline
trained only on GOT-10k as SiamRPN*. Training and inference remain the same as the baseline model
[81]. We continue utilizing AlexNet [74] trained on ImageNet as the backbone for feature extraction since
we are limited to training with GOT-10k. We train using SGD with a 5 epoch warm-up and fine-tune
the final two layers of the AlexNet after epoch 10. To mimic different motion patterns, we add additional
augmentation by random shifts and scaling, as well as sampling video frames from a set interval. The task
weights and thresholds to bin occlusion levels are tuned on the GOT-10k validation set. Our method is
implemented in Python, using PyTorch, on Tesla V100 GPUs.
28
VOT2016 VOT2017 VOT2019
A ↑ R ↓ EAO ↑ A ↑ R ↓ EAO ↑ A ↑ R ↓ EAO ↑
ECO [22] 0.55 0.20 0.375 0.48 0.27 0.280 - - -
C-COT/SSRCCOT [23, 64] 0.54 0.24 0.331 0.49 0.32 0.267 0.52 0.55 0.224
CSRDCF [87] 0.51 0.24 0.338 0.49 0.36 0.256 0.50 0.63 0.201
MDNet [98] - - 0.257 - - - - - -
VITAL [118] - - 0.323 - - - - - -
SiamFC [7] 0.53 0.46 0.235 0.50 0.59 0.188 0.51 0.96 0.189
MemDTC [133] 0.51 - 0.260 0.49 - 0.250 0.49 0.59 0.228
SiamRPN [81, 64] 0.56 0.26 0.340 0.49 0.46 0.240 0.52 0.55 0.224
GCT [40] - - - - - 0.274 - - -
SiamRPN* 0.61 0.26 0.377 0.58 0.39 0.278 0.58 0.62 0.233
Ours 0.61 0.26 0.389 0.58 0.40 0.293 0.59 0.63 0.243
Table 3.2: The overall tracking performance results on VOT datasets, compared with other well-known
trackers. SiamRPN* represents our baseline SiamRPN (trained only on GOT-10k). tbl:vot_all
3.3.2 Results on the VOT Benchmark
sec:vot
We evaluate our method on the VOT benchmark datasets [64, 67, 69], each of which contains 60 videos
with different challenging aspects. The overall performance metric is the Expected Average Overlap (EAO),
which takes into account both Accuracy (A) and Robustness (R). These metrics are defined in Section 2.1.3.
We use VOT2016, VOT2017, and VOT2019 to evaluate our method since the evaluation videos (and distributions of occluded frames) do not vary between some years of the challenge. The occlusion distributions
for these challenge years can be viewed in Fig. 3.2.
Overall Performance: In addition to showing the advantages of our method on occluded frames
to validate our approach, we also compare it to other trackers that use the AlexNet backbone on overall
tracking performance. The results for accuracy, robustness, and EAO are in Table 3.2. We compare our
method against trackers that represent a variety of different approaches in visual tracking. These include
DCF-based trackers such as ECO [22], C-COT [23] and CSRDCF [87], deep learning methods that employ
online training like MDNet [98] and VITAL [118] and also offline-trained methods, like the original SiamFC
[7], recurrent tracker MemDTC [133], the original SiamRPN paper [81] and GCT tracker [40]. We find that
our method’s performance surpasses these well-known trackers, especially on the EAO. For VOT2019, we
29
Figure 3.4: Qualitative results from occluded frames of two different videos in the VOT dataset. Green:
ground truth, blue: baseline SiamRPN trained on GOT-10k only (SiamRPN*), red: our proposed method
with multi-task occlusion learning. fig:mtol_qualitative
utilize numbers reported on [64] to report on methods that are either similar to or improve upon the ones
presented for the previous datasets.
Performance on Occluded Frames: To prove that learning occlusion end-to-end alongside the main
tracking task in this multi-task framework improves the network’s invariance to occlusions, we look at
its performance under occlusion. The results can be viewed in Table 3.3, where the EAO from all images
vs. only the occluded frames are compared. Looking at the comparison between the baseline SiamRPN*
and our SiamRPN with the occlusion branch, it can be seen that the proposed method can obtain up to 7%
improvement on the targeted occluded frames on VOT2017. Since these occluded frames are at most 11%
of the data, the impact on the overall performance is low. However, we can expect this improvement to be
more prominent for videos with higher occluded frame rates.
Qualitative results that show some occluded frames from VOT can be viewed in Fig. 3.4, where our
tracker with the occlusion branch (in red) has more overlap with the ground truth boxes (in green) compared to the baseline SiamRPN tracker (in blue). These results prove occlusions can be beneficial for
tracking as an auxiliary task.
30
SiamRPN* Ours
VOT2016 EAO (All) 0.377 0.389
EAO (Occ) 0.231 0.265
VOT2017 EAO (All) 0.278 0.293
EAO (Occ) 0.177 0.246
VOT2019 EAO (All) 0.233 0.243
EAO (Occ) 0.187 0.245
Table 3.3: Expected Average Overlap (EAO) results on VOT Benchmark computed for all vs. occluded
frames, comparing the performance of the baseline SiamRPN trained on GOT-10k only (SiamRPN*) to
ours. tbl:vot_occ
Occlusion Branch Type A ↑ R ↓ EAO ↑
Baseline (No occ. branch) 0.581 0.389 0.278
Using φ(z) & φ(x), conv + xcorr + fc 0.58 0.431 0.278
Using Acls & Areg, conv + fc 0.567 0.431 0.285
Using Acls & Areg, only fc 0.567 0.44 0.289
Using Acls, conv + fc 0.578 0.398 0.293
Table 3.4: VOT2017 results for occlusion branch architecture variants. Reported are Accuracy (A), Robustness (R), and Expected Average Overlap (EAO). See Section 4.3.1 for details. tbl:arch_perf
Ablation studies: We perform ablation studies on the occlusion branch architecture and the addition of the occlusion-weighted classification loss. We test four different variants of the architecture using
features from the backbone (φ(z) and φ(x)) or after the correlation operation in the regression and classification branches (Acls and Areg). We also test a fully-connected occlusion branch. The results of these
can be viewed in Table 3.4, where we found that classification branch features are the best features to
use for the occlusion branch since we find regression is better decoupled from the occlusion prediction.
Moreover, we find that a combination of convolutional and fully-connected layers is the best architecture
for the occlusion branch.
In addition to architecture, we also conduct ablation studies on the occlusion weighted classification
loss and find that it is complementary to the occlusion branch when they are used together. The results
can be viewed in Table 3.5. In addition to improved performance on occluded frames when trained in this
31
Occlusion Prediction A ↑ R ↓ EAO ↑
Baseline (Lcls + Lreg) ✗ 0.581 0.389 0.278
Baseline + Locc−cls ✗ 0.568 0.388 0.287
Baseline + Locc ✓ 0.571 0.417 0.285
Baseline + Locc−cls + Locc ✓ 0.578 0.398 0.293
Table 3.5: Effect of occlusion prediction loss and occlusion weighted classification loss on VOT2017. tbl:loss_perf
Average Overlap Success Rate Speed (fps)
KCF 0.203 0.177 94.66
Staple 0.246 0.239 28.87
ECO 0.325 0.328 2.62
CCOT 0.325 0.328 0.68
MDNet 0.299 0.303 1.52
CFNet 0.374 0.404 25.81
GOTURN 0.347 0.375 108.55
SiamFC 0.348 0.353 44.15
SiamRPN* 0.407 0.463 148.19
Ours 0.413 0.466 158.82
Table 3.6: GOT-10k test set results with comparisons to baseline methods in the GOT-10k official results,
including SiamRPN* (SiamRPN trained only on GOT-10k). tbl:got10k_perf
manner, the user also has access to the predicted occlusion label, which can be useful for higher-level logic
during inference in future trackers.
3.3.3 Results on GOT-10k
We also test our model on the GOT-10k test set, with 180 videos of 84 object classes, non-overlapping with
the training set. The performance metrics that are used in GOT-10k are the Average Overlap (AO) and
Success Sate (SR) [51]. As seen in Table 3.6, our method outperforms its baseline tracker SiamRPN*, as
well as the other baseline trackers in the official leaderboard. Some of these trackers are DCF-based (like
KCF [49], ECO [22] and C-COT [23]), while others are deep trackers trained online (like MDNet [98]) or
offline (like SiamFC [7], CFNet [120] and GOTURN [48]). Since the test set is held out for GOT-10k and
32
performance results are received through a private server, it is not possible to compute any occlusionrelated metrics on the test set. However, we obtain an F1-score of 0.927 for occlusion prediction on the
GOT-10k validation set.
3.4 Discussion
sec:mtol_discuss
In this chapter, we proposed a real-time visual object tracker that utilizes occlusion level annotations and
a multi-task learning framework to improve tracking performance in the presence of occlusions. This
new tracker consists of an occlusion branch that predicts occlusion severity and an occlusion-weighted
classification loss for tracking. This study is the first work that attempts to use occlusion level labels in
GOT-10k as extra supervision for deep visual tracking.
Experiments conducted on the SiamRPN tracker show our model provides a substantial boost for the
performance on occluded frames on the VOT benchmark, while also improving overall performance on
both VOT and GOT-10k. It constitutes a promising baseline for trackers that are more occlusion-robust.
However, as can be seen from the results, measuring the impact of algorithms that directly address occlusions is somewhat difficult when only a small percentage of videos contain occlusions prominently.
Therefore, while we have shown training tracking models can benefit from supervision with occlusionrelated labels, the lack of data to analyze, develop, and evaluate trackers that address occlusions prevents
a deeper look into this topic.
33
Chapter 4
HOOT: Heavy Occlusions in Object Tracking Benchmark
ch:hoot
4.1 Motivation
sec:hoot_intro
In this chapter, we address the lack of occlusion representation in current visual object tracking (VOT)
datasets. As discussed in Section 3.4, in our attempt to target occlusions exclusively, we discover that
there are not enough resources out there to analyze, evaluate, or develop deep learning-based trackers
that address occlusions explicitly. Since visual object tracking can be an important building block for largescale applications such as surveillance, assistive robotics, smart home devices, and self-driving vehicles [4,
6, 53, 58], we argue that it is important to study occlusion-robust algorithms that can improve safety in
these real-world applications.
While major progress continues to be made in deep learning-based visual trackers [25, 80, 91, 132, 141,
143], many of the confounding factors mentioned in Section 2.1.2 can still have large effects on tracking
performance. Many of these factors have been provided as video and sometimes per-frame attributes in
pioneering benchmarks like OTB [129, 130] and VOT [68, 72], as well as in more recent ones like LaSOT [34] and GOT-10k [51]. In fact, in Chapter 3, we utilize some of these annotations to show whether
trackers trained to learn occlusion information can be more robust in the presence of occlusions. The
under-representation of occlusions in current benchmarks has also been addressed in the recent HOB
34
dataset [75], the first evaluation benchmark in VOT to focus on high occlusion scenarios. HOB consists
of 20 high occlusion sequences and is annotated with a selection of per-video occlusion-related attributes.
However, it is very limited in terms of dense annotations and dataset size and does not provide enough
variety for extensive evaluation and analysis or training. In this chapter, following these observations,
we aim to create a new benchmark devoted to training, evaluation, and analysis of visual trackers under
heavy occlusions.
We present HOOT, or the Heavy Occlusions in Object Tracking Benchmark, as a new dataset for training and evaluation of visual tracking algorithms under heavy occlusion scenarios. HOOT is densely annotated with detailed occlusion information and consists of 581 high-quality videos (totaling 436K frames)
with heavy occlusion representation - 67.7% of all frames have occlusions. A detailed comparison of HOOT
to the state-of-the-art VOT datasets is presented below (Section 4.2.1, and shows how it can influence more
research addressing occlusions explicitly.
In addition to the creation of the benchmark, we also evaluate a variety of state-of-the-art visual trackers on its videos. Our experiments prove that improving representative power with stronger and deeper
models cannot help fully close the gap between tracking objects with and without occlusions. We believe
this is where HOOT can help create more occlusion-robust visual trackers.
4.2 The HOOT Benchmark
In this section, we present the Heavy Occlusions in Object Tracking (or HOOT in short) Benchmark in
detail. This main introduction to HOOT will include its comparison with other datasets in the VOT field,
the design choices made for the benchmark, a general overview of statistics, details about data collection
and annotation phases, in-depth statistics on occlusion-related attributes, and evaluation protocols. The
full dataset, along with evaluation results is released at https://www.hootbenchmark.org.
35
General Dataset Statistics
OTB2015 [129] VOT2021 [65] UAV123 [96] TrackingNet [97] GOT-10k [51] LaSOT [34] HOB [75] TOTB [37] HOOT
Num. of Videos 100 60 123 31K 10K 1.4K 20 225 581
Num. of Frames 59K 20K 113K 14M 1.5M 3.2M 55K 87K 436K
Num. of Classes 22 30 9 21 563 70 9 15 74
Frame Rate (fps) 30 30 30 30 10 30 - 30 30-60
Avg. Duration (sec) 20 11 31 16 15 84 - 12.7 22
Train/Test Set ✗ / ✓ ✗ / ✓ ✗ / ✓ ✓ / ✓ ✓ / ✓ ✓ / ✓ ✗ / ✓ ✗ / ✓ ✓ / ✓
Occlusion Related Information
OTB2015 [129] VOT2021 [65] UAV123 [96] TrackingNet [97] GOT-10k [51] LaSOT [34] HOB [75] TOTB [37] HOOT
Video-Level Attr. ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
Frame Absence ✗ ✗ ✗ ✗ ✓ ✓ ✓ ✓ ✓
Frame Full Occ. ✗ ✓ ✗ ✗ ✓ ✗ ✗ ✓ ✓
Frame Partial Occ. ✗ ✓ ✗ ✗ ✓ ✗ ✗ ✗ ✓
Frame Occ. Level ✗ ✗ ✗ ✗ ✓ ✗ ✗ ✗ ✓
Occluder Types ✗ ✗ ✗ ✗ ✗ ✗ ✓ ✗ ✓
Occlusion Masks ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✓
Table 4.1: An overview of recent and widely-used visual object tracking benchmarks compared to HOOT.
The first part of the table focuses on general statistics, while the second part focuses on occlusion-specific
information provided by these benchmarks. HOOT stands out as the dataset that provides the most detailed
occlusion data per-frame. tbl:datasets_overview
4.2.1 Comparison with Other VOT Datasets
sec:hoot_vs_all
This section provides a comparison between HOOT and other widely used and recent visual object tracking
(VOT) datasets. Table 4.1 collects both general dataset statistics and occlusion-related information for these
datasets and shows how HOOT stands out amongst them.
We can mostly separate existing VOT datasets into those that only include video-level tags for occlusions and those that provide frame-level occlusion data. For those that provide video-level tags, these tags
can be used to filter out videos to evaluate a tracker against a confounding factor. ALOV300++ contains
a single video-level occlusion (OCC) attribute [116]. The more popular OTB has video-level occlusion
(OCC) and out-of-view (OV) attributes but does not provide per-frame information on occlusions [129,
130]. Other datasets like NfS [61], UAV213 [96] and TrackingNet [97] also follow a similar approach and
provide video-level annotations for varying attributes such as out-of-view (OV), full occlusion (FOC) and
partial occlusion (POC). Unfortunately, these video-level attributes do not provide essential information
like when and how much occlusions occur in the video or the severity of occlusions.
36
In addition to the video-level occlusion attributes mentioned above, some other datasets provide perframe annotations related to occlusions to varying extents. VOT provides per-frame binary occlusion tags
that can indicate either partial or full occlusion, but no absence labels that indicate frames where the target
leaves the frame [65, 72]. Similarly to VOT, NUS-PRO [79] provides by-frame occlusion labels. However,
instead of one binary occlusion label, they provide separate labels for partial and full occlusion cases. These
occlusion tags are helpful to evaluate trackers on occlusions; however, occlusion representation in these
datasets is generally low (as outlined in Section 3.1).
More recently, the popular benchmark LaSOT [34] provides only absence labels per-frame for its 1.4K
videos, similar to OxUvA [121], an evaluation benchmark focused on long-term tracking. An evaluation
benchmark focused on transparent targets, TOTB [37], provides per-frame absence and full-occlusion labels. These labels alone are not suitable for training or analyzing trackers against per-frame partial or full
occlusions, which can degrade tracker performance significantly.
GOT-10k [51] and HOB [75] stand out as the benchmarks with the most detailed occlusion-related
information. Along with absence and cut-by-frame labels per-frame, GOT-10k also provides an occlusion
level for the target in each frame in the form of 9 labels ranging from fully-occluded to full-visible. While
these were the most detailed occlusion labels yet, they did not provide the location of the occlusion or the
type of occluder. Moreover, the occlusion representation in GOT-10k is highly imbalanced. Meanwhile,
HOB applied further emphasis on occluder types by providing video tags for occlusion by a similar object or
a transparent object. Unfortunately, HOB is not annotated densely and does not provide partial occlusion
information or occluder types per-frame. With HOOT, we extend the occlusion level labels from GOT-10k
by providing occluder masks for each frame, and inspired by some occluder types introduced in HOB,
define a taxonomy for occluders, to help analyze tracker performance against each of them separately.
These details can be found in the following sections.
37
4.2.2 Benchmark Design
sec:hoot_design
As we discussed in Section 4.1, HOOT is aimed to be the first occlusion-heavy benchmark in visual object
tracking that has dense annotations for occlusions and to provide a space for evaluation of new algorithms
against occlusions. State-of-the-art trackers still suffer huge performance drops when evaluated on high
occlusion scenarios (see Section 4.3.5), and HOOT can facilitate further development of occlusion-robust
trackers in the field. In collection and annotation of the benchmark, we observed the following design
choices:
Heavily Occluded Targets: To encourage the development and extensive analysis of occlusion invariant trackers, we designed the benchmark to be occlusion-heavy. 67.7% of all frames in HOOT have
occlusion while previous benchmarks with per-frame occlusion annotations like VOT and GOT-10k have
much lower occlusion representation (Fig. 3.2). The median percentage of occlusion in HOOT videos is
68%.
Dense Occlusion Attributes: Since HOOT highlights addressing occlusions in tracking, we designed
the benchmark to densely annotate types of occlusions that exist in each frame. Therefore, instead of
focusing on attributes like illumination variance or rotations, we curated HOOT to include 6 occlusion
attributes annotated per-frame: absent, full occlusion, cut-by-frame, partial occlusion, occluded-by-similarobject and occluded-by-multiple-occluder-types. Moreover, we designed a taxonomy for occluders which is
further detailed in Section 4.2.4.
Dense Occluder Masks: Instead of pixel-level target segmentation, HOOT provides a rotated bounding box for the target in each frame, as well as occluder masks for every bounding box. Occluder masks
were created using polygons, instead of pixel-wise labeling, due to the cost of pixel-level annotations.
These occluder masks (Fig. 4.1), coupled with the occluder taxonomy defined for the benchmark, provide
valuable information on the level of visual signal coming from the target in every frame. They can also
help train occlusion-aware visual trackers and help perform in depth analyses for new tracking algorithms.
38
Figure 4.1: Sample frames from the HOOT benchmark showing different classes with a variety of occluder
masks provided with the dataset, colored according to the defined occluder taxonomy (solid: dark blue,
sparse: purple, semi-transparent: yellow, transparent: red). Images cropped to regions of interest to better
view the target rotated bounding boxes and occluder masks. fig:sample_frames
Class Distribution: As discussed in Section 2.1.4, outside of VOT benchmarks, much attention has
been paid to occlusions for targets like persons or vehicles (e.g. MOT). Therefore, we curated HOOT
such that it can be complementary to these other datasets. Thus, most of the videos in HOOT come from
everyday objects that appear in common detection or tracking datasets. The variety of classes in HOOT
makes it a benchmark that is more geared toward generic object tracking. The class distribution can be
found in Fig. 4.2.
Both Training and Evaluation: The benchmark is also designed to be large enough to provide options for both training and evaluation of trackers. The videos in HOOT can be used along with other
low-occlusion datasets to train more occlusion-invariant trackers. We expect HOOT to grow in the future
and continue being an important resource to address the difficult problem of occlusions in visual object
tracking.
39
Figure 4.2: Target class distribution in HOOT. fig:cls_dist
4.2.3 Data Collection & Annotation
In this section, we present further details on the data collection and annotation process for HOOT. These
include specific instructions received by recruited data collectors and annotators, the resources they had
access to, and other measures taken to create a high-quality benchmark.
Data collection: The videos in HOOT were collected by the authors and other recruited contributors
(including but not limited to graduate students), in a variety of environments (public and private) to increase variations in the dataset. The recruits were given a tutorial about the general aim of the dataset and
the taxonomy of occluders defined, as well as sample videos taken by the authors. They were instructed
to follow the instructions below when shooting videos:
• Set video quality to 1080p or higher (with 4k preferred) and shoot in landscape mode.
• Set frame rate to at least 30fps.
• Eliminate illumination variance by shooting in daylight or sufficiently lit indoor environments.
• Set the maximum distance of the object from the camera to a distance where the object can be clearly
identified as the correct class.
40
• Ensure the object is fully visible in the first frame. (Exceptions to this rule were shooting a target
through glass or water, e.g. fish.)
• Use different types of occluders and create heavy occlusion scenarios.
• Do not include any identifying information (like faces) without the consent of the subjects.
The recruits also had access to the object class list and frequently updated dataset statistics, so they
could tailor their videos to include more of the occluders or object classes that needed more representation.
The collected videos were later cut by the authors to ensure full object visibility in the first frame and heavy
occlusion.
The annotation of the collected videos was performed by a team of graduate students, using the Computer Vision Annotation Tool (CVAT) [114]. The annotation team was trained by the authors for consistent
annotation and was given continuous feedback during the process. The exact steps they followed to have
a video ready for validation are as follows:
• Watch the video to label per-video target and motion tags.
• Annotate the target object given in the first frame by fitting an appropriate rotated bounding box to
it in the rest of the video. The annotators were allowed to utilize the interpolation tool of CVAT but
were instructed to carefully check the interpolated frames and modify the annotations as needed.
• Label occlusion tags defined in the benchmark design for every frame.
• Create multi-vertex polygons for objects that occlude the target with as much detail as possible,
depending on the complexity of the occluder. Occluder masks were created out of these polygons.
• Do not create occluders for the limited occlusions caused by a hand moving the target.
• Do not create occluders for self-occlusion, since this falls under the orthogonal challenge for deformable target tracking.
• Estimate the full rotated bounding box position for frames where the target is partially or fully
occluded by an occluder.
41
Total # of Videos 581 Min. # of Frames 41 Min. Duration 0.98sec
Total # of Classes 74 Max. # of Frames 4596 Max. Duration 1min 38sec
Total # of Attributes 13 Avg. # of Frames 750 Avg. Duration 22.5sec
Total # of Occ. Attributes 6 Median # of Frames 708 Median Duration 21.6sec
Percentage of Occ. Frames 68% Total # of Frames 435,790 Total Duration 3hr 38min
Table 4.2: General statistics for the HOOT Benchmark. tbl:hoot_stats
In addition to outlining clear actionable steps to follow, we take some further measures to ensure
annotation consistency and quality. These measures were as follows:
• Annotation tasks were distributed to annotators by object class, to have annotation consistency for
each object type.
• Annotators were given online feedback as they annotated, until they achieved the desired annotation
standards set by the authors.
• Two rounds of validation were performed by the authors after annotation to carefully check each
frame for correct occlusion-related attributes and high-quality polygons and fix any errors.
4.2.4 Benchmark Overview sec:occ_attributes
This section gives further details on the benchmark, including general statistics and more importantly,
extensive statistics for its occlusion-related attributes.
HOOT comprises 581 high-quality videos (1080p or higher, with frame rates of 30-60fps), with 22.5
seconds of duration on average per video. The dataset has over 3 hours of footage and provides almost
436K frames. Further details can be found in Table 4.2. The distribution of the 74 object classes in the
benchmark can be found in Fig. 4.2.
The benchmark has 13 labeled attributes which include target (3), motion (4), and occlusion (6) related
attributes. Target and motion attributes are labeled only per-video, while occlusion attributes are labeled
both per-frame and video.
42
(a) Target Attr. fig:targettag_dist (b) Motion Attr. fig:motiontag_dist (c) Occlusion Attr. fig:occtag_dist
Figure 4.3: Video-level distribution for target, motion, and occlusion attributes in HOOT. fig:video_overview_graphs
4.2.4.1 Target & Motion Attributes
Target-related attributes define whether the target is deformable, self-propelled (not moved by a human or
device), or animate. The deformable attribute allows us to keep track of videos where bounding boxes may
be less accurate when a deformable object changes shape while being occluded. As expected for controlled
occlusion scenarios, many of the videos for everyday objects include targets moved by human subjects,
which is annotated using the self-propelled attribute. During our evaluations, we did not observe a pattern
where trackers locked on the hand moving the objects that are not self-propelled.
We also label 4 motion attributes per-video. The camera-motion tags videos that might exhibit varying
amounts of camera motion. While most of the targets in HOOT are dynamic (represented by the attribute
dynamic-target), there are some static target scenarios, where occlusions are either caused by parallax
(camera moving to cause occlusions) or moving occluders (while the camera is static). These cases are
represented by the attributes parallax and moving-occluder. The distributions of target and motion-related
attributes in HOOT are given in Fig. 4.3a and Fig. 4.3b.
4.2.4.2 Occlusion Attributes
The main contribution of HOOT to the visual object tracking field is the dense occlusion-related annotations it provides. These are:
43
(a) Solid occluders (for targets apple, bird, clock, and remote).
(b) Sparse occluders (for targets coin, potted plant, cup,
and book).
(c) Semi-transparent occluders (for targets ball, shoe, Rubik’s cube, and potted plant.
(d) Transparent occluders (for targets rag, orange, plate,
and glass.
Figure 4.4: Sample images of different types of occluders defined in the benchmark taxonomy. (a-b) Hard
occlusions caused by solid occluders or sparse occluders. (c-d) Soft occlusions caused by visual distortions
from semi-transparent or transparent occluders. fig:occ_type_examples
• 6 occlusion attributes labeled by-frame,
• A taxonomy of occluder types, and
• Occlusion masks for every target bounding box, labeled by the defined taxonomy.
As mentioned in Section 4.2.2, the 6 occlusion attributes annotated per-frame are: absent, full occlusion,
cut-by-frame, partial occlusion, occluded-by-similar-object and occluded-by-multiple-occluder-types. Video
level distributions for these attributes can be seen in Fig. 4.3c and frame-level distributions are given in
Fig. 4.5a, which shows the target is partially occluded in 59.9% of the frames. Full occlusion and absent
cases are represented, but occur less often than other scenarios since long-term tracking was not in the
scope of the project. The targets are only out of the frame for 0.8% of the frames, which is 3.6K frames.
Along with heavy partial occlusions in the frame, the benchmark also has a significant representation
forcut-by-frame (where the object moves partially out of the frame). Frames labeled multiple-occluder-types
can help analyze trackers against increasingly complex occluders. Moreover, the occluded-by-similar-object
labels can help assess performance when the target is occluded by objects that can be considered distractors
(e.g. an apple being occluded by another).
44
In addition to these 6 occlusion attributes labeled per-frame, HOOT is also densely annotated with 4
occluder types. These 4 types can be loosely grouped as occlusions caused by hard and soft occlusions.
Hard occlusions generally block visual information from the target fully; we separate this type into solid
and sparse occluders. On the other hand, soft occlusions can let some visual information from the target
pass through; this type contains semi-transparent and transparent occluders. A more in-depth definition
for the 4 occluder types can be found below, with visual examples for each type given in Fig. 4.4:
- Solid - where occluder completely blocks visual information from the target (e.g. tree trunk, wall).
- Sparse - where occluder is formed of sparsely distributed solid occluders that allow varying levels
of visual information from the target (e.g. foliage, railings, blinds). This allows us to not label these
complex occluders with pixel-level segmentation.
- Semi-Transparent - where the target is covered fully by an occluder that allows some altered visual
information to pass through (e.g. frosted glass, colored plastic, tight meshes).
- Transparent - where the target is covered fully by an occluder that allows mostly unaltered visual
information to pass through (e.g. glass, clear plastic).
The occluder types defined above are especially important for the dense occlusion masks provided with
HOOT. As can be seen from the sample images in Fig. 4.1, every mask is labeled by the type of occluder
corresponding to it. This ensures rough pixel-level information about the occlusion level of the target
in the bounding box. For example, more visual information from the target can be obtained from areas
marked by a transparent occluder, compared to a solid occluder. With these masks, we can compute a
percentage level of occlusion for the target in each frame, using the intersection of occluder masks with
the target bounding box, and appropriately weight those areas with occluder types. Fig. 4.5b shows the
distribution of occlusion ratios for frames where the target is partially occluded. Overall, we find that
17.6%, 22%, 36% and 48%, of the partially occluded frames in HOOT contain transparent, semi-transparent,
sparse, and solid occluders respectively.
45
(a) Occlusion attribute dist. fig:occframe_dist (b) Occlusion level dist. fig:occlevel_dist
Figure 4.5: (a) Per-frame occlusion-related attributes in HOOT. (b) Target occlusion levels per occluder
type across all partially occluded frames in HOOT. Full solid occlusion means the target is fully occluded,
which is why solid occlusion does not have partially occluded frames with an occlusion ratio 1.0 while
other types might. fig:occ_frame_distributions
4.2.5 Evaluation Protocols sec:protocol
Inspired by LaSOT [34], we propose two protocols to evaluate trackers on the HOOT benchmark.
Protocol I This protocol includes all 581 videos in the benchmark and aims to provide a playground
to evaluate and analyze trackers against different kinds of occlusions. This protocol assumes that the
evaluated trackers have not utilized any of the HOOT videos during development.
Protocol II For protocol II, we provide a smaller test split for the evaluation of tracking algorithms on
heavy occlusion scenarios. The test split contains 130 videos. Two videos were selected randomly from
each object class that has at least 3 videos in the benchmark, creating a class-balanced test split.
The total number of frames in the 130 test split videos is around 95K. Fig. 4.6 shows the video and
frame level occlusion distributions for the test set, as well as the percentage of target occlusion in the
partially occluded frames for the defined occluder types. We find that the distributions look very similar
to the overall dataset. For this protocol, the rest of the videos in HOOT are available to use for algorithm
development and training.
46
(a) (b) (c)
Figure 4.6: Distributions of occlusion-related attributes in the HOOT test set. (a) Video-level attribute
distributions. (b) Frame-level attribute distributions. (c) Distributions of target occlusion level per occluder
type across partially occluded frames. fig:occ_test_distributions
4.3 Experiments
hoot_results
In this section, we benchmark various state-of-the-art tracking algorithms on HOOT protocols, give analyses for different occlusion attributes, show qualitative results, and compare tracker performance on HOOT
to one of the most popular current benchmarks, LaSOT [34].
4.3.1 Performance Metrics
sec:metrics
HOOT uses One-Pass Evaluation (or OPE) for evaluations, like many datasets in the field [34, 97, 129].
The metrics used to compute performance are success, precision, and normalized precision. Success is
computed using the Intersection over Union (IoU) between the predicted and ground truth boxes, where
a success represents IoU (or overlap) being higher than some threshold. For success, tracking algorithms
are ranked using Area Under the Curve (AUC) between 0 and 1. We also adopt precision and normalized
precision, the latter of which was defined in [97]. Precision is calculated by looking at the percentage of
frames where the distance between the predicted and ground truth boxes are under a certain threshold
[130]. On the other hand, normalized precision takes into account resolution and object scale changes, by
normalizing this distance with the size of the ground truth bounding box [97]. Further details on these
metrics are given in Section 2.1.3.
47
Protocol I (All videos) Protocol II (Test Split)
Tracker Backbone Venue Precision Norm. Precision Success Precision Norm. Precision Success
SiamRPN [81] AlexNet CVPR’18 0.102 0.366 0.322 0.102 0.362 0.312
SiamMask [125] ResNet-50 CVPR’19 0.126 0.413 0.354 0.137 0.443 0.371
ATOM [24] ResNet-18 CVPR’19 0.121 0.415 0.356 0.121 0.420 0.352
SiamRPN++ [80] ResNet-50 CVPR’19 0.140 0.447 0.392 0.142 0.448 0.389
SiamRPN++ (LT) [80] ResNet-50 CVPR’19 0.135 0.417 0.382 0.148 0.440 0.394
SiamDW [142] CIResNet-22 CVPR’19 0.092 0.348 0.305 0.106 0.361 0.316
DiMP [9] ResNet-50 ICCV’19 0.143 0.470 0.407 0.137 0.462 0.399
PrDiMP [25] ResNet-50 CVPR’20 0.142 0.467 0.404 0.142 0.486 0.420
Ocean [143] ResNet-50 ECCV’20 0.142 0.475 0.399 0.134 0.467 0.389
SuperDiMP [9, 25] ResNet-50 - 0.152 0.499 0.435 0.141 0.495 0.427
TransT [16] ResNet-50 CVPR’21 0.230 0.597 0.499 0.235 0.589 0.492
KeepTrack [91] ResNet-50 ICCV’21 0.177 0.578 0.492 0.169 0.570 0.484
AutoMatch [141] ResNet-50 ICCV’21 0.158 0.480 0.399 0.160 0.478 0.394
Stark-ST50 [132] ResNet-50 ICCV’21 0.202 0.557 0.488 0.209 0.563 0.491
Stark-ST101 [132] ResNet-101 ICCV’21 0.212 0.564 0.489 0.216 0.571 0.495
Table 4.3: Overall performance results for 15 state-of-the-art trackers on HOOT protocols defined in Section 4.2.5. Metrics are computed as described in Section 4.3.1. Green, red, and orange numbers represent
the top 3 performers. tbl:overall_perf
All performance results were computed by converting the rotated bounding boxes in HOOT to axisaligned boxes, which is the output format of most trackers. When computing performance for absence,
absent frames were not considered since metrics cannot be computed for those, however, their effects
are still visible in the metrics. On the other hand, full occlusion cases were included, since the HOOT
annotations include boxes estimated by annotators even though the object is fully occluded.
4.3.2 Overall Performance
We evaluate 15 recent trackers on both protocols of HOOT and present the results on the metrics defined above in Table 4.3. We chose trackers that have publicly available code and released model weights
for our evaluations. The evaluated trackers represent a variety of visual tracker types. We evaluate 5
fully-convolutional Siamese trackers: SiamRPN [81], SiamMask [125], SiamRPN++ and its long-term configuration SiamRPN++ (LT) [80], as well as SiamDW [142]. We also evaluate recent works that train online.
These include ATOM [24], which trains an online classifier, and DiMP [9], which trains an online model
optimizer, as well as DiMP variants PrDiMP [25] and SuperDiMP. KeepTrack [91], a recent tracker based
on SuperDiMP, focuses on keeping track of distractors by utilizing a target candidate association network.
48
(a) Protocol I (b) Protocol II
Figure 4.7: Success curves for the state-of-the-art trackers evaluated on HOOT. The trackers are ranked
according to AUC. fig:overall_perf
Ocean [143] is an anchor-free tracker, which is also the baseline to the recent AutoMatch [141]. Finally,
TransT [16] and Stark [132] use transformers for visual tracking.
Overall, trackers performed poorly on HOOT with best performers TransT, KeepTrack, and Stark suffering 15-17% drops compared to the performance they achieved for LaSOT (further details in Section 4.3.5).
Similar to the initial findings in [75], this shows that state-of-the-art trackers are still vulnerable to heavy
occlusion scenarios. Moreover, it justifies the addition of HOOT to the field as a resource for both extensive
evaluation and training (with its dense occlusion labels). Success curves for both protocols can be seen in
Fig. 4.7.
4.3.3 Evaluation on Occlusion Attributes
We also evaluate trackers on different per-video occluder attributes, using Protocol I. Success plots for
videos that contain solid, sparse, semi-transparent, and transparent occluders can be viewed in Fig. 4.8.
We notice a larger drop in performance for semi-transparent occluders, which means these will affect
trackers the most in the wild even though some visual information on the object is present.
49
(a) Solid (b) Sparse (c) Semi-Transparent (d) Transparent
(e) Solid (f) Sparse (g) Semi-Transp. (h) Transparent
Figure 4.8: Success (a-d) and normalized precision (e-h) curves for the different occluder types annotated
in HOOT, computed for Protocol I. fig:occ_type_curves
We observed similar trends to success for both precision and normalized precision across all evaluations. However, minor changes in the rankings did occur. We illustrate this with the normalized precision
curves for varying location error thresholds presented for the different occluder types defined in the paper,
also given in Fig. 4.8. For example, KeepTrack performs better than Stark variants for sparse occluders in
the normalized precision metric compared to success.
Fig. 4.9 shows the success plots for the HOOT videos that include cases of objects being fully occluded
and occluded by a similar object. These attributes are the top 2 occlusion attributes that the trackers
performed the worst. We find that top transformer trackers suffer larger drops compared to KeepTrack
for full occlusion, while SiamRPN++ (LT) raised in ranking since it focuses on long-term tracking. As can
be seen in Fig. 4.9b, similar occluders affect the trackers the most. For this type of occlusion, the AUC
scores for all trackers suffer major drops, including KeepTrack, which specifically addresses keeping track
of distractors.
50
(a) Full Occlusion (b) Similar Object fig:similar_success
Figure 4.9: Success curves for occlusion attributes that affected tracking performance the most in HOOT,
computed for Protocol I. fig:occ_tag_curves
Looking at the success curves for the occlusion attributes cut-by-frame, absent, and multiple occluders
(Fig. 4.10), we find that tracker rankings remain mostly similar to the overall success curves. We notice
that trackers suffer larger performance drops for videos that include absence, however, SiamRPN++ (LT)
does perform better than the original SiamRPN++ for this attribute as expected. In addition, we find that
cut-by-frame and multiple occluder type scenarios were not as difficult for trackers compared to similar
occluders, for which a much bigger drop in performance was observed.
4.3.4 Evaluation on Other Attributes
In addition to the above analyses, we also computed success curves for the rest of the attributes annotated
in HOOT, which are the motion and target attributes. Fig. 4.11 shows different success curves for motion
attributes, which shows that even with heavy occlusion, trackers perform better in videos that contain
static objects and worse for dynamic targets. This is fully expected since static targets also undergo fewer
changes in terms of rotations or deformations. A surprising observation is the major drop in performance
for videos with no camera motion, where the highest-scoring tracker’s AUC for success dropped from
51
(a) Cut-By-Frame (b) Absent (c) Multiple Occluders
Figure 4.10: Success curves for the remaining occlusion attributes annotated in HOOT, computed for Protocol I. fig:occ_tag_curves_precision
(a) Dynamic Target (b) Static Target - Parallax (c) Static Target - Mov. Occ. (d) No Cam. Motion
Figure 4.11: Success curves for videos that contain different motion tags for HOOT Protocol I. We annotate
videos that contain dynamic targets and camera motion. Moreover, for static targets, we annotate tags that
signify occlusion due to parallax and occlusion due to moving occluder. fig:occ_motion_curves
0.574 to 0.471. While trackers should intuitively perform better without any camera motion, this drop
might have been caused by the general ease of creating complex heavy occlusion conditions for static
camera shoots. Since controlling for occlusions is easier with a static camera, these videos are inherently
more difficult in terms of occlusions.
Lastly, we present analyses for target attributes annotated for HOOT in Fig. 4.12. We find that performance results for videos that either contain or do not contain these attributes were pretty similar, which
shows the effect of occlusions does not change too much with these attributes. Notable changes were
trackers performing worse for non-animate target videos. This is likely because non-animate objects can
easily be put into heavier occlusion scenarios, which also explains the larger drop for videos that contain
non self-propelled objects. These show that by controlling for occlusions in this manner, we were indeed
52
(a) Animate (b) Deformable (c) Self-Propelled
(d) Non-Animate (e) Non-Deformable (f) Non Self-Propelled
Figure 4.12: Success curves for videos in HOOT annotated by different target attributes, computed for
Protocol I. (a)-(c) Videos each target attribute is set to true. (d)-(f) Videos each target attribute is set to
false. fig:occ_target_curves
able to create difficult scenarios that affect tracker performance and that HOOT can be used to evaluate
trackers on these difficult scenarios.
4.3.5 Performance Comparisons with LaSOT
sec:hoot_vs_lasot
Finally, we compare the results of the state-of-the-art algorithms we evaluated on HOOT to their performance on the LaSOT [34] test set. LaSOT was chosen due to its increased popularity for training and
evaluating datasets since its release. Moreover, it also evaluates performance using success, precision, and
normalized precision. While making comparisons across datasets may not be fully definitive, these results
can demonstrate how challenging HOOT is compared to the current popular benchmarks that do not have
heavy occlusion distributions, answering whether HOOT can fill in the gap in the visual object tracking
benchmark literature defined in Section 4.1.
53
Norm. Precision Success
Tracker LaSOT HOOT ∆ LaSOT HOOT ∆
ATOM [24] 0.576 0.420 -0.156 0.515 0.352 -0.163
SiamRPN++ [80] 0.569 0.448 -0.121 0.496 0.389 -0.107
DiMP [9] 0.650 0.462 -0.188 0.569 0.399 -0.170
PrDiMP [25] 0.688 0.486 -0.202 0.598 0.420 -0.178
Ocean [143] 0.651 0.467 -0.184 0.560 0.389 -0.171
TransT [16] 0.738 0.589 -0.149 0.649 0.492 -0.157
KeepTrack [91] 0.772 0.570 -0.202 0.671 0.484 -0.187
AutoMatch [141] - 0.478 - 0.583 0.394 -0.189
Stark-ST101 [132] 0.770 0.571 -0.199 0.671 0.495 -0.176
Table 4.4: Comparison of the overall performance results between HOOT and LaSOT test sets. The table
below only includes trackers that have also been evaluated on LaSOT, and presents normalized precision
and success numbers for both datasets. It shows a steep decline in performance for HOOT, which has
a much higher occlusion representation. Green numbers mark the trackers that suffered the least drops
between LaSOT and HOOT, while red numbers mark the trackers that suffered the most. tbl:lasot_compare
Table 4.4 shows the overall performance results for success and normalized precision metrics in LaSOT
and HOOT test sets. The table only includes trackers that have previously evaluated and published results
on LaSOT. We use their performance metrics as given in their papers or code repositories to compile the
LaSOT results. Looking at the difference in overall performance, we observe that even strong state-ofthe-art trackers can suffer major drops when evaluated on HOOT. For normalized precision, this drop was
found to be between 12-20%, while for success, it ranged from 10% to 19%.
The tracker that suffered the least was SiamRPN++ [80], which might be due to its fully offline training framework, with no template updates. On the other hand, trackers that perform online model updates
(like KeepTrack [91] and DiMP [9] variants) suffered higher drops in performance compared to their performance in LaSOT.
Transformer-based trackers like TransT [16] and StarkST [132] also undergo drops in performance.
Among these, we find TransT is better able to handle heavy occlusions, perhaps due to its Feature Fusion
Network. Overall, the results show how HOOT emerges as a difficult dataset for dealing with occlusions
compared to the current benchmarks in the field of visual object tracking.
54
Figure 4.13: Sample frames from videos that on average scored 0.418 on the success metric. fig:qual_med
4.3.6 Qualitative Results
In this section, we present qualitative results for a variety of trackers we evaluated on HOOT. The trackers
that we visualized in Fig. 4.13 and Fig. 4.14 cover trackers from fully-convolutional Siamese trackers like
SiamRPN++, to trackers that make online model updates (such as DiMP variants), to transformer-based
trackers (TransT), as well as a spatio-temporal trackers such as Stark-ST101.
We present these qualitative results in two parts. Fig. 4.13 contains 4 frames from 3 randomly selected
videos with an average success rate of 0.418. These are videos that trackers found to be of medium difficulty.
In the first row, the effect of an occlusion by similar object can be seen clearly, as most trackers lock onto
the distractor after the target is occluded by it. The other rows also show some trackers getting distracted
and starting to track occluders instead, while others are able to catch some part of the target. While this is
the similarity computation process in action, it signals that trackers might need more specialized treatment
to understand how an object’s global representation might change with occlusions.
On the other hand, Fig. 4.14 shows the results of 3 randomly selected videos that on average scored
0.128 for success. Therefore, these videos and their selected frames show examples of trackers performing
significantly low on HOOT videos. In the first and third rows, trackers can be seen to lose the object as
it moves through a variety of occlusions and start tracking either the occluders or parts of the scene. The
55
Figure 4.14: Sample frames from videos that on average scored 0.128 on the success metric. fig:qual_low
width/height ratio of the target object and how much background the template images would contain in
these two examples might explain why the trackers performed so poorly on these videos. Meanwhile, the
second row consists of a difficult transparent target the appearance of which can include noise coming
from the background. These types of targets are directly addressed by the TOTB dataset [37].
4.4 Discussion
In this chapter, we introduced HOOT, the Heavy Occlusions in Object Tracking Benchmark, and evaluated state-of-the-art trackers on the heavy occlusion scenarios presented in the dataset. The videos in
HOOT are curated such that 67.7% of all frames have occlusions to varying degrees, with detailed occlusion annotations for each frame. These dense annotations include occluder attributes and masks for every
box in the dataset, as well as an occluder taxonomy to analyze trackers using their performance against
different occluders. With two evaluation protocols, HOOT allows for the training and testing of trackers
against heavy occlusions, and it can facilitate the development of increasingly occlusion-robust tracking
algorithms in the future.
Evaluating state-of-the-art trackers on HOOT videos shows the effects of training with datasets that
are only lightly occluded. The drops that can be seen between the trackers’ LaSOT performance to HOOT
56
performance show occlusions create a domain shift, and domain-specific data should be utilized to address
them. HOOT not only provides a training set to close that domain gap, but also provides extra occlusion
annotations for future research to be as creative with occlusions as possible.
While we expect HOOT to grow in the future to contain an increasing variety of heavy occlusion
scenarios, it is currently not as large as the current visual object tracking datasets that are widely used
for training (Table 4.1). However, since most recent trackers train with four or more datasets covering
thousands of object classes and millions of frames, we believe HOOT can be a great addition to close the
gap between trackers that have not encountered many occluded targets vs. trackers that have.
Moreover, HOOT can also be used in conjunction with other datasets that have paid special attention
to heavy occlusions very recently. One of the most related works to HOOT in this aspect is the OVIS (or
Occluded Video Instance Segmentation) Benchmark [103]. OVIS is the first benchmark in video instance
segmentation to focus on heavy occlusions. They provide segmentation masks for objects of 25 different
classes that consist of animals, vehicles, and persons. These classes are highly complementary to HOOT -
which consists mainly of everyday objects - and can easily be a companion to HOOT during training.
When HOOT is used in conjunction with datasets like OVIS, as well as other datasets in domains
like person detection [138] and multi-object-tracking [29, 127] that present occlusion level annotations
or masks, algorithms can finally be evaluated and trained at scale to handle this difficult challenge. A
more data-driven way to analyze robustness to occlusions (instead of showing qualitative results on a
few samples in current benchmarks) can help improve our understanding of how trackers actually fail and
what part occlusions play. The next step is to explore one of the many different ways this occlusion-related
information can be utilized.
57
Chapter 5
Template Update for Tracking Under Heavy Occlusion
ch:five
5.1 Motivation
sec:ch5_motivation
After exploring how existing occlusion data and annotations can be used to train more occlusion-robust
trackers in Chapter 3, and identifying and mitigating the lack of occlusion representation and annotations
in Chapter 4, we will next focus on how our collected benchmark HOOT can be used to evaluate, analyze
and develop deep visual trackers that are more robust to occlusions.
When we consider occlusions and how they affect visual object tracking (VOT) performance, two main
issues stand out: 1) Occlusions can result in missed detections. 2) Occlusions can corrupt the appearance
feature of the template if it is updated as the video is processed. Different trackers may handle possible
missed detections differently. Examples from recent trackers include KeepTrack [91], which attempts to
keep track of the distractors in the scene to prevent identity switches caused by missed detections, and
SLT [62], which uses reinforcement learning and sequence-level training to teach a tracker to predict the
bounding box stochastically instead of the highest classification score. Other more conventional methods
may include using the motion model to estimate the next state for the box instead of the appearance model
or adaptively changing the search area until the object can be detected again, but these are not commonly
utilized to support state-of-the-art deep trackers.
58
The difficulty of template update during VOT has been identified by many early works [77, 89] and
many methods that perform template updates have been proposed over the years. More recently, DiMP [9]
and its variants use meta-learning to learn an online template optimizer, and many other algorithms have
been proposed using memory networks [39], a meta-updater [21] or spatio-temporal dynamic templates
[132]. These methods do not directly address what happens to the template update under heavy occlusions,
even though it is generally assumed that template updates with occlusions can corrupt the target template
and cause failure or drift.
To show how HOOT can help evaluate, analyze, and develop more occlusion-robust trackers, we choose
the STARK tracker [132] as our baseline in this chapter and examine it under heavy occlusions. STARK
is one of the top performers on HOOT (Chapter 4.3) and is a transformer-based tracker. It updates its
dynamic template when the classification score for whether the target is in the search region or not is
high. This only accounts for absence or full occlusions, and does not examine how updating a template with
partial occlusions can affect the performance. Throughout this chapter, we present details of the baseline
STARK tracker, show how template updates affect its performance on HOOT using the extensive occlusion
annotations in the dataset, and present baselines as to how HOOT can be used during the development or
training of more occlusion-robust deep trackers.
5.2 Baseline STARK Tracker
sec:ch5_stark
Published in 2021 by Yan et al., STARK stands for spatio-temporal transformer network for visual tracking.
Much like other transformer-based trackers that have been published at the same time [16, 90, 137], it
capitalizes on the strong representation learning capabilities of these attention models to encode the search
region with the template images. Furthermore, it adapts its decoder structure from the transformer-based
object detector DETR [11] to perform target localization. STARK’s spatio-temporal nature comes from
a dynamic template that gets updated using a template update branch. Therefore, the tracking task has
59
Figure 5.1: Schematic of the STARK tracker, as adapted from the original paper [132]. The sample images
are from a video in HOOT. fig:stark_schematic
access to not only the initial (first frame) template of the target T0, but also the dynamic template TD
which can allow the transformer to extract the temporal relationship between the current search region S
and the initial and dynamic templates. A schematic of the tracker adapted from the paper can be viewed
in Fig. 5.1. We explain below the details for the two main branches in STARK, the localization branch and
the template update (or classification) branch.
5.2.1 Localization Branch:
The first step for the STARK tracker is to process the input images using a backbone and extract features
for each image. They adopt the ResNet [47] convolutional network as the backbone for STARK, and experiment with both ResNet-50 and ResNet-101 models in their work. The outputs for the backbone are
feature maps of size C ×
H
s ×
W
s where s is the multiplier for the reduction in map size through layers
convolutions. While ResNet is a popular choice for feature extraction in many vision applications, the
backbone can be any feature extractor. Throughout this chapter, we also utilize the ResNet-50 backbone.
60
After the feature maps for the search and template images are obtained from the backbone, they need
pre-processing to be fed into the transformer encoder. The pre-processing steps are as follows: 1) A bottleneck layer reduces the channel dimensions from C to d. 2) The feature maps for each image are flattened
to size (d ×
H
s
W
s
). 3) The feature maps from search and template images are then concatenated to form
a feature of length (d × (
HS
s
WS
s +
HT0
s
WT0
s +
HTD
s
WTD
s
)). Finally, sinusoidal positional embeddings for
each image patch are also added to the features. These features are then processed by the transformer
encoder, which is formed of N layers of multi-head self-attention modules with feed-forward networks.
The encoder processes this feature sequence coming from the target at different frames and learns spatiotemporally discriminative features for the localization of the target.
The encoder output is next processed by the transformer decoder module. While the object detector
DETR has up to 100 object queries, for a visual tracking task, we know that the target can only be in a single
location. Therefore, the decoder takes in a single target query, as well as the features from the encoder, and
processes them in M decoder layers. These decoder layers are formed of self-attention, encoder-decoder
attention, and feed-forward networks. By having access to the full encoder features, the target query can
pay attention to both search and template features to learn good features for localization.
The last module in the localization branch is the box prediction head. Instead of predicting box coordinates directly, STARK designs a new prediction head that predicts two probability maps for the top-left and
bottom-right corners of the target bounding box. This is done through similarity computation between
search features coming from the encoder and the decoder output. Using these similarities, discriminative
parts of the search region can be highlighted, reshaped into a d×(
HS
s ×
WS
s
) feature map and processed by
a fully convolutional network to arrive at the two probability maps for the bounding box corners. The box
coordinates can then be computed as the expected coordinates from the probability maps. Fig.5.2a shows
the schematic of the box prediction module as taken from [132]. The novelty of this approach is that it
61
fig:stark_box
(a) Box Prediction Module
fig:stark_cls
(b) Classification Module
Figure 5.2: Schematics for the box prediction and classification heads for the STARK tracker, with the box
prediction schematic directly taken from [132]. fig:stark_box_cls
does not need any further post-processing to obtain the box like cosine windows, ranking, etc. which was
necessary in trackers such as SiamRPN [81].
STARK is trained using 4 large-scale datasets: Video datasets LaSOT [34], TrackingNet [97] and GOT10k [51] and the detection dataset COCO [83]. The localization branch detailed above is trained first, in
an end-to-end fashion. This choice is led by recent works [17, 117] that found that decoupling the training
for localization and classification can be more beneficial for finding optimal solutions for each task. The
localization branch is trained with a combination of L1 loss with the generalized IoU loss [110], similar to
DETR:
Lloc = λgiouLgiou(y
bb
i
, yˆ
bb
i
) + λL1LL1(y
bb
i
, yˆ
bb
i
) (5.1)
5.2.2 Template Update Branch:
After the localization branch is trained, the parameters in that branch are fixed and the template update
branch is trained next. The template update branch in STARK consists of making a decision on whether the
dynamic template should be updated and cropping the search region according to the predicted bounding
box if the update is successful. The classification module (as given in Fig.5.2b) is a simple Multi-Layer
62
Perceptron (MLP), with 3 layers and d hidden dimensions. It performs binary classification for whether
the target is in the search region or not, and uses binary cross-entropy loss to optimize the module:
LBCE = −(y
cls
i
log(ˆy
cls
i
) + (1 − y
cls
i
) log(1 − yˆ
cls
i
) (5.2)
where y
cls
i
and yˆ
cls
i
are the ground truth and the predicted values for frame i.
In order to train the classification module, samples that contain the target in the search region are
sampled at equal probability as the samples that do not. This allows for better optimization of the template
update branch since fully occluded or absent frames are extremely imbalanced throughout the training
datasets if sampled at random. This is where decoupling localization and classification can also help since
absent/fully occluded frames cannot be used for the localization task.
At test time, the dynamic template is initialized from the first frame. Updates are scheduled at set
intervals Tu and if the score from the classification module is above a threshold τ for frames with scheduled
updates, the dynamic template is updated. STARK uses different Tu values for different test sets, while
setting τ to 0.5 across all evaluations. Further details on the STARK tracker can be found in [132].
5.3 Template Update Analysis with STARK
sec:ch5_analysis
As mentioned in Section 5.1, occlusions are considered a challenge factor for the tracker to 1) match the
target’s current appearance to the template, and 2) make template updates that contain proper appearance information from the target. As the representative power of deep learning-based appearance feature evolves, the matching power of deep trackers under partial occlusions has also become more robust.
Therefore, deep trackers are frequently able to detect parts of the target even if it is partially occluded.
The matching success under occlusions may depend on discriminative features of the target (like its color)
63
Figure 5.3: A collage of sample images from successful dynamic template updates made by the STARK
tracker on HOOT Protocol I videos. Each dynamic template is cropped by STARK to 128 × 128 using the
predicted bounding box, and targets can be seen undergoing a variety of occlusions. fig:ch5_corrupted_templates
or the occluder’s properties. In this section, we explore how visual object tracking (VOT) performance is
affected when template updates are performed using occluded templates of the target.
5.3.1 Qualitative Analysis
To analyze the template update procedure of the state-of-the-art STARK tracker [132], we first evaluate it
on Protocol I of the HOOT Benchmark, which contains all videos in HOOT. During evaluation, we modify
the tracker to record not only the predicted bounding boxes, but also the template images when the tracker
decides to update the template through its classification module. We set the update frequency to every 200
frames, which is used by the authors to evaluate the tracker on LaSOT, GOT-10k, and VOT2020-LT, and
consider an update successful only if the scheduled update happens (classification score > 0.5).
64
Figure 5.4: A collage of 10 sample images from successful dynamic template updates made by the STARK
tracker on HOOT Protocol I videos. Each dynamic template is cropped by STARK to 128 × 128 using the
predicted bounding box, and targets are not in the cropped region of interest due to tracker failure. fig:ch5_empty_templates
We identify through qualitative examination of the recorded successful template updates that updates
frequently happen with images in which the target is occluded to varying severity. Fig. 5.3 shows a variety of samples from the HOOT benchmark, where the STARK tracker has matched partially to a target
undergoing occlusions and has decided to update the dynamic template with the current appearance of
the target. This is expected since the binary classification module in STARK optimizes the template update
module such that it only skips updates when the target is fully occluded or out-of-view.
In addition to the updates made with occluded targets, we also observe some successful updates where
the classification module fails and updates are made with images that do not contain the target at all.
Fig. 5.4 shows samples of these successful updates where the dynamic template gets replaced with a crop
of the background. We observe these cases with lower frequency compared to updates made by partially
occluded targets, which shows that STARK’s classification module is mostly eliminating them as trained.
5.3.2 Quantitative Analysis
While qualitative visualizations confirm successful updates being made with occluded templates, thanks
to the extensive occlusion annotations in HOOT, we can also quantitatively analyze template updates. For
our quantitative analysis, we consider the following statistics:
65
• The tracking performance (Success)
• Percentage of scheduled updates across all frames (435,790) in HOOT
• Percentage of successful scheduled updates
• Percentage of successful updates that contain occlusions
• Percentage of missed updates without occlusions
• Percentage of successful updates made with IoU < 0.1
When we compute the above statistics for the baseline STARK with template updates scheduled every
200 frames, we find out that scheduled updates are only 0.43% of all frames in HOOT, and amongst these
scheduled updates, 64.21% are successful, where the classification module decided that the target was in
the search region. Through our analysis, we can see that in almost 65% of these successful updates, the
target was occluded to some extent, whereas 15.55% were updates made with a very low or zero IoU. We
also find that out of the scheduled updates that are missed, only around 10% contained no occlusion. This
is overall 3.6% of all scheduled updates, which means that the tracker rarely missed an update if the target
was fully visible. These values confirmed our qualitative examination of the STARK results.
We also compared the results of STARK with no updates scheduled with the above baseline with template updates scheduled every 200 frames. This comparison can help us determine how much template
updates can improve performance when STARK is evaluated on heavily occluded videos. Success scores
show that there is no difference in tracking performance between making no template updates and making
updates every 200 frames (both scores at 0.507). Therefore, when just trained on data with mostly fullyvisible targets, STARK’s spatio-temporal update mechanism essentially reduces to tracking with only the
first frame template.
Next, we examine how tracking performance and the rest of the template update statistics change with
varying template update intervals for STARK. We re-evaluate STARK with the following template update
66
Update
Iterations
Success
Rate
% Scheduled
Updates
% Successful
Updates
% Succ. Updates
w/Occlusion
% Missed Updates
w/o Occlusion
% Succ. Updates
w/IoU < 0.1
No Update 0.507 - - - - -
200 0.507 0.43% 64.21% 64.76% 10.09% 15.55%
100 0.503 0.93% 67.84% 64.52% 8.29% 17.30%
50 0.498 1.93% 72.62% 64.83% 6.78% 18.22%
25 0.495 3.93% 76.27% 64.72% 6.12% 19.51%
10 0.478 9.93% 82.80% 64.87% 3.60% 22.35%
5 0.450 19.92% 87.32% 65.59% 2.40% 25.56%
Table 5.1: Analysis of the baseline STARK tracker with varying template update frequencies, comparing
statistics for tracking performance, percentages for scheduled updates, successful updates, successful updates with occlusions, missed updates without occlusions, and successful updates with very low or zero
overlap. tbl:stark_occ_analysis
intervals: [100, 50, 25, 10, 5], roughly doubling the update frequency at each step. Through this analysis,
we discover that the tracking success consistently drops as template update frequency increases, from
0.507 to 0.45 - a 5.7% drop.
Moreover, as we increase the number of scheduled updates from 0.43% to 19.92% for update interval 5,
the percentage of successful updates also increases to 87.32%. Meanwhile, the percentage of occluded targets in those successful updates remains stable. This means that given more chances to make updates, the
tracker continues to update with more and more occluded templates, reducing the tracking performance.
Table 5.1 shows the statistics for this quantitative analysis we have presented the results for above.
In addition to showing tracking performance consistently dropping with increased template update
frequency, we also compute the percentage of successful template updates that contain each occluder
type defined in HOOT. Across all template update intervals, solid occluders appeared in 27 − 30% of
the successful template updates, while sparse, semi-transparent, and transparent occluders appeared in
21 − 23%, 9 − 12% and 8 − 11% respectively.
5.3.3 Occlusion Oracle Analysis
In addition to performing analyses for the template update module of STARK qualitatively and quantitatively by varying its update intervals, we also want to quantitatively show whether learning occlusions
67
Update
Iterations
Success
Rate
% Scheduled
Updates
% Successful
Updates
% Succ. Updates
w/Occlusion
% Missed Updates
w/o Occlusion
% Succ. Updates
w/IoU < 0.1
200 0.510 0.43% 22.25% - 30.79% 8.59%
100 0.512 0.93% 23.53% - 26.21% 7.34%
50 0.516 1.93% 25.18% - 21.03% 7.50%
25 0.518 3.93% 26.86% - 17.57% 7.68%
10 0.511 9.93% 28.59% - 12.31% 9.76%
5 0.508 19.92% 29.39% - 9.45% 8.24%
Table 5.2: Analysis of the baseline STARK tracker with varying template update frequencies using an occlusion oracle, comparing statistics for tracking performance, percentages for scheduled updates, successful
updates, successful updates with occlusions, missed updates without occlusions, and successful updates
with very low or zero overlap. The oracle forces the tracker to skip an update if there is any occlusion
affecting the target using occlusion annotations from HOOT. tbl:stark_oracle_analysis
can help the tracker skip some updates that can hurt the tracking performance. For this analysis, we assume an occlusion oracle and force the tracker to skip a template update if a certain occlusion condition is
satisfied. The oracle is informed by the extensive and dense occlusion annotations in HOOT, which further
proves how the benchmark can be useful for the development of more occlusion-robust deep trackers.
Table 5.2 shows the analysis in Table 5.1, this time with an occlusion oracle that prevents template
updates if the target is affected by any kind of occlusion. This initial occlusion oracle analysis shows
that by not making occluded template updates, the tracker not only stops suffering drops in tracking
performance with increased update frequency, but can also achieve higher performances compared to the
baseline in Table 5.1 while making updates more frequently. Empowered by knowing when the target is
occluded, STARK performs 0.3-5.8% better overall compared to the baseline without an occlusion oracle,
which means that its spatio-temporal template updates can happen more frequently and also improve
performance even under heavy occlusions.
Furthermore, we can see from the analysis that the updates made with very little or no overlap also
fall 50% or more. This implies that a majority of the templates made that contained mostly background
were caused by occlusions. Therefore, predicting occlusions can also aid the tracker to skip these updates.
In addition to the any-occlusion oracle used for the analysis in Table 5.2, we evaluated oracles that represented different occlusion conditions. Some of these oracles analyzed STARK’s template updates against
68
Figure 5.5: Plots showing how tracker performance (success) changes as update frequency is increased
for different types of occlusion oracles. Results are separated into two plots for different oracle types.
Occlusion percentage oracles prevent any update from happening if the occlusion ratio in the search frame
is over a threshold thocc, while occlusion type oracles prevent updates where the target is occluded by a
certain occluder type from happening. fig:ch5_occ_oracles
different occluder types in HOOT: solid, sparse, semi-transparent, and transparent occlusion oracles. Other
oracles we evaluated were based on the occlusion percentage of the target. Fig. 5.5 shows how the success
rate changes for varying template update frequencies for each of these different oracles. Results showed
that by being able to predict whether a target is more than 20 − 40% occluded can get the performance
closest to the upper-bound case for knowing about all occlusions. Moreover, being able to predict solid
occluders can help the most for intelligent template updates, which is intuitive since solid occluders allow
no information from the target to pass through compared to other types. They also occur the most in
HOOT.
Overall, our analyses show that when deep trackers with template update capabilities are deployed in
occlusion-heavy videos, it is beneficial to be able to predict to an extent whether a template is corrupted
before making an update. Predicting occlusions accurately can be a valuable resource for this type of
mechanism.
69
5.4 Training STARK with HOOT
Having analyzed STARK’s dynamic template update module and showing through our oracle analyses that
being able to skip updates with occluded targets can help the tracker obtain higher performance with more
frequent updates, we finally start developing baselines for training STARK with HOOT.
By mixing the heavily occluded videos in HOOT to the STARK training set, we aim to answer the
following questions: 1) How much does training with HOOT help close the domain gap between tracking
targets that are mostly visible and heavily occluded? 2) Can we use occlusion annotations in HOOT to
see improved tracking performance with increased update frequency similar to our oracle analyses? And
finally, 3) Can we get the benefits of occluded data by training only the template update module with it or
do we need to adapt the entire feature space?
5.4.1 Implementation Details
As mentioned in Section 5.2, the STARK tracker has two main branches: The target localization branch and
the template update branch. The baseline tracker is trained in two stages and uses LaSOT [34], TrackingNet
[97], and GOT-10k [51] and COCO [83]. The first stage trains the localization branch and takes 500 epochs,
with an epoch defined as 60,000 samples that contain triplets of images for search, initial template, and
dynamic template. On the other hand, the second stage freezes the weights of the localization branch after
500 epochs and trains only the classification module for 50 more epochs, using balanced sampling between
positive and negative samples (search images that contain the target vs. search images that do not). The
network is optimized using AdamW optimizer [85] on a mini-batch size of 128 (16 samples per 8 Tesla
V100 GPUs they use). Further details on learning rates and their schedule, as well as the loss weights, can
be found in the original STARK paper [132].
During our experiments, we keep most of the settings mentioned above the same, to be as comparable
to the original tracker as possible. The main difference in our experiments is that we use only 2 Tesla V100
70
HOOT Training Update HOOT Test Set - Success
Method Loc. Cls. thocc 200 100 50 25 10 5
STARK (Paper) ✗ ✗ 1.0 0.513 0.508 0.503 0.500 0.480 0.446
STARK (Our baseline) ✗ ✗ 1.0 0.483 0.479 0.459 0.443 0.439 0.416
HOOT-Trained STARK ✗ ✓,p = 0.5 1.0 0.482 0.480 0.465 0.457 0.443 0.417
HOOT-Trained STARK ✓,p = 0.2 ✓,p = 0.5 1.0 0.612 0.616 0.617 0.610 0.617 0.602
Table 5.3: Tracking performance (success) results of naively training STARK with HOOT. We show the
difference between the published baseline STARK weights and the baseline STARK we trained, as well as
how adding HOOT naively to the training affects the performance. tbl:ch5_baseline_train_results
GPUs instead of using 8, therefore reducing the mini-batch size to 32 from 128. We also mix HOOT videos
to the STARK training and modify the batch sampler to increase the probability of sampling from HOOT
compared to other datasets when training the classification module. While the original tracker sets the
template update interval to 200, we present results for varying update intervals, following our analyses
from the previous section. We do not modify the model itself for our naive HOOT-trained baselines and
binary occlusion prediction experiments.
5.4.2 Experimental Results
Baseline HOOT-Trained STARK: We start our experiments by naively adding HOOT to the STARK
tracker training. As mentioned previously, STARK’s classification module performs a binary classification
for whether the target is visible in the search frame or not. For the localization branch, we sample each
dataset with equal probability while for the second stage, we sample HOOT with 1/2 probability while
sampling the other datasets at 1/8. Table 5.3 shows these results and compares them to the baseline STARK
we trained (without HOOT), as well as the performance of the published STARK weights we evaluated in
Section 5.3 and Chapter 4. Note that reducing the mini-batch size and possible other reproducibility factors
have reduced the performance of the tracker by around 3% across all update intervals.
Comparing the naive HOOT-trained STARK to the baseline STARK, we find that when we add HOOT
to the training, we immediately start closing the domain gap that we observed when evaluating STARK on
71
HOOT Training Update HOOT Test Set - Success
Method Loc. Cls. thocc 200 100 50 25 10 5
STARK (Our baseline) ✗ ✗ 1.0 0.483 0.479 0.459 0.443 0.439 0.416
HOOT-Trained STARK
✓,p = 0.2 ✓,p = 0.5 1.0 0.612 0.616 0.617 0.610 0.617 0.602
✓,p = 0.2 ✓,p = 0.5 0.8 0.613 0.617 0.621 0.612 0.620 0.604
✓,p = 0.2 ✓,p = 0.5 0.6 0.613 0.617 0.616 0.614 0.622 0.595
✓,p = 0.2 ✓,p = 0.5 0.4 0.613 0.617 0.615 0.612 0.618 0.599
✓,p = 0.2 ✓,p = 0.5 0.2 0.613 0.617 0.615 0.614 0.616 0.598
Table 5.4: Tracking performance (success) results of training STARK with occlusion ratios in HOOT. We
change the occlusion threshold thocc to consider lower and lower visibility ratios as acceptable updates for
the classification module. Best success performances for each thocc in bold. tbl:ch5_binary_train_results
LaSOT vs. HOOT in Chapter 4. We find that training the entire model on HOOT scores 61.2% on success
at update frequency 200, which is a 13% jump in performance from the baseline, and a 10% jump from
the original STARK weights. More importantly, we find that by training the entire model on HOOT, we
are able to stabilize the tracking performance as we vary template update intervals.
As we change the update frequency from every 200 frames to every 5 frames, HOOT-trained STARK
only suffers a 1% drop compared to up to 5% drops observed when analyzing the baseline model. Moreover,
we find that training only the classification branch on HOOT provides marginal improvements, which
implies that learning better features for heavy occlusions is more important for tracking performance and
can also help predict better whether the target is fully-occluded or not.
HOOT-Trained STARK with Binary Occlusion Prediction: Next, we begin utilizing the occlusion
percentages we compute from the occluder mask annotations in HOOT to modify the binary update prediction. In baseline STARK, this binary classification module predicts full occlusion vs. any visibility, or
whether occlusion ratio == 1. We start modifying the classification module in the template update branch
by varying this occlusion ratio parameter, which we denote as thocc. As we reduce thocc, we ask the template update branch to not consider targets occluded more than thocc as suitable for update. Therefore, the
lower the thocc is, the more we aim to make updates with only fully visible targets. Table 5.4 shows these
results and compares them to the baseline STARK we trained.
72
We find that as we modify the threshold to accept only mostly visible targets as suitable for updates, we
begin losing the improvement in performance for low update frequencies. However, for every thocc value,
the best-performing update frequencies are always lower than the baseline 200, while a steady decline is
observed for the baseline STARK. The best success performance is obtained when we do not do updates
with targets higher than 60% occluded, while still scheduling updates every 10 frames.
5.5 Discussion
In this chapter, we carried out the first steps to quantitatively measure the effect of occlusions in dynamic
template updates for the first time in the literature using the extensive occlusion annotations in HOOT and
show how the heavily occluded videos in HOOT can help us update dynamic templates more intelligently
without sacrificing opportunities to update. To do this, we focused on the state-of-the-art STARK tracker
that employs a dynamic template update module which helps it extract spatio-temporal information for
improved tracking through transformers.
Our visualizations of the dynamic template updates showed updates frequently being made with heavily occluded targets. Following this evidence, we ran a full quantitative analysis using the occlusion annotations in HOOT. This analysis confirmed around 65% of all updates made were with occluded targets,
which degrades tracking performance as more updates get scheduled. Moreover, our occlusion oracle analyses proved that if we could predict occlusions to automatically skip updates with corrupted templates,
we can obtain improved tracking performance while scheduling more frequent updates. More frequent
updates can be crucial for adapting to the target’s appearance and predicting occlusions can help us have
the advantages of frequent template updates while being more occlusion-robust.
In addition to our extensive analyses powered by the annotations in HOOT, we discovered that increasing the presence of occluded data in training deep trackers can help close some of the domain gap that
exists between current visual object tracking (VOT) benchmarks and benchmarks like HOOT that focus
73
on tracking objects through heavy occlusions. Furthermore, the simple use of occlusion ratio annotations
helped get better performance than baseline STARK while scheduling 2-40x more updates, allowing the
tracker to keep track of the target appearance better while skipping corrupted templates.
Our results support that occlusion annotations and data can help create more occlusion-robust trackers,
even when used in the very straightforward manner presented in this chapter. More importantly, this data
unlocks the ability for researchers to properly evaluate trackers against this challenge, and come up with
creative ways to run a variety of analyses such as those presented in this chapter. Future work can expand
on the use of the annotations in HOOT, which can include occluder mask prediction and dynamic template
weighting through occlusion level or mask predictions, so appearance updates can adapt even better to
the targets.
74
Chapter 6
Conclusions
ch:conclusions
6.1 Thesis Summary
In this work, we examined the challenge occlusions can bring to visual object tracking (VOT) and whether
more data containing heavy occlusions and occlusion annotations can help better address trackers’ robustness to this challenge. After presenting an extensive background on the field of VOT, we start our
exploration into this domain by identifying existing occlusion annotations in VOT datasets and proposing
a multi-task occlusion learning framework that can predict the occlusion level of the target in a given
frame, supervised by occlusion level labels in GOT-10k. While we observe that learning occlusions as an
auxiliary task can help tracking for occluded frames, we find that imbalanced occluded data and the lack of
annotations in many VOT benchmarks make it difficult to properly evaluate a tracker against occlusions.
Next, we tackle how a VOT dataset that focuses on evaluating, analyzing, and developing VOT methods against occlusions can be designed. We propose HOOT, the Heavy Occlusions in Object Tracking
Benchmark, which is a large-scale video dataset to specifically address occlusions in VOT, with extensive
per-frame occlusion annotations, including occluder masks and a defined occluder type hierarchy. By
evaluating the state-of-the-art on HOOT, we find that a domain gap exists between tracking mostly visible
and heavily occluded objects even for very strong deep visual trackers. We show this by comparing the
performance of 15 trackers on HOOT to their performance on LaSOT which has low occlusion distribution.
75
Lastly, we look into dynamic template update as an important aspect of VOT that can be affected the
most from target appearance being corrupted by occlusions. Using the extensive occlusion annotations
in HOOT, we are able to quantitatively analyze a state-of-the-art tracker’s template update mechanism
against occlusions and measure their effect on the performance. After running oracle experiments that
prove the positive impact of being able to predict occlusions, we finalize our work by starting to use HOOT
videos to train the tracker. We show how HOOT videos can help close the domain gap between other VOT
datasets and videos that contain heavy occlusions, and how occlusion annotations can be further used to
supervise template updates with a positive impact on the tracking performance. We discuss some future
directions that have not been explored in our work below.
6.2 Future Work
In this section, we address some of the main branches of research that emerge from the work presented
here. These branches are as follows:
New Generation of Methods Addressing Occlusions: We hope that inspired by this research, more
studies introduce novel methods to address occlusions. This is not only limited to new deep models or other
structures that improve upon the baseline presented here (which might include more detailed occlusion
prediction with masks, or generative methods using the occlusion annotations in HOOT). We believe that
much of the impactful work that emerges from this will be the new and interesting ways to analyze trackers
against occlusions (such as the oracle experiments we conducted) and even more efficient performance
measures (such as measuring when a tracker re-detects a lost target by chance etc.).
Generalization & Bias: As previously mentioned in Chapter 1, it is important to acknowledge that
occlusions can be caused by a myriad of objects and in plenty of different scenarios. Therefore, concerns
for generalization should continue to be part of the research that addresses occlusions in VOT. Any bias that
76
may have been introduced in the collection of the dataset by repeat occluders or targets can be measured
by designing a one-shot future test set with new classes of targets, occluders, and environments. This can
help the community determine how much of the occlusion problem can be solved with data and let them
develop additional algorithms for the scenarios that the data in HOOT does not generalize to.
Increased Focus on Occlusion Handling: Lastly, more effort on handling occlusions in deep visual
object trackers is crucial for innovation in this area of VOT. To generate more ideas on how to deal with
occlusions, efforts can be made to popularize HOOT and tracking under heavy occlusions amongst the
tracking community. Inspired by the successful, long-running challenges (VOT13 [72] - VOT22 [68]) in
the community, HOOT can also be organized as an annual challenge which can help grow the visibility of
the dataset and the interest in tackling occlusions in visual object tracking.
77
Bibliography
[1] Abrar H Abdulnabi, Gang Wang, Jiwen Lu, and Kui Jia. “Multi-task CNN model for attribute
prediction”. In: IEEE Transactions on Multimedia 17.11 (2015), pp. 1949–1959.
[2] Christoph Anthes, Rubén Jesús Garcıa-Hernández, Markus Wiedemann, and Dieter Kranzlmüller.
“State of the art of virtual reality technology”. In: 2016 IEEE aerospace conference. IEEE. 2016,
pp. 1–19.
[3] Kittipat Apicharttrisorn, Xukan Ran, Jiasi Chen, Srikanth V Krishnamurthy, and
Amit K Roy-Chowdhury. “Frugal following: Power thrifty object detection and tracking for
mobile augmented reality”. In: Proceedings of the 17th Conference on Embedded Networked Sensor
Systems. 2019, pp. 96–109.
[4] Claudine Badue, Rânik Guidolini, Raphael Vivacqua Carneiro, Pedro Azevedo,
Vinicius B Cardoso, Avelino Forechi, Luan Jesus, Rodrigo Berriel, Thiago M Paixao, Filipe Mutz,
et al. “Self-driving cars: A survey”. In: Expert Systems with Applications 165 (2021), p. 113816.
[5] Simon Baker and Iain Matthews. “Lucas-kanade 20 years on: A unifying framework”. In:
International journal of computer vision 56.3 (2004), pp. 221–255.
[6] Frédéric Bergeron, Kevin Bouchard, Sébastien Gaboury, and Sylvain Giroux. “Tracking objects
within a smart home”. In: Expert Systems with Applications 113 (2018), pp. 428–442.
[7] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. HS. Torr. “Fully-convolutional
siamese networks for object tracking”. In: ECCV. Springer. 2016, pp. 850–865.
[8] Luca Bertinetto, Jack Valmadre, Stuart Golodetz, Ondrej Miksik, and Philip H. S. Torr. “Staple:
Complementary Learners for Real-Time Tracking”. In: IEEE CVPR. June 2016.
[9] G. Bhat, M. Danelljan, L. V. Gool, and R. Timofte. “Learning discriminative model prediction for
tracking”. In: IEEE ICCV. 2019, pp. 6182–6191.
[10] David S Bolme, J Ross Beveridge, Bruce A Draper, and Yui Man Lui. “Visual object tracking using
adaptive correlation filters”. In: IEEE CVPR. IEEE. 2010, pp. 2544–2550.
78
[11] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and
Sergey Zagoruyko. “End-to-end object detection with transformers”. In: Computer Vision–ECCV
2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16. Springer.
2020, pp. 213–229.
[12] Luka Čehovin, Aleš Leonardis, and Matej Kristan. “Visual Object Tracking Performance Measures
Revisited”. In: IEEE Transactions on Image Processing 25.3 (2016), pp. 1261–1274. doi:
10.1109/TIP.2016.2520370.
[13] Greg Chance, Aleksandar Jevtić, Praminda Caleb-Solly, Guillem Alenya, Carme Torras, and
Sanja Dogramadzi. ““elbows out”—predictive tracking of partially occluded pose for
robot-assisted dressing”. In: IEEE Robotics and Automation Letters 3.4 (2018), pp. 3598–3605.
[14] D. Chen, Z. Yuan, Y. Wu, G. Zhang, and N. Zheng. “Constructing adaptive complex cells for
robust visual tracking”. In: IEEE ICCV. 2013, pp. 1113–1120.
[15] Fei Chen, Xiaodong Wang, Yunxiang Zhao, Shaohe Lv, and Xin Niu. “Visual object tracking: A
survey”. In: Computer Vision and Image Understanding 222 (2022), p. 103508. issn: 1077-3142. doi:
https://doi.org/10.1016/j.cviu.2022.103508.
[16] Xin Chen, Bin Yan, Jiawen Zhu, Dong Wang, Xiaoyun Yang, and Huchuan Lu. “Transformer
tracking”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
2021, pp. 8126–8135.
[17] Bowen Cheng, Yunchao Wei, Honghui Shi, Rogerio Feris, Jinjun Xiong, and Thomas Huang.
“Revisiting rcnn: On awakening the classification power of faster rcnn”. In: Proceedings of the
European conference on computer vision (ECCV). 2018, pp. 453–468.
[18] Truman Cheng, Weibing Li, Wing Yin Ng, Yisen Huang, Jixiu Li, Calvin Sze Hang Ng,
Philip Wai Yan Chiu, and Zheng Li. “Deep learning assisted robotic magnetic anchored and
guided endoscope for real-time instrument tracking”. In: IEEE Robotics and Automation Letters 6.2
(2021), pp. 3979–3986.
[19] Jongwon Choi, Hyung Jin Chang, Tobias Fischer, Sangdoo Yun, Kyuewang Lee, Jiyeoup Jeong,
Yiannis Demiris, and Jin Young Choi. “Context-aware deep feature compression for high-speed
visual tracking”. In: Proceedings of the IEEE conference on computer vision and pattern recognition.
2018, pp. 479–488.
[20] Dorin Comaniciu, Visvanathan Ramesh, and Peter Meer. “Real-time tracking of non-rigid objects
using mean shift”. In: Proceedings IEEE Conference on Computer Vision and Pattern Recognition.
CVPR 2000 (Cat. No. PR00662). Vol. 2. IEEE. 2000, pp. 142–149.
[21] Kenan Dai, Yunhua Zhang, Dong Wang, Jianhua Li, Huchuan Lu, and Xiaoyun Yang.
“High-Performance Long-Term Tracking With Meta-Updater”. In: Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR). June 2020.
[22] M. Danelljan, G. Bhat, F. Shahbaz Khan, and M. Felsberg. “Eco: Efficient convolution operators for
tracking”. In: IEEE CVPR. 2017, pp. 6638–6646.
79
[23] M. Danelljan, A. Robinson, F. S. Khan, and M. Felsberg. “Beyond correlation filters: Learning
continuous convolution operators for visual tracking”. In: ECCV. Springer. 2016, pp. 472–488.
[24] Martin Danelljan, Goutam Bhat, Fahad Shahbaz Khan, and Michael Felsberg. “Atom: Accurate
tracking by overlap maximization”. In: Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition. 2019, pp. 4660–4669.
[25] Martin Danelljan, Luc Van Gool, and Radu Timofte. “Probabilistic regression for visual tracking”.
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020,
pp. 7183–7192.
[26] Martin Danelljan, Gustav Hager, Fahad Shahbaz Khan, and Michael Felsberg. “Convolutional
Features for Correlation Filter Based Visual Tracking”. In: Proceedings of the IEEE International
Conference on Computer Vision (ICCV) Workshops. Dec. 2015.
[27] Martin Danelljan, Gustav Hager, Fahad Shahbaz Khan, and Michael Felsberg. “Learning spatially
regularized correlation filters for visual tracking”. In: Proceedings of the IEEE international
conference on computer vision. 2015, pp. 4310–4318.
[28] E.R. Davies. “Chapter 22 - Surveillance”. In: Computer and Machine Vision (Fourth Edition). Ed. by
E.R. Davies. Fourth Edition. Boston: Academic Press, 2012, pp. 578–635. isbn: 978-0-12-386908-1.
doi: https://doi.org/10.1016/B978-0-12-386908-1.00022-7.
[29] Patrick Dendorfer, Hamid Rezatofighi, Anton Milan, Javen Shi, Daniel Cremers, Ian Reid,
Stefan Roth, Konrad Schindler, and Laura Leal-Taixé. “Mot20: A benchmark for multi object
tracking in crowded scenes”. In: arXiv preprint arXiv:2003.09003 (2020).
[30] Thang Ba Dinh, Nam Vo, and Gérard Medioni. “Context tracker: Exploring supporters and
distracters in unconstrained environments”. In: CVPR 2011. IEEE. 2011, pp. 1177–1184.
[31] X. Dong, J. Shen, D. Yu, W. Wang, J. Liu, and H. Huang. “Occlusion-aware real-time object
tracking”. In: IEEE Transactions on Multimedia 19.4 (2016), pp. 763–771.
[32] Y. Du, Y. Yan, S. Chen, Y. Hua, and H. Wang. “Object-Adaptive LSTM Network for Visual
Tracking”. In: 2018 24th International Conference on Pattern Recognition (ICPR). 2018,
pp. 1719–1724.
[33] Christopher Fagiani, Margrit Betke, and James Gips. “Evaluation of Tracking Methods for
Human-Computer Interaction.” In: WACV. Vol. 2. 2002, pp. 121–126.
[34] Heng Fan, Liting Lin, Fan Yang, Peng Chu, Ge Deng, Sijia Yu, Hexin Bai, Yong Xu,
Chunyuan Liao, and Haibin Ling. “Lasot: A high-quality benchmark for large-scale single object
tracking”. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.
2019, pp. 5374–5383.
[35] Heng Fan and Haibin Ling. “Sanet: Structure-aware network for visual tracking”. In: Proceedings
of the IEEE conference on computer vision and pattern recognition workshops. 2017, pp. 42–49.
80
[36] Heng Fan and Haibin Ling. “Siamese cascaded region proposal networks for real-time visual
tracking”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019,
pp. 7952–7961.
[37] Heng Fan, Halady Akhilesha Miththanthaya, Siranjiv Ramana Rajan, Xiaoqiong Liu, Zhilin Zou,
Yuewei Lin, Haibin Ling, et al. “Transparent object tracking benchmark”. In: Proceedings of the
IEEE/CVF International Conference on Computer Vision. 2021, pp. 10734–10743.
[38] Mahmoud Al-Faris, John Chiverton, David Ndzi, and Ahmed Isam Ahmed. “A review on computer
vision-based methods for human action recognition”. In: Journal of imaging 6.6 (2020), p. 46.
[39] Zhihong Fu, Qingjie Liu, Zehua Fu, and Yunhong Wang. “STMTrack: Template-Free Visual
Tracking With Space-Time Memory Networks”. In: Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR). June 2021, pp. 13774–13783.
[40] J. Gao, T. Zhang, and C. Xu. “Graph convolutional tracking”. In: IEEE CVPR. 2019, pp. 4649–4659.
[41] Ujwalla Gawande, Kamal Hajari, and Yogesh Golhar. “Pedestrian detection and tracking in video
surveillance system: issues, comprehensive review, and challenges”. In: Recent Trends in
Computational Intelligence (2020), pp. 1–24.
[42] Martin Godec, Peter M Roth, and Horst Bischof. “Hough-based tracking of non-rigid objects”. In:
Computer Vision and Image Understanding 117.10 (2013), pp. 1245–1256.
[43] Daniel Gordon, Ali Farhadi, and Dieter Fox. “Re3
: Real-Time Recurrent Regression Networks for
Visual Tracking of Generic Objects”. In: IEEE Robotics and Automation Letters 3.2 (2018),
pp. 788–795.
[44] Deepak K Gupta, Efstratios Gavves, and Arnold WM Smeulders. “Tackling occlusion in siamese
tracking with structured dropouts”. In: 2020 25th International Conference on Pattern Recognition
(ICPR). IEEE. 2021, pp. 5804–5811.
[45] Shangchen Han, Beibei Liu, Randi Cabezas, Christopher D Twigg, Peizhao Zhang, Jeff Petkau,
Tsz-Ho Yu, Chun-Jung Tai, Muzaffer Akbay, Zheng Wang, et al. “MEgATrack: monochrome
egocentric articulated hand-tracking for virtual reality”. In: ACM Transactions on Graphics (ToG)
39.4 (2020), pp. 87–1.
[46] Sam Hare, Stuart Golodetz, Amir Saffari, Vibhav Vineet, Ming-Ming Cheng, Stephen L Hicks, and
Philip HS Torr. “Struck: Structured output tracking with kernels”. In: IEEE transactions on pattern
analysis and machine intelligence 38.10 (2015), pp. 2096–2109.
[47] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Deep residual learning for image
recognition”. In: Proceedings of the IEEE conference on computer vision and pattern recognition.
2016, pp. 770–778.
[48] D. Held, S. Thrun, and S. Savarese. “Learning to track at 100 fps with deep regression networks”.
In: ECCV. Springer. 2016, pp. 749–765.
81
[49] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista. “High-speed tracking with kernelized
correlation filters”. In: IEEE TPAMI 37.3 (2014), pp. 583–596.
[50] AJH Hii, Christopher E Hann, J Geoffrey Chase, and Eli EW Van Houten. “Fast normalized cross
correlation for motion tracking using basis functions”. In: Computer methods and programs in
biomedicine 82.2 (2006), pp. 144–156.
[51] L. Huang, X. Zhao, and K. Huang. “GOT-10k: A Large High-Diversity Benchmark for Generic
Object Tracking in the Wild”. In: IEEE TPAMI (2019), pp. 1–1. issn: 1939-3539. doi:
10.1109/tpami.2019.2957464.
[52] Samvit Jain, Xun Zhang, Yuhao Zhou, Ganesh Ananthanarayanan, Junchen Jiang, Yuanchao Shu,
Paramvir Bahl, and Joseph Gonzalez. “Spatula: Efficient cross-camera video analytics on large
camera networks”. In: 2020 IEEE/ACM Symposium on Edge Computing (SEC). IEEE. 2020,
pp. 110–124.
[53] Omar Javed and Mubarak Shah. “Tracking and object classification for automated surveillance”.
In: European Conference on Computer Vision. Springer. 2002, pp. 343–357.
[54] Ilchae Jung, Jeany Son, Mooyeol Baek, and Bohyung Han. “Real-time mdnet”. In: Proceedings of
the European Conference on Computer Vision (ECCV). 2018, pp. 83–98.
[55] Samira Ebrahimi Kahou, Vincent Michalski, Roland Memisevic, Christopher Pal, and
Pascal Vincent. “RATM: recurrent attentive tracking model”. In: IEEE CVPR Workshops. IEEE.
2017, pp. 1613–1622.
[56] Z. Kalal, K. Mikolajczyk, and J. Matas. “Tracking-learning-detection”. In: IEEE TPAMI 34.7 (2011),
pp. 1409–1422.
[57] Zdenek Kalal, Jiri Matas, and Krystian Mikolajczyk. “Pn learning: Bootstrapping binary classifiers
by structural constraints”. In: 2010 IEEE Computer Society Conference on Computer Vision and
Pattern Recognition. IEEE. 2010, pp. 49–56.
[58] S Hamidreza Kasaei, Miguel Oliveira, Gi Hyun Lim, Luıs Seabra Lopes, and Ana Maria Tomé.
“Towards lifelong assistive robotics: A tight coupling between object perception and
manipulation”. In: Neurocomputing 291 (2018), pp. 151–166.
[59] Lei Ke, Yu-Wing Tai, and Chi-Keung Tang. “Occlusion-Aware Instance Segmentation via BiLayer
Network Architectures”. In: IEEE Transactions on Pattern Analysis and Machine Intelligence (2023).
[60] Lei Ke, Yu-Wing Tai, and Chi-Keung Tang. “Occlusion-Aware Video Object Inpainting”. In:
Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Oct. 2021,
pp. 14468–14478.
[61] Hamed Kiani Galoogahi, Ashton Fagg, Chen Huang, Deva Ramanan, and Simon Lucey. “Need for
speed: A benchmark for higher frame rate object tracking”. In: Proceedings of the IEEE
International Conference on Computer Vision. 2017, pp. 1125–1134.
82
[62] Minji Kim, Seungkwan Lee, Jungseul Ok, Bohyung Han, and Minsu Cho. “Towards
Sequence-Level Training for Visual Tracking”. In: Computer Vision–ECCV 2022: 17th European
Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXII. Springer. 2022, pp. 534–551.
[63] Adam Kosiorek, Alex Bewley, and Ingmar Posner. “Hierarchical attentive recurrent tracking”. In:
Advances in neural information processing systems. 2017, pp. 3053–3061.
[64] M. Kristan, J. Matas, A. Leonardis, M. Felsberg, R. Pflugfelder, J. Kamarainen, L. Čehovin Zajc,
O. Drbohlav, A. Lukezic, A. Berg, A. Eldesokey, J. Kapyla, and G. Fernandez. The Seventh Visual
Object Tracking VOT2019 Challenge Results. 2019.
[65] Matej Kristan, JirıMatas, Aleš Leonardis, Michael Felsberg, Roman Pflugfelder,
Joni-Kristian Kamarainen, Hyung Jin Chang, Martin Danelljan, Luka Čehovin Zajc, Alan Lukežič,
Ondrej Drbohlav, Jani Kapyla, Gustav Hager, Song Yan, Jinyu Yang, Zhongqun Zhang,
Gustavo Fernandez, and et. al. The Ninth Visual Object Tracking VOT2021 Challenge Results. 2021.
[66] Matej Kristan, Ales Leonardis, Jiri Matas, Michael Felsberg, Roman Pflugfelder,
Luka Cehovin Zajc, Tomas Vojir, Goutam Bhat, Alan Lukezic, Abdelrahman Eldesokey, et al. “The
sixth visual object tracking vot2018 challenge results”. In: ECCV. 2018, pp. 0–0.
[67] Matej Kristan, Ales Leonardis, Jiri Matas, Michael Felsberg, Roman Pflugfelder,
Luka Cehovin Zajc, Tomas Vojir, Gustav Hager, Alan Lukezic, Abdelrahman Eldesokey, et al.
“The visual object tracking vot2017 challenge results”. In: IEEE ICCV. 2017, pp. 1949–1972.
[68] Matej Kristan, Ales Leonardis, Jiri Matas, Michael Felsberg, Roman Pflugfelder,
Joni-Kristian Kamarainen, Hyung Jin Chang, Martin Danelljan, Luka Čehovin Zajc, Alan Lukežič,
Ondrej Drbohlav, Johanna Bjorklund, Yushan Zhang, Zhongqun Zhang, Song Yan, Wenyan Yang,
Dingding Cai, Christoph Mayer, and Gustavo Fernandez. The Tenth Visual Object Tracking
VOT2022 Challenge Results. 2022.
[69] Matej Kristan, Aleš Leonardis, Jiri Matas, Michael Felsberg, Roman Pflugfelder,
Luka Čehovin Zajc, Tomas Vojir, Gustav Häger, Alan Lukežič, and Gustavo Fernandez. The Visual
Object Tracking VOT2016 challenge results. Springer. Oct. 2016. url:
http://www.springer.com/gp/book/9783319488806.
[70] Matej Kristan, Aleš Leonardis, Jiřı Matas, Michael Felsberg, Roman Pflugfelder,
Joni-Kristian Kämäräinen, Martin Danelljan, Luka Čehovin Zajc, Alan Lukežič, Ondrej Drbohlav,
et al. “The eighth visual object tracking VOT2020 challenge results”. In: European Conference on
Computer Vision. Springer. 2020, pp. 547–601.
[71] Matej Kristan, Jiri Matas, Ales Leonardis, Michael Felsberg, Luka Cehovin, Gustavo Fernandez,
Tomas Vojir, Gustav Hager, Georg Nebehay, and Roman Pflugfelder. “The visual object tracking
vot2015 challenge results”. In: Proceedings of the IEEE international conference on computer vision
workshops. 2015, pp. 1–23.
[72] Matej Kristan, Roman Pflugfelder, Ales Leonardis, Jiri Matas, Fatih Porikli, Luka Cehovin,
Georg Nebehay, Gustavo Fernandez, Tomas Vojir, et al. “The vot2013 challenge: overview and
additional results”. In: (2014).
83
[73] Matej Kristan, Roman Pflugfelder, Aleš Leonardis, Jiři Matas, Luka Čehovin, Georg Nebehay,
Tomáš Vojíř, Gustavo Fernández, Alan Lukežič, Aleksandar Dimitriev, Alfredo Petrosino,
Amir Saffari, Bo Li, Bohyung Han, Cherkeng Heng, Christophe Garcia, Dominik Pangeršič,
Gustav Häger, Fahad Shahbaz Khan, Franci Oven, Horst Possegger, Horst Bischof,
Hyeonseob Nam, Jianke Zhu, JiJia Li, Jin Young Choi, Jin-Woo Choi, João F. Henriques, Joost van
de Weijer, Jorge Batista, Karel Lebeda, Kristoffer Öfjäll, Kwang Moo Yi, Lei Quin, Longyin Wen,
Mario Edoardo Maresca, Martin Danelljan, Michael Felsberg, Ming-Ming Cheng, Philip Torr,
Quingming Huang, Richard Bowden, Sam Hare, Samantha YueYing Lim, Seunghoon Hong,
Shengcai Liao, Simon Hadfield, Stan Z. Li, Stefan Duffner, Stuart Golodetz, Thomas Mauthner,
Vibhav Vineet, Weiyao Lin, Yang Li, Yuankai Qui, Zhen Lei, and Zhiheng Niu. “The Visual Object
Tracking VOT2014 challenge results”. English. In: Workshop on the Visual Object Tracking
Challenge (VOT, in conjunction with ECCV). ., 2014.
[74] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. “Imagenet classification with deep
convolutional neural networks”. In: Communications of the ACM 60.6 (2017), pp. 84–90.
[75] Thijs P Kuipers, Devanshu Arya, and Deepak K Gupta. “Hard occlusions in visual object
tracking”. In: European Conference on Computer Vision. Springer. 2020, pp. 299–314.
[76] Junseok Kwon and Kyoung Mu Lee. “Tracking by sampling trackers”. In: 2011 International
Conference on Computer Vision. IEEE. 2011, pp. 1195–1202.
[77] L.J. Latecki and R. Miezianko. “Object Tracking with Dynamic Template Update and Occlusion
Detection”. In: 18th International Conference on Pattern Recognition (ICPR’06). Vol. 1. 2006,
pp. 556–560. doi: 10.1109/ICPR.2006.886.
[78] Beng Yong Lee, Lee Hung Liew, Wai Shiang Cheah, and Yin Chai Wang. “Occlusion handling in
videos object tracking: A survey”. In: IOP conference series: earth and environmental science.
Vol. 18. 1. IOP Publishing. 2014, p. 012020.
[79] Annan Li, Min Lin, Yi Wu, Ming-Hsuan Yang, and Shuicheng Yan. “Nus-pro: A new visual
tracking challenge”. In: IEEE TPAMI 38.2 (2015), pp. 335–349.
[80] B. Li, W. Wu, Q. Wang, F. Zhang, J. Xing, and J. Yan. “SiamRPN++: Evolution of Siamese Visual
Tracking With Very Deep Networks”. In: IEEE CVPR. 2019, pp. 4282–4291.
[81] B. Li, J. Yan, W. Wu, Z. Zhu, and X. Hu. “High Performance Visual Tracking With Siamese Region
Proposal Network”. In: IEEE CVPR. June 2018.
[82] Feng Li, Cheng Tian, Wangmeng Zuo, Lei Zhang, and Ming-Hsuan Yang. “Learning
spatial-temporal regularized correlation filters for visual tracking”. In: Proceedings of the IEEE
conference on computer vision and pattern recognition. 2018, pp. 4904–4913.
[83] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan,
Piotr Dollár, and C Lawrence Zitnick. “Microsoft coco: Common objects in context”. In: ECCV.
Springer. 2014, pp. 740–755.
84
[84] Ze Liu, Yingfeng Cai, Hai Wang, Long Chen, Hongbo Gao, Yunyi Jia, and Yicheng Li. “Robust
target recognition and tracking of self-driving cars with radar and camera information fusion
under severe weather conditions”. In: IEEE Transactions on Intelligent Transportation Systems 23.7
(2021), pp. 6640–6653.
[85] Ilya Loshchilov and Frank Hutter. “Decoupled weight decay regularization”. In: arXiv preprint
arXiv:1711.05101 (2017).
[86] Bruce D Lucas, Takeo Kanade, et al. “An iterative image registration technique with an
application to stereo vision”. In: (1981).
[87] A. Lukezic, T. Vojir, L. ˇCehovin Zajc, J. Matas, and M. Kristan. “Discriminative correlation filter
with channel and spatial reliability”. In: IEEE CVPR. 2017, pp. 6309–6318.
[88] Seyed Mojtaba Marvasti-Zadeh, Li Cheng, Hossein Ghanei-Yakhdan, and Shohreh Kasaei. “Deep
learning for visual tracking: A comprehensive survey”. In: IEEE Transactions on Intelligent
Transportation Systems (2021).
[89] L. Matthews, T. Ishikawa, and S. Baker. “The template update problem”. In: IEEE Transactions on
Pattern Analysis and Machine Intelligence 26.6 (2004), pp. 810–815. doi: 10.1109/TPAMI.2004.16.
[90] Christoph Mayer, Martin Danelljan, Goutam Bhat, Matthieu Paul, Danda Pani Paudel, Fisher Yu,
and Luc Van Gool. “Transforming model prediction for tracking”. In: Proceedings of the IEEE/CVF
conference on computer vision and pattern recognition. 2022, pp. 8731–8740.
[91] Christoph Mayer, Martin Danelljan, Danda Pani Paudel, and Luc Van Gool. “Learning target
candidate association to keep track of what not to track”. In: Proceedings of the IEEE/CVF
International Conference on Computer Vision. 2021, pp. 13444–13454.
[92] Xue Mei and Haibin Ling. “Robust visual tracking using ℓ1 minimization”. In: 2009 IEEE 12th
International conference on computer vision. IEEE. 2009, pp. 1436–1443.
[93] Xue Mei, Haibin Ling, Yi Wu, Erik Blasch, and Li Bai. “Minimum error bounded efficient ℓ1
tracker with occlusion detection”. In: CVPR 2011. IEEE. 2011, pp. 1257–1264.
[94] Thomas B Moeslund, Graham Thomas, Adrian Hilton, et al. Computer vision in sports. Vol. 1.
Springer, 2014.
[95] Abhinav Moudgil and Vineet Gandhi. “Long-term Visual Object Tracking Benchmark”. In: Asian
Conference on Computer Vision. Springer. 2018, pp. 629–645.
[96] Matthias Mueller, Neil Smith, and Bernard Ghanem. “A Benchmark and Simulator for UAV
Tracking”. In: Computer Vision – ECCV 2016. Ed. by Bastian Leibe, Jiri Matas, Nicu Sebe, and
Max Welling. Cham: Springer International Publishing, 2016, pp. 445–461. isbn:
978-3-319-46448-0.
[97] Matthias Muller, Adel Bibi, Silvio Giancola, Salman Alsubaihi, and Bernard Ghanem.
“Trackingnet: A large-scale dataset and benchmark for object tracking in the wild”. In: ECCV.
2018, pp. 300–317.
85
[98] H. Nam and B. Han. “Learning multi-domain convolutional neural networks for visual tracking”.
In: IEEE CVPR. 2016, pp. 4293–4302.
[99] Hieu T Nguyen and Arnold WM Smeulders. “Robust tracking using foreground-background
texture discrimination”. In: International Journal of Computer Vision 69.3 (2006), pp. 277–293.
[100] Hieu Tat Nguyen and Arnold WM Smeulders. “Fast occluded object tracking by a robust
appearance filter”. In: IEEE Transactions on Pattern Analysis and Machine Intelligence 26.8 (2004),
pp. 1099–1104.
[101] Jiyan Pan and Bo Hu. “Robust occlusion handling in object tracking”. In: 2007 IEEE Conference on
Computer Vision and Pattern Recognition. IEEE. 2007, pp. 1–8.
[102] Yanwei Pang, Jin Xie, Muhammad Haris Khan, Rao Muhammad Anwer, Fahad Shahbaz Khan, and
Ling Shao. “Mask-guided attention network for occluded pedestrian detection”. In: Proceedings of
the IEEE/CVF International Conference on Computer Vision. 2019, pp. 4967–4975.
[103] Jiyang Qi, Yan Gao, Yao Hu, Xinggang Wang, Xiaoyu Liu, Xiang Bai, Serge Belongie, Alan Yuille,
Philip HS Torr, and Song Bai. “Occluded video instance segmentation: A benchmark”. In:
International Journal of Computer Vision 130.8 (2022), pp. 2022–2039.
[104] Jiyang Qi, Yan Gao, Yao Hu, Xinggang Wang, Xiaoyu Liu, Xiang Bai, Serge Belongie, Alan Yuille,
Philip HS Torr, and Song Bai. “Occluded video instance segmentation: Dataset and ICCV 2021
challenge”. In: arXiv preprint arXiv:2111.07950 (2021).
[105] Y. Qi, S. Zhang, W. Zhang, L. Su, Q. Huang, and M. Yang. “Learning attribute-specific
representations for visual tracking”. In: AAAI. Vol. 33. 2019, pp. 8835–8842.
[106] Yijun Qian, Lijun Yu, Wenhe Liu, and Alexander G Hauptmann. “Electricity: An efficient
multi-camera vehicle tracking system for intelligent city”. In: Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition Workshops. 2020, pp. 588–589.
[107] Siddharth S Rautaray and Anupam Agrawal. “Vision based hand gesture recognition for human
computer interaction: a survey”. In: Artificial intelligence review 43 (2015), pp. 1–54.
[108] Esteban Real, Jonathon Shlens, Stefano Mazzocchi, Xin Pan, and Vincent Vanhoucke.
“Youtube-boundingboxes: A large high-precision human-annotated data set for object detection
in video”. In: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017,
pp. 5296–5305.
[109] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. “Faster r-cnn: Towards real-time object
detection with region proposal networks”. In: Advances in neural information processing systems.
2015, pp. 91–99.
[110] Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese.
“Generalized intersection over union: A metric and a loss for bounding box regression”. In:
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019,
pp. 658–666.
86
[111] David A Ross, Jongwoo Lim, Ruei-Sung Lin, and Ming-Hsuan Yang. “Incremental learning for
robust visual tracking”. In: International journal of computer vision 77.1-3 (2008), pp. 125–141.
[112] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma,
Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. “Imagenet large scale
visual recognition challenge”. In: IJCV 115.3 (2015), pp. 211–252.
[113] Kaziwa Saleh, Sándor Szénási, and Zoltán Vámossy. “Occlusion Handling in Generic Object
Detection: A Review”. In: 2021 IEEE 19th World Symposium on Applied Machine Intelligence and
Informatics (SAMI). 2021, pp. 000477–000484. doi: 10.1109/SAMI50585.2021.9378657.
[114] Boris Sekachev, Nikita Manovich, Maxim Zhiltsov, Andrey Zhavoronkov, Dmitry Kalinin,
Ben Hoff, TOsmanov, Dmitry Kruchinin, Artyom Zankevich, DmitriySidnev, Maksim Markelov,
Johannes222, Mathis Chenuet, a-andre, telenachos, Aleksandr Melnikov, Jijoong Kim, Liron Ilouz,
Nikita Glazov, Priya4607, Rush Tehrani, Seungwon Jeong, Vladimir Skubriev, Sebastian Yonekura,
vugia truong, zliang7, lizhming, and Tritin Truong. opencv/cvat: v1.1.0. Version v1.1.0. Aug. 2020.
doi: 10.5281/zenodo.4009388.
[115] Jianbo Shi et al. “Good features to track”. In: 1994 Proceedings of IEEE conference on computer
vision and pattern recognition. IEEE. 1994, pp. 593–600.
[116] Arnold WM Smeulders, Dung M Chu, Rita Cucchiara, Simone Calderara, Afshin Dehghan, and
Mubarak Shah. “Visual tracking: An experimental survey”. In: IEEE transactions on pattern
analysis and machine intelligence 36.7 (2013), pp. 1442–1468.
[117] Guanglu Song, Yu Liu, and Xiaogang Wang. “Revisiting the sibling head in object detector”. In:
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020,
pp. 11563–11572.
[118] Y. Song, C. Ma, X. Wu, L. Gong, L. Bao, W. Zuo, C. Shen, R. WH Lau, and M. Yang. “Vital: Visual
tracking via adversarial learning”. In: IEEE CVPR. 2018, pp. 8990–8999.
[119] C. Sun, D. Wang, and H. Lu. “Occlusion-aware fragment-based tracking with spatial-temporal
consistency”. In: IEEE TIP 25.8 (2016), pp. 3814–3825.
[120] J. Valmadre, L. Bertinetto, J. Henriques, A. Vedaldi, and P. HS. Torr. “End-to-end representation
learning for correlation filter based tracking”. In: IEEE CVPR. 2017, pp. 2805–2813.
[121] Jack Valmadre, Luca Bertinetto, Joao F Henriques, Ran Tao, Andrea Vedaldi,
Arnold WM Smeulders, Philip HS Torr, and Efstratios Gavves. “Long-term tracking in the wild: A
benchmark”. In: Proceedings of the European conference on computer vision (ECCV). 2018,
pp. 670–685.
[122] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Łukasz Kaiser, and Illia Polosukhin. “Attention is all you need”. In: Advances in neural information
processing systems 30 (2017).
87
[123] Angtian Wang, Yihong Sun, Adam Kortylewski, and Alan L. Yuille. “Robust Object Detection
Under Occlusion With Context-Aware CompositionalNets”. In: Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR). June 2020.
[124] Guangting Wang, Chong Luo, Zhiwei Xiong, and Wenjun Zeng. “Spm-tracker: Series-parallel
matching for real-time visual object tracking”. In: Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition. 2019, pp. 3643–3652.
[125] Q. Wang, L. Zhang, L. Bertinetto, W. Hu, and P. HS Torr. “Fast online object tracking and
segmentation: A unifying approach”. In: IEEE CVPR. 2019, pp. 1328–1338.
[126] Xin Wang, Zhiqiang Hou, Wangsheng Yu, Lei Pu, Zefenfen Jin, and Xianxiang Qin. “Robust
occlusion-aware part-based visual tracking with object scale adaptation”. In: Pattern Recognition
81 (2018), pp. 456–470.
[127] Longyin Wen, Dawei Du, Zhaowei Cai, Zhen Lei, Ming-Ching Chang, Honggang Qi,
Jongwoo Lim, Ming-Hsuan Yang, and Siwei Lyu. “UA-DETRAC: A new benchmark and protocol
for multi-object detection and tracking”. In: Computer Vision and Image Understanding 193 (2020),
p. 102907.
[128] Fei Wu, Jianlin Zhang, and Zhiyong Xu. “Stably adaptive anti-occlusion Siamese region proposal
network for real-time object tracking”. In: IEEE Access 8 (2020), pp. 161349–161360.
[129] Yi Wu, Jongwoo Lim, and Ming-Hsuan Yang. “Object Tracking Benchmark”. In: IEEE Transactions
on Pattern Analysis and Machine Intelligence 37.9 (2015), pp. 1834–1848. doi:
10.1109/TPAMI.2014.2388226.
[130] Yi Wu, Jongwoo Lim, and Ming-Hsuan Yang. “Online object tracking: A benchmark”. In:
Proceedings of the IEEE conference on computer vision and pattern recognition. 2013, pp. 2411–2418.
[131] Yu Xiang, Alexandre Alahi, and Silvio Savarese. “Learning to track: Online multi-object tracking
by decision making”. In: IEEE ICCV. 2015, pp. 4705–4713.
[132] Bin Yan, Houwen Peng, Jianlong Fu, Dong Wang, and Huchuan Lu. “Learning spatio-temporal
transformer for visual tracking”. In: Proceedings of the IEEE/CVF International Conference on
Computer Vision. 2021, pp. 10448–10457.
[133] T. Yang and A. B. Chan. “Visual Tracking via Dynamic Memory Networks”. In: TPAMI (2019).
[134] Tianyu Yang and Antoni B Chan. “Learning dynamic memory networks for object tracking”. In:
ECCV. 2018, pp. 152–167.
[135] Tianyu Yang and Antoni B Chan. “Recurrent filter learning for visual tracking”. In: IEEE ICCV.
2017, pp. 2010–2019.
[136] Alper Yilmaz, Omar Javed, and Mubarak Shah. “Object tracking: A survey”. In: Acm computing
surveys (CSUR) 38.4 (2006), 13–es.
88
[137] Bin Yu, Ming Tang, Linyu Zheng, Guibo Zhu, Jinqiao Wang, Hao Feng, Xuetao Feng, and
Hanqing Lu. “High-performance discriminative tracking with transformers”. In: Proceedings of the
IEEE/CVF International Conference on Computer Vision. 2021, pp. 9856–9865.
[138] Shanshan Zhang, Rodrigo Benenson, and Bernt Schiele. “CityPersons: A Diverse Dataset for
Pedestrian Detection”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition (CVPR). July 2017.
[139] Xiaoli Zhang and Shahram Payandeh. “Application of visual tracking for robot-assisted
laparoscopic surgery”. In: Journal of Robotic systems 19.7 (2002), pp. 315–328.
[140] Z. Zhang, P. Luo, C. C. Loy, and X. Tang. “Facial landmark detection by deep multi-task learning”.
In: ECCV. Springer. 2014, pp. 94–108.
[141] Zhipeng Zhang, Yihao Liu, Xiao Wang, Bing Li, and Weiming Hu. “Learn to match: Automatic
matching network design for visual tracking”. In: Proceedings of the IEEE/CVF International
Conference on Computer Vision. 2021, pp. 13339–13348.
[142] Zhipeng Zhang and Houwen Peng. “Deeper and wider siamese networks for real-time visual
tracking”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
2019, pp. 4591–4600.
[143] Zhipeng Zhang, Houwen Peng, Jianlong Fu, Bing Li, and Weiming Hu. “Ocean: Object-aware
anchor-free tracking”. In: European Conference on Computer Vision. Springer. 2020, pp. 771–787.
[144] J. Zhao, C. K. Chang, and L. Itti. “Learning to Recognize Objects by Retaining other Factors of
Variation”. In: IEEE WACV. Mar. 2017, pp. 1–9.
[145] Lin Zhao, Meiling Wang, Sheng Su, Tong Liu, and Yi Yang. “Dynamic object tracking for
self-driving cars using monocular camera and lidar”. In: 2020 IEEE/RSJ International Conference on
Intelligent Robots and Systems (IROS). IEEE. 2020, pp. 10865–10872.
[146] Z. Zhu, Q. Wang, L. Bo, W. Wu, J. Yan, and W. Hu. “Distractor-aware Siamese Networks for Visual
Object Tracking”. In: ECCV. 2018.
89
Abstract (if available)
Abstract
Visual object tracking (VOT) is considered one of the principal challenges in computer vision, where a target given in the first frame is tracked in the rest of the video. Major challenges in VOT include factors such as rotations, deformations, illumination changes, and occlusions. With the widespread use of deep learning models with strong representative power, trackers have evolved to better handle the changes in the target's appearance due to factors like rotations and deformations. Meanwhile, robustness to occlusions has not been as widely studied for deep trackers and occlusion representation in VOT datasets has stayed low over the years. In this work, we focus on occlusions in deep visual object tracking and examine whether realistic occlusion data and annotations can help with the development and evaluation of more occlusion-robust trackers. First, we propose a multi-task occlusion learning framework to show how much occlusion labels in current datasets can help improve tracker performance in occluded frames. We discover that lack of representation in VOT datasets creates a barrier for developing and evaluating trackers that focus on occlusions. To address occlusions in visual tracking more directly, we create a large video benchmark for visual object tracking: The Heavy Occlusions in Object Tracking (HOOT) Benchmark. HOOT is specifically tailored for the evaluation, analysis, and development of occlusion-robust trackers with its extensive occlusion annotations. Finally, using the annotations in HOOT, we examine the effect of occlusions on template update and propose an occlusion-aware template update framework that improves the tracker performance under heavy occlusions.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
A deep learning approach to online single and multiple object tracking
PDF
Visual knowledge transfer with deep learning techniques
PDF
Green unsupervised single object tracking: technologies and performance evaluation
PDF
Towards learning generalization
PDF
Multimodal reasoning of visual information and natural language
PDF
Adapting pre-trained representation towards downstream tasks
PDF
Single-image geometry estimation for various real-world domains
PDF
Video object segmentation and tracking with deep learning techniques
PDF
Visual representation learning with structural prior
PDF
Spatio-temporal probabilistic inference for persistent object detection and tracking
PDF
Toward situation awareness: activity and object recognition
PDF
Acceleration of deep reinforcement learning: efficient algorithms and hardware mapping
PDF
Learning controllable data generation for scalable model training
PDF
3D deep learning for perception and modeling
PDF
Robust and generalizable knowledge acquisition from text
PDF
Deep representations for shapes, structures and motion
PDF
Computer vision aided object localization for the visually impaired
PDF
Towards generalizable expression and emotion recognition
PDF
Novel imaging systems for intraocular retinal prostheses and wearable visual aids
PDF
Depth inference and visual saliency detection from 2D images
Asset Metadata
Creator
Sahin, Gozde
(author)
Core Title
Towards more occlusion-robust deep visual object tracking
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2023-12
Publication Date
11/14/2023
Defense Date
04/28/2023
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
computer vision,datasets,deep learning,OAI-PMH Harvest,occlusion,target tracking,template update,visual object tracking
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Itti, Laurent (
committee member
), Jenkins, B. Keith (
committee member
), Neumann, Ulrich (
committee member
)
Creator Email
gozde.sahin9190@gmail.com,gsahin@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113773996
Unique identifier
UC113773996
Identifier
etd-SahinGozde-12474.pdf (filename)
Legacy Identifier
etd-SahinGozde-12474
Document Type
Dissertation
Format
theses (aat)
Rights
Sahin, Gozde
Internet Media Type
application/pdf
Type
texts
Source
20231120-usctheses-batch-1107
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
computer vision
deep learning
occlusion
target tracking
template update
visual object tracking