Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Green unsupervised single object tracking: technologies and performance evaluation
(USC Thesis Other)
Green unsupervised single object tracking: technologies and performance evaluation
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Green Unsupervised Single Object Tracking: Technologies and
Performance Evaluation
by
Zhiruo Zhou
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(ELECTRICAL ENGINEERING)
December 2023
Copyright 2023 Zhiruo Zhou
Dedication
To my beloved parents and husband, for their endless love, support and partnership throughout my
life. To myself, for being stubborn enough to see this through.
ii
Acknowledgements
I would like to express my sincere gratitude to my supervisor, Professor C.-C. Jay Kuo, who
supports me with the invaluable advice and patience and sets the exemplar of a researcher for
me. I would also like to thank Professor Antonio Ortega, Professor Stefanos Nikolaidis and Dr.
Suya You for their help and advice on my thesis. Many thanks to all Media Communications Lab
members and alumnus who offered their kind help and company along the journey.
iii
Table of Contents
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Significance of the Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Background of the Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Problem Definition and General Methodology . . . . . . . . . . . . . . . . 2
1.2.2 Evaluation of Tracking Performance . . . . . . . . . . . . . . . . . . . . . 3
1.2.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.4 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.5 Supervised Trackers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.6 Unsupervised Single Object Tracking . . . . . . . . . . . . . . . . . . . . 14
1.2.6.1 Unsupervised Deep Trackers . . . . . . . . . . . . . . . . . . . . 14
1.2.6.2 Unsupervised Lightweight Trackers . . . . . . . . . . . . . . . . 15
1.3 Contribution of the Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.3.1 Unsupervised High-Performance Single Object Tracker . . . . . . . . . . . 18
1.3.2 Green Unsupervised Tracking for Long Video Sequences . . . . . . . . . . 19
1.3.3 Unsupervised Green Object Tracker without Offline Pre-training . . . . . . 20
1.4 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Chapter 2: UHP-SOT: Unsupervised High-Performance Single Object Tracker . . . . . . . 22
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3.1 Background Motion Modeling . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3.2 Trajectory-based Box Prediction . . . . . . . . . . . . . . . . . . . . . . . 28
2.3.3 Fusion Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
iv
2.4.1 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.4.2 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Chapter 3: UHP-SOT++: An Unsupervised Lightweight Single Object Tracker . . . . . . . 37
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2.1 Proposal Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2.2 Occlusion Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2.3 Rule-based Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3.1 Experimental Set-up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3.2 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3.2.1 Module Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3.2.2 Attribute-based Study . . . . . . . . . . . . . . . . . . . . . . . 45
3.3.3 Comparison with State-of-the-art Trackers . . . . . . . . . . . . . . . . . . 47
3.3.4 Exemplary Sequences and Qualitative Analysis . . . . . . . . . . . . . . . 53
3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Chapter 4: GUSOT: Green and Unsupervised Single Object Tracking for Long Video
Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.2.1 Lost Object Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.2.2 Color-Saliency-Based Shape Proposal . . . . . . . . . . . . . . . . . . . . 62
4.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.3.2 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.3.3 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.3.3.1 Contribution of Modules . . . . . . . . . . . . . . . . . . . . . . 68
4.3.3.2 Impact of Baseline Trackers . . . . . . . . . . . . . . . . . . . . 68
4.3.3.3 Attribute-based Study . . . . . . . . . . . . . . . . . . . . . . . 68
4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Chapter 5: GOT: Unsupervised Green Object Tracker without Offline Pre-training . . . . . 71
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.2 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.2.1 Global Object-based Correlator . . . . . . . . . . . . . . . . . . . . . . . . 75
5.2.2 Local Patch-based Correlator . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.2.2.1 Feature Extraction and Selection . . . . . . . . . . . . . . . . . . 77
5.2.2.2 Patch Classification . . . . . . . . . . . . . . . . . . . . . . . . 78
5.2.2.3 From Heat Map to Bounding Box . . . . . . . . . . . . . . . . . 79
5.2.2.4 Classifier Update . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.2.3 Superpixel Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.2.4 Fusion of Proposals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
v
5.2.4.1 Two Fusion Strategies . . . . . . . . . . . . . . . . . . . . . . . 84
5.2.4.2 Tracking Quality Control . . . . . . . . . . . . . . . . . . . . . . 86
5.2.4.3 Object Re-identification . . . . . . . . . . . . . . . . . . . . . . 87
5.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.3.1.1 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . 88
5.3.1.2 Benchmarking Object Trackers . . . . . . . . . . . . . . . . . . 89
5.3.1.3 Tracking Datasets . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.3.1.4 Implementation Details . . . . . . . . . . . . . . . . . . . . . . 90
5.3.2 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.3.3 Comparison among Lightweight Trackers . . . . . . . . . . . . . . . . . . 92
5.3.4 Long-term Tracking Capability . . . . . . . . . . . . . . . . . . . . . . . . 93
5.3.5 Attribute-based Performance Evaluation . . . . . . . . . . . . . . . . . . . 94
5.3.6 Insights into New Ingredients in GOT . . . . . . . . . . . . . . . . . . . . 97
5.3.6.1 Local Correlator . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.3.6.2 Fuser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.4 Discussion on Limitations of GOT . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.5 Model Size and Complexity Analysis of GOT . . . . . . . . . . . . . . . . . . . . 103
5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
Chapter 6: Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6.1 Summary of the Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6.2 Future Research Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
vi
List of Tables
2.1 Performance of UHP-SOT, UHP-SOT-I, UHP-SOT-II and STRCF on the TB-100
dataset, where AUC is used for success rate. . . . . . . . . . . . . . . . . . . . . . 35
3.1 All tracking scenarios are classified into 8 cases in terms of the overall quality
of proposals from three modules. The fusion strategy is set up for each scenario.
The update rate is related the regularization coefficient, µ, that controls to which
extent the appearance model should be updated. . . . . . . . . . . . . . . . . . . . 42
3.2 Comparison of state-of-the-art supervised and unsupervised trackers on four
datasets, where the performance is measured by the distance precision (DP) and
the area-under-curve (AUC) score in percentage. The model size is measured in
MB by the memory required to store needed data such as the model parameters of
pre-trained networks. The best unsupervised performance is highlighted. Also, S,
P, G and C indicate Supervised, Pre-trained, GPU and CPU, respectively. . . . . . 52
4.1 Comparison of unsupervised and supervised trackers on LaSOT, where S and
P indicate Supervised and Pre-trained, respectively. Backbone denotes the pretrained feature extraction network. The first group are unsupervised lightweight
trackers, the second group are unsupervised deep trackers, and the the third group
are supervised deep trackers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.2 Ablation study of module contribution of GUSOT on LaSOT. w. motion
refers to baseline + lost object recovery, and w. shape refers to baseline +
color-saliency-based shape proposal. . . . . . . . . . . . . . . . . . . . . . . . . 69
4.3 Performance gain of two new modules on different baselines. . . . . . . . . . . . . 69
5.1 Comparison of tracking accuracy and model complexity of representative trackers
of four categories on four tracking datasets. Some numbers for model complexity
are rough estimations due to the lack of detailed description of network structures
and/or public domain codes. Furthermore, the complexity of some algorithms is
related to built-in implementation and hardware. OPT and UT are abbreviations
for offline pre-training and unsupervised trackers, respectively. The top 3 runners
among all unsupervised trackers (i.e., those in the last two categories) are
highlighted in red, green, and blue, respectively. . . . . . . . . . . . . . . . . . . . 91
vii
5.2 Comparison of design methods and pre-training costs of GOT and three
lightweight DL trackers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.3 Performance comparison of GOT and GOT∗
against three lightweight long-term
trackers, KCF (the long-term version), TLD and FuCoLoT, on the OxUvA dataset,
where the best performance is shown in red. KCF and TLD are implemented in
OpenCV. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.4 Performance comparison of GOT under different settings on the LaSOT dataset,
where the best performance is shown in red. The ablation study includes: 1) with
or without the local correlator branch; 2) with or without classifier update; 3) with
or without object re-identification. . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.5 Ablation study of the classification system in the local correlator branch in GOT
on LaSOT under three settings, where the best performance is shown in red. . . . . 99
5.6 Flops of the Saab feature extraction for one spatial block of size 8×8. . . . . . . . 105
5.7 The model size and the computational complexity of the whole GOT system. . . . 105
5.8 The estimated flops for some special algorithmic modules. . . . . . . . . . . . . . 106
viii
List of Figures
1.1 The tracking by detection diagram. . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Examples of challenges in video object tracking from [107]. . . . . . . . . . . . . 6
1.3 Summary of some popular single object tracking benchmark datasets. Figure is
from [41]. Datasets involved are OTB-2013 [133], OTB-2015 [134], TC-128
[76], NUS-PRO [71], UAV123 [92], UAV20L [92], CDTB [83], VOT-2017 [63],
GOT-10k [53], and LaSOT [41]. The circle diameter is proportional to the total
number of frames. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Examples of videos in OTB from [134]. . . . . . . . . . . . . . . . . . . . . . . . 9
1.5 Examples of videos in TC128 from [76]. The numbers listed along with the
sequence names are challenge factors assessed by the author of the dataset. . . . . 10
1.6 Examples of videos in UAV123 from [92]. . . . . . . . . . . . . . . . . . . . . . . 11
1.7 Examples of videos in LaSOT from [41]. . . . . . . . . . . . . . . . . . . . . . . . 12
1.8 The proposed Siamese region proposal network (RPN) in [73]. There are two
branches inside the RPN, the classification branch and the regression branch. . . . 14
1.9 The comparison between supervised and unsupervised deep training from [121]. . . 16
1.10 Generate object boxes via optical flow from [153]. . . . . . . . . . . . . . . . . . . 16
1.11 The inference structure for DCF. The red box is the object bounding box. φ is the
feature extraction module and * denotes the correlation operation. . . . . . . . . . 17
1.12 Example of pixel-wise colornames annotation from [131]. The colornames are
represented by the corresponding linguistic color label. . . . . . . . . . . . . . . . 18
2.1 The spatial weight w helps suppress learning from background information. w
assigns higher penalties on values outside of the object bounding box. Figure
modified from [33]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2 The system diagram of the proposed UHP-SOT method. In the example, the
object was lost at time t −1 but gets retrieved at time t because the proposal from
background motion modeling is accepted. . . . . . . . . . . . . . . . . . . . . . . 24
ix
2.3 Comparison of the tracking performance of STRCF (red) and UHP-SOT (green),
where the results of UHP-SOT are closer to the ground truth and those of STRCF
drift away. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4 Example of object proposal from background motion modeling. . . . . . . . . . . 28
2.5 Illustration of shape change estimation based on background motion model and
the trajectory-based box prediction, where the ground truth and our proposal are
in green and blue, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.6 The precision plot and the success plot on TB-50 and TB-100. To rank different
methods, the distance precision is measured at 20-pixel threshold, and the overlap
precision is measured by the AUC score. We collect other trackers’ raw results
from the official websites to generate results. . . . . . . . . . . . . . . . . . . . . . 32
2.7 Qualitative evaluation of UHP-SOT, STRCF [74], ECO [30] and SiamRPN++
[72] on 10 challenging videos from TB-100. They are (from left to right and top
to down): Trans, Skiing, MotorRolling, Coupon, Girl2, Bird1, KiteSurf, Bird2,
Diving, Jump, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.8 Area-under-curve (AUC) score for attribute-based evaluation on the TB-100
dataset, where the 11 attributes are background clutter (BC), deformation (DEF),
fast motion (FM), in-plane rotation (IPR), illumination variation (IV), low
resolution (LR), motion blur (MB), occlusion (OCC), out-of-plane rotation
(OPR), out-of-view (OV), and scale variation (SV), respectively. . . . . . . . . . . 35
3.1 The system diagram of the proposed UHP-SOT++ method. It shows one example
where the object was lost at time t − 1 but gets retrieved at time t because the
proposal from background motion modeling is accepted. . . . . . . . . . . . . . . 38
3.2 Illustration of occlusion detection, where the green box shows the object location.
The color information and similarity score could change rapidly if occlusion occurs. 40
3.3 An example of quality assessment of proposals, where the green box is the ground
truth, and yellow, blue and magenta boxes are proposals from Bapp, Btrj and Bbgd,
respectively, and the bright yellow text on the top-left corner denotes the quality
of three proposals (isGoodapp,isGoodtr j,isGoodbgd). . . . . . . . . . . . . . . . . 42
3.4 The precision plot and the success plot of our UHP-SOT++ tracker with different
configurations on the TC128 dataset, where the numbers inside the parentheses
are the DP values and AUC scores, respectively. . . . . . . . . . . . . . . . . . . . 44
3.5 Failure cases of UHP-SOT++ (in green) as compared to UHP-SOT (in red) on
OTB2015. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
x
3.6 The area-under-curve (AUC) scores for two datasets, TC128 and LaSOT, under
the attribute-based evaluation, where attributes of concern include the aspect
ratio change (ARC), background clutter (BC), camera motion (CM), deformation
(DEF), fast motion (FM), full occlusion (FOC), in-plane rotation (IPR),
illumination variation (IV), low resolution (LR), motion blur (MB), occlusion
(OCC), out-of-plane rotation (OPR), out-of-view (OV), partial occlusion (POC),
scale variation (SV) and viewpoint change (VC), respectively. . . . . . . . . . . . 46
3.7 The success plot and the precision plot of ten unsupervised tracking methods
for the LaSOT dataset, where the numbers inside the parentheses are the overlap
precision and the distance precision values, respectively. . . . . . . . . . . . . . . 47
3.8 Qualitative evaluation of three leading unsupervised trackers, where UHP-SOT++
offers a robust and flexible box prediction. . . . . . . . . . . . . . . . . . . . . . . 48
3.9 The success plot comparison of UHP-SOT++ with several supervised and
unsupervised tracking methods on four datasets, where only trackers with raw
results published by authors are listed. For the LaSOT dataset, only supervised
trackers are included for performance benchmarking in the plot since the success
plot of unsupervised methods is already given in Figure 3.7. . . . . . . . . . . . . 49
3.10 Qualitative comparison of top runners against the LaSOT dataset, where tracking
boxes of SiamRPN++, UHP-SOT++, ECO and ECO-HC are shown in red,
green, blue and yellow, respectively. The first two rows show sequences in
which SiamRPN++ outperforms others significantly while the last row offers the
sequence in which SiamRPN++ performs poorly. . . . . . . . . . . . . . . . . . . 53
3.11 Illustration of three sequences in which UHP-SOT++ performs the best. The
tracking boxes of SiamRPN++, UHP-SOT++, ECO and ECO-HC are shown in
red, green, blue and yellow, respectively. . . . . . . . . . . . . . . . . . . . . . . . 54
4.1 An overview of the proposed GUSOT tracker, where the red and blue boxes
denote the baseline and the motion proposals, respectively. The one with higher
appearance similarity is chosen to be the location proposal. Then, the third
proposal, called the shape proposal, is used to adjust the shape of the location
proposal. The final predicted box is depicted by the yellow box. . . . . . . . . . . 58
4.2 Comparison of two sampling scheme and their segmentation results, where green
and red dots represent foreground and background initial points, respectively. The
red box is the reference box which indicates the object location. . . . . . . . . . . 61
4.3 Determination of salient color keys. The red box is the reference box which
indicates the object location. From left to right, the figures show the original object
patch, normalized histogram of color keys inside the bounding box, normalized
histogram of color keys outside of the bounding box, the weighted difference
between two histogram, and visualization of the salient foreground color area,
respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
xi
4.4 Illustration of shape proposal derivation based on superpixel segmentation, where
the red, blue and yellow boxes correspond to the baseline, motion, and shape
proposals, respectively. The size of the motion proposal here is determined by
clipping integral curves horizontally and vertically on the motion residual map. . . 64
4.5 Qualitative evaluation of GUSOT, UHP-SOT++, USOT and SiamFC. From top to
bottom, the sequences presented are pool-12, bottle-14, umbrella-2, airplane-15,
person-10 and yoyo-17, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.6 Attribute-based evaluation of GUSOT, UHP-SOT++ and USOT on LaSOT in
terms of DP and AUC, where attributes of interest include the aspect ratio change
(ARC), background clutter (BC), camera motion (CM), deformation (DEF), fast
motion (FM), full occlusion (FOC), illumination variation (IV), low resolution
(LR), motion blur (MB), occlusion (OCC), out-of-view (OV), partial occlusion
(POC), rotation (ROT), scale variation (SV) and viewpoint change (VC). . . . . . . 69
5.1 Comparison of object trackers in the number of model parameters (along the
x-axis), the AUC performance (along the y-axis) and inference complexity in
floating point operations (in circle sizes) with respect to the LaSOT dataset. . . . . 72
5.2 The system diagram of the proposed green object tracker (GOT). The global
object-based correlator generates a rigid proposal, while the local patch-based
correlator outputs a deformable box and a objectness score map which helps the
segmentator calculate additional deformable boxes. These proposals are fused
into one final prediction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.3 Channel-wise Saab transformation on color residuals of a patch of size N ×N. . . . 78
5.4 (Top) Visualization of the evolution of templates over time and (bottom)
visualization of the noise suppression effect in the raw probability map based on
Eq. (5.1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.5 Proper updating helps maintain decent classification quality. . . . . . . . . . . . . 82
5.6 Management of different tools for tracking, where S.E. denotes the shape
estimation function provided by the second branch. . . . . . . . . . . . . . . . . . 86
5.7 The fusion strategy (given in the fourth row, where S.F. stands for simple fusion)
changes with tracking dynamics over time. The DCF proposal, the objectness
proposal, and the superpixel proposal are given in the first, second, and third rows,
respectively. The sequence is motorcycle-9 from LaSOT and the object is the
motorbike. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.8 The TPR, TNP and MaxGM values of GOT∗
at different present/absent threshold
values against the OxUva dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . 95
xii
5.9 Attribute-based evaluation of GOT, GUSOT and USOT on LaSOT in terms of
DP and AUC, where attributes of interest include the aspect ratio change (ARC),
background clutter (BC), camera motion (CM), deformation (DEF), fast motion
(FM), full occlusion (FOC), illumination variation (IV), low resolution (LR),
motion blur (MB), occlusion (OCC), out-of-view (OV), partial occlusion (POC),
rotation (ROT), scale variation (SV) and viewpoint change (VC). . . . . . . . . . . 96
5.10 Comparison of the tracked object boxes of GOT, GUSOT, USOT, and SiamFC for
four video sequences from LaSOT (from top to bottom: book, flag, zebra, and
cups). The initial appearances are given in the first (i.e., leftmost) column. The
tracking results for four representative frames are illustrated. . . . . . . . . . . . . 96
5.11 Mean IoU (the higher the better) and center error (the lower the better) on the
selected subset with different fusion thresholds. The subset from LaSOT includes
book-10, bus-2, cat-1, crocodile-14, flag-5, flag-9, gorilla-6, person-1, squirrel-19,
mouse-17. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.12 Performance comparison between GOT (orange), SiamRPN++ (red), SiamFC
(cyan), and USOT (blue) on LaSOT in terms of the success rate plot (left) and the
AUC plot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.13 Deformation-related attribute study on LaSOT for the first 10% frames in videos. . 102
5.14 Two tracking examples from LaSOT: bear-4 (top) and bicycle-2 (bottom), with
boxes of the ground truth (green), SiamRPN++ (red), SiamFC (cyan), USOT
(blue), and GOT (orange), respectively. . . . . . . . . . . . . . . . . . . . . . . . 102
6.1 While being initialized with the rectangular bounding box, SiamMask is able to
generate both segmentation mask and oriented bounding boxes for more accurate
representation of the object location. Red boxes are outputs from ECO [30] for
comparison. Figure is from [126]. . . . . . . . . . . . . . . . . . . . . . . . . . . 112
xiii
Abstract
Video object tracking is one of the fundamental problems in computer vision and has diverse realworld applications such as video surveillance and robotic vision. We focus on online single object
tracking where the tracker is expected to track the object given in the first frame and cannot exploit
any future information to infer the object location in the current frame.
Supervised tracking methods have been widely investigated, and deep learning based trackers
achieve huge success in terms of tracking accuracy. However, the requirement of large amount of
labelled data is demanding, and the rely on ground truth supervision casts doubt on the reliability
when the algorithm operates on unseen objects. Even though there have been some endeavors
made to unsupervised tracking in recent years and those works demonstrate promising progress,
the performance of the unsupervised trackers is still far behind that of supervised trackers. In addition, recent pioneering unsupervised works need to rely on the deep feature extraction network
and requires large-scale pre-training on offline video datasets. The power consumption and large
model size limit their application on resource-limited platforms such as autonomous drones and
mobile phones. While there have been few research works to compress the large network via neural
architecture search and quantization, they focus on tuning from a well-trained large tracker in a supervised manner, which makes the training process of a lightweight tracker expensive. Therefore,
there still lacks an explainable and efficient design methodology for lightweight high-performance
trackers, and the necessity and role of supervision in offline training have not been well investigated
yet.
To narrow down the gap between supervised trackers and unsupervised trackers and encourage
the application of real-time tracking on edge devices, we proposed a series of green unsupervised
xiv
trackers which are lightweight enough to run on CPUs with near real-time speed and still achieve
high tracking performance. In the unsupervised high-performance single object tracker (UHPSOT), we proposed the background motion modeling module and the trajectory-based box prediction module which are integrated into the baseline with a simple fusion strategy. It achieves on par
performance with the state-of-the-art deep supervised trackers on OTB2015 [134]. In the follow
up work UHP-SOT++, we further enhanced UHP-SOT with improved fusion strategy and more
thorough empirical study on its performance and behavior across different benchmark datasets,
including small-scale OTB2015 and TC128 [76], and large-scale ones such as UAV123 [92] and
LaSOT [41]. Obvious improvement is observed on all datasets, and insights on the gap between
deep supervised trackers and lightweight unsupervised trackers are provided. In the green unsupervised single object tracker (GUSOT), we extended those lightweight designs to tracking for long
video sequences and introduced the lost object discovery and the efficient segmentation-based box
refinement model to further boost the tracking accuracy in the long run. Our proposed model outperforms or achieves comparable performance with the-state-of-the-art deep unsupervised trackers which requires large models and pre-training. Finally, in the green object tracker (GOT), we
modeled the tracking process as the ensemble of three branches for robust tracking, the global
object-based correlator, the local patch-based correlator, and the superpixel segmentator. The outputs from three branches are then fused to generate the ultimate object box, where an innovative
fusion strategy is developed. The designed modules and mechanisms further exploit the spatial and
temporal correlation of object appearances at different granularity, offering competitive tracking
accuracy at a lower computation cost with state-of-the-art unsupervised trackers that demand heavy
offline pre-training. GOT has a tiny model size (<3k parameters) and low inference complexity
(around 58M FLOPs per frame).
xv
Chapter 1
Introduction
1.1 Significance of the Research
Video object tracking is a fundamental problem in computer vision and has a wide range of applications such as video surveillance [136, 47, 12, 127], autonomous navigation [54, 40, 69, 1],
robotics vision [24, 118, 19, 148], etc. A popular branch of visual tracking is single object tracking
where the object marked in the first frame is tracked through the whole video. Given a bounding
box on the target object at the first frame, a tracker has to predict object box locations and sizes for
all remaining frames in online single object tracking (SOT). Deep-learning-based trackers, called
deep trackers, have been popular in the last 7 years. Supervised deep trackers have been intensively
studied [30, 35, 87, 98, 7, 123, 112, 73, 72, 81, 95, 97, 108, 6]. Its superior performance is achieved
by exploiting a large amount of offline labeled data. While supervision is powerful in guiding the
learning process, it casts doubt on the reliability of tracking unseen objects. In addition, a large
number of annotated tracking video clips are needed in the training, which is a laborious and costly
task. Unsupervised deep trackers [120, 132, 153, 105, 157, 159] have been developed to address
this concern in recent years.
Research on supervised and unsupervised deep trackers has primarily focused on tracking performance. The high performance of deep trackers is accompanied with high computational complexity and a huge memory cost. They demand large memory space to store the parameters of
deep networks due to large model sizes. Generally, they are difficult to deploy in resource-limited
1
platforms such as mobile and edge devices where there is no strong computational device such as
GPU. Specific examples include drones [17, 160, 38, 161], mobile phones [70, 39], etc. To lower
the high computational resource requirement, research has been done to compress the model via
neural architecture search [141], model distillation [104], or networks pruning and quantization
[9, 21, 11, 56, 3], yet with a focus of tuning from existing state-of-the-art trackers. Furthermore,
current state-of-the-art trackers are mainly short-term trackers since they cannot recover from object tracking loss automatically, which occurs frequently in long video tracking scenarios. Therefore, the unsupervised long-term tracking in a resource-constrained environment has not been fully
investigated and remains a challenging problem.
1.2 Background of the Research
1.2.1 Problem Definition and General Methodology
Tracking can be divided into different branches according to the application requirement. For
example, single object tracking focuses on one object while multi-object tracking needs to keep
track of more than one object in the scene. Offline tracking is widely used in video understanding
of recorded videos while online tracking is expected in applications where the information needs
to be processed on the fly. We focus on the online single object tracking problem where the tracker
is expected to track the object given in the first frame marked by a rectangular bounding box x0.
The notation of object location and size at later frames should also be rectangular bounding boxes
xt
. Here online means the tracker cannot exploit any future information to infer the object location
in the current frame, such as video frames from future time steps.
Object tracking in recent decades follow the general methodology of tracking by detection
which conducts frame-by-frame tracking, as shown in Figure 1.1. An appearance model, also
named as the template, is initialized using the first frame and then gets maintained through the
tracking process for that video sequence to do template matching. On each frame to be tracked,
the tracker first detects possible locations of the object and then uses the appearance model to
2
verify which candidate is the most similar one to the true target. Then it proceeds to the next
frame with the same process. It is as if we are detecting the object in each new frame. But this is
fundamentally different from the object detection topic as we only track the specified object in the
scene and temporal information in the video can be exploited. In contrast, object detection tends
to detect all objects in the scene and be applied to different images independently. The motion
estimation step refers to the proposal of possible object locations. Given the predicted bounding
box xt−1 in frame (t −1), the search region of the object location in frame t is determined based on
xt−1. Some early works use particle filters [45, 44] to propose possible new locations of the object,
where the relationship between the bounding box and previous seen appearances is modeled by a
probability model and only box candidates with high probabilities will be checked. The number
of candidates to check should be handled carefully, because a small number may reduce tracking
accuracy but a large number would rapidly slow down the tracking process. Recent works tend to
simply crop out a rectangular region centered at xt−1 to search for the object. The region is called
the search region and its size is usually around five times that of the object area to control the
computational complexity. The underline assumption is that the movement of the object between
adjacent frames is not that drastic so that the object is still around the last seen location.
1.2.2 Evaluation of Tracking Performance
Performance evaluation on tracking accuracy is usually conducted using the ”One Pass Evaluation
(OPE)” protocol as in [134] where the tracking process is initiated once starting from the first frame
and there is no reset using the ground truth when the tracker loses the object. Evaluation on how
good the predicted bounding box xt
is against the ground truth xt,G mainly includes two aspects -
center precision and overlap precision. The center precision is measured by the difference between
the predicted target center ct and the ground truth center ct,G:
δt = ∥ct −ct,G∥. (1.1)
3
Figure 1.1: The tracking by detection diagram.
The overlap precision is measured as the intersection-over-union between xt and xt,G:
φt =
xt ∩xt,G
xt ∪xt,G
. (1.2)
δt
is a non-negative number and the smaller the number is the more accurate the predicted center
is. φt
is a number falling between 0 and 1, and a larger value demonstrate higher box accuracy.
Given a threshold α, the ratio of video frames that satisfy δt ≤ α or φt ≥ α can be calculated for
each tested video. Then this ratio number could be further averaged on all testing videos in the
dataset. Grouping the numeric numbers at different thresholds, we can draw the precision plot (i.e.,
the distance of the predicted and actual box centers) and the success plot (i.e., overlapping ratios
at various thresholds) [15]. The distance precision (DP) is measured at the 20-pixel threshold to
rank different methods. The overlap precision is measured by the area-under-curve (AUC) score
on the success plot. Some short-term tracking benchmarks [62, 65, 61, 64] will perform reset if
the tracker is detected to fail for a period. Therefore, another evaluation metric named expected
average overlap (EAO) will be used where it measures the expected no-reset overlap precision of
4
a tracker on short-term tracking. Besides the tracking accuracy related metrics mentioned above,
other metrics such as frame-per-second (FPS) for speed evaluation and model size for complexity
evaluation will also be used.
1.2.3 Challenges
Tracking could be easy if the appearance of the object stays very similar across different video
frames. However, in real-world applications, the appearance of the target object would vary a
lot due to various reasons: aspect ratio change (ARC), background clutter (BC), camera motion
(CM), deformation (DEF), fast motion (FM), full occlusion (FOC), in-plane rotation (IPR), illumination variation (IV), low resolution (LR), motion blur (MB), occlusion (OCC), out-of-plane
rotation (OPR), out-of-view (OV), partial occlusion (POC), scale variation (SV) and viewpoint
change (VC), etc. Some visualization examples of these challenges are provided in Figure 1.2.
While challenges such as illumination variation and rotation could be harder to handle for handcrafted features than for deep features, occlusion and background clutters are still challenging even
if deep network is adopted. These challenges may bring in drastic change on raw pixels or features
of the object patch, making it difficult for the appearance template to match with the new object
appearance. If the interference is so severe that the predicted box drifts far away from the true object location and the tracker is not equipped with any re-detection module, it is of high probability
that the object will be lost forever in subsequent frames.
In addition, the size of the target object also matters. Small objects whose area could be even
smaller than 100 pixels are generally much harder to track than middle-size or large-size objects.
Because a lot of the information such as textures of the object gets blurred under such a small size,
which cannot be rescued by interpolation.
1.2.4 Datasets
Visual tracking has a long history of more than two decades, but the early works in the beginning
of this century were mainly tested on small number of sequences. Until last decade, with the
5
Figure 1.2: Examples of challenges in video object tracking from [107].
6
release of the first benchmark dataset OTB-2013 [133] and its extension OTB-2015 [134] (also
named TB100), there are more and more public datasets being collected from diverse sources and
released. Figure 1.3 summarizes some popular benchmark datasets according to their number of
videos, video lengths, and number of frames. VOT-2017 is a small-scale dataset and is mainly
targeted for short-term tracking. LaSOT is a large-scale dataset and is composed of long video
sequences. There are other large-scale datasets that are not included in the figure but gain more
and more attention in recent years, such as the TrackingNet [94] dataset for short-term tracking
and the OxUva [116] dataset for long-term tracking in the wild.
Other than the difference of scale and size, those datasets can also be from quite different
collection sources and be for different application purposes. For example, while OTB series and
VOT series contain many sports videos or person-related videos where the target object tends to be
in proper size and seldom goes out of view, the videos in UAV123 are shot from cameras landed
on low-altitude unmanned aerial vehicles (UAVs). Hence, the objects in UAV123 are usually
quite small compared to the size of the whole scene and are more likely to get fully occluded.
Large-scale datasets such as LaSOT and GOT-10k are composed of natural videos in the real
world and videos there tend to cover more diverse object classes, shape variations, and motion
trajectories. Video sequences in OxUva tend to be long and have a lot of target object absence in
order to test the robust long-term tracking ability. In our work, we conduct experiments on seven
datasets including OTB-2015, TC128, UAV123, VOT, TrackingNet, OxUva and LaSOT, to test our
model on both small-scale and large-scale data, and to demonstrate the feasibility of application in
different scenarios.
1.2.5 Supervised Trackers
Supervised deep trackers [129, 30, 35, 87, 98, 7, 123, 112, 81, 95, 97, 108] have progressed a lot
in the last seven years and offer state-of-the-art tracking accuracy. Deep trackers often use a pretrained network such as AlexNet [66], VGG [106, 16] or ResNet50 [50] as the feature extractor and
7
Figure 1.3: Summary of some popular single object tracking benchmark datasets. Figure is from
[41]. Datasets involved are OTB-2013 [133], OTB-2015 [134], TC-128 [76], NUS-PRO [71],
UAV123 [92], UAV20L [92], CDTB [83], VOT-2017 [63], GOT-10k [53], and LaSOT [41]. The
circle diameter is proportional to the total number of frames.
8
Figure 1.4: Examples of videos in OTB from [134].
9
Figure 1.5: Examples of videos in TC128 from [76]. The numbers listed along with the sequence
names are challenge factors assessed by the author of the dataset.
10
Figure 1.6: Examples of videos in UAV123 from [92].
11
Figure 1.7: Examples of videos in LaSOT from [41].
12
do online tracking with extracted deep features [30, 35, 87, 98, 7, 123, 112]. Others adopt an endto-end optimized model trained by video datasets in an offline manner [95, 73, 72] and could be
adapted to video frames in an online fashion [81, 95, 97, 108]. The tracking problem is formulated
as a template matching problem in Siamese trackers [6, 73, 72, 113, 162, 125, 49, 37, 52, 144, 150],
which is popular because of its simplicity and effectiveness. The two branches of the Siamese network conduct feature extraction for the template and the search region, respectively. Usually, the
object patch of the first frame serves as the template for search in all later frames. Then, the two
feature maps are passed to a convolutional layer for correlation calculation to locate the object.
The shape and size of the predicted object bounding box are determined by the regional proposal
network [101] inside the Siamese network. Other regression trackers include [86, 108]. Classification networks could also be plugged in to distinguish target objects from background clutters
[57, 95, 97, 109]. SiamRPN [73] is one of the pioneering works which set the new milestone of
supervised tracking performance. It modifies the traditional region proposal network used in detection [27, 78, 80, 100] and proposes the Siamese region proposal network. As shown in Figure 1.8,
it contains both the classification branch and the box regression branch.
One recent trend is to apply the Vision Transformer in visual tracking [124, 22, 140, 111,
14, 89, 77, 152, 110, 55, 135]. The transformer network could either be trained on video tracking
datasets only or be finetuned based on pre-trained weights learned from more general tasks. Due to
the huge number of parameters, transformers tend to be more data-hungry and pre-trained weights
are usually needed to boost the performance.
The majority of deep learning trackers rely on powerful yet heavy backbones (e.g., CNNs or
transformers). To deploy them on resource-limited platforms, efforts have been made in compressing a network without degrading its tracking accuracy. Various approaches have been proposed
such as neural architecture search (NAS) [141], model distillation [104], network quantization
[56], feature sparsification, channel pruning [3], or other specific designs to reduce the complexity
of original network layers [21, 9, 11]. As a pioneering method, LightTrack [141] lays the foundation for later lightweight trackers. LightTrack adopted NAS to compress a large-size supervised
13
Figure 1.8: The proposed Siamese region proposal network (RPN) in [73]. There are two branches
inside the RPN, the classification branch and the regression branch.
tracker into its mobile counterpart. Its design process involved the following three steps: 1) train
a supernet, 2) search for its optimal subnet, and 3) re-train and tune the subnet on a large number
of training data. The inference complexity can be reduced to around 600M flops per frame at the
end. Simply speaking, it begins with a well-designed supervised tracker and attempts to reduce
the model size and complexity with re-training. Another example was proposed in [104]. It also
conducted NAS to find a small network and used it to distill the knowledge of a large-size tracker
via teacher-student training.
Our work is completely different from deep lightweight trackers as the proposed method does
not have an end-to-end optimized neural network architecture. It adopts a modularized and interpretable system design. It is unsupervised without offline pre-training.
1.2.6 Unsupervised Single Object Tracking
1.2.6.1 Unsupervised Deep Trackers
There is an increasing interest in learning deep trackers from offline videos without annotations
[120]. This type of works usually have similar network architectures as supervised deep trackers but need to design self-supervised objectives to do pre-training on large-scale unlabelled video
data. Recent works adopt the forward and backward strategy which is also a widely applied method
14
in other video applications [91, 128, 156, 155, 154, 147] and used by some early tracking works
for detecting tracking failures or assessing tracking quality [58, 68, 114, 94, 99]. This strategy
is revisited in neural-network-based visual tracking for offline pre-training. For example, UDT+
[121] and LUDT [122] investigated cycle learning in video, in which networks are trained to track
forward and backward with consistent object proposals. As shown in Figure 1.9, compared with
supervised training where the ground truth bounding box is needed in every frame, cycle training
only needs the ground truth in the first frame and the network is expected to track forward and
then find its own way back to the same object proposal. ResPUL [132] mined positive and negative samples from unlabeled videos and leveraged them for supervised learning in building spatial
and temporal correspondence. These unsupervised deep trackers reveal a promising direction in
exploiting offline videos without annotations. Yet, they are limited in performance. Furthermore,
they need the pre-training effort. In contrast, no pre-training on offline datasets is needed in our
unsupervised tracker. Recently, an effective data sampling strategy, which samples moving objects
in offline training using optical flow and dynamic programming, was adopted by USOT in [153].
The motion cue was leveraged by USOT for object tracking, as shown in Figure 1.10. The difference between USOT and UHP-SOT/UHP-SOT++ is that USOT uses motion to mine samples
offline with dense optical flow while we focus on online object tracking with lightweight motion
processing. The recent state-of-the-art ULAST [105] improves the cycle training process by further exploiting intermediate training frames and selecting better features and pseudo labels. Yet, it
does not target at lightweight applications and still needs heavy backbones and pre-training.
1.2.6.2 Unsupervised Lightweight Trackers
Unsupervised lightweight SOT methods often use discriminative correlation filters (DCFs). They
were investigated between 2010 and 2020 [10, 51, 32, 31, 35, 8, 5, 93, 115, 74, 137, 75]. As
shown in Figure 1.11 , DCF trackers conduct template matching within a search region to generate
the response map for object location. The matched template in the next frame is centered at the
location that has the highest response. In DCF, the template size is the same as that of the search
15
Figure 1.9: The comparison between supervised and unsupervised deep training from [121].
Figure 1.10: Generate object boxes via optical flow from [153].
16
region so that the Fast Fourier Transform (FFT) could be used to speed up the correlation process.
To learn the template, a DCF uses the initial object patch to obtain a linear template via regression:
argmin
f
1
2
∥
D
∑
d=1
x
d
∗ f
d −y∥
2
, (1.3)
where f is the template to be determined, x ∈ R
Nx×Ny×D is the spatial map of D features extracted
from the object patch, ∗ is the feature-wise spatial convolution, and y ∈ R
Nx×Ny
is a centered
Gaussian-shaped map that serves as the regression label. When converted to the Fourier domain,
this optimization problem has a closed form solution which can be quickly computed. Templates in
DCFs tend to contain some background information. Furthermore, there exists boundary distortion
caused by the 2D Fourier transform. To alleviate these side effects, it is often to weigh the template
with a window function to suppress background and image discontinuity [60, 33, 26, 59].
Figure 1.11: The inference structure for DCF. The red box is the object bounding box. φ is the
feature extraction module and * denotes the correlation operation.
At each time step t, a new template fnew could be learned from the regression. To link the new
appearance with previous learned appearance, a linear combination at the constant ratio α between
17
fnew and the template ft−1 used at time step t − 1 is adopted for fusion to deliver the updated
template ft
:
ft = αft−1 + (1−α)fnew. (1.4)
Unsupervised lightweight trackers usually use low level vision features including raw pixel
values, histogram of oriented gradients (HOG) [29] and colornames (CN) [117, 36]. While HOG
describes local gradient distribution, CN maps RGB colors to linguistic color labels and forms a
probabilistic eleven dimensional vector representation which sums up to 1. The mapping is learned
from images retrieved with Google-image search in [117]. A visualization example is provided in
Figure 1.12.
Figure 1.12: Example of pixel-wise colornames annotation from [131]. The colornames are represented by the corresponding linguistic color label.
1.3 Contribution of the Research
1.3.1 Unsupervised High-Performance Single Object Tracker
There are three main challenges in SOT: 1) significant change of object appearance, 2) loss of
tracking, 3) rapid variation of object’s location and/or shape. Deep-learning-based (DL-based)
trackers have superior performance but demand supervision. Unsupervised trackers are attractive
since they do not need annotated boxes to train supervised trackers. We examined the design
of UHP-SOT (Unsupervised High-Performance Single Object Tracker) and its extension UHPSOT++ to address the above issues.
18
• We introduce two new modules – background motion modeling and trajectory-based object
box prediction, to handle tracking loss recovery and box size adaptation.
• A systematic fusion rule is adopted to integrate proposals from the baseline and two new
modules into the final one.
• The proposed method performs competitively against previous unsupervised SOT methods
on both small and large scale datasets including OTB2015, TC128, UAV123 and LaSOT,
achieves a performance comparable with supervised DL-based SOT methods, and operates
at a fast speed (22.7-32.0 FPS on a CPU).
• We provide insights on the gap between deep supervised trackers and lightweight unsupervised trackers, and discussion on their strong points and weakness via both qualitative and
quantitative analysis on large-scale dataset.
1.3.2 Green Unsupervised Tracking for Long Video Sequences
State-of-the-art trackers requires deep neural networks which demand large memory size and offline training, and most of them are short-term trackers since they cannot recover from object
tracking loss automatically, which occurs frequently in long video tracking scenarios. To tackle
the unsupervised long-term tracking problem in a resource-constrained environment, we propose
a green and unsupervised single-object tracker (GUSOT) in this work.
• We propose the lost object recovery module which helps recover a lost object with a set of
candidates by leveraging motion in the scene and selecting the best one with local/global
features (e.g., color information).
• We propose the color-saliency-based shape proposal module facilitates accurate and longterm tracking by proposing bounding-box proposals of flexible shape for the underlying
object using low-cost yet effective segmentation.
19
• GUSOT offers a lightweight tracking solution whose performance is better than or comparable with that of deep trackers, without any need of offline pre-training on large-scale
data.
1.3.3 Unsupervised Green Object Tracker without Offline Pre-training
Our research goal is to develop unsupervised, high-performance, and lightweight trackers, where
lightweightness is measured in model sizes and inference computational complexity. Toward this
objective, we have developed new trackers by extending DCF trackers, including UHP-SOT, UHPSOT++ and GUSOT. They improved the tracking accuracy of DCF trackers greatly while maintaining their lightweight advantage. However, all above-mentioned trackers model the object appearance from the global view, i.e., using features of the whole object for the matching purpose. They
provide robust tracking results when the underlying object is distinctive from background clutters
without much deformation or occlusion. To further enhance the robust tracking ability for general
objects, we propose the green object tracker (GOT) which achieves better tracking accuracy with
low carbon footprint.
• We propose an ensemble tracker of three prediction branches for robust object tracking:
1) a global object-based correlator to predict the object location roughly, 2) a local patchbased correlator to build temporal correlations of small spatial units, and 3) a superpixelbased segmentator to exploit the spatial information (e.g., color similarity and geometrical
constraints) of the target frame.
• We propose a fuser that monitors the tracking quality and fuses different box proposals according to the tracking dynamics to ensure robustness against challenges while maintaining
a reasonable complexity.
• We discuss the role played by supervision and offline pre-training to shed light on our design.
20
1.4 Organization of the Thesis
The rest of the thesis is organized as follows. In Chapter 2, we propose an unsupervised highperformance single object tracker (UHP-SOT) which can handle object tracking loss and box shape
adaptation. In Chapter 3, we propose the extension UHP-SOT++ which has a more systematic
and justified fusion strategy and is extensively tested on both small and large scale datasets. In
Chapter 4, we propose a green, lightweight and competitive single object tracker which can achieve
better performance on long video sequences with minimal computational cost. In Chapter 5, we
propose a green object tracker that handles general objects better at a low computation cost by
further exploiting spatial and temporal correlation. Finally, concluding remarks and future research
directions are given in Chapter 6.
21
Chapter 2
UHP-SOT: Unsupervised High-Performance Single Object
Tracker
2.1 Introduction
Discriminant correlation filters (DCFs) based tracking provides an efficient solution for single
object tracking by exploiting correlation among adjacent and previous frames, but may suffer from
overfitting and tracking loss for large deformations. There is still a gap between the performance
of DCF trackers and that of supervised trackers.
In this work, we examine the design of an unsupervised high-performance tracker and name it
UHP-SOT (Unsupervised High-Performance Single Object Tracker). UHP-SOT consists of three
modules: 1) appearance model update, 2) background motion modeling, and 3) trajectory-based
box prediction. Previous unsupervised trackers pay attention to efficient and effective appearance
model update. Built upon this foundation, an unsupervised discriminative-correlation-filters-based
(DCF-based) tracker is adopted by UHP-SOT as the baseline in the first module. Yet, the use of
the first module alone has shortcomings such as failure in tracking loss recovery and being weak in
box size adaptation. We propose ideas for background motion modeling and trajectory-based box
prediction. Both are novel in SOT. We test UHP-SOT on two popular object tracking benchmarks,
TB-50 and TB-100 [134], and show that it outperforms all previous unsupervised SOT methods,
22
achieves a performance comparable with the best supervised DL-based SOT methods, and operates
at a fast speed (22.7-32.0 FPS on a CPU).
2.2 Background
Previous unsupervised trackers pay attention to efficient and effective appearance model update.
Built upon this foundation, an unsupervised discriminative-correlation-filters-based (DCF-based)
tracker Spatial-temporal regularized correlation filters (STRCF) [74] is adopted by UHP-SOT as
the baseline in the first module. In STRCF, the object appearance at frame t is modeled by a
template denoted by ft
. It is used for similarity matching at frame (t +1). The template is initialized
at the first frame. Assume that the cropped patch centered at the object location has a size of Nx×Ny
pixels at frame t. Then, the template gets updated at frame t by solving the following regression
equation:
argmin
f
1
2
∥
D
∑
d=1
x
d
t ∗ f
d −y∥
2 +
1
2
D
∑
d=1
∥w·f
d
∥
2 +
µ
2
∥f−ft−1∥
2
, (2.1)
where y ∈ R
Nx×Ny
is a centered Gaussian-shaped map used as regression labels, xt ∈ R
Nx×Ny×D
is the spatial map of D features, ∗ denotes the spatial convolution of the same feature, w is the
spatial penalty weight applied on the template, ft−1 is the template obtained from time t −1, and
µ is a constant regularization coefficient. We can interpret the three terms in the right-hand-side
of the above equation as follows. The first term demands that the new template has to match
the newly observed features accordingly with the assigned labels. The second term is the spatial
regularization term which demands that regions outside of the box contribute less to the matching
result. A visualization of its effect is provided in Figure 2.1. The third term corresponds to selfregularization that ensures smooth appearance change [25]. To search for the box in frame (t +1),
STRCF correlates template ft with the search region and determines the box by finding the location
that gives the highest response. Although STRCF can model the appearance change for most
sequences, it suffers from overfitting so that it is not able to adapt to largely deformable objects
quickly. Furthermore, it cannot recover after tracking loss.
23
Figure 2.1: The spatial weight w helps suppress learning from background information. w assigns
higher penalties on values outside of the object bounding box. Figure modified from [33].
The template model f is updated at every frame with a fixed regularization coefficient µ in
STRCF. In our implementation, we skip updating f if no obvious motion is observed. In addition, a
smaller µ is used when all modules agree with each other in prediction so that f can quickly adapt
to the new appearance for largely deformable objects.
Figure 2.2: The system diagram of the proposed UHP-SOT method. In the example, the object
was lost at time t − 1 but gets retrieved at time t because the proposal from background motion
modeling is accepted.
24
2.3 Proposed Method
There are three main challenges in SOT:
1. significant change of object appearance,
2. loss of tracking,
3. rapid variation of object’s location and/or shape.
To address these challenges, we propose a new tracker, UHP-SOT, whose system diagram is shown
in Figure 2.2. As shown in the figure, it consists of three modules:
1. appearance model update,
2. background motion modeling,
3. trajectory-based box prediction.
UHP-SOT follows the classic tracking-by-detection paradigm where the object is detected
within a region centered at its last predicted location at each frame. The histogram of oriented
gradients (HOG) [29] and colornames (CN) [117, 36] features are extracted to yield the feature
map. We choose the STRCF tracker [74] as the baseline tracker. However, STRCF cannot handle
the second and the third challenges well, as shown in Figure 2.3. We propose the second and the
third modules in UHP-SOT to address them. They are the main contributions of this work. UHPSOT operates in the following fashion. The baseline tracker gets initialized at the first frame. For
the following frames, UHP-SOT gets proposals from all three modules and chooses one of them
as the final prediction based on a fusion strategy. They are elaborated below.
2.3.1 Background Motion Modeling
For SOT, we can decompose the pixel displacement between adjacent frames (also called optical
flow) into two types: object motion and background motion. Background motion is usually simpler
25
Figure 2.3: Comparison of the tracking performance of STRCF (red) and UHP-SOT (green), where
the results of UHP-SOT are closer to the ground truth and those of STRCF drift away.
so that it can be well modeled by a parametric model. Background motion estimation [48, 2] finds
applications in video stabilization, coding and visual tracking. Here, we propose a 6-parameter
model in form of
xt+1 = α1xt +α2yt +α0, and yt+1 = β1xt +β2yt +β0, (2.2)
where (xt+1, yt+1) and (xt
, yt) are corresponding background points in frames (t +1) and t, and αi
and βi
, i = 0,1,2 are model parameters. Given more than three pairs of corresponding points, we
can solve the model parameters using the linear least-squares method. Usually, we choose a few
salient points (e.g., corners) to build the correspondence and determine the parameters. We apply
the background model to the grayscale image It(x, y) of frame t to find the estimated ˆIt+1(x, y) of
frame (t +1). Afterwards, we can compute the difference map ∆I:
∆I = |
ˆIt+1(x, y)−It+1(x, y)| (2.3)
26
which is expected to have small and large values in the background and foreground regions, respectively. Thus, we can determine potential object locations by finding the covering box with
greatest motion which is also named as the motion proposal. An example is provided in Figure
2.4. Note that we apply some post-processing on ∆I such as cleaning on the frame border area
and suppression on small values to reduce noise and fluctuation of box locations. The ∆I shown in
Figure 2.4 has already been processed.
Specifically, for the step of finding the motion proposal, it would be very slow to apply a sliding
window across the whole image and compare the sum of residuals inside the window. Prefix sum
is applied to optimize this part so the complexity is reduced from O(mnpq) (brute force) to O(mn)
which is irrelevant to the size of the proposal. Here m and n are height and width of the frame, and
p and q are those of the proposal. Furthermore, the prefix-sum-optimized maximum score window
only provides the best guess of the motion proposal. A measurement of its quality can be designed
as follows. The number of windows can be increased to two, and the target is to find the two nonoverlapping windows scoring the maximum 1
. If these two windows are adjacent to each other,
they must be around the position of the best guess of the motion proposal in the single-box case,
and the two boxes will split the original single-box proposal. In this situation, it means that the
sum of residuals does not vary much across different box locations and the quality of the singlebox proposal may be low. If these two windows are separated from the other, one window will
be exactly the same as the single-box proposal. Under such circumstances, the motion proposal is
of high quality. The time complexity of a trivial search algorithm is in O(m
2n
2
) even if the prefix
sum is applied. We designed an algorithm using dynamic programming to avoid the calculation
redundancy and managed to make the time complexity of O(mn).
While the DCF-based baseline exploits foreground correlation to locate the object, background
modeling uses background correlation to eliminate background influence in object tracking. They
1We also developed this problem into a competitive programming challenge [79] and contributed it to IEEEXtreme,
an annual global programming competition held by IEEE. It is worth mentioning that this problem is among the top
10 hardest problems in IEEEXtreme 15.0 (2021).
27
complement each other for some challenging task such as recovery from tracking loss. The DCFbased tracker cannot recover from tracking loss easily since it does not have a global view of the
scene. In contrast, our background modeling module can still find potential locations of the object
by removing the background region.
Figure 2.4: Example of object proposal from background motion modeling.
2.3.2 Trajectory-based Box Prediction
Given box centers of last N frames, {(xt−N, yt−N),··· ,(xt−1, yt−1)}, we calculate N −1 displacement vectors {(∆xt−N+1,∆yt−N+1),··· ,(∆xt−1,∆yt−1)} and apply the principal component analysis (PCA) to them. To predict the displacement at frame t, we fit the first principal component
using a line and set the second principal component to zero to remove noise. Then, the center
location of the box at frame t can be written as
(xˆt
, yˆt) = (xt−1, yt−1) + (∆ˆxt
,∆ˆyt
). (2.4)
Similarly, we can estimate the width and the height of the box at frame t, denoted by (wˆt
,hˆ
t). Typically, the physical motion of an object has an inertia in motion trajectory and its size, and the box
prediction process attempts to maintain the inertia. It contributes to better tracking performance
28
in two ways. First, it can remove small fluctuation of the box in its location and size. Second,
when there is a rapid deformation of the target object, the appearance model alone cannot capture
the shape change effectively. In contrast, the combination of background motion modeling and the
trajectory-based box prediction can offer a more satisfactory solution. An example is given in Figure 2.5, which shows a frame of the diving sequence in the upper-left subfigure, where the green
and the blue boxes are the ground truth and the result of UHP-SOT, respectively. Although a DCFbased tracker can detect the size change by comparing correlation scores at five image resolutions,
it cannot estimate the change of its aspect ratio. In contrast, the residual image after background
removal in UHP-SOT, as shown in the lower-left subfigure, reveals the object shape. We sum up
the absolute pixel values of the residual image horizontally and vertically. We use a threshold to
determine the two ends of an interval. Then, we have
wˆ = xmax −xmin, and hˆ = ymax −ymin. (2.5)
Note that the raw estimation may not be stable across different frames. Estimations that deviate too
much from the trajectory of (∆wt
,∆ht) are rejected. Then, we have a robust yet flexibly deformable
box proposal.
2.3.3 Fusion Strategy
We have three box proposals for the target object at frame t: 1) Bapp from the baseline tracker
to capture appearance change, 2) Btrj from the trajectory predictor to maintain the inertia of box
position/shape and 3) Bbgd from the background motion predictor to eliminate unlikely object
regions. We need a fusion strategy to yield the final box location/shape, which is described below.
We store two template models: the latest model ft−1, and an older model, fi
, i ≤ t − 1, where i
is the last time step where all three boxes have the same location. Model fi
is less likely to be
contaminated since it needs agreement from all modules while Bbgd can jump around. To check
29
Figure 2.5: Illustration of shape change estimation based on background motion model and the
trajectory-based box prediction, where the ground truth and our proposal are in green and blue,
respectively.
30
the reliability of the three proposals, we compute correlation scores between three pairs: (ft−1,
Bapp), (ft−1, Btrj), and (fi
, Bbgd) and then apply a rule-based fusion strategy:
• General rule: Choose the proposal with the highest score.
• Special rule: Although Bapp has the highest score, we observe that Btrj has a close score,
agrees with Bbgd, and reveals sudden jump (say, larger than 30 pixels). Then, we choose Btrj
instead.
2.4 Experimental Results
2.4.1 Implementation Details
We compare UHP-SOT with state-of-the-art unsupervised and supervised trackers on TB-50 and
TB-100 datasets [134]. The latter contains 100 videos and 59,040 frames. We use the same
hyperparameter settings in STRCF except regularization coefficient µ. STRCF sets µ = 15 while
UHP-SOT selects µ ∈ {10,5,0} if the appearance box is not chosen. The smaller the correlation
score, the smaller µ. We set N = 20 in the number of previous frames for trajectory prediction.
The cutting threshold along horizontal/vertical directions is set to 0.1. UHP-SOT runs at 22.7 FPS
on a PC equipped with an Intel(R) Core(TM) i5-9400F CPU, maintaining a near real-time speed
(while STRCF operates at a speed of 24.3 FPS).
2.4.2 Performance Evaluation
Figure 2.6 compares UHP-SOT with state-of-the-art unsupervised trackers ECO-HC [30], STRCF
[74], SRDCFdecon [34], CSR-DCF [4], SRDCF [33], Staple [5], KCF [51], DSST [31] and supervised trackers SiamRPN++ [72], ECO [30], HDT [43], SiamFC 3s [6], LCT [88]. UHP-SOT
outperforms STRCF by 4% in precision and 3.03% in overlap on TB-100 and 4% in precision
and 2.7% in overlap on TB-50, respectively. As an unsupervised light-weight tracker, UHP-SOT
achieves performance comparable with state-of-the-art deep trackers such as SiamRPN++ with
31
Figure 2.6: The precision plot and the success plot on TB-50 and TB-100. To rank different
methods, the distance precision is measured at 20-pixel threshold, and the overlap precision is
measured by the AUC score. We collect other trackers’ raw results from the official websites to
generate results.
32
Figure 2.7: Qualitative evaluation of UHP-SOT, STRCF [74], ECO [30] and SiamRPN++ [72] on
10 challenging videos from TB-100. They are (from left to right and top to down): Trans, Skiing,
MotorRolling, Coupon, Girl2, Bird1, KiteSurf, Bird2, Diving, Jump, respectively.
33
ResNet-50[50] as the backbone. UHP-SOT outperforms another deep tracker, ECO, which uses a
Gaussian Mixture Model to store seen appearances and runs at around 10 FPS, in both accuracy
and speed.
We show results on 10 challenging sequences on TB-100 for the top-4 trackers in Figure 2.7.
Generally, UHP-SOT can follow small moving objects or largely deformed objects like human
body tightly even though some of them have occlusions. These are attributed to its quick recovery
from tracking loss via background modeling and stability via trajectory prediction. On the other
hand, UHP-SOT does not perform well for Ironman and Matrix because of rapid changes in both
the foreground object and background. They exist in movie content due to editing in movie postproduction. They do not occur in real-world object tracking. Deep trackers perform well for
Ironman and Matrix by leveraging supervision.
We further analyze the performance variation under different challenging tracking scenarios for
TB-100. We present the AUC score in Figure 2.8 and compare with other state-of-the-art unsupervised DCF trackers STRCF, ECO-HC, and SRDCFdecon. Our method outperforms other trackers
in all attributes, especially in deformation (DEF), in-plane rotation (IPR) and low resolution (LR).
Difficult sequences in those attributes include MotorRolling, Jump, Diving and Skiing, where the
target appearance changes fast due to large deformation. It is difficult to reach a high overlapping
ratio without supervision or without an adaptive box aspect ratio strategy.
Finally, we test two other variants of UHP-SOT: UHP-SOT-I (without trajectory prediction),
and UHP-SOT-II (without background motion modeling) in Table 2.1. The gap between UHPSOT and UHP-SOT-I reveals the importance of inertia provided by trajectory prediction. UHPSOT-I shows that background modeling does a good job in handling some difficult cases that
STRCF cannot cope with, leading to a gain of 2% in precision and 1.95% in overlap. UHP-SOTII rejects large trajectory deviation and uses a smaller regularization coefficient to strengthen this
correction. The accuracy of UHP-SOT-II drops more due to naive correction without confirmation
from background modeling. Yet, it operates a much faster speed (32.02 FPS). Both background
modeling and trajectory prediction are lightweight modules and run in real time.
34
!
#!"
Figure 2.8: Area-under-curve (AUC) score for attribute-based evaluation on the TB-100 dataset,
where the 11 attributes are background clutter (BC), deformation (DEF), fast motion (FM), inplane rotation (IPR), illumination variation (IV), low resolution (LR), motion blur (MB), occlusion
(OCC), out-of-plane rotation (OPR), out-of-view (OV), and scale variation (SV), respectively.
Table 2.1: Performance of UHP-SOT, UHP-SOT-I, UHP-SOT-II and STRCF on the TB-100
dataset, where AUC is used for success rate.
UHP-SOT UHP-SOT-I UHP-SOT-II STRCF
Success (%) 68.95 67.87 65.64 65.92
Precision 0.91 0.89 0.87 0.87
Speed (FPS) 22.73 23.68 32.02 24.30
3
2.5 Conclusion
An unsupervised high-performance tracker called UHP-SOT, which uses STRCF as the baseline
with two novel improvements, was proposed in this work. They are background motion modeling
and a trajectory-based box prediction. It was shown by experimental results that UHP-SOT offers
an effective real-time tracker in resource-limited platforms.
36
Chapter 3
UHP-SOT++: An Unsupervised Lightweight Single Object
Tracker
3.1 Introduction
This work is an extension of UHP-SOT with new contributions. First, the fusion strategies in UHPSOT and UHP-SOT++ are different. The fusion strategy in UHP-SOT was simple and ad hoc.
UHP-SOT++ adopts a fusion strategy that is more systematic and well justified. It is applicable
to both small- and large-scale datasets with more robust and accurate performance. Second, this
work conducts more extensive experiments on four object tracking benchmarks (i.e., OTB2015,
TC128, UAV123 and LaSOT) while only experimental results on OTB2015 were reported for
UHP-SOT in [157]. New experimental evaluations demonstrate that UHP-SOT++ outperforms
previous unsupervised SOT methods (including UHP-SOT) and achieves comparable results with
deep trackers on large-scale datasets. Since UHP-SOT++ has an extremely small model size,
high tracking performance, and low computational complexity (operating at a rate of 20 FPS on
an i5 CPU even without code optimization), it is ideal for real-time object tracking on resourcelimited platforms. Finally, we compare pros and cons of SiamRPN++ and UHP-SOT++ trackers,
which serve as an example of the supervised deep tracker and the unsupervised lightweight tracker,
respectively, and provide a new perspective to their performance gap.
37
Figure 3.1: The system diagram of the proposed UHP-SOT++ method. It shows one example
where the object was lost at time t − 1 but gets retrieved at time t because the proposal from
background motion modeling is accepted.
38
3.2 Proposed Method
The system diagram of the proposed UHP-SOT++ method is shown in Figure 3.1. As introduced
in Chapter 2, UHP-SOT consists of three modules:
1. appearance model update,
2. background motion modeling,
3. trajectory-based box prediction.
Thus, we have three box proposals for the target object at frame t: 1) Bapp from the baseline
STRCF tracker to capture appearance change, 2) Bbgd from the background motion predictor to
eliminate unlikely object regions, and 3) Btrj from the trajectory predictor to maintain the inertia of
the box position and size. A fusion strategy is needed to yield the final box location and size. We
consider a couple of factors for its design.
3.2.1 Proposal Quality
There are three box proposals. The quality of each box proposal can be measured by: 1) object
appearance similarity, and 2) robustness against the trajectory. We use a binary flag to indicate
whether the quality of a proposal is good or not. As shown in Table 3.1, the flag is set to one if a
proposal keeps proper appearance similarity and is robust against trajectory. Otherwise, it is set to
zero.
For the first measure, we store two appearance models: the latest model, ft−1, and an older
model, fi
, i ≤ t −1, where i is the last time instance where all three boxes have the same location.
Model fi
is less likely to be contaminated since it needs agreement from all modules. To check
the reliability of the three proposals, we compute correlation scores for the following six pairs:
(ft−1, Bapp), (ft−1, Btrj), (ft−1, Bbgd), (fi
, Bapp), (fi
, Btrj), and (fi
, Bbgd). They provide appearance
similarity measures of the two previous models against the current three proposals. A proposal has
good similarity if one of its correlation scores is higher than a threshold.
39
For the second measure, if Bapp and Btrj have a small displacement (say, 30 pixels) from the
last prediction, the move is robust. As to Bbgd, it often jumps around and, thus, is less reliable.
However, if the standard deviations of its historical locations along the x-axis and y-axis are small
enough (e.g., 30 pixels over the past 10 frames), then they are reliable.
Figure 3.2: Illustration of occlusion detection, where the green box shows the object location. The
color information and similarity score could change rapidly if occlusion occurs.
3.2.2 Occlusion Detection
We propose an occlusion detection strategy for color images, which is illustrated in Figure 3.2. As
occlusion occurs, we often observe a sudden drop in the similarity score and a rapid change on
the averaged RGB color values inside the box. A drop is sudden if the mean over the past several
frames is high while the current value is significantly lower. If this is detected, we keep the new
prediction the same as the last predicted position since the new prediction is unreliable. We do
not update the model for this frame either to avoid drifting and/or contamination of the appearance
model.
40
3.2.3 Rule-based Fusion
Since each of the three proposals has a binary flag, all tracking scenarios can be categorized into 8
cases as shown in Figure 3.3. We propose a fusion scheme for each case below.
• When all three proposals are good, their boxes are merged together as a minimum covering
rectangle if they overlap with each other with IoU above a threshold. Otherwise, Bapp is
adopted.
• When two proposals are good, merge them if they overlap with each other with IoU above a
threshold. Otherwise, the one with better robustness is adopted.
• When one proposal is good, adopt that one if it is Bapp. Otherwise, that proposal is compared
with Bapp to verify its superiority by observing a higher similarity score or better robustness.
• When all proposals have poor quality, the occlusion detection process is conducted. The last
prediction is adopted in case of occlusion. Otherwise, Bapp is adopted.
• When other proposals outperform Bapp, the regularization coefficient, µ, is adjusted accordingly for stronger update. Because this might reveal that the appearance model needs to be
updated more to capture the new appearance.
The fusion rule is summarized in Table 3.1. In most cases, Bapp is reliable and it will be chosen or
merged with other proposals because the change is smooth between adjacent frames in the great
majority of frames in a video clip.
3.3 Experimental Results
3.3.1 Experimental Set-up
To show the performance of UHP-SOT++, we compare it with several state-of-the-art unsupervised
and supervised trackers on four single object tracking datasets. They are OTB2015 [134], TC128
41
Figure 3.3: An example of quality assessment of proposals, where the green box is the ground
truth, and yellow, blue and magenta boxes are proposals from Bapp, Btrj and Bbgd, respectively, and the bright yellow text on the top-left corner denotes the quality of three proposals
(isGoodapp,isGoodtr j,isGoodbgd).
Table 3.1: All tracking scenarios are classified into 8 cases in terms of the overall quality of proposals from three modules. The fusion strategy is set up for each scenario. The update rate is
related the regularization coefficient, µ, that controls to which extent the appearance model should
be updated.
isGoodapp isGoodtr j isGoodbgd Proposal to take Update rate
1 1 1 Bapp or union of three normal
1 1 0 Bapp or Btrj or union of two normal
1 0 1 Bapp or Bbgd or union of two normal
0 1 1 Btrj or Bbgd or union of two normal or stronger
1 0 0 Bapp normal
0 1 0 Bapp or Btrj normal or stronger
0 0 1 Bapp or Bbgd normal or stronger
0 0 0 Bapp or last prediction in case of occlusion normal or weaker
42
[76], UAV123 [92] and LaSOT [41]. OTB2015 (also named OTB in short) and TC128, which
contain 100 and 128 color or grayscale video sequences, respectively, are two widely used smallscale datasets. UAV123 is a larger one, which has 123 video sequences with more than 110K
frames in total. Videos in UAV123 are captured by low-altitude drones. They are useful in the
tracking test of small objects with a rapid change of viewpoints. LaSOT is the largest single object
tracking dataset that targets at diversified object classes and flexible motion trajectories in longer
sequences. It has one training set with dense annotation for supervised trackers to learn and another
test set for performance evaluation. The test set contains 280 videos of around 685K frames.
We use the same hyperparameters as those in STRCF except for regularization coefficient,
µ. If the appearance box is not chosen, STRCF sets µ = 15 while UHP-SOT++ selects µ ∈
{15,10,5,0}. The smaller µ is, the stronger the update is. The number of previous frames for
trajectory prediction is N = 20. The cutting threshold along the horizontal or vertical direction is
set 0.1. The threshold for good similarity score is 0.08, and a threshold of 0.5 for IoU is adopted.
UHP-SOT++ runs at 20 frames per second (FPS) on a PC equipped with an Intel(R) Core(TM) i5-
9400F CPU. The speed data of other trackers are either from their original papers or benchmarks.
Since no code optimization is conducted, all reported speed data should be viewed as lower bounds
for the corresponding trackers.
3.3.2 Ablation Study
3.3.2.1 Module Analysis
We compare different configurations of UHP-SOT++ on the TC128 dataset to investigate contributions from each module in Figure 3.4. As compared with UHP-SOT, improvements on both DP
and AUC in UHP-SOT++ come from the new fusion strategy. Under this strategy, the background
motion modeling plays an more important role and it has comparable performance even without
the trajectory prediction. Although the trajectory prediction module is simple, it contributes a lot
to higher tracking accuracy and robustness as revealed by the performance improvement over the
baseline STRCF.
43
More performance comparison between UHP-SOT++, UHP-SOT and STRCF is presented in
Table 3.2. As compared with STRCF, UHP-SOT++ achieves 1.8%, 6.2%, 6.7% and 6.8% gains in
the success rate on OTB, TC128, UAV123 and LaSOT, respectively. As to the mean precision, it
has an improvement of 1.2%, 6.9%, 7.2% and 10.4%, respectively. Except for OTB, UHP-SOT++
outperforms UHP-SOT in both the success rate and the precision. This is especially obvious for
large-scale datasets. Generally, UHP-SOT++ has better tracking capability than UHP-SOT. Its
performance drop in OTB is due to the tracking loss in three sequences; namely, Bird2, Coupon
and Freeman4. They have multiple complicated appearance changes such as severe rotation, background clutter and heavy occlusion. As shown in Figure 3.5, errors at some key frames lead to total
loss of the object, and the lost object cannot be easily recovered from motion. The trivial fusion
strategy based on appearance similarity in UHP-SOT seems to work well on their key frames while
the fusion strategy of UHP-SOT++ does not suppress wrong proposals properly since background
clutters have stable motion and trajectories as well.
Figure 3.4: The precision plot and the success plot of our UHP-SOT++ tracker with different
configurations on the TC128 dataset, where the numbers inside the parentheses are the DP values
and AUC scores, respectively.
44
Figure 3.5: Failure cases of UHP-SOT++ (in green) as compared to UHP-SOT (in red) on
OTB2015.
3.3.2.2 Attribute-based Study
To better understand the capability of different trackers, we analyze the performance variation under various challenging tracking conditions. These conditions can be classified into the following
attributes: aspect ratio change (ARC), background clutter (BC), camera motion (CM), deformation
(DEF), fast motion (FM), full occlusion (FOC), in-plane rotation (IPR), illumination variation (IV),
low resolution (LR), motion blur (MB), occlusion (OCC), out-of-plane rotation (OPR), out-of-view
(OV), partial occlusion (POC), scale variation (SV) and viewpoint change (VC). We compare the
AUC scores of supervised trackers (e.g., SiamRPN++ and ECO) and unsupervised trackers (e.g.,
UHP-SOT++, ECO-HC, UDT+ and STRCF) under these attributes in Figure 3.6.
We have the following observations. First, among unsupervised trackers, UHP-SOT++ has
leading performance in all attributes, which reveals improved robustness from its basic modules
and fusion strategy. Second, although ECO utilizes deep features, it is weak in flexible box regression and, as a result, it is outperformed by UHP-SOT++ in handling such deformation and
shape changes against LaSOT. In contrast, SiamRPN++ is better than other trackers especially
in DEF (deformation), ROT (rotation)and VC (viewpoint change). The superior performance of
SiamRPN++ demonstrates the power of its region proposal network (RPN) in generating tight
45
Figure 3.6: The area-under-curve (AUC) scores for two datasets, TC128 and LaSOT, under the
attribute-based evaluation, where attributes of concern include the aspect ratio change (ARC),
background clutter (BC), camera motion (CM), deformation (DEF), fast motion (FM), full occlusion (FOC), in-plane rotation (IPR), illumination variation (IV), low resolution (LR), motion blur
(MB), occlusion (OCC), out-of-plane rotation (OPR), out-of-view (OV), partial occlusion (POC),
scale variation (SV) and viewpoint change (VC), respectively.
46
boxes. The RPN inside SiamRPN++ not only improves IoU score but also has the long-term benefit by excluding noisy information. Fourth, supervised trackers perform better in IV (illumination
variation) and LR (low resolution) than unsupervised trackers in general. The gap between ECO
and its handcrafted version, ECO-HC, is more obvious under these attributes. This can be explained by the fact that unsupervised trackers adopt HOG, CN features or other shallow features
which do not work well under these attributes. They focus on local structures of the appearance
and tend to fail to capture the object when the local gradient or color information is not stable.
Finally, even with the feature limitations, UHP-SOT++ still runs second in many attributes against
LaSOT because of the stability offered by trajectory prediction and its capability to recover from
tracking loss via background motion modeling.
Figure 3.7: The success plot and the precision plot of ten unsupervised tracking methods for the
LaSOT dataset, where the numbers inside the parentheses are the overlap precision and the distance
precision values, respectively.
3.3.3 Comparison with State-of-the-art Trackers
We compare the performance of UHP-SOT++ and several unsupervised trackers for the LaSOT
dataset in Figure 3.7. The list of benchmarking methods includes both lightweight trackers and
47
Figure 3.8: Qualitative evaluation of three leading unsupervised trackers, where UHP-SOT++ offers a robust and flexible box prediction.
48
Figure 3.9: The success plot comparison of UHP-SOT++ with several supervised and unsupervised
tracking methods on four datasets, where only trackers with raw results published by authors are
listed. For the LaSOT dataset, only supervised trackers are included for performance benchmarking in the plot since the success plot of unsupervised methods is already given in Figure 3.7.
49
deep trackers: USOT [153], ECO-HC [30], STRCF [74], CSR-DCF [4], SRDCF [33], Staple [5],
KCF [51], DSST [31].
UHP-SOT++ achieves comparable performance with the state-of-the-art deep unsupervised
USOT that has the ResNet-50 [50] backbone network and large-scale offline training. UHPSOT++ outperforms DCF-based unsupervised methods by a large margin, which is larger than
0.02 in the mean scores of the success rate and the precision. Besides, its running speed is 20 FPS,
which is comparable with that of the third runner STRCF (24 FPS) and the fourth runner ECO-HC
(42 FPS). With a small increase in computational and memory resources, UHP-SOT++ gains in
tracking performance by adding object box trajectory and background motion modeling modules.
Object boxes of three leading DCF-based unsupervised trackers are visualized in Figure 3.8 for
qualitative performance comparison. More comparison with other unsupervised methods including LADCF [137], UDT [121], UDT+ [121] over other benchmark datasets is shown in Figure 3.9.
As compared with other methods, UHP-SOT++ offers more robust and flexible box prediction.
They follow tightly with the object in both location and shape even under challenging scenarios
such as motion blur and rapid shape change.
We compare the success rates of UHP-SOT++ and several supervised and unsupervised trackers against all four datasets in Figure 3.9. Note that there are more benchmarking methods for
OTB but fewer for TC128, UA123 and LaSOT since OTB is an earlier dataset. The supervised
deep trackers under consideration include SiamRPN++ [72], ECO [30], C-COT [35], DeepSRDCF
[33], HDT [43], SiamFC 3s [6], CFNet [115] and LCT [88]. Other deep trackers that have leading performance but are not likely to be used on resource-limited devices due to their extremely
high complexity, such as transformer-based trackers [124, 22], are not included here. Although the
performance of a tracker may vary from one dataset to the other due to different video sequences
collected by each dataset, UHP-SOT++ is among the top runners in all four datasets. This demonstrates the generalization capability of UHP-SOT++. Its better performance than ECO on LaSOT
indicates a robust and effective update of the object model. Otherwise, it would degrade quickly
with worse performance because of longer LaSOT sequences. Besides, its tracking speed of 20
50
FPS on CPU is faster than many deep trackers such as ECO (10 FPS), DeepSRDCF (0.2 FPS),
C-COT (0.8 FPS) and HDT (2.7 FPS).
In Table 3.2, we further compare UHP-SOT++ with state-of-the-art unsupervised deep trackers
ULAST [105], USOT [153], UDT+ [121], LUDT [122] and ResPUL [132] in their AUC and DP
values, running speeds and model sizes. Two leading supervised trackers SiamRPN++ and ECO
are also included in Figure 3.9. We see that UHP-SOT++ has outstanding overall performance
against recent unsupervised deep trackers. UHP-SOT++ achieves comparable performance with
USOT, which has very deep feature extraction network, on LaSOT and much better accuracy on
OTB2015. It also outperforms other unsupervised deep trackers with shallow feature extraction
backbones by a large margin. ULAST achieves high performance on LaSOT with a deep region
proposal network as well as a carefully designed pre-training strategy. It demands a large amount
of data in the pre-training of the large backbone model and the region proposal network.
It is worthwhile to emphasize that deep trackers demand pre-training on offline datasets while
UHP-SOT++ does not. In addition, UHP-SOT++ is attractive because of its lower memory requirement and near real-time running speed on CPUs. Although ECO-HC also provides a lightweight
solution, there is a performance gap between UHP-SOT++ and ECO-HC. SiamRPN++ has the
best tracking performance among all trackers, due to the merit of end-to-end optimized network
with auxiliary modules such as classification head and the region proposal network. Yet, its large
model size and GPU hardware requirement limit its applicability in resource-limited devices such
as mobile phones or drones. In addition, as an end-to-end optimized deep tracker, SiamRPN++
has the interpretability issue to be discussed later.
51
Table 3.2: Comparison of state-of-the-art supervised and unsupervised trackers on four datasets, where the performance is measured by
the distance precision (DP) and the area-under-curve (AUC) score in percentage. The model size is measured in MB by the memory
required to store needed data such as the model parameters of pre-trained networks. The best unsupervised performance is highlighted.
Also, S, P, G and C indicate
Supervised,
Pre-trained,
GPU and
CPU, respectively.
Trackers Year S P OTB2015 TC128 UAV123 LaSOT FPS Model
DP AUC DP AUC DP AUC DP AUC size
SiamRPN++[72] 2019 ✓ ✓ 91.0 69.2 - - 84.0 64.2 49.3 49.5 35 (G) 206
ECO[30] 2017 ✓ ✓ 90.0 68.6 80.0 59.7 74.1 52.5 30.1 32.4 10 (G) 329
UDT+[121] 2019
×
✓ 83.1 63.2 71.7 54.1 - - - - 55 (G)
< 1
LUDT[122] 2020
×
✓ 76.9 60.2 67.1 51.5 - - - 26.2 70 (G)
< 1
ResPUL[132] 2021
×
✓ - 58.4 - - - - - - - (G)
>
6
USOT[153] 2021
×
✓ 80.6 58.9 - - - - 32.3 33.7 - (G) 113
ULAST[105] 2022
×
✓ 81.1 61.0 - - - - 40.7 43.3 80 (G) -
ECO-HC[30] 2017 × × 85.0 63.8 75.3 55.1 72.5 50.6 27.9 30.4 42 (C)
< 1
STRCF[74] 2018 × × 86.6 65.8 73.5 54.8 67.8 47.8 29.8 30.8 24 (C)
< 1
UHP-SOT[157] 2021 × × 90.9 68.9 77.4 57.4 71.0 50.1 31.1 32.0 23 (C)
< 1
UHP-SOT++ Ours × × 87.6 66.9 78.6 58.2 72.7 51.0 32.9 32.9 20 (C)
< 1
52
Figure 3.10: Qualitative comparison of top runners against the LaSOT dataset, where tracking
boxes of SiamRPN++, UHP-SOT++, ECO and ECO-HC are shown in red, green, blue and yellow, respectively. The first two rows show sequences in which SiamRPN++ outperforms others
significantly while the last row offers the sequence in which SiamRPN++ performs poorly.
3.3.4 Exemplary Sequences and Qualitative Analysis
We conduct error analysis on a couple of representative sequences to gain more insights in this section. Several exemplary sequences from LaSOT are shown in Figure 3.10, in which SiamRPN++
performs either very well or quite poorly. In the first two sequences, we see the power of accurate
box regression contributed by the RPN. In this type of sequences, good trackers can follow the object well. Yet, their poor bounding boxes lead to a low success score. Furthermore, the appearance
model would be contaminated by the background information as shown in the second cat example.
The appearance model of DCF-based methods learns background texture (rather than follows the
cat) gradually. When the box only covers part of the object, it might also miss some object features,
resulting in a degraded appearance model. In both scenarios, the long-term performance will drop
rapidly. Although UHP-SOT++ allows the aspect ratio change to some extent as seen in the first
flag example, its residual map obtained by background motion modeling is still not as effective as
53
Figure 3.11: Illustration of three sequences in which UHP-SOT++ performs the best. The tracking
boxes of SiamRPN++, UHP-SOT++, ECO and ECO-HC are shown in red, green, blue and yellow,
respectively.
the RPN due to lack of semantic meaning. Generally speaking, the performance of UHP-SOT++
relies on the quality of the appearance model and the residual map.
On the other hand, SiamRPN++ is not robust enough to handle a wide range of sequences well.
The third example sequence is from video games. SiamRPN++ somehow includes background
objects in its box proposals and drifts away from its targets in the presented frames. Actually,
these background objects are different from their corresponding target objects in either semantic
meaning or local information such as color or texture. The performance of the other three trackers
is not affected. We see that they follow the ground truth without any problem. One explanation
is that these video game sequences could be few in the training set and, as a result, SiamRPN++
cannot offer a reliable tracking result for them.
Finally, several sequences in which UHP-SOT++ has the top performance are shown in Figure
3.11. In the first cup sequence, all other benchmarking methods lose the target while UHP-SOT++
could go back to the object once the object has obvious motion in the scene. In the second bottle
sequence, UHP-SOT++ successfully detects occlusion without making random guesses and the
object box trajectory avoids the box to drift away. In contrast, other trackers make ambitious
54
moves without considering the inertia of motion. The third bus sequence is a complicated one that
involves several challenges such as full occlusion, scale change and aspect ratio change. UHPSOT++ is the only one that can recover from tracking loss and provide flexible box predictions.
These examples demonstrate the potential of UHP-SOT++ that exploits object and background
motion clues across frames effectively.
3.4 Conclusion
An unsupervised high-performance tracker, UHP-SOT++, was proposed in this paper. It incorporated two new modules in the STRCF tracker module. They were the background motion modeling
module and the object box trajectory modeling module. Furthermore, a novel fusion strategy was
adopted to combine proposals from all three modules systematically. It was shown by extensive
experimental results on large-scale datasets that UHP-SOT++ can generate robust and flexible object bounding boxes and offer a real-time high-performance tracking solution on resource-limited
platforms.
The pros and cons of supervised and unsupervised trackers were discussed. Unsupervised
trackers such as UHP-SOT and UHP-SOT++ have the potential in delivering an explainable lightweight
tracking solution while maintaining good performance in accuracy. Supervised trackers such as
SiamRPN++ benefit from offline end-to-end learning and perform well in general. However, they
need to run on GPUs, which is too costly for mobile and edge devices. They may encounter
problems in rare samples. Extensive supervision with annotated object boxes is costly. Lack of
interpretability could be a barrier for further performance boosting.
Although UHP-SOT++ offers a state-of-the-art unsupervised tracking solution, there is still a
performance gap between UHP-SOT++ and SiamRPN++. It is worthwhile to find innovative ways
to narrow down the performance gap while keeping its attractive features such as interpretability,
unsupervised real-time tracking capability on small devices, etc. as future extension. One possible
direction is to investigate how to exploit offline unlabeled data and learn efficiently from few
55
annotated frames [102]. One main challenge in object tracking is the design of a robust tracker
that can generalize well to various situations. This is an open problem still not solved satisfactorily
by current unsupervised deep or lightweight trackers.
One of our research goals is to provide a ”white-box” tracker. To achieve it, we attempt to
understand the underlying tracking mechanisms of traditional trackers, identify the failure cases,
and find solutions to overcome them. Furthermore, to illustration the generalizability of our proposed solution, we have conducted extensive experiments from small-scale datasets in early days
to recent large-scale datasets that cover various object classes and diverse motion trajectories and
seen performance improvement. Hope that this endeavor will lead to interpretable, robust, and
high-performance tracking solutions in the long run.
56
Chapter 4
GUSOT: Green and Unsupervised Single Object Tracking for
Long Video Sequences
4.1 Introduction
Supervised tracking methods are dominant in recent years with superior performance boosted from
large amount of offline labeled data [72]. However, while the supervision is powerful at guiding the learning process, it also casts doubt on the reliability of tracking unseen objects. Therefore, unsupervised tracking without any training labels has drawn more and more attention from
researchers. There has been promising development on unsupervised trackers in recent years
[120, 132, 153, 105, 157, 159]. Nevertheless, many of them are still limited in terms of tracking performance, or are deep learning trackers with large feature extraction backbone and may
have difficulty running in hardware-limited platforms such as drones or mobile phones. In addition, many existing state-of-the-art trackers are short-term trackers where the object loss is not
well handled, while object loss is frequently observed in real-world applications. Thus, to extend
the possibility of high performance tracking with minimal human labor cost and resource consumption, we focus on boosting unsupervised tracking performance with lightweight structures
that have small model size, few training and potential benefits to long-term tracking scenarios.
To tackle the unsupervised long-term tracking problem in a resource-constrained environment,
we propose a green and unsupervised single-object tracker (GUSOT) in this work. Built upon
57
a short-term tracking baseline known as UHP-SOT++ [159], we introduce two additional new
modules to GUSOT: 1) lost object recovery, and 2) color-saliency-based shape proposal. The first
module helps recover a lost object with a set of candidates by leveraging motion in the scene and
selecting the best one with local/global features (e.g., color information). The second module
facilitates accurate and long-term tracking by proposing bounding-box proposals of flexible shape
for the underlying object using low-cost yet effective segmentation. Both modules are lightweight
and can be easily integrated with UHP-SOT++. They enable GUSOT to achieve higher tracking
accuracy in the long run. We conduct experiments on a large-scale benchmark dataset, LaSOT
[41], containing long video sequences and compare GUSOT with several state-of-the-art trackers.
Experimental results show that GUSOT offers a lightweight tracking solution whose performance
is comparable with that of deep trackers.
Figure 4.1: An overview of the proposed GUSOT tracker, where the red and blue boxes denote
the baseline and the motion proposals, respectively. The one with higher appearance similarity is
chosen to be the location proposal. Then, the third proposal, called the shape proposal, is used to
adjust the shape of the location proposal. The final predicted box is depicted by the yellow box.
4.2 Proposed Method
An overview of the proposed GUSOT method is given in Figure 4.1. GUSOT adds two new
modules to the baseline UHP-SOT++ tracker for higher performance in long video tracking: 1)
lost object recovery and 2) color-saliency-based shape proposal. While the baseline offers a box
58
proposal (the red box), the recovery module yields a motion residual map and provides the second
box proposal of the same size, called the motion proposal (the blue box). The two proposals
are compared and the one with higher appearance similarity with a trusted template is chosen as
the location proposal. Finally, another proposal, called the shape proposal, is used to adjust the
shape of the location proposal to yield the final prediction (the yellow box). For the UHP-SOT++
baseline, we refer to [159]. The operations of the two new modules are detailed below.
4.2.1 Lost Object Recovery
When the baseline tracker works properly, the lost object recovery module simply serves as a
backup. Yet, its role becomes critical when the baseline tracker fails. Recall that most trackers
rely on similarity matching between the learned template f at frame (t −1) and the search region in
frame t to locate the object. However, if the object gets lost due to tracking error or deformations,
it is difficult to capture it again as the search region often drifts away from the true object location.
Exhaustive search over the whole frame is infeasible in online tracking. Besides, the learned
template could be contaminated during the object loss period and it cannot be used to search the
object anymore. This module is used to find object locations of higher likelihood efficiently and
robustly.
This module is built upon background motion estimation and compensation [157]. Based on
the correspondence of sparsely sampled background salient points between frame (t −1) and frame
t, we can estimate the global motion field of the scene. Next, we can apply the motion field to all
pixels in frame (t − 1), which is called the motion compensated frame, and find the difference
between frame t and motion-compensated frame (t −1), leading to a motion residual map. Afterwards, a box proposal, which is of the same size as that of frame (t −1) and covering the largest
amount of motion residuals, is computed. It serves as a good proposal for the object since the
object can be revealed by residuals if 1) the object and its background take different motion paths
and 2) background motion is compensated.
59
Now, we have two bounding box proposals: 1) the baseline proposal from UHP-SOT++ and
2) the motion proposal as discussed above. We need to assess their quality and select a better one.
This is achieved by similarity measure. To avoid template degradation from tracking loss, we store
a trusted template f
∗
. It could be the initial template or a learned template from a high-confidence
frame. Let x be the candidate proposal. We measure the similarity between f
∗
and x from two
aspects: local correspondence on structure and surface information, and global correspondence
on color distribution. The former focuses on the exact matching between two appearances and
performs well even if the illumination or color changes but tend to fail under rotation and large
deformation as they corrupt the original gradient pattern and boundaries. The latter, though being
less accurate for local matching, is more tolerant to deformations and rotation under stable color
changes. We elaborate the two measurements as below.
The local correspondence on structure and surface information between f
∗
and x can be defined
as their correlation coefficient:
s1(f
∗
,x) = ⟨f
∗
,x⟩
∥f
∗∥∥x∥
, (4.1)
where ⟨·,·⟩ is the vector inner product, f
∗
and x are feature representations (rather than pixel values) of a trusted template or a candidate. Commonly used features are the histogram of oriented
gradients (HOG) and colornames.
To measure global correspondence on color distribution, we first calculate a normalized histogram v of color keys in colornames, and get vf
∗ and vx, respectively. Then, the similarity of two
histograms is measured using the Chi-square distance:
s2(f
∗
,x) = ∑
i
(vf
∗,i −vx,i)
2
vf
∗,i +vx,i
, (4.2)
60
where vx,i denotes the i-th element in the histogram. The two histograms are more similar if the
Chi-square distance is smaller. We replace the baseline proposal, x2, with the motion proposal, x1,
if
s1(f
∗
,x1) > s1(f
∗
,x2) and s2(f
∗
,x1) ≤ s2(f
∗
,x2).
Otherwise, we keep the baseline proposal. For a faster tracking speed, the lost object recovery
module is turned on only when the similarity score of the baseline proposal is low or the predicted
object centers show abnormal trajectories (e.g., getting stuck to a corner).
Figure 4.2: Comparison of two sampling scheme and their segmentation results, where green
and red dots represent foreground and background initial points, respectively. The red box is the
reference box which indicates the object location.
Figure 4.3: Determination of salient color keys. The red box is the reference box which indicates
the object location. From left to right, the figures show the original object patch, normalized
histogram of color keys inside the bounding box, normalized histogram of color keys outside of
the bounding box, the weighted difference between two histogram, and visualization of the salient
foreground color area, respectively.
61
4.2.2 Color-Saliency-Based Shape Proposal
Good shape estimation can benefit the tracking performance in the long run as it leads to tight
bounding boxes and thus reduces noise in template learning. Deep trackers use the region proposal
network to offer flexible boxes. Here, we exploit several low-cost segmentation techniques for box
shape estimation in online tracking.
Given an image of size W × H, the foreground/background segmentation is to assign binary
labels lp = l(x, y) ∈ {0,1} to pixels p = (x, y). The binary mask, I, can be estimated using the
Markov Random Field (MRF) optimization framework:
I
∗ = argmin
I ∑
p
ρ(p,lp) + ∑
{p,q}∈ℵ
wpq∥lp −lq∥, (4.3)
where ρ(p,lp) is the cost of assigning lp to pixel p, which is defined as the negative log-likelihood
of p in the Gaussian mixtures, ℵ is the four-connected neighborhood, wpq is the weight of mismatching penalty which is calculated as the Euclidean distance between p and q. Specifically, to
calculate ρ(p,lp), all foreground pixels are grouped together to build a Gaussian mixture, and all
background pixels form another Gaussian mixture. ρ(p,lp) is the negative log-likelihood of p in
that specific group, and contradictory color to the main color distribution in the mixture will lead
to large ρ(p,lp).
The first term of Eq. (4.3) ensures that similar pixels get the same label while its second term
forces label continuity among neighbors. The fast algorithm implemented in [18, 149] can work
on a 32 × 32 image in 50ms. Thus, we crop a small patch centered at the predicted location and
conduct coarse segmentation.
The MRF optimization problem is solved by iteration, which is to be initialized by a set of
foreground/background points of good quality. Random sampling inside/outside the box results
in noisy initialization as shown in Figure 4.2(a). It tends to lead to undesired errors and/or longer
iterations. To address it, we exploit salient foreground and background colors. With the predicted
box and its associated image patch, all colors on the patch are quantized into N color keys. The
62
distributions of color keys in and out of the object box, denoted by pin and pout, are calculated.
Then, the color saliency score (CSS) of color keys ki
, i = 1,··· ,N, is the weighted difference
between the two distributions:
CSS(ki) = ∑j̸=i exp∥ki −k j∥
2
Z
(pin(ki)− pout(ki)), (4.4)
Z = ∑
i
∑
j̸=i
exp∥ki −k j∥
2
. (4.5)
Z is the normalization factor for the sum of exponentials. The weights take the color difference
into consideration and favor color keys standing out among all colors. If CSS(ki) is a positive (or
negative) number of large magnitude, ki
is a foreground (or background) color key. A visualization
example is provided in Figure 4.3. The histogram bin number, N, is adaptively determined based on
the first frame. It first tries the color keys used in colornames features and takes the corresponding
number if there exists clear salient colors in this setting. Otherwise, clustering of all colors and
merging of clusters are conducted to determine N. Finally, we sample points according to their
color saliency score. As shown in Figure 4.2, color-saliency-based sampling provides better initial
points for iteration, thus leading to a better segmentation mask.
Complicated objects may not have clear salient colors. If the MRF optimization generates
an abnormal output, we switch to a simple yet effective shape proposal by merging superpixels
with guidance from motion and baseline. As shown in Figure 4.4, after superpixel segmentation
[42], we assign binary labels to superpixels and find the enclosing box, Bs
, for the foreground
superpixels. With different label assignments, different Bs can be generated.
Let Bb, Bm and Bs denote the baseline proposal, the motion proposal and the shape proposal,
respectively. Note that here the motion proposal may not be the same as the one we have in lost
object recovery. Bm is determined by firstly calculating the integral curves along the horizontal
and vertical directions and then clipping these integral curves at a certain threshold, which is used
to capture the object shape via motion saliency. Usually, we stick to baseline proposal Bb if its
similarity score is high. We will only consider Bm, Bs
, and B
∗
if the baseline gets a low similarity
63
score. When this happens, we compute the following proposal as the predicted tracking box at
frame t:
B
∗ = argmax
Bs
IoU(Bs
,Bb) +IoU(Bs
,Bm), (4.6)
where IoU is the intersection-over-union between two boxes. Sanity checking is applied on B
∗
to reject the sudden rapid change on shapes as we would like to have smooth transitions across
frames.
Figure 4.4: Illustration of shape proposal derivation based on superpixel segmentation, where the
red, blue and yellow boxes correspond to the baseline, motion, and shape proposals, respectively.
The size of the motion proposal here is determined by clipping integral curves horizontally and
vertically on the motion residual map.
4.3 Experimental Results
4.3.1 Experimental Setup
We conduct experiments on the large-scale single object tracking dataset LaSOT [41]. It has 280
long test videos of around 685K frames. Evaluation metrics for tracking performance include: 1)
the distance precision (DP) measured at the 20-pixel threshold and 2) the area-under-curve (AUC)
64
score for the overlap precision. We use the same hyper-parameter settings for the baseline tracker.
The segmentation module is activated when the appearance score of the baseline tracker is less
than 0.2. Almost all template matching trackers can provide the appearance score for an object
proposal. It is simply the correlation score in DCF-trackers. The patch size used in segmentation
is 48×48. The superpixel segmentation exploits a Gaussian blur of σblur = 0.6 and the minimal
superpixel size is set to 50.
Table 4.1: Comparison of unsupervised and supervised trackers on LaSOT, where S and P indicate Supervised and Pre-trained, respectively. Backbone denotes the pre-trained feature extraction
network. The first group are unsupervised lightweight trackers, the second group are unsupervised
deep trackers, and the the third group are supervised deep trackers.
S P DP AUC GPU Backbone
ECO-HC[30] × × 27.9 30.4 × N/A
STRCF[74] × × 29.8 30.8 × N/A
UHP-SOT++[159] × × 32.9 32.9 × N/A
LUDT[122] × ✓ - 26.2 ✓ VGG[16]
USOT[153] × ✓ 32.3 33.7 ✓ ResNet50[50]
ULAST[105] × ✓ 40.7 43.3 ✓ ResNet50[50]
SiamFC[6] ✓ ✓ 33.9 33.6 ✓ AlexNet[66]
ECO[30] ✓ ✓ 30.1 32.4 ✓ VGG[16]
SiamRPN[73] ✓ ✓ 38.0 41.1 ✓ AlexNet[66]
GUSOT (Ours) × × 36.1 36.8 × N/A
4.3.2 Performance Evaluation
We compare tracking accuracy and model complexity of GUSOT against state-of-the-art supervised and unsupervised trackers in Table 4.1. We include three groups of trackers for comparison.
The first group are unsupervised lightweight trackers, and they all have DCF structures and utilize handcrafted features including HOG, colornames and grayscale values. The second group
are unsupervised deep trackers including LUDT, USOT and ULAST. They are end-to-end neural
networks and usually have region proposal networks. The the third group are supervised deep
65
Figure 4.5: Qualitative evaluation of GUSOT, UHP-SOT++, USOT and SiamFC. From top to
bottom, the sequences presented are pool-12, bottle-14, umbrella-2, airplane-15, person-10 and
yoyo-17, respectively.
66
trackers. While ECO is a DCF with deep features, SiamFC and SiamRPN are end-to-end neural networks which need to be trained on large-scale offline datasets such as ILSVRC dataset for
object detection in video [103] and Youtube-BB [99].
We have the following observations. First, the improvement over the UHP-SOT++ baseline is
around 10% in DP and 12% in AUC. It demonstrates the capability of GUSOT in handling tracking
loss and shape deformation in long video sequences. Second, GUSOT outperforms all previous
DCF-trackers (e.g., supervised ECO and unsupervised ECO-HC and STRCF) in the first group by
a large margin. Third, GUSOT outperforms two unsupervised deep trackers with pre-training (e.g.
LUDT and USOT) in the second group, which reveals that the proposed modules are effective
options to go with minimum offline preparation. Fourth, there is a significant performance gap
between GUSOT and the latest unsupervised deep tracker ULAST. Yet, the latter has to be pretrained by a large amount of video data with pseudo labels generated from optical flows. Its large
feature backbone, ResNet-50, could be heavy for small devices. Finally, GUSOT surpasses two
supervised deep trackers (i.e., SiamFC and ECO) in the last group, narrowing the performance gap
to the supervised deep tracker SiamRPN.
Six examples are selected and shown in Figure 4.5 for qualitative comparison. These sequences
cover diverse challenging tracking scenarios including partial/full occlusion, 3D rotation, deformation, fast motion, small objects and so on. UHP-SOT++ does not perform well in any of these
sequences. In contrast, GUSOT offers the best performance in all six cases. It shows the power
of the two newly added modules. GUSOT offers flexible box shapes, yield robust tracking against
occlusion and fast motion, and works well on small objects (e.g., yoyo) and large objects (e.g., person). USOT, though having smooth shape variations due to its region proposal network, could have
wrongly inferred shapes (see plane) or be distracted by background clutters (see pool, umbrella,
yoyo and person). SiamFC also fails in last three subfigures (e.g., airplane, person and yoyo).
67
4.3.3 Ablation Study
4.3.3.1 Contribution of Modules
We conduct ablation study on the contributions of the two proposed modules in Table 4.2. Both
offer performance gains in AUC and DP when applied alone. The variant with segmentation only
can already achieve close performance to the full version, which reveals the significance of flexible
shape proposals in long term tracking.
4.3.3.2 Impact of Baseline Trackers
To examine the impact of the quality of the baseline tracker, we try different baseline settings
and show the performance gain in Table 4.3. Our framework can improve both moderate trackers
such as KCF[51] and advanced trackers including STRCF and UHP-SOT++, which further demonstrates the robustness of proposed modules and that they can also be applied to other trackers which
provide appearance similarity measurement.
4.3.3.3 Attribute-based Study
In Figure 4.6, we compare the DP and AUC scores of several leading unsupervised trackers including GUSOT, UHP-SOT++ and USOT against different challenging tracking attributes. While
USOT performs best in deformation due to its box regression neural network, GUSOT has leading
performance in all other attributes. The improvement over the baseline UHP-SOT++ is more obvious in fast motion, out-of-view and viewpoint change, which aligns well with the contribution of
the lost object recovery and the color-saliency-based shape proposal modules. GUSOT can recover
the object and adapt the shape to different viewpoints.
68
Table 4.2: Ablation study of module contribution of GUSOT on LaSOT. w. motion refers to
baseline + lost object recovery, and w. shape refers to baseline + color-saliency-based shape
proposal.
baseline w. motion w. shape w. both
AUC (%) 32.9 35.1 36.1 36.8
DP (%) 32.9 34.6 35.1 36.1
Table 4.3: Performance gain of two new modules on different baselines.
KCF STRCF UHP-SOT++
AUC (%) 3.5 2.9 3.9
DP (%) 3.5 2.2 3.2
Figure 4.6: Attribute-based evaluation of GUSOT, UHP-SOT++ and USOT on LaSOT in terms
of DP and AUC, where attributes of interest include the aspect ratio change (ARC), background
clutter (BC), camera motion (CM), deformation (DEF), fast motion (FM), full occlusion (FOC),
illumination variation (IV), low resolution (LR), motion blur (MB), occlusion (OCC), out-of-view
(OV), partial occlusion (POC), rotation (ROT), scale variation (SV) and viewpoint change (VC).
69
4.4 Conclusion
A green and unsupervised single-object tracker (GUSOT) for long video sequences was proposed
in this work. As shown by results on LaSOT, the lost-object-recovery motion proposal and colorsaliency-based shape proposal contributes to unsupervised lightweight trackers significantly. It
offers a promising high-performance tracking solution in mobile and edge computing platforms.
In the future, we would like to introduce self-supervision to GUSOT and further exploit tracked
frames in the same sequence to achieve even better tracking performance while preserving its green
and unsupervised characteristics.
70
Chapter 5
GOT: Unsupervised Green Object Tracker without Offline
Pre-training
5.1 Introduction
There are two major breakthroughs in SOT development in recent years. The first one lies in
the use of the discriminative correlation filter (DCF) [10] and its variants. Based on handcrafted
features (e.g., the histogram of oriented gradients (HOG) and colornames (CN) [33]) extracted
from the reference template, DCF trackers estimate the location and size of the target template
by examining the correlation (or similarity) between the reference template and the image content
in the target search region. The second one arises by exploiting deep neural networks (DNNs)
or deep learning (DL). Supervised and unsupervised DL trackers with pre-trained networks have
been dominating in their respective categories in recent years. They are trained with large-scale
offline pre-training data. The former has human-labeled object boxes throughout all frames, while
the latter does not, in all training sequences.
There is a link between DCF and DL trackers. One representative branch of supervised DL
trackers is known as the Siamese network, which maintains the template matching idea. On the
other hand, DL trackers adopt the end-to-end optimization approach to derive powerful deep features for the matching purpose. Besides the backbone network, they incorporate several auxiliary
subnetworks called heads, e.g., the classification head and the box regression head.
71
Figure 5.1: Comparison of object trackers in the number of model parameters (along the x-axis),
the AUC performance (along the y-axis) and inference complexity in floating point operations (in
circle sizes) with respect to the LaSOT dataset.
The superior tracking accuracy of supervised DL trackers is attributed to a huge amount of
efforts in offline pre-training with densely labeled videos and images. In addition, the backbone
network gets larger and larger from the AlexNet to the Transformer. Generally speaking, DL
trackers demand a large model size and high computational complexity. The heavy computational
burden hinders their practical applications in edge devices. For example, SiamRPN++ [72] has a
model containing 54M parameters and takes 48.9G floating point operations (flops) to track one
frame. To lower the high computational resource requirement, research has been done to compress
the model via neural architecture search [141], model distillation [104], or networks pruning and
quantization [9, 21, 11, 56, 3].
72
One recent research activity lies in reducing the labeling cost. Along this line, unsupervised
DL trackers have been proposed to enable intelligent learning, e.g., [120, 132, 153, 105]. In the
training process, they generate pseudo object boxes in initial frames, allow a tracker to track in
both forward and backward directions, and enforce the cycle consistency. Various techniques
have been proposed to adjust pseudo labels and improve training efficiency. Unsupervised DL
trackers contain complicated networks needed for large-scale offline pre-training, leading to large
model sizes. The state-of-the-art unsupervised DL tracker, ULAST [105], achieves comparable
performance as top supervised DL trackers. As a modification of SiamRPN++, ULAST has a large
model size and heavy computational complexity in inference.
Our research goal is to develop unsupervised, high-performance, and lightweight trackers,
where lightweightness is measured in model sizes and inference computational complexity. Toward this objective, we have developed new trackers by extending DCF trackers. Examples include
UHP-SOT [157], UHP-SOT++ [159] and GUSOT [158]. The extensions include an object recovery mechanism and flexible shape estimation in the face of occlusion and deformation, respectively.
They improved the tracking accuracy of DCF trackers greatly while maintaining their lightweight
advantage. These trackers were not only unsupervised but also demanded no offline pre-training.
Furthermore, these trackers adopted a modular design for algorithmic transparency.
Based on the above discussion, we can categorize object trackers into three types according to
their training strategies: A) supervised trackers, B) unsupervised trackers with offline pre-training,
and C) unsupervised trackers without offline pre-training. In terms of training complexity, Type
B has the highest training complexity while Type C has the lowest training complexity (which is
almost none). We consider their representative trackers in Fig. 5.1:
• Type A: SiamFC, ECO, SiamRPN, LightTrack, DSTfc, and FEAR-XS;
• Type B: USOT and ULAST;
• Type C: STRCF, UHP-SOT, UHP-SOT++, GUSOT, and GOT.
73
We compare their characteristics in three aspects in the figure: tracking performance (along the
y-axis), model sizes (along the x-axis), and inference complexity (in circle sizes).
The green object tracker (GOT) is a new tracker proposed in this work. It is called “green” due
to its low computational complexity in both training and inference stages, leading to a low carbon
footprint. There is an emerging research trend in artificial intelligence (AI) and machine learning
(ML) by taking the carbon footprint into account. It is called “green learning” [67]. Besides
sustainability, green learning emphasizes algorithmic transparency by adopting a modular design.
GOT has been developed based on the green learning principle.
GOT conducts an ensemble of three prediction branches for robust object tracking: 1) a global
object-based correlator to predict the object location roughly, 2) a local patch-based correlator to
build temporal correlations of small spatial units, and 3) a superpixel-based segmentator to exploit
the spatial information (e.g., color similarity and geometrical constraints) of the target frame. For
the first and the main branch, GOT adopts GUSOT as the baseline. The outputs from three branches
are then fused to generate the ultimate object box, where an innovative fusion strategy is developed.
GOT contains two novel ideas that have been neglected in the existing object tracking literature.
They are elaborated below.
• The performance of the global correlator in the first branch degrades when the tracked object
has severe deformation between two adjacent frames. The local patch-based correlator in the
second branch is used to provide more flexible shape estimation and object re-identification.
It is essential to implement the local correlator efficiently. It is formulated as a binary classification problem. It classifies a local patch into one of two classes - belonging to the object
or the background.
• The tracking process usually alternates between the easy steady period and the challenging
period as confronted with deformations and occlusion. The proposed fuser monitors the
tracking quality and fuses different box proposals according to the tracking dynamics to
ensure robustness against challenges while maintaining a reasonable complexity.
74
We evaluate GOT on five benchmarking datasets for thorough performance comparison. They
are OTB2015, VOT2016, TrackingNet, LaSOT, and OxUvA. It is demonstrated by extensive experiments and ablation studies that GOT offers competitive tracking accuracy with state-of-the-art
unsupervised trackers (i.e., USOT and ULAST), which demand heavy offline pre-training, at a
lower computation cost. GOT has a tiny model size (<3k parameters) and low inference complexity (around 58M FLOPs per frame). Its inference complexity is between 0.1% ∼ 10% of DL
trackers. Thus, it can be easily deployed on mobile and edge devices. Furthermore, we discuss the
role played by supervision and offline pre-training to shed light on our design.
5.2 Proposed Method
The system diagram of the proposed green object tracker (GOT) is depicted in Fig. 5.2. It contains
three bounding box prediction branches: 1) a global object-based correlator, 2) a local patch-based
correlator applied to small spatial units, and 3) a superpixel segmentator. Each branch will offer
one or multiple proposals from the input search region, and they will be fused to yield the final
prediction. We use GUSOT [158] in the first branch and the superpixel segmentation technique [42]
in the third branch. In the following, we provide a brief review of the first branch in Sec. 5.2.1,
and elaborate on the second branch in Sec. 5.2.2. We do not spend any space on the superpixel
segmentator since it is directly taken from [42]. Finally, we present the fusion strategy in Sec.
5.2.4.
5.2.1 Global Object-based Correlator
The GUSOT tracker is the evolved result of a series of efforts in enhancing the performance of
lightweight DCF-based trackers. They include STRCF [74], UHP-SOT [157] and UHP-SOT++
[159]. STRCF adds a temporal regularization term to the objective function used for the regression
of the feature map of an object template in a DCF tracker. STRCF can effectively capture the
appearance change while being robust against abrupt errors. However, it generates only rigid
75
Figure 5.2: The system diagram of the proposed green object tracker (GOT). The global objectbased correlator generates a rigid proposal, while the local patch-based correlator outputs a deformable box and a objectness score map which helps the segmentator calculate additional deformable boxes. These proposals are fused into one final prediction.
predictions and cannot recover from the tracking loss. UHP-SOT enhances it with two modules:
background motion modeling and trajectory-based box prediction. The former models background
motion, conducts background motion compensation, and identifies the salient motion of a moving
object in a scene. It facilitates the re-identification of the missing target after tracking loss. The
latter estimates the new location and shape of a tracked object based on its past locations and shapes
via linear prediction. The two modules can collaborate together to estimate the box aspect ratio
change to some extent. UHP-SOT++ further improves the fusion strategy of different modules
and conducts more extensive experiments on the effectiveness of each module on several tracking
datasets.
Although STRCF, UHP-SOT, and UHP-SOT++ boost the performance of classic DCF trackers
by a significant margin, their capability in flexible shape estimation and object re-identification is
still limited. This is because they rely on the correlation between adjacent frames, while an object
template is vulnerable to shape deformation and cumulative tracking errors in the long run. To
improve the tracking performance in long videos, GUSOT examines the shape estimation problem
and the object recovery problem furthermore. It exploits the spatial and temporal correlation by
76
considering foreground and background color distributions. That is, colors in a search window are
quantized into a set of primary color keys. They are extracted across multiple frames since they
are robust against appearance change. These salient color keys can identify object/background
locations with higher confidence. A low-cost graph-cut-based segmentation method can be used to
provide the object mask. GUSOT can accommodate flexible shape deformation to a certain degree.
All above-mentioned trackers model the object appearance from the global view, i.e., using
features of the whole object for the matching purpose. They provide robust tracking results when
the underlying object is distinctive from background clutters without much deformation or occlusion. For this reason, we adopt the global object-based correlator in the first branch. The advanced
version, GUSOT, is implemented in GOT.
5.2.2 Local Patch-based Correlator
The local patch-based correlator analyzes the temporal correlation existing in parts of the tracked
object. It is designed to handle object deformations more effectively. It is formulated as a binary
classification problem. Given a local patch of size 8×8, the binary classifier outputs its probability
of being parts of the object or the background. This is a novel contribution of this work.
5.2.2.1 Feature Extraction and Selection
The channel-wise Saab transform is an unsupervised representation learning method proposed in
[23]. It is slightly modified and used to extract features of a patch here. We decompose a color
input image into overlapping patches with a certain stride and subtract the mean color of each
patch to obtain its color residuals. The mean color offers the average color of a patch. The color
residuals are analyzed using the processing pipeline shown in Fig. 5.3, where the input consists of
zero-mean RGB residual channels. We conduct the spectral principle component analysis (PCA)
on RGB residuals to get three decorrelated channels denoted by P, Q and R channels. For each
of them, another spatial PCA is conducted to reduce the feature dimension to C. The final feature
vector is formed by concatenating of features of each color channel at each pixel. Note that spectral
77
and spatial PCA kernels are learned at the initial frame only and shared among all remaining
frames. Given the two PCA kernels, the computation described above can be easily implemented
by convolutional layers of CNNs. Besides the Saab features, handcrafted features such as HOG
and CN are also included for richer representation. Then, a feature selection technique called
discriminant feature test (DFT) [146] is adopted to select a subset of discriminant features. The
feature selection process is only conducted in the initial frame. Once the features are selected, they
are kept and shared among later frames to reduce computational complexity.
Figure 5.3: Channel-wise Saab transformation on color residuals of a patch of size N ×N.
5.2.2.2 Patch Classification
If a patch is fully outside and inside the bounding box in the reference frame, it is assigned “0”
and “1”, respectively. As for patches lying on the box boundary, they are not used in training
to avoid confusion. Feature vectors and labels are used in an XGBoost classifier [20]. In the
experiment, we set the tree number and the maximum tree depth to 40 and 4, respectively. These
hyperparameters are determined via offline cross validation on a set of training videos. They can
also be determined using samples in the initial frame. The predicted soft probability scores of
patches in the search window of the target frame form a heat map which is called the objectness
score map. Note that some patches inside the bounding box may belong to the background rather
than the object, leading to noisy labels. To alleviate this problem, we adopt a two-stage training
78
strategy. The first-stage classifier is trained using labels based on the patch location inside/outside
of the bounding box in the reference frame. It is applied to patches in the target frame to produce
soft probabilities. Then, the soft probabilities are binarized again to provide finetuned patch labels.
Due to the feature similarity between true background patches and false foreground patches, their
predicted soft labels should be closer and, as a result, finetuned labels are more reliable than initial
labels. The second-stage classifier is trained using finetuned labels.
5.2.2.3 From Heat Map to Bounding Box
To obtain a rectangular bounding box, we binarize the heat map and draw a tight enclosing box to
obtain an objectness proposal. Due to noise around the object boundary, direct usage of the heat
map does not yield stable box prediction. To overcome the problem, we smoothen the heat map
and use it to weigh the raw heat map for noise suppression. Let Pt ∈ R
H×W , St−1, and St denote
the raw probability map of frame t, the template of frame t −1 and the updated template of frame
t, respectively. Note that St−1 has been registered to align with Pt via circulant translation. The
processed heat map is expressed as
P
∗
t = Pt ⊙St−1, (5.1)
where ⊙ is the element-wise multiplication for locations where St−1 has the objectness score below
0.5. Then, St
is updated by minimizing a cost function as follows:
St = argmin
X
∥X −P
∗
t ∥
2
F + µ∥X −St−1∥
2
F
, (5.2)
79
where parameter µ controls the tradeoff between the updating rate and smoothness. Eq. (5.2) is a
regularized least-squares problem. It has the closed-form solution
St = [P
∗
t
,µSt−1][IH,µIH]
†
= [P
∗
t
,µSt−1]([1,µ]⊗IH)
†
= [P
∗
t
,µSt−1]([1,µ]
† ⊗I
†
H
)
= [P
∗
t
,µSt−1]([1,µ]
† ⊗IH)
= [P
∗
t
,µSt−1]([ 1
1+ µ
2
,
µ
1+ µ
2
]
T ⊗IH)
=
1
1+ µ
2
P
∗
t +
µ
2
1+ µ
2
St−1,
(5.3)
where †, ⊗, and IH are the Moore–Penrose pseudoinverse, the Kronecker product and the H ×H
identity matrix, respectively. We use several examples to visualize the evolution of templates over
time in Figure 5.4.
5.2.2.4 Classifier Update
Since the object appearance may change over time, the classifier needs to be updated to adapt to
a new environment. The necessity of classifier update can be observed based on the classification
performance. The heat map is expected to span the object template reasonably well. If it deviates
too much from the object template, an update is needed. As shown in Fig. 5.5, regions of higher
probability (marked by warm colors) tend to shrink when there are new object appearances (in the
top example) or they may go out of the box when new background appears (in the bottom example).
Once one of such phenomena is observed, the classifier should be retrained using samples from
an earlier frame of high confidence and those from the current frame. The retraining cost is low
because of the tiny size of the classifier.
80
Figure 5.4: (Top) Visualization of the evolution of templates over time and (bottom) visualization
of the noise suppression effect in the raw probability map based on Eq. (5.1).
81
Figure 5.5: Proper updating helps maintain decent classification quality.
82
5.2.3 Superpixel Segmentation
For the first and the second branches, we exploit temporal correlations of the object and background
across multiple frames. In the third branch, we consider spatial correlation in the target frame and
perform the unsupervised segmentation task. Superpixel segmentation has been widely studied for
years. It offers a mature technique to generate a rough segmentation mask. However, to group
superpixels into a connected group, an algorithm usually checks the appearance similarity and
geometric connectivity, which can be expensive. In our case, the heat map provides a natural
grouping guidance. When we overlay the heat map and superpixel segments, each segment gets
an averaged probability score. Then, we can group segments by considering various probability
thresholds and draw multiple box proposals, as shown in Fig. 5.2. They are called superpixel
proposals.
5.2.4 Fusion of Proposals
There are three types of proposals in GOT: 1) the DCF proposal xdc f from the global objectbased correlator, 2) the objectness proposal xob j from the local patch-based correlator, and 3)
the superpixel proposals χspp = {xspp,i
|i = 1,..,N} from the superpixel segmentator. While χspp
contains multiple proposals by grouping different segments, the most valuable one can be selected
by evaluating the intersection-over-union (IoU) as
x
∗
spp = arg max
x∈χspp
IoU(x, xdc f) +λIoU(x, xob j), (5.4)
where λ is an adaptive weight calculated from IoU(xdc f , xob j). It lowers the contribution of poor
heat maps. In the following, we first present two fusion strategies that combine multiple proposals
into one final prediction and then elaborate on tracking quality control and object re-identification.
83
5.2.4.1 Two Fusion Strategies
According to the difficulty level of the tracking scenario, one final box bounding is generated from
these proposals with a simple or an advanced fusion strategy.
Simple Fusion. During an easy tracking period without obvious challenges, multiple proposals
tend to agree well with each other. Then, we adopt a simple strategy based on IoU and probability
values. The current tracking status is considered as easy if
min
xi
,x j∈xdc f ,xob j,xspp
IoU(xi
, x j) ≥ α,
where α is the threshold to distinguish good and poor alignment. Under this condition, we first fuse
flexible proposals xob j and xspp and then choose from the flexible proposal and the rigid proposal
xdc f via the following steps:
• Choose from xob j and xspp by finding x
∗
d f = argmaxx IoU(x, xdc f).
• Choose from x
∗
d f and xdc f . Stick to xdc f if it has a larger averaged probability score inside the
box and the size of x
∗
de f changes too rapidly when compared with the previous prediction.
Advanced Fusion. When multiple proposals differ a lot, it is nontrivial to select the best one
just using IoU or probability distribution. Instead, we fuse the information from different sources
with the following optimization process. A rough foreground mask I
∗
is derived by searching the
optimal 0/1 label assignment to pixels in the image. Let x and lx denote the pixel location and its
label. The mask, I
∗
, can be estimated using the Markov Random Field (MRF) optimization:
I
∗ = argmin
I ∑x
ρ(x,lx) + ∑
{x,y}∈ℵ
wxy∥lx −ly∥, (5.5)
84
where ℵ is the four-connected neighborhood, the second term assigns penalties to neighboring
points that do not have the same labels, the weight wxy is calculated from the color difference
between x and y, and
ρ(x,lx) = −log pcolor(lx|x)−log pob j(lx|x) (5.6)
treats the negative log-likelihood of x being assigned to the foreground color and that of x being
foreground in terms of objectness. The former is modeled by the Gaussian mixtures while the
latter comes from the classification results. The rough mask in Eq. (5.5) takes color, objectness,
and connectivity into account to find the most likely label assignment. While the solution could be
improved iteratively, we only run one iteration since the result is good enough to serve as the rough
mask. Next, to fuse proposals, we select the one that gets the highest IoU with the wrapping box
of the rough mask. If the advanced fusion fails, we go back to the simple fusion as a backup. The
overall fusion strategy that consists of both simple and advanced fusion schemes is summarized in
Algorithm 1.
Algorithm 1 Fusion of Multiple Proposals
Input: xdc f , xob j, χspp, α, P
∗
t
Output: final prediction xt
xspp ← argmaxx∈χspp
IoU(x, xdc f) +λIoU(x, xob j)
f lag ← {minxi
,x j∈xdc f ,xob j,xspp
IoU(xi
, x j)} ≥ α
if f lag is false then
generate box xmr f from MRF mask
if success then
return xt ← argmaxx∈xdc f ,xob j,χspp
IoU(x, xmr f)
end if
end if
x
∗
d f ← argmaxx∈xob j,xspp
IoU(x, xdc f)
SP
∗
t
(x
∗
d f) ← averaged probability inside x
∗
d f
SP
∗
t
(xdc f) ← averaged probability inside xdc f
if x
∗
d f is stable or SP
∗
t
(x
∗
d f) > SP
∗
t
(xdc f) then
return xt ← x
∗
d f
end if
return xt ← xdc f
85
5.2.4.2 Tracking Quality Control
For rigid objects with rigid motion, the global object-based correlator can provide a fairly good
prediction. The template matching similarity score of DCF is usually high. The local patchbased correlator usually helps in the face of challenges such as background clutters and occlusions.
However, due to the complicated nature of local patch classification, its proposal may be noisy.
Detection and removal of noisy proposals in the second branch are important to good tracking
performance in general. To solve this issue, we monitor the quality of the heat map and may discard
noisy proposals until classification gets stable again. The flowchart of tracking quality control is
depicted in Fig. 5.6. After the heat map is obtained, we check whether the high probability region
is too small or too large and whether it contains several unconnected blobs. All of them indicate
that the local patch-based correlator is not stable. Thus, its proposal is discarded. Once the problem
is resolved, the heat map becomes stable, and the objectness proposal shall have small variations
in height and width. Then, we can turn on the shape estimation functionality (i.e., the local patchbased correlator in the second branch) and conduct the fusion of all proposals.
Figure 5.6: Management of different tools for tracking, where S.E. denotes the shape estimation
function provided by the second branch.
86
An exemplary sequence is illustrated in Fig. 5.7. In the beginning, the tracking process is
smooth and the simple fusion is sufficient. Then, when some challenges appear and multiple proposed boxes do not align, we turn to the advanced MRF fusion. When there is severe background
clutter or occlusion that confuses the classifier in the second branch, the DCF proposal is adopted
directly until the turbulence goes away. Then, the process repeats until the end of the video.
Figure 5.7: The fusion strategy (given in the fourth row, where S.F. stands for simple fusion)
changes with tracking dynamics over time. The DCF proposal, the objectness proposal, and the
superpixel proposal are given in the first, second, and third rows, respectively. The sequence is
motorcycle-9 from LaSOT and the object is the motorbike.
5.2.4.3 Object Re-identification
Besides shape estimation, the objectness proposal can be used for object re-identification after
tracking loss. Given the current DCF proposal and the motion proposal that covers the most motion
flow and possibly contains the lost foreground object, GUSOT selects one of the two via trajectory
stability and color/template similarity. However, GUSOT is not general enough to cover all cases.
The objectness score provides an extra view of the appearance similarity with timely updated object
information, and it helps recover the object quickly.
87
Given a candidate box proposal x with center xct, its averaged objectness score Sob j(x) inside
the box, the feature representation f(x) of the region, and the DCF template ft−1, the scoring
function for this candidate can be calculated as
S(x) = β1⟨f(x), ft−1⟩+β2Sob j(x)−β3∥xct −xˆt,ct∥
2
, (5.7)
where ˆxt
is the linear prediction of the box center based on past predictions, β1 and β3 are positive
constants to adjust the magnitudes of all terms to the same level, β2 is a positive adaptive weight
that assigns lower contributions for poorer probability maps. An ideal candidate box is expected
to have high template similarity, high objectness score, and with a small translation from the last
prediction. Then, we can choose from the DCF proposal and the motion proposal by selecting the
one with a higher evaluation score to re-identify the lost object.
5.3 Experimental Results
5.3.1 Experimental Setup
5.3.1.1 Performance Metrics
The one-pass-evaluation (OPE) protocol is adopted for all trackers in performance benchmarking
unless specified otherwise. The metrics for tracking accuracy include: the distance precision (DP)
and the area-under-curve (AUC). DP measures the center precision at the 20-pixel threshold to rank
different trackers, and AUC is calculated using the overlap precision curve. For model complexity,
we consider two metrics: the model size and the computational complexity required to predict the
target object box from the reference one on average. The latter is also called the inference complexity per frame. The model size is the number of model parameters. The inference complexity
per frame is measured by the number of (multiplication or add) floating point operations (flops).
88
5.3.1.2 Benchmarking Object Trackers
We compare GOT with the following four categories of trackers.
• supervised lightweight DL trackers: LightTrack [141], DSTfc [104], and FEAR-XS [11].
• supervised DL trackers: SiamFC [6], ECO [30], and SiamRPN [73].
• unsupervised DL trackers: LUDT [122], ResPUL [132], USOT [153], and ULAST [105].
• unsupervised DCF trackers: KCF [51], SRDCF [33], and STRCF [74].
For unsupervised DL trackers, we use the models trained from scratch for comparison if they are
available.
5.3.1.3 Tracking Datasets
We conduct performance evaluations of various trackers on four datasets.
• OTB2015 [134]. It contains 100 videos with an average length of 598 frames. The dataset
was released in 2015. Many video sequences are of lower resolution.
• VOT2016 [46]. It contains 60 video sequences with an average length of 358 frames. It has a
significant overlap with the OTB2015 dataset. One of its purposes is to detect the frequency
of tracking failures. Different from the OPE protocol, once a failure is detected, the baseline
experiment re-initializes the tracker.
• TrackingNet [94]. It is a large-scale dataset for object tracking in the wild. Its testing set
consists of 511 videos with an average length of 442 frames.
• LaSOT [41]. It is the largest single object tracking dataset by far. It has 280 long testing
videos with 685K frames in total. The average video length is 2000+ frames. Thus, it serves
as an important benchmark for measuring long-term tracking performance.
89
5.3.1.4 Implementation Details
In the implementation, each region of interest is warped into a 60 × 60 patch with an object that
takes around 32 × 32 pixels. The XGBoost classifier in the local correlator has 40 trees with the
maximum depth set to 4. Parameters α = 0.7 and µ = 5 are used in the fuser. The first branch
(i.e., the global correlator), the combined second and third branches (i.e., the local correlator and
the superpixel segmentator), and the fuser runs at 15 FPS, 5 FPS, and 15 FPS on one Intel(R)
Core(TM) i5-9400F CPU, respectively. The speed can be further improved via code optimization
and parallel programming.
90
Table 5.1: Comparison of tracking accuracy and model complexity of representative trackers of four categories on four tracking datasets.
Some numbers for model complexity are rough estimations due to the lack of detailed description of network structures and/or public
domain codes. Furthermore, the complexity of some algorithms is related to built-in implementation and hardware. OPT and UT are
abbreviations for offline pre-training and unsupervised trackers, respectively. The top 3 runners among all unsupervised trackers (i.e.,
those in the last two categories) are highlighted in red, green, and blue, respectively.
OTB2015 VOT16 TrackingNet LaSOT
Tracker OPT/UT Flops (
↓) # Params.(
↓) DP (
↑) AUC (
↑) EAO(
↑) DP(
↑) AUC(
↑) DP (
↑) AUC (
↑)
LightTrack[141] ✓/× 530 M (9X) 1.97 M (896X) - 66.2 - 69.5 72.5 53.7 53.8
DSTfc[104] ✓/× 1.23 G (21X) 0.54 M (246X) 76.1 57.3 - 51.2 56.2 - 34.0
FEAR-XS[11] ✓/× 478 M (8X) 1.37 M (623X) - - - - - 54.5 53.5
SiamFC[
6] ✓/× 2.7 G (47X) 2.3 M (1046X) 77.1 58.2 23.5 53.3 57.1 33.9 33.6
ECO[30] ✓/× 1.82 G (31X) 95 M (43201X) 90.0 68.6 37.5 48.9 56.1 30.1 32.4
SiamRPN[73] ✓/× 9.23 G (159X) 22.63 M (10291X) 85.1 63.7 34.4 - - 38.0 41.1
LUDT[122] ✓/✓ - - 76.9 60.2 23.2 46.9 54.3 - 26.2
ResPUL[132] ✓/✓ 2.65 G (46X) 1.445 M (657X) - 58.4 26.3 48.5 54.6 - - USOT[153] ✓/✓
>14 G (241X) 29.4 M (13370X) 79.8 58.5 35.1 55.1 59.9 32.3 33.7
ULAST[105] ✓/✓ ⪆50 G (862X)
⪆50 M (22738X) 81.1 61.0 - - - 40.7 43.3
KCF[51] ×/✓ - - 69.6 48.5 19.2 41.9 44.7 16.6 17.8
SRDCF[33] ×/✓ - - 78.1 59.3 24.7 45.5 52.1 21.9 24.5
STRCF[74] ×/✓ - - 86.6 65.8 27.9 - - 29.8 30.8
GOT (Ours)
×/✓
≈58 M (1X) 2199 (1X) 87.6 65.4 26.8 52.6 56.3 38.8 38.5
91
5.3.2 Performance Evaluation
We compare the performance of GOT with four categories of trackers on four datasets in Table 5.1.
Trackers are grouped based on their categories. From top to down, they are supervised lightweight
DL trackers, supervised DL trackers, unsupervised DL trackers, and unsupervised DCF trackers,
respectively. Our proposed GOT belongs to the last category. We have the following observations.
OTB2015. GOT has the best performance in DP and the second best performance in AUC
among unsupervised trackers on this dataset. One explanation is that DL trackers are trained on
high-resolution videos and they do not generalize well to low-resolution videos. GOT is robust
against different resolutions since its HoG and CN features are stable to ensure a higher successful
rate of the template matching idea.
VOT2016. It adopts the expected average overlap (EAO) metric to evaluate the overall performance of a tracker. EAO considers both accuracy and robustness. The observation on this dataset
reveals that the advantages of the local correlator branch are not obvious on very short videos
since the tracker gets corrected automatically if its IoU is lower than a threshold. Yet, GOT still
ranks third among unsupervised trackers (i.e., the last two categories) with a tiny model (of 2.2K
parameters) and much lower inference complexity by 3 to 5 orders of magnitude.
TrackingNet. The ground-truth box is provided for the first frame only. The performance
of GOT is evaluated by an online server. GOT ranks second among unsupervised trackers. Its
performance is also comparable with almost all supervised DL trackers (except LightTrack).
LaSOT. GOT ranks second among unsupervised trackers. It even has better performance than
some supervised trackers such as DSTfc while maintaining a much smaller model size and lower
computational complexity.
5.3.3 Comparison among Lightweight Trackers
We compare the design methods and training costs of GOT and three lightweight DL trackers in
Table 5.2. The lightweight DL trackers conduct the neural architecture search (NAS) or model
distillation/optimization to reduce the model size and inference complexity. As shown in Table
92
5.1, LightTrack achieves even higher tracking accuracy than large models. Besides NAS, FEARXS [11] adopts several special tools such as depth-wise separable convolutions and increases the
number of object templates to lower complexity while maintaining high accuracy. Although DSTfc
has the smallest model size among the three, its model size is still larger than that of GOT by
two orders of magnitude. Furthermore, the tracking performance of GOT is better than that of
DSTfc in three datasets. As to the training cost, all three lightweight DL trackers need pre-training
on millions of labeled frames while GOT does not require any as shown in the last column of
Table 5.1. The superiority of LightTrack in accuracy does have a cost, including long pre-training,
architecture search and fine-tuning. Finally, it is worth mentioning that GOT is more transparent
in its design. Thus, its source of tracking errors can be explained and addressed. In contrast, the
failure of lightweight DL trackers is difficult to analyze. It could be overcome by adding more
pre-training samples and repeating the whole design process one more time.
Table 5.2: Comparison of design methods and pre-training costs of GOT and three lightweight DL
trackers.
Trackers Design Methods # of Pre-Training Boxes
LightTrack[141] NAS ≈ 10M
DSTfc[104] NAS, model distillation ≈ 2M
FEAR-XS [11] NAS, network optimization ≈ 13M
GOT (Ours) fusion of 3 branches 0
5.3.4 Long-term Tracking Capability
To examine the capability of GOT in long-term tracking, we test it on the test set of the OxUva
dataset [116]. OxUva contains 166 long test videos under the tracking-in-the-wild setting. The
object to be tracked disappears from the field of view in around one half of video frames. Trackers
need to report whether the object is present or absent and give the object box when it is present. The
ground-truth labels are hidden, and the tracking results are evaluated on a competition server. Since
93
the competition is not maintained any longer, we cannot submit our results for official evaluation.
For this reason, we use predictions from the leading tracker LTMU [28] as the pseudo labels
for performance evaluation below. There are three major evaluation metrics: the true positive
rate (TPR), the true negative rate (TNR) and the maximum geometric mean (MaxGM), which is
calculated as
MaxGM = max
0≤p≤1
p
(1− p)·TPR·((1− p)·TNR+ p), (5.8)
where TPR stands for the fraction of presented objects that are predicted as present and located
with a tight bounding box, and TNR represents the fraction of absent objects that are correctly
reported as absent.
We compare the above performance metrics of GOT against three lightweight long-term trackers, KCF [51] (the long-term version), TLD [58] and FuCoLoT [85], in Table 5.3. TLD and
FuCoLoT are equipped with a re-detection mechanism to find the object after loss. We see that
GOT achieves the highest TPR because of its accurate box predictions. FuCoLoT and GOT do not
provide present/absent predictions so their TNR values are zero. In GOT∗
, the object is claimed to
be absent if the similarity score of template matching is lower than a threshold, which is set to 0.1
in the experiment. Then, KCF and GOT∗ have comparable performance in TNR. Finally, GOT∗
has the best performance in MaxGM. The above design indicates that the similarity score in GOT
is a simple yet effective indicator of the object status. The effect of various threshold values on
GOT∗
is illustrated in Fig. 5.8. As the threshold grows from 0 to 0.15, its TPR decreases slowly
while its TNR increases quickly. The optimal threshold for the MaxGM metric is around 0.1.
5.3.5 Attribute-based Performance Evaluation
To shed light on the strengths and weaknesses of GOT, we conduct the attribute-based study among
GOT, the GUSOT baseline, and USOT, which is an unsupervised DL tracker, on the LaSOT dataset.
94
Table 5.3: Performance comparison of GOT and GOT∗
against three lightweight long-term trackers, KCF (the long-term version), TLD and FuCoLoT, on the OxUvA dataset, where the best
performance is shown in red. KCF and TLD are implemented in OpenCV.
KCF TLD FuCoLoT GOT GOT∗
TPR (↑) 0.165 0.142 0.353 0.425 0.351
TNR (↑) 0.872 0.095 0 0 0.751
MaxGM (↑) 0.380 0.198 0.297 0.326 0.514
Figure 5.8: The TPR, TNP and MaxGM values of GOT∗
at different present/absent threshold
values against the OxUva dataset.
95
Figure 5.9: Attribute-based evaluation of GOT, GUSOT and USOT on LaSOT in terms of DP and
AUC, where attributes of interest include the aspect ratio change (ARC), background clutter (BC),
camera motion (CM), deformation (DEF), fast motion (FM), full occlusion (FOC), illumination
variation (IV), low resolution (LR), motion blur (MB), occlusion (OCC), out-of-view (OV), partial
occlusion (POC), rotation (ROT), scale variation (SV) and viewpoint change (VC).
Figure 5.10: Comparison of the tracked object boxes of GOT, GUSOT, USOT, and SiamFC for
four video sequences from LaSOT (from top to bottom: book, flag, zebra, and cups). The initial
appearances are given in the first (i.e., leftmost) column. The tracking results for four representative
frames are illustrated.
96
DPs and AUCs with respect to different challenging attributes are presented in Fig. 5.9. GUSOT outperforms USOT in all aspects except deformation (DEF) due to its limited segmentationbased shape adaptation capability. With the help of the local correlator branch and the powerful
fuser, GOT performs better than GUSOT and has comparable performance with USOT, which is
equipped with a box regression network, in DEF. For the same reason, GOT outperforms GUSOT in other DEF-related attributes such as viewpoint change (VC), rotation (ROT), and aspect
ratio change (ARC). Another improvement lies is camera motion (CM), where the local correlator
contributes to better object re-identification. GOT has the least improvement in low resolution
(LR). It appears that both the local correlator and the superpixel segmentation cannot help the GUSOT baseline much in this case. Lower video resolutions make the local features (say, around the
boundaries) less distinguishable from each other.
As discussed above, GOT adapts to the new appearance and shape well against the DEF challenge. Representative frames of four video sequences from LaSOT are illustrated in Fig. 5.10 as
supporting evidence. All tracked objects have severe deformations. For the first sequence of turning book pages, GOT covers the whole book correctly. For the second sequence of a flag, GOT
can track the flag accurately. For the third sequence of a zebra, object re-identification helps GOT
relocate the prediction to the correct place once the object is free from occlusion. For the fourth
sequence of two cups, GOT is robust against background clutters. In contrast, other trackers fail to
catch the new appearance or completely lose the object.
5.3.6 Insights into New Ingredients in GOT
GOT has two new ingredients: 1) the local correlator in the 2nd branch and 2) the fuser to combine
the outputs from all three branches. We provide further insights into them below.
5.3.6.1 Local Correlator
We compare the performance of GOT under different settings on the LaSOT dataset in Table 5.4.
The settings include:
97
• With or without the local correlator branch;
• With or without classifier update;
• With or without object re-identification.
We see from the table that a “plain” local correlator already achieves a substantial improvement in
DP. Classifier update and/or object re-identification improve more in AUC but less in DP. This is
because deformation tends to happen around object boundaries and the change of the object center
is relatively slow. On the other hand, the addition of classifier update and object re-identification
helps improve the quality of the objectness map for better shape estimation. In addition, the improvement from object re-identification indicates the frequent object loss in long videos and the
effective contribution of the objectness score. Both classifier update and object re-identification
are needed to achieve the best performance.
Table 5.4: Performance comparison of GOT under different settings on the LaSOT dataset, where
the best performance is shown in red. The ablation study includes: 1) with or without the local
correlator branch; 2) with or without classifier update; 3) with or without object re-identification.
L.C.B. Clf. update Re-idf. DP (↑) AUC (↑)
36.1 36.8
✓ 38.0 37.5
✓ ✓ 38.2 38.0
✓ ✓ 38.2 37.9
✓ ✓ ✓ 38.8 38.5
To study the necessity of quality checking and maintenance in the classification system of
the local correlator branch, we compare the tracking accuracy of three settings in Table 5.5, where
object re-identification is turned off in all settings. Without the shape estimation on-off scheme, the
tracker simply stops the shape estimation function after failures. The performance drops, which
reveals the frequent occurrence of challenges even in the early/middle stage of videos and the
importance of quality checking. Noise suppression helps boost the tracking accuracy furthermore
since it alleviates abrupt box changes due to noise around the object border.
98
Table 5.5: Ablation study of the classification system in the local correlator branch in GOT on
LaSOT under three settings, where the best performance is shown in red.
Shape Estimation On-Off Noise Suppression DP (↑) AUC (↑)
✓ 37.6 37.8
✓ 36.5 36.5
✓ ✓ 38.2 38.0
5.3.6.2 Fuser
The threshold parameter, α, in Algorithm 1 is used to choose between the simple fusion or the
MRF fusion. To study its sensitivity, we select a subset of 10 sequences from LaSOT and turn
on shape estimation in most frames. The mean IoU and the center error between the ground truth
and the prediction change are plotted as functions of the threshold value in Fig. 5.11. The optimal
threshold range is between 0.7 and 0.9 since it has lower center errors and higher IoUs. Choosing
a lower threshold means that we conduct the simple fusion. This is consistent with the proposed
fusion strategy. That is, the simple fusion should be only used when different proposals are close to
each other. Pure simple and MRF fusion strategies have their own weaknesses such as the limited
selection ability in the simple fusion and errors around boundaries in the MRF fusion. Proper
collaboration between them can boost the performance.
5.4 Discussion on Limitations of GOT
To analyze the limitations of GOT and gain a deeper understanding of the contributions of supervision and offline pre-training, we compare the performance of GOT and three DL trackers on the
LaSOT dataset in Fig. 5.12. The three benchmarking trackers are SiamRPN++ (a supervised DL
tracker), USOT (an unsupervised DL tracker), and SiamFC (a supervised DL tracker that does not
have a regression network as the previous two DL trackers). The regression network is offline pretrained. The left subfigure depicts the success rate as a function of different overlap thresholds.
99
0.0 0.2 0.4 0.6 0.8 1.0
threshold
0.62
0.64
0.66
0.68
Mean IoU
40
45
50
55
60
65
70
Mean Center Error
Figure 5.11: Mean IoU (the higher the better) and center error (the lower the better) on the selected
subset with different fusion thresholds. The subset from LaSOT includes book-10, bus-2, cat-1,
crocodile-14, flag-5, flag-9, gorilla-6, person-1, squirrel-19, mouse-17.
The right subfigure shows the AUC values as a function of different video lengths with the full
length normalized to one.
SiamRPN++ has the best performance among all. It is attributed to both supervision and offline
pre-training. GOT ranks second in most situations except for the following cases. GOT is slightly
worse than USOT when the overlap threshold is higher or at the beginning part of videos. It is
conjectured that GOT can achieve decent shape estimation but it may not be as effective as the
offline pre-trained regression network used by USOT in a tighter condition, i.e., a higher overlap
threshold or a shorter tracking memory. It is amazing to see that GOT outperforms SiamFC across
all thresholds and all video lengths. This shows the importance of the regression network in DL
trackers.
To verify the above conjecture, we conduct the deformation-related attribute study in a shorter
tracking memory setting in Fig. 5.13, where only the first 10% of all frames (i.e., the ratio of
frame numbers is 0.1) is examined. GOT is better than SiamFC in most attributes and worse than
100
Figure 5.12: Performance comparison between GOT (orange), SiamRPN++ (red), SiamFC (cyan),
and USOT (blue) on LaSOT in terms of the success rate plot (left) and the AUC plot.
SiamRPN++ and USOT. The superiority of SiamFC over GOT in DEF and ROT indicates the
power of supervision in locating the object.
We dive into two sequences where GOT’s attributes are poorer and show the tracking results
in Fig. 5.14. As shown in these examples, GOT has difficulty in handling the following cases: (1)
the object has similar local features with the background, such as the panda in the top sequence;
and (2) the object under tracking does not have a tight shape, such as the bike in the bottom
sequence. For the first case, the local patch-based correlator in GOT can only capture low-level
visual similarities. It cannot distinguish the black color of the panda and the background. For the
second case, the bounding box contains background patches inside, which become false positive
samples. Besides, the object has a few small individual parts whose representations are not stable
to reflect the appearance of the full object.
The motion may help if the object has obvious movement in the scene. However, it may not
help much if the object moves slowly or overlaps with another object. These difficulties can be
alleviated with supervision or pre-training. That is, if a learning model is trained with a rich set
101
Figure 5.13: Deformation-related attribute study on LaSOT for the first 10% frames in videos.
Figure 5.14: Two tracking examples from LaSOT: bear-4 (top) and bicycle-2 (bottom), with boxes
of the ground truth (green), SiamRPN++ (red), SiamFC (cyan), USOT (blue), and GOT (orange),
respectively.
102
of object boxes, it can avoid such mistakes more easily. As for the gap between supervised pretraining and unsupervised pre-training, unsupervised methods use pseudo boxes generated randomly or from the optical flow that usually contains noise. Thus, the offered supervision is not as
strong as ground truth labels. This explains why USOT cannot distinguish between the panda and
the bush and fails to exclude the human body from the bike while SiamRPN++ does a good job.
Supervised DL trackers usually do not distinguish different tracking scenarios but tune a model
to handle all cases to achieve high accuracy. Their high computational complexity, large model
sizes and heavy demand on training data are costly. In contrast, the proposed GOT system with no
offline pre-training can achieve decent tracking performance on general videos. Possible ways to
enhance the performance of lightweight trackers include the design of better classifiers that have
a higher level of semantic meaning and more powerful regressors for better fusion of predictions
from various branches.
5.5 Model Size and Complexity Analysis of GOT
The number of model parameters and the computational complexity analysis of the proposed GOT
system are analyzed. For the latter, we compute the floating point operations (flops) of the optimal
implementation. Whenever it is applicable, we offer the time complexity and provide a rough
estimation based on the running time on our local machine. The complexity analysis is conducted
for tracking a new frame in the inference stage. Regardless of the original size of the frame, the
region of interest is always warped into a LB×LB = 60×60 patch. The major components of GOT
include the global object-based correlator (i.e., the DCF tracker), the local patch-based correlator
(i.e., the classification pipeline), the superpixel-based segmentator, and the Markov Random Field
(MRF) optimizer in the fuser.
Global Object-based Correlator. The DCF tracker involves template matching via FFT
and template updating via regression. The template (feature map) dimension used in GOT is
(M,N,D) = (50,50,42). The complexity of the template update process is O(DMN logMN),
103
where DMN logMN ≈ 1.19M. The complexity of template matching is at the same level. Furthermore, there is a background motion modeling module in GUSOT to capture salient moving objects
in the scene. The location of a certain point (xt
, yt) is estimated from its location in frame t −1 via
the following affine transformation,
xt = a0xt−1 +b0yt−1 +c0, (5.9)
yt = a1xt−1 +b1yt−1 +c1. (5.10)
It is applied to every pixel in frame It−1 to get an estimation ˆIt of frame It
. Then, the motion
residual map is calculated as
∆I = |
ˆIt −It
|. (5.11)
The maximum dimension of ∆I is (H,W) = (720,480) as images of a larger size are downsampled.
Flops for the affine transformation and the residual map calculation are 8∗H ∗W +H ∗W ≈ 3.11M.
Local Patch-based Correlator. An input image of size 60×60 is decomposed into overlapping blocks of size 8×8 with a stride equal to 2, which generates ((60−8)/2+1)
2 = 729 blocks
for features extraction. For Saab feature extraction, we apply the one-layer Saab transform with
filters of size 5 × 5 and stride equal to 1. We keep the top 4 AC kernels from each of the PQR
channels. Thus, there are 3 DC color responses of size 3 × 1 × 1, and 12 AC responses of size
5×5×1. After feature extraction, we apply the DFT feature selection method to reduce the feature dimension to 50. Hence, the number of parameters is calculated as 3 × 3 (color kernels) +
12 × 25 (Saab kernels) + 50 (DFT feature selection index) = 359. Since the Saab feature extraction process can be implemented as 3D convolutions as in neural networks, we follow the flops
calculation there to compute the model flops for the Saab transform. For a general 3D convolution
with Ci
input channels, Co filters of spatial size Kh × Kw and output spatial size of Ho ×Wo, the
flops is calculated as
F = (2×Ci ×Kh ×Kw)×Ho ×Wo ×Co. (5.12)
104
If the filter is a mean filter, the complexity is further reduced as
F = (Ci ×Kh ×Kw)×Ho ×Wo ×Co. (5.13)
As given in Table 5.6, the flops in computing the Saab features with filter size 5×5 at stride 1 for
a block of size 8×8×3 is 11952. Then, the complexity for 729 blocks is around 8.713M. We run
this feature extraction process at most two times at each frame.
Table 5.6: Flops of the Saab feature extraction for one spatial block of size 8×8.
Steps Ci Kh Kw Ho Wo Co Flops
Get mean color 1 5 5 4 4 3 1200
RGB2PQR 3 1 1 8 8 3 1152
Saab on P 1 5 5 4 4 4 3200
Saab on Q 1 5 5 4 4 4 3200
Saab on R 1 5 5 4 4 4 3200
Total 11952
Table 5.7: The model size and the computational complexity of the whole GOT system.
Module Num. of Params. MFlops
Global Correlator 0 37.11
Local Correlator 2,199 18.12
Super-pixel segmentation 0 1.13
MRF 0 1.20
Total 2,199 57.56
The XGBoost classifier has Ntree = 40 trees with the maximum depth dM = 4 (i.e., there are at
most four tree levels excluding the root). The maximum number of leaf nodes and parent nodes
are Nl = 2
dM and Np = 2
dM − 1, respectively. Hence, the number of parameters is bounded by
Ntree × (2 × Np + Nl) = 40 × (2 × 15 + 16) = 1840. The inference for 729 samples costs dM ×
105
Table 5.8: The estimated flops for some special algorithmic modules.
Algorithmic Modules Complexity MFlops
2D FFT&IFFT O(L
2
B
logLB) 0.072
GMM - 1.634
DCF template related O(DMN logMN) 34
Super-pixel segmentation O(L
2
B
logLB) 1.132
Ntree ×729 = 4 ∗ 40 ∗ 729 ≈ 0.117M flops. The complexity of spatial alignment via 2D FFT/IFFT
is O(L
2
B
logLB). L
2
B
log2 LB ≈ 0.021M. The element-wise operation to get suppressed map takes
LB ×LB = 3600 flops. The template update costs around 3×LB ×LB = 10800 flops.
Felzenszwalb Superpixel Segmentator. The complexity of the superpixel segmentation algorithm is O(L
2
B
logLB), which roughly takes 1.13 MFlops.
MRF. The adopted Markov Random Field optimizer has one iteration only. Given an input image of size (LB,LB,C) = (60,60,3), it first learns the GMM models for foreground and background
colors, respectively, so that the foreground/background likelihood can be calculated at each pixel.
Then, around 20 element-wise matrix operations are conducted to calculate the rough assignment
of pixel labels. The flops for matrix operations are 20 ∗ 60 ∗ 60 = 0.072M.
We summarize the model size (in the number of model parameters) and the overall complexity
(in flops) in Table 5.7. Our tracker has 2,199 model parameters and roughly 57.56 MFlops. It is
worthwhile to point out that the actual complexity of some special modules such as FFT depend
on the hardware implementation and optimization. Some of them are given in Table 5.8.
5.6 Conclusion
A green object tracker (GOT) with a small model size, low inference complexity and high tracking
accuracy was proposed in this work. GOT contains a novel local patch-based correlator branch
to enable more flexible shape estimation and object re-identification. Furthermore, it has a fusion
106
tool that combines prediction outputs from the global object-based correlator, the local patchbased correlator, and the superpixel-based segmentator according to tracking dynamics over time.
Extensive experiments were conducted to compare the tracking performance of GOT and state-ofthe-art DL trackers. We hope that this work could shed light on the role played by supervision
and offline pre-training and provide new insights into the design of effective and efficient tracking
systems.
Several future extensions can be considered. It is desired to develop ways to identify different
tracking scenarios since this information can be leveraged to design a better tracking system. For
example, it can adopt tools of different complexity to strike a balance between model complexity
and tracking performance. Second, it can adopt different fusion strategies to combine outputs from
multiple decision branches for more flexible and robust tracking performance.
107
Chapter 6
Conclusions and Future Work
6.1 Summary of the Research
In this thesis, we focus on the online single object tracking scenario and propose four trackers,
UHP-SOT, UHP-SOT++, GUSOT and GOT, to push forward the limit of unsupervised lightweight
tracking and demonstrate the possibility of high tracking performance without any need of offline
pre-training on large-scale data or heavy feature extraction networks. Our models are green and
friendly to be applied to resource-limited platforms.
In UHP-SOT and UHP-SOT++, we propose two novel modules, the background motion modeling module and the trajectory-based box prediction module, and incorporate them with the baseline
DCF tracker in a systematic rule-based fusion strategy. The background motion modeling module
highlights moving objects in the scene and provides hints on object location and shape variation,
which could be used to alleviate tracking loss. The trajectory-based box prediction module considers the inertia of motion and generates stable estimation of box location and size. The proposed
fusion strategy divides tracking scenarios into eight main categories according to the quality of proposals from three modules including the baseline. Then the three proposals are merged accordingly
and contribute to more robust tracking result together. Superior or comparable results against other
unsupervised trackers and supervised trackers on diverse tracking benchmarks demonstrate the
power of the proposed lightweight yet effective modules. In addition, we provide qualitative and
quantitative analysis on the gap between supervised deep trackers and unsupervised lightweight
108
trackers. Deep trackers tend to benefit from features with semantic meanings and the region proposal network which outputs flexible box shapes, but could be heavy in power consumption and
may have issues when confronting unseen object types.
In GUSOT, we further extend the lightweight design to tracking for long video sequences. Two
new modules are introduced, the lost object recovery module and the color-saliency-based shape
proposal module. The lost object discovery module identifies moments when the target object
might have been lost and retrieves the object by checking the candidate with obvious motion and
higher similarity measurement compared with the baseline tracker. After this step, the rough location of the object has been determined. Then the saliency-based shape proposal module refines
the box shape via low-cost segmentation techniques including Markov Random Field (MRF) optimization and superpixel segmentation. To facilitate the initialization of the MRF optimization
for higher segmentation quality, we select confident foreground/background seeds in terms of their
color saliency. The superpixel segmentation serves as the back-up if MRF performs abnormally
for some complicated cases. Obvious improvement over the baseline and the comparison against
other state-of-the-art methods on the long-term tracking benchmark LaSOT demonstrate the effectiveness of the proposed modules. Our outperforms recent unsupervised deep trackers or achieves
comparable performance with a lightweight and pre-training-free model.
In GOT, we formulate tracking as the ensemble of three branches for more robust tracking
on general objects, the global object-based correlator, the local patch-based correlator and the
superpixel segmentator. The local patch-based correlator and the fuser that processes all proposals
to generate the final prediction are novel contributions. While the global object-based correlator
models the global object appearance and the superpixel segmentator exploits the spatial correlator
inside static images, the local patch-based correlator learns the distribution of local vision features
and carries the information through a longer time span, which expands the temporal memory of the
trackers and provides an appearance modeling on a finer granularity. After gathering the object box
proposals from all branches, the fuser fuse them according to the tracking dynamics, i.e., a simple
IoU and objectness score based strategy is used to select the final box when the tracking scenario is
109
simple, while the more complicated MRF optimization is adopted for more difficult cases. What is
more important, the tracking quality monitor module and re-identification mechanism embedded
in the fuser play a vital role of preventing large error propagation and helping with timely recovery
from object loss. The extensive examination of the proposed tracker over various short-term and
long-term tracking benchmarks validates its effectiveness for general object tracking. The tiny
model size and the offline-pretraining-free and explainable framework design make our tracker
attractive for low-cost tracking on resource-limited devices.
6.2 Future Research Directions
Generally speaking, many video learning tasks are basically performing the process of content
understanding on videos. The learned content is never confined to the delivery of just one specific
application. Learning on one task may provide us hint or by-product that benefits another task.
The video object tracking task is naturally tied with the video object segmentation problem [13,
119, 143, 90]. While the former outputs a rectangular bounding box, the latter requires more finegrained mark of the object - pixel-level binary segmentation mask. Recent studies also reveal that
tracking and segmentation may share unified multi-knowledge representations [145].
It is beneficial to combine the two tasks together into one unified tracking and segmentation
learning framework. On the one hand, tracking helps locate a rough region of the target object and
could be much faster or easier than trying to locate object pixels among all pixels. On the other
hand, segmentation result tends to contain less background noise thus can serve as the good training
source for tracking modules and further contributes to high tracking accuracy. One example is the
SiamMask [126] where the network is initialized with the usual rectangular bounding box in the
first frame and performs tracking and segmentation simultaneously on later frames. As shown in
Figure 6.1, due to the existence of the segmentation mask, tighter bounding boxes with orientations
could be generate for more accurate representation of the object location.
110
Recent works either exploit conversion between box and segmentation mask [82, 151], or attach different prediction heads on shared feature backbone [138]. [96] proposed a segmentationcentric network that directly learns highly accurate mask to derive the box prediction, while a
object localization branch is learned to help with long-term tracking issues such as occlusion or
disappearance. [142] shows that an auxiliary mask prediction head can improve the representation
learning for tracking and produce segmentation mask when needed. [84] exploits two complementary models, one for localization that is robust to deformation, the other one for predicting
rough object mask under rigid object assumption, and then fuse them via refinement layers guided
by segmentation loss. [130] and [139] make great progress toward learning unified embedding
and appearance model that can be directly used for multiple video tasks including single object
tracking/segmentation and multiple object tracking/segmentation.
Nevertheless, few attention has been paid to the feasibility of a lightweight unified framework.
Similar with deep trackers, recent state-of-the-art methods require sufficient offline pre-training
on large-scale data. Note that the cost of acquiring ground truth segmentation mask is far more
expensive than that of object boxes, and the model tends to be more data-hungry as the training
object gets more difficult. Based on the developed green lightweight trackers, we may also attempt
an unsupervised lightweight solution for video object segmentation and a unified tracking and
segmentation system in the future.
111
Figure 6.1: While being initialized with the rectangular bounding box, SiamMask is able to generate both segmentation mask and oriented bounding boxes for more accurate representation of the
object location. Red boxes are outputs from ECO [30] for comparison. Figure is from [126].
112
Bibliography
[1] Ankush Agarwal and Saurabh Suryavanshi. Real-time* multiple object tracking (mot) for
autonomous navigation. Tech. Rep., 2017.
[2] Ashwani Aggarwal, Susmit Biswas, Sandeep Singh, Shamik Sural, and Arun K Majumdar.
Object tracking using background subtraction and motion estimation in mpeg videos. In
Asian Conference on Computer Vision, pages 121–130. Springer, 2006.
[3] Saksham Aggarwal, Taneesh Gupta, Pawan K Sahu, Arnav Chavan, Rishabh Tiwari, Dilip K
Prasad, and Deepak K Gupta. On designing light-weight object trackers through network
pruning: Use cnns or transformers? In ICASSP 2023-2023 IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
[4] Lukeziˇ c Alan, Tom ˇ a´s Voj ˇ ´ıˇr, Luka Cehovin, Ji ˇ ˇr´ı Matas, and Matej Kristan. Discriminative
correlation filter tracker with channel and spatial reliability. International Journal of Computer Vision, 126(7):671–688, 2018.
[5] Luca Bertinetto, Jack Valmadre, Stuart Golodetz, Ondrej Miksik, and Philip HS Torr. Staple:
Complementary learners for real-time tracking. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 1401–1409, 2016.
[6] Luca Bertinetto, Jack Valmadre, Joao F Henriques, Andrea Vedaldi, and Philip HS Torr.
Fully-convolutional siamese networks for object tracking. In European conference on computer vision, pages 850–865. Springer, 2016.
[7] Goutam Bhat, Joakim Johnander, Martin Danelljan, Fahad Shahbaz Khan, and Michael
Felsberg. Unveiling the power of deep tracking. In Proceedings of the European Conference
on Computer Vision (ECCV), pages 483–498, 2018.
[8] Adel Bibi, Matthias Mueller, and Bernard Ghanem. Target response adaptation for correlation filter tracking. In European conference on computer vision, pages 419–433. Springer,
2016.
[9] Philippe Blatter, Menelaos Kanakis, Martin Danelljan, and Luc Van Gool. Efficient visual
tracking with exemplar transformers. In Proceedings of the IEEE/CVF Winter Conference
on Applications of Computer Vision, pages 1571–1581, 2023.
[10] David S Bolme, J Ross Beveridge, Bruce A Draper, and Yui Man Lui. Visual object tracking
using adaptive correlation filters. In 2010 IEEE computer society conference on computer
vision and pattern recognition, pages 2544–2550. IEEE, 2010.
113
[11] Vasyl Borsuk, Roman Vei, Orest Kupyn, Tetiana Martyniuk, Igor Krashenyi, and Jiˇri Matas.
Fear: Fast, efficient, accurate and robust visual tracker. In European Conference on Computer Vision, pages 644–663. Springer, 2022.
[12] Sebastian Brutzer, Benjamin Hoferlin, and Gunther Heidemann. Evaluation of background ¨
subtraction techniques for video surveillance. In CVPR 2011, pages 1937–1944. IEEE,
2011.
[13] Sergi Caelles, Kevis-Kokitsi Maninis, Jordi Pont-Tuset, Laura Leal-Taixe, Daniel Cremers, ´
and Luc Van Gool. One-shot video object segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 221–230, 2017.
[14] Ziang Cao, Changhong Fu, Junjie Ye, Bowen Li, and Yiming Li. Hift: Hierarchical feature
transformer for aerial tracking. In Proceedings of the IEEE/CVF International Conference
on Computer Vision, pages 15457–15466, 2021.
[15] Luka Cehovin, Ale ˇ s Leonardis, and Matej Kristan. Visual object tracking performance ˇ
measures revisited. IEEE Transactions on Image Processing, 25(3):1261–1274, 2016.
[16] Ken Chatfield, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Return of the
devil in the details: Delving deep into convolutional nets. arXiv preprint arXiv:1405.3531,
2014.
[17] Peng Chen, Yuanjie Dang, Ronghua Liang, Wei Zhu, and Xiaofei He. Real-time object
tracking on a drone with multi-inertial sensing data. IEEE Transactions on Intelligent Transportation Systems, 19(1):131–139, 2017.
[18] Qifeng Chen and Vladlen Koltun. Fast mrf optimization with application to depth reconstruction. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition,
pages 3914–3921, 2014.
[19] SY Chen. Kalman filter for robot vision: a survey. IEEE Transactions on industrial electronics, 59(11):4409–4420, 2011.
[20] Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In Proceedings
of the 22nd acm sigkdd international conference on knowledge discovery and data mining,
pages 785–794, 2016.
[21] Xin Chen, Ben Kang, Dong Wang, Dongdong Li, and Huchuan Lu. Efficient visual tracking
via hierarchical cross-attention transformer. In European Conference on Computer Vision,
pages 461–477. Springer, 2022.
[22] Xin Chen, Bin Yan, Jiawen Zhu, Dong Wang, Xiaoyun Yang, and Huchuan Lu. Transformer
tracking. In Proceedings of IEEE/CVF Computer Vision and Pattern Recognition, pages
8126–8135, 2021.
[23] Yueru Chen, Mozhdeh Rouhsedaghat, Suya You, Raghuveer Rao, and C.-C. Jay Kuo. Pixelhop++: A small successive-subspace-learning-based (ssl-based) model for image classification. In 2020 IEEE International Conference on Image Processing (ICIP), pages 3294–
3298. IEEE, 2020.
114
[24] Andrew I Comport, Eric Marchand, and Franc¸ois Chaumette. Robust model-based tracking ´
for robot vision. In 2004 IEEE/RSJ International Conference on Intelligent Robots and
Systems (IROS)(IEEE Cat. No. 04CH37566), volume 1, pages 692–697. IEEE, 2004.
[25] Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, and Yoram Singer. Online passive aggressive algorithms. JMLR, 2006.
[26] Zhen Cui, Shengtao Xiao, Jiashi Feng, and Shuicheng Yan. Recurrently target-attending
tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1449–1458, 2016.
[27] Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. R-fcn: Object detection via region-based fully
convolutional networks. Advances in neural information processing systems, 29, 2016.
[28] Kenan Dai, Yunhua Zhang, Dong Wang, Jianhua Li, Huchuan Lu, and Xiaoyun Yang. Highperformance long-term tracking with meta-updater. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6298–6307, 2020.
[29] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection.
In 2005 IEEE computer society conference on computer vision and pattern recognition
(CVPR’05), volume 1, pages 886–893. Ieee, 2005.
[30] Martin Danelljan, Goutam Bhat, Fahad Shahbaz Khan, and Michael Felsberg. Eco: Efficient
convolution operators for tracking. In Proceedings of the IEEE conference on computer
vision and pattern recognition, pages 6638–6646, 2017.
[31] Martin Danelljan, Gustav Hager, Fahad Shahbaz Khan, and Michael Felsberg. Discrimina- ¨
tive scale space tracking. IEEE transactions on pattern analysis and machine intelligence,
39(8):1561–1575, 2016.
[32] Martin Danelljan, Gustav Hager, Fahad Shahbaz Khan, and Michael Felsberg. Convolutional features for correlation filter based visual tracking. In Proceedings of the IEEE international conference on computer vision workshops, pages 58–66, 2015.
[33] Martin Danelljan, Gustav Hager, Fahad Shahbaz Khan, and Michael Felsberg. Learning
spatially regularized correlation filters for visual tracking. In Proceedings of the IEEE international conference on computer vision, pages 4310–4318, 2015.
[34] Martin Danelljan, Gustav Hager, Fahad Shahbaz Khan, and Michael Felsberg. Adaptive
decontamination of the training set: A unified formulation for discriminative visual tracking.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages
1430–1438, 2016.
[35] Martin Danelljan, Andreas Robinson, Fahad Shahbaz Khan, and Michael Felsberg. Beyond correlation filters: Learning continuous convolution operators for visual tracking. In
European conference on computer vision, pages 472–488. Springer, 2016.
115
[36] Martin Danelljan, Fahad Shahbaz Khan, Michael Felsberg, and Joost Van de Weijer. Adaptive color attributes for real-time visual tracking. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pages 1090–1097, 2014.
[37] Xingping Dong, Jianbing Shen, Wenguan Wang, Yu Liu, Ling Shao, and Fatih Porikli.
Hyperparameter optimization for tracking with continuous deep q-learning. In Proceedings
of the IEEE conference on computer vision and pattern recognition, pages 518–527, 2018.
[38] Dawei Du, Pengfei Zhu, Longyin Wen, Xiao Bian, Haibin Ling, Qinghua Hu, Jiayu Zheng,
Tao Peng, Xinyao Wang, Yue Zhang, et al. Visdrone-sot2019: The vision meets drone single
object tracking challenge results. In Proceedings of the IEEE/CVF International Conference
on Computer Vision Workshops, pages 0–0, 2019.
[39] Lamiaa A Elrefaei, Alaa Alharthi, Huda Alamoudi, Shatha Almutairi, and Fatima Alrammah. Real-time face detection and tracking on mobile phones for criminal detection.
In 2017 2nd International Conference on Anti-Cyber Crimes (ICACC), pages 75–80. IEEE,
2017.
[40] Andreas Ess, Konrad Schindler, Bastian Leibe, and Luc Van Gool. Object detection and
tracking for autonomous navigation in dynamic environments. The International Journal of
Robotics Research, 29(14):1707–1725, 2010.
[41] Heng Fan, Liting Lin, Fan Yang, Peng Chu, Ge Deng, Sijia Yu, Hexin Bai, Yong Xu, Chunyuan Liao, and Haibin Ling. Lasot: A high-quality benchmark for large-scale single object
tracking. In Proceedings of IEEE/CVF Computer Vision and Pattern Recognition, pages
5374–5383, 2019.
[42] Pedro F Felzenszwalb and Daniel P Huttenlocher. Efficient graph-based image segmentation. International journal of computer vision, 59(2):167–181, 2004.
[43] Mustansar Fiaz, Arif Mahmood, Sajid Javed, and Soon Ki Jung. Handcrafted and deep
trackers: Recent visual object tracking approaches and trends. ACM Computing Surveys
(CSUR), 52(2):1–44, 2019.
[44] Neil Gordon, B Ristic, and S Arulampalam. Beyond the kalman filter: Particle filters for
tracking applications. Artech House, London, 830(5):1–4, 2004.
[45] Fredrik Gustafsson, Fredrik Gunnarsson, Niclas Bergman, Urban Forssell, Jonas Jansson,
Rickard Karlsson, and P-J Nordlund. Particle filters for positioning, navigation, and tracking. IEEE Transactions on signal processing, 50(2):425–437, 2002.
[46] SJ Hadfield, R Bowden, and K Lebeda. The visual object tracking vot2016 challenge results.
Lecture Notes in Computer Science, 9914:777–823, 2016.
[47] Niels Haering, Peter L Venetianer, and Alan Lipton. The evolution of video surveillance: ´
an overview. Machine Vision and Applications, 19(5):279–290, 2008.
[48] Karthik Hariharakrishnan and Dan Schonfeld. Fast object tracking using adaptive block
matching. IEEE transactions on multimedia, 7(5):853–859, 2005.
116
[49] Anfeng He, Chong Luo, Xinmei Tian, and Wenjun Zeng. A twofold siamese network for
real-time object tracking. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pages 4834–4843, 2018.
[50] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for
image recognition. In Proceedings of the IEEE conference on computer vision and pattern
recognition, pages 770–778, 2016.
[51] Joao F Henriques, Rui Caseiro, Pedro Martins, and Jorge Batista. High-speed tracking with ˜
kernelized correlation filters. IEEE transactions on pattern analysis and machine intelligence, 37(3):583–596, 2014.
[52] Chen Huang, Simon Lucey, and Deva Ramanan. Learning policies for adaptive tracking with
deep feature cascades. In Proceedings of the IEEE international conference on computer
vision, pages 105–114, 2017.
[53] Lianghua Huang, Xin Zhao, and Kaiqi Huang. Got-10k: A large high-diversity benchmark
for generic object tracking in the wild. IEEE transactions on pattern analysis and machine
intelligence, 43(5):1562–1577, 2019.
[54] Joel Janai, Fatma Guney, Aseem Behl, Andreas Geiger, et al. Computer vision for au- ¨
tonomous vehicles: Problems, datasets and state of the art. Foundations and Trends® in
Computer Graphics and Vision, 12(1–3):1–308, 2020.
[55] Shan Jiayao, Sifan Zhou, Yubo Cui, and Zheng Fang. Real-time 3d single object tracking
with transformer. IEEE Transactions on Multimedia, 2022.
[56] Ilchae Jung, Minji Kim, Eunhyeok Park, and Bohyung Han. Online hybrid lightweight representations learning: Its application to visual tracking. arXiv preprint arXiv:2205.11179,
2022.
[57] Ilchae Jung, Jeany Son, Mooyeol Baek, and Bohyung Han. Real-time mdnet. In Proceedings
of the European conference on computer vision (ECCV), pages 83–98, 2018.
[58] Zdenek Kalal, Krystian Mikolajczyk, and Jiri Matas. Tracking-learning-detection. IEEE
transactions on pattern analysis and machine intelligence, 34(7):1409–1422, 2011.
[59] Hamed Kiani Galoogahi, Ashton Fagg, and Simon Lucey. Learning background-aware
correlation filters for visual tracking. In Proceedings of the IEEE international conference
on computer vision, pages 1135–1143, 2017.
[60] Hamed Kiani Galoogahi, Terence Sim, and Simon Lucey. Correlation filters with limited
boundaries. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4630–4638, 2015.
[61] Matej Kristan, Ales Leonardis, Ji ˇ ˇr´ı Matas, Michael Felsberg, Roman Pflugfelder, JoniKristian Kam¨ ar¨ ainen, Martin Danelljan, Luka ¨ Cehovin Zajc, Alan Luke ˇ ziˇ c, Ondrej Dr- ˇ
bohlav, et al. The eighth visual object tracking vot2020 challenge results. In European
Conference on Computer Vision, pages 547–601. Springer, 2020.
117
[62] Matej Kristan, Ales Leonardis, Jiri Matas, Michael Felsberg, Roman Pflugfelder, Luka
ˇCehovin Zajc, Tomas Vojir, Goutam Bhat, Alan Lukezic, Abdelrahman Eldesokey, et al.
The sixth visual object tracking vot2018 challenge results. In Proceedings of the European
Conference on Computer Vision (ECCV) Workshops, pages 0–0, 2018.
[63] Matej Kristan, Ales Leonardis, Jiri Matas, Michael Felsberg, Roman Pflugfelder, Luka ˇ
Cehovin Zajc, Tomas Vojir, Gustav H ˇ ager, Alan Luke ¨ ziˇ c, Abdelrahman Eldesokey, and ˇ
Gustavo Fernandez. The visual object tracking vot2017 challenge results, 2017.
[64] Matej Kristan, Jiˇr´ı Matas, Ales Leonardis, Michael Felsberg, Roman Pflugfelder, Joni- ˇ
Kristian Kam¨ ar¨ ainen, Hyung Jin Chang, Martin Danelljan, Luka Cehovin, Alan Luke ¨ ziˇ c,ˇ
et al. The ninth visual object tracking vot2021 challenge results. In Proceedings of the
IEEE/CVF International Conference on Computer Vision, pages 2711–2738, 2021.
[65] Matej Kristan, Jiri Matas, Ales Leonardis, Michael Felsberg, Roman Pflugfelder, JoniKristian Kamarainen, Luka ˇCehovin Zajc, Ondrej Drbohlav, Alan Lukezic, Amanda Berg,
et al. The seventh visual object tracking vot2019 challenge results. In Proceedings of the
IEEE/CVF International Conference on Computer Vision Workshops, pages 0–0, 2019.
[66] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with
deep convolutional neural networks. Advances in neural information processing systems,
25:1097–1105, 2012.
[67] C.-C. Jay Kuo and Azad M Madni. Green learning: Introduction, examples and outlook.
Journal of Visual Communication and Image Representation, page 103685, 2022.
[68] Dae-Youn Lee, Jae-Young Sim, and Chang-Su Kim. Multihypothesis trajectory analysis
for robust visual tracking. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pages 5088–5096, 2015.
[69] Donghwa Lee, Gonyop Kim, Donghoon Kim, Hyun Myung, and Hyun-Taek Choi. Visionbased object detection and tracking for autonomous navigation of underwater robots. Ocean
Engineering, 48:59–68, 2012.
[70] Juan Lei, Youji Feng, Lixin Fan, and Yihong Wu. Real-time object tracking on mobile
phones. In The First Asian Conference on Pattern Recognition, pages 560–564. IEEE, 2011.
[71] Annan Li, Min Lin, Yi Wu, Ming-Hsuan Yang, and Shuicheng Yan. Nus-pro: A new visual tracking challenge. IEEE transactions on pattern analysis and machine intelligence,
38(2):335–349, 2015.
[72] Bo Li, Wei Wu, Qiang Wang, Fangyi Zhang, Junliang Xing, and Junjie Yan. Siamrpn++:
Evolution of siamese visual tracking with very deep networks. In Proceedings of IEEE/CVF
Computer Vision and Pattern Recognition, pages 4282–4291, 2019.
[73] Bo Li, Junjie Yan, Wei Wu, Zheng Zhu, and Xiaolin Hu. High performance visual tracking
with siamese region proposal network. In Proceedings of the IEEE conference on computer
vision and pattern recognition, pages 8971–8980, 2018.
118
[74] Feng Li, Cheng Tian, Wangmeng Zuo, Lei Zhang, and Ming-Hsuan Yang. Learning spatialtemporal regularized correlation filters for visual tracking. In Proceedings of the IEEE
conference on computer vision and pattern recognition, pages 4904–4913, 2018.
[75] Yiming Li, Changhong Fu, Fangqiang Ding, Ziyuan Huang, and Geng Lu. Autotrack:
Towards high-performance visual tracking for uav with automatic spatio-temporal regularization. In Proceedings of IEEE/CVF Computer Vision and Pattern Recognition, pages
11923–11932, 2020.
[76] Pengpeng Liang, Erik Blasch, and Haibin Ling. Encoding color information for visual tracking: Algorithms and benchmark. IEEE Transactions on Image Processing, 24(12):5630–
5644, 2015.
[77] Liting Lin, Heng Fan, Yong Xu, and Haibin Ling. Swintrack: A simple and strong baseline
for transformer tracking. arXiv preprint arXiv:2112.00995, 2021.
[78] Tsung-Yi Lin, Piotr Dollar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Be- ´
longie. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2117–2125, 2017.
[79] Kuida Liu and Zhiruo Zhou. IEEEXtreme 15.0 challenge: Maximum exploitation. https:
//csacademy.com/ieeextreme-practice/task/exploitation/. IEEEXtreme 15.0
took place on 23rd October 2021.
[80] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang
Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In European conference on
computer vision, pages 21–37. Springer, 2016.
[81] Xiankai Lu, Chao Ma, Bingbing Ni, Xiaokang Yang, Ian Reid, and Ming-Hsuan Yang.
Deep regression tracking with shrinkage loss. In Proceedings of the European conference
on computer vision (ECCV), pages 353–369, 2018.
[82] Jonathon Luiten, Paul Voigtlaender, and Bastian Leibe. Premvos: Proposal-generation,
refinement and merging for video object segmentation. In Asian Conference on Computer
Vision, pages 565–580. Springer, 2018.
[83] Alan Lukezic, Ugur Kart, Jani Kapyla, Ahmed Durmush, Joni-Kristian Kamarainen, Jiri
Matas, and Matej Kristan. Cdtb: A color and depth visual object tracking dataset and
benchmark. In Proceedings of the IEEE/CVF International Conference on Computer Vision,
pages 10013–10022, 2019.
[84] Alan Lukezic, Jiri Matas, and Matej Kristan. D3s-a discriminative single shot segmentation tracker. In Proceedings of the IEEE/CVF conference on computer vision and pattern
recognition, pages 7133–7142, 2020.
[85] Alan Lukeziˇ c, Luka ˇ Cehovin Zajc, Tom ˇ a´s Voj ˇ ´ıˇr, Jiˇr´ı Matas, and Matej Kristan. Fucolot–a
fully-correlational long-term tracker. In Computer Vision–ACCV 2018: 14th Asian Conference on Computer Vision, Perth, Australia, December 2–6, 2018, Revised Selected Papers,
Part II 14, pages 595–611. Springer, 2019.
119
[86] Wenhan Luo, Peng Sun, Fangwei Zhong, Wei Liu, Tong Zhang, and Yizhou Wang. End-toend active object tracking and its real-world deployment via reinforcement learning. IEEE
transactions on pattern analysis and machine intelligence, 42(6):1317–1332, 2019.
[87] Chao Ma, Jia-Bin Huang, Xiaokang Yang, and Ming-Hsuan Yang. Hierarchical convolutional features for visual tracking. In Proceedings of the IEEE international conference on
computer vision, pages 3074–3082, 2015.
[88] Chao Ma, Xiaokang Yang, Chongyang Zhang, and Ming-Hsuan Yang. Long-term correlation tracking. In Proceedings of the IEEE conference on computer vision and pattern
recognition, pages 5388–5396, 2015.
[89] Fan Ma, Mike Zheng Shou, Linchao Zhu, Haoqi Fan, Yilei Xu, Yi Yang, and Zhicheng Yan.
Unified transformer tracker for object tracking. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pages 8781–8790, 2022.
[90] Tim Meinhardt and Laura Leal-Taixe. Make one-shot video object segmentation efficient ´
again. Advances in neural information processing systems, 33:10607–10619, 2020.
[91] Simon Meister, Junhwa Hur, and Stefan Roth. Unflow: Unsupervised learning of optical
flow with a bidirectional census loss. In Proceedings of the AAAI conference on artificial
intelligence, volume 32, 2018.
[92] Matthias Mueller, Neil Smith, and Bernard Ghanem. A benchmark and simulator for uav
tracking. In European conference on computer vision, pages 445–461. Springer, 2016.
[93] Matthias Mueller, Neil Smith, and Bernard Ghanem. Context-aware correlation filter tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition,
pages 1396–1404, 2017.
[94] Matthias Muller, Adel Bibi, Silvio Giancola, Salman Alsubaihi, and Bernard Ghanem.
Trackingnet: A large-scale dataset and benchmark for object tracking in the wild. In Proceedings of the European conference on computer vision (ECCV), pages 300–317, 2018.
[95] Hyeonseob Nam and Bohyung Han. Learning multi-domain convolutional neural networks
for visual tracking. In Proceedings of the IEEE conference on computer vision and pattern
recognition, pages 4293–4302, 2016.
[96] Matthieu Paul, Martin Danelljan, Christoph Mayer, and Luc Van Gool. Robust visual
tracking by segmentation. In European Conference on Computer Vision, pages 571–588.
Springer, 2022.
[97] Shi Pu, Yibing Song, Chao Ma, Honggang Zhang, and Ming-Hsuan Yang. Deep attentive
tracking via reciprocative learning. arXiv preprint arXiv:1810.03851, 2018.
[98] Yuankai Qi, Shengping Zhang, Lei Qin, Hongxun Yao, Qingming Huang, Jongwoo Lim,
and Ming-Hsuan Yang. Hedged deep tracking. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 4303–4311, 2016.
120
[99] Esteban Real, Jonathon Shlens, Stefano Mazzocchi, Xin Pan, and Vincent Vanhoucke.
Youtube-boundingboxes: A large high-precision human-annotated data set for object detection in video. In proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 5296–5305, 2017.
[100] Joseph Redmon and Ali Farhadi. Yolo9000: better, faster, stronger. In Proceedings of the
IEEE conference on computer vision and pattern recognition, pages 7263–7271, 2017.
[101] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time
object detection with region proposal networks. Advances in neural information processing
systems, 28, 2015.
[102] Idoia Ruiz, Lorenzo Porzi, Samuel Rota Bulo, Peter Kontschieder, and Joan Serrat. Weakly
supervised multi-object tracking and segmentation. In Proceedings of the IEEE/CVF Winter
Conference on Applications of Computer Vision, pages 125–133, 2021.
[103] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large
scale visual recognition challenge. International journal of computer vision, 115(3):211–
252, 2015.
[104] Jianbing Shen, Yuanpei Liu, Xingping Dong, Xiankai Lu, Fahad Shahbaz Khan, and Steven
Hoi. Distilled siamese networks for visual tracking. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 44(12):8896–8909, 2021.
[105] Qiuhong Shen, Lei Qiao, Jinyang Guo, Peixia Li, Xin Li, Bo Li, Weitao Feng, Weihao
Gan, Wei Wu, and Wanli Ouyang. Unsupervised learning of accurate siamese tracking.
In Proceedings of IEEE/CVF Computer Vision and Pattern Recognition, pages 8101–8110,
2022.
[106] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale
image recognition. arXiv preprint arXiv:1409.1556, 2014.
[107] Zahra Soleimanitaleb, Mohammad Ali Keyvanrad, and Ali Jafari. Object tracking methods:
A review. In 2019 9th International Conference on Computer and Knowledge Engineering
(ICCKE), pages 282–288. IEEE, 2019.
[108] Yibing Song, Chao Ma, Lijun Gong, Jiawei Zhang, Rynson WH Lau, and Ming-Hsuan
Yang. Crest: Convolutional residual learning for visual tracking. In Proceedings of the
IEEE international conference on computer vision, pages 2555–2564, 2017.
[109] Yibing Song, Chao Ma, Xiaohe Wu, Lijun Gong, Linchao Bao, Wangmeng Zuo, Chunhua
Shen, Rynson WH Lau, and Ming-Hsuan Yang. Vital: Visual tracking via adversarial learning. In Proceedings of the IEEE conference on computer vision and pattern recognition,
pages 8990–8999, 2018.
[110] Zikai Song, Junqing Yu, Yi-Ping Phoebe Chen, and Wei Yang. Transformer tracking with
cyclic shifting window attention. In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 8791–8800, 2022.
121
[111] Peize Sun, Jinkun Cao, Yi Jiang, Rufeng Zhang, Enze Xie, Zehuan Yuan, Changhu Wang,
and Ping Luo. Transtrack: Multiple object tracking with transformer. arXiv preprint
arXiv:2012.15460, 2020.
[112] Yuxuan Sun, Chong Sun, Dong Wang, You He, and Huchuan Lu. Roi pooled correlation
filters for visual tracking. In Proceedings of IEEE/CVF on Computer Vision and Pattern
Recognition, pages 5783–5791, 2019.
[113] Ran Tao, Efstratios Gavves, and Arnold WM Smeulders. Siamese instance search for tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition,
pages 1420–1429, 2016.
[114] Carlo Tomasi and Takeo Kanade. Detection and tracking of point. Int J Comput Vis, 9:137–
154, 1991.
[115] Jack Valmadre, Luca Bertinetto, Joao Henriques, Andrea Vedaldi, and Philip HS Torr. Endto-end representation learning for correlation filter based tracking. In Proceedings of the
IEEE conference on computer vision and pattern recognition, pages 2805–2813, 2017.
[116] Jack Valmadre, Luca Bertinetto, Joao F Henriques, Ran Tao, Andrea Vedaldi, Arnold WM
Smeulders, Philip HS Torr, and Efstratios Gavves. Long-term tracking in the wild: A benchmark. In Proceedings of the European conference on computer vision (ECCV), pages 670–
685, 2018.
[117] Joost Van De Weijer, Cordelia Schmid, Jakob Verbeek, and Diane Larlus. Learning color
names for real-world applications. IEEE Transactions on Image Processing, 18(7):1512–
1523, 2009.
[118] Markus Vincze, Matthias Schlemmer, Peter Gemeiner, and Minu Ayromlou. Vision for
robotics: a tool for model-based object tracking. IEEE robotics & automation magazine,
12(4):53–64, 2005.
[119] Paul Voigtlaender and Bastian Leibe. Online adaptation of convolutional neural networks
for video object segmentation. arXiv preprint arXiv:1706.09364, 2017.
[120] Guangting Wang, Yizhou Zhou, Chong Luo, Wenxuan Xie, Wenjun Zeng, and Zhiwei
Xiong. Unsupervised visual representation learning by tracking patches in video. In Proceedings of IEEE/CVF Computer Vision and Pattern Recognition, pages 2563–2572, 2021.
[121] Ning Wang, Yibing Song, Chao Ma, Wengang Zhou, Wei Liu, and Houqiang Li. Unsupervised deep tracking. In Proceedings of IEEE/CVF Computer Vision and Pattern Recognition, pages 1308–1317, 2019.
[122] Ning Wang, Wengang Zhou, Yibing Song, Chao Ma, Wei Liu, and Houqiang Li. Unsupervised deep representation learning for real-time tracking. International Journal of Computer
Vision, 129(2):400–418, 2021.
[123] Ning Wang, Wengang Zhou, Qi Tian, Richang Hong, Meng Wang, and Houqiang Li. Multicue correlation filters for robust visual tracking. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 4844–4853, 2018.
122
[124] Ning Wang, Wengang Zhou, Jie Wang, and Houqiang Li. Transformer meets tracker: Exploiting temporal context for robust visual tracking. In Proceedings of IEEE/CVF Computer
Vision and Pattern Recognition, pages 1571–1580, 2021.
[125] Qiang Wang, Zhu Teng, Junliang Xing, Jin Gao, Weiming Hu, and Stephen Maybank.
Learning attentions: residual attentional siamese network for high performance online visual
tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4854–4863, 2018.
[126] Qiang Wang, Li Zhang, Luca Bertinetto, Weiming Hu, and Philip HS Torr. Fast online
object tracking and segmentation: A unifying approach. In Proceedings of the IEEE/CVF
conference on Computer Vision and Pattern Recognition, pages 1328–1338, 2019.
[127] Xiaogang Wang. Intelligent multi-camera video surveillance: A review. Pattern recognition
letters, 34(1):3–19, 2013.
[128] Xiaolong Wang, Allan Jabri, and Alexei A Efros. Learning correspondence from the cycleconsistency of time. In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pages 2566–2576, 2019.
[129] Xinyu Wang, Hanxi Li, Yi Li, Fumin Shen, and Fatih Porikli. Robust and real-time deep
tracking via multi-scale domain adaptation. In 2017 ieee international conference on multimedia and expo (icme), pages 1338–1343. IEEE, 2017.
[130] Zhongdao Wang, Hengshuang Zhao, Ya-Li Li, Shengjin Wang, Philip Torr, and Luca
Bertinetto. Do different tracking tasks require different appearance models? Advances
in Neural Information Processing Systems, 34:726–738, 2021.
[131] Joost van de Weijer and Fahad Shahbaz Khan. An overview of color name applications
in computer vision. In International Workshop on Computational Color Imaging, pages
16–22. Springer, 2015.
[132] Qiangqiang Wu, Jia Wan, and Antoni B Chan. Progressive unsupervised learning for visual
object tracking. In Proceedings of IEEE/CVF Computer Vision and Pattern Recognition,
pages 2993–3002, 2021.
[133] Yi Wu, Jongwoo Lim, and Ming-Hsuan Yang. Online object tracking: A benchmark. In
Proceedings of the IEEE conference on computer vision and pattern recognition, pages
2411–2418, 2013.
[134] Yi Wu, Jongwoo Lim, and Ming-Hsuan Yang. Object tracking benchmark. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(9):1834–1848, 2015.
[135] Daitao Xing, Nikolaos Evangeliou, Athanasios Tsoukalas, and Anthony Tzes. Siamese
transformer pyramid networks for real-time uav tracking. In Proceedings of the IEEE/CVF
Winter Conference on Applications of Computer Vision, pages 2139–2148, 2022.
[136] Junliang Xing, Haizhou Ai, and Shihong Lao. Multiple human tracking based on multi-view
upper-body detection and discriminative learning. In 2010 20th International Conference
on Pattern Recognition, pages 1698–1701. IEEE, 2010.
123
[137] Tianyang Xu, Zhen-Hua Feng, Xiao-Jun Wu, and Josef Kittler. Learning adaptive discriminative correlation filters via temporal consistency preserving spatial feature selection for
robust visual object tracking. IEEE Transactions on Image Processing, 28(11):5596–5609,
2019.
[138] Yuanyou Xu, Zongxin Yang, and Yi Yang. Integrating boxes and masks: A multi-object
framework for unified visual tracking and segmentation. arXiv preprint arXiv:2308.13266,
2023.
[139] Bin Yan, Yi Jiang, Peize Sun, Dong Wang, Zehuan Yuan, Ping Luo, and Huchuan Lu.
Towards grand unification of object tracking. In European Conference on Computer Vision,
pages 733–751. Springer, 2022.
[140] Bin Yan, Houwen Peng, Jianlong Fu, Dong Wang, and Huchuan Lu. Learning spatiotemporal transformer for visual tracking. In Proceedings of the IEEE/CVF International
Conference on Computer Vision, pages 10448–10457, 2021.
[141] Bin Yan, Houwen Peng, Kan Wu, Dong Wang, Jianlong Fu, and Huchuan Lu. Lighttrack:
Finding lightweight neural networks for object tracking via one-shot architecture search.
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
pages 15180–15189, 2021.
[142] Bin Yan, Xinyu Zhang, Dong Wang, Huchuan Lu, and Xiaoyun Yang. Alpha-refine:
Boosting tracking performance by precise bounding box estimation. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5289–5298,
2021.
[143] Linjie Yang, Yanran Wang, Xuehan Xiong, Jianchao Yang, and Aggelos K Katsaggelos.
Efficient video object segmentation via network modulation. In Proceedings of the IEEE
conference on computer vision and pattern recognition, pages 6499–6507, 2018.
[144] Tianyu Yang and Antoni B Chan. Learning dynamic memory networks for object tracking.
In Proceedings of the European conference on computer vision (ECCV), pages 152–167,
2018.
[145] Yi Yang, Yueting Zhuang, and Yunhe Pan. Multiple knowledge representation for big data
artificial intelligence: framework, applications, and case studies. Frontiers of Information
Technology & Electronic Engineering, 22(12):1551–1558, 2021.
[146] Yijing Yang, Wei Wang, Hongyu Fu, C.-C. Jay Kuo, et al. On supervised feature selection
from high dimensional feature spaces. APSIPA Transactions on Signal and Information
Processing, 11(1), 2022.
[147] Zhichao Yin and Jianping Shi. Geonet: Unsupervised learning of dense depth, optical flow
and camera pose. In Proceedings of the IEEE conference on computer vision and pattern
recognition, pages 1983–1992, 2018.
124
[148] Guangcong Zhang and Patricio A Vela. Good features to track for visual slam. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1373–1382,
2015.
[149] Zehua Zhang. A python implementation of block gradient decent for optimizing markov
random fields. https://github.com/zehzhang/MRF BCD, 2019.
[150] Zhipeng Zhang and Houwen Peng. Deeper and wider siamese networks for real-time visual tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern
recognition, pages 4591–4600, 2019.
[151] Bin Zhao, Goutam Bhat, Martin Danelljan, Luc Van Gool, and Radu Timofte. Generating
masks from boxes by mining spatio-temporal consistencies in videos. In Proceedings of the
IEEE/CVF International Conference on Computer Vision, pages 13556–13566, 2021.
[152] Moju Zhao, Kei Okada, and Masayuki Inaba. Trtr: Visual tracking with transformer. arXiv
preprint arXiv:2105.03817, 2021.
[153] Jilai Zheng, Chao Ma, Houwen Peng, and Xiaokang Yang. Learning to track objects from
unlabeled videos. In Proceedings of the IEEE/CVF International Conference on Computer
Vision, pages 13546–13555, 2021.
[154] Tinghui Zhou, Matthew Brown, Noah Snavely, and David G Lowe. Unsupervised learning
of depth and ego-motion from video. In Proceedings of the IEEE conference on computer
vision and pattern recognition, pages 1851–1858, 2017.
[155] Tinghui Zhou, Philipp Krahenbuhl, Mathieu Aubry, Qixing Huang, and Alexei A Efros.
Learning dense correspondence via 3d-guided cycle consistency. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, pages 117–126, 2016.
[156] Xiaowei Zhou, Menglong Zhu, and Kostas Daniilidis. Multi-image matching via fast alternating minimization. In Proceedings of the IEEE international conference on computer
vision, pages 4032–4040, 2015.
[157] Zhiruo Zhou, Hongyu Fu, Suya You, Christoph C Borel-Donohue, and C.-C. Jay Kuo. UHPSOT: An unsupervised high-performance single object tracker. In 2021 International Conference on Visual Communications and Image Processing (VCIP), pages 1–5. IEEE, 2021.
[158] Zhiruo Zhou, Hongyu Fu, Suya You, and C.-C. Jay Kuo. Gusot: Green and unsupervised
single object tracking for long video sequences. In 2022 IEEE 24th International Workshop
on Multimedia Signal Processing (MMSP), pages 1–6. IEEE, 2022.
[159] Zhiruo Zhou, Hongyu Fu, Suya You, C.-C. Jay Kuo, et al. UHP-SOT++: An unsupervised
lightweight single object tracker. APSIPA Transactions on Signal and Information Processing, 11(1), 2022.
[160] Pengfei Zhu, Longyin Wen, Xiao Bian, Haibin Ling, and Qinghua Hu. Vision meets drones:
A challenge. arXiv preprint arXiv:1804.07437, 2018.
125
[161] Pengfei Zhu, Longyin Wen, Dawei Du, Xiao Bian, Heng Fan, Qinghua Hu, and Haibin Ling.
Detection and tracking meet drones challenge. arXiv preprint arXiv:2001.06303, 2020.
[162] Zheng Zhu, Qiang Wang, Bo Li, Wei Wu, Junjie Yan, and Weiming Hu. Distractor-aware
siamese networks for visual object tracking. In Proceedings of the European Conference on
Computer Vision (ECCV), pages 101–117, 2018.
126
Abstract (if available)
Abstract
Video object tracking is one of the fundamental problems in computer vision and has diverse real-world applications such as video surveillance and robotic vision. We focus on online single object tracking where the tracker is expected to track the object given in the first frame and cannot exploit any future information to infer the object location in the current frame. Supervised tracking methods have been widely investigated, and deep learning based trackers achieve huge success in terms of tracking accuracy. However, the requirement of large amount of labelled data is demanding, and the rely on ground truth supervision casts doubt on the reliability when the algorithm operates on unseen objects. Even though there have been some endeavors made to unsupervised tracking in recent years and those works demonstrate promising progress, the performance of the unsupervised trackers is still far behind that of supervised trackers. In addition, recent pioneering unsupervised works need to rely on the deep feature extraction network and requires large-scale pre-training on offline video datasets. The power consumption and large model size limit their application on resource-limited platforms such as autonomous drones and mobile phones. While there have been few research works to compress the large network via neural architecture search and quantization, they focus on tuning from a well-trained large tracker in a supervised manner, which makes the training process of a lightweight tracker expensive. Therefore, there still lacks an explainable and efficient design methodology for lightweight high-performance trackers, and the necessity and role of supervision in offline training have not been well investigated yet.
To narrow down the gap between supervised trackers and unsupervised trackers and encourage the application of real-time tracking on edge devices, we proposed a series of green unsupervised trackers which are lightweight enough to run on CPUs with near real-time speed and still achieve high tracking performance. In the unsupervised high-performance single object tracker (UHPSOT), we proposed the background motion modeling module and the trajectory-based box prediction module which are integrated into the baseline with a simple fusion strategy. It achieves on par performance with the state-of-the-art deep supervised trackers on OTB2015. In the follow up work UHP-SOT++, we further enhanced UHP-SOT with improved fusion strategy and more thorough empirical study on its performance and behavior across different benchmark datasets, including small-scale OTB2015 and TC128, and large-scale ones such as UAV123 and LaSOT. Obvious improvement is observed on all datasets, and insights on the gap between deep supervised trackers and lightweight unsupervised trackers are provided. In the green unsupervised single object tracker (GUSOT), we extended those lightweight designs to tracking for long video sequences and introduced the lost object discovery and the efficient segmentation-based box refinement model to further boost the tracking accuracy in the long run. Our proposed model outperforms or achieves comparable performance with the-state-of-the-art deep unsupervised trackers which requires large models and pre-training. Finally, in the green object tracker (GOT), we modeled the tracking process as the ensemble of three branches for robust tracking, the global object-based correlator, the local patch-based correlator, and the superpixel segmentator. The outputs from three branches are then fused to generate the ultimate object box, where an innovative fusion strategy is developed. The designed modules and mechanisms further exploit the spatial and temporal correlation of object appearances at different granularity, offering competitive tracking accuracy at a lower computation cost with state-of-the-art unsupervised trackers that demand heavy offline pre-training. GOT has a tiny model size (<3k parameters) and low inference complexity (around 58M FLOPs per frame).
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
A deep learning approach to online single and multiple object tracking
PDF
Video object segmentation and tracking with deep learning techniques
PDF
A green learning approach to image forensics: methodology, applications, and performance evaluation
PDF
Explainable and green solutions to point cloud classification and segmentation
PDF
Efficient graph learning: theory and performance evaluation
PDF
Green learning for 3D point cloud data processing
PDF
Green image generation and label transfer techniques
PDF
Deep learning techniques for supervised pedestrian detection and critically-supervised object detection
PDF
Block-based image steganalysis: algorithm and performance evaluation
PDF
Green knowledge graph completion and scalable generative content delivery
PDF
Object classification based on neural-network-inspired image transforms
PDF
Advanced techniques for object classification: methodologies and performance evaluation
PDF
Towards more occlusion-robust deep visual object tracking
PDF
Efficient machine learning techniques for low- and high-dimensional data sources
PDF
Word, sentence and knowledge graph embedding techniques: theory and performance evaluation
PDF
Advanced technologies for learning-based image/video enhancement, image generation and attribute editing
PDF
Advanced techniques for green image coding via hierarchical vector quantization
PDF
Multiple pedestrians tracking by discriminative models
PDF
Effective incremental learning and detector adaptation methods for video object detection
PDF
A green learning approach to deepfake detection and camouflage and splicing object localization
Asset Metadata
Creator
Zhou, Zhiruo
(author)
Core Title
Green unsupervised single object tracking: technologies and performance evaluation
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Degree Conferral Date
2023-12
Publication Date
11/30/2023
Defense Date
09/25/2023
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
OAI-PMH Harvest,object tracking,online tracking,single object tracking,unsupervised tracking
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Kuo, C.-C. Jay (
committee chair
), Nikolaidis, Stefanos (
committee member
), Ortega, Antonio (
committee member
)
Creator Email
zhiruozh@usc.edu,zhouzr.ciel@foxmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113781074
Unique identifier
UC113781074
Identifier
etd-ZhouZhiruo-12507.pdf (filename)
Legacy Identifier
etd-ZhouZhiruo-12507
Document Type
Dissertation
Format
theses (aat)
Rights
Zhou, Zhiruo
Internet Media Type
application/pdf
Type
texts
Source
20231201-usctheses-batch-1110
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
object tracking
online tracking
single object tracking
unsupervised tracking