Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Deep learning techniques for supervised pedestrian detection and critically-supervised object detection
(USC Thesis Other)
Deep learning techniques for supervised pedestrian detection and critically-supervised object detection
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
DEEP LEARNING TECHNIQUES FOR SUPERVISED PEDESTRIAN
DETECTION AND CRITICALLY-SUPERVISED OBJECT DETECTION
by
Chi-Hao Wu
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulllment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(ELECTRICAL ENGINEERING)
December 2017
Copyright 2017 Chi-Hao Wu
Table of Contents
List of Tables v
List of Figures vi
Abstract x
Chapter 1: Introduction 1
1.1 Signicance of the Research . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Review of Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 Pedestrian Detection . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.2 Domain Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.3 Critically Supervised Learning . . . . . . . . . . . . . . . . . . . . 6
1.3 Contributions of the Research . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Organization of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . 9
Chapter 2: Research Background 11
2.1 Deformable Part Model (DPM) and Histogram of Gradient (HOG) . . . . 11
2.2 Convolutional Neural Network and Fast-RCNN . . . . . . . . . . . . . . . 15
2.3 Caltech Pedestrian Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4 PASCAL Visual Object Class (VOC) Challenges And Datasets . . . . . . 24
Chapter 3: Boosted Convolutional Neural Network (BCNN) for Pedes-
trian Detection 27
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2.1 Weighted Loss Function . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2.2 Overview of BCNN Training . . . . . . . . . . . . . . . . . . . . . 31
3.2.3 Weighting of Informative Samples . . . . . . . . . . . . . . . . . . 33
3.2.4 CNN Boosting via Boosted Fusion (BF) Layer . . . . . . . . . . . 37
3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.3.1 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3.2 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3.3 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3.4 Visualization of Performance Gain. . . . . . . . . . . . . . . . . . . 44
3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
ii
Chapter 4: Pedestrian Detection via Domain Adaptation 46
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2 Related Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2.1 Transfer Learning and Domain Adaptation . . . . . . . . . . . . . 49
4.2.2 Domain Adaptation for Pedestrian Detection . . . . . . . . . . . . 50
4.2.3 Domain Adaptation for Other Problems . . . . . . . . . . . . . . . 52
4.3 Clustered Deep Representation Adaptation (CDRA) Method . . . . . . . 53
4.3.1 Difnition and Notations . . . . . . . . . . . . . . . . . . . . . . . . 54
4.3.2 Algorithm Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.3.3 Initial Training with Labeled Subset . . . . . . . . . . . . . . . . . 56
4.3.4 Clustering on Deep-Representations . . . . . . . . . . . . . . . . . 57
4.3.5 Condent Sample Selection With Clustered Deep Representation . 58
4.3.6 Weighted Re-training with Condent Samples . . . . . . . . . . . . 59
4.4 Implementation Details and Discussion . . . . . . . . . . . . . . . . . . . . 60
4.4.1 Network Architecture and Implementation . . . . . . . . . . . . . . 60
4.4.2 Representations of Dierent Layers . . . . . . . . . . . . . . . . . . 62
4.4.3 Sample Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.5.1 Evaluation Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.5.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.5.3 Overall Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.5.4 Justication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Chapter 5: Critically Supervised Object Detection 70
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.2 Related Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.2.1 Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.2.2 Weakly Supervised Learning . . . . . . . . . . . . . . . . . . . . . 73
5.2.3 Active Learning and Related Topics . . . . . . . . . . . . . . . . . 74
5.3 Introduction to critically Supervised Learning . . . . . . . . . . . . . . . . 74
5.4 Taught-Observe-Ask (TOA) Framework . . . . . . . . . . . . . . . . . . . 75
5.4.1 Machine-Guided Labeling via Question Answering . . . . . . . . . 77
5.4.2 Negative Object Proposal . . . . . . . . . . . . . . . . . . . . . . . 79
5.4.3 Critical Example Mining . . . . . . . . . . . . . . . . . . . . . . . . 81
5.4.4 Training Sample Composition . . . . . . . . . . . . . . . . . . . . . 87
5.5 Evaluation Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.5.1 Labeling Time Model . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.5.2 Estimated Labeling Time Ratio (ELTR) . . . . . . . . . . . . . . . 90
5.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.6.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.6.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.6.3 TOA For Multi-Class Object Detection . . . . . . . . . . . . . . . 94
5.6.4 TOA For Single-Class Object Detection . . . . . . . . . . . . . . . 96
5.6.5 TOA Training Options . . . . . . . . . . . . . . . . . . . . . . . . . 98
iii
5.6.6 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Chapter 6: Conclusion and Future Work 108
6.1 Summary of the Research . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Bibliography 111
iv
List of Tables
3.1 The sample mean and variance of random variables W
LS
and W
TI
and
their correlation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2 Convergence speed in the BCNN training process . . . . . . . . . . . . . . 42
4.1 Noatations of the proposed domain adaptation approach . . . . . . . . . . 55
4.2 Comparison between RCNN, Fast-RCNN and Faster-RCNN . . . . . . . . 61
4.3 The comparison of number of training samples between dierent pedes-
trian datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.4 Comparison of the prediction correctness between CDRA and detection
scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.1 Summary of notations used in this chapter. Note thatN
OP
i
denotes num-
ber of object proposals in imagei. N
OP
1
is dierent fromN
OP
when only
a subeset of object proposals are used as training purpose . . . . . . . . . 77
5.2 Labeling time model about dierent types of labeling. HQ-prole is pro-
vided for high-quality labeling and MQ-prole is for moderate quality
annotations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.3 Detailed experimental setup for Fig. 5.7. Here the numbers in parenthesis
after FRCNN means the percentage of data used for labeling. The paren-
thesis after TOA mean number of stages are used in TOA. In this table,
number of full-labeled images and QA-labeled samples are provided along
with the nal estimated labeling time ratio (ELTR) . . . . . . . . . . . . 95
5.4 The detailed detection results by object categories with experimental
setup dened in Table. 5.3, and visualized in Fig. 5.7 . . . . . . . . . . . 96
5.5 Detailed experimental setup for the fully and critically supervised CNN
models trained on VOC07+12 trainval dataset and tested on VOC07
dataset. Number of full-labeled images and QA-labeled sample are pro-
vided along with the nal estimated labeling time ratio (ELTR) . . . . . . 96
5.6 Detailed experimental setup for Fig. 5.8. Note that TOA method is
applied on the subset(1/4) of the Caltech dataset, thus the ELTR for
FRCNN(1/4) is equal to 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.7 Summary of the TOA training options, where def means the default values. 99
5.8 The valid QA rate with CEM algorithm in each stage. . . . . . . . . . . . 104
5.9 Dierences in detection results using models with/without the class bal-
ancing progress function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
v
List of Figures
1.1 Challenging examples in Caltech pedestrian datasets. (a) low light envi-
ronment (b) occlusion (c) cluttered background (d) various pedestrian
poses and sizes (e) extremely low resolution. . . . . . . . . . . . . . . . . 2
2.1 Illustration of the histogram of gradient feature (a) original image (b) gra-
dient image (c) the histogram is performed in each local block (upper).
Each block is divided into several bins with equally angles, and the gra-
dients are voted into bins by their directions (lower) (e) Visualized HoG
descriptor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Illustration of deformable part model (a) global model (b) part models (c)
spatial relationship between global and part models . . . . . . . . . . . . 12
2.3 The concept of image pyramid . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Mixture of model in the DPM method. . . . . . . . . . . . . . . . . . . . . 14
2.5 Examples of false detections from a HOG-based human detector. The
false detected images patches are shown in the upper row, and the corre-
sponding visualized HOG feature maps are in the lower row. . . . . . . . . 15
2.6 Example of a CNN network architecture in [54] . . . . . . . . . . . . . . . 16
2.7 Illustration of the Fast-RCNN network [35]. . . . . . . . . . . . . . . . . . 17
2.8 The VGG-16 [87] network structure. . . . . . . . . . . . . . . . . . . . . . 19
2.9 The comparison between pedestrian detection datasets. . . . . . . . . . . 20
2.10 Examples of the Caltech dataset. . . . . . . . . . . . . . . . . . . . . . . . 20
2.11 Statistics of the Caltech dataset, including probabilities against bounding
box height (left), bounding box aspect ratio (middle), and occlusion rate
(right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.12 Comparison between (a) per-window evaluation (b) per-image evaluation. 21
2.13 Pedestrian detection methods evaluated on Caltech dataset and its subsets
under dierent conditions. . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.14 Comparison of the evaluation results between dierent datasets. . . . . . . 23
2.15 Object classes in the VOC2007 and VOC2012 datasets . . . . . . . . . . . 24
2.16 Exemplary images and ground-truths in the VOC datasets . . . . . . . . . 25
2.17 Examples of evaluation results on the VOC datasets . . . . . . . . . . . . 26
vi
3.1 This gure provides an overview on the BCNN architecture. Its left
part shows a traditional CNN training process. By analyzing the train-
ing results, the BCNN system learns to identify challenging samples and
adjust the weight for each input sample. At the right side, the BCNN
model is further trained with these weighted samples using the proposed
weighted loss function to yield better performance. No extra labeled train-
ing samples are required. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 The detection score h
1
pxq gives the information of the input samples.
When the scores are low, the samples are more challenging to the base-
line detector. They typically suer from low image quality and/or with
cluttered backgrounds. The lower the score is, the more challenge the
situation is, which gives more valuable information in detector retraining. 32
3.3 (a) The histogram of the score distribution fromh
1
px
i
q using Caltech [23]
training samples, and (b) the score-based weight function for training
samples, where parameter
controls the exact shape of the curve. When
is smaller, high score samples are discounted more. . . . . . . . . . . . . 34
3.4 Two illustrative temporal inconsistency examples in the Caltech training
dataset using the Fast-RCNN as the baseline detector. (a) A low score
pedestrian bounding box appears in the third frame. There are however
several bounding boxes that contain high detection scores in its neigh-
boring frames, which come from the same pedestrian. (b) A high score
background bounding box also appears in the third frame and it gives
a false positive detection result. However, this high score bounding box
occurs only once when examining its temporally adjacent frames. . . . . . 35
3.5 (a) A temporal inconsistency pair is identied for the one pedestrian sce-
nario, where x
t
j
has a low score, but x
tt
k
has a high score. Along the
linear trajectory
jk
extended from the pair, we can nd several pedestrian
boxes that have low distances to the trajectory thus have high trajectory
matching scores S
jk
. (b) If we start with a wrong pair where two pedes-
trians walk side-by-side, the trajectory matching scoreS
jk
will be low and
this pair will be dropped from the temporal inconsistent sample set. . . . 36
3.6 The BCNN structure, where the upper branch is the model h
LS
trained
using the W
LS
-weighting scheme and the lower branch is the model h
TI
trained usingW
TI
-weighting scheme. They are given the same image and
object proposals as input. The boosted fusion (BF) layer concatenates
both networks after the fullly-connected layer. The nal detection scores
and bounding box regression results are obtained from the BF layer. . . . 39
3.7 Performance comparison of the Fast-RCNN, which serves as the baseline
method, and BCNN-TI, BCNN-LS and BCNN-BF using the miss rate
versus the false positive rate per image curves. . . . . . . . . . . . . . . . 41
3.8 Performance benchmarking of top performers plus VJ [98] and HOG [16]
on the Caltech pedestrian detection dataset. BCNN-BF ranks the third
with a log-average missing rate of 11.4% while RPN+BF ranks the rst. . 42
vii
3.9 Visual examples to demonstrate (a) the performance gain of BCNN-TI
over the fast RCNN baseline and (b) performance gain of BCNN-LS over
the fast RCNN baseline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.10 Visual performance comparison of three BCNN detectors: (a) one example
to show the performance gain of the BF layer whenh
TI
outperformsh
LS
,
and (b) another example to show the performance gain of the BF layer,
when h
LS
outperforms h
TI
. . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.1 Two illustrative examples to show the need of domain adaptation, where
two images are taken from the dataset in [106]. The dataset contains
samples from both surveillance video (left) and the movies (right). When
we apply the pedestrian detectors trained by the Caltech dataset to these
two types of images directly, the detection results are very poor. . . . . . 47
4.2 Illustration of the dataset bias problem as taken from [44]. It shows images
randomly selected from four dierent pedestrian datasets: the INRIA
dataset (generic) [16], the Caltech dataset (automobile application) [23],
the CUHK square dataset (surveillance application) [101] and the PETS
2009 dataset (surveillance application) [24]. . . . . . . . . . . . . . . . . . 48
4.3 The concept of transfer learning. . . . . . . . . . . . . . . . . . . . . . . . 49
4.4 The
ow chart of a domain adaptation procedure for pedestrian detection
[102]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.5 A network structure that uses an auto-encoder for the image classication
task with domain adaptation [34]. . . . . . . . . . . . . . . . . . . . . . . 52
4.6 The denition of the semi-supervised domain adaptation problem . . . . . 54
4.7 Illustration of the sample clustering in the deep feature space. . . . . . . . 58
4.8 Illustration of the purity measurement approach. . . . . . . . . . . . . . . 59
4.9 The network architecture of the CDRA method. A batch controller is
implemented to mimic the RCNN training process. . . . . . . . . . . . . . 62
4.10 Examples of the CUHK-SYSU dataset . . . . . . . . . . . . . . . . . . . . 65
4.11 Overall performance of the proposed CDRA algorithm. . . . . . . . . . . . 66
4.12 Visualized examples of the CDRA detection results (lower row) compared
with the baseline detection results (upper row). . . . . . . . . . . . . . . . 67
4.13 Justication of CDRA method. (a) Domain adaptation with ne-tuning
using pre-trained model versus purely training from scratch. (b) condent
sample selection using CDRA versus using detection scores. . . . . . . . . 67
4.14 Dierent weighting factor in weighted re-training step . . . . . . . . . . . 68
5.1 Illustration of TOA training process compared with fully supervised train-
ing. The upper row shows the traditional fully supervised training, where
all objects are labeled via drawing tight boudning boxes of all objects,
and training samples are obtained from calculating their IoU with object
proposals. The lower row illustrates that QA labeling is adopted only for
critical examples that are selected by CEM algorithm, and the training
samples are generated by combining them with NOP samples. . . . . . . . 72
viii
5.2 Flow chart of the teach-observe-ask (TOA) framework. The upper branch
represents the "teach stage" where full-labeling are done in a subset of
images to train the stage-0 model S
0
. The lower branch is the "obeserve-
ask stage", where the critical example mining (CEM) is used to retrieve
critical examples from unlabeled dataset, and use them for QA labeling.
The labeled examples are further combined with NOP samples to form
the training set for the stage-n models S
n
. . . . . . . . . . . . . . . . . . . 76
5.3 Examples used to illustrate two types of QA (a) type-1 QA, and (b) type-2
QA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.4 Examples to illustrate the challenges in extracting good NOP samples.
In these examples, only 200 out of all boxes are plotted for better visual-
ization (a) Edgeboxes results with extremely low scores which rank over
2000 in one image (b) Edgeboxes with modied formula . . . . . . . . . . 80
5.5 Illustration of the concept of critically supervised learning. The colored
dots are labeled samples while other are unlabeled ones. The upper row
show that when given the constraint of using only 5 QAs, the selection
of samples for labeling is critical to the trained detector(classier). The
lower row shows that the criteria for selecting the rst 5 samples is also
dierent from that of selecting additional 2 samples if better performance
is targeted. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.6 Illustration of kNN graph where each node is connected to its K nearest
neighbors (K=3 in this example). Ldist of a node is dened as the number
of edges appears in the shortest path to a nearest labeled node. . . . . . . 85
5.7 MAP vs ELTR curve comparing performances between fully, weakly, and
critically supervised learning on the VOC07 dataset. Each dot is a model
trained using dierent training sets with their estimated labeling time. . . 94
5.8 MR vs ELTR curve comparing performances between fully and critically
supervised learning on the Caltech dataset. Each dot is a model trained
with dierent setups with their associated labeling time . . . . . . . . . . 97
5.9 Detailed miss rate(MR) vs false-positive-per-image(FPPI) curves for mod-
els dened in Table 5.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.10 Experimental results with dierent training options. The results are
zoomed into into a certain ELTR range except for subplot(a). The train-
ing options include (a) amount of initial set (b) bounding box regres-
sion (REG) (c) Prediction with retrained models (RETR) (d) Pretrained
model selection (PRETR) (e) Training sample generation (GENTR), and
(f) Weighing on image subsets (WTR) . . . . . . . . . . . . . . . . . . . . 101
5.11 Analysis results for TOA framework (a) Eectiveness of the progress func-
tion for hard samples (b) Result that add errors in the simulation . . . . . 103
5.12 Eectiveness of NOP (a) example of training sample composition (b) The
IoU distribution of all negative samples compared with the ground truth . 106
5.13 Experimental results with dierent labeling time model (a) HQ-prole (b)
MQ-prole . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
ix
Abstract
With the emergence of autonomous driving and the advanced driver assistance system
(ADAS), the importance of pedestrian detection has increased signicantly. A lot of
research work has been conducted to tackle this problem with the availability of large-
scale datasets. Methods based on the convolutional neural network (CNN) technology
have achieved great success in pedestrian detection in recent years, which oers a giant
step to the solution of this problem. Although the performance of CNN-based solutions
reaches a signicantly higher level than traditional methods, it is still far from perfection.
Further advancement in this eld is still demanded. In this dissertation, we conducted
three research topics along this direction, and then further extended to a more general
problem.
In the rst topic, a boosted convolutional neural network (BCNN) system is proposed
to enhance the pedestrian detection performance. Being inspired by the classic boosting
idea, we develop a weighted loss function that emphasizes challenging samples in training
a convolutional neural network (CNN). Two types of samples are considered challenging:
1) samples with detection scores falling in the decision boundary, and 2) temporally
associated samples with inconsistent scores. A weighting scheme is designed for each of
them. Finally, we train a boosted fusion layer to benet from the integration of these two
weighting schemes. We use the Fast-RCNN as the baseline and test the corresponding
BCNN on the Caltech pedestrian dataset in the experiment and observe a signicant
performance gain of the BCNN over its baseline.
x
Data-driven pedestrian detection methods demand a large amount of human labeled
data as the training samples. The performance of these detectors is highly dependent
on the amount of labeled data. Since data labeling is time-consuming, labeled datasets
are often insucient to train a robust detector in real world applications. On the other
hand, it is relatively easy to collect unlabeled data. Thus, it is desirable to develop
unsupervised or weakly-supervised learning methods that exploit unlabeled data for
further performance improvement in the training of a detector. The domain adaptation
technique is developed to reach this goal.
In the second topic, a semi-supervised learning method is proposed for pedestrian
detection based on domain adaptation. It is observed that the deep representation, which
is the response of an input through a CNN, is powerful in estimating the class of unlabeled
data. Being motivated by this observation, we propose a clustered deep representation
adaptation (CDRA) method. It trains an initial detector using a small number of labeled
data, extracts the deep representation and, then, clusters samples based on the space
spanned by the deep representation. A purity measurement mechanism is applied to each
cluster to provide a condent score to the estimated class of unlabeled data. Finally, a
weighted re-training process is adopted to ne-tune the model by balancing the numbers
of labeled and estimated data. The CDRA method is shown to achieve the state-of-the-
art performance against a large scale dataset.
While semi-supervised learning does not provide high enough precision due to limited
number of labeled data, it is desirable to design another training method with reasonable
performance, and at the same time keeping the labeling time as less as possible. To
achieve the goal, we observe the process how human learn to recognize objects in their
childhood, which starts from being taught by their parents. Then, the children begin
to observe the world and then ask questions from time to time when they still do not
recognize some objects well enough.
In the third topic, we propose a learning framework called critically-supervised learn-
ing that mimics children learning, which provides reasonable performances while saving
xi
up to 95% of the labeling time. Several novel components are proposed to fulll the
high level concept, including negative object proposal, critical example mining, and a
machine-guided labeling process. A labeling time model is proposed to evaluate the nal
performance. Extensive experiments are conducted to shed light on several novel ideas,
and the eectiveness of the proposed method is not only evaluated on the Caltech bench-
mark datasets, but also on the PASCAL VOC datasets for the general object detection
task.
xii
Chapter 1
Introduction
1.1 Signicance of the Research
In recent years, a large number of companies have invested resources on cutting-edge
technologies such as driverless cars or advanced driving assistance systems (ADAS).
These are expected to make a great impact to our daily lives in the foreseeable future.
The chances of car accidents caused by drunk or careless driving behaviors will be largely
decrease once the technologies and markets are mature. Pedestrian detection has been
one of the key technologies which are required by self-driving or ADAS systems. The
systems have to know the presence and locations of pedestrians in order to decide when
to brake or give warnings to the drivers. The accuracy of pedestrian detection has
signicant in
uences to the safety of human beings.
In addition to automobile applications, pedestrian detection can be widely applied
in many other applications. For instance, most surveillance systems are now equipped
with intelligent functions that allow automatically detection of abnormal events. In
these intelligent systems, the pedestrian detection algorithms are required. With more
applications such as healthcare, robotics, entertainment, image retrieval, and so on,
pedestrian detection problem is gaining a lot of attention nowadays.
The goal of pedestrian detection is to nd the presence and locations of the pedestri-
ans in still images. Today, it is still considered as a very challenging topic. First of all,
the pedestrians have to be detected under dicult environments such as clustered back-
grounds or extremely low illuminations. Secondly, pedestrians are sometimes occluded
by objects or other pedestrians, which forms another challenging case. Thirdly, the
pedestrians can appear in dierent poses, and the sizes also variate according to their
1
distances from the camera. Some of the challenging examples of Caltech dataset [23]
are show in Fig. 1.1. One of the most challenging case is shown in Fig. 1.1(e), where
the pedestrian is too far away from the camera, and the image resolution and quality
is lowered to an extend that even human cannot easily recognize. Even with the rapid
development of powerful models such as convolutional neural networks, some of the chal-
lenges still remain unsolved, which leaves some room for further improvement. In the
rst topic of the dissertation, we will deal with the challenging cases in order to enhance
the performance of pedestrian detection.
Figure 1.1: Challenging examples in Caltech pedestrian datasets. (a) low light environ-
ment (b) occlusion (c) cluttered background (d) various pedestrian poses and sizes (e)
extremely low resolution.
Machine-learning based pedestrian detection solutions require human labeled data
to train their models. To solve real-world applications, huge amount of data is needed
so that the models can be robust enough to cover a wide range of situations. However,
the amount of labeled data is usually far away from enough, and human labeling is
quite time-consuming which cannot fulll the needs. Since unlabeled data are easy to be
collected due to the availabilities of video capturing and storage devices, it is urgent to
develop unsupervised or weakly supervised methods that can leverage the huge amount
of unlabeled data. Domain adaptation methods are proposed to solve this problem.
2
The goal of domain adaptation is to use the knowledge learned from a source domain
to train a new model in a target domain, especially when there are few labeled data
in the target domain. For example, supposed we have a CNN model trained from the
Caltech dataset. If domain adaptation techniques are applied involving unlabeled data
that are captured in the evening, the detector will be more robust against night-view
images. Similarly, if the unlabeled data are obtained from surveillance cameras, then
the adapted detector will work well for the surveillance applications.
This problem, however, is one of the most challenging problems in the computer
vision eld today. Without labeled data, sometimes we have to estimate the labels for
the unlabeled data. It requires good estimation algorithms as well as specially designed
models that can achieve performance gain with uncertain labels. In addition, CNN
models have not been proven to work in this problem, and the traditional approaches
also have not demonstrated with a good performance on large-scale datasets. We will
address the challenging problem using CNN in the second topic.
As abovementioned, unsupervised or weakly supervised learning play extremely
important roles in practical applications. In addition to domain adaptation, we are
also interested in training a new model using small amount of labeling time without
the help from knowledges acquired in the source domain. This task relieves the con-
straint of the needs from source domain thus provides wider range of applications once
the performance becomes reasonable. However, the problem also becomes much more
challenging.
The weakly-supervised learning approaches have attracted a lot more attention in the
general object detection problem than in the pedestrian detection problem. The chal-
lenges in object detection are, however, very dierent from those in pedestrian detection.
For example, one diculty of object detection is to let CNN distinguish from many dif-
ferent object classes (e.g. 1000 classes in ImageNet [18]), where many of these object
classes are similar in appearances to cause confusion.
3
In the third topic, we would like to design a new framework that can achieve better
performance than domain adaptation and traditional weakly-supervised learning while
keeping the labeling time as few as possible. We target to solve the problem for both
general object detection and pedestrian detection problems. In computer vision, it's rare
to have a generalized framework that performs well for dierent problems, which makes
our framework even more impactful.
1.2 Review of Related Work
1.2.1 Pedestrian Detection
The pedestrian detection problem has been studied for more than a decade. Several
survey papers [6,23,25,114] have provided comprehensive reviews of the development of
this eld to the dates that they are published. In this section, the most related works
are elaborated, including a couple of the latest works.
Before the introduction of the CNN solutions, the pedestrian detection problem was
primarily solved by two major categories of approaches. One of them includes the deci-
sion forest based approaches, and the other consists of the deformable part model (DPM)
based methods. For the decision forest family, VJ [98] is an early work that use Haar-like
features with a cascade of classiers. Then the ACF [21] method extracts channel fea-
tures and adopts a boosting tree for classication, which achieves impressive performance
at that time. Later on, a series of related works are published, including ICF [22] that
integrated various channels, and LDCF [64] which de-correlated the feature channels.
More papers are proposed [6, 100, 104, 115] with a decision forest based classiers using
various features and frameworks.
The DPM method [29, 30] is composed of a few part models together with a global
model. The histogram of oriented gradients (HOG) technique [16] is used for feature
extraction in the DPM method. After HOG features are extracted, the latent support
vector machine (SVM) is applied to train the model. DPM has shown successes in the
4
pedestrian detection problem as well as other object detection tasks. Several variations
of DPM methods are proposed [72], and DPM related frameworks have been popular
for several years. Until recent years, the CNN-based solutions have been proposed in
the pedestrian detection problem which outperform the traditional approaches by a
signicant margin, and have the state-of-the-art performance to this problem today.
Since around 2013, many CNN based solutions have been proposed for the pedestrian
detectionproblem [13,65{67,82,92,93,109]. Tian et al. [92] used CNN to learn each of the
body parts to enhance the performance, which inherited the spirit of the DPM method.
Tian et al. also proposes another method in [93] which learned the context information
by including some scene related datasets for training. Yang et al. [109] still applied the
traditional channel features with a boosting tree, but certain CNN layer features were
treated as parts of the channel features. Zhang et al. applied the RPN network [79,107]
followed by a boosted forest to achieve excellent performance. The output of the RPN
network, including bounding boxes, scores, and CNN features were then fed into the
boosted forest for classication. Several other CNN works [12,13,93,111] also proved to
provide signicant performance gain over the traditional non-CNN ones.
1.2.2 Domain Adaptation
Domain adaptation [17, 38, 74] is a special case of transfer learning [7, 69, 77]. It can be
applied in several dierent applications, and many papers have been published in recent
years including the unsupervised [2, 5, 63] schemes and the semi-supervised [14] ones.
CNN has started to show some early successes in domain adaptation problem for certain
applications, such as image classication. For instance, Ghifary et al. [34] designed a
two branch network, where one branch has an auto-encoder structure used for image
reconstruction for the unlabeled data, while the other branch took care of the original
image classication task. By jointly training two branches, the shared layers provided
good performances in the target domain.
5
Domain adaptation for detection problems has also drawn some attentions in recent
years. However, CNN networks have not yet been employed to tackle this problem. The
development of this eld is reviewed in [44]. The earliest work of domain adaptation for
object detection can be traced back to 2004, when Bose et al. [10] proposed a two-step
self-training algorithm and applied it to the far-eld videos. Kembhavi et al. propose a
multiple kernel learning based self-training algorithm, which combines a global detector
with a local detector to adapt vehicle detectors. In 2011, Wang et al. [102] started
to focus on adapting a generic pedestrian detectors to a specic scene. In this work,
an iterative self-training technique is developed, where condent positive and negative
samples are collected using a variety of cues such as motion or context in order to train a
new detector. Sharma et al. [83] proposed to use the Real Adaboost and Multi Instance
Learning methods to solve the same problem. Then, the self-training approach proposed
by Shu et al. used super-pixels for clustering into visual dictionaries and then encoded
the examples with BoW representation for training. There are more publications with
regards to domain adaptation for pedestrian detection problem [43, 45, 46, 62, 63, 82, 83,
85,91,101,102,116] which were mostly published from 2012 to 2015.
In general, these methods share some common practices to estimate positive or nega-
tive labels of the unlabeled training set in the target domain with the help from detection
scores or other cues. Then the condent samples are used to ne-tune the detectors with
specially designed models. However, the performance achieved by this methodology is
still limited even if tested on smaller datasets. It would be interesting to explore novel
CNN solutions to solve this problem.
1.2.3 Critically Supervised Learning
Critically supervised learning is a whole new concept dened in Chapter. 5 which aims
to minimize the labeling time needed for a reasonably good performance. It is closely
related to several existing topics such as active learning [48,50,53,76], which interactively
nd informative samples from new data points and ask humans to annotate them for
6
re-training the model. From labeling perspective, the concept of human-in-the-loop
[86, 86, 97] labeling is also related which is a technique that involves human-machine
collaboration in the annotation process. In Chapter 5, we are also interested in comparing
our results with weakly-supervised learning [8, 9, 37, 59, 80, 88, 89, 108], which aims to
reduce labeling time needed for training a CNN model by providing only image-level
annotations without details of individual samples. An extensive literature survey of
these topics will be elaborated in Chapter. 5.
1.3 Contributions of the Research
In the rst topic, we propose a Boosted Convolutional Neural Network (BCNN) to solve
the pedestrian detection problem. The contribution of BCNN include:
According to our best knowledge, we have not seen few previous work that imple-
ments boosting algorithm in the CNN to yield a BCNN solution. In the traditional
CNN training process, all samples are treated equally. However, some of the infor-
mative samples that lie in the decision boundary can be exploited to achieve better
performance with a boosting strategy. Our BCNN method allows CNN to auto-
matically adjusts the weights of samples to train multiple weak classiers and,
then, combines them to form a stronger classier, which is similar to what is done
in the classic boosting algorithm.
We present a method to give higher weights to informative samples in the context of
pedestrian detection, and propose a general framework to build the BCNN model
based on the weights of samples. To handle the weighted samples, we propose
a novel random selection scheme which simulate a weighted loss function. This
technique can be generalized in any applications that require weighted training of
samples in CNN.
7
In the selection of informative samples, we applied a weighting function for the
samples of low detection scores. Besides, we also introduce a weighting function
that considers the temporal inconsistency property of the detection results. This is
a novel approached to identify the temporal inconsistency samples, which is quite
eective and also dierent from any other works that involve temporal information.
The resulted BCNN detector achieves top performances in the Caltech dataset. In
addition, BCNN is a general framework which can be easily generalized in other
applications.
In the second topic, a clustered deep representation adaptation (CDRA) method is
proposed to solve the domain adaptation for pedestrian detection problem. The major
contributions are concluded as follows.
While convolutional neural networks oer good results in many other computer
vision problems, it has not yet proven to work in this particular problem. To the
best of our knowledge, the proposed CDRA is the rst to provide a CNN-based
domain adaptation solution for the pedestrian detection problem that works well
from the experiments.
The CDRA method exploits the clustered deep representations and employs a
weighted re-training process which are novel in the domain adaptation eld. Actu-
ally, they are also general techniques whose applications are not restricted to pedes-
trian detection.
While most domain adaptation methods were tested on smaller datasets, the pro-
posed CDRA method is tested on a large-scale dataset for performance evaluation
with a substantial performance gain.
In the third topic, a taught-observe-ask (TOA) method is proposed to greatly reduce
the labeling time needed for robust object detection for various applications. We conclude
our contribution as follows.
8
Recent weakly-supervised learning works have demonstrated to save considerable
labeling time by only annotating some image-level labels. To our best knowledge,
we are the rst one to provide a dierent angle called critically-supervised learning
that saves even more labeling time by picking most critical examples for question
answering starting from early stages of the training process.
The proposed taught-observe-ask (TOA) method is inspired by the children learn-
ing process, and the overall framework is a novel presence in this eld.
In the TOA method, a critical example mining (CEM) component is introduced
to analyze the samples in the features space to locate the most critical samples.
The criticalness of a sample is designed to dynamically changed along the training
process for optimal performance, which is a novel design.
In the TOA method, a negative object proposal (NOP) module is applied to nd
the samples that most likely do not belong to object classes. The brand-new
concept is very eective in saving labeling time.
An evaluation method is proposed to compared labeling eciency along with detec-
tion results among dierent approaches.
The TOA method can be generalized to both the object detection problem and
the pedestrian detection problem. The experimental results show superior perfor-
mances on both VOC and Caltech datasets. The mAP performance is much better
than those of the weakly-supervised learning methods, while the labeling time is
even fewer.
1.4 Organization of the Dissertation
The rest of this dissertation is organized as follows. Chapter. 2 gives reviews to the tradi-
tional deformable part model (DPM), the convolutional neural network (CNN) solutions,
9
and the challenging Caltech dataset. The proposed Boosted Convolutional Neural Net-
work (BCNN) framework designed for the pedestrian detection problem is introduced
in Chapter. 3. To tackle the domain adaptation problem for pedestrian detection, the
clustered deep representation adaptation (CDRA) method is elaborated in Chapter. 4.
A brand-new framework for called critically-supervised learning for both general object
detection and pedestrian detection is presented in Chapter. 5. Finally, Chapter. 6
concludes the dissertation, and present the future directions.
10
Chapter 2
Research Background
2.1 Deformable Part Model (DPM) and Histogram of Gra-
dient (HOG)
Pedestrian detection has been developed for more than a decade. Before the introduction
of convolutional neural network, one of the most important work is the deformable part
model (DPM). It attracts lots of attention, not only in pedestrian detection problem but
also in other tasks like object recognition, and so on. Therefore, a brief introduction to
DPM is given in this section.
The DPM method uses the histogram of gradient (HOG) [16] as its feature. In brief,
HOG method generates the gradient of the image rst, and calculates the histogram
within a local block by separating the gradients into several bins according to their
angles. Even if the high-level concept is not complicated, the HOG feature worked pretty
well in early days, especially in tasks like human detection, and was widely employed
in related elds. Fig. 2.1(d) shows an example of visualized HOG feature which is
obtained from the original image (Fig. 2.1(a)) and its gradient map (Fig. 2.1(b)). It
is a compact feature but captures important information of the image, like the rough
contour of objects.
Although HOG can preserve important information in the image patch, it cannot
always capture the variations in appearance of the same type of objects. Thus, DPM
is proposed to provide robust detection against the deformation of objects. The main
concept of DPM is depicted in Fig. 2.2. Instead of using only a global model, several
part models are also employed. By combining the global model, part models, and their
11
Figure 2.1: Illustration of the histogram of gradient feature (a) original image (b) gra-
dient image (c) the histogram is performed in each local block (upper). Each block is
divided into several bins with equally angles, and the gradients are voted into bins by
their directions (lower) (e) Visualized HoG descriptor.
spatial relationships, a DPM model is formed which is stronger than a single model and
is capable of dealing with articulations of objects.
Figure 2.2: Illustration of deformable part model (a) global model (b) part models (c)
spatial relationship between global and part models
To train a DPM model, the HOG feature is rst extracted. As illustrated in Fig.
2.2(a)(b), DPM has a coarse feature resolution in the global model, and ner resolution
in the part models. To deal with multiple scales of objects, DPM applies the image
pyramid technique as shown in Fig. 2.3. The idea of image pyramid is to resize the
image into dierent scales, and in each scale, the xed-size bounding box is used to slide
12
across the image in x and y directions with a stride value. Through the sliding window
process, the candidate bounding boxes are obtained in each scale. For each candidate
bounding box as a global model, its associated part models are selected from a ner scale
in image pyramids.
Figure 2.3: The concept of image pyramid
To model the relationship between global and part lters, the spatial model in Fig.
2.2(c) is used to nd the location of each part model relative to the global model. Then
a cost function is dened as
scorepp
0
;:::;p
n
q
n
¸
i0
F
1
i
pH;p
i
q
n
¸
i1
d
i
d
pdx
i
;dy
i
qb (2.1)
where the rst term denotes the sum of responses of i
th
lter, including the global lter
(i 0), and all the part lters (i 1ton). The second term represents the sum of the
displacements from the root lter to all the part lters in x and y directions. The third
term is a bias term. The parameterspF
1
i
;di;bq are learned through latent support vector
machine (latent-SVM) as formulated in [30]. Similar to the classic SVM, the model
parameters are obtained by minimizing the objective function using the labeled data.
The latent values are introduced to meet the needs for the DPM model.
13
Note that a single DPM model may not be enough for one type of object. As
the examples shown in Fig. 2.4, the interclass variations may be large that cause the
single model to fail. Hence, the DPM method employs the mixture models that allows
more than one component. In this example, a two-component bicycle model is obtained,
where the rst component captures the side-view of bicycles while the second component
captures the frontal-view of them.
Figure 2.4: Mixture of model in the DPM method.
Even though the DPM has achieved good performance for several years, it was nally
out-performed by the CNN approaches. The study [99] provided some reasons for the
restricted performance of the DPM method. While the HOG feature seems to work well
for many images, it is still not representative enough for all the object. As shown in
Fig. 2.5, these images do not belong to the human class. However, the visualized HOG
features shown in the bottom row makes them look like humans from the detector point
14
of view. Therefore, more powerful features are required for the pedestrian detection
problem, thus CNN seems to be a better choice.
Figure 2.5: Examples of false detections from a HOG-based human detector. The false
detected images patches are shown in the upper row, and the corresponding visualized
HOG feature maps are in the lower row.
2.2 Convolutional Neural Network and Fast-RCNN
Convolutional neural network (CNN) has shown tremendous successes in many computer
vision elds in recent years. The neural network theories have been studied for quite
a long time. Not until early 2000s, it starts to work well in practical applications. Its
success can be attributed to the availability of today's big data for training the models,
as well as the increased computational power which makes it feasible for developing very
large and powerful network structures.
The CNN has started to shown its power in the pedestrian detection problem since
around 2013, while several works started to out-perform the traditional approaches.
One of the major dierences between CNN and the traditional approaches is that CNN
does not need hand-crafted features that are designed in traditional methods. Instead,
the intermediate layers of the neural networks can be seen as the features that are
automatically obtained through the end-to-end training process.
15
It is worthwhile to give a brief introduction to the structure of the convolutional
neural networks. Fig. 2.6 is an example of a classic CNN network that achieves early
successes in the image classication problem. This example is used to showcase some of
the basic units of a CNN network.
Figure 2.6: Example of a CNN network architecture in [54]
As illustrated in the gure, a CNN network consists of several dierent kinds of
layers. The convolution (CONV) layer is one of the most basic building block of a CNN
network. Each node in a CONV layer contains several trainable 2D lters that can be
used to convolve with the 2D input data. In each forward pass, the lters slide across
the x-direction and y-direction of the input data and apply convolution with the input
data in order to generate a 2D output data for each lter, yielding a 3D cubic data
structure with all the lters. The fully-connected (FC) layer is another basic component
which usually appear at the end of the CNN to take care of decision making. There may
be a few FC layers in the network, and the input is usually the output from previous
layers (e.g. CONV). In addition to CONV layers and FC layers, the max pooling layer
is designed to output the maximum response for each rectangular local area from the
previous layer. Therefore, the data is down-sized when passing through a max pooling
layer. Complete introductions to the CNN network can be nd in [54,57].
To train a CNN model, the training images and their associated ground-truth labels
are given as the input to the network. In a forward pass, each sample will go through
16
the whole network to get the predicted label. A loss function is used to calculate the
error between the output of the network and the ground-truth. In the backward pass,
the errors are back-propagated in order to update the coecients of the network. This
is done with a stochastic gradient descend (SGD) method that iteratively updates the
network with a batch of samples each time.
After the training process, the CNN model is powerful enough to predict the labels
given any image as an input. That is to say, this is done without the needs of feature
extraction steps. Typically, the CONV layers can be seen as a feature extraction process
where the output responses from CONV layers can be seen as high-dimensional features
of the image. On the other hand, FC layers can be seen as the decision making process
using the CONV features. While the networks are huge and dicult to be analyzed, it is
shown in some recent papers [110] that the lters from dierent layers can be visualized,
and these visualized responses demonstrate that the CNN features are extremely powerful
compared to the traditional hand-crafted features after the end-to-end training process.
Figure 2.7: Illustration of the Fast-RCNN network [35].
For detection problems, several CNN networks have been proposed such as [13, 65{
67,82,92,93,109]. In this section we would like to give a detailed introduce to the Fast-
RCNN [35] method, which is served as our baseline method. There are several reasons
for us to adopt it as the baseline method. First of all, while most of the pedestrian
17
detection methods did not provide their source code to the public, Fast-RCNN allows
open access to its code which is quite easy to be reproduced. Secondly, Fast-RCNN
accept bounding boxes as the input, which suits our proposed method well because we
have a procedure in the proposed method that has to select a subset of bounding boxes
for training. Finally, Fast-RCNN is not a heuristically designed network. Instead, it is a
framework that can be easily generalized in many applications.
The architecture of Fast-RCNN is presented in Fig. 2.7. In Fast-RCNN method, an
input image and multiple regions of interest (RoI) are fed into the CNN network. The
input image is rst passed into the CONV layers. Then, each RoI projected to the last
CONV layer is pooled into a x-size feature vector before the FC layers. After FC layers,
the network has two outputs for each RoI. One is the softmax probabilities of each class
(i.e. pedestrian and non-pedestrian), and the other is the bounding box regression oset.
The network is end-to-end trained with a multi-task loss.
The RoI bounding boxes can be seen as the candidates of pedestrians in each image.
Traditionally, RoI are chosen with sliding window [30]. It is also widely accepted to use
object proposals [1,94,118] to be the RoI bounding boxes. In brief, object proposal is a
task to determine whether a certain bounding box could possibly be an object. Hence
the object proposals are more suitable to be RoI candidates than the sliding windows.
The benet of Fast-RCNN over its predecessor - RCNN [36] is that the cropping of
RoIs happens after CONV layers. In RCNN, on the contrary, each image is cropped
by the RoIs before being fed into the CONV layers, which causes huge computational
redundancy since RoIs are usually densely overlapped. In Fast-RCNN, the computational
time are greatly reduced by processing CONV layer only once for each image. It can be
shown in [35] that the accuracy of Fast-RCNN is also slightly better than that of RCNN.
The Fast-RCNN is a general framework that allows using of dierent CNN network
structures (i.e. number of layers, lters, and so on). In our work, we adopted the VGG16
network [87] as the underlying network. As shown in Fig. 2.8, VGG16 network is a classic
network with a very deep layer structures, and has been demonstrated with excellent
18
Figure 2.8: The VGG-16 [87] network structure.
performances in many applications, including the pedestrian detection problem. More
implementation details of the proposed method will be elaborated in Sec. 3.
2.3 Caltech Pedestrian Dataset
The rapid development of pedestrian detection algorithms is partly due to the availability
of large-scale public datasets. The most popular one is called Caltech dataset [23]. It
provides an evaluation platform, and currently all the publications regarding pedestrian
detection provide performance evaluations using the Caltech dataset. It worthwhile to
give a detailed review of this dataset in this section.
Before the Caltech dataset was released, several pedestrian detection datasets already
existed. However, Caltech dataset is the rst one to provide huge amount of data that
covers enough diversity. Fig. 2.9 is borrowed from [23] which provides a comparison of
dierent pedestrian datasets to the date that Caltech dataset was released.
The Caltech dataset provides around 250000 labeled frames selected from videos with
around 10 hours of driving scenes through regular trac in an urban environment. The
image resolution is 640x480 with a frame rate of 30Hz. There are around 2300 unique
pedestrians and a total of 350000 bounding boxes that were annotated. Some of the
19
Figure 2.9: The comparison between pedestrian detection datasets.
examples are provided in Fig. 2.10. Compared to the previous datasets, the volume of
Caltech dataset was incomparable. After Caltech dataset was released, some pedestrian
datasets with large volumes were also published, such as the KITTI dataset [33], and the
Caltech-New dataset [114]. However, Caltech dataset is still the major one that people
tend to benchmark on, because it is easier for evaluation, and all previous methods were
evaluated and available for comparison.
Figure 2.10: Examples of the Caltech dataset.
There are several reasons that contribute to the success of the Caltech dataset. First,
it not only provides a large volume of data, it also oers a wide diversity of samples.
For example, Caltech dataset includes pedestrians with dierent sizes, which can be
ranged from tens of pixels to hundreds of pixels in bounding box height. The dierent-
sized pedestrians may also appear in the same image. In addition, Caltech dataset has
pedestrians that are partly occluded and the occlusion rate varies from case to case.
Some of the statistics of Caltech dataset are shown in Fig. 2.11 to provide the proves of
the diversity.
20
Figure 2.11: Statistics of the Caltech dataset, including probabilities against bounding
box height (left), bounding box aspect ratio (middle), and occlusion rate (right).
Another reason for the popularity of the Caltech dataset is that it provides evaluation
code to allow easy and fair comparisons between dierent methods. The evaluation
metric is adjusted to become more unbiased, realistic, and informative as compared to
the traditional metric dened in [20]. It is in form of
a
0
:
areapBB
dt
XBB
gt
qq
areapBB
dt
YBB
gt
qq
¡ 0:5; (2.2)
where BB
dt
and BB
gt
denote the bounding boxes of the detected and the ground truth
regions, respectively. One major change is that it adopts per-image evaluation rather
than per-window evaluation. This is similar to what is used in the PASCAL object
detection challenges [27].
Figure 2.12: Comparison between (a) per-window evaluation (b) per-image evaluation.
In a per-image evaluation, each detected bounding (BB
dt
) will rst be examined
whether it has sucient overlap with a ground-truth bounding box (BB
gt
). To this end,
21
the Intersection-over-Union (IoU) is dened as in Eq. 2.2, where the numerator is the
area of the intersection between two bounding boxes, and the denominator is the area of
the union between them. If the IoU is greater than 0.5, the detected bounding box can
be seen as a match to a ground-truth (i.e. positive sample), otherwise it is a negative
sample. Then, a miss rate (MR) against false positive per image (FPPI) curve is plotted
in log scales. This is preferred to precision-recall curve on certain applications, such as
automotive applications. Finally, a log-average miss rate is calculated as a summarized
detector performance. This is done by averaging miss rates at several reference points
on the MR-FPPI curve in the range from 10
2
to 10
0
. It is shown in Fig. 2.12 that the
per-image evaluation result has a substantial dierence compared with the traditional
per-window evaluation.
Figure 2.13: Pedestrian detection methods evaluated on Caltech dataset and its subsets
under dierent conditions.
To validate the Caltech dataset, a large number of methods were evaluated on it
for performance comparisons. Fig. 2.13(a) shows the MR-FPPI curves and overall
performances of several well-known algorithms at that time. To give more insights, Fig.
22
2.13(b-c) compare the performances when the methods are tested on dierent sizes of
pedestrians. Obviously when the pedestrians are closer, the pedestrian sizes are larger
which make it easier for detectors to have better performances. It is shown in Fig.
2.13(d-e) that partially occluded samples are more dicult to be detected. Fig. 2.13(f)
is the performance comparison on reasonable set, where the pedestrians are at least 50
pixels tall under no or partial occlusion. This is the most commonly used setup for
performance benchmarking.
Figure 2.14: Comparison of the evaluation results between dierent datasets.
In order to compare the Caltech dataset with previous datasets, the algorithms are
evaluated with the same metrics and the results are shown in Fig. 2.14. It can be
seen that the Caltech dataset is much challenging than others with a highest miss rate.
The INRIA dataset has the lowest miss rate because it only contains high-resolution
pedestrians. Other dataset like Daimler-DB, ETH, and TUD-Brussels also have lower
miss rates than the Caltech-Train dataset, and much lower than the Caltech-Test dataset
as shown in Fig. 2.13(a). The Caltech-Japan is another dataset gathered in Japan
(which was not released publicly due to legal issues). Its performance is slightly worse,
23
partly because it has more challenging image conditions. It is also observed that the
performance ranking is stable across datasets, indicating that the proposed performance
metrics is reliable. In Sec. 3, we will use the Caltech dataset for performance evaluation.
2.4 PASCAL Visual Object Class (VOC) Challenges And
Datasets
Object detection has been one of the most important topic in the computer vision eld.
The introduction of large datasets further push the eld into a new domain. PASCAL
VOC challenge is hold every year since 2006 for the purpose of solving the problem. The
datasets are then provided in public domain, and is now one of the most commonly used
benchmark in this domain. Among these dataset, VOC2007 and VOC2012 are the most
commonly used ones.
This challenge provide wide diversity of data collected from Flickr website, and pro-
vide the ground truth annotation. The annotation includes various tasks from classi-
cation, detection, segmentation, action, layout, etc. In this section we would like to give
detailed information about object detection in VOC07 and VOC12 datasets.
Figure 2.15: Object classes in the VOC2007 and VOC2012 datasets
In VOC07 and VOC12, there are total of 20 classes which are summarized in Fig.
2.15. These classes are not balance in amount in the collected images, for example, there
24
are much more "person" and "car" than objects in other classes. The authors of the
VOC dataset have spent more eorts labeling the rare classes such as bus or dining-table,
but the inbalance problem still exist in the dataset with a certain degree.
The ground-truths for object detection include precise bounding boxes of the objects,
and the object classes associated with the boxes as visualized in the following gure. The
labeling is done through crowd-sourcing with Mechanical Turk. Several techniques have
been applied to ensure high-quality annotation as described in [26]
Figure 2.16: Exemplary images and ground-truths in the VOC datasets
The VOC07 and VOC12 datasets consist of dierent subsets, including training,
evaluation, and testing sets. A common scenario for CNN training is to train the model
using trainval set (training and evaluation together), and then test the model on the
testing set. For VOC07, there are a total of 2501 training images, 2510 evaluation
images, and 4952 testing images. VOC12 is more challenging which consist of 11540
trainval images. VOC12 testing set is not publicly available.
The Pascal VOC dataset has a standard evaluation method. First, a precision-
recall curve is drawn to show the performance of each class as in Fig. 2.17. This is
done by computing the intersection-over-union (IOU) criterion with the ground-truth
to determine correctness of every sample, and overall performance is illustrated with
25
Figure 2.17: Examples of evaluation results on the VOC datasets
the precision-recall curve by adjusting the thresholds. An average precision (AP) can
be calculated for each class, and nally a mean average precision (mAP) is obtained
through averaging the APs among classes. This is a general approach to evaluate the
performance for the object detection problem.
In summary, VOC07 and VOC12 provide a good dataset for benchmarking dierent
object detection algorithms and is widely used in this eld. Detailed introduction of the
dataset is provided in [26].
26
Chapter 3
Boosted Convolutional Neural
Network (BCNN) for Pedestrian
Detection
3.1 Introduction
With the emerging popularity of advanced driver assistance systems (ADAS) and
autonomous cars, pedestrian detection has received a lot of attention [21, 22, 29, 30, 42,
103,111,113,114] in the computer vision eld. Pedestrian detection has been intensively
studied in the last decade due to the availability of a large dataset [23]. In this work,
we adopt a convolutional neural network (CNN) [35,36,41,79,81] approach to solve the
pedestrian detection problem. Given a CNN with a xed number of training samples, the
detection performance reaches a plateau as the training iteration reaches a certain level.
As humans learn from mistakes, it is desired for a CNN to analyze its own weakness and
adjust its learning process for better performance without extra training samples. Along
this line, we propose a methodology to identify challenging samples for a CNN, and
use them to rene the CNN model for performance improvement. This idea is actually
similar to boosting [31, 32] in traditional machine learning literature. For this reason,
the resulting solution is called the boosted convolutional neural network (BCNN). The
boosting strategy can be applied to any CNN architecture. We use the Fast-RCNN [35]
as the baseline to demonstrate the boosting performance.
27
Before the emergence of the CNN solution, the pedestrian detection problem was
primarily solved by methods based on the deformable part model (DPM) [29, 30]. The
DPM consists of a few part models and a global model that connects part models with
the latent support vector machine (SVM). The histogram of oriented gradients (HOG)
technique [16] is often used in the DPM for feature extraction. Besides HOG, the ACF
[21] method extracts channel features and adopts a boosting tree for pedestrian detection.
Solutions based on the CNN were proposed to solve the pedestrian detection problem
in recent years. They outperform the traditional pedestrian detection methods by a
signicant margin and, thus, oer the state-of-the-art performance to this problem.
It is worthwhile to give a brief review on existing work on applying the CNN to
pedestrian detection [13, 65{67, 82, 92, 93, 109]. Yang et al. [109] explored certain CNN
layer features and combined them with channel features to oer better performance
using a boosting tree. By following the spirit of the DPM, Tian et al. [92] used CNN
to learn each of the body parts. Based on the Fast-RCNN [35] framework, Li et al. [58]
achieved excellent performance by applying a two-branch CNN that handles large- and
small-size pedestrians, respectively. The designed gate function allows joint training of
both branches. Several other CNN papers [12, 13, 93, 111] also demonstrated superior
performance over traditional non-CNN ones. Since CNN requires more training samples
to achieve good performance, it is commonly accepted to use every frame of the Caltech
dataset as in [92, 93]. In this work, we identify a bottleneck of current CNN solutions,
and propose the use of a boosting technique to enhance its detection performance.
Traditionally, all samples are treated equally in the training of a CNN. However, there
are informative samples that lie in the decision boundary, and they can be exploited to
achieve better performance with a boosting strategy. It is well known that a boosting
algorithm automatically adjusts the weights of samples to train multiple weak classiers
and, then, combines them to form a stronger classier. Although the high level idea is
easy to describe, we have never seen any previous work that implements boosting in the
CNN to yield a BCNN solution. To build the BCNN, we need to identify informative
28
CNN
Training
Baseline
Detector
BCNN
Training
Boosted
CNN
W=0.1
W=0.9
Prediction
Statistics
Collection
Informative
Sample
Selection
• Low
Detection
Score
(LS)
• Temporal
Inconsistency
(TI)
conv fc conv BF
Figure 3.1: This gure provides an overview on the BCNN architecture. Its left part
shows a traditional CNN training process. By analyzing the training results, the BCNN
system learns to identify challenging samples and adjust the weight for each input sam-
ple. At the right side, the BCNN model is further trained with these weighted samples
using the proposed weighted loss function to yield better performance. No extra labeled
training samples are required.
samples for CNN ne-tuning and, then, incorporate boosting in the CNN training. The
identication of informative samples is application dependent. We present a method to
select informative samples in the context of pedestrian detection, and propose a general
framework to build the BCNN. These are our two major contributions. Experiments
show that the proposed BCNN oers a signicant gain over its baseline. It can be applied
to any CNN solution without boosting to provide additional performance gain. Thus, it
is a truly powerful tool for adoption in applications other than pedestrian detection.
The rest of this chapter is organized as follows. The BCNN methodology is described
in Sec. 3.2. Experimental results are shown in Sec. 3.3. Finally, concluding remarks are
given in Sec. 3.4.
29
3.2 Methodology
In this section, we rst dene a weighted loss function in Sec. 3.2.1 and explain the
BCNN training procedure in Sec. 3.2.2. Then, we consider two types of informative
samples and discuss their weighting schemes in Sec. 3.2.3. Finally, we discuss the design
of a boosted fusion layer to fuse the two types of weighted samples in Sec. 3.2.4.
3.2.1 Weighted Loss Function
The back-propagation in the CNN training is typically done in a batch mode. To adjust
the individual weight of each training sample, one can modify either the learning rate or
the loss function adaptively. Here, we consider the modication of the loss function by
introducing a weighted term.
The loss function, Lpq, is usually dened as the average of individual losses in a
training batch. For the BCNN, we dene the following weighted loss function
Lpq
°
i
lpy
i
;hpx
i
;qqw
i
°
i
w
i
; (3.1)
wherex
i
PX
train
denotes theith image patch associated with an object proposal in the
training set,y
i
is the label ofx
i
(pedestrian or non-pedestrian), denotes all parameters
in the CNN model, h is the CNN prediction function and w
i
is a weight that assigns
importance to input samplex
i
. Ifw
i
1 for anyi, all samples are equally weighted and
the loss function in Eq. 3.1 is the same as the traditional loss function.
To simplify the implementation, we re-write the weighted loss function in Eq. 3.2 of
another form as
Lpq
°
i
lpy
i
;hpx
i
;qqI
w
i
puq
°
i
I
w
i
puq
(3.2)
30
where
I
w
i
puq
$
&
%
1; if upt
i
q w
i
;
0; if upt
i
q¥w
i
;
(3.3)
and where t
i
denotes the event of selecting x
i
as a training sample and upt
i
q is a real-
valued random variable of uniform distribution between 0 and 1.
By following Eq. 3.2, we decide whether to use x
i
as a training sample based on its
probability ofw
i
. The higher the weightw
i
, the higher the probability. As a result, Eqs.
3.1 and 3.2 are equivalent when i is suciently large. We use Eq. 3.2 as the weighted
loss function in the proposed BCNN. In such a way, we do not have to alter other parts
of the CNN training process, including backward propagation.
3.2.2 Overview of BCNN Training
We provide an overview of the BCNN training in Fig. 3.1. As depicted in this gure,
the training results obtained by the non-weighted CNN provide useful information for
its re-training; namely, the challenging samples. By assigning these samples with higher
weights, we can train the CNN to be more sensitive to them.
The BCNN procedure consists of the following steps.
1. Train baseline detector h
1
.
First, we train a baseline detector using all training samples with an equal weight.
Our baseline detector is trained from a pre-trained model using the ImageNet [18],
but ne-tuned with the Caltech pedestrian dataset. This is a common practice in
CNN training.
2. Collect prediction statistics of training samples based on h
1
.
After training the baseline detector, we collect statistics of training samples so as
31
0.1 0.5 0.9
Figure 3.2: The detection scoreh
1
pxq gives the information of the input samples. When
the scores are low, the samples are more challenging to the baseline detector. They
typically suer from low image quality and/or with cluttered backgrounds. The lower
the score is, the more challenge the situation is, which gives more valuable information
in detector retraining.
to nd informative samples. We run detectorh
1
on all training samples and collect
their detection scores and bounding box regression results.
3. Train new detector h
LS
by emphasizing low score samples.
Samples with low detection scores are informative samples. We assign them a
higher weight denoted by w
LS
and train a new detector h
LS
based on Eq. 3.2.
4. Train new detector h
TI
by emphasizing temporally inconsistent samples.
Samples whose detection scores are not consistent temporally yet they are associ-
ated with each other as one object are also informative samples. We assign them
a higher weight denoted by w
TI
and train a new detector h
TI
based on Eq. 3.2.
5. Construct a boosted fusion (BF) layer and obtain h
BF
.
Both h
LS
and h
TI
are better detectors than baseline detector h
1
. We can further
fuse them by introducing a boosted fusion (BF) layer and obtain detector h
BF
better than all the other three.
32
3.2.3 Weighting of Informative Samples
We consider two types of informative samples and adopt dierent weighting schemes for
them.
Samples with Low Detection Scores
The score distribution of h
1
px
i
q helps spotting informative samples. We observe that
most scores for the pedestrian box are close to 1 for a reasonably good detector. Thus,
the challenging samples come from those with smaller detection scores. Low detection
scores are usually caused by low image quality. By selecting them as informative samples,
we can train the CNN detector in detecting low quality pedestrian more eectively. This
idea is illustrated in Fig. 3.2.
We study the score distribution of training samples from the pedestrian class and,
then, design a weighting function accordingly. To have better understanding of the
detection score distribution of the baseline detector, we show its histogram in Fig. 3.3(a).
The score distribution of pedestrians is highly concentrated at 1. This is caused by the
adoption of the softmax layer. We t the fast decaying score distribution with the beta
probability density function, which has parameters 1:1091 and 0:1926.
Since informative samples come from relatively low score samples, we dene the
score-based weighting function as
w
LS;i
expp
p
i
q; (3.4)
where
is a parameter used to adjust the weighting curve, and p
i
rh
1
px
i
qs is the
beta probability density function evaluated at the score of x
i
. As shown in Fig. 3.3(b),
the weight curve gives much lower weights to high score samples. Its exact shape can be
controlled by parameter
. As a result, there is a higher probability that the high score
samples will be discarded in the training of a new detector h
LS
33
Figure 3.3: (a) The histogram of the score distribution from h
1
px
i
q using Caltech [23]
training samples, and (b) the score-based weight function for training samples, where
parameter
controls the exact shape of the curve. When
is smaller, high score samples
are discounted more.
Samples with Temporal Inconsistency
We observe that a low score pedestrian bounding box is often accompanied by high score
ones in its neighboring frames as shown in Fig. 3.4. Temporal inconsistency is attributed
to reasons other than low quality image. This observation serves as another cue to
spot informative samples. By exploiting the relationship between temporally associated
bounding boxes, we can develop another weighting scheme. Examples include challenges
from the movement due to the viewing camera, pedestrian movement, and interaction
between a pedestrian and his/her environment (e.g. other pedestrians, occlusion by
vehicles, etc.)
Two examples of nding temporal inconsistent samples are illustrated in Fig. 3.5.
For the single pedestrian case, we rst locate a low score bounding box denoted by x
t
j
,
wherej andt are spatial and temporal indices, respectively. We attempt to nd another
bounding box that covers the same pedestrian yet with a high score in its adjacent frame.
If it is found, we call them a temporally inconsistent pair denoted bypx
t
j
;x
tt
k
q, where
the superscript time index indicates the frame number and t denotes the distance
between selected frames.
34
Figure 3.4: Two illustrative temporal inconsistency examples in the Caltech training
dataset using the Fast-RCNN as the baseline detector. (a) A low score pedestrian
bounding box appears in the third frame. There are however several bounding boxes
that contain high detection scores in its neighboring frames, which come from the same
pedestrian. (b) A high score background bounding box also appears in the third frame
and it gives a false positive detection result. However, this high score bounding box
occurs only once when examining its temporally adjacent frames.
To check whether two bounding boxes cover the same pedestrian, we examine if there
is a trajectory formed by this pair and it has other high score bounding boxes along it.
In our experiment, we use every 4 frames for training so that t 4 or4. Besides,
we only pay attention to the pedestrian bounding boxes; namely, those boxes classied
to the pedestrian class with an IOU greater than 0.5. We use score
t
j
and location
t
j
to
denote the prediction score and the center location of the bounding box, respectively.
Forpx
t
j
;x
tt
k
q to be a temporal inconsistent pair, they should satisfy the following three
conditions:
Dplocation
t
j
;location
tt
k
q ; (3.5)
score
t
j
; (3.6)
|score
tt
k
score
t
j
|¡ : (3.7)
The second and third conditions are related to prediction scores. Eq. 3.7 states that the
pair should have scores that are suciently dierent. Eq. 3.6 indicates that one of the
pair should have a small score. The Dpx1;x2q in Eq. 3.5 is the 2D Euclidean distance
between x1 and x2. The pair should satisfy this condition to ensure that they are the
35
Figure 3.5: (a) A temporal inconsistency pair is identied for the one pedestrian scenario,
where x
t
j
has a low score, but x
tt
k
has a high score. Along the linear trajectory
jk
extended from the pair, we can nd several pedestrian boxes that have low distances
to the trajectory thus have high trajectory matching scores S
jk
. (b) If we start with a
wrong pair where two pedestrians walk side-by-side, the trajectory matching score S
jk
will be low and this pair will be dropped from the temporal inconsistent sample set.
same pedestrian. However, we allow a margin, , to accommodate the global camera
movement and pedestrian movement.
If the bounding box pair truly covers the same pedestrian, we expect to nd a series
of high score pedestrian boxes along the trajectory extended from this pair as shown
in Fig. 3.5. This trajectory is denoted by
jk
, which can be well approximated by a
straight line since it should not change much within a couple of frames. Then, we dene
the trajectory matching score S
jk
as
S
jk
t3t
¸
t3t
¸
iPX
score
i
pDp
jk
;x
i
qq; (3.8)
where pdq maxp0; 1d{q is a function used to penalize the sample that has a large
distance from the trajectory. The score for a candidate sample only counts when the
distance is less than . Then, we impose the fourth condition
S
jk
¡Th
; (3.9)
where Th is a pre-selected threshold for the trajectory matching score.
36
If the above four conditions are all met, the pair xpj;k;t;t tqpx
t
j
;x
tt
k
q will
be included in the set that contains all temporal inconsistent pairs, which is denoted by
X
TI
. Note that the trajectory matching score does take human interaction into account.
For example, the two black samples in Fig. 3.5(b) belong to two dierent pedestrians.
Their trajectory matching score S
jk
is not high enough to be included in the temporal
inconsistent set.
Based on the above discussion, we dene the weighting function for temporal incon-
sistent pair xpj;k;t;t tq as
w
TI
px
j;k;t;tt
q
$
&
%
1; if xpj;k;t;t tqPX
TI
;
0; otherwise.
(3.10)
The above-mentioned procedure is extremely powerful in identifying temporal incon-
sistent pairs, which can be used to train a pedestrian detector that is more sensitive to
these samples.
Our work is dierent from those pedestrian detection papers [73,100] that used optical
ow and other temporal features extracted from two adjacent frames to achieve better
performance since we do not calculate any pixel-wise or block-wise temporal information.
Our scheme is also dierent from the tracking-by-detection idea [3, 4, 49, 117] that is
popular in the tracking eld since no tracking is used in our proposed method. Only a
simple linear extrapolation is adopted in our work.
3.2.4 CNN Boosting via Boosted Fusion (BF) Layer
Two weighting schemes were proposed in Sec. 3.2.3. They can be used to rene the CNN
model for better detection performance against these challenging samples separately.
Intuitively, these two enhanced pedestrian detectors play a complement role to each
other. Thus, it is desired to design a boosted fusion (BF) layer to fuse their individual
strength to result in a stronger one, which is essentially the boosting idea.
37
ErW
LS
s 0.0912
ErW
TI
s 0.0052
VarrW
LS
s 0.0509
VarrW
TI
s 0.0052
CorrW
LS
;W
TI
s 0.1368
Table 3.1: The sample mean and variance of random variables W
LS
and W
TI
and their
correlation.
Weightsw
LS
andw
TI
dene two random variables denoted byW
LS
andW
TI
, respec-
tively. To have a better understanding of the dierences between two weighting schemes,
the statistics of random variables W
LS
and W
TI
from the Caltech training dataset [23]
are provided in Table 3.1. The dierences in their sample moments and their low correla-
tion indicate that the weighted samples may lead to two detectors of dierent behaviors.
This will be further proven in Sec. 3.3.
The boosted fusion framework is illustrated in Fig. 3.6, where h
TI
and h
LS
are two
separate branches and are trained using their own weighting schemes, respectively. Both
branches are given the same input image and object proposals. They also share the
same pre-train network weights from the baseline detector h
1
and the same Fast-RCNN
structure [35]. Since their sample weighting schemes are dierent and they are trained
separately, they have dierent CNN parameters.
Immediately after individual fully-connected layers of both branches, we construct a
boosted fusion (BF) layer that consist of two fully connected layers. The BF layer accepts
the output from both branches. The function of the BF layer is to conduct joint decision
so as to get a nal score for any given bounding box. To obtain the parameters of the
new BF layer, the network is retrained using all training samples of equal weighting. In
the BF training process, all conv layers' parameters in both branches remain unchanged
so that only parameters in the fully connected layers including BF layers are trained in
this step. Finally, we get a new detector h
BF
based on h
TI
and h
LS
. We will show in
Sec. 3.3 that h
BF
out-performs both h
TI
and h
LS
.
38
Deep
ConvNet
(BCNN-‐LS)
Deep
ConvNet
(BCNN-‐TI)
ROI
pooling
(BCNN-‐TI)
ROI
pooling
(BCNN-‐LS)
Boosted
Fusion
(BCNN-‐BF)
softmax
bbox
regressor
score
bbox
fc
layer
(BCNN-‐LS)
fc
layer
(BCNN-‐TI)
512
512
512
512
4096 4096
4096 4096
64
64
2
8
2
2
8
8
RoIprojection
(weighted)
RoIprojection
(weighted)
Figure 3.6: The BCNN structure, where the upper branch is the modelh
LS
trained using
the W
LS
-weighting scheme and the lower branch is the model h
TI
trained using W
TI
-
weighting scheme. They are given the same image and object proposals as input. The
boosted fusion (BF) layer concatenates both networks after the fullly-connected layer.
The nal detection scores and bounding box regression results are obtained from the BF
layer.
The reason for adopting a boosted fusion layer is that we would like to ensure the
complementary features obtained by retraining of two types of informative samples are
fully utilized. It does not make sense to blend both weighting functions together in
the beginning because it will average out the discriminative power of each weighting
function. The proposed BF layer oers a better solution in combining the advantages of
the two weighting schemes at the end of the network.
3.3 Results
The proposed three BCNN methods are evaluated on the Caltech pedestrian detection
dataset [23], which is widely used for performance benchmarking of various pedestrian
39
detection methods. The Caltech pedestrian dataset consists of about 10-hour video
taken by a driving vehicle in an urban environment. Its frame rate is 30 fps while each
of the frames has a resolution of 640x480. All pedestrian are annotated, and there are
around 350,000 bounding boxes covering 2,300 pedestrians.
3.3.1 Implementation Details
Our BCNN framework is built upon Fast-RCNN [35], where the input to the network is
the whole image rather than a single region proposal. As compared to the traditional
RCNN [36], the Fast-RCNN greatly reduces the computational redundancy in convo-
lution layers. Region proposals are provided in the ROI-pooling layer to achieve the
pooling eect in each region over the conv feature space. Also, instead of using general
object proposals [1,94,118], we use the LDCF method [64] as the proposal generator. It
provides high quality proposals for the pedestrian problem. We also remove the pool
4
layer in BCNN for better performance as described in [58]. In the training of h
1
, h
LS
and h
TI
, we make the parameters of the rst four convolution layers unchanged in the
ne-tuning process. We train the network using the stochastic gradient descent (SGD).
The softmax loss is used to train the class score while the smooth L1 loss is adopted for
bounding box prediction.
Our baseline detector h
1
is initialized by the VGG-16 model [87] and pre-trained by
the ImageNet dataset [18]. The pooling size at the ROI-pooling layer is set to 7x7. In
SGD, we use a base learning rate of 0.001. In the SGD training process, the momentum
is set to 0.9 with a weight decay of 0.0005. Each mini-batch contains one image from
the Caltech training dataset, which is sampled once at an interval of every 4 frames and
resized to 800 pixels in width while keeping the same aspect ratio. Bounding boxes with
a score higher than -15 in the LDCF [64] detector serve as our region proposals. 80 region
proposals are selected for each mini-batch. Among these 80 proposals, there are 20 or
less proposals have an intersection-over-union (IOU) ratio over 0.5. The other proposals
40
Figure 3.7: Performance comparison of the Fast-RCNN, which serves as the baseline
method, and BCNN-TI, BCNN-LS and BCNN-BF using the miss rate versus the false
positive rate per image curves.
have lower IOU values. For two detectors h
LS
and h
TI
, we lower the base learning rate
to 0.0001 for ne-tuning from baseline detector. The learning rate ofh
BF
is set to 0.001.
We train our networks on a single Nvidia K40 GPU with the Cae software [47]. It
takes about two days to train h
1
and additional two days for all h
LS
, h
TI
, and h
BF
.
The parameters used are
0:5, 32, 0:95, 0:05, Th
2:5
3.3.2 Performance Evaluation
In this section, we use BCNN-LS, BCNN-TI and BCNN-BF to denote the results from
h
LS
, h
TI
and h
BF
, respectively. Since the three BCNN detectors are used to boost
the performance of the Fast-RCNN, we compare the performance curves of these four
methods in Fig. 3.7. All three BCNN methods outperform the Fast-RCNN. Among
41
Figure 3.8: Performance benchmarking of top performers plus VJ [98] and HOG [16] on
the Caltech pedestrian detection dataset. BCNN-BF ranks the third with a log-average
missing rate of 11.4% while RPN+BF ranks the rst.
Model Iteration nos.
Fast-RCNN 80,000
BCNN-LS 22,000
BCNN-TI 24,000
BCNN-BF 20,000
Table 3.2: Convergence speed in the BCNN training process
the three BCNN methods, BCNN-BF gives the best result, BCNN-LS the second and
BCNN-TI the third. The log-average miss rates of Fast-RCNN, BCNN-TI, BCNN-LS
and BCNN-BF are 13.80%, 13.25%, 11.86% and 11.40%, respectively. Thus, we observe
boosting performance gains in the range of 0.55-2.40%, which are still substantial.
The proposed BCNN methods with the Fast-RCNN method as the
baseline oer highly competitive performance among all top performers
42
Figure 3.9: Visual examples to demonstrate (a) the performance gain of BCNN-TI over
the fast RCNN baseline and (b) performance gain of BCNN-LS over the fast RCNN
baseline.
[6, 12, 13, 36, 58, 64, 68, 92, 93, 109, 111, 115] in the Caltech website as of today.
The benchmarking results are shown in Fig. 3.8. As shown in the gure, RPN+BF has
the state-of-the-art performance of 9.6%. The BCNN-BF is the third and it outperforms
several latest detectors such as deepParts [92] and compAcT [13]. In principle, the
boosting idea can also be applied to any other pedestrian detection algorithms as long
as region proposals are used.
3.3.3 Complexity Analysis
In BCNN, we need to train several additional networks using the weighted loss function.
We show the number of iterations required for the training to converge in Table 3.2. We
see that the training of any BCNN is much faster than that of Fast-RCNN. Actually,
the total training complexity of BCNN-BF is less than two times the training cost of
Fast-RCNN.
43
(a)
BCNN-‐LS
BCNN-‐TI
BCNN-‐BF
(b)
BCNN-‐LS
BCNN-‐TI
BCNN-‐BF
Figure 3.10: Visual performance comparison of three BCNN detectors: (a) one example
to show the performance gain of the BF layer whenh
TI
outperformsh
LS
, and (b) another
example to show the performance gain of the BF layer, when h
LS
outperforms h
TI
.
3.3.4 Visualization of Performance Gain.
To provide insights into the performance gains achieved by the BCNN, some detected
bounding boxes are shown in Fig. 3.9. First, we compare the Fast-RCNN method with
the proposed BCNN-LS and BCNN-TI detectors in Figs. 3.9(a) and (b), respectively.
As shown in Fig. 3.9(a), BCNN-LS can detect pedestrians of lower image quality. As to
the BCNN-TI detector, it makes better decision in certain challenging samples that do
not oer consistent scores temporally as shown in Fig. 3.9 (b). In Fig. 3.10, we show
cases when BCNN-TI and BCNN-LS perform better over the other one (as shown in (a)
and (b), respectively) and BCNN-BF can benet from both by fusing their results in
these examples.
44
3.4 Conclusion
A boosting framework was proposed to retrain a baseline CNN using weighted training
samples in this work. This was implemented by introducing a novel weighted loss function
in the CNN training. The weighting function was used to emphasize the importance of
challenging (or informative) samples so that the ne-tuned CNN can be more sensitive
to their dierences. For the pedestrian detection problem, we proposed two informative
samples - those with low detection scores and temporal inconsistent bounding box pairs.
We showed how to train two BCNN systems (namely, BCNN-LS and BCNN-TI) to
handle them eectively. Finally, we added a boosted fusion layer to combine BCNN-
LS and BCNN-TI together into one BCNN-BF. It was demonstrated that BCNN-LS,
BCNN-TI and BCNN-BF can achieve substantial performance gain over the Fast-RCNN
which serves as the baseline for performance boosting. The BCNN-BF ranks the third
in the Caltech pedestrian dataset. We plan to apply the boosting framework to CNN
systems for other applications in the near future.
45
Chapter 4
Pedestrian Detection via Domain
Adaptation
4.1 Introduction
Nowadays, most pedestrian detection methods focus on fully-supervised learning schemes
where the detectors are trained using publicly available datasets with human labels.
However, since human labeling is time-consuming and labor-intensive, the amount of
labeled data is much less than that of unlabeled data in real world applications. To
narrow the gap, it is urgent to develop a weakly-supervised or unsupervised learning
technique so as to take advantage of the huge amount of unlabeled data for performance
enhencement. This problem is called domain adaptation. In this chapter, we develop a
domain adaptation technique for pedestrian detection.
Often, a detector trained on one pedestrian dataset work poorly on another dataset
without retraining based on labels of the new dataset. Fig. 4.1 shows two examples of
applying the developed BCNN model to images from a dataset [106] containing both
surveillance video and movies. Although BCNN has an excellent performance of around
11% missing rate on the Caltech datset, it fails to provide a reasonable performance
for the pedestrian dataset in [106] that aims at dierent applications. To address this
problem, domain adaptation techniques are needed to adapt a detector from the source
domain (e.g., the Caltech dataset in our current context) to the target domain (e.g., the
surveillance and movie dataset) with few extra labeled data in the target domain.
The above-mentioned problem is caused by dierences between the source data dis-
tribution and the target data distribution, which is called the dataset bias. Examples to
46
Figure 4.1: Two illustrative examples to show the need of domain adaptation, where
two images are taken from the dataset in [106]. The dataset contains samples from both
surveillance video (left) and the movies (right). When we apply the pedestrian detectors
trained by the Caltech dataset to these two types of images directly, the detection results
are very poor.
illustrate the dataset bias problem are given in Fig. 4.2. It is obvious that each pedes-
trian dataset has its own characteristics, and looks signicantly dierent from others.
The bias may come from dierences in camera viewing angels, illumination, image reso-
lutions, image distortion, background, and even pedestrian appearance. Due to limited
resources, it is unavoidable that when a dataset is built, it can only target at certain
applications while image/video must be collected from a restricted time and area. As a
result, there is no single dataset that can cover all applications and situations, and the
domain adaptation technique becomes essential since it compensates insucient labeling
of datasets of great diversity.
By domain adaptation, we attempt to use the knowledge learned from the source
domain to train a new model in the target domain. It is often achieved by estimating
labels for unlabeled data and using more condent samples to train the new model
[101,102]. Domain adaptation is a challenging task that demands condence prediction
and the design of a model that exploits uncertain input labels in an eective manner.
In this work, we propose a method called the \clustered deep representation adapta-
tion (CDRA) to solve this problem. The CDRA method uses the convolutional neural
47
Figure 4.2: Illustration of the dataset bias problem as taken from [44]. It shows
images randomly selected from four dierent pedestrian datasets: the INRIA dataset
(generic) [16], the Caltech dataset (automobile application) [23], the CUHK square
dataset (surveillance application) [101] and the PETS 2009 dataset (surveillance appli-
cation) [24].
network (CNN) to achieve domain adaptation. It rst extracts the deep representation
of a labeled subset using an initial detector. Then, the representations are clustered for
further usage in label prediction. A purity measurement is proposed to nd condent
samples. Through weighted re-training of the CNN, we can ne-tune the deep represen-
tation to adapt to the target domain. The CDRA method is tested on the large-scale
CHUK-SYSU dataset [106], where a signicant performance gain over the baseline is
observed.
There are several major contributions of this work. First, while deep neural networks
oer good results in many computer vision problems, it is still not mature in the area
of domain adaptation. We develop a domain adaptation solution that works well for
48
the pedestrian detection problem. Second, the methodology of using clustered deep
representations and weighted re-training are novel. It is actually a general technique
whose applications are not restricted to pedestrian detection. Third, while most domain
adaptation methods were tested on smaller datasets, the proposed CDRA method is
tested on a large-scale dataset for performance evaluation.
The rest of this chapter is organized as follows. Related previous work is reviewed
in Sec. 4.2. The proposed clustered deep-representation adaptation (CDRA) method
is described in Sec. 4.3. Some implementation details and discussion about the CDRA
method are provided in Sec. 4.4. Experimental results are shown in Sec. 4.5. Finally,
concluding remarks are give in Sec. 4.6
4.2 Related Previous Work
4.2.1 Transfer Learning and Domain Adaptation
Figure 4.3: The concept of transfer learning.
Domain adaptation is highly related to the concept of transfer learning. Fig. 4.3 [44]
explains the idea of transfer learning. Consider two tasks, Task A and Task B, which
are related to each other. The knowledge learned from Task A should benet the model
learning for Task B. Transfer learning is a technique that transfers the knowledge from
Task A to Task B. This technique is needed when the amount of training data in Task
B is few. For more detailed discussion, we refer to [44]. Domain adaptation is a special
49
case of transfer learning. It arises when Task A and Task B are the same task yet
they have dierent domains. For example, both Task A and Task B are targeting at
pedestrian detection. However, they are under dierent applications, say, the advanced
driver assistance system (ADAS) application and the video surveillance application.
Although transfer learning [7, 69, 77] and domain adaptation [17, 38, 74] have been
developed for quite some time, they attract a lot of attention in recent years. The
emergence of research on these topics is due to the rapid development of the CNN
technology and the rapid growth of image/video data. On one hand, the CNN technology
provides a powerful tool for big data analytics with labeled data. On the other hand,
it is dicult to provide adequate labels for a large amount of collected data so that the
importance of transfer learning and domain adaptation becomes higher nowadays.
4.2.2 Domain Adaptation for Pedestrian Detection
Domain adaptation for pedestrian detection has drawn some attention in recent years.
The development in this eld was reviewed in [44]. The earliest work of domain adap-
tation for object detection can be traced back to 2004 [10]. The number of publications
on domain adaptation for pedestrian or object detection [43, 45, 46, 62, 63, 82, 83, 85, 91,
101,102,116] has increased rapidly from 2012 to 2015.
Most previous work adopted the INRIA dataset [16] as the source domain dataset.
The INRIA dataset is a generic pedestrian dataset, where each pedestrian image con-
tains a large-size person right in the middle of the image. The target domain datasets
considered were mostly the MIT trac dataset [102] and the CUHK square dataset [101],
which are both surveillance datasets. The goal was to adapt a generic detector to an
application-oriented detector.
Although each paper has its own model design, feature selection, etc, most of them
do share a similar high-level framework. Here, we would like to review the work in [102]
so as to provide a high-level description of the framework. The
ow chart of the domain
adaption method proposed in [102] is shown in Fig. 4.4. First, samples in training
50
frames of the target domain are extracted with their associated detection scores. Since
the initial detector is weak, their detection scores are not reliable. Multiple cues are
used to help decide condent positive and negative samples for model training. These
cues include the motion, size, and context information such as vehicle trajectories. This
process is done iteratively to increase the performance of the scene-specic detector.
Figure 4.4: The
ow chart of a domain adaptation procedure for pedestrian detection
[102].
It is a common practice to estimate positive or negative labels of the training set with
detection scores as well as other cues in the target domain, and use the condent ones to
ne-tune the model. Sometimes, models need to be specically designed to give condent
samples higher weights. However, the performance achieved by this methodology is still
limited even if it is tested on a small dataset consisting of only hundreds of training
images. Furthermore, deep learning is not yet proven to work for domain adaptation.
As stated in [44], "it is unclear how well deep learning really performs for use in domain
51
adaptation. It is an open and interesting research question that could be addressed in
the future.". This inspired us to look for a novel solution to tackle the problem.
4.2.3 Domain Adaptation for Other Problems
Figure 4.5: A network structure that uses an auto-encoder for the image classication
task with domain adaptation [34].
Besides object/pedestrian detection, domain adaptation has been applied to applica-
tions such as image classication. For image classication, better performance in domain
adaptation has been achieved by the use of deep learning networks. This advancement
is built upon the auto-encoder [19,56] that encodes images and, then, reconstructs them
automatically using the encoded representation. The auto-encoder can learn the repre-
sentation of a dataset without knowing the labels. One exemplary auto-encoder network
is shown in Fig. 4.5. It consists of two branches, where the upper branch is used for
the image classication task in the source domain while the lower branch is designed
for image reconstruction in the target domain. Two branches are trained simultane-
ously, and the learned representations in the shared layers can be further used for image
classication in the target domain.
More domain adaptation papers have been published in recent years. Some of them
are totally unsupervised [2, 5, 63] while a few of them are semi-supervised [14], where
52
a portion of the dataset is labeled. Although the domain adaptation technology has
progressed more in some application, its overall performance is still not good enough.
Furthermore, it is not yet tested in large-scale datasets. Thus, it is still one of the open
problems in computer vision.
Domain adaptation in object/pedestrian detection remains to be a challenging prob-
lem due to its unique characteristics. For example, most cropped samples in pedestrian
detection are too small and of low image quality so that they are not suitable for image
reconstruction. This is especially ture when the deep network down-sizes input images
signicantly after several pooling layers. In addition, the boundary between positive
and negative samples is vague. Sometimes, even humans cannot distinguish confusing
samples easily. Thus, it is unclear whether the pedestrian detection problem can benet
from domain adaptation.
4.3 Clustered Deep Representation Adaptation (CDRA)
Method
Dierent from approaches introduced in Sec. 4.2, the proposed CDRA method is based
on convolutional neural network (CNN). Given a CNN detector trained from the source
dataset, the responses extracted from the CNN network, also known as deep features
or deep representations, carry very high-dimensional information with source domain
knowledges which are powerful enough to help achieving our goal. We found that deep
representations along can already provide reasonable performance gain without involving
additional cues, such as motion, context, ...etc. In fact, we intend to avoid the ad-hoc
cues in this work, because they require consecutive frames as well as good performances
from other problems like tracking, context extraction, and so on.
In this work, we adopted the semi-supervised scheme. Compared with totally unsu-
pervised scenario, we assume very small amount of the data are labeled, while most of
other samples are still unlabeled. This scheme is also used in the literature such as [14].
53
We rst ne-tune the detector with the labeled subset. Then we cluster the samples in
deep feature space to form groups, which further help to predict the labels for unlabeled
data with higher condence using proposed purity measurement approach. A weighted
CNN re-training method is then introduced to learn the model that balances between
labeled and predicted labels. The proposed CDRA method is further validated on a
large-scale dataset.
4.3.1 Difnition and Notations
To give a better introduction to the proposed method, the denition of the problem is
given in Fig. 4.6 and the detailed notations are provided in Table 4.1.
Figure 4.6: The denition of the semi-supervised domain adaptation problem
Fig. 4.6 illustrates the scenario of semi-supervised domain adaptation problem.
Given a source dataset with training data - pX
s
;Y
s
q that represents the images and
their labels respectively, we are able to train the CNN model which is denoted as
w
s
;b
s.
This model works pretty well on the source domain testing set as demonstrated in Sec.
3, but fails in the testing set of the target domain. In the target domain, we only have
unlabeled training data denoted as X
t
. With the assistance of originally pre-trained
model
w
s
;b
s, we are able to get an updated model
w
t
;b
t which provide better perfor-
mances on the target domain testing set. Since this is a semi-supervised scenario, a very
54
small portion of the target domain training data are labeled, which are represented as
pX
tl
;Y
tl
q. Rest of the unlabeled data in the target domain are represented as X
tu
. In
CDRA, we estimate the labels of the unlabeled subset and generate a condenct subset
to train the
w
t
;b
t. This condent subset is represented aspX
est
;Y
est
q
Table 4.1: Noatations of the proposed domain adaptation approach
Generic
Dataset X
Label Y
Number of samples N
CNN coecients (weights and biases) w,b
CNN detector
w;b
pXq
CNN loss function Lp
w;b
pXq;Yq
CNN representations (features) rep(
w;b
pXq, layer)
Source domain
Data, label, and number of samples pX
s
;Y
s
;N
s
q
Pre-trained CNN model from the source domain w
s
;b
s
Target domain
Data, and number of samples pX
t
;N
t
q
Labeled subset pX
tl
;Y
tl
;N
tl
q
Unlabeled subset pX
tu
;N
tu
q
Condent subset with estimated labels pX
est
;Y
est
;N
est
q
4.3.2 Algorithm Overview
The proposed CDRA algorithm consists of several steps. Alg. 1 gives an overview of the
algorithm. In the rst step, we use the labeled subset in the target domain to conduct an
initial training and generate a detector
w
t0
;w
t0. This is achieved by ne-tuning the CNN
model on top of the pre-trained model (w
s
;b
s
). The generated initial detector is more
tailored to the target domain but still considered as a weak detector because only small
amount of labeled data is used. Step 2-5 are designed to nd condent samples from the
unlabeled subset. This is achieved by rst conducting an unsupervised clustering in fc7
deep feature space in step-2, followed by a purity measuring process in step-3. With the
condence space formed in step-3, some of the unlabeled data will be labeled in step-
4 to form the condent subset pX
est
;Y
est
q. Finally, a weighted re-training of CNN is
55
Algorithm 1 Clustered Deep Representation Adaptation (CDRA) Algorithm
1: Step-1 Train an initial detector with labeled data in target domain (Sec. 4.3.3)
2: w
t0
;b
t0
argmin
w;b
Lp
w
s
;b
spX
tl
q;Y
tl
q
3:
4: Step-2 Unsupervised clustering on deep representations (Sec. 4.3.4)
5: Featurerepp
w
t0
;b
t0pX
tl
q;fc7q
6: Centroid;ClusterkmeanspFeature;N
k
q
7:
8: Step-3 Purity measurement (Sec. 4.3.5)
9: for k 1 :N
k
do
10: Distance
k
max distancepCluster
k
;Centroid
k
q
11: Purity
k
;Label
k
purity measurepCentroid
k
;Cluster
k
;Distance
k
q
12: while Purity
k
Thrd
P
do
13: Distance
k
Distance
k
14: Purity
k
;Label
k
purity measurepCentroid
k
;Cluster
k
;Distance
k
q
15: end while
16: end for
17:
18: Step-4 Estimate the labels for unlabeled data (Sec. 4.3.5)
19: for i 1 :N
tu
do
20: for k 1 :N
k
do
21: if Feature
i
Centroid
k
Distance
k
then
22: addpx
i
;Label
k
q intopX
est
;Y
est
q
23: break
24: end if
25: end for
26: end for
27:
28: Step-5 Weighted re-training (Sec. 4.3.6)
29: X
w
;Y
w
composep;X
tl
;Y
tl
; 1;X
est
;Y
est
q
30: w
t
;b
t
argmin
w;b
Lp
w
t0
;b
t0pX
w
q;Y
w
q
conducted to adapt the CNN detector utilizing both labeled and condent subset. Each
step will be elaborated in the following sections.
4.3.3 Initial Training with Labeled Subset
The idea of domain adaptation is to employ the knowledge learned from the source
domain to benet the task of the target domain. Although the detector trained in
the source domain,
w
s
;b
s performs poorly on the target domain, the CNN networks
56
carry abundant useful information in the network coecients, w
s
and b
s
. The most
straight-forward way to use the coecients is to treat them as pre-trained model, and
ne-tune the network using labeled data in the target domain. The ne-tuning process
can preserve knowledge in the source domain, while making the prediction in the target
domain better.
The main reason for initial training is that we need the CNN network to at least have
a better decision so that the deep representations are more useful for nding the condent
set. Since we do not use ad-hoc cues such as motion or context, we need a very small
number of labeled samples to conduct the initial training. While the sample amount is
small, the learning rate is lowered to avoid overtting. The ne-tuned network
w
t0
;b
t0
will be more tailored to the target domain, and thus provide better representations for
further steps.
4.3.4 Clustering on Deep-Representations
Given the initial detector
w
t0
;b
t0, our goal is to predict the labels for those unlabeled
subset. However, the detection score is not reliable since it is a weak detector. At the
same time, we are not going to use other cues such as motion or context. Thus we put
our focus on the deep representations, or deep feature space.
Before introducing the clustering approach, we take one step back to explain the
relationship between detection score and deep representations. A detection score can
be seen as a projection from a sample to an anchor vector on the deep feature space as
described in [55]. In the pedestrian detection problem, there are only two anchor vectors
(positive and negative), and every sample score is determined according to how close
it is to the anchor points. However, the deep feature space (fc7 layer in VGG16 [87]
network) has a dimension of 4096. The samples are scattered in the high-dimensional
space with high variations. Sometimes they have cluster eect as illustrated in Fig. 4.7.
If we can understand how they are distributed in the space, this is more helpful than the
detection scores that are simply the projection onto a 1-dimensional space.
57
Figure 4.7: Illustration of the sample clustering in the deep feature space.
Therefore, in our method we rst conduct an unsupervised clustering on the labeled
samples in the fc7 feature space that is formed from the initial detector
w
t0
;b
t0pX
tl
q.
As shown in Fig. 4.7, each cluster may or may not have similar image patterns. Theo-
retically, if a cluster contains more similar samples with the same label, it is considered
as a condent cluster. Whenever an unlabeled data falls in the space, it is more likely
that this sample belongs to this label. On the other hand, there are confusing clusters,
in which we are less condent about its label. In this work, we use k-means for unsuper-
vised clustering, and we can get the centroid of each cluster - Centroid
k
, as well as all
the samples belonging to this cluster denoted as Cluster
k
.
4.3.5 Condent Sample Selection With Clustered Deep Representation
After clusters are formed, we are interested in knowing which clusters belongs to condent
clusters, and which are confusing clusters. This information helps us to determine the
condence and label of an unlabeled sample.
The proposed methed is called purity measurement as illustrated in Fig. 4.8. The
main purpose is to determine a boundary for each cluster which can be used to identify
condent samples and their labels. If an unlabeled sample lies inside the boundary of
58
Figure 4.8: Illustration of the purity measurement approach.
a cluster, it is considered as condent with its corresponding label associated with this
cluster.
To determine the boundary, we start from a larger circle around the centroid, and
keep shrinking the circle by a factor of until the samples inside the boundary are
pure enough (i.e. whenPurity
k
is larger than a threshold). Label
k
is determined by the
majority samples in the boundary, andPurity
k
is the ratio of the majority samples in the
circle (i.e. #Majority{p#Majority #Minorityq). If the distance from the centroid
to the nal boundary is very smaller, this cluster is more likely to be a confusing cluster.
That is to say, the unlabeled samples around its centroid cannot easily be trusted unless
it is close enough.
After purity is measured in each cluster, the prediction of unlabeled sample is
straight-forward. We simply check each unlabeled sample, and see if it lies in one of
the boundaries of all clusters. If it does lie inside the boundary of a cluster k, we add
this samplepx
i
;Label
k
q into the condent setpX
est
;Y
est
q for using in the training step.
4.3.6 Weighted Re-training with Condent Samples
Although we have collected a condent setpX
est
;Y
est
q, it is not 100% accurate. Later on
we will verify that ne-tuning the model purely with this set does not guarantee with a
59
good performance. Thus we design a weighted re-training scheme that balances between
the labeled data and the estimated data. By mixing both data in the training process,
we will have better performances.
Similar to Sec .4.3.3, we still ne-tune the CNN model with a pre-trained model. Here
the pre-trained model is the initial detectorpw
t0
;b
t0
q. To ne-tune a model, traditionally
it is achieved by minimizing the loss function Lp
w;b
pXq;Yq using back-propagation. In
order to balance between labeled samples and estimated samples, we hope to formulate
our loss function as
Lp
w
t0
;b
t0pX
tl
q;Y
tl
qp1qLp
w
t0
;b
t0pX
est
q;Y
est
q; (4.1)
where is fromr0; 1s and represents the weighting factor to control which set contribute
more to the CNN training. From implementation point of view, it is the same as compos-
ing each batch to have (batch size) labeled samples and (p1qbatch size) estimated
samples in a random order. This process described here is the function "composepq" in
step-5 of Alg. 1, and it generates a new training setpX
w
;Y
w
q used for training the new
modelpw
t
;b
t
q.
4.4 Implementation Details and Discussion
In the design and implementation of the proposed CDRA method, several details and
observations are worthwhile to be mentioned in this section.
4.4.1 Network Architecture and Implementation
As the same as Sec. 3, we use Fast-RCNN [35] as our baseline CNN detector in this
chapter, and VGG16 [87] is the underlying network. However, a few changes have to be
made for domain adaptation.
We start from introducing the dierences between three classic network implemen-
tations - RCNN [36], Fast-RCNN [35], and Faster-RCNN [79]. Among three of them,
60
RCNN is the earliest one to be published, in which the input to the network is the
cropped images using object proposals. Later on, Fast-RCNN is introduced to acceler-
ate the process by using whole images as inputs to go through conv layers, and cropped
the frames by object proposals right after conv layers. Faster-RCNN, on the other hand,
does not need object proposals, because the proposals are generated inside the network
with RPN. The detailed comparison between three networks are summarized in Table
4.2.
Table 4.2: Comparison between RCNN, Fast-RCNN and Faster-RCNN
RCNN Fast-RCNN Faster-RCNN
Description
Each proposal as a
unit to go through
CNN network
Each image as a unit
to go through conv
layers. Proposals are
applied after conv
layers
Object proposals are
generated inside the
network (RPN)
Object proposal Given as input Given as input
Automatically
generated
Batch size
Fix number of
proposals
Fix number of
images
Fix number of
images
Performance
(mAP on
VOC2007)
66.0 (good) 66.9 (better) 66.9 (better)
Speed
(test time)
1x (normal) 25x (fast) 250x (fastest)
Note that in Fast-RCNN and Faster-RCNN implementations, one batch corresponds
to all proposals in one or more images. This property is not suitable for domain adap-
tation, even though Fast-RCNN and Faster-RCNN provide better performances. The
main reason is that CDRA needs to select condent samples, but the condent samples
do not necessarily span equally across images. That is to say, some of the batches may
contain much less samples than other batches, and this cause the network performance
to drop signicantly. Therefore, RCNN implementation is preferable for domain adap-
tation using CDRA, since the batch size is simply a x number of samples in RCNN
implementation.
61
Figure 4.9: The network architecture of the CDRA method. A batch controller is imple-
mented to mimic the RCNN training process.
To benet from the advantage of Fast-RCNN without suering from abovementioned
problem, we modify the network implementation as illustrated in Fig. 4.9. We still adopt
the Fast-RCNN implementation. However, when we ne-tune the network in CDRA, we
add a batch controller module to mimic the RCNN training process that collects a x
number of samples before calling it a batch. This batch controller is applied in the middle
of the network. the coecients of the layers before the batch controller are frozen in the
ne-tuning process, and only those layers beyond the batch controller are updated.
4.4.2 Representations of Dierent Layers
The concept of domain adaptation is that some of the knowledge are inherited from the
pre-trained model in the source domain. It would be interesting to know which of the
layers in CNN carry more knowledge from the source domain, and which other layers
have to be adapted for the new domain.
We did a couple of experiments about this specic question. In the ne-tuning
process, we can freeze some of the layers and only allow other layers to be updated.
For initial training (i.e. the rst step of CDRA), we found that the performance does
not change if we freeze the rst few convolution layers. This implies that the rst few
62
convolution layers are the important knowledge that are transferred from the source
domain.
In general, we can freeze a few convolution layers in the ne-tuning process. For
initial training in step-1 of CDRA, we froze three convolution layers. In the weighted
training process, the network is even more stable so that we can freeze all the convolution
layers and only update the fully-connected layers. This is empirically decided after a few
experiments. This also brings some benets. For example, we can modify the network,
and make it rst pass through conv layers for once and then update fc-only iteratively
through the training process. This modication can accelerate the training process by
multiple times.
4.4.3 Sample Balancing
In Sec. 4.3.6, we discuss the the balance between labeled data and estimated data.
One more detail we need to take care is the balance between positive samples and
negative samples. When we select the condent samples, there is no guarantee that
the condent samples have the same ratio of positive and negative samples. It depends
on the clustering results. Thus, it is desirable that we can keep the positive and negative
ratio unchanged no matter what the cluster results are.
This is done in the "composepq" function in step-5 of Alg. 1. When we control the
balance between labeled data and estimated data, we also make sure that the ratio of
positive sample and negative sample remain the same as original ratio (of the labeled
data). In such a way we can make sure the training process does not bias to a certain
label.
63
4.5 Experimental Results
4.5.1 Evaluation Dataset
In this section, we test our CDRA method on a large-scale dataset. First, we want to
make sure the algorithm is robust enough to handle large-scale dataset. Second, the
CNN network has a large architecture with huge number of coecients. It is preferable
to have larger data for ne-tuning the model.
Nowadays, there are few large-scale pedestrian dataset available, especially with dif-
ferent applications. Luckily there is one dataset just released earlier, called CUHK-
SYSU [106]. This dataset contains pedestrian from surveillance applications and movie
scenes. In our work, we use Caltech dataset [23] as the source domain dataset, and
CUHK-SYSU dataset as the target domain dataset. The number of training images of
dierent datasets are summarized in Table 4.3.
Table 4.3: The comparison of number of training samples between dierent pedestrian
datasets
dataset type # of training images
INRIA Generic 614 (pedestrians)
CUHK square Surveillance 352
MIT trac Surviellance 420
Caltech Automobile 32077
CUHK-SYSU Surveillance and Movie 11206
The CUHK-SYSU dataset is designed for two tasks. One is pedestrian detection, and
the other is person re-identication [28,39,75]. They claim that their proposed method
can achieve two tasks simultaneously. Fig. 4.10 gives examples of CUHK-SYSU dataset,
where the goal of person re-identication is to nd the same person across dierent
images. In our work, we only use the pedestrian detection portion of the dataset. The
dataset has a total of 18184 images, with 8432 individual persons and 99809 annotated
bounding boxes. The dataset contains samples of dierent pedestrian size, occlusion
rate, under a wide diversity of environments.
64
Figure 4.10: Examples of the CUHK-SYSU dataset
To evaluation the performance, we adopt the widely used FPPI-MR curve, where the
x-axis represents false positive per image (FPPI), while the y-axis is missing rate.
4.5.2 Experimental Setup
In this semi-supervised scenario, we randomly picked 1{30 of the samples from CUHK-
SYSU training set to be the labeled subset. There is a total of 374 labeled images. Rest
of CUHK-SYSU dataset are considered as unlabeled subset with 10832 images (even
though the labels are provided by CUHK-SYSU, we choose not to use them). The CNN
network used here is the same as BCNN in Sec. 3. In CDRA algorithm, we set the
parameters with N
k
64, Thrd
P
0:98, 0:2, and 0:95. And the distance
measurement in the deep feature space is the Euclidean distance.
4.5.3 Overall Performance
Fig. 4.11 shows the performance of the proposed CDRA algorithm. The yellow curve
is the result using the source domain detector
w
s
;b
s directly on the testing set of the
65
target domain. The red curve is the result of the detector from initial training - w
t0
;b
t0
.
The blue curve is the result of the CDRA algorithm -
w
t
;b
t. The numbers shown in
the left-bottom box represent miss rate with FPPI = 0.5 as a reference point. We have
demonstrated that CDRA provide a 50% gain over the source domain detector, and is
6% better than the initial detector.
Figure 4.11: Overall performance of the proposed CDRA algorithm.
Some visual examples of the CDRA results are shown in Fig. 4.12. In these examples,
we can see the CDRA detector performs much better in the target domain compared to
the original detector. It can already detect the pedestrian with dierent camera angles
or under extremely dark environments that do not appear in the source domain.
4.5.4 Justication
The CUHK-SYSU dataset is one of the latest datasets, and there is no other methods
that we can benchmark with. Therefore, we need more justications to validate the
proposed CDRA algorithm.
66
Figure 4.12: Visualized examples of the CDRA detection results (lower row) compared
with the baseline detection results (upper row).
Figure 4.13: Justication of CDRA method. (a) Domain adaptation with ne-tuning
using pre-trained model versus purely training from scratch. (b) condent sample selec-
tion using CDRA versus using detection scores.
First of all, we would like to demonstrate that the proposed initial training is rea-
sonable for domain adaptation. Thus we conduct another experiment to train the CNN
67
network without using the pre-trained model from the source domain. In this experiment,
we still use the labeled subset, but the network is randomly initialized without carrying
any information from the source domain. Fig. 4.13(a) shows that the performance is
much worse than our initial detector, and it proves that the proposed ne-tuning process
from the pre-train model is needed, and it works well even if the labeled data is few.
Secondly, we would like to validate the proposed condent sample selection method.
As we mentioned in Sec. 4.3.4, the method provide a much better performance than the
detection scores. Thus we conducted an experiment that uses the detection scores to
nd condent samples. We set some thresholds to make high score samples as positive
samples and low score samples as negative ones. The thresholds are set to keep the
number of selected samples similar to that of the CDRA method. When we train the
model with the samples from detection scores, the results are worse as in Fig. 4.13(b).
Table 4.4: Comparison of the prediction correctness between CDRA and detection scores
Correctness
Predict with detection scores 91%
Predict with CDRA 98%
Figure 4.14: Dierent weighting factor in weighted re-training step
68
From Table 4.4, we can also see that the correctness of using CDRA as condent
sample selection is higher than that of using detection scores.
Finally, the weighted re-training process is examined in Fig. 4.14. If we do not
involve labeled subset, but only use the estimated set for training, the weight factor
is equal to 0, and the missing rate at FPPI=0.5 is 79.92%. This is much worse than the
proposed weighted re-training scheme with 0:2. On the other hand, if 1, the
performance will be the same as the initial detector because they are using the same set
of data. Therefore, it is essential to strike a balance between labeled data and unlabeled
data as proposed in CDRA.
4.6 Conclusion
The clustered deep representation adaptation (CDRA) method was proposed to tackle
the domain adaptation problem for pedestrian detection in this chapter. It was demon-
strated that, with the proper use of deep representations, the CDRA method can achieve
a signicant performance gain even without additional features such as motion or con-
text. The CDRA method is a semi-supervised method that demands a small amount of
labeled data in the target domain. The labeled data were used to ne-tune the model
in the target domain. This step was followed by unsupervised clustering in the deep
feature space. A purity measurement procedure was proposed to nd condent samples
and estimate their labels. Finally, we developed a novel weighted re-training to balance
labeled and estimated data. The proposed CDRA method was tested on a large-scale
target domain dataset, and a signicant performance gain over the baseline method was
achieved.
69
Chapter 5
Critically Supervised Object
Detection
5.1 Introduction
Nowadays the superior performances of convolutional neural network (CNN) based
approaches are attributed to the availability of large-scale datasets with human labeled
ground truth. While visual data are easy to acquire, their labeling is often time-
consuming. Clearly, there exists a gap between labeled and unlabeled data in many
real world applications. To address this gap, it is essential to develop weakly super-
vised CNN solutions that take advantages of the huge amount of unlabeled data, but
signicantly reduce the required labeling time.
In this work, we study this issue from the object detection task, which is one of the
most fundamental problems in computer vision today. Although recent studies in weakly
supervsied object detection [8, 9, 37, 59, 80, 88, 89, 108] have shown great successes, the
performance is still much lower than that of fully supervised learning schemes. There-
fore, we examine this problem from a brand new angle - how can we achieve reasonable
performance while keeping the labeling time as few as possible. We refer this concept as
"critically supervised learning", as only the most critical data are labeled along the train-
ing process. In fact, this is inspired from human learning behaviors in their childhood.
Firstly, children are taught to recognize objects by parents or teachers given only a few
examples. Then, they keep observing the world and start to ask questions to conrm the
answers. The questions are usually formed from the most critical examples that help to
enhance their recognition rate.
70
We would like to take a further step to examine the gap between CNN learning
and human learning. In common CNN based object detection frameworks, big training
dataset is unavoidable in order to leverage huge number of CNN coecients. To get
enough training samples, traditionally, precise bounding boxes of all objects are nec-
essary in one image. This "full-labeling" approach generates all positive(object) and
negative(background) samples simultaneously, where the latter ones usually dominate in
numbers. In human learning, however, only a little supervision is required for positive
samples, while human have innate ability to obtain background samples either through
tracking or 3D vision, thus no additional supervision is needed for negative samples.
To narrow the gap, we rst introduce the negative object proposal (NOP) concept to
collect huge number of background samples nearly unsupervisedly. Instead of drawing
tight bounding boxes around objects, we argue that it is much faster and more human
nature to design a machine-guided labeling mechanism based on question answering
(QA). According to [70], it takes around 42 seconds to draw a high-quality bounding
box while it takes only 1.6 seconds for human to do the QA verication. In order to
minimize the number of questions required for training a high precesion detector, a
critical example mining (CEM) method is proposed to select examples that are most
critical for performance improvement through QA. A kNN graph is design to assist the
retrieval of critical samples.
These novel components are further combined in an order that the CNN model is rst
trained with a very small amount of fully labeled data. Extra examples are then collected
via CEM by observing from unlabeled data, and a series of QA are conducted for human
to verify the labels. Finally, these examples are composed with NOP examples to retrain
the model. This method is referred to as the taught-observe-ask (TOA) framework,
since it mimics how children learn to recognize objects. Dierent from traditional weakly
supervised learning frameworks that get low precision from incomplete labels, TOA has
achieved much better performance with even less labeling time. The concept of TOA is
illustrated in Fig. 5.1.
71
Figure 5.1: Illustration of TOA training process compared with fully supervised training.
The upper row shows the traditional fully supervised training, where all objects are
labeled via drawing tight boudning boxes of all objects, and training samples are obtained
from calculating their IoU with object proposals. The lower row illustrates that QA
labeling is adopted only for critical examples that are selected by CEM algorithm, and
the training samples are generated by combining them with NOP samples.
Extensive experiments are conducted to shed light on the proposed TOA method
in Sec. 5.6. The eectiveness of TOA method is demonstrated on the PASCAL VOC
datasets [26] for multi-class object detection, and on the Caltech pedestrian dataset [23]
for single-class object detection as a special case. The TOA method is demonstrated
to largely reduce the labeling time needed for training a robust object detector com-
pared to the fully supervised learning scenario on both datasets. Additional analysis are
conducted to provide more insights into the novel framework.
72
5.2 Related Previous Work
5.2.1 Object Detection
Obeject detection is one of the most studied problems in the computer vision eld today,
and it still remains as a challenging task that is under fast developement. Deep learning
has brought great successes in this eld in these years and it outperforms traditional
approaches such as deformable part model (DPM) [29, 30] by a siginicant margin. R-
CNN [36] as an early CNN work achieves end-to-end training of a mutli-task network for
each region proposal given as its input. Region proposals are usually get from methods
such as [94, 118], or others approaches that are trained from CNN networks. Networks
such as Fast-RCNN [35] and SPP [40] accept the whole image as input to the CONV
layers to greatly reduce the redundant convolutional computations while region propos-
als are pooled in later layers for further classication and bounding regression tasks.
Methods such as Faster-RCNN [79] further extend the works by including a region pro-
posal network which releases the needs for calculating object proposals outside of the
network. More algorithms have been proposed [15, 51, 52, 61, 78] which either provide
ecient approaches with competitive results or keep pushing up the mAP performance
by more advanced network designs.
5.2.2 Weakly Supervised Learning
In contrast to fully supervised object detection, people have also tried to learn object
detectors in weakly supervised ways, which manages to save tremendous labeling time.
Given all object classes that appear in one image as the ground truth, weakly super-
vised object localization (WSOL) techniques are designed to achieve localization tasks
and simultaneously enhance the classication results. Recent works have shown great
progress [8,9,37,59,80,88,89,108] in this eld using convolutional neural networks. Most
of these methods are based on multiple instance learning (MIL), where images are treated
as a bag of instances, and the images that contain no object instances of a certain class is
73
labeled as a negative sample for this category, and vice versa. However, learning detec-
tors without bounding box information is quite a challenging task, and the performance
is still way lower that of the fully supervised detectors.
5.2.3 Active Learning and Related Topics
There are a few research elds that are also related to our work. Human-in-the-loop is
a technique that considers human-machine collaborative annotation [11, 86, 97]. These
methods are required usually when a pretrained model does not provide satisfactory
results on certain challenging tasks and human eorts are involved in the annotation
task for these samples. Active learning is referred to as a technique that iteratively select
samples from a large set of unlabeled data and request human to do the labeling tasks for
re-training more powerful models. Previous works of active learning mainly focus on tasks
like image classication [48,50,53,76] or region labeling [86,95,96]. These works usually
start from a reasonably good pretrained model and target to enhance the performances
with abundant extra unlabeled data. In addition, few of active learning works have
been applied to window based object detection. Several other weakly supervised or
unsupervised frameworks are also related to our work [60, 112] in certain aspects. The
uniqueness of the proposed framework is presented in the following section.
5.3 Introduction to critically Supervised Learning
While huge amount of human labeling time is a bottle-neck to many practical applica-
tions, most of the labeling eorts are not very critical to the nal performances. In the
context of object detection, for example, fully supervised learning requires accurately
labeling of all bounding boxes in one image, which is counter-intuitive. Inspired by how
human learns to detect objects, CNN training can be achieved in a dierent perspective
that only the most important samples are labeled. We dene critically supervised learn-
ing as a way to train a model that targets good enough performance with as less labeling
74
time as possible. This is dierent from traditional weakly supervised or unsupervised
learning, which only provides limited number or types of labels, but does not consider
the strategy to analyze which parts of the dataset are more critical to enhancing the
performance. Therefore, only limited performance are observed.
critically supervised learning is also closely related to several concepts such as active
learning, or human-in-the-loop labeling. However, critically supervised learning is more
specic in that it tries to minimize the labeling time from the begining of the CNN
training process so that the labeling time is minimized. The concept is similar to human
behaviours that we need very few supervision from other people. To achieve the goal,
the strategies have to be dynamically adjusted in dierent labeling stages to ensure any
labeling eort is critical to the performance. A mixture use of various labeling methods
and mining criteria are keys to the success of critically supervised learning.
In object detection, for instance, a CNN network needs a large number of training
samples to gain superior performance, and most of them belong to negative objects. The
aquisition of background samples, however, relies on accurately labeling of all bounding
boxes to avoid false negatives. There must be alternative ways to release the constraint.
On the other hand, not all object samples share the same importance, and only a small
number of them are more representative and critical for the CNN to gain discriminative
power in separating dierent classes. A tremendous reduction of labeling time can be
expected with critically supervised learning.
5.4 Taught-Observe-Ask (TOA) Framework
To achieve critically supervised learning, we propose a new framework called taught-
observe-ask (TOA), which is inspired by children learning. In the rst stage, a small
amount of fully labeled data is used to guide CNN get basic recognition capability. This
initial stage, referred to as "teach stage", is unavoidable in order for the machine to
75
understand the current problem denition and for it to be able to know what questions
to ask in the next stages.
The next stage is called observe-ask stage. A critical example mining (CEM) method
is rst introduced to select examples from unlabeled images for human to verify their
object classes, which mimics the behavior of children observing the environment and ask
questions from their mentors. Machine-guided labeling (MGL) is the implementation
of nding most critical example iteratively and conduct the question answering (QA)
labeling to obtain the labels from human annotators. Negative object proposal (NOP)
is adopted to collect background samples without extra supervision. Details of the TOA
framework are given in Fig. 5.2.
Figure 5.2: Flow chart of the teach-observe-ask (TOA) framework. The upper branch
represents the "teach stage" where full-labeling are done in a subset of images to train
the stage-0 model S
0
. The lower branch is the "obeserve-ask stage", where the critical
example mining (CEM) is used to retrieve critical examples from unlabeled dataset, and
use them for QA labeling. The labeled examples are further combined with NOP samples
to form the training set for the stage-n models S
n
.
76
Detailed notations are given in the gure for better understanding. In Fig. 5.2, the
huge amount of unlabeled images (D) are rst separated into two subsets. The rst
subset D
FL
consists of only a small number of images which requires human to do the
full labeling, which generates training samples pX
FL
;Y
FL
q used for training a stage-0
detector S
0
. The rest of unlabeled images are refered to as D
UL
. In each observe-ask
stage n, we rst collect necessary features FT
UL
and detection scores DT
UL
for each
region proposal by using the CNN model that is trained in the previous stage. CEM
module then uses features to select a most critical examplex
t
out of all unlabeled samples
and ask human to verify its label y
t
. After a x number of t
n
iterations, the labeled
samples are gathered to combine with NOP samples X
NOP
to form the training set
X
Sn
;Y
Sn
which is used to obtain a new CNN model S
n
for stage n. Detailed notations
are summarized in Table. 5.1.
Table 5.1: Summary of notations used in this chapter. Note that N
OP
i
denotes number
of object proposals in imagei. N
OP
1
is dierent fromN
OP
when only a subeset of object
proposals are used as training purpose
Number of images N
D
Dataset Dtd
i
PZ
wiyi
|i 1:::N
D
u
Fully labeled subset Unlabeled subset
Number of images N
FL
N
UL
Image index
FL
t
i
PZ|i 1:::N
FL
u
UL
t
i
PZ|i 1:::N
UL
u
Image subset D
FL
td
i
PZ
wiyi
|iP
FL
u D
UL
td
i
PZ
wiyi
|iP
UL
u
Object proposal X
OP
tx
i;j
PZ
4
|i 1:::N
D
;j 1:::N
OP
i
u
Samples X
FL
tx
i
PZ
4
|iP
FL
;j 1:::N
OP
1
i
u X
UL
tx
i
PZ
4
|iP
UL
;j 1:::N
OP
1
i
u
Labels* Y
FL
ty
i
PL|iP
FL
;j 1:::N
OP
1
i
u Y
UL
ty
i
PL|iP
UL
;j 1:::N
OP
1
i
u
Deep Feature FT
FL
tft
i
PR
dimpftq
|iP
FL
;j 1:::N
OP
1
i
u FT
UL
tft
i
PR
dimpftq
|iP
UL
;j 1:::N
OP
1
i
u
Detection score* DT
FL
tdt
i
PR
N1
|iP
FL
;j 1:::N
OP
1
i
u DT
UL
tdt
i
PR
N1
|iP
UL
;j 1:::N
OP
1
i
u
*Lt1:::pN 1qu, where N 1 is number of object classes with background as an extra class
5.4.1 Machine-Guided Labeling via Question Answering
Annotating tight bounding boxes (i.e. full-labeling) is time-consuming, and is against
human nature. In the crowd-sourcing scenarios, it is relatively dicult to nd many
annotators who are willing to do the careful manipulation. On the other hand, asking
human to answer object classes appearing in one image (i.e. weak-labeling) is much faster
77
and more friendly. However, the drawback is that the performance of weakly supervised
learning without bounding box information is usually not satisfactory.
Question answering labeling (i.e. QA-labeling) is the fastest and easiest way for
human to do the labeling compared to the previous two. With TOA method, the per-
formance could be much higher than that of weakly supervised scenario if the number of
QA is large enough. In this work, we have adopted two easy QA schemes. As shown in
Fig. 5.3, the rst one is to make human answer the yes-or-no question to verify whether
a sample belongs to a certain object class. The other is to query annotators about the
exact class of a certain object. These two types of questions imitate the most common
ways of how children ask questions from their parents. They are refered to as type-1 and
type-2 QA, respectively. Consecutive type-1 QAs are as eective as a type-2 QA, but
the labeling time can be reduced with type-1 QA when we are more condent about the
object class of the sample.
QA-labeling has one more advantage over the other two types of labeling. To make
training eective, usually we need human to label all objects in one image without missing
any objects in both fully supervised or weakly supervised scenarios. Failing to do so will
cause huge performance drop. Thus extra eorts are required for careful verications.
In the proposed TOA method, however, the constraint of needing to label all objects are
eased by the adoption of NOP, thus annotators only need to focus on one specic sample
at a time.
It is, however, not always possible for machines to select samples with tight bounding
boxes for QA. In the QA-labeling process, we require human to give answers with toler-
ance to the noisy samples (i.e. samples without tight bounding boxes). A guideline has
to be provided to the annotators ahead of labeling. For example, we could ask human
to accept objects when the sample has overlapped to a tight box with intersection-
over-union (IOU) greater than 0.6. A quick training containing a few examples will be
provided to the annotators at the beginning of annotation process. This task is simple
78
Figure 5.3: Examples used to illustrate two types of QA (a) type-1 QA, and (b) type-2
QA
for human. Although small variations may occur in determining the overlapping ratio
when it is close to 0.6, the performance is not much aected as evidenced in Sec. 5.6.
5.4.2 Negative Object Proposal
Object proposal [94,118] is a common technique to extract a set of bounding boxes that
ideally include all objects in one image. It reduces the computational complexity for
exhaustive search. We dene another concept called negative object proposal (NOP)
which stands for a set of boxes that ideally contains no objects. With introduction
of NOP, the need for labeling all objects in one image in order for collecting negative
samples is relieved.
79
Telling objects from non-objects can be easily done by human without any supervi-
sion. However, unsupervised NOP extraction is a non-trivial task. To show the chal-
lenge, we examine two straight-forward ways to extract NOP. First, we collect boxes
with extremely low scores from the Edgeboxes [118] object proposal algorithm. Second,
we modify the formula of the Edgeboxes algorithm to weigh more on boxes that have
more edges going across the box boundaries. The results are shown in Fig. 5.4.
Figure 5.4: Examples to illustrate the challenges in extracting good NOP samples. In
these examples, only 200 out of all boxes are plotted for better visualization (a) Edge-
boxes results with extremely low scores which rank over 2000 in one image (b) Edgeboxes
with modied formula
There are two major issues in these results. First, the straight-forward approaches
tend to contain all small bounding boxes. To serve for training purpose, the NOP
should contain some informative samples with reasonable sizes and with certain degrees
of overlapping with objects. Secondly, the precision should be extremely high to provide
reasonable training performance, since negative samples are the major presence in the
training samples. However, the results from above approaches still occasionally contain
positive samples, and fail to meet the required precision.
To solve the challenges, we use the detection scores informationDT
UL
from the stage-
0 detector. In other words, we do not need extra labeling eorts for extracting NOP.
Although the stage-0 model is trained from a small percentage of data and its detection
scores are not robust, one can still fetch a candidate set of samples that contain almost
80
all objects with a low enough threshold. NOP samples are then achieved by avoiding to
pick samples that are around the candidate set. We rst dene an inverse set concept
based on IoU as follow, where the function returns samples in set A that do not have
large overlapping with all samples in set B:
CInv IoUpA;Bqta
i
PA|IoUpa
i
;b
j
q ;@b
j
PBu (5.1)
With Eq. 5.2, the NOP set can be calculated by nding the inverse set from a
candidate object set that is extracted using detection scores. The idea is formulated as:
X
NOP
i
Inv IoUpX
OP
i
;X
DT
i
q;@iP
UL
(5.2)
Here X
DT
is the candidate set fetched by applying a threshold to the detection
scores, and X
OP
is the samples generated from the object proposal algorithm. The
resulted NOP samplesX
NOP
not only provides high quality negative samples of reason-
able sizes, but also has extremely low chances to include object samples. Supporting
experiments will be conducted in Sec. 5.6.6.
Negative object proposal can also be created without supervision if we are given prior
knowledge about the target objects. For example, in the pedestrian detection problem,
almost all pedestrians are in the upright direction with a certain range of aspect ratios.
We can simply make the candidate object set contains all possible boxes in the range
of possible pedestrian aspect ratios. The NOP calculated from Eq. 5.2 will generate
boxes that are with wrong aspect ratios which cannot be pedestrians. This is used in the
experiments in Sec. 5.6, and good performance is observed even if the method is totally
unsupervised.
5.4.3 Critical Example Mining
It is well-known that a few samples play more important roles than others in the training
process as elaborated in [84,105]. Typically, hard samples are more critical to improving
81
the CNN model performances. However, in critically supervised learning, the criteria of
choosing important samples are very dierent, and they change dynamically along the
labeling process. This is a new problem that has not yet been explored. Fig. 5.5 is used
as an example to illustrate the dierences.
Figure 5.5: Illustration of the concept of critically supervised learning. The colored dots
are labeled samples while other are unlabeled ones. The upper row show that when
given the constraint of using only 5 QAs, the selection of samples for labeling is critical
to the trained detector(classier). The lower row shows that the criteria for selecting
the rst 5 samples is also dierent from that of selecting additional 2 samples if better
performance is targeted.
When given limited labeling budget, the selection of examples should be very cautious
in order not to waste any labeling eorts. Initially, two criteria are more important. First,
the selection should be balanced across dierent classes. Second, the sample should be
representative, which means redundant ones should be avoided. As shown in the upper
row of Fig. 5.5, cautious selection of labeled samples according to these two criteria have
great impact on the trained model performance. As more samples are labeled and the
model starts to gain better performances over easier samples, it becomes more critical
82
to nding harder examples that help to achieve better separation around the decision
boundaries as shown in the lower row in Fig. 5.5.
Based on the intuitions above, we design a function c
i;j
to represent the criticalness
of each sample, where three criteria are taken into consideration, including class balanc-
ing (BAL), sample representativeness (REP), and hardness (HARD). In addition, the
importance of each term has to change along the labeling process. Thus, each term is
composed of two parts, while the former part cs
i;j
represents the score of the criteria
while the latter onecp
i;j
is called the progress function that controls the contribution of
each term at dierent labeling iteration t.
c
i;j
c
BAL
i;j
c
REP
i;j
c
HARD
i;j
(5.3)
c
i;j
cs
BAL
i;j
cp
BAL
i;j
ptqcs
REP
i;j
cp
REP
i;j
ptqcs
HARD
i;j
cp
HARD
i;j
ptq (5.4)
This function is heuristic, since nowadays there is no ground truth about importance
of each sample along the labeling process. Before further explaining each term of the
criticalness function, we rst provide the details of the features that we will use in this
function.
Detection Score and Deep Feature
The detection score dt
i;j
is the output of the CNN network representing the proba-
bility of a sample being a specic class. They are obtained in each stage-n by feeding all
training samples into theS
n1
model for prediction. Although the precision ofS
n1
may
not be high enough, the detection score still serves as a weak indication of how likely
a sample could be an object sample. Each score dt
i;j
is a vector with a dimension of
pN 1q, whereN denotes number of classes, and background is treated as an additional
class.
Besides detection scores, deep features are also extracted in the CNN predicting
process, which provide powerful information in mining critical examples. Deap features
83
FT are the output from one of the intermediate CNN layers and the feature dimension
depends on which layer we would like to use. Detailed experimental set-ups will be
provided in Sec. 5.6.
KNN Graph and Geodesic Distance
To measure the representativeness of a sample as illustrated in Fig. 5.5, we design our
mining algorithm based on the k-nearest-neighbor (kNN) graph. To build the undirected
kNN graph GpV;Eq, each node is a sample in either X
UL
or X
FL
. Each node v
i
is
then calcaulated its Euclidean distances in the deep feature space with all other samples
as Eucpv
i
;Vq. Another sample v
2
is set to connect with v
1
in the graph if it is among
the k nearest neighbors of v
1
. The generation of kNN graph is formulated in Eq. 5.5.
e
v
1
;v
2
$
'
'
'
'
&
'
'
'
'
%
1;Eucpv
1
;v
2
q sortpEucpv
1
;Vq;
1
ascend
1
qrKs
1;Eucpv
1
;v
2
q sortpEucpv
2
;Vq;
1
ascend
1
qrKs
0;otherwise
(5.5)
With the kNN graph, the geodesic distance between two nodes is further dened as
the shortest path between these two nodes in the kNN graph [112] . Compared to the
Euclidean distance, geodesic distance better represents the sample distributions over the
deep feature space.
Finally, for each unlabeled sample, we dene a metric called ldist
i;j
P LDIST
UL
representing the nearest labeled sample measured by the geodesic distance. Whenldist
i;j
is smaller, the sample x
i;j
becomes less critical since there is already a labeled sample
nearby. Fig 5.6 shows the detailed concept of the kNN graph, geodesic distance, and the
DLIST metric.
Progress Function
To control the contribution from each term in the criticalness function, we dene a
simple progress functionPpx;;
2
q as formulated in Eq. 5.6. We assume that one term
is important in the labeling process until a certain condition is met at a cuto threshold
. Upon this point, we intend to reduce its weight and focus more on other criteria in
84
Figure 5.6: Illustration of kNN graph where each node is connected to its K nearest
neighbors (K=3 in this example). Ldist of a node is dened as the number of edges
appears in the shortest path to a nearest labeled node.
the selection of critical examples. To avoid abruptly making the criticalness become 0,
we apply the Gaussian probability function fpx|;
2
q normalized tor0; 1s to achieve a
gradual decay of this term as Eq. 5.6.
Ppx;;
2
q
$
&
%
1; x
fpx|;
2
q{fp|;
2
q; x¡
(5.6)
Criticalness Function
We would like to give details to the rst term of the criticalness function regard-
ing sample balancing (BAL). This term simultaneously considers the balance between
dierent object classes, and also that between positive and negative samples. In the
early labeling stages, there are much more negative samples obtained through the NOP
algorithm. Therefore, selecting a sample with higher chances of being objects is impor-
tant in order not to waste QA eorts. Hence, the maximum detection score of a sample
maxpdt
i;j
q is served as an index to select samples for QA as expressed in Eq. 5.7.
85
In the equation, there are two progress functions, where the rst one is used to lower
the importance of this term when more object samples are already labeled in this image.
The later progress function is introduced to ensure the balance between dierent classes.
This is achieved by setting a cuto threshold at the point that a specic class already has
too many labeled samples in the current labeling stage. In VOC dataset, for instance,
we can avoid picking too many examples from major classes such as "people" or "car",
but distribute the QA quota over to other rare classes in order to boost the overall
performance in earlier stages.
c
BAL
i;j
maxpdt
i;j
qPpcountpy
i;j
¡ 0;@y
i;j
PY
UL
i
q;p;
2
q
BAL1
q
Ppcountpy
i;j
argmaxpdt
i;j
q;@y
i;j
PY
UL
q;p;
2
q
BAL2
q (5.7)
The second term of the criticalness function considers the representativeness of a
sample, and we use the metric DLIST as described above. When dlist
i;j
is smaller,
the sample is relatively redundant in the training process, and vice versa. DLIST is
used inside the progress function in Eq. 5.8 because its values may keep changing along
the labeling process. In addition to the progress function, we introduce a variable
to adjust the importance of this term across dierent datasets. In fact, some datasets
contain much more redundant samples while others just have diversied content. The
choice of is related to the sample distribution on the kNN graph and will be elaborated
more in Sec. 5.6.6.
c
REP
i;j
Ppldist
i;j
;p;
2
q
REP
q (5.8)
The last term of the criticalness function gives higher priority to the hard samples
which appear around the decision boundary. This term is not critical until more repre-
sentative samples are already labeled in early stages. In Eq. 5.9, the progress function
is determined by the average DLIST value, because the value is an indication of how
86
densely samples are labeled in the deep feature space. To determine the hardness of a
sample, one straight-forward way is to examine the neighbors of a sample in the kNN
graph and see if mixed classes are observed. However, it is only eective when labeled
samples are really dense in the space, which does not happen when we target to achieve
few labeling time. Therefore, we provide another metric to highlight hard examples. As
described in Eq. 5.9, the hard samples are spotted if its detection scores are high not
only in one class, but also in other object categories.
c
HARD
i;j
psumpdt
i;j
qmaxpdti;jqqPpmeanpLDISTq;p;
2
q
HARD
q (5.9)
The criticalness score is used to determine how important a sample is. The CEM
procedure is done by rst calculating the criticalness scores of all unlabeled samples.
Then the sample with highest score is selected for QA labeling. In general, we apply
type-1 QA labeling by asking human whether the sample belongs to the class that has the
highest score. For hard samples which satises cp
HARD
i;j
¡cp
BAL
i;j
andcp
HARD
i;j
¡cp
BAL
i;j
,
type-2 QA is applied because we are less sure about its class. The newly labeled sample is
used to update thedlist
i;j
andc
i;j
values of aected unlabeled samples, and the iteration
continues till the end of the stage.
5.4.4 Training Sample Composition
After the number of QA reaches our target amount in stage n, training samples have to
be composed for training the S
n
CNN model. fully labeled images X
FL
;Y
FL
are still
involved in the training process, while an unlabeled image is only used when there are
QA-labeled object samples in this image.
We have both QA-labeled samples and NOP samples in unlabeled images. The pos-
itive samples are rstly chosen among object proposals that have IoU greater than a
threshold TH
FG
with QA-labeled samples. Then, negative samples are picked from the
87
NOP set with IoU within the range of rTH
HI
;TH
LO
s with QA-labeled samples. The
adoption of these three thresholds are similar to the case in fully supervised learning.
The dierences are that we use QA-labeled samples to calculate IoU to determine posi-
tive/negative samples rather than using ground truth, and that we use NOP instead of
object proposals for negative samples. Examples will be shown in Sec. 5.6.6.
It is worth mentioning that the QA-labeled samples are noisy (i.e. does not necessar-
ily have tight bounding boxes). Thus the composition of samples in TOA method with
three thresholds may cause positive samples to have less overlapping with real ground
truth, or cause negative samples to have more overlapping with ground truth than usual.
We will elaborate more on choosing of these thresholds in Sec. 5.6, and show the real
IoU distributions in Sec. 5.6.6.
5.5 Evaluation Method
Since we provide a new angle of solving object detection problem by minimizing the
labeling time, a new evaluation approach is introduced in this section.
5.5.1 Labeling Time Model
Human labeling time has been discussed in several literatures. For example, it is men-
tioned in [70] that drawing a tight bounding box with high quality takes around 42
seconds in the crowd-sourcing setup, and 26 seconds is needed for faster annotation.
The labeling time for simple yes-or-no questions, on the other hand, only requires 1.6
seconds. Similar numbers are also reported in [71, 90]. There is no doubt that drawing
a bounding box is a lot more time-consuming than answering simple questions.
In order to evaluate the performance, we have to build a simple labeling time evalua-
tion model. Firstly, we refer to the labeling time as numbers mentioned in above papers
regarding bounding box drawing and yes-or-no questions. In addition, we also include
the labeling time for answering a type-2 question about what class an object belongs
88
to. We conduct the experiments on our own, and calculate the average time needed
to type a specic object class name is 2.4 sec from multiple trials. Similarly, we have
also evaluated the time for human to scan over the whole image and ensure there is no
missing objects. These four numbers are summarized in Table. 5.2.
Table 5.2: Labeling time model about dierent types of labeling. HQ-prole is provided
for high-quality labeling and MQ-prole is for moderate quality annotations.
Labeling type variable HQ prole MQ prole
Draw a tight bounding box t
FL
42.0 sec 26.0 sec
Answer type-1 question t
QA1
1.6 sec 1.6 sec
Answer type-2 question t
QA2
2.4 sec 2.4 sec
Verify missing objects t
VER
2.6 sec 0.0 sec
In this table, two proles are considered, while the high-quality (HQ) prole assumes
drawing a bounding box is as expensive as reported in abovementioned papers. In favor
of fully supervised learning, we also dene a moderate-quality (MQ) prole which allows
quicker drawing of bounding boxes and assumes that image verication is done in parallel
while drawing the boxes. Experimental results in Sec. 5.6.6 will demonstrate that our
method is always competitive under dierent time models.
It is worth mentioning that we set labeling time as variables in the table, where
the numbers provided are just for reference. In practice, one can plug any numbers
into the model which he/she feels reasonable. In fact, powerful tools can always be
designed for more ecient annotation, no matter for bounding box drawing or question
answering. For example, we can show a bunch of samples together and make human click
on those that do not belong to a certain class. This batch process will make yes-or-no
question much faster than 1.6 seconds per image. In this paper, we do not discuss any
sophisticated annotation tools, but focus on the scenarios that are commonly seen in
crowd-sourcing scenarios.
89
5.5.2 Estimated Labeling Time Ratio (ELTR)
Given the labeling time model, we dene a metric called estimated labeling time ratio
(ELTR) to evaluate the overall performance compared to the fully or weakly super-
vised schemes. ELTR basically calculates the labeling time needed for a specic method
normalized by the time required for conducting fully labeling for the whole image set.
In this section, we provide detailed calculation of ELTR for dierent training scenar-
ios. First, we consider using a subset of images D
FL
to do the fully supervised training.
The ELTR for this scenario is simply the ratio between the number of selected image
and the whole dataset, which is shown in Eq. 5.10.
ELTR
FS
N
FL
{N
D
(5.10)
For weakly supervised scenarios, human need to answer object classes that present
in one image. This is the same as what is answered in type-2 questions. The dierence
is that human have to answer all object classes in one image for weakly supervised case.
Suppose there are N
CLS
i
dierent classes in image i, it roughly takes N
CLS
i
times t
QA2
seconds for weakly supervision in this image. Moreover, verication on the whole image
has to be applied for weakly supervised learning. Thus the ELTR is provided in Eq.
5.11 where the labeling time is also normalized by the time needed for fully labeling of
the whole dataset.
ELTR
WS
p
N
D
¸
i1
pt
VER
t
QA2
N
CLS
i
qq{pt
FL
N
D
q (5.11)
Finally, for critically supervised learning, our TOA method includes dierent types of
labeling ranging from full-labeling, type-1, and type2 QA labeling. Thus all the labeling
time are added up, and the ELTR is obtained by doing the same normalization. Here
N
QA1
and N
QA1
represent total number of QA done for type-1 and type-2 questions,
respectively, and the total number of QA can be represented as N
QA
.
90
ELTR
CS
pt
FL
N
FL
t
QA1
N
QA1
t
QA2
N
QA2
q{pt
FL
N
D
q (5.12)
5.6 Experimental Results
5.6.1 Datasets
To validate the proposed TOA method, we rst test our algorithm on the general multi-
class object detection problem using the famous VOC 2007 and VOC 2012 dataset
(abbreviated as VOC07 and VOC12, respectively). To further demonstrate the general-
ization capability of TOA method, experiments are conducted on the Caltech pedestrian
dataset (abbreviated as Caltech dataset), which is a special case of object detection with
only one object class.
VOC dataset
In terms of object detection, VOC dataset has been one of the most widely used
benchmark dataset nowadays, especially the VOC07 and VOC12 datasets. These two
datasets provide 20 object classes of most commonly seen objects in our daily lives.
VOC datasets are split into three subsets - train, val, and test sets. In our work, we
adopt the general approach to train our model on the trainval set and calculate the mean
average precision (mAP) on the test set. VOC07 has a total of 5011 trainval images,
while VOC07+12 has a total of 16553 trainval images.
Caltech dataset
Caltech dataset is the mostly used dataset in the pedestrian detection domain, and
it provides a large number of training images captured from the street with wide diver-
sity. Pedestrian detection is essential for various applications such as autonomous vehi-
cles, advanced driving assistant system (ADAS), security, robotics, and others. It is
a special case of object detection that only one class is of our interest. However, the
challenges are very dierent, which involve cluttered backgrounds, occlusions, extremely
91
small objects..., and so on. Caltech dataset provides a total of 128419 training images
together with their bounding box ground truth.
5.6.2 Experimental Setup
Simulation
TOA is a method involving iterative human-machine collaborative labeling. For
easier comparison with fully supervised and weakly supervised learning methods, we
still choose to use existing datasets that are already annotated. However, we do not
directly use their labels. Instead, we simulate our results by applying the QA-labeling
and let the ground truth in the dataset to answer our questions. For example, when we
choose a sample and ask a type-1 question, we assume that human will give the answer
"yes" when the sample has high enough IoU with the ground truth, and the class that we
query ts what is annotated in the ground truth. For type-2 questions, we are expected
to get the object classes when IoU with the ground truth is greater than a threshold. By
doing the simulation, no human labeling is really needed. In our work, the IoU threshold
is set to 0.6. In Sec. 5.6.6, we will provide more analysis regarding variations in the IoU
threshold.
CNN Network
The evolution of CNN network architectures is extremely fas nowadays, and so is
the CNN performance improvement on dierent problems. In this work, we adopt
Fast-RCNN [35] (abbreviated as FRCNN) as our baseline method for proof of concept.
FRCNN does not provide state-of-the-art results in the object detection problem, how-
ever, there are two major reasons that make FRCNN a must choice in this work. First,
weakly supervised learning methods are still mostly based on FRCNN. To make a fair
comparison, FRCNN is our rst priority. Secondly, FRCNN is a common framework
which proves its success in dierent computer vision problems. Considering the few
CNN networks that have been proved on both Caltech and VOC datasets, FRCNN is
still the one with the best performance [35,58]. In this work, we use VGG-16 [87] as the
92
back-bone network architecture with ImageNet pretrained model, and the selection also
shares the same reason as above.
Implementation Details
In stage-0, a subset of images is selected for full supervision. In this work, this subset
is randomly selected from the whole training set. We tried dierent random sets, but
the results do not change much. It is left as future work regarding how to mine a better
initial set for training S
0
for further performance improvement. For object proposals,
we adopt Edgeboxes [118] algorithm with its default settings. In the OA-stage, we select
FC-7 layer in the CNN network as our deep featuresFT . FC-7 is the last stage of VGG-
16 network, and its feature space reserves more clustering of samples which better ts
our proposed concepts.
To determine detection scores and deep features for unlabeled training images, non-
maxima-suppression (nms) is applied to avoid too many redundant samples. In the
calculation of kNN graph, each sample is computed its Euclidean distance with all other
samples, and the computational overhead is extremely high. To accelerate the process,
we rst exclude samples with lowest detections scores under 0.01. According to our
CEM design, these samples are almost impossible to be selected in our algorithm even if
they are not excluded. Secondly, we implement the kNN calculation with CUDA to fully
exploiting the parallel computing capability of the GPUs, and accelerate the process by
around 200 times than the CPU implementation.
For the CEM parameters, we set K=4 for the kNN graph. In fact, K value is closely
related to the geodesic distances among samples, and we can use other parameters to
control the behaviors that are related to the geodesic distance. Other parameters are
empirically set as 0:3, 0:01,p;
2
q
BAL1
p1; 0:2q,p;
2
q
BAL2
p#QA{20; 1q,
p;
2
q
REP
p1:5; 0:5q, andp;
2
q
HARD
p2; 0:2q. Other training options or parame-
ters will be discussed in later sections.
93
5.6.3 TOA For Multi-Class Object Detection
First, we would like to validate our TOA framework on the VOC07 dataset. To compare
the performance with fully and weakly supervised learning, we draw a mAP over ELTR
plot as Fig. 5.7, where HQ-prole is applied for ELTR calculation. In the gure, the
points on the blue curve are CNN models trained using fully supervised learning with
dierent sizes of image subsets. The black point in triangle shape on the curve is the
state-of-the-art performance of weakly supervised learning. Compared to fully supervised
learning, its labeling time is less under the same mAP on the blue curve. The points
on the red curve show the performances of our TOA method under dierent stages.
Our method not only has much higher mAP than weakly supervised methods, but also
requires much less labeling time compared with other two schemes.
Figure 5.7: MAP vs ELTR curve comparing performances between fully, weakly, and
critically supervised learning on the VOC07 dataset. Each dot is a model trained using
dierent training sets with their estimated labeling time.
In this experiment, our S
0
model uses only 1{64 images from VOC07 trainval set,
which correspond to 157 images. Then dierent numbers of QA are conducted in each
stage, as detailed in Table 5.3. For example, in stage-1, 3600 QA are done and the
mAP ofS
1
boosts from 32.0% to 47.5% with only 0.0421 ELTR required. In stage-2, an
94
additional 3600 QA are performed, and the total of 7200 QA together contribute to a
mAP of 52.2% with 0.0559 ELTR. In stage-2, some of the QA are type-2 QA which are
decided by the CEM algorithm.
Table 5.3: Detailed experimental setup for Fig. 5.7. Here the numbers in parenthesis
after FRCNN means the percentage of data used for labeling. The parenthesis after TOA
mean number of stages are used in TOA. In this table, number of full-labeled images
and QA-labeled samples are provided along with the nal estimated labeling time ratio
(ELTR)
method Category mAP N
Full
N
QA1
N
QA2
N
QA
ELTR(HQ)
FRCNN fully supervised 67.5 5011 0 0 0 1
FRCNN(1/4) fully supervised 60.1 1253 0 0 0 0.25
FRCNN(1/8) fully supervised 52.8 626 0 0 0 0.125
FRCNN(1/16) fully supervised 42.6 313 0 0 0 0.0625
FRCNN(1/32) fully supervised 32.0 157 0 0 0 0.0313
K-EM [108] weakly supervised 46.1 0 0 0 0 0.0619
TOA(S1) critically supervised 47.5 157 3600 0 3600 0.0421
TOA(S2) critically supervised 52.2 157 5276 1924 7200 0.0559
TOA(S3) critically supervised 55.5 157 5276 3124 8400 0.0613
We can examine the benet of TOA from two aspects. First, for the same labeling
time at ELTR around 0.062, TOA can achieve a 55.5% mAP, which out-performs K-EM
(weakly supervised method) by 9.4%, and exceeds FRCNN by 12.9%. From another
perspective, if we target mAP at 55%, the labeling time needed for TOA is only about
1{3 of FRCNN's labeling time, while weakly supervised learning has diculty achieving
this level. It is also worth mentioning that the TOA method will not have a performance
cap as the weakly supervised models. Ideally, when the QA number is as large as
the number of all training samples, TOA will achieve the same performance as fully
supervised learning. However, more sophisticated mining algorithm has to be developed
to keep the same labeling time saving eciency, which is left as our future work.
The detection performance on dierent classes are shown in Tab. 5.4. TOA method
can achieve a consistent improvement across all object categories.
In addition to VOC07 dataset, we also applied the TOA method on VOC07+12
trainval dataset, and tested it on the VOC07 test set. Since the number of images
95
Table 5.4: The detailed detection results by object categories with experimental setup
dened in Table. 5.3, and visualized in Fig. 5.7
method aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv mAP
FRCNN(1/32) 24.8 27.4 30.2 11.7 14.8 34.4 56.9 44.4 18.3 43.8 17.7 37.7 52.4 45.3 51.2 11.2 32.0 24.9 23.4 36.7 32.0
FRCNN(1/16) 39.9 58.2 42.6 28.0 15.8 50.3 66.4 59.1 29.2 46.4 30.4 44.8 54.3 62.6 54.5 14.3 24.0 32.9 43.9 53.9 42.6
FRCNN(1/8) 47.4 65.4 50.0 34.4 24.9 65.6 71.7 69.0 28.3 62.9 41.2 64.3 70.2 62.6 61.8 23.3 59.6 51.8 53.1 49.1 52.8
FRCNN(1/4) 58.8 70.4 58.1 42.9 43.6 70.7 75.9 76.2 34.4 66.3 52.4 69.5 76.4 73.6 67.5 32.6 59.3 56.6 63.7 52.6 60.1
FRCNN 68.6 77.8 65.0 57.5 44.8 79.9 78.6 83.8 41.6 74.5 62.0 82.3 80.8 75.3 70.8 37.7 66.2 68.0 75.4 60.0 67.5
K-EM 59.8 64.6 47.8 28.8 21.4 67.7 70.3 61.2 17.2 51.5 34.0 42.3 48.8 65.9 9.3 21.1 53.6 51.4 54.7 50.7 46.1
TOA(S1) 48.1 61.5 43.6 22.5 29.6 60.8 61.7 56.1 24.8 57.3 28.4 56.4 63.7 62.9 47.7 21.5 47.8 45.6 55.9 54.5 47.5
TOA(S2) 53.6 48.3 55.3 31.0 27.2 61.1 66.9 70.2 26.1 68.8 32.3 66.8 71.1 62.8 57.2 21.8 54.0 52.0 64.5 52.1 52.2
TOA(S3) 60.0 52.8 57.6 35.3 29.6 66.3 69.2 73.2 29.5 70.3 42.5 66.5 72.8 66.2 58.4 26.2 57.5 54.3 65.4 57.2 55.5
becomes larger, the setup of TOA method is dierent and is detailed in Table. 5.5.
Similar performance gains can still be observed, and a higher mAP of 58.3 is achieved
with even lower ELTR.
Table 5.5: Detailed experimental setup for the fully and critically supervised CNN mod-
els trained on VOC07+12 trainval dataset and tested on VOC07 dataset. Number of
full-labeled images and QA-labeled sample are provided along with the nal estimated
labeling time ratio (ELTR)
method Category mAP N
Full
N
QA1
N
QA2
N
QA
ELTR(HQ)
FRCNN fully supervised 70.7 16551 0 0 0 1
FRCNN(1/4) fully supervised 66.8 4138 0 0 0 0.25
FRCNN(1/8) fully supervised 61.6 2069 0 0 0 0.125
FRCNN(1/16) fully supervised 56.1 1034 0 0 0 0.0625
FRCNN(1/32) fully supervised 49.0 517 0 0 0 0.0313
FRCNN(1/64) fully supervised 38.3 259 0 0 0 0.0156
TOA(S1) critically supervised 54.6 259 10000 0 10000 0.0248
TOA(S2) critically supervised 58.3 259 16038 3962 20000 0.0357
5.6.4 TOA For Single-Class Object Detection
In addition to general object detection, we evaluate our method considering the single
object detection problem using the Caltech dataset. The standard evaluation metric for
Caltech dataset is dierent. The miss rate (MR) versus false-positive-per-image (FFPI)
curve is calculated for each trained model and the overall performance of a model is
obtained by averaging several reference points on the curve in log scale. We use MR
to represent the log-average-miss-rate, and the performance comparison with the fully
supervised scheme is plotted in Fig. 5.8.
96
In Caltech dataset, the performance gain is even more obvious. For example, when
MR = 17, TOA can save up to 95% of the labeling eorts compared to the fully supervised
scenario. This can be attributed to fact that the Caltech dataset contains more redundant
samples than the VOC datasets. Caltech dataset is captured frame-by-frame with street-
view scenes, and all the pedestrians have similar shapes and appearances. The dierence
in content diversity across datasets will be elaborated in later sections. To sum up, TOA
is more advantageous when a dataset has more similar samples presented.
Figure 5.8: MR vs ELTR curve comparing performances between fully and critically
supervised learning on the Caltech dataset. Each dot is a model trained with dierent
setups with their associated labeling time
The detailed setup of the above experiment is shown in Table. 5.6, where each entry
is correspondent to the dots on the MR-ELTR curve. The MR-FPPI curves of these
models are also provided in Fig. 5.9 for reference. Although Caltech dataset provide
frame-to-frame images, we only use every 4 frame to form our baseline dataset D. The
rst reason is that the computational time can be saved dramatically during the whole
TOA process, and the performance with 1{4 training data is not much dierent from
that of using every frame, which is evidenced in Table 5.6.
97
Table 5.6: Detailed experimental setup for Fig. 5.8. Note that TOA method is applied
on the subset(1/4) of the Caltech dataset, thus the ELTR for FRCNN(1/4) is equal to 1
MR N
Full
N
QA1
N
QA2
N
QA
ELTR(HQ)
FRCNN 14.4 128419 0 0 0
FRCNN(1/4) 14.7 32105 0 0 0 1
FRCNN(1/8) 19.0 16052 0 0 0 0.5
FRCNN(1/16) 20.0 8026 0 0 0 0.25
FRCNN(1/32) 21.7 4013 0 0 0 0.125
FRCNN(1/64) 24.6 2007 0 0 0 0.0625
FRCNN(1/128) 30.6 1003 0 0 0 0.0313
TOA(S1) 20.2 1003 50000 0 5000 0.0329
TOA(S2) 17.1 1003 10000 0 10000 0.0346
Figure 5.9: Detailed miss rate(MR) vs false-positive-per-image(FPPI) curves for models
dened in Table 5.6
5.6.5 TOA Training Options
Dierent from traditional fully supervised FRCNN training, TOA method has dierent
types of training samples and requires special adjustments in training the model. In
this section, we discuss how dierent training options aect the nal results, and the
selection of options may also vary in dierent applications. We summarize the training
options in Table. 5.7, and each of them will be elaborated in the following paragraphs.
98
Table 5.7: Summary of the TOA training options, where def means the default values.
Value Description
REG 0 No bounding box regression
1(def) Bounding box regression with noisy targets
2 Bounding box regression using S
0
model
3 Bounding box regression assuming perfect regressor
RETR 0 Do not retrain the model (use only 1 OA-stage)
1(def) Use retrained model S
n
for prediction in stage S
n1
PRETR 0 Always using ImageNet as pretrained model
1(def) Use model S
n
as pretrained model for stage S
n1
GENTR 0(def) Use the same threshold as FRCNN to generate training samples
1 Generate training sample with adapted thresholds
WTR 0(def) Treat each image equally
1 Weigh more on fully supervised images
Since the performance of Caltech dataset is superior, the results are not very sensitive
to the options. Thus default values are applied. In the following paragraphs, we are going
to analyze the eects of dierent training options on the VOC07 dataset. The VOC07
results in previous Sec. 5.6.3 are evaluated with default values except for PRETR=0.
Labeling Amount Selection
The selection of labeling amount, including the number of initial fully labeled image
subset and the number of QA in each stage, is heuristic. It depends on the annotation
resources we have in practical applications and there is no optimal choices. Generally
speaking, when a larger initial set is selected, the detection scores and deep features are
more reliable for retrieving better QA samples eciently. However, full-labeling is time
consuming which causes fast growth in labeling time. In contrast, if the initial set is too
small, the poor S
0
detector will result in bad performance in selecting CEM samples.
In Fig 5.10(a), we show the results with selection of dierent initial sets, and then
apply the same amount of QA. In general, the smaller number of initial set is still
favorable if the S
0
detector is not too poor, because QA-labeling still consumes much
less labeling time than full-labeling, and could provide faster performance boost in early
stages.
Bounding box regression(REG)
99
Object detection usually involves multi-task training of probability score and bound-
ing box regression simultaneously [35]. For bounding box regression, ground truth bound-
ing box locations are used as training targets. However, in the TOA framework, we do not
have ground truth bounding box locations via question answering. Lacking of bounding
boxes locations is one limitation of the TOA framework.
In this section we examine several ways to deal with the issue. The rst option is to
train the model without bounding box regression at all (REG=0), which gives the worst
performance as shown in Fig. 5.10(b). An alternative way is to accept the locations
of these noisy bounding boxes as groud truth for training the bounding box regressor
(REG=1). It is shown in the gure that training the regressor with noisy bounding boxes
still provide reasonable performance improvement. The result is still much better than
using no regression, or applying the weak regressor trained from S
0
detector (REG=2).
In this gure, we also compare our results with another scenario that assumes to
have a perfect regressor (REG=3). Here the optimal regressor is trained using VOC2007
trainval set. In real-world applications, it is possible that these two tasks are trained
independently and then combined afterwards. Our work intends to give more highlights
to improve the classier performances, while there are other works that focus more on
better localization [70], which can be further applied to improve the regression perfor-
mance in the TOA framework.
To give more accurate evaluation without aected by noisy bounding box regression,
we do not apply bounding box regression for all other experiments in this section and
the next analysis section.
Prediction with retrained models(RETR)
In the TOA framework, we allow multiple stages of QA labeling, and the CNN
model is retrained in each stage. The retrained model then provides better performance
which directly help to improve the mining of more informative samples in the next stage.
However, the question regarding whether multi-stage should be applied is again a tradeo
between computational resources and the nal performance. For example, if we target
100
Figure 5.10: Experimental results with dierent training options. The results are zoomed
into into a certain ELTR range except for subplot(a). The training options include (a)
amount of initial set (b) bounding box regression (REG) (c) Prediction with retrained
models (RETR) (d) Pretrained model selection (PRETR) (e) Training sample generation
(GENTR), and (f) Weighing on image subsets (WTR)
a total of 8400 QA-labeling, we can choose to apply TOA with one stage or split these
QA into three stages as our previous examples, where the latter choice requires three
times extra computation from retraining the model, redoing the prediction, rebuilding
the kNN graph, and so on.
We compare the results in Fig. 5.10(c), where RETR=1 means that we apply multi-
stage retraining and RETR=0 is the results with only one stage TOA. It is observed
that with the same amount of QA, multiple-stages TOA indeed results in better mAP
with its extra computation costs.
Pretrained model selection (PRETR)
For each stage n, training samples are collected from both fully labeled subset, and
QA-labeled subset. In other words, the training set in stage n is always a superset of
that in the previous stage. To retrain the CNN model S
n
, we have two choices to either
101
use ImageNet pretrained model (PRETR=0) as what we do in stage S
0
, or apply the
pretrained model from the previous stage S
n1
and do the ne-tuning (PRETR=1). In
the latter case, learning rate can be reduced since we already have a better basis. The
result is shown in Fig. 5.10(d) that always training from ImageNet pretrained model
provides slightly better performances in VOC07 dataset.
Training sample generation (GENTR)
In Sec. 5.4, we introduced how training samples are generated once all QAs are
done. The parameters (TH
FG
, TH
HI
, TH
LO
) can be set as the default values used
in fully supervised learning (GENTR=0). For example, these parameters are (0.5, 0.5,
0.1) when training traditional FRCNN on the VOC07 dataset. Since the QA-labeled
bounding boxes are noisy, it is speculated that adjusting the threshold could potentially
give better performances (GENTR=1). Thus we set the thresholds to (0.6, 0.4, 0.1) to
avoid nding positive samples that are too far away from QA-labeled samples, and at the
same time prevent negative samples from being too closed to the QA-labeled samples.
Fig. 5.10(e) shows the result that compares these two options, and it show that applying
default thresholds gives slightly better performances than the adjusted thresholds.
Weighing on image subsets (WTR)
TOA training set consists of both fully labeled and QA-labeled subsets. Generally,
the fully labeled subset should contain better qualities of training samples than the
QA-labeled subsets. It is unknown whether better performances can be observed if we
give the fully labeled frames higher weights in the training process. Technically, this is
achieved by allowing the fully supervised frames to appear more times in each epoch
(WTR=1). In Fig. 5.10(f), we show the result of weighing fully labeled images by two
times, which does not necessarily provide better performances, which implicitly tells that
the training samples prepared by the TOA method still remain high quality.
102
Figure 5.11: Analysis results for TOA framework (a) Eectiveness of the progress func-
tion for hard samples (b) Result that add errors in the simulation
5.6.6 Analysis
To give more insights into the TOA framework, the following analysis are conducted on
VOC07 dataset.
Eectiveness of CEM
In this section, several analyses are rst provided to shed lights on the eectiveness of
the proposed CEM algorithm. First we examine the reason of using detection scores in
the criticalness function. In Table 5.8, the valid QA rates are provided, which indicate
the percentage of QAs that are resulted in labeled objects but not backgrounds. As
mentioned in previous sections, only object samples can directly help improving the
precision as we already have NOP samples. For type-1 QA, the sample not only has to
be an object, but the class should be correct to become a valid QA, while valid type-2
QA only requires the sample to be an object.
Since the initial modelS
0
is a weak detector, the valid QA1 rate is only around 37%
in stage-1. In other words, it is very important to use high detection score samples for
QA in earlier stages, otherwise more QA are going to be wasted to cause sub-optimal
performances.
103
Table 5.8: The valid QA rate with CEM algorithm in each stage.
QA1 Valid QA1 QA2 Valid QA2
TOA(S1) 3600 1320 0 0
TOA(S2) 5276 1754 1924 904
TOA(S3) 5276 1754 3124 1406
Secondly, we want to validate the progress function regarding class balancing in
Eq.5.7. We compare the original detection result TOA(S1) with a model that does not
have this progress function. The class-wise detection results for both models are shown
in Tab. 5.9. With the same amount of QA. There is a huge mAP dierence of 5.3.
The results with the progress function is able to select balanced training samples among
dierent classes thus achieve a balanced performance boost in multiple categories. How-
ever, the results without the progress function tend to pick a lot of samples from major
classes such as "person" or "car". Although it indeed provides better detection scores
in both categories than the original model, it sacrices the performances of many other
categories. Therefore, it does not nd the most critical examples for overall performance
improvement.
Table 5.9: Dierences in detection results using models with/without the class balancing
progress function.
method train set aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv mAP
TOA(S1) 07 42.9 61.5 40.9 19.0 24.6 55.3 59.2 55.0 21.7 56.0 33.8 47.4 60.1 58.6 43.0 21.9 45.4 41.9 52.6 44.7 44.3
TOA(S1) w/o balance 07 41.5 25.2 41.7 6.2 18.8 54.0 62.7 52.0 19.5 49.9 20.9 48.9 58.4 54.2 54.8 15.9 30.7 41.3 44.4 38.9 39.0
Next, we can examine the progress function in the hardness term as Eq. 5.9. If the
term is taken away from the criticalness function, the result is shown in Fig. 5.11(a).
Compared to the original criticalness function, the one without hardness progress func-
tion has worse mAP performance under the same experimental setup using TOA(S3).
The result is a hint that the progress functions play important roles in highlighting
dierent types of critical examples in dierent labeling stages.
Finally we want to show the eectiveness of the representativeness term as Eq.5.8.
We would rst like to examine the redundancy of samples in both VOC vs Caltech
dataset. By building the kNN graphs for both datasets, we found that the average of
104
Euclidian distance between two nearest samples is 249.3 in Caltech dataset and is 1194.2
in the VO07 dataset. In other words, samples in Caltech dataset are much more repeated
compared to the VOC07 dataset with clear evidences. According to the observations, we
set the parameter 0:2 for VOC dataset, and 1 for Caltech dataset, respectively,
in adaptation to their dierent contents.
Eectiveness of NOP
Next, we want to check the quality of NOP as well as the whole training samples.
First of all, the quality of NOP can be examine by calculating its precision of retrieving
true negative samples. In VOC07 dataset, the NOP algorithm can select an average
of 1871 bounding boxes out of 1900 object proposals per images, meaning that a large
percentage of boxes are still kept under theInv IoU operation. For the whole NOP set,
the negative sample precision is as high as 99.99914% which is much better than the
precisions of around 99.5% using alternative approaches as described in Fig. 5.4. As
we have mentioned, NOP precision need to be extremely high in that lots of negative
samples are required for training.
In addition, we would also like to check the quality of training samples composed using
QA-labeled samples and NOP samples. An example is shown in Fig. 5.12. In general, the
visual quality of these training samples are high, since the NOP samples seem to contain
informative negative samples with dierent sizes. Fig. 5.12 further presents a histogram
of IoU that compares all NOP samples with ground truth on VOC07 dataset. Ideally, in
fully supervised learning, the IoU will 100% fall in the range of TH
HI
;TH
LO
. For the
training samples in TOA, many of the negative samples still have IoU within the range.
Although we also see some samples with IoU = 0, but they don't aect performance
much. In general, the IoU curve still provides evidence of good NOP quality.
Simulation Variation
In this work, we use simulation to replace real human question answering. However,
human could make inaccurate decisions. Supposedly there is no intentional errors, there
are still
uctuation for human to determine the objects based on a certain IoU threshold.
105
Figure 5.12: Eectiveness of NOP (a) example of training sample composition (b) The
IoU distribution of all negative samples compared with the ground truth
Thus we add some random errors in the simulation. Originally human are asked to
select objects with IoU greater than 0.6. Since human cannot be precise in estimating
overlaps, we use a random variable to simulate the human thesholding in the range of
[0.6 + , 0.6 - ]. We apply the uniform random variable, which is still reasonable
enough since human does not make huge mistake so we should not include a long-tail
distribution.
We set to be 0.1 and observe the results in Fig. 5.11(b). With additional random
errors, the performance is even slightly better than simulating with a x threshold. The
result indicates that the TOA framework is robust against small labeling variations.
Labeling Time Model
Labeling time can vary according to how the labeling tools are designed. We apply
the HQ-prole time model that is designed using numbers from several papers to pro-
vide an accurate estimation in the crowd-sourcing setup. We would also like to prove
our performance by changing the time model to MQ-prole which favors more on fully
supervised learning.
We re-evaluate our results using VOC07 dataset with experimental setup dened in
Sec. 5.5.1, and the results are shown in Fig. 5.11(b). Even if the gap between two types
106
of labeling are closer, we still have obvious performance gain over the full-supervised
learning.
Figure 5.13: Experimental results with dierent labeling time model (a) HQ-prole (b)
MQ-prole
5.7 Conclusion
In conclusion, we propose a new CNN training framework that can achieve good per-
formance while keeping the labeling time as less as possible. The design of proposed
teach-observe-ask (TOA) framework follows human nature with adoption of a machine
guided labeling based on question answering. Several novel components are presented
including the crtical example mining (CEM) algorithm used to select sampels for QA,
and the negative object proposal (NOP) designed to extract negative samples without
extra labeling. The TOA method is demonstrated on the VOC dataset and also the
Caltech pedestrian dataset for proof of concept. A huge labeling time reduction can
be achieve on both datasets with reasonable mAP and MR performances. Extensive
experiments are conducted to give insights into the TOA design.
107
Chapter 6
Conclusion and Future Work
6.1 Summary of the Research
The pedestrian detection problem is an important yet challenging problem in the com-
puter vision eld. Thress topics in pedestrian detection were studied in the thesis, while
the third one is generalized to the multi-class object detection problem.
For the rst topic, a boosting framework was proposed to solve this problem by
retraining a convolutional neural network (CNN) with re-weighting of training samples.
A novel weighted loss function was introduced to achieve the goal. Two weighting func-
tions was designed to emphasize informative samples so that the ne-tuned CNN model
would be more robust against challenging cases. The rst type of informative samples
was those with low detection scores while the second type was temporally inconsistent
bounding box pairs. Two corresponding BCNN models (i.e., BCNN-LS and BCNN-TI)
were trained with the proposed BCNN training procedure. Finally, a boosted fusion
layer was proposed to merge the BCNN-LS and the BCNN-TI into one model; namely,
the BCNN-BF. The experimental results showed that the BCNN-LS, the BCNN-TI and
the BCNN-BF achieved a signicant performance gain over the Fast-RCNN baseline,
and the BCNN-BF currently ranks the third in the Caltech pedestrian dataset.
For the second topic, the domain adaptation technique was studied and explored
for pedestrian detection to meet the need of adapting a detector to other domains with
unlabeled data. A clustered deep representation adaptation (CDRA) method was pro-
posed to leverage the deep representations of the CNN. The CDRA method demanded
a small amount of labeled data in the target domain to train an initial model. This was
108
followed by an unsupervised clustering in the deep feature space and a purity measure-
ment process that aims to nd condent samples with estimated labels. Finally, a novel
weighted re-training procedure was designed to balance the labeled and estimated data.
When being tested on a large-scale target domain dataset, the CDRA method achieved a
signicant performance gain even without additional features such as motion or context.
For the third topic, a critically-supervised training framework is introduced to largely
reduce the labeling time needed for good performance. Our framework is referred to as
a Taught-Observe-Ask (TOA) method, where a small number of fully-labeled training
images are used to train a low precision CNN model. A machine-guide labeling scheme is
applied to allow humans to label a set of selected critical samples by answering questions.
The selection of these critical samples are based on analysis on the kNN graph generated
from the deep feature space. Three criticalness functions are designed to select most
critical examples along labeling process. Negative object proposal (NOP) are further
used to combine with the critical examples to form training samples, which achieves
reasonable precision while greatly reducing the labeling time. An evaluation method is
designed to benchmark our results with full-supervised learning and weakly-supervised
learning to show our superior performance in saving labeling time.
6.2 Future Work
In this thesis, we have focused on topics of training a CNN model with good perfor-
mances, and also dealing with the practical problems of lacking human labeling. One
of our future works is to designing better mining algorithm in TOA method in order to
have a higher mAP performance.
In addition, the theoretical foundation of neural networks has been studied for almost
three decades. These theoretical results only explain the basic working principle of CNNs,
but they do not explain the recent success of the deep networks well. To be more specic,
although the CNN-based methods, including our proposed BCNN and CDRA solutions,
109
can achieve signicantly better performance over traditional ones in solving the pedes-
trian detection problem, their success has little theoretical support. Furthermore, the
parameters of CNNs such as the number of layers and lters are determined empirically
in all existing CNNs. In order to analyze the superior performance of CNNs, a visual-
ization tool [110] was proposed to provide insights into the function of each lter in the
CNN. Although the tool helps people understand the CNN partially, the question about
the CNN design remains. For the pedestrian detection problem, many researchers use
the classic VGG-16 architecture as the network and adjust its lter weights to provide
better performance based on slightly dierent training samples and procedures.
As a future extension, we would like to nd a theoretical basis about the network
architecture for a certain application. For example, will it be possible to shrink the
network size while preserving the detection performance? On the other hand, we would
like to know whether a network is too small for a certain application. If this is the
case, whether it is possible to grow the network size for better performance. This goal
might be achievable by analyzing the responses at dierent layers of the network. Being
inspired by [55], we can study the mean and the variance of responses at each lter
of each layer, where each sample is projected onto an anchor vector which is the lter
weight as described in [55]). If the variance at a lter is large, we may split the lter
into multiple ones to enlarge the network. Otherwise, we may merge multiple lters that
are close to each other in terms of their orientation angles. This is an interesting yet
challenging problem worth further explorations as the next topic.
110
Bibliography
[1] Bogdan Alexe, Thomas Deselaers, and Vittorio Ferrari. What is an object? In
Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on,
pages 73{80. IEEE, 2010.
[2] Rahaf Aljundi, R emi Emonet, Damien Muselet, and Marc Sebban. Landmarks-
based kernelized subspace alignment for unsupervised domain adaptation. In Pro-
ceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pages 56{63, 2015.
[3] Shai Avidan. Ensemble tracking. Pattern Analysis and Machine Intelligence, IEEE
Transactions on, 29(2):261{271, 2007.
[4] Boris Babenko, Ming-Hsuan Yang, and Serge Belongie. Visual tracking with online
multiple instance learning. In Computer Vision and Pattern Recognition, 2009.
CVPR 2009. IEEE Conference on, pages 983{990. IEEE, 2009.
[5] Mahsa Baktashmotlagh, Mehrtash T Harandi, Brian C Lovell, and Mathieu Salz-
mann. Unsupervised domain adaptation by domain invariant projection. In Pro-
ceedings of the IEEE International Conference on Computer Vision, pages 769{
776, 2013.
[6] Rodrigo Benenson, Mohamed Omran, Jan Hosang, and Bernt Schiele. Ten years
of pedestrian detection, what have we learned? In Computer Vision-ECCV 2014
Workshops, pages 613{627. Springer, 2014.
[7] Yoshua Bengio et al. Deep learning of representations for unsupervised and transfer
learning. ICML Unsupervised and Transfer Learning, 27:17{36, 2012.
[8] Hakan Bilen, Marco Pedersoli, and Tinne Tuytelaars. Weakly supervised object
detection with posterior regularization. In Proceedings BMVC 2014, pages 1{12,
2014.
[9] Hakan Bilen, Marco Pedersoli, and Tinne Tuytelaars. Weakly supervised object
detection with convex clustering. In Proceedings of the IEEE Conference on Com-
puter Vision and Pattern Recognition, pages 1081{1089, 2015.
111
[10] Biswajit Bose and Eric Grimson. Improving object classication in far-eld video.
In Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of
the 2004 IEEE Computer Society Conference on, volume 2, pages II{II. IEEE,
2004.
[11] Steve Branson, Catherine Wah, Florian Schro, Boris Babenko, Peter Welinder,
Pietro Perona, and Serge Belongie. Visual recognition with humans in the loop.
Computer Vision{ECCV 2010, pages 438{451, 2010.
[12] Zhaowei Cai, Quanfu Fan, Rogerio S Feris, and Nuno Vasconcelos. A unied multi-
scale deep convolutional neural network for fast object detection. In European
Conference on Computer Vision, pages 354{370. Springer, 2016.
[13] Zhaowei Cai, Mohammad Saberian, and Nuno Vasconcelos. Learning complexity-
aware cascades for deep pedestrian detection. In Proceedings of the IEEE Interna-
tional Conference on Computer Vision, pages 3361{3369, 2015.
[14] Olivier Chapelle, Bernhard Sch olkopf, and Alexander Zien. Risks of semi-
supervised learning: How unlabeled data can degrade performance of generative
classiers.
[15] Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. R-fcn: Object detection via region-
based fully convolutional networks. In Advances in neural information processing
systems, pages 379{387, 2016.
[16] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detec-
tion. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE
Computer Society Conference on, volume 1, pages 886{893. IEEE, 2005.
[17] Hal Daum e III. Frustratingly easy domain adaptation. arXiv preprint
arXiv:0907.1815, 2009.
[18] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet:
A large-scale hierarchical image database. In Computer Vision and Pattern Recog-
nition, 2009. CVPR 2009. IEEE Conference on, pages 248{255. IEEE, 2009.
[19] Jun Deng, Zixing Zhang, Erik Marchi, and Bj orn Schuller. Sparse autoencoder-
based feature transfer learning for speech emotion recognition. In Aective Com-
puting and Intelligent Interaction (ACII), 2013 Humaine Association Conference
on, pages 511{516. IEEE, 2013.
[20] P. Doll ar, C. Wojek, B. Schiele, and P. Perona. Pedestrian detection: A benchmark.
In CVPR, June 2009.
[21] Piotr Doll ar, Ron Appel, Serge Belongie, and Pietro Perona. Fast feature pyramids
for object detection. Pattern Analysis and Machine Intelligence, IEEE Transac-
tions on, 36(8):1532{1545, 2014.
112
[22] Piotr Doll ar, Zhuowen Tu, Pietro Perona, and Serge Belongie. Integral channel
features. 2009.
[23] Piotr Dollar, Christian Wojek, Berntf Schiele, and Pietro Perona. Pedestrian detec-
tion: An evaluation of the state of the art. IEEE transactions on pattern analysis
and machine intelligence, 34(4):743{761, 2012.
[24] Anna Ellis, Ali Shahrokni, and James Michael Ferryman. Pets2009 and winter-
pets 2009 results: A combined evaluation. In Performance Evaluation of Tracking
and Surveillance (PETS-Winter), 2009 Twelfth IEEE International Workshop on,
pages 1{8. IEEE, 2009.
[25] Markus Enzweiler and Dariu M Gavrila. Monocular pedestrian detection: Survey
and experiments. IEEE transactions on pattern analysis and machine intelligence,
31(12):2179{2195, 2009.
[26] M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, and
A. Zisserman. The pascal visual object classes challenge: A retrospective. Inter-
national Journal of Computer Vision, 111(1):98{136, January 2015.
[27] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and
Andrew Zisserman. The pascal visual object classes (voc) challenge. International
journal of computer vision, 88(2):303{338, 2010.
[28] Michela Farenzena, Loris Bazzani, Alessandro Perina, Vittorio Murino, and Marco
Cristani. Person re-identication by symmetry-driven accumulation of local fea-
tures. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Confer-
ence on, pages 2360{2367. IEEE, 2010.
[29] Pedro Felzenszwalb, David McAllester, and Deva Ramanan. A discriminatively
trained, multiscale, deformable part model. In Computer Vision and Pattern
Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1{8. IEEE, 2008.
[30] Pedro F Felzenszwalb, Ross B Girshick, and David McAllester. Cascade object
detection with deformable part models. In Computer vision and pattern recognition
(CVPR), 2010 IEEE conference on, pages 2241{2248. IEEE, 2010.
[31] Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line
learning and an application to boosting. Journal of computer and system sciences,
55(1):119{139, 1997.
[32] Yoav Freund, Robert E Schapire, et al. Experiments with a new boosting algo-
rithm. In ICML, volume 96, pages 148{156, 1996.
[33] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous
driving? the kitti vision benchmark suite. In Computer Vision and Pattern Recog-
nition (CVPR), 2012 IEEE Conference on, pages 3354{3361. IEEE, 2012.
113
[34] Muhammad Ghifary, W Bastiaan Kleijn, Mengjie Zhang, David Balduzzi, and Wen
Li. Deep reconstruction-classication networks for unsupervised domain adapta-
tion. In European Conference on Computer Vision, pages 597{613. Springer, 2016.
[35] Ross Girshick. Fast r-cnn. In Proceedings of the IEEE International Conference
on Computer Vision, pages 1440{1448, 2015.
[36] Ross Girshick, Je Donahue, Trevor Darrell, and Jitendra Malik. Rich feature
hierarchies for accurate object detection and semantic segmentation. In Proceedings
of the IEEE conference on computer vision and pattern recognition, pages 580{587,
2014.
[37] Ramazan Gokberk Cinbis, Jakob Verbeek, and Cordelia Schmid. Multi-fold mil
training for weakly supervised object localization. In Proceedings of the IEEE
conference on computer vision and pattern recognition, pages 2409{2416, 2014.
[38] Boqing Gong, Yuan Shi, Fei Sha, and Kristen Grauman. Geodesic
ow kernel for
unsupervised domain adaptation. In Computer Vision and Pattern Recognition
(CVPR), 2012 IEEE Conference on, pages 2066{2073. IEEE, 2012.
[39] Shaogang Gong, Marco Cristani, Shuicheng Yan, and Chen Change Loy. Person
re-identication, volume 1. Springer, 2014.
[40] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Spatial pyramid pooling
in deep convolutional networks for visual recognition. In European Conference on
Computer Vision, pages 346{361. Springer, 2014.
[41] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Spatial pyramid pool-
ing in deep convolutional networks for visual recognition. Pattern Analysis and
Machine Intelligence, IEEE Transactions on, 37(9):1904{1916, 2015.
[42] Jan Hosang, Mohamed Omran, Rodrigo Benenson, and Bernt Schiele. Taking a
deeper look at pedestrians. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pages 4073{4082, 2015.
[43] Kyaw Kyaw Htike and David Hogg. Weakly supervised pedestrian detector training
by unsupervised prior learning and cue fusion in videos. In 2014 IEEE International
Conference on Image Processing (ICIP), pages 2338{2342. IEEE, 2014.
[44] Kyaw Kyaw Htike and David Hogg. Adapting pedestrian detectors to new
domains: A comprehensive review. Engineering Applications of Articial Intel-
ligence, 50:142{158, 2016.
[45] Kyaw Kyaw Htike and David C Hogg. Ecient non-iterative domain adaptation of
pedestrian detectors to video scenes. In Proceedings-22nd International Conference
on Pattern Recognition, pages 654{659. IEEE, 2014.
[46] Vidit Jain and Sachin Sudhakar Farfade. Adapting classication cascades to
new domains. In Proceedings of the IEEE International Conference on Computer
Vision, pages 105{112, 2013.
114
[47] Yangqing Jia, Evan Shelhamer, Je Donahue, Sergey Karayev, Jonathan Long,
Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Cae: Convolutional archi-
tecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.
[48] Ajay J Joshi, Fatih Porikli, and Nikolaos Papanikolopoulos. Multi-class active
learning for image classication. In Computer Vision and Pattern Recognition,
2009. CVPR 2009. IEEE Conference on, pages 2372{2379. IEEE, 2009.
[49] Zdenek Kalal, Krystian Mikolajczyk, and Jiri Matas. Tracking-learning-detection.
Pattern Analysis and Machine Intelligence, IEEE Transactions on, 34(7):1409{
1422, 2012.
[50] Ashish Kapoor, Kristen Grauman, Raquel Urtasun, and Trevor Darrell. Active
learning with gaussian processes for object categorization. In Computer Vision,
2007. ICCV 2007. IEEE 11th International Conference on, pages 1{8. IEEE, 2007.
[51] Tao Kong, Fuchun Sun, Anbang Yao, Huaping Liu, Ming Lu, and Yurong Chen.
Ron: Reverse connection with objectness prior networks for object detection. arXiv
preprint arXiv:1707.01691, 2017.
[52] Tao Kong, Anbang Yao, Yurong Chen, and Fuchun Sun. Hypernet: Towards
accurate region proposal generation and joint object detection. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition, pages 845{853,
2016.
[53] Adriana Kovashka, Sudheendra Vijayanarasimhan, and Kristen Grauman. Actively
selecting annotations among objects and attributes. In Computer Vision (ICCV),
2011 IEEE International Conference on, pages 1403{1410. IEEE, 2011.
[54] Alex Krizhevsky, Ilya Sutskever, and Georey E Hinton. Imagenet classication
with deep convolutional neural networks. In Advances in neural information pro-
cessing systems, pages 1097{1105, 2012.
[55] C-C Jay Kuo. Understanding convolutional neural networks with a mathematical
model. Journal of Visual Communication and Image Representation, 41:406{413,
2016.
[56] Sascha Lange and Martin Riedmiller. Deep auto-encoder neural networks in rein-
forcement learning. In The 2010 International Joint Conference on Neural Net-
works (IJCNN), pages 1{8. IEEE, 2010.
[57] Yann LeCun, Koray Kavukcuoglu, Cl ement Farabet, et al. Convolutional networks
and applications in vision. In ISCAS, pages 253{256, 2010.
[58] Jianan Li, Xiaodan Liang, ShengMei Shen, Tingfa Xu, and Shuicheng Yan. Scale-
aware fast r-cnn for pedestrian detection. arXiv preprint arXiv:1510.08160, 2015.
[59] Siyang Li, Xiangxin Zhu, Qin Huang, Hao Hsu, and C.-C. Jay Kuo. Multiple
instance curriculum learning for weakly supervised object detection. In BMVC,
2017.
115
[60] Xiaodan Liang, Si Liu, Yunchao Wei, Luoqi Liu, Liang Lin, and Shuicheng Yan.
Towards computational baby learning: A weakly-supervised approach for object
detection. In Proceedings of the IEEE International Conference on Computer
Vision, pages 999{1007, 2015.
[61] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed,
Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In
European conference on computer vision, pages 21{37. Springer, 2016.
[62] Yunxiang Mao and Zhaozheng Yin. Training a scene-specic pedestrian detector
using tracklets. In 2015 IEEE Winter Conference on Applications of Computer
Vision, pages 170{176. IEEE, 2015.
[63] Anna Margolis. A literature review of domain adaptation with unlabeled data.
Tec. Report, pages 1{42, 2011.
[64] Woonhyun Nam, Piotr Doll ar, and Joon Hee Han. Local decorrelation for improved
pedestrian detection. In Advances in Neural Information Processing Systems, pages
424{432, 2014.
[65] Wanli Ouyang and Xiaogang Wang. A discriminative deep model for pedestrian
detection with occlusion handling. In Computer Vision and Pattern Recognition
(CVPR), 2012 IEEE Conference on, pages 3258{3265. IEEE, 2012.
[66] Wanli Ouyang and Xiaogang Wang. Joint deep learning for pedestrian detection.
In Proceedings of the IEEE International Conference on Computer Vision, pages
2056{2063, 2013.
[67] Wanli Ouyang, Xingyu Zeng, and Xiaogang Wang. Modeling mutual visibility
relationship in pedestrian detection. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pages 3222{3229, 2013.
[68] Sakrapee Paisitkriangkrai, Chunhua Shen, and Anton van den Hengel. Strength-
ening the eectiveness of pedestrian detection with spatially pooled features. In
Computer Vision{ECCV 2014, pages 546{561. Springer, 2014.
[69] Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE Transac-
tions on knowledge and data engineering, 22(10):1345{1359, 2010.
[70] Dim P Papadopoulos, Jasper RR Uijlings, Frank Keller, and Vittorio Ferrari. We
don't need no bounding-boxes: Training object class detectors using only human
verication. In Proceedings of the IEEE Conference on Computer Vision and Pat-
tern Recognition, pages 854{863, 2016.
[71] Dim P Papadopoulos, Jasper RR Uijlings, Frank Keller, and Vittorio Ferrari.
Extreme clicking for ecient object annotation. arXiv preprint arXiv:1708.02750,
2017.
116
[72] Dennis Park, Deva Ramanan, and Charless Fowlkes. Multiresolution models for
object detection. In European conference on computer vision, pages 241{254.
Springer, 2010.
[73] Dennis Park, C Zitnick, Deva Ramanan, and Piotr Doll ar. Exploring weak stabi-
lization for motion feature extraction. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 2882{2889, 2013.
[74] Vishal M Patel, Raghuraman Gopalan, Ruonan Li, and Rama Chellappa. Visual
domain adaptation: A survey of recent advances. IEEE signal processing magazine,
32(3):53{69, 2015.
[75] Bryan Prosser, Wei-Shi Zheng, Shaogang Gong, Tao Xiang, and Q Mary. Person
re-identication by support vector ranking. In BMVC, volume 2, page 6, 2010.
[76] Guo-Jun Qi, Xian-Sheng Hua, Yong Rui, Jinhui Tang, and Hong-Jiang Zhang.
Two-dimensional active learning for image classication. In Computer Vision and
Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1{8. IEEE,
2008.
[77] Rajat Raina, Alexis Battle, Honglak Lee, Benjamin Packer, and Andrew Y Ng.
Self-taught learning: transfer learning from unlabeled data. In Proceedings of the
24th international conference on Machine learning, pages 759{766. ACM, 2007.
[78] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look
once: Unied, real-time object detection. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages 779{788, 2016.
[79] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards
real-time object detection with region proposal networks. In Advances in Neural
Information Processing Systems, pages 91{99, 2015.
[80] Olga Russakovsky, Yuanqing Lin, Kai Yu, and Li Fei-Fei. Object-centric spatial
pooling for image classication. Computer Vision{ECCV 2012, pages 1{15, 2012.
[81] Pierre Sermanet, David Eigen, Xiang Zhang, Micha el Mathieu, Rob Fergus, and
Yann LeCun. Overfeat: Integrated recognition, localization and detection using
convolutional networks. arXiv preprint arXiv:1312.6229, 2013.
[82] Pierre Sermanet, Koray Kavukcuoglu, Soumith Chintala, and Yann LeCun. Pedes-
trian detection with unsupervised multi-stage feature learning. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition, pages 3626{
3633, 2013.
[83] Pramod Sharma and Ram Nevatia. Ecient detector adaptation for object detec-
tion in a video. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pages 3254{3261, 2013.
117
[84] Abhinav Shrivastava, Abhinav Gupta, and Ross Girshick. Training region-based
object detectors with online hard example mining. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pages 761{769, 2016.
[85] Guang Shu, Afshin Dehghan, and Mubarak Shah. Improving an object detector
and extracting regions using superpixels. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages 3721{3727, 2013.
[86] Behjat Siddiquie and Abhinav Gupta. Beyond active noun tagging: Modeling
contextual interactions for multi-class active learning. In Computer Vision and
Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 2979{2986. IEEE,
2010.
[87] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for
large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
[88] Hyun Oh Song, Ross Girshick, Stefanie Jegelka, Julien Mairal, Zaid Harchaoui, and
Trevor Darrell. On learning to localize objects with minimal supervision. arXiv
preprint arXiv:1403.1024, 2014.
[89] Hyun Oh Song, Yong Jae Lee, Stefanie Jegelka, and Trevor Darrell. Weakly-
supervised discovery of visual pattern congurations. In Advances in Neural Infor-
mation Processing Systems, pages 1637{1645, 2014.
[90] Hao Su, Jia Deng, and Li Fei-Fei. Crowdsourcing annotations for visual object
detection. In Workshops at the Twenty-Sixth AAAI Conference on Articial Intel-
ligence, volume 1, 2012.
[91] Kevin Tang, Vignesh Ramanathan, Li Fei-Fei, and Daphne Koller. Shifting weights:
Adapting object detectors from image to video. In Advances in Neural Information
Processing Systems, pages 638{646, 2012.
[92] Yonglong Tian, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning strong
parts for pedestrian detection. In Proceedings of the IEEE International Conference
on Computer Vision, pages 1904{1912, 2015.
[93] Yonglong Tian, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Pedestrian detection
aided by deep learning semantic tasks. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pages 5079{5087, 2015.
[94] Jasper RR Uijlings, Koen EA van de Sande, Theo Gevers, and Arnold WM Smeul-
ders. Selective search for object recognition. International journal of computer
vision, 104(2):154{171, 2013.
[95] Sudheendra Vijayanarasimhan and Kristen Grauman. Multi-level active prediction
of useful image annotations for recognition. In Advances in Neural Information
Processing Systems, pages 1705{1712, 2009.
118
[96] Sudheendra Vijayanarasimhan and Kristen Grauman. What's it going to cost
you?: Predicting eort vs. informativeness for multi-label image annotations. In
Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference
on, pages 2262{2269. IEEE, 2009.
[97] Sudheendra Vijayanarasimhan and Kristen Grauman. Large-scale live active learn-
ing: Training object detectors with crawled data and crowds. International Journal
of Computer Vision, 108(1-2):97{114, 2014.
[98] Paul Viola and Michael J Jones. Robust real-time face detection. International
journal of computer vision, 57(2):137{154, 2004.
[99] Carl Vondrick, Aditya Khosla, Tomasz Malisiewicz, and Antonio Torralba. Hog-
gles: Visualizing object detection features. In Proceedings of the IEEE Interna-
tional Conference on Computer Vision, pages 1{8, 2013.
[100] Stefan Walk, Nikodem Majer, Konrad Schindler, and Bernt Schiele. New features
and insights for pedestrian detection. In Computer vision and pattern recognition
(CVPR), 2010 IEEE conference on, pages 1030{1037. IEEE, 2010.
[101] Meng Wang, Wei Li, and Xiaogang Wang. Transferring a generic pedestrian detec-
tor towards specic scenes. In Computer Vision and Pattern Recognition (CVPR),
2012 IEEE Conference on, pages 3274{3281. IEEE, 2012.
[102] Meng Wang and Xiaogang Wang. Automatic adaptation of a generic pedestrian
detector to a specic trac scene. In Computer Vision and Pattern Recognition
(CVPR), 2011 IEEE Conference on, pages 3401{3408. IEEE, 2011.
[103] Xiaoyu Wang, Tony X Han, and Shuicheng Yan. An hog-lbp human detector with
partial occlusion handling. In Computer Vision, 2009 IEEE 12th International
Conference on, pages 32{39. IEEE, 2009.
[104] Christian Wojek and Bernt Schiele. A performance evaluation of single and multi-
feature people detection. In Joint Pattern Recognition Symposium, pages 82{91.
Springer, 2008.
[105] Chi-Hao Wu, Weihao Gan, De Lan, and C.-C. Jay Kuo. Boosted convolutional
neural networks (bcnn) for pedestrian detection. In 2017 IEEE Winter Conference
on Applications of Computer Vision, pages 540{549. IEEE, 2017.
[106] Tong Xiao, Shuang Li, Bochao Wang, Liang Lin, and Xiaogang Wang. End-to-end
deep learning for person search. arXiv preprint arXiv:1604.01850, 2016.
[107] Junjie Yan, Xucong Zhang, Zhen Lei, Shengcai Liao, and Stan Z Li. Robust
multi-resolution pedestrian detection in trac scenes. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pages 3033{3040, 2013.
[108] Ziang Yan, Jian Liang, Weishen Pan, Jin Li, and Changshui Zhang. Weakly-and
semi-supervised object detection with expectation-maximization algorithm. arXiv
preprint arXiv:1702.08740, 2017.
119
[109] Bin Yang, Junjie Yan, Zhen Lei, and Stan Z Li. Convolutional channel features.
In Proceedings of the IEEE International Conference on Computer Vision, pages
82{90, 2015.
[110] Jason Yosinski, Je Clune, Anh Nguyen, Thomas Fuchs, and Hod Lipson.
Understanding neural networks through deep visualization. arXiv preprint
arXiv:1506.06579, 2015.
[111] Liliang Zhang, Liang Lin, Xiaodan Liang, and Kaiming He. Is faster r-cnn doing
well for pedestrian detection? In European Conference on Computer Vision, pages
443{457. Springer, 2016.
[112] Quanshi Zhang, Ruiming Cao, Ying Nian Wu, and Song-Chun Zhu. Mining object
parts from cnns via active question-answering. arXiv preprint arXiv:1704.03173,
2017.
[113] Shanshan Zhang, Christian Bauckhage, and Armin Cremers. Informed haar-like
features improve pedestrian detection. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 947{954, 2014.
[114] Shanshan Zhang, Rodrigo Benenson, Mohamed Omran, Jan Hosang, and Bernt
Schiele. How far are we from solving pedestrian detection? arXiv preprint
arXiv:1602.01237, 2016.
[115] Shanshan Zhang, Rodrigo Benenson, and Bernt Schiele. Filtered channel features
for pedestrian detection. In Computer Vision and Pattern Recognition (CVPR),
2015 IEEE Conference on, pages 1751{1760. IEEE, 2015.
[116] Xu Zhang, Fei He, Lu Tian, and Shengjin Wang. Cognitive pedestrian detector:
Adapting detector to specic scene by transferring attributes. Neurocomputing,
149:800{810, 2015.
[117] Wei Zhong, Huchuan Lu, and Ming-Hsuan Yang. Robust object tracking via
sparsity-based collaborative model. In Computer vision and pattern recognition
(CVPR), 2012 IEEE Conference on, pages 1838{1845. IEEE, 2012.
[118] C Lawrence Zitnick and Piotr Doll ar. Edge boxes: Locating object proposals from
edges. In Computer Vision{ECCV 2014, pages 391{405. Springer, 2014.
120
Abstract (if available)
Abstract
With the emergence of autonomous driving and the advanced driver assistance system (ADAS), the importance of pedestrian detection has increased significantly. A lot of research work has been conducted to tackle this problem with the availability of large-scale datasets. Methods based on the convolutional neural network (CNN) technology have achieved great success in pedestrian detection in recent years, which offers a giant step to the solution of this problem. Although the performance of CNN-based solutions reaches a significantly higher level than traditional methods, it is still far from perfection. Further advancement in this field is still demanded. In this dissertation, we conducted three research topics along this direction, and then further extended to a more general problem. ❧ In the first topic, a boosted convolutional neural network (BCNN) system is proposed to enhance the pedestrian detection performance. Being inspired by the classic boosting idea, we develop a weighted loss function that emphasizes challenging samples in training a convolutional neural network (CNN). Two types of samples are considered challenging: 1) samples with detection scores falling in the decision boundary, and 2) temporally associated samples with inconsistent scores. A weighting scheme is designed for each of them. Finally, we train a boosted fusion layer to benefit from the integration of these two weighting schemes. We use the Fast-RCNN as the baseline and test the corresponding BCNN on the Caltech pedestrian dataset in the experiment and observe a significant performance gain of the BCNN over its baseline. ❧ Data-driven pedestrian detection methods demand a large amount of human labeled data as the training samples. The performance of these detectors is highly dependent on the amount of labeled data. Since data labeling is time-consuming, labeled datasets are often insufficient to train a robust detector in real world applications. On the other hand, it is relatively easy to collect unlabeled data. Thus, it is desirable to develop unsupervised or weakly-supervised learning methods that exploit unlabeled data for further performance improvement in the training of a detector. The domain adaptation technique is developed to reach this goal. ❧ In the second topic, a semi-supervised learning method is proposed for pedestrian detection based on domain adaptation. It is observed that the deep representation, which is the response of an input through a CNN, is powerful in estimating the class of unlabeled data. Being motivated by this observation, we propose a clustered deep representation adaptation (CDRA) method. It trains an initial detector using a small number of labeled data, extracts the deep representation and, then, clusters samples based on the space spanned by the deep representation. A purity measurement mechanism is applied to each cluster to provide a confident score to the estimated class of unlabeled data. Finally, a weighted re-training process is adopted to fine-tune the model by balancing the numbers of labeled and estimated data. The CDRA method is shown to achieve the state-of-the- art performance against a large scale dataset. ❧ While semi-supervised learning does not provide high enough precision due to limited number of labeled data, it is desirable to design another training method with reasonable performance, and at the same time keeping the labeling time as less as possible. To achieve the goal, we observe the process how humans learn to recognize objects in their childhood, which starts from being taught by their parents. Then, the children begin to observe the world and then ask questions from time to time when they still do not recognize some objects well enough. ❧ In the third topic, we propose a learning framework called critically-supervised learning that mimics children learning, which provides reasonable performances while saving up to 95% of the labeling time. Several novel components are proposed to fulfill the high level concept, including negative object proposal, critical example mining, and a machine-guided labeling process. A labeling time model is proposed to evaluate the final performance. Extensive experiments are conducted to shed light on several novel ideas, and the effectiveness of the proposed method is not only evaluated on the Caltech bench-mark datasets, but also on the PASCAL VOC datasets for the general object detection task.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Labeling cost reduction techniques for deep learning: methodologies and applications
PDF
Object localization with deep learning techniques
PDF
Efficient graph learning: theory and performance evaluation
PDF
Advanced techniques for object classification: methodologies and performance evaluation
PDF
Video object segmentation and tracking with deep learning techniques
PDF
Human appearance analysis and synthesis using deep learning
PDF
Object classification based on neural-network-inspired image transforms
PDF
Machine learning methods for 2D/3D shape retrieval and classification
PDF
Novel algorithms for large scale supervised and one class learning
PDF
Machine learning techniques for perceptual quality enhancement and semantic image segmentation
PDF
Deep learning for subsurface characterization and forecasting
PDF
Multimodal image retrieval and object classification using deep learning features
PDF
A data-driven approach to image splicing localization
PDF
Multiple pedestrians tracking by discriminative models
PDF
Visual knowledge transfer with deep learning techniques
PDF
Exploring complexity reduction in deep learning
PDF
Deep learning models for temporal data in health care
PDF
Data-driven image analysis, modeling, synthesis and anomaly localization techniques
PDF
Vision-based and data-driven analytical and experimental studies into condition assessment and change detection of evolving civil, mechanical and aerospace infrastructures
PDF
Hashcode representations of natural language for relation extraction
Asset Metadata
Creator
Wu, Chi-Hao
(author)
Core Title
Deep learning techniques for supervised pedestrian detection and critically-supervised object detection
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
11/03/2017
Defense Date
09/07/2017
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
active learning,boosted convolutional neural network,convolutional neural network,critically supervised learning,deep learning,domain adaptation,human-in-the-loop,OAI-PMH Harvest,object detection,pedestrian detection,unsupervised learning,weakly supervised learning
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Kuo, C.-C. Jay (
committee chair
), Haldar, Justin (
committee member
), Nakano, Aiichiro (
committee member
)
Creator Email
chihaowu@usc.edu,nbadream@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-449314
Unique identifier
UC11265747
Identifier
etd-WuChiHao-5867.pdf (filename),usctheses-c40-449314 (legacy record id)
Legacy Identifier
etd-WuChiHao-5867.pdf
Dmrecord
449314
Document Type
Dissertation
Rights
Wu, Chi-Hao
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
active learning
boosted convolutional neural network
convolutional neural network
critically supervised learning
deep learning
domain adaptation
human-in-the-loop
object detection
pedestrian detection
unsupervised learning
weakly supervised learning