Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Advanced visual processing techniques for latent fingerprint detection and video retargeting
(USC Thesis Other)
Advanced visual processing techniques for latent fingerprint detection and video retargeting
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
ADVANCED VISUAL PROCESSING TECHNIQUES FOR LATENT
FINGERPRINT DETECTION AND VIDEO RETARGETING
by
Jiangyang Zhang
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulllment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(ELECTRICAL ENGINEERING)
March 2014
Copyright 2014 Jiangyang Zhang
to my grandma and my parents.
ii
Acknowledgments
On March 24 2014, nally I'm standing on the nish line of this long race. The
time is 5 years, 6 months and 27 days. For me, the nish line itself does not have
much meaning any more- it is just a symbol of completion. What really matters
is the entire journey of reaching this nish line.
In Fall 2008, I joined the PhD program without having a clear motivation about
why I needed this degree. My only reasons seemed to be quite simple: rst, my
dad is a PhD so maybe I should be one as well, and second, I got the fellowship
oer from USC - not a too bad school. But what I was not sure is whether I would
really like academic research. About one year into the program, I became almost
uncertain that the academic path was not for me - it is something I won't hate
but denitely not passionate about. Afterwards, I had many opportunities to quit
this program but eventually decided not to, because I nd becoming a software
engineer at a big company in Silicon valley is probably the most uninspiring thing
on earth.
It is very easy for us to say we love something, but nding the one thing that
makes us really obsessed with is not easy. Our entire life is a process of "dream,
explore, discover" and I think the ultimate goal of my life is to nd out the one
thing to which I'm born to devote my life. This became the main theme of my
past six years, and I call it the process of soul searching.
iii
Everyone is an artist of his/her own life. My piece of artwork for PhD cannot
be completed without the following people:
First, I would like to thank Prof. Jay Kuo for his great supervision and under-
standing in the past six years. My favorite words from Prof. Kuo are "aim high,
accept low". It sounds simple yet contains great wisdom - I could not tell how
much I beneted from it. It is not easy to deal with such a non-typical PhD stu-
dent as me and I really appreciate the freedom and understanding I received under
Prof. Kuo's supervision.
I would like to thank all my friends from the Melton Foundation. The nine-year
experience as a Melton Fellow totally shaped my values and made me become who
I am today. I would like to name a few great people here: Mr. William Melton,
Prof. Huilan Ying, Lili Yao, Li Li, Li Zhen, Zhimin Ren, Feng Ji, Wa Yuan,
Jiayuan Meng, Weichun Yao, Jing Yu, Wenrong Ji, Wei Dai, Yujing Zhang, Jing
Ji, Gu Ye, Xingyun Liao, Jiahui Huang, Yanlin Chen, Zhou Yu, Jiayu Liu, Fang
Cai, Qifei Zhu, etc.
I would like to thank my friends in Los Angeles, who made my six years of PhD
life much less boring. The best roommate ever, Yu Zhao, my labmates: Xingze He,
Hang Yuan, Shangwen Li, Xiang Fu, Jian Li, Jing Zhang, Sanjay Purushotham,
Martin Gawescki, Sachin Chachada, Jia He. And also, friends including: Lulu
Cao, Chupei Zhang, Lu Liu, Chenhong Fang, Xiaofan Niu, Shiyi Zhang, Fei Li,
Qianyu Liu, Dezhi Kang, Xiaoyang Yang, Jianqiao Huang, Cong Fang, Rui Luo,
Jun Deng, Xin Xu, etc.
I would like to thank everybody who accompanied me throughout the entire
journey of GymFlow and at the Viterbi Startup garage. The best co-founder ever,
Jimmy Liu, advisor Prof. Ashish Soni and Prof. Helena Yli-Renko, teammates Nhi
Duong, Aria Malboubi, David Kale, as well as Dr. Justine Gilman. Great friends
iv
at the garage: Gregory Lou, Grace Owh, Ted Hadjisavas, Jonathan Sutherland
and Amir Raminfar, etc.
I would like to thank all my outdoor friends for the great time we had during
numerous adventurous trips: Weiyi Chen, Yuchi Che, Dongqing Li, Justin Zhang,
Xiaolan Xu, Zheng Wu, Zoe Zou, Han Yan, etc.
I would like to thank all my teammates at OneWay: Chenchen Zhang, Xueqiao
Ma, Yeming Chen, Chenni Xu, Junyu Cao, Weinan Wu. We are doing something
truly exceptional.
Lastly, I would like to thank my grandma, the most in
uential person in my life.
It was her education that has always inspired me to take adventures and chase for
my dream. I would like to thank my parents, my brother Claus, my sister Wenyi
for their endless support throughout these years.
Jiangyang Zhang
March 31, 2014
Los Angeles, USA
v
Contents
Dedication ii
Acknowledgments iii
List of Figures ix
List of Tables xv
1 Introduction 1
1.1 Signicance of the Research . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Review of Previous Work . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 Latent Fingerprint Segmentation . . . . . . . . . . . . . . . 4
1.2.2 Content-aware Image and Video Resizing . . . . . . . . . . . 5
1.2.3 Quality Assessment for Image Retargeting . . . . . . . . . . 8
1.3 Contribution of the Research . . . . . . . . . . . . . . . . . . . . . . 9
1.4 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . 11
2 Background Review 12
2.1 Total-Variation Model for Latent Segmentation . . . . . . . . . . . 12
2.1.1 Structured Noise in Latent Fingerprint Images . . . . . . . . 13
2.1.2 The Total-Variation Model . . . . . . . . . . . . . . . . . . . 17
2.1.3 Multiscale Feature Selection Property of TV Models . . . . . 19
2.2 Content-Aware Image and Video Resizing . . . . . . . . . . . . . . 20
2.2.1 Image Retargeting . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.2 Video Retargeting . . . . . . . . . . . . . . . . . . . . . . . 23
2.3 Quality Assessment for Image Retargeting . . . . . . . . . . . . . . 25
2.3.1 Global Structural Distortion . . . . . . . . . . . . . . . . . . 26
2.3.2 Local Detail Distortion . . . . . . . . . . . . . . . . . . . . . 27
2.3.3 Loss of Salient Information . . . . . . . . . . . . . . . . . . . 28
3 Adaptive Total Variation Model for Latent Fingerprint Detection
and Segmentation 30
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
vi
3.2 Segmentation with Adaptive TV-L1 Model . . . . . . . . . . . . . . 34
3.2.1 The TV-L1 Model . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.2 Multiscale Feature Selection Property of TV-L1 . . . . . . . 36
3.2.3 The Adaptive TV-L1 Model . . . . . . . . . . . . . . . . . . 38
3.2.4 Scale Parameter Estimation . . . . . . . . . . . . . . . . . . 42
3.2.5 Region-of-Interest Segmentation . . . . . . . . . . . . . . . . 44
3.3 Segmentation with Directional Total-Variation Model (DTV) . . . . 45
3.3.1 The TV-L2 (ROF) Model . . . . . . . . . . . . . . . . . . . 46
3.3.2 The Directional Total-Variation Model . . . . . . . . . . . . 46
3.3.3 Orientation Field Estimation . . . . . . . . . . . . . . . . . . 50
3.3.4 Region-of-Interest Segmentation . . . . . . . . . . . . . . . . 51
3.4 Segmentation with Adaptive Directional Total-Variation Model (ADTV) 52
3.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.5.1 Results of Good, Bad and Ugly Fingerprint Processing . . . 55
3.5.2 Comparison with Other Segmentation Methods . . . . . . . 57
3.5.3 Feature Extraction Accuracy . . . . . . . . . . . . . . . . . . 58
3.5.4 Fingerprint Matching Results . . . . . . . . . . . . . . . . . 60
3.5.5 Comparison with Other TV-based Models . . . . . . . . . . 61
3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4 Texture-Aware Image Resizing and Compressed Domain Video
Retargeting 66
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.2 Region-Adaptive Texture-Aware Image Resizing . . . . . . . . . . . 69
4.2.1 Impact of Texture Regularity on Resizing . . . . . . . . . . . 69
4.2.2 Region-adaptive Image Resizing . . . . . . . . . . . . . . . . 71
4.3 A Compressed Domain Video Retargeting System . . . . . . . . . . 76
4.3.1 System Overview . . . . . . . . . . . . . . . . . . . . . . . . 77
4.4 Partial Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.5 Compressed Domain Video Resizing . . . . . . . . . . . . . . . . . . 79
4.5.1 Scene Change Detection . . . . . . . . . . . . . . . . . . . . 80
4.5.2 Visual Importance Analysis . . . . . . . . . . . . . . . . . . 81
4.5.3 Optimum Cropping . . . . . . . . . . . . . . . . . . . . . . . 87
4.5.4 Column Mesh Deformation . . . . . . . . . . . . . . . . . . . 88
4.5.5 Transform Domain Block Resizing . . . . . . . . . . . . . . . 95
4.6 Re-encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.6.1 Macroblock Type Selection . . . . . . . . . . . . . . . . . . . 97
4.6.2 Motion Vector Renement . . . . . . . . . . . . . . . . . . . 98
4.7 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.7.1 Texture-Aware Image Resizing . . . . . . . . . . . . . . . . . 99
4.7.2 Compressed-domain Video Retargeting . . . . . . . . . . . . 101
4.7.3 Computational Complexity Analysis . . . . . . . . . . . . . 107
vii
4.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5 Objective Quality Assessment for Image Retargeting 111
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.2 GLS Quality Index for Retargeted Images . . . . . . . . . . . . . . 113
5.2.1 Overview of System Framework . . . . . . . . . . . . . . . . 114
5.2.2 Saliency-based Classication . . . . . . . . . . . . . . . . . . 114
5.2.3 SIFT Mapping and Mesh Formulation . . . . . . . . . . . . 116
5.2.4 Extraction of Features . . . . . . . . . . . . . . . . . . . . . 117
5.2.5 Feature Fusion and Model Selection . . . . . . . . . . . . . . 120
5.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.3.2 Test Methodology . . . . . . . . . . . . . . . . . . . . . . . . 123
5.3.3 Comparison of Feature Fusion Methods . . . . . . . . . . . . 123
5.3.4 Comparison of Objective Quality Indices . . . . . . . . . . . 125
5.3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
6 Conclusion and Future Work 133
6.1 Summary of the Research . . . . . . . . . . . . . . . . . . . . . . . 133
6.2 Future Research Directions . . . . . . . . . . . . . . . . . . . . . . . 135
Reference List 137
viii
List of Figures
2.1 Three types of ngerprints: rolled, plain and latent. . . . . . . . . . 13
2.2 Illustration of six types of structured noise in latent ngerprint images. 14
2.3 Comparison of distributions of three features (mean, variance and
coherence) in the foreground and background areas of plain and
latent ngerprints. . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4 Feature selection based on the TV-L1 model for latent ngerprint
image: input image f (left most) and its TV-L1 decomposed com-
ponents u and v with its value shown in the subscript. As
increases, only features of smaller scales are extracted to texture
output v, while features of larger scales are kept in cartoon u. . . . 20
2.5 The importance of image/video retargeting: media content needs to
be adapted to display devices of various size and aspect ratios. . . . 21
2.6 Seam carving [8] for content-aware image resizing. . . . . . . . . . . 22
2.7 Content-aware image resizing guided by mesh deformation [75]. . . 23
2.8 Seam carving [59] for content-aware video resizing using a graph-cut
approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.9 Content-aware video resizing technique proposed in [77]. It com-
bines cropping and warping by forcing all informative video content
inside the target video cube. . . . . . . . . . . . . . . . . . . . . . . 25
2.10 A scalable content-aware video resizing technique guided by opti-
mized motion path-lines [87]. . . . . . . . . . . . . . . . . . . . . . 26
2.11 Two examples of global structural distortion. Upper row: the origi-
nal image of face (left) and the retargeted result by [8] (right). Lower
row: the original image of lotus (left) and the retargeted result by
scaling (right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
ix
2.12 Illustration of local region distortion. Left: the original image of
pencil. Right: the retargeted result using the seam carving method
in [8] with three zoomed-in local regions. . . . . . . . . . . . . . . . 28
2.13 Illustration of loss of salient information. (a): original image of
Marble and the cropping result. The yellow box shows the optimum
cropping result. Middle: a better retargeting result by [60] . . . . . 29
3.1 Illustration of the boundary signal problem in TV-L1 decomposi-
tion: a small amount of structure noise edge signal is still kept in
texturev (left) and signals along the dash line depicted in f,u and
v (right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2 Plots of
(x) for several pixels in dierent latent ngerprint images.
It has a sharp peak located near = 2:0 in the ngerprint region
while it reaches the maximum at dierent values in other regions. 43
3.3 Distributions of the variance feature for the foreground and back-
ground region in f and v, respectively. . . . . . . . . . . . . . . . . 45
3.4 (a) Top: original imagef. Bottom: texture outputv after decompo-
sition by the TV-L2 model [61], (b) Texture outputv for orientation
vector~ a in four dierent directions. Top: ~ a = (0; 1) and~ a = (1; 0).
Bottom: ~ a = (
p
2
2
;
p
2
2
) and~ a = (
p
2
2
;
p
2
2
). . . . . . . . . . . . . . . 48
3.5 Essential steps for computing ~ a(x). From left to right: original
image f, coarse orientation estimation o(x), orientation smoothen-
ing O(x) and coherency evaluation c(x). . . . . . . . . . . . . . . . 52
3.6 Experimental results of three latent ngerprints with good quality
(from left to right): original image f, scale parameter(x), texture
output v, orientation vector~ a(x) and the nal segmentation result. 56
3.7 Three input image types for latent matching. (a) Type-1: with-
out any segmentation, (b) Type-2: segmentation mask over original
image f, (c) Type-3: segmentation mask over texture layer v, (d)
the corresponding mated rolled ngerprint. . . . . . . . . . . . . . . 58
3.8 The cumulative matching curves (CMC) of all three latent nger-
print input types for good-quality latent ngerprints. . . . . . . . . 61
x
3.9 Performance comparison of the proposed ADTV model and two
other TV-based models. First row: original image f, texture out-
put v of TV-L2 [61], TVL1 [12] and the proposed ADTV model
(from left to right). Second row: distribution of variance feature in
the foreground and background areas. Third row: the segmentation
result based on variance feature. . . . . . . . . . . . . . . . . . . . . 62
3.10 Comparison of the cumulative matching curves (CMC) of the pro-
posed ADTV model and two other TV-based models. . . . . . . . . 63
3.11 Experimental results of latent ngerprints with ugly quality. From
left to right: original image f, scale parameter (x), orientation
vector~ a(x), texture output v and the nal segmentation result. . . 64
3.12 Experimental results of latent ngerprints with good quality. From
left to right: original image f, scale parameter (x), orientation
vector~ a(x), texture output v and the nal segmentation result. . . 65
3.13 Experimental results of latent ngerprints with bad quality. From
left to right: original image f, scale parameter (x), orientation
vector~ a(x), texture output v and the nal segmentation result. . . 65
4.1 Image resizing results for brick wall (regular texture) and sky (stochas-
tic texture) using three dierent schemes. . . . . . . . . . . . . . . . 70
4.2 Two types of control vertices for mesh warping: quad and contour. . 73
4.3 (a) Texture patch selection guided by the resized illumination map,
(b) Priority-queue-based texture placement following spiral order. . 76
4.4 The block diagram of the proposed compressed-domain video retar-
geting system that consists of three stages (or modules): 1) the par-
tial decoding stage, 2) the compressed domain video resizing stage,
and 3) the re-encoding stage. . . . . . . . . . . . . . . . . . . . . . . 78
4.5 The block-diagram of the compressed-domain video resizing stage.
This module takes reconstructed DCT coecients as input and out-
puts DCT coecients of the retargeted video. . . . . . . . . . . . . 80
4.6 Top: the manual scene segmentation result for a 1200-frame segment
of the big buck bunny sequence. Bottom: the percentage of block
change T
c
of each frame. The sharp peaks in the T
c
curve closely
match the manual segmentation result. . . . . . . . . . . . . . . . . 81
xi
4.7 Illustration of motion map generation using motion vectors from
sequence coastguard. From left to right: one video frame, the orig-
inal motion vector map, the compensated object motion, and the
nal motion map (after applying temporal ltering on the compen-
sated motion map). . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.8 Illustration of the visual importance analysis procedure: frames of
the entire scene are analyzed and three maps (saliency, texture and
motion) are generated. Being fused together, they form the nal
visual importance map. . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.9 Left: the average column importance curve for dierent cropping
factors in the cropping range. Each point in this curve corresponds
to the best window of a given length. The optimum cropping fac-
tor and its corresponding window maximize the average column
importance within the cropping window. Right: the optimum crop-
ping window and the average visual importance for each column.
Columns marked by blue color represent the region that fall inside
the cropping window while columns marked by red color would be
discarded after cropping. . . . . . . . . . . . . . . . . . . . . . . . . 89
4.10 The column mesh used for compressed-domain video resizing. The
mesh M =fV; Cg includes a set of vertices V
t
=fv
t
0
;v
t
1
;v
t
2
;:::;v
t
n
g
and a set of columns C =fc
1
;c
2
;:::;c
n
g. . . . . . . . . . . . . . . . 90
4.11 The impact of using motion preservation in the column mesh defor-
mation. Top: resizing results without considering motion preserva-
tion and the corresponding column vertex movement paths. The
tree is resized inconsistently at dierent frames. Bottom: resizing
results that considers motion preservation and the corresponding
column vertex movement path. The tree size undergoes more con-
sistent transformation throughout the entire video sequence. . . . . 93
4.12 Illustration of the supporting area, which each macroblock of the
output video frame is resized from its corresponding supporting area
in the original frame. . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.13 Motion vector of a retargeted block, mv, is estimated using the
motion vectors in its supporting area: mv
1
, mv
2
and mv
3
. . . . . . 99
4.14 Visual comparison of re-sized images using the proposed algorithm
and [8,60,75]. From top to bottom: getty, blueman and boy. . . . . 100
xii
4.15 Performance comparison of the proposed solution versus the seam
carving method [59] for sequence rat and roadski. From left to right:
the original video sequence, the result of seam carving [59] and our
result. The seam carving method is incapable of preserving the
shape of prominent edges, as can be observed at from the distortions
on the curb-line in the rat sequence and the road-lines of the roadski
sequence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.16 Performance comparison of the proposed solution versus the pixel-
warp method [38] for sequence waterski. From left to right: the
original video sequence, the result of [38] and our result. The pixel-
warp method over-squeezes the water wave region of the waterski
sequence, leading to noticeable artifacts. In contrast, our method
incorporates cropping into the whole procedure and performs better
in preserving the original content. . . . . . . . . . . . . . . . . . . . 103
4.17 Performance comparison of the proposed solution versus the approach
by Wang et al. [87] for sequence big buck bunny, car and building.
From left to right: the original video sequence, the result of [87]
and our result. Our method achieves comparable results in terms
of visual quality, yet it has a lower computational cost and memory
consumption. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
4.18 Pairwise comparison results of 56 user study participants, which
show that users have a preference on the visual quality of our solu-
tion over the other three benchmarking methods proposed in [38,
59,87]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.1 The system framework in computing the proposed GLS quality
index for retargeted images. . . . . . . . . . . . . . . . . . . . . . . 114
5.2 Classication based on the saliency map histogram analysis for two
representative images: the original image I (left), the saliency map
(middle) and its histogram (right). . . . . . . . . . . . . . . . . . . 116
5.3 SIFT feature mapping and graph formulation between the original
image and its retargeted result: (a) original imageI (top), matched
SIFT features (middle) and its formulated graph G = (V; E) (bot-
tom), (b): retargeted result
^
I
i
(top), matched SIFT features (mid-
dle) and its formulated graph
^
G
i
= (
^
V
i
;
^
E
i
) (bottom). . . . . . . . . 129
5.4 Log-polar spatial representation scheme [55]. This example shows
the 5-bit (32 regions) log-polar representation of a triangle: the
position and orientation codes of each triangle node. . . . . . . . . . 130
xiii
5.5 The degree of local region distortion is measured by using the graph
patch similarity, where the Euclidean distance of local patches of
matched graph nodes are computed and summed up over the entire
graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
5.6 Illustration of the reverse mapping of from the matched SIFT nodes
in the retargeted image to the source image to compute the impact
area
^
P
i
in the source image. The white region in the bottom left
indicates a region where the underlying information can be found
in the retargeted image while the black region means this part of
information is lost after retargeting. . . . . . . . . . . . . . . . . . . 131
5.7 Illustration of the information completeness feature computation:
the original image (left top), the saliency map (left bottom) and the
segmentation of the saliency map into the critical region (white),
the important region (gray) and the ordinary region (black), and
the impact area
^
P
i
encircled by the green dash line. . . . . . . . . . 131
5.8 An example where the proposed GLS index is strongly correlated
with the subjective rank (with Kendall rank coecient = 0:857).
The original and eight retargeted images of the Obama image and
the corresponding subjective rank [58] and the objective rank com-
puted using the GLS index are shown. . . . . . . . . . . . . . . . . 132
5.9 An example where the proposed GLS index is poorly correlated with
the subjective rank (with Kendall rank coecient =0:357). The
original and eight retargeted images of the Buddha image and the
corresponding subjective rank [58] and the objective rank computed
using the GLS index are shown. . . . . . . . . . . . . . . . . . . . . 132
xiv
List of Tables
3.1 Augmented Lagrangian method for our proposed adaptive TV-L1
model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2 Augmented Lagrangian method for our proposed DTV model. . . . 50
3.3 Augmented Lagrangian method for our proposed ADTV model. . . 53
3.4 Computational time for the proposed ADTV-based latent nger-
print segmentation algorithm. . . . . . . . . . . . . . . . . . . . . . 56
3.5 Performance comparison of three segmentation algorithms. . . . . . 57
3.6 Feature extraction accuracy with and without ADTV-based seg-
mentation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.1 Scene change detection results on two test sequences: big buck
bunny and elephants dream. . . . . . . . . . . . . . . . . . . . . . . 96
4.2 Computational time for each procedure. . . . . . . . . . . . . . . . 101
4.3 Performance comparison of re-encoding using the proposed motion
vector renement approach versus full search. . . . . . . . . . . . . 106
4.4 Complexity analysis for retargeting the big buck bunny sequence for
DCT domain versus the spatial domain. . . . . . . . . . . . . . . . 108
4.5 Complexity analysis for retargeting the roadski and building sequence
for DCT domain versus the spatial domain. . . . . . . . . . . . . . 109
5.1 Kendall rank coecient of dierent fusion models for RetargetMe
database [58] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.2 LCC, SROCC, RMSE and OR of dierent fusion models for CUHK
database [46] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
xv
5.3 Performance comparison of ve objective image QoE indices for
RetargetMe database [58] . . . . . . . . . . . . . . . . . . . . . . . 125
5.4 Comparison with other objective image QoE metrics for CUHK
database [46] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
xvi
Chapter 1
Introduction
1.1 Signicance of the Research
Latent ngerprint identication plays a critical role for law enforcement agencies
in identifying and convicting criminals. An important step in an automated n-
gerprint identication systems (AFIS) is the process of ngerprint segmentation.
While a tremendous amount of eorts has been made on plain and rolled nger-
print segmentation, latent ngerprint segmentation remains to be a challenging
problem. Collected from crime scenes, latent ngerprints are often mixed with
other components such as structured noise or other ngerprints. Existing nger-
print recognition algorithms fail to work properly on latent ngerprint images,
since they are mostly applicable under the assumption that the image is already
properly segmented and there is no overlap between the target ngerprint and
other components.
Based on the collection procedure, ngerprint images can generally be divided
into three categories, namely, rolled, plain and latent [35]. While signicant eort
has been made on developing segmentation algorithms for rolled/plain ngerprints,
latent ngerprint segmentation remains to be a challenging problem. Although
automated identication has already achieved high accuracy for plain/rolled n-
gerprints, manual intervention is still necessary for latent prints processing [35].
The diculty mainly lies in: 1) the poor quality of ngerprint patterns in terms
of the clarity of ridge information, and 2) the presence of various structured noise
1
in the background. Traditional segmentation methods fail to work properly on
latent ngerprints as they are based on many assumptions that are only valid
for rolled/plain ngerprints. In recent works on latent ngerprints [14, 22, 86],
the region-of-interest (ROI) is still manually marked and assumed to be known.
Undoubtedly, accurate and robust latent segmentation is an essential step towards
achieving automatic latent identication, and it is the focus of our rst part of this
proposal.
The second and third part of this thesis focuses on the topic of content-aware
image and video resizing. The demand to adapt images and videos to display
devices of various aspect ratios and resolutions calls for new solutions to image
resizing (or called image retargeting). Traditional image and video resizing tech-
niques are incapable of meeting this requirement since they may either discard
important information (e.g., cropping) or produce distortions by over-squeezing
the content (e.g., non-uniform scaling). Recently, several techniques have been pro-
posed for content-aware image resizing [8,60,75] and video resizing [38,59,77,87].
While all these approaches demonstrated promising results, two issues were still
left unaddressed.
The rst issue is related to texture redundancy. Repetitive textures could be
modeled as a primitive texture element that is replicated according to certain place-
ment rules [56]. An ideal resizing solution for these textures would be reducing
the total number of replications while keeping the primitive element intact. How-
ever, previous resizing schemes usually leaves the replication number unchanged
and distorts the shape of primitive texture element. The second unaddressed issue
involves the solution for compressed format video data. All existing video retar-
geting techniques are spatial-domain techniques, which means, they are designed
for raw video data. This could be dicult for practical usage, as most real-world
2
digital videos are mainly available in compressed format. To apply the spatial-
domain retargeting techniques on a compressed video, we need to rst decompress
the video back to raw format, apply the retargeting algorithm, and eventually
recompress the retargeted video so that it would be interoperable with other appli-
cations. However, this process has many disadvantages. First of all, it requires
both compression and decompression which is computation-intensive and time-
consuming, especially when the video server needs to support quality of service
to heterogeneous clients. Secondly, it requires a large storage for the intermediate
de-compressed video sequence, especially when several clients are performing the
same task on dierent video les. To reduce the overhead involved in the decom-
pression and recompression steps, a more ecient technique is to conduct video
retargeting in the compressed domain by directly manipulating the compressed
data.
The third part of this thesis deals with designing objective quality assessment
metrics for image retargeting. Objective image quality assessment has been exten-
sively studied in literature [78]. In general, these methods can be divided into
three categories: full-reference (FR), reduced-reference (RR) and non-reference
(NR). However, traditional image quality metrics are not applicable in the context
of image retargeting. For FR and RR methods, one underlying assumption is that
the size of the original image to be matched with the distorted image. Clearly,
this is not valid for retargeted results, as the original and retargeted image dier
signicantly in terms of size and aspect ratio. On the other hand, NR methods are
not feasible as well, since the retargeted image should preserve as much important
information in the original image as possible, thus referring to the original image
is indispensable for evaluating a retargeting result.
3
In our work, we attempt to address this issue with a novel quality metrics
that accounts for the three major distortion types occurred for image retargeting.
We rst start with an in-depth analysis of all three distortions, which include:
global structural distortion, local detail distortion and loss of salient information.
Based on this knowledge, we aim to select proper features pertaining towards image
retargeting quality and then use machine learning techniques to model the complex
process of feature pooling.
1.2 Review of Previous Work
1.2.1 Latent Fingerprint Segmentation
Segmentation on rolled and plain ngerprint images has been well-studied in lit-
erature. In the early work of [49], segmentation was achieved by partitioning
the ngerprint image into blocks, followed by block classication based on gradi-
ent and variance information. This method was further extended to a composite
method [48] that takes advantage of both the directional and variance approaches.
Ratha et al. [57] considered the gray-scale variance along the direction orthogo-
nal to the ridge-
ow orientation as the key feature for block classication. In [9],
ngerprints were segmented using three pixel-level features (coherence, mean and
variance). An optimal linear classier was trained for pixel-based classication and
morphology operators were applied to obtain compact segmentation clusters.
Most recently, several studies [36,68] have been conducted to address the prob-
lem of latent ngerprint segmentation. Karimi-Ashtiani and Kuo [36] used a pro-
jection method to estimate the orientation and frequency of local blocks. After
projection, the distance between center-of-transient points measures the amount
of data degradation and used for segmentation. Short et al. [68] formulated a ridge
4
model template and used the cross-correlation between a local block and the gen-
erated template to assign one of six quality scores. Blocks with high quality score
are labeled as foreground while the rest are treated as background. While the pro-
posed methods demonstrated improved performance in handling latent ngerprint
images, experimental results show that their performances are still limited by the
presence of structured noise.
Total-Variation-based (TV-based) image models have been widely used in the
context of image decomposition [7,85]. Among several well known TV-based mod-
els, the model using total variation regularization with an L1 delity term, denoted
by the TV-L1 model, is especially suited for multiscale image decomposition and
feature selection [12,15]. Besides, a modied TV-L1 model was adopted in [15] to
extract small-scale facial features for facial recognition under varying illumination.
More recently, the authors proposed an adaptive TV-L1 model for latent ngerprint
segmentation in [90], where the delity weight coecient is adaptively adjusted to
the background noise level. Furthermore, the Directional Total Variation (DTV)
model was formulated in [88] by imposing the directional information on the TV
term, which proved to be eective for latent ngerprint detection and segmenta-
tion. It appears that the TV-based image model with proper adaptation oers
a suitable tool for latent ngerprint segmentation. However, the performance of
both models in [88,90] was evaluated only subjectively, as no objective evaluation
was performed to determine whether the proposed scheme improved the matching
accuracy, which is the ultimate goal for ngerprint segmentation.
1.2.2 Content-aware Image and Video Resizing
Solutions for content-aware image resizing can be generally classied into discrete
approach and continuous approach [65]. For discrete approach, resizing of an image
5
is achieved by removing a set of seams, which are dened as vertical or horizontal
paths of pixels with least amount of importance. Cropping-based techniques [42,62]
identies the most important components in an image through saliency-based mea-
sures and cut out a rectangular region as the retargeting result. Seam carving [8]
resizes the image through iteratively removing paths of pixels with the least amount
of energy. Studies show that no single retargeting operator could perform well on
all images, and the multi-operator approach [60] was proposed that combines three
dierent operators (seam carving, cropping and scaling) while optimizing the resiz-
ing through an image similarity measure. Though this method takes the advantage
of each individual operator, it computational cost entailed by optimizing the sim-
ilarity measure is too expensive for practical usage. The continuous approach for
image retargeting is formulated as a global optimization problem for mesh defor-
mation. In [75], a grid mesh is placed onto the image, and its new geometry is
computed so that the deformed mesh preserves the shape of salient regions while
allowing the non-salient regions to be squeezed or stretched .
Video retargeting is more challenging than image retargeting because of
the additional temporal coherence requirement and the need to preserve object
motions. Most video retargeting techniques proposed extends the per-frame image-
based techniques with some temporal considerations. Based on the original idea of
seam carving [8], the authors generalized seams to surfaces [59] and formulates the
video retargeting problem as a graph-cut problem on a three-dimensional spatial-
temporal cube. In [28], the author further extended the seam carving for video
retargeting method by allowing seams to be unconnected. Experimental results
demonstrated that under certain scenarios discontinuous seams could outperform
continuous ones.
6
Compared with discrete approaches, continuous approaches usually produce
smoother results and have more
exibility. Continuous video retargeting methods
[38, 76, 77, 80] use variational formulation to design warping functions, such that
the shape of prominent regions are preserved while the non-salient regions are
allowed to be squeezed or stretched. Due to the presence of camera and dynamic
motion, temporally-adjacent pixels do not necessarily contain the corresponding
object, so that the object may deform in consistently between frames, leading
to noticeable artifacts. This problem is addressed in [76] by explicit detection
of camera and object motions. In some scenarios, when the video contains large
area of unimportant objects, removing the entire object parts proves better than
distorting them. In [77], the resizing technique combines both cropping and mesh
deformation in one global optimization. One limitations of these video retargeting
techniques is the memory requirement as the whole video cube is required for
retargeting. This problem is addressed [87], where the author proposed an ecient
yet scalable retargeting scheme that conducts video retargeting in three separate
stages: per-frame resizing, path-line optimization and motion-guided resizing. In
this way, the proposed method never requires storage of the whole video sequence.
Applying traditional resizing techniques, such as homogeneous scaling and crop-
ping, in the compressed domain has been widely studied in literature. Using the
distributive characteristics of unitary orthogonal transform, several resizing tech-
niques were proposed that achieved DCT domain downsizing/up-scaling by a factor
of two. In [69], the algorithms were further extended to allowing resizing with arbi-
trary ratio. Meanwhile, several systems were also proposed that allow cropping in
the compressed domain. Most recently, content-aware image resizing techniques
for the compressed-domain has also been studied as well. In [20], the author pro-
posed a resizing scheme that uses DCT coecients for the computation of saliency
7
map, followed by block-level seam carving. Though it utilized compressed-domain
features for the retargeting, the seam carving procedure is still operated in the
spatial-domain and requires full decoding of the image.
1.2.3 Quality Assessment for Image Retargeting
The rst comparison study on image retargeting approaches was presented in [58].
In this work, the author conducted a large scale user study to compare the per-
formance of eight representative state-of-the-art image retargeting methods. In
addition to performance evaluations based on user responses, six dierent com-
putational image distance metrics were also evaluated and compared with actual
human perception. The author concluded that none of the six measures are in
alignment with human rankings and suggested other image features be used for
possibly better agreement. A similar study was conducted by Ma et al. [46], in
which a dierent image retargeting database was built and evaluated by human
viewers. Several existing objective measures were evaluated and the author sug-
gested that a better metrics could be obtained by combining information of shape
distortion and content information loss. Although both work conducted in-depth
performance analysis on metrics design, neither of them focused on proposing the
actual metrics that can eectively conduct quality assessment for image retarget-
ing.
Most recently, Liu et al. [43] proposed a quality assessment for image retargeting
using a top-down approach. Although the approach is capable measuring both local
and geometric distortions using the SSIM metrics, it does not account for the issue
of information completeness, which is a key aspect of evaluating image retargeting
results.
8
1.3 Contribution of the Research
The goal for the rst part of research is to achieve accurate latent segmenta-
tion, which is an essential step towards achieving automatic latent identica-
tion. Existing ngerprint segmentation algorithms performs poorly on latent
prints, as they are mostly based on the assumptions that are only applicable
for rolled/plain ngerprints. Our main contribution for this topic is proposing
three Total-Variation(TV)-based mathematical models for eective latent nger-
print segmentation. The proposed schemes can be regarded as a preprocessing
technique for automatic latent ngerprint recognition. It also has strong potential
to be applied on other relevant applications, especially for processing images with
oriented textures.
We will rst introduce two Total-Variation models (Adaptive TV-L1 and DTV)
as image decomposition schemes that facilitate eective latent ngerprint segmen-
tation and enhancement. Then we combine both models into one single model, the
Adaptive Total-Variation Model (ADTV). Based on the classical Total-Variation
model, the proposed ADTV model dierentiates itself by integrating two unique
features of ngerprints, scale and orientation, into the model formulation. The
proposed model has the ability to decompose a single latent image into two layers
and locate the essential latent area for feature matching. The two spatially varying
parameters of the model, scale and orientation, are adaptively chosen according to
the background noise level and textural orientation, and eectively separate the
latent ngerprint from structured noises in the background. Experimental results
show that the proposed scheme provides eective segmentation and enhancement.
The improvements in feature detection accuracy and latent matching further jus-
ties the eectiveness of the proposed scheme.
9
For the second part of this proposal, we provide solutions for the two unad-
dressed issues of existing image and video retargeting techniques. To address the
issue of texture redundancy in image resizing, we rst propose a region-adaptive
image resizing technique that considers texture properties and is capable of pre-
serving the underlying object structures. Dierent from existing methods, which
is guided mainly by pixel-level saliency map, the proposed technique uses the seg-
mented region as basic unit. Guided by region and the contour information, mesh
warping was formulated as a non-linear least square optimization problem, which
strives to preserve the local as well as the global object structures. Texture redun-
dancy was eectively reduced through pattern regularity detection and real-time
image synthesis. Experimental results demonstrated improved image quality over
state-of-the-art image resizing algorithms.
In addition, to address the issue of compressed-domain video data, we pro-
pose a novel video retargeting system that operates directly on an intermediate
representation in the compressed domain, namely, the discrete cosine transform
(DCT) domain. In this way, we are able to avoid the computationally expensive
process of de-compressing, processing, and recompression. As the systems uses the
DCT coecients directly for processing, only minimal decoding of video streams is
necessary. Though there are overheads of entropy decoding and encoding of inter-
mediate symbols (e.g. DCT coecients) the total overhead is signicantly less than
that of decompression and recompression. The proposed system is targeted for the
latest H.264 coding standard yet can be easily applied to other video compres-
sion standards as well. To the best of our knowledge, most existing approaches
related to this subject either are spatial-domain video retargeting techniques or
focuses solely on image resizing in the compressed-domain. Video retargeting in
10
the compressed domain still remains an open problem and will be addressed in our
research as well.
For the third part of thesis, we start with analyzing the determining factors
for humans visual perception on retargeted image quality and propose an eective
quality metrics for content-aware image resizing. The metrics extracts features
from the retargeted result and uses machine learning techniques to fuse them into
one single quality score. The feature design is based on three important factors:
global structural distortion, local detail distortion and loss of salient information.
Experimental results demonstrate that our designed metrics is more eective and
general in evaluating quality of image retargeting results than existing objective
metrics.
1.4 Organization of the Thesis
The rest of this proposal is organized as follows. The background of both top-
ics would be brie
y reviewed in Chapter 2. Then our proposed Total-Variation-
based latent ngerprint segmentation solution will be introduced in Chapter 3. We
present a texture-aware image resizing technique and a compressed-domain video
resizing technique in Chapter 4. In Chapter 5, we will propose a novel quality
assessment metric for the application of image retargeting. Finally, concluding
remarks and suggestions for future work are given in Chapter 6.
11
Chapter 2
Background Review
2.1 Total-Variation Model for Latent Segmenta-
tion
Fingerprint segmentation refers to the process of decomposing a ngerprint image
into two disjoint regions: foreground and background. The foreground, also called
the region of interest (ROI), consists of the desired ngerprints while the back-
ground contains noisy and irrelevant contents that will be discarded in the fol-
lowing processing steps. Accurate ngerprint segmentation is critical as it aects
the accurate extraction of minutiae and singular points, which are key features for
ngerprint matching. When feature extraction algorithms are applied on a n-
gerprint image without segmentation, lots of false features may be extracted due
to the presence of noisy background, and eventually leading to matching errors in
the later stage. Therefore, the goal of ngerprint segmentation is to discard the
background, reduce the number of false features, and thus improve the matching
accuracy.
Based on the collection procedure, ngerprint images can generally be divided
into three categories (see Fig. 2.1), namely, rolled, plain and latent [35]. Rolled
ngerprints are obtained from rolling the nger from one side to the other in
order to capture all ridge details of the ngerprint. Plain ngerprints images are
acquired by pressing the ngertip onto a
at surface. Both rolled and plain prints
are obtained in an attended mode, so they are usually of good visual quality
12
Three types of Fingerprint Images
$"
Rolled
Rolling finger from
one side to another
in order to obtain all
ridge details
Pressing fingertip
on a flat surface.
Plain
Obtained from crime
scenes, crucial
evidence in forensic
identification.
Latent
!"
3 2 1
Figure 2.1: Three types of ngerprints: rolled, plain and latent.
and contain sucient information for reliable matching. On the contrary, latent
ngerprints are usually collected from crime scenes, in which the print is lifted from
object surfaces that were inadvertently touched or handled. The matching between
latents and rolled/plain ngerprints plays a crucial role in identifying suspects by
law enforcement agencies.
2.1.1 Structured Noise in Latent Fingerprint Images
The diculty for latent ngerprint segmentation mainly lies in two aspects. On
one hand, the ngerprint itself is usually of very poor quality, often with smudged
or blurred ridges. It is very common that the image contains only partial area of
the nger and large nonlinear distortions exist due to pressure variations. As a
result, while a typical rolled ngerprint has around 80 minutiae, a latent ngerprint
contains only about 15 usable minutiae with reasonable quality [35].
On the other hand, the presence of various types of structured noise further
hinders the proper segmentation for latent prints. As compared with the oscillatory
ridge structures of ngerprints, structured noise are usually of much larger scale
and can appear in various forms. Based on appearance, structured noise can be
13
Figure 2.2: Illustration of six types of structured noise in latent ngerprint images.
classied into six categories: arch, line, character, speckle, stain and others. They
are shown in Fig. 2.2 and elaborated below.
1. Arch. The big arch is manually marked by crime-scene investigators to indi-
cate the possible existence of latent ngerprints in the region encircled by
the arch. The arch noise is considered to be the simplest type of structured
noise.
2. Line. The line noise may appear in the format of a single line or multiple
parallel lines. Single line is usually detected and removed using methods
based on Hough transform [68]. Multiple parallel lines are easily confused
with ngerprints since they share many common features.
3. Character. The most common type of structured noise that appears in latent
ngerprint images. The characters may appear in various font types, sizes,
brightness, and can be either handwritten or typed.
14
4. Stain. It is generated when the nger, instead of being properly pressed, was
inadvertently smeared on a wet or dirty surface. Stain noise often appears
in spongy shape with inhomogeneous brightness.
5. Speckle. As compared with lines and characters, the speckle noise tends to
consist of tiny-scale structures, which can be either regular (e.g., clusters of
small dots) or random (e.g., ink and dust speckles).
6. Others. A latent ngerprint image may contain other structured noises
such as arrows, signs, etc. Similar to arch and character noise, they usually
consist of smooth surfaces with sharp edges.
The line, character and speckle noise often appear when the latent ngerprint
is lifted from the surface of a text document (e.g. maps, newspapers, checks, etc).
For latent ngerprint segmentation, the main challenge lies in how to eec-
tively separate latent ngerprints, the relatively weak signal, from all structured
noise in the background, which is often the dominant image component. Addi-
tional complexity arises when structured noise overlaps with the ngerprint signal.
Previous methods proposed for ngerprint segmentation are mostly feature-based,
and features commonly used for segmentation include the mean, variance, contrast,
coherence as well as their variants [9,23, 67]. However, these methods may fail to
work properly on latent ngerprints as they are based on many assumptions that
are only valid for rolled/plain ngerprints. For instance, in [9], the mean feature
was used since the background was assumed to be bright and the variance feature
was used since the variance of background noise was assumed to be much lower
than that of ngerprint regions. However, these assumptions are no longer valid
in the context of latent ngerprint images.
15
Figure 2.3: Comparison of distributions of three features (mean, variance and
coherence) in the foreground and background areas of plain and latent ngerprints.
To evaluate the eectiveness of traditional segmentation features, we manually
segment a plain and a latent ngerprint image, and plot the distributions of three
segmentation features, namely, the mean, variance and coherence, for both fore-
ground and background regions. As shown in the Fig. 2.3, the distributions of
these features in foreground and background regions are well separated for plain
ngerprints, while those of latent ngerprints have signicant overlaps. These
overlaps can be explained by two reasons. First, regions with structured noise
often have high contrast and coherent gradient orientations as well, so it's dicult
to dierentiate them from ngerprints using these features. Second, the qualities
of some latent ngerprints are so poor that they cannot be well characterized by
traditional ngerprint features. As a result, new features or models need to be
considered for more eective separation of latent ngerprint and structured noises.
16
2.1.2 The Total-Variation Model
TV-based image models have been widely studied to achieve the task of image
decomposition. Among many existing TV models, the total variation regulariza-
tion model with an L1 delity term, denoted by TV-L1, is suitable for multiscale
image decomposition and feature selection. In the context of facial recognition
under varying illumination, a modied TV-L1 model was proposed in [15] to sep-
arate small-scale facial features with nonuniform illumination, and thus leads to
improved recognition result.
Similar to other TV-based image models (e.g., the ROF model [61]), the TV-L1
model decomposes an input image, f, into two signal layers:
Cartoon u, which consists of the piecewise-smooth component in f, and
Texture v, which contains the oscillatory or textured component in f.
The decomposition
f =u +v;
is obtained by solving the following variational problem:
min
u
Z
jOuj +
Z
jufjdx; (2.1)
where f, u and v are functions of image gray-scale intensity values in R
2
, Ou is
the gradient value of u and is a constant weighting parameter. We call
R
jOuj
and
R
jufj the total variation of u and the delity term, respectively.
The TV-L1 model is dicult to compute due to nonlinearity and non-
dierentiability of the total variation term as well as the delity term. A gradient
17
descent approach was proposed in [12], which solves for u as a steady solution of
the Euler-Lagrange equation of (3.1):
O
Ou
jOuj
+
fu
jfuj
= 0: (2.2)
Although (3.2) is easier to implement, the gradient descent approach is slow
due to a small time step imposed by the strict stability constraint. That is, the
term
fu
jfuj
is non-smooth at fu, which forces the time step to be very small
when the solution is approaching the steady state. In addition,jOuj in the term
Ou
jOuj
might be zero, and a small positive constant needs to be added to avoid zero
division, which results in inexact solution.
Many numerical methods have been proposed to improve this method. One
approach is the split Bregman iteration [26,27,51,81], which uses functional split-
ting and Bregman iteration for constrained optimization. The equivalence of
the split Bregman iterations with the alternating direction method of multipliers
(ADMM), the Douglas-Rachford splitting and the augmented Lagrangian method
can be found in [11,19,63,81].
The use of TV model is motivated by the analogy between the problem of
TV decomposition and latent ngerprint segmentation. As discussed in Section
2.1.1, the key challenge for latent segmentation is to eectively separate latent
ngerprint with dierent structured noise. Structured noise (e.g. arch, character),
with its smooth inner surface and crisp edges, share many similar characteristics
with components in the cartoon layer u. On the other hand, ngerprint pattern,
which consists of oscillatory ridge structures, matches the characteristics of the
texture components in v. This interesting analogy suggests that TV model could
be a viable solution to our problem.
18
2.1.3 Multiscale Feature Selection Property of TV Models
The TV-L1 model distinguishes itself from other TV-based models by its unique
capability of intensity-independent multiscale decomposition. It has been shown
both theoretically [12] and experimentally [85] that the delity weight coecient,
, in (3.1) is closely related to the scale of features in the texture output v. This
relation is supported by the analytic example in [12]. Iff is equal to a disk signal,
denoted by B
r
, which has radius r and unit height, the solution of (3.1) is given
as:
u
(x) =
8
>
>
>
>
>
<
>
>
>
>
>
:
0 if 0<
2
r
f(x) if >
2
r
cf(x) if =
2
r
, for any c2 [0; 1]
In other words, depending on the value, the TV-L1 functional is minimized
by either 0 or input f. This shows that the TV-L1 model has the ability to select
geometric features based on a given scale. Fig. 2.4 shows an example of feature
selection on the latent ngerprint image.
As shown in Fig 2.4, the numerical results match with the analysis before. The
delity weight coecient controls the feature selection by manipulating the scale
of content captured in each image layer. When is very small (e.g., = 0:10),
u captures the inhomogeneous illumination in the background while most ne
structures are kept inv. When = 0:30, large-scale objects (arch) are captured in
u, and separated from structures of smaller scales (characters). As continues to
increase, only small-scale structures (ngerprint and noise) are left in v while the
major content of f is extracted to u.
19
Original Image f
u
0.10
v
0.10
u
0.30
u
0.70
v
0.30
v
0.70
Figure 2.4: Feature selection based on the TV-L1 model for latent ngerprint
image: input image f (left most) and its TV-L1 decomposed components u andv
with its value shown in the subscript. As increases, only features of smaller
scales are extracted to texture output v, while features of larger scales are kept in
cartoon u.
2.2 Content-Aware Image and Video Resizing
The demand to adapt images to display devices of various aspect ratios and res-
olutions calls for new solutions to image resizing (or called image retargeting).
Traditional image resizing techniques are incapable of meeting this requirement
since they may either dis-card important information (e.g., cropping) or produce
distortions by over-squeezing the content (e.g., non-uniform scaling). Therefore, a
better resizing approach should take into consideration the actual visual content
rather than simply account for the geometric constraints of the output display.
Recently, several works have presented content-aware retargeting methods. These
works can be classied into two basic categories: discrete and continuous, and will
be described in detail below.
20
16:9
4:3
1.5:1
1:1
16:10
21:9
Figure 2.5: The importance of image/video retargeting: media content needs to
be adapted to display devices of various size and aspect ratios.
2.2.1 Image Retargeting
Discrete Approach
The most typical approach of this category is the seam carving approach [8]. This
approach uses dynamic programming to compute the optimum seam for removal.
A seam is dened as a path of pixels that follows two basic rules: 1) connectivity,
and 2) monotonicity. The image is resized to the desired size through continuously
removing seams from the image. The seam computation is based on an importance
map generated from the original image. The importance map can be computed in
various ways, including simple ways such as directly using the gradient strength,
or in more sophisticated ways, such as using the visual saliency map. However, as
the author stated in [8], the optimum importance map is dependent on the image
itself, and there is no measure that outperforms all other measures for all images.
In Fig. 2.6, the procedure of seam carving resizing is shown. An energy map
21
Energy Map Original image
Seams removed Result
Figure 2.6: Seam carving [8] for content-aware image resizing.
is computed based on gradient information and vertical seams are computed and
continuously removed until the image reaches the desired nal size.
Continuous Approach
Warping-based approach is the most common continuous approach for image retar-
geting. It views the image as a continuous domain and computes a warping of the
image. The warping is usually guided by mesh deformation [75]. Each point is
assigned a local transformation, which enables important regions in the region to
be preserved after mesh deformation. On the other hand, the less important con-
tent are allowed to be squeezed or stretched. The mesh deformation is computed
through a global optimization process that minimizes a total energy term. Dier-
ent from discrete approach, which removes image content, the continuous approach
does not remove any content but scale it non-uniformly (see Fig. 2.7).
22
Figure 2.7: Content-aware image resizing guided by mesh deformation [75].
2.2.2 Video Retargeting
Discrete Approach
Video retargeting is more challenging than image retargeting because of the addi-
tional temporal coherence requirement and the need to preserve object motions.
The discrete seam carving approach [8] can be extended to resize video. However,
the dynamic programming formulation does not apply to videos and a graph-cut
formalism needs to be used for conducting video retargeting. Graph partition-
ing and graph-based energy minimization techniques are widely used for various
visual processing techniques, including image restoration, segmentation, object
recognition and shape reconstruction. The graph representation of an image can
be generated by connecting pixels based on their similarities. The seam carving
operator can be formulated as a minimum cost graph cut problem (see Fig. 2.8).
In this graph every node represents a pixel and connects to its neighboring piles
in a grid structure. The source and sink nodes are connected with innite weight
23
arcs to the pixels at the leftmost and rightmost columns of the video frame. Simi-
lar to image retargeting using seam carve, the seam-carve-based video retargeting
scheme needs to satisfy two constraints:
1. Monotonicity: the seam must include only one pixel, in each row or column.
2. Connectivity: the pixels of the seams must be connected.
It was reported in [59] that the graph cut algorithm runs in polynomial time,
but in practice it has linear running time on average. To further save computational
time, an approximate minimal cut is rst computed on a coarse graph, while the
nal solution is rened iteratively at higher resolutions.
2D: Seam carving
Figure 2.8: Seam carving [59] for content-aware video resizing using a graph-cut
approach.
Continuous Approach
For continuous approach, the resizing method can be easily extended to video by
constraining temporally-adjacent pixels to transform similarly. This can be for-
mulated as an energy term which penalizes strong partial derivatives for the mesh
vertex positions with respect to time. In [76], the author addressed the temporal
coherence problem by detecting camera and object motions. This approach was
24
further extended in [77] by incorporating cropping into the whole procedure. It
was shown that, in addition to warping the video, incorporating content-aware
cropping into the optimization process does help improve the nal result. Specif-
ically, cropping is necessary when the video is crowded with multiple prominent
objects.
ACM Reference Format
Wang, Y., Lin, H., Sorkine, O., Lee, T. 2010. Motion-based Video Retargeting with Optimized Crop-and-
Warp. ACM Trans. Graph. 29, 4, Article 90 (July 2010), 9 pages. DOI = 10.1145/1778765.1778827
http://doi.acm.org/10.1145/1778765.1778827.
Copyright Notice
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted
without fee provided that copies are not made or distributed for profi t or direct commercial advantage
and that copies show this notice on the fi rst page or initial screen of a display along with the full citation.
Copyrights for components of this work owned by others than ACM must be honored. Abstracting with
credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any
component of this work in other works requires prior specifi c permission and/or a fee. Permissions may be
requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701, fax +1
(212) 869-0481, or permissions@acm.org.
© 2010 ACM 0730-0301/2010/07-ART90 $10.00 DOI 10.1145/1778765.1778827
http://doi.acm.org/10.1145/1778765.1778827
Motion-basedVideoRetargetingwithOptimizedCrop-and-Warp
Yu-ShuenWang
1,2
Hui-ChihLin
1
OlgaSorkine
2
Tong-YeeLee
1
1
NationalCheng-KungUniversity,Taiwan
2
NewYorkUniversity
c BlenderFoundation
Figure 1: Our video retargeting framework has motion information at its core, utilizing it to define temporal persistence of video contents
and to describe temporal coherence constraints. Left: We combine cropping and warping by forcing all informative video content inside the
target video cube, without a priori constraining the size of each frame. Right: parts of the bunny and the squirrel are allowed to be cropped
(top) since they will fully appear later in the video (bottom).
Abstract
Weintroduceavideoretargetingmethodthatachieveshigh-quality
resizingtoarbitraryaspectratiosforcomplexvideoscontainingdi-
verse camera and dynamic motions. Previous content-aware retar-
geting methods mostly concentrated on spatial considerations, at-
tempting to preserve the shape of salient objects in each frame by
removing or distorting homogeneous background content. How-
ever,sacrificeablespaceisfundamentallylimitedinvideo,sinceob-
ject motion makes foreground and background regions correlated,
causing waving and squeezing artifacts. We solve the retargeting
problem by explicitly employing motion information and by dis-
tributing distortion in both spatial and temporal dimensions. We
combine novel cropping and warping operators, where the crop-
ping removes temporally-recurring contents and the warping uti-
lizes available homogeneous regions to mask deformations while
preserving motion. Variational optimization allows to find the best
balance between the two operations, enabling retargeting of chal-
lengingvideoswithcomplexmotions,numerousprominentobjects
and arbitrary depth variability. Our method compares favorably
withstate-of-the-artretargetingsystems,asdemonstratedintheex-
amplesandwidelysupportedbytheconducteduserstudy.
Keywords: videoretargeting,cropping,warping,spatialandtem-
poralcoherence,optimization
1 Introduction
Retargeting images and video for display on devices with different
resolutionsandaspectratiosisanimportantproblemforthemodern
society,wherevisualinformationisaccessedusingavarietyofdis-
play media with different formats, such as cellular phones, PDAs,
widescreen television, and more. To fully utilize the target screen
resolution, traditional methods homogeneously rescale or crop the
visual content to fit the aspect ratio of the target medium. Simple
linearscalingdistortstheimagecontent,andcroppingmayremove
valuable visual information close to the frame periphery. To ad-
dress this problem, content-aware retargeting techniques were re-
centlyintroduced. Thesemethodsnon-homogeneouslydeformim-
agesandvideototherequireddimensions,suchthattheappearance
of visually important contentis preservedat the expenseof remov-
ingordistortinglessprominentpartsoftheinput.
Most content-aware retargeting techniques to date have concen-
tratedon spatialimageinformation,suchasvariousvisualsaliency
measures and face/object detection, to define visually important
partsofthemediaandtoguidetheretargetingprocess. Theyrelyon
thefactthatremovingordistortinghomogeneousbackgroundcon-
tentislessnoticeabletotheeye[ShamirandSorkine2009]. Recent
videoretargetingworks[Kr¨ ahenb¨ uhletal.2009;Wangetal.2009]
additionally average the per-frame importance maps over several
frames and grant higher importance to moving objects to improve
thetemporalcoherenceoftheresult.
Yet,videoretargetingisfundamentallydifferentfromstillimagere-
targeting and cannot be solved solely by augmenting image-based
methods with temporal constraints. The reason for this is twofold:
(i)Invideo, motionandtemporaldynamicsarethecoreconsidera-
tionsandmustbeexplicitlyaddressed;simplysmoothingtheeffect
of the per-frame retargeting operator along the time axis, as was
done in most previous works, cannot cope with complex motion
flow and results in waving and flickering artifacts; (ii) Prominent
objects often cover most of the image, in which case any image-
based retargetingmethodreachesitslimit, sinceretargetingissim-
ply impossible without removing or distorting important content.
Even if each individual frame does contain some disposable con-
tent, the trajectories of the important objects often cover the en-
ACM Transactions on Graphics, Vol. 29, No. 4, Article 90, Publication date: July 2010.
Figure 2.9: Content-aware video resizing technique proposed in [77]. It combines
cropping and warping by forcing all informative video content inside the target
video cube.
One limitations of [77] is the memory requirement as the whole video cube is
required for retargeting. This problem is addressed in [87], where the author pro-
posed an ecient yet scalable retargeting scheme (see Fig. 2.10 ) that conducts
video retargeting in three separate stages: per-frame resizing, path-line optimiza-
tion and motion-guided resizing. In this way, the proposed method never requires
storage of the whole video sequence.
2.3 Quality Assessment for Image Retargeting
There are three determining distortion types for image retargeting { global struc-
tural distortion, local region distortion and loss of salient information. The rst
two distortion types introduce visual unpleasing artifacts to retargeted results such
as over-squeezing the object shapes (global) or breaking the prominent lines (local).
25
Figure 2.10: A scalable content-aware video resizing technique guided by optimized
motion path-lines [87].
The third type does not necessarily introduce visually noticeable artifacts, yet the
retargeted result fails to preserve all salient information in the original image.
Understanding the characteristics of these major distortion types would lay out a
basis for the GLS quality index design, which will be elaborated in Section 5.2.
2.3.1 Global Structural Distortion
Global structural distortion occurs when an image is over-squeezed or over-
stretched after retargeting, leading to unpleasing shape deformation of prominent
objects. This distortion is especially noticeable when the salient object is improp-
erly deformed and/or dierent parts of a salient object is deformed unproportion-
ally, leading to inconsistency as compared with the original image. This type of
distortion produces artifacts at the global scale.
Both discrete and continuous image retargeting methods could potentially pro-
duce global structural distortion. For example, as shown in Fig. 2.11, the global
structure of the prominent object in the face image is heavily distorted because the
relative positions of eyes, nose and mouth are misaligned after retargeting. How-
ever, the shape of each individual face component (eyes, nose, mouth, etc.) are
kept intact. In other words, there is no distortion in local regions. For image lotus,
26
Figure 2.11: Two examples of global structural distortion. Upper row: the original
image of face (left) and the retargeted result by [8] (right). Lower row: the original
image of lotus (left) and the retargeted result by scaling (right).
there is also heavy global structural distortion as the prominent object (
ower) is
over-squeezed after retargeting.
2.3.2 Local Detail Distortion
After retargeting, some local regions in the image may be heavily distorted, espe-
cially near those regions with prominent edges. Discrete retargeting methods, such
as seam carving [8], may introduce broken lines when the removed region overlaps
with a prominent edge region. On the other hand, continuous retargeting methods
may result in heavy edge bending when the underlying mesh behind the edge region
undergoes signicant warping. The local region distortions become less noticeable
at regions with homogeneous textures, (e.g. sky, surface, wall, etc.) and irregular
textures (e.g. trees, grass, sand, etc.) [89]. We show an example of heavy local
27
region distortion using the seam carving method in Fig. 2.12. Prominent edges are
heavily bended after retargeting since many of the removed seams passed through
the edge of pencils.
Figure 2.12: Illustration of local region distortion. Left: the original image of
pencil. Right: the retargeted result using the seam carving method in [8] with
three zoomed-in local regions.
2.3.3 Loss of Salient Information
Besides reducing visual artifacts, a good retargeted image should be able to include
all important content in its original image as much as possible. Loss of salient
information is a distortion type commonly introduced by discrete operators such
as cropping. When the salient object is too large and/or spans across the whole
image, cropping will inevitably discard some important information. For example,
as shown in Fig. 2.13(a), there are four dierent buildings in the original image,
each of them with similar visual importance. To retarget this image to half of
its original width, a simple cropping method will inevitably discard some salient
28
information, leaving only two of the buildings in the retargeted result. A better
retargeting method for this image should be able to remove redundant information
from each building but preserve all four buildings as shown in Fig. 2.13(b) (b).
This type of distortion is less observable for continuous retargeting methods, since
dierent regions of an image are scaled disproportionately without discarding pixels
in them.
(a) (b)
Figure 2.13: Illustration of loss of salient information. (a): original image of
Marble and the cropping result. The yellow box shows the optimum cropping
result. Middle: a better retargeting result by [60]
The objective quality indices examined in [43,58] primarily focus on measuring
distortions (global or/and local). However, they do not pay much attention to
the importance of information completeness. In contrast, we attempt to nd good
features to characterize all three distortion types and combine them into one single
score for quantitative evaluation of retargeted results in this work.
29
Chapter 3
Adaptive Total Variation Model
for Latent Fingerprint Detection
and Segmentation
3.1 Introduction
Latent ngerprint identication plays a critical role for law enforcement agencies
in identifying and convicting criminals. An important step in an automated n-
gerprint identication systems (AFIS) is the process of ngerprint segmentation.
While a tremendous amount of eorts has been made on plain and rolled nger-
print segmentation, latent ngerprint segmentation remains to be a challenging
problem. Collected from crime scenes, latent ngerprints are often mixed with
other components such as structured noise or other ngerprints. Existing nger-
print recognition algorithms fail to work properly on latent ngerprint images,
since they are mostly applicable under the assumption that the image is already
properly segmented and there is no overlap between the target ngerprint and
other components.
Fingerprint segmentation refers to the process of decomposing a ngerprint
image into two disjoint regions: foreground and background. The foreground,
also called the region of interest (ROI), consists of the desired ngerprints while
the background contains noisy and irrelevant contents that will be discarded in
30
the following processing steps. Accurate ngerprint segmentation is critical as
it aects the accurate extraction of minutiae and singular points, which are key
features for ngerprint matching. When feature extraction algorithms are applied
on a ngerprint image without segmentation, lots of false features may be extracted
due to the presence of noisy background, and eventually leading to matching errors
in the later stage. Therefore, the goal of ngerprint segmentation is to discard the
background, reduce the number of false features, and thus improve the matching
accuracy.
Based on the collection procedure, ngerprint images can generally be divided
into three categories, namely, rolled, plain and latent [35]. Rolled ngerprints are
obtained from rolling the nger from one side to the other in order to capture all
ridge details of the ngerprint. Plain ngerprints images are acquired by pressing
the ngertip onto a
at surface. Both rolled and plain prints are obtained in an
attended mode, so they are usually of good visual quality and contain sucient
information for reliable matching. On the contrary, latent ngerprints are usu-
ally collected from crime scenes, in which the print is lifted from object surfaces
that were inadvertently touched or handled. The matching between latents and
rolled/plain ngerprints plays a crucial role in identifying suspects by law enforce-
ment agencies.
Segmentation on rolled and plain ngerprint images has been well-studied in
literature. In the early work of [49], segmentation was achieved by partitioning the
ngerprint image into blocks, followed by block classication based on gradient and
variance information. This method was further extended to a composite method
[48] that takes advantage of both the directional and variance approaches. Ratha
et al. [57] considered the gray-scale variance along the direction orthogonal to the
ridge-
ow orientation as the key feature for block classication. In [9], ngerprints
31
were segmented using three pixel-level features (coherence, mean and variance). An
optimal linear classier was trained for pixel-based classication and morphology
operators were applied to obtain compact segmentation clusters.
While signicant eort has been made on developing segmentation algorithms
for rolled/plain ngerprints, latent ngerprint segmentation remains to be a chal-
lenging problem. Although automated identication has already achieved high
accuracy for plain/rolled ngerprints, manual intervention is still necessary for
latent prints processing [35]. The diculty mainly lies in: 1) the poor quality of
ngerprint patterns in terms of the clarity of ridge information, and 2) the presence
of various structured noise in the background. Traditional segmentation methods
fail to work properly on latent ngerprints as they are based on many assump-
tions that are only valid for rolled/plain ngerprints. In recent works on latent
ngerprints [14, 22, 86], the region-of-interest (ROI) is still manually marked and
assumed to be known. Undoubtedly, accurate and robust latent segmentation is
an essential step towards achieving automatic latent identication, and it is the
main focus of our current research.
Most recently, several studies [36,68] have been conducted to address the prob-
lem of latent ngerprint segmentation. Karimi-Ashtiani and Kuo [36] used a pro-
jection method to estimate the orientation and frequency of local blocks. After
projection, the distance between center-of-transient points measures the amount
of data degradation and used for segmentation. Short et al. [68] formulated a ridge
model template and used the cross-correlation between a local block and the gen-
erated template to assign one of six quality scores. Blocks with high quality score
are labeled as foreground while the rest are treated as background. While the pro-
posed methods demonstrated improved performance in handling latent ngerprint
32
images, experimental results show that their performances are still limited by the
presence of structured noise.
Total-Variation-based (TV-based) image models have been widely used in the
context of image decomposition [7,85]. Among several well known TV-based mod-
els, the model using total variation regularization with an L1 delity term, denoted
by the TV-L1 model, is especially suited for multiscale image decomposition and
feature selection [12,15]. Besides, a modied TV-L1 model was adopted in [15] to
extract small-scale facial features for facial recognition under varying illumination.
More recently, the authors proposed an adaptive TV-L1 model for latent ngerprint
segmentation in [90], where the delity weight coecient is adaptively adjusted to
the background noise level. Furthermore, the Directional Total Variation (DTV)
model was formulated in [88] by imposing the directional information on the TV
term, which proved to be eective for latent ngerprint detection and segmenta-
tion. It appears that the TV-based image model with proper adaptation oers
a suitable tool for latent ngerprint segmentation. However, the performance of
both models in [88,90] was evaluated only subjectively, as no objective evaluation
was performed to determine whether the proposed scheme improved the matching
accuracy, which is the ultimate goal for ngerprint segmentation.
In this chapter, we present three TV-based models for eective latent ngerprint
image segmentation. The proposed models decompose a latent ngerprint image
into two layers: cartoon and texture. The cartoon layer contains the unwanted
components (e.g. structured noise) while the texture layer mainly consists of the
latent ngerprint. This cartoon-texture decomposition facilitates the process of
segmentation, as the region of interest can be easily detected from the texture
layer using traditional segmentation methods. In addition, the eectiveness of our
33
proposed schemes is validated through experiments on feature detection and latent
matching.
The rest of this chapter is organized as follows.In Section 3.2, we introduce the
Adaptive TV-L1 model, which incorporates a spatially-adaptive delity weight,
and show how it can be used to facilitate latent ngerprint segmentation. In Sec-
tion 3.3, we propose the Directional Total-Variation (DTV) model which consists
of a novel anisotropic directional TV term. In Section 3.4, we take advantage of
the two previously mentioned models and combine into one single model, called
the Adaptive Directional Total Variation (ADTV) model. In Section 3.5, we val-
idate the eectiveness of our proposed schemes through a series of benchmarking
experiments. Concluding remarks are given in Section 3.6.
3.2 Segmentation with Adaptive TV-L1 Model
In this section, we introduce the Adaptive Total Variation Model with L1 delity
(TV-L1) model and explain how it can be used to eectively separate latent n-
gerprint with structured noises, and thus facilitate the process of ngerprint seg-
mentation. We begin with introducing the TV-L1 model, which serves as the basis
for the proposed Adaptive TV-L1 model, and explain its capability in multiscale
feature selection. Finally, we propose the Adaptive TV-L1 model and discuss the
choice of its parameters.
3.2.1 The TV-L1 Model
TV-based image models have been widely studied to achieve the task of image
decomposition. Among many existing TV models, the total variation regulariza-
tion model with an L1 delity term, denoted by TV-L1, is suitable for multiscale
34
image decomposition and feature selection. In the context of facial recognition
under varying illumination, a modied TV-L1 model was proposed in [15] to sep-
arate small-scale facial features with nonuniform illumination, and thus leads to
improved recognition result.
Similar to other TV-based image models (e.g., the ROF model [61]), the TV-L1
model decomposes an input image, f, into two signal layers:
Cartoon u, which consists of the piecewise-smooth component in f, and
Texture v, which contains the oscillatory or textured component in f.
The decomposition
f =u +v;
is obtained by solving the following variational problem:
min
u
Z
jOuj +
Z
jufjdx; (3.1)
where f, u and v are functions of image gray-scale intensity values in R
2
, Ou is
the gradient value of u and is a constant weighting parameter. We call
R
jOuj
and
R
jufj the total variation of u and the delity term, respectively.
The TV-L1 model is dicult to compute due to nonlinearity and non-
dierentiability of the total variation term as well as the delity term. A gradient
descent approach was proposed in [12], which solves for u as a steady solution of
the Euler-Lagrange equation of (3.1):
O
Ou
jOuj
+
fu
jfuj
= 0: (3.2)
Although (3.2) is easier to implement, the gradient descent approach is slow
due to a small time step imposed by the strict stability constraint. That is, the
35
term
fu
jfuj
is non-smooth at fu, which forces the time step to be very small
when the solution is approaching the steady state. In addition,jOuj in the term
Ou
jOuj
might be zero, and a small positive constant needs to be added to avoid zero
division, which results in inexact solution.
Many numerical methods have been proposed to improve this method. One
approach is the split Bregman iteration [26,27,51,81], which uses functional split-
ting and Bregman iteration for constrained optimization. The equivalence of
the split Bregman iterations with the alternating direction method of multipliers
(ADMM), the Douglas-Rachford splitting and the augmented Lagrangian method
can be found in [11,19,63,81].
The use of TV model is motivated by the analogy between the problem of
TV decomposition and latent ngerprint segmentation. As discussed in Section
2.1.1, the key challenge for latent segmentation is to eectively separate latent
ngerprint with dierent structured noise. Structured noise (e.g. arch, character),
with its smooth inner surface and crisp edges, share many similar characteristics
with components in the cartoon layer u. On the other hand, ngerprint pattern,
which consists of oscillatory ridge structures, matches the characteristics of the
texture components in v. This interesting analogy suggests that TV model could
be a viable solution to our problem.
3.2.2 Multiscale Feature Selection Property of TV-L1
The TV-L1 model distinguishes itself from other TV-based models by its unique
capability of intensity-independent multiscale decomposition. It has been shown
both theoretically [12] and experimentally [85] that the delity weight coecient,
, in (3.1) is closely related to the scale of features in the texture output v. This
relation is supported by the analytic example in [12]. Iff is equal to a disk signal,
36
denoted by B
r
, which has radius r and unit height, the solution of (3.1) is given
as:
u
(x) =
8
>
>
>
>
>
<
>
>
>
>
>
:
0 if 0<
2
r
f(x) if >
2
r
cf(x) if =
2
r
, for any c2 [0; 1]
In other words, depending on the value, the TV-L1 functional is minimized
by either 0 or input f. This shows that the TV-L1 model has the ability to select
geometric features based on a given scale. Fig. 2.4 shows an example of feature
selection on the latent ngerprint image.
As shown in Fig 2.4, the numerical results match with the analysis before. The
delity weight coecient controls the feature selection by manipulating the scale
of content captured in each image layer. When is very small (e.g., = 0:10),
u captures the inhomogeneous illumination in the background while most ne
structures are kept inv. When = 0:30, large-scale objects (arch) are captured in
u, and separated from structures of smaller scales (characters). As continues to
increase, only small-scale structures (ngerprint and noise) are left in v while the
major content of f is extracted to u.
We observe that one of the dierences between ngerprint patterns and struc-
ture noise is their relative scale. By applying the TV-L1 model with an appropri-
ately chosen value, it's seemingly possible to extract ngerprints to texture layer
v while leaving the unwanted structure noise in cartoon layer u. However, there
arise two problems by applying the TV-L1 model directly:
1. The value of forces structures that are smaller that or equal to a given scale
to appear inv. As a result, structured noises of smaller scales as ngerprints
(e.g. speckle, stain), will also be captured by v along with ngerprints.
37
2. A small amount of boundary signals near non-smooth edges will appear in
v (see Fig. 3.1) due to the non-smoothness of the boundary and the use of
nite dierencing. This issue was also reported in [15].
f
u
v
Figure 3.1: Illustration of the boundary signal problem in TV-L1 decomposition:
a small amount of structure noise edge signal is still kept in texture v (left) and
signals along the dash line depicted in f, u and v (right).
To overcome these limitations, we propose the Adaptive TV-L1 model, which
will be presented next.
3.2.3 The Adaptive TV-L1 Model
The TV-L1 model with spatially invariant delity (3.1) does not generate the
desired output throughout the whole latent ngerprint image. In the ngerprint
region, when is well matched with the scale of ngerprints, all essential contents
can be captured in output texturev. However, in the noisy region, some unwanted
38
signals will also be extracted to v with the same . This motivates us to consider
an adaptive TV-L1 model with spatially variant delity:
min
u
Z
jOuj +
Z
(x)jufjdx; (3.3)
where (x) is a spatially varying parameter.
The spatially varying parameter, (x), can be understood by two ways. First,
as analyzed in Sec. 3.2,(x) is a scalar that controls the scale of features appearing
in v at pixel x. A large (x) value enforces most textures to be kept in u, leaving
only tiny-scale structures in v. When (x) is suciently large, u(x) f(x), and
the original content is almost totally blocked from v. Thus, v(x) 0. Second,
parameter(x) can also be interpreted as a weighting coecient that balances the
importance between delity and smoothness of u. In the ngerprint region, the
(x) value should be relatively small, since low delity ensures the smoothness ofu
and, thus, more textures could be extracted inv. In regions with structured noise,
delity becomes important since a large (x) value ensures all noise components
to be ltered out from texture v.
We use the augmented Lagrangian method [25,31,71,82] to solve the proposed
adaptive TV-L1 model given in (3.3). The augmented Lagrangian method is both
accurate and ecient, as it benets from the FFT-based fast solver with a closed-
form solution. It has been proven that the augmented Lagrangian is equivalent to
the split Bregman iteration and its convergence is always guaranteed [81].
39
In the augmented Lagrangian method, two new variables are introduced and it
can be reformulated as the following constraint optimization problem:
min
u
Z
*
p
+
Z
(x)jvfjdx;
s.t.
*
p =
0
@
p
1
p
2
1
A
=
0
@
@
x
u
@
y
u
1
A
=Ou;v =u:
(3.4)
To solve (3.14), the following augmented Lagrangian functional is dened:
L(u;
*
p;v;
*
p
;
v
) =
Z
*
p
+
Z
(x)jvfjdx
+
r
p
2
Z
(
*
pOu)
2
+
Z
*
p
(
*
pOu)
+
r
v
2
Z
(vu)
2
+
Z
v
(vu);
where
*
p
and
v
are the Lagrange multipliers andr
p
andr
v
are positive constants.
The augmented Lagrangian method uses an iterative procedure to solve (3.14) as
shown in Algorithm 1. The iterative scheme runs until some stopping condition is
satised.
Since variablesu,
*
p andv inL(u;
*
p;v;
*
p
;
v
) are coupled together in the min-
imization problem (3.13), it is dicult to solve them simultaneously. Instead, the
problem is decomposed into three sub-problems and an alternative minimization
40
process is applied. All three sub-problems can be eciently solved as they all have
closed-form solutions. They are given as:
u
k
=F
1
0
B
B
B
B
B
B
B
@
r
p
F(div)F(
*
p)F(div)F(
*
p
k
)
+r
v
F(v) +F(
k
v
)
r
v
r
p
F(4)
1
C
C
C
C
C
C
C
A
;
*
p
k
(x) = max
8
<
:
0; 1
1
r
p
*
w(x)
9
=
;
*
w(x);
v
k
(x) = max
0; 1
(x)
r
v
jq(x)j
q(x) +f(x);
whereF(u) denotes the Fourier transform ofu,
*
w(x) =Ou(x)
*
p
k
(x)
rp
andq(x) =
v(x)f(x)
k
v
(x)
rv
.
Algorithm 1 Augmented Lagrangian method for our proposed adaptive TV-L1
model.
1. Initialization:
u
0
= 0,
*
p
0
= 0, v
0
= 0;
2. For k = 1; 2;:::, compute:
(u
k
;
*
p
k
;v
k
) = argmin
(u;
*
p;v)
L(u;
*
p;v;
*
p
k
;
k
v
) (3.5)
3. Update:
*
p
k+1
=
*
p
k
+r
p
(
*
p
k
Ou
k
)
k+1
v
=
k+1
v
+r
v
(v
k
u
k
)
Table 3.1: Augmented Lagrangian method for our proposed adaptive TV-L1
model.
41
3.2.4 Scale Parameter Estimation
As discussed in Section 3.2.2, applying one uniform value over the entire n-
gerprint image does not generate satisfactory results. To improve the result, the
value of should be spatially adaptive. That is,(x) ought to be adaptively chosen
according to the background noise level. Ideally, parameter(x) should take larger
values in regions with much structured noise and be relatively small in ngerprint
regions.
To dierentiate these regions, we study their characteristics after going through
local low-pass ltering. When an input image, f, is locally ltered by a low-pass
lter denoted by
L
() =
1
1 + (2jj)
4
;
its cartoon and texture components, though both being blurred to some extent,
change dierently by means of local total variation (LTV), which is dened as
LTV (f) =G
jOfj;
where f is the image region and G
is a Gaussian kernel with standard deviation
. In [10], the author used the relative LTV reduction ratio to dierentiate cartoon
with textural regions. It was observed that the LTV of textural regions decay much
rapidly than that of cartoon regions after low-pass ltering.
Though the LTV reduction ratio provides a good measure for separating edgy
regions from textural regions, it has limited capability in dierentiating textures
of dierent scales (e.g., ngerprints with speckles). To overcome this limitation,
we further introduce the dierential LTV reduction rate, denoted by
, as
42
Figure 3.2: Plots of
(x) for several pixels in dierent latent ngerprint images.
It has a sharp peak located near = 2:0 in the ngerprint region while it reaches
the maximum at dierent values in other regions.
=
LTV (L
+1
f)LTV (L
f)
LTV (f)
: (3.6)
For a given local patch, the parameter
describes its structural components'
sensitivity to low-pass ltering of scale . It provides useful information about
the underlying texture structure of a local region. Intuitively, it measures the
texture's local oscillatory behavior at a certain spatial scale . In Fig. 3.2, we
demonstrate the
values of dierent textural patches, which are extracted from
latent ngerprint images. We observe that the
value of ngerprint regions all
reaches local maxima around = 2:0. With a xed value,
will have the
largest response for textural components of scales around, while the response for
textures of other scales will be suppressed.
43
Based on this observation, we choose the spatially variant coecient (x) in
(3.12) as:
(x) =
1
c
(x) +
; (3.7)
where
c
is the dierential LTV reduction rate at =c, which is adjusted to the
best response of ngerprint patterns, and are trivial positive constants used
for scaling and avoiding zero-division. In our experiments, we observe thatc = 2:0
gives the optimum value for the latent ngerprint patterns while parameters and
are empirically set to 0.5 and 0.01, and used for scaling and avoiding zero-division,
respectively.
3.2.5 Region-of-Interest Segmentation
After decomposing the latent ngerprint image using our proposed Adaptive TV-
L1 model, we have obtained two image layers: 1) cartoon u, which contains the
majority of unwanted content (e.g. structured noise, small-scale structures), and
2) texture v, which consists of latent ngerprints and only a small amount of
random noise. This decomposition facilitates two procedures: segmentation and
enhancement.
The variance value acts as a key segmentation feature for rolled/plain nger-
prints [9]. As discussed in Section 2.1.1, this feature cannot be directly applied
to latent ngerprints due to the presence of structured noise. However, after
the cartoon-texture layer decomposition, most high-variance noise components are
kept away from the texture layer v, allowing us to use the variance features for
segmentation. We verify this point in Fig. 3.3, where we plot the probability
44
(a) Input f (b) Texture v
Figure 3.3: Distributions of the variance feature for the foreground and background
region in f and v, respectively.
distribution of variance feature at foreground/background regions before and after
the decomposition.
In addition, our proposed decomposition scheme is capable of enhancing the
ngerprint quality. After decomposition, in the texture layerv, we have removed all
the unwanted components that may overlap with the ngerprints. The extracted
patterns are less degraded by structured noises and free of illumination eects,
leading to enhanced ngerprint quality. In the next section, we will experimentally
demonstrate that this enhancement as well as the segmentation will result in better
latent matching performance.
3.3 Segmentation with Directional Total-
Variation Model (DTV)
In this section, we will introduce another TV-based model for latent ngerprint
segmentation, namely, the Directional Total Variation (DTV) model. We rst
introduce the TV-L2 (ROF) model [61] and discuss the problems of applying it
45
directly for latent ngerprint segmentation. Then, we propose the DTV model as
well as our algorithm, and discuss the choice of the key parameter,
*
a(x).
3.3.1 The TV-L2 (ROF) Model
The total variation regularization model with an L2 delity term, denoted by TV-
L2, was originally proposed for image denoising. Recently, the TV-L2 model has
also demonstrated good performance for image decomposition [7].
Similar to the TV-L1 model, the TV-L2 model decomposes an input image,
f, into two signal layers: 1) cartoon u, which consist of the piecewise-smooth
components in f, and 2) texture v, which contains the oscillatory or textured
component inf. The decompositionf =u+v is obtained by solving the following
variation problem:
min
u
Z
jOuj +
2
kfuk
2
; (3.8)
where f, u and v are functions of image gray-scale intensity values in R
2
, Ou is
the gradient value of u and is a constant weighting parameter. We call
R
jOuj
andkfuk
2
the total variation (TV) of u and the delity term, respectively.
Applying TV-L2 directly on latent ngerprint images is incapable of completely
separating structured noise with ngerprints, as it involves the same problem as
discussed in Section 3.2 for the Adaptive TV-L1 model.
3.3.2 The Directional Total-Variation Model
Besides the scale dierence, ngerprint also diers from structured noise by its
coherently oriented texture pattern. We observe that ngerprint textures usually
consist of parallel ridge-line structures with relatively xed frequency range. This
46
motivates us to consider an adaptive TV-L2 model with spatially variant direc-
tional total variation (DTV):
min
u
Z
Ou
*
a(x)
dx +
2
Z
(uf)
2
(3.9)
where
*
a(x) is a spatially varying orientation vector adjusted to the local texture
orientation.
The orientation vector
*
a controls the signal captured in the texture output v.
By tuning
*
a to a specic direction, we are mainly interested in minimizing the total
variation of u along that direction, while allowing the existence of total variation
ofu along other directions. As a result, textures along the corresponding direction
will be fully captured by v and textures of other directions will be weakened in v.
In particular, textures along the orthogonal direction of
*
a will be totally blocked
from v. In Fig. 3.4, we illustrate the impact of
*
a on the output of v.
Again, we use the augmented Lagrangian method [25] to solve the proposed
DTV model given in (3.9). In the Augmented Lagrangian method, we introduce
two new variables and formulate (3.9) as the following constraint optimization
problem:
min
u
Z
jqj +
2
Z
(fu)
2
dx;
s.t.
*
p =
0
@
p
1
p
2
1
A
=
0
@
@
x
u
@
y
u
1
A
=Ou;q =
*
p
*
a:
(3.10)
47
(a)
a a
a a
(b)
Figure 3.4: (a) Top: original image f. Bottom: texture output v after decompo-
sition by the TV-L2 model [61], (b) Texture output v for orientation vector~ a in
four dierent directions. Top: ~ a = (0; 1) and~ a = (1; 0). Bottom: ~ a = (
p
2
2
;
p
2
2
)
and~ a = (
p
2
2
;
p
2
2
).
To solve (3.14), the following augmented Lagrangian functional is dened:
L(u;
*
p;q;
*
p
;
q
) =
Z
jqj +
2
Z
(fu)
2
dx
+
r
p
2
Z
(
*
pOu)
2
+
Z
*
p
(
*
pOu)
+
r
q
2
Z
(q
*
p
*
a)
2
+
Z
q
(q
*
p
*
a);
where
*
p
and
q
are the Lagrange multipliers andr
p
andr
q
are positive constants.
The augmented Lagrangian method uses an iterative procedure to solve (3.14) as
shown in Algorithm 1. The iterative scheme runs until some stopping condition is
satised.
Since variablesu,
*
p andq inL(u;
*
p;q;
*
p
;
q
) are coupled together in the min-
imization problem (3.13), it is dicult to solve them simultaneously. Instead, the
48
problem is decomposed into three sub-problems and an alternative minimization
process is applied. The three sub-problems are given as:
E
1
(u) =
2
Z
(fu)
2
+
r
p
2
Z
(
*
pOu)
2
Z
*
p
Ou;
E
2
(
*
p) =
r
p
2
Z
(
*
pOu)
2
+
Z
*
p
*
p +
r
q
2
Z
(q
*
p
*
a)
2
Z
q
*
p
*
a;
E
3
(q) =
Z
jqj +
r
q
2
Z
(q
*
p
*
a)
2
+
Z
q
q
All three sub-problems can be eciently solved as they all have closed-form
solutions:
u
k
=F
1
0
B
B
B
B
B
B
B
@
r
p
F(div)F(
*
p)F(div)F(
*
p
k
)
+F(f)
r
p
F(4)
1
C
C
C
C
C
C
C
A
;
*
p
k
(x) =Ou
1
r
p
*
p
k
(r
q
v(x)r
q
q +
k
q
)
*
a(x)
;
q
k
(x) = max
0; 1
1
r
q
jw(x)j
w(x);
wherev(x) =
rp(Ou
*
a (x))
*
p
k
*
a (x)+(
k
q
+rqq)k
*
a (x)k
2
rp+rqk
*
a (x)k
2
andw(x) =
*
p
*
a(x)
*
q
k
(x)
rq
.F(u)
andF
1
(u) denotes the Fourier transform and inverse Fourier transform of u,
respectively.
49
Algorithm 2. Augmented Lagrangian method for our proposed DTV model.
1. Initialization: u
0
= 0,
*
p
0
= 0, q
0
= 0;
2. For k = 0; 1; 2;:::, compute:
(u
k
;
*
p
k
;q
k
) = argmin
(u;
*
p;q)
L(u;
*
p;q;
*
p
k
;
k
q
) (3.11)
3. Update:
*
p
k+1
=
*
p
k
+r
p
(
*
p
k
Ou
k
)
k+1
q
=
k
q
+r
q
(q
k
*
p
k
*
a)
Table 3.2: Augmented Lagrangian method for our proposed DTV model.
3.3.3 Orientation Field Estimation
In order to extract the ngerprint components to texture outputv,~ a(x) should be
spatially varying and well aligned with the local ngerprint ridge orientation. We
use the gradient-based approach [32,91] for computing the coarse orientation eld
at each pixel:
o(x) =
1
2
tan
1
P
W
2f
x
1
f
x
2
P
W
(f
2
x
1
f
2
x
2
)
+
2
where W is a neighborhood window around x, (f
x
1
;f
x
2
) is the gradient vector at
x = (x
1
;x
2
), and tan
1
is a 4-quadrant arctangent function with output range of
(;).
The estimation above is relatively accurate at ngerprint regions, while it
becomes less reliable at noisy regions. We evaluate the reliability of the estimated
orientation eld by its local coherency:
c(x) =
(
P
W
(f
2
x
1
f
2
x
2
))
2
+ 4(
P
W
f
x
1
f
x
2
)
2
(
P
W
(f
2
x
1
+f
2
x
2
))
2
50
where c(x)2 [0; 1] (close to 1 for strongly oriented pattern, and 0 for isotropic
regions). The value of c(x) provides a reliability measure of the estimated orien-
tation eld and will be utilized to generate the nal orientation vector~ a(x).
The coarse orientation eldo(x) still contains inconsistencies caused by creases
and ridge breaks of the ngerprint pattern. We further improve the estimation by
orientation smoothening:
O(x) =
1
2
tan
1
G
sin(2o(x))
G
cos(2o(x))
where G
is a Gaussian smoothing kernel with standard deviation .
Finally, the orientation vector~ a(x) in (3.12) is computed as:
~ a(x) = ( cosO(x); sinO(x))c(x)
At regions where the orientation estimation is reliable, the large c(x) value
enforces textures along the direction~ a(x) to be fully captured byv, leaving textures
of the remaining orientations in u. On the other hand, at regions where c(x) is
small and the estimation is not trustworthy, the delity term
2
kfuk
2
becomes
dominant and most of the image content will be kept in u. In this way, we can
eciently lter out the structured noises from the texture output v. The process
for computing the parameter~ a(x) is illustrated in Fig. 3.5.
3.3.4 Region-of-Interest Segmentation
Similar to Section 3.2, we use the variance feature for segmenting the processed
texture map v. We have previously showed that this feature cannot be directly
applied to latent ngerprints due to the presence of structured noise. However,
after applying our proposed DTV model, the latent image is decomposed into two
51
Figure 3.5: Essential steps for computing~ a(x). From left to right: original image
f, coarse orientation estimationo(x), orientation smootheningO(x) and coherency
evaluation c(x).
parts: 1) cartoon u, which mainly contains large-scale structured noise, and 2)
texture v, which consists of latent ngerprint and a small amount of noise. In
texture outputv, most noise with high variance disappears, allowing us to use the
variance feature for eective segmentation [9].
3.4 Segmentation with Adaptive Directional
Total-Variation Model (ADTV)
In this section, we take advantages of both two models proposed in Section 3.2 and
Section 3.3 and combine them into one single mathematical model. We call it the
Adaptive Directional Total-Variation (ADTV) model.
u
= argmin
u
Z
jOu~ a(x)jdx +
1
2
Z
(x)jufjdx (3.12)
where~ a(x) is a spatially varying orientation vector adjusted to the local texture
orientation, and (x) is a spatially varying parameters that controls the feature
scale.
52
Algorithm 3. Augmented Lagrangian method for our proposed ADTV model.
1. Initialization: u
0
= 0, ~ p
0
= 0, q
0
= 0, w
0
= 0;
2. For k = 0; 1; 2;:::, compute:
(u
k+1
;~ p
k+1
;q
k+1
;w
k+1
) = argmin
(u;~ p;q;w)
L(u;~ p;q;w; ~
p
k
;
k
q
;
k
w
) (3.13)
3. Update:
~
p
k+1
= ~
p
k
+r
p
(~ p
k+1
Ou
k+1
)
k+1
q
=
k
q
+r
q
(q
k+1
~ p
k+1
~ a)
k+1
w
=
k
w
+r
w
(w
k+1
u
k+1
)
Table 3.3: Augmented Lagrangian method for our proposed ADTV model.
Again, we use the augmented Lagrangian method [25,31,71,82] to solve the pro-
posed ADTV model given in (3.12). Three new variables (~ p, q, w) are introduced
to reformulate (3.12) into the following constraint optimization problem:
min
u
Z
jqj +
1
2
Z
(x)jwfjdx;
s.t. ~ p =
0
@
p
1
p
2
1
A
=
0
@
@
x
u
@
y
u
1
A
=Ou;q =~ p~ a;w =u
(3.14)
To solve (3.14), the following augmented Lagrangian functional is dened:
L(u;~ p;q;w; ~
p
;
q
;
w
) =
Z
jqj +
1
2
Z
(x)jwfjdx
+
r
p
2
Z
(~ pOu)
2
+
Z
~
p
(~ pOu)
+
r
q
2
Z
(q~ p~ a)
2
+
Z
q
(q~ p~ a)
+
r
w
2
Z
(wu)
2
+
Z
w
(wu);
where ~
p
,
q
and
w
are the Lagrange multipliers and r
p
, r
q
, r
w
are positive con-
stants. The augmented Lagrangian method uses an iterative procedure to solve
53
(3.14) as shown in Algorithm 1. The iterative scheme runs until some stopping con-
dition is satised. Since variables u, ~ p, q, w in L(u;~ p;q;w; ~
p
;
q
;
w
) are coupled
together, it is dicult to solve them simultaneously. Instead, the problem is decom-
posed into four sub-problems and an alternative minimization process is applied.
Instead of solving 3.13 exactly, we apply the alternating direction method of mul-
tipliers (ADMM) [19,81] and run one iteration for each sub-problem. It should be
mentioned that this was also reused in the split Bregman iteration method [11,63].
This approach of splitting technique is ecient since all sub-problems have closed-
form solution, which are given as:
u
k
=F
1
0
B
B
B
B
B
B
B
@
r
p
F(div~ p)F(div~
k
p
)
+r
w
F(w) +F(
k
w
)
r
w
r
p
F(4)
1
C
C
C
C
C
C
C
A
;
~ p
k
(x) =Ou
1
r
p
~
p
k
(r
q
qr
q
(x) +
k
q
)~ a(x)
;
q
k
(x) = max
0; 1
1
r
q
j (x)j
(x);
w
k
(x) = max
0; 1
(x)
r
w
j(x)j
(x) +f(x)
where (x) =
rp(Ou~ a(x))~ p
k
~ a(x)+(
k
q
+rqq)k~ a(x)k
2
rp+rqk~ a(x)k
2
, (x) = ~ p~ a(x)
k
q
(x)
rq
, and (x) =
u(x)f(x)
k
w
(x)
rw
.F(u) andF
1
(u) denotes the Fourier transform and inverse
Fourier transform of u, respectively.
54
3.5 Experimental Results
In this section, we evaluate the proposed ADTV model through a series of exper-
iments. We rst show the segmentation results for latent ngerprint images with
dierent quality types. We compare our results with two segmentation approaches
[9, 16]. The approach in [9] uses a linear-classier for conducting segmentation
on rolled ngerprints, while [16] was designed specically for latent segmentation.
Then, we experimentally examine the impact of our proposed segmentation scheme
on the accuracy of feature extractions. We also conduct latent matching experi-
ments to verify whether the segmentation result can indeed lead to higher matching
accuracy. Finally, we compare the performance of our proposed ADTV model with
two other TV-based models: the TV-L1 [12] and TVL2 [61] model. The algorithm
was implemented on a MacBook Pro computer with 2.3 GHz Intel Core i7. For
a 1000ppi latent ngerprint image of size 1131 1321, the computational time of
each procedure is listed in Table 3.4.
3.5.1 Results of Good, Bad and Ugly Fingerprint Process-
ing
All experiments were conducted on the public domain latent ngerprint database,
NIST SD27, which contains 258 latent ngerprints and their corresponding rolled
ngerprints. In this database, ngerprint experts have assigned to each ngerprint
one of three quality levels - good, bad and ugly. The numbers of good, bad and
ugly latent prints are 88, 85 and 85, respectively. For the ngerprint matching
experiments (Section IV.D and Section IV.E), we included 27,000 rolled nger-
prints from the NIST SD14 database [5] and extended the background database
to 27,258 ngerprints, making the problem more realistic and challenging.
55
Table 3.4: Computational time for the proposed ADTV-based latent ngerprint
segmentation algorithm.
Procedure Time (sec)
Generate map 2.35
Generate orientation map 0.33
ADTV Decomposition (150 iterations) 100.33
ROI Segmentation 4.01
Total 107.02
Figure 3.6: Experimental results of three latent ngerprints with good quality (from
left to right): original imagef, scale parameter(x), texture outputv, orientation
vector~ a(x) and the nal segmentation result.
We ran the proposed algorithm over the entire NIST SD-27 database [6], and
put all results in the supplemental material that goes with this work. Furthermore,
we have also uploaded the same results online [3] so that other researchers can use
it to make fair comparison in the future. The processing results of using the
proposed ADTV model for three good, two bad and two ugly representative latent
ngerprints are shown in Fig. 3.12, Fig. 3.13 and Fig. 3.11, respectively.
56
Table 3.5: Performance comparison of three segmentation algorithms.
Segmentation algorithm MDR (%) FDR (%)
Bazen et al. [9] 5.04 79.31
Choi et al. [16] 14.78 47.99
Our approach 14.10 26.13
Visual inspection shows that the proposed ADTV scheme provides satisfactory
results, as the most essential ngerprint regions lie within the segmented fore-
ground. It should be noted that it is dicult for the proposed decomposition
scheme to perfectly erase structure noise from the texture layer v. For instance,
in Fig. 3.12, some part of the characters can still be observed in v. The goal
of ADTV decomposition is to suppress the structure noise components from tex-
ture layer v as much as possible so that the region-of-interest can be more easily
identied using the variance feature.
3.5.2 Comparison with Other Segmentation Methods
We rst compare the segmentation accuracy of our approach with two other n-
gerprint segmentation methods: the linear-classier-based approach [9] and the
latent segmentation approach [16]. The segmentation accuracy is evaluated using
two measures: the Missed Detection Rate (MDR) and the False Detection Rate
(FDR) [16]. We use the manual segmentation results provided by [6] as the ground
truth. MDR is dened as the average percentage of foreground pixels misclassi-
ed as background, while FDR refers to the average percentage of background
pixels misclassied as foreground. The average segmentation accuracy of the pro-
posed approach and the other two segmentation methods for the entire NIST SD27
database is shown in Table 3.5.
57
(a) Latent print
input: Type-1
(b) Latent print
input: Type-2
(c) Latent print
input: Type-3
(d) Mated rolled nger-
print
Figure 3.7: Three input image types for latent matching. (a) Type-1: without any
segmentation, (b) Type-2: segmentation mask over original image f, (c) Type-
3: segmentation mask over texture layer v, (d) the corresponding mated rolled
ngerprint.
While the linear classier approach [9] works well on segmenting rolled nger-
prints, it performs poorly on latent images. Although its MDR is relatively low
(around 5%), its FDR is much higher than [16] and our approach because much
of the background structure noise region is mistakenly classied as foreground.
As compared with the approach by Choi et al. [16], while the two segmentation
methods have about the same level of MDR, the FDR of our proposed method is
about 20% lower.
3.5.3 Feature Extraction Accuracy
Without segmentation, the performance of latent matching is very poor due to
the high number of unreliable features. There are two types of features that are
essential to ngerprint matching: singular points (SPs) and minutiae. Traditional
feature extraction algorithms perform poorly on latent ngerprints, especially at
regions with much structured noise. Some areas of noise are often miss-identied as
useful ngerprint features, which can aect the accuracy of the ngerprint matching
stage signicantly. Thus, with the help of accurate segmentation, we can remove
58
unwanted structured noise components and, therefore, decrease the number of
erroneous features.
Table 3.6: Feature extraction accuracy with and without ADTV-based segmenta-
tion.
w.o. segmentation w. segmentation
Missed Minutiae 0.0% 3.23%
Missed SPs 0.0% 3.32%
False Minutiae 73.69% 31.52%
False SPs 71.47% 30.73%
Again, we use the manual segmentation results of [6] as the ground truth, and
use VeriFinger SDK 6.6 [4] for feature extraction. All extracted features points that
fall within the manual segmentation region are used as the ground truth. Then,
we calculate the number of true features points that were missing and the number
of false feature points detected for two scenarios (with and without ADTV-based
segmentation). For cases without segmentation, we conduct feature extraction
directly on the original image (Type-1 input) while, for cases with segmentation,
feature extraction was conducted on our segmentation results (Type-2 input).
Experimental results are given in Table 3.6. For latent inputs without any
segmentation, though none of true features points are missing, more than 70% of
the detected features are erroneous as many incorrect feature points were detected
in the structure noise regions. On the other hand, for inputs with our ADTV
segmentation, the false feature point ratio has decreased to around 30% while the
missing features points have only slightly increased by 3%. In the next subses-
sion, we will show that the improvement in feature extraction will lead to better
matching performance.
59
3.5.4 Fingerprint Matching Results
The ultimate goal of segmentation is to successfully match the input latent n-
gerprint with the corresponding plain/rolled ngerprint in a large database. and
shown that the segmentation does improve the accuracy of feature extractions. In
this subsection, we conduct matching experiment to verify whether the segmenta-
tion result can indeed lead to improved matching accuracy.
The feature extraction and ngerprint matching process are conducted using
the commercial matcher Neurotechnology VeriFinger SDK 6.6 [4]. For each latent
ngerprint, we compare the results for three input types:
Type-1 : original latent ngerprint without segmentation,
Type-2 : segmentation mask applied on original image,
Type-3 : segmentation mask applied on texture layer v.
An example of the three input image types is given in Fig. 3.7. As mentioned ear-
lier, the proposed ADTV model has two functionalities: segmentation and texture
enhancement. We can evaluate the eectiveness of segmentation by comparing
the corresponding results of Type-1 and Type-2 images and the impact of texture
enhancement by comparing the matching results of Type-2 and Type-3 images.
The Cumulative Match Characteristic (CMC) curves of the three input types
to the ngerprint matcher are shown in Fig. 3.8. Each input is searched against
the background database of 27,258 rolled ngerprint from both NIST SD27 and
NIST SD14 database. The CMC curve plots the rank-k identication rate against
k = 1; 2; 3; ; 100. As shown in Fig. 3.8, automatic matching performance
is signicantly improved when Type-2 and Type-3 are used as the input to the
matcher. The rank-20 identication ratio has increased from 2.48% (Type-1) to
4.13% (Type-2) and 10.74% (Type-3).
60
10 20 30 40 50 60 70 80 90 100
0
5
10
15
20
25
30
35
40
Rank (m)
Rank −m Identification Rate (%)
Type #1: Without Segmentation
Type #2: Segment + Original
Type #3: Segment + Texture
Figure 3.8: The cumulative matching curves (CMC) of all three latent ngerprint
input types for good-quality latent ngerprints.
3.5.5 Comparison with Other TV-based Models
In Fig. 3.9, we compare the texture output v of the proposed ADTV model with
two classical TV models: TV-L1 [12] and TV-L2 [61]. The variance distribution of
foreground/background regions as well as the corresponding segmentation result
are shown in the bottom part of Fig. 3.9, respectively. In ngerprint regions,
the proposed ADTV model is able to extract all essential ngerprint texture with
clear ridge information, while the results of other TV models still contain some
background noise. In regions with structured noise, boundary and speckle noise
can be clearly observed in the texture layer,v, obtained by other TV-based models.
61
In contrast, the proposed ADTV model can lter out the background noise signals
from texture output v.
Figure 3.9: Performance comparison of the proposed ADTV model and two other
TV-based models. First row: original image f, texture output v of TV-L2 [61],
TVL1 [12] and the proposed ADTV model (from left to right). Second row: dis-
tribution of variance feature in the foreground and background areas. Third row:
the segmentation result based on variance feature.
The CMC curves of two TV-based models and the proposed ADTV model are
shown in Fig. 3.10, where we show the Type-3 result (segmentation + texture) of
each TV model to match against the same background database. As shown in Fig.
3.10, the proposed ADTV model oers signicantly better matching performance
and its rank-20 identication ratio is about 5 times higher than the TV-L1 and
the TV-L2 models.
62
0 10 20 30 40 50 60 70 80 90 100
0
10
20
30
40
50
60
70
80
Rank (m)
Rank−m Identification Rate (%)
Segmentation on texture output
Segmentation on original image
Without segmentation
Figure 3.10: Comparison of the cumulative matching curves (CMC) of the proposed
ADTV model and two other TV-based models.
3.6 Conclusion
While current automated ngerprint identication systems have achieved high
accuracy in matching rolled/plain prints, latent matching still remains to be a chal-
lenging problem and requires much human intervention. The goal of this work is
to achieve accurate latent segmentation, which is an essential step towards achiev-
ing automatic latent identication. Existing ngerprint segmentation algorithms
performs poorly on latent prints, as they are mostly based on the assumptions that
are only applicable for rolled/plain ngerprints.
63
In this chapter, we rst proposed two Total-Variation models (Adaptive TV-L1
and DTV) as image decomposition schemes that facilitate eective latent nger-
print segmentation and enhancement. Then we combine both models into one
single model, the Adaptive Total-Variation Model (ADTV). Based on the classical
Total-Variation model, the proposed ADTV model dierentiates itself by integrat-
ing two unique features of ngerprints, scale and orientation, into the model for-
mulation. The proposed model has the ability to decompose a single latent image
into two layers and locate the essential latent area for feature matching. The two
spatially varying parameters of the model, scale and orientation, are adaptively
chosen according to the background noise level and textural orientation, and eec-
tively separate the latent ngerprint from structured noises in the background.
Experimental results show that the proposed scheme provides eective segmenta-
tion and enhancement. The improvements in feature detection accuracy and latent
matching further justies the eectiveness of the proposed scheme.
The proposed scheme can be regarded as a preprocessing technique for auto-
matic latent ngerprint recognition. It also has strong potential to be applied on
other relevant applications, especially for processing images with oriented textures.
Figure 3.11: Experimental results of latent ngerprints with ugly quality. From left
to right: original image f, scale parameter (x), orientation vector~ a(x), texture
output v and the nal segmentation result.
64
Figure 3.12: Experimental results of latent ngerprints with good quality. From left
to right: original image f, scale parameter (x), orientation vector~ a(x), texture
output v and the nal segmentation result.
Figure 3.13: Experimental results of latent ngerprints with bad quality. From left
to right: original image f, scale parameter (x), orientation vector~ a(x), texture
output v and the nal segmentation result.
65
Chapter 4
Texture-Aware Image Resizing
and Compressed Domain Video
Retargeting
4.1 Introduction
The demand to adapt images to display devices of various aspect ratios and res-
olutions calls for new solutions to image resizing (or called image retargeting).
Traditional image resizing techniques are incapable of meeting this requirement
since they may either discard important information (e.g., cropping) or produce
distortions by over-squeezing the content (e.g., non-uniform scaling).
Recently, several techniques have been proposed for content-aware image resiz-
ing. Avidan and Shamir [8] proposed a seam-carving algorithm, which resizes an
image by incrementally removing or inserting seams. Another class of image resiz-
ing methods is based on image warping. In [75], a grid mesh is placed onto the
image, and resizing is formulated as computing the new mesh geometry based on a
specied size. Studies show that no single retargeting operator could perform well
on all images, and an algorithm combining multiple retargeting operators has been
recently proposed [60]. Although this method demonstrates superior performance,
its computational cost is relatively expensive for practical usage.
66
The above methods and their variants are mostly guided by pixel-wise signif-
icance, which is computed using either the gradient or saliency information. The
saliency map implies visual attractiveness of an area while the gradient map indi-
cates the presence of edge components. However, using an individual pixel as a
basic unit may result in inaccurate characterization since both maps are weakly
correlated with the underlying object structure. Specically, with these measures,
part of an object may be well preserved as it contains high contrast ne structures
while the rest undergoes heavy distortion due to its homogenous surface.
Texture redundancy is another issue that remains unaddressed in previous
image resizing work. Repetitive textures could be modeled as a primitive tex-
ture element that is replicated according to certain placement rules [56]. An ideal
resizing solution for these textures would be reducing the total number of repli-
cations while keeping the primitive element intact. However, previous resizing
schemes usually leaves the replication number unchanged and distorts the shape
of primitive texture element. More recently, Wu [83] proposed a novel image resiz-
ing technique that further explores the underlying image semantics. Symmetrical
regions are resized by summarization while non-symmetrical regions are handled
with traditional warping method. This method eectively addresses texture redun-
dancy, and opens a new direction for media retargeting.
Video retargeting is more challenging than image retargeting because of
the additional temporal coherence requirement and the need to preserve object
motions. Recently, several sophisticated techniques have been proposed for video
retargeting, which conducts video resizing either through non-uniform warping
[80] or iteratively removing columns/rows of unimportant pixels [28, 59]. In
[38, 76, 77, 87], preserving object motion was further considered which leads to
improved temporal coherence in the retargeted result. Most recently, the issue of
67
scalability was addressed in [87] and a scalable and computationally ecient solu-
tion was proposed. Despite their very promising results, these techniques are all
designed for raw video data, which would be dicult for practical usage, as most
real-world digital videos are mainly available in compressed format.
To apply the spatial-domain retargeting techniques on a compressed video, we
need to rst decompress the video back to raw format, apply the retargeting algo-
rithm, and eventually recompress the retargeted video so that it would be interop-
erable with other applications. For example, a retargeted video may still need to be
represented in compressed format for its display by a browser, storage in an archive,
or transmission through a network. These applications However, this process has
many disadvantages. First of all, it requires both compression and decompression
which is computation-intensive and time-consuming, especially when the video
server needs to support quality of service to heterogeneous clients. Secondly, it
requires a large storage for the intermediate de-compressed video sequence, espe-
cially when several clients are performing the same task on dierent video les.
To reduce the overhead involved in the decompression and recompression steps, a
more ecient technique is to conduct video retargeting in the compressed domain
by directly manipulating the compressed data.
In this chapter, we will rst one solution for image resizing and another solution
for video retargeting. Our proposed image resizing algorithm is aware of texture
properties and capable of preserving underlying object structures. Then we will
propose a solution for conducting video retargeting in the compressed domain. The
proposed system takes compressed video bitstream as input and directly operates
in the compressed domain, thus saving the computation for decompression, spatial-
domain processing and decompression.
68
The rest of this chapter is organized as follows. Our proposed region-adaptive
texture-aware image resizing algorithm will be introduced in Section 4.2. In Sec-
tion 4.3, we describe our proposed compressed domain video retargeting system.
Experimental results are shown in Section 4.7. Finally, concluding remarks are
given in Section 4.8.
4.2 Region-Adaptive Texture-Aware Image
Resizing
In this section, we start with studying the impact of texture regularity on image
resizing. Then we present the algorithm details for our proposed region-adaptive
texture-aware image resizing algorithm.
4.2.1 Impact of Texture Regularity on Resizing
Based on the degree of randomness, most real-world textures can be roughly clas-
sied as regular or stochastic. Regular textures usually appear as periodic pat-
terns with repeating intensity, color and shape elements. Stochastic textures, with
features opposite to repetitiveness, exhibit less noticeable structures and display
rather random patterns. As revealed by the classical experiment in [56], regularity
plays a signicant role as a high-level feature for human texture perception.
Previous image resizing algorithms are mostly guided by the saliency map [34],
which gives higher importance to regions that present distinct properties (color,
intensity or orientation) with respect to their surroundings. Textured regions,
regular and stochastic, are usually perceived as unimportant due to spatial homo-
geneity, and largely deformed during resizing in order to keep the prominent object
intact.
69
This treatment may not provide satisfactory resizing results. Distortion in
stochastic texture may be less noticeable because of its structural randomness.
However, for regular textures, structural defects caused by arbitrary warping may
result in serious visual artifacts. This is due to human's innate ability to perceive
symmetry and the human vision system has specialized receptive elds responding
to the disruption in regularity [64]. An example is given in Fig. 4.1, where we
compare the resizing results of three dierent schemes [8, 39, 75] on regular and
stochastic textures.
(a) Original images (b) Seam carving [1] (c) Scale-and-stretch [2] (d) Patch-based
Texture Synthesis [8]
Figure 4.1: Image resizing results for brick wall (regular texture) and sky (stochas-
tic texture) using three dierent schemes.
As shown in Fig. 4.1, we see that all three algorithms produce reasonable
resizing results on sky, an example of stochastic texture, but they perform dif-
ferently on brick wall, an example of regular texture. For the brick wall image,
the patch-based texture synthesis approach [39] performs the best among all three
algorithms. Seam carving [8] and scale-and-stretch [75] do not fully exploit pattern
70
regularity in resizing so that it fails to reduce texture redundancy. The geometric
defect on each brick piece induced by these two approaches leads to disruption in
texture regularity, which is highly visible to human eyes. On the other hand, for
the sky image, the distortion caused by [8,75] is relatively subtle since its structure
is less regular and the resultant deformation is less noticeable.
4.2.2 Region-adaptive Image Resizing
The proposed region-adaptive resizing method consists of three main steps. First,
the original image is partitioned into multiple regions and classied as salient,
regular or irregular according to visual saliency and texture regularity. This is
called the region map. Second, we resize the region map through mesh warping
which incorporates both the region and the contour information. Finally, based on
region features, the nal result would be generated using either warping or texture
synthesis. Each individual step is detailed below.
Region Map Generation
The image is segmented using mean-shift segmentation method [17], which takes
three parameters as input: spatial bandwidthh
r
, color bandwidthh
s
and minimum
pixel number per region N. To speed up, we segment the down-sampled original
image and then up-sample its result back to the original size. Median lter is
applied to further smoothen region boundaries.
Then, we compute individual pixel signicance using the saliency measure pro-
posed in [34]. Based on our observation, high-saliency regions computed by this
measure sometimes covers only part of real prominent objects, while some areas
surrounding prominent objects may be mistakenly considered as salient. Therefore,
71
we use region saliency, computed by averaging pixel saliency within the region, to
ensure uniform resizing of the underlying object.
To measure pattern regularity, we apply the scoring system based on the Gabor-
ltering-based texture descriptor [84]. Each region is ltered with a set of 24 Gabor
lters (including 6 orientations and 4 scales). The ltered results are projected
along horizontal/vertical directions, and the normalized autocorrelation function
(NAC) is computed. The periodicity of NAC could be captured by multiple pro-
jections for highly structured textures (e.g. brick wall, fence, etc.) while this
periodicity is either very weak or does not exist at all for stochastic textures (e.g.
sky, grass, etc.).
To generate the region map, regions with prominent saliency and texture reg-
ularity are classied as salient and regular, respectively, while the rest are labeled
as irregular.
Mesh Warping
We cover the region map with a grid mesh denoted by G = (V
q
; E; F), where V
q
,
E and F represent sets of quad vertices, edges and quad faces, respectively. In
addition to quad vertices, we add a set of contour vertices, V
c
, to better preserve
region geometry. The contour vertices are derived by tracing along the region
contour and sampling contour pixels at every interval of T
s
(see Fig. 4.2).
Our mesh warping algorithm takes as input the initial vertex positions and
solves for the new mesh geometry. To formulate this global optimization problem,
we consider the following three factors.
1) Saliency-weighted shape preservation
72
Quad Vertices V
q
Contour Vertices V
c
Mesh Grid
Figure 4.2: Two types of control vertices for mesh warping: quad and contour.
Given quad facef and its quad verticesV
q
(f), we measure the shape deforma-
tion of f as its loss of squareness with the following energy:
D(f) =
X
v
i
;v
j
2Vq (f)
(v
0
i
v
0
j
)s
f
(v
i
v
j
)
2
;
where s
f
is the optimum scaling factor of quad f. The shape-warping energy of
all quad faces is given by
X
f2F
w
f(R)
D(f); (4.1)
where w
f(R)
is the saliency of the region aliated with quad f. Apparently, quad
faces from the same region would be resized homogeneously while the majority of
distortion tends to be diused to non-salient regions.
2) Laplacian coordinates preservation
During the mesh warping, we preserve the region contour Laplacian coordinates
by minimizing the following energy:
X
v
i
2Vc
kL(v
i
) T
i
i
k
2
; (4.2)
73
where T
i
is a 2 2 transform matrix of v
i
and will be updated iteratively during
optimization.
3) Mean-value preservation
To prevent contour vertices from shifting to another quad during the mesh
warping, we preserve its relative position with surrounding quad vertices by main-
taining its mean value coordinates [24]. For every contour vertex v
i
2 V
c
, we try
to minimize the following function:
X
v
i
2Vc
v
i
X
v
j
2F(v
i
)
ij
v
i
2
; (4.3)
where
ij
=m
ij
n
P
j
m
ij
, m
ij
is the mean value coordinate of v
j
with respect to
v
i
, and F(v
i
) is the quad face where v
i
was originally located.
Let us usekPV Qk
2
to denote the position constraint imposed by the new
image size,h
new
w
new
. By incorporating the energy terms in Eq. (4.1)- (4.3), the
total energy we want to minimize could be written in the following matrix format:
kDVs(V)k
2
| {z }
Eq.(4.1)
+kLV(V)k
2
| {z }
Eq.(4.2)
+kCVk
2
| {z }
Eq.(4.3)
+kPV Qk
2
:
Dierent weights may be used to balance dierent objectives. Then, we can express
the above objective function as
min
V
kAV b(V)k
2
; (4.4)
74
where
A =
0
B
B
B
B
B
B
B
@
D
L
C
P
1
C
C
C
C
C
C
C
A
; b(V) =
0
B
B
B
B
B
B
B
@
s(V)
(V)
0
Q
1
C
C
C
C
C
C
C
A
:
The nonlinear least squares optimization problem given in Eq. (4.4) can be
solved using the iterative Gauss-Newton method. The vertex positions are initial-
ized under the homogenous resizing condition, and they are updated iteratively
via
V
(k)
= (A
T
A)
1
A
T
b(V
(k1)
) = H b(V
(k1)
); (4.5)
where V
(k)
is the vector of vertex positions after the kth iteration. As H is only
dependent on A, it can be precomputed and stay xed during the iteration.
Texture Re-synthesis
Based on the generated region map, salient and irregular regions are resized with
mesh warping as discussed in Sec. 4.2.2. In this subsection, we focus on image
synthesis in regular regions.
Regular regions are usually perceived as non-salient and largely squeezed during
mesh warping. To preserve texture content, we re-synthesize textures on the resized
regular region using the original texture as exemplar. For speed up, we adopt a
real-time texture synthesis method proposed in [39]. Since salient and irregular
regions are already known, synthesizing on regular regions is a constraint synthesis
problem [39].
Directly applying texture synthesis may destroy the illumination of the original
patterns. To enhance the authenticity of the re-synthesized result, we rst use a
nonlinear decoupling lter [50] to extract the illumination map, which is then used
75
Texture Exemplar
?
?
Resized Salient/
Irregular Regions
Illumination Map
exture Exemplar
(a)
Start
1
st
Priority
2
nd
Priority
Placement Direction
S
(b)
Figure 4.3: (a) Texture patch selection guided by the resized illumination map,
(b) Priority-queue-based texture placement following spiral order.
to guide the placement of each texture patch (see Fig. 4.3(a)). During texture
re-synthesis, a patch is selected as a qualied candidate only if it minimizes the
blending error and also matches the illumination level at the desired location.
To further reduce the discontinuity artifacts at region boundaries, we use a
priority queue to store the location for patch placement. Highest priority is given
to areas close to salient regions, followed by those within other region border zones.
Patches are placed consecutively in spiral order (Fig. 4.3(b)).
4.3 A Compressed Domain Video Retargeting
System
In this section, we propose a novel video retargeting system that operates directly
on an intermediate representation in the compressed domain, namely, the discrete
76
cosine transform (DCT) domain, thus avoiding the computationally expensive pro-
cess of decompressing, processing, and recompression. As the systems uses the
DCT coecients directly for processing, only minimal decoding of video streams
is necessary. Though there are overheads of entropy decoding and encoding of
intermediate symbols (e.g. DCT coecients) the total overhead is signicantly
less than that of decompression and recompression. Though the proposed system
is targeted for the latest H.264 coding standard, it can be easily applied to other
video compression standards as well.
4.3.1 System Overview
The proposed system takes the H.264/AVC encoded video bitstream as the input,
conducts retargeting directly on partially decoded DCT coecients, and outputs
the H.264/AVC-compliant bitstream of the retargeted video. The proposed system
consists of three separate stages (or modules) as shown in Fig. 4.4. They are:
1. the partial decoding stage;
2. the compressed domain video resizing stage; and
3. the re-encoding stage.
In the partial decoding stage, we decode the video bitstream partially to recon-
struct the non-inverse-quantized DCT coecients of each frame. In the resizing
stage, we perform video analysis and content-aware resizing directly based on com-
pressed domain features and operations. The output of the resized image is re-
encoded to be the output bitstream in the last stage. Details of each module will
be described in the following sections.
77
Entropy
Decoding
Intra Prediction
Inter Prediction
Residue DCT blocks
...
Coded bitstream
(original video)
Reconstructed DCT blocks
Video Resizing
(see Fig. 2 for details)
Motion Vectors
Retargeted DCT blocks
Motion Vector
Refinement
New Motion Vectors
Entropy
Decoding
MB Type
Selection
+
Intra Prediction
Inter Prediction
-
Stage 1: Partial Decoding
Stage 2: Video Resizing
Stage 3: Re-encoding
...
Coded bitstream
(retargeted video)
Residue DCT blocks
Figure 4.4: The block diagram of the proposed compressed-domain video retar-
geting system that consists of three stages (or modules): 1) the partial decoding
stage, 2) the compressed domain video resizing stage, and 3) the re-encoding stage.
4.4 Partial Decoding
In this stage, we partially decode the input bitstream to the proper form of com-
pressed data such that it can be further utilized in the resizing stage. Although we
try to limit the amount of operations conducted in this stage, a certain amount of
decoding is still required. The input bitstream is rst entropy decoded into resid-
ual DCT blocks, which are then used to compute reconstructed DCT blocks of the
original video frame. Since our system operates directly in the DCT domain, no
inverse DCT is required as well. To further reduce the overhead of the decoding
stage, we avoid the inverse quantization step as well. Therefore, the output of the
partial decoding stage are non-inverse-quantized DCT block coecients.
To compute reconstructed DCT coecients from the residual data, we make
use of the transform domain prediction techniques proposed in [13, 53]. As the
H.264/AVC standard supports both intra and inter prediction modes, two types
78
of prediction are conducted here. For the inter-prediction mode, we use the mac-
roblockwise inverse motion compensation (MBIMC) scheme proposed by Porwal et
al. [53]. In this scheme, the predicted DCT block of the current frame is estimated
using DCT coecients of nine spatially-adjacent blocks in the previous frame.
Although originally proposed for the MPEG standard, this method can be easily
applied to the inter prediction mode of H.264/AVC as well. For the intra-prediction
case, dierent situations have to be considered as H.264/AVC supports nine intra-
prediction modes for 4 4 sub-blocks and four intra-prediction modes for 16 16
macroblocks. Here, we use the method in [13] to compute the DCT coecients of
intra-predicted blocks. By combining the intra/inter-predicted DCT coecients
and the partially decoded residual DCT coecients, we obtain reconstructed DCT
coecients, which will be utilized in the resizing stage.
Another important task to be performed in this stage is extracting motion
vectors from the input bitstream. The motion vectors are temporarily stored, and
they will be processed and utilized in both the resizing and the re-encoding stages.
4.5 Compressed Domain Video Resizing
In this stage, we perform content-aware resizing using compressed-domain features
and operations with multiple steps. The input video is rst segmented into dier-
ent scenes and each scene is processed separately. Then, we analyze the importance
of each scene using three dierent measures: saliency, motion and texture. Guided
by the importance map, the input video is partially resized through optimum crop-
ping, followed by the column-mesh-based warping procedure to reach its desired
size. Finally, we compute DCT coecients of the retargeted result. The block-
diagram of this procedure is depicted in Fig. 4.5. All steps will be detailed below.
79
Video Analysis
Saliency
Motion
Texture
Retargeted DCT blocks Partially
Resized Result Reconstructed
DCT blocks
Data of
Entire Scene
Importance Map
Adaptive Cropping
Scene Change Detection
Transform-domain
Resizing
Mesh Deformation
Figure 4.5: The block-diagram of the compressed-domain video resizing stage. This
module takes reconstructed DCT coecients as input and outputs DCT coecients
of the retargeted video.
4.5.1 Scene Change Detection
We use a pair-wise macroblock comparison method to detect scene changes between
consecutive frames. For each video frame, we compare the DCT coecients of each
block with the average coecient values of the same block in previous frames. The
content dierence for block k is computed as
k
=
1
N
2
NN
X
i=1
c
k
(i) c
T
k
(i)
maxfc
k
(i); c
T
k
(i)g
whereN is the size of block,c
k
(i) is the DCT coecient of blockk, and c
T
k
(i) is the
average DCT coecients of block k of the previousT frames (in our case, T = 5).
If this dierence exceeds a preset threshold, we claim that there is a change for
block k. We use D
k
to denote this change:
D
k
=
8
<
:
1
k
>
thre
0 otherwise
80
Let K be the total number of blocks in a frame. For all blocks in the current
frame, when the percentage of changed blocks exceeds a preset threshold; namely,
T
c
=
P
K
k=1
D
k
K
>; a scene change is detected, and the entire data of this scene will
be loaded for further processing. Fig. 4.6 illustrates the scene change detection
results for a segment of 1200 frames of the big buck bunny sequence. The sharp
peaks in the T
c
curve indicates the occurrences of scene changes.
1 2 3 4 5 6 7 8 9 10
100 200 300 400 500 600 700 800 900 1000 1100 1200
1.0
0
0.5
Figure 4.6: Top: the manual scene segmentation result for a 1200-frame segment
of the big buck bunny sequence. Bottom: the percentage of block changeT
c
of each
frame. The sharp peaks in the T
c
curve closely match the manual segmentation
result.
4.5.2 Visual Importance Analysis
The visual importance map is used to guide the retargeting process, since we
would like to preserve the important content as much as possible while allowing
the unimportant content undergo more deformation in an ideal retargeted result.
Once a scene change is detected, data of the entire scene would be processed
together for visual importance analysis. The challenge of conducting analysis in
the compressed-domain is that the original pixel values are unknown and we need
81
to rely on compressed-domain features (e.g., DC, AC coecients, motion vector,
etc.) only. In this work, we perform content analysis based on three features:
saliency, texture and motion. Each analysis generates one single map, which will
eventually be combined to form the nal visual importance map.
a) Saliency Map
Saliency is used to detect the region of interest in images, and has been widely
used to guide both image and video retargeting [38,75{77,87]. While most saliency
detection methods operate in the pixel-domain [33, 34], there is recent work on
compressed-domain saliency detection for images [20] and videos [21].
In the proposed system, we adopt the spectral-residual visual attention model
[33] for saliency map computing. In [33], the saliency map of an image is calculated
using the spectral residual signal, derived from analyzing the log spectrum of the
image. Although this saliency detection method operates in the pixel-domain,
we can modify it so that it can be used in the DCT domain as well. Instead of
downsampling the input image to a smaller size as done in [33], we directly use the
DC coecient of each DCT block and apply the same saliency detection algorithm
to this DC-based image. For the 4 4 DCT transform, the DC coecients of
the entire image yield an equivalent image obtained by downsampling the original
image to its original size by a factor of 4 4. For improved temporal coherency,
the visual saliency map is temporally ltered with its neighboringT frames (in our
system, T = 5).
b) Texture Map
Fine structures, such as textures and edges, need special treatment in video
retargeting. One of the limitations of seam carving [59] is that, when the removed
seams pass through edge regions, noticeable artifacts would occur. In addition,
the eect of texture regularity on the retargeting result was studied in [89], and it
82
Figure 4.7: Illustration of motion map generation using motion vectors from
sequence coastguard. From left to right: one video frame, the original motion vec-
tor map, the compensated object motion, and the nal motion map (after applying
temporal ltering on the compensated motion map).
is observed that stochastic textures are less susceptible to large deformation than
regular textures.
The saliency map generated using [33] contains limited amount of texture infor-
mation, as only DC coecients were used for computation. Extracting textures
and edges normally require pixel-level processing, yet it is possible to obtain some
level of texture and edge information through frequency analysis on DCT coe-
cients as textures and edges correspond to mid-to-high frequency components in
the DCT domain.
In the proposed system, we use dierent frequency components of DCT coe-
cients to generate the feature vector for each block. To classify each block into one
of the three categories (texture, edge and smooth region), we compute the distance
between the feature vector of a given block with a group of preset feature vectors,
obtained by training on a set of test sequences from the public database [2]. The
likelihood of a block belonging to a particular category (k =tex;edge;smooth) is
given by:
L
k
=
e
1
1+d
k
P
k
e
1
1+d
k
;
83
where d
k
is the Euclidean distance between the feature vector and the preset cen-
troid vector for category k. In this work, we pay special attention to the texture
map L
tex
.
The texture degree of a block also provides a measure on the reliability of
motion vectors. It is well known that low-textured regions tend to yield larger
encoding matching errors [74]. For each macroblock, we can compute a condence
score for the corresponding motion vector using its texture degree, which will be
elaborated in the next section.
c) Motion Map
The saliency map generated by [33] mainly captures the visual attractive parts
in an individual frame, but fails to consider the motion information, which is
another critical factor for important content detection in video. For example, a
fast moving object might be non-salient in a single frame, but an important content
in a video sequence.
Most spatial-domain video retargeting methods [38,76,77,87] have incorporated
motion detection techniques based on the SIFT feature [45] or optical
ow [79].
However, these motion detection methods are not applicable in the compressed
domain. Our goal here is to detect moving regions in the video sequence using
the motion vectors embedded in the video bitstream. There are two challenges for
using motion vectors directly for moving object detection:
1. It is common that the video includes various types of camera motions (e.g.
zoom, pan, tilt) and they need to be excluded to re
ect true object motion.
2. Some motion vectors are unreliable as they do not agree with the true motion.
84
In the proposed system, the camera motion is estimated using a four-parameter
global motion model [73]. In this model, the relationship between pixels of con-
secutive frames can be written as:
0
@
x
y
1
A
=
0
@
z r
r z
1
A
0
@
x
y
1
A
+
0
@
p
r
p
d
1
A
; (4.6)
wherex andy are coordinates of the current frame, x =xmv
x
and y =ymv
y
are coordinates of the previous frame,z,r,p
r
andp
d
are the four unknown camera
parameters representing zoom, rotate, pan right and pan down, respectively.
To estimate the four camera parameters, we can re-write Eq. (4.6) as an over-
determined linear system [73] and compute the least-square estimator of the four
camera parameters:
X
LS
=
h
z r p
r
p
d
i
T
= (H
T
H)
1
H
T
Y; (4.7)
where Y is the observation column vector and H is the spatial location matrix.
As done in [74], we weigh the rows of Y and H using the condence measure
computed from the texture map so that unreliable motion vectors would have a
minimum impact on the camera motion estimation result.
85
The estimation process given above assumes that object motion does not t into
the camera model in Eq. (4.6) and becomes outliers in the least square estimation.
The estimated object motion is then computed as
MV
obj
=
2
6
6
6
6
6
6
6
6
6
6
4
mv
x
(1)
mv
y
(1)
mv
x
(2)
mv
y
(2)
:::
3
7
7
7
7
7
7
7
7
7
7
5
= Y H X
LS
;
where mv
x
(i) and mv
y
(i) are the compensated motion vector components of block
i. The nal motion map is computed using the magnitude of the compensated
motion vector and applying a temporal lter over the neighboring T frames (in
our case, T = 5).
We show the motion vectors after motion compensation and the nal motion
map for sequence coastguard in Fig. 4.7. The original motion vectors include
both the object motion (ship) and the camera motion (right pan). After motion
compensation, we eliminate the camera motion from the original motion vectors,
leaving only the object motion. The two camera parameters in Eq. (4.7), p
r
and
p
d
will be used in the mesh deformation stage as described in Section 4.5.4.
d) Visual Importance Map
For each video frame, the nal visual importance map denoted byI is computed
by combining all three maps (see Fig. 4.8) generated from the above analysis:
I =I
s
I
t
I
m
; (4.8)
86
where I
s
, I
t
and I
m
are the saliency, texture and motion map, respectively. The
values of all three maps are normalized to the range of [0:10; 1:00]. Although
there are other ways (e.g., weighted sum) to fuse all three maps into the nal
importance map, we nd that the multiplication-based fusion generates a more
satisfying result.
Visual Importance Texture Motion Saliency Original
Figure 4.8: Illustration of the visual importance analysis procedure: frames of
the entire scene are analyzed and three maps (saliency, texture and motion) are
generated. Being fused together, they form the nal visual importance map.
4.5.3 Optimum Cropping
The importance of incorporating cropping into video retargeting was extensively
discussed in [77]. When the video sequence is densely populated with salient
contents, it is dicult to preserve all salient contents while maintaining temporal
coherence, because the retargeting result would be close to uniform scaling in this
scenario. Instead of performing nonuniform warping on the entire frame, a better
solution is to allow some of non-salient regions to be discarded.
In the proposed system, we partially resize the video through cropping rst,
and then perform warp-based deformation to resize the video to its desired size.
We dene the cropping factor as the percentage of block columns to discard during
cropping. The optimum cropping factor, which balances the amount of cropping
and warping, needs to be determined at the rst place. Since the importance
map is computed at the DCT block level, we perform cropping through discarding
87
unimportance columns of DCT blocks (size of 4 4 in our case). In the following,
we assume the resizing is conducted along the horizontal direction. The same
process can be applied to resizing in the vertical direction.
To compute the optimum cropping factor, we rst determine the minimum and
the maximum cropping factors, which are precomputed values based on the desired
resizing factor. For example, if the resizing factor is 0.50 (resized to half width),
the minimum cropping factor is set to be 0:50 + (1:0 0:50) 20% = 0:60, while
the maximum cropping factor is 0:50 + (1:0 0:50) 80% = 0:90. Within this
cropping range, we compute the average visual importance value for all possible
cropping options. Our goal is to nd a rectangular region in the input video that
contains the maximum average visual importance value. Since we are only dealing
with a limited number of options, we use exhaustive search to compute the best
cropping window of a given window length. After computing the best cropping
window for each window length (see Fig. 4.9), we look for the optimum window
length that contains the maximum average visual importance value. Once we have
determined the optimum cropping window of the given video scene, we discard the
columns of DCT blocks that fall outside of this window. This cropping window
will be applied to all the frames of the same scene.
4.5.4 Column Mesh Deformation
After partially resizing through cropping, the video will be further resized to its
desired result through nonuniform warping. For spatial domain retargeting meth-
ods [38,76,77,87], a quad-shape mesh is often used to guide the warping operation.
However, conducting a quad-to-quad deformation is a challenging task in the com-
pressed domain since it is dicult to compute the corresponding DCT coecients
88
Average visual importance
Column index
Optimum cropping window
Cropping factor
Average visual importance
Figure 4.9: Left: the average column importance curve for dierent cropping fac-
tors in the cropping range. Each point in this curve corresponds to the best win-
dow of a given length. The optimum cropping factor and its corresponding window
maximize the average column importance within the cropping window. Right: the
optimum cropping window and the average visual importance for each column.
Columns marked by blue color represent the region that fall inside the cropping
window while columns marked by red color would be discarded after cropping.
of a block after it is warped to an arbitrary-shape quadrilateral. Instead of adopt-
ing the conventional quad-shape mesh, we use a column-shape mesh to guide the
warping process.
For mesh warping, we extend the formulation in [89] and adjust it to our
proposed column-mesh structure. In addition, two new energy terms are added to
preserve motion and temporal coherency.
Consider a column mesh, represented by M
t
=fV
t
; Cg, as shown in Fig. 4.10,
where C =fc
1
;c
2
;:::;c
n
g denotes the set of columns, and V
t
=fv
t
0
;v
t
1
;v
t
2
;:::;v
t
n
g is
the set of vertex positions, withv
t
i
representing the horizontal coordinate of vertex
i. The width between consecutive mesh vertices is set to 4, which is the same as
the length of transform block. We place such a mesh on each input video frame.
89
c1 c2c3c4c5... ...cn
v0 v1 v2 v3 ... vn ...
Figure 4.10: The column mesh used for compressed-domain video resizing. The
mesh M =fV; Cg includes a set of vertices V
t
=fv
t
0
;v
t
1
;v
t
2
;:::;v
t
n
g and a set of
columns C =fc
1
;c
2
;:::;c
n
g.
We take the initial vertex positions of each frame, V
t
, as the input and solve for
their new positions V
t
new
=f v
t
0
; v
t
1
; v
t
2
;:::; v
t
n
g by minimizing an objective function
as described below.
a) Shape Deformation
During the resizing process, we want to preserve the shape of columns with
high saliency while allowing columns of lower saliency to be squeezed or stretched
more. For each each frame t, we measure the amount of shape deformation of
column i as
D
t
(c
i
) =
( v
t
i
v
t
i1
)l
i
(v
t
i
v
t
i1
)
2
; i = 1; 2; 3 ;n;
where l
i
is the optimum scaling factor for C
i
and is updated at each iteration as:
l
(k)
i
=
~
l
i
jv
t(k)
i
v
t(k)
i1
j
90
where
~
l
i
is the original width of column i before deformation. v
t(k)
i
is the vertex
position of column i at iteration k.
The shape deformation energy of all columns is given by
E
d
=
X
t
X
c
i
2C
I
t
(c
i
)D
t
(c
i
); (4.9)
where I
t
(c
i
) is the average visual importance of column c
i
at frame t.
b) Vertex Order Preservation
It is possible that some vertices may
ip over each other after mesh deformation,
leading to unwanted artifacts. To avoid it, we preserve their relative positions
with respect to their immediate neighboring vertices by maintaining their relative
barycentric coordinate. For vertices v
i
, we minimize:
E
v
=
X
t
X
i
v
t
i
X
v
j
2N(v
i
)
m
ij
v
t
j
2
; (4.10)
where m
ij
is the barycentric coordinate of v
j
with respect to v
i
, and N(v
i
) repre-
sents the set of neighboring vertices of v
i
.
c) Temporal Coherency
To preserve temporal coherency and avoid jittering artifacts, we enforce the
temporal smoothness of vertex positions across neighboring frames. Specically,
we try to minimize
E
c
=
X
t
X
i
v
t+1
i
v
t
i
2
(4.11)
where v
t
i
are vertex positions of the previous frame.
d) Motion Preservation
91
As noted in [77], simply enforcing per-pixel smoothing along the temporal
dimension, which does not take object or camera motion into account, yields poor
re-sizing results. Under this scenario, an object that moves from the left to right
of the frame may be resized dierently throughout the whole scene. For example,
as shown in Fig. 4.11 (top), the scene consists of camera panning from right to
left and the tree size has changed across the frames without motion preservation.
To account for object motion and camera motion, we exploit the camera param-
eters estimated in Section 4.5.2. Specically, we utilize the camera right panning
parameter,p
r
, since we resize the input video along the horizontal direction. Sim-
ilarly, the down panning parameter, p
d
, will be used if we perform resizing along
the vertical direction. To achieve motion-aware resizing, we minimize the following
energy:
E
m
=
X
t
X
i
( v
t+1
i
v
t+1
i1
) ( u
t
i
u
t
i1
)
2
; (4.12)
whereu
t
i
=v
t+1
i
p
t
r
, andp
t
r
is the right panning parameter of framet. u
t
i
represents
the corresponding position of u
t
i
after mesh deformation. Since u
t
i
may not align
with any of v
t
i
, we represent it with a linear combination of the column mesh
vertices in its immediate vicinity as
u
t
i
=
X
v
j
2N(u
i
)
m
ij
v
t
j
;
where m
ij
is the barycentric coordinates of u
t
i
w.r.t. column vertices v
t
j
of its
immediate vicinity N(u
i
). Note that we only consider columns whose correspond-
ing positions in the previous frames are still within the frame boundary. For other
columns, their temporal coherency will be preserved by the temporal coherency
92
energy dened in Eq. (4.11). As shown in Fig. 4.11 (bottom), after incorporat-
ing the energy function for motion preservation, the tree size has been preserved
throughout all frames.
Tree position
(frame 138)
Tree position (frame 22)
Tree position
(frame 138)
Tree position (frame 22)
Without Motion Preservation
(frame 22)
Without Motion Preservation
(frame 138)
With Motion Preservation
(frame 138)
With Motion Preservation
(frame 22)
Column Index
Column Index
frame #
frame #
Without Motion Preservation
(frame 22)
With Motion Preservation
(frame 22)
Without Motion Preservation
(frame 138)
With Motion Preservation
(frame 138)
Tree position (frame 22)
Tree position (frame 138)
Tree position (frame 22)
Tree position (frame 138)
Figure 4.11: The impact of using motion preservation in the column mesh defor-
mation. Top: resizing results without considering motion preservation and the
corresponding column vertex movement paths. The tree is resized inconsistently
at dierent frames. Bottom: resizing results that considers motion preservation
and the corresponding column vertex movement path. The tree size undergoes
more consistent transformation throughout the entire video sequence.
e) Joint Optimization for Column Mesh Deformation
93
Combining all energy terms in Eqs. (4.9) - (4.12), we solve for the deformed
column mesh by minimizing the following objective function:
E =E
d
+E
v
+E
c
+
E
m
; (4.13)
subject to the boundary constraint. The weighting coecients are empirically set
to = 1:0, = 50:0 and
= 10:0 in our experiments. We can represent the
objective function in Eq. (4.13) and its constraint in matrix format as:
kDVl(V)k
2
| {z }
Eq.(4.9)
+kCVk
2
| {z }
Eq.(4.10)
+kTV Sk
2
| {z }
Eq.(4.11)
+kMV Nk
2
| {z }
Eq.(4.12)
+kPV Qk
2
;
(4.14)
where the last term,kPV Qk
2
, denotes the position constraint imposed by the
target video size. The matrix expression given above can be rewritten as
min
V
kAV b(V)k
2
; (4.15)
where
A =
0
B
B
B
B
B
B
B
B
B
B
@
D
C
T
M
P
1
C
C
C
C
C
C
C
C
C
C
A
; b(V) =
0
B
B
B
B
B
B
B
B
B
B
@
l(V)
0
S
N
Q
1
C
C
C
C
C
C
C
C
C
C
A
:
94
The nonlinear least-squares optimization problem in Eq. (4.15) can be solved
through an iterative Gauss-Newton method. The vertex positions are initialized
with a homogenous resizing condition and updated iteratively via
V
(k)
= (A
T
A)
1
A
T
b(V
(k1)
) = H b(V
(k1)
); (4.16)
where V
(k)
is the vector of vertex positions after the k-th iteration. As H is only
dependent on A, it can be precomputed and stay xed during the iteration process.
4.5.5 Transform Domain Block Resizing
Once the deformed mesh is computed, we use this information to resize the video
frame in the compressed domain. We employ the DCT domain resizing method
in [69], which supports resizing with arbitrary factors. In [69], each block in the
resized frame is downsized from a rectangular region (called the supporting area)
in the original frame as illustrated in Fig. 4.12. The resizing is conducted in two
separate steps: 1) extracting the supporting area from the original frame, and 2)
downsizing the supporting area to a square-size output block.
In the proposed system, the block resizing task is conducted at the transform
block level. With the information of the deformed mesh, we rst compute the sup-
porting area of each output macroblock through reverse mapping. Specically, for
every vertex v
i
in the retargeted frame, we compute its corresponding coordinates
in the original frame through interpolation. Then, every NN block (N = 4
in our case) in the resized frame is reverse-mapped to an NN
0
block in the
original frame. The height of the supporting area equals to the output block as we
only consider resizing along the horizontal direction. It should be noted that the
95
Table 4.1: Scene change detection results on two test sequences: big buck bunny
and elephants dream.
Sequence Total frame # Abrupt change Gradual change Total Recall Precision
N
C
N
m
N
f
N
C
N
m
N
f
N
C
N
m
N
f
big buck bunny 14,315 127 0 7 1 4 0 128 4 7 96.97% 94.81%
elephants dream 15,691 99 14 22 5 3 0 104 17 22 85.95% 85.33%
supporting area may cover multiple transform blocks in the original frame, and
some blocks may be only partially covered by the supporting area.
By following [69], the supporting area is resized to the output block via
B =
h
I
N
0
i
NN
0
M
L
X
B
i
2S(
B)
B
i
M
R
2
4
I
N
0
3
5
N
0
N
; (4.17)
whereI
N
is an identity matrix of sizeNN, andB
i
are DCT coecients of blocks
that are covered by the supporting area of
B,M
L
andM
R
are the DCT transforms
of shifting matrices, and S(
B) denotes the supporting area of block
B.
B
Original Video
Retargeted Video
Supporting Area
of macroblock B
Figure 4.12: Illustration of the supporting area, which each macroblock of the
output video frame is resized from its corresponding supporting area in the original
frame.
96
4.6 Re-encoding
In the last stage, we re-encode the block DCT coecients of the retargeted result
to an H.264/AVC-compliant bitstream. The re-encoding step is essentially the
reverse process of that in the partial decoding stage. It should be noted that
quantization is not required here since we did not perform inverse quantization
in the partial decoding stage. The encoder forms a prediction of each macroblock
based on previously-coded data either from the current frame using intra prediction
[13] or from other frames that have already been coded [53]. The prediction is
then subtracted from the current macroblock to form a residual and exported to
the output bitstream. However, since each macroblock is modied during the
retargeting process, there are two additional issues to be addressed: 1) macroblock
type selection, and 2) motion vector re-estimation. We will describe our solutions
below.
4.6.1 Macroblock Type Selection
For selecting the macroblock types, we employ the MTSS scheme [30], which was
originally proposed for the compressed-domain video downsizing system. This
scheme can be modied to t our retargeting framework. For the H.264/AVC
video coding standard, a macroblock in a P frame can be intra-coded, forward
predicted, or skipped. A predicted B-frame macroblock can be intra-code, forward,
backward, or bi-directionally predicted, or skipped. We use the area proportion of
each macroblock w.r.t. the supporting area to determine the prediction mode of
output macroblocks.
A retargeted macroblock is intra-coded if and only if more than 50% of its
supporting area covers macroblocks that are originally intra-coded. For B frames, a
97
retargeted macroblock is to be forward predicted if more than 70% of its supporting
area covers original macroblocks that have forward prediction. Similarly, it is
backward predicted if 70% of its supporting area covers original macroblocks that
have backward prediction. In the case where the supporting area covers both
forward and backward predicted macroblocks while both are lower than 70%, then
the prediction type that has a higher percentage will determine the retargeted
macroblock type. In case of a tie (50% backward, 50% forward) the macroblock is
bi-directionally predicted.
4.6.2 Motion Vector Renement
The conventional way to generate an H.264/AVC bitstream of a resized video
requires decompressing it and then applying a spatial-domain motion estimation
technique to recompute motion vectors in the pixel domain. However, recomput-
ing motion vectors is a computationally intensive procedure and it typically takes
60% or higher of the workload of a video encoder [66]. To remedy this, we pro-
pose a motion vector renement technique that works directly in the compressed
domain and re-estimate the new motion vector using motion vectors of the original
macroblocks. The proposed renement technique is intended only for inter-frame
coding, as intra frames are coded independently and do not contain any motion
information.
Consider the case of resizing a supporting area of sizeNN
0
(N
0
=L
1
+L
2
+L
3
)
to the output macroblock of sizeNN, as shown in Fig. 4.13. The supporting area
covers three macroblocks in the original frame with motion vectors mv
1
, mv
2
and
mv
3
, respectively. The motion vector mv of the resized macroblock is estimated
as
mv =
N
N
0
P
i
mv
i
L
i
P
i
L
i
; (4.18)
98
Supporting Area
mv
mv1
mv2 mv3
L1 L2 L3
N
N
N
Figure 4.13: Motion vector of a retargeted block, mv, is estimated using the motion
vectors in its supporting area: mv
1
, mv
2
and mv
3
.
where mv
i
is the motion vector of original macroblock i and L
i
is the length of
macroblock i in the supporting area. It should be noted that intra macroblocks
are considered as blocks with zero motion vectors. The eectiveness of our motion
vector renement scheme is validated through experiments in the next section.
4.7 Experimental Results
4.7.1 Texture-Aware Image Resizing
The proposed image resizing algorithm was implemented on a PC with Duo CPU
2.4 GHz. For region-map generation, values of h
r
= 5:0, h
s
= 4:0, N = 800 and
T
s
= 10 were used for all test images. For mesh warping, although smaller quads
would lead to a ner outcome, we found that the quad size of 2020 pixels produces
suciently good results. Furthermore, we used patches of sizepp = 90 90 and
overlap width w
p
= 15 in image synthesis.
99
(a) Original images (b) Seam carving [1] (c) Scale-and-strech [2] (d) Multi-operator [3] (e) Our results Figure 4.14: Visual comparison of re-sized images using the proposed algorithm
and [8,60,75]. From top to bottom: getty, blueman and boy.
We demonstrate the eectiveness of our proposed region-adaptive texture-aware
resizing algorithm by comparing the results obtained by [8,60,75] with ours in Fig.
4.14. Due to the greedy nature of seam carving [8], it fails to prevent seams from
passing through prominent objects when the non-salient region is relatively more
structural than the prominent object (e.g., boy image) and produces noticeable
distortion on structural objects (e.g., getty and blueman image). In contrast, the
warp-based approaches ( [75] and ours) outperform the seam carving method in
preserving structured objects. This is because warping is a continuous operator,
which deforms structures through linear interpolation.
As compared with [75], the proposed region-adaptive method is better in main-
taining the shape of prominent objects since mesh warping is guided by the region
information. This ensures that the underlying object undergoes homogenous scal-
ing, as in getty image, where the roof structure is non-uniformly warped by [75]. In
100
particular, the proposed algorithm performs well on images containing large areas
of regular textures, e.g., boy image. As compared with [8, 75], both of which pro-
duce noticeable distortion to the background wall texture, the proposed method
preserves the shape of each wall brick and eciently reduces texture redundancy.
For most images, our method is able to achieve comparable results as [60].
However, the multi-operator approach may sometimes crop the prominent content,
e.g. boy image. In addition, texture redundancy is not perfectly reduced under
this scheme since cropping is the only operator among the three that eectively
addresses spatial redundancy.
Table 4.2: Computational time for each procedure.
getty blueman boy
Vertex Number (jV
q
j +jV
c
j) 425+232 425+213 600+148
Region Map Generation 0.28 s 0.27 s 0.31 s
Matrix Factorization 0.20 s 0.21 s 0.23 s
Back Substitution 0.01 s 0.01 s 0.01 s
Texture Synthesis - - 0.08 s
Our proposed method is computationally ecient since both region segmenta-
tion and matrix factorization can be pre-computed. For all test images that we
have tried in our experiments, the iterative algorithm converges within 3-10 iter-
ations. The computational time of our scheme is comparable to that of existing
real-time image resizing algorithms. Table 4.2 shows the timing statistics for the
three test images.
4.7.2 Compressed-domain Video Retargeting
We demonstrate the eectiveness of the proposed compressed-domain video retar-
geting solution by comparing its results with previous spatial domain video retar-
geting methods [38, 59, 87]. We begin with evaluating the scene change detection
101
method in our proposed system, followed by visually comparing our retargeting
results with other spatial-domain techniques. The eectiveness of our motion
vector renement technique is then validated, followed by computation complex-
ity analysis on the proposed system versus spatial-domain retargeting methods.
Finally, we present subjective quality evaluation results conducted on 56 subjects.
Figure 4.15: Performance comparison of the proposed solution versus the seam
carving method [59] for sequence rat and roadski. From left to right: the original
video sequence, the result of seam carving [59] and our result. The seam carving
method is incapable of preserving the shape of prominent edges, as can be observed
at from the distortions on the curb-line in the rat sequence and the road-lines of
the roadski sequence.
Scene Change Detection Evaluation
In Table 4.1, we evaluate the performance of our scene change detection algorithm
proposed in Section 4.5.1. The two test sequences used in this experiment, big buck
bunny and elephants dream, contain various types of scene changes and camera
motions. In our experiment, abrupt and gradual scene changes are evaluated
separately.
102
Figure 4.16: Performance comparison of the proposed solution versus the pixel-
warp method [38] for sequence waterski. From left to right: the original video
sequence, the result of [38] and our result. The pixel-warp method over-squeezes
the water wave region of the waterski sequence, leading to noticeable artifacts. In
contrast, our method incorporates cropping into the whole procedure and performs
better in preserving the original content.
The performance of a scene-change-detection algorithm is measured in terms
of recall and precision rate. The recall and prevision rate are dened as:
Recall =
N
c
N
c
+N
m
100%
Precision =
N
c
N
c
+N
f
100%
where N
c
, N
m
and N
f
represent the number of correct, miss and false detections,
respectively.
Our proposed scene change detection algorithm performs relatively well on big
buck bunny sequence, with recall rate of 96.97% and precision rate of 94.81%.
The algorithm has better performance in detecting abrupt changes than gradual
changes. While none of the abrupt changes are missed by our method, the number
of missed gradual changes is relatively high (4 out of 5 were missed). On the other
hand, our method has relatively lower recall and prevision rates on elephants dream
103
sequence. This is mainly due to the existence of non-static background and fast
camera motions in the video content. This implies that our detection method is
not robust enough to perform well on all types of video sequences and there is still
room for further improvement.
Figure 4.17: Performance comparison of the proposed solution versus the approach
by Wang et al. [87] for sequence big buck bunny, car and building. From left to right:
the original video sequence, the result of [87] and our result. Our method achieves
comparable results in terms of visual quality, yet it has a lower computational cost
and memory consumption.
104
Visual Quality Comparison
We show the retargeting results of the proposed solution with three state-of-the-
art spatial domain methods [38, 59, 87] in Figs. 4.15, 4.16 and 4.17 for visual
comparison.
The performance comparison with the seam carving method [59] is given in Fig.
4.15. The seam carving method resizes a video through continuously removing
seams. In some cases, it is incapable of preserving prominent edges. As shown in
Fig. 4.15, while seam carving introduces noticeable artifacts to the edge regions in
both sequences (rat and roadski), the results of our solution contain fewer visual
artifacts as the shape of prominent edges is better preserved.
We compare the performance of the proposed solution with that of the pixel-
warp retargeting method [38] in Fig. 4.16. Our method diers from the pixel-
warp retargeting method in that we have incorporated cropping in the resizing
procedure, thereby avoiding over-squeezing the original video content. When the
change in the aspect ratio is signicant, as shown in the example of Fig. 4.16, the
method entirely based on warping [38] over-squeezes the relative non-important
video content, leading to noticeable visual distortion.
Finally, we compare the performance of our solution with the method proposed
by Wang et al. [87], which is a state-of-the-art method with optimized computa-
tional eciency. Being similar to our method, the method in [87] incorporates
both cropping and warping. For most test sequences, our method achieves com-
parable performance as that of [87]. One example is shown in Fig. 4.17. On the
other hand, since our method operates directly in the compressed domain, it has
an advantage in terms of computational and memory cost saving. This will be
analyzed in Sec. 4.7.3.
105
Eectiveness of Motion Vector Renement
In the re-encoding stage, to avoid the computationally expensive procedure of
motion search, we proposed a motion vector renement scheme that computes the
new motion vector of each retargeted macroblock using original motion vectors. We
evaluate the eectiveness of this approach by comparing it with full motion search
in terms of encoding PSNR. In the experimental setup, the retargeted sequence
is encoded into an H.264/AVC (baseline prole) bitstream using the JM reference
software [1]. All test sequences are encoded at a bit rate of 2 Mbps and a frame
rate of 15 fps. The results of six dierent test sequences are listed in Table 4.3.
Table 4.3: Performance comparison of re-encoding using the proposed motion vec-
tor renement approach versus full search.
Sequence name Full Motion Search Our approach Dierence
car 41.62 dB 39.87 dB 1.75 dB
big buck bunny 43.83 dB 43.28 dB 0.55 dB
waterski 46.66 dB 45.01 dB 1.65 dB
rat 45.46 dB 44.81 dB 0.65 dB
roadski 45.91 dB 44.70 dB 1.21 dB
building 40.44 dB 36.91 dB 3.53 dB
As listed in Table 4.3, the motion vectors generated by our renement scheme
oer comparable encoding PSNR values as that of full search, which can be viewed
as the upper bound. On the average, the PSNR value of the proposed scheme is
about 1 dB lower than that of the full search. For sequences with less movement
such as big buck bunny and rat, the PSNR dierence can be as low as 0.60 dB.
However, for sequences with signicant moving background (such as building), the
estimated motion vectors using our approach may become less reliable, leading to
relatively larger dierence (3.53 dB dierence for building sequence). In all, the
106
proposed solution achieves fast and accurate re-estimation of the motion vector for
the output target video while signicantly reducing the complexity of full search.
4.7.3 Computational Complexity Analysis
In Table 4.4, we further compare the computational complexity of the proposed
solution with spatial domain video retargeting algorithms. The experiments were
conducted on a segment of big buck bunny sequence (158 frames, size: 672 384)
encoded using the H.264/AVC baseline prole coding standard. It should be noted
that the computational complexity for each frame may be dierent, as dierent
prediction modes are used for each frame. Table 4.4 shows the per-frame total
operation cost averaged over all frames. In the encoding stage, the EPZS approach
[72] is used for fast motion search.
In the decoding stage, spatial-domain methods demand the entire full decoding
process, including inverse DCT, inverse quantization, motion compensation and
intra prediction. Our solution operates directly in the compressed domain, thereby
avoiding both inverse DCT and inverse quantization, leading to 13.72% savings in
the total operation cost (see Table 4.4). For the encoding stage, with the proposed
motion vector re-estimation scheme, our proposed system results in 30.17% and
99.92% savings in the total operation costs as compared with the fast [72] and full
motion search approach, respectively.
In Table 4.5, we show the computation complexity analysis for two other test
sequences: roadski (99 frames, size: 540 280) and building (104 frames, size:
720376). Experimental results on these two sequences also demonstrates that our
proposed systems leads to signicant cost savings in both encoding and decoding
stage.
107
Table 4.4: Complexity analysis for retargeting the big buck bunny sequence for
DCT domain versus the spatial domain.
Our partial decoding module Full decoding
Procedure Add Multiply Shift Procedure Add Multiply Shift
Compressed-domain inverse motion 1,706,693 287,533 - IDCT 1,032,192 - 258,048
compensation (MBIMC [53]/ TDIP [13]) Inverse quantization - 474,473 -
Inverse motion compensation 241,732 - -
Intra prediction 12,892 1,109 6,384
Total operations count 2,569,353 Total operations count 2,977,996
(per frame) (Savings: 13.72%) (per frame)
Our re-encoding module Full encoding
Procedure Add Multiply Shift Procedure Add Multiply Shift
Compressed-domain motion 853,374 143,777 - DCT 516,096 - 129,024
compensation (MBIMC [53]/ TDIP [13]) Quantization - 237,237 -
Motion vector renement 944 2,361 Motion compensation 120,866 - -
Intra prediction 6,446 555 3,192
Motion estimation (full search) 1,529,982,109 - -
Motion estimation (fast search) 362,126 - -
Total operations count 1,292,703 Total operations count 1,531,471,108 (full search)
(per frame) (Savings - full search: 99.92% ) (per frame) 1,851,124 (fast search)
(Savings - fast search: 30.17% )
Subject Visual Quality Test
Lastly, we report the subject test results on the visual quality of the proposed
retargeting solution via a user study conducted on 56 participants (27 female and
29 male, aged between 22 and 54). The experiments were conducted in the typical
laboratory environment. We used 6 dierent videos in the experiment and retar-
geted each video to 50% width using the method in [38,59,87] and our method.
0% 25% 50% 75% 100%
Our approach
Rubinstein et al. [5]
Krahenbuhl et al. [2]
Wang et al. [4]
61.1%
59.3%
57.9%
38.9%
40.7%
42.1%
Our approach
Our approach
Figure 4.18: Pairwise comparison results of 56 user study participants, which show
that users have a preference on the visual quality of our solution over the other
three benchmarking methods proposed in [38,59,87].
108
Table 4.5: Complexity analysis for retargeting the roadski and building sequence
for DCT domain versus the spatial domain.
roadski sequence
Procedure Total operations count Procedure Total operations count Savings
Our partial decoding module 707,750 Full decoding 1,015,718 30.32%
Our partial re-encoding module 355,970 Full encoding (full search) 399,743,392 99.91%
Full encoding (fast search) 602,353 40.90%
building sequence
Procedure Total operations count Procedure Total operations count Savings
Our partial decoding module 1,310,533 Full decoding 1,704,394 23.11%
Our partial re-encoding module 659,493 Full encoding (full search) 500,202,682 99.87%
Full encoding (fast search) 1,042,871 36.76%
In the subject test, we presented the output video sequences obtained by two
retargeting methods side-by-side to the observer, who is then asked to choose
the better one among the two. Among each pair, one is our own result while
the other is the result from one of the state-of-the-art spatial domain methods
in [38, 59, 87]. The entire user study consists of 6 4 = 24 video pairs and we
received 24 56 = 1344 answers overall. It took on average 15-20 minutes for each
participant to complete the user study. To minimize user bias, we randomized the
order of test pairs and hid all technical details from the participants.
We show the results of our conducted user study in Fig. 4.18. Our results were
favored in 61.1% (821 of 1344) of the comparisons with Rubinstein et al. [59], in
59.3% (797 of 1344) of the comparisons with Krahenbuhl et al. [38], and in 57.9%
(778 of 1344) of the comparisons with Wang et al. [87]. The study results show
that users have a stronger preference on the visual quality of our solution over the
other three benchmarking methods.
109
4.8 Conclusion
In this chapter, we rst introduced an ecient texture-aware image resizing algo-
rithm using the segmented region as basic unit. Guided by region and the con-
tour information, mesh warping was formulated as a non-linear least square opti-
mization problem, which strives to preserve the local as well as the global object
structures. Texture redundancy was eectively reduced through pattern regular-
ity detection and real-time image synthesis. Experimental results demonstrated
improved image quality over state-of-the-art image resizing algorithms.
Next, we proposed a practical video retargeting system that operates directly
on DCT coecients and motion vectors in the compressed domain. This solution
avoids the computationally expensive process of de-compressing, processing, and
recompression. As the system uses the DCT coecients directly for processing,
only partial decoding of video streams is needed. The proposed solution achieves
comparable (or slightly better) visual quality performance as that of several state-
of-art spatial domain video retargeting methods, yet it signicantly reduces the
computational and storage costs. Although the proposed system uses the latest
H.264/AVC coding standard as an example, the general methodology is applicable
to other video coding standards as well.
110
Chapter 5
Objective Quality Assessment for
Image Retargeting
5.1 Introduction
Content-aware image resizing (or image retargeting) is a technique that addresses
the increasing demand to display image contents on devices of dierent resolutions
and aspect ratios. Traditional resizing techniques do not meet this requirement
since they either discard important information (e.g. cropping) or introduce visual
artifacts by over-squeezing the content (e.g. homogenous scaling). The goal of
image retargeting is to change the aspect ratio and the resolution of images while
preserving its visually important content and avoiding noticeable artifacts.
Several content-aware image retargeting solutions have been proposed in the
last 7-8 years. They can be classied into two types: discrete and continuous
approaches [65]. A discrete approach resizes an image by removing unimportant
pixel regions iteratively [8, 54, 60] while a continuous approach conducts resizing
through non-uniform image warping [38, 75, 80]. Most previous work has demon-
strated novelty in problem formulation and algorithmic design. However, evalua-
tion on the performance of dierent retargeting methods remains to be ad hoc as
most of them rely on simple visual comparison or small-scale user studies. Clearly,
there is a need to develop a better methodology for evaluating all retargeting results
in a systematic and quantitative way.
111
Recently, Rubinstein et al. [58] conducted a systematic study on eight state-of-
the-art retargeting algorithms through a large scale user study. Besides collecting
and analyzing subjective evaluation results, they evaluated the performance of
six distances as possible objective measures for retargeted images. However, there
exists signicant disagreement between their chosen measures and subjective evalu-
ation results. Thus, a better objective QoE assessment index for retargeted images
is still in need.
Objective image QoE assessment indices have been extensively studied in the
last decade [40,78]. They can be divided into three categories: full-reference (FR),
reduced-reference (RR) and no-reference (NR). However, traditional image QoE
assessment indices are not applicable in the context of image retargeting for the
following reasons. For FR and RR methods, one underlying assumption is that
the size of the original image should be matched with that of the distorted image.
Since the original and retargeted images dier signicantly in sizes and aspect
ratios, this assumption does not hold. Furthermore, a retargeted image should
preserve as much important information in its original image as possible. Thus,
referring to the original image is an indispensable part in evaluating a retargeted
result which rules out NR methods.
Most recently, Rubinstein [58] conducted a systematic study on image retarget-
ing methods through a large scale user study on eight state-of-the-art retargeting
algorithms. Besides collecting and analyzing user's subjective evaluations, this
work also evaluated the performance of six dierent image distance measures as
possibilities of objective measure for retargeting. However, the results indicate
that the disagreement between these measures and the subjective evaluation is
still signicant. Therefore, a better quality metrics needs to be designed for better
imitating the human perception of image retargeting results.
112
In this chapter, we attempt to address the QoE assessment issue and propose a
novel objective index that accounts for three major determining factors for humans
visual perception on retargeted images. These factors include: the global structural
distortion (G), the local region distortion (L) and the loss of salient information
(S). Various features are chosen to quantify their respective distortion degrees.
Then, an objective quality assessment index, called GLS, is developed to predict
viewers' QoE by fusing these features into one quality score. In developing the GLS
quality index, we compare several regression models in fusing multiple features.
It is shown by experimental results that the proposed GLS index has stronger
correlation with human QoE than other existing objective indices in retargeted
image quality assessment. The eectiveness of new extracted features is the basis
for the impressive performance gain of the proposed GLS index as compared with
other existing QoE indices.
5.2 GLS Quality Index for Retargeted Images
In this section, we will introduce the proposed GLS quality index for retargeted
images. Given the original image, I, and its retargeted results
^
I
i
(i=1,2,3,...), we
would like to compute an objective quality score S
i
for each result to achieve the
following two objectives:
1. the relative rank of retargeted results is consistent with the subjective rank-
ing;
2. the predicted quality score matches with subjective quality scores.
We will rst introduce the overall framework in Section 5.2.1 and, then, describe
each stage in detail in Sections 5.2.2-5.2.5.
113
Original image
Figure 1: Two examples of global structural distor-
tion. Upper row: original image of face (left) and
retargeting result by [1] (right). Lower row: orig-
inal image of lotus (left) and retargeting result by
scaling (right).
continuous retargeting approaches may result in heavy edge
bending when the underlying mesh behind the edge region
undergoes significant warping. The local detail distortions
become less noticeable at regions with homogeneous tex-
tures, (e.g. sky, surface, wall, etc.) and irregular textures
(e.g. trees, grass, sand, etc.) [25]. In Fig. 2 shows an exam-
ple of heavy local detail distortion after applying the seam
carving operator. Prominent edges are heavily bended after
retargetingsincemanyoftheremovedseamspassedthrough
the edge of pencils.
Figure 2: Illustration of local detail distortion. Left:
the original image of pencil. Right: retargeting re-
sult by [1] and zoomed-in images of three local re-
gions with local detail distortion.
3.3 LossofSalientInformation
Besidesreducing visualartifacts, a goodretargeting result
should be able to include all important content in the orig-
inal image as much as possible. Loss of salient information
is a distortion type commonly introduced by discrete opera-
torssuchascropping. Whenthesalientobjectistoolargeor
spans across the whole image, cropping will inevitably dis-
card some important information. For example, as shown
in Fig. 3(a), there are four di↵ erent buildings in the origi-
nal image, each of them with similar visual importance. To
retarget this image to half its original width, a simple crop-
ping operator will inevitably discard some salient informa-
tion, leaving only two of the buildings left in the retargeted
result. A better retargeting result for this image should be
able to remove redundant information from each building
but preserve all four buildings (see Fig. 3(b)). This type of
distortion is less observable for continuous retargeting ap-
proaches, since they are based on disproportionately scaling
di↵ erent regions of the image without discarding pixels.
(a) (b)
Figure 3: Illustration of loss of salient information.
(a): original image of Marble and the cropping re-
sult. The yellow box shows the optimum cropping
result. Middle: a better retargeting result by [17]
All previously proposed metrics [9, 16] mostly focuses on
measuring distortions (global or/and local), while the im-
portance of information completeness has not received suf-
ficient attention. Our goal for this research is to find good
descriptors for all three distortion types and optimally com-
bine them into one single score for quantitative evaluation.
4. THEPROPOSEDQUALITYMETRICSFOR
IMAGERETARGETING
Inthissection, wewillintroduceourproposedimagequal-
itymetricsforretargetedimages. Giventheoriginalimage I
and its retargeting results
ˆ
Ii (i=1,2,3,...), our goal is to com-
pute an objective quality score Si for each results such that:
1) the relative rank of all retargeting results is consistent
with the subjective ranking, and 2) the predicted quality
score matches with subjective quality scores.
We will start with introducing the overall framework in
Section 4.1, followed by detailed description of each stage in
Section 4.2 - 4.5.
4.1 OverallFramework
Thechallengeofobjective imagequalityassessment liesin
formulating e↵ ective features and fusing them into a single
numbertopredictthequalityscore. Inourproposedscheme,
we first conduct saliency analysis and SIFT feature mapping
todeterminewhether I belongsto salient or non-salient and
build the mapping correspondence between I and
ˆ
Ii.For
each retargeting result
ˆ
Ii, we extract feature descriptors to
quantify the three types of distortions mentioned in Section
3. A pre-trained machine learning model is used to fuse all
the feature descriptors into one single quality score as the
final result. The machine learning model is trained using
subjective evaluation results of existing image retargeting
databases [11, 16].
4.2 Saliency-basedClassification
The first step is to determine which category the original
image I belongs to: salient or non-salient. The purpose of
grouping is to separate images with di↵ erent characteristics
and train di↵ erent predictive models for each group of im-
ages. If the image contains one salient object which does
not cover the entire image, then it is considered as salient.
Otherwise, if all content in I have equal visual importance
Retargeted result
Figure 1: Two examples of global structural distor-
tion. Upper row: original image of face (left) and
retargeting result by [1] (right). Lower row: orig-
inal image of lotus (left) and retargeting result by
scaling (right).
continuous retargeting approaches may result in heavy edge
bending when the underlying mesh behind the edge region
undergoes significant warping. The local detail distortions
become less noticeable at regions with homogeneous tex-
tures, (e.g. sky, surface, wall, etc.) and irregular textures
(e.g. trees, grass, sand, etc.) [25]. In Fig. 2 shows an exam-
ple of heavy local detail distortion after applying the seam
carving operator. Prominent edges are heavily bended after
retargetingsincemanyoftheremovedseamspassedthrough
the edge of pencils.
Figure 2: Illustration of local detail distortion. Left:
the original image of pencil. Right: retargeting re-
sult by [1] and zoomed-in images of three local re-
gions with local detail distortion.
3.3 LossofSalientInformation
Besidesreducing visualartifacts, a goodretargeting result
should be able to include all important content in the orig-
inal image as much as possible. Loss of salient information
is a distortion type commonly introduced by discrete opera-
torssuchascropping. Whenthesalientobjectistoolargeor
spans across the whole image, cropping will inevitably dis-
card some important information. For example, as shown
in Fig. 3(a), there are four di↵ erent buildings in the origi-
nal image, each of them with similar visual importance. To
retarget this image to half its original width, a simple crop-
ping operator will inevitably discard some salient informa-
tion, leaving only two of the buildings left in the retargeted
result. A better retargeting result for this image should be
able to remove redundant information from each building
but preserve all four buildings (see Fig. 3(b)). This type of
distortion is less observable for continuous retargeting ap-
proaches, since they are based on disproportionately scaling
di↵ erent regions of the image without discarding pixels.
(a) (b)
Figure 3: Illustration of loss of salient information.
(a): original image of Marble and the cropping re-
sult. The yellow box shows the optimum cropping
result. Middle: a better retargeting result by [17]
All previously proposed metrics [9, 16] mostly focuses on
measuring distortions (global or/and local), while the im-
portance of information completeness has not received suf-
ficient attention. Our goal for this research is to find good
descriptors for all three distortion types and optimally com-
bine them into one single score for quantitative evaluation.
4. THEPROPOSEDQUALITYMETRICSFOR
IMAGERETARGETING
Inthissection, wewillintroduceourproposedimagequal-
itymetricsforretargetedimages. Giventheoriginalimage I
and its retargeting results
ˆ
Ii (i=1,2,3,...), our goal is to com-
pute an objective quality score Si for each results such that:
1) the relative rank of all retargeting results is consistent
with the subjective ranking, and 2) the predicted quality
score matches with subjective quality scores.
We will start with introducing the overall framework in
Section 4.1, followed by detailed description of each stage in
Section 4.2 - 4.5.
4.1 OverallFramework
Thechallengeofobjective imagequalityassessment liesin
formulating e↵ ective features and fusing them into a single
numbertopredictthequalityscore. Inourproposedscheme,
we first conduct saliency analysis and SIFT feature mapping
todeterminewhether I belongsto salient or non-salient and
build the mapping correspondence between I and
ˆ
Ii.For
each retargeting result
ˆ
Ii, we extract feature descriptors to
quantify the three types of distortions mentioned in Section
3. A pre-trained machine learning model is used to fuse all
the feature descriptors into one single quality score as the
final result. The machine learning model is trained using
subjective evaluation results of existing image retargeting
databases [11, 16].
4.2 Saliency-basedClassification
The first step is to determine which category the original
image I belongs to: salient or non-salient. The purpose of
grouping is to separate images with di↵ erent characteristics
and train di↵ erent predictive models for each group of im-
ages. If the image contains one salient object which does
not cover the entire image, then it is considered as salient.
Otherwise, if all content in I have equal visual importance
Salient
Non-salient
Classification SIFT Matching
Graph Formulation
Feature Extraction
Graph Structural Similarity
Graph Patch Similarity
Information Completeness
Score Prediction
Pre-trained fusion model
for salient images
Pre-trained fusion model
for non-salient images
Score Prediction
87
Objective score
Figure 5.1: The system framework in computing the proposed GLS quality index
for retargeted images.
5.2.1 Overview of System Framework
The challenge of objective image QoE assessment lies in formulating eective fea-
tures and fusing them into a single number to predict the quality score. In the
proposed GLS scheme, we rst conduct saliency analysis and SIFT feature mapping
to determine whether I is a salient or non-salient image and build the mapping
correspondence betweenI and
^
I
i
. For each retargeted result
^
I
i
, we extract features
to quantify the three types of distortions as mentioned in Section 2.3. A pre-trained
machine learning model is used to fuse all features into one single quality score as
the nal result. The machine learning model is trained using subjective evaluation
results of existing image retargeting databases [46,58]. Fig. 5.1 shows the overall
system framework in computing the proposed GLS quality index.
5.2.2 Saliency-based Classication
The rst step is to determine whether the source image, I, is a salient or non-
salient image. If an image contains one salient object which does not cover the
entire image, it is considered as salient. Otherwise, if all contents in I have equal
114
visual importance or its salient object is too large and lls up the entire image,
I is viewed as non-salient. This classication step is commonly known as data
grouping in handling large-scale databases. The main purpose of image grouping
is to separate images of dierent characteristics into multiple disjoint groups so
that we can train dierent prediction models for them separately. This grouping
process allows us to design a more accurate prediction model since there is a
stronger correlation between the training and test images.
There are many algorithms proposed for saliency computation and, without loss
of generality, we simply choose one and adopt the GBVS method [29] here. The
salient image classication problem is conducted based on analyzing the histogram
of the obtained saliency map, which takes a value ranging from 0 to 255. The
saliency value \0" means the lowest saliency level (i.e., no saliency) and \255"
means the highest saliency level (i.e. strongest saliency). As shown in Fig. 5.2,
the saliency histogram of a typical salient image usually consists of a steep peak
followed by a quickly descending tail as shown in Fig. 5.2(a). On the other hand,
for a typical non-salient image, the histogram usually has a low-rising peak and
a slowly decaying tail as shown in Fig. 5.2(b). The car image in Fig. 5.2(b) is
classied to a non-salient one since its salient region is too large.
We dene the percentage of pixels below brightness level x as
x
=
P
x
h(x)
N
; x = 1; 2; ; 256;
where h(x) is the number of pixels at bin x in the histogram and N is the total
number of pixels in image I. In our experiments, we adopt the following simple
rule to decide whether an image is a salient one or not. If
30
> with threshold
= 0:70, it is a salient image. Otherwise, it is a non-salient one.
115
(a) salient image (eagle)
(b) non-salient image (car)
Figure 5.2: Classication based on the saliency map histogram analysis for two
representative images: the original image I (left), the saliency map (middle) and
its histogram (right).
5.2.3 SIFT Mapping and Mesh Formulation
The next step is compute the mapping correspondence between SIFT features [44]
of the original image, I, and its retargeted results,
^
I
i
. This correspondence will
help serve as the basis for feature extraction as elaborated in Section 5.2.4.
We rst extract the SIFT features from I and match them with those of each
retargeted image
^
I
i
. We discard all SIFT features that are not successfully matched
so that we have an equal number of SIFT features for each image pair, I and
^
I
i
,
at the end of the matching process. Then, we formulate two graphs for each image
pair (I;
^
I
i
). Each vertex in the graph represents one matched feature in the original
image. The graph formulation is completed by connecting all neighboring vertices
using delaunay triangulation as shown in Fig. 5.3.
116
As a result, we associate each image pair, I and
^
I
i
, with two graphs denoted
by G = (V; E) and
^
G
i
= (
^
V
i
;
^
E
i
), wherejVj =j
^
V
i
j and, for each vertex v2 V,
there is a unique mapping m(v) = ^ v
i
, where ^ v
i
2
^
V
i
. With such a mapping in
place, we have converted the problem of measuring the distance between I and
^
I
i
to the problem of computing the graph similarity between G and
^
G
i
.
5.2.4 Extraction of Features
There are two key issues in objective image QoE assessment: 1) extraction and
representation of appropriate features, and 2) pooling of features into one single
number to represent quality score. We will address the rst issue in this section
and focus on the issue of feature fusion in Section 5.2.5.
Graph Structure Similarity
The graph structure similarity feature measures the amount of global structural
distortion in the retargeted image. To compute this feature, we make use of the
results in Section 5.2.3, where the problem of comparing image pair, I and
^
I
i
, is
reformulated as comparing the graph similarity between G = (V; E) and
^
G
i
=
(
^
V
i
;
^
E
i
).
If there is little global structural distortion during the retargeting process, the
relative positions of
^
V
i
should be close to those of V and the shape of each triangle
in G should be similar to the corresponding matched triangle in
^
G
i
. As a result,
the global structural distortion in
^
I
i
can be measured using shape deformation of
each mesh triangle.
To measure the shape deformation of each triangle, we make use of the log-
polar spatial representation scheme [55], which is computationally more ecient
than methods such as RANSAC. It encodes the relative positions and orientations
117
between each pair of nodes in the graph. Fig. 5.4 shows an example of 5-bit (32
regions) log-polar representations and its corresponding spatial and orientation
codes. In our case, we use a spatial code with 8 bits, where the rst 3 bits represents
the relative orientation angle (quantized into 8 sectors) and the remaining 5 bits
represents the relative distance (quantized into 32 levels).
The shape deformation between two triangles is measured using the modied
inconsistency sum method [55]. That is, to compare the log-polar codes of nodes
in the triangle, we rst compute the distance between two triangles T
k
and
^
T
k
as
d
k
=
3
X
i=1
C
i;k
^
C
i;k
;
where k and i are indices for triangles and its three nodes, respectively, and C
i;k
(i=1,2,3) are the codes for nodei in triangleT
k
, and
denotes the XOR operator.
If a triangle in G is perfectly matched with the corresponding triangle in
^
G
i
, then
d
k
= 0. Then, we add up the distances for all triangle pairs and obtain the distance
between two graphs, G and
^
G
i
, as
f
1
=
X
k
d
k
:
Graph Patch Similarity
To measure the degree of local region distortion in the retargeted result, we consider
a feature called the graph patch similarity. If there is prominent local region
distortion such as broken edges, edge bending, the patch dierence should be
signicant.
For each image pairI and
^
I
i
, we compare the similarity of local patches of size
NN centered around each node in graphs G and
^
G
i
, where N is chosen to be
118
15 in our experiment. We use p
i;k
and ^ p
i;k
to denote patches centered at the node
with index i of the triangle with index k in these two graphs, respectively. Then,
the graph patch similarity feature can be computed as
f
2
=
X
k
3
X
i=1
d(p
i;k
; ^ p
i;k
);
where d(p
i;k
; ^ p
i;k
) represents the Euclidean distance between patches p
i;k
and ^ p
i;k
.
We show an example of three matched features and the corresponding local patches
for comparison in Fig. 5.5.
Information Completeness
The features of information completeness characterizes how well the retargeted
image,
^
I
i
, preserves the important content in the original image, I. When I is a
salient image and contains a prominent object, preserving this salient object and its
surrounding region is important at the expense of regions of lower saliency. How-
ever, all contents possess similar importance for non-salient images, they should
be preserved in a more uniform manner.
First, we need to determine a region inI that should be present in the retargeted
image
^
I
i
. This region, denoted by
^
P
i
, is called the impact area of
^
I
i
. To determine
the impact region
^
P
i
, we reverse-map all matched SIFT nodes in
^
I
i
back to I as
shown in Fig. 5.6, and nd a tight rectangular bounding box, which is denoted as
^
P
i
. Based on the saliency map, we classify pixels in I into three regions: critical,
important and ordinary as shown in Fig. 5.7. For a good retargeted result, its
impact area
^
P
i
should contain as much critical and important regions as possible.
119
Then, to quantify the information completeness, we can dene a feature that
measures the amount of saliency value covered by the impact region. It is written
as
f
3
=
P
c
N
c
+ (1)
P
i
N
i
;
where N
i
and N
c
are the total numbers of pixels of important and critical regions
and P
i
and P
c
are the total numbers of important and critical region pixels inside
the impact area
^
P
i
, respectively, and is a weighting parameter. We choose to
be 0.70 in the experiment. The value off
3
ranges from 0 to 1. If all the important
and critical regions are encircled by the impact area, we have f
3
= 1:0.
5.2.5 Feature Fusion and Model Selection
For eective image quality prediction, not only is the feature selection important
but also the mechanism to fuse all features into one single quality score. There
is no straightforward solution to feature fusion since the contribution of each fea-
ture to the nal quality score may be dierent and is dicult to determine. A
few basic pooling methods can be employed, including simple summation, multi-
plication, linear combinations of features. However, all these methods implicitly
make assumptions on the relative importance of each feature, and there is lack of
convincing ground for the assumptions.
In the proposed GLS quality index, we take advantage of the subjective human
evaluation results and employ the machine learning technique to nd a mapping
function between the features discussed in Section 5.2.4 and the nal quality score.
In addition to featuresf
1
,f
2
andf
3
, we consider three more auxiliary features.
They are:
f
4
: the total number of matched SIFT node pairs between I and
^
I
i
;
120
f
5
: the average matching strength of all matching SIFT node pairs;
f
6
: the average matching strength of top 50 matched SIFT node pairs.
These three features measure how good the matching is betweenI and
^
I
i
. For each
retargeted result, we extract the six featuresff
1
;f
2
;:::;f
6
g from the given image
and normalize them to the range of [0; 1].
To determine the optimal fusion rule, we conduct experiments with the follow-
ing eight fusion methods and compare their performance in our current application
context:
1. Direct feature addition (add)
2. Direct feature multiplication (multi)
3. Linear regression (lin)
4. Logistic regression (log)
5. Logistic regression with L
1
penalty (log-L1 )
6. Support vector regression with linear kernel (svr-lin)
7. Support vector regression with polynomial kernel (svr-pol)
8. Support vector regression with RBF kernel (svr-rbf )
Note that the last six of the above eight fusion methods are based on machine
learning.
In the training phase, each candidate model is presented with a training set
ff
p
;y
p
g and the model parameters are estimated. The training set are obtained
from subjective evaluation results from existing public datasets, where f
p
are the
feature descriptors andy
p
corresponds to the subjective score. We utilize the cross-
validation scheme for each candidate model and choose the optimal model whose
objective scores have the highest correlation with human subjective evaluation
121
results. During the test phase, the trained optimal model is presented with the
feature descriptors of the test image, and it predicts the estimated objective quality
score.
5.3 Experimental Results
5.3.1 Datasets
For the experiments, we make use of two public databases for image retargeting:
the RetargetMe database [58] and the CUHK database [46].
The RetargetMe database contains 80 images, each with eight retargeted results
obtained by eight methods: Nonhomogeneous Warping (WARP) [80], Seam-
Carving (SC) [8], Scale-and-Stretch (SNS) [75], Multi-Operator (MULTI) [60],
Shift-Map (SM) [54], Streaming Video (SV) [38], Cropping (CR) and Homoge-
neous Scaling (SCL). The subjective evaluation results on 37 images for all eight
retargeting methods are provided in this database (37 8 = 296 results). The
evaluation was conducted with 210 human participants and the scores were com-
puted using pairwise comparison method, in which participants were shown two
retargeted images side-by-side and were asked to choose their preferred ones.
The CUHK database contains 171 retargeted results from 57 image sources. In
addition to the eight retargeting methods studied by [58], this database includes
results from two more targeting methods; namely, the optimized seam carving
and scale method [18] and the energy-based deformation method [37]. Unlike
the pair-wise comparison scheme used in [58], the subjective evaluation in this
study employed the 5-category discrete scale ("Bad", "Poor", "Fair", "Good" and
"Excellent") to obtain the mean opinion scores (MOS) of viewers for each retar-
geted result.
122
5.3.2 Test Methodology
For both databases, we employ the 10-fold cross-validation method to evaluate
the performance of the proposed GLS quality index. That is, the data is equally
divided into ten parts: one chunk is used for testing and the remaining nine parts
are used for training. The experiment is repeated with each of the ten chunk used
for testing. The averaged accuracy of the test based on all ten chunks is taken as
the nal performance measure.
5.3.3 Comparison of Feature Fusion Methods
For performance evaluation, we consider the following ve metrics: 1) the Kendall
rank coecient , 2) the Pearson linear correlation coecient r, 3) the Spear-
man rank order correlation coecient , 4) the root mean square error (RMSE)
between subjective and objective quality scores, and 5) the outliers ratio (OR).
For a perfect match between the objective and subjective scores (or rank), we
have
= 1:0
r = 1:0
= 1:0
RMSE = 0
OR = 0
Table 5.1 shows the Kendall rank coecient of the eight dierent fusion meth-
ods as discussed in Section 5.2.5 for the RetargetMe database. Since the subjective
evaluation for this database is based on the pairwise comparison, it is dicult to
determine the mean opinion score (MOS) and we evaluate the performance of
123
these fusion methods with the Kendall rank coecient only. We see from Table
5.1 that the machine-learning-based fusion methods perform signicantly better
than simple feature addition or multiplication. In particular, the logistic regres-
sion fusion method outperforms all others. Furthermore, adding the L
1
penalty
further improves the Kendall rank coecient from 0.355 to 0.382.
Table 5.1: Kendall rank coecient of dierent fusion models for RetargetMe
database [58]
add multi lin log log-L1 svr-lin svr-pol svr-rbf
0.058 0.095 0.301 0.355 0.382 0.307 0.308 0.306
Table 5.2 compares the performance of dierent fusion methods for the CUHK
database. Since the mean opinion scores (MOS) are provided in this database, we
can conduct performance comparison using multiple metrics. As shown in Table
5.2, the logistic regression (yet without L
1
penalty) outperforms all other fusion
methods under almost all performance metrics except for SROCC, where the linear
regression is slightly better than the logistic regression.
Table 5.2: LCC, SROCC, RMSE and OR of dierent fusion models for CUHK
database [46]
lin log log-L1 svr-lin svr-pol svr-rbf
r 0.4402 0.4622 0.3961 0.3656 0.3711 0.3658
0.4939 0.4760 0.4002 0.4038 0.3821 0.3961
RMSE 12.204 10.932 14.026 12.894 13.259 13.212
OR 0.2046 0.1345 0.2163 0.2339 0.2022 0.2267
124
Table 5.3: Performance comparison of ve objective image QoE indices for Retar-
getMe database [58]
BDS [70] EH [47] SIFT-Flow [41] EMD [52] GLS
0.083 0.004 0.145 0.251 0.382
Table 5.4: Comparison with other objective image QoE metrics for CUHK database
[46]
BDS [70] EH [47] SIFT-Flow [41] EMD [52] GLS
r 0.2896 0.3422 0.3141 0.2760 0.4622
0.2887 0.3288 0.2899 0.2904 0.4760
RMSE 12.922 12.686 12.817 12.977 10.932
OR 0.2164 0.2047 0.1462 0.1696 0.1345
5.3.4 Comparison of Objective Quality Indices
In this section, we compare the proposed GLS quality index with four other objec-
tive QoE indices for retargeted images. They are:
Bidirectional Similarity (BDS) [70]
Edge Histogram (EH) [47]
SIFT-Flow [41]
Earth Mover Distance (EMD) [52]
Table 5.3 and Table 5.4 compare the performance among all ve QoE indices
for the RetargetMe and the CUHK databases, respectively. We see from the exper-
imental results that the proposed GLS index performs better than all existing QoE
indices by a signicant margin in all four performance metrics.
125
5.3.5 Discussion
The proposed GLS index outperforms all other existing QoE indices for two main
reasons. First, the GLS index design is based upon three dorminant distortion
types for image retargeting as discussed in Section 2.3. The other quality indices
consider only one or two of these distortion types but none of them consider all
three together. For example, EH [47], SIFT-Flow [41] and EMD [52] do capture the
global structural distortion and the local region distortion of retargeted images,
but fails to consider the information completeness factor. The BDS index [70]
measures information completeness in a bidirectional way, but it fails to consider
either global or local distortions that occurred in the retargeting result fully.
The second reason that explains the good performance of the proposed GLS
index is that the machine-learning technique is adopted to fuse features eectively
to yield one nal quality score. Although our feature design takes into consider-
ation all three distortion types, determining relative weights of multiple features
still remains a challenge. In the GLS index, we address this challenge by training
a machine learning model that learns from existing subjective evaluation results
and intelligently determines the optimal feature weights for each specic image.
The more subjective evaluation results we have for the fusion model training, the
better the predicted objective score for each retargeted result.
We oer further insights into the performance of the GLS quality index by
examining two examples. We show the evaluation of eight retargeted results for
image Obama in Fig. 5.8. This image contains two salient objects: President
Obama and the boy. As shown in the subjective result, cropping performs the
best since the two salient objects can perfectly t into one cropping window and
very little salient information is lost. On the other hand, seam carving [?] and
warping [80] perform the worst as they introduce heavy global structural distortion
126
(on President Obama) and local region distortion (the document in President's
hand).
As shown in the rank order table of Fig. 5.8, the objective rank computed
with the proposed GLS index correlates well with the subjective rank in [58]. The
Kendall rank coecient is equal to = 0:857. The best one and the poorest
four image retargeting methods identied by the GLS index are identical with
subjective evaluation results. The only slight dierence lies in the methods that
are ranked from 2 to 4. The GLS index favored the result of the streaming video
method [38] while the subjective evaluation ranks the multi-operator method [?]
as the second.
However, there are individual cases where the proposed GLS index does not
agree well with the subjective evaluation results. We show the evaluation results of
the Buddha image in Fig. 5.9. For this case, there is a large discrepancy between
the subjective and objective evaluation results with Kendall rank coecient =
0:357. The GLS index gives the highest preference to cropping, seam-carving [?]
and warping [80]. However, these three are among the worst performing methods
according to subjective evaluation results, thereby leading to a negative Kendall
rank coecient value. The mismatch between the objective and subjective ranks
can be explained by the shortage of training data. In the training data set, there is
no image similar to the Buddha image which contains the face of a human statue as
opposed to an authentic human face. If similar cases are available in the training
data, we expect the proposed machine-learning-based GLS index to learn from
these cases and provide more accurate prediction.
127
5.4 Conclusion
In this chapter, we proposed a novel objective quality of experience (QoE) index,
called the GLS index, to evaluate image retargeting results. We rst identied
three key factors related to human perception on the quality of retargeted images.
They are global structural distortion, local region distortion and loss of salient
information. Using this knowledge as guidance, we found eective features that
capture these distortion types and utilized a machine learning mechanism to fuse
all features into one single quality score. One major advantage of applying the
machine learning tool is that the feature weights can be determined automatically.
It was shown by experimental results that the proposed GLS index outperforms
four other existing objective indices by a signicant margin in all four performance
metrics of consideration.
128
(a) (b)
Figure 5.3: SIFT feature mapping and graph formulation between the original
image and its retargeted result: (a) original imageI (top), matched SIFT features
(middle) and its formulated graph G = (V; E) (bottom), (b): retargeted result
^
I
i
(top), matched SIFT features (middle) and its formulated graph
^
G
i
= (
^
V
i
;
^
E
i
)
(bottom).
129
Node Position Orientation
n1 19 π
n2 29 π/4
n3 24 π/2
1
2 3
4
5
6 7
8
9
10
11
12
13
14
15
16
17
25
24
32
23 22
21
20
19 18
26 27
28
29
30
31
n1
n2
n3
Figure 5.4: Log-polar spatial representation scheme [55]. This example shows the
5-bit (32 regions) log-polar representation of a triangle: the position and orienta-
tion codes of each triangle node.
Figure 5.5: The degree of local region distortion is measured by using the graph
patch similarity, where the Euclidean distance of local patches of matched graph
nodes are computed and summed up over the entire graph.
130
original image retargeted image
Figure 5.6: Illustration of the reverse mapping of from the matched SIFT nodes
in the retargeted image to the source image to compute the impact area
^
P
i
in the
source image. The white region in the bottom left indicates a region where the
underlying information can be found in the retargeted image while the black region
means this part of information is lost after retargeting.
ordinary critical important impact area
original image
saliency map
Figure 5.7: Illustration of the information completeness feature computation: the
original image (left top), the saliency map (left bottom) and the segmentation of
the saliency map into the critical region (white), the important region (gray) and
the ordinary region (black), and the impact area
^
P
i
encircled by the green dash
line.
131
ORI CR MULTI SC SCL
SM SNS SV WARP
Rank 1 2 3 4 5 6 7 8
Subjective CR MULTI SV SCL SM SNS WARP SC
Objective CR SV SCL MULTI SM SNS WARP SC
= 0.857
Spring 2014 WEEKLY REPORT #130, Jan 30, 2013 JIANGYANG ZHANG
1 Progress of the week
• Feature design for characterizing information completeness.
• Experimenting with di↵ erent feature sets and obtained improved results.
2 Prof. Kuo’s Feedback
Quality Assessment
• Try to work on improving features related to information completeness. Use backtracking of SIFT
feature correspondence and SIFT strength to extract features.
• At this stage, no need to do further sub-grouping if it’s non-intuitive.
3Report
3.1 Additional features to characterize information completeness
Last week, we explored the possible additional features to characterize information completeness. Here is
the list of features we are particularly interested:
• f
1
: The captured saliency score inside the impact area.
• f
2
: The total # of matched feature between the retargeted and original image.
• f
3
: Average matching strength of the corresponding SIFT feature pairs.
• f
4
: Average matching strength of the top 50 matched SIFT feature pairs.
All the features above are extracted from all 8 results of each of the 37 images, and normalized to the
range of [0,1]. To figure out which features are useful and which are not, we tried di↵ erent combinations
of these features. The results are listed the table below:
Table 1: Comparing the Kendall coe cients using di↵ erent feature sets.
Feature set ⌧ lin
⌧ log1
⌧ log2
⌧ svr1
⌧ svr2
⌧ svr3
{f
s
,f
p
}+{f
1
} 0.143 0.293 0.305 0.137 0.137 0.135
{f
s
,f
p
}+{f
1
,f
2
} 0.181 0.278 0.299 0.168 0.169 0.167
{f
s
,f
p
}+{f
1
,f
2
,f
3
} 0.208 0.282 0.276 0.166 0.169 0.165
{f
s
,f
p
}+{f
4
} 0.245 0.328 0.313 0.263 0.265 0.265
{f
s
,f
p
}+{f
1
,f
4
} 0.257 0.311 0.303 0.270 0.273 0.271
{f
s
,f
p
}+{f
1
,f
2
,f
4
} 0.303 0.357 0.365 0.292 0.293 0.293
{f
s
,f
p
}+{f
1
,f
2
,f
3
,f
4
} 0.301 0.355 0.384 0.307 0.308 0.306
In the table above, f
s
and f
p
corresponds to features of structure similarity and patch similarity. As
can be seen from the Table above, when all 4 information complete features are used for training, the
average Kendall rank coe cient performs the best (⌧ =0.384).
1
Figure 5.8: An example where the proposed GLS index is strongly correlated with
the subjective rank (with Kendall rank coecient = 0:857). The original and
eight retargeted images of the Obama image and the corresponding subjective
rank [58] and the objective rank computed using the GLS index are shown.
ORI CR MULTI SC SCL
SM SNS SV WARP
= -0.357
Spring 2014 WEEKLY REPORT #130, Jan 30, 2013 JIANGYANG ZHANG
1 Progress of the week
• Feature design for characterizing information completeness.
• Experimenting with di↵ erent feature sets and obtained improved results.
2 Prof. Kuo’s Feedback
Quality Assessment
• Try to work on improving features related to information completeness. Use backtracking of SIFT
feature correspondence and SIFT strength to extract features.
• At this stage, no need to do further sub-grouping if it’s non-intuitive.
3Report
3.1 Additional features to characterize information completeness
Last week, we explored the possible additional features to characterize information completeness. Here is
the list of features we are particularly interested:
• f
1
: The captured saliency score inside the impact area.
• f
2
: The total # of matched feature between the retargeted and original image.
• f
3
: Average matching strength of the corresponding SIFT feature pairs.
• f
4
: Average matching strength of the top 50 matched SIFT feature pairs.
All the features above are extracted from all 8 results of each of the 37 images, and normalized to the
range of [0,1]. To figure out which features are useful and which are not, we tried di↵ erent combinations
of these features. The results are listed the table below:
Table 1: Comparing the Kendall coe cients using di↵ erent feature sets.
Feature set ⌧ lin
⌧ log1
⌧ log2
⌧ svr1
⌧ svr2
⌧ svr3
{f
s
,f
p
}+{f
1
} 0.143 0.293 0.305 0.137 0.137 0.135
{f
s
,f
p
}+{f
1
,f
2
} 0.181 0.278 0.299 0.168 0.169 0.167
{f
s
,f
p
}+{f
1
,f
2
,f
3
} 0.208 0.282 0.276 0.166 0.169 0.165
{f
s
,f
p
}+{f
4
} 0.245 0.328 0.313 0.263 0.265 0.265
{f
s
,f
p
}+{f
1
,f
4
} 0.257 0.311 0.303 0.270 0.273 0.271
{f
s
,f
p
}+{f
1
,f
2
,f
4
} 0.303 0.357 0.365 0.292 0.293 0.293
{f
s
,f
p
}+{f
1
,f
2
,f
3
,f
4
} 0.301 0.355 0.384 0.307 0.308 0.306
In the table above, f
s
and f
p
corresponds to features of structure similarity and patch similarity. As
can be seen from the Table above, when all 4 information complete features are used for training, the
average Kendall rank coe cient performs the best (⌧ =0.384).
1
Rank 1 2 3 4 5 6 7 8
Subjective MULTI SCL SV SM CR SNS WARP SC
Objective CR SC WARP SV MULTI SM SNS SCL
Figure 5.9: An example where the proposed GLS index is poorly correlated with
the subjective rank (with Kendall rank coecient =0:357). The original and
eight retargeted images of the Buddha image and the corresponding subjective
rank [58] and the objective rank computed using the GLS index are shown.
132
Chapter 6
Conclusion and Future Work
6.1 Summary of the Research
In this rst part of this research proposal, we have proposed the Adaptive Direc-
tional Total Variation (ADTV) model as an image decomposition scheme that
facilitates eective latent ngerprint segmentation and enhancement. Based on
the classical Total-Variation model, the proposed ADTV model dierentiates itself
by integrating two unique features of ngerprints, scale and orientation, into the
model formulation. The proposed model has the ability to decompose a single
latent image into two layers and locate the essential latent area for feature match-
ing. The two spatially varying parameters of the model, scale and orientation, are
adaptively chosen according to the background noise level and textural orientation,
and eectively separate the latent ngerprint from structured noises in the back-
ground. Experimental results show that the proposed scheme provides eective
segmentation and enhancement. The improvements in feature detection accuracy
and latent matching further justies the eectiveness of the proposed scheme. The
proposed scheme can be regarded as a preprocessing technique for automatic latent
ngerprint recognition. It also has strong potential to be applied on other relevant
applications, especially for processing images with oriented textures.
In the second part, we have proposed two solutions for content-aware
image/video resizing (or called image/video retargeting). The rst solution is an
ecient texture-aware resizing algorithm that addresses texture redundancy issue
133
through texture regularity analysis and real-time texture synthesis. This solution
also exploits region features, including the scale and the shape information, to
preserve both local and global structures. Experimental results show that the pro-
posed technique achieves comparable results compared with the state-of-art image
retargeting techniques. Then we proposed a novel video retargeting system that
operates directly on an intermediate representation in the compressed domain,
namely, the discrete cosine transform (DCT) domain. Under this system, we are
able to avoid the computationally expensive process of de-compressing, processing,
and recompression while conducting video retargeting on compressed-format video
data. The proposed solution achieves comparable results with the state-of-art spa-
tial domain video retargeting techniques, while signicantly reduced the overhead
for computation and storage.
In the third part, we proposed a novel objective quality of experience (QoE)
index, called the GLS index, to evaluate image retargeting results. We rst iden-
tied three key factors related to human perception on the quality of retargeted
images. They are global structural distortion, local region distortion and loss of
salient information. Using this knowledge as guidance, we found eective features
that capture these distortion types and utilized a machine learning mechanism to
fuse all features into one single quality score. One major advantage of applying the
machine learning tool is that the feature weights can be determined automatically.
It was shown by experimental results that the proposed GLS index outperforms
four other existing objective indices by a signicant margin in all four performance
metrics of consideration.
134
6.2 Future Research Directions
For the study of TV-based latent ngerprint segmentation, it can be further
extended along the following directions:
The eectiveness of the proposed scheme is related to the accuracy of orien-
tation estimation. When the estimated orientation is unreliable, ngerprint
patterns may not be fully extracted to the texture layer v, leading to poor
segmentation and enhancement results. In addition, the positions of singular
points were not taken into consideration by our proposed model. Though
only very few singular points appear in each latent image, additional detec-
tion and processing techniques need to be introduced for handling regions
surrounding the singular points.
Experimental results on latent matching shows that our proposed scheme
shows signicant improvement only for latent images of good quality, while
not much improvement could be observed for bad and ugly latent prints.
These challenging images still require manual processing from ngerprint
experts.
For a latent image of size 768 768 pixels, it would take about one minute
for the proposed model to converge. Although this computation complexity
is acceptable for the processing of latent ngerprints, speed-up solutions are
desirable to enable more ecient processing.
For content-aware image and video resizing, we would like to extend our study
along the following directions:
135
For our compressed domain video retargeting system, we have so far only
visually compared our results with the state-of-art methods. To demon-
strate the eectiveness of operating directly in the compressed domain, some
additional benchmarking needs to be done. For instance, the reduced compu-
tation time shall be estimated by evaluating the savings of avoiding decom-
pression and re-compression.
Memory saving is another merit of our proposed video retargeting system.
The advantage of our proposed system would be more convincing if the mem-
ory savings induced by the compressed domain technique can be quantied.
This shall be another important benchmarking task.
For quality metrics of image retargeting, since the performance of the machine
learning method will be improved with more training data, a larger database with
more subjective evaluation results for image retargeting is desired. The prediction
of objective scores will be greatly improved with a model trained with more com-
plete data set that contains all distortion types and human scorers. In addition,
this work is mainly focused on evaluating retargeting results of images. Objective
QoE assessment for retargeted video will be an important extension of the current
work.
136
Reference List
[1] \H.264/avc jm reference software," in http://iphome.hhi.de/suehring/tml/,
2013.
[2] \Video trace library," in http://trace.eas.asu.edu/yuv/, 2013.
[3] Latent Fingerprint Segmentation Result, Media Communica-
tions Lab, http://mrl.usc.edu/People/Jiangyang/TIFS-02732-
2012 supplemental materials.zip.
[4] Neurotechnology Inc., VeriFinger, http://www.neurotechnology.com, 2012.
[5] NIST Special Database 14, NIST Mated Fingerprint Card Pairs 2 (MFCP2),
http://www.nist.gov/srd/nistsd14.htm.
[6] NIST Special Database SD27, http://www.nist.gov/itl/iad/ig/sd27a.cfm.
[7] J.-F. Aujol, G. Gilboa, T. Chan, and S. Osher, \Structure-texture image
decomposition{modeling, algorithms, and parameter selection," Int. J. Com-
put. Vision, vol. 67, no. 1, pp. 111{136, Apr. 2006.
[8] S. Avidan and A. Shamir, \Seam carving for content-aware image resizing,"
in ACM SIGGRAPH 2007, New York, 2007, p. 10.
[9] A. M. Bazen and S. H. Gerez, \Segmentation of ngerprint images," in ProR-
ISC 2001 Workshop on Circuits, Systems and Signal Processing, 2001, pp.
276{280.
[10] A. Buades, T. M. Le, J.-M. Morel, and L. A. Vese, \Fast cartoon + texture
image lters," Trans. Img. Proc., vol. 19, pp. 1978{1986, August 2010.
[11] J. Cai, S. Osher, and Z. Shen, \Split bregman methods and frame based image
restoration," Multiscale Model. Simul., vol. 8, 2009.
[12] T. Chan and S. Esedoglu, \Aspects of total variation regularized l1 function
approximation," SIAM Journal on Applied Mathematics, vol. 65, 2004.
137
[13] C. Chen, P.-H. Wu, and H. Chen, \Transform-domain intra prediction for
h.264," in Circuits and Systems, 2005. ISCAS 2005. IEEE International Sym-
posium on, may 2005, pp. 1497 { 1500 Vol. 2.
[14] F. Chen, J. Feng, A. K. Jain, J. Zhou, and J. Zhang, \Separating over-
lapped ngerprints." IEEE Transactions on Information Forensics and Secu-
rity, vol. 6, no. 2, pp. 346{359, 2011.
[15] T. Chen, W. Yin, X. S. Zhou, D. Comaniciu, and T. S. Huang, \Total variation
models for variable lighting face recognition," IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 28, pp. 1519{1524, 2006.
[16] H. Choi, M. Boaventura, I. Boaventura, and A. Jain, \Automatic segmenta-
tion of latent ngerprints," in Biometrics: Theory, Applications and Systems
(BTAS), 2012 IEEE Fifth International Conference on, sept. 2012, pp. 303
{310.
[17] D. Comaniciu and P. Meer, \Mean shift: a robust approach toward feature
space analysis," Pattern Analysis and Machine Intelligence, IEEE Transac-
tions on, vol. 24, no. 5, pp. 603 {619, may. 2002.
[18] W. Dong, N. Zhou, J.-C. Paul, and X. Zhang, \Optimized image resizing using
seam carving and scaling," ACM Transactions on Graphics (TOG), vol. 28,
no. 5, p. 125, 2009.
[19] E. Esser, \Applications of lagrangian-based alternating direction methods and
connections to split bregman," UCLA CAM Report (09-31), 2009.
[20] Y. Fang, Z. Chen, W. Lin, and C.-W. Lin, \Saliency-based image retargeting
in the compressed domain," in Proceedings of the 19th ACM international
conference on Multimedia, ser. MM '11, 2011, pp. 1049{1052.
[21] Y. Fang, W. Lin, Z. Chen, C.-M. Tsai, and C.-W. Lin, \Video saliency detec-
tion in the compressed domain," in Proceedings of the 20th ACM international
conference on Multimedia, ser. MM '12, 2012, pp. 697{700.
[22] J. Feng, S. Yoon, and A. K. Jain, \Latent ngerprint matching: Fusion
of rolled and plain ngerprints," in Proceedings of the Third International
Conference on Advances in Biometrics, ser. ICB '09. Berlin, Heidelberg:
Springer-Verlag, 2009, pp. 695{704.
[23] H. Fleyeh, D. Jomaa, and M. Dougherty, \Segmentation of low quality nger-
print images," International Conference on Multimedia Computing and Infor-
mation Technology (MCIT-2010), vol. March 2-4, 2010.
138
[24] M. S. Floater, \Mean value coordinates," Computer Aided Geometric Design,
vol. 20, p. 2003, 2003.
[25] R. Glowinski and P. L. Tallee, \Augmented lagrangian and operator-splitting
methods in nonlinear mechanics," SIAM, 1989.
[26] T. Goldstein, X. Bresson, and S. Osher, \Geometric applications of the split
bregman method: Segmentation and surface reconstruction," Journal of Sci-
entic Computing, vol. 45, no. 1-3, 2009.
[27] T. Goldstein and S. Osher, \The split bregman method for l1-regularized
problems," SIAM J. Img. Sci., vol. 2, pp. 323{343, April 2009.
[28] M. Grundmann, V. Kwatra, M. Han, and I. Essa, \Discontinuous seam-
carving for video retargeting," IEEE CVPR, 2010.
[29] J. Harel, C. Koch, P. Perona et al., \Graph-based visual saliency," Advances
in neural information processing systems, vol. 19, p. 545, 2007.
[30] M. Hashemi, L. Winger, and S. Panchanathan, \Macroblock type selection for
compressed domain down-sampling of mpeg video," in Electrical and Com-
puter Engineering, 1999 IEEE Canadian Conference on, vol. 1, may 1999, pp.
35 {38 vol.1.
[31] M. Hestenes, \Multiplier and gradient methods," Journal of Optimization
Theory and Applications, vol. 4, 1969.
[32] L. Hong, Y. Wan, and A. Jain, \Fingerprint image enhancement: Algorithm
and performance evaluation," IEEE Transactions on Pattern Analysis and
Machine Intelligence, vol. 20, pp. 777{789, 1998.
[33] X. Hou and L. Zhang, \Saliency detection: A spectral residual approach," in
Computer Vision and Pattern Recognition, 2007. CVPR '07. IEEE Confer-
ence on, june 2007, pp. 1 {8.
[34] L. Itti, C. Koch, and E. Niebur, \A model of saliency-based visual atten-
tion for rapid scene analysis," IEEE Trans. on Pattern Analysis and Machine
Intelligence, vol. 20, no. 11, pp. 1254 {1259, nov. 1998.
[35] A. K. Jain and J. Feng, \Latent ngerprint matching," IEEE Trans. Pattern
Anal. Mach. Intell., vol. 33, pp. 88{100, January 2011.
[36] S. Karimi-Ashtiani and C. C. J. Kuo, \A robust technique for latent ngerprint
image segmentation and enhancement," in Proceedings of the International
Conference on Image Processing. IEEE, 2008, pp. 1492{1495.
139
[37] Z. Karni, D. Freedman, and C. Gotsman, \Energy-based image deformation,"
in Computer Graphics Forum, vol. 28, no. 5. Wiley Online Library, 2009,
pp. 1257{1268.
[38] P. Kr ahenb uhl, M. Lang, A. Hornung, and M. Gross, \A system for retargeting
of streaming video," ACM Trans. Graph., vol. 28, no. 5, pp. 126:1{126:10, Dec.
2009.
[39] L. Liang, C. Liu, Y.-Q. Xu, B. Guo, and H.-Y. Shum, \Real-time texture
synthesis by patch-based sampling," ACM Trans. Graph., 2001.
[40] W. Lin and C.-C. Jay Kuo, \Perceptual visual quality metrics: A survey,"
Journal of Visual Communication and Image Representation, vol. 22, no. 4,
pp. 297{312, 2011.
[41] C. Liu, J. Yuen, A. Torralba, J. Sivic, and W. T. Freeman, \Sift
ow: Dense
correspondence across dierent scenes," in Computer Vision{ECCV 2008.
Springer, 2008, pp. 28{42.
[42] F. Liu and M. Gleicher, \Video retargeting: automating pan
and scan," in Proceedings of the 14th annual ACM interna-
tional conference on Multimedia, ser. MULTIMEDIA '06. New
York, NY, USA: ACM, 2006, pp. 241{250. [Online]. Available:
http://doi.acm.org.libproxy.usc.edu/10.1145/1180639.1180702
[43] Y.-J. Liu, X. Luo, Y.-M. Xuan, W.-F. Chen, and X.-L. Fu, \Image retargeting
quality assessment," in Computer Graphics Forum, vol. 30, no. 2. Wiley
Online Library, 2011, pp. 583{592.
[44] D. G. Lowe, \Object recognition from local scale-invariant features," in Com-
puter vision, 1999. The proceedings of the seventh IEEE international confer-
ence on, vol. 2. Ieee, 1999, pp. 1150{1157.
[45] ||, \Distinctive image features from scale-invariant keypoints," Int. J. Com-
put. Vision, vol. 60, no. 2, pp. 91{110, Nov. 2004.
[46] L. Ma, W. Lin, C. Deng, and K. N. Ngan, \Image retargeting quality assess-
ment: a study of subjective scores and objective metrics," Selected Topics in
Signal Processing, IEEE Journal of, vol. 6, no. 6, pp. 626{639, 2012.
[47] B. S. Manjunath, J.-R. Ohm, V. V. Vasudevan, and A. Yamada, \Color and
texture descriptors," Circuits and Systems for Video Technology, IEEE Trans-
actions on, vol. 11, no. 6, pp. 703{715, 2001.
[48] B. Mehtre, \Segmentation of ngerprint images - a composite method," Pat-
tern Recognition, vol. 22, no. 4, pp. 381{385, 1989.
140
[49] B. M. Mehtre, N. N. Murthy, S. Kapoor, and B. Chatterjee, \Segmentation of
ngerprint images using the directional image." Pattern Recognition, vol. 20,
no. 4, pp. 429{435, 1987.
[50] B. M. Oh, M. Chen, J. Dorsey, and F. Durand, \Image-based modeling and
photo editing," ser. SIGGRAPH '01. New York, NY, USA: ACM. [Online].
Available: http://doi.acm.org/10.1145/383259.383310
[51] S. Osher, M. Burger, D. Goldfarb, J. Xu, and W. Yin, \An iterative regular-
ization method for total variation-based image restoration," Multiscale Mode.
Simul., vol. 4, 2005.
[52] O. Pele and M. Werman, \Fast and robust earth mover's distances," in Com-
puter vision, 2009 IEEE 12th international conference on. IEEE, 2009, pp.
460{467.
[53] S. Porwal and J. Mukhopadhyay, \A fast dct domain based video downscal-
ing system," in Acoustics, Speech and Signal Processing, 2006. ICASSP 2006
Proceedings. 2006 IEEE International Conference on, vol. 2, may 2006, p. II.
[54] Y. Pritch, E. Kav-Venaki, and S. Peleg, \Shift-map image editing," in Com-
puter Vision, 2009 IEEE 12th International Conference on. IEEE, 2009, pp.
151{158.
[55] S. Purushotham, Q. Tian, and C.-C. J. Kuo, \Picture-in-picture copy detec-
tion using spatial coding techniques," in Proceedings of the 2011 ACM inter-
national workshop on Automated media analysis and production for novel TV
services. ACM, 2011, pp. 25{30.
[56] A. R. Rao and G. L. Lohse, \Identifying high level features of texture per-
ception," CVGIP: Graph. Models Image Process., vol. 55, no. 3, pp. 218{233,
1993.
[57] N. K. Ratha, S. Chen, and A. K. Jain, \Adaptive
ow orientation-based
feature extraction in ngerprint images." Pattern Recognition, vol. 28, no. 11,
pp. 1657{1672, 1995.
[58] M. Rubinstein, D. Gutierrez, O. Sorkine, and A. Shamir, \A comparative
study of image retargeting," in ACM transactions on graphics (TOG), vol. 29,
no. 6. ACM, 2010, p. 160.
[59] M. Rubinstein, A. Shamir, and S. Avidan, \Improved seam carving for video
retargeting," ACM Transactions on Graphics (SIGGRAPH), vol. 27, no. 3,
pp. 1{9, 2008.
141
[60] ||, \Multi-operator media retargeting," ACM Trans. Graph., vol. 28, July
2009. [Online]. Available: http://doi.acm.org/10.1145/1531326.1531329
[61] L. I. Rudin, S. Osher, and E. Fatemi, \Nonlinear total variation based noise
removal algorithms," Phys. D, vol. 60, pp. 259{268, November 1992.
[62] A. Santella, M. Agrawala, D. DeCarlo, D. Salesin, and M. Cohen,
\Gaze-based interaction for semi-automatic photo cropping," in Proceedings
of the SIGCHI Conference on Human Factors in Computing Systems, ser.
CHI '06. New York, NY, USA: ACM, 2006, pp. 771{780. [Online]. Available:
http://doi.acm.org.libproxy.usc.edu/10.1145/1124772.1124886
[63] S. Setzer, \Split bregman algorithm, douglas-rachford splitting and frame
shrinkage," in Proceedings of the Second International Conference on Scale
Space and Variational Methods in Computer Vision, ser. SSVM '09. Berlin,
Heidelberg: Springer-Verlag, 2009, pp. 464{476.
[64] O. G. Sezer and A. Ercil, \Using perceptual relation of regularity and
anisotropy in the texture with independent component model for defect detec-
tion," Pattern Recogn., vol. 40, no. 1, pp. 121{133, 2007.
[65] A. Shamir and O. Sorkine, \Visual media retargeting," in ACM
SIGGRAPH ASIA 2009 Courses, ser. SIGGRAPH ASIA '09. New
York, NY, USA: ACM, 2009, pp. 11:1{11:13. [Online]. Available:
http://doi.acm.org.libproxy.usc.edu/10.1145/1665817.1665828
[66] B. Shen, I. Sethi, and B. Vasudev, \Adaptive motion-vector resampling for
compressed video downscaling," Circuits and Systems for Video Technology,
IEEE Transactions on, vol. 9, no. 6, pp. 929 {936, sep 1999.
[67] Z. Shi, Y. Wang, J. Qi, and K. Xu, \A new segmentation algorithm for low
quality ngerprint image," in Proceedings of the Third International Confer-
ence on Image and Graphics, ser. ICIG '04. Washington, DC, USA: IEEE
Computer Society, 2004, pp. 314{317.
[68] N. Short, M. Hsiao, A. Abbott, and E. Fox, \Latent ngerprint segmentation
using ridge template correlation," in Proceedings of the International Confer-
ence on Imaging for Crime Detection and Prevention. London, UK: IEEE,
2011.
[69] H. Shu and L.-P. Chau, \An ecient arbitrary downsizing algorithm for video
transcoding," Circuits and Systems for Video Technology, IEEE Transactions
on, vol. 14, no. 6, pp. 887 { 891, june 2004.
142
[70] D. Simakov, Y. Caspi, E. Shechtman, and M. Irani, \Summarizing visual data
using bidirectional similarity," in Computer Vision and Pattern Recognition,
2008. CVPR 2008. IEEE Conference on. IEEE, 2008, pp. 1{8.
[71] X. Tai, J. Hahn, and G. Chung, \A fast algorithm for euler's elastic model
using augmented lagrangian method," SIAM J. Imaging Science, vol. 4, 2011.
[72] A. M. Tourapis, \Enhanced predictive zonal search for single and multiple
frame motion estimation," in Electronic Imaging 2002, vol. 4671. Interna-
tional Society for Optics and Photonics, 2002, pp. 1069{1079.
[73] R. Wang and T. Huang, \Fast camera motion analysis in mpeg domain," in
Image Processing, 1999. ICIP 99. Proceedings. 1999 International Conference
on, vol. 3, 1999, pp. 691 {694 vol.3.
[74] R. Wang, H.-J. Zhang, and Y.-Q. Zhang, \A condence measure based moving
object extraction system built for compressed domain," in Circuits and Sys-
tems, 2000. Proceedings. ISCAS 2000 Geneva. The 2000 IEEE International
Symposium on, vol. 5, 2000, pp. 21 {24 vol.5.
[75] Y.-S. Wang, C.-L. Tai, O. Sorkine, and T.-Y. Lee, \Optimized scale-and-
stretch for image resizing," in SIGGRAPH Asia '08. New York, NY, USA:
ACM, 2008, pp. 1{8.
[76] Y.-S. Wang, H. Fu, O. Sorkine, T.-Y. Lee, and H.-P. Seidel, \Motion-aware
temporal coherence for video resizing," in ACM SIGGRAPH Asia 2009
papers, ser. SIGGRAPH Asia '09, 2009, pp. 127:1{127:10.
[77] Y.-S. Wang, H.-C. Lin, O. Sorkine, and T.-Y. Lee, \Motion-based video retar-
geting with optimized crop-and-warp," ACM Trans. Graph., vol. 29, pp. 90:1{
90:9, July 2010.
[78] Z. Wang and A. C. Bovik, \Modern image quality assessment," Synthesis
Lectures on Image, Video, and Multimedia Processing, vol. 2, no. 1, pp. 1{
156, 2006.
[79] M. Werlberger, W. Trobin, T. Pock, A. Wedel, D. Cremers, and H. Bischof,
\Anisotropic huber-l1 optical
ow," in Proceedings of the British Machine
Vision Conference (BMVC), September 2009.
[80] L. Wolf, M. Guttmann, and D. Cohen-Or, \Non-homogeneous content-driven
video-retargeting," in Proceedings of the Eleventh IEEE International Con-
ference on Computer Vision (ICCV-07), 2007.
143
[81] C. Wu and X. Tai, \Augmented lagrangian method, dual methods and split
bregman iterations for rof model, vectorial tv and higher order models," SIAM
J. Imaging Science, vol. 3, 2010.
[82] C. Wu, J. Zhang, and X.-C. Tai, \Augmented lagrangian method for total vari-
ation restoration with non-quadratic delity," Inverse Problems and Imaging
(IPI, vol. 4, 2011.
[83] H. Wu, Y.-S. Wang, Wong, L. T.-T., T.-Y., and P.-A. Heng, \Resizing
by symmetry-summarization," ACM Trans. Graph., vol. 29, 2010. [Online].
Available: http://doi.acm.org/10.1145/1882261.1866185
[84] P. Wu, B. Manjunanth, S. Newsam, and H. Shin, \A texture descriptor for
image retrieval and browsing," 1999, p. 3.
[85] W. Yin, D. Goldfarb, and S. Osher, \The total variation regularized l1 model
for multiscale decomposition," Multis. Model. Simul., vol. 6, 2006.
[86] S. Yoon, J. Feng, and A. K. Jain, \On latent ngerprint enhancement,"
in Society of Photo-Optical Instrumentation Engineers (SPIE) Conference
Series, ser. Society of Photo-Optical Instrumentation Engineers (SPIE) Con-
ference Series, vol. 7667, Apr. 2010.
[87] O. S. Yu-Shuen Wang, Jen-Hung Hsiao and T.-Y. Lee, \Scalable and coherent
video resizing with per-frame optimization," ACM Trans. Graph., vol. 30,
no. 4, 2011.
[88] J. Zhang, R. Lai, and C.-C. J. Kuo, \Latent ngerprint detection and seg-
mentation with a directional total variation model," in Proceedings of the
International Conference on Image Processing (ICIP), 2012 (to appear).
[89] J. Zhang and C.-C. Kuo, \Region-adaptive texture-aware image resizing," in
Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International
Conference on, march 2012, pp. 837 {840.
[90] J. Zhang, R. Lai, and C.-C. Kuo, \Latent ngerprint segmentation with adap-
tive total variation model," in Biometrics (ICB), 2012 5th IAPR International
Conference on, 29 2012-april 1 2012, pp. 189 {195.
[91] E. Zhu, J. Yin, C. Hu, and G. Zhang, \A systematic method for ngerprint
ridge orientation estimation and image segmentation," Pattern Recognition,
vol. 39, no. 8, pp. 1452{1472, 2006.
144
Abstract (if available)
Abstract
An important step in an automated fingerprint identification systems (AFIS) is the process of fingerprint segmentation. While a tremendous amount of efforts has been made on plain and rolled fingerprint segmentation, latent fingerprint segmentation remains to be a challenging problem. Traditional segmentation methods fail to work properly on latent fingerprints as they are based on many assumptions that are only valid for rolled/plain fingerprints. We propose a new image decomposition scheme, called the adaptive directional total variation (ADTV) model, to achieve effective segmentation and enhancement for latent fingerprint images in this work. The proposed model is inspired by the classical total variation models, but it differentiates itself by integrating two unique features of fingerprints
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Advanced techniques for stereoscopic image rectification and quality assessment
PDF
A data-driven approach to compressed video quality assessment using just noticeable difference
PDF
Data-driven image analysis, modeling, synthesis and anomaly localization techniques
PDF
Machine learning based techniques for biomedical image/video analysis
PDF
Techniques for vanishing point detection
PDF
Object localization with deep learning techniques
PDF
Advanced visual segmentation techniques: algorithm design and performance analysis
PDF
Video object segmentation and tracking with deep learning techniques
PDF
Advanced techniques for high fidelity video coding
PDF
Dynamic latent structured data analytics
PDF
Explainable and green solutions to point cloud classification and segmentation
PDF
Advanced modulation, detection, and monitoring techniques for optical communication systems
PDF
Advanced techniques for green image coding via hierarchical vector quantization
PDF
Efficient machine learning techniques for low- and high-dimensional data sources
PDF
Green image generation and label transfer techniques
PDF
Word, sentence and knowledge graph embedding techniques: theory and performance evaluation
PDF
Advanced techniques for human action classification and text localization
PDF
Depth inference and visual saliency detection from 2D images
PDF
Techniques for compressed visual data quality assessment and advanced video coding
PDF
Advanced technologies for learning-based image/video enhancement, image generation and attribute editing
Asset Metadata
Creator
Zhang, Jiangyang
(author)
Core Title
Advanced visual processing techniques for latent fingerprint detection and video retargeting
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
04/21/2014
Defense Date
03/24/2014
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
image retargeting,latent fingerprint segmentation,OAI-PMH Harvest,video retargeting
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Kuo, C.-C. Jay (
committee chair
), Nakano, Aiichiro (
committee member
), Sawchuk, Alexander A. (Sandy) (
committee member
)
Creator Email
jiangyaz@usc.edu,zhangjiangyang@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-382016
Unique identifier
UC11295784
Identifier
etd-ZhangJiang-2387.pdf (filename),usctheses-c3-382016 (legacy record id)
Legacy Identifier
etd-ZhangJiang-2387.pdf
Dmrecord
382016
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Zhang, Jiangyang
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
image retargeting
latent fingerprint segmentation
video retargeting