Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Automatic image and video enhancement with application to visually impaired people
(USC Thesis Other)
Automatic image and video enhancement with application to visually impaired people
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
AUTOMATIC IMAGE AND VIDEO ENHANCEMENT
WITH APPLICATION TO VISUALLY IMPAIRED PEOPLE
by
Anustup Kumar Choudhury
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
August 2012
Copyright 2012 Anustup Kumar Choudhury
Dedication
To my grandparents
(Late) Shri Atul Chandra Saha Choudhury, (Late) Shri Prabir Kumar Goswami,
(Late) Smt. Nandita Choudhury and Smt. Bani Goswami
And my parents
Smt. Bidyasri Choudhury and Shri. Atanu Choudhury
And my sister
Ms. Ajanta Choudhury
ii
Acknowledgments
Firstly, I would like to express my heartfelt thanks and deepest gratitude to my advisor,
Prof. Gérard Medioni. I would not have been able to complete my doctoral research
without his continuous guidance, enlightening ideas, endless patience, unbounded en-
thusiasm and warm support.
I would like to thank Prof. Bosco Tjan and Prof. Abhijeet Ghosh for their percep-
tive comments and for spending their valuable time on my thesis committee. I would
also like to thank Prof. Ram Nevatia and Prof. Cyrus Shahabi for being on my guid-
ance committee. I would like to thank Prof. Norberto Grzywacz, Prof. Eli Peli and
Prof. Bartlett Mel for their comments during the course of my research.
I had a wonderful experience during the time I spend in the Computer Vision Group
at USC and am grateful for the interactions I had with the past and present members of
the group, especially, Douglas Fidaleo, Prithviraj Banerjee, Vivek Singh, Chang Yuan,
Matheen Siddiqui, Pramod Sharma, Younghoon Lee, Dian Gong, Jan Prokaj, Yann
Dumortier, Eunyoung Kim, Thang Ba Dinh, Jongmoo Choi, Xuemei Zhao, Cheng-
Hao Kuo, Vivek Pradeep, Furqan Khan, Yuping Lin, Wei-Kai Liao and Li Zhang.
My stay here at USC was made memorable and I made some friends for life
and would like to thank them, especially, Adarsh Shekhar, Sushmita Allam, Krish-
nakali Dasgupta, Lubaina Kitabi, Sunaina Subhagan, Gaurav Agarwal, Abhijeet Kher,
Sushila Seshasayee and Sampada Chavan. I would like to express my sincerest thanks
to Vinay Narayanan, Harish Rajamani and Anand Gopalakrishnan who have always
iii
been there for me. I would also like to thank Damayanthi Krishnan who believed in
me and encouraged me to pursue my doctoral research.
Last but not the least, I would like to express my love and deepest gratitude to my
wife, Rumi for her unconditional support and for being my inspiration. I am extremely
grateful to my family and to Rumi’s family for their love, encouragement, affection,
sacrifices and blessings.
iv
Table of Contents
Dedication ii
Acknowledgments iii
List of Tables viii
List of Figures ix
Abstract xvii
Chapter 1: Introduction 1
1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.5 Overview of proposal . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.6 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Chapter 2: Illumination Estimation and Color Constancy 17
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.2 Overview of our Methods . . . . . . . . . . . . . . . . . . . 21
2.1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2 Method 1: Standard Deviation of Color Channels . . . . . . . . . . . 24
2.2.1 Statistical Verification . . . . . . . . . . . . . . . . . . . . . 25
2.2.2 Illuminant Color Estimation . . . . . . . . . . . . . . . . . . 27
2.3 Method 2: Denoising Techniques . . . . . . . . . . . . . . . . . . . . 27
2.3.1 Gaussian Filter . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3.2 Median Filter . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3.3 Bilateral Filter . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3.4 Non-Local means Filter . . . . . . . . . . . . . . . . . . . . 30
2.3.5 Illuminant Color Estimation . . . . . . . . . . . . . . . . . . 31
2.4 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.4.1 Controlled Indoor Environment Dataset . . . . . . . . . . . . 33
v
2.4.2 Effects of Parameter Modification . . . . . . . . . . . . . . . 36
2.4.3 Real World Environment Dataset . . . . . . . . . . . . . . . . 40
2.4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.4.5 Computational Cost . . . . . . . . . . . . . . . . . . . . . . 45
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Chapter 3: Perceptually Motivated Automatic Color Contrast Enhancement
of Images 47
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.1.2 Overview of our Method . . . . . . . . . . . . . . . . . . . . 50
3.1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.2 Our Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.2.1 Estimating illumination using Non-local Means technique . . 51
3.2.2 Automatic enhancement of illumination . . . . . . . . . . . . 53
3.2.3 Combining enhanced illumination with reflectance . . . . . . 56
3.3 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.3.1 Enhancement Results . . . . . . . . . . . . . . . . . . . . . . 57
3.3.2 Comparison with other methods . . . . . . . . . . . . . . . . 61
3.3.3 Statistical Analysis . . . . . . . . . . . . . . . . . . . . . . . 66
3.3.4 Quantitative Evaluation . . . . . . . . . . . . . . . . . . . . . 68
3.3.5 Computational Cost . . . . . . . . . . . . . . . . . . . . . . 69
3.4 Human Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . 70
3.4.2 Subjects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.4.3 Simulation Glasses . . . . . . . . . . . . . . . . . . . . . . . 70
3.4.4 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.4.5 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.4.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
Chapter 4: Robust Sharpness Enhancement Using Hierarchy of Non-Local
Means 84
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.1.2 Overview of our Method . . . . . . . . . . . . . . . . . . . . 91
4.1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.2 Our Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.2.1 Noise removal to ensure robustness . . . . . . . . . . . . . . 91
4.2.2 Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.2.3 Hierarchical Decomposition with NL-means . . . . . . . . . 96
4.2.4 Sharpness Enhancement . . . . . . . . . . . . . . . . . . . . 97
4.3 Comparison of decomposition with existing methods . . . . . . . . . 97
vi
4.4 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 100
4.4.1 Sharpness Enhancement . . . . . . . . . . . . . . . . . . . . 100
4.4.2 Quantitative Results . . . . . . . . . . . . . . . . . . . . . . 103
4.4.2.1 Experiment A: Correlation Experiment . . . . . . . 108
4.4.2.2 Experiment B: Sharpness Metric . . . . . . . . . . 110
4.4.2.3 Experiment C: Automatic Enhancement . . . . . . 112
4.4.3 HDR Tone Mapping and Robustness to Noise . . . . . . . . . 120
4.4.4 Enhancement of eroded artifacts . . . . . . . . . . . . . . . . 123
4.4.5 Computational Cost . . . . . . . . . . . . . . . . . . . . . . 125
4.4.6 Power Spectrum Analysis . . . . . . . . . . . . . . . . . . . 125
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
Chapter 5: Framework for Robust Online Video Enhancement Using Mod-
ularity Optimization 128
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
5.1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 131
5.1.2 Overview of our Method . . . . . . . . . . . . . . . . . . . . 132
5.1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
5.2 Our Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
5.2.1 Similarity Measure . . . . . . . . . . . . . . . . . . . . . . . 134
5.2.2 Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . 137
5.2.3 Key frame Detection . . . . . . . . . . . . . . . . . . . . . . 139
5.2.4 Information Transfer . . . . . . . . . . . . . . . . . . . . . . 140
5.2.5 Video Enhancement . . . . . . . . . . . . . . . . . . . . . . 141
5.3 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 142
5.3.1 Quantitative Evaluation of Cluster Detection . . . . . . . . . 142
5.3.2 Video Enhancement . . . . . . . . . . . . . . . . . . . . . . 153
5.3.2.1 Cluster Analysis and Key frame Extraction . . . . . 153
5.3.2.2 Removal of flashing artifacts . . . . . . . . . . . . 156
5.3.2.3 Significance of Clustering . . . . . . . . . . . . . . 160
5.3.2.4 Human Evaluation . . . . . . . . . . . . . . . . . . 161
5.3.2.5 Computational Cost . . . . . . . . . . . . . . . . . 165
5.3.3 Video Segmentation . . . . . . . . . . . . . . . . . . . . . . 166
5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
Chapter 6: Conclusion 170
6.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . 170
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
References 174
vii
List of Tables
2.1 Error for the controlled indoor environment using various color con-
stancy methods. The parameters of our approach have been described
in Section 2.3 and in Section 2.2. The best result is reported in bold. . 34
2.2 Median angular error for the real world environment. The parameters
of our approach have been described in Section 2.3 and in Section 2.2.
The best result is reported in bold. . . . . . . . . . . . . . . . . . . . 43
2.3 Ranks of algorithm according to Wilcoxon Signed Rank Test. The best
results are reported in bold. . . . . . . . . . . . . . . . . . . . . . . . 44
2.4 Computational time for the different methods. . . . . . . . . . . . . . 45
5.1 Videos from TRECVid 2001 dataset used in our experiments . . . . . 144
5.2 Results for gradual transitions on videos from TRECVid 2001 dataset.
R is for recall andP is for precision. A ‘-’ indicates unreported result 146
5.3 Results for cuts on videos from TRECVid 2001 dataset. R is for recall
andP is for precision. A ‘-’ indicates unreported result . . . . . . . . 148
5.4 AverageF
1
measure for videos shown in Figure 5.1 . . . . . . . . . . 149
viii
List of Figures
1.1 Frame from a high quality Hollywood movie . . . . . . . . . . . . . 1
1.2 Frame from a poor quality YouTube video . . . . . . . . . . . . . . . 2
1.3 (a) is the image that can be seen by a person with normal vision and (b)
is the same image that is viewed by a patient with AMD. The images
are from NIH website . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Histogram Equalization of color images. We can see artifacts in the
sky and color shift along the edges of trees. . . . . . . . . . . . . . . 6
1.5 Halo effect caused during retinex enhancement can be observed around
the edge of the body and the background. . . . . . . . . . . . . . . . 7
1.6 High-level overview of our method . . . . . . . . . . . . . . . . . . . 10
2.1 Image from [98] showing the same cube under two different lighting
conditions (blue on the left and yellow on the right) . . . . . . . . . . 18
2.2 Summary of previous work in color constancy . . . . . . . . . . . . . 18
2.3 (a) Image from [8] with blue color cast. The intensity is tripled for
better clarity. (b) Zoomed-in RGB histogram of Figure 2.3(a) with
original intensity. The complete histogram is shown in the inset of
Figure 2.3(b) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4 (a) Image from original Figure 2.3(a) without color cast. The intensity
is tripled for better clarity. (b) Zoomed-in RGB histogram of Fig-
ure 2.4(a) with original intensity. The complete histogram is shown in
the inset of Figure 2.4(b) . . . . . . . . . . . . . . . . . . . . . . . . 23
2.5 Flowchart of our method . . . . . . . . . . . . . . . . . . . . . . . . 24
2.6 Statistics for images with and without color cast. Image (a) is the his-
togram of averageÁ. Image (b) is the histogram ofÁ over all images.
Image (c) is the corresponding cumulative distribution . . . . . . . . 26
ix
2.7 Images from the controlled indoor environment[8] . . . . . . . . . . . 33
2.8 Example of an image from controlled indoor environment corrected
after estimating the illumination by different methods and their angu-
lar errors. The intensity of the images are tripled for better clarity.
From left to right and top to bottom: Original Image, and the subse-
quent images are correction using Ground Truth values, White Patch
Assumption, Grey World Algorithm, Grey Edge Algorithm, Standard
Deviation of Color Channels, Gaussian Filter, Bilateral Filter, Median
Filter and Non-local Means Filter . . . . . . . . . . . . . . . . . . . . 35
2.9 Effect of window sizeW on median angular error for the controlled
indoor environment . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.10 Median angular error as a function of¾ and kernel size for a Gaussian
filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.11 Median angular error as a function of window size for a Median filter 38
2.12 Median angular error as a function of filtering parameter for a Non-
Local means filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.13 Median angular error as a function of¾’s for a Bilateral filter . . . . . 39
2.14 Median angular error as a function of window size for a Bilateral filter 39
2.15 Images from the real world environment[33] . . . . . . . . . . . . . . 41
2.16 Example of images from real world environment corrected after esti-
mating the illumination by different methods and their angular errors.
The first row contains the Original Images and the subsequent rows are
correction using (2) Ground Truth values (3) White Patch Assumption
(4) Grey World Algorithm (5) Grey Edge Algorithm (6) Standard De-
viation of Color Channels (7) Gaussian Filter (8) Bilateral Filter (9)
Median Filter (10) Non-local Means Filter . . . . . . . . . . . . . . . 42
3.1 Flowchart of the enhancement module . . . . . . . . . . . . . . . . . 51
3.2 The mapping function for enhancement . . . . . . . . . . . . . . . . 53
3.3 The effects of changing the value of½ . . . . . . . . . . . . . . . . . 56
x
3.4 Enhancement Results. From left to right and top to bottom: Origi-
nal Image, Our enhancement, Mapping on all 3 color channels of il-
lumination, Mapping on intensity channel of illumination, Histogram
equalization on all 3 color channels, Histogram equalization on inten-
sity channel, Kimmel’s retinex on all 3 color channels and Kimmel’s
retinex on intensity channel . . . . . . . . . . . . . . . . . . . . . . . 58
3.5 Enhancement Results. From left to right and top to bottom: Original
Image, Our enhancement, Mapping on all 3 color channels of illumina-
tion, Mapping on intensity channel of illumination, Histogram equal-
ization on all 3 color channels and Histogram equalization on intensity
channel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.6 Our enhancement maintains contrast of good contrast images. (a) is
the original image and (b) is the enhanced image . . . . . . . . . . . 60
3.7 Top row: The left image is the original image, the middle image is our
enhancement and the right image is obtained from [87]. Bottom row:
Enhancements by Rizzi et al. and Provenzi et al.. The left image is en-
hanced by RSR [96], the middle image is enhanced by ACE [102] and
the right image is enhanced by RACE [97]. The images are obtained
from [97] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.8 The top row is the original image. On the second row, the left image is
enhanced by McCann Retinex and the right image is enhanced by our
algorithm. More detail can be seen in the shadows and no halo effect
exists in our implementation. . . . . . . . . . . . . . . . . . . . . . . 63
3.9 The first row contains the original images. The second row has the
enhanced images by MSRCR. The third row has the images enhanced
by our algorithm. Using our algorithm, we can clearly see the red brick
wall within the pillars on the left image and some colored texture on
the white part of the shoe can be clearly seen. . . . . . . . . . . . . . 64
3.10 Top row: Left image is the original image and the right image is our en-
hancement. Bottom row: Left image is the enhanced output using Auto
Contrast feature of Picasa
TM
and the right image is enhanced using our
DXO Optics Pro
R °
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.11 The left image is the original image, the middle image is is enhanced
using our algorithm and the right image is enhanced using DXO Optics
Pro
R °
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.12 Statistical characteristics of the image before and after enhancement . 66
xi
3.13 Evaluation of image quality afer image enhancement . . . . . . . . . 67
3.14 Simulation glasses. Image from “Low Vision simulators” website . . 71
3.15 Image quality ratings for subjects with ’normal’ vision . . . . . . . . 73
3.16 Image quality ratings for subjects with simulated AMD . . . . . . . . 74
3.17 Histogram of mean image quality ratings for all subjects with simu-
lated AMD and normal vision . . . . . . . . . . . . . . . . . . . . . 74
3.18 Preference for enhancement ratings for subjects with ’normal’ vision 75
3.19 Preference for enhancement ratings for subjects with simulated AMD 76
3.20 Histogram of mean preference ratings for all subjects with simulated
AMD and normal vision . . . . . . . . . . . . . . . . . . . . . . . . 76
3.21 ROC plot for subjects with simulated AMD. The data being degenerate
does not produce operating points off the axes . . . . . . . . . . . . . 78
3.22 ROC plot for all subjects with normal vision. The data being degener-
ate does not produce operating points off the axes . . . . . . . . . . . 78
3.23 A
z
’s for subjects with normal vision and simulated AMD . . . . . . . 79
3.24 (a) is the original image and (b) is the enhanced image . . . . . . . . 81
4.1 Sharpness Enhancement. Image (a) is from [40]. Note the enhance-
ment in Image (b). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.2 Exaggerated Sharpness Enhancement of Figure 4.1(a) . . . . . . . . . 87
4.3 Effects of smoothing on noisy images. Column(a) is the noisy image.
For columns (b) to (f), the top row is the smooth image whereas the
bottom row is the method noise. Column (b) uses Gaussian filter with
¾
c
= 5. Column (c) uses Bilateral Filter with ¾
c
= 5 and ¾
s
= 0:05.
Column (d) uses Bilateral Filter with ¾
c
= 5 and ¾
s
= 0:5. Column
(e) uses WLS Filter with ® = 0:25 and ¸ = 1:2. Column (f) uses
NL-means Filter with h = 0:03. The method noise is normalized for
better visibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.4 Flowchart of our method . . . . . . . . . . . . . . . . . . . . . . . . 90
4.5 Applying NL-means Filter on noise-free synthetic image (a). Image
(b) is smooth image and Image (c) is method noise . . . . . . . . . . 92
xii
4.6 Applying NL-means Filter on Figure 4.1(a) with h = 0:01. Top row:
On noise-free natural image. Bottom row: On noisy natural image . . 93
4.7 Filtering using edge information. The top row is the smooth image and
the bottom row is the method noise. The red rectangle is zoomed for
clarity. (a) uses NL-means filter with high value of h = 0:5 and no
edge information. (b) uses edge and h = 0:5. (c) uses bilateral filter
along with edge information. Though most regions are smooth and
there are no edges in the method noise due to low¾, noise can be seen
along the edges of smooth image . . . . . . . . . . . . . . . . . . . . 95
4.8 Decomposition using existing approaches. The left half of the image
is the smooth image and the right half is the corresponding fine level.
The smoothing increases from top to bottom. The images are taken
from [40] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.9 Decomposition using our technique. The left half of the image is the
smooth image and the right half is the corresponding fine level. The
smoothing increases from top to bottom. Note the presence of strong
halo effects in image (b), subtle halo effects in image (a) and absence
of halo effects in image (c). Note reduced ringing in image (a) as com-
pared to Figure 4.8(a). The segmentation information used in image
(a) and image (c) are the same. The images are normalized for better
visibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.10 Sharpness Enhancement. Image (a) is from Flickr R °. Note the pro-
gressively increasing sharpness enhancement from left to right . . . . 101
4.11 Abstractions at different spatial scales by changing the filtering param-
eterh. (a) is original image. (b)h = 0:01 at level 1 and (c)h = 0:1 at
level 1. l
1
=5 for both enhancements . . . . . . . . . . . . . . . . . 102
4.12 Comparison of our enhancement with Fattal et al. [42]. Details in the
right image are more clearly visible than in the left image . . . . . . . 104
4.13 Gradient reversal. A red arrow shows the gradient reversals along the
mountain edges in the top image[42] (zoomed in the inset) and the lack
of it in the bottom image . . . . . . . . . . . . . . . . . . . . . . . . 105
4.14 Enhancement with Photoshop’s unsharp mask. Note the halo effect
along the boundary of the flower . . . . . . . . . . . . . . . . . . . . 106
4.15 Enhanced flower from [110]. Note subtle halo effects (white color)
along the boundary of flower and leaves and the lack of it using our
method (Figure 4.2) . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
xiii
4.16 Correlation between the ranks assigned using Tenengrad criterion and
the ranks assigned by human observers for (top) image with the best
average ½ (½ = 0:9714) and (bottom) image with the worst average ½
(½ = 0:9524) across all observers. The identity line in black, y = x
depicts perfect agreement between the different ranks . . . . . . . . . 109
4.17 Tenengrad values on 9 images for our dataset. (a) shows responses
marked by circles (Preferred Images) and squares (Transition to “Too-
detailed” Images). The size of the circles and squares correspond to the
number of responses. (b) shows convergence of Tenengrad Criterion.
All 37 images are not included for better clarity . . . . . . . . . . . . 111
4.18 SPI values on our dataset. The “Too much detail” region extends till
SPI = 1. The blue horizontal line shows mean SPI for the train
preferred images. The red horizontal line shows mean SPI for train
images with too much detail . . . . . . . . . . . . . . . . . . . . . . 113
4.19 (Top) Preference for our selection (Bottom) Histogram of mean pref-
erence ratings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
4.20 A
z
’s for all observers . . . . . . . . . . . . . . . . . . . . . . . . . . 117
4.21 Comparison of enhancements using SPI. The “Optimal” region is
fromSPI 2[0:23;0:36] . . . . . . . . . . . . . . . . . . . . . . . . 118
4.22 Automatic sharpness enhancement of images. Left are the original
images and right are the enhanced images. The box represents the
preferred images according to our metric. For the first image, the en-
hanced image (l
1
= 3) is preferred whereas for the 2
nd
image, the
original image (l
1
= 1) is preferred. The enhancements are better vis-
ible at the original high resolution . . . . . . . . . . . . . . . . . . . 119
4.23 (Left) Tone-mapped results and (Right) Close-up. Both methods use
the same tone mapping algorithm. Note the lack of halos around pic-
ture frames and light fixture and better color balance in the close up of
the bottom image . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
4.24 Exaggerated tone-mapped results using different techniques of an HDR
image. Halos are shown in the close-ups of the images in Figure 4.25 . 122
4.25 (Top row) Exaggerated tone-mapped results using different techniques
of an HDR image. (Bottom row) Close-up of a part of the image.
Halos are shown in the close-ups of the images. Note the presence of
halo effects in (a) and (c) and the lack of it in (b) and (d) . . . . . . . 122
xiv
4.26 (Left) Close-up of tone-mapped image by Fattal [41] and obtained
from Paris et al. [88] and (Right) Close-up of tone-mapped image by
our method. A red arrow shows the irregular edge generated in the left
image due to aliasing and the lack of it in the right image . . . . . . . 123
4.27 Tone-mapping of noisy image. Note the relatively less amplification
of noise using our method . . . . . . . . . . . . . . . . . . . . . . . . 124
4.28 Enhancement of archaeological artifacts. (a) is the original image from
[72]. (b) is the enhanced image from [72]. (c) is enhanced by our
method. The structure beneath some weathered regions are clearly
visible in (c) as compared to (b) . . . . . . . . . . . . . . . . . . . . 125
4.29 Power Spectrum Characteristics. The left image is the power spec-
trum of Fig. 4.1(a) and the right image is that of Fig. 4.1(b). Note the
magnification of high frequencies in the right image . . . . . . . . . . 126
5.1 Comparison of methods on the basis ofF
1
measure for cuts transition 150
5.2 Failure cases using our clustering method with true cluster boundary
shown in red and false positive in magenta. It could not detect cut tran-
sition in (a) where the illumination and edge information is similar. It
also could not detect dissolve transition in (b) where it shows the same
surface of moon after zooming in. It detects extra cluster boundaries
in (c) and (d) as there is a significant difference in illumination. Such
failure cases in terms of shot detection are required for video enhance-
ment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
5.3 Frame similarity graph along with cluster and key frame information.
The ground truth shot boundaries are shown with black lines on the
upper right of the diagonal and the detected shot boundaries (indicated
by change of cluster label index) are shown with pink lines on the
lower left of the diagonal. The key frame information that is being
used for enhancement for every cluster is shown with a green square
marker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
5.4 Frame similarity graph along with cluster and key frame information
for a sequence with 3 clusters . . . . . . . . . . . . . . . . . . . . . 156
5.5 Slowly changing scene where we estimated the shot boundary (black
rectangle). Images in sequence are from left to right and top to bottom 157
5.6 Similarity measure of key frame . . . . . . . . . . . . . . . . . . . . 157
5.7 Comparison of enhancement techniques for the tracked object. (Tem-
poral consistency and significance of key frame) . . . . . . . . . . . . 158
xv
5.8 Significance of clustering. Failure to detect clusters using existing
method results in improper enhancement as shown by the green curve
that reduces contrast of underexposed frames(Frames 13-27). This
problem is not present using our method as shown by red curve . . . . 161
5.9 Preference ratings for enhanced videos . . . . . . . . . . . . . . . . 163
5.10 Histogram of mean preference ratings for all observers . . . . . . . . 164
5.11 A
z
’s for all observers . . . . . . . . . . . . . . . . . . . . . . . . . . 165
5.12 Segmentation Results. Top row: Original images. Middle row: Seg-
mentation results using [60]. Bottom row: Segmentation results [60]
using cluster information. The left column shows results at90% of the
maximum hierarchical level and the right column shows results at70%
of the maximum hierarchical level . . . . . . . . . . . . . . . . . . . 168
xvi
Abstract
Images/videos may have poor visual quality due to the relatively low dynamic range
of capture/display devices as compared to the human visual system or poor lighting
conditions or the lack of experience of people capturing them. We are exploring tech-
niques to perform automatic enhancement of images and videos. The goal is to produce
a better visual experience for all viewers. Another motivation is to improve perception
for visually impaired patients, in particular, people suffering from Age-related Macular
Degeneration. In order to address these problems, we have developed novel techniques
for contrast and sharpness enhancement of images and videos.
Our color contrast enhancement technique is inspired from the Retinex theory. We
use denoising techniques to estimate the illumination component of the image, while
preserving color and white balance. We then enhance only the illumination component
using mapping functions. These enhancement parameters are estimated automatically.
This enhanced illumination is then combined with the original reflectance to obtain
enhanced images with better contrast.
For sharpness enhancement, we use a novel approach based on a hierarchical
framework using edge-preserving Non-local means filter. The hierarchical framework
is constructed from a single image using a modified version of Laplacian pyramid.
We also introduce a new measure to quantify sharpness quality, which allows us to
automatically set parameters in order to achieve a preferred sharpness enhancement.
Finally, we propose a novel method based on modularity optimization to perform
temporally consistent and robust enhancement of videos. A key aspect of our processes
xvii
is that it is fully automatic. Our method detects scene changes ‘on-the-fly’ in a video.
For every detected cluster, we find a key frame that is most representative of other
frames in the sequence and estimate enhancement parameters for only the key frame.
We then enhance all frames in that cluster using these enhancement parameters, thus
making our method robust.
We compare our enhancement results with existing state-of-the-art approaches and
commercial packages and show noticeable improvements. Our image enhancements
do not suffer from halo effects, color artifacts, and color shifts; our video enhancement
is temporally consistent and does not suffer from flash or flickering artifacts. Validation
is challenging, as visual experience is intrinsically subjective. We have conducted ex-
tensive tests on real viewers (both normally sighted and simulated visually impaired),
and provide a statistical measure of improvement in terms of preference.
xviii
Chapter 1
Introduction
The human sense of sight has a very large dynamic range. We can see both objects
in the night sky as well as in bright sunlight albeit we cannot see those extremes si-
multaneously. Since the dynamic range of sensors used in digital photography is much
lesser, it is difficult to achieve the full dynamic range experienced by the human visual
system (HVS) using commodity capture/display devices. We therefore need to develop
techniques to fit the original scene with a wide dynamic range in devices with a much
lower dynamic range.
Figure 1.1: Frame from a high quality Hollywood movie
1
In films and movies, lighting is given very high priority because the perception
of the quality of the movie depends on how well the movie is lit. As a result, movie
production houses spend a lot of money and hire experienced professionals to get the
right lighting. An example of a movie frame from a Hollywood movie is shown in
Figure 1.1. Also, during the shooting of any movie, there is control of how/where to
add/remove the light in any scene thus resulting in perfect lighting for every scene.
On the other hand, badly lit images can make a film look amateurish. This is indeed
the case with millions of YouTube videos that are uploaded by common people. An
example of such a frame can be seen in Figure 1.2. It may happen due to improper
understanding or lack of control over the lighting conditions or just the lack of expe-
rience of people capturing the videos. As a result, the images and movies may have
poor visual quality. Therefore, we want to perform enhancement of images and videos
to give a better visual experience to all viewers.
Figure 1.2: Frame from a poor quality YouTube video
2
Another reason why an image/video may be perceived as having poor visual quality
is due to visual impairment. As the population ages, the instances of people suffering
from eye diseases and thus from vision impairment is increasing. According to [1],
age-related macular degeneration (AMD) affects more than 1.75 million individuals in
the United States with the prediction increasing to almost 3 million by 2020. These
visual impairments affect the day-to-day life of many older people. As can be seen
in Figure 1.3, patients suffering from AMD generally experience blurring or darkness
in the center of their vision field resulting in a central field loss (CFL). This interferes
with daily activities including reading, mobility, face recognition etc.
(a) Normal perception (b) AMD perception
Figure 1.3: (a) is the image that can be seen by a person with normal vision and (b)
is the same image that is viewed by a patient with AMD. The images are from NIH
website
Therefore, in order to improve the visual experience of people both normally sighted
people as well as people suffering from visual impairment such as AMD, we pro-
pose post-processing techniques to automatically and consistently enhance images and
videos without any artifacts.
3
1.1 Problem Statement
The objective of this proposal is to develop a system to enhance images and videos.
We are looking at two different enhancement mechanisms - 1) To enhance contrast
in color images and 2) To enhance sharpness in images. Enhancing contrast is use-
ful because it will give people a better idea of the digital content. While enhancing
contrast of images, we mimic the human ability to achieve color constancy. This is
equivalent to the auto white balancing feature that is present in digital cameras. Also,
since patients suffering from AMD have blurred peripheral vision, emphasizing the
sharpness or the details of the image may aid them in their perception.
Apart from enhancing images, we have also constructed a framework to perform
video enhancement. While analyzing the video sequences, we cluster frames in an
online manner and using a technique called modularity optimization, we have devel-
oped a novel technique to perform scene change detection.
We try to ensure that the enhancement process is completely automatic - irrespec-
tive of the quality/nature of images and videos, the parameters should be set auto-
matically depending on the content. Also, the enhancement process should be robust
and should not distort the content or produce any undesirable artifacts. Finally, we
also need to validate our enhancement results by conducting experiments on human
observers.
1.2 Challenges
Enhancement of images and videos is quite challenging due to various reasons. Some
of the challenges are as follows -
4
² Image Variations
The images that we deal with may be underexposed, overexposed, combina-
tion of both or may just be having excellent contrast. The images may have been
taken under different lighting conditions (colored or white illumination) depend-
ing on the time of the day or if the images were captured indoor or outdoor.
² Automatic Process
Since images can have multitude of variations, in order to get good enhanced
image for each of those cases, the parameters for enhancement will also vary
greatly across different images. It is impractical and cumbersome to have human
intervention to modify the enhancement parameters for each image. It becomes
harder when it comes to videos, especially movies, because it generally covers a
vast spectrum of individual frames. Therefore, the enhancement process should
be automatic and adaptive to a large range of images.
² Enhancement Artifacts
Some of the existing techniques for image enhancements such as histogram
equalization works well for grayscale images. However, when histogram equal-
ization is used to enhance color images, it may cause a shift in the color scale,
resulting in artifacts and an imbalance in image color as shown in Figure 1.4.
These unwanted artifacts are not desirable, as it is critical to maintain the color
properties of the image while enhancing them.
Other enhancement techniques may result in halo effects as shown in Figure 1.5.
² Computational Time
This is an important aspect of enhancement techniques. It is undesirable but
may be acceptable for the user to wait for a long time to view an enhanced image.
5
(a) Original
(b) Histogram Equalized
Figure 1.4: Histogram Equalization of color images. We can see artifacts in the sky
and color shift along the edges of trees.
6
(a) Original (b) Enhanced
Figure 1.5: Halo effect caused during retinex enhancement can be observed around the
edge of the body and the background.
However, computational time becomes a critical issue when dealing with videos.
Standard definition videos play at 30 fps and High Definition videos play at 25
fps. Hence, in those cases, it is desirable for the enhancement process to run at
several frames per second for a better visual experience for the user.
² Validation
This is an important step after the enhancement process in order to assess the
benefits of the enhancement techniques. The problems that we face during the
validation process are to define the task to be performed by people while viewing
images and videos. The issues that come up when dealing with people are their
individual preferences, their cognitive load etc.
7
1.3 Related Work
There has been a lot of work in the area of color constancy and color contrast enhance-
ment. We will discuss some of the relevant existing methods of color constancy in Sec-
tion 2.1.1 and that of contrast enhancement in Section 3.1.1. However, all these meth-
ods are used to improve the contrast of images for people with normal vision. Some
techniques have been proposed specifically to improve the visual quality of images for
subjects with low vision. The technique proposed by Peli et al. [91] called wideband
enhancement super-imposes high contrast outlines over images. This method detects
visually relevant features in an image (edges) and then marks the edge features with a
bright line on the bright side of the edge and a dark line on the dark side of the edge.
These features are then either added to the original image pixels or they replace the
original pixel values. However, this technique resulted in an improvement in the per-
ceived image quality for only22% of the patients. Further, when viewed under normal
viewing conditions, these features introduce a ’halo’ effect that is not desirable.
Tang et al. [112] have also developed an enhancement method for JPEG images in
the discrete cosine transform (DCT) domain by introducing a scale factor for the quan-
tization table in the decoder. This method increases the contrast by applying a uniform
enhancement factor at all frequencies. Though experiments with low vision patients
have shown that this results in better perception of the enhanced images, the enhanced
images have a lot of artifacts that may not be desirable for people with normal vision.
Peli [92] used an adaptive enhancement algorithm to improve the visual quality
of images for low vision subjects. In this method, a tuned range of frequencies is
enhanced. In order to limit saturation, some low frequencies are reduced. This was
tested on low vision patients and was found to be beneficial. However, the enhanced
8
images appear to have some edge artifacts and may be undesirable for subjects with
normal vision.
We will discuss the relevant existing methods of sharpness enhancement in Sec-
tion 4.1.1.
1.4 Applications
Due to relatively poor visual perception, low vision patients may get a better under-
standing of the environment due to enhancements. Daily activities involve recogni-
tion tasks such as face recognition, object identification and recognition etc. Contrast
enhancement will help in improving the contrast of such objects and sharpness en-
hancement will help in better viewing of the details of objects. Consequently, these
enhancements can help low-vision patients with daily activities.
Achieving color constancy (estimating the color of illumination) helps in increas-
ing the robustness of color features and therefore, can find extensive applications in
computer vision tasks such as image and video retrieval, image classification, scene
and color object recognition and object tracking. Apart from giving users a better vi-
sual experience, contrast enhancement can also help in computer vision tasks such as
object identification or classification, especially under poor lighting conditions. The
hierarchical approach for sharpness enhancement can find various applications in HDR
tone mapping, image editing, image fusion etc. We discuss some of these applications
in Section 4.4.3 and Section 4.4.4.
9
1.5 Overview of proposal
Our proposal consists of two distinct enhancement mechanisms - one for enhancing
the contrast of the image and the other for enhancing the sharpness of the image; and a
framework for enhancing videos. A high-level overview of these methods is as follows
-
² Contrast Enhancement
The flowchart of our contrast enhancement approach can be seen in Figure 1.6.
We assume any image to be a pixel-by-pixel product of the illumination (light
Figure 1.6: High-level overview of our method
that falls on the scene) and the reflectance component of the scene. This can be
expressed as:
I(x)=L(x)R(x); (1.1)
whereL(x) is the illumination component andR(x) is the reflectance compo-
nent of the image I(x) with spatial coordinate x. In this paper, we deal with
color images. So,I(x),L(x) andR(x) have 3 components - one for each color
10
channel. For instance, for the illumination image,L(x), we denote the red color
channel byL
red
(x), the green channel byL
green
(x) and the blue color channel
byL
blue
(x). Similarly, we can denote the color channels for other images. The
capital bold font denotes multiband. The presence of a bar above the capital bold
font denotes singleband. Similarly, the spatial coordinate,x=(x;y)2<
2
.
We first try to estimate the illumination component of the image, L(x) and
thereby separate illumination from reflectance. The reason why we try to sepa-
rateL(x) andR(x) is because experimentally Land [74] showed that the Human
Visual System (HVS) can recognize the color in a scene irrespective of the il-
lumination. This property of HVS is termed as color constancy and the theory
is referred to as Retinex theory. Our method also tries to mimic this human
ability by removing the effects of the color of illumination from the image after
estimating the illumination.
Trying to recoverL(x) from an image and separateL(x) fromR(x) is a mathe-
matically ill-posed problem, so recoveringL(x) needs further assumptions. We
make the assumption thatL(x) is smooth and contains the low frequency com-
ponent of the image. The reflectance component,R(x) mainly includes the high
frequency component of the image. This estimate of illumination is not the same
as defined in physics but is an approximation of the illumination and is a com-
bination of the illumination , the low frequency and the mid-tone components
of reflectance. On the other hand, the reflectance component, R(x) is a com-
bination of the high-frequency features and a small part of the illumination. In
order to estimate L(x), based on this assumption, we smooth the image using
denoising algorithms. Though this does not give us the true illumination of the
scene, it gives us a good estimate of the illumination color. However, as a result
11
of smoothing, strong halo effects can be seen along certain boundaries of the
image. Therefore, before smoothing the image to estimateL(x), we pre-process
the image and segment it. Smoothing can then be performed adaptively, and
reduced around boundaries.
Once the color cast has been removed from the image (this is equivalent to
achieving White Balance), we only process theL(x) component of the image.
The motivation behind modifying only the illumination of the scene is to pre-
serve the original properties of the scene - the reflectance of the scene. Also,
the dynamic range of the illumination can be very large and hence we compress
the dynamic range of the illumination. We modify the illumination of the image
depending on the distribution of the intensity pixels, using logarithmic functions
to estimate the enhanced illumination and then multiply it by the reflectance
component to produce the enhanced image. This also helps in improving the
local contrast of the image, which is another property of the human visual sys-
tem. Our results show that even in complex non-uniform lighting conditions, the
enhanced results look visually better.
Once we have obtained the enhanced images we conduct subjective evaluation
experiments on visually impaired and normally sighted people. Since we do not
have access yet to low vision patients, we simulate the vision of AMD patients
with simulation glasses.
² Sharpness Enhancement
We present a robust sharpness enhancement technique using a single image,
based on the Non Local means (NL-means) algorithm [13] using a hierarchical
framework. We use the Laplacian pyramid framework to decompose the image
into different levels - one smooth level and several fine levels. We modify the
12
Laplacian pyramid framework in the following 2 ways - 1) We do not sub-sample
the image and 2) While decomposing the image, instead of using the smooth im-
age from the previous level, we smooth the original image by a higher degree.
We found NL-means filter to be well-suited for progressive smoothing of images
resulting in better extraction of fine levels. The existing sharpness enhancement
techniques do not account for noise in the image, leading to degradation of im-
ages during enhancement. The robustness of our method lies in using an extra
level of decomposition by smoothing the image using a very low value of the
filtering parameter. We show this extra level of decomposition does not result
in significant loss of image structure in case of good quality image. However,
if noise is present in the image, it removes noise from the image. Once the dif-
ferent levels of the image are obtained, we can enhance those levels individually
and combine them to get an enhanced image.
² Framework for Video Enhancement
Our method analyzes video streams and clusters frames that are similar to each
other. Our method does not have omniscient information about the entire video
sequence. It is an online process with a fixed delay. A sliding window mecha-
nism successfully detects scene changes “on-the-fly” in a video. A graph-based
technique called “modularity” performs automatic clustering of video frames
without a priori information about clusters. For every cluster in the video, we
extract key frames belonging to each cluster using eigen analysis. This key frame
is the most representative of all frames in that cluster. We then estimate enhance-
ment parameters for only the key frame, then use these parameters to enhance
frames belonging to that cluster, thus making our method robust.
13
1.6 Outline
This proposal is organized as follows: Chapter 2 provides the details of our methods
to achieve color constancy (estimating the color of illumination) while trying to esti-
mate the illumination component of the image. In this chapter, we have proposed two
new methods to estimate the color of illumination - the first method is based on the
observation that images with color cast have standard deviation of one color channel
significantly different from at least one other color channel. The second method is
based on denoising techniques to smooth the image to estimate the illumination com-
ponent of image and then estimate the illuminant color from that component of the
image. We perform evaluation of our methods on two widely used datasets. Chapter 3
provides the details of our method to achieve color contrast enhancement. We describe
our method to automatically enhance the illumination component of the image result-
ing in image enhancement. We compare our enhancement results with that of existing
approaches and commercial packages. In this chapter, we also describe the details of
how effective our enhancements are for both people with normal vision as well as for
people with simulated AMD. Chapter 4 provides the details of our method to achieve
robust sharpness enhancement in presence of noise using a hierarchical framework to
decompose the image into a smooth layer and several detail layers. We compare our
decomposition with existing approaches. We also describe the various applications of
that approach in different domains such as HDR tone mapping. Chapter 5 provides the
details of our framework to achieve robust video enhancement. We describe our online
process to detect scene changes and also show an application of how scene detection
benefits video temporal segmentation. Finally, Chapter 6 concludes the dissertation by
summarizing the contributions of our work and discusses the future directions of our
research.
14
The papers that serve as the basis for Chapters 2-5 are listed as follows:
² Chapter 2 (Color Constancy)
– Anustup Choudhury and Gérard Medioni, “Color constancy using denois-
ing methods and cepstral analysis” in ICIP 2009. [24]
– Anustup Choudhury and Gérard Medioni, “Color constancy using standard
deviation of color channels” in ICPR 2010. [26]
² Chapter 3 (Color Contrast Enhancement)
– Anustup Choudhury and Gérard Medioni, “Perceptually motivated auto-
matic color contrast enhancement” in Color and Reflectance in Computer
Vision Workshop, ICCV 2009. [25]
– Anustup Choudhury and Gérard Medioni, “Perceptually motivated automaitc
color contrast enhancement based on novel color constancy estimation”
in EURASIP Journal on Image and Video Processing, Special Issue on
’Emerging Methods for Color Image and Video Quality Enhancement’
2010 [28]
– Anustup Choudhury and Gérard Medioni, “Color contrast enhancement for
visually impaired people” in Computer Vision Applications for Visually
Impaired Workshop, CVPR 2010. [27]
² Chapter 4 (Sharpness Enhancement)
– Anustup Choudhury and Gérard Medioni, “Perceptually Motivated Auto-
matic Sharpness Enhancement using Hierarchy of Non-Local Means” in
Color and Photometry in Computer Vision Workshop, ICCV 2011. [29]
15
– Anustup Choudhury and Gérard Medioni, “Hierarchy of Non-Local Means
for Preferred Automatic Sharpness Enhancement and Tone Mapping” sub-
mitted to IEEE Trans. on Visualization and Computer Graphics (TVCG)
[23]
² Chapter 5 (Video Enhancement)
– Anustup Choudhury and Gérard Medioni, “A framework for robust online
video contrast enhancement using modularity optimization” to appear in
IEEE Trans. on Circuits and Systems for Video Technology (TCSVT) [30]
16
Chapter 2
Illumination Estimation and Color Constancy
2.1 Introduction
Consider the cube shown in Figure 2.1, observed under two different illumination con-
ditions. If we observe the patches marked by green circles on the cube we would
mark the patch as red although the reflected colors are quite different. Similarly, if we
observe the patches marked by red circles on the cube, we would mark that patch as
yellow in the left image and blue in the right image despite the reflected colors being
quite similar.
This human ability to estimate the color of a scene irrespective of the color of
illumination of that scene is described by a phenomenon called color constancy. This
ability is usually implemented in digital cameras as white balancing.
2.1.1 Related Work
Most color constancy algorithms make the assumption that only one light source is
used to illuminate the scene. Given the image pixel values I(x), the goal of color
constancy is to estimate the color of light source, assuming there is a single constant
source for the entire image.
17
Figure 2.1: Image from [98] showing the same cube under two different lighting con-
ditions (blue on the left and yellow on the right)
Many color constancy algorithms have been proposed to estimate the color of il-
lumination of a scene. These methods can be broadly classified into 4 categories -
1)Gamut-based/Learning methods 2)Probabilistic methods 3)Methods based on low-
level features and 4)Methods based on combination of different methods. The litera-
ture of these methods can be summarized as shown in Figure 2.2.
Figure 2.2: Summary of previous work in color constancy
18
The gamut mapping method proposed by Forsyth [48] is based on the observation
that given an illuminant, the range of RGB values present in a scene is limited. Under
a known illuminant (typically, white), the set of all RGB colors is inscribed inside a
convex hull and is called the canonical gamut. This method tries to estimate the illu-
minant color by finding an appropriate mapping from an image gamut to the canonical
gamut. Since this method could result in infeasible solutions, Finlayson et al. [45] im-
prove the above algorithm by constraining the transformations so that the illuminant
estimation corresponds to a pre-defined set of illuminants. Finlayson et al. [46] use
the knowledge about appearance of colors under a certain illumination as a prior to
estimate the probability of an illuminant from a set of illuminations. The disadvantage
of this method is that the estimation of the illuminant depends on a good model of the
lights and surfaces, which is not easily available. Chakrabarti et al. [19] consider the
spatial dependencies between pixels to estimate the illuminant color. Another learning
based approach by Cardei et al. [17] use a neural network to learn the illumination of
a scene from a large number of training data. The disadvantage with neural networks
is that the choice of training dataset heavily influences the estimation of the illuminant
color. A nonparametric linear regression tool called kernel regression has also been
used to estimate the illuminant chromaticity [4].
Probabilistic methods include Bayesian approaches [12] that estimate the illumi-
nant color depending on the posterior distribution of the image data. These methods
first model the relation between the illuminants and surfaces in a scene. They cre-
ate a prior distribution depending on the probability of the existence of a particular
illuminant or suface in a scene, and then using Bayes’s rule, compute the posterior
distribution. Rosenberg et al. [103] combine the information about neighboring pixels
being correlated [46] within the Bayesian framework.
19
A disadvantage of the above mentioned algorithms is that they are quite complex
and all the methods require large datasets of images with known values of illumination
for training or as a prior. Also, the performance may be influenced by the choice of
training dataset.
Another set of methods uses low level features of the image. The White-Patch
Assumption [74] is a simple method that estimates the illuminant value by measuring
the maximum value in each of the color channels. The Grey-World algorithm [14]
assumes that the average pixel value of a scene is grey. The Grey-Edge algorithm [119]
measures the derivative of the image and assumes the average edge difference to be
achromatic. As shown in [119], all the above low level color constancy methods can
be expressed as:
µZ
¯
¯
¯
¯
@
n
I
¾
(x)
@x
n
¯
¯
¯
¯
p
dx
¶1
p
=kl
n;p;¾
; (2.1)
where,n is the order of derivative,p is the Minkowski norm and¾ is the parameter for
smoothing the image I(x) with a Gaussian filter. The original formulation of Equa-
tion 2.1 for the White-Patch Assumption and the Grey-World algorithm can be found
in [47]. As expressed in Equation 2.1, the White-Patch Assumption can be expressed
as l
0;1;0
, the Grey-World algorithm can be expressed as l
0;1;0
and the n-order Grey-
Edge algorithm can be expressed asl
n;p;¾
. Weijer et al. [119] have shown results using
values ofn=1 andn=2.
More recent techniques to achieve color constancy use higher level information
and also use combinations of different existing color constancy methods. [16] esti-
mates the illuminant by taking a weighted average of different methods. The weights
are pre-determined depending on the choice of dataset. Gijsenij and Gevers [54] use
Weibull parameterization to get the characteristics of the image and depending on those
values, divide the image space into clusters usingk-means algorithm and then use the
20
best color constancy algorithm corresponding to that cluster. The best algorithm for a
cluster is learnt from the training dataset. Weijer et al. [120] model an image as a com-
bination of various semantic classes such as sky, grass, road and buildings. Depending
on the likelihood of semantic content, the illuminant color is estimated corresponding
to the different classes. Similarly, information about images being indoor or outdoor
are also used to select a color constancy algorithm and consequently estimate the color
of illuminant [11]. More recently, 3D scene geometry is used to classify images and a
color constancy algorithm is chosen according to the classification results to estimate
the illuminant color [80].
2.1.2 Overview of our Methods
As stated earlier, color constancy is the human ability to perceive the color of a scene
irrespective of the illumination conditions. Therefore, to achieve color constancy, we
should estimate the color of illumination. In order to estimate the color of the illumi-
nation of an image, we have proposed 2 methods -
Our first method [26] is based on the observation that an image of a scene, taken
under colored illumination, has one color channel that has significantly different stan-
dard deviation from at least one other color channel. Figure 2.3(a) has a strong blue
color cast and the standard deviations of the RGB color channels are ¾
R
= 0:0184,
¾
G
= 0:0267 and ¾
B
= 0:0941. We can see that the value of ¾
B
is 5 times more
than¾
R
. If we remove the color cast from Figure 2.3(a) as shown in Figure 2.4(a), the
standard deviations of the RGB color channels are ¾
R
= 0:0591, ¾
G
= 0:0500 and
¾
B
= 0:0582. We observe that the standard deviations of the color channels of an im-
age with no color cast are very similar to each other. We find the ratio of the maximum
21
and minimum standard deviation of color channels of local patches of an image and
use that as a prior to estimate the color of illumination and achieve color constancy.
(a) (b)
Figure 2.3: (a) Image from [8] with blue color cast. The intensity is tripled for better
clarity. (b) Zoomed-in RGB histogram of Figure 2.3(a) with original intensity. The
complete histogram is shown in the inset of Figure 2.3(b)
In the second method [24], we process the image to estimate the illumination com-
ponent,L(x). We make the assumption thatL(x) is smooth. Based on this assump-
tion, we use denoising techniques to process the image and smooth it. Denoising
techniques are traditionally used in image processing to remove noise from the image.
This involves smoothing of the images and hence removal of the noise. Because of
our smoothness assumption ofL(x), the smooth image that is obtained by denoising,
is our estimate of the illumination image, L(x). Once we have estimated L(x), we
apply the White-Patch Assumption (maximum in each color channel), as described in
Section 2.3.5 onL(x) to estimate the color of the illumination.
In the most simple filtering example, the smoothing operation calculates the aver-
age of a region around every pixel of the image. As the size of the region increases
(limited by the size of the image), the estimate of the maximum value in each color
22
(a) (b)
Figure 2.4: (a) Image from original Figure 2.3(a) without color cast. The intensity is
tripled for better clarity. (b) Zoomed-in RGB histogram of Figure 2.4(a) with original
intensity. The complete histogram is shown in the inset of Figure 2.4(b)
channel tends towards the average value in each color channel. Therefore, in a way that
Minkowski norm works as mentioned in Section 2.1.1, the smoothing operation unifies
the White Patch Assumption [74] and the Grey-World algorithm [14] in one approach.
Hence, similar to Equation 2.1, we also assume the true color of illumination to be
somewhere between the maximum and the mean of the color channels. These trends
in the estimate of the illumination color while filtering using different parameters are
explained in Section 2.4.2.
2.1.3 Outline
In Section 2.2, we describe our first method to achieve color constancy based on our
observation of the ratio of the standard deviation of color channels. In Section 2.3, we
describe our second method to achieve color constancy using denoising methods. In
23
Section 2.4, we perform the evaluation of our color constancy methods on widely used
large datasets of images. Finally, in Section 2.5, we summarize this chapter.
2.2 Method 1: Standard Deviation of Color Channels
As described in Section 2.1.2, for images with color cast, the standard deviation of
one color channel is significantly different from that of other color channels. This
can be characterized by the ratio between ¾
max
= maxf¾
i
;i2R;G;Bg and ¾
min
=
minf¾
i
;i2R;G;Bg, where¾
i
is the standard deviation of the color channeli. The
value of this ratioÁ=¾
max
=¾
min
will be very high for images with color cast and low
for images without color cast. We find that, in most images under white illumination
(without color cast), local patches of an image have similar standard deviations in all 3
color channels, and it is not the case for images with color cast. This leads us to believe
that the change in standard deviation for those local patches is mainly contributed by
the colored illumination. Therefore, we use information from these patches to select
pixels to estimate the color of illumination.
Figure 2.5: Flowchart of our method
Our method illustrated in Figure 2.5, consists of 2 key steps: 1) Create a new image
I
Á
, which is theÁ value of local window of original image 2) Use the brightest pixels
fromI
Á
as prior to select pixel from original image as illumination color. We create
a new image with same resolution as the original image where every pixel of the new
24
image is theÁ value of a local window around the corresponding pixel in the original
image. This can be formulated as follows:
I
Á
(x)=8x
8
^ y2W(x)
max(¾
i
(^ y);i2fR;G;Bg)
8
^ y2W(x)
min(¾
i
(^ y);i2fR;G;Bg)
; (2.2)
where, i denotes the color channel - R is red channel, G is green channel and B is
blue channel, ^ y is the color value of pixels in a windowW centered at pixelx in the
original image I and ¾
i
(^ y) is the standard deviation of the values denoted by ^ y for
color channeli. A lucid explanation of Equation 2.2 is that for every pixelx in image
I, a windowW is considered around that pixel. For all pixels insideW, the standard
deviation of each of 3 color channels is calculated and the ratio of the maximum to the
minimum standard deviation is used to create the imageI
Á
.
2.2.1 Statistical Verification
We use the controlled indoor environment [8] to verify how goodI
Á
is. This dataset
consists of a controlled indoor environment with 30 different scenes taken under 11
different illuminant conditions. All the images are illuminated by only one illuminant.
Some of these images were deemed unusable by the original authors and therefore, this
dataset has 321 images. All images have resolution of 637 X 468 pixels. Figure 2.3(a)
is an example from this dataset. The ground truth values of the illumination are already
provided. These values are normalized by the Euclidean norm of the illumination color
vector. The color cast is removed from the entire dataset and transformed to conditions
under canonical white light using White-Patch assumption. We compute the average
Á for every image for both the datasets (with and without color cast) and plot the
corresponding histogram in Figure 2.6(a). From Figure 2.6(a) we can see that the
average value ofÁ for images with color cast is much higher than the average value of
25
(a) (b)
(c)
Figure 2.6: Statistics for images with and without color cast. Image (a) is the his-
togram of averageÁ. Image (b) is the histogram ofÁ over all images. Image (c) is the
corresponding cumulative distribution
26
Á for images without color cast. Figure 2.6(b) is the histogram ofÁ over all the images
for both images with and without color cast and Figure 2.6(c) is the corresponding
cumulative histogram. We can see that about 80% of the images without color cast
haveÁ < 20. On the other hand, for images with color cast, less than12% of images
have Á < 30 and around 90% of images have Á < 125. This gives us a very strong
statistical support to use the value ofÁ to distinguish between images with and without
color cast.
2.2.2 Illuminant Color Estimation
In case of White-Patch algorithm, the pixel with the highest intensity is assumed to
be the color of illumination. But in images from the real world environment, due to
noise or specular reflections, this assumption can be violated. In order to improve the
estimation of the illumination color we use the reconstructedI
Á
image from which we
pick the top 1% brightest pixels. We use the pixels from the original input image I
having the same spatial coordinates as these selected pixels from the reconstructedI
Á
and use the White Patch Assumption [74] on those pixels to estimate the illuminant
color. It should be noted that this pixel need not be the brightest pixel in the image.
Once we have estimated the color of illumination, the color-corrected images as
shown in Figure 2.16 can be obtained using Equation 2.10.
2.3 Method 2: Denoising Techniques
We study 4 different existing denoising techniques to smooth the image - 1) Gaussian
filter 2) Median filter 3) Bilateral filter 4) Non-Local Means filter. These filters have
different levels of complexity and vary in their smoothness mechanism. We consider
27
the input imageI(x) to be defined on a bounded domain½<
2
and letx2 . The
description of these filters is given below.
2.3.1 Gaussian Filter
Blurring the image removes noise. This filter can be expressed as:
G(x)=
1
2¼¾
2
e
(¡
jxj
2
2¾
2
)
; (2.3)
where,¾ is the smoothing parameter. This is a 2D Gaussian radially symmetric kernel
wherejxj=(x
2
+y
2
)
1=2
.
Gijsenij and Gevers [55] have explored iterated local averaging and can be con-
sidered similar to the Gaussian filter approach. The edge information is lost during
Gaussian filter and this introduces error while estimating the illuminant. We apply this
filter across all 3 color channels (Red, Green and Blue) of the image.
2.3.2 Median Filter
For every pixel in an image, the median filter [116] chooses the median color value
amongst the neighborhood pixels in a windowW for that pixel - every pixel has same
number of color values above and below it. Smoothing using median filter may result
in the loss of fine details of the image though boundaries may be preserved. We apply
this filter on all 3 color channels (Red, Green and Blue) of the image.
2.3.3 Bilateral Filter
In case of bilateral filter [115], every pixel of the image is replaced by a weighted sum
of its neighbors. The neighboring pixels are chosen from a window (W) around a
28
given pixel. The weights depend on two parameters - 1) Proximity of the neighboring
pixels to the current pixel (Closeness function) and 2) Similarity of the neighboring
pixels to the current pixel (Similarity function). The closer and the more similar pixels
are given higher weights. The two parameters can be combined to describe the bilateral
filter as follows:
L(x)=k
¡1
(x)
Z
W(x)
c(y;x)s(I(y);I(x))I(y)dy; (2.4)
where,y is the neighboring pixel and the normalization termk can be given by:
k(x)=
Z
W(x)
c(y;x)s(I(y);I(x))dy: (2.5)
The closeness function is:
c(y;x)=e
¡
1
2
(
d(y;x)
¾c
)
2
; (2.6)
where, d(y;x) = ky¡xk is the Euclidean distance between a given pixel,x and its
neighbory. The similarity function is:
s(I(y);I(x))=e
¡
1
2
(
±(I(y);I(x))
¾s
)
2
: (2.7)
where,±(I(y);I(x)) is the pixel value difference betweenx andy. The closeness func-
tion from Equation 2.6 having standard deviation ¾
c
and the similarity function from
Equation 2.7 having standard deviation¾
s
are Gaussian functions of the Euclidean dis-
tance between their parameters. In the original implementation of the bilateral filter,
Tomasi and Manduchi [115] have used the CIELAB color space. We present results by
applying this filter on all 3 channels of the CIELAB color space of the image. Applying
this filter on the RGB colorspace instead does not result in a significant difference.
29
2.3.4 Non-Local means Filter
The hypothesis behind non-local means (NL-means) technique [13] is that for any im-
age, the most similar pixels to a given pixel need not be close to it. They could lie
anywhere in the image. For comparing how similar the pixels are, instead of check-
ing the difference between the pixel values (which is used in bilateral filtering), the
neighborhood of the pixel is considered - that is, comparison of a window around the
pixel is done. This technique uses self-similarity in an image to reduce the noise. The
formulation of the NL-means filter is:
NL(I(x))=L(x)=
1
N(x)
Z
e
¡
(G½¤jI(x+:)¡I(y+:)j
2
)(0)
h
2
I(y)dy; (2.8)
wherey 2 ,I(x) is the observed intensity at pixelx, G
½
is a Gaussian kernel with
standard deviation ½, h is the filtering parameter that controls the amount of smooth-
ing andN(x) is the normalizing factor. Equation 2.8 means that an image pixel value
I(x) is replaced by the weighted average of other pixel values in the imageI(y). The
weights are significant only if a window around pixelx looks like the corresponding
window around pixel y. While comparing the windows, we consider the Euclidean
distance between the 2 windows. However, we weigh this distance by a Gaussian-like
kernel decaying from the center of the window to its boundaries. This is because closer
pixels are more related, and so pixels closer to the reference pixel are given more im-
portance. Ideally, we should search the entire image to find a similar neighborhood.
But for efficient computation, we consider a smaller local search area,S. The numer-
ator of the exponential accounts for the neighborhood of a pixel which we denote by
W. Please refer to [13] for a more detailed explanation of the NL-means filter. For
discrete images, the integral over can be replaced by a summation over all pixels of
30
the image. We apply this filter on all 3 color channels (Red, Green and Blue) of the
image.
2.3.5 Illuminant Color Estimation
We make an assumption that there is smooth diffuse lighting condition, a single light
source and all the materials have a Lambertian BRDF. We therefore obtain the illumi-
nation imageL(x), by smoothing the original color imageI(x). In order to estimate
the color of the illuminationl, we use the White-Patch assumption [74]. This method
computes the maximum values of each of the 3 color channels - red, green and blue.
The proposed formulation forl is:
l=[max
x
(L
red
(x));max
x
(L
green
(x));max
x
(L
blue
(x))]; (2.9)
whereL
red
(x), L
green
(x) andL
blue
(x) are the red, green and blue color channels of
the denoised image (illumination image) and the max operation is performed on the
separate color channels. The maximum value of these color channels need not lie at
the same pixel of the image. This estimation is based on the hypothesis that since a
white patch of the image reflects all the light that is incident on it, its position can be
found by searching for the maximum values of the red, green and blue channels. The
l vector thus estimated is normalized and is denoted by
^
l, which has 3 components -
^
l(1),
^
l(2) and
^
l(3) for red, green and blue respectively.
31
Once the color of the illumination is estimated as described above, we remove the
effect of color cast from the image. This is equivalent to adding white light to the
scene. This can be represented as:
L
red
color¡corrected
(x)=
1
p
3
L
red
(x)
^
l(1)
;
L
green
color¡corrected
(x)=
1
p
3
Lgreen(x)
^
l(2)
;
L
blue
color¡corrected
(x)=
1
p
3
L
blue
(x)
^
l(3)
;
(2.10)
where the factor
p
3 is the normalization constant based on the diagonal model to
preserve the intensity of pixels. Combining all 3 color channels -L
red
color¡corrected
(x),
L
green
color¡corrected
(x) and L
blue
color¡corrected
(x), shown in Equation 2.10 gives us the
color corrected image. This is how we obtain the images illustrated in Figure 2.16.
2.4 Results and Discussion
In order to evaluate our approach, we conduct experiments on two widely used datasets.
The ground-truth values of the illuminant of the scenes are provided for both datasets.
For quantitative evaluation of the approaches, the error difference between the esti-
mated illumination colorl and the ground truth illumination colorl
gt
is computed. We
use the angular error as a measure and it can be computed as:
angular error;²=cos
¡1
(
^
l:
^
l
gt
); (2.11)
where, (^ :) stands for normalized values. In order to measure the overall performance
across each of the datasets, the median of the angular errors is considered as a suitable
measure [65].
32
2.4.1 Controlled Indoor Environment Dataset
We have already described this dataset in Section 2.2.1. Some sample images from
this dataset can be seen in Figure 2.7.
Figure 2.7: Images from the controlled indoor environment[8]
The results of existing color constancy algorithms on this dataset are summarized
in Table 2.1. These results are available in [119], [45] and [65]. Some methods in
the literature [4] have used the root mean square (RMS) error between the estimated
illuminantion chromaticity,lc
e
and the actual illumination chromaticity,lc
gt
to evalu-
ate their results although this is not the best metric [65]. Since we do not have access
to the individual error values, we compare the results as is. The RMS
rg
error can be
calculated as:
RMS
rg
=
Ã
1
N
N
X
i=1
1
M
M
X
j=1
(lc
i
(j)¡lc
gt
i
(j))
2
!
1=2
; (2.12)
where,N is the total number of images,i is the index of images andM is the number
of color channels (M = 2 for chromaticity space). We calculate error for therg space.
The chromaticity forr andg for the estimated illuminant can be computed aslc(1) =
^
l(1)=(
^
l(1)+
^
l(2)+
^
l(3)) andlc(2)=
^
l(2)=(
^
l(1)+
^
l(2)+
^
l(3)). Similarly, we can compute
the chromaticity for the ground truth illuminant. The RMS
rg
error for White-Patch,
Neural Networks and Color by Correlation were presented in [7]. The performance of
33
our approach on this dataset are also presented in Table 2.1. The results are presented
in Figure 2.8.
Table 2.1: Error for the controlled indoor environment using various color constancy
methods. The parameters of our approach have been described in Section 2.3 and in
Section 2.2. The best result is reported in bold.
Method Parameters Median² (
o
) RMS
rg
White-Patch - 6.4 0.053
Grey-World - 6.9 -
1
st
order Grey Edge p=7 3.2 -
2
nd
order Grey Edge p=7 2.8 -
Color by Correlation - 3.1 0.061
Gamut Mapping - 2.9 -
Neural Networks - 7.7 0.071
GCIE Version 3, 11 lights - 1.3 -
GCIE Version 3, 87 lights - 2.6 -
Kernel Regression - - 0.052
Support Vector Machines - - 0.066
Standard Deviation
of color channels W =11 2.8 0.044
Gaussian Filter ¾ =3;x=30 2.9 0.043
Median Filter W =20 2.4 0.044
Bilateral Filter ¾
c
=2;¾
s
=5;W =5 2.8 0.044
NL-means Filter h=0:2;S=3;W =2 2.5 0.043
From Table 2.1, we can see that our approach gives the best performance both
with respect to the angular error and the RMS
rg
error. On comparing the error of
our approach with that of all the approaches in [7], we can see that our approach
has the least RMS
rg
error. The GCIE algorithm using 11 lights perform very well
because this technique uses the 11 illuminants as a prior knowledge. It constrains the
estimated value of the illuminant to lie in that set. However, the performace drops
if more illuminants are used for training. We also compared our results with a recent
denoising technique by Dabov et al. [35]. However, their technique produces a median
error of5:67
o
on this dataset which is worse than our results.
34
Figure 2.8: Example of an image from controlled indoor environment corrected after
estimating the illumination by different methods and their angular errors. The inten-
sity of the images are tripled for better clarity. From left to right and top to bottom:
Original Image, and the subsequent images are correction using Ground Truth values,
White Patch Assumption, Grey World Algorithm, Grey Edge Algorithm, Standard
Deviation of Color Channels, Gaussian Filter, Bilateral Filter, Median Filter and Non-
local Means Filter
35
2.4.2 Effects of Parameter Modification
In this section, we discuss the effects of parameter setting for the different methods of
our approach. We use the images from controlled indoor environment dataset and plot
the median angular error across the dataset against the parameters of the filter. The
axis for the median angular error is inverted for better visualization.
For our first method, the effects of W on the median angular error for the con-
trolled indoor environment dataset [8] can be seen in Figure 2.9 (Y-axis is inverted for
better visualization). Small values ofW have larger error because of insufficient in-
formation in the small local patches whereas asW increases, the median angular error
will eventually converge to the error of White-Patch algorithm whenW is the image
size.
Figure 2.9: Effect of window sizeW on median angular error for the controlled indoor
environment
For our second method, the error plot for Gaussian filter is shown in Figure 2.10.
Smoothing with¾ =0 will have no effect and therefore, for¾ =0, this method works
like the White-Patch assumption. However, as¾!1, the algorithm is equivalent to
36
Figure 2.10: Median angular error as a function of ¾ and kernel size for a Gaussian
filter
taking the average of the image and therefore, the algorithm is equivalent to the Grey-
World algorithm. We obtain the best results for an intermediate value of¾. Also from
the plot, we can see that for lower values of¾, the median angular error remains fairly
consistent for higher values of window size. This is because in case of a Gaussian dis-
tribution,99:7% pixels values lie within3¾ of the mean and therefore higher values of
window size does not affect the calculation of a pixel value. Also, for higher values of
¾, we can see that the error value increases with increase in window size. Irrespective
of the¾ values, typically window size of around5¡20 pixels gives us best results.
The error plot for Median filter is shown in Figure 2.11. As can be seen in the case
of Gaussian filter, we can see that we obtain best error values for window size(W) of
around5¡20 pixels. ForW =0, this algorithm is once again the same as White-Patch
algorithm. However when the value ofW !1, the color estimated by this method
will be the median color value of all the pixels present in the image. Realistically, the
limiting factor forW will be the image size. For this dataset, the median angular error
whenW is the image size, is 12.99
o
.
37
Figure 2.11: Median angular error as a function of window size for a Median filter
Figure 2.12: Median angular error as a function of filtering parameter for a Non-Local
means filter
38
Figure 2.13: Median angular error as a function of¾’s for a Bilateral filter
Figure 2.14: Median angular error as a function of window size for a Bilateral filter
39
The error plot for Non-Local means filter is shown in Figure 2.12. In case of a
Non-Local means filter, the amount of smoothing of the image depends on the filtering
parameter,h. A higher value ofh does more smoothing of the image. As we can see in
Figure 2.12, the error value converges for higher values ofh. Forh=0, the algorithm
will not perform any smoothing and this method will be the same as the White-Patch
algorithm.
The error plots for Bilateral filter are shown in Figure 2.13 and Figure 2.14. The
left image of Figure 2.13 plots the median angular error for low values of¾ for spatial
domain, ¾
c
and ¾ for intensity domain, ¾
s
whereas the right image plots the median
angular error for high values of¾
c
and¾
s
. As with the Non-Local means filter, we can
see that the error value converges for high smoothing (high values of ¾
c
and¾
s
). For
values of¾’s =0, this algorithm will behave like the White Patch algorithm. The best
results are obtained from an intermediate value of smoothing parameter.
As shown in Figure 2.14, given the best values of¾
c
and¾
s
, we see the effects of
changing the window size on the median errors. For low window size, since there is not
enough information, the error values are high. However, since the individual closeness
and similarity functions are Gaussian distributions, having a very high window size
does not help much and the error value converges close to the best value.
2.4.3 Real World Environment Dataset
This dataset [33] consists of approximately 11000 images from 15 different scenes
taken in a real world environment. Some sample images from this dataset can be seen
in Figure 2.15. The images include both indoor and outdoor scenes. All images have
360 X 240 resolution as they are all extracted from a digital video. As a result, there
is a high correlation between the images from one scene in the database. Therefore,
40
from each scene we randomly select 10 images amounting to 150 images in total on
which we test our approach. Some sample images from this dataset can be seen in
Figure 2.15.
Figure 2.15: Images from the real world environment[33]
As can be seen in Figure 2.15, all the images from this dataset have a grey ball at
the bottom right corner of the image. This grey ball was mounted on the video camera
to estimate the ground truth value of the color of the illuminant and those values are
available with the dataset. However, while trying to estimate the illuminant color, we
exclude the bottom right quadrant as depicted by the white box in Figure 2.16.
The results of existing color constancy algorithms and our approaches on this
dataset can be found in Table 2.2. The parameters for the existing methods were cho-
sen such that they give the best results as mentioned by the respective authors. From
Table 2.2, we can see that our method has a 23% improvement over the state-of-the-art
Grey-Edge algorithm. We also compared our algorithm with a very recent technique
- “Beyond Bag of pixels” approach [19] and found that our method gives us almost
26% improvement. On comparing our results with a recent denoising technique by
Dabov et al. [35], that produces a median error of3:96
o
, we obtain an improvement of
almost 14%.
41
Figure 2.16: Example of images from real world environment corrected after esti-
mating the illumination by different methods and their angular errors. The first row
contains the Original Images and the subsequent rows are correction using (2) Ground
Truth values (3) White Patch Assumption (4) Grey World Algorithm (5) Grey Edge
Algorithm (6) Standard Deviation of Color Channels (7) Gaussian Filter (8) Bilateral
Filter (9) Median Filter (10) Non-local Means Filter
42
Table 2.2: Median angular error for the real world environment. The parameters of
our approach have been described in Section 2.3 and in Section 2.2. The best result is
reported in bold.
Method Parameters Median² (
o
)
White-Patch - 4.85
Grey World - 7.36
1
st
order Grey Edge p=6 4.41
Beyond Bags of Pixels - 4.58
Standard Deviation of color channels W =11 3.73
Gaussian Filter ¾ =5;x=15 3.63
Median Filter W =9 3.42
Bilateral Filter ¾
c
=5;¾
s
=7;W =3 3.4
NL-means Filter h=1:0;S=3;W =2 3.39
2.4.4 Discussion
From the experiments that we have conducted on the 2 widely used datasets and the
results presented in Table 2.1 and Table 2.2, we can see that our approach gives us
results that are better than state-of-the-art color constancy approaches. On the con-
trolled indoor environment dataset, [119] have reported a median error of3:2
o
for the
1
st
order Grey-Edge algorithm withp=7 and pre-processing with a Gaussian filter of
¾ = 4. The median error for the same dataset for the 2
nd
order Grey-Edge algorithm
withp=7 and pre-processing with a Gaussian filter of¾ =5 is2:7
o
. Our experiments
have shown that using just a Gaussian filter with¾ =4 gives us a median error of3:08
o
and with ¾ = 5 gives us a median error of 3:12
o
. This makes us wonder if applying
1
st
order Grey-Edge algorithm hurts the illuminant color estimation. However there
are benefits of using higher order Grey-Edge color constancy algorithms. It will be
interesting to explore how Grey-Edge color constancy algorithms with order n > 2
affect the estimation of illuminant color. It will also be interesting to observe how
complex methods such as GCIE, neural networks and other learning methods perform
on images that are already pre-processed by our approach.
43
For the denoising algorithms, we observe that the window size required for the
controlled indoor image dataset is larger than the one required for the real world en-
vironment dataset. This could be because the controlled environment dataset does not
have too much variability - it consists of an object or two with/without a background.
However, for the real world dataset, there is a lot of variability and so more information
is available even in a small window.
Table 2.3: Ranks of algorithm according to Wilcoxon Signed Rank Test. The best
results are reported in bold.
Method WSTs
White-Patch 1
Grey World 0
1
st
order Grey Edge 2
Standard Deviation of color channels 2
Gaussian Filter 2
Median Filter 2
Bilateral Filter 3
NL-means Filter 3
As shown in [13], amongst the filters described here, NL-means filter performs
the best denoising. It best preserves the structure of the image while denoising the
image. We observe a similar correspondence between the denoising capabilities of the
filter and its illumination color estimation. The best illumination estimation results are
obtained by using the best denoising filters.
In order to compare the performance of different color constancy algorithms, we
use the Wilcoxon signed-rank test [124]. For a given dataset (We choose the real
world environment dataset), let C
1
and C
2
denote the angular error between the illu-
minant estimation of two different algorithms and the ground truth illuminant val-
ues. Let the medians of these 2 angular errors be m
c
1
and m
c
2
. The Wilcoxon
signed-rank test is used to test the null hypothesis H
0
: m
c
1
= m
c
2
. In order to
test this hypothesis, for each of N images, the angular error difference is considered
44
- (e
1
c
1
¡ e
1
c
2
);(e
2
c
1
¡ e
2
c
2
);:::;(e
N
c
1
¡ e
N
c
2
). These error pairs are ranked according to
their absolute differences. If the hypothesis H
0
is correct, the sum of the ranks is 0.
However, if the sum of ranks is different from 0, we consider the alternate hypoth-
esis H
1
: m
c
1
< m
c
2
to be true. We reject/accept the hypothesis if the probability
of observing the error differences is less than or equal to a given significance level
®. We compare every color constancy algorithm with every other color constancy al-
gorithm and generate a score that tells us the number of times a given algorithm has
been considered to be significantly better than the others. The results are presented in
Table 2.3.
2.4.5 Computational Cost
We have implemented all the algorithms in MATLAB in Windows XP environment on
a PC with Xeon processor. The computational speed of the different methods can be
found in Table 2.4.
Table 2.4: Computational time for the different methods.
Method Parameters Image Size Time(seconds)
Standard Deviation
of color channels W =11 637 X 468 9
Gaussian Filter ¾ =5;x=15 360 X 240 0.05
Median Filter W =9 360 X 240 0.39
Bilateral Filter ¾
c
=5;¾
s
=7;W =3 360 X 240 2.16
NL-means Filter h=1:0;S=3;W =2 360 X 240 39.10
2.5 Summary
We have proposed two new methods to achieve color constancy. The first method to
achieve color constancy is based on the statistics of images with color cast. The second
45
method uses denoising techniques viz., Gaussian filter, Median filter, Bilateral filter
and NL-means filter to smooth the image. After initial processing using both methods,
we apply the White Patch algorithm on the processed image to estimate the color of
illumination. Experiments on two widely used datasets showed that our first method
is robust to choice of dataset and gives results that are atleast as good as existing state-
of-the-art color constancy methods. The second method gives an improvement over
existing state-of-the-art color constancy methods.
Our first method is not robust to noise and the illuminant color estimation may
be incorrect in the presence of noise as it may cause an abnormal change in the ra-
tio of standard deviations. While estimating illumination using our second method,
we observed that the best results were obtained using Non-Local Means filter. This
filter best preserves the structure of image whereas comparatively worse results were
obtained with the technique that least preserves the structure of image. Therefore, in-
telligent smoothing helps in better estimation of illuminant value. If we compare our
two methods, we observe that we still get the best performance using NL-means fil-
ter. Therefore, we use NL-means filter in the next chapter to estimate the illumination
component of the image.
46
Chapter 3
Perceptually Motivated Automatic Color Contrast
Enhancement of Images
3.1 Introduction
The human visual system (HVS) is a sophisticated mechanism capable of capturing a
scene with very precise representation of detail and color. In the HVS, while individ-
ual sensing elements can only distinguish limited quantized levels, the entire system
handles large dynamic range that is present in the real world scenes through various
biological actions. Expensive hardware devices like (High Dynamic Range) HDR
cameras and HDR displays can capture/display large dynamic range scenes. This may
not always be feasible. Most commodity current capture and display devices do not
have large dynamic ranges and therefore cannot faithfully represent the entire dynamic
range of the scene. As a result, images taken from a camera or displayed on moni-
tors/display devices suffer from certain limitations. Thus, while trying to display the
large dynamic range of the real world onto the low dynamic range of the display de-
vices, the low intensity regions of the image appear underexposed and the high inten-
sity regions of the image appear overexposed and as a result cannot be seen properly.
47
In order to improve the visual quality of images, we need to enhance the contrast of
the images.
3.1.1 Related Work
The most common technique to enhance images is to equalize the global histogram
or to perform a global stretch [58]. Since this does not always gives us good results,
local approaches to histogram equalization have been proposed. One such technique
uses adaptive histogram equalization [58] that computes the intensity value for a pixel
based on the local histogram for a local window. Another technique to obtain image
enhancement is by using curvelet transformations [81]. A recent technique has been
proposed by Palma-Amestoy et al. [87] that uses a variational framework based on
human perception for achieving enhancement. This method also removes the color cast
from the images during image enhancement unlike most other methods and achieves
the same goal as our algorithm. Inspired by the Grey-World algorithm [14], Rizzi
et al. [102] introduced a technique called Automatic Color Equalization (ACE) that
uses an odd function of differences between pixel intensities to enhance an image.
This is a two step process - the first step computes the chromatic spatial adjustment
by considering the difference in the pixels and weighted by a distance function. The
second step maximizes the dynamic range of the image. A technique inspired from
the Retinex theory [74] is the Random Spray Retinex (RSR) by Provenzi et al. [96]
that uses local information within the Retinex theory framework by replacing paths
with a random 2-D pixel spray around a given pixel under consideration. The previous
two techniques were fused by Provenzi et al. [97] and called RACE that account for
the defects of those two algorithms (RSR has good saturation properties but cannot
recover detail in dark regions whereas ACE has good detail recovery in dark regions
48
but tends to wash out images). Another recent technique based on the Retinex theory is
the Kernel-based Retinex (KBR) [10] that is based on computing a kernel function that
represents the probability density of picking a pixely in the neighborhood of another
pixelx wherex is fixed andy could be any pixel in the image.
Other methods that can enhance images under difficult lighting conditions have
been inspired from the Retinex theory. One such popular technique was proposed by
Jobson et al. [67] called Multi Scale Retinex with Color Restoration (MSRCR) where
the color value of a pixel is computated by taking the ratio of the pixel to the weighted
average of the neighboring pixels. One disadvantage of this technique is that there
could be abnormal color shifts because three color channels are treated independently.
An inherent problem in most retinex implementation is the strong halo effect in regions
having strong contrast. The halo effects are shown in Figure 1.5.
The halo effects are reduced in another variation of the Retinex theory that is pro-
posed by Kimmel et al. [71] where the illumination component is removed from the
reflectance component of the image using a variational approach. While enhancing
the illumination channel, the technique uses only gamma-correction and the method
is not automated. The user has to manually modify the value of gamma depending
on the exposure conditions of the image - the value of gamma for enhancing under-
exposed images is very different from the value of gamma for enhancing overexposed
images. A disadvantage of the techniques that are purely based on the Retinex theory
is that these techniques cannot enhance over-exposed images because of the inherent
nature of the Retinex theory to always increase the pixel intensities as shown in [95].
However, this modification is tackled in both [10] and [87].
A recent method to obtain automated image enhancement is the technique pro-
posed by Tao et al. [113] that uses an inverse sigmoid function. Due to lack of flexi-
bility of the inverse sigmoid curve, this technique does not always result in achieving
49
good dynamic range compression. There are other techniques that have been designed
in the field of computer graphics for high dynamic range images to obtain improved
color contrast enhancement. Pattanaik et al. [89] use a tone-mapping function that
tries to mimic the processing of the human visual system. It applies a local filtering
procedure and uses a multiscale mechanism. This technique is a local approach and
it may result in halo effects. Another technique proposed by Larson et al. [75] does
tone-mapping based on iterative histogram modification. Ashikhmin [5] has also pro-
posed a tone-mapping algorithm for enhancement and is based on a Gaussian pyramid
framework. Reinhard et al. [100] does tone-mapping by defining a term called key
that indicates if a scene is subjectively light, normal or dark.
3.1.2 Overview of our Method
Our method for image enhancement is motivated by illuminance-reflectance model-
ing. It consists of 2 key steps:(1) Illumination estimation using Non-Local means
technique (2) Automatic enhancement of illumination. The flowchart of our enhance-
ment module is shown in Figure 3.1. For our illumination estimation, we use Non-local
means(NL-means) technique because as shown in Section 2.4.3, it performs amongst
the best while trying to estimate the color of illumination. Also as shown in [13], NL-
means filter does a very good job at preserving the original structure of image while
smoothing the image. Both these conditions are necessary for good enhancement of
images.
3.1.3 Outline
In Section 3.2 we describe our enhancement method in complete detail. In Section 3.3,
we show the results of our proposed enhancement method and compare our results with
50
Figure 3.1: Flowchart of the enhancement module
other enhancement techniques including some state-of-the-art approaches, and results
from commercial image processing packages. Section 3.4 describes the experiments
that we conducted on normal vision and simulated AMD patients and we discuss them.
Finally, in Section 3.5, we will summarize this chapter.
3.2 Our Method
3.2.1 Estimating illumination using Non-local Means technique
As shown in Section 2.3.4, we smooth the image using Non-local means filter to es-
timate the illumination component of the image,L(x). However, while smoothing an
image during illumination separation, a potential artifact that may occur is the halo
artifact. Existing enhancement techniques that use illumination separation suffer from
this drawback. This happens across the edge of regions having high contrast. In spite
of the edge preserving properties of the Non-local means filter, due to the high value
of filtering parameter, h = 500 while smoothing the image, an ’overflowing’ occurs
51
from the bright region to the dark region across the edge. Processing of such illumi-
nation and multiplying it back with the reflectance causes strong halo effects. In order
to remove the halo effect, we pre-segment the image using Mean-shift segmentation
algorithm proposed by Comaniciu and Meer [34]. Any segmentation algorithm can be
used. The boundaries of the segmented image is used as a preliminary information for
the smoothing process. While trying to estimate the illumination, when we consider a
neighborhood for every pixel, we also consider the same spatial neighborhood of the
pre-segmented image and if an edge occurs in the pre-segmented image, less smooth-
ing is done (h=0:01). This helps in preserving high contrast boundaries of the image
and thus removes the halo effect from the image.
Once the illumination componentL(x) has been estimated, the reflectance compo-
nent of the imageR(x) for every pixelx can be calculated as the pixel-by-pixel ratio
of the image and the illumination component and can be expressed as:
R(x)=
I(x)
L(x)
; (3.1)
whereI(x) is the original image. Alternatively, in the logarithm domain, the difference
betweenI(x) andL(x) can be used to estimateR(x).
As shown in Section 2.3.5, once we estimate the color of illumination, we remove
the effect of color cast fromL(x). We first estimate the color of illumination as shown
in Equation 2.9. Then as shown in Equation 2.10, we remove the effect of color cast
from each of the 3 color channels (Red, Green and Blue) and then combine all 3 new
color channels to get color-corrected illumination image,L
cc
(x).
52
3.2.2 Automatic enhancement of illumination
In order to preserve the color properties of the image, we modify only the brightness
component of the image. The illuminationL
cc
(x) obtained from Section 3.2.1 is con-
verted from RGB color space to HSV color space and only the luminance(V) channel -
L
cc
V
(x) is modified. Alternatively, CIELAB color space could be used and modifica-
tion of the lightness channel can be done. The chrominance channels are not modified
to preserve the color properties of the image.
Our method can be used for images that are underexposed, overexposed or a com-
bination of both. Since, we have to deal with different exposure conditions, our en-
hancement method deals with those cases separately and divide the intensity of the
illumination map into 2 different regions as shown in Figure 3.2.
Figure 3.2: The mapping function for enhancement
53
The division of the region is determined by the positioning of thecontrolPt. The
value ofcontrolPt is computed as follows:
controlPt =
P
L
cc
V
(x)·0:5
1
P
L
cc
V
(x)·1
1
: (3.2)
Intuitively, if an image is heavily underexposed, then a lot of pixels will have very low
intensity values. Using Equation 3.2 we can see that, for such images, the value of
controlPt!1. Similarly, if an image is heavily overexposed, then a lot of pixels will
have very high intensity values causing the value ofcontrolPt!0. Thus, we can see
that thecontrolPt2[0:5;1] for images that are predominantly underexposed whereas
controlPt2[0;0:5] for predominantly overexposed images.
To improve the visual quality of images, if the dynamic range of certain parts of
the image is increased, some other range must be compressed.
For underexposed regions,
8
>
<
>
:
Choose blue curve dark =
P
Lcc
V
(x)·0:1
1
P
L
cc
V
(x)·1
1
>T
1
Choose green curve Otherwise
(3.3)
If a lot of pixels lie in the dark region, then we have to enhance the values of those
pixels by a larger value. As a result, the blue curve will be chosen. Otherwise, we will
choose the green curve to enhance the pixels that lie in(0:1;controlPt] region.
For overexposed regions,
8
>
<
>
:
Choose green curve bright =
P
L
cc
V
(x)¸0:9
1
P
Lcc
V
(x)·1
1
>T
2
Choose blue curve Otherwise
(3.4)
54
Similarly, if a lot of pixels lie in thebright region, then we have to reduce the values of
those pixels by a larger value. Therefore, we will choose the green curve. Otherwise,
in order to enhance the pixels that lie in(controlPt;0:9), we choose the blue curve.
The thresholdsT
1
= 0:01 andT
2
= 0:01 are determined experimentally and those
values were chosen that gave the best enhancement results over a wide range of images.
The curves are represented as a logarithmic function. The blue curve can be repre-
sented as -
L
enh
V
(x)=v
1
+
1
K
log((L
cc
V
(x)¡v
1
)½+1)(v
2
¡v
1
); (3.5)
and the green curve can be represented as -
L
enh
V
(x)=v
2
¡
1
K
log((v
2
¡L
cc
V
(x))½+1)(v
2
¡v
1
): (3.6)
where K = log(½(v
2
¡ v
1
) + 1 is a constant and L
cc
V
(x) 2 [v
1
;v
2
]. For underex-
posed regions of the image, L
cc
V
(x) 2 [0;controlPt] and for overexposed regions,
L
cc
V
(x) 2 (controlPt;1]. The formulation of the curves in Equations 3.5 and 3.6 is
inspired by the Weber-Fechner law that states that the relationship between the phys-
ical magnitude of the stimuli and the perceived intensity of the stimuli is logarithmic.
This relationship was also explored by Stockham [109] and Drago et al. [38] who also
recommended a similar logarithmic relationship for tone mapping purposes. Our for-
mulation, inspired by the HVS, is different in flavor from the existing formulations. It
takes care of the different exposure conditions simultaneously and automatically es-
timates the parameters across a wide variety of images, thus enhancing those images
without additonal manual intervention.
55
½ represents the curvature of the curve, and enhances the intensity of illumination
accordingly. For instance, in an underexposed region, due to higher value of dark,
the intensity has to be increased more and so a higher value of ½ is used for the first
curve. Similarly in an overexposed region, due to higher value of bright, we decrease
the intensity of those regions by a higher value and therefore, a higher value of ½ is
used for the second curve. This effect is depicted in Figure 3.3.
Figure 3.3: The effects of changing the value of½
L
enh
V
(x) is the enhanced luminance. This is combined with the original chromi-
nance to obtain the enhanced illumination -L
enh
(x).
3.2.3 Combining enhanced illumination with reflectance
The enhanced illumination, L
enh
(x) that was obtained in Section 3.2.2 is multiplied
with the reflectance componentR(x) that was obtained in Section 3.2.1 to produce the
enhanced imageI
enh
(x) as follows:
I
enh
(x)=L
enh
(x)R(x); (3.7)
56
The entire process is automated and the enhancement occurs depending on the distri-
bution of the pixels in the image.
3.3 Results and Discussion
In this section, we consider the computational costs and discuss the results of applying
the enhancement algorithm described in Section 3.2 on a variety of images. We com-
pare our results with histogram equalization [58] and different Retinex techniques.
We also compare our results with commercial image processing software packages
- Picasa
TM
, DXO Optics Pro
R °
and PhotoFlair
R °
. PhotoFlair
R °
uses the multi-scale
Retinex with color restoration (MSRCR) algorithm proposed by Jobson et al. [67].
Although we have used many different methods for comparison, we could not use com-
mon benchmark image(s) to compare all the methods. This is because, for some meth-
ods such as [87], [96], [102], [97] etc., the source code is not available or for method
such as [50], the available source code does not produce desired results or package
such as PhotoFlair
R °
needs to be purchased. All these techniques use different images
for enhancement. So, we use results directly from the respective papers/websites to do
comparison with our enhancements. However, as shown in Section 3.3.4, for a couple
of images we use multiple methods.
Finally, we perform statistical analysis and quantitative evaluation to demonstrate
the effectiveness of our enhancement algorithm.
3.3.1 Enhancement Results
In order to display the results of our enhancement, we consider two different images.
The original image in Figure 3.4 is obtained from the database provided by Barnard et
57
Figure 3.4: Enhancement Results. From left to right and top to bottom: Original
Image, Our enhancement, Mapping on all 3 color channels of illumination, Mapping
on intensity channel of illumination, Histogram equalization on all 3 color channels,
Histogram equalization on intensity channel, Kimmel’s retinex on all 3 color channels
and Kimmel’s retinex on intensity channel
58
Figure 3.5: Enhancement Results. From left to right and top to bottom: Original
Image, Our enhancement, Mapping on all 3 color channels of illumination, Mapping
on intensity channel of illumination, Histogram equalization on all 3 color channels
and Histogram equalization on intensity channel
59
al. [8], is taken under strong blue illumination while the original image in Figure 3.5
is obtained courtesy of P. Greenspun
1
, and is underexposed.
The results of our algorithm are presented in Figure 3.4 and Figure 3.5. The con-
trast of the enhanced images is much better than the original images and the color cast
from the images has also been removed. For comparison, enhancement was done on
all 3 color channels of the original illuminationL(x). It helped in removing the color
cast but also resulted in a loss of color from the original image. Enhancement was also
done on only the intensity channel of the original illuminationL(x) and it did not help
in removing the color cast as shown in the top image. The results are presented in the
second row of Figure 3.4 and Figure 3.5. Therefore, our technique produces the best
of both worlds - it enhances the color of the scene as well as it helps in removing the
color cast from the image. Further, it also gives us a visually pleasant image.
Another advantage of our technique is that it does not degrade the quality of images
that already have a good contrast as shown in Figure 3.6.
(a) (b)
Figure 3.6: Our enhancement maintains contrast of good contrast images. (a) is the
original image and (b) is the enhanced image
1
http://philip.greenspun.com
60
3.3.2 Comparison with other methods
We have also compared our results with existing methods such as histogram equaliza-
tion. Histogram equalization on all 3 color channels of the original image results in
color shift and loss of color. It also results in artifacts. The histogram equalization on
only the luminance channel does not result in color shift but it also does not improve
the visual quality of images with color cast. These results are present in the third row
of Figure 3.4 and Figure 3.5.
We have compared with the variational approach in Retinex proposed by Kimmel
et al. [71] and have used our implementation of the proposed technique. This technique
uses gamma correction to modify the illumination. The method is not automated and
the value of gamma required for enhancing the images depends on the quality of the
original images (underexposed/overexposed) and it is cumbersome to manually adjust
it. The output from this technique is shown in the fourth row of Figure 3.4. When we
compare this to the output that we get as shown on the top right image in Figure 3.4,
we can see that our method produces better enhancement.
We have also compared our results with a recent techniques proposed by Palma-
Amestoy et al. [87] and Provenzi et al. [102] [96] [97],and observe that our methods
give better contrast and the resulting images look visually better as shown in Figure
3.7.
We also compared with the Retinex implementation by McCann in [50] as shown
in Figure 3.8. A disadvantage is the possibility that a halo effect may still exist in the
scene as shown in Figure 3.8 in the title of the rightmost book. The halo effect does
not occur in our method because of the prior segmentation of the method and the use
of NL-means filter which preserves the edges in a scene.
61
Figure 3.7: Top row: The left image is the original image, the middle image is our
enhancement and the right image is obtained from [87]. Bottom row: Enhancements
by Rizzi et al. and Provenzi et al.. The left image is enhanced by RSR [96], the middle
image is enhanced by ACE [102] and the right image is enhanced by RACE [97]. The
images are obtained from [97]
62
Figure 3.8: The top row is the original image. On the second row, the left image is
enhanced by McCann Retinex and the right image is enhanced by our algorithm. More
detail can be seen in the shadows and no halo effect exists in our implementation.
We also compared our results with the MSRCR algorithm and observed that our
results have a better contrast than the output of the PhotoFlair
R °
software as shown in
Figure 3.9. The original images and the output of MSRCR were obtained from NASA
website.
2
Finally, we compared our algorithm with the output given by Picasa
TM
. We used
the Auto Contrast feature of Picasa
TM
and as shown in Figure 3.10, we can see that we
obtain better results. Other parameters can be manually modified in Picasa
TM
to get
better results but we are building an automated system with no manual tuning of pa-
rameters. Even in our approach, if necessary, we can also manually modify parameters
to obtain different looking results.
We also compared our algorithm with the output given by DXO Optics Pro
R °
and
the results are shown in Figure 3.10 and Figure 3.11.
2
http://dragon.larc.nasa.gov/retinex/pao/news
63
Figure 3.9: The first row contains the original images. The second row has the en-
hanced images by MSRCR. The third row has the images enhanced by our algorithm.
Using our algorithm, we can clearly see the red brick wall within the pillars on the left
image and some colored texture on the white part of the shoe can be clearly seen.
64
Figure 3.10: Top row: Left image is the original image and the right image is our
enhancement. Bottom row: Left image is the enhanced output using Auto Contrast
feature of Picasa
TM
and the right image is enhanced using our DXO Optics Pro
R °
Figure 3.11: The left image is the original image, the middle image is is enhanced
using our algorithm and the right image is enhanced using DXO Optics Pro
R °
65
3.3.3 Statistical Analysis
Figure 3.12: Statistical characteristics of the image before and after enhancement
A good way to compare an image enhancement algorithm is to measure the change
in image quality in terms of brightness and contrast [113]. For measuring the bright-
ness of the image, the mean brightness of the image is considered and for measuring
the contrast of the image, local standard deviation of the image(standard deviation of
image blocks) is considered. In Figure 3.12, the mean brightness and local standard
deviation of the image blocks for the original image of Figure 3.4 (size of image block
is 50 X 50 pixels) is plotted before and after enhancement. We can see that there is a
marked difference(increase) in both the brightness and the local contrast after image
enhancement.
66
Figure 3.13: Evaluation of image quality afer image enhancement
67
3.3.4 Quantitative Evaluation
A quantitative method to evaluate enhancement algorithms depending on visual repre-
sentation was devised by Jobson et al. [68] where a region of 2D space of mean of the
image and local standard deviation of the image was defined as visually optimal, after
testing it on a large number of images. We have shown the effects of implementing
our algorithm in Figure 3.13. It can be seen that some of the enhanced images (over-
exposed and underexposed before enhancement) lie inside the visually optimal region
whereas other images though do not lie in the visually optimal region have a tendency
to move towards that region. This happens because those original images are either
too dark (very underexposed) or too bright (very overexposed).
We also visualize the effects of other enhancement algorithms in Figure 3.13. We
observe that more images enhanced by our method lie closer/in the visually optimal
region than the images enhanced by MSRCR. For images that are heavily overexposed,
we can see that our enhancements lie closer to the visually optimal region than the ones
enhanced by Picasa. Some images, for instance, image with¹
¹
¼ 42 and¹
¾
¼ 5 are
enhanced by multiple methods and we can see that our enhanced images lie closer to
the visually optimal region than the other enhancement methods. On an average, our
approach results in ’better’ enhancement than existing techniques. However, ’better’
enhancement is very subjective so we conduct human validation of our approach, as
described in Section 3.4.
Some interesting observations are as follows - Histogram Equalization (both on 3
color channels or the intensity channel) results in the enhanced image being in the vi-
sually optimal region whereas our enhanced image lies very close to that region. This
is because as can be seen in the histogram equalization results in Figure 3.4, a lot of
68
artifacts are generated. These artifacts increase the local standard deviation thus caus-
ing an increase in the value of¹
¾
resulting in the histogram equalized image to lie in
the visually optimal region. Also for an image that already lies in the visually optimal
region, we can see that our enhancement does not result in a significant change. In
short, our enhancement does not spoil the visual quality of good quality images. This
has also been validated by experiments on human observers as shown in Section 3.4.
3.3.5 Computational Cost
We have implemented the algorithm in MATLAB in Windows XP environment on a
PC with Xeon processor. For an image with size 360 X 240 pixels, segmentation takes
6 sec., illumination estimation takes 39 sec. and the enhancement of illumination takes
< 1 sec. The speed can be improved by implementing the code in C++ and using GPU.
The computational costs after those speed-ups are mentioned in Section 5.3.2.5. Also,
faster and more optimized versions of NL means filter can be used to increase the
computational efficiency [86].
3.4 Human Validation
The most obvious way to measure the effectiveness of contrast enhancement is to ask
a subject (either visually impaired or normally sighted) to indicate their preference in a
side-by-side comparison of the original and the enhanced images. The preferences can
then be quantified to know the effectiveness of the method. However, large number
of comparisons should be made and this should be repeated over multiple subjects.
We conduct experiments to that effect and present results from those experiments and
discuss them [27].
69
3.4.1 Experimental Setup
We created a database of 40 images consisting of a variety of images under different
lighting conditions (colored/white illumination) and different exposure conditions(over-
exposure/under-exposure/good contrast images) taken either indoor or outdoor. The
reason why we have included good contrast images is that, normal visual activity such
as watching movies includes perception of good contrast images and therefore, we
want to study perception in those conditions
The experiments were conducted independently on the subjects. Each subject was
seated at a distance of roughly 20” from the computer monitor. The subjects were given
the freedom to move closer to the screen or farther away from the screen according to
their convenience. The screen is around 24”(diagonal) and the subjects were seated
perpendicular to the center of the screen.
3.4.2 Subjects
All 12 subjects had “normal” vision (The subjects were asked to wear their prescribed
corrective lens during the entire course of experimentation). All subjects were able to
understand the instructions given to them in English. The subjects viewed the images
on the screen with both eyes.
3.4.3 Simulation Glasses
In order to simulate age-related macular degeneration (AMD), we purchased simu-
lation glasses from “Low Vision simulators” website
3
. These glasses have a blind
spot that is white and opaque whereas the visual periphery is fogged (blurred). For
our experiments, we have used simulation glasses that simulate visual acuity 20/400
3
http://www.lowvisionsimulators.com
70
(6/120) and have a large central scotoma. The glasses are shown in Figure 3.14. While
conducting experiments, the subjects were asked to wear the glasses on top of their
corrective lenses.
Figure 3.14: Simulation glasses. Image from “Low Vision simulators” website
3.4.4 Procedure
All the images were processed off-line to produce the enhanced images. For every
image, the experimental procedure has 2 steps -
1. The original and the enhanced images were shown simultaneously on the screen.
The placement of the original and the enhanced versions were random (either
the left or the right side of the screen) to remove any bias that may exist while
selecting an image from only one side of the screen. Then the subject was asked
to rate the image on the “right” side as “Better”, “Same” or “Worse” relative to
the image on “left” side of the screen and their responses were recorded by the
experimenter. Scores were thus assigned as 1 for Better, 0 for Same and -1 for
Worse. These responses were used to evaluate the “preference” for enhanced
image.
71
2. The original image from Step 1 was then presented on the screen and the subject
was asked to rate the “quality” of the image as “Poor”, “Average” or “Good”
that was duly recorded by the experimenter as 1, 2 or 3 respectively.
The subjects were unaware of the fact that the image that is being presented to them
in Step 2 is the original image from Step 1 to remove any bias that may occur due to
knowledge of the original image. Rating “preference” prior to “quality” is important
to remove the bias of prior information regarding the original image.
This was repeated for all 40 images in the database. Finally, one image that had the
most perceptible difference after enhancement was presented to the subject but with
the order flipped from the previous display and the response was noted to check if the
subjects were consistent with their responses.
This entire procedure was repeated twice - firstly, while wearing the simulation
glasses and secondly, without them. Since all the subjects have normal vision (with/without
corrective lenses), conducting the experiment first while wearing the simulation glasses
was important. This is because these glasses significantly degrade the visual quality
and therefore, the subjects have no prior information regarding preference for a partic-
ular image which need not be the case if the experiments were conducted first without
the glasses.
3.4.5 Data Analysis
In our experiments, we found that subjects were not biased towards selecting any one
side of the screen. The responses of all the subjects were also consistent when the
“preference” ratings for the last comparison was compared with that of its earlier
ratings.
72
Figure 3.15 shows the original image “quality” ratings for all the subjects with
normal vision. Figure 3.16 shows the original image “quality” ratings for all the sub-
jects with simulated AMD. A better idea of how the visual perception of people with
normal vision is different from that of people with simulated AMD can be indicated
by the histogram in Figure 3.17. We can see that around 72:5% of images were rated
above Average (quality>=2) by people with normal vision. On the other hand,50%
of images were rated below Average (quality<= 2) by people with simulated AMD
and most of the images were considered to be very close to Average. This gives us a
relative idea on how poorly patients with AMD perceive the environment.
Figure 3.15: Image quality ratings for subjects with ’normal’ vision
Similarly, Figure 3.18 shows the “preference” ratings for the enhancement method
for all the subjects with normal vision. Figure 3.19 shows the “preference” ratings for
the enhancement method across all the subjects with simulated AMD. The histogram
in Figure 3.20 gives a better visualization of how the “preference” for enhancement
varies between subjects with normal vision and for subjects with simulated AMD. We
73
Figure 3.16: Image quality ratings for subjects with simulated AMD
Figure 3.17: Histogram of mean image quality ratings for all subjects with simulated
AMD and normal vision
74
can see that both normally sighted subjects and simulated AMD subjects prefer our
enhanced images and the preference is larger for subjects with simulated AMD than
for subjects with normal vision.
Figure 3.18: Preference for enhancement ratings for subjects with ’normal’ vision
We use the Wilcoxon signed-rank test [124] to check for statistical significance.
In both cases, for subjects with normal vision and for subjects with simulated AMD,
we infer that the preference for enhanced images are statistically significantly dif-
ferent from that for the original images (The null hypothesis of zero median of the
difference is rejected at 5% level) and by comparing the ranks we conclude that the
enhanced images have a higher rank than the original images. This implies that sub-
jects (both visually impaired and normally sighted) have a significant preference for
the enhancement method.
75
Figure 3.19: Preference for enhancement ratings for subjects with simulated AMD
Figure 3.20: Histogram of mean preference ratings for all subjects with simulated
AMD and normal vision
76
The evaluation of the preference of the enhancement method can be analyzed us-
ing a variation of the ROC approach as described by Peli et al. [91] to give an impres-
sion of perceived image quality with respect to the original images. Unlike in regular
ROC analysis, where ground-truth information is provided, no such information is
present in this case. The raw data consists of the preference of subjects regarding
perceived quality of the enhanced images with respect to the original images. In this
case, the axes of the ROC plots have a different interpretation. The Y-axis, that orig-
inally corresponds to the true positive rate can be considered to be equivalent to the
proportion of the enhanced image with higher perceived image quality. The X-axis,
that originally corresponds to the false positive rate can be considered to be equivalent
to the proportion of original image with higher perceived image quality.
In order to generate the ROC plots, the program “jrocfit” [2] was used. The ROC
plot for subjects with simulated AMD is shown in Figure 3.21 whereas the ROC plot
for subjects with normal vision is shown in Figure 3.22. With a larger and a more con-
tinuous ratings scale, more operating points can be generated resulting in a smoother
ROC curve.
We consider the area under the ROC curve [61] (A
z
) as a measure of the perceived
quality of the enhancements. For the original images, the value of A
z
= 0:5. If
A
z
> 0:5 then the perceived quality of the enhanced images can be considered to be
better than that of the original images. On the other hand, if the perceived quality of the
original images is better than that of the enhanced images, the value ofA
z
< 0:5. The
“jrocfit” program computes the empirical value ofA
z
and those values for the different
subjects are shown in Figure 3.23. For all the subjects, on an average, both with nor-
mal vision (A
z
=0:67§0:058) and with simulated AMD (A
z
=0:721§0:0533), the
enhanced images are deemed to have better quality than the original images. Using the
77
Figure 3.21: ROC plot for subjects with simulated AMD. The data being degenerate
does not produce operating points off the axes
Figure 3.22: ROC plot for all subjects with normal vision. The data being degenerate
does not produce operating points off the axes
78
Wilcoxon signed-rank test [124], for both instances (simulated AMD and normal vi-
sion), our color contrast enhanced images was reported as having significantly better
quality than original images.
Figure 3.23: A
z
’s for subjects with normal vision and simulated AMD
3.4.6 Discussion
As can be seen in Figure 3.20, subjects with simulated AMD show a preference for the
enhanced images. We have also shown that the preference for the enhanced images is
statistically significant. However, for certain images as can be seen in Figure 3.19, the
subjects prefer the original images (preference<0). For those images, we found that
certain regions of the image were over-exposed. The enhancement method reduces the
intensity of those regions of the image. As the simulation glasses degrades the visual
quality, the subjects prefer the original images as the brighter regions of the image are
better visible. The subjects with simulated AMD showed no preference (preference
=0) for 5 out of 40 enhanced images when compared with the original images. When
79
we checked the quality of those original images (Figure 3.16), we found that 4 of
those images were rated as above Average (quality > 2) implying similar rating for
the enhanced images. For all the images that were present in the lower quarter of the
“quality” ratings (quality2 [1;1:5]), the subjects consistently showed preference for
the enhanced images (preference>0).
For people with normal vision, as can be seen in Figure 3.20, on an average, there
is a preference for the enhanced images (preference > 0). However as can be seen
in Figure 3.18, for 5 images the subjects prefer the original images (preference< 0),
though the preference is not significantly lower. If we check the image quality of
those original images (Figure 3.15), we can see that 3 of those images were rated high
(quality > 2:5). Also for all the 4 cases where enhanced images were considered to
be the same as the original images (preference=0), the quality of the original images
was rated was rated high (quality> 2:5). For all the cases where the original images
were rated Poor (quality = 1), the subjects consistently showed preference for the
enhanced images (preference>0).
An interesting case is that of image label 29. The original image has a very high
rating (quality> 2:5) and the subjects with normal vision almost always (10/12 sub-
jects) prefer the original images(preference< 0). This is because the original image
is that of a flower as shown in Figure 3.24(a) and after increasing the contrast, the
enhanced flower no longer looks ’natural’ as shown in Figure 3.24(b). This causes
the subjects to prefer the original image. However, when viewed with the simulation
AMD glasses, the subjects preferred the enhanced image(preference>0).
The parameters of the approach can be modified to account for such instances thus
improving the results of the method. We believe that modifying those parameters will
not result in a significant change in the overall preference for the images.
80
(a) (b)
Figure 3.24: (a) is the original image and (b) is the enhanced image
We also notice that, in general, the subjects with simulated AMD show more pref-
erence for the enhanced images than the subjects with normal vision. This is because
subjects with normal vision have a better perception of the original quality of images.
A lot of images (22/40) are good contrast images (quality > 2:5) thus reducing the
scope for preference for enhancement for those images amongst people with normal
vision. However, when viewed through simulation AMD glasses, due to degradation
of original image quality, the scope for enhancement increases thus increasing the
preference for enhanced images.
The enhanced images improve the visual quality for subjects with normal vision
and for subjects with simulated AMD. However, trade-offs can be done to obtain better
enhancements for a particular category of subjects at the expense of others. For the
experiments that are conducted here, no training database has been used to set the
parameters of the algorithm. There is a higher chance of getting better results for
either category of subjects by tuning the parameters of the enhancement method based
on the results that were obtained from this approach.
81
3.5 Summary
We have proposed an automatic enhancement technique for color images that is moti-
vated by human perception and can work well under non-uniform lighting conditions.
It provides an unifying framework in which any image can be enhanced automati-
cally irrespective of the inherent color cast in the image or the exposure conditions -
both underexposure or overexposure. It thus helps in achieving both color constancy
and local contrast enhancement in which the human visual system is proficient. Our
method estimates the illumination present in the scene and then modifies the illumina-
tion to achieve good enhancement depending on some analysis of the distribution of
image pixels although, the users can be given control over certain parameters to tune
them and adjust the values according to their needs to get customized enhancements.
Our method also does not suffer from halo effects during enhancement nor does it
suffer from any color shifts or color artifacts. Experimentally, we have compared
our enhancement results with results from other existing enhancement techniques and
commercial image processing packages and show that our results look ’visually’ bet-
ter than the ones produced by existing techniques. Statistically and quantitatively, we
have shown that our technique indeed results in enhanced images.
Our technique is beneficial for normally sighted people as well as for people with
visual impairment such as AMD due to CFL. All the subjects (both normally sighted
and visually impaired) showed a preference for the enhanced images and we showed
that the improvement in perceived image quality was significant. Though the tests were
performed on static images, the framework can be easily extended to video sequences.
We are aware that the simulation glasses do not exactly replicate the real disorder as
the eyes can shift to avoid the artificial scotoma. Since we are trying to simulate low
vision, we believe that using these glasses is a good initial step to test our results.
82
Since we use information from segmentation algorithm to adaptively modify smooth-
ing in order to remove halo effects, the performance of our enhancement module is
bounded by the performance of segmentation algorithm. The segmentation method
should be robust to high contrast edges and consistent across a sequence of similar
frames. We have extended the enhancement module to videos which we have de-
scribed later in Chapter 5.
83
Chapter 4
Robust Sharpness Enhancement Using Hierarchy of
Non-Local Means
4.1 Introduction
Sharpness enhancement results in the enhancement of details in an image or a video.
Sharpness enhancement is a post-processing module that has application in the con-
sumer electronics video chain. The goal is to improve the quality of image by enhanc-
ing sharpness of image. One of the reasons why we need sharpness enhancement is as
follows: During contrast enhancement, we increase the intensity of dark pixels and re-
duce the intensity of bright pixels. This causes a reduction of contrast in the mid-grey
scale range of the image. In order to improve visibility of those regions, we accentuate
the details of images and perform sharpness enhancement. Some of the existing tech-
niques to achieve sharpness enhancement use a hierarchical framework and decompose
the image into a smooth image and several high frequency components (detail layers).
To decompose the image, different types of filters are used. However, due to the in-
ability of these filters to preserve edges, these techniques suffer from halo effects or
produce loss of structure in the image. Other techniques use multiple images taken
84
under different lighting conditions, which may not always be practical. Examples of
our sharpness enhancement can be seen in Figure 4.1.
4.1.1 Related Work
Sharpness enhancement has been studied for a long time and this can be achieved by
many methods. Peaking [63] does not change the bandwidth of the signal. He increases
the amplitude of the high frequency component of the image by adding the2
nd
deriva-
tive at the edge. Luminance/Chrominance transient improvement (LTI/CTI) improves
the perceived sharpness [79] but it may result in Moire effect on video sequences [62].
Other methods use a hierarchical framework to decompose an image into a smooth
low-frequency image and various other high-frequency levels. The different levels can
be enhanced and then combined to obtain an enhanced image. Care must be taken
while smoothening an image in order to prevent halo effects that might occur after the
smooth level and the fine levels are enhanced and combined. In order to decompose
the image, various edge-preserving filters can be used. Greenspan et al. [59] use a
Gaussian filter. Recently, Farbman et al. [40] have proposed a weighted-least squares
(WLS) approach to perform edge-preserving smoothing. Bilateral filter [115] and its
variations can also be used for edge-preserving smoothing. The bilateral filter uses 2
different kernels - one to determine the proximity of neighbors to current pixel,¾
c
and
the other to determine the similarity between the neighbors and the current pixel,¾
s
.
Farbman et al. [40] have shown how the WLS approach results in an edge preserv-
ing smoothing by showing absence of blur across edges and smoothing in flat regions
of the smooth image. However, in order to show preservability of edges by smoothing
techniques, it is important to show what the image loses as a result of smoothing. In
the denoising domain, Buades et al. [13] call this difference method noise, which is the
85
(a) Original Image
(b) Enhanced Image
Figure 4.1: Sharpness Enhancement. Image (a) is from [40]. Note the enhancement in
Image (b).
86
(a) Over-Enhancement
Figure 4.2: Exaggerated Sharpness Enhancement of Figure 4.1(a)
difference betweent the images before and after smoothing. The “best” method should
produce noise uncorrelated with the input image.
Figure 4.3 demonstrates the results of some smoothing techniques where the method
noise shows us what is lost as a result of smoothing. In case of Gaussian filter, we can
see that although the noise is removed, the edges are blurrred and the edges can still
be seen in the method noise as seen in Figure 4.3(b). Bilateral filter with low¾
c
results
in edge-preservation (the method noise has less structure) but the smooth image still
contains noise as seen in Figure 4.3(c). Increasing ¾
c
does not help much. In order
to increase smoothing, we have to increase the value of¾
s
but this results in a loss of
structure as seen in Figure 4.3(d). Therefore, using a Gaussian filter or using Bilateral
filters with higher values of¾
c
results in halo effects. In case of WLS, the noise is re-
moved and it “seems” that the edges are preserved as seen in Figure 4.3(e). There is no
blurring and therefore no halo effects will be produced. However, if we visualize the
87
(a)
(b) (c) (d) (e) (f)
Figure 4.3: Effects of smoothing on noisy images. Column(a) is the noisy image. For
columns (b) to (f), the top row is the smooth image whereas the bottom row is the
method noise. Column (b) uses Gaussian filter with¾
c
=5. Column (c) uses Bilateral
Filter with ¾
c
= 5 and ¾
s
= 0:05. Column (d) uses Bilateral Filter with ¾
c
= 5 and
¾
s
= 0:5. Column (e) uses WLS Filter with® = 0:25 and¸ = 1:2. Column (f) uses
NL-means Filter withh=0:03. The method noise is normalized for better visibility
method noise obtained after WLS smoothing, we can see that some edge information
from the input image is still visible. As shown in Figure 4.3(f), NL-means filter does
the best edge-preservation. There is no structure present in the method noise and the
smooth image has no blur across edges. The result for NL-means filter uses low value
of filtering parameter,h. Increasingh may create some blurring and we deal with that
in Section 4.2.2.
Tumblin and Turk in [117] have used a variant of anisotropic diffusion [93] that
works well for preserving edges in an image. However, this technique tends to over-
sharpen edges and may result in artifacts. Durand and Dorsey in [39] have used a
variation of bilateral filter but the halo effects are not completely removed. Chen et
al. [21] construct a bilateral pyramid using bilateral filter by increasing the width of
the Gaussian kernels to do smoothing for progressive abstraction of videos. Fattal et
al. [42] enhances details in an image by recursively applying bilateral filter and com-
bining the images and their high frequency components obtained from multiple light
88
sources. Bae et al. [6] also use a bilateral filter to separate low and high contrast fea-
tures of image and modify them to get enhancement. Zhang and Allebach [126] have
modified the range filter of the bilateral filter to perform both sharpness enhancement
and noise removal. However the performance is constrained by the choice of training
dataset, which is used to optimize the parameters of the modified bilateral filter. Subr
et al. [110] acquire information about oscillations from the local extrema of an image
at multiple scales, and use that to build a hierarchy and enhance details in an image.
Though [110] shows examples to smooth noisy images, it does not specifically handle
enhancement of an image in presence of noise. Recently, Fattal [41] considers edges
in an image to define the basis functions and reduces the correlation between the lev-
els of the pyramid. A multi-resolution analysis framework based on wavelets (these
wavelets are constructed depending on the edge content of the image) was proposed to
decompose an image into smooth and detail components. Paris et al. [88] differentiate
large-scale edges from small-scale details and use that in a Laplacian pyramid [15]
framework in order to get edge-aware smoothing of image at different spatial scales.
The hierarchical approach has been used for multi-scale decomposition by Jobson
et al. [118] and Pattanaik et al. [89] for tone-mapping. However these methods may
produce halo effects. Li et al. [77] demonstrate a tone-mapping operator depending
on spatially-invariant wavelets. Fattal et al. [43] performs tone-mapping while try-
ing to avoid halos and manipulates the gradients of the image. Other tone-mapping
techniques for high dynamic range (HDR) images have been proposed by Reinhard et
al. [101].
89
Figure 4.4: Flowchart of our method
90
4.1.2 Overview of our Method
Our method [29] is summarized in the flowchart shown in Figure 4.4. It consists of 4
key steps: (1) Noise Removal (2) Segmentation (3) Decomposition using hierarchical
framework (4) Enhancement of the decomposed levels. We use the NL-means filter
proposed by Buades et al. [13] to remove noise and to decompose the image. The
description of NL-means filter can be found in Section 2.3.4. For steps (1) and (2),
we highlight why we prefer NL-means filter over existing approaches. The image
is converted from RGB color space to CIELAB color space and only the lightness
channel is modified, in order to preserve the color properties. Enhancement is achieved
by modifying parameters to increase the intensity of each level and then adding back
the enhanced levels.
4.1.3 Outline
In Section 4.2 we describe our method in complete detail. In Section 4.3, we compare
the decomposition of NL-means method with existing approaches. In Section 4.4, we
show the results of the proposed method and compare our results with other enhance-
ment techniques and applications in various domains. Finally, in Section 4.5, we will
summarize this chapter.
4.2 Our Method
4.2.1 Noise removal to ensure robustness
All existing enhancement techniques make the assumption that the input image is
noise-free. This need not be the case, and most existing techniques fail in the pres-
ence of noise. This is because, after smoothing, noise is still present in the fine level
91
and enhancement of the fine level will result in enhancement of noise, thus spoiling the
visual quality of images. The effects of noise-removal during enhancement are shown
in Section 4.4.3.
In order to remove noise, as an initialization, we filter the image using NL-means
filter with a low value of h. If noise is not present in image, smoothing using low
values ofh does not cause much difference as shown in Figure 4.5. As we can see in
Figure 4.5(b), the original image is preserved with very little loss of structure as seen
in Figure 4.5(c).
(a) (b) (c)
Figure 4.5: Applying NL-means Filter on noise-free synthetic image (a). Image (b) is
smooth image and Image (c) is method noise
Applying NL-means filter on non-noisy natural images result in minimal loss of
structure as shown in Figure 4.6. Removal of this detail from image (Figure 4.6(b))
does not result in much difference when visualizing the smooth image (Figure 4.6(a)).
If NL-means Filter is applied on a noisy image, we can see that the noise is removed
as shown in Figure 4.6(c). The method noise, as seen in Figure 4.6(d) shows the noise
that was removed from the image and there is very little structure, thus preserving
the image structure while denoising. Since we are using low values of h, the smooth
images Figure 4.6(a) and Figure 4.6(c) look visually similar. We use visual evaluation
to make judgement on the presence (or lack) of structure in the method noise.
92
(a) Smooth Image (b) Method Noise
(c) Smooth Image (d) Method Noise
Figure 4.6: Applying NL-means Filter on Figure 4.1(a) with h = 0:01. Top row: On
noise-free natural image. Bottom row: On noisy natural image
93
We can represent our system as shown in Equation 4.1.
J =I +Residue; (4.1)
where J is the input image and I = NL(J). If J is noisy, Residue is noise, else
Residue is negligible. I can be decomposed as shown in Equation 4.2 and enhanced
as shown in Equation 4.4 whereasResidue is added after enhancement to preserve the
original properties of the image.
In Section 4.1.1, we have compared the performance of NL-means algorithm with
existing techniques and found that NL-means is better at both edge-preservation and
noise removal than existing techniques.
4.2.2 Segmentation
To get abstractions at different levels of the hierarchy, the value of the filtering pa-
rameter h is progressively increased at every level. As we can see in the first row of
Figure 4.7(a), a high value ofh results in blurring of edges and consequently enhance-
ments suffer from halo effects.
In order to remove blurring of edges and hence halo effects, the image is segmented
by Mean-shift segmentation algorithm proposed by Comaniciu and Meer [34]. While
smoothing the image, when we consider a neighborhood for every pixel, we also con-
sider the same spatial neighborhood of the pre-segmented image and if an edge occurs
in the pre-segmented image, a lower value ofh is chosen, otherwise a higher value of
h is chosen. This preserves high contrast boundaries of the image and thus removes
the halo effect from the image as shown in Figure 4.7(b). Though the segmented im-
age can be used at all levels of the hierarchy, it will have noticeable effect only at the
higher levels.
94
(a) (b) (c)
Figure 4.7: Filtering using edge information. The top row is the smooth image and
the bottom row is the method noise. The red rectangle is zoomed for clarity. (a) uses
NL-means filter with high value ofh=0:5 and no edge information. (b) uses edge and
h = 0:5. (c) uses bilateral filter along with edge information. Though most regions
are smooth and there are no edges in the method noise due to low¾, noise can be seen
along the edges of smooth image
95
We tried using the same information from the pre-segmented image while using a
bilateral filter and used different smoothing parameters depending on the presence or
lack of edges. The results are shown in Figure 4.7(c). We observe that even though the
edges of the image are preserved, the smoothing of the image is not done effectively
(noise is still present in the image). This will only help to reduce potential halo effects
but will not remove it completely due to improper smoothing of the image.
4.2.3 Hierarchical Decomposition with NL-means
The hierarchical framework is inspired from the Laplacian pyramid [15] and we build
it using NL-means filter with filtering parameter, h
i
;i 2 [1;k] where k is level of
hierarchy. Also, h
i
< h
i+1
. However, unlike Laplacian pyramid, we do not sub-
sample the smooth images because the different levels are obtained by smoothing using
an edge-preserving filter. As a result, the non-noisy input image is decomposed into
several fine levels f
i
and the smoothest level s. This can be expressed as shown in
Equation 4.2.
I =s+
k
X
i=1
f
i
: (4.2)
where s = NL
h
k
(n
0
). NL
h
(x) is the smooth image obtained by applying the NL-
means filter with filtering parameter, h, on image x and n
0
is the input non-noisy
imageI.
The smooth image is obtained by applying the NL-means filter on the original
image with a certain value of filtering parameter, h
k
. The fine image at any level is
calculated as the difference between the original image and its smooth image. This can
be expressed as shown in Equation 4.3.
f
i
=n
i¡1
¡NL
h
i
(n
0
): (4.3)
96
where,n
i
=NL
h
i
(n
0
).
4.2.4 Sharpness Enhancement
Once the different levels are obtained, sharpness enhancement can be achieved as
shown in Equation 4.4.
EnhancedImage =l
0
:s+
k
X
i=1
l
i
:f
i
; (4.4)
wherek denotes the number of decomposition levels,l
i
is the enhancement factor for
every fine level and l
0
is the enhancement factor for smooth level. The parameters
l
i
;i2 [0;k] can be user-defined to modulate sharpness. The default parameters in this
paper are k = 2, l
0
= 1, l
1
= 5 and l
2
= 1. However, modifying l
1
creates most
noticeable effect and its value can be automatically and appropriately set to achieve
proper sharpness enhancement depending on the value of Sharpness Sensitive Index
(SSI) as defined in Section 4.4.2.
4.3 Comparison of decomposition with existing methods
In this section, we compare the decomposition at multiple scales using our technique
(Figure 4.9) with existing approaches (Figure 4.8).
As shown in Figure 4.8(a), the bilateral pyramid by Chen et al. [21], uses increas-
ing values of ¾
c
and ¾
s
to progressively smooth the images. However, strong edges
of the image are not retained and visible ringing can be seen in the fine levels. The
decompositions by LCIS [117] is shown in Figure 4.8(b) and the decomposition by
Trilateral filter is shown in Figure 4.8(c). Trilateral filter, proposed by Choudhury and
Tumblin [31], is an edge-preserving filter and is better than LCIS and Bilateral filter.
97
Finest Level Finest Level Finest Level
Smoothest Level Smoothest Level Smoothest Level
(a) Bilateral filter [21] (b) LCIS [117] (c) Trilateral filter [31]
Finest Level Finest Level Finest Level
Smoothest Level Smoothest Level Smoothest Level
(d) Fattal et al. [42] (e) WLS [40] (f) Iterative WLS [40]
Figure 4.8: Decomposition using existing approaches. The left half of the image is
the smooth image and the right half is the corresponding fine level. The smoothing
increases from top to bottom. The images are taken from [40]
98
However, the decompositions of both these approaches removes strong edges from the
image resulting in visible ringing in the fine levels. Enhancement of the levels that
suffer from ringing artifacts and combining back all the levels result in halo effects.
As shown in [40], decomposition technique used by Fattal et al. [42] results in gradi-
ent reversal. The decomposition using WLS is shown in Figure 4.8(e). The Laplacian
pyramid framework using WLS is used to generate the decomposition shown in Fig-
ure 4.8(f). Unlike our technique, that extracts small pebbles present in the bottom
of the image, as shown in Figure 4.9(c), existing techniques also extract low contrast
features such as clouds present at the top of the image in the fine level.
Finest Level Finest Level Finest Level
Smoothest Level Smoothest Level Smoothest Level
(a) Bilateral (With seg-
mentation)
(b) NL-means (No seg-
mentation)
(c) Our technique
Figure 4.9: Decomposition using our technique. The left half of the image is the
smooth image and the right half is the corresponding fine level. The smoothing in-
creases from top to bottom. Note the presence of strong halo effects in image (b),
subtle halo effects in image (a) and absence of halo effects in image (c). Note reduced
ringing in image (a) as compared to Figure 4.8(a). The segmentation information used
in image (a) and image (c) are the same. The images are normalized for better visibility
Segmentation is also an important part of our method. We have shown the effects
of segmentation on a synthetic image in Figure 4.7. The effects of segmentation on
99
decomposition can be seen in Figure 4.9. As shown in Figure 4.9(b), ringing can be
observed in the fine level when we use NL-means filter without segmentation informa-
tion. This ringing artifacts will result in halo effects when enhanced. Our technique
does not suffer from such artifacts as can be seen in Figure 4.9(c). If decomposition
is performed with bilateral filter using the edge information from segmentation, then
the ringing artifacts are reduced but not removed completely as can be seen in Fig-
ure 4.9(a). The ringing artifacts can be further reduced at the cost of abstraction by
lowering the value of¾’s but that is undesirable.
4.4 Results and Discussion
In this section, we show results of sharpness enhancement and introduce a new method
to measure sharpness enhancement. We discuss how our method can be used in con-
trast enhancement, robust tone mapping and in the enhancement of eroded archaeo-
logical artifacts. We also discuss the computational costs of the algorithm and analyze
the power spectrum of the enhancements.
4.4.1 Sharpness Enhancement
The results of our enhancements are shown in Figures 4.1 and 4.10. From Figure 4.1(b)
to Figure 4.2 and from Figure 4.10(b) to Figure 4.10(c), the increase in enhancement
is only due to increase in the value of l
1
from 5 to 10. l
1
corresponds to fine level f
1
which contains the minor details of the image and therefore increasing l
1
boosts the
minor details of the image. Similarly, only increasing l
2
will enhance stronger edges
of image while suppressing other details of the image. Only increasingl
0
increases the
intensity of the smooth level,s resulting in a brighter enhanced image.
100
(a) Original
(b) Enhanced
(c) Over-Enhancement
Figure 4.10: Sharpness Enhancement. Image (a) is from Flickr R °. Note the progres-
sively increasing sharpness enhancement from left to right
101
Our experiments have shown that the enhancements are not very sensitive to l
i
’s,
where i > 0 and to obtain perceivable difference, l
i
should be incremented by 1.
However, they are very sensitive to l
0
and l
0
should typically be incremented in the
order of0:2.
Changing the value of h causes abstractions at different spatial scales. These ef-
fects can be seen in Fig. 4.11. As can be seen in Fig. 4.11(b), using very low values of
h results in the abstraction of very fine details of the image (sand particles). Enhance-
ment of details at such spatial levels are not preferred. On the other hand, using higher
values of h (h = 0:1) results in the abstraction of preferred and more perceptible de-
tails such as the patterns in sand and the rock formations. Enhancement of such details
results in sharpness enhancement of images.
(a) (b) (c)
Figure 4.11: Abstractions at different spatial scales by changing the filtering parameter
h. (a) is original image. (b)h = 0:01 at level 1 and (c)h = 0:1 at level 1. l
1
= 5 for
both enhancements
We have compared our results as shown in Figure 4.12(b) with the results obtained
by Fattal et al. [42] as shown in Figure 4.12(a). We can see that our technique results
in better delineation of the veins of the leaf. Note that our method uses only one input
102
image whereas Fattal et al. uses 3 input images under different light source directions.
In Fig. 4.13, we compare our method with Fattal et al. [42] and show that our
output neither has any halos nor has the gradient reversals as can be seen in the output
of [42].
We also compare our results with the result of applying Photoshop’s unsharp mask
on Figure 4.1(a) as shown in Figure 4.14. Comparing this with Figure 4.2, we can see
that although sharpness is enhanced using an unsharp mask, there is a very distinct
halo effect along the perimeter of the flower, which is not desirable. This halo effect
is not present in our method as shown in Figure 4.2. Using Photoshop’s unsharp mask
can be considered to be analogous to using a Gaussian filter to sharpen images since
both techniques do not consider edge information while sharpening, resulting in halo
effects.
The recent technique by Subr et al. [110] that uses information about oscillations
present in an image also suffer from subtle halo effects as can be seen by the white
colored regions that is present at certain parts along the boundary of the flower and the
leaves. This can be seen in Figure 4.15. This is not present in the enhanced images by
our technique as shown in Figure 4.2.
4.4.2 Quantitative Results
To measure the sharpness of images, we use modified version of a popular image
sharpness measure, the Tenengrad criterion [73] [114], that sums the magnitude of
gradient at every pixel of the image and can be represented as shown in Equation 4.5.
Tenengrad Criterion=
P
8x;y
(
p
grad
2
x
+grad
2
y
)
n
; (4.5)
103
(a) Enhancement by Fattal et al. [42]
(b) Our enhancement
Figure 4.12: Comparison of our enhancement with Fattal et al. [42]. Details in the
right image are more clearly visible than in the left image
104
Enhancement by Fattal et al. [42]
Enhancement by our method
Figure 4.13: Gradient reversal. A red arrow shows the gradient reversals along the
mountain edges in the top image[42] (zoomed in the inset) and the lack of it in the
bottom image
105
Figure 4.14: Enhancement with Photoshop’s unsharp mask. Note the halo effect along
the boundary of the flower
Figure 4.15: Enhanced flower from [110]. Note subtle halo effects (white color) along
the boundary of flower and leaves and the lack of it using our method (Figure 4.2)
106
where n is the number of pixels in the image, grad
x
and grad
y
are the morpholog-
ical gradients along the horizontal and vertical directions and gradient(image) =
dilate(image)¡erode(image). Existing enhancement techniques [22, 104, 18] con-
sider the sharper image to be better which need not always be true. Existing sharpness
enhancement metrics based on local edge kurtosis [18] and frequency analysis [104]
also consider the sharper image to be the preferred image. However, this need not be
true as an extremely sharp image need not necessarily be preferred. [125] attempts
to find a sharpness metric based on human perception without any conclusive results.
Since perception of sharpness is subjective, we conduct experiments to analyze the
response of human observers regarding the sharpness of an image.
Our experimental setup consists of 50 images that were obtained from [40] and
Flickr. We split this dataset into a training set of 28 images and a test set of 22 images.
These images represent a wide variety of scenes - natural, artificial, indoor, outdoor
etc. All the images were enhanced offline by increasing the value ofl
1
. l
1
corresponds
to the original image. The experiments were conducted independently for each subject.
Each subject was seated at a distance of 20” from the screen and was given the freedom
to move closer to the screen or farther away according to their convenience. The screen
is around 24”(diagonal) and the observers were seated perpendicular to the center of
the screen. Our psychophysical experiments can be divided into 3 parts - A) Check
if Tenengrad criterion correlates with perceived sharpness B) Introduce a new metric
for sharpness to aid automatic sharpness enhancement using the training set, and C)
Validate the new metric for sharpness using the test set.
We used different observers for each part with some overlap. 10 observers were
used for both the correlation experiment and the training procedure and 15 observers
were used for the test procedure. All observers had “normal” vision (They were asked
to wear their prescribed corrective lens during the entire course of experimentation).
107
The observers were either graduate students or post-docs from our lab and could un-
derstand the instructions given to them in English. The observers viewed the images
on the screen with both eyes.
4.4.2.1 Experiment A: Correlation Experiment
In the first experiment, we find out how well the Tenengrad Criterion correlates with
perceived sharpness. From the training set, we select 4 distinct images and enhance
those images by increasingl
1
froml
1
2[1:::8]. For each set of images, we randomly
present those 8 images independently to 10 observers and ask them to rank the images
in increasing order of sharpness and note their responses. Depending on their rankings,
we once again present the images in the order that the observers have ranked and ask
them to verify their responses. Any change in their rankings is duly noted. We did
not use higher values of l
1
because the enhancements become indistinguishable and
the Tenengrad criterion also converges. We compare the ranks of the sharpness that
is assigned by the Tenengrad criterion with the ranks that are provided by the human
observers.
We use the Spearman’s rank correlation coefficient (½) for each image to give a
quantitative measure of the correlation of ranks given by the metric and those given
by the observers. The value of½ ranges from[¡1;+1]. A value of+1 implies that all
pairs of ranks are equal.The value of½ can be computed as shown in Equation 4.6.
½=1¡
6
P
n
i=1
d
2
i
n(n
2
¡1)
; (4.6)
where n = 8 is the number of levels of image and d
i
is the difference between the
i
th
pair or ranks given by the metric and the observer. The median value of ½ that
we obtain across all images for all observers is 0:9762. Since this value is positive
108
1 2 3 4 5 6 7 8
1
2
3
4
5
6
7
8
Rank by Tenengrad criterion
Rank by observers
Obs. 1
Obs. 2
Obs. 3
Obs. 4
Obs. 5
Obs. 6
Obs. 7
Obs. 8
Obs. 9
Obs. 10
1 2 3 4 5 6 7 8
1
2
3
4
5
6
7
8
Rank by Tenengrad criterion
Rank by observers
Obs. 1
Obs. 2
Obs. 3
Obs. 4
Obs. 5
Obs. 6
Obs. 7
Obs. 8
Obs. 9
Obs. 10
Figure 4.16: Correlation between the ranks assigned using Tenengrad criterion and
the ranks assigned by human observers for (top) image with the best average ½ (½ =
0:9714) and (bottom) image with the worst average½ (½=0:9524) across all observers.
The identity line in black,y =x depicts perfect agreement between the different ranks
109
and close to 1, it implies that the Tenengrad criterion correlates well with perceived
sharpness. The correlation between the ranks is also statistically significant (p<0:03,
Median ofp=0:0003).
4.4.2.2 Experiment B: Sharpness Metric
As motivated earlier, existing sharpness metrics do not correspond to human percep-
tion. We therefore conduct psychophysical experiments to introduce a new metric for
sharpness measure. Using this metric we can perform automatic sharpness enhance-
ment of images. In this experiment, we use the training set of 28 images and increase
the value of l
1
from 1 to 25 at equal intervals, thus increasing the sharpness of im-
ages. Each set of original and enhanced images were kept in the same folder and were
presented to the human observers on the screen. The observers viewed each set of
image using Windows Picture Viewer and were asked to use either the mouse or the
arrow keys to move across different levels of enhancements. For every set of images,
we independently asked 10 human observers - “Which image do you prefer?” and
“When is the detail becoming too much?” and recorded their responses as can be seen
in Fig. 4.17.
As can be seen in Fig. 4.17, our experiments demonstrate that the preferred image
need not be the sharpest image. This can be seen by the position of the circles in
Fig. 4.17, which is not present at the highest value of l
1
. In fact, after a certain value
of l
1
, as the image becomes sharper the human observers report the details being too
much, which is demonstrated by the position of the squares. We also observe that the
Tenengrad criterion converges for large values ofl
1
, typically500.
110
5 10 15 20
20
40
60
80
100
120
140
160
Value of l
1
Tenengrad Criterion
(a)
0 500 1000 1500 2000
0
50
100
150
200
250
300
Value of l
1
Tenengrad Criterion
(b)
Figure 4.17: Tenengrad values on 9 images for our dataset. (a) shows responses
marked by circles (Preferred Images) and squares (Transition to “Too-detailed” Im-
ages). The size of the circles and squares correspond to the number of responses. (b)
shows convergence of Tenengrad Criterion. All 37 images are not included for better
clarity
111
We introduce a new metric called Sharpness Preference Index(SPI), SPI to measure
the preferred sharpness quality and it can be represented as shown in Equation 4.7.
SPI =
Image Tenengrad Criterion
Convergent Tenengrad Criterion
; (4.7)
where SPI 2 [0;1]. The “Image Tenengrad Criterion” is obtained by using Equa-
tion 4.5 and the “Convergent Tenengrad Criterion” is obtained by decomposing the
image and increasingl
1
to a very large value (l
1
= 2000) and then using Equation 4.5
on the enhanced image, where the value converges as shown in Fig. 4.17.
Using the values of SPI for the training images, we find that the mean value
of SPI for preferred images is 0:23 and the mean value of SPI for “Too-detailed”
images is0:36. The trends inSPI can be seen in Fig. 4.18. We consider images with
SPI < 0:23 to be “Soft” images and images withSPI > 0:36 to be “Too-Detailed”.
The intermediate region, SPI 2 [0:23;0:36] can be considered to be the “Optimal”
region with regards to the amount of detail present in an image.
4.4.2.3 Experiment C: Automatic Enhancement
Given an image, since all computations can be performed off-line, we can compute
its SPI and predict the sharpness quality of image. We use the test set of 22 im-
ages and enhance those images automatically with the lowest value ofl
1
such that its
SPI 2 [0:23;0:36]. For every image, we present the original and the preferred im-
ages simultaneously and randomly (either left or the right side) on the screen. When
the original image is selected as the preferred image, we present the next higher level
of enhanced image as the original image. Then each of the 15 observers were asked
independently to rate the image on the “right” side as “Better”, “Same” or “Worse”
relative to the image on the “left” side of the screen and their responses were recorded.
112
5 10 15 20 25
0
0.1
0.2
0.3
0.4
0.5
0.6
Image Label
Sharpness Preference Index (SPI)
’Soft’ Image
Optimum detail
Too much detail
Preferred Image
Transition to ‘Too−detailed‘ image
Figure 4.18: SPI values on our dataset. The “Too much detail” region extends till
SPI = 1. The blue horizontal line shows mean SPI for the train preferred images.
The red horizontal line shows meanSPI for train images with too much detail
113
Scores were thus assigned as 1 for Better, 0 for Same and -1 for Worse to the selected
image. This was repeated for all 22 test images. Finally, one image was presented to
the subject but with the order flipped from the previous display and the response was
noted to check if the observers were consistent with their responses. These responses
were used to evaluate the “preference” for the image selected by our automatic mech-
anism.
We found that our observers were not biased towards selecting one side of the
screen. The responses were also consistent when the preference ratings for the last
comparison was compared with that of its earlier rating. Fig. 4.19 shows the prefer-
ence ratings for the automatically selected images when compared with the original
images. The histogram gives a better visualization and we can see that the observers
shows a strong preference for the automatically selected images. We use the Wilcoxon
signed-rank test [124] to check for statistical significance and infer that that the pref-
erence for automatically selected images are statistically significantly different from
that for the original images (The null hypothesis of zero median of the difference is
rejected at 5% level) and by comparing the ranks we conclude that the selected images
have a higher rank than the original images (p = 0:0046). This implies that observers
have a significant preference for the automatic selection of enhanced images.
The evaluation of the preference of the enhancement method can be analyzed us-
ing a variation of the ROC approach as described in [91] to give an impression of
perceived image quality with respect to the original images. Unlike in regular ROC
analysis, where ground-truth information is provided, no such information is present
in this case. The raw data consists of the preference of observers regarding perceived
quality of the selected images with respect to the original images. In this case, the
axes of the ROC plots have a different interpretation. The Y-axis, that originally cor-
responds to the true positive rate can be considered to be equivalent to the proportion
114
0 5 10 15 20 25
−1.5
−1
−0.5
0
0.5
1
1.5
Image Label
Preference for our selection
Worse Same Better
−1.5 −1 −0.5 0 0.5 1 1.5
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Worse Same Better
Preference for our selection
Number of images
Figure 4.19: (Top) Preference for our selection (Bottom) Histogram of mean prefer-
ence ratings
115
of the selected image with higher perceived image quality. The X-axis, that origi-
nally corresponds to the false positive rate can be considered to be equivalent to the
proportion of original image with higher perceived image quality.
We consider the area under the ROC curve [61] (A
z
) as a measure of the perceived
quality of the selected images. For the original images, the value of A
z
= 0:5. If
A
z
> 0:5 then the perceived quality of the selected images can be considered to be
better than that of the original images. On the other hand, if the perceived quality
of the original images is better than that of the enhanced images, the value of A
z
<
0:5. The empirical value of A
z
for the different observers are shown in Fig. 4.20.
Only 1 of the 15 observers considered the original image to have better quality(A
z
<
0:5) and 1 observer considered the original and the selected images to have similar
quality(A
z
= 0:5). For all the observers on an average, (A
z
= 0:69§ 0:19), the
selected images are deemed to have better quality than the original images. Using the
Wilcoxon signed-rank test [124], our automatically selected images have statistically
significantly better quality than original images (p=0:0074).
Although preference for our automatic selection is statistically significant, in some
cases (5 images) as shown in Fig. 4.19 the observers prefer the original image. Apart
from personal preferences, this is due to the characteristics of the image. The observers
prefer less enhancement of good quality natural images. However, they show prefer-
ence for enhancements in images having an object in the foreground. Categorization of
images and tuning the parameters accordingly may help in improving the efficiency of
our implementation. Blur detection can also help to improve estimate of enhancement
parameters of the image. Currently, we globally enhance the entire image. However,
the observers prefer more enhancement in certain specific regions of the image(eg.
more in foreground than in background). We can therefore use prior information about
the image such as its saliency map and adaptively modify the enhancement parameters
116
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Subject Label
A
z
Figure 4.20: A
z
’s for all observers
depending on the salient regions of the image. The observers also prefer enhancement
in texture of the images and information from texture analysis can give better results.
As shown in Fig. 4.21, we compared the SPI values of images enhanced by ex-
isting methods. Since the source code of those techniques is not available, we used
the images provided by the respective authors and therefore, do not have results from
all the techniques for all the images. We did not present results using our technique
because our method automatically gives us an image preferred by human observers.
Using the default parameters from existing approaches, the SPI values do not lie in
the “Optimal” region and therefore in comparison with these approaches, our method
gives pleasant results. However, enhancement parameters of existing approaches can
be modified such that the enhanced images lie in the “Optimal” region.
Examples of our automatic selection of enhanced images can be found in Fig. 4.22.
117
Figure 4.21: Comparison of enhancements usingSPI. The “Optimal” region is from
SPI 2[0:23;0:36]
118
Figure 4.22: Automatic sharpness enhancement of images. Left are the original images
and right are the enhanced images. The box represents the preferred images according
to our metric. For the first image, the enhanced image (l
1
=3) is preferred whereas for
the 2
nd
image, the original image (l
1
= 1) is preferred. The enhancements are better
visible at the original high resolution
119
4.4.3 HDR Tone Mapping and Robustness to Noise
Our decomposition framework can be used for high dynamic range (HDR) tone-mapping
by reducing the intensity range of the HDR image using detail preserving compression.
Since there is no ground truth for comparison, we use visual evaluation to compare the
results.
We use the tone-mapping technique by Durand and Dorsey [39] and replace the
bilateral filter by NL-means filter to obtain results shown in Fig. 4.23. All other pa-
rameters are kept constant. We can see that [39] sometimes suffers from subtle halo
effects whereas our method does not. As can be seen in Fig. 4.23 and in [110], our
method also results in better color balance.
Tone-mapped image by Durand and Dorsey [39]
Tone-mapped image using our method
Figure 4.23: (Left) Tone-mapped results and (Right) Close-up. Both methods use the
same tone mapping algorithm. Note the lack of halos around picture frames and light
fixture and better color balance in the close up of the bottom image
120
We also conducted experiments using a recent tone-mapping method proposed by
Paris et al. [88]. Due to the large dynamic range of the input images, a large amount
of compression is involved resulting in minor inaccuracies becoming visible at high-
contrast edges. We compare the exaggerated reditions of our method with some of
the state-of-the-art decomposition techniques such as [77], [41], [40] and [88]. These
exaggerated renditions are generally images that may not be preferred by humans but
we perform this extreme testing to check for the failure cases of these methods. As can
be seen in Fig. 4.24, our method produces consistent results without halos whereas the
existing methods fail. As can be seen in Fig. 4.24(c), the wavelet-based method by Li
et al. [77] produces halos as the level of detail is increased. Generally the technique by
Farbman et al. [40] does not result in halo effects under normal conditions but if the
renditions are exaggerated as seen in Fig. 4.24(a), subtle halo effects are observed. The
technique by Paris et al. [88] and the exaggerated rendition of our method (using the
same enhancement parameters as used for Fig. 4.24(a)) does not produce halo effects.
As can be seen in Fig. 4.26, one of the latest techniques by Fattal [41] suffers from
aliasing effects. These artifacts are not present using our method.
In order to check for the robustness of our method to noise, we introduce zero mean
Gaussian noise in a HDR input and then perform tone-mapping. We consider the WLS
decomposition framework proposed by Farbman et al. [40] and compare the results
with our method. As shown in Fig. 4.27, noise is amplified by [40]. Since the original
image is an HDR image, the effects of noise are more pronounced in the darker regions
of the image. Our method has a specific noise-removal step and so, the amplification
of noise is much less in our method. The color balance using our method is not as good
as shown in Fig. 4.23 as the noise is not completely removed using the noise-removal
step of our method. Due to lack of ground truth, we rely on visual evaluation and
do not report quantitative results. Removing the noise before tone-mapping and then
121
(a) Farbman et al. [40] (b) Paris et al. [88]
(c) Li et. al. [77] (d) Our method
Figure 4.24: Exaggerated tone-mapped results using different techniques of an HDR
image. Halos are shown in the close-ups of the images in Figure 4.25
(a) Farbman et al. [40] (b) Paris et al. [88] (c) Li et. al. [77] (d) Our method
Figure 4.25: (Top row) Exaggerated tone-mapped results using different techniques of
an HDR image. (Bottom row) Close-up of a part of the image. Halos are shown in the
close-ups of the images. Note the presence of halo effects in (a) and (c) and the lack
of it in (b) and (d)
122
Figure 4.26: (Left) Close-up of tone-mapped image by Fattal [41] and obtained from
Paris et al. [88] and (Right) Close-up of tone-mapped image by our method. A red
arrow shows the irregular edge generated in the left image due to aliasing and the lack
of it in the right image
adding the noise back again while using the WLS approach [40] will result in visually
similar results with respect to noise but may potentially cause halo effects as shown in
Fig. 4.24(a).
4.4.4 Enhancement of eroded artifacts
One of the challenges that archaeologists face while studying old artifacts is weather-
ing of artifacts that may cause patterns to not be clearly visible. Malzbender et al. [82]
proposed a technique using polynomial texture maps to enhance artifacts. However,
this technique requires the artifact to be physically present. This is not always feasible.
Therefore as shown in Figure 4.28, we can enhance the details of the image of artifacts
using our method.
123
Enhancement by Farbman et al. [40]
Enhancement using our method
Figure 4.27: Tone-mapping of noisy image. Note the relatively less amplification of
noise using our method
124
(a) (b) (c)
Figure 4.28: Enhancement of archaeological artifacts. (a) is the original image from
[72]. (b) is the enhanced image from [72]. (c) is enhanced by our method. The
structure beneath some weathered regions are clearly visible in (c) as compared to (b)
4.4.5 Computational Cost
We have implemented the algorithm in MATLAB in Windows XP environment on a
PC with 3 GHz Xeon processor. For an image with size 360 X 245 pixels, it takes 6 sec.
for segmentation, every level of NL-means filter takes 39 sec. and the enhancements
takes< 1 sec. The speed can be improved by implementing the code in C++ and using
GPU. The computational costs after those speed-ups are mentioned in Section 5.3.2.5.
Also, faster and more optimized versions of NL means filter can be used to increase
the computational efficiency [86].
4.4.6 Power Spectrum Analysis
Fig. 4.29 shows the power spectrum characteristics before and after sharpness enhance-
ment. We can see that our enhancement technique results in the magnification of high
frequencies of the image.
125
−200 −100 0 100 200
−250
−200
−150
−100
−50
0
50
100
150
200
250
−200 −100 0 100 200
−250
−200
−150
−100
−50
0
50
100
150
200
250
Figure 4.29: Power Spectrum Characteristics. The left image is the power spectrum
of Fig. 4.1(a) and the right image is that of Fig. 4.1(b). Note the magnification of high
frequencies in the right image
4.5 Summary
We have proposed a sharpness enhancement technique from a single image based on
a hierarchical framework using Non-Local means filter and have shown how this tech-
nique performs edge-preserving smoothing and does not suffer from the limitations
of Gaussian or bilateral filter or the WLS approach. Though the Non-Local means
approach could potentially suffer from halo effects, we have ensured that it does not
by pre-segmentation of the input image. We have also ensured the robustness of our
system by a noise removal mechanism.
Experimentally, we have compared our results with results from existing tech-
niques and show that our technique results in better enhancement of the images and
also how it is robust to noise. We have also shown how this technique can be used
with contrast enhancement or to do tone mapping or to show hidden features present
in eroded archaeological artifacts. We also introduced a new measure for sharpness
enhancement.
126
Since we use segmentation information to prevent halo effects, the presence/absence
of halo effects depends on the performance of the segmentation information.
127
Chapter 5
Framework for Robust Online Video Enhancement
Using Modularity Optimization
5.1 Introduction
Consistently modifying video is a ubiquitous problem. For instance, low vision peo-
ple benefit from viewing contrast enhanced videos [49]. Existing techniques for video
contrast enhancement either trivially enhance each frame of the video, use a fixed num-
ber of frames to do temporal smoothing, or assume slow changing light intensity while
performing enhancement. We argue that enhancing individual frames of the video does
not exploit temporal information, which results in temporal incoherence causing flash
and flickering artifacts. Also, using a fixed number of frames is undesirable because
the method does not account for illumination changes or scene changes resulting in
improper enhancement. Slowly changing light intensity may be an unreasonable as-
sumption when dealing with videos with diverse content.
Our method analyzes a video stream and uses a recent image color contrast en-
hancement technique [28] to enhance videos. Our framework can be extended to any
other image contrast enhancement technique. Successive frames of a video generally
convey similar information, unless there is a shot change. We use this information
128
to maintain temporal consistency. In order to account for the challenges mentioned
earlier, we propose a novel way to achieve video contrast enhancement. A video se-
quence, particularly in movies, consists of many shots. In order to account for the
shots, we first cluster frames that are similar to each other. Next we select a key frame
that is most representative of all frames in that cluster and estimate the enhancement
parameters for only that key frame. Then, we enhance all frames in a cluster with the
estimated enhancement parameters of its key frame. Therefore, clustering is critical to
video contrast enhancement and helps in reducing flash and flickering artifacts. This
idea is quite intuitive and as can be seen in Section 5.1.1, where we review previous
work in video enhancement, none of the existing methods use this idea for video en-
hancement. We also show the effectiveness of our enhancement method by conducting
experiments on human observers.
A lot of work has been done in video clustering and a review of some existing
methods can be found in [53, 78, 106, 37]. In order to cluster frames of the video,
we perform graph based partitioning. The graph G = (V; E) is a weighted undirected
graph where the vertices V correspond to the frames and edges E are formed between
the vertices. The weight of an edge,w(i;j) is the similarity measure between vertices
i and j. Most of the previous graph based clustering techniques such as normalized
cut [105] and techniques based on spectral clustering [84] minimize the size of the cut
to produce two disjoint clusters. Recently, these techniques have been used in video
clustering [20, 51, 127]. However, these methods require prior information about the
number of clusters, which is not always available and varies largely across different
videos. Eigengap [108] can be used to predict the number of clusters [20, 36, 85].
However, this measure is not robust. It is critical for the clustering approach to have
a high recall rate because a missed detection of a cut or a gradual scene change may
cause improper enhancement.
129
In this paper, we introduce a new approach for graph-based video clustering using
modularity maximization. This method was introduced by Newman [83] and exten-
sively used in community detection and network analysis [52, 83]. This method finds
clusters such that the number of edges between clusters is smaller than expected. To
the best of our knowledge, this modularity maximization algorithm has not been used
for shot detection and a key advantage is that it can automatically find clusters in a
graph without a priori information about the cluster count or size. Using eigen anal-
ysis, we can robustly find the most influential vertex in a cluster that is equivalent to
finding the key frame in a sequence. As spectral clustering techniques minimize the
cut, there is no intuitive way to extract the key frame using eigenvector analysis. Since
this method has not been earlier used for clustering, we evaluate it by conducting ex-
periments on video sequences from TRECVid 2001 dataset [107] and compare it with
existing approaches.
Most existing video clustering techniques have omniscient information about the
video thus making it an offline process. An online process for video clustering [53]
detects shot using change in curvature of the similarity measure and detects the key
frame as the middle frame amongst the frames detected so far in a shot, which may not
be the most representative frame. Another online clustering method [3] calculates the
rank of the feature matrix of a window of frames using SVD and chooses a frame as the
key frame if its rank is more than a threshold. Since this threshold is fixed, it will not
scale to videos with diverse content such as movies, resulting in improper estimation
of the key frame. Since we want an online system and a shot can be described by a
consecutive window of frames, we use a sliding window mechanism to detect shots
and transfer clustering information of a fraction of the last frames from the previous
window to the current window.
130
5.1.1 Related Work
The technique by Fullerton and Peli [49] enhances MPEG videos one frame at a time.
Hines et al. [64] use a DSP implementation of the retinex algorithm to enhance video
on a frame by frame basis. However, these methods do not ensure temporal consis-
tency. We ensure temporal consistency by enhancing similar frames with same en-
hancement parameters.
Goh et al. [56] use a weighted histogram in which the weights depend on the com-
parison of the current frame histogram with the previous frame histogram. Since only
2 frames are considered at a time and the histogram may change for every frame, it is
not temporally consistent. Wang and Ward [122] use a weighted histogram where the
weights depend on a fixed number of frames. This will ensure smoothing (less flicker)
if all frames belong to one shot of a video but will fail in case of scene changes. The
technique by Kang et al. [69] extends the work of Reinhard et al. [100] to videos. For
enhancement of images, Reinhard et al. have defined a term called key that indicates
if a scene is subjectively light, normal or dark. Kang et al. assumes the videos to have
a slowly changing scene intensity and so consider the value of key to be a constant.
They also assume a fixed number of frames to maintain temporal coherence. But this
is not necessarily the case in most videos because of scene changes and shots having
different lengths. Ramsey et al. [99] adapt the value of key depending on luminance
changes, but they too use a fixed but small number of frames to maintain temporal
consistency. Wang et al. [121] consider temporal information to avoid flash and flick-
ering effects. However, their technique uses a specially designed camera to capture
HDR video. Bennett and McMillan [9] use the bilateral ASTA filter (Adaptive Spatio-
Temporal Accumulation) that filters the image depending on the detection of motion
and it favors more spatial filtering over temporal filtering when motion is detected.
131
However, the parameters have to be set manually in case of a change in the lighting
conditions. The advantage of our method is that during video enhancement, it deals
with scene changes including illumination changes using a clustering mechanism and
can automatically detect and accordingly enhance arbitrary sized shots without addi-
tional hardware.
Pattanaik et al. [90] mimics the temporal adaptation of the human visual system
to illumination changes. This adaptation is gradual, and it takes longer if one moves
from a bright lighting place to a dark place than if one moves from a dark place to a
bright place. As a result, multiple frames are needed to adjust to the overall luminance
changes resulting in undesirable information loss during the adaptation phase, that is
not present in our method.
5.1.2 Overview of our Method
Our method is described in Algorithm 1. We first measure the similarity between
frames (Section 5.2.1). In order to cluster similar frames, we use the modularity mea-
sure (Section 5.2.2) and extract a key frame from that cluster (Section 5.2.3). We
want an online system and therefore consider an incoming window of n frames at a
particular time. Since successive windows can describe the same scene, in order to
maintain temporal consistency we employ a sliding window mechanism to transfer
information about the clustering index and the key frame, K to subsequent windows
(Section 5.2.4). We denote the overlap between windows by P and use a third of the
last n frames for overlap (p = n=3). Finally we perform video enhancement (Sec-
tion 5.2.5).
132
Algorithm 1 Algorithm (Pseudo-code) of our method
Input: Streaming Video; A window, W having n frames (F
i
) and corresponding
cluster labels tempC; Size of overlap window p; Overlap window P µ W with
corresponding cluster labelsCP
Initialize: C =Á,W =Á,t=1,CP =Á
Output: Enhanced frames and cluster labelsC of all frames
1: while there are frames to be processed do
2: ifCP ==Á then
3: W =[F
t
;:::;F
t+n¡1
]
4: tempC =Á
5: else
6: W =[P;F
t+n
;:::;F
t+2n¡p¡1
]
7: tempC =[CP Á]
8: t=t+n¡p
9: end if
10: Cluster frames inW depending on similarity measure
11: Assign cluster labels tempC(j)’s such that W(j)’s belonging to same cluster
have same label and those belonging to different clusters have different labels
{This ensures information transfer across different clusters usingCP }
12: tW(j)=W(j) andtC(j)=tempC(j) wherej =[1;:::;n¡p]
13: for each unique cluster labelc oftC do
14: ifc = 2C then
15: Find key frame,K amongst frames intW corresponding to cluster labelc
16: Estimate enhancement parameters ofK
17: Save these enhancement parameters forc
18: else
19: Retrieve enhancement parameters forc
20: end if
21: Enhance frames intW with labelc with corresponding enhancement param-
eter
22: end for
23: print C =[C tC]
24: P =[F
t+n¡p
;:::;F
t+n¡1
]
25: CP(j¡n+p)=tempC(j) wherej =[n¡p+1;:::;n]
26: end while
133
5.1.3 Outline
In Section 5.2, we describe the details of the proposed method. We discuss details
regarding similarity measure, cluster analysis, key frame detection and information
transfer across sequences to maintain temporal consistency. In Section 5.3, we show
the results of cluster analysis and key frame detection on video sequences and compare
with existing methods. We also show how this analysis benefits video enhancement
and video segmentation. Finally, we summarize this chapter in Section 5.4.
5.2 Our Method
5.2.1 Similarity Measure
Given a window of n frames, we first compute the illumination (lighting conditions)
and the reflectance component (properties) of each frame as described in the enhance-
ment method proposed in [28]. We construct a weighted fully connected undirected
graph G=(V; E). The vertices, V, are the frames in the feature space, and the weight
of the edges,w(i;j), connecting two vertices is the similarity measure between frames
i and j. In order to account for lighting and scene variations, we use two different
descriptors: a color histogram to account for illumination variations and an edge di-
rection histogram to account for edge variations. We use two different descriptors to
consider both the color properties as well as the structural properties of the frame. The
color histogram is computed for the illumination image and is composed of 64 bins
that are determined by sampling colors in the RGB color space. The edge direction
histogram is computed for both the illumination and reflectance images. We use Sobel
filters on the images to obtain vertical and horizontal gradients of the image. These
134
gradients are used to compose the edge direction histogram with 72 bins, where each
bin corresponds to an interval of 2.5
o
.
In order to compare the histograms, we compute the difference between the his-
togram. For the color histogram, we compute the histogram intersection [111]d
c
(i;j)
defined as
d
c
(i;j)=
P
63
b=0
min(c
i
(b);c
j
(b))
P
63
b=0
c
i
(b)
; (5.1)
where i and j are the frames from window W , c
i
and c
j
are the respective color his-
tograms and b is the bin number. The intersection is normalized by the number of
pixels to get a value between 0 and 1. If the frames are similar, then the value of
d
c
(i;j) will be high (close to 1). Otherwise, if the frames are dissimilar, then the value
ofd
c
(i;j) will be low (close to 0). To compare the edge direction histogram, we com-
pute the complement of the Euclidean distance between the histogramsd
e
(i;j) defined
as
d
e
(i;j)=1¡
v
u
u
t
71
X
b=0
(e
i
(b)¡e
j
(b))
2
; (5.2)
wherei andj are the frames from windowW ,e
i
ande
j
are the respective normalized
edge histograms and b is the bin number. For similar frames, the Euclidean distance
between the histograms will be low. Therefore, if the frames are similar, then the value
ofd
e
(i;j) will be high (close to 1). Otherwise, if the frames are dissimilar, then it will
be low (close to 0).
We tried different combinations of difference measures (such as Euclidean dis-
tance, histogram intersection, cosine distance etc.) to compare the histograms. We
chose histogram intersection for the color histogram and Euclidean distance for the
edge direction histogram because experimentally, we found these measures to be the
best interpretation of similarity measure. This combination best accentuated dissim-
ilarities to account for scene changes since amongst the ones we tried, it gave the
135
highest value for frames with similar content and lowest value for frames with differ-
ent content. We calculate the similarity measure,w(i;j) between frames by combining
the similarity measure of the histograms as shown:
w(i;j) = d
c
(i;j)d
e
illum
(i;j)+d
c
(i;j)d
e
refl
(i;j)
+d
e
illum
(i;j)d
e
refl
(i;j); (5.3)
whered
c
(i;j) is the color histogram similarity measure between illumination compo-
nent of framesi andj,d
e
illum
(i;j) is the edge similarity measure between illumination
component of framesi andj and likewised
e
refl
(i;j) is for the reflectance component.
All the distance measures are normalized to have a value between 0 and 1. Since most
of the color properties of the frames are present in the illumination component, we do
not compute the color histogram similarity measure for the reflectance components.
However, edge information is present in both the illumination and the reflectance com-
ponents of the image.
We used 2 different descriptors - color histogram and edge histogram. Using only
one descriptor is not sufficient to describe content completely. For example, if we use
only a color histogram descriptor then frames with similar color content but different
in other attributes will be considered similar. This is undesirable. Therefore, we need
to combine multiple feature attributes. One way to do this is to weigh the features.
However, this technique is cumbersome and may involve sophisticated machine learn-
ing tricks to come up with the right combination of weights. We instead propose a
simpler alternative as shown in Equation 5.3. We weigh the similarity measure of each
feature with that of every other feature. This helps us to get high similarity values if
each of the similarity measures have high values. Thus we need a good similarity score
from all 3 features to be confident of the frames being similar.
136
5.2.2 Cluster Analysis
Once we have constructed the graph G as shown in Section 5.2.1, we want to parti-
tion the graph into clusters and compute the key frame for every cluster. We use the
technique introduced by Newman [83] in network analysis, that tries to maximize the
modularity instead of minimizing the cut size [52]. One key advantage of this method
is that this is a recursive method and can automatically determine the number of clus-
ters in a graph.
We first take a look at bi-partitioning the graph. Modularity function [83],M can
be defined as
M = (number of edges within a cluster)
¡(expected number of such edges): (5.4)
For all pair of verticesi andj, the number of edges can be determined by its weighted
adjacency matrix A
ij
, where the weights are determined by the similarity measure as
mentioned in Equation 5.3. The expected number of edges can be derived from a null
model as
k
i
k
j
2m
, where k
i
and k
j
are the total weights of nodes i and j. k
i
=
P
j
A
ij
and therefore, 2m =
P
i
k
i
is the total weight of all edges in the graph. Thus, the
modularity function M can be represented as M =
1
2m
P
(i;j)2same cluster
B
ij
, where
B
ij
= A
ij
¡
k
i
k
j
2m
. Once the modularity function is obtained, a matrix based approach
is used andM can thus be represented as
M =
1
4m
s
T
Bs; (5.5)
where s is the index vector such that s
i
= +1 if vertex i belongs to the first group
ands
i
=¡1 if vertexi belongs to the second group. B is the modularity matrix with
137
elements B
ij
that is used to maximize the modularity M in order to bi-partition the
graph (if possible).
The maximization problem ofM is NP-hard [57], and so an approximation is done
using eigen vector analysis [83]. The method is shown in Algorithm 2. Existing spec-
tral clustering techniques must consider the trivial case of all vertices being placed in
one group. This is not present in our case because the modularity is being maximized.
Algorithm 2 Graph bi-partitioning using modularity
Input: A graph withn vertices, B
1: Compute the eigenvectors u
i
of modularity matrix B. Subscript i denotes the i
th
element of vector
2: Index vector s=
P
n
i=1
®
i
u
i
such that®
i
= u
T
i
s
3: Compute the leading eigenvalue¯
max
and corresponding eigenvector u
max
of B
4: if¯
max
¸0 then
5: Compute s such that elementss
i
=+1 if(u
max
)
i
>0 ands
i
=¡1 otherwise
6: if s
T
Bs>0 then
7: return Division of graph into 2 clusters according to the signs ofs
i
8: else
9: return Graph cannot be partitioned
10: end if
11: else
12: return Graph cannot be partitioned
13: end if
It is quite likely that a window of frames may contain more than 2 shots. So,
we extend this method to find more than 2 clusters by using a recursive approach
where each cluster is sub-divided. However, unlike in normalized cut [105] where each
subgraph is treated independently, it is incorrect to delete edges that fall between two
sub-graphs and apply the algorithm to each subgraph. This is because the modularity
measure changes if edges are deleted and the wrong modularity measure is maximized.
138
We therefore consider the additional contribution to modularity ¢M upon dividing a
clusterG withn
G
nodes in two, defined as
¢M =
1
4m
s
T
B
G
s; (5.6)
where B
G
is then
G
Xn
G
modularity matrix with elements indexed by labels of vertices
i and j with cluster G and having values B
G
ij
= B
ij
¡±
ij
P
l2G
B
il
where ±
ij
is the
Kronecker ± function. This equation has the same form as Equation 5.5 and we can
apply the method described above to maximize¢M. Initially, we consider all vertices
to be in one graph and therefore
P
l
B
il
=0. Consequently, while repeatedly subdivid-
ing a graph, we apply Equation 5.6 on each cluster. We stop when the partition does
not increase the sub-modularity measure (¢M · 0). This happens if the generalized
modularity matrix B
G
has all eigenvalues· 0 and therefore checking for ¯
max
· 0
can give an indication about the termination of the partitioning process for that clus-
ter. Thus, using the modularity measure we can automatically partition a graph into
multiple clusters.
5.2.3 Key frame Detection
In order to find the vertex that is most representative of all the vertices in a cluster, we
consider the magnitude of the elements of the leading eigenvector, u
max
, of modularity
matrix B and choose the one with the largest value. This is equivalent to estimating
the key frame in a cluster (video sub-sequence) and can be extracted as
K =argmax
i
((u
max
)
i
); (5.7)
139
whereK is the key frame andi are the vertices belonging to a particular cluster. This
equation means the vertices corresponding to elements having the largest magnitude
in the leading eigenvector make the most contribution in that cluster. For more details,
please refer to [83].
5.2.4 Information Transfer
Since we are dealing with online video streams, we have access to only a window of
n frames at a time (and not the entire video sequence). We first cluster thesen frames
and find the key frame for each cluster. Next, we assign the same cluster index label to
frames that belong to the same cluster. In order to maintain temporal consistency, we
transfer this information to the next window by using the lastp frames from the current
window and merge them with the next n¡p frames from the video. We transfer the
cluster index and the key frame information about the overlapping P frames. We do
not directly transfer information about the firstn¡p frames. In the event that cluster
information of those frames is not reflected in the P frames, temporal consistency is
still not affected. We consider every new detected cluster in the subsequent windows
to be different from the cluster from an earlier window. However, if the first n¡p
frames of a window and the overlapping P frames belong to the same cluster, then
there are no issues.
We transfer the key frame information to subsequent windows using P . In order
to maintain consistency in enhancement, we do not compute the key frame again for
subsequent windows that belong to the same cluster as frames from an earlier window.
The choice of n and p may affect the estimation of the key frame, resulting in a sub-
optimal estimate of the key frame. However, as shown later in Section 5.3.2.2, key
frame estimation is not critical for video enhancement.
140
The modularity algorithm that we have described in Section 5.2.2 has a failure
case - It cannot detect clusters with just one vertex. It needs at least 2 vertices in a
cluster. This is because the notion of edges within a cluster is not present when there
is only one vertex. If a different frame is present in the middle of a sequence, then
our algorithm will fail to detect it. This is an unlikely scenario in videos. However, it
is quite possible that when we consider a window of n frames, there could be n¡1
frames belonging to one cluster and the last frame belonging to a different cluster.
Since we use a sliding window mechanism, it is highly likely that the last frame from
the previous window will be grouped in a cluster with frames from the next window.
Therefore, to prevent such cluster index label switching which may result in incorrect
enhancement, we enhance only the firstn¡p frames of current window and postpone
processing of the lastp frames to the next window.
5.2.5 Video Enhancement
Given a new cluster label, we estimate the enhancement parameters of only the key
frame of that cluster using the technique mentioned in Chapter 3. A brief overview
of that method is as follows: We first try to estimate the illumination component and
separate it from the reflectance component of the image. While doing so, we try to
achieve color constancy and enhance only the luminance of the illumination compo-
nent of the image. This is done in order to preserve the color properties of the image.
The enhancement parameters are computed depending on the exposure conditions of
the image (underexposed, overexposed or a combination of both). Then, logarithmic
curves inspired by the Weber-Fechner law are applied to enhance the luminance. This
enhanced illumination is multiplied with the reflectance to obtain the enhanced frames.
Next we enhance all frames in that cluster with the estimated enhancement parameters
141
of the key frame. To enhance frames that have the same cluster index label as already
seen in an earlier window, we use the enhancement parameters from the previously
estimated key frame of that cluster. This maintains temporal consistency during en-
hancement.
5.3 Results and Discussion
In this section, we evaluate the results of our online clustering process on the TRECVid
2001 dataset [107] and compare with existing methods. We considered the TRECVid
2001 dataset because the ground truth information regarding scene change is available.
For video enhancement, we enhance trailers of High Definition (HD) movies from
the Apple website
1
. We compare our key frame extraction with an existing method.
We show that video enhancement using our technique reduces flash and flickering
artifacts. We also show the significance of proper clustering for video enhancement.
We perform experiments to conduct human validation to know the effectiveness of the
enhancement method. Finally, we show how scene change detection helps in robust
video segmentation.
5.3.1 Quantitative Evaluation of Cluster Detection
In order to evaluate our clustering process, we carry out experiments on video se-
quences from the TRECVid 2001 dataset [107] and compare our results with 6 exist-
ing methods [44, 53, 66, 76, 94, 105]. We consider 8 video sequences whose infor-
mation is summarized in Table 5.1. We chose these 8 videos because of fair compar-
ison with existing methods that use a similar subset. We directly present the results
1
http://trailers.apple.com/
142
from [44, 66, 76, 94]. We implemented the methods described in [53] and [105] and
use the same similarity measure as our method. Since normalized cuts [105] requires
prior information about the number of clusters and since the number of clusters is not
fixed, we use eigengap [108] to automatically detect the number of clusters. Compar-
ing with [105] is thus equivalent to comparing with [20, 36, 51].
The ground truth information about the different transitions is provided. The tran-
sitions are divided into 2 broad categories - cuts and gradual transitions. The gradual
transitions are further sub-divided into other categories - dissolve, fade in/out and oth-
ers. As ground truth, the exact frame number is provided for cuts whereas for the other
gradual transitions, a range of frames are provided. For some of the videos (nad31,
nad33, nad53 and nad57), the reported ground truth values are at a fixed offset from
the actual frame number of sequences. We account for these discrepancies while eval-
uating different methods. To compare and evaluate our method we use 2 performance
measures - precision and recall. Recall can be defined as
R =
N
c
N
c
+N
m
; (5.8)
whereR is the recall,N
c
is the number of correct detections andN
m
is the number of
missed detections. Precision can be defined as
P =
N
c
N
c
+N
f
; (5.9)
whereP is the precision andN
f
is the number of false detections.
Unlike existing methods such as [44, 66, 76, 94], our clustering method does not
explicitly detect gradual transitions. For gradual transitions, we consider correct de-
tection as the case when clustering detects a scene change in the range of the ground
143
Table 5.1: Videos from TRECVid 2001 dataset used in our experiments
Title File Size Run time Frame Transition Cuts Gradual
(MB) (mm:ss) Count Count Transition
NASA 25th Anniversary Show-Seg 5 anni005 66.9 6:19 11364 65 38 27
NASA 25th Anniversary Show-Seg 9 anni009 72.4 6:50 12307 103 38 65
Spaceworks - Episode 6 nad31 260.1 29:08 52405 242 187 55
Spaceworks - Episode 8 nad33 247.1 27:40 49768 215 189 26
A&S Reports Tape 4-Report 260 nad53 128.0 14:20 25783 157 82 75
A&S Reports Tape 5-Report 264 nad57 63.4 7:06 12781 67 44 23
Challenge at Glen Canyon bor03 240.5 26:56 48451 242 231 11
The Great Web of Water bor08 251.0 28:07 50569 531 380 151
144
truth gradual transition frames. However, if more than 1 scene change is detected in
the range of the ground truth of gradual transition frames, then we consider those to be
false detections. The number of clusters within a dissolve transition varies according
to the rate at which the transition occurs. The precision and recall rates are shown in
Table 5.2.
Due to the reasons mentioned above, the comparison of precision and recall rates
for gradual transitions as shown in Table 5.2 does not convey significant information
about the comparison of methods. Since our method and [53] use the same technique,
we can conclude that our method is better. It is not clear how the existing methods will
behave when their respective criterion for cuts is applied to gradual transition frames
since neither the source code nor any confusion matrix is provided. As the recall rates
are quite high when using our clustering method, we can infer that our method is able to
detect scene changes across gradual transitions. An instance of the failure case occurs
when the dissolve transition gradually happens from the previous scene to a zoomed-in
part of the same scene. In these cases, the illumination remains almost same. Also, a
lot of edge information between the 2 scenes remains the same. This results in a high
similarity measure and therefore, our clustering method fails to detect such gradual
scene changes. Note that such failure cases that occur in shot boundary detection are
actually desirable from the video enhancement perspective for robust enhancement
since the frames effectively belong to the same scene and have similar contrast. Notice
that our precision rates are on the lower side This is because our method generally
detects a scene change both at the start and at the end of a gradual scene transition.
[53] has a higher precision rate because along with not being able to detect gradual
transitions (low recall rate), it also does not detect many false scene changes amidst
gradual transitions.
145
Table 5.2: Results for gradual transitions on videos from TRECVid 2001 dataset. R is for recall and P is for precision. A ‘-’
indicates unreported result
File Ours [44] [66] [76] [94] [53]
R P R P R P R P R P R P
anni005 0.888 0.828 0.666 0.782 0.759 1.000 0.786 0.880 - - 0.000 0.000
anni009 0.907 0.881 0.507 0.733 - - 0.848 0.926 0.523 0.669 0.046 0.750
nad31 0.818 0.789 0.535 0.428 - - 0.708 0.687 0.418 0.436 0.000 0.000
nad33 0.923 0.800 0.692 0.382 - - 0.943 0.805 0.500 0.389 0.231 0.857
nad53 0.970 0.913 0.805 0.696 - - 0.826 0.947 0.631 0.575 0.040 0.750
nad57 0.956 0.846 0.826 0.593 - - 0.852 0.885 - - 0.087 1.000
bor03 1.000 0.688 0.818 0.281 - - 0.929 0.382 0.545 0.283 0.182 1.000
bor08 0.960 0.913 0.758 0.816 - - 0.716 0.930 0.741 0.794 0.007 0.500
146
We use our clustering method to also detect cuts. Since we assume the cluster-
ing method to detect cuts and the ground truth values are provided, we can use them
to compare different methods using both precision and recall values as shown in Ta-
ble 5.3. The false detections in the range of the ground truth of gradual transitions
(depicted by precision in Table 5.2) and the false detections detected otherwise (de-
picted by precision in Table 5.3) comprise all the false detections of scene changes
detected by our method for the given video sequence.
As can be seen in Table 5.2 and Table 5.3, our clustering method has a high recall
at the cost of precision. It is critical that our clustering method has a high recall rate
(i.e.) all the clusters/scene changes in a sequence should be correctly detected. This is
because during enhancement, a missed scene change due to either a cut or a gradual
transition may cause the enhancement parameters of a previous scene to be applied
to a different scene. These effects are undesirable, and an example illustrating such
an undesirable effect in terms of video enhancement can be found in Section 5.3.2.3.
One of the failure cases for cuts in the TRECVid dataset was when the movie had
a cut transition between 2 scenes - both scenes show different faces under similar
illumination conditions. In this case both the color and edge histograms were quite
similar.
Since a low precision rate implies a high false detection rate of clusters, the pre-
cision rate should also not be very low. This will result in frequent computation of
enhancement parameters for different clusters that belong to the same scene causing
temporal inconsistency. In the boundary condition, a very low precision rate will cause
the enhancement process to perform frame-by-frame enhancement, resulting in flash-
ing artifacts. An illustration of such an effect in video enhancement is described later
in Section 5.3.2.2. The low precision rate using normalized cut [105] is primarily
because of false cluster detections using eigengap [108]. This is because for some
147
Table 5.3: Results for cuts on videos from TRECVid 2001 dataset. R is for recall andP is for precision. A ‘-’ indicates unreported
result
File Ours [44] [66] [76] [94] [53] [105]
R P R P R P R P R P R P R P
anni005 1.000 0.826 0.973 0.948 0.946 1.000 1.000 0.736 - - 0.868 0.943 1.000 0.160
anni009 1.000 0.760 0.842 0.888 - - 1.000 0.655 0.789 0.697 0.947 0.857 1.000 0.086
nad31 0.963 0.818 0.912 0.938 - - 0.901 0.945 0.866 0.900 0.043 1.000 0.973 0.176
nad33 0.947 0.861 0.989 0.963 - - 0.984 0.964 0.952 0.918 0.910 0.950 0.973 0.196
nad53 0.988 0.802 0.975 0.975 - - 0.952 0.941 0.951 0.797 0.842 0.920 0.988 0.129
nad57 0.977 0.915 0.772 0.971 - - 0.958 0.939 - - 0.909 0.909 0.977 0.167
bor03 0.978 0.904 0.938 0.954 - - 0.961 0.949 0.982 0.953 0.627 0.923 0.996 0.321
bor08 0.934 0.868 0.965 0.882 - - 0.973 0.901 0.873 0.945 0.497 0.990 0.800 0.372
148
sub-sequences of videos, eigengap considers every frame to belong to a different clus-
ter which increases the count of false detections by a large value. If the cluster count
for each sliding window is calculated using our method and the count is given as an
input to [105], then the results using [105] are similar to our results. Using recursive
normalized cut will not work well because cut-size threshold will vary across diverse
videos. Having a fixed constant cluster count is also not a good idea. Existing tech-
niques [20, 36, 85] based on eigengap will show similar poor performance. As shown
in Table 5.2, [53] does not work well when the scene is gradually changing although
the performance is reasonable for cut transition as shown in Table 5.3.
Since we use 8 videos and 7 methods for comparison, each having 2 measures
(precision and recall), the visualization of all these measures on a graph may get in-
comprehensible. Therefore, to compare the performance of different methods with
respect to cuts transition, we calculateF
1
measure defined as
F
1
=2
P:R
P +R
; (5.10)
whereP is the precision andR is the recall. TheF
1
measure depends on both precision
and recall. We calculate the F
1
measure of each method on every video, and the
comparison ofF
1
measures is shown in Figure 5.1.
Table 5.4: AverageF
1
measure for videos shown in Figure 5.1
Method AverageF
1
measure
[44] 0.928
[76] 0.915
[94] 0.883
[53] 0.752
[105] 0.322
Ours 0.903
149
anni005 anni009 nad31 nad33 nad53 nad57 bor03 bor08
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Video Label
F
1
Measure
[44]
[76]
[94]
[53]
[105]
Ours
Figure 5.1: Comparison of methods on the basis ofF
1
measure for cuts transition
150
(a)
(b)
(c)
(d)
Figure 5.2: Failure cases using our clustering method with true cluster boundary shown
in red and false positive in magenta. It could not detect cut transition in (a) where the
illumination and edge information is similar. It also could not detect dissolve transition
in (b) where it shows the same surface of moon after zooming in. It detects extra cluster
boundaries in (c) and (d) as there is a significant difference in illumination. Such failure
cases in terms of shot detection are required for video enhancement
151
As can be seen in Figure 5.1 and in Table 5.4, our clustering method based on
modularity maximization performs comparably to state-of-the-art approaches despite
using very simple similarity measures. We performed statistical analysis on the differ-
ent methods shown in Table 5.4 using the Wilcoxon-signed rank test and found that
theF
1
measure of the best method [44] is not statistically significantly different from
our method. Our method is not specifically designed for shot detection, and the fail-
ure to detect some shot boundaries (Figure 5.2(a) and (b)) or extra detection of cluster
boundaries (Figure 5.2(c) and (d)) is counter-intuitive and is in fact conducive to video
enhancement. Illustrations of such cases can be found in Figure 5.2. Some of the cases
when our method detects a false cluster boundary is when there is a change in the il-
lumination of the scene due to flash/explosion/light of the scene turning on/off, when
the appearance of an object changes in the scene (due to camera motion or object mo-
tion), when there is a zoom-in or zoom-out of parts of the scene, or when text(movie
credits) gradually appears and covers substantial part of the scene. Although in such
cases the ground truth considers the events to belong to the same scene, the contrast
of the scene changes, and therefore it is reasonable to assume a scene change for the
purpose of video enhancement. Consequently, the enhancement parameters should
also change with change in contrast of the scene. Similarly for events as shown in
Figure 5.2(a) and (b), the ground truth considers them to be from different scenes.
However, our method considers them to be from the same scene, and this matches
with the intuition of video enhancement because we would want those scenes to be
enhanced using the same parameters. This is why our method uses simple features
such as color and edge histograms to compute similarity measures as these features fit
well in our contrast enhancement framework. Existing methods [53, 106] use wavelet
analysis, luminance values, texture analysis, shape features, flash detection, motion
estimation, adaptive thresholds etc. to compute similarity measures between frames.
152
Using machine learning techniques, we can find the best combination of features to
make the similarity measure more robust and consequently improve the performance
of the clustering method for the purpose of shot boundary detection at the potential
cost of stable video enhancement.
5.3.2 Video Enhancement
In this section, we show results of clustering on some HD movie trailers. We show
how clustering results in temporal consistency by removing flashing artifacts thus im-
proving the process of video enhancement. We also show why proper estimation of
clusters is important for the video enhancement process.
The most obvious way to measure the effectiveness of contrast enhancement is
to ask an observer to indicate their preference in a side-by-side comparison of the
original and the enhanced videos. The preferences can then be quantified to know
the effectiveness of the method. We have shown the effectiveness of the proposed
image contrast enhancement technique in Chapter 3. However, Weber et al. [123]
have demonstrated that static enhanced images are perceived as less enhanced or not
enhanced after a short period of adaptation. This adaptation to enhancement may
diminish the benefits of image enhancement (due to perceiving the enhanced images
as normal) and thus affect the subjective evaluation of enhanced videos. We therefore
conduct experiments, present the results, and discuss them.
5.3.2.1 Cluster Analysis and Key frame Extraction
We consider 3 movie trailers and plot the frame similarity graph for those trailers in
Figure 5.3 and in Figure 5.4. We consider a sliding window mechanism with window
size ofn = 16 frames and overlapping set ofp = 5 for all results in the paper except
153
Figure 5.4. For Figure 5.4, we usen=19 andp=3 to show robustness of our method
to n and p and to illustrate that the modularity algorithm automatically detects more
than 2 clusters (3 clusters) as can be seen in the 2
nd
window. The frame similarity
graph indicates the similarity between frames (Red is the most similar and blue is the
least). Since we transfer information across windows using an overlapping set,P , we
are able to correctly detect clusters (shots) that are spread across consecutive windows
in each of the videos shown in Figure 5.3 and in Figure 5.4. The ground truth values of
shot boundaries are obtained by manual inspection of the trailer sequences. There are
2 distinct groups that we can visualize. One is the sliding window of frames,W with
overlapping frames,P , and the other is the cluster that may/may not be spread across
different windows. We have also shown the results of key frame extraction for every
unique cluster. One important aspect to note is that this key frame is locally the most
representative key frame of a cluster. It does not need to be globally the most similar
frame to other frames in that cluster (that may be spread across different windows).
A potential concern regarding the clustering mechanism is its behavior in a slowly
changing scene. One such example is shown in Figure 5.3(b). Consider the2
nd
and3
rd
window. That is an example of a dissolve transition where one shot gradually blends
into another shot. The images from this transition period are shown in Figure 5.5. Even
in such instances, the modularity clustering algorithm is able to detect shot boundaries.
Notice that for different values ofn, the results of key frame detection may change.
However, the shot boundary detection will be consistent for a cut transition as shown in
Figure 5.4. We discuss the effects of choosing key frame in Section 5.3.2.2. In order to
check for correctness of the key frame, we consider the video whose similarity graph
is shown in Figure 5.3(a). We consider the1
st
,4
th
and5
th
clusters and the respective
key frames for that cluster. We compute how similar each frame is to the other frames
in that cluster, and then find the mean of that similarity measure. As can be seen in
154
Frame Number
Frame Number
10 20 30 40 50 60 70 80
10
20
30
40
50
60
70
80
0 0.2 0.4 0.6 0.8 1
Ground Truth Shot Boundary
Detected Shot Boundary
Keyframe
(a) Video 1
Frame Number
Frame Number
10 20 30 40 50 60 70 80 90
10
20
30
40
50
60
70
80
90
0 0.2 0.4 0.6 0.8 1
Ground Truth Shot Boundary
Detected Shot Boundary
Keyframe
(b) Video 2
Figure 5.3: Frame similarity graph along with cluster and key frame information. The
ground truth shot boundaries are shown with black lines on the upper right of the
diagonal and the detected shot boundaries (indicated by change of cluster label index)
are shown with pink lines on the lower left of the diagonal. The key frame information
that is being used for enhancement for every cluster is shown with a green square
marker
155
Frame Number
Frame Number
5 10 15 20 25 30 35 40
5
10
15
20
25
30
35
40
0 0.2 0.4 0.6 0.8 1
Ground Truth Shot Boundary
Detected Shot Boundary
Keyframe
(a) Video 3
Figure 5.4: Frame similarity graph along with cluster and key frame information for a
sequence with 3 clusters
Figure 5.6, for all3 clusters, the key frame is locally the most similar to other frames
in that cluster. We also plotted the similarity measure of the frame that is the mid-
point in the sequence of that cluster. This frame is considered to be the key frame
by an online clustering process [53]. However as seen in Figure 5.6, the detected key
frame by [53] does not need to be the most representative frame of the cluster. Apart
from video summarization [32], key frame detection can also find application in video
enhancement as shown in Section 5.3.2.2.
5.3.2.2 Removal of flashing artifacts
In order to demonstrate video enhancement, we consider a part of the video whose
similarity graph is shown in Figure 5.3(b). We consider the 3
rd
cluster. We enhance
frames from that video sub-sequence using 2 techniques - 1) On a frame-by-frame
156
Figure 5.5: Slowly changing scene where we estimated the shot boundary (black rect-
angle). Images in sequence are from left to right and top to bottom
0 10 20 30 40 50 60 70
0.8
0.81
0.82
0.83
0.84
0.85
0.86
0.87
0.88
Frame Number
Mean Similarity
Key frame using our method
Key frame using mid−point of cluster
Cluster Boundary in a window
Figure 5.6: Similarity measure of key frame
157
basis 2) Using our online method (Enhance all frames using parameters of local key
frame). For that video sub-sequence, we track an object and calculate the mean inten-
sity of the tracked object through the length of the sequence.
For the tracked object whose plot can be seen in Figure 5.7, the original intensity
(shown in blue) is fairly consistent throughout the sequence. Enhancement on a frame-
by-frame basis (shown in green) results in intensity fluctuations due to differences in
the estimation of parameters. This results in undesirable flash and flickering artifacts
that can be seen as unnatural peaks (at frame numbers 2, 13 and 21). Similar effects
can also be observed when the clustering method has a very low precision rate. This
will cause multiple estimation of enhancement parameters for the same scene result-
ing in flashing artifacts. In the degenerate case this is analogous to frame-by-frame
enhancement.
0 5 10 15 20 25
20
30
40
50
60
70
80
Frame Numbers
Mean Intensity of rectangle
Frame−by−frame enhancement
Original Sequence
Our enhancement
Enhancement with global keyframe
Figure 5.7: Comparison of enhancement techniques for the tracked object. (Temporal
consistency and significance of key frame)
158
We can see that using our online method (enhance all frames using parameters of
just the key frame) reduces fluctuations (shown in red). The standard deviation of the
mean intensity of the tracked object across the length of the sequence,¾ is0:46 for the
original sequence and is 2:81 for the frame-by-frame enhancement approach. The ¾
using our approach (1:88) is less than the¾ of frame-by-frame enhancement approach.
This gives us an indication of reduction of artifacts by our approach.
Since we want an online system, we use parameters of the local key frame to en-
hance the frames belonging to that cluster instead of delaying the enhancement process
to account for the global key frame. We also enhanced this sub-sequence using param-
eters of the global key frame of the cluster. The global key frame was obtained offline
by considering the entire shot and then estimating the key frame. The estimation of
global and local key frames was different. As shown in Figure 5.7, using parameters
of the global key frame also results in consistent enhancement (shown in black). Since
the global or local key frame does not result in a significant difference in estimation,
we can say that our enhancement process is not dependent on the estimation of the
key frame. Using any other frame from the same cluster instead of the key frame also
results in stable results. However given a cluster, using the key frame quantitatively
gives us lower standard deviation than using other frames.
We also computed the average of the estimated parameters from all frames in a
cluster and used that to enhance that cluster. However, there was not a significant dif-
ference in the visual quality of the enhanced video. Moreover, parameter estimation is
a costly process. Since we aim to have a real-time system, we want low computational
cost. The current piece-wise linear curve that is shown in Figure 5.7 is sufficiently
smooth for the human eye as there are no visible temporal artifacts, thus making our
enhancement perceptually temporally consistent. Achieving a smoother curve at the
159
cost of speed without a visible change in human perception is undesirable in a real-
time system. Thus key frame estimation makes our method robust and fast and helps
in achieving stable video enhancement.
5.3.2.3 Significance of Clustering
Here we discuss how proper cluster estimation is critical to the enhancement process.
We consider a sub-sequence consisting of 2 clusters. The first cluster is an overexposed
set of frames, and the second one is an underexposed set with a dissolve transition from
frames 9 to 12 as can be seen in Figure 5.8. Our method successfully detects 2 clusters,
estimates the key frame for each cluster, and then enhances the sequence accordingly.
As a result, our framework correctly enhances each cluster, resulting in consistent
enhancement. It correctly reduces the intensity of overexposed frames and increases
the intensity of underexposed frames. While doing so, an interesting phenomenon is
observed. There is a sudden change in the mean intensity of the frame from frame 9
to frame 10. This happens for the intermediate transition frames. Since the transition
occurs from an overexposed sub-sequence to an underexposed sub-sequence and the
enhancement parameter that is calculated from the key frame of the underexposed
sub-sequence is applied to the transition frames that contain content from both the
over-exposed and the under-exposed parts, there is a jump in the mean intensity of the
frame. This jump in the intensity is not noticeable during the normal viewing of the
video and we verified this by playing the video at different frame rates (3 fps to 24 fps)
and getting responses from 5 different human observers. Moreover, during transition
frames, there is no coherent information and such effects go unnoticed by human users.
However, applying [105] with eigengap [108] results in undesirable enhancement.
Eigengap only estimates 1 cluster and therefore, normalized cut [105] incorrectly en-
hances the underexposed cluster using the key frame from the overexposed frames.
160
This results in a decrease in intensity for the already underexposed frames. These
effects can be observed in a clustering method having a low recall rate. Thus proper
clustering is extremely critical for appropriate video enhancement. We compared these
two enhancements (ours and using [105]) by conducting preference tests on the same
5 human observers as above and all of them preferred our enhancement.
0 5 10 15 20 25 30
60
80
100
120
140
160
180
200
Frame Numbers
Mean Intensity of image
Original Sequence
Enhancement using normalized cut
Our enhancement
Cluster Boundary
Figure 5.8: Significance of clustering. Failure to detect clusters using existing method
results in improper enhancement as shown by the green curve that reduces contrast of
underexposed frames(Frames 13-27). This problem is not present using our method as
shown by red curve
5.3.2.4 Human Evaluation
To establish the effectiveness of the enhancement method, we enhanced 24 High Def-
inition(HD) videos using our framework. On an average, each video is approximately
30 seconds long. All the videos have been enhanced offline. We have uploaded the
YouTube version of some of the original and enhanced videos at -
“http://iris.usc.edu/people/achoudhu/video.html”. As mentioned earlier, all these videos
161
are movie trailers from the Apple website. Each video has multiple transitions. The
number of transitions and the combination of transitions are different for every video.
Specifically, 16 videos have at least 1 dissolve transition and 9 videos have at least 1
other gradual transition. Other gradual transitions may include fade-in/fade-out/wipe
transitions.
Since the videos are in the ‘.mov’ format, we present the original and the enhanced
version simultaneously using Quicktime Pro player. The placement of the original
and the enhanced version were random(either the top or the bottom of the screen) to
remove any bias that may exist while selecting a video. We asked each of the 16
human observers independently to rate the video on the “top” as “Better”, “Same”
or “Worse” relative to the video on the “bottom” of the screen and recorded their
responses. Scores were assigned as 1 for Better, 0 for Same and -1 for Worse. These
responses were used to evaluate the preference for enhanced video. Since each video
has very diverse content, the observers were asked to view the entire video trailer and
were also given the liberty to view the video multiple times (if required) before rating
the video. This was repeated for all 24 videos in the database.
The experiments were conducted independently on the subjects. Each subject was
seated at a distance of roughly 2 feet from the computer monitor. The subjects were
given the freedom to move closer to the screen or farther away from the screen accord-
ing to their convenience. The screen is around 24”(diagonal) and the subjects were
seated perpendicular to the center of the screen.
Figure 5.9 shows the preference ratings for the enhanced videos when compared
with the original videos. The histogram in Figure 5.10 gives a better visualization
and we can see that the observers show a strong preference for the enhanced videos
compared with the original videos (preference > 0). We use the Wilcoxon signed-
rank test [124] to check for statistical significance and infer that the preference for
162
enhanced videos is statistically significantly different from that for the original videos
(The null hypothesis of zero median of the difference is rejected at 5% level). By
comparing the ranks we conclude that the enhanced videos have a higher rank than
the original videos. This implies that observers have a significant preference for the
enhanced videos.
0 5 10 15 20 25
−1.5
−1
−0.5
0
0.5
1
1.5
Video Label
Preference for our selection
Worse Same Better
Figure 5.9: Preference ratings for enhanced videos
The evaluation of the preference for the enhanced videos can be analyzed using a
variation of the ROC approach as described by Peli et al. [91] to give an impression
of perceived quality of the enhanced videos with respect to the original videos. Unlike
in regular ROC analysis, where ground-truth information is provided, no such infor-
mation is present in this case. The raw data consists of the preference of observers
regarding perceived quality of the enhanced videos with respect to the original videos.
In this case, the axes of the ROC plots have a different interpretation. The Y-axis, that
originally corresponded to the true positive rate, can be considered to be equivalent
163
−1.5 −1 −0.5 0 0.5 1 1.5
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Worse Same Better
Preference for our selection
Number of videos
Figure 5.10: Histogram of mean preference ratings for all observers
to the proportion of the enhanced video with higher perceived video quality. The X-
axis, that originally corresponded to the false positive rate, can be considered to be
equivalent to the proportion of original video with higher perceived video quality.
We consider the area under the ROC curve [61] (A
z
) as a measure of the perceived
quality of the enhancements. For the original videos, the value ofA
z
= 0:5. IfA
z
>
0:5 then the perceived quality of the enhanced video can be considered to be better than
that of the original videos. On the other hand, if the perceived quality of the original
videos is better than that of the enhanced videos, thenA
z
< 0:5. The empirical values
of A
z
for different observers are shown in Figure 5.11. For all the observers, since
averageA
z
=0:6108§0:0968, the enhanced videos are deemed to have better quality
than the original videos. Using the Wilcoxon signed-rank test [124], our enhanced
videos was reported as having significantly better quality than the original videos.
164
0 2 4 6 8 10 12 14 16 18
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Subject Label
A
z
Figure 5.11: A
z
’s for all observers
Although preference for our enhancements is statistically significant, in some
cases as shown in Figure 5.9 the observers prefer the original video (preference·0).
Apart from personal preferences, this is due to the characteristics of the video. Some
of these videos have good quality and the observers prefer less or no enhancement of
such videos. Also, since we enhance High Definition(HD) trailers of popular movies,
in some cases the observers have an idea about the content of the original movie and
use that information to make a judgment regarding their preference. The observers
also did not report flashing artifacts in our enhanced videos.
5.3.2.5 Computational Cost
Estimating the illumination component of a video frame that is a part of the image
enhancement process is computationally expensive. We therefore use a GPU (Graphics
Processing Unit) and use an implementation by Kharlamov and Podlozhnyuk [70] on
165
a PC with Intel i5-680 processor with NVidia GeForce GTX 480 GPU. We build our
framework in Visual C++ environment. We use GPU only for illumination estimation.
For a 1920 X 1080 image, illumination component estimation runs at around 10 fps.
At the cost of potential minor artifacts, the speed can be further increased by using
approximations where only one distance per patch is computed. Due to this bottleneck,
for a 1080p video (each frame has 1920 X 1080 resolution) having typically 45 I-
frames, our enhancement framework runs at approximately 3.5 fps and for a 720p
video our enhancement framework runs at approximately 6.4fps. The other bottlenecks
in the process are the floating point operations required for the computation of the
reflectance (It is the ratio of the input image and the illumination) and achieving white
balance (It is the normalized ratio of the pixel value and the illuminant color) and a few
other operations such as actual modification of pixel values. This time also includes the
computation time required for first decoding the video to get frames and then encoding
the frames again to get back the original video for which we use the FFMPEG library.
As a future work, we would like to improve the computational speed of the framework.
5.3.3 Video Segmentation
Clustering a video sequence into shots finds interesting applications in video segmenta-
tion. We consider the video segmentation method proposed by Grundmann et al. [60].
This method considers a video volume and tracks segments through a spatio-temporal
volume. It uses a hierarchical graph based technique where increasing levels have de-
creasing number of regions that are obtained by merging regions from previous levels.
The authors have shown consistent results over long video sequences that also include
movies. In order to maintain temporal consistency, the authors partition video into
166
n = 25 frames and usep = 8 overlap frames from previous sub-sequence to the cur-
rent sub-sequence[60]. The performance of the method is not critically dependent on
the values ofn andp.
Movies are a combination of a diverse range of frame sequences. Using a fixed
number of frames to partition the video decreases the robustness of the existing method
because of information mismatch across different shots of the video. We propose to
analyze the incoming video stream by dividing the sequence into clusters of similar
frames and using this information to segment the video. While choosingn andp, the
method should be modified such that frames from the same cluster are chosen. This
will ensure that segmentation labels remain consistent through the length of the shot.
A change of segmentation label after the shot may be acceptable. The results of using
cluster information are shown in Figure 5.12. As shown in the bottom row left image
of Figure 5.12, we see that our results are more robust to illumination changes. At
no level of the hierarchy is the original method (middle row) able to segment the face
of the image. As shown in the right image of Figure 5.12, the existing method is not
able to retrieve a structure (merged with background water) that was segmented by our
method. For lower levels of the hierarchy, it gave a different label to that structure than
other similar structures, which is not desirable. The maximum hierarchical level is set
when the maximum number of regions is below a threshold(10).
5.4 Summary
We have proposed a novel approach to perform video enhancement. Our system is
online, automatically clusters video sequences, and extracts a key frame from each
cluster. In order to do clustering, we have used a new graph partitioning technique
167
Figure 5.12: Segmentation Results. Top row: Original images. Middle row: Seg-
mentation results using [60]. Bottom row: Segmentation results [60] using cluster
information. The left column shows results at90% of the maximum hierarchical level
and the right column shows results at70% of the maximum hierarchical level
168
called modularity that has so far been used only in social network analysis. We quanti-
tatively evaluated this clustering method on TRECVid 2001 dataset and compare with
existing methods. In order to extract key frames,we perform eigen analysis on the
modularity matrix. We enhance video sequences using the estimated parameters from
the key frame. We have shown that using clustering information results in robustness
and significant reduction of flash and flickering artifacts as compared to frame-by-
frame enhancement. Correct cluster estimation is critical to the enhancement process.
Extraction of the key frame increases efficiency of our method though it is not criti-
cal to the enhancement process. Enhancing a scene using any frame from that cluster
also gives stable results. We have shown that our technique indeed results in enhanced
videos and validated our argument by conducting experiments on human observers.
We have also used our framework to produce stable video segmentation.
169
Chapter 6
Conclusion
6.1 Summary of Contributions
In this dissertation, we have presented enhancement techniques to improve the contrast
and sharpness of images. While enhancing the contrast of images, we have also devel-
oped novel methods to achieve color constancy. We have also developed a framework
to achieve video enhancement. Encouraging experimental results, both quantitative
and visual, along with validation on human observers demonstrate the effectiveness
and robustness of our approach.
My contributions in each step of the process is as follows:
² Chapter 2 (Color Constancy)
– Implementation of a novel technique based on denoising methods to esti-
mate illumination and illuminant color
– Implementation of a novel technique to achieve color constancy based on
our statistical observation of the ratio of the standard deviation of color
channels
² Chapter 3 (Color Contrast Enhancement)
170
– Implementation of novel mapping functions based on human perception to
enhance contrast of images
– Using segmentation information about images to remove halo effects
– Automatic estimation of enhancement parameters irrespective of image
quality and based on the preference of people
– Conducting psychophysical experiments to show preference by normally
sighted and simulated visually impaired people for our enhancements
² Chapter 4 (Sharpness Enhancement)
– Implementation of a hierarchical approach based on modified Laplacian
pyramid using Non-Local Means filter
– Using segmentation information about images to remove halo effects
– Introduction of a new metric to estimate sharpness quality, thus helping in
automatic enhancement of sharpness of images
² Chapter 5 (Video Enhancement)
– Introduction of a novel framework for video enhancement based on scene
change detection that ensures temporally consistent enhancement
– Using a novel graph-based clustering technique called modularity opti-
mization to detect scene changes
– Application of the clustering technique to improve temporal segmentation
of videos
171
6.2 Future Work
Here we take a look at future directions of our approach. The first direction of future
work could be personalization of image enhancement. Since visual experience is in-
trinsically subjective, our initial results have shown that there exists a large variation
in the estimation of optimal parameters of enhancement when tested on both normally
sighted and visually impaired people. This variation could be due to individual differ-
ences in the cognitive conditions of the people or due to the effect of AMD on people’s
retinas or other reasons. Therefore, one way to move forward is to personalize the en-
hancement process. We can use techniques based on Machine Learning to try and
learn the kind of enhancements that people like or the type of enhancements that is
beneficial to them and then efficiently find these optimal enhancement parameters in
high dimensional spaces.
Currently, we have tested our enhancements by displaying images and videos on
television screens or monitor displays. However, we would like to test if our enhance-
ments can help people in their day-to-day life. The constraint that we have to face is
to test the tasks in a controlled manner. This can be done by implementing the en-
hancement algorithms in augmented-reality devices such as STAR 1200 eyewear by
Vuzix
1
. The idea is to collect the images/video captured from the camera present on
the eyewear and then enhance it in the device itself. These enhanced images can then
be send to the eyes of the viewer for testing. This will help us in understanding the ef-
fectiveness of our enhancement while the people are performing tasks in the real world
in real time. Examples of these tasks may include making a sandwich, pouring water
into a glass, searching for a tool in a shelf, and building a prototype
2
.
1
http://www.vuzix.com/ar/products_star1200.html
2
http://www.ikea.com.kw/en/children-ikea.html?page=shop.product_details&flypage=flypage.tpl
&product_id=475&category_id=13
172
The future work in terms of color constancy may include ways to combine differ-
ent approaches in one framework to get better results. It may include utilization of
Machine Learning techniques to know if a particular method/filter works better for a
particular category of images and then achieve white balance based on those classifi-
cation results. It may also be worthwhile to explore how the parameters of filter should
be set automatically depending on image characteristics. Since we estimate only one
illuminant color, our proposed methods will fail in case of scenes with multiple illu-
mination conditions. It will give an estimate that is a combination of all illuminations.
During our sharpness enhancement experiments, we found out that the level of
enhancement depends on the category of images being enhanced. Also, people do not
necessarily prefer global sharpness enhancement of images. They may want particular
regions of the image to be enhanced more than other regions of the image. As future
work, prior information about the images such as the category of image or the visual
saliency information about the image, can be exploited to enhance the sharpness of
images.
In video enhancement, we would like to see the performance of our clustering
method using other similarity measures such as wavelet or motion estimation. We
would also like to further improve the speed of our framework. Also, it would be in-
teresting to see if clustering in video actually helps in other problems such as video
denoising. It will also be interesting to see if video contrast enhancement further im-
proves the performance of video segmentation.
173
References
[1] Archives of Ophthalmology, 122:564–572, 2004.
[2] Eng j. roc analysis:web-based calculator for roc curves. Baltimore:Johns Hop-
kins University, 2006.
[3] W. Abd-Almageed. Online simultaneous shot boundary detection and key
frame extraction for sports videos using rank tracing. In IEEE International
Conference on Image Processing, pages 3200 –3203, Oct 2008.
[4] V . Agarwal, A.V . Gribok, A.F. Koschan, and M.A. Abidi. Estimating illumina-
tion chromaticity via kernel regression. In IEEE International Conference on
Image Processing(ICIP), pages 981–984, Atlanta, USA, 2006.
[5] Michael Ashikhmin. A tone mapping algorithm for high contrast images.
In Eurographics workshop on Rendering(ERGW), pages 145–156, Pisa, Italy,
2002.
[6] Soonmin Bae, Sylvain Paris, and Frdo Durand. Two-scale tone management
for photographic look. ACM Transactions on Graphics, 25:637–645, 2006.
[7] K. Barnard, L. Martin, A. Coath, and B.V . Funt. A comparison of computa-
tional color constancy algorithms-part ii: Experiments with image data. IEEE
Transactions on Image Processing, 11(9):985–996, September 2002.
[8] Kobus Barnard, Lindsay Martin, Brian Funt, and Adam Coath. A data set for
color research. Color Research and Application, 27(3):147–151, 2002.
[9] Eric P. Bennett and Leonard McMillan. Video enhancement using per-pixel
virtual exposures. ACM Transactions on Graphics, 24(3):845–852, 2005.
[10] M. Bertalmio, V . Caselles, and E. Provenzi. Issues about retinex theory and
contrast enhancement. International Journal of Computer Vision, 83(1):101–
119, June 2009.
[11] S. Bianco, G. Ciocca, C. Cusano, and R. Schettini. Improving color constancy
using indoor-outdoor image classification. IEEE Transactions on Image Pro-
cessing, 17(12):2381–2392, December 2008.
174
[12] David H. Brainard and William T. Freeman. Bayesian color constancy. Journal
of the Optical Society of America A, 14(7):1393–1411, 1997.
[13] Antoni Buades, Bartomeu Coll, and Jean-Michel Morel. Nonlocal image and
movie denoising. International Journal of Computer Vision, 76(2):123–139,
2008.
[14] G. Buchsbaum. A spatial processor model for object dolor perception. Franklin
Inst., 310:1–26, 1980.
[15] Peter J. Burt and Edward H. Adelson. The laplacian pyramid as a compact im-
age code. IEEE Transactions on Communications, COM-31,4:532–540, 1983.
[16] Vlad C. Cardei and Brian Funt. Committee-based color constancy. In Color
Imaging Conference, pages 58–60, Scottsdale, USA, 1996.
[17] Vlad C. Cardei, Brian Funt, and Kobus Barnard. Estimating the scene illumina-
tion chromaticity by using a neural network. Journal of the Optical Society of
America, 19(12):2374–2386, 2002.
[18] J. Caviedes and S. Gurbuz. No-reference sharpness metric based on local edge
kurtosis. In IEEE International Conference on Image Processing, 2002.
[19] A. Chakrabarti, K. Hirakawa, and T.E. Zickler. Color constancy beyond bags
of pixels. In IEEE Conference on Computer Vision and Pattern Recogni-
tion(CVPR), pages 1–6, Anchorage, USA, 2008.
[20] V . Chasanis, A. Likas, and N. Galatsanos. Video rushes summarization using
spectral clustering and sequence alignment. In TRECVid Video Summarization
Workshop, pages 75–79, 2008.
[21] Jiawen Chen, Sylvain Paris, and Frédo Durand. Real-time edge-aware image
processing with the bilateral grid. ACM Transactions on Graphics, 26(3):103,
2007.
[22] Z. Chen, B.R. Abidi, D.L. Page, and M.A. Abidi. Gray-level grouping (glg): An
automatic method for optimized image contrast enhancement – part i: The basic
method. IEEE Transactions on Image Processing, 15(8):2290–2302, August
2006.
[23] Anustup Choudhury and Gerard Medioni. Hierarchy of non-local means for
preferred automatic sharpness enhancement and tone mapping. Submitted to
IEEE Transactions on Visualization and Computer Graphics.
[24] Anustup Choudhury and Gerard Medioni. Color constancy using denoising
methods and cepstral analysis. In IEEE International Conference on Image
Processing(ICIP), pages 1637–1640, Cairo, Egypt, 2009.
175
[25] Anustup Choudhury and Gerard Medioni. Perceptually motivated automatic
color contrast enhancement. In Color and Reflectance in Imaging and Com-
puter Vision Workshop, IEEE International Conference on Computer Vi-
sion(ICCV), pages 1893–1900, Kyoto, Japan, 2009.
[26] Anustup Choudhury and Gerard Medioni. Color constancy using standard de-
viation of color channels. In International Conference on Pattern Recogni-
tion(ICPR), Istanbul, Turkey, 2010.
[27] Anustup Choudhury and Gerard Medioni. Color contrast enhancement for vi-
sually impaired people. In Workshop on Computer Vision Applications for the
Visually Impaired, IEEE Conference on Computer Vision and Pattern Recogni-
tion(CVPR), pages 33–40, San Francisco, USA, 2010.
[28] Anustup Choudhury and Gerard Medioni. Perceptually motivated auto-
matic color contrast enhancement based on novel color constancy estimation.
EURASIP Journal on Image and Video Processing, Special Issue on ’Emerging
Methods for Color Image and Video Quality Enhancement’, 2010.
[29] Anustup Choudhury and Gerard Medioni. Perceptually motivated automatic
color contrast enhancement using hierarchy of non-local means. In Color and
Photometry in Computer Vision Workshop, IEEE International Conference on
Computer Vision(ICCV), Barcelona, Spain, 2011.
[30] Anustup Choudhury and Gerard Medioni. A framework for robust online video
contrast enhancement using modularity optimization. IEEE Transactions on
Circuits and Systems for Video Technology, 2012.
[31] Prasun Choudhury and Jack Tumblin. The trilateral filter for high contrast im-
ages and meshes. In ACM SIGGRAPH 2005, page 5, Los Angeles, California,
2005.
[32] G Ciocca and R Schettini. Dynamic key-frame extraction for video summariza-
tion. In Internet imaging VI, pages 137 – 142, 2005.
[33] Florian Ciurea and Brian V . Funt. A large image database for color constancy
research. In Color Imaging Conference, pages 160–164, Scottsdale, USA,
2003.
[34] D. Comaniciu and P. Meer. Mean shift: a robust approach toward feature space
analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence,
24(5):603–619, 2002.
[35] K.. Dabov, A.. Foi, V .. Katkovnik, and K.. Egiazarian. Image denoising by
sparse 3-d transform-domain collaborative filtering. IEEE Transactions on Im-
age Processing, 16(8):2080–2095, aug. 2007.
176
[36] Uros Damnjanovic, Tomas Piatrik, Divna Djordjevic, and Ebroul Izquierdo.
Video summarisation for surveillance and news domain. In Semantic and Digi-
tal Media Technologies, pages 99–112, 2007.
[37] H. Denman and A. Kokaram. A multiscale approach to shot change detection.
In Irish Machine Vision and Image Processing, 2004.
[38] F. Drago, K. Myszkowski, T. Annen, and N. Chiba. Adaptive logarithmic map-
ping for displaying high contrast scenes. Computer Graphics Forum, 22:419–
426, 2003.
[39] Frédo Durand and Julie Dorsey. Fast bilateral filtering for the display of high-
dynamic-range images. In ACM SIGGRAPH 2002, pages 257–266, San Anto-
nio, Texas, 2002.
[40] Zeev Farbman, Raanan Fattal, Dani Lischinski, and Richard Szeliski. Edge-
preserving decompositions for multi-scale tone and detail manipulation. In
ACM SIGGRAPH 2008, pages 1–10, Los Angeles, USA, 2008.
[41] Raanan Fattal. Edge-avoiding wavelets and their applications. ACM Trans.
Graph., 28:1–10, July 2009.
[42] Raanan Fattal, Maneesh Agrawala, and Szymon Rusinkiewicz. Multiscale
shape and detail enhancement from multi-light image collections. In ACM SIG-
GRAPH 2007, pages 51–59, San Diego, USA, 2007.
[43] Raanan Fattal, Dani Lischinski, and Michael Werman. Gradient domain high
dynamic range compression. In ACM SIGGRAPH 2002, pages 249–256, 2002.
[44] Yu fei Ma, Jia Sheng, Yuan Chen, and Hong jiang Zhang. Msr-asia at trec-10
video track: Shot boundary detection. In TREC 2001, 2001.
[45] G. D. Finlayson, S. D. Hordley, and I. Tastl. Gamut constrained illuminant
estimation. International Journal of Computer Vision, 67(1):93–109, 2006.
[46] G.D. Finlayson, S.D. Hordley, and P.M. Hubel. Color by correlation: A simple,
unifying framework for color constancy. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 23(11):1209–1221, November 2001.
[47] Graham D. Finlayson and Elisabetta Trezzi. Shades of gray and colour con-
stancy. In Color Imaging Conference, pages 37–41, Scottsdale, USA, 2004.
[48] D. A. Forsyth. A novel algorithm for color constancy. International Journal of
Computer Vision, 5(1):5–36, 1990.
177
[49] Matthew Fullerton and E. Peli. Post transmission digital video enhancement for
people with visual impairments. Journal of the Society for Information Display,
14(1):15–24, 2006.
[50] B. Funt, F. Ciurea, and J. McCann. Retinex in matlab. In Color Imaging Con-
ference, pages 112–121, Scottsdale, USA, 2000. IS and T.
[51] Y . Gao, J. Tang, and X. Xie. Key frame vector and its application to shot re-
trieval. In International workshop on Interactive multimedia for consumer elec-
tronics, pages 27–34, 2009.
[52] Rumi Ghosh and Kristina Lerman. Community detection using a measure of
global influence. In SNA Workshop, SIGKDD Conference on Knowledge Dis-
covery and Data Mining(KDD), Washington DC, USA, 2008.
[53] Ciocca Gianluigi and Schettini Raimondo. An innovative algorithm for key
frame extraction in video summarization. Journal of Real-Time Image Process-
ing, 1:69–88, 2006.
[54] A. Gijsenij and T. Gevers. Color constancy using natural image statistics. In
IEEE Conference on Computer Vision and Pattern Recognition(CVPR), pages
1–8, Minneapolis, USA, 2007.
[55] A. Gijsenij and Theo Gevers. Color constancy by local averaging. In Compu-
tational Color Imaging Workshop, International Conference on Image Analysis
and Processing(ICIAP), pages 171–174, Washington DC, USA, 2007.
[56] K.H. Goh, Yong Huang, and L. Hui. Automatic video contrast enhancement. In
IEEE Int. Symposium on Consumer Electronics (ICCE), pages 359–364, Read-
ing, UK, sep. 2004.
[57] O. Goldschmidt and D. S. Hochbaum. Polynomial algorithm for the k-cut prob-
lem. In 29th Annual Symposium on Foundations of Computer Science (SFCS),
pages 444–451, 1988.
[58] Rafael C. Gonzalez and Richard E. Woods. Digital Image Processing (3rd Edi-
tion). Prentice-Hall, Inc., NJ, USA, 2006.
[59] H. Greenspan, C.H. Anderson, and S. Akber. Image enhancement by nonlin-
ear extrapolation in frequency space. IEEE Transactions on Image Processing,
9(6):1035–1048, Jun 2000.
[60] M. Grundmann, V . Kwatra, M. Han, and I.A. Essa. Efficient hierarchical
graph-based video segmentation. In Computer Vision and Pattern Recognition
(CVPR), 2010.
178
[61] J. A. Hanley and McNeil B. J. The meaning and use of the area under a receiver
operating characteristic (roc) curve. Radiology, 143:29–36, 1982.
[62] C. Hentschel. Video moire cancellation filter for high-resolution crts. In
IEEE International Conference on Consumer Electronics(ICCE), pages 200–
201, Japan, 1999.
[63] C. Hentschel and D. La Hei. Effective peaking filter and its implementation on
a programmable architecture. In IEEE International Conference on Consumer
Electronics(ICCE), pages 330–331, Japan, 1999.
[64] G. Hines, Z.-U. Rahman, D. Jobson, and G. Woodell. Dsp implementation of
the retinex image enhancement algorithm. In Visual Information Processing
XIII, pages 13–24, 2004.
[65] S. D. Hordley and G. D. Finlayson. Re-evaluating colour constancy algorithms.
In International Conference on Pattern Recognition(ICPR), pages 76–79, Cam-
bridge, UK, 2004.
[66] Chun-Rong Huang, Huai-Ping Lee, and Chu-Song Chen. Shot change detection
via local keypoint matching. IEEE Transactions on Multimedia, 10(6):1097 –
1108, Oct. 2008.
[67] D. Jobson, Z. Rahman, and G. Woodell. A multiscale retinex for bridging the
gap between color images and the human observation of scenes. IEEE Trans-
actions on Image Processing, 6(7):965–976, 1997.
[68] Daniel Jobson, Zia ur Rahman, and Glenn A. Woodell. The statistics of visual
representation. In Visual Information Processing, Sydney, Australia, 2002.
[69] Sing Bing Kang, Matthew Uyttendaele, Simon Winder, and Richard Szeliski.
High dynamic range video. In ACM SIGGRAPH 2003, pages 319–325, San
Diego, USA, 2003.
[70] Alexander Kharlamov and V . Podlozhnyuk. Image denoising. NVIDIA Techni-
cal Report, June 2007.
[71] Ron Kimmel, Michael Elad, Doron Shaked, Renato Keshet, and Irwin Sobel.
A variational framework for retinex. International Journal of Computer Vision,
52(1):7–23, 2003.
[72] M Kolomenkin, I. Shimshoni, and A. Tal. Prominent field for shape analysis of
archaeological artifacts. In Workshop on eHeritage & Digital Art Preservation,
IEEE International Conference on Computer Vision(ICCV), 2009.
[73] Eric Paul Krotkov. Active Computer Vision by Cooperative Focus and Stereo.
Springer-Verlag New York, Inc., Secaucus, NJ, USA, 1989.
179
[74] E. H. Land. The retinex theory of color vision. Scientific American,
237(6):108–128, December 1977.
[75] Gregory Ward Larson, Holly Rushmeier, and Christine Piatko. A visibility
matching tone reproduction operator for high dynamic range scenes. IEEE
Transactions on Visualization and Computer Graphics, 3(4):291–306, 1997.
[76] Wei-Kuang Li and Shang-Hong Lai. A motion-aided video shot segmentation
algorithm. In Pacific Rim Conference on Multimedia, pages 336–343, 2002.
[77] Yuanzhen Li, Lavanya Sharan, and Edward H. Adelson. Compressing and com-
panding high dynamic range images with subband architectures. In ACM SIG-
GRAPH 2005 Papers, pages 836–844, 2005.
[78] Rainer Lienhart. Comparison of Automatic Shot Boundary Detection Algo-
rithms. In SPIE Storage and Retrieval for Image and Video Databases, pages
290–301, 1999.
[79] Peng Lin and Yeong-Taeg Kim. An adaptive color transient improvement al-
gorithm. IEEE Transactions on Consumer Electronics, 49(4):1326–1329, Nov.
2003.
[80] R. Lu, A. Gijsenij, T. Gevers, D. Xu, V . Nedovic, and J. M. Geusebroek. Color
constancy using 3d stage geometry. In IEEE International Conference on Com-
puter Vision(ICCV), pages 1749–1756, Kyoto, Japan, 2009.
[81] Jean luc Starck, Fionn Murtagh, Emmanuel J. Cands, Emmanuel J. C, and
David L. Donoho. Gray and color image contrast enhancement by the curvelet
transform. IEEE Transactions on Image Processing, 12:706–717, 2003.
[82] Tom Malzbender, Dan Gelb, and Hans Wolters. Polynomial texture maps. In
ACM SIGGRAPH 2001, pages 519–528, Los Angeles, USA, 2001.
[83] M. E. J. Newman. Finding community structure in networks using the eigen-
vectors of matrices. Phys. Rev. E, 74(3):036104, Sep 2006.
[84] Andrew Y . Ng, Michael I. Jordan, and Yair Weiss. On spectral clustering: Anal-
ysis and an algorithm. In Advances in NIPS, pages 849–856. MIT Press, 2001.
[85] Jean-Marc Odobez, Daniel Gatica-Perez, and Mael Guillemot. Video shot clus-
tering using spectral methods. In Workshop on Content-Based Multimedia In-
dexing (CBMI), pages 94–102, 2003.
[86] J. Orchard, M. Ebrahimi, and A. Wong. Efficient nonlocal-means denoising
using the svd. In IEEE International Conference on Image Processing(ICIP),
pages 1732–1735, San Diego, USA, 2008.
180
[87] Rodrigo Palma-Amestoy, Edoardo Provenzi, Marcelo Bertalmío, and Vincent
Caselles. A perceptually inspired variational framework for color enhancement.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(3):458–
474, 2009.
[88] Sylvain Paris, Samuel W. Hasinoff, and Jan Kautz. Local Laplacian Filters:
Edge-aware Image Processing with a Laplacian Pyramid. In ACM SIGGRAPH
2011, 2011.
[89] Sumanta N. Pattanaik, James A. Ferwerda, Mark D. Fairchild, and Donald P.
Greenberg. A multiscale model of adaptation and spatial vision for realistic
image display. In SIGGRAPH ’98, pages 287–298, Orlando, USA, 1998.
[90] Sumanta N. Pattanaik, Jack Tumblin, Hector Yee, and Donald P. Greenberg.
Time-dependent visual adaptation for fast realistic image display. In ACM SIG-
GRAPH 2000, pages 47–54, New Orleans, USA, 2000.
[91] E. Peli, J. Kim, Y . Yitzhaky, R. B. Goldstein, and R. L. Woods. Wideband
enhancement of television images for people with visual impairments. Journal
of the Optical Society of America, 21:937–950, June 2004.
[92] Eli Peli. Recognition performance and perceived quality of video enhanced for
the visually impaired. Opthalmic and Physiological Optics, 25:543–555, 2005.
[93] P. Perona and J. Malik. Scale-space and edge detection using anisotropic
diffusion. IEEE Transactions on Pattern Analysis and Machine Intelligence,
12(7):629 –639, jul 1990.
[94] Marcus J. Pickering and Stefan M. Rüger. Multi-timescale video shot-change
detection. In TREC 2001, 2001.
[95] Edoardo Provenzi, Luca De Carli, Alessandro Rizzi, and Daniele Marini. Math-
ematical definition and analysis of the retinex algorithm. Journal of the Optical
Society of America, 22(12):2613–2621, 2005.
[96] Edoardo Provenzi, Massimo Fierro, Alessandro Rizzi, L. De Carli, D. Gadia,
and Daniele Marini. Random spray retinex: A new retinex implementation
to investigate the local properties of the model. IEEE Transactions on Image
Processing, 16(1):162–171, 2007.
[97] Edoardo Provenzi, Carlo Gatta, Massimo Fierro, and Alessandro Rizzi. A spa-
tially variant white-patch and gray-world method for color image enhancement
driven by local contrast. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 30:1757–1770, 2007.
[98] D. Purves, R. G. Lotto, and S. Nundy. Why we see what we do. American
Scientist, 90(3):236–243, 2002.
181
[99] Shaun Ramsey, J. Thmas Johnson III, and Charles Hansen. Adaptive temporal
tone mapping. In Computer Graphics and Imaging (CGIM), Kauai, Hawaii,
2004.
[100] Erik Reinhard, Michael Stark, Peter Shirley, and James Ferwerda. Photo-
graphic tone reproduction for digital images. ACM Transactions on Graphics,
21(3):267–276, 2002.
[101] Erik Reinhard, Mike Stark, Peter Shirley, and James Ferwerda. Photo-
graphic tone reproduction for digital images. ACM Transactions on Graphics,
21(3):267–276, July 2002.
[102] Alessandro Rizzi, Carlo Gatta, and Daniele Marini. From retinex to automatic
color equalization: issues in developing a new algorithm for unsupervised color
equalization. Journal of Electronic Imaging, 13(1):75–84, 2004.
[103] Charles Rosenberg, Thomas Minka, and Alok Ladsariya. Bayesian color con-
stancy with non-gaussian models. Neural Information Processing System, 2003.
[104] D. Shaked and I. Tastl. Sharpness measure: towards automatic image enhance-
ment. In IEEE International Conference on Image Processing, 2005.
[105] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. In
Computer Vision and Pattern Recognition (CVPR), page 731, 1997.
[106] Alan F. Smeaton, Paul Over, and Aiden R. Doherty. Video shot boundary detec-
tion: Seven years of trecvid activity. Computer Vision and Image Understand-
ing, 114(4):411 – 418, 2010.
[107] Alan F. Smeaton, Paul Over, and Wessel Kraaij. Evaluation campaigns and
trecvid. In MIR ’06: Proceedings of the 8th ACM International Workshop on
Multimedia Information Retrieval, pages 321–330, New York, NY , USA, 2006.
ACM Press.
[108] G. W. Stewart and J.-G. Sun. Matrix Perturbation Theory. Academic Press,
1990.
[109] T.G. Stockham. Image processing in the context of a visual model. Proceedings
of the IEEE, 60(7):828–842, July 1972.
[110] Kartic Subr, Cyril Soler, and Frédo Durand. Edge-preserving multiscale image
decomposition based on local extrema. In ACM SIGGRAPH Asia 2009, pages
1–9, Yokohama, Japan, 2009.
[111] Michael J. Swain and Dana H. Ballard. Color indexing. International Journal
on Computer Vision (IJCV), 7(1):11–32, 1991.
182
[112] Jinshan Tang, Jeonghoon Kim, and Eli Peli. Image enhancement in the jpeg
domain for people with vision impairment. IEEE Transactions on Biomedical
Engineering, 51:2013–2023, November 2004.
[113] Li Tao, Richard Tompkins, and Vijayan Asari. An illuminance-reflectance
model for nonlinear enhancement of color images. In Face Recognition Grand
Challenge Workshop, IEEE Conference on Computer Vision and Pattern Recog-
nition(CVPR), pages 159–166, San Diego, USA, 2005.
[114] Jay Martin Tenenbaum. Accommodation in computer vision. PhD thesis, Stan-
ford University, Stanford, CA, USA, 1971.
[115] C. Tomasi and R. Manduchi. Bilateral filtering for gray and color images.
In IEEE International Conference on Computer Vision(ICCV), pages 839–846,
Bombay, India, 1998.
[116] John W. Tukey. Exploratory Data Analysis. Addison-Wesley, 1977.
[117] Jack Tumblin and Greg Turk. Lcis: a boundary hierarchy for detail-preserving
contrast reduction. In ACM SIGGRAPH ’99, pages 83–90, Los Angeles, USA,
1999.
[118] Zia ur Rahman and Glenn A. Woodell. A multiscale retinex for bridging the gap
between color images and the human observation of scenes. IEEE Transactions
of Image Processing, 6:965–976, 1997.
[119] J. van de Weijer, T. Gevers, and A. Gijsenij. Edge-based color constancy. IEEE
Transactions on Image Processing, 16(9):2207–2214, 2007.
[120] J. van de Weijer, C. Schmid, and J. Verbeek. Using high-level visual infor-
mation for color constancy. In IEEE International Conference on Computer
Vision(ICCV), pages 1–8, Rio de Janeiro, Brazil, 2007.
[121] Hongcheng Wang, Ramesh Raskar, and Narendra Ahuja. High dynamic range
video using split aperture camera. In OMNIVIS Workshop, IEEE International
Conference on Computer Vision (ICCV), Beijing, China, 2005.
[122] Qing Wang and R.K. Ward. Fast image/video contrast enhancement based on
weighted thresholded histogram equalization. IEEE Trans. on Consumer Elec-
tronics, 53(2):757 –764, May 2007.
[123] Michael A Webster, Mark A Georgeson, and Shernaaz M Webster. Neural ad-
justments to image blur. Nature Neuroscience, 5(9):839–840, 2002.
[124] Frank Wilcoxon. Individual comparisons by ranking methods. Biometrics Bul-
letin, 1(6):80–83, 1945.
183
[125] B. Zhang, J. P. Allebach, and Z. Pizlo. An investigation of perceived sharpness
and sharpness metrics. In SPIE, volume 5668, pages 98–110, 2005.
[126] Buyue Zhang and J.P. Allebach. Adaptive bilateral filter for sharpness enhance-
ment and noise removal. IEEE Transactions of Image Processing, 17(5):664
–678, May 2008.
[127] L. Zhong, C. Li, H. Li, and Z. Xiong. Unsupervised clustering algorithm for
video shots using spectral division. In International Symposium on Visual Com-
puting (ISVC), pages 782–792, 2008.
184
Abstract (if available)
Abstract
Images/videos may have poor visual quality due to the relatively low dynamic range of capture/display devices as compared to the human visual system or poor lighting conditions or the lack of experience of people capturing them. We are exploring techniques to perform automatic enhancement of images and videos. The goal is to produce a better visual experience for all viewers. Another motivation is to improve perception for visually impaired patients, in particular, people suffering from Age-related Macular Degeneration. In order to address these problems, we have developed novel techniques for contrast and sharpness enhancement of images and videos. ❧ Our color contrast enhancement technique is inspired from the Retinex theory. We use denoising techniques to estimate the illumination component of the image, while preserving color and white balance. We then enhance only the illumination component using mapping functions. These enhancement parameters are estimated automatically. This enhanced illumination is then combined with the original reflectance to obtain enhanced images with better contrast. ❧ For sharpness enhancement, we use a novel approach based on a hierarchical framework using edge-preserving Non-local means filter. The hierarchical framework is constructed from a single image using a modified version of Laplacian pyramid. We also introduce a new measure to quantify sharpness quality, which allows us to automatically set parameters in order to achieve a preferred sharpness enhancement. ❧ Finally, we propose a novel method based on modularity optimization to perform temporally consistent and robust enhancement of videos. A key aspect of our processes is that it is fully automatic. Our method detects scene changes `on-the-fly' in a video. For every detected cluster, we find a key frame that is most representative of other frames in the sequence and estimate enhancement parameters for only the key frame. We then enhance all frames in that cluster using these enhancement parameters, thus making our method robust. ❧ We compare our enhancement results with existing state-of-the-art approaches and commercial packages and show noticeable improvements. Our image enhancements do not suffer from halo effects, color artifacts, and color shifts
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Outdoor visual navigation aid for the blind in dynamic environments
PDF
Spherical harmonic and point illumination basis for reflectometry and relighting
PDF
Part based object detection, segmentation, and tracking by boosting simple shape feature based weak classifiers
PDF
Advanced technologies for learning-based image/video enhancement, image generation and attribute editing
PDF
3D inference and registration with application to retinal and facial image analysis
PDF
Learning contour statistics from natural images
PDF
Incorporating aggregate feature statistics in structured dynamical models for human activity recognition
PDF
Face recognition and 3D face modeling from images in the wild
PDF
Learning the geometric structure of high dimensional data using the Tensor Voting Graph
PDF
Exploitation of wide area motion imagery
PDF
Motion pattern learning and applications to tracking and detection
PDF
Digitizing human performance with robust range image registration
PDF
Structure learning for manifolds and multivariate time series
PDF
Multiple humnas tracking by learning appearance and motion patterns
PDF
Advanced machine learning techniques for video, social and biomedical data analytics
PDF
Recording, reconstructing, and relighting virtual humans
PDF
Graph-based models and transforms for signal/data processing with applications to video coding
PDF
Machine learning techniques for perceptual quality enhancement and semantic image segmentation
PDF
Depth inference and visual saliency detection from 2D images
PDF
Energy proportional computing for multi-core and many-core servers
Asset Metadata
Creator
Choudhury, Anustup Kumar Atanu
(author)
Core Title
Automatic image and video enhancement with application to visually impaired people
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
07/16/2012
Defense Date
04/25/2012
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
color contrast enhancement,human validation,image enhancement,OAI-PMH Harvest,sharpness enhancement,video enhancement,visually impaired people
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Medioni, Gérard G. (
committee chair
), Ghosh, Abhijeet (
committee member
), Tjan, Bosco S. (
committee member
)
Creator Email
achoudhu@usc.edu,stup.anu@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-59770
Unique identifier
UC11288406
Identifier
usctheses-c3-59770 (legacy record id)
Legacy Identifier
etd-ChoudhuryA-957.pdf
Dmrecord
59770
Document Type
Dissertation
Rights
Choudhury, Anustup Kumar Atanu
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
color contrast enhancement
human validation
image enhancement
sharpness enhancement
video enhancement
visually impaired people