Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
A data-driven approach to image splicing localization
(USC Thesis Other)
A data-driven approach to image splicing localization
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
A DATA-DRIVEN APPROACH TO IMAGE SPLICING LOCALIZATION
by
Ronald Salloum
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(ELECTRICAL ENGINEERING)
August 2019
Copyright 2019 Ronald Salloum
This dissertation is dedicated to my parents.
ii
Acknowledgments
Pursuing my PhD degree was a very challenging but rewarding experience. I enjoyed
my time as a member of the Media Communications Laboratory and had the opportunity
to work on exciting research projects. I am very thankful to my advisor, Professor C.-C.
Jay Kuo, for his guidance and support throughout my PhD studies. I am very impressed
by his passion for research and his dedication to student success. I would like to thank
Professor Alexander Sawchuk and Professor Aiichiro Nakano for serving on my defense
committee, and for their valuable feedback about my research work. Also, I would like
to thank Professor Panayiotis Georgiou and Professor Keith Jenkins for serving on my
qualifying exam committee.
I would like to thank my parents for their love and support throughout my life. They
have been excellent role-models. In addition, I would like to thank my sister, Mariam,
and my brother-in-law, Zaid, for their support, advice, and encouragement throughout
my studies. Finally, I would like to thank my niece, Audrey, whose birth in 2018 brought
great joy to my life.
iii
The work presented in this manuscript is based on research sponsored by DARPA
and Air Force Research Laboratory (AFRL) under agreement number FA8750-16-2-
0173. The U.S. Government is authorized to reproduce and distribute reprints for Gov-
ernmental purposes notwithstanding any copyright notation thereon. The views and con-
clusions contained herein are those of the authors and should not be interpreted as nec-
essarily representing the official policies or endorsements, either expressed or implied,
of DARPA and Air Force Research Laboratory (AFRL) or the U.S. Government.
iv
Contents
Dedication ii
Acknowledgments iii
List of Tables vii
List of Figures ix
Abstract xiii
1 Introduction 1
1.1 Significance of the Research . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Contributions of the Research . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 Fully Convolutional Network (FCN) Approach . . . . . . . . . 5
1.2.2 cPCA++ Approach . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Organization of the Dissertation . . . . . . . . . . . . . . . . . . . . . 7
2 Research Background 9
2.1 Existing Splicing Localization Techniques . . . . . . . . . . . . . . . . 9
2.2 Convolutional Neural Networks (CNNs) . . . . . . . . . . . . . . . . . 11
2.2.1 Overview of CNNs . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.2 Fully Convolutional Networks (FCNs) . . . . . . . . . . . . . . 14
2.3 Contrastive PCA (cPCA) . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.1 Principal Component Analysis (PCA) . . . . . . . . . . . . . . 17
2.3.2 Summary of Contrastive PCA (cPCA) . . . . . . . . . . . . . . 21
3 Image Splicing Localization Using A Multi-Task Fully Convolutional Net-
work (MFCN) 24
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2 Proposed Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2.1 Single-task Fully Convolutional Network (SFCN) . . . . . . . . 26
3.2.2 Multi-task Fully Convolutional Network (MFCN) . . . . . . . . 27
3.2.3 Edge-enhanced MFCN Inference . . . . . . . . . . . . . . . . 29
v
3.2.4 Training and Testing Procedure . . . . . . . . . . . . . . . . . 30
3.3 Performance Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . 32
3.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4.1 Performance Comparison . . . . . . . . . . . . . . . . . . . . . 33
3.4.2 Performance on JPEG Compressed Images . . . . . . . . . . . 37
3.4.3 Performance on Gaussian Blurred Images . . . . . . . . . . . . 38
3.4.4 Performance on Images with Additive White Gaussian Noise . . 39
3.4.5 Performance on DARPA/NIST MediFor Annual Competitions . 40
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4 Efficient Image Splicing Localization via Contrastive Feature Extraction 44
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2 cPCA++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2.1 The cPCA++ Method . . . . . . . . . . . . . . . . . . . . . . . 45
4.2.2 Performance Comparison of Feature Extraction Methods . . . . 62
4.2.3 Computational Time Performance Comparison . . . . . . . . . 72
4.3 The cPCA++ Framework for Image Splicing Localization . . . . . . . . 74
4.4 Experimental Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.4.1 Evaluation/Scoring Procedure . . . . . . . . . . . . . . . . . . 79
4.4.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 81
4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
Appendices
4.A Derivation of Filters for Synthetic Example . . . . . . . . . . . . . . . 85
4.B The cPCA++ Approach For Matrix Factorization and Image Denoising . 88
5 Conclusion and Future Work 91
5.1 Summary of the Research . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.2 Future Research Directions . . . . . . . . . . . . . . . . . . . . . . . . 93
Bibliography 95
vi
List of Tables
3.1 Training and Testing Images . . . . . . . . . . . . . . . . . . . . . . . 31
3.2 Average F
1
Scores of Proposed and Existing Methods For Different
Datasets. For each dataset, we highlight in bold the top-performing
method. As noted in [56], ADQ2, ADQ3, and NADQ require JPEG
images as input because they exploit certain JPEG data directly extracted
from the compressed files. Therefore, these three algorithms could only
be evaluated on the CASIA v1.0 and Nimble 2016 SCI datasets, which
contain images in JPEG format. For the Columbia and Carvalho datasets
(which do not contain images in JPEG format), we put “NA” in the cor-
responding entries in the table to indicate that these three algorithms
could not be evaluated on these two datasets. . . . . . . . . . . . . . . . 34
3.3 AverageMCC Scores of Proposed and Existing Methods For Various
Datasets. For each dataset, we highlight in bold the top-performing
method. As noted in [56], ADQ2, ADQ3, and NADQ require JPEG
images as input because they exploit certain JPEG data directly extracted
from the compressed files. Therefore, these three algorithms could only
be evaluated on the CASIA v1.0 and Nimble 2016 SCI datasets, which
contain images in JPEG format. For the Columbia and Carvalho datasets
(which do not contain images in JPEG format), we put “NA” in the cor-
responding entries in the table to indicate that these three algorithms
could not be evaluated on these two datasets. . . . . . . . . . . . . . . . 35
3.4 AverageF
1
Scores of Proposed and Existing Methods on Original and
JPEG Compressed Carvalho Images. For each column, we highlight in
bold the top-performing method. . . . . . . . . . . . . . . . . . . . . . 39
3.5 AverageF
1
Scores of Proposed and Existing Methods On Original and
Blurred Carvalho Images. For each column, we highlight in bold the
top-performing method. . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.6 AverageF
1
Scores of Proposed and Existing Methods On Original and
Noisy Carvalho Images. For each column, we highlight in bold the top-
performing method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
vii
4.1 Overview of the datasets used for comparing the PCA, cPCA, t-SNE,
and cPCA++ dimensionality reduction methods. Note thatN
f
denotes
the number of foreground samples, N
b
denotes the number of back-
ground samples, andM denotes the original feature dimension. . . . . . 63
4.2 Time required for the different algorithms to perform the required dimen-
sionality reduction for the various datasets studied in Sec. 4.2.2. All
times listed in the table are in seconds. Boldface is used to indicate
shortest runtimes and average cPCA++ speedup. . . . . . . . . . . . . . 74
4.3 Edge-based MCC Scores on Columbia and Nimble WEB Datasets.
Boldface is used to emphasize best performance. . . . . . . . . . . . . 82
4.4 Edge-basedF
1
Scores on Columbia and Nimble WEB Datasets. Bold-
face is used to emphasize best performance. . . . . . . . . . . . . . . . 82
viii
List of Figures
1.1 An image splicing example. The manipulated image (in the top row)
shows Bill Clinton shaking hands with Saddam Hussein. The three
authentic images used to create the manipulated image are shown in
the bottom row. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 An image splicing example [47]: (a) the spliced image showing John
Kerry and Jane Fonda together at a rally, (b) an authentic image of Kerry,
and (c) an authentic image of Fonda. . . . . . . . . . . . . . . . . . . . 3
1.3 An image splicing example showing: (a) the manipulated or probe image,
(b) the host image, (c) the donor image, (d) the spliced surface (or
region) ground truth mask, and (e) the spliced boundary (or edge) ground
truth mask. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1 This figure illustrates the convolutional operation. Given an input matrix
of size44, a kernel of size33, and a stride of1 without zero padding,
the output will be a matrix of size22. . . . . . . . . . . . . . . . . . 12
2.2 This figure illustrates the step-by-step process of moving a filter or ker-
nel across the input to generate the output matrix. This example assumes
a stride of1 without zero padding. . . . . . . . . . . . . . . . . . . . . 13
2.3 This figure illustrates a typical CNN architecture. [49] . . . . . . . . . . 15
3.1 The MFCN Architecture for image splicing localization. Numbers in
the form x=y refer to the kernel size and number of filters in the con-
volutional layer (colored blue), respectively. For example, the Conv1
block consists of two convolutional layers, each with a kernel size of 3
and64 filters (note that after each convolutional layer is a batch normal-
ization layer and a ReLU layer). Numbers in the form of 2 and 8
refer to an upsampling factor of 2 and 8 for the deconvolutional layers,
respectively. Also, please note the inclusion of skip connections at the
third and fourth max pooling layers. The grey-colored layers represent
element-wise addition. The max pooling layers have a kernel size of 2
and a stride of2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
ix
3.2 Illustration of MFCN inference with edge enhancement: (a) Edge prob-
ability map, (b) Hole-filled, thresholded edge mask, (c) Surface proba-
bility map, (d) Thresholded surface mask, (e) Ground truth mask, and
(f) Final system output mask. . . . . . . . . . . . . . . . . . . . . . . . 30
3.3 System Output Mask Examples of SFCN, MFCN, and Edge-Enhanced
MFCN on the CASIA v1.0 and Carvalho Datasets. Please note that we
refer to the MFCN without edge-enhanced inference simply as MFCN.
Each row in the figure shows a manipulated or spliced image, the ground
truth mask, the SFCN output, the MFCN output, and the edge-enhanced
MFCN output. The number below each output example is the corre-
sponding F
1
score. The first two rows are examples from the CASIA
v1.0 dataset, while the other two rows are examples from the Carvalho
dataset. It can be seen from these examples that the edge-enhanced
MFCN achieves finer localization than the SFCN and the MFCN with-
out edge-enhanced inference. . . . . . . . . . . . . . . . . . . . . . . . 36
3.4 MFCN Surface Masks (Without Edge-Enhanced Inference) for Spliced
Images Using Different Threshold Values. Each row shows a spliced
image, the ground truth mask, the surface probability map, and the
corresponding thresholded surface masks for different threshold values
(0.7, 0.8, and 0.9). For each thresholded surface mask, pixels that are
classified as manipulated are marked as black and pixels that are classi-
fied as authentic are marked as white. . . . . . . . . . . . . . . . . . . 37
3.5 MFCN Surface Masks (Without Edge-Enhanced Inference) for Authen-
tic Images Using Different Threshold Values. Each row shows an authen-
tic image, and the corresponding thresholded surface masks for different
threshold values (0.9, 0.8, 0.7, and 0.6). For each thresholded surface
mask, pixels that are classified as manipulated are marked as black and
pixels that are classified as authentic are marked as white. . . . . . . . 38
3.6 Localization output examples from the 2017 MediFor Challenge dataset,
with the corresponding MCC values. In the ground truth mask, the
color black denotes a spliced pixel and the color white denotes an authen-
tic pixel. The colors pink and yellow in the ground truth mask denote
a pixel that is not scored (according to the MediFor scoring protocol).
One of the reasons that there are no-score regions is that there may also
be non-splicing manipulations present in the image (in addition to the
splicing manipulations). . . . . . . . . . . . . . . . . . . . . . . . . . 41
x
4.1 Performance on a synthetic dataset, where different colors are used for
the four different classes. The top row of plots show the performance of
the cPCA algorithm for different positive values of the contrast parame-
ter. Clearly, a contrast factor for = 2:7 is ideal, but must be found by
sweeping. The bottom-left plot shows the performance of traditional
PCA (which, as expected, fails to separate the classes). The bottom-
center plot shows the performance of the t-SNE algorithm, which again
fails to discover the underlying structure in the high-dimensional data.
Finally, the bottom-right figure shows the output obtained by the cPCA++
method, which obtains the ideal clustering without a parameter sweep. . 66
4.2 Example of six target images. The MNIST images for digits0 and1 are
superimposed on top of grass images. . . . . . . . . . . . . . . . . . . 67
4.3 The performance of different dimensionality reduction techniques on
the “MNIST over Grass” dataset illustrated in Fig. 4.2. In all the plots,
the black markers represent the digit 0 while the red markers represent
the digit 1. The top row shows the result of executing the cPCA algo-
rithm for different values of, the bottom-left plot shows the output of
the traditional PCA algorithm on the target dataset, the bottom-center
plot shows the output of the t-SNE algorithm, and the bottom-right plot
shows the output of the cPCA++ method. . . . . . . . . . . . . . . . . 68
4.4 The performance of different dimensionality reduction techniques on
the Mice Protein Expression dataset [21]. In all the plots, the black
markers represent the non-Down Syndrome mice while the red markers
represent the Down Syndrome mice. . . . . . . . . . . . . . . . . . . . 69
4.5 Clustering result of the MHealth Dataset for performing squats and cycling.
In all the plots, the red markers denote squatting activity while the black
markers denote cycling. In the top-left and bottom-left subplots, we see
that traditional PCA and t-SNE are incapable of separating the two activ-
ities, respectively. In the top-right and bottom-right subplots, we see that
cPCA and cPCA++ are capable of clustering the activities, respectively.
Note that we show the optimal cPCA result (after performing the param-
eter sweep). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.6 Dimensionality reduction result for the Single Cell RNA-Seq of Leukemia
Patient example. In all plots, the black markers denote pre-transplant
samples while the red markers denote post-transplant samples. The top
row illustrates the output of the cPCA algorithm for varying values of
, the bottom-left plot shows the output of traditional PCA, the bottom-
center plot shows the output of the t-SNE algorithm, while the bottom-
right plot shows the output of the cPCA++ method. We observe that
the cPCA (for some values of ) and cPCA++ methods yield the best
clustering for this dataset. . . . . . . . . . . . . . . . . . . . . . . . . . 73
xi
4.7 An example illustrating that there is an ambiguity in how one labels the
spliced and authentic regions in a given probe image. The image on the
right is a ground truth mask highlighting the spliced edge or boundary.
The color black is used to denote a pixel belonging to the spliced bound-
ary, while the color white is used to denote a pixel not belonging to the
spliced boundary. It is possible to label region A as the spliced region
and region B as the authentic region, or vice versa. . . . . . . . . . . . 75
4.8 Example of foreground and background patches. . . . . . . . . . . . . 80
4.9 Localization Output Examples from Columbia Dataset. Each row shows
(from left to right): the manipulated/probe image with the spliced edge
highlighted in pink, the structural edge detection output mask highlight-
ing both spliced and authentic edges, the cPCA++ raw probability out-
put, and the MFCN-based raw probability output. . . . . . . . . . . . . 83
4.10 Localization Output Examples from Nimble WEB Dataset. Each row
shows (from left to right): the manipulated/probe image with the spliced
edge highlighted in pink, the structural edge detection output mask high-
lighting both spliced and authentic edges, the cPCA++ raw probability
output, and the MFCN-based raw probability output. . . . . . . . . . . 84
4.11 The denoising of the digit 0 over grass background. Please note that
we use the color white to denote pixels corresponding to the digit. On
the top-left plot, we show the original noisy digit. On the top-right plot,
we show the denoising achieved by traditional PCA. On the bottom-left
plot, we show the denoising achieved by the cPCA algorithm. Finally,
on the bottom-right plot, we show the denoising performance of the
cPCA++ method, which is the low-rank approximationWy
n
. We observe
that the output of cPCA++ is far less noisy than that of the other meth-
ods. The number of components was chosen to beK = 3. . . . . . . . 90
xii
Abstract
Image splicing is a type of forgery or manipulation in which a portion of one image is
copied and pasted onto a different image. Image splicing attacks have become perva-
sive with the advent of easy-to-use digital manipulation tools and an increase in public
image distribution. Much of the previous research work on image splicing attacks has
focused on the problem of simply detecting whether an image is spliced or not, and did
not attempt to localize the spliced region. In this work, we present two novel approaches
for the image splicing localization problem, with the goal of generating a per-pixel mask
that localizes the spliced region. The first proposed approach is based on a multi-task
fully convolutional network (MFCN), which is a special type of convolutional neural
network (CNN). The MFCN is simultaneously trained on the surface label (which indi-
cates whether each pixel in an image belongs to the spliced surface/region) and the edge
label (which indicates whether each pixel belongs to the boundary of the spliced region).
The MFCN-based approach is shown to outperform existing splicing localization tech-
niques on several publicly available datasets.
Our second contribution is based on a new dimensionality-reduction technique that
we have developed. This technique, referred to as cPCA++ (where cPCA stands for con-
trastive Principal Component Analysis), utilizes the fact that the interesting features of
a target dataset may be obscured by high variance components during traditional PCA.
By analyzing what is referred to as a background dataset (i.e., one that exhibits the high
xiii
variance principal components but not the interesting structures), our technique is capa-
ble of efficiently highlighting the structure that is unique to the target dataset. Similar
to another recently proposed algorithm called contrastive PCA (cPCA), the proposed
cPCA++ method identifies important dataset-specific patterns that are not detected by
traditional PCA in a wide variety of settings. However, the proposed cPCA++ method is
significantly more efficient than cPCA, because it does not require the parameter sweep
in the latter approach. We applied the cPCA++ method to the problem of image splicing
localization. In this application, we utilize authentic edges as the background dataset and
the spliced edges as the target dataset. The proposed cPCA++ method is significantly
more efficient than state-of-the-art CNN-based methods, as the former does not require
iterative updates of filter weights via stochastic gradient descent and backpropagation.
Furthermore, the cPCA++ method is shown to provide performance scores comparable
to the MFCN-based approach.
xiv
Chapter 1
Introduction
1.1 Significance of the Research
The availability of low-cost and user-friendly editing software has made it significantly
easier to manipulate images. At the same time, the prevalence of social media applica-
tions has made it very easy to quickly circulate these manipulated images. Thus, there
has been an increasing interest in developing forensic techniques to detect and localize
manipulations (also referred to as forgeries or attacks) in images. One of the most com-
mon types of forgery is the image splicing attack. A splicing attack is a forgery in which
a region from one image (i.e., the donor image) is copied and pasted onto another image
(i.e., the host image). Forgers often use splicing to give a false impression that there is
an additional object present in the image, or to remove an object from the image. Image
splicing can be potentially used in generating false propaganda for political purposes.
For example, the altered or manipulated image in the top row of Figure 1.1 shows Bill
Clinton shaking hands with Saddam Hussein, although this event never occurred. The
authentic images used to create the fake image can be seen in the bottom row of Figure
1.1.
In another example, an image was circulated during the 2004 U.S. presidential elec-
tion campaign that showed presidential candidate John Kerry and the actress Jane Fonda
speaking together at a rally. It was discovered later that this was a spliced image, and
was meant to associate Kerry with Fonda, who had received negative publicity. Fig. 1.2
1
Figure 1.1: An image splicing example. The manipulated image (in the top row) shows
Bill Clinton shaking hands with Saddam Hussein. The three authentic images used to
create the manipulated image are shown in the bottom row.
shows the spliced image, along with the two original authentic images that were used to
create the spliced image [47].
An additional splicing example
1
is shown in Figure 1.3. The top row shows (from
left to right): the manipulated image (also referred to as the probe or spliced image),
the host image, and the donor image. In this example, the murky water from the donor
image was copied and pasted on top of the clear water in the host image. This gives
the false appearance that the water in the pool is not clean. The murky water in the
manipulated image is referred to as the spliced surface or region. The bottom row in
1
https://www.nist.gov/itl/iad/mig/media-forensics-challenge
2
Figure 1.2: An image splicing example [47]: (a) the spliced image showing John Kerry
and Jane Fonda together at a rally, (b) an authentic image of Kerry, and (c) an authentic
image of Fonda.
Figure 1.3 shows (from left to right): the spliced surface (or region) ground truth mask
and the spliced boundary (or edge) ground truth mask. These two types of masks provide
different ways of highlighting the splicing manipulation. The surface ground truth mask
is a per-pixel binary mask which specifies whether each pixel in the given manipulated
image is part of the spliced surface (or region). We use the color black to denote a pixel
belonging to the spliced surface and the color white to denote a pixel not belonging
to the spliced surface. The edge ground truth mask is a per-pixel binary mask which
specifies whether each pixel in the given probe image is part of the boundary of the
spliced surface. In this case, we use the color black to denote a pixel belonging to
3
Figure 1.3: An image splicing example showing: (a) the manipulated or probe image,
(b) the host image, (c) the donor image, (d) the spliced surface (or region) ground truth
mask, and (e) the spliced boundary (or edge) ground truth mask.
the spliced boundary and the color white to denote a pixel not belonging to the spliced
boundary.
There are two main problems in the literature: detection and localization. The detec-
tion problem refers to the problem of classifying an image as either spliced or authentic,
without localizing the spliced region or boundary. Many of the current techniques only
address the detection problem, and do not address the localization problem. Thus, in this
dissertation, we focus on the more challenging and less studied problem of image splic-
ing localization, and we present two novel approaches for localizing splicing attacks.
1.2 Contributions of the Research
The first proposed technique [46] is based on a fully convolutional network (FCN), and
is shown to outperform existing techniques. The second proposed technique is based
4
on a new dimensionality-reduction technique that we have developed, referred to as
cPCA++ (where cPCA stands for contrastive PCA), which is able to perform discrim-
inative feature extraction when dealing with extremely similar classes. In this section,
we will summarize the contributions of the FCN-based and cPCA++ approaches.
1.2.1 Fully Convolutional Network (FCN) Approach
In the first proposed technique [46], we use a fully convolutional network (FCN), which
is a special type of convolutional neural network (CNN). The base network architecture
is the FCN VGG-16 architecture with skip connections, but we incorporate several mod-
ifications, including batch normalization layers and class weighting. We present three
variants of the FCN-based approach:
We first evaluated a single-task FCN (SFCN) trained only on the surface label or
ground truth mask, which indicates whether or not each pixel in an image belongs
to the spliced region/surface. Although the SFCN is shown to provide superior
performance over existing techniques, it still provides a coarse localization output
in certain cases.
So, we next propose the use of a multi-task FCN (MFCN) that utilizes two output
branches for multi-task learning. One branch is used to learn the surface label,
while the other branch is used to learn the edge or boundary of the spliced region.
It is shown that by simultaneously training on the surface and edge labels, we can
achieve finer localization of the spliced region, as compared to the SFCN. In this
approach, we utilize only the surface output probability map during the inference
stage.
5
The final approach, which is referred to as the edge-enhanced MFCN, utilizes
both the surface and edge output probability maps during the inference stage to
achieve finer localization of the spliced region.
We trained all proposed FCN-based approaches using the CASIA v2.0 dataset
[14] and tested the trained networks on the CASIA v1.0 [14]
2
, Columbia Uncom-
pressed [23], Carvalho [11], and the Nimble Challenge 2016 Science (SCI) datasets
3
. Experiments show that the SFCN and MFCN outperform existing splicing local-
ization algorithms, with the edge-enhanced MFCN achieving the best performance.
Furthermore, we show that after applying various post-processing operations such as
JPEG compression, blurring, and addition of noise to the spliced images, the SFCN
and MFCN methods still outperform the existing methods. Finally, we also participated
in the annual competitions of the MediFor (Media Forensics) Program sponsored by
the Defense Advanced Research Projects Agency (DARPA)
4
, and our MFCN-based
approach achieved the highest score in the splicing localization task in the 2017 compe-
tition, and the second highest score in the 2018 competition.
1.2.2 cPCA++ Approach
Although CNN-based approaches, such as our proposed MFCN-based approach, have
yielded promising results in the field of image forensics, they rely on careful selection
of hyperparameters, network architecture, and initial filter weights. Furthermore, CNNs
require a long training time since the filter weights need to be iteratively updated via
2
Credits for the use of the CASIA Image Tampering Detection Evaluation Database (CASIA TIDE)
V1.0 and V2.0 are given to the National Laboratory of Pattern Recognition, Institute of Automation,
Chinese Academy of Science, Corel Image Database and the photographers. http://forensics.idealtest.org
3
https://www.nist.gov/itl/iad/mig/nimble-challenge
4
https://www.darpa.mil/program/media-forensics
6
stochastic gradient descent and backpropagation. Our next proposed approach does not
require a significant amount of experimental tuning nor long training/testing times, but
is still able to achieve performance scores comparable to the MFCN-based approach.
The second splicing localization approach is based on a new version of Principal
Component Analysis (PCA) that we have developed. In the context of image splicing
localization, the two classes of interest (i.e., spliced and authentic boundaries) are very
similar in terms of their covariance matrices, and traditional PCA is not able to effec-
tively discriminate between the two classes. Instead, we propose a new version of PCA,
referred to as cPCA++ [45] (where cPCA stands for contrastive Principal Component
Analysis), which is able to perform discriminative feature extraction when dealing with
extremely similar classes. We then propose a new approach for image splicing local-
ization based on cPCA++. The proposed cPCA++ approach is mathematically tractable
and does not require a significant amount of experimental tuning. Unlike CNNs, the
proposed cPCA++ approach does not require iterative updates of filter weights via
stochastic gradient descent and backpropagation, and thus is much more efficient than
CNN-based approaches. In addition, we will see that the cPCA++ approach does not
require the training of a classifier (e.g., support vector machines or random forests),
which greatly speeds up the method. Also, the proposed cPCA++ approach can be
readily parallelized due to its lack of dependence on inherently serial methods such as
stochastic gradient descent.
1.3 Organization of the Dissertation
The rest of this manuscript is organized as follows. In Chapter 2, we present neces-
sary research background material. In Chapter 3, we present the three variants of the
FCN-based approach. In Chapter 4, we present the cPCA++ dimensionality-reduction
7
technique, and discuss how we apply it to the problem of image splicing localization.
Finally, in Chapter 5, we summarize our contribution and outline potential future work.
8
Chapter 2
Research Background
In this chapter, we will provide necessary background material. This chapter is orga-
nized as follows. In Section 2.1, we will summarize existing splicing localization tech-
niques. In Section 2.2, we will provide background material for the first proposed
approach, which is based on a fully convolutional network (FCN). Finally, in Section
2.3, we will provide background material for the second proposed approach, which is
based on a new dimensionality-reduction technique that we have developed.
2.1 Existing Splicing Localization Techniques
Zampoglou et al. [56] conducted a comprehensive review of non-deep-learning-based
techniques for image splicing localization, and provided a comparison of their perfor-
mance. We provide a brief summary as follows. These techniques can be roughly
grouped into the following three categories based on the type of feature or artifact
they exploit: noise patterns [9, 32, 37, 38], Color Filter Array (CFA) interpolation pat-
terns [12, 18], and JPEG-related traces [2, 5–7, 17, 28, 33, 34, 36, 54, 55]. The first class
of splicing localization algorithms exploits noise patterns under the assumption that dif-
ferent images have different noise patterns as a result of a combination of different
camera makes/models, the capture parameters of each image, and post-processing tech-
niques [9, 32, 37, 38]. Since the spliced region originated from a different image (i.e.,
the donor image) than the host image, the spliced region may have a noise pattern that
9
is different than the noise pattern in the remaining region of the host image. Thus, the
noise pattern can potentially be used to identify the spliced region.
The second class of algorithms exploits CFA interpolation patterns [12, 18]. Most
digital cameras acquire images using a single image sensor overlaid with a CFA that
produces one value per pixel. CFA interpolation (also called demosaicing) is a process
to reconstruct the full color image by transforming the captured output into three chan-
nels (RGB). Splicing can disrupt the CFA interpolation patterns in multiple ways. For
example, different cameras may use different CFA interpolation algorithms so that com-
bining two different images may cause discontinuities. Also, spliced regions are often
rescaled, which can also disrupt the CFA interpolation patterns. These artifacts can be
exploited when attempting to localize a spliced region.
The third class of algorithms exploits the traces left by JPEG compression. Most of
these methods use features from one of two subgroups: JPEG quantization artifacts and
JPEG compression grid discontinuities [2, 5–7, 33, 34, 36, 55]. In JPEG quantization-
based methods, it is assumed that the original image underwent consecutive JPEG com-
pressions, while the spliced portion may have lost its initial JPEG compression char-
acteristics due to smoothing or resampling of the spliced portion. These incongruous
features can help localize a spliced region. In JPEG grid-based methods, one may detect
spliced regions due to misalignment of the88 block grids used in compression. Two
other approaches that exploit JPEG compression traces are JPEG Ghosts [17] and Error
Level Analysis [28, 54].
Recently, there has been an increasing interest in the application of deep-learning-
based techniques to general image forensics and splicing detection/localization [8, 10,
24, 41, 42, 48, 58]. Specifically, convolutional neural networks (CNNs) have attracted
a significant amount of attention in the forensics community, due to the promising
10
results they have yielded on a variety of image-based tasks such as object recogni-
tion and semantic segmentation [29, 35, 49]. One of the earliest CNN-based tech-
niques for image splicing localization is our Multi-task Fully Convolutional Network
(MFCN) [46], which is discussed in detail in Chapter 3. Note that we have cited several
CNN-based approaches which were published after our MFCN-based approach.
2.2 Convolutional Neural Networks (CNNs)
The first proposed technique is based on a fully convolutional network (FCN), which is
a special type of convolutional neural network (CNN). In this section, we will present
necessary background material on CNNs and FCNs.
2.2.1 Overview of CNNs
A non-convolutional neural network transforms an input through a series of hidden lay-
ers. Each hidden layer is composed of a set of filters (also referred to as neurons or
kernels). Each hidden layer is fully connected to the previous layer (i.e., each neuron in
a given layer is connected to all neurons in the previous layer). The final output layer per-
forms the classification and yields the class scores or probabilities. This fully-connected
structure will not scale well to images since the number of parameters required for each
layer will quickly add up to an unmanageable size.
A convolutional neural network (CNN) is a type of neural network that has been
shown to provide promising results in image-based tasks such as object detection or
recognition [29, 49]. While a CNN is similar to a non-convolutional neural network, it
employs a more sensible architecture to constrain the number of parameters or weights.
A CNN consists of an input layer, multiple hidden layers and an output layer. The
hidden layers of a CNN typically consist of convolutional layers, non-linear activation
11
Figure 2.1: This figure illustrates the convolutional operation. Given an input matrix of
size44, a kernel of size33, and a stride of1 without zero padding, the output will
be a matrix of size22.
layers, pooling layers, and fully connected layers. Next, we describe each type of
layer in more detail.
Convolutional layer. A convolutional layer consists of a set of learnable filters
or kernels connected to a small spatial region of the input volume (referred to as
the receptive field of the filter), with each filter having the same depth as that of
the input volume. For example, if the input to the convolutional layer is an image
of dimension 2002003 (i.e., 200 pixels for the width and height, and 3 for
the RGB channels of the image), then each filter will also have a depth of 3. In
the convolutional operation, we slide each filter across the image, computing the
dot product between the filter and the input volume to produce a two-dimensional
output map (commonly referred to as an activation map). The size of the two-
dimensional activation map is dependent on two parameters: 1) the stride and
2) the amount of zero padding. The stride specifies how many pixels we shift
the filter (in either the horizontal or vertical direction) as we slide it across the
image. Zero padding the input image can help control the spatial size of the out-
put. An example of a convolutional operation applied to an input of spatial size
44 using a kernel or filter of spatial size 33 with a stride of 1 and without
12
Figure 2.2: This figure illustrates the step-by-step process of moving a filter or kernel
across the input to generate the output matrix. This example assumes a stride of 1
without zero padding.
zero padding is shown in Figure 2.1. Please note that we have left out the depth
dimension of the input and kernel, since it is understood that they must be equal
to each other. The step-by-step process of moving the kernel or filter by the given
stride across the input image is shown in Figure 2.2. This filter-parameter-sharing
scheme significantly reduces the number of parameters or weights when dealing
with image-based tasks. Stacking together the activation maps obtained from the
different filters yields the full output volume from the given convolutional layer.
Each filter has a set of weights or parameters which are learned via a process
called backpropagation. Backpropagation is typically used in conjunction with an
optimization algorithm (e.g., stochastic gradient descent) to iteratively adjust the
weights of a neural network. This process works by computing the gradient of the
cost function (also referred to as the loss function or error function), which is the
discrepancy between the actual network output (after one forward pass through
the network) and the expected output.
Non-linear activation layer. It is typical to apply an element-wise non-linear
activation function to the output of a convolutional layer. The most commonly
13
used activation function is the Rectified Linear Unit (ReLU), defined asf(x) =
max(0;x), which has the effect of thresholding any negative values to zero.
Pooling layer. A pooling layer is usually inserted in between successive convo-
lutional layers to reduce the spatial size of the volume. This reduces the number
of parameters and computation in the network, and hence can also reduce the
possibility of overfitting. The most common pooling operations are to take the
maximum or average of a local region, known as the maximum pooling layer and
the average pooling layer, respectively.
Fully connected (FC) layer. A fully connected (FC) layer consists of a set of
neurons fully connected to the activations in the previous layer. While convolu-
tional layers preserve spatial information, the output of an FC layer is a global
feature with no spatial information.
A typical CNN used in image-based classification tasks is a linear connection of the
previously described layers. An example of a CNN architecture is shown in Figure 2.3.
The neurons of the final FC layer correspond to the different classes or categories in the
given task. A final step is performed to convert the FC output to class probabilities by
applying the softmax operation, which is given by:
p(c) =
exp(f(c))
P
i
exp(f(i))
;
wheref(c) represents the FC output corresponding to classc.
2.2.2 Fully Convolutional Networks (FCNs)
Traditional CNNs with fully connected layers are designed to solve image-level classi-
fication tasks, such as object classification, which only require a classification decision
14
Figure 2.3: This figure illustrates a typical CNN architecture. [49]
for the entire image. However, these types of CNNs are not well suited for pixel-level
classification tasks, such as image splicing localization or semantic image segmenta-
tion, which require a classification decision for each pixel in the image. The reason for
this is that the fully connected layers of traditional CNNs do not maintain local spa-
tial information, so these types of networks yield image-level output. In order to tackle
pixel-level classification problems, researchers have developed so-called fully convolu-
tional networks (FCNs). FCNs [35] are a special type of convolutional neural network
with only convolutional layers. They are formed by converting all fully connected layers
to convolutional ones. In [35], the authors adapted common classification networks into
fully convolutional ones for the task of semantic segmentation. It was shown in [35]
that FCNs can efficiently learn to make dense predictions for per-pixel tasks such as
semantic segmentation. Three classification architectures that were converted to fully
convolutional form are AlexNet [29], GoogLeNet [50], and VGG-16 [49]. In [35], the
authors found that the FCN VGG-16 performed the best among the three. In our work,
15
we adopt the FCN VGG-16 as our base architecture, but we incorporate several modifi-
cations, as discussed in Chapter 3.
Pixel-level classification tasks require an output of sufficiently high resolution.
However, the pooling and convolutional layers significantly reduce the resolution. To
address this problem, the authors of [35] utilized so-called transposed convolutional
(also referred to as deconvolutional in the CNN literature) layers with learnable weights
to upsample the coarse-level output back to the resolution of the input image. The trans-
posed convolutional operation is similar to the convolutional operation. Conceptually, it
can be viewed as first upsampling the input image by inserting zeros between the values
in the input image, and then applying a convolutional operation on the upsampled input
image. Similar to the convolutional layer, the weights in the transposed convolutional
layers are learned.
Also, in [35], the authors utilized so-called skip connections that combine high-level
information from the deep layers with low-level information from the shallow layers.
This allows the network to predict finer details, while retaining high-level information.
The authors of [35] found that this skip architecture yielded superior performance over
the original architecture without skip connections.
2.3 Contrastive PCA (cPCA)
The second proposed splicing localization approach is based on a new dimensionality-
reduction technique that we have developed. This technique, referred to as cPCA++ [45]
(where cPCA stands for contrastive Principal Component Analysis), is a modified ver-
sion of PCA. The proposed cPCA++ technique is inspired by another recently proposed
approach called contrastive PCA (cPCA) [1], which can perform discriminative feature
16
extraction when dealing with extremely similar classes. In this section, we provide nec-
essary background material on PCA and cPCA.
2.3.1 Principal Component Analysis (PCA)
To obtain the optimal subspace approximation to a set of input signals or vectors, one can
analyze the second-order statistics of the data and select the orthonormal eigenvectors of
the covariance matrix as transform kernels. This technique is the well-known Principal
Components Analysis (PCA) or Karhunen-Lo` eve transform (KLT). It has been shown
to be the optimal transform in terms of energy compaction. In other words, to obtain
the optimal subspace approximation, we can retain only the eigenvectors or kernels
associated with the largest eigenvalues of the covariance matrix (also referred to as the
leading eigenvectors or principal directions). The truncated KLT provides the optimal
approximation to the input in terms of the mean-squared-error (MSE) criterion.
PCA can be explained mathematically as follows. Suppose we are presented with
a data matrixZ2 R
MN
, whereM denotes the original feature dimension of the data
andN denotes the number of instances included inZ. PCA first computes the empirical
covariance matrix ofZ:
R,
1
N
ZZ
T
(2.1)
where we assumed thatZ has zero-mean. The total variance in the data is empirically
given by the trace of the covariance matrixR when the data instances are i.i.d.:
totalvariance =
1
N
N
X
n=1
kZ
n
k
2
2
=
1
N
M
X
m=1
N
X
n=1
Z
2
m;n
= Tr(R) (2.2)
17
whereZ
n
denotes then-th column of the data matrixZ (i.e., then-th data instance),
Z
m;n
denotes the (m;n)-th element of the data matrixZ (i.e., them-th feature of the
n-th data instance), and Tr(R) denotes the trace of the matrixR. Now, consider the
output of PCA. PCA would compute the subspace spanned by the K top or leading
eigenvectors ofR (i.e., those corresponding to the largestK eigenvalues). The basis for
this space would constitute the filters used to process the input data:
F
PCA
, evecs
K
(R) (2.3)
where K denotes the number of leading eigenvectors to return and F
PCA
2 R
MK
.
Now, a low-dimensional version of the input dataZ can be obtained as:
Y
PCA
=F
T
PCA
Z: (2.4)
It is known that the PCA filtersF
PCA
preserve the most energy (or variance) inZ after
the transformation. To see this, let us denote thek-th unit-norm leading eigenvector of
R with corresponding eigenvalue
k
asF
PCA;k
2 R
M1
(whereF
PCA;k
denotes the
k-th column ofF
PCA
). Then, we can write:
RF
PCA;k
=
k
F
PCA;k
: (2.5)
18
Now, consider the variance of the data filtered byF
PCA;k
alone. LetF
T
PCA;k
Z2 R
1N
denote the filtered data. Then, given that the data is i.i.d.:
Var(F
T
PCA;k
Z) =
1
N
N
X
n=1
(F
T
PCA;k
Z
n
)
2
=
1
N
N
X
n=1
F
T
PCA;k
Z
n
Z
T
n
F
PCA;k
=F
T
PCA;k
1
N
N
X
n=1
Z
n
Z
T
n
!
F
PCA;k
=F
T
PCA;k
1
N
ZZ
T
F
PCA;k
=F
T
PCA;k
RF
PCA;k
=F
T
PCA;k
k
F
PCA;k
=
k
kF
PCA;k
k
2
2
=
k
: (2.6)
Thus, the eigenvalue
k
is equal to the variance of the projection of Z onto F
PCA;k
(i.e., the variance of Z that is in the direction of F
PCA;k
). Thus, the leading or top
eigenvector (i.e., the one corresponding to the largest eigenvalue) gives the direction
of maximal variance. The second leading eigenvector gives the direction of maximal
variance under the constraint that it should be orthogonal to the first eigenvector. In
general, thek-th leading eigenvector gives the direction of maximal variance under the
19
constraint that it should be orthogonal to the firstk1 eigenvectors. To see the total
explained variance of the filter bankF
PCA
in (2.3), consider:
Var(F
T
PCA
Z) =
1
N
N
X
n=1
kF
T
PCA
Z
n
k
2
2
=
1
N
K
X
k=1
N
X
n=1
(F
T
PCA;k
Z
n
)
2
=
1
N
K
X
k=1
N
X
n=1
F
T
PCA;k
Z
n
Z
T
n
F
PCA;k
=
K
X
k=1
F
T
PCA;k
1
N
N
X
n=1
Z
n
Z
T
n
!
F
PCA;k
=
K
X
k=1
F
T
PCA;k
1
N
ZZ
T
F
PCA;k
=
K
X
k=1
F
T
PCA;k
RF
PCA;k
= Tr(F
T
PCA
RF
PCA
) (2.7)
=
K
X
k=1
k
(2.8)
where
k
denotes thek-th largest eigenvalue ofR,F
PCA;k
again denotes thek-th col-
umn ofF
PCA
(i.e., the k-th leading eigenvector ofR), and the last equality is due to
the fact that each column ofF
PCA
is an eigenvector ofR. Thus, the total explained
variance, as a ratio of the total variance in (2.2) is given by:
explainedvariance =
P
K
k=1
k
P
M
k=1
k
1 (2.9)
where we use the fact that Tr(R) =
P
M
k=1
k
. Note that as K ! M, the explained
variance approaches unity (in fact, in the limiting case, no dimensionality reduction is
performed). Fortunately, however, it has been observed in practice that the explained
20
variance grows very quickly withK, and it is close to unity forKM. This enables
PCA to perform a significant amount of dimensionality reduction by only utilizing a
relatively small number of filters, but still retaining most of the variance in the initial
data.
2.3.2 Summary of Contrastive PCA (cPCA)
Observe that traditional or standard PCA may not yield separability of classes in the
data matrixZ. That is, if the data matrixZ is composed of multiple classes, performing
traditional PCA will not necessarily allow us to find a representation in which we can
separate the classes.
In contrast, a recent approach called contrastive PCA (cPCA) [1] attempts to obtain
discriminative filters. In this approach, it is assumed that there are two datasets: a “tar-
get” or “foreground” dataset and a “background” dataset. Also, it is assumed that there
are some universal structures that are present in both the target and background dataset,
and that the target dataset contains some additional unique structures. The universal
structures present in both datasets are typically dominant high-variance components,
and the unique structures of the target dataset may be obscured by these high-variance
components. The goal is to apply dimension reduction in such a way that it yields
directions that highlight the unique and “interesting” structures of the target dataset,
but ignore the universal and “uninteresting” structures. Unfortunately, applying tradi-
tional PCA on either the target or background dataset will tend to yield directions that
focus on the universal and uninteresting structures (since they are typically dominant
high-variance components). To resolve this, cPCA focuses on finding principal compo-
nents or directions that yield large variations for the target dataset, while simultaneously
yielding small variations for the background dataset.
21
Consider the following illustrative example. Suppose we are provided with a tar-
get or foreground dataset that contains digits superimposed atop of grass background
images. Observe that the target dataset contains multiple classes (i.e., the two digits, 0
and 1). Examples of instances of this dataset are illustrated in Fig. 4.2 of Chapter 4.
We wish to learn filters that are tuned to the digits, as opposed to the relatively strong
grass background images. In order to accomplish this task, consider having access to
a background dataset that only contains instances of grass images (i.e., without digits).
The task is to discover the structures that are unique to the target dataset and not present
in the background dataset. If this is accomplished, the filters are hoped to be able to
differentiate between the two digits in the “target” dataset.
Now, using the notation from [1], if the empirical covariance matrix of the back-
ground dataset is given byR
b
2 R
MM
, the empirical covariance matrix of the target
or foreground dataset is given byR
f
2 R
MM
, and the matrixF
cPCA
2 R
MK
con-
tains the cPCA filters as itsK columns, then the cPCA filters would have the following
explained variances for the two datasets (using the form derived in (2.7)):
Foregroundvariance :
f
(F
cPCA
) = Tr(F
T
cPCA
R
f
F
cPCA
) (2.10)
Backgroundvariance :
b
(F
cPCA
) = Tr(F
T
cPCA
R
b
F
cPCA
): (2.11)
The objective function chosen in [1] is to maximize:
J(F
cPCA
),
f
(F
cPCA
)
b
(F
cPCA
) (2.12)
= Tr(F
T
cPCA
(R
f
R
b
)F
cPCA
) (2.13)
overF
cPCA
, where the matrixF
cPCA
satisfiesF
T
cPCA
F
cPCA
=I
KK
, and2 [0;1)
denotes a contrast parameter. When = 0, the cPCA method becomes the traditional
PCA algorithm operating on the target dataset, and when!1, it is the traditional
22
PCA method operating on the background dataset. It was shown in [1] that certain
choices of between those two extremes can yield effective projection vectors that are
able to separate the datasets effectively, or expose interesting structures in the target
dataset. It turns out that the solution of (2.12)–(2.13) is the leading eigenvectors of the
matrixR
f
R
b
for a given [20, pp. 446–447]—i.e.:
F
cPCA
, evecs
K
(R
f
R
b
): (2.14)
Unfortunately, seeking this is intensive as it requires multiple eigendecompositions
(one for each choice of in a sweep). The cPCA algorithm for a given is reproduced
in Alg. 1 for the reader’s convenience. In contrast, our proposed cPCA++ method, which
is discussed in detail in Chapter 4, does not require a parameter sweep, which makes our
proposed method significantly more efficient than the cPCA method.
Algorithm 1 cPCA Method
Inputs: background data matrix
e
Z
b
2 R
MN
b
; target/foreground data matrix:
e
Z
f
2
R
MN
f
;K: dimension of the output subspace;: Contrast parameter
1. Center the data
e
Z
b
,
e
Z
f
to obtainZ
b
andZ
f
2. Compute:
R
b
=
1
N
b
Z
b
Z
T
b
(2.15)
R
f
=
1
N
f
Z
f
Z
T
f
(2.16)
3. Perform eigenvalue decomposition on
Q
cPCA
=R
f
R
b
(2.17)
4. Compute the topK right-eigenvectorsF
cPCA
ofQ
cPCA
Return: the subspaceF
cPCA
2R
MK
23
Chapter 3
Image Splicing Localization Using A
Multi-Task Fully Convolutional
Network (MFCN)
3.1 Introduction
In this section, we present an effective solution to the splicing localization problem
based on a fully convolutional network (FCN) [46]. The base network architecture is
the FCN VGG-16 architecture with skip connections, but we incorporate several mod-
ifications, including batch normalization layers and class weighting. We first evalu-
ated a single-task FCN (SFCN) trained only on the surface label or ground truth mask,
which classifies each pixel in an image as either belonging to the spliced region/sur-
face or not. Although the SFCN is shown to provide superior performance over exist-
ing techniques, it still provides a coarse localization output in certain cases. Thus, we
next propose the use of a multi-task FCN (MFCN) that utilizes two output branches
for multi-task learning. One branch is used to learn the surface label, while the other
branch is used to learn the edge or boundary of the spliced region. It is shown that
by simultaneously training on the surface and edge labels, we can achieve finer local-
ization of the spliced region, as compared to the SFCN. Once the MFCN was trained,
we evaluated two different inference approaches. The first approach utilizes only the
surface output probability map in the inference step. The second approach, which
24
is referred to as the edge-enhanced MFCN, utilizes both the surface and edge output
probability maps to achieve finer localization. We trained the SFCN and MFCN using
the CASIA v2.0 dataset [14] and tested the trained networks on the CASIA v1.0 [14]
1
, Columbia Uncompressed [23], Carvalho [11], and the Defense Advanced Research
Projects Agency (DARPA) and National Institute of Standards and Technology (NIST)
Nimble Challenge 2016 Science (SCI) datasets
2
. Experiments show that the SFCN and
MFCN outperform existing splicing localization algorithms, with the edge-enhanced
MFCN achieving the best performance. Furthermore, we show that after applying var-
ious post-processing operations such as JPEG compression, blurring, and addition of
noise to the spliced images, the SFCN and MFCN methods still outperform the existing
methods. We also participated in the annual competitions of the MediFor (Media Foren-
sics) Program that is sponsored by DARPA, and our MFCN-based approach achieved
the top rank score among all teams/systems in the splicing localization task of the 2017
competition, and the second-highest score in the 2018 competition.
The rest of this chapter is organized as follows. The proposed methods are described
in Section 3.2. The performance evaluation metrics are discussed in Section 3.3. Exper-
imental results are presented in Section 3.4. Finally, concluding remarks are given in
Section 3.5.
1
Credits for the use of the CASIA Image Tampering Detection Evaluation Database (CASIA TIDE)
V1.0 and V2.0 are given to the National Laboratory of Pattern Recognition, Institute of Automation,
Chinese Academy of Science, Corel Image Database and the photographers. http://forensics.idealtest.org
2
https://www.nist.gov/itl/iad/mig/nimble-challenge
25
3.2 Proposed Methods
3.2.1 Single-task Fully Convolutional Network (SFCN)
In this work, we first adapted a single-task fully convolutional network (SFCN) based
on the FCN VGG-16 architecture to the splicing localization problem [46]. The SFCN
was trained on the surface label or ground truth mask, which is a per-pixel, binary mask
that classifies each pixel in an image as either belonging to the spliced region/surface
or not. We will refer to pixels belonging to the spliced region/surface as spliced pixels
and those not belonging to the spliced region/surface as authentic pixels. We utilized
the skip architecture proposed in [35]. In [35], the authors extended the FCN VGG-
16 to a three-stream network with eight pixel prediction stride (as opposed to the 32
pixel stride in the original network without skip connections), and found that this skip
architecture yielded superior performance over the original architecture. In addition, we
incorporated several modifications, including batch normalization and class weighting.
We utilized batch normalization to eliminate the bias and normalize the outputs of the
convolutional layers [25]. In [25], it was shown that batch normalization can potentially
speed up training and increase accuracy. Class weighting refers to the application of
different weights to the different classes in the loss function. In the context of splicing
localization, class weighting is beneficial since the amount of non-spliced (authentic)
pixels typically outnumbers the amount of spliced pixels by a significant margin. In
particular, we used median frequency class weighting [3, 16]. We apply a larger weight
to the spliced pixels (since there are fewer spliced pixels than non-spliced ones). That
is, the weight applied to the spliced pixels, denoted by w
s
, is inversely proportional
to the frequency of the spliced class, and the weight applied to the authentic pixels,
26
denoted by w
a
, is inversely proportional to the frequency of the authentic class. The
class frequenciesf
s
andf
a
are given by
8
>
>
<
>
>
:
f
s
=
Ns
Na+Ns
f
a
=
Na
Na+Ns
;
(3.1)
where N
s
represents the total number of spliced pixels across the entire training set
and N
a
represents the total number of authentic pixels across the entire training set.
Although the ground truth mask is binary, the raw output of the SFCN is a probability
map. That is, each pixel is assigned a value representing the probability that the pixel is
spliced. As described in Sec. 3.3, the performance metrics require a binary mask, thus
the probability map must be thresholded. We refer to the thresholded probability map
as the binary system output mask.
3.2.2 Multi-task Fully Convolutional Network (MFCN)
Although the single-task network is shown to provide superior performance over exist-
ing splicing localization methods, it can still provide a coarse localization output in
certain cases. Thus, we next propose the use of a multi-task fully convolutional net-
work (MFCN) [46], which utilizes two output branches for multi-task learning. One
branch is used to learn the surface label, while the other branch is used to learn the
edge or boundary of the spliced region. Similar to the SFCN, we incorporate several
modifications, including skip connections, batch normalization, and class weighting (as
discussed in Sec. 3.2.1). The architecture of the MFCN used in our paper is shown in
Fig. 3.1. In addition to the surface labels, the boundaries between inserted regions and
their host background can be an important indicator of a manipulated area. This is what
motivated us to use a multi-task learning network. The weights or parameters of the
27
Figure 3.1: The MFCN Architecture for image splicing localization. Numbers in the
formx=y refer to the kernel size and number of filters in the convolutional layer (colored
blue), respectively. For example, the Conv1 block consists of two convolutional layers,
each with a kernel size of 3 and 64 filters (note that after each convolutional layer is a
batch normalization layer and a ReLU layer). Numbers in the form of2 and8 refer
to an upsampling factor of 2 and 8 for the deconvolutional layers, respectively. Also,
please note the inclusion of skip connections at the third and fourth max pooling layers.
The grey-colored layers represent element-wise addition. The max pooling layers have
a kernel size of2 and a stride of2.
network are influenced by both the surface and edge labels during the training process.
By simultaneously training on the surface and edge labels, we are able to obtain a finer
localization of the spliced region, as compared to training only on the surface labels.
Once the network was fully trained, we evaluated two different binary output mask gen-
eration approaches. In the first approach, we extract the surface output probability map,
and then threshold it to yield the binary system output mask. In this approach, the edge
output probability map is not utilized in the inference step. Please note that the edge
label still influenced the weights of the network during the training process.
28
3.2.3 Edge-enhanced MFCN Inference
The second inference strategy, which we refer to as the edge-enhanced MFCN [46],
utilizes both the surface and edge output probability maps, as described in the following
steps:
1. We threshold the surface probability map with a given threshold, to yield the
binary surface mask. This mask represents one estimate of the surface label.
2. We threshold the edge probability map with a given threshold, to yield the binary
edge mask.
3. Next, we apply hole-filling to the output of step (2), yielding a surface-like mask.
The hole-filled edge mask represents a second estimate of the surface label.
4. The outputs of step (1) and step (3) represent two different estimates of the surface
label: one is based on the surface probability map and the other one is based on
the edge probability map. We generate the final system output mask by computing
the intersection of the output of step (1) and output of step (3). This process can
be viewed as a voting mechanism where only two votes per pixel in favor of the
positive class will yield a positive final classification for that pixel. The intuition is
that using an ensemble of surface masks for the final classification should improve
performance compared to using each individual surface estimate independently.
It is shown in this paper that by utilizing both the edge and surface probability maps
in the inference step, we obtain finer localization of the spliced region. An example
illustrating inference with edge-enhancement is shown in Figure 3.2. It can be seen that
utilizing both the edge and surface probability maps leads to a finer localization of the
spliced region.
29
Figure 3.2: Illustration of MFCN inference with edge enhancement: (a) Edge prob-
ability map, (b) Hole-filled, thresholded edge mask, (c) Surface probability map, (d)
Thresholded surface mask, (e) Ground truth mask, and (f) Final system output mask.
3.2.4 Training and Testing Procedure
For the MFCN, the total loss function, L
t
, is the sum of the loss corresponding to the
surface label and the loss corresponding to the edge label, denoted byL
s
andL
e
, respec-
tively. Thus, we haveL
t
=L
s
+L
e
, whereL
s
andL
e
are cross-entropy loss functions.
Note that for the SFCN, the total loss function is equal to the surface loss functionL
s
.
The surface (or edge) cross-entropy loss for pixel (i;j) in imagen of the mini-batch is
given by:
L
i;j;n
=w
i;j;n
(p
i;j;n
log(q
i;j;n
)+(1p
i;j;n
)log(1q
i;j;n
)); (3.2)
wherep
i;j;n
is the true label,q
i;j;n
is the predicted probability output, and the class weight
w
i;j;n
is equal tow
s
if the pixel is spliced and equal tow
a
if it is authentic. The surface
30
Table 3.1: Training and Testing Images
Dataset Type Number of Images
CASIA v2.0 Training 5123
CASIA v1.0 Testing 921
Columbia Testing 180
Carvalho Testing 100
Nimble 2016 SCI Testing 160
lossL
s
(or the edge lossL
e
) can be computed by summing or averaging the individual
losses across all the pixels in the mini-batch.
The training of the networks was implemented in Caffe [26] using the stochastic
gradient descent (SGD) algorithm with a fixed learning rate of 0.0001, a mini-batch size
of 3, a momentum of 0.9, and a weight decay of 0.0005. We initialize the weights or
parameters of the SFCN and MFCN by the weights of a VGG-16 model pre-trained on
the ImageNet dataset, which contains 1.2 million images for the task of object recogni-
tion and image classification [44, 49].
We trained the SFCN and MFCN using the CASIA v2.0 dataset, and then evaluated
the trained models on the CASIA v1.0, Columbia Uncompressed, Carvalho, and Nimble
Challenge 2016 Science (SCI) datasets. The numbers of training and testing images are
given in Table 3.1. Ground truth masks are provided for the Columbia Uncompressed,
Carvalho, and Nimble Challenge 2016 datasets. For the CASIA v1.0 and CASIA v2.0
datasets, we generated the ground truth masks using the provided reference information.
In particular, for a given spliced image in the CASIA datasets, the corresponding donor
and host images are provided, and we used this information to generate the ground truth
mask.
31
3.3 Performance Evaluation Metrics
Once the MFCN or SFCN is trained, we use the trained model to evaluate other images
not in the training set. We evaluated the performance of the proposed and existing
methods using theF
1
and Matthews Correlation Coefficient (MCC) metrics, which are
per-pixel localization metrics.
Both theF
1
metric andMCC metric require as input a binary mask. We converted
each output map to a binary mask based on a threshold. For each output map, we varied
the threshold and picked the optimal threshold (this is done for each method, including
the existing methods we compared against). This technique of varying the threshold
was also utilized by Zampoglou et. al. in [56] as well as in the 2017 and 2018 MediFor
(Media Forensics) challenges hosted by DARPA/NIST
3
. We then computed theF
1
and
MCC metrics by comparing the binary system output mask to the corresponding ground
truth mask. For a given spliced image, theF
1
metric is defined as
F
1
(M
out
;M
gt
) =
2TP
2TP +FN +FP
;
whereM
out
represents the binary system output mask,M
gt
represents the ground truth
mask,TP represents the number of pixels classified as true positive,FN represents the
number of pixels classified as false negative, andFP represents the number of pixels
classified as false positive. TheF
1
metric ranges in value from 0 to 1, with a value of
1 being the best. A true positive means that a spliced pixel is correctly classified as
spliced, a false negative means that a spliced pixel is incorrectly classified as authentic,
and a false positive means that an authentic pixel is incorrectly classified as spliced. For
a given spliced image, theMCC metric is defined as
3
https://www.darpa.mil/program/media-forensics
32
MCC(M
out
;M
gt
) =
TPTNFPFN
p
(TP +FP)(TP +FN)(TN +FP)(TN +FN)
:
TheMCC metric ranges in value from1 to1, with a value of1 being the best. For a
given dataset and a given method, we report the averageF
1
and averageMCC scores
across the dataset.
3.4 Experimental Results
3.4.1 Performance Comparison
We compared the proposed SFCN and MFCN methods with a large number of existing
splicing localization algorithms. Following the acronyms in [56], these algorithms are:
ADQ1 [34], ADQ2 [5], ADQ3 [2], NADQ [7], BLK [33], CFA1 [18], CFA2 [12], DCT
[55], ELA [28], NOI1 [38], NOI2 [37] and NOI4. The implementation of these existing
algorithms is provided in a publicly available Matlab toolbox written by Zampoglou
et. al. [56]. As noted in [56], ADQ2, ADQ3, and NADQ require JPEG images as
input because they exploit certain JPEG data directly extracted from the compressed
files. Therefore, these three algorithms could only be evaluated on the CASIA v1.0
and Nimble 2016 SCI datasets, which contain images in JPEG format. These three
algorithms could not be evaluated on the Columbia and Carvalho datasets, which do not
contain images in JPEG format.
For each method, we computed the averageF
1
andMCC scores across each dataset,
and the results are shown in Table 3.2 (F
1
scores) and Table 3.3 (MCC scores). We
see that the MFCN and SFCN methods outperform the benchmarking algorithms on
all four datasets, in terms of both F
1
and MCC score. Furthermore, we see that the
33
Table 3.2: AverageF
1
Scores of Proposed and Existing Methods For Different Datasets.
For each dataset, we highlight in bold the top-performing method. As noted in [56],
ADQ2, ADQ3, and NADQ require JPEG images as input because they exploit certain
JPEG data directly extracted from the compressed files. Therefore, these three algo-
rithms could only be evaluated on the CASIA v1.0 and Nimble 2016 SCI datasets, which
contain images in JPEG format. For the Columbia and Carvalho datasets (which do not
contain images in JPEG format), we put “NA” in the corresponding entries in the table
to indicate that these three algorithms could not be evaluated on these two datasets.
Method CASIA v1.0 Columbia Nimble 2016 SCI Carvalho
Edge-enhanced MFCN 0.5410 0.6117 0.5707 0.4795
MFCN 0.5182 0.6040 0.4222 0.4678
SFCN 0.4770 0.5820 0.4220 0.4411
NOI1 0.2633 0.5740 0.2850 0.3430
DCT 0.3005 0.5199 0.2756 0.3066
CFA2 0.2125 0.5031 0.1587 0.3124
NOI4 0.1761 0.4476 0.1635 0.2693
BLK 0.2312 0.5234 0.3019 0.3069
ELA 0.2136 0.4699 0.2358 0.2756
ADQ1 0.2053 0.4975 0.2202 0.2943
CFA1 0.2073 0.4667 0.1743 0.2932
NOI2 0.2302 0.5318 0.2320 0.3155
ADQ2 0.3359 NA 0.3433 NA
ADQ3 0.2192 NA 0.2622 NA
NADQ 0.1763 NA 0.2524 NA
edge-enhanced MFCN method performs the best among the three proposed methods
(the numbers are highlighted in bold in Tables 3.2 and 3.3 to reflect this).
Since the MFCN and SFCN methods were the top performing methods, we show
system output examples from these methods in Figure 3.3. Each row shows a manip-
ulated image, the ground truth mask, and the binary system output mask for SFCN,
MFCN (without edge-enhanced inference), and the edge-enhanced MFCN. As shown
by examples in this figure, the edge-enhanced MFCN yields finer localization of the
spliced region than the SFCN and the MFCN without edge-enhanced inference.
34
Table 3.3: Average MCC Scores of Proposed and Existing Methods For Various
Datasets. For each dataset, we highlight in bold the top-performing method. As noted
in [56], ADQ2, ADQ3, and NADQ require JPEG images as input because they exploit
certain JPEG data directly extracted from the compressed files. Therefore, these three
algorithms could only be evaluated on the CASIA v1.0 and Nimble 2016 SCI datasets,
which contain images in JPEG format. For the Columbia and Carvalho datasets (which
do not contain images in JPEG format), we put “NA” in the corresponding entries in the
table to indicate that these three algorithms could not be evaluated on these two datasets.
Method CASIA v1.0 Columbia Nimble 2016 SCI Carvalho
Edge-enhanced MFCN 0.5201 0.4792 0.5703 0.4074
MFCN 0.4935 0.4645 0.4204 0.3901
SFCN 0.4531 0.4201 0.4202 0.3676
NOI1 0.2322 0.4112 0.2808 0.2454
DCT 0.2516 0.3256 0.2600 0.1892
CFA2 0.1615 0.3278 0.1235 0.1976
NOI4 0.0891 0.2076 0.1014 0.1080
BLK 0.1769 0.3278 0.2657 0.1768
ELA 0.1337 0.2317 0.1983 0.1111
ADQ1 0.1262 0.2710 0.1880 0.1493
CFA1 0.1521 0.2281 0.1408 0.1614
NOI2 0.1715 0.3473 0.2066 0.1919
ADQ2 0.3000 NA 0.3210 NA
ADQ3 0.1732 NA 0.2512 NA
NADQ 0.0987 NA 0.2310 NA
In addition, we demonstrate that the MFCN-based method is not sensitive to the
threshold value used to binarize the output probability map. Figure 3.4 shows three dif-
ferent spliced images from the CASIA v1.0 dataset (distinct from the CASIA v2.0 train-
ing set), the corresponding MFCN surface probability maps (without edge-enhanced
inference), and the thresholded surface masks across different threshold values. In this
case, the threshold refers to the probability that a given pixel is manipulated (e.g., a
threshold of 0.9 means that only pixels whose probability value is greater than 0.9 will
be marked as manipulated with the color black). We can see from this figure that the
probability map closely resembles the ground truth mask for each manipulated image.
35
Figure 3.3: System Output Mask Examples of SFCN, MFCN, and Edge-Enhanced
MFCN on the CASIA v1.0 and Carvalho Datasets. Please note that we refer to the
MFCN without edge-enhanced inference simply as MFCN. Each row in the figure shows
a manipulated or spliced image, the ground truth mask, the SFCN output, the MFCN
output, and the edge-enhanced MFCN output. The number below each output example
is the correspondingF
1
score. The first two rows are examples from the CASIA v1.0
dataset, while the other two rows are examples from the Carvalho dataset. It can be seen
from these examples that the edge-enhanced MFCN achieves finer localization than the
SFCN and the MFCN without edge-enhanced inference.
In addition, we can see that the thresholded output masks closely resemble the ground
truth mask across the different threshold values (even for a threshold value of 0.9, which
is likely to yield false negatives). This figure shows that the MFCN is not sensitive to
the chosen threshold, and choosing a relatively high fixed threshold would still yield
satisfactory localization output.
Also, to show that the MFCN is capable of correctly performing pixel-wise classi-
fication of purely authentic images, we thresholded the MFCN surface probability map
(without edge-enhanced inference) for three authentic images from the CASIA v1.0
dataset (which is distinct from the CASIA v2.0 training set). These probability maps are
36
Figure 3.4: MFCN Surface Masks (Without Edge-Enhanced Inference) for Spliced
Images Using Different Threshold Values. Each row shows a spliced image, the ground
truth mask, the surface probability map, and the corresponding thresholded surface
masks for different threshold values (0.7, 0.8, and 0.9). For each thresholded surface
mask, pixels that are classified as manipulated are marked as black and pixels that are
classified as authentic are marked as white.
thresholded according to the probability of a pixel being manipulated with thresholds of
0.9, 0.8, 0.7, and 0.6. The results are illustrated in Figure 3.5. For each thresholded sur-
face mask, pixels that are classified as manipulated are marked as black and pixels that
are classified as authentic are marked as white. We observe that the thresholded surface
masks are either fully white or very close to fully white across the different threshold
values (even for low thresholds, which are more permissive and likely to yield false
positives).
3.4.2 Performance on JPEG Compressed Images
We also compared the performance of the proposed and existing methods before and
after JPEG compression of the spliced images. We used the Carvalho dataset for this
experiment. The images are originally in PNG format, and we compressed them using
two different quality factors: 50 and 70. Table 3.4 shows the average F
1
scores on
the original dataset and the JPEG compressed images using the two different quality
37
Figure 3.5: MFCN Surface Masks (Without Edge-Enhanced Inference) for Authentic
Images Using Different Threshold Values. Each row shows an authentic image, and the
corresponding thresholded surface masks for different threshold values (0.9, 0.8, 0.7,
and 0.6). For each thresholded surface mask, pixels that are classified as manipulated
are marked as black and pixels that are classified as authentic are marked as white.
factors. We see small performance degradation due to JPEG compression. However, the
performance of the SFCN and MFCN methods is still better than the performance of
existing methods. Furthermore, we see that the performance of the SFCN and MFCN
methods on the JPEG compressed dataset is better than the performance of existing
methods on the original uncompressed dataset.
3.4.3 Performance on Gaussian Blurred Images
We also evaluated the performance of the proposed and existing methods after applying
Gaussian blurring or smoothing to the spliced images of the Carvalho dataset. We fil-
tered a given spliced image using a Gaussian smoothing kernel with the following four
standard deviation values (in terms of pixels): = 0:5, 1:0, 1:5, and 2:0. Table 3.5
shows the averageF
1
scores of the proposed methods and existing methods applied to
the original and Gaussian-filtered images. We see slight performance degradation when
= 2:0: However, the performance of the SFCN and MFCN methods is still better
than the performance of existing methods. Furthermore, we see that the performance of
38
Table 3.4: Average F
1
Scores of Proposed and Existing Methods on Original and
JPEG Compressed Carvalho Images. For each column, we highlight in bold the top-
performing method.
Method Original (No Compression) JPEG Quality = 70 JPEG Quality = 50
Edge-enhanced MFCN 0.4795 0.4496 0.4431
MFCN 0.4678 0.4434 0.4334
SFCN 0.4411 0.4350 0.4326
NOI1 0.3430 0.3284 0.3292
DCT 0.3066 0.3103 0.3121
CFA2 0.3124 0.2850 0.2832
NOI4 0.2693 0.2646 0.2636
BLK 0.3069 0.2946 0.3005
ELA 0.2756 0.2703 0.2728
ADQ1 0.2943 0.2677 0.2646
CFA1 0.2932 0.2901 0.2919
NOI2 0.3155 0.2930 0.2854
the SFCN and MFCN methods on the blurred images is better than the performance of
existing methods on the original unblurred images.
3.4.4 Performance on Images with Additive White Gaussian Noise
Finally, we evaluated the performance of the proposed and existing methods on images
with additive white Gaussian noise (AWGN). We added AWGN to the images in the
Carvalho testing set and set the resulting SNR values to three levels: 25dB,20dB, and
15dB. Table 3.6 shows theF
1
scores of the proposed methods and existing methods on
the original and the noisy images. Again, we see small degradation in the performance of
the proposed methods due to additive noise. However, the performance of the proposed
SFCN and MFCN methods is still better than the performance of existing methods.
Furthermore, we see that the performance of the SFCN and MFCN methods on the
corrupted images is better than the performance of existing methods on the original
uncorrupted images.
39
Table 3.5: Average F
1
Scores of Proposed and Existing Methods On Original and
Blurred Carvalho Images. For each column, we highlight in bold the top-performing
method.
Method Original (No Blurring) = 0:5 = 1:0 = 1:5 = 2:0
Edge-enhanced MFCN 0.4795 0.4849 0.4798 0.4724 0.4482
MFCN 0.4678 0.4694 0.4659 0.4560 0.4376
SFCN 0.4411 0.4403 0.4475 0.4365 0.4219
NOI1 0.3430 0.3330 0.2978 0.2966 0.2984
DCT 0.3066 0.3040 0.3014 0.3013 0.2994
CFA2 0.3124 0.3055 0.2947 0.2907 0.2894
NOI4 0.2693 0.2629 0.2539 0.2500 0.2486
BLK 0.3069 0.3133 0.3177 0.3177 0.3168
ELA 0.2756 0.2740 0.2715 0.2688 0.2646
ADQ1 0.2943 0.2929 0.2922 0.2952 0.2960
CFA1 0.2932 0.2974 0.3085 0.3022 0.3056
NOI2 0.3155 0.3141 0.3074 0.3040 0.2982
3.4.5 Performance on DARPA/NIST MediFor Annual Competi-
tions
Finally, we also participated in the annual competitions of the MediFor (Media Foren-
sics) Program sponsored by Defense Advanced Research Projects Agency (DARPA)
4
,
and our MFCN-based approach achieved the highest score (based on Matthews Correla-
tion Coefficient orMCC) in the splicing localization task of the 2017 competition, and
the second highest score in the 2018 competition. Some examples of localization output
are shown in Figure 3.6. We can see from these examples that the localization output
closely resembles the ground truth.
4
https://www.darpa.mil/program/media-forensics
40
Table 3.6: AverageF
1
Scores of Proposed and Existing Methods On Original and Noisy
Carvalho Images. For each column, we highlight in bold the top-performing method.
Method Original (No Noise) SNR = 25 dB SNR = 20 dB SNR = 15 dB
Edge-enhanced MFCN 0.4795 0.4786 0.4811 0.4719
MFCN 0.4678 0.4674 0.4677 0.4577
SFCN 0.4411 0.4400 0.4328 0.4307
NOI1 0.3430 0.3181 0.3038 0.2918
DCT 0.3066 0.2940 0.2681 0.2485
CFA2 0.3124 0.2863 0.2844 0.2805
NOI4 0.2693 0.2535 0.2496 0.2476
BLK 0.3069 0.2974 0.2813 0.2658
ELA 0.2756 0.2533 0.2473 0.2460
ADQ1 0.2943 0.2916 0.2874 0.2937
CFA1 0.2932 0.2909 0.2906 0.2859
NOI2 0.3155 0.3207 0.3163 0.3108
Figure 3.6: Localization output examples from the 2017 MediFor Challenge dataset,
with the correspondingMCC values. In the ground truth mask, the color black denotes
a spliced pixel and the color white denotes an authentic pixel. The colors pink and
yellow in the ground truth mask denote a pixel that is not scored (according to the
MediFor scoring protocol). One of the reasons that there are no-score regions is that
there may also be non-splicing manipulations present in the image (in addition to the
splicing manipulations).
41
3.5 Conclusion
It was demonstrated in this chapter that the application of FCN to the splicing localiza-
tion problem yields large improvement over previous techniques. The FCN we utilized
is based on the FCN VGG-16 architecture with skip connections, and we incorporated
several modifications, such as batch normalization layers and class weighting. We first
evaluated a single-task FCN (SFCN) trained only on the surface ground truth mask
(which classifies each pixel in an image as either belonging to the spliced surface/region
or not). Although the single-task network is shown to outperform existing techniques, it
can still yield a coarse localization output in certain cases. Thus, we next proposed the
use of a multi-task FCN (MFCN) that is simultaneously trained on the surface ground
truth mask and the edge ground truth mask, which indicates whether each pixel belongs
to the boundary of the spliced region. For the MFCN-based method, we presented two
different inference approaches. In the first approach, we compute the binary system
output mask by thresholding the surface output probability map. In this approach, the
edge output probability map is not utilized in the inference step. This first MFCN-
based inference approach is shown to outperform the SFCN-based approach. In the
second MFCN-based inference approach, which we refer to as edge-enhanced MFCN,
we utilize both the surface and edge output probability map when generating the binary
system output mask. The edge-enhanced MFCN is shown to yield finer localization
of the spliced region, as compared to the SFCN-based approach and the MFCN without
edge-enhanced inference. The proposed methods were evaluated on manipulated images
from the Carvalho, CASIA v1.0, Columbia, and the DARPA/NIST Nimble Challenge
2016 SCI datasets. The experimental results showed that the proposed methods outper-
form existing splicing localization methods on these datasets, with the edge-enhanced
MFCN performing the best. Furthermore, we participated in the annual DARPA Med-
iFor competitions, and our MFCN-based method achieved the top-rank score in the
42
splicing localization task of the 2017 competition, and the second highest score in the
2018 competition.
43
Chapter 4
Efficient Image Splicing Localization
via Contrastive Feature Extraction
4.1 Introduction
Most of the recent splicing localization techniques utilize deep-learning-based method-
ology such as convolutional neural networks (CNNs). Although CNNs have yielded
improvement in the localization of splicing attacks, they rely on careful selection of
hyperparameters, network architecture, and initial filter weights. Furthermore, CNNs
require a very long training time because they utilize stochastic gradient descent and
backpropagation to iteratively update the filter or kernel weights. In this chapter, we
present an alternative approach, which does not require a significant amount of experi-
mental tuning, and does not require a long training time.
This alternative approach is based on Principal Component Analysis (PCA). As we
will see, traditional PCA is not adequate for the task of image splicing localization,
and so we will first present a new dimensionality-reduction technique for performing
discriminative feature extraction, referred to as cPCA++ [45] (where cPCA stands for
contrastive PCA). We then propose a new approach for image splicing localization based
on cPCA++, which is significantly more efficient than state-of-the-art techniques such
as the Multi-task Fully Convolutional Network (MFCN) [46], and still achieves com-
parable performance scores. Also, cPCA++ is derived via matrix factorization, which
44
allows us to identify underlying bases and corresponding weightings in the data that can
be used for denoising applications (see App. 4.B for such an example).
The rest of this chapter is organized as follows. Sec. 4.2 presents cPCA++ as a new
dimensionality-reduction technique, discusses how it differs from traditional PCA, and
presents simulation results. The general framework or pipeline of the proposed cPCA++
approach for splicing localization is presented in Sec. 4.3. Experimental results on
splicing datasets are presented in Sec. 4.4. Finally, concluding remarks are given in
Sec. 4.5.
4.2 cPCA++
4.2.1 The cPCA++ Method
As we will see in Sec. 4.3, one characteristic of the image splicing localization problem
is the strong similarity between authentic and spliced edges, especially when spliced
edges have been smoothed over with a low-pass filter. Thus, traditional Principal
Component Analysis (PCA) is not able to effectively discriminate between spliced and
authentic edges. In order to deal with this issue, we will first study the problem of dimen-
sionality reduction for extremely similar classes. Doing so will allow us to develop a
new and efficient dimensionality-reduction technique that we refer to as cPCA++, which
is inspired by a recently proposed algorithm called “contrastive PCA” (cPCA) [1]. In
order to derive the cPCA++ method, we shall first consider the traditional PCA method.
Suppose we are presented with a data matrix Z 2 R
MN
, where M denotes the
original feature dimension of the data andN denotes the number of instances included
inZ. PCA first computes the empirical covariance matrix ofZ:
R,
1
N
ZZ
T
(4.1)
45
where we assumed that Z has zero-mean. Next, PCA would compute the subspace
spanned by the K top or leading eigenvectors of R (i.e., those corresponding to the
largest K eigenvalues). The basis for this space would constitute the filters used to
process the input data:
F
PCA
, evecs
K
(R) (4.2)
whereK denotes the number of leading eigenvectors to return. Now, a low-dimensional
version of the input dataZ can be obtained as:
Y
PCA
=F
T
PCA
Z: (4.3)
It is known that the PCA filtersF
PCA
preserve the most energy inZ after the trans-
formation. Observe that this property may not yield separability of classes inZ. That
is, if the data matrix Z is composed of multiple classes (e.g., spliced and authentic
edges in our case), performing traditional PCA will not necessarily allow us to find a
representation in which we can separate the classes.
In this work, we will take a different approach, inspired by a recently proposed
algorithm called the “contrastive PCA” (cPCA) method [1], which obtains discrimina-
tive filters. This approach focuses on finding directions that yield large variations for
one dataset, referred to as the “target” or “foreground” dataset, while simultaneously
yielding small variations for another dataset, referred to as the “background” dataset.
There are two problem setups that benefit from this scheme, and we examine them in
the following remark.
Remark 1. In [1], the problem is to discover structures that are unique to one dataset
relative to another. There are two ways of utilizing the solution to this problem. They
are described as follows.
46
1) Consider the example described more thoroughly in Sec. 4.2.2. We are provided a
“target” or “foreground” dataset with digits superimposed atop of grass back-
ground images. Observe that the target dataset contains multiple classes (i.e.,
the two digits, 0 and 1). Examples of instances of this dataset are illustrated in
Fig. 4.2. We wish to learn filters that are tuned to the digits, as opposed to the
relatively strong grass background images. In order to accomplish this task, con-
sider having access to a “background” dataset that only contains instances of
grass images (i.e., without digits). The task is to discover the structures that are
unique to the target dataset and not present in the background dataset. If this
is accomplished, the filters are hoped to be able to differentiate between the two
digits in the “target” dataset.
2) In this alternative setup, we would like to solve a binary classification problem with
the caveat that the two classes are very similar, and are dominated by the struc-
tures that are present in both classes. This is the case in the image splicing local-
ization problem, where one class is the spliced edges and the other class is the
authentic edges. In this scenario, we consider one class (i.e., the spliced edges)
to be the target dataset and the other class (i.e., the authentic edges) to be the
background dataset. If we are able to learn filters that are tuned to the structures
unique to the spliced edges, then these filters form a basis for an efficient classifier
that is able to differentiate between spliced and authentic edges. We will see the
form of this classifier in Sec. 4.3.
We will take a detection-based approach to dimensionality reduction. We will setup
a detector whose task is to identify whether or not a presented data matrix contains
the interesting or special structures present in the foreground dataset (note that the data
matrix may contain background instances as well). By following this route, we will
see that the detector will tune itself to specifically look for these special structures in
47
the presented data matrix. Fortunately, the detector will look for these structures by
transforming the incoming data matrix with a set of linear filters. This means that we
may simply utilize the output of these filters as a low-dimensionality representation of
the input data matrix, tuned specifically to the interesting structures.
To setup the problem mathematically, we assume that the data matrixZ
b
2 R
MN
b
is collected in an independently and identically distributed (i.i.d.) manner from the
background dataset and has the following distribution:
Z
b
N(0
MN
b
;
b
); [Z
b
from background dataset] (4.4)
where
b
2 R
MM
denotes an unknown covariance matrix of the background dataset
andZN(A; ) withZ;A2R
MN
and positive-definite 2R
MM
means thatZ
has the following probability density function (pdf):
p(Z;A; ),
exp
1
2
Tr[
1
(ZA)(ZA)
T
]
p
(2)
MN
jj
N
: (4.5)
On the other hand, we assume that when the instances ofZ
f
2 R
MN
f
are sampled
independently from the foreground dataset,Z
f
has the following distribution:
Z
f
N(WY
f
;
f
); [Z
f
from foreground dataset] (4.6)
where the mean ofZ
f
can be factored as the product of a basis matrixW 2 R
MK
and a low-dimensionality representation matrixY
f
2 R
KN
f
. The inner-dimensionK
represents the underlying rank of the data matrixZ
f
, which is generally assumed to be
much smaller than the initial dimensionM.
Now, it is assumed that we are presented with a data matrixZ2 R
MN
that could
be drawn completely from the background dataset (we refer to this case as the null
48
hypothesisH
0
) or contain instances from the background dataset and some instances
from the foreground dataset (we refer to this case as hypothesisH
1
). That is, the data
matrixZ underH
1
can be written as:
Z,
h
Z
f
Z
b
i
(4.7)
whereZ
f
2 R
MN
f
andZ
b
2 R
MN
b
denote data instances from the foreground and
background datasets, respectively. WhenH
1
is active, we assume that we know the
value ofN
f
. Given (4.4) and (4.6) above, we have that the mean ofZ underH
1
can be
written as:
E[Z] =
h
E[Z
f
] E[Z
b
]
i
=
h
WY
f
0
MN
b
i
=WY
f
T (4.8)
where we have introduced the matrixT2 R
N
f
N
, whereN ,N
f
+N
b
2, with the
following structure:
T ,
h
I
N
f
N
f
0
N
f
N
b
i
: (4.9)
Therefore, the detection/classification problem we wish to examine is the determination
of the active hypothesis (eitherH
0
orH
1
), where the data matrix is described, under
each hypothesis, by:
Z
8
>
>
<
>
>
:
N(0
MN
;
0
); H
0
N(WY
f
T;
1
); H
1
: (4.10)
49
Note that
0
and
1
denote the covariance matrix underH
0
andH
1
, respectively. The
underlying assumption is that when the data matrix contains only instances from the
background dataset, then the mean would be zero. We thus assume that all data instances
ofZ have the mean of the respective partitions subtracted off. For example, if the raw
(unprocessed) data matrix is given by
e
Z,
h
e
Z
f
e
Z
b
i
, then the partitions of the data
matrixZ =
h
Z
f
Z
b
i
are given by:
Z
b
,
f
Z
b
1
N
b
e
Z
b
1
N
b
1
T
N
b
(4.11)
Z
f
,
f
Z
f
1
N
f
e
Z
f
1
N
f
1
T
N
f
: (4.12)
In solving this detection problem, we will be able to obtain the matrixY
f
2R
KN
f
,
which is the desired low-dimensionality representation of the foreground dataset. While
it may initially appear that the detection problem is relatively simple—especially when
the mean under the foreground dataset is relatively large in magnitude compared to
the power in the covariance matrices, the detection problem is difficult when the mean
is small compared to the covariance power (i.e., it is masked by large variation that
naturally occurs in both the background and foreground datasets), as outlined earlier
and in [1]. This is true in the context of image splicing localization, so it is assumed that
the mean under the foreground dataset is small compared to the covariance power.
Now, we would like to determine if a given data matrixZ belongs to the distribution
in the null hypothesis (H
0
) or the alternative hypothesis (H
1
). A classical approach to
such problems is the likelihood ratio test, which is derived via the test:
p(H
1
jZ)p(H
0
jZ)
p(ZjH
1
)p(H
1
)
(a)
p(ZjH
0
)p(H
0
) (4.13)
50
where step(a) is due to Bayes’ rule, andp(H
0
) andp(H
1
) denote the prior probabilities
of hypothesesH
0
andH
1
, respectively. The above can readily be re-arranged as:
p(ZjH
1
)
p(ZjH
0
)
,
p(H
0
)
p(H
1
)
: (4.14)
This is referred to as the “likelihood ratio test.” It evaluates the likelihood ratio of the
data matrixZ under each hypothesis and compares it to a threshold given by
. The
likelihood ratio on the left-hand-side of (4.14) is desired to have a large value when the
hypothesisH
1
is active, but is desired to take on a small value when the hypothesisH
0
is active. The threshold on the right-hand-side of (4.14) is generally swept to achieve
different performance that trades the probability of detection to false-alarm. Observe
that the likelihood ratio test is only useful when the probabilitiesp(ZjH
1
) andp(ZjH
0
)
can be computed (i.e., the probability density function is completely known or all of
its parameters can be estimated). In our case, however, we do not have knowledge of
many parameters, including the mean componentsW andY
f
, as well as the covariance
matrices
0
and
1
. In order to deal with this situation, a common approach is to utilize
what is referred to as the generalized likelihood ratio test (GLRT), which optimizes over
parameters that are unknown from within the ratio. In our case, the GLRT is given
by [27, p. 200]:
GLRT(W),
max
1
;Y
f
f
1
(Z;
1
;Y
f
)
max
0
f
0
(Z;
0
)
(4.15)
51
where
denotes the threshold parameter, f
1
(Z;
1
;Y
f
) denotes the Gaussian pdf of
Z under the hypothesisH
1
, and f
0
(Z;
0
) denotes the Gaussian pdf ofZ under the
null-hypothesisH
0
. That is,
f
1
(Z;
1
;Y
f
) =
exp
1
2
Tr
1
1
(ZWY
f
T)(ZWY
f
T)
T
p
(2)
MN
j
1
j
N
(4.16)
f
0
(Z;
0
) =
1
p
(2)
MN
j
0
j
N
exp
1
2
Tr
1
0
ZZ
T
: (4.17)
In (4.15), we clearly indicate that the resulting GLRT statistic will be a function of
the matrix W , which is the basis for the feature mean underH
1
. We will return to
this point later in the derivation. Now, we can optimize each pdf over the unknown
covariance matrix, independently, as prescribed by the GLRT in (4.15) to obtain that the
optimal covariance matrices for each case is given by the empirical covariance matrix
(i.e., the maximum-likelihood covariance matrix estimator for a Gaussian distribution
[15, p. 89]), so that:
max
0
f
0
(Z;
0
) =f
0
Z;
1
N
ZZ
T
=
(2e)
M
1
N
ZZ
T
N=2
(4.18)
and the pdf forH
1
is given by
max
1
f
1
(Z;
1
;Y
f
)
=f
1
Z;
1
N
(ZWY
f
T)(ZWY
f
T)
T
;Y
f
=
(2e)
M
1
N
(ZWY
f
T)(ZWY
f
T)
T
N=2
: (4.19)
52
Thus, the GLRT simplifies to:
GLRT(W)
=
max
Y
f
(2e)
M
1
N
(ZWY
f
T)(ZWY
f
T)
T
N
2
(2e)
M
1
N
ZZ
T
N
2
(a)
()
ZZ
T
min
Y
f
j(ZWY
f
T)(ZWY
f
T)
T
j
(4.20)
where step (a) states that the GLRT statistic is equivalent to (4.20). Now, using the
derivation in [19], we have that the GLRT statistic can be written as (see equation (8)
and related derivations in App. A from [19]):
GLRT(W) =
W
T
(Z
b
Z
T
b
)
1
W
W
T
(ZZ
T
)
1
W
: (4.21)
We will now optimize over our free variableW . When the null-hypothesis is active, it
can be shown that the GLRT statistic is independent ofW . On the other hand, whenH
1
is active, we have that the GLRT is dependent onW and is given by:
GLRT(W)
(a)
=
W
T
(Z
b
Z
T
b
)
1
W
W
T
(Z
f
Z
T
f
+Z
b
Z
T
b
)
1
W
[H
1
is active]
=
W
T
R
1
b
W
W
T
(N
f
=N
b
R
f
+R
b
)
1
W
(4.22)
where(a) is obtained via (4.7) and we have defined the following quantities:
R
b
,
1
N
b
Z
b
Z
T
b
; R
f
,
1
N
f
Z
f
Z
T
f
(4.23)
where R
b
denotes the second-order statistic associated with the background dataset,
while R
f
denotes the second-order statistic associated with the foreground dataset.
53
Now, to ensure that the gap in the GLRT statistic between the two hypotheses is as
large as possible, we must maximize the GLRT value overW when the hypothesisH
1
is active (since the GLRT value is independent ofW when the null-hypothesis is active).
It turns out that the value ofW that maximizes the GLRT is obtained by solving a gen-
eralized eigenvalue problem [53] [20, pp. 454–455]. The optimal choice forW turns
out to be the leading eigenvectors of the matrixI
MM
+N
f
=N
b
R
f
R
1
b
, or, equiva-
lently
1
, the leading eigenvectors of the matrixR
f
R
1
b
when the eigenvalues ofR
f
R
1
b
are non-negative (we will show later that this is indeed the case
2
). While the solution
to W itself is not directly interesting for the problem of image splicing localization
(it is relevant for applications related to matrix factorization, denoising, and dictionary
learning [39]—which we explore in App. 4.B), the reduced-dimensionality matrixY
f
is
much more relevant for dimensionality reduction purposes.
The maximum-likelihood estimator forY
f
(which was used to obtain (4.21)) is writ-
ten as
Y
f
=F
T
Z
f
(4.24)
whereF2R
MK
is a bank of filters yet to be determined, but has the following depen-
dence onW [19]:
F =R
1
b
W(W
T
R
1
b
W)
1
: (4.25)
1
Assuming a square matrixC =UDU
1
is diagonalizable with non-negative eigenvalues, the matrix
I+C for> 0 can be written asI+C =UU
1
+UDU
1
=U(I+D)U
1
, which modifies
all eigenvalues ofC by adding one to a positive-scaled version of them. This does not change the order of
the eigenvalues ofC since they are assumed to be non-negative and thusI +C andC share the same
leading eigenvectors when> 0 and the eigenvalues ofC are non-negative.
2
We will later show that the eigenvalues ofR
1
b
R
f
are non-negative. Observe, however, thatR
1
b
R
f
andR
f
R
1
b
are similar sinceR
b
is non-singular [22, p. 53], which means that the eigenvalues ofR
f
R
1
b
are also non-negative.
54
In fact, the filters in (4.25) can be shown to be the leading eigenvectors of the matrix:
Q,R
1
b
R
f
: (4.26)
This can be seen by noting that the optimalW (being the eigenvectors ofR
f
R
1
b
) must
satisfyR
f
R
1
b
W = WD for some diagonal matrixD that contains the eigenvalues.
From this, we have thatR
1
b
R
f
R
1
b
W = R
1
b
WD. DenotingP = R
1
b
W for
some diagonal invertible scaling matrix that we are free to choose, we have thatP
denotes the eigenvectors of the matrixQ =R
1
b
R
f
(sinceQP =PD). Substituting
this into (4.25), we have that F = P
1
(W
T
R
1
b
W)
1
= P , with the choice of
the matrix = (W
T
R
1
b
W)
1
. Now, we would like to show that the matrix is
diagonal. First, observe that = W
T
R
1
b
W = P
T
R
b
P . The matrixP
T
R
b
P
can be shown to be diagonal since the columns of P are R
b
-orthogonal for distinct
eigenvalues and the columns ofP that correspond to the same eigenvalue can be chosen
to be R
b
-orthogonal [40, p. 345]. Thus, = P
T
R
b
P is a diagonal matrix. It can
also be shown is positive definite (and thus invertible) as long asW has full column
rank. Furthermore, the matrixW has full column rank because the matrixR
f
R
1
b
can
be shown to be diagonalizable. This is becauseR
f
R
1
b
can be shown to be similar to
R
1=2
b
R
f
R
1=2
b
[30, p. 81], which is real and symmetric and thus diagonalizable [30,
p. 97].
The obtained filters attempt to examine the specific structures that differentiate sam-
ples fromH
0
andH
1
. This leads to a relatively straightforward and efficient method for
dimensionality reduction, which is listed in Alg. 2. Also, note that it is assumed that
the matrixR
b
is positive-definite. If it is not positive-definite, it is customary to apply
diagonal-loading (addition of I
MM
to the covariance matrix for a small > 0) to
force the covariance matrix to be positive-definite when it is rank-deficient (although, if
55
enough diverse data samples are available, it is rare that the covariance matrix is rank-
deficient).
Algorithm 2 cPCA++ Method
Inputs: background data matrix
e
Z
b
2 R
MN
b
; target/foreground data matrix:
e
Z
f
2
R
MN
f
;K: dimension of the output subspace
1. Center the data
e
Z
b
,
e
Z
f
by obtainingZ
b
andZ
f
via (4.11)–(4.12)
2. Compute:
R
b
=
1
N
b
Z
b
Z
T
b
(4.27)
R
f
=
1
N
f
Z
f
Z
T
f
(4.28)
3. Perform eigenvalue decomposition on
Q =R
1
b
R
f
(4.29)
4. Compute the topK right-eigenvectorsF ofQ
Return: the subspaceF2R
MK
It is not immediately clear that F is a real matrix, since Q may not be sym-
metric. However, since the matrix Q is similar to the real and symmetric matrix
H , R
1=2
b
R
f
R
1=2
b
[30, p. 81] (this is because Q is by definition similar to
R
1=2
b
QR
1=2
b
= R
1=2
b
R
1
b
R
f
R
1=2
b
= R
1=2
b
R
f
R
1=2
b
= H) where R
1=2
b
denotes
the matrix square root ofR
b
, we have thatH andQ share a spectrum, which is real
due to H being a real symmetric matrix [30, p. 78]. It is important to note that the
eigenvalues ofQ are also non-negative due to it being similar toH, andH being a
positive-semi-definite matrix
3
. That is, sinceH is positive-semi-definite, its eigenval-
ues are non-negative, and sinceQ is similar toH, its eigenvalues are also non-negative.
3
The definition of a positive semi-definite matrixH is thatH is symmetric and satisfiesv
T
Hv 0
for allv6= 0. In our case, we havev
T
Hv = v
T
R
1=2
b
R
f
R
1=2
b
v = y
T
R
f
y =
1
N
f
y
T
Z
f
Z
T
f
y =
x
T
x =kxk
2
0, wherey,R
1=2
b
v and we definedx, 1=
p
N
f
Z
T
f
y as a transformation ofy.
56
This was a requirement for our earlier derivation to show that the leading eigenvec-
tors ofI
MM
+N
f
=N
b
R
f
R
1
b
are the same as the leading eigenvectors ofR
f
R
1
b
.
We also know that the eigenvectors of the matrixH are real sinceH is real and sym-
metric [30, p. 97]. Let us now denote one of the eigenvectors ofH, associated with
eigenvalue2R asx2R
M1
. Then, we have thatHx =x andQv =v for some
eigenvectorv2C
M1
ofQ (since is an eigenvalue ofQ as we have shown above—we
will see, however, that we can always find eigenvectors inR
M1
, as opposed toC
M1
).
Multiplying both sides on the left byR
1=2
b
, we obtain:
R
1=2
b
Hx =QR
1=2
b
x =R
1=2
b
x: (4.30)
Then, we immediately obtain that:
Qv =v (4.31)
where
v,
R
1=2
b
x
kR
1=2
b
xk
2
2R
M1
(4.32)
denotes the eigenvector ofQ, which is real due tox2 R
M1
andR
1=2
b
2 R
MM
.
Thus, we see thatF is indeed a real matrix. Note thatF (which contains the leadingK
eigenvectors ofQ as its columns) may not satisfyF
T
F =I
KK
since the eigenvectors
ofQ are not guaranteed to be orthonormal.
57
To see the advantage of the cPCA++ method, it is useful to compare it with the
“contrastive PCA” (cPCA) algorithm from [1]. The cPCA algorithm from [1] performs
an eigen-decomposition of the following matrix:
Q
cPCA
,R
f
R
b
(4.33)
where is a contrast parameter that is swept to achieve different dimensionality reduc-
tion output. We observe that the cPCA algorithm is inherently sensitive to the relative
scales of the covariance matricesR
b
andR
f
. For this reason, a parameter sweep for
(which adjusts the scaling between the two covariance matrices) was used in [1]. In
other words, multiple eigen-decompositions are required in the cPCA algorithm. On
the other hand, the cPCA++ algorithm does not require this parameter sweep, as the
relative scale between the two covariance matrices will be resolved when performing
the eigen-decomposition of the matrix Q. To see this property analytically, consider
Example 1.
Example 1 (A needle in a haystack). We consider the followingM-feature example in
order to better understand the traditional PCA, cPCA, and cPCA++ methods, and how
they react to a small amount of information hidden in a significant amount of “noise. ”
Let the covariance matrices of the background and target datasets be respectively given
by (where we note thatR
b
is positive-definite when
;> 0):
R
b
,
aa
T
+I
MM
(4.34)
R
f
,aa
T
+cc
T
(4.35)
wherea;c2 R
M1
and it is assumed thatkak
2
=kck
2
= 1 anda
T
c = 0, making
the vectorsa andc orthonormal. It is further assumed that , which means that
the vector that is not common to both background and target has very small power in
58
comparison to what is common between them. It is also assumed that =
<
and
, which means that the diagonal-loading termI
MM
should not affect the
result significantly. It is assumed that
;;; > 0. In this example, we would like to
obtain a filter to reduce the dimension toK = 1 from our originalM features. Note
that (4.35) describes a matrix with eigenvalues and and associated eigenvectorsa
and c. The reason why we are interested in the above problem is that by examining
the covariance structure above, we see that the data is inherently noise-dominated. This
noise (i.e., theaa
T
term) appears in both the background dataset and the target dataset,
but the target dataset may contain some class-specific information (i.e., thecc
T
term)
that is not present in the background dataset. However, this “interesting” information
is dominated by the noise due to the eigenvectora. We wish to perform dimensional-
ity reduction that rejects variance due to the eigenvectora but preserves information
due to the eigenvector c of the target dataset. The obvious solution is to choose the
dimensionality reduction filter to bec. We will examine how each algorithm performs.
PCA When traditional PCA is executed on the target dataset, the leading eigenvector (and
thusF
PCA
) will be found to bea since . In addition, since
,a is
also the leading eigenvector of the background dataset, which is not useful in
extracting the interesting structure from the foreground dataset. This is because
F
T
PCA
c = 0, which means that this filter will in fact null-out the “interesting”
eigenvector.
cPCA When cPCA is executed on the background and target datasets, we have that
Q
cPCA
=R
f
R
b
will be given by:
Q
cPCA
= (
)aa
T
+cc
T
I
MM
: (4.36)
59
Now, (4.36) allows us to see exactly the reasoning behind sweeping . When
= 0, we obtain the PCA method operating on the target dataset (which would
pick the filter defined bya). When!1, we obtain the PCA method operating
on the background dataset (which still would pick the filter defined bya). Instead,
the desired value of is one that nulls out the eigenvector a—e.g., = =
.
Observe that when this choice of is made (or chosen through a sweep), we have
that the leading eigenvector (and thusF
cPCA
) ofQ
cPCA
will bec. This is precisely
what we want the filter to be in order to differentiate between the two datasets (or
extract the useful information from the target dataset relative to the background).
Although the optimal value is evident in this simple analytic example (due to
the fact that we know the underlying structure indicated in (4.34) and (4.35)), this
is not the case in most experiments and thus it is necessary to sweep over the
unbounded range2 [0;1).
cPCA++ For the analysis of the cPCA++ method, let us first consider the inverse ofR
b
,
via the matrix inversion lemma:
R
1
b
=
1
I
MM
+
aa
T
(a)
1
I
MM
aa
T
=
1
P
?
a
(4.37)
60
where step(a) is due to the assumption that0<
andP
?
a
=I
MM
aa
T
denotes the projection onto the left null-space ofa [30, p. 22] (sinceP
?
a
a =0
M
).
Now, multiplyingR
1
b
byR
f
, we obtain:
Q =R
1
b
R
f
1
P
?
a
aa
T
+cc
T
=
P
?
a
cc
T
(a)
=
cc
T
(4.38)
where step (a) is due to the fact thata andc are orthonormal. This means that
theQ matrix in the case of the cPCA++ method is approximately a rank-1 matrix
with leading eigenvectorc. Clearly then, the filter chosen by cPCA++ will be
F =c, which is the desired result. Observe that this result was obtained without
the need for a hyper-parameter sweep (i.e., like the sweep over the parameter
in cPCA).
Note that another recent publication [52], also based on the general principle of [1],
was brought to our attention. However, careful review of this article shows that the
approach in [52] is different from ours. In [52], the algorithm was obtained in a similar
manner as the cPCA algorithm (i.e., it was formulated as a slightly modified maximiza-
tion problem). On the other hand, our proposed cPCA++ approach is fundamentally dif-
ferent because it directly addresses the matrix factorization and dimensionality reduction
nature of our problem, and therefore, gives rise to an intuitive per-instance classification
algorithm (see Alg. 3 further ahead) that can be applied to image splicing localization,
and also addresses image denoising problems (see App. 4.B).
In the next subsection, we will simulate both the cPCA method and the cPCA++
method for a collection of experiments from [1] (we will also simulate the popular t-SNE
61
dimensionality reduction method [51]). We will observe that the cPCA++ algorithm will
be able to discover features from within the target dataset without the need to perform a
parameter sweep, which will improve the running time of the method significantly (see
Sec. 4.2.3).
4.2.2 Performance Comparison of Feature Extraction Methods
In this subsection, we will examine many of the same experiments that the cPCA algo-
rithm
4
was applied to in [1]. Table 4.1 lists the parameters of the various datasets that
will be examined in this section. In all cases, the desired dimensions from the methods
will be set toK = 2 (i.e., after feature reduction). We also incorporated the t-Distributed
Stochastic Neighbor Embedding (t-SNE) algorithm [51] in the results
5
, even though t-
SNE does not provide an explicit function for the discovered feature reduction. For
example, after learning the filtersF through Alg. 2, it is possible to efficiently apply
them to new data matrices. In the case of t-SNE, however, the dimensionality reduction
is done in relationship to the other samples in the dataset, so when a new data matrix
is to be reduced in dimensions, the t-SNE algorithm must be re-executed. However, we
include it in this exposition since it is a common algorithm for modern feature reduc-
tion. The main take-away from these experiments is the fact that the cPCA++ algorithm
can obtain an explicit filter matrix,F , and does not require a parameter sweep, which
makes the algorithm much more efficient. At the same time, it is still able to discover
the interesting structures in the target dataset.
4
https://github.com/abidlabs/contrastive
5
https://lvdmaaten.github.io/tsne/#implementations
62
Table 4.1: Overview of the datasets used for comparing the PCA, cPCA, t-SNE, and
cPCA++ dimensionality reduction methods. Note thatN
f
denotes the number of fore-
ground samples, N
b
denotes the number of background samples, and M denotes the
original feature dimension.
Example N
f
N
b
M
Synthetic 400 400 30
MNIST over Grass 5000 5000 784
Mice Protein Expression 270 135 77
MHealth Measurements 6451 3072 23
Single Cell RNA-Seq of Leukemia Patient 7898 1985 500
Synthetic Data
We first consider a synthetic data example to help illustrate the differences between
the traditional PCA, cPCA, and cPCA++ algorithms. Consider the following original
feature structure (i.e., prior to feature reduction) for an arbitrary i-th sample from a
foreground or target data matrix
f
Z
f
:
~ z
k
f;i
=
h
(h
k
i;1
)
T
(h
k
i;2
)
T
h
T
i;3
i
T
(4.39)
in which k 2 f1;2;3;4g denotes the class index, with sub-vectors h
k
i;1
;h
k
i;2
;h
i;3
2
R
101
, where h
k
i;1
, h
k
i;2
, h
i;3
are independent. Now, it is assumed that h
i;3
N(0
10
;10I
1010
). On the other hand, the other two sub-vectors are more useful to
differentiate across the different classesk:
h
k
i;1
8
>
>
<
>
>
:
N(0
10
;I
1010
); k2f1;2g
N(61
10
;I
1010
); k2f3;4g
(4.40)
63
whileh
k
i;2
is distributed according:
h
k
i;2
8
>
>
<
>
>
:
N(0
10
;I
1010
); k2f1;3g
N(31
10
;I
1010
); k2f2;4g:
(4.41)
Knowing the above structure, we can deduce that to differentiate between the different
classes1k 4, we need only useh
k
i;1
andh
k
i;2
, while completely ignoringh
i;3
as it
does not help differentiate between the different classes. On the other hand, traditional
PCA executed on the target data matrix
f
Z
f
would readily pick features fromh
i;3
since
they have the largest variance—even though they do not help in distinguishing between
the different classes.
Instead, it is assumed that there is a background dataset that, while not containing
the differentiating features, follows the following distribution:
~ z
b;i
2
6
6
6
4
N(0
10
;3I
1010
)
N(0
10
;I
1010
)
N(0
10
;10I
1010
)
3
7
7
7
5
(4.42)
where the sub-vectors are independent. We observe that ~ z
b;i
only informs us as to the
relative variance across the dimensions, but not the explicit mean values (i.e., feature
structure) within the dimensions. It can be shown that the filters obtained by the cPCA++
algorithm are given by (please see App. 4.A):
F =
h
e
1
e
2
i
1
p
10
1
10
(4.43)
wheree
n
denotes the standard basis vector of dimension3 (i.e.,e
n
= [0
T
n1
;1;0
T
3n
]
T
).
Note that these are the ideal filters to extract the signal structure from the target dataset
described by (4.39). We ran the cPCA++ method on this synthetic dataset, and compared
64
the results with the other feature reduction methods in Fig. 4.1. Note that the t-SNE
method operated over the target or foreground dataset. For all methods examined, we
extract the dimensionality reduced dataset for K = 2 dimensions. All classes in the
figure are equally likely, and a total of 400 samples were generated. In the top row of
plots in Fig. 4.1, we ran the cPCA algorithm for three positive choices of the contrast
parameter. We see that a contrast factor of = 2:7 yields a correct separation among
the classes, while the other values of the contrast factor fail to do so. In the bottom-left
plot, we have the performance of the traditional PCA algorithm on the target dataset. It
is in fact a special case of the cPCA method outlined in [1] for a contrast parameter of
= 0. We observe that traditional PCA fails to cluster the data. This is expected by our
investigation above, since the traditional PCA method is designed to pick the directions
that best explain the most variance of the dataset, which are directions that explainh
3
. In
the bottom-center plot, we executed the t-SNE algorithm [51] on the foreground dataset
for 250 iterations (no significant improvement was obtained by running t-SNE longer).
We see that t-SNE also fails to select the correct dimensions with which to visualize
the data. Finally, in the bottom-right plot, we show the data plotted against the feature
directions obtained by the cPCA++ algorithm. We observe that our feature extractor was
able to perform the correct feature selection in one step, without the need for a parameter
sweep. We note that while this is a synthetic example that can be solved analytically,
it helps in illustrating how the method actually works. In the next few simulations,
we compare and contrast the performance of the cPCA++, cPCA, traditional PCA, and
t-SNE algorithms on more complex datasets.
65
Figure 4.1: Performance on a synthetic dataset, where different colors are used for the
four different classes. The top row of plots show the performance of the cPCA algorithm
for different positive values of the contrast parameter . Clearly, a contrast factor for
= 2:7 is ideal, but must be found by sweeping . The bottom-left plot shows the
performance of traditional PCA (which, as expected, fails to separate the classes). The
bottom-center plot shows the performance of the t-SNE algorithm, which again fails
to discover the underlying structure in the high-dimensional data. Finally, the bottom-
right figure shows the output obtained by the cPCA++ method, which obtains the ideal
clustering without a parameter sweep.
MNIST over High Variance Backdrop
In this experiment (also conducted in [1]), the MNIST dataset [31] is used and the2828
images corresponding to digits0 and1 are extracted. Utilizing traditional PCA process-
ing, it is possible to separate these two particular digits.
However, the authors of [1] (as we do here), super-imposed these digits atop of
high-variance grass backgrounds in order to mask the fine details that make the digit
66
recognizable and distinct. Figure 4.2 shows six examples of these “target” images. We
see that the digits themselves are difficult to see atop the high-variance background
images. The task now is to apply the feature reduction techniques to still separate these
digits. In order to obtain a “background” dataset for the cPCA and cPCA++ methods,
Figure 4.2: Example of six target images. The MNIST images for digits 0 and 1 are
superimposed on top of grass images.
grass images (without the MNIST digits) randomly sampled from the ImageNet dataset
[43] were used to estimate R
b
. Note that the same grass images that were used to
generate the target images need not be utilized, but grass images in general will do.
In Fig. 4.3, we evaluate the performance of the dimensionality reduction techniques
on the target images illustrated in Fig. 4.2. It can be seen that both traditional PCA
and t-SNE have difficulty clustering the two digits, due to the high variance background
masking the fine structure of the digits. On the other hand, the cPCA and cPCA++
methods are able to achieve some separation between the two digits, although in both
cases there is some overlap. Again, note that cPCA++ does not require a hyperparameter
sweep.
Mice Protein Expression
This next example utilizes the public dataset from [21]. This dataset was also used in [1].
In this dataset, we have access to protein expression measurements from mice that have
received shock therapy and mice that have not. We would like to analyze the foreground
dataset (mice with shock therapy applied) to attempt to cluster mice that have developed
Down Syndrome from those that have not. We also have access to a background dataset,
67
Figure 4.3: The performance of different dimensionality reduction techniques on the
“MNIST over Grass” dataset illustrated in Fig. 4.2. In all the plots, the black markers
represent the digit0 while the red markers represent the digit1. The top row shows the
result of executing the cPCA algorithm for different values of , the bottom-left plot
shows the output of the traditional PCA algorithm on the target dataset, the bottom-
center plot shows the output of the t-SNE algorithm, and the bottom-right plot shows
the output of the cPCA++ method.
which consists of protein expression measurements from mice that did not experience
shock therapy. In Fig. 4.4 we illustrate the results of the different feature reduction tech-
niques. It can be seen that traditional PCA is unable to effectively cluster the data. On
the other hand, the t-SNE, cPCA, and cPCA++ methods are all able to achieve some
separation of the data. Although t-SNE was able to achieve good separation in this par-
ticular example, we re-iterate the fact that t-SNE is not an appropriate method to be
used to extract filters or function transformations from the high-dimensional space to
68
the lower-dimensional space as the algorithm must be re-executed with new samples
in order to map previously unmapped data. There is one other significant observation
about this example: there is a very small number of samples in the background dataset.
This caused the background covariance matrix,R
b
, to be rank-deficient. For this rea-
son, diagonal loading was employed to ensure that the background covariance matrix is
invertible.
Figure 4.4: The performance of different dimensionality reduction techniques on the
Mice Protein Expression dataset [21]. In all the plots, the black markers represent the
non-Down Syndrome mice while the red markers represent the Down Syndrome mice.
69
MHealth Measurements
In the next experiment (also conducted in [1]), we use the IMU data from [4]. In this
dataset, a variety of sensors are used to monitor subjects that are performing squats,
jogging, cycling, or lying still. The sensors include gyroscopes, accelerometers, and
electrocardiograms (EKG). In the top-left subplot of Fig. 4.5, we see that performing
traditional PCA on a dataset of subjects that are performing squats or cycling does not
yield a reasonable clustering for those two activities. On the other hand, the optimal
contrast parameter was used for cPCA, and the optimal cPCA result is illustrated in the
top-right subplot of Fig. 4.5. For both the cPCA and cPCA++ methods, the background
covariance matrixR
b
was obtained from subjects lying still. We observe that the cPCA
algorithm was able to cluster the two activities effectively. The t-SNE output is then
shown in the bottom-left subplot of the figure. We see that t-SNE was unable to dis-
tinguish the two activities. Finally, the output of the cPCA++ method is shown in the
bottom-right subplot of the figure. We see that cPCA++ is able to effectively cluster the
two activities, without the need for a parameter sweep.
Single Cell RNA-Seq of Leukemia Patient
Our final dataset in this section was obtained from [57], and this analysis was also con-
ducted in [1]. The target or foreground dataset contains single-cell RNA expression lev-
els of a mixture of bone marrow mononuclear cells (BMMCs) from a leukemia patient
before and after stem cell transplant, and the goal is to separate the pre-transplant sam-
ples from the post-transplant samples. The original features were reduced to only 500
genes based on the value of the variance divided by the mean. Once these features are
obtained, we execute the cPCA algorithm for different values of and plot the result
in the top row of Fig. 4.6. In this example, the background covariance matrixR
b
was
obtained from RNA-Seq measurements from a healthy individual’s BMMCs, for both
70
Figure 4.5: Clustering result of the MHealth Dataset for performing squats and cycling.
In all the plots, the red markers denote squatting activity while the black markers denote
cycling. In the top-left and bottom-left subplots, we see that traditional PCA and t-SNE
are incapable of separating the two activities, respectively. In the top-right and bottom-
right subplots, we see that cPCA and cPCA++ are capable of clustering the activities,
respectively. Note that we show the optimal cPCA result (after performing the parameter
sweep).
the cPCA and cPCA++ methods. As noted in [1], this should allow the dimensional-
ity reduction methods to focus on the novel features of the target dataset as opposed
to being overwhelmed by the variance due to the heterogeneous population of cells as
well as variations in experimental conditions. We observe that the data is close to being
separable for the contrast value of = 3:5. Next, we plot the traditional PCA output
in the bottom-left subplot of Fig. 4.6, and observe that it is incapable of clustering the
71
data. Next, in the bottom-center plot, we illustrate the output of the t-SNE algorithm,
and see that while it does achieve some separation of the data, it is not on par with the
cPCA algorithm. Finally, the cPCA++ output is illustrated in the bottom-right subplot of
Fig. 4.6, and we see that it is able to obtain a similar clustering performance as compared
to the optimal cPCA clustering result. Note that cPCA++ does not require a parameter
sweep. The covariance matrixR
b
was rank deficient in this example, so we utilized
diagonal loading in order to make it invertible.
4.2.3 Computational Time Performance Comparison
In this section, we will examine the computational time requirements for the various
datasets studied in Sec. 4.2.2. We neglect measuring the computation time required by
the traditional PCA algorithm since it will be very similar to the cPCA++ algorithm
and since traditional PCA is not an effective algorithm for discriminative dimensional-
ity reduction, as we saw in the results of Sec. 4.2.2. As shown in Table 4.2, the pro-
posed cPCA++ method is significantly faster than both the cPCA method and the t-SNE
method. The reason why the cPCA++ method is on average (i.e., averaged over the
experiments outlined in Sec. 4.2.2) 51 times faster than the cPCA method is that while
both methods require the eigen-decomposition of similarly sized covariance matrices,
the cPCA algorithm requires a sweep over the contrast hyper-parameter (i.e., multiple
eigen-decompositions). On the other hand, the cPCA++ method does not require this
hyper-parameter tuning since the scale discrepancy is automatically resolved. It is also
notable to see that cPCA++ is on average 428;654 times faster than t-SNE. The t-SNE
method has a very high running time due to the fact that it works by finding relation-
ships between all the points in a dataset, and not through the determination of compact
parameters. In addition, it is not very well suited for dimensionality reduction when
72
Figure 4.6: Dimensionality reduction result for the Single Cell RNA-Seq of Leukemia
Patient example. In all plots, the black markers denote pre-transplant samples while
the red markers denote post-transplant samples. The top row illustrates the output of
the cPCA algorithm for varying values of , the bottom-left plot shows the output of
traditional PCA, the bottom-center plot shows the output of the t-SNE algorithm, while
the bottom-right plot shows the output of the cPCA++ method. We observe that the
cPCA (for some values of ) and cPCA++ methods yield the best clustering for this
dataset.
future data samples will be obtained (since it has to be re-executed on the entire new
dataset).
73
Table 4.2: Time required for the different algorithms to perform the required dimen-
sionality reduction for the various datasets studied in Sec. 4.2.2. All times listed in the
table are in seconds. Boldface is used to indicate shortest runtimes and average cPCA++
speedup.
Example cPCA t-SNE cPCA++
Synthetic 0:062 6:62 0:0017
MNIST over Grass 26 811 1:00
Mice Protein Expression 0:13 2:23 0:0033
MHealth Measurements 0:091 1388 0:00065
RNA-Seq of Leukemia Patient 11:70 2155 0:86
Average cPCA++ Speedup 51x 428654x 1x
4.3 The cPCA++ Framework for Image Splicing Local-
ization
In this section, we describe the general framework of the cPCA++ approach for image
splicing localization. In this approach, we focus on detecting the spliced boundary or
edge. We take this approach of determining the spliced boundary rather than the spliced
region or surface because there is an ambiguity in the determination of which image is
the donor and the host. For example, consider Figure 4.7, which shows a probe image
and the corresponding spliced boundary ground truth mask. Please note that we use the
color black to denote a pixel belonging to the spliced boundary, and the color white to
denote a pixel not belonging to the spliced boundary. It is possible to consider region A
as the spliced region and region B as the authentic region, or vice versa. Thus, there is
an ambiguity in how one labels the spliced and authentic regions in a given manipulated
image. However, this ambiguity is resolved if we instead consider the spliced boundary
74
(a) Probe Image (b) Spliced Edge
Figure 4.7: An example illustrating that there is an ambiguity in how one labels the
spliced and authentic regions in a given probe image. The image on the right is a ground
truth mask highlighting the spliced edge or boundary. The color black is used to denote
a pixel belonging to the spliced boundary, while the color white is used to denote a pixel
not belonging to the spliced boundary. It is possible to label region A as the spliced
region and region B as the authentic region, or vice versa.
(rather than the spliced surface/region). We are aware that, in some cases, comparison to
other methods may require a surface output instead of a boundary output. To that end,
we will explore techniques that will allow us to transform the boundary localization
result to a surface output, as discussed in Chapter 5.
The “training” phase consists of two main steps: 1) collect labeled or “training”
samples/patches, 2) build a feature extractor based on the training samples. As we will
see later, the cPCA++ approach does not require the training of a classifier, such as a
support vector machine or random forest. However, we will still refer to patches used
in this phase as “training” samples, since we are using their labeled information (i.e.,
spliced vs. authentic edges). In the testing phase, we are given a new image (i.e.,
not seen during the training phase), and the goal is to output a probability map that
indicates the probability that each pixel belongs to the spliced boundary. The testing
phase consists of the following main steps: 1) divide a given test image into overlapping
patches, 2) extract the feature vector for each test patch, 3) obtain the probability output
75
for each test patch, and 4) reconstruct the final output map for the given test image. The
training and testing phases are elaborated below.
In the training phase, the first step is to collect labeled or “training” samples/patches
from a given set of images. In order to do this, we need access to two masks for each
training image: 1) the spliced surface ground truth mask and 2) a mask which highlights
the superset of all edges/boundaries in the image (i.e., spliced boundaries as well as
natural/authentic boundaries), referred to as the edge detection output mask. Note that
although we utilize the surface ground truth mask in selecting the training patches, the
output of the cPCA++ method will be edge-based (i.e., the output will represent an
estimate of the true spliced edge/boundary). This will be explained in more detail in the
next paragraph. We utilized the CASIA v2.0 dataset [14]
6
for training purposes. For
the CASIA v2.0 dataset, the surface ground truth masks are not provided. We generated
the ground truth masks using the provided reference information. In particular, for a
given spliced image in the CASIA dataset, the corresponding donor and host images are
provided, and we used this information to generate the ground truth mask. The second
mask that is needed for collecting the training patches is the edge detection output mask.
We utilized structured edge detection for detecting spliced and natural edges [13].
Once we have these two masks, we can proceed to collect the training patches. Each
training image is divided into overlapping patches, and we then select target/foreground
and background samples in the following manner. We would like target/foreground
samples to be those that contain a boundary between a spliced region and an authentic
region (referred to as a spliced boundary). In order to achieve this, we utilize the surface
ground truth masks, and we select foreground samples to be those that have a splicing
6
Credits for the use of the CASIA Image Tampering Detection Evaluation Database (CASIA TIDE)
V1.0 and V2.0 are given to the National Laboratory of Pattern Recognition, Institute of Automation,
Chinese Academy of Science, Corel Image Database and the photographers. http://forensics.idealtest.org
76
fraction that lies in a certain range (e.g., 30-70 percent of the patch is a spliced area/-
surface). On the other hand, background samples do not contain a boundary between
a spliced and authentic region. We impose an additional constraint on the background
samples so that they contain a minimum amount of authentic edges (as specified by the
structured edge detector). Figure 4.8 shows an example of a foreground patch and a
background patch. Note that although we are using the surface ground truth masks in
selecting foreground and background samples, what is actually important is the spliced
boundary/edge (or lack thereof).
The next step is to build a feature extractor based on the covariance matrices of
the foreground and background training samples. Suppose we have collectedN
b
back-
ground patches andN
f
foreground patches, and let the size of each patch bennc,
wherec is the number of channels. Since we are working with RGB images,c will be
equal to 3. Each training patch can be flattened so that it is aM1 vector, whereM
is equal tocn
2
(or 3n
2
ifc = 3). We can represent the flattened background and fore-
ground patches as data matrices
e
Z
b
2R
MN
b
and
e
Z
f
2R
MN
f
, respectively, such that
different columns of
e
Z
b
(or
e
Z
f
) correspond to different training samples.
To see why it is necessary to use the cPCA++ method as opposed to traditional PCA
for the image splicing localization task, consider the top 50 eigenvectors of the covari-
ance matrix of
e
Z
b
(i.e., authentic edges or background) and the top 50 eigenvectors of
the covariance matrix of
e
Z
f
(i.e., spliced edges or target/foreground). We compute the
power of the foreground eigenvectors (U
f
) after projecting them onto the background
subspace (spanned byU
b
) to quantify the similarity in the subspaces spanned by
e
Z
b
and
e
Z
f
: kP
U
b
U
f
k
2
F
=kU
f
k
2
F
= 93:25%. What this indicates is that the subspaces
spanned by the top 50 principal components of each dataset (U
f
andU
b
) are mostly
overlapping—hinting at the fact that it is likely that any classifier will be overwhelmed
by the similarity of the features of these two datasets. For this reason, we will utilize the
77
cPCA++ method instead to alleviate this situation and perform discriminative feature
extraction.
We thus utilize Alg. 2, described in Sec. 4.2, to build the feature extractor. We first
center the data matrices
e
Z
b
and
e
Z
f
by subtracting their respective means. We then
calculate the second order statistics R
b
=
1
N
b
Z
b
Z
T
b
and R
f
=
1
N
f
Z
f
Z
T
f
, and then
perform eigenvalue decomposition onQ =R
1
b
R
f
. If the matrixR
b
is rank-deficient,
we utilize diagonal loading in order to make it invertible. We then compute the topK
right-eigenvectorsF ofQ. The matrixF2R
MK
is used to extract features during the
testing phase, and is referred to as the transform matrix.
In the testing phase, we divide each test image into overlapping patches. Similar to
what was done during the training phase, we can flatten the test patches and represent
them as a matrix
e
Z
t
2 R
MNt
. We then utilize Algorithm 3 to obtain a probability
output for each patch. The motivation behind Algorithm 3 is explained as follows. The
cPCA++ method will naturally find filters that will yield a small norm for the reduced
dimensionality features of background samples and a larger norm for foreground sam-
ples. This means that once the filters are obtained, they are expected to yield small val-
ued output when applied to background patches and larger valued output when applied
to foreground patches. This presents us with a very simple and efficient technique for
obtaining the output: simply measure the output power after dimensionality reduction
and then convert the raw value to a probability-based one. The full output map for a
given test image can then be reconstructed by averaging the contributions from overlap-
ping regions. After reconstruction, we perform an element-wise multiplication of the
output map and the structural edge detection output mask. The reason for doing this is
explained as follows. In this task, we are attempting to classify the spliced edge. There-
fore, if the structural edge detector labeled a given pixel as non-edge (i.e., neither an
78
authentic edge nor a spliced edge), then the corresponding pixel value in the cPCA++
output map should automatically be set to zero.
Algorithm 3 Algorithm for obtaining an output for each test patch
Inputs: Test data matrix
e
Z
t
2R
MNt
; transform matrixF2R
MK
, obtained during
the training phase
1. Center the matrix
e
Z
t
to obtainZ
t
2. Compute Y
t
= F
T
Z
t
, where each column of Y
t
contains the reduced-
dimension feature vector for a given test patch
3. Compute the vectorv2 R
Nt1
where each elementv
i
,kY
t;i
k
2
2
for 1 i
N
t
is the squared-L
2
-norm of a given column ofY
t
4. Convertv to a probability vectorw by computingw =
v
max(v)
Return: the vectorw2R
Nt1
4.4 Experimental Analysis
4.4.1 Evaluation/Scoring Procedure
In this subsection, we explain the procedure used to score the output maps from the
cPCA++ method, as well as from the Multi-task Fully Convolutional Network (MFCN)-
based method [46] we compared against. As discussed in Chapter 3, the MFCN-based
method also outputs an edge-based probability map, and thus can be directly compared
with the proposed cPCA++ method. We evaluated the performance of the methods
using theF
1
and Matthews Correlation Coefficient (MCC) metrics, which are popular
per-pixel localization metrics. As noted previously, the edge-based approach that we
propose avoids the ambiguity in the labeling of the spliced and authentic surfaces/re-
gions. Because of the ambiguity in the surface-based labeling, most surface-based scor-
ing procedures score the original output map as well as the inverted version of the output
79
Figure 4.8: Example of foreground and background patches.
map, and select the one that yields the best score. However, in edge-based scoring pro-
cedures, there is no need to invert the edge-based output map, and the output can be
scored directly. For a given spliced image, theF
1
metric is defined as
F
1
=
2TP
2TP +FN +FP
;
whereTP represents the number of pixels classified as true positive,FN represents the
number of pixels classified as false negative, andFP represents the number of pixels
classified as false positive. TheF
1
metric ranges in value from0 to1, with a value of1
being the best. For a given spliced image, theMCC metric is defined as
MCC=
TPTNFPFN
p
(TP+FP)(TP+FN)(TN+FP)(TN+FN)
:
TheMCC metric ranges in value from1 to1, with a value of1 being the best.
80
4.4.2 Experimental Results
We first compared the training times of the cPCA++ method and the MFCN-based
method, and found the cPCA++ method to be significantly more efficient. The MFCN-
based method (implemented in Caffe [26]) requires a training time of approximately
11.5 hours on a NVIDIA GeForce GTX Titan X GPU, while the cPCA++ method
(implemented in MATLAB) only requires a training time of approximately 1.5 hours
on CPU (Intel Xeon Gold 6126). Please note that the cPCA++ code has not been
optimized to make full utilization of the multi-core processor. Next, we evaluated the
cPCA++ method on the Columbia [23] and Nimble WEB
7
datasets and compared its
performance with the MFCN-based method. Tables 4.3 and 4.4 show the Matthews
Correlation Coefficient (MCC) and F
1
scores, respectively. It can be seen that the
cPCA++ method yields higher scores (in terms of bothMCC andF
1
), as compared to
the MFCN-based method.
Figures 4.9 and 4.10 show multiple examples of localization output from the
Columbia and Nimble WEB datasets, respectively. Each row shows (from left to right)
a manipulated or probe image with the spliced edge highlighted in pink, the structural
edge detection output mask highlighting both spliced and authentic edges, the cPCA++
raw probability output map, and the MFCN-based raw probability output map. In these
figures, it can be seen that the cPCA++ method yields a finer localization output than
the MFCN-based method.
7
https://www.nist.gov/itl/iad/mig/nimble-challenge
81
Table 4.3: Edge-basedMCC Scores on Columbia and Nimble WEB Datasets. Boldface
is used to emphasize best performance.
Dataset cPCA++ MFCN
Columbia 0:385 0:329
Nimble WEB 0:388 0:297
Table 4.4: Edge-basedF
1
Scores on Columbia and Nimble WEB Datasets. Boldface is
used to emphasize best performance.
Dataset cPCA++ MFCN
Columbia 0:359 0:312
Nimble WEB 0:376 0:273
4.5 Conclusion
In conclusion, we proposed cPCA++, which is a new technique for discovering dis-
criminative features in high-dimensional data. The proposed approach is able to dis-
cover structures that are unique to a target dataset, while at the same time suppressing
“uninteresting” high-variance structures present in both the target dataset and a back-
ground dataset. The proposed cPCA++ approach is compared with a recently proposed
algorithm, called contrastive PCA (cPCA), and we show that cPCA++ achieves simi-
lar discriminative performance in a wide variety of settings, even though it eliminates
the need for the hyperparameter sweep required by cPCA. Following this discussion,
the cPCA++ approach was applied to the problem of image splicing localization, in
which the two classes of interest (i.e., spliced and authentic edges) are extremely similar
in nature. In the context of this problem, the target dataset contains the spliced edges
and the background dataset contains the authentic edges. We show that the cPCA++
approach is able to effectively discriminate between the spliced and authentic edges. The
82
Figure 4.9: Localization Output Examples from Columbia Dataset. Each row shows
(from left to right): the manipulated/probe image with the spliced edge highlighted in
pink, the structural edge detection output mask highlighting both spliced and authentic
edges, the cPCA++ raw probability output, and the MFCN-based raw probability output.
resulting method was evaluated on the Columbia and Nimble WEB splicing datasets.
The cPCA++ method achieves scores comparable to the Multi-task Fully Convolutional
Network (MFCN), and it does so very efficiently–without the need to iteratively update
filter weights via stochastic gradient descent and backpropagation, and without the need
to train a classifier.
83
Figure 4.10: Localization Output Examples from Nimble WEB Dataset. Each row
shows (from left to right): the manipulated/probe image with the spliced edge high-
lighted in pink, the structural edge detection output mask highlighting both spliced and
authentic edges, the cPCA++ raw probability output, and the MFCN-based raw proba-
bility output.
84
4.A Derivation of Filters for Synthetic Example
The sample covariance of the background dataset specified in (4.42) will be:
R
b
2
6
6
6
4
3I
1010
0
1010
0
1010
0
1010
I
1010
0
1010
0
1010
0
1010
10I
1010
3
7
7
7
5
(4.44)
=
2
6
6
6
4
3 0 0
0 1 0
0 0 10
3
7
7
7
5
I
1010
(4.45)
where theA
B operator denotes the Kronecker product of the matricesA andB. On
the other hand, the sample covariance of the target/foreground dataset is given by (when
the classes are equally likely):
R
f
2
6
6
6
4
9 0 0
0 2:25 0
0 0 0
3
7
7
7
5
1
10
1
T
10
+
2
6
6
6
4
1 0 0
0 1 0
0 0 10
3
7
7
7
5
I
1010
: (4.46)
85
Then, the matrixQ for the cPCA++ algorithm becomes:
Q,R
1
b
R
f
=
2
6
6
6
4
3 0 0
0 2:25 0
0 0 0
3
7
7
7
5
1
10
1
T
10
+
2
6
6
6
4
1
3
0 0
0 1 0
0 0 1
3
7
7
7
5
I
1010
: (4.47)
The matrixQ turns out to be symmetric (and thus is diagonalizable) and has the follow-
ing block-diagonal structure:
Q = blockdiag(A;B;C) (4.48)
where
A, 31
10
1
T
10
+
1
3
I
1010
(4.49)
B, 2:251
10
1
T
10
+I
1010
(4.50)
C,I
1010
: (4.51)
86
The eigenvalues of the block diagonal matrix Q are given by the eigenvalues of its
diagonal blocks
8
, i.e., if the eigen-decomposition ofQ is given byQ =UDU
T
, then
D, blockdiag(D
A
;D
B
;D
C
) (4.52)
(a)
= diag
30:33;23:50;1
T
19
;
1
3
1
T
9
(4.53)
where the matricesD
A
,D
B
, andD
C
contain the eigenvalues of the matricesA,B, and
C, respectively, along their diagonal and the eigenvalues are sorted in decreasing order
in step(a). It is then easy to verify that:
Qu
1
= 30:33u
1
(4.54)
Qu
2
= 23:50u
2
(4.55)
where
u
1
=
2
6
6
6
4
1
p
10
1
10
0
10
0
10
3
7
7
7
5
; u
2
=
2
6
6
6
4
0
10
1
p
10
1
10
0
10
3
7
7
7
5
(4.56)
indicating thatu
1
andu
2
are the leading eigenvectors ofQ, yielding (4.43).
8
This is because an eigenvalue ofQ by definition satisfiesjQI
3030
j = 0. However, sinceQ is
block diagonal, we have thatjQI
3030
j =jAI
1010
jjBI
1010
jjCI
1010
j [30, p. 5].
Clearly, if is an eigenvalue of A, B, or C, it is also an eigenvalue of Q since it will also satisfy
jQI
3030
j = 0. There are a total of 30 such values (10 per block), some with multiplicity greater
than one.
87
4.B The cPCA++ Approach For Matrix Factorization
and Image Denoising
In this appendix, we explore the use of the cPCA++ method for matrix factorization and
image denoising. This denoising example was performed in [1] for the MNIST over
grass dataset examined in Sec. 4.2.2. In this exercise, we are given a single foreground
image (i.e., an image containing a digit overlayed on top of grass imagery), flattened
into a vectorz
n
2R
M1
, whereM = 784 and the subscriptn is used to indicate thatz
n
is noisy. Sincez
n
is a foreground image, it follows the distribution from (4.6), and thus
we may seek the factorization of the expected valueE[z
n
] =Wy
n
whereW2R
MK
andy
n
2 R
K1
, and K is the effective rank of the denoised image. This is a typical
setup of matrix factorization methods. In our case, the matrixW contains dictionary
atoms along its columns. These contain the underlying bases that make up the labeled
foreground datasetZ
f
. On the other hand, the vectory
n
contains the weighting of each
dictionary atom (i.e., how much each dictionary atom contributes to the final image) and
is a function of the filtersF and the original foreground imagez
n
. Traditionally, it is
assumed that the weighting vector is sparse. We do not directly impose this, but when
K is chosen to be small, this is implicitly imposed. That is, we are approximating the
foreground image (which is corrupted by the grass background in addition to the actual
digit) as:
^ z
n
,
K
X
k=1
y
n;k
w
k
(4.57)
where the atomw
k
denotes thek-th column of the matrixW , the weightingy
n;k
denotes
thek-th element of the vectory
n
, and ^ z
n
denotes the denoised version of the imagez
n
.
WhenKM, we expect thatWy
n
will yield a denoised signal as it will only utilize
88
the few strong atoms forming W (which best describe the bases present in the fore-
ground dataset). We also saw in Sec. 4.2 that the matricesF2R
MK
andW2R
MK
contain the leading eigenvectors of the matricesR
1
b
R
f
andR
f
R
1
b
, respectively. In
addition, the vectory
n
is obtained viay
n
=F
T
z
n
. This means that the denoised ver-
sion of the image is given by ^ z
n
= WF
T
z
n
. We now attempt to denoise an image
containing the digit 0 over grass background. This situation is shown in Fig. 4.11.
Please note that we use the color white to denote pixels corresponding to the digit0. In
the top-left plot, we show the original noisy digit. It is relatively difficult to see the digit
over the background, so a significant amount of denoising is necessary. In all of the
denoising methods, we set the number of componentsK = 3. In the top-right plot, we
show the denoising achieved by traditional PCA. In the bottom-left plot, we show the
denoising achieved by the cPCA algorithm. Finally, in the bottom-right plot, we show
the denoising performance of the cPCA++ method, which is achieved as the low-rank
approximationWy
n
. We observe that the output of cPCA++ is far less noisy than that
of the other methods.
89
Figure 4.11: The denoising of the digit 0 over grass background. Please note that we
use the color white to denote pixels corresponding to the digit. On the top-left plot, we
show the original noisy digit. On the top-right plot, we show the denoising achieved
by traditional PCA. On the bottom-left plot, we show the denoising achieved by the
cPCA algorithm. Finally, on the bottom-right plot, we show the denoising performance
of the cPCA++ method, which is the low-rank approximationWy
n
. We observe that
the output of cPCA++ is far less noisy than that of the other methods. The number of
components was chosen to beK = 3.
90
Chapter 5
Conclusion and Future Work
5.1 Summary of the Research
In this dissertation, we focused on the problem of image splicing localization and
presented two novel approaches. Both approaches provide per-pixel localization of a
spliced region. The first approach is based on the use of a fully convolutional network
(FCN), which is a special type of convolutional neural network (CNN) that is capable
of yielding per-pixel classification. The per-pixel classification ability allows our net-
work to mark the particular pixels that are classified as spliced. The FCN we utilized is
based on the VGG-16 architecture with skip connections, and we incorporated several
modifications, such as batch normalization layers and class weighting. We presented
three different variants of the FCN-based approach [46]: 1) single-task FCN (SFCN), 2)
multi-task FCN (MFCN) and 3) edge-enhanced MFCN. In contrast to the SFCN, which
is trained only on the surface label, the MFCN is simultaneously trained on both the sur-
face and edge labels. The proposed variants were evaluated on manipulated images from
the Carvalho, CASIA v1.0, Columbia, and the DARPA/NIST Nimble Challenge 2016
SCI datasets. The experimental results showed that the proposed variants outperform
existing splicing localization methods on these datasets, with the edge-enhanced MFCN
performing the best. In addition, we also note the MFCN achieved the highest score
in the splicing localization task in the 2017 Nimble Challenge, and the second highest
score in the 2018 Media Forensics Challenge (which are part of the DARPA MediFor
program).
91
Although CNN-based approaches have yielded promising results in the field of
image forensics, they rely on careful selection of hyperparameters, network architec-
ture, and initial filter weights. Furthermore, CNNs require a long training time since the
filter weights need to be iteratively updated via stochastic gradient descent and back-
propagation. The second approach presented in this dissertation addresses these disad-
vantages of the CNN-based approach, while still achieving comparable performance.
This second approach is based on a new dimensionality reduction technique we have
developed, referred to as cPCA++ (where cPCA stands for contrastive Principal Com-
ponent Analysis). The cPCA++ technique is a modified version of Principal Component
Analysis (PCA) that is able to obtain discriminative filters when dealing with extremely
similar classes. It utilizes the fact that the interesting features of a target dataset may
be obscured by high variance components during traditional PCA. By analyzing what
is referred to as a background dataset (i.e., one that exhibits the high variance principal
components but not the interesting structures), our technique is capable of efficiently
highlighting the structure that is unique to the target dataset. Similar to another recently
proposed algorithm called contrastive PCA (cPCA), the proposed cPCA++ method iden-
tifies important dataset-specific patterns that are not detected by traditional PCA in
a wide variety of settings. However, the proposed method is significantly more effi-
cient than cPCA because the former does not require the hyperparameter sweep used in
cPCA. We applied the cPCA++ technique to the image splicing localization problem,
with the target and background samples corresponding to spliced and authentic edges,
respectively. The cPCA++ approach is significantly more efficient than state-of-the-art
CNN-based methods, such as the MFCN, because the former does not require iterative
updates of filter weights via stochastic gradient descent and backpropagation. Further-
more, it was shown that the proposed cPCA++ approach is able to achieve performance
scores comparable to the MFCN.
92
5.2 Future Research Directions
Currently, the cPCA++ approach does not require the training of a classifier (such as
support vector machines or random forests), and it was shown that it can still achieve
performance scores comparable to the MFCN. However, we would like to explore the
use of machine learning classifiers in the training stage as it can potentially improve the
localization performance.
We also note that our cPCA++ approach focuses on the localization of the splicing
boundary as opposed to the splicing region. We believe that the determination of the
splicing boundary is sufficient for most splicing localization applications as the labeling
of the host and donor images is arbitrary. However, we are aware that, in some cases,
comparison to other methods may require a surface output instead of a boundary out-
put. To that end, we will explore hole-filling techniques that allow us to transform the
boundary localization result to a surface output, similar to those produced by our earlier
CNN-based approach.
Also, in the CNN-based approach, it was necessary to resize the input probe images
prior to training and testing. Due to memory constraints, it is not possible to train and
test on very high-resolution images, so these images must first be downsampled. In
contrast, the cPCA++ approach does not require the probe images to be resized to a
specific size. In fact, the fundamental dimensions of the method is the patch size, which
is a constant parameter independent of the input image size. This potentially allows the
cPCA++ approach to be used to perform the inference on the full resolution input image,
without the resizing required by the CNN-based approach. We believe that this fact will
help improve the performance of the cPCA++ approach by preserving information in the
input probe images (by not downsampling the resolution). We will conduct experiments
along these lines.
93
Finally, we hope to be able to explore the use of fusion methods that combine multi-
ple diverse methods to yield a fused localization output. For example, the fusion of our
two proposed approaches may have the potential to yield a robust localization algorithm
capable of outperforming each individual method.
94
Bibliography
[1] A. Abid, M. J. Zhang, V . K. Bagaria, and J. Zou. Exploring patterns enriched in
a dataset with contrastive principal component analysis. Nature Communications,
9(1):7, 2018.
[2] I. Amerini, R. Becarelli, R. Caldelli, and A. Del Mastio. Splicing forgeries local-
ization through the use of first digit features. In 2014 IEEE International Workshop
on Information Forensics and Security (WIFS), pages 143–148, 2014.
[3] V . Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A deep convolu-
tional encoder-decoder architecture for image segmentation. arXiv preprint
arXiv:1511.00561, 2015.
[4] O. Banos, C. Villalonga, R. Garcia, A. Saez, M. Damas, J. A. Holgado-Terriza,
S. Lee, H. Pomares, and I. Rojas. Design, implementation and validation of a novel
open framework for agile development of mobile health applications. BioMedical
Engineering OnLine, 14(Suppl 2):S6–S6, 2015.
[5] T. Bianchi, A. De Rosa, and A. Piva. Improved DCT coefficient analysis for
forgery localization in JPEG images. In 2011 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), pages 2444–2447, 2011.
[6] T. Bianchi and A. Piva. Detection of nonaligned double JPEG compression based
on integer periodicity maps. IEEE transactions on Information Forensics and
Security, 7(2):842–848, 2012.
[7] T. Bianchi and A. Piva. Image forgery localization via block-grained analysis
of JPEG artifacts. IEEE Transactions on Information Forensics and Security,
7(3):1003–1017, 2012.
[8] B. Chen, X. Qi, Y . Wang, Y . Zheng, H. J. Shim, and Y .-Q. Shi. An improved
splicing localization method by fully convolutional networks. IEEE Access, 2018.
[9] M. Chen, J. Fridrich, M. Goljan, and J. Luk´ as. Determining image origin and
integrity using sensor noise. IEEE Transactions on Information Forensics and
Security, 3(1):74–90, 2008.
95
[10] D. Cozzolino and L. Verdoliva. Noiseprint: a cnn-based camera model fingerprint.
arXiv preprint arXiv:1808.08396, 2018.
[11] T. J. De Carvalho, C. Riess, E. Angelopoulou, H. Pedrini, and
A. de Rezende Rocha. Exposing digital image forgeries by illumination
color classification. IEEE Transactions on Information Forensics and Security,
8(7):1182–1194, 2013.
[12] A. E. Dirik and N. Memon. Image tamper detection based on demosaicing arti-
facts. In 2009 IEEE International Conference on Image Processing (ICIP), pages
1497–1500, 2009.
[13] P. Doll´ ar and C. L. Zitnick. Structured forests for fast edge detection. In Pro-
ceedings of the 2013 IEEE International Conference on Computer Vision, pages
1841–1848, Washington, DC, USA, 2013.
[14] J. Dong, W. Wang, and T. Tan. Casia image tampering detection evaluation
database. In 2013 IEEE China Summit and International Conference on Signal
and Information Processing, pages 422–426. IEEE, 2013.
[15] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. Wiley, NY , 2
edition, 2001.
[16] D. Eigen and R. Fergus. Predicting depth, surface normals and semantic labels
with a common multi-scale convolutional architecture. In Proceedings of the IEEE
International Conference on Computer Vision, pages 2650–2658, 2015.
[17] H. Farid. Exposing digital forgeries from JPEG ghosts. IEEE Trans. Information
Forensics and Security, 4(1):154–160, 2009.
[18] P. Ferrara, T. Bianchi, A. De Rosa, and A. Piva. Image forgery localization via
fine-grained analysis of cfa artifacts. IEEE Transactions on Information Forensics
and Security, 7(5):1566–1577, 2012.
[19] K. W. Forsythe. Utilizing waveform features for adaptive beamforming and direc-
tion finding with narrowband signals. Lincoln Laboratory Journal, 10(2):99–126,
1997.
[20] K. Fukunaga. Introduction to Statistical Pattern Recognition. Academic Press
Professional, Inc., San Diego, CA, USA, 1990.
[21] C. Higuera, K. J. Gardiner, and K. J. Cios. Self-organizing feature maps identify
proteins critical to learning in a mouse model of down syndrome. PLOS ONE,
10(6):1–28, 06 2015.
96
[22] R. A. Horn and C. R. Johnson. Matrix Analysis. Cambridge University Press, NY ,
2nd edition, 2012.
[23] Y .-F. Hsu and S.-F. Chang. Detecting image splicing using geometry invariants
and camera characteristics consistency. In 2006 IEEE International Conference
on Multimedia and Expo, pages 549–552. IEEE, 2006.
[24] M. Huh, A. Liu, A. Owens, and A. A. Efros. Fighting fake news: Image splice
detection via learned self-consistency. arXiv preprint arXiv:1805.04096, 2018.
[25] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network train-
ing by reducing internal covariate shift. In International Conference on Machine
Learning, pages 448–456, 2015.
[26] Y . Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama,
and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In
Proceedings of the 22nd ACM international conference on Multimedia, pages 675–
678. ACM, 2014.
[27] S. Kay. Fundamentals of Statistical Signal Processing: Detection theory, volume 2
of Fundamentals of Statistical Signal Processing. Prentice-Hall, NJ, 1993.
[28] N. Krawetz. A picture’s worth: digital image analysis and forensics. Black Hat
Briefings, pages 1–31, 2007.
[29] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep
convolutional neural networks. In Advances in neural information processing sys-
tems, pages 1097–1105, 2012.
[30] A. J. Laub. Matrix Analysis For Scientists And Engineers. Society for Industrial
and Applied Mathematics, Philadelphia, PA, USA, 2004.
[31] Y . Lecun, L. Bottou, Y . Bengio, and P. Haffner. Gradient-based learning applied to
document recognition. Proceedings of the IEEE, 86(11):2278–2324, Nov 1998.
[32] C.-T. Li and Y . Li. Color-decoupled photo response non-uniformity for digital
image forensics. IEEE Transactions on Circuits and Systems for Video Technology,
22(2):260, 2012.
[33] W. Li, Y . Yuan, and N. Yu. Passive detection of doctored JPEG image via block
artifact grid extraction. Signal Processing, 89(9):1821–1829, 2009.
[34] Z. Lin, J. He, X. Tang, and C.-K. Tang. Fast, automatic and fine-grained tam-
pered JPEG image detection via DCT coefficient analysis. Pattern Recognition,
42(11):2492–2501, 2009.
97
[35] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic
segmentation. In Proceedings of the IEEE conference on computer vision and
pattern recognition, pages 3431–3440, 2015.
[36] W. Luo, Z. Qu, J. Huang, and G. Qiu. A novel method for detecting cropped and
recompressed image block. In 2007 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), volume 2, pages II–217, 2007.
[37] S. Lyu, X. Pan, and X. Zhang. Exposing region splicing forgeries with blind local
noise estimation. International journal of computer vision, 110(2):202–221, 2014.
[38] B. Mahdian and S. Saic. Using noise inconsistencies for blind image forensics.
Image and Vision Computing, 27(10):1497–1503, 2009.
[39] J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online dictionary learning for sparse
coding. In Proceedings of the 26th Annual International Conference on Machine
Learning, pages 689–696, 2009.
[40] B. Parlett. The Symmetric Eigenvalue Problem. Society for Industrial and Applied
Mathematics, 1998.
[41] T. Pomari, G. Ruppert, E. Rezende, A. Rocha, and T. Carvalho. Image splicing
detection through illumination inconsistencies and deep learning. In 2018 IEEE
International Conference on Image Processing (ICIP), pages 3788–3792, 2018.
[42] Y . Rao and J. Ni. A deep learning approach to detection of splicing and copy-
move forgeries in images. In 2016 IEEE International Workshop on Information
Forensics and Security (WIFS), pages 1–6, 2016.
[43] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang,
A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. Imagenet large
scale visual recognition challenge. Int. J. Comput. Vision, 115(3):211–252, Dec.
2015.
[44] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang,
A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recogni-
tion challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
[45] R. Salloum and C.-C. J. Kuo. Efficient image splicing localization via contrastive
feature extraction. arXiv preprint arXiv:1901.07172, 2019.
[46] R. Salloum, Y . Ren, and C.-C. J. Kuo. Image splicing localization using a multi-
task fully convolutional network (mfcn). Journal of Visual Communication and
Image Representation, 51:201–209, 2018.
98
[47] Y . Q. Shi, C. Chen, and W. Chen. A natural image model approach to splicing
detection. In Proceedings of the 9th workshop on Multimedia & security, pages
51–62. ACM, 2007.
[48] Z. Shi, X. Shen, H. Kang, and Y . Lv. Image manipulation detection and localization
based on the dual-domain convolutional neural networks. IEEE Access, 2018.
[49] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale
image recognition. arXiv preprint arXiv:1409.1556, 2014.
[50] C. Szegedy, W. Liu, Y . Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V . Van-
houcke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of
the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
[51] L. van der Maaten and G. Hinton. Visualizing high-dimensional data using t-SNE.
Journal of Machine Learning Research, 9:2579–2605, Nov. 2008.
[52] G. Wang, J. Chen, and G. B. Giannakis. DPCA: Dimensionality reduction for
discriminative analytics of multiple large-scale datasets. In 2018 IEEE Interna-
tional Conference on Acoustics, Speech and Signal Processing (ICASSP), pages
2211–2215. IEEE, 2018.
[53] H. Wang, S. Yan, D. Xu, X. Tang, and T. Huang. Trace ratio vs. ratio trace for
dimensionality reduction. In 2007 IEEE Conference on Computer Vision and Pat-
tern Recognition, pages 1–8, June 2007.
[54] W. Wang, J. Dong, and T. Tan. Tampered region localization of digital color
images based on jpeg compression noise. In International Workshop on Digital
Watermarking, pages 120–133. Springer, 2010.
[55] S. Ye, Q. Sun, and E.-C. Chang. Detecting digital image forgeries by measuring
inconsistencies of blocking artifact. In 2007 IEEE International Conference on
Multimedia and Expo, pages 12–15, 2007.
[56] M. Zampoglou, S. Papadopoulos, and Y . Kompatsiaris. Large-scale evaluation of
splicing localization algorithms for web images. Multimedia Tools and Applica-
tions, 76(4):4801–4834, Feb 2017.
[57] G. X. Y . Zheng, J. M. Terry, P. Belgrader, P. Ryvkin, Z. W. Bent, R. Wilson, S. B.
Ziraldo, T. D. Wheeler, G. P. McDermott, J. Zhu, M. T. Gregory, J. Shuga, L. Mon-
tesclaros, J. G. Underwood, D. A. Masquelier, S. Y . Nishimura, M. Schnall-Levin,
P. W. Wyatt, C. M. Hindson, R. Bharadwaj, A. Wong, K. D. Ness, L. W. Beppu,
H. J. Deeg, C. McFarland, K. R. Loeb, W. J. Valente, N. G. Ericson, E. A. Stevens,
J. P. Radich, T. S. Mikkelsen, B. J. Hindson, and J. H. Bielas. Massively parallel
digital transcriptional profiling of single cells. Nature Communications, 8:1–12,
Jan. 2017.
99
[58] P. Zhou, X. Han, V . I. Morariu, and L. S. Davis. Learning rich features for image
manipulation detection. arXiv preprint arXiv:1805.04953, 2018.
100
Abstract (if available)
Abstract
Image splicing is a type of forgery or manipulation in which a portion of one image is copied and pasted onto a different image. Image splicing attacks have become pervasive with the advent of easy-to-use digital manipulation tools and an increase in public image distribution. Much of the previous research work on image splicing attacks has focused on the problem of simply detecting whether an image is spliced or not, and did not attempt to localize the spliced region. In this work, we present two novel approaches for the image splicing localization problem, with the goal of generating a per-pixel mask that localizes the spliced region. The first proposed approach is based on a multi-task fully convolutional network (MFCN), which is a special type of convolutional neural network (CNN). The MFCN is simultaneously trained on the surface label (which indicates whether each pixel in an image belongs to the spliced surface/region) and the edge label (which indicates whether each pixel belongs to the boundary of the spliced region). The MFCN-based approach is shown to outperform existing splicing localization techniques on several publicly available datasets. ❧ Our second contribution is based on a new dimensionality-reduction technique that we have developed. This technique, referred to as cPCA++ (where cPCA stands for contrastive Principal Component Analysis), utilizes the fact that the interesting features of a target dataset may be obscured by high variance components during traditional PCA. By analyzing what is referred to as a background dataset (i.e., one that exhibits the high variance principal components but not the interesting structures), our technique is capable of efficiently highlighting the structure that is unique to the target dataset. Similar to another recently proposed algorithm called contrastive PCA (cPCA), the proposed cPCA++ method identifies important dataset-specific patterns that are not detected by traditional PCA in a wide variety of settings. However, the proposed cPCA++ method is significantly more efficient than cPCA, because it does not require the parameter sweep in the latter approach. We applied the cPCA++ method to the problem of image splicing localization. In this application, we utilize authentic edges as the background dataset and the spliced edges as the target dataset. The proposed cPCA++ method is significantly more efficient than state-of-the-art CNN-based methods, as the former does not require iterative updates of filter weights via stochastic gradient descent and backpropagation. Furthermore, the cPCA++ method is shown to provide performance scores comparable to the MFCN-based approach.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Object localization with deep learning techniques
PDF
Deep learning techniques for supervised pedestrian detection and critically-supervised object detection
PDF
A data-driven approach to compressed video quality assessment using just noticeable difference
PDF
Object classification based on neural-network-inspired image transforms
PDF
Machine learning methods for 2D/3D shape retrieval and classification
PDF
Data-driven image analysis, modeling, synthesis and anomaly localization techniques
PDF
Machine learning techniques for perceptual quality enhancement and semantic image segmentation
PDF
Video object segmentation and tracking with deep learning techniques
PDF
Efficient graph learning: theory and performance evaluation
PDF
A deep learning approach to online single and multiple object tracking
PDF
Experimental analysis and feedforward design of neural networks
PDF
A learning‐based approach to image quality assessment
PDF
A green learning approach to deepfake detection and camouflage and splicing object localization
PDF
Machine learning techniques for outdoor and indoor layout estimation
PDF
Local-aware deep learning: methodology and applications
PDF
Advanced techniques for human action classification and text localization
PDF
Data-driven optimization for indoor localization
PDF
Experimental design and evaluation methodology for human-centric visual quality assessment
PDF
Labeling cost reduction techniques for deep learning: methodologies and applications
PDF
An efficient approach to clustering datasets with mixed type attributes in data mining
Asset Metadata
Creator
Salloum, Ronald
(author)
Core Title
A data-driven approach to image splicing localization
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
08/01/2019
Defense Date
04/10/2019
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
contrastive PCA,convolutional neural network,deep learning,image splicing,multimedia forensics,OAI-PMH Harvest,PCA
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Kuo, C.-C. (Jay) (
committee chair
), Nakano, Aiichiro (
committee member
), Sawchuk, Alexander (
committee member
)
Creator Email
ronaldsalloum@gmail.com,rsalloum@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-206404
Unique identifier
UC11663011
Identifier
etd-SalloumRon-7717.pdf (filename),usctheses-c89-206404 (legacy record id)
Legacy Identifier
etd-SalloumRon-7717.pdf
Dmrecord
206404
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Salloum, Ronald
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
contrastive PCA
convolutional neural network
deep learning
image splicing
multimedia forensics
PCA