Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
A green learning approach to deepfake detection and camouflage and splicing object localization
(USC Thesis Other)
A green learning approach to deepfake detection and camouflage and splicing object localization
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
A Green Learning Approach to Deepfake Detection and
Camou!age and Splicing Object Localization
by
Hong-Shuo Chen
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Ful"llment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(ELECTRICAL ENGINEERING)
December 2023
Copyright 2023 Hong-Shuo Chen
Acknowledgements
I would like to express my deepest appreciation to Professor C.-C. Jay Kuo, whose expertise, understanding,
and patience, added considerably to my graduate experience. I am extremely grateful for the guidance and
support he has provided me throughout my studies and the process of researching. His willingness to give
his time so generously has been very much appreciated. I am particularly thankful for the opportunity to
collaborate with him during my PhD study, which was instrumental in shaping my critical thinking and
research skills. Over the past "ve years, we have embarked on research in Green Learning, achieving a
multitude of astonishing and unforgettable accomplishments. For this and so much more, I am forever
thankful to Professor C.-C. Jay Kuo. It has been an honor and a privilege to be his student.
My sincere thanks also go to Professor Suya Yu, whose invaluable feedback during our weekly meetings
greatly contributed to the progression of my work. Additionally, I would like to express my appreciation
to Professor Shri Narayanan and Professor Aiichiro Nakano for their role as members of my PhD defense
committee. Their insightful advice has greatly enriched my research. Furthermore, I am grateful to Professor Justin Haldar for serving on my Qualifying Examination Committee; his perspectives have been
crucial to the completion of this thesis.
To all my colleagues at the MCL laboratory, your support and camaraderie have made my PhD journey a remarkable experience. A special thanks to Kaitai Zhang, my "rst mentor in the lab, with whom
I co-authored my inaugural publication—a milestone that set the tone for my research career. My gratitude extends to Yun-Cheng (Joe) Wang and Xiou Ge, for the invigorating research discussions and the
ii
shared personal milestones. I am also thankful to Professor Ronald Salloum, Yao Zhu, and Chee-An Yu
for their assistance in my image splicing work, and to Hamza Ghani and Mozhdeh Rouhsedaghat for their
collaboration on the DefakeHop project.
I also want to express my heartfelt appreciation to my partner, Jiaxuan Li, whose enduring support
was instrumental in the completion of my studies. Her willingness to listen to my ideas and her invaluable
input on the accessibility of my work have enriched my perspective, ensuring that my research can be
appreciated by both specialists and the wider audience. Jiaxuan’s companionship has made my life not
only more rewarding but also "lled with joy. Her presence has been a constant source of comfort and
motivation throughout this challenging journey.
Finally, my heartfelt thanks to my family for their unwavering support throughout my PhD studies.
The counsel of my parents has been a guiding light through the challenges of research and life. To my
brother, Hong-En Chen, my go-to person for software quandaries—thank you for your readiness to assist
and for the innovative solutions we’ve developed together. Our discussions on coding, our exchange of
the latest papers and technologies, and our mutual support in research have been pillars of my academic
pursuit.
This thesis is not just a re!ection of my e#orts but a testament to the collective support and encouragement of all those mentioned above, and many others who have been part of this journey. To each of
you, I am profoundly grateful.
This work was supported by the Army Research Laboratory (ARL) under agreement W911NF2020157.
Computation for the work was supported by the University of Southern California’s Center for High Performance Computing (hpc.usc.edu).
iii
Table of Contents
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Signi"cance of the Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Contributions of the Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 DefakeHop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.2 DefakeHop++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.3 Geo-DefakeHop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.4 GreenCOD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.5 GIFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Organization of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Chapter 2: Research Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1 DefakeHop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.1 Non-DL-based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.2 DL-based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 DefakeHop++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 GeoDefakeHop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.1 Fake images generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.2 Fake images detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.3 PixelHop and Saab transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.4 Di#erences between DefakeHop and Geo-DefakeHop . . . . . . . . . . . . . . . . 12
2.4 GreenCOD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4.1 Camou!aged Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4.2 Recent Advancements in COD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5 GIFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Chapter 3: DefakeHop: A Light-Weight High-Performance Deepfake Detector . . . . . . . . . . . . 16
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 DefakeHop Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.1 Face Image Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
iv
3.2.2 PixelHop++ Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.2.1 Channel-wise (c/w) Saab Transform . . . . . . . . . . . . . . . . . . . . . 19
3.2.3 Feature Distillation Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2.4 Ensemble Classi"cation Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3.1 Detection Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3.2 Model Size Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Chapter 4: DefakeHop++: An Enhanced Lightweight Deepfake Detector . . . . . . . . . . . . . . . 28
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2 Review of Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.3 DefakeHop++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.3.1 Pre-processing Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.3.2 Module 1: One-Stage PixelHop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3.3 Module 2: Spatial PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3.4 Module 3: Discriminant Feature Test . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3.5 Module 4: Classi"cation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.4.2 Detection Performance Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.4.3 Model Size of DefakeHop++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.5 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Chapter 5: Geo-DefakeHop: High-Performance Geographic Fake Image Detection . . . . . . . . . 47
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.2.1 Fake Images Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.2.2 Fake Images Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.2.3 PixelHop and Saab transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.2.4 Di#erences between DefakeHop and Geo-DefakeHop . . . . . . . . . . . . . . . . 51
5.3 Geo-DefakeHop Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.3.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.3.2 Joint Spatial/Spectral Feature Extraction via PixelHop . . . . . . . . . . . . . . . . 54
5.3.3 Channel-wise Classi"cation, Discriminant Channels Selection and Block-level
Decision Ensemble . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.3.4 Image-level Decision Ensemble . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.3.5 Visualization of Detection Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.4.2 Experiment Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.4.3 Detection Performance Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.4.4 Weak Supervision Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.4.5 Performance Benchmarking with Three GAN Models . . . . . . . . . . . . . . . . . 62
5.4.6 Model Size Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.5 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Chapter 6: GreenCOD: Green Camou!aged Object Detection . . . . . . . . . . . . . . . . . . . . . 75
v
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.2 GreenCOD Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.2.1 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.2.2 Concatenation and Resizing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.2.3 Multi-scale XGBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.2.4 Neighborhood Construction (NC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.3.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.3.3 Experiment results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.3.4 Visualization analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.4 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Chapter 7: GIFT: Green Image Forgery Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
7.2 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
7.2.1 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
7.2.2 Multi-level Surface XGBoosts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
7.2.3 Multi-level Edge XGBoosts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
7.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
7.4 Conclusion and Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Chapter 8: Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
8.1 Summary of the Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
8.2 Future Research Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
vi
List of Tables
3.1 The AUC value for each facial region and the "nal ensemble result. . . . . . . . . . . . . . 22
3.2 Comparison of the detection performance of benchmarking methods with the AUC value
at the frame level as the evaluation metric. The boldface and the underbar indicate the
best and the second-best results, respectively. The italics means it does not specify frame
or video level AUC. The AUC results of DefakeHop is reported in both frame-level and
video-level. The AUC results of benchmarking methods are taken from [108] and [69]. a
deep learning method, b non deep learning method. . . . . . . . . . . . . . . . . . . . . . . 26
3.3 Comparison of Deepfake algorithms and qualities. . . . . . . . . . . . . . . . . . . . . . . . 27
3.4 The number of parameters for various parts. . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.1 Comparison of detection performance of several methods on the "rst genreation datasets
with AUC as the performance metric. The AUC results of DefakeHop++ in both framelevel and video-level are given. The best and the second-best results are shown in boldface
and underbared, respectively. The AUC results of benchmarking methods are taken from
[69] and the number of parameters are from https://keras.io/api/applications. Also, we
use a to denote deep learning methods and b to denote non-deep-learning methods. . . . . 44
4.2 Comparison of detection performance of several Deepfake detectors on the second
genreation datasets under cross-domain training and with AUC as the performance
metric. The AUC results of DefakeHop anad DefakeHop++ in both frame-level and
video-level are given. The best and the second-best results are shown in boldface
and underbared, respectively. Furthermore, we include results of DefakeHop and
DefakeHop++ under the same-domain training in the last 4 rows. The AUC results of
benchmarking methods are taken from [69] and the number of parameters are from
https://keras.io/api/applications. Also, we use a to denote deep learning methods and b
to denote non-deep-learning methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.3 The number of parameters for various parts. . . . . . . . . . . . . . . . . . . . . . . . . . . 46
vii
5.1 Visualization of original real images (the "rst column), partial real/partial fake (PRPF)
images (the second column), the ground truth (the third column, where dark blue and
yellow denote real and fake regions, respectively) and heat maps (the four column,
where cold and warm colors indicate a higher probability of being real and fake in the
corresponding location, respectively.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.2 Visualization of absolute values of Saab "lter responses and the detection heat maps
for DC, AC1, AC11 and AC26 four channels, where DC and AC1 are low-frequency
channels, AC11 is a mid-frequency channel, and AC26 is a high-frequency channel. Cold
and warm colors in heat maps indicate a higher probability of being real and fake in the
corresponding location, respectively. The ground truth is that the whole image is a fake one. 66
5.3 Detection performance comparison with raw images from the UW dataset for three
benchmarking methods. The boldface and the underbar indicate the best and the
second-best results, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.4 Detection performance comparison for images resized from 256 ⇥ 256 to 128 ⇥ 128 and
64 ⇥ 64 The boldface and the underbar indicate the best and the second-best results,
respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.5 The statistics of three fake satellite image datasets, where C-GAN, S-GAN and L-GAN
denote CycleGAN, StyleGAN2 and Lightweight GAN, respectively. . . . . . . . . . . . . . 69
5.6 Comparison of FID scores of three fake satellite image datasets, where C-GAN, S-GAN
and L-GAN denote CycleGAN, StyleGAN2 and Lightweight GAN, respectively. Lower FID
scores indicate better generated images of higher "delity and variability. . . . . . . . . . . 69
5.7 Detection performance comparison for images corrupted by additive white Gaussian
noise with standard deviation = 0.02, 0.06, 0.1. The boldface and the underbar indicate
the best and the second-best results, respectively. . . . . . . . . . . . . . . . . . . . . . . . 70
5.8 Detection performance comparison for images coded by the JPEG compression standard
of three quality factors (QF), i.e., QF = 95, 85 and 75. The boldface and the underbar
indicate the best and the second-best results, respectively. . . . . . . . . . . . . . . . . . . 71
5.9 Comparison of F1-scores of four detection methods under the weak supervision data
setting, where X-Y-Z means that X% of training, Y% of validation and Z% of test data samples. 72
5.10 Comparion of F1-scores of four detection methods on fake images generated by CycleGAN,
StyleGAN2, and Lightweight GAN, where all datasets are split with 10% training, 10%
validation and 80% test data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.11 Model size computation of four Geo-DefakeHop designs for raw satellite input images. . . 73
5.12 Summary of model sizes of four Geo-DefakeHop designs with di#erent input images. . . . 74
viii
6.1 Comparison of performance metrics between proposed and benchmark methods on
the COD10K dataset. For computational e$ciency, only models with less than 50G
Multiply-Accumulate Operations (MACs) were considered. The top-performing method
for each metric on each dataset is highlighted in bold, while the second-best method is
underscored. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.2 Comparison of performance metrics between proposed and benchmark methods on the
COD10K dataset. Only models with more than 50G Multiply-Accumulate Operations
(MACs) were considered. The top-performing method for each metric on each dataset is
highlighted in bold, while the second-best method is underscored. . . . . . . . . . . . . . . 81
6.3 Comparison of performance metrics between proposed and benchmark methods on the
NC4K dataset. For computational e$ciency, only models with less than 50G MultiplyAccumulate Operations (MACs) were considered. The top-performing method for each
metric on each dataset is highlighted in bold, while the second-best method is underscored. 82
6.4 Comparison of performance metrics between proposed and benchmark methods on the
COD10K dataset. Only models with more than 50G Multiply-Accumulate Operations
(MACs) were considered. The top-performing method for each metric on each dataset is
highlighted in bold, while the second-best method is underscored. . . . . . . . . . . . . . . 83
7.1 Comparison of F1 Scores between proposed and benchmark methods Across Various
Datasets. Models were trained on the CASIA v2.0 dataset and tested on other datasets.
The top-performing method for each dataset is highlighted in bold and the method second
best method is highlighted in underline. For algorithms not evaluated on speci"c datasets,
entries are indicated as “NA”. We also provide a comparison of the number of parameters
and FLOPs/pixel for both the DL methods and our proposed method. . . . . . . . . . . . . 93
7.2 The number of parameters and FLOPs for various parts of GIFT and E-GIFT . . . . . . . . 98
7.3 F1 and AUC scores at from di#erent level Edge-enhanced XGboost on the CASIAv1.0 dataset 99
7.4 Comparison of AUC Scores between proposed and benchmark methods on the CASIA
v1.0 dataset. Models were trained on the CASIA v2.0 dataset. The top-performing method
is highlighted in bold. We also provide a comparison of the number of parameters and
GFLOPs for both the DL methods and our proposed method. . . . . . . . . . . . . . . . . . 99
ix
List of Figures
1.1 Visualization of real and fake faces extracted from the third generation DFDC dataset [27].
The left and right four columns depict real and fake faces, respectively. Eight Deepfake
techniques are used to generate fake videos. Furthermore, 19 perturbations are added to
real and fake videos. Exemplary perturbations include compression, additive noise, blur,
change of brightness, contrast and resolution, and overlay with !ower and dog patterns,
random faces and images. A good Deepfake detector should be able to distinguish real
and fake videos with or without perturbation. . . . . . . . . . . . . . . . . . . . . . . . . . 3
3.1 Face image preprocessing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 An overview of the DefakeHop method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3 Illustration of the c/w Saab transform. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.4 The ROC curve of DefakeHop for di#erent datasets. . . . . . . . . . . . . . . . . . . . . . . 23
3.5 The plot of AUC values as a function of the training video number. . . . . . . . . . . . . . 25
4.1 An overview of DefakeHop++. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2 Illustration of the prepossessing step. It "rst extracts 68 landmarks (in blue) from the face.
Then, it crops out blocks of size 31 ⇥ 31 from three regions (in yellow) and blocks of size
13 ⇥ 13 from eight landmarks (in red). The step can extract consistent blocks despite
di#erent head poses and perturbations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.3 Illustration of the feature selection idea in the discriminant feature test (DFT). For the
feature dimension associated with the left sub"gure, samples of class 0 and class 1 can
be easily separated by the blue partition line. Its cross entropy is lower. For the feature
dimension associated with the right sub"gure, samples of class 0 and class 1 overlap
with each other signi"cantly. It is more di$cult to separate them and its cross entropy is
higher. Thus, the feature dimension in the left sub"gure is preferred. . . . . . . . . . . . . 35
4.4 Illustration of the feature selection process for a single landmark, where the y-axis is the
cross entropy of each dimension and the x-axis is the channel index. The left and right
sub"gures show unsorted and sorted feature dimensions. . . . . . . . . . . . . . . . . . . . 35
x
4.5 Analysis of landmark discriminability, where the x-axis is the landmark index anad the
y-axis is the AUC score. Landmarks in the two eye regions are most discriminant in both
training and test datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.6 Detection performance comparison of DefakeHop++, MobileNet v3 with pre-training
by ImageNet and MobileNet v3 without pre-training as a function of training data
percentages of the DFDC dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.7 Training time comparison of DefakeHop++, MobileNet v3 with pre-training by ImageNet
and MobileNet v3 without pre-training as a function of training data percentages of the
DFDC dataset, where the training time is in the unit of seconds. The training time does
not include that used in the pre-processing step. . . . . . . . . . . . . . . . . . . . . . . . . 42
5.1 An overview of the Geo-DefakeHop method, where the input is an image tile and the
output is a binary decision on whether the input is an authentic or a fake one. First, each
input title is partitioned into non-overlapping blocks of dimension 16 ⇥ 16 ⇥ 3. Second,
each block goes through one PixelHop or multiple PixelHops, each of which yields 3D
tensor responses of dimension H ⇥ W ⇥ C. Third, for each PixelHop, an XGBoost
classi"er is applied to spatial samples of each channel to generate channel-wise (c/w) soft
decision scores and a set of discriminant channels are selected accordingly. Last, all block
decision scores are ensembled to generate the "nal decision of the image tile. . . . . . . . . 50
5.2 The channel-wise performance of four settings: a) without perturbation, b) resizing, c)
adding Gaussian noise, and d) JPEG compression. The channel 0 is DC (Direct Current)
and from the "rst channel to the 26th channel are corresponding to AC1 to AC26
(Alternating Current). The blue line is the energy percentage of each channel and the
red, magenta and green lines are the F1-score of the training, validation and testing
dataset. We observe that high-frequency channels without perturbation in 5.2a has a
higher performance. After applying resizing, adding Gaussian noise and compression, the
performance of high-frequency channels degrades as shown in 5.2b, 5.2c, 5.2d. The test
score and validation score are closely related, indicating that the validation score can be
used to select the discriminant channels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.1 An overview of the GreenCOD method, where the input is an image of dimension
672 ⇥ 672 ⇥ 3 and the output is a probabilty mask of dimension 168 ⇥ 168 ⇥ 1....... 78
6.2 Illustration of mask predictions using the proposed GreenCOD. Images are taken from
COD10K test dataset. From left to right: (a) tampered images, (b) ground-truth masks, (c)
prediction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
7.1 System diagram of the Green Image Forgery Technique (GIFT). "E-XGBoost" represents
the edge-detection XGBoost, while "S-XGBoost" stands for the surface-detection XGBoost.
The architecture e$ciently integrates multi-level feature extraction with specialized
XGBoosts to discern both surface and edge forgeries. . . . . . . . . . . . . . . . . . . . . . 89
xi
7.2 Illustration of mask predictions using the proposed GIFT and edge-enhanced GIFT.
Images are taken from CASIA v1.0 dataset. From left to right: (a) tampered images,
(b) ground-truth masks, (c) GIFT mask predictions, and (d) edge-enhanced GIFT mask
predictions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
xii
Abstract
In the current technological era, the advancement of AI models has not only driven innovation but also
heightened concerns over environmental sustainability due to increased energy and water usage. For
context, the water consumption equivalent to a 500ml bottle is tied to 10 to 50 responses from a model
like GPT-3, and projections suggest that by 2027, AI could be using an estimated 85 to 134 TWh per year,
potentially surpassing the water withdrawal of half of the United Kingdom. In light of these challenges,
there is an urgent call for AI solutions that are environmentally friendly, characterized by lower energy
consumption through fewer !oating-point operations (FLOPs), more compact designs, and the ability to
run independently on mobile devices without depending on server-based infrastructures.
Addressing this need, our research presents "DefakeHop," an innovative Deepfake detection method
that leverages the Successive Subspace Learning (SSL) principle to extract features from di#erent facial regions through a process known as channel-wise Saab transform. The method’s feature distillation module
then re"nes these features, resulting in a highly e$cient yet small model with only 42,845 parameters that
demonstrates state-of-the-art performance on multiple Deepfake datasets.
Building on the foundation of DefakeHop, we introduce an enhanced version, "DefakeHop++," which
extends its capabilities by analyzing additional facial landmarks and employing a supervised method, the
Discriminant Feature Test (DFT), for feature selection. Despite being signi"cantly smaller than lightweight
CNN models like MobileNet v3, with just 238,000 parameters, DefakeHop++ outperforms its counterparts
in detecting Deepfake images, especially in a weakly-supervised setting.
xiii
Moreover, our e#orts extend beyond the realm of video to still images, with "Geo-DefakeHop" targeting the detection of Deepfake satellite imagery. This model uses Parallel Subspace Learning (PSL) to
di#erentiate real from manipulated images through response variances across multiple "lter banks. With
a model size as compact as 800 to 62,000 parameters, Geo-DefakeHop still achieves impressive results, surpassing 95% F1-scores even when images have been altered through common manipulations like resizing
and compression.
In addition to these detection models, we propose "GreenCOD," a novel approach for Camou!aged
Object Detection that combines the e$ciency of Extreme Gradient Boosting (XGBoost) with deep feature
extraction from DNNs. Unlike typical DNN approaches, GreenCOD eschews complex architectures and
the backpropagation training method, yet still delivers enhanced performance with a leaner computational
footprint.
Finally, we showcase the "Green Image Forgery Technique" (GIFT), an advanced gradient-boosting
technique speci"cally designed for identifying multiple image forgeries. GIFT utilizes the pyramid structure of E$cient-Net and engages XGBoost at each layer for precise forgery detection. This method bene"ts
from direct supervision using the ground truth forgery mask and edge supervision to re"ne the accuracy
of forgery boundary identi"cation. Despite its conservative computational requirements, GIFT surpasses
many state-of-the-art methods, as con"rmed by extensive testing across diverse datasets.
Through these developments, we aim to set a new standard for eco-friendly AI solutions that not only
address critical digital security challenges but also demonstrate a commitment to environmental responsibility.
xiv
Chapter 1
Introduction
1.1 Signi!cance of the Research
In the current technological landscape, AI models have substantially increased in size, leading to heightened energy consumption. This signi"cant energy use is concerning for our environment. For instance,
10 to 50 responses from GPT-3 is equivalent to the water consumption of "lling a 500ml bottle. Projections indicate that by 2027, AI servers might consume between 85 to 134 terawatt hours (TWh) annually.
Furthermore, AI’s water consumption could escalate to 4.2 to 6.6 billion cubic meters by 2027, surpassing
the total annual water withdrawal of half of the UK.
Given these alarming statistics, there’s an urgent need to develop sustainable and e$cient AI solutions
that are environmentally friendly. This involves reducing FLOPs (Floating Point Operations) to minimize
energy consumption, designing models with smaller sizes that require less memory, and creating models
capable of operating independently on mobile devices, thereby removing the need for server connectivity.
The signi"cance of the research introducing GreenCOD and GIFT cannot be overstated as it addresses a
critical bottleneck in the realm of Camou!aged Object Detection and Image Forgery Detection: the tradeo# between computational e$ciency and performance. By ingeniously combining XGBoost with deep
feature extraction from DNNs, GreenCOD paves the way for robust detection mechanisms that require
signi"cantly fewer parameters and FLOPs, making it a game-changer for applications where computational
1
resources are limited. Moreover, the elimination of backpropagation in training further underscores the
e$ciency and innovation of the approach. GIFT, on the other hand, emerges as a groundbreaking gradientboosting technique for image forgery detection, leveraging the architecture of E$cientNet. It introduces a
novel strategy that integrates edge supervision with direct ground truth guidance, enhancing the precision
of forgery boundaries. The remarkable performance of GIFT, validated through rigorous experimentation,
demonstrates its potential to set a new standard in the detection of multiple image forgeries, rea$rming
the transformative impact of this research on the future of digital image analysis and security.
Arti"cial intelligence (AI) and deep learning (DL) techniques have made signi"cant advances in recent years by leveraging more powerful computing resources and larger collected and labeled datasets in
computer vision "elds. Despite countless advantages brought by AI, misinformation over the Internet,
ranging from fake news [91] to fake images and videos [27, 51, 69, 93, 138, 151], poses a serious threat to
our society. It is common to see Deepfake videos appearing in social media platforms nowadays due to
the popularity of Deepfake techniques. Several mobile apps can help people create fake contents without
any special editing skill. Generally speaking, deepfake programs can change the identity of one person
in real videos to another realistically and easily. The number of Deepfake videos surges rapidly in recent
years. There were 7,964 Deepfake video clips online at the beginning of 2019. The number almost doubled
to 14,678 in nine months and continued to increase exponentially [24]. As the number of Deepfake video
contents grows rapidly, automatic Deepfake detection has received a lot of attention in the community of
digital forensics.
Besides the large quantity, another major threat is that fake video quality has improved a lot over a short
period of time. With the fast growing Generative Adversarial Network (GAN) technology, image forgery
techniques keep evolving in recent years. They are e#ective in reducing manipulation traces detectable by
human eyes. It becomes very challenging to distinguish Deepfake images from real ones with human eyes
against new generations of Deepfake technologies as shown in Figure 1.1. Furthermore, adding di#erent
2
Figure 1.1: Visualization of real and fake faces extracted from the third generation DFDC dataset [27].
The left and right four columns depict real and fake faces, respectively. Eight Deepfake techniques are
used to generate fake videos. Furthermore, 19 perturbations are added to real and fake videos. Exemplary
perturbations include compression, additive noise, blur, change of brightness, contrast and resolution, and
overlay with !ower and dog patterns, random faces and images. A good Deepfake detector should be able
to distinguish real and fake videos with or without perturbation.
kinds of perturbation (e.g., blur, noise and compression) can hurt the detection performance of Deepfake
detectors since manipulation traces are mixed with perturbations. A robust Deepfake detector should be
able to tell the di#erences between real images and fake images generated by the GAN techniques although
both of them experience such modi"cations.
Deepfake videos can be potentially harmful to society, from non-consensual explicit content creation
to forged media by foreign adversaries used in disinformation campaigns. Fake videos may result in serious
damages to our society since people can be fooled by them delivered by the Internet and some misinformation may make the public panic and anxious. Not only the face, the same techniques are used to generate
realistically looking satellite images. Satellite images are utilized in various applications such as weather
prediction [134], agriculture crops prediction [64], !ood and "re control [66]. If one cannot determine
whether a satellite image is real or fake, it would be risky to use it for decision making. Fake satellite
images may have impacts on national security. For example, adversaries can create fake satellite images
3
to hide military infrastructure and/or create fake ones to deceive others. Though government analysts
could verify the authenticity of geospatial imagery leveraging other satellites or data sources, this would
be prohibitively time intensive. It would be extremely di$cult for the public to verify the authenticity of
satellite images.
As a result, an automatic and e#ective Deepfake detection mechanism is in urgent need. It is also
desired to have a software solution that runs easily on mobile devices to provide automatic warning messages to people when fake videos are played. To address this emerging threat, it is essential to develop
lightweight Deepfake detectors that can be deployed in mobile phones which is is the objective of our
current research.
1.2 Contributions of the Research
1.2.1 DefakeHop
We proposed a light-weight high-performance method for deepfake detection in this work. Main contributions in this work are summarized below.
1. The proposed method is mathematically transparent since its feature extraction and classi"cation
modules are both explainable. The model size of proposed method is signi"cantly smaller, thus
making it an attractive choice for mobile/edge computing. And it is easy to train since no backpropagation is required for the end-to-end system optimization.
2. Extensive experiments are conducted to demonstrate the e#ectiveness of the proposed DefakeHop
method. With a small model size of 42,845 parameters, DefakeHop achieves state-of-the-art performance on several "rst and second generation datasets.
4
1.2.2 DefakeHop++
On the basis of DefakeHop, an enhanced lightweight Deepfake detector called DefakeHop++ is proposed
in this work. Main contributions in this work are summarized below.
1. We exploit the label information in the feature extraction part to select the features from di#erent
frequency bands, which contains discriminant information of the Deepfake faces by using Discriminant Feature Test (DFT).
2. We include the 8 landmarks on the face to let the model learn the detail information inside the face
and concatenate the spatial and spectral features from di#erent regions and di#erent landmarks
together to train a classi"er.
3. We conduct extensive experiments to demonstrate that the performance of our model outperform
the MobileNet v3 in a weakly-supervised setting. .
1.2.3 Geo-DefakeHop
In this work, we propose a high-performance geographic fake image detector. Main contributions in this
work are summarized below.
1. We propose a fake satellite image detection method, called Geo-DefakeHop, which exploits the PSL
methodology to extract discriminant features with the implementation of multiple "lter banks.
2. We conduct extensive experiments to demonstrate the high performance of Geo-DefakeHop and its
robustness against various image manipulations.
3. We use the heat map to visualize prediction results and spatial responses of di#erent frequency
channels for Geo-DefakeHop’s interpretability.
5
1.2.4 GreenCOD
We proposed a light-weight high-performance method for Camou!aged Object Detectio in this work. Main
contributions in this work are summarized below.
• We present GreenCOD, an innovative approach to Camou!aged Object Detection (COD) that utilizes
extreme gradient boosting (XGBoost) combined with deep neural network features.
• Our model is designed for e$ciency and interpretability, reducing the computational resources typically required by conventional deep learning methods in COD.
• By forgoing backpropagation, GreenCOD proposes a new paradigm in neural architectures that is
environmentally friendly and suitable for real-world applications.
• We advance the COD "eld towards achieving models that retain detection e#ectiveness while being
more resource-e$cient and sustainable.
1.2.5 GIFT
We proposed a light-weight high-performance method for image forgery detection in this work. Main
contributions in this work are summarized below.
1. We introduce a novel approach by combining a pre-trained E$cient-Net with the XGBoost machine learning classi"er to perform forgery detection. Our method eliminates the need for backpropagation and end-to-end training.
2. We utilize both surface supervision and edge supervision. This dual approach aids in achieving more
accurate predictions at forgery boundaries.
6
3. Our method outperforms state-of-the-art techniques while requiring signi"cantly fewer !oating
point operations (FLOPs) and trainable parameters, proving to be superior even to transformer-based
forgery detectors.
1.3 Organization of the Dissertation
The rest of the dissertation is organized as follows. In Chapter 2, we review the research background. In
Chapter 3, we propose a light-weight high-performance deepfake detector, called DefakeHop, which extracts features automatically using the successive subspace learning (SSL) principle from various parts of
face images. In Chapter 4, we improve our previous work DefakeHop and propose an enhanced lightweight
Deepfake detector, called DefakeHop++, which could archive high performacne in the third generataion
dataset under severe peraturbations. In Chapter 5, we propose a high-performance geographic fake image
detection method called Geo-DefakeHop, which could detect the fake satellite images in an explainable
way. In Chapter 6, we proposed a light-weight high-performance method for Camou!aged Object Detection in this work. In Chapter 7, we proposed a light-weight high-performance method for image forgery
detection in this work. Finally, concluding remarks and future research directions are given in Chapter 8.
7
Chapter 2
Research Background
2.1 DefakeHop
2.1.1 Non-DL-based Methods
Yang et al. [117] exploited discrepancy between head poses and facial landmarks for Deepfake detection.
They "rst extracted features such as rotational and translational di#erences between real and fake videos
and then applied the SVM classi"er. Agarwal et al. [2] focused on detecting Deepfake video of high-pro"le
politicians and leveraged speci"c facial patterns of individuals when they talk. They used the one-class
SVM, which is only trained on real videos of high pro"le individuals. Matern et al. [80] used landmarks to
"nd visual artifacts in fake videos, e.g., missing re!ections in eyes, teeth replaced by a single white blob,
etc. They adopted the logistic regression and the multi-layer perceptron (MLP) classi"ers.
2.1.2 DL-based Methods
CNN Solutions. Li et al. [67] used CNNs to detect the warping artifact that occurs when a source face is
warped into the target one. Several well known CNN architectures such as VGG16, ResNet50, ResNet101,
and ResNet152 were tried. VGG16 was trained from scratch while ResNet models were pretrained on
the ImageNet dataset and "ne-tuned by image frames from Deepfake videos. Afchar et al. [1] proposed
8
a mesoscopic approach to Deepfake detection and designed a small network that contains only 27,977
trainable parameters. Tolosana et al. [107] examined di#erent facial regions and landmarks such as eyes,
nose, mouth, the whole face, and the face without eyes, nose, and mouth. They applied the Xception
network to each region and classify whether it is fake or not.
Integrated CNN/RNN Solutions. The integrated CNN/RNN solutions exploit both spatial and temporal features. Sabir et al. [98] applied the DenseNet and the bidirectional RNN to aligned faces. Güera and
Delp [37] "rst extracted features from a fully-connected-layer removed InceptionV3, which was trained
on the ImageNet. Then, they fed these features to an LSTM (Long short-term memory) for sequence processing. Afterward, they mapped the output of the LSTM, called the sequence descriptor, to a shallow
detection network to yield the probability of being a fake one.
2.2 DefakeHop++
Deepfake Detection. Most state-of-the-art Deepfake detection methods use DNNs to extract features
from faces. Their models are trained with heavy augmentation (e.g., deleting part of faces) to increase
the performance. Despite the high performance of these models, their model sizes are usually very large.
They have to be pre-trained by other datasets in order to converge. Several examples are given below.
The model of the winning team of the DFDC Kaggle challenges [101] has 432M parameters. Heo et al.
[43] improved this model by concatenating it with a Vision Transformer(VIT), which has 86M parameters.
Zhao et al. [133] proposed a model that exploits multiple spatial attention heads to learn various local
parts of a face and the textural enhancement block to learn subtle facial artifacts. To reduce the model size
and improve the e$ciency, Sun et al. [104] proposed a robust method based on the change of landmark
positions in a video. These landmarks are calibrated by the neighbor frames. Afterwards, a two-stream
RNN is trained to learn the temporal information of the landmark position. Since they did not consider the
image information and only consider the position information, the model size is 0.18M which is relatively
9
small. Tran et al. [109] applied MobileNet to di#erent facial regions and InceptionV3 to the entire faces.
Although this model is smaller, it still demands 26M parameters.
DefakeHop. To address the challenge of the need of huge model sizes and training data, a new machine learning paradigm called green learning has been developed in the last six years [17, 19, 59, 61]. Its
main goal is to reduce the model size and training time while keep high performance. Green learning has
been applied to di#erent applications, e.g., [52, 74, 83, 95, 97, 125, 127, 128, 129]. Based on green learning, DefakeHop was developed in [14] for the Deepfake detection task. It "rst extracts a large number of
features from three facial regions using PixelHop++ [17] and then re"nes them using feature distillation
modules. Finally, it feeds distilled features to the XGBboost classi"er [16]. The DefakeHop model has
only 42.8K parameters, yet it outperforms many DNN solutions in detection accuracy against the "rst and
second generataion Deepfake datasets. Recently, DefakeHop has been used to detect fake satellite images
in [15].
2.3 GeoDefakeHop
2.3.1 Fake images generation
GANs provide powerful machine learning models for image-to-image translation. It consists of two neural
networks in the training process: a generator and a discriminator. The generator attempts to generate fake
images to fool the discriminator while the discriminator tries to distinguish generated fake images from
real ones. They are jointly trained via end-to-end optimization with an adversarial loss. In the inference
stage, only the generator is needed. Many GANs have been proposed. One example is Cycle-consistent
GAN (CycleGAN) [145]. It has been applied to fake satellite image generation [132]. In this work, we
use StyleGAN2 [55] which is the most popular GAN on the internet and Lightweight GAN [71] which is
10
the newest GAN that could be trained e$ciently in one day to generate more fake satellite images (see
Sec.6.3.1).
2.3.2 Fake images detection
Most fake image detection methods adopt convolution neural networks (CNNs). For example, [113] used
the real and fake images generated by ProGAN [53] as the input of ResNet-50 pretrained by the ImageNet.
[131] generated fake images with their designed GAN, called AutoGAN, and claimed that CNN trained by
their simulated images could learn artifacts of fake images. [84] borrowed the idea from image steganalysis and used the co-occurrence matrix as input to a customized CNN so that it can learn the di#erences
between real and fake images. By following this idea, [6] added the cross-band co-occurrence matrix to the
input so as to increase the stability of the model. [36] utilized the EM algorithm and the KNN classi"er to
learn the convolution traces of artifacts generated by GANs. Little research has been done to date on fake
satellite images detection due to the lack of available datasets. [132] proposed the "rst fake satellite image
dataset with simulated satellite images from three cities (i.e., Tacoma, Seattle and Beijing). Furthermore, it
used 26 hand-crafted features to train an SVM classi"er for fake satellite image detection. The features can
be categorized into three types which are spatial, histogram and frequency. Features of di#erent classes
are concatenated for performance evaluation. We will benchmark our proposed Geo-DefakeHop method
with the method in [132] and CNN models with images and spectrum as the input.
2.3.3 PixelHop and Saab transform
The PixelHop concept was introduced by Chen et al. in [18]. Each PixelHop has local patches of the same
size as its input. Suppose that local patches are of dimension L = s1 ⇥ s2 ⇥ c, where s1 ⇥ s2 is the
spatial dimension and c is the spectral dimension. A PixelHop de"nes a mapping from pixel values in a
patch to a set of spectral coe$cients, which is called the Saab transform [61]. The Saab transform is a
11
variant of the principal component analysis (PCA). For standard PCA, we subtract the ensemble mean and
then conduct eigen-analysis on the covariance matrix of input vectors. The ensemble mean is di$cult
to estimate if the sample size is small. The Saab transform decomposes the n-dimensional signal space
into a one-dimensional DC (direct current) subspace and an (n 1)-dimensional AC (alternating current)
subspace. Signals in the AC subspace have an ensemble mean close to zero. Then, we can apply PCA to the
AC signal and decompose it into (n1) channels. Saab coe$cients are unsupervised data-driven features
since Saab "lters are derived from the local correlation structure of pixels.
2.3.4 Di"erences between DefakeHop and Geo-DefakeHop
The Saab transform can be implemented conveniently with "lter banks. It has been successfully applied
to many application domains. Examples include [14, 74, 125, 129]. Among them, DefakeHop [14] is closest
to this work. There are substantial di#erences between DefakeHop and Geo-DefakeHop. DefakeHop was
initially proposed to detect deepfake face videos. DefakeHop extracted features from human eyes, nose and
mouth regions. It focused on low-frequency channels and discarded high-frequency channels. In contrast,
we show that high-frequency channels are more discriminant than low-frequency channels for fake image
detection. Furthermore, DefakeHop was designed using successive subspace learning (SSL) while GeoDefakeHop is developed with parallel subspace learning (PSL). SSL and PSL are quite di#erent. We tailor
DefakeHop to the context of satellite images and show that Geo-DefakeHop outperforms DefakeHop by a
signi"cant margin due to a better design.
2.4 GreenCOD
2.4.1 Camou#aged Object Detection
The realm of Camou!aged Object Detection (COD) uniquely stands out in the domain of computer vision.
It aims at identifying objects that often seamlessly blend into their surroundings due to various factors
12
such as their minuscule size, obfuscation, inherent concealment, or the act of self-disguise. The intricacies
of COD, especially given the striking resemblance camou!aged objects tend to have with their immediate
environments, make it a "eld of great complexity.
2.4.2 Recent Advancements in COD
In recent years, various strategies have emerged to tackle the challenge of camou!aged object detection.
[31] laid the groundwork by introducing a foundational framework dedicated to identifying camou!aged
objects within images. Following this initiative, [112] unveiled the D2C-Net, employing a dual-branch,
dual-guidance, and cross-re"ne network to enhance detection performance. Similarly, [102] proposed a
context-aware cross-level fusion network to leverage contextual information across di#erent levels for improved detection. The spotlight was turned to texture awareness by [144], who introduced a texture-aware
interactive guidance network for inferring camou!aged objects. Exploring the notion of uncertainty, [63]
presented an uncertainty-aware method for joint detection of salient and camou!aged objects. A distinct
approach was taken by [75], aiming to localize, segment, and rank camou!aged objects simultaneously
through a method that performs these tasks concurrently.
The methodology of mutual graph learning was the cornerstone of the detection enhancement method
proposed by [123]. Utilizing distraction mining, [82] achieved better segmentation of camou!aged objects.
Transformer reasoning guided by uncertainty was the focus of the method proposed by [116], targeting
enhanced detection. A boundary-aware segmentation network suitable for mobile and web applications
was introduced by [90]. The exploration of neighbor connection and hierarchical information transfer
for improved detection was discussed by [124]. Also, [12] sought to improve camou!aged object detection
through context-aware cross-level fusion. In a novel architectural approach, [150] introduced the CubeNet
with X-shape connections for enhanced detection performance. Edge-based reversible re-calibration for
faster detection was the focus of the method proposed by [49]. Lastly, the TPRNet proposed by [130]
13
utilized a transformer-induced progressive re"nement network to address the challenge of detecting camou!aged objects.
2.5 GIFT
Image Forgery: Common image forgeries include splicing, copy-move and removal. Image splicing involves the idea of cutting a part of one image and pasting it into another, creating a composite that may
show events or scenes that never occurred. Similarly, in copy-move forgery, a part of the image is copied
and pasted within the same image. Removal forgery, represents the form that an object or region is removed from source image and "ll with similar textures in the neighborhood. In this paper, we mainly
focus on the localization of splicing and copy-move forgeries.
Image Forgery Detection: Image Forgery Detection refers to the idea of detecting pixels that are manually manipulated within images, usually with the intent to mislead or deceive. Traditionally, many studies
use hand-crafted feature to perform forgery detection, such as local noise analysis [22, 76, 87]; color "lter
array [32]; illumination variance analysis[23]; and double JPEG compression [5, 7]. These feature usually
come from the inconsistency between tampered region and authentic region caused by manipulations.
For example, local noise analysis [22, 76, 87] make use of the fact that noise variance from di#erent
image source are di#erent. Authentic image regions that contain intrinsic noise have similar variances,
while tampered region from other image preserve di#erent noise variances. Some works also use the
artifacts in color "lter array to perform forgery localization [32]. In spliced images, spliced region may
have di#erent CFA patterns or demosaicing artifacts. Also, when an image is spliced, there’s often a need
to resize or rotate the inserted part to "t naturally. This resampling can disrupt the regular CFA pattern.
Some work explore the artifact from double JPEG compression [5, 7]). That is, when an image is spliced
from a JPEG-compressed source, and saved again as a JPEG, the spliced region has e#ectively undergone
14
compression twice. This means its 8x8 DCT blocks have been quantized twice, leading to speci"c and
detectable inconsistencies.
Di#erent from traditional methods, Deep Neural Networks (DNN) learns deep feature and use end-toend training to perform forgery detection. Salloum et al. [100] proposed a method to detect splicing based
on multi-task fully convolution network trained on labeled data. [114] proposed a uni"ed framework for
multiple forgeries or combinations. [45] used Spatial Pyramid Attention Network to perform the same
task.
After the emerging of transformers in computer vision "eld, researchers start to bring transformers
into image forensics area. [39] proposes a new framework that is based on Transformer, introducing
self-attention and dense attention into image splicing localization. [111] added additional high frequency
feature to [39]. [73] further extends [39] into all low-level structure segmentation tasks, including Camou!age Detection, Shadow Detection, Forgery Detection and Defocus Blur Detection.
Green Learning: Proposed by Kuo et al. [60], Green Learning aims to solve computer vision or media
forensics tasks without using back-propagation. While Deep Neural Networks provide extraordinary performance in many tasks, they usually su#er from the high computational resources, poor explainability
and cumbersome end-to-end training. Contrary to DNN, Green Learning provides a modular solution
to computer vision and forensics tasks such that it does not need back-propagation. It has been proved
successful in many applications such as image classi"cation [35, 118, 119], image generation [62, 146],
unsupervised object tracking [139, 140, 141, 142], and fake image detection[13, 14, 15, 147, 148, 149]. In
this work, we propose an e$cient and green image forgery detection method. It’s based on extreme gradient boosting combined with pre-trained E$cient-Net. It does not need back-propagation or end-to-end
training. Our method can achieve comparable performance with state-of-the-art methods, with only half
of the parameters and 10% of the FLOPs.
15
Chapter 3
DefakeHop: A Light-Weight High-Performance Deepfake Detector
3.1 Introduction
Most state-of-the-art Deepfake detection methods are based upon deep learning (DL) technique and can
be mainly categorized into two types: methods based on convolutional neural networks (CNNs) [1, 85,
86, 92, 107, 136] and methods that integrate CNNs and recurrent neural networks (RNNs) [37, 98]. While
the former focuses solely on images, the latter takes both spatial and temporal features into account. DLbased solutions have several shortcomings. First, their size is typically large containing millions of model
parameters. Second, training them is computationally expensive. For example, the winning team of the
DFDC contest [101] used seven pre-trained E$cientNets that contain 432 million parameters. There are
also non-DL-based Deepfake detection methods [2, 80, 117], where handcrafted features are extracted and
fed into classi"ers. The performance of non-DL-based methods is usually inferior to that of DL-based ones.
A new non-DL-based solution to Deepfake detection, called DefakeHop, is proposed in this work.
DefakeHop consists of three main modules: 1) PixelHop++, 2) feature distillation and 3) ensemble classi"-
cation. To derive the rich feature representation of faces, DefakeHop extracts features using PixelHop++
units [19] from various parts of face images. The theory of PixelHop++ have been developed by Kuo et al.
using SSL [19, 59, 61]. PixelHop++ has been recently used for feature learning from low-resolution face
images [94, 96] but, to the best of our knowledge, this is the "rst time that it is used for feature learning
16
Sampling Frames Landmarks Extraction Face Alignment Regions Extraction
Figure 3.1: Face image preprocessing.
from patches extracted from high-resolution color face images. Since features extracted by PixelHop++
are still not concise enough for classi"cation, we also propose an e#ective feature distillation module to
further reduce the feature dimension and derive a more concise description of the face. Our feature distillation module uses spatial dimension reduction to remove spatial correlation in a face and a soft classi"er
to include semantic meaning for each channel. Using this module the feature dimension is signi"cantly
reduced and only the most important information is kept. Finally, with the ensemble of di#erent regions
and frames, DefakeHop achieves state-of-the-art results on various benchmarks.
3.2 DefakeHop Method
Besides the face image preprocessing step (see Fig. 4.2), the proposed DefakeHop method consists of three
main modules: 1) PixelHop++ 2) feature distillation and 3) ensemble classi"cation. The block diagram of
the DefakeHop method is shown in Fig. 3.2. Each of the components is elaborated below.
3.2.1 Face Image Preprocessing
Face image preprocessing is the initial step in the DefakeHop system. As shown in Fig. 4.2, we crop out
face images from video frames and then align and normalize the cropped-out faces to ensure proper and
consistent inputs are fed to the following modules in the pipeline. Preprocessing allows DefakeHop to
17
PixelHop++ Unit
(3 x 3)
30 x 30 x K1
Max-pooling
15 x 15 x K12
PixelHop++ Unit
(3 x 3)
13 x 13 x K2
Max-pooling
7 x 7 x K22
PixelHop++ Unit
(3 x 3)
5 x 5 x K3
Max-pooling
32 x 32 x K0
PCA 15 x 15 x K11
PCA 7 x 7 x K21
PCA 3 x 3 x K31
Hop-1
N1 x K11
N2 x K21
N3 x K31
Hop-2
Hop-3
Spatial Dimension
Reduction
Channel-Wise
Soft Classifiers
K Classifier Classifier … Classifier 11
N1 x1 N1 x1 N1 x1
K Classifier Classifier … Classifier 21
N2 x1 N2 x1 N2 x1
K Classifier Classifier … Classifier 31
N3 x1 N3 x1 N3 x1
K11
Decision
K21
K31
Multi Frames
Ensemble
Real / Fake
Final
Classifier
Soft labels
Soft labels
Soft labels
Concatenation
Ensemble
Multi Regions
Ensemble
Region #2
Region #N
Frame #2
Frame #N …
…
…
…
Feature Distillation
Figure 3.2: An overview of the DefakeHop method.
handle di#erent resolutions, frame rates, and postures. Some details are given below. First, image frames
are sampled from videos. Then, 68 facial landmarks are extracted from each frame by using an open-source
toolbox “OpenFace2" [4]. After extracting facial landmarks from a frame, faces are resized to 128⇥128 and
rotated to speci"c coordinates to make all samples consistent without di#erent head poses or face sizes.
Finally, patches of size 32 ⇥ 32 are cropped from di#erent parts of the face (e.g., the left eye, right eye and
mouth) as the input data to PixelHop++ module.
3.2.2 PixelHop++ Module
PixelHop++ extracts rich and discriminant features from local blocks. As shown in Fig. 3.2, the input is a
color image of spatial resolution 32 ⇥ 32 that focuses on a particular part of a human face. The block size
and stride are hyperparameters for users to decide. This process can be conducted in multiple stages to
get a larger receptive "eld. The proposed DefakeHop system has three PixelHop++ units in cascade, each
of which has a block size of 3 ⇥ 3 with the stride equal to one without padding. The block of a pixel of the
"rst hop contains 3 ⇥ 3 ⇥ K0 = 9K0 variables as a !attened vector, where K0 = 3 for the RGB input.
Spectral !ltering and dimension reduction. We exploit the statistical correlations between pixelbased neighborhoods and apply the c/w Saab transform to the !attened vector of dimension K1 = 9K0 =
27 to obtain a feature representation of dimension K11 (see discussion in Sec. 3.2.3).
18
Spatial max-pooling. Since the blocks of two adjacent pixels overlap with each other, there exists
spatial redundancy between them. We conduct the (2x2)-to-(1x1) maximum pooling unit to reduce the
spatial resolution of the output furthermore.
3.2.2.1 Channel-wise (c/w) Saab Transform
The Saab (subspace approximation via adjusted bias) transform [61] is a variant of PCA. It "rst decomposes a signal space into the local mean and the frequency components and then applies the PCA to the
frequency components to derive kernels. Each kernel represents a certain frequency-selective "lter. A
kernel of a larger eigenvalue extracts a lower frequency component while a kernel of a smaller eigenvalue
extracts a higher frequency component. The high-frequency components with very small eigenvalues can
be discarded for dimension reduction.
This scheme can be explained by the diagram in Fig. 3.3, where a three-stage c/w Saab transform is
illustrated. By following the system in Fig. 3.2, the root of the tree is the color image of dimension 32⇥32⇥
3. The local input vector to the "rst hop has a dimension of 3 ⇥ 3 ⇥ 3 = 27. Thus, we can get a local mean
and 26 frequency components. We divide them into three groups: low-frequency channels (in blue), midfrequency channels (in green), and high-frequency channels (in gray). Each channel can be represented
as a node in the tree. Responses of high-frequency channels can be discarded, responses of mid-frequency
channels are kept, yet no further transform is performed due to weak spatial correlations, and responses
of low-frequency channels will be fed into the next stage for another c/w Saab transform due to stronger
spatial correlations. Responses in Hop-1, Hop-2 and Hop-3 are joint spatial-spectral representations. The
"rst few hops contain more spatial detail but have a narrower view. As the hop goes deeper, it has less
spatial detail but with a broader view. The channel-wise (c/w) Saab transform exploits channel separability
to reduce the model size of the Saab transform without performance degradation.
19
3.2.3 Feature Distillation Module
After feature extraction by PixelHop++, we derive a small and e#ective set of features of a face. However,
the output dimension of PixelHop++ is still not concise enough to be fed into a classi"er. For example, the
output dimension of the "rst hop is 15 ⇥ 15 ⇥ K11 = 225K11. We use two methods to further distill the
feature to get a compact description of a face being fake or real.
Spatial dimension reduction Since input images are face patches from the same part of human faces,
there exist strong correlations between the spatial responses of 15⇥15 for a given channel. Thus, we apply
PCA for further spatial dimension reduction. By keeping the top N1 PCA components which contain about
90% of the energy of the input images, we get a compact representation of N1 ⇥ K11 dimension.
Channel-wise Soft Classi!cation After removing spatial and spectral redundancies, we obtain K11
channels from each hop with a rather small spatial dimension N1. For each channel, we train a soft binary
classi"er to include the semantic meaning. The soft decision provides the probability of a speci"c channel
being related to a fake video. Di#erent classi"ers could be used here in di#erent situations. In our model,
the extreme gradient boosting classi"er (XGBoost) is selected because it has a reasonable model size and
is e$cient to train to get a high AUC. The max-depth of XGBoost is set to one for all soft classi"ers to
prevent over"tting.
For each face patch, we concatenate the probabilities of all channels to get a description representing
the patch. The output dimension becomes K11, which is reduced by a large amount comparing to the
output of PixelHop++. It will be fed into a "nal classi"er to determine the probability of being fake for this
face patch.
3.2.4 Ensemble Classi!cation Module
We integrate soft decisions from all facial regions (di#erent face patches) and selected frames to o#er the
"nal decision whether a video clip is fake or real according to the following procedure.
20
… … …
Intermediate node
Leaf node
Discarded node
Hop-1
K1 Nodes
32 x 32 x 3
30 x 30
… … … … … …
Hop-2
… … … K2 Nodes
13 x 13
PixelHop++
… … … … … … … … … … … … … … … … … …
5 x 5
Hop-3
K3 Nodes
Figure 3.3: Illustration of the c/w Saab transform.
Multi-Region Ensemble. Since each facial region might have di#erent strength and weakness against
di#erent Deepfake manipulations [107], we concatenate the probabilities of di#erent regions together. In
experiments reported in Sec. 6.3, we focus on three facial regions. They are the left eye, the right eye, and
the mouth.
Multi-Frame Ensemble. Besides concatenating di#erent regions, for each frame, we concatenate the
current frame and its adjacent 6 frames (namely, 3 frames before and 3 frames after) so as to incorporate
the temporal information.
Finally, for the whole video clip, we compute its probability of being fake by averaging the probabilities
of all frames from the same video. Di#erent approaches to aggregate frame-level probabilities can be used
to determine the "nal decision.
21
3.3 Experiments
We evaluate the performance of DefakeHop on four Deepfake video datasets: UADFV, FaceForensics++
(FF++), Celeb-DF v1, and Celeb-DF v2, and compare the results with state-of-the-art Deepfake detection
methods. UADFV does not specify the train/test split. We randomly select 80% for training and 20% for
testing. FF++ and Celeb-DF provide the test set, and all other videos are used for training the model.
Benchmarking Datasets. Deepfake video datasets are categorized into two generations based on
the dataset size and Deepfake methods used. The "rst generation includes UADFV and FF++. The second
generation includes Celeb-DF version 1 and version 2. Fake videos of the second generation are more
realistic, which makes their detection more challenging.
Table 3.1: The AUC value for each facial region and the "nal ensemble result.
Left eye Right eye Mouth Ensemble
UADFV 100% 100% 100% 100%
FF++ / DF 94.37% 93.73% 94.25% 97.45%
Celeb-DF v1 89.69% 88.20% 92.66% 94.95%
Celeb-DF v2 85.17% 86.41% 89.66% 90.56%
3.3.1 Detection Performance
DefakeHop is benchmarked with several other methods using the area under the ROC curve (AUC) metric
in Table 3.2. We report both frame-level and video-level AUC values for DefakeHop. For the framelevel AUC, we follow the evaluation done in [69] that considers key frames only (rather than all frames).
DefakeHop achieves the best performance on UADFV, Celeb-DF v1, and Celeb-DF v2 among all methods.
On FF++/DF, its AUC is only 2.15% lower than the best result ever reported. DefakeHop outperforms other
non-DL methods [80, 117] by a signi"cant margin. It is also competitive against DL methods [85, 86, 92,
107]. The ROC curve of DefakeHop with respect to di#erent datasets is shown in Fig. 3.4.
We compare the detection performance by considering di#erent factors.
22
Facial Regions The performance of DefakeHop with respect to di#erent facial regions and their ensemble results are shown in Table 3.1. The ensemble of multiple facial regions can boost the AUC values
by up to 5%. Each facial region has di#erent strengths on various faces, and their ensemble gives the best
result.
Video Quality. In the experiment, we focus on compressed videos since they are challenging for
Deepfake detection algorithms. We evaluate the AUC performance of DefakeHop on videos with di#erent
qualities in Table 3.3. As shown in the table, the performance of DefakeHop degrades by 5% as video
quality becomes worse. Thus, DefakeHop can reasonably handle videos with di#erent qualities.
Figure 3.4: The ROC curve of DefakeHop for di#erent datasets.
23
3.3.2 Model Size Computation
PixelHop++ units Filters in all three hops have a size of 3x3. The maximum number of PixelHop++
units is limited to 10. For Hop-1, the input has three channels, leading to a 27D input vector. For Hop-2
and Hop-3, since we use the channel-wise method, each input has only 1 channel, leading to a 9D input
vector. There are multiple c/w Saab transforms for Hop-2 and Hop-3. The channel would be selected across
multiple c/w Saab transforms. Therefore, it is possible to get more than 9 channels for Hop-2 and Hop-3.
At most 27 ⇥ 10, 9 ⇥ 10, 9 ⇥ 10 parameters are used in 3 PixelHop++ units.
Spatial PCA Hop-1 is reduced from 225 to 45, Hop-2 is reduced from 49 to 30, and Hop-3 is reduced
from 9 to 5. The number of parameters for spatial PCAs are 225 ⇥ 45, 49 ⇥ 25, and 9 ⇥ 5. The number of
kept channels 45, 25, 5 is selected based on keeping about 90% of energy.
XGBoost The number of trees is set to 100. Each tree includes intermediate and leaf nodes. Intermediate nodes decide the dimension and boundary to split (i.e., 2 parameters per node) while the leaf nodes
determine the predicted value (i.e., 1 parameter per node). When the max-depths are 1 and 6, the number
of parameters are 400 and 19,000. The max-depth of channel-wise and ensemble XGBoosts are set to 1
and 6, respectively. Thus, the total number of parameters for 30 channel-wise XGBoosts is 12,000 and the
number of parameters of the ensemble XGBoost is 19,000.
The above calculations are summarized in Table 7.2. The "nal model size, 42,845, is actually an upperbound estimate since the maximum depth in the XGBoost and the maximum channel number per hop are
upper bounded.
Train on a small data size We observe that DefakeHop demands fewer training videos. An example
is given to illustrate this point. Celeb-DF v2 has 6011 training videos in total. We train DefakeHop with a
randomly increasing video number in three facial regions and see that DefakeHop can achieve nearly 90%
AUC with only 1424 videos (23.6% of the whole dataset). In Fig. 4.6 the AUC of test videos is plotted as a
24
Figure 3.5: The plot of AUC values as a function of the training video number.
function of the number of training videos of Celeb-DF v2. As shown in the "gure, DefakeHop can achieve
about 85% AUC with less than 5% (250 videos) of the whole training data.
3.4 Conclusion
A light-weight high-performance method for Deepfake detection, called DefakeHop, was proposed. It has
several advantages: a smaller model size, fast training procedure, high detection AUC and needs fewer
training samples. Extensive experiments were conducted to demonstrate its high detection performance.
25
Table 3.2: Comparison of the detection performance of benchmarking methods with the AUC value at the frame level as the evaluation metric. The
boldface and the underbar indicate the best and the second-best results, respectively. The italics means it does not specify frame or video level
AUC. The AUC results of DefakeHop is reported in both frame-level and video-level. The AUC results of benchmarking methods are taken from
[108] and [69]. a deep learning method, b non deep learning method.
1st Generation datasets 2nd Generation datasets
Method UADFV FF++ / DF Celeb-DF v1 Celeb-DF v2 Number of parameters
Zhou et al..(2017) [136] InceptionV3a 85.1% 70.1% 55.7% 53.8% 24M
Afchar et al..(2018) [1] Meso4a 84.3% 84.7% 53.6% 54.8% 27.9K
Li et al..(2018) [67] FWAa (ResNet-50) 97.4% 80.1 53.8% 56.9% 23.8M
Yang et al..(2019) [117] HeadPoseb (SVM) 89% 47.3% 54.8% 54.6% -
Matern et al..(2019) [80] VA-MLPb 70.2% 66.4% 48.8% 55% -
Rossler et al..(2019) [92] Xception-rawa 80.4% 99.7% 38.7% 48.2% 22.8M
Nguyen et al..(2019) [85] Multi-taska 65.8% 76.3% 36.5% 54.3% -
Nguyen et al..(2019) [86] CapsuleNeta 61.3% 96.6% - 57.5% 3.9M
Sabir et al..(2019) [98] DenseNet+RNNa - 99.6% - - 25.6M
Li et al..(2020) [67] DSP-FWAa (SPPNet) 97.7% 93% - 64.6% -
Tolosana et al..(2020) [107] Xceptiona 100% 99.4% 83.6% - 22.8M
Ours DefakeHop (Frame) 100% 95.95% 93.12% 87.65% 42.8K DefakeHop (Video) 100% 97.45% 94.95% 90.56% 42.8K
26
Table 3.3: Comparison of Deepfake algorithms and qualities.
FF++ with Deepfakes FF++ with FaceSwap
HQ (c23) LQ (c40) HQ (c23) LQ (c40)
Frame 95.95% 93.01% 97.87% 89.14%
Video 97.45% 95.80% 98.78% 93.22%
Table 3.4: The number of parameters for various parts.
Subsystem Number of Parameters
Pixelhop++ Hop-1 270
Pixelhop++ Hop-2 90
Pixelhop++ Hop-3 90
PCA Hop-1 10,125
PCA Hop-2 1,225
PCA Hop-3 45
Channel-Wise XGBoost(s) 12,000
Fianl XGBoost 19,000
Total 42,845
27
Chapter 4
DefakeHop++: An Enhanced Lightweight Deepfake Detector
4.1 Introduction
With the fast growing Generative Adversarial Network (GAN) technology, image forgery techniques keep
evolving in recent years. They are e#ective in reducing manipulation traces detectable by human eyes. It
becomes very challenging to distinguish Deepfake images from real ones with human eyes against new
generations of Deepfake technologies. Furthermore, adding di#erent kinds of perturbation (e.g., blur, noise
and compression) can hurt the detection performance of Deepfake detectors since manipulation traces are
mixed with perturbations. A robust Deepfake detector should be able to tell the di#erernces between
real images and fake images generated by the GAN techniques although both of them experience such
modi"cations.
There have been three generations of fake video datasets created for the research and development
purpose. They demonstrate the evolution of Deepfake techniques. The UADFV dataset [68] belongs to
the "rst generate. It has only 50 video clips generated by one Deepfake method. Its real and fake videos
can be easily detected by humans. FaceForensics++ [92] and Celeb-DF-v2 [69] are examples of the second
generation. They contain more video clips with more identities. It is di$cult for humans to distinguish
real and fake faces for them. DFDC [27] is the third generation dataset. It contains more than 100K fake
videos which are generated by 8 Deepfake techniques and perturbed by 19 kinds of distortions. The size
28
of the third generation dataset is very large. It is designed to test the performance of various Deepfake
detectors in an environment close to real world applications.
A lightweight Deepfake detector called DefakeHop was developed in [14]. An enhanced version of
DefakeHop, called DefakeHop++, is proposed in this work. The improvements lie in two areas. First,
DefakeHop examines three facial regions (i.e., two eyes and mouth) only. DefakeHop++ includes eight
more landmark regions to o#er more information of human faces. Second, DefakeHop uses an unsupervised energy criterion to select discriminant features. It does not exploit the correlation between features
and labels. DefakeHop++ adopts a supervised tool, called the Discriminant Feature Test (DFT), in feature
selection. The latter is more e#ective than the former due to supervision. In DefakeHop++, spatial and
spectral features from multiple facial regions and landmarks are generated automatically and, then, DFT
is used to select a subset of discriminant features to train a classi"er. As compared with MobileNet v3,
which a lightweight CNN model of 1.5M parameters targeting at mobile applications, DefakeHop++ has
an even smaller model size of 238K parameters (i.e., 16% of MibileNet v3). In terms of Deepfake image
detection performance, DefakeHop++ outperforms MobileNet v3 without data augmentation and leverage
of pre-trained models.
4.2 Review of Related Work
Deepfake Detection. Most state-of-the-art Deepfake detection methods use DNNs to extract features
from faces. Their models are trained with heavy augmentation (e.g., deleting part of faces) to increase
the performance. Despite the high performance of these models, their model sizes are usually very large.
They have to be pre-trained by other datasets in order to converge. Several examples are given below.
The model of the winning team of the DFDC Kaggle challenges [101] has 432M parameters. Heo et al.
[43] improved this model by concatenating it with a Vision Transformer(VIT), which has 86M parameters.
Zhao et al. [133] proposed a model that exploits multiple spatial attention heads to learn various local
29
parts of a face and the textural enhancement block to learn subtle facial artifacts. To reduce the model size
and improve the e$ciency, Sun et al. [104] proposed a robust method based on the change of landmark
positions in a video. These landmarks are calibrated by the neighbor frames. Afterwards, a two-stream
RNN is trained to learn the temporal information of the landmark position. Since they did not consider the
image information and only consider the position information, the model size is 0.18M which is relatively
small. Tran et al. [109] applied MobileNet to di#erent facial regions and InceptionV3 to the entire faces.
Although this model is smaller, it still demands 26M parameters.
DefakeHop. To address the challenge of the need of huge model sizes and training data, a new machine learning paradigm called green learning has been developed in the last six years [17, 19, 59, 61]. Its
main goal is to reduce the model size and training time while keep high performance. Green learning has
been applied to di#erent applications, e.g., [52, 74, 83, 95, 97, 125, 127, 128, 129]. Based on green learning, DefakeHop was developed in [14] for the Deepfake detection task. It "rst extracts a large number of
features from three facial regions using PixelHop++ [17] and then re"nes them using feature distillation
modules. Finally, it feeds distilled features to the XGBboost classi"er [16]. The DefakeHop model has
only 42.8K parameters, yet it outperforms many DNN solutions in detection accuracy against the "rst and
second generataion Deepfake datasets. Recently, DefakeHop has been used to detect fake satellite images
in [15].
4.3 DefakeHop++
DefakeHop++ is an improved version of DefakeHop. An overview of the DefakeHop++ system is shown
in Fig. 4.1. Facial blocks of two sizes are "rst extracted from frames in a video sequence in the preprocessing step. These blocks are then passed to DefakeHop++ for processing. DefakeHop++ consists of
four modules: 1) one-stage PixelHop, 2) spatial PCA, 3) discriminant feature test (DFT), and 4) Classi"er.
The pre-processing step and the four modules are elaborated below.
30
Figure 4.1: An overview of DefakeHop++.
4.3.1 Pre-processing Step
Frames are extracted from video sequences. For training videos, we extract three frames per second. Since
the video length is typically around 10 seconds, we obtain around 30 frames per video. On the other hand,
the length and the frame number per second (FPS) are not "xed in test videos. We uniformly sample 100
frames from each test video.
Facial landmarks are obtained by OpenFace2. With 68 facial landmarks, we crop faces with 30% of
margin and resize them to 128 ⇥ 128 with zero padding (if needed). We do not align the face since face
alignment may distort the original face image and the side face is di$cult to align. We conduct experiments
on the discriminant power of each landmark and "nd that two eyes and the mouth are most discriminant
regions. Thus, we crop out 8 smaller blocks that cover 6 representative landmarks from two eyes, one
from the nose and one from the mouth. Furthermore, we crop out three larger blocks to cover the left eye,
right eye and mouth. We make the block size a hyper-parameter for user to choose. For the experiments
reported in Sec. 6.3, we adopt smaller blocks of size 13 ⇥ 13 centered at landmarks and larger blocks of
31
Figure 4.2: Illustration of the prepossessing step. It "rst extracts 68 landmarks (in blue) from the face. Then,
it crops out blocks of size 31 ⇥ 31 from three regions (in yellow) and blocks of size 13 ⇥ 13 from eight
landmarks (in red). The step can extract consistent blocks despite di#erent head poses and perturbations.
size 31 ⇥ 31 for three facial regions (i.e., two eyes and the mouth) since they give the best results. The
extracted small and large blocks are illustrated in Fig. 4.2.
32
4.3.2 Module 1: One-Stage PixelHop
A PixelHop unit [17, 19] contains a set of "lters used to extract features from training blocks. While the
"lter weights of traditional "lter banks (e.g., the Gabor or Laws "lters) are "xed, those of the PixelHop are
data dependent. Its "lter banks are de"ned by the Saab transform, which decomposes a local neighborhood
(i.e., a patch) into one DC (direct current) and multiple AC (alternated current) components. The DC
component is the mean of each patch. The AC components are obtained through the principal component
analysis (PCA) of patch residuals.
To give an example, we set the patch size to 3 ⇥ 3 in the experiment. For the color face image input,
each patch contains (3⇥3)⇥3 degrees of freedom, including nine spatial pixel locations and three spectral
components. By collecting a large number of patches, we can conduct PCA to obtain AC "lters. The eigenvectos and the eigenvalues of the covariance matrix correspond to AC "lters and their mean energy values,
respectively. Since most natural images are smooth, the leading AC "lters extract low-frequency features
of higher energy. When the frequency index goes higher, the energy of higher frequency components
bdecays quickly. The Saab "lter bank decouples the 27 correlated input dimensions into 27 decorrelated
output dimensions. Besides decorrelating input components, the Saab transform allows to discard some
high frequency channels due to their very small variances (or energy values).
The horizontal output size of a block can be calculated as
horizontal block size horizontal "lter size
s
+ 1, (4.1)
where s is the stride parameter along the horizontal direction. For example, if the horizontal block size is
13, the horizontal "lter size is 3 and the horizontal stride is 2, then the horizontal output size is equal to 6.
The same computation applies to the vertical output size.
33
4.3.3 Module 2: Spatial PCA
The output of the same frequency component may still have correlations in the spatial domain. This is
especially true for the DC component and low AC components. The correlation can be removed by spatial
PCA. The idea of spatial PCA is inspired by the eigenface method in [110]. For each channel, we train a
spatial PCA and keep the leading components that have cumulative energy up to 80% of the total energy.
This helps reduce the output feature dimension.
Modules 1 and 2 describe the feature generation procedure for DefakeHop++. It is worthwhile to point
out the di#erences in feature generation of DefakeHop and DefakeHop++.
• DefakeHop only focuses on two eyes and mouth three regions. Besides these three regions, DefakeHop++ zooms into the neighborhood of 8 landmarks called a block to gain more detailed information.
The justi"cation of these 8 landmarks is given in the last paragraph of Sec. 4.4.1
• DefakeHop conducts three-stage PixelHop units and applies spatial PCA to the response output
of all three stages. The pipeline is simpl"ed to a one-stage PixelHop in DefakeHop++. Yet, the
simpli"cation does not hurt the performance since the spatial PCA applied to each block (or region)
still o#er the global information of the corresponding block (or region). Note that such simpli"cation
is needed as DefakeHop++ covers more spatial patches and regions.
4.3.4 Module 3: Discriminant Feature Test
DefakeHop selects responses from channels of larger energy as features into a classi"er under the assumption that features of higher variance are more discriminant. This is an unsupervised feature selection
method. Recently, a supervised feature selection method called the discriminant feature test (DFT) was
proposed in [119]. DFT provides a powerful method in selecting discriminant features using training labels. The DFT process can be simply stated below. For each feature dimension, we de"ne an interval using
its minimum and maximum values across all training samples. Then, we partition the interval into two
34
Figure 4.3: Illustration of the feature selection idea in the discriminant feature test (DFT). For the feature
dimension associated with the left sub"gure, samples of class 0 and class 1 can be easily separated by the
blue partition line. Its cross entropy is lower. For the feature dimension associated with the right sub"gure,
samples of class 0 and class 1 overlap with each other signi"cantly. It is more di$cult to separate them
and its cross entropy is higher. Thus, the feature dimension in the left sub"gure is preferred.
Figure 4.4: Illustration of the feature selection process for a single landmark, where the y-axis is the cross
entropy of each dimension and the x-axis is the channel index. The left and right sub"gures show unsorted
and sorted feature dimensions.
sub-intervals at all possible split positions which are equally spaced in the interval. For example, if we
may select 31 split positions uniformly distributed over the interval. For each partitioning, we can use the
maximum likelihood principle to assign a predicted label to each training sample and compute the cross
entropy of all training samples accordingly. The split position that gives the lowest cross entropy is chosen
to be the optimal split of the feature and the associated cross entropy is used as the cost function of this
35
feature. The lower the cross entropy, the most discriminant the feature dimension. We refer to [119] for
more details.
We use Figs. 4.3 and 4.4 to explain the DFT idea. Two features are compared in Fig. 4.3. The feature
in the left sub"gure has a lower cross entropy value than the one in the right sub"gure. The left one is
more discriminant than the right one, which is intuitive. Fig. 4.4 is used to describe the feature selection
process for a given landmark block. The feature dimension of one landmark block is 972. We perform DFT
on each feature and obtain 972 cross entropy values. The y-axis in both sub"gures is the cross entropy
value while the x-axis is the channel index. In the left sub"gure, a smaller channel index indicates a lower
frequency channel. We see that discriminant channels of lower cross entropy actually spread out in both
low and high frequency channels. In the right plot, we sort channels based on their cross entropy values,
which have an elbow point at 350. As a result, we can reduce the feature number from 972 to 350 (35%) by
selecting 350 features with the lowest cross entropy.
4.3.5 Module 4: Classi!cation
In DefakeHop, soft decisions from di#erent regions are used to train several XGBoost classi"ers. Then,
another ensemble XGBoost classi"er is trained to make the "nal decision. This is a two-stage decision
process. We "nd that a lot of detailed information is lost in the "rst-stage soft decisions and, as a result,
the ensemble classi"cation idea is not e#ective. To address this issue, we simply collect all feature vectors
from di#erent regions and landmarks, apply DFT for discriminant feature selection and train a LightGBM
classi"er in the last stage in DefakeHop++. LightGBM and XGBoost play a similar role. The main di#erence
between DefakeHop and DefakeHop++ is that the former is a two-stage decision process while the latter
is a one-stage decision process. The one-stage decision process can consider the complicated relations of
features in landmarks and regions across all frequency bands. Once all frame-level predictions for a video
are obtained, we simply use their mean as the "nal prediction for the video.
36
4.4 Experiments
We compare the detection performance of DefakeHop++, state-of-the-art deep learning and non-deep
learning methods on several datasets as well as their model sizes and training time in this section to
demonstrate the e#ectiveness of DefakeHop++.
4.4.1 Experimental Setup
Datasets. Deepfake video datasets can be categorized into three generations based on the dataset size
and Deepfake methods used for fake image generation. The "rst-generation datasets includes UADFV and
FF++. The second-generation datasets include Celeb-DF version 1 and version 2. The third-generation
dataset is the DFDC dataset. The datasets of later generations have more identities of di#erent races,
utilize more deepfake algorithms to generate fake videos, and add more perturbation types to test videos.
Apparently, the later generation is more challenging than the earlier generation. The datasets used in our
experiments are described below.
• UADFV [68]
UADFV is the "rst Deepfake detection dataset. It consists of 49 real videos and 49 fake videos. Real
videos are collected from YouTube while fake ones are generated by the FakeApp mobile App [29].
• FaceForensics++ (FF++) [92]
It contains 1000 real videos collected from YouTube. Two popular methods, FaceSwap [28] and
Deepfakes [25], are used in fake video generation. Each of them generated 1000 fake videos of
di#erent quality levels, e.g., RAW, HQ (high quality) and LQ (low quality). In the experiment, we
focus on HQ compressed videos since they are more challenging to detect by Deepfake detection
algorithms.
37
• Celeb-DF [69]
Celeb-DF has two versions. Celeb-DF v1 contains 408 real videos from YouTube and 795 fake videos.
Celeb-DF v2 consists of 890 real and 5639 fake videos. Fake videos are created by an advanced version
of DeepFake. These videos contain subjects of di#erent ages, ethnicities and sex. Celeb-DF v2 is a
superset of Celeb-DF v1. Celeb-DF v1 and v2 have been widely tested on many Deepfake detection
methods.
• DFDC [27]
DFDC is the third generation dataset. It contains more than 100K videos generated by 8 di#erent Deepfake algorithms. The test videos are perturbed by 19 distractors and augmenters such as
change of brightness/contrast, logo overlay, dog "lter, dots overlay, faces overlay, !ower crown "lter, grayscale, horizontal !ip, noise, images overlay, shapes overlay, change of coding quality level,
rotation, text overlay, etc. The dataset was generated to mimic the real-world application scenario.
It is the most challenging one among the four benchmark datasets.
Evaluation Metrics. Each Deepfake detector assigns a probability score of being a fake one to all
test images. Then, these scores can be used to plot the Receiver Operating Characteristic (ROC) curve.
Then, the area under curve (AUC) score can be used to compare the performance of di#erent detectors.
We report the AUC scores at the frame level as well as the video level.
Discriminability Analysis of Landmarks. As mentioned in Sec. 4.3.1, there are 68 landmarks. We
use the AUC scores to analyze the discriminability of the 68 landmark regions in Fig. 4.5. We see from
the "gure that the performance of di#erent landmarks varies a lot. Landmarks in two eye regions are
most discriminant. This is attributed to the fact that eyes have rich details and complex movement and,
as a result, they cannot be well synthesized by any Deepfake algorithms. The mouth region is the next
discriminant one because it is di$cult to synthesize lip motion and teeth. The cheek and nose regions are
less discriminant since they are relatively smooth in space and stable in time. This justi"es our choice of
38
Figure 4.5: Analysis of landmark discriminability, where the x-axis is the landmark index anad the y-axis is
the AUC score. Landmarks in the two eye regions are most discriminant in both training and test datasets.
six landmarks from eyes, one landmark from the nose and one landmark from the mouth as described in
Sec. 4.3.1.
4.4.2 Detection Performance Comparison
First Generation Datasets. The performance of a few Deepfake detectors on two "rst generation datasets,
UADFV and FF++, is compared in Table 4.1. Both DefakeHop and DefakeHop++ achieve a perfect AUC
score of 100% against UADFV. DSP-FWA and FWA are two closer ones with AUC scores of 97.7% and 97.4%,
respectively. Actually, UADFV is an easy dataset where the visual artifacts are visible to human eyes. It is
39
interesting to see that the extremely large models do not reach perfect detection results. This could be explained by the small size of UADFV (i.e., 49 real videos and 49 fake videos). The model sizes of DefakeHop
and DefakeHop++ can be su$ciently trained by a small dataset size. For FF++, Multi-attentional achieves
the best AUC score (i.e., 99.8%) while Xception-raw and Xception-c23 achieves the second best (i.e., 99.7%).
DefakeHop++ with its AUC computation at the videl level has the next highest AUC score (i.e., 99.3%). The
performance gap among them is small. Furthermore, the performance of Xception-raw and Xception-c23
is boosted by a larger dataset size of FF++, which has 1000 real videos of di#erent quality levels.
Second Generation Datasets with Cross-Domain Training. The performance of several Deepfake
detectors, which are trained on the FF++ dataset, against two second generation datasets is compared in
Table 4.2. We see that video-level DefakeHop++ gives the best AUC score while frame-level DefakeHop++
gives the second best AUC score for Celeb-DF-v1. As to Celeb-DF-v2, Multi-attentional yield the best AUC
score while Xception-raw and Xception-c23 o#er the second best scores. DefakeHop++ is slightly inferior
to them. Furthermore, we show the performance of DefakeHop and DefakeHop++ under the same domain
training in the last four rows of Table 4.2. Their performance has improved signi"cantly. Video-level
DefakeHop++ outperforms video-level DefakeHop by 2.5% in Celeb-DF v1 and 6.1% in Celeb-DF v2.
Third Generation Dataset. The size of the third generation dataset, DFDC, is huge. It demands
a lot of computational resources (including training hardware, time and large models) to achieve high
performance. Since our main interest is on lightweight detection algorithms, we focus on the comparison
of DefakeHop++ and MobileNet v3, which has 1.5M parameters and targets at mobile applications. We train
both models with parts and/or all of DFDC training data and report the detection performance on the test
dataset in Fig. 4.6. We have the following three observations from the "gure. First, pre-trained MobileNet
v3 gives the best result, DefakeHop++ the second, and MobileNet v3 without pre-training the worst. It
shows that, if there are su$cient training data in training, the detection performance of a larger model can
be boosted. For the same reason, the detection performance of all three models decreases as the training
40
Figure 4.6: Detection performance comparison of DefakeHop++, MobileNet v3 with pre-training by ImageNet and MobileNet v3 without pre-training as a function of training data percentages of the DFDC
dataset.
data of DFDC becomes less. Second, with 1-8% of DFDC training data, the performance of DefakeHop++
and pre-trained MobileNet v3 is actually very close. The performance gap between DefakeHop++ and
MobileNet v3 without pre-training is signi"cant in all training data ranges. For example, with only 1% of
the DFDC training data, the AUC score of DefakeHop++ reaches 68% while that of MobileNet V3 can only
reach 54%. Third, with 100% of the DFDC training data but without any data augmentation, DefakeHop++
still can achieve an AUC score of 86%, which is 5% lower than pre-trained MobileNet v3.
Furthermore, we compare the training time on the three models as a function of the training data
percentage in Fig. 4.7. The model is trained on AMD Ryzen 9 5950X with Nvidia GPU 3090 24G. If a CNN
is not pre-trained, it generally needs more time to converge. The training time of DefakeHop++ is lowest
for 64% and 100% of the total training data. Its training time is about the same as the pre-trained MobileNet
v3 for the cases of 1-32% training data.
41
4.4.3 Model Size of DefakeHop++
DefakeHop++ consists of one-stage PixelHop, Spatial PCA, DFT and Classi"er modules. The size of each
component can be computed as follows.
PixelHop. The parameters are the "lter weights. Each "lter has a size of (3⇥3)⇥3 = 27. Since there
are 27 "lters, the total number of parameters of PixelHop is 27 ⇥ 27 = 729 parmaeters.
Spatial PCA. For each channel, we !atten 2D spatial responses to a 1D vector and train a PCA. We
conduct spatial PCA on channels of higher energy (i.e. those with cumulative energy up to 80% total
energy). Furthermore, we set an upper limit of 10 channels to avoid a large number of "lters for a particular
channel. Based on this design guideline, the averaged channel numbers for a landmark and a spatial region
are 35 and 40, respectively.
Figure 4.7: Training time comparison of DefakeHop++, MobileNet v3 with pre-training by ImageNet and
MobileNet v3 without pre-training as a function of training data percentages of the DFDC dataset, where
the training time is in the unit of seconds. The training time does not include that used in the pre-processing
step.
42
DFT. DFT is used to select a subset of discriminant features. Its parameters are indices of selected
channels. For 8 landmark blocks, we keep features in top 35%. For 3 spatial regions, we keep features in
top 15%. Then, the number of parameters of DFT is (8 ⇥ 340) + (3 ⇥ 911) = 5, 453 as shown in Table 7.2.
Classi!er. We use LightGBM [57] as the classi"er and set the maximum number of leaves to 64. As
a result, the maximum intermediate node number of a tree is bounded by 63. We store two parameters
(i.e., the selected dimension and its threshold) at each intermediate node and one parameter (i.e., the predicted soft decision score) at each leaf node. The number of parameters for one tree is bounded by 190.
Furthermore, we set the maximum number of tree to 1000. Thus, the number of parameters for LightGBM
is bounded by 190K.
4.5 Conclusion and Future Work
A lightweight Deepfake detection method, called DefakeHop++, was proposed in this work. It is an enhanced version of our previous solution called DefakeHop. Its model size is signi"cantly smaller than that
of state-of-the-art DNN-based solutions, including MobileNet v3, while keeping reasonably high detection
performance. It is most suitable for Deepfake detection in mobile/edge devices.
Fake image/video detection is an important topic. The faked content is not restricted to talking head
videos. There are many other application scenarios. Examples include faked satellite images, image splicing, image forgery in publications, etc. Heavyweight fake image detection solutions are not practical.
Furthermore, fakes images can appear in many forms. On one hand, it is unrealistic to include all possible perturbations in the training dataset under the setting of heavy supervision. On the other hand, the
performance could be quite poor with little supervision. It is essential to "nd a midground and look for a
lightweight weakly-supervised solution with reasonable performance. This chapter shows our research effort along this direction. We will continue to explore and generalize the mothodology to other challenging
Deepfake problems.
43
Table 4.1: Comparison of detection performance of several methods on the "rst genreation datasets with AUC as the performance metric. The AUC
results of DefakeHop++ in both frame-level and video-level are given. The best and the second-best results are shown in boldface and underbared,
respectively. The AUC results of benchmarking methods are taken from [69] and the number of parameters are from https://keras.io/api/appli
cations. Also, we use a to denote deep learning methods and b to denote non-deep-learning methods.
1st Generation
Method Model UADFV FF++ #param
Two-stream [136] InceptionV3a[105] 85.1% 70.1% 23.9M
Meso4 [1] Designed CNNa 84.3% 84.7% 28.0K
MesoInception4 [1] Designed CNNa 82.1% 83.0% 28.6K
HeadPose [117] SVMb 89.0% 47.3% -
FWA [68] ResNet-50a[40] 97.4% 80.1% 25.6M
VA-MLP [80] Designed CNNa 70.2% 66.4% -
VA-LogReg [80] Logistic Regressionb 54.0% 78.0% -
Xception-raw [92] XceptionNeta[21] 80.4% 99.7% 22.9M
Xception-c23 [92] XceptionNeta[21] 91.2% 99.7% 22.9M
Xception-c40 [92] XceptionNeta[21] 83.6% 95.5% 22.9M
Multi-task [85] Designed CNNa 65.8% 76.3% -
Capsule [86] CapsuleNeta[99] 61.3% 96.6% 3.9M
DSP-FWA [67] SPPNeta[41] 97.7% 93.0% -
Multi-attentional [133] E$cient-B4a[106] - 99.8% 19.5M
DefakeHop [14] DefakeHopb 100% 96.0% 42.8K
Ours (Frame Level) DefakeHop++b 100% 98.4% 238K
Ours (Video Level) DefakeHop++b 100% 99.3% 238K
44
Table 4.2: Comparison of detection performance of several Deepfake detectors on the second genreation datasets under cross-domain training and
with AUC as the performance metric. The AUC results of DefakeHop anad DefakeHop++ in both frame-level and video-level are given. The best
and the second-best results are shown in boldface and underbared, respectively. Furthermore, we include results of DefakeHop and DefakeHop++
under the same-domain training in the last 4 rows. The AUC results of benchmarking methods are taken from [69] and the number of parameters
are from https://keras.io/api/applications. Also, we use a to denote deep learning methods and b to denote non-deep-learning methods.
2nd Generation
Method Model Celeb-DF v1 Celeb-DF v2 #param
Two-stream [136] InceptionV3a 55.7% 53.8% 23.9M
Meso4 [1] Designed CNNa 53.6% 54.8% 28.0K
MesoInception4 [1] Designed CNNa 49.6% 53.6% 28.6K
HeadPose [117] SVMb 54.8% 54.6% -
FWA [68] ResNet-50a 53.8% 56.9% 25.6M
VA-MLP [80] Designed CNNa 48.8% 55.0% -
VA-LogReg [80] Logistic Regressionb 46.9% 55.1% -
Xception-raw [92] XceptionNeta 38.7% 48.2% 22.9M
Xception-c23 [92] XceptionNeta - 65.3% 22.9M
Xception-c40 [92] XceptionNeta - 65.5% 22.9M
Multi-task [85] Designed CNNa 36.5% 54.3% -
Capsule [86] CapsuleNeta - 57.5% 3.9M
DSP-FWA [67] SPPNeta - 64.6% -
Multi-attentional [133] E
$cient-B4a - 67.4% 19.5M
Ours (Frame Level) DefakeHop++b 56.30% 60.5% 238K
Ours (Video Level) DefakeHop++b 58.15% 62.4% 238K
Ours (Trained on Celeb-DF, Frame Level) DefakeHopb 93.1% 87.7% 42.8K
Ours (Trained on Celeb-DF, Video Level) DefakeHopb 95.0% 90.6% 42.8K
Ours (Trained on Celeb-DF, Frame Level) DefakeHop++b 95.4% 94.3% 238K
Ours (Trained on Celeb-DF, Video Level) DefakeHop++b 97.5% 96.7% 238K
45
Table 4.3: The number of parameters for various parts.
Subsystem Number Parameters Total
Pixelhop Landmarks 8 3x3x3=729 5832
Regions 3 3x3x3=729 2187
Spatial PCA Landmarks 8 6x6x35=1,260 10,080
Regions 3 15x15x40=9,000 27,000
DFT Landmarks 8 6x6x27x0.35=340 2720
Regions 3 15x15x27x0.15=911 2733
LightGBM - 1 190,000 190,000
Total 237,832
46
Chapter 5
Geo-DefakeHop: High-Performance Geographic Fake Image Detection
5.1 Introduction
Arti"cial intelligence (AI) and deep learning (DL) techniques have made signi"cant advances in recent
years by leveraging more powerful computing resources and larger collected and labeled datasets. Geospatial science [47] and remote sensing [78] bene"t from this development, involving increased application
of AI to process data arising from cartography and geographic information science (GIS) more e#ectively.
Despite countless advantages brought by AI, misinformation over the Internet, ranging from fake news
[91] to fake images and videos [27, 51, 69, 93, 138, 151], poses a serious threat to our society.
Satellite images are utilized in various applications such as weather prediction [134], agriculture crops
prediction [64], !ood and "re control [66]. If one cannot determine whether a satellite image is real or
fake, it would be risky to use it for decision making. Fake satellite images may have impacts on national
security. For example, adversaries can create fake satellite images to hide military infrastructure and/or
create fake ones to deceive others. Though government analysts could verify the authenticity of geospatial
imagery leveraging other satellites or data sources, this would be prohibitively time intensive. It would be
extremely di$cult for the public to verify the authenticity of satellite images.
It becomes easier to generate realistically looking images due to the rapid growth of generative adversarial networks (GANs). There are two ways to generate fake satellite images. One is to leverage the
47
base map of an input satellite image to be produced by one GAN "rst. Then, a fake satellite image can be
generated by another GAN with the base map [20, 46, 145]. CycleGAN belongs to this family. Another
way is to generate fake satellite images directly without a base map [10, 53, 54, 55, 89]. StyleGAN [54, 55]
and Lightweight GAN [71] belong to this family. Since generated satellite images are di$cult to discern by
human eyes, there is an urgent need to develop an automatic detection system that can "nd fake satellite
images accurately and e$ciently.
Little research has been done on fake satellite images detection due to the lack of a proper fake satellite
image dataset. The "rst fake satellite image dataset was recently released by Zhao et al. [132]. To determine whether a satellite image is real and fake, this work extracted handcrafted features (such as spatial,
histogram, and frequency features) and adopted the support vector machine (SVM) classi"er. It achieves
an F1-score of 87% in detection performance. We are not aware of any existing DL solution to this dataset.
Yet, there are DL-based fake image detection methods for other images. They will be reviewed in Sec. 5.2.
A robust fake satellite image detection method, called Geo-DefakeHop, is proposed here. It is based
on one observation and one assumption. The observation is that the human visual system (HVS) [38] has
its limitation. That is, it behaves like a low-pass "lter and, as a result, it has a poor discriminant power for
high-frequency responses. The assumption is that GANs can generate realistic images by reproducing lowfrequency responses of synthesized images well [34, 113]. Apparently, it is more challenging to synthesize
both low and high-frequency components well due to limited model complexity. If this assumption holds,
we can focus on di#erences between higher frequency components in di#erentiating true and fake images.
This high-level idea can be implemented by a set of "lters operating at all pixel locations in parallel,
known as a "lter bank in signal processing. Each "lter o#ers responses of a particular frequency channel in
the spatial domain and these responses can be used to check the discriminant power of a channel from the
training data. To make the detection model more robust, we adopt multiple "lter banks, "nd discriminant
channels from each, and ensemble their responses to get the "nal binary decision. Since multiple "lter
48
banks are used simultaneously, it is named parallel subspace learning (PSL). The proposed Geo-DefakeHop
o#ers a lightweight, high-performance and robust solution to fake satellite images detection. Its model size
ranges from 0.8K to 62K parameters. It achieves an F1-score higher than 95% under various common image
manipulations such as resizing, compression and noise corruption.
5.2 Related Work
5.2.1 Fake Images Generation
GANs provide powerful machine learning models for image-to-image translation. It consists of two neural
networks in the training process: a generator and a discriminator. The generator attempts to generate fake
images to fool the discriminator while the discriminator tries to distinguish generated fake images from
real ones. They are jointly trained via end-to-end optimization with an adversarial loss. In the inference
stage, only the generator is needed. Many GANs have been proposed. One example is Cycle-consistent
GAN (CycleGAN) [145]. It has been applied to fake satellite image generation [132]. In this work, we
use StyleGAN2 [55] which is the most popular GAN on the internet and Lightweight GAN [71] which is
the newest GAN that could be trained e$ciently in one day to generate more fake satellite images (see
Sec.6.3.1).
5.2.2 Fake Images Detection
Most fake image detection methods adopt convolution neural networks (CNNs). For example, [113] used
the real and fake images generated by ProGAN [53] as the input of ResNet-50 pretrained by the ImageNet.
[131] generated fake images with their designed GAN, called AutoGAN, and claimed that CNN trained by
their simulated images could learn artifacts of fake images. [84] borrowed the idea from image steganalysis and used the co-occurrence matrix as input to a customized CNN so that it can learn the di#erences
between real and fake images. By following this idea, [6] added the cross-band co-occurrence matrix to the
49
input so as to increase the stability of the model. [36] utilized the EM algorithm and the KNN classi"er to
learn the convolution traces of artifacts generated by GANs. Little research has been done to date on fake
satellite images detection due to the lack of available datasets. [132] proposed the "rst fake satellite image
dataset with simulated satellite images from three cities (i.e., Tacoma, Seattle and Beijing). Furthermore, it
used 26 hand-crafted features to train an SVM classi"er for fake satellite image detection. The features can
be categorized into three types which are spatial, histogram and frequency. Features of di#erent classes are
concatenated for performance evaluation. In Sec. 6.3, we will benchmark our proposed Geo-DefakeHop
method with the method in [132] and CNN models with images and spectrum as the input.
5.2.3 PixelHop and Saab transform
Tiles
256 x 256 x 3
Blocks
16 x 16 x 3
PixelHop A Channel-wise
Classification
PixelHop B
PixelHop C
… Decision
Ensemble
Channel-wise
Classification
Channel-wise
Classification
scores
Real/
Fake?
Hb x Wb x Cb
Ha x Wa x Ca
Hc x Wc x Cc
Preprocessing
Feature extraction Discriminant channel
selection
Figure 5.1: An overview of the Geo-DefakeHop method, where the input is an image tile and the output is
a binary decision on whether the input is an authentic or a fake one. First, each input title is partitioned
into non-overlapping blocks of dimension 16 ⇥ 16 ⇥ 3. Second, each block goes through one PixelHop
or multiple PixelHops, each of which yields 3D tensor responses of dimension H ⇥ W ⇥ C. Third, for
each PixelHop, an XGBoost classi"er is applied to spatial samples of each channel to generate channelwise (c/w) soft decision scores and a set of discriminant channels are selected accordingly. Last, all block
decision scores are ensembled to generate the "nal decision of the image tile.
The PixelHop concept was introduced by Chen et al. in [18]. Each PixelHop has local patches of the
same size as its input. Suppose that local patches are of dimension L = s1 ⇥ s2 ⇥ c, where s1 ⇥ s2 is
the spatial dimension and c is the spectral dimension. A PixelHop de"nes a mapping from pixel values
50
in a patch to a set of spectral coe$cients, which is called the Saab transform [61]. The Saab transform is
a variant of the principal component analysis (PCA). For standard PCA, we subtract the ensemble mean
and then conduct eigen-analysis on the covariance matrix of input vectors. The ensemble mean is di$cult
to estimate if the sample size is small. The Saab transform decomposes the n-dimensional signal space
into a one-dimensional DC (direct current) subspace and an (n 1)-dimensional AC (alternating current)
subspace. Signals in the AC subspace have an ensemble mean close to zero. Then, we can apply PCA to the
AC signal and decompose it into (n1) channels. Saab coe$cients are unsupervised data-driven features
since Saab "lters are derived from the local correlation structure of pixels.
5.2.4 Di"erences between DefakeHop and Geo-DefakeHop
The Saab transform can be implemented conveniently with "lter banks. It has been successfully applied
to many application domains. Examples include [14, 74, 125, 129]. Among them, DefakeHop [14] is closest
to this work. There are substantial di#erences between DefakeHop and Geo-DefakeHop. DefakeHop was
initially proposed to detect deepfake face videos. DefakeHop extracted features from human eyes, nose and
mouth regions. It focused on low-frequency channels and discarded high-frequency channels. In contrast,
we show that high-frequency channels are more discriminant than low-frequency channels for fake image
detection. Furthermore, DefakeHop was designed using successive subspace learning (SSL) while GeoDefakeHop is developed with parallel subspace learning (PSL). SSL and PSL are quite di#erent. We tailor
DefakeHop to the context of satellite images and show that Geo-DefakeHop outperforms DefakeHop by a
signi"cant margin due to a better design in Sec. 6.3.
5.3 Geo-DefakeHop Method
Our idea is motivated by the observation that GANs fail to generate high-frequency components such as
edges and complex textures well. It is pointed out by [34] that GANs have inconsistencies between the
51
spectrum of real and fake images in high-frequency bands. Another evidence is that images generated by
simple GANs are blurred and unclear. Blurry artifacts are reduced and more details are added by advanced
GANs to yield higher quality fake images. Although these high quality simulated images look real to
human eyes because of the limitation of the HVS, it does not mean that the high-frequency "delity loss is
not detectable by machines. Another shortcoming of generated images is periodic patterns introduced by
convolution and deconvolution operations in GAN models as reported in [36]. GANs often use a certain
size of convolution and deconvolution "lters (e.g., 3 ⇥ 3 or 5 ⇥ 5). They leave traces on simulated images
in form of periodic patterns in some particular frequency bands. Sometimes, when GAN models do not
perform well, they can be observed by human eyes.
Being motivated by the above two observations, we proposed a new method for fake satellite image
detection as shown in Fig. 6.1. It consists of four modules:
1. Preprocessing: Input image tiles are cropped into non-overlapping blocks of a "x size.
2. Joint spatial/spectral feature extraction via PixelHop: The PixelHop has a local patch as its input and
applies a set of Saab "lters to pixels of the patch to yield a set of joint spatial/spectral responses as
features for each block.
3. Channel-wise classi"cation, discriminant channels selection and block-level decision ensemble: We
apply an XGBoost classi"er to spatial responses of each channel to yield a soft decision, and select
discriminant channels accordingly. Then, the soft decisions from discriminant channels of a single
PixelHop or multiple PixelHops are ensembled to yield the block-level soft decision.
4. Image-level decision ensemble: Block-level soft decisions are ensembled to yield the image-level
decision.
They are elaborated below.
52
(a) Without perturbation (b) Resizing to 64 ⇥ 64
(c) Adding Gaussian noise with = 0.1 (d) JPEG compression with Q = 75
Figure 5.2: The channel-wise performance of four settings: a) without perturbation, b) resizing, c) adding
Gaussian noise, and d) JPEG compression. The channel 0 is DC (Direct Current) and from the "rst channel
to the 26th channel are corresponding to AC1 to AC26 (Alternating Current). The blue line is the energy
percentage of each channel and the red, magenta and green lines are the F1-score of the training, validation
and testing dataset. We observe that high-frequency channels without perturbation in 5.2a has a higher
performance. After applying resizing, adding Gaussian noise and compression, the performance of highfrequency channels degrades as shown in 5.2b, 5.2c, 5.2d. The test score and validation score are closely
related, indicating that the validation score can be used to select the discriminant channels.
5.3.1 Preprocessing
A color satellite image tile of spatial size 256⇥256 covers an area of one kilometer square as shown in the
left of Figure 6.1. It is cropped into 256 non-overlapping blocks of dimension 16 ⇥ 16 ⇥ 3, where the last
number 3 denotes the R, G, B three color channels. Each block has homogeneous content such as trees,
buildings, land and ocean.
53
5.3.2 Joint Spatial/Spectral Feature Extraction via PixelHop
As described in Sec. 5.2.3, a PixelHop has a local patch of dimension L = s1 ⇥s2 ⇥c as its input, where s1
and s2 are spatial dimensions and c is the spectral dimension. For square patches, we have s1 = s2 = s.
We set s to 2, 3, 4 in the experiments. Since the input has R, G, B three channels, c = 3.
The PixelHop applies L Saab "lters to pixels in the local patch, including one DC "lter and (L1) AC
"lters, to generate L responses per patch. The AC "lters are obtained via eigen-analysis of AC components.
The mapping from L pixel values to L "lter responses de"nes the Saab transform. Since the AC "lters are
derived from the statistics of the input, the Saab transform is a data-driven transform.
We adopt overlapping patches with stride equal to one. Then, for a block of spatial size 16 ⇥ 16, we
obtain W ⇥ H patches, where W = 17 s1 and H = 17 s2. As a result, the block output is a set of
joint spatial/spectral responses of dimension W ⇥ H ⇥ L. To give an example, if the local patch size is
3 ⇥ 3 ⇥ 3 = 27, the block output is a 3D tensor of dimension 14 ⇥ 14 ⇥ 27. They are used as features to
be fed to the classi"er in the next stage.
5.3.3 Channel-wise Classi!cation, Discriminant Channels Selection and Block-level
Decision Ensemble
For each channel in a block, we have one response from each local patch so that there are W ⇥H responses
in total. These responses form a feature vector, and samples from blocks of training real/fake images are
used to train a classi"er, leading to channel-wise classi"cation. The classi"er can be any one used in
machine learning such as Random Forest, SVM, and XGBoost. In our experiments, the XGBoost classi"er
[16] is chosen for its high performance. XGBoost is a gradient-boosting decision tree algorithm that can
learn a nonlinear data distribution e$ciently.
To evaluate the discriminant power of a channel, we divide the training data into two disjoint groups:
1) data used to train the classi"er, and 2) data used to validate the channel performance. The latter provides
54
a soft decision score predicted by the channel-wise classi"er. The channel-wise performance evaluation
re!ects the generation power of a GAN in various frequency bands. Some channels are more discriminant
than others because of the poor generation power of the GAN in that frequency band. This "nding matches
to other general deepfake detector method where fake images contain discrepancy from real images in
frequency domain [34]. Selection of discriminant channels is based on the performance of the validation
data.
98.60 // 99.40 We use an example to explain discriminant channel selection. It is a PixelHop of dimension 3 ⇥ 3 ⇥ 3, which has 27 channels in total. The x-axis of Fig. 5.2 is the channel index and the y-axis is
the energy percentage curve or the performance curve measured by the F1 score. A larger channel index
means a higher frequency component. In these plots, blue lines indicate that energy percentage of each
channel while red, magenta and green lines represent the F1 scores of the train, validation and test data.
We consider the following four settings.
1. Raw images
A higher frequency channel usually has a higher performance score as shown in Fig. 5.2a. Lowfrequency channels are not as discriminant as high-frequency channels. It validates our assumption
that the GANs fail to generate high-frequency components with high "delity.
2. Image resizing
The input image is resized from 256 ⇥ 256 to 64 ⇥ 64. As compared with the setting of raw images,
the discriminant power of high-frequency channels degrades a little bit as shown in Fig. 5.2b. This
is attributed to the fact that the down-sampling operation uses a low pass "lter to alleviate aliasing.
Despite the performance drop of each channel, the overall detection performance can be preserved
by selecting more channels.
55
3. Additive Gaussian noise
Noisy satellite images are obtained by adding white Gaussian noise with = 0.1, where the dynamic
range of the input pixel is [0, 1]. Thus, the relative noise level is high. We see from Fig. 5.2c that
low-frequency channels perform better than high-frequency channels. This is because we need to
take the signal-to-noise ratio (SNR) into account. Low-frequency channels have higher SNR values
than high-frequency ones. As a result, low-frequency channels have higher discriminant power.
4. JPEG compression
The experimental results with JPEG compression of quality factor 75 are shown in Fig. 5.2d. We see
from the "gure that the performance of di#erent channels !uctuates. Generally, the performance of
low-frequency channels is better than that of high-frequency channels since the responses of highfrequency channels degrade due to higher quantization errors in JPEG compression. However, We
still can get discriminant channels based on the performance of the validation data.
Generally, if only one PixelHop is used, we select several most discriminant channels for ensembles. If
multiple PixelHops are used simultaneously, we select most discriminant channels from all PixelHops for
ensembles.
The number of channels are also "ne-tuned by the validation dataset to "nd the optimal number of
channels. All selections are based on the F1 score performance of the validation dataset.
5.3.4 Image-level Decision Ensemble
In the last stage, we ensemble predicted scores of all blocks in one image tile. Let Nch denote the total
number of selected channels. Since each channel has one predicted score from the previous step, each
block has a feature vector of dimension Nch. For each image, we concatenate the feature vectors of all
blocks to form one feature vector of the image. Since there are 256 blocks in one tile, the dimension of the
56
image-level feature vector is 256 ⇥ Nch. An XGBoost classi"er is trained to determine the "nal prediction
of each tile. Nch is a hyperparameter that is decided by the performance of the validation dataset.
5.3.5 Visualization of Detection Results
An attacker may stitch real and fake image blocks to form an image tile so as to confuse the ensemble
classi"er. This can be handled by a visualization tool that shows pixel-wise prediction scores. That is, we
use a heat map to display XGBoost prediction scores for each pixel, which is the center of of a block. Since
we would like to have the pixel-wise prediction, these blocks are overlapping ones with stride equal to
one. Furthermore, we can plot the heat map for each channel using the prediction score of each channel.
Several examples are given in Table 5.1. The table has four columns. The "rst column shows real satellite
images. The second column shows edited images, parts of which are replaced by the fake images. The
third column gives the ground truth labels, where the dark blue and the yellow indicate the real and fake
regions, respectively. Finally, the fourth column shows the prediction results. Cold and warm colors mean
a higher probability of being real and fake, respectively. Our model can highlight the position of the fake
area even for a small fake region as shown in the last row of Table 5.1.
To gain more insights, we show channel-wise Saab features and channel-wise heat map for DC, AC1,
AC11 and AC26 frequencies of a PixelHop of dimension 3 ⇥ 3 ⇥ 3 in Table 5.2, where DC and AC1 are
low-frequency channels, AC11 is a mid-frequency channel and AC26 is a high-frequency channel. By
comparing the four heat maps, we see that A26 has the strongest discriminant power, AC11 the second
while DC and AC1 have the least discriminant capability.
57
5.4 Experiments
5.4.1 Datasets
The UW Fake Satellite Image dataset released by the University of Washington [132] is the "rst publicly
available dataset targeting at authentic and fake satellite image detection. Its authentic satellite images are
collected from Google Earth’s satellite images while its fake satellite images are generated by CycleGAN.
The base map used to generate fake satellite images are from CartoDB [11]. There are 4032 authentic color
satellite images of size 256 ⇥ 256 and their fake counterparts in the dataset. The images are captured at
a zoom level of 16, which is equivalent to the scale of 1:8000. It covers Tacoma, Seattle and Beijing three
cities.
To demonstrate the generalizability of our detection method, we use StyleGAN2 [56] which is the
most popular GAN on the internet and Lightweight GAN [71] which is the newest GAN that could be
trained e$ciently in one day to generate more fake satellite images. To reduce the training complexity,
we crop images into non-overlapping sub-images of size 128x128. For each city, we train two GAN models. Thus, three are six GAN models in total: StyleGAN2-Beijing, StyleGAN2-Seattle, StyleGAN2-Tacoma,
Lightweight GAN-Beijing, Lightweight GAN-Seattle and Lightweight GAN-Tacoma. The number of fake
satellite images generated by each GAN model is the same as that of authentic images. The numbers of
real and fake satellite images are summarized in Table 5.5. Each GAN model is trained on Nvidia V100
GPU for 150,000 steps with random translation and cutout as the augmentation. Each StyleGAN2 model
takes 48 hours to train while each Lightweight GAN demands 24 training hours. We call them USC Fake
Satellite Image datasets. The two datasets as well as trained GAN models are released in the GitHub∗
.
The FID score [44] is a commonly used metric to evaluate the "delity and variability of generated
GAN images. Lower FID scores indicate better generated GAN images of higher "delity and variability.
We report the FID scores of UW/CycleGAN, USC/StyleGAN2 and USC/Lightweight GAN in Table 5.6.
∗
https://github.com/hongshuochen/Geo-DefakeHop
58
It is apparent that StyleGAN2 and Lightweight GAN have lower FID scores than CycleGAN. It implied
that StyleGAN2 and Lightweigh GAN are more di$cult to detect than CycleGAN. This is consistent with
experimental results given in Sec. 5.4.5. Note that CycleGAN does not have the FID for Tacoma, since the
dataset only contains fake images from Beijing and Seattle.
5.4.2 Experiment Settings
We compare the performance of Geo-Defakehop with two previous methods in this section.
• Method by Zhao et al. [132]
The method extracted hand-crafted features (e.g., spatial, histogram and frequency features) from
satellite images and trained an SVM classi"er with these hand-crafted features to classify real and
fake images.
• DefakeHop [14]
The method was "rst proposed to detect Deepfake face videos. We tailored it to the fake satellite
image detection problem by removing frame and region ensemble modules.
It is worthwhile to point out that DefakeHop is built upon successive subspace learning (SSL) while GeoDefakeHop is based on parallel subspace learning (PSL). SSL anad PSL are two di#erent designs.
For Geo-DefakeHop, we consider four PixelHop designs:
• PixelHop A: Selected discriminant channels from 12 "lters of dimension 2 ⇥ 2 ⇥ 3,
• PixelHop B: Selected discriminant channels from 27 "lters of dimension 3 ⇥ 3 ⇥ 3,
• PixelHop C: Selected discriminant channels from 48 "lters of dimension 4 ⇥ 4 ⇥ 3,
• PixelHop A&B&C: Selected discriminant channels from PixelHops A, B and C.
We compare the detection performance under six settings:
• Raw images obtained from the UW dataset;
59
• Image tiles being resized from 256 ⇥ 256 to 128 ⇥ 128 and to 64 ⇥ 64;
• Image tiles corrupted additive white Gaussian noise with standard deviation = 0.02, 0.06, 0.1;
• Image tiles coded by the JPEG compression standard.
• Split the dataset with 80-10-10, 40-10-50 and 10-10-80 setting.
• Image tiles generated by CycleGAN, StyleGAN2 and Lightweight GAN
We follow the same experimental setting as given in [132] and split the dataset as 80-10-10. The model is
obtained by the training set, "ne-tuned on validation set and evaluated on the test set. Training and test
images go through the same image manipulation conditions. As to the performance metrics, we use the
F1 score, precision and recall.
5.4.3 Detection Performance Comparison
We compare the performance of three detection methods under various conditions in this subsection.
Raw images. We conduct both training and testing on raw images from the UW dataset [132] and
show the performance of the three methods in Table 5.3. As shown in the table, we see that PixelHop B
and PixelHop A&B&C of Geo-DefakeHop achieve perfect detection performance with 100% F1 score, 100%
precision and 100% recall while PixelHop A and PixelHop B achieve nearly perfect performance. Both
Geo-DefakeHop and DefakeHop outperform Zhao et al.’s method in all performance metrics by signi"cant
margins. There is also a clear performance gap between Geo-DefakeHop and DefakeHop.
Image resizing. The results are shown in Table 5.4. For image resized to 128 ⇥ 128, both PixelHop
A and PixelHop A&B&C achieve perfect performance with 100% F1 score while PixelHop B and PixelHop
C achieve nearly perfect performance. For image resized to 64 ⇥ 64, we see the power of ensembles.
That is, the F1 score, precision and recall of PixelHop A&B&C are all above 99%, which is slightly better
than an individual PixelHop. Again, all four Geo-DefakeHop settings outperform Zhao et al.’s method by
60
signi"cant margins. DefakeHop is slightly better than Zhao et al.’s method but signi"cantly worse than
Geo-DefakeHop.
Additive white Gaussian noise. We test the detection performance with three noise levels =
0.02, 0.06, 0.1 and show the results in Table 5.7. We see from the table that, if authentic or fake satellite
images are corrupted by white Gaussian noise with = 0.02, 0.06 and 0.1, the F1 scores of Geo-DefakeHop
decreases from 100% to 99.01%, 96.59% and 96.10%, respectively. In contrast, the F1 scores of DefakeHop
are slightly above 90% and those of Zhao et al.’s method are around 80% or lower. Also, the ensemble gain
of multiple PixelHops is more obvious as the noise level becomes higher.
DefakeHop is a#ected much less than Geo-DefakeHop. It is because DefakeHop focus on the lowfrequency components only.
JPEG compression. Typically, QF is chosen from the range of [0.7,1]. In this experiment, we encode
satellite images by JPEG with QF=0.95, 0.85 and 0.75 and investigate the robustness of benchmarking
methods against these QF values. The results are shown in Table 5.8. The F1 scores of Geo-DefakeHop
are 98.28%, 97.91% and 97.92% for QF=0.95, 0.85 and 0.75, respectively. It is interesting to note that handcrafted feature performs better with a lower quality factor. The possible reason might be JPEG compression
suppresses some noise information which derive a set of better hand-crafted features.
By comparing the three distortion types, the additive white Gaussian noise has the most negative
impact on the detection performance, JPEG compression the second, and image resizing has the least
impact. This is consistent with our intuition. Image resizing does not change the underlying information
of images much, JPEG changes the information slightly because of the "delity loss of high-frequencies and
the additive white Gaussian noise perturbs the information of all frequencies.
61
5.4.4 Weak Supervision Setting
We consider the weak supervision setting by reducing the number of training samples and increasing the
number of test samples. That is, we split the CycleGAN dataset based on two settings: 40-10-50 and 10-
10-80, where the "rst, second and third numbers indicate the percentages of training, validation and test
data samples. Furthermore, we include two general fake image detectors based on the convolution neural
networks in performance benchmarking. Motivated by [113] and [131], we use ResNet-18 pre-trained by
the ImageNet as the network structure. The optimizer is SGD optimizer and the initial learning rate is
0.001 with momentum 0.9. The batch size is 32. We update the model for 50 epochs and get the model with
the best validation score. We train two models: one with the original image as input and the other with the
2D FFT spectrum as the input. They are denoted by ResNet18 and ResNet18-FFT in Table 5.9. ResNet18,
ResNet18-FFT and Geo-DefakeHop all have excellent detection performance under the weak supervision
setting against CycleGAN.
5.4.5 Performance Benchmarking with Three GAN Models
Table 5.10 compares the detection performance of the same four detection methods against CycleGAN,
StyleGAN2 and Lightweight GAN models under the weak supervision setting of 10-10-80. As mentioned
earlier, fake images generated by StyleGAN2 and Lightweight GAN have lower FID scores, indicating that
their generated images are more di$cult to detect. We see signi"cant performance degradation of the
method of Zhao et al. ResNet18 still maintain high performance. The F1-scores of ResNet18-FFT drop
around 3% to 96% from CycleGAN to StyleGAN2 and Lightweight GAN. Geo-DefakeHop can preserve
high detection performance with over 99% F1-scores against all three GAN models.
62
5.4.6 Model Size Computation
For a PixelHop of "lter size s1 ⇥ s2 ⇥ c, it has at most Pmax = s1 ⇥ s2 ⇥ c "lters. For example, the size of
PixelHop A is 2⇥2⇥3 and PA,max = 12. Similarly, we have PB,max = 27 and PC,max = 48. However, we
choose only a subset of discriminant "lters. They are denoted by PA, PB and PC, respectively. An XGBoost
classi"er consists of a sequence of binary decision trees, which are speci"ed by two hyper-parameters: the
max depth and the number of trees. Each XGBoost tree consists of both leaf nodes and non-leaf nodes.
Non-leaf nodes have two parameters (i.e., the dimension and the value) to split the dataset where leaf
nodes have one parameter (i.e., the predicted value). We have two types of XGBoost classi"ers: 1) the
channel-wise classi"er and 2) the ensemble classi"er. For the former, the max depth and the number of
the trees are set to 1 and 100 respectively. Since each tree has one non-leaf node and two leaf nodes, its
model size is 4 ⇥ 100 = 400 parameters. For the latter, the max depth and the number of trees are set to 1
and 100 ⇥ P, where P = PA + PB + PC is the total number of selected discriminant channels of all three
PixelHops, respectively. The model size of the ensemble classi"er is 4 ⇥ 100 ⇥ P = 400P parameters. As
an example, we provide the model size computation detail in Table 7.2 for four Geo-DefakeHop designs
with the raw satellite images as the input. As shown in the table, PixelHops A, B, C and A&B&C have 812,
827, 848 and 2,487 parameters, respectively. Since the selected discriminant channel numbers of PixelHops
A, B, C and A&B&C vary with raw, resized, noisy and compressed input satellite images, their model sizes
are di#erent. The model sizes are summarized in Table 5.12.
5.5 Conclusion and Future Work
A method called Geo-DefakeHop was proposed to distinguish between authentic and counterfeit satellite
images. Its e#ectiveness in terms of the F1 scores, precision and recall was demonstrated by extensive experiments. Furthermore, its model size was thoroughly analyzed. It can be easily implemented in software
63
on mobile or edge devices due to its small model size. As to future extensions, two topics are described
below. First, the UW Fake Satellite Image dataset only contains Tacoma, Seattle and Beijing three cities. A
large-scale fake satellite image dataset with more cities can be constructed to make the dataset more challenging. More manipulations such as blurring and contrast adjustment can be added to test the limitation
of the detection system. Second, several frequency-aware GANs such as StyleGAN3 [54] were recently
proposed to enhance the high frequency components in synthesized images. StyleGAN3 utilized Fourier
features as input to de"ne a spatially in"nite map. To solve the aliasing problem which is highly detrimental to GAN, StyleGAN3 used a smaller cuto# frequency to suppress high frequency components. Although
StyleGAN3 can control high frequency components in generated images with improved capability, some
high frequency information is still removed in the generating process. It is interesting to see whether
Geo-DefakeHop can exploit such small di#erences for e#ective fake satellite images detection.
64
Table 5.1: Visualization of original real images (the "rst column), partial real/partial fake (PRPF) images
(the second column), the ground truth (the third column, where dark blue and yellow denote real and
fake regions, respectively) and heat maps (the four column, where cold and warm colors indicate a higher
probability of being real and fake in the corresponding location, respectively.)
Original PRPF Groundtruth Heat map
65
Table 5.2: Visualization of absolute values of Saab "lter responses and the detection heat maps for DC, AC1,
AC11 and AC26 four channels, where DC and AC1 are low-frequency channels, AC11 is a mid-frequency
channel, and AC26 is a high-frequency channel. Cold and warm colors in heat maps indicate a higher
probability of being real and fake in the corresponding location, respectively. The ground truth is that the
whole image is a fake one.
Index Name Input image Saab features Heat map
0 DC
1 AC1
11 AC11
26 AC26
66
Table 5.3: Detection performance comparison with raw images from the UW dataset for three benchmarking methods. The boldface and the underbar indicate the best and the second-best results, respectively.
Method Features or Designs F1 score Precision Recall
Zhao, et al. (2021)
Spatial 75.81% 78.15% 73.61%
Histogram 78.99% 72.93% 86.16%
Frequency 65.84% 49.07% 100%
Spatial + Histogram 86.77% 82.78% 91.17%
Spatial + Frequency 77.02% 78.75% 75.36%
Histogram + Frequency 83.90% 78.36% 90.29%
Spatial + Histogram + Frequency 87.08% 82.73% 91.92%
DefakeHop (2021) 96.89% 97.26% 96.53%
Geo-DefakeHop (Ours)
PixelHop A 99.88% 100% 99.75%
PixelHop B 100% 100% 100%
PixelHop C 99.88% 100% 99.75%
PixelHops A&B&C 100% 100% 100%
67
Table 5.4: Detection performance comparison for images resized from 256 ⇥ 256 to 128 ⇥ 128 and 64 ⇥ 64 The boldface and the underbar indicate
the best and the second-best results, respectively.
Tile size Method Features or Designs F1 score Precision Recall
128 x 128
Zhao, et al. (2021)
Spatial 77.35% 76.61% 78.10%
Histogram 80.09% 75.93% 84.72%
Frequency 64.14% 47.21% 100%
Spatial + Histogram 88.28% 85.81% 90.89%
Spatial + Frequency 79.79% 81.38% 78.26%
Histogram + Frequency 81.92% 76.99% 87.53%
Spatial + Histogram + Frequency 88.09% 86.52% 89.71%
DefakeHop (2021) 92.63% 97.78% 88.00%
Geo-DefakeHop (Ours)
PixelHop A 100% 100% 100%
PixelHop B 99.88% 100% 99.75%
PixelHop C 99.75% 99.75% 99.75%
PixelHops A&B&C 100% 100% 100%
64 x 64
Zhao, et al. (2021)
Spatial 76.46% 78.85% 74.21%
Histogram 81.59% 76.60% 87.26%
Frequency 49.75% 79.89% 36.12%
Spatial + Histogram 88.22% 86.15% 90.39%
Spatial + Frequency 77.46% 77.83% 77.09%
Histogram + Frequency 83.16% 77.80% 89.32%
Spatial + Histogram + Frequency 87.91% 83.94% 92.29%
DefakeHop (2021) 86.60% 89.36% 84.00%
Geo-DefakeHop (Ours)
PixelHop A 98.27% 98.27% 98.27%
PixelHop B 97.39% 97.76% 97.03%
PixelHop C 96.36% 97.71% 95.05%
PixelHops A&B&C 99.01% 99.01% 99.01%
68
Table 5.5: The statistics of three fake satellite image datasets, where C-GAN, S-GAN and L-GAN denote
CycleGAN, StyleGAN2 and Lightweight GAN, respectively.
UW/C-GAN USC/S-GAN USC/L-GAN
No. of Real 8,046 32,184 32,184
No. of Fake 8,046 32,184 32,184
Image sizes 256x256 128x128 128x128
Table 5.6: Comparison of FID scores of three fake satellite image datasets, where C-GAN, S-GAN and LGAN denote CycleGAN, StyleGAN2 and Lightweight GAN, respectively. Lower FID scores indicate better
generated images of higher "delity and variability.
UW/C-GAN USC/S-GAN USC/L-GAN
Beijing 134.88 49.31 55.72
Seattle 174.78 47.11 41.87
Tacoma - 60.18 28.76
69
Table 5.7: Detection performance comparison for images corrupted by additive white Gaussian noise with standard deviation
= 0.02, 0.06, 0.1.
The boldface and the underbar indicate the best and the second-best results, respectively.
Noise
Method Features or Designs F1 score Precision Recall
0.02
Zhao, et al. (2021)
Spatial 70.74% 72.58% 68.98%
Spatial + Histogram 83.04% 82.41% 83.67%
Spatial + Frequency 75.63% 78.42% 73.04%
Histogram + Frequency 81.47% 76.62% 86.98%
Spatial + Histogram + Frequency 83.25% 81.47% 85.11%
DefakeHop (2021) 91.84% 93.75% 90.00%
Geo-DefakeHop (Ours)
PixelHop A 97.56% 96.38% 98.77%
PixelHop B 98.90% 98.05% 99.75%
PixelHop C 99.01% 98.53% 99.50%
PixelHop A&B&C 98.65% 97.58% 99.75%
0.06
Zhao, et al. (2021)
Spatial 68.22% 74.77% 62.72%
Spatial + Histogram 80.74% 80.94% 80.54%
Spatial + Frequency 76.39% 78.47% 74.42%
Histogram + Frequency 80.28% 75.49% 85.71%
Spatial + Histogram + Frequency 81.42% 79.40% 83.55%
DefakeHop (2021) 92.78% 95.75% 90.00%
Geo-DefakeHop (Ours)
PixelHop A 95.24% 93.98% 96.53%
PixelHop B 96.59% 95.19% 98.02%
PixelHop C 95.07% 94.70% 97.28%
PixelHop A&B&C 96.59% 95.19% 98.02%
0.1
Zhao, et al. (2021)
Spatial 68.58% 69.76% 67.44%
Spatial + Histogram 81.74% 78.42% 85.35%
Spatial + Frequency 69.05% 70.35% 67.79%
Histogram + Frequency 79.44% 74.67% 84.86%
Spatial + Histogram + Frequency 80.05% 77.78% 82.46%
DefakeHop (2021) 92.63% 97.78% 88.00%
Geo-DefakeHop (Ours)
PixelHop A 94.43% 92.42% 96.53%
PixelHop B 94.88% 93.51% 96.29%
PixelHop C 95.37% 93.99% 96.78%
PixelHop A&B&C 96.10% 94.71% 97.52%
70
Table 5.8: Detection performance comparison for images coded by the JPEG compression standard of three quality factors (QF), i.e., QF = 95, 85
and 75. The boldface and the underbar indicate the best and the second-best results, respectively.
JPEG quality factor Method Features or Designs F1 score Precision Recall
95
Zhao, et al. (2021)
Spatial 74.88% 73.96% 75.82%
Spatial + Histogram 85.95% 82.49% 89.72%
Spatial + Frequency 78.00% 78.38% 77.62%
Histogram + Frequency 82.43% 74.95% 91.58%
Spatial + Histogram + Frequency 86.96% 85.06% 88.94%
DefakeHop (2021) 98.00% 98.00% 98.00%
Geo-DefakeHop (Ours)
PixelHop A 97.91% 97.31% 98.51%
PixelHop B 97.90% 97.54% 98.27%
PixelHop C 98.28% 97.56% 99.01%
PixelHop A&B&C 98.15% 97.55% 98.76%
85
Zhao, et al. (2021)
Spatial 76.46% 78.64% 74.38%
Spatial + Histogram 85.91% 82.67% 89.42%
Spatial + Frequency 82.53% 81.48% 83.61%
Histogram + Frequency 85.28% 81.66% 89.24%
Spatial + Histogram + Frequency 89.54% 85.82% 93.6%
DefakeHop (2021) 94.85% 97.87% 92.00%
Geo-DefakeHop (Ours)
PixelHop A 97.54% 96.83% 98.27%
PixelHop B 97.91% 97.08% 98.76%
PixelHop C 97.91% 97.08% 98.76%
PixelHop A&B&C 97.54% 97.06% 98.02%
75
Zhao, et al. (2021)
Spatial 73.88% 75.78% 72.03%
Spatial + Histogram 85.61% 81.70% 89.93%
Spatial + Frequency 87.09% 83.94% 90.49%
Histogram + Frequency 88.94% 87.41% 90.52%
Spatial + Histogram + Frequency 90.20% 88.46% 92.00%
DefakeHop (2021) 92.93% 93.88% 92.00%
Geo-DefakeHop (Ours)
PixelHop A 97.92% 96.63% 99.26%
PixelHop B 97.66% 97.07% 98.27%
PixelHop C 97.79% 96.84% 98.76%
PixelHop A&B&C 97.92% 96.63% 99.26%
71
Table 5.9: Comparison of F1-scores of four detection methods under the weak supervision data setting,
where X-Y-Z means that X% of training, Y% of validation and Z% of test data samples.
40-10-50 10-10-80
Zhao et al. 87.82% 86.62%
ResNet18 99.85% 98.87%
ResNet18-FFT 99.98% 99.88%
Geo-DefakeHop 99.93% 99.67%
Table 5.10: Comparion of F1-scores of four detection methods on fake images generated by CycleGAN,
StyleGAN2, and Lightweight GAN, where all datasets are split with 10% training, 10% validation and 80%
test data.
CycleGAN StyleGAN2 LightweightGAN
Zhao et al. 86.62% 69.50% 69.75%
ResNet18 98.87% 98.46% 98.89%
ResNet18-FFT 99.88% 96.33% 96.45%
Geo-DefakeHop 99.67% 99.47% 99.80%
72
Table 5.11: Model size computation of four Geo-DefakeHop designs for raw satellite input images.
System No. of Selected No. of Filter No. of c/w XGBoost No. of ensemble Total
System Channels Parameters Parameters XGBoost Parameters Model Size
Pixelhop A 1 12 400 400 812
Pixelhop B 1 27 400 400 827
Pixelhop C 1 48 400 400 848
Pixelhop A&B&C 3 87 1,200 1,200 2,487
73
Table 5.12: Summary of model sizes of four Geo-DefakeHop designs with di#erent input images.
Experiments PixelHop A PixelHop B PixelHop C PixelHop A&B&C
Raw Images 0.8K 0.8K 0.8K 2.5K
Resizing 9.7K 20K 37K 61.7K
Noise 8.1K 13K 33K 38.5K
Compression 7.3K 19K 33K 37.4K
74
Chapter 6
GreenCOD: Green Camou#aged Object Detection
6.1 Introduction
Camou!aged objects, often overlooked in the casual eye, o#er fascinating study in nature’s own strategies to survive and protect. They seamlessly merge with their environment, making use of adaptations in
coloration, patterns, or even size. These camou!ages can range from the chameleons blending with their
surroundings, soldiers in their camou!aged uniforms, to lions concealed amidst the grasslands. The intricate designs and sophisticated methods these beings employ to remain concealed provide valuable lessons
for technology, especially in the realm of computer vision.
Camou!aged Object Detection (COD) is one such area within computer vision that seeks to detect
these well-hidden objects. It surpasses the challenges posed by traditional salient object detection, considering the visual subtleties and the multitude of scales and appearances in which camou!aged objects
present themselves. The task is complex, given the "ne lines separating the objects from their surrounding
environments and the often indistinct and obscure characteristics that these objects possess.
Recent advances in deep learning have propelled the COD realm forward, resulting in a !urry of methods and models that aim to accurately detect camou!aged objects. Deep neural networks, with their
intricate architectures and comprehensive training regimens, have o#ered notable successes. However,
75
they often require substantial computational resources and intricate designs. Moreover, the marginal improvements across various models have sometimes been met with increased computational cost, posing
challenges in real-world applications.
In this context, we introduce GreenCOD, a fresh perspective on COD that leans on gradient boosting,
speci"cally exploiting the power of extreme gradient boosting (XGBoost). Our methodology combines
the robustness of XGBoost with deep features gleaned from Deep Neural Networks (DNNs) to establish a
model that is e$cient, requires fewer resources, and is inherently green. Notably, we have side-stepped
the back propagation, a staple in conventional deep learning, hinting at the possibilities of exploring newer
paradigms in neural architectures.
This work aims to address a primary concern: Can we develop a model that retains the e$cacy in COD
tasks but is more e$cient, interpretable, and environmentally friendly? With GreenCOD, we believe we
have taken a signi"cant step in that direction.
The rest of this work is organized as follows. Related work is reviewed in Sec. 5.2. The GreenCOD
method is presented in Sec. 6.2. Experiments are shown in Sec. 6.3. Finally, concluding remarks are given
in Sec. 6.4.
6.2 GreenCOD Method
GreenCOD, an acronym for Green Camou!aged Object Detection, aims to pioneer a novel approach to
object detection, emphasizing both e$ciency and performance. The goal of our method is to achieve
comparable performance to larger, more complex models while drastically reducing the number of FLOPs
(Floating Point Operations) and model parameters.
Drawing inspiration from the U-Net architecture, which excels in extracting features across multiple
receptive "elds and progressively re"nes output segmentation from a coarse to "ne scale, we introduced
signi"cant enhancements. The right-hand side of the traditional U-Net architecture, primarily responsible
76
for the expansive pathway, has been substituted with the Extreme Gradient Boosting (XGBoost) mechanism in our model. This decision was motivated by XGBoost’s capability to discern concealed objects
within images.
One of the pivotal advantages of GreenCOD is that it prevents the need for end-to-end training typically required in deep learning models. Employing XGBoost not only aids in parameter reduction but
also eliminates the necessity for backpropagation during model training. Furthermore, the absence of
end-to-end training o#ers a more modular and !exible training approach, setting our method apart from
conventional deep learning models. To the best of our knowledge, GreenCOD is the "rst to harness the
power of XGBoost for detecting concealed objects within images, marking a groundbreaking advancement
in the realm of object detection.
The proposed method integrates the power of deep learning with the robustness of gradient-boosted
trees to achieve sophisticated image segmentation. It adopts a multi-resolution approach, utilizing both
feature extraction and multi-scale XGBoost to e#ectively capture object hierarchies in images. Additionally, the method involves the concept of neighborhood construction to enhance context-awareness during
segmentation.
6.2.1 Feature Extraction
The "rst step involves passing the input image through the "E$cientNetB4" backbone. E$cientNetB4 is
a state-of-the-art deep learning architecture designed for high-quality feature extraction. As the image
progresses through the eight blocks (Block1 to Block8) of the backbone, it undergoes a series of convolutions, pooling, and normalization operations. Each block captures features at varying levels of granularity,
ensuring both low-level details and high-level semantics are preserved.
77
Figure 6.1: An overview of the GreenCOD method, where the input is an image of dimension 672⇥672⇥3
and the output is a probabilty mask of dimension 168 ⇥ 168 ⇥ 1.
6.2.2 Concatenation and Resizing
After obtaining the feature maps from E$cientNetB4, they are resized to a standard dimension and concatenated. This provides a multi-resolution representation of the image, which ensures that features of
di#erent scales and complexities are e#ectively combined, enabling the model to capture objects and patterns of varying sizes.
6.2.3 Multi-scale XGBoost
With the concatenated feature representation in hand, the model leverages the gradient-boosting framework of XGBoost. XGBoost, known for its e$ciency and performance, is typically used for structured data.
However, in this context, it’s innovatively applied to image feature data. The multi-scale aspect implies
that the feature data undergoes processing at di#erent resolutions or scales, each handled by a distinct
XGBoost model.
78
6.2.4 Neighborhood Construction (NC)
Post each XGBoost processing, the method introduces the "Neighborhood Construction" step. This unique
phase is crucial for context-aware segmentation. It’s speculated that during NC, features from the local
neighborhood of each pixel or region are considered, allowing for a more informed decision during the
segmentation process. This enhances the accuracy and precision of segment delineation, ensuring objects
and regions are de"ned clearly and correctly.
The proposed image segmentation method is an amalgamation of deep learning and gradient-boosted
modeling. By leveraging the feature extraction prowess of E$cientNetB4, the multi-scale insights from
XGBoost, and the contextual understanding from the neighborhood construction, the model promises
accurate, high-resolution segmentations. The method’s unique combination of techniques ensures that
it captures and delineates image features ranging from minute details to overarching patterns, making it
versatile and robust for various image segmentation challenges.
6.3 Experiments
6.3.1 Datasets
In our experiment, we maintain consistency with the methodology of previous experiments. The training
is performed on a dataset that combines the CAMO and COD10K datasets, totaling 4040 images. Testing
is carried out on three separate datasets: CAMO, COD10K, and NC4K. The CAMO dataset includes 250
images, the COD10K dataset contains 2026 images, and the NC4K dataset is the largest dataset for testing,
with 4121 images.
79
Model Pub/Year Input S↵ " F w " M # Emn " Para. MACs
SINet [31] CVPR’20 3522 0.776 0.631 0.043 0.860 48.95M 19.42G
C2FNet [102] IJCAI’21 3522 0.813 0.686 0.036 0.890 28.41M 13.12G
TINet [144] AAAI’21 3522 0.793 0.635 0.042 0.861 28.56M 8.58G
JSCOD [63] CVPR’21 3522 0.809 0.684 0.035 0.884 121.63M 25.20G
LSR [75] CVPR’21 3522 0.804 0.673 0.037 0.880 57.90M 25.21G
PFNet [82] CVPR’21 4162 0.800 0.660 0.040 0.877 45.64M 26.54G
C2FNet-V2 [12] TCSVT’22 3522 0.811 0.691 0.036 0.887 44.94M 18.10G
ERRNet [49] PR’22 3522 0.786 0.630 0.043 0.867 69.76M 20.05G
TPRNet [130] TVCJ’22 3522 0.817 0.683 0.036 0.887 32.95M 12.98G
FAPNet [137] TIP’22 3522 0.822 0.694 0.036 0.888 29.52M 29.69G
BSANet [143] AAAI’22 3842 0.818 0.699 0.034 0.891 32.58M 29.70G
SegMaR [50] CVPR’22 3522 0.833 0.724 0.034 0.899 56.21M 33.63G
SINetV2 [30] TPAMI’22 3522 0.815 0.680 0.037 0.887 26.98M 12.28G
CRNet [42] AAAI’23 3202 0.733 0.576 0.049 0.832 32.65M 11.83G
DGNet-S [48] MIR’23 3522 0.810 0.672 0.036 0.888 7.02M 2.77G
DGNet [48] MIR’23 3522 0.822 0.693 0.033 0.896 19.22M 1.20G
GreenCOD-D3-1000 - 6722 0.797 0.701 0.033 0.881 16.83M 13.70G
GreenCOD-D3-10000 - 6722 0.807 0.715 0.032 0.893 17.62M 15.06G
GreenCOD-D6-1000 - 6722 0.804 0.709 0.032 0.891 17.50M 13.78G
GreenCOD-D6-10000 - 6722 0.813 0.724 0.031 0.895 24.34M 16.22G
Table 6.1: Comparison of performance metrics between proposed and benchmark methods on the COD10K dataset. For computational e$ciency,
only models with less than 50G Multiply-Accumulate Operations (MACs) were considered. The top-performing method for each metric on each
dataset is highlighted in bold, while the second-best method is underscored.
80
Model Pub/Year Input S↵ " F w " M # Emn " Para. MACs
D2CNet [112] TIE’21 3202 0.807 0.680 0.037 0.876 - -
R-MGL [123] CVPR’21 4732 0.814 0.666 0.035 0.852 67.64M 249.89G
S-MGL [123] CVPR’21 4732 0.811 0.655 0.037 0.845 63.60M 236.60G
UGTR [116] ICCV’21 4732 0.818 0.667 0.035 0.853 48.87M 127.12G
BAS [90] arXiv’21 2882 0.802 0.677 0.038 0.855 87.06M 161.19G
NCHIT [124] CVIU’22 2882 0.792 0.591 0.046 0.819 - -
CubeNet [150] PR’22 3522 0.795 0.643 0.041 0.865 - -
OCENet [72] WACV’22 4802 0.827 0.707 0.033 0.894 60.31M 59.70G
BGNet [103] IJCAI’22 4162 0.831 0.722 0.033 0.901 79.85M 58.45G
PreyNet [126] MM’22 4482 0.813 0.697 0.034 0.881 38.53M 58.10G
ZoomNet [88] CVPR’22 3842 0.838 0.729 0.029 0.919 32.38M 95.50G
FDNet [135] CVPR’22 4162 0.840 0.729 0.030 0.919 - -
CamoFormer-C [121] arXiv’23 3842 0.860 0.770 0.024 0.926 96.69M 50.77G
CamoFormer-R [121] arXiv’23 3842 0.838 0.724 0.029 0.916 54.25M 78.85G
PopNet [115] arXiv’23 5122 0.851 0.757 0.028 0.910 188.05M 154.88G
PFNet+ [81] SCIS’23 4802 0.806 0.677 0.037 0.884 - -
GreenCOD-D3-1000 - 6722 0.797 0.701 0.033 0.881 16.83M 13.70G
GreenCOD-D3-10000 - 6722 0.807 0.715 0.032 0.893 17.62M 15.06G
GreenCOD-D6-1000 - 6722 0.804 0.709 0.032 0.891 17.50M 13.78G
GreenCOD-D6-10000 - 6722 0.813 0.724 0.031 0.895 24.34M 16.22G
Table 6.2: Comparison of performance metrics between proposed and benchmark methods on the COD10K dataset. Only models with more than
50G Multiply-Accumulate Operations (MACs) were considered. The top-performing method for each metric on each dataset is highlighted in bold,
while the second-best method is underscored.
81
Model Pub/Year Input S↵ " F w " M # Emn " Para. MACs
SINet [31] CVPR’20 3522 0.808 0.723 0.058 0.871 48.95M 19.42G
C2FNet [102] IJCAI’21 3522 0.838 0.762 0.049 0.897 28.41M 13.12G
TINet [144] AAAI’21 3522 0.829 0.734 0.055 0.879 28.56M 8.58G
JSCOD [63] CVPR’21 3522 0.842 0.771 0.047 0.898 121.63M 25.20G
LSR [75] CVPR’21 3522 0.840 0.766 0.048 0.895 57.90M 25.21G
PFNet [82] CVPR’21 4162 0.829 0.745 0.053 0.887 45.64M 26.54G
C2FNet-V2 [12] TCSVT’22 3522 0.840 0.770 0.048 0.896 44.94M 18.10G
ERRNet [49] PR’22 3522 0.827 0.737 0.054 0.887 69.76M 20.05G
TPRNet [130] TVCJ’22 3522 0.846 0.768 0.048 0.898 32.95M 12.98G
FAPNet [137] TIP’22 3522 0.851 0.775 0.047 0.899 29.52M 29.69G
BSANet [143] AAAI’22 3842 0.841 0.771 0.048 0.897 32.58M 29.70G
SegMaR [50] CVPR’22 3522 0.841 0.781 0.046 0.896 56.21M 33.63G
SINetV2 [30] TPAMI’22 3522 0.847 0.770 0.048 0.903 26.98M 12.28G
DGNet-S [48] MIR’23 3522 0.845 0.764 0.047 0.902 7.02M 1.20G
DGNet [48] MIR’23 3522 0.857 0.784 0.042 0.911 19.22M 2.77G
GreenCOD-D3-1000 - 6722 0.815 0.756 0.049 0.884 16.83M 13.70G
GreenCOD-D3-10000 - 6722 0.823 0.766 0.047 0.892 17.62M 15.06G
GreenCOD-D6-1000 - 6722 0.820 0.763 0.047 0.891 17.50M 13.78G
GreenCOD-D6-10000 - 6722 0.827 0.772 0.046 0.893 24.34M 16.22G
Table 6.3: Comparison of performance metrics between proposed and benchmark methods on the NC4K dataset. For computational e$ciency, only
models with less than 50G Multiply-Accumulate Operations (MACs) were considered. The top-performing method for each metric on each dataset
is highlighted in bold, while the second-best method is underscored.
82
Model Pub/Year Input S↵ " F w " M # Emn " Para. MACs
R-MGL [123] CVPR’21 4732 0.833 0.740 0.052 0.867 67.64M 249.89G
S-MGL [123] CVPR’21 4732 0.829 0.731 0.055 0.863 63.60M 236.60G
UGTR [116] ICCV’21 4732 0.839 0.747 0.052 0.874 48.87M 127.12G
BAS [90] arXiv’21 2882 0.817 0.732 0.058 0.859 87.06M 161.19G
NCHIT [124] CVIU’22 2882 0.830 0.710 0.058 0.851 - -
OCENet [72] WACV’22 4802 0.853 0.785 0.045 0.902 60.31M 59.70G
BGNet [103] IJCAI’22 4162 0.851 0.788 0.044 0.907 79.85M 58.45G
PreyNet [126] MM’22 4482 0.834 0.763 0.050 0.887 38.53M 58.10G
ZoomNet [88] CVPR’22 3842 0.853 0.784 0.043 0.896 32.38M 95.50G
FDNet [135] CVPR’22 4162 0.834 0.750 0.052 0.893 - -
CamoFormer-C [121] arXiv’23 3842 0.883 0.834 0.032 0.933 96.69M 50.77G
CamoFormer-R [121] arXiv’23 3842 0.855 0.788 0.042 0.900 54.25M 78.85G
PopNet [115] arXiv’23 5122 0.861 0.802 0.042 0.909 188.05M 154.88G
GreenCOD-D3-1000 - 6722 0.815 0.756 0.049 0.884 16.83M 13.70G
GreenCOD-D3-10000 - 6722 0.823 0.766 0.047 0.892 17.62M 15.06G
GreenCOD-D6-1000 - 6722 0.820 0.763 0.047 0.891 17.50M 13.78G
GreenCOD-D6-10000 - 6722 0.827 0.772 0.046 0.893 24.34M 16.22G
Table 6.4: Comparison of performance metrics between proposed and benchmark methods on the COD10K dataset. Only models with more than
50G Multiply-Accumulate Operations (MACs) were considered. The top-performing method for each metric on each dataset is highlighted in bold,
while the second-best method is underscored.
83
Figure 6.2: Illustration of mask predictions using the proposed GreenCOD. Images are taken from COD10K
test dataset. From left to right: (a) tampered images, (b) ground-truth masks, (c) prediction.
(a) Tampered (b) Ground-truth (c) Prediction
84
6.3.2 Evaluation Metrics
In order to benchmark the performance of our proposed method, we conducted a comprehensive comparison with the state-of-the-art methods employing identical evaluation metrics. The comparative analysis
focused on several key aspects including Mean Absolute Error (MAE), Structural similarity, Enhancedalignment Measure, and F-measure, where W and H are the width and height of the images respectively,
P(x, y) represents the pixel values of the Groundtruth at coordinates (x, y), and G(x, y) represents the
pixel values of the prediction at coordinates (x, y).
• The Mean Absolute Error (MAE) is computed as:
M = 1
W ⇥ H
X
x
X
y
|P(x, y) G(x, y)| (6.1)
The function |P(x, y)G(x, y)| computes the absolute di#erence between the corresponding pixel
values of the two masks.
• The Structural measure is given by:
S↵ = (1 ↵)So(P, G) + ↵Sr(P, G), (6.2)
where ↵ serves to adjust the balance between the object-aware similarity So and the region-aware
similarity Sr. Following the convention established in the original publication, we set ↵ to a default
value of 0.5.
• The Enhanced-alignment Measure is computed as:
E = 1
W ⇥ H
X
x
X
y
[P(x, y), G(x, y)] (6.3)
85
The function is the enhanced-alignment matrix applied to the pixel values from masks P and G.
• The F-measure is given by:
F = (1 + 2)Precision ⇥ Recall
2Precision + Recall , (6.4)
where the term 2 = 0.3 serves to give more weight to the precision compared to the recall in the
computation, as suggested in the previous work. Derived from this are two other metrics:
The results from the comparative analysis underscore the e$cacy and robustness of our method, showcasing superior or comparable performance across the evaluated metrics.
6.3.3 Experiment results
In Table 6.1, we present a comparative analysis of our proposed GreenCOD method against other leadingedge methods from recent literature, utilizing the COD10K dataset. This comparison speci"cally includes
models that operate under the computational threshold of 50G Multiply-Accumulate Operations (MACs)
to ensure computational e$ciency. Remarkably, our GreenCOD achieves the highest F-measure and the
lowest Mean Absolute Error (MAE) with just 24.34 million parameters and 16.22 GFLOPs. This performance is notably superior to that of SegMaR, which requires 56.21 million parameters and 33.63 GFLOPs.
The favorable balance between performance and e$ciency that GreenCOD o#ers illustrates its potential
as a robust architecture worthy of further investigation. While GreenCOD does not secure the top spot in
E-measure—where it ranks third, behind SegMaR and DGNet—it still demonstrates commendable overall
e$cacy.
In Table 6.2, our focus shifts from evaluating our proposed method against smaller models to benchmarking it alongside larger-scale models. This table is con"ned to models exceeding the computational
complexity of 50G Multiply-Accumulate Operations (MACs). Although our model does not outperform
the leading method, CamoFormer-C, it is important to note that CamoFormer-C demands fourfold more
86
parameters and a threefold increase in MACs compared to our model. Upon examining the Mean Absolute
Error (MAE) and F-measure metrics, our model outperforms 11 of the 16 methods considered, all of which
have signi"cantly larger model sizes than ours. In terms of E-measure, our model surpasses 10 out of the
16 methods. Notably, when compared with R-GML, our method achieves a substantial reduction in MACs,
plummeting from 249.89G to 16.22G. This reduction translates to an energy consumption decrease by a
factor of 15, emphasizing our model’s enhanced e$ciency.
In Table 6.3, we extend the evaluation of our model to the NC4K dataset, which is currently the largest
testing set, to assess our model’s ability to generalize across extensive conditions. Our model secures a
second-place ranking in Mean Absolute Error (MAE), matching the performance of SegMaR while boasting a signi"cantly smaller model size and fewer Multiply-Accumulate Operations (MACs). Introduced in
2023, DGNet leads the pack for models under 50 GFLOPs, with 19.22 million parameters and 2.77G MACs,
achieving the best results. Nonetheless, our model stands out by o#ering greater interpretability. Moreover, it eliminates the need for end-to-end training of the entire model, thereby forgoing any requirement
for backpropagation—an advantage that DGNet does not provide.
In Table 6.4, pertaining to the NC4K dataset, we assess our model alongside larger models with computational complexities exceeding 50G Multiply-Accumulate Operations (MACs). Our model demonstrates
its robustness by outscoring 7 of the 13 models in Mean Absolute Error (MAE), F-measure, and E-measure.
This performance underscores the e#ectiveness of our model on the NC4K dataset, showcasing its capability to generalize successfully to larger datasets.
6.3.4 Visualization analysis
As illustrated in 6.2, our attention is drawn to the segmentation of large concealed objects. In the "rst row,
our model demonstrates exceptional detail in segmenting the camou!aged object, precisely identifying
the butter!y with remarkable accuracy. The second row showcases the model’s capability to di#erentiate
87
subtle details, such as the bird’s tail. The third row presents a challenging scenario: a rabbit immersed in
snow, representing the kind of complex conditions that could be encountered in everyday environments.
Finally, in the fourth row, despite the "sh being obscured by dust, our model successfully delineates its
contours with high precision, highlighting the e#ectiveness of our approach in detecting concealed objects
even with very "ne boundaries.
6.4 Conclusion and Future Work
This study introduces a novel approach for Camou!aged Object Detection, termed "GreenCOD." GreenCOD combines the power of Extreme Gradient Boosting (XGBoost) with deep feature extraction from Deep
Neural Networks (DNN). Contemporary research often focuses on devising intricate DNN architectures to
enhance the performance of Camou!aged Object Detection. However, these methods are typically computationally intensive. Our GreenCOD model stands out by employing gradient boosting for detection
tasks. Its e$cient design requires fewer parameters and FLOPs than leading-edge deep learning models,
all while maintaining superior performance. Notably, our model undergoes training without the use of
backpropagation.
88
Chapter 7
GIFT: Green Image Forgery Technique
Figure 7.1: System diagram of the Green Image Forgery Technique (GIFT). "E-XGBoost" represents the
edge-detection XGBoost, while "S-XGBoost" stands for the surface-detection XGBoost. The architecture
e$ciently integrates multi-level feature extraction with specialized XGBoosts to discern both surface and
edge forgeries.
7.1 Introduction
As digital imaging technology advance and become more widely accessible, the ability to manipulate and
forge images has expanded, image forgeries become a concern for many domains such as journalism,
legal proceedings, etc. Image forgery detection, or the ability to di#erentiate between genuine and forged
89
images is becoming increasingly crucial. Forgery detection is usually done based on the traces that are
left behind by manipulation. However, when there are multiple forgeries, or if exist post-processing after
the forgery manipulation, manipulation artifacts become untraceable. Thus, image forgery detection still
remains a challenge.
To tackle the detection of combined forgeries, researchers have continuously ventured into the exploitation of deep neural networks, including CNNs, RNNs and long-short term memory (LSTM) networks,
aiming to "nd the uni"ed inconsistency between forgery region and authentic region in one shot. However, these models are often restrained by patch sequence orders and manipulation types. Inspired by the
success of transformers in computer vision areas, some researchers also explore the usage of transformers
or attention mechanism in forgery detection "eld. Despite the advanced performance of transformerbased models, they often demand substantial computational resources, leading to longer processing times
and less environmental-friendly solution. Hence, there arises a critical need for a ’green’ image forgery
detector – a solution optimized not only for accuracy but also for computational and energy e$ciency.
Our contribution of this work mainly contains three aspects. First, we combine pre-trained E$cientNet and machine learning classi"er XGBoost to perform forgery detection. We don’t need back-propagation
and end-to-end training. Second, we make use of both surface supervision as well as edge supervision,
which can help us get more accurate prediction in forgery boundaries. Lastly, we outperform the state-ofthe-art methods with much lower FLOPs and trainable parameters. We are even better than transformerbased forgery detectors.
7.2 Proposed Method
In this section, we present our newly proposed Green Image Forgery Technique (GIFT). As illustrated in
Figure 7.1, GIFT is structured into three distinct components:
90
1. Feature Extraction: Utilizing a pretrained E$cientNet B4 architecture without any "netuning, this
serves as the foundation for extracting discriminative features at various scales.
2. Multi-level Surface XGBoosts: Tailored to discern the correlation between forgery pixels and
authentic pixels, they learn from a coarse level, advancing to a more re"ned scale.
3. Multi-level Edge XGBoosts: These are trained with the guidance of edge maps. The predicted
edge maps derived from this component play a pivotal role in amplifying the performance of the
surface XGBoosts.
We will delve into the design intricacies and the comprehensive training methodology of these components in the subsequent subsections.
7.2.1 Feature Extraction
The Green Image Forgery Technique (GIFT) harnesses the power of the E$cientNetB4 architecture as
its backbone to extract discriminative features from input images. Importantly, we do not "ne-tune the
E$cientNetB4; it remains in its pretrained state from ImageNet, and there is no backpropagation applied
in our use-case. The input to the network is an image of dimensions 384 ⇥ 384 ⇥ 3. This E$cientNetB4
backbone is strategically divided into eight blocks, with each sequentially reducing the spatial dimensions
of the image, but enriching the depth of the feature maps. After traversing these blocks, features of varying
resolutions are derived. Subsequently, they are resized and concatenated, resulting in a combined feature
dimension of 1152. It’s worth noting that while we chose E$cientNetB4, the GIFT’s design is !exible, and
other backbones like ResNet or MobileNet could be seamlessly integrated as alternatives.
91
7.2.2 Multi-level Surface XGBoosts
After the feature extraction process using E$cientNetB4, the concatenated features with a dimensionality
of 1152 are passed to the Multi-level Surface XGBoosts. These XGBoosts, denoted as S-XGBoost in our diagram, are trained to distinguishing between authentic and forged pixels. For each S-XGBoost, the ground
truth is resized according to its designated scale and subsequently !attened. For instance, S-XGBoost2
deals with features at a resolution of 48 ⇥ 48, the groundtruth is resized to 48 ⇥ 48.
Each S-XGBoost relies on two main input channels: image features and surface features. The image
features, derived from our feature extraction step, provide various details about the image. On the other
hand, the surface features, originating from the predictions of the preceding S-XGBoost, o#er a contextrich perspective about possible forgeries. S-XGBoost uses a 19 ⇥ 19 window for capturing neighborhood
context, the resulting feature dimensionality in!ates to 1152+ 19⇥19 = 1513. For the S-XGBoost1, being
the initial model in the series, lacks any preceding surface features. Thus, its sole focus remains on the
image features.
7.2.3 Multi-level Edge XGBoosts
Along with the Surface XGBoosts, we use the Multi-level Edge XGBoosts, called E-XGBoost in our diagram.
These XGBoosts look at the edges of images. Edges are important because they help us "nd where a picture
might have been changed or faked, especially near the borders. Just like the Surface XGBoosts, the Edge
XGBoosts work at di#erent sizes. For example, E-XGBoost4 looks at the 192 ⇥ 192 higher resolution
features, while E-XGBoost1 looks at the 24 ⇥ 24 lower resolution features. These edge "ndings help make
the surface predictions better, making our GIFT tool more powerful.
To make the ground truth for the edges, we "rst make the surface ground truth smaller to the right
size. Then, we use a method called erosion. This makes the binary mask a bit smaller. We then "nd the
edges by looking at the di#erence between the original and the smaller version. This gives us the edge
92
map. Adding these edge "ndings to the surface XGBoosts gives them more information. For example, an
S-XGBoost uses features from the last S-XGBoost, the current E-XGBoost, and the main image. So, it has
features from three places, which makes it 1152 + 19 ⇥ 19 + 19 ⇥ 19 = 1874 in total.
Edges are very helpful. Sometimes, the main part of the image doesn’t show if it’s fake, but the edges
can. By looking at the edges, our tool can better spot fake pictures. When we don’t use edge information,
we call our model GIFT. But when we add edge information, we call it the edge-enhanced GIFT or E-GIFT
for short.
Table 7.1: Comparison of F1 Scores between proposed and benchmark methods Across Various Datasets.
Models were trained on the CASIA v2.0 dataset and tested on other datasets. The top-performing method
for each dataset is highlighted in bold and the method second best method is highlighted in underline. For
algorithms not evaluated on speci"c datasets, entries are indicated as “NA”. We also provide a comparison
of the number of parameters and FLOPs/pixel for both the DL methods and our proposed method.
Method Dataset Parameters FLOPs/pixel CASIA v1.0 Columbia
NOI3 0.1761 0.4476 - -
NADQ 0.1763 NA - -
NOI1 0.2633 0.5740 - -
DCT 0.3005 0.5199 - -
CFA2 0.2125 0.5031 - -
ELA 0.2136 0.4699 - -
ADQ1 0.2053 0.4975 - -
CFA1 0.2073 0.4667 - -
BLK 0.2312 0.5234 - -
NOI2 0.2302 0.5318 - -
ADQ2 0.3359 NA - -
ADQ3 0.2192 NA - -
SFCN 0.4770 0.5820 134.27M (7.22x) 423.99K (12.30x)
MFCN 0.5182 0.6040 134.28M (7.22x) 424.19K (12.31x)
Edge-enhanced MFCN 0.5410 0.6117 134.28M (7.22x) 424.19K (12.31x)
GIFT 0.5728 0.6497 18.4M (0.99x) 31.80K (0.92x)
Edge-enhanced GIFT 0.6210 0.6558 18.6M (1x) 34.46K (1x)
7.3 Experiments
Dataset. The CASIA v2.0, CASIA v1.0 and Columbia dataset are used to evaluate the proposed method.
93
• CASIA has two versions. CASIA v1.0 contains 912 tampered images. CASIA v2.0 consists of 5123
tampered images and 7491 authentic images.
• Columbia contains 180 tampered images.
We adopt the same experiment setting as [100]. We train on the CASIA v2.0 dataset and test on the CASIA
v1.0 dataset and Columbia dataset.
Evaluation Metrics. To evaluate our model’s performance, we follow the previous method[100] and
employ F1 score. To derive this mask from our output map, the optimal threshold value is selected for
each image (this is done for each method). This thresholding technique aligns with the methodology
presented by [122]. For each image, an F1 score is calculated, and we subsequently present the mean F1
score computed over the entire dataset.
Experimental results. We compared the presented GIFT and edge-enhanced GIFT methods to several existing splicing localization algorithms. These algorithms encompass ADQ1[70], ADQ2[8], ADQ3[3],
NADQ[9], BLK[65], CFA1[33], CFA2[26], DCT[120], ELA[58], NOI1[79], NOI2[77], SFCN[100], MFCN[100],
and edge-enhanced MFCN[100]. We leverage the implementations of ADQ1, ADQ2, ADQ3, NADQ, BLK,
CFA1, CFA2, DCT, ELA, NOI1, and NOI2 from the publicly available toolbox in Matlab written by Zampoglou et al.[122]. Also, the F1 score of SFCN, MFCN, and edge-enhanced MFCN is mentioned in [100].
In Table 7.1, we compare the F1 to the existing methods. The "rst group of methods is based on traditional image-processing techniques. These algorithms detect di#erent kinds of noises to localize splicing
pixels. Therefore, they are not robust when the noises become various. The CASIA v1.0 dataset consists
of diverse noises and copy and move patterns, which causes the low F1 scores. The second group of methods is deep-learning oriented. They are more robust to di#erent noises but require more memory to store
parameters and speci"c hardware, such as GPU to localize splicing pixels. The third part is green learning
methods. Compared to deep learning methods, the memory requirement is about 1/7, and the computational complexity is about 1/12. Furthermore, it is worth mentioning that the F1 score can outperform
94
methods based on deep learning. The number of parameters is reduced signi"cantly by XGboost optimization, which trims redundant parameters in deep learning models via the tree architecture. On top of
GIFT, the enge-enhanced GIFT accomplishes state-of-the-art performance by leveraging the edge supervision from the splicing boundary. The number of parameters of the edge-enhanced GIFT only increases
by about 1 percent because of the reusable model architecture design.
In table 7.4, we compared GIFT to the current transformer-based architecture by AUC metric. GIFT
demonstrates its lightweight and computational-friendly advantages. The number of parameters and
FLOPs/pixel of GIFT compared to EVP are only about 28.7 percent and 5 percent, respectively.
In Figure 7.2, we demonstrate the visual results from CASIA v1.0 dataset. We can see that the splicing
objects can be localized precisely by the presented architecture. With the supervision of the edge features,
not only the object-based splicing pixels but also copy and move patterns (the bottom row) can be localized
e#ectively. Also, the localization quality at the boundary can be boosted by the supervision of the edge
feature.
Model Size and FLOPs Computation. In Table 7.2, we present the calculated number of parameters
and the number of FLOPs (!oating point operations) per pixel for both the GIFT and E-GIFT models. Each
employs the E$cientNet-B4 as their backbone, omitting the fully connected layers and utilizing only the
"rst 8 blocks for feature extraction. The input resolution for E$cientNet-B4 is 384 ⇥ 384 pixels. The
E$cientNet-B4 accounts for 16.74 million parameters and operates at 4.29 GFLOPs.
After feature extraction, the GIFT model incorporates 4 S-XGBoost modules, while E-GIFT combines
4 S-XGBoost with 4 E-XGBoost modules. For S-XGBoost 1, the maximum tree depth is set to 8 with a total
of 2000 trees, which results in 1,532,000 parameters, calculated as 2000 ⇥ (2 ⇥ (28 1) + 28). The FLOPs
for S-XGBoost 1 are computed as 24 ⇥ 24 ⇥ 2000 ⇥ (8 + 1), totaling 10,368,000 FLOPs.
The remaining S-XGBoost and E-XGBoost modules are con"gured with a maximum depth of 3 and 2000
trees. This con"guration yields 44,000 parameters each, following the formula 2000 ⇥ (2 ⇥ (23 1) + 23).
95
Figure 7.2: Illustration of mask predictions using the proposed GIFT and edge-enhanced GIFT. Images are
taken from CASIA v1.0 dataset. From left to right: (a) tampered images, (b) ground-truth masks, (c) GIFT
mask predictions, and (d) edge-enhanced GIFT mask predictions.
(a) Tampered (b) Ground-truth (c) GIFT (d) Edge-enhanced GIFT
96
The FLOPs for these modules are calculated as size ⇥ size ⇥ 2000 ⇥ (3 + 1), which varies according to
the size. Given that the input size is 384 ⇥ 384, the FLOPs for each submodule are normalized by 384 ⇥
384 pixels. Consequently, the GIFT model utilizes 18.41 million parameters and achieves a computation
e$ciency of 31.80K FLOPs per pixel. The E-GIFT model uses slightly more resources, with 18.58 million
parameters and 34.46K FLOPs per pixel.
7.4 Conclusion and Future work
In this work, we present the Green Image Forgery Technique (GIFT). GIFT is a pioneering gradientboosting method tailored for detecting multiple image forgeries. It operates without the necessity for
back-propagation or end-to-end training. Utilizing the pyramid structure of E$cientNet, GIFT achieves
forgery detection through XGBoost in each layer. Besides bene"ting from the direct supervision of the
ground truth forgery mask, our method incorporates edge supervision, which re"nes our decision-making
concerning forgery boundaries. Extensive experiments across multiple datasets validate GIFT’s superior
performance over many state-of-the-art methods, even with its reduced computational demands. In the
future, we should conduct more experiments on the most recent Generative AI methods.
97
Table 7.2: The number of parameters and FLOPs for various parts of GIFT and E-GIFT
Submodule size #trees depth Number of Parameters FLOPs FLOPs/pixel GIFT E-GIFT GIFT E-GIFT GIFT E-GIFT
E
$cientNetB4 -- - 16,742,216 16,742,216 4,292,302,304 4,292,302,304 29,109.04 29,109.04
S-XGBoost 1 24 2000 8 1,532,000 1,532,000 10,368,000 10,368,000 70.31 70.31
S-XGBoost 2 48 2000 3 44,000 44,000 18,432,000 18,432,000 125.00 125.00
S-XGBoost 3 96 2000 3 44,000 44,000 73,728,000 73,728,000 500.00 500.00
S-XGBoost 4 192 2000 3 44,000 44,000 294,912,000 294,912,000 2,000.00 2,000.00
E-XGBoost 1 24 2000 3 - 44,000 - 4,608,000 - 31.25
E-XGBoost 2 48 2000 3 - 44,000 - 18,432,000 - 125.00
E-XGBoost 3 96 2000 3 - 44,000 - 73,728,000 - 500.00
E-XGBoost 4 192 2000 3 - 44,000 - 294,912,000 - 2,000.00
Total -- - 18,406,216 18,582,216 4,689,742,304 5,081,422,304 31,804.35 34,460.60
98
Table 7.3: F1 and AUC scores at from di#erent level Edge-enhanced XGboost on the CASIAv1.0 dataset
Metric XGboost1 XGboost2 XGboost3 XGboost4
F1 0.54467 0.59203 3 0.61122 0.6199
AUC 0.84927 0.87402 0.87779 0.87876
Table 7.4: Comparison of AUC Scores between proposed and benchmark methods on the CASIA v1.0
dataset. Models were trained on the CASIA v2.0 dataset. The top-performing method is highlighted in
bold. We also provide a comparison of the number of parameters and GFLOPs for both the DL methods
and our proposed method.
Method CASIA v1.0 Parameters FLOPs/pixel
VPT-Deep 0.847 - -
AdaptFormer 0.855 - -
EVP 0.862 64.1M 649.00K
GIFT 0.888 18.4M 32.49K
Edge-enhanced GIFT 0.883 18.6M 35.04K
99
Chapter 8
Conclusion and Future Work
8.1 Summary of the Research
In this thesis proposal, we focus on two problems: Deepfake detection and Deepfake Satellite Image Detection. For the "rst task, we further improve our method to make sure our model is robust to perturbation
and scalable to the real world dataset.
• DefakeHop A light-weight high-performance method for Deepfake detection, called DefakeHop,
was proposed. It has several advantages: a smaller model size, fast training procedure, high detection
AUC and needs fewer training samples. Extensive experiments were conducted to demonstrate its
high detection performance.
• DefakeHop++ A lightweight Deepfake detection method, called DefakeHop++, was proposed in
this work. It is an enhanced version of our previous solution called DefakeHop. Its model size
is signi"cantly smaller than that of state-of-the-art DNN-based solutions, including MobileNet v3,
while keeping reasonably high detection performance. It is most suitable for Deepfake detection in
mobile/edge devices.
Fake image/video detection is an important topic. The faked content is not restricted to talking
head videos. There are many other application scenarios. Examples include faked satellite images,
100
image splicing, image forgery in publications, etc. Heavyweight fake image detection solutions are
not practical. Furthermore, fakes images can appear in many forms. On one hand, it is unrealistic
to include all possible perturbations in the training dataset under the setting of heavy supervision.
On the other hand, the performance could be quite poor with little supervision. It is essential to
"nd a midground and look for a lightweight weakly-supervised solution with reasonable performance. This chapter shows our research e#ort along this direction. We will continue to explore and
generalize the mothodology to other challenging Deepfake problems.
• Geo-DefakeHop A method called Geo-DefakeHop was proposed to distinguish between authentic
and counterfeit satellite images. Its e#ectiveness in terms of the F1 scores, precision and recall was
demonstrated by extensive experiments. Furthermore, its model size was thoroughly analyzed. It
can be easily implemented in software on mobile or edge devices due to its small model size.
• GreenCOD This study introduces a novel approach for Camou!aged Object Detection, termed
"GreenCOD." GreenCOD combines the power of Extreme Gradient Boosting (XGBoost) with deep
feature extraction from Deep Neural Networks (DNN). Contemporary research often focuses on devising intricate DNN architectures to enhance the performance of Camou!aged Object Detection.
However, these methods are typically computationally intensive. Our GreenCOD model stands out
by employing gradient boosting for detection tasks. Its e$cient design requires fewer parameters
and FLOPs than leading-edge deep learning models, all while maintaining superior performance.
Notably, our model undergoes training without the use of backpropagation.
• GIFT In this paper, we present the Green Image Forgery Technique (GIFT). GIFT is a pioneering
gradient-boosting method tailored for detecting multiple image forgeries. It operates without the
necessity for back-propagation or end-to-end training. Utilizing the pyramid structure of E$cientNet, GIFT achieves forgery detection through XGBoost in each layer. Besides bene"ting from the
101
direct supervision of the ground truth forgery mask, our method incorporates edge supervision,
which re"nes our decision-making concerning forgery boundaries. Extensive experiments across
multiple datasets validate GIFT’s superior performance over many state-of-the-art methods, even
with its reduced computational demands. In the future, we should conduct more experiments on the
most recent Generative AI methods, which used the model.
8.2 Future Research Directions
In the future, we will continue to prioritize Green Learning methods to enhance our models’ e$ciency and
reduce their size. We have already demonstrated success with Camou!aged Object Detection in images.
Our aim is to extend this focus to video-based camou!aged object detection, which has broader applications
in real-world scenarios. Regarding Green Image Forgery Techniques, we will delve deeper into analyzing
images manipulated by contemporary generative AI technologies such as stable di#usion. Additionally,
we will strive to streamline our models further, enabling their integration into mobile devices.
102
Bibliography
[1] Darius Afchar, Vincent Nozick, Junichi Yamagishi, and Isao Echizen. “Mesonet: a compact facial
video forgery detection network”. In: 2018 IEEE International Workshop on Information Forensics
and Security (WIFS). IEEE. 2018, pp. 1–7.
[2] Shruti Agarwal, Hany Farid, Yuming Gu, Mingming He, Koki Nagano, and Hao Li. “Protecting
World Leaders Against Deep Fakes.” In: CVPR Workshops. 2019, pp. 38–45.
[3] Irene Amerini, Rudy Becarelli, Roberto Caldelli, and Andrea Del Mastio. “Splicing forgeries
localization through the use of "rst digit features”. In: 2014 IEEE International Workshop on
Information Forensics and Security (WIFS). IEEE. 2014, pp. 143–148.
[4] Tadas Baltrusaitis, Amir Zadeh, Yao Chong Lim, and Louis-Philippe Morency. “Openface 2.0:
Facial behavior analysis toolkit”. In: 2018 13th IEEE International Conference on Automatic Face &
Gesture Recognition (FG 2018). IEEE. 2018, pp. 59–66.
[5] Mauro Barni, Luca Bondi, Nicolò Bonettini, Paolo Bestagini, Andrea Costanzo, Marco Maggini,
Benedetta Tondi, and Stefano Tubaro. “Aligned and non-aligned double JPEG detection using
convolutional neural networks”. In: Journal of Visual Communication and Image Representation 49
(2017), pp. 153–163.
[6] Mauro Barni, Kassem Kallas, Ehsan Nowroozi, and Benedetta Tondi. “CNN detection of
GAN-generated face images based on cross-band co-occurrences analysis”. In: 2020 IEEE
International Workshop on Information Forensics and Security (WIFS). IEEE. 2020, pp. 1–6.
[7] Tiziano Bianchi, Alessia De Rosa, and Alessandro Piva. “Improved DCT coe$cient analysis for
forgery localization in JPEG images”. In: 2011 IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP). IEEE. 2011, pp. 2444–2447.
[8] Tiziano Bianchi, Alessia De Rosa, and Alessandro Piva. “Improved DCT coe$cient analysis for
forgery localization in JPEG images”. In: 2011 IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP). IEEE. 2011, pp. 2444–2447.
[9] Tiziano Bianchi and Alessandro Piva. “Image forgery localization via block-grained analysis of
JPEG artifacts”. In: IEEE Transactions on Information Forensics and Security 7.3 (2012),
pp. 1003–1017.
103
[10] Andrew Brock, Je# Donahue, and Karen Simonyan. “Large scale GAN training for high "delity
natural image synthesis”. In: arXiv:1809.11096 (2018).
[11] CartoDB. https://carto.com. 2021.
[12] Geng Chen, Si-Jie Liu, Yu-Jia Sun, Ge-Peng Ji, Ya-Feng Wu, and Tao Zhou. “Camou!aged object
detection via context-aware cross-level fusion”. In: IEEE Transactions on Circuits and Systems for
Video Technology 32.10 (2022), pp. 6981–6993.
[13] Hong-Shuo Chen, Shuowen Hu, Suya You, C-C Jay Kuo, et al. “Defakehop++: An enhanced
lightweight deepfake detector”. In: APSIPA Transactions on Signal and Information Processing 11.2
(2022).
[14] Hong-Shuo Chen, Mozhdeh Rouhsedaghat, Hamza Ghani, Shuowen Hu, Suya You, and
C-C Jay Kuo. “DefakeHop: A Light-Weight High-Performance Deepfake Detector”. In: 2021 IEEE
International Conference on Multimedia and Expo (ICME). IEEE. 2021, pp. 1–6.
[15] Hong-Shuo Chen, Kaitai Zhang, Shuowen Hu, Suya You, and C-C Jay Kuo. “Geo-defakehop:
High-performance geographic fake image detection”. In: arXiv preprint arXiv:2110.09795 (2021).
[16] Tianqi Chen and Carlos Guestrin. “Xgboost: A scalable tree boosting system”. In: Proceedings of
the 22nd acm sigkdd international conference on knowledge discovery and data mining. 2016,
pp. 785–794.
[17] Yueru Chen and C-C Jay Kuo. “PixelHop: A Successive Subspace Learning (SSL) Method for
Object Classi"cation”. In: arXiv preprint arXiv:1909.08190 (2019).
[18] Yueru Chen and C-C Jay Kuo. “Pixelhop: A successive subspace learning (ssl) method for object
recognition”. In: Journal of Visual Communication and Image Representation 70 (2020), p. 102749.
[19] Yueru Chen, Mozhdeh Rouhsedaghat, Suya You, Raghuveer Rao, and C-C Jay Kuo. “Pixelhop++:
A small successive-subspace-learning-based (ssl-based) model for image classi"cation”. In: 2020
IEEE International Conference on Image Processing (ICIP). IEEE. 2020, pp. 3294–3298.
[20] Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and Jaegul Choo.
“StarGAN: Uni"ed generative adversarial networks for multi-domain image-to-image
translation”. In: Proceedings of the IEEE conference on computer vision and pattern recognition.
2018, pp. 8789–8797.
[21] François Chollet. “Xception: Deep learning with depthwise separable convolutions”. In:
Proceedings of the IEEE conference on computer vision and pattern recognition. 2017, pp. 1251–1258.
[22] Davide Cozzolino, Diego Gragnaniello, and Luisa Verdoliva. “Image forgery localization through
the fusion of camera-based, feature-based and pixel-based techniques”. In: 2014 IEEE International
Conference on Image Processing (ICIP). IEEE. 2014, pp. 5302–5306.
104
[23] Tiago José De Carvalho, Christian Riess, Elli Angelopoulou, Helio Pedrini, and
Anderson de Rezende Rocha. “Exposing digital image forgeries by illumination color
classi"cation”. In: IEEE Transactions on Information Forensics and Security 8.7 (2013),
pp. 1182–1194.
[24] Deepfakes Are Going To Wreak Havoc On Society. We Are Not Prepared.
https://www.forbes.com/sites/robtoews/2020/05/25/deepfakes-are-going-to-wreak-havoc-on-soci
ety-we-are-not-prepared/?sh=db57ba174940. posted on May 25 2020.
[25] Deepfakes github. https://github.com/deepfakes/faceswap. 2018.
[26] Ahmet Emir Dirik and Nasir Memon. “Image tamper detection based on demosaicing artifacts”.
In: 2009 16th IEEE International Conference on Image Processing (ICIP). IEEE. 2009, pp. 1497–1500.
[27] Brian Dolhansky, Joanna Bitton, Ben P!aum, Jikuo Lu, Russ Howes, Menglin Wang, and
Cristian Canton Ferrer. “The deepfake detection challenge (dfdc) dataset”. In: arXiv preprint
arXiv:2006.07397 (2020).
[28] FaceSwap github. https://github.com/MarekKowalski/FaceSwap. 2018.
[29] Fakeapp. https://www.fakeapp.com. 2018.
[30] Deng-Ping Fan, Ge-Peng Ji, Ming-Ming Cheng, and Ling Shao. “Concealed object detection”. In:
IEEE transactions on pattern analysis and machine intelligence 44.10 (2021), pp. 6024–6042.
[31] Deng-Ping Fan, Ge-Peng Ji, Guolei Sun, Ming-Ming Cheng, Jianbing Shen, and Ling Shao.
“Camou!aged object detection”. In: Proceedings of the IEEE/CVF conference on computer vision and
pattern recognition. 2020, pp. 2777–2787.
[32] Pasquale Ferrara, Tiziano Bianchi, Alessia De Rosa, and Alessandro Piva. “Image forgery
localization via "ne-grained analysis of CFA artifacts”. In: IEEE Transactions on Information
Forensics and Security 7.5 (2012), pp. 1566–1577.
[33] Pasquale Ferrara, Tiziano Bianchi, Alessia De Rosa, and Alessandro Piva. “Image forgery
localization via "ne-grained analysis of CFA artifacts”. In: IEEE Transactions on Information
Forensics and Security 7.5 (2012), pp. 1566–1577.
[34] Joel Frank, Thorsten Eisenhofer, Lea Schönherr, Asja Fischer, Dorothea Kolossa, and
Thorsten Holz. “Leveraging frequency analysis for deep fake image recognition”. In: International
Conference on Machine Learning. PMLR. 2020, pp. 3247–3258.
[35] Hongyu Fu, Yijing Yang, Vinod K Mishra, and C-C Jay Kuo. “Classi"cation via Subspace Learning
Machine (SLM): Methodology and Performance Evaluation”. In: ICASSP 2023-2023 IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2023, pp. 1–5.
[36] Luca Guarnera, Oliver Giudice, and Sebastiano Battiato. “Deepfake detection by analyzing
convolutional traces”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition Workshops. 2020, pp. 666–667.
105
[37] David Güera and Edward J Delp. “Deepfake video detection using recurrent neural networks”. In:
2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS).
IEEE. 2018, pp. 1–6.
[38] Charles F Hall and Ernest L Hall. “A nonlinear model for the spatial characteristics of the human
visual system”. In: IEEE Transactions on systems, man, and cybernetics 7.3 (1977), pp. 161–170.
[39] Jing Hao, Zhixin Zhang, Shicai Yang, Di Xie, and Shiliang Pu. “Transforensics: image forgery
localization with dense self-attention”. In: Proceedings of the IEEE/CVF International Conference on
Computer Vision. 2021, pp. 15055–15064.
[40] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Deep residual learning for image
recognition”. In: Proceedings of the IEEE conference on computer vision and pattern recognition.
2016, pp. 770–778.
[41] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Spatial pyramid pooling in deep
convolutional networks for visual recognition”. In: IEEE transactions on pattern analysis and
machine intelligence 37.9 (2015), pp. 1904–1916.
[42] Ruozhen He, Qihua Dong, Jiaying Lin, and Rynson WH Lau. “Weakly-supervised camou!aged
object detection with scribble annotations”. In: Proceedings of the AAAI Conference on Arti!cial
Intelligence. Vol. 37. 1. 2023, pp. 781–789.
[43] Young-Jin Heo, Young-Ju Choi, Young-Woon Lee, and Byung-Gyu Kim. “Deepfake detection
scheme based on vision transformer and distillation”. In: arXiv preprint arXiv:2104.01353 (2021).
[44] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter.
“Gans trained by a two time-scale update rule converge to a local nash equilibrium”. In: Advances
in neural information processing systems 30 (2017).
[45] Xuefeng Hu, Zhihan Zhang, Zhenye Jiang, Syomantak Chaudhuri, Zhenheng Yang, and
Ram Nevatia. “SPAN: Spatial pyramid attention network for image manipulation localization”. In:
Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020,
Proceedings, Part XXI 16. Springer. 2020, pp. 312–328.
[46] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. “Image-to-Image Translation with
Conditional Adversarial Networks”. In: Computer Vision and Pattern Recognition (CVPR), 2017
IEEE Conference on. 2017.
[47] Krzysztof Janowicz, Song Gao, Grant McKenzie, Yingjie Hu, and Budhendra Bhaduri. “GeoAI:
spatially explicit arti"cial intelligence techniques for geographic knowledge discovery and
beyond”. In: vol. 34. 4. Taylor & Francis, 2020, pp. 625–636.
[48] Ge-Peng Ji, Deng-Ping Fan, Yu-Cheng Chou, Dengxin Dai, Alexander Liniger, and Luc Van Gool.
“Deep gradient learning for e$cient camou!aged object detection”. In: Machine Intelligence
Research 20.1 (2023), pp. 92–108.
[49] Ge-Peng Ji, Lei Zhu, Mingchen Zhuge, and Keren Fu. “Fast camou!aged object detection via
edge-based reversible re-calibration network”. In: Pattern Recognition 123 (2022), p. 108414.
106
[50] Qi Jia, Shuilian Yao, Yu Liu, Xin Fan, Risheng Liu, and Zhongxuan Luo. “Segment, magnify and
reiterate: Detecting camou!aged objects the hard way”. In: Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition. 2022, pp. 4713–4722.
[51] Liming Jiang, Ren Li, Wayne Wu, Chen Qian, and Chen Change Loy. “Deeperforensics-1.0: A
large-scale dataset for real-world face forgery detection”. In: Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition. 2020, pp. 2889–2898.
[52] Pranav Kadam, Min Zhang, Shan Liu, and C-C Jay Kuo. “R-PointHop: A Green, Accurate and
Unsupervised Point Cloud Registration Method”. In: arXiv preprint arXiv:2103.08129 (2021).
[53] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. “Progressive growing of GANs for
improved quality, stability, and variation”. In: arXiv preprint arXiv:1710.10196 (2017).
[54] Tero Karras, Miika Aittala, Samuli Laine, Erik Härkönen, Janne Hellsten, Jaakko Lehtinen, and
Timo Aila. “Alias-free generative adversarial networks”. In: Advances in Neural Information
Processing Systems 34 (2021).
[55] Tero Karras, Samuli Laine, and Timo Aila. “A style-based generator architecture for generative
adversarial networks”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition. 2019, pp. 4401–4410.
[56] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila.
“Analyzing and improving the image quality of stylegan”. In: Proceedings of the IEEE/CVF
conference on computer vision and pattern recognition. 2020, pp. 8110–8119.
[57] Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and
Tie-Yan Liu. “Lightgbm: A highly e$cient gradient boosting decision tree”. In: Advances in neural
information processing systems 30 (2017).
[58] N Krawtez. “A pictures worth digital image analysis and forensics”. In: Black Hat Brie!ngs 131.1
(2007), p. 31.
[59] C-C Jay Kuo. “Understanding convolutional neural networks with a mathematical model”. In:
Journal of Visual Communication and Image Representation 41 (2016), pp. 406–413.
[60] C-C Jay Kuo and Azad M Madni. “Green Learning: Introduction, Examples and Outlook”. In:
arXiv preprint arXiv:2210.00965 (2022).
[61] C-C Jay Kuo, Min Zhang, Siyang Li, Jiali Duan, and Yueru Chen. “Interpretable convolutional
neural networks via feedforward design”. In: Journal of Visual Communication and Image
Representation 60 (2019), pp. 346–359.
[62] Xuejing Lei, Ganning Zhao, Kaitai Zhang, and C-C Jay Kuo. “TGHop: An Explainable, E$cient
and Lightweight Method for Texture Generation”. In: arXiv preprint arXiv:2107.04020 (2021).
[63] Aixuan Li, Jing Zhang, Yunqiu Lv, Bowen Liu, Tong Zhang, and Yuchao Dai. “Uncertainty-aware
joint salient object and camou!aged object detection”. In: Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition. 2021, pp. 10071–10081.
107
[64] Ting Li, Kasper Johansen, and Matthew F McCabe. “A machine learning approach for identifying
and delineating agricultural "elds and their multi-temporal dynamics using three decades of
Landsat data”. In: ISPRS Journal of Photogrammetry and Remote Sensing 186 (2022), pp. 83–101.
[65] Weihai Li, Yuan Yuan, and Nenghai Yu. “Passive detection of doctored JPEG image via block
artifact grid extraction”. In: Signal Processing 89.9 (2009), pp. 1821–1829.
[66] Yu Li, Sandro Martinis, and Marc Wieland. “Urban !ood mapping with an active self-learning
convolutional neural network based on TerraSAR-X intensity and interferometric coherence”. In:
ISPRS Journal of Photogrammetry and Remote Sensing 152 (2019), pp. 178–191.
[67] Yuezun Li and Siwei Lyu. “Exposing DeepFake Videos By Detecting Face Warping Artifacts”. In:
IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). 2019.
[68] Yuezun Li and Siwei Lyu. “Exposing deepfake videos by detecting face warping artifacts”. In:
arXiv preprint arXiv:1811.00656 (2018).
[69] Yuezun Li, Xin Yang, Pu Sun, Honggang Qi, and Siwei Lyu. “Celeb-df: A large-scale challenging
dataset for deepfake forensics”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition. 2020, pp. 3207–3216.
[70] Zhouchen Lin, Junfeng He, Xiaoou Tang, and Chi-Keung Tang. “Fast, automatic and "ne-grained
tampered JPEG image detection via DCT coe$cient analysis”. In: Pattern Recognition 42.11 (2009),
pp. 2492–2501.
[71] Bingchen Liu, Yizhe Zhu, Kunpeng Song, and Ahmed Elgammal. “Towards faster and stabilized
gan training for high-"delity few-shot image synthesis”. In: International Conference on Learning
Representations. 2020.
[72] Jiawei Liu, Jing Zhang, and Nick Barnes. “Modeling aleatoric uncertainty for camou!aged object
detection”. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision.
2022, pp. 1445–1454.
[73] Weihuang Liu, Xi Shen, Chi-Man Pun, and Xiaodong Cun. “Explicit visual prompting for
low-level structure segmentations”. In: Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition. 2023, pp. 19434–19445.
[74] Xiaofeng Liu, Fangxu Xing, Chao Yang, C-C Jay Kuo, Suma Babu, Georges El Fakhri,
Thomas Jenkins, and Jonghye Woo. “VoxelHop: Successive Subspace Learning for ALS Disease
Classi"cation Using Structural MRI”. In: arXiv preprint arXiv:2101.05131 (2021).
[75] Yunqiu Lv, Jing Zhang, Yuchao Dai, Aixuan Li, Bowen Liu, Nick Barnes, and Deng-Ping Fan.
“Simultaneously localize, segment and rank the camou!aged objects”. In: Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021, pp. 11591–11601.
[76] Siwei Lyu, Xunyu Pan, and Xing Zhang. “Exposing region splicing forgeries with blind local noise
estimation”. In: International journal of computer vision 110 (2014), pp. 202–221.
108
[77] Siwei Lyu, Xunyu Pan, and Xing Zhang. “Exposing region splicing forgeries with blind local noise
estimation”. In: International journal of computer vision 110 (2014), pp. 202–221.
[78] Lei Ma, Yu Liu, Xueliang Zhang, Yuanxin Ye, Gaofei Yin, and Brian Alan Johnson. “Deep learning
in remote sensing applications: A meta-analysis and review”. In: ISPRS journal of photogrammetry
and remote sensing 152 (2019), pp. 166–177.
[79] Babak Mahdian and Stanislav Saic. “Using noise inconsistencies for blind image forensics”. In:
Image and vision computing 27.10 (2009), pp. 1497–1503.
[80] Falko Matern, Christian Riess, and Marc Stamminger. “Exploiting visual artifacts to expose
deepfakes and face manipulations”. In: 2019 IEEE Winter Applications of Computer Vision
Workshops (WACVW). IEEE. 2019, pp. 83–92.
[81] H Mei, X Yang, Y Zhou, GP Ji, X Wei, and DP Fan. “Distraction-aware camou!aged object
segmentation”. In: SCIENTIA SINICA Informationis (SSI) (2023).
[82] Haiyang Mei, Ge-Peng Ji, Ziqi Wei, Xin Yang, Xiaopeng Wei, and Deng-Ping Fan. “Camou!aged
object segmentation with distraction mining”. In: Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition. 2021, pp. 8772–8781.
[83] Masoud Monajatipoor, Mozhdeh Rouhsedaghat, Liunian Harold Li, Aichi Chien, C-C Jay Kuo,
Fabien Scalzo, and Kai-Wei Chang. “BERTHop: An E#ective Vision-and-Language Model for
Chest X-ray Disease Diagnosis”. In: arXiv preprint arXiv:2108.04938 (2021).
[84] Lakshmanan Nataraj, Tajuddin Manhar Mohammed, BS Manjunath, Shivkumar Chandrasekaran,
Arjuna Flenner, Jawadul H Bappy, and Amit K Roy-Chowdhury. “Detecting GAN generated fake
images using co-occurrence matrices”. In: Electronic Imaging 2019.5 (2019), pp. 532–1.
[85] Huy H Nguyen, Fuming Fang, Junichi Yamagishi, and Isao Echizen. “Multi-task learning for
detecting and segmenting manipulated facial images and videos”. In: arXiv preprint
arXiv:1906.06876 (2019).
[86] Huy H Nguyen, Junichi Yamagishi, and Isao Echizen. “Use of a capsule network to detect fake
images and videos”. In: arXiv preprint arXiv:1910.12467 (2019).
[87] Xunyu Pan, Xing Zhang, and Siwei Lyu. “Exposing image splicing with inconsistent local noise
variances”. In: 2012 IEEE International conference on computational photography (ICCP). IEEE.
2012, pp. 1–10.
[88] Youwei Pang, Xiaoqi Zhao, Tian-Zhu Xiang, Lihe Zhang, and Huchuan Lu. “Zoom in and out: A
mixed-scale triplet network for camou!aged object detection”. In: Proceedings of the IEEE/CVF
Conference on computer vision and pattern recognition. 2022, pp. 2160–2170.
[89] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. “Semantic image synthesis with
spatially-adaptive normalization”. In: Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition. 2019, pp. 2337–2346.
109
[90] Xuebin Qin, Deng-Ping Fan, Chenyang Huang, Cyril Diagne, Zichen Zhang,
Adrià Cabeza Sant’Anna, Albert Suarez, Martin Jagersand, and Ling Shao. “Boundary-aware
segmentation network for mobile and web applications”. In: arXiv preprint arXiv:2101.04704
(2021).
[91] Alec Radford, Je#rey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al.
“Language models are unsupervised multitask learners”. In: OpenAI blog 1.8 (2019), p. 9.
[92] Andreas Rossler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies, and
Matthias Nießner. “Faceforensics++: Learning to detect manipulated facial images”. In:
Proceedings of the IEEE International Conference on Computer Vision. 2019, pp. 1–11.
[93] Andreas Rössler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies, and
Matthias Nießner. “Faceforensics: A large-scale video dataset for forgery detection in human
faces”. In: arXiv preprint arXiv:1803.09179 (2018).
[94] Mozhdeh Rouhsedaghat, Yifan Wang, Xiou Ge, Shuowen Hu, Suya You, and C-C Jay Kuo.
“FaceHop: A Light-Weight Low-Resolution Face Gender Classi"cation Method”. In: arXiv preprint
arXiv:2007.09510 (2020).
[95] Mozhdeh Rouhsedaghat, Yifan Wang, Xiou Ge, Shuowen Hu, Suya You, and C-C Jay Kuo.
“FaceHop: A light-weight low-resolution face gender classi"cation method”. In: International
Conference on Pattern Recognition. Springer. 2021, pp. 169–183.
[96] Mozhdeh Rouhsedaghat, Yifan Wang, Shuowen Hu, Suya You, and C-C Jay Kuo. “Low-Resolution
Face Recognition In Resource-Constrained Environments”. In: arXiv preprint arXiv:2011.11674
(2020).
[97] Mozhdeh Rouhsedaghat, Yifan Wang, Shuowen Hu, Suya You, and C-C Jay Kuo. “Low-resolution
face recognition in resource-constrained environments”. In: Pattern Recognition Letters (2021).
[98] Ekraam Sabir, Jiaxin Cheng, Ayush Jaiswal, Wael AbdAlmageed, Iacopo Masi, and
Prem Natarajan. “Recurrent convolutional strategies for face manipulation detection in videos”.
In: Interfaces (GUI) 3.1 (2019).
[99] Sara Sabour, Nicholas Frosst, and Geo#rey E Hinton. “Dynamic routing between capsules”. In:
Advances in neural information processing systems 30 (2017).
[100] Ronald Salloum, Yuzhuo Ren, and C-C Jay Kuo. “Image splicing localization using a multi-task
fully convolutional network (MFCN)”. In: Journal of Visual Communication and Image
Representation 51 (2018), pp. 201–209.
[101] Selim Seferbekov. A prize winning solution for DFDC challenge.
https://github.com/selimsef/dfdc_deepfake_challenge. 2020.
[102] Yujia Sun, Geng Chen, Tao Zhou, Yi Zhang, and Nian Liu. “Context-aware cross-level fusion
network for camou!aged object detection”. In: arXiv preprint arXiv:2105.12555 (2021).
110
[103] Yujia Sun, Shuo Wang, Chenglizhao Chen, and Tian-Zhu Xiang. “Boundary-guided camou!aged
object detection”. In: arXiv preprint arXiv:2207.00794 (2022).
[104] Zekun Sun, Yujie Han, Zeyu Hua, Na Ruan, and Weijia Jia. “Improving the e$ciency and
robustness of deepfakes detection through precise geometric features”. In: Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021, pp. 3609–3618.
[105] Christian Szegedy, Vincent Vanhoucke, Sergey Io#e, Jon Shlens, and Zbigniew Wojna.
“Rethinking the inception architecture for computer vision”. In: Proceedings of the IEEE conference
on computer vision and pattern recognition. 2016, pp. 2818–2826.
[106] Mingxing Tan and Quoc Le. “E$cientnet: Rethinking model scaling for convolutional neural
networks”. In: International conference on machine learning. PMLR. 2019, pp. 6105–6114.
[107] Ruben Tolosana, Sergio Romero-Tapiador, Julian Fierrez, and Ruben Vera-Rodriguez. “DeepFakes
Evolution: Analysis of Facial Regions and Fake Detection Performance”. In: arXiv preprint
arXiv:2004.07532 (2020).
[108] Ruben Tolosana, Ruben Vera-Rodriguez, Julian Fierrez, Aythami Morales, and
Javier Ortega-Garcia. “Deepfakes and beyond: A survey of face manipulation and fake detection”.
In: arXiv preprint arXiv:2001.00179 (2020).
[109] Van-Nhan Tran, Suk-Hwan Lee, Hoanh-Su Le, and Ki-Ryong Kwon. “High Performance deepfake
video detection on CNN-based with attention target-speci"c regions and manual distillation
extraction”. In: Applied Sciences 11.16 (2021), p. 7678.
[110] Matthew Turk and Alex Pentland. “Eigenfaces for recognition”. In: Journal of cognitive
neuroscience 3.1 (1991), pp. 71–86.
[111] Junke Wang, Zuxuan Wu, Jingjing Chen, Xintong Han, Abhinav Shrivastava, Ser-Nam Lim, and
Yu-Gang Jiang. “Objectformer for image manipulation detection and localization”. In: Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022, pp. 2364–2373.
[112] Kang Wang, Hongbo Bi, Yi Zhang, Cong Zhang, Ziqi Liu, and Shuang Zheng. “D⇥{2} C-Net: A
Dual-Branch, Dual-Guidance and Cross-Re"ne Network for Camou!aged Object Detection”. In:
IEEE Transactions on Industrial Electronics 69.5 (2021), pp. 5364–5374.
[113] Sheng-Yu Wang, Oliver Wang, Richard Zhang, Andrew Owens, and Alexei A Efros.
“CNN-generated images are surprisingly easy to spot... for now”. In: Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition. 2020, pp. 8695–8704.
[114] Yue Wu, Wael AbdAlmageed, and Premkumar Natarajan. “Mantra-net: Manipulation tracing
network for detection and localization of image forgeries with anomalous features”. In:
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019,
pp. 9543–9552.
[115] Zongwei Wu, Danda Pani Paudel, Deng-Ping Fan, Jingjing Wang, Shuo Wang,
Cédric Demonceaux, Radu Timofte, and Luc Van Gool. “Source-free depth for object pop-out”. In:
Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023, pp. 1032–1042.
111
[116] Fan Yang, Qiang Zhai, Xin Li, Rui Huang, Ao Luo, Hong Cheng, and Deng-Ping Fan.
“Uncertainty-guided transformer reasoning for camou!aged object detection”. In: Proceedings of
the IEEE/CVF International Conference on Computer Vision. 2021, pp. 4146–4155.
[117] Xin Yang, Yuezun Li, and Siwei Lyu. “Exposing deep fakes using inconsistent head poses”. In:
ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP). IEEE. 2019, pp. 8261–8265.
[118] Yijing Yang, Hongyu Fu, and C-C Jay Kuo. “Design of supervision-scalable learning systems:
Methodology and performance benchmarking”. In: Journal of Visual Communication and Image
Representation 96 (2023), p. 103925.
[119] Yijing Yang, Wei Wang, Hongyu Fu, C-C Jay Kuo, et al. “On supervised feature selection from high
dimensional feature spaces”. In: APSIPA Transactions on Signal and Information Processing 11.1 ().
[120] Shuiming Ye, Qibin Sun, and Ee-Chien Chang. “Detecting digital image forgeries by measuring
inconsistencies of blocking artifact”. In: 2007 IEEE International Conference on Multimedia and
Expo. Ieee. 2007, pp. 12–15.
[121] Bowen Yin, Xuying Zhang, Qibin Hou, Bo-Yuan Sun, Deng-Ping Fan, and Luc Van Gool.
“Camoformer: Masked separable attention for camou!aged object detection”. In: arXiv preprint
arXiv:2212.06570 (2022).
[122] Markos Zampoglou, Symeon Papadopoulos, and Yiannis Kompatsiaris. “Large-scale evaluation of
splicing localization algorithms for web images”. In: Multimedia Tools and Applications 76.4
(2017), pp. 4801–4834.
[123] Qiang Zhai, Xin Li, Fan Yang, Chenglizhao Chen, Hong Cheng, and Deng-Ping Fan. “Mutual
graph learning for camou!aged object detection”. In: Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition. 2021, pp. 12997–13007.
[124] Cong Zhang, Kang Wang, Hongbo Bi, Ziqi Liu, and Lina Yang. “Camou!aged object detection via
neighbor connection and hierarchical information transfer”. In: Computer Vision and Image
Understanding 221 (2022), p. 103450.
[125] Kaitai Zhang, Bin Wang, Wei Wang, Fahad Sohrab, Moncef Gabbouj, and C-C Jay Kuo.
“AnomalyHop: An SSL-based Image Anomaly Localization Method”. In: arXiv preprint
arXiv:2105.03797 (2021).
[126] Miao Zhang, Shuang Xu, Yongri Piao, Dongxiang Shi, Shusen Lin, and Huchuan Lu. “Preynet:
Preying on camou!aged objects”. In: Proceedings of the 30th ACM International Conference on
Multimedia. 2022, pp. 5323–5332.
[127] Min Zhang, Pranav Kadam, Shan Liu, and C-C Jay Kuo. “Unsupervised feedforward feature (UFF)
learning for point cloud classi"cation and segmentation”. In: 2020 IEEE International Conference
on Visual Communications and Image Processing (VCIP). IEEE. 2020, pp. 144–147.
112
[128] Min Zhang, Yifan Wang, Pranav Kadam, Shan Liu, and C-C Jay Kuo. “PointHop++: A lightweight
learning model on point sets for 3d classi"cation”. In: 2020 IEEE International Conference on Image
Processing (ICIP). IEEE. 2020, pp. 3319–3323.
[129] Min Zhang, Haoxuan You, Pranav Kadam, Shan Liu, and C-C Jay Kuo. “PointHop: An explainable
machine learning method for point cloud classi"cation”. In: IEEE Transactions on Multimedia 22.7
(2020), pp. 1744–1755.
[130] Qiao Zhang, Yanliang Ge, Cong Zhang, and Hongbo Bi. “TPRNet: camou!aged object detection
via transformer-induced progressive re"nement network”. In: The Visual Computer (2022),
pp. 1–15.
[131] Xu Zhang, Svebor Karaman, and Shih-Fu Chang. “Detecting and simulating artifacts in GAN fake
images”. In: 2019 IEEE International Workshop on Information Forensics and Security (WIFS). IEEE.
2019, pp. 1–6.
[132] Bo Zhao, Shaozeng Zhang, Chunxue Xu, Yifan Sun, and Chengbin Deng. “Deep fake geography?
When geospatial data encounter Arti"cial Intelligence”. In: Cartography and Geographic
Information Science 48.4 (2021), pp. 338–352.
[133] Hanqing Zhao, Wenbo Zhou, Dongdong Chen, Tianyi Wei, Weiming Zhang, and Nenghai Yu.
“Multi-attentional deepfake detection”. In: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition. 2021, pp. 2185–2194.
[134] Zhuo Zheng, Ailong Ma, Liangpei Zhang, and Yanfei Zhong. “Deep multisensor learning for
missing-modality all-weather mapping”. In: ISPRS Journal of Photogrammetry and Remote Sensing
174 (2021), pp. 254–264.
[135] Yijie Zhong, Bo Li, Lv Tang, Senyun Kuang, Shuang Wu, and Shouhong Ding. “Detecting
camou!aged object in frequency domain”. In: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition. 2022, pp. 4504–4513.
[136] Peng Zhou, Xintong Han, Vlad I Morariu, and Larry S Davis. “Two-stream neural networks for
tampered face detection”. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition
Workshops (CVPRW). IEEE. 2017, pp. 1831–1839.
[137] Tao Zhou, Yi Zhou, Chen Gong, Jian Yang, and Yu Zhang. “Feature aggregation and propagation
network for camou!aged object detection”. In: IEEE Transactions on Image Processing 31 (2022),
pp. 7036–7047.
[138] Tianfei Zhou, Wenguan Wang, Zhiyuan Liang, and Jianbing Shen. “Face Forensics in the Wild”.
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021,
pp. 5778–5788.
[139] Zhiruo Zhou, Hongyu Fu, Suya You, Christoph C Borel-Donohue, and C-C Jay Kuo. “Uhp-sot: An
unsupervised high-performance single object tracker”. In: 2021 International Conference on Visual
Communications and Image Processing (VCIP). IEEE. 2021, pp. 1–5.
113
[140] Zhiruo Zhou, Hongyu Fu, Suya You, and C-C Jay Kuo. “Gusot: Green and unsupervised single
object tracking for long video sequences”. In: 2022 IEEE 24th International Workshop on
Multimedia Signal Processing (MMSP). IEEE. 2022, pp. 1–6.
[141] Zhiruo Zhou, Hongyu Fu, Suya You, C-C Jay Kuo, et al. “Uhp-sot++: An unsupervised lightweight
single object tracker”. In: APSIPA Transactions on Signal and Information Processing 11.1 ().
[142] Zhiruo Zhou, Hongyu Fu, Suya You, and C-C Jay Kuo. “Unsupervised lightweight single object
tracking with uhp-sot++”. In: arXiv preprint arXiv:2111.07548 (2021).
[143] Hongwei Zhu, Peng Li, Haoran Xie, Xuefeng Yan, Dong Liang, Dapeng Chen, Mingqiang Wei,
and Jing Qin. “I can "nd you! boundary-guided separated attention network for camou!aged
object detection”. In: Proceedings of the AAAI Conference on Arti!cial Intelligence. Vol. 36. 3. 2022,
pp. 3608–3616.
[144] Jinchao Zhu, Xiaoyu Zhang, Shuo Zhang, and Junnan Liu. “Inferring camou!aged objects by
texture-aware interactive guidance network”. In: Proceedings of the AAAI Conference on Arti!cial
Intelligence. Vol. 35. 4. 2021, pp. 3599–3607.
[145] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. “Unpaired image-to-image
translation using cycle-consistent adversarial networks”. In: Proceedings of the IEEE international
conference on computer vision. 2017, pp. 2223–2232.
[146] Yao Zhu, Saksham Suri, Pranav Kulkarni, Yueru Chen, Jiali Duan, and C-C Jay Kuo. “An
interpretable generative model for handwritten digits synthesis”. In: 2019 IEEE International
Conference on Image Processing (ICIP). IEEE. 2019, pp. 1910–1914.
[147] Yao Zhu, Xinyu Wang, Hong-Shuo Chen, Ronald Salloum, and C-C Jay Kuo. “A-pixelhop: A
green, robust and explainable fake-image detector”. In: ICASSP 2022-2022 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2022, pp. 8947–8951.
[148] Yao Zhu, Xinyu Wang, Hong-Shuo Chen, Ronald Salloum, and C-C Jay Kuo. “Green Steganalyzer:
A Green Learning Approach to Image Steganalysis”. In: arXiv preprint arXiv:2306.04008 (2023).
[149] Yao Zhu, Xinyu Wang, Ronald Salloum, Hong-Shuo Chen, C-C Jay Kuo, et al. “RGGID: A Robust
and Green GAN-Fake Image Detector”. In: APSIPA Transactions on Signal and Information
Processing 11.2 (2022).
[150] Mingchen Zhuge, Xiankai Lu, Yiyou Guo, Zhihua Cai, and Shuhan Chen. “CubeNet: X-shape
connection for camou!aged object detection”. In: Pattern Recognition 127 (2022), p. 108644.
[151] Bojia Zi, Minghao Chang, Jingjing Chen, Xingjun Ma, and Yu-Gang Jiang. “Wilddeepfake: A
challenging real-world dataset for deepfake detection”. In: Proceedings of the 28th ACM
International Conference on Multimedia. 2020, pp. 2382–2390.
114
Abstract (if available)
Abstract
In the current technological era, the advancement of AI models has not only driven innovation but also heightened concerns over environmental sustainability due to increased energy and water usage. For context, the water consumption equivalent to a 500ml bottle is tied to 10 to 50 responses from a model like GPT-3, and projections suggest that by 2027, AI could be using an estimated 85 to 134 TWh per year, potentially surpassing the water withdrawal of half of the United Kingdom. In light of these challenges, there is an urgent call for AI solutions that are environmentally friendly, characterized by lower energy consumption through fewer floating-point operations (FLOPs), more compact designs, and the ability to run independently on mobile devices without depending on server-based infrastructures.
In this work, I will present "GreenCOD," an innovative method for Camouflaged Object Detection that cooperates XGBoost with DNNs for feature extraction, offering a computationally lighter alternative to complex DNN architectures without relying on back propagation for training. Alongside, the "Green Image Forgery Technique" (GIFT) is introduced as a gradient-boosting approach for efficient image forgery detection, leveraging EfficientNet’s features, enhanced by edge supervision for precise forgery boundary identification. GreenCOD and GIFT demonstrate exceptional performance on various datasets, surpassing many advanced methods while significantly reducing computational load.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Green image generation and label transfer techniques
PDF
A green learning approach to image forensics: methodology, applications, and performance evaluation
PDF
Efficient machine learning techniques for low- and high-dimensional data sources
PDF
A data-driven approach to image splicing localization
PDF
Deep learning techniques for supervised pedestrian detection and critically-supervised object detection
PDF
Explainable and green solutions to point cloud classification and segmentation
PDF
Learning to optimize the geometry and appearance from images
PDF
Behavior understanding from speech under constrained conditions: exploring sparse networks, transfer and unsupervised learning
PDF
Object detection and recognition from 3D point clouds
PDF
Towards efficient edge intelligence with in-sensor and neuromorphic computing: algorithm-hardware co-design
PDF
3D object detection in industrial site point clouds
PDF
Improving language understanding and summarization by leveraging auxiliary information through self-supervised or unsupervised learning
PDF
Grid-based Vlasov method for kinetic plasma simulations
PDF
Behavior-based approaches for detecting cheating in online games
PDF
Explainable and lightweight techniques for blind visual quality assessment and saliency detection
PDF
Attacks and defense on privacy of hardware intellectual property and machine learning
PDF
Advanced techniques for green image coding via hierarchical vector quantization
PDF
Data-efficient image and vision-and-language synthesis and classification
PDF
Multimodal perception guided computational media understanding
PDF
Green learning for 3D point cloud data processing
Asset Metadata
Creator
Chen, Hong-Shuo
(author)
Core Title
A green learning approach to deepfake detection and camouflage and splicing object localization
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Degree Conferral Date
2023-12
Publication Date
12/04/2023
Defense Date
11/13/2023
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
computer vision,green learning,OAI-PMH Harvest
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Kuo, C.-C. Jay (
committee chair
), Nakano, Aiichiro (
committee member
), Narayanan, Shrikanth (
committee member
)
Creator Email
hongshuo@usc.edu,max2468tw@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113781860
Unique identifier
UC113781860
Identifier
etd-ChenHongSh-12518.pdf (filename)
Legacy Identifier
etd-ChenHongSh-12518
Document Type
Dissertation
Format
theses (aat)
Rights
Chen, Hong-Shuo
Internet Media Type
application/pdf
Type
texts
Source
20231205-usctheses-batch-1111
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
computer vision
green learning