Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Detecting semantic manipulations in natural and biomedical images
(USC Thesis Other)
Detecting semantic manipulations in natural and biomedical images
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Detecting Semantic Manipulations in
Natural and Biomedical Images
by
Ekraam Sabir
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
May 2022
Copyright 2022 Ekraam Sabir
”It always seems impossible until it’s done.”
- Nelson Mandela
ii
Dedicated to my parents,
for their untold sacrifices and unconditional love.
iii
Acknowledgements
As a practising muslim, I owe all of my accomplishments and gratitude to Allah swt. for placing
me in this extraordinary position of honor and success. I sincerely believe that I have come this
far in life due to the kindness, generosity and forgiveness shown to me by people I have been
fortunate to meet. While there are several inspirational people who achieve success despite the
odds, the highlights of my life, including this PhD have been enabled by people around me and for
that they have my deepest gratitude. I’m grateful to everyone for their continued love and support.
I would like to thank my advisor Prof. Premkumar Natarajan for giving me the opportunity of
a lifetime by accepting me as his PhD student. He has been a kind, considerate and understanding
advisor in my academic journey. I have found him to share in the celebration of other’s accomplish-
ments, be forgiving of mistakes, demonstrate effortless eloquence and exhibit keen perception. He
has a disarmingly lighthearted attitude, that takes the stress off a person even on the roughest of
days and makes you want to be a part of his team. During my PhD I have come across a multi-
tude of people with great intellect, to the point that intelligence may appear to be commonplace.
However, a genuinely caring and considerate person is still a rare find and by that measure, Prof.
Natarajan is a gem.
PhD is a journey of apprenticeship and in that sense, I have truly been an apprentice to Prof.
Wael AbdAlmageed. He has always encouraged and pushed me to achieve more. This thesis
would be nowhere near its current form, if not for his higher expectations from me. From weekly
meetings to paper reviews, I am deeply grateful for his mentorship throughout my PhD.
I owe my development as a researcher to the mentorship of Dr. Yue (Rex) Wu and Stephen
Rawls, specially during the early years of my PhD. Stephen taught me the engineering aspect of
research – setting up experiments, building repositories, downloading dependencies, importance
iv
of github and most importantly how to debug! Rex was instrumental in developing my scientific
thinking. Working closely with him on a paper gave me confidence for the first time that I could
carry a research idea from its inception to publication.
I would like to sincerely thank Prof. Aiichiro Nakano, Prof. Cauligi Raghavendra, Prof. Emilio
Ferrara, Prof. Iacopo Masi, Prof. Ram Nevatia and Prof. Aram Galstyan for serving on my
committee at different times. Committee work is purely voluntary and they took time out from
their busy schedules, sometimes on short notice to provide valuable feedback.
PhD has provided me with the opportunity to meet and work with some amazing students
– Ayush Jaiswal, Emily Sheng, Jiaxin Cheng, Soumyaroop Nandi, I-Hung Hsu, Zekun Li, Joe
Mathai, Xiao Guo, Hanchen Xie, Hengameh Mirzaalian, Kuan Liu and others. From working
together on papers to random lunches around ISI, I’ve done it all with this group of amazing
people. I would like to extend a special thanks to Ayush (my senpai!) for his invaluable feedback
and guidance on almost every paper that I have published.
I would like to extend my thanks to Karen Rawlins and Lizsl De Leon for helping me out with
so many things. I have pestered Karen for almost everything from reimbursements to meeting
appointments with Prof. Natarajan and she has always been patient and gracious in helping me
out. Throughout the years, Lizsl provided clarifications and helped me navigate the logistics of
getting through my PhD. She always makes life easier for PhD students.
I was extremely lucky to have the support and friendship of Muazzam Idris and Dhruva Kartik.
They have been the voices of reason and upliftment through the ups and downs of my life and PhD.
Whether I hit rock-bottom or just needed to vent, I have found sound advice and emotional support
from them. Coping with the stress of research would have been near impossible without them.
I would also like to thank several friends and roommates who I was fortunate enough to have
met during my stay at USC – Abdul Qadeer, Naveed Ahmed Abbasi, Rizwan Saeed, Aaqib Ismail,
Shahid Md. Shaikbepari, Md. Junaid Hundekar, Aman Maqbool, Nikhil Bhambri, Md. Saleh,
Daniel Rojas, Sufiyan Khan, Ahsen Javed, Abdul Quadeer and others.
v
The continued support of my friends from undergraduate years – Masterji, Bhalla, Gandhi, Jain
bandhu, Sukku boy, Aayushmaan, Rishav and others is deeply appreciated. Pursuing PhD would
have been much more difficult, if not for their constant motivation, group chats and belief in me.
I am sincerely grateful to my undergraduate professors – Prof. Savitha G. Kini, Prof. B.
K. Singh, Prof. Ciji Pearl Kurian, James Pinto, Vineeth Patil and others. Their encouragement,
teaching and belief in me was instrumental in helping me pursue my PhD.
I also want to extend my special thanks to Dr. Elisabeth M. Bik for providing me with raw
annotated datasets, answering my questions and improving my understanding of the domain of
biomedical images. Her input and feedback was critical in the development of this thesis.
Finally, any acknowledgement would be incomplete without gratitude towards my family. My
parents always believed in me and gave me the support and encouragement to achieve my dreams.
From buying me my first book to supporting my decision to pursue education outside India, they
have always selflessly prioritized my education and career. I owe many thanks to my sister as well,
for being a friend and tolerating my endless pranks.
vi
Table of Contents
Epigraph ii
Dedication iii
Acknowledgements iv
List of Tables x
List of Figures xii
Abstract xiv
Chapter 1: Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Definitions and Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Chapter 2: Background 8
2.1 Natural Image Manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.1 Social Impact of Misinformation . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.2 Analyzing the Spread of Misinformation . . . . . . . . . . . . . . . . . . 10
2.1.3 Detection of Online Misinformation . . . . . . . . . . . . . . . . . . . . . 10
2.1.4 Digital Image Forensics . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Biomedical Image Manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.1 Scientific Misconduct . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.2 Misappropriated Images in Biomedical Literature . . . . . . . . . . . . . . 14
2.2.3 Recommendations to Preserve Research Integrity . . . . . . . . . . . . . . 15
2.2.4 Biomedical-Image Forensics . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.5 Computer Vision in Biomedical Domain . . . . . . . . . . . . . . . . . . . 17
Chapter 3: Leveraging Semantic Inconsistency for Repurposing Detection 18
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2 Semantic Integrity Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2.1 Deep Multimodal Representation Learning . . . . . . . . . . . . . . . . . 22
3.2.1.1 Multimodal Autoencoder . . . . . . . . . . . . . . . . . . . . . 22
vii
3.2.1.2 Bidirectional (Symmetrical) Deep Neural Network . . . . . . . . 22
3.2.1.3 Unified Visual Semantic Neural Language Model . . . . . . . . 23
3.2.2 Outlier Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2.2.1 One-Class Support Vector Machine . . . . . . . . . . . . . . . . 24
3.2.2.2 Isolation Forest . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Chapter 4: An Evidence Based Approach to Repurposing Detection 28
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2 Multimodal Entity Image Repurposing Dataset (MEIR) . . . . . . . . . . . . . . . 29
4.3 Image Repurposing Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3.1 Modality Importance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3.2 Deep Manipulation Detection Model . . . . . . . . . . . . . . . . . . . . . 35
4.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Chapter 5: Multi-Evidence Graph Neural Network 44
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.2 Architecture Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2.1 Package Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.2.3 Attention-based Evidence Matching . . . . . . . . . . . . . . . . . . . . . 46
5.2.4 Modality Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.2.5 Manipulation Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.2.6 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.3.1 Benchmark Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.3.2 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Chapter 6: Biomedical Image Forensics 59
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.2 BioFors Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.2.1 Image Collection Procedure and Statistics . . . . . . . . . . . . . . . . . . 61
6.2.2 Dataset Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.2.3 Manipulation Detection Tasks in BioFors . . . . . . . . . . . . . . . . . . 65
6.3 Why is Biomedical Forensics Hard? . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.4 Evaluation and Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.4.1 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.4.2 Baseline Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.4.3 Synthetic Data Generation . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.5 Ethical Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
viii
6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Chapter 7: Repurposing Detection in Biomedical Images 82
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
7.2 Architecture Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
7.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
7.3.1 Dataset and Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
7.3.2 Synthetic Data Generation . . . . . . . . . . . . . . . . . . . . . . . . . . 87
7.3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
7.4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Chapter 8: Conclusion 93
8.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
8.3 Supporting Papers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
8.4 Other Papers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
References 98
Appendices 109
E Chapter 6 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
E.1 Experiment Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
E.2 Results with F
1
metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
ix
List of Tables
2.1 Comparison of biomedical forensic methods . . . . . . . . . . . . . . . . . . . . . 16
3.1 Joint Modelling of Image-Text: Evaluation results on Flickr30K . . . . . . . . . . 25
3.2 Joint Modelling of Image-Text: Evaluation results on MS COCO . . . . . . . . . . 26
3.3 Joint Modelling of Image-Text: Evaluation results on MAIM . . . . . . . . . . . . 26
4.1 Examples of location, person and organization manipulations from MEIR . . . . . 32
4.2 DMM: Retrieval accuracy of related packages . . . . . . . . . . . . . . . . . . . . 34
4.3 DMM: Ablation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.4 DMM: Evaluation results on MEIR dataset . . . . . . . . . . . . . . . . . . . . . 38
4.5 DMM: Predicted samples from MEIR . . . . . . . . . . . . . . . . . . . . . . . . 40
4.6 DMM: Evaluation results on MAIM dataset . . . . . . . . . . . . . . . . . . . . . 42
4.7 DMM: Evaluation results on MEIR with missing modalities . . . . . . . . . . . . 43
5.1 MEG: Scalability ablation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.2 MEG: Order invariance ablation . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.3 MEG: Ablation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.4 MEG: Evaluation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.5 MEG: Effect of package retrieval accuracy on performance . . . . . . . . . . . . . 55
6.1 BioFors: Dataset split . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.2 BioFors: Image classification accuracy . . . . . . . . . . . . . . . . . . . . . . . . 64
6.3 BioFors: Test set distribution by task . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.4 BioFors: Image level evaluation on EDD task . . . . . . . . . . . . . . . . . . . . 78
6.5 BioFors: Pixel level evaluation on EDD task . . . . . . . . . . . . . . . . . . . . . 78
x
6.6 BioFors: Evaluation results on IDD task . . . . . . . . . . . . . . . . . . . . . . . 78
6.7 BioFors: Evaluation results on CSTD task . . . . . . . . . . . . . . . . . . . . . . 79
7.1 MONet: Reduction in patch comparisons . . . . . . . . . . . . . . . . . . . . . . 85
7.2 MONet: Image level evaluation results on EDD task . . . . . . . . . . . . . . . . . 88
7.3 MONet: Pixel level evaluation results on EDD task . . . . . . . . . . . . . . . . . 88
7.4 MONet: Evaluation results on IDD task . . . . . . . . . . . . . . . . . . . . . . . 89
7.5 MONet: Ablation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
7.6 MONet: Overlap Detection Network Architecture . . . . . . . . . . . . . . . . . . 92
8.1 BioFors: Image level F
1
scores for EDD task . . . . . . . . . . . . . . . . . . . . 112
8.2 BioFors: Pixel level F
1
scores for EDD task . . . . . . . . . . . . . . . . . . . . . 112
8.3 BioFors: F
1
scores for IDD task . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
xi
List of Figures
1.1 An example of semantic image repurposing . . . . . . . . . . . . . . . . . . . . . 2
1.2 Examples of biomedical image manipulation . . . . . . . . . . . . . . . . . . . . . 3
1.3 Multimedia package example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.1 Package integrity assessment system . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Joint Modelling of Image-Text: Multimodal autoencoder architecture . . . . . . . . 21
3.3 Joint Modelling of Image-Text: BiDNN architecture . . . . . . . . . . . . . . . . . 21
3.4 Joint Modelling of Image-Text: VSM architecture . . . . . . . . . . . . . . . . . . 21
3.5 Examples from MAIM dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.1 Examples of real-world manipulated images . . . . . . . . . . . . . . . . . . . . . 29
4.2 DMM: Average feature importance of modalities . . . . . . . . . . . . . . . . . . 35
4.3 DMM: Model architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.4 Baseline: SRS model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.1 MEG: Model architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.2 MEG: True and false positive predictions on Painters dataset . . . . . . . . . . . . 55
5.3 MEG: True and false positive predictions on Google Landmarks dataset . . . . . . 56
5.4 MEG: True positive predictions on MEIR dataset . . . . . . . . . . . . . . . . . . 57
5.5 MEG: False negative predictions on MEIR dataset . . . . . . . . . . . . . . . . . . 58
6.1 BioFors: Image extraction from figures . . . . . . . . . . . . . . . . . . . . . . . . 61
6.2 BioFors: Image distribution across documents . . . . . . . . . . . . . . . . . . . . 62
6.3 BioFors: Size variation of images . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.4 BioFors: Samples by image class . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
xii
6.5 BioFors: EDD task samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.6 BioFors: Manipulation orientation distribution . . . . . . . . . . . . . . . . . . . . 67
6.7 BioFors: IDD task samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6.8 BioFors: CSTD task samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.9 BioFors: Gamma correction of images . . . . . . . . . . . . . . . . . . . . . . . . 69
6.10 BioFors: Annotation artifacts in images . . . . . . . . . . . . . . . . . . . . . . . 70
6.11 BioFors: Chemically stained images . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.12 BioFors: Zoomed images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.13 BioFors: Keypoint detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.14 BioFors: Hard negative samples . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.15 BioFors: Baseline for CSTD task . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.16 BioFors: Synthetic data for EDD task . . . . . . . . . . . . . . . . . . . . . . . . 76
6.17 BioFors: Synthetic data for IDD task . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.18 BioFors: Synthetic data for CSTD task . . . . . . . . . . . . . . . . . . . . . . . . 77
6.19 BioFors: Predicted EDD samples . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.20 BioFors: Predicted IDD samples . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.21 BioFors: Predicted CSTD samples . . . . . . . . . . . . . . . . . . . . . . . . . . 80
7.1 MONet: Model architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
7.2 MONet: Predicted samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
7.3 MONet: False positive samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
7.4 MONet: Mislabeled samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
xiii
Abstract
Malicious and falsified digital content has become a powerful conveyor of false information that
is not just a nuisance but a threat to open societies worldwide. Often disinformation articles rely
on manipulated images as “evidence”, making it important to develop methods for the detection
of misuse of images. Such inappropriate image use can be broadly classified into two categories:
(1) semantic forgery i.e. reusing or repurposing an image by falsifying its context, and (2) digi-
tal forgery i.e. modifying the image itself to achieve an end purpose. Compared to digital image
forensics, research in semantic forgery detection is relatively new and sparse. For semantic forgery
detection of natural images in a social media context, we introduced a dataset that simulates per-
son, location and organization repurposing of images. We also developed deep-learning methods
to detect image repurposing by leveraging a trusted knowledge base. Additionally, research into
image forensics has been limited to natural images. There are instances of digital and semantic
forgery beyond natural images such as in the biomedical domain, where images are manipulated
to misrepresent experimental results. In order to promote research beyond natural image foren-
sics, we introduce a dataset comprising biomedical image manipulations along with a taxonomy
of semantic and digital manipulation detection tasks. Through our extensive evaluation of state-of-
the-art digital image forensics models, we found that existing algorithms developed on common
computer vision datasets are not robust when applied to biomedical images. To address seman-
tic forgeries in the biomedical domain, we developed a multi-scale overlap detection model that
achieves state-of-the-art performance across multiple categories of biomedical images.
xiv
Chapter 1
Introduction
1.1 Motivation
The internet has made the world an open forum of ideas and information. While access and volume
of information has certainly increased, its authenticity is often lost in the fog. The spread of digital
misinformation (falsified information) has negatively impacted virtually all aspects of society. The
perils of misinformation extend to our politics (2016 US election interference (Bovet et al. 2019)),
healthcare (Covid-19 misinformation (Tasnim et al. 2020)), economy (effect of fake news on the
stock market (Kogan et al. 2020)) and education (fake research published by scientists (Bar-Ilan
et al. 2021)). As such, it is imperative to develop strategies to combat the spread of misinforma-
tion. However, there are major hurdles associated with combating misinformation, due to its – 1)
diverse manifestation 2) credibility and 3) scale. Misinformation is diverse and often multi-modal
in nature, comprising text, images and other metadata. And the falsified information may be all or
a part of it. Secondly, falsified information is often substantiated with evidence to provide a cloak
of plausibility. This requires focusing on details for authentication. Finally, the scale of misin-
formation is almost as vast as the internet itself which necessitates the development of automated
solutions.
The scientific community has responded actively to the threat of misinformation with a special
emphasis on social media content. Current research tackles multiple modalities of information such
as – text (analyzing the spread of misinformation (V osoughi et al. 2018), bot detection (Kudugunta
1
Figure 1.1: A real world example of semantic image repurposing. A decade old photo was reused
to misappropriate the location to amazon rainforest.
et al. 2018)), images (copy-move and splicing detection (Y . Wu et al. 2018, 2019)), videos (deep-
fake detection (Sabir et al. 2019; Masi et al. 2020)) and multi-modal articles (rumor detection
(Singh et al. 2021)). However, to the best of our knowledge the problem of detecting pristine im-
ages accompanied by falsified metadata (text, location etc.) has been neglected. We refer to the
problem of detecting semantically manipulated images as image repurposing detection. To further
compound the problem, repurposed images may be accompanied by semantically consistent meta-
data. Semantic consistency lends credibility to otherwise falsified information. Figure 1.1 shows
an example
1
of semantic image repurposing. Existing research on rumor or fake news detection
relies on a mixed bag of features such as digital image manipulation (Z. Jin et al. 2017), visual-text
discrepancy (Zhiwei Jin et al. 2017) or other language cues (Ma et al. 2016). However, as shown in
Figure 1.1 real-world instances found on social media may lack such features. Hence, verification
of repurposed images on social media requires the development of appropriate solutions involving
corroboration with an external knowledge base.
While the scientific community has focused on understanding and mitigating misinformation
on social media, research itself is not free from it. A famous case is that of Dr. Hwang Woo Suk
who falsely claimed to have successfully cloned human stem cells (Gottweis et al. 2006). His bold
assertion drew further scrutiny which ended up exposing him. Sadly this is not an isolated case.
1
https://inews.co.uk/news/environment/amazon-fire-rainforest-burning-brazil-photos-decades-old-329965
2
Within Document
Esfandiari et. al., SCD 2012
Across Documents
Meyfour et al., J. Proteome Res. 201 7 Ghiasi et al., J. Cell Physiol. 2018
Figure 1.2: Real world examples of suspicious duplications in biomedical images. Top and bottom
rows show duplications between images in the same and different documents respectively.
Falsified research can potentially damage the trust at the core of the symbiotic relationship between
society and science. Additionally, (Stern et al. 2014) estimate a financial loss of $392 ,582 for each
retracted article. Efforts to mitigate such occurrences are necessary. Of the various scientific
domains, the biomedical research community has encountered repeated paper retractions (Alberts
et al. 2015) due to manipulated images. Common manifestations of misconduct is the presence of
repurposed and forged biomedical images in a bid to present non-existent experimental findings.
Figure 1.2 shows duplicated images within and across publications. Bik et al. 2018 analyzed over
20,000 papers and found 3.8% of these to contain at least one manipulation. However, detection
of manipulated biomedical images can be more challenging for a human than natural images on
social media due to the presence of arbitrary and confusing patterns and lack of real-world semantic
context. Furthermore, manipulation detection tasks that comprehensively categorize repurposed or
forged biomedical images are not defined in literature. Existing research efforts to the tackle this
problem are stunted by the lack of structured problem definitions or publicly benchmarked datasets
3
(Sabir et al. 2021a). This thesis introduces structured task definitions and datasets to advance
the general field of biomedical image forensics. Additionally, a solution to detect repurposed
biomedical images is also introduced.
1.2 Outline
This thesis introduces the problem of detecting semantically misappropriated images i.e. image
repurposing detection. We introduce datasets and models to tackle this problem. The first half of
the thesis focuses on the problem in the context of natural images generated on social media. The
second half of the thesis focuses on biomedical images from scientific documents.
A background on existing research and problems pertaining to image forensics is provided in
Chapter 2. The chapter is divided into two sections to discuss the literature and challenges around
natural and biomedical image forensics.
Chapter 3 introduces a simplified version of the problem comprising image caption pairs with
semantic inconsistency. A test dataset derived from standard computer vision datasets is intro-
duced. The chapter explores suitable joint modelling of image-caption pairs followed by outlier
detection methods for image repurposing classification. This chapter is based on the work from
Jaiswal et al. 2017.
In Chapter 4 a more realistic version of the problem is introduced. Instead of semantically
inconsistent image-caption pairs, images are repurposed by manipulating named entities within
the metadata. The proposed problem also extends the metadata to include location information
in the form of gps coordinates. An multi-task learning model is introduced that utilizes evidence
from an external knowledge base for verifying images. This chapter is based on work from Sabir
et al. 2018.
The model introduced in Chapter 4 utilizes a single evidence from the external knowledge
base. However, there is potential for utilizing multiple evidences for improved image repurposing
detection performance. In Chapter 5, a graph neural network based approach is introduced for
4
incorporating multiple evidences in the verification step. This chapter is based on work from Sabir
et al. 2021b.
The second half of the thesis shifts to detecting manipulated biomedical images extracted from
research documents. Chapter 6 overcomes shortcomings of problem definition in existing litera-
ture by introducing manipulation detection tasks that are computationally friendly and cover most
manipulations. The tasks are backed up by a dataset of manipulated biomedical images. The chap-
ter also introduces appropriate baselines and metrics for benchmarking. This chapter is based on
work from Sabir et al. 2021a.
Chapter 7 introduces a multi-scale hierarchical model for detecting repurposed or partially
duplicated biomedical images. The model is evaluated on two of the three tasks introduced in
Chapter 6. Considerations around hierarchical model design are discussed in the chapter. The
contents of this chapter is based on work that is under review at ICIP 2022.
Finally in Chapter 8 we provide a summary of contributions and discuss the limitations and
avenues for future research. The section also includes a list of supporting publications towards this
thesis and additional publications from my PhD. The appendix section includes additional details
as referenced by chapter.
1.3 Contributions
Research on manipulated content generated on social media is multifaceted with unique methods
to understand and tackle different aspects of the problem. Biomedical image forensics research on
the other hand has sparse literature with ill-defined or narrow problem statements. We define and
tackle a problem comprising pristine images and falsified metadata both on natural images found
in social media and biomedical images from research documents. The contributions of this thesis
are as follows:
5
1. A structured problem definition and framework for evaluating the image repurposing prob-
lem for natural images in the context of social media. The framework introduces the concept
of using an external knowledge base for verifying information.
2. Providing three structured tasks that comprehensively cover manipulations found and dis-
cussed in biomedical literature. The tasks are different from natural image forensics tasks
such splicing and copy-move detection, yet aligned to incorporate natural image forensics
methods for benchmarking and evaluation.
3. Datasets and models for benchmarking and evaluation of image repurposing detection in
natural and biomedical images.
1.4 Definitions and Notations
Abbreviations Some common abbreviations that reoccur in this thesis or are commonly used
in deep-learning literature are: Convolutional Neural Network (CNN), Long-Short Term Memory
(LSTM).
Natural Image A natural image refers to RGB images of common objects and scenes taken
with regular cameras. The contents of a natural image are usually consistent with our semantic
understanding of the world. For example, an image of a tree is expected to have its roots towards
the ground and its canopy towards the sky.
Biomedical Image In the context of this thesis, biomedical images represent the results of actual
experiments and are extracted from biomedical research documents. It does not include synthe-
sized images in documents such as plots, graphs, flowcharts, tables etc.
Multimedia Package Real-world multimedia data is often composed of multiple modalities such
as an image or a video with associated text (e.g., captions, user comments, etc.) and metadata. Such
6
Figure 1.3: Multimodal information manipulation example. This is a photograph of the Eiffel
Tower in Las Vegas, but the caption says France.
multimodal data packages are referred to as multimedia packages in this work. Figure 1.3 shows
an example of a manipulated multimedia package.
Image Repurposing Detection Multimedia packages may be prone to manipulations, where a
subset of the modalities associated with the image can be altered to repurpose it, with possible
malicious intent. We refer to the problem of detecting such packages as image repurposing detec-
tion. The term repurposing refers to the misrepresentation of an image by semantic manipulation
of associated data.
Reference Dataset A reference dataset (RD) is a knowledge base. A query multimedia package
can be verified against it. It is expected to contain information related to the query package for
assessment.
7
Chapter 2
Background
The literature around natural image forensics is rich and diverse due to its long history of research.
The development of proper datasets, early interest in the area and well-defined tasks with bench-
marks has matured the field. On the other hand, research in biomedical forensics is limited and
skewed towards qualitative assessments of problems. A lack of proper datasets or well-defined
tasks has limited the growth of more quantitative research. Considering the distinctly different as-
pects of literature between these topics, this chapter is divided into two sections for natural image
forensics (Sec. 2.1) and biomedical forensics (Sec. 2.2).
Under natural image forensics, Sec. 2.1.1 discusses the social impact of misinformation on
different aspects of our daily lives. Sec. 2.1.2 reviews literature analyzing the spread of misinfor-
mation in a social media context. In Sec. 2.1.3 existing methods for tackling fake news on social
media are discussed. Finally in Sec. 2.1.4 methods and datasets for detecting pixel-level image
manipulations are discussed.
For biomedical image forensics, Sec. 2.2.5 briefly describes the impact of traditional machine
learning and computer vision methods in the development of biomedical domain specific methods.
The qualitative literature around biomedical image forensics is discussed in Sec. 2.2.2. The sparse,
but important literature around quantitative biomedical image forensics is discussed in Sec. 2.2.4.
8
2.1 Natural Image Manipulation
2.1.1 Social Impact of Misinformation
Apart from being ethically inappropriate, misinformation or fake news has consequences that af-
fect our daily lives. Misinformation negatively influences decisions made in diverse areas of our
daily lives such as our economy, politics, healthcare and journalism. (Kogan et al. 2020) studied
financial news on social media and how it relates to market manipulation. They found that the pres-
ence of fraud leads participants to discount all news including legitimate news from their decision
making. (Rapoza 2018) reports the impact of fake news on stock markets and (Allcott et al. 2017)
analyze its influence on the 2016 US presidential election. An implication in (Rapoza 2018) and
(Allcott et al. 2017) is that people find fake news on social media or otherwise believable, which
contributes to their impact. (Bovet et al. 2019) studied the impact of fake or biased news on twitter
for the 2016 US presidential election with a dataset comprising 30 million tweets from 2.2 million
users. They found a quarter of the tweets to have either fake or biased news. A counter-intuitive
finding in their study was that it was the activity of right leaning supporters that influenced the
dynamics of fake news and not vice versa. The harmful effects of healthcare misinformation leads
to erroneous practices and virus spread was reported by (Tasnim et al. 2020). A systematic study
on covid-19 misinformation by (Rocha et al. 2021) found that such news can lead to panic, stress
and anxiety. The constant presence of misinformation places a constant burden on journalists to
verify their sources. Since images are often used as visual evidences in support of journalistic arti-
cles, (Zampoglou et al. 2016) developed a web-based graphical user interface (GUI) to assist news
professionals in verifying images using existing methods. In the long run, the harmful effects of
misinformation far outweigh positive outcomes, if any. While it is unlikely that we can eliminate
misinformation from all aspects our life, it is still important to pursue methods for mitigation.
9
2.1.2 Analyzing the Spread of Misinformation
In addition to the impact of fake news, its diffusion has also been studied (V osoughi et al. 2018;
Tambuscio et al. 2015). V osoughi et al. 2018 study the diffusion of verified true and false news
on twitter and use it to assess popular beliefs about fake news. They find fake news to elicit
negative response as believed but also show that robots contribute equally to the spread of false
and true news contrary to conventional wisdom. More importantly they show that false news
spreads faster than true news and humans are responsible for it. Tambuscio et al. 2015 create a
mathematical model for analyzing the spread of fake news in a social graph, with users represented
as nodes. Their model shows that existence of sufficient fact verification ensures the complete
removal of false facts from a network. A shortcoming of their model however, is that each user is
treated with the same parameters which is not representative of real world scenario. Regardless of
their shortcoming, their results show potential for development of automated fake news detection
methods as a tool for combating the spread of fake news.
2.1.3 Detection of Online Misinformation
Text based false news do not have consistent nomenclature and go by fake news, hoax or rumor.
Since the term fake news colloquially refers to the broad category of misinformation, text based
false news is called rumor ahead. A rumor usually consists of false claims made about entities.
There are two popular ways to approach rumor detection; classification of flat feature vectors (Ma
et al. 2016; Gupta et al. 2014; X. Liu et al. 2015) or as an epidemic on a social graph (K. Wu et al.
2015; F. Jin et al. 2013). Feature vector based approaches capture information about text content
such as presence of swear words, urls or information about the user such as number of followers.
Ma et al. 2016 use tf-idf features for classification. In (Gupta et al. 2014; X. Liu et al. 2015) tweets
are classified according to tweet content and metadata information for real time analysis. A graph
based approach attempts to identify the source of the rumor epidemic, however this approach is
limited by the need for a graph which may not be available. F. Jin et al. 2013 investigate the use of
a diffusion model to classify rumors. K. Wu et al. 2015 models message propagation structure as
10
a tree and uses graph kernels to measure similarities for classifying tweets. Ruchansky et al. 2017
categorize rumor detection differently, based on three popular signals in literature; text, response
and source. Text analysis involves checking the consistency and language used in the article.
Response refers to the reaction elicited from readers, which is usually negative and extreme in case
of inflammatory articles. Source involves checking the origin of article. Ruchansky et al. 2017
present a model which combines all three signals into a single model in contrast to previous work
(Ma et al. 2016; X. Liu et al. 2015; K. Wu et al. 2015; Zhao et al. 2015). Datasets for rumor
detection are usually collected from social microblogs (twitter and weibo) (Ma et al. 2016) or from
fact checking websites
1
(W. Y . Wang 2017). It is noteworthy that most of the above mentioned
methods capitalize on visible traits of rumors such as swear words, extreme responses, anomalous
user information etc., without verifying the misleading fact or claim in the rumor itself. This is a
much more challenging approach and requires natural language understanding of text in a rumor.
Ciampaglia et al. 2015 verify statements using a fact authentication approach. They use wikipedia
for building knowledge graphs which serves as their reference dataset. However, the statements
they verify have clean grammar, which makes it convenient to identify and hence evaluate the
assertion being made in text.
There is recent work in rumor detection which factors images into the problem statement (Z.
Jin et al. 2017; Zhiwei Jin et al. 2017). The datasets used in these works consist of forged images
and hence are different from ours, which consists of unmanipulated images. Their method does
not explicitly factor in image forgery detection, which would be a good if not sufficient indicator
for fake news detection with respect to their dataset. There are additional differences from our
work. The method presented in (Zhiwei Jin et al. 2017) performs rumor and fake news detection
on image caption pairs from Weibo and Twitter. Their approach involves checking for information
inconsistency within a package, while ours focuses on manipulations which are not obvious within
a package. Z. Jin et al. 2017 use images for rumor detection with the assumption that images
1
http://www.politifact.com/
11
associated with rumors have low diversity compared to real news events. This assumption may not
hold beyond their dataset.
2.1.4 Digital Image Forensics
Digital image manipulation or image forgery is a component of fake news, considering it can
be used to misrepresent information. The connection between fake news and forged images is
supported in Zampoglou et al. 2016, where they identify the need for checking images in the area
of journalism. Digital image manipulation detection has been studied extensively (Farid 2009;
Y . Wu et al. 2017; Qureshi et al. 2015; Asghar et al. 2017; Verdoliva 2020). Pixel level image
manipulations fall into copy move, image splicing, resampling and retouching categories (Qureshi
et al. 2015). Copy move involves duplicating objects within the same image (Y . Wu et al. 2018;
Cozzolino et al. 2015a; Ryu et al. 2010), while splicing inserts objects from a foreign image (Y .
Wu et al. 2017; Cozzolino et al. 2015b). Resampling involves resizing, rotation and stretching of
original image (Farid 2009). Retouching refers to the enhancement of images while keeping close
to the original visual information (Qureshi et al. 2015). It is used harmlessly for better presentation
in media, but is considered forgery since it fundamentally involves digital image manipulation.
Deepfakes are a more recent form of manipulation in images and videos (Sabir et al. 2019; Masi
et al. 2020), where facial characteristics of people are swapped or modified. In Asghar et al. 2017,
the authors classify image forgery detection techniques into active and passive (blind) methods.
Active methods involve extracting and verifying a signature or watermark which is assumed to be
embedded in the image. Watermarking of images is not common and hence use for this approach
is limited. A blind or passive approach does not assume the presence of a watermark and instead
looks for inconsistencies and artifacts in the image being investigated, which might result from
digital manipulations. Pixel level image manipulations discussed previously are detected using
this approach.
For the manipulations mentioned earlier, forgery detection methods have been developed to flag
suspicious content with reasonable success. A critical step for the development of these algorithms
12
has been the curation and release of datasets that facilitated benchmarking. As an example, FF++
(Rossler et al. 2019), DeeperForensics (Jiang et al. 2020) and Celeb-DF (Li et al. 2020) helped
develop methods for deepfake detection. Similarly, CASIA (Dong et al. 2013), NIST16 (Nimble
Challenge 2017 Evaluation — NIST n.d.), COLUMBIA (Ng et al. 2009) and COVERAGE (Wen
et al. 2016) helped advance detection methods for a combination of forgeries such as copy-move,
splicing and removal.
2.2 Biomedical Image Manipulation
2.2.1 Scientific Misconduct
Scientific misconduct has existed for a long time, but definitions and legislation against it are
a more recent development (Gross 2016). Historically, fabrication, falsification and plagiarism
(FFP) have been the trinity of scientific research misconduct. Fabrication refers to making up
reslts, falsification involves manipulating results and plagiarism is the misappropriation of some-
one else’s research (Gross 2016). More recently other forms misconduct have come under scrutiny,
such as HARKing, p-hacking and spin. Hypothesizing after the results are known a.k.a HARKing
refers to presenting post hoc hypothesis as if they were a priori hypothesis (Kerr 1998). Spin refers
to the misrepresentation of findings by beautification or misinterpretation (Boutron et al. 2018).
Another form of misconduct, called p-hacking refers to the selective collection of data such that
non-significant results become significant (Head et al. 2015). While these activities apply to indi-
viduals and their misconduct, a disturbing development at a systemic level is the rise of paper mills.
Paper mills refers to industries offering research for purchase – the commodity being authorship
on papers, ready to publish datasets or writing manuscripts (Calver 2021). To avoid these malprac-
tices, researchers often turn to prestigious journals for reliable scientific discovery. However, it has
been reported counter-intuitively that prestigious journals may in fact have a negative correlation
with reliability (Brembs 2018).
13
2.2.2 Misappropriated Images in Biomedical Literature
Misrepresentation of scientific research is a broad problem (Boutron et al. 2018) but scientific mis-
conduct in the biomedical domain has higher incidence since researchers can get away with an
excuse of ”biologic variability” (Kumar 2008). Of the several forms of misconduct, manipulation
or duplication of biomedical images has been recognized as a serious problem by journals and
the community in general (Bik et al. 2018; Christopher 2018; Bik et al. 2016). The editors at
the Journal of Clinical Investigation (JCI) screened 200 papers that were on a clear path towards
acceptance and found 55 (27.5%) to have issues with images (Williams et al. 2019). While the
majority of documents had images with objectionable but minor issues, 2 of them comprised of
altered or fabricated images. Bik et al. (Bik et al. 2016) analyzed over 20,621 papers and found
3.8% of these to contain at least one manipulation with at least half of them appearing to be de-
liberate. The authors categorized manipulations into one of three categories – simple duplication,
duplication with repositioning and duplication with alterations such as stamping, patching. A tem-
poral analysis of these duplications revealed an increased in recent years. There are more instances
of duplications (∼ 4%) since 2003 than from 1995 to 2002 (<2%). The authors also found that
researchers with problematic images were likely to repeat it. In continuing research (Bik et al.
2018), the authors were able to bring 46 corrections or retractions from a pool of 960 papers span-
ning seven years (2009-2016). By this proportion they estimate that roughly 35,000 papers in
literature are candidates for retraction due to the presence of image duplication. However, most
of this effort was performed manually which is unlikely to scale given the high volume of publi-
cations. The biomedical literature indicates that researchers, editors and reviewers have noticed a
spike in the occurrence of manipulated or fraudulent images. While some of the publications with
problematic images are filtered during the review process, the process still appears to be manual
and not scalable.
14
2.2.3 Recommendations to Preserve Research Integrity
Machine-learning based methods are being developed for automated detection of biomedical im-
age manipulation. However, several publications have focused on analyzing the root cause of
misconduct, along with recommendations for mitigation, control and reform at a systemic level.
In one study the authors analyzed publications with image duplications to identify risk factors that
are more likely to lead to research misconduct (Fanelli et al. 2019). They evaluated commonly
hypothesized risk factors – pressure to publish, lack of social control, countries with poor policies
and gender. They found evidence that a lack of social control and countries with poor scien-
tific conduct policies induce greater risk for misconduct. No evidence was found for pulication
pressure or a gender bias in manipulated documents. A reform at the student education level is
recommended in one study to produce scientists who are capable of deep research and scientific
critical thinking to avoid misonduct (Bosch et al. 2017). The implication is that poor education
and lack of training leads to research misconduct. A prominent editor for a scientific journal has
proposed from experience that requiring raw data from author’s leads to transparency and avoids
retractions (Miyakawa 2020). The editor recounts that 97% of 41 manuscripts reviewed, either
declined or were unable to comply with a request to provide raw data. With some evidence to
back the claim that some prestigious journals have a negative correlation with reliable research,
one study (Brembs 2019) argues to replace legacy journals with a modern information infrastruc-
ture that is governed by scholars. The proposed framework would lead to massive cost savings
and allow for improved functionalities for maintaining scientific discoveries. Another framework
centered around the concept of scientific rigor has been proposed (Casadevall et al. 2016). The
framework has five actionable recommendations for redundant experimental design, sound statis-
tical analysis, recognition of error, avoidance of logical fallacies and intellectual honesty. While
the previously mentioned studies focus on understanding and reform, one study makes a case for
penalties (Dal-R´ e et al. 2020). The argument is blunt – criminalize scientific misconduct. Not only
flagrant violations such as fabrication, falsification or plagiarism, but also incidents of fake peer
review or image duplication. To summarize, a long list of recommendations have been proposed to
15
Attributes
Acuna
et al. 2018
Koppers
et al. 2017
Cardenuto
et al. 2019
Xiang
et al. 2020
Bucci
2018
Dataset Available ✓ ✗ ✗ ✗ ✗
Benchmarking ✗ ✗ ✓ ✓ ✗
Multiple Manipulations? ✗ ✗ ✗ ✓ ✗
Diverse Images ✓ ✓ ✓ ✗ ✓
Annotation ✗ ✓ ✓ ✓ ✗
Real Manipulations ✓ ✓ ✓ ✗ ✓
Table 2.1: Comparison of biomedical forensic methods
improve research integrity not just for biomedical forensics but also for general research. However,
most of these recommendations require not only significant consensus but are also time intensive,
due to their elaborate nature. Systemic changes to how research is conducted may be of great help
in the struggle against scientific misconduct, but the urgency of the situation demands for quicker
and easily deployed solutions. Machine-learning based methods for scientific fraud detection is
one promising avenue in this regard.
2.2.4 Biomedical-Image Forensics
Models and frameworks have been proposed for automated detection of biomedical image manip-
ulations (Acuna et al. 2018; Koppers et al. 2017; Cardenuto et al. 2019; Xiang et al. 2020; Bucci
2018). Koppers et al. 2017 developed a duplication screening tool evaluated on three images.
Bucci et al. (Bucci 2018) engineered a copy-move forgery detection (CMFD) framework from
open-source tools to evaluate 1,546 documents and found 8.6% of it to contain manipulations.
Acuna et al. 2018 used SIFT (Lowe 2004) image-matching to find potential duplication candidates
in 760k documents, followed by human review. In the absence of a robust evaluation, it is unknown
how many documents with forgeries went unnoticed in (Acuna et al. 2018; Bucci 2018). Carde-
nuto et al. 2019 curated a dataset of 100 images to evaluate an end-to-end framework for CMFD
task. Xiang et al. 2020 test a heterogenous feature extraction model to detect artificially created
manipulations in a dataset of 357 microscopy and 487 western blot images. It is unclear how the
images were collected in (Cardenuto et al. 2019; Xiang et al. 2020). Table 2.1 shows a comparison
16
of existing attempts to tackle biomedical image manipulations. A✓and✗indicate the presence or
absence of a feature respectively. In summary, none of the proposed datasets unify the community
around a common biomedical image forensics dataset with standard benchmarking. Additionally,
all of the methods described in Table 2.1 have used diverse approaches on their own datasets. With-
out any common benchmarking it is near impossible to fairly compare the effectiveness of of these
methods.
2.2.5 Computer Vision in Biomedical Domain
Machine learning and computer vision have made significant contributions to the biomedical do-
main involving problems such as image segmentation (Kulikov et al. 2020; Lee et al. 2020; D.
Wang et al. 2020), disease diagnostics (Perez et al. 2019), super-resolution (Peng et al. 2020) and
biomedical image denoising (Zhang et al. 2019). While native computer vision algorithms have
existed for these problems, ensuring robustness on biomedical data has always been a challenge.
This is partly due to domain shift and also due to the difficulty of training data-intensive deep
learning models on biomedical datasets which are usually small. These challenges often lead to
the development of models that are uniquely tuned for the biomedical domain. Occasionally, mod-
els developed for the biomedical domain lead to suitable variations for natural images. As an
example, U-Net (Ronneberger et al. 2015) is a popular multi-scale deep neural network model,
that was developed for biomedical image segmentation, but has found use in natural image tasks
(Baheti et al. 2020; Kazerouni et al. 2021).
17
Chapter 3
Leveraging Semantic Inconsistency for Repurposing Detection
3.1 Introduction
In real life, data often presents itself with multiple modalities, where information about an entity or
an event is incompletely captured by each modality separately. For example, a caption associated
with the image of a person might provide information such as the name of the person and the lo-
cation where the picture was taken, while other metadata might provide the date and time at which
the image was taken. Independent existence of each modality makes multimedia data packages
vulnerable to tampering, where the data in a subset of modalities of a multimedia package can
be modified to misrepresent or repurpose the multimedia package. Such tampering, with possible
malicious intent, can be misleading, if not dangerous. The location information, for example, in
the aforementioned caption could be modified without an easy way to detect such tampering. Nev-
ertheless, if the image has visual cues, such as a landmark, a person familiar with the location can
easily detect such a manipulation. However, this is a challenging multimedia analysis task, espe-
cially with the subtlety of data manipulation, the absence of clear visual cues and the proliferation
of multimedia content from mobile devices and digital cameras.
Verification of the integrity of information contained in any kind of data requires the existence
of some form of prior knowledge. In the previous example, this knowledge is represented by a
person’s familiarity with the location. Human beings use their knowledge, learned over time, or
external sources such as encyclopedias, as knowledge bases (KBs). Motivated by this important
18
observation, image repurposing detection algorithms could also take advantage of a KB to auto-
matically assess the integrity of multimedia packages. A KB can be either implicit (such as a
trained scene understanding and/or recognition model) or explicit (such as a database of known
facts). In this chapter we explore the use of a reference dataset (RD) of multimedia packages
to assess the integrity of query packages. The RD is assumed to not include other copies of the
query image. Otherwise, existing image retrieval methods would suffice to verify the multimedia
package integrity.
While information manipulation detection is a broad problem, in this chapter we focus on
verifying the semantic integrity of multimedia packages. We define multimedia semantic integrity
as the semantic consistency of information across all the modalities of a multimedia package.
We present a novel framework to solve a limited version of the multimedia information in-
tegrity assessment problem, where we consider each multimedia package to contain only one im-
age and an accompanying caption. Data packages in the reference dataset are used to train deep
multimodal representation learning models (DMRLMs). The learned DMRLMs are then used to
assess the integrity of query packages by calculating image-caption consistency scores (ICCSs)
and employing outlier detection models (ODMs) to find their inlierness with respect to the RD.
We evaluate the proposed method on two publicly available datasets—Flickr30K (Young et al.
2014) and MS COCO (Lin et al. 2014), as well as on the MultimodAl Information Manipulation
(MAIM) dataset that we created from image and caption pairs downloaded from Flickr. This work
is significantly different from past work on robust hashing and watermarking (Ababneh et al. 2008;
R. Sun et al. 2014; X. Wang et al. 2015; Yan et al. 2016) as those methods focus on the prevention
of information manipulation while ours focuses on detection at a semantic level.
3.2 Semantic Integrity Assessment
One approach to information integrity assessment of a data object is to compare it against an
existing knowledge-base (KB), with the assumption that such a KB exists. This KB can be explicit
19
Figure 3.1: Package Integrity Assessment System
(such as a database of facts) or implicit (such as a learned statistical inference model). We use
the observation that human beings verify the information integrity of pieces of data using world
knowledge learned over time and external sources, such as an encyclopedia, to develop machine
learning models that mimic world knowledge, and then use these models to assess the integrity of
query data packages.
In order to verify the integrity of a query multimedia package that contains an image and an
associated caption, we assume the existence of a reference set of similar media packages. This set,
which we call the reference dataset (RD), serves as the KB to compare query packages against to
measure their integrity. More specifically, we train an outlier detection model (ODM) on image-
caption consistency scores (ICCSs) from packages in RD and use it to calculate the inlierness
of query packages. We employ deep multimodal representation learning models (DMRLMs) for
jointly encoding images and corresponding captions, inspired by their success as reflected in recent
literature, and use them to calculate ICCSs (depending on the DMRLM used). Fig. 3.1 explains
our complete integrity assessment system.
20
Figure 3.2: Our Multimodal Autoencoder
Architecture. F
I
and F
C
are image and cap-
tion features respectively, while
b
F
I
and
b
F
C
are their reconstructed versions.
Figure 3.3: Our BiDNN Architecture. F
I
and F
C
are image and caption features re-
spectively, while
b
F
I
and
b
F
C
are their recon-
structed versions. Colored arrows with dot-
ted connections reflect weight tying.
Figure 3.4: The VSM Architecture of Kiros et al. (Kiros et al. 2014)
In this work we use a multimodal autoencoder (MAE) (Ngiam et al. 2011), a bidirectional
(symmetrical) deep neural network (BiDNN) (Vukoti´ c et al. 2016) or the unified visual seman-
tic neural language model (VSM) (Kiros et al. 2014) as the embedding model. VGG19 (Si-
monyan et al. 2014) image features are given as inputs to all these models, along with either
average word2vec (Mikolov et al. 2013) embeddings (MAE and BiDNN) or one-hot encodings of
words in captions (VSM). The ODMs that we work with are the one-class support vector machine
(OCSVM) (Sch¨ olkopf et al. 1999) and isolation forest (iForest) (F. T. Liu et al. 2008). We discuss
the aforementioned DMRLMs, with their associated ICCSs, and ODMs in detail in the following
subsections.
21
3.2.1 Deep Multimodal Representation Learning
3.2.1.1 Multimodal Autoencoder
An autoencoder is a neural network that learns to reconstruct its input (Hinton et al. 2006). Autoen-
coders are typically used to learn low-dimensional representations of data. The network architec-
ture is designed such that the input goes through a series of layers with decreasing dimensionality
to produce an encoding, which is then transformed through layers of increasing dimensionality
to finally reconstruct the input. Ngiam et al. (Ngiam et al. 2011) showed how an autoencoder
network can be used to learn representations over multiple modalities. We train an MAE on the
image-caption pairs in RD to learn their shared representation. Fig. 3.2 shows our MAE architec-
ture, inspired by the bimodal deep autoencoder of Ngiam et al. (Ngiam et al. 2011). The image
and caption features are passed through a series of unimodal layers before combining them in the
shared representation layer. The decoder module of the MAE is a mirror image of its encoder. For
MAE, we use reconstruction loss as the ICCS.
3.2.1.2 Bidirectional (Symmetrical) Deep Neural Network
A BiDNN is composed of two unimodal autoencoders with tied weights for the middle repre-
sentation layers (Vukoti´ c et al. 2016). The network is trained to simultaneously reconstruct each
modality from the other, learning cross-modal mappings as well as a joint representation. Fig. 3.3
shows our BiDNN architecture and illustrates the tied weights for a better understanding. Our
formulation of the joint representation is the same as Vukoti´ c et al. (Vukoti´ c et al. 2016), i.e., the
concatenation of the activations of the two representation layers. We used the BiDNN package
made available by Vukoti´ c et al. (Vukoti´ c et al. 2016)
1
to implement our model. Reconstruction
loss also serves as the ICCS in the case of BiDNN.
1
https://github.com/v-v/BiDNN
22
Figure 3.5: Image-Caption Data package examples from MAIM dataset. The blue captions are the
original ones that came with the image while the red ones are their manipulated versions.
3.2.1.3 Unified Visual Semantic Neural Language Model
Kiros et al. 2014 introduced the unified visual semantic neural language model (VSM) that learns
representations of captions in the embedding space of images, where image embeddings are first
calculated using a deep neural network such as VGG19 (Simonyan et al. 2014). The VSM is
trained to optimize a contrastive loss, which aims to maximize the cosine similarity between the
representation of an image and the learned encoding of its caption while minimizing that between
the image and captions not related to it. Fig. 3.4 shows the structure of the VSM. The network uses
long short-term memory (LSTM) units (Hochreiter et al. 1997) to encode variable-length captions.
We used the VSM package made available by Kiros et al. (Kiros et al. 2014)
2
and trained one
model on each RD. Cosine similarity becomes the natural choice of ICCS in the case of VSM.
2
https://github.com/ryankiros/visual-semantic-embedding
23
3.2.2 Outlier Detection
3.2.2.1 One-Class Support Vector Machine
The OCSVM is an unsupervised outlier detection model trained only on positive examples (Sch¨ olkopf
et al. 1999). It learns a decision function based on the distribution of the training data in its original
or kernel space to classify the complete training dataset as the positive class, and everything else in
the high-dimensional space as the negative class. This model is then used to predict whether a new
data point is an inlier or an outlier with respect to the training data. This formulation of OCSVM
fits very well with our approach of assessing the semantic integrity of a data package with respect
to an RD (by using an OCSVM trained on the RD).
3.2.2.2 Isolation Forest
An isolation forest (iForest) is a collection of decision trees that isolate a data point through recur-
sive partitioning of random subsets of its features (F. T. Liu et al. 2008). It works by first randomly
selecting a feature of a data point and then finding a random split-value between its minimum and
maximum values. This is then repeated recursively on the new splits. The recursive partitioning
of a tree stops when a node contains only the provided data point. Under this setting, the average
number of splits required (across all trees in the forest) to isolate a point gives an indication of
its outlierness. The smaller the number, the higher the confidence that the point is an outlier; it is
easier to isolate outliers as they lie in relatively low-density regions with respect to inliers (RD).
3.3 Data
We provide a quantitative evaluation of the performance of our method on three datasets: Flickr30K
(Young et al. 2014), MS COCO (Lin et al. 2014) and a dataset that we created from images,
captions and other metadata downloaded from Flickr (MAIM). While Flickr30K and MS COCO
24
Deep Multimodal Representation Learning Model
MAE BiDNN VSM
F
1
-tampered F
1
-clean F
1
-tampered F
1
-clean F
1
-tampered F
1
-clean
ODM
One-class SVM 0.48 0.50 0.47 0.67 0.89 0.88
Isolation Forest 0.49 0.50 0.63 0.62 0.81 0.7
Table 3.1: Evaluation Results on Flickr30K
datasets contain objective captions which describe the contents of images, MAIM contains subjec-
tive captions, which do not necessarily do so and sometimes contain related information that might
not be obvious from the image. Fig. 3.5 shows some examples from the MAIM dataset.
We use the training, validation and testing subsets of Flickr30K and MS COCO as made avail-
able by Kiros et al. (Kiros et al. 2014)
3
, which makes sure that there is no overlap of images among
the subsets. This is necessary because each image in these datasets has five captions (giving five
image-caption pairs). There are 158,915 and 423,915 image-caption pairs in Flickr30K and MS
COCO respectively, in total. Our dataset (MAIM) has 239,968 image-caption pairs with exactly
one caption for each unique image. We randomly split MAIM into training, validation and test-
ing subsets and treat the training subset of each dataset as the RD in our framework. We replace
the captions of half of the validation and testing images with captions of other images to create
manipulated image-caption pairs for evaluation.
MAIM also has metadata for each package but we do not use them in our experiments in this
work. This metadata includes location where the image was taken, time and date when the image
was taken and information associated with the device used to capture the image.
3.4 Analysis
The inlier/outlier decisions of the ODMs in our system serve as the prediction of semantic informa-
tion manipulation in query packages. We use F
1
scores as our evaluation metrics. Tables 3.1, 3.2
and 3.3 summarize the results of our experiments on all combinations of DMRLMs and ODMs
that we use in this work, on Flickr30K, MS COCO and MAIM respectively. We treat tampered
3
https://github.com/ryankiros/visual-semantic-embedding
25
Deep Multimodal Representation Learning Model
MAE BiDNN VSM
F
1
-tampered F
1
-clean F
1
-tampered F
1
-clean F
1
-tampered F
1
-clean
ODM
One-class SVM 0.53 0.46 0.68 0.55 0.94 0.94
Isolation Forest 0.5 0.48 0.76 0.77 0.94 0.94
Table 3.2: Evaluation Results on MS COCO
Deep Multimodal Representation Learning Model
MAE BiDNN VSM
F
1
-tampered F
1
-clean F
1
-tampered F
1
-clean F
1
-tampered F
1
-clean
ODM
One-class SVM 0.49 0.49 0.46 0.5 0.75 0.76
Isolation Forest 0.56 0.42 0.52 0.52 0.75 0.77
Table 3.3: Evaluation Results on MAIM
packages as the positive class when calculating F
1
-tampered and as the negative class for F
1
-clean.
The F
1
-tampered and F
1
-clean scores are coupled, i.e., every pair is from the same trained model.
We see that VSM consistently performs better than MAE and BiDNN in all our experiments
on both metrics, with MAE consistently performing the worst. This gives us some key insights
into the working of these DMRLMs. Even though MAE can compress multimodal data with low
reconstruction error, it does not learn semantic associations between images and captions very
well. The BiDNN model is trained to learn cross-modal mappings between images and captions,
which forces it to learn those semantic associations. This explains why it works better than MAE
at this task. The VSM model is trained to map captions to the representation space of images. The
learning objective explicitly requires it to learn semantic relationships between the two modalities
so that it can map captions consistent with an image close to it while inconsistent ones are mapped
far from it. This makes VSM the strongest of the three models.
We also see that the F
1
scores of VSM on MS COCO are significantly better than those on
the other datasets. This is expected and explained by the process through which the captions in
the dataset were collected. Chen et al. 2015 used Amazon’s Mechanical Turk
4
to gather objective
captions with strong guidelines for their quality and content. The numbers are higher simply due
4
https://www.mturk.com/mturk/welcome
26
to the better quality of captions and their objective content. This indicates that our method is better
suited for objective captions.
3.5 Summary
Real-world multimedia data is often multimodal, consisting of images, videos, captions and other
metadata. While multiple modalities present additional sources of information, it also makes such
data packages vulnerable to tampering, where a subset of modalities might be manipulated, with
possible malicious intent. In this chapter we formally defined this problem and provided a method
to solve a limited version of it (where each package has an image and a caption) as a first step
towards the larger goal. Our method combines deep multimodal representation learning with out-
lier detection methods to assess whether a caption is consistent with the image in its package.
We introduced the MultimodAl Information Manipulation dataset (MAIM) that we created for the
larger problem, containing images, captions and various metadata, which we make available to the
research community. We presented a quantitative evaluation of our method on Flickr30K and MS
COCO datasets, containing objective captions, and on the MAIM dataset, containing subjective
captions. Our method was able to achieve F
1
scores of 0.75, 0.89 and 0.94 on MAIM, Flickr30K
and MS COCO, respectively, for detecting semantically incoherent media packages.
In our work we used the general formulation of MAE and BiDNN, providing these models
VGG19 image features and aggregated word2vec caption features as inputs. It is possible that an
end-to-end model with raw images and captions as inputs and a combination of convolution and
recurrent layers might perform better. Similarly, training the image encoder of VSM jointly with
the caption encoder might further boost its performance. These issues can be explored in future
work. Our future work will also incorporate metadata and assess the integrity of entire packages.
It is easy to see that our framework can be extended to include more modalities such as location,
audio and video. We leave this to future work.
27
Chapter 4
An Evidence Based Approach to Repurposing Detection
4.1 Introduction
In Chapter 3 the multimedia packages were limited to image-caption pairs with semantically incon-
sistent manipulations. The semantic inconsistency facilitated the development of joint representa-
tion learning models that could be used in conjunction with outlier detection for image repurposing
detection. However, semantically inconsistent manipulations are farther from real-world examples
as demonstrated in Figure 4.1. The left-most example is relatively easy for an educated audience
to identify as fake. The right three examples, however, highlight the difficulty of detecting manip-
ulated image data when the details being manipulated are subtle, subliminal and often interleaved
with the truth. The right three images each have one entity manipulated in the associated caption,
while the image remains untampered. In Figure 4.1(b) the deceased rapper Tupac Shakur is incor-
rectly referenced. Figure 4.1(c) has the Democratic National Convention referenced instead of the
Ku Klux Klan, while the image in Figure 4.1(d) is from a protest in Chile instead of North Dakota.
It is noteworthy that the manipulated details in the right three examples are semantically coherent
and leave no digital footprint in the image itself, making their detection difficult.
The manipulations presented in 4.1 are characterized by semantic consistency and subtlety of
detail and therefore require explicit contradicting evidence for detection. Use of external knowl-
edge in an implicit manner – like the joint modeling approach presented in Chapter 3 is unlikely
28
Figure 4.1: Examples of real-world manipulated information. Image (a) is an obvious hoax with
digital image manipulation and is less likely to fool people. Image (b) falsely claims that Tupac
Shakur is alive. In image (c), Democratic Party is incorrectly referenced. Image (d) misattributes
the photograph to a protest at the North Dakota Pipeline, when it was originally taken in Chile.
Images (b), (c) and (d) are not digitally manipulated, but the details accompanying them are, mak-
ing detection more challenging.
to be effective. In fact real-world fact-checking platforms such as Snopes
1
use external knowledge
and evidences to flag semantically consistent instances of image repurposing. In order to address
the limitations of implicit evidence usage present in the dataset and models of the previous chapter,
we introduce a new dataset and multimodal model.
The contributions of this chapter are:
1. A new and challenging dataset with relatively intelligent, labeled and semantically coherent
manipulations over what is currently available for research.
2
2. A novel deep multimodal model that uses multitask learning for image repurposing detection
with the explicit use of an external evidence.
4.2 Multimodal Entity Image Repurposing Dataset (MEIR)
In Chapter 3 we presented the Multimodal Information Manipulation (MAIM) dataset of multi-
media packages, in which each package contains an image-caption pair. In order to mimic image
repurposing, associated captions were swapped with randomly chosen captions—potentially lead-
ing to semantically inconsistent repurposing, such as an image of a cat captioned as a car. The
1
https://www.snopes.com/
2
https://github.com/Ekraam/MEIR
29
MAIM dataset was the first attempt to create a dataset for image repurposing detection research.
However, MAIM had several limitations. First, the resulting manipulations are often easy to detect
and unlikely to fool a human. Second, the reference dataset provided in MAIM is unlabeled and
does not contain entities directly related to manipulated packages. Therefore, there is no explicit
use of reference dataset when detecting manipulations. MAIM dataset does not reflect difficult
manipulations that are created by carefully altering caption contents to produce semantically con-
sistent packages—for example, leaving the caption unaltered except for replacing an entity name.
In order to address these limitations, we introduce the Multimodal Entity Image Repurpos-
ing Dataset (MEIR), which compared to MAIM, contains semantically intelligent manipulations.
Multimedia packages in MEIR are manipulated by replacing named entities of corresponding type,
instead of entire captions. We consider three named entities: person, organization and location.
Unlike MAIM, the reference dataset in MEIR has directly related entities to manipulated pack-
ages. It is also labeled to identify this relationship. The process for bringing these improvements
in a new dataset is described in the following.
Creating MEIR consisted of three stages—data curation, clustering packages by relatedness,
and creating manipulations in a fraction of packages of each cluster. MEIR was created from Flickr
images and their associated metadata, collected in the form of packages. To ensure uniformity,
packages with missing modalities and non-English text were removed. The remaining packages
were preprocessed to remove noise in the form of html tags. Following data collection and prepro-
cessing, packages were clustered by relatedness. Clustering by relatedness helps allocate packages
to reference, training, development and test sets, such that related packages are labeled and present
in the reference dataset when detecting manipulations. Identifying relatedness before synthesizing
manipulations also helps to avoid replacing entities between two related packages. Two packages
are considered related when they are geographically located close to each other in terms of real-
world location, and the image and text modalities between them are similar. Clustering is done in
two stages: clustering by location followed by refining the clusters based on image and text simi-
larities between packages. Initial clusters of packages with neighboring global positioning system
30
(GPS) coordinates were created by measuring the proximity of GPS coordinates up to two decimal
places of accuracy. This limited the maximum distance between two related packages to approx-
imately 1.3 kilometers. A relaxed distance constraint between two packages was kept to ensure
better recall in the first stage of clustering. In the second stage, clusters were refined based on
image and text similarity between packages. Similarity was measured by scoring image and text
feature representations of two packages. Feature representations used were VGG19 (Simonyan
et al. 2014), features pretrained on imagenet and averaged word2vec (Mikolov et al. 2013) embed-
dings for image and text respectively. Cosine similarity was used to measure similarity. Selection
of a suitable threshold on the similarity score was based on a sample of 200 package pairs which
was manually annotated for relatedness.
In order to create manipulations, we applied named entity recognition for identifying and re-
placing entities between packages. We used StanfordNER’s (Finkel et al. 2005) three-class model
for identifying person name, location and organization entities in each package. Entities of cor-
responding type were replaced between two packages from unrelated clusters. All instances of
that entity type were manipulated to ”cover-up” potential inconsistencies within a manipulated
package. As an example, when making a location manipulation of London to Paris, any men-
tion of England should also be manipulated along with the GPS coordinates. Approximately, a
quarter of each cluster is manipulated. The reference dataset is allocated to approximately half of
the packages from each cluster, such that each package is unmanipulated. The remaining half of
each cluster is an equal mix of manipulated and unmanipulated packages and is used to create the
train-test-validation split. The reference dataset is common to train, test and validation sets.
The reference dataset contains 82,156 untampered related packages. There are 57,940 packages
in the manipulated dataset out of which 28,976 packages are split into 14,397 location, 7,337
person and 7,242 organization manipulations. The manipulated dataset is divided into 40,940,
7,000 and 10,000 packages for training, validation and test, respectively. Each package consists
of image, location and text modalities. Location comprises country, county, region and locality
names, in addition to GPS coordinates. Text is the user generated description related to the image.
31
Reference Dataset Manipulated Dataset
Related Package Manipulation Source Package Repurposed Query Package
Image
Text
Ex Convento at Cuilapan De
Guerrero completed in 1555
The Dorena Bridge has these
neat windows , which kind
makes you feel like you are in a
house !
Ex Convento at Dorena Bridge
completed in 1555
Location
Mexico, Cuilapam de Guerrero,
Oaxaca, Cuilapam de Guerrero.
GPS: 16.992657, -96.779133
United States, Lane, Oregon,
Row River. GPS: 43.737553,
-122.883646
United States, Lane, Oregon,
Row River. GPS: 43.737553,
-122.883646
Image
Text
The 2001 Space pod modelled
in Lego by Dilip
” It was a dream , and in dreams
you have no choices: ...” – Neil
Gaiman , American Gods (
2001 )
Space pod modelled in Lego by
Neil Gaiman
Location
United Kingdom, West Sussex,
England, Billingshurst. GPS:
51.024538, -0.450281
United States, Riverside,
California, Moreno Valley.
GPS: 33.900035, -117.254384
United Kingdom, West Sussex,
England, Billingshurst. GPS:
51.024538, -0.450276
Image
Text
The International Rally of
Queensland Sunday
competition images + podium
$10 , 000 , 000 00 is the
estimated value for the first
United States gold coin
Numismatic Guaranty
Corporation special display ...
The Numismatic Guaranty
Corporation Sunday
competition images + podium
Location
Australia, Gympie Regional,
Queensland, Imbil. GPS:
-26.460497, -152.679491
United States, Philadelphia,
Pennsylvania, Philadelphia.
GPS: 39.953333, -75.156667
Australia, Gympie Regional,
Queensland, Imbil. GPS:
-26.461131, 152.678117
Table 4.1: Examples of location, person and organization manipulations from the Multimodal
Entity Image Repurposing (MEIR) dataset. A manipulation source package provides a foreign
entity which is used to replace corresponding entities from the manipulated package. Text in red
indicates manipulation. Longer text descriptions have been truncated.
32
The dataset also contains an additional modality—time-stamp which contains the date and time
associated with each package. Packages are spread across 106 countries with English-speaking
countries contributing the majority.
Examples from MEIR are presented in Table 4.1 showing location, person and organization
manipulation. It is important to note that a human is unlikely to see through these manipulations.
This understanding reinforces the need for a reference dataset containing directly related informa-
tion.
4.3 Image Repurposing Detection
Our general approach for image repurposing detection is to retrieve a related multimedia package
from a reference dataset first, followed by comparing the query package to the retrieved one to
determine likelihood of manipulation. Since different modalities in a given package (e.g., image,
text, location, etc.) contribute differently to both retrieval and manipulation detection, we start
by assessing the importance of each modality to guide the design of the manipulation detection
approach.
4.3.1 Modality Importance
Image manipulators are assumed to leave most of the content of the package unchanged, and
only change a small number of package modalities (e.g., only location) to make the manipulation
subtle and hard to detect. Therefore, the query and retrieved packages are assumed to have largely
overlapping information.
We use similarity scoring to retrieve a package from reference datasetR using every modality
for every query package. Let f
qm
and f
rm
be the feature vectors for modality m in query q and
33
Modality Manipulated Unmanipulated Overall
GPS coordinates 47.0% 95.0% 71.0%
Image 65.9% 65.6% 65.8%
Text 60.0% 77.2% 68.6%
Image+GPS+Text 66.1% 93.6% 79.9%
Table 4.2: Top-1 retrieval accuracy of related packages when a query package is manipulated or
unmanipulated. Using all modalities gives the best performance for retrieval.
reference r packages, respectively, and s(.,.) is the similarity metric. The top retrieved package r
∗ is identified as shown in Equation 4.1.
r
∗ = argmax
r∈R
∑
m
s( f
qm
, f
rm
) (4.1)
VGG19 (Simonyan et al. 2014) pretrained on imagenet (Russakovsky et al. 2015) and averaged
word2vec (Mikolov et al. 2013) features are used for images and text respectively. We use cosine
distance to measure package similarities. Experimental results are presented in Table 4.2, which
shows the top-1 retrieval results for both manipulated and unmanipulated partitions of the query
subset of MEIR. The results illustrate that using all modalities for retrieving top-1 related package
provides the best retrieval results.
Further, in order to estimate the importance of each modality for manipulation detection,
we devise a classification experiment followed by feature importance measurement. The model
comprises Gaussian random projection for dimensionality reduction followed by random forest
(Breiman 2001) for classification. Random projections have been used previously for dimensional-
ity reduction and in classification problems (Bingham et al. 2001)(Dasgupta 2000). Each modality
in query and related packages is reduced to a common feature dimension L, and features from all
modalities are concatenated. A simple random forest classifier is trained on the resulting feature
vector for manipulation detection. Feature importance of each dimension is measured using Gini
impurity across all trees as described in (Breiman et al. 1984) and implemented in (Pedregosa et al.
2011). Averaged modality importance is shown in Figure 4.2. Each experiment configuration is
34
Figure 4.2: Average feature importance of each modality with varying feature dimension. All
modalities are found to contribute towards manipulation detection.
repeated for 30 trials and averaged results are presented. As shown in Figure 4.2, all modalities
contribute significantly to manipulation detection, at all feature dimensions.
4.3.2 Deep Manipulation Detection Model
Broadly speaking, there are two approaches for manipulation detection. The first approach de-
pends on assessing the coherency of the content of the query package, without using any reference
datasets, e.g., by matching caption against the image or vice versa (Jaiswal et al. 2017). The second
approach assumes the existence of a relatively large reference dataset, and assesses the integrity of
the query package by comparing it to one or more packages retrieved from the reference dataset.
The main advantage of the second approach is when the manipulations are semantically consistent.
The proposed method in this paper belongs to the second approach, since the information in the
query package is potentially manipulated and requires external information for validation.
35
Figure 4.3: Model overview. A potentially related package is retrieved from the reference dataset.
Feature extraction and balancing modules create a concatenated feature vector. The package eval-
uation module consists of related and single package modules. All NN
i
layers represent a single,
dense, fully connected layer with ReLU activation.
We propose a deep multimodal, multi-task learning model for image repurposing detection, as
illustrated in Figure 4.3. The proposed model consists of four modules: (1) feature extraction, (2)
feature balancing, (3) package evaluation and (4) integrity assessment. The model takes a query
package and the top-1 related package, retrieved from a reference dataset, as discussed in Section
4.3.1.
Both query and reference dataset packages are assumed to contain image, text and GPS coor-
dinates. Images are encoded using VGG19 (Simonyan et al. 2014) pretrained on imagenet (Rus-
sakovsky et al. 2015). GPS coordinates are a compact numerical representation of location; there-
fore we normalize them without any further processing. Finally, text is represented using averaged
word2vec (Mikolov et al. 2013) features for all words in a sentence. We also explore performance
improvement using an attention model over words instead of averaging word features.
Features from different modalities have widely different dimensionalities (4096D for image,
300D for text and 2D for location). As shown before in Section 4.3.1, varying the dimensionality
of each modality does not significantly change its importance. In order to ensure features from all
modalities are balanced, all features are transformed to 300D feature vectors (similar to word2vec
36
Component Combination
Single Package ✓ ✓ ✓ ✓ ✓ ✓
Related Package ✓ ✓ ✓ ✓ ✓
Multitask-loss ✓ ✓ ✓ ✓
Feature Balancing ✓ ✓ ✓
Attention in Text ✓ ✓
Learnable Forget Gate ✓
Metric Scores
F
1
tampered 0.58 0.59 0.61 0.76 0.80 0.83
F
1
clean 0.60 0.60 0.65 0.78 0.82 0.84
AUC 0.64 0.65 0.70 0.86 0.89 0.91
Table 4.3: This table justifies different design choices of the model. A ✓indicates the presence of
a component. Model architecture has been optimized on the development set.
text features) and concatenated into a single feature vector. Neural layers are used to transform
feature dimensions.
The core of the proposed model is the package evaluation module, which consists of related
package and single package sub-modules. As shown in Figure 4.3, the related package sub-module
consists of two siamese networks. The first network is a relationship classifier that verifies whether
the query package and top-1 package are indeed related, while the second network is a manipu-
lation detection network that determines whether the query package is a manipulated version of
the top-1 retrieved package. Since manipulation detection is dependent on the relatedness of the
two packages, the relationship classifier network controls a forget gate which scales the feature
vector of the manipulation detection network down to zero if the two packages are unrelated. Two
designs are considered for a forget gate. The first is a dot product between the output of relation-
ship classifier and manipulation detection feature vector. This formulation of the forget gate is not
learnable and scales all dimensions of the manipulation detection feature vector indiscriminately.
An alternative is a learnable forget gate similar to LSTMs (Hochreiter et al. 1997). If x is the rela-
tionship classifier output, y
inp
the manipulation detection feature vector, w the forget gate weight
matrix and b the bias vector, then the output feature vector y
out
is given by Equation 4.2 where∗ is
37
Figure 4.4: Our baseline semantic retrieval system (SRS) retrieves similar concepts via each
modality and uses an overlap between retrieved packages as an indicator for integrity score. Jacar-
dian index is used to measure overlap.
Modalities Method
MEIR
F
1
tampered F
1
clean AUC
Image + VSM (Jaiswal et al. 2017) 0.56 0.63 0.60
Text Our Method 0.73 0.76 0.81
Image + VSM (Jaiswal et al. 2017) 0.63 0.59 0.65
Text + GCCA 0.75 0.71 0.81
GPS SRS 0.66 0.37 0.71
Our Method 0.80 0.80 0.88
Table 4.4: We present results on MEIR. Experiments are performed in two settings—with image
and text modalities, and image, location and text modalities. This is done for a fairer comparison
with VSM which was originally designed for the image-text dataset.
the Hadamard product and· is matrix multiplication. For a learnable forget gate, the feature vector
of the relationship classifier is used as input. Choice of gate is justified ahead.
y
out
= y
inp
∗ σ(w· x+ b) (4.2)
In the meantime, a single package module verifies the coherency (i.e., integrity) of the query pack-
age alone.
3
The single package module is a feedforward network that takes the balanced feature
vector of the query package as input and performs feature fusion to present a 100 dimensional
feature vector.
3
Similar to approaches that do not depend on a reference dataset for integrity assessment
38
To ensure the overall network behaves as intended, we use multitask learning in the two sub-
modules. Relationship and manipulation classifiers are trained with corresponding labels. To avoid
conflict in training, the manipulation classifier in related package submodule has a third label un-
known apart from manipulated and unmanipulated, when the retrieved package is unrelated. All
neural layers in the package evaluation module are dense, fully connected 100 dimensional layers.
The integrity assessment module concatenates feature vectors from both related and single
package modules for manipulation classification. A neural network classifier is trained on the
concatenated feature vector with no hidden layers.
The model has multiple components that need to be validated before we evaluate it against
other methods and baselines. As discussed previously, there are also choices involved with some
components of the model—attention over averaged word2vec features for text feature extraction
and learnable over non-learnable forget gate. These choices are tuned on the development set with
a set of ablation experiments. The experiments add different components of the model sequentially
and notice a performance increase justifying their addition. By default, averaged word2vec features
and non-learnable forget gates are used, unless otherwise mentioned. Experimental results are
shown in Table 4.3. A✓indicates the presence of a component. Improvement of 0.16 AUC
from the feature balancing layer can be further broken down into 0.04 AUC from increased depth
and 0.12 AUC from matching feature dimensions of different modalities. We hypothesize that
dimensionality matching prevents modalities with higher dimension (e.g., image) from dominating
over low dimension modalities (e.g., GPS). A model comprising all discussed components, with
attention over text features and a learnable forget gate, gives the best performance. We use this
configuration in all future evaluations.
The model is trained end to end. We use Adam optimizer in all examples with a learning rate
of 0.001. All nonlinear transformations use the ReLU activation function. We use Keras with
tensorflow backend. Undiscussed parameters are set to default values.
39
Reference Dataset Manipulated Dataset
Related Package Manipulation Source Package Repurposed Query Package
Image
Text
Sommerpalast Suzhou Street
Grand Beach , Lake Winnipeg ,
Manitoba , Canada
Canada
Location
China, Beijing, Beijing,
Beijing. GPS: 40.000826,
116.268825
Canada, Manitoba, Manitoba,
Grand Beach. GPS: 50.562316,
–96.614059
Canada, Manitoba, Manitoba,
Grand Beach. GPS: 50.562316,
–96.614059
Image
Text
Live Life Awards Image by
Sean McGrath Photography
Film Effects - The Vault
CartLike many effects seen in
the films ... Ollivanders wand
shop , Flourish and Blotts , the
...
Live Life Awards Image by
Blotts Photography
Location
Canada, Saint John, New
Brunswick, Saint John. GPS:
45.271428, -66.061981
United Kingdom, Hertfordshire,
England, Watford. GPS:
51.693761, -0.422329
Canada, Saint John, New
Brunswick, Saint John. GPS:
45.271428, -66.061981
Image
Text
1st Combined Convention 28th
Annual National DeSoto Club
Convention & 44th Annual
Walter P Chrysler Club
ConventionJuly 17 - 21 ,
2013Lake Elmo ,
MinnesotaClick link below for
more car
Quick-Look Hill-shaded Colour
Relief Image of 2014 2m
LIDAR Composite Digital
Terrain Model ( DTM ) Data
supplied by Environment
Agency under the ...
1st Combined Convention 28th
Environment Agency
ConventionJuly 17 - 21 ,
2013Lake Elmo ,
MinnesotaClick link below for
more car pictures:
Location
United States, Chisago,
Minnesota, Taylor Falls, GPS:
45.401366, -92.651065
United Kingdom, East Sussex,
England, Fairlight Cove, GPS:
50.881552, 0.664108
United States, Chisago,
Minnesota, Taylor Falls, GPS:
45.401366, -92.651065
Table 4.5: Location, person and organization manipulation examples in order. Our model identifies
each of these examples correctly from test set. Longer text descriptions have been truncated.
40
4.4 Experimental Evaluation
Detection of image repurposing is a relatively new research area without established methods for
evaluation. In this chapter we propose using the area under receiver operating characteristic curve
(AUC) and also F
1
scores for evaluating performance.
We compare the performance of our model against the state-of-the-art in Chapter 3. The visual
semantic model (VSM) is the best encoding model in the anomaly detection framework presented
in Chapter 3. Further, we also compare our performance against two baselines: the first is our
baseline semantic retrieval system (SRS) shown in Figure 4.4. SRS retrieves similar packages
corresponding to each modality using cosine distance. The output integrity score is the measured
overlap between retrieved packages. Intuitively, this overlap indirectly measures whether modal-
ities in a query package point to the same related packages in the reference dataset. Since each
modality will retrieve similar packages, a rogue modality will retrieve packages pertaining to a
different event from the reference dataset. If modalities in the query package are consistent with
information in the reference dataset, the overlap between retrieved packages will be significant.
We use the Jaccard index for measuring overlap between retrieved packages. The second baseline
approach uses generalized canonical correlation analysis (GCCA) (Kettenring 1971) for feature
embedding and random forest for classification. GCCA transforms multimodal features with vary-
ing feature dimensions into a common embedding space and has been used for classification with
multimodal features (Shen et al. 2014; M. Sun et al. 2013).
We also explore the performance of our model without retrieving the top-1 package from a
reference dataset. In this setting we remove the related package module. We call this version of
our model the single package assessment (SPA) in all figures. This variation enables us to analyze
the importance of related package retrieval and compare the performance on MEIR and MAIM
datasets.
The results of our experiments on MEIR are presented in Table 4.4, which illustrates that our
model performs better than the baselines and VSM (Jaiswal et al. 2017). It should be noted that
VSM was originally designed for image and text modalities. We extend it with location modality.
41
Method
MAIM Flickr30K MS COCO
F
1
- T F
1
- C AUC F
1
- T F
1
- C AUC F
1
- T F
1
- C AUC
MAE (Jaiswal et al. 2017) 0.49 0.49 - 0.49 0.50 - 0.5 0. 48 -
BiDNN (Jaiswal et al. 2017) 0.52 0.52 - 0.63 0.62 - 0.76 0.77 -
VSM (Jaiswal et al. 2017) 0.75 0.77 - 0.89 0.88 - 0.94 0.94 -
Our Method - SPA 0.78 0.78 0.87 0.88 0.88 0.95 0.92 0.92 0.96
Table 4.6: We present results on MultimodAl Information Manipulation (MAIM) dataset. MAIM
has subjective image caption pairs. We train our method without Inter-package task module since
there are no related packages to leverage in MAIM. We compare against methods presented in
(Jaiswal et al. 2017). Tampered and clean columns are indicated by T and C respectively.
We also evaluate our model with image and text modality for a complete comparison with VSM.
Under both circumstances our model performs better. The extended version of VSM with location
modality performs better than the original version. Table 4.5 shows examples of location, person
and organization manipulation from the test set which our model identifies correctly.
In Chapter 3 VSM and other methods are evaluated on three datasets—MAIM, MSCOCO
and Flickr30K. MAIM has subjective image caption pairs and is more challenging compared to
Flickr30K and MSCOCO, which have objective image caption pairs. However, these datasets do
not have related content in the reference dataset. This does not help the related package module
in our dataset. We therefore compare the single package assessment (SPA) version of our model
without retrieving any query package. They use a representation learning and outlier detection
method for comparison. We compare against all encoding methods shown in the referenced paper.
They do not provide AUC scores for their method. In Table 4.6, SPA gives superior performance
on MAIM and competitive performance on Flickr30K and MS COCO.
A reference dataset may not have all modalities present, which makes missing modalities from
retrieved packages an additional problem of interest. Without modifying model architecture, the
problem can be tackled by improving the training scheme. During training a modality is delib-
erately removed from some retrieved samples and represented as a zero vector. Table 4.7 shows
results of training with a 20% missing modality. At test time, the corresponding modality is com-
pletely removed from all samples. The new training scheme improves performance on missing
42
Missing Modality - Training
Missing Modality - Test
None Image Text Location
None 0.88 0.77 0.66 0.63
Image 0.87 0.85 - -
Text 0.87 - 0.75 -
Location 0.85 - - 0.82
All 0.88 0.86 0.76 0.79
Table 4.7: Scores with missing modalities in retrieved packages. None and All refer to no missing
and all missing modalities respectively. The training scheme improves performance for all missing
modality scenarios.
modalities when compared to simple model training, while maintaining competitive performance
on samples with no missing modalities.
4.5 Summary
With an increase in fake news, image repurposing detection has become an important problem. It
is also a relatively unexplored research area. In the previous chapter we explored joint-modeling
of images and captions and found that an implicit learning of evidences was suitable for detecting
semantically consistent manipulations but not for semantically consistent manipulations. In this
chapter, the scope of image repurposing was expanded to semantically consistent manipulations
which are more likely to fool people. We presented the MEIR dataset with intelligent and seman-
tically consistent manipulations of location, person and organization entities. We also introduced
a model that leverages evidences from the reference dataset explicitly instead of implicit learning.
Our end-to-end deep multimodal learning model gives state-of-the art performance on MEIR and
MAIM. The model is also shown to be robust to missing modalities with a proper training scheme.
Our proposed model still has certain shortcomings as it utilizes a single evidence, when multiple
evidences are present in the reference dataset.
43
Chapter 5
Multi-Evidence Graph Neural Network
5.1 Introduction
The problem setup in Chapters 3 and 4 involves detection of semantically repurposed multimedia
packages with the help of an external reference dataset (RD). A reference dataset is a knowledge
base of packages that are assumed to contain unmanipulated information. Additionally, in Chapter
4 the formulation was improved to consider explicit evidences from the RD, such that each package
in the RD is a potential evidence for verifying a query package. However, a shortcoming of the
deep multimodal model (DMM) architecture is that it is designed to utilize one evidence from the
RD for repurposing detection. DMM does not scale to handle multiple retrieved packages and
hence, does not leverage additional information for performance improvement. This shortcoming
can have a limiting effect on image repurposing detection accuracy if there are multiple evidences
present in the RD.
In this chapter, we propose MEG – a multi-evidence graph neural network (GNN) model for
image repurposing detection. It is a scalable model for assimilating an arbitrary number of ev-
idences from the referecence dataset for image repurposing detection. We show that it achieves
state-of-the-art performance on MEIR dataset introduced in Chapter 4 and two other benchmark
datasets. Additionally, the GNN component of our model inherently provides performance invari-
ance w.r.t the order of packages retrieved from the RD.
44
5.2 Architecture Design
As discussed in Section 5.1, reference datasets in image repurposing problems often have multiple
evidences for verification of the query package. However, previous methods for image repurposing
detection cannot handle variable number of retrieved packages (Jaiswal et al. 2017; Sabir et al.
2018). This shortcoming prevents these models from leveraging potentially multiple instances of
related packages for performance improvement. As such, a driving motivation of our model design
is to make it scalable to multiple retrieved packages. Additionally, as discussed in Chapter 4, it is
possible for a modality to be missing from a package or the dataset itself. For example an image
may be accompanied by text or location information or both. Under such circumstances, it is
also important to keep the model flexible to handle an arbitrary number of modalities. In order
to ensure this, our model processes each modality in a different branch which is architecturally
the same. Effectively, each branch has the same architecture, but with different learned weights
for each modality. Each branch has three major components - (1) feature extraction, (2) evidence
matching and (3) modality summary. They are preceded by package retrieval for evidences and
followed by the manipulation detection layer. Figure 5.1 gives an overview of our model. We
describe the motivation and design of each component of our model below.
5.2.1 Package Retrieval
Verification of a query package requires additional information from a reliable reference dataset.
Since packages retrieved from the reference dataset form the basis for authentication, the package
retrieval system is an important component of the overall method. We use a package retrieval
system similar to Section 4.3 where each modality of the query package is scored against the
corresponding modality of all packages in the reference dataset. A reference package with the top-
1 combined score across all modalities is retrieved. We extend this method to retrieve k packages
with the highest scores.
45
Figure 5.1: Our Model diagram. Modalities from each retrieved package are organized together
and are processed through a dedicated branch of the modality. The Evidence Matching layer
matches query and retrieved features side-by-side and weights it to produce a node initialization.
A graph neural network is used to summarize each modality for final manipulation detection. The
crossmodal connections form a complete graph in implementation, but few connections are shown
for simplicity.
5.2.2 Feature Extraction
Learned feature extraction is an important component of deep learning models. Previous literature
in image repurposing detection has used pretrained models to extract features from all modalities
of a package. We follow a similar approach using convolutional neural network (CNN) based
models, word2vec and global positioning system (GPS) coordinates for image, text and location
feature extraction respectively. Specific details on models used for feature extraction are discussed
in Section 5.3.1.
5.2.3 Attention-based Evidence Matching
Semantic forensics can involve a subtle but specific change of detail which can be hard to detect at
a glance. Additionally, the information (entity, location, etc.) involving both the manipulation and
46
evidence in query and retrieved packages is unlikely to be previously seen. It is therefore prudent
to develop a method that can compare a previously unseen instance of manipulation and evidence
without memorizing it. This requirement is in contrast to classical computer vision models that
reward memorization of training examples such as associating the word dog with a corresponding
image in a standard classification task. We address the problem of dealing with previously unseen
manipulations and evidences with an attention-based evidence matching module, shown in Figure
5.1. Evidence matching compares concatenated query and retrieved features with a soft attention
mechanism for selecting important matches. For query and retrieved features q and r respectively, a
concatenated feature vector[q,r] is processed with 1D convolution network (CNN) for matching.
A soft attention model on top of the concatenated representation, followed by a dense layer to
compute a matched feature f eat. This layer can be represented by Equation 5.1
f eat= FC
σ
Conv([q,r])
⊙ ([q,r])
(5.1)
where Conv is a 1D CNN,σ is soft attention and FC is a dense layer for dimensionality matching
across modalities.
5.2.4 Modality Summary
The retrieved packages represent a bag-of-packages without specific order. We use a graph neural
network (GNN) for each modality that considers all possible comparisons between the query and
retrieved packages. The graph network makes the overall system (1) flexible enough to scale to an
arbitrary number of retrievals and (2) invariant to the order of retrieved packages. Each node in the
graph network is updated with respect to its adjacent nodes allowing simultaneous updates. The
graph is then summarized into one modality-level representation. Each node contributes directly
to the final graph summary. This is different from recurrent networks where the latent embedding
is updated in sequence, making them order-dependent. A node v in a GNN is represented by the
hidden state h
v
. A forward pass through a GNN is divided into propagation and output steps. The
47
propagation step updates nodes along edges in the graph for T timesteps. It can be thought of as
a gated recurrence along paths in the graph, similar to long short-term memory network (LSTM)
recurrence. The output step produces a graph level vector representation by combining hidden
states of nodes with an attention mechanism. The model is summarized by Equations 5.2-5.7
h
1
v
=[x
v
,0]
T
(5.2)
a
(t)
v
= A
T
v
[h
(t− 1)
1
...h
(t− 1)
N
]
T
+ b (5.3)
z
t
v
=σ(W
z
a
(t)
v
+U
z
h
(t− 1)
v
) (5.4)
r
t
v
=σ(W
r
a
(t)
v
+U
r
h
(t− 1)
v
) (5.5)
f
h
(t)
v
= tanh(Wa
(t)
v
+U(r
t
v
⊙ h
(t− 1)
v
)) (5.6)
h
(t)
v
=(1− z
t
v
)⊙ h
(t− 1)
v
+ z
t
v
⊙ f
h
(t)
v
(5.7)
The hidden state is initialized with an initial representation x
v
according to the application and
padded with 0 to match dimensions if needed. A is the adjacency matrix and a
v
is the summation
of adjacent node embeddings based on edge type. Equations 5.4-5.7 represent updates using a
GRU. The graph neural network effectively summarizes the potential for manipulation in a learned
representation for each modality. We use a complete graph (adjacency matrix of ones except along
the diagonal) with one timestep of propagation. This allows simultaneous update of all nodes
throughout the graph in an order agnostic manner. The final graph output G
m
for modality m of
our model is a weighted average of activation all N nodes as shown in Equation 5.8:
48
G
m
=
N
∑
v=1
h
1
v
⊙ Att(h
1
v
)
(5.8)
The weights are estimated using a neural network Att. Since the model is set up for variable
number of inputs, the scale of adjacent node embeddings a
v
may fluctuate by an order of magni-
tude. To control for the variation, we modify Equation 5.3 by scaling it down by the number of
adjacent nodes. For our fully connected graph setup, modality m with N nodes has N− 1 adjacent
nodes, as shown in 5.3.
a
(t)
v
=
A
T
v
[h
(t− 1)
1
...h
(t− 1)
N
]
T
+ b
N− 1+ε
(5.9)
This summarizes each modality into a single graph output, with nodes of the same modality.
However, it has been shown that cross-modal learning helps with multimodal tasks (Ngiam et al.
2011). To incorporate cross-modal learning into our model we add cross-modal graph connections.
The adjacency matrix is expanded to include nodes from adjacent modalities. We validate the
performance of cross-modal connections later in Section 5.3.2. For m modalities, each with N
m
nodes, a general update to Equation 5.9 for arbitray nodes in adjacent modalities is shown in
Equation 5.10.
a
(t)
v
=
A
T
v
[h
(t− 1)
1
...h
(t− 1)
N
]
T
+ b
N
i
− 1+
m
∑
j=1, j̸=i
N
j
+ε
(5.10)
However, considering that each modality generates an equal number of nodes and fully-connected
cross-modal edges are used, Equation 5.10 is simplified to Equation 5.11.
a
(t)
v
=
A
T
v
[h
(t− 1)
1
...h
(t− 1)
N
]
T
+ b
m∗ N− 1+ε
(5.11)
Theε term takes care of zero adjacency for each node in a graph. Finally, the output representation
for each modality is combined as described next.
49
5.2.5 Manipulation Detection
A feed-forward network on top of concatenated modality summary outputs is used for the final
manipulation detection. This layer combines all branches of modalities into a single binary predic-
tion.
5.2.6 Implementation Details
Our model was implemented in Keras and trained with ADAM optimizer with an initial learning
rate of 0.001. All parameters had default values, unless otherwise mentioned. All edge layers and
feedforward layers have ReLU activation function. We trained all our models with a batch size of
32 and subsampled models within each epoch for selecting the best model.
5.3 Evaluation
This section describes benchmarks in Section 5.3.1 and both quantitative and qualitative results in
Section 5.3.2.
5.3.1 Benchmark Datasets
We perform experimental evaluation on MEIR introduced in Chapter 4, which is the most challeng-
ing dataset for image repurposing detection. We also evaluate on Google Landmarks (Noh et al.
2017) and Painter by Numbers (Painter by Numbers 2019) datasets which were originally released
for different tasks, but can be adapted for semantic forensics. The adapted splits for Google Land-
marks and Painter by Numbers used in (Jaiswal et al. 2019) have repeated locations and painters
in training and testing. Keeping in line with the idea in Chapter 4 that manipulations are unseen in
training and test, we adapt a new split with mutually exclusive manipulations in training and test
set.
50
MEIR: It is a multimodal dataset comprising images, text and location modalities. Manipula-
tions are present in text and location modalities and comprise three types of entity manipulations
— person, location and organization. Manipulations are also coherent within a package i.e. a
location manipulation within text will result in corroborating manipulations in GPS coordinates.
The dataset comprises 82,156 packages in reference dataset and 57,940 packages split between
training, test and validation sets. It should be noted that all packages in training, test and valida-
tion conform to different events. This helps in evaluating the generalizability of models to unseen
semantic manipulations.
Google Landmarks: We use Google Landmarks for further evaluating location manipulation,
since locations can easily confuse people, especially if the landmark in the photo is not well recog-
nized by the person. This is one of the manipulations present in MEIR, but is mixed with all other
manipulations. This dataset is available as a part of a Kaggle competition
1
. The modified task
for semantic forensics on this dataset is to identify if the landmark associated with a query pack-
age is correct. The complete dataset is extremely large with over 1.2 million images and 14,951
different landmarks. We prune the dataset, keeping landmarks with at least 3 images and at most
50 images. This leaves us with 152,074 images split into 78,573 images for reference and 73,501
images for train, test and validation which is further split in a 70-10-20 ratio. We create believable
semantic manipulations by swapping similar images. Images are determined to be similar using
a kd-tree search. During test, we ensure that landmarks in training, test and validation form a
disjoint set. This ensures that the model is robust in identifying unseen landmarks. Image fea-
tures are generated using NetVLAD (Arandjelovic et al. 2016), followed by principal component
analysis (PCA) and l
2
normalization as used in (Jaiswal et al. 2019). Landmark IDs are encoded
using 50 dimensional random embeddings. We also measure our retrieval accuracy using mean
average precision (MAP). With this scheme of manipulation and feature embedding, a cosine sim-
ilarity based retriever as described in Chapter 4, but generalized to 5 package retrieval achieves
0.81 mean average precision.
1
https://www.kaggle.com/google/google-landmarks-dataset/home
51
Painter by Numbers: We evaluate the proposed system on painting forgeries, which is an old
problem with counterfeits being created for paintings by famous artists. This is a high stakes
problem with art experts being called in to validate paintings. We create a painting repurposing
dataset from the Painters by Numbers dataset. This dataset was also released as a part of a Kaggle
challenge
2
. We restructure the dataset for semantic forensics, where the identity of the artist for a
given painting is potentitally manipulated. After ensuring that each artist has at least three images,
the dataset is split into 36,669 reference images, and 36,164 images for train, test and validation.
There are 1000 different artists in the dataset. To create manipulations, we use a kd-tree for finding
similar paintings and swap the artists. There is no overlap between artists in training, test and
validation, to ensure generalization. Image features are extracted using the winner’s model from
the competition
3
, similar to (Jaiswal et al. 2019). For painter IDs, we generate 50 dimensional
random embeddings. Again, using the retrieval scheme in Chapter 4, for top-5 packages, we
achieve 0.72 mean average precision.
5.3.2 Results and Analysis
We use accuracy, area under the receiver operating characteristic curve (AUC), and F
1
-clean and
F
1
-tampered (F
1
scores for unmanipulated and manipulated class respectively) scores as evaluation
metrics. Previous chapters have used these metrics. We perform ablation experiments to test the
scalability and order invariance of our model. A summary of the results is discussed. We also
evaluate our model on benchmark datasets and discuss the quantitative and qualitative results.
Scalability: A contribution of our model is the ability to handle variable number of related pack-
ages. The modality summary module of our model is responsible for providing scalability. Keeping
this in mind, we perform two categories of ablation experiments as shown in Table 5.1: (1) replac-
ing the modality summary network with standard recurrent networks (GRU and LSTM) and the
read-process-write (RPW) network from (Vinyals et al. 2015) (2) removing scaling modifications
2
https://www.kaggle.com/c/painter-by-numbers
3
https://github.com/inejc/painters
52
Ablation Train on Test on Relative
Model 2 Packages 5 Packages Drop
MEG (Ours) 0.91 0.91 0%
MEG - GNN + GRU 0.90 0.88 20%
MEG - GNN + LSTM 0.90 0.85 50%
MEG - GNN + RPW 0.89 0.89 0%
MEG - scaling 0.90 0.85 50%
Table 5.1: Ablation experiments for verifying scalability. We replace the modality summary
(GNN) component of MEG with other models. All variants are trained on two and tested on
five packages. AUC scores are reported.
Ablation
Before After
Relative
Model Drop
MEG (Ours) 0.92 0.92 0.0%
MEG - GNN + GRU 0.91 0.83 88.8%
MEG - GNN + LSTM 0.91 0.87 44.4%
MEG - GNN + RPW 0.89 0.89 0.0%
Table 5.2: Ablation experiments for verifying order invariance. We replace the modality summary
(GNN) in MEG with other models. Before and After columns show performance before and after
reversing retrieval order. AUC scores reported.
we made to the graph network in Section 5.2.4. For this set of experiments we train our model for
up to two packages and test on 5 packages. A drop in performance at 5 packages indicates that
the model does not scale. We train all models with a minimum of two packages to avoid an empty
adjacency matrix for GNN. The results clearly support the scalability of the proposed model.
Order Invariance: A result of the GNN based modality summary module in our model is in-
variance to input ordering. We perform ablation experiment by replacing the modality summary
module with standard recurrent networks (GRU and LSTM) and an existing order agnostic model
- read-process-write (RPW) network from (Vinyals et al. 2015). It has been reported that recur-
rent networks suffer from order dependence issues, resulting in performance drop for input order
changes between training and test (Vinyals et al. 2015). We train our model for 5 packages and
test by reversing the training order. A drop in performance indicates that the model variation is
53
Ablation Order
Scalability Score
Model Invariance
MEG (Ours) ✓ ✓ 0.92
MEG - GNN + GRU 0.91
MEG - GNN + LSTM 0.91
MEG - GNN + RPW ✓ ✓ 0.89
MEG - scaling ✓ 0.90
Table 5.3: Summary of our ablation experiments. MEG outperforms across all three factors. AUC
scores are reported.
Metric
MEIR Painter by Numbers Google Landmarks
SRS DMM MEG SRS DMM MEG SRS DMM MEG
F
1
-clean 0.51 0.80 0.84 0.70 0.59 0.83 0.82 0.87 0.87
F
1
-tampered 0.66 0.80 0.84 0.80 0.67 0.79 0.86 0.87 0.87
Accuracy 0.60 0.80 0.84 0.76 0.63 0.82 0.84 0.88 0.87
AUC 0.67 0.88 0.92 0.77 0.74 0.86 0.93 0.93 0.94
Table 5.4: Performance of our proposed model (MEG) against existing methods from Chapter 4
across all three benchmark datasets.
not model invariant. Results are presented in Table 5.2. It is evident that LSTM or GRU based
variations of our model are not order invariant. As expected, replacing GNN with RPW (Vinyals
et al. 2015) in the modality summary layer maintains order invariance, but leads to a performance
drop.
Ablation Summary: We summarize the ablation results in Table 5.3. Three properties are con-
sidered: scalability, order invariance and detection performance. Our method satisfies all properties
while maintaining best performance for all comparisons.
Performance: We compare performance against previous methods from Chapter 4 - namely the
deep multimodal model (DMM) and the semantic retrieval system (SRS). DMM is a deep learning
based model which verifies a query package using top-1 retrieved package. SRS is a non-learning
method which computes the Jacardian index on packages retrieved by individual modalities. It’s
54
Figure 5.2: The top two rows contain true positive examples and the bottom two rows contain
false positive samples. Across both cases it is noticeable that the repurposing/manipulation is
believable. In the bottom row, the correct retrievals are visually different from the query, leading
to false alarms. Green and red borders indicate correct and incorrect retrievals respectively.
Dataset
Classification Category
TP+TN FP+FN
MEIR 3.05 2.60
Painter by Numbers 3.25 1.00
Google Landmarks 3.05 2.15
Table 5.5: We measure the average number of correctly retrieved packages out of top-5 retrievals
for correctly classified (True Positive and True Negative) and misclassified (False Positive and
False Negative) query packages. Package retrieval accuracy positively affects final model perfor-
mance.
performance is known to scale with the correctness of retrievals. Our model improves upon state-
of-the-art performance across all three datasets as shown in Table 5.4.
Analysis: Examples from Painters by Numbers and Google Landmarks datasets are shown in
Figure 5.2 and 5.3 respectively. True positive and false negative examples from MEIR are shown
55
Figure 5.3: The top two and bottom two rows contain true positive and false positive samples
respectively. In the bottom row, the correct retrievals look significantly different, leading to a false
alarm. Green and red borders indicate correct and incorrect retrievals respectively.
in Figure 5.4 and 5.5 respectively. It is visible from the results that image repurposing performance
is dependent on package retrieval performance. To further test this hypothesis, we compare the
average number of correct packages retrieved between successful (true positive and true negative)
and unsuccessful (false positive and false negative) classifications. The results in Table 5.5 show a
consistently better retrieval for all correctly classified packages.
5.4 Summary
Semantic image repurposing detection is an important but emerging research area for multimodal
forensics and fake news detection. In the previous chapter we introduced a deep multimodal model
56
Figure 5.4: The two rows show true positive samples of our model. The first and second package
have location and organization manipulation respectively. Green and red borders indicate correct
and incorrect retrievals respectively. Metadata highlighted in red in query package is manipulation.
that relies on a single package retrieval to verify information. In this chapter we presented a multi-
evidence GNN model (MEG) for semantic repurposing detection that improves upon previous
state-of-the-art across three benchmark datasets. Our scaling modifications over a standard GNN
make the proposed model scalable to multiple retrieved packages. Our model is order invariant
compared to standard recurrent architectures.
Besides the contributions to image repurposing detection (problem definition, datasets, base-
lines, metrics and methods) in this and in previous chapters, there are still unexplored problems
remaining. The methods discussed so far do not localize the exact manipulation. While successful
manipulation detection can alert users to semantic manipulations, successful localization can help
users reason about manipulations. Another possible area to explore is real-time multimodal seman-
tic repurposing detection i.e. using the web instead of a reference dataset. Most practical scenarios
where we need such a verification system would not have a carefully curated reference dataset.
57
Figure 5.5: The two rows show false negative samples of our model. The first and second package
have location and organization manipulation respectively. Green and red borders indicate correct
and incorrect retrievals respectively. Metadata highlighted in red in query package is manipulation.
Additionally, the assumption of all information being reliable in the reference dataset also may
not hold. A reference dataset collected at scale is likely to contain spurious information. These
directions of research that are oriented towards the development of solutions that can be deployed
or used in a real-world setting are left for future work.
58
Chapter 6
Biomedical Image Forensics
6.1 Introduction
Research misconduct can appear in several forms such as plagiarism, fabrication and falsification.
Scientific misconduct has consequences beyond ethics and leads to retractions (Bik et al. 2018) and
by one estimate $392,582 of financial loss for each retracted article (Stern et al. 2014). The general
scope of scientific misconduct and unethical behavior is broad. An emerging research domain
is biomedical image forensics; i.e. detection of research misconduct in biomedical publications
(Bik et al. 2018; Christopher 2018; Bik et al. 2016). In this chapter we focus on detection of
manipulation or inappropriate duplication of scientific images in biomedical literature.
Duplication and tampering of protein, cell, tissue and other experimental images has become a
nuisance in the biomedical sciences community. As the description suggests, duplication involves
reusing all or part of images generated by one experiment to misrepresent results for unrelated ex-
periments. Tampering of images involves pixel- or patch-level forgery to hide unfavorable aspects
of the image or to produce favorable results. Biomedical image forgeries can be more difficult for a
human to detect than manipulated images on social media due to the presence of arbitrary and con-
fusing patterns and lack of real-world semantic context. Detecting forgeries is further complicated
by manipulations involving images across different documents. The difficulty of noticing such ma-
nipulations coupled with a high paper-per-reviewer ratio often leads to these manipulations going
59
unnoticed during the review process. It may come under scrutiny later leading to possible retrac-
tions (Bik et al. 2018). While the problem has received the attention of the biomedical community,
to the best of our knowledge there is no publicly available biomedical image forensics dataset,
detection software or standardized task for benchmarking. We address these issues by releasing
the first biomedical image forensics dataset (BioFors) and proposing benchmarking tasks.
The objective of our work is to advance biomedical forensic research to identify suspicious
images with high confidence. We hope that BioFors will promote the development of algorithms
and software which can help reviewers identify manipulated images in research documents. The
final decision regarding malicious, mistaken or justified intent behind a suspicious image is to
be left to the forensic analyst. This is important due to cases of duplication/tampering that are
justified with citation, explanation, harmlessness or naive mistake as detailed in (Bik et al. 2016).
BioFors comprises 47,805 manually cropped images belonging to four major categories — (1)
Microscopy, (2) Blot/Gel, (3) Macroscopy and (4) Flow-cytometry or Fluoroscence-activated cell
sorting (FACS). It covers popular biomedical image manipulations (including repurposed images)
with three forgery detection tasks. The contributions of this chapter are:
• A large scale biomedical image forensics dataset with real-world forgeries
• A computation friendly taxonomy of forgery detection tasks that can be matched with stan-
dard computer vision tasks for benchmarking and evaluation
• Extensive analysis explaining the challenges of biomedical forensics and the loss in perfor-
mance of standard computer vision models when applied to biomedical images
6.2 BioFors Benchmark
As discussed in the previous section, a dataset with standardized benchmarking is essential to ad-
vance the field of biomedical image forensics. Additionally, we want BioFors to have image level
granularity in order to facilitate image and pixel level evaluation. Furthermore, it is desirable to use
60
Compound Figure Subfigures Images
Figure 6.1: We crop compound biomedical figures in two stages: 1) crop sub-figures and 2) crop
images from sub-figures. Synthetic images such as charts and plots are filtered.
images with real-world manipulations. To this end, we used open-source or retracted research doc-
uments to curate BioFors. BioFors is a reasonably large dataset at the intersection of biomedical
and image-forensics domain, with 46,064 pristine and 1,741 manipulated images, when compared
to biomedical image datasets including FMD (Zhang et al. 2019) (12,000 images before augmenta-
tion) and CVPPP (Scharr et al. 2014) (284 images) and also compared to image forensics datasets,
including Columbia (Ng et al. 2009) (180 tampered images), COVERAGE (Wen et al. 2016) (100
tampered images), CASIA (Dong et al. 2013) (5,123 tampered images) and MFC (Guan et al.
2019) (100k tampered images). Section 6.2.1 details the image collection procedure. Image di-
versity and categorization is described in Section 6.2.2. Proposed manipulation detection tasks are
described in Section 6.2.3.
6.2.1 Image Collection Procedure and Statistics
Most research publications do not exhibit forgery, therefore collecting documents with manipula-
tions is a difficult task. We received a set of documents from Bik et al. (Bik et al. 2016) along with
raw annotations of suspicious scientific images which will be discussed in Section 6.2.3. Of the
list of documents from different journals provided to us, we selected documents from PLOS ONE
open-source journal comprising 1031 biomedical research documents published between January
2013 and August 2014.
61
0 50 100 150 200
Image Interval
0
10
20
30
40
50
60
70
Document Frequency
Figure 6.2: Distribution of images extracted from documents. The distribution peaks at 25 images
from most documents. The rightmost entry has 219 images from one document.
The collected documents were in Portable Document Format (PDF), however direct extraction
of biomedical images from PDF documents is not possible with available software. Furthermore,
figures in biomedical documents are compound figures (Shi et al. 2019; Tsutsui et al. 2017) i.e.
a figure comprises biomedical images, charts, tables and other artifacts. Sadly, state-of-the-art
biomedical figure decomposition models (Shi et al. 2019; Tsutsui et al. 2017) have imperfect and
overlapping crop boundaries. We overcome these challenges in two steps: 1) automated extraction
of figures from documents and 2) manual cropping of images from figures. For automated figure
extraction we used deepfigures (Siegel et al. 2018). We experimented with other open source
figure extractors, but deepfigures had significantly better crop boundaries and worked well on
all the documents. We obtained 6,543 figure images out of which 5,035 figures had biomedical
images. For the cropping step, in order to minimize human error in manual crop boundaries we
performed cropping in two stages. We cropped sub-figures with a loose bounding box, followed by
a tight crop around images of interest. We filtered out synthetic/computer generated images such
as tables, bar plots, histograms, graphs, flowcharts and diagrams, since verification of numerical
62
Modality Train Test Total
Documents 696 335 1,031
Figures 3,377 1,658 5,035
All Images 30,536 17,269 47,805
Microscopy Images 10,458 7,652 18,110
Blot/Gel Images 19,105 8,335 27,440
Macroscopy Images 555 639 1,194
FACS Images 418 643 1,061
Table 6.1: Top rows give a high level view of BioFors. Bottom rows provide statistics by image
category. Training set comprises pristine images and documents.
results in synthetic images is beyond the scope of this paper. Figure 6.1 shows a sample compound
figure and its decomposition. The image collection process resulted in 47,805 images. There is
significant variation in the number of images extracted from each document. Figure 6.2 shows the
frequency of images extracted per document. We created the train/test split such that a document
and its images belong to the test set if it has at least one manipulation. Table 6.1 gives an overview
of the dataset. Furthermore, images in BioFors have a wide range of dimensions. Figure 6.3
shows a scatter-plot of BioFors image dimensions as compared to two other natural-image forensic
datasets, Columbia (Ng et al. 2009) and COVERAGE (Wen et al. 2016).
6.2.2 Dataset Description
We classify the images from the previous collection step into four categories — (1) Microscopy (2)
Blots/Gels (3) Flow-cytometry or Fluoroscence-activated cell sorting (FACS) and (4) Macroscopy.
This taxonomy is made considering both the semantics and visual similarity of different image
classes. Semantically, microscopy includes images from experiments that are captured using a
microscope. They include images of tissues and cells. Variations in microscopy images can result
from factors pertaining to origin (e.g. human, animal, organ) or fluorescent chemical staining of
cells and tissues. This produces images of diverse colors and structures. Western, northern and
southern blots and gels are used for analysis of proteins, RNA and DNA respectively. The images
63
0 200 400 600 800 1000 1200
Image Width
0
200
400
600
800
1000
1200
Image Height
Columbia
COVERAGE
BioFors - Microscopy
BioFors - Blot/Gel
BioFors - Macroscopy
BioFors - FACS
Figure 6.3: BioFors images have much higher variation in dimension compared to two popular
image forensic datasets.
Model Train Test
VGG16 (Simonyan et al. 2014) 99.79% 97.11%
DenseNet (Huang et al. 2017) 99.25% 97.67%
ResNet (He et al. 2016) 98.93% 97.47%
Table 6.2: Accuracy of classifying BioFors images using popular image classification models is
reliable.
look similar and the specific protein or blot types are visually indistinguishable. FACS images look
similar to synthetic scatter plots. However, the pattern is generated by a physical experiment which
represents the scattering of cells or particles. Finally, Macroscopy includes experimental images
that are visible to the naked eye and do not fall into any of the first three categories. Macroscopy is
the most diverse image class with images including rat specimens, tissues, ultrasound, leaves, etc.
Table 6.1 shows the composition of BioFors by image class. Figure 6.4 shows inter and intra-class
diversity of each class. The image categorization discussed here is easily learnable by popular
image classification models as shown in Table 6.2.
64
(a) Microscopy (b) Blot/Gel (c) FACS (d) Macroscopy
Figure 6.4: Rows of image samples representative of the following image classes: (a) Microscopy
(b) Blot/Gel (c) FACS and (d) Macroscopy.
6.2.3 Manipulation Detection Tasks in BioFors
The raw annotations provided by Bik et al. (Bik et al. 2016) contain freehand annotations of
manipulated regions and notes explaining why the authors of (Bik et al. 2016) consider them ma-
nipulated. However, the annotation format was not directly useful for ground truth computation.
We inspected all suspicious images and manually created binary ground truth masks for all manip-
ulations. This process resulted in 297 documents containing at least one manipulation. We also
checked the remaining documents for potentially overlooked manipulations and found another 38
documents with at least one manipulation. Document level cohen’s kappa (κ) inter-rater agree-
ment between biomedical experts (raw annotations) and computer-vision experts (final annotation)
is 0.91.
Unlike natural-image forensic datasets (Dong et al. 2013; Nimble Challenge 2017 Evaluation
— NIST n.d.; Ng et al. 2009; Wen et al. 2016) that include synthetic manipulations, BioFors
has real-world suspicious images where the forgeries are diverse and the image creators do not
share the origin of images or manipulation. Therefore, we do not have the ability to create a
one-to-one mapping of biomedical image manipulation detection tasks to the forgeries described
65
Modality EDD IDD CSTD
Documents 308 54 61
Pristine Images 14,675 2,307 1,534
Manipulated Images 1,547 102 181
All Images 16,222 2,409 1,715
Table 6.3: Distribution of pristine and tampered images in the test set by manipulation task.
(a) (b)
Image GT Mask
Figure 6.5: Two image pairs exhibiting duplication manipulation in EDD task. Duplicated regions
are color coded to show correspondence. Bottom row shows ground truth masks for evaluation.
in Section 2.1.4. Consequently, we propose three manipulation detection tasks in BioFors —
(1) external duplication detection, (2) internal duplication detection and (3) cut/sharp-transition
detection. These tasks comprehensively cover the manipulations presented in (Christopher 2018;
Bik et al. 2016). Table 6.3 shows the distribution of documents and images in the test set across
tasks. We describe the tasks and their annotation ahead.
External Duplication Detection (EDD): This task involves detection of near identical regions
between images. The duplicated region may span all or part of an image. Figure 6.5 shows two ex-
amples of external duplication. Duplicated regions may appear due to two reasons — (1) cropping
two images with an overlap from a larger original source image and (2) by splicing i.e. copy-
pasting a region from one image into another as shown in Figure 6.5a and b respectively. Irrespec-
tive of the origin of manipulation, the task requires detection of recurring regions between a pair of
66
1294
17
50 54
24
0 degree 90 degree 180 degree Horizontal Flip Vertical Flip
Orientation Difference
Figure 6.6: Frequency of differing orientations between duplicated regions in EDD and IDD tasks.
images. Further, another dimension of complexity for EDD stems from the orientation difference
between duplicated regions. Duplicated regions in the second example of Figure 6.5 have been ro-
tated by 180
◦ . We also found orientation difference of 0
◦ , 90
◦ , horizontal and vertical flip. Figure
6.6 shows the frequency of each orientation between duplicated regions. From an evaluation per-
spective, an image pair is considered one sample for EDD task and ground truth masks also occur
in pairs. The same image may have unique masks for different pairs corresponding to duplicated
regions. Since, it is computationally expensive to consider all image pairs in a document, we dras-
tically reduce the number of pairs to be computed by considering pairs of the same image class.
This is a reasonable heuristic, since (1) we do not find duplications between images of different
class and (2) automated image classification has reliable accuracy as shown in Table 6.2.
Internal Duplication Detection (IDD): IDD is our proposed image forensics task that involves
detection of internally repeated image regions (Y . Wu et al. 2018; Islam et al. 2020). Unlike a
standard copy-move forgery detection (CMFD) task where the source region is known and is also
from the same image, in IDD the source region may or may not be from the same image. The
repeated regions may have been procured by the manipulator from a different image or document.
67
(a) (b) (c) (d)
Figure 6.7: Manipulated samples in IDD task. Top row shows images and bottom row has corre-
sponding masks. Repeated regions within the same image are color coded.
Figure 6.7 shows examples of internal duplication. Notice that the regions highlighted in red in
Figure 6.7c and d are the same and it is unclear which or if any of the patches is the source.
Consequently from an evaluation perspective we treat all duplicated regions within an image as
forged. Ground truth annotation includes one mask per image. The orientation statistics shown in
Figure 6.6 also hold true for images in IDD task.
Cut/Sharp-Transition Detection (CSTD): A cut or a sharp transition can occur at the boundary
of spliced or tampered regions. Unlike spliced images on social media, blot/gel images do not show
a clear distinction between the authentic background and spliced foreground, making it difficult to
identify the foreign patch. As an example, in Figure 6.8a and b it is not possible to identify if the
left or right section of the western blot is spliced. Sharp transitions in texture can also occur from
blurring of pixels or other manipulations of unknown provenance. In both cases, a discontinuity
in image-texture in the form of a cut or sharp transition is the sole clue to detect manipulations.
Accordingly we annotate anomalous boundaries as forged. From an annotation perspective, cuts
or sharp transitions can be difficult to see, therefore we used gamma correction to make the images
68
(b)
(c)
(a)
Figure 6.8: Examples of cuts/transitions. Noticeable sharp transition in (c) has been annotated, but
the complete boundary is unclear.
(a) Light (b) Dark
Figure 6.9: Left and right examples show light and dark gamma correction of images making it
easier to spot potential manipulations. The third arrow band in (a) appears to be spliced.
light or dark and highlight manipulated regions. Figure 6.9 shows examples of gamma correction.
Ground truth is a binary mask for each image.
6.3 Why is Biomedical Forensics Hard?
Based on our insights from the data curation process and analysis of experimental results in
Sec. 6.4, we explain potential challenges for natural-image forensic methods when applied to
biomedical domain.
69
(a) (b) (c) (d)
Figure 6.10: Examples of annotation artifacts in biomedical images: (a) dotted lines (b) alphanu-
meric text (c) arrows (d) scale.
Stained Merged
Figure 6.11: Left three columns show staining of microscopy images. Right column is an overlay
of all stained images. Two or more images can be found tiled in this fashion.
Artifacts in Biomedical Images: Unlike natural image datasets, biomedical images are scientific
images presented in research documents. Accordingly, there are artifacts in the form of annotations
and legends that are added to an image. Figure 6.10 shows some common artifacts that we found,
including text and symbols such as arrows, scale and lines. The presence of these artifacts can
create false positive matches for EDD and IDD tasks.
Figure Semantics: Biomedical research documents contain images that are visually similar, but
the figure semantics indicates that they are not manipulated. Two such statistically significant se-
mantics are staining-merging and zoom. Forgery detection algorithms may generate false positive
70
Pair Single
Figure 6.12: Images on the left show pairs of zoomed images. Right column has zoomed regions
within the image. Rectangular bounding boxes are part of the original image.
matches for images belonging to these categories. Stained images originate from microscopy ex-
periments that involve colorization of the same cell/tissue sample with different fluorescent chemi-
cals. This is usually followed by a merged/overlaid image which combines the stained images. The
resulting images are tiled together in the same figure. Since the underlying cell/tissue sample is
unchanged, the image structure is retained across images but with color change. Figure 6.11 shows
some samples of staining and merging. The second semantics involves repeated portions of images
that are magnified to highlight experimental results. Zoom semantics involves images that contain
a zoomed portion of the image internally or themselves are a zoomed portion of another image.
The zoomed area is indicated by a rectangular bounding box and images are adjacent. Figure 6.12
shows paired and single images with zoom semantics.
Image Texture: As illustrated in Figure 6.4, biomedical images tend to have a plain or pattern like
texture with the exception of macroscopy images. This phenomena is particularly accentuated in
blot/gel and microscopy images which are the largest two image classes and also contain the most
71
458
450
55
422
434
34
190
312
6
16
23
2
158
258
8
533
430
31
SIFT ORB BRIEF
Flickr30K Holidays Microscopy Blot/Gel Macroscopy FACS
Figure 6.13: Median number of keypoints identified in images. Biomedical images have a rela-
tively plain texture with the exception of FACS images, leading to fewer keypoints.
manipulations. The plain texture of images makes it difficult to identify keypoints and extract de-
scriptors for image matching, making descriptor based duplication detection difficult. We contrast
this with the ease of identifying keypoints from two common computer vision datasets – Flickr30k
(Plummer et al. 2015) and Holidays (Jegou et al. 2008). Figure 6.13 shows the median number of
keypoints identified in each image class using three off-the-shelf descriptor extractors: SIFT (Lowe
2004), ORB (Rublee et al. 2011), BRIEF (Calonder et al. 2010). We resized all images to 256x256
pixels to account for differing images sizes. With the exception of FACS, other three image classes
show a sharp decline in the number of extracted keypoints. We consider FACS to be an exception
due to the large number of dots, where each dot is capable of producing a keypoint. However these
keypoints may be redundant and not necessarily useful for biomedical image forensics.
Hard Negatives: Scientific experiments often involve tuning of multiple parameters in a common
experimental paradigm to produce comparative results. For biomedical experiments, this can pro-
duce very similar-looking images, which can act like hard negatives when looking for duplicated
72
Figure 6.14: Hard negative samples from Blot/Gel, Macroscopy, FACS and Microscopy classes in
clockwise order.
regions. For blot and gel images this can be true irrespective of a common experimental frame-
work due to patterns of blobs on a monotonous background. Figure 6.14 shows some hard negative
samples for each image class.
6.4 Evaluation and Benchmarking
6.4.1 Metrics
For all the manipulation tasks discussed in Section 6.2.3, detection algorithms are expected to
produce a binary prediction mask of the same dimension as the input image. The predicted masks
are compared against ground truth annotation masks included in the dataset. Manipulated pixels in
images denote the positive class. Following previous work in forgery detection (Y . Wu et al. 2018,
2019, 2017) we compute F
1
scores between the predicted and ground truth mask for all tasks. We
also compute Matthews correlation coefficient (MCC) (Matthews 1975) between the masks since
it has been shown to present a balanced score when dealing with imbalanced data (Chicco et al.
2020; Boughorbel et al. 2017) as is our case with fewer manipulated images. MCC ranges from
73
-1 to +1 and represents the correlation between prediction and ground truth. F
1
score tabulation is
done in Appendix E. Evaluation is done both at the image and pixel-level i.e. true/false positives
and true/false negatives are determined for each image and pixel. For image evaluation, following
the protocol in (Y . Wu et al. 2018), we consider an image to be manipulated if any one pixel has
positive prediction. Pixel level evaluation across multiple images is similar to protocol A in (Y . Wu
et al. 2018) i.e. all pixels from the dataset are gathered for one final computation.
6.4.2 Baseline Models
We evaluate several deep learning and non deep learning models for our three tasks introduced in
Section 6.2.3. Our baselines are selected from forensics literature based on model/code availability
and task suitability. Deep-learning baselines require finetuning for weight adaptation. However,
due to small number of manipulated samples, BioFors training set comprises pristine images only.
Inspired by previous forgery detection methods (Y . Wu et al. 2018, 2017), we create synthetic
manipulations on pristine training data to finetune models. Details of synthetic data and baseline
experiments are provided in the supplementary material. To promote reproducibility, our synthetic
data generators and evaluation scripts will be released with the dataset.
External Duplication Detection (EDD): Baselines for EDD should identify repeated regions
between images. We evaluate classic keypoint-descriptor based image-matching algorithms such
as SIFT (Lowe 2004), ORB (Rublee et al. 2011) and BRIEF (Calonder et al. 2010). We follow a
classic object matching approach, using RANSAC (Fischler et al. 1981) to remove stray matches.
CMFD algorithms can be used by concatenating two images to create a single input. We evaluated
DenseField (DF) (Cozzolino et al. 2015a) with best reported transform – zernike moment (ZM)
on concatenated images. Additionally, we evaluate a splicing detection algorithm, DMVN (Y . Wu
et al. 2017) to find repeated regions. DMVN implements a deep feature correlation layer which
matches coarse image features at 16x16 resolution to find visually similar regions.
74
Conv2d, Ch=32, k=5, pad=2
Conv2d, Ch=64, k=3, pad=1
Conv2d, Ch=128, k=3, pad=1
Conv2d, Ch=256, k=3, pad=1
Conv2d, Ch=1, k=3, pad=1
Input Image (HxWx3)
Binary Mask (HxWx1)
Figure 6.15: Our baseline CNN architecture. Padding ensures the same height and width for input
image and output mask.
Internal Duplication Detection (IDD): Appropriate baselines for IDD should be suitable for
identifying repeated regions within images. DenseField (DF) (Cozzolino et al. 2015a) proposes
an efficient dense feature matching algorithm for CMFD. We evaluate it using the three circular
harmonic transforms used in the paper: zernike moments (ZM), polar cosine transform (PCT) and
fourier-mellin transform (FMT). We also evaluated the CMFD algorithm reported in (Christlein
et al. 2012), using three block based features – discrete cosine transform (DCT) (Fridrich et al.
2003), zernike moments (ZM) (Ryu et al. 2010) and discrete wavelet transform (DWT) (Bashar
et al. 2010). BusterNet (Y . Wu et al. 2018) is a two-stream deep-learning based CMFD model that
leverages visual similarity and manipulation artifacts. Visual similarity in BusterNet is identified
using a self-correlation layer on coarse image features followed by percentile pooling.
Cut/Sharp-Transition Detection (CSTD): Unlike the previous two tasks, it is challenging to
find forensics algorithms designed for detecting cuts or transitions. We evaluate ManTraNet (Y . Wu
et al. 2019), a state-of-the-art manipulation detection algorithm which is used to identify anomalous
pixels and image regions. We also evaluated a baseline convolutional neural network (CNN) model
for detecting cuts and transitions. The CNN was trained on synthetic manipulations in blot/gel
images from the training set. Figure 6.15 shows the CNN architecture.
75
Manipulated Image Pairs Ground Truth Masks
(a)
(b)
Figure 6.16: Synthetic manipulations in image pairs created using (a) overlapping image regions
and (b) spliced image patches.
6.4.3 Synthetic Data Generation
Image forensic datasets usually do not have sufficient samples to train deep learning models. Pre-
vious works (Y . Wu et al. 2018, 2017) created suitable synthetic manipulations in natural images
for model pre-training. The synthetic manipulations were created by extracting objects from im-
ages and pasting them in the target image with limited data augmentation such as rotation and
scale change. Similar to previous works, we created suitable synthetic manipulations in biomedi-
cal images corresponding to each task for training and validation. However, biomedical images do
not have well defined objects and boundaries and manipulated regions are created with rectangular
patches. Manipulation process for each task is discussed ahead.
External Duplication Detection (EDD): Corresponding to the two possible sources of external
duplication, we create manipulations by 1) Cropping two images with overlapping regions from a
source image and 2) Splicing i.e. copy-pasting rectangular patches from source to target image.
Manipulations of both types are shown in Figure 6.16. We generate pristine, spliced and overlapped
samples in a 1:1:1 ratio. Images extracted with overlapping regions are resized to 256x256 image
dimensions.
76
Figure 6.17: Internal duplications created with copy-move operations.
Synthetically Manipulated Image Ground Truth Mask
Figure 6.18: Synthetic cut/sharp transitions created in blot/gel images.
Internal Duplication Detection (IDD): Internal duplications are created with a copy-move opera-
tion within an image. Rectangular patches of random dimensions are copy-pasted within the same
image. Figure 6.17 shows samples.
Cut/Sharp-Transition Detection (CSTD): We simulated synthetic manipulations by randomly
splitting an image along two horizontal or vertical lines in the image and rejoining the image. The
line of rejoining represents a synthetic cut or sharp transition and is used for training. Figure 6.18
shows synthetic CSTD manipulations.
6.4.4 Results
Tables 6.4, 6.5, 6.6 and 6.7 present baseline results for EDD, IDD and CSTD tasks respectively.
Corresponding F
1
score evaluation for EDD and IDD can be found in Appendix E. We find that
dense feature matching approaches (DF-ZM,PCT,FMT) are better than sparse (SIFT, SURF, ORB),
77
Method Microscopy Blot/Gel Macroscopy FACS Combined
SIFT (Lowe 2004) 0.180 0.113 0.130 0.11 0.142
ORB (Rublee et al. 2011) 0.319 0.087 0.126 0.269 0.207
BRIEF (Calonder et al. 2010) 0.275 0.058 0.135 0.244 0.180
DF - ZM (Cozzolino et al. 2015a) 0.422 0.161 0.285 0.540 0.278
DMVN (Y . Wu et al. 2017) 0.242 0.261 0.185 0.164 0.244
Table 6.4: Image level evaluation results for external duplication detection (EDD) task by image
class. All numbers are MCC scores.
Method Microscopy Blot/Gel Macroscopy FACS Combined
SIFT (Lowe 2004) 0.146 0.148 0.194 0.073 0.132
ORB (Rublee et al. 2011) 0.342 0.127 0.226 0.187 0.252
BRIEF (Calonder et al. 2010) 0.277 0.102 0.169 0.188 0.202
DF - ZM (Cozzolino et al. 2015a) 0.425 0.192 0.256 0.504 0.324
DMVN (Y . Wu et al. 2017) 0.342 0.430 0.238 0.282 0.310
Table 6.5: Pixel level evaluation results for external duplication detection (EDD) task by image
class. All numbers are MCC scores.
Method
Microscopy Blot/Gel Macroscopy Combined
Image Pixel Image Pixel Image Pixel Image Pixel
DF - ZM (Cozzolino et al. 2015a) 0.764 0.197 0.515 0.449 0.573 0.478 0.564 0.353
DF - PCT (Cozzolino et al. 2015a) 0.764 0.202 0.503 0.466 0.712 0.487 0.569 0.364
DF - FMT (Cozzolino et al. 2015a) 0.638 0.167 0.480 0.400 0.495 0.458 0.509 0.316
DCT (Fridrich et al. 2003) 0.187 0.022 0.250 0.168 0.158 0.143 0.196 0.095
DWT (Bashar et al. 2010) 0.299 0.067 0.384 0.295 0.591 0.268 0.341 0.171
Zernike (Ryu et al. 2010) 0.192 0.032 0.336 0.187 0.493 0.262 0.257 0.114
BusterNet (Y . Wu et al. 2018) 0.183 0.178 0.226 0.076 0.021 0.106 0.269 0.107
Table 6.6: Results for internal duplication detection (IDD) task by image class and a combined
result. There are no IDD instances in FACS images. Image and Pixel columns denote image and
pixel level evaluation respectively. All numbers are MCC scores.
block-based (DCT, DWT, Zernike) or coarse feature matching methods (DMVN and BusterNet)
for identifying repeated regions in both EDD and IDD tasks. Dense feature matching is com-
putationally expensive, and most image forensics algorithms obtain a viable quality-computation
78
Method
F
1
MCC
Image Pixel Image Pixel
MantraNet (Y . Wu et al. 2019) 0.253 0.09 0.170 0.080
CNN Baseline 0.212 0.08 0.098 0.070
Table 6.7: Results on the cut/sharp-transition detection (CSTD) task.
trade-off on natural images. However, biomedical images have relatively plain texture and simi-
lar patterns, which may lead to indistinguishable features for coarse or sparse extraction. For the
set of baselines evaluated, exchanging feature matching quality for computation is not successful
on biomedical images. Furthermore, performance varies drastically across image classes for all
methods, with models peaking across different image classes. The variation is expected since the
semantic and visual characteristics vary by image category. However, as a direct consequence of
this variance, image category specific models may need to be developed in future research. On
CSTD, our simple baseline trained to detect sharp transitions produces false alarms on image bor-
ders or edges of blots. Both MantraNet and our baseline have similar performance, indicating that
a specialized model design might be required to detect cuts and anomalous transitions. Finally,
performance is low across all tasks which can be attributed to some of the challenges discussed in
Section 6.3. In summary, it is safe to conclude that existing natural-image forensic methods are
not robust when applied to biomedical images and also show high variation in performance across
image classes. The results emphasize the need for robust forgery detection algorithms that are
applicable to the biomedical domain. Prediction samples for all three tasks are shown in Figures
6.19,6.20 and 6.21. For EDD we show predictions from ORB Rublee et al. 2011 and DMVN Y .
Wu et al. 2017. Samples for IDD include DCT Fridrich et al. 2003, DenseField Cozzolino et al.
2015a, DWT Bashar et al. 2010, Zernike Ryu et al. 2010 and BusterNet Y . Wu et al. 2018 baselines.
Similarly, CSTD predictions are from ManTraNet Y . Wu et al. 2019 and our cnn baseline.
79
Image Pair GT Masks ORB DMVN
(a)
(b)
(c)
(d)
Figure 6.19: Rows of image pairs and corresponding predicted masks. The text in sample (d)
misleads the prediction from both models.
Image DCT Zernike DF - ZM BusterNet GT Mask DWT
(a)
(b)
(c)
(d)
Figure 6.20: Rows of images and forgery detection predictions. There is significant variation in
prediction across models. Rotated predictions in sample (a) are not identified by any model.
Image GT Mask ManTraNet Baseline
CNN
(a)
(b)
(c)
(d)
Figure 6.21: Predictions from ManTraNet and baseline CNN. It is evident that current forensic
models are not suitable for the CSTD task.
80
6.5 Ethical Considerations
We have used documents from PLOS to curate BioFors, since it is open access and can be used
for further research including modification and distribution. However, the purpose of BioFors
is to foster the development of algorithms to flag potential manipulations in scientific images.
BioFors is explicitly not intended to malign or allege scientific misconduct against authors whose
documents have been used. To this end, there are two precautions (1) We have anonymized images
by withholding information about the source publications. Since scientific images have abstract
patterns, matching documents from the web with BioFors images is a significant hindrance to the
identification of source documents. (2) Use of pristine documents and documents with extenuating
circumstances such as citation for duplication and justification. As a result, inclusion of a document
in BioFors does not assure scientific misconduct.
6.6 Summary
Manipulation of scientific images is an issue of serious concern for the biomedical community.
While reviewers can attempt to screen for scientific misconduct, the complexity and volume of the
task places an undue burden on them. Automated and scalable biomedical forensic methods are
necessary to assist reviewers. We presented BioFors, a large biomedical image forensics dataset.
BioFors comprises a comprehensive range of images found in biomedical documents. We also
framed three manipulation detection tasks based on common manipulations found in literature.
Our evaluations show that common computer vision algorithms are not robust when extended to
the biomedical domain. Our analysis shows that attaining respectable performance will require
well designed models, as there are multiple challenges to the problem. We expect that BioFors
will advance biomedical image forensic research.
81
Chapter 7
Repurposing Detection in Biomedical Images
7.1 Introduction
There are three manipulation detection tasks described in Chapeter 6 – external duplication detec-
tion (EDD), internal duplication detection (IDD) and cut/sharp transition detection (CSTD). These
tasks together cover popular forms of semantic and digital manipulations found in biomedical lit-
erature. In this chapter we focus on the detection of semantically repurposed biomedical iamges,
which are found entirely as a subset of the external duplication detection (EDD) task. However, the
EDD task does not entirely comprise of repurposed images and has instances of digitally spliced
images. The unifying factor across manipulated images in EDD task is the presence of dupli-
cated image regions, either partial or whole. Unlike repetition of natural images in a social media
context as described in Chapters 3,4 and 5, reuse of images in biomedical literature raises a red
flag. Therefore, identifying a repeated image region is a sufficient condition for image repurposing
detection. However, identifying duplicated regions in biomedical images is not an easy task, as
shown in Chapter 3.4.
This chapter proposes a multi-scale overlap detection network (MONet) that recursively finds
overlap between patches to locate duplicated image regions. Recursive overlap detection is per-
formed at multiple scales in an hierarchical manner from large to small image patches. Our model
increases the matching detail from coarse to refined feature maps in a top-down approach, while
82
simultaneously reducing the computational burden by making fewer patch-comparisons. The pro-
posed method can also be modified for use on internal duplication detection task, although the
performance does not surpass existing digital image forensic baselines.
7.2 Architecture Design
Motivation: The primary objective of the EDD task is to locate duplicated regions between
biomedical image pairs, regardless of image provenance i.e. irrespective of whether it was repur-
posing or digital forgery that led to the repetition. Given two input images I
1
,I
2
∈R
H× W× 3
the
objective of the EDD task is to predict two binary masks M
1
,M
2
∈R
H× W× 1
, highlighting the du-
plicated image region. We also know from Chapter 6 that identifying repeated regions in biomed-
ical images is a challenging task. Keypoint based approaches (Lowe 2004; Rublee et al. 2011;
Calonder et al. 2010) suffer from sparse keypoint detection, dense matching of features (Cozzolino
et al. 2015a) is computationally inefficient and relies on heuristics to make sparse matches and
deep learning based methods (Y . Wu et al. 2018, 2017) utilize a low-resolution feature map which
loses detail. To overcome these challenges, we design a U-net style neural network (Ronneberger
et al. 2015) that measures patch overlap at different scales in an hierarchical fashion to reduce
computation and detect duplication. Figure 7.1 shows our model diagram.
Architecture Overview: The general structure of our model resembles a U-Net (Ronneberger
et al. 2015) with a series of convolutional encoders at multiple scales s∈{1,2,3,4,5} that produce
feature maps F ∈R
N
h
× N
w
× C
s
. For notational convenience we consider that images are square
with N× N× 3 dimension. Consequently, the dimension of encoder feature maps at each scale
is
N
2
s
× N
2
s
. The upsampling involves a series of convolutional decoders that produce feature maps
at corresponding scales to that of the encoder. The final output of the decoder produces a pair of
binary prediction masks. To find duplicated regions between images, we measure overlap between
patches of two images at each scale within overlap-score modules (OSMs) in a top-down hierarchy.
The maximum overlap score of a patch indicates the confidence with which all or a part of it is
83
Figure 7.1: Illustration of MONet. The top shows details of overlap score module (OSM) and the
bottom shows overall architecture.
considered to have been repeated in the other image. A higher score indicates full or substantial
repetition, while a low score represents negligible or no repetition. In order to minimize the number
of patch comparisons to be made, patch pairs in I
1
and I
2
with maximum overlap at the current scale
s are used to guide the search among sub-patches at the lower scale s− 1 and so on.
Overlap-Score Module (OSM): The purpose of the OSM module is to predict two overlap
score maps at each scale corresponding to the feature maps. The deconvolution layers upsample
the overlap score maps sequentially to produce binary output masks. Score maps from previous
and current scale are concatenated for upsampling. Overlap scores are produced by an overlap
detection networkD which takes as input two patch feature vectors (one from each image). It is
trained on patch feature triplets (anchor, overlapping and non-overlapping patches) generated from
synthetic data at each scale. We consider a feature map F at scale s, to be composed of a grid of
patch feature vectors f ∈R
1× 1× C
s
, such that each feature vector represents a patch of dimension
84
Scale
Patch Naive
Ours
Dimension Comparisons
1 2x2 ∼ 268.43M ∼ 131K
2 4x4 ∼ 16.77M 32,768
3 8x8 ∼ 1.04M 8,192
4 16x16 65,536 2,048
5 32x32 4,096 4,096
Table 7.1: Number of patch comparisons at each scale.
d
s
× d
s
in the input image, where d
s
=
N
2
s
. While the convolutional receptive field of a feature vector
f is larger than the patch dimension d
s
at any given scale, we implicitly limit the scope of each
feature vector to its patch dimensions when measuring overlap. The overlap score map, is indexed
similar to a feature map F. The score at each index represents the maximum overlap found for that
patch feature vector when compared to patch feature vectors from the other image.
Structured Hierarchical Search: The OSMs are structurally linked from higher to lower scale
such that patch comparisons can be made hierarchically. Sub-patches of a patch with maximum
overlap at a higher scale, are candidate patches for overlap detection at a lower scale. Since, the
spatial dimension of each feature map gets halved at each scale, a feature vector f at a higher
scale overlaps with four feature vectors at the immediate lower scale. This observation is useful
in limiting the number of patch comparisons to to be made at a lower scale. For two patches with
maximum overlap at a higher scale, each of their four sub-patches are compared only with each
other. At the largest scale (lowest resolution feature map), with no prior scoring, overlap o is
measured between all possible pairs to predict an overlap score map O
N× N× 1
. Table 7.1 shows the
reduction in patch comparisons at each scale for 256x256 image pairs.
Loss: We pretrain the endoder and overlap detection network jointly using the margin ranking
loss function L
o
. The model is then trained end-to-end with mask output using binary cross-
entropy loss. For two feature vectors x
1
and x
2
the regular margin ranking loss function is given
by the Equation (7.1), where m is the margin hyper-parameter. In our experiments for an anchor,
85
positive and negative patch triplet < a,a
+
,a
− >, x
1
and x
2
represent the overlap scores between
patch pairs < a,a
+
> and < a,a
− > respectively. Therefore the difference between x
1
and x
2
represents the difference in overlap between positive and negative patch pairs. As a result, we also
experiment with a flexible margin that is measured as a function of overlap difference. Specifically,
if the true overlap in pixels for< a,a
+
> and< a,a
− > is given by o
+
and o
− , the flexible margin
m
f lex
is given by Equation (7.2), where d is the patch dimension. The flexible margin ranking loss
L
f lex
is given by Equation (7.3).
L
o
= max(0,(x
2
− x
1
)+ m) (7.1)
m
f lex
=
o
+
− o
− d
2
(7.2)
L
f lex
= max(0,(x
2
− x
1
)+ m
f lex
) (7.3)
Implementation and Training Details: We resize all input images to 256× 256× 3 dimension.
At the largest scale, feature map has the lowest resolution with 8x8x256 dimension feature map
and 32x32 dimensional patches. The channel dimension is halved at each scale, with 256 at scale
5 and 16 at scale 1. The overlap detection layer is a two layer feed-forward network. We pretrain
our model for 25 epochs with the margin ranking loss on overlapping and non-overlapping patch
triplets generated from synthetic data. The model is trained end-to-end for 50 epochs after that
with binary cross-entropy and margin ranking loss. We use the adam optimizer with a learning rate
of 1e-4.
86
7.3 Experiments
7.3.1 Dataset and Metrics
We use the BioFors dataset introduced in Sabir et al. 2021a. The EDD task in BioFors has 1,547
manipulated images. The train and test splits have 30,536 and 17,269 images respectively, divided
into four image categories – Microscopy, Blot/Gel, Macroscopy and FACS images. Each image
category has a different origin or semantic meaning in the biomedical domain, which leads to
different image properties and challenges. We evaluate our model both at the image and pixel
level, according to the protocol described in Sabir et al. 2021a. Image level evaluation assigns a
binary label to the image, where the presence of duplication represents the positive class. Pixel
level evaluation aggregates the statics of positive and negative class pixels across images before
score computation. There are two metrics reported in Sabir et al. 2021a – matthews correlation
coefficient (MCC) and F
1
score. Due to space constraints, we report MCC results for both image
and pixel level evaluation.
7.3.2 Synthetic Data Generation
BioFors dataset does not provide any manipulated samples for training. Hence, we train our model
using synthetically generated samples similar to the process described in Sabir et al. 2021a. How-
ever, our model additionally requires joint pre-training of encoders and overlap scoring modules
(OSMs). This requires extensive hierarchical annotation of patch overlap at each pixel i.e. patch
pairs and their overlap scores at each scale. Generating such extensive annotation on the fly is
computationally expensive. As a workaround, we generate predefined annotation templates, which
can be used with random image-pairs on the fly to generate unique synthetic samples.
87
Method Microscopy Blot/Gel Macroscopy FACS Combined
SIFT (Lowe 2004) 0.180 0.113 0.130 0.11 0.142
ORB (Rublee et al. 2011) 0.319 0.087 0.126 0.269 0.207
BRIEF (Calonder et al. 2010) 0.275 0.058 0.135 0.244 0.180
DF - ZM (Cozzolino et al. 2015a) 0.422 0.161 0.285 0.540 0.278
DMVN (Y . Wu et al. 2017) 0.242 0.261 0.185 0.164 0.244
Ours - regular margin loss 0.398 0.507 0.221 0.313 0.410
Ours - flexible margin loss 0.346 0.520 0.309 0.256 0.398
Table 7.2: Image level results for external duplication detection (EDD) task by image class. All
numbers are MCC scores.
Method Microscopy Blot/Gel Macroscopy FACS Combined
SIFT (Lowe 2004) 0.146 0.148 0.194 0.073 0.132
ORB (Rublee et al. 2011) 0.342 0.127 0.226 0.187 0.252
BRIEF (Calonder et al. 2010) 0.277 0.102 0.169 0.188 0.202
DF - ZM (Cozzolino et al. 2015a) 0.425 0.192 0.256 0.504 0.324
DMVN (Y . Wu et al. 2017) 0.342 0.430 0.238 0.282 0.310
Ours - regular margin loss 0.435 0.520 0.262 0.356 0.438
Ours - flexible margin loss 0.386 0.520 0.281 0.336 0.410
Table 7.3: Pixel level results for external duplication detection (EDD) task by image class. All
numbers are MCC scores.
7.3.3 Results
Tables 7.2 and 7.3 show the performance of our model on the EDD task. The baseline results
are presented as reported in Sabir et al. 2021a. Image and pixel columns denote corresponding
evaluation protocol. We highlight two versions of our model – with regular margin ranking loss
and with a flexible margin ranking loss. Our model achieves a new state-of-the art on blot/gel,
microscopy, macroscopy image categories and also on the combined evaluation. Table 7.4 shows
the results of our model on the IDD task. The diagonals of score maps are ignored when searching
for patches with maximum overlap, to avoid cases of self-overlap in the IDD task.
88
Method
Microscopy Blot/Gel Macroscopy Combined
Image Pixel Image Pixel Image Pixel Image Pixel
DF - ZM (Cozzolino et al. 2015a) 0.764 0.197 0.515 0.449 0.573 0.478 0.564 0.353
DF - PCT (Cozzolino et al. 2015a) 0.764 0.202 0.503 0.466 0.712 0.487 0.569 0.364
DF - FMT (Cozzolino et al. 2015a) 0.638 0.167 0.480 0.400 0.495 0.458 0.509 0.316
DCT (Fridrich et al. 2003) 0.187 0.022 0.250 0.168 0.158 0.143 0.196 0.095
DWT (Bashar et al. 2010) 0.299 0.067 0.384 0.295 0.591 0.268 0.341 0.171
Zernike (Ryu et al. 2010) 0.192 0.032 0.336 0.187 0.493 0.262 0.257 0.114
BusterNet (Y . Wu et al. 2018) 0.183 0.178 0.226 0.076 0.021 0.106 0.269 0.107
Ours - regular margin loss 0.082 0.292 0.412 0.184 0.197 0.140 0.337 0.213
Ours - flexible margin loss 0.283 0.048 0.436 0.292 0.250 0.159 0.395 0.188
Table 7.4: Results for internal duplication detection (IDD) task by image class and a combined
result. There are no IDD instances in FACS images. Image and Pixel columns denote image and
pixel level evaluation respectively. All numbers are MCC scores.
Figure 7.2: Predicted samples of our model. Input images, ground truth masks, predicted masks
and intermediate score maps are shown.
7.4 Analysis
As shown in Tables 7.2 and 7.3, our model achieves state-of-the art result across multiple cate-
gories. However, the performance fluctuates across image categories. Additionally, a single model
doesn’t hold top-performance across categories. We believe that the unique characteristics of each
89
Figure 7.3: False positive samples on Microscopy images.
category make it difficult to train a single outperforming model. Figure 7.2 shows sample predic-
tions from our model. Intermediate overlap score maps from each scale show the progression of
patch overlap from 32x32 to 2x2 patches. Figure 7.3 also shows that our method generates false
positives. As described in chapter 6, these duplicated regions are not considered manipulations due
to the semantics of experiments that produced them, such as image overlay or chemical staining.
Overcoming these false positives requires either additional semantic information from source doc-
uments or the definition of manipulation needs rethinking. Additional instances of false positive
samples were found due to incorrect annotation, as shown in Figure 7.4. Despite the best efforts
and multi-round annotation, some image pairs with duplication went unnoticed resulting in spu-
rious false positives. This discovery underscores the difficulty of detecting duplicated regions in
biomedical images and shows that a human-centric approach to detection is likely to have failure
cases.
90
Figure 7.4: Duplication in images that went unnoticed during the annotation process results in
some false positives.
Table 7.4 shows results on the IDD task. The proposed model does not achieve state-of-the-art
performance. We believe that the top-down (coarse to refined) patch comparison structure of our
model is not ideal for the IDD task.
Method Image Pixel
Ours w/o gating 0.340 0.398
Ours w/ dot product overlap 0.076 0.052
Ours - flexible margin 0.398 0.410
Ours - regular margin 0.410 0.438
Table 7.5: Ablation analysis on our model.
Ablation: We perform an ablation analysis of our model in Table 7.5. The model performance
degrades if we remove the gating operation when concatenating overlap score maps across scales.
Additionally, performance drops drastically if we use feature dot products from literature (Y . Wu
et al. 2018, 2017) instead of one hidden layer overlap detection network. Table 7.6 shows in detail
the effect that the choice of overlap detection network has on model performance at each scale. We
conducted a simple experiment for overlap detection. An overlap detection network placed on top
91
Overlap Detection Patch Dimension
Network 32x32 16x16 8x8 4x4 2x2
Dot product 67.85% 65.51% 63.90% 60.87% 56.13%
1 Hidden Layer 75.49% 73.84% 70.82% 71.80% 60.07%
2 Hidden Layer 76.02% 72.80% 73.82% 72.97% 64.74%
Table 7.6: Accuracy of classifying overlap in patches. The choice of overlap detection architecture
affects the performance at each scale.
of a common CNN feature extractor is used to score two input patches to check for overlap. 1 and 2
hidden layer architectures are found to be better than the dot product comparison of features found
in literature. In the interest of saving computation we implemented a 1 hidden layer architecture in
this work.
7.5 Summary
The proposed model performs well on multiple categories of biomedical images for detecting
duplicated regions between images. The multi-scale architecture make fewer patch comparisons
and is computationally efficient. However it does not overcome all the challenges of biomedical
image forensics. The top down down approach is not suitable for detecting smaller duplicated
image regions within images. Additionally, the semantics of chemically stained microscopy im-
ages requires external information to disambiguate legitimate repetition from fraudulent ones. The
proposed model is also not effective on FACS images. These findings reinforce the observation
that biomedical forensics is not only a challenging problem, but also requires the development of
image category specific methods. A single model is unlikely to function well across tasks or even
image categories.
92
Chapter 8
Conclusion
Detecting manipulations in multimedia data is a diverse area of research. The area of forensic re-
search has developed side-by-side with the occurrence of new forms of manipulations in different
modalities of data. While classical computer-vision based image forensics methods, focused on
detecting digitally manipulated images have existed for a while, manipulations in videos such as
deepfakes or fake news on twitter are more recent problems. In keeping with the evolving nature
of manipulations we identified a neglected but important case of semantically repurposed images.
Additionally, most of image forensics research has been aggregated in the natural image domain.
However, the problem of image manipulation extends beyond social media to the biomedical do-
main. This thesis is focused on detecting semantically repurposed images in natural and biomedical
domain.
8.1 Summary of Contributions
We tackled the problem of semantic repurposing/misuse of images in natural and biomedical image
domains. We proposed problem definitions, datasets and tasks to advance research.
In Chapter 2 we discussed the literature surrounding natural and biomedical forensics. The
chapter into two parts for each, sinc the two areas have distinctly different challenges and literature
surrounding them. For natural image forensics we discuss how online misinformation affects our
93
daily lives. We also review the literature around the spread of misinformation and some counter-
intuitive findings. We discussed existing text and image based manipulation detection methods
that tackle different manifestations of online misinformation. For biomedical forensics, we review
the general understanding around scientific fraud, followed by a review of the qualitative literature
surrounding image manipulations in biomedical documents. Finally, we discuss the shortcomings
of existing attempts to develop automated methods for biomedical forensics.
A simplified version of the online misinformation problem with multimedia packages com-
prising images and captions is introduced in Chapter 3. The chapter introduced a synthetic dataset
of manipulated packages and a general framework for using external knowledge bases to verify
query packages. We evaluated joint modelling of images and text followed by outlier detection of
semantically inconsistent packages. The semantic inconsistency of packages made the detection
problem easier.
In Chapter 4 we introduced a dataset with semantically consistent and more subtle manipu-
lations about person, location and organization information in a multimedia package. We also
proposed and evaluated a multi-task learning model that makes use of explicit evidences from a
reference dataset for verification. The proposed method outperformed baselines using a single
evidence from the reference dataset.
The number of evidences used for corroborating a query package has a bearing on the final
performance of the model. In Chapter 5 we introduced a graph neural network model that leverages
an arbitrary number of evidences for package verification. The proposed model also deals well with
missing modalities.
To overcome the lack of proper datasets with well defined tasks, we introduced a biomedi-
cal forensics dataset – BioFors in Chapter 6. The dataset has biomedical images extracted from
research documents, divided into four categories and associated with three manipulation detec-
tion tasks that covers most popular forms of manipulation found in literature. We introduced and
evaluated baselines derived from common-vision and image forensic literature. Baselines that are
developed for natural images are not effective for biomedical images.
94
We proposed a multi-scale overlap detection model in Chapter 7 to find instances of semantic
repurposing in biomedical images. Specifically, our proposed model detects duplication in image
pairs and is evaluated on the external duplication detection (EDD) task. Our model achieves state-
of-the-art performance on multiple image categories. We also discovered that a single model is not
effective across multiple types of biomedical images.
8.2 Future Work
There are several directions for future research, both in the natural and biomedical image domains.
One of the major shortcomings of the natural image repurposing research presented in this thesis is
the lack of a real-world dataset and evaluation on it. Regardless of how cleverly manipulations are
simulated in datasets, evaluation on a real-world dataset is more likely to indicate reliable model
performance. Additionally, the scale of misinformation is vast along with the size of a knowledge
base for verification. Development of methods that are scalable and quantified with computational
efficiency metrics is a must. Finally, for the sake of simplification, we used a trusted knowledge
base for verification. However, this assumption is flawed in a real-world setting and practical
methods would either have to account for other spurious sources of corroborating information or
rely only on trusted knowledge sources.
Research in biomedical image forensics is still in its nascent stages. There are several avenues
to improve upon the research presented in this thesis. In the data curation step, it was discovered
that current biomedical image extraction models are unable to extract reasonable quality figures
and images from research documents. Poor quality image extraction will negatively affect down-
stream performance of a practical real-world software for screening documents. Additionally, in
this thesis we focused on detecting duplications between images as they comprise semantic ma-
nipulations. However, novel methods for two additional tasks of detecting duplications or sharp
transitions within images are not proposed. Each of these tasks requires a significant amount of
effort, with dedicated models for improving upon discussed baselines.
95
8.3 Supporting Papers
This thesis is supported by the following papers:
• Jaiswal, A., Sabir, E., AbdAlmageed, W., Natarajan, P. (2017, October). Multimedia se-
mantic integrity assessment using joint embedding of images and text. In Proceedings
of the 25th ACM International Conference on Multimedia (pp. 1465-1471).
• Sabir, E., AbdAlmageed, W., Wu, Y ., Natarajan, P. (2018, October). Deep multimodal
image-repurposing detection. In Proceedings of the 26th ACM international conference
on Multimedia (pp. 1337-1345).
• Sabir, E., Jaiswal, A., AbdAlmageed, W., Natarajan, P. (2021, January). MEG: Multi-
Evidence GNN for Multimodal Semantic Forensics. In 2020 25th International Confer-
ence on Pattern Recognition (ICPR) (pp. 9804-9811). IEEE.
• Sabir, E., Nandi, S., Abd-Almageed, W., Natarajan, P. (2021). BioFors: A Large Biomed-
ical Image Forensics Dataset. In Proceedings of the IEEE/CVF International Conference
on Computer Vision (pp. 10963-10973).
8.4 Other Papers
Other papers that I published or contributed to during my PhD:
• Sabir, E., Cheng, J., Jaiswal, A., AbdAlmageed, W., Masi, I., Natarajan, P. (2019). Recur-
rent convolutional strategies for face manipulation detection in videos. In Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Work-
shops.
• Sabir, E., Rawls, S., Natarajan, P. (2017, November). Implicit language model in LSTM
for OCR. In 2017 14th IAPR international conference on document analysis and recognition
(ICDAR) (V ol. 7, pp. 27-31). IEEE.
96
• Rawls, S., Cao, H., Sabir, E., Natarajan, P. (2017, April). Combining deep learning and
language modeling for segmentation-free OCR from raw pixels. In 2017 1st international
workshop on Arabic script analysis and recognition (ASAR) (pp. 119-123). IEEE.
• Kartik, D., Sabir, E., Mitra, U., Natarajan, P. (2018, October). Policy design for active se-
quential hypothesis testing using deep learning. In 2018 56th Annual Allerton Conference
on Communication, Control, and Computing (Allerton) (pp. 741-748). IEEE.
97
References
Bovet, Alexandre et al. (2019). “Influence of fake news in Twitter during the 2016 US presidential
election”. In: Nature communications 10.1, pp. 1–14.
Tasnim, Samia et al. (2020). “Impact of rumors and misinformation on COVID-19 in social media”.
In: Journal of preventive medicine and public health 53.3, pp. 171–174.
Kogan, Shimon et al. (2020). “Fake News in Financial Markets”. In: Available at SSRN 3237763.
Bar-Ilan, Judit et al. (2021). “Retracted articles–the scientific version of fake news”. In: The psy-
chology of fake news: Accepting, sharing, and correcting misinformation, pp. 47–70.
V osoughi, Soroush et al. (2018). “The spread of true and false news online”. In: Science 359.6380,
pp. 1146–1151.
Kudugunta, Sneha et al. (2018). “Deep neural networks for bot detection”. In: Information Sciences
467, pp. 312–322.
Wu, Yue et al. (2018). “BusterNet: Detecting Copy-Move Image Forgery with Source/Target Lo-
calization”. In: Proceedings of the European Conference on Computer Vision (ECCV).
Wu, Yue et al. (2019). “ManTra-Net: Manipulation Tracing Network for Detection and Localiza-
tion of Image Forgeries With Anomalous Features”. In: Proceedings of the IEEE/CVF Confer-
ence on Computer Vision and Pattern Recognition (CVPR).
Sabir, Ekraam et al. (2019). “Recurrent Convolutional Strategies for Face Manipulation Detection
in Videos”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR) Workshops, pp. 80–87.
Masi, Iacopo et al. (2020). “Two-branch Recurrent Network for Isolating Deepfakes in Videos”.
In: Proceedings of the European Conference on Computer Vision (ECCV).
98
Singh, Vivek K et al. (2021). “Detecting fake news stories via multimodal analysis”. In: Journal of
the Association for Information Science and Technology 72.1, pp. 3–17.
Jin, Z. et al. (2017). “Novel Visual and Statistical Image Features for Microblogs News Verifica-
tion”. In: IEEE Transactions on Multimedia 19.3, pp. 598–608. DOI: 10.1109/TMM.2016.
2617078.
Jin, Zhiwei et al. (2017). “Multimodal Fusion with Recurrent Neural Networks for Rumor Detec-
tion on Microblogs”. In: Proceedings of the 2017 ACM on Multimedia Conference. MM ’17.
New York, NY , USA: ACM, pp. 795–816. DOI:10.1145/3123266.3123454.
Ma, Jing et al. (2016). “Detecting Rumors from Microblogs with Recurrent Neural Networks.” In:
IJCAI, pp. 3818–3824.
Gottweis, Herbert et al. (2006). “South Korean policy failure and the Hwang debacle”. In: Nature
biotechnology 24.2, pp. 141–143.
Stern, Andrew M et al. (2014). “Financial costs and personal consequences of research misconduct
resulting in retracted publications”. In: Elife 3, e02956.
Alberts, Bruce et al. (2015). “Self-correction in science at work”. In: Science 348.6242, pp. 1420–
1422.
Bik, Elisabeth M et al. (2018). “Analysis and correction of inappropriate image duplication: the
Molecular and Cellular Biology Experience”. In: Molecular and Cellular Biology 38.20.
Sabir, Ekraam et al. (2021a). “BioFors: A Large Biomedical Image Forensics Dataset”. In: Pro-
ceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10963–10973.
Jaiswal, Ayush et al. (2017). “Multimedia Semantic Integrity Assessment Using Joint Embedding
Of Images And Text”. In: Proceedings of the 2017 ACM on Multimedia Conference. MM ’17.
New York, NY , USA: ACM, pp. 1465–1471. DOI:10.1145/3123266.3123385.
Sabir, Ekraam et al. (2018). “Deep multimodal image-repurposing detection”. In: Proceedings of
the 26th ACM international conference on Multimedia, pp. 1337–1345.
99
Sabir, Ekraam et al. (2021b). “MEG: Multi-Evidence GNN for Multimodal Semantic Forensics”.
In: 2020 25th International Conference on Pattern Recognition (ICPR). IEEE, pp. 9804–9811.
Rapoza, Kenneth (2018). Can ’Fake News’ Impact The Stock Market? en.
Allcott, Hunt et al. (2017). “Social Media and Fake News in the 2016 Election”. en. In: Journal of
Economic Perspectives 31.2, pp. 211–236. DOI:10.1257/jep.31.2.211.
Rocha, Yasmim Mendes et al. (2021). “The impact of fake news on social media and its influence
on health during the COVID-19 pandemic: A systematic review”. In: Journal of Public Health,
pp. 1–10.
Zampoglou, Markos et al. (2016). “Web and Social Media Image Forensics for News Profession-
als.” In: SMN@ ICWSM.
Tambuscio, Marcella et al. (2015). “Fact-checking Effect on Viral Hoaxes: A Model of Misin-
formation Spread in Social Networks”. In: Proceedings of the 24th International Conference
on World Wide Web. WWW ’15 Companion. New York, NY , USA: ACM, pp. 977–982. DOI:
10.1145/2740908.2742572.
Gupta, Aditi et al. (2014). “TweetCred: Real-Time Credibility Assessment of Content on Twitter”.
en. In: Social Informatics. Lecture Notes in Computer Science. Springer, Cham, pp. 228–243.
DOI:10.1007/978-3-319-13734-6_16.
Liu, Xiaomo et al. (2015). “Real-time Rumor Debunking on Twitter”. In: Proceedings of the 24th
ACM International on Conference on Information and Knowledge Management. CIKM ’15.
New York, NY , USA: ACM, pp. 1867–1870. DOI:10.1145/2806416.2806651.
Wu, K. et al. (2015). “False rumors detection on Sina Weibo by propagation structures”. In: 2015
IEEE 31st International Conference on Data Engineering, pp. 651–662. DOI:10.1109/ICDE.
2015.7113322.
Jin, Fang et al. (2013). “Epidemiological Modeling of News and Rumors on Twitter”. In: Proceed-
ings of the 7th Workshop on Social Network Mining and Analysis. SNAKDD ’13. New York,
NY , USA: ACM, 8:1–8:9. DOI:10.1145/2501025.2501027.
100
Ruchansky, Natali et al. (2017). “CSI: A Hybrid Deep Model for Fake News Detection”. In: Pro-
ceedings of the 2017 ACM on Conference on Information and Knowledge Management. ACM,
pp. 797–806.
Zhao, Zhe et al. (2015). “Enquiring Minds: Early Detection of Rumors in Social Media from En-
quiry Posts”. In: Proceedings of the 24th International Conference on World Wide Web. WWW
’15. Republic and Canton of Geneva, Switzerland: International World Wide Web Conferences
Steering Committee, pp. 1395–1405. DOI:10.1145/2736277.2741637.
Wang, William Yang (2017). “”Liar, Liar Pants on Fire”: A New Benchmark Dataset for Fake
News Detection”. In: arXiv:1705.00648 [cs]. arXiv: 1705.00648.
Ciampaglia, Giovanni Luca et al. (2015). “Computational Fact Checking from Knowledge Net-
works”. en. In: PLOS ONE 10.6, e0128193. DOI:10.1371/journal.pone.0128193.
Farid, H. (2009). “Image forgery detection”. In: IEEE Signal Processing Magazine 26.2, pp. 16–
25. DOI:10.1109/MSP.2008.931079.
Wu, Yue et al. (2017). “Deep matching and validation network: An end-to-end solution to con-
strained image splicing localization and detection”. In: Proceedings of the 25th ACM interna-
tional conference on Multimedia, pp. 1480–1502.
Qureshi, Muhammad Ali et al. (2015). “A bibliography of pixel-based blind image forgery detec-
tion techniques”. In: Signal Processing: Image Communication 39, pp. 46–74.
Asghar, Khurshid et al. (2017). “Copy-move and splicing image forgery detection and localization
techniques: a review”. In: Australian Journal of Forensic Sciences 49.3, pp. 281–307.
Verdoliva, Luisa (2020). “Media forensics and deepfakes: an overview”. In: arXiv preprint arXiv:2001.06564.
Cozzolino, Davide et al. (2015a). “Efficient dense-field copy–move forgery detection”. In: IEEE
Transactions on Information Forensics and Security 10.11, pp. 2284–2297.
Ryu, Seung-Jin et al. (2010). “Detection of copy-rotate-move forgery using Zernike moments”. In:
Proceedings of the 12th international conference on Information hiding, pp. 51–65.
101
Cozzolino, Davide et al. (2015b). “Splicebuster: A new blind image splicing detector”. In: 2015
IEEE International Workshop on Information Forensics and Security (WIFS). IEEE, pp. 1–6.
Rossler, Andreas et al. (2019). “Faceforensics++: Learning to detect manipulated facial images”.
In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1–11.
Jiang, Liming et al. (2020). “Deeperforensics-1.0: A large-scale dataset for real-world face forgery
detection”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR). IEEE, pp. 2886–2895.
Li, Yuezun et al. (2020). “Celeb-DF: A Large-Scale Challenging Dataset for DeepFake Forensics”.
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR).
Dong, Jing et al. (2013). “Casia image tampering detection evaluation database”. In: 2013 IEEE
China Summit and International Conference on Signal and Information Processing. IEEE,
pp. 422–426.
Nimble Challenge 2017 Evaluation — NIST (n.d.). https://www.nist.gov/itl/iad/mig/
nimble-challenge-2017-evaluation. (Accessed on 11/14/2020).
Ng, Tian-Tsong et al. (2009). “Columbia image splicing detection evaluation dataset”. In: DVMM
lab. Columbia Univ CalPhotos Digit Libr.
Wen, Bihan et al. (2016). “COVERAGE—A novel database for copy-move forgery detection”. In:
2016 IEEE International Conference on Image Processing (ICIP). IEEE, pp. 161–165.
Gross, Charles (2016). “Scientific misconduct”. In: Annual review of psychology 67.
Kerr, Norbert L (1998). “HARKing: Hypothesizing after the results are known”. In: Personality
and social psychology review 2.3, pp. 196–217.
Boutron, Isabelle et al. (2018). “Misrepresentation and distortion of research in biomedical litera-
ture”. In: Proceedings of the National Academy of Sciences 115.11, pp. 2613–2619.
Head, Megan L et al. (2015). “The extent and consequences of p-hacking in science”. In: PLoS
biology 13.3, e1002106.
102
Calver, Mike (2021). “Combatting the rise of paper mills”. In: Pacific Conservation Biology 27.1,
pp. 1–2.
Brembs, Bj¨ orn (2018). “Prestigious science journals struggle to reach even average reliability”. In:
Frontiers in human neuroscience 12, p. 37.
Kumar, Malhar N (2008). “A review of the types of scientific misconduct in biomedical research”.
In: Journal of Academic Ethics 6.3, pp. 211–228.
Christopher, Jana (2018). “Systematic fabrication of scientific images revealed”. In: FEBS letters
592.18, pp. 3027–3029.
Bik, Elisabeth M et al. (2016). “The prevalence of inappropriate image duplication in biomedical
research publications”. In: MBio 7.3.
Williams, Corinne L et al. (2019). “Figure errors, sloppy science, and fraud: keeping eyes on your
data”. In: The Journal of Clinical Investigation 129.5, pp. 1805–1807.
Fanelli, Daniele et al. (2019). “Testing hypotheses on risk factors for scientific misconduct via
matched-control analysis of papers containing problematic image duplications”. In: Science
and engineering ethics 25.3, pp. 771–789.
Bosch, Gundula et al. (2017). “Graduate biomedical science education needs a new philosophy”.
In: MBio 8.6, e01539–17.
Miyakawa, Tsuyoshi (2020). No raw data, no science: another possible source of the reproducibil-
ity crisis.
Brembs, Bj¨ orn (2019). “Reliable novelty: New should not trump true”. In: PLoS Biology 17.2,
e3000117.
Casadevall, Arturo et al. (2016). Rigorous science: a how-to guide.
Dal-R´ e, Rafael et al. (2020). “Should research misconduct be criminalized?” In: Research Ethics
16.1-2, pp. 1–12.
103
Acuna, Daniel E et al. (2018). “Bioscience-scale automated detection of figure element reuse”. In:
bioRxiv, p. 269415.
Koppers, Lars et al. (2017). “Towards a systematic screening tool for quality assurance and semi-
automatic fraud detection for images in the life sciences”. In: Science and engineering ethics
23.4, pp. 1113–1128.
Cardenuto, JP et al. (2019). Scientific Integrity Analysis of Misconduct in Images of Scientific
Papers.
Xiang, Ziyue et al. (2020). “Scientific Image Tampering Detection Based On Noise Inconsisten-
cies: A Method And Datasets”. In: arXiv preprint arXiv:2001.07799.
Bucci, Enrico M (2018). “Automatic detection of image manipulations in the biomedical litera-
ture”. In: Cell death & disease 9.3, pp. 1–9.
Lowe, David G (2004). “Distinctive image features from scale-invariant keypoints”. In: Interna-
tional Journal of Computer Vision 60.2, pp. 91–110.
Kulikov, Victor et al. (2020). “Instance Segmentation of Biological Images Using Harmonic Em-
beddings”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR).
Lee, Hong Joo et al. (2020). “Structure Boundary Preserving Segmentation for Medical Image
With Ambiguous Boundary”. In: Proceedings of the IEEE/CVF Conference on Computer Vi-
sion and Pattern Recognition (CVPR).
Wang, Dong et al. (2020). “FocalMix: Semi-Supervised Learning for 3D Medical Image Detec-
tion”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni-
tion (CVPR).
Perez, Fabio et al. (2019). “Solo or Ensemble? Choosing a CNN Architecture for Melanoma Clas-
sification”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR) Workshops.
Peng, Cheng et al. (2020). “SAINT: Spatially Aware Interpolation NeTwork for Medical Slice
Synthesis”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR).
104
Zhang, Yide et al. (2019). “A poisson-gaussian denoising dataset with real fluorescence microscopy
images”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog-
nition (CVPR), pp. 11710–11718.
Ronneberger, Olaf et al. (2015). “U-net: Convolutional networks for biomedical image segmenta-
tion”. In: International Conference on Medical image computing and computer-assisted inter-
vention. Springer, pp. 234–241.
Baheti, Bhakti et al. (2020). “Eff-unet: A novel architecture for semantic segmentation in unstruc-
tured environment”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition Workshops, pp. 358–359.
Kazerouni, Iman Abaspur et al. (2021). “Ghost-UNet: An asymmetric encoder-decoder architec-
ture for semantic segmentation from scratch”. In: IEEE Access 9, pp. 97457–97465.
Young, Peter et al. (2014). “From image descriptions to visual denotations: New similarity met-
rics for semantic inference over event descriptions”. In: Transactions of the Association for
Computational Linguistics 2, pp. 67–78.
Lin, Tsung-Yi et al. (2014). “Microsoft coco: Common objects in context”. In: European Confer-
ence on Computer Vision. Springer, pp. 740–755.
Ababneh, Sufyan et al. (2008). “Scalable multimedia-content integrity verification with robust
hashing”. In: Electro/Information Technology, 2008. EIT 2008. IEEE International Conference
on. IEEE, pp. 263–266.
Sun, Rui et al. (2014). “Secure and robust image hashing via compressive sensing”. In: Multimedia
tools and applications 70.3, pp. 1651–1665.
Wang, Xiaofeng et al. (2015). “A visual model-based perceptual image hash for content authenti-
cation”. In: IEEE Transactions on Information Forensics and Security 10.7, pp. 1336–1349.
Yan, Cai-Ping et al. (2016). “Multi-scale image hashing using adaptive local feature extraction for
robust tampering detection”. In: Signal Processing 121, pp. 1–16.
Ngiam, Jiquan et al. (2011). “Multimodal deep learning”. In: Proceedings of the 28th international
conference on machine learning (ICML-11), pp. 689–696.
105
Vukoti´ c, Vedran et al. (2016). “Bidirectional joint representation learning with symmetrical deep
neural networks for multimodal and crossmodal applications”. In: Proceedings of the 2016
ACM on International Conference on Multimedia Retrieval. ACM, pp. 343–346.
Kiros, Ryan et al. (2014). “Unifying visual-semantic embeddings with multimodal neural language
models”. In: arXiv preprint arXiv:1411.2539.
Simonyan, Karen et al. (2014). “Very deep convolutional networks for large-scale image recogni-
tion”. In: arXiv preprint arXiv:1409.1556.
Mikolov, Tomas et al. (2013). “Distributed representations of words and phrases and their compo-
sitionality”. In: Advances in neural information processing systems, pp. 3111–3119.
Sch¨ olkopf, Bernhard et al. (1999). “Support vector method for novelty detection.” In: NIPS. V ol. 12,
pp. 582–588.
Liu, Fei Tony et al. (2008). “Isolation forest”. In: Data Mining, 2008. ICDM’08. Eighth IEEE
International Conference on. IEEE, pp. 413–422.
Hinton, Geoffrey E. et al. (2006). “Reducing the dimensionality of data with neural networks”. In:
science 313.5786, pp. 504–507.
Hochreiter, Sepp et al. (1997). “Long short-term memory”. In: Neural computation 9.8, pp. 1735–
1780.
Chen, Xinlei et al. (2015). “Microsoft COCO captions: Data collection and evaluation server”. In:
arXiv preprint arXiv:1504.00325.
Finkel, Jenny Rose et al. (2005). “Incorporating non-local information into information extraction
systems by gibbs sampling”. In: Proceedings of the 43rd annual meeting on association for
computational linguistics. Association for Computational Linguistics, pp. 363–370.
Russakovsky, Olga et al. (2015). “ImageNet Large Scale Visual Recognition Challenge”. en. In:
International Journal of Computer Vision 115.3, pp. 211–252. DOI: 10.1007/s11263-015-
0816-y.
106
Breiman, Leo (2001). “Random Forests”. en. In: Machine Learning 45.1, pp. 5–32. DOI:10.1023/
A:1010933404324.
Bingham, Ella et al. (2001). “Random Projection in Dimensionality Reduction: Applications to
Image and Text Data”. In: Proceedings of the Seventh ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining. KDD ’01. New York, NY , USA: ACM, pp. 245–
250. DOI:10.1145/502512.502546.
Dasgupta, Sanjoy (2000). “Experiments with random projection”. In: Proceedings of the Six-
teenth conference on Uncertainty in artificial intelligence . Morgan Kaufmann Publishers Inc.,
pp. 143–151.
Breiman, Leo et al. (1984). Classification and regression trees . en.
Pedregosa, Fabian et al. (2011). “Scikit-learn: Machine Learning in Python”. In: Journal of Ma-
chine Learning Research 12, 2825-2830.
Kettenring, J. R. (1971). “Canonical analysis of several sets of variables”. en. In: Biometrika 58.3,
pp. 433–451. DOI:10.1093/biomet/58.3.433.
Shen, Cencheng et al. (2014). “Generalized canonical correlation analysis for classification”. In:
Journal of Multivariate Analysis 130.C, pp. 310–322.
Sun, Ming et al. (2013). “Generalized canonical correlation analysis for disparate data fusion”. In:
Pattern Recognition Letters 34.2, pp. 194–200.
Noh, Hyeonwoo et al. (2017). “Large-scale image retrieval with attentive deep local features”. In:
Proceedings of the IEEE International Conference on Computer Vision, pp. 3456–3465.
Painter by Numbers (2019).
Jaiswal, Ayush et al. (2019). “AIRD: Adversarial Learning Framework for Image Repurposing De-
tection”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pp. 11330–11339.
Arandjelovic, Relja et al. (2016). “NetVLAD: CNN Architecture for Weakly Supervised Place
Recognition”. In: pp. 5297–5307.
107
Vinyals, Oriol et al. (2015). “Order Matters: Sequence to sequence for sets”. In: arXiv:1511.06391
[cs, stat]. arXiv: 1511.06391.
Scharr, Hanno et al. (2014). “Annotated image datasets of rosette plants”. In: Proceedings of the
European Conference on Computer Vision (ECCV), pp. 6–12.
Guan, Haiying et al. (2019). “MFC datasets: Large-scale benchmark datasets for media foren-
sic challenge evaluation”. In: 2019 IEEE Winter Applications of Computer Vision Workshops
(WACVW). IEEE, pp. 63–72.
Shi, Xiangyang et al. (2019). “Layout-aware Subfigure Decomposition for Complex Figures in the
Biomedical Literature”. In: ICASSP 2019-2019 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP). IEEE, pp. 1343–1347.
Tsutsui, Satoshi et al. (2017). “A data driven approach for compound figure separation using convo-
lutional neural networks”. In: 2017 14th IAPR International Conference on Document Analysis
and Recognition (ICDAR). V ol. 1. IEEE, pp. 533–540.
Siegel, Noah et al. (2018). “Extracting scientific figures with distantly supervised neural networks”.
In: Proceedings of the 18th ACM/IEEE on joint conference on digital libraries, pp. 223–232.
Huang, Gao et al. (2017). “Densely connected convolutional networks”. In: Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4700–4708.
He, Kaiming et al. (2016). “Deep residual learning for image recognition”. In: Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778.
Islam, Ashraful et al. (2020). “DOA-GAN: Dual-Order Attentive Generative Adversarial Network
for Image Copy-move Forgery Detection and Localization”. In: Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4676–4685.
Plummer, Bryan A et al. (2015). “Flickr30k entities: Collecting region-to-phrase correspondences
for richer image-to-sentence models”. In: Proceedings of the IEEE International Conference
on Computer Vision, pp. 2641–2649.
Jegou, Herve et al. (2008). “Hamming Embedding and Weak Geometric Consistency for Large
Scale Image Search”. In: Computer Vision – ECCV 2008. Ed. by David Forsyth et al. Berlin,
Heidelberg: Springer Berlin Heidelberg, pp. 304–317.
108
Rublee, Ethan et al. (2011). “ORB: An efficient alternative to SIFT or SURF”. In: Proceedings of
the IEEE International Conference on Computer Vision. Ieee, pp. 2564–2571.
Calonder, Michael et al. (2010). “Brief: Binary robust independent elementary features”. In: Pro-
ceedings of the European Conference on Computer Vision (ECCV). Springer, pp. 778–792.
Matthews, Brian W (1975). “Comparison of the predicted and observed secondary structure of T4
phage lysozyme”. In: Biochimica et Biophysica Acta (BBA)-Protein Structure 405.2, pp. 442–
451.
Chicco, Davide et al. (2020). “The advantages of the Matthews correlation coefficient (MCC) over
F1 score and accuracy in binary classification evaluation”. In: BMC genomics 21.1, p. 6.
Boughorbel, Sabri et al. (2017). “Optimal classifier for imbalanced data using Matthews Correla-
tion Coefficient metric”. In: PloS one 12.6, e0177678.
Fischler, Martin A et al. (1981). “Random sample consensus: a paradigm for model fitting with
applications to image analysis and automated cartography”. In: Communications of the ACM
24.6, pp. 381–395.
Christlein, Vincent et al. (2012). “An evaluation of popular copy-move forgery detection ap-
proaches”. In: IEEE Transactions on Information Forensics and Security 7.6, pp. 1841–1854.
Fridrich, A Jessica et al. (2003). “Detection of copy-move forgery in digital images”. In: in Pro-
ceedings of Digital Forensic Research Workshop. Citeseer.
Bashar, M. et al. (2010). “Exploring Duplicated Regions in Natural Images”. In: IEEE Transactions
on Image Processing, pp. 1–1. DOI:10.1109/TIP.2010.2046599.
109
Appendices
E Chapter 6 Appendix
E.1 Experiment Details
We list the hyper-parameter and finetuning details of baselines corresponding to each task.
Keypoint-Descriptor: We implemented classic image matching algorithm using keypoint-descriptor
based methods such as SIFT, ORB and BRIEF. Keypoints are matched using kd-tree and a consis-
tent homography is found using RANSAC to remove outlier matches. A rectangular bounding box
is created around the furthest matched keypoints. We keep a threshold of minimum 10 matched
keypoints to consider an image pair to be manipulated.
DenseField: We evaluated DenseField Cozzolino et al. 2015a on IDD task with three reported
transforms - zernike moments (ZM), polar cosine transform (PCT) and fourier-mellin transform
(FMT). ZM and PCT are evaluated with polar sampling grid. Feature length for ZM, PCT and
FMT are 12, 10 and 25. Since DenseField is a copy-move detection algorithm, it expects a single
image input. For evaluation on EDD task, we concatenated image pairs along the column axis to
form a single input and used the best reported transform (ZM).
DMVN: The model is finetuned on synthetic data using adam optimizer with a learning rate of
1e-5, batch size 16 and binary crossentropy loss. The model has two outputs: 1) binary mask
prediction and 2) image level forgery classification. We found fine-tuning to be unstable for joint
training of both outputs. We set image classification loss weight to zero, tuning only the pixel loss.
110
For image level classification we used the protocol similar to BusterNet Y . Wu et al. 2018. Postpro-
cessing by removing stray pixels with less than 10% of image area improved image classification
performance.
BusterNet: We finetune BusterNet Y . Wu et al. 2018 on synthetic data using adam optimizer with
a learning rate of 1e-5, batch size of 32 and categorical-crossentropy loss. BusterNet predicts a
3-channel mask to identify source, target and pristine pixels. Since we do not need to discriminate
between source and target pixels, we consider both classes as manipulated.
Block Feature Matching: Discrete cosine transform (DCT), discrete wavelet transform (DWT)
and Zernike features are matched with a block size of 16 pixels and minimum euclidean distance
of 50 pixels between two matched blocks using the CMFD algorithm reported in Christlein et al.
2012.
ManTraNet: We finetuned the model using adam optimizer with learning rate of 1e-3, batch size
of 32 with gradient accumulation and binary-crossentropy loss. Since, cuts and transitions have
thin pixel slices which can be distorted by resizing, we use images with original dimension.
Baseline CNN: We trained the CNN using adam optimizer with learning rate of 1e-3, mean
squared error loss and batch size of 10.
111
E.2 Results with F
1
metric
Method Microscopy Blot/Gel Macroscopy FACS Combined
SIFT Lowe 2004 8.48% 9.37% 6.98% 6.09% 8.18%
ORB Rublee et al. 2011 30.48% 5.97% 9.87% 22.53% 20.66%
BRIEF Calonder et al. 2010 27.42% 3.74% 13.07% 20.09% 18.22%
DF - ZM Cozzolino et al. 2015a 42.00% 15.42% 27.48% 54.17% 27.06%
DMVN Y . Wu et al. 2017 16.40% 18.61% 10.29% 8.94% 16.31%
Table 8.1: Image level F
1
scores for external duplication detection (EDD) task.
Method Microscopy Blot/Gel Macroscopy FACS Combined
SIFT Lowe 2004 5.82% 11.74% 10.52% 2.32% 5.83%
ORB Rublee et al. 2011 28.56% 12.45% 19.34% 8.86% 21.47%
BRIEF Calonder et al. 2010 25.59% 9.74% 16.22% 9.33% 18.58%
DF - ZM Cozzolino et al. 2015a 42.08% 19.17% 25.82% 50.24% 32.46%
DMVN Y . Wu et al. 2017 31.54% 42.07% 21.78% 18.85% 27.55%
Table 8.2: Pixel level F
1
scores for external duplication detection (EDD) task.
Method
Microscopy Blot/Gel Macroscopy Combined
Image Pixel Image Pixel Image Pixel Image Pixel
DF - ZM Cozzolino et al. 2015a 74.1% 12.0% 52.5% 44.0% 53.3% 38.9% 56.1% 30.5%
DF - PCT Cozzolino et al. 2015a 74.1% 12.4% 52.0% 46.0% 70.6% 40.0% 57.3% 31.9%
DF - FMT Cozzolino et al. 2015a 58.3% 9.9% 50.0% 38.5% 49.1% 39.8% 50.6% 27.7%
DCT Fridrich et al. 2003 16.3% 3.5% 28.6% 17.1% 23.5% 15.5% 23.4% 10.3%
DWT Bashar et al. 2010 26.0% 7.5% 40.0% 25.7% 63.2% 22.2% 37.0% 16.2%
Zernike Ryu et al. 2010 18.0% 3.9% 34.8% 18.4% 42.9% 13.5% 29.0% 11.1%
BusterNet Y . Wu et al. 2018 14.0% 13.4% 27.3% 4.6% 24.1% 14.9% 30.2% 6.0%
Table 8.3: F
1
scores for external duplication detection (IDD) task.
112
Abstract (if available)
Abstract
Malicious and falsified digital content has become a powerful conveyor of false information that is not just a nuisance but a threat to open societies worldwide. Often disinformation articles rely on manipulated images as “evidence”, making it important to develop methods for the detection of misuse of images. Such inappropriate image use can be broadly classified into two categories: (1) semantic forgery i.e. reusing or repurposing an image by falsifying its context, and (2) digital forgery i.e. modifying the image itself to achieve an end purpose. Compared to digital image forensics, research in semantic forgery detection is relatively new and sparse. For semantic forgery detection of natural images in a social media context, we introduced a dataset that simulates person, location and organization repurposing of images. We also developed deep-learning methods to detect image repurposing by leveraging a trusted knowledge base. Additionally, research into image forensics has been limited to natural images. There are instances of digital and semantic forgery beyond natural images such as in the biomedical domain, where images are manipulated to misrepresent experimental results. In order to promote research beyond natural image forensics, we introduce a dataset comprising biomedical image manipulations along with a taxonomy of semantic and digital manipulation detection tasks. Through our extensive evaluation of state-of-the-art digital image forensics models, we found that existing algorithms developed on common computer vision datasets are not robust when applied to biomedical images. To address semantic forgeries in the biomedical domain, we developed a multi-scale overlap detection model that achieves state-of-the-art performance across multiple categories of biomedical images.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Discovering and querying implicit relationships in semantic data
PDF
Understanding semantic relationships between data objects
PDF
Exploiting web tables and knowledge graphs for creating semantic descriptions of data sources
PDF
Dynamic graph analytics for cyber systems security applications
PDF
Multimodal image retrieval and object classification using deep learning features
PDF
Object detection and recognition from 3D point clouds
PDF
Behavior-based approaches for detecting cheating in online games
PDF
Applying semantic web technologies for information management in domains with semi-structured data
PDF
Provenance management for dynamic, distributed and dataflow environments
PDF
Semantic structure in understanding and generation of the 3D world
PDF
Vision-based and data-driven analytical and experimental studies into condition assessment and change detection of evolving civil, mechanical and aerospace infrastructures
PDF
Neural representation learning for robust and fair speaker recognition
PDF
Responsible artificial intelligence for a complex world
PDF
Empirical study of informational regularizations in learning useful and interpretable representations
PDF
Accelerating robot manipulation using demonstrations
PDF
Deep learning techniques for supervised pedestrian detection and critically-supervised object detection
PDF
Scalable exact inference in probabilistic graphical models on multi-core platforms
PDF
Invariant representation learning for robust and fair predictions
PDF
Annotating FrameNet via structure-conditioned language generation
PDF
Toward better understanding and improving user-developer communications on mobile app stores
Asset Metadata
Creator
Sabir, Ekraam
(author)
Core Title
Detecting semantic manipulations in natural and biomedical images
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2022-05
Publication Date
04/14/2022
Defense Date
03/08/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
biomedical forensics,biomedical image manipulation,biomedical image matching,deep learning,forgery detection,image forensics,image manipulation,image repurposing,manipulation detection,multimedia forensics,multimodal forensics,natural image manipulation,OAI-PMH Harvest,scientific integrity,semantic integrity assessment,semantic manipulation
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Natarajan, Premkumar (
committee chair
), AbdAlmageed, Wael (
committee member
), Nakano, Aiichiro (
committee member
), Raghavendra, Cauligi S. (
committee member
)
Creator Email
ekraam.sabir@gmail.com,esabir@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC110963591
Unique identifier
UC110963591
Legacy Identifier
etd-SabirEkraa-10519
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Sabir, Ekraam
Type
texts
Source
20220415-usctheses-batch-924
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
biomedical forensics
biomedical image manipulation
biomedical image matching
deep learning
forgery detection
image forensics
image manipulation
image repurposing
manipulation detection
multimedia forensics
multimodal forensics
natural image manipulation
scientific integrity
semantic integrity assessment
semantic manipulation