Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Invariant representation learning for robust and fair predictions
(USC Thesis Other)
Invariant representation learning for robust and fair predictions
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Invariant Representation Learning for Robust and Fair Predictions
by
Ayush Jaiswal
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulllment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(Computer Science)
May 2020
Copyright 2020 Ayush Jaiswal
\We can only see a short distance ahead, but we can see plenty there that needs to be done."
Alan Turing
ii
Dedicated to my parents, Soni and Binod, and my brother, Piyush.
iii
Acknowledgements
I truly believe that the completion of this degree and this thesis is owed in part to the people
involved in my life over the course of and beyond my time in the PhD program. Hence, I would
like to sincerely thank individuals who have made this journey memorable, joyful, and successful.
I express my deepest gratitude to my advisor, Prof. Premkumar Natarajan, whose mentorship,
support, and feedback kept me on the right track with my research direction. I truly appreciate
his ability of quickly identifying
aws and gaps in ideas, which often opened my eyes to things
that I had previously overlooked. This helped improve my own scientic thinking and saved me
from a lot of wasted eort. Prof. Natarajan has always inspired me to think about larger research
problems and to carefully set long-term goals. He has also shared my happiness whenever my
experiments delivered successful results, which boosted my motivation during stressful times. I
thoroughly cherish our conversations beyond research as well, spanning a wide diversity of topics
including politics, psychology, and life experiences. I look up to him for his amazing leadership
skills, clear thought process, realistic outlook, and an empathetic and kind-hearted spirit.
I would like to thank Prof. Wael AbdAlmageed for being a supportive research leader under
whose supervision I worked on almost all my projects over the course of my program. Our weekly
meetings helped develop and rene ideas while also keeping me in check so that I did not fall back
on progress. I appreciate his never-ending hardwork in day-to-day management of projects and
research infrastructure that made the research in this thesis possible. I am also grateful for his
feedback on my papers, which signicantly improved my technical writing and presentation skills.
iv
I have been fortunate to work with him on grant proposals, an arena that was completely new to
me. I thank him for oering me endless opportunities and expanding my horizons.
I am grateful to Dr. Yue (Rex) Wu, who truly helped develop my technical skills and research
ideology. His constant guidance was key in developing some of the work presented in this thesis.
His cheerful spirit and kind friendship pushed me through my early failures. I thank him for his
generosity and mentorship.
I would also like to thank Prof. Greg Ver Steeg for taking interest in my work and providing
guidance on some of the work presented in this thesis. His analytical thinking and theoretical
foundation are truly inspiring. He also graciously served on my thesis proposal committee.
Before Prof. Natarajan, I had the fortune of working under Prof. Cauligi S. Raghavendra. I
am grateful to him for believing in my potential to be a successful PhD candidate and providing
constant support during my time under him. I appreciate his kind, friendly, and caring nature. I
also thank him for serving on my thesis proposal and defense committees.
I am thankful to Prof. Ram Nevatia for serving on my thesis proposal and defense committees
and to Prof. Aram Galstyan for serving on my proposal committee. Their valuable feedback is
deeply appreciated.
Over the course of my PhD, I have also had the fortune to work with and alongside some
amazing people: Ekraam Sabir (who I know from high school!), Dr. Daniel (Dan) Moyer, Rob
Brekelmans, Joe Mathai, Dr. Iacopo Masi, I-Hung Hsu, Jiaxin Cheng, Stephen Rawls, Sachin
Malhotra, Dong Guo, Umang Gupta, G ozde S ahin, Emily Sheng, Mona Shari Sarabi, Dr. Kuan
Liu, Soumyaroop Nandi, and Zekun Li. I thank them for their invaluable collaboration and
fascinating conversations.
I would like to thank Karen Rawlins for providing exceptional administrative assistance during
my years at ISI. I have also thoroughly enjoyed our conversations on a wide array of topics. I
am grateful to Lida Dimitropoulou and Lizsl De Leon for their constant support during my PhD
journey as well.
v
I am grateful to my mentors during my undergraduate years: Prof. Vineeth Paleri, Ayush
Sengupta, and Arnab Bhattacharjee, who inspired me to think big and work towards academic
excellence. Their invaluable guidance and encouragement pushed me to apply for the PhD program.
I also thank Profs. Craig A. Knoblock, Yao-Yi Chiang, and Pedro Szekely for mentoring me during
my summer internship at ISI, which was also my rst research experience.
A key factor that helped me survive the PhD program was the emotional support that I received
from the friends I made in Los Angeles. I am fortunate to have them as my new family away from
my family in India. I am especially grateful for these amazing people: Iris Sylph, G ozde S ahin,
Ekraam Sabir, Daniel Moyer, Palash Goyal, Umang Gupta, Ganesh Sreeram, Nitin Kamra, and
Cheung Chung Ming.
Finally, I owe everything, not just this thesis, to my parents, Soni and Binod, and my brother,
Piyush. Their neverending love, sacrices, encouragement, and support have made it possible for
me to fulll my dreams.
vi
Table of Contents
Epigraph ii
Dedication iii
Acknowledgements iv
List Of Tables xi
List Of Figures xiii
Abstract xv
Chapter 1: Introduction 1
Chapter 2: Background 7
2.1 Unsupervised Methods of Invariance . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.1 The Information Bottleneck Method . . . . . . . . . . . . . . . . . . . . . . 10
2.1.2 Information Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.3 Deep Variational Information Bottleneck . . . . . . . . . . . . . . . . . . . . 12
2.2 Supervised Methods of Invariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.1 Training for Statistical Parity . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.2 Regularization with Maximum Mean Discrepancy . . . . . . . . . . . . . . . 16
2.2.3 Variational Fair Autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.4 HSIC-constrained Variational Autoencoder . . . . . . . . . . . . . . . . . . 18
2.2.5 Controllable Adversarial Invariance . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.6 Conditional Variational Information Bottleneck . . . . . . . . . . . . . . . . 19
I METHODS 21
Chapter 3: Unied Adversarial Invariance 22
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 Unied Adversarial Invariance (UnifAI) . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2.1 Unied Invariance Induction Framework . . . . . . . . . . . . . . . . . . . . 24
3.2.2 Adversarial Model Design and Optimization . . . . . . . . . . . . . . . . . . 28
3.2.3 Invariant Predictions with the Trained Model . . . . . . . . . . . . . . . . . 31
3.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4 Empirical Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4.1 Invariance to inherent nuisance factors . . . . . . . . . . . . . . . . . . . . . 34
3.4.2 Eective use of synthetic data augmentation . . . . . . . . . . . . . . . . . 39
vii
3.4.3 Invariance to arbitrary nuisances by leveraging GANs . . . . . . . . . . . . 42
3.4.4 Domain Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.4.5 Fair Representation learning . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.4.6 Competition between prediction and reconstruction . . . . . . . . . . . . . 50
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Chapter 4: Discovery and Separation of Features for Invariant Representations 53
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.2 Separation of Predictive and Nuisance Factors of Data . . . . . . . . . . . . . . . . 54
4.2.1 Invariance through Information Discovery and Separation . . . . . . . . . . 55
4.2.2 Embedding Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.2.3 Independence between the z
p
and z
n
Embeddings . . . . . . . . . . . . . . . 57
4.2.3.1 Independence through Compression . . . . . . . . . . . . . . . . . 58
4.2.3.2 Hilbert-Schmidt Independence Criterion . . . . . . . . . . . . . . . 59
4.2.3.3 Adversarial Independence . . . . . . . . . . . . . . . . . . . . . . . 60
4.2.4 Model Implementation and Training . . . . . . . . . . . . . . . . . . . . . . 61
4.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Chapter 5: Invariant Representations through Adversarial Forgetting 71
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.2 Invariance through Adversarial Forgetting . . . . . . . . . . . . . . . . . . . . . . . 72
5.3 Characterizing Forgetting with Forget-gate . . . . . . . . . . . . . . . . . . . . . . 75
5.3.1 Forget-gate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.3.2 Practicalities of "-noise and bottleneck-mask intuition . . . . . . . . . . . . 78
5.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.4.1 Robustness through invariance to nuisance factors . . . . . . . . . . . . . . 80
5.4.2 Fairness through invariance to biasing factors . . . . . . . . . . . . . . . . . 84
5.4.3 Invariance in multi-task learning . . . . . . . . . . . . . . . . . . . . . . . . 86
5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
II APPLICATIONS 89
Chapter 6: Robust Presentation Attack Detection 90
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.3 Robust Presentation Attack Detection . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.3.1 A Data Factorization View of PAD . . . . . . . . . . . . . . . . . . . . . . . 94
6.3.2 Base CNN Model of RoPAD . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.3.3 Unsupervised Adversarial Invariance . . . . . . . . . . . . . . . . . . . . . . 96
6.3.4 RoPAD using UAI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.4.1 Datasets And Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.4.2 Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
viii
Chapter 7: Nuisance Invariant End-to-end Speech Recognition 105
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
7.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
7.2.1 Base Sequence-to-sequence Model . . . . . . . . . . . . . . . . . . . . . . . . 107
7.2.2 Unsupervised Adversarial Invariance . . . . . . . . . . . . . . . . . . . . . . 108
7.2.3 NIESR Model Design and Optimization . . . . . . . . . . . . . . . . . . . . 109
7.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
7.3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
7.3.2 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
7.3.3 ASR Performance on Benchmark Datasets . . . . . . . . . . . . . . . . . . . 116
7.3.4 Invariance to Nuisance Factors . . . . . . . . . . . . . . . . . . . . . . . . . 116
7.3.5 Additional Robustness through Data Augmentation . . . . . . . . . . . . . 117
7.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
Chapter 8: Nuisance Invariance for Learning with Less Labels 119
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
8.2 Robust Low-data and Semi-supervised Learning . . . . . . . . . . . . . . . . . . . . 121
8.2.1 Nuisance Invariance via Discovery and Separation of Features . . . . . . . . 121
8.2.2 Nuisance Invariance in Low-data Settings . . . . . . . . . . . . . . . . . . . 122
8.2.3 Nuisance Invariance in Semi-supervised Settings . . . . . . . . . . . . . . . 122
8.3 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
8.3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
8.3.2 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
8.3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
8.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
Chapter 9: Fair Face Analytics 133
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
9.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
9.3 Debiasing Face Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
9.3.1 Base Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
9.3.2 Adversarial Forgetting for Invariant Predictions . . . . . . . . . . . . . . . . 136
9.3.3 Unbiased Attribute Prediction with Adversarial Forgetting . . . . . . . . . 137
9.3.4 Training Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
9.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
9.4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
9.4.2 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
9.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
9.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
Chapter 10: Conclusion 142
10.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
10.2 Supporting Papers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
10.3 Other Papers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
Bibliography 147
Appendix A
Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
A.1 Variance Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
ix
Appendix B
Samples from Benchmarking Datasets Created in this Work . . . . . . . . . . . . . . . . 164
B.1 MNIST-ROT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
B.2 MNIST-DIL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
B.3 Chairs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
B.4 Multi-PIE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
Appendix C
Other Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
C.1 Adversarial Learning Framework for Image Repurposing Detection . . . . . . . . . 168
C.2 Bidirectional Conditional Generative Adversarial Networks . . . . . . . . . . . . . 170
C.3 CapsuleGAN: Generative Adversarial Capsule Network . . . . . . . . . . . . . . . . 172
x
List Of Tables
3.1 UnifAI { Key Concepts and Framework Components . . . . . . . . . . . . . . . . . 24
3.2 UnifAI { Results on Extended Yale-B dataset . . . . . . . . . . . . . . . . . . . . . 33
3.3 UnifAI { Results on Chairs dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4 UnifAI { Results on MNIST-ROT dataset . . . . . . . . . . . . . . . . . . . . . . . 38
3.5 UnifAI { Results on MNIST-DIL dataset . . . . . . . . . . . . . . . . . . . . . . . . 41
3.6 UnifAI { Results on Fashion-MNIST dataset . . . . . . . . . . . . . . . . . . . . . 43
3.7 UnifAI { Results on Omniglot dataset . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.8 UnifAI { Results on Amazon Reviews dataset . . . . . . . . . . . . . . . . . . . . . 47
3.9 UnifAI { Results on Adult dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.10 UnifAI { Results on German dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.1 DSF { Results on MNIST-ROT dataset . . . . . . . . . . . . . . . . . . . . . . . . 63
4.2 DSF { Results on MNIST-DIL dataset . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.3 DSF { Results on Extended Yale-B dataset . . . . . . . . . . . . . . . . . . . . . . 67
4.4 DSF { Results on MultiPIE dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.5 DSF { Results on Chairs dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.1 Adversarial Forgetting { Results on Chairs dataset . . . . . . . . . . . . . . . . . . 79
5.2 Adversarial Forgetting { Results on Extended Yale-B dataset . . . . . . . . . . . . 81
5.3 Adversarial Forgetting { Results on MNIST-ROT dataset . . . . . . . . . . . . . . 84
5.4 Adversarial Forgetting { Results on Adult dataset . . . . . . . . . . . . . . . . . . 86
xi
5.5 Adversarial Forgetting { Results on German dataset . . . . . . . . . . . . . . . . . 87
5.6 Adversarial Forgetting { Results on dSprites dataset . . . . . . . . . . . . . . . . . 87
6.1 Summary of benchmark datasets for PAD . . . . . . . . . . . . . . . . . . . . . . . 99
6.2 GCT1 { summary of real images and attacks . . . . . . . . . . . . . . . . . . . . . 99
6.3 RoPAD { Results on Idiap Replay-Attack . . . . . . . . . . . . . . . . . . . . . . . 101
6.4 RoPAD { Results on Idiap Replay-Mobile . . . . . . . . . . . . . . . . . . . . . . . 101
6.5 RoPAD { Results on MSU MSFD . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.6 RoPAD { Results on 3DMAD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.7 RoPAD { Results on GCT1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
7.1 NIESR { Hyperparameters for the base model . . . . . . . . . . . . . . . . . . . . . 114
7.2 NIESR { Hyperparameters for the NIESR model . . . . . . . . . . . . . . . . . . . 115
7.3 NIESR { Results on speech recognition . . . . . . . . . . . . . . . . . . . . . . . . . 115
7.4 NIESR { Results on nuisance invariance . . . . . . . . . . . . . . . . . . . . . . . . 116
7.5 NIESR { Results on augmented dataset . . . . . . . . . . . . . . . . . . . . . . . . 117
8.1 NILLL { Results on CIFAR10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
8.2 NILLL { Results on CIFAR100 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
8.3 NILLL { Results on Mini-ImageNet . . . . . . . . . . . . . . . . . . . . . . . . . . 131
9.1 FFA { Results on IMDB Faces for predicting age invariant to gender . . . . . . . . 140
9.2 FFA { Results on IMDB Faces for predicting gender invariant to age . . . . . . . . 140
xii
List Of Figures
3.1 Unsupervised Invariance Induction Framework . . . . . . . . . . . . . . . . . . . . 27
3.2 The Unied Adversarial Invariance (UnifAI) model . . . . . . . . . . . . . . . . . . 27
3.3 UnifAI { t-SNE visualization of Extended Yale-B embeddings . . . . . . . . . . . . 35
3.4 UnifAI { Extended Yale-B reconstructions . . . . . . . . . . . . . . . . . . . . . . . 36
3.5 UnifAI { t-SNE visualization of Chairs embeddings . . . . . . . . . . . . . . . . . . 37
3.6 UnifAI { Chairs reconstructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.7 UnifAI { MNIST-ROT reconstructions . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.8 UnifAI { t-SNE visualization of MNIST-ROT embeddings - I . . . . . . . . . . . . 39
3.9 UnifAI { t-SNE visualization of MNIST-ROT embeddings - II . . . . . . . . . . . . 40
3.10 Fashion-MNIST samples generated using BiCoGAN . . . . . . . . . . . . . . . . . 43
3.11 UnifAI { t-SNE visualization of Fashion-MNIST embeddings . . . . . . . . . . . . 44
3.12 Omniglot samples generated using DAGAN . . . . . . . . . . . . . . . . . . . . . . 46
3.13 UnifAI { t-SNE visualization of Adult embeddings . . . . . . . . . . . . . . . . . . 49
3.14 UnifAI { t-SNE visualization of German embeddings . . . . . . . . . . . . . . . . . 50
3.15 UnifAI { Competition between prediction and reconstruction . . . . . . . . . . . . 51
4.1 DSF { t-SNE visualization of MNIST-ROT embeddings . . . . . . . . . . . . . . . 64
4.2 DSF { t-SNE visualization of Extended Yale-B embeddings . . . . . . . . . . . . . 66
4.3 DSF { t-SNE visualization of MultiPIE embeddings . . . . . . . . . . . . . . . . . 68
4.4 DSF { t-SNE visualization of Chairs embeddings . . . . . . . . . . . . . . . . . . . 69
xiii
5.1 Adversarial forgetting framework for invariant representation learning . . . . . . . 73
5.2 Adversarial Forgetting { t-SNE visualization of Chairs . . . . . . . . . . . . . . . . 80
5.3 Adversarial Forgetting { Chairs reconstructions . . . . . . . . . . . . . . . . . . . . 81
5.4 Adversarial Forgetting { t-SNE visualization of Extended Yale-B . . . . . . . . . . 82
5.5 Adversarial Forgetting { Extended Yale-B reconstructions . . . . . . . . . . . . . . 83
5.6 Adversarial Forgetting { t-SNE visualization of MNIST-ROT . . . . . . . . . . . . 85
5.7 Adversarial Forgetting { MNIST-ROT reconstructions . . . . . . . . . . . . . . . . 85
6.1 Presentation Attack Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.2 The RoPAD Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.3 GCT1 Presentation Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.4 RoPAD { Receiver Operating Curves for MSU MSFD . . . . . . . . . . . . . . . . 102
6.5 RoPAD { Receiver Operating Curves for GCT1 . . . . . . . . . . . . . . . . . . . . 103
7.1 The NIESR Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
8.1 NILLL { t-SNE visualization of CIFAR-10 (1%) . . . . . . . . . . . . . . . . . . . . 124
8.2 NILLL { t-SNE visualization of CIFAR-10 (2%) . . . . . . . . . . . . . . . . . . . . 125
8.3 NILLL { t-SNE visualization of CIFAR-10 (4%) . . . . . . . . . . . . . . . . . . . . 126
8.4 NILLL { t-SNE visualization of CIFAR-10 (8%) . . . . . . . . . . . . . . . . . . . . 129
B.1 Samples from MNIST-ROT dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
B.2 Samples from MNIST-DIL dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
B.3 Samples from Chairs dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
B.4 Samples from Multi-PIE dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
xiv
Abstract
Learning representations that are invariant to nuisance factors of data improves robustness of
machine learning models, and promotes fairness for factors that represent biasing information.
This view of invariance has been adopted for deep neural networks (DNNs) recently as they learn
latent representations of data by design. Numerous methods for invariant representation learning
for DNNs have emerged in recent literature, but the research problem remains challenging to solve:
existing methods achieve partial invariance or fall short of optimal performance on the prediction
tasks that the DNNs need to be trained for.
This thesis presents novel approaches for inducing invariant representations in DNNs by
eectively separating predictive factors of data from undesired nuisances and biases. The presented
methods improve the predictive performance and the fairness of DNNs through increased invariance
to undesired factors. Empirical evaluation on a diverse collection of benchmark datasets shows
that the presented methods achieve state-of-the-art performance.
Application of the invariance methods to real-world problems is also presented, demonstrating
their practical utility. Specically, the presented methods improve nuisance-robustness in presen-
tation attack detection and automated speech recognition, fairness in face-based analytics, and
generalization in low-data and semi-supervised learning settings.
xv
Chapter 1
Introduction
A common formulation of supervised machine learning is the estimation of the conditional proba-
bility p(yjx) from data where x and y denote data samples and target variables, respectively. This
involves the decomposition of x into its underlying factors of variation, such that associations
can be learned between y and the said factors to approximate a mapping from x to y. However,
trained models often learn to incorrectly associate y with nuisance factors of data, which are truly
irrelevant to the prediction of y, leading to overtting and poor generalization on test cases that
contain unseen variations of such factors (Domingos, 2012). For example, a nuisance variable in
the case of face recognition is the lighting condition in which the photograph was captured. A
face recognition model that associates lighting with subject identity is expected to perform poorly,
especially in previously unseen lighting.
Developing machine learning methods that are invariant to nuisance factors has been a long-
standing problem; studied under various names such as feature selection (Miao and Niu, 2016a),
robustness through data augmentation (Ko et al., 2015; Krizhevsky, Sutskever, and Hinton, 2012;
Saito et al., 2017) and invariance induction (Jaiswal et al., 2018d; Achille and Soatto, 2018a;
Alemi et al., 2016). An architectural solution to this problem for deep neural networks (DNN)
is the creation of neural network units that capture specic forms of information, and thus are
inherently invariant to certain nuisance factors (Bengio, Courville, and Vincent, 2013). For example,
1
convolutional operations coupled with pooling strategies capture shift-invariant spatial information
while recurrent operations robustly capture high-level trends in sequential data. However, this
approach requires signicant eort for engineering custom modules and layers to achieve invariance
to specic nuisance factors, making it in
exible. A dierent but popularly adopted solution to the
problem of nuisance factors is the use of data augmentation where synthetic versions of real data
samples are generated, during training, with specic forms of variation (Bengio, Courville, and
Vincent, 2013; Masi et al., 2019). For example, rotation and translation are typical methods of
augmentation used in computer vision, especially for classication and detection tasks. However,
models trained na vely on the augmented dataset become robust to limited forms of nuisance by
learning to associate every seen variation of such factors to the target. Consequently, such models
perform poorly when applied to data exhibiting unseen variations of those nuisance variables,
e.g., images of objects at previously unseen orientations or colors in the case of object detection.
Thus, na vely training with data augmentation makes models partially invariant to the variables
accounted for in the augmentation process.
Furthermore, training datasets often contain factors of variation that are correlated with
the prediction target but should not be incorporated in the prediction process to avoid skewed
decisions that are unfair to under-represented categories of these biasing factors. This can also
be viewed as a \class-imbalance" problem with respect to the biasing factor instead of the target
variable. For example, gender and race are biasing factors in many human-centric tasks like
face recognition (Merler et al., 2019), sentiment analysis (Kiritchenko and Mohammad, 2018),
socio-economic assessments (Courtland, 2018), etc. Models that do not account for such bias make
incorrect predictions and can sometimes be unethical to use. It is, therefore, necessary to develop
mechanisms that train models to be invariant to not only nuisance but also biasing factors of data.
Within the framework of DNNs, predictions can be made invariant to undesired (nuisance or
biasing) factors s if the latent representation of data learned by a DNN at any given layer does
not contain any information about those factors. This view has been adopted by recent works as
2
the task of invariant representation learning (Achille and Soatto, 2018a; Alemi et al., 2016; Zemel
et al., 2013; Li, Swersky, and Zemel, 2014; Louizos et al., 2016; Xie et al., 2017; Moyer et al.,
2018; Lopez et al., 2018) through specialized training mechanisms that encourage the exclusion of
undesired variables from the latent embedding. Models trained in this fashion to be invariant to
nuisance variables s, as opposed to training simply with data augmentation, become robust by
exclusion rather than inclusion. Therefore, such models are expected to perform well even on data
containing variations of specic nuisance factors that were not seen during training. For example,
a face recognition model that learns to not associate lighting conditions with the identity of a
person is expected to be more robust to lighting conditions than a similar model trained na vely on
images of subjects under certain dierent lighting conditions (Xie et al., 2017). Similarly, the use
of such mechanisms to train models to be invariant to biasing s provides better guarantees that
the sensitive information is not incorporated in the prediction process (Li, Swersky, and Zemel,
2014; Louizos et al., 2016).
Several approaches for invariant representation learning within neural networks have been
proposed in recent works. These can be generally divided into two groups: those that do not
require annotations (Achille and Soatto, 2018a; Alemi et al., 2016; Jaiswal et al., 2018d) for the
undesired data factors, and those that do (Li, Swersky, and Zemel, 2014; Lopez et al., 2018;
Louizos et al., 2016; Moyer et al., 2018; Xie et al., 2017; Zemel et al., 2013). Annotation-free
invariance methods cannot be used to remove biased information from the latent embedding
because there is no way to tell whether a predictive factor is biased or not. However, this class
of methods is well suited for learning representations invariant to nuisances, largely due to two
reasons | (1) these approaches require no additional annotation besides the prediction target,
making them more widely applicable in practice, and (2) it is known (Achille and Soatto, 2018b)
that nuisance annotations are not necessary for learning minimal yet sucient representations
of data for a prediction task. Training a supervised model with the Information Bottleneck (IB)
objective (Tishby, Pereira, and Bialek, 1999) can compress out all nuisance factors from the latent
3
embedding under optimality (Achille and Soatto, 2018b). However, in practice, IB is very dicult
to optimize (Achille and Soatto, 2018a; Alemi et al., 2016) and recent works have approximated
IB variationally (Alemi et al., 2016) or in the form of information dropout (Achille and Soatto,
2018a). Achieving such theoretically perfect invariance to nuisances in an annotation-free manner
has, hence, remained a challenging task.
Methods that require annotations of undesired factors are suitable for targeted removal of
specic kinds of information from the latent space. These methods could be used to explicitly
induce invariance to nuisances when nuisance annotations are readily available or easy to collect.
Furthermore, these methods are directly applicable for removing factors of data that are correlated
with the prediction target but are undesired due to external reasons, e.g., biased variables
corresponding to race and gender. In contrast, annotation-free invariance methods cannot be
used to remove biased information from latent representations, as discussed above. Hence, it is
necessary to employ this class of methods in fairness settings. Numerous invariance frameworks
have been introduced recently that use annotations of undesired factors during training. Prominent
approaches use statistical tools like Maximum Mean Discrepancy (MMD) (Gretton et al., 2007) and
Hilbert-Schmidt Independence Criterion (HSIC) (Gretton et al., 2005) as regularizers (Li, Swersky,
and Zemel, 2014; Louizos et al., 2016; Lopez et al., 2018), the gradient-reversal trick (Ganin et al.,
2016) as an auxiliary penalty (Xie et al., 2017), or formulate invariance with annotations as a
conditional form (Moyer et al., 2018) of the IB objective. Despite the signicant amount of research
interest and eort towards the development of such invariance frameworks, previous methods do
not function perfectly. Models trained in these manners often retain some information about the
undesired factors or inadvertently lose information about desired predictive factors of data.
The purpose of this thesis is to provide novel approaches for eectively achieving invariance
to nuisances and biases in data in order to make robust and fair predictions with deep neural
networks. The presented works include both aforementioned classes of invariance methods and
advance the state-of-the-art in this eld as validated on a diverse collection of datasets and tasks.
4
The usability and eectiveness of these works is further presented for various application domains
that can benet from the adoption of invariance frameworks for developing DNN-based solutions.
We begin by describing prior work on invariance for DNNs in Chapter 2. This sets the stage
for the works presented in this thesis and positions works presented here with respect to previous
art. The rest of the thesis is divided into two parts. Part I presents novel methods for invariance
to nuisances and biases through learning to discover and separate underlying factors of data. The
applications of these frameworks in various domains is then presented in Part II.
In Part I, Chapter 3 describes a new unied framework for invariance (Jaiswal et al., 2018d; 2019d)
to both nuisance and biasing factors of data. It achieves invariance to nuisances in an annotation-free
fashion through competitive optimization of target-prediction and data-reconstruction objectives in
the presence of adversarially enforced information constraints. It additionally achieves invariance to
biasing factors by using their annotations to train the main predictive DNN against an adversarial
discriminator that aims to predict the biasing factors from the DNN's latent embedding. Chapter 4
expands on the intuition in Chapter 3 and presents a novel information theoretic framework (Jaiswal
et al., 2019b) for nuisance invariance through learning to simultaneously discover and separate
predictive and nuisance factors of data. It also provides an information theoretic analysis of
the method in Chapter 3 showing that the newer framework is strictly superior. In Chapter 5
we present yet another novel approach (Jaiswal et al., 2020) to invariance through learning to
\forget" known undesired factors through elementwise multiplication of the latent embeddings with
adversarially learned forget-masks.
Part II begins with Chapter 6 where we present a robust model (Jaiswal et al., 2019c) for
combatting the problem of presentation attacks in face-based biometrics. Chapter 7 then describes
a nuisance-invariant end-to-end automated speech recognition (ASR) model (Hsu, Jaiswal, and
Natarajan, 2019). The presentation attack detection and ASR models use the method discussed in
Chapter 3 for invariance to nuisance factors without the incorporation of domain knowledge and
nuisance annotations. The benets of adopting nuisance invariance in low-data and semi-supervised
5
learning settings are presented in Chapter 8 with the incorporation of the method described in
Chapter 4. In Chapter 9 we show that the forgetting method for invariance (Chapter 5) reduces
biases in facial attribute prediction.
Chapter 10 provides concluding remarks and summarizes the contributions made by this
thesis. Proofs are presented in Appendix A. Samples from datasets created to further research in
invariance are shown in Appendix B. Finally, a few related works nished during the course of
this thesis are presented in Appendix C.
6
Chapter 2
Background
Methods for preventing supervised models from learning false associations between target variables
and nuisance factors have been studied from various perspectives including feature selection (Miao
and Niu, 2016b), robustness through data augmentation (Ko et al., 2015; Krizhevsky, Sutskever,
and Hinton, 2012; Saito et al., 2017) and invariance induction (Achille and Soatto, 2018a; Alemi
et al., 2016; Zemel et al., 2013; Li, Swersky, and Zemel, 2014; Louizos et al., 2016; Xie et al.,
2017; Moyer et al., 2018; Lopez et al., 2018). Feature selection has typically been employed when
data is available as a set of conceptual features, some of which are irrelevant to the prediction
tasks. Popular feature selection methods (Miao and Niu, 2016b) incorporate information-theoretic
measures or use supervised methods to score features with their importance for the prediction
task and prune the low-scoring features. These approaches are not directly applicable to neural
networks (NNs) that use complex raw data as inputs, e.g., images, speech signals, text, etc.
Deep neural networks (DNNs) have outperformed traditional methods at several supervised
learning tasks. However, they have a large number of parameters that need to be estimated from
data, which makes them especially vulnerable to learning relationships between target variables and
nuisance factors and, thus, overtting. Motivated by the fact that NNs learn latent representations
of data in the form of activations of their hidden layers, recent works (Achille and Soatto, 2018a;
Achille and Soatto, 2018b; Alemi et al., 2016; Jaiswal et al., 2018d) have framed robustness to
7
nuisance for NNs as the task of nuisance-invariant representation learning. In this formulation,
latent representations of NNs are made invariant through (1) na vely training models with large
variations of nuisance factors (e.g., through data augmentation) (Ko et al., 2015; Krizhevsky,
Sutskever, and Hinton, 2012), or (2) training mechanisms (Achille and Soatto, 2018a; Alemi et al.,
2016; Louizos et al., 2016; Moyer et al., 2018; Xie et al., 2017) that lead to the exclusion of nuisance
factors from the latent representation.
Data augmentation is a popular approach to achieve invariance to nuisance in DNNs. The
functional idea in this class of methods is to expand the data size, where multiple copies of
data samples are created by altering variations of certain known nuisance factors. DNNs trained
with data augmentation have been shown to generalize better and be more robust compared to
those trained without augmentation in many domains including vision (Masi et al., 2019; Jaiswal
et al., 2018c; Krizhevsky, Sutskever, and Hinton, 2012), speech (Ko et al., 2015) and natural
language (Saito et al., 2017). This approach works on the principle of inclusion, in which the
model learns to associate multiple seen variations of those nuisance factors to each target value.
In contrast, a method that encourages exclusion of information about nuisance factors from
latent features used for predicting the target is expected to learn more robust representations.
Furthermore, combining such a method with data augmentation additionally helps models remove
information about nuisance factors used to synthesize data, without the need to explicitly quantify
or annotate the generated variations. This is especially helpful in cases where augmentation is
performed using sophisticated analytical or composite techniques (Masi et al., 2019).
Information bottleneck (Tishby, Pereira, and Bialek, 1999) has been used to model unsupervised
methods of invariance to nuisance variables within supervised DNNs in recent works (Alemi et al.,
2016; Moyer et al., 2018; Achille and Soatto, 2018b). The working mechanism of these methods is
to minimize the mutual information of the latent embedding z and the data x, i.e., I(x :z), while
maximizingI(z :y) to ensure thatz is maximally predictive ofy but a minimal representation ofx
in that regard. Hence, these methods compress data into a compact representation and indirectly
8
minimize I(z :s) for nuisance s?y. An optimal compression of this form would get rid of all
such nuisance factors with respect to the prediction target (Achille and Soatto, 2018b). However,
the bottleneck objective is dicult to optimize (Achille and Soatto, 2018a; Alemi et al., 2016) and
has consequently been approximated using variational inference in prior work (Alemi et al., 2016).
Information Dropout (Achille and Soatto, 2018a), which is a data-dependent generalization of
dropout (Srivastava et al., 2014), also optimizes the bottleneck objective indirectly.
Several supervised methods for invariance induction have also been developed recently (Xie et
al., 2017; Zemel et al., 2013; Louizos et al., 2016; Li, Swersky, and Zemel, 2014; Moyer et al., 2018).
These methods use annotations of unwanted factors of data within specialized training mechanisms
that force the removal of these variables from the latent representation. Zemel et al. (2013 learn fair
representations by optimizing an objective that maximizes the performance of y-prediction while
enforcing group fairness through statistical parity. Maximum Mean Discrepancy (MMD) (Gretton
et al., 2007) has been used directly as a regularizer for neural networks in the NN+MMD model
of (Li, Swersky, and Zemel, 2014). The Variational Fair Autoencoder (VFAE) (Louizos et al., 2016)
optimizes the information bottleneck objective indirectly in the form of a Variational Autoencoder
(VAE) (Kingma and Welling, 2014) and uses MMD to boost the removal of unwanted factors
from the latent representation. The Hilbert-Schmidt Information Criterion (HSIC) (Gretton et al.,
2005) has been used similarly in the HSIC-constrained VAE (HCV) (Lopez et al., 2018) to enforce
independence between the intermediate hidden embedding and the undesired variables. Moyer
et al. (2018) achieve invariance to s by augmenting the information bottleneck objective with
the mutual information between the latent representation and s, and optimizing its variational
bound (Conditional Variational Information Bottleneck or CVIB). Such methods are expected to
more explicitly remove certain specic nuisance factors of data from the latent representation as
compared to the aforementioned unsupervised methods. A shortcoming of this approach is the
requirement of domain knowledge of possible nuisance factors and their variations, which is often
hard to nd (Bengio, Courville, and Vincent, 2013). Additionally, this solution applies only to
9
cases where annotated data is available for each nuisance factor, such as labeled information about
the lighting condition of each image in the face recognition example, which is often not the case.
Methods that use annotations of undesired factors during training are well-suited for inducing
invariance to biasing factors of data, which are correlated with the prediction targety but are unfair
to under-represented groups within the training set, e.g., age, gender, race, etc. in historical income
data. This is because the correlation of biasing factors with the prediction target makes it impossible
for unsupervised invariance methods to automatically remove them from the latent representation,
and external information about these variables is, hence, required. The aforementioned supervised
methods of invariance have been shown be eective at eliminating biasing information from latent
representations of DNNs, making these models more unbiased and fair.
In the following sections, we describe prominent methods for invariant representation learning,
classied as those that do not require annotations of undesired factors versus those that do. We
refer to the former category of frameworks as unsupervised methods of invariance and the latter
as supervised methods of the same.
2.1 Unsupervised Methods of Invariance
2.1.1 The Information Bottleneck Method
The Information Bottleneck (IB) method (Tishby, Pereira, and Bialek, 1999) aims to generate
compact embeddings z from data that are sucient for predicting y from x (Achille and Soatto,
2018a). The optimization objective of IB can be written as:
max I(z :y) (2.1)
s.t. I(x :z)I
c
10
where the goal is to maximize the predictive capability ofz while limiting the amount of information
it can encode through compression. It is intuitive to see that optimization of this objective denes
a trade-o between (1) the performance of the trained model at predicting y and (2) the amount of
information it encodes about x, which is also commonly referred to as the channel capacity or rate.
Optimization of this objective is, in theory, sucient (Tishby, Pereira, and Bialek, 1999; Achille and
Soatto, 2018b) for eliminating nuisance factors of data from z because nuisances do not contribute
anything to the prediction of y and can be compressed away completely without degrading the
y-prediction performance. However, the IB objective is extremely dicult to optimize because
computing the mutual information terms accurately is computationally challenging (Alemi et al.,
2016) in general without imposing priors or analytical structure on the distributions of the involved
variables. Furthermore, despite the fact that IB is optimal in theory for invariance to nuisance, it
relies heavily on the existence of a powerful encoder that can perfectly disentangle factors of data
at an atomic level (Jaiswal et al., 2018d). This is required to make sure that predictive factors
are not lost from the latent representation because the encoder could not separate them from
nuisances in trying to achieve complete nuisance-invariance.
2.1.2 Information Dropout
Achille and Soatto (2018a) dene an approximate form of the IB Lagrangian as:
L =
1
N
N
X
n=1
E
zp(zjx
(i)
logp(y
(i)
jz)
+ KL
p
#
(zjx
(i)
)kp
#
(z)
(2.2)
where the rst term is the standard cross-entropy loss used for training NNs and KL stands for
the Kullback-Leibler divergence. The KL term in this Lagrangian seeks to push the distribution
p
#
(zjx
(i)
) close to p
#
(z), which entails loss of information about x from z, or in other words, the
compression of z.
11
Invariance to nuisance can be induced by training DNNs with the objective dened in Equa-
tion 2.2 as long as there is a way to compute the KL term. In order to achieve this, Achille and
Soatto (2018a) design networks to generate representations in the form of
"p
(x)
(") (2.3)
z =f(x)" (2.4)
where" is input-dependent noise sampled from the parametric distributionp
(x)
with an analytical
form, e.g., log-uniform, log-Normal,etc. The embeddingz is calculated by multiplying elementwise a
deterministic mappingf(x), through an encoder network, with the" noise. The method functionally
injects noise into the latent embedding and is intuitively a data-dependent generalization of
dropout (Srivastava et al., 2014), where standard dropout translates to injecting independent
Bernoulli noise.
The parameters of the noise-distribution are learned with the rest of the network during training.
The authors explore multiple noise distributions and derive corresponding KL regularization terms
in order to optimize the training objective dened in Equation 2.2.
2.1.3 Deep Variational Information Bottleneck
Alemi et al. (2016) derive a variational approximation to the IB objective dened in Equation 2.1
that does not face the same computation challenges as the vanilla IB. Training DNNs with this
variational approximation (abbreviated as VIB) has a similar eect on the models as IB such that
the latent representation is compressed and nuisance information is lost. The authors seek to
maximize the IB Lagrangian:
J =I(z :y)I(x :z) (2.5)
12
where denes the trade-o between performance on they-prediction task and the compression ofz.
The joint distribution p(x;y;z) in the standard IB setup factorizes as p(x;y;z) =p(zjx)p(yjx)p(x).
The authors use this alongside denitions and identities from Information Theory to derive
variational bounds on the mutual information terms in Equation 2.5 as:
I(z :y)
Z
dx dy dz p(x) p(yjx) p(zjx) logq(yjz) (2.6)
I(x :z)
Z
dx dz p(x) p(zjx) log
p(zjx)
r(z)
(2.7)
where q(yjz) and r(z) denote variational approximations to p(yjz) and p(z), respectively. The
combined bound is then found as:
I(z :y)I(x :z)
Z
dx dy dz p(x) p(yjx) p(zjx) logq(yjz)
Z
dx dz p(x) p(zjx) log
p(zjx)
r(z)
(2.8)
=L (2.9)
where L denotes the variational lower bound that should be maximized for the VIB framework.
In practice, the authors suggest to approximate the p(x) and p(yjx) terms empirically from the
distribution of the training data as:
p(x;y)
1
N
N
X
n=1
x
(n)(x)
y
(n)(y) (2.10)
which allows approximating L in practice as:
L
1
N
N
X
n=1
h
Z
dz p
zjx
(n)
logq
y
(n)
p
zjx
(n)
log
p
zjx
(n)
r(z)
i
(2.11)
13
where they choosep(zjx) as the GaussianN (zjf
(x);f
(x)), with the mean and covariance param-
eters calculated using NNs (f
;f
) from x. Using the reparameterization trick of VAEs (Kingma
and Welling, 2014), the authors reformulate z =f(x;") with deterministic f and Gaussian " to
derive the nal training objective as:
J
VIB
=
1
N
N
X
n=1
E
"p(")
h
logq
y
(n)
jf(x
(n)
;")
i
+ KL
p
zjx
(n)
r(z)
(2.12)
which is optimized through backpropagation.
2.2 Supervised Methods of Invariance
2.2.1 Training for Statistical Parity
Zemel et al. (2013) learn invariant representations of data in the form of probabilistic mappings
of data points to a set of prototypes, which are randomly sampled data points. They solve an
optimization problem that aims to achieve statistical parity in these mappings. The idea here
is that the probability that a random sample from any of the s-classes maps to a particular
prototype should be the same. This brings about loss of all s-related information from the latent
representation. Denoting D-dimensional data points with x
(n)
and a set of K prototypes as
fv
(k)
g
K
k=1
, the mapping of x
(n)
to v
(k)
is dened through distances d between x
(n)
and v
(k)
as:
d(x
(n)
;v
(k)
;) =
D
X
i=1
i
(x
(n)
i
v
(k)
i
)
2
(2.13)
M
n;k
=P (Z =kjx
(n)
) =
exp(d(x
(n)
;v
(k)
))
P
K
j=1
exp(d(x
(n)
;v
(j)
))
(2.14)
14
where the parameter
i
denotes the importance of the i
th
feature and is also learned through
optimization. Statistical parity is then sought to achieve:
M
s=1
k
=M
s=0
k
; 8k (2.15)
where M
s
k
=E
x2X
sP (Z =kjx) =
1
jX
s
train
j
X
x
(n)
2X
s
train
M
n;k
(2.16)
where s = 1 and s = 0 denote s-classes in a binary setting. Invariant representation learning is
brought about by optimizing the following weighted loss function
L =A
z
L
z
+A
x
L
x
+A
y
L
y
(2.17)
where the A terms denote the corresponding loss weights, which are hyperparameters. Each of the
loss terms is dened as:
L
z
=
K
X
k=1
jM
s=1
k
M
s=0
k
j (2.18)
L
x
=
N
X
n=1
kx
(n)
^ x
(n)
k
2
2
(2.19)
L
y
=
N
X
n=1
y
(n)
log ^ y
(n)
1y
(n)
log
1 ^ y
(n)
(2.20)
where L
z
denotes the loss for statistical parity, L
x
denotes a reconstruction loss between real x
(n)
and ^ x
(n)
reconstructed from z, and L
y
denotes the binary cross-entropy loss of predicting y. Here
the reconstructions ^ x
(n)
and predictions ^ y
(n)
are dened as:
^ x
(n)
=
K
X
k=1
M
n;k
v
(k)
(2.21)
^ y
(n)
=
K
X
k=1
M
n;k
w
(k)
(2.22)
15
wherev
(k)
denotes the same prototypes as above butw
(k)
denotes parameters for learning to predict
y. In sum, the parameters v
(k)
, w
(k)
, and
i
are learned from data by solving the optimization
problem dened in Equation 2.17.
2.2.2 Regularization with Maximum Mean Discrepancy
Maximum Mean Discrepancy (MMD) (Gretton et al., 2007) is a distance measure between two
distributions X
p
P and X
q
Q, which is dened as:
MMD(X
p
;X
q
) =
1
N
N
X
n=1
X
(n)
p
1
M
M
X
m=1
X
(m)
q
2
(2.23)
=
1
N
2
N
X
n=1
N
X
n
0
=1
X
(n)
p
T
X
(n
0
)
p
+
1
M
2
M
X
m=1
M
X
m
0
=1
X
(m)
q
T
X
(m
0
)
q
2
MN
N
X
n=1
M
X
m=1
X
(n)
p
T
X
(m)
q
(2.24)
where () denotes a feature expansion function. The kernel trick is applied to each product term
in Equation 2.24 to use an implicit feature space. Li et al. (2014) use MMD as a regularizer for
training neural networks to be invariant to undesired s. In order to achieve this, they dene the
following regularization with respect to s:
L
MMD
=
S
X
s=1
1
N
s
X
i:si=s
z
(i)
1
N
N
X
n=1
z
(n)
2
(2.25)
for undesired s with S distinct classes. Thus, the L
MMD
regularizer is the sum of MMD between
the distribution of each s-class and the data distribution. Li et al. (2014) use the Gaussian kernel
in their experiments for calculating the L
MMD
loss.
16
2.2.3 Variational Fair Autoencoder
Louizos et al. (2016) present a variant of VAE (Kingma and Welling, 2014) that incorporates
undesired factors of data as priors in the generative process to learn representations invariant to
them. They term this model the Variational Fair Autoencoder (VFAE). The authors start with
dening the generative process:
sp(s); zp(z); xp
#
(xjz;s) (2.26)
where # denotes the parameters of the generator model. The representations z in this graphical
model are marginally independent of s, which introduces the idea of inferring s-invariant z
representations through the posterior p(zjx;s). Following the VAE training algorithm (Kingma
and Welling, 2014), the generative model can be trained in the form of a decoder network along
with a variational Gaussian posterior q
'
(zjx;s) as an encoder network, with an isotropic Gaussian
prior.
The graphical model in Equation 2.27 is modied to include the prediction target y as a
categorical variable in order to develop the VFAE graphical model as:
sp(s); y Cat(y); z
2
p(z
2
); z
1
p
#
(z
1
jz
2
;y); xp
#
(xjz
1
;s) (2.27)
which translates to two VAEs | (1) the encoder-decoder pair
p
#
(xjz
1
;s);q
'
(z
1
jx;s)
for x,
and (2)
p
#
(z
1
jz
2
;y);q
'
(z
2
jz
1
;y)
for z
1
. The encoders and decoders are instantiated as neural
networks. An additional NN module is added to the architecture for predicting y from z
1
. The
parameters of the complete architecture are learned jointly through a modied version of the VAE
training algorithm (Kingma and Welling, 2014) with some relaxations (Louizos et al., 2016). The
training objective is explicitly augmented with a loss-term for predicting y from z
1
.
17
The z
1
embeddings learned in this fashion do not lose all information about the undesired s in
practice (Louizos et al., 2016). Hence, the authors add the L
MMD
regularizer from Equation 2.25
to the training objective of VFAE to boost the removal of s-related information from the z
1
representation.
2.2.4 HSIC-constrained Variational Autoencoder
Hilbert-Schmidt Independence Criterion (HSIC) (Gretton et al., 2005) is a penalty constructed out
of a kernel generalization of covariance between two variables. Kernel covariance is used to measure
the degree of \dependent-ness" between two variables and training with a penalty constructed out
of the Hilbert-Schmidt operator norm of the kernel covariance pushes variables to be mutually
independent (Lopez et al., 2018).
The HSIC estimator (Lopez et al., 2018) is dened for variables u and v with kernels h and
k, respectively. Assuming that the kernels are both universal, the following is an estimator of a
two-component HSIC:
HSIC
f(u;v)g
N
n
=
1
N
2
X
i;j
k(u
i
;u
j
)h(v
i
;v
j
) +
1
N
4
X
i;j;k;l
k(u
i
;u
j
)h(v
k
;v
l
)
+
2
N
3
X
i;j;k
k(u
i
;u
j
)h(v
i
;v
k
) (2.28)
This is dierentiable and can be used directly as a penalty on the dependent-ness of variables.
Lopez et al. (2018) use the HSIC penalty in place of L
MMD
in the VFAE framework described in
Section 2.2.3. They show that HSIC can be used eectively instead of MMD to induce invariance
to continuous s variables.
18
2.2.5 Controllable Adversarial Invariance
Xie et al. (2017) adopt the Domain Adversarial Neural Network (DANN) (Ganin et al., 2016)
framework for learning representations of data invariant to undesired s. They start with decompos-
ing a supervised neural network into an encoder network that maps data samples x to embeddings
z and a predictor network that infers the target y from z. They then augment the architecture
with a discriminator network that tries to predict undesireds fromz but precede the augmentation
with a gradient reversal layer (Ganin et al., 2016) following the DANN architecture. The gradient
reversal layer performs an identity operation during forward pass, i.e., the output of the layer is
the same as its input, but
ips the sign of the gradient during backward pass. Thus, the goal
of this layer is to push the optimization of the layers before it in the direction opposite to the
gradient for the objective of the layers after it. Hence, optimization of the complete architecture
seeks to generate such z that the discriminator is unable to predict s from them.
2.2.6 Conditional Variational Information Bottleneck
The Information Bottleneck method described in Section 2.1.1 provides an unsupervised framework
for eliminating nuisance factors of data from the latent embedding z through bringing about the
compression of the latent space. In light of the problems associated with vanilla IB (as discussed
in Section 2.1.1) and its limitation to nuisance factors of data, Moyer et al. (2018) augment the
IB Lagrangian in Equation 2.5 with the mutual information I(z : s) for known undesired s to
formulate an s-conditional variant of IB as:
J =I(z :y)I(x :z)I(z :s) (2.29)
I(z :y)I(x :z)E[logp(xjz;s)] (2.30)
19
where the bound in Equation 2.30 was derived by Moyer et al. (2018). The authors build upon the
VIB method described in Section 2.1.3 to derive a variational bound on Equation 2.30 as:
I(z :y)I(x :z)E[logp(xjz;s)]
E
(x;s)
h
E
(z;y)
logp(yjz) ( +)KL(q(zjx)kq(z))
+E
z
logp(xjz;s)
i
(2.31)
which is optimized to train s-invariant neural networks. The complete model, as indicated in
Equation 2.31, consists of an encoder network q(zjx), a conditional decoder network p(xjz;s), and
a predictor network p(yjz).
20
Part I
METHODS
21
Chapter 3
Unied Adversarial Invariance
3.1 Introduction
We present a unied framework for invariance induction that can be used for robustness to nuisances
without their labels and additionally with annotations for independence to biasing factors (Jaiswal
et al., 2018d; 2019d). The framework promotes invariance to nuisances through separating the
underlying factors of data x into two latent embeddings | z
1
, which contains all the information
required for predicting the targety, andz
2
, which contains information irrelevant to the prediction
task. While z
1
is used for predicting y, a noisy version of z
1
, denoted as ~ z
1
, and z
2
are used to
reconstruct x. This creates a competitive scenario where the reconstruction module tries to pull
information into z
2
(because ~ z
1
is unreliable) while the prediction module tries to pull information
into z
1
. The training objective is augmented with a disentanglement loss that penalizes the model
for overlapping information between z
1
and z
2
, further boosting the competition between the
prediction and reconstruction tasks. In order to remove known biasing factors of data, a proxy
loss term for their mutual information with z
1
is added to the training objective, creating a
framework that learns invariance to both nuisance and biasing factors s. We present an adversarial
instantiation of this generalized formulation of the framework, where disentanglement is achieved
betweenz
1
andz
2
in a novel way through two adversarial disentanglers | one that aims to predict
22
z
2
from z
1
and another that does the inverse, and invariance to biasing s is achieved through
an adversarial s-discriminator that aims to predict s from z
1
. The parameters of the model are
learned through adversarial training between (a) the encoder, the predictor and the decoder, and
(b) the disentanglers (for both nuisances and biases) and the s-discriminator (for biases).
The framework makes no assumptions about the data, so it can be applied to any prediction task
without loss of generality, be it binary/multi-class classication or regression. We provide results
on ve tasks involving a diverse collection of datasets | (1) invariance to inherent nuisance factors,
(2) eective use of synthetic data augmentation for learning invariance, (3) learning invariance to
arbitrary nuisance factors by leveraging Generative Adversarial Networks (GANs) (Goodfellow
et al., 2014), (4) domain adaptation, and (5) invariance to biasing factors for fair representation
learning. Our framework outperforms existing approaches on all of these tasks. This is especially
notable for invariance to nuisances in tasks (1) and (2) where previous state-of-the-art works
incorporate s-labels whereas our model is trained without these annotations.
3.2 Unied Adversarial Invariance (UnifAI)
We present a generalized framework for induction of invariance to undesired (both nuisance and
biasing) factors s of data, where s information is not necessary for the exclusion of nuisance but is
employed for making y-predictions independent of biasing factors. The framework brings about
invariance to nuisances by disentangling information required for predictingy from other unrelated
information contained in x through the incorporation of data reconstruction as a competing task
for the primary prediction task. This is achieved by learning a split representation of data as
z = [z
1
z
2
], such that information essential for the prediction task is pulled into z
1
while all other
information about x migrates to z
2
. In order to further learn invariance to known biasing s, the
training objective of the framework penalizes the model if the encodingz
1
contains any information
about theses. We present adversarial instantiations of this framework | Unsupervised Adversarial
23
Table 3.1: Key Concepts and Framework Components
Term Meaning
x Data sample
y Prediction target
z
1
Encoding of information desired for predicting y
z
2
Encoding of information not desired for predicting y
~ z
1
Noisy version of z
1
used with z
2
for reconstructing x
s Undesired information not to be used for predicting y
f
i
An atomic factor of data
F Set of underlying atomic factors of data F =ff
i
g
F
y
Subset of F that is informative of y
F
y
Subset of F that is not informative of y
F
b
Subset of F
y
that is biased
Enc Encoder that embeds x into z = [z
1
z
2
]
Pred Predictor that infers y from z
1
Noisy transformer that converts z
1
to ~ z
1
, e.g., Dropout
Dec Decoder that reconstructs x from [~ z
1
z
2
]
Dis
1
Adversarial disentangler that tries to predict z
2
from z
1
Dis
2
Adversarial disentangler that tries to predict z
1
from z
2
D
s
Adversarial s-discriminator that tries to predict s from z
1
Invariance (UAI), which treats disentanglement of z
1
and z
2
as an adversarial objective with
respect to the competitive prediction and reconstruction tasks, and Unied Adversarial Invariance
(UnifAI), which additionally removes biasing s from z
1
through another adversarial objective.
3.2.1 Unied Invariance Induction Framework
Data samples (x) can be abstractly decomposed into a set of underlying atomic factors of variation
F =ff
i
g. This set can be as simple as a collection of numbers denoting the position of a point
in space or as complicated as information pertaining to various facial attributes that combine
non-trivially to form the image of someone's face. Modeling the interactions between factors of
data is an open problem. However, supervised learning of the mapping of x to target (y) involves a
relatively narrower (yet challenging) problem of nding those factors of variation (F
y
) that contain
24
all the information required for predicting y and discarding all the others (F
y
). Thus, F
y
and
F
y
form a partition of F, where we are more interested in the former than the latter. Since y is
independent of F
y
, i.e., y?F
y
, we get p(yjx) =p(yjF
y
). Estimating p(yjx) as q(yjF
y
) from data
is benecial because the nuisance factors (i.e., f
i
?y), which comprise F
y
, are never presented to
the estimator, thus avoiding inaccurate learning of associations between nuisance factors and y.
We incorporate the idea of splitting F into F
y
and F
y
in our framework in a more relaxed
sense as learning a split latent representation of x in the form of z = [z
1
z
2
]. While z
1
aims to
capture all the information relevant for predicting the target (F
y
), z
2
contains nuisance factors
(F
y
). Once trained, the model can be used to infer z
1
fromx followed byy fromz
1
. Learning such
a representation of data requires careful separation of information of x into two independent latent
embeddings. We bring about this information separation in our framework through competition
between the task of predictingy and that of reconstructingx, coupled with enforced disentanglement
between the two representations. This competition is induced by requiring the model to predict
y from z
1
while being able to reconstruct x from z
2
along with a noisy version of z
1
. Thus, the
prediction task is favored ifz
1
encodes everything inx that is informative ofy while reconstruction
benets from embedding all information of x into z
2
, but the disentanglement constraint forces z
1
and z
2
to contain independent information.
More formally, our general framework for invariance to nuisance consists of four core modules:
(1) an encoder Enc that embeds x into z = [z
1
z
2
], (2) a predictor Pred that infers y from
z
1
, (3) a noisy-transformer that converts z
1
into its noisy version ~ z
1
, and (4) a decoder Dec
that reconstructs x from ~ z
1
and z
2
. Additionally, the training objective is equipped with a loss
25
that enforces disentanglement between Enc(x)
1
= z
1
and Enc(x)
2
= z
2
. Figure 3.1 shows our
generalized framework. The training objective for this system can be written as Equation 3.1:
L
n
=L
pred
(y;Pred(z
1
)) +L
dec
(x;Dec( (z
1
);z
2
)) +
L
dis
((z
1
;z
2
))
=L
pred
(y;Pred(Enc(x)
1
)) +L
dec
(x;Dec( (Enc(x)
1
);Enc(x)
2
)) +
L
dis
(Enc(x)) (3.1)
where , , and
are the importance-weights for the corresponding losses. As evident from
the formal objective, the predictor and the decoder are designed to enter into a competition,
where Pred tries to pull information relevant to y into z
1
while Dec tries to extract all the
information about x into z
2
. This is made possible by , which makes ~ z
1
an unreliable source of
information for reconstructing x. Moreover, a version of this framework without can converge
to a degenerate solution where z
1
contains all the information about x and z
2
contains nothing
(noise), because absence of allows z
1
to be readily available to Dec. The competitive pulling of
information intoz
1
andz
2
induces information separation |z
1
tends to contain more information
relevant for predicting y and z
2
more information irrelevant to the prediction task. However, this
competition is not sucient to completely partition information of x into z
1
and z
2
. Without
the disentanglement term (L
dis
) in the objective, z
1
and z
2
can contain redundant information
such that z
2
has information relevant to y and, more importantly, z
1
contains nuisance factors.
The disentanglement term in the training objective encourages the desired clean partition. Thus,
essential factors required for predicting y concentrate into z
1
and all other factors migrate to z
2
.
While nuisance factors F
y
can be separated from those essential for y-prediction using the L
n
objective in Equation 3.1, biasing factors cannot. This is because biasing factors are correlated
withy and, hence, form a subsetF
b
ofF
y
, i.e.,F
b
F
y
. TheL
n
objective has no way to determine
whether an essential factor is biased. In general, this is true for fairness settings. External
information about biasing s (encompassing F
b
) and training mechanisms that use this information
to eliminate s from the latent representation are necessary for making fair y-predictions, even if
26
′
'
(
Pred
Enc
Dec
disentangle
Figure 3.1: Unsupervised Invariance Induction Framework
Dropout
′
&
'
'
(
&
(
Pred
Enc
Dec
Dis
1
Dis
2
D
s
Figure 3.2: The Unied Adversarial Invariance (UnifAI) model. Enc encodes x into
z
1
and z
2
. Pred uses z
1
to predict y. Dec uses (z
1
) and z
2
to reconstruct x. is
implemented as dropout. Disentanglement is enforced through adversarial modules Dis
1
and Dis
2
. Biasing factors are eliminated from z
1
through D
s
.
it entails relatively poor performance at the task of predicting y. In order to achieve this, we
augment L
n
with a loss term L
s
that penalizes z
1
for containing s information. The L
s
loss can
be abstractly viewed as a proxy for the mutual information I(z
1
:s). The nal training objective
is as shown in Equation 3.2:
L =L
n
+L
s
(z
1
)
=L
pred
(y;Pred(Enc(x)
1
)) +L
dec
(x;Dec( (Enc(x)
1
);Enc(x)
2
))
+
L
dis
(Enc(x)) +L
s
(Enc(x)
1
) (3.2)
27
where denotes the relative importance of L
s
. The eect of L
s
on the training objective is very
intuitive. It forces unwanteds out ofz
1
, such thatz
1
encodesF
y
nF
b
. WhileL
s
is in direct con
ict
withL
pred
for biasings,L
dec
andL
dis
are not. The decoder can still receive the s forced out ofz
1
throughz
2
, which encodesF
y
S
F
b
, and use them for reconstructing x. TheL
dis
loss is unaected
because it only enforces disentanglement between z
1
and z
2
, and removing s from z
1
does not
violate that.
3.2.2 Adversarial Model Design and Optimization
While there are numerous ways to implement the proposed invariance induction framework, e.g.,
using mutual information and variational tools similar to (Moyer et al., 2018), we adopt an
adversarial design, introducing a novel approach to disentanglement in the process. Enc, Pred
and Dec are modeled as neural networks. can be modeled as a parametric noisy-channel, where
the parameters of can also be learned during training. However, we model as multiplicative
Bernoulli noise using dropout (Srivastava et al., 2014) since it provides a straightforward method
for noisy-transformation of z
1
into ~ z
1
without complicating the training process.
We augment these core modules with two adversarial disentanglers { Dis
1
and Dis
2
. While
Dis
1
aims to predict z
2
from z
1
, Dis
2
aims to do the inverse. It would be impossible to predict
either embedding from the other if they were truly independent. Hence, the objectives of the
two disentanglers are in direct opposition to the desired disentanglement, forming the basis for
adversarial minimax optimization. In comparison to the use of information theoretic measures
like the mutual information I(z
1
: z
2
) (or a suitable proxy) for the loss L
dis
, this approach to
disentanglement does not requirez
1
andz
2
to be stochastic, and does not assume prior distributions
for the two embeddings.
Thus, Enc, Pred and Dec can be thought of as a composite model (M
1
) that is pitted
against another composite model (M
2
) containing Dis
1
and Dis
2
. This results in an adversarial
instantiation of the framework for invariance to nuisance factors. In order to complete the
28
adversarial model so that it allows removal of known biasing s from z
1
, an s-discriminator D
s
is
added to the model that aims to predict s from z
1
. Thus, the objective of D
s
is the opposite of
the desired invariance to s, making it a natural component of M
2
for fairness settings.
Figure 3.2 shows our unied adversarial invariance (UnifAI) model for invariance to nuisance
as well as biasing factors. The composite model M
1
is represented by the color blue and M
2
with
orange. The model is trained through backpropagation by playing a minimax game. The objective
for invariance to nuisance factors is shown in Equation 3.3.
min
Enc;Pred;Dec
max
Dis1;Dis2
J
n
; where:
J
n
(Enc;Pred;Dec;Dis
1
;Dis
2
)
=L
pred
y;Pred(z
1
)
+L
dec
x;Dec( (z
1
);z
2
)
+
~
L
dis
(z
1
;z
2
)
=L
pred
y;Pred(Enc(x)
1
)
+L
dec
x;Dec( (Enc(x)
1
));Enc(x)
2
))
+
~
L
dis1
Enc(x)
2
;Dis
1
(Enc(x)
1
)
+
~
L
dis2
Enc(x)
1
;Dis
2
(Enc(x)
2
)
(3.3)
Equation 3.4 describes the complete minimax objective for invariance to both nuisance and biasing
factors of data.
min
Enc;Pred;Dec
max
Dis1;Dis2;Ds
J ; where:
J(Enc;Pred;Dec;Dis
1
;Dis
2
;D
s
)
= J
n
(Enc;Pred;Dec;Dis
1
;Dis
2
) +
~
L
s
z;D
s
(z
1
)
= L
pred
y;Pred(Enc(x)
1
)
+L
dec
x;Dec( (Enc(x)
1
));Enc(x)
2
))
+
~
L
dis1
Enc(x)
2
;Dis
1
(Enc(x)
1
)
+
~
L
dis2
Enc(x)
1
;Dis
2
(Enc(x)
2
)
+
~
L
s
z;D
s
(Enc(x)
1
)
(3.4)
29
We optimize the proposed adversarial model using a scheduled update scheme where we freeze
the weights of a composite player model (M
1
or M
2
) when we update the weights of the other.
M
2
should ideally be trained to convergence before updating M
1
in each training epoch to
backpropagate accurate and stable disentanglement-inducing and s-eliminating gradients to Enc.
However, this is not scalable in practice. We update M
1
and M
2
with a frequency of 1 :k. We
found k = 5 to perform well in our experiments, but a larger k might be required depending on
the complexity of the prediction task, the unwanted variables, and the dataset in general. We use
mean squared error for the disentanglement losses
~
L
dis1
and
~
L
dis2
. The discriminative loss
~
L
s
depends on the nature of s-annotations, e.g., cross-entropy loss for categorical s.
Adversarial training withDis
1
,Dis
2
, andD
s
necessitates the choice of appropriate adversarial
targets, i.e., the targets that are used to calculate losses and gradients from the adversaries to
update the encoder. More specically, in the M
2
phase of the scheduled training, at a given
iteration, the targets for calculating
~
L
dis1
and
~
L
dis2
are the true values of the vectors z
2
and
z
1
, respectively, calculated from x at that iteration. On the other hand, in the M
1
phase, the
targets for
~
L
dis1
and
~
L
dis2
are randomly sampled vectors. The intuition behind this choice of
targets is straightforward { for truly disentangled z
1
and z
2
, the best an adversary predicting one
from the other can do is predict random noise because z
1
?z
2
and their mutual information is
zero. Hence, the encoder should be updated in a way that the best these disentanglers can do
is predict random noise. We implement this by constraining the encoder to use the hyperbolic
tangent activation in its nal layer, thus limiting the components of z
1
and z
2
to [1; 1] (any
other bounded activation function could be used), and sampling random vectors from a uniform
distribution in [1; 1] as targets for the M
1
phase. Similarly, for biasing factors, ground-truth
s is used as the target in M
2
phase for
~
L
s
while random s are used as targets in the M
1
phase.
For categorical s, this is implemented as a straightforward sampling of s from the empirically
estimated categorical distribution of s calculated from the training dataset.
30
3.2.3 Invariant Predictions with the Trained Model
The only components of the proposed framework that are required for making predictions at test
time are the encoderEnc and the predictorPred. Prediction is a simple forward-pass of the graph
x !z
1
!y. Thus, making predictions with a model trained in the proposed framework does not
have any overhead computational cost.
3.3 Analysis
We analyze the relationship between the loss weights and , corresponding to the competing
tasks of predicting y and reconstructing x, respectively, in our generalized invariance induction
framework. We then discuss the equilibrium of the minimax game in our adversarial instantiation
for both nuisance and biasing factors. Finally, we use the results of these two analyses to provide
a systematic way for tuning the loss weights ,
, and . The following analyses are conducted
assuming a model with innite capacity, i.e., in a non-parametric limit.
Competition between prediction and reconstruction. The prediction and reconstruction
tasks in our framework are designed to compete with each other for invariance to nuisance factors.
Thus, =
in
uences which task has higher priority in the objective shown in Equation 3.1.
We analyze the aect of on the behavior of our framework at optimality considering perfect
disentanglement of z
1
and z
2
. There are two asymptotic scenarios with respect to { (1) !1
and (2) ! 0. In case (1), our framework for invariance to nuisances (i.e., without D
s
) reduces to
a predictor model, where the reconstruction task is completely disregarded (). Only the
branch x !z
1
!y remains functional. Consequently, z
1
contains all f2F
0
at optimality, where
F
y
F
0
F . In contrast, case (2) reduces the framework to an autoencoder, where the prediction
task is completely disregarded ( ), and only the branch x ! z
2
! x
0
remains functional
because the other input to Dec, (z
1
), is noisy. Thus, z
2
contains all f2 F and z
1
contains
31
nothing at optimality, under perfect disentanglement. In transition from case (1) to case (2), by
keeping xed and increasing , the reconstruction loss starts contributing more to the overall
objective, thus inducing more competition between the two tasks. As is gradually increased,
f2 (F
0
rF
y
)F
y
migrate from z
1
to z
2
because f2F
y
are irrelevant to the prediction task
but can improve reconstruction by being more readily available to Dec through z
2
instead of
(z
1
). After a point, further increasing is, however, detrimental to the prediction task as the
reconstruction task starts dominating the overall objective and pulling f 2 F
y
from z
1
to z
2
.
Results in Section 3.4.6 show that this intuitive analysis is consistent with the observed behavior.
In the case of known undesired s, the presence of D
s
in the unied framework pushes known s
out of z
1
, thus favoring the reconstruction objective by forcing known s to migrate to z
2
. Thus,
the analysis of the competition still holds intuitively for nuisance factors besides s.
Equilibrium analysis of adversarial instantiation. The disentanglement and prediction
objectives in our adversarial model design can simultaneously reach an optimum where z
1
contains
F
y
and z
2
contains F
y
. Hence, the minimax objective in our method has a win-win equilibrium
for invariance to nuisance factors. However, the training objective of D
s
for biasing s is in direct
opposition to the prediction task because such s are correlated with y. This leads to a win-lose
equilibrium for biasing factors, which is true for all methods of fair representation learning.
Selecting loss weights. Using the above analyses, any
that successfully disentangles z
1
and
z
2
should be sucient. We found
= 1 to work well for the datasets on which we evaluated our
model. On the other hand, if
is xed, and can be selected by starting with and
gradually increasing as long as the performance of the prediction task improves. The removal of
biasing s, however, is controlled by the loss weight in Equation 3.4 and requires to be carefully
tuned depending on the complexity of the dataset, the prediction task, and the biasing factors.
32
3.4 Empirical Evaluation
We provide empirical results on ve tasks relevant to invariant feature learning for robustness
to nuisances and fair predictions: (1) invariance to inherent nuisance factors, (2) eective use
of synthetic data augmentation for learning invariance to specic nuisance factors, (3) learning
invariance to arbitrary nuisance factors by leveraging Generative Adversarial Networks, (4) domain
adaptation through learning invariance to \domain" information, and (5) fair representation
learning. For experiments (1){(4), we do not use nuisance annotations for learning invariance,
i.e., we train the model without D
s
. In contrast, the state-of-the-art methods use s-labels. We
evaluate the performance of our model and prior works on two metrics { accuracy of predicting y
from z
1
(A
y
) and accuracy of predicting s from z
1
(A
s
). While A
y
is calculated directly from the
predictions of the trained models, A
s
is calculated using a two-layer neural network trained post
hoc to predicts from the latent embedding. The goal of the model is to achieve highA
y
in all cases
butA
s
close to random chance for nuisance factors andA
s
the same as the population share of the
majoritys-class for biasing factors in fairness settings. We train two baseline versions of our model
for our ablation experiments | B
0
composed of Enc andPred, i.e., a single feed-forward network
x !z !y andB
1
, which is the same as the composite modelM
1
, i.e., the proposed model trained
without the adversarial components. B
0
is used to validate the phenomenon that invariance to
nuisance by exclusion is a better approach than robustness through inclusion whereas B
1
helps
Table 3.2: Results on Extended Yale-B dataset. High A
y
and low A
s
are desired.
Model Accuracy of y Accuracy of s
NN+MMD (Li, Swersky, and Zemel, 2014) 0.82 -
VFAE (Louizos et al., 2016) 0.85 0.57
CAI (Xie et al., 2017) 0.89 0.57
CVIB (Moyer et al., 2018) 0.82 0.45
Ablation baseline B
0
0.90 0.60
Ablation baseline B
1
0.94 0.28
UAI (Jaiswal et al., 2018d; 2019d) (Ours) 0.95 0.24
33
evaluate the importance of disentanglement. Hence, results of B
0
and B
1
are presented for tasks
(1) and (2). Besides the results on the aforementioned tasks, we provide empirical insight into the
competition between the prediction and reconstruction tasks in our framework, as discussed in
Section 3.3, through the in
uence of the ratio
on A
y
.
3.4.1 Invariance to inherent nuisance factors
We provide results of our framework at the task of learning invariance to inherent nuisance factors
on two datasets | Extended Yale-B (Georghiades, Belhumeur, and Kriegman, 2001), which has
been used by previous works(Li, Swersky, and Zemel, 2014; Louizos et al., 2016; Xie et al., 2017)),
and Chairs (Aubry et al., 2014), which we propose as a new dataset for this task. We compare our
framework to existing state-of-the-art invariance methods | CAI (Xie et al., 2017), VFAE (Louizos
et al., 2016), NN+MMD (Li, Swersky, and Zemel, 2014), and CVIB (Moyer et al., 2018).
Extended Yale-B This dataset contains face-images of 38 subjects under various lighting
conditions. The target y is the subject identity whereas the inherent nuisance factor s is the
lighting condition. We use the prior works' version of the dataset, which has lighting conditions
classied into ve groups { front, upper-left, upper-right, lower-left and lower-right, with the same
split as 38 5 = 190 samples used for training and the rest used for testing (Li, Swersky, and
Zemel, 2014; Louizos et al., 2016; Xie et al., 2017)). We use the same architecture for the predictor
and the encoder as CAI (as presented in (Xie et al., 2017)), i.e., single-layer neural networks,
except that our encoder produces two encodings instead of one. We also model the decoder and
the disentanglers as single-layer neural networks.
Table 3.2 summarizes the results. The proposed unsupervised method (trained without D
s
)
outperforms ablation versions of our model and existing state-of-the-art (supervised) invariance
induction methods on both A
y
and A
s
, providing a signicant boost on A
y
and nearly complete
removal of lighting information from z
1
re
ected by A
s
. Furthermore, the accuracy of predicting s
34
(a) (b)
(c) (d)
Figure 3.3: Extended Yale-B { t-SNE visualization of (a) raw data, (b) z
2
labeled by
lighting condition, (c) z
1
labeled by lighting condition, and (d) z
1
labeled by subject-ID
(numerical markers, not colors). Raw images cluster by lighting. z
1
clusters by identity
but not lighting, as desired, while z
2
clusters by lighting.
fromz
2
is 0:89, which validates its automatic migration toz
2
. Figure 3.3 shows t-SNE (Maaten and
Hinton, 2008) visualization of raw data and embeddingsz
1
andz
2
for our model. While raw data is
clustered by lighting conditions s, z
1
exhibits clustering by y with no grouping based on s, and z
2
exhibits near-perfect clustering by s. Figure 3.4 shows reconstructions from z
1
and z
2
. Dedicated
decoder networks were trained (with weights of Enc frozen) to generate these visualizations. As
evident, z
1
captures identity-related information but not lighting while z
2
captures the inverse.
35
Figure 3.4: Extended Yale-B { reconstruction results. Each block shows results for a
single subject. Columns in each block are (left to right): real image, reconstruction from
z
1
and that from z
2
. Reconstructions from z
1
show that it captures subject-identity but
has little lighting information, thus achieving the invariance goal. Reconstructions from
z
2
show that it captures lighting but not identity. Viewing along rows across blocks, it is
easy to see that reconstructions from z
2
look similar.
Table 3.3: Results on Chairs. High A
y
and low A
s
are desired.
Model Accuracy of y Accuracy of s
NN+MMD (Li, Swersky, and Zemel, 2014) 0.73 0.46
VFAE (Louizos et al., 2016) 0.72 0.37
CAI (Xie et al., 2017) 0.68 0.69
CVIB (Moyer et al., 2018) 0.67 0.52
Ablation baseline B
0
0.67 0.70
Ablation baseline B
1
0.69 0.54
UAI (Jaiswal et al., 2018d; 2019d) (Ours) 0.74 0.34
Chairs This dataset consists of 1,393 dierent chair types rendered at 31 yaw angles and two
pitch angles using a computer aided design model. We treat the chair identity as the target y and
the yaw angle as the nuisance factor s by grouping into four categories { front, left, right and
back. This s information is used for training previous works but our model is trained without D
s
and hence, without any s-information. We split the data into training and testing sets by picking
alternate yaw angles. Therefore, there is no overlap of between the two sets. We model the
encoder and the predictor as two-layer neural networks for the previous works and our model. We
also model the decoder as a two-layer network and the disentanglers as single-layer networks.
36
(a) Raw data
(b) z
1
embedding (c) z
2
embedding
Figure 3.5: Chairs dataset { t-SNE visualization. Labels indicate the nuisance factor
{ orientation. Raw images cluster by orientation. z
1
clusters by chair-class but not
orientation, as desired, while z
2
clusters by orientation.
Table 3.3 summarizes the results, showing that our model outperforms both ablation baselines
and previous state-of-the-art methods on both A
y
and A
s
. Moreover, the accuracy of predicting
from z
2
is 0:73, which shows that this information migrates to z
2
. Figure 3.5 shows t-SNE
visualization of raw data and embeddings z
1
and z
2
for our model. While raw data and z
2
are
clustered by the orientation direction s, z
1
exhibits no grouping based on s. Figure 3.6 shows
results of reconstructingx fromz
1
andz
2
generated in the same way as for Extended Yale-B above.
37
Figure 3.6: Chairs { reconstruction results. Each block shows results for a single chair-
class. Columns in each block re
ect (left to right): real, reconstruction from z
1
and
that from z
2
. Reconstructions from z
1
show that it captures chair-class but has little
orientation information, as desired. Reconstructions from z
2
show that it captures
orientation but not much about identity.
Table 3.4: Results on MNIST-ROT. =f0;22:5
;45
g was used for training. High
A
y
and lowA
s
are desired. VFAE does not allow for out-of-domain s because the VFAE
encoder requires s as input, and s is categorical here.
Model
Accuracy of y Accuracy of s
55
65
NN+MMD (Li, Swersky, and Zemel, 2014) 0.970 0.831 0.665 0.380
VFAE (Louizos et al., 2016) 0.953 0.389
CAI (Xie et al., 2017) 0.958 0.829 0.663 0.384
CVIB (Moyer et al., 2018) 0.960 0.819 0.674 0.382
Ablation baseline B
0
0.974 0.826 0.674 0.586
Ablation baseline B
1
0.972 0.829 0.682 0.409
UAI (Jaiswal et al., 2018d; 2019d) (Ours) 0.977 0.856 0.696 0.338
Figure 3.7: MNIST-ROT { reconstruction results. Each block shows a digit-class.
Columns in each block are (left to right): real images, reconstruction from z
1
and that
from z
2
. Reconstructions from z
1
show that it captures digit-class but has little rotation
information, as desired for invariance. Reconstructions from z
2
show that it captures
rotation as well as other inherent nuisance factors, which are hard to visually interpret.
The gure shows that z
1
contains identity information but nothing about while z
2
contains
with limited identity information.
38
(a) Raw data (b) z
1
embedding
Figure 3.8: MNIST-ROT { t-SNE visualization. While raw data is clustered by rotation
angle , z
1
is grouped by digit-class.
3.4.2 Eective use of synthetic data augmentation for learning invariance
Data is often not available for all possible variations of nuisance factors. A popular approach
to learn models robust to such expected yet unobserved or infrequently seen (during training)
variations is data augmentation through synthetic generation using methods ranging from simple
operations (Ko et al., 2015) like rotation and translation to complex transformations (Masi et al.,
2019) for synthesis of more sophisticated variations. The prediction model is then trained on the
expanded dataset. The resulting model, thus, becomes robust to specic forms of variations of
certain nuisance factors that it has seen during training. Invariance induction, on the other hand,
aims to completely prevent prediction models from using information about nuisance factors. Data
augmentation methods can be more eectively used for improving the prediction of y by using the
expanded dataset for inducing invariance by exclusion rather than inclusion. We use two variants
of the MNIST (LeCun et al., 1998) dataset of handwritten digits for experiments on this task. We
use the same two-layer architectures for the encoder and the predictor in our model as well as
previous works, except that our encoder generates two encodings instead of one. We model the
decoder as a three-layer neural network and the disentanglers as single-layer neural networks.
39
(a) (b)
(c) (d)
Figure 3.9: t-SNE visualization of MNIST-ROT z
1
embedding for UnifAI (a) & (c), and
baseline model B
0
(b) & (d). Models were trained on 2f0;22:5;45g. Visualization
is presented for =55. B
0
embeddings show sub-clusters of within each digit cluster,
such that information is easily separable (d). The UnifAI embedding z
1
does not show
any grouping by .
MNIST-ROT We create this variant of the MNIST dataset by rotating each image by angles
2f45
;22:5
; 0
; 22:5
; 45
g about the Y-axis. We denote this set of angles as . The angle
information is used as a one-hot encoding while training the previous works whereas our model is
trained without s-labels (i.e., without D
s
). We evaluate all the models on the same metrics A
y
andA
s
we previously used. We additionally test all the models on 62 to gauge the performance
of these models on unseen variations of the rotation nuisance factor.
40
Table 3.5: MNIST-DIL { Accuracy of predicting y (A
y
). =2 represents erosion
with kernel-size of 2.
Model
Accuracy of y
=2 = 2 = 3 = 4
NN+MMD (Li, Swersky, and Zemel, 2014) 0.870 0.944 0.855 0.574
VFAE (Louizos et al., 2016) 0.807 0.916 0.818 0.548
CAI (Xie et al., 2017) 0.816 0.933 0.795 0.519
CVIB (Moyer et al., 2018) 0.844 0.933 0.846 0.586
Ablation baseline B
0
0.872 0.942 0.847 0.534
Ablation baseline B
1
0.870 0.940 0.853 0.550
UAI (Jaiswal et al., 2018d; 2019d) (Ours) 0.880 0.958 0.874 0.606
Table 3.4 summarizes the results, showing that our adversarial model, which is trained without
anys information, not only performs better than the baseline ablation versions but also outperforms
state-of-the art methods, which use supervised information about the rotation angle. The dierence
in A
y
is especially notable for the cases where 62 . Results on A
s
show that our model discards
more information about than previous works even though prior art uses information during
training. The information about migrates to z
2
, indicated by the accuracy of predicting it from
z
2
being 0:77. Figure 3.7 shows results of reconstructing x from z
1
and z
2
generated in the same
way as Extended Yale-B above. The gures show that reconstructions from z
1
re
ect the digit
class but contain no information about, while those fromz
2
exhibit the inverse. Figure 3.8 shows
t-SNE visualization of raw MNIST-ROT images and z
1
learned by our model. While raw data
tends to cluster by ,z
1
shows near-perfect grouping based on the digit-class. We further visualize
the z
1
embedding learned by the proposed model and the baseline B
0
, which models the classier
x !h !y, to investigate the eectiveness of invariance induction by exclusion versus inclusion,
respectively. Both the models were trained on digits rotated by 2 and t-SNE visualizations
were generated for2f55g. Figure 3.9 shows the results. As evident,z
1
learned by the proposed
model shows no clustering by the rotation angle, while that learned by B
0
does, with encodings of
some digit classes forming multiple clusters corresponding to rotation angles.
41
MNIST-DIL We create this variant of MNIST by eroding or dilating MNIST digits using
various kernel-sizes (). We use models trained on MNIST-ROT to report evaluation results on
this dataset, to show the advantage of unsupervised invariance induction in cases where certain s
are not annotated in the training data. Thus, information about these s cannot be used to train
supervised invariance models.
Table 3.5 summarizes the results of this experiment. The results show signicantly better
performance of our model compared to the baselines. More notably, prior works perform signicantly
worse than our baseline models, indicating that those approaches for invariance induction can
worsen performance with respect to nuisance factors not accounted for during training.
3.4.3 Learning invariance to arbitrary nuisance factors by leveraging
Generative Adversarial Networks
Algorithmic generation of synthetic data for augmentation allows the generation of specic forms
of variation of data that they are designed for. However, it is very dicult to account for all
possible variations of data using such approaches. In light of this, Generative Adversarial Networks
(GANs) (Goodfellow et al., 2014) have recently been employed for data augmentation and have
provided signicant gains on the nal prediction performance of the supervised task (Antoniou,
Storkey, and Edwards, 2017). The ability of generating massive amounts of arbitrary variations of
data using GANs combined with the proposed invariance induction framework provides a novel
approach for the development of robust features that are, in theory, invariant to all forms of
nuisance factors in data (with respect to the supervised task) that dene the underlying generative
model parameterized by the GAN. We evaluate this experimental setting on two datasets { Fashion-
MNIST (Xiao, Rasul, and Vollgraf, 2017) and Omniglot (Lake, Salakhutdinov, and Tenenbaum,
2015). We report results of three congurations { (1) B
0
(baseline model composed of Enc and
Pred) trained on real training data, (2)B
0
trained on real and generated data (augmented dataset),
42
Figure 3.10: Random samples generated using the BiCoGAN trained on Fashion-MNIST.
Rows indicate classes. The latent embedding (style) is xed for each column.
Table 3.6: Fashion-MNIST { Accuracy of predicting y (A
y
)
Model
Accuracy of y
Real Test Extreme Test
B
0
trained on Real Data 0.918 0.640
B
0
trained on Augmented Data 0.922 0.876
UAI trained on Augmented Data 0.934 0.889
and (3) the proposed model trained on the augmented dataset. Since the data is generated with
arbitrary variations of the latent nuisance factors, s is not easily quantiable for this experiment.
We present results of these congurations on real testing data as well as extreme samples, which
are dicult examples sampled far from modes of the latent distribution of the GAN models.
43
B
0
trained on real B
0
trained on aug UAI trained on aug
Figure 3.11: t-SNE visualization of Fashion-MNIST z
1
embedding. The rst row shows
the visualization for real test data and the second shows that for extreme test samples.
The embedding of the extreme test samples from the B
0
model trained on real data is
scattered with vague clustering by the clothing-class. Training B
0
with the augmented
dataset makes the clustering of extreme samples cleaner. This clustering improves further
in the case of the UnifAI model trained with augmented data.
Fashion-MNIST This dataset contains grayscale images of 10 kinds of clothing. It was designed
as a more challenging replacement of the MNIST dataset for benchmarking machine learning
models. The target y in the supervised task is the type of clothing (e.g., trouser, coat, etc.),
whereas nuisance factors include all elements of style that are not particular to (discriminative of)
specic clothing classes. We trained a Bidirectional Conditional GAN (BiCoGAN) (Jaiswal et al.,
2018a) on the training set. Figure 3.10 qualitatively shows the performance of the BiCoGAN
through randomly sampled images. We sampled the generated instances for training the B
0
and
UnifAI models two standard deviations away from the mean of the latent distribution. This
was done to avoid generating samples that are very similar to real examples and thus have very
44
little variation with respect to the real training dataset. Extreme examples for testing were
sampled three standard deviations away from the distribution mean. We model Enc as a neural
network composed of two convolution layers followed by two fully-connected layers, Pred as two
fully-connected layers, Dec as three convolution layers and the disentanglers as two dense layers.
Table 3.6 summarizes the results of our experiments. As evident, training B
0
with augmented
data generated using the GAN model improves the prediction accuracy on the real test data as
well as extreme examples, as compared to training B
0
with only real training data. However,
the conguration with the proposed invariance induction framework trained with the augmented
dataset achieves the best performance. Figure 3.11 shows the t-SNE visualization of the embedding
used for classication for real and extreme test samples. The gure shows that the embedding of
real data does not change much across the three congurations. However, that of extreme samples
improves progressively in the order: B
0
trained on real data, B
0
trained on augmented data, and
the proposed framework trained on augmented data. This correlates with the quantitative results
in Table 3.6.
Omniglot This is a dataset of 1,623 dierent handwritten characters from 50 dierent alpha-
bets. The target y is the character-type whereas elements of handwriting style that are not
discriminative of y are considered as nuisance factors. We trained the Data Augmentation GAN
(DAGAN) (Antoniou, Storkey, and Edwards, 2017) using the ocial code
1
available for this dataset.
Figure 3.12 qualitatively shows the performance of the DAGAN through randomly sampled images.
The generated dataset for training the models was sampled one standard deviation away from
the mean of the latent distribution, whereas extreme examples for testing were sampled two
standard deviations away from the mean. We used a neural network composed of three convolution
layers followed by two fully-connected layers for Enc, two fully-connected layers for Pred, three
convolution layers for Dec, and two fully-connected layers for the disentanglers.
1
https://www.github.com/AntreasAntoniou/DAGAN
45
Figure 3.12: Random samples generated using DAGAN trained on Omniglot. Rows
indicate 10 randomly sampled classes. The latent embedding (style) is xed for each
column.
Table 3.7: Omniglot { Accuracy of predicting y (A
y
)
Model
Accuracy of y
Real Test Extreme Test
B
0
trained on Real Data 0.674 0.414
B
0
trained on Augmented Data 0.725 0.535
UAI trained on Augmented Data 0.740 0.558
Table 3.7 shows the results of our experiments. As with the case of Fashion-MNIST above,
training B
0
with the augmented dataset leads to better classication accuracy on not only the
real test dataset but also the extreme samples, as compared to B
0
trained with only the real
training dataset. The proposed invariance framework trained with the augmented dataset, however,
46
Table 3.8: Amazon Reviews dataset { Accuracy of predicting y from z
1
(A
y
). CAI (Xie
et al., 2017) is the same as DANN (Ganin et al., 2016)
.
Source - Target
DANN VFAE UAI (Ours)
Ganin et al. Louizos et al. Jaiswal et al.
(2016) (2016) (2018d, 2019d)
books - dvd 0.784 0.799 0.820
books - electronics 0.733 0.792 0.764
books - kitchen 0.779 0.816 0.791
dvd - books 0.723 0.755 0.798
dvd - electronics 0.754 0.786 0.790
dvd - kitchen 0.783 0.822 0.826
electronics - books 0.713 0.727 0.734
electronics - dvd 0.738 0.765 0.740
electronics - kitchen 0.854 0.850 0.890
kitchen - books 0.709 0.720 0.724
kitchen - dvd 0.740 0.733 0.745
kitchen - electronics 0.843 0.838 0.859
achieves the best performance, further supporting the eectiveness of the proposed framework in
leveraging GANs for learning invariance to arbitrary nuisance factors.
3.4.4 Domain Adaptation
Domain adaptation has been treated as an invariance induction task recently (Ganin et al., 2016;
Louizos et al., 2016)) where the goal is to make the prediction task invariant to the \domain"
information. We evaluate the performance of our model at domain adaptation on the Amazon
Reviews dataset (Chen et al., 2012) using the same preprocessing as (Louizos et al., 2016). The
dataset contains text reviews on products in four domains { \books", \dvd", \electronics", and
\kitchen". Each review is represented as a feature vector of unigram and bigram counts. The target
y is the sentiment of the review { either positive or negative. We use the same experimental setup
as (Ganin et al., 2016; Louizos et al., 2016)) where the model is trained on one domain and tested
on another, thus creating 12 source-target combinations. We design the architectures of the encoder
and the decoder in our model to be similar to those of VFAE, as presented in (Louizos et al.,
47
Table 3.9: Results on Adult dataset
Model Accuracy of y Accuracy of s
NN+MMD (Li, Swersky, and Zemel, 2014) 0.75 0.67
VFAE (Louizos et al., 2016) 0.76 0.67
CAI (Xie et al., 2017) 0.83 0.89
CVIB (Moyer et al., 2018) 0.69 0.68
UnifAI (Jaiswal et al., 2018d; 2019d) (Ours) 0.84 0.67
2016). Table 3.8 shows the results of our model trained without D
s
and supervised state-of-the-art
methods VFAE and Domain Adversarial Neural Network (DANN) (Ganin et al., 2016), which use
s labels during training. The results of the prior works are quoted directly from (Louizos et al.,
2016). Results show that our model outperforms both VFAE and DANN at nine out of the twelve
tasks. Thus, our model can also be used eectively for domain adaptation.
3.4.5 Fair Representation learning
Learning invariance to biasing factors requires information about these factors to discard from the
prediction process because they are correlated with the prediction target and cannot be removed in
an unsupervised way. Hence, we use the full unied framework, which includes the s-discriminator
D
s
, for this task. We provide results of our model and prior state-of-the-art methods (NN+MMD,
VFAE, CAI, and CVIB) on Adult (Kohavi, 1996; Dheeru and Karra Taniskidou, 2017)) and
German (Kohavi, 1996; Dheeru and Karra Taniskidou, 2017)) datasets, which are used popularly
in evaluating fair representation frameworks (Louizos et al., 2016; Xie et al., 2017; Moyer et al.,
2018)). We used the same preprocessed versions of these datasets as (Louizos et al., 2016).
Adult This is an income dataset of 45,222 individuals with various socio-economic attributes.
The prediction task is to infer whether a person has more than $50,000 savings. The biasing
factor s for this dataset is age, which is binarized, and it is required to make age-invariant savings
48
(a) Raw data (b) z
1
embedding (c) z
2
embedding
Figure 3.13: Adult dataset { t-SNE visualization. Labels indicate the biasing factor {
age. Raw data clusters by age, showing the bias. z
1
does not cluster by age, as desired
for fairness, while z
2
does, showing the migration of bias to z
2
.
predictions. We model the encoder and the s-discriminator as two-layer neural networks, and the
predictor, the decoder, and the disentanglers as one-layer neural networks.
Results of this experiment are presented in Table 3.9. Our model achieves the state-of-the-art
performance at the accuracy of predicting y, while being completely invariant to s as re
ected by
A
s
which is the same as the population share of the majority s-class (0:67). Figure 3.13 shows the
t-SNE visualization of the raw data and the z
1
and z
2
embeddings. Both the raw data and the z
2
embedding show clustering by age while the invariant embedding z
1
does not.
German This dataset contains information about 1,000 people with the target to predict whether
a person has a good credit-rating. The biasing factor here is gender and it is required to make
gender-invariant credit assessments. For evaluating UnifAI on this dataset, the s-discriminator is
modeled as a two-layer neural network whereas one-layer neural networks are used to instantiate
the encoder, the predictor, the decoder, and the disentanglers.
Table 3.10 summarizes the results of this experiment, showing that the proposed model
outperforms previous methods at A
y
, while retaining A
s
at the population share of the majority
gender class (0.80). Thus, the proposed model achieves perfect invariance to gender while retaining
more information relevant for making credit assessments. Figure 3.14 shows the t-SNE visualization
49
Table 3.10: Results on German dataset
Model Accuracy of y Accuracy of s
NN+MMD (Li, Swersky, and Zemel, 2014) 0.74 0.80
VFAE (Louizos et al., 2016) 0.70 0.80
CAI (Xie et al., 2017) 0.70 0.81
CVIB (Moyer et al., 2018) 0.74 0.80
UnifAI (Jaiswal et al., 2018d; 2019d) (Ours) 0.78 0.80
(a) Raw data (b) z
1
embedding (c) z
2
embedding
Figure 3.14: German dataset { t-SNE visualization. Labels indicate the biasing factor {
gender. Raw data clusters by gender, showing the bias. z
1
does not cluster by gender, as
desired for fairness, while z
2
does, showing the migration of bias to z
2
.
of the raw data and the z
1
and z
2
embeddings. While the raw data and the z
2
embedding are
clustered by gender, the fair embedding z
1
is not.
As evident from results on both the datasets, our invariance framework works eectively at the
task of fair representation learning, exhibiting state-of-the-art results.
3.4.6 Competition between prediction and reconstruction
Figure 3.15 shows the eect of the ratio
on the prediction performance (A
y
) for the Extended
Yale-B and MNIST-ROT datasets. The results were generated by keeping the loss-weights and
xed at 100 and 1, respectively, and increasing from 10
6
to 1. Thus, the plots show the
eect of gradually increasing the competition between the prediction and reconstruction tasks by
50
10
6
10
5
10
4
10
3
10
2
10
1
10
0
Decoder loss weight ( )
0.9450
0.9475
0.9500
0.9525
Test Accuracy (A
y
)
0.974
0.975
0.976
0.977
Test Accuracy (A
y
)
Extended Yale-B
MNIST-ROT
Figure 3.15: Eect of competition between prediction and reconstruction on A
y
. Plots
were generated by keeping and
xed at 100 and 1, respectively, and increasing .
giving the latter more say in the overall training objective. As evident, increasing improves
A
y
by pulling nuisance factors into z
2
up to a point beyond which A
y
drops because information
essential for predicting y also gets pushed from z
1
to z
2
. Hence, the observed behavior of the said
competition is consistent with the intuitive analysis provided in Section 3.3.
3.5 Summary
We have presented a unied framework for invariance induction in neural networks for both
nuisance and biasing factors of data. Our method models invariance to nuisances as an information
separation task, which is achieved by competitive training between a predictor and a decoder
coupled with disentanglement, and explicitly penalizes the network if it encodes known biasing
factors in order to achieve independence to such information. We described an adversarial
instantiation of this framework and provided analysis of its working. Experimental evaluation
shows that our invariance induction model outperforms previous state-of-the-art methods, which
incorporate s-labels in their training, on learning invariance to nuisance factors without using
51
any s-annotations. The proposed model also exhibits state-of-the-art performance on fairness
tasks where it makes the latent embedding and the predictions independent of known biasing s.
Our model does not make any assumptions about the data, and can be applied to any supervised
learning task, eg., binary/multi-class classication or regression, without loss of generality.
52
Chapter 4
Discovery and Separation of Features for
Invariant Representation Learning
4.1 Introduction
We present a method for inducing nuisance-invariant representations through learning to discover
and separate predictive and nuisance factors of data (Jaiswal et al., 2019b). The presented
framework generates a complete representation of data through two independent embeddings |
one for encoding predictive factors and another for nuisances. This is achieved by augmenting the
target-prediction objective, during training, with a reconstruction loss for decoding data from the
said complete representation while simultaneously enforcing information constraints on the two
constituent embeddings.
The presented framework builds upon the Information Bottleneck (IB) method (Tishby,
Pereira, and Bialek, 1999) but learns a complete representation of data similar in intuition to
the Unsupervised Adversarial Invariance (UAI) model (Chapter 3). We present an information
theoretic formulation of this approach and derive three equivalent training objectives. We show
that the UAI model is a relaxation of the presented framework, grounding the superiority of
this method over UAI in theory. Furthermore, results in Section 4.4 show that the presented
model outperforms both an exact IB method and UAI. The presented approach does not require
53
annotations of nuisance factors for inducing their invariance, which is desired both in practice
and in theory (Achille and Soatto, 2018b). Extensive experimental evaluation shows that this
framework outperforms previous state-of-the-art methods that do not employ nuisance annotations
as well as those that do.
4.2 Separation of Predictive and Nuisance Factors of Data
The working mechanism of neural networks can be interpreted as the mapping of data samples x to
latent codes z (activations of an intermediate hidden layer) followed by the inference of the target
y from z, i.e., the sequence x !z !y. The goal of this work is to learn z that are maximally
informative for predictingy but are invariant to all nuisance factorss. Our approach for generating
nuisance-free embeddings involves learning a complete representation of data in the form of two
independent embeddings, z
p
and z
n
, where z
p
encodes only those factors that are predictive of y
and z
n
encodes the nuisance information.
In order to learn z
p
and z
n
, we augment the prediction objectiveE logp(yjz
p
) with a recon-
struction objectiveE logp(xjfz
p
;z
n
g) for decoding x from the complete representationfz
p
;z
n
g
while imposing information constraints on z
p
and z
n
in the form of mutual information measures:
I(z
p
:x), I(z
n
:x), and I(z
p
:z
n
). The reconstruction objective and the information constraints
encourage the learning of information-rich embeddings (Sabour, Frosst, and Hinton, 2017a; Zhang,
Lee, and Lee, 2016) and promote the separation of predictive factors from nuisances into the two
embeddings. In the following sections, we present an information theoretic formulation of this
approach and derive several equivalent training objectives.
54
4.2.1 Nuisance Invariance via Information Discovery and Separation
The Information Bottleneck (IB) method (Tishby, Pereira, and Bialek, 1999) aims to learn minimal
representations of data that are sucient for predicting y fromx (Achille and Soatto, 2018a). The
optimization objective of IB can be written as:
max I(z :y) (4.1)
s.t. I(z :x)I
c
where the goal is to maximize the predictive capability ofz while constraining how much information
z can encode. The method intuitively brings about a trade-o between the prediction capability of
z and its information theoretic \size", also known as the channel capacity or rate. This is exactly
the rate-distortion trade-o (Tishby, Pereira, and Bialek, 1999). While optimizing this objective
is, in theory, sucient (Achille and Soatto, 2018b) for getting rid of nuisance factors with respect
to y, the optimization is dicult and relies on a powerful encoder z = Encoder(x) that is capable
of disentangling factors of data eectively such that only nuisance information is compressed away.
In practice, this is hard to achieve directly and methods that help the encoder better separate
predictive factors from nuisances (ideally, at an atomic level (Jaiswal et al., 2018d)) are expected
to perform better by retaining more predictive information while being invariant to nuisances.
Our approach for improving this separation of predictive and nuisance factors is to more
explicitly learn a complete representation of data that comprises two independent embeddings: z
p
55
for encoding predictive factors and z
n
for nuisances. The proposed optimization objective can be
written as:
max I(z
p
:y) +I(x :fz
p
;z
n
g) (4.2)
s.t. I(z
p
:x)I
c
I(z
p
:z
n
) = 0
where determines the relative importance of the two mutual information terms. Like IB, our
objective maximizes information about some target, in this case, both y and x, with z
p
used for
predicting y but z =fz
p
;z
n
g used for decoding x. We also keep a rate constraint to limit the
information inz
p
but add a new constraint for independence between z
p
andz
n
. The optimization
objective in Equation 4.2 can be relaxed such that the objective J is:
J =I(z
p
:y) +I(x :fz
p
;z
n
g)I(z
p
:x)
I(z
p
:z
n
) (4.3)
where and
denote multipliers for the I(z
p
: x) and I(z
p
: z
n
) constraints, respectively.
The optimization of I(z
p
: y) and I(x :fz
p
;z
n
g) is straightforward through their variational
bounds (Alemi et al., 2016; Moyer et al., 2018): E logp(yjz
p
) andE logp(xjfz
p
;z
n
g), respectively.
We discuss next the optimization of the I(z
p
:x) and I(z
p
:z
n
) terms for inducing the desired
information separation.
4.2.2 Embedding Compression
We directly optimize the mutual information I(z
p
:x) in Equation 4.3 for restricting the
ow of
information from x to z
p
. In IB terminology, this is also referred to as the compression of the z
p
embedding. We compute a simple exact expression for this mutual information using the recently
developed method of Echo noise (Brekelmans et al., 2019), which takes the same shift-and-scale
56
form as a Variational Autoencoder (VAE) (Kingma and Welling, 2014), but replaces the standard
Gaussian noise with an implicit sampling procedure. The encoding z
p
is calculated as:
z
p
=f
p
(x) +S
p
(x)"
p
(4.4)
where f
p
and S
p
are parameterized by neural networks with bounded output activations and the
noise "
p
is calculated recursively using Equation 4.4 on independent and identically distributed
(iid) samples x
`
from the data distribution q
data
(x) as:
"
p
=f
p
(x
(0)
) +S
p
(x
(0)
)
f
p
(x
(1)
) +S
p
(x
(1)
)
f
p
(x
2
) +S
p
(x
2
)
:::
Thus, the noise corresponds to an innite sum that repeatedly applies Equation 4.4 to additional
input samples. The key observation here is that the original training sample x is also an iid sample
from the input. By simply relabeling the sample indices `, we can see that the distributions of z
p
and "
p
match in the limit. This yields (Brekelmans et al., 2019) an exact, analytic form for the
mutual information:
I(z
p
:x) =E
x
logj detS
p
(x)j (4.5)
Intuitively, given that (1) I(z
p
:x) =H(z
p
)H(z
p
jx) and that (2) the "
p
and z
p
distributions
match, the entropy in a conditional and unconditional draw from z
p
dier only by the scaling
factor S
p
(x). We use Equation 4.5 to calculate the I(z
p
:x) term in our objective.
4.2.3 Independence between the z
p
and z
n
Embeddings
The exact form of I(z
p
:z
n
) is much more dicult to minimize than the I(z
p
:x) term described
in Section 4.2.2. We explore three approaches for enforcing independence between z
p
and z
n
57
| (1) independence through compression, (2) Hilbert-Schmidt Independence Criterion, and (3)
adversarial disentanglement. We also present the corresponding complete training objectives.
4.2.3.1 Independence through Compression
The objective in Equation 4.3 can be re-arranged to contain only terms limiting the information in
each embedding. We rst state an identity based on the chain rule of mutual information:
I(z
p
:z
n
) =I(z
p
:x)I(z
p
:xjz
n
) +I(z
p
:z
n
jx) (4.6)
We next inspect the I(z
p
:z
n
jx) term in this identity following (Moyer et al., 2018):
I(z
p
:z
n
jx) =H(z
p
jx)H(z
p
jx;z
n
)
=H(z
p
jx)H(z
p
jx)
= 0 (4.7)
which is intuitively true because z
p
and z
n
are computed only from x. Thus, we get:
I(z
p
:z
n
) =I(z
p
:x)I(z
p
:xjz
n
) (4.8)
This gives us an alternate way for computing I(z
p
: z
n
) but the I(z
p
: xjz
n
) is still dicult to
calculate in this expression. In order to simplify this further, we use another key identity:
I(z
p
:xjz
n
) =I(x :fz
p
;z
n
g)I(z
n
:x) (4.9)
Substituting for I(z
p
:xjz
n
) in Equation 4.8, we get:
I(z
p
:z
n
) =I(z
p
:x) +I(z
n
:x)I(x :fz
p
;z
n
g) (4.10)
58
The expression for I(z
p
:z
n
) in Equation 4.10 allows for the enforcement of independence between
z
p
and z
n
by optimizing their joint and individual mutual information with x. Using this identity,
Equation 4.3 simplies into two \relevant information" terms and two compression terms as follows:
J =I(z
p
:y) +I(x :fz
p
;z
n
g)I(z
p
:x)
I(z
p
:z
n
)
=I(z
p
:y) +I(x :fz
p
;z
n
g)I(z
p
:x)
I(z
p
:x)
I(z
n
:x) +
I(x :fz
p
;z
n
g)
=I(z
p
:y) + (1 +
)I(x :fz
p
;z
n
g) ( +
)I(z
p
:x)
I(z
n
:x) (4.11)
Intuitively, this corresponds to maximizing I(z
p
:y) andI(x :fz
p
;z
n
g) while compressing z
p
more
than z
n
. This rewriting of the proposed objective emphasizes its interpretation as an augmented
version of Information Bottleneck. The multipliers and
can be separately tuned to weight the
compression ofz
p
andz
n
. We calculate each of the compression terms using the method described
in Section 4.2.2. The nal training objective after substituting for the compression losses and the
variational bounds on I(z
p
:y) and I(x :fz
p
;z
n
g) is termed as DSF-C and can be written as:
^
J
DSF-C
=E logp(yjz
p
) + (1 +
)E logp(xjfz
p
;z
n
g)
+ ( +
)E logj detS
p
(x)j +
E logj detS
n
(x)j (4.12)
4.2.3.2 Hilbert-Schmidt Independence Criterion
Independence between z
p
and z
n
can also be achieved through the optimization of the Hilbert-
Schmidt Independence Criterion (HSIC) (Gretton et al., 2005) between the two embeddings. HSIC
is a kernel generalization of covariance and constructing a penalty out of the Hilbert-Schmidt
operator norm of a kernel covariance pushes variables to be mutually independent (Lopez et al.,
2018). This provides an intuitive and a more \direct" option to enforce the independence constraint
on z
p
and z
n
.
59
The HSIC estimator (Lopez et al., 2018) is dened for variables u and v with kernels h and
k respectively. Assuming that the kernels are both universal, the following is an estimator of a
two-component HSIC:
HSIC
f(u;v)g
N
n
=
1
N
2
X
i;j
k(u
i
;u
j
)h(v
i
;v
j
) +
1
N
4
X
i;j;k;l
k(u
i
;u
j
)h(v
k
;v
l
)
+
2
N
3
X
i;j;k
k(u
i
;u
j
)h(v
i
;v
k
) (4.13)
This is dierentiable and can be used directly as a penalty on the \dependent-ness" of variables.
The nal training objective is termed as DSF-H (for short) and can be written as:
^
J
DSF-H
=E logp(yjz
p
) +E logp(xjfz
p
;z
n
g)
+E logj detS
p
(x)j
HSIC(z
p
;z
n
) (4.14)
4.2.3.3 Adversarial Independence
Thez
p
andz
n
embeddings can also be made independent using the adversarial training scheme (Jaiswal
et al., 2018d; Jaiswal et al., 2019d) described in Chapter 3. The model is augmented with two
adversarial discriminators: D
1
that tries to inferz
n
fromz
p
andD
2
that does the inverse. The key
idea here is that it is impossible to predict one variable from another if their mutual information is
zero. The adversarial training pushes the model to generate such z
p
andz
n
that the discriminators
fail to infer one from the other, thus achieving the said independence. The complete training
objective can be written as:
^
J
DSF-A
=E logp(yjz
p
) +E logp(xjfz
p
;z
n
g) +E logj detS
p
(x)j
kD
1
(z
p
)z
n
k
2
2
+kD
2
(z
n
)z
p
k
2
2
(4.15)
60
The model is trained using a scheduled scheme (Chapter 3) by alternating between updating the
parameters of D
1
and D
2
in phase (1) and the rest of the model in phase (2), where phase (1)
is executed ve times before each phase (2) run. The targets z
p
and z
n
are set as z
n
and z
p
,
respectively, during phase (1) and as random vectors during phase (2).
4.2.4 Model Implementation and Training
We implemented the models in Keras with TensorFlow backend. We used the Adam optimizer with
10
4
learning rate and 10
4
weight decay. The multiplier was xed at 1; and
were tuned
through grid-search overf10
4
; 10
3
; 10
2
; 10
1
g. We used diagonal S(x) (Brekelmans et al.,
2019) with number of samples limited to the batch-size instead of the innite sum as described in
Section 4.2.2. The discriminators in DSF-A were instantiated as single layer NNs.
4.3 Analysis
In this section we derive a relationship between the proposed framework and the UAI model (Jaiswal
et al., 2018d) that we presented in Chapter 3. This analysis is useful in understanding both
the proposed model and UAI. We show that the UAI objective is a relaxation of the objective
we propose in Equation 4.3, which additionally demonstrates the superiority of the proposed
framework for learning nuisance-invariant representations.
We start with rearranging the proposed objective in Equation 4.3 using the identity from
Equation 4.8 as I(z
p
:x) =I(z
p
:z
n
) +I(z
p
:xjz
n
). This gives us:
J =I(z
p
:y) +I(x :fz
p
;z
n
g)I(z
p
:x)
I(z
p
:z
n
)
=I(z
p
:y) +I(x :fz
p
;z
n
g)I(z
p
:z
n
)I(z
p
:xjz
n
)
I(z
p
:z
n
)
=I(z
p
:y) +
n
I(x :fz
p
;z
n
g)I(z
p
:xjz
n
)
o
( +
)I(z
p
:z
n
) (4.16)
61
Recall that the chain rule for mutual information in Equation 4.9 implies that
I(x :fz
p
;z
n
g) =I(z
n
:x) +I(z
p
:xjz
n
). In the expression in braces above,I(x :fz
p
;z
n
g)I(z
p
:
xjz
n
), our objective simply downweights the I(z
p
:xjz
n
) component of I(x :fz
p
;z
n
g) as:
I(x :fz
p
;z
n
g)I(z
p
:xjz
n
) =I(z
n
:x) + (1)I(z
p
:xjz
n
) (4.17)
This is equivalent to calculating I(x :f~ z
p
;z
n
g) for a noisied ~ z
p
= (z
p
) such that:
I(x :f~ z
p
;z
n
g) =I(z
n
:x) +I(~ z
p
:xjz
n
) (4.18)
I(~ z
p
:xjz
n
) = (1)I(z
p
:xjz
n
) (4.19)
where the free parameter can be chosen to enforce this relationship. In particular, depends
on I(z
p
:xjf~ z
p
;z
n
g), which measures the information about x that is destroyed by adding noise
through . We derive this result using the chain rule for mutual information in two dierent ways:
I(x :f~ z
p
;z
p
;z
n
g) =I(x :f~ z
p
;z
n
g) +I(z
p
:xjf~ z
p
;z
n
g) (4.20)
=I(x :fz
p
;z
n
g) +I(~ z
p
:xjfz
p
;z
n
g) (4.21)
=) I(x :fz
p
;z
n
g) =I(x :f~ z
p
;z
n
g) +I(z
p
:xjf~ z
p
;z
n
g) (4.22)
In Equation 4.22, we used the fact that I(~ z
p
: xjfz
p
;z
n
g) = 0 by the data processing inequal-
ity (Cover and Thomas, 2012). Using the chain rule again, we know that both I(x :fz
p
;z
n
g) and
I(x :f~ z
p
;z
n
g) contain a I(z
n
: x) term. Canceling out I(z
n
: x) and rearranging to match the
form of Equation 4.19, we obtain:
I(~ z
p
:xjz
n
) =I(z
p
:xjz
n
)I(z
p
:xjf~ z
p
;z
n
g) (4.23)
62
Table 4.1: MNIST-ROT results presented as accuracy for angles =f0
;22:5
;45
g
that the models were trained on and unseen55
and65
angles. VFAE could not
be evaluated for55
and65
as they use s as input for encoding x and cannot be
used for previously unseen s. The y-accuracy should be high but s-accuracy should be
close to random chance (0.20). RI indicates relative improvement in error-rate between
DSF-H (ours) and UAI (previous best).
Model
Accuracy of y Accuracy of s
55
65
VFAE 0.953 0.389
CAI 0.958 0.826 0.662 0.384
CVIB 0.960 0.819 0.674 0.382
UAI 0.977 0.856 0.696 0.338
DSF-E 0.980 0.865 0.707 0.200
DSF-C 0.981 0.001 0.869 0.001 0.724 0.002 0.200 0.001
DSF-A 0.981 0.002 0.873 0.002 0.730 0.002 0.200 0.000
DSF-H 0.981 0.001 0.873 0.002 0.732 0.001 0.200 0.000
RI over UAI 17% 12% 12% 100%
Thus, the information using a noisy ~ z
p
instead of z
p
diers by a term of
I(z
p
: xjf~ z
p
;z
n
g). Since is a free parameter and does not appear elsewhere in the objec-
tive, it can be set to satisfy Equation 4.19. The objective function in Equation 4.16 can be
rewritten with this noisy ~ z
p
as:
J =I(z
p
:y) +I(x :f~ z
p
;z
n
g)I(z
p
:z
n
) (4.24)
Equation 4.24 describes an information theoretic formulation of UAI. The UAI model uses
independent multiplicative Bernoulli noise to create a noisy ~ z
p
= (z
p
), which is then used
alongside z
n
in a variational decoder maximizing I(x :f~ z
p
;z
n
g). This has the indirect eect of
regularizing I(z
p
: x), since nuisance information cannot be reliably passed through ~ z
p
for the
reconstruction task. In contrast, the proposed objective in Equation 4.3 directly constrains the
information channel between x and z
p
, which guarantees invariance to nuisances (Achille and
Soatto, 2018b). This establishes the fact that the proposed framework is strictly superior to UAI.
63
zp embedding zn embedding
Figure 4.1: t-SNE visualization of z
p
andz
n
embeddings of MNIST-ROT images labeled
by rotation angle. As desired, the z
p
embedding does not encode rotation information,
which migrates to z
n
.
4.4 Experimental Evaluation
Empirical results are presented on ve datasets | MNIST-ROT (Jaiswal et al., 2018d), MNIST-
DIL (Jaiswal et al., 2018d), Extended Yale-B (Georghiades, Belhumeur, and Kriegman, 2001),
Multi-PIE (Gross et al., 2008), and Chairs (Aubry et al., 2014), following previous works (Jaiswal
et al., 2018d; Li, Swersky, and Zemel, 2014; Louizos et al., 2016; Moyer et al., 2018; Xie et al.,
2017). The proposed model is compared with previous state-of-the-art: VFAE, CAI, CVIB, and
UAI. Results are also reported for an ablation version of our framework, DSF-E, which models
the IB objective in Equation 4.1. We optimize an exact expression for I(z
p
:x), as presented in
Section 4.2.2, for DSF-E. Hence, we do not evaluate methods that are similar to the ablation
model but indirectly optimize I(z
p
:x) (Achille and Soatto, 2018a; Alemi et al., 2016).
The accuracy of predicting y from z
p
is reported for the trained models. Additionally, the
accuracy of predictings is reported as a measure of invariance using two layer neural networks that
were trained post hoc to predict known s from z
p
, following previous works (Jaiswal et al., 2018d;
Moyer et al., 2018). While the y-accuracy is desired to be high, the s-accuracy should be close to
64
Table 4.2: MNIST-DIL results presented as accuracy for various kernel sizes k (positive
for dilation and negative for erosion). Models were trained on MNIST-ROT but tested
on MNIST-DIL. RI indicates relative improvement in error-rate between DSF-H (ours)
and UAI (previous best).
Model
Accuracy of y
k =2 k = 2 k = 3 k = 4
VFAE 0.807 0.916 0.818 0.548
CAI 0.816 0.933 0.795 0.519
CVIB 0.844 0.933 0.846 0.586
UAI 0.880 0.958 0.874 0.606
DSF-E 0.891 0.964 0.887 0.608
DSF-C 0.899 0.002 0.966 0.001 0.889 0.002 0.608 0.002
DSF-A 0.907 0.002 0.969 0.002 0.892 0.003 0.609 0.003
DSF-H 0.907 0.001 0.970 0.001 0.892 0.002 0.610 0.002
RI over UAI 22% 28% 14% 1%
random chance of s for true invariance. Results of the full version of our model are reported as
mean and standard-deviation based on ve runs. We also report relative improvements (%) in
error-rate over previous best models. The error-rate for s is calculated as the dierence between
s-accuracy and random chance. Furthermore, t-SNE (Maaten and Hinton, 2008) visualization
of the z
p
and z
n
embeddings are presented for the DSF-H version of the proposed model for
visualizing the separation of nuisance factors.
MNIST-ROT: This dataset was introduced in Chapter 3 as an augmented version of the
MNIST (LeCun et al., 1998) dataset that contains digits rotated at angles2 =f0
;22:5
;45
g
for training. Evaluation is performed on digits rotated at 2 as well as for55
and65
. The
NN instantiation of the proposed model follows the setup for UAI (Chapter 3) with two layers for
encoding x into z
p
and z
n
, two layers for inferring y from z
p
, and three layers for reconstructing x
fromfz
p
;z
n
g. The digit class is treated asy and categorical ass. Table 4.1 presents results of our
model, showing that all versions of the proposed framework outperform previous state-of-the-art
models with DSF-H achieving the best y-accuracies. Furthermore, without using s-labels, all
versions of the proposed framework achieve random chance s-accuracy (0:20), which indicates
65
zp embedding zn embedding
Figure 4.2: t-SNE visualization of z
p
and z
n
embeddings of Extended Yale-B images
labeled by lighting direction. As desired, the z
p
embedding does not encode lighting
information, which migrates to z
n
. RI indicates relative improvement in error-rate
between DSF-H (ours) and UAI (previous best).
perfect invariance to rotation angle. Figure 4.1 shows the t-SNE visualization of z
p
and z
n
. As
evident,z
p
does not cluster by rotation angle butz
n
does, which validates that this nuisance factor
is separated out and encoded in z
n
instead of z
p
.
MNIST-DIL: This variant of MNIST contains digits eroded or dilated with various kernel sizes
k2f2; 2; 3; 4g, as introduced in Chapter 3. MNIST-DIL is used for further evaluating models
trained on the MNIST-ROT dataset for varying stroke-widths, which is not explicitly controlled in
MNIST-ROT but is implicitly present. Results in Table 4.2 show that all versions of the proposed
framework outperform previous works by retaining more predictive information in z
p
while being
invariant to inherent nuisances pertaining to the stroke-width.
Extended Yale-B: This dataset contains face images captured in various lighting conditions,
which are binned into directions corresponding to the four corners and frontal. We use the same
version of this dataset as previous works (Jaiswal et al., 2018d; Louizos et al., 2016; Xie et al.,
2017) where only one image from each lighting direction is used for training and all other images
66
Table 4.3: Extended Yale-B | y-accuracy should be high but s-accuracy should be
random chance (0.20). RI indicates relative improvement in error-rate over previous best
(UAI).
Model Accuracy of y Accuracy of s
VFAE 0.85 0.57
CAI 0.89 0.57
CVIB 0.82 0.45
UAI 0.95 0.24
DSF-E 0.96 0.23
DSF-C 0.96 0.00 0.23 0.00
DSF-A 0.96 0.00 0.23 0.00
DSF-H 0.96 0.00 0.23 0.00
RI over UAI 20% 25%
are used for evaluation. The NN instantiation of the proposed model follows (Chapter 3) with
one layer each for encoding z
p
and z
n
, and for predicting y from z
p
, while two layers are used for
reconstructing x fromfz
p
;z
n
g. Table 4.3 presents the results of this experiment, showing that all
versions of the proposed model outperform previous works on both y-accuracy and s-accuracy.
The t-SNE visualization of z
p
and z
n
in Figure 4.2 shows that lighting information is separated
out and encoded in z
n
instead of z
p
, resulting in a more invariant z
p
embedding.
Multi-PIE: This is a dataset of face images of 337 subjects captured at 15 poses and 19
illumination conditions with various facial expressions. A subset of the data is prepared for this
experiment that contains 264 subjects with images captured at ve pose anglesf0;15;30g and
four illumination conditions: neutral, frontal, left, and right. The subject identity is treated as y
while pose and illumination are treated as nuisances. The NN instantiation of the proposed model
uses one layer each for encoding z
p
and z
n
, and for predicting y from z
p
, while two layers are used
for reconstructing x fromfz
p
;z
n
g. Table 4.4 presents the results of this experiment, showing that
all versions of the proposed model outperform previous works on both y-accuracy and s-accuracy
of both illumination and pose. The t-SNE visualization of z
p
andz
n
in Figure 4.3 shows that both
illumination and pose information are separated out and encoded in z
n
instead of z
p
, resulting in
an invariant z
p
embedding.
67
zp { labeled by illumination zn { labeled by illumination
zp { labeled by pose zn { labeled by pose
Figure 4.3: t-SNE visualization of z
p
and z
n
embeddings of Multi-PIE images labeled
by illumination (top row) and pose (bottom row). As desired, z
p
does not encode
illumination and pose, both of which migrate to z
n
.
Chairs: This dataset contains images of Chairs at various yaw angles, which are binned into
four orientations: front, back, left, and right. We use the same version of this dataset as described
in Chapter 3 where the yaw angles do not overlap between the train and test sets. The NN
instantiation follows the setup of UAI (Chapter 3) with two layers each for encoding z
p
and z
n
from x, predicting y from z
p
, and reconstructing x fromfz
p
;z
n
g. Results are summarized in
Table 4.5, showing that all versions of the proposed framework outperform previous methods
by a large margin on both y-accuracy and s-accuracy. All versions of the proposed framework
68
Table 4.4: Multi-PIE | y-accuracy should be high but s-accuracy should be random
chance (illumination (i): 0.25, pose (p): 0.20). Separate models were trained for
illumination and pose for previous supervised invariance methods: VFAE, CAI, and
CVIB. RI indicates relative improvement in error-rate over previous best (UAI).
Model Training s Accuracy of y
Accuracy of s
i p
VFAE
i 0.67 0.41 0.65
p 0.62 0.80 0.29
CAI
i 0.76 0.99 0.98
p 0.77 1.00 0.98
CVIB
i 0.51 0.44 0.45
p 0.46 0.63 0.28
UAI { 0.82 0.61 0.32
DSF-E { 0.83 0.25 0.20
DSF-C { 0.85 0.00 0.25 0.02 0.20 0.01
DSF-A { 0.87 0.01 0.25 0.00 0.20 0.00
DSF-H { 0.87 0.01 0.25 0.00 0.20 0.00
RI over UAI 28% 100% 100%
zp embedding zn embedding
Figure 4.4: t-SNE visualization of z
p
and z
n
embeddings of Chairs images labeled by
yaw orientation. As desired, the z
p
embedding does not encode orientation information,
which migrates to z
n
.
also achieve random chance s-accuracy (0:25), which indicates perfect invariance to orientation,
without using s-labels during training. Figure 4.4 shows the t-SNE visualization of z
p
and z
n
,
further validating that the orientation information is separated out of z
p
and encoded in z
n
.
69
Table 4.5: Chairs results | y-accuracy should be high but s-accuracy should be random
chance (0.25). RI indicates relative improvement in error-rate between DSF-H (ours)
and UAI (previous best).
Model Accuracy of y Accuracy of s
VFAE 0.72 0.37
CAI 0.68 0.69
CVIB 0.67 0.52
UAI 0.74 0.34
DSF-E 0.84 0.25
DSF-C 0.86 0.01 0.25 0.00
DSF-A 0.88 0.02 0.25 0.00
DSF-H 0.88 0.01 0.25 0.00
RI over UAI 54% 100%
4.5 Summary
We have presented a framework for inducing nuisance-invariant representations in supervised NNs
through learning to encode all information about the data while separating out predictive and
nuisance factors into independent embeddings. We provided an information theoretic formulation
of the approach and derived several equivalent training objectives from it. Furthermore, we
provided a theoretical analysis of the proposed model and derived a connection with the UAI
model, showing that the proposed framework is strictly superior to UAI. Empirical results on
benchmark datasets show that the proposed framework outperforms previous works with large
relative improvements.
70
Chapter 5
Invariant Representations through Adversarial Forgetting
5.1 Introduction
We present a novel framework for invariance in DNNs that promotes removal of information
about undesired factors s from latent representations ~ z through an adversarial forgetting mecha-
nism (Jaiswal et al., 2020). Inspired by (1) the discovery of richer features in DNN classiers upon
augmentation with reconstruction objectives (Sabour, Frosst, and Hinton, 2017a; Zhang, Lee, and
Lee, 2016), and (2) the forgetting operation in Long Short-Term Memory (LSTM) (Hochreiter and
Schmidhuber, 1997) cells, the presented framework adopts the idea of \discovery and separation of
information" for invariance to specic s. The working principle of the model is to map data x
to an intermediate embedding z that encodes everything about x and then use a forget-mask to
lter out s-related information from z while retaining information about the prediction target y
to produce the invariant ~ z. More specically, an encoder network generates a latent code z from
x, which is used to reconstruct x through a decoder. At the same time, a forget-gate network
generates a mask m from x, which is multiplied elementwise with z to produce ~ z. The encoding ~ z
is then used by a predictor to infer the target y. These components of the framework are trained
adversarially with a discriminator that aims to predict the s from ~ z. However, during training,
gradients from the discriminator are allowed to only aect the training of the forget-gate. Finally,
71
the framework is augmented with a regularizer that pushes the components m
i
of the forget-mask
to be close to either 0 or 1, inducing disentanglement within the components of z and eective
masking of s to produce an invariant representation ~ z.
We show that the forgetting mechanism is equivalent to a bound on the mutual information
I(~ z :z) and that coupled with they-prediction task, can be interpreted as an information bottleneck.
Furthermore, by the data processing inequality, the forgetting mechanism bounds I(~ z :s). The
generated mask can be manipulated via adversarial training to remove information about s from
z. Empirical results show that the presented framework exhibits state-of-the-art performance for
invariance to both nuisance and biasing factors across a diverse collection of datasets and tasks.
Furthermore, unlike previous methods for invariance, the presented framework can be extended
to the multi-task learning setting in a straightforward manner, where each task has its own set
of associated undesired factors. Experiments in the multi-task setting show that the presented
framework works eectively here as well.
5.2 Invariance through Adversarial Forgetting
Invariant representation learning aims to produce a mapping of data (x) to code (~ z) that is
minimally informative of undesired factors of data (s) but maximally discriminative for the
prediction task (y). There are two cases that arise from this formulation { (1) s is nuisance,
i.e., there is little or no information shared between s and y asymptotically (e.g., pose in face
recognition), and (2) s contains biasing factors, i.e., there is correlation between s and y, but for
other outside reasons, it is necessary to exclude these biases from the prediction process (e.g.,
age, gender, race, etc. in socio-economical prediction tasks that are trained on historical data).
Invariance tos leads to robust models that generalize better on test data for case (1), and produces
fair models that do not incorporate s while making predictions for case (2).
72
Forget Gate
Encoder Decoder ′
Predictor ̃ Discriminator
Figure 5.1: Adversarial forgetting framework for invariant representation learning
We induce invariance to s within a DNN using a novel approach of adversarial forgetting,
which is inspired by forget-gates in Long Short-Term Memory (LSTM) cells in recurrent neural
networks (Hochreiter and Schmidhuber, 1997). Within the proposed framework, the model
learns to embed everything about a data sample into an intermediate representation, which is
transformed into an invariant representation through multiplication with a mask generated by an
adversarial forgetting mechanism. Figure 5.1 shows the complete framework design. Data samples
x are encoded into an intermediate representation z using an encoder E, while a forget-mask m
(m
i
2 (0; 1)) is simultaneously produced from x through a forget-gate network F. The invariant
representation ~ z is then computed as the element-wise multiplication ~ z =zm. This is similar to
how forget gates are used in LSTMs to \forget" certain information learned from past data. A
decoder R is used to reconstruct x from z, such that E learns to encode everything about x into
z. A predictor P infers y from ~ z, while an adversarial discriminator D tries to predict s from ~ z.
Hence, the combined objectives of P and D aim to allow only factors of x that are predictive of y
but not of s to pass from z to ~ z.
The complete framework is trained with an adversarial objective such that the discriminator
is pitted against all the other modules, as depicted with colors in Figure 5.1. The discriminator
is allowed a more active role in the development of forget-masks m by allowing adversarial
gradients from the discriminator to only
ow to the forget-gate network and not to the encoder
during training. This is illustrated in the gure with the break on the arrow between z and
73
the multiplication operation. In order to further encourage the development of masks that truly
lter out some information from z but retain everything else, a mask regularizer, in the form of
m
T
(1m), is added to push components of m to either 0 or 1. We found the mask regularizer
to always improve results in our experiments. The complete training objective can be written as
shown in Equation 5.1.
min
E;F;P;R
max
D
J(E;F;P;R;D); where:
J(E;F;P;R;D) = L
y
y;P (~ z)
+L
x
x;R(z)
+ L
s
s;D(~ z)
+m
T
(1m)
The proposed model is trained using a scheduled update scheme similar to the training mechanism
of the UAI model (Chapter 3). Hence, the weights of the discriminator are frozen when updating
the rest of the model and vice versa. The adversarial training of the proposed model would benet
from training the discriminator to convergence before any update to the other modules, following
the intuition presented in Chapter 3. However, in practice, this is infeasible and training the
discriminator much more frequently than the rest of the model (depending on the nature of the
prediction task and the dataset) is sucient to achieve good performance. This is especially true
because the training of the discriminator is resumed from its previous state rather than starting
from scratch after every update to the other modules. Therefore, the weights of the discriminator
and the rest of the model are updated in the frequency ratio of k : 1. We found k = 10 to work
well in our experiments.
The model training does not incorporate the popular approach of gradient reversal (Ganin
et al., 2016). The targets of the discriminatorD are set to the ground-truths labels while updating
D, but to randoms values (sampled from the empiricals-distribution) when the parameters of the
rest of the model are updated. Hence,D tries to predict the corrects during its training phase, but
the rest of the model is updated to elicit random-chance performance at s-prediction, which leads
to the desired invariance to s. The model was implemented in Keras with TensorFlow backend.
74
The Adam optimizer was used with 10
4
learning rate and 10
4
decay. The hyperparameters were
tuned through grid search | ;2f10
2
; 10
1
; 10
0
g, 2f10
2
; 10
1
; 10
0
g for nuisance s, and
2f10
2
; 10
3
g for bias s.
The proposed framework can also be extended to multi-task settings, with z treated as the
common encoding for tasks involving prediction of targetsfy
(1)
;y
(2)
;:::;y
(n)
g with corresponding
undesired factorsfs
(1)
;s
(2)
;:::;s
(n)
g. Forget-gates F
(j)
and predictors P
(j)
are added to the
framework, one for each prediction task y
(j)
, to generate associated masks m
(j)
and invariant
representations ~ z
(j)
from z through adversarial training with discriminators D
(j)
, each of which
tries to predict s
(j)
from ~ z
(j)
. Thus, the multi-task extension of the proposed model is intuitive
and straightforward.
5.3 Characterizing Forgetting with Forget-gate
Forget-gates were introduced as components of LSTMs, where they cause \forgetting" of information
from the past in the recurrence formulation conditioned on the input at a given step as well as
the existing state. A forget-gate typically produces a mask m with m
i
2 (0; 1) that is multiplied
elementwise with a latent encoding within an LSTM cell. Thus, the forget-gate can scale or
remove information in each dimension of the encoding but not add information to it. Inspired
by this formulation of \forgetting" information, we employ forget-gates to induce invariance to
undesired s, i.e., to \forget" s-related information. In this section, we characterize the erasure
properties of the forget-gate in the proposed framework. Intuitively, if a mask element m
i
= 0, the
information passed fromz
i
to ~ z
i
is also zero; likewise, ifm
i
= 1, the information passed is complete.
We characterize here the behavior for m
i
2 (0; 1), showing that under reasonable assumptions,
there is a non-trivial forget regime besides zero. We also show that the forget-gate acts as an
information bottleneck, which can be manipulated to induce invariance, with similar intuition to
other bottleneck models (Achille and Soatto, 2018a; Achille and Soatto, 2018b).
75
5.3.1 Forget-gate
The proposed model generates a d-dimensional encoding z of x as well as a forgetting mask m
with components m
i
2 (0; 1). These are multiplied element-wise to produce ~ z = zm. We
consider the multiplication as a noisy operation ~ z =zm +" with a small " in order to facilitate
a theoretical analysis of \forgetting". We discuss the practicalities of " later in Section 5.3.2.
Assuming "N (0;
"
I), we get P (~ zjz)N (zm;
"
I). In each dimension, we get:
I(~ z
i
:z
i
) =H(~ z
i
)H(~ z
i
jz
i
) (5.1)
=H(~ z
i
)H("
i
) (5.2)
=H(~ z
i
)
1
2
log
Var("
i
)
1
2
log(2e) (5.3)
where H("
i
) is constant with respect to z
i
and m
i
. Thus, the information passed from z
i
to ~ z
i
is
proportional to H(~ z
i
). Assuming that Var(z
i
) is dened, the max-entropy Gaussian upper bound
gives us the following in each dimension of the embedding (Cover and Thomas, 2012):
H(~ z
i
) =H(m
i
z
i
+"
i
) (5.4)
1
2
log
Var(m
i
z
i
) + Var("
i
)
+
1
2
log(2e) (5.5)
If m is non-random, the mutual information is:
I(~ z
i
:z
i
) =H(~ z
i
)H("
i
) (5.6)
1
2
log
m
2
i
Var(z
i
) + Var("
i
)
1
2
log
Var("
i
)
(5.7)
When m
i
! 1, assuming that Var("
i
) Var(z
i
), H(~ z
i
)H(z
i
). As m
i
! 0, ~ z
i
="
i
, H(~ z
i
)!
H("
i
) and I(~ z
i
:z
i
)! 0. Importantly, when m
i
<
p
Var("
i
) yet away from zero, the information
loss is still non-trivial. The result in Equation 5.7 was derived assuming that m is xed. We can
76
extend this bound to random m, including those dependent on x as described in Section 5.2, by
using the following identity (shown in Appendix A):
Var(m
i
z
i
) 2Var
(m
i
E[m
i
])z
i
+ 2E[m
i
]
2
Var(z
i
) (5.8)
This makes the bound on I(~ z
i
:z
i
):
I(~ z
i
:z
i
)
1
2
log
2Var
(m
i
E[m
i
])z
i
+ Var("
i
) + 2E[m
i
]
2
Var(z
i
)
1
2
log
Var("
i
)
(5.9)
Though more dicult to interpret, this has approximately the same characteristic as the xed
m
i
case. In order to extend this to the multivariate case (from a bound on I(~ z
i
:z
i
) to that on
I(~ z :z)), we rst note that the max-entropy bound still holds for multivariate Gaussians, and by
Hadamard's inequality, we can bound that distribution by its diagonal as:
H(~ z) log det(
~ z
+
"
I) +
d
2
log(2e)
log det(
diag
~ z
+
"
I) +
d
2
log(2e) (5.10)
where
diag
z
has only the diagonal elements of
z
and zero elsewhere. This gives us the bound:
I(~ z :z)
X
i
log
Var(m
i
z
i
) + Var("
i
)
(5.11)
The max-entropy bound fails if there are degenerate elements of z, i.e. completely duplicate
channels, but still holds on subsets of channels without duplicates. We have somewhat abused
notation in this section; really, our bound is on I(~ z : (z;m)). While this is less intuitive, the
distinction is necessary for data processing inequalities showing, e.g., that I(~ z :s) is controlled by
the forget gate.
77
5.3.2 Practicalities of "-noise and bottleneck-mask intuition
With exact computation, information is only lost when m
i
= 0 because scaling bym
i
2 (0; 1) is an
isomorphic map. In real computation, however, multiplication operations are not isomorphic due
to imprecision in
oating point arithmetic. These imprecisions, alongside commonly undertaken
computational procedures (e.g., clipping) induce a non-trivial forgetting region, under which z
i
with \reasonable" variance may be forgotten. Thus, practically speaking, we do not need to add
"-noise articially to erase information even when m
i
6= 0. Hence, we do not need m
i
to be
exactly zero to lose information. The background noise of computation is sucient for I(~ z :z)
to be smoothly controlled outside of zero m
i
. The proposed framework implements exactly this
control on I(~ z : z), and thereby I(~ z : x). Further, since it also optimizes a distortion measure
between ~ z and y, this forms an information bottleneck (Alemi et al., 2016). For categorical y
with cross-entropy loss, minimization of m
i
coupled with the prediction task is equivalent to the
bottleneck objective from (Tishby, Pereira, and Bialek, 1999), parameterized by neural networks
in (Achille and Soatto, 2018a; Alemi et al., 2016).
The goal of this work, however, is to generate invariant representations and not necessarily
optimal bottleneck representations. In order to induce invariance to specic factors s, we learn
the parameters of the bottleneck (forget-gate) so that it lters this information out of z using
the mechanism described in Section 5.2. This encourages the encoder and the forget-gate to
generate ~ z with minimal I(~ z :s). The adversary operates on only m and thus can be thought of
as optimizing element-wise the channel between z and ~ z, minimizing I(~ z :s) (or the equivalent
general co-dependence term).
The masks generated by the forget-gate have an intuitive interpretation: for each component,
it either allows information to pass from z
i
to ~ z
i
or does not. The overall design of the framework
causes separation of factors of x that are correlated with s from those that are not, so that they
occur in dierent components ofz, allowing a component-wise mask to eectively include or exclude
78
Table 5.1: Chairs results (random chance of s = 0:25).
Model Accuracy of y Accuracy of s
NN+MMD (Li, Swersky, and Zemel, 2014) 0.73 0.02 0.46 0.04
VFAE (Louizos et al., 2016) 0.72 0.04 0.37 0.02
CAI (Xie et al., 2017) 0.68 0.69
CVIB (Moyer et al., 2018) 0.67 0.01 0.52 0.01
UAI (Jaiswal et al., 2018d; 2019d) 0.74 0.34
Adversarial Forgetting
(Jaiswal et al., 2020) (Ours)
0.84 0.01 0.25 0.00
Improvement () over UAI 38.5% 100%
factors from the nal representation ~ z without cross-factor considerations. Removal of nuisance s
does not penalize the training objective (Equation 5.1). However, for biasing s correlated with
y, the forget-gate will choose whether to allow their inclusion in ~ z based on the loss-weights in
the objective. Finally, it is interesting to note that calculating (1m)z from a trained model
provides an embedding that contains everything that m does not allow to pass from z to ~ z.
5.4 Experimental Evaluation
The proposed framework is compared with NN+MMD, VFAE, CAI, CVIB and UAI. Performance
is evaluated on two metrics: accuracy of predictingy from ~ z (A
y
) using the jointly trained predictor
and that of predicting s from ~ z (A
s
) using a two-layer neural network trained post hoc. While a
high A
y
is desired, for true invariance A
s
should be random chance for nuisance and the share
of the majority s-class for biasing s. Mean and standard deviation are reported based on ve
runs, except when results are quoted from previous works. Relative improvements in error-rate are
also reported with error-rate for A
s
dened as the gap between the observed A
s
and its optimal
value. We further report evaluation results of the framework in a multi-task setting, as described
in Section 5.2.
79
Figure 5.2: Chairs Dataset { t-SNE visualization of z and ~ z labeled with orientation
class (s). Visualization with chair-type (y) annotations are not shown because there
are 1,393 y classes. The invariant encoding ~ z shows no clustering by orientation as s is
masked out of z, which exhibits s-grouping.
5.4.1 Robustness through invariance to nuisance factors
Invariance to nuisance is evaluated on the Chairs (Jaiswal et al., 2018d), Extended Yale-B (Georghi-
ades, Belhumeur, and Kriegman, 2001), and MNIST-ROT (Jaiswal et al., 2018d) datasets. The
network architectures for the forget-gate and the encoder are kept the same in all experiments.
Besides the quantitative results, we show t-SNE plots of z and ~ z for visualizing the transformation
of the latent space brought about by the elementwise multiplication ~ z =zm. The goal of these
plots is to show that invariance is indeed brought about by the adversarial forgetting operation.
We also show reconstructions generated from z and ~ z using separate post hoc networks trained for
decoding x from z and ~ z. The goal of these results is not to show perfect reconstruction, but to
visually substantiate that s is encoded in z but not in ~ z.
Chairs. This is a dataset of images of 1,393 types of chairs at 31 yaw and two pitch angles. We
use the same version of this dataset as described in Chapter 3, which is split into training and
testing sets by picking alternate yaw angles, such that there is no overlap of angles between the
80
Figure 5.3: Chairs { reconstruction. Columns in each block (left to right): original
image, reconstruction from ~ z, and that from z. Reconstructions from ~ z do not contain
orientation (yaw) information.
Table 5.2: Extended Yale-B results (random chance of s = 0:2).
Model Accuracy of y Accuracy of s
NN+MMD (Li, Swersky, and Zemel, 2014) 0.82 {
VFAE (Louizos et al., 2016) 0.85 0.57
CAI (Xie et al., 2017) 0.89 0.57
CVIB (Moyer et al., 2018) 0.82 0.01 0.45 0.03
UAI (Jaiswal et al., 2018d; 2019d) 0.95 0.24
Adversarial Forgetting
(Jaiswal et al., 2020) (Ours)
0.95 0.01 0.20 0.01
Improvement () over UAI { 100%
two sets. The chair type is treated as y and the yaw binned into four classes (front, left, right
and back) as s. We use the same architecture for the encoder, the predictor, and the decoder as
UAI (Chapter 3), i.e., two-layer networks. The discriminator is modeled as a two-layer network.
Results of our experiments are summarized in Table 5.1. The proposed model achieves large
improvements over UAI on both A
y
and A
s
, with A
s
that is exactly random chance. Figure 5.2
shows the t-SNE visualization of z and ~ z, which exhibits that z clusters by orientation but ~ z does
not. Reconstructions from ~ z and z are presented in Figure 5.3, showing that z contains both y
and s information while ~ z encodes y but not s.
81
Figure 5.4: Extended Yale-B { t-SNE visualization of z and ~ z labeled with lighting
direction (s) and subject-identity (y). Please note that subject-identities are marked
with numbers and not colors. The invariant encoding ~ z shows no clustering by lighting
direction but groups by subject identity.
Extended Yale-B. This is a dataset of face-images of 38 subjects captured under various
lighting conditions. The prediction target is the subject-ID, while the nuisance is the lighting
condition binned into ve classes (four corners and frontal). For each subject, one image from
each s-category is used for training and the rest of the dataset is used for testing (Jaiswal et al.,
2018d; Louizos et al., 2016; Xie et al., 2017). We use the same architecture for the encoder
and the predictor as previous works (Jaiswal et al., 2018d; Xie et al., 2017), i.e., one layer for
each of these modules. The decoder and the discriminator are modeled with two layers each.
82
Figure 5.5: Extended Yale-B { reconstruction. Columns in each block (left to right):
original image, reconstruction from ~ z, and that from z. Reconstructions from ~ z do not
contain lighting information.
Results of our experiments are shown in Table 5.2. The proposed model exhibits state-of-the-art
performance on both A
y
and A
s
metrics, with A
s
being exactly random chance. This shows that
the proposed model is able to completely remove information about the lighting direction from
the latent embedding while retaining high A
y
. Figure 5.4 shows the t-SNE visualizations of z
and ~ z and validates this claim as the clustering of the latent embedding completely changes from
grouping by s for z to grouping by y for ~ z. This is shown qualitatively through reconstruction
of images from z and ~ z as shown in Figure 5.5. Reconstructions from ~ z have no indication of
the lighting direction but retain the identity information. On the other hand, reconstructions
from z look very similar to the original images with the lighting direction intact, showing that the
forget-gate successfully removes s.
MNIST-ROT. This is a variant of the MNIST (LeCun et al., 1998) dataset, which is augmented
with digits rotated at22:5
and45
. The digit class is treated as y and the rotation angle as
categoricals. We use the same architecture for the encoder, the predictor, and the decoder as UAI
(Chapter 3), i.e., two layers for the encoder and the predictor each and three layers for the decoder.
The discriminator is modeled with two layers. Table 5.3 summarizes the results of our experiments
on test images with rotation angles both seen () and unseen (f55
;65
g) during training.
The proposed model achieves not only the best A
y
but also A
s
that is exactly random chance,
83
Table 5.3: MNIST-ROT results (random chance of s = 0:2). represents the angles
seen during training, i.e.,f0;22:5
;45
g.
Model Accuracy of y Accuracy of s
55
65
NN+MMD
(Li et al., 2014)
0.970 0.001 0.831 0.001 0.665 0.002 0.380 0.011
VFAE
(Louizos et al., 2016)
0.953 0.004 0.389 0.076
CAI
(Xie et al., 2017)
0.958 0.829 0.663 0.384
CVIB
(Moyer et al., 2018)
0.960 0.008 0.819 0.007 0.674 0.009 0.382 0.005
UAI
(Jaiswal et al., 2018d)
0.977 0.856 0.696 0.338
Adversarial Forgetting
(Jaiswal et al., 2020)
(Ours)
0.991 0.001 0.863 0.001 0.730 0.001 0.201 0.001
Improvement ()
over UAI
60.9% 4.86% 11.18% 99.3%
showing that it is able to successfully lter outs while retaining more information abouty, leading
to more accurate y-predictions. The t-SNE visualizations of z and ~ z shown in Figure 5.6 further
validate this as ~ z is clustered by y but has uniformly distributed s, while z shows distinct groups
of s in each digit cluster. Figure 5.7 shows reconstructions from ~ z and z. While z contains both
y and s information, the forget-gate successfully masks out rotation information while allowing
factors relevant for y to pass on to ~ z.
5.4.2 Fairness through invariance to biasing factors
Preprocessed versions (Louizos et al., 2016; Xie et al., 2017) of Adult (Dheeru and Karra Taniskidou,
2017) and German (Dheeru and Karra Taniskidou, 2017) datasets are popularly employed for
evaluating models in fairness settings. We conduct experiments on these datasets using architectures
similar to VFAE (Louizos et al., 2016) and UnifAI (Chapter 3) for encoders (two layers), predictors
(one layer), and decoders (two layers), along with a two-layer discriminator.
84
Figure 5.6: MNIST-ROT { t-SNE visualization of z and ~ z labeled with rotation-angle
(s) and digit-class (y). The invariant ~ z shows no clustering by s while z shows clear
s-subgroups within each y-cluster.
Figure 5.7: MNIST-ROT { reconstruction. Columns in each block (left to right): original
image, reconstruction from ~ z, and that from z. Reconstructions from ~ z do not contain
rotation information.
Adult. This is a dataset of 45,222 individuals, where y re
ects whether a person has more than
$50,000 savings and the biasing s is age. Results of experiments on this dataset are shown in
Table 5.4. The proposed model completely removes information about s, as re
ected by A
s
being
85
Table 5.4: Adult results (majority class of s = 0:67).
Model Accuracy of y Accuracy of s
NN+MMD (Li, Swersky, and Zemel, 2014) 0.75 0.00 0.67 0.01
VFAE (Louizos et al., 2016) 0.76 0.01 0.67 0.01
CAI (Xie et al., 2017) 0.83 0.89
CVIB (Moyer et al., 2018) 0.69 0.01 0.68 0.01
Adversarial Forgetting
(Jaiswal et al., 2020) (Ours)
0.85 0.00 0.67 0.00
Improvement () over VFAE 37.5% {
the same as the population share of the majority s-class (0:67). Previous works have also achieved
perfect score on A
s
but the proposed model outperforms them on A
y
, showing that it is able to
retain more information relevant for predicting y while being fair by successfully ltering only
age-related factors.
German. This dataset contains attributes of 1,000 people for classifying whether a person has a
good credit rating, and gender is the biasing s. Table 5.5 shows the results of our experiments.
As evident from A
s
being the same as the population share of the majority s-class (0:80), the
proposed model generates latent embeddings that are completely invariant to s. While previous
works have also achieved perfect A
s
score, the proposed model outperforms them on A
y
, show-
ing that it is able to retain more information about y in the invariant encoding, while being
unbiased to gender.
5.4.3 Invariance in multi-task learning
The proposed framework is evaluated on the dSprites (Matthey et al., 2017) dataset of shapes with
independent factors: color, shape, scale, orientation, and position. The dataset was preprocessed
following Higgins et al. (2018), resulting in two classes for scale and four for position and orientation
each. Shape (y
(1)
) and scale (y
(2)
) are treated as the prediction tasks, where shape is desired
86
Table 5.5: German results (majority class of s = 0:8).
Model Accuracy of y Accuracy of s
NN+MMD (Li, Swersky, and Zemel, 2014) 0.74 0.01 0.80 0.00
VFAE (Louizos et al., 2016) 0.70 0.00 0.80 0.00
CAI (Xie et al., 2017) 0.70 0.81
CVIB (Moyer et al., 2018) 0.74 0.00 0.80 0.00
Adversarial Forgetting
(Jaiswal et al., 2020) (Ours)
0.76 0.00 0.80 0.00
Improvement () over CVIB 7.7% {
Table 5.6: Results on dSprites in multi-task setting. Random chance of s for task #1 is
0:5 and for task #2 is 0:25. Previous state-of-the-art (Jaiswal et al., 2018d; Li, Swersky,
and Zemel, 2014; Louizos et al., 2016; Moyer et al., 2018; Xie et al., 2017) cannot be
applied to multi-task settings.
Model
Task #1 Accuracy Task #2 Accuracy
y
(1)
s
(1)
y
(2)
s
(2)
Baseline 0.99 0.94 0.99 0.40
Adversarial Forgetting 0.99 0.50 0.99 0.25
to be invariant to position (s
(1)
) and scale to orientation (s
(2)
). We use the same component
networks that we used for Extended Yale-B. We compare results with a version of our model
without the decoder, the maskers and the discriminators, i.e., both y
(1)
and y
(2)
are predicted
from z. Evaluation could not be conducted on NN+MMD, VFAE, CAI and CVIB because they
have one ~ z and all tasks have to be invariant to the same s, and on UAI because it works only for
a single y. Table 5.6 presents the results. The proposed framework achieves the same accuracies
for y
(1)
and y
(2)
as the baseline, while maintaining random chance accuracies for s
(1)
(0:5) and
s
(2)
(0:25) as opposed to signicantly higher corresponding scores for the baseline. Hence, the
proposed framework works eectively in multi-task settings.
87
5.5 Summary
We have presented a novel framework for invariance induction in supervised neural networks through
\forgetting" of information related to unwanted factors from the latent space. We showed that the
forget-gate used in the proposed framework acts as an information bottleneck and that adversarial
training encourages generation of forget-masks that remove unwanted factors. Results of extensive
experimental evaluation show that the proposed model exhibits state-of-the-art performance in
both nuisance and bias settings.
88
Part II
APPLICATIONS
89
Chapter 6
Robust Presentation Attack Detection through
Unsupervised Adversarial Invariance
6.1 Introduction
Biometric identity authentication technologies based on computer vision, such as face and iris
recognition, have become ubiquitous in recent times. However, biometric authentication methods
are still prone to presentation attacks, where spoof samples (e.g. printed pictures or videos of a
person) are presented to the biometric sensor, attempting to gain unauthorized access. Furthermore,
other factors, such as the rising ease of 3D printing technology and capturing very realistic high-
resolution images and videos of people's faces due to advancements in camera technologies as
well as generative adversarial networks (Goodfellow et al., 2014) make creating these presentation
attacks much easier. As illustrated in Figure 6.1, which shows samples of genuine faces and
presentation attacks, learning subtle features to dierentiate the two is very challenging even for
humans. Therefore, it is crucial to augment face recognition systems with presentation attack
detection (PAD) methods in order to improve the security of face authentication systems.
Presentation attack detection methods can be broadly categorized into two classes. The rst
class of methods depends on augmenting the biometric authentication hardware with an additional
sensor that provides auxiliary data that can be used (with or without the original biometric data)
90
(a) (b)
(c) (d)
Figure 6.1: Which are genuine images and which are presentation attacks?
1
by a presentation attack detection algorithm. For example, light eld cameras (LFCs) have been
used to capture multiple depth images of faces, which are then analyzed through a rule-based
scheme for PAD (Raghavendra, Raja, and Busch, 2015). This class of methods is limited by large
cost and legacy system compatibility constraints. The second class of methods directly uses regular
data captured by the authentication system for presentation attack detection using, for example,
signal processing and/or machine learning algorithms. These methods extract features, such as
Local Binary Patterns (LBP) (Ramachandra and Busch, 2017), and classify them as bona de
or attack using a downstream classier, such as support vector machine, or use a convolutional
neural network (CNN) for both learned representation extraction and classication (Jourabloo,
Liu, and Liu, 2018). The second class of approaches for PAD are, however, inherently challenging
91
and have garnered tremendous research interest in recent times. While the use of deep neural
networks (DNNs) in PAD has led to major boosts in performance (Jourabloo, Liu, and Liu, 2018),
their inherent limitations, such as vulnerability to overtting, need for vast amounts of training
data, etc. prevent DNN-based systems from reaching their full potential.
One such limitation of DNNs, like most machine learning models, is that they could learn
incorrect associations between nuisance factors in the raw data and the nal prediction target
(e.g. pose, gender or skin tone nuisance factors in face recognition), leading to poor generalization.
Existing DNN-based PAD methods do not address this inherent problem and can, hence, be made
more robust by incorporating learning techniques that induce robustness through invariance to
nuisance factors.
We present RoPAD (Jaiswal et al., 2019c), a novel end-to-end deep neural network model for
presentation attack detection that robustly classies face images as \live" (i.e. real) or \spoof" (i.e.
fake) by being invariant to visual distractors inherent in images. The invariance is achieved by
adopting the unsupervised adversarial invariance (UAI) framework (Jaiswal et al., 2018d) presented
in Chapter 3, which induces implicit feature selection and invariance to nuisance factors within
neural networks without requiring nuisance annotations. Most of the visual content in face images
is not relevant for PAD. For example, given a face image, the identity of the person, variations in
the pose of the face, ne-grained facial attributes, and elements of the background of the image
are irrelevant to PAD. Therefore, employing UAI as a core component of the presented model
makes it largely invariant to all such distractors in an inexpensive yet eective way.
The presented RoPAD model exhibits state-of-the-art performance on 3DMAD (Erdogmus and
Marcel, 2014), Idiap Replay-Mobile (Costa-Pazo et al., 2016), Idiap Replay-Attack (Chingovska,
Anjos, and Marcel, 2012), MSU-MFSD (Wen, Han, and Jain, 2015) and GCT1, a new self-collected
dataset (described in Section 6.4.1), which includes common forms of presentation attacks studied
in recent literature, viz., printed faces-images, video replays and 3D masks (Erdogmus and Marcel,
1
(b) and (c) are genuine faces for authentication.
92
2014; Ramachandra and Busch, 2017). Ablation study of a base model (BM), which does not
include UAI, shows that UAI provides a signicant boost in performance. This essentially proves
that invariance to visual distractors makes PAD signicantly more robust and eective.
6.2 Related Work
PAD methods for image-based biometric authentication have traditionally involved extraction of
discriminative features, such as specular re
ection, blurriness, chromatic moment, color diversity,
etc. and their analysis to distinguish live (genuine) images from spoof (fake) ones (Ramachandra
and Busch, 2017; Wen, Han, and Jain, 2015). Previous works have also incorporated deep learning
based latent features computed oine in conjunction with linear classiers for PAD as well as
learned representations within a neural network trained end-to-end for the PAD task (Peng, Qin,
and Long, 2017).
Hand-crafted features and statistical machine learning algorithms have been extensively used in
the past for the detection of print and replay kind of attacks. For example, texture analysis through
extraction of low-level texture features has been widely utilized for spoong detection (M a att a,
Hadid, and Pietik ainen, 2011). Feature descriptors such as Local Binary Patterns (LBP) (Grag-
naniello et al., 2015), Scale-Invariant Feature Transform (SIFT) (Gragnaniello et al., 2015) and
Speeded Up Robust Features (SURF) (Boulkenafet, Komulainen, and Hadid, 2017) have been
popularly employed to embed faces into low dimension encodings in prior works (Liu, Jourabloo,
and Liu, 2018). In order to make such feature descriptors more discriminative for PAD, researchers
have utilized dierent color-spaces such as RGB, HSV and YCbCr (Boulkenafet, Komulainen,
and Hadid, 2015). Hand-crafted feature based methods play an important role in the detection
of spoong given their simplicity and eectiveness in PAD for the domains for which they are
specically designed. For instance, several texture analysis methods (Peng, Qin, and Long, 2018)
achieved good performance on the MSU MFSD dataset.
93
Deep learning has provided powerful approaches for the development of eective data-driven
models for a plethora of computer vision tasks. Convolutional Neural Networks (CNNs) have been
employed successfully for PAD recently (Nogueira, Alencar Lotufo, and Machado, 2016). Yang,
Lei, and Li (2014) were an early adopter of DNNs for PAD, who achieved signicant improvements
in detection performance with simple CNN architectures. More recently, architectures such
as 3D-CNN (Gan et al., 2017), patch-based and depth-based architectures have been used for
PAD (Atoum et al., 2017). Furthermore, instead of binary supervision, some works use spatial
and temporal auxiliary information to train of PAD models (Liu, Jourabloo, and Liu, 2018).
The proposed RoPAD is a DNN-based model for PAD from raw RGB images that uses a simple
CNN architecture coupled with eective unsupervised invariance induction through UAI without
incorporating any of the aforementioned specialized architecture designs or training regimens.
6.3 Robust Presentation Attack Detection
The proposed RoPAD is a DNN model for robust presentation attack detection, which learns to
distinguish real face images from fake ones directly from RGB images in an end-to-end framework.
Robust PAD is achieved by combining invariance induction and the DNN's ability to learn highly
discriminative representations.
6.3.1 A Data Factorization View of PAD
The face image formation process (S) is a complex interaction of multiple entangled signals. For
presentation attack detection, we are ultimately interested in the genuineness of a presented sample,
and therefore the entangled signals can be split into two main categories | (1) signals useful
for solving the anti-spoong problem and (2) nuisances for the PAD task. Thus, a face image
could be expressed as the result of dierent factors interacting together as I
:
=S(';), where
represents the nuisances dened as all the signals presented in the input media that should
94
not be contributing to the assessment of the genuineness detection, whereas' indicates all the
signals that are helpful to solving the PAD task. Most common nuisances for PAD tasks can be
the subjects's identity, facial attributes, and elements of the background. Contrastively, signals
useful for PAD include subtle dierences of specic patterns, and characteristic noise aecting
non bona de images. Given all aforementioned variables, presentation attack detection can be
improved by reverse-engineering the image formation process'
?
=r
S(';)
to disentangle the
important information from the irrelevant ones.
Specically, given an image I, a PAD system needs to analyze ', without being distracted
by other confounding factors () present implicitly in the media (e.g. identity, pose, background,
etc.). Note that, at test time, a PAD system has only access to I and no access to the individual
variables contributing toS whatsoever.
6.3.2 Base CNN Model of RoPAD
The core neural network that is responsible for learning to distinguish between real and fake images
is a deep CNN composed of three convolutional blocks and a nal prediction block interspersed with
max-pooling operations. Each convolutional block contains three convolutional layers alternating
with batch-normalization. The kernel shape of each of these convolutional layers is (33), and max-
pooling is performed over windows of (33). The channel size in each block is kept xed at 256, 128,
and 64 in the rst, second, and third blocks, respectively. Exponential Linear Unit (ELU) (Clevert,
Unterthiner, and Hochreiter, 2015) is used as an activation function in the convolutional layers
of these blocks. The prediction block contains one convolutional layer with a kernel-shape of
(2 2) with the hyperbolic tangent activation, followed by a reshape operation to produce a
64-dimensional embedding, which is followed by two fully connected layers of output-sizes 32 and
1 to perform the nal binary classication task (bona de versus spoof). Figure 6.2a illustrates
the complete architecture. The model design is based on VGG16 (Simonyan and Zisserman,
2014), with channel-sizes, activation shapes, activation functions, and batch-normalization being
95
Conv 3x3, 256
Batch. Norm.
ELU
Max Pool, 3x3
Max Pool, 3x3
Max Pool, 3x3
Conv 3x3, 256
Batch. Norm.
ELU
Conv 3x3, 256
Batch. Norm.
ELU
Conv 3x3, 128
Batch. Norm.
ELU
Conv 3x3, 128
Batch. Norm.
ELU
Conv 3x3, 128
Batch. Norm.
ELU
Conv 3x3, 64
Batch. Norm.
ELU
Conv 3x3, 64
Batch. Norm.
ELU
Conv 3x3, 64
Batch. Norm.
ELU
Conv 2x2, 64
tanh
Dense, 32
ReLU
Dense, 1
sigmoid
Input Image
Real / Fake
Conv. Block 1 Conv. Block 2 Conv. Block 3 Pred. Block
(a) Base CNN Model of RoPAD
Conv. Block 1
Conv. Block 2
Conv. Block 3
Conv 2x2, 64
tanh
Conv 2x2, 64
tanh
Input Image
Enc
Dense, 32
ReLU
Dense, 1
sigmoid
Real / Fake
Pred
#
$
Dense, 64
tanh
Predicted
$
D
1
Dense, 64
tanh
Predicted
#
D
2
Dec
Recon. Image
Deconv 6x6, 64
ReLU
UpSample, 3x3
Deconv 3x3, 32
ReLU
UpSample, 3x3
Deconv 3x3, 16
ReLU
UpSample, 3x3
Deconv 5x5, 3
ReLU
Dropout
(b) RoPAD training architecture
Figure 6.2: (a) Base CNN Model of RoPAD: the base model (BM) is inspired by
VGG16 (Simonyan and Zisserman, 2014), with channel-sizes, activation shapes, activation
functions, and batch-normalization being notable modications. The model comprises
three convolutional blocks and a prediction block. (b) RoPAD training architecture: the
base model is split into encoder (Enc) and predictor (Pred), and a decoder (Dec) and
two disentanglers (D
1
and D
2
) are attached for unsupervised adversarial invariance.
notable modications, among others. We empirically found the proposed architecture to perform
signicantly better than the standard VGG16.
6.3.3 Unsupervised Adversarial Invariance
Deep neural networks, like machine learning models in general, often learn incorrect associations
between the prediction target and nuisance factors of data, leading to poor generalization (Jaiswal
et al., 2018d). This is especially problematic for PAD because it is a relatively new area of research
and suers from both the lack of expert knowledge about nuisance factors as well as nuisance
annotations. For example, in the case of face images for PAD, the identity of the person, their
facial attributes, elements of the background of the image, etc. are nuisance factors for the PAD
96
task. While some of these factors can be annotated with large investments of time and money,
others like \elements of the background" are dicult to quantify concretely.
Chapter 3 introduced an unsupervised approach for learning invariance to all, including
potentially unknown, nuisance factors with respect to a given supervised task. The unsupervised
adversarial invariance (UAI) framework learns a split representation of data into relevant and
nuisance factors with respect to the prediction task without needing annotations for the nuisance
factors. The underlying mechanism of UAI is formulated as a competition between the prediction
and a reconstruction objective coupled with disentanglement between the two representations.
This forces the prediction model to utilize only those factors of data that are truly essential for the
supervised task at hand (here classicaiton of genuine/fake samples), disregarding everything else.
The UAI framework splits a feedforward neural network into an encoder and a predictor. The
encoder is modied such that it produces two representations (i.e. embedding vectors) instead
of one | z
1
and z
2
, where only z
1
is used for the prediction task. Besides the encoder and the
predictor, the UAI framework consists of a decoder that reconstructs data from a noisy version
of z
1
concatenated with z
2
, and a pair of disentanglers that aim to predict one embedding from
the other. The disentanglers are trained adversarially against the rest of the model, leading to
disentanglement between the two embeddings. The aforementioned competition between the
prediction and reconstruction tasks is induced by the noisy channel that connects z
1
to the decoder
and the enforced disentanglement between z
1
and z
2
, which leads to information separation such
that factors of data truly relevant for the prediction task are encoded in z
1
and all other factors
of data (nuisance) migrate to z
2
. UAI has been shown (Jaiswal et al., 2018d) to work eectively
across a diverse collection of datasets and nuisance factors. Hence, we employ UAI as an integral
component in the proposed model.
97
6.3.4 RoPAD using UAI
As mentioned in Section 6.3.3, UAI splits the base feedforward (illustrated in Figure 6.2a) model
into an encoder and a predictor. We split the base CNN model of RoPAD at the prediction block,
such that all convolutional blocks as well as the convolutional layer of the prediction block are
collectively treated as an encoder, while the two fully connected layers are treated as a predictor.
In order to produce two embedding vectors from the encoder instead of one, as required by UAI,
we duplicate the nal convolutional layer of the encoder as a parallel branch emerging from
the nal convolutional block. The decoder is designed as a deconvolutional network with four
deconvolutional layers interspersed with upsampling layers. The disentanglers are designed as
single fully-connected layers, and the noisy transformer is implemented as multiplicative Bernoulli
noise. The complete RoPAD architecture is shown in Figure 6.2b. RoPAD is implemented in
Keras
2
and trained with TensorFlow
3
backend. We follow the same adversarial training strategy
as described in Chapter 3 of alternating between training the disentanglers versus the rest of the
model with 5 : 1 frequency.
Performing presentation attack detection with RoPAD at test time is as ecient as with the
base model. Although Figure 6.2 shows a complex architecture used for training, RoPAD is
actually very light at prediction time. At test time, RoPAD is reshaped so that the decoder and the
disentangler components of UAI are discarded. Thus, in terms of eciency and model complexity,
RoPAD testing model has the same structure as the base model, yet provides signicantly more
eective and robust predictions. In light of this, prediction with RoPAD remains as easy as a
single forward pass without any additional computational cost.
2
https://keras.io/
3
https://www.tensor
ow.org/
98
Figure 6.3: From left to right, corresponding examples of image-types { (1) genuine, (2)
glasses with doll eye, (3) analog photo, (4) makeup, (6) paper glasses, (7) glasses with
Van Dyke eye, (8) silicone mask, and (9) transparent mask.
Table 6.1: Summary of benchmark datasets for PAD
Database Institute Real/Fake Attack Types
3DMAD Idiap 170 / 85 3D masks
Replay-Attack Idiap 200 / 1000 printed & replay
Replay-Mobile Idiap 390 / 640 printed & replay
MSU MFSD MSU 110 / 330 printed & replay
Table 6.2: GCT1 { summary of real images and attacks
Type Train Validation Test
Genuine 266 70 167
Glasses with Doll Eye 26 6 16
Analog Photo 30 7 18
Makeup 14 4 9
Paper Glasses 27 7 17
Glasses with Van Dyke Eye 26 7 17
Silicone Mask 3 3 4
Transparent Mask 20 5 13
6.4 Experimental Evaluation
6.4.1 Datasets And Metrics
The proposed RoPAD model is evaluated on the following publicly available benchmark datasets for
presentation attack detection { 3DMAD (Erdogmus and Marcel, 2014), Idiap Replay-Mobile (Costa-
Pazo et al., 2016), Idiap Replay-Attack (Chingovska, Anjos, and Marcel, 2012), and MSU
MFSD (Wen, Han, and Jain, 2015). Details of these datasets are summarized in Table 6.1.
99
While 3DMAD exclusively contains presentation attacks involving 3D masks, the other datasets
contain attacks through printed faces and video-replays.
RoPAD is also evaluated on the Government Controlled Testing-1 (GCT1) dataset. GCT1
is collected by Johns Hopkins Applied Physics Laboratory during Government testing of the
IARPA Odin project
4
. GCT1 contains images of about 400 subjects, including various forms of
presentation attacks. The subjects were split into training, validation and testing sets, such that
all images of a given subject belonged to only one of the three sets. Table 6.2 summarizes the
attack types and distribution of the attacks in the training, validation and testing sets, which
contain 215, 57 and 137 subjects, respectively. We will make the splits publicly available as soon
as GCT1 is released by NIST.
Evaluations are performed following the protocol used in prior works for each dataset. Further,
results for each dataset are reported using the same metrics that previous works used to reported
their performance, for fair comparison. Half Total Error-Rate (HTER) 2016 is reported for Idiap
Replay-Attack and 3DMAD, Equal Error Rate (EER) (Wen, Han, and Jain, 2015) for MSU MSFD,
and Attack Presentation Classication Error Rate (APCER) (2016), Bona Fide Presentation
Classication Error Rate (BPCER)) (2016) and ACER = (APCER + BPCER)=2 (Peng, Qin, and
Long, 2017) for Idiap Replay-Mobile dataset. Results of PAD performance on GCT1 are reported
using APCER, BPCER, ACER, EER, and the Area Under the Receiver Operating Curve (AUC).
Ablation study was performed by training and evaluating the base CNN model of RoPAD (BM)
without the UAI components and results of these experiments are also reported.
6.4.2 Evaluation Results
Idiap Replay-Attack: Table 6.3 summarizes the experimental results on the Idiap Replay-
Attack dataset. As shown, the proposed model achieves a perfect HTER of 0 on this dataset,
outperforming the state of the art (Phan et al., 2016).
4
Public release of GCT1 is planned by the National Institute of Standards and Technology (NIST)
100
Table 6.3: Test HTER (%) on Idiap Replay-Attack
Method HTER
LBP+CCoLBP (Peng, Qin, and Long, 2018) 5.38
CCoLBP (Peng, Qin, and Long, 2018) 5.25
LDP+TOP (Phan et al., 2016) 1.75
BM 0.38
RoPAD (Jaiswal et al., 2019c) 0
Table 6.4: Test results (%) on Idiap Replay-Mobile.
Method ACER APCER BPCER
IQM (Costa-Pazo et al., 2016) 13.64 19.87 7.40
Gabor (Costa-Pazo et al., 2016) 9.53 7.91 11.15
LBP+GS-LBP (Peng, Qin, and Long, 2017) 1.74 2.09 1.38
LGBP (Peng, Qin, and Long, 2017) 1.50 2.08 0.91
LGBP (video) (Peng, Qin, and Long, 2017) 1.25 1.40 1.10
BM 0.90 0 1.80
RoPAD (Jaiswal et al., 2019c) 0 0 0
Idiap Replay-Mobile: In Table 6.4 we summarize the results of our experiments on the Idiap
Replay-Mobile dataset. While BM outperforms the previous state-of-the-art on the ACER and
APCER scores, the proposed RoPAD model performs the best on all metrics, achieving a perfect
score of 0 on each.
MSU MFSD: The proposed RoPAD outperforms the previous state-of-the-art models on the
MSU MSFD dataset, as shown in Table 6.5. In contrast, BM performs signicantly worse than the
previous best model. This large performance boost is, hence, credited to the UAI component of
RoPAD, and is further highlighted by the Receiver Operating Curve (ROC) shown in Figure 6.4.
3DMAD: Table 6.6 summarizes results of our experiments on the 3DMAD dataset. While the
ablation version BM performs worse than previous state-of-the-art, the proposed RoPAD achieves
a perfect HTER of 0 on this dataset also.
GCT1: The proposed model achieves near perfect score at PAD on the GCT1 dataset, as shown
in Table 6.7. In comparison to the base model, the ACER, EER and AUC of RoPAD are 12.1, 5.6
101
Table 6.5: Test EER (%) on MSU MFSD
Method EER
DoG-LBP+SVM (Wen, Han, and Jain, 2015) 23.10
LBP + SVM (Wen, Han, and Jain, 2015) 14.70
IDA + SVM (Wen, Han, and Jain, 2015) 8.85
LDP + TOP (Phan et al., 2016) 6.54
CCoLBP (Peng, Qin, and Long, 2018) 5.83
LBP + CCoLBP (Peng, Qin, and Long, 2018) 5.00
BM 7.15
RoPAD (Jaiswal et al., 2019c) 1.70
Figure 6.4: Receiver Operating Curves for MSU MSFD
Table 6.6: Test HTER (%) on 3DMAD
Method HTER
LBP + LDA (Erdogmus and Marcel, 2014) 0.95
BM 1.00
RoPAD (Jaiswal et al., 2019c) 0
and 1.5 percentage points higher, respectively, which highlights a clear improvement in performance
due to the incorporation of UAI in RoPAD, as further re
ected in Figure 6.5.
102
Table 6.7: Results on GCT1 | all metrics except AUC are reported as percentages (%)
Method APCER BPCER ACER EER Val. AUC Test AUC
BM 28.7 0.5 14.6 7.4 0.957 0.983
RoPAD (Jaiswal et al., 2019c) 4.2 1.0 2.5 1.8 1.000 0.998
Figure 6.5: Receiver Operating Curves for GCT1
In summary, taking into consideration the results of the proposed model on the aforementioned
benchmark datasets, it is evident that the proposed RoPAD outperforms previous state-of-the-art
models across the board. Results of the ablation version of the proposed model, on the other hand,
show that the UAI component of RoPAD is crucial to achieving this outstanding performance.
6.5 Summary
The increasing use of face-based biometric authentication calls for robust technologies that are
protected against attacks that can fool such systems. In this paper, we presented a novel deep neural
network model, RoPAD, for detecting presentation attacks in such systems. RoPAD is designed
as a deep convolutional neural network and trained to make robust predictions by employing
103
unsupervised adversarial invariance, which makes RoPAD invariant to factors in face images that
are irrelevant for presentation attack detection. Results of extensive experimental evaluation show
that RoPAD achieves state-of-the-art performance at presentation attack detection.
104
Chapter 7
Nuisance Invariant End-to-end Speech Recognition
7.1 Introduction
With the aid of recent advances in neural networks, end-to-end deep learning systems for automatic
speech recognition (ASR) have gained popularity and achieved extraordinary performance on
a variety of benchmarks (Prabhavalkar et al., 2017; Chiu et al., 2018; Zhou et al., 2018; Jaitly
et al., 2016). End-to-end ASR models typically consist of Recurrent Neural Networks (RNNs)
with Sequence-to-Sequence (Seq2Seq) architectures and attention mechanisms (Chan et al., 2015),
RNN transducers (Rao, Sak, and Prabhavalkar, 2017), or transformer networks (Zhou et al.,
2018). These systems learn a direct mapping from an audio signal sequence to a sequence of
text transcriptions. However, the input audio sequence often contains nuisance factors that are
irrelevant to the recognition task and the trained model can incorrectly learn to associate some
of these factors with target variables, which leads to overtting. For example, besides linguistic
content, speech data contains nuisance information about speaker identities, background noise, etc.,
which can hurt the recognition performance if the distributions of these attributes are mismatched
between training and testing.
A common method for combatting the vulnerability of deep neural networks to nuisance factors
is the incorporation of invariance induction during model training. For example, invariant deep
105
models have achieved considerable success in computer vision (Jaiswal et al., 2018d; Jaiswal
et al., 2019c; Jaiswal et al., 2019d) and speech recognition (Serdyuk et al., 2016; Meng et al.,
2018; Hsu and Glass, 2018; Liang, Huang, and Lipton, 2018). Serdyuk et al. (2016) obtain
noise-invariant representations by employing noise-condition annotations and the gradient reversal
layer (Ganin and Lempitsky, 2014) for acoustic modeling. Similarly, Meng et al. (2018) utilize
speaker information to train a speaker-invariant model for senone prediction. Hsu and Glass (2018)
extract domain-invariant features using a factorized hierarchical variational autoencoder. Liang,
Huang, and Lipton (2018) force their end-to-end ASR model to learn similar representations for
clean input instances and their synthetically generated noisy counterparts.
While these methods work well at handling discrepancies between training and testing datasets
for ASR systems, they require domain knowledge (Hsu and Glass, 2018), supplementary nui-
sance information during training (e.g., speaker identities (Meng et al., 2018), recording environ-
ments (Serdyuk et al., 2016), etc.), or pairwise data (Liang, Huang, and Lipton, 2018). However,
these requirements are dicult and expensive to fulll in real world, e.g., it is hard to enumerate
all possible nuisance factors and collect corresponding annotations.
In this work, we present a new training scheme (Hsu, Jaiswal, and Natarajan, 2019), namely
NIESR, which adopts the Unsupervised Adversarial Invariance learning framework (UAI) (Jaiswal
et al., 2018d) presented in Chapter 3 for end-to-end speech recognition. Without incorporating
supervised information of nuisances for the input signal features, the proposed method is capable
of separating the underlying elements of speech data into two series of latent embeddings { one
containing all the information that is essential for ASR, and the other containing information that
is irrelevant to the recognition task (e.g. accents, background noises, etc.). Experimental results
show that the proposed training method boosts the ASR performance on WSJ0, CHiME3, and
TIMIT datasets. We also show the eectiveness of combining NIESR with data augmentation.
106
7.2 Method
In this section, we present the proposed NIESR model for nuisance-invariant end-to-end speech
recognition, where the invariance is achieved by adopting the UAI framework (Jaiswal et al.,
2018d). We begin by describing the base Seq2Seq ASR model. Subsequently, we introduce the UAI
framework for Unsupervised Adversarial Invariance induction. Finally, we present the complete
design of the proposed NIESR model.
7.2.1 Base Sequence-to-sequence Model
We are interested in learning a mapping from a sequence of acoustic spectra features x =
(x
1
;x
2
;:::;x
T
) to a series of textual characters y = (y
1
;y
2
;:::;y
S
), given a dataset D
f(x; y)
i
g
N
i=1
, following the formulation of Chan et al. (Chan et al., 2015). We employ a Seq2Seq
model for this task, which estimates the probability of each character output y
i
by conditioning
over the previous characters y
1:(i1)
and the input sequence x. Thus, the conditional probability
of the entire output y is:
p(yjx) =
Y
i
p(y
i
jx; y
1:(i1)
) (7.1)
A Seq2Seq model is composed of two modules: an encoderEnc and a decoderDec. Enc transforms
the input features x into a high-level representation z = (z
1
;z
2
;:::;z
T
), i.e. z =Enc(x) and Dec
infers the output sequence y from z. We model Enc as a stack of Bidirectional Long-Short Term
Memory (BLSTM) layers with interspersed projected-subsampling layers (Zhang, Chan, and Jaitly,
2017). The subsampling layer projects a pair of consecutive input frames (u
2i1
;u
2i
) to a single
lower-dimensional frame v
i
. We model Dec as an attention-based LSTM transducer (Bahdanau,
Cho, and Bengio, 2014), which employs z to produce the output character sequence. At every time
step, Dec generates a probability distribution of y
i
over character sequences, which is a function
107
of a transducer state r
i
and an attention context c
i
. We denote this function as CharDist, which
is implemented as a single layer perceptron with softmax activation:
r
i
= LSTM([y
i1
;c
i1
];r
i1
) (7.2)
p(y
i
jx; y
1:(i1)
) = CharDist(r
i
;c
i
) (7.3)
In order to calculate the attention context c
i
, we employ the hybrid location-aware content-based
attention mechanism proposed by (Chorowski et al., 2015). Specically, the attention energy
e
i;j
for frame j at time-step i takes previous attention alignment
i1
into account through the
convolution operation:
e
i;j
=w
|
tanh(Wr
i
+Vz
j
+U(F
i1
) +b) (7.4)
where w, b, W , V , U, and F are learned parameters and depicts the convolution operation. The
attention alignment
i;j
and the attention context c
i
is then calculated as:
i;j
=
exp(e
i;j
)
P
L
k=1
exp(e
i;k
)
; c
i
=
P
L
j=1
i;j
z
j
(7.5)
The base model is trained by minimizing the cross-entropy loss:
L
y
=
X
i
logp(y
i
jx; y
1:(i1)
) (7.6)
7.2.2 Unsupervised Adversarial Invariance
Deep neural networks (DNNs) often learn incorrect associations between nuisance factors in the
raw data and the nal target, leading to poor generalization (Jaiswal et al., 2018d). In the case of
ASR, the network can link accents, speaker-specic information, or background noise with the
108
transcriptions, resulting in overtting. In order to cope with this issue, we adopt the Unsupervised
Adversarial Invariance (UAI) (Chapter 3) framework for learning invariant representations that
eliminate factors irrelevant to the recognition task without requiring any knowledge of nuisances.
The working principle of UAI is to learn a split representation of data as z
1
and z
2
, where z
1
contains information relevant to the prediction task (here ASR) and z
2
holds all other information
about the input data. The underlying mechanism for learning such a split representation is to
induce competition between the main prediction task and an auxiliary task of data reconstruction.
In order to achieve this, the framework uses z
1
for the prediction task and a noisy versione z
1
of z
1
along with z
2
for reconstruction. In addition, a disentanglement constraint enforces that
z
1
and z
2
contain independent information. The prediction task tries to pull relevant factors
into z
1
, while the reconstruction task drives z
2
to store all the information about input data
becausee z
1
is unreliable. However, the disentanglement constraint forces the two embeddings to
not contain overlapping information, thus leading to competition. At convergence, this results in a
nuisance-free z
1
that contains only those factors that are essential for the prediction task.
7.2.3 NIESR Model Design and Optimization
The NIESR model comprises ve types of modules: (1) encoders Enc
1
and Enc
2
that map input
data to the encodings z
1
and z
2
, respectively, (2) a decoderDec that infers target y from z
1
, (3) a
dropout layer that converts z
1
into its noisy versione z
1
, (4) a reconstructorRecon that reconstructs
input data from [e z
1
; z
2
], and (5) two adversarial disentanglers Dis
1
and Dis
2
that try to infer
each embedding (z
1
or z
2
) from the other. Figure 7.1 shows the complete NIESR model.
The encoder Enc
1
and decoder Dec follow the base model design as described in Section 7.2.1,
i.e., an attention-based Seq2Seq model for the speech recognition task. Enc
2
is designed to have
exactly the same structure as Enc
1
. The dropout layer is introduced to makee z
1
an unreliable
source of information for reconstruction, which in
uences the reconstruction task to extract all
information about x into z
2
(Jaiswal et al., 2018d). Recon is modeled as a stack of BLSTM layers
109
…
…
"
#
#
#
$
#
"
#
*
+
…
…
,
…
…
"
"
#
"
$
"
"
#
*
+
…
…
,
"
= (
"
"
,
#
"
,…,
$
"
)
#
= (
"
#
,
#
#
,…,
$
#
)
′
…
"
′
[
"
" >
;
"
#
]
#
′
*
′
+
′
, A"
′
,
′
…
[
#
" >
;
#
#
] [
$
" >
;
$
#
]
…
"
#
′
"
"
#
"
*
"
$
"
#
#
′
*
#
′
$
#
′
′
…
"
"
′
"
#
#
#
*
#
$
#
#
"
′
*
"
′
$
"
′
′
…
"
"
"
#
"
#
#
K
K
K
…
…
" >
= (
"
" >
,
#
" >
,…,
$
" >
)
Figure 7.1: NIESR: The two encoders Enc
1
and Enc
2
are BLSTM-based feature extrac-
tors that encode the input sequence x into representations z
1
and z
2
. The two encodings
are disentangled by adversarially training the two disentanglers, Dis
1
and Dis
2
, which
aim to predict one embedding from another. Dec is an attention-based decoder that
generates the target y characters from z
1
. Recon is a BLSTM-based reconstructor that
decodes z
2
and the noisye z
1
back to the input-sequence x
interspersed with novel upsampling layers, which perform decompression by splitting information
110
in each time-frame to two frames. This is the inverse of the subsampling layers (Zhang, Chan, and
Jaitly, 2017) used in Enc
1
and Enc
2
. The upsampling operation is formulated as:
[u
2i1
;u
2i
] = BLSTM([e z
1
i
;z
2
i
];s
i1
) (7.7)
o
2i
=Pu
2i
; o
2i1
=Pu
2i1
(7.8)
where [;] represents concatenation, o is the output, and P is a learned projection matrix.
The adversarial disentanglers Dis
1
and Dis
2
model the UAI disentanglement constraint
discussed in Section 7.2.2 following previous works (Jaiswal et al., 2018d; Jaiswal et al., 2019c;
Jaiswal et al., 2019d). Dis
1
tries to predict z
2
from z
1
and Dis
2
tries to do the inverse. This
is directly opposite to the desired independence between z
1
and z
2
. Thus, training Dis
1
and
Dis
2
adversarially against the rest of the model helps achieve the independence goal. Unlike
previous works (Jaiswal et al., 2018d; Jaiswal et al., 2019c; Jaiswal et al., 2019d), the encodings
z
1
and z
2
for this work are vector-sequences instead of single vectors: z
1
= (z
1
1
;z
1
2
;:::;z
1
L
) and
z
2
= (z
2
1
;z
2
2
;:::;z
2
L
). Na ve instantiations of the disentanglers would perform frame-specic
predictions of z
2
i
from z
1
i
and vice versa. However, each pair of z
1
i
and z
2
i
generated at the
time-step i contains information not only from frame i but also from other frames across the
time-span. This is because Enc
1
and Enc
2
are modeled as RNNs. Therefore, a better method to
perform disentanglement for sequential representations is to use the whole series of z
1
or z
2
to
estimate every element of the other. Hence, we model Dis
1
and Dis
2
as BLSTMs.
The proposed NIESR model is optimized by adopting the UAI training strategy (Jaiswal et al.,
2018d; Jaiswal et al., 2019d), i.e., playing a game where we treat Enc
1
, Enc
2
, Dec, and Recon as
one player P
1
, andDis
1
andDis
2
as the other player P
2
. The model is trained using a scheduled
update scheme where we freeze the weights of one player model when we update the weights of the
other. The training objective comprises three tasks: (1) predicting transcriptions from the input
signal, (2) reconstruction of the input, and (3) adversarial prediction of each of z
1
and z
2
from the
111
other. The objective of the rst task is written as Equation 7.6. The goal for the reconstruction
task is to minimize the mean squared error (MSE) between x and the reconstructed x
0
:
L
x
= MSE(Recon([ (Enc
1
(x));Enc
2
(x)]); x) (7.9)
where means dropout. The training objective for the disentanglers is to minimize the MSE
between embeddings predicted by the disentenglers and the embeddings generated from the encoder.
However, that of the encoders is to generate z
1
and z
2
that are not predictive of each other. Hence,
in the scheduled update scheme, the targets t
1
and t
2
for the disentanglers are dierent when
updating P
1
versus P
2
, following (Jaiswal et al., 2019d). The loss can be written as:
L
d
= MSE(Dis
1
(Enc
1
(x)); t
1
) + MSE(Dis
2
(Enc
2
(x)); t
2
)) (7.10)
where t
1
and t
2
are set as z
2
and z
1
, respectively, when updating P
2
but are set to random vectors
when updating P
1
.
Overall, the model is trained through backpropagation by optimizing the objective described
in Equation 7.11, where the loss-weights , , and
are hyperparameters, which are decided by
the performance on the development set.
L =L
y
+L
x
+
L
d
(7.11)
Inference with NIESR involves a forward pass of data through Enc
1
followed by Dec. Hence, the
usage and computational cost of NIESR for inference is the same as the base model.
112
7.3 Experiments
The eectiveness of NIESR is quantied through the performance improvement achieved by
adopting the invariant learning framework. We provide experimental results on speech recognition
on three benchmark datasets: the Wall Street Journal Corpus (WSJ0) (Paul and Baker, 1992),
CHiME3 (Barker et al., 2015), and TIMIT (Garofolo et al., 1993). We additionally provide results
on the combined WSJ0+CHiME3 dataset.
7.3.1 Datasets
WSJ0: This dataset is a collection of readings of the Wall Street Journal. It contains 7,138
utterances in the training set, 410 in the development set, and 330 in the test set. We use
40-dimensional log Mel lterbank features as the model input, and normalize the transcriptions to
capitalized character sequences.
CHiME3: CHiME3 dataset contains: (1) WSJ0 sentences spoken in challenging noisy environ-
ments (real data) and (2) WSJ0 readings mixed with four dierent background noise (simulated
data). The real speech data was recorded in ve noisy environments using a six-channel tablet-
based microphone array. Training data consists of 1,999 real noisy utterances from four speakers,
and 7,138 simulated noisy utterances from 83 speakers in the WSJ0 training set. In total, there are
3,280 utterances in the development set, and 2,640 utterances in the test set containing both real
and simulated data. The speakers in training, development, and test set are mutually dierent. In
our experiments, we follow (Meng et al., 2018) to use far-eld speech from the fth microphone
channel for all sets. We adopt the same input-output setting for CHiME3 as WSJ0.
TIMIT: This corpus contains a total of 6,300 sentences, with 10 sentences spoken by 630 speakers
each with 8 dierent dialects. Among them, utterances from 168 dierent speakers are held-out as
the test set. We further select sentences from 4 speakers of each dialect group, i.e., 32 speakers
113
Table 7.1: Hyperparameters for the base model.
Item Setting
Enc and Dec LSTM Dimensionality 200
Subsampling Projected Dimensionality 200
Attention Dimensionality 200
Attention Convolution Channel 10
Attention Convolution Kernel Size 100
Optimizer Adam
Learning Rate 5e-4
in total, from the remaining data to form the development set. Thus, all speakers in training,
development, and test sets are dierent. Models were trained on 80 log Mel lterbank features
and capitalized character sequences were treated as targets.
7.3.2 Experiment Setup
We train the base model without using invariance induction, i.e., the model consisting of Enc
and Dec (Section 7.2.1), as a baseline. We feed the whole sequence of spectra features to Enc
and get the predicted character sequence from Dec. We use a stack of two BLSTMs with a
subsampling layer (as described in Section 7.2.1) in between for Enc. Dec is implemented as a
single layer LSTM combined with attention modules introduced in Section 7.2.1. All the models
were trained with early stopping with 30 epochs of patience and the best model is selected based
on the performance on the development set. Other model and training hyperparameters are listed
in Table 7.1.
We augment the base model with Enc
2
, Recon, Dis
1
, and Dis
2
, while treating Enc as Enc
1
,
to form the NIESR model. Enc
2
has the same hyperparameter setting and structure as Enc
1
.
Recon is modeled as a cascade of a BLSTM layer, an upsampling layer, and another BLSTM layer.
Dis
1
and Dis
2
are implemented as BLSTMs followed by two fully-connected layers. We update
the player models P
1
and P
2
in the frequency ratio of 1 : 5 in our experiments. Hyperparameters
114
Table 7.2: Hyperparameters for the NIESR model.
Item Setting
Recon LSTM Dimensionality 300
Upsampling Projected Dimensionality 200
Dis
1
, Dis
2
Dimensionality 200
Dropout layer rate 0.4
Optimizer Adam
Learning Rate for P
1
5e-4
Learning Rate for P
2
1e-3
, ,
for WSJ0 100, 10, 1
, ,
for CHiME3 100, 1, 0.5
, ,
for TIMIT 100, 50, 1
Table 7.3: Speech recognition performance as CER (%). RI indicates relative improvement
(%) over Base model.
Model
WSJ0 CHiME3 TIMIT
CER RI CER RI CER RI
Base 12.95 { 44.61 { 28.76 {
Spk-Inv 12.31 4.94 43.93 1.52 28.45 1.08
Env-Inv { { 42.61 4.48 { {
Dial-Inv { { { { 28.29 1.63
NIESR 12.24 5.48 41.86 6.16 26.86 6.61
for Enc
1
and Dec are the same as the base model. Additional hyperparameters for NIESR are
summarized in Table 7.2.
We further provide results of a stronger baseline model that utilizes labeled nuisancess (speakers
for WSJ0, speakers and noise environment condition for CHiME3, speakers and dialect groups for
TIMIT) with the gradient reversal layer (GRL) (Ganin and Lempitsky, 2014) to learn invariant
representations. Specically, the model consists of Enc, Dec, and a classier with a GRL between
the embedding learned from Enc and the classier, following the standard setup in (Ganin and
Lempitsky, 2014). The target for the classier is to predicts from the embedding while the direction
of the training gradient toEnc is
ipped. We denote this model as Spk-Inv for speaker-invariance,
Env-Inv for environment-invariance in CHiME3, and Dial-Inv for dialect-invariance in TIMIT.
115
Table 7.4: Results of predicting nuisance factors from learned representations as accuracy.
Env stands for environment. As desired, the accuracy is lowest for the z
1
embedding of
NIESR. The accuracy for thez
2
embedding of NIESR shows that the nuisance information
is encoded largely in z
2
, as it should be.
Model Embedding
WSJ0 CHiME3
s : Speaker s : Speaker s : Env
Base Model z 67.91 38.52 69.24
Spk-Inv z 65.60 37.91 69.11
Env-Inv z { 38.84 66.44
NIESR
z
1
63.35 35.87 63.45
z
2
97.92 92.28 97.05
7.3.3 ASR Performance on Benchmark Datasets
Table 7.3 summarizes the results at ASR on WSJ0, CHiME3, and TIMIT datasets. Results show
that NIESR achieves 5.48%, 6.16%, and 6.61% relative improvements over base model on WSJ0,
CHiME3, and TIMIT, respectively, and demonstrates the best CER among all methods.
7.3.4 Invariance to Nuisance Factors
In order to examine whether a latent embedding is invariant to nuisance factors s, we calculate
the accuracy of predicting the factor s from the encoding. Specically, this is calculated by
training classication networks (BLSTM followed by two fully-connected layers) to predict s from
the generated embeddings. Table 7.4 presents results of this experiment, showing that the z
1
embedding of the NIESR model, which is used for ASR, contains less nuisance information than
the z encoding of the base, Spk-Inv, and Env-Inv models. In contrast, the z
2
embedding of
NIESR contains most of the nuisance information, showing that nuisance factors migrate to this
embedding, as expected.
116
Table 7.5: Test results of models trained on the WSJ0+CHiME3 augmented dataset as
CER (%). RI indicates the relative improvement (%) over Base model.
Model
WSJ0 CHiME3
CER RI CER RI
Base 9.35 { 41.55 {
Spk-Inv 8.62 7.81 40.77 1.88
Env-Inv 9.17 1.93 40.27 3.08
NIESR 8.00 14.44 38.35 7.7
7.3.5 Additional Robustness through Data Augmentation
Training with additional data that re
ects multiple variations of nuisance factors helps models
generalize better. In this experiment, we treat the CHiME3 dataset, which contains WSJ0
recordings with four dierent types of noise, as a noisy augmentation for WSJ0. We train the
base model and NIESR on the augmented dataset, i.e. WSJ0+CHiME3, and test on the original
CHiME3 and WSJ0 test sets separately. Table 7.5 summarizes the results on this experiment,
showing that training with data augmentation provides improvements on both CHiME3 and WSJ0
datasets compared to the results in Table 7.3. It is important to note that the NIESR model
trained on the augmented dataset achieves 14.44% relative improvement on WSJ0 as compared
to the base model trained on the same. This is because data augmentation provides additional
information about potential nuisance factors to the NIESR model and, consequently, helps it
ignore these factors for the ASR task, even though pairwise data is not provided to the model
like (Liang, Huang, and Lipton, 2018). Hence, results show that the NIESR model can be easily
combined with data augmentation to further enhance the robustness and nuisance-invariance of
the learned features.
117
7.4 Summary
We presented NIESR, an end-to-end speech recognition model that adopts the unsupervised
adversarial invariance framework for invariance to nuisances without requiring any knowledge of
potential nuisance factors. The model works by learning a split representation of data through
competition between the recognition and an auxiliary data reconstruction task. Results of
experimental evaluation demonstrate that the proposed model achieves signicant boosts in
performance on ASR.
118
Chapter 8
Nuisance Invariance for Learning with Less Labels
8.1 Introduction
The performance of machine learning models depends on how well they are trained, which is
directly aected by the data available to estimate their parameters through empirical optimization.
It is a common belief that more data leads to better training and model generalization. Availability
of data also encourages the adoption of models with larger numbers of parameters, which have
better learning capacity but are more prone to overtting (Domingos, 2012). The success of
deep neural networks (DNNs) can be largely attributed to the collection and curation of large
public datasets that made it possible to train deeper networks with reduced overtting. However,
generalization also depends on the representativeness of datasets (a large dataset could still lack
diversity, causing overtting) as well as choices in model design and optimization procedures.
While data representativeness is a dicult problem of its own and can be addressed by
carefully designing and vetting data collection programs, model-specic causes of overtting can be
overcome by systematically studying them and developing solutions thereof. This would also lead
to performance boosts when sophisticated models like DNNs are employed in scenarios where only
a small amount of data is available (low-data) or where data is partially labeled (semi-supervised),
i.e., regimes where data size and representativeness are beyond immediate control.
119
One such model-specic factor with respect to supervised DNNs pertains to the ability of
DNNs to eectively learn and disentangle factors of variation of data such that the prediction
target is associated with relevant factors but not with nuisances (e.g., subject background in face
recognition and stroke-width in optical character recognition are nuisances). As studied in Part I
of this thesis, this aspect of DNNs can be improved by incorporating invariance frameworks into
the training procedure.
In this work, we employ the overtting-reduction eect of nuisance invariance to improve model
performance in low-data and semi-supervised settings. We incorporate the approach of \discovery
and separation of features" (DSF), presented in Chapter 4, as the nuisance invariance method. We
train VGG-13 networks on various fractions of CIFAR-10 and CIFAR-100 (Krizhevsky, 2009) and
ResNet-18 on subsets of Mini-ImageNet (Vinyals et al., 2016) to evaluate the benets of nuisance
invariance in low-data settings. We then repeat these experiments with a recently developed label
propagation method (Iscen et al., 2019) to study the improvements in a semi-supervised regime by
treating the aforementioned fractions of data as labeled and the rest as unlabeled.
Results of experiments in the low-data setting show up to 29%, 36%, and 99% reduction in
error-rate for Mini-ImageNet, CIFAR-100, and CIFAR-10, respectively. When combined with
label-propagation, up to 47%, 41%, and 99% error-rate reduction is achieved on Mini-ImageNet,
CIFAR-100, and CIFAR-10, respectively. Results also show that it is possible to achieve similar or
better performance in both settings while using only a fraction of the dataset when compared to
conventional training on the full training dataset without explicit nuisance invariance. Hence, it is
evident that the incorporation of nuisance invariance provides large benets in both low-data and
semi-supervised regimes.
120
8.2 Robust Low-data and Semi-supervised Learning
8.2.1 Nuisance Invariance through Discovery and Separation of Features
Supervised machine learning models often learn incorrect associations between the prediction
target and underlying factors of variation of data that are irrelevant to the prediction task. This is
especially true for DNNs because of they are contain a large number of parameters that need to
be estimated from data, which makes them vulnerable to this kind of overtting. The problem of
overtting is amplied in low-data and semi-supervised settings. Hence, it becomes imperative to
incorporate measures like nuisance invariance to train models in these scenarios in order to achieve
reasonable performance. In order to improve generalization and model performance in these
regimes, we adopt the DSF method, presented in Chapter 4, to induce invariance to nuisances.
The DSF method of training DNNs involves learning a split representation of data x into two
embeddings { z
p
and z
n
, where z
p
is used for inferring the prediction target y and z =fz
p
;z
n
g is
used for decoding x. This is brought about by augmenting the conventional y-prediction training
objective with x-reconstruction and information constraints on z
p
and z
n
. The core idea is that z
should contain all the information about x but split it into y-related (or predictive) information,
which is encoded in z
p
, and information about nuisances, which is encoded in z
n
. The complete
objective of DSF can be written as:
max I(z
p
:y) +I(x :fz
p
;z
n
g) (8.1)
s.t. I(z
p
:x)I
c
I(z
p
:z
n
) = 0
121
where determines the relative importance of the two prediction tasks. The optimization objective
in Equation 8.1 can be relaxed such that the objective J is:
J =I(z
p
:y) +I(x :fz
p
;z
n
g)I(z
p
:x)
I(z
p
:z
n
) (8.2)
where and
denote multipliers for the I(z
p
: x) and I(z
p
: z
n
) constraints, respectively. In
this work, we use the Echo method (Brekelmans et al., 2019) to exactly compute I(z
p
: x) for
minimization and employ the Hilbert-Schmidt Independence Criterion (HSIC) (Gretton et al.,
2005), which is a kernel measure of dependent-ness of variables, to minimize I(z
p
: z
n
). This
corresponds to the DSF-H version of DSF (as in Chapter 4) but we refer to it as DSF in the rest
of this chapter for simplicity.
8.2.2 Nuisance Invariance in Low-data Settings
Incorporating nuisance invariance while training DNNs in low-data settings does not require any
additional modication to the training procedure. Therefore, the model can be trained directly
with the DSF objective. This allows for easy adoption of nuisance invariance in low-data regimes,
where DNNs usually fare poorly.
8.2.3 Nuisance Invariance in Semi-supervised Settings
Semi-supervised learning is a machine learning paradigm where even though substantial amounts
of data are available, only a small fraction of it is labeled for the supervised task, i.e., y-prediction.
While there are several families of semi-supervised learning methods, in this work, we focus on the
method of label propagation for DNNs and study the benets of the incorporation of nuisance
invariance into it.
Label propagation is a semi-supervised learning technique where pseudo-labels are generated
for a (usually large) set of input samples for which the ground-truth is unknown with the help of a
122
small amount of labeled data available for training. This technique involves three stages, which
could be repeated in a loop in certain conditions: (1) training a base model with the small labeled
dataset, (2) \propagating" labels to the unlabeled dataset, and (3) training the model on the
combined dataset with real and pseudo-labels.
Iscen et al. (2019) recently proposed a label propagation approach (LP) where a DNN is treated
as the base model. In their approach, labels are propagated to unlabeled samples by building a
nearest neighbor graph using the embeddings of the same DNN and diusing the labels through
it. Finally, the DNN is trained with both the real data and the data with the propagated labels.
This method achieves state-of-the-art semi-supervised learning performance. However, it has a
key problem { successfully training the DNN on the small labeled set in the rst stage without
overtting, which often results in learning spurious relationships between the target and factors of
the data that are unrelated to the target (nuisances). Failure in addressing this problem has a
cascading eect on stages (2) and (3), hurting the performance of the model. Specically, this
leads to inaccurate calculations of similarities between the embeddings in stage (2) caused by the
encoding of nuisances. In Chapter (Jaiswal et al., 2019d), we showed that invariant representations
cluster better by y. Hence, nuisance invariance can improve all stages of label propagation.
We incorporate the DSF method into the LP framework such that DSF is used in both training
stages (1) and (3). While its employment in stage (1) is crucial, that in stage (3) is expected to
provide an additional performance boost by continuing to maintain nuisance invariance in the nal
trained model.
8.3 Experimental Evaluation
In this section, we provide an empirical evaluation of the benets of nuisance invariance, and more
specically DSF, in low-data and semi-supervised settings. In the following sections, we describe
the datasets, the experiment setup and the results of our experiments.
123
Without DSF With DSF
Low-data
Semi-supervised
Figure 8.1: CIFAR-10 (1%) { t-SNE visualization of latent embeddings labeled with
prediction class (y) when 1% of the training data is used for training. In both low-data
and semi-supervised settings, training with nuisance invariance (DSF) leads to much
cleaner clustering by y than training without it.
8.3.1 Datasets
Here we provide details of the datasets used in these experiments. We also describe preprocessing
methods used following Iscen et al. (2019).
CIFAR-10: This dataset (Krizhevsky, 2009) contains 60,000 color images, which are 3232 in
size. Each image belongs to one of ten classes { airplane, automobile, bird, cat, deer, dog, frog,
horse, ship, and truck. The dataset has a standard split of 50,000 images for training and 10,000 for
124
Without DSF With DSF
Low-data
Semi-supervised
Figure 8.2: CIFAR-10 (2%) { t-SNE visualization of latent embeddings labeled with
prediction class (y) when 2% of the training data is used for training. In both low-data
and semi-supervised settings, training with nuisance invariance (DSF) leads to much
cleaner clustering by y than training without it.
testing. Each class has an equal number of images that were randomly selected (Krizhevsky, 2009).
All the images were preprocessed using the pipeline established by Tarvainen and Valpola (2017)
and used by the label propagation work (Iscen et al., 2019) that we employ in this work. This
involved normalizing the images with mean and standard-deviation calculated on the complete
training set. The training set was also augmented with random translations by four pixels combined
with re
ective padding and randomized horizontal
ip. Four versions of the training set were
125
Without DSF With DSF
Low-data
Semi-supervised
Figure 8.3: CIFAR-10 (4%) { t-SNE visualization of latent embeddings labeled with
prediction class (y) when 4% of the training data is used for training. In both low-data
and semi-supervised settings, training with nuisance invariance (DSF) leads to much
cleaner clustering by y than training without it.
created with 1%, 2%, 4%, or 8% considered \available" for low-data and \labeled" for semi-
supervised experiments, which corresponded to 500, 1,000, 2,000, and 4,000 samples per subset,
respectively. Ten dierent variations of these 1{8% versions were created for calculating average
statistics. The division of the training set into subsets and the variations of the 1{8% versions
were statically conducted using the exact setup of Iscen et al. (2019).
CIFAR-100: This dataset (Krizhevsky, 2009) contains the same images and train-test splits as
CIFAR-10 but comprises 100 classes with equal number of samples in each class. Essentially, each
126
of the ten classes of CIFAR-10 is further split into ten more classes. This makes the classication
problem signicantly more challenging, requiring more ne-grained understanding of the images to
dierentiate between low-level classes. The images were preprocessed in the same way as CIFAR-10
and the same data augmentation techniques were used for the training set (Tarvainen and Valpola,
2017; Iscen et al., 2019). Two versions of the training sets were created with 8% or 20% considered
\available" for low-data and \labeled" for semi-supervised experiments, which corresponded to 4,000
and 10,000 samples per subset, respectively. Following Iscen et al. (2019), three variations of each
of these versions were developed for gathering mean statistics using statically dened divisions.
Mini-ImageNet: This is a subset (Vinyals et al., 2016) of the ImageNet dataset (Russakovsky et
al., 2015) containing 60,000 color images, which are 84 84 in size. The dataset contains 100 classes
with each class having 600 images. The dataset is more challenging than CIFAR-100. The dataset
has a standard split of 50,000 training and 10,000 testing images. Image preprocessing involves
normalization with mean and standard-deviation calculated on the entire training set (Tarvainen
and Valpola, 2017; Iscen et al., 2019). The training set was also augmented with random rotations
of 10
, random crops with a padding of eight pixels and randomized horizontal
ips, following
Iscen et al. (2019). Two versions of the training set were developed corresponding with 8% or
20% considered \available" for low-data and \labeled" for semi-supervised experiments, which
corresponded to 4,000 and 10,000 samples, respectively. Three variations of each of these versions
were created for reporting mean statistics of model performance using statically dened divisions
following Iscen et al. (2019).
8.3.2 Experiment Setup
We use the same base models (BMs) as Iscen et al. (2019), where VGG-13 is used for CIFAR-10
and CIFAR-100, and ResNet-18 for Mini-ImageNet. We integrated DSF-H into their publicly
127
Table 8.1: CIFAR10 { Results for low-data (BM and BM+DSF) and semi-supervised
(BM+LP and BM+LP+DSF) settings. BM and BM+DSF indicate the base model
trained without and with invariance (DSF-H), respectively. BM+LP is trained with label
propagation while BM+LP+DSF is additionally trained with DSF-H. \Full Data" is the
model trained on the full training set while others are trained on subsets as indicated.
Scores are reported as mean error-rates on the full test set with relative reduction in
parentheses.
Conguration Mean Error-rate (%)
Full Data 4.74
1% Labeled 2% Labeled 4% Labeled 8% Labeled
BM 48.54 39.66 31.00 23.10
BM+DSF 6.09 (# 87%) 2.33 (# 94%) 0.62 (# 98%) 0.17 (# 99%)
BM+LP 32.40 22.02 15.66 12.69
BM+LP+DSF 0.36 (# 99%) 0.14 (# 99%) 0.14 (# 99%) 0.08 (# 99%)
available code
1
to evaluate the performance both (1) without and (2) with label propagation, where
case (1) re
ects a low-data regime while (2) is a semi-supervised setting. In each of these scenarios,
we report performance both without and with nuisance invariance to quantify its benets. While
reporting results, we use short forms of model conguration where \+LP" indicates semi-supervised
learning with label propagation and its absence refers to low-data experiments. Similarly, \+DSF"
in the conguration indicates use of DSF for nuisance invariance while its absence corresponds to
conventional training without invariance.
Experiments were conducted with the aforementioned fractions of the datasets considered
labeled, where the unlabeled samples were not used in (1) but were used for label propagation in
(2). For each fraction, models were trained on ten variations of the set for CIFAR-10 and three for
CIFAR-100 and Mini-ImageNet (as described in Section 8.3.1). All the models were evaluated on
the complete test set of the corresponding datasets. We report the mean error-rate (across the
three or ten variations) for both low-data and semi-supervised experiments along with the relative
reduction in error-rate from the incorporation of DSF. We also show t-SNE (Maaten and Hinton,
1
https://github.com/ahmetius/LP-DeepSSL
128
Without DSF With DSF
Low-data
Semi-supervised
Figure 8.4: CIFAR-10 (8%) { t-SNE visualization of latent embeddings labeled with
prediction class (y) when 8% of the training data is used for training. In both low-data
and semi-supervised settings, training with nuisance invariance (DSF) leads to much
cleaner clustering by y than training without it.
2008) visualization of embeddings for CIFAR-10 for qualitative understanding of the eects of
nuisance invariance in low-data and semi-supervised settings.
8.3.3 Results
CIFAR-10: Table 8.1 summarizes the results of our experiments on CIFAR-10. The test error-
rate of BM when trained on the full training set of 50,000 images is 4.74%. In low-data experiments,
while BM achieves 48.54% mean error-rate when 1% of the data is used for training, incorporation of
129
DSF brings down that number to 6.09% with an 87% reduction. On the other end where 8% of the
data is used for training, DSF reduces the mean error-rate by 99% from 23.10% to 0.17%. Besides
the 1% case, incorporation of DSF even brings the mean error-rate below the full-data baseline of
4.74%. In the semi-supervised experiments, DSF reduces the error-rate by 99% from 32.40% to
0.36% in the 1% case and from 12.69% to 0.08% in the 8% case, with the nal error-rate always
being below the full-data case in all of 1{8% cases. Thus, nuisance invariance provides massive
boosts in performance on CIFAR-10 in both low-data and semi-supervised settings. Figures 8.1{8.4
present t-SNE visualizations of latent embeddings used for predicting y in both low-data and
semi-supervised settings when models are trained with and without nuisance invariance with DSF.
As evident, DSF makes the y-clusters signicantly tighter and cleaner in all cases, which validates
the quantitative results.
CIFAR-100: Results of experiments on CIFAR-100 are summarized in Table 8.2. The test
error-rate of BM when trained on the full training set is 22.55%. In low-data experiments, BM
achieves 55.43% mean error-rate when 8% of the data is used for training without DSF whereas
that with DSF drops by 36% to 35.70%. Similarly, when 20% of the data is used for training,
the number drops by 30% from 40.67% to 28.27%. In semi-supervised settings, when treating 8%
of the data as labeled, DSF reduces the mean error-rate by 41% from 46.20% to 27.19%. When
20% of the data is labeled, that number drops by 33% from 38.43% to 25.89%, coming very close
to the full-data results. Thus, nuisance invariance provides signicant boosts in performance on
CIFAR-100 in both low-data and semi-supervised experiments.
Mini-ImageNet: Table 8.3 presents results of experiments on Mini-ImageNet. The test error-
rate of BM when trained on the full training set is higher than the CIFAR datasets at 47.74%,
which speaks to how challenging the dataset is. In low-data settings, BM achieves a high mean
error-rate of 74.78% when 8% of the training set is used and 53.07% when 20% of it is used. The
incorporation of DSF brings those numbers down by 25% to 55.96% and by 29% to 37.43% (below
130
Table 8.2: CIFAR100 { Results for low-data (BM and BM+DSF) and semi-supervised
(BM+LP and BM+LP+DSF) settings. BM and BM+DSF indicate the base model
trained without and with invariance (DSF-H), respectively. BM+LP is trained with label
propagation while BM+LP+DSF is additionally trained with DSF-H. \Full Data" is the
model trained on the full training set while others are trained on subsets as indicated.
Scores are reported as mean error-rates on the full test set with relative reduction in
parentheses.
Conguration Mean Error-rate (%)
Full Data 22.55
8% Labeled 20% Labeled
BM 55.43 40.67
BM+DSF 35.70 (# 36%) 28.27 (# 30%)
BM+LP 46.20 38.43
BM+LP+DSF 27.19 (# 41%) 25.89 (# 33%)
Table 8.3: Mini-ImageNet { Results for low-data (BM and BM+DSF) and semi-supervised
(BM+LP and BM+LP+DSF) settings. BM and BM+DSF indicate the base model
trained without and with invariance (DSF-H), respectively. BM+LP is trained with label
propagation while BM+LP+DSF is additionally trained with DSF-H. \Full Data" is the
model trained on the full training set while others are trained on subsets as indicated.
Scores are reported as mean error-rates on the full test set with relative reduction in
parentheses.
Conguration Mean Error-rate (%)
Full Data 47.74
8% Labeled 20% Labeled
BM 74.78 53.07
BM+DSF 55.96 (# 25%) 37.43 (# 29%)
BM+LP 70.29 38.28
BM+LP+DSF 36.92 (# 47%) 30.26 (# 21%)
the full-data error-rate), respectively. In semi-supervised settings, BM achieves a high error-rate
of 70.29% when 8% of the data is labeled and 38.28% when 20% is labeled. DSF brings those
numbers down by 47% to 36.92% and by 21% to 30.26%, respectively. In both 8% and 20% cases,
error-rates after DSF employment are below the full-data scores. Thus, nuisance invariance largely
improves low-data and semi-supervised performances on Mini-ImageNet.
131
8.4 Summary
Deep neural networks are known to be vulnerable to overtting, which becomes even more
problematic in low-data and semi-supervised settings. We have presented nuisance invariance as
a possible solution to improve the performance of DNNs in these regimes. We provided a brief
description of the DSF nuisance invariance method and a recently developed label propagation
method (Iscen et al., 2019) that we used in this study. We conducted experiments on CIFAR-10,
CIFAR-100, and Mini-ImageNet by treating various subsets of the training sets as \available" and
\labeled" in low-data and semi-supervised settings, respectively. Results show that the incorporation
of nuisance invariance can lead to massive improvements in performance in both settings.
132
Chapter 9
Fair Face Analytics
9.1 Introduction
The advent of deep learning along with the collection and public release of face datasets has led to
massive advancements in a variety of face-based analytics including face recognition (Zhao, Xu,
and Cheng, 2019; Duan, Lu, and Zhou, 2019), detection (Li et al., 2019; Chaudhuri, Vesdapunt,
and Wang, 2019), and facial attribute prediction (Gnanasekar and Yanushkevich, 2019; Huang
et al., 2019) (e.g., age, gender, race, hair color, etc.). Sucesses in these domains has led to their
widespread adoption in commercial applications such as interactive lters on social media platforms
like Instagram, face-based computer/phone security, etc. However, this has also brought to light the
fact that many face-based systems are largely biased, making skewed and incorrect predictions on
certain demographic subgroups of people. For example, major face recognition systems (Microsoft,
IBM, Face++) (K arkk ainen and Joo, 2019; Buolamwini and Gebru, 2018; Raji and Buolamwini,
2019) tend to perform well on Caucasian and male faces but poorly on people of color and females.
Biased predictions made by face-based systems can be largely attributed to mis-representation
of these subgroups of individuals in the training data that was used to train the machine learning
models at their core (K arkk ainen and Joo, 2019). Consequently, several works have emerged
recently that aim to make face-based analytics more fair by (1) creating new balanced datasets
133
(K arkk ainen and Joo, 2019; Wang et al., 2019), (2) debiasing existing datasets (Kortylewski et al.,
2018a; Kortylewski et al., 2019), or (3) changing the model training process such that it ignores
factors of data associated with the biases while making predictions (Alvi, Zisserman, and Nell aker,
2018; Kim et al., 2019; Li and Vasconcelos, 2019).
While carefully collecting new datasets or debiasing existing ones seem intuitive and direct
measures for mitigating biases, these approaches require signicant manual eort, infrastructural
resources, and time. This has led to increase in research interest in the development and adoption
of training schemes that algorithmically learn to ignore known biases and produce fair models (Alvi,
Zisserman, and Nell aker, 2018; Kim et al., 2019; Li and Vasconcelos, 2019; Moyer et al., 2018;
Jaiswal et al., 2019d; Jaiswal et al., 2020).
In this work, we adopt the Adversarial Forgetting (AdvForget) mechanism of training unbiased
models, as presented in Chapter 5, for predicting age invariant to gender and gender invariant
to age. Results of these experiments show signicant boosts in generalization when AdvForget
training is used.
9.2 Related Work
The study of fairness in face-based analytics has emerged as an important research area in the past
few years due to the discovery of biases learned by large-scale face recognition systems (K arkk ainen
and Joo, 2019; Buolamwini and Gebru, 2018; Raji and Buolamwini, 2019). While some works
have focused on analyzing existing datasets to educate the community about the biases in
them (Kortylewski et al., 2018b; Nagpal et al., 2019; Srinivas et al., 2019; Vangara et al., 2019;
Muthukumar, 2019), others have proposed approaches to learn fair models (K arkk ainen and Joo,
2019; Wang et al., 2019; Kortylewski et al., 2018a; Kortylewski et al., 2019; Li and Vasconcelos,
2019; Alvi, Zisserman, and Nell aker, 2018; Kim et al., 2019).
134
Kortylewski et al. (2018b) present a study of pose and illumination related biases in training
data and their eects on the generalization of face recognition systems. The work of Nagpal
et al. (2019) discusses age and race biases across several models and datasets. Srinivas et al. (2019)
study age bias in face recognition through performance dierence between adults and children.
The adverse eects of race bias in gender classication is studied in (Muthukumar, 2019).
Approaches for removing biases from face analytics systems either focus on improving training
datasets or modifying the training scheme such that the resulting model makes fair predictions.
K arkk ainen and Joo (2019) created a new race-balanced dataset and showed that training on
their dataset leads to better cross-dataset test performance when compared to conventional biased
datasets. A race-balanced testing data was proposed by Wang et al. (2019) for evaluating face
recognition models trained on other datasets with or without debiasing methods to gauge how
biased they are. Besides creating new datasets, some works (Kortylewski et al., 2019; Kortylewski
et al., 2018a) have also proposed careful synthetic data augmentation to debias existing datasets.
While removing biases in data by either creating new balanced ones or balancing existing ones
is a direct solution for promoting fairness in models, both of these approaches require signicant
resources, eort, and time. Hence, there has been a growing interest in the development of new
training schemes that can create fair face models from biased data. Li and Vasconcelos (2019)
frame data resampling as a minimax game with the target prediction objective, such that under-
represented groups are seen more frequently during the training process. The work of Alvi,
Zisserman, and Nell aker (2018) adds a confusion loss to the classication network along with
adversarial training with a discriminator to remove biases from the latent representations learned
by the Neural Network (NN) model. Kim et al. (2019) adopt a similar approach but instead of
using a confusion loss, they use a gradient reversal layer (Ganin et al., 2016).
135
9.3 Debiasing Face Analytics
In this section, we present the complete model used in this work for unbiased facial attribute
prediction. We begin by describing the base model and then explain how the Adversarial Forgetting
training scheme was adopted by modifying the base model model. Finally, we present additional
details pertaining to model training and hyperparameter tuning.
9.3.1 Base Model
The base NN architecture used in this work follows the ResNet-18 model (He et al., 2016). It
starts with a convolutional layer, followed by max-pooling and eight residual blocks (containing
convolutional and batch-normalization layers), and ends with average pooling and the nal fully-
connected classifcation layer. The model is pretrained on ImageNet (Russakovsky et al., 2015).
The last fully-connected layer is then replaced with a new layer that has a one-dimensional output
for binary prediction of specic facial attributes. The resulting model is ne-tuned on the attribute
prediction dataset to create the nal base model.
9.3.2 Adversarial Forgetting for Invariant Predictions
The Adversarial Forgetting (AdvForget) framework, discussed in Chapter 5, induces invariance to
biases within latent encodings of data generated by neural networks by learning forget-masks from
data that remove bias-related information from the latent code through elementwise multiplication
of the mask with the code. The working principle of this framework is that the masked embedding
should be predictive of the target but not biases, which is brought about by adversarial training
with a discriminator, gradients from which update the mask-generating module.
In order to incorporate the AdvForget training mechanism, the base NN is split into an
encoder and a predictor. The network is then augmented with a forget-gate that generates
the masks, a decoder to promote better feature learning by learning to reconstruct the data
136
sample from the unmasked embedding, and a bias-discriminator. Furthermore, gradients from
the bias-discriminator are not allowed to
ow to the encoder, which allows the discriminator to
only aect the masking process. Adversarial training with the discriminator leads to bias-invariant
masked embeddings that result in fair predictions and the constraint of bias-removal through
elementwise mask multiplication causes the bias-related information in the unmasked embedding
to disentangle from other factors such that they occur on dierent dimensions of the encoding.
As shown in Chapter 5, the AdvForget mechanism eectively removes unwanted factors from the
decision-making process and outperforms previous works in this domain.
9.3.3 Unbiased Attribute Prediction with Adversarial Forgetting
As mentioned in Section 9.3.2, the AdvForget framework splits the base NN into an encoder and a
predictor. We split the base model discussed in Section 9.3.1 at the fourth residual block such that
the encoder and the predictor both have four residual blocks. We design the forget-gate module as
a four-layer convolutional NN (CNN) with ReLU activation and interspersed with max-pooling
(MP) and batch-normalization (BN). The decoder is designed as a four-layer CNN with ReLU
activations and interspersed with upsampling and BN layers. We design the bias-discriminator as
a convolutional block involving an MP and a BN layer followed by two fully-connected layers, all
with ReLU activations. The model is implemented in Keras and trained with TensorFlow backend.
Performing attribute prediction with the resulting model at test time requires generating the
unmasked embedding and the forget-mask using the encoder and the forget-gate, respectively,
followed by multiplying the two elementwise and feeding the resulting embedding through the
predictor. This does not slow down the inference process signicantly when compared to using the
base model, yet leads to unbiased predictions.
137
9.3.4 Training Details
We follow the adversarial training strategy described in Chapter 5 of alternating between training
the discriminator versus the rest of the model with 10 : 1 frequency. We use the Adam optimizer
with 10
4
learning-rate and 10
4
weight decay. The hyperparameter values searched for the
reconstruction and the discriminator loss-weights weref0:01; 0:1; 1g andf0:01; 0:1; 1; 10; 100g,
respectively.
9.4 Experimental Evaluation
9.4.1 Dataset
In this work, we use the IMDB Face dataset (Rothe, Timofte, and Van Gool, 2015) for age and
gender prediction. The original dataset comprises 460,723 cropped face images of 20,284 subjects
with age and gender annotations. As noted in (Rothe, Timofte, and Van Gool, 2015; Kim et al.,
2019), the labels contain signicant noise. Hence, we use a cleaned version of the dataset (Kim
et al., 2019) where multiple pretrained networks were used to estimate gender and age labels and
only samples with consistent label predictions were used. Thus, the nal version of the dataset
used in this work comprises 112,340 images.
Following Kim et al. (2019), the dataset is split into three disjoint subsets { two biased sets for
training and cross-testing, and one common balanced test set. These subsets are listed as follows:
• Extreme bias 1 (EB1): women aged 0{29 years and men aged 40 years or more, i.e., younger
females and older males
• Extreme bias 2 (EB2): women aged 40 years or more and men aged 0{29 years, i.e., older
females and younger males
• Test set: 20% of the images with individuals aged 0{29 years and 40 years or more
138
The resulting EB1 and EB2 subsets consist of 36,004 and 16,800 images, respectively, while the
test set contains 13,129 images. The age labels are binarized as 0{29 years or 40 years.
9.4.2 Experiment Setup
We conduct two sets of experiments { (1) predicting age invariant to gender and (2) predicting
gender invariant to age. For each case, two experiments are run: one where the model is trained
on the EB1 set and the other where the model is trained on EB2. Results are reported on the sets
that were not used for training. Results of the previous state-of-the-art BlindEye (Alvi, Zisserman,
and Nell aker, 2018) and Learning Not To Learn (LNTL) (Kim et al., 2019) models are reported
for comparison. Performance is calculated in terms of accuracy and relative improvement (%) in
error-rate (1 - accuracy) over the previous best score reported in literature. A publicly available
Keras implementation
1
of ResNet-18, which was pretrained on ImageNet, was used as the base
model in all the experiments.
9.4.3 Results
Results of predicting age invariant to gender are presented in Table 9.1. While neither BlindEye
nor LNTL had previous best scores in all congurations, incorporating AdvForget leads to state-
of-the-art results across the board, with 9{36% relative improvements in error-rate.
Table 9.2 summarizes the results of learning to predict gender while being invariant to age. As
evident, training with the AdvForget mechanism leads to the best results in all cases, outperforming
the previous best LNTL method by 42{55% relative improvement in error-rate.
Hence, based on results of the above sets of experiments, it is evident that the adoption of the
AdvForget training mechanism leads to more fair results that generalize better even when trained
on datasets that are extremely biased.
1
https://github.com/qubvel/classication models
139
Table 9.1: IMDB Faces { Results for predicting age invariant to gender. Two experiments
were conducted where the model was either trained on the EB1 (old males and young
females) or the EB2 (young males and old females) set. Base Model indicates the
setting where no explicit fairness regularization is used. Performance is reported in terms
of accuracy on the set not used for training as well as a balanced test set. Relative
improvement (RI) in error-rate is also reported over the previous best (underlined).
Model
Trained on EB1 Trained on EB2
EB2 Test EB1 Test
Base Model 0.5430 0.7717 0.4891 0.6197
BlindEye
(Alvi, Zisserman, and Nell aker, 2018)
0.6680 0.7513 0.6416 0.6240
LNTL (Kim et al., 2019) 0.6527 0.7743 0.6218 0.6304
Adversarial Forgetting 0.6984 0.8541 0.6946 0.7503
RI 9% 36% 15% 32%
Table 9.2: IMDB Faces { Results for predicting gender invariant to age. Two experiments
were conducted where the model was either trained on the EB1 (old males and young
females) or the EB2 (young males and old females) set. Base Model indicates the
setting where no explicit fairness regularization is used. Performance is reported in terms
of accuracy on the set not used for training as well as a balanced test set. Relative
improvement (RI) in error-rate is also reported over the previous best (underlined).
Model
Trained on EB1 Trained on EB2
EB2 Test EB1 Test
Base Model 0.5986 0.8442 0.5784 0.6975
BlindEye
(Alvi, Zisserman, and Nell aker, 2018)
0.6374 0.8556 0.5733 0.6990
LNTL (Kim et al., 2019) 0.6800 0.8666 0.6418 0.7450
Adversarial Forgetting 0.8364 0.9226 0.8373 0.8748
RI 49% 42% 55% 51%
9.5 Summary
Face-based analytics models are known to make unfair predictions as they adopt the inherent
biases in the datasets used to train them. Recent work has provided extensive evidence of these
biases. In this work, we have presented the use of the Adversarial Forgetting training mechanism
to train neural networks for facial attribute prediction such that they do not incorporate biases
140
into the decision-making process. Results of experiments show that the trained models indeed
generalize better even when trained on extremely biased data.
141
Chapter 10
Conclusion
Predictive models that incorporate irrelevant nuisance factors in the decision-making process are
vulnerable to poor generalization, especially when predictions need to be made on data with
previously unseen congurations of nuisances. This dependence on spurious connections between
the prediction target and extraneous factors also makes models less reliable for practical use.
Supervised machine learning models often learn such false associations due to the nature of the
training objective and the optimization procedure. Deep Neural Networks are especially vulnerable
to learning such false associations due to their large number of parameters when relatively small
amount of training data are available in practice. Models also often learn associations between
targets and biasing factors of data, which are inherently correlated with prediction targets but are
undesired due to external reasons, e.g., ethical, legal, etc. Hence, it is crucial to develop training
frameworks for creating models that are invariant to both nuisance and biasing factors of data.
10.1 Summary of Contributions
We have introduced the concept of invariant representation learning for neural networks and
discussed prior art in this research eld. We have presented our new frameworks for inducing
invariance in deep neural networks that bring about exclusion of undesired factors of data from
the latent representations learned by neural network models. We have shown how training models
142
with these methods makes them more robust in nuisance settings and more fair in bias settings.
The presented methods have advanced the state-of-the-art in the eld of invariant representation
learning and provided new insights about how invariance can be achieved more eectively through
learning to disentangle desired and undesired factors of data.
We began by describing our Unsupervised Adversarial Invariance (UAI) and Unied Adversarial
Invariance (UnifAI) frameworks in Chapter 3 that provided an intuitive mechanism for learning a
split representation of data into desired and undesired factors. This was achieved by augmenting
the optimization objective with data reconstruction and inducing competition between that and
the original target-prediction loss through a noisy transformation of the embedding that should
contain the desired factors.
In Chapter 4 we built upon the intuition in Chapter 3 to develop an information theoretic
formulation of the notion of nuisance invariance through discovery and separation of features
(DSF). We also presented an information theoretic interpretation of UAI and showed that the UAI
objective is a relaxation of DSF, which proved that DSF is strictly superior to UAI. This was
further validated by empirical results.
We also presented a new supervised framework for learning nuisance and bias invariant features
in the form of adversarial forgetting in Chapter 5. Here we achieved discovery and separation
of features through data reconstruction and elementwise masking of an intermediate embedding,
respectively. The masking operation eectively splits the embedding into to sub-embeddings { one
that contains the desired factors and the other that contains the undesired factors.
Besides presenting new methods for invariance, we also demonstrated the applicability of these
methods in diverse scenarios for building DNN models with increased robustness and fairness.
In Chapter 6 we presented a robust DNN model for presentation attack detection (PAD) that
largely outperformed previous works in PAD. This was made possible by the adoption of the UAI
framework.
143
A robust end-to-end model for automated speech recognition was presented in Chapter 7 that
also incorporated UAI training. Results showed larger reductions in error-rate when compared to
those from the adoption of previous invariance works.
Chapter 8 presents the benets of incorporating the DSF framework when training DNNs in
low-data and semi-supervised regimes. Experimental results showed that using DSF led to massive
drops in error-rate, sometimes even outperforming models that were trained without nuisance
invariance on full training datasets while using only a small fraction of the data with DSF.
Lastly, in Chapter 9 we presented results of using the adversarial forgetting mechanism to learn
unbiased models for predicting attributes from face images. Results showed that adopting our
mechanism led to much less bias in the trained models compared to previous fairness methods.
10.2 Supporting Papers
This thesis is supported by the following papers:
• Jaiswal, A., Wu, R. Y., Abd-Almageed, W., and Natarajan, P. (2018). Unsupervised
adversarial invariance. In Advances in Neural Information Processing Systems (pp.
5092-5102).
• Jaiswal, A., Moyer, D., Steeg, G.V., Abd-Almageed, W., and Natarajan, P. (2020). Invari-
ant Representations through Adversarial Forgetting. In Proceedings of the AAAI
Conference on Articial Intelligence. Vol. 34.
• Jaiswal, A., Xia, S., Masi, I., and AbdAlmageed, W. (2019). RoPAD: Robust Presen-
tation Attack Detection through Unsupervised Adversarial Invariance. In 12th
IAPR International Conference on Biometrics.
• Hsu, I.H., Jaiswal, A., and Natarajan, P. (2019). NIESR: Nuisance Invariant End-to-
End Speech Recognition. In Proceedings of Interspeech 2019, pp. 456460.
144
• Jaiswal, A., Wu, Y., AbdAlmageed, W., and Natarajan, P. (2019). Unied adversarial
invariance. In submission to IEEE Transactions on Neural Networks and Learning Systems.
• Jaiswal, A., Brekelmans, R., Moyer, D., Steeg, G. V., AbdAlmageed, W., and Natarajan,
P. (2019). Discovery and Separation of Features for Invariant Representation
Learning. In submission to International Conference on Machine Learning.
10.3 Other Papers
Other papers written during the PhD journey:
• Jaiswal, A., Wu, Y., AbdAlmageed, W., Masi, I., and Natarajan, P. (2019). AIRD: Ad-
versarial Learning Framework for Image Repurposing Detection. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 11330-11339).
• Jaiswal, A., AbdAlmageed, W., Wu, Y., and Natarajan, P. (2018). Bidirectional Con-
ditional Generative Adversarial Networks. In Asian Conference on Computer Vision
(pp. 216-232). Springer, Cham.
• Jaiswal, A., Sabir, E., AbdAlmageed, W., and Natarajan, P. (2017). Multimedia semantic
integrity assessment using joint embedding of images and text. In Proceedings of
the 25th ACM international conference on Multimedia (pp. 1465-1471).
• Jaiswal, A., AbdAlmageed, W., Wu, Y., and Natarajan, P. (2018). CapsuleGAN: Genera-
tive adversarial capsule network. In Workshop Proceedings of the European Conference
on Computer Vision (ECCV).
• Sabir, E., Cheng, J., Jaiswal, A., AbdAlmageed, W., Masi, I., and Natarajan, P. (2019).
Recurrent convolutional strategies for face manipulation detection in videos. In
Workshop Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(CVPR).
145
• Liu, J., Jaiswal, A., Yao, K. T., and Raghavendra, C. S. (2015). Autoencoder-derived
features as inputs to classication algorithms for predicting well failures. In SPE
Western Regional Meeting. Society of Petroleum Engineers.
• Rao, V., Sarabi, M. S., and Jaiswal, A. (2015). Brain tumor segmentation with deep
learning. MICCAI Multimodal Brain Tumor Segmentation Challenge (BraTS), 56-59.
• Jaiswal, A., Guo, D., Raghavendra, C. S., and Thompson, P. (2018). Large-scale unsuper-
vised deep representation learning for brain structure. arXiv preprint arXiv:1805.01049.
146
Bibliography
Achille, A. and S. Soatto (2018a). \Information Dropout: Learning Optimal Representations
Through Noisy Computation". In: IEEE Transactions on Pattern Analysis and Machine
Intelligence 40.12, pp. 2897{2905. issn: 0162-8828. doi: 10.1109/TPAMI.2017.2784440.
Achille, Alessandro and Stefano Soatto (2018b). \Emergence of Invariance and Disentanglement in
Deep Representations". In: Journal of Machine Learning Research 19.50, pp. 1{34.
Alemi, Alexander A et al. (2016). \Deep variational information bottleneck". In: arXiv preprint
arXiv:1612.00410.
Alvi, Mohsan, Andrew Zisserman, and Christoer Nell aker (2018). \Turning a blind eye: Explicit
removal of biases and variation from deep neural network embeddings". In: Proceedings of the
European Conference on Computer Vision (ECCV).
Antoniou, Antreas, Amos Storkey, and Harrison Edwards (2017). \Data Augmentation Generative
Adversarial Networks". In: arXiv preprint arXiv:1711.04340.
Arjovsky, Martin, Soumith Chintala, and L eon Bottou (2017). \Wasserstein gan". In: arXiv preprint
arXiv:1701.07875.
Asghar, Khurshid, Zulqar Habib, and Muhammad Hussain (2017). \Copy-move and splicing
image forgery detection and localization techniques: a review". In: Australian Journal of
Forensic Sciences 49.3, pp. 281{307. url: http://www.tandfonline.com/doi/abs/10.1080/
00450618.2016.1153711.
147
Atoum, Yousef et al. (2017). \Face anti-spoong using patch and depth-based CNNs". In: Biometrics
(IJCB), 2017 IEEE International Joint Conference on. IEEE, pp. 319{328.
Aubry, Mathieu et al. (2014). \Seeing 3D Chairs: Exemplar Part-Based 2D-3D Alignment Using a
Large Dataset of CAD Models". In: Proceedings of the IEEE Computer Society Conference on
Computer Vision and Pattern Recognition.
Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio (2014). \Neural machine translation by
jointly learning to align and translate". In: arXiv preprint arXiv:1409.0473.
Barker, Jon et al. (2015). \The third CHiMEspeech separation and recognition challenge: Dataset,
task and baselines". In: 2015 IEEE Workshop on Automatic Speech Recognition and Under-
standing (ASRU). IEEE, pp. 504{511.
Bengio, Yoshua, Aaron Courville, and Pascal Vincent (2013). \Representation learning: A review
and new perspectives". In: IEEE transactions on pattern analysis and machine intelligence
35.8, pp. 1798{1828.
Berthelot, David, Tom Schumm, and Luke Metz (2017). \Began: Boundary equilibrium generative
adversarial networks". In: arXiv preprint arXiv:1703.10717.
Boulkenafet, Zinelabidine, Jukka Komulainen, and Abdenour Hadid (2015). \Face anti-spoong
based on color texture analysis". In: Image Processing (ICIP), 2015 IEEE International
Conference on. IEEE, pp. 2636{2640.
| (2017). \Face antispoong using speeded-up robust features and sher vector encoding". In:
IEEE Signal Processing Letters 24.2, pp. 141{145.
Brekelmans, Rob et al. (2019). \Exact Rate-Distortion in Autoencoders via Echo Noise". In: arXiv
preprint arXiv:1904.07199.
Buolamwini, Joy and Timnit Gebru (2018). \Gender shades: Intersectional accuracy disparities in
commercial gender classication". In: Conference on fairness, accountability and transparency,
pp. 77{91.
Chan, William et al. (2015). \Listen, attend and spell". In: arXiv preprint arXiv:1508.01211.
148
Chaudhuri, Bindita, Noranart Vesdapunt, and Baoyuan Wang (2019). \Joint Face Detection and
Facial Motion Retargeting for Multiple Faces". In: The IEEE Conference on Computer Vision
and Pattern Recognition (CVPR).
Chen, Minmin et al. (2012). \Marginalized Denoising Autoencoders for Domain Adaptation". In:
Proceedings of the 29th International Conference on Machine Learning. ICML'12. Edinburgh,
Scotland: Omnipress, pp. 1627{1634. isbn: 978-1-4503-1285-1.
Chingovska, Ivana, Andr e Anjos, and S ebastien Marcel (2012). \On the eectiveness of local binary
patterns in face anti-spoong". In: Proceedings of the 11th International Conference of the
Biometrics Special Interes Group. EPFL-CONF-192369.
Chiu, Chung-Cheng et al. (2018). \State-of-the-art speech recognition with sequence-to-sequence
models". In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP). IEEE, pp. 4774{4778.
Chorowski, Jan K et al. (2015). \Attention-based models for speech recognition". In: Advances in
neural information processing systems, pp. 577{585.
Clevert, Djork-Arn e, Thomas Unterthiner, and Sepp Hochreiter (2015). \Fast and Accurate Deep
Network Learning by Exponential Linear Units (ELUs)". In: CoRR abs/1511.07289. arXiv:
1511.07289. url: http://arxiv.org/abs/1511.07289.
Costa-Pazo, Artur et al. (2016). \The replay-mobile face presentation-attack database". In:
Biometrics Special Interest Group (BIOSIG), 2016 International Conference of the. IEEE,
pp. 1{7.
Courtland, R. (2018). \Bias detectives: the researchers striving to make algorithms fair". In: Nature
558, pp. 357{360. doi: 10.1038/d41586-018-05469-3.
Cover, Thomas M and Joy A Thomas (2012). Elements of Information Theory. John Wiley &
Sons.
Dheeru, Dua and E Karra Taniskidou (2017). UCI Machine Learning Repository. url: http:
//archive.ics.uci.edu/ml.
149
Domingos, Pedro (2012). \A Few Useful Things to Know About Machine Learning". In: Commun.
ACM 55.10, pp. 78{87. issn: 0001-0782. doi: 10.1145/2347736.2347755.
Donahue, Je, Philipp Krhenbhl, and Trevor Darrell (2017). \Adversarial Feature Learning". In:
International Conference on Learning Representations.
Duan, Yueqi, Jiwen Lu, and Jie Zhou (2019). \UniformFace: Learning Deep Equidistributed
Representation for Face Recognition". In: The IEEE Conference on Computer Vision and
Pattern Recognition (CVPR).
Dumoulin, Vincent et al. (2017). \Adversarially Learned Inference". In: International Conference
on Learning Representations.
Durugkar, Ishan, Ian Gemp, and Sridhar Mahadevan (2017). \Generative Multi-Adversarial
Networks". In: International Conference on Learning Representations.
Erdogmus, Nesli and Sebastien Marcel (2014). \Spoong in 2D face recognition with 3D masks and
anti-spoong with Kinect". In: IEEE Sixth International Conference on Biometrics: Theory,
Applications and Systems, pp. 1{6.
Gan, Junying et al. (2017). \3D convolutional neural network based on face anti-spoong". In:
Multimedia and Image Processing (ICMIP), 2017 2nd International Conference on. IEEE,
pp. 1{5.
Ganin, Yaroslav and Victor Lempitsky (2014). \Unsupervised domain adaptation by backpropaga-
tion". In: arXiv preprint arXiv:1409.7495.
Ganin, Yaroslav et al. (2016). \Domain-adversarial training of neural networks". In: The Journal
of Machine Learning Research 17.1, pp. 2096{2030.
Garofolo, J. S. et al. (1993). DARPA TIMIT Acoustic Phonetic Continuous Speech Corpus CDROM.
Georghiades, A. S., P. N. Belhumeur, and D. J. Kriegman (2001). \From few to many: illumination
cone models for face recognition under variable lighting and pose". In: IEEE Transactions on
Pattern Analysis and Machine Intelligence 23.6, pp. 643{660. issn: 0162-8828. doi: 10.1109/
34.927464.
150
Gnanasekar, Sudarsini Tekkam and Svetlana Yanushkevich (2019). \Face Attribute Prediction in
Live Video using Fusion of Features and Deep Neural Networks". In: 2019 International Joint
Conference on Neural Networks (IJCNN). IEEE, pp. 1{8.
Goodfellow, Ian et al. (2014). \Generative Adversarial Nets". In: Advances in neural information
processing systems, pp. 2672{2680.
Gragnaniello, Diego et al. (2015). \An Investigation of Local Descriptors for Biometric Spoong
Detection". In: IEEE Transactions on Information Forensics and Security 10, pp. 849{863.
Gretton, Arthur et al. (2005). \Measuring Statistical Dependence with Hilbert-Schmidt Norms".
In: Algorithmic Learning Theory. Ed. by Sanjay Jain, Hans Ulrich Simon, and Etsuji Tomita.
Berlin, Heidelberg: Springer Berlin Heidelberg, pp. 63{77. isbn: 978-3-540-31696-1.
Gretton, Arthur et al. (2007). \A Kernel Method for the Two-Sample-Problem". In: Advances in
Neural Information Processing Systems 19. Ed. by B. Sch olkopf, J. C. Platt, and T. Homan.
MIT Press, pp. 513{520.
Gross, R. et al. (2008). \Multi-PIE". In: 2008 8th IEEE International Conference on Automatic
Face Gesture Recognition, pp. 1{8. doi: 10.1109/AFGR.2008.4813399.
Gulrajani, Ishaan et al. (2017). \Improved training of wasserstein gans". In: Advances in Neural
Information Processing Systems, pp. 5769{5779.
Gupta, M., P. Zhao, and J. Han (2012). \Evaluating Event Credibility on Twitter". In: Proceedings
of the 2012 SIAM International Conference on Data Mining, pp. 153{164.
He, Kaiming et al. (2016). \Deep residual learning for image recognition". In: Proceedings of the
IEEE conference on computer vision and pattern recognition, pp. 770{778.
Higgins, Irina et al. (2018). \SCAN: Learning Hierarchical Compositional Visual Concepts". In:
Hinton, Georey E, Alex Krizhevsky, and Sida D Wang (2011). \Transforming auto-encoders". In:
International Conference on Articial Neural Networks. Springer, pp. 44{51.
Hochreiter, Sepp and J urgen Schmidhuber (1997). \Long Short-Term Memory". In: Neural Comput.
9.8, pp. 1735{1780. issn: 0899-7667. doi: 10.1162/neco.1997.9.8.1735.
151
Hsu, I-Hung, Ayush Jaiswal, and Premkumar Natarajan (2019). \NIESR: Nuisance Invariant
End-to-End Speech Recognition". In: Proceedings of Interspeech 2019, pp. 456{460.
Hsu, Wei-Ning and James Glass (2018). \Extracting domain invariant features by unsupervised
learning for robust automatic speech recognition". In: 2018 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp. 5614{5618.
Huang, Chen et al. (2019). \Deep imbalanced learning for face recognition and attribute prediction".
In: IEEE transactions on pattern analysis and machine intelligence.
Im, Daniel Jiwoong et al. (2016). \Generative Adversarial Metric". In:
Iscen, Ahmet et al. (2019). \Label propagation for deep semi-supervised learning". In: Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5070{5079.
ISO/IEC JTC 1/SC 37 - Biometrics. Information Technology Biometric presentation attack detec-
tion part 1: Framework. (2016). Standard. https://www.iso.org/obp/ui/iso. International
Organization for Standardization.
Isola, Phillip et al. (2017). \Image-To-Image Translation With Conditional Adversarial Networks".
In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Jaiswal, Ayush et al. (2017). \Multimedia Semantic Integrity Assessment Using Joint Embedding
Of Images And Text". In: pp. 1465{1471.
Jaiswal, Ayush et al. (2018a). \Bidirectional Conditional Generative Adversarial Networks". In:
Computer Vision { ACCV 2018. Springer International Publishing.
| (2018b). \CapsuleGAN: Generative Adversarial Capsule Network". In: The European Conference
on Computer Vision (ECCV) Workshops.
Jaiswal, Ayush et al. (2018c). \Large-Scale Unsupervised Deep Representation Learning for Brain
Structure". In: arXiv preprint arXiv:1805.01049.
Jaiswal, Ayush et al. (2018d). \Unsupervised Adversarial Invariance". In: Advances in Neural
Information Processing Systems 31. Ed. by S. Bengio et al. Curran Associates, Inc., pp. 5097{
5107.
152
Jaiswal, Ayush et al. (2019a). \AIRD: Adversarial Learning Framework for Image Repurposing
Detection". In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Jaiswal, Ayush et al. (2019b). \Discovery and Separation of Features for Invariant Representation
Learning". In: arXiv preprint arXiv:1912.00646.
Jaiswal, Ayush et al. (2019c). \RoPAD: Robust Presentation Attack Detection through Unsuper-
vised Adversarial Invariance". In: 12th IAPR International Conference on Biometrics, ICB
2019, Crete, Greece.
Jaiswal, Ayush et al. (2019d). \Unied Adversarial Invariance". In: arXiv preprint arXiv:1905.03629.
Jaiswal, Ayush et al. (2020). \Invariant Representations through Adversarial Forgetting". In:
Proceedings of the AAAI Conference on Articial Intelligence. Vol. 34.
Jaitly, Navdeep et al. (2016). \An online sequence-to-sequence model using partial conditioning".
In: Advances in Neural Information Processing Systems, pp. 5067{5075.
Jin, Fang et al. (2013). \Epidemiological Modeling of News and Rumors on Twitter". In: Proceedings
of the 7th Workshop on Social Network Mining and Analysis.
Jourabloo, Amin, Yaojie Liu, and Xiaoming Liu (2018). \Face de-spoong: Anti-spoong via noise
modeling". In: arXiv preprint arXiv:1807.09968 1.2, p. 3.
Kaggle, Painter by Numbers. Available at https://www.kaggle.com/c/painter-by-numbers.
K arkk ainen, Kimmo and Jungseock Joo (2019). \FairFace: Face Attribute Dataset for Balanced
Race, Gender, and Age". In: arXiv preprint arXiv:1908.04913.
Kim, Byungju et al. (2019). \Learning not to learn: Training deep neural networks with biased
data". In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pp. 9012{9020.
Kingma, Diederik P. and Max Welling (2014). \Auto-encoding Variational Bayes". In: International
Conference on Learning Representations.
Kiritchenko, Svetlana and Saif M Mohammad (2018). \Examining gender and race bias in two
hundred sentiment analysis systems". In: arXiv preprint arXiv:1805.04508.
153
Ko, Tom et al. (2015). \Audio augmentation for speech recognition". In: Sixteenth Annual Confer-
ence of the International Speech Communication Association.
Kohavi, Ron (1996). \Scaling Up the Accuracy of Naive-Bayes Classiers: A Decision-tree Hybrid".
In: Proceedings of the Second International Conference on Knowledge Discovery and Data
Mining. KDD'96. Portland, Oregon: AAAI Press, pp. 202{207.
Kortylewski, Adam et al. (2018a). \Can Synthetic Faces Undo the Damage of Dataset Bias To
Face Recognition and Facial Landmark Detection?" In: arXiv preprint arXiv:1811.08565.
Kortylewski, Adam et al. (2018b). \Empirically analyzing the eect of dataset biases on deep face
recognition systems". In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition Workshops, pp. 2093{2102.
| (2019). \Analyzing and reducing the damage of dataset bias to face recognition with synthetic
data". In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Workshops, pp. 0{0.
Krizhevsky, Alex (2009). \Learning Multiple Layers of Features from Tiny Images". In:
Krizhevsky, Alex, Ilya Sutskever, and Georey E Hinton (2012). \Imagenet classication with
deep convolutional neural networks". In: Advances in neural information processing systems,
pp. 1097{1105.
Lake, Brenden M., Ruslan Salakhutdinov, and Joshua B. Tenenbaum (2015). \Human-level concept
learning through probabilistic program induction". In: Science 350.6266, pp. 1332{1338. issn:
0036-8075. doi: 10.1126/science.aab3050. eprint: http://science.sciencemag.org/
content/350/6266/1332.full.pdf.
LeCun, Yann et al. (1998). \Gradient-based Learning Applied to Document Recognition". In:
Proceedings of the IEEE 86.11, pp. 2278{2324.
Ledig, Christian et al. (2017). \Photo-Realistic Single Image Super-Resolution Using a Generative
Adversarial Network". In: The IEEE Conference on Computer Vision and Pattern Recognition
(CVPR).
154
Li, Jian et al. (2019). \DSFD: Dual Shot Face Detector". In: The IEEE Conference on Computer
Vision and Pattern Recognition (CVPR).
Li, Yi and Nuno Vasconcelos (2019). \Repair: Removing representation bias by dataset resampling".
In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9572{
9581.
Li, Yujia, Kevin Swersky, and Richard Zemel (2014). \Learning unbiased features". In: arXiv
preprint arXiv:1412.5244.
Liang, Davis, Zhiheng Huang, and Zachary C Lipton (2018). \Learning noise-invariant representa-
tions for robust speech recognition". In: arXiv preprint arXiv:1807.06610.
Liu, Xiaomo et al. (2015a). \Real-time Rumor Debunking on Twitter". In: Proceedings of the 24th
ACM International on Conference on Information and Knowledge Management, pp. 1867{1870.
Liu, Yaojie, Amin Jourabloo, and Xiaoming Liu (2018). \Learning deep models for face anti-
spoong: Binary or auxiliary supervision". In: Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pp. 389{398.
Liu, Ziwei et al. (2015b). \Deep Learning Face Attributes in the Wild". In: Proceedings of
International Conference on Computer Vision (ICCV).
Lopez, Romain et al. (2018). \Information Constraints on Auto-Encoding Variational Bayes".
In: Advances in Neural Information Processing Systems 31. Ed. by S. Bengio et al. Curran
Associates, Inc., pp. 6117{6128.
Louizos, Christos et al. (2016). \The variational fair autoencoder". In: Proceedings of International
Conference on Learning Representations.
Ma, Jing et al. (2016). \Detecting Rumors from Microblogs with Recurrent Neural Networks." In:
pp. 3818{3824.
Maaten, Laurens van der and Georey Hinton (2008). \Visualizing data using t-SNE". In: Journal
of machine learning research 9.Nov, pp. 2579{2605.
155
M a att a, Jukka, Abdenour Hadid, and Matti Pietik ainen (2011). \Face spoong detection from
single images using micro-texture analysis". In: Biometrics (IJCB), 2011 international joint
conference on. IEEE, pp. 1{7.
Mahasseni, Behrooz, Michael Lam, and Sinisa Todorovic (2017). \Unsupervised Video Summariza-
tion With Adversarial LSTM Networks". In: The IEEE Conference on Computer Vision and
Pattern Recognition (CVPR).
Makhzani, Alireza et al. (2016). \Adversarial Autoencoders". In: International Conference on
Learning Representations. url: http://arxiv.org/abs/1511.05644.
Marchi, Regina (2012). \With Facebook, Blogs, and Fake News, Teens Reject Journalistic
\Objectivity"". In: Journal of Communication Inquiry 36.3, pp. 246{262. doi: 10.1177/
0196859912458700. eprint: https://doi.org/10.1177/0196859912458700. url: https:
//doi.org/10.1177/0196859912458700.
Masi, I. et al. (2019). \Learning Pose-Aware Models for Pose-Invariant Face Recognition in the
Wild". In: IEEE Transactions on Pattern Analysis and Machine Intelligence 41.2, pp. 379{393.
issn: 0162-8828. doi: 10.1109/TPAMI.2018.2792452.
Mathieu, Michael F. et al. (2016). \Disentangling Factors of Variation in Deep Representation using
Adversarial Training". In: Advances in Neural Information Processing Systems, pp. 5040{5048.
Matthey, Loic et al. (2017). dSprites: Disentanglement testing Sprites dataset. https://github.
com/deepmind/dsprites-dataset/.
Maze, Brianna et al. (2018). \IARPA Janus Benchmark{C: Face Dataset and Protocol". In:
Meng, Zhong et al. (2018). \Speaker-invariant training via adversarial learning". In: 2018 IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp. 5969{
5973.
Merler, Michele et al. (2019). \Diversity in Faces". In: arXiv preprint arXiv:1901.10436.
156
Mescheder, Lars, Sebastian Nowozin, and Andreas Geiger (2017). \Adversarial Variational Bayes:
Unifying Variational Autoencoders and Generative Adversarial Networks". In: International
Conference on Machine Learning (ICML).
Miao, Jianyu and Lingfeng Niu (2016a). \A Survey on Feature Selection". In: Procedia Computer
Science 91. Promoting Business Analytics and Quantitative Management of Technology: 4th
International Conference on Information Technology and Quantitative Management (ITQM
2016), pp. 919 {926. issn: 1877-0509. doi: https://doi.org/10.1016/j.procs.2016.07.111.
url: http://www.sciencedirect.com/science/article/pii/S1877050916313047.
| (2016b). \A Survey on Feature Selection". In: Procedia Computer Science 91. Promoting
Business Analytics and Quantitative Management of Technology: 4th International Conference
on Information Technology and Quantitative Management (ITQM 2016), pp. 919 {926. issn:
1877-0509. doi: https://doi.org/10.1016/j.procs.2016.07.111.
Mirza, Mehdi and Simon Osindero (2014). \Conditional Generative Adversarial Nets". In: arXiv
preprint arXiv:1411.1784.
Moyer, Daniel et al. (2018). \Invariant Representations without Adversarial Training". In: Advances
in Neural Information Processing Systems 31. Ed. by S. Bengio et al. Curran Associates, Inc.,
pp. 9102{9111.
Muthukumar, Vidya (2019). \Color-theoretic experiments to understand unequal gender classica-
tion accuracy from face images". In: Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition Workshops.
Nagpal, Shruti et al. (2019). \Deep Learning for Face Recognition: Pride or Prejudiced?" In: arXiv
preprint arXiv:1904.01219.
Nogueira, Rodrigo Frassetto, Roberto de Alencar Lotufo, and Rubens Campos Machado (2016).
\Fingerprint Liveness Detection Using Convolutional Neural Networks." In: IEEE Trans.
Information Forensics and Security 11.6, pp. 1206{1213.
157
Noh, Hyeonwoo et al. (2017). \Largescale image retrieval with attentive deep local features". In:
pp. 3456{3465.
Odena, Augustus, Christopher Olah, and Jonathon Shlens (2017). \Conditional Image Synthesis
with Auxiliary Classier GANs". In: Proceedings of the 34th International Conference on
Machine Learning. Ed. by Doina Precup and Yee Whye Teh. Vol. 70. Proceedings of Machine
Learning Research. International Convention Centre, Sydney, Australia: PMLR, pp. 2642{2651.
url: http://proceedings.mlr.press/v70/odena17a.html.
Paul, Douglas B and Janet M Baker (1992). \The design for the Wall Street Journal-based CSR
corpus". In: Proceedings of the workshop on Speech and Natural Language. Association for
Computational Linguistics, pp. 357{362.
Peng, Fei, Le Qin, and Min Long (2017). \Face presentation attack detection using guided scale
texture". In: Multimedia Tools and Applications, pp. 1{27.
| (2018). \CCoLBP: Chromatic Co-Occurrence of Local Binary Pattern for Face Presentation
Attack Detection". In: 2018 27th International Conference on Computer Communication and
Networks (ICCCN). IEEE, pp. 1{9.
Perarnau, Guim et al. (2016). \Invertible Conditional GANs for image editing". In: NIPS Workshop
on Adversarial Training.
Phan, Quoc-Tin et al. (2016). \FACE spoong detection using LDP-TOP". In: Image Processing
(ICIP), 2016 IEEE International Conference on. IEEE, pp. 404{408.
Prabhavalkar, Rohit et al. (2017). \A Comparison of Sequence-to-Sequence Models for Speech
Recognition." In:
Radford, Alec, Luke Metz, and Soumith Chintala (2016). \Unsupervised Representation Learning
with Deep Convolutional Generative Adversarial Networks". In: International Conference on
Learning Representations.
158
Raghavendra, Ramachandra, Kiran B Raja, and Christoph Busch (2015). \Presentation attack
detection for face recognition using light eld camera". In: IEEE Transactions on Image
Processing 24.3, pp. 1060{1075.
Raji, Inioluwa Deborah and Joy Buolamwini (2019). \Actionable auditing: Investigating the impact
of publicly naming biased performance results of commercial ai products". In: Proceedings of
the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pp. 429{435.
Ramachandra, Raghavendra and Christoph Busch (2017). \Presentation attack detection methods
for face recognition systems: a comprehensive survey". In: ACM Computing Surveys (CSUR)
50.1, p. 8.
Rao, Kanishka, Ha sim Sak, and Rohit Prabhavalkar (2017). \Exploring architectures, data and
units for streaming end-to-end speech recognition with RNN-transducer". In: 2017 IEEE
Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, pp. 193{199.
Reed, Scott et al. (2016). \Generative Adversarial Text-to-Image Synthesis". In: Proceedings of
The 33rd International Conference on Machine Learning.
Rothe, Rasmus, Radu Timofte, and Luc Van Gool (2015). \Dex: Deep expectation of apparent
age from a single image". In: Proceedings of the IEEE international conference on computer
vision workshops, pp. 10{15.
Ruchansky, Natali, Sungyong Seo, and Yan Liu (2017). \CSI: A Hybrid Deep Model for Fake News
Detection". In: Proceedings of the 2017 ACM on Conference on Information and Knowledge
Management. ACM, pp. 797{806.
Russakovsky, Olga et al. (2015). \Imagenet large scale visual recognition challenge". In: Interna-
tional journal of computer vision 115.3, pp. 211{252.
Sabir, Ekraam et al. (2018). \Deep Multimodal Image-Repurposing Detection". In: pp. 1337{1345.
Sabour, Sara, Nicholas Frosst, and Georey E Hinton (2017a). \Dynamic routing between capsules".
In: Advances in Neural Information Processing Systems, pp. 3859{3869.
159
Sabour, Sara, Nicholas Frosst, and Georey E Hinton (2017b). \Dynamic routing between capsules".
In: Advances in Neural Information Processing Systems, pp. 3859{3869.
Saito, Itsumi et al. (2017). \Improving Neural Text Normalization with Data Augmentation
at Character- and Morphological Levels". In: Proceedings of the Eighth International Joint
Conference on Natural Language Processing (Volume 2: Short Papers). Taipei, Taiwan: Asian
Federation of Natural Language Processing, pp. 257{262.
Salimans, Tim et al. (2016). \Improved Techniques for Training GANs". In: Advances in Neural
Information Processing Systems 29. Ed. by D. D. Lee et al. Curran Associates, Inc., pp. 2234{
2242. url: http://papers.nips.cc/paper/6125-improved-techniques-for-training-
gans.pdf.
Serdyuk, Dmitriy et al. (2016). \Invariant representations for noisy speech recognition". In: arXiv
preprint arXiv:1612.01928.
Simonyan, Karen and Andrew Zisserman (2014). \Very Deep Convolutional Networks for Large-
Scale Image Recognition". In: CoRR abs/1409.1556. arXiv: 1409.1556. url: http://arxiv.
org/abs/1409.1556.
Srinivas, Nisha et al. (2019). \Face Recognition Algorithm Bias: Performance Dierences on Images
of Children and Adults". In: Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition Workshops.
Srivastava, Nitish et al. (2014). \Dropout: A simple way to prevent neural networks from overtting".
In: The Journal of Machine Learning Research 15.1, pp. 1929{1958.
Tarvainen, Antti and Harri Valpola (2017). \Mean teachers are better role models: Weight-averaged
consistency targets improve semi-supervised deep learning results". In: Advances in Neural
Information Processing Systems 30. Ed. by I. Guyon et al. Curran Associates, Inc., pp. 1195{
1204.
160
Tishby, Naftali, Fernando C. Pereira, and William Bialek (1999). \The information bottleneck
method". In: 37th Annual Allerton Conference on Communication, Control and Computing,
pp. 368{377.
Vangara, Kushal et al. (2019). \Characterizing the Variability in Face Recognition Accuracy
Relative to Race". In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition Workshops.
Vinyals, Oriol et al. (2016). \Matching Networks for One Shot Learning". In: Advances in Neural
Information Processing Systems 29, pp. 3630{3638.
Wang, Mei et al. (2019). \Racial Faces in the Wild: Reducing Racial Bias by Information Max-
imization Adaptation Network". In: Proceedings of the IEEE International Conference on
Computer Vision, pp. 692{702.
Wen, Di, Hu Han, and Anil K Jain (2015). \Face spoof detection with image distortion analysis".
In: IEEE Transactions on Information Forensics and Security 10.4, pp. 746{761.
Wu, K., S. Yang, and K. Q. Zhu (2015). \False rumors detection on Sina Weibo by propagation
structures". In: 2015 IEEE 31st International Conference on Data Engineering, pp. 651{662.
doi: 10.1109/ICDE.2015.7113322.
Wu, Yue, Wael Abd-Almageed, and Prem Natarajan (2017). \Deep matching and validation
network: An end-to-end solution to constrained image splicing localization and detection". In:
pp. 1480{1502.
| (2018a). \BusterNet: Detecting copy-move image forgery with source/target localization". In:
pp. 168{184.
| (2018b). \Image Copy-Move Forgery Detection via an End-to-End Deep Neural Network". In:
pp. 1907{1915.
Xiao, Han, Kashif Rasul, and Roland Vollgraf (2017). Fashion-MNIST: a Novel Image Dataset for
Benchmarking Machine Learning Algorithms. arXiv: cs.LG/1708.07747 [cs.LG].
161
Xie, Qizhe et al. (2017). \Controllable Invariance through Adversarial Feature Learning". In:
Advances in Neural Information Processing Systems 30. Ed. by I. Guyon et al. Curran Associates,
Inc., pp. 585{596.
Yang, Jianwei, Zhen Lei, and Stan Z Li (2014). \Learn convolutional neural network for face
anti-spoong". In: arXiv preprint arXiv:1408.5601.
Yang, Jianwei et al. (2017). \LR-GAN: Layered Recursive Generative Adversarial Networks for
Image Generation". In: International Conference on Learning Representations.
Zemel, Rich et al. (2013). \Learning Fair Representations". In: Proceedings of the 30th International
Conference on Machine Learning. Ed. by Sanjoy Dasgupta and David McAllester. Vol. 28.
Proceedings of Machine Learning Research 3. Atlanta, Georgia, USA: PMLR, pp. 325{333.
Zhang, Yu, William Chan, and Navdeep Jaitly (2017). \Very deep convolutional networks for
end-to-end speech recognition". In: 2017 IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP). IEEE, pp. 4845{4849.
Zhang, Yuting, Kibok Lee, and Honglak Lee (2016). \Augmenting supervised neural networks
with unsupervised objectives for large-scale image classication". In: International Conference
on Machine Learning, pp. 612{621.
Zhao, Kai, Jingyi Xu, and Ming-Ming Cheng (2019). \RegularFace: Deep Face Recognition
via Exclusive Regularization". In: The IEEE Conference on Computer Vision and Pattern
Recognition (CVPR).
Zhou, Shiyu et al. (2018). \Syllable-based sequence-to-sequence speech recognition with the
transformer in Mandarin Chinese". In: arXiv preprint arXiv:1804.10752.
162
Appendix A
Proofs
A.1 Variance Inequality
For two random variables A and B, possibly dependent,
Cov(A;B)
p
Var(A)Var(B) (A.1)
1
2
(Var(A) + Var(B)) (A.2)
Var(A +B) = Var(A) + Var(B) + 2Cov(A;B) (A.3)
2Var(A) + 2Var(B) (A.4)
Equation A.2 holds due to the geometric mean{arithmetic mean inequality. LetA = (m
i
E[m
i
])z
i
and B =E[m
i
]z
i
. Then
Var(m
i
z
i
) = Var(A +B) (A.5)
2Var((m
i
E[m
i
])z
i
) + 2E[m
i
]
2
Var(z
i
) (A.6)
163
Appendix B
Samples from Benchmarking Datasets Created in this Work
B.1 MNIST-ROT
Figure B.1: Samples from MNIST-ROT dataset. Column titles indicate rotation angles.
Column titles in green indicate rotation angles used for training and in-domain testing,
while others indicate out-of-domain testing angles.
164
B.2 MNIST-DIL
Figure B.2: Samples from MNIST-DIL dataset. Column titles indicate erosion/dilation
kernel sizes. Column title \0" indicates the original digit.
165
B.3 Chairs
Back
Left
Front
Right
Figure B.3: Samples from Chairs dataset. Row headers indicate orientation category.
166
B.4 Multi-PIE
Figure B.4: Samples from Multi-PIE dataset. Column titles indicate pose angles.
Illumination conditions are (top to bottom): neutral, right, frontal, and left.
167
Appendix C
Other Research
C.1 AIRD: Adversarial Learning Framework for
Image Repurposing Detection
The internet-driven information age, in which we are currently living, has seen rapid advances in
technology that have made creation and transmission of information on a large-scale increasingly
easier. Consequently, the manner in which people consume information has evolved from printed
media and cable-television to digital sources. Simultaneously, social networking platforms have
evolved to make it easier for people to disseminate information quickly within communities and
publicly. This provides an excellent way for people to share news quickly, making social media a
popular news source, especially among the youth (Marchi, 2012). However, this ease of information
propagation has also made social networks a popular mode of transmission of fake news.
Given the potency of falsied information propagating on the internet, several activist groups
have launched crowd-sourced eorts (such as Snopes
1
) towards debunking fake news. However,
the sheer volume and rate at which information is being created and disseminated necessitate
developing automated ways of validating information. Several methods have, hence, been developed
recently for detecting rumors on online forums (Gupta, Zhao, and Han, 2012; Jin et al., 2013; Liu
1
https://www.snopes.com/
168
et al., 2015a; Ma et al., 2016; Wu, Yang, and Zhu, 2015; Ruchansky, Seo, and Liu, 2017), digital
manipulations of images (Asghar, Habib, and Hussain, 2017; Wu, Abd-Almageed, and Natarajan,
2017; Wu, Abd-Almageed, and Natarajan, 2018a; Wu, Abd-Almageed, and Natarajan, 2018b),
and semantic inconsistencies in multimedia data (Jaiswal et al., 2017; Sabir et al., 2018). While
the detection of digital manipulation has gained most of the research attention over the years,
rumor detection and semantic integrity verication are much newer areas of research.
We focus here on detecting image repurposing | a form of semantic manipulation of multimedia
data where untampered images are reused in combination with falsied metadata to spread
misinformation. Jaiswal et al. (2017) dene the broader problem of multimedia semantic integrity
assessment as the verication of consistency between the media asset (e.g. image) and dierent
components of the associated metadata (e.g. text caption, geo-location, etc.), since the asset and
the metadata are expected to be a coherent package. They also introduce the concept of using a
reference dataset (RD) of untampered packages to assist the validation of query packages. Image
repurposing detection falls under this umbrella and has been explored in packages of images and
captions (Jaiswal et al., 2017) as well as those that additionally contain Global Positioning System
(GPS) information (Sabir et al., 2018).
The method proposed in (Jaiswal et al., 2017) for integrity assessment detects inconsistencies in
packages with entire captions potentially copied from other packages at random. Sabir et al.(2018),
on the other hand, present a method for the detection of manipulations of named entities within
captions. However, the MEIR dataset proposed and evaluated on in (Sabir et al., 2018) falls short
on the deceptive potential of entity-manipulations because they are implemented as randomly
swapping the entity in a given caption with the same class of entity (person, organization, or
location) from a caption in an unrelated package.
One of the main challenges for developing image repurposing detection methods is the lack of
training and evaluation data. While crowd sourcing is a potential alternative, it is expensive, and
time consuming. In light of this, we propose a novel framework (Jaiswal et al., 2019a) for image
169
repurposing detection, which can be trained in the absence of training data containing manipulated
metadata. The proposed framework, Adversarial Image Repurposing Detection (AIRD), is modeled
to simulate the real-world adversarial interplay between a bad actor who repurposes images with
counterfeit metadata and a watchdog who veries the semantic consistency between images and
their accompanying metadata. More specically, AIRD consists of two models: a counterfeiter and
a detector, which are trained adversarially.
Following the approach of previous works, the proposed framework employs a reference dataset of
unmanipulated packages as a source of world knowledge. While the detector gathers evidence from
the reference set, the counterfeiter exploits it to conjure convincingly deceptive fake metadata for a
given query package. The proposed framework can be applied to all forms of metadata. However,
since generating natural language text is an open research problem, the experimental evaluation
is performed on structured metadata. Furthermore, previous methods on image repurposing
detection focus only on entity manipulations within captions. Hence, AIRD could be employed in
such cases by rst extracting entities using named entity recognition. The proposed framework
exhibits state-of-the-art performance on the Google Landmarks dataset (Noh et al., 2017) for
location-identity verication, a variant of the IJB-C dataset (Maze et al., 2018), called IJBC-IRD,
which we created for subject-identity verication, and the Painter by Numbers dataset (Kaggle,
Painter by Numbers) for painting-artist verication. Results on this diverse collection of datasets,
which we make publicly available
2
, illustrate the generalization capability of the proposed model.
C.2 Bidirectional Conditional Generative
Adversarial Networks
Generative Adversarial Networks (GAN) (Goodfellow et al., 2014) have recently gained immense
popularity in generative modeling of data from complex distributions for a variety of applications
2
https://github.com/isi-vista/AIRD-Datasets
170
including image editing (Perarnau et al., 2016), image synthesis from text descriptions (Reed et al.,
2016), image super-resolution (Ledig et al., 2017), and video summarization (Mahasseni, Lam,
and Todorovic, 2017). GANs essentially learn a mapping from a latent distribution to a higher
dimensional, more complex data distribution. Many variants of the GAN framework have been
recently developed to augment GANs with more functionality and to improve their performance
in both data modeling and target applications (Perarnau et al., 2016; Donahue, Krhenbhl, and
Darrell, 2017; Dumoulin et al., 2017; Makhzani et al., 2016; Mathieu et al., 2016; Mescheder,
Nowozin, and Geiger, 2017; Mirza and Osindero, 2014; Yang et al., 2017). Conditional GAN
(cGAN) (Mirza and Osindero, 2014) is a variant of standard GANs that was introduced to augment
GANs with the capability of conditional generation of data samples based on both latent variables
(or intrinsic factors) and known auxiliary information (or extrinsic factors) such as class information
or associated data from other modalities. Desired properties of cGANs include the ability to
disentangle the intrinsic and extrinsic factors, and also disentangle the components of extrinsic
factors from each other, in the generation process, such that the incorporation of a factor minimally
in
uences that of the others. Inversion of such a cGAN provides a disentangled information-rich
representation of data, which can be used for downstream tasks (such as classication) instead
of raw data. Therefore, an optimal framework would be one that ensures that the generation
process uses factors in a disentangled manner and provides an encoder to invert the generation
process, giving us a disentangled encoding. The existing equivalent of such a framework is the
Invertible cGAN (IcGAN) (Perarnau et al., 2016), which learns inverse mappings to intrinsic
and extrinsic factors for pretrained cGANs. The limitations of post-hoc training of encoders in
IcGANs are that it prevents them from (1) in
uencing the disentanglement of factors during
generation, and (2) learning the inverse mapping to intrinsic factors eectively, as noted for GANs
in (Dumoulin et al., 2017). Other encoder-based cGAN models either do not encode extrinsic
factors (Makhzani et al., 2016) or encode them in xed-length continuous vectors that do not
171
have an explicit form (Mathieu et al., 2016), which prevents the generation of data with arbitrary
combinations of extrinsic attributes.
We propose the Bidirectional Conditional GAN (BiCoGAN) (Jaiswal et al., 2018a), which
overcomes the deciencies of the aforementioned encoder-based cGANs. The encoder in the
proposed BiCoGAN is trained simultaneously with the generator and the discriminator, and
learns inverse mappings of data samples to both intrinsic and extrinsic factors. Hence, our model
exhibits implicit regularization, mode coverage and robustness against mode collapse similar to
Bidirectional GANs (BiGANs) (Donahue, Krhenbhl, and Darrell, 2017; Dumoulin et al., 2017).
However, training BiCoGANs na vely does not produce good results in practice, because the
encoder fails to model the inverse mapping to extrinsic attributes and the generator fails to
incorporate the extrinsic factors while producing data samples. We present crucial techniques
for training BiCoGANs, which address both of these problems. BiCoGANs outperform IcGANs
on both encoding and generation tasks, and have the added advantages of end-to-end training,
robustness to mode collapse and fewer model parameters. Additionally, the BiCoGAN-encoder
outperforms IcGAN and the state-of-the-art methods on facial attribute prediction on cropped
and aligned CelebA (Liu et al., 2015b) images. Furthermore, state-of-the-art performance can
be achieved at predicting previously unseen facial attributes using features learned by our model
instead of images.
C.3 CapsuleGAN: Generative Adversarial Capsule Network
Generative modeling of data is a challenging machine learning problem that has garnered tremen-
dous interest recently, partly due to the invention of generative adversarial networks (GANs) (Good-
fellow et al., 2014) and its several sophisticated variants
3
. A GAN model is typically composed
of two neural networks; (1) a generator that attempts to transform samples drawn from a prior
3
https://github.com/hindupuravinash/the-gan-zoo
172
distribution to samples from a complex data distribution with much higher dimensionality, and (2)
a discriminator that decides whether the given sample is real or from the generator's distribution.
The two components are trained by playing an adversarial game. GANs have shown great promise
in modeling highly complex distributions underlying real world data, especially images. However,
they are notorious for being dicult to train and have problems with stability, vanishing gradients,
mode collapse and inadequate mode coverage (Radford, Metz, and Chintala, 2016; Salimans et al.,
2016; Durugkar, Gemp, and Mahadevan, 2017). Consequently, there has been a large amount
of work towards improving GANs by using better objective functions (Arjovsky, Chintala, and
Bottou, 2017; Gulrajani et al., 2017; Berthelot, Schumm, and Metz, 2017), sophisticated training
strategies (Salimans et al., 2016), using structural hyperparameters (Radford, Metz, and Chintala,
2016; Odena, Olah, and Shlens, 2017) and adopting empirically successful tricks
4
.
Radford et al. (2016) provide a set of architectural guidelines, formulating a class of convolutional
neural networks (CNNs) that have since been extensively used to create GANs (referred to as Deep
Convolutional GANs or DCGANs) for modeling image data and other related applications (Reed
et al., 2016; Isola et al., 2017). More recently, however, Sabour et al. (2017b) introduced
capsule networks (CapsNets) as a powerful alternative to CNNs, which learn a more equivariant
representation of images that is more robust to changes in pose and spatial relationships of parts
of objects in images (Hinton, Krizhevsky, and Wang, 2011) (information that CNNs lose during
training, by design). Inspired by the working mechanism of optic neurons in the human visual
system, capsules were rst introduced by Hinton et al. (2011) as locally invariant groups of neurons
that learn to recognize visual entities and output activation vectors that represent both the presence
of those entities and their properties relevant to the visual task (such as object classication). The
training algorithm of CapsNets involves a routing mechanism between capsules in successive layers
of the network that imitates hierarchical communication of information across neurons in human
brains that are responsible for visual perception and understanding.
4
https://github.com/soumith/ganhacks
173
The initial intuition behind the design of deep neural networks was to imitate human brains
for modeling hierarchical recognition of features, starting from low-level attributes and progressing
towards complex entities. CapsNets capture this intuition more eectively than CNNs because they
have the aforementioned in-built explicit mechanism that models it. CapsNets have been shown to
outperform CNNs on MNIST digit classication and segmentation of overlapping digits (Sabour,
Frosst, and Hinton, 2017b). This motivates the question whether GANs can be designed using
CapsNets (instead of CNNs) to improve their performance.
We propose Generative Adversarial Capsule Network (CapsuleGAN) (Jaiswal et al., 2018b)
as a framework that incorporates capsules within the GAN framework. In particular, CapsNets
are used as discriminators in our framework as opposed to the conventionally used CNNs. We
show that CapsuleGANs perform better than CNN-based GANs at modeling the underlying
distribution of MNIST (LeCun et al., 1998) and CIFAR-10 (Krizhevsky, 2009) datasets both
qualitatively and quantitatively using the generative adversarial metric (GAM) (Im et al., 2016)
and at semi-supervised classication using unlabeled GAN-generated images with a small number
of labeled real images.
174
Abstract (if available)
Abstract
Learning representations that are invariant to nuisance factors of data improves robustness of machine learning models, and promotes fairness for factors that represent biasing information. This view of invariance has been adopted for deep neural networks (DNNs) recently as they learn latent representations of data by design. Numerous methods for invariant representation learning for DNNs have emerged in recent literature, but the research problem remains challenging to solve: existing methods achieve partial invariance or fall short of optimal performance on the prediction tasks that the DNNs need to be trained for. ❧ This thesis presents novel approaches for inducing invariant representations in DNNs by effectively separating predictive factors of data from undesired nuisances and biases. The presented methods improve the predictive performance and the fairness of DNNs through increased invariance to undesired factors. Empirical evaluation on a diverse collection of benchmark datasets shows that the presented methods achieve state-of-the-art performance. ❧ Application of the invariance methods to real-world problems is also presented, demonstrating their practical utility. Specifically, the presented methods improve nuisance-robustness in presentation attack detection and automated speech recognition, fairness in face-based analytics, and generalization in low-data and semi-supervised learning settings.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Graph embedding algorithms for attributed and temporal graphs
PDF
Neural representation learning for robust and fair speaker recognition
PDF
Deep learning models for temporal data in health care
PDF
Understanding sources of variability in learning robust deep audio representations
PDF
Learning distributed representations from network data and human navigation
PDF
Hashcode representations of natural language for relation extraction
PDF
Shift-invariant autoregressive reconstruction for MRI
PDF
Deep representations for shapes, structures and motion
PDF
Multimodal representation learning of affective behavior
PDF
Leveraging structure for learning robot control and reactive planning
PDF
Deep learning techniques for supervised pedestrian detection and critically-supervised object detection
PDF
Representation problems in brain imaging
PDF
Effective graph representation and vertex classification with machine learning techniques
PDF
Robust representation and recognition of actions in video
PDF
Scalable optimization for trustworthy AI: robust and fair machine learning
PDF
Learning invariant features in modulatory neural networks through conflict and ambiguity
PDF
3D deep learning for perception and modeling
PDF
Building straggler-resilient and private machine learning systems in the cloud
PDF
Human appearance analysis and synthesis using deep learning
PDF
Learning fair models with biased heterogeneous data
Asset Metadata
Creator
Jaiswal, Ayush
(author)
Core Title
Invariant representation learning for robust and fair predictions
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
04/25/2020
Defense Date
02/28/2020
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
bias,deep learning,deep neural networks,fair representation learning,Fairness,invariance,invariant representation learning,nuisance,OAI-PMH Harvest,representation learning,robustness
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Natarajan, Premkumar (
committee chair
), Nevatia, Ram (
committee member
), Raghavendra, Cauligi S. (
committee member
)
Creator Email
ajaiswal@usc.edu,mail.ayush.jaiswal@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-290645
Unique identifier
UC11663599
Identifier
etd-JaiswalAyu-8336.pdf (filename),usctheses-c89-290645 (legacy record id)
Legacy Identifier
etd-JaiswalAyu-8336.pdf
Dmrecord
290645
Document Type
Dissertation
Rights
Jaiswal, Ayush
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
bias
deep learning
deep neural networks
fair representation learning
invariance
invariant representation learning
nuisance
representation learning
robustness