Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Interpretable machine learning models via feature interaction discovery
(USC Thesis Other)
Interpretable machine learning models via feature interaction discovery
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
INTERPRETABLE MACHINE LEARNING MODELS VIA FEATURE INTERACTION DISCOVERY by Michael Tsang A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) December 2020 Copyright 2020 Michael Tsang Dedication To my beloved mother and grandmothers. Your love and sacrifice enabled me to excel. ii Acknowledgments I would like to express my sincerest gratitude to the many supporters of my PhD journey. It has been a truly challenging five years and could not have been successful without your help. I would like to start by thanking my thesis advisor: Prof. Yan Liu. It has been an incredible PhD experience under your guidance and I’m very grateful that we found a strong cooperation between your advisement style and my research personality. I particularly enjoyed the flexibility of choosing what problems and applications domains to work on. At the same time, it was very helpful to know whykindofresearchquestionscouldleadtopublications. Ithasbeenunforgettable learning from you in both professional and personal settings. Tomyguidancecommitteemembers: Prof. EmilyPutnam-HornsteinandProf. Xiang Ren, Prof. Joseph Lim, Prof. Maja Matarić, Prof. Meisam Razaviyayn, and Prof. Stefan Scherer, and Prof. Milind Tambe, your feedback has been invaluable. I want to extend special thanks to my dissertation committee: Prof. Putnam- Hornstein and Prof. Ren for your continued guidance until the end. My PhD could not have been possible without the initial efforts of Prof. Rehan Kapadia and Prof. Maja Matarić to provide me the opportunity to discover my research interests. Prof. Kapadia was my dearest postdoc mentor in undergrad and initially offered me the opportunity to study a PhD at USC. My first year iii PhD studies in Prof. Matarić’s lab exposed me to the wonders of social research before I was given the blessing to pursue machine learning research. To my dearest labmates, I will not forget our time together. Thank you co- authors Dr. Dehua Cheng, Hanpeng Liu, Dr. Sirisha Rambhatla, Prof. Sanjay Purushotham, Loc Trinh, and Dr. Yaguang Li. Thank you colleagues and friends Sungyong Seo, Nitin Kamra, Tanachat Nilanon, Dr. Zhengping Che, Prof. Rose Yu, Dr. Xinran He, Guangyu Li, Karishma Sharma, Chuizheng Meng, Aastha Dua, Nan Xu, Yizhou Zhang, and Dr. Natali Ruchansky. My PhD internships have also allowed me to meet great advisors to whom I express my gratitude. Thank you Dr. Pavankumar Murali and Dr. Joe Zhou (IBM), Dr. Yinyin Liu, ChinniKrishna Kothapalli, and Sharath Sridhar (Intel), and Dr. Eric Zhou, Dr. Xue Feng, Artem Volkhin, Dr. Levent Ertoz, and Dr. Ellie Wen (Facebook). I also thank my countless other mentors from industry. The USC administrative and IT staff have been instrumental in ensuring the success of my PhD. For this and your generosity, I cannot express enough gratitude to Lizsl De Leon, Tracy Charles, Jennifer Gerson, Jack Li, and Chris Badua. Finally, to my family, thank you so much for your unconditional love and support. iv Table of Contents Dedication ii Acknowledgments iii List of Tables viii List of Figures xii Abstract xx 1 Introduction 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Interpretability . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.2 Feature Interaction . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Contributions of Research . . . . . . . . . . . . . . . . . . . . . . . 4 2 Related Works 7 2.1 Why Interpretable Machine Learning . . . . . . . . . . . . . . . . . 7 2.1.1 Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.2 Accommodation of Human Understanding . . . . . . . . . . 11 2.2 Why Feature Interactions for Interpretable Machine Learning . . . . 15 2.2.1 Feature Interaction Detection . . . . . . . . . . . . . . . . . 16 2.2.2 Feature Interaction Interpretation . . . . . . . . . . . . . . . 21 2.2.3 Feature Interaction Attribution . . . . . . . . . . . . . . . . 22 2.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.3.1 Text Analyzers . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.3.2 Image Classifiers . . . . . . . . . . . . . . . . . . . . . . . . 23 2.3.3 Recommender Systems . . . . . . . . . . . . . . . . . . . . . 24 3 Preliminaries 25 3.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.2 Statistical Feature Interaction . . . . . . . . . . . . . . . . . . . . . 26 v 3.3 Feedforward Neural Network . . . . . . . . . . . . . . . . . . . . . . 27 4 Detecting Feature Interactions learned by Feedforward Neural Networks 28 4.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.1.1 Feature Interactions at Hidden Units . . . . . . . . . . . . . 29 4.1.2 Interaction Detection Algorithm . . . . . . . . . . . . . . . . 35 4.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.2.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . 40 4.2.2 Pairwise Interaction Detection . . . . . . . . . . . . . . . . . 43 4.2.3 Higher-Order Interaction Detection . . . . . . . . . . . . . . 46 4.2.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.3 Broader Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5 Interpreting Feature Interactions through Intermediate Hidden Layers 52 5.1 Methdology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.1.1 Interaction Entangling in Neural Networks . . . . . . . . . . 53 5.1.2 Architecture Choice . . . . . . . . . . . . . . . . . . . . . . . 56 5.1.3 Disentangling Regularization . . . . . . . . . . . . . . . . . . 57 5.1.4 Advantages of the NIT Framework . . . . . . . . . . . . . . . 59 5.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.2.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . 60 5.2.2 Training the NIT Framework . . . . . . . . . . . . . . . . . . 62 5.2.3 Disentangling Experiment . . . . . . . . . . . . . . . . . . . 62 5.2.4 Real-World Dataset Experiments . . . . . . . . . . . . . . . 63 5.2.5 Disentangling Interactions Through All Layers . . . . . . . . 64 5.2.6 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.3 Broader Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 6 Detecting Feature Interactions in Black-Box Models and Appli- cations 68 6.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 6.1.1 MADEX: Model-Agnostic Dependency Explainer . . . . . . . . 70 6.1.2 GLIDER: Global Interaction Detection and Encoding for Rec- ommendation . . . . . . . . . . . . . . . . . . . . . . . . . . 73 6.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 6.2.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . 76 6.2.2 MADEX Experiments . . . . . . . . . . . . . . . . . . . . . . . 76 6.2.3 GLIDER Experiments . . . . . . . . . . . . . . . . . . . . . . 81 6.3 Broader Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 vi 7 Explaining Feature Interactions in Black-Box Models via Interac- tion Attribution 86 7.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 7.1.1 Archipelago Interaction Attribution . . . . . . . . . . . . . 89 7.1.2 Archipelago Interaction Detection . . . . . . . . . . . . . . 95 7.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 7.2.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . 100 7.2.2 ArchDetect . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 7.2.3 ArchAttribute & Archipelago . . . . . . . . . . . . . . . . 103 7.2.4 Interactive Visualization . . . . . . . . . . . . . . . . . . . . 106 7.2.5 Runtime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 7.3 Broader Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 8 Summary, Discussion and Future Work 109 References 111 Appendices 124 A Appendix of Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . . 125 B Appendix of Chapter 4 . . . . . . . . . . . . . . . . . . . . . . . . . 127 C Appendix of Chapter 5 . . . . . . . . . . . . . . . . . . . . . . . . . 130 D Appendix of Chapter 6 . . . . . . . . . . . . . . . . . . . . . . . . . 140 vii List of Tables 2.1 Timeline of feature interaction research . . . . . . . . . . . . . . . . 20 4.1 Test suite of data-generating functions . . . . . . . . . . . . . . . . 42 4.2 AUC of pairwise interaction strengths proposed by NID and base- lines on a test suite of synthetic functions (Table 4.1). ANOVA, HierLasso, and RuleFit are deterministic. . . . . . . . . . . . . . . . 44 4.3 Testperformanceimprovementwhenaddingtop-K interactionsfrom MLP-M to MLP-Cutoff for real-world datasets and select synthetic datasets. Here, the median ¯ K excludes subset interactions, and ¯ |I| denotes average interaction cardinality. RMSE values are standard scaled. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.1 A comparison of the number of models needed to construct various forms of GAMs. GA K M all interactions refers to constructing a GAM for every interaction of order≤K respectively. MLP-Cutoff is an additive model of Multilayer Perceptrons, and η is the top number of interactions based on a learned cutoff. . . . . . . . . . . 59 5.2 Real-world datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 viii 5.3 Predictive performance of NIT. RMSE is calculated from standard scaled outcome variables. *GA 2 M took several days to train and did not converge. Lower RMSE and higher AUC means better model performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 6.1 Prediction performance (mean-squared error; lower is better) with (k> 0) and without (k = 0) interactions for random data instances in the test sets of respective black-box models. k = L corresponds to the interaction at a rank threshold. 2≤ k < L are excluded because not all instances have 2 or more interactions. Only results with detected interactions are shown. At least 94% (≥ 188) of the data instances had interactions across 5 trials for each model and score statistic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 6.2 Global explanation of a sentiment analyzer . . . . . . . . . . . . . . 80 6.3 CTR dataset statistics . . . . . . . . . . . . . . . . . . . . . . . . . 81 6.4 Detected global interactions from the AutoInt baseline on Avazu data. “C14” is an anonymous feature. . . . . . . . . . . . . . . . . . 82 6.5 Test prediction performance by encoding top-K global inter- actions in baseline recommender systems on the Criteo and Avazu datasets (5 trials). K are 40 and 10 for Criteo and Avazu respec- tively. “+ GLIDER” means the inclusion of detected global interac- tions to corresponding baselines. The “Setting” column is labeled relative to the source of detected interactions: AutoInt. * scores by Song et al. (2018). . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 6.6 # parameters of the models in Table 6.5. M denotes million. . . . 84 ix 7.1 Comparison of interaction detectors (b) on syntethic ground truth in (a). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 7.2 Comparison of attribution methods on BERT for sentiment analy- sis and ResNet152 for image classification. Performance is measured by the correlation (ρ) or AUC of the top and bottom 10% of attri- butions for each method with respect to reference scores defined in §7.2.3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 B.1 HyperparametersofNITomittedfromthemainpapercorresponding to experiments in Table 5.3 . . . . . . . . . . . . . . . . . . . . . . 127 B.2 Archiectures of the MLP baselines . . . . . . . . . . . . . . . . . . . 128 B.3 Interaction order statistics when all repeated interactions and spar- sified blocks are ignored. An order of 1 is a main effect. . . . . . . . 128 B.4 The sensitivity of predictive performance to shuffling of the learned gates Z within each network block. . . . . . . . . . . . . . . . . . . 129 C.1 Comparison of # model parameters between baseline models with enlargedembeddingsandoriginalbaselines+GLIDER(fromTables6.5 and 6.6). The models with enlarged embeddings are denoted by the asterick (*). The embedding dimension of sparse features is denoted by “emb. size”. Percent differences are relative to baseline* models. M denotes million, and the ditto mark (") means no change in the above line. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 C.2 Test prediction performance corresponding to the models shown in Table C.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 C.3 Top-ranked word interactionsI i from Sentiment-LSTM and BERT on randomly selected sentences in the SST test set. . . . . . . . . . 134 x C.4 Data generating functions with interactions . . . . . . . . . . . . . . 137 C.5 Detection Performance in R-Precision (higher the better). σ = 0.6 (max: 3.2). “Tree”isXGBoost. *Doesnotdetecthigher-orderinter- actions.†Requires an exhaustive search of all feature combinations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 xi List of Figures 4.1 An illustration of an interaction within a multilayer perceptron with fully-connected layers, where the box contains later layers in the network. The first hidden unit takes inputs from x 1 and x 3 with largeweightsandcreatesaninteractionbetweenthem. Thestrength of the interaction is determined by both incoming weights and the outgoing paths between a hidden unit and the final output, y. . . . 29 4.2 Neural network architecture for interaction detection, with optional univariate networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.3 A comparison of averaging functions by the total number of correct interactions ranked before any false positives, evaluated on the test suite (Table 4.1). x-axis labels are maximum, root mean square, arithmetic mean, geometric mean, harmonic mean, and minimum. 40 4.4 Heat maps of pairwise interaction strengths proposed by the NID framework on MLP-M for datasets generated by functions F 1 -F 10 (Table 4.1). Red cross-marks indicate ground truth interactions. . . 45 4.5 Heat maps of pairwise interaction strengths proposed by the NID framework on MLP-M for real-world datasets. . . . . . . . . . . . . 45 xii 4.6 MLP-Cutoff error with added top-ranked interactions (alongx-axis) ofF 1 -F 10 (Table 4.1), where the interaction rankings were generated by the NID framework applied to MLP-M. Red cross-marks indicate ground truth interactions, and Ø denotes MLP-Cutoff without any interactions. Subset interactions become redundant when their true superset interactions are found. . . . . . . . . . . . . . . . . . . . . 47 4.7 MLP-Cutoff error with added top-ranked interactions (alongx-axis) of real-world datasets (Table 4.1), where the interaction rankings were generated by the NID framework on MLP-M. Ø denotes MLP- Cutoff without any interactions. . . . . . . . . . . . . . . . . . . . . 47 4.8 Comparisons between AG andNID in higher-order interaction detec- tion. (a) Comparison of top-ranked recall at different noise levels on the synthetic test suite (Table 4.1), (b) comparison of runtimes, where NID runtime with and without cutoff are both measured. NID detects interactions with top-rank recall close to the state-of-the-art AG while running orders of magnitude times faster. . . . . . . . . . 49 5.1 A demonstration of interaction entangling in a feedforward neural network. Whentrainedonasimpledatasetgeneratedbyx 1 x 2 +x 3 x 4 under common forms of regularization, the neural network tends to keep the interactions separated in the first hidden layer but entan- gles them in the second layer. In contrast, NIT (black and cyan) is abletofullydisentangletheinteractions. Themeaningofentangling is detailed in §5.1.1. . . . . . . . . . . . . . . . . . . . . . . . . . . 53 xiii 5.2 Anillustrativecomparisonbetweentwosimplefeedforwardnetworks trained on data with interactions{1, 2} and{3, 4}. (a) A standard feedforwardneuralnetwork,and(b)adesirablenetworkarchitecture that separates the two interactions. All hidden units have ReLU activation, and y is a linear unit which can precede a sigmoid link function if classification is desired. . . . . . . . . . . . . . . . . . . 55 5.3 A version of the NIT model architecture. Here, NIT consists of B mutli-layer network blocks above a common input dense layer. Appropriate regularization on the dense layer forces each block to model a single interaction or main effect. This model can equiva- lently be seen as a standard feedforward neural network with block diagonal weight matrices at intermediate layers. . . . . . . . . . . . 56 5.4 Visualizations that provide a global (Ribeiro et al., 2016) and trans- parent(Tanetal.,2017)interpretationofNITtrainedontheMIMIC- III dataset at K = 2. Outcome scores are interpreted as contribu- tion to 30-day hospital readmission in the same way described by Caruana et al. (2015). The output bias of NIT is 0.21. . . . . . . . 63 6.1 A simplified overview. (1) MADEX uses interaction detection and LIME(Ribeiroetal.,2016)togethertointerpretfeatureinteractions learned by a source black-box model at a data instance, denoted by thelargegreenplussign. (2)GLIDER identifiesinteractionsthatcon- sistently appear over multiple data samples, then explicitly encodes these interactions in a target black-box recommender model f rec . . . 69 6.2 Qualitative examples (more in Appendix C.4 & C.5) . . . . . . . . . 79 xiv 6.3 Occurrence counts (Total: 1000) vs. rank of detected interactions from AutoInt on Criteo and Avazu datasets. * indicates a higher- order interaction (details in Appendix C.7). . . . . . . . . . . . . . 82 6.4 Test logloss vs. K of DeepFM on the Criteo dataset (5 trials). . . . 84 7.1 OurexplanationforthesentimentanalysisexamplebyJanizeketal. (2020). Colors indicate sentiment, and arrows indicate interactions. Compared to other axiomatic interaction explainers (Janizek et al., 2020; Dhamdhere et al., 2019), only our work corroborates our intu- ition by showing negative attribution among top-ranked interactions. 87 7.2 Non-additive interaction for p = 2 features: The corner points are used to determine ifx 1 andx 2 interact based on their non-additivity on f, i.e. they interact if δ∝ (f(a)−f(b))− (f(c)−f(d))6= 0 (§7.1.2.1). In (c), the attribution of (bad, awful) should be negative via φ (7.2), but Shapley Taylor Interaction Index uses the positive δ. Note that φ depends on a and d whereas δ depends on a, b, c, and d. Also, Integrated Hessians is not relevant here since it does not apply to ReLU functions. . . . . . . . . . . . . . . . . . . . . . 89 7.3 Interactiondetectionoverlap(redundancy)withaddedcontextsto(7.11). “fixed” at n = 2 (ArchDetect) already shows good stability. . . . . 101 7.4 Our BERT visualizations on random test sentences from SST under BERT tokenization. Arrows indicate interactions, and colors indi- cate attribution strength. f cls is the sentiment classification. The interactions point to salient and sometimes long-range sets of words, and the colors are sensible. . . . . . . . . . . . . . . . . . . . . . . 104 xv 7.5 Our explanations of a COVID-19 classifier (COVID-Net) (Wang & Wong, 2020) on randomly selected test X-rays (Chowdhury et al., 2020; Cohen et al., 2020) classified as COVID positive. COVID- Net accurately distinguishes COVID from pneumonia and normal X-rays. Colored outlines indicate detected feature sets with positive attribution. Theinteractionsconsistentlyfocusonthe“greatvessel” region outlined in green. . . . . . . . . . . . . . . . . . . . . . . . . 105 7.6 Online ad-targeting: “banner_pos” is used to target ads to a user per their “device_id”. . . . . . . . . . . . . . . . . . . . . . . . . . . 106 7.7 Interactive visualization of Archipelago. When moving the slider to the right, the initial negativity of “predictable” and “but” turns positive after interacting with the positive phrase “jumps with style”. 106 7.8 Average runtime comparison for explaining (a) BERT on sentiment analysis and (b) ResNet152 on image classification. . . . . . . . . . 107 A.1 Response plots of an MLP-M’s univariate networks corresponding to variables x 8 , x 9 , and x 10 . The MLP-M was trained on data generated from synthetic function F 6 (Table 4.2). Note that the plots are subject to different levels of bias from the MLP-M’s main multivariate network. . . . . . . . . . . . . . . . . . . . . . . . . . . 126 C.1 The effects of varying the number of buckets on (a) on the average embedding size of cross features involving dense features and (b) the individual embedding sizes of the same cross features. . . . . . 133 xvi C.2 Additional qualitative results, following Figure 6.2a, on random test images in ImageNet. Interactions are denoted byI i and are unordered. Overlapping interactions with overlap coefficient≥ 0.5 are merged to reduce {I i } per test image. . . . . . . . . . . . . . . 136 C.3 Occurrence counts (total: 1000) vs. rank of interactions detected from AutoInt on (a) Criteo and (b) Avazu datasets. Each higher- order interaction is annotated with its interaction cardinality. . . . 139 C.4 Histogramsofinteractionsizesforinteractionsdetectedin(a)ResNet152 and (b) Sentiment-LSTM across 1000 random samples in respective test sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 D.1 Text explanation metrics ((a) Wordρ, (b) Any Phraseρ, (c) Multi- Word Phrase ρ) versus top and bottom % of attributions retained for different attribution methods on BERT over the SST test set. These plots expand the analysis of Table 7.2. . . . . . . . . . . . . 150 D.2 Image explanation metric (segment AUC) versus top and bottom % ofattributionsretainedfordifferentattributionmethodsonResNet152 over the MS COCO test set. These plots expand the analysis of Table 7.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 D.3 Our ResNet152 visualizations on random test images from Ima- geNet. Colored outlines indicate interactions with positive attri- bution. f c is the image classification result. To the best of our knowledge, onlythisworkshowsinteractionsthatsupporttheimage classification via interaction attribution. . . . . . . . . . . . . . . . 151 xvii D.4 Text Viz. Comparison A. In the first text example, “regret, not extreme enough” is a meaningful and strongly negative interaction. In the second example, “when you begin to” interacts to diminish its overall attribution magnitude. . . . . . . . . . . . . . . . . . . . 153 D.5 Text Viz. Comparison B. In the first text example, “thought pro- voking” is a meaningful and strongly positive interaction. In the second example, the “lousy, un” interaction factors in a large con- text to make a negative text classification. . . . . . . . . . . . . . . 154 D.6 Text Viz. Comparison C. In the first text example, “refined, to a crystalline” is a meaningful and strongly positive interaction. In the second example, “is aptly named” is also a meaningful and strongly positive interaction. . . . . . . . . . . . . . . . . . . . . . . . . . . 155 D.7 TextViz. ComparisonD.Inthefirsttextexample,“theending,out” is a meaningful and negative interaction. In the second example, “a feel good, best” is a meaningful and strongly positive interaction. . 156 D.8 Text Viz. Comparison E. In the first text example, “film should be, buried” is a meaningful and strongly negative interaction. In the second example, “-oherent” belongs to a negative word “incohorent”. 157 D.9 Image Viz. Comparison A. In the first image example, the dog’s eyes are a meaningful interaction supporting the classification. In the second example, the monkey’s head is also a positive interaction. 158 D.10 Image Viz. Comparison B. In the first image example, the obelisk tip is a meaningful interaction supporting the classification. In the second example, the leopard’s face is also a positive interaction. . . 159 xviii D.11 Image Viz. Comparison C. In the first image example, different patches of the apron are interactions supporting the classification. Inthesecondexample,thestork’sbodyisaninteractionthatstrongly supports the classification. . . . . . . . . . . . . . . . . . . . . . . 160 D.12 Image Viz. Comparison D. In the first image example, certain small patches of the waffle iron interact, one of which supports the clas- sification. In the second example, the leopard’s face is the primary positive interaction. . . . . . . . . . . . . . . . . . . . . . . . . . . 161 D.13 ImageViz. ComparisonE.Inthefirstimageexample,differentparts of the polaroid camera are interactions that positively support the classification. In the second example, the dogs’ heads and body are also positive interactions. . . . . . . . . . . . . . . . . . . . . . . . 162 D.14 Text Viz. with ArchDetect Ablation. The interactions tend to use more salient words when including the baseline context, which is proposed in ArchDetect. . . . . . . . . . . . . . . . . . . . . . . . 163 D.15 Image Viz. with ArchDetect Ablation A. The interactions tend to focus more on salient patches of the images when including the baseline context, which is proposed in ArchDetect. . . . . . . . . . 164 D.16 Image Viz. with ArchDetect Ablation B. The interactions tend to focus on salient patches of the images when including the baseline context. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 xix Abstract The interpretability of machine learning prediction systems is important for reasons such as transparency, ethics, accountability, scientific discovery, and model debugging. Thisthesisaimstoexpandtheinterpretabilityofhighperformancepre- diction models by developing new analysis tools. We are interested in explaining the reason why machine learning models have high performance: feature interac- tions. To this end, we study how to interpret the high performance models of neural networks and black-box models more generally. Two qualities of model interpretations are taken into account in this thesis: 1) fundamental understanding and 2) practical utility. In terms of fundamental understanding, this work develops interpretations of feature interactions learned by neural network parameters as well as a principled approach to attribute fea- ture interactions to black-box predictions. In terms of practical utility, this work emphasizes accuracy and efficiency of explanations, qualitative and quantitative interpretability, and new perspectives such as improving prediction performance via model interpretations. Feature interactions offer insightful views into the cur- rent complexity of prediction models. xx Chapter 1 Introduction 1.1 Background Theneedformachinelearningtransparencycallsforexplanationsofhowinputs relate to machine learning predictions - especially for high performance yet opaque predictionmodels. Theproblemofexplaininginput-outputrelationsbelongstothe field of interpretable machine learning. This thesis aims to advance interpretable machine learning by creating new analysis tools. Prior to this thesis, model inter- pretation via input or feature importance has become a mature topic for state-of- the-art prediction models such as neural networks. This thesis advances the field by laying groundwork for explaining the feature interactions in these models. 1.1.1 Interpretability If black-box predictions affect our future, should we trust them? In reality, they already affect us from the recommendations we receive everyday for entertainment, food, products, friends, relationships, services, etc. The desire to understand why machine learning models make certain predictions is the core essence of model 1 interpretability. Model interpretability is the “degree to which an observer can understand the causes of a prediction” (Biran & Cotton, 2017; Miller, 2019). There are different components to this definition. • The Observer: The observer may belong to different audiences. For exam- ple, the observer may have an area of expertise that allows them to under- stand model explanations better than an average person, or we may not want to assume the observer has background knowledge. • Understanding: Given a target audience, a model interpretation method may be designed to provide a more or less human-understandable explana- tion depending on the explanation’s objective and content. The objective may focus on either informing us what happens in a model or providing explanations that match our intuition. There may or may not be a middle ground. We discuss an explanation’s content next. • Causes of a Prediction: An explanation’s content can take on a variety of forms. For example, the content could tell us how input features contribute to a prediction, what concepts are important for a prediction, what feature groups are important, the process by which the prediction model formed, input-output trends, etc. Each content type has a different way of character- izing the “causes of a prediction”. Moreover, each content type can provide different degrees of faithfulness to this characterization as well as different degrees of comprehensibility. In this thesis, we are generally interested in researching new model interpreta- tion methods that can provide explanations to an audience with little background knowledge. We require our explanations to be informative of what happens in a 2 model and discuss how they are also comprehensible. The content of our explana- tions focus on what and how feature interactions are important for predictions. 1.1.2 Feature Interaction A feature interaction describes a situation in which the effect of one feature on an outcome depends on the state of a second feature. For example, consider a linear regression modelf(x) =w 1 x 1 +w 2 x 2 +w 3 x 1 x 2 , wherex 1 andx 2 are features and{w i } are coefficients. The multiplication x 1 x 2 forms a feature interaction, and the individual terms x 1 and x 2 are main effects. The coefficients provide information about each term’s importance. A feature interaction is not restricted to a multiplicative effect between features, nor does it have to be between two features. In general, a feature interaction is a non-additive effect between multiple features on an outcome. To put the physical meaning of an interaction in perspective, let us consider another example. Say, one feature represents “student age” and another feature represents “student alertness”. The outcome variable is the student’s score on a test. We may expect an interaction between age and alertness on test scores because student performance varies not only with age, but also alertness within each age group. Feature interaction phenomena exist in many real-world settings where an out- come is modeled as a function of features. Because machine learning is precisely designed to model such functions, high performance prediction models - in par- ticular neural networks - are well suited for learning feature interactions. Feature interactionsmanifestbeyondtabulardatasettingstoo, suchaswordinteractionsin text-based sentiment analysis and image-patch interactions in image classification. Examples of word interactions are “not, good” and “not, bad” in movie reviews. 3 An example of an image interaction is a combination of image patches correspond- ing to the ears, nose, and eyes of a dog, which supports a dog classification. We see that interaction phenomena can be interpretable in diverse domains. 1.2 Thesis Statement Feature interactions are interpretable in complex prediction models. 1.3 Contributions of Research Thisthesisaimstoaddressagapinresearchoninterpretingfeatureinteractions in high performance (complex) prediction models. Specifically, it is unknown how tointerpretfeatureinteractionsinmodernpiecewiselinearneuralnetworks(Glorot et al., 2011; He et al., 2016) and black-box models more generally. We are also interested in extending our studies to examine the potential impact of interaction interpretations. We present our contributions as follows. Neural Network Interpretation: • Detecting Interactions from Neural Network Weights: We develop a technique to detect the statistical interactions learned in feedforward neural networks by interpreting their learned weights. This method is unique in that it is the first method to either detect interactions in neural networks from their weights or detect interactions in neural networks with piecewise linear activation functions such as ReLU. This method also shows practical benefits in outperforming state-of-the-art methods at detecting non-additive interactions and shows signs of being effective at detecting higher-order inter- actions. (Tsang et al., 2018a) 4 • Learning Transparent Neural Networks: We develop an architecture and learning approach to automatically separate feature interactions in neu- ral networks. This approach allows the specification of a maximum interac- tion set size, which has the practical advantage of learning neural networks with low interaction orders that are transparent or globally visualizable. The main contributions here are 1) identifying that feature interactions entangle at intermmediate hidden layers in feedforward neural networks and 2) learn- ing generalized additive models with interactions inO(1) models. (Tsang et al., 2018b) Black-Box Interpretation: • Detecting Interactions and Improving Predictions: We propose a method to interpret black-box models by interaction detection on feature perturbations and their model inferences. This method enables interaction interpretationsofanypredictionmodel,e.g. imageandtextclassifiers,forthe firsttime. Wethenfocusonexplainingatypeofmodelthathaslargesocietal impact: black-boxrecommendersystems. Byaccumulatinginteractionsfrom this model over a data batch and encoding them into the same or a separate model, our method uniquely utilizes model interpretations to automatically improve predictions - in this case improve recommendations. (Tsang et al., 2020a) • Assigning Interpretable Attribution to Interactions: While there are now ways to detect interactions in black-box models, a question remains in howdotheinteractionscontributetopredictionsviaattribution. Wepropose a principled interaction attribution method and a complementary interaction detector for any prediction model. Currently, this attribution method is the 5 only one which is simultaneously interpretable, model-agnostic, and princi- pled (aka axiomatic). Both the attribution method and interaction detector are also scalable for arbitrary-order feature interactions. Combined, these methods produce explanations that outperform state-of-the-art methods at interpretability, interaction detection, and runtime efficiency. (Tsang et al., 2020b) This thesis is a series of works on improving the interpretability of high perfor- mance prediction models. Through this work, we draw attention to the importance of interpreting feature interactions. 6 Chapter 2 Related Works 2.1 Why Interpretable Machine Learning Interpretable machine learning is important for leveraging automatic predic- tion models in high-stakes decisions, scientific discoveries, and model debugging. Asmachinelearningmodelsadvancedovertime, theytendedtobecomemoreaccu- rate yet less interpretable. We first provide an overview of interpretable machine learning techniques. We then discuss how to accommodate social science perspec- tives of human understanding for interpretable machine learning. 2.1.1 Techniques The various techniques of interpretable machine learning are summarized as follows. Early Methods The most famous interpretable model is linear regression (Gal- ton, 1886), which treats inputs independently, and the coefficient of each input cor- responds to its slope in linear plots. More recent developments in training machine 7 learning models saw the invention of sparsity regularization, e.g. lasso (Tibshirani, 1996). When coupled with linear regression, lasso attempts to select only the most important inputs, which can be useful for interpretability. Decision trees (Breiman et al., 1984) are important for interpretability because they represent a function by a hierarchy of rules. When the number of rules in small, their logic is simple enough to understand. In some settings, simple decision trees can be more accurate than linear regression because they can automatically capture feature interactions (Nelder, 1977; Kass, 1975). For larger decision trees such as random forests, feature importance can be estimated by the normalized total reduction of gini-impurity or entropy due to each feature (Breiman, 2001). Generalized Additive Models Generalized additive models are typically viewed as powerful interpretable models in machine learning and statistics (Hastie, 2017). The standard form of this model constructs univariate functions of features, namely g(E[y]) = P f i (x i ), where g is a link function and f i can be any arbitrary nonlinear function. Similar to linear regression, this model can easily be visualized because eachf i can be independently plotted by varying each function’s input and observing the response on the output. Neural Networks The interpretability of neural networks has been more chal- lenging. We overview the different forms of interpretations for neural networks. • Weight Interpretation: The meaning of the learned weights in neural networks has always been a mystery even in basic neural networks such as multilayer perceptrons (MLP). Existing efforts studied how to interpret MLP weights to compute global feature importance. Garson’s algorithm (Garson, 1991; Goh, 1995) is regarded as the first attempt to interpret the weights, 8 which essentially multiplies weight matrices from early layers to later layers and applies absolute values. Another method, called Connection Weights, is similar to Garson’s algorithm but without the absolute values and an output normalization step (Olden & Jackson, 2002). Despite these efforts, their empirical validation has remained limited, and no method has yet interpreted feature interactions from the weights. • Feature Attribution: In contrast to global feature importance methods, therehasbeenasignificanteffortoninterpretinglocalfeatureimportance,i.e. feature importance specific to data instances. This local feature importance is called feature attribution. A variety of feature attribution methods have been proposed based on neural network gradients (Simonyan et al., 2013; Ancona et al., 2018; Selvaraju et al., 2017); however, one method emerged as industry-strength called Integrated Gradients (IG) (Sundararajan et al., 2017). IG leverages the fundamental theorem of calculus to enable input gradients to assign an additive importance to each feature with respect to a baseline. IG satisfies desirable attribution axioms, or principles that other attribution methods should follow. Although feature attribution applies to individual data instances, it is possible to run attribution methods across a batch of data instances and obtain a surrogate for global feature importance. • Hidden Layer Interpretation: Another problem of interest to inter- pretablemachinelearningishowtoexplaintheintermediatehiddenrepresen- tations of neural networks. One approach has been extracting learned con- cepts. Neural networks for image classification are particularly suitable for this analysis because images naturally give rise to visual concepts. A method 9 called Quantitative Testing with Concept Activation Vectors (TCAV) fit lin- ear probes to the hidden representations of images for known concepts and measured model sensitivities to those concepts. Another approach on inter- preting hidden layers is clustering the hidden representations of input data in a reduced two-dimensional space. This approach makes use of dimensionality reduction methods (Pearson, 1901; Jolliffe, 1986; Maaten & Hinton, 2008). Although these methods are not specific to neural networks, they have still been popular for visualizing the learned representations of neural networks. • Disentangled Representations: So far, we have discussed methods to interpret generic neural network models after they are trained. The topic of disentanglement in neural networks in interested in how to train or lever- age neural networks that have interpretable representations. In this space, emphasis has been on disentangling factors of variation. Various methods studied how to modify the training of deep neural networks to identify inter- pretable latent codes (Chen et al., 2016; Kulkarni et al., 2015; Hsu et al., 2017; Higgins et al., 2017), whereas other methods interpret intrinsically dis- entangled representations after training (Radford et al., 2015; Bau et al., 2017). While works have focused on disentangling certain properties of neu- ral networks, the concept of disentangling can extend to other forms of neural network interpretability. Black-Box Models Interpretation methods have also been developed to treat the model under investigation as a black-box, such as permutation feature impor- tance, which shuffles the values of a feature in a data batch and checks the resulting change in model loss. Although this method was originally proposed for random forests (Breiman, 2001), it can be used on any tabular prediction model. To 10 accomodate any prediction model, a method called Locally Interpretable Model- Agnostic Explanations (LIME) was developed to compute feature attribution via linearregressiononfeatureperturbationsandtheirmodelinferences(Ribeiroetal., 2016). Another method called Shapley Additive Explanations (SHAP) enforced stricter additive constraints on attribution scores by modifying LIME’s feature perturbation scheme with guidance from Shapley values (Lundberg & Lee, 2017). 2.1.2 Accommodation of Human Understanding In order to develop interpretability methods that are explainable to humans, we should be aware of human preferences of explanations and how to evaluate explanations. Here, wesummarizetheexcellentsurveybyMiller(2019)anddiscuss how social science concepts of explanation relate to interpretable machine learning. Human Preferences of Explanations The fields of philosophy, cognitive sci- ence, andsocialpsychologyhavelongstudiedhowpeopledefine, select, andpresent explanations. These perspectives can inform the design of future interpretable machine learning methods. The following are human preferences of explanations. • Contrastive Explanations: Recall that interpretability is the degree to which an observer can understand the causes of a prediction. One of the most important findings in philosophical and cognitive science literature is that people like to use contrastive explanations, to explain the cause of an event P relative to some other event Q that did not occur (Hilton, 1990). Many researchers argue that all why-questions ask for contrastive explana- tions (Lipton, 1990; Hilton & Slugoski, 1986; Hilton, 1990; Hesslow, 1988; Mackie, 1974; Lombrozo, 2012), even if Q needs to be inferred (Hilton & Slugoski, 1986; Hilton, 1996; Hesslow, 1988). A contrastive explanation 11 essentially asks “Why P rather thanQ?”, where “rather thanQ” may not be explicitly stated. In interpretable machine learning, we commonly see con- trastive explanations in comparing predictions associated with a real data instance P and a baseline data instance Q, e.g. to obtain feature attribu- tion scores (Ribeiro et al., 2016; Lundberg & Lee, 2017; Sundararajan et al., 2017). • Multi-Mode Explanations: There should be multiple modes of explain- ing an event in order to cater to individuals’ questions. Researchers have attempted to capture these multiple modes via models which categorize explanations (Hankinson, 2001; Dennett, 1989; Marr, 1982; Marr & Poggio, 1976;Kass&Leake,1987). Althoughthereisnoconsensusmodel, Aristotle’s Four Causes model has been adapted to categorize four types of explana- tions (Hankinson, 2001) which are summarized as: composition, properties, immediate factors, and end-goal. In interpretable machine learning, we also see different modes of explanations, such as decision rules, concepts, feature importance, and the topic of this thesis: feature interactions. • Distillation: “People rarely, if ever, expect an explanation that consists of an actual and complete cause of an event” (Miller, 2019). Instead, peo- ple expect explanations to be distilled according to principles that have been investigated in literature. One important principle is that Contrastive Explanations should show the greatest number of differences betweenP and Q (Hesslow, 1988; Lipton, 1990; Rehder, 2003, 2006), and highlight the dif- ferences that aren’t obvious (Hilton & Slugoski, 1986; McClure & Hilton, 1998; McClure et al., 2003; Denis et al., 2005; Samland & Waldmann, 2014). Another principle is that explanations should prioritize necessary causes of 12 an event, followed by sufficient causes (Lipton, 1990; Lombrozo, 2010; Wood- ward, 2006). Explanation distillation is often done through summarization, which tries to provide a concise yet complete description of something. In interpretable machine learning, all methods that cannot fully explain model predictions in a compact space, e.g. a paper figure, use some form of distil- lation, similar to the ones discussed. • Conversation and Familiarity: Explanations should be presented relative to a person’s beliefs. Hilton (1990) argued that an explanation is a conversa- tion and should follow Grice’s maxims of conversation (Grice, 1975), which is summarized as (a) say what you believe in and have enough evidence, (b) only say as much as necessary, (c) be relevant, and (d) have manners, i.e. avoid ambiguity and obscurity, and be brief and orderly. These maxims not only emphasize conversational elements, but also imply that explanations are understandable to people via a foundation of shared knowledge (Miller, 2019). From the perspective of interpretable machine learning, the challenge here is determining what audience is a model interpretation for and what is the best way to present that interpretation. Explanation Evaluation Researchers in social science have developed princi- ples to evaluate explanations. We first discuss these principles, then examine their relationship to evaluation metrics in interpretable machine learning. • Coherence: Thagard (1989) proposed the Theory for Explanatory Coher- ence, which says that people will be more likely to accept explanations “if they are consistent with their prior beliefs” (Miller, 2019). Coherence is very important in interpretable machine learning literature, because model inter- pretations that do not cohere to people’s belief can mislead them about the 13 quality of the model. A common way to evaluate coherence for interpretable machine learning methods is through user studies (Kim et al., 2016; Ribeiro et al., 2016; Singh et al., 2019). There has also been interest in automatic coherence evaluation using human annotation labels that already exist in datasets (Jin et al., 2019). • Simplicity and Generalizability: Thagard’s Theory of Explanatory Con- herence also discussed the value of simple and generalizable explanations as another important explanation metric. Simple refers to fewer causes shown in explanations, and generalizable refers to explanations that apply to more events. These principles have been validated in simulation (Ran- ney & Thagard, 1988) and user-study (Read & Marcus-Newhall, 1993). In interpretable machine learning, many works utilize simplicity as an explana- tion objective or visualization condition (Tibshirani, 1996; Lakkaraju et al., 2016; Singla et al., 2019; Gupta et al., 2019). Generally speaking, simplicity has been experimentally tested moreso qualitatively than quantitatively. On the other hand, generalizability has also been a desirable property (Galton, 1886; Breiman et al., 1984; Lakkaraju et al., 2016; Ribeiro et al., 2016; Lou et al., 2013), which has typically been evaluated by surrogate metrics like prediction performance under generalizability constraints (Lou et al., 2013; Lakkaraju et al., 2016). • Truthfulness: The truthfulness of an explanation is an important eval- uation metric. Miller (2019) asserted that truthfulness is a necessary but insufficient criteria for explanations, in response to the perspective of Hilton (1996) that the truth may not always be the best explanation - especially 14 if the truth is obvious. One interesting study asked participants to evalu- ate explanations of a music recommender system. This study found that participants preferred explanations with the characteristics of truthfulness and completeness, without being overwhelming (Kulesza et al., 2015, 2013). In interpretable machine learning, truthfulness tends to be an important yet elusive evaluation metric because real-world datasets generally do not contain ground truth explanations. One way interpretability methods have circumvented this problem is by using synthetic datasets that are generated by known functions with ground truth (Ai & Norton, 2003; Hooker, 2004). • Goal-Orientation: Vasilyeva et al. (2015) found that the goal of an expla- nation is important for how explanations are evaluated via a user-study, which is summarized as follows. Participants were given tasks with different goals and were provided explanations associated with the tasks. On aver- age, the explanations that were aligned with the task goals were rated better than other explanations. In interpretable machine learning, all explanation evaluations are expected to align with the goals of respective explanations. 2.2 Why Feature Interactions for Interpretable Machine Learning Feature interactions have a long history in data analysis and model interpre- tation. In many cases, data analysis is done by interpreting models of feature interactions. Early feature interaction studies were meant to experiment with optimal conditions for agricultural yield and medical treatment effects on different groups of patients. 15 2.2.1 Feature Interaction Detection Acommonthreadamonghistoricalresearchonfeatureinteractionsisthatmost methods were developed to detect feature interactions. Timeline The notion of a feature interaction has been studied at least since the 19th century when John Lawes and Joseph Gilbert used factorial designs in agri- cultural research at the Rothamsted Experimental Station (Dean et al., 2015). A factorial design is an experiment that includes observations at all combinations of categories of each factor or feature. However, the “advantages [of factorial design] had never been clearly recognised, and many research workers believed that the best course was the conceptually simple one of investigating one question at a time” (Yates, 1964). In the early 20th century, Fisher et al. (1926) emphasized the importance of factorial designs as being the only way to obtain information about feature interactions. Near the same time, Fisher (1921) also developed one of the foundations of statistical analysis called Analysis of Variance (ANOVA) including two-way ANOVA (Fisher, 1925), which is a factorial method to detect pairwise feature interactions based on differences among group means in a dataset. Tukey (1949) extended two-way ANOVA to test if two categorical features are non-additively related to the expected value of a outcome variable. This work set a precedent for later research on detecting feature interactions based on their non-additive definition. Soon after, experimental designs were generalized to study feature interactions, in particular the generalized randomized block design (Wilk, 1955), which assigns test subjects to different categories (or blocks) between fea- tures in a way where cross-categories between features serve as interaction terms in linear regression. 16 There was a surge of interest in improving the analysis of feature interactions after the mid 20th century. Belson (1959); Morgan & Sonquist (1963) proposed Automatic Interaction Detection (AID) originally under a different name. AID detects interactions by subdividing data into disjoint exhaustive subsets to model an outcome based on categorical features. Based on AID, Kass (1980) developed Chi-square Automatic Interaction Detection (CHAID), which determines how cat- egorical features best combine in decision trees via a chi-square test. AID and CHAID were precursors to modern decision tree prediction models. Concurrently, Nelder (1977) introduced the “Principle of Marginality” arguing that a feature interaction and its marginal variables should not be considered separately, for example in linear regression. Hamada & Wu (1992) provided a contrasting view that an interaction is only important if one or both of its marginal variables are important. At the start of the 21st century, efforts began to focus on interpreting interactions in accurate prediction models. Ai & Norton (2003) proposed extract- ing interactions from logit and probit models via mixed partial derivatives. Gevrey et al. (2006) followed up by proposing mixed partial derivatives to extract interac- tions from multilayer perceptrons with sigmoid activations when at the time, only shallow neural networks were studied. Friedman & Popescu (2008) proposed using hybrid models to capture interactions with decision trees and univariate effects with linear regression. Sorokina et al. (2008) proposed to use high-performance additive trees to detect feature interactions based on their non-additive definition. At the turn of the decade, we saw Bien et al. (2013) capture interactions with different heredity conditions using a hierarchical lasso on linear regression models. Then, Hao & Zhang (2014) drew attention towards interaction screening in high dimensional data. We provide a timeline for this research history in Table 2.1. 17 Individual Testing The most common form of interaction detection tests each combination of features separately. This individual testing approach started with ANOVA (Fisher, 1925) and has continued with modern interaction detectors. ANOVA conducts hypothesis tests for each interaction candidate by checking each hypothesis with F-statistics (Wonnacott & Wonnacott, 1972). Multi-way ANOVA exists to detect interactions of higher-orders combinations, not just between pairs of features. The Additive Groves (AG) (Sorokina et al., 2008) method is notable because it detects interactions using high-performance models while accounting for the non-additive definition of feature interaction. This essentially allows the detected feature interaction to be unrestricted to a functional form, e.g. multi- plication. AG tests each of these interactions by comparing two regression trees, one that fits all interactions, and the other that has the interaction of interest forcibly removed. A significant problem with individual testing methods is that they require an exponentially growing number of tests as the desired interaction order increases. Not only is this approach intractable, but also has a high chance of generating false positives or a high false discovery rate (Benjamini & Hochberg, 1995) that arises from multiple testing. Neural Network Interpretation A prominent approach to detect feature interactions is based on mixed partial derivatives (Friedman & Popescu, 2008), which also uses individual testing but in a more pronounced way. Namely, the mixed derivative definition of feature interactions is the following: a function f(·) exhibits statistical interactionI among all featuresx i indexed byi 1 ,i 2 ,...,i |I| ∈I if E x ∂ |I| f(x) ∂x i 1 ∂x i 2 ...∂x i |I| 2 > 0. (2.1) 18 This definition has been popular to use with logit (Ai & Norton, 2003), non- linear (Karaca-Mandic et al., 2012), and traditional feedforward neural network models (Gevrey et al., 2006). The advantage of these mixed derivative approaches is that they are exact at interaction detection up to the performance of the model. In particular, neural networks excel at interaction detection because they are high performance models as universal function approximators. However, (2.1) computes an expectation that can be computationally expensive by requiring many individ- ual tests for just one tested interaction. A theoretical problem with (2.1) is that it does not apply to neural networks with piecewise linear activation functions. For example, the most common neural network with ReLU activations (Glorot et al., 2011) always returns a zero mixed partial derivative despite learning interactions. Lasso Selection An alternative approach to interaction detection is based on lasso selection, which is fast and may not require individual testing. One can construct an additive model with many different interaction terms and let lasso shrink the coefficients of unimportant terms to zero (Tibshirani, 1996; Bien et al., 2013; Min et al., 2014; Purushotham et al., 2014). While lasso methods are fast, they require specifying all interaction terms of interest. For pairwise interaction detection, this requiresO(p 2 ) terms (wherep is the number of features), andO(2 p ) terms for higher-order interaction detection. Even still, the form of interactions that lasso-based methods capture is limited by which are pre-specified. 19 Table 2.1 Timeline of feature interaction research Lawes & Gilbert - factorial design in agricultural research at the Rothamsted Experimental Sta- tion 1843• Fisher - two-way Analysis of Variance (ANOVA) 1925• 1949• Tukey - Tukey’s test of additivity 1955• Wilk - generalized random block design Belson - Automatic Interaction Detection by subdividing data 1959• Nelder - Principle of Marginality 1977• 1980• Kass - Chi-square Automatic Interaction Detection by combining features in decision trees via chi-square tests 1991• Aiken & West - book on interpreting interaction effects Hamada & Wu - heredity condi- tions 1992• Ai & Norton -interactionsinlogit and probit models 2003• 2006• Gevry et al. - interactions in sigmoid neural networks Friedman & Popescu - RuleFit to detect interactions by mixing lin- ear regression and trees 2008• Sorokina et al. - Additive Groves to detect non-additive interactions Bien et al. - Hierarchical Lasso 2013• Hao & Zhang -interactionscreen- ing in high dimensional data 2014• 20 2.2.2 Feature Interaction Interpretation Various methods exist on how to interpret a given feature interaction. A com- mon approach is in a multiple regression setting, where a linear regression model is used to capture multiplicative interactions. For two features, let such a model be defined as f(x) =w 1 x +w 2 z +w 3 xz +b, (2.2) wherex,z are the features,{w i } are their coefficients, andb is a bias term. Jaccard et al. (2003) suggest two ways to interpret the xz interaction. One is by directly examining the w 3 coefficient as a slope varying with x when z increases by one unit. Another way is by rearranging (2.2) as f(x) = (w 1 +w 3 z)x + (w 2 z +b), (2.3) and interpreting (w 1 +w 3 z) as a slope for fixed values of z. Aiken et al. (1991) recommend using this alternative slope to plot interaction effects for representative values of z. The differences in these slopes is also a measure of the significance of the interaction. A more recent work developed an extension to generalized additive models to account for feature interactions. Lou et al. (2013) introduced Generalized Additive Models with Pairwise Interactions (GA 2 M) to add models of pairwise interactions to the GAM in the form of g(E[y]) = P f i (x i ) + P f ij (x i ,x j ). Lou et al. (2013) and Caruana et al. (2015) interpreted the interaction effects by plotting each f ij fully as a heatmap. 21 2.2.3 Feature Interaction Attribution With the onset of deep learning and feature attribution research, several meth- ods were developed to compute attribution scores for feature interactions. These attribution scores apply to individual data instances and estimate the impact of interactions on a prediction. Existing interaction attribution methods are Shapley Taylor Interaction Index (STI) (Dhamdhere et al., 2019) which uses random fea- tureorderingstoidentifycontextsforadiscretemixedpartialderivative,Integrated Hessians (IH) (Janizek et al., 2020) which extends Integrated Gradients (IG) (Sun- dararajan et al., 2017) with path integration for hessian computations, and Model- Agnostic Hierarchical Explanations (MAHE) (Tsang et al., 2018c) which trains surrogate explainer models for interaction detection and attribution. Like IG, STI and IH are axiomatic and enforce attributions to sum up to the prediction, but the attributions are not necessarily interpretable. MAHE’s attributions are unidentifiable by training additive attribution models on overlapping feature sets. 2.3 Applications We overview several applications relevant to feature interactions for inter- pretable machine learning. Since deep learning has become mainstream, there have been efforts on interpreting or leveraging feature interactions that are cap- tured by deep neural networks. Here, we discuss research related to three types of models: text analyzers, image classifiers, and recommender systems. 2.3.1 Text Analyzers A prominent interest of the text analysis community is explaining word inter- actions in applications like sentiment analysis. The explanations can indicate how 22 words modify each others’ sentiment when considered together rather than sepa- rately. On this topic, Murdoch et al. (2018) proposed Contextual Decomposition (CD) to extract word interactions from Long Short-Term Memory (LSTM) net- works (Hochreiter & Schmidhuber, 1997) in the form of word-phrase attributions decomposed throughout the network. Jin et al. (2019) advanced CD with a method called Sampling Contextual Decomposition (SCD), which applies word sampling to the neural activation decomposition of CD. In addition, Jin et al. (2019) pro- posed Sampling Occlusion (SOC), which samples words around a target phrase in occlusion-based attribution scoring. SCD and SOC have been applied to both LSTM and state-of-the-art BERT (Devlin et al., 2019) models. While CD, SCD, and SOC compute attribution scores, they do not detect feature interactions nor apply to general sets of words. Shapley Taylor Interaction Index (Dhamdhere et al., 2019) and Integrated Hes- sians (Janizek et al., 2020) were also applied to text analysis, but their attributions lack interpretability as mentioned earlier. 2.3.2 Image Classifiers Very limited works have studied how to interpret feature interactions in image classifiers. In Singh et al. (2019), the Contextual Decomposition (CD) method was expanded to image classification, but CD still does not detect interactions nor are itsattributionsaxiomaticinthesenseofSundararajanetal.(2017). Somemethods attempt to interpret feature groups in image classifiers, such as Anchors (Ribeiro et al., 2018), and Context-Aware methods (Singla et al., 2019); however, these methods face the same drawbacks as CD. 23 2.3.3 Recommender Systems The related works for recommender systems follow a different focus. These works leverage representations of interactions to maximize prediction performance rather than explain the interactions. Cheng et al. (2016), Guo et al. (2017), Wang et al. (2017), and Lian et al. (2018) directly incorporate multiplicative cross terms in neural network architectures, and Song et al. (2018) use attention as an inter- action module, all of which are intended to improve the neural network’s function approximation. The attention module in Song et al. (2018) was motivated in part for interpretability, but it remains to been seen how interpretable attention modules are. Luo et al. (2019) followed up by detecting interactions in data then explicitly encoding them via feature crossing. This general line of work found that predictive performance can improve with dedicated interaction modeling. Despite the significant efforts on leveraging feature interaction for prediction performance, to the best of our knowledge no works have explained the feature interactions learned by recommender systems. 24 Chapter 3 Preliminaries We first introduce preliminaries that serve as a basis for this thesis. 3.1 Notations Vectors are represented by boldface lowercase letters, such asx,w; matrices are represented by boldface capital letters, such asW. Thei-th entry of a vectorw is denoted by w i , and element (i,j) of a matrix W is denoted by W i,j . The i-th row and j-th column of W are denoted by W i,: and W :,j , respectively. For a vector w∈R n , let diag(w) be a diagonal matrix of size n×n, where{diag(w)} i,i =w i . For a matrixW, let|W| be its elementwise absolute value, i.e.|W| i,j = W i,j , and for a setS, let|S| be its cardinality. Let [p] denote the set of integers from 1 to p. I⊆ [p] is a subset of all input features where|I|≥ 2 is a interaction, and|I|≥ 3 is a higher-order interaction. For a vector x∈R p , let x I ∈R p be defined element-wise as (x I ) i = x i , if i∈I 0 otherwise. 25 3.2 Statistical Feature Interaction A statistical interaction describes a situation in which the joint influence of multiple variables on an output variable is not additive (Dodge, 2006; Friedman & Popescu, 2008; Sorokina et al., 2008). Let x i ,i∈ [p] be the features and y be the response variable, a statistical interactionI⊆ [p] exists if and only ifE y|x , which is a function of x = x 1 ,x 2 ,...,x p , contains a non-additive interaction between variables x I : Definition1 (Non-additiveInteraction). Consider a functionf(·) with input vari- ables x i ,i∈ [p], and an interactionI⊆ [p]. ThenI is a non-additive interaction of function f(·) if and only if there does not exist a set of functions f i (·),∀i∈I where f i (·) is not a function of x i , such that f (x) = X i∈I f i x [p]\{i} . Forexample, amultiplicationbetweentwofeatures,x 1 andx 2 , isafeatureinter- action because it cannot be represented as an addition of univariate functions, i.e. x 1 x 2 6=f 1 (x 2 )+f 2 (x 1 ). As another example, the functionx 1 x 2 +sin (x 2 +x 3 +x 4 ) contains both a pairwise interaction{1, 2} and a higher-order 3-way interaction {2, 3, 4}. Note that from the definition of statistical interaction, a d-way inter- action can only exist if all its corresponding (d− 1)-interactions exist (Sorokina et al., 2008). For example, the interaction{1, 2, 3} can only exist if interactions {1, 2},{1, 3}, and{2, 3} also exist. 26 3.3 Feedforward Neural Network Consider a feedforward neural network 1 with L hidden layers and the param- eters: L + 1 weight matrices W (`) ∈ R p ` ×p `−1 and L + 1 bias vectors b (`) ∈ R p ` , ` = 1, 2,...,L + 1. Let p ` be the number of hidden units in the `-th layer. The input features are the 0-th layer where p 0 = p is the number of input features, and the output is the (L + 1)-th layer with p L+1 = 1. Let the activation function (non-linearity) be ϕ (·). In all experiments, ϕ is the ReLU activation function, which is defined as ϕ(z) = max(z, 0), which is an elementwise operation on some z. The feedforward neural network can now be defined, with inputx∈R p , output y and hidden units h (`) , as: h (0) =x h (`) =ϕ W (`) h (`−1) +b (`) , ∀` = 1, 2,...,L y =w (L+1) > h (L) +b (L+1) 1 For feedforward networks, we primarily focus on the multilayer perceptron architecture with ReLU activation functions. Some of the results can be generalized to a broader class of feedfor- ward neural networks. 27 Chapter 4 Detecting Feature Interactions learned by Feedforward Neural Networks It is unknown how to interpret feature interactions from neural network weights and from piece-wise linear neural networks. Because neural networks are universal functionapproximators(Horniketal.,1989), interpretingtheirlearnedinteractions wouldofferaccurateinsightsintonotonlymodelbehaviorbutalsoreal-worldtrain- ing data. We propose an accurate and efficient method, called Neural Interaction Detection (NID), which detects non-additive and arbitrary-order statistical interac- tions captured by a feedforward neural network via its learned weights. Depending on the desired interactions, our method can achieve significantly better or similar interaction detection performance compared to the state-of-the-art without search- ing an exponential solution space of possible interactions. We obtain this accuracy and efficiency by observing that interactions between input features are created by the non-additive effect of nonlinear activation functions, and that interacting paths 28 ! 1 # ! 2 ! 3 Figure 4.1: An illustration of an interaction within a multilayer perceptron with fully- connected layers, where the box contains later layers in the network. The first hidden unit takes inputs from x 1 and x 3 with large weights and creates an interaction between them. The strength of the interaction is determined by both incoming weights and the outgoing paths between a hidden unit and the final output, y. are encoded in weight matrices (Figure 4.1). We demonstrate the performance of our method and the importance of discovered interactions via experimental results on both synthetic datasets and real-world application datasets. We also discuss the physical meaning of our detected feature interactions. 4.1 Methodology We begin by discussing how multilayer perceptrons (MLPs) capture feature interactions in §4.1.1, followed by our interaction detection algorithm in §4.1.2. 4.1.1 Feature Interactions at Hidden Units The two core components of understanding feature interactions in MLPs are identifying feature interactions and individual hidden units and measuring the influence of each of those hidden units. 29 Interactions at Individual Hidden Units In MLPs with nonlinear activations, interacting features must follow strongly weighted connections to a common hidden unit before the final output. Consider an MLP as a directed graph where hidden units are vertices and nonzero weights are directed edges pointing to next layers, then Proposition 2 is given by the following: Proposition 2 (Interactions at Common Hidden Units). Let f (·) be a multilayer perceptron. If f (·) contains an interactionI, then there exists a hidden unit h, such that∀i∈I, there is a path from i through h to the output in the associated graph. The reverse of this statement is true with probability 1 when the non-zero weights are generated i.i.d. from any continuous distribution. Proof. The proof of the first statement is as follows. Without loss of generality, only vertices connected to the final output are considered. Suppose for the purpose of contradiction that h, as defined in Proposition 2, does not exist. Then for any hidden unith 0 in the last hidden layer before the output,I(S h 0 whereS h 0 is the set of all ancestors of h 0 on the graph. Note that, for any hidden unit h 0 , it can be viewed as a function f h 0(S h 0) of all its parents. Then the output f (·) can be rewritten as f = X h 0 in the last hidden layer w (L+1) h 0 ·f h 0(S h 0) +b (L+1) , where w (L+1) h 0 is the weight for h 0 , and b (L+1) is the bias term for the final output. This is a function without the interactionI, which is a contradiction. The reverse of this statement holds true in most cases. The existence of coun- terexamples is manifested when early hidden layers capture an interaction that is negated in later layers. For example, the effects of two interactions may be 30 directly removed in the next layer, as in the case of the following expression: max{w 1 x 1 +w 2 x 2 , 0}− max{−w 1 x 1 −w 2 x 2 , 0} =w 1 x 1 +w 2 x 2 . Such an counterex- ample is legitimate; however, due to random fluctuations, it is highly unlikely in practice that the w 1 s and the w 2 s from the left hand side are exactly equal. In general, the weights in a neural network are nonzero, in which case Proposi- tion 2 blindly infers that all features are interacting. For example, in a neural net- work with just a single hidden layer, any hidden unit in the network can imply up to 2 kW j,:k 0 potential interactions, where W j,: 0 is the number of nonzero values in the weight vector W j,: for the j-th hidden unit. Narrowing this large interaction space based on nonzero weights requires characterizing the relative importance of interactions, so the concept of interaction strength must be mathematically defined. In this chapter, the search complexity of interaction detection is signifi- cantly reduced by only quantifying interactions created at the first hidden layer, which is important for efficiency and sufficient for high detection accuracy based on empirical evaluation (§4.2.2 and Table 4.2). Considerahiddenunitinthefirstlayer: φ w > x +b ,wherewistheassociated weight vector and x is the input vector. For an interactionI⊆ [p], we propose to useanaverageoftherelevantfeatureweightsw I asthesurrogatefortheinteraction strength: μ |w I | , where μ (·) is the averaging function for an interaction that represents the interaction strength due to feature weights. Weprovideguidanceonhowμshouldbedefinedbyfirstconsideringrepresenta- tive averaging functions from the generalized mean family: maximum value, root mean square, arithmetic mean, geometric mean, harmonic mean, and minimum value (Bullen et al., 1988). These options can be narrowed down by accounting for intuitive properties of interaction strength: 31 1. interaction strength is evaluated as zero whenever an interaction does not exist (one of the features has zero weight); 2. interaction strength does not decrease with any increase in magnitude of feature weights; 3. interaction strength is less sensitive to changes in large feature weights. While the first two properties place natural constraints on interaction strength behavior, the third property is subtle in its intuition. Consider the scaling between themagnitudesofmultiplefeatureweights, whereoneweighthasmuchhighermag- nitude than the others. In the worst case, there is one large weight in magnitude while the rest are near zero. If the large weight grows in magnitude, then inter- action strength may not change significantly, but if instead the smaller weights grow at the same rate, then interaction strength should strictly increase. As a result, maximum value, root mean square, and arithmetic mean should be ruled out because they do not satisfy either property 1 or 3. We provide an interaction strength analysis on a bivariate ReLU function: max{α 1 x 1 +α 2 x 2 , 0}, wherex 1 ,x 2 are two variables andα 1 ,α 2 are the weights for thissimplenetwork. Thestrengthoftheinteractionbetweenx 1 andx 2 isquantified via the cross-term coefficient of the best quadratic approximation. That is, β 0 ,...,β 5 = argmin β i ,i=0,...,5 1 x −1 β 0 +β 1 x 1 +β 2 x 2 +β 3 x 2 1 +β 4 x 2 2 +β 5 x 1 x 2 − max{α 1 x 1 +α 2 x 2 , 0} 2 dx 1 dx 2 . Then β 5 , the coefficient of the{x 1 ,x 2 } interaction, is given by |β 5 | = 3 4 1− min{α 2 1 ,α 2 2 } 5 max{α 2 1 ,α 2 2 } ! min{|α 1 |,|α 2 |}. (4.1) 32 Note that the choice of the region (−1, 1)× (−1, 1) is arbitrary: for a larger region (−c,c)× (−c,c) with c > 1,|β 5 | is found to scale with c −1 . Also note that the factor before min{|α 1 |,|α 2 |} in (4.1) is almost a constant with less than 20% fluctuation. This analysis suggests that the interaction strength of a bivariate ReLU function can be well-modeled by the minimum value between|α 1 | and|α 2 |. Measuring the Influence of Hidden Units Our definition of interaction strength at individual hidden units is not complete without considering their outgoing paths, because an outgoing path of zero weight cannot contribute an interaction to the final output. To propose a way of quantify- ing the influence of an outgoing path on the final output, we draw inspiration from Garson’s algorithm (Garson, 1991; Goh, 1995), which instead of computing the influence of a hidden unit, computes the influence of features on the output. This is achieved by cumulative matrix multiplications of the absolute values of weight matrices. In the following, we propose our definition of hidden unit influence, then prove that this definition upper bounds the gradient magnitude of the hidden unit with its activation function. To represent the influence of a hidden unit i at the `-th hidden layer, we define the aggregated weight z (`) i , z (`) = w (L+1) > W (L) ··· W (`+1) . This definition upper bounds the gradient magnitudes of hidden units because it computes Lipschitz constants for the corresponding units. Gradients have been commonly used as variable importance measures in neural networks, especially input gradients which compute directions normal to decision boundaries (Ross et al., 2017; Goodfellow et al., 2015; Simonyan et al., 2013). Thus, an upper 33 bound on the gradient magnitude approximates how important the variable can be. Lemma 3 (Neural Network Lipschitz Estimation). Let the activation function ϕ (·) be a 1-Lipschitz function. Then the output y is z (`) i -Lipschitz with respect to h (`) i . Proof. A non-differentiable ϕ (·) such as the ReLU function can be replaced with a series of differentiable 1-Lipschitz functions that converges to ϕ (·) in the limit. Therefore, without loss of generality, ϕ (·) is assumed to be differentiable with |∂ x ϕ(x)|≤ 1. The partial derivative of the final output is now taken with respect to h (`) i , the i-th unit at the `-th hidden layer: ∂y ∂h (`) i = X j `+1 ,...,j L ∂y ∂h (L) j L ∂h (L) j L ∂h (L−1) j L−1 ··· ∂h (`+1) j `+1 ∂h (`) i =w (L+1) > diag( _ ' (L) )W (L) ··· diag( _ ' (`+1) )W (`+1) , where _ ' (`) ∈R p ` is a vector that ˙ ϕ (`) k =∂ x ϕ W (`) k,: h (`−1) +b (`) k . To conclude the Lemma, the following inequality must be proved: ∂y ∂h (`) i ≤ w (L+1) > W (L) ··· W (`+1) :,i =z (`) i . The left-hand side can be re-written as X j `+1 ,...,j L w (L+1) j L ˙ ϕ (L) j L W (L) j L ,j L−1 ˙ ϕ (L−1) j L−1 ··· ˙ ϕ (`+1) j `+1 W (`+1) j `+1 ,i . 34 The right-hand side can be re-written as X j `+1 ,...,j L w (L+1) j L W (L) j L ,j L−1 ··· W (`+1) j `+1 ,i . Noting that|∂ x ϕ(x)|≤ 1 concludes the proof. Quantifying Interaction Strength We now provide a unified definition of interaction strength ω i (I) for an inter- action candidateI at the i-th unit in the first hidden layer h (1) i : ω i (I) =z (1) i μ W (1) i,I ! . (4.2) Note that ω i (I) is defined on a single hidden unit, and it is agnostic to scaling ambiguity within a ReLU based neural network. In the next section, we discuss our scheme of aggregating strengths across hidden units, so we can compare inter- actions of different orders. 4.1.2 Interaction Detection Algorithm In this section, we propose our feature interaction detection algorithm NID, which can extract interactions of all orders without individually testing each com- bination of features. Our methodology for interaction detection is comprised of three main steps: 1) define the model architecture, 2) interpret network weights to obtain a ranking of interaction candidates, and 3) optionally identify a top-K cutoff on the interaction ranking. Model Architecture 35 ⋯ 1 2 ⋯ ( 1 , 2 , … , ) main effects feature interactions ( 5 ) ( 4 ) ( 3 ) ( 2 ) ( 1 ) Figure 4.2: Neural network architecture for interaction detection, with optional univari- ate networks We study two architectures: MLP and MLP- M. MLP is a standard multilayer perceptron, and MLP-M is an MLP with additional univariate net- works summed at the output (Figure 4.2). The univariatenetworksareintendedtodiscouragethe modeling of main effects away from the standard MLP,whichcancreatespuriousinteractionsusing the main effects. When training the neural net- works, we apply L 1 regularization on the MLP portions of the architectures to suppress unim- portant interacting paths. For MLP-M, the L 1 penalty also has the effect of pushing the modeling of main effects to the univari- ate networks. Ranking Interactions of Arbitrary Order Wedesignagreedyalgorithmtogeneratearankingofinteractioncandidatesby only considering, at each hidden unit, the top-ranked interactions of every order, where 2≤|I|≤ p. The greedy algorithm thereby drastically reduces the search space of potential interactions while still considering all orders. This algorithm is shown in Alg. 1. Once an MLP or MLP-M is trained, the greedy algorithm is used to traverse the input weight matrixW (1) . Specifically, the greedy algorithm selects only top-ranked interaction candidates per hidden unit based on their interaction strengths. By selecting the top-ranked interactions of every order and summing theirrespectivestrengthsacrosshiddenunits, weobtainfinalinteractionstrengths, allowing arbitrary-order interaction candidates to be ranked relative to each other. 36 Algorithm 1 NID Greedy Ranking Algorithm Input: input-to-first hidden layer weights W (1) , aggregated weights z (1) Output: ranked list of interaction candidates{I i } m i=1 1: d← initialize an empty dictionary mapping interaction candidate to interac- tion strength 2: for each row w 0 of W (1) indexed by r do 3: for j = 2 to p do 4: I← sorted indices of top j weights in w 0 5: d[I]←d[I] +z (1) r μ |w 0 I | 6:{I i } m i=1 ← interaction candidates in d sorted by their strengths in descending order For this algorithm, we set the averaging function μ (·) = min (·) based on its performance in experimental evaluation (§4.2.1.1). Since a higher-order interaction only exists if its subset interactions also exist (§3.2), the subset interactions are redundant in the presence of the higher-order interaction. Thus, Algorithm 1 assumes that there are at least as many first-layer hidden units as there are the true number of such higher-order interactions and non-redundant pairwise interactions. 12 In addition to efficiency, a benefit of Algorithm 1’s greedy strategy is that it automaticallyimprovestherankingofahigher-orderinteractionoveritsredundant subsets. This allows the higher-order interaction to have a better chance of ranking above any false positives and being captured in the cutoff stage. We justify this improvement via Theorem 4 under a mild assumption. Theorem 4 (Improving the ranking of higher-order interactions). LetR be the set of interactions proposed by Algorithm 1 with μ (·) = min (·), letI∈R be a d-way interaction where d≥ 3, and letS be the set of subset (d− 1)-way interactions of 1 In practice, true interactions are initially unknown, so arbitrarily large numbers of first layer hidden units are used. 2 Redundant subset interactions can be pruned from the interaction ranking when correspond- ing superset interactions are higher ranked. 37 I where|S| =d. Assume that for any hidden unit j which proposed s∈S∩R,I will also be proposed at the same hidden unit, and ω j (I) > 1 d ω j (s). Then, one of the following must be true: a)∃s∈S∩R ranked lower thanI, i.e. ω(I)>ω(s), or b)∃s∈S where s / ∈R. Proof. Suppose for the purpose of contradiction thatS⊆R and∀s∈S, ω(s)≥ ω(I). Because ω j (I)> 1 d ω j (s), ω(I) = X s∈S∩R X j propose s z j ω j (I)> 1 d X s∈S∩R X j propose s z j ω j (s) = 1 d X s∈S∩R ω(s). Since∀s∈S, ω(s)≥ω(I), 1 d X s∈S∩R ω(s)≥ 1 d X s∈S∩R ω(I) SinceS⊆R,|S∩R| =d. Therefore, 1 d X s∈S∩R ω(I)≥ 1 d ω(I)d≥ω(I), which is a contradiction. Under the noted assumption, the theorem in part a) shows that a d-way inter- action will improve over one its d− 1 subsets in rankings as long as there is no sudden drop from the weight of the (d− 1)-way to the d-way interaction at the same hidden units. The improvement extends to b) as well, whend =|S∩R|> 1. 38 Cutoff on Interaction Ranking In order to predict the true top-K interactions{I i } K i=1 , there must be a cutoff point on the interaction ranking from §4.1.2. We obtain this cutoff by constructing a Generalized Additive Model (GAM) with interactions: ˜ f K (x) = p X i=1 g i (x i ) + K X i=1 g 0 i (x I ), where g i (·) captures the main effects, g 0 i (·) captures the interactions, and both g i and g 0 i are small feedforward networks trained jointly via backpropagation. We refer to this model as MLP-Cutoff. We gradually add top-ranked interactions to the GAM, increasing K, until GAM performance on a validation set plateaus. The exact plateau point can be found by early stopping or other heuristic means, and we report{I i } K i=1 as the identified feature interactions. Pairwise Interaction Detection A variant to our interaction ranking algorithm tests for all pairwise interac- tions. Pairwise interaction detection has been a standard problem in the interac- tion detection literature (Lou et al., 2013; Fan et al., 2016) due to its simplicity. Modeling pairwise interactions is also the de facto objective of many successful machine learning algorithms such as factorization machines (Rendle, 2010) and hierarchical lasso (Bien et al., 2013). We rank all pairs of features{i,j} according to their interaction strengths ω({i,j}) calculated on the first hidden layer, where again the averaging function is min (·), and ω({i,j}) = P p 1 s=1 ω s ({i,j}). The higher the rank, the more likely the interaction exists. 39 4.2 Experiments In this section, we discuss our experiments on both simulated and real-world datasets to study the performance of our approach on interaction detection. 4.2.1 Experiment Setup 4.2.1.1 Averaging Function max. r.m.s. arith. geom. harm. min. 0 100 200 300 400 500 correct top interactions Figure 4.3: A comparison of averaging functions by the total number of correct inter- actions ranked before any false positives, evaluated on the test suite (Table 4.1). x-axis labels are maximum, root mean square, arithmetic mean, geo- metric mean, harmonic mean, and minimum. Our proposed NID framework relies on the selection of an averaging function (§4.1.1, 4.1.2, and 4.1.2). We experimentally determined the aver- aging function by comparing representative func- tions from the generalized mean family (Bullen et al., 1988): maximum, root mean square, arith- metic mean, geometric mean, harmonic mean, and minimum, intuitions behind which were discussed in §4.1.1. To make the comparison, we used a test suite of 10 synthetic functions, which consist of a variety of interactions of varying order and over- lap, as shown in Table 4.1. We trained 10 trials of MLP and MLP-M on each of the synthetic func- tions, obtained interaction rankings with our proposed greedy ranking algorithm (Algorithm 1), and counted the total number of correct interactions ranked before any false positive. In this evaluation, we ignore predicted interactions that are subsets of true higher-order interactions because the subset interactions are redun- dant (§2). As seen in Figure 4.3, the number of true top interactions we recover is highest with the averaging function, minimum, which we will use in all of our 40 experiments. This function is consistent with the analytical study on the bivariate hidden unit in §4.1.1. 4.2.1.2 Neural Network Configuration We trained feedforward networks of MLP and MLP-M architectures to obtain interaction rankings, and we trained MLP-Cutoff to find cutoffs on the rankings. In our experiments, all networks that model feature interactions consisted of four hidden layers with first-to-last layer sizes of: 140, 100, 60, and 20 units. In con- trast, all individual univariate networks had three hidden layers with sizes of: 10, 10, and 10 units. All networks used ReLU activation and were trained using back- propagation. In the cases of MLP-M and MLP-Cutoff, summed networks were trained jointly. The objective functions were mean-squared error for regression and cross-entropy for classification tasks. On the synthetic test suite, MLP and MLP-M were trained with L 1 constants in the range of 5e-6 to 5e-4, based on parameter tuning on a validation set. On real-world datasets,L 1 was fixed at 5e-5. MLP-Cutoff used a fixed L 2 constant of 1e-4 in all experiments involving cutoff. Early stopping was used to prevent overfitting. 4.2.1.3 Datasets We study our interaction detection framework on both simulated and real- world experiments. For simulated experiments, we used a test suite of synthetic functions, as shown in Table 4.1. The test functions were designed to have a mixture of pairwise and higher-order interactions, with varying order, strength, nonlinearity, and overlap. F 1 is a commonly used function in interaction detection literature (Hooker, 2004; Sorokina et al., 2008; Lou et al., 2013). All features were uniformly distributed between−1 and 1 except in F 1 , where we used the 41 Table 4.1: Test suite of data-generating functions F 1 (x) π x1x2 √ 2x 3 − sin −1 (x 4 ) + log(x 3 +x 5 )− x 9 x 10 s x 7 x 8 −x 2 x 7 F 2 (x) π x1x2 q 2|x 3 |− sin −1 (0.5x 4 ) + log(|x 3 +x 5 | + 1) + x 9 1 +|x 10 | s x 7 1 +|x 8 | −x 2 x 7 F 3 (x) exp|x 1 −x 2 | +|x 2 x 3 |−x 2|x4| 3 + log(x 2 4 +x 2 5 +x 2 7 +x 2 8 ) +x 9 + 1 1 +x 2 10 F 4 (x) exp|x 1 −x 2 | +|x 2 x 3 |−x 2|x4| 3 + (x 1 x 4 ) 2 + log(x 2 4 +x 2 5 +x 2 7 +x 2 8 ) +x 9 + 1 1 +x 2 10 F 5 (x) 1 1 +x 2 1 +x 2 2 +x 2 3 + q exp(x 4 +x 5 ) +|x 6 +x 7 | +x 8 x 9 x 10 F 6 (x) exp (|x 1 x 2 | + 1)− exp(|x 3 +x 4 | + 1) + cos(x 5 +x 6 −x 8 ) + q x 2 8 +x 2 9 +x 2 10 F 7 (x) (arctan(x 1 ) + arctan(x 2 )) 2 + max(x 3 x 4 +x 6 , 0)− 1 1 + (x 4 x 5 x 6 x 7 x 8 ) 2 + |x 7 | 1 +|x 9 | ! 5 + 10 X i=1 x i F 8 (x) x 1 x 2 + 2 x3+x5+x6 + 2 x3+x4+x5+x7 + sin(x 7 sin(x 8 +x 9 )) + arccos(0.9x 10 ) F 9 (x) tanh(x 1 x 2 +x 3 x 4 ) q |x 5 | + exp(x 5 +x 6 ) + log (x 6 x 7 x 8 ) 2 + 1 +x 9 x 10 + 1 1 +|x 10 | F 10 (x) sinh (x 1 +x 2 ) + arccos tanh(x 3 +x 5 +x 7 ) + cos(x 4 +x 5 ) + sec(x 7 x 9 ) same variable ranges as reported in literature (Hooker, 2004). In all synthetic experiments, weusedrandomtrain/valid/testsplitsof 1/3eachon30kdatapoints. We use four real-world datasets, of which two are regression datasets, and the other two are binary classification datasets. The datasets are a mixture of common prediction tasks in the cal housing and bike sharing datasets, a scientific discovery task in the higgs boson dataset, and an example of very-high order interaction detection in the letter dataset. Specifically, the cal housing dataset is a regres- sion dataset with 21k data points for predicting California housing prices (Pace & Barry, 1997). The bike sharing dataset contains 17k data points of weather and seasonal information to predict the hourly count of rental bikes in a bikeshare system (Fanaee-T & Gama, 2014). The higgs boson dataset has 800k data points for classifying whether a particle environment originates from the decay of a Higgs 42 Boson (Adam-Bourdarios et al., 2014). Lastly, the letter recognition dataset con- tains 20k data points of transformed features for binary classification of letters on a pixel display (Frey & Slate, 1991). For all real-world data, we used random train/valid/test splits of 80/10/10. 4.2.1.4 Baselines We compare the performance of NID to that of three baseline interaction detec- tion methods. Two-Way Analysis of Variance (ANOVA) (Wonnacott & Wonna- cott, 1972) utilizes linear models to conduct significance tests on the existence of interaction terms. Hierarchical lasso (HierLasso) (Bien et al., 2013) applies lasso feature selection to extract pairwise interactions. RuleFit (Friedman & Popescu, 2008) contains a statistic to measure pairwise interaction strength using partial dependence functions. Additive Groves (AG) (Sorokina et al., 2008) is a nonpa- rameteric means of testing for interactions by placing structural constraints on an additive model of regression trees. AG is a reference method for interaction detec- tion because it directly detects interactions based on their non-additive definition. 4.2.2 Pairwise Interaction Detection As discussed in §4.1.2, our framework NID can be used for pairwise interaction detection. To evaluate this approach, we used datasets generated by synthetic functions F 1 -F 10 (Table 4.1) that contain a mixture of pairwise and higher-order interactions, where in the case of higher-order interactions we tested for their pairwise subsets as in Sorokina et al. (2008); Lou et al. (2013). AUC scores of interaction strength proposed by baseline methods and NID for both MLP and MLP-M are shown in Table 4.2. We ran ten trials of AG and NID on each dataset and removed two trials with highest and lowest AUC scores. 43 Table 4.2: AUC of pairwise interaction strengths proposed by NID and baselines on a test suite of synthetic functions (Table 4.1). ANOVA, HierLasso, and RuleFit are deterministic. ANOVA HierLasso RuleFit AG NID MLP MLP-M F 1 (x) 0.992 1.00 0.754 1± 0.0 0.970± 9.2e−3 0.995± 4.4e−3 F 2 (x) 0.468 0.636 0.698 0.88± 1.4e−2 0.79± 3.1e−2 0.85± 3.9e−2 F 3 (x) 0.657 0.556 0.815 1± 0.0 0.999± 2.0e−3 1± 0.0 F 4 (x) 0.563 0.634 0.689 0.999± 1.4e−3 0.85± 6.7e−2 0.996± 4.7e−3 F 5 (x) 0.544 0.625 0.797 0.67± 5.7e−2 1± 0.0 1± 0.0 F 6 (x) 0.780 0.730 0.811 0.64± 1.4e−2 0.98± 6.7e−2 0.70± 4.8e−2 F 7 (x) 0.726 0.571 0.666 0.81± 4.9e−2 0.84± 1.7e−2 0.82± 2.2e−2 F 8 (x) 0.929 0.958 0.946 0.937± 1.4e−3 0.989± 4.4e−3 0.989± 4.5e−3 F 9 (x) 0.783 0.681 0.584 0.808± 5.7e−3 0.83± 5.3e−2 0.83± 3.7e−2 F 10 (x) 0.765 0.583 0.876 1± 0.0 0.995± 9.5e−3 0.99± 2.1e−2 average 0.721 0.698 0.764 0.87± 1.4e−2 0.92*± 2.3e−2 0.92± 1.8e−2 *Note: Thehigh average AUC ofNID,MLPisheavily influenced by F 6 . When comparing the AUCs of NID applied to MLP and MLP-M, we observe that the scores of MLP-M tend to be comparable or better, except the AUC for F 6 . On one hand, MLP-M performed better onF 2 andF 4 because these functions contain main effects that MLP would model as spurious interactions with other variables. On the other hand, MLP-M performed worse on F 6 because it modeled spurious main effects in the{8, 9, 10} interaction. Specifically,{8, 9, 10} can be approximatedasindependentparabolasforeachvariable(showninAppendixA.1). In our analyses of NID, we mostly focus on MLP-M because handling main effects is widely considered an important problem in interaction detection (Bien et al., 2013; Lim & Hastie, 2015; Kong et al., 2017). Comparing the AUCs of AG and NID for MLP-M, the scores tend to close, except forF 5 ,F 6 , andF 8 , where NID performs significantlybetterthanAG.Thisperformancedifferencemaybeduetolimitations on the model capacity of AG, which is tree-based. In comparison to ANOVA, HierLasso and RuleFit, NID-MLP-M generally performs on par or better. This is expected for ANOVA and HierLasso because they are based on quadratic models, 44 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 F 1 F 2 F 3 F 4 F 5 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 F 6 F 7 F 8 F 9 F 10 Figure 4.4: Heat maps of pairwise interaction strengths proposed by the NID framework on MLP-M for datasets generated by functions F 1 -F 10 (Table 4.1). Red cross-marks indicate ground truth interactions. x1 x2 x3 x4 x5 x6 x7 x8 x1 x2 x3 x4 x5 x6 x7 x8 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 cal housing bike sharing higgs boson letter Figure 4.5: Heat maps of pairwise interaction strengths proposed by the NID framework on MLP-M for real-world datasets. which can have difficulty approximating the interaction nonlinearities present in the test suite. In Figure 4.4, heat maps of synthetic functions show the relative strengths of all possible pairwise interactions as interpreted from MLP-M, and ground truth is indicated by red cross-marks. The interaction strengths shown are normally high at the cross-marks. An exception is F 6 , where NID proposes weak or negligible 45 interaction strengths at the cross-marks corresponding to the{8, 9, 10} interac- tion, which is consistent with previous remarks about this interaction. Besides F 6 ,F 7 also shows erroneous interaction strengths; however, comparative detection performance by the baselines is similarly poor. Interaction strengths are also visu- alized on real-world datasets via heat maps (Figure 4.5). For example, in the cal housing dataset, there is a high-strength interaction between x 1 and x 2 . These variables mean longitude and latitude respectively, and it is clear to see that the outcome variable, California housing price, should indeed strongly depend on geo- graphical location. We further observe high-strength interactions appearing in the heat maps of the bike sharing, higgs boson dataset, and letter datasets. For exam- ple, all feature pairs appear to be interacting in the letter dataset. The binary classification task from the letter dataset is to distinguish letters A-M from N-Z using 16 pixel display features. Since the decision boundary between A-M and N-Z is not obvious, it would make sense that a neural network learns a highly interacting function to make the distinction. 4.2.3 Higher-Order Interaction Detection We use our greedy interaction ranking algorithm (Algorithm 1) to perform higher-order interaction detection without an exponential search of interaction candidates. We first visualize our higher-order interaction detection algorithm on synthetic and real-world datasets, then we show how the predictive capability of detected interactions closes the performance gap between MLP-Cutoff and MLP- M. Next, we discuss our experiments comparing NID and AG with added noise, and lastly we verify that our algorithm obtains significant improvements in runtime. We visualize higher-order interaction detection on synthetic and real-world datasets in Figures 4.6 and 4.7 respectively. The plots correspond to higher-order 46 Ø x1 x2 x1 x2 x3 x9 x10 x7 x9 x7 | x10 x2 x7 x4 x5 x1 x2 x3 x5 0.05 0.10 0.15 0.20 0.25 Standardized RMSE cutoff true superset interaction true subset interaction Ø 1 2 2 7 3 5 1 2 7 7 9 9 10 8 9 2 7 9 1 2 7 8 1 3 1 2 4 7 3 6 3 6 9 1 | 10 0.1 0.2 0.3 0.4 0.5 0.6 0.7 cutoff true superset interaction true subset interaction Ø x1 x2 x5 x7 x4 x8 x7 x8 x4 x5 x5 x8 x2 x3 x4 x7 x3 x4 x5 x7 x8 x4 x7 x8 x4 x5 x7 x8 x1 x2 x9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 cutoff true superset interaction true subset interaction Ø x1 x2 x5 x7 x7 x8 x2 x3 x5 x8 x3 x4 x4 x8 x5 x7 x8 x4 x5 x8 x4 x5 x7 x8 x1 x2 x3 x1 x4 x4 x5 x7 x8 x9 0.1 0.2 0.3 0.4 0.5 0.6 cutoff true superset interaction true subset interaction Ø x8 x9 x10 x4 x5 x6 x7 x2 x3 x1 x3 x1 x2 x4 x8 x9 x10 x4 x5 x6 0.2 0.4 0.6 0.8 cutoff true superset interaction true subset interaction F 1 F 2 F 3 F 4 F 5 Ø 3 4 5 6 8 2 6 3 4 5 1 3 2 6 10 1 3 10 2 6 9 10 1 3 5 7 10 3 4 7 2 4 6 9 10 3 4 5 7 3 4 8 3 4 5 10 2 5 6 8 1 | 10 0.2 0.4 0.6 0.8 Standardized RMSE cutoff true interaction Ø 1 2 3 10 3 4 3 8 10 3 5 8 10 3 5 7 8 10 3 5 | 8 10 3 | 8 10 7 9 1 3 | 8 10 1 | 10 0.05 0.10 0.15 0.20 0.25 cutoff true superset interaction true subset interaction Ø x3 x5 x3 x4 x5 x7 x1 x2 x3 | x7 x7 x8 x9 x2 | x7 x1 | x10 0.1 0.2 0.3 0.4 cutoff true superset interaction true subset interaction Ø x5 x6 x9 x10 x3 x4 x1 x2 x1 | x4 x7 x8 x6 x7 x8 x4 x5 x6 x5 | x8 0.1 0.2 0.3 0.4 0.5 cutoff true superset interaction true subset interaction Ø x1 x2 x7 x9 x4 x5 x3 x5 x7 x5 x7 x9 x3 x5 x7 x9 0.05 0.10 0.15 0.20 0.25 cutoff true inter- action F 6 F 7 F 8 F 9 F 10 Figure 4.6: MLP-Cutoff error with added top-ranked interactions (along x-axis) of F 1 - F 10 (Table 4.1), where the interaction rankings were generated by the NID framework applied to MLP-M. Red cross-marks indicate ground truth interactions, and Ø denotes MLP-Cutoff without any interactions. Subset interactions become redundant when their true superset interactions are found. Ø x1 x2 x4 x6 x4 x7 x5 x6 x2 x7 0.44 0.45 0.46 0.47 0.48 0.49 0.50 0.51 0.52 Standardized RMSE cutoff Ø 4 8 12 16 20 24 28 32 36 40 44 48 Rank Order 0.25 0.30 0.35 0.40 0.45 0.50 0.55 cutoff Ø 3 6 9 12 15 18 21 24 27 30 33 36 Rank Order 0.090 0.095 0.100 0.105 0.110 1 - AUC cutoff Ø 13 15 8 15 8 13 15 9 15 8 9 10 11 12 8 12 9 12 14 15 4 10 9 13 15 10 15 9 | 12 15 16 8 16 1 | 16 0.00 0.02 0.04 0.06 0.08 0.10 cutoff cal housing bike sharing higgs boson letter Figure 4.7: MLP-Cutoff error with added top-ranked interactions (along x-axis) of real- world datasets (Table 4.1), where the interaction rankings were generated by the NID framework on MLP-M. Ø denotes MLP-Cutoff without any interactions. interaction detection as the ranking cutoff is applied (§4.1.2). The interaction rankings generated by NID for MLP-M are shown on thex-axes, and the blue bars correspond to the validation performance of MLP-Cutoff as interactions are added. For example, the plot for cal housing shows that adding the first interaction signif- icantly reduces RMSE. We keep adding interactions into the model until reaching a cutoff point. In our experiments, we use a cutoff heuristic where interactions are 47 no longer added after MLP-Cutoff’s validation performance reaches or surpasses MLP-M’s validation performance (represented by horizontal dashed lines). As seen with the red cross-marks, our method finds true interactions in the synthetic data ofF 1 -F 10 before the cutoff point. Challenges with detecting interac- tions are again mainly associated withF 6 andF 7 , which have also been difficult for baselines in the pairwise detection setting (Table 4.2). For the cal housing dataset, we obtain the top interaction{1, 2} just like in our pairwise test (Figure 4.5, cal housing), where now the{1, 2} interaction contributes a significant improvement in MLP-Cutoff performance. Similarly, from the letter dataset we obtain a 16- way interaction, which is consistent with its highly interacting pairwise heat map (Figure 4.5, letter). For the bike sharing and higgs boson datasets, we note that even when considering many interactions, MLP-Cutoff eventually reaches the cut- off point with a relatively small number of superset interactions. This is because many subset interactions become redundant when their corresponding supersets are found. In our evaluation of interaction detection on real-world data, we study detected interactions via their predictive performance. By comparing the test performance of MLP-Cutoff and MLP-M with respect to MLP-Cutoff without any interac- tions (MLP-Cutoff Ø ), we can compute the relative test performance improvement obtained by including detected interactions. These relative performance improve- ments are shown in Table 4.3 for the real-world datasets as well as four selected synthetic datasets, where performance is averaged over ten trials per dataset. The results of this study show that a relatively small number of interactions of variable order are highly predictive of their corresponding datasets, as true interactions should. 48 Table 4.3: Test performance improvement when adding top-K interactions from MLP-M to MLP-Cutoff for real-world datasets and select synthetic datasets. Here, the median ¯ K excludes subset interactions, and ¯ |I| denotes average interaction cardinality. RMSE values are standard scaled. Dataset p Performance Improvement ¯ K |I| Relative Absolute cal housing 8 99%± 4.0% 0.09± 1.3e−2 RMSE 2 2.0 bike sharing 12 98.8%± 0.89% 0.331± 4.6e−3 RMSE 12 4.7 higgs boson 30 98%± 1.4% 0.0188± 5.9e−4 AUC 11 4.0 letter 16 101.1%± 0.58% 0.103± 5.8e−3 AUC 1 16 F 3 (x) 10 104.1%± 0.21% 0.672± 2.2e−3 RMSE 4 2.5 F 5 (x) 10 102.0%± 0.30% 0.875± 2.2e−3 RMSE 6 2.2 F 7 (x) 10 105.2%± 0.30% 0.2491± 6.4e−4 RMSE 3 3.7 F 10 (x) 10 105.5%± 0.50% 0.234± 1.5e−3 RMSE 4 2.3 0.0 0.2 0.4 0.6 0.8 1.0 noise 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 top-rank recall NID, MLP-M NID, MLP AG (a) higgs boson letter bike sharing cal housing F 3 F 5 F 7 F 10 10 2 10 3 10 4 10 5 10 6 runtime (seconds) AG NID, MLP-M & MLP-Cutoff NID, MLP-M NID, MLP (b) Figure 4.8: Comparisons between AG and NID in higher-order interaction detection. (a) Comparison of top-ranked recall at different noise levels on the synthetic test suite (Table 4.1), (b) comparison of runtimes, where NID runtime with and without cutoff are both measured. NID detects interactions with top-rank recall close to the state-of-the-art AG while running orders of magnitude times faster. We further study higher-order interaction detection of our NID framework by comparing it to AG in both interaction ranking quality and runtime. To assess ranking quality, we design a metric, top-rank recall, which computes a recall of pro- posed interaction rankings by only considering those interactions that are correctly ranked before any false positive. The number of top correctly-ranked interactions is then divided by the true number of interactions. Because subset interactions are redundant in the presence of corresponding superset interactions, only such 49 superset interactions can count as true interactions, and our metric ignores any subset interactions in the ranking. We compute the top-rank recall of NID on MLP and MLP-M, the scores of which are averaged across all tests in the test suite of synthetic functions (Table 4.1) with 10 trials per test function. For each test, we remove two trials with max and min recall. We conduct the same tests using the state-of-the-art interaction detection method AG, except with only one trial per test because AG is very computationally expensive to run. In Figure 4.8a, we show top-rank recall of NID and AG at different Gaussian noise levels 3 , and in Figure 4.8b, we show runtime comparisons on real-world and synthetic datasets. As shown, NID can obtain similar top-rank recall as AG while running orders of magnitude times faster. 4.2.4 Limitations In higher-order interaction detection, our NID framework can have difficulty detecting interactions from functions with interlinked interacting variables. For example, a clique x 1 x 2 +x 1 x 3 +x 2 x 3 only contains pairwise interactions. When detecting pairwise interactions (§4.2.2), NID often obtains an AUC of 1. How- ever, in higher-order interaction detection, the interlinked pairwise interactions are often confused for single higher-order interactions. This issue could mean that our higher-order interaction detection algorithm fails to separate interlinked pairwise interactions encoded in a neural network, or the network approximates interlinked low-order interactions as higher-order interactions. Another limitation of our framework is that it sometimes detects spurious interactions or misses inter- actionsasaresultofcorrelationsbetweenfeatures; however, correlationsareknown 3 Gaussian noise was to applied to both features and the outcome variable after standard scaling all variables. 50 to cause such problems for any interaction detection method (Sorokina et al., 2008; Lou et al., 2013). 4.3 Broader Impact Neural Interaction Detection (NID) offers a practical way to detect non- additive higher-order feature interactions in a large enough dataset D = {(x 1 ,y 1 ), (x 2 ,y 2 ),..., (x N ,y N )}. NID can also be useful for detecting pairwise fea- ture interactions, especially if efficiency is a main concern. The opportunity to discover higher-order feature interactions is particularly exciting for scientific discovery applications, such as identifying new atomic or molecular interactions. Later in Chapter 6, we will see how the scope of NID can be expanded even beyond tabular data settings - to detect interactions in any prediction model. 51 Chapter 5 Interpreting Feature Interactions through Intermediate Hidden Layers In the previous chapter (Chapter 4), feature interactions were found to separate at the first hidden layer under sparsity regularization. This was useful for identi- fying lower-order more interpretable feature combinations rather than considering all features as interacting. In this chapter, we demonstrate that the interaction separation does not hold for intermediate hidden layers. We propose a frame- work, Neural Interaction Transparency (NIT), that disentangles feature interac- tions across all hidden layers to obtain a low-order and interpretable structure for each interaction. This is done through a novel regularizer that directly penalizes interaction order. We show that disentangling interactions reduces a feedforward neural network to a generalized additive model with interactions, which can lead to transparent models that perform comparably to the state-of-the-art models. NIT is 52 10 7 10 5 10 3 10 1 lower threshold on weight magnitude 0 25 50 75 100 % neurons entangled first layer, L1 second layer, L1 first layer, L2 second layer, L2 first layer, no regularization second layer, no regularization NIT, first layer NIT, second layer Figure 5.1: A demonstration of interaction entangling in a feedforward neural network. When trained on a simple dataset generated by x 1 x 2 +x 3 x 4 under common forms of regularization, the neural network tends to keep the interactions separated in the first hidden layer but entangles them in the second layer. In contrast, NIT (black and cyan) is able to fully disentangle the interactions. The meaning of entangling is detailed in §5.1.1. also flexible and efficient; it can learn generalized additive models with maximum K-order interactions by training onlyO(1) models. 5.1 Methdology In order to motivate our proposed NIT framework, we first demonstrate that feedforward neural networks entangle interactions under sparsity regularization. Then, we propose the architecture and regularization framework of NIT to disen- tangle the interactions to be order K at maximum. 5.1.1 Interaction Entangling in Neural Networks We define a neural network entangling interactions in the following way: Definition 5 (Entangled Interactions). LetS be the set of all true feature inter- actions{I i } |S| i=1 in a dataset, and let h be any hidden unit in a feedforward neural network. Let h capture a feature interaction ˆ I if and only if there exists nonzero 53 weighted paths between h and each interacting feature and between h and the out- put y. If h captures ˆ I such that for any two different interactionsI i ,I j ∈S the following is trueI i ( ˆ I andI j ( ˆ I, then the hidden unit entangles interactions, and correspondingly the neural network entangles interactions. For example, suppose we have a datasetD that contains N samples and 4 features. Each label is generated by the function y = x 1 x 2 + x 3 x 4 where each feature x i , i = 1,..., 4 is i.i.d. with uniform distribution between−1 and 1. As a result,D contains two pairwise interactions: {1, 2},{3, 4}. Consider a neural networkf(·), which is trained on the datasetD to predict the label. If any hidden unit learns the interaction{1, 2, 3, 4}, then it has entangled interactions. We desire a feedforward neural network that does not entangle interactions at any hidden layer so that each interaction is separated in additive form. Specifically, we aim to learn a function in the form: ˜ f(x) = R X i=1 g 0 i (x I ) + S X i=1 g i (x u ), (5.1) where{I i } R i=1 is a set of interactions,{u i } S i=1 is a set of main effects, and g 0 (·) and g(·) can be any arbitrary functions of respective inputs. For example, we would like our previous f(·) to be decomposed into an addition of two functions, e.g. g 0 1 ({1, 2}) +g 0 2 ({3, 4}), where both g 0 1 , g 0 2 perform multiplication. A model that learns the additive function in (5.1) is a generalized additive model with interactions. Chapter 4 has shown that the weights to the first hidden layer are exceptional at detecting interactions in data using common weight regularization techniques likeL 1 orL 2 . Even when assuming that each hidden unit in the first layer modeled one interaction, interaction detection was still very accurate. These results lead to 54 4 1 3 2 1 3 2 4 (a) (b) Figure 5.2: An illustrative comparison between two simple feedforward networks trained on data with interactions{1, 2} and{3, 4}. (a) A standard feedforward neural network, and (b) a desirable network architecture that separates the two interactions. All hidden units have ReLU activation, and y is a linear unit which can precede a sigmoid link function if classification is desired. the question of whether neural networks automatically separate out interactions at all hidden layers as like Figure 5.2b when common regularization techniques are applied. To test this hypothesis, we train 10 trials of ReLU-based feedforward net- works of size 4-100-100-100-100-1 on dataset D with N = 3e4 at equal train/validation/test splits and different regularizations, which were tuned on the validation set. We then calculate the percentage of hidden units entangling inter- actions in the first and second hidden layers at different lower thresholds on the magnitudes of all weights (i.e. weight magnitudes below the threshold are zeroed) (see Figure 5.1). We note that when calculating percentages, if a hidden unit does not learn any of the true interactions, even a superset of one, then that hidden unit is ignored. Consistent with the performance of NID (§4), the first hidden layer with L 1 or L 2 regularization is capable of keeping the interactions separated, but at the second hidden layer, nearly all hidden units entangle the interactions at every threshold when at least one of the interactions is modeled. Therefore, the 55 4 3 1 2 … 1 2 B Dense Layer to Disentangle Figure 5.3: A version of the NIT model architecture. Here, NIT consists ofB mutli-layer network blocks above a common input dense layer. Appropriate regularization on the dense layer forces each block to model a single interaction or main effect. This model can equivalently be seen as a standard feedforward neural network with block diagonal weight matrices at intermediate layers. regularization does very little to prevent the pairwise interactions from entangling within the neural network. Our understanding that feedforward networks entangle interactions even in the simple setting with two multiplicative interactions motivates the need to disentan- gle them. 5.1.2 Architecture Choice In order to disentangle interactions, we first propose a feedforward network modification that has a dense input weight matrix followed by multiple network blocks, as depicted in Figure 5.3. Our choice of a dense input weight matrix follows the results of Chapter 4 that a sparse regularized input weight matrix tends to automatically separate out different true feature interactions to different first layer hidden units (Tsang et al., 2018a). Separate block networks at upper layers are used to force the representation learning of separate interactions to be disentangled of each other. The remaining challenge is to ensure that each block only learns one 56 interaction or main effect to generate the desired GAM structure (5.1) 1 . Note that in this model, the number of blocks, B, must be pre-specified. Two approaches to selecting B are either choosing it to be large and letting sparse regularization cancel unused blocks, or setting B to be small in case a small number of blocks is desired for human interpretation. In the experiments section (§5.2.5), we additionally discuss our results on an approach that does not require any pre-specification of network blocks. Instead, thisapproachattemptstoseparateinteractionsthrougheverylayerduringtraining rather than rely on the forced separation of blocks. 5.1.3 Disentangling Regularization As mentioned in the previous section (§5.1.2), each network block must learn one interaction or main effect. Formally, each input hidden unit h∈{h i } p 1 /B i=1 to block b ∈ {b i } B i=1 must learn the same interaction I ∈ {I i } R i=1 or main effect u∈{u i } S i=1 where R +S ≤ B. We propose to learn such a model by way of regularization that explicitly defines maximum allowable interaction orders. Fixing the maximum order to be 2 has been the standard in the construction of GAMs with pairwise interactions (Lou et al., 2013). The existence of interactions or main effects being modeled by first layer hid- den units is determined by the nonzero weights entering those units (Chapter 4). Therefore, we would like to penalize the number of nonzero elements in rows of the input weight matrix, as a group belonging to a block. The number of nonzero elements can be obtained by usingL 0 regularization, however, it is known that L 0 is non-differentiable and cannot be used in gradient-based regularization. 1 It is possible that multiple blocks learn the same interaction or main effect, and some blocks can learn a subset of what other blocks learn. In such cases, the duplicate or redundant blocks can be combined. We leave the prevention of this type of learning for future work. 57 Recently, Louizos et al. (2018) developed a differentiable surrogate to theL 0 norm by smoothing the expected L 0 and using approximately binary gates z ∈ R n to determine which parameters ∈ R n to set to zero. The parameters of a hypothesis ˆ h are then re-parameterized as separate parameters ~ and, such that for a dataset of N samples{(x 1 ,y 1 ),..., (x N ,y N )}, Empirical Risk Minimization becomes: R( ~ ,) = 1 N N X i=1 L ˆ h(x i ; ~ z(),y i +λ n X j=1 z j (φ j ), (5.2) which follows the ideal reparameterization of into ~ and z as: θ j = ˜ θ j z j , z j ∈{0, 1}, ˜ θ j 6= 0, kk 0 = n X j=1 z j (5.3) Wenotethat(5.2)isnotinitsfinalformforclarityandinfactusesdistributionsfor the gates to enable differentiability and exact zeros in parameters; we refer inter- estedreadersto(Louizosetal.,2018). Weproposeadisentangledgroupregularizer denoted byL K , to disentangle feature interactions in the NIT framework. L K is designed to be a group version of the smoothedL 0 regularization. LetZ, ∈R B×p be matrix versions of the vectorsz and from (5.2). LetT :R B×p →R p 1 ×p assign the same gate to all first layer hidden units in a block corresponding to a single feature, for all such groups of hidden units in every block. Just as ˜ θ j 6= 0 in (5.3), ˜ W (1) ij 6= 0,∀i = 1,...,p 1 and∀j = 1,...,p. Then, the cost function for our NIT model f(·) has the form: R NIT = 1 N N X i=1 L f(x i ; ~ W (1) T (Z()),{W (`) } L+1 `=2 ,{b (`) } L+1 `=1 ,y i +L K , (5.4) 58 Table 5.1: A comparison of the number of models needed to construct various forms of GAMs. GA K M all interactions refers to constructing a GAM for every interaction of order≤K respectively. MLP-Cutoff is an additive model of Multilayer Perceptrons, and η is the top number of interactions based on a learned cutoff. Framework GAM GA K M GA 2 M MLP-Cutoff NIT original (Hastie, 2017) all interactions (Lou et al., 2013) (§4) (proposed) # models O(p) O(p K ) O(p 2 ) O(η) O(1) whereL K is our proposed disentangling regularizer. SinceZ is≈ 1 when a feature is active in a block and≈ 0 otherwise, the estimated interaction order of a block is defined as ˆ k i = P p j=1 Z ij (Φ ij ) ∀i = 1, 2,...,B. Let ˜ B = P B i=1 1( ˆ k i 6= 0). Then, we can conveniently learn generalized additive models with desirable properties by including two terms in our regularizer: L K = max ( max i ˆ k i −K, 0 ) | {z } limits the maximum interaction order to be K + λ 1 ˜ B B X i=1 ˆ k i | {z } encourages smaller interaction orders and block sparsity (5.5) The first term is responsible for penalizing the maximum interaction order during training to be a pre-specified positive integerK. A threshold atK is enforced by a rectifier (Glorot et al., 2011), which is used for its differentiability and sharp on/off switching behavior at K. The second term both penalizes the average non-zero interaction order over all blocks and sparsifies unused blocks. 5.1.4 Advantages of the NIT Framework Constructing generalized additive models by disentangling interactions has important advantages, the main of which is that our NIT framework can learn the GAM inO(1) models, whereas previous methods either needed to learn a model for each interaction, or in the case of traditional univariate GAMs (Hastie, 59 2017), learn a model for each feature (Table 5.1). Our approach can construct the GAM quickly because it leverages gradient based optimization to determine interaction separations. In addition to efficiency, our approach is the first to investigate setting hard maximum thresholds on interaction order, for any order K. This is a straightfor- ward result of our regularizer formulation. Previous methods (Lou et al., 2013; Caruana et al., 2015) have focused on advo- catingtree-basedGAMstointerpretandvisualizewhatthemodellearns, andthere exist few works which have explored neural network based GAMs with interactions for interpretability. While the tree-based GAMs can provide interpretability, their visualization can appear jagged since decision trees divide their feature space into axis-parallel rectangles which may not be user-friendly (Lou et al., 2013). This kind of jagged visualization is not a problem for neural network-based GAMs, which as a result can produce smoother and more intuitive visualizations. 5.2 Experiments 5.2.1 Experiment Setup We validate the efficacy of our method first on synthetic data and then on four real-world datasets. Our experiment on synthetic data is the baseline disentangling experi- ment discussed in §5.1.1. We use real-world datasets to evaluate the pre- dictive performance of NIT (under restriction of maximum interaction order K) as compared to standard and relevant machine learning models: Lin- ear/Logistic Regression (LR), GAM (Lou et al., 2012), GA 2 M (Lou et al., 2013), Random Forests (RF), and the Multilayer Perceptron (MLP). We 60 consider standard LR and GAM as the models that do not learn interac- tions, and GA 2 M learns up to pairwise interactions, and RF and MLP are fully-complexity models (Lou et al., 2013) that can learn all interactions. Table 5.2: Real-world datasets Dataset N p %Pos Cal Housing 20640 8 - Bike Sharing 17379 15 - MIMIC-III 20922 40 14.02% CIFAR-10 binary 12000 3072 50.0% After the performance evaluation, we visual- ize and examine interactions learned by our NIT from a medical dataset. The real-world datasets (Table 5.2) include two regression datasetspreviouslystudiedinstatisticalinter- action research: Cal Housing (Pace & Barry, 1997) and Bike Sharing (Fanaee-T & Gama, 2014), and two binary classifica- tion datasets MIMIC-III (Johnson et al., 2016) and CIFAR-10 binary. CIFAR-10 binary is a binary classification dataset (derived from CIFAR-10 (Krizhevsky & Hinton, 2009)) with two randomly selected classes, which are “cat” and “deer” in our experiments. We study binary classification as opposed to multi-class classi- fication to minimize learned interaction orders and accord with previous research on GAMs (Lou et al., 2013). Root-mean squared error (RMSE) and Area under ROC (AUC) are used as the evaluation metrics for the regression and classification tasks. In all of our NIT models, we set B = 20 and use an equal number of hidden units per network block for any given hidden layer. The hyperparameter λ in our disentangling regularizer (5.5) was found by running a grid search for validation performance on each fold of a 5-fold cross-validation 2 . In our experiments, we don’t assumeK, so we report the performances of NIT when varyingK. Learning 2 For all real-world datasets except CIFAR-10 binary, 5-fold cross-validation was used, where model training was done on 3 folds, validation on the 4th fold and testing on the 5th. For the CIFAR dataset, the standard test set was only used for testing, and an inner 5-fold cross- validation was done on an 80%-20% train-validation split. 61 rate was fixed at 5e−2 while the disentangling regularization was applied. For the hyperparameters of baselines and those not specific to NIT, tuning was done on the validation set. For all experiments with neural nets, we use the ADAM optimizer (Kingma & Ba, 2014) and early stopping on validation sets. 5.2.2 Training the NIT Framework Model training in NIT was conducted in two phases. The first was a disentan- gling phase where each block learned one interaction or main effect. The second phase keptL K = 0 andZ fixed, so thatZ acted as a mask that deactivated multi- ple features in each block. The second phase of training starts when the maximum interaction order across all blocks 3 was≤K. and the maximum interaction order of the disentangling phase stabilizes. We also reinitialized parameters between training phases in case optimization was stuck at local minima. 5.2.3 Disentangling Experiment We revisit the same functionx 1 x 2 +x 3 x 4 that the MLP failed at disentangling (§5.1.1) and evaluate NIT instead. We train 10 trials of NIT with a 4-100-100-100- 100-1 architecture like before (§5.1.1) and a grid search over K. In Figure 5.1 we show that NIT disentangles the x 1 x 2 and x 3 x 4 pairwise interactions at all possible lowerweightthresholdswhilemaintainingaperformance(RMSE= 1.3e−3)similar to that of MLP. Note that the architecture choice of NIT (Figure 5.3) automatically disentangles interactions in the entire model when the first two hidden layers are disentangled. 3 With this criteria, K6=p, but the criteria can be changed to allow K =p. 62 Table 5.3: Predictive performance of NIT. RMSE is calculated from standard scaled outcome variables. *GA 2 M took several days to train and did not converge. Lower RMSE and higher AUC means better model performance. Model Cal Housing Bike Sharing MIMIC-III CIFAR-10 binary K RMSE K RMSE K AUC K AUC LR 0.60± 0.016 0.78± 0.021 0.70± 0.013 0.676± 0.0072 GAM 0.506± 0.0078 0.55± 0.016 0.75± 0.015 0.829± 0.0014 GA 2 M 0.435± 0.0077 0.307± 0.0080 0.73± 0.012 * NIT 2 0.448± 0.0080 2 0.31± 0.013 2 0.76± 0.011 10 0.849± 0.0049 3 0.437± 0.0077 3 0.26± 0.015 4 0.76± 0.013 15 0.858± 0.0020 4 0.43± 0.013 4 0.240± 0.0097 6 0.77± 0.011 20 0.860± 0.0034 RF 0.435± 0.0095 0.243± 0.0053 0.685± 0.0087 0.793± 0.0034 MLP 0.445± 0.0081 0.22± 0.012 0.771± 0.0096 0.860± 0.0046 0 25 50 75 diasbp_min 0.20 0.15 0.10 0.05 chance of readmission 0 250 500 750 1000 platelets_min 0.15 0.10 0.05 0.00 20 40 60 resprate_max 0.04 0.06 0.08 0.10 0.12 0.14 12.0 521.8 1031.5 1541.2 platelets_max 0.0 2.5 5.0 7.5 sofa 0.4 0.2 0.0 0.2 0.4 2.0 74.0 146.0 218.0 urea_n_max 1.0 1.6 2.2 2.8 albumin_min 0.1 0.0 0.1 0.2 0.3 0.4 0 25 50 75 100 sapsii 0.1 0.2 0.3 0.4 chance of readmission 0 100 200 300 urea_n_max 0.35 0.30 0.25 0.20 0.15 0.10 0.05 20 30 temp_min 0.25 0.20 0.15 0.10 0.05 0.00 0.05 1.2 15.4 29.6 43.8 magnesium_max 1.0 1.6 2.2 2.8 albumin_min 0.05 0.00 0.05 0.10 51.0 108.2 165.5 222.8 hr_max 0.2 19.8 39.4 59.0 sysbp_min 0.00 0.05 0.10 0.15 Figure 5.4: Visualizations that provide a global (Ribeiro et al., 2016) and transpar- ent (Tan et al., 2017) interpretation of NIT trained on the MIMIC-III dataset at K = 2. Outcome scores are interpreted as contribution to 30-day hospital readmission in the same way described by Caruana et al. (2015). The output bias of NIT is 0.21. 5.2.4 Real-World Dataset Experiments On real-world datasets (Table 5.2), we evaluate the predictive performance of NIT at different levels of K, as shown in Table 5.3. For the Cal Housing, Bike Sharing, and MIMIC-III datasets, we choose K to be 2 first and increase it until NIT’s predictive performance is similar to that of RF or MLP. For CIFAR- 10 binary, we set K = 10, 15, 20 to demonstrate the capability of NIT to learn high-order interactions. The exact statistics of learned interaction orders for all datasets are shown in Appendix B.2 in the supplementary materials. For all the 63 datasets, the predictive performance of NIT is either comparable to MLP at low K, or comparable to GA 2 M at K = 2 and RF/MLP at higher values of K, as expected. In Figure 5.4, we provide all visualizations of what NIT learns at K = 2 on one fold of the MIMIC-III dataset. MIMIC-III is currently the largest public health records dataset (Johnson et al., 2016), and our prediction task is classifying whether a patient will be re-admitted into an intensive care unit within 30 days. Since K = 2, all learned interactions are plotted as heatmaps as shown, and the remaining main effects are shown in the left six plots. We notice interesting patterns, for example when a patient’s minimum temperature rises to≈ 40 ◦ C, the chance for readmission drops sharply. Another interesting pattern is in an interactionplotshowingassofascore(whichestimatesmortalityrisk)increases,the chance for readmission decreases. We checked that these potentially un-intuitive patterns are indeed consistent with those in the actual dataset by examining the frequency of readmission labels relative to temperature or sofa score. This insight may warrant further investigation by medical experts. 5.2.5 Disentangling Interactions Through All Layers In addition to disentangling at the first weight matrix, we discuss results on modifying NIT to disentangle interactions through all layers’ weight matrices. By doing this, we no longer require B network blocks nor group L 0 . Instead, L 0 is applied to each individual weight as done normally in Louizos et al. (2018). We now define layer-wise gate matrices Z (1) ,Z (2) ,...,Z (L) of gates for each weight in the corresponding matrices W (1) ,W (2) ,...,W (L) . The estimated interaction order ˆ k i from (5.5) is now for each neuron i in the last hidden layer and ˆ k i = P p j=1 [σ(Z (L) Z (L−1) ...Z (1) )] ij , where normalized matrix multiplications between 64 Z (`) ’s are taken. Here, σ is a sigmoid-type function, σ(Z 0 ) = Z 0 c+|Z 0 | , which approxi- mates a function that sends all elements of Z 0 greater than or equal to 1 to be 1, otherwise 0 (c is a hyperparameter satisfying 0<c 1). Theoretical justification for the formulation of ˆ k i is provided as follows. Lemma 6 (Paths From Multiplying Gate Matrices). Let f(·) be a feedforward neural network. Assume that in general, weights in W (`) are nonzero ∀` = 1,...,L + 1. Let Z (1) ,Z (2) ,...,Z (L) be masks for corresponding weight matrices W (1) ,W (2) ,...,W (L) , where the elements of each mask are binary{0, 1}. Let ˜ Z ∈ R p L ×p be given by the matrix multiplications Z (L) Z (L−1) ...Z (1) . Then a nonzero value of ˜ Z ij indicates that there is a nonzero weighted path from feature j to neuron i in the L-th hidden layer, and a zero value of ˜ Z ij indicates there is no such path. Proof. In the case that f(·) has a single hidden layer (L = 1), ˜ Z = Z (1) directly gives the zero and nonzero paths from features to the L-th hidden layer. In cases where f(·) has more than one hidden layer, first consider the weight connectivity between input features and the second hidden layer. Since a feed- forward neural network is a directed acyclic graph where a hop transitions from one layer to the next, the connectivity from input features to the second hidden layer can be viewed as two hops or two applications of an adjacency matrix, A, comprising of Z (1) and Z (2) as: 65 A = 0 Z (1) > 0 0 0 Z (2) > 0 0 0 Therefore, the adjacency matrix for two hops is: A 2 = 0 0 Z (2) Z (1) > 0 0 0 0 0 0 Since the elements of A 2 are the number of paths between graph vertices in two hops, the nonzero elements of Z (2) Z (1) represent the existence of paths from features to the second hidden layer, and the zero elements represent the lack of such paths. Therefore, hops can be repeatedly added up to the L-th hidden layer, yielding Z (L) Z (L−1) ...Z (1) to represent the zero and nonzero paths from features to the neurons in the L-th layer. Disentangling interactions through all layers can perform well in regression tasks. When we let this approach discover the max interaction order by setting K = 0 forL K in (5.5) and c = 1e− 2, NIT is able to reach 0.43 RMSE at max 66 order 3 for Cal Housing, and 0.26 RMSE at max order 6 for Bike Sharing without re-initializing models. Now without network blocks, NIT architectures are smaller than before (Appendix B.1), i.e. 8-200-200-1 for Cal Housing and 15-300-200-100-1 for Bike Sharing. 5.2.6 Limitations Although NIT can learn a GAM with interactions inO(1) models, the disen- tangling phase in our training optimization can take longer with increasingp orB. In addition, ifK is not pre-specified, a search for an optimalK for peak predictive performancecanbeaslowprocesswhentestingeachK. Finally, sinceoptimization is non-convex, there is no guarantee that correct interactions are learned. 5.3 Broader Impact The practical value of Neural Interaction Transparency (NIT) is that it can efficiently learn generalized additive models with interactions in the form of trans- parent neural networks for tabular data, thereby improving the accessibility of gen- eralized additive models. Neural networks are effective at interpolating between data points, so NIT can reveal interaction patterns that are otherwise difficult to understand when generalized additive models are based on decision trees. The transparency of neural networks via NIT is useful for validating neural network behavior. While NIT may present an attractive option for explaining patterns in data, future work is still needed to accurately recover such data patterns with respect to ground truth. 67 Chapter 6 Detecting Feature Interactions in Black-Box Models and Applications In previous chapters, we proposed methods to interpret feature interactions learnedinmultilayerperceptronarchitectures. Thischapterfocusesoninterpreting feature interactions learned by general black-box prediction models, which to the bestofourknowledgehasnotbeenstudiedpreviously. Theabilitytoexplainblack- box predictions has strong merits in providing new insights into any prediction model. We will be leveraging our Neural Interaction Detection (NID) method from Chapter 4 for this task. In order to explain general black-box models, we detect feature interactions local to individual data instances. Through experiments, we showcase new insights brought by feature interaction interpretability across diverse application domains like image, text, and DNA modeling. 68 ⋮ 2 are interacting Interaction 1 Detector linear model Figure 6.1: A simplified overview. (1) MADEX uses interaction detection and LIME (Ribeiro et al., 2016) together to interpret feature interactions learned by a source black-box model at a data instance, denoted by the large green plus sign. (2) GLIDER identifiesinteractionsthatconsistentlyappearovermultipledatasamples, thenexplicitly encodes these interactions in a target black-box recommender model f rec . We then focus on a specific application of increasing real-world importance: transparency in ad-targeting. Here, we can detect feature interactions across data instances, also known as global feature interactions. We discuss how these global feature interactions not only explain online ad targeting behavior, but also have high commercial utility in automatic feature engineering. Our experiments on explaining real-world recommender systems show the efficacy of both our interpre- tation and feature engineering methods. Notations Let f(·) : R d → R be a black-box model where d is the number of input features. We use f to represent f(·). For classification tasks, f is assumed to be a class logit. 6.1 Methodology In this section, we discuss our black-box interaction explanation approach, Model-Agnostic Dependency Explainer (MADEX) in §6.1.1. We then develop MADEX 69 to be useful for real-world recommendation problems. Specifically, we pro- pose Global Interaction Detection and Encoding for Recommendation (GLIDER) in §6.1.2, which detects and explicitly enocdes feature interaction interpretations in black-box recommender models. Figure 6.1 overviews both MADEX and GLIDER. 6.1.1 MADEX: Model-Agnostic Dependency Explainer Westartbyexplaininghowtoobtainadata-instancelevel(local)interpretation of feature interactions by utilizing interaction detection on feature perturbations. 6.1.1.1 Feature Perturbation and Inference Given a data instancex∈R p , LIME proposed to perturb the data instance by sampling a separate binary representation ˜ x∈{0, 1} d of the same data instance. Let ξ :{0, 1} d →R p be the map from the binary representation to the perturbed data instance. Starting from a binary vector of all ones that map to the original features values in the data instance, LIME uniformly samples the number of ran- dom features to switch to 0 or the “off” state. In the data instance, “off” could correspond to a 0 embedding vector for categorical features or mean value over a batch for numerical features. It is possible for d < p by grouping features in the datainstancetocorrespondtosinglebinaryfeaturesin ˜ x. Animportantstepisget- ting black-box predictions of the perturbed data instances to create a dataset with binary inputs and prediction targets: D ={(˜ x i ,y i )| y i = f(ξ(˜ x i )), ˜ x i ∈{0, 1} d }. Though we use LIME’s approach, the next section is agnostic to the instance perturbation method. 70 6.1.1.2 Feature Interaction Detection Feature interaction detection is concerned with identifying feature interactions in a dataset (Bien et al., 2013; Purushotham et al., 2014; Lou et al., 2013; Fried- man & Popescu, 2008). Typically, proper interaction detection requires a pre- processing step to remove correlated features that adversely affect detection per- formance (Sorokina et al., 2008). As long as features in datasetD are generated in an uncorrelated fashion, e.g. through random sampling, we can directly useD to detect feature interactions from black-box model f at data instance x. Neural Interaction Detection f can be an arbitrary function and can gener- ate highly nonlinear targets inD, so we focus on detecting interactions that could have generic forms. In light of this, we leverage our accurate and efficient interac- tion detection method, Neural Interaction Detection (NID), from Chapter 4. As a recap, NID detects interactions by training a lasso-regularized multilayer percep- tron (MLP) on a dataset, then identifying the features that have high-magnitude weights to common hidden units. NID is efficient by greedily testing the top- interaction candidates of every order at each ofh first-layer hidden units, enabling arbitrary-order interaction detection inO(hd) tests within one MLP. Gradient-based Neural Interaction Detection As stated in Chapter 2 (§2.2.1), agradientdefinitionofstatisticalinteractionexistsbasedonmixedpartial derivatives (Friedman & Popescu, 2008), i.e. a function F (·) exhibits interaction I of all features z i indexed by i 1 ,i 2 ,...,i |I| ∈I if E z ∂ |I| F (z) ∂z i 1 ∂z i 2 ...∂z i |I| 2 > 0. 71 The advantage of this definition is that it allows exact interaction detection from model gradients (Ai & Norton, 2003); however, this definition contains a computa- tionally expensive expectation, and typical neural networks with ReLU activation functions do not permit mixed partial derivatives. For the task of local interpre- tation, we only examine a single data instance x, which avoids the expectation. We turnF into an MLPg(·) with smooth, infinitely-differentiable activation func- tions such as softplus, which closely follows ReLU (Glorot et al., 2011). We then train the MLP with the same purpose as §6.1.1.2 to faithfully capture interac- tions in perturbation datasetD. Given these conditions, we define an alternate gradient-based neural interaction detector (GradientNID) as: ω(I) = ∂ |I| g(˜ x) ∂˜ x i 1 ∂˜ x i 2 ...∂˜ x i |I| 2 , where ω is the strength of the interactionI, ˜ x is the representation of x, and the MLP g is trained onD. While GradientNID exactly detects interactions from the explainer MLP, it needs to compute interaction strengths ω for feature combina- tionsthatgrowexponentiallyinnumberas|I|increases. Werecommendrestricting GradientNID to low-order interactions. 6.1.1.3 Scope Based on the preceding discussions, we define a function, MADEX(f,x), that takes as inputs black-boxf and data instancex, and outputsS ={I i } k i=1 , a set of top-k detectedfeatureinteractions. MADEXstandsfor“Model-AgnosticDependency Explainer”. In some cases, it is necessary to identify a k threshold. Because of the impor- tance of speed for local interpretations, we simply use a linear regression with 72 additional multiplicative terms to approximate the gains given by interactions in S, wherek starts at 0 and is incremented until the linear model’s predictions stop improving. 6.1.2 GLIDER: Global Interaction Detection and Encoding for Recommendation This section discusses global interaction detection and encoding for one domain of MADEX: models for tabular data. Without loss of generality, GLIDER focuses on recommender systems, which are interesting because they have pervasive applica- tioninreal-worldsystems, andtheirfeaturesareoftenverysparse. Asparsefeature is one with many categories, e.g. millions of user IDs. The sparsity makes interac- tion detection challenging especially when applied directly on raw data because the one-hot encoding of sparse features creates an extremely large space of potential feature combinations (Fan et al., 2015). A recommender system, f rec (·), is a model of two feature types: dense numer- ical features and sparse categorical features. Since the one-hot encoding of cate- gorical feature x c can be high-dimensional, it is commonly represented in a low- dimensional embedding e c =one_hot(x c )v c via embedding matrix v c . With recommendation system research becoming a mature field, it has become increasingly difficult to obtain prediction performance gains. Here we discuss how MADEX can be leveraged to improve feature interaction representations for recom- mendations in addition to providing global interaction interpretations. 6.1.2.1 Global Interaction Detection In this section, we explain the first step of GLIDER. As defined in §6.1.1.3, MADEX takes as input a black-box model f and data instance x. In the context 73 Algorithm 2 Global Interaction Detection in GLIDER Input: datasetB, recommender model f rec Output:G ={(I i ,c i )}: global interactionsI i and their countsc i over the dataset 1:G← initialize occurrence dictionary for global interactions 2: for each data sample x within datasetB do 3: S← MADEX(f rec ,x) 4: G← increment the occurrence count ofI j ∈S,∀j = 1, 2,...,|S| 5: sortG by most frequently occurring interactions 6: [optional] prune subset interactions inG within a target number of interactions K of this section, MADEX inputs a source recommender system f rec and data instance x = [x 1 ,x 2 ,...,x p ]. x i isthei-thfeaturefieldandiseitheradenseorsparsefeature. pisboththetotalnumberoffeaturefieldsandthenumberofperturbationvariables (p =d). We define global interaction detection as repeatedly running MADEX over a batch of data instances, then counting the occurrences of the same detected interactions, shown in Algorithm 2. The occurrence counts are not only a useful way to rank global interaction detections, but also a sanity check to rule out the chance that the detected feature combinations are random selections. One potential concern with Algorithm 2 is that it could be slow depending on the speed of MADEX. In our experiments, the entire process took less than one hour when run in parallel over a batch of 1000 samples with∼ 40 features on a 32-CPU server with 2 GPUs. This algorithm only needs to be run once to obtain the summary of global interactions. 6.1.2.2 Truncated Feature Crosses The global interactionI i , outputted by Algorithm 2, is used to create a syn- thetic feature x I i for a target recommender system. The synthetic feature x I i 74 is created by explicitly crossing sparse features indexed inI i . If interactionI i involves dense features, we bucketize the dense features before crossing them. The synthetic feature is sometimes called a cross feature (Wang et al., 2017; Luo et al., 2019) or conjunction feature (Rosales et al., 2012; Chapelle et al., 2015). In this context, a cross feature is an n-ary Cartesian product among n sparse features. If we denoteX 1 ,X 2 ,...,X n as the set of IDs for each respective feature x 1 ,x 2 ,...,x n , then their cross feature x {1,...,n} takes on all possible values in X 1 ×···×X n ={(x 1 ,...,x n )|x i ∈X i ,∀i = 1,...,n} Accordingly, the cardinality of this cross feature is|X 1 |×···×|X n | and can be extremely large, yet many combinations of values in the cross feature are likely unseen in the training data. Therefore, we generate a truncated form of the cross feature with only seen combinations of values, x (j) I , where j is a sample index in the training data, and x (j) I is represented as a sparse ID in the cross feature x I . We further reduce the cardinality by requiring the same cross feature ID to occur more than T times in a batch of samples, or set to a default ID otherwise. These truncationstepssignificantlyreducetheembeddingsizesofeachcrossfeaturewhile maintaining their representation power. Once cross features{x I i } i are included in a target recommender system, it can be trained as per usual. 6.1.2.3 Model Distillation vs. Enhancement There are dual perspectives of GLIDER: as a method for model distillation or model enhancement. If a strong source model is used to detect global interactions which are then encoded in more resource-constrained target models, then GLIDER 75 adopts a teacher-student type distillation process. If interaction encoding aug- ments the same model where the interactions were detected from, then GLIDER tries to enhance the model’s ability to represent the interactions. 6.2 Experiments 6.2.1 Experiment Setup In our experiments, we study interaction interpretation and encoding on real- world data. The hyperparameters in MADEX are as follows. For all experiments, our perturbation datasetsD contain 5000 training samples and 500 samples for each validation and testing. Our usage of NID or GradientNID as the interaction detector (§6.1.1.2) depends on the experimental setting. For all experiments that only examine single data instances, we use GradientNID for its exactness and pair- wise interaction detection; otherwise, we use NID for its higher-order interaction detection. The MLPs for NID and GradientNID have architectures of 256-128-64 first-to-last hidden layer sizes, and they are trained with learning rate of 1e−2, batchsize of 100, and the ADAM optimizer. NID uses ReLU activations and an` 1 regularization ofλ 1 = 1e−4, whereas GradientNID uses softplus activations and a structural regularizer as MLP+linear regression, which we found offers strong test performance. In general, models are trained with early stopping on validation sets. 6.2.2 MADEX Experiments This section provides quantitative and qualitative results for MADEX interpre- tations of black-box models. Four different models and domains are investi- gated: ResNet152 – an image classifier pretrained on ImageNet ‘14 (Russakovsky 76 Table6.1: Predictionperformance(mean-squarederror; lowerisbetter)with(k> 0)and without(k = 0)interactionsforrandomdatainstancesinthetestsetsofrespectiveblack- box models. k = L corresponds to the interaction at a rank threshold. 2≤ k < L are excludedbecausenotallinstanceshave 2ormoreinteractions. Onlyresultswithdetected interactionsareshown. Atleast 94% (≥ 188)ofthedatainstanceshadinteractionsacross 5 trials for each model and score statistic. k DNA-CNN Sentiment-LSTM ResNet152 GCN linear LIME 0 10e−3± 1e−3 8.0e−2± 6e−3 1.9± 0.1 7.1e3± 7e2 MADEX (ours) 1 8e−3± 2e−3 3.8e−2± 6e−3 1.7± 0.1 5.7e3± 7e2 MADEX (ours) L 5.0e−3± 8e−4 0.4e−2± 3e−3 0.9± 0.2 2e3± 1e3 et al., 2015; He et al., 2016), Sentiment-LSTM – bi-directional multilayer senti- ment analyzer trained on movie reviews from the Stanford Sentiment Treebank (SST) (Socher et al., 2013; Tai et al., 2015), DNA-CNN – a 2-layer 1D convo- lutional neural network (CNN) trained on DNA-protein binding data (Mordelet et al., 2013; Yang et al., 2013; Alipanahi et al., 2015; Zeng et al., 2016; Wang et al., 2018), and GCN – a 3-layer Graph Convolutional Network trained on citation net- work data from the Cora dataset (Kipf & Welling, 2016; Sen et al., 2008). In order to make informative comparisons to the linear LIME baseline, we use LIME’s sample weighting strategy and kernel size (0.25) in this section. We first provide quantitative validation for the detected interactions of all four models in §6.2.2.1, followed by qualitative results for ResNet152, Sentiment-LSTM, and DNA-CNN in §6.2.2.2. 6.2.2.1 Quantitative To quantitatively validate our interaction interpretations of general black-box models, we measure the local explanation fidelity of the interactions via prediction performance. As suggested in §6.1.1.3 and §6.1.2.2, encoding feature interactions is a way to increase a model’s function representation, but this also means that predictionperformancegainsoversimplerfirst-ordermodels(e.g. linearregression) 77 is a way to test the significance of the detected interactions. In this section, we use neural network function approximators for each top-interaction from the ranking {I i } given by MADEX’s interaction detector (in this case NID). Similar to the k- thresholding description in §6.1.1.3, we start atk = 0, which is a linear regression, then increment k with added MLPs for eachI i among{I i } k i=1 until validation performance stops improving, denoted atk =L. The MLPs all have architectures of 64-32-16 first-to-last hidden layer sizes and use the binary perturbation dataset D (from §6.1.1.1). Test prediction performances are shown in Table 6.1 for k∈{0, 1,L}. The average number of features ofD among the black-box models ranges from 18 to 112. Our quantitative validation shows that adding feature interactions for DNA- CNN, Sentiment-LSTM, and ResNet152, and adding node interactions for GCN result in significant performance gains when averaged over 40 randomly selected data instances in the test set. 6.2.2.2 Qualitative For our qualitative analysis, we provide interaction interpretations via MADEX(·) of ResNet152, Sentiment-LSTM, and DNA-CNN on test samples. The interpreta- tions are given byS ={I i } k i=1 , a set of k detected interactions, which are shown in Figure 6.2 for ResNet152 and Sentiment-LSTM. For reference, we also show the top “main effects” by LIME’s original linear regression, which select the top-5 features that attribute towards the predicted class 1 . 1 Based on official code: https://github.com/marcotcr/lime 78 top prediction: hammerhead, hammerhead shark top prediction: viaduct top prediction: Brittany spaniel top prediction: trolleybus, trolley coach, trackless trolley Interactions (ours) Original image Main effects I 1 I 2 (a) ResNet152 interpretations Original sentence predi- ction LIME selection MADEX I 1 I 2 It never fails to engage us. pos. never, us never, fails The movie makes absolutely no sense. neg. no, sense absolutely, no no, sense The central story lacks punch. neg. lacks story, lacks lacks, punch (b) Sentiment-LSTM interpretations Figure 6.2: Qualitative examples (more in Appendix C.4 & C.5) In Figure 6.2a, the “inter- action” columns show selected features from MADEX’s interac- tions between Quickshift super- pixels (Vedaldi & Soatto, 2008; Ribeiro et al., 2016). To reduce the number of interactions per image, we merged interactions that have overlap coefficient ≥ 0.5 (Vijaymeena & Kavitha, 2016). From the figure, we see that the interactions form a single region or multiple regions of the image. They also tend to be complemen- tary to LIME’s main effects and are sometimes more informative. For example, the interpretations of the “shark” classification show that interaction detection finds the shark fin whereas main effects do not. Interpretations of Sentiment- LSTM are shown in Figure 6.2b, excluding common stop words (Appendix C.3). We again see the value of MADEX’s interactions, which show salient combinations of words, such as “never, fails”, “absolutely, no”, and “lacks, punch”. 79 Table 6.2: Global explanation of a sentiment analyzer Count (Total:40) Interaction (ordered) 36 never, fails 30 suspend, disbelief 30 too, bad 29 very, funny 29 neither, nor 28 not, miss 27 recent, memory 27 not, good 26 no, denying 25 not, bad In our experiments on DNA-CNN, we consistently detected the interaction between “CACGTG” nucleotides, which form a canonical DNA sequence (Staiger etal.,1989). Theinteractionwasdetected 97.3%outof 187CACGTGappearances in the test set. In order to run consistency experiments now on Sentiment-LSTM, word inter- actions need to be detected consistently across different sentences, which naïvely would require an exorbitant amount of sentences. Instead, we initially collect inter- action candidates by running MADEX over all sentences in the SST test set, then select the word interactions that appear multiple times. We assume that word interactions are ordered but not necessarily adjacent or positionally bound, e.g. (not, good)6= (good, not), but their exact positions don’t matter. We use the larger IMDB dataset (Maas et al., 2011) to collect different sets of sentences that contain the same ordered words as each interaction candidate (but the sentences are otherwise random). The ranked detection counts of the target interactions on their individual sets of sentences are shown in Table 6.2. The average sen- tence length is 33 words, and interaction occurrences are separated by 2 words on average. 80 6.2.3 GLIDER Experiments This section provides experiments with GLIDER on models trained for click- through-rate (CTR) prediction. The recommender models under investiga- tion include commonly reported baselines, which all use neural networks: Wide&Deep (Cheng et al., 2016), DeepFM (Guo et al., 2017), Deep&Cross (Wang et al., 2017), xDeepFM (Lian et al., 2018), and AutoInt (Song et al., 2018). AutoInt is the reported state-of-the-art in academic literature, so we use the model settings and data splits provided by AutoInt’s official public repository 2 . For all other recommender models, we use public implementations 3 with the same original architectures reported in literature, set all embedding sizes to 16, and tune the learning rate and optimizer to reach or surpass the test logloss reported by the AutoInt paper (on AutoInt’s data splits). From tuning, we use the Adagrad optimizer (Duchi et al., 2011) with learning rate of 0.01. All models use early stopping on validation sets. Table 6.3: CTR dataset statistics Dataset # Samples # Features Total # Sparse IDs Criteo 45, 840, 617 39 998, 960 Avazu 40, 428, 967 23 1, 544, 428 The datasets used are bench- mark CTR datasets with the largest number of features: Criteo 4 and Avazu 5 , whose data statistics are shown in Table 6.3. Criteo and Avazu both contain 40+ millions of user records on clicking ads, with Criteo being the primary benchmark in CTR research (Cheng et al., 2016; Guo et al., 2017; Wang et al., 2017; Lian et al., 2018; Song et al., 2018; Luo et al., 2019). 2 https://github.com/shichence/AutoInt 3 https://github.com/shenweichen/DeepCTR 4 https://www.kaggle.com/c/criteo-display-ad-challenge 5 https://www.kaggle.com/c/avazu-ctr-prediction 81 0 10 20 30 40 50 rank 0 200 400 600 count * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * Criteo Avazu Figure 6.3: Occurrence counts (Total: 1000) vs. rank of detected interactionsfromAutoIntonCriteo and Avazu datasets. * indicates a higher-order interaction (details in Appendix C.7). Table 6.4: Detected global interactions from the AutoInt baseline on Avazu data. “C14” is an anonymous feature. Count (Total:1000) Interaction 525 {device_ip, hour} 235 {device_id, device_ip, hour} 217 {device_id, app_id} 203 {device_ip, device_model, hour} 194 {site_id, site_domain} 190 {site_id, hour} 187 {device_ip, site_id, hour} 183 {site_id, site_domain, hour} 179 {device_id, hour} 179 {device_id, device_ip, device_model, hour} Global Interaction Detection For each dataset, we train a source AutoInt model, f rec , then run global interaction detection via Algorithm 2 on a batch of 1000 samples from the validation set. A full global detection experiment finishes in less than one hour when run in parallel on either Criteo or Avazu datasets in a 32-CPU Intel Xeon E5-2640 v2 @ 2.00GHz server with 2 Nvidia 1080 Ti GPUs. The detection results across datasets are shown in Figure 6.3 as plots of detection counts versus rank. Because the Avazu dataset contains non-anonymized features, we directly show its top-10 detected global interactions in Table 6.4. From Figure 6.3, we see that the same interactions are detected very frequently across data instances, and many of the interactions are higher-order interactions. The interaction counts are very significant. For example, any top-1 occurrence count > 25 is significant for the Criteo dataset (p < 0.05), and likewise > 71 for the Avazu dataset, assuming a conservative search space of only up to 3-way interactions (|I|≤ 3). Our top-1 occurrence counts are 691 ( 25) for Criteo and 525 ( 71) for Avazu. In Table 6.4, the top-interactions are explainable. For example, the interaction between “device_ip” and “hour” (in UTC time) makes sense because users - here 82 Table 6.5: Test prediction performance by encoding top-K global interactions in baseline recommender systems on the Criteo and Avazu datasets (5 trials). K are 40 and 10 for Criteo and Avazu respectively. “+ GLIDER” means the inclusion of detected global interactions to corresponding baselines. The “Setting” column is labeled relative to the source of detected interactions: AutoInt. * scores by Song et al. (2018). Setting Model Criteo Avazu AUC logloss AUC logloss Distillation Wide&Deep 0.8069± 5e−4 0.4446± 4e−4 0.7794± 3e−4 0.3804± 2e−4 + GLIDER 0.8080±3e−4 0.4436±3e−4 0.7795± 1e−4 0.3802±9e−5 DeepFM 0.8079± 3e−4 0.4436± 2e−4 0.7792± 3e−4 0.3804± 9e−5 + GLIDER 0.8097±2e−4 0.4420±2e−4 0.7795±2e−4 0.3802±2e−4 Deep&Cross 0.8076± 2e−4 0.4438± 2e−4 0.7791± 2e−4 0.3805± 1e−4 + GLIDER 0.8086±3e−4 0.4428±2e−4 0.7792± 2e−4 0.3803±9e−5 xDeepFM 0.8084± 2e−4 0.4433± 2e−4 0.7785± 3e−4 0.3808± 2e−4 + GLIDER 0.8097±3e−4 0.4421±3e−4 0.7787± 4e−4 0.3806±1e−4 Enhancement AutoInt * 0.8083 0.4434 0.7774 0.3811 + GLIDER 0.8090±2e−4 0.4426±2e−4 0.7773± 1e−4 0.3811± 5e−5 identified by IP addresses - have ad-click behaviors dependent on their time zones. This is a general theme with many of the top-interactions 6 . As another example, the interaction between “device_id” and “app_id” makes sense because ads are targeted to users based on the app they’re in. Interaction Encoding Based on our results from the previous section (§6.2.3), we turn our attention to explicitly encoding the detected global interactions in target baseline models via truncated feature crosses (detailed in §6.1.2.2). In order to generate valid cross feature IDs, we bucketize dense features into a maximum of 100 bins before crossing them and require that final cross feature IDs occur more than T = 100 times over a training batch of one million samples. We take AutoInt’s top-K global interactions on each dataset from §6.2.3 with subset interactions excluded (Algorithm 2, line 6) and encode the interactions in 6 “device_ip” and “device_id” identify different sets of users (https://www.csie.ntu.edu. tw/~r01922136/slides/kaggle-avazu.pdf) 83 Table 6.6: # parameters of the models in Table 6.5. M denotes million. Model Criteo Avazu Wide&Deep 18.1M 27.3M + GLIDER 19.3M (+6.8%) 27.6M (+1.0%) DeepFM 17.5M 26.7M + GLIDER 18.3M (+4.8%) 26.9M (+0.6%) Deep&Cross 17.5M 26.1M + GLIDER 18.7M (+6.9%) 26.4M (+1.0%) xDeepFM 18.5M 27.6M + GLIDER 21.7M (+17.2%) 28.3M (+2.5%) AutoInt 16.4M 25.1M + GLIDER 17.3M (+5.1%) 25.2M (+0.6%) 0 20 40 60 K 0.4420 0.4425 0.4430 0.4435 test logloss Figure 6.4: Test logloss vs. K of DeepFM on the Criteo dataset (5 trials). each baseline model including AutoInt itself. K is tuned on valiation sets, and model hyperparameters are the same between a baseline and one with encoded interactions. We set K = 40 for Criteo and K = 10 for Avazu. In Table 6.5, we found that GLIDER often obtains significant gains in perfor- mance based on standard deviation, and GLIDER often reaches or exceeds a desired 0.001 improvement for the Criteo dataset (Cheng et al., 2016; Guo et al., 2017; Wang et al., 2017; Song et al., 2018). The improvements are especially visible with DeepFM on Criteo. We show how this model’s test performance varies with different K in Figure 6.4. All performance gains are obtained at limited cost of extra model parameters (Table 6.6) thanks to the truncations applied to our cross features. To avoid extra parameters entirely, we recommend feature selection on the new and existing features. One one hand, the evidence that AutoInt’s detected interactions can improve other baselines’ performance suggests the viability of interaction distillation. On the other hand, evidence that AutoInt’s performance on Criteo can improve using its own detected interactions suggests that AutoInt may benefit from learning interactions more explicitly. In either model distillation or enhancement settings, 84 we found that GLIDER performs especially well on industry production models trained on large private datasets with thousands of features. 6.3 Broader Impact This chapter initiates a discussion on interpreting feature interactions in black- box prediction models, especially deep neural networks. Our interaction visualiza- tionsessentiallyprovidedanotherdimensionofinsightintomodelbehaviorsbeyond individual feature importance. These interaction explanations were demonstrated on sentiment analysis, image classification, DNA modeling, graph prediction, and ad-recommendation tasks. In addition, by focusing on ad-recommendation, we showed that there is now great commercial incentive to interpret feature interac- tions through simultaneous gains in prediction performance. This incentive may also encourage corporations to explain to users how recommender services person- alize to them. 85 Chapter 7 Explaining Feature Interactions in Black-Box Models via Interaction Attribution When state-of-the-art machine learning models are used to make predictions for users, we may want to know how they personalize to (or target) us. Under- standing such model behaviors requires not only interaction detection, but also interaction attribution to explain if features influence each other and how these interactions contribute to predictions, respectively. Interaction explanations are useful for applications such as sentiment analysis, image classification, and recom- mendation tasks, as we saw in Chapter 6. Relevant methods for attributing predictions to feature interactions are black- box explanation methods based on axioms, but these methods lack interpretability. One of the core issues is that an interaction’s importance is not the same as its attribution. Techniques like Shapley Taylor Interaction Index (STI) (Dhamdhere 86 -1.0 -0.5 0.0 0.5 1.0 bad, horrible bad, terrible bad, awful Interaction -1.0 -0.5 0.0 0.5 1.0 Attribution (normalized) -1.0 -0.5 0.0 0.5 1.0 Input: "a bad , terrible , awful , horrible movie" Classification: Negative Our Explanation: Explainer Comparison: a bad , terrible , awful , horrible movie neg pos Integrated Hessians Shapley Taylor Interaction Index Our Method Figure 7.1: Our explanation for the sentiment analysis example by Janizek et al. (2020). Colors indicate sentiment, and arrows indicate interactions. Compared to other axiomatic interaction explainers (Janizek et al., 2020; Dhamdhere et al., 2019), only our work corroborates our intuition by showing negative attribution among top-ranked interactions. et al., 2019) and Integrated Hessians (IH) (Janizek et al., 2020) combine these con- cepts in order to be axiomatic. Specifically, they base an interaction’s attribution on non-additivity, i.e. the degree that features non-additively affect an outcome. While non-additivity can be used for interaction detection, it is not interpretable as an attribution measure as we see in Figure 7.1. In addition, neither STI nor IH is tractable for higher-order feature interactions. Hence, there is a need for interpretable, axiomatic, and scalable methods for interaction attribution and cor- responding interaction detection. To this end, we propose a novel framework called Archipelago, which con- sists of an interaction attribution method, ArchAttribute, and a corresponding interaction detector, ArchDetect, to address the challenges of being interpretable, axiomatic, and scalable. Archipelago is named after its ability to provide expla- nations by isolating feature interactions, or feature “islands”. Our framework has the unique advantage of being simultaneously interpretable, model-agnostic, 87 axiomatic, efficient, and easy-to-implement without requiring gradients (Sun- dararajan et al., 2017; Janizek et al., 2020) or random sampling (Ribeiro et al., 2016; Tsang et al., 2020a). In our experiments, we show that Archipelago effectively detects relevant interactions and is more interpretable than state-of-the-art methods (Dhamdhere et al., 2019; Jin et al., 2019; Sundararajan et al., 2017; Janizek et al., 2020; Tsang et al., 2018c; Grabisch & Roubens, 1999) when evaluated on annotation labels in sentiment analysis and image classification. We visualize our explanations on sen- timent analysis, coronavirus prediction on chest X-rays, and ad-recommendation, and we demonstrate an interactive visualization of Archipelago. Interpretability: One of the motivations of this work is to provide coher- ent interpretability, meaning that it needs to be consistent with human prior belief (Miller, 2019). Coherence is important so that users can agree with model interpretations and don’t inadvertently think a model fails when it actually works, as we motivate in Figure 7.1. Coherent interpretability leads to more trustworthy explanations (Jin et al., 2019). In our experiments, we leverage standard annota- tion labels to measure coherence. Besides coherence, Archipelago satisfies other desirable qualities of interpretability (Miller, 2019): simplicity, generality, and truthfulness. We achieve simplicity by merging feature sets, generality by isolating attributions from contexts, and truthfulness by trying to explain the true cause of predictions via interactions. Problem Setup: Let f denote a black-box model with scalar output. For multi-class classification, f is assumed to be a class logit. We use an input vector x ? ∈ R p to denote the data instance where we wish to explain f, and x 0 ∈ R p to denote a neutral baseline. Here, the baseline is a reference vector for x ? and 88 x 2 x 1 f(x) d b c a (a) Additive (linear) function x 2 x 1 d b c a (b) Non-additive (ReLU) function x 2 x 1 φ δ d (_,_) b (_,awful) c (bad,_) a (bad,awful) (c) δ vs. φ on a text example (Figure 7.1) Figure 7.2: Non-additive interaction for p = 2 features: The corner points are used to determine if x 1 and x 2 interact based on their non-additivity on f, i.e. they interact if δ∝ (f(a)−f(b))− (f(c)−f(d))6= 0 (§7.1.2.1). In (c), the attribution of (bad, awful) should be negative viaφ (7.2), but Shapley Taylor Interaction Index uses the positive δ. Note that φ depends on a and d whereas δ depends on a, b, c, and d. Also, Integrated Hessians is not relevant here since it does not apply to ReLU functions. conveys an “absence of signal” as per Sundararajan et al. (2017). These vectors form the space ofX⊂R p , where each element comes from either x ? i or x 0 i , i.e. X ={(x 1 ,...,x p )|x i ∈{x ? i ,x 0 i },∀i = 1,...,p}. (7.1) 7.1 Methodology We begin by presenting our feature attribution measure. Our feature attribu- tion analyzes and assigns scores to detected feature interactions. Our correspond- ing interaction detector is presented in §7.1.2. 7.1.1 Archipelago Interaction Attribution We begin by presenting our feature attribution measure. Our feature attribu- tion analyzes and assigns scores to detected feature interactions. Our correspond- ing interaction detector is presented in §7.1.2. 89 7.1.1.1 ArchAttribute LetI be the set of feature indices that correspond to a desired attribution score. Our proposed attribution measure, called ArchAttribute, is given by φ(I) =f(x ? I +x 0 \I )−f(x 0 ). (7.2) ArchAttribute essentially isolates the attribution of x ? I from the surrounding baseline context while also satisfying axioms (§7.1.1.2). We call this isolation an “island effect”, where the input features n x ? i o i∈I do not specifically interact with the baseline features n x 0 j o j∈\I . For example, consider sentiment analysis on a phrase x ? = “not very bad” with a baseline x 0 = “_ _ _” . Suppose that we wanttoexaminetheattributionofaninteractionI thatcorrespondsto{very, bad} in isolation. In this case, the contextual word “not” also interacts withI, which becomesapparentwhensmallperturbationstotheword“not”causeslargechanges to prediction probabilities. However, as we move further away from the word “not” towards the empty-word “_” in the word-embedding space, small perturbations no longer result in large prediction changes, meaning that the “_” context does not specifically interact with {very, bad}. This intuition motivates our use of the baseline context x 0 \I in (7.2). Note that ArchAttribute is independent of input context and thus carries generality over input data instances. 7.1.1.2 Axioms We now show how ArchAttribute obeys standard feaure attribution axioms (Sundararajan et al., 2017). Since ArchAttribute operates on feature sets, we generalize the notion of standard axioms to feature sets. To this end, we 90 also propose a new axiom, Set Attribution, which allows us to work with feature sets. LetS ={I i } k i=1 be allk feature interactions and main effects of f in the space X (7.1), where we take the union of overlapping sets inS. Later in §7.1.2, we explain how to obtainS. Completeness Weconsiderageneralizationofthecompletenessaxiomforwhich the sum of all attributions equals f(x ? )−f(x 0 ). The axiom tells us how much feature(s) impact a prediction. Lemma 7 (Completeness onS). The sum of all attributions by ArchAttribute for the disjoint sets inS equals the difference of f between x ? and the baseline x 0 : f(x ? )−f(x 0 ). Proof. Based on the definition of non-additive statistical interaction, a function f can be represented as a generalized additive function (Chapters 4, 5), here on the domain ofX: f(x) = η X i=1 q i (x I u i ) + p X j=1 q 0 j (x j ) +b, (7.3) where q i (x I u i ) is a function of each interactionI u i onX∀i = 1,...,η interactions, q 0 j (x j ) is a function for each feature∀j = 1,...,p, and b is a bias. The u inI u stands for “unmerged”. ThedisjointsetsofS ={I i } k i=1 aretheresultofmergingoverlappinginteraction sets and main effect sets, so we can merge the subfunctions q(·) and q 0 (·) of (7.3) 91 whose input sets overlap to write f(x) as a sum of new functions g i (x I i )∀i = 1,...,k: f(x) = k X i=1 g i (x I i ) +b. (7.4) Using the form of (7.4), we rewrite (7.2) by separating out the effect of index i: φ(I i ) =f(x ? I i +x 0 \I i )−f(x 0 ) ∀i = 1,...,k = g i (x ? I i ) + k X j=1 j6=i g j (x 0 I j ) +b − g i (x 0 I i ) + k X j=1 j6=i g j (x 0 I j ) +b (7.5) =g i (x ? I i )−g i (x 0 I i ). (7.6) Since allI∈S are disjoint, g j (x 0 I j ) can be canceled in (7.5)∀j, leading to (7.6). The result at (7.6) can also be obtained with an alternative attribution approach, as shown in Corollary 11. Next, we compute the sum of attributions: k X i=1 φ(I i ) = k X i=1 g i (x ? I i )−g i (x 0 I i ) (7.7) = k X i=1 g i (x ? I i )− k X i=1 g i (x 0 I i ) (7.8) =f(x ? )−f(x 0 ) 92 We can easily see that ArchAttribute satisfies this axiom in the limiting case where k = 1,I 1 ={i} p i=1 because (7.2) directly becomes f(x ? )−f(x 0 ). Exist- ing interaction or group attribution methods: Sampling Contextual Decompo- sition (SCD) (Jin et al., 2019), its variant (CD) (Murdoch et al., 2018; Singh et al., 2019), Sampling Occlusion (SOC) (Jin et al., 2019), and Shapley Interac- tion Index (SI) (Grabisch & Roubens, 1999) do not satisfy completeness, whereas Integrated Hessians (IH) (Janizek et al., 2020) and Shapley Taylor Interaction Index (STI) (Dhamdhere et al., 2019) do. Set Attribution We propose an axiom for interaction attribution called Set Attribution to work with feature sets as opposed to individual features. Set Attri- bution follows the natural additive structure of a function. Axiom 8 (Set Attribution). If f : R p → R is a function in the form of f(x) = P k i=1 ψ i (x I i ) where{I i } k i=1 are disjoint and functions{ψ i (·)} k i=1 have roots, then an interaction attribution method admits an attribution for feature setI i asψ i (x I i ) ∀i = 1,...,k. For example, if we consider a function y = x 1 x 2 +x 3 ; it makes sense for the attribution of the x 1 x 2 interaction to be the value of x 1 x 2 and the attribution for the x 3 main effect to be the value of x 3 . Lemma9 (SetAttributiononS). Forx =x ? and a baselinex 0 such thatψ i (x 0 I i ) = 0∀i = 1,...,k, ArchAttribute satisfies the Set Attribution axiom and provides attribution ψ i (x I i ) for setI i ∀i. Proof. From (7.6) in Lemma 7, ArchAttribute can be written as φ(I i ) =g i (x ? I i )−g i (x 0 I i ) ∀i = 1,...,k, 93 where f(x) = P k i=1 g i (x I i ) +b. SinceS ={I i } k i=1 are disjoint feature sets for the same function f in Axiom 8, g i (·) and ψ i (·) are related by a constant bias b i : ψ i (x) =g i (x) +b i Eachψ i (·) has roots, sog i (x) +b i has roots. x 0 is set such thatψ i (x 0 I i ) =g i (x 0 I i ) + b i = 0. Rearranging, −g i (x 0 I i ) =b i . Adding g i (x ? I i ) to both sides, g i (x ? I i )−g i (x 0 I i ) =g i (x ? I i ) +b i , which becomes φ(I i ) =ψ i (x ? I i ) ∀i = 1,...,k. Neither SCD, CD, SOC, SI, IH, nor STI satisfy Set Attribution (shown in Appendix D.2). We can enable Integrated Gradients (IG) (Sundararajan et al., 2017) to satisfy our axiom by summing its attributions within each feature set ofS. ArchAttribute differs from IG by its “island effect” (§7.1.1.1) and model-agnostic properties. Other Axioms ArchAttribute also satisfies the remaining axioms: Sensitivity, Implementation Invariance, Linearity, and Symmetry-Preserving, which we show via Lemmas 12-16 in Appendix D.3. 94 Discussion Several axioms required disjoint interaction and main effect sets in S. Though interactions are not necessarily disjoint by definition (Definition 1), it is reasonable to merge overlapping interactions to obtain compact and simpler visualizations, as shown in Figure 7.1 and our experiments later in §7.2.3. The disjoint sets also allow ArchAttribute to yield identifiable non-additive attribu- tions in the sense that it can identify the attribution given a feature set inS. This contrasts with Model-Agnostic Hierarchical Explanations (MAHE) (Tsang et al., 2018c), which yields unidentifiable attributions (Wood, 2006). 7.1.2 Archipelago Interaction Detection Our axiomatic analysis of ArchAttribute relied onS, which contains interac- tion sets of f on the spaceX (7.1). To develop an interaction detection method that works in tandem with ArchAttribute, we draw inspiration from the discrete interpretation of mixed partial derivatives. 7.1.2.1 Discrete Interpretation of Mixed Partial Derivatives Consider the plots in Figure 7.2, which consist of pointsa,b,c, andd that each contain two feature values. From a top-down view of each plot, the points form the corners of a rectangle, whose side lengths are h 1 =|a 1 −b 1 | =|c 1 −d 1 | and h 2 = |a 2 −c 2 | =|b 2 −d 2 |. When h 1 and h 2 are small, the mixed partial derivative w.r.t variables x 1 and x 2 is computed as follows. First, ∂f(a) ∂x 1 ≈ 1 h 1 f(a)−f(b) and ∂f(c) ∂x 1 ≈ 1 h 1 f(c)−f(d) . Similarly, the mixed partial derivative is approximated as: ∂ 2 f ∂x 1 x 2 ≈ 1 h 2 ∂f(a) ∂x 1 − ∂f(c) ∂x 1 ! ≈ 1 h 1 h 2 (f(a)−f(b))− (f(c)−f(d)) . (7.9) 95 When h 1 and h 2 become large, (7.9) tells us if a plane can fit through all four pointsa,b,c,d (Figure 7.2a), which occurs when (7.9) is zero. Here, a plane in the linear form f(x) = w 1 x 1 +w 2 x 2 +b is functionally equivalent to all functions of the form f(x) =f 1 (x 1 ) +f 2 (x 2 ) +b since x 1 and x 2 only take two possible values each, so any deviation from the plane (e.g. Figure 7.2b) becomes non-additive. Consequently, a non-zero value of (7.9) identifies a non-additive interaction by the definition of statistical interaction (Definition 1). What’s more, the magnitude of (7.9) tells us the degree of deviation from the plane, or the degree of non- additivity. (Additional details in Appendix D.4) 7.1.2.2 ArchDetect Leveraging these insights about mixed partial derivatives, we now discuss the two components of our proposed interaction detection technique – ArchDetect. Handling Context As defined in §7.1.1.2, our problem is how to identify inter- actions ofp features inX for our input data instance x ? and baselinex 0 . Ifp = 2, thenwecanalmostdirectlyuse(7.9), wherea = (x ? 1 ,x ? 2 ),b = (x 0 1 ,x ? 2 ),c = (x ? 1 ,x 0 2 ), andd = (x 0 1 ,x 0 2 ). Howeverifp> 2, allpossiblecombinationsoffeaturesinX would need to be examined to thoroughly identify just one pairwise interaction. To see this, we first rewrite (7.9) to accommodate p features, and square the result to measure interaction strength and be consistent with previous interaction detec- tors (Friedman & Popescu, 2008; Gevrey et al., 2006). The interaction strength between features i and j for a context x \{i,j} is then defined as ω i,j (x) = 1 h i h j f(x ? {i,j} +x \{i,j} )−f(x 0 {i} +x ? {j} +x \{i,j} ) −f(x ? {i} +x 0 {j} +x \{i,j} ) +f(x 0 {i,j} +x \{i,j} ) ! 2 , (7.10) 96 where h i =|x ? i −x 0 i | and h j = x ? j −x 0 j . The thorough way to identify the{i,j} feature interaction is given by ¯ ω i,j =E x∈X h ω i,j (x) i , where each element of x \{i,j} is Bernoulli (0.5). This expectation is intractable becauseX has an exponential searchspace, soweproposethefirstcomponentofArchDetectforefficientpairwise interaction detection: ¯ ω i,j = 1 2 ω i,j (x ? ) +ω i,j (x 0 ) . (7.11) Here, we estimate the expectation by leveraging the physical meaning of the inter- actions and ArchAttribute’s axioms via the different contexts of x in (7.11) as follows: • Context of x ? : An important interaction is one due to multiplex ? features. As a concrete example, consider an image representation of a cat which acts as our input data instance. The following higher-order interaction,if x ear = x ? ear and x nose = x ? nose and x fur = x ? fur then f(x) = high cat probability, is responsible for classifying “cat”. We can detect any pairwise subset{i,j} of this interaction by setting the context as x ? \{i,j} using ω i,j (x ? ). • Context of x 0 : Next, we consider x 0 \{i,j} to detect interactions via ω i,j (x 0 ), which helps us establish ArchAttribute’s completeness (Lemma 7). This also separates out effects of any higher-order baseline interactions fromf(x 0 ) in (7.5) and recombine their effects in (7.8). From an interpretability stand- point, the x 0 \{i,j} context ranks pairwise interactions w.r.t. a standard base- line. This context is also used by ArchAttribute (7.2). • Other Contexts: The first two contexts accounted for any-order inter- actions created by either input or baseline features and a few interactions created by a mix of input and baseline features. The remaining interactions 97 specifically require a mix of > 3 input and baseline features. This case is unlikely and is excluded, as we discuss next. The following assumption formalizes our intuition for the Other Contexts set- ting where there is a mix of higher-order (> 3) input and baseline feature interac- tions. Assumption 10 (Higher-Order Mixed-Interaction). For any feature setI where |I| > 3 and any pair of non-empty disjoint setsA andB whereA∪B =I, the instances x∈X such that x i = x ? i ∀i∈A and x j = x 0 j ∀j∈B do not cause a higher-order interaction of all features in{x k } k∈I via f. Assumption 10 has a similar intuition as ArchAttribute in §7.1.1.1 that input feature values do not specifically interact with baseline feature values. To under- stand this assumption, consider the original sentiment analysis example in Fig- ure 7.1 simplified as x ? = “bad terrible awful horrible movie” where x 0 = “_ _ _ _ _”. It is reasonable to assume that there is no special interaction created by token sets such as {bad, terrible, _ , horrible} or {_ , _ , _ , horrible} due to the meaningless nature of the “_” token. In practice, it makes more sense to detect interactions out of sets like {bad, terrible, horrible} so we can avoid spurious interactions with the “_” token. Efficiency: In (7.11), ArchDetect attains interaction detection over all pairs {i,j} inO(p 2 ) calls of f. Note that in (7.10), most function calls are reusable during pairwise interaction detection. Detecting Disjoint Interaction Sets In this section, the aim here is to recover arbitrary size and disjoint non-additive feature setsS ={I i } (not just pairs). ArchDetect looks at the union of overlapping pairwise interactions to obtain disjoint feature sets. Merging these pairwise interactions captures any existing 98 higher-order interactions automatically since the existence of a higher-order inter- action automatically means all its subset interactions exist (§3.2). In addition, ArchDetect merges these overlapped pairwise interactions with all individual fea- ture effects to account for all features. The time complexity of the full merging process is alsoO(p 2 ). Note that to visualize ArchDetect, an initial threshold on the pairwise interaction ranking of (7.11) may be needed prior to merging, to show smaller size interactions for explanation simplicity. 7.1.2.3 Input Dimensionality Reduction For a black-box model f : R p 0 → R which takes as input a vector with p 0 dimensions (e.g. an image, input embedding, etc.) and maps it to a scalar output (e.g. a class logit), we can make ArchDetect more efficient by operating on a lower dimensional input encodingx∈R p withp dimensions. To match the dimensional- ityp 0 of the input argument off, we define a transformation functionξ :R p →R p 0 which takes the input encoding x in the lower dimensional space p and brings it back to the input space off with dimensionalityp 0 . In other words, (7.10) becomes ω i,j (x) = 1 h i h j f 0 (x ? {i,j} +x \{i,j} )−f 0 (x 0 {i} +x ? {j} +x \{i,j} ) −f 0 (x ? {i} +x 0 {j} +x \{i,j} ) +f 0 (x 0 {i,j} +x \{i,j} ) ! 2 , where f 0 =f◦ξ. Correspondingly, ArchAttribute (7.2) becomes φ(I) =f 0 (x ? I +x 0 \I )−f 0 (x 0 ). A similar notion of input encoding was used in Chapter 6. 99 7.2 Experiments 7.2.1 Experiment Setup We conduct experiments first on ArchDetect in §7.2.2 then on ArchAttribute in §7.2.3. We then visualize their combined form as Archipelago in §7.2.3. Throughout our experiments, we commonly study BERT (Devlin et al., 2019; Wolf et al., 2019) on text-based sentiment analysis and ResNet152 (He et al., 2016) on imageclassification. BERTwasfine-tunedontheSSTdataset(Socheretal.,2013), and ResNet152 was pretrained on ImageNet (Deng et al., 2009). For sentiment analysis, we set the baseline vector x 0 to be the tokens “_”, used in place of each word from x ? . For image classification, we set x 0 to be an all-zero image. Several methods we compare to are common across experiments, in particular IG, IH, (disjoint) MAHE, SI, STI, and Difference, which is defined as φ d (I) =f(x ? )−f(x 0 I +x ? \I ). We use the following input encodings as per the need for input dimensionality reduction with ArchDetect (§7.1.2.3): • For image inputs, we use the Quickshift superpixel segmenter (Vedaldi & Soatto, 2008), which selects regions on the image. The selection is covered by the vector x∈{0, 1} p , which encodes which image segments have been selected. Note that wherever x is 0 corresponds to a baseline feature value (e.g. zeroed image pixels). • For text inputs, we use the natural correspondence between an input embed- ding and a word token. The selection of input embedding vectors is also covered by the vector x∈{0, 1} p . 100 Table 7.1: Comparison of interaction detectors (b) on syntethic ground truth in (a). (a) Functions with Ground Truth Interactions F 1 (x) = P 10 i=1 P 10 j=1 x i x j + P 20 i=11 P 30 j=21 x i x j + P 40 k=1 x k F 2 (x) = V (x;{x ? i } 20 i=1 ) + V (x;{x ? i } 30 i=11 ) + P 40 j=1 x j F 3 (x) = V (x;{x 0 i } 20 i=1 ) + V (x;{x ? i } 30 i=11 ) + P 40 j=1 x j F 4 (x) = V (x;{x ? 1 ,x ? 2 }∪{x 0 3 }) + V (x;{x ? i } 30 i=11 ) + P 40 j=1 x j (b) Pairwise Interaction Ranking AUC. The baseline methods often fail to detect interactions suited for the desired contexts in §7.1.2.2. Method F 1 F 2 F 3 F 4 Two-way ANOVA 1.0 0.51 0.51 0.55 Integrated Hessians 1.0 N/A N/A N/A Neural Interaction Detection 0.94 0.54 0.54 0.56 Shapley Interaction Index 1.0 0.50 0.50 0.51 Shapley Taylor Interaction Index 1.0 0.55 0.78 0.55 ArchDetect (this work) 1.0 1.0 1.0 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Number of Contexts Considered (n) 0.0 0.2 0.4 0.6 0.8 1.0 Interaction Detection Redundancy 2 4 6 8 10 0.00 0.25 0.50 0.75 1.00 BERT on SST 2 4 6 8 10 0.00 0.25 0.50 0.75 1.00 ResNet152 on ImageNet fixed k = 5 k = 10 random k = 5 k = 10 Figure 7.3: Interaction detec- tion overlap (redundancy) with added contexts to (7.11). “fixed” at n = 2 (ArchDetect) already shows good stability. • For recommendation data, we use the same type of correspondence between an input embedding and a feature field. 7.2.2 ArchDetect We validate ArchDetect’s interaction detection via synthetic ground truth and redundancy experiments. Synthetic Validation We set x ? = [1, 1,..., 1] ∈ R 40 and x 0 = [−1,−1,...,−1]∈ R 40 . Let z[·] be a key-value pair function such that z[i] = x i for key i∈z.keys and value x i , so we can define ^ (x;z) := 1, if x i =z[i]∀i∈z.keys −1 for all other cases. Table 7.1a shows functions with ground truth interactions suited for the desired contexts in §7.1.2.2. Table 7.1b shows the interaction detection AUC on these 101 functions by ArchDetect, IH, SI, STI, Two-way ANOVA (Fisher, 1925) and the state-of-the-art Neural Interaction Detection (§4). OnF 2 ,F 3 , andF 4 , the baseline methods often fail because they are not designed to detect the interactions of our desired contexts (§7.1.2.2). Interaction Redundancy The purpose of interaction redundancy experiments is to see if ArchDetect can omit certain higher-order interactions. We study the form of (7.11) by examining the redundancy of interactions as new contexts are added to (7.11), which we now write as ¯ ω i,j (C) = 1 C P C c=1 ω i,j (x c ). Let n be the number of contexts considered, and k be the number of top pairwise interactions selected after running pairwise interaction detection via ¯ ω i,j for all{i,j} pairs. Interaction redundancy is the overlap ratio of two sets of top-k pairwise interac- tions, one generated via ¯ ω i,j (n) and the other one via ¯ ω i,j (n− 1) for some integer n≥ 2. We generally expect the redundancy to increase as n increases, which we initially observe in Figure 7.3. Here, “fixed” and “random” correspond to different context sequences x 1 ,x 2 ,...,x N . The “random” sequence uses random samples fromX for all{x i } N i=1 , whereas the “fixed” sequence is fixed in the sense that x 1 = x ? , x 2 = x 0 , and the remaining{x i } N i=3 are random samples. Experiments are done on the SST test set for BERT and 100 random test images in ImageNet for ResNet152. Notably, the “fixed” setting has very low redundancy at n = 2 (ArchDetect) compared to “random”. As soon as n = 3, the redundancy jumps and stabilizes quickly. These experiments support Assumption 10 and (7.11) to omit specified higher-order interactions. 102 Table 7.2: Comparison of attribution methods on BERT for sentiment analysis and ResNet152 for image classification. Performance is measured by the correlation (ρ) or AUCofthetopandbottom 10%ofattributionsforeachmethodwithrespecttoreference scores defined in §7.2.3. BERT Sentiment Analysis ResNet152 Image Classification Method Word ρ Any Phrase ρ† Multi-Word Phrase ρ† Segment AUC† Difference 0.333 0.639 0.735 0.705 Integrated Gradients (IG) 0.473 0.737 0.823 0.786 Integrated Hessians (IH) N/A 0.128 0.128 N/A Model-Agnostic Hierarchical Explanations (MAHE) 0.570 0.702 0.759 0.712 Shapley Interaction Index (SI) 0.160 −0.018 −0.018 0.530 Shapley Taylor Interaction Index (STI) 0.657 0.286 0.286 0.626 Sampling Contextual Decomposition (SCD) 0.622 0.742 0.813 N/A Sampling Occlusion (SOC) 0.670 0.794 0.861 N/A ArchAttribute (this work) 0.745 0.836 0.871 0.919 † Methods that cannot tractably run for arbitrary feature set sizes are only run for pairwise feature sets. 7.2.3 ArchAttribute & Archipelago We study the coherent interpretability of ArchAttribute by comparing its attribution scores to ground truth annotation labels on subsets of features. For fair comparison, we look at extreme attributions (top and bottom 10%) for each baseline method. We then visualize the combined Archipelago framework. Addi- tional comparisons on attributions and visualizations are shown in Appendices D.5 and D.6, respectively. Sentiment Analysis For this task, we compare ArchAttribute to other expla- nation methods on two metrics: phrase correlation (Phraseρ) and word correlation (Word ρ) on the SST test set. One one hand, Phrase ρ is the Pearson correlation between estimated phrase attributions and SST phrase labels (excluding prediction labels) on a 5-point sen- timent scale (Jin et al., 2019). We study two types of phrases: “Any Phrase”, which is a phrase with any number of words, and “Multi-Word Phrase”, which is 103 f cls neg i regret to report that these ops are just not extreme enough . neg pos neg it ' s a worse sign when you begin to envy her condition . pos it ' s solid and affecting and exactly as thought - pro -voking as it should be . neg a lou -sy movie that ' s not merely un -watch -able , but also un -list -ena -ble . Figure 7.4: Our BERT visualizations on random test sentences from SST under BERT tokenization. Arrows indicate interactions, and colors indicate attribution strength. f cls isthesentimentclassification. Theinteractionspointtosalientandsometimeslong-range sets of words, and the colors are sensible. a phrase with more than one word. Word ρ is the same as Phrase ρ but only for single words that consist of a single token. In addition to the aforementioned baseline methods in §7.2.1, we include the state-of-the-art SCD and SOC methods for sequence models (Jin et al., 2019) in our performance comparisons. As shown in Table 7.2, ArchAttribute compares favorably to all methods where we consider the top and bottom 10% of the attri- bution scores for each method. We obtain similar performance across all other percentiles in Appendix D.5. We visualizeArchipelago explanations onS generated by top-3 pairwise inter- actions (§7.1.2.2) in Figure 7.4. The sentence examples are randomly selected from theSSTtestset. Thevisualizationsshowinteractionsandindividualfeatureeffects which all have reasonable polarity and intensity. Interestingly, some of the inter- actions, e.g. between “lou-sy” and “un”, are long range. 104 positive attribution rank 1 2 3 4 Figure 7.5: Our explanations of a COVID-19 classifier (COVID-Net) (Wang & Wong, 2020) on randomly selected test X-rays (Chowdhury et al., 2020; Cohen et al., 2020) classified as COVID positive. COVID-Net accurately distinguishes COVID from pneu- monia and normal X-rays. Colored outlines indicate detected feature sets with positive attribution. The interactions consistently focus on the “great vessel” region outlined in green. Image Classification On image classification, we compare ArchAttribute to relevant baseline methods on a “Segment AUC” metric, which computes the agree- ment between the estimated attribution of an image segment and that segment’s label. We obtain segment labels from the MS COCO dataset (Lin et al., 2014) and match them to the label space of ImageNet. All explanation attributions are computed relative to ResNet152’s top-classification in the joint label space. The segment label thus becomes whether or not the segment belongs to the same class as the top-classification. Evaluation is conducted on all segments with valid labels in the MS COCO dev set. ArchAttribute performs especially well on extreme attributions in Table 7.2, as well as all attributions (in Appendix D.5). Figure7.5visualizesArchipelagoonanaccuratecoronavirus(COVID-19)clas- sifierforchestX-rays(Wang&Wong,2020),whereS isgeneratedbytop-5pairwise interactions (§7.1.2.2). Shown is a random selection of test X-rays (Chowdhury et al., 2020; Cohen et al., 2020) that are classified COVID-positive. The explana- tions tend to detect the “great vessels” near the heart. 105 0.25 0.00 0.25 Attribution hour, site_domain, C20 app_id site_category device_ip device_id, banner_pos Feature Set Figure 7.6: Online ad-targeting: “banner_pos” is used to target ads to a user per their “device_id”. Recommendation Task Figure 7.6 shows Archipelago’s result for this task using a state-of-the-art AutoInt model (Song et al., 2018) for ad-recommendation. Here, our approach finds a positive interaction between “device_id” and “banner_pos” in the Avazu dataset 1 , meaning that the online advertise- ment model decides the banner position based on user device_id. Note that for this task, there are no ground truth annotations. 7.2.4 Interactive Visualization Classification by model: Positive it ' s predictable , but it jumps through the expected hoop -s with style and even some depth . neg pos it ' s predictable , but it jumps through the expected hoop -s with style and even some depth . it ' s predictable , but it jumps through the expected hoop -s with style and even some depth . k 0 k 4 k 6 Figure 7.7: Interactive visualization of Archipelago. When moving the slider to the right, the initial negativity of “predictable” and “but” turns positive after interacting with the positive phrase “jumps with style”. While our visualizations have used a fixed threshold on pairwise interaction strength, we can interactively visualize Archipelago explanations by varying the threshold. For example, a slider user interface offers this interactivity and allows users to perform in-depth analysis. Figure 7.7 illustrates the interface, where 1 https://www.kaggle.com/c/avazu-ctr-prediction 106 moving the slider tells us when interactions appear and allows us to better judge model quality. Note that our interactive visualization is fast since interaction detectiononlyrunsonce, andtheadditionalArchAttribute andinteraction-merge steps are fast. 7.2.5 Runtime 0 200 400 Average Runtime Per Sentence (seconds) Difference + ArchDetect IG + ArchDetect IH MAHE SI STI SCD SOC Archipelago (a) Sentiment Analysis on SST 0 500 1000 1500 Average Runtime Per Image (seconds) Difference + ArchDetect IG + ArchDetect MAHE SI STI Archipelago (b) Image Classification on ImageNet Figure 7.8: Average runtime comparison for explaining (a) BERT on sentiment analysis and (b) ResNet152 on image classification. Figure7.8showsaserialruntimecomparisonofexplainermethodsfor(a)BERT sentiment analysis on SST and (b) ResNet152 image classification on ImageNet. Runtimes correspond to static explanations and are averaged across 100 random data samples from respective test sets. Archipelago outperforms the state-of-the- art. These experiments were done on a server with 32 Intel Xeon E5-2640 v2 CPUs @ 2.00GHz and 2 Nvidia 1080 Ti GPUs. 7.3 Broader Impact The purpose of Archipelago is to improve the analysis of feature interactions in black-box models via interpretable attribution. By leveraging the union of over- lapping feature sets, we also visualize explanations that are simultaneously simpler 107 andnovel. Thesepropertiesallowinteractionexplanationstoappealtolargeraudi- ences (§2.1.2) and facilitate model validation and scientific insights based on model interpretations. We showcased Archipelago on three relevant applications: senti- ment analysis, coronavirus classification on chest X-rays, and ad-recommendation. We envision that the sentiment analysis explanations can be used as an auto- matic way to help teach students the nuances of different languages. On the other hand, the X-ray explanations can give insights into what common regions of the chest are indicative of the coronavirus. Lastly, the interaction attributions for ad- recommendation can tell users how recommendations personalize to them. These are just several examples of impactful applications of interaction attributions. The opportunities for application in domains like healthcare, social justice, education, business, and more are limitless. 108 Chapter 8 Summary, Discussion and Future Work In this thesis, we proposed methods to advance the interpretation of feature interactions in complex prediction models. We covered new explanations of neural networks and black-box prediction models. These explanations came in the form of: • Extracting interactions from the learned weights of feedforward neural net- works • Learning globally visualizable feedforward neural networks • Detecting feature interactions in any prediction model • Explaining the impact of feature interactions on black-box predictions These approaches addressed different elements of interpretability, including the extraction of fundamental model behaviors, the human-understanding of interac- tions, practical considerations of efficient interpretations, and a unique application of automatic feature engineering. 109 The vision of these works is to expand our understanding of feature interac- tions. Perhaps the most valuable part of this vision is enabling audiences without background knowledge to understand and appreciate feature interactions as natu- ral phenomena. Based on this thesis work, we conclude that feature interactions explain intuitive behaviors, i.e. an interaction between two features can result in the following types of influences on an outcome 1) exaggeration, 2) saturation , 3) attenuation, and 4) negation. To understand how interactions relate to these concepts, please refer to Figure 7.2 in Chapter 7, which shows an example of “sat- uration”. An important future work is to investigate which type of interaction is most naturally understandable to normal people. This may inform future methods on how to explain feature interactions. In addition to this vision, we advocate future work on improving the reliabil- ity and efficiency of interaction explanation methods. On one hand, reliability is important for making credible judgments of model behavior and the underlying data generating process. One the other hand, efficiency is important for improv- ing the accessibility of accurate explanations. This thesis proposed interpretable, accurate, and accessible explanations of feature interactions to motivate future explorations in this area. Beyond explanations of just feature interactions is another interesting research direction that has been relatively unexplored. Namely, the intersection of interac- tion explanations and causal inference presents the question of how does a treat- ment variable interact with existing features for the prediction of a causal effect. The features involved in the interaction could reveal how treatment effects per- sonalize to subgroups of users by leveraging explanations of modern prediction models. 110 References Claire Adam-Bourdarios, Glen Cowan, Cecile Germain, Isabelle Guyon, Balazs Kegl, and David Rousseau. Learning to discover: the higgs boson machine learning challenge. URL https://higgsml.lal.in2p3.fr/documentation/, 2014. Chunrong Ai and Edward C Norton. Interaction terms in logit and probit models. Economics letters, 80(1):123–129, 2003. Leona S Aiken, Stephen G West, and Raymond R Reno. Multiple regression: Testing and interpreting interactions. Sage, 1991. Babak Alipanahi, Andrew Delong, Matthew T Weirauch, and Brendan J Frey. Predicting the sequence specificities of dna-and rna-binding proteins by deep learning. Nature biotechnology, 33(8):831, 2015. Marco Ancona, Enea Ceolini, Cengiz Oztireli, and Markus Gross. Towards better understanding of gradient-based attribution methods for deep neural networks. In 6th International Conference on Learning Representations, 2018. David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, and Antonio Torralba. Net- work dissection: Quantifying interpretability of deep visual representations. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pp. 3319–3327. IEEE, 2017. William A Belson. Matching and prediction on the principle of biological classifi- cation. Journal of the Royal Statistical Society: Series C (Applied Statistics), 8 (2):65–75, 1959. Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: a prac- tical and powerful approach to multiple testing. Journal of the royal statistical society. Series B (Methodological), pp. 289–300, 1995. Jacob Bien, Jonathan Taylor, and Robert Tibshirani. A lasso for hierarchical interactions. Annals of statistics, 41(3):1111, 2013. 111 OrBiranandCourtenayCotton. Explanationandjustificationinmachinelearning: A survey. In IJCAI-17 workshop on explainable AI (XAI), volume 8, pp. 8–13, 2017. Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001. Leo Breiman, Jerome Friedman, Charles J Stone, and Richard A Olshen. Classi- fication and regression trees. CRC press, 1984. PS Bullen, DS Mitrinović, and PM Vasić. Means and their inequalities, mathe- matics and its applications, 1988. Rich Caruana, Yin Lou, Johannes Gehrke, Paul Koch, Marc Sturm, and Noemie Elhadad. Intelligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmission. In Proceedings of the 21th ACM SIGKDD Inter- national Conference on Knowledge Discovery and Data Mining, pp. 1721–1730. ACM, 2015. Olivier Chapelle, Eren Manavoglu, and Romer Rosales. Simple and scalable response prediction for display advertising. ACM Transactions on Intelligent Systems and Technology (TIST), 5(4):61, 2015. Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pp. 785–794. ACM, 2016. Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximiz- ing generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 2172–2180, 2016. Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al. Wide & deep learning for recommender systems. In Proceedings of the 1st work- shop on deep learning for recommender systems, pp. 7–10. ACM, 2016. Muhammad EH Chowdhury, Tawsifur Rahman, Amith Khandakar, Rashid Mazhar, Muhammad Abdul Kadir, Zaid Bin Mahbub, Khandakar R Islam, Muhammad Salman Khan, Atif Iqbal, Nasser Al-Emadi, et al. Can ai help in screening viral and covid-19 pneumonia? arXiv preprint arXiv:2003.13145, 2020. Joseph Paul Cohen, Paul Morrison, and Lan Dao. Covid-19 image data col- lection. arXiv 2003.11597, 2020. URL https://github.com/ieee8023/ covid-chestxray-dataset. 112 Angela Dean, Max Morris, John Stufken, and Derek Bingham. Handbook of design and analysis of experiments, volume 7. CRC Press, 2015. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large- Scale Hierarchical Image Database. In CVPR09, 2009. J. Hilton Denis, L. McClure John, and R. Slugoski Ben. The course of events: Counterfactuals,causalsequencesandexplanation. InDavidR.Mandel, DenisJ. Hilton, and Patrizia Catellani (eds.), The Psychology of Counterfactual Think- ing. Routledge, 2005. Daniel Clement Dennett. The intentional stance. MIT press, 1989. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre- training of deep bidirectional transformers for language understanding. In Pro- ceedings of the 2019 Conference of the North American Chapter of the Associ- ation for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186, 2019. Kedar Dhamdhere, Ashish Agarwal, and Mukund Sundararajan. The shapley taylor interaction index. arXiv preprint arXiv:1902.05622, 2019. Yadolah Dodge. The Oxford dictionary of statistical terms. Oxford University Press on Demand, 2006. John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011. Yingying Fan, Yinfei Kong, Daoji Li, Zemin Zheng, et al. Innovated interaction screening for high-dimensional nonlinear classification. The Annals of Statistics, 43(3):1243–1272, 2015. Yingying Fan, Yinfei Kong, Daoji Li, and Jinchi Lv. Interaction pursuit with feature screening and selection. arXiv preprint arXiv:1605.08933, 2016. Hadi Fanaee-T and Joao Gama. Event labeling combining ensemble detectors and background knowledge. Progress in Artificial Intelligence, 2(2-3):113–127, 2014. Ronald A Fisher. On the’probable error’of a coefficient of correlation deduced from a small sample. Metron, 1:1–32, 1921. RonaldAylmerFisher. Statistical methods for research workers. GenesisPublishing Pvt Ltd, 1925. Ronald Aylmer Fisher et al. 048: The arrangement of field experiments. 1926. 113 Peter W Frey and David J Slate. Letter recognition using holland-style adaptive classifiers. Machine learning, 6(2):161–182, 1991. JeromeHFriedmanandBogdanEPopescu. Predictivelearningviaruleensembles. The Annals of Applied Statistics, pp. 916–954, 2008. Francis Galton. Regression towards mediocrity in hereditary stature. The Journal of the Anthropological Institute of Great Britain and Ireland, 15:246–263, 1886. G David Garson. Interpreting neural-network connection weights. AI Expert, 6 (4):46–51, 1991. Muriel Gevrey, Ioannis Dimopoulos, and Sovan Lek. Two-way interaction of input variables in the sensitivity analysis of neural network models. Ecological mod- elling, 195(1-2):43–50, 2006. Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural networks. InProceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 315–323, 2011. ATC Goh. Back-propagation neural networks for modeling complex systems. Arti- ficial Intelligence in Engineering, 9(3):143–151, 1995. Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harness- ing adversarial examples. In International Conference on Learning Representa- tions, 2015. Michel Grabisch and Marc Roubens. An axiomatic approach to the concept of interaction among players in cooperative games. International Journal of game theory, 28(4):547–565, 1999. Herbert P Grice. Logic and conversation. In Speech acts, pp. 41–58. Brill, 1975. Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. Deepfm: a factorization-machine based neural network for ctr prediction. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, pp. 1725– 1731. AAAI Press, 2017. Piyush Gupta, Nikaash Puri, Sukriti Verma, Sameer Singh, Dhruv Kayastha, Shri- pad Deshmukh, and Balaji Krishnamurthy. Explain your move: Understanding agent actions using focused feature saliency. arXiv preprint arXiv:1912.12191, 2019. Michael Hamada and CF Jeff Wu. Analysis of designed experiments with complex aliasing. Journal of Quality Technology, 24(3):130–137, 1992. 114 Robert James Hankinson. Cause and explanation in ancient Greek thought. Oxford University Press, 2001. Ning Hao and Hao Helen Zhang. Interaction screening for ultrahigh-dimensional data. Journal of the American Statistical Association, 109(507):1285–1301, 2014. Trevor J Hastie. Generalized additive models. In Statistical models in S, pp. 249–307. Routledge, 2017. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016. Germund Hesslow. The problem of causal selection. Contemporary science and natural explanation: Commonsense conceptions of causality, pp. 11–32, 1988. Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations, 2017. Denis J Hilton. Conversational processes and causal explanation. Psychological Bulletin, 107(1):65, 1990. Denis J Hilton. Mental models and causal explanation: Judgements of probable cause and explanatory relevance. Thinking & Reasoning, 2(4):273–308, 1996. Denis J Hilton and Ben R Slugoski. Knowledge-based causal attribution: The abnormal conditions focus model. Psychological review, 93(1):75, 1986. Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural com- putation, 9(8):1735–1780, 1997. Giles Hooker. Discovering additive structure in black box functions. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 575–580. ACM, 2004. Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366, 1989. Wei-Ning Hsu, Yu Zhang, and James Glass. Unsupervised learning of disentangled and interpretable representations from sequential data. In Advances in neural information processing systems, pp. 1876–1887, 2017. James Jaccard, Robert Turrisi, and Jim Jaccard. Interaction effects in multiple regression. Number 72. Sage, 2003. 115 Joseph D Janizek, Pascal Sturmfels, and Su-In Lee. Explaining explana- tions: Axiomatic feature interactions for deep networks. arXiv preprint arXiv:2002.04138, 2020. Xisen Jin, Junyi Du, Zhongyu Wei, Xiangyang Xue, and Xiang Ren. Towards hier- archical importance attribution: Explaining compositional semantics for neural sequence models. arXiv preprint arXiv:1911.06194, 2019. Alistair EW Johnson, Tom J Pollard, Lu Shen, H Lehman Li-wei, Mengling Feng, MohammadGhassemi, BenjaminMoody, PeterSzolovits, LeoAnthonyCeli, and Roger G Mark. Mimic-iii, a freely accessible critical care database. Scientific data, 3:160035, 2016. Ian T Jolliffe. Principal components in regression analysis. In Principal component analysis, pp. 129–155. Springer, 1986. Pinar Karaca-Mandic, Edward C Norton, and Bryan Dowd. Interaction terms in nonlinear models. Health services research, 47(1pt1):255–274, 2012. Alex Kass and David Leake. Types of explanations. Technical report, Yale Univ New Haven CT Dept of Computer Science, 1987. Gordon V Kass. Significance testing in automatic interaction detection (aid). Journal of the Royal Statistical Society: Series C (Applied Statistics), 24(2): 178–189, 1975. Gordon V Kass. An exploratory technique for investigating large quantities of categorical data. Journal of the Royal Statistical Society: Series C (Applied Statistics), 29(2):119–127, 1980. Been Kim, Oluwasanmi O Koyejo, and Rajiv Khanna. Examples are not enough, learn to criticize! criticism for interpretability. In Advances In Neural Informa- tion Processing Systems, pp. 2280–2288, 2016. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. Thomas N Kipf and Max Welling. Semi-supervised classification with graph con- volutional networks. arXiv preprint arXiv:1609.02907, 2016. Yinfei Kong, Daoji Li, Yingying Fan, Jinchi Lv, et al. Interaction pursuit in high- dimensional multi-response regression via distance correlation. The Annals of Statistics, 45(2):897–922, 2017. Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009. 116 Todd Kulesza, Simone Stumpf, Margaret Burnett, Sherry Yang, Irwin Kwan, and Weng-Keen Wong. Too much, too little, or just right? ways explanations impact end users’ mental models. In 2013 IEEE Symposium on Visual Languages and Human Centric Computing, pp. 3–10. IEEE, 2013. Todd Kulesza, Margaret Burnett, Weng-Keen Wong, and Simone Stumpf. Prin- ciples of explanatory debugging to personalize interactive machine learning. In Proceedings of the 20th international conference on intelligent user interfaces, pp. 126–137, 2015. Tejas D Kulkarni, William F Whitney, Pushmeet Kohli, and Josh Tenenbaum. Deep convolutional inverse graphics network. In Advances in Neural Information Processing Systems, pp. 2539–2547, 2015. Himabindu Lakkaraju, Stephen H Bach, and Jure Leskovec. Interpretable decision sets: A joint framework for description and prediction. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1675–1684, 2016. Jiwei Li, Will Monroe, and Dan Jurafsky. Understanding neural networks through representation erasure. arXiv preprint arXiv:1612.08220, 2016. Jianxun Lian, Xiaohuan Zhou, Fuzheng Zhang, Zhongxia Chen, Xing Xie, and Guangzhong Sun. xdeepfm: Combining explicit and implicit feature interactions for recommender systems. In Proceedings of the 24th ACM SIGKDD Inter- national Conference on Knowledge Discovery & Data Mining, pp. 1754–1763. ACM, 2018. Michael Lim and Trevor Hastie. Learning interactions via hierarchical group-lasso regularization. Journal of Computational and Graphical Statistics, 24(3):627– 654, 2015. Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pp. 740–755. Springer, 2014. Peter Lipton. Contrastive explanation. Royal Institute of Philosophy Supplement, 27:247–266, 1990. Tania Lombrozo. Causal–explanatory pluralism: How intentions, functions, and mechanisms influence causal ascriptions. Cognitive Psychology, 61(4):303–332, 2010. Tania Lombrozo. Explanation and abductive inference. 2012. 117 Yin Lou, Rich Caruana, and Johannes Gehrke. Intelligible models for classifica- tion and regression. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 150–158. ACM, 2012. Yin Lou, Rich Caruana, Johannes Gehrke, and Giles Hooker. Accurate intelligible models with pairwise interactions. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 623–631. ACM, 2013. Christos Louizos, Max Welling, and Diederik P Kingma. Learning sparse neu- ral networks through l 0 regularization. International Conference on Learning Representations, 2018. Scott M Lundberg and Su-In Lee. A unified approach to interpreting model pre- dictions. In Advances in Neural Information Processing Systems, pp. 4765–4774, 2017. Scott M Lundberg, Gabriel G Erion, and Su-In Lee. Consistent individualized feature attribution for tree ensembles. arXiv preprint arXiv:1802.03888, 2018. Yuanfei Luo, Mengshuo Wang, Hao Zhou, Quanming Yao, Wei-Wei Tu, Yuqiang Chen, Wenyuan Dai, and Qiang Yang. Autocross: Automatic feature crossing for tabular data in real-world applications. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2019. Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Compu- tational Linguistics: Human Language Technologies, pp. 142–150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/P11-1015. LaurensvanderMaatenandGeoffreyHinton. Visualizingdatausingt-sne. Journal of Machine Learning Research, 9(Nov):2579–2605, 2008. John Leslie Mackie. The cement of the universe: A study of causation. Oxford: Clarendon Press, 1974. Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA, 2008. ISBN 0521865719, 9780521865715. David Marr. Vision: A computational investigation into the human representation and processing of visual information, henry holt and co. Inc., New York, NY, 2 (4.2), 1982. 118 David Marr and Tomaso Poggio. From understanding computation to understand- ing neural circuitry. 1976. John McClure and Denis J Hilton. Are goals or preconditions better explanations? it depends on the question. European Journal of Social Psychology, 28(6):897– 911, 1998. John L McClure, Robbie M Sutton, and Denis J Hilton. Implicit and explicit processes in social judgments and decisions: The role of goal-based explanations. 2003. Tim Miller. Explanation in artificial intelligence: Insights from the social sciences. Artificial Intelligence, 267:1–38, 2019. Martin Renqiang Min, Xia Ning, Chao Cheng, and Mark Gerstein. Interpretable sparse high-order boltzmann machines. In Artificial Intelligence and Statistics, pp. 614–622, 2014. Fantine Mordelet, John Horton, Alexander J Hartemink, Barbara E Engelhardt, and Raluca Gordân. Stability selection for regression-based models of transcrip- tion factor–dna binding specificity. Bioinformatics, 29(13):i117–i125, 2013. James N Morgan and John A Sonquist. Problems in the analysis of survey data, andaproposal. Journal of the American statistical association, 58(302):415–434, 1963. WJamesMurdoch, PeterJLiu, andBinYu. Beyondwordimportance: Contextual decomposition to extract interactions from lstms. International Conference on Learning Representations, 2018. JA Nelder. A reformulation of linear models. Journal of the Royal Statistical Society: Series A (General), 140(1):48–63, 1977. Julian D Olden and Donald A Jackson. Illuminating the “black box”: a random- ization approach for understanding variable contributions in artificial neural networks. Ecological modelling, 154(1-2):135–150, 2002. R Kelley Pace and Ronald Barry. Sparse spatial autoregressions. Statistics & Probability Letters, 33(3):291–297, 1997. Karl Pearson. Liii. on lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Sci- ence, 2(11):559–572, 1901. 119 Sanjay Purushotham, Martin Renqiang Min, C-C Jay Kuo, and Rachel Ostroff. Factorized sparse learning models with interpretable high order feature inter- actions. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 552–561. ACM, 2014. Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015. Michael Ranney and Paul Thagard. Explanatory coherence and belief revision in naive physics. Technical report, Pittsburgh Univ PA Learning Research and Development Center, 1988. Stephen J Read and Amy Marcus-Newhall. Explanatory coherence in social expla- nations: A parallel distributed processing account. Journal of Personality and Social Psychology, 65(3):429, 1993. Bob Rehder. A causal-model theory of conceptual representation and categoriza- tion. Journal of Experimental Psychology: Learning, Memory, and Cognition, 29(6):1141, 2003. Bob Rehder. When similarity and causality compete in category-based property generalization. Memory & Cognition, 34(1):3–16, 2006. Steffen Rendle. Factorization machines. In Data Mining (ICDM), 2010 IEEE 10th International Conference on, pp. 995–1000. IEEE, 2010. Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Why should i trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144. ACM, 2016. Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Anchors: High-precision model-agnostic explanations. In AAAI Conference on Artificial Intelligence, 2018. Rómer Rosales, Haibin Cheng, and Eren Manavoglu. Post-click conversion model- ing and analysis for non-guaranteed delivery display advertising. In Proceedings of the fifth ACM international conference on Web search and data mining, pp. 293–302. ACM, 2012. Andrew Slavin Ross, Michael C. Hughes, and Finale Doshi-Velez. Right for the right reasons: Training differentiable models by constraining their explanations. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17, pp. 2662–2670, 2017. 120 Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Com- puter Vision, 115(3):211–252, 2015. Jana Samland and Michael R Waldmann. Do social norms influence causal infer- ences? In Proceedings of the Annual Meeting of the Cognitive Science Society, volume 36, 2014. Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019. Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedan- tam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE interna- tional conference on computer vision, pp. 618–626, 2017. Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi-Rad. Collective classification in network data. AI magazine, 29(3): 93–93, 2008. Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convo- lutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013. Chandan Singh, W James Murdoch, and Bin Yu. Hierarchical interpretations for neural network predictions. International Conference on Learning Representa- tions, 2019. Sahil Singla, Eric Wallace, Shi Feng, and Soheil Feizi. Understanding impacts of high-order loss approximations and features in deep learning interpretation. arXiv preprint arXiv:1902.00407, 2019. Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Ng, and Christopher Potts. Recursive deep models for semantic com- positionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pp. 1631–1642, 2013. Weiping Song, Chence Shi, Zhiping Xiao, Zhijian Duan, Yewen Xu, Ming Zhang, andJianTang. Autoint: Automaticfeatureinteractionlearningviaself-attentive neural networks. arXiv preprint arXiv:1810.11921, 2018. Daria Sorokina, Rich Caruana, Mirek Riedewald, and Daniel Fink. Detecting statistical interactions with additive groves of trees. In Proceedings of the 25th international conference on Machine learning, pp. 1000–1007. ACM, 2008. 121 Dorothee Staiger, Hildegard Kaulen, and Jeff Schell. A cacgtg motif of the antir- rhinum majus chalcone synthase promoter is recognized by an evolutionarily conserved nuclear protein. Proceedings of the National Academy of Sciences, 86 (18):6930–6934, 1989. Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3319–3328. JMLR. org, 2017. Kai Sheng Tai, Richard Socher, and Christopher D Manning. Improved semantic representations from tree-structured long short-term memory networks. arXiv preprint arXiv:1503.00075, 2015. Sarah Tan, Rich Caruana, Giles Hooker, and Yin Lou. Detecting bias in black- box models using transparent model distillation. AAAI/ACM Conference on Artificial Intelligence, Ethics, and Society, 2017. Paul Thagard. Explanatory coherence. Behavioral and brain sciences, 12(3):435– 502, 1989. Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pp. 267–288, 1996. Michael Tsang, Dehua Cheng, and Yan Liu. Detecting statistical interactions from neural network weights. International Conference on Learning Representations, 2018a. Michael Tsang, Hanpeng Liu, Sanjay Purushotham, Pavankumar Murali, and Yan Liu. Neural interaction transparency (nit): Disentangling learned interactions for improved interpretability. In Advances in Neural Information Processing Systems, pp. 5804–5813, 2018b. Michael Tsang, Youbang Sun, Dongxu Ren, and Yan Liu. Can i trust you more? model-agnostic hierarchical explanations. arXiv preprint arXiv:1812.04801, 2018c. Michael Tsang, Dehua Cheng, Hanpeng Liu, Xue Feng, Eric Zhou, and Yan Liu. Feature interaction interpretability: A case for explaining ad-recommendation systems via neural interaction detection. In International Conference on Learn- ing Representations, 2020a. Michael Tsang, Sirisha Rambhatla, and Yan Liu. How does this interaction affect me? interpretable attribution for feature interactions. arXiv preprint arXiv:2006.10965, 2020b. 122 John W Tukey. One degree of freedom for non-additivity. Biometrics, 5(3):232– 242, 1949. Nadya Vasilyeva, Daniel A Wilkenfeld, and Tania Lombrozo. Goals affect the perceived quality of explanations. In CogSci, 2015. Andrea Vedaldi and Stefano Soatto. Quick shift and kernel methods for mode seeking. In European Conference on Computer Vision, pp. 705–718. Springer, 2008. MK Vijaymeena and K Kavitha. A survey on similarity measures in text mining. Machine Learning and Applications: An International Journal, 3(2):19–28, 2016. LindaWangandAlexanderWong. Covid-net: Atailoreddeepconvolutionalneural network design for detection of covid-19 cases from chest radiography images. arXiv preprint arXiv:2003.09871, 2020. Meng Wang, Cheng Tai, Weinan E, and Liping Wei. Define: deep convolutional neural networks accurately quantify intensities of transcription factor-dna bind- ing and facilitate evaluation of functional non-coding variants. Nucleic acids research, 46(11):e69–e69, 2018. Ruoxi Wang, Bin Fu, Gang Fu, and Mingliang Wang. Deep & cross network for ad click predictions. In Proceedings of the ADKDD’17, pp. 12. ACM, 2017. Martin B Wilk. The randomization analysis of a generalized randomized block design. Biometrika, 42(1/2):70–79, 1955. Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R’emi Louf, Morgan Fun- towicz, and Jamie Brew. Huggingface’s transformers: State-of-the-art natural language processing. ArXiv, abs/1910.03771, 2019. Thomas H Wonnacott and Ronald J Wonnacott. Introductory statistics, volume 19690. Wiley New York, 1972. Simon Wood. Generalized additive models: an introduction with R. CRC press, 2006. James Woodward. Sensitive and insensitive causation. The Philosophical Review, 115(1):1–50, 2006. Lin Yang, Tianyin Zhou, Iris Dror, Anthony Mathelier, Wyeth W Wasserman, Raluca Gordân, and Remo Rohs. Tfbsshape: a motif database for dna shape features of transcription factor binding sites. Nucleic acids research, 42(D1): D148–D155, 2013. 123 Frank Yates. Sir ronald fisher and the design of experiments. Biometrics, 20(2): 307–321, 1964. Haoyang Zeng, Matthew D Edwards, Ge Liu, and David K Gifford. Convolutional neuralnetworkarchitecturesforpredictingdna–proteinbinding. Bioinformatics, 32(12):i121–i127, 2016. 124 Appendix A Appendix of Chapter 3 A.1 Spurious Main Effect Approximation In the synthetic function F 6 (Table 4.2), the {8, 9, 10} interaction, q x 2 8 +x 2 9 +x 2 10 , can be approximated as main effects for each variable x 8 , x 9 , and x 10 when at least one of the three variables is close to−1 or 1. Note that in experiments, these variables were uniformly distributed between−1 and 1. For example, let x 10 = 1 and z 2 = x 2 8 +x 2 9 , then by taylor series expansion at z = 0, √ z 2 + 1≈ 1 + 1 2 z 2 = 1 + 1 2 x 2 8 + 1 2 x 2 9 . By symmetry under the assumed conditions, q x 2 8 +x 2 9 +x 2 10 ≈c + 1 2 x 2 8 + 1 2 x 2 9 + 1 2 x 2 10 , where c is a constant. FigureA.1visualizesthex 8 ,x 9 ,x 10 univariatenetworksofaMLP−M (Figure 4.2) that is trained on F 6 . The plots confirm the hypothesis that the MLP−M 125 models the {8,9,10} interaction as spurious main effects with parabolas scaled by 1 2 . 1.0 0.5 0.0 0.5 1.0 x 8 1.0 0.9 0.8 0.7 0.6 y (main effect net 8) 1.0 0.5 0.0 0.5 1.0 x 9 1.1 1.0 0.9 0.8 0.7 y (main effect net 9) 1.0 0.5 0.0 0.5 1.0 x 10 0.6 0.5 0.4 0.3 0.2 y (main effect net 10) Figure A.1: Response plots of an MLP-M’s univariate networks corresponding to vari- ablesx 8 ,x 9 , andx 10 . TheMLP-Mwastrainedondatageneratedfromsyntheticfunction F 6 (Table 4.2). Note that the plots are subject to different levels of bias from the MLP- M’s main multivariate network. 126 Appendix B Appendix of Chapter 4 B.1 NIT Hyperparameters Hyperparameters of NIT not mentioned in the paper body but used in the main experiments (Table 5.3) are shown in Table B.1. Architecture sizes are from input through each hidden layer to output. λ 2 is the L 2 regularization constant used on the second phase of training NIT after disentangling (§5.2.2). Table B.1: Hyperparameters of NIT omitted from the main paper corresponding to experiments in Table 5.3 hyperparameter Cal Housing Bike Sharing MIMIC-III CIFAR-10 binary architecture 8-400-300-200-100-1 15-800-600-400-200-1 40-200-100-1 3072-400-300-200-100-1 λ 2 1e− 5 1e− 5 1e− 4 1e− 5 The architectures of the MLP baselines vary based on hyperparameter tun- ing and are shown in Table B.2. As before, the architecture sizes are from the input through each hidden layer to output. All baseline MLPs are tuned with L 2 regularization. Finally, intheapproachofdisentanglinginteractionsthroughalllayers(§5.2.5), λ = 5e− 3 for both Cal Housing and Bike Sharing. A regularization constant is 127 Table B.2: Archiectures of the MLP baselines Cal Housing Bike Sharing MIMIC-III CIFAR-10 binary 8-140-100-60-20-1 15-100-100-100-1 40-300-200-100-1 3072-200-200-1 also used in front of the max term in (5.5), which is 0.5 for Cal Housing and 0.05 for Bike Sharing. As before, the learning rate is set to 5e− 2. All other hyperparameters were already mentioned in §5.2.5 and §4.2.1. B.2 Learned Interaction Orders Shown in Table B.3 are statistics of learned interaction orders by NIT corre- sponding to experiments in Table 5.3. Interaction orders at different K are aver- aged over 5 folds of cross-validation. The interaction orders of the baselines are the maximum interaction orders they can learn. TableB.3: Interactionorderstatisticswhenallrepeatedinteractionsandsparsifiedblocks are ignored. An order of 1 is a main effect. Model Cal Housing Bike Sharing MIMIC-III CIFAR-10 binary K order K order K order K order LR/GAM - 1 - 1 - 1 - 1 GA 2 M - 2 - 2 - 2 - 2 NIT max 2 2.0± 0.0 2 2.0± 0.0 2 2.0± 0.0 10 10± 0.0 mean 1.70± 0.057 1.83± 0.067 1.5± 0.12 5.8± 0.20 min 1.0± 0.0 1.0± 0.0 1.0± 0.0 1.8± 0.75 max 3 3.0± 0.0 3 3.0± 0.0 4 4.0± 0.0 15 13.6± 0.49 mean 2.2± 0.27 2.69± 0.054 3.0± 0.24 9± 1.1 min 1.2± 0.40 1.6± 0.49 1.2± 0.40 4± 1.9 max 4 3.8± 0.40 4 4.0± 0.0 6 5.6± 0.80 20 19± 1.2 mean 2.8± 0.37 3.6± 0.11 4.1± 0.80 12.7± 0.95 min 1.2± 0.40 1.8± 0.75 1.8± 0.75 4± 2.4 RF/MLP - 8 - 15 - 40 - 3072 128 B.3 Importance of Gate Matrix Z ShowninTableB.4isacomparisonbetweentheoriginalpredictionperformance of NIT and its performance when the learned gates Z within each network block are shuffled before the second phase of training (§5.2.2). The lowered predictive performance due to shuffling is more pronounced for lower interaction orders and different datasets. For example, large performance differences are observed for the Bike Sharing dataset. The smaller performance differences with the MIMIC-III and CIFAR-10 binary datasets may be due to high feature correlations. Table B.4: The sensitivity of predictive performance to shuffling of the learned gates Z within each network block. Model Cal Housing Bike Sharing MIMIC-III CIFAR-10 binary K RMSE K RMSE K AUC K AUC original 2 0.448± 0.0080 2 0.31± 0.013 2 0.76± 0.011 10 0.849± 0.0049 shuffled 0.51± 0.089 0.6± 0.13 0.73± 0.019 0.841± 0.0047 original 3 0.437± 0.0077 3 0.26± 0.015 4 0.76± 0.013 15 0.858± 0.0020 shuffled 0.47± 0.013 0.5± 0.15 0.74± 0.022 0.854± 0.0036 original 4 0.43± 0.013 4 0.240± 0.0097 6 0.77± 0.011 20 0.860± 0.0034 shuffled 0.440± 0.0082 0.33± 0.042 0.75± 0.029 0.860± 0.0027 129 Appendix C Appendix of Chapter 5 C.1 Effect of Extra Parameters by Interaction Encodings vs. Enlarged Embeddings In this section, we study whether increasing embedding size can obtain similar prediction performance gains as explicitly encoding interactions via GLIDER. We increase the embedding dimension sizes of every sparse feature in baseline rec- ommender models to match the total number of model parameters of baseline + GLIDER as close as possible. The embedding sizes we used to obtain similar param- eter counts are shown in Table C.1. For the Avazu dataset, all of the embedding sizes remain unchanged because they were already the target size. The correspond- ing prediction performances of all models are shown in Table C.2. We observed that directly increasing embedding size / parameter counts generally did not give the same level of performance gains that GLIDER provided. 130 Table C.1: Comparison of # model parameters between baseline models with enlarged embeddings and original baselines + GLIDER (from Tables 6.5 and 6.6). The models with enlarged embeddings are denoted by the asterick (*). The embedding dimension of sparse features is denoted by “emb. size”. Percent differences are relative to baseline* models. M denotes million, and the ditto mark (") means no change in the above line. Model Criteo Avazu emb. size # params emb. size # params Wide&Deep* 17 19.1M 16 27.3M Wide&Deep 16 18.1M 16 " + GLIDER 16 19.3M (+1.1%) 16 27.6M (+1.0%) DeepFM* 17 18.5M 16 26.7M DeepFM 16 17.5M 16 " + GLIDER 16 18.3M (−0.9%) 16 26.9M (+0.6%) Deep&Cross* 17 18.5M 16 26.1M Deep&Cross 16 17.5M 16 " + GLIDER 16 18.7M (+1.0%) 16 26.4M (+1.0%) xDeepFM* 19 21.5M 16 27.6M xDeepFM 16 18.5M 16 " + GLIDER 16 21.7M (+0.7%) 16 28.3M (+2.5%) AutoInt* 17 17.4M 16 25.1M AutoInt 16 16.4M 16 " + GLIDER 16 17.3M (−1.0%) 16 25.2M (+0.6%) C.2 Effect of Dense Feature Bucketization We examine the effect of dense feature bucketization on cross feature parameter efficiency for the Criteo dataset, which contains 13 dense features. Figure C.1 shows the effects of varying the number of dense buckets on the embedding sizes of the cross features involving dense features. Both the effects on the average and individual embedding size are shown. 14 out of 40 of the cross features involved a dense feature. Different cross features show different parameter patterns as the number of buckets increases (Figure C.1b). One one hand, the parameter count sometimes increases then asymptotes. Our requirement that a valid cross feature ID occurs more than T times (§6.1.2.2) restricts the growth in parameters. On the other hand, the parameter count sometimes decreases, which happens when the dense bucket size becomes too small to satisfy the T occurrence restriction. 131 Table C.2: Test prediction performance corresponding to the models shown in Table C.1 Model Criteo Avazu AUC logloss AUC logloss Wide&Deep* 0.8072± 3e−4 0.4443± 2e−4 0.7794± 3e−4 0.3804± 2e−4 Wide&Deep 0.8069± 5e−4 0.4446± 4e−4 " " + GLIDER 0.8080±3e−4 0.4436±3e−4 0.7795± 1e−4 0.3802±9e−5 DeepFM* 0.8080± 4e−4 0.4435± 4e−4 0.7792± 3e−4 0.3804± 9e−5 DeepFM 0.8079± 3e−4 0.4436± 2e−4 " " + GLIDER 0.8097±2e−4 0.4420±2e−4 0.7795±2e−4 0.3802±2e−4 Deep&Cross* 0.8081± 2e−4 0.4434± 2e−4 0.7791± 2e−4 0.3805± 1e−4 Deep&Cross 0.8076± 2e−4 0.4438± 2e−4 " " + GLIDER 0.8086±3e−4 0.4428±2e−4 0.7792± 2e−4 0.3803±9e−5 xDeepFM* 0.8088± 1e−4 0.4429± 1e−4 0.7785± 3e−4 0.3808± 2e−4 xDeepFM 0.8084± 2e−4 0.4433± 2e−4 " " + GLIDER 0.8097±3e−4 0.4421±3e−4 0.7787± 4e−4 0.3806±1e−4 AutoInt* 0.8087± 2e−4 0.4431± 1e−4 0.7774±1e−4 0.3811± 8e−5 AutoInt 0.8083 0.4434 " " + GLIDER 0.8090±2e−4 0.4426±2e−4 0.7773± 1e−4 0.3811± 5e−5 In all cases, the parameter counts are kept limited, which is important for overall parameter efficiency. 132 10 1 10 2 10 3 10 4 10 5 # buckets 800 900 1000 1100 1200 1300 avg # parameters (a) effect on avg. cross feature embedding size 10 1 10 2 10 3 10 4 10 5 # buckets 0 500 1000 1500 2000 2500 # parameters (b) effect on each cross feature embedding size FigureC.1: Theeffectsofvaryingthenumberofbucketson(a)ontheaverageembedding size of cross features involving dense features and (b) the individual embedding sizes of the same cross features. C.3 Stop Words For all qualitative interpretations on text (in §6.2.2.2 and Appendix C.4), we preprocessed sentences to remove stop words. We use the same stop words sug- gested by Manning et al. (2008), i.e.{a, an, and, are, as, at, be, by, for, from, has, he, in, is, it, its, of, on, that, the, to, was, were, will, with}. C.4 Qualitative Results on Sentiment-LSTM vs. BERT In this section, we compare the word interactions discovered by MADEX on Sentiment-LSTM versus BERT. These models perform with accuracies of 87% and 92% respectively on the SST test set. We use a public pre-trained BERT, i.e. DistilBERT (Sanh et al., 2019), which is available online 1 . The interaction detector we use is GradientNID (§6.1.1.2), and sample weighting is disabled for 1 https://github.com/huggingface/transformers 133 this comparison. The top-2 interactions for each model are shown in Table C.3 on random sentences from the SST test set. Table C.3: Top-ranked word interactionsI i from Sentiment-LSTM and BERT on ran- domly selected sentences in the SST test set. Original sentence Sentiment-LSTM BERT I 1 I 2 I 1 I 2 An intelligent, earnest, intimate film that drops the ball only when it pauses for blunt exposition to make sure you’re getting its metaphysical point. intelligent, metaphysical metaphysical, point intelligent, earnest drops, ball It’s not so much enjoyable to watch as it is enlightening to listen to new sides of a previous reality, and to visit with some of the people who were able to make an impact in the theater world. not, enjoyable not, so not, much not, enlightening Uneasy mishmash of styles and genres. uneasy, mishmash mishmash, genres uneasy, mishmash uneasy, styles You’re better off staying home and watching the X-Files. x, files off, x better, off you, off If this is the Danish idea of a good time, prospective tourists might want to consider a different destination – some jolly country embroiled in a bloody civil war, perhaps. if, this if, good if, jolly jolly, country We can see the wheels turning, and we might resent it sometimes, but this is still a nice little picture, made by bright and friendly souls with a lot of good cheer. resent, nice we, resent nice, good nice, made One of the greatest family-oriented, fantasy-adventure movies ever. family, oriented greatest, family greatest, family adventure, movies It’s so full of wrong choices that all you can do is shake your head in disbelief – and worry about what classic Oliver Parker intends to mangle next time. so, wrong full, wrong so, wrong so, full Its mysteries are transparently obvious, and it’s too slowly paced to be a thriller. mysteries, transparently paced, thriller too, thriller too, paced This miserable excuse of a movie runs on empty, believing flatbush machismo will get it through. miserable, runs excuse, get runs, empty miserable, runs 134 135 C.5 More Qualitative Results for ResNet152 top prediction: bolo tie, bolo, bola tie, bola Original image Main effects I 1 I 2 I 3 top prediction: wooden spoon top prediction: rhinoceros beetle top prediction: jellyfish top prediction: potpie top prediction: tick top prediction: jackfruit, jak, jack top prediction: cauliflower top prediction: menu top prediction: yurt top prediction: pill bottle top prediction: dome top prediction: fur coat top prediction: soccer ball top prediction: bluetick Figure C.2: Additional qualitative results, following Figure 6.2a, on random test images inImageNet. InteractionsaredenotedbyI i andareunordered. Overlappinginteractions with overlap coefficient≥ 0.5 are merged to reduce {I i } per test image. 136 C.6 Detection Performance of MADEX vs. Base- lines We compare the detection performances between MADEX and baselines on iden- tifying feature interactions learned by complex models, i.e. XGBoost (Chen & Guestrin, 2016), Multilayer Perceptron (MLP), and Long Short-Term Memory Network (LSTM) (Hochreiter & Schmidhuber, 1997). The baselines are Tree-Shap: a method to identify interactions in tree-based models like XGBoost (Lundberg et al., 2018), MLP-ACD+: a modified version of ACD (Singh et al., 2019; Murdoch et al., 2018) to search all pairs of features in MLP to find the best interaction can- didate, and LSTM-ACD+: the same as MLP-ACD+ but for LSTMs. All baselines are local interpretation methods. For MADEX, we sample continuous features from a truncated normal distributionN (x,σ 2 I) centered at a specified data instance x and truncated at σ. Our MADEX experiments consist of two methods, NID and GradNID (shorthand for GradientNID). Table C.4: Data generating functions with interactions F 1 (x) = 10x 1 x 2 + P 10 i=3 x i F 2 (x) = x 1 x 2 + P 10 i=3 x i F 3 (x) = exp(|x 1 +x 2 |) + P 10 i=3 x i F 4 (x) = 10x 1 x 2 x 3 + P 10 i=4 x i We evaluate interaction detection performance by using synthetic data where ground truth inter- actions are known (Hooker, 2004; Sorokina et al., 2008). We generate 10e3 samples of synthetic data using functions F 1 −F 4 (Table C.4) with continu- ous features uniformly distributed between−1 to 1. Next, wetraincomplexmodels(XGBoost, MLP, and LSTM) on this data. Lastly, we run MADEX and the baselines on 10 trials of 20 data instances at randomly sampled locations on the synthetic function domain. Between trials, the complex models are trained with different random initializa- tion to test the stability of each interpretation method. Interaction detection 137 performance is computed by the average R-precision (Manning et al., 2008) 2 of interaction rankings across the sampled data instances. Results are shown in Table C.5. MADEX (NID and GradNID) performs well compared to the baselines. On the tree-based model, MADEX can compete with the tree-specific baseline Tree-Shap, which only detects pairwise interactions. On MLP andLSTM,MADEXperformssignificantlybetterthanACD+. Theperformancegain is especially large in the LSTM setting. Comparing NID and GradNID, NID tends to perform better in this experiment because it takes its entire sampling region into account whereas GradNID examines a single data instance. Table C.5: Detection Performance in R-Precision (higher the better). σ = 0.6 (max: 3.2). “Tree” is XGBoost. *Does not detect higher-order interactions. †Requires an exhaustive search of all feature combinations. Tree MLP LSTM Tree-Shap NID GradNID MLP-ACD+ NID GradNID LSTM-ACD+ NID GradNID F 1 (x) 1± 0 1± 0 0.96± 0.04 0.63± 0.08 1± 0 1± 0 0.3± 0.2 1± 0 1± 0 F 2 (x) 1± 0 0.3± 0.4 0.6± 0.4 0.41± 0.06 1± 0 0.95± 0.04 0.01± 0.02 0.99± 0.02 0.95± 0.04 F 3 (x) 1± 0 1± 0 1± 0 0.3± 0.2 1± 0 1± 0 0.05± 0.08 1± 0 1± 0 F 4 (x) * 1± 0 † † 1± 0 † † 1± 0 † C.7 Higher-Order Interactions This section shows how often different orders of higher-order interactions are identifiedbyGLIDER/MADEX.FigureC.3plotstheoccurrencecountsofglobalinter- actions detected in AutoInt for the Criteo and Avazu dataset, which correspond to the results in Figure 6.3. Here we show the occurrence counts of higher-order interactions, where the exact interaction cardinality is annotated besides each data point. 3-way interactions are the most common type, followed by 4-, then 5-way interactions. 2 R-precision is the percentage of the top-R items in a ranking that are correct out of R, the number of correct items. R = 1 in these experiments. 138 Figure C.4 plots histograms of interaction cardinalities for all interactions detected from ResNet152 and Sentiment-LSTM across 1000 random samples in their test sets. The average number of features are 66 and 18 for ResNet152 and Sentiment-LSTM respectively. Higher-order interactions are common in both models. 0 10 20 30 40 50 rank 0 200 400 600 count 3 3 3 43 3 3 3 433 33 43 3 4 3 333 (a) Criteo 0 10 20 30 40 50 rank 0 200 400 600 count 3 3 3 3 4 4 3 334 4 3 34 535433 (b) Avazu Figure C.3: Occurrence counts (total: 1000) vs. rank of interactions detected from AutoInton(a)Criteoand(b)Avazudatasets. Eachhigher-orderinteractionisannotated with its interaction cardinality. 2 6 10 14 18 22 26 30 34 38 interaction cardinality 0 2000 4000 6000 8000 count (a) ResNet152 2 4 6 8 10 12 14 16 18 20 22 24 interaction cardinality 0 2000 4000 6000 8000 count (b) Sentiment-LSTM Figure C.4: Histograms of interaction sizes for interactions detected in (a) ResNet152 and (b) Sentiment-LSTM across 1000 random samples in respective test sets. 139 Appendix D Appendix of Chapter 6 D.1 Completeness of a Complementary Attribu- tion Method Corollary 11 (Completeness of a Complement). An attribution approach: φ(I) = f(x ? )−f(x 0 I +x ? \I ), similar to what is mentioned in Li et al. (2016); Jin et al. (2019), also satisfies the completeness axiom. Proof. Based on Eqs. 7.4 - 7.6 of Lemma 7: φ(I i ) =f(x ? )−f(x 0 I i +x ? \I i ) = g i (x ? I i ) + k X j=1 j6=i g j (x ? I j ) +b − g i (x 0 I i ) + k X j=1 j6=i g j (x ? I j ) +b =g i (x ? I i )−g i (x 0 I i ) We can then resume with (7.7) of Lemma 7. 140 D.2 Set Attribution Counterexamples We now provide counterexamples to identify situations in which the related methods do not satisfy the Set Attribution axiom. Let f(x) = ReLU(x 1 +x 3 + 1) +ReLU(x 2 ) + 1. f(x) can be written as f(x) = ψ 1 (x {1,3} ) +ψ 2 (x {2} ) where ψ 1 (x) = ReLU(x 1 + x 3 + 1), and ψ 2 (x) = ReLU(x 2 ) + 1. According to the Set Attribution axiom, an interaction attribution method admits attributions as • ReLU(x 1 +x 3 + 1) for featuresI 1 ={1, 3} • ReLU(x 2 ) + 1 for featureI 2 ={2}. The above setting serves as counterexamples to the related methods as follows: • CD always assigns α + α α+β toI 1 and β + β α+β toI 2 , where α = ReLU(x 1 + x 3 + 1) and β = ReLU(x 2 ). • SCD uses an expectation over an activation decomposition, which does not guarantee admission of ReLU(x 1 +x 3 +1) forI 1 and ReLU(x 2 ) forI 2 through their respective decompositions. In the ideal case SCD becomes CD, which still does not satisfy Set Attribution from above. • IH always assigns a zero attribution toI 2 from hessian computations. IH also does not assign attributions to general sets of features. • SOC does not assign attributions to general feature sets, only contiguous feature sequences. 141 • Both SI and STI assign the following attribution score toI 1 : ReLU(x 1 +x 3 + 1)−ReLU(x 1 +x 0 3 + 1)−ReLU(x 0 1 +x 3 + 1) +ReLU(x 0 1 +x 0 3 + 1). (D.1) There do not exist a selection ofx 0 1 andx 0 3 such that this attribution becomes ReLU(x 1 +x 3 + 1) for all values of x 1 and x 3 . Proof. We prove via case-by-case contradiction. Only the ReLU(x 1 +x 3 + 1) term can create an interaction between x 1 and x 3 , and this term is also the target result, so any nonzero deviation from this term via independent x 1 or x 3 effects in (D.1) must be countered. These independent effects manifest as the ReLU(x 1 +x 0 3 + 1) or ReLU(x 0 1 +x 3 + 1) terms respectively. Since ReLU is always non-negative, the only way either of these terms is nonzero is if it is positive, which implies that ReLU(x 1 +x 0 3 + 1) = x 1 +x 0 3 + 1 or ReLU(x 0 1 +x 3 +1) =x 0 1 +x 3 +1. If both terms are positive, their substitution into (D.1)yieldsReLU(x 1 +x 3 +1)−x 1 −x 0 3 −1−x 0 1 −x 3 −1+ReLU(x 0 1 +x 0 3 +1). Even if ReLU(x 0 1 +x 0 3 + 1) is positive, we obtain ReLU(x 1 +x 3 + 1)−x 1 − x 0 3 −1−x 0 1 −x 3 −1+x 0 1 +x 0 3 +1 = ReLU(x 1 +x 3 +1)−x 1 −x 3 −1. Asserting −x 1 −x 3 − 1 = 0 is a contradiction. If only one of the independent effects was positive, we also cannot assert 0 through similar simplifications. Now consider the remaining case where ReLU(x 1 +x 0 3 + 1) = ReLU(x 0 1 + x 3 + 1) = ReLU(x 0 1 +x 0 3 + 1) = 0. For any real-valued x 0 1 or x 0 3 , there can also be a negative real-valued x 3 or x 1 respectively. From either terms ReLU(x 1 +x 0 3 + 1) or ReLU(x 0 1 +x 3 + 1), we obtain ReLU(1) = 0, which is a contradiction. 142 D.3 Other Axioms D.3.1 Sensitivity Lemma 12 (Sensitivity (a)). If x ? and x 0 only differ at features indexed inI and f(x ? )6=f(x 0 ), then φ(I) (7.2) yields a nonzero attribution. Proof. Since x ? and x 0 only differ atI, the following is true: x ? \I = x 0 \I . We can therefore write x ? as x ? =x ? I +x ? \I =x ? I +x 0 \I Substituting this equivalence in (7.2), we have φ(I) =f(x ? I +x 0 \I )−f(x 0 ) =f(x ? )−f(x 0 ). Since f(x ? )−f(x 0 )6= 0, we directly obtain φ(I)6= 0. Lemma 13 (Sensitivity (b)). If f does not functionally depend onI, then φ(I) is always zero. Proof. Since f does not functionally depend onI, f(x ? I +x 0 \I ) =f(x 0 I +x 0 \I ) =f(x 0 ) 143 Therefore, φ(I) =f(x ? I +x 0 \I )−f(x 0 ) = 0 D.3.2 Implementation Invariance Lemma 14 (ImplementationInvariance). For functionally equivalent models (with the same input-output mapping), φ(·) are the same. The definition of (7.2) only relies on function calls to f, which implies Imple- mentation Invariance. D.3.3 Linearity Lemma 15 (Linearity onS). If two models f 1 , f 2 have the same disjoint feature setsS and f = c 1 f 1 +c 2 f 2 where c 1 ,c 2 are constants, then φ(I) = c 1 φ 1 (I) + c 2 φ 2 (I)∀I∈S. Proof. Since f 1 and f 2 have the sameS ={I i } k i=1 , we can write f 1 and f 2 as follows via (7.4) in Lemma 7: f 1 (x) = k X i=1 g (1) i (x I i ) +b (1) , (D.2) f 2 (x) = k X i=1 g (2) i (x I i ) +b (2) , (D.3) 144 Since f =c 1 f 1 +c 2 f 2 , f(x) =c 1 f 1 (x) +c 2 f 2 (x) (D.4) = k X i=1 c 1 ×g (1) i (x I i ) +c 1 ×b (1) + k X i=1 c 2 ×g (2) i (x I i ) +c 2 ×b (2) (D.5) = k X i=1 c 1 ×g (1) i (x I i ) +c 2 ×g (2) i (x I i ) +c 1 b (1) +c 2 b (2) . (D.6) Bygroupingtermsasg i (x I i ) =c 1 ×g (1) i (x I i )+c 2 ×g (2) i (x I i )andb =c 1 b (1) +c 2 b (2) , we write (D.6) as f(x) = k X i=1 g i (x I i ) +b. (D.7) From the form of (D.7), we can invoke (7.6): φ(I i ) =g i (x ? I i )−g i (x 0 I i ) via Lemma 7. This equation is rewritten as φ(I i ) =g i (x ? I i )−g i (x 0 I i ) (D.8) = c 1 ×g (1) i (x ? I i ) +c 2 ×g (2) i (x ? I i ) − c 1 ×g (1) i (x 0 I i ) +c 2 ×g (2) i (x 0 I i ) (D.9) =c 1 g (1) i (x ? I i )−g (1) i (x 0 I i ) +c 2 g (2) i (x ? I i )−g (2) i (x 0 I i ) (D.10) =c 1 φ 1 (I i ) +c 2 φ 2 (I i ). (D.11) By noting thatS ={I i } k i=1 , this concludes the proof. 145 D.3.4 Symmetry-Preserving We first define symmetric feature sets as a generalization of “symmetric vari- ables” in Sundararajan et al. (2017). Feature index setsI 1 andI 2 are symmetric with respect to function f if swapping features inI 1 with the features inI 2 does not change the function, This implies that for symmetricI 1 andI 2 , their cardi- nalities are the same|I 1 | =|I 2 |, and they are disjoint sets in order to swap the features to any valid set index. Lemma 16 (Symmetry-Preserving). Forx ? andx 0 that each have identical feature values between symmetric feature sets with respect to f, the symmetric feature sets receive identical attributions φ(·). Proof. Since x ? and x 0 each have identical feature values between the symmetric feature sets, {x ? i } i∈I 1 ={x ? j } j∈I 2 , {x 0 i } i∈I 1 ={x 0 j } j∈I 2 . Therefore, the symmetry implies the following for any x in the domain of f. f x ? I 1 +x 0 I 2 +x \(I 1 ∪I 2 ) =f x 0 I 1 +x ? I 2 +x \(I 1 ∪I 2 ) (D.12) Setting x =x 0 , we rewrite (D.12) as f x ? I 1 +x 0 I 2 +x 0 \(I 1 ∪I 2 ) −f x 0 I 1 +x ? I 2 +x 0 \(I 1 ∪I 2 ) = 0 =f(x ? I 1 +x 0 \I 1 )−f(x ? I 2 +x 0 \I 2 ) = f(x ? I 1 +x 0 \I 1 )−f(x 0 ) − f(x ? I 2 +x 0 \I 2 )−f(x 0 ) =φ(I 1 )−φ(I 2 ) 146 Therefore, φ(I 1 ) =φ(I 2 ). D.4 Discrete Mixed Partial Derivatives Detect Non-Additive Statistical Interactions A generalized additive model f g is given by f g (x) = p X i=1 g i (x i ) +b, (D.13) whereg i (·) can be any function of individual featuresx i andb is a bias. Since each x i of x∈X only takes on two values, a line can connect all valid points in each feature. Therefore, (D.13) is equivalent to f ` (x) = p X i=1 w i x i +b, (D.14) for weights w i ∈R and the function domain beingX. For the case wherep = 2, the discrete mixed partial derivative is given by (7.9) or ∂ 2 f ∂x 1 ∂x 2 = 1 h 1 h 2 f([x ? 1 ,x ? 2 ])−f([x ? 1 ,x 0 2 ])−f([x 0 1 ,x ? 2 ]) +f([x 0 1 ,x 0 2 ]) , 147 where h 1 =|x ? 1 −x 0 1 | and h 2 =|x ? 2 −x 0 2 |. Since any three points (not on the same line) define a plane of the form (D.14) (p = 2), we can write the fourth point as having a function value with deviation δ from the plane. ∂ 2 f ∂x 1 ∂x 2 = 1 h 1 h 2 f([x ? 1 ,x ? 2 ])−f([x ? 1 ,x 0 2 ])−f([x 0 1 ,x ? 2 ]) +f([x 0 1 ,x 0 2 ]) (D.15) = 1 h 1 h 2 (w 1 x ? 1 +w 2 x ? 2 +b +δ)− (w 1 x ? 1 +w 2 x 0 2 +b) − (w 1 x 0 1 +w 2 x ? 2 +b) + (w 1 x 0 1 +w 2 x 0 2 +b) (D.16) = δ h 1 h 2 . (D.17) If (D.17) is 0, then δ = 0, which implies that f can be written as (D.14). δ6= 0 implies the opposite, that f cannot be written in linear form (by definition). Since (D.14) is equivalent to (D.13) in the domain ofX, this implies that δ6= 0 if and only if f(x)6=g 1 (x 1 ) +g 2 (x 2 ) +b. Based on Def. 1, we can conclude that a nonzero discrete mixed partial deriva- tive w.r.t. x 1 and x 2 in the spaceX at p = 2 detects a non-additive statistical interaction between the two features. For the case where p> 2, Def. 1 states that a pairwise interaction{i,j} exists in f if and only if f(x)6= f i (x \{i} ) +f j (x \{j} ) for functions f i (·) and f j (·). This means that{i,j} is declared to be an interaction if a local{i,j} interaction occurs at any x \{i,j} , x∈X. Therefore, we can detect non-additive statistical interactions{i,j} for general p≥ 2 via E x " ∂ 2 f ∂x i ∂x j # 2 > 0, 148 which mirrors the definition of pairwise interaction for real-valued x in Friedman & Popescu (2008). 149 D.5 Attributions Compared to Annotation Labels 50% (all) 40% 30% 20% 10% Top and bottom % of attributions retained 0.0 0.2 0.4 0.6 0.8 Word ρ (a) Word ρ 50% (all) 40% 30% 20% 10% Top and bottom % of attributions retained 0.0 0.2 0.4 0.6 0.8 Phrase ρ Methods Difference IG IH MAHE SI STI SCD SOC ArchAttribute (b) Any Phrase ρ 50% (all) 40% 30% 20% 10% Top and bottom % of attributions retained 0.0 0.2 0.4 0.6 0.8 Multi-Word Phrase ρ (c) Multi-Word Phrase ρ Figure D.1: Text explanation metrics ((a) Word ρ, (b) Any Phrase ρ, (c) Multi-Word Phrase ρ) versus top and bottom % of attributions retained for different attribution methods on BERT over the SST test set. These plots expand the analysis of Table 7.2. 50% (all) 40% 30% 20% 10% Top and bottom % of attributions retained 0.0 0.2 0.4 0.6 0.8 1.0 Segment AUC Methods Difference IG MAHE SI STI ArchAttribute Figure D.2: Image explanation metric (segment AUC) versus top and bottom % of attributions retained for different attribution methods on ResNet152 over the MS COCO test set. These plots expand the analysis of Table 7.2. 150 D.6 Visualization Comparisons D.6.1 Sentiment Analysis Visualization comparisons of different attribution methods on BERT are shown in Figures D.4-D.8 for random test sentences from SST. The visualization format is the same as Figure 7.4. Note that all individual feature attributions that corre- spond to stop words (Manning et al., 2008) are omitted in these comparisons and Figures 7.1, 7.4. D.6.2 Image Classification f c : great dane f c : spider monkey f c : obelisk f c : snow leopard f c : apron f c : black stork f c : waffle iron f c : snow leopard f c : polaroid camera f c : greater swiss mountain dog positive attribution rank 1 2 3 4 Figure D.3: Our ResNet152 visualizations on random test images from ImageNet. Col- ored outlines indicate interactions with positive attribution. f c is the image classification result. To the best of our knowledge, only this work shows interactions that support the image classification via interaction attribution. InFigureD.3, wevisualizeourexplanationsonS viatop-5pairwiseinteractions (§7.1.2.2), wherepositiveattributioninteractionsareshownforclarity. Theimages are randomly selected from the ImageNet test set. It is interesting to see which image parts interact, such as the eyes of the “great dane” image. Visualization comparisons of different attribution methods on ResNet152 are shown in Figures D.9-D.13 for the same random test images from ImageNet. 151 D.7 ArchDetect Ablation Visualizations We run an ablation study removing the x 0 \{i,j} baseline context from (7.11) for disjoint interaction detection and examine its effect on visualizations. The visual- izationsareshowninFigureD.14forsentimentanalysisandFiguresD.15andD.16 forimageclassification. Top-3andtop-5pairwiseinteractionsareusedinsentiment analysis and image classification respectively before merging the interactions. 152 Text input: "I regret to report that these ops are just not extreme enough ." Classification: neg Archipelago i regret to report that these ops are just not extreme enough . neg pos Difference + ArchDetect i regret to report that these ops are just not extreme enough . IG i regret to report that these ops are just not extreme enough . IG + ArchDetect i regret to report that these ops are just not extreme enough . IH 0.50 0.25 0.00 0.25 0.50 0.75 1.00 Attribution (normalized) regret, not not, extreme i, not i, regret ops, enough interaction LIME i regret to report that these ops are just not extreme enough . MAHE 1.0 0.5 0.0 0.5 1.0 Attribution (normalized) regret, report regret, just, not, extreme, enough interaction SI i regret to report that these ops are just not extreme enough . 1.0 0.5 0.0 0.5 Attribution (normalized) to, extreme not, enough not, extreme regret, just are, just interaction STI i regret to report that these ops are just not extreme enough . 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Attribution (normalized) not, extreme regret, not regret, just regret, enough just, enough interaction Text input: "It 's a worse sign when you begin to envy her condition ." Classification: neg Archipelago it ' s a worse sign when you begin to envy her condition . Difference + ArchDetect it ' s a worse sign when you begin to envy her condition . IG it ' s a worse sign when you begin to envy her condition . IG + ArchDetect it ' s a worse sign when you begin to envy her condition . IH 1.0 0.5 0.0 0.5 1.0 Attribution (normalized) it, worse a, worse worse, you worse, envy worse, condition interaction LIME it ' s a worse sign when you begin to envy her condition . MAHE 0.5 0.0 0.5 1.0 Attribution (normalized) worse, condition sign, when, you, begin, to, envy interaction SI it ' s a worse sign when you begin to envy her condition . 0.50 0.25 0.00 0.25 0.50 0.75 1.00 Attribution (normalized) worse, condition you, condition begin, envy sign, condition worse, you interaction STI it ' s a worse sign when you begin to envy her condition . 0.1 0.0 0.1 0.2 Attribution (normalized) worse, condition to, envy worse, envy envy, her worse, begin interaction Figure D.4: Text Viz. Comparison A. In the first text example, “regret, not extreme enough”is ameaningfuland stronglynegativeinteraction. Inthesecondexample, “when you begin to” interacts to diminish its overall attribution magnitude. 153 Text input: "It 's solid and affecting and exactly as thought-provoking as it should be ." Classification: pos Archipelago it ' s solid and affecting and exactly as thought - pro -voking as it should be . neg pos Difference + ArchDetect it ' s solid and affecting and exactly as thought - pro -voking as it should be . IG it ' s solid and affecting and exactly as thought - pro -voking as it should be . IG + ArchDetect it ' s solid and affecting and exactly as thought - pro -voking as it should be . IH 0.0 0.2 0.4 0.6 0.8 1.0 Attribution (normalized) s, solid solid, affecting solid, as solid, and and, affecting interaction LIME it ' s solid and affecting and exactly as thought - pro -voking as it should be . MAHE 1.0 0.8 0.6 0.4 0.2 0.0 Attribution (normalized) solid, affecting, -voking solid, pro solid, and interaction SI it ' s solid and affecting and exactly as thought - pro -voking as it should be . 0.25 0.00 0.25 0.50 0.75 1.00 Attribution (normalized) as, it and, -voking solid, -voking solid, affecting pro, -voking interaction STI it ' s solid and affecting and exactly as thought - pro -voking as it should be . 0.2 0.0 0.2 0.4 Attribution (normalized) pro, -voking solid, affecting solid, -voking and, affecting as, should interaction Text input: "A lousy movie that 's not merely unwatchable , but also unlistenable ." Classification: neg Archipelago a lou -sy movie that ' s not merely un -watch -able , but also un -list -ena -ble . Difference + ArchDetect a lou -sy movie that ' s not merely un -watch -able , but also un -list -ena -ble . IG a lou -sy movie that ' s not merely un -watch -able , but also un -list -ena -ble . IG + ArchDetect a lou -sy movie that ' s not merely un -watch -able , but also un -list -ena -ble . IH 1.0 0.5 0.0 0.5 1.0 Attribution (normalized) lou, un lou, -sy a, lou -sy, un ,, but interaction LIME a lou -sy movie that ' s not merely un -watch -able , but also un -list -ena -ble . MAHE 0.0 0.2 0.4 0.6 0.8 1.0 Attribution (normalized) lou, -sy, un not, merely, un interaction SI a lou -sy movie that ' s not merely un -watch -able , but also un -list -ena -ble . 0.5 0.0 0.5 1.0 Attribution (normalized) -sy, un lou, un lou, -sy un, un but, un interaction STI a lou -sy movie that ' s not merely un -watch -able , but also un -list -ena -ble . 1.0 0.5 0.0 0.5 1.0 Attribution (normalized) lou, -sy not, merely un, un lou, un -sy, un interaction Figure D.5: Text Viz. Comparison B. In the first text example, “thought provoking” is a meaningful and strongly positive interaction. In the second example, the “lousy, un” interaction factors in a large context to make a negative text classification. 154 Text input: "Tsai Ming-liang has taken his trademark style and refined it to a crystalline point ." Classification: pos Archipelago ts -ai ming - liang has taken his trademark style and refined it to a crystalline point . neg pos Difference + ArchDetect ts -ai ming - liang has taken his trademark style and refined it to a crystalline point . IG ts -ai ming - liang has taken his trademark style and refined it to a crystalline point . IG + ArchDetect ts -ai ming - liang has taken his trademark style and refined it to a crystalline point . IH 1.0 0.8 0.6 0.4 0.2 0.0 Attribution (normalized) refined, crystalline style, refined trademark, refined ming, refined -ai, refined interaction LIME ts -ai ming - liang has taken his trademark style and refined it to a crystalline point . MAHE 0.50 0.25 0.00 0.25 0.50 0.75 1.00 Attribution (normalized) has, taken a, crystalline refined, a his, refined refined, it interaction SI ts -ai ming - liang has taken his trademark style and refined it to a crystalline point . 0.2 0.0 0.2 0.4 0.6 0.8 1.0 Attribution (normalized) trademark, a has, taken refined, crystalline a, crystalline and, it interaction STI ts -ai ming - liang has taken his trademark style and refined it to a crystalline point . 0.2 0.1 0.0 0.1 0.2 Attribution (normalized) refined, crystalline refined, to has, taken his, trademark a, crystalline interaction Text input: "As an actor , The Rock is aptly named ." Classification: pos Archipelago as an actor , the rock is apt -ly named . Difference + ArchDetect as an actor , the rock is apt -ly named . IG as an actor , the rock is apt -ly named . IG + ArchDetect as an actor , the rock is apt -ly named . IH 0.5 0.0 0.5 1.0 Attribution (normalized) apt, named an, apt an, , apt, -ly an, rock interaction LIME as an actor , the rock is apt -ly named . MAHE 0.25 0.00 0.25 0.50 0.75 1.00 Attribution (normalized) apt, -ly, named actor, the actor, apt is, -ly interaction SI as an actor , the rock is apt -ly named . 0.4 0.2 0.0 0.2 0.4 0.6 0.8 Attribution (normalized) an, -ly is, apt -ly, named rock, -ly rock, apt interaction STI as an actor , the rock is apt -ly named . 0.50 0.25 0.00 0.25 0.50 0.75 1.00 Attribution (normalized) is, apt -ly, named apt, -ly an, actor actor, named interaction Figure D.6: Text Viz. Comparison C. In the first text example, “refined, to a crystalline” is a meaningful and strongly positive interaction. In the second example, “is aptly named” is also a meaningful and strongly positive interaction. 155 Text input: "The ending is a cop-out ." Classification: neg Archipelago the ending is a cop - out . neg pos Difference + ArchDetect the ending is a cop - out . IG the ending is a cop - out . IG + ArchDetect the ending is a cop - out . IH 0.0 0.2 0.4 0.6 0.8 1.0 Attribution (normalized) is, cop is, out cop, out ending, is is, a interaction LIME the ending is a cop - out . MAHE 1.0 0.5 0.0 0.5 Attribution (normalized) the, ending, cop is, a a, out is, cop interaction SI the ending is a cop - out . 0.50 0.25 0.00 0.25 0.50 0.75 Attribution (normalized) is, . ending, out the, ending is, a cop, - interaction STI the ending is a cop - out . 0.0 0.1 0.2 0.3 Attribution (normalized) the, ending ending, out a, cop is, a is, . interaction Text input: "A feel-good picture in the best sense of the term ." Classification: pos Archipelago a feel - good picture in the best sense of the term . Difference + ArchDetect a feel - good picture in the best sense of the term . IG a feel - good picture in the best sense of the term . IG + ArchDetect a feel - good picture in the best sense of the term . IH 0.25 0.00 0.25 0.50 0.75 1.00 Attribution (normalized) the, best best, of best, sense best, the in, best interaction LIME a feel - good picture in the best sense of the term . MAHE 1.00 0.75 0.50 0.25 0.00 0.25 0.50 Attribution (normalized) feel, good, best good, in, best interaction SI a feel - good picture in the best sense of the term . 1.0 0.8 0.6 0.4 0.2 0.0 0.2 Attribution (normalized) in, best a, feel good, best feel, good good, in interaction STI a feel - good picture in the best sense of the term . 0.6 0.4 0.2 0.0 0.2 0.4 Attribution (normalized) good, best a, feel feel, good in, best best, sense interaction Figure D.7: Text Viz. Comparison D. In the first text example, “the ending, out” is a meaningful and negative interaction. In the second example, “a feel good, best” is a meaningful and strongly positive interaction. 156 Text input: "All prints of this film should be sent to and buried on Pluto ." Classification: neg Archipelago all prints of this film should be sent to and buried on pluto . neg pos Difference + ArchDetect all prints of this film should be sent to and buried on pluto . IG all prints of this film should be sent to and buried on pluto . IG + ArchDetect all prints of this film should be sent to and buried on pluto . IH 1.0 0.5 0.0 0.5 Attribution (normalized) should, be all, this and, on of, and to, buried interaction LIME all prints of this film should be sent to and buried on pluto . MAHE 0.25 0.00 0.25 0.50 0.75 1.00 Attribution (normalized) all, prints to, buried sent, and, buried buried, pluto interaction SI all prints of this film should be sent to and buried on pluto . 1.00 0.75 0.50 0.25 0.00 0.25 Attribution (normalized) film, sent film, on to, buried all, buried be, sent interaction STI all prints of this film should be sent to and buried on pluto . 0.75 0.50 0.25 0.00 0.25 0.50 Attribution (normalized) be, buried sent, buried sent, to all, should be, to interaction Text input: "Arguably the year 's silliest and most incoherent movie ." Classification: neg Archipelago arguably the year ' s si -llie -st and most inc -oh -ere -nt movie . Difference + ArchDetect arguably the year ' s si -llie -st and most inc -oh -ere -nt movie . IG arguably the year ' s si -llie -st and most inc -oh -ere -nt movie . IG + ArchDetect arguably the year ' s si -llie -st and most inc -oh -ere -nt movie . IH 1.0 0.5 0.0 0.5 Attribution (normalized) year, s arguably, . inc, -oh year, movie arguably, year interaction LIME arguably the year ' s si -llie -st and most inc -oh -ere -nt movie . MAHE 0.5 0.0 0.5 1.0 Attribution (normalized) the, year arguably, most -oh, -ere, -nt arguably, the inc, -oh interaction SI arguably the year ' s si -llie -st and most inc -oh -ere -nt movie . 0.2 0.0 0.2 0.4 0.6 0.8 1.0 Attribution (normalized) si, most the, inc ', -nt -st, most arguably, the interaction STI arguably the year ' s si -llie -st and most inc -oh -ere -nt movie . 0.6 0.4 0.2 0.0 0.2 0.4 0.6 Attribution (normalized) inc, -oh arguably, the year, movie -ere, movie si, -llie interaction Figure D.8: Text Viz. Comparison E. In the first text example, “film should be, buried” is a meaningful and strongly negative interaction. In the second example, “-oherent” belongs to a negative word “incohorent”. 157 Image input Classification: Great Dane Archipelago against classification for classification individual effects interaction I 1 Difference + ArchDetect individual effects interaction I 1 IG + ArchDetect individual effects (×10) interaction I 1 LIME individual effects MAHE interaction I 1 interaction I 2 interaction I 3 SI individual effects interaction I 1 interaction I 2 interaction I 3 interaction I 4 interaction I 5 STI individual effects interaction I 1 interaction I 2 interaction I 3 interaction I 4 interaction I 5 Image input Classification: spider monkey, Ateles geoffroyi Archipelago individual effects interaction I 1 interaction I 2 Difference + ArchDetect individual effects interaction I 1 interaction I 2 IG + ArchDetect individual effects (×10) interaction I 1 interaction I 2 LIME individual effects MAHE interaction I 1 interaction I 2 interaction I 3 SI individual effects interaction I 1 interaction I 2 interaction I 3 interaction I 4 interaction I 5 STI individual effects interaction I 1 interaction I 2 interaction I 3 interaction I 4 interaction I 5 Figure D.9: Image Viz. Comparison A. In the first image example, the dog’s eyes are a meaningful interaction supporting the classification. In the second example, the monkey’s head is also a positive interaction. 158 Image input Classification: obelisk Archipelago against classification for classification individual effects interaction I 1 interaction I 2 Difference + ArchDetect individual effects interaction I 1 interaction I 2 IG + ArchDetect individual effects (×10) interaction I 1 interaction I 2 LIME individual effects MAHE interaction I 1 interaction I 2 interaction I 3 SI individual effects interaction I 1 interaction I 2 interaction I 3 interaction I 4 interaction I 5 STI individual effects interaction I 1 interaction I 2 interaction I 3 interaction I 4 interaction I 5 Image input Classification: snow leopard, ounce, Panthera uncia Archipelago individual effects interaction I 1 interaction I 2 interaction I 3 Difference + ArchDetect individual effects interaction I 1 interaction I 2 interaction I 3 IG + ArchDetect individual effects (×10) interaction I 1 interaction I 2 interaction I 3 LIME individual effects MAHE interaction I 1 interaction I 2 interaction I 3 SI individual effects interaction I 1 interaction I 2 interaction I 3 interaction I 4 interaction I 5 STI individual effects interaction I 1 interaction I 2 interaction I 3 interaction I 4 interaction I 5 Figure D.10: Image Viz. Comparison B. In the first image example, the obelisk tip is a meaningful interaction supporting the classification. In the second example, the leopard’s face is also a positive interaction. 159 Image input Classification: apron Archipelago against classification for classification individual effects interaction I 1 interaction I 2 Difference + ArchDetect individual effects interaction I 1 interaction I 2 IG + ArchDetect individual effects (×10) interaction I 1 interaction I 2 LIME individual effects MAHE interaction I 1 interaction I 2 interaction I 3 SI individual effects interaction I 1 interaction I 2 interaction I 3 interaction I 4 interaction I 5 STI individual effects interaction I 1 interaction I 2 interaction I 3 interaction I 4 interaction I 5 Image input Classification: black stork, Ciconia nigra Archipelago individual effects interaction I 1 Difference + ArchDetect individual effects interaction I 1 IG + ArchDetect individual effects (×10) interaction I 1 LIME individual effects MAHE interaction I 1 interaction I 2 interaction I 3 interaction I 4 interaction I 5 SI individual effects interaction I 1 interaction I 2 interaction I 3 interaction I 4 interaction I 5 STI individual effects interaction I 1 interaction I 2 interaction I 3 interaction I 4 interaction I 5 Figure D.11: Image Viz. Comparison C. In the first image example, different patches of the apron are interactions supporting the classification. In the second example, the stork’s body is an interaction that strongly supports the classification. 160 Image input Classification: waffle iron Archipelago against classification for classification individual effects interaction I 1 interaction I 2 Difference + ArchDetect individual effects interaction I 1 interaction I 2 IG + ArchDetect individual effects (×10) interaction I 1 interaction I 2 LIME individual effects MAHE interaction I 1 interaction I 2 interaction I 3 SI individual effects interaction I 1 interaction I 2 interaction I 3 interaction I 4 interaction I 5 STI individual effects interaction I 1 interaction I 2 interaction I 3 interaction I 4 interaction I 5 Image input Classification: snow leopard, ounce, Panthera uncia Archipelago individual effects interaction I 1 interaction I 2 Difference + ArchDetect individual effects interaction I 1 interaction I 2 IG + ArchDetect individual effects (×10) interaction I 1 interaction I 2 LIME individual effects MAHE interaction I 1 interaction I 2 interaction I 3 SI individual effects interaction I 1 interaction I 2 interaction I 3 interaction I 4 interaction I 5 STI individual effects interaction I 1 interaction I 2 interaction I 3 interaction I 4 interaction I 5 FigureD.12: ImageViz. ComparisonD.Inthefirstimageexample, certainsmallpatches ofthewaffleironinteract, oneofwhichsupportstheclassification. Inthesecondexample, the leopard’s face is the primary positive interaction. 161 Image input Classification: Polaroid camera, Polaroid Land camera Archipelago against classification for classification individual effects interaction I 1 interaction I 2 interaction I 3 interaction I 4 Difference + ArchDetect individual effects interaction I 1 interaction I 2 interaction I 3 interaction I 4 IG + ArchDetect individual effects (×10) interaction I 1 interaction I 2 interaction I 3 interaction I 4 LIME individual effects MAHE interaction I 1 SI individual effects interaction I 1 interaction I 2 interaction I 3 interaction I 4 interaction I 5 STI individual effects interaction I 1 interaction I 2 interaction I 3 interaction I 4 interaction I 5 Image input Classification: Greater Swiss Mountain dog Archipelago individual effects interaction I 1 interaction I 2 interaction I 3 interaction I 4 interaction I 5 Difference + ArchDetect individual effects interaction I 1 interaction I 2 interaction I 3 interaction I 4 interaction I 5 IG + ArchDetect individual effects (×10) interaction I 1 interaction I 2 interaction I 3 interaction I 4 interaction I 5 LIME individual effects MAHE interaction I 1 interaction I 2 interaction I 3 interaction I 4 interaction I 5 SI individual effects interaction I 1 interaction I 2 interaction I 3 interaction I 4 interaction I 5 STI individual effects interaction I 1 interaction I 2 interaction I 3 interaction I 4 interaction I 5 Figure D.13: Image Viz. Comparison E. In the first image example, different parts of the polaroid camera are interactions that positively support the classification. In the second example, the dogs’ heads and body are also positive interactions. 162 Text input: "I regret to report that these ops are just not extreme enough ." Classification: neg w/ Baseline Context i regret to report that these ops are just not extreme enough . neg pos w/o Baseline Context i regret to report that these ops are just not extreme enough . Text input: "It 's a worse sign when you begin to envy her condition ." Classification: neg w/ Baseline Context it ' s a worse sign when you begin to envy her condition . w/o Baseline Context it ' s a worse sign when you begin to envy her condition . Text input: "It 's solid and affecting and exactly as thought-provoking as it should be ." Classification: pos w/ Baseline Context it ' s solid and affecting and exactly as thought - pro -voking as it should be . w/o Baseline Context it ' s solid and affecting and exactly as thought - pro -voking as it should be . Text input: "A lousy movie that 's not merely unwatchable , but also unlistenable ." Classification: neg w/ Baseline Context a lou -sy movie that ' s not merely un -watch -able , but also un -list -ena -ble . w/o Baseline Context a lou -sy movie that ' s not merely un -watch -able , but also un -list -ena -ble . Text input: "Tsai Ming-liang has taken his trademark style and refined it to a crystalline point ." Classification: pos w/ Baseline Context ts -ai ming - liang has taken his trademark style and refined it to a crystalline point . w/o Baseline Context ts -ai ming - liang has taken his trademark style and refined it to a crystalline point . Text input: "As an actor , The Rock is aptly named ." Classification: pos w/ Baseline Context as an actor , the rock is apt -ly named . w/o Baseline Context as an actor , the rock is apt -ly named . Text input: "The ending is a cop-out ." Classification: neg w/ Baseline Context the ending is a cop - out . w/o Baseline Context the ending is a cop - out . Text input: "A feel-good picture in the best sense of the term ." Classification: pos w/ Baseline Context a feel - good picture in the best sense of the term . w/o Baseline Context a feel - good picture in the best sense of the term . Text input: "All prints of this film should be sent to and buried on Pluto ." Classification: neg w/ Baseline Context all prints of this film should be sent to and buried on pluto . w/o Baseline Context all prints of this film should be sent to and buried on pluto . Text input: "Arguably the year 's silliest and most incoherent movie ." Classification: neg w/ Baseline Context arguably the year ' s si -llie -st and most inc -oh -ere -nt movie . w/o Baseline Context arguably the year ' s si -llie -st and most inc -oh -ere -nt movie . Figure D.14: Text Viz. with ArchDetect Ablation. The interactions tend to use more salient words when including the baseline context, which is proposed in ArchDetect. 163 Image input Classification: Great Dane w/ Baseline Context against classification for classification individual effects interaction I 1 w/o Baseline Context individual effects interaction I 1 interaction I 2 interaction I 3 interaction I 4 Image input Classification: spider monkey, Ateles geoffroyi w/ Baseline Context individual effects interaction I 1 interaction I 2 w/o Baseline Context individual effects interaction I 1 Image input Classification: obelisk w/ Baseline Context individual effects interaction I 1 interaction I 2 w/o Baseline Context individual effects interaction I 1 interaction I 2 interaction I 3 Image input Classification: snow leopard, ounce, Panthera uncia w/ Baseline Context individual effects interaction I 1 interaction I 2 interaction I 3 w/o Baseline Context individual effects interaction I 1 Image input Classification: apron w/ Baseline Context individual effects interaction I 1 interaction I 2 w/o Baseline Context individual effects interaction I 1 interaction I 2 interaction I 3 Figure D.15: Image Viz. with ArchDetect Ablation A. The interactions tend to focus more on salient patches of the images when including the baseline context, which is proposed in ArchDetect. 164 Image input Classification: black stork, Ciconia nigra w/ Baseline Context against classification for classification individual effects interaction I 1 w/o Baseline Context individual effects interaction I 1 interaction I 2 interaction I 3 Image input Classification: waffle iron w/ Baseline Context individual effects interaction I 1 interaction I 2 w/o Baseline Context individual effects interaction I 1 interaction I 2 interaction I 3 interaction I 4 Image input Classification: snow leopard, ounce, Panthera uncia w/ Baseline Context individual effects interaction I 1 interaction I 2 w/o Baseline Context individual effects interaction I 1 interaction I 2 interaction I 3 Image input Classification: Polaroid camera, Polaroid Land camera w/ Baseline Context individual effects interaction I 1 interaction I 2 interaction I 3 interaction I 4 w/o Baseline Context individual effects interaction I 1 interaction I 2 interaction I 3 interaction I 4 Image input Classification: Greater Swiss Mountain dog w/ Baseline Context individual effects interaction I 1 interaction I 2 interaction I 3 interaction I 4 interaction I 5 w/o Baseline Context individual effects interaction I 1 interaction I 2 interaction I 3 Figure D.16: Image Viz. with ArchDetect Ablation B. The interactions tend to focus on salient patches of the images when including the baseline context. 165
Abstract (if available)
Abstract
The interpretability of machine learning prediction systems is important for reasons such as transparency, ethics, accountability, scientific discovery, and model debugging. This thesis aims to expand the interpretability of high performance prediction models by developing new analysis tools. We are interested in explaining the reason why machine learning models have high performance: feature interactions. To this end, we study how to interpret the high performance models of neural networks and black-box models more generally. ❧ Two qualities of model interpretations are taken into account in this thesis: 1) fundamental understanding and 2) practical utility. In terms of fundamental understanding, this work develops interpretations of feature interactions learned by neural network parameters as well as a principled approach to attribute feature interactions to black-box predictions. In terms of practical utility, this work emphasizes accuracy and efficiency of explanations, qualitative and quantitative interpretability, and new perspectives such as improving prediction performance via model interpretations. Feature interactions offer insightful views into the current complexity of prediction models.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Learning controllable data generation for scalable model training
PDF
Machine learning in interacting multi-agent systems
PDF
Advanced machine learning techniques for video, social and biomedical data analytics
PDF
Generating and utilizing machine explanations for trustworthy NLP
PDF
Deep learning models for temporal data in health care
PDF
Neural sequence models: Interpretation and augmentation
PDF
Bridging the visual reasoning gaps in multi-modal models
PDF
Identifying and mitigating safety risks in language models
PDF
Improving machine learning algorithms via efficient data relevance discovery
PDF
Multimodal representation learning of affective behavior
PDF
Fair Machine Learning for Human Behavior Understanding
PDF
Learning to optimize the geometry and appearance from images
PDF
Computational modeling of behavioral attributes in conversational dyadic interactions
PDF
Simulation and machine learning at exascale
PDF
Physics-aware graph networks for spatiotemporal physical systems
PDF
User modeling for human-machine spoken interaction and mediation systems
PDF
Externalized reasoning in language models for scalable and trustworthy AI
PDF
Dynamic topology reconfiguration of Boltzmann machines on quantum annealers
PDF
Fairness in machine learning applied to child welfare
PDF
Photoplethysmogram-based biomarker for assessing risk of vaso-occlusive crisis in sickle cell disease: machine learning approaches
Asset Metadata
Creator
Tsang, Michael Yunn-Horng
(author)
Core Title
Interpretable machine learning models via feature interaction discovery
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
11/29/2020
Defense Date
06/11/2020
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
explainable AI,feature interaction,interpretable machine learning,model interpretability,OAI-PMH Harvest
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Liu, Yan (
committee chair
), Putnam-Hornstein, Emily (
committee member
), Ren, Xiang (
committee member
)
Creator Email
themichaeltsang@gmail.com,tsangm@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-402556
Unique identifier
UC11668079
Identifier
etd-TsangMicha-9161.pdf (filename),usctheses-c89-402556 (legacy record id)
Legacy Identifier
etd-TsangMicha-9161.pdf
Dmrecord
402556
Document Type
Dissertation
Rights
Tsang, Michael Yunn-Horng
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
explainable AI
feature interaction
interpretable machine learning
model interpretability