Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
A rigorous study of game-theoretic attribution and interaction methods for machine learning explainability
(USC Thesis Other)
A rigorous study of game-theoretic attribution and interaction methods for machine learning explainability
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
A RIGOROUS STUDY OF GAME-THEORETIC ATTRIBUTION AND INTERACTION METHODS FOR MACHINE LEARNING EXPLAINABILITY by Daniel Lundstrom A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (APPLIED MATHEMATICS) August 2023 Copyright 2023 Daniel Lundstrom To my wife, Abby, and my children. ii Acknowledgments I would like to acknowledge my advisor, Meisam Razaviyayn, for taking me on as an advisee. He was always kind, wise, and generous with his time and advice. He was a model of mentorship and excellence. I would like to thank Stanislav Minsker, my advisor in the Math Department. He was a helpful resource and supported me all along the way to my defense. I would also like to acknowledge Tianjian Huang and Ali Ghafelebashi for their help on the experiments sections. iii Table of Contents Dedication ii Acknowledgments iii Abstract viii Chapter 1: Introduction 1 1 Background Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Chapter 2: A Rigorous Study of Integrated Gradients and an Extension to Internal Neuron Attribution 9 1 Introduction to Attributions and the Integrated Gradients Method . . . . . 10 2 Remarks on Original IG Paper and Other Uniqueness Claims . . . . . . . . 19 3 Establishing Ensemble Uniqueness Claims with Non-Decreasing Positivity . 28 4 Lipschitz Continuity of Integrated Gradients. . . . . . . . . . . . . . . . . . 30 5 Axioms For a Distribution Baseline . . . . . . . . . . . . . . . . . . . . . . . . 31 6 Internal Neuron Attributions . . . . . . . . . . . . . . . . . . . . . . . . . . 32 7 Emperical Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Chapter 3: Analyzing Interactions with Synergy Functions 41 1 Introduction to Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 iv 2 M¨ obius Transforms as a Complete Account of Interactions . . . . . . . . . . 48 3 Synergy Distribution for Binary Feature Methods . . . . . . . . . . . . . . . 56 4 Synergy Distribution for Gradient-Based Methods: The Monomial . . . . . 58 5 Table of Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 6 Empirical Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Chapter 4: Four Characterizations of IG 69 1 Characterising IG among Path Methods with Symmetry-Preserving and ASI 70 2 Characterizing IG with ASI and Proportionality . . . . . . . . . . . . . . . . 71 3 Characterizing IG with Symmetric Monotonicity . . . . . . . . . . . . . . . 72 4 Characterizing IG with Attribution to Monomials . . . . . . . . . . . . . . . 74 5 Table of Characterizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Chapter 5: Conclusion 79 1 Results of the Study and Characterization of Integrated Gradients . . . . . 79 2 Results of the Analysis of Interactions using Synergy Functions . . . . . . . 80 Bibliography 82 Chapter A: Supplementary Material on Previous IG and Path Method Papers 88 1 Computing Path Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 2 Figure 1 from Sundararajan et al. [2017] . . . . . . . . . . . . . . . . . . . 89 3 Counterexample to Claim 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4 Proof of Lemma 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5 Counterexamples to Other Uniqueness Claims and Proof with NDP . . . . 90 Chapter B: On Characterizing Ensembles of Monotone Path Methods 95 v 1 Comment on Conjecture 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 2 Proof of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 3 Proof of Lemma 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 4 Proof of Theorem 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 Chapter C: Supplementary Material on IG Lipshitzness and Extensions 104 1 Proof of Theorem 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 2 Distributional IG Satisfies Distributional Attribution Axioms . . . . . . . . 105 3 Additional Experiments from Section 7 . . . . . . . . . . . . . . . . . . . . . 107 4 Model Architecture and Training Parameters . . . . . . . . . . . . . . . . . 109 Chapter D: Proofs of Synergy Function Theorems 111 1 Proof of Theorem 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 2 Proof of Corollary 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 3 Proof of Corollary 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 Chapter E: Supplementary Material and Proofs for Various k th -order Interaction Methods 117 1 Statement of Symmetry Axiom . . . . . . . . . . . . . . . . . . . . . . . . . 117 2 Proof of Theorem 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 3 Interaction Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 4 Experimental Details and Additional Results . . . . . . . . . . . . . . . . . 137 Chapter F: Supplementary Materials on Four Characterizations of IG 141 1 Symmetry-Preserving Alone is Insufficient to Characterize IG Among Path Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 vi 2 Proof of Theorem 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 3 Proof of Theorem 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 4 Proof of Theorem 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 5 Proof of Theorem 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 6 Proof of Theorem 12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 7 Softplus Approximations Converge Uniformly . . . . . . . . . . . . . . . . . 152 8 Proof of Theorem 13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 9 Proof of Corollary 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 vii Abstract Deep learning has revolutionized many areas of machine learning, from computer vision to natural language processing, but these high-performance models are generally “black box.” Explaining such models would improve transparency and trust in AI-powered decision making and is necessary for understanding other practical needs such as robustness and fairness. A popular means of enhancing model transparency is to quantify how individual inputs contribute to model outputs (called attributions) and the magnitude of interactions between groups of inputs. A growing number of these methods import concepts and results from game theory to produce attributions and interactions. This work studies these methods. We analyze the popular integrated gradients method (IG), outlining issues with multiple claims that it uniquely satisfies certain sets of desirable properties. We recover results with the addition of a desiderata, non-decreasing positivity. In all, we provide four different sets of properties that IG uniquely satisfies. We also study aspects of IG such as sensitivity to input perturbations, a formulation of IG where the reference baseline is a distribution of inputs, and a method of scoring internal neuron contributions based on IG. Beyond IG, we study the extension of attribution methods to interaction methods, which quantifies the unique effects different groups of inputs have on a model’s output. Particularly, we study methods that quantify interactions between any subset of inputs, and k th -order viii interaction methods, which report interactions for groups up to size k. We show that, given modest assumptions, a unique full account of interactions between features, called synergies, is possible in the continuous input setting. This unique full account of interactions is based on the M¨ obius transform, and induces a unique decomposition of any real-valued function into a sum of synergy functions. We go on to detail existing and novel interaction methods, showing that they are defined by their action on synergy functions, and for gradient-based methods, defined by their action on monomials. We experimentally validate our method of attributing to internal neurons on a ResNet- 152 model trained on ImageNet and a custom model trained on Fashion-MNIST. We experimentally validate various interaction methods on a custom model trained on a protein tertiary structure dataset. ix Chapter 1 Introduction Explainability is an ever increasing topic of interest among the Machine Learning (ML) community. Various ML methods, including deep neural networks, have unprecedented accuracy and functionality, but their models are generally considered “black box” and unexplained. Without “explaining” a model’s workings, it can be difficult to troubleshoot issues,improveperformance,guaranteeaccuracy,orensureotherperformancecriteriasuchas fairness. Thiscanleadtoalackofusertrustinthemodel. Variousgovernmentalbodieshave identified the importance of safe and responsible AI developments. These government bodies have discussed potential future regulations that would require ML models be transparent in certain scenarios [House, 2022], [Commission, 2021], [Dorries, 2022]. Various reasons have been cited for the lack of ML transparency, particularly in deep neural networks (DNNs). DNNs can contain thousands of layers with billions or trillions of parameters, and can be trained on billions of data points via stochastic gradient descent. Becasue of this complexity, the role of any one parameter or layer is hard to determine. The goal of many architecture and training choices is to maximizing accuracy without concern for explainability, producing models which are hard to understand. A variety of approaches have been employed to address the explainability issue of neural networks. Taking the taxonomy of Linardatos et al. [2020], some methods are universal in application (called model agnostic) [Ribeiro et al., 2016], while others are limited to specific types of models (model specific) [Binder et al., 2016]. Some model-specific methods are 1 Figure 1.1: A figure of Linardatos et al. [2020]’s taxonomy of explainability methods. Attribution methods are local, post-hoc methods for specific models and a variety of input types. limited to a certain data type, such as image [Selvaraju et al., 2017] or tabular data [Ustun and Rudin, 2016]. Some methods are global, i.e., they seek to explain a model’s workings as a whole [Ibrahim et al., 2019], while others are local, explaining how a model works for a specific input [Zeiler and Fergus, 2014]. Finally, some methods seek to make models that are intrinsically explainable [Letham et al., 2015], while others, called post hoc, are designed to be applied to a black box model without explaining it [Springenberg et al., 2014]. These post hoc methods may seek to ensure fairness, test model sensitivity, or indicate which features are important to a model’s prediction. This thesis focuses on the concept of attributions and interactions. Attributions are local, post hoc explainbility methods that indicate which features of an input contributed to a model’s output [Lundberg and Lee, 2017], [Sundararajan et al., 2017], [Sundararajan and Najmi, 2020], [Binder et al., 2016], [Shrikumar et al., 2017]. Interactions, on the other hand, are methods that indicate which groups of features may have interacted, producing effects beyond the sum of their parts [Masoomi et al., 2021], [Chen and Ye, 2022], [Sundararajan 2 et al., 2020], [Janizek et al., 2021], [Tsai et al., 2022], [Bl¨ ucher et al., 2022], [Zhang et al., 2021], [Liu et al., 2020], [Tsang et al., 2020a], [Hamilton et al., 2021], [Tsang et al., 2020b], [Hao et al., 2021], [Tsang et al., 2017], [Tsang et al., 2018]. Inthisthesis, wefocusongame-theoreticattributionandinteractionmethods, acommon andfruitfulapproachtoattributionsandinteractions. Thesemethodsarecalledsuchbecause they apply methods from game theoretic cost sharing literature [Shapley and Shubik, 1971], [Aumann and Shapley, 1974], and have the advantages of 1) already having a well-developed theory and 2) producing methods that uniquely satisfy certain desirable qualities. Inourstudyofattributions,wefocusonthepopularintegratedgradientsmethod,inspired bytheAumann-Shapleycostsharingmethod[AumannandShapley,1974]. Introducedinthe paper Axiomatic Attributions for Deep Networks Sundararajan et al. [2017], the integrated gradientsmethodproducesinputattributesusingabaselineinputasacomparativereference. The original paper included claims that the method uniquely satisfied a set of desirable axioms. This method has inspired numerous modifications and extensions, including a attribution method using a distribution of baselines Erion et al. [2021], a method for attributing to internal neurons Dhamdhere et al. [2018], and method to identify pairwise interactions Janizek et al. [2021]. Inourstudyofinteractions, wefocusonaparticulartypeofinteractioncalleda k th -order interaction. A k th -order interaction identifies interactions between groups of inputs only in the case where that group of inputs is no larger than size k. The contributions of this thesis are as follows: • We outline issues and a provide a counterexample to a central claim in the original integrated gradients paper [Sundararajan et al., 2017] which claim that IG uniquely satisfies a given set of axioms. We also criticize two other claims in related literature 3 about the integrated gradients method and a parent family of attrition methods, the path method. • Werecovertheresultswiththeadditionofaproposedaxiom: non-decreasingpositivity. • WeprovidefourdifferentsetsofpropertiesthatcharacterizeIG.Thesecharacterizations establish that no method but integrated gradients satisfies the stipulated set of properties. • We analyze the sensitivity of the integrated gradients to the explained input by establishing Lipshitzness bounds of IG for a certain class of practical models, and outlining a common situation where IG may fail to be Lipschitz. • We consider the axioms used in the original integrated gradients paper [Sundararajan et al., 2017] and formulate parallel axioms for the extension of IG were the reference baseline is a distribution of inputs and not a singleton. • In the context of image processing, integrated gradients produces an image mask which highlights features of the explained input image. We introduce a method of identifying which internal neurons contributed to a particular feature being hilighted in the image mask. We experimentally validate this approach on a ResNet-152 model trained on ImageNet and a custom model trained on Fashion-MNIST • In our study of interactions, we show that, given modest assumptions, a unique full account of interactions between features, called synergies, is possible in the continuous input setting. • We leverage the idea of synergies to analyze existing and novel k th -order interaction methods. Weshowthatthesemethodsaredefinedbytheiractiononsynergyfunctions, 4 and for gradient-based methods, defined by their action on a type of synergy function, the monomial. • We experimentally compare pre-existing and novel interaction methods on a custom model trained on a protein tertiary structure dataset. 1 Background Review Attributions are an effort to explain the functions of models on specific inputs. Suppose, as an example, F :R n →R is an object classification model, so that given an image x∈R n , F(x) gives a confidence score, or a probability that the image belongs to a specific category. For a specific input to be explained, ¯x, the purpose of an attribution is to indicate which pixels, or components of ¯x, contributed to the reported confidence score F(¯x). An early approach practitioners used to gauge the contribution of ¯x i was to inspect the gradient of F with respect to the specific input component, ∂F ∂x i (¯x) [Baehrens et al., 2010],[Simonyan et al., 2013]. The reasoning is that the magnitude of the derivative was thought to indicate the sensitivity of F to the input. Various attribution methods have been developed to address the explainability problem. Deconvolutional networks [Zeiler and Fergus, 2014] employ deep networks to produce attributions, while guided back-propagation [Springenberg et al., 2014] gives attributions to internalneuronactivations. MethodssuchasDeeplift[Shrikumaretal.,2017]andLayer-wise relevance propagation [Binder et al., 2016] employ a baseline to use as a comparison to the input (called baseline attributions). Further methods include Zhou et al. [2016], Zintgraf et al. [2016]. 5 There are two well known and widely used game-theoretic attributions: the Integrated Gradients Method (IG) [Sundararajan et al., 2017] and the Shapley Value [Lundberg and Lee, 2017]. Both are baseline methods, meaning that they attribute to an input ¯x relative to some comparative baseline value x ′ . IG is an application of the well-known Aumann-Shapley cost share solution to the ML attribution problem. It is defined as follows: for a given input ¯x, comparative baseline x ′ , and model F, the IG attribution to ¯x i is given by 1 : IG i (¯x,x ′ ,F)=(¯x i − x ′ i ) Z 1 0 ∂F ∂x i (x ′ +t(¯x− x ′ ))dt (1.1) The IG is an example of a gradient-based method, meaning that it is computed using the gradient of the model. The Shapley value in ML, on the other hand, is an adaptation of the famous Shapley value cost share solution to the ML attribution problem. To define the Shapley value, let N = {1,...,n} and suppose x ′ ∈ R n is some baseline reference input. Then for S ⊆ N, x∈R n , define x S as: (x S ) i = x i if i∈S x ′ i if i / ∈S, (1.2) where x i is the i th element of x and x ′ i is the i th element of x ′ . Then for a given input ¯x, baseline x ′ , and model F, the Shapley value is defined as: Shap i (¯x,F)= 1 n X S⊆ N\{i} n− 1 |S| − 1 (F(¯x S∪{i} )− F(¯x S )), (1.3) 1 Technical details that ensure IG exists are discussed in chapter2. 6 where n− 1 |S| ≜ (n− 1)! (n− 1−| S|)!(|S|)! denotes the number of subsets of size|S| of n− 1 features. The Shapley value is an example of a binary-features method, meaning that it treats each input feature x i a binary variable taking the values ¯x i or x ′ i , and assigns attributions based solely on F evaluated at points ¯x S , S⊆ N. Methodsinthegametheoreticapproachseektoconformtoaxioms,ordesirableproperties. While some properties may not be as essential as the word “axiom” typically denotes, the use of the term “axioms” is widely used in the literature, so we adopt its use here. Examples of axioms include requiring that ¯x i receive an attribution of 0 if F does not vary in the input component x i (called Dummy or Null Feature), or that attributions be linear in F (called Linearity). The axioms usually have direct analogues from the field of cost-sharing in game theory. To further understand these methods and axioms, we will detail the the cost sharing problem. The cost sharing problem can be described as follows: suppose a group of n agents have various demands for services or goods, with ¯x i representing the demand of agent i, 1≤ i≤ n. LetF :R n →Rdenoteacostfunction, sothatF(¯x)denotesthecostofsatisfying the demands of all agents. The question of cost sharing asks how ought the cost of satisfying the demands be billed or shared amongst the agents. A rule of how to distribute cost shares given a variety of possible cost functions and demands is called a cost-sharing mechanism. If we denote a cost sharing mechanism by A, then A i (¯x,F) would denote agent i’s associated share of the cost when ¯x is demanded and F is the cost function. Various cost-sharing works outline sets of axioms that characterize the Shapley value or the Aumann-Shapley method. Previous works then made use of these results to produce theoretical claims that IG or the Shapley value can be characterized among attribution 7 methods. We address some previous works on IG in Chapter 2 and fully outline several characterizations of IG in Chapter 4. Interactions are an extension of the concept of attributions to groups of inputs. In an intuitive sense, a model such as F(x 1 ,x 2 ) = 3x 1 +2x 2 does not have any interactions between the inputs because the change in the value of F due to ¯x 1 ’s value is independent of the value of ¯x 2 . In comparison, a model F(x 1 ,x 2 )=x 1 x 2 does have interactions between inputs. How to identify and quantify the interactions for a given input ¯x is the purpose of an interaction method. We discuss interaction methods in Chapter 3. When speaking of interactions among a group of features, there are multiple possible meanings: marginal interactions between members of a group, total interactions among members of the group, and average interactions among members of the group. Loosely speaking, if we let G S be the interactions among the features of S that are not accounted for by the interactions of sub-groups, then G S represents marginal interactions of features in S, P T⊆ S G T represents the total interactions of features in S, and P T⊆ S µ T G T represents average interactions of features in S, where µ T is some weight function. This thesis focuses on marginal interactions. Specifically, we address k th -order interactions, which describe interactions between inputs for a group of inputs of size k or less. 8 Chapter 2 A Rigorous Study of Integrated Gradients and an Extension to Internal Neuron Attribution This chapter is based on the paper “A Rigorous Study of the Integrated Gradients Method and Extensions to Internal Neuron Attributions” [Lundstrom et al., 2022a] and part of another paper, “Four Axiomatic Characterizations of the Integrated Gradients Attributions Method” [Lundstrom and Razaviyayn, 2023a]. The baseline attribution method Integrated Gradients was first introduced in Sundarara- janetal.[2017]. Thepaperidentifiedasetofdesirableaxiomsforattributions, demonstrated that previous methods fail to satisfy them, and introduced the IG method which satisfied the axioms. Included was the claim that any method satisfying a subset of the axioms must be a more general form of IG (called path methods). HereweaddressmultipleaspectsoftheIGmethod: itsfoundationalclaims,mathematical behavior, and extensions. The IG paper of Sundararajan et al. [2017] applies results from Friedman [2004] (given here as Theorem 1) to claim that path methods (defined below) are the only methods that satisfy a set of desirable axioms. Upon inspection, we observe that there are key assumptions of the function spaces of Friedman [2004], such as functions being 9 non-decreasing, which are not true in the DL context. These differences in function spaces were unaddressed in Sundararajan et al. [2017]. We show that because the function spaces differ, Theorem 1 does not apply and the uniqueness claim is false. This observation also invalidates other uniqueness claims found in Xu et al. [2020] and Sundararajan and Najmi [2020]. With the introduction of an additional axiom, non-decreasing positivity (NDP), we show that Theorem 1 can apply, and rigorously extend it into a broad-ranging DL function space. We also address the mathematical behavior of IG and an extension. We identify a common class of functions where IG may be hypersensitive to the input image by failing to be Lipschitz continuous, as well as a function class where IG is guaranteed to be Lipschitz continuous. We also note that the axioms in Sundararajan et al. [2017] apply to single baseline attribution methods, but no such axioms have been stated for methods that employ a distribution of baselines. We identify/extend axioms for the distribution of baselines methods that parallel those in the single baseline case. Lastly, we introduce a computationally efficient method of attributing to an internal neuron based on it’s contribution to a region of the IG map. If an IG map indicated certain regions or sub-features are important to an output, this method provides a means of inspecting which individual neurons are responsible for that region or sub-feature. 1 Introduction to Attributions and the Integrated Gradients Method In this section, we cover preliminaries needed for our analysis of IG. 10 1.1 Baseline Attribution Notations We begin by establishing preliminary notions. For a, b∈R n , let [a,b] denote the hyper- rectangle with opposite vertices a and b. Here [a,b] represents the domain of input of a ML model, such as a colored image. We denote the set of ML models of interest by F(a,b), with F ∈F(a,b) being some function F : [a,b]→R. We assume that a, b are such that a i ̸=b i for all i, and may drop a, b and writeF if it is not important to consider a, b for the discussion at hand. Also, we only consider one output of a model, so that if a model reports a probability vector of scores from a softmax layer, for instance, we only consider one entry of the probability vector. Throughout the thesis x represents a general function input, ¯x represents a particular inputthatispartofanattribution,andx ′ denotesareferencebaseline. A baseline attribution method (BAM)explainsamodelbyassigningscorestothecomponentsofaninputindicating its contribution to the output F(¯x). We define a BAM as: Definition1 (BaselineAttributionMethod). Givenaninput ¯x∈[a,b]andbaselinex ′ ∈[a,b], F ∈F(a,b), a baseline attribution method is any function of the form A:D A →R n , where D⊆ [a,b]× [a,b]×F . ABAMreportsavector,sothatA i (¯x,x ′ ,F)reportsthecontributionofthei th component of ¯xtotheoutputF(¯x),giventhereferencebaselineinputx ′ . BAMsareatypeofattribution with a baseline input used for comparison to the input ¯x, usually representing an absence of features. Often a baseline x ′ is implicit for the model F, and we may drop writing x ′ if it is unnecessary. It is not guaranteed that a BAM is defined for any input, as we will see in section 1.2. We denote the domain where an attribution is defined by D A . 11 There are two particular BAM’s defined on different function classes we will discuss. Define F 1 (a,b) to be the set of real analytic functions on [a,b], and define A 1 to be the set of BAMs defined on [ a,b]× [a,b]×F 1 (a,b). As before, we may writeF 1 if a, b is apparent. The class of real analytic functions is well understood, but does not include many practical deep NNs, such as those which use the ReLU and max functions. To address these networks, define F 2 (a,b), or F 2 if a, b is apparent, to be the set of feed-forward neural networks with a finite number of nodes on [ a,b] composed of real-analytic layers and ReLU layers. This includes fully connected, skip, residual, max, and softmax layers, as well as activation functions like sigmoid, mish, swish, softplus, and leaky ReLU. Formally, let n 0 ,...,n m ,m ∈ N, and for 1 ≤ k ≤ m, let F k : R n k− 1 → R n k denote a real-analytic function. Let S k : R n k → R n k to be any function of the form S k (x) = (f k 1 (x 1 ),...,f k n k (x n )), where f k i (x k ) is the identity mapping or the ReLU function. That is, S k performs one of either a pass through or a ReLU on each component, and could perform different operations on different components. Each function in F 2 takes the form: F(x)=S m ◦ F m ◦ S m− 1 ◦ F m− 1 ◦ ...◦ S 2 ◦ F 2 ◦ S 1 ◦ F 1 (x), where ◦ denotes function composition. Note that a multi-input max function can be formulated by a series of two-input max functions, and max(x,y)=ReLU(x− y)+y. Thus neural networks with the max function can be reformulated using only the ReLU function, andF 2 includes neural networks with the max function. Define A 2 (D) (orA 2 ) to be the set of BAMs defined on D⊆ [a,b]× [a,b]× (F 1 ∪F 2 ). 12 1.2 The Integrated Gradients There is a particular form of baseline attribution method which satisfied axioms 1-4, called a path method. Define a path function as follows: Definition 2 (Path Function). A function γ (¯x,x ′ ,t):[a,b]× [a,b]× [0,1]→[a,b] is a path function if, for fixed ¯x,x ′ , γ (t):=γ (¯x,x ′ ,t) is a continuous, piecewise smooth curve from x ′ to ¯x. We may drop both ¯x, x ′ when they are fixed, and write γ (t). If we further suppose that ∂F ∂x i (γ (t)) exists almost everywhere 1 , then the path method associated with γ can be defined as: Definition 3 (Path Method). Given the path function γ (·,·,·), the corresponding path method is defined as A γ (¯x,x ′ ,F)= Z 1 0 ∂F ∂x i (γ (¯x,x ′ ,t))× ∂γ i ∂t (¯x,x ′ ,t)dt, (1.1) where γ i denotes the i-th entry of γ . Path methods are well defined when ∇F exists and is continuous on [a,b], however, this is not necessarily the case for common ML models, such as neural networks that use ReLU and max functions. For example, if ¯x = (1,1), x ′ = (0,0), we use the straight line path γ (t) = t(1,1). If F(x) = max(x 1 ,x 2 ), then A γ (¯x,x ′ ,F) does not exist because the partial derivatives are undefined on any point on the path γ . The Integrated Gradients method [Sundararajan et al., 2017] is the path method defined by the straight path from x ′ to ¯x, given as γ (¯x,x ′ ,t)=x ′ +t(¯x− x ′ ), and takes the form: 1 ∂F ∂x i (γ (t)) exists almost everywhere iff the Lebesgue measure of {t∈ [0,1] : ∂F ∂x i (γ (t)) does not exist.} is 0. 13 Definition 4 (Integrated Gradients Method). Given a function F and baseline x ′ , the Integrated Gradients attribution of the i-th component of ¯x is defined as IG i (¯x,x ′ ,F)=(¯x i − x ′ i ) Z 1 0 ∂F ∂x i (x ′ +t(¯x− x ′ ))dt (1.2) ForcommentaryonaccuratecomputationoftheIGandpathmethods, seeAppendixA.1. 1.3 The Axiomatic Approach of Sundararajan et al. [2017] The theoretical allure of IG stems from three key claims: 1) IG satisfies stipulated axioms (desirable properties), 2) other methods fail at least one of the axioms, and 3) only methods like it (path methods) are able to satisfy these axioms. Here we review the stated axioms in Sundararajan et al. [2017]. Our first axiom, Sensitivity(a), can be stated as follows: 1. Suppose ¯x,x ′ vary in one component, so that ¯x i ̸= x ′ i , and ¯x j = x ′ j ∀j ̸= i. Further suppose F(¯x)̸=F(x ′ ). Then A i (¯x,x ′ ,F)̸=0. This axiom asserts that if exactly one input is not at its reference baseline value, and that difference caused F to change value, then the input contributed to a change. Thus the attributionshouldindicatetheinputcontributedtothefunctionchange, i.e. A i (¯x,x ′ ,F)̸=0. The next axiom, implementation invariance [Sundararajan et al., 2017], can be stated as follows: 2. Implementation Invariance: A is not a function of model implementation, but solely a function of the mathematical mapping of the model’s domain to the range. This axiom stipulates that an attribution method be independent of the model’s implementa- tion. Otherwise, the values of the attribution may carry information about implementation 14 aspects such as architecture. This axiom requires that attributions ignore all aspects of specific implementation. Many methods, such as Smoothgrad [Smilkov et al., 2017] and SHAP [Lundberg and Lee, 2017], satisfy implementation invarinace while Sundararajan et al. [2017] showed that DeepLIFT [Shrikumar et al., 2017] and Layer-Wise Relevance Propogation [Binder et al., 2016] do not satisfy it. Another axiom, linearity [Sundararajan et al., 2017] [Sundararajan and Najmi, 2020] [Janizek et al., 2021], is given as, 3. Linearity: If (¯x,x ′ ,F), (¯x,x ′ ,G) ∈ D A , α,β ∈ R, then (¯x,x ′ ,αF +βG ) ∈ D A and A(¯x,x ′ ,αF +βG )=αA (¯x,x ′ ,F)+βA (¯x,x ′ ,G). The linearity ensures that if F is a linear combination of other models, a weighted average of model outputs for example, then the attributions of F equals the average of the attributions to the sub-models. This imposes structure to the attributions outputs, so that if a model’s outputs are scaled to give outputs twice as large for example, then the attributions are scaled as well. We say that a function F does not vary in an input x i if for every x in the domain of F, G(t):=F(x 1 ,...,x i− 1 ,t,x i+1 ,...,x m ) is a constant function. We denote that F does not vary in x i by writing ∂ i F ≡ 0. With is definition we may state another axiom, dummy 2 , 4. Dummy: If (¯x,x ′ ,F)∈D A and ∂ i F ≡ 0, then A i (¯x,x ′ ,F)=0. Dummy ensures that whenever an input has no effect on the function, the attribution score is zero. Another axiom, completeness [Sundararajan et al., 2017] [Sundararajan and Najmi, 2020] [Tsai et al., 2022], is given as, 2 The dummy axiom here is called Sensitivity(b) in Sundararajan et al. [2017]. 15 5. Completeness: If (¯x,x ′ ,F)∈D A , then P n i=1 A i (¯x,x ′ ,F)=F(¯x)− F(x ′ ). Completenessgroundsthemeaningofthemagnitudeandsignofattributions. Themagnitude of A i (¯x,x ′ ,F) indicates that ¯x i contributed that quantity to the change in function value from F(x ′ ) to F(¯x). The sign of A i (¯x,x ′ ,F) indicates whether ¯x i contributed to function increase or function decrease. Thus the attributions to each input give a complete account of function change, F(¯x)− F(x ′ ). Lastly, we have symmetry-preserving, which is given as: 6. Symmetry-Preserving: For a vectorx and two indicesi, j such that 1≤ i, j≤ n, define x ∗ by swapping the values of x i and x j . Now suppose that∀x∈[a,b], F(x)=F(x ∗ ). Then if (¯x,x ′ ,F)∈D A , ¯x i = ¯x j and x ′ i =x ′ j , we have A i (¯x,x ′ ,F)=A j (¯x,x ′ ,F). Symmetry-preserving requires “swappable” features with identical values to give identical attributions. The Integrated Gradient method satisfies the axioms stated above. We give a brief explanation for how it satisfies each. Assume here that γ (t) is the straight line IG path. 1. Sensitivity(a): Suppose only ¯x i ̸= x ′ i . For any j ̸= i, we have ¯x j − x ′ j = 0, and thus IG i (¯x,x ′ ,F) = 0. Because IG satisfies completeness (see below), and F(¯x)̸= F(x ′ ), we have IG i (¯x,x ′ ,F)=F(¯x)− F(x ′ )̸=0. 2. Implementation Invariance: IG only depends on ∇F, which is independent on the implementation of F. 16 3. Linearity: For any index i we have: IG i (¯x,x ′ ,aF +bG)=(¯x i − x ′ i ) Z 1 0 ∂(aF +bG) ∂x i (γ (t))dt =a(¯x i − x ′ i ) Z 1 0 ∂F ∂x i (γ (t))dt+b(¯x i − x ′ i ) Z 1 0 ∂G ∂x i (γ (t))dt =aIG i (¯x,x ′ ,G)+bIG i (¯x,x ′ ,G) 4. Dummy/Sensitivity(b): If ∂ i F ≡ 0 then IG i (¯x,x ′ ,F) integrates the zero function, and equals 0. 5. Completeness: Letting “ · ” denote the inner product, we employ the fundamental theorem of line integrals to gain: n X i=1 IG i (¯x,x ′ ,F)= n X i=1 (¯x i − x ′ i ) Z 1 0 ∂F ∂x i (γ (t))dt = Z 1 0 ∇F(γ (t))· γ ′ (t)dt =F(¯x)− F(x ′ ) 6. Symmetry-Preserving: Suppose F(x)=F(x ∗ ) for all x∈[a,b]. Then ∂F ∂x i (x)= ∂F ∂x j (x) for all x∈[a,b]. The result applies immediately. The argument for IG uniqueness in Sundararajan et al. [2017] is roughly as follows: other established methods fail to satisfy sensitivity(a) or implementation invariance. IG satisfies: completeness, a stronger claim that includes sensitivity(a); implementation invariance; linearity; and sensitivity(b). It can be shown that path methods are the unique methods that satisfy implementation invariance, sensitivity(b), linearity, and completeness. IG is the unique path method that satisfies symmetry. Thus, IG uniquely satisfies axioms 1-6. It was admitted that the Shaply value [Shapley and Shubik, 1971] also satisfy these conditions, but 17 it not a path method, being formulated over multiple paths, and besides is computationally infeasible because it is comprised of a large number of paths. It should be noted that [Lerma and Lucas, 2021] pointed out that other computationally feasible path methods (single path methods) satisfying all axioms exist and are easy to produce, although they are not as simple as IG. It should also be noted that other axiomatic treatments of IG exist. Sundararajan and Najmi [2020] introduced an alternative set of axioms and claimed that IG uniquely satisfied them. Xu et al. [2020] claimed that path methods uniquely satisfy linearity, dummy, completeness, and an additional axiom. These treatments will be discussed later. 1.4 Modifications and Extensions One issue with IG is the noisiness of the attribution. Sharp fluctuations in the gradient, sometimes called the shattered gradient problem [Balduzzi et al., 2017], are generally blamed. Another issue with integrated gradients is baseline choice. If the baseline is a black image, then the (¯x i − x ′ i ) term will be zero for any black pixel in the input image, causing those attributions to be zero. This is an issue if the black input pixels do contribute to image recognition, such as a model identifying an image of a blackbird. A category of fixes to these issues rely on modifying the choice of input and baseline. Smilkov et al. [2017] addresses the noisiness issue by introducing noise into the input and taking the average IG. A tutorial on Tensorflow.com [2022] addresses baseline choice by averaging the results when using a white and black baseline. Erion et al. [2021] claims synthetic baselines such as black and white images are out of distribution data points, and suggests using training images as baseline and taking the average. Pascal Sturmfels [2020] 18 investigates various fixed and random baselines using blurring, Gaussian noise, and uniform noise. Another category of fixes modifies the IG path. Kapishnikov et al. [2021] identifies accumulated noise along the IG path as the cause of noisy attributions, and employs a guided path approach to reduce attribution noise. Xu et al. [2020] is concerned with the introduction of information using a baseline, and opts to use a path that progressively removes Gaussian blur from the attributed image. Some IG extensions employ it for tasks beyond input attributions. Dhamdhere et al. [2018] calculates accumulated gradient flow through neurons to produce an internal neuron attributionmethodcalledconductance. Shrikumaretal.[2018]lateridentifiesconductanceas an augmented path method and provides a computational speedup. Lundstrom et al. [2022b] compares a neuron’s average attribution over different image classes to characterize neurons as class discriminators. Erion et al. [2021] incorporates IG attributions in a regularization term during training to improve the quality of attributions and model robustness. 2 Remarks on Original IG Paper and Other Uniqueness Claims 2.1 Remarks on Completeness, Path Definition We first address a few claims of the original IG paper [Sundararajan et al., 2017] to add mathematical clarifications. Sundararajan et al. [2017, Remark 2] states: “Integrated gradients satisfies Sensivity(a) because Completeness implies Sensivity(a) and is thus a strengthening of the Sensitivity(a) axiom. This is because Sensitivity(a) refers to a case where the baseline and the input differ only in one variable, for 19 which Completeness asserts that the difference in the two output values is equal to the attribution to this variable.” To clarify, completeness implies sensitivity(a) for IG, and for monotone path methods in general. The form of IG guarantees that any input that does not differ from the baseline will have zero attribution, due to the ¯x i − x ′ i term in (1.2). If only one input differs from the baseline, and F(¯x)̸=F(x ′ ), then the value F(¯x)− F(x ′ )̸=0 will be attributed to that input by completeness. However, completeness does not imply sensitivity(a) for general attribution methods, or for non-monotone path methods specifically. In Sundararajan et al. [2017], monotone path methods (what they simply term path methods) are introduced as a generalization of the IG method. The section reads: “Integratedgradientsaggregatethegradientsalongtheinputsthatfallonthestraightline between the baseline and the input. There are many other (non-straightline) paths that monotonically interpolate between the two points, and each such path will yield a different attribution method. For instance, consider the simple case when the input is two dimensional. Figure 1 3 has examples of three paths, each of which corresponds to a different attribution method. Formally, let γ = (γ 1 ,...,γ n ) : [0,1]→R n be a smooth function specifying a path in R n from the baseline x ′ to the input ¯x, i.e., γ (0)=x ′ , and γ (1)=x.” By the referred figure, P 1 is identified as a path, but it is not smooth. It is simple enough to interpret smooth here to mean piecewise smooth. Note that monotonicity is mentioned, and all examples in Figure 1 are monotone, but monotonicity is not explicitly included in the formal definition. The cited source on path methods, Friedman [2004], only considers 3 See Figure A.1, Appendix A.2. 20 monotone paths. Thus, we assume that Sundararajan et al. [2017] only considers monotone paths. The alternative is addressed in the discussion of Conjecture 1. 2.2 On Sundararajan et al. [2017]’s Uniqueness Claim In the original IG paper, an important uniqueness claim is given as follows [Sundararajan et al., 2017, Prop 2]: “[Friedman, 2004] Path methods are the only attribution methods that always satisfy Implementation Invariance, Sensitivity(b), Linearity, and Completeness.” The claim that every method that satisfies certain axioms must be a path method is an important claim for two reasons: 1) It categorically excludes every method that is not a path method from satisfying the axioms, and 2) It characterizes the form of methods satisfying the axioms. However, no proof of the statement is given, only the following remark (Remark 4): “Integrated gradients correspond to a cost-sharing method called Aumann-Shapley [Aumann and Shapley, 1974]. Proposition 2 holds for our attribution problem because mathematically the cost-sharing problem corresponds to the attribution problem with the benchmark fixed at the zero vector.” The cost-sharing problem does correspond to the attribution problem with benchmark fixed at zero, with some key differences . To understand the differences, we review the cost-share problem and results in Friedman [2004], then rigorously state Sundararajan et al. [2017, Proposition 2]. We will then point out discrepancies between the function spaces that make the application of the results in Friedman [2004] neither automatic nor, in one case, appropriate. 21 In Friedman [2004], attributions are discussed within the context of the cost-sharing problem. Suppose F gives the cost of satisfying the demands of various agents, given by ¯x. Each input ¯x i represents an agent’s demand, F(¯x) represents the cost of satisfying all demands, and the attribution to ¯x i represents that agent’s share in the total cost. It is assumed F(0)=0, naturally, and because increased demands cannot result in a lower total cost, F(¯x) is non-decreasing in each component of ¯x. Furthermore, only C 1 cost functions are considered. To denote these restrictions on F formally, we write that for a positive vector a ∈ R n + , the set of attributed functions for a cost-sharing problem is denoted by F 0 ={F ∈F(a,0)|F(0)=0,F ∈C 1 ,F non-decreasing in each component}. Therearealso restrictionsonattribution functions. Thecomparativebaselinein thiscontext isnodemands, sox ′ isfixedat0. Becauseanagent’sdemandscanonlyincreasethecost,anagent’sdemands should only have positive cost-share. Thus cost-shares are non-negative. Formally we denote the set of baseline attributions in Friedman [2004] by A 0 ={A:[a,0]×F 0 →R n + }. Before we continue, we must define an ensemble of path methods. Let Γ( ¯x,x ′ ) denote the set of all path functions projected onto their third component, so that ¯x,x ′ are fixed and γ ∈ Γ( ¯x,x ′ )isafunctionsolelyoft. WemaywriteΓ( ¯x)whenx ′ isfixedorapparent. Definetheset ofmonotonepathfunctionsasΓ m (¯x,x ′ ):={γ ∈Γ( ¯x,x ′ )|γ is monotone in each component}. We can then define an ensemble of path methods: Definition5. ABAMAisanensembleofpathmethodsifthereexistsafamilyofprobability measures indexed by ¯x,x ′ ∈[a,b], µ x,x ′ , each on Γ(¯ x,x ′ ), such that: A(¯x,F)= Z γ ∈Γ(¯ x) A γ (¯x,F)dµ x (γ ) (2.1) 22 An ensemble of path methods is an attribution method where, for a given x ′ , the attribution to ¯x is equivalent to an average among a distribution of path methods, regardless ofF. Thisdistributiondependsonthefixed x ′ andchoiceof ¯x. Ifweonlyconsidermonotone paths, then we say that a BAM A is an ensemble of monotone path methods, and swap Γ( ¯x) for Γ m (¯x). 4 We now present Friedman’s characterization theorem: Theorem 1. (Friedman [2004, Thm 1]) The following are equivalent: 1. A∈A 0 satisfies completeness, linearity 5 , and sensitivity(b). 2. A∈A 0 is an ensemble of monotone path methods. TorigorouslystateSundararajanetal.[2017, Prop2], wemustinterprettheclaim: “path methods are the only attribution methods that always satisfy implementation invariance, sensitivity(b), linearity, and completeness.” By “path methods”, Sundararajan et al. [2017] cannot exclude ensembles of path methods. Simply stated: if some path methods satisfy the axioms, then some ensembles of path methods, such as finite averages, satisfy the axioms also. Neither can it mean non-monotone path methods, since Theorem 1 only addresses monotone path methods, and, supposedly, the theorem applies immediately. Thus we will interpret “path methods” as in Theorem 1, as an ensemble of monotone path methods. Finally, the function classes are unspecified, so we will loosely define F DL to be the set of DL models where one output is considered, and define A DL to be the set of attribution methods defined on F DL . We now state the characterization theorem in Sundararajan et al. [2017]: 4 The construction of a measure on monotone paths is detailed in the proof of Friedman [2004][Theorem 1], and is based on measures of monotone paths on a grid. Our results concern ensembles of monotone path methods, and we do not comment on the construction of a measure on non-monotone path methods. 5 Friedman [2004] uses a weaker form of linearity: A(¯x,F +G)=A(¯x,F)+A(¯x,G). 23 Claim 1. (Sundararajan et al. [2017, Prop 2]) Suppose A∈A DL satisfies completeness, linearity, sensitivity(b), and implementation invariance. Then for any fixed x ′ , A(¯x,x ′ ,F) is an ensemble of monotone path methods. As stated previously, there are several discrepancies between the function classes of Theorem 1 and Claim 1. F ∈F DL need not be non-decreasing nor C 1 . x ′ need not be 0, and F(x ′ ) has no restrictions. Additionally, attributions in A D can take on negative values while those in A 0 can not. The differences between F 0 and F DL , A 0 and A DL make the application of Theorem 1 problematic in the DL context. In fact, Claim 1 is actually false. Note that monotone and non-monotone path methods satisfy completeness 6 , linearity, sensitivity(b), and implementation invariance. Fixing the baseline to zero and [a,b]=[0,1] n , there exists a non-monotone path ω(t) and non-decreasing F s.t. A ω (¯x,x ′ ,F) has negative components. However, if path γ (t) = γ (¯x,x ′ ,t) is monotone and F is non-decreasing, ∂γ i ∂t ≥ 0 and ∂F ∂x i ≥ 0,∀i. By Eq. 1.1, A γ (¯x,x ′ ,F)≥ 0 for monotone γ and non-decreasing F, implying any ensemble of monotone path methods would be non-negative. Thus, A ω is not an ensemble of monotone path methods. For a full proof, see Appendix A.3. Why did this happen? Note that in the context of Theorem 1, this counterexample is disallowed. A 0 only includes attributions that give non-negative values. Non-monotone path methods can give negative values for functions in F 0 , so they are disallowed. However, what is excluded in the game-theoretic context is allowed in the DL context: F D functions can increase or decrease from the their baseline, so by completeness, negative and positive attributions must be included. Thus, non-monotone path methods are not prohibited, they 6 Path methods satisfy completeness because P i A γ i (¯x,x ′ ,F) = R 1 0 ∇F ∗ dγ = F(¯x)− F(x ′ ) by the fundamental theorem for line integrals. 24 are fair game. Without additional constraints, this implies that non-monotone path methods are allowed. The above example shows that the set of BAMs satisfying axioms 3-5 cannot be char- acterized as an ensemble of path methods over Γ m . Since the counter example was a non-monotone path method, perhaps the set of BAMs can be characterized as an ensemble of path methods over Γ. Conjecture 1. The following are equivalent: • A∈A DL satisfies completeness, linearity, sensitivity(b), and implementation invari- ance. • For a fixed x ′ , A ∈ A DL is equivalent to an ensemble of path methods where the maximal path length of the support of µ x is bounded. 7 If Conjecture 1 were true, it would somewhat preserve the intention of Claim 1: that BAMs satisfying axioms 3-5 are path methods. However, it is not clear how Theorem 1 can be used to support Conjecture 1, since it proves characterizations exclusively with monotone path ensembles. On the other hand, it is an open question whether conjecture 1 is false, that is, perhaps there is a BAM satisfying axioms 3-5 that is not an ensemble of path methods. Even if we do not have any path characterization for BAMs satisfying axioms 3-5, we submit an insight into BAMs satisfying axioms 4 and 5. Lemma 1. Suppose a BAM A satisfies linearity and sensitivity(b), and ∇F is defined on [a,b]. Then A(¯x,x ′ ,F) is a function solely of ¯x,x ′ , and the gradient of F. Furthermore, A i (¯x,x ′ ,F) is a function solely of ¯x,x ′ and ∂F ∂x i . 7 For the necessity of bounding maximal path length, see the Appendix B.1. Friedman [2004] only details the construction of such a measure for monotone paths. We do not address how a measure on non-monotone paths could be constructed. 25 The proof of Lemma 1 is relegated to appendix A.4. 2.3 On Other Uniqueness Claims There are other attempts to establish the uniqueness of IG or path methods by referencing cost-sharing literature, each of which succumbs to the same issue as Claim 1. The claims make use of an additional axiom, Affine Scale Invariance (ASI). 8 We denote the function composition operator by “◦ ”. The ASI axiom, seventh in our list, is as follows: 7. Affine Scale Invariance (ASI) : For a given index i, constants c ̸= 0, d, define the affine transformation T(x) := (x 1 ,...,cx i +d,...,x n ). Then whenever (¯x,x ′ ,F), (T(¯x),T(x ′ ),F ◦ T − 1 )∈D A , we have A(¯x,x ′ ,F)=A(T(¯x),T(x ′ ),F ◦ T − 1 ). This axiom can be justified by considering unit conversion. Suppose F is some machine learning model where input ¯x i is given in degrees Fahrenheit. T could be an affine trans- formation that converts the i th input from Fahrenheit to Celcius, so that F ◦ T − 1 is an adjusted model where ¯x i would be given in Celsius, converted to Fahrenheit, then input into the original model. Affine scale invariance would require that an attribution method A give the same attributions whether in Fahrenheit inputs, (¯x,x ′ ,F), or Celcius inputs, (T(¯x),T(x ′ ),F ◦ T − 1 ). It is interesting to note that ASI effectively means that the shape of a path for a path method stays the same regardless of the input or baseline values. Explicitly, suppose A γ is a path method satisfying ASI. For any ¯x, x ′ there exists a unique affine transformation T such that T(x ′ )=0, T(¯x)=1, where by 0 and 1 we mean the vectors with entries that are all zero or one, respectively. Thus A γ (¯x,x ′ ,F)=A γ (T(¯x),T(x ′ ),F ◦ T − 1 )=A γ (1,0,F ◦ T − 1 ). The final expression uses the path γ (1,0,t), and ignores the form of the path for γ (¯x,x ′ ,t). 26 This causes the path to keep the same shape, so that all paths are an affine stretching of the base path from x ′ =0 to ¯x=1. Xu et al. [2020, Prop 1] claims that path methods are the unique methods that satisfy dummy, linearity, completeness, and ASI. Here the situation is similar to Sundararajan et al. [2017]: they import game-theoretic results from Friedman [2004] which assumes functions are non-decreasing and attributions are non-negative. As mentioned in our discussion of claim 1, the referenced result can not be correctly applied to the context where attributions can be negative and no additional constraints are imposed. For a fuller treatment and counterexample, see Appendix A.5. In another paper, Sundararajan and Najmi [2020, Cor 4.4] claims that IG uniquely satisfies a handful of axioms: linearity, dummy, symmetry, ASI, and proportionality. This argument is a corollary of another claim: any attribution method satisfying ASI and linearity is the difference of two cost-share solutions [Sundararajan and Najmi, 2020, Thm 4.1]. By breaking up an attribution into two cost-share solutions, the aim is to apply cost-share results. The argument roughly is as follows: for any attribution, input, baseline, and function, they use ASI to formulate the attribution as A(¯x,0,F), with ¯x>0. They write F = F + − F − , where F + and F − are non-decreasing. Then by linearity, A(¯x,0,F) = A(¯x,0,F + − F − ) = A(¯x,0,F + )− A(¯x,0,F − ), which, the claim states, is the difference of two cost-share solutions. However, there are methods that satisfy ASI and linearity, but generally give negative values for cost-sharing problems. Thus neither A(¯x,0,F + ) nor A(¯x,0,F − ) are necessarily cost-share solutions to cost-share problems. See Appendix A. 5 for a counterexample. 8 Xu et al. [2020] and Sundararajan et al. [2017] gives an incorrect definition of ASI, saying A(¯x,x ′ ,F) = A(T(¯x),T(x ′ ),F ◦ T)). The source definition is from Friedman and Moulin [1999]. 27 3 Establishing Ensemble Uniqueness Claims with Non- Decreasing Positivity We now seek to salvage the uniqueness claims identified in the previous section for a robust set of functions. To this end, we introduce the axiom of non-decreasing positivity (NDP). We say that F is non-decreasing from x ′ to ¯x if F(γ (t)) is non-decreasing for every monotone path γ (t)∈Γ(¯ x,x ′ ) from x ′ to ¯x. We can then define NDP as follows: Definition 6. A BAM A satisfies NDP if A(¯x,x ′ ,F)≥ 0 whenever F is non-decreasing from x ′ to ¯x. F being non-decreasing from x ′ to ¯x is analogous to a cost function being non-decreasing in the cost-sharing context. NDP is then analogous to requiring cost-shares to be non- negative. Put another way, NDP states that if F(x) does not decrease when any input y i moves closer to ¯x i from x ′ i , then A(¯x,x ′ ,F) should not give negative values to any input. The addition of NDP enables Theorem 1 to extend closer to the DL context. Theorem 2. (Characterization Theorem with NDP) For A∈A 1 , the following are equiva- lent: i. A satisfies completeness, linearity, dummy, and NDP. ii. There exists a family of probability measures µ ·,· indexed on (¯x,x ′ ) ∈ [a,b]× [a,b], where µ ¯x,x ′ is a measure on Γ m (¯x,x ′ ), such that A(¯x,x ′ ,F)= Z Γ m (¯x,x ′ ) A γ (¯x,x ′ ,F)dµ ¯x,x ′ (γ ) Theorem 2 states that if A∈A 1 is constrained according to the four axioms, then A is an expected value of path methods with monotone paths. We call this expected value 28 of path methods an ensemble of monotone path methods, or more generally an ensemble of path methods if the expectation is not constrained to monotone paths. To present results forF 2 , we first give a result on the topology of NN models in F 2 : A sketch of the proof is as follows. Let ¯x be fixed, and F ∈F 1 . It can be shown that the behavior of F outside of [¯x,x ′ ] is irrelevant to A(¯x,x ′ ,F). Using this, apply a coordinate transformT thatmaps[¯x,x ′ ]onto[|¯x− x ′ |,0], sothatA(¯x,x ′ ,F)=A 0 (0,|¯x− x ′ |,F 0 ), where A 0 ,F 0 haveproperdomainstoapplyTheorem1. F 0 isC 1 anddefinedonacompactdomain, so its derivative is bounded, and there exists c∈R n such that F 0 (x)+c T x is non-decreasing in x. Apply Theorem 1 to A 0 (¯x,x ′ ,F 0 (x)+c T x) and simplify to show A(¯x,x ′ ,F) is an ensemble of path methods for function F 0 and paths in Γ[ |¯x− x ′ |,0]. Reverse the transform to get the ensemble in terms of F and Γ(¯ x,x ′ ). To expand Theorem 2 further to non-C 1 functions, we begin with a lemma regarding the topology of the domain of F ∈F 2 . Lemma 2. Suppose F ∈F 2 . Then the domain [¯x,x ′ ] can be partitioned into a nonempty region D and it’s boundary ∂D, where F is real-analytic on D, D is open with respect to the topology of the dimension of [¯x,x ′ ], and ∂D is measure 0. We now present a claim extending Theorem 2 to functions in F 2 . Let D denote the set as described above, and denote the set of points on the path γ by P γ . Theorem 3. (Extension to F ∈F 2 ) Suppose A∈A 2 is defined on [ a,b]× [a,b]×F 1 and some subset of [a,b]× [a,b]×F 2 , and satisfies completeness, linearity, dummy, and NDP. Let µ ·,· be the family of measures on monotone paths that defines A on [a,b]× [a,b]×F 1 from Theorem 2, and let (¯x, x ′ ,F)∈ [a,b]× [a,b]×F 2 . If A(¯x,x ′ ,F) is defined, and for almost every path γ ∈ Γ m (¯x,x ′ ) (according to µ ¯x,x ′ ), {t∈ [0,1] : γ (t)∈ ∂U} is a null set 29 w.r.t the Lebesgue measure onR, then A(¯x,x ′ ,F) is equivalent an ensemble of monotone path methods. Furthermore, this ensemble is defined with the same µ ·,· as Theorem 2. The above result answers two questions: 1) is A an ensemble of path methods when evaluating models inF 2 , and 2) is that ensemble the same ensemble that A uses to evaluate models inF 1 ? The above theorem guarantees that when considering models in F 2 which may not be differentiable on [ a,b], A is still an ensemble of path methods, and, in fact, is the same ensemble that define’s A’s action on models inF 1 . Thus Theorem 3 establishes that while ensembles of path methods uniquely satisfy a set of axioms for attributions in A 1 , they also satisfy these axioms for models inF 2 , if it makes sense to do so. 9 NotethatTheorem3doesnotsettlethequestionofuniquenessforIG,onlyforensembles ofpathmethods. WecontinuetoexploreauniquenessproofforIGusingsymmetry-preserving in Chapter 4, section 1. With the addition of NDP, we also establish the other uniqueness claims of Section 2.3. FordetailsonXuetal.[2020, Prop1]andarecoveryof[SundararajanandNajmi,2020,Thm 4.1], see Appendix A.5. Details on a full characterization of IG like that in Sundararajan and Najmi [2020], with proportioinality and NDP, is given in Chapter 4, Section 2. 4 Lipschitz Continuity of Integrated Gradients DL models can be extremely sensitive to slight changes in the input image [Goodfellow et al., 2014]. It stands to reason that IG should also have increased sensitive in the output for more sensitive models, and less sensitivity in the output for less sensitive models. The question of whether IG is locally Lipshchitz, and what its local Lipschitz constant is, has been studied 9 Theorem 3 gains a characterisation of monotone path methods for ML models with rectangular domains. A future area of research could investigate the concept of monotone path methods and the straight path method in other geometries such as the unit sphere. 30 previously by experimental means. Previous works searched for the Lipschitz constant when the domain is restricted to some ball around the input, either by Monte Carlo sampling [Yeh et al., 2019] or exhaustive search of nearby input data [Alvarez-Melis and Jaakkola, 2018]. In contrast to these, we provide theoretical results on the global sensitivity of IG for two extremes: a model with a discontinuous gradient (as with a neural network with a max or ReLU function), and a model with a well behaved gradient: Theorem 4. Let F be defined on [ a,b], x ′ be fixed. If F has the usual discontinuities due to ReLU or Max functions, then IG(¯x,F) may fail to be Lipschitz continuous in ¯x. If∇F is Lipschitz continuous with constant L and| ∂F ∂x i | attains maximum M, then IG i (¯x,F) is Lipschitz continuous in ¯x with Lipschitz constant at most M + |a i − b i | 2 L. 5 Axioms For a Distribution Baseline As mentioned in 1.4, some extensions of IG use a distribution of baseline points as a baseline. Here we give a formal definition of the distributional IG, and comment on some axioms it satisfies. We denote the set of distributions on the input space by D. The set of distributional attributions, E, is then defined as the set containing all functions of the form E : [a,b]×D×F → R n . Given a distribution of baselines images D ′ ∈ D, we suppose the baseline random variable X ′ ∼ D ′ . Then the distributional IG is given by EG(¯x,X ′ ,F) :=E X ′ ∼ D ′IG(¯x,X ′ ,F). Particular axioms, namely implantation invariance, sensitivity(b), linearity, and ASI can be directly carried over to the baseline attribution context. Distributional IG satisfies these axioms. The axioms of sensitivity(a), completeness, symmetry-preserving,andNDPdonothavedirectanalogues. Belowweidentifydistributional attributionaxiomsthatextendsensitivity(a),completeness,andsymmetry-preservingaxioms 31 to the distributional attribution case. Distributional IG satisfies these axioms as well. 10 See Appendix C.2 for details. Let E∈E, D ′ ∈D, X ′ ∼ D ′ , and F,G∈F: 1. Sensitivity(a): Suppose X ′ varies in exactly one input, X ′ i , so that X ′ j = ¯x j for all j̸=i, andEF(X ′ )̸=F(¯x). Then E i (¯x,X ′ ,F)̸=0. 2. Completeness: P n i=1 E i (¯x,X ′ ,F)=F(¯x)− EF(X ′ ). 3. Symmetry Preserving: For a given i, j, define x ∗ by swapping the values of x i and x j . Now suppose that for all x, F(x) = F(x ∗ ). Then whenever X ′ i and X ′ j are exchangeable 11 , and ¯x i = ¯x j , we haveE i (¯x,X ′ ,F)=E j (¯x,X ′ ,F). 4. NDP: If F is non-decreasing from every point on the support of D ′ to ¯x, then E(¯x,X ′ ,F)≥ 0. 6 Internal Neuron Attributions 6.1 Previous Methods Previous works apply IG to internal neuron layers to obtain internal neuron attributions. We review their results before discussing extensions. Suppose F is a single output of a feed forward neural network, with F :[a,b]→R. We can separate F at an internal layer such that F(x)=G(H(x)). Here H :[a,b]→R m is the first half of the network outputting the value of an internal layer of neurons, and G :R m →R is the second half of the network that would take the internal neuron values as an input. We assume the straight line path 10 Completeness has been observed by Erion et al. [2021]. 11 Xi and Xj are exchangeable if X and X ∗ are identically distributed. 32 γ , although other paths can be used. Following Dhamdhere et al. [2018], the flow of the gradient in IG i through neuron j, labeled IG i,j , is given by: IG i,j =(¯x i − x ′ i ) Z 1 0 ∂G ∂H j (H(γ )) ∂H j ∂x i (γ )dt (6.1) By fixing the input and summing the gradient flow through each internal neuron, we get IG i , or, what we equivalently denote for this context, IG i,∗ . This is what we should expect, and is accomplished by moving the sum into the integral and invoking the chain rule. X j IG i,j =(¯x i − x ′ i ) Z 1 0 G◦ H dx i (γ )dt=IG i,∗ (6.2) If we fix an internal neuron and calculate the total gradient flow through it for each input, we get an internal neuron attribution, or what Dhamdhere et al. [2018] calls a neuron’s conductance: IG ∗ ,j = X i IG i,j = X i (¯x i − x ′ i ) Z 1 0 ∂G ∂H j (H(γ )) ∂H j ∂x i (γ )dt = Z 1 0 ∂G ∂H j (H(γ )) X i [ ∂H j ∂x i (γ )× (¯x i − x ′ i )]dt = Z 1 0 ∂G ∂H j (H(γ )) d(H j ◦ γ ) dt dt (6.3) Shrikumar et al. [2018] recognized the last line above, which, since H j (γ ) is a path, formulates conductance as a path method. Note that this path may not be monotone, implying the usefulness of non-monotone path methods. 33 The above formulations can be extended to calculating the gradient flow through a group of neurons in a layer, or through a sequence of neurons in multiple layers. But here we run into a computational issue. Calculating eqs. 6.1 or 6.3 for each neuron in a layer could be expensive using standard programs. We hypothesize this is because they are designed primarily for efficient back-propagation, which finds the gradient of multiple inputs with respect to a single output, not the Jacobean for a large number of outputs. 6.2 Neuron Attributions for an Input Patch An IG attribution map usually highlights regions or features that contributed to a model’s output, e.g., highlighting a face in a picture of a person. A pertinent question is: are there internal neurons that are responsible for attributing that feature? In our example, are there neurons causing IG to highlight the face? We propose an answer by attributing to a layer of internal neurons for an input patch. If we index each input feature, then we can denote a patch of input features by S. Then the gradient flow through a neuron j for the patch S is given by: IG S,j = X i∈S IG i,j = Z 1 0 ∂G ∂H j (H(γ )) X i∈S ∂H j ∂x i (γ )(¯x i − x ′ i )dt (6.4) Asnotedinsection6.1,computingEq.6.4forafulllayerofneuronscanbeexpensive. We introduce a speedup inspired by Shrikumar et al. [2018]. Let d be a vector with d i = ¯x i − x ′ i if i∈ S, and d i = 0 if i / ∈ S. Denote the unit vector d ||d|| by ˆ d. Formulating a directional derivative, then taking Reimann sum with N terms, we write: 34 IG S,j = Z 1 0 ∂G ∂H j (H(γ )) X i∈S ∂H j ∂x i (γ )(¯x i − x ′ i )dt = Z 1 0 ∂G ∂H j (H(γ ))D ˆ d H j (γ )||d||dt ≈|| d|| Z 1 0 ∂G ∂H j (H(γ )) H j (γ (t)+ ˆ d N )− H j (γ (t)) 1/N dt ≈|| d|| N X k=1 ∂G ∂H j (H(γ ( k N ))) × [H j (γ ( k N )+ ˆ d N )− H j (γ ( k N ))] With this speedup, we bypass computing the Jacobean to find ∂H j ∂x i for each input and internal neuron. For an accurate calculation, choose N such that IG S,j +IG S c ,j ≈ IG ∗ ,j . 7 Emperical Evaluations Here, we present experiments validating the methods in Section 6. 12 We experiment on two models/data sets: ResNet-152 [He et al., 2016] trained on ImageNet [Deng et al., 2009], and a custom model trained on Fashion MNIST [Xiao et al., 2017]. Some results for ImageNet appear here, while further results appear in Appendix C.3. The general outline of each experiment is: 1) calculate a performance metric for neurons in an internal layer using IG attributions, 2) rank neurons based on the performance metric, and 3) prune neurons according to rank and observe corresponding changes in model. The goal is to validate the claim that the methods of Section 6 identify neurons that contribute to a particular task. The code used in our experiments is available at: https://github.com/ optimization-for-data-driven-science/XAI. 12 ExperimentsinthissectionswerecodedbyTianjianHuang(tianjian@usc.edu). Analysisofexperiments were produced by a collaboration of Daniel Lundstrom, Tinajian Huang, and Meisam Razaviyayn. 35 7.1 Preliminaries: Pruning Based on Whole Input Internal Neuron Attri- butions The first experiment (Figure 2.1) calculates a general performance metric for each each internal neuron in a particular layer. We calculate the average neuron conductance over each input in the training set, where the output of F is the confidence in the correct label. We use a black image as a baseline. Following the method of “deletion and insertion” [Petsiuk et al., 2018], we progressively prune (zero out) a portion of the neurons according to their rank and observe changes in model accuracy on the teseting set. 13 We zero-out internal neurons because we wish to mask the indication of feature presence, and ResNet uses the ReLU activation function, which encodes no feature presence with a neuron value of zero. These experiments are preformed twice: once on a dense layer (2 nd to last), and once on a convolutional layer (the output of the conv2x block). Figure 2.1: Pruning neurons by conductance values versus random pruning. IG ↓ means pruning neurons by conductance values in descending order. IG ↑ means pruning neurons by conductance values in ascending order. 13 While similar, our experiment differs from others by zeroing the filter, not ablating it [Dhamdhere et al., 2018] or fixing it to a reference input [Shrikumar et al., 2018] 36 When pruning the dense layer, we see that the order of pruning makes little difference in performance. We attribute this effect to the dense layer having an evenly distributed neuron importance, something likely in a 1000 category classifier. In the convolutional layer, we see that pruning by descending order rapidly kills the model’s accuracy, while pruning by ascending order generally maintains model accuracy better than random pruning. This shows that average conductance can help identify neuron importance. Thesecondexperiment(Figure2.2)calculatesaperformancemetricindicatinganeuron’s contribution in identifying a particular image category. We follow Lundstrom et al. [2022b] and calculate the same performance metric as previously, but average over a particular category of images (e.g. Lemons). We then rank and prune neurons, observing changes in the model’s test accuracy identifying the particular category. In both layers, pruning to kill performance quickly reduces the model’s accuracy iden- tifying Lemon. This is compared to the median category’s performance, and the random pruning baseline. When we prune to keep performance in the dense layer, we see that Lemon performs well below the median with random pruning, but swaps to above the median with IG pruning. Pruning in the convoluional layer quickly causes Lemon to become very accurate while the median accuracy dips below the random baseline. 7.2 Pruning Based on Image Patch Internal Neuron Attributions Here we show results of an experiment using image-patch based internal neuron attributions. In a picture of two traffic lights (Figure 2.3, top-left), we identify an image-patch around one traffic light as a region of interest. We then find the attributions of each internal neuron in a convolutional layer for this image patch and rank them. Using this ranking, we progressively prune the neurons (top-ranked first), periodically reassessing the total IG attributions inside 37 Figure 2.2: Testing accuracy when neurons are pruned according to their IG values corre- sponding to the class Lemon. Top: neurons pruned in dense layer. Bottom: neurons pruned in a convolutional layer. Left: Neurons pruned by IG values in descending order. Right: Neurons pruned by IG values in ascending order. “Random, Median”, “IG, Median” report median accuracy of all classes for random/ranked pruning. “Random, Lemon”, “IG, Lemon” report accuracy of class Lemon for random/ranked pruning. and outside the specified region. This procedure is repeated, instead ranking neurons by their conductance for the image. From Figure 2.4, we see that using global conductance rankings causes the sum of IG inside and outside the bounding box to briefly fluctuate, then converge to zero. In comparison, pruning by region-targeted rankings consistently causes a positive IG sum outside the box and negative IG sum inside the box. This reinforces the claim that image- patch based rankings give high ranks to neurons causing positive IG values in the bounding 38 Figure 2.3: Top Left: The original image and bounding box indicating specified image patch. Top Right: IG attributes visualized. Green dots show positive IG, red dots show negative IG. We see most IG attributes are within or around the bounding box. Bottom Left: IG attributes visualized after top 1% of neurons pruned based on image-patch attributions. We see IG attributes moved from the right light to the left light. Bottom Right: IG attributes visualized after top 1% neurons pruned based on the global ranking. We see IG attributes are scattered. box. Interestingly, we also see that ( P IG, all) quickly drops for the global pruning but stays elevated for the regional pruning. By completeness, this indicates the model quickly looses confidence in the former case, but keeps a high confidence for up to 50% pruning when pruned using region-targeted rankings. In Figure 2.3, we prune the top-1% of neurons in a convolutional layer according to both conductance and image-patch rankings, then re-visualize the IG. The model gives an initial confidence score of 0.9809. When pruning according to conductance, the confidence 39 changes to 0.9391, but the model’s attention loses focus, and a broad region receives a cloudy mixture of positive and negative attributions. When pruning according to the image-patch rankings, the confidence score is 0.9958, but the model’s attention shifts from the right traffic light to the left one. This validates that the image-patch method indeed highly ranked internal neurons associated with the right traffic light, and ranked neurons is a region-targeted way compared to general neuron conductance. Further experiments can be found in Appendix C.3. Figure 2.4: Sum of IG attributes inside and outside the bounding box when neurons are pruned according to certain rankings. Left: Neurons are pruned based on IG global ranking. Right: Neurons are pruned based on the IG ranking inside the bounding box. Acknowledgement. The work in this section supported in part by a gift from the USC-Meta Center for Research and Education in AI and Learning. 40 Chapter 3 Analyzing Interactions with Synergy Functions This chapter is based on the paper “Distributing Synergy Functions: Unifying Game- Theoretic Interaction Methods for Machine-Learning Explainability” [Lundstrom and Raza- viyayn, 2023b]. Thischapteroffersamethodofanalysisforattributionand k th -orderinteractionmethods of continuous-input models through the concept of synergy functions. We show that, given naturalandmodestassumptions,synergyfunctionsgiveauniqueaccountingofallinteractions between features. We also show that any continuous input function has a unique synergy decomposition. Because of this decomposition, various (existing) methods are governed by rules of synergy distribution, and common axioms constrain the distribution of synergies. With this in mind, we highlight the particular strengths and weaknesses of established methods. Furthermore, we show that under natural continuity criteria, gradient-based attribu- tion/interaction methods on analytic functions are uniquely characterized by their actions on monomials. This collapses the question “how should we define interactions on analytic functions” to “how should we define interactions of a monomial?” We then give two methods that serve as potential answers to this question. 41 1 Introduction to Interactions 1.1 Notation and Terminology Attribution methods give a score to the contribution of each input feature. Interactions extend attributions, and give a score to a group of features based on the group’s contribution to F(¯x) beyond the contributions of each feature [Grabisch and Roubens, 1999]. For ease of reference, we may speak of a nonempty set S⊆ N as being a group of features, by which we mean the group of features with indices in S. LetP k ={S ⊆ N :|S|≤ k}. Then we can define a k th -order baseline interaction method by: Definition 7 (k th -Order Baseline Interaction Method). A k th -order baseline attribution method is any function of the form I k (¯x,x ′ ,F):D→R |P k | , where D⊆ [a,b]× [a,b]×F . k th -orderinteractionmethodsareasortofexpansionofattributions,givingacontribution for each group of features in P k . For some S ∈ P k , the term I k S (¯x,x ′ ,F) indicates the component of I k (¯x,x ′ ,F) that gives interactions among the group of features S. When speaking of interactions among a group of features, there are multiple possible meanings: marginal interactions between members of a group, total interactions among members of the group, and average interactions among members of the group. Loosely speaking, if we let G S be the interactions among the features of S that are not accounted for by the interactions of sub-groups, then G S represents marginal interactions of features in S, P T⊆ S G T represents the total interactions of features in S, and P T⊆ S µ T G T represents average interactions of features in S, where µ T is some weight function. This thesis focuses on marginal interactions. 42 Using quadratic regression as an example, suppose F(x 1 ,x 2 ,x 3 )=2x 1 − 3x 2 +x 1 x 3 − 15, ¯x=(1,1,1), x ′ =(0,0,0). Then a 2 nd -order baseline interaction method may report some- thing like: I ∅ (¯x,x ′ ,F)=− 15, I {1} (¯x,x ′ ,F)=2, I {2} (¯x,x ′ ,F)=− 3, and I {1,3} (¯x,x ′ ,F)=1, and the other interactions equal zero. Itshouldbenotedthat1 st -orderinteractionswithI 1 ∅ disregardedandbaselineattributions have equivalent definitions. As with attributions, interactions may not be defined for all (¯x,x ′ ,F). As with attributions, we denote the set of inputs where a given I k is defined by D I k. As with attributions, all interactions are baseline k th -order interactions for the purpose of this thesis. We may drop x ′ if the baseline is fixed, and also drop ¯x, implying that some appropriate value is considered. 1.2 Interaction Axioms Here we review some common interaction axioms found in the literature [Grabisch and Roubens, 1999] [Sundararajan et al., 2020], [Sundararajan and Najmi, 2020], [Tsai et al., 2022], [Janizek et al., 2021], [Marichal and Roubens, 1999], [Zhang et al., 2020]. These are generally used to guarantee a method has desirable properties and constrains the possible forms a method can take. The reader should note that these axioms are reformulations or extensions of axioms for attributions. 1. Completeness: P S∈P k ,|S|>0 I k S (¯x,x ′ ,F)=F(¯x)− F(x ′ ) for all (¯x,x ′ ,F)∈D I k. 2. Linearity: If (¯x,x ′ ,F), (¯x,x ′ ,G)∈ D I k, a,b∈R, then (¯x,x ′ ,aF +bG)∈ D I k, and I k (¯x,x ′ ,aF +bG)=aI k (¯x,x ′ ,F)+bI k (¯x,x ′ ,G). 3. NullFeature: If(¯x,x ′ ,F)∈D I k,F doesnotvaryinx i , andi∈S, thenI k S (¯x,x ′ ,F)= 0. 43 While these three are very common to the literature, there are many other axioms for attributions and interactions offered that generally serve one of two purposes: either they dis- tinguishamethodasunique, ortheyshowthatamethodsatisfiesdesirablequalities. Among them are symmetry [Sundararajan et al., 2020], symmetry-preservation [Sundararajan et al., 2017], [Janizek et al., 2021], [Sundararajan and Najmi, 2020], interaction symmetry [Janizek et al., 2021], [Tsai et al., 2022], interaction distribution [Sundararajan et al., 2020], binary dummy [Grabisch and Roubens, 1999],[Sundararajan et al., 2020], sensitivity (sometimes called sensitivity (a))[Sundararajan et al., 2017], [Sikdar et al., 2021], implementation invari- ance [Sundararajan et al., 2017], [Sundararajan et al., 2020], [Janizek et al., 2021], [Sikdar et al., 2021], set attribution [Tsang et al., 2020b], non-decreasing positivity [Lundstrom et al., 2022a], recursive axioms [Grabisch and Roubens, 1999], 2-Efficiency [Grabisch and Roubens, 1999], [Tsai et al., 2022], faithfulness 1 [Tsai et al., 2022], affine scale invariance [Friedman, 2004], [Sundararajan and Najmi, 2020], [Xu et al., 2020], demand monotonicity [Sundararajan and Najmi, 2020], proportionality [Sundararajan and Najmi, 2020], and causality [Xu et al., 2020]. Some of the above axioms, such as linearity or implementation invariance, are satisfied by many methods, but no one method satisfies all axioms. For example, Faith-Shap [Tsai et al., 2022] agrees with the Shapley-Taylor’s [Sundararajan et al., 2020] axioms up to a point, but while Shapley-Taylor posits interaction distribution to gain a unique method, Faith-Shap instead posits a formulation of faithfulness to gain a unique method. There are natural limitation to this setup, as some attributions in the literature do not satisfy these definitions and axioms. For example, GradCAM [Selvaraju et al., 2017] does not use a baseline input, nor does Smoothgrad [Smilkov et al., 2017]. Many methods, such 1 While not stated as an axiom, “faithfulness” was given as a desirable property and used to constrain the form of an interaction. 44 as Layer-Wise Relevance Propagation [Zeiler and Fergus, 2014] or Deconvolutional networks [Springenberg et al., 2014], do not attempt to satisfy completeness, so that the magnitude of the attributions is governed by some other principle. It may be possible for some methods to be adjusted to have a baseline by taking the difference in attributions between an input and a baseline, or to satisfy completeness by scaling all attributions by some proportion. While considering adjusted methods could conceivably lead to interesting results, the methods in question are not designed to fit into a game-theoretic context, and we omit analysis of methods that need adjustment to fit in the game-theoretic paradigm. 1.3 Two Interaction Methods Severalk th -orderinteractionsthatextendShapleyvalueshavebeenproposed,allofwhichare binary feature methods [Grabisch and Roubens, 1999],[Tsai et al., 2022]. For our purposes we review one in particular. First, recall that for given features S⊆ N and assumed baseline x ′ , we define ¯ x S ∈[a,b] by: (¯x S ) i = ¯x i if i∈S ¯x ′ i if i / ∈S, (1.1) where ¯x i is the i th element of ¯x and x ′ i is the i th element of x ′ . Next, define δ S|T F(x) = X W⊆ S (− 1) |S|−| W| F(x W∪T ), which intuitively measures the marginal impact of including the features in S when the features in T are already present based on the inclusion-exclusion principle. The Shapley-Taylor Interaction Index of order k [Sundararajan et al., 2020] is then given by: 45 ST k S (¯x,F)= k n P T⊆ N\S δ S|T F(¯x) ( n− 1 |T| ) if|S|=k δ S|∅ (F) if|S|<k. (1.2) Shapley-Taylor prioritizes interactions of order k and its unique contribution is to satisfy the interaction distribution axiom, which is discussed in 2.4. Currently, no k th -order interactions extension of the IG has been proposed. However, a 2 th -order interaction, Integrated Hessian (IH), has been proposed in Janizek et al. [2021]. This interaction method computes the pairwise interaction between ¯x i and ¯x j as: IH {i,j} (¯x,F)=2(¯x i − x ′ i )(¯x j − x ′ j ) × Z 1 0 Z 1 0 st ∂ 2 F ∂x i ∂x j (x ′ +st(¯x− x ′ ))dsdt The “main effect” of ¯ x i , or lone interaction (a misnomer), is defined as: IH {i} (¯x,F)=(¯x i − x ′ i )× Z 1 0 Z 1 0 ∂F ∂x i (x ′ +st(¯x− x ′ ))dsdt +(¯x i − x ′ i ) 2 × Z 1 0 Z 1 0 st ∂ 2 F ∂x 2 i (x ′ +st(¯x− x ′ ))dsdt IH is what we label a recursive method since it uses an attribution method recursively. Specifically, IH {i,j} (¯x,F) = IG i (¯x,IG j (·,F))+IG j (¯x,IG i (·,F)). Similarly, IH {i} (¯x,F) = IG i (¯x,IG i (·,F)) [Janizek et al., 2021]. Note that the domain where IH is well defined is restricted to functions where components of the Hessian can be integrated along the path x ′ +t(¯x− x ′ ). We discuss the expansion of IH to a k th -order interaction and its properties in section 4.2 and Appendix E.3.2. 46 1.4 The M¨ obius Transform Lastly, we review the M¨ obius transform, which will be useful for our definition of the notion of “pureinteractions” insection2. Let v be a real-valued function on|N| binary variables, so thatv :{0,1} N →R. ForS⊆ N, wewritev(S)todenotev((1 1∈S ,..., 1 n∈S )), where 1 isthe indicator function. Recall that the M¨ obius transform of v is a function a(v):{0,1} N →R given by Rota [1964]: a(v)(S)= X T⊆ S (− 1) |S|−| T| v(T). (1.3) The M¨ obius transform satisfies the following relation to v: v(S)= X T⊆ N a(v)(T)1 T⊆ S = X T⊆ S a(v)(T). (1.4) The M¨ obius transform can be conceptualized as a decomposition of v into the marginal effects on v for each subset of N. Each subset of S has its own marginal effect on the change in function value of v, so that v(S) is a sum of the individual effects, represented by a(v)(T) in Eq. (1.4). For example, if N ={1,2}, then for v(S)= α if S =∅ β if S ={1} γ if S ={2} δ if S ={1,2} we have 47 a(v)(S)= α if S =∅ β − α if S ={1} γ − α if S ={2} δ − β − γ +α if S ={1,2} 2 M¨ obius Transforms as a Complete Account of Interactions 2.1 Motivation: Pure Interactions Inordertoidentifydesirablequalitiesofaninteractionmethod, itwouldbefruitfultoanswer the question: what sorts of function is a “pure interaction” of features in S? Specifically, is F(x 1 ,x 2 ,x 3 ) = x 1 x 2 a function of pure interaction between x 1 and x 2 ? This question is useful because if F is a pure interaction of x 1 and x 2 (i.e. the only effects in F is an interactionbetweenx 1 andx 2 ), thennaturallyitoughttobethat I 2 S (¯x,F)=0forS̸={1,2}. Indeed, to continue the example, suppose F is a general function and we can decompose F as follows: F(x)=f ∅ + X 1≤ i≤ 3 f {i} (x i )+ X 1≤ i<j≤ 3 f {i,j} (x i ,x j )+f {1,2,3} (x), wheref ∅ issomeconstant,f {i} ispuremaineffectof x i ;f {i,j} givespurepairwiseinteractions; and f {1,2,3} is pure interaction between x 1 , x 2 , and x 3 . Assuming I 2 conforms to linearity, we would gain: I 2 S (¯x,F)= X |T|≤ 3 I 2 S (¯x,f T )=I 2 S (¯x,f S )+I 2 S (¯x,f {1,2,3} ), 48 by applying the above principle, namely I 2 S (¯x,f T ) = 0 if S ̸= T, |T| ≤ 2. That is, the 2 nd -order interaction of F for S would be a sum of I 2 S acting on the pure interaction function for group S, written f S , and I 2 S acting on a pure interaction of size 3. This would generalize to higher order interactions, so that: I k S (¯x,F)=I k S (¯x,f S )+ X T⊆ N,|T|>k I k S (¯x,f T ). We would then have to determine what rules should govern I k S (¯x,f S ), and I k S (¯x,f T ),|T|>k. 2.2 Unique Full-Order Interactions In the previous section we spoke intuitively regarding the notion of pure interaction; we now present a formal treatment. Let I n be a n th -ordered interaction function, i.e., I n gives the interaction between all possible subsets of features. In addition to the axioms of completeness and null features above, we propose two modest axioms for such a function; first, we propose a milder form of linearity, which requires linearity only for functions that I n S assign no interaction to. We weaken linearity in the interest of establishing the notion of pure interactions with minimal assumptions. 4. Linearity of Zero-Valued Functions: If (¯x,x ′ ,G), (¯x,x ′ ,F)∈ D I n , S ⊆ N such that I n S (¯x,x ′ ,G)=0, then I n S (¯x,x ′ ,F +G)=I n S (¯x,x ′ ,F). Before introducing the next axiom, we consider the meaning of the baseline, x ′ . In cost sharing, the baseline is the state where all agents make no demands [Shapley and Shubik, 1971]. If an agent makes no demands, there are no attributions, nor are there interactions with other players. Likewise, the original IG paper notes [Sundararajan et al., 2017]: 49 “Let us briefly examine the need for the baseline in the definition of the attribution problem. A common way for humans to perform attribution relies on counterfactual intuition. When we assign blame to a certain cause we implicitly consider the absence of the cause as a baseline for comparing outcomes. In a deep network, we model the absence using a single baseline input.” As with the cost sharing literature and Sundararajan et al. [2017], we interpret the condition ¯x i = x ′ i to indicate that the feature represented by x i is not present. Recalling that x S denotes a vector where the components in S are not fixed at the baseline values in x ′ , we present the next axiom: 5. Baseline Test for Interactions (k =n): For baseline x ′ , if F(x S ) is constant∀x, then I n S (¯x,x ′ ,F)=0. This axiom states that if every variable / ∈ S is held at the baseline value, and the other variables∈S are allowed to vary, but the function is a constant, then there is no interaction between the features of S. Why is this sensible? The critical observation is that a feature being at its baseline value indicates the feature is not present. If the features of S have no effect when other features are absent, then the features of F do not interact in and of themselves and their interaction measurement should be zero. Our definition of interactions allows F and x ′ to be chosen separately. However, it is generally the case that a model will be trained on data which will inform the appropriate choice of baseline. It is possible that a model does not admit to a baseline representing the absence of features, in which case game-theoretic baseline attributions and interactions may be ill-suited as explanation tools. We proceed to discuss the case when F has a baseline, and assume implicitly that x ′ is chosen as the fitting baseline to F. 50 Theorem 5. There is a unique n th -order interaction method with domain [a,b]× [a,b]×F that satisfies completeness, null feature, linearity of zero-valued functions, and baseline test for interactions (n=k). Proof of Theorem 5 is deferred to Appendix D.1. We turn to explicitly defining the unique interaction function satisfying the conditions in Theorem 5. For a fixed x and implicit x ′ , F(x S ) is a function of S. This implies it can be formulated as a function of binary variables indicating whether each input component of F takes value x i or x ′ i . Thus we can take the M¨ obius transform of F(x (·) ), written as a(F(x (·) )). Now, if we evaluate the M¨ obius transform of F(x (·) ) for some S, given as a(F(x (·) ))(S), and allow x to vary, then this is a function of x. Recall thatP k ={S⊂ N :|S|≤ k}. Given a baseline x ′ , define the synergy function: Definition 8 (Synergy Function). For F ∈F, S∈P n , and implicit baseline x ′ ∈[a,b], the synergy function ϕ :P N ×F →F is defined by the relation ϕ S (F)(x)=a(F(x (·) ))(S). We present the following example to help illustrate the synergy function: let F(x 1 ,x 2 )= a+bx 2 1 +csinx 2 +dx 1 x 2 2 , and suppose x ′ = (0,0) are the baseline values for x 1 and x 2 that indicate the features are not present. The synergy for the empty set is the constant F(x ′ )= a, indicating the baseline value of the function when no features are present. To obtain ϕ {1} (F), we allow x 1 to vary but keep x 2 at the baseline, and subtract the value of F(x ′ ). This gives us ϕ {1} (F)(x)=a+bx 2 1 − a=bx 2 1 . If instead we allow only x 2 to vary, we get ϕ {2} (F)(x)=a+csin(x 2 )− a=csin(x 2 ). Finally, if we allow both to vary and subtract of all the lower synergies, we get ϕ {1,2} (F)(x)=dx 1 x 2 2 . Note that if we fix S ⊆ N, the synergy function ϕ S is a linear functional from F to F. However, ϕ can act as an n th -order interaction method. Specifically, if ϕ is taken with 51 respect to the baseline x ′ , then ϕ S (F)(¯x) gives an interaction score for the input (¯x,x ′ ,F) and set of inputs S⊆ N. With this clarificaiton, we turn to the following corollary: Corollary 1. The synergy function is the unique n-order interaction method that satisfies completeness,nullfeature,linearityofzero-valuedfunctions,andbaselinetestforinteractions (n=k). Proof of Corollary 1 is relegated to Appendix D.2. The properties of the synergy function stem from properties of the M¨ obius transform. Specifically, because the synergy function is defined by the M¨ obious Transform, it inherits many of its properties, including completeness, null feature, linearity of zero-valued functions, and baseline test for interactions (n = k). The primary precursor to the synergy function is the Harsanyi dividend [Harsanyi, 1963], which is like the M¨ obius transform and is formulated for discrete-input settings. More recently, the Shapley-Taylor Interaction Index [Sundararajan et al., 2017] takes the form of the M¨ obius Transform when k =n, where Shapley-Taylor imposes symmetry and interaction distribution axioms. Likewise, Faith-Shap [Tsai et al., 2022] takes the form of the M¨ obius Transform when k = n, where Faith-Shap primarily imposes a best-fit property dubbed faithfulness. The novelty of the synergy function is that, while previous works assumed F to be a set function (as in section 1.4), the synergy function is a linear functional between continuous input functions. Consequently, Corollary 1 is novel, not only because of the inclusion of baseline test for interactions (k =n), but also because all axioms do not assume F is a set function. 2.3 Properties of the Synergy Function GivenafunctionF,thesynergyofasinglefeature ¯x i isgivenbyϕ {i} (F)(¯x)=F(¯x {i} )− F(x ′ ), and the pairwise synergy for features ¯x i and ¯x j is 52 ϕ {i,j} (F)(¯x)=F(¯x {i,j} )− ϕ {i} (F)(¯x)− ϕ {j} (F)(¯x)− F(x ′ ) =F(¯x {i,j} )− F(¯x {i} )− F(¯x {j} )+F(x ′ ). In general, the synergy function for a group of features S is ϕ S (F)(¯x)=F(¯x S )− X T⊊ S,T̸=∅ ϕ T (F)(¯x)− F(x ′ ) = X T⊆ S (− 1) |S|−| T| × F(¯x T ) With this we can define the notion of a pure interaction. A pure interaction function of the features S is a function that 1) takes a value of 0 if any feature in S takes its baseline value, and 2) varies and only varies in the features in S. 2 This is exactly what the synergy function accomplishes: either ϕ S (F)(x) = 0, or ϕ S (F)(x) varies in exactly the features in S and is 0 whenever ¯x i = x ′ i for any i ∈ S. More technically, define C S = {F ∈ F|F is a pure interaction function of S} to be the set of pure interactions of features S. Then we have the following corollary: Corollary 2. Suppose an implicit baseline x ′ ∈[a,b] and let F ∈F, and S, T ∈P n . Then the following hold: 1. Pure interaction sets are disjoint, meaning C S ∩C T =∅ whenever S̸=T. 2. ϕ S projectsF onto C S ∪{0}. That is, ϕ S (F)∈C S ∪{0} and ϕ S (ϕ S (F))=ϕ S (F). 3. For Φ T ∈C T , we have ϕ S (Φ T )=0 whenever S̸=T. 2 For the degenerate case where S =∅, a pure interaction of the features of S would be a constant function. 53 4. ϕ uniquely decomposes F ∈F into a set of pure interaction functions on distinct groups of features. That is, there existsP ⊂P n such that F = P S∈P Φ S where each Φ S ∈C S , only one such representation exists, and Φ S =ϕ S (F) for each S∈P while ϕ S (F)=0 for each S∈P n \P. Proof of Corollary 2 is relegated to Appendix D.3. For ease of notation, we move forward assuming that if x ′ is not stated, the implicit baseline value is x ′ =0 and is appropriate to F. We also assume that the synergy functions S is applied using the proper implicit baseline choice. Lastly, we denote Φ S ∈C S to be a pure interaction in S as defined above, or what we may also call a “synergy function” in S. 2.4 Axioms and the Distribution of Synergies Nowthatwehavethenotionofpureinteractionsbywayofthesynergyfunction,wecomment on the interplay between axioms and synergy functions. First, we present a version of the baseline test for interactions which applies for k≤ n. The idea is a generalization of the (k =n) case; that if I k is a k th -order interaction and Φ S is some pure interaction function with|S|≤ k, then I k (Φ S ) should not report interactions for any set but S. We give this as an axiom: 6. Baseline Test for Interactions (k≤ n): For baseline x ′ and any synergy function Φ S with|S|≤ k, if T ⊊ S, then I k T (Φ S )=0. This is a weaker version of the defining axiom of Shapley-Taylor [Sundararajan et al., 2020], which states: 7. Interaction Distribution: For baseline x ′ and any synergy function Φ S , if T ⊊ S and|T|<k, then I k T (Φ S )=0. 54 The baseline test of interactions asserts that if a synergy function is for a group of at least size k, I k should not report interactions for any other group. The interaction distribution asserts the same, and adds the caveat that if the synergy function is for a group of size larger than k, it must be distributed only to groups of size k. We now detail how some of these axioms can be formulated as constraints on the distribution of synergies. 1. Completeness: enforces that any method distributes a synergy among sets of inputs. Formally, for a synergy function Φ S , we may say that I k T (¯x,Φ S ) = w T (¯x,Φ S )× Φ S (¯x), where w T is some function satisfying P T⊆P k w T (¯x,Φ S )=1. 2. Linearity: enforces that I k (F) is the sum of I k applied to the synergies of F. Formally, I k (F)= P T⊂P k I k (ϕ T (F)). 3. Null Feature: enforces that I k only distributed Φ S to groups T ⊆ S. 4. Baseline Test for Interaction(k≤ n): enforces that Φ S is not distributed to groups T ⊊ S when|S|≤ k. 5. Interaction Distribution: enforces that Φ S is not distributed to groups T ⊊ S when |S|≤ k, and is distributed only to groups of size k when|S|>k. 6. Symmetry 3 : enforces that a synergy Φ S be distributed equally among groups in the binary features case. 3 See Appendix E.1 for a statement of symmetry axiom. 55 3 Synergy Distribution for Binary Feature Methods We now discuss the role of the synergy function in axiomatic attributions/interactions. Harsanyi [1963] 4 noticed that for a synergy function Φ S , the Shapley value is Shap i (¯x,Φ S )= Φ S (¯x) |S| if i∈S 0 if i / ∈S (3.1) This means the Shapley value distributes the function gain from Φ S equally among all i∈S. Using the synergy representation of F and linearity of Shapley values, we get Shap i (¯x,F)= X S⊆ Ns.t.i∈S Φ S (¯x) |S| (3.2) Thus, the Shapley value can be conceptualized as distributing each synergy Φ {i} to ¯x i and distributing all higher synergies, Φ S with |S| ≥ 2, equally among all features in S, e.g., Shap(¯x,Φ {1,2,3} ) = ( Φ {1,2,3} (¯x) 3 , Φ {1,2,3} (¯x) 3 , Φ {1,2,3} (¯x) 3 ,0,...,0). Indeed the Shapley value is characterized by its rule of distributing the synergy function. Proposition 1. [Grabisch, 1997, Thm 1] The Shapley value is the unique attribution that satisfies linearity and acts on synergy functions as in (3.1). For a synergy function Φ S , the Shapley-Taylor interaction index of order k for a group of features T ∈P k is given by: 4 Harsanyi [1963] observed Eq. (3.1) and (3.2) in the binary feature setting with M¨ obius transforms. Here we state the continuous input form with synergy functions. 56 ST k T (¯x,Φ S ) = Φ S (¯x) if T =S Φ S (¯x) ( |S| k ) if T ⊊ S,|T|=k 0 else (3.3) The Shapley-Taylor distributes each synergy function of S to its group, unless is too large (|S| > k), in which case it distributes the synergy equally among all subsets of S of size k. We denote this type of k th -order interaction top-distributing, as it projects all synergies larger than the largest available size, k, to the largest groups available. This results in Shapley-Taylor emphasising interactions between features of size k, which may be an advantage or disadvantage, depending on the goal of the interaction. As with the Shapley value, the Shapley-Taylor is characterized by this action on synergy functions: Proposition 2. [Sundararajan et al., 2020, Prop 4] The Shapley–Taylor Interaction Index oforderk istheuniquek th -orderinteractionindexthatsatisfieslinearityandactsonsynergy functions as in Eq. (3.3). There is another binary feature k th -order interaction method similar to Shapley-Taylor, briefly motioned in Sundararajan et al. [2020], with the distinction that it is not top- distributing. Herewedetailandaugmentthemethod. SimilarlytotheIntegratedHessian,we may take the Shapley value recursively to gain pairwise interaction between ¯x i and ¯x j , given by RS {i,j} (¯x,F) = Shap i (¯x,Shap j (·,F))+Shap j (¯x,Shap i (·,F)) = 2Shap i (¯x,Shap j (·,F)). Main effects for ¯ x i would be Shap i (¯x,Shap i (·,F)). 57 More generally, consider expanding the expression ∥x∥ k 1 , and let N k T denote the sum of coefficients associated exactly with the variables with indices in T. Then the Recursive Shapley of order k distributes synergy functions as such: RS k T (¯x,Φ S )= N k T |S| k Φ S (¯x) if T ⊆ S 0 else , (3.4) whereinthecaseT =S =∅weset N k T |S| k :=1. Thisformulation,however,hasthedisadvantage ofdistributingaportionofsynergyfunctionsforgroupssized≤ k tosubgroups. Forexample, the recursively Shapley reports that a synergy function Φ {1,2,3} (¯x) also has interactions for subgroup {1,2}. This violates the baseline test for interactions (k ≤ n). We can modify the method to avoid this issue, causing Recursive Shapley to satisfy the baseline test for interactions (k≤ n) axiom. We explicitly detail the Recursive Shapley and modification in Appendix E.3.1. We also give the following Theorem (Proof in Appendix E.3.1.2): Theorem 6. The Recursive Shapley of order k is the unique k th -order interaction index that satisfies linearity and acts on synergy functions as in Eq. (3.4). 4 Synergy Distribution for Gradient-Based Methods: The Monomial A critical aspect of the above binary feature methods is that they treat all features in a synergy function as equal contributors to the function output. For example, consider the synergy function of S ={1,2} given by F(x 1 ,x 2 )=(x 1 − x ′ 1 ) 100 (x 2 − x ′ 2 ). F evaluated at ¯x=(x ′ 1 +2,x ′ 2 +2) yields F(¯x)=2 100 2 1 =2 101 . The Shapley method applied to F treats both inputs as equal contributors, and would indicate that ¯x 1 and ¯x 2 each contributed 2 101 2 58 to the function increase from the baseline. This assertion seems unsophisticated, not to mention intuitively incorrect, given we know the mechanism of the interaction function. The IG exhibits the potential advantages of gradient-based attribution methods by providing a more sophisticated attribution. For m∈N n , define [ x− x ′ ] m =(x 1 − x ′ 1 ) m 1 ··· (x n − x ′ n ) mn , taking the convention that if m i =0 and ¯x i =x ′ i , then (x i − x ′ i ) m i =1. Define [m]!=m 1 !··· m n !, and define D m F = ∂ ∥m∥ 1 F ∂x m 1 1 ··· ∂x mn n . We notate the non-constant features of [x] m by S m ={i|m i >0}. We call a function of the form F(x)=[x− x ′ ] m a monomial centered at x ′ , and note that any monomial centered at an assumed baseline x ′ is a synergy function of S m . Assuming m i >0 and taking x ′ =0, the IG attribution to [x] m , a synergy function of S m , is: IG {i} (¯x,[x] m )= ¯x i Z 1 0 m i [t¯x] (m 1 ,...,m i − 1,...,mn) dt = ¯x i Z 1 0 m i t P m i − 1 [¯x] (m 1 ,...,m i − 1,...,mn) dt =m i [¯x] m t P m j P m j 1 0 = m i ∥m∥ 1 [¯x] m This means that IG distributes the function change of F(¯x)=[¯x] m to ¯x i in proportion to m i . For example, the IG’s attribution to our previous problem is IG((2,2),x 100 1 x 2 ) = ( 100 101 2 101 , 1 101 2 101 ), a solution that seems much more equitable than the Shapley value. Thus the IG can distinguish between features based on the form of the synergy, unlike the Shapley value, which treats all features in a synergy functions as equal contributors. 4.1 Continuity Condition Wenowmovetomorerigorouslydeveloptheconnectionbetweengradient-basedmethodsand monomials. To connect the action of attributions and interactions on monomials to broader 59 functions, we now move towards defining the notion of an interaction being continuous in F. LetC ω denote the set of functions that are real-analytic on [a,b]. It is well known that any F ∈C ω admits to a convergent multivariate Taylor Expansion centered at x ′ : F(x)= X m∈N n D m F(x ′ ) [m]! [x− x ′ ] m (4.1) Functions in C w have continuous derivatives of all orders, and those derivatives are bounded in [a,b]. Thus, C ω it is a well-behaved class that gradient-based interactions ought to be able to assess. Recall that the Taylor approximation of order l centered at x ′ , denoted F l , is given by: T l (x)= X m∈N n ,∥m∥ 1 ≤ l D m (F)(x ′ ) [m]! [x− x ′ ] m (4.2) The Taylor approximation for analytic functions has the property that D m T l uniformly converges to D m F for any m∈N n and x∈ [a,b]. Given this fact, it would be natural to require that for a given k th -ordered interaction I k defined for C w functions, lim l→∞ I k (T l )= I k (F). ThisnotionisfurtherjustifiedbythefactthatmanyMLmodelsareanalytic. Particularly, NNs composed of fully connected and residual layers, analytic activation functions such as sigmoid, mish, swish, as well as softmax layers are real-analytic. While models using max or ReLU functions are not analytic, they can be approximated to arbitrary precision by analytic functions simply by replacing ReLU and max with the parameterized softplus and smoothmax functions, respectively. With this, we propose a continuity axiom requiring interactions for a sequence of Taylor approximations of F to converge to the interactions at F. 60 7. Continuity of Taylor Approximation for Analytic Functions: If I k is defined for all (¯x,x ′ ,F)∈[a,b]× [a,b]×C ω , then for any F ∈C ω , lim l→∞ I k (¯x,x ′ ,T l )=I k (¯x,x ′ ,F), where T l is the l th order Taylor approximation of F centered at x ′ . From this we have the following result, who’s proof can be found in Appendix E.2: Theorem 7. Let I k be an interaction method defined on [ a,b]× [a,b]×C ω which satisfies linearity and continuity of Taylor approximation for analytic functions. Then I k (¯x,x ′ ,F) is uniquely determined by the the values I k takes for the inputs in the set{(¯x,x ′ ,F):F(¯x)= [¯x− x ′ ] m ,m∈N n }. In section 3 we saw that binary feature methods distribute synergy functions according to a rule, and that rule characterized the method as a whole. Gradient-based methods satisfying linearity and the continuity condition are characterized by their actions on specific sets of elementary synergy functions, monomials. Thus, given our the continuity condition and linearity, we have collapsed the question of continuous interactions to the question of interactions of monomials centered at x ′ . Specifically, if linearity and continuity are deemed desirable, and a means of distributing polynomials can be chosen, then the entire method is determined for analytic functions. 4.2 Integrated Hessians Next, we present two gradient-based interaction methods corresponding to Shapley-Taylor and Recursive Shapley. For m∈N n , the Integrated Hessian of F(x)=[x] m at ¯x is: IH {i,j} (¯x,[x] m )= 2m i m j ∥m∥ 2 1 [¯x] m , IH {i} (¯x,[x] m )= m 2 i ∥m∥ 2 1 [¯x] m 61 As in Recursive Shapley, IH distributes a portion of any pure interaction monomial to all nonempty subsets of features in S m , breaking the baseline test for interactions(k≤ n). For example, although F(x 1 ,x 2 ,x 3 )=x 1 x 2 is a synergy function of S ={1,2}, IH distributes some of F to main effects. This can be remedied by directly distributing single and pairwise synergies, then using IH to distribute monomials involving 3 or more variables. This augmented IH is given below: IH ∗ T (¯x,F)=ϕ T (F)(¯x)+IH T (¯x,F − X |S|≤ 2 ϕ S (F)) Both IH and augmented IH can be extended to k th -ordered interactions to produce a monomial distribution scheme. Consider the expansion of ∥m∥ k 1 , and let M k T (m) denote the sum of the terms of the expansion involving exactly the m i where i∈T. Explicitly, M k T (m)= X l∈N n ,|l|=k,S l =T k l m l (4.3) The augmented IH of order k acts on monomial functions as follows: IH k∗ T (¯x,[x] m )= [¯x] m if T =S m M k T (m) ∥m∥ k 1 [¯x] m if T ⊊ S m ,|S m |>k 0 else (4.4) To explain, IH k∗ distributes all monomial synergies of size ≤ k to their groups, and distributes monomial synergies of size >k to subgroups of S m in proportion to M k T (m). A full treatment of both is given in Appendix E.3.2. 62 Corollary 3. IH k∗ is the unique attribution method on analytic functions that satisfies linearity, the continuity condition, and distributes monomials as in Eq. (4.4). 4.3 Sum of Powers: A Top-Distributing Gradient-Based Method Previously we outlined a k th -order interaction that was not top-distributing. Now we now present the distribution scheme for a gradient-based top-distributed k th -order interaction we call Sum of Powers. 5 We present only its action on monomials here, and detail the method in Appendix E.3.3. Sum of Powers distributes a monomial as such: SP k T (¯x,[x] m )= [¯x] m if T =S m P i∈T m i ( |Sm|− 1 k− 1 )∥m∥ 1 [¯x] m if T ⊊ S m ,|T|=k 0 else (4.5) The highlight is that Sum of Powers satisfies completeness, null feature, linearity, conti- nuity condition, baseline test for interactions, and is a top-distributing method. Particularly for [x] m , |S m | > k, Sum of Powers distributes [x] m only to top subgroups where |T| = k, and in proportion to P i∈T m i . We present a corollary below; for full details of the Sum of Powers method, see Appendix E.3.3. Corollary 4. Sum of Powers is the unique attribution method on analytic functions that satisfies linearity, the continuity condition, and distributes monomials as in Eq. (4.5). 5 Table of Methods Below we give a table of methods, noteworthy properties, and the mechanism by which they distribute synergy functions or, in the case of gradient based methods, monomials. All listed 63 methods satisfy completeness, linearity, null feature, and symmetry. All gradient-based methods satisfy the continuity condition. All interaction methods also satisfy baseline test for interactions (k ≤ n) unless otherwise noted. We do not list interaction distribution, which is a combination of baseline test for interactions (k≤ n) and being top-distributing in the binary features scheme. Name Properties Distribution Rule Synergy Function unique n th -order interaction ϕ T (Φ S )(¯x)= Φ S (¯x) if S =T 0 if S̸=T Shapley Value attribution method binary features Shap i (¯x,Φ S )= Φ S (¯x) |S| if i∈S 0 if i / ∈S Integrated Gradients attribution method gradient-based IG i (¯x,[x] m )= mi ∥m∥1 [¯x] m if i∈S m 0 if i / ∈S m Shapley-Taylor binary features top-distributing ST k T (¯x,Φ S )= Φ S (¯x) if T =S Φ S (¯x) ( |S| k ) if T ⊊ S,|T|=k 0 else Sum of Powers gradient-based top-distributing SP k T (¯x,[x] m )= [¯x] m if T =S m P i∈T mi ( |Sm|− 1 k− 1 )∥m∥1 [¯x] m if T ⊊ S m ,|T|=k 0 else Recursive Shapley binary features iterative breaks baseline test RS k T (¯x,Φ S )= N k T |S| k Φ S (¯x) if T ⊆ S 0 else Augmented Recursive Shapley binary features iterative RS k∗ T (¯x,Φ S )= Φ S (¯x) if T =S N k T |S| k Φ S (¯x) if T ⊊ S,|S|>k 0 else Integrated Hessian gradient-based iterative breaks baseline test IH k T (¯x,[x] m )= M k T (m) ∥m∥ k 1 [¯x] m if T ⊆ S m 0 else Augmented Integrated Hessian gradient-based iterative IH k∗ T (¯x,[x] m )= [¯x] m if T =S m M k T (m) ∥m∥ k 1 [¯x] m if T ⊊ S m ,|S m |>k 0 else 64 6 Empirical Evaluations In this section, we highlight how our analysis can help explain the differences between methods in practice. We compare the performance of the 2 nd -order Sum of Powers and the unaltered Integrated Hessian methods on a protein tertiary structure dataset. 6 Particularly, we use the Physicochemical Properties of Protein Tertiary Structure dataset from the UCI machine learning repository Rana [2013]. This dataset consists of 45,730 samples with 9 input features describing the molecular structure of proteins, and the target variable is the size of the residue. For this regression task, we utilize a 2-layer neural network with SoftPlus activation. We run each method on 200 samples. More details about the experiments and additional results are provided in Appendix E.4. Figures3.1and3.2reportaveragevaluesforIHandSP,withmaineffectsonthediagonal. We see that both methods report a strong negative interaction between features 1 and 6, with SP reporting a more negative interaction by 8 points. In the main effects, we see that SP gives more largely positive values for features 1 and 6, while IH is more diminished. Why is this? Understanding the theory of distributing synergies helps us understand these differences. Theoretically SP reports pure main effects as they are, and all other interactions are projected down to the pairwise interactions. Sum of powers indicates that the pure main effects of features 1 and 6 are positive. IH intermixes main effects and higher order interactions. Since IH’s main effects are lower, this means that the pure positive main effects of 6.1 and 9.3 (as seen in SP) are being lowered by generally negative higher-order interactions when IH reports them. A consequence of this is that IH also has a smaller report of the interactions between features 1 and 6: the negative interactions involving features 1 and 6 are being broken up and some are being distributed to main effects, diminishing the report. Thisstrengtheningofpairwiseinteractionsisfurtherconfirmedbyabox-and-whiskers plot (Fig. 3.3), which shows that SP gives more largely negative values at Q1, 2 and 3. 6 Experiments in this sections were coded by Ali Ghafelebashi (ali.ghafelebashi@gmail.com). Analysis of experiments were produced by a collaboration of Ali Ghafelebashi, Daniel Lundstrom, and Meisam Razaviyayn. 65 Figure 3.1: Mean of the Integrated Hessian interaction values. Interestingly, Figure 3.4 also indicates a more pure relationship between features 1 and 6. It is theorized that IH can have wide ranges of coefficients when distributing a monomial (the M k T (m) term), while sum of powers is relatively more stable. 66 Figure 3.2: Mean of the Sum of Powers interaction values. Figure 3.3: Box plot of interaction values of feature 1 and feature 6. Several values with extreme positive and negative interaction values are removed for a cleaner plot. 67 Figure 3.4: Interaction of feature 1 and feature 6. Left: driven by Integrated Hessian. Right: driven by Sum of Powers. X-axis: Feature 1. Y-axis: Interaction value. Colorbar: Feature 6. 68 Chapter 4 Four Characterizations of IG This chapter is based on the paper “Four Axiomatic Characterizations of the Integrated Gradients Attribution Method” Lundstrom and Razaviyayn [2023a]. Here we pick up where Chapter 2 left off. In Chapter 2 we detailed some issues with previous claims characterizing IG and gave characterizations of ensembles of path methods. In this work, we show that IG uniqueness claims can be established rigorously via different axioms. First, we establish a characterization in the vein of Sundararajan et al. [2017] based on the symmetry-preserving axiom. We also give a characterization in the vein of Sundararajan and Najmi [2020] based on the axiom of proportionality. We then provide a characterization of IG based on a symmetric monotonicity. Finally, we use methods from Chapter 3 to characterize IG by its action on monomials. Furthermore, we show that IG attributions to neural networks with ReLU and max functions coincide with IG attributions to softplus approximations to such models. This establishes a sort of continuity of IG among softplus approximations. 0.1 Related Works The Integrated Gradients was first introduced in Sundararajan et al. [2017] and a charac- terization was provided for it as well (Chapter 2, section 2.2). This claim did not cite any characterizations of the Aumann-Shapley, but used game-theoretic results about ensembles of monotone path methods. However, Lerma and Lucas [2021] and Lundstrom et al. [2022a] identified issues with the uniqueness claim. The issues identified in Lerma and Lucas [2021] amount to the existence of other path methods that satisfy symmetry-preserving other than IG. The issue cited in Lundstrom et al. [2022a] is that the ML context is significantly different than the cost-sharing context, causing unforeseen difficulties in applying results 69 from one to another. Another characterization of IG was provide in Sundararajan and Najmi [2020], this time based on a the cost-sharing result relying on the principle of proportionality. Lundstrom et al. [2022a] outlined similar issues with the proof of this claim. Regarding characterizations of Aumann-Shapley, Billera and Heath [1982] gave a charac- terization based on the idea of proportionality, while Mirman and Tauman [1982] and Samet and Tauman [1982] gave further characterization in a similar vein. In other efforts, McLean et al. [2004] showed a characterization based on the ideas of potential and consistency, while Calvo and Santos [2000] characterized the Aumann-Shapley method based on the idea of balanced contributions. Sprumont [2005] developed constraints around the merging or splittingagentstoprovideacharacterization. Young[1985]providedacharacterizationusing the principle of symmetric monotonicity, Monderer and Neyman [1988] developed another characterization absed on potential, while Albizuri et al. [2014] developed a characterization based on both merging/splitting and monotonicity. 1 Characterising IG among Path Methods with Symmetry- Preserving and ASI In Chapter 2 we discussed Sundararajan et al. [2017]’s attempt to characterize IG using symmetry-preserving and showed that the original proof was invalid. Theorems 2&3 showed that in the context of BAMs, NDP, completeness, dummy, and linearity characterize ensembles of monotone path methods. We now aim to follow and extend the reasoning of Sundararajan et al. [2017] to characterize IG. First we note that the Shapley value is another popular method that is defined as an ensemble of path methods. The Shapley value is obtained by considering average change in function value when a component’s value is changed from x ′ i to ¯x i . Specifically, consider all possible ways that x ′ can transition to ¯x by sequentially toggling each component from x ′ i to ¯x i . The Shapley value for ¯x i is the average change in function value over all possible transitions via toggling. This method can be formulated as an ensemble of n! path methods. With speedups, calculating the Shapley value precisely is exponential in the number of inputs, and significant effort has been put into faster calculation via approximation [Chen 70 et al., 2023]. The Shapley value was identified as potentially problematic compared to IG in the original IG paper [Sundararajan et al., 2017][Remark 5]. As an alternative to this approach, the most computationally efficient ensemble would be an ensemble composed of a single-path method. Seeing some sense in this observation and noting the potential downside of multiple-path methods in the original IG paper, we adopt this constraint and consider only ensembles consisting of a single monotone path. The argument of Sundararajan et al. [2017] asserted that IG is the unique path method that satisfies symmetry-preserving. However, as Lerma and Lucas [2021] noted, for a monotone path method A γ to satisfy symmetry-preserving, it is sufficient for γ to take the straight-line path when ¯x i = ¯x j , x ′ i =x ′ j for any i, j. Since there are many such methods, further constraints are necessary. We consider a strengthening of symmetry preserving, but find this insufficiently constrains the form of A γ , leaving multiple methods that satisfy the axioms. See Appendix 1 for details. Considering monotone path methods that satisfy symmetry preserving, only one other constraint is required: ASI. With symmetry-preserving and ASI, IG can be characterized among monotone path methods: Theorem 8. (Symmetry-Preserving Path Method Characterization on A 2 ) If A∈A 2 (D IG ) isamonotonepathmethodsatisfyingASIandsymmetry-preserving, thenitistheIntegrated Gradients method. The proof of Theorem 8 is relegated to Appendix F.2. 2 Characterizing IG with ASI and Proportionality A second attempt at characterizing the Integrated Gradients was presented in The Many Shapley Values for Model Explanation paper [Sundararajan et al., 2017], which Lundstrom et al. [2022a] later identified issues with. Here we present the characterization. The axiom of proportionality states, 8. Proportionality: If there exists G:[a,b]→R such that for all x∈[a,b], F(x)=G( P i x i ), then there exists c∈R such that A i (¯x,0,F)=c¯x i for 1≤ i≤ n. 71 This axiom states that if F can be expressed as a function of the cumulative quantity, P i x i , then each attribution is proportional to its contribution to P i ¯x i , namely ¯x i . This axiom originates from the context of cost-sharing [Friedman and Moulin, 1999], where each ¯x i may represent an investment. As an example, if the return on investment, F(¯x), is a function of the cumulative dollars invested, P i ¯x i , then proportionality asserts that the payout to each investor should be proportional to the amount invested. This principle does not always apply in cost-sharing problems, as when different investors make different kinds of contributions to an investment. This principle is fitting, however, when all investments are of the same kind so that the payout is simply a function of the total investment. Admittedly, this axiom appears at first glance to be more sensible in the cost-sharing context than the ML attributions context, and depends on the application of interest. With proportionality and ASI, we can characterize IG: Theorem 9. (Proportionality Characterization onA 2 ) Suppose that A∈A 2 (D IG ). Then the following are equivalent: i. A satisfies linearity, ASI, completeness, NDP, and proportionality. ii. A is the Integrated Gradients method. The proof of Theorem 9 is deferred to Appendix F.3. Note that unlike theorem 8, which characterized IG among monotone path methods, this is a much broader characterization, establishing that all BAMs inA 2 , only IG satisfies the given axioms. 3 Characterizing IG with Symmetric Monotonicity We next present a characterization of IG employing the concept of monotonicity. The axiom of monotonicity can be stated as: 8a. Monotonicity: Suppose F ∈F 1 . Then, i. If ¯x i ̸=x ′ i , then ∂F ∂x i (x)≤ ∂G ∂x i (x)∀x∈[¯x,x ′ ] implies A i (¯x,x ′ ,F) ¯x i − x ′ i ≤ A i (¯x,x ′ ,G) ¯x i − x ′ i . ii. If ¯x i =x ′ i , then A i (¯x,x ′ ,F)=0. 72 To explain i., the term A i (¯x,x ′ ,F) ¯x i − x ′ i is the per-unit attribution of ¯x i . We take the contribution of ¯x i to the change in F, denoted A i (¯x,x ′ ,F), and divide it by the total change in ¯x i . If ¯x i contributedtoF increasing, but ¯x i decreasedfromthebaseline, thentheper-unitattribution of ¯x i would be negative. As an example, suppose both derivatives are positive and ¯x i >x ′ i . If increasing ¯x i causes at least as great an increase for G as it does for F, then according to monotonicity, the per-unit attribution of ¯x i should be at least as great for G as F. Requirement ii. is the continuous extension of i. to the ¯x i =x ′ i case under the assumption of completeness and dummy. To demonstrate this extension, let F ∈F 1 , so that c≤ ∂F ∂x i ≤ d on the bounded domain [a,b]. Then, by completeness and dummy, A i (¯x,x ′ ,cx i )=c(¯x i − x ′ i ) and A i (¯x,x ′ ,dx i )=d(¯x i − x ′ i ). Thus c= A i (¯x,x ′ ,cx i ) ¯x i − x ′ i ≤ A i (¯x,x ′ ,F) ¯x i − x ′ i ≤ A i (¯x,x ′ ,dx i ) ¯x i − x ′ i =d. Now, as ¯x i →x ′ i , we have A i (¯x,x ′ ,F)→0. With the idea of monotonicity, one can assert a similar principle to the comparison between different inputs with the axioms, symmetric monotonicity: 8b. Symmetric Monotonicity: Suppose A∈A 1 , F, G∈F 1 . Then: i. If ¯x i ̸= x ′ i and ¯x j ̸= x ′ j , then ∂F ∂x i (x) ≤ ∂G ∂x j (x) ∀x ∈ [¯x,x ′ ] implies A i (¯x,x ′ ,F) ¯x i − x ′ i ≤ A j (¯x,x ′ ,G) ¯x j − x ′ j . ii. If ¯x i =x ′ i , then A i (¯x,x ′ ,F)=0. Symmetric monotonicity enforces that the principle of monotonicity can be applied between different inputs. With symmetric monotonicity, we give the following characterization of IG among methods inA 1 : Theorem 10. (Symmetric Monotonicity Characterization on A 1 ) Suppose that A∈A 1 . Then the following are equivalent: i. A satisfies completeness, dummy, linearity, and symmetric monotonicity. ii. A is the Integrated Gradients method. The proof of Theorem 10 is located in Appendix F.4. ToextendtheresultstoA 2 , weconsidertwooptions. ThefirstistoincludeNDP,andthe secondistoincludeaversionofsymmetricmonotonicitythatisformulatedforfunctionsthat 73 may not be differentiable. To do this, we replace the condition ∂F ∂x i (x)≤ ∂G ∂x j (x)∀x∈[¯x,x ′ ] with a condition applicable to non-differentiable functions. Supposing F, G∈F 2 , we define the statement ∂F ∂x i (x)≤ ∂G ( x)∂x j locally approximately to mean: ∃ϵ > 0 such that |z| < ϵ implies F(x 1 ,...,x i +z,...,xn)− F(x) z ≤ G(x 1 ,...,x j +z,...,xn)− G(x) z whenever both terms exists. The above statement indicates we have something akin to ∂F ∂x i (x) ≤ ∂G ∂x j (x), using local secant approximations of the derivative. We now state C 0 - symmetric monotonicity, an adjustment to symmetric monotonicity for BAMs in A 2 : 8c. C 0 -Symmetric Monotonicity: Suppose A∈A 2 (D IG ), (¯x,x ′ ,F), (¯x,x ′ ,G)∈D IG . Then: i. If ¯x i ̸= x ′ i and ¯x j ̸= x ′ j , then ∂F ∂x i (x)≤ ∂G ∂x j (x) locally approximately ∀x∈ [¯x,x ′ ] implies A i (¯x,x ′ ,F) ¯x i − x ′ i ≤ A j (¯x,x ′ ,G) ¯x j − x ′ j . ii. If ¯x i =x ′ i , then A i (¯x,x ′ ,F)=0. We now extend the characterization of Theorem 10 for attributions in A 2 . Theorem11. (SymmetricMonotonicityCharacterizationonA 2 )SupposethatA∈A 2 (D IG ). Then the following are equivalent: i. A satisfies completeness, dummy, linearity, symmetric monotonicity, and NDP. ii. A satisfies completeness, dummy, linearity, and C 0 -symmetric monotonicity. iii. A is the Integrated Gradients method. The proof of Theorem 11 is located in Appendix F.5. 4 Characterizing IG with Attribution to Monomials Anothermeansofcharacterizingattributionmethodsistobeginwithaprincipleofattributing to simple functions. 1 First, for m∈N n 0 , define [ x] m := x m 1 1 ··· x mn n . Given a set baseline x ′ and m∈N n 0 , we employ a slight abuse of terminology and define a monomial to be any 1 This concept was explored in Sundararajan et al. [2020] for an interactions method, and later by Lundstrom and Razaviyayn [2023b] for IG gradient-based interactions methods. Here we state results from that paper forA 1 , and give a result on continuity intoA 2 . 74 function of the form F(x)=[x− x ′ ] m . Now, consider a simple example function we would like to perform attribution on F(x 1 ,x 2 )=(x 1 − x ′ 1 ) 100 (x 2 − x ′ 2 ). The function F evaluated at ¯x=(x ′ 1 +2,x ′ 2 +2) yields F(¯x)=2 100 2 1 =2 101 . Now, considering methods that satisfy completeness, the attribution question is: how to distribute F(¯x)− F(x ′ )=2 101 between x 1 and x 2 . One possibility is to consider x 1 and x 2 equal contributors, so that A(¯x,x ′ ,F) = ( 2 101 2 , 2 101 2 ). This is in fact, what the Shapley value attribution. For any monomial F(x)= [x− x ′ ] m , the Shapley value gives attributions equally to each input such that m i ̸=0. This seems a naive attribution, given the structure of F. Another means of attributing to the inputs of F is to consider the magnitude of m i , the power of ¯x i . Particularly, we could attribute to ¯x i proportionally to the number of times it is multiplied when evaluating F(¯x). An attribution following this guideline would yield: A((x ′ 1 +2,x ′ 2 +2),x ′ ,(x 1 − x ′ 1 ) 100 (x 2 − x ′ 2 ))=( 100 101 2 101 , 1 101 2 101 ), a result that appears equitable. In fact, this attribution coincides with the attribution of IG. For m∈N n 0 such that m i ̸=0, we have IG i (¯x,x ′ ,[x− x ′ ] m ) =(¯x i − x ′ i ) Z 1 0 ∂([x− x ′ ] m ) ∂x i (x ′ +t(¯x− x ′ ))dt =(¯x i − x ′ i ) Z 1 0 m i (t(¯x 1 − x ′ 1 )) m 1 ··· (t(¯x i − x ′ i )) m i − 1 ··· (t(¯x n − x ′ n )) mn dt =(¯x i − x ′ i ) Z 1 0 m i t ∥m∥ 1 − 1 (¯x 1 − x ′ 1 ) m 1 ··· (¯x i − x ′ i ) m i − 1 ··· (¯x n − x ′ n ) mn dt = m i ∥m∥ 1 [¯x− x ′ ] m (4.1) WemayproceedfromattributionsonmonomialstoattributionsonF 1 byrequiringasort of continuity criteria. For m∈N n 0 , define [ m]!:=m 1 !··· m n !, and define D m F = ∂ ∥m∥ 1 F ∂x m 1 1 ··· ∂x mn n . Recall that for F ∈F 1 , the Taylor approximation of order l centered at x ′ , denoted F l , is given by: 75 T l (x)= X m∈N n 0 ,∥m∥ 1 ≤ l D m (F)(x ′ ) [m]! [x− x ′ ] m (4.2) The Taylor approximation for analytic functions has the property that D m T l uniformly converges to D m F for any m∈N n 0 and x∈[a,b]. Thus, it seems natural to require that any attribution A∈A 1 satisfy lim l→∞ A(¯x,x ′ ,T l ) = A(¯x,x ′ ,F). This is the principle behind the axiom Continuity of Taylor Approximation for Analytic Functions, or what we may equivalently call the continuity condition, given below: 9 Continuity of Taylor Approximation for Analytic Functions: If A∈A 1 , (¯x,x ′ ,F)∈ [a,b]× [a,b]×F 1 , then lim l→∞ A(¯x,x ′ ,T l )=A(¯x,x ′ ,F), where T l is the l th order Taylor approximation of F centered at x ′ . We now give the characterization of IG according to its actions on monomials: Theorem 12. (Distribution of Monomials Characterization on A 1 ) Suppose A∈A 1 . Then the following are equivalent: i. A satisfies continuity of Taylor approximation for analytic functions and acts on monomials as: A(¯x,x ′ ,[x− x ′ ] m )= m ∥m∥ 1 × [¯x− x ′ ] m ii. A is the Integrated Gradients method. We may proceed fromF 1 toF 2 by considering a means of approximating a feed-forward neural network by an analytic function. Suppose F ∈F 2 is a feed-forward neural network with ReLU and Max functions. Note that the multi-input max function can be formulated as a series of dual input max functions, and the dual input max function can be formulated as max(a,b)=ReLU(a− b)+b. Thus we may formulate F using only the ReLU function. We may then define F α to be the analytic approximation of F given by replacing all instances of ReLU in F with the parameterized softplus, s α (z)= ln(1+exp(αz )) α . We show in Appendix F.7 that this softplus approximation uniformly converges to the function F. 76 Before we give our result, we first give a technical theorem on the topology of [ a,b] with respect softplus approximations of functions in F 2 . Let∇F denote the gradient of F, and let λ denote the Lebesgue measure onR. Then our result is as follows: Theorem 13. For any F ∈F 2 , there exists an open set U ⊆ [a,b] such that λ (U)=λ ([a,b]) and for each x∈U, the following hold: • There exists an open set containing x, B x , and real analytic function on [a,b], H x , such that F ≡ H x on B x . • ∇F(x) exists. • ∇F α (x)→∇F(x) as α →∞. With this theorem, we give a result on IG’s ability to uniquely extend into models in F 2 . Corollary 5. Let (¯x,x ′ ,F) ∈ F 2 , and let U be the set as in Theorem 13. Let γ (t) = x ′ +t(¯x− x ′ ) and suppose λ ({t∈[0,1]:γ (t)∈U})=1. Then: lim α →∞ IG(¯x,x ′ ,F α )=IG(¯x,x ′ ,F) Proofs of Theorem 13 and Corollary 5 are located in Appendices F.8 and F. 9, respectively. 5 Table of Characterizations We present a table of results, summarizing various characterizations of the Integrated Gradients method found in this chapter. Acknowledgement. The work in this section was supported in part by a gift from the USC-Meta Center for Research and Education in AI and Learning. 77 Assumptions Theorems 3 + 8 Theorem 9 Theorem 11 i. Theorem 11 ii. Theorem 12 Linearity x x x x x Dummy x - x x - Completeness x x x x - NDP x x x - - Path Method x - - - - Symmetry-Preserving x - - - - ASI x x - - - Proportionality - x - - - Symmetric Monotonicity - - x x - Distribution of Monomials - - - - x Continuity Condition - - - - x Table 4.1: Each axiom and the section it is located in are listed under the “Assumptions” column. Under each column with a “Theorem” heading, the the set of axioms that char- acterized IG are marked. All results characterize IG among attributions in A 2 except for Theorem 12, which characterized IG among attribution sin A 1 . Also, note that Theorem 11 i. assumes Symmetric monotonicity for functions in F 1 , while Theorem 11 ii. assumes C 0 -symmetric monotonicity. 78 Chapter 5 Conclusion 1 Results of the Study and Characterization of Integrated Gradients In this thesis we offered a varied and in-depth study of the Integrated Gradients method. We touched on previous treatments offering criticisms and recovering results. We characterized IG in a variety of ways, using symmetry-preserving, proportionality, symmetric-monotonicy, and monomial distribution and as key distinguishing features of the method. Finally, we touched on aspects of Lipshitzness, a distribution of baselines, continuity of IG for Taylor and softplus approximations, and an adaptation to attributions of internal neurons. The axiomatic approach of IG offers major advantages in the task of attributions. For DNNs, no ground truth explanation is available, we start from a position of not understanding the model. The axiomatic approach starts by stipulating what we would want in an explanation, and can arrive at one specific attribution. In order for this process to be successful, careful and rigorous think is necessary. In our critique of previous works, we foundthattheyfailedtoaccountforthedifferenceinfunctionspacebetweenthecost-sharing and attribution problems. While the problems appear similar, the stricter requirement that cost-shares be non-negative was critical, causing various results to be invalid. The occurrence of these issues and subsequent correction is a testament to the utility of rigorous mathematical analysis for machine learning. Regarding the provided characterizations, the authors believe it is unlikely that there is a single best attribution method. In cost-sharing, it has been established that no one method can satisfy all desirable properties, 1 and it is possible that this is the case for attribution methods as well. Theorems 2 & 3 demonstrate that ensembles of path methods uniquely 1 See Friedman and Moulin [1999, Lemma 4]. 79 satisfy common and broadly used axioms. However, the community should consider the less common axioms that help characterize IG: Theorem 8 - being a symmetry preserving single path method, Theorem 9 - proportionality, Theorems 10 & 11 - symmetric monotonicity, and Theorem 12 - IG’s distribution of monomials. In the author’s opinion, these more defining axioms indicate contexts where the IG attribution method is preferable. We expect that there are other properties, suitable in other contexts, which would exclude the IG and recommend another method. 2 Results of the Analysis of Interactions using Synergy Func- tions The paradigm of synergy distribution is a useful concept for the analysis and development of attribution and interaction methods. First, it can point out weaknesses in existing methods suchastheIntegratedHessian; second, itcanleadtonewmethodssuchastheSumofPowers method, and last, it allows new characterization results based on synergy or monomial distribution. As seen in the comparison of Shapley Value vs Integrated Gradient, synergy distribution can play an important role implicitly even when not explicitly discussed in the literature. However, the application of this analysis tool does not settle the question, “which method is best?” There exists conflicting groups of axioms and various combinations of them produce unique interactions. The choice of whether to use a top-distributing or recursively defined method, a binary features or gradient-based method, or some other method may vary with the goal. For example, top-distributed methods may be preferable when explicitly searching for strong interactions of size k, while an iterative approach may be preferable when seeking to emphasizing all interactions up to size k. For problems with continuous inputs, gradient-based methods seem to offer a more sophisticated means of distributing synergies, as they distinguish between features when they distribute a synergy function. Here again, it is not clear if any given method represents a clear “winner” to distribute monomials. We have presented two top-distributing and two recursive methods, but it is unclear if these methods are best in class. For instance, perhaps a top-distributing method that distributes monomials by some softmax-weighted scheme is 80 preferable to Sum of Powers. In order to find such methods, one may try to find a linear operator L S :C ω →C ω where the continuity criteria apply and L S (x m ) = c S (x,m)x m for some desirable weighting function c S , i.e. L {i} (x m )= e (x m i i ) P j∈S e (x m j j ) x m if i∈S m 0 else (2.1) Finding such linear operators could produce a variety of attribution and interaction methods. Intheauthors’opinion,thepossibilityoftheexistenceofone“best”methodisimprobable asvariouscombinationsofdifferentaxiomsleadtothedevelopmentofuniquemethods. Thus, choosing methods based on the context of the application seems a more logical approach. Indeed, the existence of unique methods with individual strengths is already studied in game-theoretic cost-sharing literature 2 . 2 See the Shapley value vs Aumann-Shapley value vs serial cost for cost-sharing [Friedman and Moulin, 1999], or the Shapley vs Banzhaf interaction indices Grabisch and Roubens [1999]. 81 Bibliography Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. In International Conference on Machine Learning, pages 3319–3328. PMLR, 2017. The White House. Blueprint for an ai bill of rights, 2022. Accessed May 22, 2023, Section: Notice and Explanation. European Commission. Proposal for an artificial intelligence act, 2021. 2021/0106(COD), Article 13, Accessed May 22, 2023. Nadine Dorries. Establishing a pro-innovation approach to regulating ai, 2022. Pantelis Linardatos, Vasilis Papastefanopoulos, and Sotiris Kotsiantis. Explainable ai: A review of machine learning interpretability methods. Entropy, 23(1):18, 2020. Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. ” why should i trust you?” explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 1135–1144, 2016. Alexander Binder, Gr´ egoire Montavon, Sebastian Lapuschkin, Klaus-Robert M¨ uller, and Wojciech Samek. Layer-wise relevance propagation for neural networks with local renor- malization layers. In International Conference on Artificial Neural Networks , pages 63–71. Springer, 2016. Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh,andDhruvBatra. Grad-cam: Visualexplanationsfromdeepnetworksviagradient- based localization. In Proceedings of the IEEE international conference on computer vision, pages 618–626, 2017. Berk Ustun and Cynthia Rudin. Supersparse linear integer models for optimized medical scoring systems. Machine Learning, 102:349–391, 2016. Mark Ibrahim, Melissa Louie, Ceena Modarres, and John Paisley. Global explanations of neural networks: Mapping the landscape of predictions. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pages 279–287, 2019. Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In European conference on computer vision, pages 818–833. Springer, 2014. Benjamin Letham, Cynthia Rudin, Tyler H. McCormick, and David Madigan. Interpretable classifiers using rules and Bayesian analysis: Building a better stroke prediction model. The Annals of Applied Statistics, 9(3):1350 – 1371, 2015. doi: 10.1214/15-AOAS848. URL https://doi.org/10.1214/15-AOAS848. 82 JostTobiasSpringenberg,AlexeyDosovitskiy,ThomasBrox,andMartinRiedmiller. Striving for simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806, 2014. Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions. Advances in neural information processing systems, 30, 2017. Mukund Sundararajan and Amir Najmi. The many shapley values for model explanation. In International conference on machine learning, pages 9269–9278. PMLR, 2020. Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning important features through propagating activation differences. In International Conference on Machine Learning, pages 3145–3153. PMLR, 2017. Aria Masoomi, Davin Hill, Zhonghui Xu, Craig P Hersh, Edwin K Silverman, Peter J Castaldi, Stratis Ioannidis, and Jennifer Dy. Explanations of black-box models based on directional feature interactions. In International Conference on Learning Representations, 2021. Dangxing Chen and Weicheng Ye. Generalized gloves of neural additive models: Pur- suing transparent and accurate machine learning models in finance. arXiv preprint arXiv:2209.10082, 2022. Mukund Sundararajan, Kedar Dhamdhere, and Ashish Agarwal. The shapley taylor interac- tion index. In International conference on machine learning, pages 9259–9268. PMLR, 2020. Joseph D Janizek, Pascal Sturmfels, and Su-In Lee. Explaining explanations: Axiomatic feature interactions for deep networks. J. Mach. Learn. Res., 22:104–1, 2021. Che-Ping Tsai, Chih-Kuan Yeh, and Pradeep Ravikumar. Faith-shap: The faithful shapley interaction index. arXiv preprint arXiv:2203.00870, 2022. Stefan Bl¨ ucher, Johanna Vielhaben, and Nils Strodthoff. Preddiff: Explanations and interactions from conditional expectations. Artificial Intelligence , 312:103774, 2022. Hao Zhang, Yichen Xie, Longjie Zheng, Die Zhang, and Quanshi Zhang. Interpreting multivariate shapley interactions in dnns. In Proceedings of the AAAI Conference on Artificial Intelligence , volume 35, pages 10877–10886, 2021. Zirui Liu, Qingquan Song, Kaixiong Zhou, Ting-Hsiang Wang, Ying Shan, and Xia Hu. Detecting interactions from neural networks via topological analysis. Advances in Neural Information Processing Systems, 33:6390–6401, 2020. Michael Tsang, Dehua Cheng, Hanpeng Liu, Xue Feng, Eric Zhou, and Yan Liu. Feature interaction interpretability: A case for explaining ad-recommendation systems via neural interaction detection. arXiv preprint arXiv:2006.10966, 2020a. Mark Hamilton, Scott Lundberg, Lei Zhang, Stephanie Fu, and William T Freeman. Axiomatic explanations for visual search, retrieval, and similarity learning. arXiv preprint arXiv:2103.00370, 2021. Michael Tsang, Sirisha Rambhatla, and Yan Liu. How does this interaction affect me? interpretableattributionforfeatureinteractions. Advancesinneuralinformationprocessing systems, 33:6147–6159, 2020b. 83 YaruHao,LiDong,FuruWei,andKeXu.Self-attentionattribution: Interpretinginformation interactions inside transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 12963–12971, 2021. Michael Tsang, Dehua Cheng, and Yan Liu. Detecting statistical interactions from neural network weights. arXiv preprint arXiv:1705.04977, 2017. Michael Tsang, Hanpeng Liu, Sanjay Purushotham, Pavankumar Murali, and Yan Liu. Neural interaction transparency (nit): Disentangling learned interactions for improved interpretability. Advances in Neural Information Processing Systems, 31, 2018. Lloyd S Shapley and Martin Shubik. The assignment game i: The core. International Journal of game theory, 1(1):111–130, 1971. RobertJ.AumannandLloydS.Shapley. Values of Non-Atomic Games. PrincetonUniversity Press, Princeton, NJ, 1974. Gabriel Erion, Joseph D Janizek, Pascal Sturmfels, Scott M Lundberg, and Su-In Lee. Improving performance of deep learning models with axiomatic attribution priors and expected gradients. Nature Machine Intelligence, pages 1–12, 2021. Kedar Dhamdhere, Mukund Sundararajan, and Qiqi Yan. How important is a neuron? arXiv preprint arXiv:1805.12233, 2018. David Baehrens, Timon Schroeter, Stefan Harmeling, Motoaki Kawanabe, Katja Hansen, and Klaus-Robert M¨ uller. How to explain individual classification decisions. The Journal of Machine Learning Research, 11:1803–1831, 2010. Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013. Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Learning deep features for discriminative localization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2921–2929, 2016. Luisa M Zintgraf, Taco S Cohen, and Max Welling. A new method to visualize deep neural networks. arXiv preprint arXiv:1603.02518, 2016. Daniel D Lundstrom, Tianjian Huang, and Meisam Razaviyayn. A rigorous study of inte- grated gradients method and extensions to internal neuron attributions. In International Conference on Machine Learning, pages 14485–14508. PMLR, 2022a. Daniel Lundstrom and Meisam Razaviyayn. Four axiomatic characterizations of the inte- grated gradients attribution method. arXiv preprint arXiv:2306.13753, 2023a. Eric J Friedman. Paths and consistency in additive cost sharing. International Journal of Game Theory, 32(4):501–518, 2004. Shawn Xu, Subhashini Venugopalan, and Mukund Sundararajan. Attribution in scale and space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9680–9689, 2020. 84 Daniel Smilkov, Nikhil Thorat, Been Kim, Fernanda Vi´ egas, and Martin Wattenberg. Smoothgrad: removing noise by adding noise. arXiv preprint arXiv:1706.03825, 2017. Miguel Lerma and Mirtha Lucas. Symmetry-preserving paths in integrated gradients. arXiv preprint arXiv:2103.13533, 2021. David Balduzzi, Marcus Frean, Lennox Leary, JP Lewis, Kurt Wan-Duo Ma, and Brian McWilliams. The shattered gradients problem: If resnets are the answer, then what is the question? In International Conference on Machine Learning, pages 342–350. PMLR, 2017. Tensorflow.com. Integrated gradients, 2022. URL https://www.tensorflow.org/ tutorials/interpretability/integrated_gradients. Su-In Lee Pascal Sturmfels, Scott Lundberg. Visualizing the impact of feature attribution baselines, 2020. URL https://distill.pub/2020/attribution-baselines/. Andrei Kapishnikov, Subhashini Venugopalan, Besim Avci, Ben Wedin, Michael Terry, and Tolga Bolukbasi. Guided integrated gradients: An adaptive path method for removing noise. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5050–5058, 2021. Avanti Shrikumar, Jocelin Su, and Anshul Kundaje. Computationally efficient measures of internal neuron importance. arXiv preprint arXiv:1807.09946, 2018. DanielLundstrom,AlexanderHuyen,AryaMevada,KyongsikYun,andThomasLu. Explain- abilitytoolsenablingdeeplearninginfuturein-situreal-timeplanetaryexplorations. arXiv preprint arXiv:2201.05775, 2022b. Eric Friedman and Herve Moulin. Three methods to share joint costs or surplus. Journal of economic Theory, 87(2):275–312, 1999. Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014. Chih-Kuan Yeh, Cheng-Yu Hsieh, Arun Suggala, David I Inouye, and Pradeep K Ravikumar. On the (in) fidelity and sensitivity of explanations. Advances in Neural Information Processing Systems, 32, 2019. David Alvarez-Melis and Tommi S Jaakkola. On the robustness of interpretability methods. arXiv preprint arXiv:1806.08049, 2018. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms, 2017. 85 VitaliPetsiuk,AbirDas,andKateSaenko. Rise: Randomizedinputsamplingforexplanation of black-box models. arXiv preprint arXiv:1806.07421, 2018. Daniel Lundstrom and Meisam Razaviyayn. Distributing synergy functions: Unifying game-theoretic interaction methods for machine-learning explainability. arXiv preprint arXiv:2305.03100, 2023b. Michel Grabisch and Marc Roubens. An axiomatic approach to the concept of interaction among players in cooperative games. International Journal of game theory, 28(4):547–565, 1999. Jean-Luc Marichal and Marc Roubens. The chaining interaction index among players in cooperative games. In Advances in Decision Analysis, pages 69–85. Springer, 1999. Hao Zhang, Xu Cheng, Yiting Chen, and Quanshi Zhang. Game-theoretic interactions of different orders. arXiv preprint arXiv:2010.14978, 2020. Sandipan Sikdar, Parantapa Bhattacharya, and Kieran Heese. Integrated directional gradi- ents: Feature interaction attribution for neural nlp models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th Interna- tional Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 865–878, 2021. Gian-Carlo Rota. On the foundations of combinatorial theory i. theory of m¨ obius functions. Zeitschrift f¨ ur Wahrscheinlichkeitstheorie und verwandte Gebiete , 2(4):340–368, 1964. John C Harsanyi. A simplified bargaining model for the n-person cooperative game. Inter- national Economic Review, 4(2):194–220, 1963. Michel Grabisch. K-order additive discrete fuzzy measures and their representation. Fuzzy sets and systems, 92(2):167–189, 1997. Prashant S Rana. Physicochemical properties of protein tertiary structure data set. UCI Machine Learning Repository, 2013. Louis J Billera and David C Heath. Allocation of shared costs: A set of axioms yielding a unique procedure. Mathematics of Operations Research, 7(1):32–39, 1982. Leonard J Mirman and Yair Tauman. Demand compatible equitable cost sharing prices. Mathematics of Operations Research, 7(1):40–56, 1982. Dov Samet and Yair Tauman. The determination of marginal cost prices under a set of axioms. Econometrica: Journal of the Econometric Society, pages 895–909, 1982. Richard P McLean, Amit Pazgal, and William W Sharkey. Potential, consistency, and cost allocation prices. Mathematics of Operations Research, 29(3):602–623, 2004. Emilio Calvo and Juan Carlos Santos. A value for multichoice games. Mathematical Social Sciences, 40(3):341–354, 2000. Yves Sprumont. On the discrete version of the aumann–shapley cost-sharing method. Econometrica, 73(5):1693–1712, 2005. 86 H Peyton Young. Producer incentives in cost allocation. Econometrica: Journal of the Econometric Society, pages 757–765, 1985. Dov Monderer and Abraham Neyman. Values of smooth nonatomic games: the method of multilinear approximation. Cambridge University Press, Cambridge, 1988. M Josune Albizuri, H D´ ıez, and A Sarachu. Monotonicity and the aumann–shapley cost- sharing method in the discrete case. European Journal of Operational Research, 238(2): 560–565, 2014. HughChen, IanCCovert, ScottMLundberg, andSu-InLee. Algorithmstoestimateshapley value feature attributions. Nature Machine Intelligence, pages 1–12, 2023. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blon- del, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011. 87 Appendix A Supplementary Material on Previous IG and Path Method Papers Here we discuss current understandings of calculating the IG and directions for further research on computing path methods, provide a figure used in Sundararajan et al. [2017], detail counterexamples to claims about IG/path method uniqueness, and give a proof of Lemma 1. 1 Computing Path Methods Sundararajan et al. [2017] indicate that the IG can be computed approximately on some modern ML models using numerical integration methods. Particularly, 20-300 gradient calls with the rectangle rule can gain accuracy within 5% error, and a check of confirming that completeness approximately holds is recommended. Future areas of research include analyzing computational complexity of calculating IG, and developing further efficient integration methods for IG, perhaps by leveraging properties of particular ML models. Also, other path methods could potentially lead to desirable properties or computational advantages, as in Kapishnikov et al. [2021] and Xu et al. [2020]. These may be tailored to particular models. 88 2 Figure 1 from Sundararajan et al. [2017] Axiomatic Attribution for Deep Networks r 1 ,r 2 s 1 ,s 2 P 1 P 2 P 3 Figure1. Three paths between an a baseline (r 1 ,r 2 ) and an input (s 1 ,s 2 ). Each path corresponds to a different attribution method. The pathP 2 corresponds to the path used by integrated gradients. tifacts that stem from perturbing the data, a misbehaving model, and a misbehaving attribution method. This was why we turned to an axiomatic approach in designing a good attribution method (Section 2). While our method satisfies Sensitivity and Implementation Invariance, it cer- tainly isn’t the unique method to do so. We now justify the selection of the integrated gradients method in two steps. First, we identify a class of meth- ods called Path methods that generalize integrated gradi- ents. We discuss that path methods are the only methods to satisfy certain desirable axioms. Second, we argue why integrated gradients is somehow canonical among the dif- ferent path methods. 4.1. Path Methods Integrated gradients aggregate the gradients along the in- puts that fall on the straightline between the baseline and the input. There are many other (non-straightline) paths that monotonically interpolate between the two points, and each such path will yield a different attribution method. For instance, consider the simple case when the input is two di- mensional. Figure 1 has examples of three paths, each of which corresponds to a different attribution method. Formally, letγ = (γ 1 ,...,γ n ) : [0, 1]→ R n be a smooth function specifying a path inR n from the baselinex 0 to the inputx, i.e.,γ(0) =x 0 andγ(1) =x. Given a path functionγ, path integrated gradients are ob- tained by integrating the gradients along the pathγ(α) for α∈ [0, 1]. Formally, path integrated gradients along the i th dimension for an inputx is defined as follows. PathIntegratedGrads γ i (x) ::= Z 1 α=0 ∂F(γ(α)) ∂γ i (α) ∂γ i (α) ∂α dα (2) where ∂F(x) ∂x i is the gradient ofF along thei th dimension atx. Attribution methods based on path integrated gradients are collectively known as path methods. Notice that integrated gradients is a path method for the straightline path specified γ(α) =x 0 +α× (x−x 0 ) forα∈ [0, 1]. Remark 3. All path methods satisfy Implementation In- variance. This follows from the fact that they are defined using the underlying gradients, which do not depend on the implementation. They also satisfy Completeness (the proof is similar to that of Proposition 1) and Sensitvity(a) which is implied by Completeness (see Remark 2). More interestingly, path methods are the only methods that satisfy certain desirable axioms. (For formal defini- tions of the axioms and proof of Proposition 2, see Fried- man (Friedman, 2004).) Axiom: Sensitivity(b). (called Dummy in (Friedman, 2004)) If the function implemented by the deep network does not depend (mathematically) on some variable, then the attribution to that variable is always zero. This is a natural complement to the definition of Sensitiv- ity(a) from Section 2. This definition captures desired in- sensitivity of the attributions. Axiom: Linearity. Suppose that we linearly composed two deep networks modeled by the functionsf 1 andf 2 to form a third network that models the functiona×f 1 +b×f 2 , i.e., a linear combination of the two networks. Then we’d like the attributions fora×f 1 +b×f 2 to be the weighted sum of the attributions forf 1 andf 2 with weightsa andb respectively. Intuitively, we would like the attributions to preserve any linearity within the network. Proposition 2. (Theorem 1 (Friedman, 2004)) Path meth- ods are the only attribution methods that always satisfy Implementation Invariance, Sensitivity(b), Linearity, and Completeness. Remark 4. We note that these path integrated gradients have been used within the cost-sharing literature in eco- nomics where the function models the cost of a project as a function of the demands of various participants, and the attributions correspond to cost-shares. Integrated gradi- ents correspond to a cost-sharing method called Aumann- Shapley (Aumann & Shapley, 1974). Proposition 2 holds for our attribution problem because mathematically the cost-sharing problem corresponds to the attribution prob- lem with the benchmark fixed at the zero vector. (Imple- mentation Invariance is implicit in the cost-sharing litera- ture as the cost functions are considered directly in their mathematical form.) 4.2. Integrated Gradients is Symmetry-Preserving In this section, we formalize why the straightline path cho- sen by integrated gradients is canonical. First, observe that it is the simplest path that one can define mathematically. Figure A.1: Three paths between an a baseline (r 1 ,r 2 ) and an input (s 1 ,s 2 ). Each path corresponds to a different attribution method. The path P 2 corresponds to the path used by integrated gradients. 3 Counterexample to Claim 1 Let F(x 1 ,x 2 )=x 1 x 2 be defined on [0 ,1] 2 , x ′ =(0,0), ¯x=(1,0). Suppose that A γ is defined by a monotone path. Note that ∂F ∂x i ≥ 0, ∂γ i ∂t ≥ 0 for all i. Thus A γ (¯x,x ′ ,F)≥ 0 by Eq. 1.1. Thus any ensemble of monotone path methods has a non-negative output for the input (¯x,x ′ ,F). Let γ ′ be the path that travels via a straight line from (0,0) to (0,1), then to (1,1), and ends at (1,0). A γ ′ satisfies completeness, linearity, sensitivity(b), and implementation invariance. A γ ′ (¯x,x ′ ,F) = (1,− 1)≱ 0. Thus not every baseline attribution that satisfies completeness, linearity, sensitivity(b), and implementation invariance is a probabilistic ensemble of monotone path methods. 4 Proof of Lemma 1 Proof. Suppose A satisfies linearity and sensitivity(b), and F ∈ F has defined ∇F in [a,b]. Let and ¯x,x ′ ∈ [a,b], and let G ∈ F be such that ∂F ∂x i = ∂G ∂x i for some i. Then ∂(F− G) ∂x i =0, and by sensitivity(b), A i (¯x,x ′ ,F)− A i (¯x,x ′ ,G)=A i (¯x,x ′ ,F − G)=0. Thus 89 A i (¯x,x ′ ,F) = A i (¯x,x ′ ,G), and A i is a function solely of ¯x,x ′ and ∂F ∂x i . By extension, A(¯x,x ′ ,F) is a function solely of x,x’, and∇F. 5 Counterexamples to Other Uniqueness Claims and Proof with NDP 5.1 Counterexample to Xu et al. [2020, Proposition 1] and Proof with NDP In this section we present a counterexample to Xu et al. [2020, Proposition 1] and establish the claim with the addition of NDP. The original statement of Xu et al. [2020, Proposition 1] is as follows: “Path methods are the only attribution methods that always satisfy Dummy, Linearity, Affine Scale Invariance and Completeness.” AswiththestatementofSundararajanetal.[2017, Proposition2], thedefinitionof“path methods” here is informed by the work of Friedman [2004], which is a referent given as proof of the theorem and specifies the statement. We first rigorously re-state the claim, filling in gaps. The work of Friedman [2004, Theorem 1], which was given above as Theorem 1, is given in the context of monotone path methods and their ensembles. Since this theorem is referenced without further proof, it is assumed they do not mean to include non-monotone path methods, since otherwise the theorem would not be justifiably applied. Furthermore, it is known that if a path method satisfies the axioms, then an ensemble of path methods satisfies the axioms. Thus we assume they mean to include ensembles of path methods. Thus, we can interprete the statement as: Claim2. (Xuetal.[2020, Proposition1])Ifanattributionmethodsatisfiesdummy, linearity, ASI, and completeness, then that method is an ensemble of monotone path methods. Here we present a counterexample to this claim in the form of a non-monotone path method that satisfies the axioms. Let n=2, [a,b]=[0,1]. Define γ (¯x,x ′ ,t) as follows. Set T to be the affine transformation T(¯x)=x ′ +(¯x− x ′ )⊙ y. Inspired by Friedman [2004]’s 90 treatment of ASI, we define a non-monotone path method γ (¯x,x ′ ,t) as such. We set γ (1,0,t) as the constant velocity path which travels in straight lines as such: (0,0) → (1,0) → (1,1)→ (0,1)→ (0,0)→ (1,1). Define γ (¯x,x ′ ,t) = T(γ (0,1,t)). Thus γ (¯x,x ′ ,t) is affine transformation of the reference path γ (0,1,t). Note that γ (0,1,t)∈[0,1], so if ¯x,x ′ ∈[0,1], then the path γ (¯x,x ′ ,t)∈[x ′ ,¯x], and will not exit out of the box [0,1]. This ensures that A γ (¯x,x ′ ,F) is well defined for any ¯ x,x ′ . Note that A γ is a non-monotone path method satisfying completeness, dummy, and linearity. To complete the counterexample, it remains to show that A γ satisfies ASI. Let T ′ be any affine transformation as in the definition of ASI. All that remains is to show that if ¯x,x ′ ,T ′ (¯x),T ′ (x ′ )∈ [a,b], then A γ (¯x,x ′ ,F) = A γ (T ′ (¯x),T ′ (x ′ ),F ◦ T ′− 1 ). First we note that T ′ (T(γ (1,0,t)))=γ (T ′ (T(1)),T ′ (T(0)),t)=γ (T ′ (¯x),T ′ (x ′ ),t). Applying this, we show that for any index i: A γ i (¯x,x ′ ,F)= Z 1 0 ∂F ∂x i (γ (t)) dγ i dt dt = Z 1 0 ∂F ∂x i (T(γ (1,0,t))) d(T(γ (1,0,t))) i dt dt = Z 1 0 ∂F ∂x i (T ′− 1 (T ′ (T(γ (1,0,t))))) d(T ′− 1 (T ′ (T(γ (1,0,t))))) i dt dt = Z 1 0 ∂F ∂x i (T ′− 1 (γ (T ′ (¯x),T ′ (x ′ ),t))) d(T ′− 1 (γ (T ′ (¯x),T ′ (x ′ ),t))) i dt dt = Z 1 0 ∂F ∂x i (T ′− 1 (γ (T ′ (¯x),T ′ (x ′ ),t))) dT ′− 1 i dx i (γ (T ′ (¯x),T ′ (x ′ ),t))) × d(γ (T ′ (¯x),T ′ (x ′ ),t)) i dt dt = Z 1 0 ∂(F ◦ T ′− 1 ) ∂x i (γ (T ′ (¯x),T ′ (x ′ ),t)) d(γ (T ′ (¯x),T ′ (x ′ ),t)) i dt dt =A γ i (T ′ (¯x),T ′ (x ′ ),F ◦ T ′− 1 ) (5.1) Thus A γ is an attribution method that satisfies dummy, linearity, ASI, and completeness, but is not in the form of an ensemble of monotone path methods. We now prove that A γ is not equivalent to a ensemble of monotone path methods. We do this by introducing a context where A γ gives negative attributions, and note that monotone path methods (and thus ensembles of monotone path methods) cannot give negative attributions in this context. 91 Let F(x 1 ,x 2 )=x 1 x 2 2 . We calculate A γ (1,0,F) by calculating the five straight paths that comprise it. We denote the path P 1 to be the path from (0,0)→(1,0), P 2 to be the path from(1,0)→(1,1),andsoon. Bythisdecomposition,wehaveA γ (1,0,F)= P 5 i=1 IG(P i ,F), whereP i indicatestheinputandbaselineofIGintheobviousway. Calculating IG(P 1 ,F),..., IG(P 4 ,F) is simple if we observe that ∂F ∂x i =0 for one variable, which causes that component to be zero, and then apply completeness. This yields: IG(P 1 ,F)=(0,0), IG(P 2 ,F)=(0,1), IG(P 3 ,F)=(− 1,0), and IG(P 4 ,F)=(0,0). Because parameterization will not affect the integral, we parameterize P 5 (t) = (t,t). Calculating A P 5 (1,0,F), IG 1 (P 5 ,F)= Z 1 0 ∂F ∂x 1 dP 5 1 dt dt = Z 1 0 t 2 dt = 1 3 By completeness we get IG(P 5 ,F) = ( 1 3 , 2 3 ). Thus, A γ (1,0,F) = (− 2 3 , 5 3 ), and A γ can give negative values to a non-decreasing C 1 function with baseline 0 and input 1. Because monotone path methods cannot give negative values in this case, and by extension, ensemble of monotone path methods cannot either, A γ cannot be represented as an ensemble of monotone path methods. We note that many non-monotone, piece-wise smooth paths could suffice for the coun- terexample. We also note that since non-monotone path methods satisfy the above axioms, it is an open questions whether other methods that are not an ensemble of path methods also satisfy the axioms. We now establish Claim 2 with the additional assumption of NDP for a particular class of functions. Corollary 6. (Claim 2 with NDP forF 1 ∪F 2 Functions) Let x ′ be fixed, ¯x∈[a,b]. Suppose A∈A 2 satisfies dummy, linearity, completeness, ASI, and NDP. 1) If F ∈F 1 , A(¯x,x ′ ,F) is equivalent to the usual ensemble of path methods. 2) If F ∈F 2 , let µ x be the measure on 92 Γ m (¯x,x ′ ) from Theorem 2. If A(¯x,F) is defined, and for almost every path γ ∈ Γ m (¯x,x ′ ) (according to µ x ), ∂D∩P γ is a null set, then A(¯x,F) is equivalent to the usual ensemble of path methods. Proof. These are two specific cases of Theorems 2 and 3. 5.2 Counterexample to Sundararajan and Najmi [2020, Thm 4.1] and Proof with NDP The original statement of Sundararajan and Najmi [2020, Thm 4.1] is as follows: “(Reducing Model Explanation to Cost-Sharing). Suppose there is an attribution method that satisfies Linearity and ASI. Then for every attribution problem with explicand ¯x, baseline x ′ and function f (satisfying the minor technical condition that the derivatives are bounded), then there exist two costsharing problems such that the resulting attributions for the attribution problem are the difference between cost-shares for the cost-sharing problems.” Sundararajan and Najmi [2020] defines “cost-sharing problems” to be attributions where x ′ =0,x≥ 0, and F is non-decreasing in each component. Interprating “cost-shares”, we look to the referenced work, Friedman and Moulin [1999], which restricts cost-share solutions to non-negative solutions to cost-sharing problems. A restatement of the theorem is then: Claim 3. (Sundararajan and Najmi [2020, Proposition 1]) Suppose A is an attribution method that satisfies linearity and ASI. Then for every attribution problem ¯x , x ′ , and F with bounded first derivative: 1. There exists ¯y,¯z≥ 0, G,H non-decreasing (There are 2 cost-share problems) 2. A(¯x,x ′ ,F)=A(¯y,0,G)− A(¯z,0,H) (The original attribution equals the difference between attributions for the cost-share problems) 3. A(¯y,0,G), A(¯z,0,H)≥ 0 (A gives cost-share solutions to the cost-share problems) We now provide a counterexample to claim 3. Let n = 1, and define the attribution method A by A(¯x,x ′ ,F):=F(x ′ )− F(¯x). Note that A satisfies linearity since: 93 A(¯x,x ′ ,F 1 +F 2 )=F 1 (x ′ )+F 2 (x ′ )− F 1 (¯x)− F 2 (¯x)=A(¯x,x ′ ,F 1 )+A(¯x,x ′ ,F 2 ) (5.2) A also satisfies ASI since for any linear transformation T we have: A(T(¯x),T(x ′ ),F ◦ T − 1 )=F ◦ T − 1 (T(x ′ ))− F ◦ T − 1 (T(¯x)) =F(x ′ )− F(¯x) =A(¯x,x ′ ,F) (5.3) Nowlet ¯x=1,x ′ =0,F(x):=x. Weproceedbycontradiction. Supposethereexists ¯y,¯z≥ 0, G,H non-decreasing such that A(¯y,0,G), A(¯z,0,H) ≥ 0 and A(1,0,F) = A(¯y,0,G)− A(¯z,0,H). Now observe that A(1,0,F)=F(0)− F(1)=− 1, which implies A(¯z,0,H)>0. However, A(¯z,0,H)=H(0)− H(¯z)≤ 0, a contradiction. Thus the theorem does not hold for A with the stipulated ¯x,x ′ ,F, and is false. We now establish Claim 3 with the addition of NDP. Theorem 14. (Claim 3 with NDP) Suppose A is an attribution method that satisfies linearity, ASI, and NDP. Then for every ¯x,x ′ , and F with bounded first derivative: 1. There exists ¯y,¯z≥ 0, G,H non-decreasing (There are 2 cost-share problems) 2. A(¯x,x ′ ,F)=A(¯y,0,G)− A(¯z,0,H) (The original attribution equals the difference between attributions for the cost-share problems) 3. A(¯y,0,G), A(¯z,0,H)≥ 0 (A gives cost-share solutions to the cost-share problems) Proof. We follow the proof from Sundararajan and Najmi [2020, Proposition 1], but employ NDP. Since A satisfies ASI, there exists an affine transformation T such that A(¯x,x ′ ,F)= A(T(¯x),0,F◦ T − 1 ), where T(¯x)≥ 0 and F◦ T − 1 has bounded derivative. Since F◦ T − 1 has a bounded derivative, there exists a c∈R n such that F◦ T − 1 +c ⊺ x, c ⊺ x are non-decreasing. By linearity, A(T(¯x),0,F ◦ T − 1 )=A(T(¯x),0,F ◦ T − 1 +c ⊺ x)− A(T(¯x),0,c ⊺ x). Because A satisfies NDP, we have A(T(¯x),0,F ◦ T − 1 +c ⊺ x), A(T(¯x),0,c ⊺ x)≥ 0. 94 Appendix B On Characterizing Ensembles of Monotone Path Methods Here we provide a comment on Conjecture 1 and proofs of the theorems for characterizing ensembles of monotone path methods. 1 Comment on Conjecture 1 If no qualifications are put on the set of paths that µ x is supported on, then A may take on infinite values, contradicting completeness, or may simply be undefined. Consider the following example. Let n=2, [a,b]=[0,1]. Let F(x)=x 1 x 2 , ¯x=(1,1), x ′ =(0,0). Define the path γ n (¯x,x ′ ,t) to be the path obtained by traveling completely around the boundary of the domain clockwise n times, then following the straight line from (0,0) to (1,1). We define γ − n (¯x,x ′ ,t) similarly to γ n , but with counterclockwise paths. A γ 0 (¯x,x ′ ,F) = (0.5,0.5). A γ n (¯x,x ′ ,F) = (0.5+n,0.5− n), n∈Z. Now define the support of µ x (γ ) to be {γ (− 2) k : k∈N}. We then define µ x on it’s support to be µ x (γ (− 2) k )= 1 2 k . A(¯x,x ′ ,F) = Z γ ∈Γ(¯ x,x ′ ) A γ (¯x,x ′ ,F)dµ x (γ ) = ∞ X k=1 (0.5+(− 2) k ,0.5− (− 2) k ) 1 2 k = ∞ X k=1 ( 0.5 2 k +(− 1) k , 0.5 2 k − (− 1) k ) The above sum is not convergent in either component, so A(¯x,x ′ ,F) is not defined. 95 A similar construction only allowing clockwise paths may yield A(¯x,x ′ ,F)=(∞,−∞ ), contradicting completeness. 2 Proof of Theorem 2 Proof. We begin by supposing the assumptions. Let x ′ be fixed, F 1 andA 1 be as stipulated, and A∈A 1 . We introduce the notationA 1 (c,d), c,d∈R n , to be defined as the set A 1 , but with specified region [ c,d] instead of [a,b]. The setF 1 (c,d) is defined likewise. 2) → 1): Suppose A is an ensemble of monotone path methods as in the theorem statement. It is trivial to show that A satisfies linearity, completeness, and sensitivity(b). Suppose F is non-decreasing from x ′ to some ¯x. Then for any monotone path γ from x ′ to ¯x, A γ (¯x,x ′ ,F)≥ 0. Thus A(¯x,x ′ ,F)≥ 0, and A satisfies NDP. 1)→2): Let A satisfy completeness, linearity, sensitivity(b), and NDP. Let F ∈F 1 (a,b) and ¯x ∈ [a,b]. WLOG, we may assume that F(x ′ ) = 0, since if not, consider G(x) := F(x)− F(x ′ ) and apply Lemma 1. Our strategy will be to first define a transform such that A can be represented as a baseline attribution with baseline 0. Define T :R n →R n as T i (x)=(x i − x ′ i )× (− 1) 1 x ′ i >¯x i . One can think of T as a transform from the baseline x ′ space to the baseline 0 space. T transforms [a,b], by shifting and reflections about axes, into some other rectangular prism [c,d], for some c,d∈R n . More importantly, T transforms x ′ to 0 and ¯x to|¯x− x ′ |. Specifically, we get T([x ′ ,¯x])=[0,|¯x− x ′ |], with T(x ′ )=0 and T(¯x)=|¯x− x ′ |. Note further that T transforms the set of monotone paths from x ′ to ¯x into the set of monotone paths from 0 to|¯x− x ′ |, or T(Γ m (¯x,x ′ ))=Γ m (|¯x− x ′ |,0). T is one-to-one and has a well defined inverse overR n . So one can think of T − 1 as a transform from the baseline 0 space to the baseline x ′ space. For ¯y,y ′ ∈ [c,d], G ∈ F 1 (c,d), define A ′ ∈ A 1 (c,d) by A ′ (¯y,y ′ ,G) := A(T − 1 (¯y),T − 1 (y ′ ),G◦ T). Essentially A ′ is a reformulation of A in the baseline 0 space. By definition, A(¯x,x ′ ,F) = A ′ (|¯x− x ′ |,0,F ◦ T − 1 ). A ′ satisfies completeness, linearity, sensitivity(b), and NDP. 96 Note that to apply Theorem 1, we must restrict the domain of A ′ to not include inputs withnegativecomponents. Ifwerestrictthedomainof A ′ , itisnotclearthatthisattribution will behave the same. It seems possible that the attribution A ′ (¯x,x ′ ,F) depends on the behavior of F in the domain we want to remove. If this were the case, issues could arise, such as the restricted A ′ not being equivalent to the unrestricted A ′ . To address this issue, we turn to the development of an important lemma. Lemma 3. If A ∈ A 1 satisfies completeness, linearity, sensitivity(b), and NDP, then A(¯x,x ′ ,F) is determined by ¯x,x ′ and the behavior of F inside [x ′ ,¯x]. Proof. Suppose G,H ∈ F 1 have the same behavior in [x ′ ,¯x]. So for x ∈ [a,b], G(x)− H(x) = 0 = H(x)− G(x). Thus both are non-decreasing from x ′ to ¯x. Because A satisfies NDP, A(¯x,x ′ ,H− G) ≥ 0, and A(¯x,x ′ ,G− H) = − A(¯x,x ′ ,H− G) ≥ 0. Thus 0=A(¯x,x ′ ,G)− A(¯x,x ′ ,H), and A(¯x,x ′ ,G)=A(¯x,x ′ ,H). Now we define a BAM to apply Theorem 1 on. Define A ′′ :[0,T(¯x)]×F 0 (T(¯x),0)→R as such: A ′′ (¯y,G) := A ′ (¯y,0,H), where H ∈ F 1 [c,d] is any function such that H = G when restricted to [0,T(¯x)]. A ′′ is a properly defined BAM by Lemma 3. Note that for G∈F 1 with G(0)=0, G non decreasing, and ¯y∈[0,T(¯x)], we may go backwards and say A ′ (¯y,0,G)=A ′′ (¯y,G). Furthermore, A ′′ satisfies completeness, linearity, sensitivity(b), and NDP. Write F 0 = F ◦ T − 1 . F 0 is a C 1 function defined on a compact domain, so ∇F 0 is bounded. So there exists c ∈ R n such that ∇(F 0 (x)+ c ⊺ x) = ∇F 0 + c ≥ 0 on the compact domain. This implies that F 0 (x)+c ⊺ x is non-decreasing, C 1 , with F 0 (0)=0. So F 0 (x)+c ⊺ x∈F 0 . Employing Theorem 1, there exists a measure µ such that: A i (¯x,x ′ ,F(x)+c ⊺ T(x)) =A ′ i (T(¯x),0,F 0 (x)+c ⊺ x) =A ′′ i (T(¯x),F 0 (x)+c ⊺ x) = Z γ ∈Γ m (T(¯x),0) A γ i (T(¯x),0,F 0 (x)+c ⊺ x)× dµ (γ ) 97 Inspecting the interior term, we find that for γ a monotone path from 0 to T(¯x), A γ i (T(¯x),0,F 0 (x)+c ⊺ x) = Z 1 0 [ ∂F 0 ∂γ i +c i ] dγ i dt dt = Z 1 0 ∂F ◦ T − 1 ∂γ i × dγ i dt dt+ Z 1 0 c i dγ i dt dt = Z 1 0 ∂F ∂(T − 1 ◦ γ ) i (T − 1 (γ (t))× ∂T − 1 i ∂γ i (γ (t))× dγ i dt dt+c i (T i (¯x)− T i (x ′ )) = Z 1 0 ∂F ∂(T − 1 ◦ γ ) i (T − 1 (γ (t)))× ∂(T − 1 ◦ γ ) i ∂t dt+c i T i (¯x) =A (T − 1 ◦ γ ) i (¯x,x ′ ,F)+c i T i (¯x) Set µ ′ (γ ) := µ (T(γ )) so that µ ′ is a measure on the monotone paths from x ′ to ¯x. Combining previous results, we have, A i (¯x,x ′ ,F(x))+A i (¯x,x ′ ,c ⊺ T(x))) =A i (¯x,x ′ ,F(x)+c ⊺ T(x)) = Z γ ∈Γ m (T(¯x),0) A γ i (T(¯x),0,F 0 (x)+c ⊺ x)× dµ (γ ) = Z γ ∈Γ m (T(¯x),0) [A (T − 1 ◦ γ ) i (¯x,x ′ ,F)+c i T i (¯x)]× dµ (γ ) = Z γ ∈Γ m (T(¯x),0) A (T − 1 ◦ γ ) i (¯x,x ′ ,F)× dµ (γ )+c i T i (¯x) = Z γ ∈Γ m (¯x,x ′ ) A γ i (¯x,x ′ ,F)× dµ (T(γ ))+c i T i (¯x) = Z γ ∈Γ m (¯x,x ′ ) A γ i (¯x,x ′ ,F)× dµ ′ (γ )+c i T i (¯x) From a previous result, A i (¯x,x ′ ,G) is a function only of ¯x,x ′ and ∂G ∂x i . So A i (¯x,x ′ ,c ⊺ T(x)) = A i (¯x,x ′ ,c i T i (x)). By sensitivity(b), A j (¯x,x ′ ,c i T i (x)) = 0 for j ̸= i. So by completeness, A i (¯x,x ′ ,c i T i (x))=c i T i (¯x)− c i T i (x ′ )=c i T i (¯x). Subtracting the term from both sides of the above equation yields: 98 A i (¯x,x ′ ,F(x))= Z γ ∈Γ m (¯x,x ′ ) A γ i (¯x,x ′ ,F)× dµ ′ (γ ) Note that A ′′ is determined by A and choice of ¯x and x ′ , since T is determined by ¯x,x ′ . Note further that A ′′ and T determines µ ′ . So for any F ∈F 1 and fixed x ′ , we can index on ¯x to get µ ′x , a probability measure on Γ(¯ x,x ′ ). Thus for a fixed x ′ , F ∈F 1 , we have: A i (¯x,x ′ ,F)= Z γ ∈Γ m (¯x,x ′ ) A γ i (¯x,x ′ ,F)dµ ′x (γ ) 3 Proof of Lemma 2 Proof. Proceed by induction. Let F :R n →R be a one-layer feed forward neural network with F ∈F 2 . If the layer is an analytic function, then F is analytic since the composition of analytic functions is analytic. Precisely, F is analytic in the interior of [x ′ ,¯x], which has a boundary of measure 0. If the layer is a max function, then F is analytic except on the boundary of [x ′ ,¯x] and potentially some hyper-plane, which is a null set. In either case, the result is obtained. Now suppose that F :R n →R m is a k-layer feed forward neural network with F i ∈F 2 for each i. Further suppose [x ′ ,¯x] can be partitioned into and open set D i and ∂D i , where each output F i is analytic on D i and ∂D i a null set. If H :R m →R is an analytic function, then H◦ F is analytic on D =∩ n i=1 D i , ∂D =∪ n i=1 ∂D i is a null set, and D∪∂D is a null set. NowsupposeinsteadthatH isamaxfunction. Sincethemaxofmorethantwofunctions is the composition of the two-input max function, we will only consider the two-input max. Let H be the max function of the i th and j th components, so that H◦ F = max(F i ,F j ). First, we inspect points in D i ∩D j . Let x∈D i ∩D j , and consider three disjoint cases: 1) if F i (x)− F j (x)̸=0, then by continuity of F, F i − F j ̸=0 in some ball centered around x. This implies that either H◦ F ≡ F i or H◦ F ≡ F j in some ball around x. Thus H◦ F is analytic at x. Note the set of case 1 points form an open set. Denote the set of case one points D 1 . 99 2) If F i (x)=F j (x), and F i =F j for some ball centered around x, then H =F i =F j in that ball, and H◦ F is analytic at x. Note the set of case 2 points form an open set. Denote the set of case 2 points D 2 . 3) We denote the set of all other points in D i ∩D j by D 3 . For x∈ D 3 , F i (x) = F j (x), but F i ̸= F j for some point in every open ball centered at x. Set D =D 1 ∪D 2 , and note that D is open, H◦ F is analytic in D. Since D i ∩D j is an open set, there exists a countable sequence of open balls,{B k }, such that D i ∩D j =∪ ∞ k=1 B k . For any B k , set B 1 k :=B k ∩D 1 , B 2 k :=B k ∩D 2 , and B 3 k :=B k ∩D 3 . Let G(x)=F i (x)− F j (x), and note that since F i , F j are analytic on B k , G is analytic on B k also. Note that for any x∈B 3 k , G(x)=0. If m(B 3 k )>0, then, m({x∈B k |G(x)=0})>0, and since G is analytic, G≡ 0 on B k . This is a contradiction, since this implies B k =B 2 k and m(B 3 k )=0. Thus m(B 3 k )=0, and D 3 =∪B 3 k is a null set. D 1 ,D 2 ,D 3 , and ∂D i ∪∂D j partition [x ′ ,¯x], a closed and bounded set. D =D 1 ∪D 2 is an open set in the interior of [x ′ ,¯x] and ∂D i ∪∂D j ∪D 3 is null and thus has no interior. Thus ∂D =∂D i ∪∂D j ∪D 3 , a null set. Since D 3 points are boundary points of D 1 , we have that ∂D i , ∂D j , and D 3 are all boundary points of D. Since D 1 , D 2 , D 3 , and ∂D i ∪∂D j partition [x ′ ,¯x], and D is an open set in the interior of [x ′ ,¯x], we have ∂D =D 3 ∪∂D i ∪∂D j , a null set. 4 Proof of Theorem 3 Proof. Suppose the suppositions of the theorem. We may assume that x ′ = 0, ¯x≥ 0, for otherwise we may use the transformation technique applied in theorem 2. Further suppose ¯x̸=x ′ , for otherwise the result is trivial. Denote the open region where F is C 1 by D. Note that since each layer of F is a Lipschitz function, F is Lipschitz. We now turn to a useful lemma, but before we do, we give the following definitions. For a given i, x∈[x ′ ,¯x], we define a function that travels from one side of the rectangle [ x ′ ,¯x], through x, and to the other side, while varying only in the i th component. Formally, define ℓ (x,i) (t) with 0≤ t≤| ¯x i − x ′ i | as such: ℓ (x,i) i (t) = x ′ i +sign(¯x i − x ′ i )t, and ℓ (x,i) j (t) = x j for j̸=i. We say that F is non-decreasing from x ′ to ¯x in it’s i th component if, for all x∈[x ′ ,¯x], F ◦ ℓ (x,i) is non-decreasing in t. 100 Lemma 4. Let A satisfy linearity, completeness, sensitivity(b), and NDP. Suppose F ∈ F is Lipschitz continuous and non-decreasing from x ′ to ¯x in its i th component. Then A i (¯x,x ′ ,F)≥ 0. Proof. Since F is Lipschitz, there exists c with c i =0 such that for each j, F +c ⊺ x is non- decreasing from x ′ to ¯x in the j th component. Set G(x)=F(x)+c ⊺ x. For any monotone path γ from x ′ to ¯x, if t ≤ t ′ then G◦ γ (t) ≤ G◦ γ (t ′ ), implying G is non-decreasing from x ′ to ¯x. Note that ∂ i (F − G) = − ∂ i (c ⊺ x) = 0. Thus, by Dummy, A i (¯x,x ′ ,F) = A i (¯x,x ′ ,F − G)+A i (¯x,x ′ ,G)=A i (¯x,x ′ ,G)≥ 0. Our goal now is to construct a sequence of C 1 functions {F m } such that lim m→∞ A i (¯x,x ′ ,F m ) = A i (¯x,x ′ ,F). Fix i and define f(x) = ∂F ∂x i (x) for x ∈ D. For ¯x∈ ∂D, define f(x) =− L, where L is the Lipschitz constant of F. f is continuous in D and minimized on ∂D, thus f is lower semi-continuous. By Baire’s Theorem, there exists a monotone increasing sequence of continuous functions, {g m }, such that g m →f point-wise. Because f is bounded, it is possible to construct this sequence as being bounded below. By the Stone-Weierstrass Theorem, for each g m there exists ξ m such that ξ m is a polynomial in R n and|g m − ξ m |< 1 m . Define a sequence {f m } with f m =ξ m − 1 m . Thus for the sequence {f m } we have: • f m is C 1 (C ∞ in fact). • f m =ξ m − 1 m f m ◦ ℓ (x,i) = d(F m ◦ ℓ (x,i) ) dt where the inequality is gained because ℓ (x,i) is a strictly increasing function in this case, and f >f m by construction. Note F ◦ ℓ (x,i) is Lipschitz with Lipschitz constant L. If d(F◦ ℓ (x,i) ) dt exists on ∂D, then when ℓ (x,i) is on the region ∂D we have d(F ◦ ℓ (x,i) ) dt ≥− L=f◦ ℓ (x,i) >f m ◦ ℓ (x,i) = d(F m ◦ ℓ (x,i) ) dt This implies d(F◦ ℓ (x,i) ) dt − d(Fm◦ ℓ (x,i) ) dt is non-negative where it exists. Finally, F ◦ ℓ (x,i) is Lipschitz, so it’s derivative exists almost everywhere, and R α 0 d(F◦ ℓ (x,i) ) dt =F ◦ ℓ (x,i) (α )− F ◦ ℓ (x,i) (0). From this, we gain Z α 0 d(F ◦ ℓ (x,i) ) dt − d(F m ◦ ℓ (x,i) ) dt dt=F◦ ℓ (x,i) (α )− F m ◦ ℓ (x,i) (α )+F◦ ℓ (x,i) (0)− F m ◦ ℓ (x,i) (0) The above is the integral of a non-negative function, and is thus non-decreasing in α . This implies that F − F m is non-decreasing from 0 to ¯x in the i th component. So A i (¯x,x ′ ,F − F m )≥ 0 and A i (¯x,x ′ ,F)≥ A i (¯x,x ′ ,F m ). Employing Theorem 2, we have 102 A i (¯x,x ′ ,F) ≥ lim m→∞ A i (¯x,F m ) = lim m→∞ Z γ ∈Γ m (¯x) A γ i (¯x,F m )dµ x (γ ) = lim m→∞ Z γ ∈Γ m (¯x) Z 1 0 f m ∂γ i ∂t dtdµ x (γ ) = Z γ ∈Γ m (¯x) Z 1 0 lim m→∞ f m ∂γ i ∂t dtdµ x (γ ) = Z γ ∈Γ m (¯x) Z 1 0 ∂F ∂γ i ∂γ i ∂t dtdµ x (γ ) We move the limit inside the integral by the dominated convergence theorem. We can move the limit inside the interior integral because f m is bounded, ∂γ i ∂t is bounded using the constant velocity path parameterization, and the interior terms have a point- wise limit of ∂F ∂γ i ∂γ i ∂t almost everywhere for almost every γ . To move the limit into the first integral, note that for particular values of c i we can employ Lemma 3 to bound A γ i (¯x,F m +c i x i ) = A γ i (¯x,F m )+c i ¯x i above or below zero. Thus A γ i (¯x,F m ) has an upper and a lower bound. Using an over-approximating sequence for {f m } instead of an under- approximating sequence yields the same inequality in reverse, gaining our result. 103 Appendix C Supplementary Material on IG Lipshitzness and Extensions Here we provide a fuller treatment on IG Lipshitzness and the distribution of baseline case for IG, and additional experimental details and results for the method of internal neuron attributions for image patches. 1 Proof of Theorem 4 First, we begin the case were IG may fail to be Lipschitz. Consider F(x 1 ,x 2 )=max(x 2 − x 1 ,x 1 − x 2 ). Let ϵ> 0. Set x ′ =(0,0) and consider ¯x=(1,1+ ϵ 2 ), ¯y =(1,1− ϵ 2 ). First, note that ||¯x− ¯y|| = ϵ . We find that ∂F ∂x 1 = 1 if x 1 > x 2 , and ∂F ∂x 1 = − 1 if x 1 < x 2 . So IG 1 (¯x,x ′ ,F) = (¯x 1 − x ′ 1 ) R 1 0 (− 1)dt = − 1, while IG 1 (¯y,x ′ ,F) = 1. So |IG 1 (¯x,x ′ ,F)− IG 1 (¯y,x ′ ,F)|=2 for ϵ> 0. Thus IG(¯x,x ′ ,F) is not Lipschitz continuous in ¯x. Now we present the proof of the second claim: Proof. Fix x ′ and let F be such that∇F is Lipschitz continuous with constant L. Since∇F is continuous on a bounded domain,| ∂F ∂x i | attains a maximum on [a,b], call it M. Choose any ¯x,¯y ∈ [a,b]. We will denote the uniform-velocity paths for ¯x,¯y by γ,γ ′ , respectively. Then, 104 |IG i (¯x,x ′ ,F)− IG i (¯y,x ′ ,F)| =|(¯x i − x ′ i ) Z 1 0 ∂F ∂x i (γ (t))dt− (¯y i − x ′ i ) Z 1 0 ∂F ∂x i (γ ′ (t))dt| =|(¯x i − ¯y i ) Z 1 0 ∂F ∂x i (γ (t))dt− (¯y i − x ′ i ) Z 1 0 [ ∂F ∂x i (γ ′ (t))− ∂F ∂x i (γ (t))]dt| ≤| ¯x i − ¯y i | Z 1 0 | ∂F ∂x i (γ (t))|dt+|¯y i − x ′ i | Z 1 0 | ∂F ∂x i (γ ′ (t))− ∂F ∂x i (γ (t))|dt ≤|| ¯x− ¯y||M +|b i − a i | Z 1 0 L||γ (t)− γ ′ (t)||dt =||¯x− ¯y||M +|b i − a i |L Z 1 0 ||(¯x− ¯y)t||dt =(M + |b i − a i | 2 L)||¯x− ¯y|| Thus IG i (¯x,x ′ ,F) is Lipschitz continuous with Lipschitz constant at most M + |b i − a i | 2 L. 2 Distributional IG Satisfies Distributional Attribution Axioms Here we provide proofs that distributional IG satisfies given axioms. Sensitivity(a). Suppose X ′ varies in exactly one input, X ′ i , so that X ′ j = ¯x j for all j ̸= i, andEF(X ′ )̸=F(x). Then EG i (¯x,X ′ ,F)=EIG i (¯x,X ′ ,F) =E(F(¯x)− F(X ′ )) =F(¯x)− EF(X ′ )̸=0 The second line is gained because IG satisfies completeness and ¯x j = X j causes IG j (¯x,X ′ ,F)=0 for j̸=i. 105 Completeness. n X i=1 EG(¯x,X ′ ,F)= n X i=1 EIG(¯x,X ′ ,F) =E n X i=1 IG(¯x,X ′ ,F) =EF(¯x)− F(X ′ ) (IG satisfies completeness.) =F(¯x)− EF(X ′ ) Symmetry Preserving. Suppose that for all ¯x, F(x)=F(x ∗ ), X ′ i and X ′ j are exchangeable, and ¯x i = ¯x j . Let 1 k represent a vector with every component 0 except the k th component, which is 1. First observe: ∂F ∂x i (¯x)= lim t→0 F(¯x+1 i t)− F(¯x) t = lim t→0 F(¯x ∗ +1 j t)− F(¯x ∗ ) t = ∂F ∂x j (¯x ∗ ) From this, we have EG i (¯x,X ′ ,F)=E X ′ ∼ D ′(¯x i − X ′ i ) Z 1 0 ∂F ∂x i (X ′ +(¯x− X ′ )t)dt =E X ′ ∼ D ′(¯x i − X ′ i ) Z 1 0 ∂F ∂x j ((X ′ +(¯x− X ′ )t) ∗ )dt =E X ′ ∼ D ′(¯x ∗ j − X ′∗ j ) Z 1 0 ∂F ∂x j (X ′∗ +(¯x ∗ − X ′∗ )t)dt (¯x i =x ∗ j ) =E X ′∗ ∼ D ′(¯x ∗ j − X ′∗ j ) Z 1 0 ∂F ∂x j (X ′∗ +(¯x ∗ − X ′∗ )t)dt (X ′∗ ∼ D ′ ⇐⇒ X ′ ∼ D ′ ) =EG j (¯x ∗ ,X ′∗ ,F) =EG j (¯x,X ′ ,F) (¯x=x ∗ ;X ′ ,X ′∗ ∼ D ′ ) 106 NDP. Suppose F is non-decreasing from every point on the support of D ′ to ¯x. Then for any ¯x ′ on the support of X ′ , IG(¯x,x ′ ,F)≥ 0. Thus, for any i EG i (¯x,X ′ ,F)=EIG i (¯x,X ′ ,F)≥ 0 3 Additional Experiments from Section 7 3.1 Further ImageNet Results To further explore the pruning based on internal neuron attributions for image patches. We pick an often referenced image for IG, a fireboat, and repeat the experiments in Section 7.2. The results are shown in Figures C.1 and C.2. Figure C.1: From left to right are images A,B,C,and D. A: The original image and bounding box indicating specified image patch. B: IG attributes visualized. Green dots show positive IG, red dots show negative IG. C: IG attributes visualized after top 1% of neurons pruned based on image-patch attributions. D: IG attributes visualized after top 1% neurons pruned based on the global ranking. Figure C.1 shows that pruning 1% of the neurons based on targeted-pruning results in some scattering of activity, but the IG’s focus on the leftmost water jets is still present. On the contrary, when pruned by global ranking, the IG is broadly scattered, and the focus on the leftmost water jest is diminished. 107 Figure C.2: Sum of IG attributes inside and outside the bounding box when neurons are pruned according to certain rankings. Left: Neurons are pruned based on IG global ranking. Right: Neurons are pruned based on the IG ranking inside the bounding box. Figure C.2 reinforces the observations we have in Figure 2.4. We see that pruning by IG rankings inside the bounding box make the IG sum inside the box more negative and outside the box more positive compared to the pruning by global ranking. The observation again supports that image-patch based ranking gives higher ranks to the neurons that are responsible for positive IG inside the box. 3.2 Fashion MNIST Results Here we present experiments on internal neuron attributions with a custom model trained on the Fashion MNIST data set. Information about the model can be found in the Appendix 4. In this experiment we identify a sub-feature common to each image in the category Sneaker: the heel. We stipulate a bounding box for the heel, seen in Figure C.3. For each Sneaker image we calculate image-patch attributions for neurons in the second to last dense layer. To calculate the each neuron’s rank, we average it’s attributions over all Sneaker images. We then progressively prune based on rankings while noting the IG sums inside and outside the bounding box. 108 Figure C.3: Left: IG attributes with respect to the sneaker images. Green dots show positive IG attributes and red dots show negative IG attributes. Bounding boxes are shown in yellow. Right: Recomputed IG attributes with respect to the sneaker images after internal neurons are pruned. Figure C.4: Summation of the recomputed IG attributes inside the bounding box, outside the bounding box and both. Summations are averaged over 64 samples chosen from testing set. Left: Pruning by the IG ranking in descending order. Middle: Pruning by the IG ranking in ascending order. Right: Randomly pruning. When we prune in descending order, the average IG sum inside the box initially drops whilethesumoutsidetheboxincreases. AgapbetweentheIGsumswidens, andissustained through the pruning process. This shows that the pruning targeted neurons that contributed to positive IG values in the box. Thus the regional IG accurately identified neurons postively associated with the heel region. 4 Model Architecture and Training Parameters Table C.1 presents the architecture of the model used in the MNIST experiments. 109 Layer Type Shape Convolution + tanh 5× 5× 5 Max Pooling 2× 2 Convolution + tanh 5× 5× 10 Max Pooling 2× 2 Fully Connected + tanh 160 Fully Connected + tanh 64 Softmax 10 Table C.1: Model Architecture for the Fashion MNIST dataset 110 Appendix D Proofs of Synergy Function Theorems Here we provide proofs for the theorems on the unique n th -order interaction and the synergy function. 1 Proof of Theorem 5 Proof. Let I be any n th -ordered interaction that satisfies the given axioms, and let ¯x,x ′ ∈ [a,b]× [a,b] be arbitrarily chosen. We assume that all interactions are taken with respect to input ¯x and baseline x ′ . For ease of notation, we define F S (x):=F(x S ). For any nonempty S∈P n , note that I S (F)=I S (F− F S +F S ). Note that (F− F S )(x S ) is constant. Thus, I S (F − F S ) = 0 for any S ∈ P k by the baseline test for interaction. Thus, by linearity of zero-valued functions, we have established that I S (F)=I S (F S ) for any S∈P k . We now proceed by strong induction: |S| = 1 case: Let i ∈ N and choose F ∈ F. Note that F {i} does not vary with any feature but x i . This implies that for S̸={i}, I S (F {i} )=0 by null feature. By completeness, I {i} (F {i} )=F {i} (¯x)− F {i} (x ′ ), and I {i} (F) is uniquely determined. Thus I S (F) is uniquely determined for|S|=1. |S|≤ k ⇒|S| = k+1 case: Suppose that for any G∈F[a,b] and any S ⊆{ 1,...,n} such that |S|≤ k, I S (G) is uniquely determined. Let T ∈P n ,|T| = k+1, F ∈F. It has been established that I T (F)=I T (F T ). Note that for all S⊊ T, we have|S|≤ k, so I S (F T ) is uniquely determined by the induction hypotheses. Since F T does not vary in each x i such that i / ∈T, we have I S (F T )=0 for S⊈T by null feature. By completeness, F T (¯x)− F T (x ′ ) = P S⊆P k I S (F T ) = P S⊆ T I S (F T ). Thus I T (F T ) = F T (¯x)− F T (x ′ )− P S⊊ T I S (F T ). 111 Since I T (F) = I T (F T ) equals the sum of uniquely determined terms, I T (F) is uniquely determined. 2 Proof of Corollary 1 We proceed to show the synergy function satisfies completeness, linearity, null feature, and baseline test for interactions (k≤ n). Proof. Completeness: For any v :{0,1} n →R, Sundararajan et al. [2020, Appendix 7.1] shows that the M¨ obius transform has the property that, v(T)= X S⊆ T a(v)(S). (2.1) Using this, observe, F(x ′ )+ X S∈Pn ϕ S (F)(¯x)= X S⊆ N a(F(¯x (·) ))(S) =F(¯x N ) =F(¯x), (2.2) which established completeness. Linearity of Zero-Valued Functions: We simply establish ϕ is linear. ϕ S (cF +dG)(¯x)=a(cF(¯x (·) )+dG(¯x (·) ))(S) = X T⊆ S (− 1) |S|−| T| (cF(¯x (·) )+dG(¯x (·) ))(T) =c X T⊆ S (− 1) |S|−| T| F(¯x (·) )(T)+d X T⊆ S (− 1) |S|−| T| G(¯x (·) )(T) =cϕ S (F)(¯x)+dϕ S (G)(¯x) (2.3) 112 Baseline Test for Interactions: Suppose F(¯x S ) is constant. ϕ S (F)(¯x)=a(F(¯x (·) ))(S) = X T⊆ S (− 1) |S|−| T| F(¯x T ) = X T⊆ S (− 1) |S|−| T| F(x ′ ) =F(x ′ ) X 0≤ i≤| S| |S| i (− 1) |S|− i =0 (2.4) Null Feature: Suppose F does not vary in some x i and i∈S. Then, ϕ S (F)(¯x)=a(F(¯x (·) ))(S) = X T⊆ S (− 1) |S|−| T| F(¯x T ) = X T⊆ S,i∈T (− 1) |S|−| T| F(¯x T )+ X T⊆ S,i/ ∈T (− 1) |S|−| T| F(¯x T ) = X T⊆ S\{i} (− 1) |S|− (|T|+1) F(¯x T∪{i} )+ X T⊆ S\{i} (− 1) |S|−| T| F(¯x T ) =− X T⊆ S\{i} (− 1) |S|−| T|) F(¯x T )+ X T⊆ S\{i} (− 1) |S|−| T| F(¯x T ) =0 (2.5) 113 3 Proof of Corollary 2 Proof. We proceed in the order given in Corollary 2. 1. Pure interaction sets are disjoint, meaning C S ∩ C T = ∅ whenever S̸=T. Suppose S, T ∈P n with T ̸=S. We proceed by contradiction and suppose F ∈C S ∪C T . WLOG∃i∈S\T, implying that F varies in feature i since F is a synergy function of S, and F does not vary in feature i, since F is a synergy function of T. This is a contradiction. Thus C S ∩C T =∅. 2. ϕ S projects F onto C S ∪{0}. That is, ϕ S (F)∈C S ∪{0} and ϕ S (ϕ S (F))=ϕ S (F) Let F ∈F. First, for the degenerate case, ϕ ∅ (F)=F(x ′ ), which is a constant function. For any constant c, ϕ ∅ (c) = c, implying ϕ ∅ is a projection and surjective for the range C ∅ ∪{0}. Thus ϕ ∅ projectsF onto C ∅ ∪{0}. Nowwewillshowthatϕ S (F)eitherisapureinteractionofS oris0inthenon-degenerate case. Suppose x i =x ′ i for some i∈S. Then, ϕ S (F)(x)= X T⊆ S (− 1) |S|−| T| F(x T ) = X T⊆ S,i∈T (− 1) |S|−| T| F(x T )+ X T⊆ S,i/ ∈T (− 1) |S|−| T| F(x T ) = X T⊆ S\{i} (− 1) |S|− (|T|+1) F(x T∪{i} )+ X T⊆ S\{i} (− 1) |S|−| T| F(x T ) =− X T⊆ S\{i} (− 1) |S|−| T|) F(x T )+ X T⊆ S\{i} (− 1) |S|−| T| F(x T ) =0 Thus ϕ S (F)=0 whenever x i =x ′ i for some i∈S, and ϕ S (F) satisfies condition 1 for being a pure interaction of S. 114 Now, inspecting the definition, ϕ S (F)(x)= P T⊆ S (− 1) |S|−| T| F(x T ), so ϕ S (F) does not vary in x i , i / ∈S. Lastly, suppose that F does not vary in some x i , i∈S. Since ϕ satisfies null feature, ϕ S (F) = 0. So either ϕ S (F) varies in all x i such that i ∈ S, or ϕ S (F) = 0. If the former, ϕ S (F) satisfies condition 2 for being a pure interaction of S; if the latter, ϕ S (F) = 0. Thus ϕ S (F) = 0 or ϕ S (F) is a pure interaction function of S, implying the range of ϕ S is C S ∪{0}. Now let Φ S ∈C S . Note ϕ S (Φ S )(x)= X T⊆ S (− 1) |S|−| T| Φ S (x T ) = X T=S (− 1) |S|−| T| Φ S (x T ) =Φ S (x S ) =Φ S (x) It is plain by the definition that ϕ S (0) = 0. Thus ϕ S is surjective for the range C S ∪{0}. Since the range of ϕ S is C S ∪{0}, ϕ maps elements of C S to themselves, and maps 0 to 0, so ϕ S is a projection. 3. For Φ T ∈C T , we have ϕ S (Φ T )=0 whenever S̸=T. Let Φ T ∈ C T and T ̸= S. If ∃i∈ S\T, then ϕ S (Φ T ) = 0 by null feature. Otherwise S⊊ T, and ϕ S (Φ T )=0 be baseline test for interactions (k =n). 4. ϕ uniquely decomposes F ∈ F into a set of pure interaction functions on distinct groups of features. That is, there exists P ⊂ P n such that F = P S∈P Φ S , where each Φ S ∈C S . Further more, only one such representation exists, Φ S =ϕ S (F) for each S∈P, and ϕ S (F)=0 for each S∈P n \P. 115 F = P S∈Pn ϕ S (F), and each ϕ S (F) ∈ C S ∪{0}. Since 0+ϕ ∅ (F) ∈ C ∅ and we may gather all the ϕ S (F) terms that are zero into the C ∅ term, we have shown a decomposition exists. Let it be that F(x) = P S∈P Φ S (x) for someP ∈P n , where each Φ S is an interaction function in S. By the results already established, we have for any T ∈P ϕ S (F)=ϕ S ( X T∈P Φ T ) = X T∈P ϕ S (Φ T ) =ϕ S (Φ S ) =Φ S If S / ∈P, then ϕ S (F)=ϕ S ( X T∈P Φ T ) = X T∈P ϕ S (Φ T ) =0 Now suppose that there are two decompositions, P S∈P 1Φ 1 S =F = P S∈P 2Φ 2 S . WLOG suppose S ∈ P 1 \P 2 . Then ϕ S (F) = 0 since S / ∈ P 2 and ϕ S (F) = Φ 1 S since S ∈ P 1 . Thus Φ 1 S = 0 and S = ∅. Thus P 1 △P 2 equals either ∅ or {∅}, and in the case that P 1 △P 2 ={∅} the extra term corresponding to∅ in one of the sums is 0, and does not effect thedecomposition. Now,ifP 1 △P 2 =∅,thenforanyS∈P 1 ,P 2 ,wehaveΦ 1 S =ϕ S (F)=Φ 2 S . Thus, the decomposition is unique. 116 Appendix E Supplementary Material and Proofs for Various k th -order Interaction Methods HereweprovidevariousdefinitionsandproofsforintegratedHessians, recursiveShapely, and sumofpowersmethods. Wealsogivetheproofthatcontinuous-inputinteractionmethodsare characterizedbysatisfyingthecontinuityconditionandarulefordistributingmonomials. We give a statement of the symmetry axiom used to characterize the Shapley-Taylor Interaction Index. Finally, we give further details on comparative experiments between the methods. 1 Statement of Symmetry Axiom Let π be an ordering of the features in N. We loosely quote the definition of symmetry from Sundararajan et al. [2020], altering the binary feature setting to a continuous feature setting: 1. Symmetry Axiom: for all F ∈F, for all permutations π on N: I k S (¯x,x ′ ,F)=I k πS (π ¯x,πx ′ ,F ◦ π − 1 ), (1.1) where ◦ denotes function composition, πS :={π (i):i∈S}, and (πx ) π (i) =x i . This axioms implies that if we relabel the features, then interactions for the relabeled features will concur with interactions before relabeling. It requires that the domain, [a,b], is closed under permutations of inputs, meaning it is of the form [a 1 ,b 1 ] n . 117 2 Proof of Theorem 7 Proof. Let I k be a k th -order interaction method defined for all ( ¯x,x ′ ,F)∈[a,b]× [a,b]×C ω . Fix x ′ and ¯x. Let T l be the l th order Taylor approximation of F at x ′ . Then I k (¯x,x ′ ,F)= lim l→∞ I k (¯x,x ′ ,T l ) = X m∈N n ,∥m∥ 1 ≤ l D m (F)(x ′ ) [m]! lim l→∞ I k (¯x,x ′ ,[x− x ′ ] m ) The last line is determined by the action of I k on elements of the set {(¯x,x ′ ,F) : F(x) = [x− x ′ ] m ,m∈N n }, concluding the proof. 3 Interaction Methods Here we give an in depth treatment of the Recursive Shapley, Integrated Hessian, and Sum of Powers methods, as well as the augmentations to the recursive methods. We define the methods and show that each method is the unique method that satisfies linearity, their distribution policy, and in the case of gradient methods, the continuity condition. We also prove that each method satisfies desirable properties such as completeness, null feature, symmetry, and, if applicable, baseline test for interactions (k≤ n). 3.1 Recursive Shapley and Augmented Recursive Shapley 3.1.1 Defining Recursive Shapley Here we detail the properties of Recursive Shapley and Augmented Recursive Shap- ley. Let σ k T be the set of sequences of length k such that the sequence is made of the elements of T ̸= ∅ and each element appears at least once. For example, σ 3 {1,2} = {(1,1,2),(1,2,1),(1,2,2),(2,1,1),(2,1,2), (2,2,1)}. Calculating the size of σ k T , |σ k T | = P ∥l∥ 1 =k s.t. S l =T k l = N k T . For a given sequence s, define IG t (¯x,F) be a recursive implementation of the Shapley method according to the sequence s, i.e., 118 Shap (1,2,3) (¯x,F)=Shap 3 (¯x,Shap 2 (·,Shap 1 (·,F))). We can then define the k th -order Recur- sive Shapley for T ̸=∅ as: RS k T (¯x,F)= X s∈σ k T Shap s (¯x,F) (3.1) and define RS k ∅ (¯x,x ′ ,F):=F(x ′ ). We now move to inspect this equation and establish some properties. Eq. (3.1) states that for a synergy function Φ S , S̸=∅, Shap i (¯x,Φ S )= Φ S (¯x) |S| if i∈S 0 if i / ∈S (3.2) Then for a given sequence s∈σ k T and synergy function Φ S , if T ⊆ S then, Shap s (¯x,Φ S )=Shap s k (¯x,Shap s k− 1 (...Shap s 1 (·,Φ S )....) =Shap s k (¯x,Shap s k− 1 (...Shap s 2 (·, Φ S |S| )....) =Shap s k (¯x,Shap s k− 1 (...Shap s 3 (·, Φ S |S| 2 )....) =... =Shap s k (¯x, Φ S |S| k− 1 )) = Φ S (¯x) |S| k (3.3) However, if T ⊊ S then there exists an element of s that is not in S, and: Shap s (¯x,Φ S )=0, (3.4) due to some s j / ∈S in the sequence. 119 3.1.2 Recursive Shapley’s Distribution Policy Now, to show how Recursive Shapley distributes synergies, apply the definition of recursive Shapely for S̸=∅ to get: RS k T (¯x,Φ S )= X s∈σ k T Shap s (¯x,Φ S ) = P s∈σ k T Φ S (¯x) |S| k if T ⊆ S P s∈σ k T 0 if T ⊈S = N k T |S| k Φ S (¯x) if T ⊆ S 0 if T ⊈S (3.5) We also gain the above for S = ∅ by setting N k T |S| k = 1 when T = ∅. This establishes the distribution scheme in Eq. (3.4). Recursive Shapley is also linear because it it the sum of function compositions of composition of linear functions. This establishes Theorem 6. 3.1.3 Properties of Recursive Shapley To show Recursive Shapley satisfies completeness, observe for S̸=∅: X T∈P k ,|T|>0 RS k T (¯x,Φ S )= X T⊆ S N k T Φ S (¯x) |S| k = Φ S (¯x) |S| k X T⊆ S N k T = Φ S (¯x) |S| k |S| k =Φ S (¯x) (3.6) The case when S =∅ is easily verified by inspecting the synergy distribution policy of RS. 120 To show Recursive Shapley satisfies null feature, suppose that F does not vary in ¯x i . Then for any S∈P k such that i∈S, ϕ S (F)=0 since the synergy function is an interaction satisfying null feature. Then if i∈T, RS k T (¯x,F)= X S∈P k RS k T (¯x,ϕ S (F)) = X S∈P k s.t. i∈S RS k T (¯x,ϕ S (F))+ X S∈P k s.t. i/ ∈S RS k T (¯x,ϕ S (F)) = X S∈P k s.t. i∈S RS k T (¯x,0)+ X S∈P k s.t. i/ ∈S 0 =0 (3.7) Where the terms in the second sum are zero by Eq. (3.4). To show Recursive Shapley satisfies symmetry, let π be a permutation on N. Note that for Φ S ∈C S , we have Φ S ◦ π − 1 is a pure interaction function in πS with baseline πx ′ . Then RS k πT (π ¯x,πx ′ ,Φ S ◦ π − 1 )= N k πT |πS | k Φ S ◦ π − 1 (π ¯x) if πT ⊆ πS 0 if πT ⊈πS = N k T |S| k Φ S (¯x) if T ⊆ S 0 if T ⊈S =RS k T (¯x,x ′ ,Φ S ) So RS is symmetric on synergy functions. Now use the synergy decomposition of F ∈F to show RS is generally symmetric. 3.1.4 Augmented Recursive Shapley and Properties The synergy function ϕ is taken implicitly with respect to a baseline appropriate to F. To make the baseline choice explicit, we write ϕ (F)=ϕ (x ′ ,F). Augmented Recursive Shapley is then defined as: 121 RS k∗ T (¯x,x ′ ,F)=ϕ T (x ′ ,F)(¯x)+RS k T (¯x,x ′ ,F − X S∈P k ϕ S (x ′ ,F)) (3.8) With the above augmentation, IH k∗ explicitly distributes synergies ϕ T (F) to group T whenever|T|≤ k, and distributes higher synergies as IH k . The above is a linear function of F. Plugging in Φ S to the above gains the following distribution policy: RS k∗ T (Φ S )= Φ S (¯x) if T =S N k T |S| k Φ S (¯x) if T ⊊ S,|S|>k 0 else (3.9) Because each F ha a unique synergy decomposition, we have Corollary 7. Augmented Recursive Shapley of order k is the unique k th -order interaction index that satisfies linearity and acts on synergy functions as in Eq. (3.9). To show that Augmented Recursive Shapley satisfies null feature, let F not vary in some feature ¯x i and let i∈T. Then RS k∗ T (¯x,F)= X S∈Pn RS k∗ T (¯x,ϕ S (F)) =RS k∗ T (¯x,ϕ T (F))+ X T⊊ S,|S|>k RS k∗ T (¯x,ϕ S (F)) =RS k∗ T (¯x,0)+ X T⊊ S,|S|>k N k T |S| k ϕ S (F)(¯x) =0+ X T⊊ S,|S|>k 0 =0 Thus Augmented Recursive Shapley satisfies null feature. To show Augmented Recursive Shapley satisfies baseline test for interactions (k≤ n), let T ⊊ S,|S|≤ k, and Φ S ∈C S . Then RS k∗ T (¯x,Φ S )=0 by Eq.(3.9). 122 To show Augmented Recursive Shapley satisfies completeness, consider the synergy function Φ S . If|S|≤ k, Eq. (3.9) shows completeness. If|S|>k, then follow the proof of completeness for Recursive Shapley. To show Augmented Recursive Shapley satisfies symmetry, consider a synergy function Φ S ∈C S and permutation π . Note that for Φ S ∈C S , we have Φ S ◦ π − 1 is a pure interaction function in πS with baseline πx ′ . Then RS k∗ πT (π ¯x,πx ′ ,Φ S ◦ π − 1 )= N k πT |πS | k Φ S ◦ π − 1 (π ¯x) if πT =πS N k πT |πS | k Φ S ◦ π − 1 (π ¯x) if πT ⊊ πS, |πS |>k 0 else = N k T |S| k Φ S (¯x) if T ⊆ S N k T |S| k Φ S (¯x) if T ⊊ S,|S|>k 0 else =RS k∗ T (¯x,x ′ ,Φ S ) 3.2 Integrated Hessian and Augmented Integrated Hessian 3.2.1 Definition of Integrated Hessian Here we give a complete definition of IH and detail how IH distributes monomials. We also detail IH ∗ and show it satisfies Corollary 3. We then show both satisfy completeness, linearity, nullfeature, andsymmetry, andaugmentedIHsatisfiesbaselinetestforinteractions (k≤ n). Let σ k T be the set of sequences of length k such that the sequence is made of the elements of T ̸= ∅ and each element appears at least once. For example, σ 3 {1,2} = {(1,1,2),(1,2,1),(1,2,2),(2,1,1),(2,1,2), (2,2,1)}. For a given sequence s, define IG s (¯x,F) to be a recursive implementation of IG according to the sequence s, i.e., IG (1,2,3) (¯x,F)= IG 3 (¯x,IG 2 (·,IG 1 (·,F))). We can then define the k th -order Integrated Hessian for T ̸=∅ by: 123 IH k T (¯x,F)= X s∈σ k T IG s (¯x,F), (3.10) and for T =∅, we define IH k ∅ (¯x,x ′ ,F)=F(x ′ ). 3.2.2 IH Policy Distributing Monomials and Continuity Condition We now move to inspect this equation and establish some properties. First, IG is linear, establishing that IH is also linear by its form. Next, we establish its policy distributing monomials centred at x ′ . Eq. (4.1) states that for a monomial F(x)=[x− x ′ ] m , m̸=0, IG i (¯x,x ′ ,[x− x ′ ] m )= m i ∥m∥ 1 [¯x− x ′ ] m if i∈S m 0 if i / ∈S m (3.11) Then for a given sequence s∈σ k T and synergy function [x− x ′ ] m , T ⊆ S m , IG s (¯x,[x− x ′ ] m )=IG s k (¯x,IG s k− 1 (...IG s 1 (·,[x− x ′ ] m )....) =IG s k (¯x,IG s k− 1 (...IG s 2 (·, m s 1 [x− x ′ ] m ∥m∥ 1 )....) =IG s k (¯x,IG s k− 1 (...IG s 3 (·, m s 1 m s 2 [x− x ′ ] m ∥m∥ 2 1 )....) =... =IG s k (¯x, Π 1≤ i≤ k− 1 m s i [x− x ′ ] m ∥m∥ k− 1 1 ) = Π 1≤ i≤ k m s i ∥m∥ k 1 [¯x− x ′ ] m (3.12) However, if there exists any elements of s that is not in S m , then: IG s (¯x,x ′ ,[x− x ′ ] m )=0, (3.13) due to some s j / ∈S m in the sequence. Now, applying the definition of IH when m̸=0, we get: 124 IH k T (¯x,[x− x ′ ] m )= X s∈σ k T IG s (¯x,[x− x ′ ] m ) = P s∈σ k T Π 1≤ i≤ k ms i ∥m∥ k 1 [¯x− x ′ ] m if T ⊆ S m P s∈σ k T 0 if T ⊈S m = M k T (m) ∥m∥ k 1 [¯x− x ′ ] m if T ⊆ S m 0 if T ⊈S m , (3.14) wherewedefine M k T (m)= P ∥l∥ 1 =k s.t. S l =T k l [m] l ,with k l = k! [l]! themultinomialcoefficient. In the case T =S m =∅, we set M k T (m) ∥m∥ k 1 =1. Now let us turn to the question of the continuity of Taylor approximation for analytic functions. Let T l be the Taylor approximation of some F ∈C ω . Using Theorem 12, we have lim l→∞ IG i (¯x,T l )=IG i (¯x,F). This implies: IG i (¯x,F)= lim l→∞ IG i (¯x,T l ) = X m∈N n D m (F)(x ′ ) [m]! IG i (¯x,[x− x ′ ] m ) = X m∈N n D m (F)(x ′ ) [m]! m i ∥m∥ 1 [¯x− x ′ ] m (3.15) That is, the above sum is convergent for all ¯x∈ [a,b], implying that IG i (·,F)∈C ω . Also note: IG i (¯x,T l )= X m∈N n ,∥m∥ 1 ≤ l D m (F)(x ′ ) [m]! m i ∥m∥ 1 [¯x− x ′ ] m (3.16) 125 This shows that IG(¯x,T l ) is a Taylor approximation of IG i (¯x,F). Thus, for F ∈C ω and a sequence s, we can pull the limit out consecutively since we are simply dealing with a series of Taylor approximations. IG s (¯x,F)=IG s k (¯x,IG s k− 1 (...IG s 1 (·,F)...)) =IG s k (¯x,IG s k− 1 (... lim l→∞ IG s 1 (·,T l )...)) =IG s k (¯x,IG s k− 1 (... lim l→∞ IG s 2 (·,IG s 1 (·,T l ))...)) = lim l→∞ IG s k (¯x,IG s k− 1 (...IG s 1 (·,T l )...)) = lim l→∞ IG s (¯x,T l ), (3.17) which establishes that IH k satisfies the continuity property. This implies the following corollary: Corollary 8. Integrated Hessian of order k is the unique k th -order method to satisfy linearity, the continuity condition, and distributes monomials as in Eq. (3.14). 3.2.3 Establishing Further Properties of IH To show IH is complete, observe for a monomial F(x)=[x− x ′ ] m , m̸=0, X S∈P k ,|S|>0 IH k S (¯x,x ′ ,F)= X S⊆ Sm,|S|>0 M k T (m) ∥m∥ k 1 [¯x− x ′ ] m = X S⊆ Sm,|S|>0 P ∥l∥ 1 =k s.t. S l =S k l [m] l ∥m∥ k 1 [¯x− x ′ ] m = ∥m∥ k 1 ∥m∥ k 1 [¯x− x ′ ] m =[¯x− x ′ ] m When m = 0, we get IH k S (¯x,x ′ ,F) = 0 except when S = ∅, in which case we get IH k S (¯x,x ′ ,F)=1. Applying the Taylor decomposition of F and continuity property to a general F ∈C ω , we get: 126 X S∈P k ,|S|>0 IH k S (¯x,x ′ ,F)= X S∈P k ,|S|>0 lim l→∞ IH k S (¯x,x ′ ,T l ) = lim l→∞ X S∈P k ,|S|>0 X m∈N n ,0<∥m∥ 1 ≤ l D m (F)(x ′ ) [m]! IH k S (¯x,x ′ ,[x− x ′ ] m ) = lim l→∞ X m∈N n ,0<∥m∥ 1 ≤ l D m (F)(x ′ ) [m]! X S∈P k ,|S|>0 IH k S (¯x,x ′ ,[x− x ′ ] m ) = lim l→∞ X m∈N n ,0<∥m∥ 1 ≤ l D m (F)(x ′ ) [m]! [¯x− x ′ ] m = lim l→∞ X m∈N n ,∥m∥ 1 ≤ l D m (F)(x ′ ) [m]! [¯x− x ′ ] m − F(x ′ ) =F(¯x)− F(x ′ ) To show IH satisfies null feature, we proceed as in the proof for Recursive Shapley and suppose that F does not vary in ¯x i . Then for any S∈P k such that i∈S, ϕ S (F)=0 since the synergy function is an interaction satisfying null feature. Then if i∈T, IH k T (¯x,F)= X S∈P k IH k T (¯x,ϕ S (F)) = X S∈P k s.t. i∈S IH k T (¯x,ϕ S (F))+ X S∈P k s.t. i/ ∈S IH k T (¯x,ϕ S (F)) = X S∈P k s.t. i∈S IH k T (¯x,0)+ X S∈P k s.t. i/ ∈S 0 =0 (3.18) To show symmetry, let π be a permutation. Note that since (πx ) π (i) =x i , we also have (π − 1 x) i =(π − 1 x) π − 1 (π (i)) =x π (i) . Then, if F(x)=[x− x ′ ] m , we get 127 F · π − 1 (x)=[x π (1) − x ′ 1 ] m 1 ··· (x π (n) − x ′ n ) mn =[x 1 − x ′ π − 1 (1) ] m π − 1 (1) ··· (x n − x ′ π − 1 (n) ) m π − 1 (n) =[x− πx ′ ] πm Also note that, S πm ={i:(πm ) i >0} ={i:m π − 1 (i) >0} ={π (i):m π − 1 (π (i)) >0} ={π (i):m i >0} ={π (i):i∈S m } =πS m Then, IH k πT (π ¯x,πx ′ ,F ◦ π − 1 )= M k πT (πm ) ∥πm ∥ k 1 [π ¯x− πx ′ ] πm if πT ⊆ S πm 0 if πT ⊈S πm = M k T (m) ∥m∥ k 1 [¯x− x ′ ] m if T ⊆ S 0 if T ⊈S =IH k T (¯x,x ′ ,F) 128 Now,ifwetakeπ ∈C ω anddenoteπ − 1 j tobethej th outputofπ − 1 ,then ∂π − 1 j ∂x i = 1 j=π − 1 (i) . Then we have ∂(F ◦ π − 1 ) ∂x i (x)= n X j=1 ∂F ∂x j (π − 1 (x)) ∂π − 1 j ∂x i (x) = ∂F ∂x π − 1 (i) (π − 1 (x)), which yields D πm (F ◦ π − 1 )(πx ′ )= ∂ ∥πm ∥ 1 (F ◦ π − 1 ) ∂x (πm ) 1 1 ··· ∂x (πm )n n (πx ′ ) = ∂ ∥πm ∥ 1 F ∂x m π − 1 (1) π − 1 (1) ··· ∂x m π − 1 (n) π − 1 (n) (π − 1 πx ′ ) = ∂ ∥m∥ 1 F ∂x m 1 1 ··· ∂x mn n (x ′ ) =D m F(x ′ ) From the above we have for general F, IH k πS (π ¯x,πx ′ ,F ◦ π − 1 )= lim l→∞ IH k πS (π ¯x,πx ′ , X m∈N n ,0<∥m∥ 1 ≤ l D m (F ◦ π − 1 )(πx ′ ) [m]! [x− πx ′ ] m ) = lim l→∞ X m∈N n ,0<∥m∥ 1 ≤ l D m (F ◦ π − 1 )(πx ′ ) [m]! IH k πS (π ¯x,πx ′ ,[x− πx ′ ] m ) = lim l→∞ X m∈N n ,0<∥m∥ 1 ≤ l D πm (F ◦ π − 1 )(πx ′ ) [πm ]! IH k πS (π ¯x,πx ′ ,[x− πx ′ ] πm ) = lim l→∞ X m∈N n ,0<∥m∥ 1 ≤ l D m (F)(x ′ ) [m]! IH k S (¯x,x ′ ,[x− x ′ ] m ) = lim l→∞ IH S (¯x,x ′ ,T l ) =IH S (¯x,x ′ ,F) 129 3.2.4 Augmented Integrated Hessian and its Properties The synergy function ϕ is taken implicitly with respect to a baseline appropriate to F. To make the baseline choice explicit, we write ϕ (F)=ϕ (x ′ ,F). Augmented Integrated Hessian is then defined as: IH k∗ T (¯x,x ′ ,F)=ϕ T (x ′ ,F)(¯x)+IH k T (¯x,x ′ ,F − X S∈P k ϕ S (x ′ ,F)) (3.19) As in Augmented Recursive Shapley, Augmented Integrated Hessian explicitly distributes ϕ T (F) to group T when|T|≤ k, and distributes ϕ T (F) as IH when|T|>k. To establish the monomial distribution policy we inspect the action of IH k∗ T in different cases. Plugging in [x− x ′ ] m to the above, if|S m |≤ k, the right term is zero and Eq. (4.4) holds, while if|S m |>k, the left term is zero and the right term is IH k T (¯x,[x− x ′ ] m ). It is also easy to see that the above is linear. Regarding the continuity condition, observe that: ϕ S (F)= X m∈N n ,Sm=S D m (F)(x ′ ) [m]! [¯x− x ′ ] m = lim l→∞ X m∈N n ,∥m∥ 1 ≤ l,Sm=S D m (F)(x ′ ) [m]! [¯x− x ′ ] m = lim l→∞ ϕ S (T l ), which gains, lim l→∞ IH k∗ S (¯x,T l )= lim l→∞ ϕ S (T l )(¯x)+IH k S (¯x,T l − X R∈P k ϕ R (T l )) =ϕ S (F)(¯x)+IH k S (¯x, lim l→∞ T l − X S∈P k ϕ R (T l )) =IH k S (¯x,F − X R∈P k ϕ R (F)) =IH k∗ S (¯x,F), 130 which establishes Corollary 3. Toshowcompleteness,considerthedecompositionF = P S∈Pn ϕ S (F). NowIH k∗ satisfies completeness for the subset of functions Φ S ∈C S ,|S|≤ k from the completeness of ϕ and Eq. (3.19). Also, IH k∗ satisfies completeness for the subset of functions Φ S ∈ C S ,|S| > k because IH k satisfies completeness. From this we have: X T∈P k ,|T|̸=0 IH k∗ T (¯x,x ′ ,F)= X T∈P k ,|T|̸=0 IH k∗ T (¯x,x ′ , X S∈Pn ϕ S (F)) = X S∈Pn X T∈P k ,|T|̸=0 IH k∗ T (¯x,x ′ ,ϕ S (F)) = X S∈Pn,|S|̸=0 [ϕ S (F)(¯x)− ϕ S (F)(x ′ )] = X S∈Pn,|S|̸=0 [ϕ S (F)(¯x)]+F(x ′ )− F(x ′ ) = X S∈Pn [ϕ S (F)(¯x)]− F(x ′ ) =F(¯x)− F(x ′ ) Baseline test for interactions applies immediately from the definition of Augmented Integrated Hessian in Eq. (3.19). Concerning null feature, suppose F does not vary in some ¯x i and i∈T. First, we have ϕ T (F)=0. Also, F− P R∈P k ϕ R (F) does not vary in x i either, so, since IH k satisfies null feature. Thus we have IH k∗ (¯x,F)=0 by Eq. (3.19). Lastly, concerning symmetry, let π be a permutation. Note that ϕ is symmetric, as it is the k =n case for Shapley-Taylor, which is symmetric. Then, 131 IH k∗ πT (π ¯x,πx ′ ,F ◦ π − 1 )=ϕ πT (πx ′ ,F ◦ π − 1 )(π ¯x) +IH k πT (π ¯x,πx ′ ,F ◦ π − 1 − X R∈P k ϕ πR (πx ′ ,F ◦ π − 1 )) =ϕ T (x ′ ,F)(¯x)+IH k T (π ¯x,πx ′ ,ϕ πR (πx ′ , X R⊂ N,|R|>k F ◦ π − 1 )) =ϕ T (x ′ ,F)(¯x)+ X R⊂ N,|R|>k IH k T (π ¯x,πx ′ ,ϕ πR (πx ′ ,F ◦ π − 1 )) =ϕ T (x ′ ,F)(¯x)+ X R⊂ N,|R|>k IH k T (¯x,x ′ ,ϕ R (x ′ ,F)) =ϕ T (x ′ ,F)(¯x)+IH k T (¯x,x ′ , X R⊂ N,|R|>k ϕ R (x ′ ,F)) =IH k∗ T (¯x,x ′ ,F) 3.3 Sum of Powers 3.3.1 Defining Sum of Powers To define Sum of Powers, we first turn to defining a slight alteration of the Shapley-Taylor method. Suppose we performed Shapley-Taylor on a function F, but we treated F as a function of every variable except for x i , which we held at the input value. Specifically, for a given index i and coalition S with i∈S, we perform the (|S|− 1) th -order Shapley-Taylor method for the coalition S\{i}. We perform this on an alteration of F, so that F is a function of n− 1 variables because the x i value is fixed at ¯x i . We denote this function ST − i S , which has formula: ST − i S (¯x,x ′ ,F)= |S|− 1 n− 1 X T⊆ N\S δ S\{i}|T∪{i} F(¯x) n− 2 |T| (3.20) With this, we define Sum of Powers for k≥ 2 as: SP k S (¯x,x ′ ,F)= P i∈S ST − i S (¯x,x ′ ,IG i (·,x ′ ,F)) if|S|=k ϕ S (F) if|S|<k (3.21) 132 WedefinetheSumofPowersfor k =1astheIG,withtheadditionthatSP 1 ∅ (¯x,x ′ ,F)=F(x ′ ). Similar to the alteration of the Shapley-Taylor, we can alter the Shapley method, giving us: Shap − i j (¯x,x ′ ,F)= X S⊂ N\{i,j} |S|!(n−| S|− 2)! (n− 1)! F(¯x S∪{i,j} )− F(¯x S∪{i} ) (3.22) For the Sum of Powers k =2 case, the altered Shapley-Taylor is a 1 st -order Shapley-Taylor method, and conforms to the Shapley method: SP 2 i,j (¯x,x ′ ,F)= Shap − i j (¯x,x ′ ,IG i (·,x ′ ,F))+Shap − j i (¯x,x ′ ,IG j (·,x ′ ,F)) if|S|=2 ϕ S (F) if|S|≤ 1 (3.23) 3.3.2 Proof of Corollary 4 For the k = 1 case, Sum of Powers is the IG, which satisfies linearity, distributes as in Eq. 4.5, and satisfies the continuity condition. We now assume k≥ 2 for the rest of the section. First, SP k S satisfies linearity because IG is linear in F and ST − i S is linear in F. We now proceed by cases to establish how SP k distributes monomials. We consider first the action of ST − i S on F(x)=[x− x ′ ] m . ST − i S acts as the (|S|− 1) st -order Shapley-Taylor on an augmented function F − i (x 1 ,...,x i− 1 ,x i+1 ,...,x i ):=[x i − x ′ i ] m i Π j̸=i [x j − x ′ j ] m j . Now, Π j̸=i [x j − x ′ j ] m j is a synergy function of S m \{i}. Thus we can use the distribution rule of Shapley-Taylor, gaining 133 ST − i S (¯x,x ′ ,F)=ST |S|− 1 S\{i} (¯x − i ,x ′ − i ,F − i ) = [¯x i − x ′ i ] m if S =S m [¯x i − x ′ i ] m ( |S|− 1 k− 1 ) if S⊊ S m ,|S|=k 0 else , (3.24) where the notations ¯x − i , x ′ {− i} denote the vectors ¯x, x ′ with the i th component removed. With this established, we now show the action of the Sum of Powers method for an exhaustive set of cases: 1. (|S|<k, S =S m ): SP k S (¯x,[x− x ′ ] m )=ϕ S ([x− x ′ ] m )=[x− x ′ ] m . 2. (|S|<k, S̸=S m ): SP k S (¯x,[x− x ′ ] m )=ϕ S ([x− x ′ ] m )=0. 3. (|S|=k, S⊆ S m ): SP k S (¯x,x ′ ,[x− x ′ ] m )= X i∈S ST − i S (¯x,x ′ ,IG i (·,x ′ ,[x− x ′ ] m ) = X i∈S ST − i S (¯x,x ′ , m i ∥m∥ 1 [x− x ′ ] m ) = X i∈S 1 |Sm|− 1 |S|− 1 m i ∥m∥ 1 [¯x− x ′ ] m = 1 |Sm|− 1 |S|− 1 P i∈S m i ∥m∥ 1 [¯x− x ′ ] m 4. (|S| = k, S ⊈ S m ): Let i∈ S. If i∈ S\S m , then ST − i S (¯x,x ′ ,IG i (·,x ′ ,[x− x ′ ] m )) = ST − i S (¯x,x ′ ,0))=0. If, on the other hand, i ∈ S m , then ST − i S (¯x,x ′ ,IG i (·,x ′ ,[x − x ′ ] m )) = ST − i S (¯x,x ′ , m i ∥m∥ 1 [x− x ′ ] m ). Now, the altered Shapley-Taylor takes the value of zero for synergy functions of sets that are not super-sets of the attributed group, S\{i}. Also, [x− x ′ ] m is a synergy function of S m , and S m is not a super-set of S\{i}. Thus ST − i S (¯x,x ′ , m i ∥m∥ 1 [x− x ′ ] m )=0. 134 This established that each term in the sum P i∈S ST − i S (¯x,x ′ ,IG i (·,x ′ ,[x− x ′ ] m )) is zero, gaining SP k S (¯x,x ′ ,[x− x ′ ] m =0. Thus Sum of Powers has a distribution scheme that agrees with Eq. (4.5). To restate: SP k T (¯x,[x− x] m )= [¯x− x ′ ] m if T =S m 1 ( |Sm|− 1 k− 1 ) P i∈T m i ∥m∥ 1 [¯x− x ′ ] m if T ⊊ S m ,|T|=k 0 else (3.25) Finally, IG satisfies the continuity condition by Theorem12, and it is easy to see that that ST − 1 S satisfies the continuity condition. Thus Sum of Powers obeys the continuity condition. 3.3.3 Establishing Further Properties for Sum of Powers To establish null feature, let F not vary in x i and let i∈ S. Sum of Powers satisfies the continuity condition, so SP k S (¯x,x ′ ,F)= lim l→∞ X m∈N n ,∥m∥ 1 ≤ l D m F(x ′ ) [m]! SP k S (¯x,x ′ ,[x− x ′ ] m ) = lim l→∞ X m∈N n ,∥m∥ 1 ≤ l,m i =0 D m F(x ′ ) [m]! SP k S (¯x,x ′ ,[x− x ′ ] m ) =0, where the second line is because D m F(x ′ )=0 if m i >0 because F does not vary in x i , and the third line is because SP k S (¯x,x ′ ,[x− x ′ ] m )=0 if m i =0. To establish baseline test for interaction (k≤ n), let Φ S ∈C ω be a synergy function of S and let T ⊊ S,|T|<k. Then SP k T (¯x,Φ S )=ϕ T (Φ S )(¯x)=0. To establish completeness, consider F(x)=[x− x ′ ] m , with|S m |>k. Then, 135 X S∈P k ,|S|>0 SP k S (¯x,x ′ ,F)= X S⊊ Sm,|S|=k SP k S (¯x,x ′ ,[x− x ′ ] m ) = X S⊊ Sm,|S|=k 1 |Sm|− 1 k− 1 P i∈S m i ∥m∥ 1 [¯x− x ′ ] m = [¯x− x ′ ] m |Sm|− 1 k− 1 ∥m∥ 1 X S⊊ Sm,|S|=k X i∈S m i = [¯x− x ′ ] m |Sm|− 1 k− 1 ∥m∥ 1 |S m |− 1 k− 1 ∥m∥ 1 =F(¯x)− F(x ′ ) Now treating a general F ∈C ω , the proof is identical to the proof for Integrated Hessian, X S∈P k ,|S|>0 SP k T (¯x,x ′ ,F)= X S∈P k ,|S|>0 lim l→∞ SP k T (¯x,x ′ ,T l ) = lim l→∞ X S∈P k ,|S|>0 X m∈N n ,0<∥m∥ 1 ≤ l D m (F)(x ′ ) [m]! SP k T (¯x,x ′ ,[x− x ′ ] m ) = lim l→∞ X m∈N n ,0<∥m∥ 1 ≤ l D m (F)(x ′ ) [m]! X S∈P k ,|S|>0 SP k T (¯x,x ′ ,[x− x ′ ] m ) = lim l→∞ X m∈N n ,0<∥m∥ 1 ≤ l D m (F)(x ′ ) [m]! [¯x− x ′ ] m = lim l→∞ X m∈N n ,∥m∥ 1 ≤ l D m (F)(x ′ ) [m]! [¯x− x ′ ] m − F(x ′ ) =F(¯x)− F(x ′ ) To show symmetry, the proof parallels the proof for Integrated Hessian in section 3.2.3. Letπ beapermutation. IfweletF(x)=[x− x ′ ] m andfollowwhatwaspreviouslyestablished in section 3.2.3, then 136 SP k πT (π ¯x,πx ′ ,F ◦ π − 1 )= [π ¯x− πx ′ ] πm if πT =S πm 1 ( |Sπm |− 1 k− 1 ) P i∈πT (πm ) i ∥πm ∥ 1 [π ¯x− πx ′ ] πm if πT ⊊ S πm ,|πT |=k 0 else = [¯x− x ′ ] m if T =S m 1 ( |Sm|− 1 k− 1 ) P i∈T m i ∥m∥ 1 [¯x− x ′ ] m if T ⊊ S m ,|T|=k 0 else =SP k T (¯x,x ′ ,F) From the above we have for general F, SP k πS (π ¯x,πx ′ ,F ◦ π − 1 )= lim l→∞ SP k πS (π ¯x,πx ′ , X m∈N n ,0<∥m∥ 1 ≤ l D m (F ◦ π − 1 )(πx ′ ) [m]! [x− πx ′ ] m ) = lim l→∞ X m∈N n ,0<∥m∥ 1 ≤ l D m (F ◦ π − 1 )(πx ′ ) [m]! SP k πS (π ¯x,πx ′ ,[x− πx ′ ] m ) = lim l→∞ X m∈N n ,0<∥m∥ 1 ≤ l D πm (F ◦ π − 1 )(πx ′ ) [πm ]! SP k πS (π ¯x,πx ′ ,[x− πx ′ ] πm ) = lim l→∞ X m∈N n ,0<∥m∥ 1 ≤ l D m (F)(x ′ ) [m]! SP k S (¯x,x ′ ,[x− x ′ ] m ) = lim l→∞ SP(¯x,x ′ ,T l ) =SP(¯x,x ′ ,F) 4 Experimental Details and Additional Results All experiments are conducted on a device with a 6-core Intel Core i7-8700. 137 4.1 Model Description and Experimental Details 4.1.1 2-Layer Perceptron We use a 2-layer perceptron with 64 neurons in the first layer and 32 neurons in the second layer. For activation, we use SoftPlus SoftPlus(x)= 1 β log(1+exp(βx )) with β = 5 after each layer. We optimize using the Adam algorithm with the default hyper-parameters Kingma and Ba [2014] and the learning rate of 0.1054. We train the model for 1000 epochs with the whole training data, and the network achieves a test Mean-Absolute-Error (MAE) of 3.10 and a test Root-Mean-Squared-Error (MRSE) of 4.14. Hyperparametertuning: Thenumberofneuronsineachlayerincludesvalues8,16,32,64, and 128 such that the size of the first hidden layer should be larger than or equal to the size of the second layer. For each dimension of the neural network, we swept through a range of stepsizes and values of β to find the (approximately) optimal stepsize and β . The stepsize grid consists of 5 evenly spaced points between e − 6 and e − 1 . The β parameter of the SoftPlus activation includes values of 1 and 5. 4.1.2 Second-Degree Polynomial Regression We use the LinearRegression function from scikit-learn Pedregosa et al. [2011] with default values to train the polynomial regression model. 4.2 Description of the Dataset The Physicochemical Properties of Protein Tertiary Structure data is avail- able at https://archive.ics.uci.edu/ml/datasets/Physicochemical+Properties+ of+Protein+Tertiary+Structure. After preprocessing, there were a total of 9 input features from this dataset and it contained around 45,730 entries in total. The regression task is to predict the size of the residue. The list of features: 1. Total surface area (mean: 9871.60± standard deviation: 4058.14) 138 2. Non polar exposed area (3017.37± 1464.32) 3. Fractional area of exposed non polar residue (0.30± 0.06) 4. Fractional area of exposed non polar part of residue (103.49± 55.42) 5. Molecular mass weighted exposed area (1.37e+06± 5.64e+05) 6. Average deviation from standard exposed area of residue (145.64± 70.00) 7. Euclidian distance (3989.76± 1993.57) 8. Secondary structure penalty (69.98± 56.49) 9. Spacial Distribution constraints (N, K Value) (34.52± 5.98) Preprocessing: We standardize the numerical data to have mean zero and unit variance. We utilize a 70/15/15 train/validation/test split for data. 4.3 More Details on Generating Attribution and Interaction Values TogeneratetheattributionsusingIntegratedGradientandcomputetheinteractionsutilizing Integrated Hessian and Sum of Powers, we use 200 samples from the dataset. We use numerical integration with 500 samples to approximate the integral in Integrated Gradient and Integrated Hession. 4.4 Standard Deviation of the Interaction Values Figure E.1 demonstrates the standard deviation of the interaction values from Integrated Hessian and Sum of Powers. We notice that the standard deviation of feature 1 and feature 6 is much higher in Sum of Powers than in Integrated Hessian. Furthermore, we see that small mean interaction values (see Figure 3.1 and Figure 3.2) do not imply low interaction between features, as they can have large standard deviation values (e.g., feature 1 and feature 4). 139 Figure E.1: Standard deviation of interaction values. Left: Integrated Hessian. Right: Sum of Powers. 4.5 Attribution Values The attribution values of each feature based on Integrated Gradient are displayed in Figure E.2. The features are ordered by their importance in predicting the target. The attribution values indicate the direction and magnitude of the feature’s influence on the size of the residue (positive values imply an increase, negative values imply a decrease). The positive trend observed for total surface area suggests that a larger total surface area is associated with a larger size of the residue, which is consistent with intuition. Figure E.2: Attributions by Integrated Hessian. 140 Appendix F Supplementary Materials on Four Characterizations of IG 1 Symmetry-Preserving Alone is Insufficient to Characterize IG Among Path Methods Here we provide a counterexample to the claim that IG is the unique path method that satisfies symmetry-preserving. We also give a axiom that is stronger than symmetry- preserving, and show that this axiom is insufficient to characterize IG. 1.1 Another Path Method that Satisfies Symmetry-Preserving Here we provide a treatment of how the symmetry-preserving axiom or a particular strength- ening of it is not enough, by itself, to characterize IG among path methods. We also give proofs of the various results on characterizing IG. Let D =[0,1] 2 and define γ ′ (¯x,x ′ ,t) element-wise as follows: γ i (¯x,x ′ ,t)=x ′ i +(¯x i − x ′ i )t (¯x i − x ′ i ) 2 Note that γ is monotonic. When ¯x 1 = ¯x 2 , x ′ 1 =x ′ 2 , then ¯x 1 − x ′ 1 = ¯x 2 − x ′ 2 , and γ 1 (¯x,x ′ ,t)= γ 2 (¯x,x ′ ,t). In this case γ is the straight path, and A γ acts as IG. Thus, A γ satisfies symmetry-preserving. However, when ¯x 1 − x ′ 1 ̸= ¯x 2 − x ′ 2 , then the path γ differs from the straight line path, causing A γ ̸=IG. 141 1.2 Strong Symmetry-Preserving Here we present an attempt to strengthen symmetry and show that multiple path methods satisfy the strengthened axiom. The axiom, Strong Symmetry-Preserving, extends the symmetry-preserving axiom to cases when ¯x i ̸= ¯x j , x ′ i ̸=x ′ j : 10. (Strong Symmetry-Preserving): For a vector x and indices 1≤ i, j≤ n, let x ∗ denote thevectorxbutwiththei th andj th componentsswapped. SupposeF issymmetricini and j, meaning that F(x)=F(x ∗ ) for all x∈[a,b]. Then A i (¯x,x ′ ,F)=A j (¯x ∗ ,x ′∗ ,F). Here we provide a counterexample to the claim that IG is the unique path method that satisfies strong symmetry-preserving. LetD =[0,2] 2 ,andletF besymmetricincomponents1and2. Wedefineapathfunction γ (t) that is equivalent to the IG path except when ¯x=(2,1), x ′ =(1,0) or ¯y = ¯x ∗ =(1,2), y ′ =x ′∗ =(0,1). For ¯x, x ′ , let γ (t) be the path that travels in straight lines along the course: (1,0)→ (2,0)→ (2,1). Now for baseline y ′ = (0,1) and input ¯y = (1,2), let γ (t) be the path that travels in straight lines along the course: (0,1)→(0,2)→(1,2). We then have A γ 1 (¯x,x ′ ,F)=F(2,0)− F(1,0)=F(0,2)− F(0,1)=A γ 2 (¯x ∗ ,x ′∗ ,F), and likewise A γ 2 (¯x,x ′ ,F) = A γ 1 (¯x ∗ ,x ′∗ ,F). Thus we have another strong symmetry-preserving path method that is not the IG path. 2 Proof of Theorem 8 Proof. We present an adjusted version of the proof found in [Sundararajan et al., 2017, Theorem 1]. Suppose that A is a monotone path method that satisfies symmetry-preserving andaffinescaleinvariance. Thenforany i,A i =A γ i = R 1 0 ∂F ∂x i (γ (t)) dγ dt (t)dtforsomemonotone path function γ . Let 1 and 0 denote the vectors of all ones and all zeros, respectively. We proceed by contradiction and suppose that γ (¯x,x ′ ,t)̸= (¯x− x ′ )t+x ′ when ¯x = 1, x ′ = 0. Particularly, wesupposethat γ (1,0,t)̸=1× t(usingthen-dimensionalonesvector). WLOG, suppose that there exists a t such that γ 1 (1,0,t)>γ 2 (1,0,t). Let (t a ,t b ) be the maximal open set such that if t∈(t a ,t b ) then γ 1 (1,0,t)>γ 2 (1,0,t). 142 We now move to define a ( ¯x,x ′ ,F)∈ D IG where F is symmetric in x 1 and x 2 , but A does not give equal attributions to ¯x 1 and ¯x 2 . Let ¯x=1, and x ′ =0, and define F ∈F 2 by F(x)=ReLU − ReLU(x 1 x 2 − t 2 a )+t 2 b − t 2 a Now, F can be written in a case-format as follows: F(x)= t 2 b − t 2 a if x 1 x 2 ≤ t 2 a t 2 b − x 1 x 2 if t 2 a ≤ x 1 x 2 ≤ t 2 b 0 if x 1 x 2 ≥ t 2 b (2.1) It is easy to verify that (¯x,x ′ ,F)∈D IG . Calculating A γ i (1,0,F), and using the short hand γ (t)=γ (1,0,t), we gain: A γ 1 (1,0,F)= Z 1 0 ∂F ∂x 1 (γ (t)) dγ dt (t)dt = Z t b ta ∂(t 2 b − x 1 x 2 ) ∂x 1 (γ (t))× 1dt = Z t b ta − γ 2 (t)dt (2.2) and A γ 2 (1,0,F)= Z t b ta − γ 1 (t)dt (2.3) By assumption, γ 1 (t)>γ 2 (t) for t∈(a,b), yielding: A γ 1 (1,0,F)= Z t b ta − γ 2 (t)dt > Z t b ta − γ 1 (t)dt =A γ 2 (1,0,F) (2.4) 143 This is a contradiction. Thus, there is no t where γ 1 (1,0,t) > g 2 (1,0,t), and more generally, there is not t were γ i (1,0,t)>g j (1,0,t) for any i, j. So, γ i (1,0,t)=g j (1,0,t) for any pair, and we have the IG path. Thus γ (1,0,t)=1× t. Now, consider any (¯x,x ′ ,F)∈A 2 (D IG ), and use the shorthand γ (t)=γ (1,0,t). Let T be the affine mapping such that T(1) = ¯x, T(x ′ ) = 0. We employ the assumption that A satisfies affine scale invariance to gain: A i (¯x,x ′ ,F)=A i (T(1),T(0),F) =A i (¯x,x ′ ,T(F)) = Z 1 0 ∂T(F) ∂x i (γ (t)) dγ i dt dt = Z 1 0 ∂(F ◦ T) ∂x i (1× t)dt = Z 1 0 ∂F ∂x i (T(1× t)) ∂(T) i ∂x i (1× t) = Z 1 0 ∂F ∂x i (x ′ +t(¯x− x ′ ))(¯x i − x ′ i )dt =(¯x i − x ′ i ) Z 1 0 ∂F ∂x i (x ′ +t(¯x− x ′ ))dt =IG i (¯x,x ′ ,F) (2.5) 3 Proof of Theorem 9 We set out to establish the results forA 1 , then move to establish the results forA 2 . 3.1 Proof for A∈A 1 Proof. Here we present a proof along the lines of that found in Sundararajan and Najmi [2020]. Suppose A∈A 1 . 144 (ii. ⇒ i) IG satisfies linearity and completeness and proportionality because it is a path method. Suppose F is non-decreasing from x ′ to ¯x, then (¯x i − x ′ i ) and ∂F ∂x i (x ′ (t(¯x− x ′ )) do not have opposite signs, so IG i (¯x,x ′ ,F)=(¯x i − x ′ i ) Z 1 0 ∂F ∂x i (x ′ +t(¯x− x ′ ))dt≥ 0, which shows IG satisfies NDP. Finally, let F(x) = G( P j x j ) and x ′ = 0. Then ∂F ∂x i (¯x) = G ′ ( P j ¯x j ), and IG i (¯x,x ′ ,F)= ¯x i Z 1 0 G ′ ( X j ¯x j )dt Note that the integral is equivalent for any i, so we take c = R 1 0 G ′ ( P j x j )dt to gain IG i (¯x,x ′ ,F)=cx i . Thus IG satisfies proportionality, and ii. ⇒ i. (i. ⇒ ii.) Now suppose that A satisfies linearity, ASI, completeness, NDP, and propor- tionality. Let A 0 denote the set of all BAMs such that 1) they are defined on analytic, non-decreasing functions, 2) they are only defined for x ′ = 0, ¯x ≥ 0, 3) the BAMs give non-negative attributions and 4) the BAMs satisfy completeness. By [Friedman and Moulin, 1999, Theorem 3], the only BAM inA 0 to satisfy proportionality and ASI is the Integrated Gradients method. Note that if x ′ =0, ¯x≥ 0, F non-decreasing, then A(1,0,F)≥ 0 by NDP.A also satisfies completeness by assumption. Thus if we let A ′ denote A with the requisite restriction of domains, then A ′ ∈ A 0 . Because A ′ satisfies ASI and proportionality, A ′ = IG on this restricted domain. Let x ′ =0, and ¯x=1, the vector of all ones. For any F ∈F 1 , F is Lipschitz on bounded domain, and there exists c∈R n such that c≥ 0, F(x)+c ⊺ x is non-decreasing. Thus A(1,0,F(x))=A(1,0,F(x)+c ⊺ x− c ⊺ x) =A(1,0,F(x)+c ⊺ x)− A(1,0,c ⊺ x) =IG(1,0,F(x)+c ⊺ x)− IG(1,0,c ⊺ x) =IG(1,0,F(x)) 145 We can then harness ASI as in Eq. 2.5 to get that A(¯x,x ′ ,F)=IG(¯x,x ′ ,F) for any ¯x, x ′ . 3.2 Proof for A∈A 2 (D IG ) Proof. Let A∈A 2 (D IG ). It is easy to show ii. ⇒ i.. We turn to show i. ⇒ ii.. Suppose A satisfies linearity, ASI, completeness, NDP, and proportionality. Let (¯x,x ′ ,F) ∈ D IG , and choose a component i. By methods found in the proof of [Lund- strom et al., 2022a, Theorem 2], there exists a sequence of functions F m such that: • F m is analytic for all m. • ∂Fm ∂x i ≤ ∂F ∂x i where ∂F ∂x i exists. • lim m→∞ ∂Fm ∂x i = ∂F ∂x i where ∂F ∂x i exists. • | ∂Fm ∂x i |≤ k for all m. • F − F m is non-decreasing from x ′ to ¯x in i. F − F m is Lipshitz because F, F m are Lipshitz. Thus, for each m, there exists c∈R n such that c i =0 and F(x)− F λ (x)+c ⊺ x is non-decreasing from x ′ to ¯x. Since c ⊺ x∈F 1 , we apply previous results to gain A i (¯x,x ′ ,c ⊺ x)=IG i (¯x,x ′ ,c ⊺ x)=0. Thus, A i (¯x,x ′ ,F(x))− A i (¯x,x ′ ,F m (x))=A i (¯x,x ′ ,F(x))− A i (¯x,x ′ ,F m (x))+A i (¯x,x ′ ,c ⊺ x) =A i (¯x,x ′ ,F(x)− F m (x)+c ⊺ x) ≥ 0 Thus we have A i (¯x,x ′ ,F)≥ A i (¯x,x ′ ,F m ). Now, because (¯x,x ′ ,F) ∈ D IG , R 1 0 ∂F ∂x i (x ′ +t(¯x− x ′ ))dt exists and ∂F ∂x i exists almost everythere on the path x ′ +t(¯x− x ′ ). Employing DCT, we have: 146 A i (¯x,x ′ ,F)≥ lim m→∞ A i (¯x,x ′ ,F m ) = lim m→∞ IG i (¯x,x ′ ,F m ) = lim m→∞ Z 1 0 (¯x i − x ′ i ) ∂F m ∂x i (γ (t))dt = Z 1 0 (¯x i − x ′ i ) ∂F ∂x i (γ (t))dt =IG i (¯x,x ′ ,F) We may also gain the reverse, A i (¯x,x ′ ,F)≤ IG i (¯x,x ′ ,F), using a similar method. Thus A i (¯x,x ′ ,F)=IG i (¯x,x ′ ,F), concluding the proof. 4 Proof of Theorem 10 Proof. ii. ⇒i.) LetA∈A 1 betheIGmethod,andlet(¯x,x ′ ,F),(¯x,x ′ ,G)∈[a,b]× [a,b]×F 1 . If ¯x i =x ′ i , then it is easy to confirm that IG(¯x,x ′ ,F)=0. Suppose ¯x i ̸=x ′ i , ¯x j ̸=x ′ j . Then, supposing ∂F ∂x i ≤ ∂F ∂x j , we have: IG i (¯x,x ′ ,F) ¯x i − x ′ i = Z 1 0 ∂F ∂x i (x ′ +t(¯x− x ′ ))dt ≤ Z 1 0 ∂F ∂x j (x ′ +t(¯x− x ′ ))dt = IG j (¯x,x ′ ,F) ¯x j − x ′ j and IG satisfies symmetric monotonicity. i. ⇒ ii.) The following proof is inspired by [Young, 1985, Theorem 1]. We begin with an important lemma: Lemma 5. Let A∈A 1 satisfy completeness, dummy, linearity, and symmetric monotonicity. Then A(¯x,x ′ ,[x− x ′ ] m )=IG(¯x,x ′ ,[x− x ′ ] m ), where m∈N n 0 . Proof. Let A ∈ A 1 satisfy completeness, dummy, linearity, and symmetric monotonicity. Fix ¯x, x ′ . It is useful to note that IG i (¯x,x ′ ,[x− x ′ ] m ) = m i ∥m∥ 1 [¯x− x ′ ] m . We proceed by 147 lexicographic induction on m ∈ N n 0 . What we mean by m ′ < lex m is that m ′ i = m i for 1≤ i<k, but m ′ k <m k . LetM ⊆ N n 0 bethesetofvaluesofmforwhichA(¯x,x ′ ,[x− x ′ ] m )=IG(¯x,x ′ ,[x− x ′ ] m )= 1 ∥m∥ 1 (m 1 ,...,m n )[¯x− x ′ ] m . Now, A(¯x,x ′ ,(x− x ′ ) 0 ) = 0 = IG(¯x,x ′ ,(x− x ′ ) 0 ) by dummy, so (0,...,0) ∈ M. Suppose instead that ∥m∥ 0 = 1, so that only m i ̸= 0. By dummy, A j (¯x,x ′ ,[x− x ′ ] m )=0 for j̸=i, and by completeness, A i (¯x,x ′ ,[x− x ′ ] m )=[¯x− x ′ ] m . Thus A(¯x,x ′ ,[x− x ′ ] m )=IG(¯x,x ′ ,[x− x ′ ] m ), and∥m∥ 0 =1 implies m∈M. Suppose there exists some element inN n 0 that is not an element in M. Let m ∗ be the smallest such element. Define S ={1≤ i≤ n:A i (¯x,x ′ ,[x− x ′ ] m ∗ )̸=IG i (¯x,x ′ ,[x− x ′ ] m ∗ )}. By the above, we have that∥m ∗ ∥ 0 ≥ 2. Note that if i∈S then it must be that 1) ¯x i ̸=x ′ i , for otherwise A i =0=IG i , and 2) m ∗ i >0. ChooseitobetheleastelementinS. AandIGmustdisagreeintwoormorecomponents, foriftheydisagreedinexactlyonecomponent, thentheycouldnotbothsatisfycompleteness. Thus i<n. Define F(x)=[x− x ′ ] m ∗ and define, G(x)= m ∗ i m ∗ n +1 (x 1 − x ′ 1 ) m ∗ 1 ··· (x i − x ′ i ) m ∗ i − 1 ··· (x n − x ′ n ) m ∗ n +1 Note ∂F ∂x i = ∂G ∂xn . Thus, we have by symmetric monotonicity: A i (¯x,x ′ ,F) ¯x i − x ′ i = A n (¯x,x ′ ,G) ¯x n − x ′ n Also note that m ∗∗ = (m 1 ,...,m i − 1,...,m n +1) < m ∗ . Thus m ∗∗ / ∈ M, A(¯x,x ′ ,G) = IG(¯x,x ′ ,G). We then have, 148 A i (¯x,x ′ ,F) ¯x i − x ′ i = A n (¯x,x ′ ,G) ¯x n − x ′ n = IG n (¯x,x ′ ,G) ¯x n − x ′ n = m ∗ i ∥m∥ 0 (x 1 − x ′ 1 ) m ∗ 1 ··· (x i − x ′ i ) m ∗ i − 1 ··· (x n − x ′ n ) m ∗ n = m ∗ i ∥m∥ 0 (x 1 − x ′ 1 ) m ∗ 1 ··· (x i − x ′ i ) m ∗ i ··· (x n − x ′ n ) m ∗ n ¯x i − x ′ i = IG i (¯x,x ′ ,F) ¯x i − x ′ i This shows that A i (¯x,x ′ ,F) = IG i (¯x,x ′ ,F) for i < n. By completeness, we have A n (¯x,x ′ ,F) = IG n (¯x,x ′ ,F). Thus m ∗ ∈ M, a contradiction. Thus there is no element of N n 0 that is not an element of M, and M =N n 0 concluding the proof. We now move to the main proof: Let A∈A 1 satisfy completeness, dummy, linearity, and symmetric monotonicity and let F ∈F 1 . For any i such that 1≤ i≤ n, ∂F ∂x i is analytic and by the Stone Weierstrass theorem, for any ϵ> 0, there exists a polynomial, p, such that|p(x)− ∂F ∂x i (x)|<ϵ on [a,b]. Let p m be a polynomial such that|p m (x)− ∂F ∂x i (x)|< 1 2m , and let P m be any polynomial so that ∂Pm ∂x i =p m . Note that ∂(Pm− x i m ) ∂x i =p m − 1 m < ∂F ∂x i . Now assume that ¯x i − x ′ i ≥ 0. By symmetric monotonicity we have A i (¯x,x ′ ,P m − x i m )≤ A i (¯x,x ′ ,F). Employing the dominated convergence theorem, we have: 149 A i (¯x,x ′ ,F)≥ lim m→∞ A i (¯x,x ′ ,P m − x i m ) = lim m→∞ IG i (¯x,x ′ ,P m − x i m ) = lim m→∞ (¯x i − x ′ i ) Z 1 0 ∂(P m − x i m ) ∂x i (γ (t))dt = lim m→∞ (¯x i − x ′ i ) Z 1 0 p m (γ (t))dt− (¯x i − x ′ i ) m =(¯x i − x ′ i ) Z 1 0 ∂F ∂x i (γ (t))dt =IG i (¯x,x ′ ,F) By considering P m + x i m , we gain the opposite inequality, namely, A i (¯x,x ′ ,F)≤ IG i (¯x,x ′ ,F). This establishes that A i (¯x,x ′ ,F)=IG i (¯x,x ′ ,F). The case where ¯x i − x ′ i ≤ 0 follows a parallel proof. 5 Proof of Theorem 11 Proof. (iii. ⇒ ii.) Suppose A∈A 2 (D IG ) is the IG method and (¯x,x ′ ,F)∈D IG . It is well known that IG satisfies completeness, dummy, and linearity. If ¯x i =x ′ i , then it is easy to see that IG i (¯x,x ′ ,F)=0. Suppose that (¯x,x ′ ,G) ∈ D IG as well, and that ¯x i ̸= x ′ i , ¯x j ̸= x ′ j . Furthermore, suppose that ∂F ∂x i ≤ ∂F ∂x j locally approximately. Because (¯x,x ′ ,F), (¯x,x ′ ,F) ∈ D IG , ∂F ∂x i and ∂F ∂x j can be integrated along the path γ (t) = x ′ +t(¯x− x ′ ), implying that the mea- sure of points on the path where ∂F ∂x i and ∂F ∂x j exist has full measure with respect to the Lebesgue measure onR. Suppose x is one such point. Then lim z→∞ F(x 1 ,...,x i +z,...,xn)− F(x) z , lim z→∞ G(x 1 ,...,x j +z,...,xn)− G(x) z both exist and, because ∂F ∂x i ≤ ∂F ∂x j locally approximately, ∂F ∂x i (x)=lim z→∞ F(x 1 ,...,x i +z,...,xn)− F(x) z ≤ lim z→∞ G(x 1 ,...,x j +z,...,xn)− G(x) z = ∂F ∂x j (x). Thus, 150 IG i (¯x,x ′ ,F) ¯x i − x ′ i = Z 1 0 ∂F ∂x i (x ′ +t(¯x− x ′ ))dt ≤ Z 1 0 ∂F ∂x j (x ′ +t(¯x− x ′ ))dt = IG j (¯x,x ′ ,F) ¯x j − x ′ j and IG satisfies C 0 -symmetric monotonicity. ii. ⇒ i.) Suppose A ∈ A 2 (D IG ) satisfies completeness, dummy, linearity, and C 0 - symmetric monotonicity. A satisfies symmetric monotonicity for F ∈F 1 immediately by the definition of partial derivatives. Suppose that F is non-decreasing from x ′ to ¯x and let (¯x,x ′ ,F) ∈ D IG . If ¯x i = x ′ i , then A i (¯x,x ′ ,F) = 0. Suppose ¯x i > x ′ i . As previously observed, ∂F ∂x i exists almost everywhere on the straight path γ (t). Setting G ≡ 0, then 0 = ∂G ∂x i ≤ ∂F ∂x i almost approximately since F is non-decreasing from x ′ to ¯x and ¯x i > x ′ i . Thus0= A i (¯x,x ′ ,G) ¯x i − x ′ i ≤ A i (¯x,x ′ ,F) ¯x i − x ′ i ,and0≤ A i (¯x,x ′ ,F). Ifinsteadweassumethat ¯x i <x ′ i ,then 0= ∂G ∂x i ≥ ∂F ∂x i , and 0= ∂G ∂x i ≤− ∂F ∂x i . Thus 0= A i (¯x,x ′ ,G) ¯x i − x ′ i ≤ A i (¯x,x ′ ,− F) ¯x i − x ′ i , and 0≤ A i (¯x,x ′ ,F). Thus, in any case, A i (¯x,x ′ ,F)≥ 0 for all i, and A satisfies NDP. i. ⇒ iii.) Suppose A∈A 2 (D IG ) satisfies completeness, dummy, linearity, and NDP. Let (¯x,x ′ ,F)∈D IG and choose a component i. By methods found in the proof of Lundstrom et al. [2022a, Theorem 2], there exists a sequence of functions F m such that: • F m is analytic for all m. • ∂Fm ∂x i ≤ ∂F ∂x i where ∂F ∂x i exists. • lim m→∞ ∂Fm ∂x i = ∂F ∂x i where ∂F ∂x i exists. • | ∂Fm ∂x i |≤ k for all m. • F − F m is non-decreasing from x ′ to ¯x in i. By NDP we have A i (¯x,x ′ ,F− F m )≥ 0 and A i (¯x,x ′ ,F)≥ A i (¯x,x ′ ,F m ). Since F m ∈A 1 , we have A i (¯x,x ′ ,F m ) = IG i (¯x,x ′ ,F) by Theorem 10. Recalling that ∂F ∂x i exists almost everywhere on IG’s path, we employ the dominated convergence theorem to gain: 151 A i (¯x,x ′ ,F)≥ lim m→∞ A i (¯x,x ′ ,F m ) = lim m→∞ IG i (¯x,x ′ ,F m ) = lim m→∞ (¯x i − x ′ i ) Z 1 0 ∂F m ∂x i (γ (t))dt =(¯x i − x ′ i ) Z 1 0 ∂F ∂x i (γ (t))dt =IG i (¯x,x ′ ,F) By a parallel method we can gain A i (¯x,x ′ ,F)≤ IG i (¯x,x ′ ,F). 6 Proof of Theorem 12 Sundararajan et al. [2017] has shown that IG is linear and Eq. (4.1) shows the actions of IG on polynomials. Let F ∈C ω and let T l be the Taylor approximation of F of order l centered at x ′ . It is known that ∂T l ∂x i → ∂F ∂x i uniformly on a compact domain, such as [a,b]. Thus, lim l→∞ IG i (¯x,T l )= lim l→∞ (¯x i − x ′ i ) Z 1 0 ∂T l ∂x i (x ′ +t(¯x− x ′ ))dt =(¯x i − x ′ i ) Z 1 0 ∂F ∂x i (x ′ +t(¯x− x ′ ))dt =IG i (¯x,F) (6.1) Thus IG satisfies the continuity criteria. Apply Theorem 7 for result. 7 Softplus Approximations Converge Uniformly Define S k α to be as S k , but replace each ReLU function s in S k with the parameterized softplus, s α . Then the softplus approximation of F is given by: F α (x)=S m α ◦ F m ◦ S m− 1 α ◦ F m− 1 ◦ ...◦ S 2 α ◦ F 2 ◦ S 1 α ◦ F 1 (x) 152 Lemma 6. F α →F uniformly on U. Proof. Begin proof by induction. For k = 1, it is easy to show that s α → s uniformly on R, and thus, S 1 α → S 1 uniformly on R n . Thus, for any ϵ > 0, an A > 0 may be chosen such that for any y∈R n , α>A implies∥S 1 α (y)− S 1 (y)∥<ϵ . Replace y with F 1 (x) to get S 1 α (F 1 )→S 1 (F 1 ) uniformly. Write G k :=S k ◦ F k ◦ ...◦ S 1 ◦ F 1 (x) and G k α :=S k α ◦ F k ◦ ...◦ S 1 α ◦ F 1 (x), and suppose G α →G uniformly. It remains to be shown that S k α ◦ F k ◦ G k α →S k ◦ F k ◦ G k uniformly. ∥S k α (F k (G k α (x)))− S k (F k (G k (x)))∥ ≤∥ S k α (F k (G k α (x)))− S k α (F k (G k (x)))∥+∥S k α (F k (G k (x)))− S k (F k (G k (x)))∥ ≤∥ F k (G k α (x))− F k (G k (x))∥+∥S k α (F k (G k (x)))− S k (F k (G k (x)))∥ Where the third line is because S k α is Lipschitz with Lipschitz constant≤ 1. Since G k is analytic, it is bounded on U. Since G k α converges uniformly to G k , it is bounded for large enough α . Let α 0 produce this bound, that is, if α > α 0 , then max(∥G k α (x)∥,∥G k (x)∥)≤ C 1 for any x∈U. Since F is analytic, it is Lipshitz on bounded domains. Thus, if α>α 0 , then ∥F k (G k α (x))− F k (G k (x))∥≤ C 2 ∥G k α (x)− G k (x)∥ 153 Now, by uniform continuity of G k α and S k α , choose α 1 so that α > α 1 guarantees that ∥G k α (x)− G k (x)∥<ϵ/ 2C 2 , and choose α 2 so that α>α 2 guarantees that∥S k α (F k (G k (x)))− S k (F k (G k (x)))∥<ϵ/ 2. Then α> max(α 0 ,α 1 ,α 2 ) guarantees that ∥S k α (F k (G k α (x)))− S k (F k (G k (x)))∥ ≤∥ F k (G k α (x))− F k (G k (x))∥+∥S k α (F k (G k (x)))− S k (F k (G k (x)))∥ ≤ C 2 ∥G k α (x)− G k (x)∥+∥S k α (F k (G k (x)))− S k (F k (G k (x)))∥ <ϵ/ 2+ϵ/ 2=ϵ showing that S k α ◦ F k ◦ G k α →S k ◦ F k ◦ G k uniformly. 8 Proof of Theorem 13 8.1 Setup Define S k α to be as S k , but replace each ReLU function s in S k with the parameterized softplus, s α . Then the softplus approximation of F is given by: F α (x)=S m α ◦ F m ◦ S m− 1 α ◦ F m− 1 ◦ ...◦ S 2 α ◦ F 2 ◦ S 1 α ◦ F 1 (x) Also, for a funciton G:R n →R m , define DF to be the Jacobian, so that if F i is the i th output of F, then (DG) i,j = ∂G i ∂x j . 8.2 Main Proof First, we state an outline of the proof. We proceed by induction. In the non-trivial case with one-dimensional output, F 1 is not the zero function and S 1 is ReLU. In this case, {y∈ U : F 1 (y)̸= 0} is open and has full measure. For any x in this set, we can compose F 1 with ReLU and get that S 1 ◦ F 1 behaves like F 1 or the zero function locally. For each x in this set, D(S 1 α ◦ F 1 ) converges locally to DF 1 or 0 locally. In the multivariate case, each (S◦ F 1 ) i has a set with desired behaviors, so for any x in the intersection of such sets, S◦ F 1 has the desired behaviors. That set is open and has full measure. 154 For the induction step, we assume that G k has the desired properties and want to show S k+1 ◦ F k+1 ◦ G k does as well. If G k is equivalent to an analytic function in some neighborhood, so is F k+1 ◦ G k . An argument similar to the k =1 step shows that for almost every x in our neighborhood, S k+1 ◦ F k+1 ◦ G k is equivalent to an analytic function in some new open neighborhood containing x, and S k+1 α ◦ F k+1 ◦ G k α converges. We then consider a collection of points x∈ U with the desirable properties, and a collection of open sets N x containing them, where S k+1 ◦ F k+1 ◦ G k is locally equivalent to an analytic function on N x . We show that∪ x N x is open and has full measure. Proof. LetF ∈F 1 . Asbefore,writeG k :=S k ◦ F k ◦ ...◦ S 1 ◦ F 1 andG k α :=S k α ◦ F k ◦ ...◦ S 1 α ◦ F 1 . Assume that there exists U ∗ ⊂ U with same measure as U, and that x∈ U ∗ implies that exists an open region containing x, B x , such that: 1) G k ≡ H x on B x , where H x is a real-analytic function on U, 2) DG k (x) exists, and 3) DG k α (x)→DG k (x) as α →∞. We want to show that there is a set analogous to U ∗ for S k+1 ◦ F k+1 ◦ G k and S k+1 α ◦ F k+1 ◦ G k α . With this established, we will have gained a proof by induction. To explain, the above is the k→k+1 step. By setting k =1, and setting F 1 , S 1 as the identity mappings, we will prove the k =1 step, concluding the proof. First, let us consider the case where F k+1 , S k+1 output in one dimension. Let x∈U ∗ , and suppose G k ≡ H x on B x . Then F k+1 ◦ G k is analytic on B x , since compositions of real analytic function are real analytic. Case 1: Consider the case where λ ({y ∈ B x : G k (y) = 0)} > 0. Then G k ≡ 0 and S k+1 ◦ F k+1 ◦ G k is constant on B x . In this case, the derivative of S k+1 ◦ F k+1 ◦ G k exists everywhere on B x , and is equal to zero. Now, for y∈B x , we have lim α →∞ ∇(S k+1 α ◦ F k+1 ◦ G k α )(y)= lim α →∞ n k X j=1 dS k+1 α d(F k+1 ◦ G k ) (F k+1 (G k α (y))) × ∂F k+1 ∂G k α,j (G k α (y))×∇ G k α,j (y) =0 =∇(S k+1 ◦ F k+1 ◦ G k )(y) 155 where the 0 comes from the fact that | dS k+1 α d(F k+1 ◦ G k ) | ≤ 1, ∂F k+1 ∂G k α,j is bounded for a bounded domain (which it is), and∇G k α,j (y)→0 for each j. Thus S k+1 ◦ F k+1 ◦ G k , S k+1 α ◦ F k+1 ◦ G k α have properties 1-3 of the theorem on the set B x . Case 2: Consider instead the case where G k is not the zero function, but S k is the identity mapping. Then F k+1 ◦ G k is analytic on B x and so is S k+1 ◦ F k+1 ◦ G k , and the derivative exists on B x . Now, for y∈B x , we have lim α →∞ ∇(S k+1 α ◦ F k+1 ◦ G k α )(y)= lim α →∞ ∇(F k+1 ◦ G k α )(y) = lim α →∞ n k X j=1 ∂F k+1 ∂G k α,j (G k α (y))×∇ G k α,j (y) = lim α →∞ n k X j=1 ∂F k+1 ∂G k j (G k α (y))×∇ G k α,j (y) = n k X j=1 ∂F k+1 ∂G k j (G k (y))×∇ G k j (y) =∇(F k+1 ◦ G k )(y) (8.1) To explain the fourth line, ∇G k α,j (y) converges pointwise by assumption. Also, ∂F k+1 ∂G k j is Lipschitz continuous in a bounded domain and G k α (y) converges uniformly. Thus each term converges pointwise. Thus S k+1 ◦ F k+1 ◦ G k , S k+1 α ◦ F k+1 ◦ G k α have properties 1-3 of the theorem on the set B x . Case 3: Consider the case where G k is not the zero function and S k+1 is the ReLU function. Then F k+1 ◦ G k is analytic and either the zero function or not on B x . 156 Case 3.1: Consider the subcase where F k+1 ◦ G k ≡ 0 on B x . Then S k+1 ◦ F k+1 ◦ G k ≡ 0 on B x , is differentiable on B x , and the derivative is the zero function. Then for y∈B x , we have lim α →∞ ∇(S k+1 α ◦ F k+1 ◦ G k α )(y)= lim α →∞ dS k+1 α d(F k+1 ◦ G k α ) (F k+1 (G k α (y)))×∇ (F k+1 ◦ G k α )(y) = lim α →∞ dS k+1 α d(F k+1 ◦ G k α ) (F k+1 (G k α (y)))×∇ (F k+1 ◦ G k )(y) = lim α →∞ dS k+1 α d(F k+1 ◦ G k ) (F k+1 (G k α (y)))× 0 =∇(S k+1 ◦ F k+1 ◦ G k )(y) where the third line is because dS k+1 α d(F k+1 ◦ G k ) is bounded and∇(F k+1 ◦ G k α )→∇(F k+1 ◦ G k ) on B x by Eq. (8.1). Thus in this subcase, S k+1 ◦ F k+1 ◦ G k , S k+1 α ◦ F k+1 ◦ G k α have properties 1-3 of the theorem on the set B x . Case 3.2: Instead consider the subcase where F k+1 ◦ G k is a non-constant function on B x . We have λ ({z∈B x :F k+1 ◦ G k (z)=0})=0. Case 3.2.1: Suppose F k+1 ◦ G k (x)>0. Because F k+1 ◦ G K is continuous, there exists an open set B ′ x containing x where F k+1 ◦ G k >0, and that on such a set, S k+1 ◦ F k+1 ◦ G k ≡ F k+1 ◦ G k . Then, lim α →∞ ∇(S k+1 α ◦ F k+1 ◦ G k α )(y)= lim α →∞ dS k+1 α d(F k+1 ◦ G k ) (F k+1 (G k α (y)))×∇ (F k+1 ◦ G k α )(y) =1×∇ (F k+1 ◦ G k )(y) =∇(S k+1 ◦ F k+1 ◦ G k )(y) 157 Case 3.2.2: Suppose F k+1 ◦ G k (x)<0. Because F k+1 ◦ G K is continuous, there exists an open setB ′ x containingx whereF k+1 ◦ G k <0, and that on such a set,, S k+1 ◦ F k+1 ◦ G k ≡ 0. Then, lim α →∞ ∇(S k+1 α ◦ F k+1 ◦ G k α )(y)= lim α →∞ dS k+1 α d(F k+1 ◦ G k ) (F k+1 (G k α (y)))×∇ (F k+1 ◦ G k α )(y) =0×∇ (F k+1 ◦ G k )(y) =∇(S k+1 ◦ F k+1 ◦ G k )(y) Case 3.2.3: Suppose F k+1 ◦ G k (x)=0. In this case, we do not define a B ′ x set. We remind the reader that if x∈U ∗ is a case 3.2.3 point, then λ ({z∈B x :F k+1 ◦ G k (z)=0})=0 Thus we have established in the one-dimensional output case that for each x∈U ∗ that is not a case 3.2.3 point, there exists an open neighborhood containing x where properties 1-3 hold. Now consider the multivariate case. Define K⊂ U ∗ as the set of points in U ∗ that are case 3.2.3 points for at least one output of S k+1 ◦ F k+1 ◦ G k . Let x ∈ U ∗ \K. Let B ′ x,i correspond to the open set containing x where properties 1-3 hold when we only consider the output (S k+1 ◦ F k+1 ◦ G k ) i . Then properties 1-3 hold on ∩ i B ′ x,i for each output of (S k+1 ◦ F k+1 ◦ G k ) i and (S k+1 α ◦ F k+1 ◦ G k α ) i . Thus properties 1-3 hold for S k+1 ◦ F k+1 ◦ G k and S k+1 α ◦ F k+1 ◦ G k α on ∩ i B ∗ x,i . Thus we have established in the multivariate case the following: for each x∈U ∗ \K, there exists an open neighborhood containing x, B ′ x , where properties 1-3 hold for S k+1 ◦ F k+1 ◦ G k and S k+1 α ◦ F k+1 ◦ G k α . We now move to show that λ (K) = 0, which will conclude the proof. Let K i denote the set of case 3.2.3 points for the i th output of S k+1 ◦ F k+1 ◦ G k . Since K = ∪ i K i , it suffices to show λ (K i ) = 0. Let x ∈ K i for some i. Then x is a case 3.2.3 point for the output of (S k+1 ◦ F k+1 ◦ G k ) i . According to our assumption, there exists a B x containing x where properties 1-3 hold for G k , G k α on B x . Note that∪ x∈K i B x is an open cover, and has a countable subcover∪ j∈N B x j , where each x j is a case 3.2.3 point for (S k+1 F k+1 ◦ G k ) i . Because K i ⊆∪ x∈K i B x , we also have K i ⊆∪ j∈N B x j . Now, 158 K i =K i ∩(∪ j∈N B x j ) =∪ j∈N (B x j ∩K i ) Now, if x∈K i , then (F k+1 ◦ G k ) i (x)=0 by virtue of being a case 3.2.3 point. Also, it has been established that for case 3.2.3 points, λ ({z ∈ B x : (F k+1 ◦ G k ) i (z) = 0}) = 0. Thus λ (B x j ∩K i )≤ λ ({z ∈ B x j : (F k+1 ◦ G k ) i (z) = 0}) = 0. Thus, K i is a countable union of sets of measure zero, and is thus measure zero. 9 Proof of Corollary 5 Proof. Let F ∈F 2 , and let U be the set as in Theorem 13. Let γ (t) be the uniform speed path from x ′ to ¯x and suppose λ ({t ∈ [0,1] : γ (t) ∈ U}) = 1, where m is the Lebesgue measure on R. By Theorem 13, we have ∇F α (γ (t)) → ∇F(γ (t)) for almost every t in [0,1]. Suppose∇F α is bounded on U for large enough α . Let a n be any sequence such that a n →∞. Choose any index i, and by employing the dominated convergence theorem we gain: lim n→∞ IG i (¯x,x ′ ,F an )= lim n→∞ (¯x i − x ′ i ) Z 1 0 ∂F an ∂x i (γ (t))dt =(¯x i − x ′ i ) Z 1 0 ∂F ∂x i (γ (t))dt =IG i (¯x,x ′ ,F) Since lim n→∞ IG i (¯x,x ′ ,F an ) = IG i (¯x,x ′ ,F) for any sequence a n , we have lim α →∞ IG i (¯x,x ′ ,F α )=IG i (¯x,x ′ ,F). We now turn to show that ∇F α is bounded for large enough α . Using the notation introduced in Theorem 13, note that: ∇F α =DS m α DF m DS m− 1 α DF m− 1 ...DS 2 α DF 2 DS 1 α DF 1 159 Thus, ∥∇F α ∥ ∞ ≤ Π m k=1 ∥DS k α (F k ◦ ...◦ F 1 )∥ ∞ ×∥ DF k (S k− 1 α ◦ ...◦ F 1 )∥ ∞ Now∥DS k α (F k ◦ ...◦ F 1 )∥ ∞ ≤ 1 since S k α is either softplus or the identity mapping for each input. Also, F k is Lipshitz in a bounded domain, and S k− 1 α ◦ ...◦ F 1 converges uniformly on U to a function with a bounded range. Thus∥DF k (S k− 1 α ◦ ...◦ F 1 )∥ ∞ is bounded on U, and∥∇F α ∥ ∞ is bounded. 160
Abstract (if available)
Abstract
Deep learning has revolutionized many areas of machine learning, from computer vision to natural language processing, but these high-performance models are generally "black box." Explaining such models would improve transparency and trust in AI-powered decision making and is necessary for understanding other practical needs such as robustness and fairness. A popular means of enhancing model transparency is to quantify how individual inputs contribute to model outputs (called attributions) and the magnitude of interactions between groups of inputs. A growing number of these methods import concepts and results from game theory to produce attributions and interactions. This work studies these methods. We analyze the popular integrated gradients method (IG), outlining issues with multiple claims that it uniquely satisfies certain sets of desirable properties. We recover results with the addition of a desiderata, non-decreasing positivity. In all, we provide four different sets of properties that IG uniquely satisfies. We also study aspects of IG such as sensitivity to input perturbations, a formulation of IG where the reference baseline is a distribution of inputs, and a method of scoring internal neuron contributions based on IG.
Beyond IG, we study the extension of attribution methods to interaction methods, which quantifies the unique effects different groups of inputs have on a model's output. Particularly, we study methods that quantify interactions between any subset of inputs, and kth-order interaction methods, which report interactions for groups up to size k. We show that, given modest assumptions, a unique full account of interactions between features, called synergies, is possible in the continuous input setting. This unique full account of interactions is based on the Mobius transform, and induces a unique decomposition of any real-valued function into a sum of synergy functions. We go on to detail existing and novel interaction methods, showing that they are defined by their action on synergy functions, and for gradient-based methods, defined by their action on monomials.
We experimentally validate our method of attributing to internal neurons on a ResNet-152 model trained on ImageNet and a custom model trained on Fashion-MNIST. We experimentally validate various interaction methods on a custom model trained on a protein tertiary structure dataset.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
New methods for asymmetric error classification and robust Bayesian inference
PDF
Differentially private and fair optimization for machine learning: tight error bounds and efficient algorithms
PDF
Facial key points detection by convolutional neural network
PDF
Scalable optimization for trustworthy AI: robust and fair machine learning
PDF
Robust interpretable machine learning on data manifold via feature interaction using Shapley framework and quadtree masking
PDF
Graph machine learning for hardware security and security of graph machine learning: attacks and defenses
PDF
Striking the balance: optimizing privacy, utility, and complexity in private machine learning
PDF
Leveraging sparsity in theoretical and applied machine learning and causal inference
PDF
Interpretable machine learning models via feature interaction discovery
PDF
Machine learning methods for 2D/3D shape retrieval and classification
PDF
Algorithms and landscape analysis for generative and adversarial learning
PDF
Max-3-Cut performance of graph neural networks on random graphs
PDF
A function approximation view of database operations for efficient, accurate, privacy-preserving & robust query answering with theoretical guarantees
PDF
Conformalized post-selection inference and structured prediction
PDF
Delta Method confidence bands for parameter-dependent impulse response functions, convolutions, and deconvolutions arising from evolution systems described by…
PDF
Physics-informed machine learning techniques for the estimation and uncertainty quantification of breath alcohol concentration from transdermal alcohol biosensor data
PDF
Neural matrix factorization model combing auxiliary information for movie recommender system
PDF
From least squares to Bayesian methods: refining parameter estimation in the Lotka-Volterra model
PDF
The application of machine learning in stock market
PDF
Hierarchical regularized regression for incorporation of external data in high-dimensional models
Asset Metadata
Creator
Lundstrom, Daniel David (author)
Core Title
A rigorous study of game-theoretic attribution and interaction methods for machine learning explainability
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Applied Mathematics
Degree Conferral Date
2023-08
Publication Date
08/16/2023
Defense Date
08/04/2023
Publisher
University of Southern California. Libraries
(digital)
Tag
explainability,interpretability,machine learning,neural networks,OAI-PMH Harvest,XAI
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Razaviyayn, Meisam (
committee chair
), Minsker, Stanislav (
committee member
), Heilman, Steven (
committee member
)
Creator Email
lundstro@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113298186
Unique identifier
UC113298186
Identifier
etd-LundstromD-12261.pdf (filename)
Document Type
Dissertation
Rights
Lundstrom, Daniel David
Internet Media Type
application/pdf
Type
texts
Source
20230816-usctheses-batch-1085
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus, Los Angeles, California 90089, USA
Repository Email
cisadmin@lib.usc.edu
Tags
explainability
interpretability
machine learning
neural networks
XAI