Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Mutual information estimation and its applications to machine learning
(USC Thesis Other)
Mutual information estimation and its applications to machine learning
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
MUTUAL INFORMATION ESTIMATION AND ITS APPLICATIONS TO MACHINE LEARNING by Shuyang Gao A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) August 2018 Copyright 2018 Shuyang Gao Acknowledgements First and foremost, I am deeply thankful to my advisor, Professor Aram Gal- styan, for all his support and encouragement during my six years Ph.D. journey. Aram has always been understanding, generous and full of inspiration and helpful suggestions. I want to give my special thanks to Professor Greg Ver Steeg. We had count- less discussions on dierent research topics, and Greg's suggestions and insights are invaluable. I learned a lot from Greg about always being enthusiastic and optimistic in research and how to approach and solve problems from dierent per- spectives with an open mind. I want to thank my friends and colleagues. Especially, I want to thank my ocemate Sahil Garg. I alway enjoyed research discussions with him. I would also like to thank Palash Goyal, Rob Brekelmans, Neal Lawton, Dan Moyer, Dave Kale, Kyle Reing, Tozammel Hossain, Linghong Zhu with whom I had many enlightening discussions. Finally, I want to thank my parents for being always a constant source of support and encouragement during my life. ii Abstract Mutual information (MI) has been successfully applied to a wide variety of domains due to its remarkable property to measure dependencies between random variables. Despite its popularity and wide spread usage, a common unavoidable problem of mutual information is its estimation. In this thesis, we demonstrate that a popular class of nonparametric MI estimators based on k-nearest-neighbor graphs requires number of samples that scales exponentially with the true MI. Consequently, accu- rate estimation of MI between strongly dependent variables is possible only for prohibitively large sample size. This important yet overlooked shortcoming of the existing estimators is due to their implicit reliance on local uniformity of the under- lying joint distribution. As a result, my thesis proposes two new estimation strate- gies to address this issue. The new estimators are robust to local non-uniformity, works well with limited data, and is able to capture relationship strengths over many orders of magnitude than the existing k-nearest-neighbor methods. Modern data mining and machine learning presents us with problems which may contain thousands of variables and we need to identify only the most promising strong relationships. Therefore, caution must be taken when applying mutual infor- mation to such real-world scenarios. By taking these concerns into account, my thesis then demonstrates the practical applicability of mutual information on sev- eral tasks. In the rst task, my thesis suggests an information-theoretic approach iii for measuring stylistic coordination in dialogues. The proposed measure has a simple predictive interpretation and can account for various confounding factors through proper conditioning. My thesis proposes an MI-based shuing test to dis- tinguish correlations in length due to contextual factors (topic of conversation, user verbosity, etc.) and turn-by-turn coordination. We also suggest a test to identify whether stylistic coordination persists even after accounting for length coordina- tion and contextual factors. In the second task, my thesis focuses on feature selec- tion, which is one of the fundamental problems in machine learning. An extensive body of work on information-theoretic feature selection exists which is based on maximizing mutual information between subsets of features and class labels. How- ever, practical methods are forced to rely on approximations due to the diculty of estimating mutual information. My thesis demonstrates that approximations made by existing methods are based on unrealistic assumptions. We formulate a more exible and general class of assumptions based on variational distributions and use them to tractably generate lower bounds for mutual information. These bounds dene a novel information-theoretic framework for feature selection, which is proved to be optimal under tree graphical models with proper choice of vari- ational distributions. In another machine learning task, my thesis addresses the representation deep learning problem. Advances in unsupervised learning enable reconstruction and generation of samples from complex distributions, but this suc- cess is marred by the inscrutability of the representations learned. My thesis proposes an information-theoretic approach to characterizing disentanglement and dependence in representation learning using multivariate mutual information, also called total correlation. The principle of total Cor-relation Ex-planation (CorEx) has motivated successful unsupervised learning applications across a variety of domains, but under some restrictive assumptions. Here we relax those restrictions iv by introducing a exible variational lower bound to CorEx. Surprisingly, we nd that this lower bound is equivalent to the one in variational autoencoders (VAE) under certain conditions. This information-theoretic view of VAE deepens our understanding of hierarchical VAE and motivates a new algorithm, AnchorVAE, that makes latent codes more interpretable through information maximization and enables generation of richer and more realistic samples. v Contents Acknowledgements ii Abstract iii List of Figures ix List of Tables xiii 1 Introduction 1 1.1 Mutual Information Estimation . . . . . . . . . . . . . . . . . . . . 1 1.2 Mutual Information Applications . . . . . . . . . . . . . . . . . . . 2 1.2.1 Mutual Information-based Linguistic Style Accommodation . 3 1.2.2 Mutual Information-based Feature Selection . . . . . . . . . 4 1.2.3 Mutual Information-based Representation Learning . . . . . 5 1.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2 Background and Related Works 8 2.1 Basic Concepts in Information Theory . . . . . . . . . . . . . . . . 8 2.2 Estimating Mutual Information in Continuous Settings . . . . . . . 9 2.3 Estimating Entropic Measures and Testing of Conditional Indepen- dence in Discrete Settings . . . . . . . . . . . . . . . . . . . . . . . 10 2.4 Mutual Information and Equitability . . . . . . . . . . . . . . . . . 11 2.5 Information-theoretic Feature Selection . . . . . . . . . . . . . . . . 12 2.6 Disentangled Representation Learning . . . . . . . . . . . . . . . . . 12 3 Estimating Mutual Information for Strongly Correlated Variables 14 3.1 kNN-based Estimation of Entropic Meatures . . . . . . . . . . . . . 14 3.1.1 Naive kNN Estimator . . . . . . . . . . . . . . . . . . . . . . 15 3.1.2 KSG Estimator . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.2 Limitations of kNN-based MI Estimators . . . . . . . . . . . . . . . 18 3.3 Improved kNN-based Estimators . . . . . . . . . . . . . . . . . . . 22 3.3.1 Local Nonuniformity Correction (LNC) . . . . . . . . . . . . 22 3.3.2 Estimating Nonuniformity by Local PCA . . . . . . . . . . . 23 vi 3.3.3 Testing for Local Nonuniformity . . . . . . . . . . . . . . . . 23 3.3.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.4 Estimation Mutual Information by Local Gaussian Approximation . 31 3.4.1 Local Gaussian Density Estimation . . . . . . . . . . . . . . 31 3.4.2 LGDE-based Estimators for Entropy and Mutual Information 34 3.4.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4 Modeling Linguistic Style Coordination by Mutual Information 41 4.1 Measuring Stylistic Coordination . . . . . . . . . . . . . . . . . . . 41 4.1.1 Representing Stylistic Features . . . . . . . . . . . . . . . . 41 4.1.2 Information-theoretic measure of coordination . . . . . . . . 42 4.1.3 Estimating mutual information from data . . . . . . . . . . 43 4.2 Length as a confounding factor . . . . . . . . . . . . . . . . . . . . 44 4.3 Understanding Length Coordination . . . . . . . . . . . . . . . . . . 48 4.3.1 Information-theoretic characterization of length coordination 49 4.3.2 Turn-by-Turn Length Coordination Test . . . . . . . . . . . 50 4.4 Revisiting Stylistic Coordination . . . . . . . . . . . . . . . . . . . . 53 4.4.1 Information-theoretic characterization of stylistic coordination 54 4.4.2 Turn-by-Turn Stylistic Coordination Test . . . . . . . . . . . 55 4.5 Stylistic Coordination and Power Relationship . . . . . . . . . . . . 58 4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5 Variational Information Maximization for Feature Selection 64 5.1 Previous Mutual Information-based Feature Selection Algorithms . 64 5.2 Limitations of Previous Mutual Information-based Feature Selection Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.3 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.3.1 Variational Mutual Information Lower Bound . . . . . . . . 68 5.3.2 Choice of Variational Distribution . . . . . . . . . . . . . . . 70 5.3.3 Estimating Lower Bound From Data . . . . . . . . . . . . . 73 5.3.4 Variational Forward Feature Selection Under Auto-Regressive Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 6 Disentangled Representation Learning by Mutual Information 79 6.1 Total Correlation and Informativeness . . . . . . . . . . . . . . . . 79 6.2 Total Correlation Explanation Representation Learning . . . . . . . 80 6.3 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 6.3.1 Variational Lower Bound for I (x i : z) . . . . . . . . . . . . 84 6.3.2 Variational Upper Bound for I (z i : x) . . . . . . . . . . . . 84 vii 6.4 Connection to Variational Autoencoders . . . . . . . . . . . . . . . 85 6.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 6.5.1 Disentangling Latent Codes via Hierarchical VAE / Stacking CorEx on MNIST . . . . . . . . . . . . . . . . . . . . . . . . 90 6.5.2 Learning Interpretable Representations through Information Maximizing VAE / CorEx on CelebA . . . . . . . . . . . . 91 6.5.3 Generating Richer and More Realistic Images via CorEx . . 94 6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 7 Conclusion and Future Directions 102 7.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 7.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 A Appendix for Chapter 3 106 A.1 Proof of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 A.2 Proof of Theorem 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 A.3 Proof of Theorem 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 A.4 Derivation of Eq. 3.15 . . . . . . . . . . . . . . . . . . . . . . . . . 112 A.5 Empirical Evaluation for k;d . . . . . . . . . . . . . . . . . . . . . 113 A.6 More Functional Relationship Tests in Two Dimensions for LNC estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 B Appendix for Chapter 5 117 B.1 Detailed Algorithm for Variational Forward Feature Selection . . . . 117 B.2 Optimality Under Tree Graphical Models . . . . . . . . . . . . . . . 119 B.3 Datasets and Results . . . . . . . . . . . . . . . . . . . . . . . . . . 123 B.4 Generating Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . 124 Bibliography 125 viii List of Figures 3.1 Centered at a given sample point, x (i) , we show the max-norm rect- angle containingk nearest neighbors (a) for points drawn from a uni- form distribution, k = 3, and (b) for points drawn from a strongly correlated distribution, k = 4. . . . . . . . . . . . . . . . . . . . . . 19 3.2 A semi-logarithmic plot ofN s (number of required samples to achieve an error at most ") for KSG estimator for dierent values of I(X : Y ). We set " = 0:1, k = 1. . . . . . . . . . . . . . . . . . . . . . . 21 3.3 For all the functional relationships above, we used a sample size N = 5000 for each noise level and the nearest neighbor parameter k = 5 for LNC, KSG and GNN estimators. . . . . . . . . . . . . . . 26 3.4 Estimated MI using both KSG and LNC estimators in the number of samples (k = 5 and k;d = 0:37 for 2D examples; k = 8 and k;d = 0:12 for 5D examples) . . . . . . . . . . . . . . . . . . . . . . 27 3.5 Spearman correlation coecient between the original MI rank and the rank after hiding some percentage of data by KSG and LNC estimator respectively. The 95% condence bars are obtained by repeating the experiment for 200 times. . . . . . . . . . . . . . . . . 29 3.6 Two examples of synergistic triplets: ^ I KSG (X : Y : Z) = 0:14 and ^ I LNC (X : Y : Z) = 0:95 for the rst example; ^ I KSG (X : Y : Z) = 0:05 and ^ I LNC (X :Y :Z) = 0:7 for the second example . . . . . . . 30 3.7 Functional relationship test for mutual information estimators. The horizontal axis is the value of which controls the noise level; the vertical axis is the mutual information in nats. For the Kraskov and GNN estimators we used nearest neighbor parameter k = 5. For the local Gaussian estimator, we choose the bandwidth to be the distance between a point and its 5rd nearest neighbor. . . . . . . . 38 ix 4.1 Coordination measures for the Supreme Court data. The red (blue) dots give the true CMI (MI). The green dots represent CMI under the null hypothesis that there is no coordination after conditioning. (a) Lawyers coordinating to Judges. (b) Judges coordinating to Lawyers. In both gures, the conditional mutual information is sig- nicantly smaller than the mutual information for all eight stylistic features, indicating length is a confounding factor. . . . . . . . . . . 45 4.2 Coordination measures for the Wikipedia data. (a) Non-admins coordinating to Admins. (b) Admins coordinating to Non-admins. Symbols have the same interpretation as in the previous plot. . . . 46 4.3 A Bayesian network model for length coordination. The network containing contextual factors, C, the length of an utterance, L (t) O , and the length of the response, L (t) R . (a) The lengths are correlated only due to contextual factors. (b) The lengths are correlated due to both contextual factors and potential eect of turn-by-turn level coordination (represented with the dotted line). . . . . . . . . . . . 48 4.4 Turn-by-turn length coordination test. (a) Supreme Court dataset. (b) Wikipedia dataset. In both two subgures,OLC 1 is signicantly smaller than OLC 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.5 A Bayesian network for linguistic style coordination. L O andL R rep- resent length of the respondent and length of the originator respec- tively. F m O andF m R represent a specic style feature variable for the respondent and originator. . . . . . . . . . . . . . . . . . . . . . . 54 4.6 Turn-by-turn stylistic coordination test for Supreme Court data. (a) Lawyers coordinating to Judges. (b) Judges coordinating to Lawyers. (Blue bars indicate the overall stylistic coordination(OSC) before the test). One can see that after shuing, values of OSC 1 are within the zero-information condence intervals. . . . . . . . . . 57 4.7 Turn-by-turn stylistic coordination test for Wikipedia data. (a) Non-admins coordinating to Admins. (b) Admins coordinating to Non-admins. (Blue bars indicate the overall stylistic coordination(OSC) before the test). One cannot rule out the null hypothesis that the remnant stylistic coordination is due to the contextual factors. . . . 57 4.8 SVM Prediction Accuracy for both stylistic coordination features and length coordination features . . . . . . . . . . . . . . . . . . . 59 5.1 Graphical models assumptions for mutual information approxima- tions. The rst two graphical models show the assumptions of tra- ditional MI-based feature selection methods. The third graphical model shows a scenario when both Assumption 1 and Assumption 2 are true. Dashed line indicates there may or may not be a corre- lation between two variables. . . . . . . . . . . . . . . . . . . . . . . 67 x 5.2 Auto-regressive decomposition for q(x S jy) . . . . . . . . . . . . . . 71 5.3 (Left) This is the generative model used for synthetic experiments. Edge thickness represents the relationship strength. (Right) Opti- mizing the lower bound byVMI naive . Variables under the blue line denote the features selected at each step. Dotted blue line shows the decreasing lower bound if adding more features. Ground-truth mutual information is obtained using N = 100; 000 samples. . . . . 76 5.4 Number of selected features versus average cross-validation error in datasets Semeion and Gisette. . . . . . . . . . . . . . . . . . . . . 77 6.1 The graphical model for p (x; z) assuming p (zjx) achieves the global maximum in Eq. 6.4. In this model, all x i are factorized conditioned on z, and all z i are independent. . . . . . . . . . . . . 81 6.2 Encoder and decoder models for MNIST, where z (1) is 64 dimen- sional continuous variable and z (2) is a discrete variable (one hot vector with length ten). . . . . . . . . . . . . . . . . . . . . . . . . 90 6.3 Varying the latent codes of z (1) on MNIST. In both gures, each row corresponds to a xed discrete number in layer z (2) . Dierent columns correspond to the varying noise from the selected latent node in layer z (1) from left to right, while keeping other latent codes xed. In (a) varying the noise results in dierent rotations of the digit; In (b) a small (large) value of the latent code corresponds to wider (narrower) digit. . . . . . . . . . . . . . . . . . . . . . . . . . 96 6.4 Mutual information between input data x and each latent variable z i in CelebA with AnchorVAE. It is clear that the anchored rst ve dimensions have the highest mutual information with x. . . . . 97 6.5 Manipulating latent codes z 0 ; z 1 ; z 2 ; z 3 ; z 4 on CelebA using Anchor- VAE: We show the eect of the anchored latent variables on the outputs while traversing their values from [-3,3]. Each row repre- sents a dierent seed image to encode latent codes. Each anchored latent code represents a dierent factor on interpretablility. (a) Skin Color (b) Azimuth (c) Emotion (Smile) (d) Hair (less or more) (e) Lighting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.6 Manipulating top two latent codes with the most mutual informa- tion on CelebA using original VAE. We observe that both latent codes learned entangled representations. (a) z 130 entangles skin color with hair; (b) z 610 entangles emotion with azimuth. . . . . . . 99 6.7 Variance statistics forp (z) on celebA after training a standard VAE with 128 latent codes. . . . . . . . . . . . . . . . . . . . . . . . . . 100 xi 6.8 Dierent sampling strategies of latent codes for CelebA dataset on VAE / CorEx. Sampling latent codes from Q m i=1 p (z i ) in (b) yields better quality images than sampling from a standard normal distri- bution in (a). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 A.5.1b k;d as a function of k. k ranges over [d; 20] for each dimension d. . 115 A.6.1Mutual Information tests of LNC,KSG,GNN,MST ,EXP esti- mators. Twenty-one functional relationships with dierent noise intensities are tested. Noise has the form U[=2;=2]where varies(as shown in X axis of the plots). For KSG, GNN and LNC estimators, nearest neighbor parameter k = 5. We are using N = 5; 000 data points for each noisy functional relationship. . . . . . . . 116 B.2.1Demonstration of tree graphical model, label y is the root node. . . 120 xii List of Tables 5.1 Time complexity in number of features D, selected number of fea- tures d, and number of samples N. . . . . . . . . . . . . . . . . . . 74 5.2 Mutual information between label y and each feature x i for Fig. 5.3. I(x i : y) is estimated using N=100,000 samples. Top three variables with highest mutual information are highlighted in bold. . . . . . . 75 5.3 Average cross-validation error rate comparison of VMI against other methods. The last two lines indicate win(W)/tie (T)/ loss(L) forVMI naive andVMI pairwise respectively. . . . . . . . . . . . . . 77 B.1 Dataset summary. N: # samples, d: # features, L: # classes. . . . 124 xiii To my parents Qiaoyun Shan, Jian Gao To my grandparents Fuyin Ju, Erbao Gao, Xuehong Huan, Jincai Shan xiv Chapter 1 Introduction 1.1 Mutual Information Estimation As a measure of statistical dependence, mutual information (MI) is remarkably general and has several intuitive interpretations [Cover and Thomas, 1991]. The concept of mutual information was introduced by [Shannon, 1948]. Since then there have been extensive studies and developments around mutual information. One common unavoidable problem of mutual information is its estimation. In a typical scenario, one needs to estimate mutual information using independently and iden- tically distributed (i.i.d.) samples while the underlying distribution of the random variables is unknown. A naive approach to this problem is to rst estimate the underlying probability mass/density, and then calculate mutual information using its denitions. Such approaches are usually called plug-in methods [Antos and Kon- toyiannis, 2001], of which the theoretical properties is well studied in [Paninski, 2003]. Unfortunately, estimating joint densities from a limited number of samples is often infeasible in many practical settings. Another slightly dierent approach is to estimate mutual information directly from samples in a non-parametric way without calculating the density. The main intuition behind such direct estima- tors is that evaluating mutual information can in principle be a more tractable problem than estimating the density over the whole state space [P erez-Cruz, 2008] The most popular class of the estimators taking this approach is the k-nearest- neighbor(kNN) based estimators. One example is due to Kraskov, St ogbauer, and 1 Grassberger (referred to as the KSG estimator from here on) [Kraskov et al., 2004], which has been extended to generalized nearest neighbor graphs [P al et al., 2010]. Despite the wide studies of mutual information estimators, this thesis will demonstrate that existing kNN-based estimators suer from a critical yet over- looked aw. Such aw implies stronger relationship are actually more dicult to measure. This counter-intuitive property re ects the fact that most work on mutual information estimation has focused on estimators that are good at detect- ing independence of variables rather than precisely measuring strong dependence. After all, in the era of big data, it is often the case that many strong relationships are present in the data and we are interested in picking out only the strongest ones. Existing mutual information estimation approaches will perform poorly for this task if their accuracy is low in the regime of strong dependence. We show that the undesired behavior of kNN-based mutual information estimators can be attributed to the assumption of local uniformity utilized by those estimators, which can be violated for suciently strong (almost deterministic) dependencies. There- fore, I proposes two new mutual information estimators to address this issue based on local uniformity correction and local Gaussian approximation. 1.2 Mutual Information Applications It is insucient to only care about mutual information estimation. Another ques- tion would be how to properly apply mutual information to dierent domains. After all, mutual information has become an important tool in a variety of applications, like machine learning, biology, neuroscience, etc. In this thesis, I demonstrate the wide applicability of mutual information in three dierent tasks, with dierent mutual information estimators. 2 1.2.1 Mutual Information-based Linguistic Style Accom- modation Communication Accommodation Theory [Giles et al., 1991] states that people tend to adapt their communication style (voice, gestures, word choice, etc.) in response to the person with whom they interact. Originally, experiments on linguistic accommodation were conned to small scale laboratory settings with a handful of participants. The recent proliferation of digital (or digitized) communication data oers an opportunity to study nuances of human communication behavior on much larger scales. A number of recent studies have indicated presence of stylistic coordination in communication [Ireland et al., 2011, Gonzales et al., 2010, Danescu-Niculescu-Mizil et al., 2011, 2012], where one person's use of a linguistic feature (e.g. prepositions) increases the probability that a response will include the same feature. Linguistic style coordination (or matching) has been used to predict relationship stability [Ireland et al., 2011] and negotiation outcomes [Tay- lor and Thomas, 2008], understand group cohesiveness [Gonzales et al., 2010], and infer relative social status and power relationships among individuals [Danescu- Niculescu-Mizil et al., 2012]. Most reports of linguistic style coordination have been based on correlational analysis. Thus, such claims are susceptible to various confounding eects. For instance, it is known that there is signicant length coordination in dialogues, in the sense that a longer utterance from user Y tends to solicit a longer response from user X Niederhoer and Pennebaker [2002]. Thus, if the probability of an utterance containing a feature, e.g. prepositions or words whose second letter is \r", depends only on length, this will create the illusion of stylistic coordination on the given feature. In this thesis, I will propose an information-theoretic framework 3 for characterizing stylistic coordination in dialogues using mutual information and detecting those confounding eects. 1.2.2 Mutual Information-based Feature Selection Feature selection is one of the fundamental problems in machine learning research [Dash and Liu, 1997, Liu and Motoda, 2012]. Many problems include a large number of features that are either irrelevant or redundant for the task at hand. In these cases, it is often advantageous to pick a smaller subset of fea- tures to avoid over-tting, to speed up computation, or simply to improve the interpretability of the results. Feature selection approaches are usually categorized into three groups: wrap- per, embedded and lter [Kohavi and John, 1997, Guyon and Elissee, 2003, Brown et al., 2012]. The rst two methods, wrapper and embedded, are considered classier-dependent, i.e., the selection of features somehow depends on the classi- er being used. Filter methods, on the other hand, are classier-independent and dene a scoring function between features and labels in the selection process. Because lter methods may be employed in conjunction with a wide variety of classiers, it is important that the scoring function of these methods is as general as possible. Since mutual information (MI) is a general measure of dependence with several unique properties [Cover and Thomas, 1991], many MI-based scoring functions have been proposed as lter methods [Battiti, 1994, Yang and Moody, 1999, Fleuret, 2004, Peng et al., 2005, Rodriguez-Lujan et al., 2010, Nguyen et al., 2014]; see [Brown et al., 2012] for an exhaustive list. Owing to the diculty of estimating mutual information in high dimensions, most existing MI-based feature selection methods are based on various low-order approximations for mutual information. Such methods are inherently heuristic in 4 nature and lack theoretical guarantees. More scalable feature selection method has been developed inspired by group testing [Zhou et al., 2014], but this method also requires the calculation of high-dimensional mutual information as a basic scoring function. In this thesis, we will introduce a novel feature selection method based on a tractable variational lower bound on mutual information to address these issues. 1.2.3 Mutual Information-based Representation Learning Learning representations from data without labels has become increasingly impor- tant to solving some of the most crucial problems in machine learning|including tasks in image, language, speech, etc. Bengio et al. [2013]. Complex models, such as deep neural networks, have been successfully applied to generative modeling with high-dimensional data. From these methods we can either infer hidden represen- tations with variational autoencoders (VAE) Kingma and Welling [2013], Rezende et al. [2014] or generate new samples with VAE or generative adversarial networks (GAN) Goodfellow et al. [2014]. Building on these successes, an explosive amount of recent eort has focused on interpreting learned representations, which could have signicant implications for subsequent tasks. Methods like InfoGAN Chen et al. [2016] and -VAE Higgins et al. [2017] are able to learn disentangled and interpretable representations in a completely unsupervised fashion. Information theory provides a natural frame- work for understanding representation learning and continues to generate new insights Alemi et al. [2017], Shwartz-Ziv and Tishby [2017], Achille and Soatto [2018], Saxe et al. [2018]. 5 In this thesis, we discuss the problem of learning disentangled and interpretable representations in a purely information-theoretic way. Instead of making assump- tions about the data generating process at the beginning, we consider the question of how informative the underlying latent variable z is about the original data variable x. We would like z to be as informative as possible about the relation- ships in x while remaining as disentangled as possible in the sense of statistical independence. This principle has been previously proposed as Cor-relation Ex- planation (CorEx) Ver Steeg and Galstyan [2014], Ver Steeg [2017]. By optimizing appropriate information-theoretic measures, CorEx denes not only an informative representation but also a disentangled one, thus eliciting a natural comparison to the recent literature on interpretable machine learning. However, computing the CorEx objective can be challenging, and previous studies have been restricted to cases where random variables are either discrete Ver Steeg and Galstyan [2014], or Gaussian Ver Steeg and Galstyan [2017]. My thesis formulates a variational lower bound for CorEx and extends this information-theoretic method to deep neural networks. 1.3 Overview The outline of the dissertation is as follows: In Chapter 2, we describe background and related works on mutual information. Specically, we discuss several previous mutual information estimators. we also review the previous work on linguistic style coordination , mutual information-based feature selection and representation learning. In Chapter 3, we point out the limitations of kNN-based mutual informa- tion estimators and propose two new estimators. In Chapter 4, we propose mutual 6 information as a measure of linguistic coordination and use it to detect confound- ing eects. In Chapter 5, we focus on mutual information-based feature selection and propose a new feature selection method based on a variational mutual infor- mation lower bound. In Chapter 6, we propose an information-theoretic objective for disentangled representation learning and derive a variational lower bound to optimize it with deep neural networks. In Chapter 7, we summarize the thesis and discuss future directions. 7 Chapter 2 Background and Related Works 2.1 Basic Concepts in Information Theory Let x = (x 1 ;x 2 ;:::;x d ) denote a d-dimensional absolute continuous random vari- able whose probability density function is dened as f X : R d ! R and marginal densities of each x j are dened as f j :R!R;j = 1;:::;d. Shannon dierential entropy and mutual information are dened in the usual way: H (x) = Z R d f X (x) logf X (x)dx (2.1) I (x) = Z R d f X (x) log f X (x) d Q j=1 f j (x j ) dx (2.2) We use natural logarithms so that information is measured in nats. For d > 2, the generalized mutual information is also called total correlation [Watanabe, 1960b] or multi-information [Studen y and Vejnarov a, 1998]. If we further dene y as a b-dimensional absolute continuous random variable whose probability density function is f Y : R b ! R, and let f XY denote the joint probability density function of x and y. Then mutual information between x and y can be written as: I (x; y) = Z y2R b Z x2R d f XY (x; y) log f XY (x; y) f X (x)f Y (y) dxdy (2.3) 8 The dierence between Eq. 2.2 and Eq. 2.3 is that for the rst denition, it captures the dependencies across all the dimensions in the random variable x while in the latter denition, it considers the dependence only among two multi- dimenisonal variables. And we always have the following equality: I (x; y) =I (x; y)I (x)I (y) (2.4) 2.2 Estimating Mutual Information in Continu- ous Settings There has been a signicant amount of work on estimating entropic measures such as divergences and mutual information from samples (see this survey [Walters- Williams and Li, 2009] for an exhaustive list). [Khan et al., 2007] compared dif- ferent mutual information estimators for varying sample sizes and noise intensity, and reported that for small samples, KSG estimator [Kraskov et al., 2004] was the best choice overall for relatively low noise intensities, while kernel density estima- tion (KDE) performed better at higher noise intensities. Other approaches include estimators based on Generalized Nearest-Neighbor Graphs [P al et al., 2010], min- imum spanning trees [M uller et al., 2012], maximum likelihood density ratio esti- mation [Suzuki et al., 2008], and ensemble methods [Sricharan et al., 2013, Moon and Hero, 2014]. In particular, the latter approach works by taking a weighted average of simple density plug-in estimates such as kNN or KDE. It has been recognized that kNN-based entropic estimators underestimate probability density at the sample points that are close to the support bound- ary [Liiti ainen et al., 2010]. [Sricharan et al., 2012] proposed a bipartite plug-in estimator for non-linear density functionals that extrapolates the density estimates 9 at interior points that are close to the boundary in order to compensate the bound- ary bias. However, this method requires to identify boundary and interior points, which is a dicult problem when mutual information is large, so that almost all the points are close to the boundary. [Singh and Poczos, 2014] used a "mirror image" kernel density estimator to escape the boundary eect, but their estimator relies on the knowledge of the support of the densities. 2.3 Estimating Entropic Measures and Testing of Conditional Independence in Discrete Set- tings There has been solid theoretical studies of estimating entropic measures on discrete variables. For the plug-in estimator of entropy, the asymptotic variance is obtained in [Basharin, 1959], and [Antos and Kontoyiannis, 2001] has shown the plug-in estimator is always consistent. For sample complexity developments, [Paninski, 2004] rst proved the existence of consistent entropy estimators using sublinear sample size n = o(k), where k is the is cardinality. [Valiant and Valiant, 2010, 2011] further extended the minimal sample size for consistent estimation to be k logk . [Wu and Yang, 2016] renes the results by showing the minimax mean-square error via best polynomial approximation. See [Paninski, 2003, Acharya et al., 2014, Jiao et al., 2015, Wu and Yang, 2015, Han et al., 2015a,b] for recent developments on estimating entropy, distribution functionals on discrete variables. As a statistical measure, conditional mutual information is also an important metric for testing conditional independence [Dobrushin, 1959, Wyner, 1978]. The eld of testing conditional independence has been extensively studied in the last 10 century [Fisher, 1924, Cochran, 1954, Mantel and Haenszel, 1959, Agresti, 1992] and has signicant implications in the eld of structure learning, Bayesian network testing and fairness machine learning [Neapolitan et al., 2004, Tsamardinos et al., 2006, Hardt et al., 2016, Canonne et al., 2017]. Recently, [Canonne et al., 2017] studied the property of testing conditional independence under discrete settings in the framework of distribution testing [Batu et al., 2000, Canonne, 2015, Goldreich, 2017]. They use the attening technique introduced in [Diakonikolas and Kane, 2016] to obtain the rst sublinear sample complexity with respect to total variance distance metric. It is worth-noting that [Canonne et al., 2017] also generalized their analysis for conditional mutual information and proved a sample-ecient algorithm exists in binary settings. 2.4 Mutual Information and Equitability [Reshef et al., 2011] introduced a property they called \suitability" for a measure of correlation. If two variables are related by a functional form with some noise, equitable measures should re ect the magnitude of the noise while being insensi- tive to the form of the functional relationship. They used this notion to justify a new correlation measure called MIC. Based on comparisons with mutual informa- tion using the kNN estimator, they concluded that MIC is \more equitable" for comparing relationship strengths. While several problems [Simon and Tibshirani, 2014, Gorne et al.] and alternatives [Heller et al., 2013, Sz ekely et al., 2009] were pointed out, [Kinney and Atwal, 2014] (KA) showed that MIC's apparent superiority to mutual information was actually due to aws in estimation. A more careful denition of equitability led KA to the conclusion that mutual informa- tion is actually more equitable than MIC. KA suggest that mutual information 11 estimation could be improved by using more samples for estimation. However, in this proposal, we will show that the number of samples required for KSG is pro- hibitively large, but that this diculty can be overcome by using improved mutual information estimators. 2.5 Information-theoretic Feature Selection There has been a signicant amount of work on information-theoretic feature selec- tion in the past twenty years Brown et al. [2012], Battiti [1994], Yang and Moody [1999], Fleuret [2004], Peng et al. [2005], Lewis [1992], Rodriguez-Lujan et al. [2010], Nguyen et al. [2014], Cheng et al. [2011], to name a few. Most of these methods are based on combinations of so-called relevant, redundant and compli- mentary information. Such combinations representing low-order approximations of mutual information are derived from two assumptions which are shown to be unrealistic to have both assumptions to be true. More scalable feature selection method has been developed inspired by group testing Zhou et al. [2014], but this method also requires the calculation of high-dimensional mutual information as a basic scoring function. 2.6 Disentangled Representation Learning The notion of disentanglement in representation learning lacks a unique characteri- zation, but it generally refers to latent factors which are individually interpretable, amenable to simple downstream modeling or transfer learning, and invariant to nuisance variation in the data Bengio et al. [2013]. The common denition of statistical independence Achille and Soatto [2017], Dinh et al. [2014] is minimiz- ing total correlation|an idea with a rich history Barlow [1989], Comon [1994], 12 Schmidhuber [1992]. However, there are numerous alternatives not rooted in inde- pendence. Higgins et al. [2017] measures disentanglement by the identiability of changes in a single latent dimension. More concretely, they vary only one latent variable with others xed, apply the learned decoder and encoder to reconstruct the latent space, and propose that a classier should be able to predict the varied dimension for a disentangled representation. The work of Thomas et al. [2017], Bengio et al. [2017] is similar in spirit, identifying disentangled factors as changes in a latent embedding that can be controlled via reinforcement learning. Alter- natively, if prior knowledge of the number of desired factors of variation is given, models such as InfoGAN Chen et al. [2016]. 13 Chapter 3 Estimating Mutual Information for Strongly Correlated Variables In this chapter, we will discuss two new approaches to estimate mutual informa- tion. We rst introduce kNN-based non-parametric entropy and mutual informa- tion estimators. After that we demonstrate the limitations of kNN-based mutual information estimators. Finally, we suggest a correction term to overcome these limitations by proposing a local non-uniform correction (LNC) estimator. And we also construct another mutual information estimator based on local Gaussian approximation. We show empirically using synthetic and real-world data that our methods outperform existing techniques. 3.1 kNN-based Estimation of Entropic Meatures In this section, we focus on the mutual information estimator ^ I (x) for Eq. 2.2 given N i.i.d. samplesX = x (i) n i=1 drawn from f X (x). We will rst introduce the naive kNN estimator for mutual information, which is based on an entropy estimator due to [Singh et al., 2003], and show its theoretical properties. Next, we focus on a popular variant of kNN estimators - KSG estimator [Kraskov et al., 2004]. 14 3.1.1 Naive kNN Estimator Entropy Estimation The naive kNN entropy estimator is as follows: b H 0 kNN;k (x) = 1 n n X i=1 log b f k x (i) (3.1) where b f k x (i) = k n 1 (d=2 + 1) d=2 r k x (i) d (3.2) r k x (i) in Eq. 3.2 is the Euclidean distance from x (i) to its kth nearest neighbor inX . By introducing a correction term, an asymptotic unbiased estimator is obtained: b H kNN;k (x) = 1 n n X i=1 log b f k x (i) k (3.3) where k = k k (k 1)! Z 1 0 log (x)x k1 e kx dx = (k) log(k) (3.4) where () represents the digamma function. The following theorem shows asymptotic unbiasedness of b H kNN;k (x) according to [Singh et al., 2003]. Theorem 1 (kNN entropy estimator, asymptotic unbiasedness, [Singh et al., 2003]). Assume that x is absolutely continuous and k is a positive integer, then lim n!1 E h b H kNN;k (x) i =H kNN;k (x) (3.5) i.e., this entropy estimator is asymptotically unbiased. 15 From Entropy to Mutual Information To construct a mutual information estimator from an entropy estimator is straightforward by combining entropy esti- mators using the identity [Cover and Thomas, 1991]: I(x) = d X i=1 H (x i )H (x) (3.6) Combining Eqs. 3.1 and 3.6, we have, b I 0 kNN;k (x) = 1 n n X i=1 log b f k x (i) b f k x (i) 1 b f k x (i) 2 ::: b f k x (i) d (3.7) where x (i) j denotes the point projected into jth dimension in x (i) andb p k (x (i) j ) rep- resents the marginal kNN density estimator projected into jth dimension of x (i) . Similar to Eq. 3.3, we can also construct an asymptotically unbiased mutual information estimator based on Theorem 1: b I kNN;k = b I 0 kNN;k (d 1) k (3.8) Corollary 1 (kNN MI estimator, asymptotic unbiasedness). Assume that x is absolute continuous and k is a positive integer, then: lim n!1 E h b I kNN;k (x) i =I kNN;k (x) (3.9) 3.1.2 KSG Estimator The KSG mutual information estimator [Kraskov et al., 2004] is a popular vari- ant of the naive kNN estimator. The general principle of KSG is that for each density estimator in dierent spaces, we would like to use the similar length-scales 16 for k-nearest-neighbor distance as in the joint space so that the bias would be approximately smaller. Although the theoretical properties of this estimator is unknown, it has a relatively good performance in practice, see [Khan et al., 2007] for a comparison of dierent estimation methods. Unlike naive kNN estimator, the KSG estimator uses the max-norm distance instead of L2-norm. In particular, if i;k is twice the (max-norm) distance to thek-th nearest neighbor of x (i) , it can be shown that the expectation value (over all ways of drawing the surrounding N 1 points) of the log probability mass within the box centered at x (i) is given by this expression. E i;k " log Z jxx (i) j1 i;k =2 f(x) dx # = (k) (N) (3.10) We use to represent the digamma function. If we assume that the density inside the box (with sides of length i;k ) is constant, then the integral becomes trivial and we nd that, log(f(x i ) d i;k ) = (k) (N): (3.11) Rearranging and taking the mean over logf(x (i) ) leads us to the following entropy estimator: ^ H KSG;k (x) (N) (k) + d N N X i=1 log i;k (3.12) Note thatk, dening the size of neighborhood to use in local density estimation, is a free parameter. Using smallerk should be more accurate, but largerk reduces 17 the variance of the estimate [Khan et al., 2007]. While consistent density estimation requires k to grow with N [Von Luxburg and Alamgir, 2013], entropy estimates converge almost surely for any xedk [Wang et al., 2009a, P erez-Cruz, 2008] under some weak conditions. 1 To estimate the mutual information, in the joint x space we set k, the size of the neighborhood, which determines i;k for each point x (i) . Next, we consider the smallest rectilinear hyper-rectangle that contains these k points, which has sides of length x j i;k for each marginal direction x j . We refer to this as the \max-norm rectangle" (as shown in Fig. 3.1(a)). Letn x j be the number of points at a distance less than or equal to x j i;k =2 in thex j -subspace. For each marginal entropy estimate, we use n x j (i) instead of k to set the neighborhood size at each point. Finally, in the joint space using a rectangle instead of a box in Eq. 3.12 leads to a correction term of size (d 1)=k (details given in [Kraskov et al., 2004]). Adding the entropy estimators together with these choices yields the following. ^ I KSG;k (x) (d 1) (N) + (k) (d 1)=k 1 N N X i=1 d X j=1 (n x j (i)) (3.13) 3.2 Limitations of kNN-based MI Estimators In this section, we demonstrate a signicant aw in kNN-based MI estimators which is summarized in the following theorems. We then explain why these estimators fail to accurately estimate mutual information unless the correlations are relatively weak or the number of samples is large. 1 This assumes the probability density is absolutely continuous but see [P al et al., 2010] for some technical concerns. 18 (a) (b) Figure 3.1: Centered at a given sample point, x (i) , we show the max-norm rectangle containing k nearest neighbors (a) for points drawn from a uniform distribution, k = 3, and (b) for points drawn from a strongly correlated distribution, k = 4. Theorem 2. For any d-dimensional absolute continuous probability density func- tion, p(x), for any k 1, for the estimated mutual information to be close to the true mutual information,j ^ I kNN;k (x)I(x)j", requires that the number of sam- ples, N, is at least, N C exp I(x)" d1 + 1, where C is a constant which scales like O( 1 d ). The proof of Theorem 2 is shown in the Appendix A.1. Theorem 3. For any d-dimensional absolute continuous probability density func- tion, p(x), for any k 1, for the estimated mutual information to be close to the true mutual information,j ^ I KSG;k (x)I(x)j ", requires that the number of samples, N, is at least, NC exp I(x)" d1 + 1, where C = e k1 k . 19 Proof. Note that (n) = H n1 , where H n is the n-th harmonic number and 0:577 is the Euler-Mascheroni constant. b I KSG;k (x) (d 1) (N) + (k) (d 1)=k 1 N N X i=1 d X j=1 (k) = (d 1)( (N) (k) 1=k) = (d 1)(H N1 H k1 + 1=k) (d 1)(log(N 1) + (k 1)=k) (3.14) The rst inequality is obtained from Eq. 3.13 by observing thatn x j (i)k for any i;j and that (k) is a monotonically increasing function. And the last inequality is obtained by dropping the termH k1 0 and using the well-known upper bound H N logN + 1. Requiring thatj ^ I KSG;k (x)I(x)j < ", we obtain N C exp I(x)" d1 + 1, where C =e k1 k : The above theorems state that for any xed dimensionality, the number of samples needed for estimating mutual information I(x) increases exponentially with the magnitude ofI(x). From the point of view of determining independence, i.e., distinguishing I(x) = 0 from I(x)6= 0, this restriction is not particularly troubling. However, for nding strong signals in data it presents a major barrier. Indeed, consider two random variables X and Y , where X U(0; 1) and Y = X +U(0; 1). When ! 0, the relationship between X and Y becomes nearly functional, and the mutual information diverges as I(X : Y ) ! log 1 . As a consequence, the number of samples needed for accurately estimating I(X : Y ) diverges as well. This is depicted in Fig. 3.2 where we compare the empirical lower 20 bound to our theoretical bound given by Theorem 3. It can be seen that the theoretical bounds are rather conservative, but they have the same exponentially growing rates comparing to the empirical ones. What is the origin of this undesired behavior? An intuitive and general argu- ment comes from looking at the assumption of local uniformity in kNN-based estimators. In particular, both naive kNN and KSG estimators approximate the probability density in the kNN ball or max-norm rectangle containing thek nearest neighbors with uniform density. If there are strong relationships (in the joint x space, the density becomes more singular), then we can see in Fig. 3.1(b) that the uniform assumption becomes problematic. 2 3 4 5 6 7 8 9 True Mutual Information (I) 10 1 10 2 10 3 10 4 10 5 N s Theoretial Lower Bound (KSG) Empirical Lower Bound (KSG) Figure 3.2: A semi-logarithmic plot of N s (number of required samples to achieve an error at most ") for KSG estimator for dierent values of I(X : Y ). We set " = 0:1, k = 1. 21 3.3 Improved kNN-based Estimators In this section, we suggest a class of kNN-based estimators that relaxes the local uniformity assumption mentioned above. 3.3.1 Local Nonuniformity Correction (LNC) Considering the ball (in naive kNN estimator) or max-norm hyper-rectangle (in KSG estimator) around the point x (i) which contains k nearest neighbors, let us denote this region of the space withV(i)R d , whose volume is V (i). Instead of assuming that the density is uniform insideV(i) around the point x (i) , we assume that there is some subset, V(i)V(i) with volume V (i) V (i) on which the density is constant, i.e., ^ p(x (i) ) = I[x2 V(i)] V (i) : This is illustrated with a shaded region in Fig. 3.1(b). We now repeat the derivation above using this altered assumption about the local density around each point for ^ H(x). We make no changes to the entropy estimates in the marginal subspaces. Based on this idea, we get a general correction term for kNN-based MI estimators (see Appendix A.4 for the details of derivation): ^ I LNC (x) = ^ I(x) 1 N N X i=1 log V (i) V (i) (3.15) where ^ I(x) can be either ^ I kNN (x) or ^ I KSG (x). If the local density in thek-nearest-neighbor regionV(i) is highly non-uniform, as is the case for strongly related variables like those in Fig. 3.1(b), then the proposed correction term will improve the estimate. For instance, if we assume that relationships in data are smooth, functional relationships plus some noise, the correction term will yield signicant improvement as demonstrated empirically in 22 the evaluation. We note that this correction term is not bounded byN, but rather by our method of estimating V . Next, we will give one concrete implementation of this idea by focusing on the modication of KSG estimator. 3.3.2 Estimating Nonuniformity by Local PCA With the correction term in Eq. 3.15, we have transformed the problem into that of nding a local volume on which we believe the density is positive. Regard- ing KSG estimator, instead of a uniform distribution within the max-norm rect- angle in the neighborhood around the point x (i) , we look for a small, rotated (hyper)rectangle that covers the neighborhood of x (i) . The volume of the rotated rectangle is obtained by doing a localized principle component analysis (PCA) around all of x (i) 's k nearest neighbors, and then multiplying the maximal axis values together in each principle component after the k points are transformed to the new coordinate system 2 . The key advantage of our proposed estimator is as follows: while KSG assumes local uniformity of the density over a region con- taining k nearest neighbors of a particular point, our estimator relies on a much weaker assumption of local linearity over the same region. Note that the local linearity assumption has also been widely adopted in the manifold learning, for example, local linear embedding(LLE) [Roweis and Saul, 2000] and local tangent space alignment(LTSA) [Zhang and Zha, 2002]. 3.3.3 Testing for Local Nonuniformity One problem with this procedure is that we may nd that, locally, points may occupy a small sub-volume, i.e., even if the local neighborhood is actually drawn 2 Note that we manually set the mean of these k points to be x (i) when doing PCA, in order to put x (i) in the center of the rotated rectangle. 23 from a uniform distribution(as shown in Fig. 3.1(a)), the volume of the PCA- aligned rectangle will with high probability be smaller than the volume of the max-norm rectangle, leading to an articially large non-uniformity correction. To avoid this artifact, we consider a trade-o between the two possibilities: for a xed dimension d and nearest neighbor parameter k, we nd a constant k;d , such that if V (i)=V (i)< k;d , then we assume local uniformity is violated and use the correction V (i), otherwise the correction is discarded for point x (i) . Note that if k;d is suciently small, then the correction term will be always discarded, so that our estimator reduces to the KSG estimator. Good choices of k;d are set using arguments described in Appendix A.5. Furthermore, we believe that as long as the expected value of E[ V (i)=V (i)] k;d in large N limit, for some properly selected k;d , then the consistency properties of the proposed estimator will be identical to the constancy properties of the KSG estimator. The full algorithm for our estimator is given in Algorithm 1. 3.3.4 Evaluation We evaluate the proposed estimator on both synthetically generated and real-world data. For the former, we considered various functional relationships and thoroughly examined the performance of the estimator over a range of noise intensities. For the latter, we applied our estimator to the WHO dataset used previously in [Reshef et al., 2011]. Below we report the results. Synthetic data Functional relationships in two dimensions In the rst set of experiments, we generate samples from various functional relationships of the formY =f(X)+ that were previously studied in [Reshef et al., 2011, Kinney and Atwal, 2014]. The 24 Algorithm 1 Mutual Information Estimation with Local Nonuniform Correction Input: points x (1) ; x (2) ;:::; x (N) , parameterd (dimension),k (nearest neighbor), k;d Output: ^ I LNC (x) Calculate ^ I KSG (x) by KSG estimator, using the same nearest neighbor parameter k for each point x (i) do Find k nearest neighbors of x (i) : kNN (i) 1 , kNN (i) 2 ,..., kNN (i) k Do PCA on thesek neighbors, calculate the volume corrected rectangle V (i) Calculate the volume of max-norm rectangle V (i) if V (i)=V (i)< k;d then LNC i = log V (i) V (i) else LNC i = 0:0 end if end for Calculate LNC : average value of LNC 1 ;LNC 2 ;:::;LNC N ^ I LNC = ^ I KSG LNC noise term is distributed uniformly over the interval [=2;=2], where is used to control the noise intensity. We also compare the results to several base- line estimators: KSG [Kraskov et al., 2004], generalized nearest neighbor graph (GNN) [P al et al., 2010] 3 , minimum spanning trees (MST) [M uller et al., 2012, Yukich and Yukich, 1998], and exponential family with maximum likelihood esti- mation (EXP) [Nielsen and Nock, 2010] 4 . Figure 3.3 demonstrates that the proposed estimator, LNC, consistently out- performs the other estimators. Its superiority is most signicant for the low noise regime. In that case, both KSG and GNN estimators are bounded by the sample size while LNC keeps growing. MST tends to overestimate MI for large noise but 3 We use the online code http://www.cs.cmu.edu/ ~ bapoczos/codes/REGO_with_kNN.zip for the GNN estimator. 4 We use the Information Theoretical Estimators Toolbox (ITE) [Szab o, 2014] for MST and EXP estimators. 25 3 −11 3 −9 3 −7 3 −5 3 −3 3 −1 3 1 σ 0 4 8 12 I ( X:Y ) Y=X+η LNC KSG GNN MST EXP Ground Truth 3 −11 3 −9 3 −7 3 −5 3 −3 3 −1 3 1 σ I ( X:Y ) Y=4∗X 2 +η LNC KSG GNN MST EXP Ground Truth 3 −11 3 −9 3 −7 3 −5 3 −3 3 −1 3 1 σ I ( X:Y ) Y=4X 3 +X 2 −4X+η LNC KSG GNN MST EXP Ground Truth 3 −11 3 −9 3 −7 3 −5 3 −3 3 −1 3 1 σ 0 4 8 12 I ( X:Y ) Y=2 X +η LNC KSG GNN MST EXP Ground Truth 3 −11 3 −9 3 −7 3 −5 3 −3 3 −1 3 1 σ I ( X:Y ) Y=sin(8πX)+η LNC KSG GNN MST EXP Ground Truth 3 −11 3 −9 3 −7 3 −5 3 −3 3 −1 3 1 σ I ( X:Y ) Y=0.2sin(8X−4)+1.1(2X−1)+η LNC KSG GNN MST EXP Ground Truth Figure 3.3: For all the functional relationships above, we used a sample size N = 5000 for each noise level and the nearest neighbor parameter k = 5 for LNC, KSG and GNN estimators. still stops growing after the noise fall below a certain intensity 5 . Surprisingly, EXP is the only other estimator that performs comparably with LNC for the linear rela- tionship (the left most plot in Figure 3.3). However, it fails dramatically for all the other relationship types. See Appendix A.6 for more functional relationship tests. Convergence rate Figure 3.4 shows the convergence of the two estimators ^ I KSG and ^ I LNC , at a xed (small) noise intensity, as we vary the sample size. We test 5 Even if k = 2 or 3, KSG and GNN estimators still stop growing after the noise fall below a certain intensity. 26 the estimators on both two and ve dimensional data for linear and quadratic relationships. We observe that LNC is doing better overall. In particular, for linear relationships in 2D and 5D, as well as quadratic relationships in 2D, the required sample size for LNC is several orders of magnitude less that for ^ I KSG . For instance, for the 5D linear relationship KSG does not converge even for the sample size (10 5 ) while LNC converges to the true value with only 100 samples. Finally, it is worthwhile to remark on the relatively slow convergence of LNC for the 5D quadratic example. This is because LNC, while relaxing the local uniformity assumptions, still assumes local linearity. The stronger the nonlinearity, the more samples are required to nd neighborhoods that are locally approximately linear. We note, however, that LNC still converges faster than KSG. 10 2 10 3 10 4 10 5 2D Linear 10 0 10 1 I ( X:Y ) Y=X+U(−3 −8 /2,3 −8 /2) Ground Truth LNC KSG 10 2 10 3 10 4 10 5 2D Quadratic 10 0 10 1 I ( X:Y ) Y=X 2 +U(−3 −8 /2,3 −8 /2) Ground Truth LNC KSG 10 2 10 3 10 4 10 5 5D Linear 10 -1 10 0 10 1 10 2 I(X 1 :X 2 :X 3 :X 4 :Y) Y=X 1 +X 2 +X 3 +X 4 +U(−3 −8 /2,3 −8 /2) Ground Truth LNC KSG 10 2 10 3 10 4 10 5 5D Quadratic 10 -1 10 0 10 1 I(X 1 :X 2 :X 3 :X 4 :Y) Y=X 2 1 +X 2 2 +X 2 3 +X 2 4 +U(−3 −8 /2,3 −8 /2) Ground Truth LNC KSG Figure 3.4: Estimated MI using both KSG and LNC estimators in the number of samples (k = 5 and k;d = 0:37 for 2D examples; k = 8 and k;d = 0:12 for 5D examples) 27 Real-world data Ranking Relationship Strength We evaluate the proposed estimator on the WHO dataset which has 357 variables describing various socio-economic, political, and health indicators for dierent countries 6 . We calculate the mutual information between pairs of variables which have at least 150 samples. Next, we rank the pairs based on their estimated mutual information and choose the top 150 pairs with highest mutual information. For these top 150 pairs, We randomly select a fraction of samples for each pair, hide the rest samples and then recalculate the mutual information. We want to see how mutual information-based rank changes by giving dierent amount of less data, i.e., varying. A good mutual information estimator should give a similar rank using less data as using the full data. We compare our LNC estimator to KSG estimator. Rank similarities are calculated using the standard Spearman's rank correlation coecient described in [Spearman, 1904]. Fig 3.5 shows the results. We can see that LNC estimator outperforms KSG estimator, especially when the missing data approaches 90%, Spearman correlation drops to 0.4 for KSG estimator, while our LNC estimator still has a relatively high score of 0.7. Finding interesting triplets We also use our estimator to nd strong multi- variate relationships in the WHO data set. Specically, we search for synergistic triplets (X;Y;Z), where one of the variables, say Z, can be predicted by knowing bothX andY simultaneously, but not by using either variable separately. In other words, we search for triplets (X;Y;Z) such that the pair-wise mutual information 6 WHO dataset is publicly available at http://www.exploredata.net/Downloads 28 50 60 70 80 90 % of missing data(ρ) 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Spearman Correlation KSG LNC Figure 3.5: Spearman correlation coecient between the original MI rank and the rank after hiding some percentage of data by KSG and LNC estimator respectively. The 95% condence bars are obtained by repeating the experiment for 200 times. between the pairs I(X : Y ), I(X : Z) and I(Y : Z) are low, but the multi- information I(X : Y : Z) is relatively high. We rank relationships using the fol- lowing synergy score: 7 SS =I (X :Y :Z)=maxfI (X :Y );I (Y :Z);I (Z :X)g. We select the triplets that have synergy score above a certain threshold. Fig- ure 3.6 shows two synergistic relationships detected byLNC but not byKSG. In these examples, bothKSG andLNC estimators yield low mutual information for the pairs (X;Y ), (Y;Z) and (X;Z). However, in contrast to KSG, our estimator yields a relatively high score for multi-information among the three variables. For the rst relationship, the synergistic behavior can be explained by noting that the ratio of the Total Energy Generation (Y ) to Electricity Generation per Person (Z) essentially yields the size of the population, which is highly predictive of the Number of Female Cervical Cancer cases (X). While this example might 7 Another measure of synergy is given by the so called \interaction information": I (X :Y ) + I (Y :Z) +I (Z :X)I (X :Y :Z). 29 10 0 10 1 10 2 10 3 10 4 10 5 10 6 X 10 0 10 1 10 2 10 3 10 4 Y LNC = 0.10, KSG = 0.07 10 0 10 1 10 2 10 3 10 4 Y 10 2 10 3 10 4 10 5 Z LNC = -0.04, KSG = -0.05 10 0 10 1 10 2 10 3 10 4 10 5 10 6 X 10 2 10 3 10 4 10 5 Z LNC = 0.03, KSG = 0.03 10 0 10 1 10 2 10 3 10 4 10 5 10 6 X 10 -4 10 -3 10 -2 10 -1 10 0 10 1 Y Z LNC = 0.96, KSG = 0.83 (a) 10 0 10 1 10 2 10 3 X 10 3 10 4 10 5 10 6 10 7 10 8 10 9 Y LNC = 0.03, KSG = 0.02 10 3 10 4 10 5 10 6 10 7 10 8 10 9 Y 10 0 10 1 10 2 10 3 10 4 10 5 Z LNC = 0.07, KSG = 0.06 10 0 10 1 10 2 10 3 X 10 0 10 1 10 2 10 3 10 4 10 5 Z LNC = -0.02, KSG = -0.05 10 0 10 1 10 2 10 3 X 10 1 10 2 10 3 10 4 10 5 10 6 Y Z LNC = 0.45, KSG = 0.44 (b) Figure 3.6: Two examples of synergistic triplets: ^ I KSG (X : Y : Z) = 0:14 and ^ I LNC (X : Y : Z) = 0:95 for the rst example; ^ I KSG (X : Y : Z) = 0:05 and ^ I LNC (X :Y :Z) = 0:7 for the second example seem somewhat trivial, it illustrates the ability of our method to extract syner- gistic relationships automatically without any additional assumptions and/or data preprocessing. In the second example, LNC predicts a strong synergistic interaction between Total Cell Phones (Y ), Number of Female Cervical Cancer cases (Z), and rate of Tuberculosis deaths (X). Since the variable Z (the number of female cervical cancer cases) grows with the total population, Y Z is proportional to the average number of cell phones per person. The last plot indicates that a higher number of cell phones per person are predictive of lower tuberculosis death rate. One possible explanation for this correlation is some common underlying cause (e.g., overall economic development). Another intriguing possibility is that this nding re ects recent eorts to use mobile technology in TB control. 8 8 See Stop TB Partnership, http://www.stoptb.org. 30 3.4 Estimation Mutual Information by Local Gaussian Approximation To relax the local uniformity assumption in kNN-based estimators, Sec. 3.2 pro- posed LNC estimator to replace the axis-aligned rectangle with a PCA-aligned rectangle locally, and use the volume of this rectangle for estimating the unknown density at a given point. Mathematically, this revision was implemented by intro- ducing a novel term that accounted for local non-uniformity. Nevertheless, LNC estimator relied on a heuristic for determining when to use the correction term, and did not have any theoretical guarantees. Alternatively, in this section, we suggest an estimator based on local Gaussian density estimation, as more gen- eral approach to overcome the above limitations. The main idea is that, instead of assuming a uniform distribution around the local kNN-ball or a PCA-aligned rectangle, we approximate the unknown density at each sample point by a local Gaussian distribution, which is estimated using the k-nearest neighborhood of that point. Notice that we now shift our focus on estimating mutual information dened in Eq. 2.3. Assume we are given N i.i.d. samples (X;Y) =f(x; y) (i) g N i=1 from the unknown joint distributionf XY , our goal is then to construct a mutual information estimator ^ I(x : y) based on those samples. 3.4.1 Local Gaussian Density Estimation In this subsection, we introduce a density estimation method called local Gaussian density estimation, or LGDE [Hjort and Jones, 1996], which serves as the basic building block for the proposed mutual information estimator. 31 Consider N i.i.d. samples x 1 ; x 2 ;:::; x N drawn from an unknown density f(x), where x is a d-dimensional continuous random variable. The central idea behind LGDE is to locally approximate the unknown probability density at point x using a Gaussian parametric familyN d ((x); (x)), where (x) and (x) are the (x-dependent) mean and covariance matrix of each local approximation. This intuition is formalized in the following denition: Denition 1 (Local Gaussian Density Estimator). Let x denote a d-dimensional absolutely continuous random variable with probability density function f(x), and letfx 1 , x 2 ,..., x N g be N i.i.d. samples drawn from f(x). Furthermore, let K H (x) be a product kernel with diagonal bandwidth matrix H = diag(h 1 ;h 2 ;:::;h d ), so that K H (x) = h 1 1 K h 1 1 x 1 h 1 2 K h 1 2 x 2 :::h 1 d K h 1 d x d , where K() can be any one-dimensional kernel function. Then the Local Gaussian Density Estimator, or LGDE, of f(x) is given by b f (x) =N d (x;(x); (x)) ; (3.16) Here; are dierent for each point x, and are obtained by solving the following optimization problem, (x); (x) = arg max ; L (x;; ) ; (3.17) whereL (x;; ) is the local likelihood function dened as follows: L (x;; ) = 1 N N X i=1 K H (x i x) logN d (x i ;; ) Z K H (t x)N d (t;; )dt (3.18) 32 The rst term in the right hand side of Eq. 3.18 is the localized version of Gaussian log-likelihood. One can see that without the kernel function, Eq. 3.18 becomes similar to the global log-likelihood function of the Gaussian parametric family. However, since we do not have sucient information to specify a global distribution, we make a local smoothness assumption by adding this kernel func- tion. The second term of right hand side in Eq. 3.18 is a penalty term to ensure the consistency of the density estimator. The key dierence between kNN density estimator and LGDE is that the former assumes that the density is locally uniform over the neighborhood of each sample point, whereas the latter method relaxes local uniformity to local linearity 9 , which allows to compensates for the boundary bias. In fact, any non-uniform parametric probability distribution is suitable for tting a local distribution under the local likelihood, and the Gaussian distribution used here is simply one realization. Theorem 4 below establishes the consistency property of this local Gaussian estimator; for a detailed proof see [Hjort and Jones, 1996]. Theorem 4 ( [Hjort and Jones, 1996] ). Let x denote a d-dimensional absolutely continuous random variable with probability density function f(x), and letfx 1 , x 2 ,..., x N g be N i.i.d. samples drawn from f(x). Let b f (x) be the Local Gaussian Density Estimator with diagonal bandwidth matrix diag(h 1 ;h 2 ;:::;h d ), where the diagonal elements h i -s satisfy the following conditions: lim N!1 h i = 0 ; lim N!1 Nh i =1;i = 1; 2;:::;d: (3.19) 9 To elaborate on the local linearity, we note that Gaussian distribution is essentially a special case of Elliptical distribution f(x) =kg((x) T 1 (x)). Therefore, the local Gaussian approximation actually assumes a rotated hyper-ellipsoid locally at each point. 33 Then the following holds: lim N!1 Ej b f (x)f (x)j = 0 (3.20) lim N!1 Ej b f (x)f (x)j 2 = 0 (3.21) The above theorem states that LGDE is asymptotically unbiased and L2- consistent. 3.4.2 LGDE-based Estimators for Entropy and Mutual Information We now introduce our estimators for entropy and mutual information that are inspired by the local density estimation approach dened in Sec. 3.4.1. Let us again consider N i.i.d samples (X;Y) =f(x; y) (i) g N i=1 drawn from an unknown joint distribution f XY , where x and y are random vectors of dimension- ality d and b, respectively. Let us construct the following estimators for entropy, b H (x) = 1 N N X i=1 log b f (x i ); (3.22) and mutual information b I (x : y) = 1 N N X i=1 log b f (x i ; y i ) b f (x i ) b f (y i ) (3.23) where b f(x), b f(y), b f(x; y) are the local Gaussian density estimators for f X (x), f Y (y), f XY (x; y) respectively, dened in the previous section. Recall that the entropy and mutual information can be written as appropriately dened expectations, then the proposed estimator simply replaces the expectation 34 by the sample averages, and then plugs in density estimators from Sec. 3.4.1 into those expectations. The next two theorems state that the proposed estimators are asymptotically unbiased. Theorem 5 (Asymptotic Unbiasedness of Entropy Estimator). If the conditions in Eq. 3.19 hold, then the entropy estimator given by Eq. 3.22 is asymptotically unbi- ased, i.e., lim N!1 E b H (x) =H(x) (3.24) Theorem 6 (Asymptotic Unbiasedness of MI Estimator). If the conditions in Eq. 3.19 hold, then the mutual information estimator given by Eq. 3.23 is asymp- totically unbiased: lim N!1 E b I (x : y) =I(x : y) (3.25) We provide the proofs of the above theorems in the Appendix. 3.4.3 Evaluation Implementation Details Our main computational task is to maximize the local likelihood function in Eq. 3.18. Since computing the second term on the right hand side of Eq. 3.18 requires integration that can be time-consuming, we choose the kernel function K() to be a Gaussian kernel, K H (t x) =N d (t; x; H) so that the integral can be performed analytically, yielding Z K H (t x)N d (t;; )dt =N d (x;; H + ) (3.26) 35 Thus, Eq. 3.18 reduces to L (x;; ) = 1 N N X i=1 N d (x i ; x; H) logN d (x i ;; ) N d (x;; H + ) (3.27) Maximizing Eq. 3.27 is a constrained non-convex optimization problem with the condition that the covariance matrix is positive semi-denite. We use Cholesky parameterization to enforce the positive semi-deniteness of , which allows to reduce our constrained optimization problem into an unconstrained one. Also, since we would like to preserve the local structure of the data, we select the band- width to be close to the distance between pair of k-nearest points (averaged over all the points). We use Newton-Ralphson method to do the maximization although the func- tion itself is not exactly concave. The full algorithm for our estimator is given in Algorithm 2 which takes Algorithm 3 as a subroutine. Note that in Algo- rithm 3, the Wolfe condition is a set of inequalities in performing quasi-Newton methods [Wolfe, 1969]. Algorithm 2 Mutual Information Estimation with Local Gaussian Approximation Input: points (x; y) (1) ; (x; y) (2) ;:::; (x; y) (N) Output: b I(x; y) Calculate entropy b H(x) using samples x (1) , x (2) ...,x (N) Calculate entropy b H(y) using samples y (1) , y (2) ...,y (N) Calculate joint entropy b H(x; y) using input samples (x; y) (1) ; (x; y) (2) ;:::; (x; y) (N) Return estimated mutual information ^ I = b H(x) + b H(y) b H(x; y) In a single step, evaluating the gradient and Hessian in Algorithm 3 would takeO(N) time because Eq. 3.18 is a summation over all the points. However, for 36 Algorithm 3 Entropy Estimation with Local Gaussian Approximation Input: points u (1) ; u (2) ;:::; u (N) Output: b H(u) Initialize b H(u) = 0 for each point x (i) do initialize = 0 ,L =L 0 while notL(x (i) ;; =LL T ) converge do CalculateL(x (i) ;; =LL T ) Calculate gradient vector G ofL(x (i) ;; = LL T ), with respect to,L Calculate Hessian matrix of H ofL(x (i) ;; = LL T ), with respect to,L Do Hessian modication to ensure the positive semi-deniteness of H Calculate descent direction D =H 1 G, where we compute to satisfy Wolfe condition Update;L with (;L) + D end while b f(x (i) ) =N (x;; =LL T ) b H(u) = b H(u) logf(x (i) ) N end for points that are far from the current point x (i) , the kernel weight function is very close to zero and we can ignore those point and do the summation only over a local neighborhood of x (i) . Experiments with synthetic data Functional relationships We test our MI estimator for near-functional rela- tionships of form Y = f(X) +U(0;), whereU(0;) is the uniform distribution over the interval (0;), and X is drawn randomly uniformly from [0; 1]. Similar relationships were studied in [Reshef et al., 2011], [Kinney and Atwal, 2014] and [Gao et al., 2015]. 37 Noise Level 0 4 8 12 I ( X:Y ) Y=X+U Local Gaussian Kraskov MST GNN Ground Truth Noise Level I ( X:Y ) Y=X 2 +U Local Gaussian Kraskov MST GNN Ground Truth Noise Level 0 4 8 12 I ( X:Y ) Y=X 3 +U Local Gaussian Kraskov MST GNN Ground Truth Noise Level I ( X:Y ) Y=2 X +U Local Gaussian Kraskov MST GNN Ground Truth 3 − 8 3 − 6 3 − 4 3 − 2 3 0 3 2 3 4 Noise Level 0 4 8 12 I ( X:Y ) Y=sin(4πX )+U Local Gaussian Kraskov MST GNN Ground Truth 3 − 8 3 − 6 3 − 4 3 − 2 3 0 3 2 3 4 Noise Level I ( X:Y ) Y=cos(5πX (1− X))+U Local Gaussian Kraskov MST GNN Ground Truth Figure 3.7: Functional relationship test for mutual information estimators. The horizontal axis is the value of which controls the noise level; the vertical axis is the mutual information in nats. For the Kraskov and GNN estimators we used nearest neighbor parameter k = 5. For the local Gaussian estimator, we choose the bandwidth to be the distance between a point and its 5rd nearest neighbor. We compare our estimator to several baselines that include the kNN estimator proposed by [Kraskov et al., 2004], an estimator based on generalized nearest- neighbor graphs (GNN) [P al et al., 2010], and minimum spanning tree method (MST) [Yukich and Yukich, 1998]. We evaluate those estimators for six dierent functional relationships as indicated in Figure 3.7. We useN = 2500 sample points 38 for each relationship. To speed up the optimization, we limited the summation in Eq. 3.27 to only k nearest neighbors, thus reducing the computational complexity from O(N) to O(k) in every iteration step of Algorithm 3. One can see from Fig. 3.7 that when is relatively large, all methods except MST produce accurate estimates of MI. However, as one decreases, all three base- line estimators start to signicantly underestimate mutual information. In this low- noise regime, our proposed estimator outperforms the baselines, at times by a sig- nicant margin. Note also that all the estimators, including ours, perform relatively poorly for highly non-linear relationships (the last row in Figure 3.7). According to our intuition, this happens when the scale of the non-linearity becomes su- ciently small, so that the linear approximation of the relationship around the local neighborhood of each sample point does not hold. Under this scenario, accuracy can be recovered by adding more samples. 3.5 Conclusion The problem of deciding whether or not two variables are independent is a his- torically signicant endeavor. Statistics has largely focused on the classic problem of determining whether it is possible to exclude the null hypothesis that variables are independent. In that context, research on mutual information estimation has been geared towards distinguishing weak dependence from independence. How- ever, modern data mining presents us with problems requiring a totally dierent perspective. It is not unusual to have thousands of variables which could have millions of potential relationships. We have insucient resources to examine each potential relationship so we need an assumption-free way to pick out only the 39 most promising relationships for further study. Many applications have this a- vor including the health indicator data considered above as well as gene expression microarray data, human behavior data, to name a few. How can we select the most interesting relationships? Mutual information gives a clear and general basis for comparing the strength of otherwise dissimilar variables and relationships. While non-parametric mutual information estimators exist, we showed that strong rela- tionships require at least exponentially many samples to accurately measure using some of these techniques. We introduced a non-parametric mutual information estimator based on local uniformity correction that can measure the strength of nonlinear relationships even with small sample sizes. We have also introduced a semi-parametric method called local Gaussian approximation to estimate the entropy and mutual information. The estimators were shown to be asymptotically unbiased. We also show empirically that our estimator can detect the strength of the relationships even if the variables are very correlated and the sample size is small. As the amount and variety of available data grows, general methods for identifying strong relationships will become increasingly necessary. We hope that the developments suggested here will help to address this need. 40 Chapter 4 Modeling Linguistic Style Coordination by Mutual Information In this chapter, we propose an information-theoretic framework for characterizing stylistic coordination in dialogues. Namely, given a temporally ordered sequence of utterances (verbal or electronic statements depending on the context) by two individuals, we characterize their stylistic coordination with time-shifted mutual information. The proposed coordination measure characterizes the dependence between the stylistic features of the original post and the response. In addition, we provide a computational framework to account for confounding factors when measuring stylistic coordination. 4.1 Measuring Stylistic Coordination 4.1.1 Representing Stylistic Features To represent stylistic features in utterances, we use Linguistic Inquiry Word Count (LIWC) [Pennebaker and Francis, 2007], which is a dictionary-based encoding scheme that has been used extensively for evaluating emotional and psychological dimensions in various text corpora. The latest version of the LIWC dictionary con- tains around 4500 words and word stems. Each word or word stem belongs to one 41 or more word categories or subcategories. Various LIWC categories include posi- tive and negative emotion, function words, pronouns, articles, and so on. Here we focus on eight LIWC categories that have been used in previous studies [Danescu- Niculescu-Mizil et al., 2012]: articles, auxiliary verbs, conjunctions, high-frequency adverbs, impersonal pronouns, personal pronouns, prepositions, and quantiers. Utterances are represented as eight-component binary vectors indicating the pres- ence or absence of each linguistic marker [Danescu-Niculescu-Mizil et al., 2012]. 4.1.2 Information-theoretic measure of coordination Each dialogue is a sequence of utterance exchanges between two participants. Following [Danescu-Niculescu-Mizil et al., 2011, 2012] we binarize the stylistic features of utterances, so that a dialogue is represented asfo m k ;r m k g K k=1 , where o m k ;r m k =f0; 1g indicate the absence or presence of the stylistic marker m, and K is the total number of exchanges in a dialogue. Since we focus on coordination between the same stylistic markers, we will drop the superscript m from now on. We use the conventionO to represent the originator { the person who is producing the original utterance in a single exchange, R to represent the respondent { the person who is replying to the originator. Let p(o;r) be the joint distribution of the random variables O and R. We characterize the amount of stylistic coordination using mutual information. I(O :R) =H(O)H(OjR) (4.1) whereH(O) = P p(o) logp(o) is the entropy ofO, andH(OjR) is the entropy of O conditioned onR. Note that in our case the arguments are temporarily ordered: O is always the initial utterance, and R is the response, so that Eq. 4.1 in fact 42 denes time-shifted mutual information. Thus, even though mutual information is symmetric with respect to its argument, the coordination between two users may be asymmetric. Recall that mutual information between two variables measures the average reduction in the uncertainty of one variable, if we know the other variable. Thus, in essence, the proposed measure of stylistic coordination quanties how the use of a marker m in an utterance of O's can help to predict R's usage of m in the immediate response. In contrast to linear correlation measures, mutual information is well suited for handling strongly non-linear dependencies. We measure the correlation between two variables after conditioning on a third variable, Z, via conditional mutual information (CMI), dened as I(O :RjZ) =H(OjZ)H(OjR;Z): (4.2) Below we will use CMI to account for the confounding eect of the utterance length by conditioning on it. Namely, the actual stylistic accommodation, after accounting for the length coordination, is given by I(O : RjL R ), where L R is the length of the utterance by user R. 4.1.3 Estimating mutual information from data Given a set of samplesfo k ;r k g K k=1 , our goal is to estimate mutual information betweenO andR. We could do this by rst calculating the empirical distribution p(o;r) and then using Eq. 4.1. However, it is known that this naive plug-in esti- mator tends to underestimate the entropy of a system. Instead, here we use the statistical bootstrap method introduced by [DeDeo et al., 2013], which attempts to reduce the bias of the naive estimator by estimating a bootstrap correction term. 43 The estimate of bias comes from comparing the entropy of the empirical distri- bution to estimates of entropy from several bootstrap datasets drawn randomly according to the empirical distribution. See [DeDeo et al., 2013] for more details. While the above discrete estimator works well for evaluating mutual informa- tion between discrete stylistic variables, it is not very useful for evaluating mutual information between two length variables, due to limited number of samples we have. Instead, we will use a continuous estimator introduced by [Kraskov et al., 2004]. This non-parametric estimator searches the k-nearest neighbors to each point, and then average the mutual information estimated from the neighborhood of each point. It has been shown that this estimator is asymptotically unbiased and consistent. Discussion of dierent entropy estimators can be found in [Wang et al., 2009b] and references therein. 4.2 Length as a confounding factor We applied our coordination measures to two datasets previously studied in [Danescu-Niculescu-Mizil et al., 2012]: oral transcripts from the Supreme Court hearings, and discussion among Wikipedia editors. In the Supreme Court Data, there are 11 Judges and 311 Lawyers conversing with each other. We obtain 51,498 utterances from all the dialogues among 204 cases. In the Wikipedia dataset, users are classied into two categories, Administrators, or Admins, and non-Admins. All of the users interact with each other on Wikipedia talk pages, where they discuss issues about specic Wikipedia pages. We focus on dialogues where each partici- pant make at least two exchanges within a dialogue, which results in over 30,000 utterances. 44 Ideally, we would like to calculate linguistic accommodation between any pair of individuals O and R who have participated in a dialogue. Unfortunately, most pair-wise exchanges are rather short and do not produce sucient samples for evaluating mutual information or conditional mutual information. Instead, we group the individuals according to their roles, and then use aggregated samples to calculate stylistic coordination between the groups. The groups correspond to Judges and Lawyers for the Supreme Court data, and Admins and non-Admins for the Wikipedia data. [ (a) ] Personal Pronoun Impersonal Pronoun Article Auxiliary verb Adverb Preposition Conjunction Quantifier −0.005 0.000 0.005 0.010 0.015 0.020 0.025 COORDINATION LAWYERS COORDINATING TO JUDGES Conditional Mutual Information Mutual Information Zero Information [ (b) ] Personal Pronoun Impersonal Pronoun Article Auxiliary verb Adverb Preposition Conjunction Quantifier −0.005 0.000 0.005 0.010 0.015 0.020 0.025 COORDINATION JUDGES COORDINATING TO LAWYERS Conditional Mutual Information Mutual Information Zero Information Figure 4.1: Coordination measures for the Supreme Court data. The red (blue) dots give the true CMI (MI). The green dots represent CMI under the null hypoth- esis that there is no coordination after conditioning. (a) Lawyers coordinating to Judges. (b) Judges coordinating to Lawyers. In both gures, the conditional mutual information is signicantly smaller than the mutual information for all eight stylistic features, indicating length is a confounding factor. Fig. 4.1 describes stylistic coordination for the Supreme Court data as measured by I(O : R) and I(O : RjL R ). The bias in estimators for conditional mutual information and mutual information are generally dierent. Therefore, rather than estimating mutual information directly, we use a conditional mutual information estimator where we condition on randomly permuted values forL R . We repeat this procedure for four hundred times to produce 99% condence intervals forI(O :R) 45 Personal Pronoun Impersonal Pronoun Article Auxiliary verb Adverb Preposition Conjunction Quantifier −0.005 0.000 0.005 0.010 0.015 COORDINATION NON-ADMINS COORDINATING TO ADMINS Conditional Mutual Information Mutual Information Zero Information (a) Personal Pronoun Impersonal Pronoun Article Auxiliary verb Adverb Preposition Conjunction Quantifier −0.005 0.000 0.005 0.010 0.015 COORDINATION ADMINS COORDINATING TO NON-ADMINS Conditional Mutual Information Mutual Information Zero Information (b) Figure 4.2: Coordination measures for the Wikipedia data. (a) Non-admins coor- dinating to Admins. (b) Admins coordinating to Non-admins. Symbols have the same interpretation as in the previous plot. (blue bars). The green bars give the 99% condence intervals in case there is no stylistic coordination by estimating CMI with R's utterances permuted (erasing any stylistic coordination). The blue dots show the mutual information between the corresponding stylistic features, and suggest strong linguistic correlations between the groups. This eect, however, is strongly diminished after conditioning on the length of utterances (red dots). For instance, the coordination scores on features Impersonal Pronoun, Arti- cle, and Auxiliary Verb are reduced by factors of 6:7, 4:8, and 5:3, respectively, after conditioning on length. For the feature Conjunction, the 99% condence interval of coordination score is above the condence interval of zero information before conditioning, and falls into the condence interval of zero information after conditioning. Similarly, in Fig. 4.1(b), the coordination scores for ve out of eight markers (Impersonal Pronoun, Article, Adverb, Preposition, Quantier) become practically zero after conditioning, suggesting that the observed coordination in those stylistic features are due to length correlations. 46 A similar picture holds for the Wikipedia dataset shown in Fig. 4.2. Again, we observe non-zero mutual information in all the features. However, this cor- relation is signicantly diminished after conditioning on length. In fact, both non-admins coordinating to admins (Fig. 4.2(a)) and admins coordinating to non- admins (Fig. 4.2(b)) have an extremely weak signal after conditioning on length (all below 0.005). In particular, for non-admins coordinating to admins(Fig. 4.2(a)), the red dots of ve out of eight features lie in the zero conditional mutual informa- tion condence interval. For these ve features in Fig. 4.2(b), we cannot rule out the null hypothesis that all stylistic coordination is due to phenomenon of length coordination. Another interesting observation is that there is signicant asymmetry, or directionality, in stylistic coordination. For instance, by comparing Figs. 4.1(a) and 4.1(b) we see that the mutual information is signicantly higher from lawyers to judges than vice versa. A similar (albeit less pronounced) asymmetry is present for the Wikipedia data as well. This type of asymmetry has been used to suggest that the relative strength of stylistic accommodation re ects social status [Danescu- Niculescu-Mizil et al., 2012]. However, Figs. 4.1 and 4.2 illustrate that the asym- metry is drastically weakened after conditioning on length (red dots), suggesting that the phenomenon of higher stylistic coordination from lawyers to judges (and from non-admins to admins for the Wikipedia dataset) is due to the confounding eect of length. Unfortunately, a direct assessment of this eect in a single conver- sation is not feasible due to the insucient number of utterances for calculating conditional mutual information. To conclude this section, we note that some of the correlations in stylistic fea- tures persist even after conditioning on length. One can ask whether this remnant 47 correlation is due to turn-by-turn level linguistic coordination, or can be attributed to other confounding factors. We address this question in detail later in the text. 4.3 Understanding Length Coordination As discussed in the previous section, the observed correlations in linguistic features can be attributed to coordination in the length of utterances. Here we analyze this phenomenon in more detail. In particular, we are interested whether the observed length correlations are due to turn-by-turn coordination, or can be attributed to other contextual factors. For instance, consider a scenario that in one conversa- tion, Alice and Bob are always conversing using short statements, while in another conversation they exclusively use long statements, perhaps due to dierent topics of conversation. Length coordination is found if data from these two conversation is aggregated, however, this coordination only re ects Alice's and Bob's response to the topic of conversation. More generally, aggregating data might lead to eects similar to Simpson's paradox Simpson [1951]. Figure 4.3: A Bayesian network model for length coordination. The network containing contextual factors, C, the length of an utterance, L (t) O , and the length of the response,L (t) R . (a) The lengths are correlated only due to contextual factors. (b) The lengths are correlated due to both contextual factors and potential eect of turn-by-turn level coordination (represented with the dotted line). 48 To understand the possible extent of various confounding factors (we call them contextual factors), consider the Bayesian network model that incorporates both contextual factors and length coordination, as shown in Fig. 4.3. Here L O and L R are random variables representing the length of an utterance by the originator O and the respondent R, respectively. In the model with both solid and dashed lines in Fig. 4.3(b), L O explicitly in uences L R . While if we only have the soild lines in Fig. 4.3(a), L R is independent of L O after conditioning on the context C. Thus, the model in Fig. 4.3(a) assumes that there is only contextual coordination while Fig. 4.3(b) implies turn-by-turn coordination. Note that in principle, the contextual factor C can vary within a single conversation, for example, the theme of a conversation may change as time goes by. But for simplicity, we will assume that the contextual factor C does not change within the dialogue or conversation. 4.3.1 Information-theoretic characterization of length coordination A direct measure of Turn-by-turn Length Coordination (TLC) is given by the following conditional mutual information: TLC =I(L O :L R jC) (4.3) Additionally, we dene the Overall Length Coordination (OLC) as OLC =I(L O :L R ) (4.4) 49 Thus,OLC captures not only the length coordination in a turn-by-turn level, but also the confounding behaviors between L O , L R and C. In fact, OLC can be decomposed into two items: OLC =TLC +I(L O :L R :C) (4.5) The second item of right hand side in Eq. 4.5 indicates the multivariate mutual information(MMI) (also known as interaction information [McGill, 1954] or co- information [Bell, 2003]), and characterizes the amount of shared information between L O , L R and C. A straightforward method to test for turn-by-turn coordination is to evaluate TLC described in Eq. 4.3. Indeed, L O andL R are conditionally independent ofC if and only if TLC = 0. However, direct evaluation of TLC is not possible due to the lack of sucient number of samples, e.g., the number of exchanges within a specic dialogue. Nevertheless, it is possible to test the turn-by-turn length coordination by a non-parametric statistical test as shown below. 4.3.2 Turn-by-Turn Length Coordination Test Our null hypothesis is that there is no turn-by-turn coordination, so that all observed correlations are due to contextual factors. We now describe a procedure for testing this hypothesis. We denote the pairwise set of exchanges in a specic dialoguec from originator o and respondent r as: S c o r = o k c ;r k c Kc k=1 (4.6) 50 where o k c ;r k c indicate the kth exchange (two utterances) by the originator o and respondent r in dialogue c, and K c represents the total number of exchanges in c. We also dene the aggregated set of exchanges of user o2O and user r2R as: S O R = [ o2O;r2R [ c2Co;r S c o r (4.7) where C o;r represents all the dialogues that involved user o andr. We can rewrite S O R element-wise as S O R =fO k ;R k ;C k g N k=1 (4.8) whereN =jS O R j representing number of samples. For each triplet of right hand side in Eq. 4.8, R k is the reply utterance to O k in the dialogue C k . Finally, from S O R we obtain the set L(S O R ) =flen (O k );len (R k )g N k=1 (4.9) where len () is a function representing the length of an utterance. Consider now another sample, which is obtained by randomly permuting the respondent r's utterances in the set S o r;c : b S c o r = o k c ;b r k c Kc k=1 (4.10) wherefb r k c g Kc k=1 is a random permutation of r k c Kc k=1 . By aggregation, we have, b S O R = [ o2A;r2B [ c2Co;r b S c o r = n O k ; b R k ;C k o N k=1 and L( b S O R ) =flen(O k );len( b R k )g N k=1 (4.11) 51 Let us assume there is no turn-by-turn coordination, so that L O and L R are con- ditionally independent from each other givenC. Then, it is easy to see that under this null model, the samples L(S O R ) and L( b S O R ) have the same likelihood, e.g., they are statistically equivalent. In other words, L( b S O R ) can be viewed as a new sample from the same distribution p(l o ;l r ). This observation suggests the following test: We rst estimate OLC from the sample L(S O R ) (denoted as OLC 0 ) and then using the within-dialogue shued samples L( b S O R ) (denoted as OLC 1 ). Under the null hypothesis, these two estimates should coincide. Con- versely, if OLC 0 6= OLC 1 , then the null hypothesis is rejected, suggesting that there is turn-by-turn length coordination. The above procedure, which we call Turn-by-Turn Length Coordination Test, is a conditional Monte Carlo test [Pesarin and Salmaso, 2010]. The main advantage of this non-parametric test is that it requires a smaller sample size and does not need to make particular distribution assumptions. The test is non-parametric in two ways: the permutation procedure is non-parametric as well as the estimation of mutual information. We also note that in the context of stylistic coordination, a similar test was used in [Danescu-Niculescu-Mizil and Lee, 2011]. The results of this test are shown in Fig. 4.4. For the Supreme Court data, Fig. 4.4(a) shows that both Lawyers to Judges and Judges to Lawyers have non- zero mutual information (OLC 0 ) before permutation. The Turn-by-Turn Length Coordination test shows that the mutual information decreases signicantly after permutation(green condence intervals, OLC 1 ), rejecting the null hypothesis that L O andL R are independent after conditioning on the contextual factorC. In other words, the contagion of length exists from the original utterance to the reply on a turn-by-turn level. 52 Lawyers to Judges Judges to Lawyers −0.02 −0.01 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 COORDINATION 0.055 0.048 0.012 0.008 SUPREME COURT TTLCT OLC 0 OLC 1 (a) Non-admins to Admins Admins to Non-admins 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 COORDINATION 0.145 0.136 0.082 0.080 WIKIPEDIA TTLCT OLC 0 OLC 1 (b) Figure 4.4: Turn-by-turn length coordination test. (a) Supreme Court dataset. (b) Wikipedia dataset. In both two subgures, OLC 1 is signicantly smaller than OLC 0 . For the results on the Wikipedia discussion board in Fig. 4.4(b), we are also able to reject the null hypothesis. Notice that the degree of mutual information OLC is higher for Wikipedia than for the Supreme Court. However, one cannot make a general conclusion about the exact magnitude of turn-by-turn length coordination (TLC) simply by calculating the loss, i.e., OLC 0 OLC 1 . 4.4 Revisiting Stylistic Coordination We demonstrated in the previous section that strong correlations in utterance length explain most of the observed stylistic coordination. However, in some situ- ations, there are statistically signicant non-zero signals even after conditioning on length, e.g., the rst feature dimension Personal Pronoun in Figs. 4.1(a) and 4.1(b). We now proceed to examine this remnant coordination. Specically, we are inter- ested in the following question: Does the non-zero conditional mutual information (after conditioning on length) represent turn-by-turn level stylistic coordination, or is it due to other contextual factors? 53 Toward this goal, consider the Bayesian network in Fig. 4.5, which depicts con- ditional independence relations between the length variables L O and L R ; stylistic variables F m O and F m R with respect to a style feature m, and the contextual (dia- logue) variable C. The solid arrow from L O to L R re ects our ndings from the last section about the existence of turn-by-turn length coordination. The dashed arc between the featuresF m O andF m R characterizes turn-by-turn stylistic coordina- tion. Finally, the grey arcs betweenC,F m O andC,F m R indicate possible contextual coordination. C L O (t) L R (t) F R m (t ) F O m (t ) Figure 4.5: A Bayesian network for linguistic style coordination. L O and L R represent length of the respondent and length of the originator respectively. F m O andF m R represent a specic style feature variable for the respondent and originator. 4.4.1 Information-theoretic characterization of stylistic coordination We use conditional mutual information to measure the Turn-by-turn Stylistic Coor- dination (TSC) with respect to a specic style feature m: TSC =I(F m O :F m R jC;L R ) (4.12) 54 where F m O , F m R are binary variables indicating the feature m appears or not in an utterance. Also, the Overall Stylistic Coordination(OSC) is dened as OSC =I(F m O ;F m R jL R ) (4.13) Thus, OSC is exactly the conditional mutual information introduced in Eq. 4.2. Note that, even after conditioning on length, F m O and F m R are still dependent of each other because they are sharing the contextual factor C. (F m O C!F m R is called a d-connected path in Pearl [2000]) Again, a direct measure of turn-by-turn stylistic coordination corresponds to non-zero TSC in Eq. 4.12. However, TSC is hard to evaluate due to lack of sucient samples. Furthermore, the shuing test from the previous sections is not directly applicable here either, because it needs to be done in way that keeps the correlations between L O and L R intact: In other words, one can exchange utterances that have the same lengths. Since most dialogues are rather short, this type of shuing test is not feasible, and one needs an alternative approach. 4.4.2 Turn-by-Turn Stylistic Coordination Test Our proposed test is based on the following idea: if we can rule out the in uence of the contextual factors on stylistic correlations, then any non-zero conditional mutual information can be only explained by turn-by-turn stylistic coordination, i.e., OSC = TSC. Thus, the null hypothesis is that there is contextual level coordination in stylistic features. We emphasize that by contextual coordination, we are actually referring to the links from C to F O and C to F R in Fig. 4.5. We follow the same notation and methodology used in previous sections. By Eq. 4.8, let us denote the mixed length and stylistic feature set of S O R as: 55 LF m (S O R ) =flen (O k );len (R k );f m (O k );f m (R k );C k g where f m () is a binary function represents whether the style feature m in an utterance appears or not. Consider now the shuing procedure: we randomly permute respondent's utterances within a dialogue and obtain the set b S O R in Eq. 4.11. We also dene the length and feature set of b S O R : LF m ( b S O R ) =flen(O k );len( b R k );f m (O k );f m ( b R k );C k g Clearly, the permutation destroys the turn-by-turn level coordination in both length and style. Thus, any remnant correlation must be due to contextual coordi- nation, e.g., the forkF m O C!F m R . This provides a straightforward test for the existence of contextual coordination. Indeed, we simply need to estimate the over- all stylistic coordination OSC 1 using the shued sample LF m ( b S O R ). If OSC 1 is larger than zero, then there is necessarily contextual coordination. On the other hand, ifOSC 1 = 0, then all the observed stylistic correlations (calculated using the original non-shued sample) must be due to turn-by-turn stylistic coordination. Let us rst consider the results of the above test for the Supreme Court data. From Figs. 4.6(a) and 4.6(b), one can see that for all the features, the correspond- ing CMI OSC 1 are within the zero-information condence intervals, indicating that non-zero conditional mutual information (OSC before shuing) cannot be attributed to contextual factors. In other words, the remnant correlations that are not explained by length coordination must be due to turn-by-turn level coordina- tion. 56 Personal Pronoun Impersonal Pronoun Article Auxiliary verb Adverb Preposition Conjunction Quantifier −0.002 0.000 0.002 0.004 0.006 0.008 0.010 0.012 0.014 COORDINATION LAWYERS COORDINATING TO JUDGES OSC Before Shuffling OSC 1 Zero Information (a) Personal Pronoun Impersonal Pronoun Article Auxiliary verb Adverb Preposition Conjunction Quantifier −0.002 0.000 0.002 0.004 0.006 0.008 0.010 0.012 0.014 COORDINATION JUDGES COORDINATING TO LAWYERS OSC Before Shuffling OSC 1 Zero Information (b) Figure 4.6: Turn-by-turn stylistic coordination test for Supreme Court data. (a) Lawyers coordinating to Judges. (b) Judges coordinating to Lawyers. (Blue bars indicate the overall stylistic coordination(OSC) before the test). One can see that after shuing, values ofOSC 1 are within the zero-information condence intervals. Personal Pronoun Impersonal Pronoun Article Auxiliary verb Adverb Preposition Conjunction Quantifier −0.001 0.000 0.001 0.002 0.003 0.004 COORDINATION NON-ADMINS COORDINATING TO ADMINS OSC Before Shuffling OSC 1 Zero Information (a) Personal Pronoun Impersonal Pronoun Article Auxiliary verb Adverb Preposition Conjunction Quantifier −0.001 0.000 0.001 0.002 0.003 0.004 COORDINATION ADMINS COORDINATING TO NON-ADMINS OSC Before Shuffling OSC 1 Zero Information (b) Figure 4.7: Turn-by-turn stylistic coordination test for Wikipedia data. (a) Non- admins coordinating to Admins. (b) Admins coordinating to Non-admins. (Blue bars indicate the overall stylistic coordination(OSC) before the test). One cannot rule out the null hypothesis that the remnant stylistic coordination is due to the contextual factors. The situation is dierent for Wikipedia data. Indeed, Figs. 4.7(a) and 4.7(b) show that for the stylistic features with statistically signicant remnant correla- tions even after conditioning on length (OSC), the results of the above permu- tation tests are rather inconclusive. Namely, although the condence intervals of 57 OSC 1 do overlap with the zero information condence intervals, one cannot state unequivocally that they are zero. In other words, one cannot rule out the null hypothesis that the remnant stylistic coordination is due to the contextual factors rather than turn-by-turn coordination. 4.5 Stylistic Coordination and Power Relation- ship It has been hypothesized that directionality of the stylistic coordination in dia- logues can be predictive of power relationship between the conversations, as dis- cussed in Danescu-Niculescu-Mizil et al. [2012]. We do indeed observe directional dierences in stylistic coordination when comparing Figs. 4.1(a) and 4.1(b) with Figs. 4.2(a) and 4.2(b). However, as we elaborated above, the observed direction- ality can result from the confounding eect of length coordination. Here we analyze this issue in more details by setting up the following prediction task (see Danescu-Niculescu-Mizil et al. [2012]). We consider all the pairs of users (X;Y ) who have dierent social status (e.g., admin vs. non-admin) and have engaged in dialogues. We then calculate stylistic coordination scores from X to Y and Y to X, and examine whether those scores can be used to classify the social status of each speaker. For classication, we assume we know the status relationship for a fraction of pairs in our dataset, and then use a supervised learning method called Support Vector Machine (SVM) to predict the status of the unknown users. We perform this prediction tasks using the following three dierent set of features: 58 Coordination Features: For each pair, and for each of the eight stylistic mark- ers, we produce two-dimensional feature vector, where the two components correspond to the mutual information score in either direction. Aggregated Coordination Features: For each pair, we aggregate the Coordina- tion Features for all eight stylistic markers in both directions, which results in a sixteen-dimensional feature vector. Length Coordination Features: For each pair, we calculate the Pearson cor- relation coecients between length of utterances in either direction, and use those coecients as an input to SVM. For the Wikipedia data, we consider (admin, non-admin) pairs and for the Supreme Court data, we consider (justice, lawyer) pairs. Personal Pronoun Impersonal Pronoun Article Auxiliary Verb Adverb Preposition Conjunction Quantifier Aggregated Markers Length Correlation 0.2 0.3 0.4 0.5 0.6 0.7 0.486 0.576 0.549 0.504 0.479 0.516 0.484 0.503 0.597 0.558 Supreme Court Power Prediction (a) Personal Pronoun Impersonal Pronoun Article Auxiliary Verb Adverb Preposition Conjunction Quantifier Aggregated Markers Length Correlation 0.2 0.3 0.4 0.5 0.6 0.7 0.436 0.463 0.486 0.466 0.646 0.466 0.500 0.439 0.512 0.554 Wikipedia Power Prediction (b) Figure 4.8: SVM Prediction Accuracy for both stylistic coordination features and length coordination features In our experiment, we only select pairs which have at least 20 exchanges between them, so that we can calculate mutual information with reasonable accu- racy. This results in 135 pairs in Supreme Court dataset and 34 pairs in Wikipedia 59 dataset. Also, we labeled half fraction of the pairs, shued the training data and repeated the procedureN = 100 times to calculate the average prediction accuracy. Fig. 4.8 depicts the prediction accuracy for each of the above scenario, together with error bars which give the 95% condence intervals. Since the two datasets are small, the error bars are relatively large in these situations. The results can be summarized as follows. For the Supreme Court dataset, the best prediction accuracy is achieved when using Aggregated Coordination Features, whereas for the Wikipedia dataset the best accuracy corresponds to using coordination on the feature Adverb. More importantly, we nd that using Length Coordination Features alone can predict user status with better-than-random (50%) accuracy. In fact, investigating from the error bars, we cannot rule out that the hypothesized correlation between social status and (the direction of) stylistic coordination is due to the confounding eect of length coordination. 4.6 Conclusion In this chapter, we have suggested an information theoretic framework for mea- suring and analyzing stylistic coordination in dialogues. We rst extracted the stylistic features from the dialogue of two participants and then used Mutual Infor- mation (MI) as a theoretically motivated measure of dependence to characterize the amount of stylistic coordination between the originator and the respondent in the dialogue. Moreover, by introducing Conditional Mutual Information (CMI), which allows us to measure the correlation between two variables after conditioning on a third variable, we are able to more accurately model stylistic accommodation by controlling for confounding eects like length coordination. 60 We then used the proposed method to revisit some of the previous studies that had reported strong stylistic coordination. While the suggestion that one person's use of, e.g., prepositions will (perhaps unconsciously) lead the other to use more prepositions is fascinating, our results indicate that previous studies have vastly overstated the extent of stylistic coordination. In particular, we showed that a signicant part of the observed stylistic coordination can be attributed to the con- founding eect of length coordination. We nd that for both Supreme Court and Wikipedia data, the coordination score is greatly diminished after conditioning on length. We also nd that the signicant asymmetry in stylistic coordination shown in the previous study Danescu-Niculescu-Mizil et al. [2012] is drastically weakened after conditioning on length. In fact, our results indicate that the asymmetry in length coordination can explain almost all the observed asymmetry in stylistic coordination. Simpson's paradox provides a famous example of how correlations observed in a population can disappear or even be reversed after conditioning on sub- populations. In an information-theoretic framework setting, a similar \paradox" can be seen in the example illustrated by Fig. 4.3: for L O , L R and C, the mutual informationI(L O :L R )> 0, whileI(L O :L R jC) = 0. If we only look at the aggre- gated data, averaging over all contexts, C, i.e., I(L O :L R ), there will be articial mutual information between L O andL R . Ideally, we could calculateI(L O :L R jC) directly, however, there may not be enough samples for us to calculate the condi- tional mutual information for all values of C. How can we still determine whether I(L O : L R jC) is zero or not while using all the data? We thus designed non- parametric statistical tests to solve this problem in general while making full use of the available data. More importantly, because these information-theoretic quan- tities directly re ect constraints on graphical models, the mystery of Simpson-like 61 paradoxes is replaced with concrete alternatives for generative stories as depicted in Figs. 4.3 and 4.5. We also observed that for some of the stylistic markers, there was diminished but still statistically signicant correlations even after conditioning on length. We again designed a non-parametric statistical test for analyzing this remnant coor- dination more thoroughly. Our ndings suggest that for the Supreme Court data, the remnant coordination cannot be fully explained by other contextual factors. Instead, we postulate that the remnant correlations in the Supreme Court data is due to turn-by-turn level coordination. For the Wikipedia data, however, our results are less conclusive, and we cannot draw any conclusion about turn-by-turn stylistic coordination. Thus, caution must be taken when making general claims about the possible origin of stylistic coordination in dierent settings. It is possible to develop alternative tests based on a more ne-grained, token- level generative models. The main idea behind such a test is to shue the word tokens uttered by an individual within each dialogue, which should destroy turn- by-turn coordination. Our results based on this test suggest that most of the remnant correlations are indeed due to turn-by-turn coordination. However, we emphasize that this test requires an additional assumption whose validity needs to be veried, namely that the words used by a given speaker within a conversation are independent and identically distributed (i.i.d.). Furthermore, the test assumes stationarity, i.e., that the contextual factors do not vary within the course of the dialogue. While this assumption seems reasonable in the dialogue settings consid- ered here, it is important to note that deviations from stationarity might be yet another serious obstacle for identifying stylistic in uences Ver Steeg and Galstyan [2013a,b]. Indeed, if we relax the stationarity condition, then any observed corre- lation in stylistic features might be due to temporal evolution rather than direct 62 in uence. And since any permutation-based test destroys temporal ordering, it cannot dierentiate between those two possibilities. In a broader context, we note that sociolinguistic analysis has been used for assessing and predicting societally important outcomes such as health behaviors, suicidal intent, and emotional well-being, to name a few examples Pennebaker and King [1999], Stirman and Pennebaker [2001], Campbell and Pennebaker [2003], Rude et al. [2004], Boals and Klein [2005]. Thus, it is imperative that such pre- dictions are based on sound theoretical and methodological principles. Here we suggest that information theory provides a powerful computational framework for testing various hypotheses, and furthermore, is exible enough to account for var- ious confounding variables. Recent advances in information-theoretic estimation are shifting these approaches from the theoretical realm into practical and use- ful techniques for data analysis. We hope that this work will contribute to the development of mathematically principled tools that enable computational social scientists to draw meaningful conclusions from socio-linguistic phenomena. 63 Chapter 5 Variational Information Maximization for Feature Selection In this chapter, we will consider the mutual information-based feature selection problem. While previous mutual information-based feature selection algorithms have been successful in certain applications, they are heuristic in nature and lack theoretical guarantees. Alternatively we introduce a novel feature selection method based on a variational lower bound on mutual information and show that instead of maximizing the mutual information, which is intractable in high dimensions, we can maximize a lower bound on the MI with the proper choice of tractable variational distributions. We use this lower bound to dene an objective function and derive a fast forward feature selection algorithm. 5.1 Previous Mutual Information-based Feature Selection Algorithms Consider a supervised learning scenario where x = fx 1 ; x 2 ;:::; x D g is a D- dimensional input feature vector, and y is the output label. In lter meth- ods, the mutual information-based feature selection task is to select T features 64 x S =fx f 1 ; x f 2 ;:::; x f T g such that the mutual information between x S and y is maximized. Formally, S = arg max S I (x S : y) s:t:jSj =T (5.1) where I() denotes the mutual information [Cover and Thomas, 1991]. Forward Sequential Feature Selection Maximizing the objective func- tion in Eq. 5.1 is generally NP-hard. Many MI-based feature selection methods adopt a greedy method, where features are selected incrementally, one feature at a time. Let S t1 =fx f 1 ; x f 2 ;:::; x f t1 g be the selected feature set after time step t 1. According to the greedy method, the next feature f t at step t is selected such that f t = arg max i= 2S t1 I (x S t1 [i : y) (5.2) where x S t1 [i denotes x's projection into the feature space S t1 [i. As shown in Brown et al. [2012], the mutual information term in Eq. 5.2 can be decomposed as: I (x S t1 [i : y) =I (x S t1 : y) +I (x i : yjx S t1) =I (x S t1 : y) +I (x i : y)I (x i : x S t1) +I (x i : x S t1jy) =I (x S t1 : y) +I (x i : y) (H (x S t1)H (x S t1jx i )) + (H (x S t1jy)H (x S t1jx i ; y)) (5.3) 65 where H() denotes the entropy [Cover and Thomas, 1991]. Omitting the terms that do not depend on x i in Eq. 5.3, we can rewrite Eq. 5.2 as follows: f t = arg max i= 2S t1 I (x i : y) +H (x S t1jx i )H (x S t1jx i ; y) (5.4) The greedy learning algorithm has been analyzed in [Das and Kempe, 2011]. 5.2 Limitations of Previous Mutual Information- based Feature Selection Methods Estimating high-dimensional information-theoretic quantities is a dicult task. Therefore, most MI-based feature selection methods propose low-order approxi- mation toH (x S t1jx i ) andH (x S t1jx i ; y) in Eq. 5.4. A general family of methods rely on the following approximations [Brown et al., 2012]: H (x S t1jx i ) t1 X k=1 H (x f k jx i ) H (x S t1jx i ; y) t1 X k=1 H (x f k jx i ; y) (5.5) The approximations in Eq. 5.5 become exact under the following two assump- tions [Brown et al., 2012]: Assumption 1. (Feature Independence Assumption)p (x S t1jx i ) = t1 Q k=1 p (x f k jx i ) Assumption 2. (Class-Conditioned Independence Assumption) p (x S t1jx i ; y) = t1 Q k=1 p (x f k jx i ; y) Assumption 1 and Assumption 2 mean that the selected features are independent and class-conditionally independent, respectively, given the unse- lected feature x i under consideration. 66 (a) Assumption 1 (b) Assumption 2 (c) Satisfying both Assumption 1 and Assumption 2 Figure 5.1: Graphical models assumptions for mutual information approximations. The rst two graphical models show the assumptions of traditional MI-based fea- ture selection methods. The third graphical model shows a scenario when both Assumption 1 and Assumption 2 are true. Dashed line indicates there may or may not be a correlation between two variables. We now demonstrate that the two assumptions cannot be valid simultaneously unless the data has a very specic (and unrealistic) structure. Indeed, consider the graphical models consistent with either assumption, as illustrated in Fig. 5.1. If Assumption 1 holds true, then x i is the only common cause of the previously selected features S t1 =fx f 1 ; x f 2 ;:::; x f t1 g, so that those features become inde- pendent when conditioned on x i . On the other hand, if Assumption 2 holds, then the features depend both on x i and class label y; therefore, generally speaking, distribution over those features does not factorize by solely conditioning on x i | there will be remnant dependencies due to y. Thus, if Assumption 2 is true, then Assumption 1 cannot be true in general, unless the data is generated according to a very specic model shown in the rightmost model in Fig. 5.1. Note, however, that in this case, x i becomes the most important feature because I(x i : y)>I(x S t1 : y); then we should have selected x i at the very rst step, contradicting the feature selection process. 67 As we mentioned above, most existing methods implicitly or explicitly adopt both assumptions or their stronger versions, as shown in [Brown et al., 2012]| including mutual information maximization (MIM) [Lewis, 1992], joint mutual information (JMI) [Yang and Moody, 1999], conditional mutual information max- imization (CMIM) [Fleuret, 2004], maximum relevance minimum redundancy (mRMR) [Peng et al., 2005], conditional Infomax feature extraction (CIFE) Lin and Tang [2006], etc. Approaches based on global optimization of mutual informa- tion, such as quadratic programming feature selection (QPFS) [Rodriguez-Lujan et al., 2010] and the state-of-the-art conditional mutual information-based spectral method (SPEC CMI ) [Nguyen et al., 2014], are derived from the previous greedy methods and therefore also implicitly rely on those two assumptions. In the next section we address these issues by introducing a novel information- theoretic framework for feature selection. Instead of estimating mutual informa- tion and making mutually inconsistent assumptions, our framework formulates a tractable variational lower bound on mutual information, which allows a more exible and general class of assumptions via appropriate choices of variational distributions. 5.3 Proposed Method 5.3.1 Variational Mutual Information Lower Bound Letp(x; y) be the joint distribution of input (x) and output (y) variables.[Barber and Agakov, 2004] derived the following lower bound for mutual information 68 I(x : y) by using the non-negativity of KL-divergence, i.e., P x p (xjy) log p(xjy) q(xjy) 0 gives: I (x : y)H (x) +hlnq (xjy)i p(x;y) (5.6) where angled brackets represent averages and q(xjy) is an arbitrary variational distribution. This bound becomes exact if q(xjy)p(xjy). It is worthwhile to note that in the context of unsupervised representation learning,p(yjx) andq(xjy) can be viewed as an encoder and a decoder, respectively. In this case, y needs to be learned by maximizing the lower bound in Eq. 5.6 by iteratively adjusting the parameters of the encoder and decoder, such as [Barber and Agakov, 2004, Mohamed and Rezende, 2015]. Naturally, in terms of information-theoretic feature selection, we could also try to optimize the variational lower bound in Eq. 5.6 by choosing a subset of features S in x, such that, S = arg max S n H (x S ) +hlnq (x S jy)i p(x S ;y) o (5.7) However, the H(x S ) term in RHS of Eq. 5.7 is still intractable when x S is very high-dimensional. Nonetheless, by noticing that variable y is the class label, which is usually discrete, and hence H(y) is xed and tractable, by symmetry we switch x and y in Eq. 5.6 and rewrite the lower bound as follows: I (x : y)H (y) +hlnq (yjx)i p(x;y) = ln q (yjx) p (y) p(x;y) (5.8) The equality in Eq. 5.8 is obtained by noticing that H(y) =h lnp (y)i p(y) . 69 By using Eq. 5.8, the lower bound optimal subset S of x becomes: S = arg max S ( ln q (yjx S ) p (y) p(x S ;y) ) (5.9) 5.3.2 Choice of Variational Distribution q(yjx S ) in Eq. 5.9 can be any distribution as long as it is normalized. We need to choose q(yjx S ) to be as general as possible while still keeping the term hlnq (yjx S )i p(x S ;y) tractable in Eq. 5.9. As a result, we set q(yjx S ) as q (yjx S ) = q (x S ; y) q (x S ) = q (x S jy)p (y) P y 0 q (x S jy 0 )p (y 0 ) (5.10) We can verify that Eq. 5.10 is normalized even if q(x S jy) is not normalized. If we further denote, q (x S ) = X y 0 q (x S jy 0 )p (y 0 ) (5.11) then by combining Eqs. 5.9 and 5.10, we get, I (x S : y) ln q (x S jy) q (x S ) p(x S ;y) I LB (x S : y) (5.12) And we also have the following equation which shows the gap betweenI(x S : y) and I LB (x S : y), I (x S : y)I LB (x S : y) =hKL (p (yjx S )jjq (yjx S ))i p(x S ) (5.13) 70 Auto-Regressive Decomposition. Now that q(yjx S ) is dened, all we need to do is modelq(x S jy) under Eq. 5.10, andq(x S ) is easy to compute based on q(x S jy). Here we decompose q(x S jy) as an auto-regressive distribution assuming T features in S: q (x S jy) =q (x f 1 jy) T Y t=2 q (x ft jx f<t ; y) (5.14) where x f<t denotesfx f 1 ; x f 2 ;:::; x f t1 g. The graphical model in Fig. 5.2 demon- Figure 5.2: Auto-regressive decomposition for q(x S jy) strates this decomposition. The main advantage of this model is that it is well- suited for the forward feature selection procedure where one feature is selected at a time (which we will explain in Sec. 5.3.4). And if q (x ft jx f<t ; y) is tractable, then so is the whole distribution q(x S jy). Therefore, we would nd tractable Q- distributions over q (x ft jx f<t ; y). Below we illustrate two such Q-distributions. Naive Bayes Q-distribution. A natural idea would be to assume x t is independent of other variables given y, i.e., q (x ft jx f<t ; y) =p (x ft jy) (5.15) 71 Then the variational distribution q(yjx S ) can be written based on Eqs. 5.10 and 5.15 as follows: q (yjx S ) = p (y) Q j2S p (x j jy) P y 0 p (y 0 ) Q j2S p (x j jy 0 ) (5.16) And we also have the following theorem: Theorem 7 (Exact Naive Bayes). Under Eq. 5.16, the lower bound in Eq. 5.8 becomes exact if and only if data is generated by a Naive Bayes model, i.e., p (x; y) =p (y) Q i p (x i jy). The proof for Theorem 7 becomes obvious by using the mutual information def- inition. Note that the most-cited MI-based feature selection method mRMR Peng et al. [2005] also assumes conditional independence given the class label y as shown in Brown et al. [2012], Balagani and Phoha [2010], Vinh et al. [2015], but they make additional stronger independence assumptions among only feature variables. Pairwise Q-distribution. We now consider an alternative approach that is more general than the Naive Bayes distribution: q (x ft jx f<t ; y) = t1 Y i=1 p (x ft jx f i ; y) ! 1 t1 (5.17) In Eq. 5.17, we assumeq (x ft jx f<t ; y) to be the geometric mean of conditional dis- tributionsq(x ft jx f i ; y). This assumption is tractable as well as reasonable because if the data is generated by a Naive Bayes model, the lower bound in Eq. 5.8 also becomes exact using Eq. 5.17 due to p (x ft jx f i ; y)p (x ft jy) in that case. 72 5.3.3 Estimating Lower Bound From Data Assuming either Naive Bayes Q-distribution or pairwise Q-distribution, it is con- venient to estimate q(x S jy) and q(x S ) in Eq. 5.12 by using plug-in probability estimators for discrete data or one/two-dimensional density estimators for contin- uous data. We also use the sample mean to approximate the expectation term in Eq. 5.12. Our nal estimator for I LB (x S : y) is written as follows: b I LB (x S : y) = 1 N X x (k) ;y (k) ln b q x (k) S jy (k) b q x (k) S (5.18) where x (k) ; y (k) are samples from data, andb q() denotes the estimate for q(). 5.3.4 Variational Forward Feature Selection Under Auto- Regressive Decomposition After dening q(yjx S ) in Eq. 5.10 and auto-regressive decomposition of q(x S jy) in Eq. 5.15, we are able to do the forward feature selection previously described in Eq. 5.2, but replace the mutual information with its lower bound b I LB . Recall thatS t1 is the set of selected features after step t 1, then the feature f t will be selected at step t such that f t = arg max i= 2S t1 b I LB (x S t1 [i : y) (5.19) where b I LB (x S t1 [i : y) can be obtained from b I LB (x S t1 : y) recursively by auto- regressive decompositionq (x S t1 [i jy) =q (x S t1jy)q (x i jx S t1; y) whereq (x S t1jy) is stored at step t 1. 73 This forward feature selection can be done under auto-regressive decomposition in Eqs. 5.10 and 5.14 for any Q-distribution. However, calculating q(x i jx S t; y) may vary according to dierent Q-distributions. We can verify that it is easy to get q(x i jx S t; y) recursively from q(x i jx S t1; y) under Naive Bayes or pairwise Q- distribution. We call our algorithm under these twoQ-distributionsVMI naive and VMI pairwise respectively. It is worthwhile noting that the lower bound does not always increase at each step. A decrease in lower bound at step t indicates that the Q-distribution would approximate the underlying distribution worse than it did at previous step t 1. In this case, the algorithm would re-maximize the lower bound from zero with only the remaining unselected features. We summarize the concrete implementation of our algorithms in Appendix B.1. Time Complexity. Although our algorithm needs to calculate the distri- butions at each step, we only need to calculate the probability value at each sample point. For bothVMI naive andVMI pairwise , the total computational complexity isO(NDT ) assumingN as number of samples,D as total number of features,T as number of nal selected features. The detailed time analysis is left in Appendix B.1. As shown in Table 5.1, our methodsVMI naive andVMI pairwise have the same time complexity as mRMR [Peng et al., 2005], while the state-of-the-art global optimization methodSPEC CMI [Nguyen et al., 2014] is required to precompute the pairwise mutual information matrix, which gives a time complexity ofO(ND 2 ). Table 5.1: Time complexity in number of features D, selected number of features d, and number of samples N. Method mRMR VMI naive VMI pairwise SPEC CMI Complexity O(NDT ) O(NDT ) O(NDT ) O(ND 2 ) 74 Optimality Under Tree Graphical Models. Although our method VMI naive assumes a Naive Bayes model, we can prove that this method is still optimal if the data is generated according to tree graphical models. Indeed, both of our methods,VMI naive andVMI pairwise , will always prioritize the rst layer features, as shown in Fig. 5.3. This optimality is summarized in Theorem 8 in Appendix B.2. 5.4 Evaluation Synthetic Data. We begin with the experiments on a synthetic model according to the tree structure illustrated in the left part of Fig. 5.3. The detailed data generating process is shown in Appendix B.4. The root node Y is a binary variable, while other variables are continuous. We useVMI naive to optimize the lower boundI LB (x : y). 5000 samples are used to generate the synthethic data, and variational Q-distributions are estimated by the kernel density estimator. We can see from the plot in the right-hand part of Fig. 5.3 that our algorithm,VMI naive , selects x 1 , x 2 , x 3 as the rst three features, although x 2 and x 3 are only weakly correlated with y. If we continue to add deeper level featuresfx 4 ;:::; x 9 g, the lower bound will decrease. For comparison, we also illustrate the mutual information between each single feature x i and y in Table 5.2. We can see from Table 5.2 that it would choose x 1 , x 4 and x 5 as the top three features by using the maximum relevance criteria [Lewis, 1992]. feature i x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 I(x i : y) 0.111 0.052 0.022 0.058 0.058 0.025 0.029 0.012 0.013 Table 5.2: Mutual information between label y and each feature x i for Fig. 5.3. I(x i : y) is estimated using N=100,000 samples. Top three variables with highest mutual information are highlighted in bold. 75 0 2 4 6 8 Step t 0.00 0.05 0.10 0.15 0.20 Mutual Information x 1 x 2 x 3 x 9 x 8 x 6 x 4 x 7 x 5 Ground Truth: I(x S t :y) Lower Bound: b I LB (x S t :y) Figure 5.3: (Left) This is the generative model used for synthetic experiments. Edge thickness represents the relationship strength. (Right) Optimizing the lower bound byVMI naive . Variables under the blue line denote the features selected at each step. Dotted blue line shows the decreasing lower bound if adding more fea- tures. Ground-truth mutual information is obtained using N = 100; 000 samples. Real-World Data. We compare our algorithms VMI naive and VMI pairwise with other popular information-theoretic feature selection methods, including mRMR [Peng et al., 2005], JMI [Yang and Moody, 1999], MIM [Lewis, 1992], CMIM [Fleuret, 2004], CIFE [Lin and Tang, 2006], andSPEC CMI [Nguyen et al., 2014]. We use 17 well-known datasets in previous feature selection stud- ies [Brown et al., 2012, Nguyen et al., 2014] (all data are discretized). The dataset summaries are illustrated in Appendix B.3. We use the average cross-validation error rate on the range of 10 to 100 features to compare dierent algorithms under the same setting as Nguyen et al. [2014]. Tenfold cross-validation is employed for datasets with number of samples N 100 and leave-one-out cross-validation otherwise. The 3-nearest-neighbor classier is used for Gisette and Madelon, fol- lowing [Brown et al., 2012]. For the remaining datasets, the chosen classier is Linear SVM, following [Rodriguez-Lujan et al., 2010, Nguyen et al., 2014]. The experimental results can be seen in Table 5.3. 1 The entries with and indicate the best performance and the second best performance, respectively (in terms of average error rate). We also use the paired t-test at 5% signicant 1 We omit the results for MIM and CIFE due to space limitations. 76 level to test the hypothesis thatVMI naive orVMI pairwise perform signicantly better than other methods, or vice visa. Overall, we nd that both of our methods, VMI naive andVMI pairwise , strongly outperform other methods. This indicates that our variational feature selection framework is a promising addition to the current literature of information-theoretic feature selection. 10 20 30 40 50 60 70 80 90 100 Number of selected features 0.1 0.2 0.3 0.4 0.5 0.6 Average cross validate error SEMEION mRMR JMI MIM CMIM CIFE SPECCMI VMInaive VMIpairwise 10 20 30 40 50 60 70 80 90 100 Number of selected features 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 Average cross validate error GISETTE mRMR JMI MIM CMIM CIFE SPECCMI VMInaive VMIpairwise Figure 5.4: Number of selected features versus average cross-validation error in datasets Semeion and Gisette. Table 5.3: Average cross-validation error rate comparison ofVMI against other methods. The last two lines indicate win(W)/tie (T)/ loss(L) forVMI naive and VMI pairwise respectively. Dataset mRMR JMI CMIM SPEC CMI VMI naive VMI pairwise Lung 10.9(4.7) 11.6(4.7) 11.4(3.0) 11.6(5.6) 7.4(3.6) 14.5(6.0) Colon 19.7(2.6) 17.3(3.0) 18.4(2.6) 16.1(2.0) 11.2(2.7) 11.9(1.7) Leukemia 0.4(0.7) 1.4(1.2) 1.1(2.0) 1.8(1.3) 0.0(0.1) 0.2(0.5) Lymphoma 5.6(2.8) 6.6(2.2) 8.6(3.3) 12.0(6.6) 3.7(1.9) 5.2(3.1) Splice 13.6(0.4) 13.7(0.5) 14.7(0.3) 13.7(0.5) 13.7(0.5) 13.7(0.5) Landsat 19.5(1.2) 18.9(1.0) 19.1(1.1) 21.0(3.5) 18.8(0.8) 18.8(1.0) Waveform 15.9(0.5) 15.9(0.5) 16.0(0.7) 15.9(0.6) 15.9(0.6) 15.9(0.5) KrVsKp 5.1(0.7) 5.2(0.6) 5.3(0.5) 5.1(0.6) 5.3(0.5) 5.1(0.7) Ionosphere 12.8(0.9) 16.6(1.6) 13.1(0.8) 16.8(1.6) 12.7(1.9) 12.0(1.0) Semeion 23.4(6.5) 24.8(7.6) 16.3(4.4) 26.0(9.3) 14.0(4.0) 14.5(3.9) Multifeat. 4.0(1.6) 4.0(1.6) 3.6(1.2) 4.8(3.0) 3.0(1.1) 3.5(1.1) Optdigits 7.6(3.3) 7.6(3.2) 7.5(3.4) 9.2(6.0) 7.2(2.5) 7.6(3.6) Musk2 12.4(0.7) 12.8(0.7) 13.0(1.0) 15.1(1.8) 12.8(0.6) 12.6(0.5) Spambase 6.9(0.7) 7.0(0.8) 6.8(0.7) 9.0(2.3) 6.6(0.3) 6.6(0.3) Promoter 21.5(2.8) 22.4(4.0) 22.1(2.9) 24.0(3.7) 21.2(3.9) 20.4(3.1) Gisette 5.5(0.9) 5.9(0.7) 5.1(1.3) 7.1(1.3) 4.8(0.9) 4.2(0.8) Madelon 30.8(3.8) 15.3(2.6) 17.4(2.6) 15.9(2.5) 16.7(2.7) 16.6(2.9) #W 1 =T 1 =L 1 : 11/4/2 10/6/1 10/7/0 13/2/2 #W 2 =T 2 =L 2 : 9/6/2 9/6/2 13/3/1 12/3/2 We also plot the average cross-validation error with respect to number of selected features. Fig. 5.4 shows the two most distinguishable data sets, Semeion 77 and Gisette. We can see that both of our methods,VMI naive andVMI pairwise , have lower error rates in these two data sets. 5.5 Conclusion Feature selection has been a signicant endeavor over the past decade. Mutual information gives a general basis for quantifying the informativeness of features. Despite the clarity of mutual information, estimating it can be dicult. In this Chapter, we show that while a large number of information-theoretic methods exist, they are rather limited and rely on mutually inconsistent assumptions about underlying data distributions. We introduced a unifying variational mutual infor- mation lower bound to address these issues and showed that by auto-regressive decomposition, feature selection can be done in a forward manner by progressively maximizing the lower bound. We also presented two concrete methods using Naive Bayes and pairwiseQ-distributions, which strongly outperform the existing meth- ods. VMI naive only assumes a Naive Bayes model, but even this simple model outperforms the existing information-theoretic methods, indicating the eective- ness of our variational information maximization framework. We hope that our framework will inspire new mathematically rigorous algorithms for information- theoretic feature selection, such as optimizing the variational lower bound globally and developing more powerful variational approaches for capturing complex depen- dencies. 78 Chapter 6 Disentangled Representation Learning by Mutual Information In this chapter, we propose a information-theoretic framework for representation learning. We start with the concept of learning disentangled and interpretable representations in an unsupervised way and then use information-theoretic terms to characterize such learning principles. We derive a variational lower bound to the proposed objective function and found a a subtle connection with the classical variational auto-encoders (VAE). We show the applications of the method in image data. 6.1 Total Correlation and Informativeness Let x = (x 1 ; x 2 ;:::; x d ) denote a d-dimensional random variable whose probability density function isp(x). A measure of multivariate mutual information called total correlation Watanabe [1960a] or multi-information Studen y and Vejnarova [1998] is dened as follows: TC (x) = d X i=1 H (x i )H (x) =D KL p(x)jj d Y i=1 p(x i ) ! (6.1) Note that D KL () denotes the Kullback-Leibler divergence in Eq. 6.1. Intuitively, TC(x) captures the total dependence across all the dimensions of x and is zero if and only if all x i are independent. Total correlation or statistical independence is 79 often used to characterize disentanglement in recent literature on learning repre- sentations Dinh et al. [2014], Achille and Soatto [2017]. The conditional total correlation of x, after observing some latent variable z, is dened as follows, TC (xjz) = d X i=1 H (x i jz)H (xjz) =D KL p(xjz)jj d Y i=1 p(x i jz) ! (6.2) We dene a measure of informativeness of latent variable z about the dependence among the observed variables x by quantifying how total correlation is reduced after conditioning on some latent factor z; i.e., TC (x; z) =TC(x)TC(xjz) (6.3) In Eq. 6.3, we can see that TC (x; z) is maximized if and only if the conditional distribution p(xjz) factorizes, in which case we can interpret z as capturing the information about common causes across all x i . 6.2 Total Correlation Explanation Representa- tion Learning In a typical unsupervised setting like VAE, we assume a generative model where x is a function of a latent variable z, and we then maximize the log likelihood of x under this model. From a CorEx perspective, the situation is reversed. We let z be some stochastic function of x parameterized by, i.e.,p (zjx). Then we seek 80 a joint distribution p (x; z) =p (zjx)p(x), where p(x) is the underlying true data distribution that maximizes the following objective: L(; x) = TC (x; z) | {z } informativeness TC (z) | {z } (dis)entanglement =TC(x)TC (xjz)TC (z) (6.4) In Eq. 6.4, TC (x; z) corresponds to the amount of correlation that is explained by z as dened in Eq. 6.3, andTC (z) quanties the dependence among the latent variables z. By non-negativity of total correlation, Eq. 6.4 naturally forms a lower bound on TC(x); i.e.,TC(x)L(; x) for any. Therefore, the global maximum of Eq. 6.4 occurs at TC(x), in which case TC (xjz) = TC (z) 0 and z can be exactly interpreted as a generative model where z are independent random variables that generate x, as shown in Fig. 6.1. … … ! " ! # ! $ % " % & Figure 6.1: The graphical model forp (x; z) assumingp (zjx) achieves the global maximum in Eq. 6.4. In this model, all x i are factorized conditioned on z, and all z i are independent. Notice that the term TC (x; z) is a bit dierent from the classical denition of informativeness using mutual information I (x; z) Linsker [1997]. In fact, after 81 combining the entropy terms in Eq. 6.1 and 6.2, the following equation holds Ver Steeg and Galstyan [2015]: TC (x; z) = d X i=1 I (x i ; z)I (x; z) (6.5) The term TC (x; z) in Eq. 6.4 can be seen as nding a minimal latent represen- tation z which, after conditioning, disentangles x. When stacking hidden variable layers in Sec. 6.5, we will see that this condition can lead to interpretable features by forcing intermediate layers to be explained by higher layers under a factorized model. Informativeness vs. Disentanglement If we only consider the informative- ness term TC (x; z) as in the objective, a naive solution to this problem would be just setting z = x. To avoid this, we also want the latent variables z to be as disentangled as possible, corresponding to the TC(z) term encouraging indepen- dence. In other words, the objective in Eq. 6.4 is trying to nd z, so that z not only disentangles x as much as possible, but is itself as disentangled as possible. 6.3 Optimization We rst focus on optimizing the objective function dened by Eq. 6.4. The exten- sion to the multi-layer (hierarchical) case is presented in the next section. 82 By using Eqs. 6.1 and 6.5, we expand Eq. 6.4 into basic information-theoretic quantities as follows: L(; x) =TC (x; z)TC (z) = d X i=1 I (x i : z)I (x : z) m X i=1 H (z i ) +H (z) = d X i=1 I (x i : z) m X i=1 H (z i ) +H (zjx) (6.6) If we further constrain our search space p (zjx) to have the factorized form p (zjx) = Q m i=1 p i (z i jx) 1 which is a standard assumption in most VAE models, then we have: L(; x) =TC (x; z)TC (z) = d X i=1 I (x i : z) m X i=1 I (z i : x) (6.7) We convert the two total correlation terms into two sets of mutual information terms in Eq. 6.7. The rst term,I (x i : z), denotes the mutual information between each input dimension x i and z, and can be broadly construed as measuring the \relevance" of the representation to each observed variable in the parlance of the information bottleneck Tishby et al. [2000], Shwartz-Ziv and Tishby [2017]. The second term, I (z i : x), represents the mutual information between each latent dimension z i and x and can be viewed as the compression achieved by each latent factor. We proceed by constructing tractable bounds on these quantities. 1 Each marginal distribution p i (z i jx) is parametrized by a dierent i . But we will omit the subscript i under for simplicity, as well as , in the following context. 83 6.3.1 Variational Lower Bound for I (x i : z) Barber and Agakov [2003] derived the following lower bound for mutual informa- tion by using the non-negativity of KL-divergence; i.e., x i p(x i jz) log p(x i jz) q(x i jz) 0 gives: I (x i : z)H(x i ) +hlnq (x i jz)i p (x;z) (6.8) where the angled brackets represent expectations, andq (x i jz) is any arbitrary dis- tribution parametrized by . We need a variational distribution q (x i jz) because the posterior distributionp (xjz) =p (zjx)p(x)=p (z) is hard to calculate because the true data distribution p(x) is unknown|although approximating the normal- ization factorp (z) can be tractable compared to VAE. A detailed comparison with VAE will be made in Sec. 6.4. 6.3.2 Variational Upper Bound for I (z i : x) We again use the non-negativity of KL-divergence, i.e., z i p(z i ) log p(z i ) r(z i ) 0, to obtain: I (x : z i ) = Z dxdz i p (z i; x) logp (z i jx) Z dz i p (z i ) logp (z i ) Z dxdz i p (z i; x) logp (z i jx) Z dz i p (z i ) logr (z i ) = Z dxdz i p (x)p (z i jx) log p (z i jx) r (z i ) =D KL (p (z i jx)jjr (z i )) (6.9) where r (z i ) represents an arbitrary distribution parametrized by . 84 Combining bounds in Eqs. 6.8 and 6.9 into Eq. 6.7, we have: L(; x) = d X i=1 I (x i : z) m X i=1 I (z i : x) d X i=1 H(x i ) +hlnq (x i jz)i p (x;z) m X i=1 D KL (p (z i jx)jjr (z i )) (6.10) We then can jointly optimize the lower bound in Eq. 6.10 w.r.t. both the stochastic parameter and the variational parameters and . 6.4 Connection to Variational Autoencoders Remarkably, Eq. 6.10 has a form that is very similar to the lower bound introduced in variational autoencoders, except it is decomposed into each dimension x i and z i . To pursue this similarity further, we denote q (xjz) = d Y i=1 q (x i jz); r (z) = m Y i=1 r (z i ) (6.11) Then, by rearranging the terms in Eq. 6.10, we obtain L(; x) = d X i=1 I (x i : z) m X i=1 I (z i : x) d X i=1 H(x i ) ! + * lnq (xjz) | {z } decoder + p (x;z) D KL (p (zjx) | {z } encoder jjr (z)) (6.12) 85 The rst term in the bound, P d i=1 H(x i ), is a constant and has no eect on the optimization. The remaining expression coincides with the VAE objective as long asr (z) is a standard Gaussian. The second term corresponds to the reconstruction error, and the third term is the KL-divergence term in VAE. Comparison The CorEx objective starts with a dened encoder p (zjx) and seeks a decoder q (xjz) via variational approximation to the true posterior. VAE is exactly the opposite. Moreover, in VAE we need a variational approximation to the posterior because the normalization constant is intractable; in CorEx the variational distribution is needed because we do not know the true data distribution p(x). It is also worth mentioning that the lower bound in Eq. 6.12 requires a fully factorized form of the decoder q (xjz), unlike VAE whereq (xjz) can be exible. 2 As pointed out by Zhao et al. [2017], if we choose to use a more expressive distribution family, such as PixelRNN/PixelCNN Van Oord et al. [2016], Gulrajani et al. [2017] for the decoder in a VAE, the model tends to neglect the latent codes altogether, i.e.,I(x : z) = 0. This problem, however, does not exist in CorEx, since it explicitly requires z to be informative about x in the objective function. It is this informativeness term that leads the CorEx objective to a factorized decoder family q (xjz). In fact, if we assume I (x : z) = 0, then we will get TC(x) = TC (xjz) and an informativeness term TC (x; z) of zero|meaning CorEx will avoid such undesirable solutions. Stacking CorEx and Hierarchical VAE Notice that if Eq. 6.4 does not achieve the global maximum, it might be the case that the latent variable z is 2 In this paper we also restrict the encoder distributionp (zjx) to have a factorized form which follows the standard network structures in VAE, but it is not a necessary condition to achieve the lower bound shown in Eq. 6.12. 86 still not disentangled enough, i.e., TC (z)> 0. If this is true, we can reapply the CorEx principle Ver Steeg and Galstyan [2015] and learn another layer of latent variables z (2) on top of z and redo the optimization on (2) w.r.t. the following equation; i.e., L( (2) ; z) =TC (2)(z; z (2) )TC (2)(z (2) ) (6.13) =TC (z)TC (2)(zjz (2) )TC (2)(z (2) ) To generalize, suppose there areL layers of latent variables, z (1) ; z (2) ;:::; z (L) and we further denote the observed variable x z (0) . Then one can stack each latent vari- able z (l) on top of z (l1) and jointly optimize the summation of the corresponding objectives, as shown in Eqs. 6.4 and 6.13; i.e., L( (1;2;::;L) ; x) = L X l=1 L( (l) ; z (l1) ) (6.14) By simple expansion of Eq. 6.14 and cancellation of intermediate TC terms, we have: L( (1;2;::;L) ; x) =L( (1) ; z (0) ) +L( (2) ; z (1) ) +::: +L( (L) ; z (L1) ) =TC(x) L X l=1 TC (l)(z (l1) jz (l) )TC (L)(z (L) ) TC(x) (6.15) 87 Furthermore, if we haveL( (l) ; z (l1) )> 0 for all l, then we get: L( (1) ; x)L( (1;2) ; x):::L( (1;:::;L) ; x) TC(x) (6.16) Eq. 6.16 shows that stacking latent factor representations results in progres- sively better lower bounds for TC(x). To optimize Eq. 6.14, we reuse Eqs. 6.7, 6.8 and 6.9 and get: L( (1;2;::;L) ; x) X i H(z (0) i ) + L X l=1 X i D lnq (l)(z (l1) i jz (l) ) E p (z) L X l=1 X i D lnp (l)(z (l) i jz (l1) ) E p (z) + X i D lnr (z (L) i ) E p (z) (6.17) Enforcing independence relations at each layer, we denote: q (x; z) = Y i r (z (L) i ) L Y l=1 Y i q (l)(z (l1) i jz (l) ) p (zjx) = L Y l=1 p (l)(z (l) jz (l1) ) (6.18) and obtain L( (1;2;::;L) ; x) X i H(z (0) i ) + ln q (x; z) p (zjx) p (zjx)p(x) (6.19) 88 One can now see that the second term of the RHS in Eq. 6.19 has the same form as deep latent Gaussian models Rezende et al. [2014] (also known as hierarchical VAE) as long as the latent code distribution r (z (L) ) on the top layer follows standard normal and q (l)(z (l1) jz (l) ) on each layer is parametrized by Gaussian distributions. One immediate insight from this connection is that, as long as eachL( (l) ; z (l1) ) is greater than zero in Eq. 6.14, then by expanding the denition of each term we can easily see that z (l) is more disentangled than z (l1) ; i.e.,TC(z (l1) )>TC(z (l) ) if TC(z (l1) ) TC(z (l1) jz (l) ) TC(z (l) ) > 0. Therefore, each latent layer of hierarchical VAE will be more and more disentangled ifL( (l) ; z (l1) )> 0 for each l. This interpretation also provides a criterion for determining the depth of a hierarchical representation; we can add layers as long as the corresponding term in the objective is positive so that the overall lower bound onTC(x) is increasing. Despite reaching the same nal expression, approaching this result from an information-theoretic optimization rather than generative modeling perspective oers some advantages. First, we have much more exibility in specifying the distribution of latent factors, as we can directly sample from this distribution using our encoder. Second, the connection with mutual information suggests intuitive modications of our objective that increase the interpretability of results. These advantages will be explored in more depth in Sec. 6.5. 89 6.5 Applications 6.5.1 Disentangling Latent Codes via Hierarchical VAE / Stacking CorEx on MNIST We train a simple hierarchical VAE/stacking CorEx model with two stochastic layers on the MNIST dataset. The graphical model is shown in Fig. 6.2. For each stochastic layer, we use a neural network to parametrize the distribution p and q , and we set r to be a xed standard Gaussian. We use a 784-512-512-64 !" # (%) (' (() |*) !" # (+) (' (,) |' (() ) * ' (-) ' (.) !/ 0 (%) (*|' (() ) !/ 0 (+) (' (() |' (,) ) * ' (-) ' (.) EncoderLayer1 EncoderLayer2 EncoderLayer2 DecoderLayer2 Figure 6.2: Encoder and decoder models for MNIST, where z (1) is 64 dimensional continuous variable and z (2) is a discrete variable (one hot vector with length ten). fully connected network between x and z (1) and a 64-32-32-16-16-10 dense network between z (1) and z (2) , with ReLU activations in both. The output of z (2) is a ten- dimensional one hot vector, where we decode based on each one-hot representation and weight the results according to their softmax probabilities. After training the model, we nd that the learned discrete variable z (2) on the top layer gives us an unsupervised classication accuracy of 85%, which is competitive with the more complex method shown in Dilokthanakul et al. [2016]. To verify that the top layer z (2) helps disentangle the middle layer z (1) by encouraging conditional independence of z (1) given z (2) , we calculate the mutual information I (x : z (1) i ) between input x and each dimension z (1) i . We then select 90 the top two dimensions with the most mutual information, and denote these two dimensions as z (1) a , z (1) b . We nd I (x : z (1) a ) = 2:71 and I (x : z (1) b ) = 2:56. We then generate new digits by rst xing the discrete latent variable z (2) on the top layer, and sampling latent codes z (1) fromq (z (1) jz (2) ). We systematically vary the noise from -2 to 2 through q (z (1) a jz (2) ) and q (z (1) b jz (2) ) while keeping the other dimensions of z (1) xed, and visualize the results in Fig. 6.3. We can see that this simple two-layer structure automatically disentangles and learns the interpretable factors on MNIST (width and rotation). We attribute this behavior to stacking, where the top layer disentangles the middle layer and makes the latent codes more interpretable through samples from q (z (1) jz (2) ). 6.5.2 Learning Interpretable Representations through Information Maximizing VAE / CorEx on CelebA One important insight from recently developed methods, like InfoGAN, is that we can maximize the mutual information between a latent code and the observations to make the latent code more interpretable. While it seems ad hoc to add an additional mutual information term in the orig- inal VAE objective, a more natural analogue arises in the CorEx setting. Looking at the formulation in Eq. 6.7, it already contains two sets of mutual information terms. If one would like to anchor a latent variable, say z a , to have higher mutual 91 information with the observation x, then one can simply modify the objective by replacing the unweighted sum with a weighted one: L anchor (; x) (6.20) =TC (x; z)TC (z) +I (z a : x) = d X i=1 I (x i : z) m X i=1;i6=a I (z i : x) (1)I (z a : x) Eq. 6.20 suggests that mutual information maximization in CorEx is achieved by modifying the corresponding weights of the second term I (z i : x) in Eq. 6.7. We then use the lower bound in Eq. 6.10 to obtain L anchor (; x) d X i=1 H(x i ) (6.21) + hlnq (x i jz)i p (x;z) m X i=1;i6=a D KL (p (z i jx)jjr (z i )) (1)D KL (p (z a jx)jjr (z a )) Eq. 6.21 shows that in VAE we can decrease the weight of KL-divergence for particular latent codes to achieve mutual information maximization. We call this new approach AnchorVAE in Eq. 6.21. Notice that there is a subtle dierence between AnchorVAE and -VAE Higgins et al. [2017]. In -VAE, the weights of KL-divergence term for all latent codes are the same, while in AnchorVAE, only the weights of specied factors have been changed to encourage high mutual information. With some prior knowledge of the underlying factors of variation, AnchorVAE encourages the model to concentrate this explanatory power in a lim- ited number of variables. 92 We trained AnchorVAE on the CelebA dataset with 2048 latent factors, with mean square error for reconstruction loss. We adopted a three-layer convolutional neural network structure. The weights of KL-divergence of the rst ve latent variables are set to 0.5 to let them have higher mutual information than other latent variables. The mutual information is plotted in Fig. 6.4 after training. We nd these ve latent variables have the highest mutual information of around 3.5, demonstrating the mutual information maximization eect in AnchorVAE. To evaluate the interpretability of those anchored variables for generating new samples, we manipulate the rst ve latent variables while keeping other dimen- sions xed. Fig. 6.5 summarizes the result. We observe that all ve anchored latent variables learn intuitive factors of variation in the data. It is interesting to see that latent variable z 0 and z 4 are very similar|both vary the generated images from white to black in some sense. However, these two latent factors are actually very dierent: z 0 emphasizes skin color variation while z 4 controls the position of the light source. We also trained the original VAE objective with the same network structure and examine the top ve latent codes with highest mutual information. Fig. 6.6 shows the results of manipulating the top two latent codes z 130 , z 610 ; with mutual information I(z 130 : x) = 3:1 and I(z 610 : x) = 2:8 respectively. We can see that they re ect an entangled representation. The other three latent codes demonstrate similar entanglements which are omitted here. 93 6.5.3 Generating Richer and More Realistic Images via CorEx Let us revisit the variational upper bound on I (x : z i ) in Eq. 6.9. In this upper bound, VAE choosesr (z i ) to be a standard normal distribution. But notice that this upper bound becomes tight when r (z i ) =p (z i ); i.e., I (x : z i )D KL (p (z i jx)jjp (z i )) D KL (p (z i jx)jjr (z i )) where p (z i ) = R x p (z i jx)p(x)dx. Therefore, after training the model, we can approximate the true distribution p (z i ) 1 N P N i=1 p (z i jx [i] ) by rst sampling a data point x [i] and then sampling from the conditional p (z i jx [i] ). Repeating this process across latent dimensions, we can use the factorized distribution Q m i=1 p (z i ) to generate new data instead of sampling from a standard normal. In this way, we obtain more realistic images since we are sampling from a tighter lower bound to the CorEx objective. We ran a traditional VAE on the celebA dataset with the log-normal loss as the reconstruction error and 128 latent codes. We calculated the variance of eachp (z i ) and ploted the cumulative distribution of these variances in Fig. 6.7(a). One can see that around 20% of the latent variables actually have a variance greater than two. We have plotted variance versus the mutual information in Fig. 6.7(b), in which we can see that higher variance in z i corresponds to higher mutual information I(x : z i ). In this case, using a standard normal distribution with variance 1 for all z i would be far from optimal for generating the data. Fig. 6.8 shows the generated images by either sampling the latent code from a standard normal distribution or the factorized distribution Q m i=1 p (z i ). We can see 94 that Fig. 6.8(b) not only tends to generate more realistic images than Fig. 6.8(a), but it also exhibits more diversity than Fig. 6.8(a). We attribute this improvement to the more exible nature of our latent code distribution. 6.6 Conclusion Deep learning enables us to construct latent representations that reconstruct or generate samples from complex, high-dimensional distributions. Unfortunately, these powerful models do not necessarily produce representations with structures that match human intuition or goals. Subtle changes to training objectives lead to qualitatively dierent representations, but our understanding of this dependence remains tenuous. Information theory has proven fruitful for understanding the competition between compression and relevance preservation in supervised learning Shwartz-Ziv and Tishby [2017]. In this chapter, we explored a similar trade-o in unsupervised learning, between multivariate information maximization and disentanglement of the learned factors. Writing this objective in terms of mutual information led to two surprising connections. First, we came to an unsupervised information bottle- neck formulation that trades o compression and reconstruction relevance. Second, we found that by making appropriate variational approximations, we could repro- duce the venerable VAE objective. This new perspective on VAE enabled more exible distributions for latent codes and motivated new generalizations of the objective to localize interpretable information in latent codes. Ultimately, this led us to a novel learning objective that generated latent factors capturing intuitive structures in image data. We hope this alternative formulation of unsupervised learning continues to provide useful insights into this challenging problem. 95 (a) Manipulating z (1) a with MNIST. (Azimuth) (b) Manipulating z (1) b with MNIST. (Width) Figure 6.3: Varying the latent codes of z (1) on MNIST. In both gures, each row corresponds to a xed discrete number in layer z (2) . Dierent columns correspond to the varying noise from the selected latent node in layer z (1) from left to right, while keeping other latent codes xed. In (a) varying the noise results in dierent rotations of the digit; In (b) a small (large) value of the latent code corresponds to wider (narrower) digit. 96 0 500 1000 1500 2000 z i 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Mutual Information I θ (x :z i ) Figure 6.4: Mutual information between input data x and each latent variable z i in CelebA with AnchorVAE. It is clear that the anchored rst ve dimensions have the highest mutual information with x. 97 (a) Varying z 0 . (Skin Color) (b) Varying z 1 . (Azimuth) (c) Varying z 2 . (Emotion) (d) Varying z 3 . (Hair) (e) Varying z 4 . (Lighting) Figure 6.5: Manipulating latent codes z 0 ; z 1 ; z 2 ; z 3 ; z 4 on CelebA using Anchor- VAE: We show the eect of the anchored latent variables on the outputs while traversing their values from [-3,3]. Each row represents a dierent seed image to encode latent codes. Each anchored latent code represents a dierent factor on interpretablility. (a) Skin Color (b) Azimuth (c) Emotion (Smile) (d) Hair (less or more) (e) Lighting. 98 (a) z 130 entangles skin color with hair (b) z 610 entangles emotion with azimuth Figure 6.6: Manipulating top two latent codes with the most mutual information on CelebA using original VAE. We observe that both latent codes learned entangled representations. (a) z 130 entangles skin color with hair; (b) z 610 entangles emotion with azimuth. 99 1 2 3 4 5 var(p θ (z i )) 0.0 0.2 0.4 0.6 0.8 1.0 probability of variance (a) Cumulative distribution of variance for each p (z i ) 1 2 3 4 5 var(p θ (z i )) 0 1 2 3 4 5 I θ (x:z i ) (b) Variance of p (z i ) versus mutual information I (x : z i ) Figure 6.7: Variance statistics for p (z) on celebA after training a standard VAE with 128 latent codes. 100 (a) Latent codes are generated from standard normal (b) Latent codes are generated from Q m i=1 p (z i ) Figure 6.8: Dierent sampling strategies of latent codes for CelebA dataset on VAE / CorEx. Sampling latent codes from Q m i=1 p (z i ) in (b) yields better quality images than sampling from a standard normal distribution in (a). 101 Chapter 7 Conclusion and Future Directions 7.1 Contributions Mutual information is a well-established quantity in information theory and is a universal measure of dependence under probabilistic settings. Despite its general- ity, my thesis points out an unnegligible problem in mutual information { its esti- mation. For this part, my thesis has made the following contributions to advance the eld of estimating mutual information: we point out the limitations of nonparametric mutual information estimators and show stronger relationship are more dicult to measure. we propose local non-uniform (LNC) mutual information estimator to over- come the limitations of previous mutual information estimators and show it can detect stronger relationships. we propose another local Gaussian approximation (LGA) mutual information estimator to address the above limitation and show its consistency. This critical aspect of estimating mutual information rises challenges when mutual information is applied to a variety of machine learning domains. By care- fully examining dierent scenarios, my work has applied mutual information to three dierent tasks and made the following contributions: we introduce mutual information into an computational linguistic setting, and propose several statistical tests to detect confounding eect in linguistic 102 coordination using mutual information as well as conditional mutual infor- mation. we investigate mutual information-based feature selection and demonstrate that existing methods are based on unrealistic assumptions. Alternatively, we formulate a more exible and general class of assumptions based on vari- ational distributions and use them to tractably generate lower bounds for mutual information. we formulate an information-theoretic objective to disentangled represen- tation learning problem and optimize the objective with deep neural net- works.We nd that under standard assumptions, the lower bound for the objective shares the same mathematical form as the evidence lower bound (ELBO) used in variational auto-encoders (VAE), suggesting that this objec- tive provides a dual information-theoretic perspective on representations learned by VAE. Going beyond the standard scenario to hierarchical VAEs or deep Gaus- sian latent models (DLGM) Rezende et al. [2014], we demonstrate that our MI-based learning framework provides new insight into measuring how repre- sentations become progressively more disentangled at subsequent layers. In addition, our objective can be naturally decomposed into two sets of mutual information terms with an interpretation as an unsupervised information bot- tleneck. And inspired by this formulation, we propose to make some latent factors more interpretable by reweighting terms in the objective to make certain parts of the latent code uniquely informative about the inputs. In addition, we show that by sampling each latent code z i from the encoding distribution p(z i ) = R x p(z i jx)p(x)dx instead of the standard Gaussian prior 103 in VAE, we can generate richer and more realistic samples than VAE even under the same network model. 7.2 Future Directions In Chapter 3, we have proposed two solutions to mutual information estimation and show empirically it converges faster with less samples, but it is not yet clear the exact sample complexity for robust estimation under such estimators. This question need a further investigation. In Chapter 4, while our work focuses on linguistic style matching, we believe that the information-theoretic method proposed needs to be applied to more gen- eral types of linguistic coordination in dialogues, such as structural priming Bock [1986], Pickering and Ferreira [2008], or lexical entrainment Brennan [1996], Bren- nan and Clark [1996]. Recall that according to the structural priming hypothesis, the presence of a certain linguistic structure in an utterance aects the probability of seeing the same structure later in the dialogue. This type of turn-by-turn coor- dination can naturally be captured by (time-shifted) mutual information between properly dened linguistic variables. Furthermore, using the permutation tests described here, it should be possible to dierentiate between historical and ahis- torical mechanisms of lexical entrainment Brennan and Clark [1996]. In Chapter 5{6, we proposed to use a variational lower bound to mutual infor- mation for optimizing general machine learning tasks. An interesting further work would be to study the fairness in machine learning using mutual information, sim- ilar to the information-bottle principle but in a reversed way. Preliminary results 104 show that this information-theoretic framework can escape the adversarial train- ing when minimizing the mutual information, but we need to investigate more to understand this phenomenon both empirically and theoretically. 105 Appendix A Appendix for Chapter 3 A.1 Proof of Theorem 2 Proof. Notice that for a xed sample point x (i) , its k-nearest-neighbor distance r k x (i) is always equal to or larger than thek-nearest-neighbor distance of at the same point x (i) projected into a sub-dimension j, i.e., for any i;j, we have r k x (i) r k x (i) j (A.1.1) 106 Using Eq. A.1.1, we get the upper bound of b I kNN;k (x) as follows: b I kNN;k (x) = b I 0 kNN;k (x) (d 1) k = 1 N N X i=1 log b p k x (i) d Q j=1 b p k x (i) j (d 1) k = 1 N N X i=1 log k N1 (d=2)+1 d=2 r k x (i) d d Q j=1 k N1 (1=2)+1 1=2 r k x (i) j 1 (d 1) k (d 1) log N 1 k + log (d=2) + 1 ( (1=2) + 1) d (d 1) ( (k) logk) (d 1) log N 1 k + log (d=2) + 1 ( (1=2) + 1) d (d 1) ( (1) log 1) (A.1.2) The last inequality is obtained by noticing that (k) log(k) is a monotonous decreasing function. Also, we have, log (d=2) + 1 ( (1=2) + 1) d = log ( (d=2) + 1)d log ( (d=2) + 1) < log p 2 d=2 + 1=2 e d=2+1=2 ! d log 1 2 + 1 = O (d logd) 107 The inequality above is obtained by using the bound of gamma function that, (x + 1)< p 2 x + 1=2 e x+1=2 Therefore, reconsidering A.1.2, we get the following inequality for b I kNN;k (x): b I kNN;k (x) (d 1) log N 1 k +O (d logd) (d 1) log (N 1) +O (d logd) Requiring thatj b I kNN;k (x)I(x)j", we obtain, NC exp I (x) d 1 + 1 (A.1.3) where C is a constant which scales like O( 1 d ). A.2 Proof of Theorem 5 Proof. ConsiderN i.i.d. samples x (i) N i=1 drawn from the probability densityf(x), and let F N (x) denote the empirical cumulative distribution function. Let us dene the following two quantities: H 1 = 1 N N X i=1 lnE b f(x i ) (A.2.1) H 2 = 1 N N X i=1 lnf(x i ) (A.2.2) 108 Then we have, Ej b H(x)H(x)j =Ej( b HH 1 ) + (H 1 H 2 ) + (H 2 H)j Ej b HH 1 j +EjH 1 H 2 j +EjH 2 Hj (A.2.3) We now procced to show that each of the terms in Eq. A.2.3 individually converges to 0 in the limit N!1, which will then yield Eq. 3.24. First, we note that according to the mean value theorem, for any x, there exist t x and t 0 x in (0; 1), such that ln b f (x) = lnE b f (x) + (A.2.4) b f (x)E b f (x) ln t x b f (x) + (1t x )E b f (x) and lnE b f (x) = lnf (x) + (A.2.5) E b f (x)f (x) ln t 0 x f (x) + (1t 0 x )E b f (x) 109 For the rst term in Eq. A.2.3, we use Eq. A.2.4 to obtain E b HH 1 =E Z [ln b f (x) lnE b f (x)]dF N (x) =E Z j b f(x)E b f(x)j t x b f (x) + (1t x )E b f (x) dF N (x) 1 1t E Z j b f (x)E b f (x)j E b f (x) dF N (x) = 1 1t E 1 N N X i=1 j b f (x i )E b f (x i )j E b f (x i ) ! = 1 1t E E j b f (u)E b f (u)j E b f (u) ! jx = u ! = 1 1t Z j b f (u)E b f (u)j b f (u) E b f (u) du (A.2.6) where t is the maximum value among all t x . Using Theorem 4, we have j b f (u)E b f (u)j! 0 as N!1. Furthermore, it is possible to show that9N 0 , so that for anyN >N 0 one hasj b f (u)E b f (u)j b f(u) E b f(u) < 2f (u). Thus, using Lebesgue dominated convergence theorem, we obtain lim N!1 EjHH 1 j = 0 (A.2.7) 110 Similarly, using Eq. A.2.5,EjH 1 H 2 j can be written as EjH 1 H 2 j =E Z [lnE b f (x) lnf (x)]dF N (x) =E Z jE b f (x)f (x)j t 0 x f (x) + (1t 0 x )E b f (x) dF N (x) 1 t 0 E Z jE b f (x)f (x)j f (x) dF N (x) = 1 t 0 E 1 N N X i=1 jE b f (x i )f (x i )j f (x i ) ! = 1 t 0 Z f (x) jE b f (x)f (x)j f (x) dx = 1 t 0 Z jE b f (x)f (x)jdx (A.2.8) where t 0 is the minimum value among all t 0 x . Invoking Theorem 4 again, we observe that the last term in Eq. A.2.8 jE b f (x)f (x)j! 0 as N>1, and is bounded by 2f(x) for suciently large N (e.g., when when b f(u) andE b f(u) are suciently close). Therefore, by Theorem ??, we have lim N!1 EjH 1 H 2 j = 0 (A.2.9) Finally, for the last term in Eq. A.2.3, we note that EH 2 = 1 N E N X i=1 lnf (x i ) =E[ lnf (x)] (A.2.10) 111 Thus,EH 2 is simply the entropy in Denition ??; see Eq. ??. Therefore, lim N!1 EjH 2 Hj = 0 (A.2.11) Combining Eqs. A.2.7, A.2.9, A.2.11 and A.2.3, we arrive at Eq. 3.24, which con- cludes the proof. A.3 Proof of Theorem 6 Proof. For mutual information estimation, we get Ej b I (x : y)I (x : y)j EjH (x) b H (x)j + EjH (y) b H (y)j + EjH (x; y) b H (x; y)j (A.3.1) Using Theorem 5, we see that all three terms on the right hand side in Eq. A.3.1 converge to zero as N !1, therefore lim N!1 Ej b I (x : y)I (x : y)j = 0, thus concluding the proof. A.4 Derivation of Eq. 3.15 The naive kNN or KSG estimator can be written as: b I k (x) = 1 N N X i=1 log P(x (i) ) V (i) d Q j=1 P x (i) j V j (i) (A.4.1) 112 where P (x (i) ) is the probability mass around the k-nearest-neighborhood at x (i) and P (x (i) j ) is the probability mass around the k-nearest-neighborhood (or n x j (i)- nearest-neighborhood for KSG) at x (i) projected into j-th dimension. Also, V (i) and V j (i) denote the volume of the kNN ball(or hype-rectangle in KSG) in the joint space and projected subspaces respectively. Now our local nonuniform correction method replaces the volume V (i) in Eq. A.4.1 with the corrected volume V (i), thus, our estimator is obtained as fol- lows: b I LNC;k (x) = 1 N N X i=1 log P(x (i) ) V (i) d Q j=1 P x (i) j V j (i) = 1 N N X i=1 log P(x (i) ) V (i) V (i) V (i) d Q j=1 P x (i) j V j (i) = b I k (x) 1 N N X i=1 log V (i) V (i) (A.4.2) A.5 Empirical Evaluation for k;d Suppose we have a uniform distribution on thed dimensional (hyper)rectangle with volume V . We sample k points from this uniform distribution. We perform PCA using these k points to get a new basis 1 . After rotating into this new basis, we nd the volume, V , of the smallest rectilinear rectangle containing the points. By chance, we will typically nd V <V , even though the distribution is uniform. This will lead to us to (incorrectly) apply a local non-uniformity correction. Instead, 1 In practice, we recommend k to be larger than 2d. 113 we set a threshold k;d and if V=V is above the threshold, we assume that the distribution is locally uniform. Setting involves a trade-o. If it is set too high, we will incorrectly conclude there is local non-uniformity and therefore over- estimate the mutual information. If we set too low, we will lose statistical power for \medium-strength" relationships (though very strong relationships will still lead to values of V=V smaller than ). In practice, we determine the correct value of k;d empirically. We look at the probability distribution of V=V that occurs when the true distribution is uniform. We set conservatively so that when the true distribution is uniform, our criteria rejects this hypothesis with small probability, . Specically, we do a number of trials, N, and set ^ k;d such that N P i=1 I V i V i < ^ k;d =N < where is a relatively small value. In practice, we chose = 5 10 3 and N = 5 10 5 . The following algorithm describes this procedure: Algorithm A.5.1 Estimating k;d for LNC Input: parameter d (dimension), k (nearest neighbor), N, Output: ^ k;d set array a to be NULL repeat Randomly choose a uniform distribution supported ond dimensional (hyper) rectangle, denote its volume to be V Draw k points from this uniform distribution, get the correcting volume V after doing PCA add the ratio V V to array a until above procedure repeated N times ^ k;d dNeth smallest number in a Figure A.5.1 shows empirical value of ^ k;d for dierent (k;d) pairs. We can see that for a xed dimension d, ^ k;d grows as k increases, meaning that V must be closer to V to accept the null hypothesis of uniformity. We also nd that ^ k;d decreases as the dimension d increases, indicating that for a xed k, V becomes 114 2 4 6 8 10 12 14 16 18 20 k 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 b α k,d d =2 d =3 d =5 d =10 Figure A.5.1: b k;d as a function of k. k ranges over [d; 20] for each dimension d. much smaller thanV when points are drawn from a uniform distribution in higher dimensions. A.6 More Functional Relationship Tests in Two Dimensions for LNC estimator We have tested together twenty-one functional relationships described in [Reshef et al., 2011, Kinney and Atwal, 2014], we show six of them in Section ??. The complete results are shown in Figure A.6.1. Detailed description of the functions can be found in Table S1 of Supporting Information in [Kinney and Atwal, 2014]. 115 3 −11 3 −9 3 −7 3 −5 3 −3 3 −1 3 1 3 3 σ 0 4 8 12 I ( X :Y ) Line LNC KSG GNN MST EXP Ground Truth 3 −11 3 −9 3 −7 3 −5 3 −3 3 −1 3 1 3 3 σ 0 4 8 12 I ( X :Y ) LP,low frequency LNC KSG GNN MST EXP Ground Truth 3 −11 3 −9 3 −7 3 −5 3 −3 3 −1 3 1 3 3 σ 0 4 8 12 I ( X :Y ) LP, high frequency 2 LNC KSG GNN MST EXP Ground Truth 3 −11 3 −9 3 −7 3 −5 3 −3 3 −1 3 1 3 3 σ 0 4 8 12 I ( X :Y ) Spike LNC KSG GNN MST EXP Ground Truth 3 −11 3 −9 3 −7 3 −5 3 −3 3 −1 3 1 3 3 σ 0 4 8 12 I ( X :Y ) Cubic LNC KSG GNN MST EXP Ground Truth 3 −11 3 −9 3 −7 3 −5 3 −3 3 −1 3 1 3 3 σ 0 4 8 12 I ( X :Y ) VF [med] sin LNC KSG GNN MST EXP Ground Truth 3 −11 3 −9 3 −7 3 −5 3 −3 3 −1 3 1 3 3 σ 0 4 8 12 I ( X :Y ) Cos, high frequency LNC KSG GNN MST EXP Ground Truth 3 −11 3 −9 3 −7 3 −5 3 −3 3 −1 3 1 3 3 σ 0 4 8 12 I ( X :Y ) Exponential(2^x) LNC KSG GNN MST EXP Ground Truth 3 −11 3 −9 3 −7 3 −5 3 −3 3 −1 3 1 3 3 σ 0 4 8 12 I ( X :Y ) LP, high frequency LNC KSG GNN MST EXP Ground Truth 3 −11 3 −9 3 −7 3 −5 3 −3 3 −1 3 1 3 3 σ 0 4 8 12 I ( X :Y ) Sigmoid LNC KSG GNN MST EXP Ground Truth 3 −11 3 −9 3 −7 3 −5 3 −3 3 −1 3 1 3 3 σ 0 4 8 12 I ( X :Y ) Cubic, Y stretched LNC KSG GNN MST EXP Ground Truth 3 −11 3 −9 3 −7 3 −5 3 −3 3 −1 3 1 3 3 σ 0 4 8 12 I ( X :Y ) Sin, low frequency LNC KSG GNN MST EXP Ground Truth 3 −11 3 −9 3 −7 3 −5 3 −3 3 −1 3 1 3 3 σ 0 4 8 12 I ( X :Y ) Sin, high frequency LNC KSG GNN MST EXP Ground Truth 3 −11 3 −9 3 −7 3 −5 3 −3 3 −1 3 1 3 3 σ 0 4 8 12 I ( X :Y ) Parabola LNC KSG GNN MST EXP Ground Truth 3 −11 3 −9 3 −7 3 −5 3 −3 3 −1 3 1 3 3 σ 0 4 8 12 I ( X :Y ) Exponential(10^x) LNC KSG GNN MST EXP Ground Truth 3 −11 3 −9 3 −7 3 −5 3 −3 3 −1 3 1 3 3 σ 0 4 8 12 I ( X :Y ) L shaped LNC KSG GNN MST EXP Ground Truth 3 −11 3 −9 3 −7 3 −5 3 −3 3 −1 3 1 3 3 σ 0 4 8 12 I ( X :Y ) Lopsided L Shaped LNC KSG GNN MST EXP Ground Truth 3 −11 3 −9 3 −7 3 −5 3 −3 3 −1 3 1 3 3 σ 0 4 8 12 I ( X :Y ) LP, medium frequency LNC KSG GNN MST EXP Ground Truth 3 −11 3 −9 3 −7 3 −5 3 −3 3 −1 3 1 3 3 σ 0 4 8 12 I ( X :Y ) VF [med] cos LNC KSG GNN MST EXP Ground Truth 3 −11 3 −9 3 −7 3 −5 3 −3 3 −1 3 1 3 3 σ 0 4 8 12 I ( X :Y ) NFF [low] cos LNC KSG GNN MST EXP Ground Truth 3 −11 3 −9 3 −7 3 −5 3 −3 3 −1 3 1 3 3 σ 0 4 8 12 I ( X :Y ) NFF [low] sin LNC KSG GNN MST EXP Ground Truth Figure A.6.1: Mutual Information tests of LNC, KSG, GNN, MST , EXP estimators. Twenty-one functional relationships with dierent noise intensities are tested. Noise has the form U[=2;=2]where varies(as shown in X axis of the plots). For KSG, GNN and LNC estimators, nearest neighbor parameter k = 5. We are using N = 5; 000 data points for each noisy functional relationship. 116 Appendix B Appendix for Chapter 5 B.1 Detailed Algorithm for Variational Forward Feature Selection We describe the detailed algorithm for our approach. Concretely, let us suppose class label y is discrete and has L dierent valuesfy 1 ;y 2 ;:::;y L g; then we dene the distribution q(x S tjy) vector Q (k) t of size L for each sample x (k) ; y (k) at step t: Q (k) t = h b q x (k) S t jy =y 1 ;:::;b q x (k) S t jy =y L i T (B.1.1) where x (k) S t denotes the sample x (k) projects onto the x S t feature space. Also, We further denote Y of sizeL1 as the distribution vector of y as follows: Y = [b p (y =y 1 );b p (y =y 2 );:::;b p (y =y L )] T (B.1.2) Then we are able to rewrite q(x S t1) and q(x S t1jy) in terms of Q (k) t1 ;Y and substitute them into b I LB (x S t1 : y). To illustrate, at step t 1 we have, b I LB (x S t1 : y) = 1 N X x (k) ;y (k) log p x (k) S t1 jy = y (k) 1 N X k log Y T Q (k) t1 (B.1.3) 117 To select a feature i at step t, let us dene the conditional distribution vector C (k) i;t1 for each feature i = 2S t1 and each sample x (k) ; y (k) , i.e., C (k) i;t1 = h q x (k) i jx (k) S t1 ; y =y 1 ;:::;q x (k) i jx (k) S t1 ; y =y L i T (B.1.4) At step t, we use C (k) i;t1 and Q (k) t1 previously stored and get, b I LB (x S t1 [i : y) = 1 N X x (k) ;y (k) log p x (k) S t1 jy = y (k) p x (k) i jx (k) S t1 ; y = y (k) 1 N X k log Y T diag Q (k) t1 C (k) i;t1 (B.1.5) We summarize our detailed implementation in Algorithm B.1. Algorithm B.1.1 Variational Forward Feature Selection (VMI) Data: x (1) ; y (1) ; x (2) ; y (2) ;:::; x (N) ; y (N) Input: T fnumber of features to selectg Output: F fnal selected feature setg F f?g; S 0 f?g; t 1 Initialize Q (k) 0 and C (k) i;0 for any feature i; calculate Y whilejFj<T do b I LB (x S t1 [i : y) fEq. B.1.5 for each i not in Fg f t arg max i= 2S t1 b I LB (x i[S t1 : y) if b I LB (x S t1 [ft : y) b I LB (x S t1 : y) then Clear S; Set t 1 else F F[f t S t S t1 [f t Update Q (k) t and C (k) i;t t t + 1 end if end while 118 Updating Q (k) t and C (k) i;t in Algorithm 1 may vary according to dierent Q- distributions. But we can verify that under Naive Bayes Q-distribution or pair- wise Q-distribution, Q (k) t and C (k) i;t can be obtained recursively from Q (k) t1 and C (k) i;t1 by noticing that q (x i jx S t; y) =p (x i jy) for Naive Bayes Q-distribution and q (x i jx S t; y) = p (x i jx ft ;y)q(x i jx S t1; y) t1 t for pairwise Q-distribution. Let us denote N as number of samples, D as total number of features, T as number of selected features and L as number of distinct values in class variable y. The computational complexity of Algorithm 1 involves calculating the lower bound for each featurei at every step which isO(NDL); updatingC (k) i;t would cost O(NDL) for pairwise Q-distribution and O(1) for Naive Bayes Q-distribution; updating Q (k) t would cost O(NDL). We need to select T features, therefore the time complexity is O(NDT ). 1 B.2 Optimality Under Tree Graphical Models Theorem 8 (Optimal Feature Selection). If data is generated according to tree graphical models, where the class label y is the root node, denote the child nodes set in the rst layer asL 1 =fx 1 ; x 2 ;:::; x L 1 g, as shown in Fig. B.2.1. Then there must exist a step T > 0 such that the following three conditions hold by using VMI naive orVMI pairwise : Condition I: The selected feature set S T L 1 . Condition II: I LB (x S t : y) =I(x S t : y) for 1tT . Condition III: I LB (x S T : y) =I(x : y). 1 We ignore L here because the number of classes is usually much smaller. 119 Figure B.2.1: Demonstration of tree graphical model, label y is the root node. Proof. We prove this theorem by induction. For tree graphical model when select- ing the rst layer features,VMI naive andVMI pairwise are mathematically equal, therefore we only proveVMI naive case andVMI pairwise follows the same proof. 1) At step t = 1, for each feature i, we have, I LB (x i : y) = ln q (x i jy) q (x i ) p(x;y) = * ln 0 B @ p (x i jy) P y 0 p (y 0 )p (x i jy 0 ) 1 C A + p(x;y) = ln p (x i jy) p (x i ) p(x;y) =I (x i : y) (B.2.1) Thus, we are choosing a feature that has the maximum mutual information with y at the very rst step. Based on the data processing inequality, we haveI(x i : y) I(desc(x i ) : y) for any x i in layer 1 where desc(x i ) represents any descendant of x i . Thus, we always select features among the nodes of the rst layer at stept = 1 without loss of generality. If node x j that is not in the rst layer is selected at step t = 1, denote ances(x j ) as x j 's ancestor in layer 1, then I(x j : y) =I(ances(x j ) : y) which means that the information is not lost fromances(x j )! x j . In this case, 120 one can always switchances(x j ) with x j and let x j be in the rst layer, which does not con ict with the model assumption. Therefore, condition I and II are satised in step t = 1. 2) Assuming condition I and II are satised in stept, then we have the following argument in step t + 1: We discuss the candidate nodes in three classes, and argue that nodes in Remaining-Layer 1 Class are always being selected. Redundant Class For any descendantdesc(S t ) of selected feature setS t , we have, I x S t [desc(S t ) : y =I (x S t : y) =I LB (x S t : y) (B.2.2) Eq. B.2.2 comes from the fact that the desc(S t ) carries no additional information about y other than S t . The second equality is by induction. Based on Eq. 5.12 and B.2.2, we have, I LB x S t [desc(S t ) : y <I x S t [desc(S t ) : y =I (x S t : y) (B.2.3) We assume here that the LHS is strictly less than RHS in Eq. B.2.3 without loss of generality. This is because if the equality holds, we havep (x S tjy)p (desc (S t )jy) = p (x t ;desc (S t )jy) due to Theorem 7. In this case, we can always rearrangedesc(S t ) to the rst layer, which does not con ict with the model assumption. Note that by combining Eqs. B.2.2 and B.2.3, we can also get I LB x S t [desc(S t ) : y <I LB (x S t : y) (B.2.4) 121 Eq. B.2.4 means that adding a feature in Redundant Class will actually decrease the value of lower bound I LB . Remaining-Layer1 Class For any other unselected nodej of the rst layer, i.e., j2L 1 nS t , we have I (x S t : y)I (x S t [j : y) =I LB (x S t [j : y) (B.2.5) The inequality in Eq. B.2.5 is obvious which comes from the data processing inequality [Cover and Thomas, 1991]. And the equality in Eq. B.2.5 comes directly from Theorem 7. Descendants-of-Remaining-Layer1 Class For any node desc(j) that is the descendant of j where j2L 1 nS t , we have, I LB x S t [desc(j) : y I x S t [desc(j) : y I x S t [desc(j) : y I (x S t [j : y) (B.2.6) The second inequality of Ineq. B.2.6 also comes from data processing inequality. Combining Eqs. B.2.3 and B.2.5, we get, I LB x S t [desc(S t ) : y <I LB (x S t [j : y) (B.2.7) Combining Eqs. B.2.5 and B.2.6, we get, I LB x S t [desc(j) : y I LB (x S t [j : y) (B.2.8) Ineq. B.2.7 essentially tells us the forward feature selection will always choose Remaining-Layer1 Class other than Redundant Class. 122 Ineq. B.2.8 is saying we are choosing Remaining-Layer1 Class other than Descendants-of-Remaining-Layer1 Class without loss of generality (for the equality concern, we can have the same argument in step t = 1). Considering Ineqs. B.2.7 and B.2.8, in stept + 1, the algorithm chooses nodej in Remaining-Layer1 Class, i.e., j2L 1 nS t . Therefore, condition I and II hold at step t + 1. At step t + 1, if I LB (x S t [j : y) = I LB (x S t : y) for any j2L 1 nS t , that means I (x S t [j : y) =I (x S t : y). Then we have, I (x S t : y) =I (x L 1 : y) =I (x : y) (B.2.9) The rst equality in Eq. B.2.9 holds because adding any j inL 1 nS t will not increase the mutual information. The second equality is due to the data processing inequality under tree graphical model assumption. Therefore, if I LB (x S t [j : y) = I LB (x S t : y) for any j2L 1 nS t , we set T = t. Thus by combining condition II and Eq. B.2.9, we have, I LB (x S T : y) =I (x S T : y) =I (x : y) (B.2.10) Then condition III holds. B.3 Datasets and Results Table B.1 summarizes the datasets used in the experiment. Table 5 shows the complete results. 123 Table B.1: Dataset summary. N: # samples, d: # features, L: # classes. Data N d L Source Lung 73 325 20 [Ding and Peng, 2005] Colon 62 2000 2 [Ding and Peng, 2005] Leukemia 72 7070 2 [Ding and Peng, 2005] Lymphoma 96 4026 9 [Ding and Peng, 2005] Splice 3175 60 3 [Bache and Lichman, 2013] Landsat 6435 36 6 [Bache and Lichman, 2013] Waveform 5000 40 3 [Bache and Lichman, 2013] KrVsKp 3196 36 2 [Bache and Lichman, 2013] Ionosphere 351 34 2 [Bache and Lichman, 2013] Semeion 1593 256 10 [Bache and Lichman, 2013] Multifeat. 2000 649 10 [Bache and Lichman, 2013] Optdigits 3823 64 10 [Bache and Lichman, 2013] Musk2 6598 166 2 [Bache and Lichman, 2013] Spambase 4601 57 2 [Bache and Lichman, 2013] Promoter 106 57 2 [Bache and Lichman, 2013] Gisette 6000 5000 2 [Guyon and Elissee, 2003] Madelon 2000 500 2 [Guyon and Elissee, 2003] B.4 Generating Synthetic Data Here is a detailed generating process for synthetic tree graphical model data in the experiment. Draw yBernoulli(0:5) Draw x 1 Gaussian( = 1:0; = y) Draw x 2 Gaussian( = 1:0; = y=1:5) Draw x 3 Gaussian( = 1:0; = y=2:25) Draw x 4 Gaussian( = 1:0; = x 1 ) Draw x 5 Gaussian( = 1:0; = x 1 ) Draw x 6 Gaussian( = 1:0; = x 2 ) Draw x 7 Gaussian( = 1:0; = x 2 ) Draw x 8 Gaussian( = 1:0; = x 3 ) Draw x 9 Gaussian( = 1:0; = x 3 ) 124 Bibliography Jayadev Acharya, Alon Orlitsky, Ananda Theertha Suresh, and Himanshu Tyagi. The complexity of estimating r enyi entropy. In Proceedings of the twenty-sixth annual ACM-SIAM symposium on Discrete algorithms, pages 1855{1869. SIAM, 2014. Alessandro Achille and Stefano Soatto. On the emergence of invariance and dis- entangling in deep representations. arXiv preprint arXiv:1706.01350, 2017. Alessandro Achille and Stefano Soatto. Information dropout: Learning optimal representations through noisy computation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018. Alan Agresti. A survey of exact inference for contingency tables. Statistical science, pages 131{153, 1992. Alexander A Alemi, Ian Fischer, Joshua V Dillon, and Kevin Murphy. Deep variational information bottleneck. International Conference on Learning Rep- resentations, 2017. Andr as Antos and Ioannis Kontoyiannis. Convergence properties of functional estimates for discrete distributions. Random Structures & Algorithms, 19(3-4): 163{193, 2001. Kevin Bache and Moshe Lichman. Uci machine learning repository, 2013. Kiran S Balagani and Vir V Phoha. On the feature selection criterion based on an approximation of multidimensional mutual information. IEEE Transactions on Pattern Analysis & Machine Intelligence, (7):1342{1343, 2010. David Barber and Felix Agakov. The im algorithm: a variational approach to information maximization. In Proceedings of the 16th International Conference on Neural Information Processing Systems, pages 201{208. MIT Press, 2003. 125 David Barber and Felix Agakov. The im algorithm: a variational approach to information maximization. In Advances in Neural Information Processing Sys- tems 16: Proceedings of the 2003 Conference, volume 16, page 201. MIT Press, 2004. Horace Barlow. Unsupervised learning. Neural computation, 1(3):295{311, 1989. Georgij P Basharin. On a statistical estimate for the entropy of a sequence of independent random variables. Theory of Probability & Its Applications, 4(3): 333{336, 1959. Roberto Battiti. Using mutual information for selecting features in supervised neural net learning. Neural Networks, IEEE Transactions on, 5(4):537{550, 1994. Tugkan Batu, Lance Fortnow, Ronitt Rubinfeld, Warren D Smith, and Patrick White. Testing that distributions are close. In Foundations of Computer Science, 2000. Proceedings. 41st Annual Symposium on, pages 259{269. IEEE, 2000. Anthony J. Bell. The co-information lattice. In Proceedings of the 4th international symposium on Independent Component Analysis and Blind Source Separation, pages 921{926, 2003. Emmanuel Bengio, Valentin Thomas, Joelle Pineau, Doina Precup, and Yoshua Bengio. Independently controllable features. arXiv preprint arXiv:1703.07718, 2017. Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798{1828, 2013. Adriel Boals and Kitty Klein. Word use in emotional narratives about failed romantic relationships and subsequent mental health. Journal of Language and Social Psychology, 24(3):252{268, 2005. doi: 10.1177/0261927X05278386. URL http://jls.sagepub.com/content/24/3/252.abstract. J. Kathryn Bock. Syntactic persistence in language production. Cognitive Psy- chology, 18:355{387, 1986. Susan E. Brennan. Lexical entrainment in spontaneous dialog. In International Symposium on Spoken Dialog, pages 41{44, 1996. URL citeseer.ist.psu. edu/brennan96lexical.html. Susan E Brennan and Herbert H Clark. Conceptual pacts and lexical choice in conversation. Journal of Experimental Psychology: Learning, Memory, and Cog- nition, 22(6):1482, 1996. 126 Gavin Brown, Adam Pocock, Ming-Jie Zhao, and Mikel Luj an. Conditional like- lihood maximisation: a unifying framework for information theoretic feature selection. The Journal of Machine Learning Research, 13(1):27{66, 2012. R. Sherlock Campbell and James W. Pennebaker. The secret life of pronouns. Psychological Science, 14(1):60{65, 2003. Cl ement L Canonne. A survey on distribution testing: Your data is big. but is it blue? In Electronic Colloquium on Computational Complexity (ECCC), volume 22, pages 1{1, 2015. Clement L Canonne, Ilias Diakonikolas, Daniel M Kane, and Alistair Stewart. Testing bayesian networks. In Conference on Learning Theory, pages 370{448, 2017. Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximiz- ing generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2172{2180, 2016. Hongrong Cheng, Zhiguang Qin, Chaosheng Feng, Yong Wang, and Fagen Li. Conditional mutual information-based feature selection analyzing for synergy and redundancy. ETRI Journal, 33(2):210{218, 2011. William G Cochran. Some methods for strengthening the common 2 tests. Biometrics, 10(4):417{451, 1954. Pierre Comon. Independent component analysis, a new concept? Signal processing, 36(3):287{314, 1994. Thomas M. Cover and Joy A. Thomas. Elements of information theory. Wiley- Interscience, New York, NY, USA, 1991. ISBN 0-471-06259-6. Cristian Danescu-Niculescu-Mizil and Lillian Lee. Chameleons in imagined con- versations: A new approach to understanding coordination of linguistic style in dialogs. In Proceedings of the 2Nd Workshop on Cognitive Modeling and Computational Linguistics, CMCL '11, pages 76{87, Stroudsburg, PA, USA, 2011. Association for Computational Linguistics. ISBN 978-1-932432-95-4. URL http://dl.acm.org/citation.cfm?id=2021096.2021105. Cristian Danescu-Niculescu-Mizil, Michael Gamon, and Susan Dumais. Mark my words!: linguistic style accommodation in social media. In Proceedings of the 20th international conference on World wide web, WWW '11, pages 745{754, 2011. ISBN 978-1-4503-0632-4. doi: 10.1145/1963405.1963509. URL http: //doi.acm.org/10.1145/1963405.1963509. 127 Cristian Danescu-Niculescu-Mizil, Lillian Lee, Bo Pang, and Jon Kleinberg. Echoes of power: language eects and power dierences in social interaction. In Pro- ceedings of the 21th international conference on World wide web, WWW '12, pages 699{708, 2012. ISBN 978-1-4503-1229-5. Abhimanyu Das and David Kempe. Submodular meets spectral: Greedy algo- rithms for subset selection, sparse approximation and dictionary selection. In Proceedings of the 28th International Conference on Machine Learning (ICML- 11), pages 1057{1064, 2011. Manoranjan Dash and Huan Liu. Feature selection for classication. Intelligent data analysis, 1(3):131{156, 1997. Simon DeDeo, Robert X. D. Hawkins, Sara Klingenstein, and Tim Hitchcock. Bootstrap methods for the empirical study of decision-making and information ows in social systems. Entropy, 15(6):2246{2276, 2013. ISSN 1099-4300. doi: 10.3390/e15062246. Ilias Diakonikolas and Daniel M Kane. A new approach for testing properties of discrete distributions. In Foundations of Computer Science (FOCS), 2016 IEEE 57th Annual Symposium on, pages 685{694. IEEE, 2016. Nat Dilokthanakul, Pedro AM Mediano, Marta Garnelo, Matthew CH Lee, Hugh Salimbeni, Kai Arulkumaran, and Murray Shanahan. Deep unsuper- vised clustering with gaussian mixture variational autoencoders. arXiv preprint arXiv:1611.02648, 2016. Chris Ding and Hanchuan Peng. Minimum redundancy feature selection from microarray gene expression data. Journal of bioinformatics and computational biology, 3(02):185{205, 2005. Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Non-linear independent components estimation. arXiv preprint arXiv:1410.8516, 2014. Roland L'vovich Dobrushin. A general formulation of the fundamental theorem of shannon in the theory of information. Uspekhi Matematicheskikh Nauk, 14(6): 3{104, 1959. Ronald Aylmer Fisher. The distribution of the partial correlation coecient. Metron, 3:329{332, 1924. Fran cois Fleuret. Fast binary feature selection with conditional mutual informa- tion. The Journal of Machine Learning Research, 5:1531{1555, 2004. Shuyang Gao, Greg Ver Steeg, and Aram Galstyan. Ecient estimation of mutual information for strongly dependent variables. In AISTATS'15, 2015. 128 H. Giles, N. Coupland, and J. Coupland. Accommodation theory: Communication, context, and consequence, pages 1{68. Cambridge University Press, Cambridge, 1991. Oded Goldreich. Introduction to property testing. Cambridge University Press, 2017. Amy L. Gonzales, Jerey T. Hancock, and James W. Pennebaker. Language style matching as a predictor of social dynamics in small groups. Communication Research, 37(1):3{19, 2010. doi: 10.1177/0093650209351468. URL http:// crx.sagepub.com/content/37/1/3.abstract. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672{2680, 2014. Malka Gorne, Ruth Heller, and Yair Heller. Comment on detecting novel associations in large data sets. (available at http://emotion. technion. ac. il/ gornm/les/science6. pdf on 11 Nov. 2012). Ishaan Gulrajani, Kundan Kumar, Faruk Ahmed, Adrien Ali Taiga, Francesco Visin, David Vazquez, and Aaron Courville. Pixelvae: A latent variable model for natural images. International Conference on Learning Representations, 2017. Isabelle Guyon and Andr e Elissee. An introduction to variable and feature selec- tion. The Journal of Machine Learning Research, 3:1157{1182, 2003. Yanjun Han, Jiantao Jiao, and Tsachy Weissman. Adaptive estimation of shannon entropy. In Information Theory (ISIT), 2015 IEEE International Symposium on, pages 1372{1376. IEEE, 2015a. Yanjun Han, Jiantao Jiao, and Tsachy Weissman. Does dirichlet prior smoothing solve the shannon entropy estimation problem? In Information Theory (ISIT), 2015 IEEE International Symposium on, pages 1367{1371. IEEE, 2015b. Moritz Hardt, Eric Price, Nati Srebro, et al. Equality of opportunity in supervised learning. In Advances in neural information processing systems, pages 3315{ 3323, 2016. Ruth Heller, Yair Heller, and Malka Gorne. A consistent multivariate test of association based on ranks of distances. Biometrika, 100(2):503{510, 2013. Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations, 2017. 129 NL Hjort and MC Jones. Locally parametric nonparametric density estimation. The Annals of Statistics, pages 1619{1647, 1996. Molly E Ireland, Richard B Slatcher, Paul W Eastwick, Lauren E Scissors, Eli J Finkel, and James W Pennebaker. Language style matching predicts relationship initiation and stability. Psychological Science, 22(1):39{44, 2011. Jiantao Jiao, Kartik Venkat, Yanjun Han, and Tsachy Weissman. Minimax estima- tion of functionals of discrete distributions. IEEE Transactions on Information Theory, 61(5):2835{2885, 2015. Shiraj Khan, Sharba Bandyopadhyay, Auroop R. Ganguly, Sunil Saigal, David J. Erickson, Vladimir Protopopescu, and George Ostrouchov. Relative perfor- mance of mutual information estimation methods for quantifying the depen- dence among short and noisy data. Phys. Rev. E, 76:026209, Aug 2007. doi: 10.1103/PhysRevE.76.026209. URL http://link.aps.org/doi/10.1103/ PhysRevE.76.026209. Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013. J. Kinney and G. Atwal. Equitability, mutual information, and the maximal infor- mation coecient. Proceedings of the National Academy of Sciences, 111(9): 3354{3359, 2014. Ron Kohavi and George H John. Wrappers for feature subset selection. Articial intelligence, 97(1):273{324, 1997. Alexander Kraskov, Harald St ogbauer, and Peter Grassberger. Estimating mutual information. Physical Review E, 69(6):066138, 2004. David D Lewis. Feature selection and feature extraction for text categorization. In Proceedings of the workshop on Speech and Natural Language, pages 212{217. Association for Computational Linguistics, 1992. Elia Liiti ainen, Amaury Lendasse, and Francesco Corona. A boundary corrected expansion of the moments of nearest neighbor distributions. Random Struct. Algorithms, 37(2):223{247, September 2010. ISSN 1042-9832. doi: 10.1002/rsa. v37:2. URL http://dx.doi.org/10.1002/rsa.v37:2. Dahua Lin and Xiaoou Tang. Conditional infomax learning: an integrated frame- work for feature extraction and fusion. In Computer Vision{ECCV 2006, pages 68{82. Springer, 2006. Ralph Linsker. A local learning rule that enables information maximization for arbitrary input distributions. Neural Computation, 9(8):1661{1665, 1997. 130 Huan Liu and Hiroshi Motoda. Feature selection for knowledge discovery and data mining, volume 454. Springer Science & Business Media, 2012. Nathan Mantel and William Haenszel. Statistical aspects of the analysis of data from retrospective studies of disease. Journal of the national cancer institute, 22(4):719{748, 1959. W. McGill. Multivariate information transmission. Information Theory, Trans- actions of the IRE Professional Group on, 4(4):93{111, 1954. ISSN 2168-2690. doi: 10.1109/TIT.1954.1057469. Shakir Mohamed and Danilo Jimenez Rezende. Variational information maximi- sation for intrinsically motivated reinforcement learning. In Advances in Neural Information Processing Systems, pages 2116{2124, 2015. K.R. Moon and A.O. Hero. Ensemble estimation of multivariate f-divergence. In Information Theory (ISIT), 2014 IEEE International Symposium on, pages 356{360, June 2014. doi: 10.1109/ISIT.2014.6874854. Andreas M uller, Sebastian Nowozin, and Christoph Lampert. Information the- oretic clustering using minimum spanning trees. Pattern Recognition, pages 205{215, 2012. Richard E Neapolitan et al. Learning bayesian networks, volume 38. Pearson Prentice Hall Upper Saddle River, NJ, 2004. Xuan Vinh Nguyen, Jerey Chan, Simone Romano, and James Bailey. Eective global approaches for mutual information based feature selection. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 512{521. ACM, 2014. Kate G. Niederhoer and James W. Pennebaker. Linguistic style matching in social interaction. Journal of Language and Social Psychology, 21(4):337{360, 2002. doi: 10.1177/026192702237953. F. Nielsen and R. Nock. Entropies and cross-entropies of exponential families. In 17th IEEE International Conference on Image Processing (ICIP), pages 3621{ 3624. IEEE, 2010. D avid P al, Barnab as P oczos, and Csaba Szepesv ari. Estimation of r enyi entropy and mutual information based on generalized nearest-neighbor graphs. In Advances in Neural Information Processing Systems 23, pages 1849{1857. Cur- ran Associates, Inc., 2010. Liam Paninski. Estimation of entropy and mutual information. Neural computa- tion, 15(6):1191{1253, 2003. 131 Liam Paninski. Estimating entropy on m bins given fewer than m samples. IEEE Transactions on Information Theory, 50(9):2200{2203, 2004. Judea Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, New York, NY, USA, 2000. ISBN 0-521-77362-8. Hanchuan Peng, Fuhui Long, and Chris Ding. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 27(8):1226{ 1238, 2005. James W. Pennebaker and Martha E. Francis. Linguistic inquiry and word count: A computerized text analysis program, 2007. URL http://www.liwc.net/. James W. Pennebaker and Laura A. King. Linguistic styles: Language use as an individual dierence. Journal of Personality and Social Psychology, 77(6): 1296{1312, 1999. ISSN 0022-3514. doi: 10.1037/0022-3514.77.6.1296. URL http://dx.doi.org/10.1037/0022-3514.77.6.1296. Fernando P erez-Cruz. Estimation of information theoretic measures for continuous random variables. In Proceedings of NIPS-08, pages 1257{1264, 2008. Fortunato Pesarin and Luigi Salmaso. Permutation Tests for Complex Data: The- ory, Applications and Software. John Wiley & Sons, 2010. Martin J Pickering and Victor S Ferreira. Structural priming: a critical review. Psychological bulletin, 134(3):427, 2008. David N Reshef, Yakir A Reshef, Hilary K Finucane, Sharon R Grossman, Gilean McVean, Peter J Turnbaugh, Eric S Lander, Michael Mitzenmacher, and Par- dis C Sabeti. Detecting novel associations in large data sets. science, 334(6062): 1518{1524, 2011. Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic back- propagation and approximate inference in deep generative models. In Interna- tional Conference on Machine Learning, pages 1278{1286, 2014. Irene Rodriguez-Lujan, Ramon Huerta, Charles Elkan, and Carlos Santa Cruz. Quadratic programming feature selection. The Journal of Machine Learning Research, 11:1491{1516, 2010. Sam T Roweis and Lawrence K Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500):2323{2326, 2000. 132 Stephanie Rude, Eva-Maria Gortner, and James Pennebaker. Language use of depressed and depression-vulnerable college students. Cognition and Emotion, 18(8):1121+, 2004. ISSN 0269-9931. doi: 10.1080/02699930441000030. URL http://dx.doi.org/10.1080/02699930441000030. Michael A Saxe, Yamini Bansal, Joel Dapello, Madhu Advani, Artemy Kolchinsky, Brendan T Daniel, and David D Cox. On the information bottleneck theory of deep learning. International Conference on Learning Representations, 2018. J urgen Schmidhuber. Learning factorial codes by predictability minimization. Neu- ral Computation, 4(6):863{879, 1992. C.E. Shannon. A mathematical theory of communication. The Bell System Tech- nical Journal, 27:379423, 1948. Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810, 2017. Noah Simon and Robert Tibshirani. Comment on" detecting novel associa- tions in large data sets" by reshef et al, science dec 16, 2011. arXiv preprint arXiv:1401.7645, 2014. E. H. Simpson. The interpretation of interaction in contingency tables. Journal of the Royal Statistical Society, Series B (Statistical Methodology), 13(2):238{241, 1951. ISSN 00359246. URL http://www.jstor.org/stable/2984065. Harshinder Singh, Neeraj Misra, Vladimir Hnizdo, Adam Fedorowicz, and Eugene Demchuk. Nearest neighbor estimates of entropy. American Journal of Math- ematical and Management Sciences, 23(3-4):301{321, 2003. doi: 10.1080/ 01966324.2003.10737616. URL http://dx.doi.org/10.1080/01966324.2003. 10737616. Shashank Singh and Barnabas Poczos. Generalized exponential concentra- tion inequality for renyi divergence estimation. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages 333{ 341, 2014. URL http://machinelearning.wustl.edu/mlpapers/papers/ icml2014c1_singh14. C. Spearman. The proof and measurement of association between two things. The American Journal of Psychology, 15(1):pp. 72{101, 1904. ISSN 00029556. URL http://www.jstor.org/stable/1412159. K. Sricharan, R. Raich, and A.O. Hero. Estimation of nonlinear functionals of densities with condence. Information Theory, IEEE Transactions on, 58(7): 4135{4159, July 2012. ISSN 0018-9448. doi: 10.1109/TIT.2012.2195549. 133 K. Sricharan, D. Wei, and A.O. Hero. Ensemble estimators for multivariate entropy estimation. Information Theory, IEEE Transactions on, 59(7):4374{4388, July 2013. ISSN 0018-9448. doi: 10.1109/TIT.2013.2251456. Shannon Wiltsey Stirman and James W. Pennebaker. Word use in the poetry of suicidal and nonsuicidal poets. Psychosomatic Medicine, 63(4):517{522, 2001. URL http://www.psychosomaticmedicine.org/content/63/4/517. abstract. M Studen y and J Vejnarova. The multiinformation function as a tool for measuring stochastic dependence. In Learning in graphical models, pages 261{297. Springer, 1998. Milan Studen y and Jirina Vejnarov a. The multiinformation function as a tool for measuring stochastic dependence. In Learning in graphical models, pages 261{297. Springer, 1998. Taiji Suzuki, Masashi Sugiyama, Jun Sese, and Takafumi Kanamori. Approx- imating mutual information by maximum likelihood density ratio estimation. In Yvan Saeys, Huan Liu, Iaki Inza, Louis Wehenkel, and Yves Van de Peer, editors, FSDM, volume 4 of JMLR Proceedings, pages 5{20. JMLR.org, 2008. Zolt an Szab o. Information theoretical estimators toolbox. Journal of Machine Learning Research, 15:283{287, 2014. (https://bitbucket.org/szzoli/ite/). G abor J Sz ekely, Maria L Rizzo, et al. Brownian distance covariance. The annals of applied statistics, 3(4):1236{1265, 2009. Paul J. Taylor and Sally Thomas. Linguistic style matching and negotiation out- come. Negotiation and Con ict Management Research, 1(3):263{281, 2008. ISSN 1750-4716. doi: 10.1111/j.1750-4716.2008.00016.x. URL http://dx.doi.org/ 10.1111/j.1750-4716.2008.00016.x. Valentin Thomas, Jules Pondard, Emmanuel Bengio, Marc Sarfati, Philippe Beau- doin, Marie-Jean Meurs, Joelle Pineau, Doina Precup, and Yoshua Bengio. Inde- pendently controllable features. arXiv preprint arXiv:1708.01289, 2017. Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottle- neck method. arXiv preprint physics/0004057, 2000. Ioannis Tsamardinos, Laura E Brown, and Constantin F Aliferis. The max-min hill-climbing bayesian network structure learning algorithm. Machine learning, 65(1):31{78, 2006. 134 Gregory Valiant and Paul Valiant. A clt and tight lower bounds for estimating entropy. In Electronic Colloquium on Computational Complexity (ECCC), vol- ume 17, page 9, 2010. Gregory Valiant and Paul Valiant. Estimating the unseen: an n/log (n)-sample estimator for entropy and support size, shown optimal via new clts. In Proceed- ings of the forty-third annual ACM symposium on Theory of computing, pages 685{694. ACM, 2011. Aaron Van Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. In International Conference on Machine Learning, pages 1747{ 1756, 2016. Greg Ver Steeg. Unsupervised learning via total correlation explanation. IJCAI, 2017. Greg Ver Steeg and Aram Galstyan. Statistical tests for contagion in observational social network studies. In AISTATS'13, 2013a. Greg Ver Steeg and Aram Galstyan. Minimal assumption tests for contagion in observational social network studies. In Workshop on Information in Net- works(WIN), 2013b. Greg Ver Steeg and Aram Galstyan. Discovering structure in high-dimensional data through correlation explanation. In Advances in Neural Information Processing Systems, pages 577{585, 2014. Greg Ver Steeg and Aram Galstyan. Maximally informative hierarchical represen- tations of high-dimensional data. In Articial Intelligence and Statistics, pages 1004{1012, 2015. Greg Ver Steeg and Aram Galstyan. Low complexity gaussian latent factor models and a blessing of dimensionality. arXiv preprint arXiv:1706.03353, 2017. Nguyen Xuan Vinh, Shuo Zhou, Jerey Chan, and James Bailey. Can high-order dependencies improve mutual information based feature selection? Pattern Recognition, 2015. Ulrike Von Luxburg and Morteza Alamgir. Density estimation from unweighted k-nearest neighbor graphs: a roadmap. In Advances in Neural Information Processing Systems, 2013. Janett Walters-Williams and Yan Li. Estimation of mutual information: A sur- vey. In Rough Sets and Knowledge Technology, volume 5589 of Lecture Notes in Computer Science, pages 389{396. Springer Berlin Heidelberg, 2009. ISBN 135 978-3-642-02961-5. doi: 10.1007/978-3-642-02962-2 49. URL http://dx.doi. org/10.1007/978-3-642-02962-2_49. Q. Wang, S.R. Kulkarni, and S. Verd u. Divergence estimation for multidimensional densities via k-nearest-neighbor distances. IEEE Trans. Inf. Theor., 55:2392{ 2405, May 2009a. ISSN 0018-9448. doi: 10.1109/TIT.2009.2016060. URL http: //dl.acm.org/citation.cfm?id=1669487.1669521. Qing Wang, Sanjeev R. Kulkarni, and Sergio Verd u. Universal estimation of infor- mation measures for analog sources. Found. Trends Commun. Inf. Theory, 5 (3):265{353, March 2009b. ISSN 1567-2190. doi: 10.1561/0100000021. URL http://dx.doi.org/10.1561/0100000021. Satosi Watanabe. Information theoretical analysis of multivariate correlation. IBM Journal of research and development, 4(1):66{82, 1960a. Satosi Watanabe. Information theoretical analysis of multivariate correlation. IBM Journal of research and development, 4(1):66{82, 1960b. Philip Wolfe. Convergence conditions for ascent methods. SIAM review, 11(2): 226{235, 1969. Yihong Wu and Pengkun Yang. Optimal entropy estimation on large alphabets via best polynomial approximation. In Information Theory (ISIT), 2015 IEEE International Symposium on, pages 824{828. IEEE, 2015. Yihong Wu and Pengkun Yang. Minimax rates of entropy estimation on large alphabets via best polynomial approximation. IEEE Transactions on Informa- tion Theory, 62(6):3702{3720, 2016. Aaron D Wyner. A denition of conditional mutual information for arbitrary ensembles. Information and Control, 38(1):51{59, 1978. Howard Hua Yang and John E Moody. Data visualization and feature selection: New algorithms for nongaussian data. In NIPS, volume 99, pages 687{693. Citeseer, 1999. Joseph E Yukich and Joseph Yukich. Probability theory of classical Euclidean optimization problems. Springer Berlin, 1998. Zhenyue Zhang and Hongyuan Zha. Principal manifolds and nonlinear dimen- sion reduction via local tangent space alignment. SIAM Journal of Scientic Computing, 26:313{338, 2002. Shengjia Zhao, Jiaming Song, and Stefano Ermon. Infovae: Information maximiz- ing variational autoencoders. arXiv preprint arXiv:1706.02262, 2017. 136 Yingbo Zhou, Utkarsh Porwal, Ce Zhang, Hung Q Ngo, Long Nguyen, Christopher R e, and Venu Govindaraju. Parallel feature selection inspired by group testing. In Advances in Neural Information Processing Systems, pages 3554{3562, 2014. 137
Abstract (if available)
Abstract
Mutual information (MI) has been successfully applied to a wide variety of domains due to its remarkable property to measure dependencies between random variables. Despite its popularity and wide spread usage, a common unavoidable problem of mutual information is its estimation. In this thesis, we demonstrate that a popular class of nonparametric MI estimators based on k-nearest-neighbor graphs requires number of samples that scales exponentially with the true MI. Consequently, accurate estimation of MI between strongly dependent variables is possible only for prohibitively large sample size. This important yet overlooked shortcoming of the existing estimators is due to their implicit reliance on local uniformity of the underlying joint distribution. As a result, my thesis proposes two new estimation strategies to address this issue. The new estimators are robust to local non-uniformity, works well with limited data, and is able to capture relationship strengths over many orders of magnitude than the existing k-nearest-neighbor methods. ❧ Modern data mining and machine learning presents us with problems which may contain thousands of variables and we need to identify only the most promising strong relationships. Therefore, caution must be taken when applying mutual information to such real-world scenarios. By taking these concerns into account, my thesis then demonstrates the practical applicability of mutual information on several tasks. In the first task, my thesis suggests an information-theoretic approach for measuring stylistic coordination in dialogues. The proposed measure has a simple predictive interpretation and can account for various confounding factors through proper conditioning. My thesis proposes an MI-based shuffling test to distinguish correlations in length due to contextual factors (topic of conversation, user verbosity, etc.) and turn-by-turn coordination. We also suggest a test to identify whether stylistic coordination persists even after accounting for length coordination and contextual factors. In the second task, my thesis focuses on feature selection, which is one of the fundamental problems in machine learning. An extensive body of work on information-theoretic feature selection exists which is based on maximizing mutual information between subsets of features and class labels. However, practical methods are forced to rely on approximations due to the difficulty of estimating mutual information. My thesis demonstrates that approximations made by existing methods are based on unrealistic assumptions. We formulate a more flexible and general class of assumptions based on variational distributions and use them to tractably generate lower bounds for mutual information. These bounds define a novel information-theoretic framework for feature selection, which is proved to be optimal under tree graphical models with proper choice of variational distributions. In another machine learning task, my thesis addresses the representation deep learning problem. Advances in unsupervised learning enable reconstruction and generation of samples from complex distributions, but this success is marred by the inscrutability of the representations learned. My thesis proposes an information-theoretic approach to characterizing disentanglement and dependence in representation learning using multivariate mutual information, also called total correlation. The principle of total Cor-relation Ex-planation (CorEx) has motivated successful unsupervised learning applications across a variety of domains, but under some restrictive assumptions. Here we relax those restrictions by introducing a flexible variational lower bound to CorEx. Surprisingly, we find that this lower bound is equivalent to the one in variational autoencoders (VAE) under certain conditions. This information-theoretic view of VAE deepens our understanding of hierarchical VAE and motivates a new algorithm, AnchorVAE, that makes latent codes more interpretable through information maximization and enables generation of richer and more realistic samples.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Information geometry of annealing paths for inference and estimation
PDF
Hashcode representations of natural language for relation extraction
PDF
Robust causal inference with machine learning on observational data
PDF
Representation problems in brain imaging
PDF
On information captured by neural networks: connections with memorization and generalization
PDF
Learning to diagnose from electronic health records data
PDF
Imposing classical symmetries on quantum operators with applications to optimization
PDF
Probabilistic framework for mining knowledge from georeferenced social annotation
PDF
Fast and label-efficient graph representation learning
PDF
Labeling cost reduction techniques for deep learning: methodologies and applications
PDF
Empirical study of informational regularizations in learning useful and interpretable representations
PDF
Physics-based data-driven inference
PDF
Magnetic induction-based wireless body area network and its application toward human motion tracking
PDF
Event-centric reasoning with neuro-symbolic networks and knowledge incorporation
PDF
Modeling and predicting with spatial‐temporal social networks
PDF
Designing data-effective machine learning pipeline in application to physics and material science
PDF
Machine learning techniques for outdoor and indoor layout estimation
PDF
Predicting and planning against real-world adversaries: an end-to-end pipeline to combat illegal wildlife poachers on a global scale
PDF
Dynamic graph analytics for cyber systems security applications
PDF
Responsible artificial intelligence for a complex world
Asset Metadata
Creator
Gao, Shuyang
(author)
Core Title
Mutual information estimation and its applications to machine learning
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
08/14/2018
Defense Date
05/07/2018
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
correlation analysis,feature selection,linguistic style coordination,mutual information,OAI-PMH Harvest,representation learning
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Galstyan, Aram (
committee chair
), Diakonikolas, Ilias (
committee member
), Ghanem, Roger Georges (
committee member
), Nakano, Aiichiro (
committee member
), Ver Steeg, Greg (
committee member
)
Creator Email
gaos@usc.edu,sgao@isi.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-68626
Unique identifier
UC11672141
Identifier
etd-GaoShuyang-6747.pdf (filename),usctheses-c89-68626 (legacy record id)
Legacy Identifier
etd-GaoShuyang-6747.pdf
Dmrecord
68626
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Gao, Shuyang
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
correlation analysis
feature selection
linguistic style coordination
mutual information
representation learning