Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Learning fair models with biased heterogeneous data
(USC Thesis Other)
Learning fair models with biased heterogeneous data
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Learning Fair Models with Biased Heterogeneous Data by Yuzi He A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (PHYSICS) May 2022 Copyright 2022 Yuzi He This dissertation is dedicated to the workers of the world. ii Acknowledgments First I wish to express my deepest gratitude to my thesis advisor Prof. Kristina Lerman, Ph.D. Prof. Lerman is extremely passionate about research and she taught me countless lessons on how to conduct high-quality original scientic research works. Prof. Lerman is also supportive to her students and very patient during the necessary instructions throughout the program. I have to say I enjoyed a lot working with Prof. Lerman. I also want to thank all faculty members of the department of physics and astronomy, especially Prof. Stephan Haas and Prof. Krzysztof Pilch. As the chair and academic advisor of the program respectively, Prof. Haas and Prof. Pilch helped me a lot in the logistic issues of the program, and without their supports, this dissertation will be faced with a lot more diculties. I want to thank my collaborators and colleagues, including Dr. Keith Burghardt, Ashwin Rao, Siyi Guo, Dr. Nazanin Alipourfard, Zihao Jiang, Julie Jiang, Nathan Bartley, Dr. Nazgol Tavabi, Negar Mokhberian, etc. for valuable ideas and suggestions on my research work. Also, I would like to thank my classmates for encouraging and assisting in the courses and research. I also want to thank my family and friends for their emotional supports. Love from families is the ultimate source of energy and friends can help when you are away from home and being faced with COVID! Last but not least is that the research projects in this dissertation are supported by the Defense Advanced Research Projects Agency (DARPA) (under contracts W911NF-17- C-0094, W911NF-18-C-0011, and HR00111990114) and by the Air Force Oce of Scientic Research (AFOSR) (under contract FA9550-20-1-0224). The research works are also based iii upon work supported in part by the Oce of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via 201717071900005. iv Table of Contents Dedication ii Acknowledgments iii List of Tables viii List of Figures x Abstract xiv Chapter 1: Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Statement of Research Questions . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Challenges and Main Contributions . . . . . . . . . . . . . . . . . . . . . . . 6 Chapter 2: Fair Representations via Linear Orthogonalization 9 2.1 Chapter Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3.1 Fair Interpretable Representations . . . . . . . . . . . . . . . . . . . . 13 2.3.2 Fair Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3.3 Measuring Fairness . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3.3.1 Fairness of Outcomes . . . . . . . . . . . . . . . . . . . . . . 17 2.3.3.2 Fairness of Representations . . . . . . . . . . . . . . . . . . 18 2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.4.1 Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.4.2 Real-World Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.4.3 Comparison Against State-of-the-Art . . . . . . . . . . . . . . . . . . 22 2.4.3.1 Fairness of Representations . . . . . . . . . . . . . . . . . . 23 2.4.3.2 Balance Versus Calibration . . . . . . . . . . . . . . . . . . 24 2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.6 Supplementary Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.6.1 Additional Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.6.2 Proof for the Invariant of Parameters in Linear Regression Using De- biased Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 v Chapter 3: Fair Predictions and Invariant Representations with Kernels 30 3.1 Chapter Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.3.1 Independence Condition using Kernel Functions . . . . . . . . . . . . 33 3.3.2 Relationship with Hilbert Schmidt Independence Criterion . . . . . . 35 3.3.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.3.3.1 Fair Prediction (Supervised Learning) . . . . . . . . . . . . 36 3.3.3.2 Invariant Representation and Style Transformation (Unsu- pervised Learning) . . . . . . . . . . . . . . . . . . . . . . . 36 3.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.4.1 Fair Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.4.2 Invariant Representation and Style Transformation . . . . . . . . . . 41 3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.6 Supplementary Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Chapter 4: Change Detection via Confusion 47 4.1 Chapter Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.3.1 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.3.2 Confusion-based training . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.3.3 Inside the black box: modeling accuracy . . . . . . . . . . . . . . . . 53 4.3.4 Quantifying error and complexity . . . . . . . . . . . . . . . . . . . . 58 4.3.5 State-of-the-art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.4.1 Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.4.1.1 Synthetic \Chessboard" Data . . . . . . . . . . . . . . . . . 59 4.4.1.2 Synthetic Images . . . . . . . . . . . . . . . . . . . . . . . . 60 4.4.2 Identifying Change in Real-world Data . . . . . . . . . . . . . . . . . 63 4.4.2.1 COVID-19 Air Quality . . . . . . . . . . . . . . . . . . . . . 63 4.4.2.2 Performance on Khan Academy . . . . . . . . . . . . . . . . 64 4.4.2.3 Student Test Scores . . . . . . . . . . . . . . . . . . . . . . 65 4.4.2.4 College GPA . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.6 Supplementary Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.6.1 Detail of confusion based training . . . . . . . . . . . . . . . . . . . . 70 Chapter 5: Identifying Multiple Changes: Application to Social Media 71 5.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.3 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 vi Chapter 6: Heterogeneous Eects of Software Patches in a Multiplayer Online Battle Arena Game 76 6.1 Chapter Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 6.2 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 6.3 Background and Game Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 6.3.1 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 6.3.2 Game Patches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 6.4 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 6.4.1 Heterogeneous Treatment Eects . . . . . . . . . . . . . . . . . . . . 83 6.4.2 Causal Trees for HTEs . . . . . . . . . . . . . . . . . . . . . . . . . . 84 6.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 6.5.1 Eect on Team Performance . . . . . . . . . . . . . . . . . . . . . . . 85 6.5.1.1 Champion win rate . . . . . . . . . . . . . . . . . . . . . . . 86 6.5.1.2 Heterogeneous eect on win rate . . . . . . . . . . . . . . . 87 6.5.2 Individual Player Performance . . . . . . . . . . . . . . . . . . . . . . 90 6.5.2.1 Average eect of patches . . . . . . . . . . . . . . . . . . . . 92 6.5.2.2 Heterogeneous eect of patches . . . . . . . . . . . . . . . . 92 6.5.2.3 Eect of player features . . . . . . . . . . . . . . . . . . . . 94 6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 Chapter 7: Counterfactual Learning for the Fair Allocation of Treatments 98 7.1 Chapter Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 7.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 7.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 7.3.1 Heterogeneous Treatment Eect . . . . . . . . . . . . . . . . . . . . 103 7.3.2 Inequalities in Treatment . . . . . . . . . . . . . . . . . . . . . . . . . 104 7.3.2.1 Measuring Inequality of Treatment Opportunity . . . . . . . 105 7.3.2.2 Measuring Inequality of Treatment Outcomes . . . . . . . . 107 7.3.3 Learning Optimal Interventions using Causal Tree . . . . . . . . . . . 108 7.3.4 Learning Optimal Interventions For Arbitrary Causal Estimation Meth- ods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 7.3.5 Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . . 111 7.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 7.4.1 Synthetic data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 7.4.2 EdGap Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 7.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 Chapter 8: Conclusion and Ongoing Work 118 References 121 vii List of Tables 2.1 Accuracy of predicted outcomes (Acc Y ) and protected features (Acc P) for the German and Adult datasets. The proposed fair methods (bottom four rows) use = 0:0. Higher Acc Y indicates better predictions while Acc P closer to the majority class baseline indicates fairer predictions. Results marked * were reported by [65]. Best performance is shown in bold. . . . . . 26 3.1 Comparison of our method with previous works on German and Adult dataset. Majority class ratio is used as baseline. Adv. Loss stands for adversarial loss, namely the accuracy of predictingz fromu. Pred Acc. stands for the accuracy of predicting y using x via representation u. Higher Pred Acc. is better and desirable Adv. Loss should be close to the majority class ratio. . . . . . . . . 40 3.2 Summary of results for unsupervised learning experiments. n p is the number of protected classes. is the weight of penalty term d u 2 . L rec stands for the reconstruction loss (MSE) of images. Acc LG is the accuracy of predicting z from representation using logistic regression andAcc MLP is the accuracy using MLP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.3 Results of using a fraction of data for kernel matrix calculation. The batch size equals to 512. The results suggest sampling 10% to 20% of data should be optimal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.4 Summary of hyper parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 46 viii 4.1 A comprehensive comparison of the performance of the proposed method against two types of state-of-the-art methods: optimal segmentation (in rup- tures package) and Bayesian changepoint detection on synthetic data. MtChD (RF) is our method with random forest classier; MtChD (MLP)is our method with multilayer perceptron classier. DP+Normal (Normal GLR eq.) is DP segmentation method used with normal loss function, which is equivalent to GLR test which assumes a multivariate normal distribution. Six combinations of optimal segmentation methods are listed. DP is dynamic programming seg- mentation algoritm, BinSeg is binary segmentation, Window is window-based changepoint detection, and BottomUp is Bottom-up segmentation. The cost functions used are RBF (RBF kernel), L1 (L 1 loss function), and L2 (L 2 loss function). The last four rows are for Bayesian changepoint detection with a uniform prior or Geo (geometric) prior. Gassusian stands for Gaussian likeli- hood function, IFM is the individual feature model [151], and FullCov is the full covariance model [151]. (t 0 ) and (t 0 ) are the mean value and standard deviation of inferred changepoint and () and () are the mean value and standard deviation of inferred . Bold values indicate changepoints that are closest to the correct value. . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.2 Changepoint inferred for synthetic image dataset. The true value of change- point is t 0 = 0:50 where solid circles change into hollow circles. . . . . . . . . 63 4.3 A comprehensive comparison of our method with previous methods on real world datasets, COVID-19 Air and Khan Academy. We use the same abbre- viations as in Table 4.1. For COVID-19, the measure of t 0 is number of days since 01/01/2020. For Khan Academy, the measure of t 0 is Unix timestamp, namely, number of seconds since midnight 01/01/1970. Correct values are roughly 80 days for COVID-19 air data, and 1:365 10 9 seconds for Khan Academy data. Bold values indicate changepoints that are closest to the cor- rect value. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.4 Results of our method for regression discontinuity task, on two real world datasets, Student Test Score and College GPA. We used the same abbrevia- tions as Table 4.1. For Student Test data, pretest stands for pre-test score. For College GPA data, HS grade pt stands for high school grade points; credits yr 1 stands for credits earned during the rst year. Bold values indicate change- points that are closest to the correct value; underlined values demonstrate the features with the highest . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.1 Change points automatically identied in Covid-19 tweets and important events occurring on those dates. . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.2 Comparison with baseline change detection. (Left) Tweets and (right) r/nosleep. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 7.1 Denitions used in designing of fair treatment policies. . . . . . . . . . . . . 106 ix List of Figures 2.1 Fair synthetic data. (a) raw data ( = 1:0), (b) plot for fairness level = 0:0. The two features in the data are x f 1 and x f 2 , and the two classes we want to protect are in red and blue. The two outcome classes are represented as two symbols: and. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.2 Fairness versus accuracy. Plots show Pearson correlation versus accuracy of predictions (Acc Y ) for the German, COMPAS and Adult datasets. For each plot, Zafar2015 stands for [155], Zafar2016 for [154] and jaiswal2018unsupervised for [64]. Fair NuSVM, Fair RF, Fair AdaBoost, and Fair MLP results are pro- duced using the fair representations constructed by our proposed method with NuSVM [25], random forest [22], AdaBoost [45], and multilayer perceptrons [120] models, respectively. The results of UAI are not shown for the Adult dataset, since its best accuracy (0.83) lies outside of the boundary of the plot. (Same for Figure 2.3 and 2.4.) . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.3 Discrimination versus accuracy plots for the three datasets. . . . . . . . . . 22 2.4 Accuracy of inferring the protected variable from the model's predictions (Acc P ) versus the accuracy of predicting the outcome (Acc Y ) for the three datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.5 Balance vs. negative log-likelihood (calibration error) for the German, COM- PAS and Adult datasets. In the plot, there are two sets of curves for every model, labeled y = 0 and y = 1. y = 0 stands for the dierence of mean ^ y (between dierent protected classes) given to the individuals with negative y = 0, andy = 1 stands for individuals with positive outcomes y = 1. (These dierences are called balance of negative or positive class by [77].) Fairer models are those in the lower left corner of each plot. . . . . . . . . . . . . . 24 3.1 The diagrams for supervised and unsupervised learning. . . . . . . . . . . . . 37 3.2 Plot of statistical parity SP vs accuracy Acc. Our results are shown as red crosses. In the plots, Adv. Forget is for [63]; CVIB is for [99]; FCRL is for [50]; MIFR is for [133] and MaxEnt-ARL is for [121]. A more ecient method achieves lower statistical parity (SP ) for the same value of accuracy (Acc), or higher accuracy for the same value of SP . . . . . . . . . . . . . . 38 x 3.3 Images generated by style transformation. Left gure shows the results for MNIST dataset and right gure shows results for Chairs dataset. For both datasets, the leftmost column shows the input images and the remaining columns show the generated images, a dierent digit in the same style (left) or the same chair from a dierent angle of view (right). . . . . . . . . . . . . 40 3.4 t-SNE visualization of latent representationu. Dierent sensitive groups (val- ues ofz) are shown using dierent colors. We set = 0 to ignore the invariant constraints. For = 0, we see clusters of dierent colors (values ofz) but when is properly tuned, points with dierent colors are mixed and indistinguishable. 43 4.1 Illustrations of synthetic data, where observations have two features x 1 and x 2 . Blue dots represent data points which satisfy tt 0 and orange dots are for t > t 0 . (a). n c = 2; (b). n c = 6; (c). n c = 10. For xed data size N, as n c increases, the number of data points in each cell decreases. Also, the spatial frequency of the data increases. These factors make it more dicult for a classier to nd the decision boundary. . . . . . . . . . . . . . . . . . 60 4.2 Example synthetic images that change at t 0 = 0:5. From top to bottom shows images with dierent noise level = 0:2; 0:4; 0:6; 0:8 and 1:0. At t 0 , solid circles changes into hollow circles as shown in the images. . . . . . . . . 62 4.3 Accuracy deviation curve for COVID-19 Air data. (a). Using random forest classier; (b). Using multilayer perceptron classier. The scatter points are accuracy deviation measured on testing set and the solid lines are tted using the proposed accuracy deviation model. . . . . . . . . . . . . . . . . . . . . . 65 4.4 Accuracy deviation curve for Khan Academy data. (a). Using random forest classier; (b). Using multilayer perceptron classier. . . . . . . . . . . . . . 67 5.1 Word clouds for Covid-19 tweets in period 01/21{01/30, 01/30{02/04 and 02/04{02/11. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 6.1 Win rate (%) of the top-three most played champions for each patch. . . . . 84 6.2 Trimmed causal trees for patch 4.12 (left) and 6.4 (right) which contain Lucian bus and nerfs, respectively. In the gures, samples indicates the number of observations in that node. For non-leaf nodes, it is the total number of observations before splitting. . . . . . . . . . . . . . . . . . . . . . . . . . . 87 6.3 Impact of software patches on player performance. Heatmap shows average player performance, as measured by the number of kills per match, for dier- ent versions of the game. The upper plot shows the performance of players with dierent rest time(timeSinceLastMatch). In the lower plot, players are binned based on the feature meanKillsAtStart, a proxy of player skill. The abrupt color change for versions 4.20 and 6.9 indicates a large dierence in performance after the version change. . . . . . . . . . . . . . . . . . . . . . . 89 xi 6.4 The mean eect of software patches for kills. We see there are sharp peaks at patch 4.20 and 6.9. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 6.5 Causal impact of game patches on player performance. Heatmap shows the eect of the software patch on performance for the same groups of players as in Figure 6.3. The top gure shows mean eect for players with dierent rest time (timeSinceLastMatch), and the bottom plot shows the eect of patches for bins with dierent values of meanKillsAtStart. . . . . . . . . . . . . . . 93 6.6 (Left). Causal tree learned for patch 4.21 for matches where a player select champion Vayne. The purple nodes show heterogeneous eect that is signi- cant at 5% level. (Right). Average treatment eect calculated for players with dierent levels (meanKillsAtStart) at two major patch changes. Excluding the rst bin which contains signicant portion of new players with meanKillsAt- Start close to zero, we see a trend that high level players benet more from the patch changes, indication the performance gap being widen. . . . . . . . 94 6.7 Average eect gap calculated for the 10 most important features. The error bar shows 95% condential intervals. We can see that the most important features are timeSinceLaseMatch and history performance of players. . . . . 94 7.1 Plot of outcomey vs featurex 0 for synthetic data. Note that the other feature, x 1 , is independent from y. Protected attribute z's two classes are shown as dierent colors. Treatment and control data have \o" and \x" plot markers, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 7.2 Mean outcome improvement, y, versus maximum allowed outcome inequal- ity,m y . (a) y vsm y when equal treatment opportunity is assumed. Dierent curves show the ecient boundary of policies under constrains of certain limit of percent treated, r max . (b) y vs m y when armative action is allowed. Here r max = 0:8 and dierent curves shows dierent degrees of armative action, measured by m r . Armative action greatly improves y when m y is low. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 7.3 Heat map visualizations of mean outcome improvement, y, for synthetic data. Maximum fraction treated, r max = (a) 0.2, (b) 0.4, (c) 0.6, and (d) 0.8. Lighter yellow colors correspond to larger change in the outcome. Solid black lines are the contour lines of y . . . . . . . . . . . . . . . . . . . . . . . . . 110 7.4 Heat map visualizations of performance improvement, y, for EdGap data. Maximum fraction treated, r max = (a) 0.2, (b) 0.4, (c) 0.6, and 0.8. Lighter yellow colors correspond to greater overall benets of treatment, while the infeasible region is shown as grey. . . . . . . . . . . . . . . . . . . . . . . . . 114 xii 7.5 Maps of (a) z-score normalized mean test scores from EdGap, (b) Black house- hold ratio, (c) learned optimal treatment assignment when equal treatment op- portunity is assumed (m y = 0:25;r max = 0:40;m r = 0), and (d) learned opti- mal treatment assignment when armative action is allowed (m y = 0:25;r max = 0:40;m r = 1:0). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 xiii Abstract This thesis discusses the challenges and novel methods for mining heterogeneous and biased data. Learning the structure of data enables a deeper understanding of system behavior, allowing for better predictions and decisions. However, in applications of natural and social sciences, the data is often heterogeneous and biased by hidden relationships. We rst start from fair predictions and invariant representations using biased data. Pre- vious methods range from simple constrained linear regression to deep learning models. Constrained linear regressions are too restricted, while deep learning models are not inter- pretable and often rely on expensive adversarial training. To address these challenges, we rst present a linear method that creates interpretable features that are also fair, i.e., they are independent of some sensitive features which may bias predictions. The method pre- processes data by projecting it to a hyper plane orthogonal to the sensitive features. Then we introduce a kernel-based nonlinear method that can be applied to both supervised and unsupervised learning tasks without adversarial training. Next, we show how to identify changes in heterogeneous data. Previous models rely on likelihood functions or kernels, which restricted their uses in large and high-dimensional data. Our method is self-supervised, inspired by phase transitions in physical systems. It expands a previous method by proposing a mathematical model which can robustly and accurately identify changes from the accuracy variation of any supervised learning model of choice. We show how this method can be applied to mining text data in social media. We also included empirical studies on performance data of League of Legends about the eect of changes on individuals. xiv Finally, we show how we can design interventions which actively improve fairness and maximize overall benet using limited resources. Previous methods were restricted to ideal- ized cases and cannot be applied to real-world observational data. We introduce the fairness metrics to causal inference and propose an algorithm which can give the optimal policy under constraints on fairness and resources. xv Chapter 1 Introduction 1.1 Motivation Data collected in natural and social sciences often have complex structures. A better un- derstanding these structures enable us to describe the patterns in data more accurately and eciently. It also serves as a crucial preliminary step to discovering new principles from the observational data. In the scope of AI and machine learning, we study these structures by tting the data with various kinds of models. Generally speaking, models can be parameter- ized or non-parameterized. For simplicity, here we use parameterized models as an example. In some cases, all the observations in the dataset can be described by a simple model, for example, using a linear regression shown below. y i = 0 + m X j=1 j x i +" i : (1.1) Here the residuals " i are i.i.d and normally distributed with " i N (0; 2 ). In practice, the structure of data is much more complicated. For example, there may be several subgroups in the dataset, and if we t separate linear regression models for each of the groups, we may nd that certain parameters j ; j6= 0 in Eq.1.1 may be dierent or may even have dierent signs, which is an example of Simpson's paradox [2]. In this case, the overall trend 1 j could be meaningless. This classical example shows how heterogeneous structures in data can make regression analysis dicult. Generally speaking, let dataX2R n and sensitive attributez2R m be random variables generated from certain joint i.i.d. and an expectation functionF :R m+n !R. We say that data X is invariant with respect to z underF if we see that the value f =F(X;z) (1.2) is independent of z. When z2N and z2 [1;n z ] (e.g., a group indicator as mentioned in Simpson's paradox), Eq. 1.2 can be interpreted as F(X;z) =f 0 ; z2 [1;n z ] (1.3) and if z is continuous, we have @F(X;z) @z i = 0; i = 1;:::;m: (1.4) When Eq. 1.2 does not hold, we say there is a heterogeneous structure of data underF with respect to z. Such heterogeneous structures of data can be caused by many factors. First, there may be an intrinsic dierence among subgroups of data. In this case, we need dierent models to describe dierent subgroups. In other cases, the dierence among groups is caused by imperfect observations and human errors. To deal with heterogeneous data, if we can identify the subgroup, we can t dierent models for dierent groups. This approach can help us make a better overall description of the data. If we know the subgroup of every observation in prior and predicting performance (such as accuracy, L 2 loss, and F1-score) is the only thing we care about, taking such a path will be ideal. 2 But in many cases, predicting performance is not the only thing we care about and we often have no information about subgroups. In the scope of computational social science, treating subgroups (such as individuals of dierent races or from dierent areas) dierently or even explicitly using the information of subgroups can be unethical or even a serious violation of the law. At the time this dissertation is being written, ethics has become a crucial topic in AI and machine learning since an increasing amount of decisions are made automatically by AI systems. For example, COMPAS is a system used to predict the risk of recidivism of defendants by U.S. courts [6]. Amazon was also using an AI tool to examine the resumes of job applicants [52]. It has already been shown that both systems have serious biases. COMPAS system is biased against black defendants and the system Amazon developed was discriminating against female applicants [6, 52]. Generally speaking, models are trained using past observed data that can contain ethical biases, such as those against protected subgroups, or more generally statistical biases, such as Simpson's paradox [96]. Models trained on biased data will perpetuate or even magnify such biases. To understand how biased data aects machine learning models, consider a case where we want to predict the credit score of individuals given features such as age, income, gender, race, job, etc. Assuming that due to poor sampling there is a high correlation between gender and credit score, with males having a higher credit scores (such sampling-based correlations are known as Berkson's paradox [17]). The data will be noisy such that no perfect prediction is possible. Assuming we are using a decision tree as a model, it is possible that the rst split will put all females into a branch and assigned them bad credit scores, even if we observe some females with good credit scores. Therefore, the model learns the biases from data, but it also magnies the bias such that being female leads to a bad score in a deterministic way! This suggests that in order to make fair predictions, in other words, remove or alleviate the bias, we must have an active way of controlling the bias in our model. This inevitably leads us to a dilemma, as we will show below. Let us assume we have an attribute indicating the sensitive subgroups such as race, gender, or age. We refer the variable 3 as sensitive attribute or sensitive feature (It is also referred as protected attribute/feature in some literature [155]). Roughly speaking, we can control the bias in a model in three dierent ways. First, we can try to pre-process the data hoping to remove the bias in data and feed our model with a fair version of data. Second, we can constrain our model and force the output of the model to be fair. Finally, we can process the output of the model to make it fair. The last option is sometimes called massaging [72]. When the sensitive attribute is not available at prediction time (and this is the main situation we discuss), post-processing or massaging is simply impossible. Pre-processing the data and hide the sensitive feature for the model will make prediction more dicult since the sensitive feature is also correlated with the outcome. From the perspective of optimization, any machine learning method are trying to optimize a certain objective function and fairness conditions can be regarded as constraints. This suggests that with fairness conditions, a model will likely suer from performance decrease since in general, the solution to a constrained optimization problem will be no better than the solution to the corresponding unconstrained optimization problem. To summarize, in the case where we have no access to the sensitive attribute at prediction time, we must sacrice some performance for fairness. A crucial way of accessing a certain model is to look at how eciently it can trade-o between performance (accuracy) and fairness. At the same level of fairness, better accuracy is preferred; At the same level of accuracy, smaller bias is preferred. Besides making fair predictions, a practical problem is how do we correctly identify het- erogeneous subgroups in the dataset? An intuitive way is to bin continuous-valued features or use obvious features such as gender, age, or zip code. When the number of features grows, identifying latent subgroups can be nontrivial since the number of possible subgroups grows exponentially with the number of features [58]. In this dissertation, we consider a special case of the problem, identifying subgroups in one-dimensional space. This is referred to as change detection and it is useful for mining temporal data. For example, in this dissertation, we nd that during the early stages of the COVID outbreak, there were constant shifts of discussion topics on Twitter. 4 As a supplement, we also present an empirical study of heterogeneous behaviors in an online game, League of Legends. In this case, the exact time of change is known and we are interested in how individuals react to the changes. Finally, if we know that real-world data is biased, how can we learn treatment policies or interventions that both improve the overall benet and alleviate the biases. To give an example, at the time this dissertation is written, data from CDC has shown that black Amer- icans' vaccination rate is still lower than the average [54]. This makes the racial disparity even worse since the black population lacks proper health care historically. As discussed above, we still have to make a trade-o here. We can choose to maximize the overall benet but it will make the world more biased. If we only care about fairness, we may not maximize the use of the limited resource we have. 1.2 Statement of Research Questions In this thesis, we are mainly concerned about creating fair models from biased hetero- geneous data, discovering heterogeneity in data and using tools such as causal inference to understand the change or discontinuity in data. More specically, we are interested in the following questions: • Q1. How can we create fair and robust ML models from biased heterogeneous data? And closely related, how should we represent the biased data in a fair or invariant way? • Q2. Given heterogeneous data, how can we locate discontinuities or changes? Further- more, how can we discover natural experiments using these discontinuities or changes and analyze the eects on the principles of human behavior? • Q3. Knowing there is a bias in the data observed, how can we design interventions that can not only alleviate the existing bias but also improve the overall benet? 5 1.3 Challenges and Main Contributions Fair Representations via Linear Orthogonalization We proposed a method which can learn fair and interpretable features via linear projections. Previous methods range from constrained logistic regressions [155] to deep learning methods [93, 99], but are either too restricted or not interpretable. We consider features and sensitive attributes as vectors in n-dimensional space, where n is the number of data points. We project features into the subspace which is orthogonal to the sensitive attributes and the resulting fair version of features are guaranteed to have zero covariance with the sensitive attributes. Using the pre-processed features in linear models will yield fair prediction results. See Chapter 2. Fair Predictions and Invariant Representations with Kernels This is an improve- ment of the work (Chapter 2) mentioned immediately above. As discussed, the method mentioned above only removes linear correlations (covariance) between features and the sen- sitive attributes. In Chapter 3, we proposed a general-purpose deep learning framework that can generate invariant representations of data. The heart of the method is a kernel-based loss term that measures the statistical dependence between the data representation produced and the sensitive attributes. Our method performs well on both supervised and unsupervised learning tasks. See Chapter 3. Change Detection via Confusion This part presents a new method that can detect changes in temporal data. Change detection is a well-studied topic but previous methods have limitations. A group of popular methods [136] use optimal splitting algorithm and cost functions, but this restricts the data to certain known distributions and makes it dicult to scale to high-dimensional data. There are also Bayesian change detection [1] and change detection based on Markov model [15, 116]. How to take advantage of recent state-of-the-art supervised learning methods remains an open question. Consider we observe temporal data (X i ;t i ). The goal is to nd a point t 0 such that before and after t 0 , data follows dierent 6 distributions. In the proposed method, we label the data before and after the assumed change point as 0 and 1. Then we train an arbitrary classier to predict the labels we created. We then record the accuracy of the classier and plot it against the assumed change point. Our key contribution is we found a way to model the accuracy vs. trial change point curve. We applied the method to discussions on Twitter and identied topic shifts during the early stage of COVID break out. See Chapter 4 and 5. Heterogeneous Eects of Software Patches in a Multiplayer Online Battle Arena Game In this part (Chapter 6), we are considering the case where we are given a certain event and we want study the eect of the change. Specically, we want to study how indi- viduals are reacting to the change of environment. We employed performance data of players of a popular online game, League of Legends. We are given certain change points where the software patches are updated and the game parameters are changed. We demonstrated how causal inference, especially heterogeneous treatment eect (HTE) estimation can be used to understand the eect of a change on dierent types of individuals. We identied several signicant changes and we observed that the reactions of players are heterogeneous. In some cases, the gaps between good and bad players are widened by the updates. See Chapter 6. Counterfactual Learning for the Fair Allocation of Treatments This part of the work discusses how policymakers can use observational data to design policy (a treatment plan) that can both improve the overall benet and ensure fairness. In practice, policymakers need to achieve the most positive eect using the limited resources available. There are previous works discussing how to perform fair resource allocations [38] and fair decision making [35]. But it is dicult to apply these methods to real world cases using data directly observed. To close this gap, we proposed a method which combines causal inference and algorithmic fairness. In this work, we proposed two new metrics for the fairness of policy, fairness of treatment opportunity and fairness of treatment outcome. We also proposed an algorithm for nding the optimal policy which maximizes the overall benet under the 7 constraints of fairness and the constraints of limited resources. This kind of tool will be useful in real-world cases such as the distribution of COVID-19 vaccines. See Chapter 7. 8 Chapter 2 Fair Representations via Linear Orthogonalization 2.1 Chapter Introduction Machine learning (ML) models sift through mountains of data to make decisions on matters big and small: e.g., who should be shown a product, hired for a job, or given a home loan. Machine inference can systematize decision processes to take into account orders of magnitude more information, produce accurate decisions, and avoid the common pitfalls of human judgment, such as belief in a just world or selective attention [78]. Moreover, unlike people, machines will never make poor decisions when tired [37], pressed for time or distracted by other matters [129, 94]. Recent research suggests, however, that discrimination remains pervasive [6, 30, 39, 108]: for example, a model used to evaluate criminal defendants for recidivism assigned systemat- ically higher risk scores to African Americans than to Caucasians [6]. As a result, reformed African American defendants, who would never commit another crime, were deemed by the model to present a higher risk to society|as much as twice as high [6, 39]|as reformed white defendants, with potentially grave consequences on how they were treated by the justice system. The emerging eld of AI fairness has suggested ways to mitigate harmful model biases [41, 30, 32], e.g., penalizing unfair inferences [41, 16], or creating representations that do not strongly depend on protected features [64, 99, 92]. These methods, however, fall short in 9 one or more critical dimensions: interpretability, prediction quality, and generalizability. We dene interpretability as the ability to understand how features aect|or bias|a model's predicted outcome. Interpretability is needed to improve transparency and accountability of AI systems. While models must sacrice prediction quality (as measured by accuracy, mean squared error, or another metric) to improve fairness [115], the trade-o does not need to be as drastic as what current methods achieve. Finally, we dene generalizability as the ability to easily apply fairness algorithms across multiple models and datasets. In contrast, state-of-the-art fairness methods are specialized to linear regressions or random forests [155, 72, 16]. Similarly, methods that create fair latent features for neural networks (NN) [64, 99] cannot be easily applied to improve fairness in non-NN models. These fair AI algorithms were not meant to be generalizable because there does not seem to be adequate meta-algorithms that debias a whole host of ML models. One might naively expect that we can just create a single fair model and apply it to all datasets. The problem is that model performance varies greatly on dierent datasets. While NNs are critical for, e.g., image recognition [33], other methods perform better for small data [107], especially when the number of dimensions is high and the sample size low [91]. There is no one-size-ts-all model and there is no one-size-ts-all model debiasing method. Is there an easier way to create fairer predictions other than specialized methods for specialized ML models? Chen et al. oer some clues to addressing this fundamental issue in fair AI [27]: by addressing data biases, we can potentially improve fair AI across the spectrum of models, and achieve fairness without greatly sacricing prediction quality. Inspired by these ideas, we describe a geometric method for debiasing features. Depend- ing on the hyperparameter we choose, these features are mathematically guaranteed to be uncorrelated with specied sensitive, or protected, features. This method is exceedingly fast and the debiased features are highly correlated with the original features (average Pearson correlations are between 0.993{0.994 across the three datasets studied in this paper). These debiased features are as interpretable as the original features when applied to any model. 10 When applied to linear regression, for example, the coecients are the same or similar to the coecients of the original features when controlling for protected variables (see Meth- ods). These debiased features serve as a fair representation of data that can be used with a number of NN and non-NN ML models, such as linear regression, random forest, support vector machines (SVMs), and multilayer perceptrons (MLPs). While previous methods have created fair representations [106, 124, 64, 99], these methods create representations that are either not very interpretable, like PCA components, or the relationship between these fair representations and the original features have not been established. We evaluate the pro- posed approach on several benchmark datasets. We show that models using these debiased features are more accurate for almost any level of fairness we desire. In the rest of the paper, we rst review recent advances in fair AI to highlight the novelty of our method. Next, we describe in the Methods section our methodology to improve data fairness, and the denitions of fairness we use in the paper. In Results, we describe how our method improves fairness in both synthetic data and empirical benchmark data. We compare to several competing methods and demonstrate the advantages of our method. Finally, we summarize our results and discuss future work in the Conclusion section. 2.2 Related Works There are dozens of ways to dene fairness in supervised learning tasks [140]. For exam- ple, [77] proposed three denitions of fairness: calibration within groups, balance for the negative class and balance for the positive class (the latter are also referred to as equal op- portunity [55]). Alternately, fairness can be dened as when prediction distributions are similar among dierent groups of individuals. This is also referred to as statistical parity. As mentioned in the introduction , there are three ways of performing fair classication and regressions { pre-processing, in-processing, and post-processing. 11 We begin with in-processing methods, which is the most well studied approach among the three. Early works include a linear method proposed by Zafar et al., which achieves fair classications by controlling the covariance between the decision function and the binary sensitive attribute [155]. One obvious problem is that in order to lower the correlation between decision function and the sensitive attribute, the parameters of the linear decision boundary function have to take some special combinations, which will hurt the accuracy of classications a lot. Geometrically, this can be regarded as the hyper plane corresponds to the decision function being tilted in the feature space. More generally, an family of dierent regularizers are proposed in [16]. These regularizers (penalty functions) cover both individual and group fairness and they have a nice property { being convex. There are also numerous deep learning methods [64, 93, 99, 157, 149]. Generally speaking, most deep learning methods are in-processing type. We will return to the discussion of deep learning methods latter in this dissertation. As for pre-processing approach,[68] proposed a method to remove the correlation between sensitive attribute and a certain feature by manipulating the distribution of the feature values conditioned on the sensitive attribute. This is done by matching the quantile values of the conditional distribution to the marginal distribution. The sensitive features can only be discrete and as the number of sensitive features grows, the number of subgroups grows exponentially, which means the estimation of quantile values of subgroups unreliable. Fair PCA [71] can also be used in pre-processing to improve robustness. We do not discuss post-processing methods such as [72] since we do not explicitly use the sensitive attribute at predicting time. Our method [56], on the other hand, relies on linear projections and can handle both continuous and discrete sensitive features. Our method also runs eciently. It is based on Gram{Schmidt process or singular value decomposition (SVD), scales at fast as O(n 2 p ), wheren p is the number of sensitive/protected attribute. Experiments show that our method can achieve results with better accuracy at the same level of fairness. Since our method is 12 a linear pre-processing, the features processed are also interpretable. A parallel work [112] has applied a similar method in mining complex networks. 2.3 Methods We describe a geometric method for constructing fair interpretable representations. These representations can be used with a variety of ML methods to create fairer accurate models of data. 2.3.1 Fair Interpretable Representations We consider tabular data with n entries and m features. The features are vectors in the n- dimensional space, denoted as x i wherei = 1; 2; ;m, and one of the columns corresponds to the outcome, or target variable y. Among the features, there are also n p protected features, p i ;i = 1;:::;n p . As a pre-processing step, all features are centered around the mean:hx i i = 0. We describe a procedure to debias the data so as to create linearly fair features. We aim to construct a representationr j of a featurex j , that is uncorrelated withn p protected columns p i ;i = 1;:::n p , but highly correlated to feature x j . We recall that Pearson correlation between the representation r j and any feature x k is dened as Corr(r j ;x k ) = (E[r j x k ] E[r j ]E[x k ])=( r j x k ); where E[:] is the expectation, and r j = q E[r 2 j ] E[r j ] 2 and x k = p E[x 2 k ] E[x k ] 2 : Because all the features are centered (and we also assume that r j is centered), E[r j ] = E[x k ] = 0, we have r j = q E[r 2 j ] =kr j k= p n; x k = q E[x 2 k ] =kx k k= p n 13 and E[r j x k ] =r j x k =n: Therefore Corr(r j ;p i ) =r j p i =(kr j kkp i k) and Corr(r j ;x j ) =r j x j =(kr j kkx j k): Zero correlations between r j and n p protected columns requires that r j lives in the solution space of r j p i = 0;i = 1:::n p . Maximizing correlations between r j and x j under this constraint is equivalent to projecting x j into the solution space of r j p i = 0;i = 1:::n p . To calculater j , we can rst create an orthonormal basis of vectorsp i , which we can label as p i . We then construct a projector P f = P np i=1 p i p T i . The representation r is given as r j =x j P f x j = (IP f )x j : (2.1) Using the Gram{Schmidt process, the orthonormal basis can be constructed in O(nn 2 p ) time and for every fair representation of features, the projection takesO(nn p ) time. Given n f features, the total time of the algorithm is O(nn f n 2 p ) Therefore our method scales linearly with respect to the size of the data and the number of features. In practice, this is exceedingly fast. For example, this algorithm only takes less than 200 milliseconds to run on the Adult dataset described below, which has 45K rows, 103 unprotected features, and 1 protected feature. While the previous discussion was on how to create linearly fair features, one can make linearly fair outcome variables, r y through the same process. In prediction tasks, however, we do not have access to the outcome data. While our method does not guarantee that every model's estimate of the outcome variable, ^ y is fair, we nd that it can signicantly improve the fairness compared to competing methods. Moreover, in the special case of linear 14 regression, it can be shown that the resulting estimate, ^ y, is uncorrelated with the protected variables. Inevitably, the prediction quality of a model using such linearly fair features will drop compared to using the original features, because the solution is more constrained. To address this issue, we introduce a parameter 2 [0; 1], which indicates the fairness level. We dene the parameterized latent variable as r 0 j () =r j + (x j r j ): (2.2) Here, = 0 corresponds to r 0 j () =r j , which is strictly orthogonal to the protected features p i ; while = 1 gives r 0 () =x j . The protected features can be both real valued and cardinal. The fair representation method can also handle categorical protected features by introducing dummy variables. Specically, if a variableX hask categoriesx 1 ; x 2 ;:::;x k , we can can convert them to k 1 binary variables where the i th variable is 1 if the variable is category x i , and otherwise 0. If all variables are 0, then the category is x k . As a simple example, if a feature X has 3 categories, x 1 , x 2 , and x 3 , then the dummy variables would be ~ x 1 and ~ x 2 . If ~ x 1 = 1, the category is x 1 , if ~ x 2 = 1, then the category is x 2 , and otherwise is x 3 . The condition of fairness in this case is interpreted as same mean value of the latent variables in dierent categorical groups. 2.3.2 Fair Models Using the procedure described above, we can construct a fair representation of every feature, and use the fair features to model the outcome variable. Consider a linear regression model 15 that includes all features: n p protected features p i ;i = 1; ;n p and n f = mn p non- protected features features x i ;i = 1;:::;n f . ^ y = 0 + n f X i=1 i x i + np X i=1 i p i : (2.3) After transforming the features to fair features x 0 i , the fair regression model reduces to: ^ y 0 = 0 0 + n f X i=1 0 i r i : (2.4) Here, r i corresponds to the fair versions of x i . We can prove that i = 0 i ;i = 1;:::;n f , but the predicted value ^ y 0 is uncorrelated with protected features p i ;i = 1;:::;n p . In general linear regression, such as logistic regression, this proof does not hold, but we numerically nd that the coecients are similar. We should take a step back at this point. The fair latent features are close approxi- mations of the original features, therefore we expect that, and in certain cases can prove, that the regression coecients of the fair features should be approximately the coecients of the original features. The fair features can, by this denition, be considered almost as interpretable as the original features. In addition to regression, fair representations could be used with other ML models, such as AdaBoost [45], NuSVM [25], random forest [22], and multilayer perceptrons [120]. 2.3.3 Measuring Fairness While there exists no consensus for measuring fairness, researchers have proposed a variety of metrics, some focusing on representations and some on the predicted outcomes [140, 61]. We will therefore compare our method to competing methods using the following metrics: Pearson correlation, mutual information, discrimination, calibration, balance of classes, and 16 accuracy of the inferred protected features. Due to space limitations, we leave mutual in- formation out of our analysis in this paper, and do not compare calibration and balance of classes to model accuracy. Results in all cases are similar. 2.3.3.1 Fairness of Outcomes One can argue that outcomes are fair if they do not depend on the protected features. If this is the case, a malicious adversary won't be able to guess the protected features from the model's predictions. One way to quantify the dependence is through Pearson correlation between (real valued or cardinal) predictions and protected features. For models making binary predictions, fairness can be measured using the mutual information between predictions and the protected features, given that protected features are discrete. We nd mutual information and Pearson correlations create qualitatively similar ndings, despite mutual information being a non-linear metric, therefore we focus on Pearson correlations in this paper. Previous work [157] has also dened a discrimination metric for binary predictions as below. Consider a protected variable p 1 , a binary prediction ^ y of an outcome y. The metric measures the bias of a binary prediction ^ y with respect to a single binary protected feature p 1 using the dierence of positive rates between the two groups. y Discrim = P n:p 1 [n]=0 ^ y[n] P n:p 1 [n]=0 1 P n:p 1 [n]=1 ^ y[n] P n:p 1 [n]=1 1 (2.5) For real-valued predictions (^ y2 [0; 1]), Kleinberg et al. [77] suggested a more nuanced way to measure fairness: • Calibration within groups: Individuals assigned predicted probability ^ y2 [r 0 ;r 0 +], ( > 0 and 1) should have an approximate positive rate of r. This should hold for both protected groups (p 1 = 0 and p 1 = 1). • Balance for the negative class: The mean ^ y of group p 1 = 0;y = 0 and group p 1 = 1;y = 0 should be the same. 17 • Balance for the positive class: The mean ^ y of group p 1 = 0;y = 1 and group p 1 = 1;y = 1 should be the same. In some cases, calibration error is dicult to calculate, as it depends on how predictions are binned. In these cases, we can measure calibration error using log-likelihood of the labels given the real valued predictions as a proxy. By denition, logistic regression maximizes the (log-)likelihood function, assuming the observations are sampled from independent Bernoulli distributions where P (y[n]jX[n]) = ^ y i [n]. Better log-likelihood implies that the individuals assigned probabilities ^ y2 [r 0 ;r 0 +] are more likely to have a positive rate r, which is better calibrated according to [77] 2.3.3.2 Fairness of Representations Several past studies examined the fairness of representations, arguing that models using fair representations will also make fair predictions. Learned representations are considered fair if they do not reveal any information about the protected features [64, 99, 93, 149, 140]. The studies trained a discriminator to predict protected features from the learned representations|using accuracy as a measure of fairness. Following this approach, we treat the predicted probabilities as a one-dimensional rep- resentation of data and use the accuracy of the inferred protected features as a measure of fairness. However, this method is not eective in situations where the protected classes are unbalanced. Let us assume the fair representation is R and the protected feature is p 1 . For simplicity, we only consider the case of a single binary protected feature. The discriminator infers the protected feature in a Bayesian way, namely, P (p 1 =cjR) = P (Rjp 1 =c)P (p 1 =c) P (R) ;c = 0j1 (2.6) 18 In the case where there is a large dierence between P (p 1 = 0) and P (p 1 = 1), even if there is useful information in the distribution P (Rjp 1 = c), the discriminator will not perform signicantly better than the baseline model, the majority class classier. 2.4 Results 2.4.1 Synthetic Data We create synthetic biased data using the procedure described in [155]. We generate data with one binary protected variables, one binary outcomey, and two continuous features,x 1 and x 2 , which are bivariate Gaussian distributed within each value of s. In the Fig. 2.1, we use color to represent protected feature values (red, blue) and outcome using symbol (,). The rst observation is that there is an imbalance in the joint distribution of the protected features and the outcome variable. For blue color markers, there are more blues than blue s. We expect that a logistic classier trained on this data will show similar unbalanced behavior. To demonstrate our method, we choose two dierent fairness levels, =f0:0; 1:0g. We rst transform the two features into their corresponding fair representations and then we train logistic classiers using these fair representations. In Fig. 2.1, we plot the data using the fair representations and we show the classication boundary using a green dashed line. We can observe that for = 0, the blue markers and red markers are mixed (less discrimination and bias), but for = 1:0 (equivalent to raw data), the blue and red markers tend to separate from each other. We can estimate this imbalance by comparing the ratio of blue in individuals predicted as and the ratio of blue in individuals predicted as. The larger the dierence, the more the imbalance. Quantitatively, for = 0:0, there are 62.7% blue in o-predictions and 52.9% in x-predictions. For = 1:0, those ratios are 76.2% and 36.5%. The accuracy of outcome predictions are 0.811 and 0.870 for the fair and original features, respectively, thus demonstrating that, while increasing fairness does indeed sacrice 19 in accuracy, the loss can be relatively small. Overall, the results suggest that biased data creates biased models, but our method can make fairer models. (a) (b) Figure 2.1: Fair synthetic data. (a) raw data ( = 1:0), (b) plot for fairness level = 0:0. The two features in the data are x f 1 and x f 2 , and the two classes we want to protect are in red and blue. The two outcome classes are represented as two symbols: and. We demonstrate how our method can achieve fair classication using synthetic data (see Appendix), and also compare our prediction quality and fairness to other fair AI algorithms using benchmark datasets. 2.4.2 Real-World Data German dataset has 61 features about 1,000 individuals, with a binary outcome variable de- noting whether an individual has a good credit score or not. The protected feature is gender. (https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data)) COMPAS dataset contains data about 6,172 defendants. The binary outcome variable denotes whether the defendant will recidivate (commit a crime) within two years. The protected feature is race (whether the race is African American or not), and there are nine features in total. (https://github.com/propublica/compas-analysis) 20 Adult dataset contains data about 45,222 individuals. The outcome variable is binary, denoting whether an individual has more than $50,000. The protected feature is age, and there are 104 features in total. (https://archive.ics.uci.edu/ml/datasets/Adult) Debiased features had mean correlations of 0.993, 0.994, and 0.994, for the German, COMPAS, and Adult data, respectively. We reserved 20% of the data in the Adult and COMPAS datasets for testing and used the remaining data to perform 5-fold cross validation. This ensured no leakage of information from the training set to the testing set. The German dataset is much smaller than the rest, so it was randomly divided into ve folds of training, validation and testing sets. Each set had 50%, 20% and 30% of all the data. We measured the performance metrics on the test data. We varied the fairness parameter between 0 and 1 and applied the debiased features to logistic regression, AdaBoost, NuSVM, random forest, and multilayer perceptrons. In practice, one could use a host of commercial ML models and pick the most accurate one given their fairness tolerance. (a) (b) (c) Figure 2.2: Fairness versus accuracy. Plots show Pearson correlation versus accuracy of predictions (Acc Y ) for the German, COMPAS and Adult datasets. For each plot, Zafar2015 stands for [155], Zafar2016 for [154] and jaiswal2018unsupervised for [64]. Fair NuSVM, Fair RF, Fair AdaBoost, and Fair MLP results are produced using the fair representations constructed by our proposed method with NuSVM [25], random forest [22], AdaBoost [45], and multilayer perceptrons [120] models, respectively. The results of UAI are not shown for the Adult dataset, since its best accuracy (0.83) lies outside of the boundary of the plot. (Same for Figure 2.3 and 2.4.) 21 2.4.3 Comparison Against State-of-the-Art We compared our method to several previous fair AI algorithms. For the models proposed by [155, 154], we vary the fairness constraints from perfect fairness to unconstrained. For the \Unied Adversarial Invariance" (UAI) model proposed by [64], we vary the term in the loss function from 0 (no fairness) to very large value, e.g., 9:0 10 19 for COMPAS dataset, (large value corresponds to perfect fairness). The predictions of the UAI model for the German and Adult datasets are provided by the authors. We are interested in (1) how dierent models tradeo between accuracy and fairness and (2) how dierent metrics of fairness compare to each other. Fairness Versus Accuracy We rst investigate the trade-os between prediction accu- racy (Acc Y ) and fairness, which we measure three dierent ways: (1) Pearson correlation between the protected feature and model predictions, (2) discrimination between the binary protected feature and the binarized predictions (predicted probabilities above 1/2 are given a value of 1, and are otherwise 0) and (3) the accuracy of predicting protected features from the predictions (Acc P ). To robustly predict the protected features from the model predictions, we used both a NN with three hidden layers, which is used by former works [64, 99, 93, 149, 157] and a random forest model. We report the better accuracy of those two models. Figure 2.2,2.3 and 2.4 shows the resulting comparisons. (a) (b) (c) Figure 2.3: Discrimination versus accuracy plots for the three datasets. The gures show that models using the proposed fair features achieve signicantly higher accuracy|for the same degree of fairness|compared to competing methods. Equivalently, 22 we achieve greater fairness with equivalent accuracy. In Fig. 2.4, we nd Acc P shows little dierence from the baseline majority class classier for the German and Adult datasets. The reason is explained in Eq.(2.6). On the other hand, Acc P of COMPAS dataset shows a clear trend because the majority baseline is around 0.51, which is consistent with the Eq.(2.6). For the Adult dataset, the fair logistic regression cannot achieve perfect fairness but the situation is improved by AdaBoost. We discover, in other words, that there is no single ML model that achieves greater accuracy for a given value of fairness, but our method allows us to choose suitable models to achieve greater accuracy. 2.4.3.1 Fairness of Representations (a) (c) (b) Figure 2.4: Accuracy of inferring the protected variable from the model's predictions (AccP ) versus the accuracy of predicting the outcome (Acc Y ) for the three datasets. We compared our method to earlier works using fair representations. Previous works used NNs to encode the features into a high dimensional embedding space and then separately trained discriminators to infer the protected feature and the outcome variables. The accuracy of inferring protected feature and outcome are reported. Ideally, the accuracy for the outcome should be high and the accuracy of inferring the protected features should be close to the majority class baseline. We set the fairness level to = 0 (perfect fairness). We show Acc P and Acc Y for various methods in Table 2.1 (Appendix) and Fig. 2.4. Our method applied to a logistic model has similar fairness to the best existing methods but is very fast, easy to understand, and creates more interpretable features. 23 2.4.3.2 Balance Versus Calibration (a) (b) (c) (c) Figure 2.5: Balance vs. negative log-likelihood (calibration error) for the German, COMPAS and Adult datasets. In the plot, there are two sets of curves for every model, labeled y = 0 and y = 1. y = 0 stands for the dierence of mean ^ y (between dierent protected classes) given to the individuals with negative y = 0, and y = 1 stands for individuals with positive outcomes y = 1. (These dierences are called balance of negative or positive class by [77].) Fairer models are those in the lower left corner of each plot. Finally, we use another measure of fairness that captures the degree to which each model makes mistakes. Figure 2.5 shows delta score (i.e., balance) versus negative log-likelihood (i.e., calibration error). Fairer predictions are located in the lower left corner of each gure, meaning that there are fewer dierences in outcomes for the dierent classes. We only compare the logistic model with fair features to the models proposed by Zafar et al. [155, 154], because these models maximize the log-likelihood function (minimize calibration error) when selecting parameters. For all datasets, our method generally achieves greater fairness. 2.5 Conclusion We show that our algorithm simultaneously achieves three advances over many previous fair AI algorithms. First, it is interpretable; the features we construct are minimally aected by our fair transform. While this does not mean the models trained on these features are interpretable (they could be a black box), it does mean that any method used to interpret features could easily be used for these fairer features as well. Next, the features better pre- serve model prediction quality. Namely, models using these features were more accurate than competing methods when the value of the fairness metric was held xed. This is in part due 24 to the third principle: that our method can be applied to any number of commercial models; it merely acts as a pre-processing step. Dierent models have dierent strengths and weak- nesses; while some are more accurate, others are fairer. We can pick and choose particular models that achieve both high fairness and accuracy, whether it is a linear model like logistic regression or a non-linear model like a multilayer perceptron, as shown in Figs. 2.2, 2.3., & 2.4. We propose some ideas for future work. First, while making linearly fair features works very well in practice, the fairness could be improved by removing non-linear correlations. Second, we can extend our method to more easily address categorical protected variables. In the present method, a categorical variable with alphabet size n becomes a set of n 1 bivariate variables. It would be ideal, however, if a method reduced the mutual information between the categorical variable directly, rather than rst creating n 1 variables, and removed correlations. 2.6 Supplementary Materials 2.6.1 Additional Tables We show the comparison of our method with former works on invariant representations in Table 2.1. Following the former works, we use accuracy of predicted outcomes (Acc Y ) and accuracy of protected features (Acc P) as performance metrics. 25 German Adult Method Acc Y Acc P Acc Y Acc P Maj. Class 0.71 0.80 0.75 0.67 Li [88] * 0.74 0.80 0.76 0.67 VFAE [93] * 0.73 0.70 0.81 0.67 Xie [149] * 0.74 0.80 0.84 0.67 Moyer [99] * 0.74 0.60 0.79 0.69 Jaiswal [64] * 0.78 0.80 0.84 0.67 Fair Logistic 0.74 0.80 0.84 0.67 Fair NuSVM 0.75 0.80 0.85 0.73 Fair AdaBoost 0.75 0.80 0.84 0.67 Fair RF 0.75 0.80 0.85 0.72 Fair MLP 0.75 0.80 0.85 0.67 Table 2.1: Accuracy of predicted outcomes (Acc Y ) and protected features (Acc P) for the German and Adult datasets. The proposed fair methods (bottom four rows) use = 0:0. Higher Acc Y indicates better predictions while Acc P closer to the majority class baseline indicates fairer predictions. Results marked * were reported by [65]. Best performance is shown in bold. 2.6.2 Proof for the Invariant of Parameters in Linear Regression Using Debiased Features Consider a linear regression using all the n f non-protected features x i ;i = 1;:::;n f and n p protected features p i ;i = 1;:::;n p . ^ y = 0 + n f X i=1 i x i + np X i=1 i p i : (2.7) Assuming we have created a model using the debiased features x 0 i ;i = 1; ;n f , ^ y 0 = 0 0 + n f X i=1 0 i x 0 i : (2.8) We now give a mathematical proof that i = 0 i ;8i = 1;:::;n f : (2.9) 26 For simplicity, we assume that all the features and also the outcome have means equal to 0 and standard deviations equal to 1. In this case, the Pearson correlation between features can be calculated as inner products, Corr(x;y) =xy (2.10) In the equation, x andy can be non-protected features, protected features or outcome. The bold fond stands for a vector of in N-dimension, where N is the number of data points. without loss of generality, assume that all the protected protected features are orthogonal to each other. (Generally speaking, protected features can be correlated. But we can always nd orthogonal basis for them.) The debiased features x 0 can be calculated in the following way, x 0 i =x i np X j=1 c ij p j (2.11) , wherec ij = corr(x i ;p j ) =x i p j . Since all features and the outcome has mean equals to 0, 0 = 0 0 = 0. Other parameters are solved by the inversion problem below, ~ X T ~ X ~ = ~ X T y; X 0T X 0 0 =X 0T y: (2.12) Let X = [x 1 ;:::;x n f ], P = [p 1 ;:::;p n ] and C = [c ij ]. Here X 0 = [x 0 1 ;:::;x 0 n f ], ~ X = [x 1 ;:::;x n f ;p 1 ;:::;p np ] = [X P ], = [ 1 ;:::; n f ] T and nally ~ = [ 1 ;:::; n f ; 1 ;:::; np ] T . Then, we have ~ X T ~ X = 2 6 4 X T P T 3 7 5 XP = 2 6 4 X T X X T P P T X P T P 3 7 5 : (2.13) 27 Using the assumption that all the protected features are orthogonal to each other and the denition of c ij , ~ X T ~ X = 2 6 4 X T X C C T I 3 7 5 : (2.14) And we also have ~ X T y = 2 6 4 X T y P T y 3 7 5 : (2.15) Now, we consider the inversion problem for regression using debiased features. (X 0T X 0 ) ij (2.16) = x 0 i x 0 j (2.17) = (x i X l c il p l ) (x j X k c jk p k ) (2.18) = x i x j x i X k c jk p k x j X l c il p l (2.19) + X l c il c jl (2.20) = x i x j X l c il c jl (2.21) (X 0T y) i = (x i X l c il p l )y (2.22) = x i y X l c il p l y (2.23) We can see that the rows of matrix X 0T X 0 and X 0T y can be obtained by applying the same elementary row reduction steps to ~ X T ~ X and ~ X T y. To obtain theith row, we perform Row i = Row i np X l=1 c il Row (i+n f ) : (2.24) 28 After applying the elementary row reduction steps above to the rst n f rows of the matrix form of ~ X T ~ X 0 ~ = ~ X T y, we will have 2 6 4 X 0T X 0 0 C T I 3 7 5 ~ = 2 6 4 X 0T y P T y 3 7 5 : (2.25) Thus inversion problem X 0T X 0 0 =X 0T y is a sub-problem of ~ X T ~ X ~ = ~ X T y where the rst n f elements of ~ gives 0 . Which means that we have proved Eq.(2.9). It is worth mention that for fairness level 6= 0, the statement of Eq.(2.9) does not hold. 29 Chapter 3 Fair Predictions and Invariant Representations with Kernels 3.1 Chapter Introduction Machine algorithms are increasingly used to automate high-stakes decisions in healthcare, personal nance and hiring, raising questions about whether these algorithms unfairly dis- criminate against protected groups [6, 30, 39, 108]. For example, COMPAS, a recidivism risk evaluation tool used by judges to make bail decisions, was found to exhibit racial bi- ases, assigning higher risk scores to peaceful black defendants (who did not recidivate) than similar whites [6, 77]. Given its social relevance and impact, the eld of algorithmic fairness has attracted signicant attention from the research community [96]. The goal of fair algorithms is to make predictions that are invariant to the protected classes, such as race. Some popular algorithmic fairness methods are based on adversarial training [96], but they suer from a number of limitations. First, these methods are often ad hoc, without strong theoretical guarantees on their fairness. For example, [159] achieves fairer predictions in some scenarios but lacks a theoretical guarantee of fairness in general. Second, adversarial methods can be slow and dicult to train. As mentioned in [50], to achieve satisfactory results even for tabular data, up to 200 epochs of training are required. Many alternative methods therefore 30 exist, such as fair data representations [99, 76, 56]. Fair representations are often based on non-deterministic variational autoencoders [76], however, which make them dicult to be t into other deep learning frameworks, or only reduce linear correlations with protected features [56]. To address these challenges, we propose a method that learns fair representations using maximum mean discrepancy (MMD) [48], which is expressed using kernel functions. Our method relies on a dierentiable penalty function, which can be directly calculated from the latent space representations. In contrast to many previous approaches, the method implies a model can have a statistical independence-based loss function between transformed data and discrete protected classes. We show that this loss function is a simplication of the Hilbert Schmidt Independence Criterion, a general and nonparametric measure of statistical correlation between two random variables. This mathematical connection gives the method a strong theoretical grounding. Additionally, in contrast to previous fairness methods, our method is fast and can be applied to both supervised and unsupervised learning tasks. We evaluate the proposed method on both structured and unstructured data, which demonstrates its wide applicability. We show that our method excels over state-of-the- art fairness methods on benchmark datasets, achieving better prediction accuracy without sacricing fairness. To show our method's utility in an unsupervised learning setting, we use it to learn invariant representations of images and then perform style transformations. This allows us to tune images to a given style, without, for example, mode collapse [161]. Compared to state-of-the-art approaches [50], our method requires less than 1/10 of the training epochs, while achieving similar or better fairness. The rest of the paper is organized as follows. First we review closely related works. Then we derive the kernel-based penalty function we used (namely, MMD) in an intuitive way and show its connections with Hilbert Schmidt Independence Criterion (HSIC). We then present our experiment design and results on both supervised learning and unsupervised learning tasks. 31 3.2 Related Works As mentioned above, although our work [56] improves over previous work by allowing better eciency and better accuracy, our work is still limited to only consider linear correlations, namely, the Pearson correlation. From an adversarial point of view, the \fair" features we produced are only independent to the sensitive attribute in the linear sense. One can train a nonlinear model, such as kernel SVM or neural networks to predict the sensitive information from the features we created. In this work, we proposed regularizing an arbitrary neural network with a penalty term closely related to the Hilbert{Schmidt independent criterion (HSIC) [48]. Previous linear methods for fair predictions has been discussed above. Aside form linear methods, there are also deep learning methods. Some pioneer deep learning methods lacks a theoretical grounding [157]. Many previous methods are based on variational autoencoders (VAE) [76], such as the variational fair autoencoder (VFAE) [93], or VAE with a mutual information bound between sensitive attribute and the latent representation [99, 50, 133]. This limited the application of such methods since rst they have to work with VAE and second, the representation produced is not deterministic. Given the diculties of explicitly calculating mutual information between sensitive at- tribute and the learned representation, one can instead take advantage of direct adversarial approaches. The methods are typically based on training a discriminator to predict the sensitive attribute from the learned representation [149]. Other authors [64] take a slightly dierent approach by disentangling data representations into two parts, one containing in- formation about the outcome and the other containing all the nuisance factors. Similarly, [63] proposed a method to achieve invariance via adversarial training of a feature mask. In- formation theory-based methods also actively use adversarial training: [121] proposed using adversarial training to maximize the predicting entropy of sensitive attribute. There are also some drawbacks of the adversarial methods. 32 To use adversarial methods in AI fairness is nonetheless natural, because they can often separate important and spurious features from data. For example, GAN-based methods are extremely good at separating style surrounding images (the realism of images), from labels (cars, stop signs, etc.) [60]. Similarly, they can separate labels from domain-specic styles [137]. Understanding the way in which they can achieve these remarkable feats, how- ever, are still in their infancy [90]. They are also dicult to train. As mentioned in [50], even for tabular data, 200 epochs of training is required if the model is trained from scratch. We proposed directly using maximum mean discrepancy (MMD) [48] as a regularizer. MMD can be calculated directly from the latent space representation and it is also dieren- tiable. We showed that it is actually a simplied and degenerate case of the HSIC, meaning it can directly measure the statistical dependence between the sensitive attribute and the latent space representation we created. Experiments show that our method needs much less training epochs comparing to [50] but can produce results of similar or better quality. Our method can also perform style transformation similar to [99]. 3.3 Methodology In this section, we introduce independence criteria via kernel functions, and the relation between our method and Hilbert Schmidt Independence Criterion (HSIC). We also illustrate with supervised and unsupervised applications of our method. 3.3.1 Independence Condition using Kernel Functions Assume a data representation, u2R n and a discrete variable z2 [1;n p ], where sensitive attributez represents protected classes in the data. Consider a universal kernel function (e.g., Gaussian RBF kernel function)k(;) :R n R n !R. We denote the feature transformation of kernel k as () :U !H. HereU R n is the latent representation space andH is the Reproducing Kernel Hilbert Space (RKHS) associated with kernel k. By denition, for 33 embedded vectors u i and u i 0, the inner product is dened ash(u i ); (u i 0)i =k(u i ;u i 0) and the norm is dened ask(u i )k 2 H = k(u i ;u i ). Dene a sensitive group as S j =fijz i = jg. Intuitively, if we have u?? z the centroids of two groups, S j 1 and S j 2 , inH should be identical. The centroid of a sensitive group is given as c = 1 jS j j X i2S j (u i ) = 1 n j X i2S j (u i ): (3.1) The distance between the two centroids is dened as d 2 j 1 ;j 2 ju =kc j 1 c j 2 k 2 H : (3.2) In case of Gaussian RBF kernel function, since the kernel function is continuous and the feature transformation has innite dimension, d 2 j 1 ;j 2 ju = 0 implies that the expected values of any polynomial p(u (k) ) of distribution P (ujz = j 1 ) and P (ujz = j 2 ) should be identical. (Here u (k) 's are the components of multidimensional latent representation u.) Distance is dened in Hilbert space as d 2 j 1 ;j 2 ju =kc j 1 c j 2 k 2 H =hc j 1 c j 2 ;c j 1 c j 2 i = 1 n 2 j 1 X i;i 0 2S j 1 k(u i ;u i 0) + 1 n 2 j 2 X i;i 0 2S j 2 k(u i ;u i 0) 2 n j 1 n j 2 X i2S j 1 i 0 2S j 2 k(u i ;u i 0): (3.3) Surprisingly, as shown above, the distance in the RKHS,d 2 j 1 ;j 2 ju , can be calculated explicitly using kernel functionk(;). Equation 3.3 is referred as maximum mean discrepancy (MMD). In case where the number of protected groups n p is larger than two, we dene d u 2 = max j 1 ;j 2 kc j 1 c j 2 k 2 H = max j 1 ;j 2 d 2 j 1 ;j 2 ju ; (3.4) 34 the maximum distance between centroids, as the metric of overall independence between latent space representationu and discrete sensitive attributez. Smaller values ofd u 2 indicates u and z are less dependent. The metric d u 2 can be used as a penalty term for training any machine learning model. The complexity of calculating d u 2 is O(n 2 ), for the number of datapoints,n. This is a substantial simplication over HSIC, which has a complexityO(n 4 ). For deep learning models trained using SGD, n equals to the size of a mini-batch. One can further reduce the complexity by sampling only a subset of the data from a mini-batch. (Details shown in Tab. 3.3.) We will show in the next section the relationship between this loss and HSIC. 3.3.2 Relationship with Hilbert Schmidt Independence Criterion Hilbert Schmidt Independence Criterion is used to measure the dependence of two sets of data u and v [48]: HSIC((u i ;v i ) n i=1 ;F;G) = 1 n 2 Tr (KHLH): (3.5) HereK i;j =k(u i ;u j ) andL i;j =l(v i ;v j ) are kernel matrices for data representationsu andv. F andG are RKHS associated withk andl. And H is a constant value matrix,H i;j = i;j 1 n . Plug in the values of matrix elements H i;j and perform one fold of summation, we have HSIC((u i ;v i ) n i=1 ;F;G) = 1 n 2 n X i;j k i;j l i;j + 1 n 4 n X i;j;q;r k i;j l q;r 2 n 3 X i;j;q k i;j l i;q (3.6) Assume we have only two sensitive groups, z2f0; 1g. Among the n observations, the rst n 0 have z = 0 and the rest n 1 = nn 0 observations have z = 1. Note that since z is discrete, the kernel function for z, L i;j = l(z i ;z j ), reduces to Kronecker delta function i;j . Summation over the elements L i;j yields a result identical to d 2 0;1ju mentioned in Eq. 3.3 up 35 to dierence of a constant factor. This means our penalty term is, up to a constant factor, a measure of independence between features and protected groups. 3.3.3 Applications 3.3.3.1 Fair Prediction (Supervised Learning) Suppose we are given a set of features x, labelsy and sensitive attribute z2N. The goal of fair classiers is to predict the labels ^ y using featuresx such that ^ y has high accuracy but is independent of the sensitive attribute z. We can embed structured or unstructured data x, with function f into u, and generate predictions ^ y from u, u =f(x); ^ y =g(u): (3.7) To ensure fairness, we wantu??z. To achieve this, we add a penalty term to a loss function, ` ` = loss(^ y;y) +d u 2 : (3.8) where loss(^ y;y) can be, for example,L 1 ,L 2 , or cross entropy loss. Herez is only required in the training phase and the predictions does not depend on z explicitly. 3.3.3.2 Invariant Representation and Style Transformation (Unsupervised Learning) In this task, we want to learn a representation u of data x such that u is invariant of the sensitive attribute z. A strict constraint, u??z ignores important information about z and makes reconstruction ofx infeasible. To account for this, we embed a sensitive attribute using a separate representation v =h(z), where h is the embedding model. We can reconstructed data ^ x from the concatenation of u of v, ^ x =g(u;v) =g(f(x);h(z)): (3.9) 36 To learn the embedding method via an auto-encoder, the loss function can be dened as ` = loss(^ x;x) +d u 2 : (3.10) After trainingf,g andh, latent representationu =f(x) will be an invariant represent of data x with respect to z. Instead of interpreting z as a protected feature, we could instead interpret it as a \style" feature, alike to conditional GANs, such as CycleGANS learning the style of a Van Gough painting [162, 60]. We therefore have a representation that separates style and other information to reconstruct x in to two parts, v and u. Reconstructed data ^ x = g(f(x);h(z)) will be close to the input x but on the other hand, we can articially change the input value z to z t 6= z. The reconstructed data x 0 t = g(f(x);h(z t )) will then be transformed into another style z t . The diagram for both supervised and unsupervised learning is shown in Fig. 3.1. Supervised Unsupervised Figure 3.1: The diagrams for supervised and unsupervised learning. 3.4 Experimental Results 3.4.1 Fair Predictions We tested the performance of fair predictions of our proposed method on three widely used benchmarks: German [40], Adult [40], and Health 1 . The German dataset contains 1,000 1 https://www.kaggle.com/c/hhp 37 individuals and 60 features. The outcome to be predicted is whether the individual has good credit or not, and the sensitive attribute is age. Both credit and age are binary variables (good or bad score; old or young, respectively). The Adult dataset contains 45,222 individuals and 103 features. The outcome is a binary label denoting whether the individual has an income of at least $50,000, and the sensitive attribute is gender. As with the German dataset, the outcome and sensitive attributes of Adult are binary. The Health dataset contain data for approximately 50K patients. The goal is to predict whether the patient survives for more than 10 years. The sensitive attribute is age of the patient and there are 71 features after preprocessing. For all three datasets, we used the same data preprocessing and splits as other works [50, 99]. Figure 3.2: Plot of statistical parity SP vs accuracy Acc. Our results are shown as red crosses. In the plots, Adv. Forget is for [63]; CVIB is for [99]; FCRL is for [50]; MIFR is for [133] and MaxEnt-ARL is for [121]. A more ecient method achieves lower statistical parity (SP ) for the same value of accuracy (Acc), or higher accuracy for the same value of SP . In the experiments, an encoderf is built using a full connected neural network with one hidden layer and 64 neurons. The predictor g has one hidden layer and 50 neurons. The latent representationu is set to 64 dimensions. In this paper, all experiments are performed with RBF kernel functions. We make sure every batch has balanced observations of dierent 38 sensitive subgroups by using a balanced sampler 2 . To compare with previous works, we trained another neural net (three hidden layers, 64 neurons in each) to predict the sensitive attributez from the latent representationu. The accuracy of predictingz fromu is referred as adversarial loss. In the ideal case, the adversarial loss should be equal to the majority class ratio of the binary sensitive attribute z. Methods producing adversarial loss closer to the majority class therefore reveal less information about the sensitive attribute. The results are shown in Table 3.1. Our method achieves state-of-the-art predictive accuracy with minimal change in adversarial loss compared to the majority class. While two methods show lower adversarial loss, their values are further from the majority class baseline, suggesting that the predictions made by these methods leak more information about the sensitive feature than our method. We do not show the Health dataset in this table because the other methods do not compare against it. We further compare our method to several other fair AI methods using the a measure of fairness known as statistical parity. (Please note that this measure is referred to as DP by Umang et al. [50].) For binary labels and a binary sensitive attribute, statistical parity is dened as the dierence of the positive class ratio among the two sensitive groups. Namely, SP =jP (^ y = 1jz = 1)P (^ y = 1jz = 0)j: (3.11) We change the hyperparameter, used in Eq. 3.8, and train the model to record its accuracy and statistical parity in order to understand how eciently it can trade o between fairness and accuracy. Similar to [50], we trained a separate predictor to predict the label y from the latent representations created. The separate predictor is also identical to [50], which is a multi-layer perceptron (MLP) with one hidden layer and 50 hidden neurons. We compare the performance to that achieved by previous methods on two datasets, Adult and Health. Since other methods did not compare against the German dataset, we do not show results for this dataset. We plot statistical parity vs the accuracy of predicted label y. The results 2 https://github.com/galatolofederico/pytorch-balanced-batch 39 are shown in Fig. 3.2. For both datasets, our method achieves similar or better performance than state-of-the-art methods. Our method is also faster, only requiring 4 epochs of training, while FCRL, the best performing baseline, needs 200 epochs of training without pre-training. German Adult Method Adv. Loss Pred Acc. Adv. Loss Pred Acc. Maj. Class 0.799 0.701 0.674 0.754 VFAE [93] 0.717 0.720 0.882 0.842 CIAFL [149] 0.811 0.695 0.888 0.831 CVIB [99] 0.698 0.710 0.776 0.842 Ours 0.804 0.747 0.767 0.848 Table 3.1: Comparison of our method with previous works on German and Adult dataset. Majority class ratio is used as baseline. Adv. Loss stands for adversarial loss, namely the accuracy of predicting z from u. Pred Acc. stands for the accuracy of predicting y using x via representation u. Higher Pred Acc. is better and desirable Adv. Loss should be close to the majority class ratio. Figure 3.3: Images generated by style transformation. Left gure shows the results for MNIST dataset and right gure shows results for Chairs dataset. For both datasets, the leftmost column shows the input images and the remaining columns show the generated images, a dierent digit in the same style (left) or the same chair from a dierent angle of view (right). 40 3.4.2 Invariant Representation and Style Transformation We also tested the proposed method in unsupervised learning tasks using two benchmark image datasets MNIST [84] and Chairs [11]. The MNIST dataset contains 70K grayscale images of handwritten digits. The dimensions of each image are 28 by 28 pixels. We use the digit's label (0-9) as the sensitive attribute or \style" z. The goal of the learning task is to learn the writing style, or font, of a digit, regardless of its label. In other words, the task is to learn a representation that is invariant of the digit label, similar to the set up used by [99] and GAN-based methods [162]. We use 60K for training and 10K for testing the method. For the Chairs dataset, we selected around 5K images of chairs from four dierent angles of view (45 , 135 , 225 and 315 ) from the full dataset. There are 4,457 images in the training set and 1,115 images in the testing set. The sensitive attributez represents the four dierent view angles. For the MNIST dataset, we use a fully connected neural network with three hidden layers for encoderf and a fully connected neural network with two hidden layers for decoderg. We set the dimension of both u and v to be 64. For the Chairs dataset, we use three layers of CNN followed by one layer of fully connected neural network for encoderf. Correspondingly, we use one layer of fully connected neural network followed by three layers of CNN for decoder g. The dimension of latent representation u is of 1,536 dimensions and the embedding of z, v = h(z), is of 512 dimensions. More detail of our experiment setup is available in the Supplementary Materials. We rst train the unsupervised version of our model on the training dataset and tested the performance on the testing dataset. To generate new images, we feed the model with a test image x and change the corresponding input z to dierent values z t . The output will then be transformed into other \styles". To be specic, since z is digit label for MNIST dataset, the output image will be a digit with same wrting features as input but of another digit. For Chairs dataset, the input image of a certain chair will be transformed into the image of the same chair, in dierent view angles. 41 The generated images are shown in Fig. 3.3. We can see that for the MNIST dataset, the model learns the writing style of the input digit image (shown in the leftmost column) and generates images in the same writing style for the other digits (remaining columns). The results are comparable to [99]. For the Chairs dataset, the model learns the shape of a chair and generates the image of the same chair from other view angles. We see the trained model is able to correctly generate images in the corresponding styles. This implies the encoder encodes data unrelated to the style z, just as we expected. We also trained separate classiers (both logistic regression and MLP) to predict the sensitive variable z from the representation u. The detailed results are summarized in Ta- ble 3.2. When = 0, the prediction accuracy is roughly 0.90, but when is properly tuned, the accuracy is signicantly lower. As we expected, we also observed a dramatic drop of penalty term d u 2 . Despite this, when is properly tuned, we can predict z with accuracy that is higher than majority class ratio, suggesting further work is needed to further improve fairness. We also notice that when6= 0, MLP gives higher accuracy. This may be because the RBF kernel function we use puts higher weights on low degree polynomial terms in the feature transformation. To further validate the results, we show the t-SNE plots of u for models with = 0 (no invariance) and set to properly tuned values, in Fig. 3.4. We see that for = 0, dierent sensitive groups form distinct clusters but when properly tuned, all sensitive groups are uniformly mixed, showing that the learned invariant representation lacks the information about the sensitive feature. As mentioned in the methodology section, calculating the full kernel matrix is of O(n 2 ) time complexity. To accelerate the process, we tested calculating kernel matrix and d u 2 for only a subset of data in a mini-batch. With the chairs dataset, we x the size of a mini- batch at 512 and vary the size of data sampled for kernel matrix calculation. The results are shown in Table 3.3. We see that sampling only 10% to 20% of the data for kernel matrix calculation results in similar reconstruction loss to full kernel matrix but even lower fairness 42 MNIST, = 0 MNIST, = 0:04 Chairs, = 0 Chairs, = 0:2 Figure 3.4: t-SNE visualization of latent representationu. Dierent sensitive groups (values of z) are shown using dierent colors. We set = 0 to ignore the invariant constraints. For = 0, we see clusters of dierent colors (values of z) but when is properly tuned, points with dierent colors are mixed and indistinguishable. Dataset n p L rec d u 2 Acc LG Acc MLP MINST 10 0 4:77 10 3 6:22 10 1 0.891 0.952 MNIST 10 0.04 9:65 10 3 8:62 10 4 0.250 0.652 Chairs 4 0 2:24 10 2 6:47 10 2 0.930 0.903 Chairs 4 0.2 3:27 10 2 7:02 10 3 0.455 0.647 Table 3.2: Summary of results for unsupervised learning experiments. n p is the number of protected classes. is the weight of penalty termd u 2 . L rec stands for the reconstruction loss (MSE) of images. Acc LG is the accuracy of predicting z from representation using logistic regression and Acc MLP is the accuracy using MLP. lossd u 2 . We didn't experiment on very large batch sizes and we reported our results for the chairs dataset with the optimal batch size of 128, namely 32 observations for each of the four sensitive subgroups. Sampled Fraction L rec d u 2 32 0.063 3:82 10 2 4:60 10 3 64 0.125 3:73 10 2 3:75 10 3 128 0.250 3:62 10 2 5:02 10 3 256 0.500 3:76 10 2 6:87 10 3 512 1.000 3:55 10 2 1:19 10 2 Table 3.3: Results of using a fraction of data for kernel matrix calculation. The batch size equals to 512. The results suggest sampling 10% to 20% of data should be optimal. 43 3.5 Discussion We proposed a new method which creates invariant representations by minimizing distance between sensitive groups in RKHS. We nd that the proposed penalty term is a simplied, faster, modication of HSIC. The method does not rely on any approximation of mutual information nor adversarial training, allowing for faster computing times and no mode col- lapse in style transformation. In supervised learning experiments, our method can achieve similar or better performance compared to previous methods. When applied to unsupervised representation learning, our method correctly casts images to the given styles, alike to [60]. We further showed how the method can be accelerated in practice by using only a subset of data to calculate the kernel matrix. Despite these advantages, our ndings show more work is needed in the future. First, other state-of-the-art methods can outperform our method for some range of our hyper- parameter . We will therefore explore more kernel functions, which could better address the trade-o between prediction accuracy and fairness. Furthermore, it will be important to study how MMD can be combined with adversarial methods to accelerate training and improve performance. 3.6 Supplementary Materials Here we provide technical details to help reproduce our results. All the code, raw and pre-processed data will be released upon acceptance. We have implemented our model from scratch using PyTorch. For all the experiments, we used Adam optimizer with default setting. For both supervised and unsupervised learn- ing experiments, we have included L 2 regularization terms in our loss function. The L 2 regularization term includes all the model weights for supervised learning models. But for unsupervised learning models, the weights of embedder h() are excluded. In the following discussion, we use as the strength of the L 2 regularization term. We use to refer the 44 strength of fairness penalty term d u 2 . In our actual experiment, the full loss function for a supervised learning model is ` = (1)H(^ y;y) +d u 2 + (1)kwk 2 : (3.12) HereH is the cross entropy loss function. For unsupervised learning models, the actual loss function used is ` = (1)L 2 (^ x;x) +d u 2 +kwk 2 : (3.13) As mentioned in the result section, we use RBF kernels in the loss term d u 2 . The RBF kernel function is dened as k(u; u 0 ) = exp ku u 0 k 2 : (3.14) The hyper parameter is referred as the bandwidth parameter. For unsupervised learning experiments, to accelerate the training, we used a simple adap- tive setting of hyper parameter . (i+1) = k (d u 2 ) (i) L 2 (^ x;x) (i) (3.15) We rst set the initial value of. Then the value of is automatically adjusted at the end of every epoch using the formula above. The index i denotes the epochs and k is a constant. For the supervised learning experiment on Adult and Health dataset, the encoder is a fully connected neural network with 1 hidden layer and 64 neurons. The predictor is a fully connected neural network with 1 hidden layer and 50 neurons. The latent representation is of 64 dimensions. The discriminator used to test the embedding has the same structure as the predictor, which is identical to [50]. For German dataset, we used the same encoder but the predictor is simply a logistic regression. To compare with [99], the discriminator to 45 predict z is a fully connected neural network with 3 hidden layers and 64 neurons in each layer. We always use ReLU as the activation function for hidden layers. For unsupervised learning experiments on MNIST dataset, the encoder is a three layer fully connected neural network. There are 400, 200 and 100 neurons in each layer, respec- tively. The decoder is a two layer full connected neural network, with 200 and 400 neurons in the layers. The output layer uses sigmoid activation function. As for Chairs dataset, the images are rst downscaled using three convolutional layers. Each layer has 32 lters and the lter size is 3 3. After each layer, we perform max pooling in 2 2 squares and we also perform batch normalization. The processed images are then sent to a single (no hidden layer) FC layer to produce the 1,536 dimensional latent representations. In the decoding phase, the 1,536 dimensional latent representation is rst joined with the 512 dimensional embedding ofz and the resulting 2,048 dimensional representation passes one FC layer with output dimension 2,048. Then there are 3 more upscaling convolutional layers, each with 32 lters and size 3 3. Every layer upsamples the images by a factor of 2. We also perform batch normalization in the decoding phase. As for running time, the CNN model used for Chairs dataset (which has the most pa- rameters among the models we built) took approximately 90 seconds to train an epoch. We only used one core for all of our experiments. The full hyper parameter table is shown below in Tab. 3.4. k batch epoch German 10 4 0.003 N/A 10.0 2 8 6 Adult 10 2 vary N/A 3.0 2 8 4 Health 10 4 vary N/A 0.1 2 8 4 MNIST 10 4 0.04 6.0 0:01 10 32 50 Chairs 10 4 0.20 6.0 0:001 4 32 60 Table 3.4: Summary of hyper parameters 46 Chapter 4 Change Detection via Confusion 4.1 Chapter Introduction Change is ubiquitous in the natural world. A sudden change between states of matter as temperature increases is a hallmark of phase transitions [138]. In living systems, a growing cell changes via transitions between stages of development. Social systems can also change sharply, for example by raising the legal drinking age [128], minimum wage [23], or a website's user interface [105]. This allows social scientists to infer the eect of these changes through causal methods [139, 19]. Detecting changes also has enormous practical benets. For example, a robot may need to automatically detect when its environment changes [103] and take appropriate actions, and information providers need to be alert about detecting novel events in social media, such as spam or a trending topic [87]. Changepoint detection is challenging since the the in high dimensional data, the forms of changes can be very diverse. Generally, before and after the change, the observations follow dierent distributions and these distributions can be dicult to parameterized. Additionally observations can be noisy, a changepoint detection method with practical value should be noise robust. A growing body of research has proposed methods for automated changepoint detection, from simple cumulative sum [109, 111] to more sophisticated Markov model approaches [15, 116], and more recently Bayesian approaches [148]. Many of the existing methods, 47 however, are specialized to a type of data, e.g., time series, or tuned to a problem domain. State-of-the-art Bayesian approaches usually assume a particular set of distributions for data, which can be a limitation in general settings. Moreover, while these methods will identify where the change occurs, many are not able to quantify error of the estimate or their condence in the change. Despite the strengths and successes of existing changepoint detection methods, there is a critical need for an accurate, general purpose changepoint detection method that is noise robust and can be applied to various forms of high dimensional data (e.g. video, audio and EKG sensor signals). Our contribution We introduce Meta Changepoint Detection(MtChD), a method to detect when a change occurs in high dimensional data. The method attempts to confuse a trained model by labeling the same state (before or after the change) with two labels [138]. Specically, for a set of trial changepoints, we label the data before and after each trial changepoint as belonging to class 0 and 1, respectively, and train a classier to predict the labels using features of data. Next, we t a mathematical model of classication accuracy as a function of trial changepoints to infer the actual changepoint. The method extends a previous confusion- based training framework [138] via a novel classier independent mathematical model of classication accuracy to more precisely infer both the changepoint and the fraction of data aected by change. This theoretic extension provides condence in the changepoint inference: we trust the changepoint more if a large fraction of data is aected. We apply MtChD to a range of data, both synthetic and real-world, to demonstrate that it has low bias under a wide range of conditions and accurately detects changes in noisy high dimensional data, including images. The method outperforms state-of-the-art changepoint detection methods, even on sparse, noisy real-world data with many missing values. We show that our method accurately infers the time of COVID-19 lockdowns from air pollution data, robustly identies website policy changes in an online learning platform 48 from student performance data, and nds the salient features for student grade change. Due to its exibility, accuracy and robustness, the proposed method signicantly advances the state-of-the-art in changepoint detection, thereby opening new opportunities for data-driven discovery. The rest of the paper is organized as follows. First, we review some of the relevant previous work on changepoint detection. Next, we present details of our confusion-based training method and derive the mathematical model of accuracy. Finally, we present results for several datasets and discuss implications. 4.2 Related Works Change detection is a classical research topic. In this work, we consider a one-dimensional indicating variable t. As mentioned in the introduction, data observed are (X i ;t i ). We want to nd a point t 0 such that before and after t 0 , data follows dierent distributions. In statistical physics, this is referred as a phase transition. For example, if we observe the neutron scattering signals for water as X i and temperature as t i , since the melting point of pure water is around T = 273K, we will see that the properties of X i changes dramatically near t 0 = 273K. (This is just a simple example, actually, the properties of water is much more complicated, especially at low temperature. [57]). Look back on the previous works, early trials of change detection in time series can be traced back to 1950. A gold standard for online change detection in univariate time series is CUSUM [110]. CUSUM assumes the data follows a normal distribution with known parameters and only detects shift changes (Namely, change in the mean). A major improvment over CUSUM is the development of a bunch of methods [131, 146, 12, 147] based on general likelihood ratio (GLR) test. GLR test seeks to reject a null assumption which states that observations before and after a proposed changepointt 0 follow the same distribution. An advantage of GLR is that it can be applied to multivariate series. On the other hand, the form of likelihood function must be specied 49 (usually multivariate normal distribution is used) but the parameters can be inferred from the data. The central idea of GLR, parametric likelihood functions, can be reformulated as cost functions. The change detection is then expressed as to detect whether the tting loss for the entire series is dierent from sum of piece-wise tting losses for series before and after the proposed change point. As an example, the multivariate normal likelihood function, which is often used in the GLR test, can be formulated as the gaussian process likelihood function. From the perspective of cost functions, kernels can be introduced to describe distributions which are dicult to parameterize. Furthermore, with the help of advanced search algorithms [136, 119, 46, 74, 75] the change detection based on cost functions can be generalized to detect multiple change points rather than binary change case. Recently, the collections of cost functions and search algorithms are available as a python library called ruptures [136]. Alternatively, change detection can be formulated as a state transition in hidden Markov model (HMM) [116]. There are also Bayesian change detection methods [1, 148, 103, 151]. Apart from the cost function based change detection mentioned above, there are also method using penalized quasi-likelihood [13] and kernel method which detects multiple change points [7]. Unsupervised Change Analysis is a method most closely aligned to ours [59] in that is uses a similar labeling method. But the paper focuses on explaining changes and not on accurately nding the change point. Dierent from previous methods, our method is based on [138]. We rst propose several trial change points, then for every change point we label the data before and after it as 0 and 1. We then train an arbitrary classier to predict the labels. The intuition is the same as [138] { when the trial change point is close to the actual change point, we will have a high accuracy in predictions. But for real world data collected in experiments, we do not have guarantee that there are similar amount of data before and after the change. Furthermore, the density of data of indicatort is not uniform. A naive peak picking as [138] will then lead 50 to bias of predictions. There is also another problem. In real world, a change can happen gradually and may only aect a portion of the population. To improve over [138], we proposed a model for the accuracy of prediction mentioned above. The model only depends on the empirical distribution oft i , the true change point and the fraction of data aected by the change. We then t the model to the accuracy curve and our results show that we can accurately locate the change point in real world data despite the problems mentioned above. Our proposed end to end method, Meta Change Detection (MtChD) can identify both changes with complicated structures and subtle changes only aect a small fraction of data. In this dissertation, we include a comprehensive comparison with the previous methods and show that our method is accurate, robust and ecient. Our method is a meta method and it can work with arbitrary classiers. After validation of our proposed method, MtChD, we rst expand it to detect multiple changes using binary segmentation. And then we applied to to study the change of dis- cussions on a popular social media, Twitter. Changes in these patterns suggest important events, which have been detected with generative models, such as dynamic topic models [18]. It has also been shown that there is a political polarization in the Twitter discussions [66]. We studied the tweets in the early state of COVID-19 outbreak [26]. There has been previous work on emerging rumor detection [3]. Our work, instead, reveals topic changes using subtle variations in the tweet text data. Our method is unsupervised and is able to handle various types of data. We compared the changes we found in tweets and it matches well with new events. 51 4.3 Methodology 4.3.1 Problem statement Assume we have a set of data of the form (X i ;t);i = 1;:::;n. Here X i is an arbitrarily high dimensional vector and t is an external control parameter such as time. We refer to t as the indicator and look for a changepoint as t is varied. Assume there is a change at point t 0 such that the data before the change, S 0 , and after the change, S 1 , have dierent distributions. Mathematically, we can say that t t 0 ;X i f 0 and for t > t 0 ;X i f 1 , wheref 0 andf 1 are two probability distributions (or patterns). We can make an analogy to phase transitions in physical systems where often temperature varies and the properties of a physical system, X, changes after a critical point [138]. In many datasets, however, only a fraction of the observations, X, may show observable changes, and the change may not be a sharp transition (this is known as a \fuzzy" regression discontinuity in causal literature [20]). To account for this, we introduce a third probability distribution f u to describe the fraction of observations, 1, that do not show observable change after a discontinuity. Our goal is to infer the changepoint, t 0 , and the fraction of data that undergoes a change, , given the observations (X i ;t). To achieve this goal, we create a meta method with two steps. First, we train a model and record its accuracy as a function of the trial changepoint in indicator t. We then compare this accuracy to a base accuracy in which no changepoint occurs. The dierence between these two accuracies is t to a mathematical model with parameters t 0 and . 4.3.2 Confusion-based training Similar to [138], we create a confusion-based training setup which can be used to infer the changepoint. First, we choose a featuret as an indicator variable. Though usuallyt is time, it can be any feature with ordered values. We assume a trial value t a is a changepoint and 52 label the observed data before t a as belonging to class 0 (no change), and the data after t a as class 1 (change): ~ y i = 8 > > < > > : 0 tt a 1 t>t a (4.1) We then train a supervised classication model to predict the labels ~ y i from the features X i . We calculate the accuracy of trained model as a function of trial changepoint t a , for t a in the observed range [t min ;t max ] of t. If a true changepoint exists in the observed range of t, the accuracy vs. t a curve will signicantly deviate from the base accuracy, which is the majority ratio of labels ~ y i . The shape of the curve will be aected both by the actual changepoint, t 0 , and the fraction of data points aected by change, . Importantly, our method is a meta method: any classier, from random forests to neural networks, can be used for classication. In this paper, we use a multilayer perceptron clas- sier (fully connected neural network), convolutional neural network (CNN), and a random forest classier to demonstrate that our method can work with arbitrary classiers that are suitable for the data. Detail is available in the appendix. Model training uses 50% of data while 30% of data is used as validation and 20% as testing. Models are trained multiple times using a random split of training and validation. The testing set is held out and only used to test the accuracy of the learning models. We loop over n a trial changepoints and label the data according to Eq. 4.1. For every t a , we train the model and calculate the accuracy of model as a function oft a . We repeat for every random training/validation split to get condence intervals. We show the algorithm below. 4.3.3 Inside the black box: modeling accuracy Our goal is to infer the changepointt 0 and the fraction of observations aected by the change, , from the accuracy vst a curve of a \black box" model. We derive a theory to describe the behavior of a \black box" model trained on the labeled dataset (X i ; ~ y i ). For simplicity, we assume that the model we train is able to assign X i to one of the sets S 0 , S 1 , and S u . To 53 Function ConfusionTraining Input: TrainData, TestData, t a , n trial 1 AccValidAll []; 2 AccTestAll []; 3 for i = 1 to n trial do 4 TrainSet, ValidSet RandomSplit(TrainData); 5 TestSet TestData; 6 AccValid []; 7 AccTest []; 8 for j = 1 to n a do 9 TrainX, TrainT TrainSet; 10 ValidX, ValidT ValidSet; 11 TestX, TestT TestSet; 12 TrainLabel GetLabel(TrainT , t a [j]); 13 Model Train(TrainX, TrainLabel); 14 AccValid[j] Evaluate(Model, ValidX, ValidLabel); 15 AccTest[j] Evaluate(Model, TestX, TestLabel); 16 end 17 AccValidAll[i] AccValid; 18 AccTestAll[i] AccTest; 19 end 20 return AccValidAll, AccTestAll 54 maximize accuracy, the model should use the majority class of ~ y i in the corresponding set. We assume the following conditional probabilities hold, P (X i 2S u jt) = 1 P (X i 2S 0 jt) =(1(tt 0 )) P (X i 2S 1 jt) =(tt 0 ) (4.2) Here () should have values (tt 0 )> 0:5 for t>t 0 and (tt 0 )< 0:5 for t<t 0 . For simplicity, we use step functions in our study, but we have also explored sigmoid functions. The advantage of sigmoids is that they allow for changes to happen slowly, therefore we can also model fuzzy regression discontinuities where a change is not sudden [20]. We denote the marginal distribution function of t asf t and the cumulative distribution function of t asF t . We can calculate the following probabilities, P (X2S u \ ~ y = 0) =P (X2S u \tt a ) = Z ta t min P (X2S u jt 0 )f t (t 0 )dt 0 = Z ta t min (1)f t (t 0 )dt 0 = (1)F t (t a ) (4.3) P (X2S 1 \ ~ y = 0) =P (X2S 1 \tt a ) = Z ta t min P (X i 2S 1 jt 0 )f t (t 0 )dt 0 = Z ta t min (t 0 t 0 )f t (t 0 )dt 0 (4.4) 55 P (X2S 0 \ ~ y = 0) =P (X2S 0 \tt a ) =F t (t a )P (X2S 1 \tt a ) (4.5) And we have P (X2S u ) = 1 (4.6) P (X2S 1 ) =P (X2S 1 \tt max ) = Z tmax t min (tt 0 )f t (t 0 )dt 0 (4.7) P (X2S 0 ) =P (X2S 1 ) (4.8) Here, for simplicity, we introduce the following symbols. P u;0 =P (X2S u \ ~ y = 0) P 0;0 =P (X2S 0 \ ~ y = 0) P 1;0 =P (X2S 1 \ ~ y = 0) P u =P (X2S u ) P 0 =P (X2S 0 ) P 1 =P (X2S 1 ) (4.9) 56 We can calculate the majority class ratios of set S u , S 1 and S 0 as Maj u = Maxf P u;0 P u ; 1 P u;0 P u g (4.10) Maj 0 = Maxf P 0;0 P 0 ; 1 P 0;0 P 0 g (4.11) Maj 1 = Maxf P 1;0 P 1 ; 1 P 1;0 P 1 g (4.12) Then, the accuracy when using a trial changepoint t a is the weighted average of Maj u , Maj 0 and Maj 1 using P u , P 0 and P 1 , shown below. Acc(t a ) =P u Maj u +P 0 Maj 0 +P 1 Maj 1 = MaxfP u;0 ;P u P u;0 g + MaxfP 0;0 ;P 0 P 0;0 g + MaxfP 1;0 ;P 1 P 1;0 g (4.13) The accuracy vs t a curve can therefore be determined from the coecients and t 0 and the marginal distribution of t, f t . To estimate the marginal distribution of t, we rst calculate the empirical cumulative density function and then t it using a third order spline line. The density function is then calculated as the rst derivative of the spline line. After the training, we t the accuracy model in Eq. 4.13 to the accuracy measured on the testing dataset using numerical optimization of the following loss function: l = 1 n a X i Acc train (t (i) a )Acc model (t (i) a ) 2 (4.14) 57 Here t (i) a are the trial changepoints and n a is the number of trial changepoints. For an illustration of a tted accuracy prole see Fig. 4.3 and Fig. 4.4. 4.3.4 Quantifying error and complexity Fitting the model to the results of repeated training runs allows us to derive error bounds fort 0 and via bootstrapping. When used within supervised learning, which supports mini- batches (such as neural networks and gradient boosting), the computational complexity of training scales linearly with respect to data size. The tting step which infers parameters and t 0 runs in constant time. 4.3.5 State-of-the-art We compare the proposed method to state-of-the-art methods. These methods can be di- vided into two groups, optimal segmentation algorithms and Bayesian changepoint detection. Optimal segmentation algorithms we compare to include dynamic programming (DP) [119], binary segmentation [46], bottom up methods [74], and window based methods [136] using L 1 , L 2 , normal distribution loss and RBF kernel loss functions implemented in the Python package ruptures [136]. Note that as we mentioned earlier, the optimal segmentation meth- ods are equivalent to GLR test. So, to compare to GLR test, we used DP segmentation with normal loss functions, which is equivalent to a GLR test which assumes multivariate normal distribution of data. For Bayesian changepoint detection, a conditional prior on changepoint and a likelihood function needs to be dened. We used uniform and geometric distributions as priors, and applied the Gaussian, individual feature model [151], and full covariance model [151] as likelihood functions. We used Python implementation for Bayesian changepoint de- tection available at https://github.com/hildensia/bayesian_changepoint_detection. 58 4.4 Results We demonstrate the accuracy and utility of the proposed method on data from a variety of domains. We rst apply these methods to synthetic data to test method performance and robustness with respect to noise, then apply it to real-world data to discover real change- points. 4.4.1 Synthetic Data 4.4.1.1 Synthetic \Chessboard" Data In this experiment, we generate two-dimensional data in a chessboard pattern, with two features x 1 and x 2 , each in the range [0; 1], as shown in Fig. 4.1. Assume the change takes place at t 0 . Before the change (indicator tt 0 ), data points with values x 1 , x 2 are in blue cells of an c n c chessboard, and after the change (t>t 0 ),x 1 they move to the orange cells of the chessboard. Mathematically, for n c n c chessboard, the data generated satises the following condition, (bn c x 1 c +bn c x 2 c) mod 2 = 1(t>t 0 ): (4.15) In Fig. 4.1, for t t 0 , a data point (t;x 1 ;x 2 ) has values (x 1 ;x 2 ) belonging to the blue cell and for t > t 0 , orange cells. As for the population distribution, we set tU(0; 1). For rst part of this experiment, we set t 0 = 0:5 and the size of the data N = 8K. We use dierent arrangements of the chessboard, n c =f2; 4; 6; 8; 10g. For higher n c , the data contains information with higher spatial frequency and there are fewer data points in each cell, making it more dicult to infer the true changepoint. For second part of this experiment, we x n c = 6 and we vary t 0 in rangef0:2; 0:4; 0:6; 0:8g. Since we kept the population distribution as tU(0; 1), if t 0 is away from 0:5, population before and after the change will be unbalanced. This will also make the task of inferring t 0 more challenging. 59 (a) (b) (c) Figure 4.1: Illustrations of synthetic data, where observations have two features x 1 and x 2 . Blue dots represent data points which satisfy t t 0 and orange dots are for t > t 0 . (a). n c = 2; (b). n c = 6; (c). n c = 10. For xed data size N, asn c increases, the number of data points in each cell decreases. Also, the spatial frequency of the data increases. These factors make it more dicult for a classier to nd the decision boundary. We ran six trials for our method and competing algorithms. For our proposed method and segmentation methods, 50% of the data is used as training, 30% for validation, and 20% for testing, where training and validation data is randomly sampled during each trial. As for optimal segmentation methods, we randomly sample 70% of data in each trial. Due to computational limitations, we only sample 18.8% of data (around 1.5K) for Bayesian changepoint detection. The results are shown in Table 4.1. We see that for n c = 2; 4, optimal segmentation methods (DP+RBF and BinSeg+RBF) perform as well as ours, but for n c 6, our method outperforms competing methods. Of the two classiers used in our method, random forest performs best, therefore our method works well even in data with complicated structures. 4.4.1.2 Synthetic Images We illustrate our method's ability to identify changes in diverse high-dimensional data by applying it to video. For this task, we generate a series of synthetic gray scale video images that are 64 by 64 pixels, and the images qualitatively change att 0 = 0:5. Before the change, the images are light solid circles on dark background, and after change they are hollow circles (Fig. 4.2). These images can represent, for example, organisms that were originally alive and then died; thus our task would be to determine when these organisms \died". The gray scale 60 22 t 0 = 0:5 44 t 0 = 0:5 66 t 0 = 0:5 88 t 0 = 0:5 1010 t 0 = 0:5 66 t 0 = 0:2 66 t 0 = 0:4 66 t 0 = 0:6 66 t 0 = 0:8 MtChD (RF) (t 0 ) (t 0 ) () () 0.5002 0.0025 0.9494 0.0077 0.4983 0.0017 0.9137 0.0041 0.4976 0.0033 0.8562 0.0119 0.5000 0.0005 0.7604 0.0220 0.4959 0.0049 0.6573 0.0156 0.1950 0.0047 0.6503 0.0346 0.3937 0.0052 0.8429 0.0076 0.6014 0.0023 0.8316 0.0133 0.8020 0.0022 0.6580 0.0276 MtChD (MLP) (t 0 ) (t 0 ) () () 0.5027 0.0027 0.9589 0.0095 0.5003 0.0039 0.8289 0.0366 0.5262 0.0173 0.6249 0.0710 0.5084 0.0962 0.0048 0.0068 0.5772 0.0569 0.0086 0.0080 0.5649 0.0450 0.0045 0.0035 0.4095 0.0258 0.4906 0.0534 0.5962 0.0668 0.3950 0.1112 0.5372 0.1315 0.0171 0.0202 Naive Con- fusion (RF) (t 0 ) (t 0 ) 0.4965 0.0018 0.5017 0.0019 0.4974 0.0004 0.4975 0.0001 0.4973 0.0001 0.2271 0.0382 0.4255 0.0312 0.5235 0.0229 0.5436 0.0900 DP+Normal (GLR eq.) (t 0 ) (t 0 ) 0.5003 0.0004 0.5006 0.0005 0.5212 0.0204 0.7238 0.2762 0.5971 0.3374 0.2441 0.0377 0.4578 0.0447 0.5885 0.0266 0.8108 0.0288 DP +RBF (t 0 ) (t 0 ) 0.5002 0.0004 0.5001 0.0019 0.5673 0.0684 0.9495 0.0679 0.3071 0.2392 0.3740 0.2840 0.4234 0.1893 0.5827 0.0246 0.8355 0.0654 DP+L2 (t 0 ) (t 0 ) 0.9510 0.0099 0.9875 0.0062 0.3515 0.2399 0.8584 0.2734 0.5143 0.4006 0.4451 0.3481 0.3183 0.4417 0.3104 0.4252 0.2917 0.3778 DP+L1 (t 0 ) (t 0 ) 0.9569 0.0070 0.5313 0.2660 0.5809 0.1677 0.6053 0.4027 0.4015 0.3308 0.5526 0.4467 0.1277 0.1873 0.4916 0.3832 0.2114 0.3312 BinSeg +RBF (t 0 ) (t 0 ) 0.5002 0.0002 0.4995 0.0011 0.5701 0.0502 0.7663 0.3205 0.5635 0.2190 0.3133 0.3285 0.3850 0.3702 0.6049 0.1506 0.7258 0.2715 Window +RBF (t 0 ) (t 0 ) 0.4391 0.1364 0.5653 0.2210 0.2960 0.2139 0.5699 0.1738 0.2444 0.1012 0.4746 0.2436 0.5654 0.2459 0.7964 0.2223 0.3987 0.3159 BottomUp +RBF (t 0 ) (t 0 ) 0.5002 0.0008 0.4581 0.1477 0.4500 0.3655 0.6821 0.2879 0.4947 0.3144 0.4271 0.3059 0.5213 0.2149 0.4602 0.2885 0.5861 0.2953 Uniform +Gaussian (t 0 ) (t 0 ) 0.5474 0.2299 0.5429 0.3010 0.3915 0.1567 0.4717 0.2265 0.5429 0.2159 0.6171 0.2842 0.7546 0.2203 0.5210 0.1549 0.5196 0.3386 Uniform +IFM (t 0 ) (t 0 ) 0.9969 0.0031 0.9942 0.0030 0.9973 0.0020 0.9975 0.0015 0.9975 0.0030 0.9986 0.0015 0.9958 0.0049 0.9973 0.0026 0.9985 0.0012 Uniform +FullCov (t 0 ) (t 0 ) 0.4985 0.0002 0.5089 0.0163 0.9986 0.0006 0.9976 0.0010 0.9989 0.0009 0.9930 0.0098 0.9280 0.1593 0.9982 0.0020 0.9974 0.0038 Geo +Gaussian (t 0 ) (t 0 ) 0.0282 0.0044 0.0271 0.0018 0.0286 0.0044 0.0323 0.0054 0.0278 0.0037 0.0326 0.0063 0.0340 0.0034 0.0312 0.0051 0.0254 0.0037 Table 4.1: A comprehensive comparison of the performance of the proposed method against two types of state-of-the-art methods: optimal segmentation (in ruptures package) and Bayesian changepoint detection on synthetic data. MtChD (RF) is our method with ran- dom forest classier; MtChD (MLP)is our method with multilayer perceptron classier. DP+Normal (Normal GLR eq.) is DP segmentation method used with normal loss func- tion, which is equivalent to GLR test which assumes a multivariate normal distribution. Six combinations of optimal segmentation methods are listed. DP is dynamic programming segmentation algoritm, BinSeg is binary segmentation, Window is window-based change- point detection, and BottomUp is Bottom-up segmentation. The cost functions used are RBF (RBF kernel), L1 (L 1 loss function), and L2 (L 2 loss function). The last four rows are for Bayesian changepoint detection with a uniform prior or Geo (geometric) prior. Gas- susian stands for Gaussian likelihood function, IFM is the individual feature model [151], and FullCov is the full covariance model [151]. (t 0 ) and(t 0 ) are the mean value and stan- dard deviation of inferred changepoint and () and() are the mean value and standard deviation of inferred . Bold values indicate changepoints that are closest to the correct value. 61 Figure 4.2: Example synthetic images that change at t 0 = 0:5. From top to bottom shows images with dierent noise level = 0:2; 0:4; 0:6; 0:8 and 1:0. Att 0 , solid circles changes into hollow circles as shown in the images. 62 = 0:2 = 0:4 = 0:6 = 0:8 = 1:0 (t 0 ) (t 0 )t 0 (t 0 ) () () 0.5048 0.0047 0.0028 0.9612 0.0278 0.5087 0.0086 0.0043 0.9787 0.0139 0.5253 0.0237 0.0027 0.9298 0.0083 0.5155 0.0191 0.0111 0.9609 0.0361 0.5380 0.0398 0.0246 0.8781 0.0717 Table 4.2: Changepoint inferred for synthetic image dataset. The true value of changepoint is t 0 = 0:50 where solid circles change into hollow circles. of the solid and hollow circles is = 0:8 and the gray scale of the background is = 0:2. To create more realistic data, we position the circles randomly within the image and inject dierent levels of Gaussian noise to model poor quality data. After mixing with noise, pixel grey scale values are truncated to the range [0:0; 1:0]. We also generate the distribution of indicator t by drawing values at random from a uniform distributionU(0; 1). We check the robustness of the estimated changepoint against noise. Table 4.2 shows the the inferred changepoint and estimated value of as a function of noise for the synthetic image data. Due to spatial correlation of image data and the superior predicting power of CNN classier, the changepoint inferred is close (often not statistically signicantly dierent) to the true changepoint and is close to 1.0, even for very noisy image frames. Alternative methods were infeasible because of the high-dimension and large data size. 4.4.2 Identifying Change in Real-world Data We now demonstrate the ability of MtChD to identify changes in noisy, sparse real-world data, potentially with many missing values. 4.4.2.1 COVID-19 Air Quality We rst apply our method to air pollution data to see if we can identify when changes in human behavior due to the COVID-19 pandemic occurred. We collected air quality data daily in 2020 (through May 26, 2020) for major U.S. cities from AQICN (aqicn.org). This 63 data includes daily concentrations of nitrogen dioxide, carbon monoxide, and ne particulates less than 2.5 microns across, totalling 4.3K observations for 37 cities across the U.S. once missing data are removed. We also include population within 50km of the city as a feature because people within this area may have contributed to the concentration of pollutants. We can use our model to determine when the change started, and compare these results to the gold standard: the date stay-at-home orders were issued by states. These orders limited business and commercial activity, which likely lead to the dramatic decline in pollution. The earliest such order was announced in California on March 19, 2020 and the latest in South Carolina on April 7. We compare the results of our method to state-of-the-art in Table 4.3. Our method is the only one that inferred a reasonable changepoint for the data of March 21, 2020 three days, roughly in the middle of all the state stay-at-home orders. We show accuracy deviation for MtChD in Figure 4.3. Possibly due the the lack of data, a random forest classier gives better accuracy than MLP, and the mathematical model ts accuracy deviation well. 4.4.2.2 Performance on Khan Academy As a second example, we apply our method to the online learning platform Khan Academy (khanacademy.org), which oers courses on a variety of subjects where students watch videos and test their knowledge by answering questions. The Khan Academy platform had undergone substantial changes to its user interface around April 1, 2013 (or 1:365 10 9 in Unix epoch time) [24], which aected user performance. This change acts as a ground truth we want to detect. Data was collected by Khan Academy over the period from June 2012 to February 2014 and contains 16K questions answered by 13K students totalling 681K datapoints. Despite the large number of students, the data is very sparse: the vast majority of students were typically active for less than 20 minutes and never returned to the site. The performance data records whether the student solved the problem correctly on their rst attempt and 64 (a) (b) Figure 4.3: Accuracy deviation curve for COVID-19 Air data. (a). Using random forest classier; (b). Using multilayer perceptron classier. The scatter points are accuracy de- viation measured on testing set and the solid lines are tted using the proposed accuracy deviation model. without a hint. When the user failed, they were able to attempt the problem again, and the number of attempts the user made was recorded. Additional features recorded include the time since the previous problem, the number of problems in a student session, and the number of sessions. Since segmentation methods implemented in ruptures are not memory ecient, we only sample 0.5% of the data (about 3.5K entries) uniformly in every trial. For Bayesian changepoint detection, we sampled around 1.6K data points uniformly for each trial. Both our method and optimal segmentation algorithms can identify the change from user performance data (Table 4.3), although optimal segmentation algorithms places it slightly later and with larger error. Bayesian changepoint detection does not give any reasonable changepoints for this data. The accuracy deviation curve is shown in Figure 4.4. For our method, results of random forest classier and multilayer perceptron classier are comparable and both t well with the accuracy deviation model. 4.4.2.3 Student Test Scores One application of changepoint detection is nding discontinuities in social data. In such cases, a quasi-experimental framework known as regression discontinuity design (RDD) can 65 COVID-19 Air Khan Academy MtChD(RF) (t 0 ) (t 0 ) () () 80.0829(d) 2.9713(d) 0.4164 0.0392 1.3701e+09(s) 4.0835e+05(s) 0.2830 0.0079 MtChD(MLP) (t 0 ) (t 0 ) () () 99.5820(d) 57.5959(d) 0.4843 0.3264 1.3694e+09(s) 8.5539e+05(s) 0.1491 0.0173 DP+Normal (GLR eq.) (t 0 ) (t 0 ) 71.8333(d) 0.3727(d) 1.3577e+09(s) 2.2059e+07(s) DP+RBF (t 0 ) (t 0 ) 37.1667(d) 25.5761(d) 1.3763e+09(s) 9.4481e+06(s) DP+L2 (t 0 ) (t 0 ) 70.1667(d) 33.9137(d) 1.3679e+09(s) 1.3556e+07(s) DP+L1 (t 0 ) (t 0 ) 25.5000(d) 53.8911(d) 1.3679e+09(s) 1.0014e+07(s) BinSeg+RBF (t 0 ) (t 0 ) 1.0000(d) 0.0000(d) 1.3741e+09(s) 8.9074e+06(s) Window+RBF (t 0 ) (t 0 ) 55.0000(d) 0.0000(d) 1.3587e+09(s) 1.2031e+07(s) BottomUp+RBF (t 0 ) (t 0 ) 54.0000(d) 0.8165(d) 1.3528e+09(s) 1.2960e+06(s) Uniform+Gaussian (t 0 ) (t 0 ) 96.9167(d) 37.5859(d) 1.3439e+09(s) 4.2047e+06(s) Uniform+IFM (t 0 ) (t 0 ) -0.5833(d) 0.8858(d) 1.3564e+09(s) 1.5300e+07(s) Uniform+FullCov (t 0 ) (t 0 ) 0.0000(d) 0.6455(d) 1.3591e+09(s) 1.6176e+07(s) Geo+Gaussian (t 0 ) (t 0 ) 8.1667(d) 8.9334(d) 1.3396e+09(s) 2.9504e+05(s) Table 4.3: A comprehensive comparison of our method with previous methods on real world datasets, COVID-19 Air and Khan Academy. We use the same abbreviations as in Table 4.1. For COVID-19, the measure of t 0 is number of days since 01/01/2020. For Khan Academy, the measure oft 0 is Unix timestamp, namely, number of seconds since midnight 01/01/1970. Correct values are roughly 80 days for COVID-19 air data, and 1:36510 9 seconds for Khan Academy data. Bold values indicate changepoints that are closest to the correct value. 66 (a) (b) Figure 4.4: Accuracy deviation curve for Khan Academy data. (a). Using random forest classier; (b). Using multilayer perceptron classier. be used to estimate the causal eects of the change, e.g., created by a new policy. We tested our method's ability to identify such changed on two datasets used by Herlands el al. [58]. In these two datasets, the boundary is given by a cuto value along a single feature. Thus we can identify the discontinuity by using dierent features as indicator t and nd the one with the best separability, namely, with greatest inferred . The cuto is then given by the inferred t 0 . The Student Test Scores dataset contains the academic performance of 2.6K students of dierent age, race and pre-test scores [62]. Intervention is applied to individuals with pre-test scores lower than 215. The intervention has a positive eect on the outcome, post-test score. Namely, we expect to see a changepoint detectedt 0 = 215 when feature pre-test score is used as indicator. We use age as a \dummy changepoint" as we do not expect to nd a change in this variable. If a method nds a changepoint with high condence with age, it would be a false positive. We tried both pre-test scores and age as indicators, and the results are shown in Table 4.4, which also compares performance to alternative changepoint detection methods. Alternative methods incorrectly estimate the change in pre-test scores to occur at 225 or below 190, with the exception of Uniform+Gaussian Bayesian changepoint detection, whose accuracy is comparable to ours. Moreover, MtChD nds a changepoint in age, but with very low condence ( = 0:190:13), indicating that little if anything changes. pre-test 67 scores is therefore the most salient feature. In contrast, all other methods infer a change in age with low error and unknown condence. Student Test (pretest) Student Test (age) College GPA (GPA to cut- o) College GPA (HS grade pt) College GPA (credits yr1) College GPA (age at en- try) MtChD(RF) (t 0 ) (t 0 ) () () 221.6483 2.7090 0.2074 0.0350 16.0739 0.0000 0.1943 0.1326 1.1444 0.0684 0.4366 0.0329 51.6910 2.6805 0.2465 0.0249 11.0000 0.0000 0.2147 0.1204 21.0000 0.0000 0.1645 0.2091 MtChD(MLP) (t 0 ) (t 0 ) () () 221.8466 5.0202 0.2903 0.0601 14.3159 1.8019 0.1204 0.0729 1.1132 0.0029 0.5609 0.0176 49.4054 0.7424 0.4084 0.0125 8.8895 3.5007 0.3797 0.2722 20.9750 0.0391 0.1164 0.1617 DP+Normal (GLR eq.) (t 0 ) (t 0 ) 181.6667 0.2357 11.7043 0.0031 -1.4325 0.0379 1.4167 0.4488 1.5625 0.0955 17.0000 0.0000 DP+RBF (t 0 ) (t 0 ) 225.0000 1.8257 13.0468 0.0530 1.0717 0.0855 49.8333 6.5426 2.3125 0.0955 19.8333 0.3727 DP+L2 (t 0 ) (t 0 ) 227.7500 0.3819 13.1554 0.0014 1.0125 0.0956 52.5000 6.2383 2.4583 0.0932 19.5000 0.5000 DP+L1 (t 0 ) (t 0 ) 226.0833 1.4837 13.0157 0.0075 1.0733 0.0582 37.3333 1.3744 2.4375 0.0955 18.8333 0.6872 BinSeg+RBF (t 0 ) (t 0 ) 224.0833 1.7892 13.0151 0.0102 1.0733 0.0962 46.8333 3.1314 2.3542 0.1122 19.3333 0.4714 Window+RBF (t 0 ) (t 0 ) 228.3333 13.3375 12.9482 0.7787 0.4700 1.2308 47.1667 39.4606 2.3750 0.3461 18.6667 1.2472 BottomUp +RBF (t 0 ) (t 0 ) 222.8333 2.0344 13.0999 0.0938 1.2367 0.3305 42.6667 3.3993 2.4583 0.0932 19.7500 0.3819 Uniform +Gaussian (t 0 ) (t 0 ) 208.3333 7.2265 12.2428 0.1891 -0.3483 0.4935 10.6667 10.2415 2.0000 0.0000 17.8333 0.3727 Uniform+IFM (t 0 ) (t 0 ) 171.9167 0.8375 11.5590 0.0893 -1.2808 0.2469 1.6667 1.4907 1.5625 0.0955 17.0000 0.0000 Uniform +FullCov (t 0 ) (t 0 ) 172.8333 0.6872 11.4150 0.0304 -0.7892 0.3242 3.0000 4.0415 1.7500 0.2500 17.0000 0.0000 Geo+Gaussian (t 0 ) (t 0 ) 184.5833 0.6067 11.7652 0.0252 -1.3283 0.1323 1.5000 0.5000 1.6875 0.1731 17.0000 0.0000 Table 4.4: Results of our method for regression discontinuity task, on two real world datasets, Student Test Score and College GPA. We used the same abbreviations as Table 4.1. For Student Test data, pretest stands for pre-test score. For College GPA data, HS grade pt stands for high school grade points; credits yr 1 stands for credits earned during the rst year. Bold values indicate changepoints that are closest to the correct value; underlined values demonstrate the features with the highest . 4.4.2.4 College GPA College GPA data reports the eect of academic probation for college students [89]. A student will go through academic probation if his or her rst year GPA is lower than a 68 specied cuto. The cleaned data contains contains entries for 16K students. There are several outcome variables: (a) whether student leaves in the rst year, (b) GPA in the second year and, (c){(e) whether they graduate in 4, 5 and 6 years. After centering data, the changepoint is at GPA to cuto = 0.0. Similar to [58], we used four real valued features as candidates of the indicator, (1) GPA to cuto, (2) high school grade (as a percentile), (3) year one credits, and (4) age. Table 4.4 shows results. We nd, as expected, that a changepoint occurs at GPA to cuto of approximately 0 although our method overestimates this cuto compared to uniform + Gaussian Bayesian changepoint detection. Notably, however, our method correctly infers that college GPA is the most salient feature (highest ) while other features have an that is either not statistically signicant from zero, or signicantly lower. 4.5 Discussion We introduced Meta Changepoint Detection, a novel method to detect changes in high di- mensional data. The method identies changes in a wide range of data, from student test scores to images. Moreover, it gives us the fraction of data changed, which we nd can act as a condence metric. Our comprehensive experiments validates the method on synthetic and real-world data that are dicult for other methods, and showed that it can robustly identify changes in sparse and noisy data. We also demonstrate that our method has low bias with higher accuracy than competing state-of-the-art methods, and eciently handles large datasets. MtChD has the potential to signicantly advance the social sciences through detecting rel- evant changepoints for quasi-experiments. Social scientists often leverage quasi-experiments to identify causal mechanisms of human behavior. One such quasi-experiment, regression discontinuity design, can use the changepoints we detect to infer the eect of a policy change. Applying MtChD towards regression discontintuity designs is an important future step. 69 There are also several avenues to extend MtChD. First, MtChD should be extended to detect multiple changepoints, t 0 ; t 1 ; t 2 ;::: in data. In many practical cases, such as regular policy changes that appear in websites or online games, we would want a method that can detect multiple events. Moreover, it is important to apply an extension of MtChD to large data streams in order to detect new events as they appear, such as trending topics or a sudden wave of spam. Applying changepoint detection to streaming data has only been recently applied to Bayesian changepoint detection, c.f., [1, 148, 103]. 4.6 Supplementary Materials 4.6.1 Detail of confusion based training For the multilayer perceptron classier, we use four hidden layers, each with 64 neuron. We chose the ReLU activation function and the maximum training epochs is 100. The random forest classier uses 100 decision trees with a maximum depth of 32. Entropy is used as the splitting criterion. To detect changes in video data, a slightly more sophisticated convolutional neural network (CNN) is used with six convolutional layers. The dimensions of each layer are 3 by 3, and the number of lters in each layer are 32, 32, 64, 64, 128, and 128. After the second, fourth and the sixth convolutional layer, max pooling and drop out is performed. The kernel size for max pooling is two and stride two, while the drop out ratio is 0.20. The output of the convolutional layers are sent into a fully connected neural network with one hidden layer and 64 neurons. A ReLU activation function was also used for this neural network and the model was trained for 30 epochs. 70 Chapter 5 Identifying Multiple Changes: Application to Social Media 5.1 Methodology The change detection method we proposed, MtChD, has been discussed in chapter 4. Here we discuss how to extend it to detect multiple changes. To identify multiple changes, we use recursive binary splitting. We rst use the change detection method to nd a change point, and split at this point. This, in turn, creates two subsets of data from which we can nd additional changes, and split this data, in recursion. We stop splitting a node when we hit the minimum length of ranget c or maximum depth of the binary tree D. The time complexity is O(TD) for binary segmentation depth, D, and number of data points,T . BecauseD is xed to a small value, such as 3, the splitting process is almost linear in time. The space complexity only depends on the classier used, so it can be ecient even in high-dimensional datasets. Relevant code pertaining to our analysis has been made publicly available through GitHub. 1 . 1 https://github.com/yuziheusc/confusion_multi_change 71 5.2 Data Online discussions about Covid-19 We apply our method to a large dataset of Covid-19 tweets [26]. This dataset consists of 115M tweets from users across the globe, collected since January 21, 2020. These tweets contain at least one of a predetermined set of Covid-19- related keywords (e.g., coronavirus, pandemic, Wuhan, etc.). Since this dataset provides geolocation data for only 1% of the users, we leverage a fuzzy matching approach [66] to geolocate users within the US. We want to understand the signicant shifts in attention during the earliest era of Covid-19, from January 21 until March 31, 2020, of which 7:6 million tweets are geo-located to within the US using methods by Chen et al., [26]. We then subsample 200K tweets at random each month, for a total of 600K tweets, to simplify our analysis. Text is pre-processed through removal of stopwords, links, account names, and special characters (e.g., !?%#). Only English language tweets are considered. We then use the tf-idf vectorizer (with 2.2K terms) from Python's scikit-learn library [114] in order to generate the tf-idf vectors. Reddit stories We extract Reddit posts from a popular horror story writing subreddit called nosleep using the Python Reddit API Wrapper (PRAW). We focus on posts created between January 1, 2019 and June, 2020 to understand both seasonal changes in stories (e.g., Halloween and Christmas), as well as changes in stories since the Covid-19 pandemic, creating 35.4K stories. Data pre-processing includes removing posts labeled \[removed]" and \[deleted]". Text cleaning and tf-idf vectorization (with 25K terms) follow the same methodology as in the Twitter dataset. 5.3 Experiments and Results Online Discussions about Covid-19 We start by identifying shifts in tweets about Covid- 19 (embedded into tf-idf vectors), where the word cloud of hashtags for the rst three periods are shown in the right panels of Fig.5.1. We ran binary segmentation using MtChD with a 72 Figure 5.1: Word clouds for Covid-19 tweets in period 01/21{01/30, 01/30{02/04 and 02/04{02/11. Random Forest classier, maximum segmentation depth of three and minimum time length between changes set to four days. The dates of the change points identied by the method are listed in Table 5.2. The time intervals between changes match with the period of a typical news cycle [86], which is between 5 to 9 days. Results are robust in the way that when the minimum length is increased to of 5 days, a subset of changes (01/30, 02/11, 02/16, 02/21 and 02/28) are found. Next, we analyze the discovered change points and interpret the ndings by highlighting topics that shift the collective attention. To validate the results, we compare the change points found with the news events, as shown in Tab. 5.1. Reddit Stories We also applied our method to horror stories posted on reddit.com, the subreddit r/nosleep, with stories embedded using tf-idf. We nd variations in the topics of stories, such as Jul 17 to September 25, 2019 (\camping" and \summer") appear, re ecting recreation activities in the US. The next change on September 25 to November 17 (\halloween") signals the topic of Halloween and November 17 to January 2nd (\santa" and \christmas"), corresponds to the holidays. Potentially inspired by Covid-19 restrictions, there were stories about \quarantine" from March 29 to May 4, 2020. Finally, quarantining became old news again, and discus- sions shifted in the nal months until June 2020 back to stories on \rules". As a baseline, we used GLR and DP with an RBF kernel. Due to the limitations of memory, we rst perform truncated SVD [51] to transform the tf-idf vector into a 64-dimensional vector. Then we 73 Change point Events Date 01-30 0.355 First conrmed case of person-to-person transmission of the \Wuhan Virus" in the US 02-04 0.341 Diamond Princess cruise ship quarantined. Ten people on cruise ship near Tokyo have virus 02-11 0.327 WHO announced ocial name for \COVID-19" 02-16 0.243 More than 300 passengers from the Diamond Princess are traveling in the US chartered planes 02-21 0.441 1 st Covid-19 death in Italy (02-22) 02-28 0.366 First Covid-19 death in US (02-29) 03-04 0.447 California declares state of emergency. South Korea con- rms 3 new deaths and 438 additional cases of novel coro- navirus 03-09 0.269 Italy lockdown; Grand Princess cruise ship docks in Oak- land 03-15 0.303 First lockdown orders in parts of California; national emer- gency declared (3/13) 03-24 0.146 US sees deadliest day with 160 deaths Table 5.1: Change points automatically identied in Covid-19 tweets and important events occurring on those dates. down-sampled to 8K observations from the full dataset since dynamic programming runs in O(T 2 ). We nd that not only is our method able to process the full dataset, it can nd more physically meaningful change points. Covid-19 tweets Reddit stories Our Result GLR DP+RBF Our Result GLR DP+RBF 01-30-20 02-07-20 01-27-20 03-10-19 03-26-19 04-10-19 02-04-20 02-08-20 01-28-20 06-05-19 06-03-19 04-12-19 02-11-20 02-08-20 01-31-20 07-17-19 08-11-19 11-06-19 02-16-20 02-08-20 02-13-20 09-25-19 11-05-19 01-13-20 02-21-20 02-09-20 02-15-20 11-17-19 12-20-19 01-29-20 02-28-20 02-09-20 02-26-20 01-02-20 01-30-20 02-19-20 03-04-20 02-17-20 02-29-20 02-21-20 03-02-20 03-10-20 03-09-20 02-17-20 03-02-20 03-29-20 04-03-20 03-31-20 03-15-20 02-17-20 03-07-20 05-24-20 04-09-20 04-07-20 03-24-20 02-27-20 03-13-20 Table 5.2: Comparison with baseline change detection. (Left) Tweets and (right) r/nosleep. 5.4 Conclusions In this paper, we aim to identify and understand the shifts of conversation on social media. In contrast to emergent topic detection, which detects new topics of interest, our method 74 identies when the distribution of features in high-dimensional streams of text changes. We create a method to robustly detect multiple changes within these conversations which appear to represent intuitive and realistic changes in conversations. Moreover, quantitative and qualitative comparisons to baseline methods show improved detection of changes. Our method has a unique feature { it allows us to quantify the fraction (parameter ) of data which shows observable changes. This parameter can be interpret as the signicance of a certain change. There are, however, important limitations of our approach. First, multiple changes are found with a simple binary segmentation, which is only meant to nd approximate change points [136]. While this allows us to dramatically speed up computation, it may compromise on accuracy. Next, the social media data we explore has no ground truth about changes except for daily news. So we cannot assess whether our method, or competing methods, correctly found all change points. This may aect conclusions about what are the most important changes within social media early in the Covid-19 pandemic. A more detailed analysis involving tweets from dierent languages and from across the globe would be a promising candidate for future research. These limitations, however, point to promising future work. For example, it will be important to explore advancing on the binary segmentation approach in order to sacrice some potential speed for greater accuracy or precision. Next, we should compare against realistic data with a xed number of known change points to determine the overall accuracy of this method. Finally, these results should be extended to other high-dimensional datasets, including video. 75 Chapter 6 Heterogeneous Eects of Software Patches in a Multiplayer Online Battle Arena Game 6.1 Chapter Introduction Interest in online gaming has grown explosively, driven in part by the burgeoning e-sports industry and live streaming technologies that enable people to watch others play in real time. League of Legends (LoL) is one of the most popular multiplayer online battle arena (MOBA) games with tens of millions of monthly active players. They compete in teams (typically of ve players) to capture the opposing team's base. In a LoL match, each player controls a character|known as champion|and uses it to plan attacks or mount defenses against opponents. Champions vary in their power levels and abilities (type of spells cast, rate of attack, armor strength, and the like), which can be further enhanced by the player during a match by means of skill points gained or items earned. Champions also dier in play styles, some being devoted to bu and cure teammates, others to eectively defeat opponents, etc., leading to dierent classes of champions. Likewise, individual players dier in their skill level, game style, and mastery of specic champions or champion classes. MOBA games typically match players together into teams to balance the teams' skills, so that neither side will have a built-in advantage in winning the match. Game balance makes game play more fun and engaging for the players [5]. 76 Riot Games, the developer of LoL, regularly updates the game by releasing new versions of the game's software. These versions, or software patches, not only x bugs and make technology improvements, but also introduce new functionality and content. One study of the rst six seasons of LoL identied over 7,000 changes made in 164 software patches [34]. Aside from cosmetic changes to the look and feel of the game (e.g., changing the champions' graphical appearance, a.k.a. skins), many of the patches aect gameplay by modifying the abilities of champions. These types of changes can be classied as bus that increase a champion's strength (e.g., bulking up their armor), nerfs that decrease strength (e.g., reducing the distance of their spells), or neutral changes that do not substantially impact champion abilities. Patches are often created following the introduction of new features that gave too much power to some champions or skills. When this happens, the game balance is altered, making the game too dicult|or too easy|for players and, therefore, less fun and engaging. To restore game balance, MOBA games are regularly patched to nerf to overpowered champions and bu underpowered ones. Nerfs are the most common changes. Measuring the impact of software patches on player performance and game balance is dicult due the to complex interplay between the choices of players and game outcomes. A nerfed champion, when played in combination with other champions, may improve a team's chances of winning. Player characteristics and play styles may also aect its eectiveness. These considerations dramatically complicate the maintenance of game balance, especially as new champions and skills are regularly released to expand game features and keep it interesting for the players. Surprisingly, there has been relatively little work done on this problem. Existing research examined the impact software patches on player's choices, show- ing that they increase player preference for bued champions and reduce preference for nerfed champions [144]. Additionally, bus typically improve champion's win rate, especially for underperforming champions [34] When measuring the impact of an intervention (e.g., a patch) on a system, it is important to model the intervention's causal eect on target outcomes rather than the correlation 77 between them [113]. This is especially true in observational studies, since we never observe all potential outcomes (this is the fundamental problem of causal inference). Additionally, the specication of a treatment (software patch) is typically biased, either through confounding or selection bias [14]. Not all champions are selected to be bued or nerfed in a single patch (confounding), and not all players play the same champions (selection bias). Since the distribution of samples may dier between treated and control populations, a supervised model naively trained to minimize factual errors would overt to properties of one group and not generalize well to the population. We address this problem and extend the state of the art by treating the matches that take place before and immediately after a patch is released as a control and treated populations, respectively (users are as-if randomly selected to each group). We then use causal inference methods to measure the eect of the patches on player performance, specically, the number of kills, and team performance (probability of winning). We nd a large variation in the average eect in the population. The impact of patches, however, depends on individual player features or champions played, and varies between players. To account for this, we estimate the heterogeneous treatment eect (HTE) of the patch. HTEs measure the eect for dierent subgroups of a population for which eects dier [9, 135, 8]. For example, good players may be less aected by a patch than bad players, or a nerf on a champion may not aect the team's win probability if a synergistic champion is still strong. We discover that some software patches, for example patch 4.20 and 6.9, can substan- tially aect player performance. After applying a causal tree HTE model to each patch, we nd signicant heterogeneity in team performance changes despite LoL's game balancing mechanism. Moreover, the eect of patches on player performance varies signicantly with the champion type they play and their initial performance. Despite the heterogeneity in players and patches, we nd some results that generalize across patches: players who take signicant breaks between matches perform especially well after a patch. In addition, several performance metrics show the signicant advantages that patches bring to high-performing 78 players over low-performing ones. Therefore, surprisingly, these patches caused a widening in the gap between the high and low-performance players, which is contrary to the spirit of patching aimed at game balancing. Overall, our results underscore the importance of player heterogeneity in policy changes, and limitations of attempts to balance player performance, possibly because player heterogeneity is not taken suciently into account. 6.2 Related works The rise in popularity of MOBA games in recent years has given researches a wealth of large- scale user-centric datasets. The team-based nature of such games led to many research in optimal team compositions, including identifying and predicting the in uence of teammates using co-play networks [126] and building recommender systems for the line-up of heroes (the equivalent of champions in LoL) in DOTA 2 [53, 28], another popular MOBA game. Further, MOBA games boast metagaming strategies, which are collectively decided by players (the crowd) as the most optimal strategy for the team or for each champion. For example, [85] nds that the mostly widely successful team composition in LoL consists of one player in each of the ve positions, although some non-meta teams have signicant advantages. One of the most signicant factors in uencing gameplay are patches. Patches are regular updates to the game that x bugs, introduce new game contents, and most importantly alter the skills and abilities of champions to balance the game [34]. Wang et al. show that the eect of champion balancing patches also aect player's champion preference [144]. Other game related work focus on mining and understanding human behavior. To be specic, Sapienza et al. nds that prolonged game playing sessions lead to performance decline, a phenomenon that can be attributed to cognitive depletion [127]. Sapienza et al., on the other hand, used unsupervised tensor factorization methods to cluster dierent types of users based on their in-game performance metrics [125]. 79 The estimation of HTE is an important problem in many elds, even if has not often been applied to games. HTE estimation refers to nding subsets of the population for which causal eects of a treatment dier from the population and other distinct subsets. Medical professionals, for example, may be interested in how a drug treatment may benet one group, but potentially have adverse reactions to another group [130]. Marketers, alternatively, may be interested in how an advertisement in uences dierent users to buy a product [21]. Many supervised techniques have been developed for HTE estimation [8, 49, 130, 150, 134, 135] and the related problem of nding individualized treatment regimes [4, 82, 69]. Many methods build upon interpretable tree-based methods, such as decision lists [83], classication and regression trees (CART) [8, 135, 156], and random forests [9, 143]. Others follow more in line with the supervised machine learning paradigm, such as using supervised base learners, called meta learners, which decompose HTE estimation into multiple regression or classication problems [79, 102]. Representation learning using deep neural networks have also been proposed for estimating HTEs [130, 67]. Our paper diers from these previous research due to our focus on how patches aect player and team performance. A good player may be more robust to changes, so that their performance would be unaected. On the other hand, a bad player may have outcomes that vary substantially with each patch. Additionally, a bued champion may have an overall increase in winning rate, but if their team includes a nerfed champion, it may actually perform worse in those games. Therefore, our goal is to estimate these heterogeneous eects of patches on game play. The rest of the paper is organized as the following. We rst discuss the details of LoL and the data used for analysis. Then we present the problem of HTE estimation and how it relates to the problem of estimating the eect of software patches. Then we present the method we will use for HTE estimation, causal trees. Finally, we present results and conclusions. 80 6.3 Background and Game Data LoL is a popular MOBA game where two teams of players compete in a match to destroy the opposite team's home base, or Nexus. A match lasts about 30 minutes, and is composed of two teams of ve players or two teams of three players, depending on the game map selected. At the beginning of every match, each player selects a champion to play as and assume one of the ve positions available on the map: Top, Middle, Jungle, Attack Damage Carry, and Support. The dataset tracks individual player's performance in matches, measured by metrics such as the number of kills and assists the player makes in the match, as well as whether the team for which the player was playing for won the match. At the end of the match, one and only one team will emerge as the winner. There are over 130 dierent champions, each with dierent powers and abilities. These champions belong to seven disjoint champion types: controllers, ghters, mages, marksmen, slayers, tanks and unique playstyles, sorted in alphabetic order. Not every champion is equally popular, and many champions are rarely picked. The most popular champion is Thresh, who is chosen by about 3% of all players, followed by Lucian and Vayne. In our analysis, we consider the top 25 most popular champions, and these appear in most matches within our dataset. An LoL season typically starts the beginning of each calendar year, and concludes in late November or early December. Players are ranked in tiers at the end of each season. In between seasons are pre-seasons, in which developers typically introduce large overhauls to the games in preparation for the new season. Games during pre-seasons are not counted towards players' seasonal rankings. 6.3.1 Features The LoL dataset was collected from mid 2014 to the end of 2016 [127]. It consists of 1.2 million unique players in 437,000 matches. The dataset contains information about players 81 and matches at dierent levels of granularity and for dierent versions of the game. Features granularity in the data is at the match level, user level and season level. At a match level, the data features encode basic information about the match, including the match duration, start time of the match, map ID, queue type, patch ID, season ID and the outcome the match (which team loses and which team wins). The queue type determines the type of the game. For example, a game can be in ranked queue or unranked queue, where a ranked queue indicates that the game outcome contributes to each players nal seasonal ranking. The queue type also indicates whether it is a solo queue game, in which the teams are formed by individual players who likely did not know each other beforehand. At the user level, the dataset records the champion, the role and the lane selected by each user for each match. Every champion belongs to the seven established champion types. Individual in-game performance metrics include kills, deaths, assists, gold earned and gold spent, and the champion level achieved by the player in this game (champLevel). To track the prior experience of players, we compile a set of user features. For every user-match pair, we track the number of matches the user played thus far and time interval since previous match (timeSinceLastMatch). Motivated by research showing evidence that continuous LoL game play leads to performance deterioration [127], we additionally compute per-session statistics, where a session is dened as a series of matches without a break of at least 15 minutes between consecutive matches. For each user and match, we record the user's session number and the index of the match within this session. Past user behavior includes cumulative and average performance metrics from user's rst match up until (but not including) the current match for kills, deaths, assists, KDA, gold earned, and gold spent, (mean*AtStart and cum*AtStart). Per-session aggregated performance metrics calculated in the same fashion (sessionMean*AtStart and sessionCum*AtStart) Finally, at the season level, the features track the highest tier achieved by the player in the previous season (highestAchievedSeasonTier), which is only available for players who have played competitively in the previous season. 82 6.3.2 Game Patches Developers release software patches to x bugs and security vulnerabilities, as well as to release new content and adjust game balance. The game balancing patches typically involve increasing the abilities of weaker champions (called \bung") or decreasing the abilities of more dominating champions (called \nerng"). There are 62 software patches in our data set, corresponding to dierent versions of the game, ranging from version 4.6 to 6.22. We study how bus and nerfs introduced by a new version of game software impact performance of teams and individual players, and how the impact diers depending on team composition, player characteristics, and game settings. 6.4 Methods 6.4.1 Heterogeneous Treatment Eects The goal of our study is to nd how the eect of software patches diers between dierent subgroups of players and champions chosen by the team. To discover these eects from data, we consider the problem of heterogeneous treatment eect (HTE) estimation, where the treatment is a new software patch. We consider our unit of interest as a match of LoL and the outcome is some result of the match, such as the number of kills at the individual player level or win or loss at the team level. Formally, we frame our problem using the Rubin framework of potential outcomes [122]. Let the treatment for a match i be dened as W i , such there exists potential outcomes Y i (W i =w), which is the outcome of a match i on software patch w. Each match has a set of characteristics (features), X i , such as the champions played or player statistics before the match. For any match, we can only observe the outcome from one of the treatments (the patch it was played on) and not at any point in time, dened as Y i (W i = w i ). The goal of our work is to estimate the HTE of the treatment, which we take to be a software patch, which the game company uses to update the game to a new version. Let any two consecutive 83 software patches be versions w t and w t+1 . The goal of HTE estimation is to estimate the conditional average treatment eect (CATE): (x i ) =E[Y i (w t+1 )Y i (w t )jX i =x i ]; (6.1) where X i = x i represents a subset of the population. In this case, we estimate how an outcome of interest changes between two versions of the game. For example, we may want to know how kills (outcome) of a player on a specic champion (features) changes from one version (treatment) to another. Figure 6.1: Win rate (%) of the top-three most played champions for each patch. 6.4.2 Causal Trees for HTEs Tree-based methods have been popularized recently for heterogeneous treatment eect (HTE) estimation [8, 135, 143]. Causal trees work similarly to CARTs, in that they both greedily partition the feature space based on a certain criterion. The crucial dierence is that a causal tree splits the feature space in order to reduce the expected variance of the estimated HTE while a CART does so in order to minimize the classication or regression loss. The splitting continues till the predened minimum leaf node size is reached or there is no more statistically signicant splitting can be made. We use a variant of causal trees developed by Tran and Zheleva [135], which introduce a type of validation set for generalizing causal 84 eects to unseen data. For our experiments, since the validation set is randomly selected, the causal tree built for each training/validation split will be slightly dierent. But the overall structures of the causal trees are similar and we did not observe any contradiction results. To generalize estimated eects, the estimated eect in a node is compared to the sample eect found in a separate validation set. After a causal tree is built, HTE will be estimated for every leaf node of the tree. Given feature x i , the HTE (x i ) is inferred as the HTE for the leaf node corresponds to x i . Figure 6.5.1.1 shows an example causal tree built using the algorithm developed in [135], where the treatment is patch 4.12, and the outcome of interest is whether a team wins or loses. In each node, we have an estimated causal eect, the p-value of that eect based on an independent t-test, and how many samples are used to estimate that eect. The estimated eect at any node is the dierence in means when treated and not treated (e.g. percentage of wins on patch 4.12 compared to patch 4.11). At any parent node, there is a splitting feature. At the root node, the split is based on the binary feature whether the champion Lucian is on the team or not, if yes then we traverse left, otherwise we traverse right. Statistically signicant nodes are highlighted by a purple box. We discuss these ndings in Section 6.5. 6.5 Results Software patches can aect how champions perform on a global level, and they can also aect how players perform using those champions. We rst explore how team's probabilities of winning vary by champions chosen by team members, independent of the players playing them. We then explore how individual players are aected by software patches. 6.5.1 Eect on Team Performance In our analysis we focus on the 5 versus 5 play mode and ranked (competitive) matches, as opposed to normal (casual) queues. The outcome of interest is whether a team wins or 85 not, and the treatment is the patch of interest (e.g., patch 4.20). Since in the LoL match one team wins and the other team loses, we cannot measure any average eect based on the patch. However, as some champions are stronger than others, a team's probability of winning is aected by the champions the team chooses to play. Since the likelihood a particular champion wins the match (its win rate) varies by patch, some combinations of champions are more likely to result in a team win compared to others in dierent patches. 6.5.1.1 Champion win rate Figure 6.1 shows the overall win rate for the three most played champions: Thresh, Lucian, and Vayne. The win rate is dened as the fractions of matches won by a team playing that champion. The win rate varies by both patch and by champion. For example, in patch 4.6, Lucian has a win rate over 50%, while Vayne and Thresh have a win rate below 50%. Over time, the win rates of Lucian and Vayne shift above and below 50% depending on the patch, while Thresh has a more stable win rate slightly below 50%. For example, Lucian is bued in patch 4.12, and leads to a higher win rate in matches played after the patch was introduced. In patch 4.21, however, Lucian receives a nerf, which results in a drop in win rate in the next two patches. Another interesting observation is that the win rates of Lucian and Vayne are negatively correlated (r =0:46;p = 0:0002). This is because Lucian and Vayne are typically played in the same lane and role. If one champion is stronger then the other champion is more likely to lose in a match up. Therefore, even a change in a patch not related to Lucian can aect the nal performance, which motivates the study of heterogeneity in win rates. As an example, Vayne is bued in 4.13, which increases her win rate slightly, but lowers Lucian's win rate slightly as well. This applies to other champions that are not in the same lane or role. In 5.19, many other champions are bued which lowers Lucian's win rate, and increases Vayne's win rate. This could be because Vayne is good against the bued champions, while Lucian is bad against them. The opposite could be true in patch 6.9, where there is a major mage update. 86 Figure 6.2: Trimmed causal trees for patch 4.12 (left) and 6.4 (right) which contain Lucian bus and nerfs, respectively. In the gures, samples indicates the number of observations in that node. For non-leaf nodes, it is the total number of observations before splitting. 6.5.1.2 Heterogeneous eect on win rate Next we zoom in on specic patches to see how other features, including champions and types, moderate the eect of the patch on win rate. We focus on potential changes on Lucian, since he is one of the three most popular champions, and his win rate varies more than Vayne's and Thresh's. We choose two patches that contain bus and nerfs to Lucian, patch 4.12 and 6.4, respectively [34]. In addition, we identify two other patches with signicant changes to other champions other than Lucian. The, we build causal trees for these four patches: • 4.12: Signicant changes to Lucian, resulting in a strong bu. • 6.4: Small change to Lucian, resulting in a small nerf. • 4.20: Small changes to many champions and largest average increase total kills, which is explored in Section 6.5.2 and shown in Figure 6.4. This is also a \pre-season" patch. • 6.9: No changes to Lucian, but major updates to mage champions. This is also a \mid-season" patch. We rst look at two patches that directly aect Lucian: 4.12 and 6.4. Figure 6.2 shows two trimmed causal trees on this patch, where the omitted features are denoted by a dotted line. The tag \Other features..." means we skip to a lower part of the tree. Nodes with feature splits that have children have been trimmed to save space (e.g. \Number of controllers 87 on team" in Figure 6.5.1.1). This happens in the right side of the trees. We consider the treated population as all matches played after the patch, and the control population as all matches played before the patch. The outcome is team win or loss. Patch 4.12 In Figure 6.5.1.1, we see that Lucian's presence on the team is the rst splitting feature. This makes sense, as Lucian arguably received the most changes, largely bus. This patch increases Lucian's win rate by 5% shown in the rst left node. For the games where the team included Lucian, if Rengar was also present on the same team, the win rate actually decreases by 12.6%, although this eect is not signicant with an independent t-test (p = 0:11). In patch 4.12, Rengar received a bug x that actually results in a nerf, which is acknowledged in the patch notes as a potential nerf: \This bug x might be a signicant hit to Rengar's jungling eectiveness (he was proccing Madred's Razors twice), but we'll track his performance over time." As we traverse further left, if Nami is also on the team, the win rate signicantly increases by 18%. Although Nami was not changed, likely the combination of Lucian and Nami is signicantly improved from the bus to Lucian. Further down the right subtree we see that if a team has Kassadin, then it is more likely to lose on patch 4.12. Importantly, results from causal analysis are not the same as computing the change in win rates individually for each champion. For example, Nami's win rate increases by 4.2% in patch 4.12 without conditioning on other champions and is not signicant (p = 0:10). With Lucian and without Rengar, her win rate increases by 23%, a signicant dierence. Kassadin has a noticeable nerf in this patch, resulting in an individual win rate decrease of 4.5% (p = 0:23), but after removing potential champions in the team, the win rate drops to 10%. Patch 6.4 Figure 6.5.1.1 shows the causal tree for the treatment of patch 6.4. Here, the rst split considers whether Jhin is on the team. This is because Jhin receives signicant bus in patch 6.4, which increase his win rate signicantly. Traversing the left subtree, if the team has Fiora in addition to Jhin, the win rate goes up signicantly by 23%. This 88 observation is interesting because Fiora receives a nerf in patch 6.4. There could be several reasons, such as the nerf not being enough, Jhin and Fiora being a strong combination, or other champions were changed enough so that Fiora still wins more often than not. On the right subtree of Figure 6.5.1.1, Viktor and Lucian both have a decrease in win rate, but Lucian's change is not signicant. Both Viktor and Lucian receive small nerfs. Patch 4.20 Patch 4.20 is a \pre-season" (season 5) patch. Generally, pre-season patches contain many more changes than patches released during the season. An interesting obser- vation from the tree is that many features selected are champions not directly changed. At the root node, the rst feature selected is whether Jinx is on the team, but Jinx was not changed in patch 4.20. Additionally, several features are based on the champion types: the number of ghters, marksmen, and mages aects the win chance of a team. This eect is heterogeneous: having at least one ghter with Jinx increases the win rate by 6.4%, while not having Jinx and having more than 2 marksmen decreases the win rate by 15.4%. Further down the tree, we see that our focus champion, Lucian, also has a positive change in win rate if there are at least 2 mages on the team. Mean performance of players disaggregated by feature timeSinceLastMatch Mean performance of players disaggregated by feature meanKillsAtStart Figure 6.3: Impact of software patches on player performance. Heatmap shows aver- age player performance, as measured by the number of kills per match, for dierent ver- sions of the game. The upper plot shows the performance of players with dierent rest time(timeSinceLastMatch). In the lower plot, players are binned based on the feature meanKillsAtStart, a proxy of player skill. The abrupt color change for versions 4.20 and 6.9 indicates a large dierence in performance after the version change. 89 Patch 6.9 This patch is interesting since it changes a signicant number of champions type mages, which we refer to as the \major mage update" in Figure 6.1. The causal tree learned on patch 6.9 data does not identify many mages as features in the top of the tree, except for Brand. On the left subtree there is a split on the number of controllers, another champion type similar to mages, which were changed and categorized under the umbrella term of mages, which increases the win rate. Further down the right subtree, Lucian appears as another positive increase in win rate, of 17% if there is a Riven and no Brand and Wukong. This explains the overall increase in win rate for Lucian individually shown on patch 6.9 in Figure 6.1. A potential explanation is the \major mage update" increases the amount of mages played and Lucian is one marksman that may perform better against mages. Summary Our results demonstrate the nuances of game balance: changes to champions aect the win rate of completely dierent champions played by the team. The computational framework of heterogeneous eect estimation described here can quantify how changes in the abilities of champions reverberate through other champions. Big changes, like major bus to our focal champion Lucian, are identied early in the tree in patch 4.12 (Figure 6.5.1.1), but small nerfs are identied near the bottom of the tree in patch 6.4 (Figure 6.5.1.1). Other changes to champions can also aect win rates dierently, and even champions who do not change can be aected, such as Jinx and Lucian in patch 4.20. Additionally, bued champions may be played at a higher rate, aecting the game balance. In the next section, we explore how patches can aect players dierently, rather than focus on the team performance with dierent champions. 6.5.2 Individual Player Performance By altering champion abilities and game settings, software patches can potentially aect player performance. Figure 6.3 shows average player performance per match, measured by the number of kills the player makes, for games played after each new patch was introduced. 90 The gure shows two views of the same data. The top heatmap disaggregates player matches by the value of the feature timeSinceLastMatch, which measures time elapsed since the last game they played. The top line reports the average performance of players who move on to the next match without taking a break (timeSinceLastMatch=0). These players generally perform poorly, as evidenced by the darker colors in the top line. This is consistent with the nding by Sapienza et al. 2018b that player performance deteriorates over the course of a game playing session due to cognitive depletion. This suggests that players become fatigued and their performance declines. The following lines report the corresponding percentiles of the remaining values of the timeSinceLastMatch feature. Players in second line (26%) taking a short break (< 3 minutes) generally outperform other players, including those who take longer breaks, from days (46%) to years (82%). This could potentially mean that players taking short breaks are more dedicated and able to play more frequently, thereby improving in skill. Also interesting is the dark band across all bins that is seen before patch 6.9. This means that all groups of players|those taking short and long breaks|performed worse on average than in the earlier versions. The heatmap at the bottom of Figure 6.3 shows the same data but disaggregated by meanKillsAtStart. The top row (29%) shows players with meanKillsAtStart equal to zero, and the remaining players split by quartiles of the feature value. This feature is a proxy of player skill: new players are grouped in the rst bin, with remaining players grouped into bins from weakest players with few kills per match (29%) to the strongest players with many kills per match (82%). Unsurprisingly, better players have better performance (last two lines are brightest) and outperform weaker players in all versions of the game. New players outperform the weakest players, which makes sense, as new players have a range of skill. We see abrupt color shifts in the heatmaps in Figure 6.3, especially between versions 4.19 and 4.20 and versions 6.8 and 6.9. These color shifts within a row indicate signicant changes in player performance that come with new versions. Comparing performance of 91 players before and after the version change allows for measuring the causal eect of the software patch on players. Figure 6.4: The mean eect of software patches for kills. We see there are sharp peaks at patch 4.20 and 6.9. 6.5.2.1 Average eect of patches First we estimate the overall eect averaged for all players. For every game version, we measure its average eect on performance by calculating the dierence in the mean number of kills before and after the version change. Figure 6.4 shows mean eect on performance introduced over 62 software patches in our data. The patch version 4.20 introduces game changes that increase the average number of kills made by a player by 0.5 per match. This is a pre-season patch, and our results show that it has major impact on performance. In patches immediately following version 4.20, the eect is slightly negative|as if to compensate for the changes made in version 4.20. Version 6.9 is another major patch, which increases the average number of kills by over 0.4. The remaining patches have a far smaller eect on average. 6.5.2.2 Heterogeneous eect of patches The mean eect hides much of the complexity of the impact of software changes on dierent players. We can see some of this heterogeneity in Figure 6.5, which compares the mean eect for players disaggregated into the same groups as Figure 6.3. Again, we see a strong eect of 92 Mean eect of patches on performance disaggregated by feature timeSinceLastMatch Mean eect of patches on performance disaggregated by feature meanKillsAtStart Figure 6.5: Causal impact of game patches on player performance. Heatmap shows the eect of the software patch on performance for the same groups of players as in Figure 6.3. The top gure shows mean eect for players with dierent rest time (timeSinceLastMatch), and the bottom plot shows the eect of patches for bins with dierent values of meanKillsAtStart. patches 4.20 and 6.9 on all groups of players, although some groups are better able to leverage changes made in the game than other players. This analysis, however, does not account for the impact of the champion (or champion type) the player chooses on performance. To address this point, we learn a causal tree for each champion. To learn the causal trees from data, we rst combine data from two consecutive versions of the game and then disaggregate the combined data by champion. We use the combined data to learn the causal tree for the champion and repeat for all pairs of consecutive game versions. Due to limited data and resources, we only select the top 25 most popular champions. We set minimum leaf node size as 5% of total samples and maximum depth of causal tree as 10. Our causal modeling framework learns 1,550 causal trees for 25 champions over 62 soft- ware patches. Figure 6.6 shows a pruned tree learned for Vayne for versions 4.20-4.21. The causal tree identies two leaf nodes to be statistically signicant at 5% level, which are highlighted in purple. The leaf node at the second level demonstrates that for a less expe- rienced player (cumulative match duration is small), the eect is lower than average eect (at the root node). Leaf nodes at the lowest level show that the treatment eect is smaller than average for experienced and good players (high value of cumMatchdurationAtStart and meanKillsAtStart). 93 Figure 6.6: (Left). Causal tree learned for patch 4.21 for matches where a player select champion Vayne. The purple nodes show heterogeneous eect that is signicant at 5% level. (Right). Average treatment eect calculated for players with dierent levels (meanKillsAt- Start) at two major patch changes. Excluding the rst bin which contains signicant portion of new players with meanKillsAtStart close to zero, we see a trend that high level players benet more from the patch changes, indication the performance gap being widen. Figure 6.7: Average eect gap calculated for the 10 most important features. The error bar shows 95% condential intervals. We can see that the most important features are timeSinceLaseMatch and history performance of players. 6.5.2.3 Eect of player features To explore the causal eects of player characteristics on their performance, we quantify the relationship between player features and the heterogeneous treatment eect of software patches. We perform statistical analysis of the features used to split nodes. We weigh 94 each feature in tree by the sample size in the split and use the total weight (among all champions and versions) to compare the relative importance of the features. The important features are those that occur in many trees and explain many data points. The ten most important features related to changes in champions are (in descending order): timeSince- LastMatch, meanKdaAtStart, meanMatchDurationAtStart, meanDeathsAtStart, meanKill- sAtStart, , meanWinsAtStart, meanAssistsAtStart, meanGoldspentAtStart, meanGoldearne- dAtStart, and meanChamplevelAtStart. Except for timeSinceLastMatch and meanMatchDu- ration, many of the important features are performance related. Interestingly, mean values of features such as kills, deaths, gold earned and spent, are judged to be more important than their cumulative values. The cumulative values are larger for players who play more games, but mean are calculated per match, and therefore, better re ect player's skill. However, more important than skill is the length of the break between games. To quantify how player features, such as timeSinceLastMatch, allow players to leverage patches to improve their performance, we nd all splits on timeSinceLastMatch and calculate the dierence of the heterogeneous treatment eect between the left child node (feature larger than or equal to cuto) and right child node (feature smaller than cuto) for each split. We use this causal eect dierence|eect gap|to describe the overall impact of the feature on causal eects. If the dierence is larger than 0, then the feature improves player performance after a patch. The eect gaps of important features are shown in Figure 6.7. Players with high meanKillsAtStart have a positive eect gap, meaning that their performance tends to improve following a software patch, while the performance of players with high meanDeathsAtStart tends to decrease. However, the p-values for these eects are slightly higher than 0.05, which suggests that skilled players (with high kills or low deaths) can only take weak advantage of changes in champions made by the software patches. This may be a consequence of game balance: when one group of players has advantage in one game version, designers reduce their advantage in the next version. For example, in the bottom heatmap in Figure 6.5, the positive impact on performance of highly skilled players in patch version 4.20 (bright 95 colors in bin 5) is oset by relatively stronger decreases (darker colors) in their performance in patch 4.21. The only feature with a signicant eect gap (at 5% signicance level) is timeSinceLast- Match. The positive eect gap suggests that players taking longer breaks between games (or at least those who do not play without interruptions) are consistently able to improve their performance. These results highlight the importance of short-term changes in player behavior: cognitive fatigue following uninterrupted game play diminishes players of ability to leverage changes in champions. 6.6 Conclusion Game play is rarely explored from the perspective of individualized performance. When we study the heterogeneous eect of patches, we discover signicant changes in team and player performance depending on champions, player's track record, and length of breaks between matches. On a team level, we found that changes to champions have an impact on all champions selected by the team, and not just the ones that were changed. Signicant changes are generally identied as important splits in the causal tree, while small changes are ignored or shown in lower-level nodes. At a player level, several performance metrics demon- strate signicant benets of patches for high-performing players rather than low-performing ones. These patches therefore counter-intuitively widen the gap between the high and low- performance players. When we analysed individual player performance, we also found that there are two major version changes, 4.20 and 6.9 that have an outsized impact on per- formance. But these changes hide the heterogeneity in the impact. Using causal trees, we studied the heterogeneous eect of patches on players with the same champions and found that timeSinceLastMatch and performance proxies are the most important factors. On the other hand, we nd that there is little or no correlation between these factors on conditional average treatment eects. 96 Because these results are based on observational data, however, they suer from the limi- tations of all causal models suer. Specically, we cannot guarantee that players are selected at random into the control (before a patch) and treatment (after the patch) conditions. On the contrary, previous work shows that some players behavior changes after a patch to re- ect the known champion bus and nerfs [144]. Not only is there a potential selection bias, however, there might be potential confoundings from features we do not know, and therefore cannot control for. While these results help us understand the impact patches have on player performance \in the wild", future work should create controlled experiments to address these potential confounders. 97 Chapter 7 Counterfactual Learning for the Fair Allocation of Treatments 7.1 Chapter Introduction Equitable assignment of treatments is a fundamental problem of fairness. The problem is especially acute in cases where treatments are not available to everyone and when some individuals stand to benet more from them than others. This problem arises in multi- ple contexts, including allocating costly medical care to sick patients [117, 43], vaccination strategies, materials following a disaster [145], college spots and nancial aid in college ad- missions, extending credit to consumers, and many others. Despite growing interest from the research community [42, 100, 38] and the rise of automated decision support systems in healthcare and college admissions that help make such decisions [104], fair allocation of scarce resources and treatments remains an important open problem. To motivate the problem, consider an infectious disease, like the COVID-19 pandemic, spreading through the population. The toll of the pandemic varies among dierent ethnic and racial groups (i.e., protected groups) and also via several comorbidities, such as age, weight, and underlying medical conditions. When the COVID-19 vaccine rst became available, its supplies were extremely limited, motivating the question: who should receive it rst? To minimize loss of life, we could reserve the vaccine for high-risk groups with comorbidities, 98 but this does not guarantee protected groups will be treated equitably. Some groups will get preferential treatment, unless|and this is highly unlikely|all groups reside in high- risk categories at equal rates. For example, initial evidence shows that minorities, such as Blacks, are receiving COVID-19 vaccines at lower rates than Whites, potentially because of real or perceived biases in medical care [118]. However, providing more vaccines to protected groups may result in more lives lost overall in cases where protected groups have lower mortality. This demonstrates the dicult trade-os policy-makers must consider regardless of the policies they choose. Similar trade-os between unbiased and optimal outcomes often appear in automated decisions. This issue received much attention since an investigation by ProPublica found that software used by judges in sentencing decisions was systematically biased [6]. The software's algorithm deemed Black defendants to be a higher risk for committing a crime in the future than White defendants with similar proles. Subsequent studies showed that making the algorithm less biased also decreased its accuracy [35, 97]. As a result, a fairer algorithm is more likely to incorrectly label violent oenders as low-risk and vice versa. This can jeopardize public safety if high-risk oenders are released, or needlessly keep low-risk individuals in jail. We show that, alike to these automated methods, there are multiple ways to dene fair treatment policies, which do not fully overlap. Going back to the vaccine example, selecting individuals from the population at random to receive the limited doses of the vaccine may be considered equitable, but there might be grave dierences in mortality rates across protected classes, which this treatment would not overcome. In contrast, preferentially giving vaccines to protected groups may reduce disparities in mortality between classes, but implies unequal allocation of resources and may not benet the population the most. Kleinberg et al. made a similar nding for automated decisions [77]. Except for rare trivial cases, a fair algorithm cannot simultaneously be balanced (conditioned on outcome, predictions are similar across groups) [6] and well-calibrated (conditioned on predictions, the outcomes will 99 be similar across groups) [77]. Decreasing one type of bias necessarily increases the other type. Empirical analysis conrmed these trends in benchmark data sets [56]. More worrisome still, there are dozens of denitions of AI fairness [140], and we hypothesize that there is also no shortage of fair policy denitions, making an unambiguous denition of \fair" a challenge. In this paper, we combine causal inference with fairness to learn optimal treatment poli- cies from data that improve an overall outcome for the population. First, we dene novel metrics for fairness in causal models that account for the heterogeneous eect of treatment on dierent subgroups within population. These metrics measure inequality of treatment op- portunity (who is selected for treatment) and inequality of treatment outcomes (who benets from treatment). This compliments previous research to maximize utilization of resources, i.e., ensuring that they do not sit idle, while also maximizing fairness. [42]. The results also show how armative action-like policies that preferentially select in- dividuals from protected subgroups for treatment can improve the overall benet of the treatment to the population, for a given level of fairness. To achieve the same overall im- provement (eect), we show that requiring better fairness of treatment opportunity leads to larger inequality of treatment outcomes, and vice versa. We therefore nd a necessary trade-o between policies that are fair overall with policies that would be fair within sub- groups. These results demonstrate novel ways to improve fairness of treatments, as well as the important trade-os due to distinct denitions of fairness. The rest of the paper is organized as follows. We begin by reviewing related work, then we describe the causal inference framework we use to estimate the heterogeneous eect of treatments, dene treatment biases, and optimization algorithm that learns fair intervention policies from data. Our methods are tested in synthetic and real-world data on high school student test scores. Namely, we devise fair school funding policies that raise test scores more fairly than alternative funding policies. 100 7.2 Related Works As in the section above, we discussed various ways of estimating HTE. The value of HTE is a function of the individual features. When the treatment is given, under the unconfound- edness condition [123], HTE will be an intrinsic quantity, not aected by any factors except for the individual features. On the other hand, by varying the treatment assignment, we can eectively change the ATE. From a policy maker's view, this means that known the treatment eect is dierent and may cause additional disparity in the population, without introducing new treatment/intervention, the only way to promote fairness is to select the correct subgroup to be treated. In this nal part of the dissertation, we will describe how to design a fair treatment policy using observational data collected for a treatment experiment. The policy should be fair and also benet the population the most. As an simple example, consider we are trying to distribute limited does of COVID-19 vaccine. Random distribution will not make the most of the limited resource but preferencial selection of high risk group may imply the vaccine policy is biased. Recently, there is a growing literature of fairness in causal inference, decision making and resource allocation. There are case studies on social work and health care policy such as [31, 117, 43]. The problem of vaccine distribution has been discussed in [73, 152, 10]. Corbett-Davies et al. formulate the fair decision making as optimization problem under the constraints of fairness [35]. This can be regarded as an easy adaption from fair prediction task such as [155]. Kusner et al. proposed a new perspective of fairness based on causal inference, counterfactual fairness [81]. The counterfactual fairness requires the outcome be independent of the sensitive feature, or in other words, conditional on confounders, which further diers from equal opportunity [55], or still other metrics, such as the 80% rule, statistical parity, equalized odds, or dierential fairness [44]. Donahue and Kleinberg studied the problem of fair resource allocation [38]. The goal was maximizing utility under the constrain of fairness, from which theoretical bound for the gap between fairness and unconstrained optimal utility is derived. Elzayn et al. considered a 101 similar case, with potentially more realistic assumptions that the actual demand is unknown and should be inferred [42]. The problem is formulated as constrained optimization, with the help of censored feedback. Zhang et al. modeled direct and indirect discrimination using path specic eect (PSE) and proposed a constrained optimization algorithm to eliminate both direct and indirect discrimination [160]. Also based on the concept of PSE, Nabi and Shpitser considered per- forming fair inference of outcome from the joint distribution of outcomes and features [101]. Chiappa [29] also proposed PSE based fair decision making by simply correcting the decision at test time. A few dierent methods aim to improve fair policies with Bayesian networks [141, 100, 80]. Each method aims to adjust the Bayesian network with the goal of improving future decisions while optimizing the outcome variable. This is in contrast to our work that uses CATE estimators to measure the fairness of treatment policies and change the binary treatments within specied subgroups. Other methods aim to use causal modeling to im- prove fairness of predictions, such as risk assessments [153, 36]. Kallus and Zhou developed a method to assess disparities in interventions with binary outcomes [70]. Our work, however, diers from this previous work because we create (a) policy-based denitions of fairness, (b) optimize on who to treat while accounting for fairness trade-os, and (c) address an under-explored trade-o between equality of opportunity and equality of outcome in fair policy design. 7.3 Methodology We brie y review heterogeneous treatment eect estimation, which we use to learn fair treatment policies. We then discuss how we measure biases in treatments, and create optimal intervention strategies. 102 7.3.1 Heterogeneous Treatment Eect Suppose we are given N observations indexed with i = 1;:::;N, consisting of tuples of data of the form of (X i ;y obs i ;t i ). Here X denotes features of the observation, y obs is the observed outcome, and binary variable t indicates whether the observation came from the treated group (t = 1) or the control (t = 0). We assume that each observation i has two potential outcomes: the control outcomey (0) i and the treatment outcomey (1) i , but we only observe one outcome y obs i = y (t i ) i . In addition, we assume that given features X, both of the potential outcomes y (0) ;y (1) are independent of the treatment assignment t (the unconfoundedness assumption). (y (0) ;y (1) )?tjX: The heterogeneous treatment eect is dened as (X) =E[y (1) y (0) jX] (7.1) The task of heterogeneous treatment eect (HTE) estimation is to construct an optimal estimator ^ (X) from the observations. A standard model of HTE is a causal tree [8]. Causal trees are similar to classication and regression trees (CART), as they both rely on recursive splitting of the feature space X, but causal trees are designed to give the best estimate of the treatment eect, rather than the outcome. To avoid overtting, we employ an honest splitting scheme [8], in which half of the data is reserved to estimate the treatment eect on leaf nodes. The objective function to be maximized for honest splitting is the negative expected mean squared error of the treatment eect , dened as below. \ EMSE (S tr ;N est ; ) = 1 N tr X i2S tr ^ 2 (X i ;S tr ; ) (7.2) ( 1 N tr + 1 N est ) X l2 ( Var tr (ljt = 1) p + Var tr (ljt = 0) 1p ) (7.3) 103 Here S tr is the training set, N tr and N est are the size of training and estimation set. is a given splitting,l is a given leaf node,p is the ratio of data being treated. Terms Var tr (ljt = 0) and Var tr (ljt = 1) are the within-leaf variance calculated for controlled and treated data on training set. Note that we only use the size of the estimation data during splitting. In cross validation, we use the same objective function and plug in validation set S val instead of S tr . After a causal tree is learned from data, observations in each leaf node correspond to groups of similar individuals who experience the same eect, in the same way a CART produces leaf nodes grouping similar individuals with similar predicted outcomes. There are other methods HTE estimation such as metalearners [79], causal forests [143] and a more general ensemble method [49]. In contrast to causal trees, these methods are often not interpretable, but can be similarly adopted to estimate treatment inequality and policies, as we will show later. 7.3.2 Inequalities in Treatment In many situations of interest, data comes from a heterogeneous population that includes some protected subgroups, for example, racial, gender, age, or income groups. We categorize these subgroups into one of k groups, and use z 2f1;:::;kg to denote the group. Even though we do not use z as a feature in HTE estimation, the biases present in data may distort learning, and lead to to infer policies that unfairly discriminate against protected subgroups. An additional challenge in causal inference is that a treatment can aect the subgroups dierently. To give an intuitive example, consider a hypothetical scenario where a high school is performing a supplemental instruction program (which we consider as a treatment) to help students who are struggling academically. Student are described by featuresX, such as age, sex, historical performance, average time spent on homework and computer games, etc. We want our intervention to be fair with respect to students with dierent races (in this case, the sensitive feature z is race). That means we may want to both reduce the performance 104 gap between dierent races and we also want to make sure that the minority race gets ample opportunity to participate in the intervention program. However, we assume that the school district has limited resources for supplemental instruction, which means that not every struggling student can be assigned to the intervention program. To best improve the overall performance, it therefore makes sense to leave more spots in the program to students who are more sensitive to intervention (they have a large treatment eect (X)). But if the previous pilot programs show that the eect of the intervention is dierent amount subgroups (e.g., races), with one subgroup more sensitive to the intervention and also having a better average outcome, we have a dilemma between optimal performance and fairness. If we only care about optimal outcome, the intervention will lead to not only larger performance gap between races but also lack of treatment opportunity for minority race. If we assign the intervention randomly, we will not make full use of the limited resource to benet the population. Below we discuss our approach to measure bias in treatment or intervention. We learn the eect of the interventions using causal trees, but show later this methodology can be extended to any causal method. A causal tree learned on some data partitions individual observations among the leaf nodes. A group of n i observations associated with a leaf node i of the causal tree is composed of n (1) i observations of treated individuals and n (0) i controls. We can further disaggregate observations in group i based on the values of the sensitive attribute z. This gives us n i;z=j as the size of subgroup z = j, which has n (1) i;z=j treated individuals and n (0) i;z=j controls. Similarly if y (0) i and y (1) i are the estimated outcomes for the control and treated individuals in group i, then y (0) i;z=j and y (1) i;z=j as the estimated outcomes for the control and treated subgroup z =j in group i. Table 7.1 lists these denitions. 7.3.2.1 Measuring Inequality of Treatment Opportunity To quantify the inequalities of treatment, we rst look at the inequality of treatment oppor- tunity, i.e., the disparity of the assignment of individuals from the protected subgroup in leaf 105 number of individuals in leaf node i n i number of control/treated individuals in leaf node i n (0) i , n (1) i number of individuals for subgroup z =j in leaf node i n i;z=j number of control/treated individuals for subgroup z = j in leaf node i n (0) i;z=j , n (1) i;z=j outcome for the control/treated group in leaf node i y (0) i , y (1) i outcome for control/treated subgroup with z =j in leaf node i y (0) i;z=j , y (1) i;z=j mean outcome y mean improvement of the outcome y treatment ratio for leaf node i r i treatment ratio for subgroup z =j in leaf node i r i;z=j maximum number of treated individuals N (1) max maximum ratio of treated individuals r max inequality of treatment opportunity Bias r inequality of treatment outcomes Bias y maximum allowed inequality of treatment opportunity m r maximum allowed inequality of treatment outcome m y Table 7.1: Denitions used in designing of fair treatment policies. node i to the treatment condition. To measure the bias, we introduce the treatment ratio r i;z=j as the fraction of treated individuals from subgroup j among the group in leaf node i: r i;z=j = n (1) i;z=j n (0) i;z=j +n (1) i;z=j : We dene the inequality of treatment opportunity as the maximum dierence of the within- leaf treatment ratios taken over all leaf nodes i and pairs of subgroups j; j 0 , Bias r = max i;j;j 0 jr i;z=j r i;z=j 0j: (7.4) In other words, two individuals with similar features X but dierent protected attribute z should be treated equally. This denition is similar to individual fairness [41]. In practice, we control the minimum number of individuals for all subgroups in leaf nodes to ensure Bias r is not calculated from very small leaf nodes. 106 Figure 7.1: Plot of outcome y vs feature x 0 for synthetic data. Note that the other feature, x 1 , is independent from y. Protected attribute z's two classes are shown as dierent colors. Treatment and control data have \o" and \x" plot markers, respectively. (a) (b) Figure 7.2: Mean outcome improvement, y, versus maximum allowed outcome inequality, m y . (a) y vsm y when equal treatment opportunity is assumed. Dierent curves show the ecient boundary of policies under constrains of certain limit of percent treated, r max . (b) y vs m y when armative action is allowed. Here r max = 0:8 and dierent curves shows dierent degrees of armative action, measured by m r . Armative action greatly improves y when m y is low. 7.3.2.2 Measuring Inequality of Treatment Outcomes The second type of bias we measure is the inequality of treatment outcomes. This bias arises because subgroups may dier in their response to treatment and their controlled outcomes. We quantify this disparity as 107 y z=j = 1 P i n i;z=j X i n i;z=j h r i;z=j y (1) i;z=j + (1r i;z=j )y (0) i;z=j i ; (7.5) where the index i is for leaf nodes of the causal tree. We dene inequality of outcomes as the largest dierence of expected outcomes for all pairs of protected subgroups Bias y = max j;j 0 j y z=j y z=j 0j: (7.6) This metric is a causal variant of statistical parity [55]. Recall that statistical parity ensures that positive rates (or mean logistic scores) are similar for dierent protected groups. Here, a small value of Bias y indicates that after the intervention, dierent protected groups will have similar mean outcomes or a small gap between protected groups. 7.3.3 Learning Optimal Interventions using Causal Tree A crucial problem in the design of interventions is how to balance between the optimal performance and bias. Below we describe learning optimal interventions that maximize the overall benet of treatment while properly control the bias of treatment opportunity and the bias of outcome among dierent subgroups. We can achieve optimality by choosing which individuals to treat. Specically, given the features X, the potential outcomes y (0) and y (1) are independent of treatment assignment t. Therefore, we can vary r i;z=j , while keeping y (0) i;z=j and y (1) i;z=j constant, as part of the optimal policy. In a more general condition, we can assign dierent sub-groups dierent ratios of treat- ment. We refer to this as armative action. For example, in the context of the school intervention program, armative action means groups that benet most from the treatment (have the largest eect) should be preferentially assigned to the intervention. As another 108 example, armative actions for COVID-19 vaccinations means that minorities who are at high risk for COVID-19 complications should get priority access to early vaccines. To learn such interventions, we vary treatment ratios r i;z=j to maximize the overall outcome y = 1 P i n i X i X j n i;z=j h r i;z=j y (1) i;z=j + (1r i;z=j )y (0) i;z=j i (7.7) under the constraints: I. We set an upper bound for the inequality of treatment opportunity we will tolerate as: Bias r m r (7.8) II. We set an upper bound for the inequality of treatment outcomes: Bias y m y (7.9) III. We limit the number of individuals that can be treated due to resource limitation: X i X j n i;z=j r i;z=j N (1) max (7.10) IV. And nally, all treatment ratios have to satisfy 0r i;z=j 1 (7.11) In the case where armative action is not allowed, or in other words, equal opportunity among sub-groups are assumed, we set m r = 0 and treatment assignment r i;z=j is reduced to r i since all sub-groups are assigned the identical treatment ratio. 109 (a) (b) (c) (d) Figure 7.3: Heat map visualizations of mean outcome improvement, y, for synthetic data. Maximum fraction treated,r max = (a) 0.2, (b) 0.4, (c) 0.6, and (d) 0.8. Lighter yellow colors correspond to larger change in the outcome. Solid black lines are the contour lines of y Given the parameter of constrains, (m y ;N (1) max ) or (m y ;m r ;N (1) max ), for allowing (or reject- ing) armative actions, respectively, we can use linear programming to solve for the optimal y and corresponding treatment assignment plan r i or r i;z=j . The policy solved are optimal under the constraints and it can be regarded as the ecient policy. 7.3.4 Learning Optimal Interventions For Arbitrary Causal Estimation Methods So far, our discussion was limited to causal trees. Using a general HTE estimation method such as causal forests [143] or metalearners [79], we only have access to black box estimators ^ y (0) (X;z); ^ y (1) (X;z) and ^ (X;z): (7.12) We need to nd an optimal treatment assignment function r(X;z)2 [0; 1] which has con- trolled level of both inequality of treatment opportunity and inequality of treatment out- comes. For a total population of N outcomes, assuming there are N j observations for each protected subgroup, we can estimate the subgroup outcome and the overall outcome as y z=j = 1 N j X i 1(z i =j)[r(X i ;z i )y (1) (X i ;z i )+ (1r(X i ;z i ))y (0) (X i ;z i )]; (7.13) 110 y = 1 N X i [r(X i ;z i )y (1) (X i ;z i )+ (1r(X i ;z i ))y (0) (X i ;z i )]: (7.14) Then, Bias y can be dened in the same way as Eq. 7.6. We can further dene Bias r = max i;j;j 0 jr(X i ;j)r(X i ;j 0 )j: (7.15) The total number of assigned treatments can be expressed asN (1) = P i r(X i ;z i ). Using the denitions above, we can construct an objective function J = y r Bias r y Bias y n N (1) : (7.16) Here r , y and n are hyperparameters 0 used to control the level of inequality of treat- ment opportunity, the level of inequality of treatment outcomes, and resource limitations, respectively. The treatment assignment function r(X;z) can be modeled using regressions with nonlinear feature transformations or neural networks. Maximizing loss function J will result in the optimal treatment assignment. Dierent from the case of causal trees, there is no closed form solution and constraints on Bias r , Bias y and N (1) cannot be directly applied (only by change hyper parameters r , y and n ). For clarity of the results and simplicity of the numerical optimizations, we only performed experiments using causal trees. But the framework we proposed can be applied to arbitrary HTE estimation methods as described above. 7.3.5 Computational Complexity The computational complexity of our proposed method depends on two factors, the choice of HTE estimation methods and the optimization method used. The time complexity for the causal tree is O(mn log(n)), for n data points and m features. The complexity of 111 the linear programming depends on the number of leaf nodes in the causal tree. For general HTE estimations, the complexity of optimization is linear to the size of input n and SGD can also be applied to accelerate the optimization. 7.4 Results 7.4.1 Synthetic data As proof of concept, we demonstrate our approach on synthetic data representing obser- vations from a hypothetical experiment. The individual observations have features, X = [x 0 ;x 1 ], x 0 ;x 1 are drawn independently at random from a uniform distribution in range [0; 1]. The treatment assignment t and sensitive feature z are generated independently at random using Bernoulli distributions: z;tBernoulli(0:5). Finally, the observed outcomes y depend on features and treatment as follows: y =tx 0 + 0:4tzx 0 : (7.17) Note that the feature x 1 is designed to not correlate with y. Figure 7.1 shows the outcomes y as a function of feature x 0 . The two subgroups have the same outcome in the control case, but individuals from the protected subgroup (z = 1) benet more from the treatment, since their treated outcomes are higher than for individuals from the other group (z = 0). Note that the larger the feature x 0 , the larger the impact of treatment on the protected subgroup z = 1. The disparate response to treatment creates a dilemma for decision makers|if both subgroups receive the same treatment (Bias r = 0), then higher population-wide outcome will be associated with a larger discrepancy in the outcomes for the two subgroups, hence, larger bias (Bias y ). We train a causal tree to estimate the heterogeneous treatment eect using X = [x 0 ;x 1 ]. Given 6; 000 total observations, we use a third of the data for training the causal tree, a third 112 for validation, and a third for estimation using honest trees [8]. We estimate biases for the sensitive attribute z and learn optimal interventions using data reserved in the estimation set. Equal Treatment Policy First we consider the equal treatment policy, where individuals from either subgroup are equally likely to be treated. As described in the preceding section, in this case Bias r = 0. To model limited resources, such as limited doses of a vaccine or limited number of spots in the academic intervention program, we assume that we can only treat up to N (1) max individuals. For simplicity, we introduce r max = N (1) max =N, which is the maximum treatment ratio as a measure of resource limit. Also we use y = y P n i y (0) i P n i ; as a measure of the improvement of the outcome after treatment. We vary treatment ratio r max between 0:2 and 1:0 in steps of 0.2, and for each value ofr max we plot y as a function of m y , the upper limit of the bias in outcomes (Bias y ). Figure 7.2(a) shows that as we treat more individuals (larger r max ), there is greater y. Additionally, as we tolerate more bias (m y increases), y also increases. However, for large enough m y , y does not improve signicantly. In this case, we have assigned all the necessary treatment and allowing more bias will not further improve the outcome. Armative Action Policy To see how armative action could improve the average overall outcome, we x r max = 0:8 and vary m r between 0:0 and 0:25 in steps of 0.05. This allows us to prioritize protected subgroup z = 1 for treatment. Figure 7.2(b) shows the improvement y as a function of m y for dierent values of m r . We see that for large m y , curves of dierent values of m r reach the same upper bound of y, which is constrained by r max = 0:8. For lower values of m y , armative action dramatically increases y, i.e., preferentially selecting individuals from subgroups z = 1 for treatment increases the overall 113 benet of treatment. The heat map and contour lines in Fig.7.3 demonstrate the trade- os between the two biases. In order to maintain the same level of improvement from the intervention (moving along the contour lines), reducing maximum allowed treatment opportunity biasm r requires us to tolerate larger treatment outcome biasm y and vice versa. (a) (b) (c) (d) Figure 7.4: Heat map visualizations of performance improvement, y, for EdGap data. Maximum fraction treated, r max = (a) 0.2, (b) 0.4, (c) 0.6, and 0.8. Lighter yellow colors correspond to greater overall benets of treatment, while the infeasible region is shown as grey. (a) (b) (c) (d) Figure 7.5: Maps of (a) z-score normalized mean test scores from EdGap, (b) Black house- hold ratio, (c) learned optimal treatment assignment when equal treatment opportunity is assumed (m y = 0:25;r max = 0:40;m r = 0), and (d) learned optimal treatment assignment when armative action is allowed (m y = 0:25;r max = 0:40;m r = 1:0). The trade-o between fairness and optimal prediction is intuitively unavoidable. We can regard fairness as a constraint to the optimization and the optimal solution which satises 114 the constraint will be a sub-optimal. In our case, this means that when designing the intervention policy, we have to sacrice overall benet in order to make our policy fair. 7.4.2 EdGap Data The EdGap data (https://www.edgap.org) contains education performance of dierent counties of United States. The data we used contains around 2,000 counties and 19 features. The features include funding, normalized mean test score, average school size, number of magnet schools, number of charter schools, percent of students who took standardized tests and average number of students in schools receiving discounted or free lunches (a proxy of poverty in schools). Besides these features, we have US census features for each county (https://data.census.gov/), including household mean income, household marriage ra- tio, Black household ratio, percent of people who nished high school, percent of people with a bachelor's degree, employment ratio, and Gini coecient of income. We use z-score normalized mean test score as the outcome. We binarize school funding and the county ratio of Black households to be above and below the median values as treatment indicator and sensitive feature, respectively. We are interested in the heterogeneous eect of funding increase on dierent counties and we want to design a fair intervention which reduces the education performance dierence between Black and non-Black populations. We use one third of data as training, validation and testing respectively and the results reported are averaged on 40 random splits. We plot the overall performance improvement, y, versus maximum treatment bias,m y , and maximum opportunity bias,m r , as heat maps in Fig.7.4. Unlike the synthetic data (Fig.7.3), we nd infeasible regions, shown as grey in Fig. 7.4. The lower the maximum allowed treatment ratio r max , the larger the infeasible region. The existence of infeasibility regions is because, even without any treatment, there is a dierence in the mean of average test scores for county with a larger or smaller percentage of Blacks. If the constrain m y , score dierence between two groups of county, is set to be too low and the maximum allowed treatment ratio r max is also low, the constraints cannot 115 be satised. On the other hand, we also notice that if armative action is allowed, we can assign more counties with high Black ratio to treatment and dramatically improve the mean outcome and also reduce the infeasible region. To further understand the bias in the data and the fair intervention we learned, we visualize the geographical distribution of data and the learned treatment assignment in Fig. 7.5. We rst plot the mean test score of counties and the ratio of Black household in Fig. 7.5(a){(b), respectively. We see that in the southern states, from Louisiana to North Carolina, there are counties with high ratio of Black households. Correspondingly, we also see that the mean test scores of those counties are lower than national average, probably due to chronic under-funding and racism. To illustrate the eect of armative action, we plot the learned optimal treatment assignments of two sets of parameters. First, we consider the case where we assume equal treatment opportunity. We use parameters m y = 0:25, r max = 0:4;m r = 0. Then for the case where armative action is allowed, we use parameters m y = 0:25, r max = 0:4;m r = 1:0. For both plots, we can see that counties in California, Illinois, Texas, and Georgia have a high probability of being assigned to treatment. This is because the causal tree model predicts that counties in those state have higher treatment eect (X). Importantly, comparing Fig.7.5(c) and (d), we see when armative action is allowed, the counties in southern states have a high probability of being assigned to treatment. The armative action treatment will not only improve the overall performance, but will also reduce performance dierence between counties with high and low Black households. 7.5 Discussion In this paper, we develop intervention policies that both improve desired outcomes and increase equality in treatment across protected subgroups. To do so, we rst proposed novel metrics, bias in treatment opportunity and bias in treatment outcome, to quantify the bias of 116 a treatment policy. Then we provide an algorithm that oers the best policies, conditional on the trade-o between policies that maximize outcomes and fairness. Furthermore, we show that there is a trade-o between the two proposed fairness metrics. Allowing armative action can improve both overall outcome and equality of treatment outcome, but it requires preferential treatment assignment for certain subgroups. Consistent behaviors have been observed for both synthetic and real-world data. While this methodology oers substantial benets to policy-makers, our work still has limitations. First, these methods need to be explicitly applied to dierent HTE estimation algorithms [143, 9, 79] in order to understand the best overall trade-o that may be invisible with a non-optimal model. Furthermore, there is an open question of how Bayesian networks [113], which model the pathways of causality, relate to algorithms that model heterogeneous treatment eects. Future work must therefore explore how fair policies created via causal models relate to potentially fair policies created by Bayesian networks. 117 Chapter 8 Conclusion and Ongoing Work In this dissertation, we have discussed several topics. First, we talked about linear and deep learning methods for making fair predictions and creating invariant representations. The linear method preprocesses data by projecting features to a hyperplane orthogonal to sensitive features. The method is highly ecient and requires no gradient descent training. One crucial advantage is that the preprocessed features are interpretable. Such an inter- pretable method can open up new avenues for empirical studies and visualization in social science, epidemiology and public health. We are also trying to combine it with distributed machine learning such as federated learning [95]. On the other hand, the nonlinear method is based on a well-known concept, kernels or MMD. Our results show that even such sim- ple method can produce competitive results using signicantly less computational resources. Our proposed method can be a good candidate for the pre-training step of more sophisticated deep learning models. Both of the methods work well to remove bias, but in general, there are more denitions of fairness [81, 142, 98] in social science which need to be addressed. Furthermore, on the theory side, as suggested in Eq. 1.2, any invariant condition can be expressed as a conservation condition under certain expectations (or transformation). With the help of group representation theory, some general properties of invariant representations can be obtained. This may complement the widely used information theory based methods or adversarial approaches. There have been related works combining symmetry groups and neural networks [47, 132]. 118 The confusion-based change detection method proposed here is quite promising according to the experimental results. The method can handle various types of data and remains sensitive and robust when the signal is relatively weak (subtle changes or change only aect a small portion of data). Currently, we are applying this method to social media data in a wider time range, hoping to identify more hidden events and have a better understanding of the general principles governing the dynamics of social media. The main weakness is that this method is by design an oine change detection method, which means that in order to have reliable results, one must obtain a certain amount of data after the change point. There are several ways to improve. The rst and the most obvious one is to redesign it such that online change detection [1] is supported. Another way to expand it is to use it as a tool to identify multiple modes (phase) of a dynamical system and automatically construct predictors for each of the modes. This will enable self-supervised continual learning [158], where a model can learn from continuous data ow and automatically discover patterns. As for the fair intervention/policy design, our proposed method is the rst of its kind at the time this thesis is being written. We hope our method can be used by government agencies to help them design better policies. With better tools to analyze observational data, government agencies can ne tune the policies to make best use of limited public re- sources. This is a far better approach than design policies using empirical evidence and macroscopic statistic results. In the meantime, we should also try to improve our method to include ecient optimization algorithms that can work with general back-box HTE estima- tion methods. Furthermore, from a social science perspective, more denitions of fairness should be explored such that our method can satisfy the demands of real-world policies. Handling biased data has become a critical research topic, as companies and government agencies have grown more reliant on real-world data and as automated decision systems with AI that use the data have become more prevalent. No data is perfect, since there is always noise and errors in real-world observations. Without robust ways for controlling bias, AI 119 can lead to misleading results or even catastrophic errors. We hope the contributions in this thesis can inspire more studies in this area and help free AI from inheriting human biases. 120 Bibliography [1] Ryan Prescott Adams and David J.C. MacKay. 2007. Bayesian Online Changepoint Detection. arXiv preprint: 0710.3742 (2007). [2] Nazanin Alipourfard, Peter Fennell, and Kristina Lerman. 2018. Using Simpson's paradox to discover interesting patterns in behavioral data. In Proceedings of the In- ternational AAAI Conference on Web and Social Media, Vol. 12. [3] Sarah A Alkhodair, Steven HH Ding, Benjamin CM Fung, and Junqiang Liu. 2020. Detecting breaking news rumors of emerging topics in social media. Information Pro- cessing & Management 57, 2 (2020), 102018. [4] Mamoun Almardini, Ayman Hajja, Zbigniew W Ra s, Lina Clover, David Olaleye, Youngjin Park, Jay Paulson, and Yang Xiao. 2015. Reduction of readmissions to hospitals based on actionable knowledge discovery and personalization. In Beyond Databases, Architectures and Structures. Advanced Technologies for Data Mining and Knowledge Discovery. Springer, 39{55. [5] David Altimira, Jenny Clarke, Gun Lee, Mark Billinghurst, Christoph Bartneck, et al. 2017. Enhancing player engagement through game balancing in digitally augmented physical games. International Journal of Human-Computer Studies 103 (2017), 35{47. [6] Julia Angwin, Je Larson, Surya Mattu, and Lauren Kirchner. 2016. Ma- chine Bias { There's software used across the country to predict future crim- inals. And it's biased against blacks. https://www.propublica.org/article/ machine-bias-risk-assessments-in-criminal-sentencing. [7] Sylvain Arlot, Alain Celisse, and Zaid Harchaoui. 2019. A Kernel Multiple Change- point Algorithm via Model Selection. JMRL 20, 162 (2019), 1{56. [8] Susan Athey and Guido Imbens. 2016. Recursive partitioning for heterogeneous causal eects. Proceedings of the National Academy of Sciences 113, 27 (2016), 7353{7360. [9] Susan Athey, Julie Tibshirani, and Stefan Wager. 2019. Generalized random forests. The Annals of Statistics 47, 2 (2019), 1148{1178. [10] James Atwood, Hansa Srinivasan, Yoni Halpern, and David Sculley. 2019. Fair treat- ment allocations in social networks. arXiv preprint arXiv:1911.05489 (2019). 121 [11] Mathieu Aubry, Daniel Maturana, Alexei A Efros, Bryan C Russell, and Josef Sivic. 2014. Seeing 3D Chairs: Exemplar Part-Based 2D-3D Alignment Using a Large Dataset of CAD Models. In 2014 IEEE Conference on Computer Vision and Pattern Recogni- tion. IEEE, 3762{3769. [12] Jarred Barber. 2015. A generalized likelihood ratio test for coherent change detection in polarimetric SAR. IEEE Geoscience and Remote Sensing Letters 12, 9 (2015), 1873{1877. [13] Jean-Marc Bardet, William Chakry Kengne, and Olivier Wintenberger. 2010. De- tecting multiple change-points in general causal time series using penalized quasi- likelihood. arXiv preprint: 1008.0054 (2010). [14] Elias Bareinboim and Judea Pearl. 2016. Causal inference and the data-fusion problem. Proceedings of the National Academy of Sciences 113, 27 (2016), 7345{7352. [15] Leonard E. Baum and Ted Petrie. 1966. Statistical Inference for Probabilistic Functions of Finite State Markov Chains. Ann. Math. Statist. 37, 6 (12 1966), 1554{1563. https: //doi.org/10.1214/aoms/1177699147 [16] Richard Berk, Hoda Heidari, Shahin Jabbari, Matthew Joseph, Michael Kearns, Jamie Morgenstern, Seth Neel, and Aaron Roth. 2017. A convex framework for fair regression. arXiv preprint arXiv:1706.02409 (2017). [17] Joseph Berkson. [n.d.]. Limitations of the Application of Fourfold Table Analysis to Hospital Data. Biometrics Bulletin ([n. d.]). [18] David M Blei and John D Laerty. 2006. Dynamic topic models. In Proceedings of the 23rd international conference on Machine learning. 113{120. [19] Jacob Bor, Ellen Moscoe, Portia Mutevedzi, Marie-Louise Newell, and Till B arnighausen. 2014. Regression Discontinuity Designs in Epidemiology: Causal In- ference Without Randomized Trials. Epidemiology 5 (2014), 729{737. [20] Jacob Bor, Ellen Moscoe, Portia Mutevedzi, Marie-Louise Newell, and Till B arnighausen. 2014. Regression discontinuity designs in epidemiology: causal inference without randomized trials. Epidemiology (Cambridge, Mass.) 25, 5 (2014), 729. [21] L eon Bottou, Jonas Peters, Joaquin Qui~ nonero-Candela, Denis X Charles, D Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, and Ed Snelson. 2013. Counterfactual reasoning and learning systems: The example of computational adver- tising. The Journal of Machine Learning Research 14, 1 (2013), 3207{3260. [22] Leo Breiman. 2001. Random forests. Machine learning 45, 1 (2001), 5{32. [23] David Card and Alan B. Krueger. 1993. Minimum Wages and Employment: A Case Study of the Fast Food Industry in New Jersey and Pennsylvania. NBER Working Paper No. 4509 (1993). 122 [24] M. Chan, T. O'Connor, and S. Peat. 2016. Using Khan Academy in Community College Developmental Math Courses. Technical Report. New England Board of Higher Ed- ucation. s3.amazonaws.com/KA-share/impact/Results_and_Lessons_from_DMDP_ Sept_2016.pdf [25] Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: A library for support vector machines. ACM TIST 2, 3 (2011), 27. [26] Emily Chen, Kristina Lerman, Emilio Ferrara, et al. 2020. Tracking social media discourse about the covid-19 pandemic: Development of a public coronavirus twitter data set. JMIR Public Health and Surveillance 6, 2 (2020), e19273. [27] Irene Chen, Fredrik D Johansson, and David Sontag. 2018. Why is my classier discriminatory? arXiv preprint arXiv:1805.12002 (2018). [28] Zhengxing Chen, Truong-Huy D Nguyen, Yuyu Xu, Christopher Amato, Seth Cooper, Yizhou Sun, and Magy Seif El-Nasr. 2018. The art of drafting: a team-oriented hero recommendation system for multiplayer online battle arena games. In Proceedings of the 12th ACM Conference on Recommender Systems. 200{208. [29] Silvia Chiappa. 2019. Path-specic counterfactual fairness. In Proceedings of the AAAI Conference on Articial Intelligence, Vol. 33. 7801{7808. [30] Alexandra Chouldechova. 2017. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big data 5, 2 (2017), 153{163. [31] Alexandra Chouldechova, Diana Benavides-Prado, Oleksandr Fialko, and Rhema Vaithianathan. 2018. A case study of algorithm-assisted decision making in child maltreatment hotline screening decisions. In Conference on Fairness, Accountability and Transparency. 134{148. [32] Alexandra Chouldechova and Aaron Roth. 2018. The frontiers of fairness in machine learning. arXiv preprint arXiv:1810.08810 (2018). [33] Dan Ciregan, Ueli Meier, and J urgen Schmidhuber. 2012. Multi-column deep neural networks for image classication. In 2012 IEEE conference on computer vision and pattern recognition. IEEE, 3642{3649. [34] Mark Claypool, Artian Kica, Andrew Manna, Lindsay O'Donnell, and Tom Paolillo. 2017. On the Impact of Software Patching on Gameplay for the League of Legends Computer Game. The Computer Games Journal 1, 6 (2017), 33{61. [35] Sam Corbett-Davies, Emma Pierson, Avi Feller, Sharad Goel, and Aziz Huq. 2017. Algorithmic decision making and the cost of fairness. In Proceedings of the 23rd acm sigkdd international conference on knowledge discovery and data mining. 797{806. [36] Amanda Coston, Alan Mishler, Edward H Kennedy, and Alexandra Chouldechova. 2020. Counterfactual risk assessments, evaluation, and fairness. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency. 582{593. 123 [37] Shai Danziger, Jonathan Levav, and Liora Avnaim-Pesso. 2011. Extraneous factors in judicial decisions. Proceedings of the National Academy of Sciences 108, 17 (2011), 6889{6892. [38] Kate Donahue and Jon Kleinberg. 2020. Fairness and utilization in allocating re- sources with uncertain demand. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency. 658{668. [39] Julia Dressel and Hany Farid. 2018. The accuracy, fairness, and limits of predicting recidivism. Science advances 4, 1 (2018), eaao5580. [40] Dheeru Dua and Casey Gra. 2017. UCI Machine Learning Repository. http: //archive.ics.uci.edu/ml [41] Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. 2012. Fairness through awareness. In ITCS. ACM, 214{226. [42] Hadi Elzayn, Shahin Jabbari, Christopher Jung, Michael Kearns, Seth Neel, Aaron Roth, and Zachary Schutzman. 2019. Fair algorithms for learning in allocation prob- lems. In Proceedings of the Conference on Fairness, Accountability, and Transparency. 170{179. [43] Ezekiel J Emanuel, Govind Persad, Ross Upshur, Beatriz Thome, Michael Parker, Aaron Glickman, Cathy Zhang, Connor Boyle, Maxwell Smith, and James P Phillips. 2020. Fair allocation of scarce medical resources in the time of Covid-19. N Engl J Med (2020), 2049{2055. https://doi.org/10.1056/NEJMsb2005114 [44] James R Foulds, Rashidul Islam, Kamrun Naher Keya, and Shimei Pan. 2020. An intersectional denition of fairness. In 2020 IEEE 36th International Conference on Data Engineering (ICDE). IEEE, 1918{1921. [45] Yoav Freund and Robert E Schapire. 1997. A decision-theoretic generalization of on- line learning and an application to boosting. Journal of computer and system sciences 55, 1 (1997), 119{139. [46] Piotr Fryzlewicz et al. 2014. Wild binary segmentation for multiple change-point detection. The Annals of Statistics 42, 6 (2014), 2243{2281. [47] Robert Gens and Pedro M Domingos. 2014. Deep symmetry networks. Advances in neural information processing systems 27 (2014), 2537{2545. [48] Arthur Gretton, Kenji Fukumizu, Choon Hui Teo, Le Song, Bernhard Sch olkopf, Alexander J Smola, et al. 2007. A kernel statistical test of independence.. In Nips, Vol. 20. Citeseer, 585{592. [49] Justin Grimmer, Solomon Messing, and Sean J Westwood. 2017. Estimating hetero- geneous treatment eects and the eects of heterogeneous treatments with ensemble methods. Political Analysis 25, 4 (2017), 413{434. 124 [50] Umang Gupta, Aaron Ferber, Bistra Dilkina, and Greg Ver Steeg. 2021. Controllable Guarantees for Fair Outcomes via Contrastive Information Estimation. arXiv preprint arXiv:2101.04108 (2021). [51] et al. Halko. 2009. Finding structure with randomness: Stochastic algorithms for constructing approximate matrix decompositions. arXiv:0909.4061 (2009). [52] Isobel Asher Hamilton. 2018. Amazon built an AI tool to hire people but had to shut it down because it was discrim- inating against women. https://www.businessinsider.com/ amazon-built-ai-to-hire-people-discriminated-against-women-2018-10. [53] Lucas Hanke and Luiz Chaimowicz. 2017. A recommender system for hero line-ups in MOBA games. In Thirteenth Articial Intelligence and Interactive Digital Entertain- ment Conference. [54] Lauren Weber Hannah Recht, Rachana Pradhan. 2021. Stark Racial Disparities Persist in Vaccinations, State-Level CDC Data Shows. https://www.webmd.com/vaccines/covid-19-vaccine/news/20210520/ racial-disparities-persist-in-vaccinations-cdc-data-shows. [55] Moritz Hardt, Eric Price, and Nathan Srebro. 2016. Equality of opportunity in super- vised learning. arXiv preprint arXiv:1610.02413 (2016). [56] Yuzi He, Keith Burghardt, and Kristina Lerman. 2020. A geometric solution to fair representations. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and So- ciety. 279{285. [57] Yuzi He, Ken-ichi Nomura, Rajiv K Kalia, Aiichiro Nakano, and Priya Vashishta. 2018. Structure and dynamics of water conned in nanoporous carbon. Physical Review Materials 2, 11 (2018), 115605. [58] William Herlands, Edward McFowland III, Andrew Gordon Wilson, and Daniel B Neill. 2018. Automated local regression discontinuity design discovery. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1512{1520. [59] Shohei Hido, Tsuyoshi Id e, Hisashi Kashima, Harunobu Kubo, and Hirofumi Mat- suzawa. 2008. Unsupervised change analysis using supervised learning. In Pacic-Asia Conference on Knowledge Discovery and Data Mining. Springer, 148{159. [60] Judy Homan, Eric Tzeng, Taesung Park, Jun-Yan Zhu, Phillip Isola, Kate Saenko, Alexei Efros, and Trevor Darrell. 2018. Cycada: Cycle-consistent adversarial domain adaptation. In International conference on machine learning. PMLR, 1989{1998. [61] Ben Hutchinson and Margaret Mitchell. 2019. 50 years of test (un) fairness: Lessons for machine learning. In Proceedings of the Conference on Fairness, Accountability, and Transparency. 49{58. 125 [62] Robin Jacob, Pei Zhu, Marie-Andr ee Somers, and Howard Bloom. 2012. A Practical Guide to Regression Discontinuity. MDRC. [63] Ayush Jaiswal, Daniel Moyer, Greg Ver Steeg, Wael AbdAlmageed, and Premkumar Natarajan. 2020. Invariant representations through adversarial forgetting. In Proceed- ings of the AAAI Conference on Articial Intelligence, Vol. 34. 4272{4279. [64] Ayush Jaiswal, Yue Wu, Wael AbdAlmageed, and Premkumar Natarajan. 2018. Un- supervised adversarial invariance. arXiv preprint arXiv:1809.10083 (2018). [65] Ayush Jaiswal, Yue Wu, Wael AbdAlmageed, and Premkumar Natarajan. 2019. Uni- ed adversarial invariance. arXiv preprint arXiv:1905.03629 (2019). [66] Julie Jiang, Emily Chen, Shen Yan, Kristina Lerman, and Emilio Ferrara. 2020. Po- litical polarization drives online conversations about COVID-19 in the United States. Human Behavior and Emerging Technologies 2, 3 (2020), 200{211. [67] Fredrik Johansson, Uri Shalit, and David Sontag. 2016. Learning representations for counterfactual inference. In International conference on machine learning. PMLR, 3020{3029. [68] James E Johndrow, Kristian Lum, et al. 2019. An algorithm for removing sensitive information: application to race-independent recidivism prediction. The Annals of Applied Statistics 13, 1 (2019), 189{220. [69] Nathan Kallus. 2017. Recursive partitioning for personalization using observational data. In International Conference on Machine Learning. 1789{1798. [70] Nathan Kallus and Angela Zhou. 2019. Assessing disparate impact of personalized interventions: identiability and bounds. In Proceedings of the 33rd International Con- ference on Neural Information Processing Systems. 3426{3437. [71] Mohammad Mahdi Kamani, Farzin Haddadpour, Rana Forsati, and Mehrdad Mahdavi. 2019. Ecient fair principal component analysis. arXiv preprint arXiv:1911.04931 (2019). [72] Faisal Kamiran and Toon Calders. 2009. Classifying without discriminating. In 2009 2nd International Conference on Computer, Control and Communication. IEEE, 1{6. [73] Matt J Keeling and Andrew Shattock. 2012. Optimal but unequitable prophylactic distribution of vaccine. Epidemics 4, 2 (2012), 78{85. [74] Eamonn Keogh, Selina Chu, David Hart, and Michael Pazzani. 2001. An online algo- rithm for segmenting time series. In Proceedings 2001 IEEE international conference on data mining. IEEE, 289{296. [75] Rebecca Killick, Paul Fearnhead, and Idris A Eckley. 2012. Optimal detection of changepoints with a linear computational cost. J. Amer. Statist. Assoc. 107, 500 (2012), 1590{1598. 126 [76] Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013). [77] Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan. 2016. Inherent trade-os in the fair determination of risk scores. arXiv preprint arXiv:1609.05807 (2016). [78] Arie W Kruglanski and Icek Ajzen. 1983. Bias and error in human judgment. European Journal of Social Psychology 13, 1 (1983), 1{44. [79] S oren R K unzel, Jasjeet S Sekhon, Peter J Bickel, and Bin Yu. 2019. Metalearners for estimating heterogeneous treatment eects using machine learning. Proceedings of the national academy of sciences 116, 10 (2019), 4156{4165. [80] Matt Kusner, Chris Russell, Joshua Loftus, and Ricardo Silva. 2019. Making decisions that reduce discriminatory impacts. In International Conference on Machine Learning. PMLR, 3591{3600. [81] Matt J Kusner, Joshua Loftus, Chris Russell, and Ricardo Silva. 2017. Counterfactual fairness. In Advances in neural information processing systems. 4066{4076. [82] EB Laber and YQ Zhao. 2015. Tree-based methods for individualized treatment regimes. Biometrika 102, 3 (2015), 501{514. [83] Himabindu Lakkaraju and Cynthia Rudin. 2017. Learning cost-eective and inter- pretable treatment regimes. In Articial Intelligence and Statistics. 166{175. [84] Yann LeCun, L eon Bottou, Yoshua Bengio, and Patrick Haner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (1998), 2278{2324. [85] Choong-Soo Lee and Ivan Ramler. 2017. Identifying and evaluating successful non- meta strategies in league of legends. In Proceedings of the 12th International Conference on the Foundations of Digital Games. 1{6. [86] Jure Leskovec, Lars Backstrom, and Jon Kleinberg. 2009. Meme-tracking and the dynamics of the news cycle. In KDD. 497{506. [87] Quanzhi Li, Armineh Nourbakhsh, Sameena Shah, and Xiaomo Liu. 2017. Real-time novel event detection from social media. In 2017 IEEE 33Rd international conference on data engineering (ICDE). IEEE, 1129{1139. [88] Yujia Li, Kevin Swersky, and Richard Zemel. 2014. Learning unbiased features. arXiv preprint arXiv:1412.5244 (2014). [89] Jason M Lindo, Nicholas J Sanders, and Philip Oreopoulos. 2010. Ability, gender, and performance standards: Evidence from academic probation. American Economic Journal: Applied Economics 2, 2 (2010), 95{117. [90] Zachary C Lipton. 2018. The Mythos of Model Interpretability: In machine learning, the concept of interpretability is both important and slippery. Queue 16, 3 (2018), 31{57. 127 [91] Bo Liu, Ying Wei, Yu Zhang, and Qiang Yang. 2017. Deep Neural Networks for High Dimension, Low Sample Size Data.. In IJCAI. 2287{2293. [92] Francesco Locatello, Gabriele Abbati, Tom Rainforth, Stefan Bauer, Bernhard Sch olkopf, and Olivier Bachem. 2019. On the Fairness of Disentangled Representa- tions. arXiv preprint arXiv:1905.13662 (2019). [93] Christos Louizos, Kevin Swersky, Yujia Li, Max Welling, and Richard S Zemel. 2016. The Variational Fair Autoencoder. In ICLR. [94] Anandi Mani, Sendhil Mullainathan, Eldar Shar, and Jiaying Zhao. 2013. Poverty impedes cognitive function. science 341, 6149 (2013), 976{980. [95] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. 2017. Communication-ecient learning of deep networks from decentralized data. In Articial intelligence and statistics. PMLR, 1273{1282. [96] Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. 2019. A survey on bias and fairness in machine learning. arXiv preprint arXiv:1908.09635 (2019). [97] Aditya Krishna Menon and Robert C Williamson. 2018. The cost of fairness in binary classication. In Conference on Fairness, Accountability and Transparency. 107{118. [98] Vishwali Mhasawade and Rumi Chunara. 2020. Causal Multi-Level Fairness. arXiv preprint arXiv:2010.07343 (2020). [99] Daniel Moyer, Shuyang Gao, Rob Brekelmans, Greg Ver Steeg, and Aram Galstyan. 2018. Invariant representations without adversarial training. In Proceedings of the 32nd International Conference on Neural Information Processing Systems. 9102{9111. [100] Razieh Nabi, Daniel Malinsky, and Ilya Shpitser. 2019. Learning optimal fair policies. Proceedings of machine learning research 97 (2019), 4674. [101] Razieh Nabi and Ilya Shpitser. 2018. Fair inference on outcomes. In Proceedings of the AAAI Conference on Articial Intelligence. AAAI Conference on Articial Intel- ligence, Vol. 2018. NIH Public Access, 1931. [102] Xinkun Nie and Stefan Wager. 2021. Quasi-oracle estimation of heterogeneous treat- ment eects. Biometrika 108, 2 (2021), 299{319. [103] Scott Niekum, Sarah Osentoski, Christopher G Atkeson, and Andrew G Barto. 2015. Online bayesian changepoint detection for articulated motion models. In 2015 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 1468{1475. [104] Ziad Obermeyer, Brian Powers, Christine Vogeli, and Sendhil Mullainathan. 2019. Dissecting racial bias in an algorithm used to manage the health of populations. Science 366, 6464 (2019), 447{453. 128 [105] H useyin Oktay, Brian J. Taylor, and David D. Jensen. 2010. Causal Discovery in Social Media Using Quasi-Experimental Designs. In Proceedings of the First Workshop on Social Media Analytics (Washington D.C., District of Columbia) (SOMA '10). Association for Computing Machinery, New York, NY, USA, 1{9. https://doi. org/10.1145/1964858.1964859 [106] Matt Olfat and Anil Aswani. 2019. Convex formulations for fair principal component analysis. In Proceedings of the AAAI Conference on Articial Intelligence, Vol. 33. 663{670. [107] Matthew Olson, Abraham J Wyner, and Richard Berk. 2018. Modern neural networks generalize on small data sets. In Proceedings of the 32nd International Conference on Neural Information Processing Systems. 3623{3632. [108] Cathy O'Neil. 2016. Weapons of math destruction: How big data increases inequality and threatens democracy. Broadway Books. [109] E. S. Page. 1954. Continuous Inspection Schemes. Biometrika 41, 1- 2 (06 1954), 100{115. https://doi.org/10.1093/biomet/41.1-2.100 arXiv:https://academic.oup.com/biomet/article-pdf/41/1-2/100/1243987/41-1-2- 100.pdf [110] Ewan S Page. 1954. Continuous inspection schemes. Biometrika 41, 1/2 (1954), 100{ 115. [111] E. S. Page. 1957. On problems in which a change in a parameter occurs at an unknown point. Biometrika 44, 1-2 (06 1957), 248{252. https://doi.org/10. 1093/biomet/44.1-2.248 arXiv:https://academic.oup.com/biomet/article-pdf/44/1- 2/248/752700/44-1-2-248.pdf [112] John Palowitch and Bryan Perozzi. 2020. Debiasing Graph Representations via Metadata-Orthogonal Training. In 2020 IEEE/ACM International Conference on Ad- vances in Social Networks Analysis and Mining (ASONAM). IEEE, 435{442. [113] Judea Pearl. 2009. Causality. Cambridge university press. [114] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blon- del, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825{2830. [115] Emma Pierson, Camelia Simoiu, Jan Overgoor, Sam Corbett-Davies, Daniel Jenson, Amy Shoemaker, Vignesh Ramachandran, Phoebe Barghouty, Cheryl Phillips, Ravi Shro, et al. 2020. A large-scale analysis of racial disparities in police stops across the United States. Nature human behaviour 4, 7 (2020), 736{745. [116] Vasanthan Raghavan, Aram Galstyan, and Alexander G Tartakovsky. 2013. Hidden Markov models for the activity prole of terrorist groups. The Annals of Applied Statistics (2013), 2402{2430. 129 [117] Alvin Rajkomar, Michaela Hardt, Michael D Howell, Greg Corrado, and Marshall H Chin. 2018. Ensuring fairness in machine learning to advance health equity. Annals of internal medicine 169, 12 (2018), 866{872. [118] Hannah Recht and Lauren Weber. 2021. Black Americans are getting vaccinated at lower rates than White Americans. Kais Health News (2021). [119] Guillem Rigaill. 2015. A pruned dynamic programming algorithm to recover the best segmentations with 1 to K max change-points. Journal de la Soci et e Fran caise de Statistique 156, 4 (2015), 180{205. [120] Frank Rosenblatt. 1961. Principles of neurodynamics. perceptrons and the theory of brain mechanisms. Technical Report. Cornell Aeronautical Lab Inc Bualo NY. [121] Proteek Chandan Roy and Vishnu Naresh Boddeti. 2019. Mitigating information leak- age in image representations: A maximum entropy approach. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2586{2594. [122] Donald B Rubin. 1974. Estimating causal eects of treatments in randomized and nonrandomized studies. Journal of educational Psychology 66, 5 (1974), 688. [123] Donald B Rubin. 2005. Causal inference using potential outcomes: Design, modeling, decisions. J. Amer. Statist. Assoc. 100, 469 (2005), 322{331. [124] Samira Samadi, Uthaipon Tantipongpipat, Jamie Morgenstern, Mohit Singh, and San- tosh Vempala. 2018. The price of fair pca: One extra dimension. arXiv preprint arXiv:1811.00103 (2018). [125] Anna Sapienza, Alessandro Bessi, and Emilio Ferrara. 2018. Non-negative tensor fac- torization for human behavioral pattern mining in online games. Information 9, 3 (2018), 66. [126] Anna Sapienza, Palash Goyal, and Emilio Ferrara. 2019. Deep neural networks for optimal team composition. Frontiers in big Data 2 (2019), 14. [127] Anna Sapienza, Yilei Zeng, Alessandro Bessi, Kristina Lerman, and Emilio Ferrara. 2018. Individual performance in team-based online games. Royal Society open science 5, 6 (2018), 180329. [128] Mary K Serdula, Robert D Brewer, Cathleen Gillespie, Clark H Denny, and Ali Mok- dad. 2004. Trends in alcohol use and binge drinking, 1985{1999: results of a multi-state survey. American journal of preventive medicine 26, 4 (2004), 294{298. [129] Anuj K Shah, Sendhil Mullainathan, and Eldar Shar. 2012. Some consequences of having too little. Science 338, 6107 (2012), 682{685. [130] Uri Shalit, Fredrik D Johansson, and David Sontag. 2017. Estimating individual treat- ment eect: generalization bounds and algorithms. In International Conference on Machine Learning. PMLR, 3076{3085. 130 [131] David Siegmund and ES Venkatraman. 1995. Using the generalized likelihood ratio statistic for sequential detection of a change-point. The Annals of Statistics (1995), 255{271. [132] Tess E Smidt, Mario Geiger, and Benjamin Kurt Miller. 2021. Finding symmetry breaking order parameters with Euclidean neural networks. Physical Review Research 3, 1 (2021), L012002. [133] Jiaming Song, Pratyusha Kalluri, Aditya Grover, Shengjia Zhao, and Stefano Ermon. 2019. Learning controllable fair representations. In The 22nd International Conference on Articial Intelligence and Statistics. PMLR, 2164{2173. [134] Lu Tian, Ash A Alizadeh, Andrew J Gentles, and Robert Tibshirani. 2014. A sim- ple method for estimating interactions between a treatment and a large number of covariates. J. Amer. Statist. Assoc. 109, 508 (2014), 1517{1532. [135] Christopher Tran and Elena Zheleva. 2019. Learning triggers for heterogeneous treat- ment eects. In Proceedings of the AAAI Conference on Articial Intelligence, Vol. 33. 5183{5190. [136] Charles Truong, Laurent Oudre, and Nicolas Vayatis. 2020. Selective review of oine change point detection methods. Signal Processing 167 (2020), 107299. [137] Eric Tzeng, Judy Homan, Kate Saenko, and Trevor Darrell. 2017. Adversarial dis- criminative domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7167{7176. [138] Evert PL Van Nieuwenburg, Ye-Hua Liu, and Sebastian D Huber. 2017. Learning phase transitions by confusion. Nature Physics 13, 5 (2017), 435{439. [139] Hal R. Varian. 2016. Causal inference in economics and marketing. Proceedings of the National Academy of Sciences 113, 27 (2016), 7310{7315. https://doi.org/10. 1073/pnas.1510479113 arXiv:https://www.pnas.org/content/113/27/7310.full.pdf [140] Sahil Verma and Julia Rubin. 2018. Fairness denitions explained. In Proceedings of the International Workshop on Software Fairness. 1{7. [141] Davide Viviano and Jelena Bradic. 2020. Fair policy targeting. arXiv preprint arXiv:2005.12395 (2020). [142] Julius von K ugelgen, Amir-Hossein Karimi, Umang Bhatt, Isabel Valera, Adrian Weller, and Bernhard Sch olkopf. 2020. On the fairness of causal algorithmic recourse. arXiv preprint arXiv:2010.06529 (2020). [143] Stefan Wager and Susan Athey. 2018. Estimation and inference of heterogeneous treatment eects using random forests. J. Amer. Statist. Assoc. 113, 523 (2018), 1228{1242. [144] Qi Wang, Yi Yang, Zhengren Li, Na Liu, and Xiaohang Zhang. 2020. Research on the in uence of balance patch on players' character preference. Internet Research (2020). 131 [145] Yanyan Wang, Vicki M Bier, and Baiqing Sun. 2019. Measuring and achieving equity in multiperiod emergency material allocation. Risk Analysis 39, 11 (2019), 2408{2426. [146] Alan Willsky and H Jones. 1976. A generalized likelihood ratio approach to the de- tection and estimation of jumps in linear systems. IEEE Transactions on Automatic control 21, 1 (1976), 108{112. [147] Alan S Willsky and Harold L Jones. 1974. A generalized likelihood ratio approach to state estimation in linear systems subjects to abrupt changes. In 1974 IEEE Conference on Decision and Control including the 13th Symposium on Adaptive Processes. IEEE, 846{853. [148] R. C. Wilson, M. R. Nassar, and J. I. Gold. 2010. Bayesian online learning of the hazard rate in change-point problems. Neural computation 22, 9 (2010), 2452{2476. https://doi.org/10.1162/NECO_a_00007 [149] Qizhe Xie, Zihang Dai, Yulun Du, Eduard Hovy, and Graham Neubig. 2017. Con- trollable invariance through adversarial feature learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems. 585{596. [150] Yuxiang Xie, Nanyu Chen, and Xiaolin Shi. 2018. False Discovery Rate Controlled Heterogeneous Treatment Eect Detection for Online Controlled Experiments. In Pro- ceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 876{886. [151] Xiang Xuan and Kevin Murphy. 2007. Modeling changing dependency structure in multivariate time series. In Proceedings of the 24th international conference on Machine learning. 1055{1062. [152] Ming Yi and Achla Marathe. 2015. Fairness versus eciency of vaccine allocation strategies. Value in Health 18, 2 (2015), 278{283. [153] David M. Blei Yixin Wang, Dhanya Sridhar. 2019. Equal Opportunity and Armative Action via Counterfactual Predictions. arXiv preprint arXiv:1905.10870 (2019). [154] Muhammad Bilal Zafar, Isabel Valera, Manuel Gomez Rodriguez, and Krishna P Gum- madi. 2017. Fairness beyond disparate treatment & disparate impact: Learning clas- sication without disparate mistreatment. In Proceedings of the 26th international conference on world wide web. 1171{1180. [155] Muhammad Bilal Zafar, Isabel Valera, Manuel Gomez Rogriguez, and Krishna P Gum- madi. 2017. Fairness constraints: Mechanisms for fair classication. In Articial Intel- ligence and Statistics. PMLR, 962{970. [156] Achim Zeileis, Torsten Hothorn, and Kurt Hornik. 2008. Model-based recursive parti- tioning. Journal of Computational and Graphical Statistics 17, 2 (2008), 492{514. [157] Rich Zemel, Yu Wu, Kevin Swersky, Toni Pitassi, and Cynthia Dwork. 2013. Learning fair representations. In International conference on machine learning. PMLR, 325{333. 132 [158] Friedemann Zenke, Ben Poole, and Surya Ganguli. 2017. Continual learning through synaptic intelligence. In International Conference on Machine Learning. PMLR, 3987{ 3995. [159] Brian Hu Zhang, Blake Lemoine, and Margaret Mitchell. 2018. Mitigating unwanted biases with adversarial learning. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society. 335{340. [160] Lu Zhang, Yongkai Wu, and Xintao Wu. 2016. A causal framework for discovering and removing direct and indirect discrimination. arXiv preprint arXiv:1611.07509 (2016). [161] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros. 2017. Unpaired Image- to-Image Translation using Cycle-Consistent Adversarial Networks. In IEEE Interna- tional Conference on Computer Vision (ICCV). [162] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. 2017. Unpaired image- to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision. 2223{2232. 133
Abstract (if available)
Abstract
This thesis discusses the challenges and novel methods for mining heterogeneous and biased data. Learning the structure of data enables a deeper understanding of system behavior, allowing for better predictions and decisions. However, in applications of natural and social sciences, the data is often heterogeneous and biased by hidden relationships. ❧ We first start from fair predictions and invariant representations using biased data. Previous methods range from simple constrained linear regression to deep learning models. Constrained linear regressions are too restricted, while deep learning models are not interpretable and often rely on expensive adversarial training. To address these challenges, we first present a linear method that creates interpretable features that are also fair, i.e., they are independent of some sensitive features which may bias predictions. The method preprocesses data by projecting it to a hyper plane orthogonal to the sensitive features. Then we introduce a kernel-based nonlinear method that can be applied to both supervised and unsupervised learning tasks without adversarial training. ❧ Next, we show how to identify changes in heterogeneous data. Previous models rely on likelihood functions or kernels, which restricted their uses in large and high-dimensional data. Our method is self-supervised, inspired by phase transitions in physical systems. It expands a previous method by proposing a mathematical model which can robustly and accurately identify changes from the accuracy variation of any supervised learning model of choice. We show how this method can be applied to mining text data in social media. We also included empirical studies on performance data of League of Legends about the effect of changes on individuals. ❧ Finally, we show how we can design interventions which actively improve fairness and maximize overall benefit using limited resources. Previous methods were restricted to idealized cases and cannot be applied to real-world observational data. We introduce the fairness metrics to causal inference and propose an algorithm which can give the optimal policy under constraints on fairness and resources.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Probabilistic framework for mining knowledge from georeferenced social annotation
PDF
Modeling and predicting with spatial‐temporal social networks
PDF
Emergence and mitigation of bias in heterogeneous data
PDF
Robust causal inference with machine learning on observational data
PDF
Designing data-effective machine learning pipeline in application to physics and material science
PDF
Learning controllable data generation for scalable model training
PDF
Invariant representation learning for robust and fair predictions
PDF
Learning to diagnose from electronic health records data
PDF
Information geometry of annealing paths for inference and estimation
PDF
Alleviating the noisy data problem using restricted Boltzmann machines
PDF
Fair Machine Learning for Human Behavior Understanding
PDF
Imposing classical symmetries on quantum operators with applications to optimization
PDF
Efficient graph learning: theory and performance evaluation
PDF
Exploiting structure in the Boolean weighted constraint satisfaction problem: a constraint composite graph-based approach
PDF
Responsible artificial intelligence for a complex world
PDF
Understanding diffusion process: inference and theory
PDF
Learning distributed representations from network data and human navigation
PDF
Global consequences of local information biases in complex networks
PDF
Representation problems in brain imaging
PDF
Efficient machine learning techniques for low- and high-dimensional data sources
Asset Metadata
Creator
He, Yuzi
(author)
Core Title
Learning fair models with biased heterogeneous data
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Physics
Degree Conferral Date
2022-05
Publication Date
02/10/2022
Defense Date
11/17/2021
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
AI fairness,causal inference,change detection,fair treatment,interpretable,invariant representation,linear projection,machine learning fairness,natural experiment,OAI-PMH Harvest,policy design,social media,topic shift
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Haas, Stephan (
committee chair
), Hen, Itay (
committee member
), Lerman, Kristina (
committee member
), Nakano, Aiichiro (
committee member
), ver Steeg, Greg (
committee member
)
Creator Email
yuzihe@usc.edu,yuzihe12@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC110702250
Unique identifier
UC110702250
Legacy Identifier
etd-HeYuzi-10389
Document Type
Dissertation
Format
application/pdf (imt)
Rights
He, Yuzi
Type
texts
Source
20220214-usctheses-batch-912
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
AI fairness
causal inference
change detection
fair treatment
interpretable
invariant representation
linear projection
machine learning fairness
natural experiment
policy design
social media
topic shift