Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Robust causal inference with machine learning on observational data
(USC Thesis Other)
Robust causal inference with machine learning on observational data
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
ROBUST CAUSAL INFERENCE WITH MACHINE LEARNING ON OBSERVATIONAL DATA by Myrl Galinos Marmarelis A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) August 2024 Copyright ©2024 Myrl Galinos Marmarelis Even in reasoning upon some subjects, it is a mistake to aim at an unattainable precision. It is better to be vaguely right than exactly wrong. —Carveth Read [274] ii Acknowledgements I am indebted to a number of individuals who guided me throughout the doctoral program. First I would like to thank my advisor Greg Ver Steeg, whose remarkable creativity and intellect allowed the ideas presented here to flourish. I would also like to thank my advisor Aram Galstyan, whose unremitting commitment, support, and trust allowed me to discover my passion. I thank my informal advisor and committee member Fred Morstatter for involving me on impactful projects, providing a stimulating environment, making exciting opportunities available, and ensuring forward progression in my journey. I must thank my previous collaborator and current external committee member Roger Ghanem for believing in me early on and introducing me to many ideas that have become foundational to my thinking. I also thank the accomplished Shang-Hua Teng who currently serves on my committee and has graciously made the time to brainstorm with me in the past. His undergraduate class in cryptography theory also helped spark my interest in academia. I thank my early colleagues and current friends Rob Brekelmans, Hrayr Harutyunyan, Umang Gupta, Sami Abu-El-Haija, Kyle Reing, Elan Markowitz, and Neal Lawton for the intellectually fiery discussions and mentorship in the beginning of my program. I am likewise extremely grateful to my peers and collaborators Negar Mokhberian, Siyi Guo, Elizabeth Haddad, Alexander Spangher, Basel Shbita, Ashwin Rao, Zhivar Sourati, Abel Salinas, Carolyn Steinle, Parth Shah, and others for their intellect and encouragement. My doctoral program was greatly enriched by the generous mentorship and collaboration opportunities offered by Neda Jahanshad, Heinz-Josef Lenz, Abigail Horn, Jeffrey Brantingham, and other faculty members. Many friends and family members contributed to my personal (and professional!) growth during this period. I am grateful for my friendship with Racquel Fygenson and Nina Singh, their thoughtful feedback throughout the doctoral program, and all our conversations that were especially vital during the COVID-19 pandemic. My gratitude also extends to Robert Bacon and Taehyun Kim for their longstanding friendship. It is impossible to overstate the role that my parents Vasilis Marmarelis and Melissa Orme played in my successes. The same holds for my twin brother and lifelong friend Zissis Marmarelis. Finally, I thank my partner in life Holly Huber, who without a doubt made any and all my achievements possible. iii Table of Contents Epigraph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Grand Motivating Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2.1 Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.2 Takeaways . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Observational Causal Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3.1 Potential Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3.3 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3.4 Identifiability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.4 Robustness to Hidden Confounding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.4.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.5 Robustness to Observed Confounding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.5.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.6 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Chapter 2: Representation Learning in High-dimensional Observational Data: A Study on Gene Coexpression in Single-cell Transcriptomics . . . . . . . . . . . . . . . 11 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.1 The Inner Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2.2 Using the Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2.3 Community Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2.4 Gene Expression Programs (GEPs) . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2.5 Local Distributional Segmentation (LDS) . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2.6 Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2.7 Differential Effects of Latent Variables . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2.8 Differentially Expressed Genes (DEGs) . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3.1 Case Study on Treatment of Colon Cancer . . . . . . . . . . . . . . . . . . . . . . . 20 2.3.2 Case Study on Clustering Phenotypes . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.3.3 Case Study on Brain Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 iv 2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.5 Appendix A: Supplementary Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.5.1 Details Behind the Synthetic Example (Figure 2.2) . . . . . . . . . . . . . . . . . . . 36 2.5.2 More Details on q-Diffusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.5.3 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.5.4 Theoretical Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.6 Appendix B: Supplementary Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.6.1 Details for Case Study on Treatment of Colon Cancer . . . . . . . . . . . . . . . . . 40 2.6.2 Details for Case Study on Clustering Phenotypes . . . . . . . . . . . . . . . . . . . 43 2.6.3 Details for Case Study on Brain Structure . . . . . . . . . . . . . . . . . . . . . . . 44 2.7 Appendix C: Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Chapter 3: Partial Identification of Dose Responses with Hidden Confounders . . . . . . . . . . . . 48 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.1.1 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.1.3 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.2 Continuous Sensitivity Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.2.1 The Complete Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.2.2 A Partial Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.2.3 Propensity-trust Combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.3 Estimating the Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.4 A Semi-synthetic Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.5 A Real-world Exemplar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.7 Appendix A: Completing the Derivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.8 Appendix B: How to Calibrate the Weighing Scheme . . . . . . . . . . . . . . . . . . . . . 67 3.9 Appendix C: Correctness of Algorithm 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.10 Appendix D: Details on the Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 Chapter 4: Ensembled Prediction Intervals for Causal Outcomes Under Hidden Confounding . . . 73 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.2.1 Sensitivity Analysis on Quantiles via Ensemble . . . . . . . . . . . . . . . . . . . . 79 4.3 Estimation Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.4 Empirical Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.4.1 Classical Benchmark (IHDP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.4.2 Novel Semi-synthetic Benchmark (PBMC) . . . . . . . . . . . . . . . . . . . . . . . 84 4.4.3 Novel Benchmark via GPT-4 (AITA) . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.6 Appendix A: Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.7 Appendix B: Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.8 Appendix C: Experimental Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 Chapter 5: Policy Learning for Localized Interventions from Observational Data . . . . . . . . . . 95 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 v 5.2 Estimating Nudge Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.2.1 Problem Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.2.2 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5.2.3 Estimating the Propensity Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 5.3 Learning Nudge Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 5.3.1 IB — Information Bottleneck Policy Learner . . . . . . . . . . . . . . . . . . . . . . 102 5.3.2 OT — Optimal Transport Policy Learner . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.3.3 IB+OT — Sinkhorn Policy Learner . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.3.4 Generalizing Policy Prescriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.4 A Scalable Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.5 Estimation Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 5.6 Empirical Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5.6.1 Semi-synthetic Policy Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5.6.2 Experiment with Yelp & IRS Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 5.8 Appendix A: Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 5.8.1 Proof of Proposition 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 5.8.2 Proof of Lemma 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 5.8.3 Proof of Lemma 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 5.8.4 Proof of Lemma 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 5.8.5 Proof of Theorem 12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 5.8.6 Proof of Corollary 12.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 5.8.7 Proof of Corollary 12.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 5.9 Appendix B: Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 5.9.1 TCGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 5.9.2 Yelp & IRS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 Chapter 6: Causal Encouragement Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 6.1.1 Encouragements via Logits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 6.1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 6.1.3 Potential Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 6.2 Robust Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 6.2.1 Estimation Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 6.3 Encouragement Response Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 6.4 Online Encouragement Targeting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 6.4.1 Bandit Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 6.4.2 Case Study on Political Campaign Microtargeting . . . . . . . . . . . . . . . . . . . 130 6.5 Empirical Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 6.5.1 Prioritizing Interventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 6.5.2 Pseudo-outcome Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 6.5.3 The Encouraging Bandit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 6.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 6.7 Appendix A: Derivations and Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 6.7.1 Proof of Proposition 13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 6.7.2 Proof of Proposition 14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 6.7.3 Derivation of Equation 6.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 6.7.4 Derivation of Equation 6.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 vi 6.8 Appendix B: Experimental Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 6.9 Appendix C: Extended Discussion on Campaign Microtargeting . . . . . . . . . . . . . . . 141 Chapter 7: Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 vii Abstract The rise of artificial intelligence and deep learning has led to unprecedented capabilities in prediction. As these black-box algorithms are deployed in different parts of society, it is becoming increasingly clear that predictions alone do not always translate to enabling effective decisions, policies, or reliable forecasts in a changing world. What is often needed is a stronger understanding of a system than a predictive model of observations can offer. This deficit arises when attempting to predict the system’s behavior in novel situations. Causal inference refers to a set of theoretical frameworks and practical methods for identifying cause-and-effect structures from data. Knowledge of this structure can help anticipate what would happen in a novel situation, like subjecting the system to intervention. Much work in causal inference is concerned with finding the minimal assumptions required to answer specific causal questions, like estimating the effect of a certain treatment. The more reasonable and relaxed the assumptions of a causal-inference method, the more applicable it is to diverse datasets and machine learning. There are many methodological aspects to performing causal inference on observational data—that is, without the ability to perform experiments. Of fundamental significance is having workable representations of the system that can be learned from data. Closely related to the quality of the representations is the ability to make downstream causal estimates robust to confounding. Confounders are components of a system that might confuse apparent relations between cause and effect, or treatment and outcome. In this dissertation, I propose methods for addressing these problems in challenging machine-learning contexts. I introduce an improved representation of single-cell RNA sequencing data for inference tasks in medicine and biology (Chapter 2). Looking for high-dimensional interactions in biological processes leads to better resolution of phenotypes. More broadly, I make numerous contributions towards increased robustness of machine learning to hidden or observed confounding. I address sensitivity of dose-response curves to hidden confounding (Chapter 3), prediction of interventional outcomes under hidden confounding (Chapter 4); robust effect estimation for continuous-valued and multivariate interventions (Chapter 5), and estimation for interventions that might only encourage treatment as a function of susceptibility (Chapter 6). viii Chapter 1 Introduction Machine learning has blossomed as a discipline over the past decades [192], growing out of earlier ideas of cybernetics [139], [346] from the previous century. What separates machine learning from other forms of engineering is its focus on harnessing computational resources and large datasets to build increasingly complex predictive models. As of the time of writing this manuscript, the scientific community is coming to terms with the broader implications of generative artificial intelligence [149], encapsulating recent advances in deep learning like large language models [e.g. 2] seemingly capable of reasoning with text. These generative models with billions of parameters are enabled by massive amounts of largely unstructured data collected from the internet and digital archives. By training to predict patterns in humangenerated media, they can produce such realistic outputs that there is considerable debate around their surprising emergent abilities [297] to reason about the world [372]. These models are pushing the limit of what can be learned from mere observations of an exceedingly complex system, like the corpus of all recent text produced by humans. Can thought processes and circumstances behind the text be inferred? Questions remain—many statistical in nature—about the extrapolative inferences that can be made from observational data, when it is large or even infinite [258]. Can cause and effect be inferred? Can predictions be made about novel situations? This dissertation is concerned with the paradigm of using machine learning to predict the outcomes of new interventions on a system of interest, having only passively observed it. The methods developed in this scope generally fall under causal inference [256], which is a multidisciplinary field that has attracted much attention within machine learning [47]. The causal characterization of an inference method signifies special care for the causal mechanisms underlying a system. In order to successfully predict the outcome of an intervention, a model must distinguish spurious correlations from directly causal relationships between variables. Causal inference has become increasingly relevant in machine learning as these predictive models are being used to make decisions, like crafting new policies or targeting interventions. The works compiled in this manuscript aim towards robust machine learning for the causal 1 effects of a wide class of interventions, not restricting them to be binary or discrete, while contending with identificational or statistical challenges inherent to real-world observational data. 1.1 Perspective The rapid advances in machine learning have opened room for confusion in its relation with the scientific enterprise as a whole. It is important to clarify the philosophy pertaining to any methodological discussions around causal inference in this dissertation. The role of the intervention is central to causal inference [84] and we are primarily concerned with predicting outcomes of interventions [106]. Despite the “causal” terminology, in the scope of this dissertation we do not seek or claim to build models that uncover true causal structure, or align with the actual governing rules of the system. To do science is to build models [141] or theories for the physical world, evaluate them on their capacity to make novel but testable predictions, expose them to experimental validation [264], [265], and iterate or generate new hypotheses as part of a scientific community [189]. A scientific model must make novel predictions: it has to generalize outside of the observational conditions. This notion is also fundamental to machine learning in terms of out-of-sample performance. The latter, however, largely focuses on new observations of the same data-generating process, under identical conditions. Subfields of machine learning target performance in settings different from the training data, like transfer learning [280], domain generalization [210], and domain adaptation [209]. These are all intimately related to causal inference in machine learning, which can be further specified to the task of making predictions about novel interventions to the data-generating process. This goal makes the scientific merit of a model more explicit. Additionally, by evaluating performance on the basis of a specific class of interventions, the utility of the model is tied to a particular decision-making task. 1.2 Grand Motivating Example An overarching motivation could help ground the scope of this dissertation. In this section, we introduce a rather specific illustrative example that shall be referenced throughout the remainder of the chapter. The inference problems associated with this example are emblematic of a broader theme of desiderata that guide the contributions presented in the subsequent chapters. 2 1.2.1 Scenario We seek to design a drug that can treat a rare but lethal human disease. As is the case with most physiological phenomena, this disease interacts with multiple biological processes, which makes it difficult to simulate. It happens that past attempts to use animal models, isolated cells, or mechanistic computer models have led to unsuccessful drug candidates. The prevailing view of the medical community is essentially that one needs (to simulate) the whole human body to properly understand how a putative drug would affect disease progression. Also, the human body is far from understood enough to build such a simulation. The remaining option is to study the disease in actual humans. Experimentation for the sake of exploration is highly unethical, of course. To identify drug candidates for treatment, one is left with purely observational data from medical patients. The challenge of simulating putative drug effects on the basis of observations from patients without experimental interventions can hardly be understated. The sliver of hope in any such endeavor is that some of the variability in disease state across observations might be informative of what an intervention might do. The task becomes one of perusing the natural heterogeneity across individuals, environment, and the disease itself, in order to spot parts of the data that look like experiments. For instance, if the sample includes two similar patients for whom the disease progression mostly only differs in the expression of one gene, due to extrinsic factors and not some underlying confounder (to be elaborated later,) then the difference in these patients’ outcomes could be statistically informative of an intervention that affects that one gene. What does it mean to find similar patients? For this analysis to succeed, one needs a way to characterize disease state that captures relevant factors for determining outcomes. One also needs a way to describe intervention-like conditions. In other words, there needs to be a delineation between hypothetically pre-treatment disease state, and the component that can vary by the intervention or other extrinsic factors. The interconnected dynamics of the disease process, including feedback loops, complicate this issue. However, a first step in addressing it is to measure as many details of each patient’s disease as possible. Recent breakthroughs in sequencing technologies allow single-cell measurements of extremely high granularity, some of which are discussed in Chapter 2. Single-cell RNA sequencing (scRNAseq) aims to measure each cell’s expression of tens of thousands of genes. A sufficiently sophisticated statistical method for these high-dimensional data would promise to characterize a patient’s tumor more comprehensively than ever before. The dimensionality of the statistical problem combined with our incomplete mechanistic understanding of human physiology calls for the use of machine learning to infer relations in the sample space. A machine-learning model that is tuned for predicting interventional outcomes should implicitly find relevant natural experiments in the data. The model could simulate interventions on different drug targets, like individual genes, if the pre-treatment disease state and the treatment are characterized properly. To do 3 so properly is defined more formally in a causal-inference sense in the next section on potential outcomes. Intuitively, we ask, for our observation space, what would a patient look like with and without the novel intervention, with enough specificity to distinguish these cases from other sources of variability, keeping in mind the ultimate goal of attaining trustworthy outcome predictions. 1.2.2 Takeaways The need to infer relevant components of the disease process parallels an emerging topic in causal inference: the quest to transform high-dimensional, noisy, and entangled observation vectors into latent macrovariables that facilitate interventional reasoning. This idea is generally promoted as causal representation learning by Schölkopf, Locatello, Bauer, et al. [300]. Chapter 2 proposes latent variable models, principally for scRNAseq data, with potential utility for causal inference. Chapters 3–6 address major challenges associated with inference on the kinds of interventions that would be encountered in this motivating example. Abstractly, if an intervention were to act on gene expressions or some latent representation thereof, it would likely be untenable to describe the with-intervention and without-intervention observation by a simple binary (or categorical) switch in one variable. The intervention would likely take on a more nuanced structure, like a perturbation in a gene’s expression or the turning of a knob in a biological process. It might also be multivariate. Chapters 3–6 present methods for making outcome predictions more robust when using machine learning in observational settings that are similar, in essence, to the example outlined here. 1.3 Observational Causal Inference As revealed by the motivating example, there are many fundamental and often challenging aspects to valid causal inference from observational data. One aspect is the construction of a representation from data that clarifies the causal setting, by describing the variables that might affect each other, and understanding what needs to be estimated: which effect, and from which cause. Another aspect of the problem is the robustness of the estimation. The presence of confounding variables, for instance, can severely hinder valid effect estimation by obscuring a true causal effect with many other apparent effects. A simple statistical language is needed to formalize these core issues with minimal structure imposed on the data. For this purpose, we operate within the framework of potential outcomes. 4 1.3.1 Potential Outcomes The concept of a potential outcome is first attributed to the renowned mathematician Jerzy Neyman [239], [289] in the form of “potential yields” for plots of land. It was further developed and heavily popularized by Rubin [286] and later works [135] that showed the utility of this framework for pragmatic causal inference. There are a number of paradigms under which one could study causation. A major distinction tends to be the level of expressivity one uses to describe a causal system. Many seminal works begin their analyses with structural causal models [255], [258] that fully specify the functional relationships between variables. Structural causal models make explicit all the relevant interventional and counterfactual distributions that are entailed by the system. Judea Pearl’s foundational do-calculus [254] helps to concisely reason about these probability distributions. At a level of abstraction higher than structural causal models are causal graphs, which detail the dependence structure between variables, and are sufficient for many theoretical results in causal inference [307]. The framework of potential outcomes condenses a causal graph into a simple structure that relates statistical estimands—quantities that can be estimated from the data—to causal quantities. Potential outcomes are suitable for many common causal settings, and are conducive to succinct discusssions on general obstacles to inference like the presence of hidden confounders. 1.3.2 Notation For the remainder of this introductory chapter, we rely on a simplified notation to highlight the basic ideas surrounding potential outcomes. Each subsequent chapter exposes the particular notation for its scope. The outcome is denoted as Y and tends to be univariate and real-valued for the purpose of simplicity. The treatment variable is T, on which the intervention takes place. When T is multivariate, resides in a continuum, or otherwise carries more information, statistical inference becomes more difficult. This is especially the case when T interacts with other variables. The (possibly conditional) probability distribution of the treatment assignment is termed the treatment propensity. Potential outcomes are denoted as Yt , indexed by t ∈ T living in the same space as T ∈ T . The observational data are a joint sample of Y , T, and other variables. Yt are special latent variables that represent what the outcome Y would have been, had we intervened on the assigned treatment T to make it t. Hence, they can be described as interventional outcomes. We generally also consider a covariate X with few requirements on its structure. It can include confounders, or other variables that help predict outcomes or treatment assignments. 5 1.3.3 Assumptions A basic motif in causal systems is the confounder, which refers to a variable (or group of variables) that affects both the treatment assignment and the outcome. Experiments are meant to control the assignment of treatments and break their dependency on confounders, thus helping to disentangle the effect of the treatment variable from the effect of other variables. Confounders must be handled by statistical means when working with mere observations of a confounded system. Other causal motifs include instruments, colliders, and mediators, but are not the subject of this dissertation. Here, we list the basic formal assumptions that are crucial to the potential-outcomes framework [288]. Similar conditions are discussed in earlier works on causal inference like Cox [62] and Rubin [287]. Stable unit treatment value assumption (SUTVA). The role of potential outcomes in relation to the data are largely encoded by the “SUTVA.” It states that for every observational unit in the sample, there is exactly one potential outcome for every value that the treatment variable can take. The observed outcome must equal the potential outcome for the unit’s assigned treatment. We note that treatment interference refers to a SUTVA violation when a unit is affected by the treatment of other units, leading to spillover effects between units [126]. In those cases, there is one potential outcome for every combination of the unit’s own treatment and the interfering units’ treatments. Overlap or positivity. The overlap assumption simply states that for every unit, the probability of assignment for every treatment value must be nonzero (bounded above zero). The propensity must have support on T . If not, there could be parts of the potential-outcome surface that are systematically absent from the observational data and cannot be inferred. (Conditional) ignorability. The ignorability assumption states that there are no hidden confounders. To put it in terms of potential outcomes, it must hold that the treatment assignment is independent of the potential outcomes, possibly conditioning on observed confounders. Mathematically, Yt ⊥⊥ T (unconditional), {Yt ⊥⊥ T} | X (conditional), ∀t ∈ T . The intuition behind this requirement is explained at length in Chapter 3. The idea is that if the outcome for a specific treatment depends on the currently assigned treatment, then there is some other factor linking them together in the causal system. A potential outcome should be inherent to the observational unit before treatment is even assigned, as it describes all the possibilities of what could happen after treatment. 6 This assumption is arguably the most difficult to defend in practical applications of causal inference. With ignorability, a statistical model can adequately control for all relevant confounders. 1.3.4 Identifiability Causal inference is fundamentally a missing data problem. Only one outcome out of all treatment possibilities for each unit is ever observed. The three core assumptions listed above describe sufficient conditions for the identification of a potential outcome using just observational data. The joint distribution of potential outcomes Yt∈T is still unidentifiable, but as long as the causal estimand can be framed in terms of potentialoutcome marginals, then identification is feasible. Consider the widely celebrated conditional average treatment effect (CATE), which is defined for binary treatments T = {0, 1}. The CATE is defined as τ (x) ≜ E[Y1 −Y0 | X = x]. By linearity of expectation, we can decompose τ (x) into identifiable quantities E[Y1 | X = x] − E[Y0 | X = x]. By contrast, a conditional treatment-effect quantile would not be identifiable under typical potential-outcomes assumptions. This causal estimand refers to a conditional quantile of Y1 − Y0, which depends on the joint distribution of potential outcomes. It is not to be confused with conditional quantile treatment effects [159], which take the quantiles of individual potential outcomes. The combination of SUTVA, positivity, and conditional ignorability lead to a characterization of a potential outcome in terms of the observables, leading to identifiability of causal estimands: P(Yt | X) = P(Y | T = t, X). Many recent papers on the subject of causal inference focus on violations to one or more of these assumptions, usually replacing them with relaxed versions, and attempting identification or partial identification. The latter refers to inferring a set of possible values for a causal estimand that are admitted by the causal setting, which happens to be too unwieldy to isolate a specific value. Partial identification is performed when point identification is impossible. Chapters 3 & 4 explore partial identification with hidden confounding—when there is a bounded violation to the ignorability assumption. 1.4 Robustness to Hidden Confounding Potential outcomes allow formalization of a causal-inference problem being solved with statistics on observables. When analyzing gene expressions and health outcomes to estimate the hypothetical effect of a drug intervention, the listed assumptions like SUTVA, overlap, and ignorability, clarify the criteria for valid inference. Conditional ignorability requires that the recorded covariates account for all confounders 7 relevant to the chosen treatment variable. Inconveniences like an incomplete representation of the disease state could impact this criterion. It is only prudent to study the possible impact of some amount of hidden confounding on an effect estimate. Moreover, it is valuable to form decisions that are robust to some amount of hidden confounding. A straightforward approach to this could be to consider the worst-case hidden confouders for every unit. Another approach could be to prioritize units based on the level of possible impact from hidden confounders. Partial identification can help achieve these goals. An unrestricted hidden confounder usually implies that identification is completely impossible. Some reasonable structural assumption placed on the hidden confounder could help narrow down the possibilities. This could take the form of a constraint on the strength of confounding. In a practice initiated by Bross [31] and Cornfield, Haenszel, Hammond, et al. [61], one could discuss the necessary strength of hidden confounding before an estimated causal effect completely disappears. This kind of analysis is called a causal sensitivity analysis because it attempts to measure the sensitivity of a causal estimate to possible violations of ignorability. To replace the ignorability criterion with a looser form that allows hidden confounding of a controlled strength, there needs to be a succinct description of confounding strength. A sensitivity model does this, and gives a looser ignorability criterion [284] with a parameter usually denoted as Γ ≥ 1 to control the strength of hidden confounding [317], with Γ = 1 being equivalent to ignorability. 1.4.1 Contributions A sensitivity model is appropriate when it aligns with the general behavior of hidden confounders in the causal setting and can be used for partial identification that is tight enough to be useful. Chapter 3 proposes a sensitivity model for continuous-valued treatments. It discusses partial identification of causal doseresponse curves using machine learning and the novel sensitivity model. Chapter 4 proposes a reframing of a wide class of sensitivity models that helps with predicting individual causal outcomes. 1.5 Robustness to Observed Confounding Satisfying the ignorability assumption that enables identification does not mean that estimation will be simple or even feasible. A strong degree of observed confounding—or unobserved, but with observed proxies, so not truly hidden—could hinder machine learning for outcome prediction. Using observational data to infer interventional outcomes entails a domain shift problem. The distribution of the training set is different from the distribution of the test set, where the treatment variable is disentangled from confounders. For instance, if a certain treatment value is rarely observed with a certain combination of confounding 8 variables, a model trained on this data might place little emphasis on that combination, whereas for the purpose of evaluating predictions, that combination is just as important as other combinations. Reweighing the training sample could help, but it tends to increase the estimator’s variance. Regardless of how a model is trained, there is persistent concern for confounding bias in the predictions. Causal machine learning has recently adopted doubly robust estimation [48] for this reason. Stemming from semiparametric estimation theory [121], [169], a doubly robust method leverages two predictive models (termed “nuisance parameters” in frequentist statistics) to debias each other and arrive at a reliable causal estimate. To bridge machine learning with doubly robust estimation, the two required predictive models should be easy to learn given the data. Typically, one of the models predicts expected outcomes conditional to the treatment and covariates. The other is a propensity model, and takes different forms depending on the structure of the treatment and intervention. The propensity model has a simple and well-studied objective for binary treatments [170], but for estimating dose-response curves over a treatment continuum, the propensity model needs to learn the entire conditional density function of the treatment [57]. This artifact is a consequence of the increased difficulty in learning causal dose responses with confounding. 1.5.1 Contributions When the entire response surface over a treatment continuum is too challenging to estimate via doubly robust methods, one must accept a more pragmatic approach. Chapter 5 proposes a notion of “localized interventions” that materialize as nudges or perturbations in multivariate continuous-valued treatment variables. The doubly robust estimation problem for localized interventions is more conducive to machine learning. Further, the policy learning problem [11] of assigning localized interventions to units for reliably maximizing the causal effect can be approximately solved with optimal transport [338]. Chapter 6 explores estimating the causal effect of simply encouraging the assignment of a binary treatment, even when not much is known about the form of the encouragements. A simple and relatively robust estimator is proposed. 1.6 Organization The core of this dissertation in Chapters 2–6 is a compilation of studies that, with the exception of Chapter 6 [216], have been published individually in journals or conference proceedings. They have been reproduced in this manuscript, in whole or in part, with minor revisions for presentation and correctness. Chapter 2 was published in Communications Biology as Marmarelis, Littman, Battaglin, et al. [215] and focuses on the question of learning useful representations from scRNAseq data (§1.2) rather than causal 9 inference per se. An earlier publication, Marmarelis, Ver Steeg, and Galstyan [218], falls under the theme of latent variables for challenging data modalities, but is outside the scope and not reproduced in this dissertation. Chapter 3 was published as Marmarelis, Haddad, Jesson, et al. [214]; Chapter 4 was published as Marmarelis, Ver Steeg, Galstyan, et al. [219]; Chapter 5 was published as Marmarelis, Morstatter, Galstyan, et al. [217]. These three papers on causal inference were selected for single-track oral presentation at their respective conferences: UAI 2023, CLeaR 2024, and AISTATS 2024. 10 Chapter 2 Representation Learning in High-dimensional Observational Data: A Study on Gene Coexpression in Single-cell Transcriptomics Machine learning is often applied to data of extremely high dimensionality, like images, text, and audio. Many modalities of medical data are also high-dimensional. For instance, single-cell RNA sequencing (scRNAseq) datasets are becoming larger, cleaner, and more prevalent, inviting widespread adoption of machine learning for novel discoveries in human physiology. These developments carry the potential to aid in causal inference for previously untenable medical or biological problems, if the causal settings and queries can be properly specified. The introductory motivating example (§1.2) discusses challenges associated with this line of inquiry. This chapter explores one aspect of fundamental importance in bringing causal inference to scRNAseq: the identification of macro-variables that describe biological processes at a level abstract enough for discussing causal relations. The method of q-diffusion proposed here does not necessarily identify causal structure, but it does appear to produce representations of cells that are helpful for a variety of downstream prediction tasks. This study could inspire future methods for causal representation learning on scRNAseq, and is instructive on how to conceptualize useful representations. 2.1 Introduction A cell’s phenotype is determined largely by the proteins that it expresses. Though progress has been made on directly measuring proteins in single cells [167] (via proteomics,) the full proteome is still an unwieldy proposition [123] due to the vast diversity in protein shape and their chemical properties. Thankfully, RNA transcripts correspond to proteins currently under production and offer important insights into cellular phenotypes. Modern advancements in single-cell RNA sequencing (scRNAseq) have led to consistent decreases in cost, enabling the characterization of complex biological structures—even with spatial resolution [212]. 11 However, few computational methods exist to study the large combinatorial interactions between genes that form biological processes, which new scRNAseq datasets promise to capture [188]. The discrepancy between acquisition and analysis is palpable in the algorithms themselves: toolkits in the state of the art rely on combinations of aggressive feature selection [309], [310], dimensionality reduction [34], [268], [350], or marker-gene identification [238]. Each of these stages in the analysis pipeline discards a majority of the potential gene interactions available in the original scRNAseq data. Those steps are not without good reason: scRNAseq analysis suffers from the curse of dimensionality, where the number of genes is too great to study all of them together, even in the relatively large samples that are now feasible. Compounding this issue is the noise, both technical and physiological, exacerbated by the granularity of scRNAseq. The q-diffusion method presented in this chapter enables scRNAseq analysis to extract higher-order structures from the data that other methods cannot. The general method hinges on a core novelty: a geometry of cells in transcriptomic observational space. Concretely, a q-diffused kernel function characterizes the transcriptional proximity of any two cells. This kernel supports arbitrary dimensionality, thus overcoming the “curse of dimensionality.” Notably it exhibits a multiscale nature that reveals biological processes (noisily) diffused across many genes. Internally, the kernel accounts for interactions of high order by incorporating not only pairwise (bilinear) terms, but all possible combinations (trilinear, quadrilinear, . . .) of variables in the data. These additional terms effectively shift the focus of the kernel to large-scale, possibly low-magnitude interactions of gene activities, as opposed to considering each gene’s activity on its own. The additional benefits conferred by this q-diffused geometry include robustness to noise and sample efficiency. This chapter ventures into three separate applications. The first (§2.3.1) demonstrates the propensity of q-diffusion for revealing biologically informative structure. We base that judgment on medical relevance, deemed through predictive capacity on downstream clinical outcomes in a phase III clinical trial. These results suggest precision guidance for the treatment of metastatic colorectal cancer (mCRC). The second case study (§2.3.2) for q-diffusion is benchmarked against the most popular scRNAseq clustering methods to discriminate IFN-γ stimulation in eight peripheral blood mononuclear cell (PBMC) subtypes more accurately. An additional benchmark assesses the unsupervised clustering of non-PBMC cells in small tissue samples from four human organs. Finally (§2.3.3), the q-diffused framework is harnessed in developing an unsupervised local distributional segmentation (LDS) technique (§2.2.5) to segment structural regions of the human cerebral cortex. 12 2.2 Methods Common tasks in scRNAseq analysis include clustering, factorizing, and classifying the cells. Respectively, these entail grouping cells based on a notion of similarity, identifying common components among cells such as gene expression programs (GEPs), and assigning phenotypic labels to cells. All such tasks benefit from, or even require a way to quantify the relation of one cell to another with regards to their gene expressions. For all q-diffused tasks described in this chapter, the q-diffused kernel was evaluated between all pairs of cells, and then the adjacency matrix was symmetrified to produce a weighted, undirected graph of the cells. This graph supplemented downstream analysis in the manners summarized by Figure 2.1. Namely, performing community detection directly produced cell clusters. Indirectly, the graph was also used to further constrain the factorization of cells into expression programs, lowering the risk of underdetermination. More applications are described later. The core of our approach is a coexpression geometry for the transcriptome that overcomes the curse of dimensionality. Its name q-diffused points to the heavy inspiration from Tsallis statistics, which build on smoothly deformed q-analogues to many classical functions [330]. The q-diffused geometry can be framed as a deformation of Euclidean geometry, stemming from a q-deformed Gaussian function. In effect, a q-diffused norm differs from the Euclidean norm by introducing several interaction terms of increasing order, up to the entire dimensionality of the vector space. The consequence of these additional terms is that interactions spanning many variables are weighed heavily. Borrowing on historical notation, we parametrize the deformations by a single parameter q, in the range 1 < q ≤ 2 for our context, where the limit q → 1 reconstructs the original function for each q-analogue. With α := q − 1 ∈ (0, 1], the q-diffused version of a Euclidean distance for vector v becomes q∥v∥ 2 = v 2 1 + v 2 2 + v 2 3 + . . . | {z } first order (Euclidean) + α(v 2 1 v 2 2 + v 2 1 v 2 3 + v 2 2 v 2 3 + . . .) | {z } second order (pairs) + α 2 (v 2 1 v 2 2 v 2 3 + . . .) | {z } third order + α 3 (· · ·) + α 4 (· · ·) + . . . | {z } higher order (2.1) From this perspective, α can be viewed as a discount factor on interactions of increasing order. Figure 2.1 illustrates these interaction terms. Realizations of the method are described below, and details are in §2.5.2. A plethora of nonlinear analytical methods rely on a kernel: a function that quantifies the proximity between two points in an observational space. The most common such kernel is the Gaussian kernel, sometimes called the radial basis function. The major benefit of drawing inspiration from the Tsallis 13 (small spatial groupings of cells) q-diffusion judges cell similarity based on multilevel gene interactions genes scenario 1: sufficiently different cells scenario 2: insufficient difference cells pairwise interactions higher-order interactions large Euclidean cell distance due to a single gene, but no interactions --- hence ignored by q-diffusion expression differences a) community detection b) gene expression programs c) local distributional segmentation genes T cells q-diffusion distinguishes more nuanced subtypes B cells IFNγ+ IFNγNK cells combination weights for estimated gene programs (linear) spatial scRNAseq across q-diffused graph (nonlinear) proximal pairwise pixel comparison using q-diffused MMD pixel clusterings uncover large-scale structures within cell-cell similarities rexels endowing additional structure... harmonizes weights and promotes enriched gene programs Figure 2.1. Schematic of the mechanisms behind q-diffusion. When comparing two cells, the kernel fundamentally values expression differences that occur in many genes concurrently. It can enter and augment several common analyses: a) q-Diffusion facilitates nuanced phenotype resolution via community detection, as with the experiment in §2.3.2. b) q-Diffusion can regularize gene expression program (GEP) estimators like nonnegative matrix factorization (NMF), to promote statistical enrichment of gene ontologies (§2.3.1–2.3.2). c) Recent spatial scRNAseq modalities present a new opportunity for macro-segmentation based on cellular transcriptomics, like in the brain (§2.3.3). We present a local distributional segmentation (LDS) algorithm that relies on q-diffusion applied to maximum mean discrepancy (MMD), an established kernel-based statistic. framework is that it serves as a heuristic for constructing our q-diffused kernel. Fundamentally, all qanalogues are based on the q-exponential function, a polynomial that approximates the exponential and grows or decays slower for q > 1: expq (x) := [1 + (1 − q)x] 1 1−q + , x ∈ R, ∴ logq (y) := y 1−q − 1 1 − q , y ∈ (0, ∞). (2.2) 14 A kernel that decays by power law rather than exponentially makes up for the overall increase in distances after incorporating the interaction terms in Equation 2.1. Generally the Gaussian kernel takes on the form of f(x) = exp −βx2 . In a multivariate setting, x is a vector norm. We shall outline the q-exponential’s link with the interaction terms in x discussed above. First, observe the following identity for isotropic Gaussian functions in two dimensions, x and y: exp −β(x 2 + y 2 ) = exp −βx2 · exp −βy2 . This does not hold in the q-analogue. Specifically, we have instead expq (−βx2 ) · expq (−βy2 ) = expq (−βx2 ) ⊕q (−βy2 ) , where a ⊕q b := a + b + (1 − q)ab, from which the interaction term, (1 − q)ab = −β 2 (q − 1)x 2y 2 , emerges within the q-exponential. The binary ⊕q operator is termed a q-sum. Applying the q-sum recursively generates all orders of interaction. The following perfectly recreates the q-diffused norm q∥·∥ of Equation 2.1, with α 7→ (q − 1) : Ym i=1 expq (−v 2 i ) = expq − v 2 1 ⊕q − v 2 2 ⊕q (−v 2 3 ⊕q · · · − v 2 m) = expq (−q∥v∥ 2 ). (2.3) The kernel above can be expressed as a product of univariate q-exponentials, or a single q-exponential with the q-sum quadratic form. The full q-diffused kernel incorporates two scaling terms, the inner and outer bandwidths, to control its behavior. The outer bandwidth breaks this duality of the multivariate kernel with the univariate-kernel product. Figure 2.2 as well as Supplementary Figures 2.10 & 2.11 highlight the desirable properties of this construction. Definition 1. The q-diffused kernel on vector v ∈ R m, with outer bandwidth ρ > 0 and inner bandwidth φ > 0, is given by f(v) := expq − q φ −1v 2 ρ 2 ! . We always employ the q-diffused kernel at an adaptive resolution, where the outer bandwidth ρ(k) is set to the kth nearest q-diffused norm in a cell’s neighborhood. This kNN parameter is set from the scope of the problem; for instance, when performing community detection, this k takes on the same role as in the kNN graph construction that is common in other methods. Notably, our adaptive approach follows that of PHATE [234]. The kernel is truly anisotropic, in that ρ(k) depends on the origin point. So f(a, b) := f(b − a) with ρ(k) scaled by the neighborhood of a q-Diffusion Improves Phenotype Resolution in Simulations low noise high noise 0 1 Cluster Quality Novel Typical few strong genes low noise high noise 0 1 Novel Typical low noise high noise 0 1 Novel Typical many weak genes Figure 2.2. Simulated scRNAseq experiments for low-to-high technical/background noise and low-to-high diffusion across multiple genes of the phenotype separation. The deformed q-diffused geometry better recovers two different phenotypes, compared to a typical undeformed geometry. Performance under various settings is assessed via the adjusted Rand index (ARI) [129]. Details are in Supplementary §2.5.1, and even more illustrative figures in Supplementary §2.5.2. 2.2.1 The Inner Bandwidth The inner bandwidth φ acts as a soft threshold for the magnitude of interactions. Gene differences exceeding φ in magnitude tend to cascade more strongly up the higher-order terms of the q-diffused norm (Equation 2.1). It is widely recognized that the variance (post-normalization) of expressions across a gene library is massively imbalanced: an instance of Pareto’s principle at work in nature. This is the reason that common analysis pipelines perform aggressive feature selection based on variance or dispersion, and sometimes rescale the remaining genes to equalize their importance. One of the main motivations behind our work is the belief that those preprocessing steps are too crude for complex gene processes. First, most genes are discarded; second, the rescaling of those remaining inevitably distorts their processes. It is often seen as necessary in order to analyze low-expression but important genes like transcription factors [99]. Our approach avoids the aforementioned pitfalls via this inner-bandwidth mechanism, a knob that allows us to navigate the spectrum of multiscale processes. Roughly, expression magnitudes far above this knob are amplified and those below are discounted. We devised a simple heuristic for choosing the inner bandwidth, and adhered to it for all the results presented. For each gene, we computed the average pairwise squared distance across cells. Then we picked a quantile of these gene scatters for the inner bandwidth. Since the variance is concentrated to just a few genes, ideally one could choose a relatively high quantile like 90% and still pay attention to most genes. However, for simplicity, we opted with the 50% (median) quantile. This endows us with a setting for the inner bandwidth prior to analysis. 16 Choosing the deformation parameter. As there was both a wide (in terms of clustering, factorization, embedding) and deep (for competing methods) set of comparisons to perform against the novel q-diffused framework, we chose to narrow the space by considering only q = 1.2 for q-diffusion. See the Supplementary §2.5.2 for an exploration on the effect of other settings. 2.2.2 Using the Kernel After filtering and basic normalization, scRNAseq data consist of n observed cells with m sparse gene measurements each. We denote them as a matrix with cell column vectors X = [x (1) x (2) . . . x(n) ] ∈ R m×n . The most common methods of analysis involve clustering or embedding, where cells are categorized into apparent phenotypes, and plotted in a low-dimensional space representing their semantics or relations. Figure 2.1 shows how q-diffusion improves on common analyses. All the augmentations to existing methodologies that we explored involved some form of an affinity matrix A ∈ R n×n , filled with kernel evaluations between pairs of data points in X. As with PHATE [234], we symmetrified the anisotropic matrix arithmetically: A˜ i,j := f(x (i) , x(j) ), A := (A˜ + A˜T )/2. (2.4) The recursive structure of Equation 2.1 revealed a divide-and-conquer algorithm, which was implemented and released as open-source software to compute these matrices while taking advantage of massively parallel GPUs using the CUDA platform. Numerical stability is maintained by performing the computations in a logarithmically transformed space. 2.2.3 Community Detection Considering the affinity matrix as a weighted, undirected graph adjacency matrix (by subtracting the diagonal,) we performed community detection on the basis of the state-of-the-art Leiden algorithm optimizing for modularity [79], [328]. This was benchmarked in §2.3.2. 2.2.4 Gene Expression Programs (GEPs) We also adopted nonnegative matrix factorization (NMF), of demonstrated efficacy in scRNAseq studies [181]. The estimated basis vectors of such a method are often called gene expression programs (GEPs) or metagenes because they capture sparse sets of genes that express together and are likely coregulated. Each cell is deconstructed into a set of combination weights on the GEPs. This factorization often serves as the first 17 step to data-driven cell phenotyping [208], [257]. For the sake of biological coherence, it is valuable to require a cell’s GEP weights to be similar to those of nearby cells in the transcriptomic space. We consider, in particular, the affinity matrix that is induced by our q-diffused kernel. The structure encoded in this matrix is highly nonlinear, and even though the GEPs themselves are linear, we may attempt to guide them by the nonlinear structure [36] in order to improve downstream results. The resultant programs between qNMF and NMF GEPs were qualitatively similar, partly because they were always initialized with the same random seed, but mostly because NMF picked up the strong coexpression patterns. The most heavily weighted genes in each GEP were almost equivalent between their two versions. What differed were the precise weightings, which trickled into subsequent quantitative findings. 2.2.5 Local Distributional Segmentation (LDS) We developed an approach to spatial segmentation of a tissue informed by spatially resolved transcriptomics for §2.3.3. Undoubtedly, an unsupervised technique could massively aid in the discovery of structures [13], [90] in tissue samples. Functional segmentation of a tissue must be performed at a scale coarser than that of the single cell, yet still informed by cellular heterogeneity. Still, the scale must be granular enough to preserve the intricacies revealed by MERFISH. We chose to operate on small neighborhood groupings of cells, termed rexels for region-level pixels. Concretely, each tissue sample was divided into an approximate Voronoi parcellization by repeated k-means. We hypothesized that categorizing rexels through their distributions of heterogeneous cells would provide stronger functional information than any neighborhood-level aggregation. A rexel is expected to be a sample of neurons, immune, and auxilliary cells, which together make up the functionality of that part of the brain. Our novel LDS approach entailed the computation of distances between all rexel pairs through the maximum mean discrepancy (MMD) [107], a recently popular kernel-based measure for comparing two multivariate samples [314]. Intuitively, to compare two sets of high-dimensional points, MMD averages the pairwise kernel values within each set and contrasts them with the averaged pairwise kernel values between the sets. We experimented with our q-diffused kernel and a more typical Gaussian baseline for MMD. Moreover, we compared with a simpler segmentation method on the basis of neighborhood aggregates, where rexels were represented by the gene expressions averaged over their constituent cells. 18 2.2.6 Dimensionality Reduction Researchers are rather concerned over the amount of unseen distortion of global-structure biological patterns in the most popular embedding methods for scRNAseq [42]. For this reason, t-SNE and UMAP are often relegated to mere visualization. A lesser known, albeit well founded and characterized [179] alternative is PHATE [234]. We augmented PHATE with the q-diffused kernel of Definition 1, symmetrified by Equation 2.4. We harnessed the q-diffused PHATE embedding for a task beyond mere visualization: to screen important genes in mCRC, and estimate a biomarker for clinical-trial patients. 2.2.7 Differential Effects of Latent Variables Groups of latent variables were tested in the case study on metastatic CRC (§2.3.1) for identifying transcriptomic interactions with treatments onto patient outcomes. The effects of the latent varibles were modeled as Cox proportional hazards [63]. Two sets of latent variables were identified with the help of q-diffusion. For each set, their regression coefficients on outcomes were contrasted between the treatment cohorts. Each pair of coefficients for the same latent variable, corresponding to its effects under the two treatments, was tested for a nonzero difference using the asymptotic normal approximation [5]. The z-tests were performed on the basis of variance estimates through observed Fisher informations, which is established practice [86], and then corrected for multiple testing [275]. Statistically significant differences, especially with opposing signs, signified differential effects from these latent variables. The dichotomy in an effect’s value would suggest a biomarker for a possibly causal interaction with a treatment, since treatment-cohort assignments were fully randomized. For qNMF, we limited our investigation a priori to 16 potentially novel latent variables. This amount was chosen as a round power of two, and probably the maximal supported by the sample size of 557 in the clinical trial. We avoided further explorations of different amounts as they would risk a loss of statistical power [87]. As some GEP weights transferred to patients were entirely zero or entirely nonzero, we also screened for GEP weights with sparsity no less than 5% and no greater than 95% for our tests of significance. The latent variables were inferred from an scRNAseq “atlas” reference dataset [257]. The stage-4 portion of the atlas was selected, matching the conditions of the patients in a clinical trial, so that insights from the atlas could be translated directly to the patients. This clinical trial recorded patient outcomes for two different treatments, alongside bulk RNA profiles from tumors. As presented in Figure 2.3a, the inferred latent variables consisted of two sets of 16 gene expression programs and branching gene 19 components. Translation of atlas-inferred latent variables onto patients’ RNA profiles was performed by linear deconvolutions with the estimated single-cell programs or components. 2.2.8 Differentially Expressed Genes (DEGs) A multitude of techniques exist for identifying differentially expressed genes (DEGs). These approaches can transfer to the meta-genes [382] revealed by whole GEPs, yielding differentially expressed meta-genes (DEMGs). A recent benchmark on single-cell DEG identification suggested that the earth-mover’s distance (EMD) [237] offers the best tradeoff between precision and recall [341]. The Mann-Whitney U test (or Wilcoxon rank-sum test) is a simpler statistic often used for DEGs [296]. A study on bulk RNA sequencing provided some evidence that more complex techniques tend to exaggerate the false positives [198]. Hence, the EMD permutation tests and U-tests for differential expression facilitated assessments of GEP estimators for isolating DEMGs. The Benjamini-Hochberg [275] procedure corrected for multiple testing by controlling the false discovery rate (FDR). 2.3 Results Three diverse case studies are showcased on q-diffusion applied to human scRNAseq data. Each case enhances the findings of the original analyses by exploiting the full transcriptome. 2.3.1 Case Study on Treatment of Colon Cancer We investigated whether the full dimensionality of the transcriptome from the tumor microenvironment of metastatic colorectal cancer (mCRC) could reveal novel treatment opportunities. mCRC is extremely heterogenous not only from patient to patient but also between metastatic sites or even within a single location [33], [320]. In order to convincingly validate the biological and medical utility of q-diffusion, this section reports on whether the q-diffused structure discovered in an scRNAseq mCRC atlas could produce novel insights on existing records from a large clinical trial. We took two distinct approaches for discovering the q-diffused structure in a transferable representation. Both were unsupervised statistical estimators of latent variables in the transcriptome. We developed them as q-diffused counterparts to well-established techniques. These two approaches differed in objective. The first sought to represent cells in a small linear (nonnegative) basis that can be interpreted as gene expression programs (GEPs), via a q-diffused form of nonnegative matrix factorization (NMF) that we term qNMF. These GEPs, which are supposed to capture common biological processes or groups thereof, could easily be 20 transferred to other datasets by projecting new expression profiles onto them. The second approach aimed to map the cells in a low-dimensional Euclidean space to make phenotypic relations apparent. A number of embedding methods exist to construct this mapping, of which PHATE [234] is celebrated for its reliability in describing complex biological structures [179]. PHATE with a q-diffused kernel yielded a new kind of full-transcriptome embedding that we term qPHATE. In both avenues of investigation, we compared the q-diffused results to their analogues produced by standard methods without q-diffused augmentation. The qNMF and qPHATE representations of discovered q-diffused structure were validated for biological and medical utility by testing them out of sample, since that is the gold standard in machine learning. The procedure focused on downstream clinical relevance. We hypothesized, teleologically, that the transfer of discovered structures onto new patients in a clinical trial would facilitate strong statistical predictions of clinical outcomes only if those structures were useful and biologically coherent. We emphasize that the q-diffused structures were discovered without supervision, prior to the incorporation of any outcome or treatment information from the clinical trial. 21 0 50 100 Months 0.00 0.25 0.50 0.75 1.00 Progression Free Survival 0 25 50 75 Months 0.00 0.25 0.50 0.75 1.00 Progression Free Survival Latent Variable Analysis on Metastatic Colorectal Cancer (mCRC) q-diffused analyses differential effects of latent variables + + - - scRNAseq atlas bulk RNAseq clinical trial qNMF qPHATE (cet) (bev) (a) Schematic of the methodology for estimating latent variables in the scRNAseq atlas and then deconvolving them into the clinical-trial patient sample in order to assess their potential as biomarkers that inform clinical outcomes. The two outcomes investigated were progression-free survival (PFS) and overall survival (OS) in accordance with the clinical trial’s protocol. Differential effects were measured by heterogeneity of multivariate Cox proportional hazards [63]. Marker Effects Drug Effects (bev:cet) Marker Resp. cet bev marker− marker+ all qNMF PFS 1.97* 0.81* 1.00 0.58 0.94 OS 1.97* 0.81* 1.29 0.71 1.14 qPHATE PFS 1.26** 0.68** 1.26* 0.73* 0.94 OS 1.05 0.72 1.36 0.99 1.14 (b) Hazard-ratio point estimates for patient biomarkers. Under “Marker Effects,” we compare hazard ratios of biomarkers between cet and bev groups. Under “Drug Effects,” we compare hazard ratios of bev to cet between biomarker groups. In both settings we test for significant differential effects between groups. In contrast with the q-diffusion results listed here, the structures uncovered by classical NMF and PHATE failed to produce biomarkers with any significant differential effects. Bold: FDR < 0.1; Bold*: FDR < 0.05; Bold**: FDR < 0.01. 25 50 75 OS (months) 0 1 cet 25 50 75 OS (months) bev qNMF− qNMF+ 25 50 PFS (months) 0 1 cet 25 50 PFS (months) bev qPHATE− qPHATE+ (c) Kaplan-Meier estimates [164] of survivals with 95% confidence illustrating the identified differential marker effects under the two treatments. 20 40 60 Time (months) 0 1 Overall Survival 38 GEP Genes Helping bev gene low gene high 20 40 60 Time (months) 0 1 Overall Survival 28 GEP Genes Hurting cet (d) The qNMF biomarker appears to help bev overall survival (OS) and hurt cet according to Figure 2.3b. A number of member genes in the GEP were individually associated with these differential outcomes, as determined by U-tests with FDR < 0.01. We plot the survivals (90% confidence intervals) stratified by upper and lower quartiles of expression. Figure 2.3. Schematic of our analysis pipeline for mCRC (a), accompanied by statistical results (b)–(d). 22 The Data The Human Colon Cancer Atlas (c295) [257] that includes malignant and infiltrating immune cells served as a reference scRNAseq dataset with 26,980 genes across the 17,362 cells matching in disease condition (stage 4) to the bulk RNA from the clinical trial. The latter was the Cancer and Leukemia Group B (CALGB) / Southwest Oncology Group (SWOG) 80405 randomized phase III trial in first-line mCRC patients treated with bevacizumab, cetuximab, or both, plus chemotherapy [136], [337]. The first two treatment arms are considered standard of care for newly diagnosed mCRC. To discern drug-specific effects we sought differential outcomes between the treatment arms. The trial had bulk RNA profiles from 557 patients with 56,674 genes. The allocation of bevacizumab/cetuximab/both treatments was 227/207/123, with the third arm having been discontinued early. Bevacizumab and cetuximab are abbreviated as bev and cet, respectively. CALGB is now part of the Alliance for Clinical Trials in Oncology. The Statistical Evaluation First we assessed whether the drug (cet or bev) acted as an effect modifier [21], [69] on the biomarker (qNMF or qPHATE) for clinical outcomes. Conversely, we looked at the biomarker as an effect modifier on the drug. In either case we tested for differential effects of one binary variable between strata of the other binary variable. As mentioned in Figure 2.3a, we estimated multivariate Cox proportional hazards for progression-free survival (PFS) and overall survival (OS). The regressions included as covariates the type of chemotherapy, tumor side [235], sex, age, Eastern Cooperative Oncology Group (ECOG) performance score [1], and common tumor mutations. We excluded the minority of patients with high microsatellite instability (MSI-H) [263], who generally require different protocols altogether. In prior work, GEPs were estimated in the atlas by means of NMF [257], as is established practice [116], [181]. In this work we estimated qNMF and NMF GEPs and contrasted their affinity for differential effects. Cox regressions were performed on the nonnegative GEP weights to first identify the GEP with a significant differential effect at false discovery rate (FDR) below 0.05. We observed that qNMF produced one such GEP and NMF produced none. We then binarized the patients’ weights for that GEP based on sparsity: 0 for zero and 1 for nonzero, obtaining the qNMF biomarker. Analogously for qPHATE, the patients were projected to the scRNAseq latent space and then binarized, as detailed in Figure 2.4. These biomarkers allowed the production of the result table in Figure 2.3b. The strongly identified effects were also plotted as survivals in Figure 2.3c and the GEP genes individually associated with outcomes were further characterized in Figure 2.3d. Most notably for our proposal of q-diffusion, classical NMF or PHATE biomarkers were not 23 informative enough on clinical outcomes to produce a comparable Figure 2.3b with statistically significant differential effects in either setting—drug effects or marker effects. The qPHATE embedding coordinates transferred onto patients were thresholded along their principal component, shown in Figure 2.4c, for a simple binary biomarker that could be tested for hazards. We remark that clinical outcome-related findings were robust to this discretization procedure. Bootstrap resampling of the patients revealed that the strong differential hazard on PFS shown in Figure 2.3b remains strong (p < 0.05) for 95% of the simulated (projected, then thresholded) samples. On interpreting the major scRNAseq branches of Figure 2.4a, we note that they differ in 197/204 (97%) of the original GEPs [257], with U-test FDR < 10−3 , suggesting modulation of the whole tumor microenvironment. 24 Branching Phenotypes in Colon Cancer Atlas (a) q-Diffused and original PHATE [234] embeddings of the atlas cells. Plots are sized to their true aspect ratios. Cells are colored by their estimated local intrinsic dimensionality (LID), which highlights possible branching points. Branches are annotated in red. They are less clear in the original PHATE embedding. See Supplementary §2.6.1 for a rigorous investigation. Major branch #1 had about ∼19% of the cells, major branch #2 about ∼9% of the cells, and the minor branch accounted for ∼4% of the cells. (b) Major branches #1 & #2 were contrasted against each other to screen for genes that appear to drive the branching. Expressions of the top screened genes are displayed in this heatmap. (c) The scRNAseq embedding in Figure 2.4 was translated to the clinical-trial patients by linear projection. Then the principal component of that embedding was discretized around its mean for downstream analysis as a putative biomarker. Figure 2.4. Exploration of qPHATE and its branched genes revealed in the mCRC scRNAseq atlas. 25 2.3.2 Case Study on Clustering Phenotypes Distinguishing Complex Cell Conditions We sought to study the ability of the q-diffused kernel to discern phenotypes that are spread across many genes. PBMCs are common in scRNAseq benchmarks [34], [79], [238] due to their well-understood subpopulations. They are also studied often because they are involved in circulation, and are entangled with many diseases. We obtained an existing PBMC dataset [162] where a single batch contained cells from the same lineage, under two different but known conditions. The single-batch multiplexing [376] avoids the problem of disentangling significant batch-related noise from actual differences in cell conditions. The two conditions studied were stimulation and non-stimulation (control) by interferom gamma (IFN-γ), a cytokine known to induce complex changes in PBMCs through signaling pathways [34]. IFN-γ is involved in many distinct immune-related processes [22], and would be expected to modulate many groups of genes. For this reason, we postulated that q-diffusion would help describe IFN-γ stimulation across the different PBMCs, in contrast with other methods that rely on reduced dimensionality. The dataset contained 14,039 cells with a gene library of size 17,796. The cell population contained 7,466 control and 6,573 stimulated cells. Comparisons in Figure 2.5 assessed each method’s ability to distinguish stimulated and control PBMCs. They were scored on the basis of the adjusted Rand index (ARI) [129], [281] between detected communities and the true conditions. The ARI is a standard metric for assessing the quality of unsupervised clustering methods [79], by measuring coherence/alignment with ground-truth labels. Performance for megakaryocytes was not displayed because all scores were essentially zero. In most cases, the q-diffused graph structure outperformed the others. Notable exceptions include CD8 T cells, for which the existing methods performed better, and natural killer (NK) cells, for which performance was similar under a few methods. Sample-efficient Organ Tissue Classification The Tabula Sapiens Consortium [60] recently sequenced the single-cell transcriptomes of multiple human organs, and manually annotated the individual phenotypes. This atlas promises to facilitate understanding of intercellular dynamics across the human body [202]. Methods to cluster cells by phenotype are a crucial tool in the annotation pipeline. As the Tabula Sapiens annotations were verified manually by domain experts, they presented a benchmark for clustering that was fair, in that the annotations were putatively less biased towards the established clustering methods. To highlight the sample efficiency of q-diffusion, we selected those organ tissues for which fewer than 10,000 cells were sampled. We also screened out the 26 PBMC Clustering Benchmark 0.2 0.4 0.6 0.8 ARI B cells 0.2 0.4 0.6 0.8 ARI CD14+ Monocytes 0.2 0.4 0.6 0.8 ARI CD4 T cells 0.2 0.4 0.6 0.8 ARI CD8 T cells 0.2 0.4 0.6 0.8 ARI Dendritic cells 0.2 0.4 0.6 0.8 ARI FCGR3A+ Monocytes 0.2 0.4 0.6 0.8 ARI NK cells q-Diffused Seurat Monocle Scanpy SC3 Optimal Transport Figure 2.5. Adjusted Rand indices (ARIs) [281] of Leiden community detection [328] in the PBMC dataset. Bars represent the alignment of clusters with IFN-γ stimulation and control conditions for each of the cell types. Error bars are computed from bootstrapped estimator standard deviations. Our q-diffused kernelinduced graph structure is compared to the neighborhood graph methods in Seurat [34], Monocle [268], Scanpy [350], and optimal transport [e.g. 131]. Clusters computed by SC3 [176] are included as well. PBMCs in order to focus on organ-specific phenotypes—also because PBMCs are already well characterized by existing tools. The Tabula Sapiens datasets were packaged with state-of-the-art dimensionality reductions, including scVI [205]. We elected to compare q-diffusion directly head-to-head against those embeddings, which were assumed to be optimized for their respective datasets. In Figure 2.6, q-diffusion shows improved clustering alignment with the manual annotations, in contrast with the other embeddings, for three of the four datasets. These organs were the skin (4,918 cells kept out of 9,424), trachea (6,894 cells keps out of 9,522), and uterus (6,154 cells kept out of 7,124), with the exception of the liver (2,506 cells kept out of 5,007) exhibiting degraded clusters from q-diffusion. 27 Tabula Sapiens Clustering Benchmark 0.0 0.2 0.4 ARI Skin 0.0 0.2 0.4 ARI Trachea 0.0 0.2 0.4 ARI Uterus 0.0 0.2 0.4 ARI Liver q-Diffused PCA scVI UMAP (a) ARIs of Leiden community detection with q-diffusion versus the state of the art in dimensionality reduction for Tabula Sapiens. Clusters were scored against the expert-assisted annotations in the four sans-PBMC small-sample organs. (b) UMAP embeddings of the organ tissues colored by the unsupervised q-diffusion clusters. Figure 2.6. The benchmark involving the Tabula Sapiens [60] human atlas. 28 B cells CD14+ Monocytes CD4 T cells CD8 T cells Dendritic cells FCGR3A+ Monocytes Megakaryocytes NK cells B cells CD14+ Monocytes CD4 T cells CD8 T cells FCGR3A+ Monocytes Dendritic cells Megakaryocytes NK cells 117 199 142 64 114 148 24 63 EMD Estimator 0.80 0.85 0.90 0.95 1.00 1.05 1.10 1.15 1.20 B cells CD14+ Monocytes CD4 T cells CD8 T cells Dendritic cells FCGR3A+ Monocytes Megakaryocytes NK cells 68 145 77 29 83 99 8 52 U-test Estimator 0.7 0.8 0.9 1.0 1.1 1.2 1.3 Change in Jaccard Similarity Effect of q-Diffusion on Differentially Expressed Meta-Genes (DEMGs) for PBMCs, FDR < 0.05 (a) DEMGs between control and stimulated conditions are more unique to individual PBMC subtypes after q-diffused regularization is imposed on NMF. This effect is evidenced by the greater number of decreases (blue) in Jaccard similarity than increases (red) between the cell-type pairs of DEMGs. Decreases account for 75% and 89% of the off-diagonal entries, respectively. In the diagonals, we list the actual numbers of q-diffused DEMGs. The EMD and U-test statistical estimators for DEMGs are defined in §2.2.8. qNMF kNMF NMF PNMF 0 5 10 15 No. of GEPs with Enrichment 64 GEPs qNMF kNMF NMF PNMF 0 5 10 128 GEPs qNMF kNMF NMF PNMF 0 2 4 256 GEPs Gene Expression Programs (GEPs) with Enriched Biological Processes, Total FDR < 0.1 (b) Total number of GEPs with at least one statistically enriched biological pathway. Competing NMF methods are shown for different amounts of GEPs, in powers of two. We analyzed qNMF, regularized NMF with a Seurat-style neighborhood graph (kNMF), typical NMF, and the recent Projective NMF (PNMF) [306]. Figure 2.7. Exploring the PBMC expression programs, derived by qNMF and competing algorithms. 29 We also contrasted the gene expression programs (GEPs) identified from q-diffused nonnegative matrix factorization (qNMF) versus NMF. There were numerous differentially expressed meta-genes (DEMGs, see §2.2.8) between stimulated and control cells for each subtype, under both methods. DEMGs are like differentially expressed genes (DEGs), but for GEPs that carry statistically different weight between the two conditions. In Figure 2.7a, we observe how DEMG commonality in cell-type pairs mostly decreased after q-diffused regularization. Overlap in DEMGs was measured through Jaccard similarity, which is normalized to the sizes of both sets. Figure 2.7a shows that the q-diffused DEMGs are more specific to particular cell types. It is common to study the statistical enrichment of gene ontologies [231] in GEPs as a way to validate their biological coherence [93]. Figure 2.7b tallies the number of enriched GEPs under competing methods and various settings. qNMF consistently outperforms the other three. The false discovery rate (FDR) was held below the critical threshold of 0.1. The threshold was higher than the more traditional 0.05 as the FDR spanned the combination of all ontologies and all GEPs. 2.3.3 Case Study on Brain Structure Spatial transcriptomics are gaining immense traction in biological and translational research [186], [222]. For the purpose of this case study, we found ten samples of human cortical tissue extracted by MERFISH [89]. The immense spatial fidelity of this particular modality comes with the compromise in gene library size, limiting it to 4,000 genes in the human samples. MERFISH is enabled by robust error-correcting barcodes to multiplex these gene readings [374]. Other spatial scRNAseq modalities like Visium [20] have lower spatial resolutions for the trade-off of more genes. A rather challenging aspect of analyzing brain tissue is the spatial nonlocality at the cellular level. In concrete terms, adjacent cells in the tissue may serve vastly different roles, like those of neurons, immune cells, and astrocytes. Therefore, classifying individual cells provides little information on larger-scale structure in the tissue. It is well known that the cortex has distinct functional layers. In the study that introduced this dataset [89], cells could not be segmented by the known layers L1–L6 vis-à-vis their spatial transcriptomes. The original t-SNE visualization and clustering identified cell types, which were labeled by hand, but these only partly associated with specific layers. In our experiments, we computed pairwise distance matrices between the small tissue regions termed rexels (§2.2.5). These distances either came from the novel method termed local distributional segmentation (LDS) and explained in §2.2.5 (either q-diffused or Gaussian), or by k-Nearest Neighbors (kNN) on principal components of the rexels’ average expressions. Briefly, LDS takes into account the entire heterogeneous 30 Figure 2.8. Side-by-side comparisons of local distributional segmentation with two alternatives, for ten human cortical tissue samples. Prior knowledge of the cerebral cortex compels the identification of parallal stripes to represent functional layers. Colors represent unsupervised clusters produced by the different methods, unaligned between competing versions. sample of cells in a rexel, in order to compare rexels as distributions of transcriptomes. In comparison to the other two case studies, the smaller samples in each rexel-rexel pairwise comparison justified a k = 16 neighborhood size for LDS. Finally the segmentation was performed by two popular algorithms [328]: Leiden community detection [79] or hierarchical clustering with Ward’s linkage [153]. Evaluations A comprehensive visual comparison of rexel clusterings is shown in Figure 2.8. Clustering parameters were optimized by a grid search over the silhouette scores [285] for each case. The objective in such a visual evaluation is to seek alignment with prior knowledge. The human cerebral cortex is organized into parallel functional layers [89], [233]. It is apparent that the clusterings, which operate entirely on 31 Quantitative Evaluation of LDS on the Human Cerebral Cortex 32 Rexels 64 Rexels 96 Rexels 0.0 0.1 0.2 0.3 Mean Silhouette (better →) x x x (a) Community Detection q-Diffused Gaussian kNN 32 Rexels 64 Rexels 96 Rexels 0.0 0.1 0.2 0.3 Mean Silhouette (better →) x x x x (b) Hierarchical Clustering Figure 2.9. Silhouette scores [285] as a means to quantify the spatial contiguity of the rexel clusterings, for different numbers of rexels, via either (a) community detection, or (b) hierarchical clustering. Bars are equipped with standard errors. Those marked with ‘x’ indicate a statistically significant improvement with q-diffusion over the Gaussian or kNN alternative as indicated by a t-test with p < 0.05. The opposite never occurs. transcriptomic distributions and not pixel-wise spatial arrangements, ultimately tend to spatial contiguity. In more than half of the samples, notably Samples 1, 2, 3, 7, 8, and 9, parallel stripes appear to emerge clearly under q-diffused LDS. For a more quantitative assessment, we also contrasted the mean silhouette scores for various resolutions of rexel-lation (in multiples of 32 rexels.) The bars with significance markings in Figure 2.9 suggest that q-diffused LDS was never significantly worse than the alternatives, and in many cases performed significantly better. 2.4 Discussion q-Diffusion led to demonstrable improvements in the multivariate analyses of differential effects between treatments in a clinical trial for mCRC (§2.3.1). We identified groups of genes that could inform future treatment assignment through their prognostic implications. Additionally, it helped with classification and identificatio of GEPs for PBMCs with control and stimulated conditions (§2.3.2). It also proved to be a necessary ingredient of LDS for recovering biologically relevant structures in human cortical tissue (§2.3.3). The granular nature of scRNAseq introduces sparsity and increases vulnerability to technical or biological noise [147]. Novel statistical methods like q-diffusion are required to intentionally handle the curse of dimensionality and its myriad of related effects [188]. Our results focused on findings that directly improved upon the state of the art for enhancing the utility of scRNAseq samples for foundational or clinical research. 32 The breadth of applications proposed for q-diffusion naturally intermingles with innumerable other approaches. For instance, the field is seeing renewed interest in framing scRNAseq problems in terms of optimal transport (OT) [298], [327]. Multiple competing formulations exist for OT even in the generic problem of clustering cells [18], [131]. The main point of disagreement is how to define distances between genes, which shape the OT distances between cells. Gene-to-gene distances could be defined through a corpus of ontologies, external reference datasets, or through their coexpressions in the same cells that are to be analyzed. A more fundamental limitation of OT is that cell-to-cell comparisons will always take quadratic time in the number of genes to evaluate, notwithstanding the Sinkhorn relaxation [261]. q-Diffusion takes linear time in the number of genes. The two orthogonal methodologies are both posed as improved geometries for transcriptomics. We compared them once in §2.3.2, but anticipate complementary use cases in the future. The vast field of deep learning offers techniques [196], [205], [327], [366] that are complementary to, and perhaps synergistic with q-diffusion. We hope that q-diffusion could be harnessed upstream of a generative model’s objective function to inform its target geometry [81]. Alternative cell-similarity metrics like the Spearman correlation appear more effective than a Euclidean distance in clustering [174]. Our benchmarks (§2.3.2) were conducted against the standard analysis tools because to ad hoc insert correlation distances into an established pipeline would probably require adjusting the other hyperparameters. SC3 employs correlations, and was included in the benchmark. All of the benchmarked standard tools, which are close to the state of the art, are listed in Supplementary Table 2.1 alongside their key differences from q-diffusion. Concretely, the benchmark on determining IFN-γ stimulation in PBMCs was remarkable because it represented a task that was more difficult than annotations of coarse cell types. Translational research often considers such workflows, in which certain broad phenotypes are sampled from patients and controls. The disease condition within a phenotype can be rather subtle. This is clearly the case for T cells preseroconversion of celiac disease in genetically predisposed individuals [273]. A similar challenge exists in beta or gamma cells from the pancreas of type-2 diabetic patients [38], [351]. The Tabula Sapiens benchmark evaluated q-diffusion against the dimensionality reductions supplied by the authors of the atlas, which were carefully selected for their data. Out of the four small-sample organ tissues tested, the q-diffused clusters vastly outperformed the baselines in alignment with ground-truth labels for three organs. The fourth organ, the liver, had 62% of its non-PBMC cells labeled as hepatocytes. The q-diffused clusters completely missed this separation, perhaps gravitating to other phenotypic differences in 33 the tissue. Even though q-diffusion offers a novel and valuable perspective on scRNAseq datasets, it should be used in conjunction with more traditional analyses for a more complete picture of the phenotypes. As for LDS (§2.2.5) with spatial scRNAseq, we address other approaches to spatial segmentation. A recent hidden Markov random field model for seqFISH [382] needs a matching scRNAseq reference to disentangle cell-type variation from spatial variation, whereas the proposed LDS with MERFISH does not. DestVI [204], a method based on variational autoencoders, requires external cell-type annotations. More comparable is an unsupervised method termed SSAM [252], which detects cell-type signatures and then identifies spatial domains by comparing cell-type counts in sliding windows. We cluster on the basis of small tissue regions without discretizing to cell types. Also, we compare distributions of cellular transcriptomes rather than cell-type counts, by means of MMD. One of the baselines in §2.3.3 was based on regional aggregates of principal components, reminiscent of SSAM sans the discretization. The capability to reliably segment macro-scale structures from MERFISH should be contextualized in the broader field of brain research. Not only are the cells immensely heterogeneous even in small neighborhoods [89], but signaling networks are highly reliant on spatial organization [44], [352]. As spatiotemporal dimensions in scRNAseq samples become more accessible, there is immense promise in studying the process of memory formation [294], among other phenomena in the brain. The first case study (§2.3.1) on q-diffusion was aimed at demonstrating outcomes-based biological and medical relevance by identifying potential mCRC biomarkers in a phase III clinical trial. It was important to isolate the findings to differential effects (§2.2.7) of treatments, or between treatments, in order to ensure that the tumor biomarkers possibly interacted specifically with the drug mechanism. Such findings could offer precision guidance for mCRC treatment and prognostication. Tumor angiogenesis—the growth of new blood vessels to supply oxygen and nutrients to cancer cells—is a critical process in the development and progression of CRC. Bevacizumab (bev), an anti-angiogenetic drug, is considered a standard treatment in combination with chemotherapy in first- or second-line. However, no predictive marker for bev efficacy is currently available for patient selection in the clinical setting. Through our analytical approach we were able to identify several genes associated with differential effects on bev treatment outcomes in mCRC. The majority of these genes are known to play a role in cancer, including CRC, and several are involved in angiogenesis-related pathways. However, we are the first to report a connection with bev efficacy in patients (except EREG, which had been previously identified [308].) Supplementary §2.6.1 discusses the identified genes in detail. This study has a few important limitations. First, q-diffusion arguably makes parametric assumptions on the data that might be restrictive in some use cases. Deep learning methods, on the other hand, offer 34 more general avenues, but they notoriously demand large datasets and require extensive hyperparameter optimization. We also believe that q-diffusion is more interpretable than most solutions involving neural networks because despite its nonlinearities, q-diffused geometry is straightforward and supplements linear analyses like NMF. Second, given the scope of this study, we did not consider batch-effect correction [105] or alignment of disparate datasets. Presumably, various pre-existing tools can be coopted upstream of the analysis for this purpose. It is also of note that removal of batch effects also sacrifices some truly biological variation [196], so it is important to conceive methodologies that can function without the corrections. Another limitation of q-diffusion is that nonlinearities are less directly interpretable, generally. We ameliorated this concern in the particular case of GEPs by formulating a q-diffused NMF to produce linear programs while guided by nonlinear dynamics. Finally, we note that there is room for improving the runtime efficiency of the algorithm evaluating the q-diffused kernel across all pairs of cells. With the Tabula Sapiens experiments serving as an example, which had an expansive library size of 58,870 genes, one server with four NVIDIA GeForce RTX 2080 Ti graphics cards took between 20 minutes (for the liver) to 3 hours (for the trachea) for the complete affinity matrix. A simple heuristic based on approximate nearest neighbors is likely to drastically improve runtime with minimal cost in accuracy. Moving forward, we seek to study precisely when q-diffusion would be beneficial to a particular scRNAseq problem. Such an exploration would invite a careful selection algorithm for non-Euclidean deformation q and inner bandwidth φ, the two vital parameters to q-diffusion (§2.2). At present, our custom software QDiffusion.jl has been released to the public with sufficient documentation for others to use. Data Availability All datasets considered in this study have been previously reported in the literature. Accession codes or links are provided for all datasets besides the clinical trial: the Human Colon Cancer Atlas can be accessed at GEO: GSE178341; the PBMC benchmark at GEO: GSE96583; the Tabula Sapiens benchmark at https://doi.org/10.6084/m9.figshare.14267219.v5; the human MERFISH sample at https://doi.org/10. 5061/dryad.x3ffbg7mw. For the CALGB/SWOG 80405 clinical trial, a summary of clinical and genomic data will be made available upon reasonable request. Code Availability Please visit https://github.com/marmarelis/QDiffusion.jl for access to the Julia package. 35 2.5 Appendix A: Supplementary Methods 2.5.1 Details Behind the Synthetic Example (Figure 2.2) We simulated a hundred distinct scenarios and aggregated their results for Figure 2.2. In each case, two different pseudo-phenotypes were drawn independently from a thousand-dimensional Dirichlet distribution with one uniform parameter, αD. The background or noise expression level was controlled by another parameter, αN , which balanced an arithmetic mixture between the Dirichlet-drawn vector and ones. In other words, one phenotype was internally constructed as X = max (1 − αN )Y + αN 1, 0.9 , where Y ∼ Dirichlet [αD] 1:1000 . The random vector X represents the gene activities of a phenotype. Higher noise levels entail a greater overall expression of all the genes, indiscriminately. This conflicts with phenotype resolution. Observations were modeled as univariate negative binomial distributions with dispersion parameter r = 103 . The negative binomial fits scRNAseq empirically well [115]. Before analysis, observations were log-normalized as with real data. We benchmarked community detection on either the q-diffused kernel, or the Gaussian kernel with eight principal components. Communities were compared against the binary ground-truth phenotypes by the Rand index. In Figure 2.2, the three charts ranging from “few strong genes” to “many weak genes” had log10 αD = −4, −3, −2, respectively. In each chart, the range from “low noise” to “high noise” consisted of the settings log10 αN = −5, −4, −3, −2. We note that the curious increase in accuracy for q-diffusion at the highest noise level is most likely an artifact of the normalization method. We also consider it a testament to the promise that this method holds. 2.5.2 More Details on q-Diffusion The q-diffused kernel is a deformation of the classic Gaussian radial basis function, placing a power-law decay on each gene term. Our approach draws inspiration from nonextensive entropies [344] and nonlinear Fokker-Planck equations [26]. 3 Bivariate q-Diffused Kernels with Increasing Deformation -4 -1 1 4 -4 -1 1 4 q → 1 -4 -1 1 4 -4 -1 1 4 q = 1.2 -4 -1 1 4 -4 -1 1 4 q = 1.4 -4 -1 1 4 -4 -1 1 4 q = 1.6 Figure 2.10. Contours of bivariate q-diffused kernels, demonstrating their concavity, which incidentally violates the positive semidefinite property. The tick marks indicate the inner bandwidth that is set to 1, with the outer bandwidth set to 4 for the purpose of illustration. Throughout the chapter, q-diffusion refers to q = 1.2 specifically. How The q-diffused Kernel Handles Dimensionality Figure 2.11. Kernel values when different numbers of variables have large displacements, at various dimensionalities. Colors are scaled linearly in [0.25, 1.00]. Bandwidth is set to the distance equivalent of 75 variables with large differences (marked by vertical dashed line). The q-diffused kernel’s behavior depends on relative scaling with the inner bandwidth, as showcased with the small and large versions plotted. 2.5.3 Implementation Details Preprocessing. All scRNAseq datasets considered in this study were log-normalized with an assumed transcript throughput of 104 per cell, as is typical and often considered standard [34]. Regularization hyperparameter for qNMF. With the qNMF introduced in §2.2.4, there came along one additional hyperparameter. Namely, one must select how to weigh the reconstruction error against the qdiffused regularization. It is unreasonable to optimize this weighting on the basis of insample reconstruction because the regularizer always worsens this. A feasible route for insample model selection could be through 37 information criteria like the AIC or BIC [83] on the effective sizes or degrees of freedom in the GEPs. For our purpose in §2.2.7, model fitness depends on downstream utility. Validating GEPs on the smaller sample of clinical-trial patients would ruin the statistical significance. Therefore, we opted for a scale-invariant weighting heuristic that eliminated the need for hyperparameter selection. Our heuristic, available in the source code, weighs one unit of reconstruction error equally against one unit of discrepancy in meta-gene expression between immediate neighbors in the q-diffused geometry. Finding active genes in a GEP. Active genes in a GEP were considered those with effectively nonzero (> 10−3 ) weight. NMF and qNMF produce rather sparse GEPs [181], and downstream processing is relatively insensitive to the precise threshold. Gene ontology enrichment. The PANTHER system [322] was employed to investigate the overrepresentation [231] of gene ontologies from specific biological processes in each GEP. Deconvolution of GEPs in bulk RNA. Similarly to the approach taken with CIBERSORTx [238], we perform nonnegative least squares for deconvolution of GEP estimates onto bulk RNA samples. 2.5.4 Theoretical Motivation The terminology of q-diffusion relates to the geometry’s effect, in practice, to identify processes that diffuse across many variables. Separately, its theoretical underpinnings relate to heat diffusion [177]. Briefly, we contextualize this work in the realm of Riemmanian manifolds [56], [113]. A Riemmanian metric operates on locally linear tangent spaces on a manifold. It manifests as a quadratic form, hence the local behavior of any Riemmanian manifold can exhibit curvatures described by bilinear terms at most. Intriguingly, the q-diffused geometry breaks out of that framework and incorporates terms of much greater order. Physical Motivation The Fokker-Planck equation describes the evolution of an ensemble of infinitesimal particles. Individually, a particle is governed by a stochastic Langevin equation. The Fokker-Planck equation leads to a probability density function of these particles, spreading wider with time. Reasoning about a particular microscopic 38 behavior due to some Langevin equation sheds light onto the mechanics of its corresponding macroscopic law [151]. A general Itô-Langevin equation with scalar x is written as dx dt = µ(x, t) + g(x, t)η(t) (2.5) with stochastic elements contained in the noise η(t), with E η(t) = 0. A particle following such a kinematic obeys, on average, a probability density f(x, t) over a spatial dimension x, changing with time t, possibly conforming to a corresponding Fokker-Planck equation. Should η(t) be a Wiener process (Gaussian white noise,) the linear case is recovered. The heat equation that arises in kernel methods is closely tied to the driftless (with µ = 0) version thereof. This case allows for the solution to an arbitrary initial-value problem to take the form of a convolution by the so-called heat kernel, which looks like a Gaussian function with scale ∝ √ t. A remarkable nonlinear framing of the Fokker-Planck equation is given by [26] ∂tf = −∂x(µf) + Q∂2 x (f ν ) (2.6) with constants Q and ν ̸= 1. The only way for this form to be compatible with Equation 2.5 is for the microscopic diffusion function g to be coupled with the macroscopic by g(x, t) = Qf ν−1 (x, t). This fact implies that there is an interaction among the particles, such that motion of the ensemble influences individual jump sizes. That phenomenon is reflected in the general solution (with linear drift,) which is a time-dependent q-Gaussian with ν = 2 − q: f(x, t) = Z(t) −1 [1 − β(t)(1 − q)(x − xM(t))2 ] 1 1−q + and certain functions Z(t), xM(t), β(t). When ν < 1, q > 1, and g(x, t) depends on f(x, t) taken to a negative power. In other words, jump sizes increase as a particle reaches the extremities of the distribution. The q < 1 case has been studied extensively in the context of the porous-medium equation [e.g. 245], which yields narrow tails. These univariate equations (in x) are a possible framing for how q-diffused dynamics could arise in nature. 39 Alternative Motivations Proposition 1. The q-diffused kernel around q → 1 + behaves as Ym i=1 expq −x 2 i = exp −∥x∥ 2 · 1 + (q − 1)1 2 Xm i=1 x 4 i + O(q − 1)2 . Proposition 1 suggests that there exists a sort of compensatory mechanism in the q-diffused kernel, propping up the kernel’s value when it would otherwise be low due to a few high entries (hence with strong fourth powers.) It may also be helpful to think of the q-diffused kernel in terms of its associated distance-like function, termed q-Euclidean. Definition 2. The q-Euclidean pseudonorm of a vector x ∈ R m is q ∥x∥ := − Lm i=1(−x 2 i ) 1/2 . It is properly considered a pseudonorm because it does not obey the triangle inequality. Lemma 2 (local asymptotic). The squared q-Euclidean pseudonorm, within a ball of radius O( √ ε) in R m for sufficiently small ε, may be approximated by the squared Euclidean norm up to the order ε 2 . In fact, locally, Lm i=1 ±x 2 i = ±∥x∥ 2 2 + O(ε 2 ). Lemma 3 (global bound). Anywhere in R m, the q-Euclidean pseudonorm is bounded below by the Euclidean norm: q ∥x∥ 2 ≥ ∥x∥ 2 , for q > 1. Note that the q-diffused kernel is not positive (semi)definite, posing challenges with incorporating Reproducing Kernel Hilbert Space (RKHS) theory. We plan to investigate the theoretical implications in future work. 2.6 Appendix B: Supplementary Results 2.6.1 Details for Case Study on Treatment of Colon Cancer Local Intrinsic Dimensionality (LID). The LID for Figure 2.4 was estimated by a maximum-likelihood method [40], computed over a point’s 2 9 = 512 nearest neighbors. The heatmaps were smoothed for the purpose of interpretation by taking a median “vote” over 2 11 = 2, 048 neighbors for the value of each point. There were 17, 362 points in total. Justification for Branching Genes. Figure 2.12 provides details and further evidence for our approach on defining branches in the embedding. 4 Solution Preprocessing Dimensionality Orders of Interaction Cell Distances Neighborhoods q-Diffusion normalization all genes all ≫ 2 power law non-Euclidean q-diffused Seurat normalization, feature selection, standardization a few principal components 2 Euclidean Shared Nearest Neighbors (SNN) Scanpy normalization, feature selection, standardization a few principal components 2 Euclidean UMAP or DM SC3 normalization, feature selection a few principal components 2 Euclidean/ Pearson/ Spearman k-means Monocle normalization a few principal components 2 Euclidean embedding SNN (e.g. in UMAP) Optimal Transport normalization all genes N/A through gene-gene distances SNN Table 2.1. Our method contrasted against the state of the art in cell clustering and dimensionality reduction. We list the default setting for packages that expose a variety of options. Blue: clear improvements; red: drawbacks. Expanded Discussion Among the individual genes associated with improved outcomes under bevacizumab, VAV3 plays an important role in tumor angiogenesis and is associated with CRC growth and metastatic spread through PI3K-AKT signaling [334]. SLC2A8 is part of the intracellular glucose transporters solute carrier 2A family, several members of which have been recently connected to angiogenesis, as higher glucose uptake and glycolysis have been linked to survival in bevacizumab-resistant tumors[185], [303], [331]. SPATA13 (also known as Asef2) interacts directly with APC and promotes colorectal tumorigenesis, progression and invasion [165], [166], [361]. NHLRC3 has been recently included in a prognostic model of sevengene signature to predict CRC survival [43]. This gene encodes for a protein involved in the process of ubiquitination which modulates many cellular processes and signaling pathways. TRPC4AP and RNF114 are also involved in the ubiquitination of various substrates. TRPC4AP specifically mediates the degradation of MYC and participates in the activation of NFKB1 and JNK. MYC is an extremely important oncogene which plays a major role in cancer and contributes to tumor cell invasion, migration and angiogenesis [230]. On the other hand, NFKB activation in CRC also plays a central role in promoting angiogenesis and tumor 41 growth [292]. RNF114, an E3 ubiquitin ligase, has been shown to negatively regulate NFKB signaling and to modulate T-cell activation and apoptosis, which could influence the balance of the immune tumor microenvironment (TME) [279]. COMMD7 also regulates NFKB signaling [379]. SETDB1, a histone methyltransferase, promotes CRC progression via epigenetically silencing of p21 expression [39]. Studies have reported significantly higher SETDB2 frameshift mutations in MSI-H CRC. CC2D1A has been recently reported as part of a model predicting sensitivity to anti-EGFR treatment in KRAS-wildtype CRC [17]. This gene is also involved in several pathways, including NFKB, Akt, PKA and Notch signaling. Notch signaling and its crosstalk with other signaling pathways is pivotal in cancer progression, including tumor angiogenesis and cancer stem cell expansion [180]. Another of our significant genes, CBFA2T2 (or MTGR1) is a Notch signaling inhibitor [251]. Conversely, ASPH which is associated to reduced benefit from bevacizumab in our analysis, can promote Notch signaling and has been related to tumor invasion and reduced patient survival in CRC [19]. PLCG1 has also been identified as a novel mechanism of anti-EGFR resistance in RAS wild-type CRC [66]. PIGU, MANBAL, and AAR2 have been found to be associated with KRAS Mutation in CRC, where PIGU low expression conferred a poor prognosis [197], [375]. TTI1 encodes for a protein involved in DNA damage response and part of the mTOR complex, a signaling mediator which plays a key role in angiogenesis modulation [59], [206], and has been reported to correlate with CRC proliferation and metastasis [112], [354]. Oxidative stress is another crucial pathway involved in cancer dynamics, including mechanisms of drug resistance and angiogenesis modulation [175]. Among our significant genes GSS [173], [329], OSER1 [70], and PRDX5 [138] are all involved in cellular response to oxidative stress. Additionally, PRDX5 promotes epithelial-mesenchymal transition (EMT) in CRC, which in turn stimulates tumor angiogenesis [3], [91]. ACOT8 has a role in β-oxidation during the metabolism of fatty acids and has been reported as a part of prognostic signatures based on lipid metabolism in CRC [101], [312]. DPM1 has been found to be overexpressed in CRC [4] and has been related to cell proliferation and enhanced angiogenesis [15], [377]. EREG (epiregulin) is the only gene that have been previously reported as a predictive biomarker for bevacizumab treatment in mCRC [308]. Consistently with our results, higher EREG expression was associated with longer survival in patients receiving bevacizumab plus chemotherapy. Among the genes whose high expression is associated with worse outcome under bevacizumab treatment we found two members of the S100 family of proteins, S100A14 and S100A13. This protein family is well characterized for its role in cancer. S100A14 is involved in CRC cell growth and motility, and its expression has been reported to be inversely correlated with CRC progression [117]. S100A13 has pro-angiogenic functions reported in melanoma and astrocytic gliomas [223]. HMGA1 is a target of the oncogene MYC and promotes CRC development through Wnt signaling, another central signaling pathway in CRC biology [343]. 42 It also has a pro-angiogenic activity [276], [371]. HMGA1 has been found to be overexpressed in CRC and has also been connected to metabolic pathways which promote cancer progression [348]. AGR3 promotes CRC stemness via Wnt/β-catenin signalling [54], which has been linked to treatment resistance. On the other hand, increased CTNNBL1 (β-catenin like protein 1) was associated with better bevacizumab treatment outcomes. Finally, CD163 encodes for a receptor expressed in monocytes and macrophages. CD163 positive macrophages in the TME have a M2 polarization and promote inflammation, vascular permeability and angiogenesis [305]. The infiltration of CD163 positive macrophages is a negative prognostic biomarker in CRC patients [183], [357]. Importantly, M2 macrophage polarization has been connected to bevacizumab resistance. 2.6.2 Details for Case Study on Clustering Phenotypes This case study presented a more nuanced task than the well-characterized PBMC classification problem. We believe it is representative of real studies into the more granular phenotypes that could inform the outcome of a disease. A specific example is how the effect of IFN-γ on PBMCs is a risk factor for COVID-19 [96], [127]. q-Diffusion was applied with k = 64 for the neighborhood sizes. Standard (or analogous, when there is no “standard”) neighborhood settings were used on all the competing methods. The benchmark included the most popular and current software packages for single-cell clustering [369], namely Seurat [34], Monocle [268], Scanpy [350], SC3 [176], as well as optimal transport [131]. For all but SC3, as explained below, we evaluated the variety of ways that neighborhood graphs were constructed across all PBMCs. Then, we subsetted the graphs by cell type and executed the Leiden community detection algorithm [79]. SC3 could only output clusters directly for each subtype. Optimal transport, which has yet to attain a standard operating procedure [e.g. 18] and is rather computationally expensive [261], was attempted with reasonable settings. Previously proposed methods like SNN-Cliq [353] and PhenoGraph [194] were not included because they inspired the latest versions of packages like Seurat. The Optimal Transport (OT) Baseline. As mentioned in the main text §2.3.2, there is no established toolkit for OT-assisted clustering as of the time of this writing, and we had to construct a simple baseline. At a high level we estimated gene-to-gene distances internally, through Euclidean distances between their log-normalized expressions across the cells [131]. Then we renormalized expressions to the probability simplex and ran the Sinkhorn algorithm with regularization ε = 1. For computational reasons, we selected 4, 000 features using the Seurat dispersion metric, where Seurat typically selects 2, 000 features. As we 43 sought the k = 64 nearest neighbors due to OT for the purpose of constructing a graph and performing community detection, we restricted the search to the 2 9 = 512 Seurat neighbors—in other words, Euclidean in the principal components—of each point. These Seurat neighbors were filtered down to the nearest 12.5% in terms of OT. 2.6.3 Details for Case Study on Brain Structure Grid search for LDS clustering. Clustering parameters were optimized by silhouette-score sample means over a simple grid. The search for community detection was conducted over k = 5, 6, 7, 8, 9, 10 nearest neighbors for the directed graph construction. For hierarchical clustering, we tried k = 2, 3, 4, 5 partitions. 2.7 Appendix C: Proofs Proof of Proposition 1. First we shall prove that expq −x 2 i = exp −x 2 i 1 + q−1 2 x 4 i + O(q − 1)2 . Let α = 1 − q and u = −x 2 i . Then expq (u) = (1 + αu) 1/α, for α < 0. We proceed by constructing a Taylor expansion at the limit α → 0: by definition of the exponential, limα→0 expq (u) = exp(u). Then we take the first derivative. ∂ ∂α expq (u) = ∂ ∂α exp α −1 log(1 + αu) = 1 α u 1 + αu − 1 α2 log(1 + αu) (1 + αu) 1/α . lim α→0 ∂ ∂α expq (u) = lim α→0 αu 1+αu − log(1 + αu) α2 exp(u) = lim α→0 u 1+αu − αu2 (1+αu) 2 − u 1+αu 2α exp(u) = lim α→0 − u 2 2(1 + αu) 2 exp(u) = − 1 2 u 2 exp(u) = 1 2 x 4 i exp −x 2 i , by L’Hôpital’s rule. Hence the univariate case is proven (for m = 1.) The multivariate cases follow by induction: chaining multiple expq −x 2 i terms yields a Gaussian function with unit and x 4 i factors, and everything else of order (q − 1)2 and above. Proof of Lemma 2. Borrowing notation from Equation 2.3, Mm i=1 (±x 2 i ) = Xm l=1 X Il (1 − q) l−1 (±x 2 i1 )(±x 2 i2 )· · ·(±x 2 il ) = Xm l=1 (±1)l (1 − q) l−1X Il x 2 i1 x 2 i2 · · · x 2 il ≤ Xm l=1 (±1)l (1 − q) l−1X Il exp( l · log x 2 i1 + x 2 i2 + . . . x2 il l !) due to Jensen’s inequality on log x 2 i1 x 2 i2 · · · x 2 il = log x 2 i1 + log x 2 i2 + . . . log x 2 il ≤ l · log x 2 i1 + x 2 i2 + . . . x2 il l ! . Since x 2 i1 + x 2 i2 + . . . x2 il ≤ ε for any 1 ≤ i1 < i2 < · · · ≤ m, it holds that X Il exp( l · log x 2 i1 + x 2 i2 + . . . x2 il l !) ≤ X Il exp l · log εl−1 = X Il O(ε l )l −l for any l-term in the sum. Each successive summand is of growing order in ε, so we may truncate the entire expression at l = 1: Mm i=1 (±x 2 i ) = ± ∥x∥ 2 2 + O(ε 2 ). 4 Proof of Lemma 3. Begin the proof as per above. Borrowing notation from Equation 2.3, q ∥x∥ 2 = Mm i=1 (−x 2 i ) = Xm l=1 X Il (1 − q) l−1 (−x 2 i1 )(−x 2 i2 )· · ·(−x 2 il ) = Xm l=1 (−1)l (1 − q) l−1X Il x 2 i1 x 2 i2 · · · x 2 il = Xm l=1 (−1)2l−1 |1 − q| l−1X Il x 2 i1 x 2 i2 · · · x 2 il , since q > 1 = − Xm l=1 |1 − q| l−1X Il x 2 i1 x 2 i2 · · · x 2 il ≤ −Xm i=1 x 2 i { keep the (l = 1)-terms only } = −∥x∥ 2 2 . 46 Clustering with q-Diffused PHATE (a) q-Diffused embedding colored by k-means clusterings [8] for k ∈ {2, 3, 4}. The three versions serve to validate the stability of the two major branches (blue, red) and one minor branch (yellow). Bounding Boxes for Branches (b) q-Diffused embedding colored by the same LID estimates as in Figure 2.4, here annotated with the precise bounding boxes that delineated the cells of the two major branches for downstream analysis. Clustering with Original PHATE (c) Original embedding colored by their k-means clusterings, as with Figure 2.12a. Evidently, the existence of major branches is less stable: observe how red and blue sections continuously shift. Figure 2.12. Justification for the branching classifications in Figure 2.4 for mCRC. We synthesized the LID and the stable k-means clusterings to draw bounding boxes on the q-diffused PHATE embedding. 47 Chapter 3 Partial Identification of Dose Responses with Hidden Confounders Inferring causal effects of continuous-valued treatments from observational data is a crucial task promising to better inform policy- and decision-makers. A critical assumption needed to identify these effects is that all confounding variables—causal parents of both the treatment and the outcome—are included as covariates. Unfortunately, given observational data alone, we cannot know with certainty that this criterion is satisfied. Sensitivity analyses provide principled ways to give bounds on causal estimates when confounding variables are hidden. While much attention is focused on sensitivity analyses for discrete-valued treatments, much less is paid to continuous-valued treatments. We present novel methodology to bound both average and conditional average continuous-valued treatment-effect estimates when they cannot be point identified due to hidden confounding. A semi-synthetic benchmark on multiple datasets shows our method giving tighter coverage of the true dose-response curve than a recently proposed continuous sensitivity model and baselines. Finally, we apply our method to a real-world observational case study to demonstrate the value of identifying dose-dependent causal effects. 3.1 Introduction Causal inference on observational studies [10], [119] attempts to predict conclusions of alternate versions of those studies, as if they were actually properly randomized experiments. The causal aspect is unique among inference tasks in that the goal is not prediction per se, as causal inference deals with counterfactuals, the problem of predicting unobservables: for example, what would have been a particular patient’s health outcome had she taken some medication, versus not, while keeping all else equal (ceteris paribus)? There is quite often no way to validate the results without bringing in additional domain knowledge. A set of putative treatments T , often binary with a treated/untreated dichotomy, induces potential outcomes Yt∈T . These can depend on covariates X as with heterogeneous treatment effects E[Y1 − Y0 | X] in the binary 48 case. Only one outcome is ever observed: that at the assigned treatment T. Potential biases arise from the incomplete observation. This problem is exacerbated with more than two treatment values, especially when there are infinite possibilities, like in a continuum, e.g. T = [0, 1]. Unfortunately, many consequential decisions in life involve this kind of treatment: What dose of drug should I take? How much of should I eat/drink? How much exercise do I really need? In an observational study, the direct causal link between assigned treatment T and observed outcome Y (also denoted as YT ) can be influenced by indirect links modulated by confounding variables. For instance, wealth is often a confounder in an individual’s health outcome from diet, medication, or exercise. Wealth affects access to each of these “treatments,” and it also affects health through numerous other paths. Including the confounders as covariates in X allows estimators to condition on them and disentangle the influences [363]. It can be challenging to collect sufficient data, in terms of quality and quantity, on confounders in order to adjust a causal estimation to them. Case in point, noisy observations of e.g. lifestyle confounders lead researchers to vacillate on the health implications of coffee [12], alcohol [367], and cheese [104]. For consequential real-world causal inference, it is only prudent to allow margins for some amount of hidden confounding. A major impediment to such analysis is that it is impossible to know how a hidden confounder would bias the causal effect. The role of any causal sensitivity model [61], [284] is to make reasonable structural assumptions [213] about different levels of hidden confounding. Most sensitivity analyses to hidden confounding require the treatment categories to be binary or at least discrete. This weakens empirical studies that are better specified by dose-response curves [25], [37] from a continuous treatment variable. Estimated dose-response functions are indeed vulnerable in the presence of hidden confounders. Figure 3.1 highlights the danger of skewed observational studies that lead to biased estimates of personal toxic thresholds of treatment dosages. 3.1.1 Related works There is growing interest in causal methodology for continuous treatments (or exposures, interventions), especially in the fields of econometrics [e.g. 128], [332], health sciences [335], and machine learning [49], [57], [102], [160]. So far, most scrutiny on partial identification of potential outcomes has focused on the case of discrete treatments [e.g. 200], [207], [284]. A number of creative approaches recently made strides in the discrete setting. Most rely on a sensitivity model for assessing the susceptibility of causal estimands to hidden-confounding bias. A sensitivity model allows hidden confounders but restricts their possible influence on the data, with an adjustable parameter that controls the overall tightness of that restriction. 49 Dose Response Biased Dose Responses from Observational Study danger estimated curve true population-level curve beneficial toxic A hidden attribute in the population... a) lowers personal thresholds for toxicity b) leads people to take smaller dosages on average The hidden attribute affects a) the response [outcomes], and b) the observed dosage [treatment assignments], so it is a hidden confounder. One that dangerously biases the apparent dose response! Figure 3.1. Dose-respone curves in medicine [e.g. 315] can be viewed as expected potential outcomes from continuous treatments. In this simulation (with details in §D,) there is one unobserved confounder. The empirical estimate of the population-level dose responses massively overshoots the maximum effective dosage, and would suggest treatments that were actually toxic to the population. This phenomenon persists even when the vulnerable hidden subgroup occurs more often in the population. The common discrete-treatment sensitivity models are incompatible with continuous treatments, which are needed for estimating dose-response curves. Still, some recent attempts have been made to handle hidden confounding under more general treatment domains [49]. Hu, Wu, Zhang, et al. [125] and Padh, Zeitler, Watson, et al. [248] optimize generative models to reflect bounds on the treatment effect due to ignorance, inducing an implicit sensitivity model through functional constraints. Instrumental variables are also helpful when they are available [172]. The CMSM [142] was developed in parallel to this work, and now serves as a baseline. For binary treatments, the Marginal Sensitivity Model (MSM) due to Tan [317] has found widespread usage [143], [158], [336], [365], [378]. Variations thereof include Rosenbaum’s earlier sensitivity model [2002] that enjoys ties to regression coefficients [359]. Alternatives to sensitivity models leverage generative modeling [229] and robust optimization [111]. Other perspectives require additional structure to the datagenerating (observed outcome, treatment, covariates) process. Proximal causal learning [224], [319] requires observation of proxy variables. Chen, Minorics, and Janzing [46] rely on a large number of background variables to help filter out hidden confounding from apparent causal influences. 50 3.1.2 Contributions We propose a novel sensitivity model for continuous treatments in §3.2. Next, we derive general formulas (§3.2.1) and solve closed forms for three versions (§3.2.3) of partially identified dose responses—for Beta, Gamma, and Gaussian treatment variables. We devise an efficient sampling algorithm (§3.3), and validate our results empirically using a semi-synthetic benchmark (§3.4) and realistic case study (§3.5). 3.1.3 Problem Statement Our goal is the partial identification of causal dose responses under a bounded level of possible hidden confounding. We consider any setup that grants access to two predictors [48] that can be learned empirically and are assumed to output correct conditional distributions. These are (1) a predictor of outcomes conditioned on covariates and the assigned treatment, and (2) a predictor of the propensity of treatment assignments, taking the form of a probability density, conditioned on the covariates. The latter measures (non-)uniformity in treatment assignment for different parts of the population. The observed data come from a joint distribution of outcome, continuous treatment, and covariates that include any observed confounders. Potential outcomes. Causal inference is often cast in the nomenclature of potential outcomes, due to Rubin [286]. Our first assumption, common to Rubin’s framework, is that observation tuples of outcome, assigned treatment, and covariates, {(y (i) , t(i) , x(i) )} n i=1, are i.i.d draws from a single joint distribution. This subsumes the Stable Unit Treatment Value Assumption (SUTVA), where units/individuals cannot depend on one another, since they are i.i.d. The second assumption is overlap/positivity, that all treatments have a chance of assignment for every individual in the data: pT|X(t | x) > 0 for every (t, x) ∈ T × X . The third and most challenging fundamental assumption is that of ignorability: {Yt ⊥⊥ T} | X, ∀t ∈ T . Clearly the outcome should depend on the assigned treatment, but potential outcomes ought not to be affected by the assignment, after blocking out paths through covariates. Our study focuses on dealing with limited violations to ignorability. The situation is expressed formally as {Yt ̸⊥⊥T} | X, but more specifically, we shall introduce a sensitivity model that governs the shape and extent of that violation. Let p(yt |x) denote the probability density function of potential outcome Yt = yt from a treatment t ∈ T , given covariates X = x. This is what we seek to infer, while observing realized outcomes that allow us to learn the density p(yt | x, T = t). If the ignorability condition held, then p(yt | x, T = t) = p(yt |x) 51 due to the conditional independence. However, without ignorability, one has to marginalize over treatment assignment, requiring p(yt | x, T ̸= t) because p(yt |x) = Z T p(yt | τ, x)p(τ |x) dτ, (3.1) where p(yt |τ, x) is the distribution of potential outcomes conditioned on actual treatment T = τ ∈ T that may differ from the potential outcome’s index t. The density p(τ |x) is termed the nominal propensity, defining the distribution of treatment assignments for different covariate values. On notation. Throughout this study, yt will indicate the value of the potential outcome at treatment t, and to disambiguate with assigned treatment τ will be used for events where T = τ . For instance, we may care about the counterfactual of a smoker’s (τ = 1) health outcome had they not smoked (yt=0), where T = 0 signifies no smoking and T = 1 is “full” smoking. We will use the shorthand p(· · ·) with lowercase variables whenever working with probability densities of the corresponding variables in uppercase: p(yt |τ, x) means ∂ ∂uP[ Yt ≤ u | T = τ, X = x ] u=yt. Quantities of interest. We attempt to impart intuition on the conditional probability densities that may be confusing. • p(yt | x) [conditional potential outcome]. A person’s outcome from a treatment, disentangled from the selection bias of treatment assignment in the population. We seek to characterize this in order to (partially) identify the Conditional Average Potential Outcome (CAPO) and the Average Potential Outcome (APO): CAPO(t, x) = E[Yt | X = x]; APO(t) = E[Yt ]. • p(yt | τ, x) [counterfactual]. What is the potential outcome of a person in the population characterized by x and assigned treatment τ ? The answer changes with τ only when x is inadequate to block all backdoor paths through confounders. We can estimate this for t = τ . • p(τ | yt , x) [complete propensity] is related to the above by Bayes’ rule. We distinguish it from the nominal propensity p(τ |x) because the unobservable yt possibly confers more information about the individual, again if x is inadequate. The complete propensity cannot be estimated, even for t = τ ; hence, this is the target of our sensitivity model. 52 Yt Z X T Figure 3.2. In this example, Z encompasses all hidden confounders. Counterfactual p(yt | τ, x) diverges from p(yt | x) because of the red path from T to Yt through Z. A backdoor path between potential outcomes and treatment can manifest in several ways. Figure 3.2 shows the barebones setting for hidden confounding to take place. Simply noisy observations of the confounders could leak a backdoor path. It is important to understand the ontology [295] of the problem in order to ascribe hidden confounding to the stochasticity inherent to a potential outcome. Sensitivity. Explored by Tan [317] followed by Kallus, Mao, and Zhou [158], Jesson, Mindermann, Gal, et al. [143], among many others, the Marginal Sensitivity Model (MSM) serves to bound the extent of (putative) hidden confounding in the regime of binary treatments T ′ ∈ {0, 1}. The MSM limits the discrepancy between the odds of treatment under the nominal propensity and the odds of treatment under the complete propensity. Definition 3 (The Marginal Sensitivity Model). For binary treatment t ′ ∈ {0, 1} and violation factor Γ ≥ 1, the following ratio is bounded: Γ −1 ≤ p(t ′ |x) 1 − p(t ′ |x) −1 p(t ′ | yt ′, x) 1 − p(t ′ | yt ′, x) ≤ Γ. The confines of a binary treatment afford a number of conveniences. For instance, one probability value is sufficient to describe the whole propensity landscape on a set of conditions, p(1−t ′ | · · ·) = 1−p(t ′ | · · ·). As we transfer to the separate context of treatment continua, we must contend with infinite treatments and infinite potential outcomes. 53 3.2 Continuous Sensitivity Model The counterfactuals required for Equation 3.1 are almost entirely unobservable. We look to the RadonNikodym derivative ωδ of a counterfactual with respect to another [317], quantifying their divergence between nearby treatment assignments: (assuming mutual continuity) ωδ(yt | τ, x) := p(yt | τ + δ, x) p(yt | τ, x) = (Bayes’ rule) p(τ + δ| yt , x)p(τ |x) p(τ | yt , x)p(τ + δ|x) = p(τ + δ|x) p(τ |x) −1 p(τ + δ| yt , x) p(τ | yt , x) . As with the MSM, we encounter a ratio of odds, here contrasting τ versus τ + δ in the assigned-treatment continuum. Assuming the densities are at least once differentiable, lim δ→0 δ −1 log ωδ(yt | τ, x) = ∂τ [log p(τ | yt , x) − log p(τ |x)]. By constraining ωδ to be close to unit, via bounds above and below, we tie the logarithmic derivatives of the nominal- and complete-propensity densities. Definition 4 (The Infinitesimal Marginal Sensitivity Model). For treatments t ∈ T ⊆ R, where T is connected, and violation-of-ignorability factor Γ ≥ 1, the δMSM requires ∂ ∂τ log p(τ | yt , x) p(τ |x) ≤ log Γ everywhere, for all τ , t, and x combinations. This differs from the CMSM due to Jesson, Douglas, Manshausen, et al. [142] that considers only t = τ, and which bounds the density ratios directly. 3.2.1 The Complete Framework Assumption 1 (Bounded Hidden Confounding). Invoking Definition 4, the violation of ignorability is constrained by a δMSM with some Γ ≥ 1. Assumption 2 (Anchor Point). A special treatment value designated as zero is not informed by potential outcomes: p(τ = 0 | yt , x) = p(τ = 0 | x) for all x, t, and yt . At this point we state the core sensitivity assumptions. In addition to the δMSM, we require an anchor point at T = 0, which may be considered a lack of treatment. Strictly, we assume that hidden confounding does not affect the propensity density precisely at the anchor point. A broader interpretation is that the strength of causal effect, hence vulnerability to hidden confounders, roughly increases with 54 |T|. Assumption 2 is necessary to make closed-form solutions feasible. We discuss ramifications and a relaxation in §3.2.3. The unobservability of almost all counterfactuals is unique to the case of continuous treatments, since the discrete analogy would be a discrete sum with an observable term. Figure 3.3 explains our approach to solving Equation 3.1. a) Confounded Outcomes for Binary Treatments P[Yt P[Y | T = t ] x P[T = t] t ] = P[Yt ] = + P[Yt | T = 1-t ] x P[T = 1-t] b) Confounded Outcomes for Continuous Treatments observable P[Yt | T = t ] (infinitesimal support) partially approximable counterfactuals counterfactual possible treatment assignments T completely unknown counterfactuals T ≈ t T ≠ t T = t Figure 3.3. In the binary case, the red part is unobservable, but the MSM condition helps to bound that quantity. In the continuous case the integrand (Equation 3.1) is unobservable almost everywhere in the space of assigned treatments, except for the infinitesimal point T = t. In order to divide the integral into two parts (observable and unobservable) like with the binary sum, we must draw an approximation where assigned treatment and potential-outcome index are close enough. We use a soft window (yellow) to mark the validity of the approximation. Our continuous version of the MSM, the δMSM, allows us to bound the red part as well as reason about the yellow part. Covariates X are omitted for brevity. 3.2.2 A Partial Approximation We expand p(yt |τ, x) around τ = t, where p(yt |t, x) = p(y|t, x) is learnable from data. Suppose that p(yt |τ, x) is twice differentiable in τ . Construct a Taylor expansion p(yt |τ, x) = p(yt |t, x) + (τ − t)∂τ p(yt |τ, x)|τ=t + (τ − t) 2 2 ∂ 2 τ p(yt |τ, x)|τ=t + O(τ − t) 3 . (3.2) 55 Denote with p˜(yt |τ, x) an approximation of second order as laid out above. One could have stopped at lower orders but the difference in complexity is not that large. The intractable derivatives like ∂τ p(yt |τ, x)|τ=t will be bounded using the δMSM machinery. Let us quantify the reliability of this approximation by a trust-weighing scheme 0 ≤ wt(τ ) ≤ 1, where typically wt(t) = 1. This corresponds to the yellow part in Figure 3.3. We argue that wt(τ ) should be narrower with lower-entropy (narrower) propensities (§B). The possible forms of wt(τ ) are elaborated in §3.2.3. Splitting Equation 3.1 along the trusted regime marked by wt(τ ), and then applying the approximation of Equation 3.2, p(yt |x) = Z T wt(τ )p(yt |τ, x)p(τ |x) dτ | {z } “observable” (Fig. 3.3) + Z T [1 − wt(τ )]p(yt |τ, x)p(τ |x) dτ | {z } “unobservable” (Fig. 3.3) ≈ Z T wt(τ )p˜(yt |τ, x)p(τ |x) dτ | {z } (A) the approximated quantity + Z T [1 − wt(τ )]p(τ |yt , x)p(yt |x) dτ | {z } (B) by Bayes’ rule . (3.3) The intuition behind separating the integral into two parts is the following. By choosing the weights wt(τ ) so that they are close to one in the range where approximation Equation 3.2 is valid (yellow region in Figure 3.3) and zero outside of this range, we can evaluate the first integral through the approximated counterfactuals. The second integral, which is effectively over the red region in Figure 3.3 and cannot be evaluated due to unobserved counterfactuals, will be bounded using the δMSM. Simplifying the second integral first, Z T [1 − wt(τ )]p(τ | yt , x)p(yt |x) dτ = p(yt |x) 1 − Z T wt(τ )p(τ | yt , x) dτ . By algebraic manipulation, we witness already that p(yt |x) shall take the form of p(yt |x) ≈ R T wt(τ )˜p(yt | τ, x)p(τ |x) dτ R T wt(τ )p(τ | yt , x) dτ . (3.4) Reflecting on Assumptions 1 & 2, the divergence between p(τ | yt , x) and p(τ |x) is bounded, allowing characterization of the denominator in terms of the learnable p(τ |x). Similarly the derivatives in Equation 3.2 can be bounded. These results would be sufficient to partially identify the numerator. Without loss of generality, consider the unknown quantity γ that can be a function of τ , yt , and x, such that ∂τ log p(τ |yt , x) = ∂τ log p(τ |x) + γ(τ |yt , x), where |γ(τ |yt , x)| ≤ log Γ using the δMSM. (3.5) 56 We may attempt to integrate both sides; Z s 0 ∂τ log p(τ | yt , x) dτ = Z s 0 ∂τ log p(τ |x) dτ + Z s 0 γ(τ |yt , x) dτ | {z } :=λ(s|yt,x) . ∴ log p(τ = s| yt , x) − log p(τ = 0| yt , x) = log p(τ = s| x) − log p(τ = 0| x) + λ(s|yt , x), log p(τ | yt , x) = log p(τ |x) + λ(τ |yt , x), (by Assumption 2) p(τ | yt , x) = p(τ |x)Λ(τ |yt , x), Λ := exp{λ}. (3.6) One finds that |λ(τ |yt , x)| ≤ |τ | log Γ because λ integrates γ, bounded by ± log Γ, over a support with length τ . Subsequently, Λ is bounded by Γ ±|τ| . These are the requisite tools for bounding p(yt |x)—or an approximation thereof, erring on ignorance via the trusted regime marked by wt(τ ). The derivation is completed in §A by framing the unknown quantities in terms of γ and Λ, culminating in Equation 3.7. Predicting potential outcomes. The recovery of a fully normalized probability density p˜(yt |x) via Equation 3.4 is laid out below. It may be approximated with Monte Carlo or solved in closed form with specific formulations for the weights and propensity. Concretely, it takes on the form p˜(yt |x) = d(t|yt , x) −1p(yt |t, x), where d(t|yt , x) := Eτ [Λ(τ |yt , x)] − [γΛ](t|yt , x) Eτ [τ − t] − 1 2 [( ˙γ + γ 2 )Λ](t|yt , x) Eτ [(τ − t) 2 ], (3.7) and said expectations, Eτ [·], are with respect to the implicit distribution q(τ |t, x) ∝ wt(τ )p(τ |x). The notation γ˙ denotes a derivative in the first argument of γ(t|yt , x). Assumption 3 (Second-order Simplification). The quantity γ˙(τ |yt , x) cannot be characterized as-is. Granting that γ 2 dominates over the former, and consequently ( ˙γ + γ 2 )Λ ≤ γ 2Λ + ε for small ε ≥ 0. 57 Parametrization Support (T ) Precision (r) Bounds for Eτ [Λ(τ |yt , x)] Beta [0, 1] α + β − 2 1F1(α + 1; α + β + 2; ± log Γ) where α := ¯α + α − 2, β := β¯ + β − 2 Balanced Beta [0, 1] α + β − 2 t · ⟨the Beta above⟩ + (1 − t) · ⟨Beta, mirrored⟩ Gamma [0, +∞) α/β2 [1 − (± log Γ)/β] −α where α := ¯α + α − 1, β := β¯ + β Gaussian (−∞, +∞) 1/σ exp σ 2 (log Γ)2/2 Γ ±µ [1 + erf( µ±σ2 log Γ √ 2σ )] + Γ ∓µ [1 − erf( µ∓σ2 log Γ √ 2σ )] where µ := µσ¯ 2+¯µσ2 σ¯ 2+σ2 , σ 2 := σ¯ 2σ 2 σ¯ 2+σ2 Table 3.1. Candidates for propensity and trust-weighing combinations. Each row specifies the distribution— beta, beta, gamma, and Gaussian respectively—of the propensity model p(τ |x). The last column lists solutions for the first term of Equation 3.7 / 3.8. This is a convolution of the propensity and weighing scheme, which have similar forms (see Bromiley [30] for the Gaussian case.) We distinguish the replicated parameters between propensity and weight by placing a bar over the propensity parameters. So if the propensity is x 7→ (¯α, β¯), then the weighing scheme has t 7→ (α, β). The bold parameters are of the compound density, with respect to which the first and second moments are computed in Equation 3.7 / 3.8. To make use of the formula in Equation 3.7, one first obtains the set of admissible d(t|yt , x) ∈ d(t|yt , x), d(t|yt , x) that violate ignorability up to a factor Γ according to the δMSM. With the negative side of the ± corresponding to d and the positive side to d, the bounds are expressible as d, d = Z T Γ ±|τ| q(τ |t, x) dτ −→ Eτ [Λ(τ |yt , x)] + (± log Γ)Γ|t| Z T (τ − t)q(τ |t, x) dτ + 1 2 0, log2 Γ Γ |t| Z T (τ − t) 2 q(τ |t, x) dτ. (3.8) The Γ ±|τ| in the first integral, as well as the alternating sign of the other two terms taken together, reveal that d ≤ 1 ≤ d with equality at Γ = 1. This is noteworthy because it implies that p(y|t, x) is admissible for the partially identified p˜(yt |x). We cannot describe p(yt |x) once d crosses zero. Ensembles. To quantify empirical uncertainties [144] alongside our sensitivity, the predictors could be learned as ensembles, with p˜(yt |x) computed as (bootstrap resampled [203]) expectations over them. 58 3.2.3 Propensity-trust Combinations In addition to developing the general framework above, we derive analytical forms for a myriad of paramametrizations that span the relevant supports T for continuous treatments: the unit interval [0, 1], the nonnegative reals [0, +∞), and the real number line (−∞, +∞). For some nominal propensity distributions p(τ |x), we propose trust-weighing schemes wt(τ ) with shared form so that the expectations in Equation 3.8 are solvable. For instance, consider the parametrization (T | X = x) ∼ Beta(α(x), β(x)). We select a Beta-like weighing scheme, rescaled and translated, w beta t (τ ) = ctτ at−1 (1 − τ ) bt−1 . Two constraints are imposed on every wt(τ ) studied herein: • (the mode) that wt(τ ) peaks at τ = t, and wt(t) = 1. • (the precision) that some r > 0 defines a narrowness of the form, and can be set a priori. For the beta version we chose at+bt = r+2. These constraints imply that at := rt+1, bt := r(1−t)+1, and c −1 t := t rt(1 − t) r(1−t) . r = 4 r = 16 r = 64 Figure 3.4. Beta parametrizations for wt(τ ) in the unit square, plotted for t = 0.125, 0.25, 0.5. We present solutions for propensity-trust combinations in Table 3.1. Balanced Beta stands out by not strictly obeying Assumption 2. Rather, it adheres to a symmetrified mixture that is more versatile to realistic situations. Formally, for all t, yt , and x, Balanced Beta replaces Assumption 2 with p( τ = 0 | yt , x) = p( τ = 0 | x) w.p. t, p( τ = 1 | yt , x) = p( τ = 1 | x) w.p. 1 − t. This special parametrization deserves further justifying. The premise is that distant treatments are decoupled; treatment assignment τ shares less information with a distal potential outcome yt than a proximal one. If that were the case, then the above linear interpolation favors the less informative anchor points for a given t. This is helpful because the sensitivity analysis is vulnerable to the anchor points. Stratifying the anchor points eventually leads to an arithmetic mixture of d(t|yt , x) in Equation 3.7 with its mirrored version about t 7→ 1 − t, and (α, β) 7→ (β, α). 59 Controlling trust. The absolute error of the approximation in Equation 3.3.A is bounded above by a form that could grow with narrower propensities (see §B), in the Beta parametrization. Intuitively the error also depends on the smoothness of the complete propensity (Taylor residual.) For that reason we used the heuristic of setting the trust-weighing precision r to the nominal propensity precision. 3.3 Estimating the Intervals We seek to bound partially identified expectations with respect to the true potential-outcome densities, which are constrained according to Equation 3.7 / 3.8. The quantities of interest are the Average Potential Outcome (APO), E[f(Yt)], and Conditional Average Potential Outcome (CAPO), E[f(Yt)|X = x], for any task-specific f(y). We use Monte Carlo over m realizations yi drawn from proposal density g(y), and covariates from a subsample of instances: E˜[f(Yt) | X ∈ {x (j) }j∈J ] = Pm i=1 P j∈J f(yi) ˜p(yt = yi | x (j) )/g(yi) Pm i=1 P j∈J p˜(yt = yi | x (j))/g(yi) , (3.9) where J ⊆ {1 . . . n} indexes a subset of the finite instances. |J| = 1 recovers the formula for the CAPO, and |J| = n for the APO. The partially identified p˜(yt |x) really encompasses a set of probability densities that includes p(y|t, x) and smooth deviations from it. Our importance sampler ensures normalization [326], but is overly conservative [75]. For current purposes, a greedy algorithm may be deployed to maximize (or minimize) Equation 3.9 by optimizing the weights wi attached to each f(yi), within the range wi := p(yi |t, x) d(t|yi , x)g(yi) , wi := p(yi |t, x) d(t|yi , x)g(yi) . Our Algorithm 1 adapts the method of Jesson, Mindermann, Gal, et al. [143] and Kallus, Mao, and Zhou [158] to heterogeneous weight bounds [wi , wi ] per draw i. View a proof of correctness in §C. Others have framed the APO as the averaged CAPOs, and left the min/max optimizations on the CAPO level [142]. We optimize the APO directly, but have not studied the impact of one choice versus the other. 3.4 A Semi-synthetic Benchmark It is common practice to test causal methods, especially under novel settings, with real datasets but synthetic outcomes [64], [67]. We adopted four exceedingly diverse datasets spanning health, bioinformatics, and social-science sources. Our variable-generating process preserved the statistical idiosyncracies of each 60 Algorithm 1: The expectation maximizer, with O(n) runtime if intermediate ∆j results are memoized. Input :{(wi , wi , fi)} n i=1 ordered by ascending fi . Output :maxw E[f(X)] estimated by importance sampling with n draws. 1 Initialize wi ← wi for all i = 1, 2, . . . n; 2 for j = 1, 2, . . . n do 3 Compute ∆j := Pn i=1 wi(fj − fi); 4 if ∆j < 0 then 5 wj ← wj ; 6 else 7 break; 8 Return P i wifi/ P i wi Benchmarks brain blood pbmc mftc ratio mean (std.) mean (std.) mean (std.) mean (std.) % best to best δMSM (ours) 188 (70) 186 (62) 204 (75) 172 (65) 68.3 1.09 (0.22) CMSM 377 (155) 382 (153) 419 (156) 319 (158) 10.9 2.54 (2.33) uniform 229 (96) 225 (88) 247 (100) 206 (95) 6.5 1.30 (0.32) binary MSM 217 (81) 216 (76) 235 (87) 195 (81) 14.3 1.24 (0.31) Table 3.2. Semi-synthetic benchmark: divergence costs of 90% coverage of the Average Potential Outcome (APO), multiplied by 1000. The four datasets are listed on top. We report averages over 500 trials per experiment. A paired t-test and sign test, roughly corresponding to the mean and median, showed significant improvement by the δMSM over the others with all p < 10−18 . “% best” counts the proportion of trials that each method outperformed the rest, and “ratio to best” is the average cost ratio to the best method’s in each trial—closer to one is better. dataset. Confounders and treatment were random projections of the data, which were quantile-normalized for uniform marginals in the unit interval. Half the confounders were observed as covariates and the other half were hidden. The outcome was Bernoulli with random linear or quadratic forms mixing the variables before passing through a normal CDF activation function. Outcome and propensity models were linear and estimated by maximum likelihood. See §3.10. Selecting the baselines. The δMSM with Balanced Beta was benchmarked against three relevant baselines. • (CMSM) Use the recent model by Jesson, Douglas, Manshausen, et al.[142], where d := Γ−1p(τ |x), d := Γ +1p(τ |x). • (uniform) Suppose d := Γ−1 , d := Γ+1, as if the propensity were uniform and constant. 61 • (binary MSM) Shoehorn the propensity into the classic MSM [317] by considering the treatment as binary with indicator I[T > 0.5]. Note that the CMSM becomes equivalent to the “uniform” baseline above when CAPOs are concerned (Equation 3.9 with m = 1), which are not studied in this benchmark. A B Treatment t 0.0 0.5 1.0 APO Bernoulli(p) Divergence Costs Visualized low high KL Divergences Figure 3.5. Divergence cost measures the size of the ignorance intervals (blue), weighted by the badness of each estimate (red). The black line is the true APO. Coverage is the portion of treatments contained between A and B in this example. Scoring the coverages. A reasonable goal would be to achieve a certain amount of coverage [225] of the true APOs, like having 90% of the curve be contained in the ignorance intervals. Since violation factor Γ is not entirely interpretable, nor commensurable across sensitivity models, we measure the size of an ignorance interval via a cost incurred in terms of actionable inference. For each point t of the dose-response curve, we integrated the KL divergence of the actual APO (which defines the Yt Bernoulli parameter) against the predicted APO uniformly between the bounds. This way, each additional unit of ignorance interval is weighed by its information-theoretic approximation cost. This score is a divergence cost of a target coverage. Analysis. The main results are displayed in Table 3.2. There were ten confounders and the true doseresponse curve was a random quadratic form in the treatment and confounders. Other settings are shown in §3.10, Table 3.4. Each trial exhibited completely new projections and outcome function. Hence there were different levels and types of confounding each time, as well as varying goodness of fit. 62 60 70 80 90 95 % Coverage 0 20 40 60 80 100 % Best δMSM for High Coverage Figure 3.6. Performance for different coverages. Black line: rate of δMSM achieving lowest divergence cost compared to baselines. Dashed line: expected rate if the chance of any one method outperforming another were identical. 3.5 A Real-world Exemplar The UK Biobank [35] is a large, densely phenotyped epidemiological study with brain imaging. We preprocessed 40 attributes, eight of which were continuous diet quality scores (DQSs) [291], [383] valued 0–10 and serving as treatments, on 42,032 people. The outcome was thicknesses of 34 cortical brain regions. A poor DQS could translate to noticeable atrophy in the brain of some older individuals, depending on their attributes [109], [228]. Continuous treatments enable the (Conditional) Average Causal Derivative, (C)ACD := ∂ E[Yt |X] / ∂t. The CACD informs investigators on the incremental change in outcome due to a small change in an individual’s given treatment. For instance, it may be useful to identify the individuals who would benefit the most from an incremental improvement in diet. We plotted the age distributions of the top 1% individuals by CACD (diet → cortical thickness) in Figure 3.7. We also compared the δMSM to an equivalent binary MSM where CACDs are computed in the latter case by thresholding the binary propensity at t. Each model’s violation factor Γ was set for an equivalent amount (∼30%) of nonzero CACDs. Under the δMSM, the DQSs with strongest average marginal benefit ranked as vegetables, whole grains, and then meat, for both females and males. They differed under the binary MSM, with meat, then whole grains as the top for females and dairy, then refined grains as the top for males. 3.6 Discussion Sensitivity analyses for hidden confounders can help to guard against erroneous conclusions from observational studies. We generalized the practice to causal dose-response curves, thereby increasing its practical 63 50 60 70 80 Age (years) 0.0 0.5 1.0 Empirical CDF Top 1% CACDs Before/After δMSM Γ = 1 Γ = 1.2 Figure 3.7. When we apply the δMSM (Γ > 1) for partial identification, the individuals with the top 1% causal derivatives of cortical thickness with respect to DQSs skew even older. This is expected logically because older people have more years during which they could have revised their diets. Red dotted line corresponds to the entire population. applicability. However, there is no replacement for an actual interventional study, and researchers must be careful to maintain a healthy degree of skepticism towards observational results even after properly calibrating the partially identified effects. Specifically for Average Potential Outcomes (APOs) via the sample-based algorithm, we demonstrated widespread applicability of the δMSM in §3.4 by showing that it provided tighter ignorance intervals than the recent CMSM and other models for 78% of all trials, notwithstanding the wide variation in scenarios tested. Ablating the approximation in Equation 3.2 and dropping the quadratic term, that percentage falls slightly to 74%. Even further, keeping just the constant term results in a large drop to 7%. This result suggests that the proposed Taylor expansion (Equation 3.2) is useful, and that terms of higher order would not give additional value. We showcased sensical behaviors of the δMSM in a real observational case study (§3.5), e.g. how older people would be more impacted by (retroactive) changes to their reported diets. Additionally, the top effectual DQSs appeared more consistent with the δMSM rather than the binary MSM. Contrasting the CMSM. Another recently proposed sensitivity model for continuous-valued treatments is the CMSM [142], which was included in our benchmark, §3.4. Unlike the δMSM, the CMSM does not always guarantee d ≤ 1 ≤ d and therefore p(y|t, x) need not be admissible for p˜(yt |x). For partial identification of the CAPO with importance sampling, the propensity density factors out and does not affect outcome sensitivity under the CMSM. For that implementation it happens that p(y|t, x) is indeed admissible. However, we believe that the nominal propensity should play a role in the CAPO’s sensitivity 64 to hidden confounders, as both the CMSM and the δMSM couple the hidden confounding (via the complete propensity) to the nominal propensity. Equations 3.7 & 3.8 make it clear that the propensity plays a key role in outcome sensitivity under the δMSM for both CAPO and APO. We remind the reader of the original MSM that bounds a ratio of complete and nominal propensity odds. The δMSM takes that structure to the infinitesimal limit and maintains the original desirable property of p(y|t, x) admissibility for p˜(yt |x). Looking ahead. Alternatives to sampling-based Algorithm 1 deserve further investigation for computing ignorance intervals on expectations—but not only. Our analytical solutions bound the density function p(yt |x) of conditional potential outcomes, which can generate other quantities of interest [156] or play a role in larger pipelines. Further, an open challenge with the δMSM would be to find a pragmatic solution to sharp partial identification. Recent works have introduced sharpness to binary-treatment sensitivity analysis [246]. Conclusion We recommend the novel δMSM for causal sensitivity analyses with continuous-valued treatments. The simple and practical Monte Carlo estimator for the APO and CAPO (Algorithm 1) gives tighter ignorance intervals with the δMSM than alternatives. We believe that the partial identification of the potentialoutcome density shown in Equation 3.8, in conjunction with the parametric formulas of Table 3.1, is of general applicability for causal inference in real-world problems. The variety of settings presented in that table allow a domain-informed selection of realistic sensitivity assumptions. For instance, when estimating the effect of a real-valued variable’s deviations from some base value, like a region’s current temperature compared to its historical average, the Gaussian scheme could be used. Gamma is ideal for one-sided or unidirectional deviations. Finally, Balanced Beta is recommended for measurements in an interval where neither of the endpoints is special. 65 3.7 Appendix A: Completing the Derivations Consider Equation 3.3.A: Z 1 0 wt(τ )˜p(yt |τ, x)p(τ |x) dτ = p(yt |t, x) Z 1 0 wt(τ )p(τ |x) dτ | {z } (A.0) + g1(yt |t, x) Z 1 0 wt(τ )(τ − t)p(τ |x) dτ | {z } (A.1) + g2(yt |t, x) Z 1 0 wt(τ ) (τ − t) 2 2 p(τ |x) dτ | {z } (A.2) , where gk(yt |t, x) := ∂ k τ p(yt |τ, x)|τ=t . (3.10) Lightening the notation with a shorthand for the weighted expectations, ⟨·⟩τ := R 1 0 wt(τ )(·)p(τ |x) dτ, it becomes apparent that we must grapple with the pseudo-moments ⟨1⟩τ , ⟨τ − t⟩τ , and ⟨(τ − t) 2 ⟩τ . Note that t should not be mistaken for a “mean” value. Furthermore, we have yet to fully characterize gk(yt |t, x). Observe that p(yt |τ, x) = p(τ |yt , x)p(yt |x) p(τ |x) ⇐⇒ ∂τ p(yt |τ, x) = p(yt |x) · ∂ ∂τ p(τ |yt , x) p(τ |x) . The p(yt |x) will be moved to the other side of the equation as needed; by Equation 3.6, ∂ ∂τ p(τ |yt , x) p(τ |x) = ∂ ∂τ Λ(τ |yt , x). Expanding, = ∂ ∂τ expZ τ 0 γ(τ |yt , x) dτ = γ(τ |yt , x) expZ τ 0 γ(τ |yt , x) dτ = (γΛ)(τ |yt , x). Appropriate bounds will be calculated for g2(yt |t, x) next, utilizing the finding above as their main ingredient. Let g˜k(yt |t, x) := p(yt |x) −1 gk(yt |t, x) = ∂ ∂τ k p(τ |yt , x) p(τ |x) τ=t. 66 The second derivative may be calculated in terms of the ignorance quantities γ,Λ: g˜2(yt |t, x) =∂τ γ(τ |yt , x)Λ(τ |yt , x) =γ(τ |yt , x) 2Λ(τ |yt , x) + ˙γ(τ |yt , x)Λ(τ |yt , x) =(γ 2 + ˙γ)Λ(τ |yt , x). And finally we address p˜(yt |x). Carrying over the components of Equation 3.10 into Equation 3.3, p˜(yt |x) = p(yt |t, x)⟨1⟩τ ⟨Λ(τ |yt , x)⟩τ − g˜1(yt |t, x)⟨τ − t⟩τ − g˜2(yt |t, x)⟨(τ − t) 2⟩τ = p(yt |t, x) Eτ [Λ(τ |yt , x)] − (γΛ)(t|yt , x) Eτ [τ − t] − 1 2 (( ˙γ + γ 2)Λ)(t|yt , x) Eτ [(τ − t) 2] , (3.11) where these expectations Eτ [·] are with respect to the implicit distribution q(τ |t, x) ∝ wt(τ )p(τ |x). The first term in the denominator, Eτ [Λ(τ |yt , x)], may be approximately bounded by the same Algorithm 1. 3.8 Appendix B: How to Calibrate the Weighing Scheme We present an argument based on the absolute error of the approximation in Equation 3.2, specifically for Beta propensities. The following applies to both Beta and Balanced Beta, 0 < t < 1. Suppose that the the second derivative employed in the Taylor expansion is Q-Lipschitz, so that ∂ 3 τ p(yt |τ, x) ≤ Q. Denote the remainder as ρ(yt |τ, x). By Taylor’s theorem, |ρ(yt |τ, x)| ≤ |τ − t| 3 6 Q. The approximated quantity (part A) in Equation 3.3 is the following integral, which ends up becoming the numerator in Equation 3.4: Z 1 0 wt(τ )˜p(yt |τ, x)p(τ |x) dτ = Z 1 0 wt(τ ) p(yt |τ, x) + ρ(yt |τ, x) p(τ |x) dτ. The absolute error of this integral is therefore Z 1 0 wt(τ )ρ(yt |τ, x)p(τ |x) dτ ≤ 1 6 Q Z 1 0 wt(τ )p(τ |x)|τ − t| 3 dτ | {z } := J, which upper-bounds the error. by the remainder theorem. 67 Let A = α − 1 + rt and B = β − 1 + r(1 − t), where (α, β) parametrize the nominal propensity and r is the precision of the Beta trust-weighing scheme. The trust-propensity combination is wt(τ )p(τ |x) = τ A(1 − τ ) B ct B(α, β) , where ct = t rt(1 − t) r(1−t) . Hence, the error bound reduces to J = [ct B(α, β)]−1 Z 1 0 τ A(1 − τ ) B|τ − t| 3 dτ = [ct B(α, β)]−1 Γ(A + 1)Γ(B + 1) Γ(A + B + 5) U3(A, B, t) | {z } first term + Γ(A + 1) Γ(A + 5)12t A+4(1 − t) B+4 2F1(4, A + B + 5, A + 5; t) | {z } second term , where U3(A, B, t) is a cubic polynomial in A, B, and t. Notice that even though the quantity is symmetric about (A, B, t) 7→ (B, A, 1 − t), the form does not appear so. We shall focus on the relation of the error bound entirely with A and α, then justify the analogous conclusion for B and β by the underlying symmetry of the expression. The Gaussian hypergeometric function in the second term can be expressed as X∞ i=0 (4)i(A + B + 5)i (A + 5)i t i i! = X∞ i=0 (4)i A + B + 5 A + 5 A + B + 6 A + 6 · · · | {z } length i t i i! = X∞ i=0 (4)i i! 1 + B A + 5 1 + B A + 6 · · ·t i , where (4)i i! = (i + 2)(i + 3)(i + 4) 3! . by using the definition of the Pochhammer symbol (x)i = x(x + 1). . .(x + i − 1). In terms of A → ∞, the whole second term in J is O(A−4 ) due to the fraction of Γ functions. The first term in J is O(A −(B+4)B −(A+4)) · U3(A, B, t) = O(A −B−1B −A−1 ) by Stirling’s approximation of Γ(x) = O(x x− 1 2 ). Clearly, a small B > 0 might cause the first term in J to explode with large A due to the O(B−A−1 ) part. This could occur with high α, low β, and low r—it is an instance of a high-precision propensity and low-precision weighing scheme destroying the upper error bound. Hence follows an argument for having r match the propensity’s precision, to avoid these cases. As mentioned earlier, the same argument flows for large B and small A, while swapping t 7→ (1 − t). 68 3.9 Appendix C: Correctness of Algorithm 1 The algorithm functions by incrementally reallocating mass (relative, in the weights) to the righthand side, from a cursor beginning on the lefthand side of the “tape”. Proof. Firstly we characterize the indicator quantity ∆j . Differentiate the quantity to be maximized with respect to wj ; ∂ ∂wj P P i wifi i wi = P fj i wi − P i wifi ( P i wi) 2 = fj P i wi − P i wifi ( P i wi) 2 ∝ X i wi(fj − fi) | {z } :=∆j up to some positive factor. Hence, ∆j captures the sign of the derivative. We shall proceed with induction. Begin with the first iteration, j = 1. No weights have been altered since initialization yet. Therefore we have ∆1 = X i wi(f1 − fi). Since ∀i, f1 ≤ fi due to the prior sorting, ∆1 is either negative or zero. If zero, trivially terminate the procedure as all function values are identical. Now assume that by the time the algorithm reaches some j > 1, all wk = wk for 1 ≤ k < j. In other words, ∆j = X i<j wi (fj − fi) | {z } (+) + X i>j wi (fj − fi) | {z } (−) . Per the algorithm, we would flip the weight wj ← wj only if ∆j < 0. In that case, X i<j wi (fj − fi) < X i>j wi(fi − fj ), where both sides are non-negative. Notice that the above is not affected by the current value of wj . This update can only increase the current estimate because the derivative remains negative and the weight at j is non-increasing. We must verify that the derivatives for the previous weights, indexed at k < j, remain negative. Otherwise, the procedure would need to backtrack to possibly flip some weights back up. 69 More generally, with every decision for weight assignment, we seek to ensure that the condition detailed above is not violated for any weights that have been finalized. That includes the weights before j, and those after j at the point of termination. Returning from this digression, at k < j after updating wj , ∆k = X i≤j wi (fk − fi) +X i>j wi(fk − fi). To glean the sign of this, we refer to a quantity that we know. X i<j wi (fj − fi) < X i>j wi(fi − fj ) ⇐⇒ X i≤j wi (fk − fi) < X i>j wi(fi − fj ) +X i≤j wi (fk − fj ) ⇐⇒ X i≤j wi (fk − fi) +X i>j wi(fk − fi) | {z } ∆k < X i>j wi(fk − fj ) +X i≤j wi (fk − fj ) | {z } negative. The remaining fact to be demonstrated is that upon termination, when ∆j ≥ 0, no other pseudo-derivatives ∆j ′, j′ > j are negative. This must be the case simply because fj ′ ≥ fj . 3.10 Appendix D: Details on the Benchmark During each trial, 750 train and 250 test instances of (observed/hidden) confounders, treatment, and outcome were generated. The APO was computed on the test instances. Coverage of the dose-response curve was assessed on a treatment grid of 100 evenly spaced points in [0, 1]. The different violation factors Γ that were tested were also from a 100-sized grid in [0, 2.5]. The data-generating process constructed vectors V := ⟨visible conf. . . , treatment, hidden conf. . .⟩ ∈ [0, 1]k where k is the number of confounders plus one, for the treatment. Each of these variables is a projection of the original data with i.i.d normal coefficients that was then transformed by its empirical cumulative distribution to be marginally unit-uniform. Then, we experimented with two functional forms of confounded dose-response curves. In each, we upscaled the coefficient of the middle (i.e. treatment) entry by (k − 1) to keep the treatment’s contribution strong. • (linear) mixing vector {Mi} k i=1 ∼ i.i.d Normal(0, 1). Pre-activation outcome is u := M · v. 70 • (quadratic) matrix {Mij} ∼ i.i.d Normal(0, 1). Pre-activation outcome is u := v TMv. Unlike a covariance, M is not positive (semi-)definite. The fact that all entries are i.i.d Gaussian implies that there are cases where the off-diagonal entries are much larger in magnitude than the on-diagonal entries, in such a way that cannot occur in a covariance matrix. This induces more confounding and strengthens our benchmark. The actual outcome is Bernoulli with probability u ⋆ := ϕ (u − m)/s , wherein ϕ is the standard normal CDF, location parameter m is the sample median, and scale s is the sample mean absolute deviation from the median. If u were normal, s would be expected to be a bit smaller than σ, by a factor of p 2/π. Generally u ⋆ is no longer uniformly distributed (on margin) because we use s, and instead it gravitates towards zero or one. Since the estimated outcome models use logistic sigmoid activations, there is already an intentional measure of model mismatch present in this setup. See Table 3.4 for results under all the settings considered. The linear outcome and propensity predictors were estimated by maximum likelihood using the ADAM gradient-descent optimizer, with learning rate 101 , 4 batches, and 50 epochs throughout. For the outcome, we used a sigmoid activation stretched horizontally by 102 for smooth training. For the propensity, similarly, we stretched a sigmoid horizontally and vertically, gating the output in order to yield Beta parameters within (0, 102 ). Data sources. The datasets brain and blood both came from the UK Biobank, which is described in the case study of §3.5. The two datasets are taken from disjoint subsets of all the available fields, one pertaining to parcelized brain volumes (via MRI) and the other to blood tests. The pbmc dataset came from single-cell RNA sequencing, a modality that is exploding in popularity for bioinformatics. PBMC data are a commonly used benchmark in the field [162]. Finally, the mftc dataset consisted of BERT embeddings for morally loaded tweets [122], [232]. Dataset Sample Size Dimension brain 43,069 148 blood 31,811 42 pbmc 14,039 16 mftc 17,930 768 Table 3.3. Characteristics of the various datasets employed in our experiments. 7 Benchmarks \ Scores brain blood pbmc mftc mean median mean median mean median mean median linear 2 confounders δMSM 107 83 87 71 108 88 68 61 CMSM 312 291 259 221 292 269 225 191 uniform 138 106 111 88 140 114 84 76 binary MSM 133 105 108 86 136 112 81 75 6 confounders δMSM 88 75 83 70 107 87 68 62 CMSM 282 231 276 235 286 249 223 170 uniform 111 94 106 89 137 111 84 75 binary MSM 109 91 103 86 133 109 82 73 10 confounders δMSM 91 79 89 79 117 99 70 64 CMSM 270 224 259 220 321 294 191 150 uniform 116 102 113 102 151 127 88 81 binary MSM 113 100 111 100 147 125 86 78 quadratic 2 confounders δMSM 198 188 176 165 198 189 163 150 CMSM 357 351 323 313 341 333 284 276 uniform 243 236 215 203 245 235 194 174 binary MSM 225 236 202 200 225 235 181 173 6 confounders δMSM 181 179 185 178 205 200 169 165 CMSM 390 383 388 388 402 393 331 303 uniform 224 213 225 213 252 247 204 192 binary MSM 212 210 213 212 235 244 194 191 10 confounders δMSM 188 186 186 180 204 197 172 170 CMSM 377 377 382 374 419 408 319 292 uniform 229 223 225 215 247 242 206 191 binary MSM 217 219 216 216 235 239 195 186 Table 3.4. The full array of experiments. Underlined settings are those shown in Table 3.2. Model mismatch varied with how approximately linear the true dose responses were. As expected, there was a significant negative correlation between model likelihood and divergence cost, so poorer fits had higher costs for coverage. Source-code availability. Please visit https://github.com/marmarelis/TreatmentCurves.jl. 72 Chapter 4 Ensembled Prediction Intervals for Causal Outcomes Under Hidden Confounding Causal inference of exact individual treatment outcomes in the presence of hidden confounders is rarely possible. Recent work has extended prediction intervals with finite-sample guarantees to partially identifiable causal outcomes, by means of a sensitivity model for hidden confounding. In deep learning, predictors can exploit their inductive biases for better generalization out of sample. We argue that the structure inherent to a deep ensemble should inform a tighter partial identification of the causal outcomes that they predict. We therefore introduce an approach termed Caus-Modens, for characterizing causal outcome intervals by modulated ensembles. We present a simple approach to partial identification using existing causal sensitivity models and show empirically that Caus-Modens gives tighter outcome intervals, as measured by the necessary interval size to achieve sufficient coverage. The last of our three diverse benchmarks is a novel usage of GPT-4 for observational experiments with unknown but probeable ground truth. 4.1 Introduction In order for a regression model to make causal predictions, the effect of confounders must be disentangled from the effect of the treatment. For this reason, causal inference is closely related to the problem of domain shift, since the outcome predictor may be learned on observational data while being expected to perform well on the hypothetical domain with fully randomized treatments. More often than not, the available covariates are imperfect proxies for all the confounders in the causal system. This further compounds the task of causal inference, as the hidden confounders must somehow be taken into account. The best hope in these cases is to produce “ignorance intervals” that partially identify the causal estimands. The tighter the intervals, the more useful the partial identification, which depends on what can be said about the hidden confounders. 73 A sensitivity model [284] in causal inference is a structural assumption [213] about the possible behavior of hidden confounders. It allows causal estimands to be partially identified as long as the extent of hidden confounding is consistent with the sensitivity model. The dependence of the treatment assignment on confounders, i.e. the conditional treatment propensity, is what makes a study observational rather than a fully randomized experiment. We consider sensitivity models that bound the complete propensity (colloquially, the true propensity of treatment assignments for an individual, taking into account all relevant variables, observed or not) in terms of the nominal propensity (based just on observed covariates, allowing it to be estimated by regression.) Sensitivity models of this kind were first introduced by Tan [317] and have become popular due to their generality and simplicity. The most common setting for these models, in line with Tan’s initial formulation, is of binary treatments [76], [143], [158], in which the Marginal Sensitivity Model (MSM) bounds the ratio of nominal-propensity odds to complete-propensity odds. When that ratio is unit, and the complete propensity equals the nominal propensity at all points, then the covariates are adequate to explain all the confounders. It is worthwhile to broaden the notion of the MSM in light of recent developments with MSM-like sensitivity models for continuous treatments [142], [214] and other nonbinary domains. To accommodate these settings, we consider a general form of sensitivity model. In this chapter, we explore prediction intervals of causal outcomes due to interventions on the treatment variable, termed outcome intervals, that incorporate empirical uncertainties [144] in addition to the orthogonal concept of hidden-confounding uncertainty. Outcome intervals predict individual outcomes of treatments disentangled from confounders, relying on a sensitivity model to guide partial identification in the presence of hidden confounders. 4.1.1 Related Work Our goal diverges from the great strides that have been made in the realm of multiply debiased and robust estimators for average outcomes of populations or subpopulations [10], [48]. Largely in the binary-treatment context, these estimators have been augmented with sharp partial identification methods [75] that are guaranteed to be valid while not overly conservative. Dorn, Guo, and Kallus [76] accomplishes this partial identification at the cost of having to re-estimate outcome regressions every time the sensitivity model changes. Partial identification is less explored for nonbinary treatments, which are receiving increased attention [23], [154], [241], [342]. Separately, outcome statistics that are more complex than expectations are also of key interest in machine learning [159], with diverse purposes like fairness-oriented measures [156]. Tight partial identification of these other statistics requires novel methodology. Partially identified outcome 74 ensemble prediction Untreated Treated Observational Study Individual Causal Outcomes Distribution of Outcome Predictors reweighing the distribution for hidden confounders Covariates maximized upper prediction quantile minimized lower prediction quantile Figure 4.1. An illustration of the proposed method for causal outcome intervals. First, one samples predictors from a Bayesian posterior or otherwise learns an ensemble to approximate the distribution of outcome predictors that agree with the observational data. The ensemble average (grey dot) could be used to predict actual causal outcomes (red/blue dots). With hidden confounding, the learned ensemble might diverge substantially from the best predictor distribution to model causal outcomes. One cannot identify the correct distribution from observational data alone. Instead, a sensitivity model says how wrong this learned ensemble could be, and one optimizes with respect to weights on the ensemble elements for each individual and treatment in order to upper-bound the (1 − α/2) quantile and lower-bound the (α/2) quantile of the ensemble prediction. These intervals incorporate both empirical uncertainties from prediction quantiles and hidden-confounding uncertainties from the ensemble modulation. They are evaluated against ground-truth causal outcomes by removing confounding through interventions on test-set individuals, using (semi- )synthetic data. quantiles would be a step in that direction. We solve that problem in this chapter for the purpose bounding above and below the individual outcome intervals—our current focus. The state of the art for partially identified outcome intervals from binary treatments is conformalized [148], [365], building on domain shift [193]. Conformal inference looks at the empirical performance of a model to decide how to size its prediction set (interval). The simplicity of this approach coupled with its finite-sample statistical guarantees makes it widely applicable. However, conformalized intervals even in the causal setting are based on the behavior of the outcome predictor on the observed distribution. We hypothesize that heavily confounded observational data might make it difficult for causal conformal prediction to extrapolate to the unconfounded domain. In this chapter, we offer an alternative approach that can benefit from a predictor ensemble’s inductive biases when constructing the causal outcome intervals. 75 4.1.2 Motivation Instead of partial identification of (conditional) average treatment effect, (C)ATE, the conformal sensitivity analysis (CSA) [148], [365] produces rigorous intervals for the individual treatment effect (ITE), in other words the outcome realization rather than expectation. CSA considers a predictor’s performance in a calibration set as a guide for determining prediction intervals out of sample. Partial identification tends to be formulated adversarially, in terms of minimizing/maximizing a causal estimand that is admitted by the problem’s constraints. CSA involves an optimization problem over the rebalancing weights applied to the calibration sample [193], [324]. As the conformal method requires quantile estimates, it is impacted by theoretical implications on weighted quantile estimators. The variance of the estimator scales with the variance of the weights [103, Theorem 1]. If the weights were not inverse-propensity adjusted, then the conformal guarantees would fail due to distribution mismatch, so a large variance from covariate shift, for instance, cannot be avoided. To avoid the statistical challenges associated with reweighing an observational sample, we posit that an ensemble capturing empirical uncertainties from the observational data could harness its inductive biases to extrapolate to causal outcomes [144], [272]. These elements exist in Bayesian reasoning [141], which is a sound and scientific way to reconcile models with data. It incorporates parametric, distributional, structural, and prior knowledge into a posterior distribution of learned models that agree with the data. Even with large, deep models that are commonly developed in machine learning, the structural elements of the model contribute to its performance in a general domain [e.g. 85]. In the method presented below, we allow an ensemble’s learned biases to aid in extrapolation of the partially identifiable causal estimands. The mathematical connections are clear when the ensemble is supposedly from a Bayesian posterior, but in practice it can be learned in any way that sufficiently captures empirical uncertainties [347]. See Figure 4.1 on the ensemble reweighing. 4.2 Approach We present a versatile, modular procedure for taking an ensemble of outcome predictors and, in coordination with some causal sensitivity model, producing tight causal outcome intervals. We term this approach for causal outcome intervals via modulated ensembles “Caus-Modens.” The idea is to min/max an ensemble’s conditional quantiles by reweighing the predictors, yielding individual causal outcome intervals. First we list the fundamental assumptions for our causal inference. 76 Assumption 4 (Potential Outcomes). We adopt Rubin [286]’s first two assumptions for potential outcomes. First, observation tuples of (outcome, assigned treatment, covariates) denoted as {(y (i) , t(i) , x(i) )} n i=1, are i.i.d from a single joint distribution. This subsumes the Stable Unit Treatment Value Assumption (SUTVA), where units/individuals cannot depend on one another. Secondly, all treatment values have a nonzero chance of assignment for every individual in the data. For a family of outcome predictor models M, we use pM to denote probability density functions constrained by one or more models in M—that is, M conveys the hard constraints implied by the choice of parametrization θ and any other structural assumption. These models predict an outcome Y due to treatment assignment T and covariate X. In Bayesian notation the posterior P(Θ | D), given a dataset D, induces a posterior predictive outcome distribution, which is described by a conditional expectation that averages the individual model predictions pM(y | t, x; θ): pM(y | t, x; D) = EΘ pM(y | t, x; Θ) | D . (4.1) In practice, this integration over viable parameters is simulated by Monte Carlo with an ensemble of learned models. For our purposes, {θ (j)} m j=1 is assumed to be i.i.d from an estimator distribution (in the frequentist case) or posterior [in the Bayesian case, 195] with a density denoted as p(θ|D) in either case. Our sensitivity analysis requires the estimation of a nominal propensity function as well, denoted by et(x), which can be a discrete probability or a continuous density. The third potential-outcomes assumption is ignorability: absence of hidden confounders, where {(Yt)t∈T ⊥⊥ T} | X. It states that while the outcome would depend on the assigned treatment, a potential outcome for any treatment should not be affected by the treatment assignment, after conditioning on covariates. Our setting allows a bounded violation to the ignorability assumption. Definition 5 (Hidden Confounding). {(Yt)t∈T ̸⊥⊥ T} | X, hence P(Yt | T, X) may differ from P(Yt | X) for the potential outcomes Yt outside the assigned treatment (T ̸= t), and similarly the complete propensity P(T | X, Yt) is not the nominal propensity P(T | X) for any Yt . Whichever sensitivity model is invoked to bound the extent of hidden confounding, all that is required for Caus-Modens is a pair of weight-bounding functions ω(t, x), ω(t, x) that are partial identifiers of the potential-outcome probability density function, p(yt |x). We introduce one layer of indirection by referring to potential outcome models θt , heterogeneous in treatment t and covariate x (conditioning on the latter,) that can only be partially identified by means of the learned outcome model θ. The real potential 77 outcomes are therefore (partially) identified by marginalization over the potential models: p(yt |x) = R p(y|t, x; θt) p(θt |x; D) dθt , assuming integrability. The role of the weights is in the relation p(θt |x; D) = ω(θ, t, x)p(θ|t, x; D), where p(θ|t, x; D) = p(θ|D) because the learned model is invariant. As mentioned, the weights can only be partially identified by the given sensitivity model. The reason for pushing our causal sensitivity analysis to the level of the outcome model is that it can be empirically favorable while remaining largely intuitive. Assumption 5 (Sensitivity Model as Weights). The sensitivity model u28(;)-166(t;)-167(x)]TJ/F33␣10.9091␣Tf␣25.296␣0␣Td␣[())]TJ/F92␣10.9091␣Tf␣4.242␣0␣Td␣[(p)]TJ/F33␣10.9091␣Tf␣5.489␣0␣Td␣[(()]TJ/F92␣10.9091␣Tf␣4.242␣0␣Td␣[(„)]TJ/F95␣10.9091␣Tf␣5.424␣0␣Td␣[(j)]TJ/F92␣10.9091␣Tf␣3.031␣0␣Td␣[(t;)-167(x)]TJ/F33␣10.9091␣Tf␣15.022␣0␣Td␣[(;)]TJ/F95␣10.9091␣Tf␣4.849␣0␣Td␣[(D)]TJ/F33␣10.9091␣Tf␣8.718␣0␣Td␣[())]TJ/F92␣10.9091␣Tf␣4.242␣0␣Td␣[(;)]TJ/F46␣10.9091␣Tf␣5.764␣0␣Td␣[(wher)8(e)]TJ/F92␣10.9091␣Tf␣30.475␣0␣Td␣[(p)]TJ/F33␣10.9091␣Tf␣5.489␣0␣Td␣[(()]TJ/F92␣10.9091␣Tf␣4.242␣0␣Td␣[(„)]TJ/F95␣10.9091␣Tf␣5.424␣0␣Td␣[(j)]TJ/F92␣10.9091␣Tf␣3.031␣0␣Td␣[(t;)-167(x)]TJ/F33␣10.9091␣Tf␣15.022␣0␣Td␣[(;)]TJ/F95␣10.9091␣Tf␣4.849␣0␣Td␣[(D)]TJ/F33␣10.9091␣Tf␣8.718␣0␣Td␣[())-279(=)]TJ/F92␣10.9091␣Tf␣18.815␣0␣Td␣[(p)]TJ/F33␣10.9091␣Tf␣5.489␣0␣Td␣[(()]TJ/F92␣10.9091␣Tf␣4.242␣0␣Td␣[(„)]TJ/F95␣10.9091␣Tf␣5.424␣0␣Td␣[(jD)]TJ/F33␣10.9091␣Tf␣11.749␣0␣Td␣[())]TJ/F46␣10.9091␣Tf␣6.975␣0␣Td␣[(b)-10(e)-7(cause)-251(the)-250(learne)-7(d)-251(mo)-7(del)-250(is)-251(invariant.)-293(A)12(s)-251(mentione)-7(d,)]TJ␣1.02␣0␣0␣1␣72␣649.066␣Tm␣[(the)-268(w)8(eights)-268(can)-268(only)-268(b)-10(e)-268(partially)-268(identifie)-6(d)-268(by)-268(the)-268(giv)8(en)-268(sensitivity)-268(mo)-7(del.)-355(The)-268(r)8(eason)-268(for)-268(pushing)-267(our)]TJ␣1.016␣0␣0␣1␣72␣628.742␣Tm␣[(causal)-246(sensitivity)-246(analysis)-245(to)-246(the)-246(le)7(v)8(el)-246(of)-246(the)-245(outcome)]TJ/F55␣10.9091␣Tf␣1.016␣0␣0␣1␣314.856␣628.742␣Tm␣[(mo)-7(del)]TJ/F46␣10.9091␣Tf␣1.016␣0␣0␣1␣344.591␣628.742␣Tm␣[(is)-246(that)-246(it)-245(can)-246(b)-10(e)-246(empirically)-246(fav)8(orable)-245(while)]TJ␣1␣0␣0␣1␣72␣608.418␣Tm␣[(r)8(emaining)-250(largely)-250(intuitiv)8(e)10(.)]TJ˙0␣g␣0␣G˙/F76␣10.9091␣Tf␣0.98␣0␣0␣1␣71.596␣579.128␣Tm␣[(A)12(ssumption)-251(5)]TJ/F46␣10.9091␣Tf␣0.98␣0␣0␣1␣141.404␣579.128␣Tm␣[((Sensitivity)-251(Mo)-8(del)-251(as)-252(W)82(eights))]TJ/F76␣10.9091␣Tf␣0.98␣0␣0␣1␣274.152␣579.128␣Tm␣[(.)]TJ˙0␣g␣0␣G˙/F46␣10.9091␣Tf␣0.98␣0␣0␣1␣281.699␣579.128␣Tm␣[(The)-251(sensitivity)-252(mo)-7(del)-251(u28(;)-166(t;)-167(x)]TJ/F33␣10.9091␣Tf␣25.296␣0␣Td␣[())]TJ/F92␣10.9091␣Tf␣4.242␣0␣Td␣[(p)]TJ/F33␣10.9091␣Tf␣5.489␣0␣Td␣[(()]TJ/F92␣10.9091␣Tf␣4.242␣0␣Td␣[(„)]TJ/F95␣10.9091␣Tf␣5.424␣0␣Td␣[(j)]TJ/F92␣10.9091␣Tf␣3.031␣0␣Td␣[(t;)-167(x)]TJ/F33␣10.9091␣Tf␣15.022␣0␣Td␣[(;)]TJ/F95␣10.9091␣Tf␣4.849␣0␣Td␣[(D)]TJ/F33␣10.9091␣Tf␣8.718␣0␣Td␣[())]TJ/F92␣10.9091␣Tf␣4.242␣0␣Td␣[(;)]TJ/F46␣10.9091␣Tf␣5.764␣0␣Td␣[(wher)8(e)]TJ/F92␣10.9091␣Tf␣30.475␣0␣Td␣[(p)]TJ/F33␣10.9091␣Tf␣5.489␣0␣Td␣[(()]TJ/F92␣10.9091␣Tf␣4.242␣0␣Td␣[(„)]TJ/F95␣10.9091␣Tf␣5.424␣0␣Td␣[(j)]TJ/F92␣10.9091␣Tf␣3.031␣0␣Td␣[(t;)-167(x)]TJ/F33␣10.9091␣Tf␣15.022␣0␣Td␣[(;)]TJ/F95␣10.9091␣Tf␣4.849␣0␣Td␣[(D)]TJ/F33␣10.9091␣Tf␣8.718␣0␣Td␣[())-279(=)]TJ/F92␣10.9091␣Tf␣18.815␣0␣Td␣[(p)]TJ/F33␣10.9091␣Tf␣5.489␣0␣Td␣[(()]TJ/F92␣10.9091␣Tf␣4.242␣0␣Td␣[(„)]TJ/F95␣10.9091␣Tf␣5.424␣0␣Td␣[(jD)]TJ/F33␣10.9091␣Tf␣11.749␣0␣Td␣[())]TJ/F46␣10.9091␣Tf␣6.975␣0␣Td␣[(b)-10(e)-7(cause)-251(the)-250(learne)-7(d)-251(mo)-7(del)-250(is)-251(invariant.)-293(A)12(s)-251(mentione)-7(d,)]TJ␣1.02␣0␣0␣1␣72␣649.066␣Tm␣[(the)-268(w)8(eights)-268(can)-268(only)-268(b)-10(e)-268(partially)-268(identifie)-6(d)-268(by)-268(the)-268(giv)8(en)-268(sensitivity)-268(mo)-7(del.)-355(The)-268(r)8(eason)-268(for)-268(pushing)-267(our)]TJ␣1.016␣0␣0␣1␣72␣628.742␣Tm␣[(causal)-246(sensitivity)-246(analysis)-245(to)-246(the)-246(le)7(v)8(el)-246(of)-246(the)-245(outcome)]TJ/F55␣10.9091␣Tf␣1.016␣0␣0␣1␣314.856␣628.742␣Tm␣[(mo)-7(del)]TJ/F46␣10.9091␣Tf␣1.016␣0␣0␣1␣344.591␣628.742␣Tm␣[(is)-246(that)-246(it)-245(can)-246(b)-10(e)-246(empirically)-246(fav)8(orable)-245(while)]TJ␣1␣0␣0␣1␣72␣608.418␣Tm␣[(r)8(emaining)-250(largely)-250(intuitiv)8(e)10(.)]TJ˙0␣g␣0␣G˙/F76␣10.9091␣Tf␣0.98␣0␣0␣1␣71.596␣579.128␣Tm␣[(A)12(ssumption)-251(5)]TJ/F46␣10.9091␣Tf␣0.98␣0␣0␣1␣141.404␣579.128␣Tm␣[((Sensitivity)-251(Mo)-8(del)-251(as)-252(W)82(eights))]TJ/F76␣10.9091␣Tf␣0.98␣0␣0␣1␣274.152␣579.128␣Tm␣[(.)]TJ˙0␣g␣0␣G˙/F46␣10.9091␣Tf␣0.98␣0␣0␣1␣281.699␣579.128␣Tm␣[(The)-251(sensitivity)-252(mo)-7(del)-251(u28(;)-166(t;)-167(x)]TJ/F33␣10.9091␣Tf␣25.296␣0␣Td␣[())]TJ/F92␣10.9091␣Tf␣4.242␣0␣Td␣[(p)]TJ/F33␣10.9091␣Tf␣5.489␣0␣Td␣[(()]TJ/F92␣10.9091␣Tf␣4.242␣0␣Td␣[(„)]TJ/F95␣10.9091␣Tf␣5.424␣0␣Td␣[(j)]TJ/F92␣10.9091␣Tf␣3.031␣0␣Td␣[(t;)-167(x)]TJ/F33␣10.9091␣Tf␣15.022␣0␣Td␣[(;)]TJ/F95␣10.9091␣Tf␣4.849␣0␣Td␣[(D)]TJ/F33␣10.9091␣Tf␣8.718␣0␣Td␣[())]TJ/F92␣10.9091␣Tf␣4.242␣0␣Td␣[(;)]TJ/F46␣10.9091␣Tf␣5.764␣0␣Td␣[(wher)8(e)]TJ/F92␣10.9091␣Tf␣30.475␣0␣Td␣[(p)]TJ/F33␣10.9091␣Tf␣5.489␣0␣Td␣[(()]TJ/F92␣10.9091␣Tf␣4.242␣0␣Td␣[(„)]TJ/F95␣10.9091␣Tf␣5.424␣0␣Td␣[(j)]TJ/F92␣10.9091␣Tf␣3.031␣0␣Td␣[(t;)-167(x)]TJ/F33␣10.9091␣Tf␣15.022␣0␣Td␣[(;)]TJ/F95␣10.9091␣Tf␣4.849␣0␣Td␣[(D)]TJ/F33␣10.9091␣Tf␣8.718␣0␣Td␣[())-279(=)]TJ/F92␣10.9091␣Tf␣18.815␣0␣Td␣[(p)]TJ/F33␣10.9091␣Tf␣5.489␣0␣Td␣[(()]TJ/F92␣10.9091␣Tf␣4.242␣0␣Td␣[(„)]TJ/F95␣10.9091␣Tf␣5.424␣0␣Td␣[(jD)]TJ/F33␣10.9091␣Tf␣11.749␣0␣Td␣[())]TJ/F46␣10.9091␣Tf␣6.975␣0␣Td␣[(b)-10(e)-7(cause)-251(the)-250(learne)-7(d)-251(mo)-7(del)-250(is)-251(invariant.)-293(A)12(s)-251(mentione)-7(d,)]TJ␣1.02␣0␣0␣1␣72␣649.066␣Tm␣[(the)-268(w)8(eights)-268(can)-268(only)-268(b)-10(e)-268(partially)-268(identifie)-6(d)-268(by)-268(the)-268(giv)8(en)-268(sensitivity)-268(mo)-7(del.)-355(The)-268(r)8(eason)-268(for)-268(pushing)-267(our)]TJ␣1.016␣0␣0␣1␣72␣628.742␣Tm␣[(causal)-246(sensitivity)-246(analysis)-245(to)-246(the)-246(le)7(v)8(el)-246(of)-246(the)-245(outcome)]TJ/F55␣10.9091␣Tf␣1.016␣0␣0␣1␣314.856␣628.742␣Tm␣[(mo)-7(del)]TJ/F46␣10.9091␣Tf␣1.016␣0␣0␣1␣344.591␣628.742␣Tm␣[(is)-246(that)-246(it)-245(can)-246(b)-10(e)-246(empirically)-246(fav)8(orable)-245(while)]TJ␣1␣0␣0␣1␣72␣608.418␣Tm␣[(r)8(emaining)-250(largely)-250o)-2␣0␣1␣281.a␣Td␣[())]TJ/F92␣10.1argely)-250o) ) Algorithm 2: General Quantile Maximizer Input: Quantile rank β, weight bounds (ω, ω) like those described in Assumption 5, and invertible cumulative density functions F1(y), F2(y), . . . , Fn(y), which can be considered the conditional prediction distributions from the ensemble. Output: Ensemble’s β-quantile, q := supw F −1 (β). 1 Initialize wi ← 1 for all i = 1, 2, . . . n; 2 while global optimality is not met, according to Theorem 6, do 3 Compute β-quantile of current F(y) := n −1 P i wiFi(y); 4 Find pair(s) of weight indices (r, s) that violate the optimality condition per Theorem 6; 5 Transfer weight between pair(s) (wr, ws) such that the condition is satisfied for the pair(s); 4.3 Estimation Properties Our main assumption beyond Assumptions 4 & 5 that enables a simple coverage guarantee of causal outcomes Yt is that they are independently generated by some unobserved Θt ∼ P(Θt |D). This requirement, marked by the subscript M, aligns with our parametric setting and a Bayesian perspective. However, we acknowledge that this result is not as general as the conformal alternatives. We note, additionally, that the empirical evaluations of §4.4 do not necessarily enforce these conditions. Lemma 4 (Empirical Coverage). For fixed values t, x, and α ∈ (0, 1), consider empirical weights ω˜(θ, t, x). Let Fˆ ω˜ be the cumulative distribution of the empirical, finite-ensemble estimate for the potential outcome of Equation 4.3, i.e. pˆM(yt | x; D) = Eˆm[ ˜ω(Θ, t, x) × pM(y | t, x; Θ) | D ]. Then for any ε > 0 and β = α + ε + 2 E |ω˜ − ω|, it holds with probability at least 1 − β that PM Yt ∈ Fˆ−1 ω˜ (α/2), Fˆ−1 ω˜ (1 − α/2) X = x > 1 − 2 exp −mε2 /2 . We blend the finite-sample coverage result in Lemma 4 with partial identification. Theorem 5 characterizes the validity of the causal-outcome intervals from a finite ensemble of size m. Theorem 5 (Valid Partial Identification). For fixed values t, x, and α ∈ (0, 1), consider weight boundary estimates [˜ω, ω˜] yielded from a sensitivity model according to Assumption 5. Estimating a solution to the program in Equation 4.4 produces outcome intervals with hidden-confounding ignorance. Assume for the admitted extrema inf Fˆ−1 ω˜ (α/2) and sup Fˆ−1 ω˜ (1 − α/2), that ω˜(Θ) ∈ {ω˜, ω˜} almost surely. Now let β = α + ε + 2 E[ |ω˜ − ω| ∨ ω˜ − ω ] for any margin constant ε > 0. In this case, with probability at least 1 − β, PM Yt ∈ inf Fˆ−1 ω˜ (α/2), sup Fˆ−1 ω˜ (1 − α/2) X = x > 1 − 2 exp −mε2 /2 . (4.5) Next, we justify our Supplementary Algorithm 3 by revealing a global optimality condition that can be reached greedily. Theorem 6 suggests a simple, monotonically nondecreasing update rule for an optimization algorithm: find pairs of ensemble components that disprove the optimality condition, and transfer weight between them. Theorem 6 (Global Optimality Condition). The weight assignments ω(θ (i) ) ∈ [ω, ω] for a predictor ensemble {θ (i)} m i=1 maximize the β-quantile of the finite weighted mixture in the space of all admissible weight assignments if and only if there exists no pair of mixture components (θ (j) , θ(k) ) such that weight can be transferred from j to k, i.e. ω(θ (j) ) > ω and ω(θ (k) ) < ω, and j has more leftward mass than k, i.e. F(q; θ (j) ) > F(q; θ (k) ) where q is the current β-quantile: β = m−1 P i ω(θ (i) )F(q; θ (i) ). Some of the empirical tightness of the outcome intervals might stem from the preservation of continuity in the partially identified densities; by the definition of Lipschitz continuity on the reals, Proposition 7 (Continuity of Outcome Density). If the predictor densities pM(y | t, x; θi) are C-Lipschitz continuous, then the posterior outcome density pM(yt | x; D) is ωC-Lipschitz. On the other hand, a sample-based reweighing scheme like from Jesson, Mindermann, Gal, et al. [143] and Kallus, Mao, and Zhou [158] does not preserve any implied continuity of the partially identified probability density. 4.4 Empirical Evaluations We present three benchmarks comparing the tightness of the outcome intervals produced by Caus-Modens and the prevailing conformalized causal sensitivity analyses. As discussed in §4.1, these conformal approaches encompass the state of the art in partial idenfiticaiton of individual outcomes and not outcome expectations. We first list the baselines and then detail the evaluation procedure. The baseline methods. We consider various combinations and ablations of conformal sensitivity analysis (CSA) [148], [365] with the two state-of-the-art conformal backbones: distributional conformal prediction (DCP) [53] and conformalized quantile regression (CQR) [282]. The CSA studies relied on CQR for their implementations. In the meanwhile an even more adaptive procedure, DCP, was proposed. For completeness in our analysis we constructed a “supercharged” baseline that combined CSA with DCP. Since we learned an entire ensemble for each benchmark, we usually allowed the conformal alternatives to also harness the empirical uncertainties captured by the ensemble. Again, this was done in an attempt 81 to be as favorable to the conformal alternatives as possible. Ensembled baselines were marked by the “Ens-” prefix, and the one baseline without that simply used a single model drawn from the ensemble. The predictive modeling foundations for all methods were kept the same so that there was no question about differential modeling performance leading to different results between Caus-Modens and baselines. This ensured the benchmarks were commensurate. For instance, whereas CQR normally calls for quantile regression, we fed it quantiles of the ensemble-marginalized distributional prediction. We list the actual baselines implemented: • Ens-CSA-DCP — the main conformal baseline with all the beneficial components; • Ens-CSA-CQR — similar to the above, but with the more standard CQR; • CSA-DCP — the non-ensembled ablation; • Ens-DCP — the non-CSA (non-causal) ablation. How tightness is measured. Most of our results were reported with a concept of coverage efficiency. As is customary with studies on conformal inference, we set a target coverage level. Then we observed the size of the intervals required to achieve that level of coverage on a causal test set where treatments were de-confounded (the smaller the intervals, the more efficient the coverage.) The logic of this strategy is that a tighter partial identification should require less implied hidden confounding (via the sensivitiy model) to cover the causal outcomes, relative to a more conservative method that does not utilize the sample statistics or problem assumptions effectively. Each result section (§4.4.1, §4.4.2, §4.4.3) defined a domain-specific cost function to measure the size of the outcome intervals. Tighter intervals had lower cost. We always used Tan [317]’s sensitivity model for binary treatments (MSM) and varied its single parameter Γ for the extent of violation to the ignorability assumption. We explored the landscape between and including Γ = 1, where ignorability holds, and Γ = 50, where all methods would plateau. We used binary search to identify the smallest Γ := Γ∗ ∈ [1, 50] that achieves a target coverage, like 95% of the test set. That point could be Γ = 1, the no-hidden-confounding condition. On the other hand, an experiment was classified as a failure if the method never reached the target coverage. The cost function evaluated at a successful Γ ∗ was termed the coverage cost. Training the predictors. Whereas Caus-Modens was conceived in a Bayesian framework, in practice, deep ensembles tend to achieve better accuracy than Bayesian neural networks and similarly quantify empirical uncertainty [94], [190], [253], [270]. Caus-Modens ultimately requires a sample of predictive models, whether from a posterior or an estimator distribution. The focus of our evaluations is the sensitivity 82 3 4 5 6 7 8 Caus-Modens Cost 3 4 5 6 7 8 Conformal Cost 90% Coverage 95% Coverage 99% Coverage # Failures Caus-Modens Baseline 90% Coverage 0 / 10 1 / 10 95% Coverage 0 / 10 1 / 10 99% Coverage 0 / 10 5 / 10 Figure 4.2. Coverage costs of Caus-Modens versus the main ensemble-conformal baseline, Ens-CSA-DCP, for the ten IHDP realizations used by Louizos, Shalit, Mooij, et al. [207]. In the scatter plot, we only display the cost pairs, clipped at 8.0, where both methods achieved the target coverage. The table shows that all the failures to reach adequate coverage occurred on the conformal method. Conformal Baseline 50 - 50 Split 75 - 25 Split 87.5 - 12.5 Split* Ens-CSA-DCP 1 Failure / 10 Trials 2 Failures / 10 Trials 5 Failures / 10 Trials Ens-CSA-CQR 2 Failures / 10 Trials 2 Failures / 10 Trials 6 Failures / 10 Trials Table 4.1. Failure rates of the baseline methods applied to the IHDP benchmark with 95% target coverage. The results in Figure 4.2 use the 50-50 split that appears to work best. Asterisk marks the arrangement where the entire original training set is used for estimation, and the validation set for calibration. The other benchmarks (PBMC, AITA) have larger samples that obviate this issue. analyses once models have been learned. Hence, we use deep ensembles of cardinality 16 in our reported benchmarks. We trained fully connected feedforward neural networks with sigmoid activations for both the outcome and the propensity predictors. Hyperparameter and architecture selection was done by grid search. Ensembles were trained by maximum likelihood on bootstrap-resampled training sets and randomly initialized weights. Caus-Modens and all the baselines relied on this set of predictors, either in whole or by randomly drawing individual models in the case of non-ensemble ablations. 4.4.1 Classical Benchmark (IHDP) The IHDP dataset in causal literature is a semi-synthetic classical benchmark for CATE estimation [e.g. 145], [207], [293]. It contains binary treatments with covariate shift, and simulated real-valued outcomes, for 747 individuals. The original covariates are eight real and nineteen binary attributes. To induce hidden confounding, we obscured the binary covariates. The benchmark task was to predict the T = 1 (potential) 83 outcomes of the test set. Due to the smallness of the sample, we randomly allocated 10% of the data to the validation set and 20% to the test set. In addition to hyperparameter selection, the validation set served for calibration in the conformal baselines for maximal resourcefulness. The whole calibration set consisted of 3/7ths of the otherwise-labeled training set for an ultimate 50-50 estimation-calibration split, as is recommended with split conformal prediction (SCP) [249]. In other words, Caus-Modens utilized the entire training set, and the conformal baselines used 4/7ths of the training set for estimation and the rest for calibration. We also tested a 75-25 SCP split rather than 50-50, to similar effect. Cost function & results. The cost function was the absolute length of the interval scaled to the empirical standard deviation of the outcomes. We first tested Caus-Modens against Ens-CSA-DCP for three target coverages shown in Figure 4.2. Caus-Modens produced tighter intervals with Wilcoxon signed-rank test p < 0.05. This dataset was noteworthy for the occurrence of failures in the conformal approach and complete success in Caus-Modens for achieving the target coverage. Table 4.1 shows how other conformal configurations induced more failures. For Caus-Modens we found that the size of the ensemble beyond 16 predictors ceased to impact the coverage cost. 4.4.2 Novel Semi-synthetic Benchmark (PBMC) Recent widely celebrated single-cell RNA sequencing (scRNAseq) modalities have enabled an unprecedented view into human physiology [152]. The complex relations between the expressions of roughly 20,000 genes makes it a good source for benchmark datasets with unintuitive statistics. We obtained a relatively clean dataset of well-characterized peripheral blood mononuclear cells (PBMCs) [162] and randomly projected the gene expressions into 32 observed and 32 unobserved confounders, as well as a treatment variable that was discretized to binary values. The simulated outcome was a completely random quadratic form of all these 32+32+1 variables, ensuring arbitrary relations between treatment assignment, confounders, and outcome. A current shortcoming of partial identifiers for expectations of causal quantities, like ATE and CATE, is that they were not designed for heavy-tailed outcomes. We showcase Caus-Modens in this light, using the Cauchy distribution for simulated PBMC outcomes. The Cauchy distribution has several scientific uses, including the modeling of physical & financial phenomena [e.g. 155] and specifying priors for variance [100]. It is considered “pathological” because it has no mean or higher moments. The sample mean is also Cauchy distributed—for it is a stable distribution [243]—and diverges in large samples. However, a viable alternative is to estimate the tail parameters by maximum likelihood [e.g. 130], [316]. Parametric approaches are 84 paramount to characterizing pathological distributions like the Cauchy, which are punctuated by extreme rare events. This simple benchmark highlights the value of inductive bias. Cost function & results. For an interpretable measure than can be aggregated across multiple experiments, the cost function for Cauchy-outcome intervals was the interval length scaled to the smallest achieved length in each setting. Table 4.2 displays these relative costs for a high coverage target of 99%, evaluating each method’s ability to characterize Cauchy tails. Caus-Modens achieved significantly lower costs than the CSA baselines while meeting target coverage on average, and the non-causal Ens-DCP had similar cost for greater miscoverage, with failure on average. Method Achieved Coverage ↑ Coverage Cost ↓ Avg. Coverage Loss ↓ Caus-Modens 99.15 (0.20) % 0.28 (0.15) 0.028 % pts Ens-CSA-DCP 99.58 (0.32) % 1.51 (1.94) 0.002 % pts Ens-CSA-CQR 99.57 (0.32) % 1.51 (1.84) 0.003 % pts CSA-DCP 99.60 (0.32) % 1.50 (1.82) 0.002 % pts Ens-DCP 98.95 (0.45) % 0.30 (0.20) 0.206 % pts Table 4.2. PBMC results from 16 independent dataset generations and inferences. We set the random seed to the predetermined value 0 prior to generation for reproducibility and transparency. We present average achieved coverages and standard deviations for a target of 99%, accompanied by relative coverage costs for the trials that met the target, and the average nonnegative loss in coverage percentage points, which was positive for trials with coverage below 99%. 4.4.3 Novel Benchmark via GPT-4 (AITA) Semi-synthetic causal benchmarks like PBMC can be designed to harness the arbitrary statistical relations in real data. Still, the outcome must have pre-specified functional relations with the treatment and confounders. With the proliferation of causal-infernece studies proposing new methods for various settings, there is a need for flexible yet realistic benchmarks. In this result section we took a step in building a new kind of observational dataset that includes intervention results without the challenges of actually bringing in a randomized control experiment for testing the causal inference. We used the celebrated large language model (LLM) GPT-4 [2] that has demonstrated remarkable capabilities in emulating human text [32]. One can use an LLM to sample complex outcomes from observational datasets and also query textual interventions. We seek to promote this usage of large generative models for benchmarking causal inference [67]. We framed the novel inference task in the format of the r/AmITheAsshole subreddit (hence the name of this benchmark, AITA.) The subreddit is a scientifically attractive setting because the rules and structure of the forum are clean: users post personal stories of conflict, and comments offer opinions on whether the 85 author was at fault in the way the story transpired. A verdict is determined by the upvote mechanism on comments. Data from this subreddit have recently served as a vessel for human perspectives [28] and moral judgment [262] in the computational social sciences. For the sake of a causal benchmark, we asked GPT-4 to act as moral arbiter on real posts from the subreddit [244]. That way there would be no doubt about the real-world salience of the data, while permitting interventions via the LLM. The treatment variable was the customary self-identified gender, which is binary between ‘F’ and ‘M’ (a limiting and problematic format.) Nevertheless, this variable allowed us to assess a bias in GPT-4’s verdicts. The mechanics of GPT-4’s “intuitive process” are so complex that it would be difficult to predict its moral judgments through a much simpler outcome predictor that would necessarily be trained on a relatively small sample of text embeddings: the AITA posts that have discernible gender markers. We changed this benchmark in order to simplify the prediction problem. Concretely, the outcome predictor was tasked with denoising an artificially noised GPT-4 verdict, utilizing the gender (treatment) and topic (covariates) information of the post. See Supplemental Figure 4.4 for a diagram. Topics were represented by fivedimensional embeddings like in BERTopic [108]. To strengthen the bias in GPT-4, we coupled the real indicated gender with synthetic ages so that authors were either 30-year old men (T = 0) or 70-year old women (T = 1). Caus-Modens and baselines were tasked with predicting the causal effect of T on the denoised verdict. 0.5 0.6 0.7 0.8 0.9 0.5 0.6 0.7 0.8 0.9 Target Coverage, Achieved by All Listed Methods 0.0 0.2 0.4 0.6 0.8 Tightest Intervals (absolute) 0.6 0.7 0.8 0.9 Tightest Intervals (mass) Caus-Modens Ens-CSA-DCP Ens-CSA-CQR CSA-DCP Figure 4.3. Share of the test set that each method produced the tightest intervals, shown in absolute (left) and mass (right) units, at a wide range of target coverages. Lines are widened by standard errors. The four listed methods always achieved the target, whereas the unlisted Ens-DCP failed frequently. The presence of ties, particularly in the right subplot, allows the shares to add up to more than unit. 86 Cost function & results. The GPT-4 verdict was a number from 1–99, afterwards rescaled to the unit interval, and then logit-transformed to be modeled by normal distributions. These outcomes had a standard deviation of 0.62 and the artificial noise standard deviation was selected to be 0.5. After learning the outcome (and propensity,) we focused on the woman gender arm of potential outcomes for the intervention testing set. Verdict-noise variance out of sample was reduced by 40% with the predictors. We chose two cost functions for evaluating coverage efficiency: the absolute units of outcome interval size, and mass units computed as the integral of the empirical marginal outcome distribution along the interval. Figure 4.3 shows the rate of tightest intervals for all the evaluated methods that reached sufficient coverage, in both units and at a wide array of coverage targets. At 70% coverage, the verdict overlap between genders is smaller with Caus-Modens outcome intervals than with Ens-CSA-DCP for 20.0 (0.8) % of the posts, with the rest being equal in the empirical probability mass units, and none in the other direction. This suggests that Caus-Modens identifies more gender bias in GPT-4’s moral judgments. The doubly robust and debiased ATE estimator [363] for gender effect suggests that ATE = −0.07 with Student t-test p < 0.05, likewise revealing a gender bias towards young men being more wrong than old women according to GPT-4. Table 4.3 compares GPT-4 verdicts to the original Reddit verdicts and displays coverage costs for different judgment regimes. Caus-Modens consistently outperformed Ens-CSA-DCP, which was the best of the baselines according to Figure 4.3, even in empirical mass units. Counts Reddit right Reddit wrong GPT-4 right 1303 [58%] 276 [12%] GPT-4 wrong 430 [19%] 250 [11%] Costs Reddit right Reddit wrong GPT-4 right 0.36* / 0.38 0.38 / 0.39 GPT-4 wrong 0.39* / 0.41 0.35 / 0.37 Table 4.3. Confusion matrix of Reddit versus GPT-4 verdicts (left) and the 70%-coverage costs of CausModens against Ens-CSA-DCP for those stratified posts (right). Bold: p < 0.05; asterisk: p < 0.01. As the GPT-4 verdict was given on a sliding scale, we chose a right/wrong threshold by equalizing its marginal rate with Reddit’s. If we chose the midpoint of the verdict spectrum, there would have been many more wrong than right verdicts from GPT-4. This can be explained by a higher threshold for the designation of “asshole” to the author of a post, as is the Reddit protocol. 4.5 Discussion We present three benchmarks, IHDP, PBMC, and AITA, on which our proposed Caus-Modens yields tighter coverage than the state of the art in adaptive conformal prediction with causal sensitivity analysis. In IHDP (§4.4.1), the conformal baselines failed at least once out of ten trials to achieve coverage by reweighing the 87 calibration set. The costs of non-failures tended to be larger than Caus-Modens’ at the same target coverage. Caus-Modens did not fail at all for IHDP. The failure rate of the conformalized sensitivity analyses highlights potential pitfalls of reweighing a finite observational sample to achieve coverage of causal outcomes. Valid coverage for CSA is contingent on correct propensity specification [see Lemma 3, 365], which may be a challenge in certain applications with strong, high-dimensional covariate shift. In the much larger PBMC (§4.4.2) benchmark, Caus-Modens achieved the tightest coverage for the majority of the trials. PBMC leveraged the relations between gene expressions in human cells by randomly projecting them into confounders and treatment. Further, it enabled the demonstration of Caus-Modens quantile predictions for extremely heavy tails. In the novel AITA benchmark (§4.4.3), which used an LLM to generate extremely complex causal outcomes with access to interventions, we also demonstrated that the outcome intervals produced by Caus-Modens were consistently more informative, in terms of share of tightest coverage, than the intervals produced by the baselines. Empirical uncertainty via ensembles. Ensemble-based uncertainty quantification in deep learning remains an active field of study [9], [304], [321], [349]. We employed the classic deep ensemble [190] for our results. However, our approach is flexible and works with many kinds of “ensembles,” even those that are implicitly defined, as with MC dropout [97], or the inducible model distribution in Bayesian neural networks [345]. Using a sensitivity model. As mentioned in Examples 4.2 & 4.2, popular sensitivity models for hidden confounding like the MSM are parametrized by a single Γ ≥ 1, with Γ = 1 signifying no hidden confounders. In past studies, Γ has been informed by domain knowledge [61] or data-driven heuristics [124]. Without domain knowledge, a sensitivity model can help rank heterogeneous causal effects (across observational units, or even types of treatment) by apparent robustness to hidden confounding. Our empirical evaluations tested the alignment of the ensembled sensitivity model with the ground truth of semi-synthetic hidden confounders. Societal concerns. We wish to emphasize that GPT-4’s verdicts on morality in the AITA scheme should not be considered as an approximation of human moral judgements. GPT-4 is biased in poorly understood ways by its architecture and training data [92], [98], [370]. Further, the biases isolated in the circumscribed scope of these experiments do not necessarily even reflect the general biases present in GPT-4. 88 Limitations. Causal inference methods always depend on the appropriateness of the assumptions. In our case, this includes assumptions about the data-generating process (namely constraints on the hidden confounders, via a sensitivity model) and the quality of the estimates. Our method elevates the role of empirical uncertainty relative to many prior works. However, in contrast with recent alternatives, CausModens may be less conservative by relying more on parametric, structural, and inductive constraints. Hence, it could also be more vulnerable to model misspecification. Conclusion Our simple ensemble-based partial identification of outcome quantiles is a promising approach to prediction intervals that leverages the inductive biases of deep models. In addition to coverage efficiency, it accommodates various sensitivity models and adapts them to a novel formulation of potential posteriors that justifies the weight-modulation of an ensemble. Future work could explore how to regularize Caus-Modens to be sharper in its partial identification, especially in general (non-binary) treatment domains that are less explored. 4.6 Appendix A: Algorithm This algorithm has quadratic asymptotic runtime in the ensemble size. As the bottleneck tends to be the quantile-search subroutine, one may benefit from implementing a bulk weight-transfer procedure using Algorithm 3 as a starting point. 4.7 Appendix B: Proofs Lemma 4. We study estimation errors in the weights in a manner inspired by Theorem 3 (supplementary) of [193]. In our case, we study the cumulative distribution function (CDF) estimate Fˆ ω˜ (y) = m−1Xm j=1 ω˜jFj (y) = m−1Xm j=1 ω˜(Θ(j) )F(y; Θ(j) ) 89 Algorithm 3: Greedy Quantile Maximizer (minimizer version is trivial) Input: Quantile rank β, weight bounds (ω, ω) like those described in Assumption 5, and invertible cumulative density functions F1(y), F2(y), . . . , Fn(y), which can be considered the conditional prediction distributions from the ensemble. Output: Ensemble’s β-quantile, q := supw F −1 (β). 1 Initialize wi ← 1 for all i = 1, 2, . . . n; 2 Compute initial search bounds q ← mini F −1 i (β) and q ← maxi F −1 i (β); 3 while not converged do 4 Binary-search for q ← F −1 (β) ∈ (q, q), where F(y) := n −1 P i wiFi(y); 5 Compute masses αi := Fi(q) for every i and sort in ascending order (without relabeling); 6 Find receiver r := arg mini αi such that wi < ω; 7 Find sender s := arg maxi αi such that wi > ω; 8 if r ≥ s then 9 break; 10 Compute receivable a := ω − wr and sendable b := ws − ω; 11 if a < b then 12 Transfer wr ← ω and ws ← ws − a; 13 else 14 Transfer ws ← ω and wr ← wr + b; 15 Refine search bounds q ← q; for pˆM(yt | x; D) = Eˆm[ ˜ω(Θ, t, x) × pM(y | t, x; Θ) | D ] with t and x omitted for brevity. We wish to accurately predict Yt ∼ pM(Yt | X = x; D). Our main tool will be Hoeffding’s inequality [340] in this endeavor. Namely, for any u > 0, P[mFˆ ω˜ (y) − E mFˆ ω˜ (y) ≥ u] ≤ exp −2u 2 /m because the individual CDFs are independent conditional on a fixed y, and take the range [0, 1]. We focus on the upper bound (with quantile 1 − α/2) of the prediction interval first, and extend that result to the lower bound by a symmetry argument. Resolving the expectation and factoring out m, we find P[Fˆ ω˜ (y) − Fω˜ (y) ≥ u] ≤ exp −2mu2 where Fω˜ (y) = E[˜ωF(y; Θ)]. (4.6) 90 We observe that Fω˜ (y) = E[ωF(y; Θ)] + E[(˜ω − ω)F(y; Θ)] = Fω(y) + Fω˜−ω(y). Now let u = 1 − Fω˜ (y) + α/2 for the upper bound. This implies P[Fˆ ω˜ (y) ≥ 1 − α 2 ] ≤ exp −2m[1 − Fω˜ (y) + α 2 ] 2 . Now we introduce the margin ε > 0 and note that when Fω˜ (y) + ε ≤ 1 − α 2 , then u > 0 and we have P h Fˆ ω˜ (y) ≥ 1 − α 2 i ≤ exp −2mε2 ⇐⇒ P h Fˆ ω˜ (y) < 1 − α 2 i > 1 − exp −2mε2 . Plugging in y := Yt from the test set defined above, we find that this law is satisfied when Fω(Yt) ≤ 1 − ( α 2 + ε + Fω˜−ω(Yt)). The fact that Fω(Yt) is uniformly distributed (following Assumption 5) allows us to conclude that the condition is met with probability at least 1 − ( α 2 + ε + E |ω˜ − ω|), observing the triangle inequality of the absolute norm. Applying the same reasoning to the prediction interval’s lower bound eventually yields P hα 2 < Fˆ ω˜ (Yt) < 1 − α 2 i > 1 − 2 exp −2mε2 . with aggregate probability 1−β ′ for β ′ = α+ 2ε+ 2 E |ω˜ − ω|. Applying the inverse CDF and substituting ε := ε/2 (without loss of generality), we obtain the final form β = α + ε + 2 E |ω˜ − ω|; hence with probability 1 − β, P h Yt ∈ Fˆ−1 ω˜ α/2 , Fˆ−1 ω˜ 1 − α/2 i > 1 − 2 exp −mε2 /2 . Theorem 5. Here we extend Lemma 4 to partially identifiable weights that yield an admissible set of prediction intervals. The outcome interval under consideration is the union of all these admissible intervals. This is attained by a supremum over the upper bound and an infimum over the lower bound. In applying Hoeffding’s bound on the upper bound, we can replace Equation 4.6 with P[Fˆ ω˜+ (y) − Fω˜+ (y) ≥ u] ≤ exp −2mu2 such that Fˆ ω˜+ (y) = sup Fˆ ω˜ (y). Additionally, we define ω + to satisfy Fˆ ω+ (y) = sup Fˆ ω(y). Let β = α + ε + E |ω˜ + − ω +| + E |ω˜ − − ω −| where (˜ω −, ω−) are analogously defined for their respective infima. By a straightforward extension of Lemma 4, we have Equation 4.5 with probability at least 1 − β. By the assumption stated in this theorem, we know that ω˜ ±(Θ) ∈ {ω˜, ω˜} almost surely over the weight assignments. It is clear, then, that E |ω˜ + − ω +| + E |ω˜ − − ω −| ≤ 2 E[ |ω˜ − ω| ∨ ω˜ − ω ], completing the proof. Theorem 6. The putative optimality condition for the maximization problem solved by Algorithm 3, restated, is for there to be no pair of mixture components (θ (j) , θ(k) ) such that ω(θ (j) ) > ω and ω(θ (k) ) < ω, as well as F(q; θ (j) ) > F(q; θ (k) ) where q is the current β-quantile: β = F(q) ≜ m−1X i ω(θ (i) )F(q; θ (i) ). We will prove both directions of entailment to establish equivalence. First, we must show that if the quantile is maximized, then the condition holds. Suppose that q is the maximal quantile under the problem constraints and the condition is not satisfied, so there indeed is a pair (θ (j) , θ(k) ) as described. This implies that there is weight that could be transferred, of the amount ∆ω ≜ min{ω(θ (j) ) − ω, ω − ω(θ (k) )} > 0. Transferring that weight would yield a new mixture G(·) = 1 m X i ω(θ (i) )F(· ; θ (i) ) + ∆ω m [F(· ; θ (k) ) − F(· ; θ (j) )], with the consequence of G(q) < F(q) because F(q; θ (j) ) > F(q; θ (k) ). Therefore G−1 (β) > q due to monotonicity and q is not the optimal quantile. By contraposition, optimality entails our stated optimality condition. Now for the converse. With similar notation as above, we have F(q) = β but come into the posession of some feasible G(·) where G(q ∗ ) = β and q ∗ > q, so q is no longer optimal. Deconstruct G(·) = m−1 P i ω ′ (θ (i) )F(· ; θ (i) ). By monotonicity, G(q) < F(q). Hence m−1X i [ω ′ (θ (i) ) − ω(θ (i) )]F(· ; θ (i) ) < 0. Ignoring the identical pairs of weights between F and G, X i∈A [ω(θ (i) ) − ω ′ (θ (i) )]F(q; θ (i) ) > X i∈B [ω ′ (θ (i) ) − ω(θ (i) )]F(q; θ (i) ), A ≜ {i : ω(θ (i) ) > ω′ (θ (i) )}, B ≜ {i : ω ′ (θ (i) ) > ω(θ (i) )}. At the same time, P i∈A[ω(θ (i) ) − ω ′ (θ (i) )] = P i∈B[ω ′ (θ (i) ) − ω(θ (i) )] because of the constraint on the probability simplex. For the above inequality to be valid alongside this equality, there must be at least one 92 pair (j ∈ A, k ∈ B) such that F(q; θ (j) ) > F(q; θ (k) ). Hence the negation of the optimality condition holds, and by contraposition we prove the other entailment direction. 4.8 Appendix C: Experimental Details Details on PBMC The original PBMC dataset [162] had 14,039 cells that were used as data points for our benchmark. In each of the 16 trials, we randomly allocated 8,192 (2 13) training instances, 2,048 (2 11) validation instances, and 2,048 (2 11) test insances from the original sample. Then we designed the causal system by projecting the cells’ 17,796 gene expressions into vectors V ≜ ⟨32 visible confounders. . . , 1 treatment, 32 hidden confounders. . .⟩ ∈ R 65 by drawing 65 i.i.d normal coefficients. The 64 confounding entries were rank-normalized to give them Uniform[0, 1) marginals. The treatment entry was binarized by thresholding at 2/3 so that the data were slightly unbalanced with more (T = 0) observations. The outcome link was determined by a random matrix Mij ∼ i.i.d Normal(0, 1). The diagonal coefficient M33,33 corresponding to the treatment entry was upscaled by a factor of 64 to keep the treatment effect discernible from the rest. The pre-noised outcome U ≜ V TMV endured strong quadratic treatment and confounding effects. Finally, the observed outcome was Cauchy(µ = U, σ = 1)-distributed as motivated in §4.4.2. We aimed to keep this mathematical construction of the semi-synthetic benchmark parsimonious by introducing a minimal number of design choices and nontrivial default values. Details on AITA Figure 4.4 depicts the experimental setup. We used OpenAI’s gpt-4-0314 API model. The O’Brien [244] dataset contained 11,862 posts with detectable gender identifiers, of which 9,000 were reserved for the observational data. 8,000 of those were for training and the other 1,000 for validation. The 2,862 posts that were not included in the observational dataset were used as the causal outcome test set by intervening on the gender variable before feeding them into GPT-4. Please refer to scripts/aita-preprocessing.py in the supplementary source for all the querying and preprocessing details. Processed data are included as CSVs. View the chat-prompt prologue below. 93 Large Language Model Noisy Verdict T X Yt AITA? intervention on gender variable learns gender bias to help denoise verdict moral arbiter gender noise right wrong Hey Reddit, I recently told my brother that... Predictor right wrong Figure 4.4. Illustration for the AITA benchmark as a denoising task. The predictor receives a rough verdict through a noisy channel, relieving it from having to learn the moral judgment of GPT-4 and allowing it to focus on the biases due to gender and topics that help with denoising. System You are an average person acting as a subjective moral arbiter on r/AmITheAsshole. Whenever someone presents a story, you respond with a number from 1 to 99 where 1 is noble, 49 is completely neutral (rare), and 99 is atrocious. User I was mean for no reason. Assistant 90 User I tipped the waiter more than usual. Assistant 10 Unfortunately, we hit OpenAI’s GPT-4 quota before we could evaluate all arms of the potential outcomes. It did not hinder the benchmark, but it did prevent thorough analysis of the causal system. 94 Chapter 5 Policy Learning for Localized Interventions from Observational Data A largely unaddressed problem in causal inference is that of learning reliable policies in continuous, highdimensional treatment variables from observational data. Especially in the presence of strong confounding, it can be infeasible to learn the entire heterogeneous response surface from treatment to outcome. It is also not particularly useful, when there are practical constraints on the size of the interventions altering the observational treatments. Since it tends to be easier to learn the outcome for treatments near existing observations, we propose a new framework for evaluating and optimizing the effect of small, tailored, and localized interventions that nudge the observed treatment assignments. Our doubly robust effect estimator plugs into a policy learner that stays within the interventional scope by optimal transport. Consequently, the error of the total policy effect is restricted to prediction errors nearby the observational distribution, rather than the whole response surface. 5.1 Introduction Improvements in predictive power from large statistical models do not always translate to better decision making. The best way to support decision-making is to infer the outcomes from possibly relevant interventions. Models built to describe observations like in supervised or self-supervised learning tasks are not always adequate to predict interventional outcomes. In the field of causal inference, the theory of causal-effect estimation in a potential outcomes framework is largely concerned with building statistical models that can predict interventional outcomes explicitly. From observational data, it can be useful to learn the causal effects of a treatment variable and then construct a policy for prescribing treatments with the goal of maximizing overall outcomes [11]. This problem is often called offline policy learning because one cannot take actions and observe new outcomes like in an online setting. The structure of the causal system, and in particular that of the treatment variable, 95 can have severe implications on the feasibility of learning optimal policies. Treatments are commonly assumed to be binary, although more flexible settings are gaining traction, including for discrete multivalued treatments [157], [333], [381], continuous treatments [72], and discrete multivariate treatments [199], [355], [356]. When treatments are continuous and possibly multivariate, it can be quite difficult to learn the full response surface of every unit to every treatment value. Furthermore, prevailing estimators require learning the conditional probability density of the observational treatment propensity, e.g. Colangelo and Lee [57], Kallus and Zhou [161], Marmarelis, Haddad, Jesson, et al. [214], and Nie, Ye, liu, et al. [241]. That task might not scale well with covariate shift and increasing treatment dimensionality. It could be simpler to estimate local causal derivatives [51], [120]. Focusing on derivatives restricts the learning problem to the parts of the response surface near existing observations. Causal derivatives at observed treatments are informative of incremental effects from small interventions. Small interventions are often the most achievable. Despite these benefits, it is not always obvious how to learn a policy from causal derivatives. How small of an intervention is small enough? How far can a derivative extrapolate? Never mind the additional considerations for outcome predictors with well-behaved derivatives, which are needed in estimators of causal derivatives. We therefore propose to learn the effects of nudging∗ the treatment variable (§5.2), and to optimize budget-constrained policies thereof (§5.3). A causal-effect estimator specifically for nudges coming from a nudge prior, representing the interventional scope and perhaps budget under consideration, facilitates reliable learning of nudge policies (§5.5). Figure 5.1 shows a simple illustrative scenario calling for nudges. Our solution involves a few novelties. We formulate a learning objective (§5.2.3) for directly debiasing nudge-effect estimates in a doubly robust framework. While developing the policy learner, we discover a connection to optimal transport with an unorthodox cost function (§5.3.2), on transferring nudges to observational units. This link further reveals the possibility of an efficient, information-bottlenecked solver (§5.3.3). Example (wildfires). Suppose we are attempting to spatially target costly interventions that could reduce the proclivity for wildfires. We are studying satellite images [78] to identify the pixels most conducive to intervention before summer. From records of past summers and perhaps large-scale climate models [278], we have learned a spatial forecasting model for wildfire occurrence given a satellite image of surrounding vegetation, moisture, and topography. If the intervention to target is a reduction in vegetation, then the ∗ Even though we believe a “nudge” is the best term for a small intervention, somewhere between infinitesimal and global, we wish not to encourage associations with nudge theory from the social/behavioural sciences [88]. 96 machine-learning problem could be to identify the high-vegetation pixels that would, once lowered, causally reduce the risk of a wildfire. We are effectively nudging multiple continuous-valued spatial pixels because we are aiming to reduce vegetation to varying degrees as determined by the optimal policy. Relation to reinforcement learning. In the literature of proximal policy optimization (PPO) [247], [301], [302], a policy model is optimized with respect to some reward model within a trust region. Usually, the trust region is defined by a cutoff or penalty on the KL-divergence between the old (original or logging) and new (learned) policy. This prevents domain shift due to the updated policy moving far from the datagenerating process. Our conception of nudge priors can be interpreted as a trust region as well. However, to be faithful to PPO, we would have the learned policy be similar to the treatment propensity—the conditional distribution governing treatment assignments. The true propensity is unknown and has to be estimated in our problem setup, and it can be difficult to guarantee its accuracy. It can be untenable to base the trust region on this estimate. Instead, we use a nudge prior that fixes the marginal distribution of nudge policies over the data. Our lack of a known logging policy also separates us from offline contextual bandits [362]. Doubly robust estimation. Stemming from semiparametric estimation theory [121], [169], the framework of double machine learning [48] has become immensely useful in the state of the art in causal-effect estimation and offline policy learning [e.g. 157], [246], [381]. We employ this theory to derive estimators with increased robustness to misspecification in the learned models. 5.2 Estimating Nudge Effects The first step of our methodology is to estimate nudge effects. The overall approach is to use generic machine learning to estimate certain functions of the data, which are combined to form robust causal estimates. 5.2.1 Problem Setup We consider vector treatments with unbounded support T ∈ T = R dT under otherwise standard potentialoutcomes assumptions listed in Assumption 6 on the system of (outcome, treatment, covariate) variables W = (Y, T, X). Our aim is to learn the effect of an interventional treatment nudge ξ ∈ R dT concretely defined as ∆ξY ≜ Y (T + ξ) − Y (T), where Y (t) is the potential outcome at t. Only the realized outcome Y at the assigned treatment T is ever observed for each unit in the sample. The rest must be inferred in a way that accounts for selection and confounding biases. 97 Exemplar Causal Setting with Univariate Continuous Treatment and Covariate Figure 5.1. With continuous-valued treatments and significant covariate shift illustrated by the observations marked by xs, it can be infeasible to learn the full response surface T, X 7→ Y . In this example the upper left and lower right corners of the response would be impossible to predict without significant prior knowledge. Subsequently, it would be impractical to learn policies X 7→ T over the full surface, and perhaps infeasible to act on them. We propose to learn localized policies shown by the nudging arrows in gray. Assumption 6 (Potential Outcomes). The standard causal setting due to Rubin [286]. (a) Stable unit treatment value assumption (SUTVA). The potential outcomes for a unit do not depend on the treatment assigned to any other unit. (b) Overlap/positivity. All treatment values have a nonzero probability of occurring for every unit. (c) Ignorability. Potential outcomes are independent of the treatment after conditioning on covariates. Assumptions for nudges. A core assumption for the learning problem posed in this work is tied to the nudges under consideration. Our prior knowledge in this regard will be encapsulated by a distribution over the nudges, with a probability density p(ξ). This nudge prior governs the domain for the estimation problem covered in this section, for the policy learning in §5.3, as well as the error bounds in §5.5. The distributional assumption subsumes possible budgetary requirements that could be written as expectations over nudges, like a threshold over the average magnitude of prescriptions to a sample. It can also reflect our degree of confidence in the learned models, depending on the problem difficulty. Estimates that cannot extrapolate far should only be trusted with narrow nudge priors. 98 5.2.2 Our Approach We outline an estimator that synergizes two predictive models for nuisance parameters, so-called because they parametrize the causal-outcome estimates. These will be combined to form a robust estimate of the nudge effect ∆ξY conditional to other observable features from X. We learn the following models: • conditional outcome µ(T, X) ≜ E[Y | T, X], and • propensity ratio ηξ(T, X) ≜ fξ(T|X) / f0(T|X). The propensity density function is denoted as f0(T|X), belonging to a family of propensities that have been shifted by a nudge ξ, defined as fξ(t|X) = f0(t−ξ|X) induced by the transport map (T, X) 7→ (T +ξ, X). Taking inspiration from a rich line of work on efficient influence functions [121], [169], we propose a pseudo-outcome for nudge effects that combines nuisances and data: φξ ≜ [ηξ(T, X) − 1] [Y − µ(T, X)] + ∆ξµ(T, X), (5.1) where ∆ξµ(T, X) ≜ µ(T +ξ, X)−µ(T, X). Like other parameters studied in the double machine learning literature, this pseudo-outcome is an unbiased estimator for the nudge effect even if only one of the two nuisance estimates {µˆ(T, X), ηˆξ(T, X)} is correctly specified (see §5.5). Hence, φˆξ is considered doubly robust/debiased. In fact, by iterated expectation over Y and T, it can be shown that E[ ˆφξ | X] is a doubly robust estimator of the heterogeneous nudge effects, E[∆ξY | X]. As is explored further in §5.5, the local error of φˆξ at a specific X is the product of the errors of the two nuisance estimates {µˆ(T, X), ηˆξ(T, X)}, enabling faster convergence for the effect estimate. We state a proposition that serves as a theoretical basis for deriving Equation 5.1; Proposition 8. The efficient influence function for the average nudge effect is φξ − E[∆ξY ]. Kennedy [170], Oprescu, Dorn, Ghoummaid, et al. [246], and others have already proposed regression on pseudo-outcomes with a data-splitting strategy in order to accurately estimate conditional average treatment effects (CATEs) for binary treatments. In a typical data-splitting procedure, one estimates the functions (nuisance parameters) {µ(T, X), ηξ(T, X)} in one data partition, and then computes pseudo-outcome estimates φˆξ(Y, T, X) on another data partition using {µˆ(T, X), ηˆξ(T, X)}. Heterogeneous effects can be identified by regressing on the pseudo-outcomes in the second data partition. 99 5.2.3 Estimating the Propensity Ratio The solution we present includes an approximation for the propensity-density ratio parameter ηξ. Its form is motivated by its limiting behavior as the negative logarithmic gradient of the propensity density. Keeping in mind that the estimate ηˆξ should be reliable in the regime of the nudge prior p(ξ), we chose to construct a set of learning problems defined over a sample of nudges. We begin with a first-order multivariate Taylor expansion of log fξ(T|X) in ξ, log fξ(T|X) = log f0(T|X) + ξ · g(T, X) + O(∥ξ∥ 2 ) where the identity fξ(T|X) = f0(T − ξ|X) reveals that g(T, X) is indeed −∇T log f0(T|X). This expansion suggests an approximation to the propensity ratio, ηξ(T, X) = fξ(T|X) f0(T|X) ≈ exp{ξ · g(T, X)}, which leads to a reparametrization of the estimate ηˆξ(T, X) in gˆ(T, X) for any ξ drawn from a prior of relatively small nudges. Even though g(T, X) is defined as a logarithmic gradient, the optimal estimator for gˆ(T, X) over p(ξ) would minimize an average loss over the nudges. This distinguishes our solution from traditional score matching [132]. Probabilistic classification (PC). Let qξ(T, X) be the probability of T coming from fξ(T|X) rather than f0(T|X). Then it follows that ηξ(T, X) = qξ(T|X) 1 − qξ(T|X) , log ηξ(T, X) = log qξ(T|X) 1 − qξ(T|X) = ξ · g(T, X). Hence ξ · g(T, X) gives classification logits for fξ(T|X) versus f0(T|X). In this way, we propose to learn gˆ(T, X) from a set of classification problems over a nudge sample by learning to classify (T + ξ, X) as positive and (T, X) as negative over the data (Y, T, X) and the nudge prior p(ξ). Probablistic classification is a simple approach to estimating density ratios [324] and our scheme of formulating many small classification problems is partly inspired by recent work like that of Choi, Meng, Song, et al. [55]. All in all, our loss function for gˆ(t, x) is given as L PC[g] ≜ Eˆ (T,X)×p(ξ) h log σ(ξ · g(T, X)) − log σ(ξ · g(T + ξ, X)) i , (5.2) 100 where σ(·) is the logistic sigmoid. As already mentioned, this nuisance parameter gˆ(T, X) is estimated on one dataset split and then used on the other dataset split. When gˆ(T, X) is parametrized as a deep neural network, we found that it can be helpful to calibrate it on the second split with a low-dimensional adjustment like temperature scaling [110]. Well-calibrated and smoothed logits are paramount to the stability of probabilistic classification. Comparison to denoising score matching (SMD). The recent success of diffusion models [65] has underscored the popularity of SMD [313], [339]. The objective of SMD can be understood as score matching on a smoothed (noised) version of the data distribution. It is similar to our PC approach in that the score, i.e. log-gradient of a density, is related to the log-ratio of an infinitesimally perturbed density against the original density. See the remark on causal derivatives, Equation 5.4. The smoothing brings enjoyable finite-sample qualities and also simplifies the learning objective. If we were to apply SMD to estimating g(T, X) using the existing domain knowledge for nudges, then the smoothing probabilistic kernel over treatments would be defined as additive noise in terms of nudges, probably coming from p(ξ). Therefore the SMD loss would be L SMD[g] ≜ Eˆ (T,X)×p(ξ) ∥∇ log p(ξ) − g(T + ξ, X)∥ 2 2 (5.3) and it would uncover gˆ(t, x) ≈ ∇ log Ep(ξ) f0(t − ξ | x). On the other hand, our PC objective (Equation 5.2) would uncover local linear coefficients gˆ such that ξ · gˆ(t, x) ≈ log f0(t − ξ | x)/f0(t | x), trained by a logistic cross-entropy loss. While SMD could certainly be used to approximate our learning task, it would be suboptimal especially because the nudge expectation resides within the logarithm. Remark (causal derivative). Letting ξ = εv and ∇v denote a directional derivative in the first argument of a function with unit vector v, the quantity ε −1 [ηξ(T, X) − 1] −−−→ε→0 −∇v log f0(T|X) (5.4) recovers a multivariate version of the Riesz representer for the average causal derivative (ACD) as considered in prior works like Chernozhukov, Newey, and Singh [51]. The form of Equation 5.1 resembles a finitedifference version of the pseudo-outcomes for the ACD. 101 5.3 Learning Nudge Policies Our goal is to make policy prescriptions for nudge interventions that reliably maximize nudge effects in expectation. The policy should adhere to the nudge prior for two reasons: first, any interventional budget constraints should be respected. Second, as the effect estimates are calibrated for the nudge prior distribution, the policy should not stray from that domain where generalization degrades. With that in mind, we seek a learned heterogeneous policy that is a function of some flexibly defined variable U ∈ U that satisfies U = f(X) for some f, allowing for the simplest case U = X, or for the cases of U being a subset of the covariate features when one wishes to generalize to a broader population with fewer recorded attributes. Consider a scenario where one learns from a detailed survey (using X) and then makes broader prescriptions (on a simpler U). Letting Π be the set of conditional density functions in P(Ξ|U), the ideal nudge policy is characterized by an optimization problem constrained on the marginal. Definition 6 (Optimal Nudge Policy). A nudge policy given as a probability density π(ξ|U) conditional on features U is considered optimal if it solves the following constrained optimization problem: max π∈Π E h Z Ξ ∆ξY π(ξ|U) dξ i , s.t. ∀ξ ∈ Ξ, E[π(ξ|U)] = p(ξ). The attained maximum value is considered the nudge-policy effect because it integrates the population’s nudge effects over the policy prescriptions. Note on notation. As in the other sections of this chapter, the bare expectations E are with respect to the data W = (Y, T, X), and the training set for the nuisances, but not the nudge variable ξ. Expectations over the nudge only are denoted as Eπ or the integral form shown in Definition 6. Of course, the nudge domain Ξ is equal to the treatment domain T , and they will be used interchangeably. The infinite and idealized optimization problem in Definition 6 is intractable per se. To proceed, we introduce two approximate solution schemes denoted as IB and OT, eventually synergizing into IB+OT from the desirable aspects of both perspectives. 5.3.1 IB — Information Bottleneck Policy Learner The first approach is to solve a more constrained optimization problem inspired by the informationbottleneck principle [325] that is commonly employed in variational bounds for intractable problems in 102 representation learning [118]. Notice that the hard constraint on the marginal can be equivalently stated in the Kullback-Leibler (KL) divergence, or relative entropy: D E[π(ξ|U)] ∥ p(ξ) = 0. (5.5) This can be turned into a single soft constraint by penalizing the objective function with a Lagrange multiplier. However, it is still difficult to compute over a large sample. We turn to the mutual information between ξ and X induced by a given policy π, Iπ(ξ;U) = D π(ξ|U)p(U) ∥ p(ξ)p(U) = E[ D π(ξ|U) ∥ p(ξ) ], (5.6) which implicitly asserts that the nudge policy’s marginal is identical to the nudge prior. It is easier to compute because if the policy and prior are of the same parametric family (e.g. conditionally Gaussian), the inner divergence quantity can be solved analytically. By Jensen’s inequality, this mutual information is an upper bound on the marginal divergence constraint of Equation 5.5. Therefore, the mutual information acts as a stronger constraint: not only does it enforce correct marginal behavior, but it also limits the specificity of the policy. It introduces an information bottleneck. This additional regularization from Equation 5.6 on the policy implies that any policy prescription π(ξ|ui) for a particular unit ui that strongly diverges from the nudge prior would only do so for a large apparent benefit. This phenomenon could be useful for ranking the most beneficial interventions. The IB policy-learning objective functional is stated as follows, for β > 0: L IB[π, β] ≜ Eˆ − Eπ φˆξ + βD π(ξ|U) ∥ p(ξ) . (5.7) The objective L IB can straightforwardly be used to train a neural network parametrizing u 7→ π(ξ|u) with hyperparameter β controlling the strength of the IB. 5.3.2 OT — Optimal Transport Policy Learner The second approach is a natural consequence of the observation that the optimization problem in Definition 6 is an infinite linear program [71]. Informally, we seek to maximize a massive sum of nudge effects weighted by the policy density function, subject to an equality constraint along every nudge value as well as constraints ensuring that the conditional policy is a valid probability density function. This motivates a translation of Definition 6 to a form where all the constraints are explicit and the optimization is performed over all functions π : Ξ × X → R: arg max π E h Z Ξ ∆ξY π(ξ|U) dξ i , s.t. ∀ξ, E[π(ξ|U)] = p(ξ), (nudge prior) ∀u, Z Ξ π(ξ|u) dξ = 1, (policy validity #1) ∀(ξ, u), π(ξ|u) ≥ 0. (policy validity #2) The problem begins to look like an optimal transport with a transference plan π ′ (ξ, u) = π(ξ|u)p(u) where p(u) is the marginal density of X. The semblance becomes clearer when the second group of constraints is written as R Ξ π ′ (ξ, u) dξ = p(u), mirroring the first group of constraints on p(ξ). The connection between linear programming and optimal transport was first noticed by Kantorovich [163]. Concretely, the problem in Equation 5.8 is a Monge-Kantorovich optimal transport with cost function −∆ξY , assuming it is continuous and finite [338]. We substitute this ideal cost function with a pseudo-outcome approximation c(ξ, W) ≜ −φξ(W); recall that W = (Y, T, X) describes a whole observational unit. π ⋆ = arg max π E h Z Ξ φξ π(ξ|U) dξ i , s.t. ∀ξ, E[π(ξ|U)] = p(ξ), ∀u, Z Ξ π(ξ|u) dξ = 1, ∀(ξ, u), π(ξ|u) ≥ 0. (5.8) We study the finite-sample version of Equation 5.8, which is again an optimal transport: πˆ (n,m) = arg max π Xn j=1 Xm i=1 φˆijπij , s.t. 1 m Xm i=1 πij = 1, 1 n Xn j=1 πij = 1, πij ≥ 0, (5.9) where the quantities have been written in matrix form as φˆij = ˆφξi (wj ), πij = π(ξi |uj )/p(ξi) for convenience. We use i.i.d {ξi} m i=1 from the prior and {wj = (yj , tj , xj )} n j=1 from the second dataset split for computing φˆij in (m × n) batches. Even though we heretofore required that the policy marginal adhered to the nudge prior, in practice any additional sparsity in prescriptions for ineffective interventions is helpful. In other words, it is desirable to allow a policy to favor inaction. Any unmatched or not-fully-matched unit uj to nudge points ξi implicitly 104 matches it to the null nudge ξ0 = 0 since ∆ξ=0Y = 0. We shall permit any amount of null mass to be allocated to units. Further, the problem naturally accommodates L 1 -regularization of the policy by introducing a hyperparameter γ ≥ 0. This would encourage policies to tend towards sparsity. Combining these two modifications to Equation 5.9 amounts to changing the equality constraints to inequalities and perturbing the objective: πˆ (n,m) γ = arg max π Xn j=1 Xm i=1 ( ˆφij − γ)πij , s.t. Xm i=1 πij ≤ m, Xn j=1 πij ≤ n, πij ≥ 0. (5.10) These finite linear programs can be solved straightforwardly on the second dataset split. A solution [ˆπij ] gives a particle approximation to the policy density for the observations by πˆ(ξ|uj ) = 1 m Xm i=1 πˆij δ(ξ − ξi). Finally, we propose a synthesis of the IB and OT approaches yielding computationally favorable properties. 5.3.3 IB+OT — Sinkhorn Policy Learner Suppose one were to discretize the IB problem, with objective in Equation 5.7, similarly to how OT was transformed into a linear program. The result might look like the finite-sample optimization problem detailed by Equation 5.9 with a mutual-information penalty added to the objective. Surprisingly, this is the exact problem that is efficiently solvable by the popular Sinkhorn algorithm [68], [259]. Penalizing the mutual information is equivalent to rewarding the joint entropy whenever the marginals are fixed; hence, the information bottleneck corresponds to an entropic regularization πˆ (n,m) β = arg max π Xn j=1 Xm i=1 ( ˆφijπij − β log πij ), s.t. 1 m Xm i=1 πij = 1, 1 n Xn j=1 πij = 1, πij ≥ 0. (5.11) By duality theory [29], for every β ≥ 0 there exists some α ≥ 0 such that the admissible set of solutions is {π : Iπ(ξ;U) ≤ α}, with Iπ(· · ·) from Equation 5.6. For the remainder of this chapter, we largely concern ourselves with the finite-sample OT solver and recommend IB+OT for larger data. 105 5.3.4 Generalizing Policy Prescriptions The finite-sample policy learners shown above, namely OT of Equation 5.10 and IB+OT of Equation 5.11, give prescriptions on the units in the second dataset split. By repeated application of our method with round-robin splits, as will be discussed further in §5.4, one obtains prescriptions for the entire sample. A generalizable prescription rule (policy) can be obtained by a regression on the in-sample prescriptions: hπˆ(u) ≜ Eˆ[ Eπˆ ξ | U = u]. (5.12) This estimation is “easy” in the sense that it lacks the covariate/domain shift that tends to make causal-effect estimation difficult. hπˆ(u) is tasked with predicting the expected policy within the observational domain. There is no extrapolation on (T, X) combinations. 5.4 A Scalable Algorithm Algorithm 4: Finite-sample Nudge Policy Learner Input :{(yj , tj , xj , uj )} n j=1 ∼i.i.d (Y, T, X, U), policy regularization γ ≥ 0 or β ≥ 0 Output :policy predictor hπˆ(u), Equation 5.12 1 foreach k-fold partition of the training set do 2 Learn nuisances (ˆµ, ηˆξ) using data outside the current partition; 3 Calibrate temperature of propensity-ratio ηˆξ on the current partition; 4 Pool the corresponding nuisances across the partitions covering the entire training set; 5 Randomly re-partition the dataset into policy batches of equal cardinality; 6 foreach policy batch do 7 Sample nudges {ξi} m i=1 from prior and compute pseudo-outcomes φˆξi , Equation 5.1; 8 Optimize policy prescriptions πˆ by solving linear program or Sinkhorn problem (Equations 5.10 / 5.11); 9 Estimate policy rule hπˆ(u) by regressing expected policy Eπˆ ξ on U on entire training set; It is usually statistically favorable to partition the dataset into many small non-overlapping sets that each act as the second split (for pseudo-outcomes) in a problem instance with the rest of the data reserved for that problem’s first split (for nuisances) [48]. The benefit of this strategy is that it avails more data for estimating the nuisances, and the drawback is the additional computation cost. For our method, estimating nuisances and pseudo-outcomes (§5.2) is only the first learning problem. We also seek to learn policies (§5.3), and it could be difficult to do so on small second splits. Therefore, we decouple these two data-partitioning design choices in Algorithm 4. For effect estimation, nuisances for pseudo-outcomes are estimated and then pooled across the whole sample. Then the policy is optimized in randomly allocated batches. Those 106 prescriptions are pooled back together in order to generalize the policy out of sample according to the regression posed in Equation 5.12. Whether the Sinkhorn or linear-program solver is used, it would be more efficient for large datasets to learn the policy in batches. Overall, for O(1) policy-batch sizes, Algorithm 4 achieves a runtime and memory complexity (in sample size) equivalent to that of the constituent machine-learning procedures. 5.5 Estimation Properties A recurrent quantity in our analysis shall be the conditional bias of the pseudo-outcome estimate for the heterogeneous nudge effect, ˆbξ(X), defined in Lemma 9. This reveals the double (rate) robustness property. Lemma 9 (Effect Estimation). The plug-in estimator for nudge effects φˆξ shown in Equation 5.1 is locally doubly robust in the sense that only one of the two nuisances need be correctly specified; in other words, if either µˆ(T, X) = µ(T, X) or ηˆξ(T, X) = ηξ(T, X), then E[ ˆφξ | X] = E[∆ξY | X]. Further, the estimator is doubly rate robust because for any {µˆ(T, X), ηˆξ(T, X)}, ˆbξ(X) ≜ E[ ˆφξ − φξ | X] = E[−(ˆµ − µ)(ˆηξ − ηξ) | X]. In order to study the convergence of the learned policies, we introduce πˆ ∗ that is the finite-sample solution with oracle pseudo-outcomes φξ in place of their estimates φˆξ. This intermediate quantity shall allow us to separate the consequences of imperfect effect estimation from imperfect policy learning. Also, for finite-sample estimates like πˆ, the norm operator ∥·∥p is taken elementwise, like the Frobenius norm for p = 2, on the (m nudges × n units) matrices. For instance, ∥πˆ∥2 = qPm i=1 Pn j=1(ˆπij ) 2. The perturbation stability of linear programs [277] lets us compare πˆ and πˆ ∗ . We need one more assumption on uniqueness in order to proceed with the remaining lemmas. This assumption, while not guaranteed for our cost function in general, helps to simplify the theoretical results. Assumption 7. The policies πˆ, πˆ ⋆ , π ⋆ are uniquely optimal solutions to their respective programs. Lemma 10. For the vanilla OT policy-learning problem described by Equation 5.9, there exist positive constants (ε, δ) such that for any ∥φˆ − φ∥ < δ, one has ∥πˆ − πˆ ∗ ∥ ≤ ε∥φˆ − φ∥. 107 In addition, E ∥φˆ − φ∥1 ≤ P i,j q ( ˆbij ) 2 + (ˆsij ) 2 where sˆ 2 is the finite-sample conditional error variance, sˆξ(X) 2 ≜ Var[ ˆφξ − φξ | X]. Finally, we study the asymptotics of the policy estimate through optimal transport [338]. Lemma 11. Under mild conditions, transference plan πˆ ∗ converges weakly to π ∗ as n, m → ∞. If, in addition, ηˆξ → ηξ and µˆ → µ uniformly in T × X as n → ∞, then πˆ converges weakly to πˆ ⋆ as well. Theorem 12 (Robust Policies). The policy-effect estimate is doubly rate robust in the sense that it can only be overestimated up to a product of the nuisance errors. In particular, E[Eπˆ φˆξ − Eπ∗ φξ] = E[Eπˆ ˆbξ] + E[Eπˆ φξ − Eπˆ ∗ φξ] | {z } (A) negative regret + E[Eπˆ ∗ φξ − Eπ∗ φξ] | {z } (B) discretization (5.13) with the first term, E[Eπˆ ˆbξ] is the product of errors localized around the learned policy. Term (A) is nonpositive and term (B) is a discretization error unrelated to the effect estimation. Further, if the assumptions for Lemma 11 are satisfied, E[Eπˆ φˆξ − Eπ∗ φξ] → 0. (5.14) We show that the policy-effect estimate is consistent (Equation 5.14) and tends to be conservative (Equation 5.13), which is useful for prudent decision-making. Next, we shed light on how the nudge-marginal constraint discussed in §5.3 guarantees error localization. Corollary 12.1 (Localized Errors). The error-product term E[Eπˆ ˆbξ] in Theorem 12 can be understood in terms of the nudge prior p(ξ), after observing the absolute bound E[Eπˆ ˆbξ] ≤ Ep(ξ) sup u∈U ˆbξ(u) . In words, the wort-case heterogeneous error is only relevant around the nudge prior. Next, we seek intuition on the consequences of this bound via the nudge prior. Corollary 12.2 constructs an illustrative scenario with one nudge prior being more dispersed than another nudge prior. Corollary 12.2 (Consequence of Nudge Dispersion). Suppose that | ˆbξ(u)| is bounded above by some function w(|ξ|, u) that is monotonically non-decreasing in every dimension of |ξ| and has finite limit. Further, consider two candidate nudge priors p1 and p2 such that P1[∥ξ∥ > a] ≤ P2[∥ξ∥ > a] for every a > 0. Then the error bound of Corollary 12.1 cannot be greater for p1 than for p2. 108 5.6 Empirical Evaluations We sought to verify the empirical improvements conferred by the doubly robust estimation (§5.2) and the constrained policy learner (§5.3). Experiment source code may be found in the supplementary material. 5.6.1 Semi-synthetic Policy Learning Policy Effect Learner Mean (Std.Err.) Robust OT (Alg. 4) 0.80 (0.05) Naïve OT −0.01 (0.01) Robust Full −1.49 (0.16) Naïve Full −1.49 (0.16) Table 5.1. Mean (and standard error) of policy effects on 20 trials (seeds 0–19) of the TCGA semi-synthetic benchmark. Our robust OT learner with k = 5 dataset splits is compared to three ablated baselines that do not achieve policy improvement. Units are in standard deviations of the observational outcomes. The first benchmark we conducted was built off of the cancer genome atlas (TCGA) dataset originally proposed for causal-estimation benchmarking by Bica, Jordon, and Schaar [23]. Our semi-synthetic causal setting was designed to resemble a wide diversity of the problems mentioned in §5.1. Concretely, we aimed for real-valued multivariate treatments with significant covariate shift and complex response surfaces that reach a floor (of zero) at extreme-valued treatments. These are all reasonable traits for real-life policy learning from observational studies. The baselines considered in Table 5.1 were ablations to Algorithm 4. The Robust/Naïve demarcation indicated whether φˆξ was used or the direct prediction ∆ξµˆ. On the other hand, OT → Full indicated a policy learned by directly optimizing the effect estimate rather than regression on the OT transference plan. Figure 5.2 shows the change in performance at different nudge priors. To verify that the policy learner of Equation 5.12 abides by the nudge prior, we present Figure 5.3 comparing the quantiles of the empirical nudges with the prior. They appear to align overall, although the learned prescriptions are slightly under-dispersed, which is reasonable. 5.6.2 Experiment with Yelp & IRS Data The second benchmark used latitudes and longitudes of Yelp establishments across the United States as the “treatment” variable [364]. In this case, the treatment assignment is not synthetic and indeed rather complex because it relies on the geography of major cities in the country. The outcome was a semi-synthetic 109 0.5 1.0 1.5 2.0 Nudge Prior Standard Deviation 0 1 2 3 Mean Policy Effect Figure 5.2. Achieved TCGA policy effects for different nudge priors (interventional budgets), across the 20 random seeds. Table 5.1 reports evaluations for nudge-prior standard deviation set to 0.5. We observe increasing policy-effect variance with increasing nudge dispersion. -1.0 -0.5 0.0 0.5 1.0 Nudge Prior Quantiles (σ = 0.5) -1.0 -0.5 0.0 0.5 1.0 Empirical Policy Quantiles Figure 5.3. A quantile-quantile plot of the TCGA nudge prior compared with the empirical distribution of the 20 learned policy prescriptions applied out of sample. imitation of revenue that depended on features like Yelp reviews and income brackets of the establishment’s ZIP Code [137]. Here, a nudge effect is defined as the change in revenue from moving location, using a Gaussian nudge prior with 1 ◦ of standard deviation. We trained nuisance models on half the dataset and learned policies on the other half in order to compare variations of the effect estimator. All policies were learned by OT with γ = 0, so the experiment in Table 5.2 served to contrast the proposed estimator against the naïve direct estimate, and the robust estimate with an SMD-trained (Equation 5.3) propensity in place of the recommended PC of Equation 5.2. 110 Policy Effect Learner Mean (Std.Err.) Robust OT 0.39 (0.03) Naïve OT 0.27 (0.002) SMD OT 0.27 (0.002) Table 5.2. Mean (and standard error) of policy effects on 10 trials of the Yelp & IRS semi-synthetic benchmark. Both the robust estimator equipped with an alternative SMD-trained propensity, and the naïve estimator, do not improve upon the proposed robust estimator. 5.7 Discussion Our analyses focus on the OT (§5.3.2) policy-learning scheme, although we began §5.3 by building IB (§5.3.1) in order to justify IB+OT (§5.3.3) down the line. Our results validate OT and, moving forward, suggest the general utility of the Sinkhorn algorithm for solving the entropically regularized version, which is IB+OT. In our analysis of the estimation properties (§5.5), we revealed fundamental properties of the learned policy in relation to oracle pseudo-outcomes. Theorem 12 deconstructs the policy-effect error into understandable components, and its corollaries explore basic phenomena arising from the choice of nudge prior. In our empirical evaluations (§5.6), we demonstrated the necessity of each ingredient for the proposed approach. The TCGA benchmark (§5.6.1) showed that a learned prescription rule applied out of sample requires the added robustness of pseudo-outcomes along with the OT policy-learning scheme to reliably achieve positive policy effects. The Yelp & IRS experiment (§5.6.2) that the propensity nuisance for the pseudo-outcome estimate must be learned using the novel PC loss of Equation 5.2 for significantly higher policy effects. Possible extensions. Returning to the wildfire example laid out in §5.1, we are struck with the need to deal with two difficulties often faced in real-life causal inference: treatment interference, and hidden confounding. Spatially proximal observation units, like patches of land, can be affected by one another’s interventions. Also, not all confounders are recorded—much less perfectly. Both of these issues can be handled with extensions to the proposed nudge-policy framework. Future work. We recognize the need to further study the consequences of using pseudo-outcomes as an approximation of the truly sought optimal-transport objective. We plan to augment the proposed policy learners in a way that gives stronger statistical guarantees on the actual policy effect. This would likely 111 entail alterations to the optimization problem in order to include a conditional independence constraint, ensuring that the transference plan is only a function of the relevant covariates U, and not T. Conclusion We present a set of algorithms for learning continuous and multivariate policies for localized interventions, from offline observational data. We demonstrated our Algorithm 4 on semi-synthetic datasets against baselines that showcase the benefits offered by each component of the algorithm. Further, our Theorem 12 supplies an interpretable guarantee on the policy-effect error localized by the nudge prior, which can be dictated by interventional budgets or other domain knowledge. 5.8 Appendix A: Proofs 5.8.1 Proof of Proposition 8 Following closely the approaches of Ichimura and Newey [133] and Hines, Dukes, Diaz-Ordaz, et al. [121], we give a brief exposition on deriving the efficient influence function for E[∆ξY ]. The notation for these semiparametric estimators is a little different than in this chapter. For instance, we denote the oracle estimator as P and the oracle estimate for a parameter θ as θ(P). We compare these quantities to those from arbitrary estimators P˜ and the empirical plug-in estimator Pˆ n. With nudge effects, we have θ(P) ≜ E[∆ξY ] for some ξ. θ(P) = E[ E[Y | T = T + ξ, X] − E[Y | T, X] ], θ(Pˆ n) = n −1Xn j=1 Eˆ n[Y | T = tj + ξ, X = xj ] − Eˆ n[Y | T = tj , X = xj ] . Then we look at a parametric submodel with some perturbation to the oracle estimator: Pτ = τP˜+(1−τ )P for τ ∈ [0, 1]. Let us assume that P˜ is a point mass in some w˜ = (˜y,t,˜ x˜). We calculate the following Gateaux derivative: dθ(Pτ ) dτ τ=0 ≜ ψ( ˜w,P) termed canonical gradient at w˜, through Riesz representer theorem. 112 Now denoting the density estimates using Pτ as fτ (. . .), it holds under mild regularity conditions that θ(Pτ ) = Z Z Z h yfτ (y|t + ξ, x)fτ (t, x) − yfτ (y|t, x)fτ (t, x) i dy dt dx, = Z Z Z h fτ (y|t + ξ, x) − fτ (y|t, x) i yfτ (t, x) dy dt dx. To zoom in on one of these terms, we denote θξ(Pτ ) = RRR fτ (y|t + ξ, x) yfτ (t, x) dy dt dx. Now, taking the derivative, dθξ(Pτ ) dτ τ=0 = Z Z Z ( fτ=0(t, x) fτ=0(t + ξ, x) d dτ fτ (y, t + ξ, x)|τ=0 − fτ=0(t, x)fτ=0(y, t + ξ, x) fτ=0(t + ξ, x) 2 d dτ fτ (t + ξ, x)|τ=0 + fτ=0(y, t + ξ, x) fτ=0(t + ξ, x) d dτ fτ (t, x)|τ=0) y dy dt dx By using the identity dfτ (w)/ dτ |τ=0 = δw˜(w) − fτ=0(w) [121], which means for nudges that dfτ (t + ξ, x) dτ τ=0 = δ(t,˜x˜) (t + ξ, x) − fτ=0(t + ξ, x), dθξ(Pτ ) dτ τ=0 = Z Z Z yfτ=0(y|t + ξ, x)fτ=0(t, x) ( δw˜(y, t + ξ, x) fτ=0(y, t + ξ, x) − δ(t,˜x˜) (t + ξ, x) fτ=0(t + ξ, x) + δ(t,˜x˜) (t, x) fτ=0(t, x) − 1 ) , = fτ=0(t˜− ξ, x˜) fτ=0(t,˜ x˜) y˜ − fτ=0(t˜− ξ, x˜) fτ=0(t,˜ x˜) E[Y | T = t, X ˜ = ˜x] + E[Y | T = t˜+ ξ, X = ˜x] − θξ(P), = ηξ(t,˜ x˜) h y˜ − µ(t,˜ x˜) i + µ(t˜+ ξ, x˜) − θξ(P). ∴ dθ(Pτ ) dτ τ=0 = h ηξ(t,˜ x˜) − 1 ihy˜ − µ(t,˜ x˜) i + ∆ξµ(t,˜ x˜) − θξ(P). This result proves the proposition. The way the efficient influence function is used to form a corrective term in a doubly robust estimate is by the following von Mises expansion, θ(P) = θ(P˜) − dθ(Pτ ) dτ τ=1 + remainder. 113 Notice that the Gateaux derivative is evaluated at τ = 1 rather than τ = 0. It is known that dθ(Pτ ) dτ τ=1 = −P{ψ(W,P˜)}, which we estimate as P{ψ(W,Pˆ n)} ≈ n −1 Pn j=1 ψ(wj ,Pˆ n) when P˜ = Pˆ n. Finally, θ(P) ≈ φˆξ harkening back to Equation 5.1, since φˆξ = Pˆ n{ψ(W,Pˆ n)} + θ(Pˆ n). 5.8.2 Proof of Lemma 9 These proofs rely largely on iterated expectations. We first show that the conditional expectation of the pseudo-outcome φˆξ is doubly robust to nuisance misspecification. There are two branches in this proof, from either considering (a) µˆ = µ or (b) ηˆξ = ηξ. For (ˆµ = µ), the expectation is quite simple: (arguments T, X to the parameters omitted for brevity) E[ ˆφξ | X] = E[(ˆηξ − 1)(Y − µ) + ∆ξµ | X] = ET EY [(ˆηξ − 1)(Y − µ) | T, X] + ∆ξµ | X , = ET [(ˆηξ − 1)(µ − µ) + ∆ξµ | X] = E[∆ξY | X]. The (ˆηξ = ηξ) branch is slightly more involved, E[(ηξ − 1)(Y − µˆ) + ∆ξµˆ | X] = ET [(ηξ − 1)(µ − µˆ) + ∆ξµˆ | X], = ET h ηξ(T, X) µ(T, X) − µˆ(T, X) + ˆµ(T + ξ, X) − µ(T, X) X i . Lemma 0. Notice that for any integrable v(T, X), we have E[ηξv | X] = Z T ηξ(t, X)v(t, X)f0(t|X) dt, = Z T f0(t − ξ|X) f0(t|X) v(t, X)f0(t|X) dt, = Z T f0(t ′ |X) v(t ′ + ξ, X) dt ′ where t ′ = t − ξ, = E[v(T + ξ, X) | X]. We have leveraged the assumption that T is unbounded. 114 Hence, ET ηξ(T, X)[µ(T, X) − µˆ(T, X)] X = ET µ(T + ξ, X) − µˆ(T + ξ, X) X , ∴ E[(ηξ − 1)(Y − µˆ) + ∆ξµˆ | X] = ET µ(T + ξ, X) − µ(T, X) X = E[∆ξY | X]. Double rate robustness. We follow a similar path for the double rate robustness property, revealed as a product of errors in ˆbξ(X) ≜ E[ ˆφξ − φξ | X]. To start, we examine the difference in the corrective term of the pseudo-outcome, (ηξ − 1)(Y − µ), coincidentally the only part that depends on Y : EY (ˆηξ − 1)(Y − µˆ) − (ηξ − 1)(Y − µ) | T, X = (ˆηξ − 1)(µ − µˆ). Again, when arguments are omitted, (T, X) are implied. The fact above allows us to short-circuit to ˆbξ(X) = ET (ˆηξ − 1)(µ − µˆ) + ∆ξµˆ − ∆ξµ | X = ET ηˆξ × (µ − µˆ) + ˆµ(T + ξ, X) − µ(T + ξ, X) | X , = ET ηˆξ × (µ − µˆ) + ηξµˆ − ηξµ | X by Lemma 0 above, = ET − (ˆηξ − ηξ)(ˆµ − µ) | X]. 5.8.3 Proof of Lemma 10 This result follows from the main theorem of Robinson [277]. By construction, what distinguishes the finite-sample πˆ from its oracle intermediary πˆ ∗ is the substitution φˆξ 7→ φξ. As long as the solution set for the linear program posed in Equation 5.9 is nonempty and bounded, then the (ε, δ) condition asserted in the lemma holds. The second part of the lemma, decomposing the absolute norm in terms of the conditional bias ˆbξ and the conditional error variance sˆ 2 ξ , follows from a simple norm inequality. By the definition of variance, we have E[( ˆφij − φij ) 2 ] = (ˆbij ) 2 + (ˆsij ) 2 . Recall that φij is shorthand for φξi (uj ), and so on. Additionally, by Jensen’s inequality, E |φˆij − φij | ≤ p E( ˆφij − φij ) 2. Therefore, E |φˆij − φij | ≤ q ( ˆbij ) 2 + (ˆsij ) 2 ⇐⇒ E ∥φˆ − φ∥1 ≤ Xm i=1 Xn j=1 q ( ˆbij ) 2 + (ˆsij ) 2. 115 5.8.4 Proof of Lemma 11 This result relies on Theorem 5.20 of Villani et al. [338]. To satisfy all the conditions of convergence, we must assume that the finite-sample and infinite-sample oracle transference plans πˆ ∗ and π ∗ are both uniquely optimal for their respective problems. First we investigate the requirements for πˆ ∗ → π ∗ weakly. The (oracle) cost functions are identical here: c(ξ, W) = −φξ(W). However, one is finite-sample and the other is infinite. The stability theorem of optimal transport requires that the marginals converge weakly. In our case, the empirical nudge sample and the empirical observational sample both converge weakly to their oracle equivalents. Next we prove that πˆ → πˆ ∗ . The corresponding marginals for πˆ and πˆ ∗ are identical, simplifying the conditions for convergence. Since we presupposed uniform convergence of the nuisance parameters, we have uniform convergence in the pseudo-outcome φˆξ → φξ as well. Hence the cost function converges uniformly and all the conditions for transference-plan convergence are satisfied. 5.8.5 Proof of Theorem 12 The decomposition of Equation 5.13 is straightforward to derive. The key ingredient is showing that, for any π, E[Eπ φˆξ] = E[Eπ φξ] + E[Eπ( ˆφξ − φξ)] = E[Eπ φξ] + E[Eπ ˆbξ]. Then, by setting π = ˆπ, we obtain E[Eπˆ φˆξ −Eπ∗ φξ] = E[Eπ ˆbξ] +E[Eπˆ φξ −Eπ∗ φξ], and the rest follows from linearity of expectation. Equipped additionally with Lemma 11, we can show that every one of these expectations vanishes asymptotically as in Equation 5.14. We restate the main decomposition; E[Eπˆ φˆξ − Eπ∗ φξ] = E[Eπˆ ˆbξ] + E[Eπˆ φξ − Eπˆ ∗ φξ] | {z } (A) negative regret + E[Eπˆ ∗ φξ − Eπ∗ φξ] | {z } (B) discretization By uniform convergence in the nuisances, we have the first error-localization term converge to zero. Both (A) and (B) likewise converge due to the weak respective convergences in the transference plans as stated by Lemma 11. 116 5.8.6 Proof of Corollary 12.1 Firstly we notice that E[Eπˆ ˆbξ] = R T E[ ˆbξ(U)ˆπ(ξ|U)] dξ. Taking absolute values, E[Eπˆ ˆbξ] ≤ Z T E ˆbξ(U)ˆπ(ξ|U) dξ = Z T ˆbξπˆ 1 dξ (where this norm is over the expectation) ≤ Z T ˆbξ ∞ ∥πˆ∥1 dξ (by Hölder’s inequality) = Z T sup u∈U n | ˆbξ(u)| o p(ξ) dξ (by the prior constraint on the marginal) = Ep(ξ) sup u∈U ˆbξ(u) . 5.8.7 Proof of Corollary 12.2 We begin by proving the univariate case for simplicity. The univariate nudge will be denoted as ξ 0 . Since w(|ξ 0 |, u) is monotonically non-decreasing in |ξ 0 |, so is the function v(ξ 0 ) ≜ supu∈U w(|ξ 0 |, u), which is an upper bound of ξ 0 7→ supu∈U | ˆbξ 0 (u)|. For nudge-prior CDFs (F1, F2) we have F1(ξ 0 ) ≥ F2(ξ 0 ) for all ξ 0 > 0 and F1(ξ 0 ) ≤ F2(ξ 0 ) for all ξ 0 < 0, by the dispersion assumption. Also, for k ∈ {1, 2}, Epk v(ξ 0 ) = Z +∞ −∞ v(ξ 0 )pk(ξ 0 ) dξ 0 = h v(ξ 0 )Fk(ξ 0 ) i+∞ −∞ − Z +∞ −∞ v˙(ξ 0 )Fk(ξ 0 ) dξ 0 ∴ Ep2 v(ξ 0 ) − Ep1 v(ξ 0 ) = [v(+∞) − v(+∞)] | {z } →0 + Z +∞ −∞ v˙[F1 − F2] dξ 0 (by finite limit assumption) = Z +∞ 0 v˙[F1 − F2] dξ 0 | {z } ≥0 − Z −∞ 0 v˙[F1 − F2] dξ 0 | {z } ≤0 ≥ 0. To generalize to multivariate nudges, we bring in the multivariate densities p1(ξ), p2(ξ) as well as their multivariate CDFs. The above inequality can be attained for vectors ξ by iterated integration on each dimension. 117 5.9 Appendix B: Experimental Setup 5.9.1 TCGA There were 9,659 individuals in this panel dataset with sampled expressions for 4,000 genes. Like in previous usages of TCGA for causal-effect benchmarking, we projected the genes into a smaller set of variables. In our case we built the outcomes as random polynomials in these random projections. Namely, with all Z variables denoting matrices with i.i.d standard Gaussian random entries, and G the (genes × units) expression matrix, we projected X˜ = Z xG, T˜ = (Z t + Z x→tZ x )G and let X and T be the z-score normalized versions of these projections. We had 60 covariates and 4 treatments for the purpose of our experiments. It follows that, Z x ∈ R 60×4000 , Z t ∈ R 4×4000, and Z x→t ∈ R 4×60. The latter is a mixing matrix that increases dependence between X and T, i.e. covariate shift. Now, by stretching the notation, we also considered Z x→y ∈ R 4×60 and Z t→y ∈ R 4×4 to project the observed variables into 4 latent variables that passed through nonlinearities to form the outcome, which took the structural form Y = softplus − X 4 b=1 Z x→y X p 4/60 + Z t→y T p 60/4 √ 64 b − ∥[X T]∥ 6 2 + S ! , where S ∼ Normal(0, 0.1 2 ). The scaling terms serve to weigh the treatment and covariates similarly. The norm on the concatenated covariates and treatments exists to ensure that extreme treatments push the outcome towards zero and not +∞. Estimation. As for the specific invocation of Algorithm 4, we estimated nuisances on a classic 5-fold dataset split using typical 2-layer, 50-unit, SiLU-activated feedforward neural networks (as well as for the final step of policy regression.) Batches had size 256, training always ran for 1024 epochs, nudge sample sizes were also 1024 for the OT step, and ADAM learning rates were set to 1 × 10−5 for the outcome model and 5 × 10−5 for everything else by simple hyperparameter search via the validation sets. When OT was used, we employed the rendition of Equation 5.10 with γ = 0 (for no L 1 regularization). This way, we did 118 not have to tune any further hyperparameters, and we could efficiently repeat the whole experimental setup for many random seeds. 5.9.2 Yelp & IRS There were 150,243 restaurants with valid entries in the Yelp dataset [364]. We extracted five-dimensional UMAP [227] embeddings from BERTopic [108] on the establishments’ category fields. These counted as covariates. They were supplemented by the establishments’ number of reviews, average score, and whether they were open or permanently closed. For the purpose of this section, we denote all those covariates as X1:5 (UMAP), X6, X7, X8 respectively. The treatment, T, is (latitude, longitude) of the establishment. We also obtained resident income information for the ZIP codes of T via an IRS dataset [137]. The number of people (field “N2”) in the top two brackets, #6 and #5, were used in constructing the link to the synthetic revenue outcome variable. We denote these count variables as R1(T), R2(T). The revenue outcome is modeled as Y = [10R1(T) + X8R2(T)] √ X6 + S X7 ∥X1:5∥2 , where S ∼ Exp(µ = 10), which heterogeneously relates revenue to neighborhood income and Yelp reviews, incorporating nonlinear noise S. Nuisance estimation. We used XGBoost [45] with 5-fold cross-validation to select an outcome model, using grid search over 3–10 maximum depths and {4k, 8k, 12k, 16k, 20k} tree-ensemble sizes. For propensity-related models, we trained feedforward neural networks with SiLU activations, width of 256 neuronal units, and depth of 3 layers. ADAM learning rates were 10−5 for 1, 000, 000 epochs, with dropout of 5% and weight decay of 10−5 . Policy learning. The OT program is employed with (1,000 nudges × 1,000 units) on the second/test sample per experimental invocation. Dataset access. The Yelp dataset and its terms of use can be accessed at https://www.yelp.com/dataset. 11 Chapter 6 Causal Encouragement Effects Robust estimation of treatment effects from observational data can be an invaluable tool for decision-makers, but these methods often neglect real-world constraints. A doctor’s prescription to stop smoking may be regarded as an “encouragement” that only marginally changes the probability that the patient adopts the treatment. We show how to estimate the direction and relative magnitude of the effects of an intervention campaign that encourages a certain treatment, even if a similar campaign has not been previously observed. While these results cannot tell us the absolute outcome of the intervention, often the direction and relative effects are all that is needed to most effectively allocate resources. Feedback from interventions can then be used to further improve causal estimates. We motivate novel causal estimands, which are relatively agnostic to the form of the intervention but still account for heterogeneous susceptibility to encouragement, and present a robust approach to estimating them. We demonstrate their usage in a hypothetical case study for a political campaign, and assess the performance of our proposed estimators in a variety of semi-synthetic experiments. 6.1 Introduction It is often of interest to maximize desirable outcomes even when there is limited control and observability over the processes that generate those outcomes. In this chapter, we extend causal-effect estimation methods to cases where very little is known about the putative instrument for intervention. We rely on a minimal-assumptions notion of encouraging treatment assignment or take-up. An example instrument for encouragement could be personalized recommendations that might not be followed very faithfully. The goal is to make reliable inferences about novel instruments, in the sense that they have not yet been observed in the causal system. When just treatments, outcomes, and covariates have been observed, can 120 we identify the situations that are most susceptible to an encouragement (or discouragement) for treatment, with the ultimate goal of effecting a better outcome? The average treatment effect (ATE), and its conditional variant the conditional average treatment effect (CATE) [143], disregard any notion of susceptibility to encouragement. These causal estimands help guide interventions only when there is complete control over the treatment assignment. When the propensity of treatment is entirely certain in either direction (never treated or always treated), it is reasonable to infer that an encouragement would be ineffective in swaying it. The logit is a sound real-valued measure of the direction and certainty of this treatment assignment. 6.1.1 Encouragements via Logits We propose a class of causal estimands that frame encouragements in terms of the logit of the treatment propensity. In the potential-outcomes setting with outcome Y , covariates X, and binary treatment T, we consider its propensity π(X) ≜ P[T = 1 | X] = σ(Z), where σ is the standard logistic sigmoid and Z represents the latent logit of the propensity, depending on X. Then we define the average encouragement derivative (AED), denoted by AED ≜ E ∂ E[Y |Z, X]/∂Z , (6.1) giving the infinitesimal effect of an increase in propensity logit. More intuitively, this quantifies the responsivity of the outcome variable to a small proportional increase in the odds of treatment. When an intervention cannot fully control treatment assignment but merely encourage or discourage it, the AED can be a better heuristic for the actual interventional effect, in comparison with the ATE. For cases where the magnitude of the encouragement is known, we propose the average encouragement response (AER) curve indexed by scalar r ∈ R such that AER(r) ≜ E E[Y | Z = z + r, X]z=Z − E[Y | Z, X] . (6.2) The AER(r) quantifies the expected effect of an additive increase in treatment-propensity logit by r, or in other words, an e r -fold multiplicative increase in treatment odds. We argue that the AED, AER, and related quantities are useful in a number of common causal settings. • Health recommendations. The health effect of a bad habit like smoking is well known, and can vary among individuals. Still, an effectual public-health intervention like an informational billboard should not just target those hurt my smoking the most. The heterogeneous receptivity of individuals 121 to such interventions, and their susceptibility to habit change, can determine the success of a public health campaign [174]. • Climate change advocacy. In order to promote impactful adoption of new climate policy, advocates should tailor their efforts to those communities that are open to amending their beliefs, potentially willing to act, and capable of implementing policies that can help the climate [27], [242]. • Political campaign targeting. In §6.4 we dive into a case study on microtargeting for political campaigns. We address the question of optimally allocating resources towards political messaging using observational data from a past election as well as polls in a running election. The common thread among these examples is the need to identify instances in observational data that are susceptible, for the sake of an outcome, to a treatment encouragement. In this work, logits of the treatment propensity serve as heuristics for susceptibility when little else is known about the instrument for encouragement. Figure 6.1 illustrates our reasoning. Figure 6.1. The proposed causal estimands AED and AER incorporate susceptibility to encouragement with treatment effects. In this example with four individuals that have equal treatment effects (their response curves have equal heights,) one individual is more susceptible than the rest due to their treatment propensity being less certain (logit closer to zero.) Why logits? Besides being the de facto standard representation of uncertainty for discrete predictions, the are fundamental reasons for the logit’s centrality in our framing of encouragement effects. Since an additive effect on a logit is a multiplicative effect on the probability odds, we note the ubiquity of multiplicative effects in the sciences [187], [201], ranging from sociology [178], to environment [6], and biology [95]. 122 We also present an argument on the basis of the principle of maximum entropy [141]. Suppose there is some intervention variable V that, when positive encourages treatment, and discourages it when negative. For the purpose of this example, the complete treatment propensity is π(X, V ). We seek to model π with minimal parametric conditions. Borrowing the orthogonality principle from least-squares regression, we impose that the sample of V taken as a vector ⟨v1, v2, . . .⟩ be orthogonal to the error of the approximation ⟨π˜(x, v1), π˜(x, v2), . . .⟩ for every x. Orthogonality of the predictor variable and error is a common criterion for optimality of linear estimators [220]. Proposition 13 (The Logistic Model). With the orthogonality condition stated above, as well as the requirement that |π˜(X, 0) − T| ≤ ε with some margin ε > 0, the maximal-entropy binomial model of the treatment propensity belongs to the parametric family π˜(x, v) = σ α(x)v + β(x) . By Proposition 13, the maximal-entropy propensity model subject to minimal conditions yields a quasilinear relation between propensity logits and the intervention. This fact motivates encouragements similar to V acting linearly on Z. Connection to causal derivatives. The AED is similar to the average causal derivative (ACD) [114], [120], [217] when projecting Z into a continuous-valued treatment variable. The key difference is that Z is latent, and obeys a specific form with respect to the real treatment variable T. Likewise, the terminology for the encouragement response curve is inspired by dose responses [14], [23], [161], [171], [214], [241], which are continua of causal outcomes indexed by a continuous treatment variable. 6.1.2 Related Work Policy learning for encouragements is an active area of research [380]. When the encouragement variable is observed alongside the other variables in the causal system, the statistical setting becomes one of instrumental variables [260], [266], [267]. A massive body of literature studies treatment-effect estimation via instruments [41]. A closely related line of work studies treatment effects under limited compliance or adherence [16], [73], [269], [271]. Here, the observed interventions act like encouragements because treatment or nontreatment is not guaranteed. The estimation problems we consider in §6.2 & 6.3 are fundamentally different from those scrutinized by all these works because we assume that the encouragement has not been observed. We note the contemporaneous work of McClean, Branson, and Kennedy [226] that proposed similar estimands and robust estimators as the AED and AER in this work. Those authors frame the identification 12 problem in terms of incremental effects, which are largely motivated to overcome positivity/overlap violations [80], [168]. Our problem setting of encouragements acting on logits is substantially different, and so is our proposed bandit formulation. One of our experiments investigates the question of prioritizing encouragements using the proposed estimands. Sun, Munro, Kalashnov, et al. [311] study treatment allocation by prioritizing greater predicted effects under resource constraints with uncertain costs. We also investigate policy learning using a bandit for allocating encouragements in §6.4. Importantly, our problem setting requires that the bandit not have access to the outcomes, forcing it to rely on effect estimates from observational data. Other work has studied the use of surrogate outcomes for improving effect estimation [most recently 373], though not with encouragements. We follow a similar approach as Imbens, Kallus, Mao, et al. [134] and Yang, Eckles, Dhillon, et al. [360] using short-term proxies for long-term objectives. In fact, like Yang, Eckles, Dhillon, et al. [360], we develop a Thompson sampler to solve our bandit problem. Our work differs in the special role of the encouragements, which are not included in the observations. 6.1.3 Potential Outcomes Y R X covariate intervention treatment outcome logit Z T Figure 6.2. Illustrative diagram of our causal setting. In terms of the observed data-generating process, we operate under standard potential-outcomes assumptions listed in Assumption 8 on i.i.d observations of (outcome, binary treatment, covariate) variables (Y, T, X). Potential outcomes are denoted as (Y0, Y1). As shown in the diagram of Figure 6.2, we assume additional structure on the putative interventions that motivate our causal estimands. Assumption 9 solidifies this structure. These minimal, flexible assumptions are sufficient for proceeding with §6.2 & 6.3. Additional requirements are introduced in §6.4 when a specific intervention is explored. Assumption 8 (Potential Outcomes). We specialize the causal setting due to Rubin [286] and Rubin [288] with the additional structure shown in the diagram of Figure 6.2. 124 (a) Stable unit treatment value assumption (SUTVA). There are only two potential outcomes for an observational unit, corresponding to the values of its own treatment variable, and the observed outcome is given by the unit’s assigned treatment. (b) Overlap/positivity. Both treatments have a nonzero probability of being assigned to every unit. (c) Ignorability. Potential outcomes are independent of the treatment after conditioning on covariates. Since Z is the sole parent of T, we state this condition as (Y0, Y1) ⊥⊥ Z | X. Assumption 9 (Intervention Model). Let the random variable Z represent the logit of the treatment propensity. We have σ(Z) = π(X). A hypothetical intervention R as depicted in Figure 6.2 is additive on Z. We interpret the observational dataset of (Y, T, X) as implicitly including R = 0. The potential-outcomes framing of a causal system, as described by Assumption 8, allows stating expectations of potential outcomes in terms of conditional outcome expectations as long as they condition on confounders. In our case, E[Yt | X] = E[Y | T = t, X]. Note that both AED (Equation 6.1) and AER (Equation 6.2) are based on the conditional expectation E[Y | Z, X]. By Assumption 9, this reduces to E[Y | Z, X] = σ(Z) E[Y | T = 1, X] + [1 − σ(Z)] E[Y | T = 0, X] = σ(Z) E[Y1 | X] + [1 − σ(Z)] E[Y0 | X], which can be interpreted as a propensity-weighted mixture of the potential-outcome expectations. 6.2 Robust Estimation Inspired by semiparametric estimation theory [121], [133], [169], we propose novel estimators for the AED and AER and, in particular, their conditional versions for heterogeneous effects. Our approach begins with constructing pseudo-outcomes in a similar manner as is done for other causal estimands [48] in order for them to be regressed on covariates [170], [246]. The pseudo-outcomes for these causal estimands tend to be parametrized by two nuisance functions each: one that informally gives a naive prediction of the effect and another that debiases it, in expectation. The celebrated doubly robust/debiased property holds when only one of the two nuisance parameters need be correctly specified for the pseudo-outcome to equal the causal estimand in expectation. 125 The proposed pseudo-outcomes for the AED and AER are substantially different in how they are parametrized. The first remarkable aspect is that they employ the same classical nuisances as the ATE pseudo-outcome: the conditional mean µ(T, X) ≜ E[Y | T, X], and the propensity π(X) ≜ P[T = 1 | X]. The clear upside of this formulation is that, operationally, estimation of the AED is about as simple or difficult as for the ATE. The downside is that the double robustness property does not precisely hold with respect to these nuisances. The robustness benefit conferred by our proposed pseudo-outcomes are a bit more nuanced, and are explored in §6.2.1 and beyond. We begin with the AED as defined in Equation 6.1. Introducing the shorthand notation ∆µ(X) ≜ µ(1, X) − µ(0, X) and omitting arguments _(X) or _(T, X) implied by the nuisances, we succinctly present the AED pseudo-outcome as φ ≜ [(1 − π)T − π(1 − T)](Y − µ) + π(1 − π)∆µ. (6.3) A correctly specified φ yields the conditional AED, or CAED, in expectation: E[φ|X] = E[∂ E[Y |Z, X]/∂Z | X] = CAED(X). The nuisances can be estimated on one split of the training data and plugged into the pseudo-outcomes on a separate split. This ensures independence of the nuisance estimates and the data points used in the pseudo-outcomes, which is required in the convergence analysis of pseudo-outcome estimators [48], [50], [52]. 6.2.1 Estimation Bias The conditional bias of our pseudo-outcome φ, defined below, reveals the nature of its relative robustness. In this paper, the hat symbol signifies an estimate, with φˆ corresponding to Equation 6.3 with nuisance estimates (ˆµ, πˆ) replacing the oracle quantities (µ, π). By iterated expectation, E[ ˆφ − φ | X] = −(ˆπ − π) 2 (ˆµ1 − µˆ0) | {z } (bias with quadratic decay) + (ˆπ − π) h π1(ˆµ1 − µ1) + π0(ˆµ0 − µ0) | {z } (outcome rates) − (π1 − π0)(ˆµ1 − µˆ0) | {z } (systemic outcome bias) i . (6.4) 126 We set µ1(X) ≜ µ(T = 1, X) and µ0(X) ≜ µ(T = 0, X) for convenience. The red and blue components show how the conditional bias mostly, but not entirely, involves a product of the component errors (ˆπ − π) and (ˆµt − µt) for t ∈ {0, 1}. This structure is desirable because it ensures that the statistical convergence rate of the pseudo-outcome synergizes the two nuisance rates [e.g. 157], [381]. However, a violation of putative double robustness in (µ, π) stems from, and is isolated to the component marked as “systemic outcome bias.” That term is the reason that a correctly specified µ but misspecified πˆ may result in a nonzero conditional bias. The systemic outcome bias in Equation 6.4 suggests that φˆ is preferable to a simpler, naive AED estimator πˆ(1 − πˆ)∆ˆµ when effects are heterogeneous, and more concretely, when the effects are much smaller in size than the overall variability in the outcomes—so when (ˆµ1 − µˆ0) is very small compared to (ˆµ1 − µ1) and/or (ˆµ0 − µ0). The importance of calibrated propensities. The asymmetrical non-double robustness property elucidated with Equation 6.4 emphasizes the need for well-calibrated propensity estimates. If πˆ(x) is accurate enough, it could guarantee the accuracy of the whole (C)AED estimate. 6.3 Encouragement Response Curves Following the same steps for deriving φ from the AED, but this time targeting AER(r) with a pseudooutcome termed ψr, we arrive at the formula ψr ≜ (π (r) − π) T π − 1 − T 1 − π (Y − µ) + (π (r) − π)∆µ, where π (r) (X) ≜ σ(σ −1 ◦ π(X) + r). (6.5) We maintain the shorthand notation from Equation 6.3. The new quantity π (r) represents a propensity nudged or shifted by r logits. This ψr can be viewed as a non-infinitesimal version of the derivative estimator, φ. One might notice a correspondence between ψr and the well-known pseudo-outcome for the ATE: ψ ATE ≜ T π − 1 − T 1 − π (Y − µ) + ∆µ, from which it follows that ψr = (π (r) − π)ψ ATE. The simplicity of this relation should not obscure the additional robustness endowed by regression on ψˆ r rather than ψˆATE for estimating heterogeneous encouragement responses. This benefit is verified by empirical experiments in §6.5. 127 The analogous conditional bias (as in Equation 6.4) for a ψr estimate can be calculated as E[ψˆ r − ψr | X] = [(ˆπ (r) − πˆ) − (π (r) − π)](ˆµ1 − µˆ0) | {z } (systemic outcome bias) − π1 πˆ (r) 1 πˆ1 − π (r) 1 π1 ! (ˆµ1 − µ1) − π0 πˆ (r) 0 πˆ0 − π (r) 0 π0 ! (ˆµ0 − µ0). (6.6) Again, we encounter a product of errors but also a systemic bias term that is not fully mitigated by the debiasing term. The propensity errors play multiple roles in the pseudo-outcome’s bias. They take the form of ratios of nudged to not-nudged when they multiply against the outcome prediction errors. On the other hand, the systemic bias component is influenced by the error of the propensity nudge-effect (π (r) − π) prediction. The decomposition of Equation 6.6 confirms that correct nuisances yield the conditional AER, or CAER, in expectation: E[ψr|X] = E E[Y | Z = z + r, X]z=Z − E[Y | Z, X] | X = CAER(r|X). The naive encouragement response-curve estimate would be given by (ˆπ (r) − πˆ)(ˆµ1 − µˆ0), which is the encouragement’s effect on the propensity multiplied by the treatment effect. Clearly, a correct propensity model would not suffice to fully debias the naive predictions. For the CAER, these naive predictions can give the curve directly as a function of r. Using the pseudo-outcomes instead requires a regression of the entire ψˆ r curves on the covariates. For simplicity, in our experiments we estimated separate regressions along a grid of logits and then matched any r to the closest grid point. We briefly highlight an alternative way for fitting the conditional response curve AER(r). When considering the univariate mapping r 7→ ψr for fixed (Y, T, X), we observe the recurrence of (π (r) − π), which takes the functional form of a shifted logistic sigmoid. This clue motivates a parametrization of the encouragement response-curve estimator with three scalar functions as ψcr(X) ≜ Aˆ(X)σ Bˆ(X) + r + Cˆ(X). The parameter Aˆ(x) can be interpreted as corresponding to the conditional average treatment effect (CATE), with Bˆ(x) approximately corresponding to a propensity logit, and Cˆ(x) being the average outcome. 12 6.4 Online Encouragement Targeting In this section we specialize our methodological findings to a problem of online resource allocation. Consider the scenario where ample observational data enables estimation of causal encouragement quantities like the conditional AED (CAED) and the conditional AER (CAER). We aim to use these estimates from offline data in order to assist with online targeting of a novel encouragement intervention. Since it is new, we assume there is little to no record of such interventions realized with their associated outcomes. Therefore we cannot approach this problem from the lens of instrumental variables, and instead require an algorithm that combines offline and online learning while relying on relatively intervention-agnostic quantities like those proposed in this paper. 6.4.1 Bandit Problem Suppose that a large offline dataset consists of a sample of (Y, T, X) units satisfying Assumption 8 under the general setup outlined in §6.1.3. Online, a covariate X is observed and an agent is tasked with choosing an intervention amount U ∈ [0, ∞) with the purpose of maximizing an outcome. After deciding on U, the agent observes a T realization that can be used for feedback on the strength of the U intervention. Finally, the outcomes Y are all realized at once, after all decisions have been made. The motivation for this formulation is expanded in the case study, §6.4.2. Generally, U can be interpreted as a resource like money being allocated for intervention on X. The notion of utility is encoded as a “profit” or “return on investment” in Y , which has the same units as U. Hence we optimize a U policy such that it maximizes (Y −U) in expectation for each unit described by X. Even though U is allowed to be unbounded, at some point the return on Y will saturate, much like the phenomenon of diminishing marginal utility in economics [146], [221]. In our setting, the response curve at the limit(s) AER(±∞) is bounded. The ability to observe T at each round but not Y is motivated by a difference in temporal scales. For instance, T can mark the reception of a certain message or a commitment to a certain action, whereas Y can measure some eventual outcome relating to long-term health or the conclusion of an election, as in §6.4.2. Our agent can adapt to each realized T and use that information in conjunction with the offline learned effects to construct its policy on U. Our problem corresponds to that of a contextual multi-armed bandit [191] with additional structure on the reward function. The precise implementation of the bandit is outside the scope of this study, but we opted for the celebrated Thompson sampling [74], [323] algorthm, which is simple yet demonstrably effective [see e.g. 82]. A bandit takes sequential actions while observing each step’s reward, with the goal 129 Algorithm 5: Causal Encouraging Bandit Input: offline response-curve estimate ψcr(x), offline propensity estimate πˆ(x), current encouragement posterior hΘ(x, u), next covariate/context X, and treatment-assignment oracle π ∗ (x, u). Output: next intervention amount U and updated posterior hΘ′(x, u). 1 Set contextual reward function fθ(u) ≜ ψb hθ(X,u) (X) − u; 2 Sample U ∈ U according to the probability u 7→ EΘ I fΘ(u) = maxu′∈U fΘ(u ′ ) ; 3 Perform action U and observe treatment assignment T ∼ Bernoulli π ∗ (X, U) ; 4 Update posterior Θ′ using observation (T, U, X) and a Bernoulli likelihood for parameter θ with treatment probability given by σ(σ −1 ◦ πˆ(X) + hθ(X, U)); of minimizing cumulative regret, which is the difference between the maximal achievable total reward (summed over the sequence) and the bandit’s achieved total reward. A contextual bandit is given i.i.d “contexts” X that modulate the rewards, and a multi-armed bandit can choose from multiple possible actions—rather than a binary decision. The action space is the whole continuum [184], [211] for U ∈ [0, ∞), but in practice it is discretized to a grid U. A single bandit step is detailed by Algorithm 5. Our “causal encouraging bandit” relies on a parametric family of encouragement functions hθ(x, u) that models the (potentially heterogeneous) translation from intervention resource units u to logits r. We impose the restriction that a zero intervention has zero effect; ∀x, hθ(x, 0) = 0. In the language of Thompson sampling, the bandit maintains a posterior over the parametrization of the encouragements, denoted as Θ. Disentangling heterogeneities in the response. Conditional response curves CAER(r|x) predicted by ψcr(x) are still clearly heterogeneous even when the true encouragement function r = h(x, u) is invariant to x, such as when h(x, u) = αu. Notwithstanding variability in the treatment effect itself, the susceptibility to encouragement of the treatment propensity π(x) varies with the magnitude of its logit. 6.4.2 Case Study on Political Campaign Microtargeting We present a relevant case study on political messaging optimization, often termed political “microtargeting,” which relies on targeting specific groups of individuals with messages intended to attain desirable election outcomes. Fundamentally, political campaigns rely on conveying persuasive platform messages to constituents. More specifically, microtargeting enables candidates to focus on issues that enhance election success with less regard for the broader electorate [299]. As individuals have different interests, beliefs, and resources, the effectiveness of this message towards election outcomes differs. 1 Extant literature has analyzed the concept of political microtargeting due to its relevancy in recent political campaigns. Compared to just uniformly allocating resources towards the median voter [77] as campaigns have historically done, attempts have been made to utilize individual targeting to further forms of democratic participation and make appeals to targeted subsets of voters on the basis of the parties and issues they already care about [182]. However, an open question remains on how to optimize campaigns on the basis of polling data with an objective tied to the actual downstream electoral outcomes. Bandit for encouraging voters. We discusss how the encouraging bandit of Algorithm 5 might fit in this problem domain. Suppose a campaign is designing novel political messages individualized to demographic/geographic attributes X. The efficacy of these messages is uncertain. The campaign conducts frequent polls to assess each population segment’s susceptibility to their targeted message. Let T indicate the binary poll result of whether the individual stated they will support the campaign’s candidate. One might assume that T is a reliable proxy of the actual electoral result, but differences in voter turnout or poll trustworthiness can lead to strongly biased predictions. For this reason, let Y indicate the actual contribution of the individual to the election result. To have a data-driven grounding of the relationship between individual attributes X, polling response T, and electoral action Y , the campaign attains voting records or exit polls from a previous election. That election is recent enough that the observations of (Y, T, X) are close in distribution to equivalent measurements in the current election cycle. This allows the campaign to learn relevant effects that can transfer to the online microtargeting problem. The bandit traverses through an i.i.d sequence of individuals described by X and performs experiments by spending resources U on a message tailored to X. A poll is then conducted on that population segment, and a T is recorded. The bandit uses the information to refine its estimate of hΘ(X, U), the mapping from resources spent U to the change in propensity logits of expressing support in poll, R. Once U → R is known, it can be pushed through ψcR in order to arrive at a robust estimate of the effect on the actual election. Recall that the bandit seeks a return on investment by maximizing (Y − U). A precise description of Y is the utility of an individual’s role in the election in units that are commensurable with U. We assume that Y is scaled according to the amount of resources the campaign is willing to spend for one vote, depending on their goals and financial resources [150]. The utility can also vary according to the importance each electoral district to the election [58]. An effect estimate on Y incorporates these factors with the expected change in voting behavior from an intervention, based on susceptibility. Naturally, the bandit can only measure the partial effect of U on Y that is mediated by the polling result T, relying on an exclusion restriction [7]. 131 6.5 Empirical Evaluations Our experiments involve a wide diversity of semi-synthetic causal settings with randomly generated dependencies. The semi-synthetic data are based on real-life datasets TCGA, which has been in use for semi-synthetic evaluations [23], and PBMC [162] that has also been used for this purpose [214]. In all of the empirical results presented, the individual treatment effect has unit variance over individuals. Intervention Prioritization Rules 0 10 20 30 # Covariates 1.0 1.2 1.4 1.6 1.8 2.0 2.2 AUTOC ratio (AED to ATE) Figure 6.3. Comparing the AUTOC [358] of prioritization rules using AED versus ATE. We plot this AUTOC ratio (where > 1 favors our proposed estimand, AED) against the number of covariates in the semi-synthetic experiment. By construction, the overlap in treatment assignments decreases with more covariates, inducing greater covariate shift, and increasing the variance in the propensity logits. Logit variance implies the presence of individuals with very high or low susceptibility to encouragement, emphasizing the value of AED over ATE. 6.5.1 Prioritizing Interventions Our first experiment investigates observational policy learning [11] attenuated by heterogeneous compliance. With a simple scenario, we demonstrate the suboptimality of the individualized treatment effect for prioritizing interventions when the policy instrument can only encourage treatment—even with oracle causal estimands. Prioritization is crucial in resource-constrained settings like a medical clinic, where noncompliance and other factors can lead to heterogeneous receptivity to an intervention. We employ the area under the targeting operator characteristic curve (AUTOC) as recently proposed by Yadlowsky, Fleming, Shah, et al. [358] as a metric for evaluating prioritization rules. In Figure 6.3, we compare the AUTOC of prioritizing by CAED against the AUTOC of prioritizing by CATE. For a covariate 132 Encouragement (r) Dataset Estimator ±0.5 logits ±1 logits ±1.5 logits ±2 logits TCGA Proposed 0.08 (0.002) 0.15 (0.004) 0.22 (0.006) 0.29 (0.008) Robust Baseline 0.15 (0.06) 0.30 (0.12) 0.44 (0.18) 0.56 (0.23) Naive Baseline 0.14 (0.01) 0.27 (0.02) 0.38 (0.03) 0.48 (0.03) PBMC Proposed 0.09 (0.004) 0.19 (0.01) 0.29 (0.02) 0.41 (0.03) Robust Baseline 1.44 (0.43) 2.78 (0.81) 3.96 (1.13) 5.00 (1.40) Naive Baseline 0.18 (0.01) 0.35 (0.02) 0.51 (0.03) 0.66 (0.04) Table 6.1. Root mean square errors (RMSEs) over 256 trials of regression on the proposed pseudo-outcome estimator of CAER(r) contrasted against the baseline estimators for different values of r. number displayed, we include 64 points with encouragement strength varying along a logarithmic grid in 10[−3,+1] logits. AED Prediction Errors on TCGA Dataset 0 10 20 30 Outcome Variability 0.2 0.4 0.6 0.8 1.0 RMSE proposed robust baseline naive baseline Figure 6.4. Root mean square errors (RMSEs) over 256 trials of regression on the proposed pseudo-outcome estimator of AED contrasted against two baselines described in §6.5.2. As outcome variability increases, the difference in accuracies is magnified, confirming what was suggested in §6.2.1. 6.5.2 Pseudo-outcome Estimation In the set of experiments presented in Figure 6.4 and Table 6.1, linear estimators for the AED and AER respectively are benchmarked out of sample. We notice consistent performance improvements across the two datasets and along a range of logits at which the response curve is estimated. For both AED and AER, the proposed pseudo-outcome estimators φ, ψr are benchmarked against two baselines, marked as the robust baseline and the naive baseline. The former uses the doubly robust ATE estimator ψ ATE in 133 combination with the propensity nuisance in order to estimate the AED or AER. The naive estimate relies entirely on the outcome and propensity nuisances for its predictions out of sample. The number of regression covariates was drawn from a Poisson(16) distribution and the outcome variability—the standard deviation of the outcomes on top of the treatment effect with unit variance—was drawn from an Exponential(8). We observe trends along outcome variability, as shown in Figure 6.4, but not along numbers of covariates. Bandit Performance on PBMC Dataset 0 200 400 600 800 1000 Bandit round 0 25 50 75 100 125 150 Cumulative return oracle proposed robust baseline naive baseline Figure 6.5. The bandit from §6.4 is evaluated against its oracle and baseline alternatives to show the benefit of our novel AER estimator in tackling such problems. The baselines do not achieve a positive return on investment, on average. Line widths are sample means ± standard errors across 256 trials. 6.5.3 The Encouraging Bandit Finally, we evaluate the performance of a bandit in a realistic representation of the case study presented in §6.4.2. We show that the offline/online hybrid Thompson sampler of Algorithm 5 benefits from the robustness of the pseudo-outcome estimator (by contrasting with bandits using baseline estimator alternatives,) and yields a significantly positive utility return even when the model of the encouragement function hθ(x, u) is homogeneous and linear, while the ground truth is heterogeneous. In Figure 6.5, we plot cumulative returns of the proposed bandit, the baselines, and an “oracle bandit” that has access to the exact encouragement 134 responses. We do not show the traditional metric of cumulative regret because we are benchmarking the novel estimation part, and not the bandit part, of Algorithm 5. Here, the outcome variability is ten times larger than the treatment-effect variability (equivalent to a value of 10 on the horizontal axis of Figure 6.4.) 6.6 Discussion Our results demonstrate the value of the AED (§6.2) and AER (§6.3) estimands as well as the reliability of their proposed estimators. First, Figure 6.3 highlights the need for CAED over CATE as priority scores when the intervention merely encourages treatment rather than controlling it perfectly. Figure 6.4 then shows the value of our pseudo-outcome formula for φˆ (Equation 6.1) in estimating heterogeneous AEDs in comparison with simpler alternative approaches that yield higher empirical errors. Table 6.1 shows the same for AERs across key logit values for r. That table strongly suggests that the “robust baseline” using the ATE pseudo-outcome is rather unstable, as revealed through the PBMC dataset. We hypothesize that the ATE pseudo-outcome is vulnerable to high variance from the inverse-propensity weighting in ψ ATE, whereas our proposed ψr reduces that variance by multiplying by the propensity nudge (π (r) − π). We also assessed the performance of the encouraging bandit that was proposed as a solution to our more elaborate case study of §6.4. The AER estimator enabled the encouraging bandit to seek and attain a positive “return” on its “investment” across individuals with heterogeneous effects, susceptibilities, and even logit responses to the interventions. The baseline bandits failed much more often. To further illustrate the point, we calculated 95%-confidence intervals on the probability of at least breaking even: proposed (0.59, 0.71), robust baseline (0.43, 0.56), naive baseline (0.44, 0.57). In words, the proposed bandit turned out to be consistently better than the baselines, and a coin flip, in terms of profitability. Future work. This work marks another step in the direction of reliable causal inference from observational data. In the spirit of promoting robustness to possible violations to causal assumptions, we hope to incorporate a sensitivity analysis for hidden confounders on these novel estimands. Promising theoretical advancements have been made by Dorn and Guo [75] and Dorn, Guo, and Kallus [76] and Chernozhukov, Cinelli, Newey, et al. [49]. Drawing from the symbolic connection to a causal derivative (for AED) or dose response (for AER) with respect to a latent logit, there could also be value in the recently proposed δMSM sensitivity model for continuous treatment variables [114], [214]. 135 Conclusion Motivated by an argument on maximal entropy and other sources of intuition, our novel causal estimands for encouragement effects are useful in cases where the putative encouragements have yet to be observed. The proposed robust estimators produce consistently lower errors than simpler alternatives. These findings are also contextualized to a case study on political campaigns, where a novel bandit algorithm is able to harness the robust estimator to deliver desirable outcomes most of the time. The diverse semi-synthetic causal settings further support our claims. 6.7 Appendix A: Derivations and Proofs 6.7.1 Proof of Proposition 13 This proposition of a variant of a generally well-known property of logistic regression [368]. For entropy maximization subject to constraints, we follow the method of Lagrange multipliers [29]. For the purpose of just this proof, consider the finite sample {xi} n i=1 as well as a finite set of possible intervention values {vj} m j=1. Let ti,j ∈ {0, 1} be the treatment realization under covariate xi and intervention vj . For k ∈ {0, 1} and error margin ε > 0, we seek the maximal-entropy parametric form of the propensity function πk(xi , vj ). Define the margin function ϵ : {0, 1} → (0, 1) as ϵ(t) = ε · (1 − t) + (1 − ε) · t to conveniently give the error bounds for a propensity function living in the probability simplex. In the objective L below, the first term corresponds to the entropy to be maximized, while the multipliers encode our constraints. The λi,k term imposes orthogonality, the ζi,k term imposes the error margin, with γi,j ensuring normalization. We need not worry about non-negativity because the logarithm will require that implicitly. L ≜ − X i,j,k πk(xi , vj ) log πk(xi , vj ) + X i,k λi,kX j [πk(xi , vj ) − I(Ti,j = k)]vj + X i,k ζi,k[πk(xi , 0) − ϵ(I{Ti,j = k})] + X i,j γi,j X k [πk(xi , vj ) − 1]. 136 We simply differentiate pointwise with respect to the propensity, and equate it to zero: ∂L ∂πk(xi , vj ) = − log πk(xi , vj ) − 1 + λi,kvj + ζi,k + γi,j = 0 ⇐⇒ πk(xi , vj ) = exp{λi,kvj + ζi,k + γi,j − 1}. We find that the degree of freedom in γi,j is taken by normalization: πk(xi , vj ) = exp{λi,kvj + ζi,k} P k ′ exp λi,k′vj + ζi,k′ , which reveals the softmax function. Since π1 yields the actual propensity function π, we may compute it as π1(xi , vj ) = σ − (λi,1 − λi,0)vj − (ζi,1 − ζi,0) . Dropping the subscripts, we encounter the general structure π(x, v) = σ α(x)v + β(x) , for some α(x), β(x). φ ≜ [(1 − π)T − π(1 − T)](Y − µ) + π(1 − π)∆µ | {z } (pseudo-nuisance) . Proposition 14 (Almost double robustness). The pseudo-outcome φ is doubly robust in terms of the quantities π and those marked by “pseudo-nuisance,” rather than just µ. 6.7.2 Proof of Proposition 14 Here we shall demonstrate the double robustness of φ in terms of the quantities π and the “pseudo-nuisance” from φ ≜ [(1 − π)T − π(1 − T)](Y − µ) + π(1 − π)∆µ | {z } (pseudo-nuisance) . 1 The way these two sub-proofs are approached is by plugging in nuisance estimates in one of the two components, and oracles in the other. We start with correctly specified “pseudo-nuisance” for the relative simplicity that follows. Consider E [(1 − πˆ(X))T − πˆ(X)(1 − T)](Y − µ(T, X)) + π(X)(1 − π(X))(µ(1, X) − µ(0, X)) X = E [(1 − πˆ(X))T − πˆ(X)(1 − T)] × 0 + π(X)(1 − π(X))(µ(1, X) − µ(0, X)) X = π(1 − π)(µ1 − µ0). Now with the other direction, for correctly specified π: E [(1 − π(X))T − π(X)(1 − T)] Y − µˆ(T, X) + π(X)(1 − π(X))(ˆµ(1, X) − µˆ(0, X)) X = E [(1 − π(X))T − π(X)(1 − T)] µ(T, X) − µˆ(T, X) + π(X)(1 − π(X))(ˆµ(1, X) − µˆ(0, X)) X = π(1 − π) (µ1 − µˆ1) − (µ0 − µˆ0) + π(1 − π)(ˆµ1 − µˆ0) = π(1 − π)(µ1 − µ0). 6.7.3 Derivation of Equation 6.4 E[ ˆφ − φ | X] = −(ˆπ − π) 2 (ˆµ1 − µˆ0) + (ˆπ − π) h π1(ˆµ1 − µ1) + π0(ˆµ0 − µ0) − (π1 − π0)(ˆµ1 − µˆ0) i This is the pseudo-outcome estimate in more explicit notation: φˆ(Y, T, X) = [(1 − πˆ(X))T − πˆ(X)(1 − T)](Y − µˆ(T, X)) + ˆπ(X)(1 − πˆ(X))(ˆµ(1, X) − µˆ(0, X)). Proceeding with iterated expectations in Y |T, X and then T|X, E[ ˆφ | T, X] = [(1 − πˆ(X))T − πˆ(X)(1 − T)](µ(T, X) − µˆ(T, X)) + ˆπ(X)(1 − πˆ(X))(ˆµ(1, X) − µˆ(0, X)) E[ ˆφ | X] = π(1 − πˆ)(µ1 − µˆ1) − πˆ(1 − π)(µ0 − µˆ0) + ˆπ(1 − πˆ)(ˆµ1 − µˆ0). Given that E[φ|X] = π(1 − π)(µ1 − µ0), ∴ E[ ˆφ − φ | X] = π(1 − πˆ)(µ1 − µˆ1) − πˆ(1 − π)(µ0 − µˆ0) + ˆπ(1 − πˆ)(ˆµ1 − µˆ0) − π(1 − π)(µ1 − µ0) = −(ˆπ − π)πµ1 − (ˆπ − π)(1 − π)µ0 + (ˆπ − π)(1 − πˆ)ˆµ1 + (ˆπ − π)ˆπµˆ0 = (ˆπ − π) π(ˆµ1 − µ1) − πµˆ1 + (1 − π)(ˆµ0 − µ0) − (1 − π)ˆµ0 + (1 − πˆ)ˆµ1 + ˆπµˆ0 = (ˆπ − π) π(ˆµ1 − µ1) + (1 − π)(ˆµ0 − µ0) − πµˆ1 − (1 − π)ˆµ0 + (1 − πˆ)ˆµ1 + ˆπµˆ0 | {z } = (ˆπ − π) π(ˆµ1 − µ1) + (1 − π)(ˆµ0 − µ0) − (ˆπ − π + 2π − 1)(ˆµ1 − µˆ0) = (ˆπ − π) π1(ˆµ1 − µ1) + π0(ˆµ0 − µ0) − (ˆπ − π)(ˆµ1 − µˆ0) − (π1 − π0)(ˆµ1 − µˆ0) . The above equals Equation 6.4, concluding the derivation. 6.7.4 Derivation of Equation 6.6 E[ψˆ r − ψr | X] = [(ˆπ (r) − πˆ) − (π (r) − π)](ˆµ1 − µˆ0) − π1 πˆ (r) 1 πˆ1 − π (r) 1 π1 ! (ˆµ1 − µ1) − π0 πˆ (r) 0 πˆ0 − π (r) 0 π0 ! (ˆµ0 − µ0) Again, we first state the pseudo-outcome estimate explicitly: ψˆ r(Y, T, X) = (ˆπ (r) (X)−πˆ(X)) T π(X) − 1 − T 1 − π(X) (Y −µˆ(T, X))+(ˆπ (r) (X)−πˆ(X))(ˆµ(1, X)−µˆ(0, X)). Proceeding, like in §6.7.3, with iterated expectations in Y |T, X and then T|X, E[ψˆ r | T, X] = (ˆπ (r) (X) − πˆ(X)) T πˆ(X) − 1 − T 1 − πˆ(X) (µ(T, X) − µˆ(T, X)) + (ˆπ (r) (X) − πˆ(X))(ˆµ(1, X) − µˆ(0, X)) E[ψˆ r | X] = (ˆπ (r) − πˆ) π πˆ (µ1 − µˆ1) − (ˆπ (r) − πˆ) 1 − π 1 − πˆ (µ0 − µˆ0) + (ˆπ (r) − πˆ)(ˆµ1 − µˆ0). 139 Since E[ψr|X] = (π (r) − π)(µ1 − µ0), E[ψˆ r − ψr|X] = (ˆπ (r) − πˆ) π πˆ (µ1 − µˆ1) − (ˆπ (r) − πˆ) 1 − π 1 − πˆ (µ0 − µˆ0) + (ˆπ (r) − πˆ)(ˆµ1 − µˆ0) − (π (r) − π)(µ1 − µ0) = πˆ (r) πˆ − 1 ! π(µ1 − µˆ1) + 1 − πˆ (r) 1 − πˆ − 1 ! (1 − π)(µ0 − µˆ0) + (ˆπ (r) − πˆ)(ˆµ1 − µˆ0) − π (r) π − 1 ! πµ1 − 1 − π (r) 1 − π − 1 ! (1 − π)µ0 = πˆ (r) πˆ − π (r) π ! πµ1 + 1 − πˆ (r) 1 − πˆ − 1 − π (r) 1 − π ! (1 − π)µ0 − πˆ (r) πˆ − π (r) π ! πµˆ1 − π (r) π − 1 ! πµˆ1 + (ˆπ (r) − πˆ)ˆµ1 − 1 − πˆ (r) 1 − πˆ − 1 − π (r) 1 − π ! (1 − π)ˆµ0 − 1 − π (r) 1 − π − 1 ! (1 − π)ˆµ0 − (ˆπ (r) − πˆ)ˆµ0 = − πˆ (r) πˆ − π (r) π ! π(ˆµ1 − µ1) − 1 − πˆ (r) 1 − πˆ − 1 − π (r) 1 − π ! (1 − π)(ˆµ0 − µ0) + [(ˆπ (r) − πˆ) − (π (r) − π)](ˆµ1 − µˆ0). Substituting notation π 7→ π1 and more importantly (1 − π) 7→ π0, the above equals Equation 6.6, concluding the derivation. 6.8 Appendix B: Experimental Details Please refer to the scripts included in the supplementary material for comprehensive details on all the experiments described in this chapter. The PBMC dataset has also been included, while the larger TCGA dataset needs to be downloaded from a past study’s repository, https://github.com/ioanabica/SCIGAN. TCGA has 9,659 points with 4,000 features. PBMC, of which we used the principal components, has 14,039 points with 16 features that were always included in the set of covariates. For the benchmarks, train-test splits were generated using a 5-fold cross validation on the entire dataset. For the bandit, we estimated effects on on the whole dataset, and then ran the bandit with the same covariates but different treatment/outcome realizations. The semi-synthetic causal setting. The general structure is based on random subsets of the variables recorded in the TCGA or PBMC datasets. These variables, standardized, were revealed directly as the 140 covariates X. To induce model misspecification for the linear outcome regression and logistic propensity regression, which formed the nuisance estimates, the covariates X were passed through sine functions before generating the ground-truth links to treatment T and outcome Y . We denote the nonlinearly transformed covariates as X˜. The heterogeneous treatment effect was modeled as a random projection of these features, X˜. Similarly, the outcome variability not due to treatment was modeled as another random projection of X˜. We also added separate noise, which was a random projection of the entire dataset (that is, from before the subsetting or nonlinearity.) The projected variables were re-standardized, and the outcome-variability component was scaled according to each experimental setting, while the others held unit variance. Pseudo-outcome estimation. We always employed a 5-fold cross-validation split on the training set in order to estimate nuisances on the large partition and apply them on the small partition, giving us the pseudo-outcome instances on the latter. Eventually, we aggregated all the pseudo-outcomes and regressed them against covariates on the entire training set, in order to evaluate on the testing set. 6.9 Appendix C: Extended Discussion on Campaign Microtargeting Historically, campaigning has been done in a rather homogeneous manner due to the difficulty in not only disentangling the various individual characteristics, but also understanding how each of those affects voting behavior. As such, political targeting was often confined to a certain geographic area that could be correlated with known important characteristics, e.g. wealth or education [240]. Traditionally pollsters and campaigners simply used information from previous elections as predictors of future voting behavior. Importantly, they did not intend to make a causal claim but rather predict the focal trait of support through correlation [240]. With evolving technology and statistical knowledge, campaigners have moved towards developing predictive models for individual-level scores of likelihoods towards voting behaviors and targeted interventions [240], though often not causal as they rely on standard regression techniques that are prone to significant omitted variable bias and endogeneity. As a precursor to microtargeting, the 2012 Obama campaign conducted thousands of short surveys in battleground states every night to assess how each demographic might respond to different political messages [24], [240]. Campaigners have been able to cluster individuals on specific characteristics and make predictions of voting behavior [236], as well as use social media data to personally target political messages [250]. Given 141 the stringency of privacy laws in Europe, these techniques have been proven more effective in the United States, though obvious democratic dilemmas may arise [384]. Few chapters have sought to quantify the effectiveness of political microtargeting. A randomized experimental design simulated a targeted campaign parallel to the 2020 US presidential election, in which respondents were randomly treated with a targeted advertisement, and the experimental data were used to train an online targeting algorithm that optimally allocated advertisements based on individual covariates [140]. The optimized targeting yielded a significant change in voting intention among independent voters, and most importantly showed that responses are likely to be significantly heterogeneous amongst a population. By contrast, a recent experimental study quantifying the effectiveness of microtargeting in comparison to other messaging strategies found that while microtargeting generates substantially higher outcomes, targeting on a single covariate is just as effective as multiple covariates, suggesting that heterogeneity was limited [318]. However, this study utilized simple least-squares regressions that may be prone to bias. Machine learning techniques such as logistic regression trees have further been applied to this question, in which multiple covariates were found to improve predictive accuracy of individual voting turnout from a microtargeting campaign in Ohio [290]. 142 Chapter 7 Conclusions This dissertation introduces a number of methodological contributions for robust observational causal inference using machine learning (Chapters 3–6) in cross-disciplinary applications like public health (Chapter 3), social media (Chapter 4), political campaigns (Chapter 6), and bioinformatics (special emphasis in Chapter 2). These versatile advancements to the state of the art in causal inference deserve further exploration. For instance, it might be fruitfully straightforward to study multivariate extensions of the δMSM from Chapter 3 for a causal sensitivity analysis on the localized (nudge) interventions from Chapter 5. The theoretical implications of these novel techniques need to be further investigated as well. I hope that these contributions bring us closer to the promise of machine learning for informing consequential decisions in society. Robustness to confounders is one important facet of a much larger set of problems that must be addressed for this purpose. Furthermore, the increased robustness demonstrated here still warrants just as much scrutiny from domain experts as before. Even when loosening an assumption of ignorability by replacing it with a reasonable sensitivity model, this reasonability must be assessed on a case-by-case basis. More robustness is usually favorable to less, though tradeoffs always exist: doubly robust estimation, which removes confounding bias as in Chapter 5, tends to exhibit increased variance that can sometimes be debilitating. The goal of this dissertation is to propose these methods and highlight their potential utility in those applications where existing solutions must commit to unacceptable tradeoffs. 143 Bibliography [1] O. Abdel-Rahman, “Ecog performance score 0 versus 1: Impact on efficacy and safety of first-line 5-fu-based chemotherapy among patients with metastatic colorectal cancer included in five randomized trials,” International journal of colorectal disease, vol. 34, no. 12, pp. 2143–2150, 2019. [2] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al., “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023. [3] H.-M. Ahn, J.-W. Yoo, S. Lee, H. J. Lee, H.-S. Lee, and D.-S. Lee, “Peroxiredoxin 5 promotes the epithelial-mesenchymal transition in colon cancer,” Biochemical and Biophysical Research Communications, vol. 487, no. 3, pp. 580–586, 2017. [4] P. Alves, N. Lévy, B. J. Stevenson, H. Bouzourene, G. Theiler, G. Bricard, S. Viatte, M. Ayyoub, H. Vuilleumier, J.-C. R. Givel, et al., “Identification of tumor-associated antigens by large-scale analysis of genes expressed in human colorectal cancer,” Cancer immunity, vol. 8, no. 1, 2008. [5] P. K. Andersen and R. D. Gill, “Cox’s regression model for counting processes: A large sample study,” The annals of statistics, pp. 1100–1120, 1982. [6] A. Andersson, “Mechanisms for log normal concentration distributions in the environment,” Scientific reports, vol. 11, no. 1, p. 16 418, 2021. [7] J. D. Angrist, G. W. Imbens, and D. B. Rubin, “Identification of causal effects using instrumental variables,” Journal of the American statistical Association, vol. 91, no. 434, pp. 444–455, 1996. [8] D. Arthur and S. Vassilvitskii, “K-means++: The advantages of careful seeding,” Stanford, Tech. Rep., 2006. [9] A. Ashukha, A. Lyzhov, D. Molchanov, and D. Vetrov, “Pitfalls of in-domain uncertainty estimation and ensembling in deep learning,” in International Conference on Learning Representations, 2020. [10] S. Athey, J. Tibshirani, and S. Wager, “Generalized random forests,” The Annals of Statistics, vol. 47, no. 2, pp. 1148–1178, 2019. [11] S. Athey and S. Wager, “Policy learning with observational data,” Econometrica, vol. 89, no. 1, pp. 133–161, 2021. 144 [12] P. A. Atroszko, “Is a high workload an unaccounted confounding factor in the relation between heavy coffee consumption and cardiovascular disease risk?” The American Journal of Clinical Nutrition, vol. 110, no. 5, pp. 1257–1258, 2019. [13] L. Atta and J. Fan, “Computational challenges and opportunities in spatially resolved transcriptomic data analysis,” Nature Communications, vol. 12, no. 1, pp. 1–5, 2021. [14] T. Bahadori, E. T. Tchetgen, and D. Heckerman, “End-to-end balancing for causal continuous treatment-effect estimation,” in International Conference on Machine Learning, PMLR, 2022, pp. 1313–1326. [15] K. Baksi, Z. Zhang, A. Banerjee, J. E. Serrano, L. E. Perez, L. Linares, A. Seijo, N. Sanchez, U. Katiyar, and D. K. Banerjee, “Silencing mannosylphospho dolichol synthase with shrna impacts differentiation of capillary endothelial cells,” The FASEB Journal, vol. 30, pp. 844–1, 2016. [16] A. Balke and J. Pearl, “Bounds on treatment effects from studies with imperfect compliance,” Journal of the American statistical Association, vol. 92, no. 439, pp. 1171–1176, 1997. [17] J. M. Balko and E. P. Black, “A gene expression predictor of response to egfr-targeted therapy stratifies progression-free survival to cetuximab in kras wild-type metastatic colorectal cancer,” BMC cancer, vol. 9, pp. 1–10, 2009. [18] R. Bellazzi, A. Codegoni, S. Gualandi, G. Nicora, and E. Vercesi, “The gene mover’s distance: Single-cell similarity via optimal transport,” arXiv preprint arXiv:2102.01218, 2021. [19] R. Benelli, D. Costa, L. Mastracci, F. Grillo, M. J. Olsen, P. Barboro, A. Poggi, and N. Ferrari, “Aspartate-β-hydroxylase: A promising target to limit the local invasiveness of colorectal cancer,” Cancers, vol. 12, no. 4, p. 971, 2020. [20] J. Bergenstråhle, L. Larsson, and J. Lundeberg, “Seamless integration of image and molecular analysis for spatial transcriptomics workflows,” BMC genomics, vol. 21, no. 1, pp. 1–7, 2020. [21] A. Berrington de González and D. R. Cox, “Interpretation of interaction: A review,” 2007. [22] M. Y. Bhat, H. S. Solanki, J. Advani, A. A. Khan, T. Keshava Prasad, H. Gowda, S. Thiyagarajan, and A. Chatterjee, “Comprehensive network map of interferon gamma signaling,” Journal of cell communication and signaling, vol. 12, pp. 745–751, 2018. [23] I. Bica, J. Jordon, and M. van der Schaar, “Estimating the effects of continuous-valued interventions using generative adversarial networks,” Advances in Neural Information Processing Systems, vol. 33, pp. 16 434–16 445, 2020. [24] M. Blumenthal, “Polls, Forecasts, and Aggregators,” en, PS: Political Science & Politics, vol. 47, no. 02, pp. 297–300, Apr. 2014, issn: 1049-0965, 1537-5935. doi: 10.1017/S1049096514000055. (visited on 02/02/2024). [25] M. Bonvini and E. H. Kennedy, “Fast convergence rates for dose-response estimation,” arXiv preprint arXiv:2207.11825, 2022. 145 [26] L. Borland, “Microscopic dynamics of the nonlinear fokker-planck equation: A phenomenological model,” Physical Review E, vol. 57, no. 6, pp. 6634–6642, 1998. [27] A. Bostrom, G. Böhm, and R. E. O’Connor, “Targeting and tailoring climate change communications,” Wiley Interdisciplinary Reviews: Climate Change, vol. 4, no. 5, pp. 447–455, 2013. [28] N. Botzer, S. Gu, and T. Weninger, “Analysis of moral judgment on reddit,” IEEE Transactions on Computational Social Systems, 2022. [29] S. P. Boyd and L. Vandenberghe, Convex optimization. Cambridge university press, 2004. [30] P. Bromiley, “Products and convolutions of gaussian probability density functions,” Tina-Vision Memo, vol. 3, no. 4, p. 1, 2003. [31] I. D. Bross, “Spurious effects from an extraneous variable,” Journal of chronic diseases, vol. 19, no. 6, pp. 637–647, 1966. [32] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020. [33] J. Y. Buikhuisen, A. Torang, and J. P. Medema, “Exploring and modelling colon cancer inter-tumour heterogeneity: Opportunities and challenges,” Oncogenesis, vol. 9, no. 7, p. 66, 2020. [34] A. Butler, P. Hoffman, P. Smibert, E. Papalexi, and R. Satija, “Integrating single-cell transcriptomic data across different conditions, technologies, and species,” Nature biotechnology, vol. 36, no. 5, pp. 411–420, 2018. [35] C. Bycroft, C. Freeman, D. Petkova, G. Band, L. T. Elliott, K. Sharp, A. Motyer, D. Vukcevic, O. Delaneau, J. O’Connell, et al., “The uk biobank resource with deep phenotyping and genomic data,” Nature, vol. 562, no. 7726, pp. 203–209, 2018. [36] D. Cai, X. He, J. Han, and T. S. Huang, “Graph regularized nonnegative matrix factorization for data representation,” IEEE transactions on pattern analysis and machine intelligence, vol. 33, no. 8, pp. 1548–1560, 2010. [37] E. J. Calabrese and L. A. Baldwin, “U-shaped dose-responses in biology, toxicology, and public health,” Annual Review of Public Health, vol. 22, no. 1, pp. 15–33, 2001, PMID: 11274508. doi: 10.1146/annurev.publhealth.22.1.15. [38] J. Camunas-Soler, X.-Q. Dai, Y. Hang, A. Bautista, J. Lyon, K. Suzuki, S. K. Kim, S. R. Quake, and P. E. MacDonald, “Patch-seq links single-cell transcriptomes to human islet dysfunction in diabetes,” Cell metabolism, vol. 31, no. 5, pp. 1017–1031, 2020. [39] N. Cao, Y. Yu, H. Zhu, M. Chen, P. Chen, M. Zhuo, Y. Mao, L. Li, Q. Zhao, M. Wu, et al., “Setdb1 promotes the progression of colorectal cancer via epigenetically silencing p21 expression,” Cell Death & Disease, vol. 11, no. 5, p. 351, 2020. 146 [40] K. M. Carter, R. Raich, and A. O. Hero III, “On local intrinsic dimension estimation and its applications,” IEEE Transactions on Signal Processing, vol. 58, no. 2, pp. 650–663, 2010. [41] Y. Chandak, S. Shankar, V. Syrgkanis, and E. Brunskill, “Adaptive instrument design for indirect experiments,” in The Twelfth International Conference on Learning Representations, 2024. [42] T. Chari, J. Banerjee, and L. Pachter, “The specious art of single-cell genomics,” BioRxiv, 2021. [43] H. Chen, X. Sun, W. Ge, Y. Qian, R. Bai, and S. Zheng, “A seven-gene signature predicts overall survival of patients with colorectal cancer,” Oncotarget, vol. 8, no. 56, p. 95 054, 2017. [44] M. B. Chen, X. Jiang, S. R. Quake, and T. C. Südhof, “Persistent transcriptional programmes are associated with remote memory,” Nature, vol. 587, no. 7834, pp. 437–442, 2020. [45] T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,” in Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, 2016, pp. 785–794. [46] Y.-L. Chen, L. Minorics, and D. Janzing, “Correcting confounding via random selection of background variables,” arXiv preprint arXiv:2202.02150, 2022. [47] V. Chernozhukov, C. Hansen, N. Kallus, M. Spindler, and V. Syrgkanis, Applied Causal Inference Powered by ML and AI. 2024. [48] V. Chernozhukov, D. Chetverikov, M. Demirer, E. Duflo, C. Hansen, W. Newey, and J. Robins, “Double/debiased machine learning for treatment and structural parameters,” The Econometrics Journal, vol. 21, no. 1, pp. C1–C68, Jan. 2018. [49] V. Chernozhukov, C. Cinelli, W. Newey, A. Sharma, and V. Syrgkanis, “Long story short: Omitted variable bias in causal machine learning,” National Bureau of Economic Research, Tech. Rep., 2022. [50] V. Chernozhukov, J. C. Escanciano, H. Ichimura, W. K. Newey, and J. M. Robins, “Locally robust semiparametric estimation,” Econometrica, vol. 90, no. 4, pp. 1501–1535, 2022. [51] V. Chernozhukov, W. K. Newey, and R. Singh, “Debiased machine learning of global and local parameters using regularized riesz representers,” The Econometrics Journal, vol. 25, no. 3, pp. 576–601, 2022. [52] V. Chernozhukov, W. K. Newey, and R. Singh, “A simple and general debiased machine learning theorem with finite-sample guarantees,” Biometrika, vol. 110, no. 1, pp. 257–264, 2023. [53] V. Chernozhukov, K. Wüthrich, and Y. Zhu, “Distributional conformal prediction,” Proceedings of the National Academy of Sciences, vol. 118, no. 48, e2107794118, 2021. [54] J. Chi, H. Zhang, J. Hu, Y. Song, J. Li, L. Wang, and Z. Wang, “Agr3 promotes the stemness of colorectal cancer via modulating wnt/β-catenin signalling,” Cellular signalling, vol. 65, p. 109 419, 2020. [55] K. Choi, C. Meng, Y. Song, and S. Ermon, “Density ratio estimation via infinitesimal classification,” in International Conference on Artificial Intelligence and Statistics, PMLR, 2022, pp. 2552–2573. 147 [56] R. R. Coifman and S. Lafon, “Diffusion maps,” Applied and Computational Harmonic Analysis, vol. 21, pp. 5–30, 2006. [57] K. Colangelo and Y.-Y. Lee, “Double debiased machine learning nonparametric inference with continuous treatments,” arXiv preprint arXiv:2004.03036, 2020. [58] C. S. Colantoni, T. J. Levesque, and P. C. Ordeshook, “Campaign resource allocations under the electoral college,” American Political Science Review, vol. 69, no. 1, pp. 141–154, 1975. [59] F. Conciatori, L. Ciuffreda, C. Bazzichetto, I. Falcone, S. Pilotto, E. Bria, F. Cognetti, and M. Milella, “Mtor cross-talk in cancer and potential for combination therapy,” Cancers, vol. 10, no. 1, p. 23, 2018. [60] T. T. S. Consortium*, R. C. Jones, J. Karkanias, M. A. Krasnow, A. O. Pisco, S. R. Quake, J. Salzman, N. Yosef, B. Bulthaup, P. Brown, et al., “The tabula sapiens: A multiple-organ, single-cell transcriptomic atlas of humans,” Science, vol. 376, no. 6594, eabl4896, 2022. [61] J. Cornfield, W. Haenszel, E. C. Hammond, A. M. Lilienfeld, M. B. Shimkin, and E. L. Wynder, “Smoking and lung cancer: Recent evidence and a discussion of some questions,” Journal of the National Cancer institute, vol. 22, no. 1, pp. 173–203, 1959. [62] D. R. Cox, Planning of Experiments. New York, Wiley, 1958. [63] D. R. Cox, “Regression models and life-tables,” Journal of the Royal Statistical Society. Series B (Methodological), vol. 34, no. 2, pp. 187–220, 1972, issn: 00359246. [Online]. Available: http://www.jstor.org/stable/2985181 (visited on 10/10/2022). [64] I. Cristali and V. Veitch, “Using embeddings for causal estimation of peer influence in social networks,” in Advances in Neural Information Processing Systems, A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, Eds., 2022. [65] F.-A. Croitoru, V. Hondru, R. T. Ionescu, and M. Shah, “Diffusion models in vision: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023. [66] R. Cruz-Duarte, C. Rebelo de Almeida, M. Negrão, A. Fernandes, P. Borralho, D. Sobral, L. M. Gallego-Paez, D. Machado, J. Gramaça, J. Vílchez, et al., “Predictive and therapeutic implications of a novel plcγ1/shp2-driven mechanism of cetuximab resistance in metastatic colorectal cancerplcγ1: Clinical implications for cetuximab therapy,” Clinical Cancer Research, OF1–OF14, 2022. [67] A. Curth, D. Svensson, J. Weatherall, and M. van der Schaar, “Really doing great at estimating cate? a critical look at ml benchmarking practices in treatment effect estimation,” in Thirty-fifth conference on neural information processing systems datasets and benchmarks track (round 2), 2021. [68] M. Cuturi, “Sinkhorn distances: Lightspeed computation of optimal transport,” Advances in neural information processing systems, vol. 26, 2013. [69] I. J. Dahabreh, R. Hayward, and D. M. Kent, “Using group data to treat individuals: Understanding heterogeneous treatment effects in the age of precision medicine and patient-centred evidence,” International journal of epidemiology, vol. 45, no. 6, pp. 2184–2193, 2016. 148 [70] Q. Dang, Z. Liu, S. Hu, Z. Chen, L. Meng, J. Hu, G. Wang, W. Yuan, X. Han, L. Li, et al., “Derivation and clinical validation of a redox-driven prognostic signature for colorectal cancer,” Frontiers in Oncology, vol. 11, p. 743 703, 2021. [71] G. Dantzig, Linear programming and extensions. Princeton university press, 1963. [72] M. Demirer, V. Syrgkanis, G. Lewis, and V. Chernozhukov, “Semi-parametric efficient policy learning with continuous actions,” Advances in Neural Information Processing Systems, vol. 32, 2019. [73] M. A. Detry and R. J. Lewis, “The intention-to-treat principle: How to assess the true effect of choosing a medical treatment,” Jama, vol. 312, no. 1, pp. 85–86, 2014. [74] S. Dong and B. Van Roy, “An information-theoretic analysis for thompson sampling with many actions,” Advances in Neural Information Processing Systems, vol. 31, 2018. [75] J. Dorn and K. Guo, “Sharp sensitivity analysis for inverse propensity weighting via quantile balancing,” Journal of the American Statistical Association, pp. 1–13, 2022. [76] J. Dorn, K. Guo, and N. Kallus, “Doubly-valid/doubly-sharp sensitivity analysis for causal inference with unmeasured confounding,” Journal of the American Statistical Association, no. just-accepted, pp. 1–23, 2024. [77] A. Downs, “An economic theory of political action in a democracy,” Journal of political economy, vol. 65, no. 2, pp. 135–150, 1957. [78] M. Drusch, U. Del Bello, S. Carlier, O. Colin, V. Fernandez, F. Gascon, B. Hoersch, C. Isola, P. Laberinti, P. Martimort, et al., “Sentinel-2: Esa’s optical high-resolution mission for gmes operational services,” Remote sensing of Environment, vol. 120, pp. 25–36, 2012. [79] A. Duò, M. D. Robinson, and C. Soneson, “A systematic performance evaluation of clustering methods for single-cell rna-seq data,” F1000Research, vol. 7, p. 1141, 2018. doi: 10.12688/f1000research.15666.3. [80] T. D. Duong, Q. Li, and G. Xu, “Stochastic intervention for causal effect estimation,” in 2021 International Joint Conference on Neural Networks (IJCNN), IEEE, 2021, pp. 1–8. [81] A. F. Duque, S. Morin, G. Wolf, and K. R. Moon, “Geometry regularized autoencoders,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022. [82] V. Dwaracherla, S. M. Asghari, B. Hao, and B. V. Roy, Efficient exploration for llms, 2024. arXiv: 2402.00396 [cs.LG]. [83] J. J. Dziak, D. L. Coffman, S. T. Lanza, R. Li, and L. S. Jermiin, “Sensitivity and specificity of information criteria,” Briefings in bioinformatics, vol. 21, no. 2, pp. 553–565, 2020. [84] F. Eberhardt and R. Scheines, “Interventions and causal inference,” Philosophy of science, vol. 74, no. 5, pp. 981–995, 2007. 149 [85] B. L. Edelman, S. Goel, S. Kakade, and C. Zhang, “Inductive biases and variable creation in self-attention mechanisms,” in International Conference on Machine Learning, PMLR, 2022, pp. 5793–5831. [86] B. Efron and D. V. Hinkley, “Assessing the accuracy of the maximum likelihood estimator: Observed versus expected fisher information,” Biometrika, vol. 65, no. 3, pp. 457–483, 1978. [87] A. Erasmus, B. Holman, and J. P. Ioannidis, “Data-dredging bias,” BMJ Evidence-Based Medicine, vol. 27, no. 4, pp. 209–211, 2022. [88] B. Ewert, “Moving beyond the obsession with nudging individual behaviour: Towards a broader understanding of behavioural public policy,” Public Policy and Administration, vol. 35, no. 3, pp. 337–360, 2020. [89] R. Fang, C. Xia, J. L. Close, M. Zhang, J. He, Z. Huang, A. R. Halpern, B. Long, J. A. Miller, E. S. Lein, and X. Zhuang, “Conservation and divergence of cortical cell organization in human and mouse revealed by merfish,” Science, vol. 377, no. 6601, pp. 56–62, 2022. doi: 10.1126/science.abm1741. eprint: https://www.science.org/doi/pdf/10.1126/science.abm1741. [90] S. Fang, B. Chen, Y. Zhang, H. Sun, L. Liu, S. Liu, Y. Li, and X. Xu, “Computational approaches and challenges in spatial transcriptomics,” Genomics, Proteomics & Bioinformatics, 2022. [91] A. Fantozzi, D. C. Gruber, L. Pisarsky, C. Heck, A. Kunita, M. Yilmaz, N. Meyer-Schaller, K. Cornille, U. Hopfer, M. Bentires-Alj, et al., “Vegf-mediated angiogenesis links emt-induced cancer stemness to tumor initiationemt, stemness, and angiogenesis,” Cancer research, vol. 74, no. 5, pp. 1566–1575, 2014. [92] E. Ferrara, “Should chatgpt be biased? challenges and risks of bias in large language models,” arXiv preprint arXiv:2304.03738, 2023. [93] J. S. Fleck, S. M. J. Jansen, D. Wollny, F. Zenk, M. Seimiya, A. Jain, R. Okamoto, M. Santel, Z. He, J. G. Camp, et al., “Inferring and perturbing cell fate regulomes in human brain organoids,” Nature, pp. 1–8, 2022. [94] S. Fort, H. Hu, and B. Lakshminarayanan, “Deep ensembles: A loss landscape perspective,” arXiv preprint arXiv:1912.02757, 2019. [95] C. Furusawa, T. Suzuki, A. Kashiwagi, T. Yomo, and K. Kaneko, “Ubiquity of log-normal distributions in intra-cellular reaction dynamics,” Biophysics, vol. 1, pp. 25–31, 2005. [96] A. C. Gadotti, M. de Castro Deus, J. P. Telles, R. Wind, M. Goes, R. G. C. Ossoski, A. M. de Padua, L. de Noronha, A. Moreno-Amaral, C. P. Baena, et al., “Ifn-γ is an independent risk factor associated with mortality in patients with moderate and severe covid-19 infection,” Virus research, vol. 289, p. 198 171, 2020. [97] Y. Gal and Z. Ghahramani, “A theoretically grounded application of dropout in recurrent neural networks,” Advances in neural information processing systems, vol. 29, 2016. 150 [98] I. O. Gallegos, R. A. Rossi, J. Barrow, M. M. Tanjim, S. Kim, F. Dernoncourt, T. Yu, R. Zhang, and N. K. Ahmed, “Bias and fairness in large language models: A survey,” arXiv preprint arXiv:2309.00770, 2023. [99] C. M. Gay, C. A. Stewart, E. M. Park, L. Diao, S. M. Groves, S. Heeke, B. Y. Nabet, J. Fujimoto, L. M. Solis, W. Lu, et al., “Patterns of transcription factor programs and immune pathway activation define four major subtypes of sclc with distinct therapeutic vulnerabilities,” Cancer cell, vol. 39, no. 3, pp. 346–360, 2021. [100] A. Gelman, “Prior distributions for variance parameters in hierarchical models (comment on an article by browne and draper),” Bayesian Analysis, vol. 1, pp. 515–533, 2006. [101] E. Gharib, P. Nasrinasrabadi, and M. R. Zali, “Development and validation of a lipogenic genes panel for diagnosis and recurrence of colorectal cancer,” PLoS One, vol. 15, no. 3, e0229864, 2020. [102] A. Ghassami, N. Sani, Y. Xu, and I. Shpitser, “Multiply robust causal mediation analysis with continuous treatments,” arXiv preprint arXiv:2105.09254, 2021. [103] P. W. Glynn et al., “Importance sampling for monte carlo estimation of quantiles,” in Mathematical Methods in Stochastic Simulation and Experimental Design: Proceedings of the 2nd St. Petersburg Workshop on Simulation, Citeseer, 1996, pp. 180–185. [104] J. Godos, M. Tieri, F. Ghelfi, L. Titta, S. Marventano, A. Lafranconi, A. Gambera, E. Alonzo, S. Sciacca, S. Buscemi, et al., “Dairy foods and health: An umbrella review of observational studies,” International Journal of Food Sciences and Nutrition, vol. 71, no. 2, pp. 138–151, 2020. [105] W. W. B. Goh, W. Wang, and L. Wong, “Why batch effects matter in omics data, and how to avoid them,” Trends in biotechnology, vol. 35, no. 6, pp. 498–507, 2017. [106] S. Greenland, “The logic and philosophy of causal inference: A statistical perspective,” in Philosophy of Statistics, Elsevier, 2011, pp. 813–830. [107] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola, “A kernel two-sample test,” Journal of Machine Learning Research, vol. 13, no. 25, pp. 723–773, 2012. [Online]. Available: http://jmlr.org/papers/v13/gretton12a.html. [108] M. Grootendorst, “Bertopic: Neural topic modeling with a class-based tf-idf procedure,” arXiv preprint arXiv:2203.05794, 2022. [109] Y. Gu, A. M. Brickman, Y. Stern, C. G. Habeck, Q. R. Razlighi, J. A. Luchsinger, J. J. Manly, N. Schupf, R. Mayeux, and N. Scarmeas, “Mediterranean diet and brain structure in a multiethnic elderly cohort,” Neurology, vol. 85, no. 20, pp. 1744–1751, 2015. [110] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger, “On calibration of modern neural networks,” in International conference on machine learning, PMLR, 2017, pp. 1321–1330. [111] W. Guo, M. Yin, Y. Wang, and M. Jordan, “Partial identification with noisy covariates: A robust optimization approach,” in Conference on Causal Learning and Reasoning, PMLR, 2022, pp. 318–335. 151 [112] Z. Guo, X. Zhang, H. Zhu, N. Zhong, X. Luo, Y. Zhang, F. Tu, J. Zhong, X. Wang, J. He, et al., “Telo2 induced progression of colorectal cancer by binding with rictor through mtorc2,” Oncology reports, vol. 45, no. 2, pp. 523–534, 2021. [113] A. Haddad, D. Kushnir, and R. R. Coifman, “Texture separation via a reference set,” Applied and Computational Harmonic Analysis, vol. 35, pp. 335–347, 2014. [114] E. Haddad, M. G. Marmarelis, T. M. Nir, A. Galstyan, G. V. Steeg, and N. Jahanshad, “Causal sensitivity analysis for hidden confounding: Modeling the sex-specific role of diet on the aging brain,” in International Workshop on Machine Learning in Clinical Neuroimaging, Springer, 2023, pp. 91–101. [115] C. Hafemeister and R. Satija, “Normalization and variance stabilization of single-cell rna-seq data using regularized negative binomial regression,” Genome biology, vol. 20, no. 1, pp. 1–15, 2019. [116] R. Hamamoto, K. Takasawa, H. Machino, K. Kobayashi, S. Takahashi, A. Bolatkan, N. Shinkai, A. Sakai, R. Aoyama, M. Yamada, et al., “Application of non-negative matrix factorization in oncology: One approach for establishing precision medicine,” Briefings in Bioinformatics, vol. 23, no. 4, bbac246, 2022. [117] H. Hashida and R. J. Coffey, “Significance of a calcium-binding protein s100a14 expression in colon cancer progression,” Journal of Gastrointestinal Oncology, vol. 13, no. 1, p. 149, 2022. [118] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner, “Beta-vae: Learning basic visual concepts with a constrained variational framework,” in International conference on learning representations, 2016. [119] J. L. Hill, “Bayesian nonparametric modeling for causal inference,” Journal of Computational and Graphical Statistics, vol. 20, no. 1, pp. 217–240, 2011. [120] O. Hines, K. Diaz-Ordaz, and S. Vansteelandt, “Optimally weighted average derivative effects,” arXiv preprint arXiv:2308.05456, 2023. [121] O. Hines, O. Dukes, K. Diaz-Ordaz, and S. Vansteelandt, “Demystifying statistical learning based on efficient influence functions,” The American Statistician, vol. 76, no. 3, pp. 292–304, 2022. [122] J. Hoover, G. Portillo-Wightman, L. Yeh, S. Havaldar, A. M. Davani, Y. Lin, B. Kennedy, M. Atari, Z. Kamel, M. Mendlen, et al., “Moral foundations twitter corpus: A collection of 35k tweets annotated for moral sentiment,” Social Psychological and Personality Science, vol. 11, no. 8, pp. 1057–1071, 2020. [123] V. A. Hristova and D. W. Chan, “Cancer biomarker discovery and translation: Proteomics and beyond,” Expert review of proteomics, vol. 16, no. 2, pp. 93–103, 2019. [124] J. Y. Hsu and D. S. Small, “Calibrating sensitivity analyses to observed covariates in observational studies,” Biometrics, vol. 69, no. 4, pp. 803–811, 2013. 152 [125] Y. Hu, Y. Wu, L. Zhang, and X. Wu, “A generative adversarial framework for bounding confounded causal effects,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, 2021, pp. 12 104–12 112. [126] Y. Hu, S. Li, and S. Wager, “Average direct and indirect causal effects under interference,” Biometrika, vol. 109, no. 4, pp. 1165–1172, 2022. [127] Z.-J. Hu, J. Xu, J.-M. Yin, L. Li, W. Hou, L.-L. Zhang, Z. Zhou, Y.-Z. Yu, H.-J. Li, Y.-M. Feng, et al., “Lower circulating interferon-gamma is a risk factor for lung fibrosis in covid-19 patients,” Frontiers in immunology, vol. 11, p. 585 647, 2020. [128] W. Huang, O. Linton, and Z. Zhang, “A unified framework for specification tests of continuous treatment effect models,” Journal of Business & Economic Statistics, vol. 0, no. 0, pp. 1–14, 2021. doi: 10.1080/07350015.2021.1981915. [129] L. Hubert and P. Arabie, “Comparing partitions,” Journal of Classification, vol. 2, no. 1, pp. 193–218, 1985. [130] R. Huisman, K. G. Koedijk, C. J. M. Kool, and F. Palm, “Tail-index estimates in small samples,” Journal of Business & Economic Statistics, vol. 19, no. 2, pp. 208–216, 2001. [131] G.-J. Huizing, G. Peyré, and L. Cantini, “Optimal transport improves cell–cell similarity inference in single-cell omics data,” Bioinformatics, vol. 38, no. 8, pp. 2169–2177, 2022. [132] A. Hyvärinen and P. Dayan, “Estimation of non-normalized statistical models by score matching.,” Journal of Machine Learning Research, vol. 6, no. 4, 2005. [133] H. Ichimura and W. K. Newey, “The influence function of semiparametric estimators,” Quantitative Economics, vol. 13, no. 1, pp. 29–61, 2022. [134] G. Imbens, N. Kallus, X. Mao, and Y. Wang, “Long-term causal inference under persistent confounding via data combination,” arXiv preprint arXiv:2202.07234, 2022. [135] G. W. Imbens and D. B. Rubin, Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction. Cambridge University Press, 2015. [136] F. Innocenti, A. Yazdani, N. Rashid, X. Qu, F.-S. Ou, S. Van Buren, M. M. Bertagnolli, O. Kabbarah, C. D. Blanke, A. P. Venook, H.-J. Lenz, and B. G. Vincent, “Tumor immunogenomic features determine outcomes in patients with metastatic colorectal cancer treated with standard-of-care combinations of bevacizumab and cetuximab,” Clinical Cancer Research, vol. 28, no. 8, pp. 1690–1700, Apr. 2022, issn: 1078-0432. doi: 10.1158/1078-0432.CCR-21-3202. eprint: https://aacrjournals.org/clincancerres/article-pdf/28/8/1690/3113016/1690.pdf. [137] Internal Revenue Service, 2020 zip code data (soi), 2020. [Online]. Available: https://www.irs.gov/statistics/soi-tax-stats-individual-income-tax-statistics-2020-zipcode-data-soi. 153 [138] T. Ismail, Y. Kim, H. Lee, D.-S. Lee, and H.-S. Lee, “Interplay between mitochondrial peroxiredoxins and ros in cancer development and progression,” International journal of molecular sciences, vol. 20, no. 18, p. 4407, 2019. [139] A. Ivakhnenko, “Heuristic self-organization in problems of engineering cybernetics,” Automatica, vol. 6, no. 2, pp. 207–219, 1970. [140] M. Jacobs-Harukawa, “Does microtargeting work? evidence from an experiment during the 2020 united states presidential election,” Tech. Rep., 2022. [Online]. Available: https://muhark.github.io/static/docs/harukawa-2021-microtargeting.pdf. [141] E. T. Jaynes, Probability theory: The logic of science. Cambridge university press, 2003. [142] A. Jesson, A. R. Douglas, P. Manshausen, M. Solal, N. Meinshausen, P. Stier, Y. Gal, and U. Shalit, “Scalable sensitivity and uncertainty analyses for causal-effect estimates of continuous-valued interventions,” in Advances in Neural Information Processing Systems, A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, Eds., 2022. [143] A. Jesson, S. Mindermann, Y. Gal, and U. Shalit, “Quantifying ignorance in individual-level causal-effect estimates under hidden confounding,” in International Conference on Machine Learning, PMLR, 2021, pp. 4829–4838. [144] A. Jesson, S. Mindermann, U. Shalit, and Y. Gal, “Identifying causal-effect inference failure with uncertainty-aware models,” Advances in Neural Information Processing Systems, vol. 33, pp. 11 637–11 649, 2020. [145] A. Jesson, P. Tigas, J. van Amersfoort, A. Kirsch, U. Shalit, and Y. Gal, “Causal-bald: Deep bayesian active learning of outcomes to infer treatment-effects from observational data,” Advances in Neural Information Processing Systems, vol. 34, pp. 30 465–30 478, 2021. [146] W. S. Jevons, The theory of political economy. Macmillan, 1879. [147] R. Jiang, T. Sun, D. Song, and J. J. Li, “Statistics or biology: The zero-inflation controversy about scrna-seq data,” Genome biology, vol. 23, no. 1, pp. 1–24, 2022. [148] Y. Jin, Z. Ren, and E. J. Candès, “Sensitivity analysis of individual treatment effects: A robust conformal inference approach,” Proceedings of the National Academy of Sciences, vol. 120, no. 6, e2214889120, 2023. [149] A. Jo, “The promise and peril of generative ai,” Nature, vol. 614, no. 1, pp. 214–216, 2023. [150] R. Johnston and C. Pattie, “How much does a vote cost? incumbency and the impact of campaign spending at english general elections,” Journal of Elections, Public Opinion and Parties, vol. 18, no. 2, pp. 129–152, 2008. [151] R. Jordan, D. Kinderlehrer, and F. Otto, SIAM J. Math. Anal., vol. 29, no. 1, pp. 1–17, 2006. [152] D. Jovic, X. Liang, H. Zeng, L. Lin, F. Xu, and Y. Luo, “Single-cell rna sequencing technologies and applications: A brief overview,” Clinical and Translational Medicine, vol. 12, no. 3, e694, 2022. 154 [153] J. H. W. Jr., “Hierarchical grouping to optimize an objective function,” Journal of the American Statistical Association, vol. 58, no. 301, pp. 236–244, 1963. doi: 10.1080/01621459.1963.10500845. eprint: https://www.tandfonline.com/doi/pdf/10.1080/01621459.1963.10500845. [154] J. Kaddour, Y. Zhu, Q. Liu, M. J. Kusner, and R. Silva, “Causal effect inference for structured treatments,” Advances in Neural Information Processing Systems, vol. 34, pp. 24 841–24 854, 2021. [155] Y. Y. Kagan, “Observational evidence for earthquakes as a nonlinear dynamic process,” Physica D: Nonlinear Phenomena, vol. 77, no. 1-3, pp. 160–192, 1994. [156] N. Kallus, “Treatment effect risk: Bounds and inference,” in 2022 ACM Conference on Fairness, Accountability, and Transparency, 2022, pp. 213–213. [157] N. Kallus, X. Mao, K. Wang, and Z. Zhou, “Doubly robust distributionally robust off-policy evaluation and learning,” in International Conference on Machine Learning, PMLR, 2022, pp. 10 598–10 632. [158] N. Kallus, X. Mao, and A. Zhou, “Interval estimation of individual-level causal effects under unobserved confounding,” in The 22nd international conference on artificial intelligence and statistics, PMLR, 2019, pp. 2281–2290. [159] N. Kallus and M. Oprescu, “Robust and agnostic learning of conditional distributional treatment effects,” in International Conference on Artificial Intelligence and Statistics, PMLR, 2023, pp. 6037–6060. [160] N. Kallus and M. Santacatterina, “Kernel optimal orthogonality weighting: A balancing approach to estimating effects of continuous treatments,” arXiv preprint arXiv:1910.11972, 2019. [161] N. Kallus and A. Zhou, “Policy evaluation and optimization with continuous treatments,” in International conference on artificial intelligence and statistics, PMLR, 2018, pp. 1243–1251. [162] H. M. Kang, M. Subramaniam, S. Targ, M. Nguyen, L. Maliskova, E. McCarthy, E. Wan, S. Wong, L. Byrnes, C. M. Lanata, et al., “Multiplexed droplet single-cell rna-sequencing using natural genetic variation,” Nature biotechnology, vol. 36, no. 1, pp. 89–94, 2018. [163] L. V. Kantorovich, “On the translocation of masses,” in Dokl. Akad. Nauk. USSR (NS), vol. 37, 1942, pp. 199–201. [164] E. L. Kaplan and P. Meier, “Nonparametric estimation from incomplete observations,” Journal of the American statistical association, vol. 53, no. 282, pp. 457–481, 1958. [165] Y. Kawasaki, T. Jigami, S. Furukawa, M. Sagara, K. Echizen, Y. Shibata, R. Sato, and T. Akiyama, “The adenomatous polyposis coli-associated guanine nucleotide exchange factor asef is involved in angiogenesis,” Journal of Biological Chemistry, vol. 285, no. 2, pp. 1199–1207, 2010. [166] Y. Kawasaki, S. Tsuji, K. Muroya, S. Furukawa, Y. Shibata, M. Okuno, S. Ohwada, and T. Akiyama, “The adenomatous polyposis coli-associated exchange factors asef and asef2 are required for adenoma formation in apcmin/+ mice,” EMBO reports, vol. 10, no. 12, pp. 1355–1362, 2009. 155 [167] R. T. Kelly, “Single-cell proteomics: Progress and prospects,” Molecular & Cellular Proteomics, vol. 19, no. 11, pp. 1739–1748, 2020. [168] E. H. Kennedy, “Nonparametric causal effects based on incremental propensity score interventions,” Journal of the American Statistical Association, vol. 114, no. 526, pp. 645–656, 2019. [169] E. H. Kennedy, “Semiparametric doubly robust targeted double machine learning: A review,” arXiv preprint arXiv:2203.06469, 2022. [170] E. H. Kennedy, “Towards optimal doubly robust estimation of heterogeneous causal effects,” Electronic Journal of Statistics, vol. 17, no. 2, pp. 3008–3049, 2023. [171] E. H. Kennedy, Z. Ma, M. D. McHugh, and D. S. Small, “Non-parametric methods for doubly robust estimation of continuous treatment effects,” Journal of the Royal Statistical Society Series B: Statistical Methodology, vol. 79, no. 4, pp. 1229–1245, 2017. [172] N. Kilbertus, M. J. Kusner, and R. Silva, “A class of algorithms for general instrumental variable models,” Advances in Neural Information Processing Systems, vol. 33, pp. 20 108–20 119, 2020. [173] A. D. Kim, R. Zhang, X. Han, K. A. Kang, M. J. Piao, Y. H. Maeng, W. Y. Chang, and J. W. Hyun, “Involvement of glutathione and glutathione metabolizing enzymes in human colorectal cancer cell lines and tissues,” Molecular medicine reports, vol. 12, no. 3, pp. 4314–4319, 2015. [174] H. S. Kim, S. Yang, M. Kim, B. Hemenway, L. Ungar, and J. N. Cappella, “An experimental study of recommendation algorithms for tailored health communication,” Computational Communication Research, vol. 1, no. 1, pp. 103–129, 2019. [175] Y.-W. Kim and T. V. Byzova, “Oxidative stress in angiogenesis and vascular disease,” Blood, the Journal of the American Society of Hematology, vol. 123, no. 5, pp. 625–631, 2014. [176] V. Y. Kiselev, K. Kirschner, M. T. Schaub, T. Andrews, A. Yiu, T. Chandra, K. N. Natarajan, W. Reik, M. Barahona, A. R. Green, et al., “Sc3: Consensus clustering of single-cell rna-seq data,” Nature methods, vol. 14, no. 5, pp. 483–486, 2017. [177] F. B. Knight, Essentials of Brownian Motion and Diffusion. American Mathematical Society, 1981. [178] N. Kobayashi, H. Kuninaka, J.-i. Wakita, and M. Matsushita, “Statistical features of complex systems–toward establishing sociological physics–,” Journal of the Physical Society of Japan, vol. 80, no. 7, p. 072 001, 2011. [179] F. C. Koch, G. J. Sutton, I. Voineagu, and F. Vafaee, “Supervised application of internal validation measures to benchmark dimensionality reduction methods in scRNA-seq data,” Briefings in Bioinformatics, vol. 22, no. 6, Aug. 2021, bbab304, issn: 1477-4054. doi: 10.1093/bib/bbab304. eprint: https://academic.oup.com/bib/article-pdf/22/6/bbab304/41090155/bbab304.pdf. [180] N. M. Kofler, C. J. Shawber, T. Kangsamaksin, H. O. Reed, J. Galatioto, and J. Kitajewski, “Notch signaling in developmental and tumor angiogenesis,” Genes & cancer, vol. 2, no. 12, pp. 1106–1116, 2011. 156 [181] D. Kotliar, A. Veres, M. A. Nagy, S. Tabrizi, E. Hodis, D. A. Melton, and P. C. Sabeti, “Identifying gene expression programs of cell-type identity and cellular activity with single-cell rna-seq,” Elife, vol. 8, 2019. [182] D. Kreiss, “Micro-targeting, the quantified persuasion,” en, Internet Policy Review, vol. 6, no. 4, Dec. 2017, issn: 2197-6775. doi: 10.14763/2017.4.774. (visited on 02/02/2024). [183] D. Krijgsman, N. L. De Vries, M. N. Andersen, A. Skovbo, R. A. Tollenaar, H. J. Møller, M. Hokland, and P. J. Kuppen, “Cd163 as a biomarker in colorectal cancer: The expression on circulating monocytes and tumor-associated macrophages, and the soluble form in the blood,” International journal of molecular sciences, vol. 21, no. 16, p. 5925, 2020. [184] A. Krishnamurthy, J. Langford, A. Slivkins, and C. Zhang, “Contextual bandits with continuous actions: Smoothing, zooming, and adapting,” The Journal of Machine Learning Research, vol. 21, no. 1, pp. 5402–5446, 2020. [185] R. Kuang, A. Jahangiri, S. Mascharak, A. Nguyen, A. Chandra, P. M. Flanigan, G. Yagnik, J. R. Wagner, M. De Lay, D. Carrera, et al., “Glut3 upregulation promotes metabolic reprogramming associated with antiangiogenic therapy resistance,” JCI insight, vol. 2, no. 2, 2017. [186] A. Kulkarni, A. G. Anderson, D. P. Merullo, and G. Konopka, “Beyond bulk: A review of single cell transcriptomics methodologies and applications,” Current Opinion in Biotechnology, vol. 58, pp. 129–136, 2019, Systems Biology • Nanobiotechnology, issn: 0958-1669. doi: https://doi.org/10.1016/j.copbio.2019.03.001. [187] J. Laherrere and D. Sornette, “Stretched exponential distributions in nature and economy:“fat tails” with characteristic scales,” The European Physical Journal B-Condensed Matter and Complex Systems, vol. 2, pp. 525–539, 1998. [188] D. Lähnemann, J. Köster, E. Szczurek, D. J. McCarthy, S. C. Hicks, M. D. Robinson, C. A. Vallejos, K. R. Campbell, N. Beerenwinkel, A. Mahfouz, et al., “Eleven grand challenges in single-cell data science,” Genome biology, vol. 21, no. 1, pp. 1–35, 2020. [189] I. Lakatos, “Falsification and the methodology of scientific research programmes,” in Criticism and the growth of knowledge, I. Lakatos and A. Musgrave, Eds., Cambridge University Press, 1970, pp. 91–196. [190] B. Lakshminarayanan, A. Pritzel, and C. Blundell, “Simple and scalable predictive uncertainty estimation using deep ensembles,” Advances in neural information processing systems, vol. 30, 2017. [191] T. Lattimore and C. Szepesvári, Bandit algorithms. Cambridge University Press, 2020. [192] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521, no. 7553, pp. 436–444, 2015. [193] L. Lei and E. J. Candès, “Conformal inference of counterfactuals and individual treatment effects,” Journal of the Royal Statistical Society Series B: Statistical Methodology, vol. 83, no. 5, pp. 911–938, 2021. 157 [194] J. H. Levine, E. F. Simonds, S. C. Bendall, K. L. Davis, D. A. El-ad, M. D. Tadmor, O. Litvin, H. G. Fienberg, A. Jager, E. R. Zunder, et al., “Data-driven phenotypic dissection of aml reveals progenitor-like cells that correlate with prognosis,” Cell, vol. 162, no. 1, pp. 184–197, 2015. [195] F. Li, P. Ding, and F. Mealli, “Bayesian causal inference: A critical review,” Philosophical Transactions of the Royal Society A, vol. 381, no. 2247, p. 20 220 153, 2023. [196] H. Li, D. J. McCarthy, H. Shim, and S. Wei, “Trade-off between conservation of biological variation and batch effect removal in deep generative modeling for single-cell transcriptomics,” BMC bioinformatics, vol. 23, no. 1, pp. 1–22, 2022. [197] M. Li, F. Keshavarz-Rahaghi, G. Ladua, L. Swanson, C. Speers, D. J. Renouf, H. J. Lim, J. M. Davies, S. Gill, H. C. Stuart, et al., “Characterizing the kras g12c mutation in metastatic colorectal cancer: A population-based cohort and assessment of expression differences in the cancer genome atlas,” Therapeutic Advances in Medical Oncology, vol. 14, p. 17 588 359 221 097 940, 2022. [198] Y. Li, X. Ge, F. Peng, W. Li, and J. J. Li, “Exaggerated false positives by popular differential expression methods when analyzing human population samples,” Genome biology, vol. 23, no. 1, pp. 1–13, 2022. [199] M. Liang, T. Ye, and H. Fu, “Estimating individualized optimal combination therapies through outcome weighted deep learning algorithms,” Statistics in medicine, vol. 37, no. 27, pp. 3869–3886, 2018. [200] J. Lim, C. X. Ji, M. Oberst, S. Blecker, L. Horwitz, and D. Sontag, “Finding regions of heterogeneity in decision-making via expected conditional covariance,” Advances in Neural Information Processing Systems, vol. 34, pp. 15 328–15 343, 2021. [201] E. Limpert, W. A. Stahel, and M. Abbt, “Log-normal distributions across the sciences: Keys and clues: On the charms of statistics, and how mechanical models resembling gambling machines offer a link to a handy way to characterize log-normal distributions, which can provide deeper insight into variability and probability—normal or log-normal: That is the question,” BioScience, vol. 51, no. 5, pp. 341–352, 2001. [202] Z. Liu and Z. Zhang, “Mapping cell types across human tissues,” Science, vol. 376, no. 6594, pp. 695–696, 2022. [203] A. Y. Lo, “A large sample study of the bayesian bootstrap,” The Annals of Statistics, vol. 15, no. 1, pp. 360–375, 1987. [204] R. Lopez, B. Li, H. Keren-Shaul, P. Boyeau, M. Kedmi, D. Pilzer, A. Jelinski, I. Yofe, E. David, A. Wagner, et al., “Destvi identifies continuums of cell types in spatial transcriptomics data,” Nature biotechnology, pp. 1–10, 2022. [205] R. Lopez, J. Regier, M. B. Cole, M. I. Jordan, and N. Yosef, “Deep generative modeling for single-cell transcriptomics,” Nature methods, vol. 15, no. 12, pp. 1053–1058, 2018. [206] H. Lotfimehr, N. Mardi, S. Narimani, H. T. Nasrabadi, M. Karimipour, E. Sokullu, and R. Rahbarghazi, “Mtor signalling pathway in stem cell bioactivities and angiogenesis potential,” Cell Proliferation, e13499, 2023. 158 [207] C. Louizos, U. Shalit, J. M. Mooij, D. Sontag, R. Zemel, and M. Welling, “Causal effect inference with deep latent-variable models,” Advances in neural information processing systems, vol. 30, 2017. [208] B. A. Luca, C. B. Steen, M. Matusiak, A. Azizi, S. Varma, C. Zhu, J. Przybyl, A. Espín-Pérez, M. Diehn, A. A. Alizadeh, et al., “Atlas of clinically distinct cell states and ecosystems across human solid tumors,” Cell, vol. 184, no. 21, pp. 5482–5496, 2021. [209] S. Magliacane, T. Van Ommen, T. Claassen, S. Bongers, P. Versteeg, and J. M. Mooij, “Domain adaptation by using causal inference to predict invariant conditional distributions,” Advances in neural information processing systems, vol. 31, 2018. [210] D. Mahajan, S. Tople, and A. Sharma, “Domain generalization using causal matching,” in International conference on machine learning, PMLR, 2021, pp. 7313–7324. [211] M. Majzoubi, C. Zhang, R. Chari, A. Krishnamurthy, J. Langford, and A. Slivkins, “Efficient contextual bandits with continuous actions,” Advances in Neural Information Processing Systems, vol. 33, pp. 349–360, 2020. [212] S. Maniatis, J. Petrescu, and H. Phatnani, “Spatially resolved transcriptomics and its applications in cancer,” Current Opinion in Genetics & Development, vol. 66, pp. 70–77, 2021. [213] C. F. Manski, Partial identification of probability distributions. Springer, 2003, vol. 5. [214] M. G. Marmarelis, E. Haddad, A. Jesson, N. Jahanshad, A. Galstyan, and G. Ver Steeg, “Partial identification of dose responses with hidden confounders,” in Uncertainty in Artificial Intelligence, PMLR, 2023, pp. 1368–1379. [215] M. G. Marmarelis, R. Littman, F. Battaglin, D. Niedzwiecki, A. Venook, J.-L. Ambite, A. Galstyan, H.-J. Lenz, and G. Ver Steeg, “Q-diffusion leverages the full dimensionality of gene coexpression in single-cell transcriptomics,” Communications Biology, vol. 7, no. 1, p. 400, 2024. [216] M. G. Marmarelis, Z. Marmarelis, A. Galstyan, and G. Ver Steeg, “Causal logit-encouragement effects,” working paper, 2024. [217] M. G. Marmarelis, F. Morstatter, A. Galstyan, and G. Ver Steeg, “Policy learning for localized interventions from observational data,” in International Conference on Artificial Intelligence and Statistics, PMLR, 2024, pp. 4456–4464. [218] M. G. Marmarelis, G. Ver Steeg, and A. Galstyan, “A metric space for point process excitations,” Journal of Artificial Intelligence Research, vol. 73, pp. 1323–1353, 2022. [219] M. G. Marmarelis, G. Ver Steeg, A. Galstyan, and F. Morstatter, “Ensembled prediction intervals for causal outcomes under hidden confounding,” in Causal Learning and Reasoning, PMLR, 2024, pp. 18–40. [220] V. Z. Marmarelis, Nonlinear dynamic modeling of physiological systems. John Wiley & Sons, 2004, vol. 10. [221] A. Marshall, Principles of Economics. Macmillan, 1890. 159 [222] V. Marx, “Method of the year: Spatially resolved transcriptomics,” Nature methods, vol. 18, no. 1, pp. 9–14, 2021. [223] D. Massi, M. Landriscina, A. Piscazzi, E. Cosci, A. Kirov, M. Paglierani, C. Di Serio, V. Mourmouras, S. Fumagalli, M. Biagioli, et al., “S100a13 is a new angiogenic marker in human melanoma,” Modern Pathology, vol. 23, no. 6, pp. 804–813, 2010. [224] A. Mastouri, Y. Zhu, L. Gultchin, A. Korba, R. Silva, M. Kusner, A. Gretton, and K. Muandet, “Proximal causal learning with kernels: Two-stage estimation and moment restriction,” in International Conference on Machine Learning, PMLR, 2021, pp. 7512–7523. [225] L. C. McCandless, P. Gustafson, and A. Levy, “Bayesian sensitivity analysis for unmeasured confounding in observational studies,” Statist Med, vol. 26, pp. 2331–2347, 2007. [226] A. McClean, Z. Branson, and E. H. Kennedy, “Nonparametric estimation of conditional incremental effects,” Journal of Causal Inference, vol. 12, no. 1, p. 20 230 024, 2024. [227] L. McInnes, J. Healy, and J. Melville, “Umap: Uniform manifold approximation and projection for dimension reduction,” arXiv preprint arXiv:1802.03426, 2018. [228] D. Melo Van Lent, H. Gokingco, M. I. Short, C. Yuan, P. F. Jacques, J. R. Romero, C. S. DeCarli, A. S. Beiser, S. Seshadri, J. J. Himali, et al., “Higher dietary inflammatory index scores are associated with brain mri markers of brain aging: Results from the framingham heart study offspring cohort,” Alzheimer’s & Dementia, 2022. [229] V. B. Meresht, V. Syrgkanis, and R. G. Krishnan, “Partial identification of treatment effects with implicit generative models,” in Advances in Neural Information Processing Systems, A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, Eds., 2022. [Online]. Available: https://openreview.net/forum?id=8cUGfg-zUnh. [230] E. M. Meškyte, S. Keskas, and Y. Ciribilli, “Myc as a multifaceted regulator of tumor ˙ microenvironment leading to metastasis,” International journal of molecular sciences, vol. 21, no. 20, p. 7710, 2020. [231] H. Mi, A. Muruganujan, X. Huang, D. Ebert, C. Mills, X. Guo, and P. D. Thomas, “Protocol update for large-scale genome and gene function analysis with the panther classification system (v. 14.0),” Nature protocols, vol. 14, no. 3, pp. 703–721, 2019. [232] N. Mokhberian, A. Abeliuk, P. Cummings, and K. Lerman, “Moral framing and ideological bias of news,” in Social Informatics: 12th International Conference, SocInfo 2020, Pisa, Italy, October 6–9, 2020, Proceedings 12, Springer, 2020, pp. 206–219. [233] Z. Molnár, G. J. Clowry, N. Šestan, A. Alzu’bi, T. Bakken, R. F. Hevner, P. S. Hüppi, I. Kostović, P. Rakic, E. Anton, et al., “New insights into the development of the human cerebral cortex,” Journal of anatomy, vol. 235, no. 3, pp. 432–451, 2019. [234] K. R. Moon, D. van Dijk, Z. Wang, S. Gigante, D. B. Burkhardt, W. S. Chen, K. Yim, A. van den Elzen, M. J. Hirn, R. R. Coifman, N. B. Ivanova, G. Wolf, and S. Krishnaswamy, “Visualizing structure and transitions in high-dimensional biological data,” Nature Biotechnology, vol. 37, pp. 1482–1492, 2019. 160 [235] K. Mukund, N. Syulyukina, S. Ramamoorthy, and S. Subramaniam, “Right and left-sided colon cancers-specificity of molecular mechanisms in tumorigenesis and progression,” BMC cancer, vol. 20, no. 1, pp. 1–15, 2020. [236] G. R. Murray and A. Scime, “Microtargeting and electorate segmentation: Data mining the american national election studies,” Journal of Political Marketing, vol. 9, no. 3, pp. 143–166, 2010. [237] S. Nabavi, D. Schmolze, M. Maitituoheti, S. Malladi, and A. H. Beck, “EMDomics: a robust and powerful method for the identification of genes differentially expressed between heterogeneous classes,” Bioinformatics, vol. 32, no. 4, pp. 533–541, Oct. 2015, issn: 1367-4803. doi: 10.1093/bioinformatics/btv634. [238] A. M. Newman, C. B. Steen, C. L. Liu, A. J. Gentles, A. A. Chaudhuri, F. Scherer, M. S. Khodadoust, M. S. Esfahani, B. A. Luca, D. Steiner, et al., “Determining cell type abundance and expression from bulk tissues with digital cytometry,” Nature biotechnology, vol. 37, no. 7, pp. 773–782, 2019. [239] J. Neyman, “On the application of probability theory to agricultural experiments. essay on principles,” Ann. Agricultural Sciences, pp. 1–51, 1923. [240] D. W. Nickerson and T. Rogers, “Political Campaigns and Big Data,” en, Journal of Economic Perspectives, vol. 28, no. 2, pp. 51–74, May 2014, issn: 0895-3309. doi: 10.1257/jep.28.2.51. (visited on 01/16/2024). [241] L. Nie, M. Ye, qiang liu, and D. Nicolae, “{vcn}et and functional targeted regularization for learning causal effects of continuous treatments,” in International Conference on Learning Representations, 2021. [242] M. C. Nisbet, “Communicating climate change: Why frames matter for public engagement,” Environment: Science and policy for sustainable development, vol. 51, no. 2, pp. 12–23, 2009. [243] J. P. Nolan, “Univariate stable distributions,” Springer Series in Operations Research and Financial Engineering, DOI, vol. 10, pp. 978–3, 2020. [244] E. O’Brien, iterative/aita_dataset: Praw rescrape of entire dataset, version v.20.1.2, Feb. 2020. doi: 10.5281/zenodo.3677563. [245] A. Ohara and T. Wada, “Information geometry of q-gaussian densities and behaviors of solutions to related diffusion equations,” Journal of Physics A, vol. 43, p. 035 002, 2009. [246] M. Oprescu, J. Dorn, M. Ghoummaid, A. Jesson, N. Kallus, and U. Shalit, “B-learner: Quasi-oracle bounds on heterogeneous causal effects under hidden confounding,” in International Conference on Machine Learning, PMLR, 2023, pp. 26 599–26 618. [247] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al., “Training language models to follow instructions with human feedback,” Advances in Neural Information Processing Systems, vol. 35, pp. 27 730–27 744, 2022. 161 [248] K. Padh, J. Zeitler, D. Watson, M. Kusner, R. Silva, and N. Kilbertus, “Stochastic causal programming for bounding treatment effects,” in Conference on Causal Learning and Reasoning, PMLR, 2023, pp. 142–176. [249] H. Papadopoulos, K. Proedrou, V. Vovk, and A. Gammerman, “Inductive confidence machines for regression,” in Machine Learning: ECML 2002: 13th European Conference on Machine Learning Helsinki, Finland, August 19–23, 2002 Proceedings 13, Springer, 2002, pp. 345–356. [250] O. Papakyriakopoulos, S. Hegelich, M. Shahrezaye, and J. C. M. Serrano, “Social media and microtargeting: Political data processing and the consequences for Germany,” en, Big Data & Society, vol. 5, no. 2, p. 205 395 171 881 184, Jul. 2018, issn: 2053-9517, 2053-9517. doi: 10.1177/2053951718811844. (visited on 01/16/2024). [251] B. Parang, A. M. Bradley, M. K. Mittal, S. P. Short, J. J. Thompson, C. W. Barrett, R. D. Naik, A. J. Bilotta, M. K. Washington, F. L. Revetta, et al., “Myeloid translocation genes differentially regulate colorectal cancer programs,” Oncogene, vol. 35, no. 49, pp. 6341–6349, 2016. [252] J. Park, W. Choi, S. Tiesmeyer, B. Long, L. E. Borm, E. Garren, T. N. Nguyen, B. Tasic, S. Codeluppi, T. Graf, et al., “Cell segmentation-free inference of cell types from in situ transcriptomics data,” Nature communications, vol. 12, no. 1, pp. 1–13, 2021. [253] T. Pearce, A. Brintrup, M. Zaki, and A. Neely, “High-quality prediction intervals for deep learning: A distribution-free, ensembled approach,” in International conference on machine learning, PMLR, 2018, pp. 4075–4084. [254] J. Pearl, “Causal diagrams for empirical research,” Biometrika, vol. 82, no. 4, pp. 669–688, 1995. [255] J. Pearl, Causality: Models, Reasoning and Inference, 2nd ed. New York, NY: Cambridge University Press, 2009. [256] J. Pearl and D. Mackenzie, The book of why: the new science of cause and effect. Basic books, 2018. [257] K. Pelka, M. Hofree, J. H. Chen, S. Sarkizova, J. D. Pirl, V. Jorgji, A. Bejnood, D. Dionne, H. G. William, K. H. Xu, et al., “Spatially organized multicellular immune hubs in human colorectal cancer,” Cell, vol. 184, no. 18, pp. 4734–4752, 2021. [258] J. Peters, D. Janzing, and B. Schölkopf, Elements of causal inference: foundations and learning algorithms. The MIT Press, 2017. [259] G. Peyré and M. Cuturi, “Computational optimal transport: With applications to data science,” Foundations and Trends® in Machine Learning, vol. 11, no. 5-6, pp. 355–607, 2019. [260] C. Pham and A. Ertefaie, “Sensitivity analysis for constructing optimal treatment regimes in the presence of non-compliance and two active treatment options,” arXiv preprint arXiv:2207.08964, 2022. [261] K. Pham, K. Le, N. Ho, T. Pham, and H. Bui, “On unbalanced optimal transport: An analysis of sinkhorn algorithm,” in International Conference on Machine Learning, PMLR, 2020, pp. 7673–7682. 162 [262] J. Plepi, B. Neuendorf, L. Flek, and C. Welch, “Unifying data perspectivism and personalization: An application to social norms,” in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates: Association for Computational Linguistics, Dec. 2022, pp. 7391–7402. [Online]. Available: https://aclanthology.org/2022.emnlp-main.500. [263] S. Popat, R. Hubner, and R. Houlston, “Systematic review of microsatellite instability and colorectal cancer prognosis,” Journal of clinical oncology, vol. 23, no. 3, pp. 609–618, 2005. [264] K. R. Popper, The Logic of Scientific Discovery. London, England: Routledge, 1935. [265] K. R. Popper, Conjectures and Refutations: The Growth of Scientific Knowledge. London, England: Routledge, 1962. [266] H. Pu and B. Zhang, “Estimating optimal treatment rules with an instrumental variable: A partial identification learning approach,” Journal of the Royal Statistical Society Series B: Statistical Methodology, vol. 83, no. 2, pp. 318–345, 2021. [267] H. Qiu, M. Carone, E. Sadikova, M. Petukhova, R. C. Kessler, and A. Luedtke, “Optimal individualized decision rules using instrumental variable methods,” Journal of the American Statistical Association, vol. 116, no. 533, pp. 174–191, 2021. [268] X. Qiu, Q. Mao, Y. Tang, L. Wang, R. Chawla, H. A. Pliner, and C. Trapnell, “Reversed graph embedding resolves complex single-cell trajectories,” Nature methods, vol. 14, no. 10, pp. 979–982, 2017. [269] Y. Qu, H. Fu, J. Luo, and S. J. Ruberg, “A general framework for treatment effect estimators considering patient adherence,” Statistics in Biopharmaceutical Research, vol. 12, no. 1, pp. 1–18, 2020. [270] R. Rahaman et al., “Uncertainty quantification and deep ensembles,” Advances in Neural Information Processing Systems, vol. 34, pp. 20 063–20 075, 2021. [271] T. Rahier, A. Héliou, M. Martin, C. Renaudin, and E. Diemert, “Individual treatment prescription effect estimation in a low compliance setting,” in Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, 2021, pp. 1399–1409. [272] A. Rame, M. Kirchmeyer, T. Rahier, A. Rakotomamonjy, patrick gallinari, and M. Cord, “Diverse weight averaging for out-of-distribution generalization,” in Advances in Neural Information Processing Systems, A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, Eds., 2022. [273] A. D. Ramírez-Sánchez, X. Chu, R. Modderman, Y. Kooy-Winkelaar, S. Koletzko, I. R. Korponay-Szabó, R. Troncone, C. Wijmenga, L. Mearin, S. Withoff, I. H. Jonkers, and Y. Li, “Single-cell rna sequencing of peripheral blood mononuclear cells from pediatric coeliac disease patients suggests potential pre-seroconversion markers,” Frontiers in Immunology, vol. 13, 2022, issn: 1664-3224. doi: 10.3389/fimmu.2022.843086. [274] C. Read, Logic: Deductive and Inductive. 1898. 163 [275] A. Reiner, D. Yekutieli, and Y. Benjamini, “Identifying differentially expressed genes using false discovery rate controlling procedures,” Bioinformatics, vol. 19, no. 3, pp. 368–375, 2003. [276] L. Resar, L. Chia, and L. Xian, “Lessons from the crypt: Hmga1—amping up wnt for stem cells and tumor progression,” Cancer research, vol. 78, no. 8, pp. 1890–1897, 2018. [277] S. M. Robinson, “Strongly regular generalized equations,” Mathematics of Operations Research, vol. 5, no. 1, pp. 43–62, 1980. [278] K. B. Rodgers, S.-S. Lee, N. Rosenbloom, A. Timmermann, G. Danabasoglu, C. Deser, J. Edwards, J.-E. Kim, I. R. Simpson, K. Stein, M. F. Stuecker, R. Yamaguchi, T. Bódai, E.-S. Chung, L. Huang, W. M. Kim, J.-F. Lamarque, D. L. Lombardozzi, W. R. Wieder, and S. G. Yeager, “Ubiquity of human-induced changes in climate variability,” Earth System Dynamics, vol. 12, no. 4, pp. 1393–1411, 2021. doi: 10.5194/esd-12-1393-2021. [279] M. Rodriguez, I. Egana, F. Lopitz-Otsoa, F. Aillet, M. Lopez-Mato, A. Dorronsoro, S. Lobato-Gil, J. Sutherland, R. Barrio, C. Trigueros, et al., “The ring ubiquitin e3 rnf114 interacts with a20 and modulates nf-κb activity and t-cell activation,” Cell death & disease, vol. 5, no. 8, e1399–e1399, 2014. [280] M. Rojas-Carulla, B. Schölkopf, R. Turner, and J. Peters, “Invariant models for causal transfer learning,” Journal of Machine Learning Research, vol. 19, no. 36, pp. 1–34, 2018. [281] S. Romano, N. X. Vinh, J. Bailey, and K. Verspoor, “Adjusting for chance clustering comparison measures,” Journal of Machine Learning Research, vol. 17, pp. 1–32, 2016. [282] Y. Romano, E. Patterson, and E. Candes, “Conformalized quantile regression,” Advances in neural information processing systems, vol. 32, 2019. [283] P. R. Rosenbaum, Observational Studies. Springer, 2002. [284] P. R. Rosenbaum and D. B. Rubin, “Assessing sensitivity to an unobserved binary covariate in an observational study with binary outcome,” Journal of the Royal Statistical Society Series B (Methodological), vol. 45, no. 2, pp. 212–218, 1983. [285] P. Rousseeuw, “Silhouettes: A graphical aid to the interpretation and validation of cluster analysis,” J. Comput. Appl. Math., vol. 20, no. 1, pp. 53–65, Nov. 1987, issn: 0377-0427. doi: 10.1016/0377-0427(87)90125-7. [286] D. B. Rubin, “Estimating causal effects of treatments in randomized and nonrandomized studies,” Journal of Educational Psychology, vol. 66, no. 5, p. 688, 1974. [287] D. B. Rubin, “Bayesian inference for causal effects: The role of randomization,” The Annals of statistics, pp. 34–58, 1978. [288] D. B. Rubin, “Randomization analysis of experimental data: The fisher randomization test comment,” Journal of the American statistical association, vol. 75, no. 371, pp. 591–593, 1980. [289] D. B. Rubin, “Comment: Neyman (1923) and causal inference in experiments and observational studies,” Statistical Science, vol. 5, no. 4, pp. 472–480, 1990. 164 [290] T. Rusch, I. Lee, K. Hornik, W. Jank, and A. Zeileis, “Influencing elections with statistics: Targeting voters with logistic regression trees,” en, The Annals of Applied Statistics, vol. 7, no. 3, Sep. 2013, issn: 1932-6157. doi: 10.1214/13-AOAS648. (visited on 01/16/2024). [291] M. A. Said, N. Verweij, and P. van der Harst, “Associations of combined genetic and lifestyle risks with incident cardiovascular disease and diabetes in the uk biobank study,” JAMA cardiology, vol. 3, no. 8, pp. 693–702, 2018. [292] K. Sakamoto, S. Maeda, Y. Hikiba, H. Nakagawa, Y. Hayakawa, W. Shibata, A. Yanai, K. Ogura, and M. Omata, “Constitutive nf-κb activation in colorectal carcinoma plays a key role in angiogenesis, promoting tumor growth,” Clinical Cancer Research, vol. 15, no. 7, pp. 2248–2258, 2009. [293] S. Samothrakis, A. Matran-Fernandez, U. Abdullahi, M. Fairbank, and M. Fasli, “Grokking-like effects in counterfactual inference,” in 2022 International Joint Conference on Neural Networks (IJCNN), IEEE, 2022, pp. 1–8. [294] A. M. Sardoo, S. Zhang, T. N. Ferraro, T. M. Keck, and Y. Chen, “Decoding brain memory formation by single-cell rna sequencing,” Briefings in bioinformatics, vol. 23, no. 6, bbac412, 2022. [295] A. L. Sarvet and M. J. Stensrud, “Without commitment to an ontology, there could be no causal inference,” Epidemiology, vol. 33, no. 3, pp. 372–378, 2022. [296] M. Sauler, J. E. McDonough, T. S. Adams, N. Kothapalli, T. Barnthaler, R. B. Werder, J. C. Schupp, J. Nouws, M. J. Robertson, C. Coarfa, et al., “Characterization of the copd alveolar niche using single-cell rna sequencing,” Nature communications, vol. 13, no. 1, pp. 1–17, 2022. [297] R. Schaeffer, B. Miranda, and S. Koyejo, “Are emergent abilities of large language models a mirage?” In Advances in Neural Information Processing Systems, A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, Eds., vol. 36, Curran Associates, Inc., 2023, pp. 55 565–55 581. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2023/file/adc98a266f45005c403b8311ca7e8bd7- Paper-Conference.pdf. [298] G. Schiebinger, J. Shu, M. Tabaka, B. Cleary, V. Subramanian, A. Solomon, J. Gould, S. Liu, S. Lin, P. Berube, et al., “Optimal-transport analysis of single-cell gene expression identifies developmental trajectories in reprogramming,” Cell, vol. 176, no. 4, pp. 928–943, 2019. [299] B. C. Schipper and H. Y. Yoo, “Political Awareness, Microtargeting of Voters, and Negative Electoral Campaigning,” en, Quarterly Journal of Political Science, vol. 14, no. 1, pp. 41–88, Jan. 2019, issn: 15540634. doi: 10.1561/100.00016066. (visited on 01/16/2024). [300] B. Schölkopf, F. Locatello, S. Bauer, N. R. Ke, N. Kalchbrenner, A. Goyal, and Y. Bengio, “Toward causal representation learning,” Proceedings of the IEEE, vol. 109, no. 5, pp. 612–634, 2021. [301] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization,” in International conference on machine learning, PMLR, 2015, pp. 1889–1897. [302] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017. 165 [303] A. Semaan, A. R. Munkarah, H. Arabi, S. Bandyopadhyay, S. Seward, S. Kumar, A. Qazi, Y. Hussein, R. T. Morris, and R. Ali-Fehmi, “Expression of glut-1 in epithelial ovarian carcinoma: Correlation with tumor cell proliferation, angiogenesis, survival and ability to predict optimal cytoreduction,” Gynecologic oncology, vol. 121, no. 1, pp. 181–186, 2011. [304] Y. Shen and D. Cremers, “Deep combinatorial aggregation,” Advances in Neural Information Processing Systems, vol. 35, pp. 32 299–32 310, 2022. [305] M. K. Skytthe, J. H. Graversen, and S. K. Moestrup, “Targeting of cd163+ macrophages in inflammatory and malignant diseases,” International journal of molecular sciences, vol. 21, no. 15, p. 5497, 2020. [306] D. Song, K. Li, Z. Hemminger, R. Wollman, and J. J. Li, “Scpnmf: Sparse gene encoding of single cells to facilitate gene selection for targeted gene profiling,” Bioinformatics, vol. 37, pp. i358–i366, 2021. [307] P. Spirtes, C. Glymour, and R. Scheines, Causation, Prediction, and Search, 2nd ed. Cambridge, MA: MIT Press, 2000. [308] S. Stintzing, B. Ivanova, I. Ricard, A. Jung, T. Kirchner, A. Tannapfel, H. Juette, S. Hegewisch-Becker, D. Arnold, and A. Reinacher-Schick, “Amphiregulin (areg) and epiregulin (ereg) gene expression as predictor for overall survival (os) in oxaliplatin/fluoropyrimidine plus bevacizumab treated mcrc patients—analysis of the phase iii aio krk-0207 trial,” Frontiers in Oncology, vol. 8, p. 474, 2018. [309] T. Stuart, A. Butler, P. Hoffman, C. Hafemeister, E. Papalexi, W. M. Mauck, Y. Hao, M. Stoeckius, P. Smibert, and R. Satija, “Comprehensive integration of single-cell data,” Cell, vol. 177, no. 7, 1888–1902.e21, 2019, issn: 0092-8674. doi: https://doi.org/10.1016/j.cell.2019.05.031. [310] K. Su, T. Yu, and H. Wu, “Accurate feature selection improves single-cell rna-seq cell clustering,” Briefings in bioinformatics, vol. 22, no. 5, bbab034, 2021. [311] H. Sun, E. Munro, G. Kalashnov, S. Du, and S. Wager, “Treatment allocation under uncertain costs,” arXiv preprint arXiv:2103.11066, 2021. [312] Y. Sun, B. Liu, Y. Chen, Y. Xing, and Y. Zhang, “Multi-omics prognostic signatures based on lipid metabolism for colorectal cancer,” Frontiers in Cell and Developmental Biology, vol. 9, p. 3705, 2022. [313] K. Swersky, M. Ranzato, D. Buchman, N. D. Freitas, and B. M. Marlin, “On autoencoders and score matching for energy based models,” in Proceedings of the 28th international conference on machine learning (ICML-11), 2011, pp. 1201–1208. [314] Z. Szabó and B. K. Sriperumbudur, “Characteristic and universal tensor product kernels.,” J. Mach. Learn. Res., vol. 18, pp. 233–1, 2017. [315] N. N. Taleb, “(anti) fragility and convex responses in medicine,” in Unifying Themes in Complex Systems IX: Proceedings of the Ninth International Conference on Complex Systems 9, Springer, 2018, pp. 299–325. [316] N. N. Taleb, “Statistical consequences of fat tails: Real world preasymptotics, epistemology, and applications,” arXiv preprint arXiv:2001.10488, 2020. 166 [317] Z. Tan, “A distributional approach for causal inference using propensity scores,” Journal of the American Statistical Association, vol. 101, no. 476, pp. 1619–1637, 2006. [318] B. M. Tappin, C. Wittenberg, L. B. Hewitt, A. J. Berinsky, and D. G. Rand, “Quantifying the potential persuasive returns to political microtargeting,” en, Proceedings of the National Academy of Sciences, vol. 120, no. 25, e2216261120, Jun. 2023, issn: 0027-8424, 1091-6490. doi: 10.1073/pnas.2216261120. (visited on 01/16/2024). [319] E. J. T. Tchetgen, A. Ying, Y. Cui, X. Shi, and W. Miao, “An introduction to proximal causal learning,” arXiv preprint arXiv:2009.10982, 2020. [320] M. Teeuwssen and R. Fodde, “Cell heterogeneity and phenotypic plasticity in metastasis formation: The case of colon cancer,” Cancers, vol. 11, no. 9, p. 1368, 2019. [321] R. Theisen, H. Kim, Y. Yang, L. Hodgkinson, and M. W. Mahoney, “When are ensembles really effective?” Advances in Neural Information Processing Systems, vol. 36, 2024. [322] P. D. Thomas, D. Ebert, A. Muruganujan, T. Mushayahama, L.-P. Albou, and H. Mi, “Panther: Making genome-scale phylogenetics accessible to all,” Protein Science, vol. 31, no. 1, pp. 8–22, 2022. doi: https://doi.org/10.1002/pro.4218. eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/pro.4218. [323] W. R. Thompson, “On the likelihood that one unknown probability exceeds another in view of the evidence of two samples,” Biometrika, vol. 25, no. 3-4, pp. 285–294, 1933. [324] R. J. Tibshirani, R. Foygel Barber, E. Candes, and A. Ramdas, “Conformal prediction under covariate shift,” Advances in neural information processing systems, vol. 32, 2019. [325] N. Tishby, F. C. Pereira, and W. Bialek, “The information bottleneck method,” arXiv preprint physics/0004057, 2000. [326] S. T. Tokdar and R. E. Kass, “Importance sampling: A review,” WIREs Computational Statistics, vol. 2, no. 1, pp. 54–60, 2010. [327] A. Tong, J. Huang, G. Wolf, D. van Dijk, and S. Krishnaswamy, “Trajectorynet: A dynamic optimal transport network for modeling cellular dynamics,” in Proceedings of the 37th International Conference on Machine Learning, 2020. [328] V. A. Traag, L. Waltman, and N. J. van Eck, “From louvain to leiden: Guaranteeing well-connected communities,” Nature Scientific Reports, vol. 9, p. 5233, 2019. doi: 10.1038/s41598-019-41695-z. [329] N. Traverso, R. Ricciarelli, M. Nitti, B. Marengo, A. L. Furfaro, M. A. Pronzato, U. M. Marinari, and C. Domenicotti, “Role of glutathione in cancer progression and chemoresistance,” Oxidative medicine and cellular longevity, vol. 2013, 2013. [330] C. Tsallis, Introduction to Nonextensive Statistical Mechanics. Springer, 2009. 167 [331] M. Tsukioka, Y. Matsumoto, M. Noriyuki, C. Yoshida, H. Nobeyama, H. Yoshida, T. Yasui, T. Sumi, K.-i. Honda, and O. Ishiko, “Expression of glucose transporters in epithelial ovarian carcinoma: Correlation with clinical characteristics and tumor angiogenesis,” Oncology reports, vol. 18, no. 2, pp. 361–367, 2007. [332] S. Tübbicke, “Entropy balancing for continuous treatments,” J Econ Methods, vol. 11, no. 1, pp. 71–89, 2022. [333] M. Uehara, M. Kato, and S. Yasui, “Off-policy evaluation and learning for external validity under a covariate shift,” Advances in Neural Information Processing Systems, vol. 33, pp. 49–61, 2020. [334] Y.-H. Uen, C.-L. Fang, Y.-C. Hseu, P.-C. Shen, H.-L. Yang, K.-S. Wen, S.-T. Hung, L.-H. Wang, and K.-Y. Lin, “Vav3 oncogene expression in colorectal cancer: Clinical aspects and functional characterization,” Scientific reports, vol. 5, no. 1, pp. 1–8, 2015. [335] B. G. Vegetabile, B. A. Griffin, D. L. Coffman, M. Cefalu, M. W. Robbins, and D. F. McCaffrey, “Nonparametric estimation of population average dose-response curves using entropy balancing weights for continuous exposures,” Health Services and Outcomes Research Methodology, vol. 21, no. 1, pp. 69–110, 2021. [336] V. Veitch and A. Zaveri, “Sense and sensitivity analysis: Simple post-hoc analysis of bias due to unobserved confounding,” in Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, Eds., vol. 33, Curran Associates, Inc., 2020, pp. 10 999–11 009. [337] A. P. Venook, D. Niedzwiecki, H.-J. Lenz, F. Innocenti, B. Fruth, J. A. Meyerhardt, D. Schrag, C. Greene, B. H. O’Neil, J. N. Atkins, et al., “Effect of first-line chemotherapy combined with cetuximab or bevacizumab on overall survival in patients with kras wild-type advanced or metastatic colorectal cancer: A randomized clinical trial,” Jama, vol. 317, no. 23, pp. 2392–2401, 2017. [338] C. Villani et al., Optimal transport: old and new. Springer, 2009, vol. 338. [339] P. Vincent, “A connection between score matching and denoising autoencoders,” Neural computation, vol. 23, no. 7, pp. 1661–1674, 2011. [340] M. J. Wainwright, High-dimensional statistics: A non-asymptotic viewpoint. Cambridge university press, 2019, vol. 48. [341] T. Wang, B. Li, C. E. Nelson, and S. Nabavi, “Comparative analysis of differential gene expression analysis tools for single-cell rna sequencing data,” BMC bioinformatics, vol. 20, no. 1, pp. 1–16, 2019. [342] X. Wang, S. Lyu, X. Wu, T. Wu, and H. Chen, “Generalization bounds for estimating causal effects of continuous treatments,” in Advances in Neural Information Processing Systems, A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, Eds., 2022. [343] Y. Wang, L. Hu, Y. Zheng, and L. Guo, “Hmga1 in cancer: Cancer classification by location,” Journal of Cellular and Molecular Medicine, vol. 23, no. 4, pp. 2293–2302, 2019. 168 [344] R. S. Wedemann, A. R. Plastino, and C. Tsallis, “Curl forces and the nonlinear fokker-planck equation,” Phys. Rev. E, vol. 94, p. 062 105, 6 Dec. 2016. doi: 10.1103/PhysRevE.94.062105. [345] F. Wenzel, K. Roth, B. Veeling, J. Swiatkowski, L. Tran, S. Mandt, J. Snoek, T. Salimans, R. Jenatton, and S. Nowozin, “How good is the bayes posterior in deep neural networks really?” In International Conference on Machine Learning, PMLR, 2020, pp. 10 248–10 259. [346] N. Wiener, “Cybernetics or control and communication in the animal and the machine,” Revue Philosophique de la France Et de l’Etranger, vol. 141, pp. 578–580, 1948. [347] V. D. Wild, S. Ghalebikesabi, D. Sejdinovic, and J. Knoblauch, “A rigorous link between deep ensembles and (variational) bayesian methods,” in Advances in Neural Information Processing Systems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, Eds., vol. 36, Curran Associates, Inc., 2023, pp. 39 782–39 811. [348] M. D. Williams, X. Zhang, A. S. Belton, L. Xian, T. Huso, J.-J. Park, W. F. Siems, D. R. Gang, L. M. Resar, R. Reeves, et al., “Hmga1 drives metabolic reprogramming of intestinal epithelium during hyperproliferation, polyposis, and colorectal carcinogenesis,” Journal of proteome research, vol. 14, no. 3, pp. 1420–1431, 2015. [349] L. Wimmer, Y. Sale, P. Hofman, B. Bischl, and E. Hüllermeier, “Quantifying aleatoric and epistemic uncertainty in machine learning: Are conditional entropy and mutual information appropriate measures?” In Uncertainty in Artificial Intelligence, PMLR, 2023, pp. 2282–2292. [350] F. A. Wolf, P. Angerer, and F. J. Theis, “Scanpy: Large-scale single-cell gene expression data analysis,” Genome biology, vol. 19, no. 1, pp. 1–5, 2018. [351] Y. Xin, J. Kim, H. Okamoto, M. Ni, Y. Wei, C. Adler, A. J. Murphy, G. D. Yancopoulos, C. Lin, and J. Gromada, “Rna sequencing of single human islet cells reveals type 2 diabetes genes,” Cell metabolism, vol. 24, no. 4, pp. 608–615, 2016. [352] L.-L. Xiong, L.-L. Xue, R.-L. Du, R.-Z. Niu, L. Chen, J. Chen, Q. Hu, Y.-X. Tan, H.-F. Shang, J. Liu, et al., “Single-cell rna sequencing reveals b cell–related molecular biomarkers for alzheimer’s disease,” Experimental & Molecular Medicine, vol. 53, no. 12, pp. 1888–1901, 2021. [353] C. Xu and Z. Su, “Identification of cell types from single-cell transcriptomes using a novel clustering method,” Bioinformatics, vol. 31, no. 12, pp. 1974–1980, Feb. 2015, issn: 1367-4803. doi: 10.1093/bioinformatics/btv088. eprint: https://academic.oup.com/bioinformatics/article-pdf/31/12/1974/17100675/btv088.pdf. [354] P. Xu, G. Du, H. Guan, W. Xiao, L. Sun, and H. Yang, “A role of tti1 in the colorectal cancer by promoting proliferation,” Translational Cancer Research, vol. 10, no. 3, p. 1378, 2021. [355] Q. Xu, X. Cao, G. Chen, H. Zeng, H. Fu, and A. Qu, “Multi-label residual weighted learning for individualized combination treatment rule,” Electronic Journal of Statistics, vol. 18, no. 1, pp. 1517–1548, 2024. 169 [356] Q. Xu, H. Fu, and A. Qu, “Optimal individualized treatment rule for combination treatments under budget constraints,” Journal of the Royal Statistical Society Series B: Statistical Methodology, qkad141, 2024. [357] T. Xue, K. Yan, Y. Cai, J. Sun, Z. Chen, X. Chen, and W. Wu, “Prognostic significance of cd163+ tumor-associated macrophages in colorectal cancer,” World journal of surgical oncology, vol. 19, no. 1, pp. 1–7, 2021. [358] S. Yadlowsky, S. Fleming, N. Shah, E. Brunskill, and S. Wager, “Evaluating treatment prioritization rules via rank-weighted average treatment effects,” arXiv preprint arXiv:2111.07966, 2021. [359] S. Yadlowsky, H. Namkoong, S. Basu, J. Duchi, and L. Tian, “Bounds on the conditional and average treatment effect with unobserved confounding factors,” Annals of statistics, vol. 50, no. 5, p. 2587, 2022. [360] J. Yang, D. Eckles, P. Dhillon, and S. Aral, “Targeting for long-term outcomes,” Management Science, 2023. [361] X. Yang, J. Zhong, Q. Zhang, L. Feng, Z. Zheng, J. Zhang, and S. Lu, “Advances and insights of apc-asef inhibitors for metastatic colorectal cancer therapy,” Frontiers in Molecular Biosciences, vol. 8, p. 662 579, 2021. [362] Z. Yang, Y. Guo, P. Xu, A. Liu, and A. Anandkumar, “Distributionally robust policy gradient for offline contextual bandits,” in International Conference on Artificial Intelligence and Statistics, PMLR, 2023, pp. 6443–6462. [363] L. Yao, Z. Chu, S. Li, Y. Li, J. Gao, and A. Zhang, “A survey on causal inference,” ACM Transactions on Knowledge Discovery from Data (TKDD), vol. 15, no. 5, pp. 1–46, 2021. [364] Yelp, Yelp open dataset, 2023. [Online]. Available: https://www.yelp.com/dataset. [365] M. Yin, C. Shi, Y. Wang, and D. M. Blei, “Conformal sensitivity analysis for individual treatment effects,” Journal of the American Statistical Association, pp. 1–14, 2022. [366] Q. Yin, Y. Wang, J. Guan, and G. Ji, “Sciae: An integrative autoencoder-based ensemble classification framework for single-cell rna-seq data,” Briefings in Bioinformatics, vol. 23, no. 1, bbab508, 2022. [367] E. Ystrom, E. Degerud, M. Tesli, A. Høye, T. Reichborn-Kjennerud, and Ø. Næss, “Alcohol consumption and lower risk of cardiovascular and all-cause mortality: The impact of accounting for familial factors in twins,” Psychological Medicine, pp. 1–9, 2022. [368] H.-F. Yu, F.-L. Huang, and C.-J. Lin, “Dual coordinate descent methods for logistic regression and maximum entropy models,” Machine Learning, vol. 85, pp. 41–75, 2011. [369] L. Yu, Y. Cao, J. Y. Yang, and P. Yang, “Benchmarking clustering algorithms on estimating the number of cell types from single-cell rna-sequencing data,” Genome biology, vol. 23, no. 1, pp. 1–21, 2022. 170 [370] Y. Yu, Y. Zhuang, J. Zhang, Y. Meng, A. J. Ratner, R. Krishna, J. Shen, and C. Zhang, “Large language model as attributed training data generator: A tale of diversity and bias,” Advances in Neural Information Processing Systems, vol. 36, 2024. [371] R. Zanin, S. Pegoraro, G. Ros, Y. Ciani, S. Piazza, F. Bossi, R. Bulla, C. Zennaro, F. Tonon, D. Lazarevic, et al., “Hmga1 promotes breast cancer angiogenesis supporting the stability, nuclear localization and transcriptional activity of foxm1,” Journal of Experimental & Clinical Cancer Research, vol. 38, pp. 1–23, 2019. [372] M. Zečević, M. Willig, D. S. Dhami, and K. Kersting, “Causal parrots: Large language models may talk causality but are not causal,” Transactions on Machine Learning Research, 2023. [373] Z. Zeng, D. Arbour, A. Feller, R. Addanki, R. Rossi, R. Sinha, and E. H. Kennedy, “Continuous treatment effects with surrogate outcomes,” arXiv preprint arXiv:2402.00168, 2024. [374] M. Zhang, S. W. Eichhorn, B. Zingg, Z. Yao, K. Cotter, H. Zeng, H. Dong, and X. Zhuang, “Spatially resolved cell atlas of the mouse primary motor cortex by merfish,” Nature, vol. 598, no. 7879, pp. 137–143, 2021. [375] M. Zhang, H.-z. Wang, H.-o. Li, Y.-j. Zhou, R.-y. Peng, J. Liu, and Q. Zhao, “Identification of pigu as the hub gene associated with kras mutation in colorectal cancer by coexpression analysis,” DNA and Cell Biology, vol. 39, no. 9, pp. 1639–1648, 2020. [376] Y. Zhang, S. Xu, Z. Wen, J. Gao, S. Li, S. M. Weissman, and X. Pan, “Sample-multiplexing approaches for single-cell sequencing,” Cellular and Molecular Life Sciences, vol. 79, no. 8, pp. 1–23, 2022. [377] Z. Zhang, A. Banerjee, K. Baksi, and D. K. Banerjee, “Mannosylphosphodolichol synthase overexpression supports angiogenesis,” Biocatalysis and biotransformation, vol. 28, no. 1, pp. 90–98, 2010. [378] Q. Zhao, D. S. Small, and B. B. Bhattacharya, “Sensitivity analysis for inverse probability weighting estimators via the percentile bootstrap,” Journal of the Royal Statistical Society (Series B), vol. 81, no. 4, pp. 735–761, 2019. [379] L. Zheng, N. You, X. Huang, H. Gu, K. Wu, N. Mi, and J. Li, “Commd7 regulates nf-κb signaling pathway in hepatocellular carcinoma stem-like cells,” Molecular Therapy-Oncolytics, vol. 12, pp. 112–123, 2019. [380] A. Zhou, “Optimal and fair encouragement policy evaluation and learning,” in Thirty-seventh Conference on Neural Information Processing Systems, 2023. [381] Z. Zhou, S. Athey, and S. Wager, “Offline multi-action policy learning: Generalization and optimization,” Operations Research, vol. 71, no. 1, pp. 148–183, 2023. [382] Q. Zhu, S. Shah, R. Dries, L. Cai, and G.-C. Yuan, “Identification of spatially associated subpopulations by combining scrnaseq and sequential fluorescence in situ hybridization data,” Nature biotechnology, vol. 36, no. 12, pp. 1183–1190, 2018. 171 [383] P. Zhuang, X. Liu, Y. Li, X. Wan, Y. Wu, F. Wu, Y. Zhang, and J. Jiao, “Effect of diet quality and genetic predisposition on hemoglobin a1c and type 2 diabetes risk: Gene-diet interaction analysis of 357,419 individuals,” Diabetes Care, vol. 44, no. 11, pp. 2470–2479, 2021. [384] F. Zuiderveen Borgesius, J. Möller, S. Kruikemeier, R. Ó Fathaigh, K. Irion, T. Dobber, B. Bodo, and C. H. de Vreese, “Online political microtargeting: Promises and threats for democracy,” Utrecht Law Review, vol. 14, no. 1, pp. 82–96, 2018. 172
Abstract (if available)
Abstract
The rise of artificial intelligence and deep learning has led to unprecedented capabilities in prediction. As these black-box algorithms are deployed in different parts of society, it is becoming increasingly clear that predictions alone do not always translate to enabling effective decisions, policies, or reliable forecasts in a changing world. What is often needed is a stronger understanding of a system than a predictive model of observations can offer. This deficit arises when attempting to predict the system’s behavior in novel situations. Causal inference refers to the field of identifying cause-and-effect structures from data. Much work in causal inference is concerned with finding the minimal assumptions required to answer specific causal questions, like estimating the effect of a certain treatment. The more reasonable and relaxed the assumptions of a causal-inference method, the more applicable it is to diverse datasets and machine learning. Of fundamental significance is having workable representations that can be learned from data. Closely related to the quality of the representations is the ability to make downstream causal estimates robust to confounding. In this dissertation, I first introduce an improved representation of single-cell RNA sequencing data for inference tasks in medicine and biology. More broadly, I make numerous contributions towards increased robustness of machine learning to hidden or observed confounding. I address sensitivity of dose-response curves to hidden confounding, prediction of interventional outcomes under hidden confounding; robust effect estimation for continuous-valued and multivariate interventions, and estimation for interventions that might only encourage treatment uptake depending on susceptibility.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Mutual information estimation and its applications to machine learning
PDF
Learning fair models with biased heterogeneous data
PDF
Fast and label-efficient graph representation learning
PDF
Information geometry of annealing paths for inference and estimation
PDF
Responsible artificial intelligence for a complex world
PDF
Leveraging sparsity in theoretical and applied machine learning and causal inference
PDF
Representation problems in brain imaging
PDF
Learning controllable data generation for scalable model training
PDF
Formal analysis of data poisoning robustness of K-nearest neighbors
PDF
Statistical insights into deep learning and flexible causal inference
PDF
Controlling information in neural networks for fairness and privacy
PDF
On information captured by neural networks: connections with memorization and generalization
PDF
Scalable optimization for trustworthy AI: robust and fair machine learning
PDF
Modeling and predicting with spatial‐temporal social networks
PDF
Learning to diagnose from electronic health records data
PDF
Data-driven learning for dynamical systems in biology
PDF
Heterogeneous federated learning
PDF
Statistical methods for causal inference and densely dependent random sums
PDF
Essays on causal inference
PDF
Diffusion network inference and analysis for disinformation mitigation
Asset Metadata
Creator
Marmarelis, Myrl Galinos
(author)
Core Title
Robust causal inference with machine learning on observational data
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2024-08
Publication Date
06/13/2024
Defense Date
05/28/2024
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
artificial intelligence,bioinformatics,causal inference,machine learning,OAI-PMH Harvest,robustness,statistics
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Galstyan, Aram (
committee chair
), Ghanem, Roger (
committee member
), Morstatter, Fred (
committee member
), Teng, Shang-Hua (
committee member
), Ver Steeg, Greg (
committee member
)
Creator Email
mmarmare@usc.edu,myrlgmar@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC1139963AR
Unique identifier
UC1139963AR
Identifier
etd-Marmarelis-13104.pdf (filename)
Legacy Identifier
etd-Marmarelis-13104
Document Type
Dissertation
Format
theses (aat)
Rights
Marmarelis, Myrl Galinos
Internet Media Type
application/pdf
Type
texts
Source
20240614-usctheses-batch-1169
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
artificial intelligence
bioinformatics
causal inference
machine learning
robustness
statistics