Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Statistical methods for causal inference and densely dependent random sums
(USC Thesis Other)
Statistical methods for causal inference and densely dependent random sums
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Statistical Methods for Causal Inference and Densely Dependent Random Sums by Shane Sparkes A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (BIOSTATISTICS) May 2024 Copyright 2024 Shane Sparkes Dedication ”Behold, everything have We created in due measure and proportion” (Ayah al-Qamar 54:49) The worlds of experience are indefatigable. Yet, their number and dimension are eclipsed by the even larger infinities of all possible worlds, which lurk beyond the veil of human understanding. Only God knows. What is wrong in this manuscript is mine. What is great is His. When I started this journey, I had my grandmother and my father-in-law. Now, I do not. This work is thus dedicated to Jean Williams and Akram Ali. Love is—at least in part—identified by the grace and safety bestowed upon those who do you insufficient justice. Like many young people, I was myopic and selfish. Nevertheless, my grandmother loved me. What is wrong in me is mine. What goodness I achieve is hers. Akram Ali lived for his wife and children. He privileged the mere possibility of a better life for them over the image of his personal aspirations. He was warm, and funny, and kind, and quick-witted, and brilliant. Where I fail as a father; that is mine. Where I succeed, I remember him. I also dedicate this work to my mother and father, and to my wife, Adiba. They are quite tired of my poetry, however, so I say nothing more. One cannot thank the earth anyway. ii To my sons, Amir and Saeed: I love you more than the sun, and the moon, and the stars. I will always love you. And then love you some more. And then even some more. I started this journey for me. However, I completed it for you. iii Acknowledgements I am ineffably grateful for my co-advisors: Dr. Lu Zhang and Dr. Thomas Valente. A graduate program is a crucible and my experience was no different. I came into the program with strong convictions about model building—convictions that I depart with as well. However, like all students, the strength of my convictions was perhaps matched only by the paucity of my experience and the loudness of my academic naivete. I thank Dr. Valente for his patient guidance, for keeping his foot in the door for me, and for his editorial acumen, which helped me chip away at some of my worst continental habits. I also thank him for imparting his wealth of social network knowledge. My gratitude for Dr. Zhang is immeasurable. Her dedication to helping me strengthen the arguments of this dissertation was incomparable. Although I arrived to the program with some basic mathematical training, my experience rested mostly in social work and philosophy. Evidence of this surfaced in my use of notation at times and, ironically, in my under-appreciation for the role that narrative plays in the presentation of results. She was able to see past these defects, thankfully, and, under her tutelage, my arguments became markedly more rigorous. Without the copious hours of her meticulous proof-checking and diligent advisement, the ideas that constitute these pages would have taken the form of a much weaker science. She was truly indispensable. I am also greatly indebted to Dr. Erika Garcia and Dr. Paul Marjoram. Dr. Garcia is a gifted professor. Her causal inference class, which I took quite late in the game, provided me with the iv concepts and tools that I needed to unify my projects, frame them, and corroborate my convictions. I also had the privilege of working with her on one of my dissertation projects, which partially grew from her course and instruction. This dissertation would not exist if not for her involvement. To Dr. Marjoram, I say this: thank you for your time and shepherding. The first years of my program were characterized by a maelstrom of impoverished ideas and half-baked proofs. I think for every sensible thing that I have since written, I budgeted Dr. Marjoram ten, or perhaps even twenty that were closer in disposition to a statistical penny dreadful than to a sound argument. Nevertheless, he always remained patient and encouraging, like any great teacher should. Thankfully, upon finishing a proof, I have since learned to wait at least a week before sharing. The imagination and force of will are often stronger than the properties of the real numbers, at least for me. Excitement is indeed a necessary, but perilous friend to the budding statistician. Frankly, I am surprised that Dr. Marjoram did not seek legal recompense on behalf of the injured party, Reason. Importantly, I also owe much to Dr. Sandrah Eckel, Dr. Kim Siegmund, and Dr. Richard Watanabe. At some difficult moments, they offered essential support and counsel. Like Dr. Valente, they chose to open a door when they could have just as easily closed it. To me, this connotes integrity. They are valuable resources for graduate students. I am also grateful to Dr. Nicholas Mancuso in this regard. Finally, I am gracious for the time and effort that my remaining committee members, Dr. Trevor Pickering and Dr. Abigail Horn, dedicated to this process. Their insights and suggestions also encouraged my growth and only improved the contents of this document. Dr. Kiros Berhane also deserves a special mention. Although we did not work together, his influence on my research direction is worth noting. We held a meeting when I was searching for a v mentor early in the program. I presented on some initial findings of mine that were related to empirical likelihood theory. He was affable and academically assiduous. ”This is all for independent observations?” he said. I paused, having noted his warm, but unimpressed affect. ”Yes,” I replied. ”Ah. Okay,” he said. I started to think about this exchange a lot. Not many phenomena in life are truly independent, or even only locally dependent. The fact that a non-trivial proportion of modern statistical theory is predicated upon these ideas, then, is quite the farce. I cannot forget my friends and family. This quest of mine started long ago, when I could hear well and the hours of my days were invested more in social activism than in the dry study of number and cacophony. Although non-exhaustive as a list, the following people have impacted my life and development in ways that cannot be captured with any net of mere mortal words: Efren Lopez, Kristhy Morales, Anthony Fallon, and Justin Loeser. I am also thankful for Sara J. Earley and Jose and Mary Esqueda. To my mother, I could not have done this without you. You have always been my paragon. To my father, thank you for your love and companionship. To Atia Ali, who is also my mother, thank you also for your love, and for the endless depth of your support. And to my wife Adiba, who held up the sky as I locked myself in a sepulcher: you are beautiful, you are courageous, you are strong, you are a Tyrian purple amidst the grey, you are a leader of both large and pint-sized men, you are my heart. vi Contents Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Chapter 2: Properties and Deviations of Random Sums of Densely Dependent Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 A Novel Variance-Covariance Identity for Additive Statistics . . . . . . . . . . . . 10 2.3 Asymptotic Properties of Additive Statistics . . . . . . . . . . . . . . . . . . . . . 16 2.3.1 Some Consideration for Clustered Statistics . . . . . . . . . . . . . . . . . 20 2.3.1.1 A Problem of Choice . . . . . . . . . . . . . . . . . . . . . . . . 25 2.4 Finite Sample Inference with Dense Dependence . . . . . . . . . . . . . . . . . . 27 2.4.1 U Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.4.2 Finite Sample Inference for U Random Variables . . . . . . . . . . . . . . 30 vii 2.4.3 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.5 A Quick Extension to Estimating Equations . . . . . . . . . . . . . . . . . . . . . 44 2.6 Simulations and a Data Application . . . . . . . . . . . . . . . . . . . . . . . . . 53 2.6.1 Carbon Dioxide and Global Warming . . . . . . . . . . . . . . . . . . . . 58 2.7 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 2.A Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 2.B U Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 2.B.1 Decomposition of Continuous Uniform Random Variables . . . . . . . . . 80 2.B.2 U Variables and Regression . . . . . . . . . . . . . . . . . . . . . . . . . 82 2.B.3 Plug-ins and Model Diagnostics . . . . . . . . . . . . . . . . . . . . . . . 85 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 Chapter 3: The Functional Average Treatment Effect . . . . . . . . . . . . . . . . . . . 89 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 3.2 Functional Average Treatment Effect . . . . . . . . . . . . . . . . . . . . . . . . . 93 3.2.1 Examples of Applicability . . . . . . . . . . . . . . . . . . . . . . . . . . 94 3.2.1.1 Example 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 3.2.1.2 Example 3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 3.2.2 Identifying and Estimating Counterfactual Functional Averages . . . . . . . 96 3.2.3 The Problem of Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 101 viii 3.2.3.1 Discrete Estimators of Av˜ (Yδ ) . . . . . . . . . . . . . . . . . . . 101 3.2.3.2 Continuous Outcomes . . . . . . . . . . . . . . . . . . . . . . . 104 3.2.3.3 The Hoeffding Bootstrap . . . . . . . . . . . . . . . . . . . . . . 109 3.3 U Random Variables and Counterfactual Linear Regression . . . . . . . . . . . . 112 3.3.1 Counterfactual Linear Regression . . . . . . . . . . . . . . . . . . . . . . 122 3.4 Monte Carlo Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 3.4.1 Univariate Functional Average Estimation . . . . . . . . . . . . . . . . . . 129 3.4.2 Causal Inference with Functional Averages . . . . . . . . . . . . . . . . . 132 3.5 A Data Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 3.A Convergence of Random Sums . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 3.B Hoeffding’s Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 3.B.1 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 3.B.1.1 Uniform Experiment . . . . . . . . . . . . . . . . . . . . . . . . 160 3.B.1.2 Non-smooth Experiment . . . . . . . . . . . . . . . . . . . . . . 162 3.B.1.3 Arithmetic Mean Experiment . . . . . . . . . . . . . . . . . . . 162 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 Chapter 4: Extending Generalized Linear Models to Social Network Analysis . . . . . 165 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 ix 4.2 GLMs for Relational Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 4.2.1 The Problem of Dependence . . . . . . . . . . . . . . . . . . . . . . . . . 172 4.2.1.1 A Word on Limitations . . . . . . . . . . . . . . . . . . . . . . . 177 4.3 Main Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 4.3.1 Applying the Variance Identity to GLMs and GEEs . . . . . . . . . . . . . 181 4.3.2 A Measure of Test Robustness . . . . . . . . . . . . . . . . . . . . . . . . 182 4.3.3 A Bootstrapping Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 4.4 An Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 4.4.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 4.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 Chapter 5: Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 x List of Tables 2.1 Beta(10,10) Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 2.2 Beta(α,α) Simulations: n = 500,φn = .1 . . . . . . . . . . . . . . . . . . . . . . . 56 2.3 Regression Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.1 Basic Properties of U Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 3.2 Functional Average Estimation, Continuous . . . . . . . . . . . . . . . . . . . . . 130 3.3 Functional Average Estimation, Discrete . . . . . . . . . . . . . . . . . . . . . . . 131 3.4 Continuous Effect Estimators: ∆ = 10 . . . . . . . . . . . . . . . . . . . . . . . . 133 3.5 Discrete Effect Estimators: ∆ = 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 3.6 Linear Regression for Functional Averages . . . . . . . . . . . . . . . . . . . . . . 136 3.7 Summary of Model Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 3.B.1Hoeffding for Sample Minimum . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 3.B.2Hoeffding for |Y¯| . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 3.B.3Hoeffding for Y¯ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 3.B.4Bootstrap Normal Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . 164 4.1 Friendship Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 xi List of Figures 2.1 Model Diagnostics for CO2 Weighted Residuals . . . . . . . . . . . . . . . . . . . 62 2.2 Model Diagnostics for CO2 Residuals . . . . . . . . . . . . . . . . . . . . . . . . 62 3.1 Model Validation Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 xii Abstract Statistical modeling often relies on two pivotal conditions: non-informative sampling and weak, localized probabilistic dependencies between observations, insofar as they are believed to exist. For causal inference in observational settings, the strong ignorability of treatment exposure is also typically required. The problem with these assumptions is that they are not fulfilled in a plethora of modern observational contexts. This, in turn, damages the cogency of scientific arguments that require them. The methods investigated in this dissertation do not require all of these assumptions. After providing an overview of its contents in Chapter 1, this dissertation establishes a mathematical identity for the variance of an additive statistic that removes the strict need for a detailed dependency model in Chapter 2. Following this, some basic properties of additive statistics are explored under very general conditions of statistical dependence. An important and common type of random variable is also introduced—the U random variable—before a novel concentration inequality for sums of dependent random variables of this class is proven. Ultimately, this inequality can enable the construction of confidence sets for sums of mutually dependent, but weakly correlated random variables of this class. Chapter 3 introduces a new type of causal effect that is related to the average value of a function. This effect measures changes in the extremes of the support of an outcome variable after exposure to treatment in many contexts. Here, it is established that functional average treatment effects are identifiable under conditions that are much milder than xiii expected treatment effects. They do not require non-informative sampling or the measurement of a sufficient set of confounding variables to be identified and estimated. Functional average estimators, such as the arithmetic mean of unique values in a sample and the mid-range, are also explored and a novel bootstrapping process, the Hoeffding bootstrap, is introduced as a strategy for uncertainty quantification. Pertinently, this chapter also uses U concepts to prove that well-specified regression models, under the semantic framework of a causal theory and the conditions of this chapter, target functional average causal effects under modeling assumptions that are weaker than those that are often used in association studies. Chapter 4 uses the variance identity of Chapter 2 and the Hoeffding bootstrap of Chapter 3 to extend generalized estimating equation approaches to regressions in social network contexts. It corroborates generalized linear models in particular as a cogent approach for modeling the marginal association between graph-related random variables or statistics and features of interest when the structure or dynamics of the graph as a whole are unknown. This dissertation is concluded in Chapter 5, which offers a summary of main results, further discussion, and some commentary on future research directions. xiv Chapter 1 Introduction Classical statistical inference proceeds as follows. We have a set of N ∈ N non-stochastic objects, P = {o1,o2,o3,...,oN}, that we wish to observe for purposes of reason. Since observing them all is not feasible, we abide by a rule of chance to select a subset of them. Call this subset of objects ζ ⊂ P. When the properties of the rule of chance are well-behaved, the statistics produced from the elements of ζ approximate mathematical objects of interest that are defined on P. This game is a basis for scientific inference. Pertinently, P can also be treated as a sample of random variables from a larger population. In either case, two things are often held as axiomatic: one, that the sampling mechanism is both known and non-informative, i.e., that it is not associated in any way with the objects being collected, and two, that the outcome of each individual selection is independent of the outcome of others. When these statements are violated, they are typically replaced by marginally less restrictive versions: one, that the sampling mechanism is known and non-informative, provided certain forces are held constant, or that the sampling bias is removable through the employ of a model, and two, that sampling outcomes are related in a restricted, but predictable manner in accordance with a tractable theory, which is also known. These replacements characterize contemporary approaches 1 to statistical inference. However, although useful and fecund in some settings, contemporary approaches still suffer under one inexorable fact: we probably know very little. In all likelihood, the best of our schematics is to nature as a toddler’s drawing is to Leonardo da Vinci’s Vitruvian Man. To quote Immanuel Kant: And we indeed, rightly considering objects of sense as mere appearances, confess thereby that they are based upon a thing in itself, though we know not this thing as it is in itself, but only know its appearances, viz., the way in which our senses are affected by this unknown something.1 This point also extends to causal inference, which rests upon a host of additional statistical assumptions that are also exceptionally non-trivial. Further discussion on this matter is reserved for Chapter 3. Although it is now commonplace to accept that ’all models are wrong,’ scant attention is often paid to the many of the logical consequences of this statement. At its essence, a statistical model is a set of assumptions that empower an ampliative inference from a set of empirical observations to a set of conclusions. However, an ampliative argument is cogent if and only if its premises are at least approximately true. As a consequence, the assumptions that we choose to characterize our models invariably affect the quality and defensibility of our scientific statements. In short, it is not enough to choose a ’useful’ statistical model. Stopping at this point is at best pragmatic and at worst postmodern. We must choose one that seems approximately true with respect to the objects of interest related to P, or which, in the minimum, possesses a strong appearance of at least yielding true probabilistic statements. These are the necessary conditions that constitute the meaning of the word ’useful,’ if we are to employ this descriptor. A stool provides an imperfect, but adequate metaphor. The seat of the stool rests on legs, which, taken together, must hold the weight of the person who sits on it. Simply put, the legs are 1Kant, Immanuel. 1977. Prolegomena to Any Future Metaphysics (J. Ellington, Trans.; pp. 53). Hackett Publishing Company, Inc. (Original work published in 1772) 2 the premises of the statistical model and the seat represents the set of scientific conclusions that they support. One, two, or even all legs of the stool can be weak, i.e., they can be approximately true or even wrong in some circumstances. What is important is that they do not break when the weight of the world pushes down upon the seat. Exact premises, which produce exact models will—more often than not—provide legs that are both false and weak. When the world sits, the stool will break. General premises, however, which cover a higher number of possible worlds, have a higher chance of holding strong, even when they are false. While it is true that no stool with false legs is cogent, such a stool can still provide true statements. These statements, in turn, can become the premises of future scientific investigations. They can pave the road to scientific cogency and knowledge. These facts motivate and contextualize this dissertation. Its purpose is to provide novel statistical methods that can withstand more weight, even if the gain is marginal. In short, this document does not rely on non-informative sampling, a detailed theory of statistical dependence—or even a theory of local dependence—and it does not try to assume much, even when building models for causal inference. Generally, assumptions are kept mild in relation to standard conditions. By proceeding on these grounds, the hope is that the set of methods provided here are more reliable in a wider array of contexts. Moreover, since fewer and simpler conditions often result in statistical tools that are more elementary in characterization, it is also this author’s hope that they are more accessible to applied statisticians and researchers who do not possess advanced statistical training. This dissertation progresses in the following manner. Chapter 2 introduces an elementary, but novel variance identity for additive statistics. What is useful about this identity is that it removes the strict need for exact dependency or covariance modeling, which ultimately requires the specification of n 2 parameters for any sample of n random variables. Instead, a researcher needs to consider only the summary effects of the dependency structure on the variance of the additive statistic: a much more achievable task. More importantly, however, after establishing the statistical consistency of additive statistics under very general dependency conditions, a novel concentration inequality is proven that enables the construction of confidence sets with at least 1 − α coverage for α ∈ (0,1). This concentration inequality applies to common error distributions, and even under conditions of mutual dependence, insofar as the average correlation between outcome variables is weak. Importantly, the structure of the dependency relations can otherwise remain unknown. The contents of this chapter originate from a manuscript co-authored with Dr. Lu Zhang.2 Chapter 3 presents a paper that is co-authored with Dr. Erika Garcia and Dr. Lu Zhang.3 This manuscript introduces a salient estimand for causal inference that is related to the average value of a function. In some circumstances, this causal effect targets changes in the extremes of an outcome variable following treatment exposure. Unlike prototypical causal estimands, this manuscript proves that functional average effects are identifiable under very mild circumstances— such as consistency and the preservation of supports—and do not require non-informative sampling processes. Moreover, it establishes that linear regressions can target causal parameters under a set of assumptions that are weaker than those typically supposed for association studies, provided a causal theory and the two regularity conditions aforementioned. Finally, this chapter also explores basic estimators of functional averages and introduces a new bootstrapping process that can be used for inference: the Hoeffding bootstrap. This bootstrap is non-traditional. It applies to almost arbitrarily dependent random variables. No theory of dependence is required. Pertinently, 2Sparkes, S. & Zhang, L. (2023). Properties and deviations of random sums of densely dependent random variables. arXiv. https://arxiv.org/abs/2310.11554 3Sparkes, S., Garcia, E. & Zhang, L. (2023). The Functional Average Treatment Effect. arXiv. https://arxiv.org/abs/2312.00219 4 it also does not require the targeted statistic to be a smooth functional of the marginal cumulative distribution function(s). The final manuscript explored in this dissertation, which is provided in Chapter 4, applies the variance identity of Chapter 2 and the Hoeffding bootstrap of Chapter 3 to social network regression modeling.4 This paper is co-authored with Dr. Thomas Valente. Since outcome variables that are related to social networks exist within complicated mosaics of probabilistic dependencies, generalized estimating equations have been traditionally contraindicated as statistical tools in this domain. Instead, social network analysts have mostly relied upon models that require the specification of the entire dependency structure of the set of relational ties. This paper demonstrates that generalized estimating equations can be used for cogent inference in social network settings. It accomplishes this in conjunction with the methodologies explored in the other chapters. Moreover, in step with the motif’s of the dissertation, it substantiates that generalized estimating equations, which are marginal models, are ideal candidates for modeling social network outcomes when the architecture of the graph in question is unknown. Chapter 5 encompasses a discussion on the major findings of the previous chapters. Importantly, it also investigates future research directions. For instance, some words are provided on the applicability of these methods to the analysis of climate change and sociological domains in general. The latter discussion includes some applications in clinical psychological settings, which are largely phenomenological. 4Sparkes, S. & Valente, T. (2023). Extending GLMS to Social Network Analysis. Submitted for peer review and currently in revision 5 Chapter 2 Properties and Deviations of Random Sums of Densely Dependent Random Variables Abstract A classical problem of statistical inference is the valid specification of a model that can account for the statistical dependencies between observations when the true structure is dense, intractable, or unknown. To address this problem, a new variance identity is presented, which is closely related to the Moulton factor. This identity does not require the specification of an entire covariance structure and instead relies on the choice of two summary constants. Using this result, a weak law of large numbers is also established for additive statistics and common variance estimators under very general conditions of statistical dependence. Furthermore, this paper proves a sharper version of Hoeffding’s inequality for symmetric and bounded random variables under these same conditions of statistical dependence. Put otherwise, it is shown that, under relatively mild conditions, finite sample inference is possible in common settings such as linear regression, and even when every outcome variable is statistically dependent with all others. All results are extended to estimating equations. Simulation experiments and an application to climate data are also provided. 6 2.1 Introduction Popular methods for statistical inference model probabilistic dependencies as limited and schematic in nature. The latter notion is used to justify asymptotic normality, while the former reduces an unknowable picture to one that is tame and mathematically pliable. However, in the now everyday words of George Box, ”... all models are wrong” (Box & Draper, 1987). In many research contexts, there is little reason to believe that the system of statistical dependencies governing the joint distribution of outcome variables behaves in accordance with a tractable sequence, or that it is sparse even conditionally. Sociological, climate, or clinical settings are but a few examples. Box of course went on to state that ’some are useful.’ However, this final addendum, while by construction irrefutable, requires great caution when cited besides models of statistical uncertainty. Invalid models for the expected values of outcome variables still possess salient interpretations as approximations. On the other hand, an invalid model for statistical dependence will furnish untrue statements about error, or statements in general that are exceptionally vulnerable to doubt. This undercuts the cogency of knowledge construction. This paper addresses this problem for additive statistics, which we now define. Let I = {1,...,n} be an indexing set for a sample ζ = {Yi}i∈I of random variables. Furthermore, let {wi}i∈I be any set of constants. We define Sn = ∑ n i=1wiYi as an additive statistic, although for simplicity, we will often set wi = 1 without loss of generality (WLOG) since we can simply say Zi = wiYi and reason about {Zi}i∈I . It is important to note that Yi (Zi) is also general and can be any measurable function of k ∈ N random variables. Ultimately, we are interested in establishing some basic properties and behaviors of Sn, a vector of random sums, under very general but unfavorable conditions of 7 dependence. We do so without any particular theory of how this looks, which makes these results widely applicable. Lots of work exists on the topic of dependence and additive statistics. The generalized least squares approach encapsulates common classes of estimators. Ultimately, this method replaces Var(Sn) with a model Σ˜ in an attempt to induce the conditions of the Gauss-Markov theorem (Aitken, 1936; Amemiya, 1985). Generalized linear models and the generalized estimating equation approach of Liang and Zeger (1986) produce estimators related to this type, as do hierarchical linear models (Nelder & Wedderburn, 1972; Gardiner, Luo, & Roman, 2009). The approach of Liang and Zeger, however, also makes use of cluster-robust standard errors in conjunction with specified covariance models (Ziegler, Kastner, & Blettner, 1998; Zorn, 2001). These strategies posit that a user-specified partition of the sample results in independent clusters, the variances of which can be identified. A thorough review of the theory behind cluster-robust variance estimation is available elsewhere (MacKinnon, Nielsen, & Webb, 2023). Spatial and time series methods are also important in this universe. They adopt objects—such as variograms or auto-regressive weight matrices—to facilitate covariance estimation under the supposition of largely localized dependencies (Cressie, 2015; Kedem & Fokianos, 2005; Anselin, 2009). The details of these methods are beyond this paper’s scope. In essence, they share in one or two pivotal facts, however: they usually posit that the average number of dependencies in a sample is bounded by a constant and that the researcher has knowledge of a partition of ζ that produces K(n) → ∞ independent clusters, or (2) they replace Var(Sn) with a blueprint that is salient, but simple and mathematically convenient. Overall, these strategies set a grand majority of the n 2 covariance parameters to zero a priori in protestation of real conditions. Inference also depends on mixing conditions or concepts such as m-dependence, which justify a central limit theorem for dependent random variables (Withers, 1981; Berk, 1973). Researchers have also utilized the concentration of measure phenomenon as a tool for inference (Ledoux, 2001). Hoeffding’s inequality is a classical result in this domain, as are others (Hoeffding, 1994; Bennett, 1962). Most formulations rely on the assumption of mutual independence (Janson, Luczak, & Rucinski, 2011; Boucheron, Lugosi, & Bousquet, 2003; Talagrand, 1996). Many, however, also allow for weak or local dependence conditions to exist (Daniel, 2014; Kontorovich & Ramanan, 2008; Gotze, Sambale, & Sinulis, 2019). For instance, Hoeffding’s inequality applies to ¨ sums of negatively associated random variables and to additive statistics that can be decomposed into smaller independent sums (Wajc, 2017; Janson, 2004). These results are invaluable for theoretically bounding the tail probabilities of random sums under a wider range of circumstances. Unfortunately, however, most of these inequalities still rely upon restricted dependency pictures that do not fit the concerns of this paper. The main contribution of this manuscript is to show that finite sample inference is possible for additive statistics of bounded random variables without a detailed dependency model, and even when all measured outcome variables are statistically dependent and no central limit theorem applies. Essentially, this is done by showing that a sharper version of Hoeffding’s inequality still applies to Sn−ESn = ε when ε is densely dependent, insofar as each εi behaves in accordance with some symmetric, unimodal, but not necessarily identical probability law. No other restrictions are necessarily placed on the marginal or joint probability functions. While still non-trivial, this contribution is useful since random errors of this type often surface in regression settings. Additionally, Hoeffding’s inequality possesses a closed form that is accessible to working statisticians. The paper’s secondary contribution is a novel variance identity that is useful for analyzing the properties 9 of Sn under very general and terrible conditions. Extending these results to estimating equations is this paper’s tertiary contribution. Section 2.2 introduces some key definitions and the aforementioned variance identity, while Section 2.3 uses it to prove a weak law of large numbers (WLLN) for additive statistics under the circumstances identified, and to explore the behavior of cluster-robust variance estimators in these same settings. Section 2.4 returns to our main theme of statistical inference in the face of dense, unknown, and intractable dependency structures. It defines a new type of random variable and uses this definition to prove our main results. Following this, Section 2.5 makes use of the work of Jennrich (1969), Yuan and Jennrich (1998), and Hall (2005) to extend the statements of the previous sections to estimating equations. The last part of this paper before the conclusion— Section 2.6—demonstrates the value of our approach with a set of simulation experiments that mimic some worst-case dependency scenarios with linear estimators. After this is accomplished, the association between global changes in temperature and carbon dioxide levels is estimated to demonstrate the utility of the approach. 2.2 A Novel Variance-Covariance Identity for Additive Statistics We now introduce some important definitions. Say Y ⊤ = (Y1,...,Yn) is a 1 × n random vector s.t. EY 2 i < ∞ for ∀i and w ∈ R p×n is a matrix of constants. Furthermore, as is tradition, say Cov(Yi ,Yj) = σi, j = EYiYj −EYiEYj and hence Var(Yi) = σi,i = σ 2 i . Graphs are important for this exploration. A graph G = (V,L) is constituted by a set of nodes V and a set of lines, L, that connect them. It is undirected if ei, j ∈ L implies that e j,i ∈ L; otherwise it is directed. Here, only 10 undirected graphs will be of interest. Also, note that I = {1,...,n} can now be seen as a node set. The definition presented next is central and allows for the construction of the variance identity. Definition 2.1 (Linear Dependency Graph). Let L = (I,L) be a graph with a node set I w.r.t. {Yi}i∈I and a set of edges L between them. Then ei, j ∈ L if and only if σi, j ̸= 0. It is also relevant to know that the degree of a node is defined as the sum of its existent links. Denote this function as d(i) = ∑ n−1 j=1 1ei, j∈L, where 1ei, j∈L is an indicator function, and similarly denote the mean degree as µn = n −1 ∑ n i=1 d(i). In the context of this paper, µn is equal to the mean number of random variables that a typical random variable is correlated with in the sample. Pertinently, each 1ei, j will be treated as a non-stochastic function, conditional on the realization of {Yi}i∈I as a sample of random variables, unless otherwise noted. A few more definitions are necessary. The ⊙ symbol will signify the Hadamard product, which is the component-by-component multiplication of two matrices. Letting ws denote the sth row of w, consider σ¯r,t = |L| −1 ∑ |L| i<j wr,iwt, jσi, j w.r.t. the statistics wrY and wtY. This value will be called an average non-zero covariance term, while φr,t = {n −1 ∑ n i=1wr,iwt,iσ 2 i } −1 · σ¯r,t will be called an average correlation. Ultimately, three matrices will also be required for the identity: G,C, and V. The first two matrices are defined from the previously specified quantities: C p×p = (σ¯i, j) and G = 1+ µnφ, where 1 is a p× p matrix of ones and φ p×p = (φi, j). Finally, V is a diagonal matrix s.t. its diagonal is equal to the diagonal of Var(Y). This last matrix is recognizable as Var(Y) under the counterfactual assumption of mutual independence. Proposition 2.1. Let Y = (Y1,...,Yn) ⊤ be a n×1 random vector s.t. EY2 i < ∞ for ∀i and w ∈ R p×n is a matrix of constants. Then Var(wY) = wVw⊤ ⊙G = wVw⊤ +nµnC. 11 Proof. Let s,t be arbitrary indexes from {1,..., p}. Then wsY = ∑ n i=1ws,iYi and wtY = ∑ n i=1wt,iYi . It then follows that Cov(wsY,wtY) = ∑ n i=1ws,iwt,iσ 2 i +∑i̸=j ws,iwt, jσi, j . From here, consider the linear dependency graph L = (I,L) w.r.t. {Yi}i∈I as previously defined. Then: ∑ i̸=j ws,iwt,iσi, j = 2∑ i<j ws,iwt, jσi, j = 2 ·( |L| ∑ i<j ws,iwt, jσi, j +0) = 2|L|·σ¯s,t = nµn ·σ¯s,t The first equality follows from the definition of L , while the fourth follows from the Handshake lemma. Hence, Cov(wsY,wtY) = (wVw⊤)s,t +nµnCs,t . It then follows that Var(wY) = wVw⊤ + nµnC since s,t were arbitrary. Now, note that: Cov(wsY,wtY) = n ∑ i=1 ws,iwt,iσ 2 i +nµn ·σ¯s,t = {1+nµn ·( n ∑ i=1 ws,iwt,iσ 2 i ) −1σ¯s,t} n ∑ i=1 ws,iwt,iσ 2 i = {1+ µnφs,t} n ∑ i=1 ws,iwt,iσ 2 i Again, since s,t were arbitrary, it then follows that Var(wY) = wVw⊤ ⊙G. ■ The utility of Proposition 2.1 is that it summarizes the impact of an unknowable and inestimable system of statistical dependencies on the variance of an additive statistic with two summary constants that are more defensibly specified or bounded. Although intuition might be lacking as to how some environment dynamically acts upon {Yi}i∈I to produce Var(Y), this might not be the case for µn or the diagonal elements of φ. Prior beliefs or knowledge might exist pertaining to these values. 12 Although this also suggests a possible Bayesian route where these values are conceptualized as draws from prior distributions to model uncertainty, this road is not pursued in this paper. Moreover, recall that a Moulton factor is an expression for the variance inflation caused by intra-cluster correlation (Moulton, 1986). They typically have the form γs,s = (wVw⊤) −1 s,s Var(wY)s,s . From Proposition 2.1, we can see that γs,s = Gs,s . However, Proposition 2.1 is more general since Moulton factors are often derived under specific modeling constraints. This correspondence can also be seen from rearranging an expression from Proposition 2.1. For instance, say Γ = (n −1 · wVw⊤) −1C. Then Var(wY) = wVw⊤{1 p×p + µnΓ}, which is also recognizable as an even more general form of a Moulton style covariance factor (Moulton, 1990). We will prove three bounds on the variances of simpler additive statistics, i.e., for ws = 1 1×n . These bounds can be be useful for proving asymptotic properties. The next propositions bound φ under the conditions of fully connected graphs or sampling designs that are relatively uninformative with respect to the covariance structure. Proposition 2.2. Say Sn = ∑ n i=1 Yi and again consider L . Then −µ −1 n ≤ φ ≤ µ −1 n (n − 1) when µD > 0. Proof. Observe ∑i̸=j σi, j .By Cauchy-Schwarz and the Geometric-Mean Inequality, ∑i̸=j σi, j ≤ (n − 1)∑i=1 σ 2 i . From here, we know that 2∑i<j σi, j = nµnσ¯. Then nµnσ¯ ≤ (n − 1)∑i=1 σ 2 i , which implies φ ≤ µ −1 n (n − 1). Furthermore, since variance are non-negative, 1 + µnφ ≥ 0 =⇒ µnφ ≥ −1. ■ Corollary 2.1. For fully connected linear dependency graphs, −(n−1) −1 ≤ φ ≤ 1. Proof. Immediate from Proposition 2.2 since µn = n−1. ■ 13 Provided that µn(n) → ∞ as n → ∞, another informal corollary to Proposition 2.2 is that φ ≥ 0. Philosophically, this is useful to know since it implies that any system that is heavy with statistical dependencies in this fashion almost necessitates non-negative average correlation. From here, consider ∆, a vector of N indicator variables δi , such that δi = 1 if and only if Yi is sampled from a larger population P. This next proposition also provides conditions s.t. φ ≤ 1 without strictly requiring a fully connected graph. It is useful when P possesses a high number of dependent random variables and it is possible to execute a non-informative sampling mechanism s.t. µn (now treated as random in a temporary abuse of notation solely for Proposition 2.3) has a probability distribution that places most probability on lower values. Proposition 2.3. Say Sn = ∑ n i=1 Yi . Suppose n−1 ∑ n i=1 Var(Yi |∆) = n −1 ∑ n i=1 σ 2 i and that |L| −1 ∑ |L| i<j Cov(Yi ,Yj |∆) = |L| −1 ∑ |L| i<j σi, j , i.e., the sampling mechanism is uninformative on average with respect to the mean variance and individual covariances. Furthermore, denote n∗ as the number of correlated random variables in the population P and say n∗ > n and thus 2 −1n(n−1) is the upper bound of the support of |L|. Then φ ≤ 1. Proof. Var(Sn|∆) = ∑ n i=1 Var(Yi |∆) + 2∑i<j Cov(Yi ,Yj |∆) = ∑ n i=1 σ 2 i + 2∑ |L| i<j σi, j , where only |L| is random. The derivation is similar to Proposition 2.2. We skip these steps and observe that nµn|L| −1 ∑ |L| i<j σi, j ≤ (n−1)∑ n i=1 σ 2 i . Here, note that µn is also random and that µn = n−1 when |L| = 2 −1n(n − 1). However, since (n − 1)∑ n i=1 σ 2 i is a constant upper bound of the left-hand side and it is possible to sample n < n∗ correlated random variables, it must be the case that n(n−1)|L| −1 ∑ |L| i<j σi, j ≤ (n−1)∑ n i=1 σ 2 i and hence that φ ≤ 1. ■ This next proposition is less demanding in its premises and hence applicable in more contexts. Furthermore, it will provide another route for establishing that φ ≤ 1 when σ 2 i = σ 2 for ∀i, i.e., 14 when the assumption of equal variances holds. Denote η = (n −1 ∑ n i σ 2 i ) −1max i∈I (σ 2 i ) as the ratio of maximum variance to mean variance to these ends. It won’t be proven, but it is obvious that η = O(1) when variances are finite. This fact also generalizes to the weighted case insofar as ws is not a trivial vector of zeroes. Proposition 2.4. Again observe Sn = ∑ n i=1 Yi . Then Var(Sn) ≤ {1+ µnη}∑ n i=1 σ 2 i . If σ 2 i = σ 2 for ∀i, then φ ≤ 1 and Var(Sn) ≤ {1+ µn}∑ n i=1 σ 2 i . Proof. Recall that σ¯ = |L| −1 ∑ |L| i<j σi, j . Hence, σ¯ ≤ max i<j (σi, j) = σr,s , say. By Cauchy-Schwarz, σr,s ≤ σrσs ≤ max i∈I (σ 2 i ). The general result then follows since φ ≤ η. Under the additional premise, η = 1, which implies the last stated bound. ■ Pertinently, Var(Sn) ≤ {1+µn}∑ n i=1 σ 2 i is a tight inequality w.r.t. some dependency structures. When ζ is a sample of independent Yi , µn = 0 and Var(Sn) = ∑ n i=1 σ 2 i ≤ (1 + 0)∑ n i=1 σ 2 i = (1 + µn)∑ n i=1 σ 2 i . When ζ is a collection of the same random variable n times, S = nY1 WLOG. Then Var(Sn) = n 2σ 2 1 ≤ (1+n−1)nσ 2 1 = (1+ µn)∑ n i=1 σ 2 i since µn = n−1. Here, it is also useful to comment that we can also define a dependency graph D = (I,LD) s.t. a link exists in this graph if and only if Pr(Yi ,Yj) ̸= Pr(Yi)Pr(Yj). All the results of this section also hold when the corresponding values are defined w.r.t. this graph. We will call these values µD and φD respectively. When this is done, it is easy to show that µnφn = µDφD. This can be established from the fact that φn = |L| −1 |LD|· φD. 15 2.3 Asymptotic Properties of Additive Statistics We now show the utility of the previous set of results for the asymptotic analysis of additive statistics. Recall that one goal of this paper is to establish some basic but important statistical properties of additive estimators under ’apocalyptic’ conditions, i.e., conditions such that independence is violated in such a way as to render the true dependency structure inconceivably checkered with non-nullified co-variation. To this end, we first establish some mild assumptions for the consistency of additive estimators and common variance estimators in this setting. Variance estimators are usually reserved for plug-in strategies with Wald statistics. Although we have no intention of justifying Wald-like hypothesis testing under our assumed conditions, establishing the consistency of variance estimators can still be useful for other purposes. Later, we see that they are still useful for investigating features of the unknown correlation structure, for example. A1. There exists a C∗ ∈ R + s.t. |Yi | ≤ C∗ for ∀i ∈ I A2. Denote a partition of I s.t. Jk ⊆ I for k ∈ {1,2,...K} = K , SK k=1 Jk = I and therefore Jr ∩ Jt = /0 if r ̸= t. In accordance with this, say ζ = {Yi}i∈I , ζk = {Yj}j∈Jk , and hence ζ = SK k=1 ζk . Denote nk = |ζk |, which means ∑ K k=1 nk = n. Then n −1 r nt = O(1) for ∀r,t ∈ K A3. Let L be an arbitrary linear dependency graph. Then µn = o(n) A1 establishes that we are working within a universe of bounded random variables. If this condition is not strictly necessary for mean and variance estimation, we can of course also do with the common assumptions that EY 2 i < ∞ as A1’ or the assumption that EY 4 i < ∞ for all i and that all referenced random variables are uniformly integrable. A2 simply stipulates that all cluster sizes have the same asymptotic order. This assumption is met in common research circumstances. We 16 will not always need it. Note that A3 is very mild. For instance, it allows µn(n) → ∞ as n → ∞. It only keeps µn from linearly scaling with sample size. Put otherwise, A3 is mild because it allows for the typical random variable of a given sample to be correlated, on average, with a diverging number of others also collected—and in any imaginative way—as n becomes arbitrarily large. There is no constraint on how this occurs. Proposition 2.5. Suppose A1’ and A4 for wY s.t. wi, j = (wi, j) = O(n −1 ). Then wY p→ E(wY) as n → ∞, where p→ denotes convergence in probability. Proof. Let s be arbitrary and observe wsY = ∑ n i=1ws,iYi . From Proposition 2.1, we know that Var(wsY) = ∑ n i=1w 2 s,iσ 2 i +nµnσ¯s,s , where σ¯s,s = |L| −1 ∑i<j ws,iws, jσi, j . Since σ¯s,s ≤ {max i∈I (|ws,i |)} 2max i∈I (σ 2 i ) and max i∈I (σ 2 i ) ≤ C ∈ R + for ∀n, it follows that nµnσ¯s,s ≤ nµn{max i∈I (|ws,i |)} 2max i∈I (σ 2 i ) ≤ nµn{max i∈I (|ws,i |)} 2C. This implies that: lim n→∞ nµnσ¯s,s ≤ lim n→∞ nµn{max i∈I (|ws,i |)} 2C = 0 The last step follows since C is finite, while µn = o(n) and {max i∈I (|ws,i |)} 2 = O(n −2 ). Since ∑ n i=1w 2 s,iσ 2 i → 0 as n → ∞ as well, this implies that Var(wsY) → 0. A use of Chebyshev’s inequality then implies that wsY p→ E(wsY). Because s was arbitrary, the conclusion is reached. Note that a quicker alternative proof could have used Proposition 2.4 since η = O(1). ■ Therefore, an adequate sub-linear behavior w.r.t. µn is sufficient for quadratic mean convergence. For instance, if µn = n 1/c , c > 1, this is still sufficient to establish a WLLN, although the rate of convergence will be sub-optimal, especially for small c. It is also therefore apparent that 17 weak dependence—at least as traditionally conceived—is not a necessary condition for the consistency of a large class of commonly employed statistics, including average loss functions for learning algorithms. Next, define ˆei = Yi −EˆYi . Analyzing variance estimators that make use of squared residuals is easier under the simplifying assumption that the linear dependency graph of {eˆ 2 i }i∈I is isomorphic to the linear dependency graph for {eˆi}i∈I . This is true for special cases, such as when the ˆei are jointly normally distributed or the probabilistic dependency graph and linear dependency graph are isomorphic. Even if this does not hold exactly, it is arguably mild to assert that it is at least true that the mean degrees of these graphs have the same asymptotic order and thus can be interchangeably described by one µn. Moreover, we also use µn for the limiting average degree, but context will make this clear. For instance, for finite sample sizes, the linear dependency graph of {eˆ 2 i }i∈I is conceivably fully connected. However, provided that EˆYi p→ EYi for ∀i as n → ∞ and thus ˆei p→ εi for ∀i as n becomes arbitrarily large, the limiting linear dependency graph of {eˆ 2 i }i∈I is equal to the graph for {ε 2 i }i∈I . Hence, when discussing asymptotic orders, it is apparent that µn refers to the mean degree of the limiting object. Under the aforementioned simplifying assumption, it is also apropos to note that the limiting mean degree of the graph for {Yi}i∈I can also be represented by the same µn as the two previously mentioned. From here, we will always implicitly assume that an arbitrary weight w is O(n −1 ). Proposition 2.6. Assume A1 and A3. Let B = wY = β + wε s.t. Eε = 0 and again let V = diag{Var(Y)}, where diag(A) constructs a diagonal matrix from the diagonal of A. Define eˆ = Yi −EYˆ i . Then for an arbitrary pair (s,t), it is true that (wVwˆ ⊤)s,t = ∑ n i=1ws,iwt,ieˆ 2 i p→ (wVw⊤)s,t as n → ∞. 18 Proof. Let (s,t) be arbitrary under our premises. Then by Proposition 2.1, Var{∑ n i=1ws,iwt,ieˆ 2 i } = {1+µnφs,t:n}∑ n i=1w 2 s,iw 2 t,iVar(eˆ 2 i ). Say φs,t:n ≤ Cφ and Var(eˆ 2 i ) ≤ M for ∀i since we know that they are O(1). Then Var{∑ n i=1ws,iwt,ieˆ 2 i } ≤ {1+ (n−1)Cφ }·M ·∑ n i=1w 2 s,iw 2 t,i . Note that ∑ n i=1w 2 s,iw 2 t,i = O(n −3 ) while (n−1)Cφ ·M is obviously only O(n) since M is just some constant. Thus, it follows that Var{∑ n i=1ws,iwt,ieˆ 2 i } → 0 as n → ∞. Although this is sufficient, we observe that the variance converges to zero at the asymptotically equivalent rate of max(n −3µn,n −3 ), which can be considerably faster than what was shown since we automatically set µn to its upper bound. Under the same premises, ˆei p→ Yi − EYi as n → ∞. Hence, E ˆe 2 i p→ Vi,i as well as n becomes arbitrarily large by the Portmanteau theorem since ˆei is bounded with probability one for ∀i. This is sufficient for our conclusion. ■ An important note to make in relation to Proposition 2.6 is that Var{∑ n i=1ws,iwt,ieˆ 2 i } is O(n −2 ) at worst under our terrible, but manageable conditions. This is good to know since it must converge at a much faster rate than B if it is to be used as a plug-in estimator. To be exact, we require n 2Var{∑ n i=1ws,iwt,ieˆ 2 i } → 0 as a sufficient condition for Wald plug-in estimation. Here is an informal proof. Let Vˆ = ∑ n i=1ws,iwt,ieˆ 2 i and observe that Wn = {γs(n)Vˆ } −1/2 (Bs − βs) is a Wald-like statistic for some function γs(n) > 0 that is intended to adjust for missed variability. Typical practices set γs(n) = 1 a priori in accordance with the assumption that the employed covariance model is well-specified. For simplicity, we set γs(n) to µn since we care only about asymptotic orders in this exploration and temporarily assume that µn is non-zero and does not tend to zero. We know that Var(Bs) = O(n −1µn). Thus, {n −1µn} −1/2 stabilizes the variance of Bs in the sense that nµ −1 n Var(Bs) converges to a non-zero constant. Observe that Wn = {nVˆ } −1/2{n −1µn} −1/2 (Bs − βs). Since n 2Var(Vˆ) = O(n −1µn), plug-in Wald statistics will 19 always behave as intended under A3, at least insofar as asymptotic normality is not being considered. 2.3.1 Some Consideration for Clustered Statistics Our next object is to explore the properties of cluster-robust variance estimation under thick dependency conditions. As setup, enact a partition of ζ into K groups, as in A2, s.t. B = wY = ∑ K k=1 wkYk . The basic cluster robust estimator has the following form: Cˆ = ∑ K k=1 wkeˆkeˆ ⊤ k w ⊤ k , where each eˆk is a nk ×1 vector of residuals as previously defined. Classically, this approach has required that K → ∞ as n → ∞ and for max k∈K (nk) = nM to be O(1) in addition. If the outcome variables of different groups are independent—or at least uncorrelated—then Cˆ is consistent for Var(B) under these nice conditions (MacKinnon, Nielsen, & Webb, 2023). Although incredibly useful, the problem with this approach is simple, albeit undeniable. Outside of contrived examples, the specification of a partition that produces K independent or even uncorrelated clusters is an arduous task. Recall that Ln is unknown and inestimable. Hence, provided a dynamic and inter-dependent world, the chances that a user specifies a valid partition with partial and imperfect knowledge about Ln are safely assumed to be small. This point has doubled poignancy when nM is small in addition. This is problematic because an invalid specification of the partition structure calls into question the consistency and utility of cluster-robust variance estimation. Here, we prove that—even if a partition is invalidly specified—cluster-robust variance estimation can still be consistent for its identified portion of the variance under a couple of common scenarios. Overall, three cases are considered: (1) the case s.t. nM ≤ Q ∈ N, (2) the case s.t. 20 min k∈K (nk) = nm → ∞ and Lk is fully connected for ∀k ∈ K , and (3) the case s.t. nm → ∞,µnk = o(nk) for an arbitrary k ∈ K , and K = O(1). Although establishing the consistency of a particular Cˆ s does not help us identify and estimate Var(Bs) in total, we show that the consistent estimation of portions of Var(Bs) can provide information on the magnitude of missed variability, which is why it is still important. Ultimately, this information can be used to inform a researcher’s choice of a corrective factor in the style of Proposition 2.1. Gaining insight into the magnitude of µnφn is also important for additional reasons that become clear in Section 2.6. The next proposition provides a version of Proposition 2.1 for the sum of random vectors. It is helpful because it gives an expression for the bias induced by an invalid partition and helps to see the problem with more clarity. It is not strictly necessary for our analysis, but we show it for the sake of completeness. Say Tk = wkYk = µk +wkεk and define a directed linear dependency graph LT = (K ,ELT ) s.t. a link exists from node r to t if and only if E{(Tr − µr)(Tt − µt) ⊤} ̸= 0. Define σ¯T = |ELT | −1 ∑r̸=t E{(Tr − µr)(Tt − µt) ⊤}. Further, define φ T = {∑ K k=1 Var(Tk)} −1Kσ¯T. Proposition 2.7. Observe Tk = wkYk = µk +wkεk for B = ∑ K k=1Tk and define a directed linear dependency graph LT = (K ,ELT ) as previously constructed. Then Var(B) = ∑ K k=1 Var(Tk){I + µTφ T } = ∑ K k=1 Var(Tk) +KµTσ¯T. Proof. Observe that Var(B) = E{∑ K k=1 (Tk − µk )}{∑ K k=1 (Tk − µk )} ⊤ = ∑ K k=1 Var(Tk) +|ELT |σ¯T. By the degree-sum formula, |ELT | = K·µT. Substitution and algebraic manipulation of ∑ K k=1 Var(Tk) in a manner analogous to Proposition 2.1 yields the identity. ■ Therefore, it is then implied—provided the consistency of Cˆ and a multivariate central limit theorem holds—that Cˆ −1/2 (B − β) d → N(0,I + µTφ T ), where d → connotes convergence in distribution. Although this identity provides a medium for representing bias, φ T is not an intuitive 21 object in comparison to the other forms explored. Since most inferential settings truly require valid statements for the variances only, we retreat to this forum. For instance, Bs = ∑ K k=1 ws,kYk = ∑ K k=1 ∑ nk j=1ws,k, jYk, j = ∑ K k=1 Ts,k . We can thus resort to an iterated application of Proposition 2.1 and state Var(Bs) = {1+ µKφK}∑ K k=1 {1+ µnk φs,nk }∑ nk j=1w 2 s,k, j σ 2 k, j , where (µK,φK) correspond to a linear dependency graph connected to {Ts,k}k∈K and the (µnk ,φs,nk ) correspond to the clusterspecific dependency graphs for the outcome variables in each cluster ζk . However, more simply, we can settle for Var(Bs) = {1+ µKφK}∑ K k=1 Var(Ts,k) to represent the bias of cluster-robust methods when consistent estimation of each Var(Ts,k) is possible. Some additional preliminaries help to frame this exploration. Pertinently, we provisionally assume that A2 applies to an arbitrary µnr and µnt for r,t ∈ K . This is not a critical assumption. However, it greatly simplifies notation. If one does not wish to apply it, the forthcoming statements hold by replacing µnk with max k∈K (µnk ). Now, recall that Var(Bs) = O{n −1max(µn,1)}. From the above cluster expression and our working assumption, we also know that Var(Bs) = O{n −2max(µK,1)·Knk ·max(µnk ,1)}, which simplifies to O{n −1max(µK,1)·max(µnk ,1)} since Knk = O(n) for all k under A2. This implies that max(µn,1) and max(µK,1)· max(µnk ,1) are equivalent in order. To progress, we note that Var ˆ (Ts,k) = Cˆ s,k = ∑ nk j=1w 2 s,k, j eˆ 2 k, j +∑r̸=t ws,k,rws,k,teˆk,reˆk,t for an arbitrary cluster ζk . The usual machinery requires that Var(Cˆ s,k) → 0 as n → ∞. Since Var(Cˆ s,k) ≤ 2{Var(∑ nk j=1w 2 s,k, j eˆ 2 k, j )+Var(∑r̸=t ws,k,rws,k,teˆk,reˆk,t)}, it is sufficient to reason about the individual 22 variances on the right-hand side. From a previous proposition, we know the form of the following object: Var(∑ nk j=1w 2 s,k, j eˆ 2 k, j ). Otherwise: Var(∑ r̸=t ws,k,rws,k,teˆk,reˆk,t) = {1+ µnk(nk−1)φs,nk(nk−1)}∑ r̸=s w 2 s,r,iw 2 s,t,iVar(eˆk,reˆk,t) For finite samples, we can conservatively expect µnk(nk−1) = nk(nk −1)−1 in most circumstances. Moreover, for simplicity, we can assume that Var(eˆk,reˆk,t) ≤ M and φs,nk(nk−1) ≤ Cφ WLOG in addition. Now, let Cˆ s = ∑ K k=1 Cˆ s,k . We also require Var(Cˆ s) → 0, at least as a typical sufficient condition to establish that Cˆ s is a consistent estimator of the variance model that results from a theorized partition. However, now, we have: Var(Cˆ s) = {1+ µK,∗φK,∗}∑ K k=1 Var(Cˆ s,k). We must show that this expression tends to zero as n → ∞ to establish consistency. A small handful of cases will be considered. Essentially, we will need to reason about the asymptotic orders of µK∗ ,µnk ,K, and nk . Note that we do not need to reason about µnk(nk−1) since—outside of exceptional situations—it is safe to assume that it has an order of (nk −1)max(µnk ,1). This is because—in the limit—each ˆereˆt can still be expected to be correlated, even if marginally, with other products that share at least one index in the minimum. If we hold one pair of subscripts for some ˆereˆt constant, there are 2·(nk−1) others that meet this criteria. Since µnk(nk−1) ≤ nk(nk − 1) − 1, the remaining factor should have an order that is equivalent to max(µnk ,1). Although this has not been rigorously substantiated, we can provisionally assume this to be the case for this exploration without much loss. This is also because we truly only need it for Case 3. 23 Case 1: nM ≤ Q ∈ N. Under this assumption, Var(Cˆ s,k) = O(n −4 ). This follows from the fact that the estimator is a sum of a finite number of O(n −4 )terms. Hence, Var(Cˆ s) = O{max(n −3µK,∗,n −3 )} since K = O(n). Under this scenario, the asymptotic order of Var(Cˆ s) is less than or equal to O(n −2 ) due to the fact that µK,∗ ≤ K − 1. Therefore, Cˆ s is consistent at a rate of n −1 for its identified portion of Var(Bs), even when all clusters are correlated. This also makes estimators of this case a good fit for plug-in Wald estimation. This is because it is then implied that n 2Var(Cˆ s) = O{max(n −1µK,∗,n −1 )} → 0 as n becomes arbitrarily large. This follows from the fact that µK∗ ≤ µn and µn = o(n) by assumption. Case 2: nm → ∞,Lk fully connected for ∀k ∈ K . For this case, Var(Cˆ s,k) = O(n −4n 2 k µnk(nk−1) ) = O{(n −1nk) 4} since the sum of weighted ˆereˆt s.t. r ̸= t dominates the variance. Thus, we already know that a sufficient condition for consistency is that nk = o(n). This is violated when K = O(1). Assuming that nk = o(n) implies that Var(Cˆ s) = O{(n −1nM) 4 ·K·max(µK,∗,1)}. Thus, a sufficient condition for consistency is that n 3 Mmax(µK,∗,1) = o(n 3 ). This condition is fulfilled since n has an asymptotic order that is equivalent to KnM. Utilization of a Wald-like statistic, however, requires that n 3 Mmax(µK,∗,1) = o(n), which is obviously non-trivial. These statements are easier to grasp when cluster sizes are equal as a special case. Then n = Knk for all k and the general expression for the order of Var(Cˆ s) is O{K −3max(µK,∗,1)}. Since this implies Var(Cˆ s) ≤C∗K −2 for some C∗ ∈ R + even when all clusters are correlated, it is a consistent estimator of its portion of the variance. Again, n 2K −3max(µK,∗,1) converging to zero is sufficient for Wald usage. Using substitution, we can translate this expression to K −1n 2 kmax(µK,∗,1). This means that one sufficient condition for Wald usage provided this specific setup is that max(µK,∗,1) is O(1) and for K −1n 2 k → 0, which is more restrictive and less forgiving of invalid specifications. 24 For an example of when this becomes problematic, consider a social network analysis context s.t. the total sample size of possible relationships in a directed network is N = n(n−1), where n is the number of actors sampled. Researchers often choose to cluster on actor identity. Then K = n and nk = n−1. Obviously, then, K −1nk = O(1). As a result, the denominator of the Wald statistic would fail to converge to a constant. It would remain random in behavior. Case 3: nm → ∞,µnk = o(nk), K = O(1). This case necessitates that nk = O(n). Under the condition that µnk(nk−1) has the same asymptotic order as (nk − 1)max(µnk ,1), it is then implied that Var(Cˆ s,k) = O{n −1max(µnk ,1)} and Var(Cˆ s) = O{n −1max(µnk ,1)}. Therefore, although Cˆ s is consistent, it will converge at a rate that also prevents its straightforward application for Wald statistics. This is because n 2Var(Cˆ s) does not converge to zero. To see this, first consider the subcase s.t. max(µnk ,1) = µnk . When this is true, it is also true that µnk = O(µn). As a consequence, n 2Var(Cˆ s) = O(nµn) → ∞ and Cˆ s does not converge to a constant when used to construct a test statistic; it remains random. This situation obviously does not change for the remaining sub-case. 2.3.1.1 A Problem of Choice Let us recall our central problem to re-frame the last investigation. We have an additive estimator wsY. However, although we believe that it is consistent, we have no access to L or D and thus have limited, imperfect—or even incorrect—information pertaining thow to model Var(wsY). There are two important situations to now consider: the situation s.t. it is believed a central limit theorem holds and the one where it is believed that one does not. We temporarily consider the former. If the researcher believes that a central limit theorem holds, then we have two non-mutually exclusive choices for Wald-like confidence sets: choose a defensible correction via Proposition 2.1 25 or use cluster-robust variance estimation under the auspices of a user-specified partition. Hitherto, the primary strategy is to use Cˆ s . We know that Case 1 estimators are the most robust against invalid partition specifications from the above analysis. However, since they require small cluster sizes, even if asymptotic normality holds, a lot of variability is likely to be missed and a correction will still need to be applied. Estimators in the spirit of Case 2 are also feasible choices, but their ’safe’ Wald-like use is predicated upon restrictive conditions. Put shortly, under A2, we require nk to be o(n 1/3 ) when the number of correlated clusters is asymptotically bounded. If µK∗ diverges with sample size, then each nk must grow at an even slower, but ultimately unknown rate. Still, even if these estimators can be validly used for plugin-Wald statistics, there is no guarantee that further correction is unnecessary. Asymptotically expanding cluster sizes still does not guarantee that µK = 0. Recall, however: the only explored case s.t. Cˆ s was inconsistent for its portion of the variance was Case 2 when K = O(1). Otherwise, it is a consistent estimator of the target quantity identified by the theoretical variance model. The rate of convergence might limit its direct use as a plug-in estimator for confidence sets, but it is important to note that it can still be used to investigate the variance structure. Recall that Proposition 2.2 (in a circumstance of copious dependencies) almost guarantees that φ is non-negative. Condition on this supposition, then, and note that wsY = ∑ K1 k=1 Ts,k = ∑ K2 k=1 Ts,k,∗ for any two partitions into K1 and K2 groups. Proposition 2.1 then allows us to state that {1 + µK1 φK1 }∑ K1 k=1 Var(Ts,k) = {1 + µK2 φK2 }∑ K2 k=1 Var(Ts,k,∗). It then follows WLOG that {∑ K2 k=1 Var(Ts,k,∗)} −1 ∑ K1 k=1 Var(Ts,k) < 1 implies that µK2 φK2 < µK1 φK1 . The latter inequality, of course, implies that the K2 partition possesses less missed variability and is more robust. Unsurprisingly, then, one should always aim to choose the partition that results in the highest estimated variance. For ample sample sizes, one can use differences such as Cˆ s,2 − Cˆ s,1 26 for this purpose. Nevertheless, this strategy does not guarantee a correct choice of partition—and still—a correction might be required. One strategy that synthesizes both approaches is to use various theories of dependence and hence partitions to compare Cˆ s estimators from Cases 1, 2, and 3. If n is large enough, this should—in the minimum—give the researcher at least imperfect snapshots of the magnitudes of portions of the missed variance. She can then condition on this new (incomplete) knowledge and use it to inform a choice of correction via Proposition 2.1 after making use of the Cˆ s estimator that yields the highest estimate. If it is believed that a Cˆ s estimator from Case 2 possesses a poor rate of convergence, a Case 1 or Proposition 2.6 type estimator is more appropriate since they possess a better rate of convergence and can be ’treated’ as constants more safely for ample sample sizes. The information acquired from the comparisons of cluster-robust estimators can still be factored into the choice of a deterministic correction. For the alternative situation—the one s.t. dependencies are so thick that a central limit theorem fails to hold—the variance estimators can still be used for exploratory ends insofar as at least A3 is true. 2.4 Finite Sample Inference with Dense Dependence Recall from Section 2.2: for any additive statistic, µDφD = µnφn and φD = |LD| −1 |L|·φn. Hence, if µn = o(µD), φD → 0. This offers a new perspective on a known possibility. Statistical dependencies can diverge in number, but µn can be o(n) or even bounded. When the latter is stipulated, this simply means that µD(n)φD(n) must be two such functions s.t. this is requirement is met. We do not need to know the details of these functions, otherwise. Their importance is that they inform us— even in the presence of dense statistical dependencies that prevent convergence to normality—that 27 we can still reason about mean convergence for linear statistics and explore strategies for inference. This is what we accomplish here. We will be working under the supposition that µD → ∞ or that µD is even quite possibly O(n). We will prefer that µn = o(n), but we will see that this is not a strict requirement. Provided this setup, we show here that comfortably bounding µnφn for a given sample is feasibly sufficient for constructing cogent confidence sets through the use of Hoeffding’s or Bernstein’s inequalities. To do this, we first introduce a type of random variable. The reason we are discussing these variables is that they permit us to extend a handful of known concentration inequalities to circumstances s.t. µD = O(n). 2.4.1 U Random Variables Defining this class of random variable requires familiarity with the average functional value from basic analysis. Say Ri = R R 1yi∈S dyi , where Si is the support of the cumulative distribution function (CDF) of Yi , say F(yi). For counting measures, Ri = |Si |. When Y = g(X1,...,Xk) for some function g, we can generalize this WLOG to Rx = R Rk 1(x1,...,xk)∈S kdx1 ...dxk , where S k ⊆ S1 × ... × Sk . Given this setup, and temporarily abandoning subscripts for readability, the average function value is then Av(Y) = R −1 R S ydy. Although we can also define Avx(Y) = R −1 x R S k g(x1,..., xk)dx1 ...dxk , the former is mostly the focus here for exposition. A random variable Y is in the U class if and only if EY = Av(Y). Similarly, it is said to be in the U class w.r.t. X if EY = Avx(Y). Finally, a random variable is said to be regular if its associated support is a single interval of real numbers or a set of integers {m,m + 1,...,M − 1,M} s.t. no integer is missing between m and M. In this case, it is easy to verify that Av(Y) = 2 −1 (M +m) and membership in the U class implies that EY = 2 −1 (M +m). 28 A set of basic properties for this class are proven in the supplementary materials. Here, it is sufficient to note only a small handful of them. First, if Y is a regular random variable s.t. EY = 0, it is a member of the U class if and only if m = −M, i.e., the CDF of Y is defined on symmetric support. Moreover, a regular and continuous random variable Y is in the U class if and only if R S F(y)dy = R S S(y)dy, where S(y) = 1 − F(y), and hence equivalently if M − EY = EY − m. Pertinently, although random variables with symmetric probability distributions are U random variables, they are only a special case. U random variables can be asymmetric. Lastly, it can be easily shown that for any constant c, if Y ∈ U , i.e., it is in the class of U random variables, then cY ∈ U and Y +c ∈ U . These variables are quite common. For instance, we have already established that all regular and symmetric bounded random variables are in this group. If Y has a density f(y) ∝ σY and σy → 0 as n → ∞, then Y behaves more and more like a U random variable as n becomes arbitrarily large. We will also see that the additive error variables associated with density or mass approximation are also elements when the approximating function is correct on average. More important for our purposes, however, is the connection between U status and the common linear model: Y = xβ +ε. Two conditions are ubiquitously supposed for its usage: (1) E(ε|x) = 0, and (2) εi ∼ N(0,σ 2 i ) for ∀i. The latter assumption, however, is simply fecund. In reality, no observable error distribution for a Yi on finite support could ever truly be normally distributed. This common assumption is best understood as a mathematically convenient statement that results in (hopefully) negligible error. For instance, we could replace the traditional assumption with the assertion that each εi possesses a normal distribution that has been symmetrically truncated at a value that is extreme enough to leave out negligibly small probabilities. Such an assertion would be empirically isomorphic to the traditional assumption—and importantly—it would result in each εi being a U random variable. 29 Therefore, it is safe to work in a universe s.t. wsY − βs = ∑ n i=1ws,iεi is a sum of U functions. Positing that {εi}i∈I is a set of regular and symmetric U random variables is also less restricting than assuming truncated normality since many more distributions meet these criteria. In the fixed regression setting, U status can be feasibly verified using the same plot that is used to check for violations of linearity. Namely, we would look at the residual versus fitted plot to confirm that the scatter of points is mean-zero at any location of the graph, and that they are randomly dispersed between two points that are roughly equidistant from the horizontal axis. Under the null assumption of a well-specified model, this is sufficient. More generally, however, when the Yi are continuous and identically distributed, we could verify that the behavior of the empirical CDF matches the sum-symmetric behavior described in the previous paragraph. Otherwise, for the discrete case, the empirical probability mass function could be plotted against the observed values to see if it demonstrates a sum-symmetric reallocation of mass from the area around the mode of the distribution to the tails. 2.4.2 Finite Sample Inference for U Random Variables This section accomplishes two primary objectives. One, it introduces and discusses an additional regularity condition that is required for further theoretical work. Two, it utilizes this condition to extend the Hoeffding and Bernstein inequalities to sums of regular and symmetric U random variables under dense dependence. For this section, w is any 1×n vector of constants. Moreover, Av∗(·) indicates a functional average taken over a Cartesian product of sets, i.e., over S k ∗ = S1 × ...× Sk . Since the random variables being considered are bounded, it is helpful to know that all referenced objects exist for finite n. 30 A5. Let ε be a n×1 vector of bounded random variables s.t. Eε = 0. For all s > 0, it is then true that max(E(exp{s·wε}),E(exp{s·(−wε)})) ≤ Av∗ (exp{s·wε}) This new condition places an implicit bound on the behavior of the distributions of the marginal random variables and their joint distribution simultaneously. This differs from traditional approaches, which mostly place restrictions on the dependency structure. Put succinctly, for A5 to be true, the allocation of mass or density has to be somewhat balanced for the marginal distributions, and the joint distribution has to be biased away from elements of the joint support that produce larger values of the target statistic in addition. Mutual independence alone does not guarantee A5. To see this, assume that E(exp{swiεi}) > Av(exp{swiεi}) for ∀i. Then: E(exp{s·wε}) = n ∏ i E(exp{swiεi}) > n ∏ i Av(exp{swiεi}) = Av∗ (exp{s·wε}) It follows that A5 is false. Pertinently, this does not make A5 more restrictive than mutual independence in general. It simply informs us that the conditions that would supply it are somewhat complex when there are no constraints placed on the marginal distributions. The informal description in the previous paragraph hints that unimodal and symmetric marginal distributions prevent this quandary from occurring under mutual independence. This indeed turns out to be true. Before showing this, we prove a pivotal lemma. Note that the =⇒ symbol is readable as ’implies.’ Lemma 2.1. Let Z ∈ U be regular and continuous s.t. EZ = 0 and max(S ) = M. Let s > 0 and w ∈ R be arbitrary constants. Then Av(exp{swZ}) ≤ exp{24−1 s 2w 2R 2}. Proof. We note that R S z kdz = 0 for all odd integers k when Z is a continuous and regular U random variable s.t. EZ = 0. This is because 0 = 2 −1 (M +m), which implies that m = −M. Hence 31 R S z kdz = (k +1) −1{Mk+1 −(−M) k+1} = 0 since k +1 is even. This implies that Av(Z k ) = 0 for all odd integers k. Additionally, when k is even, observe that R S z kdz = (k+1) −12Mk+1 and hence Av(Z k ) = (k +1) −1Mk . Let 2N0 = {n ∈ N|n = 2k, k ∈ N∪ {0}}. From here, for any constants s > 0 and w ∈ R: exp{swZ} = ∞ ∑ i=0 {i!} −1 s iw iZ i =⇒ Av(exp{swZ}) = ∞ ∑ i=0 {i!} −1 s iw iAv(Z i ) = ∑ i∈2N0 {i!} −1 {i+1} −1 s iw iMi = ∞ ∑ i=0 {2i!} −1 {2i+1} −1w 2i s 2iM2i Now, we compare two sequences: S1 = (2k)!·(2k +1) and S2 = k!· 6 k for k ∈ N. We will prove that S1 ≥ S2 for ∀k ∈ N by induction. Since S1(0) = S2(0) = 1 and 2 ·(2k +1) ≥ 6 for k ≥ 1, the base cases are established. Now, suppose (2n)!·(2n+1) ≥ n!· 6 n for n ≥ 1. (2n)!·(2n+1) ≥ n!· 6 n 2 ·(2n+1) ≥ 6 =⇒ 2 ·(2n+1)·(2n)!·(2n+1) ≥ n!· 6 n+1 =⇒ 2 ·(n+1)·(2n+1)·(2n)!·(2n+1) ≥ (n+1)· n!· 6 n+1 =⇒ {2(n+1)}!·(2n+1) ≥ (n+1)!· 6 n+1 =⇒ {2(n+1)}!· {2(n+1) +1} ≥ (n+1)!· 6 n+1 32 Therefore, S −1 1 = {(2k)!·(2k +1)} −1 ≤ S −1 2 = {k!· 6 k} −1 for all k ∈ N. Hence: Av(exp{swZ}) = ∞ ∑ i=0 {2i!} −1 {2i+1} −1w 2i s 2iM2i ≤ ∞ ∑ i=0 {i!} −1 {6 i } −1w 2i s 2iM2i = ∞ ∑ i=0 {i!} −1 {6 −1 s 2w 2M2 } i = exp{6 −1 s 2w 2M2 } Since M = 2 −1R for U variables, Av(exp{swZ}) ≤ exp{24−1 s 2w 2R 2}. ■ Proposition 2.8. Suppose Z ∈ U is a regular and continuous random variable s.t. EZ = 0. If EZk ≤ {2 k ·(k+1)} −1R k for all even integers k and EZk ≤ 0 for odd integers k, then E(exp{sZ}) ≤ Av(exp{sZ}) for any constant s > 0. If Z has a symmetric probability distribution, s can be any constant. Proof. From Lemma 2.1, we know that Av(exp{sZ}) = ∑i∈2N0 {i!} −1{2 i ·(i + 1)} −1 s iR i . Therefore: E(exp{sZ}) = ∞ ∑ i=0 {i!} −1 s iE(Z i ) ≤ ∑ i∈2N0 {i!} −1 s iEZ i ≤ ∑ i∈2N0 {i!} −1 {2 i ·(i+1)} −1 s iR i = Av(exp{sZ}) 33 The expansion of the proposition to the case of symmetric probability distributions is quick. Under this case, all odd moments are also zero. Therefore, the first inequality of the latter sequence of logic becomes a strict equality. Since only even indexes are left, s can be any constant. ■ Lemma 2.2. If Z is a symmetric, regular, and unimodal random variable s.t. EZ = 0 and max(S ) = M, then EZk ≤ Av(Z k ) for all even integers k when Z is continuous. If Z is discrete and symmetric with support S = {−cM,−cM−1,...,0,..., cM−1, cM} w.r.t. constants c1 < c2 < ··· < cM−1 < cM, then EZk ≤ {|S | −1} −1 |S |·Av(Z k ). Proof. We provide a proof for the continuous case first. By definition, f(z) is increasing on [−M,0] and decreasing on [0,M] if Z is unimodal and symmetric. However, z k is decreasing on [−M,0] and increasing on [0,M] when k is an even integer. WLOG, we reason w.r.t. [−M,0] ⊂ S . By Chebyshev’s integral inequality, it is then true that R 0 −M z k f(z)dz ≤ M−1 R 0 −M z kdz· R 0 −M f(z)dz since z k and f(z) have opposite monotonicities on [−M,0]. Furthermore, since R 0 −M f(z)dz = 2 −1 and (−c) k f(−c) = c k f(c) for ∀c ∈ S when k is an even integer and f(z) is symmetric, 2 · R 0 −M z k f(z)dz = EZ k . It is also obviously the case that 2 · R 0 −M z kdz = R M −M z kdz since k is an even integer. Therefore, R 0 −M z k f(z)dz ≤ M−1 R 0 −M z kdz· R 0 −M f(z)dz implies that EZ k ≤ (2M) −1 · R M −M z kdz = Av(Z k ). Now, for the discrete case, we will say that S = {−M,−(M −1),...,M −1,M} WLOG and that ∑ −1 z=−M f(z) = ∑ M z=1 f(z) and therefore that 2 · ∑ −1 z=−M f(z) = 1 − f(0). Once again, we also note that f(z) is increasing on {−M,...,−1} and z k is decreasing on this same set for even integers k. Thus, by the Chebyshev sum inequality, ∑ −1 z=−M z k f(z) ≤ M−1 ∑ −1 z=−M z k · ∑ −1 z=−M f(z) =⇒ 2 · ∑ −1 z=−M z k f(z) ≤ M−1 ·2·∑ −1 z=−M z k · {2 −1−2 −1 f(0)}. Since z· f(z) = 0 and z k = 0 when z = 0, it is 34 then implied that EZ k ≤ (2M) −1 ∑S z k · {1− f(0)} ≤ (2M) −1 ∑S z k = {|S | −1} −1 |S |·Av(Z k ). The last sequence of logic is true because |S | = 2M +1 and f(0) > 0. ■ Corollary 2.2. Suppose Z is a regular, symmetric U random variable s.t. EZ = 0 and max(S ) = M. If Z is continuous, then E(exp{sZ}) ≤ Av(exp{sZ}) for any s ∈ R. If Z is discrete as defined in Lemma 2.2, then E(exp{sZ}) ≤ {2M} −1{2M +1} · {Av(exp{sZ})− {2M +1} −1} for any s ∈ R. Proof. Let s be arbitrary. For the continuous case, Lemma 2.2 and Proposition 2.8 imply the result. For the discrete case, we again note that Av(Z k ) = 0 when k is an odd integer because ∑ −1 z=−M z k = −∑ M z=1 z k and z k = 0 when z = 0. From the premises and Lemma 2.2, we know that E(exp{sZ}) ≤ 1+ (2M) −1 (2M +1)·∑ ∞ i=1 {i!} −1 s iAv(Z i ). Then: E(exp{sZ}) ≤ 1+(2M) −1 (2M+1)· ∞ ∑ i=1 {i!} −1 s iAv(Z i ) = 1+(2M) −1 (2M+1)·{Av(exp{sZ})−1} The statement of interest arrives via basic algebra and is therefore omitted. ■ Proposition 2.8 and the subsequent statements establish that symmetric and regular U random variables—precisely the type of variable that often arises in linear regression settings, as aforementioned—are good candidates for A5. Again, this does not mean that other types of marginal distributions that are not strictly U , regular, or symmetric are not. It is possible to imagine how different dynamics between marginal and joint distributions could still deliver A5. However, a class of variables that can serve as a basic case under mutual independence provides an anchor. Constraints on the joint distribution are next. From here, the use of A5 can be justified via recourse to another set of identities concerning U variables. Unfortunately—in the continuous 35 case—these identities require the existence of a high-dimensional density. Again say Z is an arbitrary bounded random variable with mass or density function f(z) and let L = f(z1,...,zn) denote a joint mass function or density w.r.t. Y = g(Z1,...,Zn) for some function g. The two identities are: Av(Z) = EZ +R −1σZ, f −1(Z) (2.1) Av{g(Z1,...,Zn)} = EY +R −1 z σg(Z1,...,Zn),L−1 (2.2) For random variables on finite support, inverse mass functions—both joint and marginal—will always exist for finite n. Remember: the support of a marginal or joint distribution is the closure of the set of values s.t. f(z) > 0 and L > 0 respectively. Hence, L −1 and f −1 (z) are well-defined for mass functions. They are also well-defined for marginal and joint densities of bounded random variables if the basic objects exist: a precondition that requires the absolute continuity of the cumulative probability functions. Immediately, we can see that a variable Z is in the U class if and only if it is uncorrelated with its own inverse mass or density. Just the same w.r.t. Eq. (2.2), we can see that EY ≤ Av{g(Z1,...,Zn)} if and only if σg(Z1,...,Zn),L−1 ≥ 0. This of course means that as g(Z1,...,Zn) takes larger values, L −1 tends to take larger ones as well. Informally, this indicates that L tends to take smaller values as g(Z1,...,Zn) takes larger ones. Hence, at least in a linear sense, the joint mass function or density tends to place smaller amounts of probability or density on the (z1,...,zn) ∈ S n that produce large values of Y. In other words, if the joint density or mass function displays any behavior akin to the concentration of measure and the marginal distributions behave in the manner previously stated, one can expect EY ≤ Av{g(Z1,...,Zn)} to hold as a condition. Recall that a probability measure is said to concentrate if it places most of its measurement 36 on a subset of values in the support that neighbor the expected value of the random variable. If the values in an arbitrarily small neighborhood of the expected value continue to accumulate measure with sample size, the random variable converges in probability to a constant. Vitally, this suggests that A5 holds when—together with the aformenetioned informal restrictions on the marginal distributions—A3 is true (1) and it is also the case that L exists (2) and S n remains rectangular for arbitrarily large (but finite) samples (3). Consider the first property again: µn = o(n). When the variances of the εi are finite, µn = o(n) implies that wnε converges in probability to zero. Hence, it will be the case that ∑ n i=1wiεi = g(ε1,..., εn) ≈ 0 for sufficiently sized n with high probability, which implies that the (e1,..., en) ∈ S n that distance wnε from zero in higher magnitudes will be afforded smaller density, at least overall in some sense, as n gets larger. This, in turn, will afford more of a likelihood that E(exp(swε)) is bounded by the target value. Again, the second property (and bounded supports) allow for Eq. (2.2) and the referenced objects to exist. The third property is important to explore in more detail. Essentially, it states that the mutual dependence between the εi does not remove all density from some (e1,..., en) ∈ S n ∗ , even as n gets large. Importantly, this does not mean that each (e1,..., en) ∈ S n ∗ needs to be afforded a non-negligible density for all sample sizes. The density can become arbitrarily small. It simply needs to be non-zero for all n considered. If this holds, then S n = S n ∗ and thus Av(exp{swε}) = Av∗(exp{swε}). Importantly, we do not need the last two of these properties to hold in the true limit. We only need them to hold for sufficiently sized, but ultimately finite n. For clarity, we explore yet another characterization. Define a function η(z) = f(z) − h(z), where f(z) and h(z) are two densities defined on the same support WLOG. Then it is apparent that R S η(z)dz = 0. Now rearrange and multiply by z: z· f(z) = z· h(z) + z· η(z). Integrating across, we arrive to EZ = EhZ + R S z·η(z)dz, where Eh indicates an expectation taken w.r.t. h(z). 37 Another set of identities results when h(z) = R −1 , the uniform density. Then the above becomes EZ = Av(Z) + R S z· η(z)dz and it follows that R S z· η(z)dz = −R −1σZ, f −1(Z) . Just the same, R S n g(z1,...,zn)· η(z1,...,zn)dz1 ···dzn = −R −1 z σg(Z1,...,Zn),L−1 . Therefore, U status is directly related to η(z) and g(z) being orthogonal. In the supplementary materials, it is also proven that η(Z) ∈ U is a sufficient condition for Z ∈ U when h(z) = R −1 . Although this condition seems abstract, it simply means that E{η(Z)} = 0, i.e., that the expected value of the ’approximating’ density matches the expected value of the true one. If EL → R −1 z as n grows, this is also sufficient for E{g(Z1,...,Zn)} → Av{g(Z1,...,Zn)} and hence A5 when S n ∗ = S n . 2.4.3 Main Results Before using Lemma 2.1 to derive a sharper version of Hoeffding’s inequality, we first extend classical results. A sub-U random variable is one such that Av(Z) ≤ EZ. Lemma 2.3 (Extension of Hoeffding’s lemma). Let Z be a random variable defined on bounded support S with minimum m and maximum M. Then for any s > 0, Av (exp{sZ}) ≤ exp{sAv(Z) + 8 −1 s 2R 2}. If EZ = 0 and Z is sub-U , Av (exp{sZ}) ≤ exp{8 −1 s 2R 2}. Proof. This proof follows the logic of the proof for Hoeffding’s lemma since Av(·) is a linear operator with the monotonic property over inequalities. Since exp{sz} is a convex function of z, for all z ∈ S : exp{sz} ≤ R −1 ((M −z)exp{sm}+ (z−m)exp{sM}) 38 Then, denoting Av(Z) = Av: Av (exp{sZ}) ≤ R −1 ((M −Av)exp{sm}+ (Av−m)exp{sM}) Now, specify a function g(x) = R −1mx + log((M −Av) + (Av−m)exp{x}) − log(R). Then it is easily demonstrable that g(0) = 0,g ′ (0) = R −1Av and g ′′(x) ≤ 4 −1 for ∀x. Hence, utilizing a Taylor expansion around 0, since g(x) = g(0) +xg′ (0) +2 −1 x 2g ′′(x∗) for some x∗ that is between 0 and x, it is true that g(x) ≤ R −1 xAv+8 −1 x 2 . This implies that g(sR) ≤ sAv+8 −1 s 2R 2 . Hence, Av (exp{sZ}) ≤ exp{sAv+8 −1 s 2R 2}. Now, since s > 0, when EZ = 0 and Z is sub-U , it is true that Av(Z) ≤ 0. Therefore, Av (exp{sZ}) ≤ exp{8 −1 s 2R 2}. ■ Theorem 2.1 (Extension of Hoeffding’s Inequality). Let ε1,..., εn be defined on bounded supports S1,...,Sn such that for an arbitrary i ∈ {1,...,n}, mi ≤ εi ≤ Mi with probability one and Eεi = 0. Let w be a vector of constants s.t. Sn = wε = ∑ n i=1wiεi . Suppose A5 for some fixed n and let τ > 0 be arbitrary. Then Av(εi) ≤ 0 for ∀i and w > 0 =⇒ Pr(|Sn| > τ) ≤ 2exp{−(∑ n i=1w 2 i R 2 i ) −12τ 2}. Furthermore, if there exists some N ∈ N such that for ∀n > N it is true that the above conditions hold, then the stated inequality is true for all n > N provided all objects exist. Proof. Again, note that for any s > 0, all integrals exist—including those of A5—since each εi is a bounded random variable. This can easily be shown with an extended application of Fubini’s 39 theorem. Let τ and s be arbitrary real numbers such that τ > 0 and s > 0. We will proceed with the statement Pr(Sn > τ). Then by Markov’s inequality and A5: Pr(exp{sSn} > exp{sτ}) ≤ E(exp{sSn}) exp{−sτ} ≤ Av∗ exp{s( n ∑ i=1 wiεi)} ! exp{−sτ} However, Av∗ (exp{s(∑ n i=1wiεi)}) exp{−sτ} = ∏ n i=1 Av(exp{swiεi}) exp{−sτ}. By Lemma 2.3, then, since each swi > 0 and our premise asserts that εi is sub-U for ∀i: n ∏ i=1 Av(exp{swiεi}) exp{−sτ} ≤ n ∏ i=1 exp{8 −1 s 2w 2 i R 2 i }exp{−sτ} = exp{8 −1 s 2 n ∑ i=1 w 2 i R 2 i −sτ} From here, we proceed by finding the minimum of exp{8 −1 s 2 ∑ n i=1w 2 i R 2 i − sτ} for s ∈ R +. It is easy to verify that this function is minimized at s = (∑ n i=1w 2 i R 2 i ) −14τ. Thus, Pr(Sn > τ) ≤ exp{−(∑ n i=1w 2 i R 2 i ) −12τ 2}. This proves one direction of the main statement. For the other, one is required to proceed in an analogous fashion for Pr(−Sn > τ). One simply uses A5 and Lemma 2.3 again. For this reason, the steps are omitted. Hence Pr(Sn > τ) +Pr(Sn < −τ) ≤ 2exp{−(∑ n i=1w 2 i R 2 i ) −12τ 2} =⇒ Pr(|Sn| > ε) ≤ 2exp{−(∑ n i=1w 2 i R 2 i ) −12τ 2}. The asymptotic statement of the theorem can be achieved by setting some N ∈ N such that A5 holds for all natural numbers greater than N and then letting n ∈ N be arbitrary such that n > N. Then the proofs proceed exactly as before. ■ This theorem is very useful for the construction of valid confidence intervals under almost arbitrary conditions of probabilistic dependence, and without any need to specify or completely 40 understand the latent system of dependencies, or even many features of the marginal or joint probability distributions. Again, µD could be O(n). All that is required in terms of dependence is A5. Example 2.1. Say {Yi}i∈ζ is an identically distributed sample of sub-U random variable s.t. EYi = µ. Also say m,M are the minimum and maximum of the associated support, respectively. Furthermore, say µD = n−1, but A5 is fulfilled. Then [Y¯ −R· p (2n)−1 log(2/α),Y¯ +R· p (2n)−1 log(2/α)] is at least a 1 − α confidence set for µ and α ∈ (0,1). If m ≥ 0 and m,M are unknown, then {µ | 1+ p 2n−1 log(α−12) −1 Y¯ ≤ µ ≤ 1− p 2n−1 log(α−12) −1 Y¯ } is also at least a 1 − α confidence set when n > 2log(α −12). This next lemma allows us to extend Bernstein’s inequality, which can sometimes provide much sharper confidence sets. It is also an adaptation of a classic result. Lemma 2.4. Let Z be a random variable such that |Z| ≤ M almost surely, EZ = 0, and Z is sub-U . Then for any 0 < s, Av(exp{sZ}) ≤ exp{M−2Av(Z 2 )(exp{sM} −1−sM)}. Proof. exp{sZ} = 1+sZ + ∞ ∑ k=2 (k!) −1 s kZ k ≤ 1+sZ + ∞ ∑ k=2 (k!) −1 s k |Z| k = 1+sZ + ∞ ∑ k=2 (k!) −1 s kZ 2 |Z| k−2 However, the last expression is less than or equal to: 1+sZ + ∞ ∑ k=2 (k!) −1 s kZ 2Mk−2 = 1+sZ + M−2Z 2 ∞ ∑ k=2 (k!) −1 s kMk = 1+sZ + M−2Z 2 (exp{sM} −1−sM) 41 Hence, Av(exp{sZ}) ≤ 1+ M−2Av(Z 2 )(exp{sM} −1−sM) ≤ exp{M−2Av(Z 2 )(exp{sM} −1− sM)}. The first of the previous inequalities follows from the fact that Av(Z) ≤ 0. The second follows from the fact that 1+x ≤ exp{x} for ∀x ∈ R. ■ Theorem 2.2. Let ε1,..., εn be defined on bounded supports S1,...,Sn and w be a vector of positive constants s.t. Sn = wε = w · ∑ n i=1 εi for w > 0. Moreover, for an arbitrary i ∈ {1,...,n}, say |εi | ≤ M with probability one and Eεi = 0. Suppose A5 for some fixed n and let τ > 0 be arbitrary. Finally, define a function h(u) = log(u + 1)(u + 1) − u. Then Pr(|Sn| > τ) ≤ 2exp −M−2 ∑ n i=1 Av(ε 2 i )· h {w·∑ n i=1 Av(ε 2 i )} −1 τM when εi is sub-U for ∀i. Proof. The proof is almost exactly to Theorem 2.1. Thus, we provide only a sketch. Let τ > 0 and s > 0 be arbitrary. Then Pr(exp{sSn} > exp{sτ}) ≤ ∏ n i=1 exp{M−2Av(ε 2 i )·(exp{swM} − 1 − swM)}exp{−sτ} = exp{M−2 ∑ n i=1 Av(ε 2 i )·(exp{swM} − 1 − swM) − sτ}. These statements follow from from A5 and Lemma 2.4. Now, we proceed by minimizing the last expression with respect to s. It is easy to observe that the minimum is achieved at s = (wM) −1 log {w·∑ n i=1 Av(ε 2 i )} −1 τM +1 . Algebraic rearrangement achieves one side of the bound. Parallel logic, as in Theorem 2.1, achieves the other. ■ Corollary 2.3. (Extension of Bernstein’s Inequality.) Suppose the same setup as Theorem 2.2. Then for an arbitrary τ > 0: Pr(|Sn| > τ) ≤ 2exp −{2w 2 n ∑ i=1 Av(ε 2 i ) +3 −1 2wτM} −1 τ 2 ! Proof. We simply use Theorem 2.2 and then note that h(u) ≥ (2+3 −12u) −1u 2 for u ≥ 0. Algebraic rearrangement supplies the result. ■ To use Theorem 2.2 and Corollary 2.3 for confidence sets, one only needs to note that Av(ε 2 i ) ≤ 3 −12R 2 i for sub-U variables of the form εi = Yi − EYi , where Ri here is the range of Yi . One can then substitute these values into the previous results and use additional substitutions depending on the known features of the conditional Yi or marginal Y. For U random variables, we already know that Av(ε 2 i ) = 12−1R 2 i . Theorem 2.3. Let ε1,..., εn be continuous, regular U random variables defined on supports S1,...,Sn such that for an arbitrary i ∈ {1,...,n}, max(Si) = Mi and Eεi = 0. Let w be a vector of constants s.t. Sn = wε = ∑ n i=1wiεi . Suppose A5 for some fixed n and let τ > 0 be arbitrary. Then Pr(|Sn| > τ) ≤ 2exp{−(∑ n i=1w 2 i R 2 i ) −16τ 2}. Proof. The proof will largely be omitted since it also mirrors Theorem 2.1. Instead of using Lemma 2.3, we can use Lemma 2.1 for U variables. This improves the bound on each Av(exp{swiεi}) by a factor of 3−1 within each exponentiation. It also allows for each wi to be negative. The rest of the proof follows the exact same steps. ■ Example 2.2. Now, say the objective is the estimation of β w.r.t. Y = xβ +ε. Suppose the mean model is well specified and ε is any vector of regular and continuous U random variables that are densely dependent, but A5 applies. Let Bs = wsY be arbitrary. Then [Bs − q ∑ n i=1w 2 s,iR 2 i · p 6−1 log(2/α), Bs + q ∑ n i=1w 2 s,iR 2 i · p 6−1 log(2/α)] is at least a 1 − α confidence interval for βs . Note that if each Yi is non-negative, since the range of each εi equals the range of each Yi , Ri ≤ 2EYi and hence one can feasibly replace q ∑ n i=1w 2 s,iR 2 i with 2 q ∑ n i=1w 2 s,i (EYˆ i) 2 when each Ri is unknown. This will still be accurate asymptotically insofar as µn = o(n). One could also replace the former with Rq ∑ n i=1w 2 s,i , where R is the range of the marginal distribution of Y , provided it is known. If Y is sub-U marginally and R is unknown, one can also replace the former 43 with 2Y¯ q ∑ n i=1w 2 s,i for large enough n. Lastly, we could also make use of the sample range of e. Al- ˆ though downwardly biased, it is still a reasonable choice. It is also consistent until mild regularity conditions. This topic is explored further in Section 2.6 and the supplementary materials. 2.5 A Quick Extension to Estimating Equations In this section, we extend the results of the previous sections to estimating equations, hence expanding their utility to a larger class of estimators. Consider a random variable h(Yi ,θ) for some measurable function h. Although h(Yi ,θ) will often also depend upon a set of fixed constants for ∀i, this will be left implicit. The estimating equations of interest here will take the following form, although sometimes it will be relevant to standardize them by a different function of n: Q(θ) = n −1 n ∑ i=1 Eh(Yi ,θ) (2.3) Qn(θ) = n −1 n ∑ i=1 h(Yi ,θ) (2.4) Equations (2.3) and (2.4) together are a bedrock of statistical theory and practice and are well understood. To quicken developments, we will draw heavily upon the prior results of the authors mentioned in the introduction. The main contribution of this section is to demonstrate that the variance identities can be used to establish a uniform weak law of large number (UWLLN) for Eq. (2.4) under exceptionally unfavorable conditions and to show that the estimators resulting from this setup are conditionally or asymptotically additive. The latter fact thus allows for the use of the exponential inequalities of the previous section, even in the face of ’apocalyptic’ dependence, provided the right error structure. 44 Denote β ∈ Θq to be the target parameter of interest once again. If β minimizes (maximizes) Q(θ), then ˆβn represents a sequence of random variables that constitute a class of extremum estimators, or M-estimators. For instance, when h(Yi ,θ) is a log-likelihood, ˆβn is a (quasi) maximum likelihood estimator. When Q(β) = 0, then ˆβn is instead a sequence of root estimators. All generalized method of moment estimators fall into this class (Hall, 2005). This paper will focus on the latter in the spirit of Yuan and Jennrich (1998), although some attention will be provided to the former as well. The reason for this is simple. The main objective is to (minimally) establish that the sequence of estimators ˆβn derived in concordance with Eq. (2.3) and Eq. (2.4) converges in probability to β. Theory associated with the minimization (maximization) of some Q(θ) can accomplish this; however, since this path ultimately reduces to a consideration of E∇Q(θ) = 0 as a moment condition, where ∇ is the gradient, one can ultimately explore this case with greater generality. For clarity, Q will henceforth be used for extremum estimating equations, while U will be used for root equations. Only a couple new adjustments will be made to the typical set of regularity conditions, and partially for simplicity and intuition. The assumptions utilized are provided below: R1. Θq is a compact Euclidean space R2. h(yi ,θ) is continuous in θ and is at least twice continuously differentiable w.r.t. θ ∈ Θ for ∀yi R3. |∂h 2/∂ θ h(yi ,θ)| ≤ Zi for some r.v. Zi and sup1≤i≤nZi = O(1) for ∀i R4. ˆβn ∈ Θ and Un( ˆβn) = 0 for ∀n 45 R5. θ ∈ Θ and U(θ) = 0 ⇐⇒ θ = β R6. Denote Ln as a linear dependency graph s.t. ζ = {h(Y1,θ),...,h(Yn,θ)}. Then µn = o(n) for ∀θ ∈ Θ R7. Let θ ∈ Θ be arbitrary. Then ∇U(θ) is non-singular and ∇Un(θ) converges in probability to ∇U(θ) R8. For all θ ∈ Θ and ∀i, Var{h(Yi ,θ)} < ∞ R9. Q(β) < inf θ∈Θ;θ̸=β Q(θ) R10. For ˆβn ∈ Θ, Qn( ˆβn) ≤ inf θ∈Θ Qn(θ) +op(1) Note that, in addition to these conditions, there is an implicit assumption that all inverse matrices that are utilized exist. R8 will be useful for establishing uniform convergence in probability. Utilization of R8 is a stronger assumption than is usually required; however, it is not expensive. The usual assumption is that Esup θ∈Θ |h(Yi ,θ)| < ∞ for ∀i (R8’). In the usual maneuver, R8’ is employed in conjunction with R1, R2, and the assumption that that a WLLN exists for the estimating equation to establish stochastic equicontinuity and thereby a UWLLN for Eq. (2.4). Outside of artificial pathological examples, however, such as when it is false that the variance of Eq. (2.4) tends to zero, this logic feasibly implies that R8 is also likely to hold. Regardless, R8 simply restricts Θ to a space of ’reasonable’ values. R8 is also trivially true for well-behaved functions and bounded random variables, which is precisely our universe of concern. Regardless, R8 is not strictly necessary. All results can also be achieved with R8’. As aforementioned, the most important contribution in this section arrives via R6 (or analogous statements). 46 Although it is a slightly higher moment condition, it provides more intuition in practice. Additionally, it makes the proofs quick and easier to follow for non-specialists. The next lemma presented establishes that a UWLLN can be obtained even when the mean dependency diverges as a function of n. The proof is straightforward and similar to Proposition 2.5. Lemma 2.5. Let Zi = h(Yi ,θ) be any random variable for some measurable function h s.t. R8 holds. Denote µn as the mean degree of the linear dependency graph Ln for ζ = {Z1,...,Zn}, say Z¯(θ) = n −1 ∑ n i=1 Zi , and denote W = sup θ∈Θ |Z¯(θ) − EZ¯(θ)|. If R6 also holds, then Z¯(θ) converges uniformly in probability to EZ¯(θ). Proof. Let θ∗ denote the θ ∈ Θ corresponding to W = sup θ∈Θ |Z¯(θ)−EZ¯(θ)|. Hence, W = |Z¯(θ∗)− EZ¯(θ∗)|. Also note by assumption that Var(Zi) is finite for ∀i and that µn = o(n). Let ε > 0 be arbitrary. Then: Pr(W > ε) = Pr(W2 > ε 2 ) ≤ ε −2 ·E{Z¯(θ∗)−EZ¯(θ∗)} 2 =⇒ Pr(W > ε) ≤ ε −2 ·Var(Z¯(θ∗)) ≤ ε −2 ·(1+ µnφn)· n −2 n ∑ i=1 Var(Zi) =⇒ lim n→∞ Pr(W > ε) ≤ lim n→∞ ε −2 · sup 1≤i≤n Var(Zi)·(1+ µnφn)· n −1 = 0 Hence, sup θ∈Θ |Z¯(θ)−EZ¯(θ)| converges in probability to zero. ■ Xie and Yang (2003) explored a set of exact conditions for weak and strong convergence of generalized estimating equations under the assumption that the weighted sums of clustered observations were uncorrelated with those of other clusters. Again say that n = ∑ K k=1 nk for K unique 47 clusters. They did so under three settings: 1) K → ∞ and nM is bounded for all K, 2) K is bounded but nm → ∞, and 3) nm → ∞ as K → ∞. An extension to Lemma 2.5 can be used to cover these settings. Similar to before, say Lnk refers to the linear dependency graph for random variables within cluster ζk and LTk refers to the linear dependency graph of the set of random variables {T1,...,TK}, where now Tk = ∑ nk j=1 h(Yj ,θ). From here, again let µnk signify the mean degree of Lnk and µK the mean degree of LTk . Lemma 2.6. Suppose R1 and R8 and consider Z¯(θ) = K −1 ∑ K k=1 Tk . Alternatively, for nM = max k∈K (nk), define Z¯M(θ) = nM −1 ∑ K k=1 Tk and nm = min k∈K (nk) as before. Then: 1) Provided K(n) → ∞ as n → ∞ and nM is bounded for all K, if µK = o(K), then Z¯(θ) p→ EZ¯(θ) uniformly 2) If K(n) is bounded but nm → ∞, and if µnk = o(nk) for ∀k, then Z¯M(θ) p→ EZ¯M(θ) uniformly 3) If nm → ∞ as K(n) → ∞, then Z¯M(θ) p→ EZ¯M(θ) uniformly if µK = O(1), K = o(nM), and K · µnk = o(nM) for ∀k. Alternatively, the same result holds for Z¯(θ) if µnk = O(1) for ∀k, nM = o(K), and nM · µK = o(K) Proof. The first case is very similar to that of Lemma 2.5, except now the cluster structure of each Tk is being considered. We will only prove the first two statements since the proof is largely repetitive. Case 1: Suppose K → ∞ and nM is bounded for all K as n → ∞. Call this bounding constant Mn. Furthermore, suppose µK = o(K). Since each Tk is a finite linear combination of correlated random variables with finite variance, Var(Tk) < ∞ for ∀k. This can be seen by observing that Var(Tk) ≤ nk ∑ nk j=1 Var{h(Yj ,θ)} ≤ M2 n sup 1≤j≤nk Var{h(Yj ,θ)} < ∞ under the premises. Hence, by Lemma 2.5, Z¯(θ) obeys a UWLLN. 48 Case 2: Suppose K(n) is bounded but nm → ∞. Denote MK and C∗ as the asymptotic bounds of K(n) and φnk for ∀k respectively. Furthermore, suppose µnk = o(nM) for ∀k. Now, let W = sup θ∈Θ |Z¯M(θ)−EZ¯M(θ)| = |Z¯M(θ∗)−EZ¯M(θ∗)| for an arbitrary ε > 0. Additionally, denote σ 2 = max 1≤u≤n Var{h(Yu,θ∗)}, µ∗ = max k∈K (µnk ) WLOG, and φ∗ ≤C∗. Then Pr(W > ε) ≤ ε −2Var{Z¯M(θ∗)} ≤ ε −2K ∑ K k=1 n −2 M Var(Tk;θ∗ ). And: ε −2K K ∑ k=1 n −2 M Var(Tk;θ∗ ) ≤ ε −2K K ∑ k=1 n −2 M (1+ µnk φnk )[ nk ∑ j=1 Var{h(Yk, j ,θ∗)}] ≤ ε −2K 2 n −1 M (1+ µ∗φ∗)σ 2 =⇒ lim n→∞ Pr(W > ε) ≤ ε −2M2 Kσ 2 lim n→∞ n −1 M (1+ µ∗C∗) = 0 ■ We note that Lemma 2.5 can be used with known proof strategies to demonstrate that ˆβn p→ β for M-estimators when it is possible for µn(n) → ∞, and only under the auspices that µn(n) grows at a rate that is sufficiently sub-linear (Hall, 2005). Now, Lemma 2.6 is used to demonstrate that ˆβ n is consistent and asymptotically additive. For this matter, recall that Un(θ) = n −1 ∑ n i=1 h(Yi ,θ) is now a q×1 vector and {∇Un(θ)} −1 is a q×q matrix. Proposition 2.9. Suppose R1-R8 in conjunction with Un(θ) and also that ˆβ n satisfies R4. Additionally, suppose any of the three scenarios stated in Lemma 2.6. Then ˆβ n p→ β and √ n( ˆβ n −β) converges in distribution to a vector of additive statistics as n → ∞. Proof. Recall that the premises of Lemma 2.6 make use of two different equations. One was standardized by nM and one was standardized by K w.r.t. Un(θ) = n −1 ∑ K k=1Tk = ∑ K k=1 n −1 ∑ nk j=1 h(Yk,j ,θ). 49 Denote these re-expressions UM(θ) = n −1 M nUn(θ) and UK(θ) = K −1nUn(θ) respectively. To establish that UK(θ) and UM(θ) follow a UWLLN under the relevant conditions of Lemma 2.6, one simply needs to specify an arbitrary component s of either random vector. Since this sth component is of the form covered by Lemma 2.6, under the auspices of R1 and R8, it follows that the sth component converges uniformly in probability to its expectation. Therefore, since s was arbitrary, UK(θ) converges uniformly in probability to EUK(θ) and UM(θ) does the same to EUM(θ). Thus, under Theorem 2.3 of (Yuan & Jennrich, 1998) and R1, R2, R4, and R5 it is then implied that ˆβ n p→ β. Since {∇UM(θ)} −1UM(θ) = {∇UK(θ} −1UK(θ) = {∇Un(θ)} −1Un(θ) for all θ ∈ Θ, it is possible to proceed WLOG utilizing only Un(θ). Now, note that under R1, R2, R3, R7, and the continuous mapping theorem, {∇Un(θ)} −1 converges uniformly in probability to {∇U(θ)} −1 for ∀θ ∈ Θ. Also, for some ˜β ∈ Θ that is between ˆβ n and β and an arbitrary q×1 vector of constants λ: √ n · λ ⊤Un(β) = − √ n · λ ⊤ {∇Un( ˜β)}( ˆβ n −β) Therefore, since ˆβ n p→ β, it is also true under our conditions that: √ n · λ ⊤Un(β) p→ −√ n · λ ⊤ {∇U(β)}( ˆβ n −β) This implies that √ n( ˆβ n − β) d → −√ n{∇U(β)} −1Un(β) by the Cramer-Wold device (Yuan & Jennrich, 1998; Hall, 2005; Feng, Wang, Han, Xia, & Tu, 2013). This completes the proof since √ n{∇U(β)} −1Un(β) is a vector of random sums. ■ 50 It is important to note that √ n( ˆβ n−β) will have finite variance without additional stabilization only if the maximum mean degree of linear dependency associated with the score functions is O(1). We provide two general examples to contextualize these results. Example 2.3. (Quasi-MLEs.) In this example, standard generalized linear models (GLMs) will be considered for dependent data. Recall that it has often been stated that GLMs are inappropriate when the response variables are dependent. Say Qn = n −1 ∑ n i=1 ln{ f(Yi ,θ)} s.t. Yi is a member of an exponential family with distribution function f(yi ,θ) for ∀i. Also, say EYi = µi = g(xiβ) for a differentiable, monotonic canonical link function g and conformable vector of constants xi . For Y ∈ R n×1 , µ ∈ R n×1 , and x ∈ R n×q , denote d = ∂ µ/∂ β and W as the diagonal n × n matrix s.t. Wi,i = φ∂ µi/∂xiβ for dispersion constant φ. Then ∇Qn = Un = n −1d ⊤W−1 (Y− µ). Lastly, define a L graph for {εi}i∈I . From the previous results, it is then implied that √ n( ˆβn −β) d → √ n ·(d ⊤W−1d) −1d ⊤W−1 ε = √ n ·wε, say. Then Var{ √ n( ˆβn −β)} = n · wVw⊤{1 p×p + µnΓ} asymptotically, where {1 p×p + µnΓ} has the same definition as in Section 2.2, and where V is a diagonal matrix of variances. If µn = O(1) and the conditions of Section 2.4 hold, then an exponential inequality can be used for constructing confidence sets. Otherwise, one needs to additionally use a proper choice of {1+ µnφn,s} −1/2 to stabilize the variance of ˆβs and to ensure that all relevant objects exist. Of course, if it is believed that a central limit theorem holds, a deterministic correction can also be employed with a Wald-like statistic. Example 2.4. (Iteratively re-weighted least squares.) Since the variance identity and results of Section 2.4 can be applied to additive statistics with the necessary error structures, it is applicable to estimators calculated via iteratively re-weighted least squares (IRWLS). This extends their utility 51 to a vast number of non-linear contexts. For instance, it then applies to the (weighted) minimization of Qn(θ) = (pn) −1 ∑ n i=1 {Yi − g(xiθ)} p for some measurable function g and p > 0, insofar as IRWLS is used as a fitting procedure. Here, the special case will be considered s.t. p = 2 under a K cluster partition. This case will also use the notation from the previous example. It is apropos to note that the pragmatic form of quasi-MLEs are a special case of this setup when IRWLS is used as an approximate fitting algorithm. IRWLS is well studied and has been found to have good, reliable qualities. Under the requirement that ∑ K k=1 d ⊤ k,tW−1 k,t dk,t is positive definite or the assumptions previously stated hold, the IRWLS estimator ˆβt can approximate ˆβn to an almost arbitrary precision (Yuan & Jennrich, 1998). To briefly show this, let τ ≡ 0 as a theoretical exercise. Then, only imposing the condition that Wk,t be positive definite: ˆβt − ˆβt−1 = (∑ K k=1 d ⊤ k,tW−1 k,t dk,t) −1 ∑ K k=1 d ⊤ k,tW−1 k,t (Yk − µk,t ) ≡ 0, which implies that K−1 ∑ K k=1 d ⊤ k,tW−1 k,t (Yk − µk,t ) = UK( ˆβt ) ≡ UK( ˆβt−1 ) ≡ 0. Under the assumptions of Proposition 2.9, the IRWLS estimator converges in distribution to the correct target and is asymptotically additive. Hence, once again, all previous results apply for large enough n, provided µn = O(1) or proper stabilization is employed, and the other relevant conditions of Section 2.4 hold. However, if one is willing to condition on the sigma-algebra of events generated by some ˆβt−1 following the occurrence of the event that d( ˆβt−1, ˆβt−2) < τ for a sufficiently small τ > 0, and for a long enough run, then ˆβt is an additive estimator for all finite samples. Therefore, the variance identity and exponential inequalities of Section 2.4 can be used for conducting cogent inference with functional approximations that minimize error in ℓp space. Importantly, this can again be accomplished when unknown, intractable, and possibly non-sparse dependency structures are present. 52 2.6 Simulations and a Data Application This section offers two simulation experiments under unfavorable dependence conditions: one for Y¯ and one for ˆβ. Both for symmetric U variables. An approximation to gauge the robustness of A5 will help since, in general, the exact form of E(exp{swε}) is intractable. To this end, we can use the fact that E(exp{swε}) ≈ 1+2 −1 s 2{1+µnφn}∑ n i=1w 2 i σ 2 i by Taylor approximation and that the latter expression is bounded by exp{2 −1 s 2 (1 + µnφn)∑ n i=1w 2 i σ 2 i }. For our U variables, it is then implied that exp{2 −1 s 2 (1 + µnφn)∑ n i=1w 2 i σ 2 i } ≤ exp{24−1 s 2 ∑ n i=1w 2 i R 2 i }, at least as an approximate rule of thumb for A5. Although the value on the right side of the inequality bounds the functional average specified in A5, it does so tightly. For instance, if wi = n −1 for ∀i, each error variable has the same upper bound for its support, and s = C √ n for some positive constant C, Av∗(exp{swε}) → exp{24−1C 2R 2} quickly. Regardless, further algebraic manipulation of this setup implies the following rough bound on the summary values of the dependency structure: µnφn ≤ 12−1{∑ n i=1w 2 i σ 2 i } −1 ∑ n i=1w 2 i R 2 i − 1. Homogeneity of variances and ranges results in a simpler weight-invariant bound of 12−1σ −2R 2 − 1. For simplicity, all ranges and variances will be equal. Insofar as this approximate bound holds, A5 should approximately hold as well and inference should remain robust, even with mild violations. All experiments possess N = 10,000 simulations. To get a more accurate assessment of robustness to violations of A5, we compare Aˆ = max(N −1 ∑ N i=1 exp{swei},N −1 ∑ N i=1 exp{−swei}) to Av∗ = ∏ n i=1 {swiMi} −1 sinh(swiMi), which is an exact form of Av∗(exp{swε}) for U observations of our type. Investigating all arbitrary values of s isn’t possible or necessary. Instead, we set s = {M2 · c∗ ·∑ n i=1w 2 i } −1/26 · {6 −1 log(2/α)} 1/2 for a context-dependent value of c∗ that limits the size of the exponential and average values for readability. This value of s also corresponds to 53 the optimal value of Theorem 2.3 with homogeneous ranges and τ set to the required expression for conservative, two-sided 1 − α confidence sets. Additionally, we employ a value of s that is O( √ n) since this order better corresponds to the rule of thumb. Since we will be evaluating 95% confidence sets, α = .05 for all experiments. Setup. To test ’apocalyptic’ scenarios, each simulation employs fully connected linear dependency graphs. In other words, we set µn = n − 1 for n ∈ {100,500,1500}. For the Y¯ setup, Yi ∼ Beta(α,α). This translates our rule of thumb to 3−1 (2α +1)−1. The first set of simulations uses α = 10 universally. According to the rule of thumb, then, confidence sets should start to lose their nominal values when φn = .06,.01, and .004 respectively. For comparison, we establish a baseline at φ = 0 and make use of values that neighbor these breakdown points. In many unfavorable settings, we would still not expect µn to always attain its maximum or for φn to take moderate values when it does. Hence, this series of simulations is built to demonstrate the robustness of the results for finite sample inference in practical settings since decent performance under these conditions suggest an adequate level of dependability in more modest ones. A second experiment fixes φ to .1 for n = 500 observations and varies only α for values in {10,25,50,100} to further examine the relationship between the range-variance ratio and robustness to statistical dependence. For all of these simulations, c∗ = 10. For the U regression setup, each εi ∼ TruncNorm(M = 20,µ = 0,σ 2 = 25). The linear model is characterized by Yi = 20+10 ·ti +εi , where each ti is a fixed draw from Ti ∼ TruncNorm(m = −5,M = 5,µ = 1,σ 2 = 1). Here, c∗ = σ = 5 for control of exponential size. Dependencies in the outcome variables are induced with Gaussian copulas. Literature on this method is available elsewhere (Embrechts, Lindskog, & McNeil, 2001; Demarta & McNeil, 2005). 54 R version 4.2.2 statistical software and the package ’copula’ are employed for all experiments (R Core Team, 2021; Yan, 2007). The Beta distribution simulations make use of an exchangeable correlation matrix with off-diagonal cells populated by φn. The strategy for the regression simulations is more complicated. Essentially, we specify an unstructured correlation matrix and set its non-diagonal values to the corresponding elements of 25−1φ∗n 2 · w1w ⊤ 1 for φ∗ ∈ {0,.05,.1,.15}. The first row of weights is used as a basis since the product of its values will often match the valence of the weight products in the sum of covariances. This will induce a dense mosaic of positive summands and inflate the variance. As usual, empirical coverage is estimated by N −1 ∑ N i=1 1θ∈C , where C is the constructed confidence set. Three coverage values are ultimately estimated—CIˆ Wald,CIˆ U , and CIˆ Rˆ—although the third is estimated for regression simulations only. The estimated Wald confidence sets make use of cluster-robust standard errors for the Beta simulations and robust standard errors in the style of generalized estimating equations with an exchangeable correlation structure for the regression cases (Højsgaard, Halekoh, & Yan, 2006). They suppose asymptotic normality. To mimic the specification of a partially correct but invalid partition, the outcome variables are assigned to 10−1n clusters sequentially. The CIˆ U estimator corresponds to the confidence sets that are constructed in accordance with Theorem 2.3. Importantly, these sets treat M as known. This is not the case for the confidence sets targeted by CIˆ Rˆ . Here, M is treated as unknown. The sets are still constructed in accordance with Theorem 2.3. However, the sample range of the residuals are used in place of 2M. This method is explored to gauge the robustness of the plug-in strategy, which will often be required in practice. Finally, the average lower and upper endpoint of the U constructed confidence sets for the case that M is known are also provided for reference. The results of these simulations 55 are available in Table 2.1, Table 2.2, and Table 2.3. For each table, A5 is estimated to hold when Aˆ ≤ Av∗ and is estimated to be violated when ’>’ is shown. Table 2.1: Beta(10,10) Simulations n φn Mean Lower Endpoint Mean Upper Endpoint CIˆ Wald CIˆ U Aˆ Av∗ 100 0 0.42159 0.57841 0.92 1 10.52 < 10.60 0.06 † 0.42159 0.57841 0.51 0.99 10.60 = — 0.1 0.42158 0.57840 0.41 0.97 10.66 > — 0.2 0.42158 0.57840 0.29 0.88 10.81 > — 500 0 0.46486 0.53499 0.92 1 192.57 < 194.25 0.01 † 0.46476 0.53489 0.55 0.996 193.65 < — 0.05 0.46458 0.53471 0.29 0.84 198.45 > — 0.1 0.46444 0.53457 0.2 0.68 204.8 > — 1500 0 0.47974 0.52023 0.92 1 9059.27 < 9132.95 0.004 † 0.47972 0.52021 0.51 0.99 9128.91 < — 0.01 0.47971 0.52020 0.36 0.93 9236.57 > — 0.02 0.47969 0.52018 0.26 0.80 9420.2 > — The † symbol denotes the predicted threshold value s.t. coverage will begin to falter for fully connected linear dependency graphs. Table 2.2: Beta(α,α) Simulations: n = 500,φn = .1 α Threshold Mean Lower Endpoint Mean Upper Endpoint CIˆ Wald CIˆ U Aˆ Av∗ 10 0.012 0.46444 0.53457 0.203 0.68 204.79 > 194.25 25 0.032 0.46462 0.53475 0.203 0.88 8.22 > 8.21 50 0.065 0.46471 0.53484 0.204 0.97 2.86 = 2.86 100 0.132 0.46477 0.53491 0.204 0.997 1.69 = 1.69 The ’Threshold’ column represents the predicted breakdown point at α with n held constant. 56 Table 2.3: Regression Simulations β0 = 20 n φ∗ Mean Lower Endpoint Mean Upper Endpoint CIˆ Wald CIˆ U CIˆ Rˆ Aˆ Av∗ 100 0 15.458 24.556 0.884 1 1 1.047 < 1.262 0.05 15.459 24.557 0.823 1 0.999 1.066 < — 0.1 15.460 24.558 0.770 1 0.995 1.086 < — 0.15 15.461 24.559 0.725 1 0.987 1.106 < — 500 0 17.934 22.063 0.904 1 1 1.044 < 1.247 0.05 17.933 22.061 0.672 1 0.990 1.144 < — 0.1 17.932 22.061 0.549 0.993 0.956 1.254 > — 0.15 17.931 22.060 0.475 0.978 0.914 1.374 > — 1500 0 18.833 21.170 0.902 1 1 1.046 < 1.258 0.05 18.836 21.173 0.487 0.979 0.938 1.382 > — 0.1 18.838 21.175 0.366 0.908 0.835 1.840 > — 0.15 18.839 21.176 0.306 0.839 0.753 2.474 > — β1 = 10 n φ∗ Mean Lower Endpoint Mean Upper Endpoint CIˆ Wald CIˆ U CIˆ Rˆ Aˆ Av∗ 100 0 6.836 13.151 0.885 1 1 1.023 < 1.111 0.05 6.835 13.150 0.854 1 0.999 1.028 < — 0.1 6.835 13.150 0.823 1 0.997 1.033 < — 0.15 6.834 13.149 0.794 1 0.996 1.037 < — 500 0 8.546 11.451 0.903 1 1 1.022 < 1.114 0.05 8.546 11.452 0.758 1 0.998 1.047 < — 0.1 8.547 11.452 0.662 1 0.988 1.072 < — 0.15 8.547 11.453 0.594 0.997 0.971 1.098 < — 1500 0 9.189 10.807 0.903 1 1 1.024 < 1.114 0.05 9.188 10.805 0.601 0.998 0.985 1.097 < — 0.1 9.187 10.804 0.482 0.976 0.936 1.176 > — 0.15 9.186 10.804 0.409 0.942 0.882 1.260 > — Table 2.1 shows that the coverage of the confidence sets resulting from A5 and Theorem 2.3 starts to dissipate in quality around the predicted points, and also when A5 ceases to be true. Table 2.2 provides evidence that, even if A5 does not hold exactly, the robustness of the confidence sets that result from its employment is a function of the extrema of each Si . In general, random 57 variables with higher absolute extremes allow for more dense dependencies to exist without undermining inference: a fact that has been apparent since at least Hoeffding’s lemma. Notably, Table 2.3 substantiates the utility of the plug-in strategy. Although, as expected, it is not as robust as when the upper extreme of the support is known, the simulation evidence shows that the confidence sets that use the sample range maintain their semi-conservative nominal coverage value while A5 holds. Each table also shows that the standard methods employed absolutely fail to uphold nominal coverage. Overall, the key point supported by these experiments is that A5 is a feasible condition that allows for cogent finite sample inference in many important settings, and even when every outcome variable is statistically dependent. Additive statistics—from linear models and from estimating equations—are a critical tool for scientific discovery. Our job here was to demonstrate that they remain dependable enough for cogent finite sample inference in complicated modern settings. This has been accomplished, at least in some dimension. 2.6.1 Carbon Dioxide and Global Warming In this section, we estimate the association between global temperature change and carbon dioxide levels between the years of 1979 and 2022. Monthly averages for global carbon dioxide levels (CO2 ; ppm) and global temperature anomalies (Temp; °C) were acquired from the Global Monitoring Laboratory and the Goddard Institute for Space Studies respectively (Lan, Tans, & Thoning, Version 2023-08; Team, 2023). More information pertaining to the latter source and the methods utilized for measurement are obtainable elsewhere (Lenssen, Schmidt, Hansen, Menne, Persin, Ruedy, & Zyss, 2019). Although the estimation of causality is beyond the scope of this analysis, monthly data for an industrial production index for the G-20 countries (Index) was also accessed 58 for these years to act as a rudimentary adjusting variable (OECD, 2023). The index score for each country was summed for each month to construct a single index. Alternative metrics were also considered for this analysis. For instance, measurements on global population growth and the proportion of landmass covered by forests each year were also obtained. However, they were not utilized due to issues of multicollinearity. The approach of this paper is relevant for this question since, although useful and informative as conceptualizations, there is no reason for any complex ecological time series to strictly abide by the neat schematics of a typical moving average or autoregressive model. Unknown unknowns likely impact the process across time and location. Even if auto-correlations diminish, this does not imply that dependencies do. Methods. The cardinal association is investigated w.r.t. two units of time: monthly and yearly. Only one time lag is utilized in both cases. This provided n = 527 observations for the monthly analysis and n = 43 observations for the yearly. For the latter exploration, all monthly variables are averaged for each year. For the former, a categorical variable for the season (December-February: Winter; March-May: Spring; June-August: Summer; September-November: Fall) is constructed to adjust for additional time trends. CO2 and Index are also log-transformed for both analyses. Two baseline models are estimated: Tempt = β0 + β1Tempt−1 + β2log{CO2 t−1 } + β2log{Indext−1} + ∑ 3 i=1αiSeasoni+εt and Tempt = β0+β1log{CO2 }t−1+β2log{Index}t−1+εt . Non trend adjusting variables are dropped from the model if they do not induce at least a 10 percent change in the magnitude of the estimate. All models are fitted via ordinary least squares with a Type I error rate of α = .05. Confidence sets are constructed by way of Theorem 2.3, which also applies to stochastic regressions. For instance, when W = (X ⊤X) −1X ⊤, then Bs = βs +∑ n i=1Ws,iεi is additive in Zs,i = Ws,iεi . 59 Strict stationarity, regularity, and symmetry of the {Zs,i}i∈I is sufficient for the application of the theorem in this analysis. Pertinently, Bs is not unbiased for βs for the first model since Tempt−1 appears on the right-hand side of the equation. However, it can still be consistent. Consistency is obtained if A3 applies to the mean degrees of the graphs associated with {Zs,i}i∈I and {εi}i∈I . All conditions are feasibly checked with time series plots, histograms, and empirical CDF plots for {Ws,ieˆi}i∈I and {eˆi}i∈I . Here, we use the plug-in estimator Rˆ s = max i∈I (Ws,ieˆi) − min i∈I (Ws,ieˆi). Again, this will underestimate the true range for finite samples. However, it is a consistent estimator under the mild regularity conditions we suppose. Proof of this is offered in the supplementary material. Moreover, the magnitude of the underestimation is not likely to undercut the conservatism of the method. In accordance with Theorem 2.3, then, confidence sets have the form Bs ± √ nRˆ s · {6 −1 log(2/α)} 1/2 . We use the same rule of thumb to gauge the robustness of A5: φs,n ≤ (n−1) −1 · {12−1S −2 Wseˆ Rˆ2 s −1}, where S 2 Wseˆ is the sample variance of the Ws,ieˆi . This approximate bound is compared to auto-correlation estimates. Using the sample range for the calculation of the rule of thumb helps to counter its limitation as a plug-in. Results. Including log{Indext−1} resulted in an approximate .06 decrease in the effect estimate for log{CO2 t−1 } after also adjusting for the season. This did not meet the specified threshold. Hence, log{Indext−1} was dropped from the model. Season was retained to adjust for time trends. Per every unit increase in the logarithm of the previous month’s CO2 levels, there is an estimated mean increase of 1.53 °C (≥ 95% CI: .25, 2.8) in the next month’s global temperature after accounting for the season and the previous month’s global temperature. This effect was larger than the estimated .6 °C change in average global temperature (≥ 95% CI: .24, .95) per unit increase in the previous month’s temperature. For the annual model, log{Index}t−1 was found to decrease the 60 effect of interest by approximately .07 percent and was therefore removed. Per each unit change in the logarithm of the average yearly CO2 level of the previous year, the mean global temperature of the next year increased by an estimated 3.87 °C (≥ 95% CI: 3, 4.73). Model checking. Figure 2.1 demonstrates the time series, empirical CDF, histogram, and autocorrelation plots for the Ws,ieˆi related to CO2 for each model. Both histograms provide evidence of symmetric and roughly regular distributions since they are unimodal, bell-shaped, and (mostly) vary in a smooth fashion over an interval with endpoints that are roughly symmetric about zero. The empirical CDF plots also display roughly the same amount of area above and below the curve, providing evidence of U status. For our rule of thumb, we require φs,n to be bounded by 0.014 and 0.018 respectively for the monthly and annual models. It is reasonable to assert that this bound is fulfilled due to the positive-negative oscillating character of the auto-correlation plots. With the number of lags set to ⌊10log10(n)⌋ and ⌊2 −1 (n − 1)⌋, φˆ s,n = −.007 and −.002 respectively for the monthly data and −.03 and −.02 for the annual data, of course under the supposition of strict stationarity. The time series plot for the monthly data shows no remarkable departure from this latter supposition, although there does appear to be some departures for the annual data. Figure 2.2 shows the auto-correlation and time series plots for the residuals alone. These plots appear consistent with the assumption of stationarity and sub-linear mean dependencies. 61 Figure 2.1: Model Diagnostics for CO2 Weighted Residuals The auto-correlation plots make use of the acf function of the stats package in R under default settings (R Core Team, 2022). More lags are considered when estimating φs,n. Figure 2.2: Model Diagnostics for CO2 Residuals Altogether, there is sufficient evidence that the previous year’s monthly or yearly average log(CO2 ) levels predict a non-negligible increase in the average global temperature of the following month or year. Since this approach places no specific constraint on the underlying structure 62 of statistical dependencies, e.g., it does not adopt preconceptions about strong or weak mixing or some form of m-dependence, the results of this analysis are arguably stronger. They arrive with fewer theoretical caveats. Nonetheless, important limitations lurk. Model diagnostics still make use of the residuals, which are a biased and constrained representation of the true error. This situation, however, applies to all regression diagnostic procedures that make use of the residuals and does not rule out their use for the estimation of restricted portions of auto-correlation. Ultimately, these constraints complicate, but do not eliminate, the utility of these objects’ employment as a diagnostic mechanism. 2.7 Concluding remarks This manuscript accomplished three main objectives. First, it established a small set of related identities for the variance of a vector of random sums. These identities require reasoning about the mean number of outcome variables correlated within a sample and their average correlation only. Since the true dependency structure of any collection of random variables is safely posited to be unknown and empirically unidentifiable in whole, removing the strict need to specify n 2 parameters is useful. It was shown that a researcher can elect to reason about these two intuitive summary constants instead, or that she can employ them in conjunction with popular covariance modeling methods to capture at least some of the variability that is missed by an invalid specification. Although these constants are unknown, so are the n 2 covariance parameters that statisticians specify on a daily basis. Furthermore, the cogent defense of a conservative choice for these values is a much less demanding task than the alternative in a majority of circumstances. Pertinently, these identities were also used to affirm the consistency of additive estimators—including cluster-robust variance estimators w.r.t. their identified portion of the overall variance—under the very general condition that the average number of correlated variables in a sample is asymptotically sub-linear as a function of n. For cluster variance estimators, this was shown to be the case even when no valid partition of the sample exists. A second accomplished objective was to extend these results to estimating equations and hence to the estimators of statistical approaches such as the generalized linear model. The third and most important contribution of this paper was to prove a sharpened version of Hoeffding’s inequality for a class of commonly encountered random variables. Notably, it was proven that this inequality can apply even when every single outcome variable in a sample is statistically dependent, insofar as the magnitude of their average correlation is at least moderately controlled. This result is certainly valuable for many fields where the assumption of weak or local dependence is especially untenable, such as in climate science, social network analysis, finance, and really any ecological or sociological domain. That said, more work is due. Like all statistical models, the valid application of this inequality relies on a set of assumptions that can only be feasibly verified in practice. Although imperfect, however, the diagnostic processes available are equivalent to those used to check the assumptions of common regression models. In this sense, the approach established here at least possesses nomological validity. 64 References Box, G. E., & Draper, N. R. (1987). Empirical model-building and response surfaces. John Wiley & Sons. Aitken, A. C. (1936). Iv.—on least squares and linear combination of observations. Proceedings of the Royal Society of Edinburgh, 55, 42–48. Amemiya, T. (1985). Generalized least squares theory. Advanced Econometrics. Liang, K.-Y., & Zeger, S. L. (1986). Longitudinal data analysis using generalized linear models. Biometrika, 73(1), 13–22. Nelder, J. A., & Wedderburn, R. W. (1972). Generalized linear models. Journal of the Royal Statistical Society: Series A (General), 135(3), 370–384. Gardiner, J. C., Luo, Z., & Roman, L. A. (2009). Fixed effects, random effects and gee: What are the differences? Statistics in medicine, 28(2), 221–239. Ziegler, A., Kastner, C., & Blettner, M. (1998). The generalised estimating equations: An annotated bibliography. Biometrical Journal: Journal of Mathematical Methods in Biosciences, 40(2), 115–139. Zorn, C. J. (2001). Generalized estimating equation models for correlated data: A review with applications. American Journal of Political Science, 470–490. MacKinnon, J. G., Nielsen, M. Ø., & Webb, M. D. (2023). Cluster-robust inference: A guide to empirical practice. Journal of Econometrics, 232(2), 272–299. Cressie, N. (2015). Statistics for spatial data. John Wiley & Sons. Kedem, B., & Fokianos, K. (2005). Regression models for time series analysis. John Wiley & Sons. Anselin, L. (2009). Spatial regression. The SAGE handbook of spatial analysis, 1, 255–276. Withers, C. S. (1981). Central limit theorems for dependent variables. i. Zeitschrift fur Wahrschein- ¨ lichkeitstheorie und verwandte Gebiete, 57(4), 509–534. Berk, K. N. (1973). A central limit theorem for m-dependent random variables with unbounded m. The Annals of Probability, 352–354. 65 Ledoux, M. (2001). The concentration of measure phenomenon. American Mathematical Soc. Hoeffding, W. (1994). Probability inequalities for sums of bounded random variables. The collected works of Wassily Hoeffding, 409–426. Bennett, G. (1962). Probability inequalities for the sum of independent random variables. Journal of the American Statistical Association, 57(297), 33–45. Janson, S., Luczak, T., & Rucinski, A. (2011). Random graphs. John Wiley & Sons. Boucheron, S., Lugosi, G., & Bousquet, O. (2003). Concentration inequalities. In Summer school on machine learning (pp. 208–240). Springer. Talagrand, M. (1996). A new look at independence. The Annals of probability, 1–34. Daniel, P. (2014). Concentration inequalities for dependent random variables [Doctoral dissertation, National University of Singapore]. Kontorovich, L., & Ramanan, K. (2008). Concentration inequalities for dependent random variables via the martingale method. The Annals of Probability, 36(6), 2126–2158. Gotze, F., Sambale, H., & Sinulis, A. (2019). Higher order concentration for functions of weakly ¨ dependent random variables. Electronic Journal of Probability, 24(85), 1–19. Wajc, D. (2017). Negative association: Definition, properties, and applications. Manuscript, available from https://goo. gl/j2ekqM. Janson, S. (2004). Large deviations for sums of partly dependent random variables. Random Structures & Algorithms, 24(3), 234–248. Jennrich, R. I. (1969). Asymptotic properties of non-linear least squares estimators. The Annals of Mathematical Statistics, 40(2), 633–643. Yuan, K.-H., & Jennrich, R. I. (1998). Asymptotics of estimating equations under natural conditions. Journal of Multivariate Analysis, 65(2), 245–260. Hall, A. R. (2005). Generalized method of moments. Oxford university press. Moulton, B. R. (1986). Random group effects and the precision of regression estimates. Journal of econometrics, 32(3), 385–397. 66 Moulton, B. R. (1990). An illustration of a pitfall in estimating the effects of aggregate variables on micro units. The review of Economics and Statistics, 334–338. Xie, M., & Yang, Y. (2003). Asymptotics for generalized estimating equations with large cluster sizes. The Annals of Statistics, 31(1), 310–347. Feng, C., Wang, H., Han, Y., Xia, Y., & Tu, X. M. (2013). The mean value theorem and taylor’s expansion in statistics. The American Statistician, 67(4), 245–248. Embrechts, P., Lindskog, F., & McNeil, A. (2001). Modelling dependence with copulas. Rapport technique, Departement de math ´ ematiques, Institut F ´ ed´ eral de Technologie de Zurich, ´ Zurich, 14, 1–50. Demarta, S., & McNeil, A. J. (2005). The t copula and related copulas. International statistical review, 73(1), 111–129. R Core Team. (2021). R: A language and environment for statistical computing. R Foundation for Statistical Computing. Vienna, Austria. Yan, J. (2007). Enjoy the joy of copulas: With a package copula. Journal of Statistical Software, 21, 1–21. Højsgaard, S., Halekoh, U., & Yan, J. (2006). The r package geepack for generalized estimating equations. Journal of statistical software, 15, 1–11. Lan, X., Tans, P., & Thoning, K. (Version 2023-08). Trends in globally-averaged co2 determined from noaa global monitoring laboratory measurements. Team, G. (2023). Giss surface temperature analysis (gistemp), version 4. Lenssen, N. J., Schmidt, G. A., Hansen, J. E., Menne, M. J., Persin, A., Ruedy, R., & Zyss, D. (2019). Improvements in the gistemp uncertainty model. Journal of Geophysical Research: Atmospheres, 124(12), 6307–6326. OECD. (2023). Industrial production (indicator). https : / / doi . org / doi : 10 . 1787 / 39121c55 - en(Accessedon18August2023) R Core Team. (2022). R: A language and environment for statistical computing. R Foundation for Statistical Computing. Vienna, Austria. 67 Supplementary material 2.A Introduction The manuscript introduces the U class of random variables and utilizes a handful of related identities to offer intuition about A5. The purpose of this additional document is to prove some of the claims made in the paper about these variables. They are mostly foundational and elementary, although this does not mean that they are not informative. Moreover, this document seeks to further familiarize the reader with this type of variable and to justify model diagnostic methods and plug-in methods under the assumption of copious dependencies. Stated informally, the following ideas will be explored and proven here: 1. U variables are characterized by the absence of a type of intrinsic linear distributional bias 2. Equivalent definitions of U status exist that connect these variables to a generalized notion of sum-symmetry 3. Every continuous uniform distribution can be decomposed into the sum of an arbitrary unimodal U variable with sub-uniform left tails and an additional error variable s.t. the latter is uncorrelated with all functions of the former 4. U variables (in a sense) generalize the classic regression condition of exogeneity 68 5. The preservation of the intrinsic linear bias of a distribution during sampling is sufficient for unbiased moment estimation, and even with informative sampling 6. When max(Si) = M for ∀i, where once again Si is the support of Yi , then Y(n) = max(Y1,Y2,...,Yn) is a consistent estimator of M under mild regularity conditions, even when dependencies are plethora For simplicity, we will be avoiding any statements that are required for the existence of referenced objects. We require some additional setup for the fourth and fifth points. Consider a population of random variables P = {Yk}k∈K for K = {1,...,N} and a sample ζ = {Yi}i∈I for I ⊂ K and hence ζ ⊂ P. We will say δ = (δ1,...,δN) s.t. δk = 1 if and only if a Yk ∈ P is an element in ζ and it is also uncensored. We will also assume that Eδk > 0 for all k. It is well known, then, that an arbitrary Yi ∈ ζ , say Yδi , actually has a sampling distribution that is not in general equal to its population counterpart, i.e., Yδi ∼ fδ (yi) = {Eδi} −1E(δi |yi)f(yi), where f(yi) is the density or mass function of Yi . Provided this setup, we only need one new main assumption for this exploration: C1. Let S be the support of the distribution of Y and Sδ be the support of the distribution of Yδ . Then S = Sδ . Essentially, C1 stipulates that sampling preserves the support. This is ultimately a very mild condition since it places no additional constraint on the distortion caused by informative (or biased) sampling mechanisms. All propositions are numbered separately from the main document, i.e., numbering begins anew here. 69 2.B U Random Variables The first proposition characterizes a precursor to the distance between EY and Av(Y) by relating it to the aforementioned type of intrinsic distributional bias. It shows that the difference between EY and Av(Y) is proportional to the covariance between a random variable and its inverse density or mass function. It is this covariance that we can conceptualize as an inherent bias within a probability distribution. Roughly put, the more a random variable is correlated with its own (inverse) density or mass function, the farther its expected value will be from its average value. This correlation also imparts deviations from uniformity and skews the shape of the distribution. Proposition 2.B.1. Let Y = g(X1,...,Xk) be a bounded random variable and construct an indicator variable 1R to designate when (x1,..., xk) ∈ S . Then Avx{g(X1,...,Xk)} = E{g(X1,...,Xk)}+ R −1 x σg(X1,...,Xk),1R f(X1,...,Xk)−1 . Similarly, Av(Y) = EY +R −1σY,1RY f −1(Y) . Proof. Suppose the premises. E{g(X1,...,Xk)f(X1,...,Xk) −1 1R} −E{g(X1,...,Xk)}E{1R f −1 (X1,...,Xk)} = Z S g(x1,..., xk)d(x1,..., xk)−E{g(X1,...,Xk)}Rx =⇒ R −1 x σg(X1,...,Xk),1R f(X1,...,Xk)−1 = Avx{g(X1,...,Xk)} −E{g(X1,...,Xk)} Rearrangement supplies the result. The proof is exactly analogous for the second identity. ■ As an important side note, functional averages with respect to (w.r.t.) different spaces are not equal in general. The following proposition states when they are. 70 Corollary 2.B.1. Avx{g(X1,...,Xk)} = Av(Y) if and only if R−1σY,1RY f −1(Y) = R −1 x σg(X1,...,Xk),1R f(X1,...,Xk)−1 . Proof. This proof follows easily from Proposition 2.B.1. Only one direction will be supplied. Avx{g(X1,...,Xk)} = Av(Y) =⇒ E{g(X1,...,Xk)}+R −1 x σg(X1,...,Xk),1R f(X1,...,Xk)−1 = EY +R −1σY,1RY f −1(Y) =⇒ R −1 x σg(X1,...,Xk),1R f(X1,...,Xk)−1 = R −1σY,1RY f −1(Y) ■ A special case of when Corollary 2.B.1 holds is when Y = ∑ k i Xi and it is also the case that Y and each Xi are U random variables, and the joint support of ∑ k i Xi is equal to the Cartesian product of the marginal supports. We prove the most important part of this special case. This will then be used to show that if Y ∈ U , cY ∈ U and Y +c ∈ U for all c ∈ R. Proposition 2.B.2. Let Y = ∑ k i Xi . Say S is the support of ∑ k i Xi and Si is the support of Xi for ∀i. If S = S1 ×S2 ×··· ×Sk and an arbitrary Xi ∈ U , then Avx(Y) = EY . Proof. We prove the continuous case without loss of generality (WLOG). Avx(Y) = { k ∏ i Ri} −1 Z S { k ∑ i Xi}dx1 ···dxk = k ∑ i Av(Xi) = k ∑ i EXi = EY ■ Proposition 2.B.3. Suppose Y ∈ U and let c ∈ R. Then cY ∈ U and Y +c ∈ U . 71 Proof. Let c ∈ R be arbitrary. Since expectations are linear, E(cY) = cEY = cAv(Y). Since integrals and sums are linear operators, Av(cY) = R −1 R S cY dy = c · R −1 R S Y dy = cAv(Y), which completes this part of the proof WLOG. Now, say Z = Y + c. Since c is a constant, Avx(Z) = Av(Y) + c = EZ by Proposition 2.B.2. Note that SZ, the support of Z, is just a shift of S , the support of Y. Hence, for the discrete case, |SZ| = |S | = R. Then, for ci ∈ S , Av(Z) = R −1 ∑ R i {ci +c} = Av(Y) +c = Avx(Z) = EZ. The continuous case is analogous. We can let S be a union of p intervals of real numbers and say Ri = Mi −mi is the length of the ith interval. Then Av(Y) = {∑ p i Ri} −1{∑ p i 2 −1Ri(Mi +mi)}. Since Z is simply a location shift: Av(Z) = { p ∑ i Ri} −1 Z SZ zdz = { p ∑ i Ri} −1 { p ∑ i 2 −1Ri(Mi +mi +2c)} = { p ∑ i Ri} −1 { p ∑ i 2 −1Ri(Mi +mi)}+c = Av(Y) +c ■ The next sequence of results establish equivalent definitions of U status under the additional assumption of regularity, i.e., that the support of the random variable is a single interval of real numbers or a set of integers {m,m + 1,m + 2,...,M − 1,M}. We will condition on 1R to relieve some notation burden. Lemma 2.B.1. Let Y be a regular random variable defined on support S = [m,M] with cumulative distribution function F(y) and survival function S(y). Say k ̸= 0 is an arbitrary real number. Then: 72 1. R M m F(y) kdy = σF(Y) k , f(Y)−1 +R·E F(Y) k 2. R M m S(y) kdy = σS(Y) k , f(Y)−1 +R·E S(Y) k 3. σF(Y), f(Y)−1 = R −1σY, f(Y)−1 Proof. Statement 1 and 2 are a direct application of Proposition 2.B.1. For statement 3, apply the special case s.t. k = 1 to statement 1: Z M m F(y)dy = σF(Y), f(Y)−1 +2 −1R =⇒ M −EY = σF(Y), f(Y)−1 +2 −1R =⇒ M −2 −1M +2 −1m−EY = σF(Y), f(Y)−1 =⇒ Av(Y)−EY = σF(Y), f(Y)−1 The last statement immediately follows by again applying Proposition 2.B.1. ■ Proposition 2.B.4. Suppose Y is a regular, continuous random variable with support S = [m,M], CDF F(y), survival function S(y) = 1−F(y), and define ε =Y −EY . Then the following statements are equivalent: 1. Y ∈ U 2. σY, f(Y)−1 = 0 3. M −EY = EY −m 4. R M m {F(y)−S(y)}dy = 0 5. σF(Y), f(Y)−1 = 0 6. F(Y) ∈ U 7. R M−EY m−EY ε kdε = 0 for all odd integers k Proof. Equivalences between statements 1, 2, 3, 5, and 6 follow easily from Proposition 2.B.1, the definition of being U class, and regularity since Av(Y) = 2 −1 (M +m), and Lemma 2.B.1. To see how equivalences with statements 3 and 4 follow: EY = 2 −1 (M +m) =⇒ 2EY = M +m =⇒ EY −m = M −EY However, the last line is true if and only if R M m F(y)dy = R M m S(y)dy. This can easily be shown by applying integration by parts to each side of the equality. For the last statement, note that statements 1 through 6 imply that ε k is an odd function integrated over an interval that is symmetric about zero. For the opposite implication, assuming statement 7 implies that (M −EY) s = (EY −m) s for some even integer s. Since M ≥ EY and EY ≥ m, it is then true that M −EY = EY −m. This completes the proof. ■ The next two propositions are for integer values random variables. They can be generalized, but this is not accomplished here. Proposition 2.B.5. Suppose Y is an integer valued random variable with S = {1,2,3,...,M}, CDF F(y), and survival function S(y) = 1−F(y). Then the following statements are equivalent: 74 1. Y ∈ U 2. σY, f(Y)−1 = 0 3. ∑ M i F(i) = EY 4. ∑ M i {F(i)−S(i)} = 1 5. Av{F(Y)} = 2 −1 (M−1 +1) Proof. The first two equivalences follow from Proposition 2.B.1 and definition. By Gauss’ summation identity, ∑ M i i = 2 −1M(M +1). Hence, Av(Y) = 2 −1 (M +1). This implies that M +1 = 2EY, which implies statement 3 via the fact that ∑ M i F(i) = M + 1 − ∑ M i i f(i). Assuming statement 3 then implies statements one and two, also by this last identity. For statement 4, suppose Y ∈ U . Then: M ∑ i {F(i)−S(i)} = 2 M ∑ i F(i)− M = 2EY − M = 1 Now, suppose ∑ M i {F(i)−S(i)} = 1. Then: 1 = M +2−2EY =⇒ EY = 2 −1 (M +1) 75 For the last statement: M−1 M ∑ j=1 F(j) = 2 −1 (M−1 +1) =⇒ M−1 M ∑ j=1 2F(j)−1 = M−1 =⇒ M ∑ j=1 2F(j)− M = 1 =⇒ M ∑ j=1 {2F(j)−1} = 1 =⇒ M ∑ j=1 {F(j)−S(j)} = 1 =⇒ Y ∈ U One reverses the direction for the other equivalences. ■ The next three propositions bound the distance between the expected and average values. Proposition 2.B.6. Let Y = g(X1,X2,...,Xk) be an measurable function of bounded random variables. Then: |E(Y)−Avx(Y)| ≤ q Avx(Y 2) p Rx ·E{ f(X1,X2,...,Xk)} −1 |E(Y)−Av(Y)| ≤ q Av(Y 2) p R·E{ f(Y)} −1 76 Proof. We prove the case for continuous random variables w.r.t. x WLOG. Z S g(x1, x2,..., xk)f(x1, x2,..., xk)dx−R −1 x Z S g(x1, x2,..., xk)dx = Z S g(x1, x2,..., xk)(f(x1, x2,..., xk)−R −1 x )dx =⇒ |E(Y)−Avx(Y)| ≤ rZ S g(x1, x2,..., xk) 2dx Z S { f(x1, x2,..., xk)−R −1 x } 2dx Now, for the right-hand side of the inequality: rZ S g(x1, x2,..., xk) 2dx Z S { f(x1, x2,..., xk)−R −1 x } 2dx = q RxAvx(Y 2) rZ S f(x1, x2,..., xk) 2dx−2R −1 x +R −1 x = q Avx(Y 2) p Rx ·E{ f(X1,X2,...,Xk)} −1 ■ Proposition 2.B.7. Let Y ∼ f(y) be continuous and regular with density f(y). Suppose min y∈S { f(y)} = mf > 0 and say max y∈S { f(y)} = Mf . Then: 1. |EY −Av(Y)| ≤ (2 √ 3) −1 √ R p Av{ f −1(Y)} −R 2. |EY −Av(Y)| ≤ (4 √ 3) −1{m −1 f − M−1 f } 77 Proof. By Lemma 2.B.1: |EY −Av(Y)| = |σF(Y), f −1(Y) | ≤ σF(Y)σf −1(Y) = (2 √ 3) −1 σf −1(Y) Statement 1 follows from the fact that Var{ f −1 (Y)} = R M m f −1 (y)dy−R 2 . Statement 2 follows from the fact that Var{ f −1 (Y)} ≤ 4 −1 (m −1 f − M−1 f ) 2 since it is a bounded random variable. ■ Proposition 2.B.8. Let Y ∼ f(y) be a regular random variable with CDF F(y). Then f(Y) ∈ U implies that Y ∈ U if and only if F(Y) ∈ U . If Y is discrete, this is also provided that S = {1,2,3,...,M}. Proof. First we prove the discrete case. Note that, in general, E{F(Y)} = 2 −1 [1+E{ f(Y)}]. This can be shown by expanding the sum ∑i∈S F(y)f(y), which is equal to ∑i∈S f(yi) 2 +∑i<j:i, j∈S f(yi)f(y j). Since it is then the case that {∑i∈S f(yi)} 2 = ∑i∈S f(yi) 2 + 2 · ∑i<j:i, j∈S f(yi)f(y j) = 1, E{F(Y)} = 1−∑i<j:i, j∈S f(yi)f(y j), and E{ f(Y)} = ∑i∈S f(yi) 2 , the aforementioned identity follows. Now, suppose f(Y) ∈ U . Then E{ f(Y)} = M−1 ∑ M i=1 f(i) = M−1 . By Proposition 2.B.6, then, |Av(Y)−EY| ≤ 0, which implies that Y ∈ U . Since E{ f(Y)} = M−1 , F(Y) ∈ U by Proposition 2.B.5. The continuous case follows from Proposition 2.B.6 and Proposition 2.B.4 since E{ f(Y)} = R −1 . ■ 78 Proposition 2.B.9. Let Y ∼ f(y) be a regular and continuous random variable. Then f −1 (Y) ∈ U implies that Y ∈ U . Furthermore, if Yk = g(X1,...,Xk) for some function g s.t. min y∈S { f(y)} = mf → ∞ as k → ∞ and limk→∞Yk = Z is bounded, then Yk converges in distribution to a U random variable. Proof. Both statements are consequences of Proposition 2.B.7. Since E{ f −1 (Y)} = R, the first statement follows from Proposition 2.B.7 since |Av(Y)−EY| ≤ 0. For the second statement, note that |Av(Yk)−EYk | ≤ (4 √ 3) −1m −1 f , again by Proposition 2.B.7 and also the fact that Mf > 0. Say Z = limk→∞Yk . Then: lim k→∞ |Av(Yk)−EYk | = |Av(lim k→∞ Yk)−E(lim k→∞ Yk)| ≤ 0 Hence, Yk converges in distribution to Z ∈ U . ■ An informal corollary is interesting. Say Yk has a density in a scale family distributions, i.e., that Yk ∼ σ −1 f(σ −1 yk), where σ = p Var(Yk) and f(·) ≤ Q ∈ R +. Then, if Var(Yk) → 0 as k → ∞, min yk∈S {σ −1 f(σ −1 yk)} → ∞ and Proposition 2.B.9 applies. This is useful because—insofar as it is reasonable to assert that, say, Y¯ is in the scale family of distributions as stated—convergence in mean square is sufficient for convergence to U status. The next proposition provides insight into how regular U variables can be constructed. To this end, we use the fact that an arbitrary density can be expressed as f(y) = { R S g(y)dy} −1g(y) w.r.t. an appropriate function g and, just the same, an arbitrary mass function can be expressed as f(y) = {∑i∈S g(yi)} −1g(y) for an appropriate function. 79 Corollary 2.B.2. Suppose Y ∼ f(y) is regular with CDF F(y) and survival function S(y). Suppose f(y) = { R S g(y)dy} −1g(y) and G(y) = R y m g(t)dt for some function g. Then Av{G(y)} = 2 −1 R M m g(y)dy if and only if Y ∈ U . For the discrete case on S = {1,...,M}, Av{G(y)} = 2 −1 (R −1+ 1)∑ R i=1 g(i) if and only if Y ∈ U , where f(y) = {∑i∈S g(yi)} −1g(y) and G(y) = ∑ y i=1 g(i). Proof. The proof is largely omitted. It follows from the re-expression of the mass function or density and Propositions 2.B.5 and 2.B.4 respectively. ■ We provide two simple examples of functions that generate continuous U distributions before moving forward. Example 2.1. Set g(y) = 1. Then G(y) = (y−m) and Av{G(y)} = R −1{R2 −1 (M +m)−Rm} = 2 −1R = 2 −1 R M m 1dy. This generates a Uni f(m,M) random variable. Example 2.2. Set g(y) = sin(y) on [0,π]. Then G(y) = 1−cos(y) and Av{G(y)} = π −1π = 1 = 2 −12 = 2 −1 R π 0 sin(y)dy. Then f(y) = 2 −1 sin(y) generates a U distribution. 2.B.1 Decomposition of Continuous Uniform Random Variables This next proposition is theoretically interesting and useful. It essentially states that all continuous uniform distributions can be decomposed into a type of U variable and error such that the random error is uncorrelated with all functions of that U variable. Theorem 2.B.1 (Uniform Decomposition Theorem). Consider U ∼ Uni f(m,M) on S = [m,M]. Suppose Y ∈ U is an arbitrary regular random variable that has a unimodal distribution on S . Furthermore, suppose inf y∈S { f(y)} ≤ Av{ f(Y)} s.t. inf y∈S { f(y)} occurs in the left tail. Then U = Y +ε for some random variable ε s.t. E(ε|Y) = 0. 80 Proof. Observe that Av{ f(Y)} = R −1 , which is the density of U. Since Y has a unimodal distribution on the same support as U, their cumulative distribution functions possess the single-crossing property, i.e., that there is some point c ∈ S s.t. F(y) ≤ G(y) for y ∈ [m, c] and G(y) ≤ F(y) for y ∈ [c,M], where G(y) is the uniform CDF. Since inf y∈S { f(y)} ≤ R −1 in the left tail, this implies that R c m F(y)dy ≤ R c m G(y)dy for this c in the support. Now, since Y ∈ U , EY = EUY. Now, for contradiction, suppose there exists a t ∈ S s.t. R t m F(y)dy > R t m G(y)dy. We know from the above that t > c. We now follow the implied logic: Z t m F(y)dy > Z t m G(y)dy =⇒ M −EY − Z M t F(y)dy > M −EU − Z M t G(y)dy =⇒ Z M t F(y)dy < Z M t G(y)dy Recall, since F(y) and G(y) have the single crossing property, F(y) ≥ G(y) for y ∈ [c,M]. Hence, since t > c, R M t F(y)dy ≥ R M t G(y)dy. Therefore, a contradiction is reached and R t m F(y)dy ≤ R t m G(y)dy for ∀t ∈ S . By the Rothschild-Stiglitz theorem, then, this implies the existence of a random variable ε s.t. U = Y +ε and E(ε|Y) = 0. ■ Corollary 2.B.3. Let Y be a unimodal U random variable defined on S = [m,M] and suppose inf y∈S { f(y)} ≤ Av{ f(Y)} s.t. inf y∈S { f(y)} occurs in the left tail. Then Var(Y) ≤ 12−1R 2 . Proof. Suppose the premises. Then by Theorem 2.B.1, U = Y +ε, where U ∼ Uni f(m,M) and ε is some random variable s.t. E(ε|Y) = 0. Therefore, Var(U) = Var(Y) +Var(ε). This means that 12−1R 2 ≥ Var(Y). ■ 81 2.B.2 U Variables and Regression Up until this point, we have omitted the notation that is related to sampling. We did this to make things more focused and readable. Now, there is an explicit need for it. When we use Eδ (·), this signifies that the expectation is taken w.r.t. the sampling distribution. A short word will be provided on the applicability of U concepts to stochastic linear regressions, mostly for inference about associations. Since observational studies do not truly fix the analysis on a design matrix very often, stochastic regression concepts are sometimes more appropriate. This implies the following data generating mechanism, which is stated more generally w.r.t. Xδ ∈ R n×p : Yδ = Xδβ +εδ . Again, strict exogeneity, i.e., that Eδ (ε|X) = 0, is a sufficient condition for asymptotically unbiased estimation of β provided other weak dependency and regularity conditions. U concepts can offer an alternative condition for consistent estimation that is milder in some respects. To start, we will say two functions h(x) and g(x) are orthogonal if R Sδ h(x)g(x)dx = 0 when Sδ is a union of intervals of real numbers or ∑Sδ g(xi)h(xi) = 0 when Sδ is discrete. The following lemma provides a general basis for constructing instrumental variables. It states that if a variable g(Z) is orthogonal to the expected error function, conditional on Z, then g(Z) weighted by its inverse (mass) density is a valid instrumental variable. Lemma 2.B.2. Let ε be an arbitrary random variable s.t. Eδ ε = 0 and say Z ∼ fδ (z) is an arbitrary random variable. Then g(z) and Eδ (ε|z) are orthogonal if and only if Eδ { fδ (Z) −1g(Z)ε} = 0. Proof. Suppose the premises. We will prove the statement for continuous Z WLOG. 82 0 = Eδ { fδ (Z) −1 g(Z)ε} = Eδ [Eδ { fδ (Z) −1 g(Z)ε}|Z] = Eδ { fδ (Z) −1 g(Z)Eδ (ε|Z)} = Z SZ fδ (z) −1 g(z)Eδ (ε|z)fδ (z)dz = Z SZ g(z)Eδ (ε|z)dz ■ Proposition 2.B.10 generalizes the notion of exogeneity. Further word is provided following the proof. Proposition 2.B.10. Let ε be an arbitrary random variable s.t. Eδ ε = 0 and say Z ∼ fδ (z) is an arbitrary bounded random variable. Then Eδ (ε|Z) ∈ U if and only if Eδ { fδ (Z) −1 ε} = 0. Proof. Suppose the premises. We will prove the statement for continuous Z WLOG. Set g(Z) = 1. (←) In Lemma 2.B.2, we saw that 0 = Eδ { fδ (Z) −1 ε} = R SZ Eδ (ε|z)dz. This, however, implies that Av{Eδ (ε|Z)} = 0. Thus, since 0 = Eδ {Eδ (ε|Z)} = Av{Eδ (ε|Z)}, it is implied that Eδ (ε|Z) ∈ U . (→) Suppose Eδ (ε|Z) ∈ U . Then it is true that 0 = R −1 Z R SZ Eδ (ε|z)dz implies R SZ Eδ (ε|z)dz = Eδ { fδ (Z) −1 ε} = 0. ■ The typical exogenous condition states that Eδ (ε|Z) = 0. This stipulation is recognizable as a special case of Proposition 2.B.10 since if Eδ (ε|Z) = 0, it is trivially true that Eδ (ε|Z) ∈ U . 83 Hence, Lemma 2.B.2 and Proposition 2.B.10 offer another route to identifying and consistently estimating regression parameters, although this is not explored further here. Next, recall that C1 stipulates that a sampling process has preserved the support of a random variable and Yδ is a sampled random variable, i.e., its probability distribution is a sample distribution. The next proposition is useful to know since it establishes that preserving the intrinsic linear bias of a population distribution during sampling is sufficient and necessary for population moment identification, provided the support has been preserved. Proposition 2.B.11. Denote a random variable Yδ and let g be some measurable function such that C1 holds. Then Eδ {g(Y)} = E{g(Y)} if and only if σ g(Yδ ),1R f −1 δ (Y) = σg(Y),1R f −1(Y) . Proof. Suppose C1. Then Av{g(Yδ )} = Av{g(Y)}. (←) Suppose σ g(Yδ ),1R f −1 δ (Y) = σg(Y),1R f −1(Y) . Then: Av{g(Yδ )} = Av{g(Y)} =⇒ Eδ {g(Y)}+R −1σ g(Yδ ),1R f −1 δ (Y) = E{g(Y)}+R −1σg(Y),1R f −1(Y) =⇒ Eδ {g(Y)} = E{g(Y)}+R −1σg(Y),1R f −1(Y) −R −1σ g(Yδ ),1R f −1 δ (Y) =⇒ Eδ {g(Y)} = E{g(Y)}+0 (→) This case is analogous in logic and is thus omitted. ■ An easy corollary is that σg(Y),E{δi |g(Y)} = 0 if and only if the intrinsic linear bias of a distribution is preserved, conditional on the support of the distribution remaining intact under sampling. 84 Since many different sample distributions of Y exist that meet these conditions, it substantiates a much milder condition for unbiased moment estimation than non-informative sampling. 2.B.3 Plug-ins and Model Diagnostics Concentration inequalities employed for inference in the manuscript ultimately require Ri = Mi − mi to be known or estimable. It is known that Y(n) −Y(1) (the maximum minus the minimum order statistic) is a consistent estimator when statistical dependencies are local or restricted, i.e., when some version of weak mixing applies, or when outcome variables become approximately independent w.r.t. some conception of distance (Leadbetter & Rootzen, 1982). We do not suppose this here. Hence, we require a different justification for using extreme order statistics. The example analysis of the paper made use of stochastic linear regression w.r.t. the model Y = Xβ + ε. Like in the paper, say (X ⊤X) −1X ⊤ = W and B = WY is a an estimator of β s.t. E(Wε) → 0 as n grows arbitrarily large. Also consider {Ws,ieˆi}i∈I , where I = {1,...,n}, ˆei = Yi −XiB, and Ws,i is cell (s,i) of W. Ultimately, we are interested in the properties of {Ws,iεi}i∈I . When an arbitrary µs,n = o(n), B p→ β and hence ˆei p→ εi as n → ∞. When this is the case, Ws,ieˆi = Ws,iεi +Ws,iop(1) s.t. nWs,i · op(1) → 0 as n → ∞. Since we are assuming strict stationarity, we can let s be arbitrary and say Zi = Ws,iεi to reason more generally. The next proposition offers an asymptotic justification of using Z(n) WLOG. Proposition 2.B.12. Consider {Zi}i∈I s.t. each Zi ∼ Fi(z) is a continuous random variable on support Si and max i∈I (Si) = M. Moreover, say Z(n) = max(Z1,...,Zn) has a density supported on Sn ⊆ S = [−M,M] s.t. max(Sn) = M for ∀n ∈ N. Now, let ZAi ∼ Fi(z|Zi−1 ≤ z,Zi−2 ≤ z,...,Z1 ≤ z) be well-defined under the convention that ZA1 ∼ F1(z). Denote the sequence of conditional 85 CDFs (Fn{z|Zn−1 ≤ z,...,Z1 ≤ z},Fn−1{z|Zn−2 ≤ z,...,Z1 ≤ z},...,F2{z|Z1 ≤ z},F1{z}) as F = (Fi{z|Ai})i∈I . Next, note that it is true that 1z<MFi(z|Ai) < 1 for all z ∈ S for at least k(n) ∈ N of the distribution functions of F. If k(n) → ∞ as n → ∞, then Z(n) p→ M as n → ∞. Proof. First, note for any random variable X on SX = [m,M]s.t. −M ≤ m, it is true that R M −M{2 −1− F(x)}dx = EX. Now, note that F(Z(n) ≤ z) = Pr(Z1 ≤ z,Z2 ≤ z,...,Zn ≤ z) = Fn(z|Zn−1 ≤ z,...,Z1 ≤ z)·Fn−1(z|Zn−2 ≤ z,...,Z1 ≤ z)···F2(z|Z1 ≤ z)·F1(z) for an arbitrary z ∈ S . Construct a set W ⊆ I w.r.t. F s.t. for an arbitrary w ∈ W, 1z<MFw(z|Aw) < 1 for all z ∈ S . Furthermore, say F∗(z∗|A∗) = max w∈W,z∈S {1z<MFw(z|Aw)} < 1. Then for z < M: F(Z(n) ≤ z) = Fn(z|Zn−1 ≤ z,...,Z1 ≤ z)·Fn−1(z|Zn−2 ≤ z,...,Z1 ≤ z)···F2(z|Z1 ≤ z)·F1(z) ≤ 1 n−k(n) · {F∗(z∗|A∗)} k(n) ≤ {F∗(z∗|A∗)} k(n) From the above, we know that EZ(n) = R M −M{2 −1−F(Z(n) ≤ z)}dz. Since F(Z(n) ≤ z) ≤ {F∗(z∗|A∗)} k(n) for all z < M, it follows that 2−1 − {F∗(z∗|A∗)} k(n) ≤ 2 −1 −F(Z(n) ≤ z) for all z < M and: Z M −M {2 −1 − {F∗(z∗|A∗)} k(n) }dz ≤ EZ(n) ≤ M Since {F∗(z∗|A∗)} k(n) → 0 as n → ∞ and hence k(n) → ∞: M −2M · lim n→∞ {F∗(z∗|A∗)} k(n) ≤ lim n→∞ EZ(n) ≤ M 86 Finally, this of course implies the following statement: M −0 = M ≤ lim n→∞ EZ(n) ≤ M Hence, by the Squeeze theorem, EZ(n) → M as n → ∞. From here, since we know that max(Sn) = M for all n ∈ N, and EZ(n) → M as n → ∞, it is then implied that F(Z(n) ≤ z) converges to a distribution function s.t. Pr(Z(n) < M) = 0 and Pr(Z(n) ≥ M) = 1 on limiting support Sn = {M}. Hence, Z(n) d → M, which implies that Z(n) p→ M since M is a constant. ■ Stated informally, the main condition of the last proposition will be fulfilled when the dependencies that are present in the sample aren’t very extreme. This is a mild condition because it can still hold even when every single outcome variable is statistically dependent. Pertinently, since Z(1) = −max(−Z1,...,−Zn), Proposition 2.B.12 can automatically be applied to sample minimums as well. This is also important since it establishes that 2−1 (Z(1) +Z(n) ) is a consistent estimator of the mid-range. If Z ∈ U , it is also a consistent estimator of EZ, and even possibly when Z¯ is not. For our paper, Proposition 2.B.12 can be applied to Zˆ i = Ws,ieˆi under the auspices of its additional premises when it is also true that nWs,ieˆi converges in probability to nWs,iεi . This approach also works for fixed regressions with very little difference. For this case, we simply require that µn = o(n) for {εi}i∈I and for Proposition 2.B.12 to apply to {Yi}i∈I w.r.t. their marginal distribution, and for the model to be validly specified for its first moments. Lastly, we made use of the empirical CDF for checking the U status of Ws,iεi . Since the empirical CDF is an arithmetic average of indicator functions, we actually require µD = o(n) for consistency. This is because two indicator variables are uncorrelated if and only if they are 87 independent. Consistency follows from the propositions in the main document. This does not mean that the empirical CDF is useless when µD = O(n). In this case, each estimator is still unbiased. Therefore, although it is not likely to be consistent, it can still be used to gauge the veracity of the assumption. If a U shape is visually present, this is still adequate evidence that the assumption is at least approximately true. This is because the probability of observing such a shape should be fairly low when the assumption is not met. In a situation where convergence to probability does not exist, there should naturally be more variability present. In short, we would expect to see shapes that deviate from our expectation more often than not. References Leadbetter, M., & Rootzen, H. (1982). Extreme value theory for continuous parameter stationary processes. Zeitschrift fur Wahrscheinlichkeitstheorie und Verwandte Gebiete ¨ , 60(1), 1–20. 88 Chapter 3 The Functional Average Treatment Effect Abstract This paper establishes the functional average as an important estimand for causal inference. The significance of the estimand lies in its robustness against traditional issues of confounding. We prove that this robustness holds even when the probability distribution of the outcome, conditional on treatment or some other vector of adjusting variables, differs arbitrarily from its counterfactual analogue. This paper also examines possible estimators of the functional average, including the sample mid-range, and proposes a new type of bootstrap for robust statistical inference: the Hoeffding bootstrap. After this, the paper explores a new class of variables, the U class of variables, that simplifies the estimation of functional averages. This class of variables is also used to establish mean exchangeability in some cases and to provide the results of elementary statistical procedures, such as linear regression and the analysis of variance, with causal interpretations. Simulation evidence is provided. The methods of this paper are also applied to a National Health and Nutrition Survey data set to investigate the causal effect of exercise on the blood pressure of adult smokers. 89 3.1 Introduction The Neyman-Rubin (NR) model is an important framework for causal inference with observational designs (Rubin, 2019; Pearl, 2010; Holland, 1986; Imbens & Rubin, 2015). Say we are interested in studying a population of random variables {Yi}i∈I , where I = {1,...,N}, conditional on exposure to Ti . Here, we specify Ti as binary for simplicity and denote this conditional variable as Yti . At any point in time, it is impossible to observe both Yti=1 and Yti=0 for an arbitrary unit i: a fact that makes individual causal contrasts, such as Yti=1 −Yti=0, undefined. This problem characterizes the ‘fundamental problem of causal inference’ (Ding & Li, 2018). The NR model frames this challenge in the language of missing outcomes and constructs populations of counterfactual probability distributions to address it, say {Y t=1 i }i∈I and {Y t=0 i }i∈I . These populations represent a hypothetical situation such that (s.t.) the treatment status of the entire population has been experimentally fixed to T = t. The goal under the NR framework is to identify summary causal effects that would have occurred, provided one could actually have constructed and manipulated both {Y t=1 i }i∈I and {Y t=0 i }i∈I (Hernan & Robins, 2010). More specifically, under this ´ rubric, we wish to identify some multivariate function g and some vector of adjusting variables L s.t. E{g(Yt1=1,...,Ytn1 =1) − g(Yt1=0,...,Ytn0 =0)|L} equals its counterpart evaluated from the elements of the hypothetical populations, at least asymptotically (Imbens & Rubin, 2010). Here, we again use the difference of estimands only as an example. When this is possible, unmeasured variables are said to be ignorable, conditional on L, and the conclusions of the observational study are equivalent to those of an experimental one (Rosenbaum & Rubin, 1983; Holland & Rubin, 1987). 90 Although g can be any function that has scientifically meaningful properties, a small number of summary functions have dominated the literature. Estimands related to quantiles have commandeered some attention, for instance (Imbens & Rubin, 2010; Jin & Rubin, 2008; Belloni, Chernozhukov, Fernandez-Val, & Hansen, 2017; Gangl, 2010). A lion’s share, however, has been claimed by the arithmetic mean. Unfortunately, the adage that ‘there is no free lunch’ applies to this function. When a researcher wishes to replace EY t with EYt to estimate the expected treatment effect, a set of additional conditions are required: consistency (C1), mean exchangeability (C2), and positivity (C3) (Cole & Frangakis, 2009). These conditions will soon be defined in detail. For now, it suffices to state that researchers often attempt to achieve C2 conditional on a vector of adjusting variables L so that EYt,L = EY t L . Standardization (iterated expectation) is then used to yield the quantity of interest since E{EY t L } = EY t under C1-C3. A cardinal problem is that C2 is non-trivial to achieve in observational studies, even conditional on some random vector L (Hernan & Robins, 2006; Greenland, Pearl, & Robins, 1999a, 1999b). Moreover, ´ L is often high in dimension. Parametric methods must then be employed to approximate the mean model of Y in conjunction with standardization and bias likely ensues outside of toy examples (Hernan & Robins, ´ 2010). One of our cardinal contributions is to highlight a different summary causal effect—the functional average—as a valuable estimand for the NR framework since it avoids many of these challenges, it can impart a causal interpretation to the results of standard statistical procedures, and it is identifiable under mild conditions. For example, if Yt and Y t have the same image in the traditional analysis sense, this is sufficient. Their probability distributions can otherwise differ arbitrarily. Confounding is immaterial, insofar as it does not change the image of the underlying 91 function(s). So is informative sampling more generally. All that matters for identification is that— theoretically—Yt and Y t pull from the same set of real numbers, at least conditional on some L. Although not necessary, this is at least sufficient for establishing what we call functional average exchangeability. The remainder of this paper goes as follows. For clarity, we mathematically define functional averages and prove an elementary but fundamental claim in Section 3.2. Moreover, we show that the functional average treatment effect is salient when the researcher believes that an intervention alters the set of possible values that an outcome can achieve or no expected causal effect exists. After this, we examine a small set of functional average estimators, including the sample midrange, and establish their statistical consistency under general conditions. Since their sampling distributions are largely intractable, however, we re-purpose the bootstrap as a method for conservative inference. In Section 3.3, we provide elucidation on a particular class of bounded random variables—the U class—that generalizes the notion of symmetry and assists in the estimation of functional averages. We also show that U random variables possess many favorable properties when it comes to causal inference. These facts allow us to also prove—under the auspices as a causal theory—that linear regressions estimate causal effects under a standard set of assumptions already employed for associational studies. Section 3.4 presents simulation evidence that substantiates our claims. Finally, in Section 3.5, we use our strategies in conjunction with data from the National Health and Nutrition Examination Survey Data I Epidemiologic Follow-up Study (NHEFS) to investigate if exercise activity causally impacts the systolic blood pressure (SBP) of adult smokers. Plentiful evidence exists that smoking is associated with cardiovascular disease processes and mortality (Glantz & Parmley, 1991; Stallones, 2015). Evidence has also been presented that smoking is a 92 factor in arterial stiffening (Narkiewicz, Kjeldsen, & Hedner, 2005). However, while some literature has supported the proposition that exercise lowers arterial blood pressure (Elley & Arroll, 2002) and that smokers who exercise show fewer signs of arterial stiffening (Park, Miyachi, & Tanaka, 2014), the evidence is not yet definitive. Functional average estimation targets deterministic changes in the structure of an outcome variable and is thus an informative tool in this context. 3.2 Functional Average Treatment Effect In this section, we first introduce important definitions and notation, although some concepts will be left implicit for readability. For instance, we leave the underlying probability space of the form (Ω,F,P) for an arbitrary random variable Y(ω) : Ω → R unstated, and the same goes for probability spaces defining joint distributions. Recall that the support of a random variable is a smallest closed set S s.t. Pr(Y ∈ S ) = 1. Alternatively, it can also be defined as the closure of the set of values S s.t. the density or mass function f(y) > 0 for ∀y ∈ S . Here, we will be dealing with bounded random variables, which means that S is a strict subset of the real numbers. This is not a limiting constraint. Anything that can be empirically measured is necessarily bounded. With these concepts, we can revisit the functional average. If S is discrete, define R = |S |, where | · | in this context denotes the number of elements in the set. If Y is continuous, then R = R R {1y∈S }dy and the functional average Av(·) is Av(Y) = R −1 R R {y1y∈S }dy. For discrete variables, it is Av(Y) = R −1 ∑yi∈S yi . Note that we have avoided the use of general measures for purposes of accessibility. Sometimes it will be the case that, for some measurable function g, Y = g(X1,...,Xk). Then the support of Y with respect to (w.r.t.) the joint distribution of (X1,...,Xk) = X ∈ R k is some 93 general region R ⊆ S1 × ··· × Sk , where each Si indicates the support of Xi . Without loss of generality (WLOG), we will henceforth deal only with the continuous case. In this context, R = R Rk 1(x1,...,xk)∈Rdx1 ···dxk and Avx(Y) = R −1 R R g(x1,..., xk)dx1 ···dxk . Now, let EhY indicate that the expectation of Y is taken w.r.t. a different density or mass function h(y) that is also defined on S . Then it is also apparent that Av(Y) = EhY when h(y) is a uniform density or mass function. When a subscript is omitted and Y ∼ f(y), it will be understood that the expectation is taken w.r.t. the baseline density (mass) function f , provided it exists. Otherwise, we say that EUY = Av(Y) as a special case, although we will mostly avoid this notation. This is because Av(Y) is best interpreted with the lenses of basic, deterministic analysis. The exception to this statement is when Y truly follows a uniform probability law. A functional average treatment effect then—for any two treatment values of interest t,t ′— can be defined as h{Av(Y t ),Av(Y t ′ )} = h{EUY t ,EUY t ′ } for a user-specified function h. In this paper, we set h to a simple difference for exploratory purposes, i.e., h{Av(Y t ),Av(Y t ′ )} = ∆t,t ′ = Av(Y t )−Av(Y t ′ ). 3.2.1 Examples of Applicability The average functional value is not a usual focus in statistical settings. The expected value w.r.t. the baseline measure has instead largely been an object of interest. Hence, we offer a short argument and demonstration of its importance as a preliminary apologia. By definition, an expected value is a sum of all possible values, where each value is weighted by the probability of observation or its density. However, the chance (or density) of observation is extraneous to causal relationships that are unrelated to altered probabilities. Although the functional average can also be construed, 94 albeit counterfactually in most cases, as an expected value, the uniform measure imbues it with a more deterministic interpretation. This is because it does not require a probabilistic framework, although such a framework is often necessary for its estimation. 3.2.1.1 Example 3.1 Say T is a binary treatment variable s.t. T = 1 when a particular psychotropic medication is received and Y is a Likert scale measuring anxiety in individuals with clinical depression. Also say that Yt=0 can take any integer between one and ten with the following probabilities: {.01,.04,.05,.1,.15,.15,.3,.1,.05,.05} Then EYt=0 = 6.14 and Av(Yt=0) = 5.5. Now, say Yt=1 has non-zero mass only on integers between one and eight with the following probabilities: {.01,.01,.01,.05,.1,.5,.18,.14}. Under this scheme, it is also the case that EYt=1 = 6.14. This makes the detection of a causal effect impossible if only the expected treatment effect is utilized. However, Av(Yt=1) = 4.5, a value that possibly reflects the elimination of extreme anxiety under treatment, albeit at the cost of more mild to moderate anxiety experiences. 3.2.1.2 Example 3.2 Again, let T be a binary treatment variable for simplicity and say Yt=0 is a random variable for the untreated systolic blood pressure (SBP) of individuals who have been diagnosed with high blood pressure. We will assume that Yt=0 follows a truncated normal distribution with a mean at 155 mmHg and support St=0 = [110,370] and Yt=1 follows a truncated normal distribution with mean 95 125 mmHg on support St=1 = [90,250]. Here, both an expected and a functional average treatment effect are possibly present and relevant. For the latter, Av(Yt=0) = 240, while Av(Yt=1) = 170. The functional average in this example offers additional information pertaining to changes in the possibilities of extremes that cannot be captured by expected values with the baseline probability measure, insofar as it is not uniform. Moreover, these examples elucidate how functional averages and their differences remain invariant to any redistribution of the presented probabilities insofar as they remain non-zero. For instance, say we employed a biased sampling mechanism (such as a convenience sampling) and we also failed to measure all confounders for Example 3.2. As a consequence, say Yt=1 follows a truncated normal distribution s.t. EYt=1 = 110 and Yt=0 follows a truncated normal distribution s.t. EYt=0 = 180. These facts would not matter insofar as the convenience sample was executed in such a way as to preserve the sets of values that the functions could theoretically materialize under an experimental design. For some estimators, preservation of the extremes alone is sufficient. 3.2.2 Identifying and Estimating Counterfactual Functional Averages Next, we prove some basic statements about functional averages under mild conditions and the rubric of informative sampling. For this, we specify a conditional population of interest P = {Yt,1,...,Yt,N} s.t. Yt,1 ∼ f(y|T =t) WLOG. This setup can be defined with additional conditioning or extended to unconditional circumstances, but this is omitted here for brevity. Additionally, observe a complete-case sample ζ ⊂ P and a complementary vector of indicator variables δ = (δ1,...,δN) s.t. δi = 1 if and only if Yt,i ∈ ζ . We also assume that E(δi |yi ,t) > 0 for all ∀i, which implies that E(δi |t) = πi > 0 for all i, an assumption that is typically called sampling positivity. It is 96 well-known that an arbitrary Yt,i ∈ ζ does not, in general, follow the distribution of the theoretical population (Pfeffermann & Sverchkov, 2009; Pfeffermann, Krieger, & Rinott, 1998; Patil, Rao, Zelen, & Patil, 1987; Patil & Rao, 1978). Instead, Yt,i ∈ ζ possesses a weighted density or mass function fδ (yi |t) = π −1 i E(δi |yi ,t)f(yi |t). It is easy to show, then, that EδYt,i = π −1 i σYt,i ,E(δi |Yt,i) + EYt,i . Here, the notation σYt,i ,E(δi |Yt,i) denotes the covariance: E{Yt,iE(δi |Yt,i)} − EYt,iπi . It is also easy to show—insofar as E(δi |yi ,t) > 0 for all yi and t—that fδ (y|t) is supported on the same set as f(yi |t). For conciseness, we often denote Yt,i ∈ ζ as Yt,δi under the implicit assumption that δi = 1. We also use YTi ∈ ζ or YT,δi with the understanding that T is fixed to whatever value it takes for unit i ∈ I. We now specify our short list of assumptions more formally, with a re-statement of C1-C3 for clarity. C1: Yt,δi = Y t t,δi for all t ∈ ST and δi (Consistency) C2: Let L be a vector of random variables. Then EYt = EY t or, conditional on L, EYt,L = EY t L (Mean exchangeability) C3: 0 < Pr(Ti = 1|L) < 1 for ∀i ∈ I (Positivity) C4: The support of YT,δi and Y Ti are the same, i.e., for all YTi ∈ ζ , SYT,δi = SY Ti . This can also be stated conditional on L C5: Let Zδi be a random variable and say that Ln = (I,En) is an undirected graph with node set I and link set En s.t. a link ei, j ∈ En between two nodes i, j ∈ I is present if and only if σZδi ,Zδ j ̸= 0. Then the mean degree of this graph n −1 ∑ n i=1 ∑ n−1 j=1 1ei, j∈En = µn = o(n), where 1ei, j∈En = 0 when i = j by convention and each indicator variable is non-stochastic 97 The usual provisos that referenced mathematical objects exist is mostly omitted. Reiterating the meaning of C4 is useful as a stepping stone to further consideration. Put succinctly, when C4 holds, it means that the counterfactual distribution and conditional distribution possess the same support, either conditional on some L = l or marginally. Rejecting this notion is equivalent to positing that certain values in the support of Y T (Y T l ) can never materialize with YT (YT,l). This is a strong assertion with non-trivial epistemic consequences, especially in the context of noninformative sampling. If it is believed that C4 cannot be obtained, then those values that exist in the counterfactual support alone have no real world meaning. In this circumstance, we can simply condition on those that can materialize at no empirical loss. Notably, sufficient conditions for C4 to hold are the existence of a possibly unmeasured and unknown composite confounder U s.t. Y t is conditionally independent of T provided U, C1 is also true, and the stronger form of sampling positivity holds. These conditions are articulated without further conditioning on L at no loss of generality. For demonstrative purposes, we prove that these statements imply C4 informally for the case s.t. U is absolutely continuous, also at no loss. To this end, observe the following identity under the first stated premise that U and the referenced densities exist: f(y t ) = Z SU f(y t |t,u)f(u)du = Z SU f(y|t,u)f(u)du Now, note that the following statement is also true: f(y|t) = Z SU f(y|t,u)f(u|t)du 98 Since both f(y t ) and f(y|t) are functions of y alone and otherwise share in f(y|t,u) as a basis, it is then implied that f(y t ) and f(y|t) are strictly positive on the same set of values. Furthermore, since E(δ|y,t) > 0 implies that fδ (y|t) > 0 on the same set of y values s.t. f(y|t) > 0, by transitivity, fδ (y|t) also shares the same support with f(y t ). This supplies C4. The reason we chose to treat C4 as an assumption, however, is because doing so does not require the existence of a random variable U with the stated properties. It is therefore feasible to achieve C4 in more general circumstances. Nevertheless, the conditions that imply C4 are also relatively mild and employed on a regular basis. For instance, every condition except one—the stronger form of sampling positivity—is required by methodologies that estimate expected causal effects via C1-C3. The stronger form of sampling positivity is often assumed implicitly. As a consequence, if one believes that it is even possible to try and identify the expected causal effect, then the functional average treatment effect is already identified in many circumstances. The last assumption, C5, is required for establishing statistical consistency. Results are often proven under the premises of mutual independence and non-informative sampling. This restricts their utility, especially since many modern research settings depend upon non-probability samples of outcome variables that partake in complicated and unknown systems of possibly ’long-range’ probabilistic dependence. Furthermore, this is also restricting since informative sampling can induce statistical dependencies. By proving our results under more general conditions, we expand their reliability into these contexts. C5 essentially asserts that the mean number of outcome variables in a sample that a typical one is correlated with is sub-linear in n, i.e., that n −1µn → 0 as n → ∞. Note that this is a very mild assumption since it still allows for the mean number of statistical dependencies present in the sample to diverge with sample size. It places no additional constraint on the exact form of the probabilistic dependencies. We also reference an alternative: 99 C5’. This assumption is exactly the same, except it makes use of a dependence graph s.t. a link exists between two nodes if and only if their corresponding outcome variables are statistically dependent. Our first proposition establishes functional average exchangeability. Although it is trivial mathematically, it provides a useful foundation. Proposition 3.1. (Functional Average Exchangeability) Suppose C1 and C4. Then Av(YT,δ ) = Av(Y T ). Proof. The result follows directly from the premises. Since SYT,δ = SY T , R SYT,δ 1 · dy = R SY T 1 · dy = R WLOG for the continuous case and R SYT,δ ydy = R SY T ydy. ■ Proposition 3.1 is extendable to Avx(Y T ) or to the conditional case s.t. there exists some vector L where Av(YT,L,δ ) = Av(Y T L ). However, again, this is omitted. Before continuing, a common caveat is due. Identifying a counterfactual parameter statistically is not equivalent to the identification of a causal one. Causal relationships cannot be inferred from statistical relationships alone (Pearl, 2003). A theory of causation—perhaps represented by a structural causal model—is therefore still required if causal meanings are to be supplied to the functional average (Pearl, 2010; Hernan & Robins, 2010). These results only simplify this ´ process for a small set of related parameters by removing the strict need to measure a proper set of adjusting variables in some circumstances. In other words, if C1 and C4 hold, then, provided a structural causal model, the functional average effect can be identified even without accounting for unmeasured confounders. 100 3.2.3 The Problem of Estimation The simplicity of Proposition 3.1 and the relative mildness of C4 unfortunately coexist with the difficulty of estimating Av(YT,δ ). Here, the ‘no is no free lunch’ addage returns. A theoretical estimator can be constructed, nevertheless, using the following two identities. We tacitly condition on 1SY for ease of reading: E{ f −1 (Y)Y} = R S ydy and E{ f −1 (Y)} = R. This naturally suggests an estimator of the following form: Av˜ (Yδ ) = { n ∑ i=1 f −1 δi (Yδi )} −1 n ∑ i=1 f −1 δi (Yδi )Yδi (3.1) In this section, we investigate some of the features of plug-in estimators for eq. (3.1). After exploring the discrete case, we offer brief commentary on the difficulties of the continuous one. Then we re-visit the sample mid-range estimator. After completing these explorations, we introduce a bootstrapping strategy for conducting inference. 3.2.3.1 Discrete Estimators of Av˜ (Yδ ) When Yδi is discrete and ˆfn(y) = n −1 ∑ n i 1Yδi =y = n −1My, the empirical plug-in for eq. (3.1) reduces to an intuitive estimator. Say 1y∈Sζ is an indicator that a value y ∈ SY is observed and therefore in the support of the empirical distribution: Sζ . Then the empirical plug-in for eq. (3.1) reduces to Avˆ (Yδ ) = {∑y∈SY 1y∈Sζ } −1 ∑y∈SY 1y∈Sζ y under the convention that ˆf −1 n (y) = 0 when y ∈/ Sζ . To see this, observe that for an arbitrary set of materialized sample values in ζ , where ζ is temporarily treated as a set of constants, ∑y∈ζ ˆf −1 n (y) = n · |Sζ | and ∑y∈ζ ˆf −1 n (y)y = n · ∑y∈Sζ y. Therefore, the discrete plug-in for eq. (3.1) is simply the arithmetic average of the unique values observed in the sample. 101 We now establish the statistical consistency of this plug-in under general dependency conditions. For the rest of this section, we omit notation for δ with the understanding that it is implicit whenever we are dealing with sampled outcomes. To this end, note that Pr(Y1 ̸= y,Y2 ̸= y,...,Yn ̸= y) = Pr(Yn ̸= y|Yn−1 ̸= y,...,Y1 ̸= y)· Pr(Yn−1 ̸= y|Yn−2 ̸= y,...,Y1 ̸= y)···Pr(Y1 ̸= y) and define a corresponding sequence F = (Pr(Yi ̸= y|Ai))i∈I under the convention that Pr(Y1 ̸= y|A1) = Pr(Y1 ̸= y). Proposition 3.2. Suppose a sample ζ = {Yi}i∈I . Observe F = (Pr(Yi ̸= y|Ai))i∈I as previously defined and say k(n) = |{s ∈ F|s < 1}|, where s ∈ F indicates here that s is present in the sequence. If k(n) → ∞ as n → ∞, then Avˆ (Y) a.s. → Av(Y) as n → ∞, where a.s. → denotes almost sure convergence. Proof. Let y ∈ SY be arbitrary and denote Sζ ⊆ SY as the set of observed values. Then Pr(y ∈ Sζ ) = 1 − Pr(y ∈/ Sζ ) = 1 − Pr(Y1 ̸= y,Y2 ̸= y,...,Yn ̸= y) = 1 − Pr(Yn ̸= y|Yn−1 ̸= y,...,Y1 ̸= y)·Pr(Yn−1 ̸= y|Yn−2 ̸= y,...,Y1 ̸= y)···Pr(Y1 ̸= y). Now, suppose k(n) probabilities in the sequence F are strictly less than one. Denote the maximum of these probabilities as Pr(Y∗ ̸= y) and note that since Pr(Y∗ ̸= y) < 1, there exists some ε > 0 s.t. Pr(Y∗ ̸= y) = 1−ε. Then: Pr(y ∈/ Sζ ) = Pr(Yn ̸= y|Yn−1 ̸= y,...,Y1 ̸= y)·Pr(Yn−1 ̸= y|Yn−2 ̸= y,...,Y1 ̸= y)···Pr(Y1 ̸= y) ≤ 1 n−k(n) · {Pr(Y∗ ̸= y)} k(n) = {1−ε} k(n) Hence: 0 ≤ lim n→∞ Pr(y ∈/ Sζ ) ≤ lim n→∞ {1−ε} k(n) = 0 102 This of course implies that Pr(y ∈ Sζ ) → 1 as n → ∞. Next, define an indicator variable 1y∈Sζ and also Zk∗ = sup k>n |1y∈Sζk −Pr(y ∈ Sζk )| = |1y∈Sζk∗ −Pr(y ∈ Sζk∗ )|. Letting ε > 0 be arbitrary again: Pr(Zk∗ > ε) ≤ ε −2 {Pr(y ∈ Sζk∗ )·(1−Pr{y ∈ Sζk∗ })} This then implies that: lim n→∞ Pr(Zk∗ > ε) ≤ lim n→∞ ε −2 {Pr(y ∈ Sζk∗ )·(1−Pr{y ∈ Sζk∗ })} = 0 Hence, 1y∈Sζ a.s. → 1. Thereby, since y was arbitrary, it is then implied that Avˆ (Y) = {∑y∈SY 1y∈Sζ } −1 ∑y∈SY 1y∈Sζ y a.s. → R −1 ∑y∈SY y = Av(Y) since R is finite. ■ The elementary nature of Avˆ (Y) makes it a reliable estimator for relatively simple outcome variables. Avˆ (Y) will converge almost surely at an unknown, but very fast rate in all likelihood, and even in the presence of stark probabilistic dependencies, when the scale of the outcome variable possesses a small number of unique values. This statement obviously applies to sample extremes in addition. Unfortunately, however, quantifying the rate of convergence—or the uncertainty associated with finite sample estimates—is difficult. This is true even under mutual independence. To appreciate this, it is sufficient to observe ∑y∈SY 1y∈Sζ y. Since E1y∈Sζ = py,n is unknown, so is Var{∑y∈SY 1y∈Sζ y} = ∑y∈SY py,n ·(1 − py,n)· y 2 . If ζ is a sample of identically distributed and mutually independent outcome variables, then we can attempt to estimate py,n with ˆpy,n = 1− {1− ˆfn(y)} n . However, ˆpy,n ≡ 0 when y ∈/ Sζ . Furthermore, when y ∈/ Sζ , it is also unknown 103 by definition. Hence, reasonably estimating Var{∑y∈SY 1y∈Sζ y} requires knowledge that makes estimating Av(Y) arguably redundant. A recourse to the central limit theorem is also unavailable. This is because |SY | is finite by construction. Therefore, Avˆ (Y) will always be a finite sum of random variables. In some circumstances—such as when |SY | is reasonably large—a normal approximation might still function with an acceptable degree of accuracy. However, for reasons already explored, this strategy will still require a strong set of assumptions about sampling probabilities and potential values. 3.2.3.2 Continuous Outcomes Kernel density estimation is an intuitive choice to estimate eq. (3.1) for the continuous case. However, the properties of this plug-in are also largely intractable and unknown. For example, although the properties of a kernel density estimator ˆfn(y) are well-researched for a constant y ∈ SY (Hansen, 2008; Chen, 2017; Zambom & Ronaldo, 2013), the behavior of ˆfn(Y), i.e., the random variable defined and evaluated on the same random outcome that was utilized to construct it, is not as well-studied. This is because kernel density estimation is often evaluated on a grid of deterministic points. Establishing the asymptotic properties of a statistic of the form {∑ n i=1 ˆf −1 n (Yi)} −1 ∑ n i=1 ˆf −1 n (Yi)} −1Yi , where ˆf −1 n (Yi) = nh · {∑ n j=1K{h −1 (Yi −Yj)}}−1 for some h > 0 and kernel function K(·), although promising, is therefore also non-trivial. Such considerations also require a detailed consideration of possible kernel functions. Since—in general—we are interested in establishing statistical consistency under very general dependency conditions, we avoid this enterprise in this manuscript. 104 We also avoid other options for density estimation since they arrive with similar challenges, some as of yet undisclosed. For instance, estimators that use reciprocal estimated densities can possess unstable variances when the underlying distribution possesses a density that decays smoothly toward zero. Moreover, since each ˆfn(y) is typically a function of the entire sample, plug-ins for eq. (3.1) will necessarily possess a myriad of complex dependencies. This will prevent any elementary citation of a central limit theorem. Just as importantly, it will also limit the applicability of concentration inequalities for finite sample inference. Hence, although this is a promising area of research that demands attention, no further consideration is offered here. Mid-range estimation For a large special class of bounded random variables, the mid-range is a simple alternative for estimating functional averages, including those from continuous distributions. Let Y(i) for i ∈ In = {1,...,n} denote the ith order statistic of a sample s.t. Y(1) ≤ Y(2) ≤ ··· ≤ Y(n) . The mid-range, MRˆ {Y}, or simply MRˆ when convenient, is defined as follows: MRˆ {Y} = 2 −1{Y(1) +Y(n)}. Naturally, the sample mid-range estimates the population mid-range MR = 2 −1 (m+ M). Linear combinations of order statistics are also well-studied (Chernoff, Gastwirth, & Johns, 1967; Hosking, 1990; Bickel, 1973; David & Nagaraja, 2004). However, the sample mid-range is often ignored, and especially in applied settings, because of its possible inefficiency and since its distribution also admits no closed-form expression in a majority of settings. Before offering an exposition on some of its properties, we offer a useful definition, which highlights our interest in it. We say a random variable is regular when it is supported on a single interval of real numbers 105 or a complete subset of integers. This definition is helpful because Av(Y) = MR{Y} when Y is a regular random variable. Definition 3.1. A random variable will be said to be regular if and only if its support S is a single interval of real numbers or a complete subset of integers starting at some m ∈ N and ending with a maximum integer M s.t. if integer c ∈ S \ M, then c+1 ∈ S . MRˆ is a statistically consistent estimator of MR for outcome variables with finite support under the assumption of mutual independence. Barndorff-Nielsen (Barndorff-Nielsen, 1963) established sufficient and necessary conditions for the statistical consistency of extreme order statistics. Almost sure convergence, and therefore also convergence in probability ( p→), of an extreme order statistic to its asymptotic target is trivially fulfilled when there exists a y ∈ SY s.t. F(y) = 1 and F(y − ε) < 1 for all ε > 0. Hence, MRˆ also converges almost surely to its population value for all bounded distributions under this setup. Sparkes and Zhang (Sparkes & Zhang, 2023) extended this result to much more general scenarios of statistical dependence. If we define a sequence of conditional cumulative distribution functions (CDFs) F in the same spirit as Proposition 3.2, it can be demonstrated that extreme order statistics converge in probability to their target values for bounded random variables insofar as the number of conditional CDFs in F that is strictly less than unity diverges as n becomes arbitrarily large. This is once again a very mild assumption since the dependencies involved can induce arbitrary changes in the behaviors of the distribution functions otherwise. Since the random variables considered here are bounded, convergence in probability of the sample extremes also implies their almost sure convergence. Nevertheless, as previously mentioned, when the distribution of Y is unknown, no reliable expression for the distribution of MRˆ is accessible to use for inference: a situation that is analogous 106 to the discrete plug-in estimator for eq. (3.1). Although extreme value theory helps to address this issue under the assumption of an independent and identically distributed sample or a stationary sequence of outcome variables (Leadbetter & Rootzen, 1988), it is insufficient without additional parametric constraints. For instance, classical results establish, for a suitable sequence of constants an and bn, that a −1 n {Y(n) −bn} converges weakly to one of three distributions under certain regularity conditions (Leadbetter & Rootzen, 1988; Smith, 1990; Haan & Ferreira, 2006; Kotz & Nadarajah, 2000). These are the Gumbel (type I), Frechet (type II), and reverse Weibull (type ´ III) distributions. These results are also sufficient for reasoning about the sample minimum since Y(1) = −max i∈In (−Yi). Bingham (Bingham, 1995, 1996) uses the convergence of a −1 n {Y(n) − bn} to a type I or III extreme value distribution and the asymptotic independence between Y(1) and Y(n) to derive limiting distributions of MRˆ when its underlying distribution function is also symmetric. However, the expressions derived for these asymptotic distributions ultimately depend upon unknown normalizing constants that are specific to the marginal distribution function of the sampled outcomes. Broffitt (Broffitt, 1974) and Arce and Fontana (Arce & Fontana, 1988) provide similar explorations for MRˆ under the auspices that ζ is an identical and independent sample from a symmetric power law distribution. Under this constraint, the limiting variance of MRˆ is derived as {12 · a 2 · {log(n)} 2(1−1/a)} −1 · π 2 for some distribution-specific a > 0, for instance. Unlike the sample mean under these same conditions, though, an expression for a governing probability law that does not depend on the marginal distribution functions in question is again unavailable, even asymptotically. This situation extends to the asymmetric and non-independent cases, which are even more poorly studied. Bootstrapping is a feasible option for inference, provided these challenges. However, it is also not without problems. Traditional bootstraps condition on the observed values of ζ and use the 107 strong consistency of the empirical CDF Fˆ n(x) to emulate the sampling distribution of a statistic of interest via a re-sampling procedure (Efron & Tibshirani, 1994). They require that the statistic of interest, say T0(Y1,Y2,...,Yn), is a well-behaved functional of the marginal CDF and that the targeted parameter is not a boundary value of the support (Bickel & Freedman, 1981). Overall, bootstrapping processes usually behave as intended under the same set of conditions that supply a central limit theorem. For these reasons, traditional bootstraps are problematic for functions of extreme order statistics. The m-out-of-n bootstrap, however, has proven to be an effective procedure in this domain (Bickel & Sakov, 2008). Essentially, a basic m-out-of-n bootstrapping process re-samples m observations from ζ with or without replacement s.t. n −1m → 0 as m → ∞. It can provide approximately valid inference when traditional methods fail. See Bickel and Ren (Bickel & Ren, 2001), Swanepoel (Swanepoel, 1986), Beran and Ducharme (Beran & Ducharme, 1991), or Politis, Romano, and Wolf (Politis, Romano, & Wolf, 2001) for additional background and resources on the topic. Pertinently, the m-out-of-n bootstrap is also capable of handling situations with dependent observations insofar as an appropriate sub-sampling strategy is used. Nevertheless, the m-out-of-n bootstrap (and other forms of bootstraps for dependent observations) is still insufficient for the context and conditions of this paper. Three reasons substantiate this claim. Firstly, we require a version of the bootstrap that is capable of reliably capturing θ under fairly general but unknowable dependency conditions. This rules out approaches such as the m-out-of-n bootstrap, or bootstrapping processes such as the block bootstrap, which require a re-sampling theory that corresponds adequately to the unknown dependency structure, and which typically exclude the existence of long-range dependencies (Shao, 2010; Hall, Horowitz, & Jing, 1995; Kreiss & Paparoditis, 2011; Lahiri, 2003). 108 Secondly, we are interested in reasoning about θ = Av(Y) and not E{MRˆ } for a particular n. In most circumstances where MRˆ will be used, i.e., those circumstances s.t. the marginal distributions are not symmetric, it will be a biased estimator (Arce & Fontana, 1988). It is likely that θ rests on the boundary of the support of MRˆ in these circumstances. Convergence to θ might be slow and characterized by an unknown rate in addition (Broffitt, 1974). Consequently, any inferential procedure for MRˆ must be flexible enough to provide cogent statements about θ—and not simply about E{MRˆ } at a particular value of n—and even when sample sizes are modest. This necessitates conservative approaches for inference that allow for θ to sit outside of the empirical distribution of the bootstrapped statistics. It also therefore rules out popular bootstrapping methodologies, which construct confidence sets that are subsets of this observed range. Lastly, we wish to use a bootstrapping strategy that does not rely on the assumption that T0 is a smooth functional of F(x). Further work to produce more efficient closed-form approximations is of course a preferable route. Since the mid-range is more efficient than the sample mean when non-negligible probability rests in the extremes of the support (Rider, 1957), it can provide a more efficient estimator of expected causal effects in many circumstances: a fact that is often neglected. Overall, however, due to the complicated probabilistic character of order statistics, this is a onerous road that possesses no immediate destination, especially when outcome variables are dependent and their joint distribution is unknown. This ultimately necessitates a different type of bootstrapping strategy. 3.2.3.3 The Hoeffding Bootstrap With these prior facts in mind, we offer two limited, but related solutions, although only the first is discussed in this section. In summary, we assert that the bootstrap can be re-purposed to construct conservative confidence sets under fairly general conditions of statistical dependence and under 109 milder regularity conditions. Notably, this re-purposed bootstrap, which we call the Hoeffding bootstrap, can be applied to all functional average estimators previously explored. We now provide a synopsis of the first approach. Further details and the proof are provided in the supplementary materials. Essentially, we show that (1) if a statistician does not condition on the observed values of ζ and treats each re-sampled Yi as random, (2) if the maximum order statistic of the bootstrapped statistics is a discrete or an absolutely continuous random variable (at least asymptotically), and (3) if the outcome variables being re-sampled are not monotonic transformations of one another, say, or they do not partake in other forms of truly extreme statistical dependence, then the estimator for the range of the bootstrapped statistics is a statistically consistent estimator of a value that is greater than or equal to the range of T0(Y1,Y2,...,Yn) as n and the number of bootstraps become arbitrarily large. Call this estimator Mˆ −mˆ in relation to M and m, which now designate the maximum and minimum of the support of the bootstrap distribution. Insofar as T0(Y1,Y2,...,Yn) has finite support, the estimator of the bootstrap range can then be used in conjunction with Hoeffding’s inequality to produce large-sample confidence intervals for θ with at least 1−α coverage of the form T0(Y1,Y2,...,Yn)± {Mˆ −mˆ } p 2−1 log(2/α). The performance of confidence sets constructed with this strategy are evaluated in Section 3.4 and also in the supplementary materials. For clarity, we provide a schematic of the process below: I. Acquire a sample of random variables ζ = {Yi}i∈In and compute a statistic T0(Y1,...,Yn) II. Draw m ≤ n random variables from ζ with or without replacement via a simple random sample or a theoretically guided process that attempts to reproduce a dependency structure. Compute the new statistic T1 from these variables 110 III. Repeat I. and II. K(n)−1 times, where K(n) = K is reasonably large, and construct {Tk}k∈K for K = {0,1,2,...,K} IV. Set Mˆ −mˆ = max k∈K (Tk)− min k∈K (Tk) V. Construct an estimate of an at least 1−α confidence set with T0 ± {Mˆ −mˆ } p 2−1 log(2/α) Requiring the maximum order statistic of the bootstrap sample to possess a density when it is continuous is non-trivial and might limit the applicability of the approach. However, the stipulation can be feasibly checked by observing the histogram of the bootstrap distribution. If it possesses a smooth shape without too many jagged breaks in its continuity across the x-axis, this is at least a good sign. Nevertheless, this limitation is addressed in Section 3.3. Although the provisional solution offered there does not strictly require U concepts, which we also introduce in Section 3.3, they provide clarity on the topic. Essentially, we show that it is still probably safe to use a slightly more conservative form of the same confidence set when a density does not exist, or even when the sample bootstrap range is not a statistically consistent estimator of M −m. Moreover, even if the conditions that validate this approach are not met, it is apropos to state that the Hoeffding bootstrap will always perform better than strategies that use bootstrapped tstatistics or bootstrapped normal approximations. This fact essentially flows from Popoviciu’s inequality, which states that Var(T0) ≤ 4 −1 (M0 − m0) 2 when T0 is bounded. This inequality also applies to empirical distributions. For instance, observe when α = .05. Then 1.96·STk ≤ (Mˆ −mˆ) ≤ Mˆ −mˆ · p 2−1 log(2/α) ≈ 1.35·(Mˆ −mˆ), where Sk is the sample standard deviation of the bootstrap distribution. This last fact partially motivates the use of Hoeffding’s inequality. Note that if m ≤ T0 ≤ M, it is already implied that ET0 ∈ [m,M] and hence that, for some sufficiently large n and K, ET0 ∈ [mˆ,Mˆ ] 111 with probability that is approximately one. However, since we actually want to capture θ and, in practice, we often use only moderately sized K with moderate n, Hoeffding’s inequality supplies an intuitive and well-established interval that already arrives with a penalty that is adjusted by α. Insofar as T0 a.s. → θ as n → ∞, using Hoeffding’s inequality as a penalty also asymptotically guarantees at least 1 − α coverage for θ, as previously mentioned. A method for constructing confidence intervals such as this, although conservative, avoids dependency modeling and applies to a much larger class of statistics. 3.3 U Random Variables and Counterfactual Linear Regression In this section, we discuss a class of random variables—the U class—that can help us avoid the difficulties encountered in Section 3.2. Recall: although we established that functional average causal effects can be identified and statistically consistently estimated under very mild assumptions—and even without adjusting for confounders—establishing efficient methods of statistical inference for these estimators is challenging. Ultimately, this is because the plug-in estimators defined and the sample mid-range possess largely intractable properties, and even asymptotically, in the absence of additional constraints that are in all likelihood inappropriate for applied settings. These difficulties are removed when working with U random variables as outcomes since they ultimately allow for the functional average to be estimated by standard additive statistics with well-known properties. We also show that U random variables are important because they can imbue basic linear regressions and analyses of variance with counterfactual—and thus possibly causal—interpretations under conditions traditionally assumed for estimating associations. On a similar note, we also prove 112 that properties of U variables can be used to establish sufficient conditions for mean exchangeability and that they can be used to defend an extension of the Hoeffding bootstrap. First, however, a definition of a U random variable is helpful. We assume that all integrals and mathematical objects exist when referenced, as per usual. Definition 3.2. Let g be a measurable function. A random variable g(Y) will be said to be in the class of U random variables if and only if E{g(Y)} = Av{g(Y)}. Similarly, the same will be said w.r.t. X ∈ R k for Y = g(X1,...,Xk) if and only if EY = Avx(Y). Definition 3.2 stipulates that a random variable is U class w.r.t. some space when its expected value is equal to its average functional value in that space. Stated in a probabilistic fashion, a variable Y is U class if one can take its expectation w.r.t. a uniform measure without changing its value. This type of random variable is ubiquitous in practice. A host of their properties has been investigated elsewhere (Sparkes & Zhang, 2023). To familiarize the reader, we provide a list of important ones in Table 3.1. Essentially, U variables are closely related in concept to sumsymmetry of the CDF and structured but uncorrelated deviations from uniformity. All bounded and symmetric random variables are in the U class, for instance, although symmetry is not a necessary condition. Continuous and regular random variables with densities that are proportional to their standard deviation behave more and more like U random variables as n → ∞ if their variance tends to zero. In an abuse of notation, we will say Y ∈ U if Y is in this class of random variable. 113 Table 3.1: Basic Properties of U Variables Property Variable Type Conditions and Definitions σY, f −1(Y) = 0 A Y ∼ f(y) Y ∈ U =⇒ cY ∈ U and Y +c ∈ U A c ∈ R E{ f(Y)} = R −1 implies Y ∈ U A — Y ∈ U is equivalent to F(Y) ∈ U R, C F(y) is CDF of Y R M m F(y)dy = R M m S(y)dy R, C, D S(y) = 1−F(y), SY = [m,M] U = Y +ε s.t. E(ε|Y) = 0 R, C U ∼ Uni f(m,M), f(y) ≤ R −1 in left tail and unimodal Pr(|Sn −ESn| > ε) ≤ 2exp{−{∑ n i=1R 2 i }−16ε 2} R, C Sn = ∑ n i=1Yi, Yi ∈ U , C ∗ ∑ M i=1F(i) = EY R, D SY = {1,2,...,M} ∑ M i=1F(i)−∑ M i=1 S(i) = 1 R, D SY = {1,2,...,M} A = All, R = Regular, C = Continuous, D = Discrete C ∗ = max{E(exp{sSn}),E(exp{−sSn})} ≤ Avy(exp{sSn}),s > 0 Out of these properties, we draw special attention to the concentration inequality: Pr(|Sn −ESn| > ε) ≤ 2exp{−{∑ n i=1 R 2 i } −16ε 2}. The condition detailed in the table footnote is very mild and does not require independence. In fact, it can be true even when every single outcome variable in a sample is statistically dependent, insofar as the average correlation between those variables is mild, or µn is bounded if this is not the case. Discussion on this assumption is also available elsewhere (Sparkes & Zhang, 2023). Put succinctly, a researcher can expect it to be fulfilled if each Yi is symmetric—or at least relatively symmetric—and the joint distribution of the sample is biased away from n-tuples in the joint support that inflate Sn. We note that this is useful since these conditions often apply to the error distributions of statistics of interest, including those of linear regressions. Moreover, if Sn converges in probability to ESn as n → ∞, this is also supportive of the notion that the condition is fulfilled for sufficient sample sizes. A proof that Sn converges almost surely to ESn under our conditions is provided in the supplementary material. 114 Next, we introduce two simple propositions with direct practical or theoretical interest for causal inference. Proposition 3.3 follows directly from our main conditions and solves the problem of estimating Av(Yδ,t ) for variables in the U class since it allows for the replacement of the estimators of the previous section with the sample mean. Proposition 3.4 establishes an interesting sufficient condition for mean exchangeability. We only prove these statements for functional averages w.r.t. the range of Y. This is for conciseness. Note that all results are easily extended to the excluded case. Finally, observe that, although the notation was omitted, the following results also apply when the random variables are conditioned upon another vector of random variables L, perhaps to facilitate the fulfillment of C4 or U status. In this case, we would also assume C3, although this will also be left unmentioned. Proposition 3.3. Suppose C1 and C4. If Yδ,t ∈ U , then EYδ,t = Av(Y t ). Proof. The proof is again one line under the premises: EYδ,t = Av(Yδ,t ) = Av(Y t ). ■ Again, the properties of plug-in estimators for eq. (3.1) are not easy to discern in general and citations of the central limit theorem are also questionable or implausible. However, the properties of Y¯ δ,t are exceptionally well-known. This largely solves the problem insofar as the sampling process secures a sample of U variables. Again, since C4 only requires the preservation of the support, this allows for an arbitrary distortion of the population distribution otherwise: a fact that is liberating w.r.t. study design and execution. Note also that Proposition 3.3 is not as trivial as it seems. It is well known that the sample mean and mid-range estimate the same parameter when the underlying distribution is symmetric. However, it is false that all U random variables are symmetric. Hence, the U concept expands 115 the universe where the sample mean can replace the mid-range. The next proposition establishes a new route to justifying the validity of C2, as previously mentioned. Proposition 3.4. Suppose Yt and Yδ,t are both U random variables under C1 and C4. Then EYt = EYδ,t . Proof. By our premises, the following string of equalities applies: EY t = Av(Y t ) = Av(Yδ,t ) = EYδ,t . ■ Great care and energy of argument are often expended to establish that EY t = EYt . Proposition 3.4 offers a new manner of doing so insofar as it is believed that the experimental distribution is sum-symmetric. Conditional or unconditional on some vector L = l, insofar as the researcher is willing to posit that the experimental distribution is in the U class, all that is actually required is a sufficiently executed sampling process that preserves the support and induces any form of U status. Then it is implied that (conditional) mean exchangeability is achieved. Once more, since it seems plausible that Yt or Yt,l can be mapped into a great number of U distributions on the same support via different sampling designs or conditioning, this result is potentially very useful. For instance, if it is believed that the counterfactual distribution is symmetric, then a nonrejection of a statistical hypothesis of symmetry in the observed distribution can be supporting evidence that C2 is fulfilled. More generally, if the distance between the mid-range and the sample mean is small—and here one must be diligent in deciding what precisely defines the quality of this distance—this can also be construed as evidence. A researcher can also observe the behavior of the empirical CDF for visual confirmation. For regular random variables, the area below and above the curve should be approximately equal. 116 The Hoeffding Bootstrap, Continued With U random variables introduced, we now provide an informal justification for extending a slightly more conservative version of the Hoeffding bootstrap to an arbitrary T0. This justification makes use of the principle of indifference, which states that the assignment of a uniform measure minimizes risk in the absence of information. Although it is preferable to commence from ’known’ statements to derive a bound on uncertainty, we show that employing this principle is consistent with bounds derived under oracle assumptions in all circumstances except the most extreme. Importantly, for this exploration, we do condition on the empirical distribution. We introduce some notation first. Say T∗ is a bootstrapped statistic s.t. T∗ ∼ FT∗ (t), where FT∗ (t) = Ψ{t,Fˆ n(y)} and FT0 = Ψ{t,F(y)}. The Ψ notation indicates that FT0 is a functional of the marginal marginal distribution function(s). Importantly, this does not assume independence since Ψ{t,F(y)} can be complicated in an unknown fashion as a consequence of probabilistic dependencies. Like before, we then say {T∗,k}k∈K is a sample of bootstrap statistics identically and independently drawn from FT∗ , except now K = {1,2,...,B}. Here, we only assert 1) that there is some N s.t. for all n > N, it is true that T0 ∈ U and 2) that that the minimum and maximum values of the support of FT∗ , say {m∗,0,M∗,0}, are finite almost surely: a fact that is already implied by bounded nature of T0. These presuppositions are relatively light. Only 2) is truly necessary. Since FT∗ (t) is a random CDF, it is also safe to assert that {m∗,0,M∗,0} are random variables. For functions qm and qM then, say m∗,0 = m0 +qm(Y1,Y2,...,Yn) WLOG. Constructing this object is always valid since qm = m∗,0 −m0 is defined on the same probability space as m∗,0. Symmetry of argument supplies the same equation for M∗,0: M∗,0 = M0 +qM(Y1,Y2,...,Yn). Consequently: M∗,0 −m∗,0 = M0 −m0 +{qM(Y1,Y2,...,Yn)−qm(Y1,Y2,...,Yn)} 117 Now, define qM−m = qM(Y1,Y2,...,Yn) − qm(Y1,Y2,...,Yn) for brevity. It is apparent that when qM−m ≥ 0, it follows that M0 −m0 ≤ M∗,0 −m∗,0. Because qM−m ≥ 0 provides us with a desired bound, diligence dictates examining the opposite valence. Due to the fact that Y is non-degenerate by tacit assumption, we can also assert with confidence that −(M0 − m0) < qM−m. Thereby, reasoning conservatively requires us to stipulate a lower bound LBqM−m s.t. −(M0 − m0) < LBqM−m < 0. Since every irrational number c ∈ (0,1) can be approximated arbitrarily well by a rational number, we can define two unknown integers a and b s.t. a < b to conclude that M0 − m0 ≤ {b − a} −1 · b · {M∗,0 − m∗,0} almost surely. As a consequence, for any bounded statistic T0 and arbitrary ε > 0, there exists integers a < b s.t. Pr(|T0 −ET0| > ε) ≤ 2 · exp{−{b ·(M∗,0 −m∗,0)} −2 · {b−a} 2 · 2ε 2}. Call this the bootstrap concentration inequality. By the form of the inequality, it is apparent that {a,b} are related to the dependency structure and that a especially controls extreme behavior. For instance, as a → b, the bootstrap bound becomes trivial. If b → ∞ much faster than a, then we recover the more efficient bound. Further modeling work that relates {a,b} to the dependency structure of {Yi}i∈In will thus be fortuitous. Just the same, additional discussion that incorporates a wider array of prior distributional assumptions on {a,b} will indubitably be interesting and beneficial. Out of ignorance, for our purposes, we cite the principle of indifference, which places a uniform measure on [−(M0 −m0),0]. This suggests that LBqM−m ≡ −2 −1 ·(M0 −m0) is a defensible choice that minimizes risk since it is the expected value of the (constrained) qM−m function. Doing so is also equivalent to setting b ≡ 2 and a ≡ 1. As a consequence, without further information, it is implied that M0 −m0 ≤ 2 · {M∗,0 −m∗,0}. An employment of Hoeffding’s inequality then supplies that T0 ± 2 · {M∗,0 − m∗,0} · p 2−1 log(2/α) is a confidence set for α ∈ (0,1) with at least 1 − α 118 coverage. If T0 ∈ U for sufficiently large n, then T0 ± 2 · {M∗,0 − m∗,0} ·p 6−1 log(2/α) can be used as an improvement. Still, this approach is more conservative than the one in Section 3.2. Altogether, the improved bound on the tail probabilities for this section inflates the error around the estimate by an approximate factor of 1.15 in comparison. M∗,0 and m∗,0 are also unknown. However, since T∗,(1) a.s. → m∗,0 for all n as B → ∞ WLOG and {T∗,k}k∈K can be made arbitrarily large, M∗,0 −m∗,0 can be replaced with T∗,(K) −T∗,(1) as a plug-in with negligible error for large enough B. If the empirical distribution of T∗,k demonstrates heavy tails, then typical choices for B will likely suffice. However, if T∗,k possesses a light tail, B will need to be much larger to compensate for the sub-optimal convergence rate of extreme order statistics. A Defense of the Principle of Indifference From here, we substantiate the use of the rule of indifference with a supplementary exploration. We show that the use of an oracle assumption leads to the same bounds in all but the most extreme of situations. The oracle assumption is as follows: for large enough n, ET0 is contained in ST∗ in the sense that m∗,0 ≤ ET0 ≤ M∗,0. We say that this is an oracle assumption because it automatically supplies a population bound of probability one on the expected value. As a caveat, note that using this assertion is in no way illegal statistically if it can be justified. Taking a wider view, there is no thematic difference between this strategy and the supposition of a particular dependency structure so that asymptotic normality can be inferred. In place of estimating variance parameters, we would instead be replacing M∗,0, say, with its bootstrap estimator. Nevertheless, taking a route that does not suppose these extra conditions directly is preferable. 119 For instance, if C5’ holds and the bootstrap minimally works in the sense that if T0 a.s. → ET0 then ET∗ → ET0 as n → ∞, the oracle condition is relatively safe to assume in many contexts, at least for moderate sample sizes. Plug-ins for estimands that are linear functionals of F(y) will often qualify, as will unbiased statistics more generally. Now, we commence the defense. When comparing the extremes of ST0 and ST∗ , there are only four possible cases: (1) m∗,0 ≤ m0 and M0 ≤ M∗,0 (2) m∗,0 ≤ m0 and M∗,0 ≤ M0 (3) m0 ≤ m∗,0 and M0 ≤ M∗,0 (4) m0 ≤ m∗,0 and M∗,0 ≤ M0 Case (1) is trivial and automatically implies that M0 −m0 ≤ M∗,0 −m∗,0 ≤ 2 · {M∗,0 −m∗,0}. One can generally expect case (1) to hold under conditions of mutual negative dependence. Cases (2) and (3) provide the targeted bound in conjunction with the oracle statement. Using it allows us to conclude that m∗,0 −m0 ≤ 0 ≤ M∗,0 −ET0 for case (2) at no loss of generality. As a consequence, if T0 ∈ U , it then follows that M0 −m0 ≤ 2 · {M∗,0 −m∗,0}. Cases (2) and (3) are likely to hold when dependencies are predominantly negative or positive, but they are not extreme in magnitude and the average number of dependencies is not linear in n. It is only case (4) that is problematic. For this case, m∗,0−m0 is strictly positive, for instance. Asserting that m∗,0 −m0 ≤ M∗,0 −ET0 is thus even more non-trivial. Situations of extreme dependence s.t. a large proportion of the outcome variables are positively dependent in a strong and redundant sense can induce this case. In an extreme example, consider sampling {Yi}i∈In but each outcome 120 variable is secretly equal to Y1 with additional random noise. Then T0 is ’almost’ a function of just one variable and the support of this statistic will be more similar to the support of T(Y1). In other words, it will not concentrate. C5’ prevents a large proportion of these extreme cases by definition, and also because it is sufficient for establishing the uniform and almost sure convergence of the empirical CDF. Nevertheless, to be comprehensive, we show that the desired bound holds for this case if and only if the principle of indifference holds. We can show one sub-case WLOG under the supposition of regular U status since R T′ FT0 dt = R T′ {1 − FT0 }dt. This exercise applies to discrete random variables since integrating CDFs that are step-functions over an interval with the same endpoints as the support still supplies the targeted values. Now, assume m∗,0 −m0 ≤ M∗,0 −ET0 at no loss of generality. If T0 ∈ U , then this ensures the bound. Observe the following, however, where =⇒ should be read as ’implies’: m∗,0 −m0 ≤ M∗,0 −ET0 =⇒ qm ≤ M0 +qM −ET0 =⇒ −qM−m ≤ M0 −ET0 =⇒ −qM−m ≤ 2 −1 {M0 −m0} The left hand side of the inequality is maximized when qM−m = −2 −1{M0 −m0}, which gives us the same assertion as the principle of indifference. Moving in the other direction entails a reversal of the logic and is omitted. In conclusion, under the mild assumption of U limiting behavior, three of the four exhaustive cases supply the same bound without using the principle indifference as a premise. Therefore, 121 statisticians can feel comfortable asserting it insofar as the most extreme of dependency scenarios are reasonably excluded. Since many popular statistics conceivably adopt U -like behavior for even moderate sample sizes, this is a fecund route for inference in the all-too-common face of unknown and intractable systems of statistical dependencies. Establishing this route more rigorously will undoubtedly be a boon. Finally, we note that the analysis of this section can be generalized to the situation s.t. we do not condition on the empirical distribution. In this scenario, we already know that ET0 ∈ [m,M]. However, if the dependency conditions are violated, we cannot consistently estimate these extremes. In this case, we can replace M∗,0 −m∗,0 with Mˆ −mˆ in the statements above and still employ the same arguments with little alteration to conclude that the same, slightly more conservative confidence set is a defensibly principled choice, and even when Mˆ −mˆ is statistically inconsistent. 3.3.1 Counterfactual Linear Regression Linear regression is a popular tool for causal and predictive inference. For the former, inverse probability weighting of the marginal structural model or standardization of the adjusted model are common approaches (Hernan & Robins, 2010, 2006; Imbens, 2004; Mansournia & Altman, ´ 2016). The marginal structural model w.r.t. an event {T = t} is defined as follows: EY t = β0 +β1t (3.2) 122 The adjusted model is defined similarly in conjunction with a vector of adjusting variables L, which are posited to conditionally fulfill C2: E(Y|t,L) = E(Y t |L) = β0∗ +β1t +Lβ (3.3) Standardization then yields that E{E(Y t |L)} = β0∗ +β1t +E{L}β = EY t under C1-C3. As aforementioned, the identification of a vector L that achieves C2 for all levels of T that are required for causal contrasts is non-trivial. This sentiment generalizes to the identification of a vector of random variables that are sufficient for estimating the probability model that is necessary for the inverse probability of treatment weighting. The difficulty is further exacerbated in both cases by informative sampling. For instance, for eq. (3.3), what is actually estimated is: Eδ (Y|t,L) = α0 + α1t + Lδα. Hence, even when C2 is met, only EδY t is identifiable in the absence of noninformative sampling or additional constraints. EδY t , however, might not be the target of interest. We now use concepts from the previous sections to demonstrate that the core assumptions of linear regression for predictive (associative) inference are sufficient for the identification of causal parameters in conjunction with C4. To this end, we specify the data-generating mechanism for the linear model in eq. (3.4) below with L = l fixed. Yδi = α0 +α1ti +liα +εδi (3.4) Traditionally, eq. (3.4) requires that Eδ (εi |ti ,li) = 0 for ∀i if only predictive inference is the goal and all covariate patterns of interest have truly been experimentally fixed. This is weaker than strict 123 exogeneity, which requires that Eδ (εi |Ti ,Li) = 0 when Ti and Li are stochastic. Using standardization requires a slightly weaker form of strict exogeneity, conditioned on all treatment contrasts of interest, since the method averages over Lδ : Eδ (εi |ti ,Li) = 0 for ∀i. Otherwise, for finite sample inference, the second core assumption is that εδi ∼ N(0,σ 2 i ) for ∀i. The first core assumption is commonly evaluated using the predicted versus residual plot. Under the working proposition of valid specification, this plot should demonstrate an approximately symmetric scatter of the residuals about the horizontal zero line for any arbitrarily small neighborhood around any predicted point on the x-axis. To make use of these traditional conditions, we must first make an inconsequential adjustment to the assumption of normality. We are working within a universe of bounded random variables. Consequently, the εδi of eq. (3.4) cannot be normally distributed. This is no great loss for four related reasons. Firstly, in a grand majority of scientific investigations, Yi is bounded. For example, if each Yi is a measurement of a person’s blood pressure, it is impossible for it to be less than zero or greater than an arbitrary real number. Its distribution cannot be supported on a set that is equal to R. This automatically implies that the εi cannot truly be normally distributed. In these situations, when statisticians assume normality, it is intended as a feasible approximation that results in negligible error, and that is fecund mathematically. The second reason is similar to the first. Even if someone wishes to insist that Yi is supported on the entire real line, Yδi often cannot be due to the intrinsic limitations of measurement and observation. Thirdly, as hinted in the first reason, the assumption of normality can be replaced with the assumption that εδi has a CDF Fi(eδ ) of a normal distribution that has been symmetrically truncated around zero. This is equivalent to positing that each εδi is related some variable Zi ∼ N(0,σ 2 Zi ) s.t. for an (almost) arbitrarily small τ > 0, Pr(−Mi ≤ Zi ≤ Mi) = 1 − τ and Fi(eδ ) = 124 (1−τ) −1ΦZi (eδ ) on [−Mi ,Mi ]. Provided this setup, the bias that results from treating εδi as strictly normally distributed for mathematical convenience is unimportant, especially since one does not need to identify σ 2 Zi . The fourth and last reason, which motivates the next proposition, is related to requirement of symmetric scatter in the residual versus fitted plot. A symmetrically truncated normal distribution is a special case of a U random variable. Moreover, when a continuous random variable has an expected value of zero, all that is required for regular U status is for it to be supported on a symmetric interval [−Mi ,Mi ]. Hence, the typical set of assumptions already employed for fixed linear regression already requires that each εδi ∈ U . Additionally, we also note that positing only that εδi ∈ U ∀i is a fundamentally weaker assumption than (symmetrically truncated) normality. Under this milder condition, a researcher only needs to verify that the residual versus predicted plot is (approximately) symmetrically supported around zero about any neighborhood of predicted values. The behavior of the scatter within any neighborhood is otherwise unimportant, insofar as it reasonably justifies that the expected value is also zero. Nevertheless, it is apropos to state that, if only this milder condition is supposed, then the concentration inequality of Table 3.1 or a central limit theorem are required for the construction of confidence intervals. Of course, under copious amounts of dependencies, a central limit will not necessarily apply. We now present a useful main result in Proposition 3.5, although it is technically a more detailed case of Proposition 3.3. The extra assumption of regularity is not strictly necessary. 125 Proposition 3.5. Assume C1 and C4. Say Yt,l,δ = g(t,l,α) +εt,l,δ for some measurable (possibly monotonic) function g. Suppose each εt,l,δ is regular, Eδ (ε|t,l) = 0, and εt,l,δ ∈ U for ∀t,l fixed. Then EδYt,l = Av(Y t l ). Proof. Let t,l be arbitrary. Then EδYt,l = g(t,l,β) since Eδ (ε|t,l) = 0. For an arbitrary bounded random variable Z, say min(Z) = min z∈SZ (z) and max(Z) = max z∈SZ (z), the greatest lower and least upper bounds of the closed set SZ, respectively. Since g(t,l,β) is a constant, it follows that min(Yt,l,δ ) = g(t,l,β)+min(εt,l,δ ) and max(Yt,l,δ ) = g(t,l,β)+max(εt,l,δ ). Moreover, since each εt,l,δ is regular, then each Yt,l,δ is also obviously regular. From here: min(Yt,l,δ ) +max(Yt,l,δ ) = 2g(t,l,β) +min(εt,l,δ ) +max(εt,l,δ ) =⇒ 2 −1 {min(Yt,l,δ ) +max(Yt,l,δ )} = g(t,l,β) +2 −1 {min(εt,l,δ ) +max(εt,l,δ )} =⇒ Av(Yt,l,δ ) = g(t,l,β) +0 =⇒ EδYt,l = Av(Yt,l,δ ) = Av(Y t l ) The third line follows from regularity and the fact that εt,l,δ ∈ U for ∀t,l fixed. The last line follows from substitution, C1, and C4. ■ Set g{t,l,(α0,α1,α)} = α0 + α1t + lα to recover eq. (3.4). Under the assumption that (t ′ ,l) and (t ′′ ,l) have both been fixed, where the values t ′ and t ′′ represent the treatment values to be contrasted, Proposition 3.5 implies that α1 ∝ Av(Y t ′ l ) − Av(Y t ′′ l ): the difference in the average values of Y T l when T = t ′ and T = t ′′. When (t ′ ,l) and (t ′′ ,l) have not been fixed but at least all treatment values have been, i.e., when the researcher did not fix l for all contrasts of scientific interest, the previous statement still holds in general when Eδ (ε|t,L) = 0 or ∀t involved. The 126 proof of Proposition 3.5 would only need to use this stronger statement with no further change. If the researcher wishes to reason about contrasts of T that have not been fixed as well, then the (non-trivial) assumption that Eδ (ε|T,L) = 0 suffices. This proves that conditions that are weaker than those traditionally supposed for making inferences about associations are sufficient for inference w.r.t. causal parameters. A statistician can target these parameters using linear regression under (almost) arbitrary sampling bias, insofar as linearity and U status in the error distributions are feasibly defensible w.r.t. the sample measures and at least the supports have been preserved. Even this last condition can be weakened further since we truly only require the preservation of functional averages. We choose to highlight three additional points of interest. The first one is about the interpretation of α1. It uses similar language to current interpretations. However, care is due. Although the word ’average’ is often used for interpreting the coefficients of typical linear regressions, this is imprecise slang for the change in expected value. The word ’average’ is also imprecise for the functional average, but it is at least closer in spirit in comparison to other applications since the functional average is a uniform averaging of the support. The second point is that Proposition 3.5 enables reasoning about counterfactual conditional functional averages in high-dimensional settings. Often, however, the researcher cares mostly about a marginal estimate. Although they can serve a similar purpose, E{Av(Y t L )} ̸= Av(Y t ) and Av{Av(Y t L )} ̸= Av(Y t ) in general. This does not signify that these parameters do not possess scientific meaning. For example, E{Av(Y t ′ L )} −E{Av(Y t ′′ L )} can still be interpreted as an expected functional average effect over L. 127 Nevertheless, one special circumstance when the identity E{Av(Y t L )} = EY t does hold is when C2, the conditions of Proposition 3.5 under stricter exogeneity, and the additional requirement that Y t ∈ U are valid. The last point is concise to state. The analysis of variance (ANOVA) is a special case of the linear regression model presented. It is an elementary tool that is ubiquitous in research. All prior discussion and Proposition 3.5 therefore apply to ANOVA procedures under conditions that are already stipulated. Hence, insofar as C1 and C4 are defensible, this means that a plethora of prior work can be re-interpreted with a restricted causal lens in partnership with a structural causal model. 3.4 Monte Carlo Simulations Before we apply our strategy to real data, we illustrate its utility with a set of Monte Carlo simulations that show how functional average and U concepts are useful for causal inference. For simplicity, we proceed with non-informative sampling conditions that presuppose mutual independence. All simulations use M = 1,000 iterations for sample sizes n ∈ {500,2500,5000,10000}. Furthermore, all constructed 1 − α confidence sets use α = .05. Three main simulation experiments are provided in total. The first examines the behavior of basic functional average estimators for symmetric and non-symmetric distributions. The second and third simulations demonstrate that causal effects can be consistently estimated without controlling for confounding. All experiments examine the performance of Hoeffding style bootstrapping procedures. 128 3.4.1 Univariate Functional Average Estimation The first simulation of this experiment examines the performance of MRˆ for three truncated normal distributions: Y1 ∼ T N(m = 0,M = 20,µ = 10,σ = 5),Y2 ∼ T N(m = 0,M = 15,µ = 10,σ = 3), and Y3 ∼ T N(m = 0,M = 15,µ = 5,σ = 3). The first random variable Y1 is in the U class and hence θ1 = Av(Y1) = EY1 = 10. However, Y1 and Y2 are not U variables. Their distributions are skewed with tails that impact convergence behavior. For these variables, θ = Av(Y) = 7.5. Hoeffding bootstrap style confidence sets (CIˆ H) are constructed as described in Section 3.2. To contrast its performance, we also use an m-out-of-n bootstrap. Again, an important requirement of the m-out-of-n bootstrap is that m → ∞, but n −1m → 0. To meet this criteria, we set m = r( √ n) since this setting produces relatively conservative results. Hence, if it fails to perform well, this highlights the utility of the Hoeffding procedure. Percentile confidence sets (CIˆ p,m) are constructed from the m-out-of-n bootstraps. For reference we also construct Hoeffding style confidence sets (CIˆ H,m) from this process. Importantly, all bootstrap procedures makes use of simple random sampling with replacement and only 500 bootstrap samples. Although sub-optimal, a low number of bootstrap samples is used to limit computational burden. A decent performance of the Hoeffding bootstrap at B = 500 is still a good indicator. Empirical coverage rates are estimated with EC = M−1 ∑ M i=1 1θ∈CIˆ i WLOG. Table 3.2 presents the results of this experiment. Importantly, all values in tables henceforth represent the arithmetic average of simulated objects, including the endpoints of confidence sets. 129 Table 3.2: Functional Average Estimation, Continuous Av(Y) n MRˆ CIˆ H,m ECH,m CIˆ p,m ECp,m CIˆ H ECH θ = 10 500 10.01 (1.85, 18.15) 1 (8.02, 11.98) 1 (8.87, 11.13) 1 2500 10 (4.5, 15.5) — (8.72, 11.27) — (9.72, 10.28) — 5000 10 (5.42, 14.58) — (8.97, 11.03) — (9.86, 10.14) — 10000 10 (6.26, 13.74) — (9.17, 10.83) — (9.93, 10.07) — θ = 7.5 500 8.11 (2.1, 14.12) 1 (7.58, 10.64) 0.40 (6.51, 9.7) 0.88 2500 7.74 (2.98, 12.5) — (7.6, 10.09) 0.30 (6.82, 8.65) 0.94 5000 7.63 (3.31, 11.96) — (7.59, 9.87) 0.25 (6.96, 8.31) 0.97 10000 7.58 (3.7, 11.46) — (7.58, 9.67) 0.21 (7.13, 8.03) 0.98 θ = 7.5 500 6.91 (0.88, 12.94) 1 (4.37, 7.44) 0.43 (5.31, 8.51) 0.90 2500 7.26 (2.46, 12.06) — (4.91, 7.4) 0.30 (6.34, 8.18) 0.95 5000 7.37 (3.04, 11.69) — (5.13, 7.41) 0.23 (6.69, 8.04) 0.97 10000 7.42 (3.52, 11.33) — (5.33, 7.42) 0.18 (6.97, 7.87) 0.97 † θi corresponds to the functional average of the Yi TN distributions introduced above; ’—’ indicates that the value is the same as the first row These results demonstrate that MRˆ behaves as intended. It is unbiased for the symmetric distribution and convergence behavior—albeit slow—is observable w.r.t. the target parameters for the asymmetric distributions that exhibit more problematic tail behavior. Importantly, the Hoeffdingstyle bootstrap also appears to behave as intended. Although it failed to uphold nominal coverage values for the skewed distributions at n = 500, it quickly overcame this behavior to provide conservative empirical coverage for θ. Also, the difficulties of efficiently estimating Av(Y) are evident for non-U variables, highlighting the utility of this type of variable. Notably, the m-out-of-n percentile interval did not perform well even when m was O( √ n). 130 The second experiment is for discrete variables. Like before, we use three different truncated normal distributions: Y1 ∼ T N(m = 0,M = 40,µ = 20,σ = 5),Y2 ∼ T N(m = 0,M = 40,µ = 25,σ = 8), and Y3 ∼ T N(m = 0,M = 40,µ = 15,σ = 8). These random variables are uniformly rounded to the nearest integer to induce discreteness. Here, we contrast MRˆ with Avˆ , the discrete plug-in for eq. (3.1). Hoeffding style bootstraps are again employed to construct confidence sets. However, now we use a U approximation in accordance with Section 3.3. Even if U status does not hold exactly, insofar as convergence behavior to a constant holds, the method should remain robust. Results for this simulation are available in Table 3.3. We no longer examine the performance of the m-out of-n bootstrap. Table 3.3: Functional Average Estimation, Discrete Av(Y) = 20 n Avˆ CIˆ H ECH MRˆ CIˆ H ECH r(Y1) 500 20.012 (15.21, 24.82) 1 20.044 (13.88, 26.21) 1 2500 19.992 (15.84, 24.14) — 19.969 (14.93, 25.01) — 5000 20.014 (16.18, 23.85) — 20.023 (15.58, 24.47) — 10000 20.020 (16.39, 23.65) — 20.016 (16, 24.04) r(Y2) 500 21.921 (17.62, 26.22) 0.971 21.218 (16.48, 25.96) 0.955 2500 20.490 (18.19, 22.79) 0.980 20.365 (18.17, 22.56) 0.963 5000 20.194 (18.55, 21.83) 0.985 20.169 (18.67, 21.67) 0.975 10000 20.051 (19, 21.1) 0.972 20.050 (19.06, 21.04) 0.957 r(Y3) 500 18.053 (13.8, 22.31) 0.963 18.788 (14.04, 23.54) 0.953 2500 19.518 (17.22, 21.82) 0.985 19.638 (17.45, 21.83) 0.967 5000 19.803 (18.16, 21.44) 0.987 19.833 (18.34, 21.32) 0.980 10000 19.941 (18.87, 21.01) 0.977 19.942 (18.94, 20.94) 0.963 131 The results of the discrete simulation largely match those of the preceding continuous one. The plug-in estimator performed more favorably than the sample mid-range only for r(Y1): the distribution with the lightest tails. Importantly, although many simulations demonstrated departures from U status, the coverage remained robust due to the conservative nature of the Hoeffding bootstrap. Further details pertaining to this fact are available in the supplementary material. 3.4.2 Causal Inference with Functional Averages Next, we demonstrate a classical case of confounding where the functional average treatment effect is identifiable and statistically consistently estimable without any adjustment. We use the following variables for this simulation: C ∼ Bern(.5), T ∼ Bern(.3 + .5C), and e ∼ T N(m = −50,M = 50,µ = 0,σ = τ). Moreover, we will say that Y t=1 = 110+50C +e, Y t=0 = 100+50C +e, and Y = Y t=1T + (1−T)Y t=0 . The confounding variable isC, which is present in half of the theoretical population on average. When C = 1, the probability of allocation to treatment is larger. The structure of the confounding also preserves the support of the counterfactual distribution w.r.t. the observed one. We also set τ to 5 then 25 to demonstrate the performance of a functional average estimator when the tail probabilities are thin or heavy. We contrast simple linear regression with MRˆ for estimating ∆ = Av(Y t=1 )−Av(Y t=0 ) = 10. Here, the mid-range estimator of ∆ is ∆ˆ MR = 2 −1{Y(1),t=1 +Y(n1),t=1} − 2 −1{Y(1),t=0 +Y(n0),t=0}. This simulation is executed WLOG since it can be implicitly assumed that the researcher has stratified on some set of variables—such as propensity scores—to limit confounding bias or achieve 132 the equality of supports. Confidence sets and estimators of the empirical coverage are all otherwise constructed as previously explored using the Hoeffing bootstraps of Section 3.2. Power is estimated by EPH = M−1 ∑ M i=1 10∈/CIˆ Hi . Table 3.4 has the results. Table 3.4: Continuous Effect Estimators: ∆ = 10 τ = 5 n ∆ˆ OLS ∆ˆ MR CIˆ H EPH 500 33.45 11.76 (1.53, 22) 0.71 2500 34.39 11.69 (3.05, 20.33) 0.85 5000 35.09 11.65 (3.67, 19.63) 0.90 10000 35.18 11.47 (3.88, 19.07) 0.92 τ = 25 n ∆ˆ OLS ∆ˆ MR CIˆ H EPH 500 33.44 12.82 (-10.47, 36.11) 0.05 2500 34.40 10.84 (2.73, 18.95) 0.91 5000 35.11 10.47 (5.79, 15.15) 1 10000 35.20 10.25 (7.68, 12.82) — † All empirical coverage estimators returned 1 for ∆ˆ MR As expected, ∆ˆ OLS is a confounded estimator of EY t=1 − EY t=0 = ∆ = 10. However, this is not the case for ∆ˆ MR, which demonstrates convergence behavior towards the true parameter value, albeit at a sub-optimal rate. Reiteration of a poignant fact here is valuable: MRˆ demonstrates this behavior without adjustment. Moreover, although the sampling conditions supposed here were non-informative, this was unnecessary. Insofar as the supports are preserved, informative sampling conditions are unimportant w.r.t. statistical consistency, although they might impact convergence 133 rates. These results are also consistent with prior discussions on the mid-range. We expect the midrange to possess more favorable properties when non-negligible mass or density rests in the tails, which is the case when τ = 25 for these simulations. Next, we present a restricted discrete analogue to the last experiment. C and T retain their definitions, but ε ∼ Binom(τ,2 −1 ) for τ ∈ {30,50}. Otherwise, Y t=0 = 10C + e, Y t=1 = Y t=0 + 5, and Y = Y t=1T + (1 − T)Y t=0 as before. For this simulation, we employ the Hoeffing bootstrap for U statistics once more. Table 3.5 presents the results. Table 3.5: Discrete Effect Estimators: ∆ = 5 n ∆ˆ OLS ∆ˆ Avˆ CIˆ H EPˆ H ∆ˆ MRˆ CIˆ H EPH τ = 30 500 9.694 6.059 (0.55, 11.57) 0.681 5.899 (-0.47, 12.27) 0.447 2500 9.875 5.91 (1.34, 10.48) 0.871 5.861 (0.63, 11.09) 0.680 5000 10.017 5.8 (1.45, 10.15) 0.890 5.757 (0.83, 10.69) 0.728 10000 10.037 5.765 (1.68, 9.85) 0.921 5.729 (1.19, 10.27) 0.805 τ = 50 500 9.696 6.443 (-0.16, 13.04) 0.458 6.192 (-2.11, 14.49) 0.236 2500 9.875 6.268 (0.75, 11.79) 0.728 6.150 (-0.69, 12.99) 0.419 5000 10.017 6.147 (0.87, 11.42) 0.734 6.052 (-0.48, 12.59) 0.457 10000 10.036 6.074 (1.08, 11.07) 0.804 5.995 (-0.03, 12.02) 0.551 † All empirical coverage estimators were > .998 The discrete estimators demonstrate finite sample bias. Nonetheless, they also demonstrate convergence towards ∆ as n becomes large, while ∆ˆ OLS remains confounded. Although ∆ = 5 is a small to moderate effect size, ∆ˆ Avˆ shows ample power to detect it under the null hypotheses that ∆ = 0 for reasonable sample sizes when τ is set to 30. Traditional levels of acceptable power are only achieved by the ∆ˆ Avˆ estimator at n = 10000 when τ = 50. The increase in variance and the lowered likelihood of observing some values of the support hinder both estimators’ performance. 134 Although the mid-range estimator seems to possess less bias for these simulations, the plug-in estimator seems to possess more power. Our last experiment demonstrates how linear regression can still be used for causal inference when mean exchangeability does not hold but each error term is a U variable. To this end, we use a new setup for any measurable function g: T ∼ Bern(.3), U1 ∼ T N(m = −10,M = 10,µ = 0,σ = 2), and µT = 100+20T. Then we will say U2 ∼ T N(m = µT −g(T)−10,M = µT −g(T)+10,µ = 0,σ = 2), YT = µT +U1, and Y T = g(T) +U2. For simplicity, we set g(T) = 90 + 10T. This model structure can be amended to include more covariates. However, this is unnecessary for our demonstration. What is important is that, conditional on a design matrix x, the functional averages are preserved and that linearity holds for Yx. The true generating process for the counterfactual distribution can be unknown. Here, ∆ = 20. However, confounding is present since EY t=1 −EY t=0 = 10 ̸= EYt=1 −EYt=0 = 20. The results are provided in Table 3.6. The CIˆ t column provides the arithmetic average of the standard t-distribution confidence set endpoints. For reference, we also includeCIˆ U for confidence sets constructed from the concentration inequality in Table 3.1 for U errors. CIˆ U is constructed as follows. Say x is the design matrix for a regression to estimate β and w = (x ⊤x) −1x ⊤. Then ˆβ − β = wε. Therefore, ˆβT − βT = ∑ n i=1wi,sεi , where s is the row of w corresponding to treatment feature. Algebraic rearrangement of the concentration inequality then yields confidence sets of the form ˆβT ± q ∑ n i=1 R 2 i · p 6−1 log(2/α), where Ri is the population range of wi,sεi . Under the assumption of a valid mean model specification, the maximum of the supports of the εi is feasibly estimable with ˆe(n) WLOG, where ˆe is a typical residual. Hence, we can use an approximate confidence set of the following form when the extremes of the support of 135 Y are unknown: ˆβT ± {eˆ(n) −eˆ(1)}·q ∑ n i=1w 2 i,s · p 6−1 log(2/α). We contrast these confidence sets to those constructed with the Hoeffding bootstrap of Section 3.2. Table 3.6: Linear Regression for Functional Averages n ∆ˆ OLS CIˆ t CIˆ U CIˆ H 500 20 (19.62, 20.37) (19.09, 20.91) (19.9, 21.58) 2500 — (19.83, 20.17) (19.52, 20.48) (19.28, 20.71) 5000 — (19.88, 20.12) (19.64, 20.35) (19.49, 20.51) 10000 — (19.91, 20.09) (19.74, 20.26) (19.64, 20.36) Since the properties of linear regression are well understood, a thorough discussion is unnecessary. The results again substantiate the utility of U random variables. Under their framework, efficient estimation of causal parameters is more readily achievable, especially if mutual independence is a feasible assumption. 3.5 A Data Application In this section, we employ NHEFS data to demonstrate our concepts. The NHEFS conducted medical examinations from 1971-1975 from non-institutionalized civilian adults aged 24-74 (N = 14,407) in the United States as part of a national probability sample. Follow-up surveys were then administered in 1982, 1984, and subsequent years to collect measurements for behavioral, nutritional, and clinical variables. Further documentation is available elsewhere (Madans, Kleinman, Cox, Barbano, Feldman, Cohen, et al., 1986). The subset of data we use here (n=1,479) originates from the original 1971 medical examination and follow-up in 1982. 136 Exercise (0: moderate to much; 1: little to none) is the treatment variable (T) of interest. Age, chronic bronchitis/emphysema diagnosis (1: yes; 0: never), education attained in 1971 (1: < 8th grade; 2: HS dropout; 3: HS; 4: college dropout; 5: college), income, race (1: non-white; 0: white), sex (1: female; 0: male), years smoking, alcohol frequency, and weight (kilograms) are utilized as adjusting covariates (henceforth denoted as L). Although SBP was measured with integer values, it is still treated as continuous in most instances. Pertinently, T and most covariates were all measured in 1971. Only SBP and weight were measured in 1982. All continuous covariates are centered on their observed sample means for this analysis. Here, we are interested in seeing if exercise exerted a causal effect on SBP in smokers. We aim to estimate E{Av(Y t=1 L ) − Av(Y t=0 L )} and E{Av(Y t=1 S ) − Av(Y t=0 S )} as summary causal effects, where S = s represents a stratum that has been constructed from the quintiles of eˆ(L), the estimated propensity scores. Our goal is to estimate Av(Y t=1 )−Av(Y t=0 ) and ˆβt in addition. To this end, race is treated as a confounder since it represents both genetic information and socio-historical constructions (Witzig, 1996). In a similar vein, we choose to adjust for sex to account for possible biological influences, and since it is also an imperfect proxy for social institutions that can impact exercise habits, other health behaviors, and therefore blood pressure. All other variates mentioned are adjusted for since they are either known to effect both SBP and exercise habits directly or to act as conduits for more general institutional or ecological influences. Theoretically, adjusting for them can help to block backdoor paths from a subset of unknown confounders, which, although mostly irrelevant here in terms of their probabilistic effects, can still impact supports. 137 Methods. The following methods are used to estimate possible causal effects: simple linear regression (LR), multiple linear regression with and without standardization (S; MR), linear regression adjusted for propensity score strata with standardization (PS), and sample functional average estimation (Av) with the discrete plug-in for eq. (3.1). Propensity scores are estimated with logistic regression using the same covariates as the regression model. No covariate transformations are used for the logit model, although we introduce higher order terms to the regression if it appears to improve linearity. Standard errors for all standardization estimators are calculated via a standard bootstrapping procedure with B = 1,000 replications. We use the Hoeffding bootstrap of Section 3.3 with the same value of B for the eq. (3.1) plug-in. Standard t-statistic-based confidence sets are employed for the LR and MR models. All tests are conducted at the α = .05 level using R version 4.2.2 statistical software (R Core Team, 2022). The assumptions of U status and valid mean model specification are substantiated through the inspection of residual versus fitted plots and empirical CDF plots. Results. The coefficient estimate for the simple linear regression is ˆβt = −3.87 (95% CI: -5.821 -1.912), while the plug-in functional average estimate for Av(Y t=1 )−Av(Y t=0 ) is −3.81 (95% CI: -24.17, 17.42). Additional model results are available in Table 3.7. 138 Table 3.7: Summary of Model Results Parameter Method Estimate 95% Confidence Interval ˆβt MR -.74 [−2.57,1.1] Av(Y t=1 )−Av(Y t=0 ) Av -3.81 [−24.17,17.42] E{Av(Y t=1 L )−Av(Y t=0 L )} S -.74 [−2.56,1.09] E{Av(Y t=1 S )−Av(Y t=0 S )} PS -1.014 [−2.99,.97] Initial fitting procedures for the multiple linear regression model showed mild departures from linearity. The addition of a quadratic term for age appeared to improve model fit. From this reformed model, the estimate for the expected functional average change in SBP is Eˆ {Av(Y t=1 L ) − Av(Y t=0 L )} = −0.74 (95% CI: -2.56, 1.09). This estimate is approximately equivalent to the estimated (functional) average change in blood pressure for adult smokers who exercised, conditional on all other covariates: −.74 (95% CI: -2.57, 1.1). Finally, the estimate of the expected functional average change in SBP w.r.t. the propensity score stratified model is Eˆ {Av(Y t=1 S )−Av(Y t=0 S )} = −1.014 (95% CI: -2.99, .97). Model checking. Empirical CDF plots for the SBP distributions are presented by propensity score strata and by treatment status in Figure 3.1. The residual versus fitted plot for the MR model is also included. 139 Figure 3.1: Model Validation Plots We note no remarkable departures from linearity in the residual versus fitted plot for the multiple linear regression model. Moreover, the residuals appear to possess an approximately symmetric spread around the zero horizontal line, although some outlying points appear to violate the assumption that the errors are supported around symmetric extremes. Altogether, although departures from strict U status are observable, the assumption of U status appears to be feasibly met. The stratified empirical CDF plots in Figure 3.1 do not appear to corroborate approximate sum-symmetric behavior since they show more area below the functional lines than above in most cases. Nevertheless, we note, conditional on an arbitrary stratum besides IV, that the area above each empirical CDF function appears roughly equal by exercise level. This alone is strong visual evidence that no difference in functional averages exists. The empirical CDF plots for the nonadjusted conditional distributions do not support the supposition of sum-symmetry. Again, this is because the areas above and below each function are not approximately equal. Hence, the results 140 of the simple t-test cannot be afforded a causal interpretation w.r.t. a change in functional average. Discussion. Since the NHEFS was a national probability sample of non-institutionalized adults, there is little reason to believe that C4 was not fulfilled, conditional on our adjusting covariates. Recall that positing the opposing notion in this context is to affirm—for individuals who smoked— that there were possible values of SBP in each treatment population that had zero probability of being observed in the sample. Insofar as the NHEFS survey was truly a probability sample and hence non-informative, rejecting C4 also means that these potential values of SBP could never be observed in the real world. In conjunction with the fact that U status appeared to be approximately verified for the basic multiple regression model, we are confident that—conditional on our covariates—the expected (functional average) causal effect of exercise upon SBP in smokers is plausibly within a neighborhood of zero. This conclusion is further corroborated by the validity of C4 and the direct estimate of the difference in functional averages, which was also not statistically significantly different from zero. However, this test was hampered by the fact that each conditional SBP population possessed a relatively light right tail and the sample size was small. Vitally, we have no reason to believe that we successfully adjusted for all confounding variables. Hence, we do not purport to interpret the former two effect estimates in terms of expected treatment effects. However, if mean exchangeability (and positivity) did hold, they would also be estimates of this contrast. It is also apropos to note that the consistency assumption might be violated in this analysis. This is because respondents were asked if they exercised little to none, moderately, or much; however, the meanings of these words possess no absolutism. Hence, it is possible that multiple 141 exercise treatments actually existed under the premise of one coding. This does not undermine what is formally specified in C4 conditional on the variable observed, although it does complicate the generalizability of the results if present. 3.6 Conclusion In this article, we demonstrated that causal inference is achievable in the absence of mean exchangeability if the support of the counterfactual distribution is preserved. Moreover, we offered exposition on the possible utility and scientific meaningfulness of functional average change. To overcome some of the difficulties of functional average estimation, we introduced a simple class of random variables—the U class—that possesses a milieu of practical properties. Using the U random variable framework, we showed that ubiquitously employed statistical procedures produce estimates with causal interpretations under exceptionally mild conditions, many of which are already supposed in most applied settings to investigate associations. Hence, even if a researcher fails to control for all confounding variables, she still might be left with a second-prize of sorts, and one that possesses salient causal meaning. Since uncontrolled confounding is safely assumed to be nearly omnipresent outside of toy examples, we believe that this framework provides a strong defense of elementary methods. Further work is of course due. We observe that we did not pursue the sample minimum and maximum as counterfactual estimators in and of themselves, although the set of assumptions employed here also establish their utility for estimands with potential causal interpretations. Developing this area of theory will most certainly be advantageous. Lastly, we also presented a new approach to the bootstrapping process. We called this approach the Hoeffding bootstrap. Although a less conservative form of it was proven, and strictly for the 142 case s.t. the maximum bootstrap statistic is discrete or admits a density, we resorted to a defensible citation of the principle of indifference to extend a slightly more conservative version of it to a wider class of statistics. This is a good start. However, this cannot be the end. More theoretical work on the relationship between the extremes of the support of the empirical distribution of the statistic and those of the population distribution will most certainly provide fruit. A defensible set of sufficient conditions that ensure the approach more generally will do much to lighten the burden of uncertainty. Acknowledgements: The NHEFS data was acquired from Dr. Miguel Hernan’s faculty website ( https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/). 143 References Rubin, D. B. (2019). Essential concepts of causal inference: A remarkable history and an intriguing future. Biostatistics & Epidemiology, 3(1), 140–155. Pearl, J. (2010). Causal inference. Causality: objectives and assessment, 39–58. Holland, P. W. (1986). Statistics and causal inference. Journal of the American statistical Association, 81(396), 945–960. Imbens, G. W., & Rubin, D. B. (2015). Causal inference in statistics, social, and biomedical sciences. Cambridge University Press. Ding, P., & Li, F. (2018). Causal inference. Statistical Science, 33(2), 214–237. Hernan, M. A., & Robins, J. M. (2010). Causal inference. ´ Imbens, G. W., & Rubin, D. B. (2010). Rubin causal model. In Microeconometrics (pp. 229–241). Springer. Rosenbaum, P. R., & Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1), 41–55. Holland, P. W., & Rubin, D. B. (1987). Causal inference in retrospective studies. ETS Research Report Series, 1987(1), 203–231. Jin, H., & Rubin, D. B. (2008). Principal stratification for causal inference with extended partial compliance. Journal of the American Statistical Association, 103(481), 101–111. Belloni, A., Chernozhukov, V., Fernandez-Val, I., & Hansen, C. (2017). Program evaluation and causal inference with high-dimensional data. Econometrica, 85(1), 233–298. Gangl, M. (2010). Causal inference in sociological research. Annual review of sociology, 36, 21– 47. Cole, S. R., & Frangakis, C. E. (2009). The consistency statement in causal inference: A definition or an assumption? Epidemiology, 20(1), 3–5. Hernan, M. A., & Robins, J. M. (2006). Estimating causal effects from epidemiological data. ´ Journal of Epidemiology & Community Health, 60(7), 578–586. 144 Greenland, S., Pearl, J., & Robins, J. M. (1999a). Causal diagrams for epidemiologic research. Epidemiology, 37–48. Greenland, S., Pearl, J., & Robins, J. M. (1999b). Confounding and collapsibility in causal inference. Statistical science, 14(1), 29–46. Glantz, S. A., & Parmley, W. W. (1991). Passive smoking and heart disease. epidemiology, physiology, and biochemistry. Circulation, 83(1), 1–12. Stallones, R. A. (2015). The association between tobacco smoking and coronary heart disease. International journal of epidemiology, 44(3), 735–743. Narkiewicz, K., Kjeldsen, S. E., & Hedner, T. (2005). Is smoking a causative factor of hypertension? Elley, C. R., & Arroll, B. (2002). Aerobic exercise reduces systolic and diastolic blood pressure in adults. Evidence Based Medicine, 7(6), 170–170. Park, W., Miyachi, M., & Tanaka, H. (2014). Does aerobic exercise mitigate the effects of cigarette smoking on arterial stiffness? The Journal of Clinical Hypertension, 16(9), 640–644. Pfeffermann, D., & Sverchkov, M. (2009). Inference under informative sampling. In Handbook of statistics (pp. 455–487, Vol. 29). Elsevier. Pfeffermann, D., Krieger, A. M., & Rinott, Y. (1998). Parametric distributions of complex survey data under informative probability sampling. Statistica Sinica, 1087–1114. Patil, G. P., Rao, C. R., Zelen, M., & Patil, G. P. (1987). Weighted distributions. Citeseer. Patil, G. P., & Rao, C. R. (1978). Weighted distributions and size-biased sampling with applications to wildlife populations and human families. Biometrics, 179–189. Pearl, J. (2003). Statistics and causal inference: A review. Test, 12, 281–345. Hansen, B. E. (2008). Uniform convergence rates for kernel estimation with dependent data. Econometric Theory, 24(3), 726–748. Chen, Y.-C. (2017). A tutorial on kernel density estimation and recent advances. Biostatistics & Epidemiology, 1(1), 161–187. 145 Zambom, A. Z., & Ronaldo, D. (2013). A review of kernel density estimation with applications to econometrics. International Econometric Review, 5(1), 20–42. Chernoff, H., Gastwirth, J. L., & Johns, M. V. (1967). Asymptotic distribution of linear combinations of functions of order statistics with applications to estimation. The Annals of Mathematical Statistics, 38(1), 52–72. Hosking, J. R. (1990). L-moments: Analysis and estimation of distributions using linear combinations of order statistics. Journal of the royal statistical society: series B (methodological), 52(1), 105–124. Bickel, P. J. (1973). On some analogues to linear combinations of order statistics in the linear model. The Annals of Statistics, 597–616. David, H. A., & Nagaraja, H. N. (2004). Order statistics. John Wiley & Sons. Barndorff-Nielsen, O. (1963). On the limit behaviour of extreme order statistics. The Annals of Mathematical Statistics, 34(3), 992–1002. Sparkes, S., & Zhang, L. (2023). Properties and deviations of random sums of densely dependent random variables [Preprint available at https://arxiv.org/abs/2310.11554]. Leadbetter, M., & Rootzen, H. (1988). Extremal theory for stochastic processes. The Annals of Probability, 431–478. Smith, R. L. (1990). Extreme value theory. Handbook of applicable mathematics, 7(437-471), 18. Haan, L., & Ferreira, A. (2006). Extreme value theory: An introduction (Vol. 3). Springer. Kotz, S., & Nadarajah, S. (2000). Extreme value distributions: Theory and applications. world scientific. Bingham, N. (1995). The sample mid-range and symmetrized extremal laws. Statistics & probability letters, 23(3), 281–288. Bingham, N. (1996). The sample mid-range and interquartiles. Statistics & probability letters, 27(2), 131–136. Broffitt, J. D. (1974). An example of the large sample behavior of the midrange. The American Statistician, 28(2), 69–70. 146 Arce, G. R., & Fontana, S. A. (1988). On the midrange estimator. IEEE Transactions on Acoustics, Speech, and Signal Processing, 36(6), 920–922. Efron, B., & Tibshirani, R. J. (1994). An introduction to the bootstrap. CRC press. Bickel, P. J., & Freedman, D. A. (1981). Some asymptotic theory for the bootstrap. The annals of statistics, 9(6), 1196–1217. Bickel, P. J., & Sakov, A. (2008). On the choice of m in the m out of n bootstrap and confidence bounds for extrema. Statistica Sinica, 967–985. Bickel, P. J., & Ren, J.-J. (2001). The bootstrap in hypothesis testing. Lecture Notes-Monograph Series, 91–112. Swanepoel, J. W. (1986). A note on proving that the (modified) bootstrap works. Communications in Statistics-Theory and Methods, 15(11), 3193–3203. Beran, R., & Ducharme, G. R. (1991). Asympotic theory for bootstrap methods in statistics. Politis, D. N., Romano, J. P., & Wolf, M. (2001). On the asymptotic theory of subsampling. Statistica Sinica, 1105–1124. Shao, X. (2010). The dependent wild bootstrap. Journal of the American Statistical Association, 105(489), 218–235. Hall, P., Horowitz, J. L., & Jing, B.-Y. (1995). On blocking rules for the bootstrap with dependent data. Biometrika, 82(3), 561–574. Kreiss, J.-P., & Paparoditis, E. (2011). Bootstrap methods for dependent data: A review. Journal of the Korean Statistical Society, 40(4), 357–378. Lahiri, S. N. (2003). Resampling methods for dependent data. Springer Science & Business Media. Rider, P. R. (1957). The midrange of a sample as an estimator of the population midrange. Journal of the American Statistical Association, 52(280), 537–542. Imbens, G. W. (2004). Nonparametric estimation of average treatment effects under exogeneity: A review. Review of Economics and statistics, 86(1), 4–29. Mansournia, M. A., & Altman, D. G. (2016). Inverse probability weighting. Bmj, 352. 147 Madans, J. H., Kleinman, J. C., Cox, C. S., Barbano, H. E., Feldman, J. J., Cohen, B., Finucane, F. F., & Cornoni-Huntley, J. (1986). 10 years after nhanes i: Report of initial followup, 1982-84. Public Health Reports, 101(5), 465. Witzig, R. (1996). The medicalization of race: Scientific legitimization of a flawed social construct. Annals of internal medicine, 125(8), 675–679. R Core Team. (2022). R: A language and environment for statistical computing. R Foundation for Statistical Computing. Vienna, Austria. 148 Supplementary material 3.A Convergence of Random Sums The manuscript supposes fairly general dependency conditions and reasons about random sums in the final sections. Hence, it is important to establish the statistical consistency of random sums under C5. That is accomplished in Proposition 3.A.1. For this document, I = {1,2,3,...,n}. Proposition 3.A.1. Suppose C5 and let Sn = ∑ n i=1wn,iYi for any set of constants {wn,i}i∈I that are not all equal to zero s.t. wn,i = O(n −1 ) and say that Var(Yi) < ∞ for ∀i. Then Sn a.s. → ESn as n → ∞. Proof. The proof uses a variance identity established by Sparkes and Zhang (2023). For an arbitrary random sum Sn under our premises, Var(Sn) = {1+µnφn}∑ n i=1w 2 n,iVar(Yi) ≤ {1+µnC}∑ n i=1w 2 n,iVar(Yi) for some positive constant C. Now, denote Zk∗ = sup k>n |Sn −ESn| = |Sk∗ −ESk∗ |. Let ε > 0 be arbitrary and say max i∈I {Var(Yi)} ≤ M∗ and max i∈I (|wn,i |) = wn,∗. Then by Markov’s inequality: Pr(Zk∗ > ε) ≤ ε −2 {1+ µk∗C} k∗ ∑ i=1 w 2 k∗,iVar(Yi) ≤ ε −2 {1+ µk∗C} ·w 2 k∗,∗ k∗M∗ 149 Since w 2 k∗,∗ = O(k −2 ∗ ) and k −1 ∗ µk∗ → 0 as n → ∞ and hence k∗ → ∞: lim n→∞ Pr(Zk∗ > ε) ≤ lim n→∞ ε −2 {1+ µk∗C} ·w 2 k∗,∗ k∗M∗ = 0 Hence, Zk∗ p→ 0. However, this implies that Sn a.s. → ESn. ■ 3.B Hoeffding’s Bootstrap In the paper we also introduced estimators for the functional average. We showed that these estimators are consistent in the statistical sense under very general dependence conditions. Unfortunately, however, their properties are otherwise nebulous. This is unsatisfactory for inference. We made the claim that the bootstrap could be re-purposed to (conservatively) solve this problem under some conditions. We now qualify this statement. The literature on on the bootstrap is vast and we do not review it. Essentially, the bootstrap resamples from the observed empirical distribution to emulate the population distribution of interest. Insofar as the empirical distribution function Fˆ n(x) is strongly consistent, the targeted statistic is a smooth functional of F(x), and the researcher’s theory about the underlying dependency structure that informs the re-sampling process is approximately correct, it provides a feasible approach to inference. Unfortunately, our theories pertaining to the dependencies between observations are not empirically verifiable as a whole. Like all models, they are also probably far off the mark and invalid. Our main approach does not condition on the observed sample for re-draws. Rather, it treats them as stochastic. To this end, say θˆ = T0(X1,X2,...,Xn) : R n → R is our statistic from sample 150 ζ = {Xi}i∈I . We will assume a simple random sampling process with replacement (SRSWOR), as usual, for executing the re-sampling process. This assumption is not strictly necessary—once again, sampling can be directed by a theory of dependence—but we avoid this here. Denote a kth re-sampling ζk a collection of the newly sampled variables. For instance, ζk = {X1,X2,X2,X4,....Xn} if X2 has been sampled twice and only X3 has been replaced (here we are abusing set notation). Pertinently, we then note that Tk(X1,X2,...,Xn) : R n−1 → R, but Tk still has the same ’working’ magnitude as T since X2 has simply taken the place of X3 and occupied both arguments of the function. For our example, this would look like the following if T was the arithmetic mean: T0(X1,...,Xn) = X¯ = n −1 n ∑ i=1 Xi Tk(X1,X2,...,Xn) = n −1 2 ·X2 +n −1 ∑ i∈I\{2,3} Xi Hence, any ’bootstrap’ of K = K(n) re-samples results in a sample of random functions {Tk}k∈K , where K = {0,1,2,...,K} in this exploration. Note that we are making K an implicit function of n. We do not specify this function. We only require that K → ∞ when n → ∞. From here, we rely on the same mechanism that was used elsewhere to prove the statistical consistency of the sample maximum (Sparkes & Zhang, 2023). Note that for any two elements s,r ∈ K , Pr(Ts ≤ t|Tr ≤ t) = 1 if and only if Pr(Ts ≤ t ∩Tr ≤ t) = Pr(Tr ≤ t), which indicates that the event {ω ∈ Ω : T(ω)s ≤ t} is implied by the event {ω ∈ Ω : T(ω)r ≤ t}. When this occurs—for an arbitrary assortment of conditioning events—the conditional CDF is called trivial. Altogether, note that there are 1 ≤ Q(n) ≤ K +1 non-trivial conditional CDFs. More detail is required when defining Q(n), but we leave this for the next proposition. 151 For clarity, one possible situation that can give rise to a trivial CDF for the arithmetic mean is as follows. Say X2 = βX1 and observe Tr(X1,X2) = n −1{c1X1+c2X2} s.t. c1+c2 = n and Ts(X1) = n −1 ∑ n i=1 X1 = X1. These functions correspond to the event that X1 is sampled n consecutive times and X1,X2 are sampled c1 and c2 times. Since X2 = βX1, Tr(X1,X2) = n −1{c1 +c2β}X1 and Tr and Ts are monotonic in relation. Hence, Pr(Tr ≤ t|Ts ≤ t) ∈ {0,1} and depends on the constant β. Delineating the number of conditional CDFs that are trivially one for an arbitrary statistic is not possible. However, it should be apparent that a diverging number of the n n re-samples should produce non-trivial conditional CDFs for reasonable statistics, provided a dependency structure that is not extreme. By extreme, we mean those dependency structures s.t. ζ is in reality a sample of random variables that are monotonic transformations of one another, or one s.t. the random variables are closely related in a system of constraints. In practice, we would not expect this to be the case in most circumstances. Next, construct T(K) = max k∈K {Tk} and note that Pr(T(K) ≤ t) = Pr(TK ≤ t,TK−1 ≤ t,...,T0 ≤ t), where T0 is the statistic of the original sample. We can then do a typical factoring WLOG: Pr(TK ≤ t,TK−1 ≤ t,...,T0 ≤ t) = Pr(TK ≤ t|TK−1 ≤ t,TK−2 ≤ t,...,T0 ≤ t)·Pr(TK−1 ≤ t|TK−2 ≤ t,TK−3 ≤ t,...,T0 ≤ t)···Pr(T1 ≤ t|T0 ≤ t)·Pr(T0 ≤ t). Recall: many of these conditional CDFs will evaluate to one. We call this sequence of conditional CDFs C . Proposition 3.B.1. Assume a sample of n random variables ζ = {Xi}i∈I and let T : R n → R be a measurable function. Denote {Tk}k∈K for K = {0,1,2,...,K} as a set of random functions s.t. Tk : R l → R for l ∈ N ≤ n and Tk is a function of the collection of random variables re-sampled from ζ as previously described. Observe that T0 = T(X1,X2,X3,...,Xn), the function of the original sample, is included. Next, suppose for an arbitrary k ∈ K that Sk , the support of Tk , is a finite 152 set of real numbers with maximum Mk . Furthermore, define T(K) = max k∈K (Tk) ∼ Pr(T(K) ≤ t) and suppose it is a discrete random variable or possesses a density when continuous. Finally, denote 1 ≤ Q(n) ≤ K +1 as the number of conditional CDFs in the sequence C that are strictly less than one when t < max k∈K (Mk) = M. Then, if Q(n) → ∞ as n → ∞ and K → ∞, T(K) a.s. → M. Proof. Suppose the premises. The style of this proof also follows Sparkes and Zhang (2023). First, note that the support of T(K) , say S , is a subset of the union of the supports of each Tk with a supremum that is the maximum of their respective suprema. Hence, S ⊆ S k∈K Sk s.t. M = max k∈K (Mk) = max(S ). Now, observe for an arbitrary t that Pr(T(K) ≤ t) = Pr(TK ≤ t|TK−1 ≤ t,TK−2 ≤ t,...,T0 ≤ t)·Pr(TK−1 ≤ t|TK−2 ≤ t,TK−3 ≤ t,...,T0 ≤ t)···Pr(T1 ≤ t|T0 ≤ t)·Pr(T0 ≤ t) and say Q(n) is the number of conditional CDFs that are less than one in the sequence C when 1t<M = 1. For brevity, say Pr(TK ≤ t|TK−1 ≤ t,TK−2 ≤ t,...,T0 ≤ t) = Pr(TK ≤ t|AK) WLOG under the convention that Pr(T0 ≤ t|A0) = Pr(T0 ≤ t). Moreover, denote F(t) as the maximum of these conditional CDFs that are strictly less than unity when t < M. Then there exists an ε > 0 s.t. F(t) = 1−ε and: Pr(T(K) ≤ t) = Pr(TK ≤ t|AK)·Pr(TK−1 ≤ t|AK−1)···Pr(T0 ≤ t) ≤ 1 K+1−Q(n)F(t) Q(n) = 1 K+1−Q(n) ·(1−ε) Q(n) = (1−ε) Q(n) Next, as a minor lemma, note that for any discrete random variable Z with CDF F(z) and mass function f(z) on bounded support {m,z1,z2,...,zR−2,M} with maximum and minimum {m,M} 153 respectively, R M m F(z)dz = M −EZ since F(z) is a step-function that can be integrated over [m,M]. To see this, express R M m F(z)dz = (z1 −m)· f(m) + (z2 −z1)· { f(m) + f(z1)}+ (z3 −z2)· { f(m) + f(z1) + f(z2)}+···+ (M −zR−2)· {1− f(M)}. Expand this to note the pattern of cancellations: Z M m F(z)dz = (z1 −m)· f(m) + (z2 −z1)· { f(m) + f(z1)}+ (z3 −z2)· { f(m) + f(z1) + f(z2)}+··· = −m f(m)−z1 f(z1) +z2{ f(m) + f(z1)} −z2{ f(m) + f(z1)} −z2 f(z2) +··· = −m f(m)− R−2 ∑ i=1 zi f(zi) + M − M f(M) = M −EZ Therefore, EZ = M − R M m F(z)dz. Furthermore, for any continuous random variable Y with a density f(y) and CDF F(y) s.t. min(SY ) = m > −M and max(SY ) = M, it also follows that EY = R M −M{2 −1 − F(y)}dy = M − R M m F(y)dy. Now, designate m = min(S )specifically. Then, for either of our cases, ET(K) = M− R M m Pr(T(K) ≤ t)dt and: M − {M −m} ·(1−ε) Q(n) ≤ ET(K) ≤ M It then of course follows that: M − {M −m} · lim n→∞ (1−ε) Q(n) ≤ lim n→∞ ET(K) ≤ M Which of course implies that: M ≤ lim n→∞ ET(K) ≤ M 154 And therefore that ET(K) → M as n → ∞ and K → ∞. This in turn implies that T(K) converges in distribution to the constant M since it is a boundary value of the support, which implies that T(K) p→ M as n and K become arbitrarily large. Next, note that {T(K)}k∈K is a bounded and non-decreasing sequence of random variables. Hence, the point-wise limit of T(K) as n → ∞ and K → ∞ exists. Call this limit L. However, since T(K) converges in probability to M as both n → ∞ and K → ∞, there is a sub-sequence of {T(K)}k∈K that converges almost surely to M. This then suggests that T(K) a.s. → M as n → ∞ and K → ∞ by the uniqueness of limits, which suggests L = M almost surely. ■ As stated in the paper, the supposition that the density of the maximum statistic exists is nontrivial. This assumption, however, is commonplace and at least fecund and can be circumvented with the more conservative version justified by the principle of indifference. Also, since min k∈K (Tk) = −max k∈K (−Tk), we can employ the last proposition to establish the statistical consistency of the minimum as well. Now, we can make the utility of these propositions clear. First, consider the setup for Proposition 3.A.2. For this case, since T0 is included in {Tk}k∈K , max(ST0 ) ≤ M and min(ST0 ) ≥ m. Therefore, M0 −m0 ≤ M −m. Under the conditions of this proposition, we know that we can estimate M − m consistently. Since T0 is a bounded random variable and therefore sub-Gaussian, a straightforward application of Hoeffding’s inequality for an arbitrary ε > 0 yields that: Pr(|T0 −ET0| > ε) ≤ 2exp{−(M0 −m0) −2 · 2ε 2 } 155 And this implies that: Pr(|T0 −ET0| > ε) ≤ 2exp{−(M −m) −2 · 2ε 2 } One can then use the plug-in estimate from the bootstrap to produce 1 − α confidence sets of the (informal) form T0 ± (Mˆ − mˆ) p 2−1 log(2/α), where Mˆ is the maximum of the bootstrapped statistics WLOG. Under the assumptions of the manuscript, and perhaps a more controlled resampling process s.t. each Tk converges almost surely or in probability to its expectation with probability one for the first setup, we would expect Mˆ −mˆ to tend towards zero. If T0 ∈ U , we can improve the confidence set to T0 ±(Mˆ −mˆ) p 6−1 log(2/α). Recall that we expect T0 to approach U status if its converges almost surely or in probability, or if it is in a scale family s.t. its density is proportional to its standard deviation, which tends to zero. This provides justification for the sharper confidence set, at least for moderately sized samples. We are left with one final problem to consider. Often, the researcher is interested in constructing an approximate 1−α confidence set w.r.t. a parameter of interest, say θ. However, while ET0 → θ as n → ∞, it is often the case that ET0 ̸= θ for finite samples. Consequently, the 1−α confidence sets, which are built around ET0, can fall short of the nominal value. This is ultimately another reason to use a Hoeffding style bootstrap, however. We demonstrate this informally. Trivially, we know that T0 ± {M − m} ·p 2−1 log(2/α) contains ET0 with probability one for an arbitrary n when 0 < α ≤ 2exp{−2}. We designate η(n) = {M − m} · p 2−1 log(2/α) > 0. Next, suppose S(n) is some function of n s.t. √ nS(n) stabilizes the variance of T0 and therefore that √ nS(n){T0 − θ} converges almost surely to some random variable D with expectation zero and finite variance. It is then implied that √ nS(n)η(n) → C∗ > 0 as n → ∞ or √ nS(n)η(n) → ∞. 156 Now, observe that if √ nS(n){T0−θ} a.s. → D, then there is some random variable Zn s.t. √ nS(n){T0− θ} = D+Zn and Zn a.s. → 0. Then we can also say that √ nS(n){ET0 −θ} = EZn and that EZn → 0 as n → ∞. Since EZn → 0, but √ nS(n)η(n) → C∗ > 0 or is positively infinite, this implies that ET0 − θ = { √ nS(n)} −1EZn → 0 as n → ∞ at a faster rate than η(n). Now, let N ∈ N be the natural number s.t. for all n > N, |ET0 − θ| = 0 ≤ η(n) and consider the almost sure event that |T0 −ET0| ≤ {M −m} ·p 2−1 log(2/α). Then: T0 − {M −m} ·q 2−1 log(2/α) ≤ ET0 ≤ T0 +{M −m} ·q 2−1 log(2/α) Is equivalent to: T0 −η(n) ≤ ET0 −θ +θ ≤ T0 +η(n) And: T0 −η(n) ≤ θ ≤ T0 +η(n) In conclusion, then, at least for sufficiently large n, constructing a confidence set that contains ET0 with probability one of the form T0 ± {M − m} ·p 2−1 log(2/α) around ET0 is equivalent to forming an almost sure confidence set around θ. We of course point out that this same logic applies without the added factor of p 2−1 log(2/α), and that more efficient bounds on |T0 − ET0| can be chosen. Since more efficient bounds come at the cost of constraints on the distribution of T0, this was avoided. On the former matter, it is apropos to recognize that Mˆ − mˆ will be biased downward w.r.t. the population extremes of the bootstrap distribution for real life applications that make use of finite n and K. Worse, its rate of convergence will be unknown and, in all likelihood, sub-optimal. Hence, the employment of 157 the added factor in concordance with Hoeffding’s inequality is intended to compensate for these deficiencies. Although employed somewhat arbitrarily, Hoeffding’s inequality is well-established and provides an intuitive choice for penalization. Nevertheless, since the confidence set constructed provides almost certain coverage for θ in theory, the under-estimation of the bootstrap range is not likely to undermine the cogency of confidence sets with at least a working 1−α coverage in a majority of applied circumstances. This is also true since M0 ≤ M WLOG. As a final point, we recognize that this strategy will be very conservative in most circumstances. However, we do not see this as a fault. Excluding those situations s.t. it is riskier to fail to reject a null hypothesis, say, this methodology will always be less subject to doubt and hence will also supply more cogent scientific arguments. A researcher can always make use of constraints to derive a more efficient method for inference, but the results of this methodology will always be subject to more doubt as a consequence of the additional assumptions, which are false in all likelihood. The Hoeffding bootstrap—in the minimum—can serve as a tool for sensitivity analysis, e.g., for distinguishing what propositions are the most inscrutable. Moreover, when its premises are satisfied, the Hoeffding bootstrap (at least nominally) eliminates the multiple testing issue for sufficiently large n. This trivially follows from the fact that, if the probability of a Type I error is zero for an arbitrary test, the family-wise error rate is preserved by implication. 158 3.B.1 Simulations We offer three simulation experiments as a proof of concept under some dependence conditions. The bootstrap is known to fail when estimating the minimum of n independent uniform distributions. Hence, we start with this example. It is also known to fail for non-smooth functions. Hence, our second example is T0 = |Y¯ n| when EY¯ n = 0. The third is for the arithmetic mean. We demonstrate robustness to dependence by incorporating scenarios of simple but relatively extreme dependence into all experiments. We call these setups the ’sneaky twin,’ ’sneaky decuplet,’ and ’sneaky venti-cuplet’ scenarios, which are used for the first, second, and third experiments respectively. Essentially, we draw n/2 independent Yi WLOG for the first case. However, for each Yi drawn, we include it in the sample a second time. The other two setups function analogously, but replace 2 with 10 and 20. This does not validate this type of bootstrap for all scenarios. However, it is extreme enough to demonstrate the method’s utility since the sneaky setup is more extreme than a number of setups that are commonly supposed. For instance, if T0 is a random sum, the sneaky venti-cuplet scenario invisibly increases its variance by a factor of twenty. The basic bootstrapping procedures will only use B = 500 bootstrap samples. The only two exception will be for the sneaky twin and venti-cuplet scenarios used for the sample minimums and the means: here, we use B = 2000 and B = 1500 bootstrap samples respectively to compensate for the behavior of the sample minimum and more extreme dependence picture. All setups will enact simple random sampling with replacement without any theory of dependence. This is not preferable in practice. In applied settings, it is best to enact a sampling scheme that approximates what is known about the dependency structure. Here, an ’out of the box’ approach is used to demonstrate robustness. A non-targeted sampling will produce more trivial samples on average 159 when relatively extreme dependence is present. Thus, an adequate performance when blind is a good baseline. 3.B.1.1 Uniform Experiment Set Tk = min j∈ζk (Yj). For the case of mutual independence, note that an arbitrary Pr(Tk |Ak) will equal one only if the kth re-sample is contained in any of the other bootstrapped samples. Out of n original random variables, there are 2n − 1 unique Tk possible. As n → ∞ and B → ∞, our conditions will be fulfilled. Since it is not the focus of this manuscript, we did not derive the exact properties of the sneaky twin scenario. An informal treatment will suffice. Let T0,∗ be the sample minimum of the n/2 independent Yi . Then it is apparent that T0 = T0,∗, the minimum of the n/2 independent outcome variables. In this scenario, a simple random sample with replacement of the n outcomes will result in a large number of trivial conditional CDFs. This is because there are at most n/2 independent variables to draw from and thus a large proportion of the re-samples will have strictly less than n/2 of the independent variables. This will lead to the evaluation of conditional CDFs of the form Pr(Tj ≤ t|Ts ≤ t,...) where Tj and Ts are sample minimums composed of nested or almost nested samples. For such a setup, B = 2000 is low since convergence of the bootstrap distribution will be slow. Nevertheless, it is computationally feasible and in the neighborhood of a common choice for an ’approximate’ bootstrap. For reference, we also use employ the U Hoeffding bootstrap of Section 3.3. The results of this simulation are in Table 3.B.1. CIˆ H provides the average results for the Hoeffding bootstrap of Section 3.2 WLOG, while CIˆ HU provides the results for the method of Section 3.3. These are 160 conservative one-sided 1−α confidence sets with the average sample minimum provided for the right endpoint. Table 3.B.1: Hoeffding for Sample Minimum n CIˆ H ECH CIˆ HU ECHU Independent 500 (-0.013, 0.002] 0.992 (-0.028, 0.002] 1 2500 (-0.003, < .001] 0.990 (-0.006, < .001] 1 5000 (-0.001, < .001] 0.987 (-0.003, < .001] 0.999 Sneaky Twin 500 (-0.014, 0.004] 0.930 (-0.032, 0.004] 0.984 2500 (-0.003, 0.001] 0.930 (-0.006, 0.001] 0.986 5000 (-0.001, < .001] 0.943 (-0.003, < .001] 0.987 10000 (-0.001, < .001] 0.945 (-0.002, < .001] 0.984 It is apparent that the coverage properties suffered for the sneaky twin scenario. However, this is partially due to the low number of bootstraps. Otherwise, its performance could have been improved upon if the re-draws were focused—and even if imperfectly—on the n/2 independent originals. Nonetheless, the Hoeffding bootstrap performed largely as intended for both methods. Note that this seems trivial since we already know that the parameter is zero. In actual practice, we would not know this. From here, only the Hoeffding bootstrap of Section 3.2 is simulated. This is sufficient since the one from Section 3.3 will always perform more conservatively. Therefore, if the Section 3.2 method offers decent performance, so does the one from Section 3.3. 161 3.B.1.2 Non-smooth Experiment For this experiment, each Yi ∼ T N(m = −20,M = 20,µ = 0,σ = 5). Again, we make use of the sneaky decuplet method for the non-independence cases. This is accomplished by drawing n/10 independent variables and then sneaking in an additional nine copies of each. The results are below. Table 3.B.2: Hoeffding for |Y¯| n T0 CIˆ H ECH Independent 500 0.179 (-0.987,1.344) 1 2500 0.081 (-0.443,0.605) — 5000 0.057 (-0.313,0.428) — Sneaky decuplet 500 0.560 (-0.929,2.049) 0.985 2500 0.253 (-0.42, 0.927) 0.989 5000 0.179 (-0.298, 0.657) 0.986 10000 0.122 (-0.214, 0.458) 0.990 The bootstrap behaves as intended. Although this bootstrap does not capture the true distribution of |Y¯|, this is not the goal. Instead, it was to construct cogent confidence sets for a parameter on the boundary of a non-smooth function. This is accomplished. 3.B.1.3 Arithmetic Mean Experiment The same outcome distribution as the non-smooth experiment is used here as well. Since this experiment possesses a relatively extreme dependency structure, we provide results for B = 500 and B = 1500 bootstrap procedures for the venti-cuplet simulations. Moreover, since this is a 162 case where the standard bootstrap could be applied, we also provide a complementary simulation using the same random seed. The results for the Hoeffding bootstrap are in Table 3.B.3, while the results for the results for the traditional bootstrap that uses a bootstrap variance estimate for a normal approximation is in Table 3.B.4. Table 3.B.3: Hoeffding for Y¯ n T0 CIˆ H ECH Independent 500 -0.006 (-1.845, 1.832) 1 2500 0.001 (-0.822, 0.825) — 5000 -0.000 (-0.583, 0.582) — Sneaky venti-cuplet 500 -0.017 (-1.815, 1.781) 0.902 (B = 500) 2500 -0.015 (-0.83, 0.801) 0.927 5000 -0.000 (-0.585, 0.584) 0.933 10000 0.005 (-0.406, 0.416) 0.934 Sneaky venti-cuplet 500 0.027 (-1.958, 2.011) 0.929 (B = 1500) 2500 -0.007 (-0.913, 0.899) 0.963 5000 0.004 (-0.64, 0.648) 0.958 10000 0.009 (-0.445, 0.463) 0.953 163 Table 3.B.4: Bootstrap Normal Approximation n T0 CIˆ t ECt Independent 500 -0.006 (-0.443, 0.431) 0.955 2500 0.001 (-0.194, 0.197) 0.95 5000 -0.000 (-0.139, 0.138) 0.95 Sneaky venti-cuplet 500 0.027 (-0.4, 0.454) 0.318 B = 1500 2500 -0.007 (-0.202, 0.188) 0.343 5000 0.004 (-0.134, 0.143) 0.340 10000 0.009 (-0.089, 0.107) 0.328 As expected, a larger number of bootstrap samples improves the performance of the Hoeffding bootstrap. In contrast, the standard method fails. References Sparkes, S., & Zhang, L. (2023). Properties and deviations of random sums of densely dependent random variables [Preprint available at https://arxiv.org/abs/2310.11554]. 164 Chapter 4 Extending Generalized Linear Models to Social Network Analysis Abstract Generalized linear models (GLMs) and generalized estimating equations (GEEs) have been labeled as inappropriate tools for social network analysis. This is because social network ties are not independent and specifying independent clusters is a non-trivial task in network settings. This paper discusses an expression for the variance of an additive statistic that can help facilitate cogent statistical inference with GLMs and GEEs in social network contexts. Instead of specifying an entire dependence or graph model, an approach using this identity requires reasoning about two summary constants: the average correlation and the average number of dependencies in a sample. Additionally, this paper shows that this identity can also be used to quantify the robustness of statistical decisions when covariance models are invalid and that an alternative bootstrapping procedure is available for inference when the underlying system of dependencies is mostly unknown. After establishing these facts, these methods are applied to a high school friendship network to investigate the association between different forms of student homophily and friendship identification. 165 4.1 Introduction The statistical modeling of social network ties is challenging. A main contributor to this challenge is probabilistic dependence. Classical methodologies, such as those associated with ordinary least squares and generalized linear models (GLMs), have traditionally been predicated upon mutually independent observations (Nelder & Wedderburn, 1972; McCullagh & Nelder, 2019). However, the assumption of independence is violated when dealing with random tie variables from a social network. For this reason, GLMs have historically been seen as inappropriate in this setting. Even if their parameters are identified and consistently estimable, the unknown and possibly intractable sequences of dependencies that are present between network outcome variables disqualify standard model-based approaches to uncertainty quantification. This weakens the cogency of statistical inference in the absence of an alternative strategy. As a consequence, a lot of scholarship has been dedicated to the development of statistical approaches that do not require independence, and which factor underlying theories about network formation and structure into the model. Two of the most popular instances of this approach are exponential random graph models (ERGMs) and stochastic actor-oriented models (SAOMs). The first class of models presupposes that social networks are constellations of locally emergent substructures of ties (Lusher, Koskinen, & Robins, 2013; Robins, Pattison, Kalish, & Lusher, 2007; Harris, 2013). Network ties that are not connected via local structures are typically designated as conditionally independent, provided knowledge of these structures (Snijders, Pattison, Robins, & Handcock, 2006; Wang, Pattison, & Robins, 2013; Butts, 2006). SAOMs operate under the conception that actors within a network sequentially optimize their ties over time, conditional on knowledge of local configurations and all other ties in the network (Snijders, 2017, 1996). These 166 models are invaluable for reasoning about the probability that a tie is formed, conditional on a specific graph theory and the knowledge of all other ties. However, this is not the object of interest in many circumstances. For instance, scientific interest also often centers around the marginal formation of ties, i.e., around the probability of tie formation more generally—and perhaps even conditional on a set of known nodal covariates—when no other information exists about the contemporaneous state of the network or its part-versus-whole dynamics. Despite their usefulness, ERGMs and SAOMs also possess well-known limitations. For instance, SAOMs are known to place unrealistic constraints on actor behavior (Leifeld & Cranmer, 2022, 2019). This can hamper the interpretation of results. ERGMs often suffer from model degeneracy, a situation where seemingly reasonable theories about graph architecture result in the fitting of trivial graphs (Cranmer & Desmarais, 2011; Robins, Snijders, Wang, Handcock, & Pattison, 2007; Goodreau, 2007). As a consequence, researchers are often required to enforce constraints on graph architecture in a post hoc or purely formal fashion to successfully fit a model. Ultimately, these theoretical adjustments result from considerations that are often extraneous to the original hypotheses or scientific conceptions. Fitting processes for both models can also be computationally intensive (An, 2016; Snijders et al., 2002; Amati, Schonenberger, & Snijders, 2015). ¨ On the other hand, GLMs possess favorable properties for social network analysis in these dimensions. GLMs are marginal models. They are therefore capable of modeling the formation of relationships in the absence of knowledge pertaining to contemporaneous ties or graph structures. When a researcher wants to learn about the association between nodal covariates and link formation, but wishes to do so in the absence of a greater theory about the network’s dynamics or human behavior, GLMs are a powerful tool. Moreover, strategies for checking the specification of the mean model, and other aspects of goodness-of-fit, are relatively well known, which is an added 167 boon (Breslow, 1996; Agresti, 2015). They are also computationally simple and reliable. Finally, like least squares methods in general, GLMs are interpretable as a type of optimal approximation to the true data generating mechanism, and even when the mean model is invalidly specified (White, 1982). This last fact provides GLMs with a type of intrinsic validity. Our primary contribution in this manuscript is to demonstrate that GLMs, and generalized estimating equations (GEEs) more generally, can be cogently, albeit conservatively utilized for inference with binary network data and dependent observations in general. We show that this is the case even when the details of the underlying dependency structure are mostly unknown and intractable. Ultimately, we accomplish this through the use of a novel variance identity for additive statistics. The utility of this identity is that it expresses the variance of an additive statistic as the product of two factors: one, the variance of the statistic under the counterfactual assumption of independence, and two, a factor defined by the average number of dependencies in the sample and the average correlation between them. Hence, in place of requiring a complete theory about the architecture of a social network, say, a statistician can elect to reason about two summary constants. This is beneficial in many applied contexts because the defensible specification of two summary constants is intuitively more feasible than choosing a model for n 2 unknown parameters. Additionally, we also show that cogent inference can be accomplished with a particular type of bootstrapping process insofar as network dependencies are not too extreme. This manuscript develops as follows. Section 4.2 offers exposition on notation and popular approaches to modeling dependence to further contextualize our approach. Once this is achieved, Section 4.3 introduces the variance identity and shows how it can be used in conjunction with GLMs and GEEs. Furthermore, we also explore how the variance identity can be employed to gauge the robustness of statistical decisions when a user-specified covariance model is incorrect. 16 After this, we introduce the bootstrapping strategy. Section 4.4 presents an example analysis on a real high school friendship network to demonstrate the utility of these tools. 4.2 GLMs for Relational Inference In this section, we briefly review some basics of GLMs and (quasi) maximum likelihood (ML) estimation. We do this to emphasize the additive nature of the estimators that result from these models. This allows us to highlight known challenges in the context of social network analysis. It also provides a basis for the feasible solution that is presented in Section 4.3. We also introduce some required assumptions. Let YA ∈ R n×n be a random adjacency matrix of a social network or graph G more generally. When G is a directed graph, YA can be collapsed into a n(n−1)×1 vector of tie variables Y by excluding its diagonal cells and concatenating its rows. In the directed case, letting N = n(n−1), Y is a N ×1 random vector. We work with the directed case without loss of generality (WLOG). For some monotonic function g, fixed N × p matrix of constants x, and a vector of weights β ∈ Θ ⊂ R p×1 , we are interested in the following mean model, where µ = EY: g(µ) = xβ (4.1) Since each individual tie variable Yi, j ∼ fi, j(y) of Y is a Bernoulli random variable with expected value µi, j , eq. (4.1) with a logit link function is a canonical choice. Note that eq. (4.1) is a marginal model in two senses. Firstly, it is averaged over all known and unknown implicit clusters in Y. This is the traditional sense that is often used in the context of hierarchical models. Secondly, for any 169 indexes i, j, the estimation of g(µi, j) is not conditioned upon other ties. Instead, eq. (4.1) models the association between relational formation and the features constituting x without holding other possible relationships constant. Again, this is useful when the researcher wishes to know how a certain set of properties impact the realization of contemporaneous relationships in general, and in absence of any knowledge of the greater network at the time of measurement. Equation (4.1) is typically fit using maximum likelihood (ML) estimation when it is safe to assume that the outcome variables are mutually independent. For ease of reading, we will relabel each Ys,t using a single index i ∈ {1,...,N}. The ML estimation procedure maximizes the objective function QN(β) = N −1 ∑ N i=1 ln{ f(Yi |β,xi)} with respect to the parameter vector β. Some basic regularity conditions pertaining to the behavior of the functions involved and the parameter space are required to establish statistical consistency. We will direct the reader to technical resources on standard regularity conditions if they have further interest (Amemiya, 1985; Yuan & Jennrich, 1998). Otherwise, we assume that the average number of dependent outcome variables in the sample is sub-linear in terms of N, i.e., that if µN is the average number of dependent random tie variables in the sample, N −1µN → 0 as N → ∞. In conjunction with the aforementioned standard regularity conditions, this is sufficient for establishing statistical consistency under general conditions of probabilistic dependence (Sparkes & Zhang, 2023). Note that this assumption is light and allows for µN to become arbitrarily large. Now, define a N × p matrix D = ∂ µ/∂ β T , fixed N × N covariance matrix V, and the N × 1 vector Y− µ. We designate ∇ as the gradient operator. For models in the exponential family making use of the canonical link function, also note that ∇QN(β) = UN(β) = D ⊤V −1 (Y− µ), which is recognizable as a vector of estimating equations. 170 When ˆβ converges in probability to the parameter value β0 and standard regularity conditions are true, the following then also holds, where d → in denotes convergence in distribution for an unspecified function of sample size z(N): z(N)( ˆβ −β0) d → z(N){∇UN(β0 )} −1UN(β0 ) The importance of this statement in the context of this paper is that it highlights z(N)(ˆβ − β0 ) as a vector of asymptotically additive statistics, i.e., that each of its components is a weighted random sum of outcome variables. This follows from the fact that UN(β0) is additive. For finite samples, iterative re-weighted least squares (IRWLS) methods are often used to achieve a close approximation to the true solutions (Green, 1984). Once the iterative algorithm achieves converges for some specified tolerance level, the approximating estimator is additive in Y for all samples if one conditions on the previous steps. Appreciating the (asymptotic) additive nature of ˆβ is important because it allows us to recognize the general form of any one component ˆβs and the simple expression of its variance. For instance, we can designate w = (D ⊤V −1D) −1D ⊤V −1 as a p × N matrix of regression weights. When an arbitrary ˆβs is additive, at least for sufficiently large samples, this then implies that ˆβs = βs +∑ N i=1wi,s(Yi −µi) with respect to (w.r.t.) the vector of weights ws that originates from the sth row of w. Since this is just the expression for an additive random variable, it is then also true that Var( ˆβs) = ∑ N i=1w 2 i,sVar(Yi) +2∑i<j wi,swj,sCov(Yi ,Yj). When the Yi are mutually independent (or uncorrelated), the right hand term of course disappears. Our final assumption is that the necessary preconditions for asymptotic normality exist w.r.t. an arbitrary z(N)( ˆβs − βs), conditional on x. More information on the primitive conditions that 171 are sufficient for confidently asserting asymptotic normality for functions of dependent random variables is available elsewhere (Berk, 1973; Hoeffding & Robbins, 1948; Le Cam, 1986; Withers, 1981). In summary, we assume—at least for large enough N and conditional on x—that an independent partition of the weighted outcome variables into K finite clusters exists s.t. no single clustered statistic dominates the variance of the estimator and it is also true that K → ∞ as N does. However, this partition is assumed to be otherwise unknown and mysterious. Note that the supposition of asymptotic normality is consistent with theories that social network ties are locally dependent in some sense, at least after conditioning on an ample set of variates x that would have otherwise contributed to the system of dependence. However, in the event that this assumption is not believed to hold, we offer an alternative setup and solution in Section 4.3. 4.2.1 The Problem of Dependence A cardinal problem with using eq. (4.1) for relational inference is that Y is not a set of mutually independent random variables. When the outcome variables are dependent, the GLM procedure can no longer be classified as ML estimation. More importantly, however, the possibly intractable dependencies compromise model-based strategies to calculate standard errors. White (1982) demonstrated that the first problem is not fatal. Under the supposition that f(Yi |β,xi) is well-specified, mutual independence of the Yi is only a sufficient condition for arriving to QN as an objective function. One can still utilize QN for consistent estimation of β if the set of first moments alone are well-specified. In the event that independence is violated and the distribution function posited is inaccurate, however, the maximum of QN(β) still has an interpretation as the optimal approximation to the unknown true distribution function, say h(Yi |β,xi), w.r.t. 172 the Kullback-Leibler divergence criterion. The solution to QN(β) also retains its interpretation as an optimal least squares approximation when V is non-diagonal, or when it cannot be partitioned into covariance matrices of independent clusters more generally. Since it has become a truism that all models are invalidly specified, it is thus evident that the first stated complication is not unique to the use of GLMs in contexts of statistical dependence. Since they retain their meaning as feasible least squares approximations, their use for knowledge construction remains fecund. The second problem, however, is more serious. As previously stated, intractable dependencies undercut the construction of cogent statements about the uncertainty surrounding parameter estimates. Therefore, it undermines scientific induction. A modeling approach for Var(Y) = E(Y − µ)(Y − µ) ⊤ = ΣY is hence necessary to address this challenge. For context, we briefly summarize some of the most common ones. We mostly reserve commentary until the end of this endeavor. From here, we also omit further considerations on SAOMs since they are more specialized and outside of our immediate scope. Generalized Least Squares. The generalized least squares (GLS) approach is the most straightforward. Due to its generality, it encompasses many others as special cases. In theory, it sets V = ΣY (Kariya & Kurata, 2004). Since ΣY is unknown, however, it is replaced by a user specified model, which is then estimated. This latter strategy has also been called feasible least squares (FLS) in the literature (Fomby, Johnson, Hill, Fomby, Johnson, & Hill, 1984). When V is correctly specified, ˆβ is an asymptotically efficient estimator. When V at least approximates ΣY , it is often asserted to be a feasible attempt at efficiency. 173 Generalized Estimating Equations. GEEs require a partition of Y into K clusters to exist s.t. any random variable that is an element of one cluster is independent with any of those belonging to others (Ballinger, 2004; Zorn, 2001; Zeger, Liang, & Albert, 1988). Using the same notation as the previous section, for the K identified clusters with ni elements each s.t. ∑ K i=1 ni = N, the vector of estimating equations in this scenario becomes UN(β) = ∑ K i=1 D ⊤ i V −1 i (Yi − µi), where each matrix is now defined relative to the random variables in each group. Like with FLS, each Vi is afforded a working theory for estimation. A fitting procedure such as IRWLS is then used to approximate the root of UN(β) to estimate eq. (4.1) under this theory. Although invalidly specifying the working covariance structure does not necessarily impact consistency, it undermines inference for the same reasons previously explored. Robust standard errors, which still require independence between clusters, are often employed for variance estimation in an attempt to avoid this quandary (MacKinnon, Nielsen, & Webb, 2023). This method’s reliance on independent clusters, however, has resulted in its labeling as inappropriate for social network regressions (Lyons, 2011; VanderWeele, Ogburn, & Tchetgen, 2012). Even if they exist, they are safely assumed to be unknown. Hierarchical Models. In contrast to GEEs, hierarchical GLMs do not treat the clustered dependence structure as a nuisance. In place of eq. (4.1), the new model of interest is g{E(Y|x, z,α)} = xβ +zα, where z is a known matrix of cluster memberships and α is a mean-zero random variable. A good summary of the assumptions and fitting strategies for this model are available elsewhere (Gardiner, Luo, & Roman, 2009). The most stringent conditions of these models are that, conditional on (α,x), Y becomes an independent sample—and furthermore, that the components of α are mutually independent. In a social network setting, this would translate to the requirement that the vector of relational variables Y is perfectly partitioned by the researcher’s theory and that 174 the variables of α, which act on each perfectly grouped set of possible relationships, offer no information on one another. Exponential Random Graph Models. Dependency graphs are an essential component of ERGMs. A dependency graph D is any graph s.t. a link exists between two nodes if and only if the random variables they represent are statistically dependent (Schweinberger & Handcock, 2015). Although defining the next object is not strictly required for ERGMs, we note that a linear dependency graph L can be constructed similarly since they become important for generalizing the strategy explored in Section 4.3 to other types of outcome variables. In this case, a link exists between any two nodes if and only if the covariance of the random variables they represent is non-zero. Note that for a set of N random variables, the corresponding (linear) dependency graph is undirected and possesses N 2 possible edges. In this paper, these graphs are treated as non-random. However, whenever the N random variables are sampled from a larger population, this is not necessarily the case. Frank and Strauss (1986) offered exposition on conditional dependency graphs for Y. Recall: Y is composed of random edges of the form Yi, j w.r.t. G, the main random graph being considered. Hence, D is composed of possible edges between any of the nodes that represent the random outcome tie variables Yi, j and Ys,t . An edge is said to exist in the edge set of D if and only if the formation of a link from node i to j in G is conditionally dependent on the formation of a tie from node s to t, provided knowledge of remaining relationships in G. Common examples are the Bernoulli, dyadic independence, and Markov graphs (Lusher, Koskinen, & Robins, 2013), although other types of dependence graphs have also grown in popularity (Wang, Pattison, & Robins, 2013; Butts, 2006; Block, Stadtfeld, & Snijders, 2019). 17 A cardinal contribution of Frank and Strauss (1986) was to demonstrate, through the use of the Hammersley-Clifford theorem, that Markov dependency graphs are sufficient for establishing that the Pr(Y = y) is log-linear in form. More specifically, they established, provided a dependency graph of this type, that the Pr(Y = y) is an exponential family expression with sufficient statistics defined by the counts of graph motifs such as edges, stars, and triangles (Lusher, Koskinen, & Robins, 2013; Robins, Pattison, Kalish, & Lusher, 2007). Models that make use of dependency graphs as sufficient conditions for an exponential family probability model more generally, and even beyond the Markov case, are what have come to be known as ERGMs (Robins, Pattison, Kalish, & Lusher, 2007; Snijders, Pattison, Robins, & Handcock, 2006; Cranmer & Desmarais, 2011). The utility of this log-linear form is that it can be used to characterize change statistics w.r.t. the sufficient statistics of the exponential probability model that arises from a theory of dependence. This ultimately supplies an estimable model that is similar to basic logistic regression, although it is instead related to the probability that a tie is formed, conditional on all other ties, via the influence of the change statistics. An inherent strength of ERGMs is their ability to enable inference on graphs as a whole, and to do so while considering structural or node-level phenomena. The nature of the graph itself is not treated as a nuisance. Propositions about tie formation are nested in working notions about how node-level attributes and sub-graph configurations interact to produce a certain probability that a network is observed. Since ERGMs explicitly model the joint distribution of Y as a single network object, however, inference related to the formation of relationships is invariably nested in specific theories of dependence and graph architecture. As stated in the introduction, this is a strength if the researcher wishes to reject particular hypotheses about the structure of the network or predict the formation of a tie provided knowledge of all the other ties, also conditional on the aforementioned 176 network theory. However, these goals differ from those served by eq. (4.1). The tie variables of eq. (4.1) are not conditioned on knowledge of all remaining ties. Moreover, they are not predicated upon a specific theory of how a graph emerges or becomes constrained by a user-identified set of motifs. 4.2.1.1 A Word on Limitations All of the approaches surveyed share in one general property. Namely, they replace ΣY or D with a detailed theory. The word detailed is being used here in the sense that a working version of the dependency structure is blueprinted using a set of posited elementary patterns. An immediate consequence of this is that the inductive cogency of statistical inference is also conditional on the details of the specification. Recall that this is not the case when only the mean model is invalidly specified. We can still treat estimated model coefficients as fecund approximations. This is generally not true for covariance approximations. Confidence sets will not have the correct coverage. Statements associated with them will often be false. This problem is exacerbated by the fact that both D and ΣY are empirically unidentifiable. Although this problem is endemic to the consideration of any dependency graph formed from a single vector of N random variables, this challenge is exacerbated by settings similar to those of social networks and sheer size. For instance, if there are N = n(n−1) elements to consider, then there are 2 −1n(n−1){n(n−1)−1} dependencies between links to consider, which can grow unreasonably large very quickly. For example, for a directed network of just 100 people, D possesses just over 49 million dependencies that require theoretical consideration. The true structure is hence not only inestimable, but largely impossible to conceptualize. All methods that estimate it do so only in 177 part, and after the researcher has set a vast majority of the parameters to zero a priori under the dictates of a hypothetical map that cannot be empirically substantiated in total. The number of covariance parameters generated by n nodes, however, is perhaps not as important as the possibility that Y cannot be successfully partitioned into independent or uncorrelated blocks as previously described, or that its dependency structure does not admit a closed-form expression to populate D or ΣY . Recall: generalized least squares, hierarchical modeling, and generalized estimating equation approaches all rely upon the valid specification of clustering mechanism(s) that enact independent partitions, or on the availability of tractable formulae for V. This also obviously applies to cluster-robust standard error approaches. In a large proportion of settings, however, including those pertaining to social networks, partitions or closed-form descriptions are not guaranteed to be known or to even exist: a point that obviously precedes their epistemic accessibility. This problem obviously extends to ERGMs as well. Not only do they presuppose that statistical dependencies between tie variables are neatly local, they also require the researcher to understand the underlying parts-whole relationships that constitute the network in some defensible manner. Inference pertaining to fitted models are predicated upon this understanding. However, in a circular fashion, these models are also supposed to supply it. Therefore, a more agnostic approach to the modeling of dependencies can be beneficial. This is especially true when statistical inference, say for a statistic ˆβ, only strictly requires reasoning about the total impact of the statistical dependencies on the asymptotic distribution, and not on how it actually accomplishes it. We assert that—in many circumstances—it is much easier to conservatively specify and defend a total effect than to justify that a particular reduction serves sufficiently as an approximation to a system of thousands, or even millions, of invisible relationships. This 178 is especially true when these relationships exist in an ineffable terrain of intersecting local and non-local environments and institutions. 4.3 Main Approaches This section introduces a variance identity for additive statistics that enables a more general approach to covariance modeling and inference. Pertinently, this identity requires reasoning about the average correlation between random variables in a sample and the average number of random variables correlated with any one observation only. After demonstrating how this identity can be used in conjunction with GLMs and GEEs, we illustrate how it can also be used to measure the robustness of a statistical decision when the underlying covariance model is wrong. Finally, for cases s.t. asymptotic normality is not believed to hold, we cover a bootstrapping procedure that can also be used. Say {Zi}i∈I for I = {1,...,n} is a sample of n random variables s.t. Var(Zi) = σ 2 Zi < ∞ for all i. Moreover, construct a linear dependency graph L with node set I and edge set E that possesses an edge (i, j) ∈ E if and only if σZi ,Zj ̸= 0. Now, consider any two sets of constants {ai}i∈I and {bi}i∈I to define Sr = ∑ n i=1 aiZi and St = ∑ n i=1 biZi . Finally, say µn is the mean degree of L , σ¯I = |E| −1 ∑ |E| i<j aibjσZi ,Zj , and φ = {n −1 ∑ n i=1 aibiσ 2 Zi } −1σ¯I . We define the last two values as an average (weighted) covariance between the outcome variables and their average (weighted) correlation coefficient respectively. Now, consider the random vector wZ w.r.t. a p×n matrix of constants w. We identify ⊙ as a symbol for the Hadamard product (the component-wise multiplication of two matrices) and V as a diagonal matrix composed of the diagonal elements of Var(wZ). Elsewhere, Sparkes and Zhang 179 (2023) proved the following statement: Var(wZ) = wV w ⊤ ⊙ {1+µnφ}, where 1 is a p× p matrix of ones and φ is a p× p matrix of average correlation coefficients in conformance with the details described in the previous paragraph. Ultimately, this is the vector version of the variance identity we utilize. Therefore, letting s be arbitrary, it is the case that Var(wZ)s,s = ∑ n i=1w 2 s,iσ 2 Zi + nµDσ¯I,s,s = {1+µnφs,s}∑ n i=1w 2 s,iσ 2 Zi and hence {∑ n i=1w 2 s,iσ 2 Zi } −1 ·Var(wZ)s,s = 1+µnφs,s = {1+µnφ}s,s . Notably, this last expression is recognizable as a generalization of the Moulton factor (γ) in some situations. For instance, if we set ˆβs = βs + ∑ n i=1ws,i(Yi − µi), it is then a more general expression for the variance inflation caused by within-cluster and between-cluster correlation w.r.t. the linear model (Moulton, 1986). The variance-covariance matrix identity is also closely related to other forms of the Moulton factor, although the latter is often expressed in terms of traditional matrix products under more specific constraints on the study design and ΣZ (Moulton, 1990; Kloek, 1981; Greenwald, 1983). The usefulness of the identity presented here is precisely its generality, however. From this exposition, we can see that the variance of any random sum—including the variances of estimated regression coefficients—can be decomposed into a theoretical variance under the counterfactual assumption of independence and a corrective factor that summarizes all between and within cluster covariation: {1+ µnφs,s}. Any defensible theory allowing for the specification of both µn and φs,s then is sufficient for this aspect of inference. 180 4.3.1 Applying the Variance Identity to GLMs and GEEs Section 4.2 established that each component of ˆβ is asymptotically additive under mild regularity conditions and its IRWLS approximation is conditionally additive for finite samples. This makes the application of the above inequality straightforward. To see this, remember that the sth component of ˆβ is ˆβs = βs + ∑ N i=1wi,s(Yi − µi) and hence Var( ˆβs) = {1 + φs,sµN}∑ N i=1w 2 i,sVar(Yi) = {1+φs,sµN}(D ⊤V −1D) −1 s,s , where V is a diagonal matrix. Currently, all popular statistical computing platforms calculate the standard errors of ˆβ under the assumption of independence. Thus, these inequalities offer a simple way to use currently existent software under almost arbitrary dependence conditions. The researcher need only fit the model and correct the standard errors by multiplying them by a well reasoned p 1+φs,sµN for (asymptotically) valid inference. Although it is not strictly necessary provided our strategy, it is often ill-advised to ignore what is justifiably known about the dependence structure. For GEEs, recall thatUN(β) = ∑ K i=1 D ⊤ i V −1 i (Yi − µi) with respect to a set of user-identified clusters {1,...,K}. For the resulting sth component ˆβs = βs +∑ K i=1 ∑ ni j=1ws,i, j(Yi, j −µi, j), we can consider a new linear dependency graph LT,s with respect to the Ts,i = ∑ ni j=1ws,i, j(Yi, j − µi, j) s.t. a link exists if and only if the covariance σTs,i ,Ts, j ̸= 0. Now, define V is a diagonal matrix composed of the variances of Σ ˆβ . Then, under similar regularity conditions, the following remains true for sufficiently large sample sizes or for all sample sizes w.r.t. the conditional IRWLS approximation: V = {1+φKµK}σ 2 , where {1+φKµK} is a 181 conformable diagonal matrix with components corresponding to the new set of linear dependency graphs. Also, here, σ 2 is the diagonal of: ( K ∑ i=1 D ⊤ i V −1 i Di) −1 { K ∑ i=1 D ⊤ i V −1 i ΣYiV −1 i Di}( K ∑ i=1 D ⊤ i V −1 i Di) −1 This formulation allows us to circumvent the need for independent or uncorrelated clusters, which has been seen as this method’s cardinal weakness w.r.t. social network analysis. Moreover, it allows us to partially account for at least one dimension of the unknown dependency structure with an imperfect partition. This approach requires the consistent estimation of σ 2 . The conditions for this are explored elsewhere (Sparkes & Zhang, 2023). Essentially, in addition to the standard regularity conditions for variance estimation (Zeger, Liang, & Albert, 1988; MacKinnon, Nielsen, & Webb, 2023), we typically require that µK is asymptotically finite, although this is not strictly required. Therefore, a well-chosen and defensible p 1+φKs,sµK correction factor is also sufficient in the GEE case, except now the researcher needs to reason about the average dependency and correlation between the weighted cluster statistics. In practice, we suggest reasoning about one bounding pair (µK,φK) that are greater than or equal to (µK,φKs,s ) for all s. We also note that the GLM correction previously described is a special case of this one. We explored it first, nevertheless, since it is simpler and provides more intuition. 4.3.2 A Measure of Test Robustness We are aware that some researchers might be uncomfortable with the specification of a corrective factor. There is a short answer to this discomfort. Since Var( ˆβs) = {1 + φs,sµN}∑ N i=1w 2 s,iσ 2 Yi 182 WLOG, a user specified model ΣM accomplishes precisely this due to the fact that (ΣM)s,s = {1 + φMs,sµMs }∑ N i=1w 2 s,iσ 2 Yi implicitly. Invariably, the specification of a model is the specification of a corrective factor. Ultimately, this process seems preferable since the theoretical φMs,sµMs can be estimated from the sample. However, since the true ΣY is unknown and inestimable, the specification of a model still reduces to a subjective—albeit hopefully informed—choice that inevitably omits dependencies. The specification of the corrective factor is no different—and, as we have seen—always supplements the specification of a theory anyway, even if that theory is mutual independence. Nevertheless, we show that the identity has another straightforward application for sensitivity analyses in addition to its use as a vehicle for variance modeling. This additional application allows researchers to determine just how much variation can be missed before a different statistical decision must be made. Consider a two-tailed hypothesis test with level α ∈ (0,1) s.t. a researcher deems it significant insofar as the confidence set excludes the null value. Provided this setup, we can reason about the robustness of the test (and confidence set) by considering the distance between an interval endpoint and the null value. To this end, we consider a conservative confidence set that provides at least 1−α coverage. By definition, a conservative hypothesis test remains statistically significant insofar as adjustments to the interval endpoints still exclude the null value. We use this fact in conjunction with the variance identity to give this distance a more salient meaning. Let cα/2 be a critical value. For two-sided tests with the null hypothesis that βs = 0 for an arbitrary s, the test retains its significance WLOG for ˆβs > 0 when ˆβs −cα/2 p 1+φs,sµN √ v > 0, where v is the (estimated) variance. It then follows that if (c 2 α/2 v) −1 ˆβ 2 s − 1 > φs,sµN, then the test would remain statistically significant. This implies that the statistic K = (c 2 α/2 v) −1 ˆβ 2 s − 1 183 can be interpreted as a measure of robustness against missed variability since a significant test would remain so insofar as φs,sµN—the product between the mean dependency and correlation—is strictly less than K . This statistic is easily extended to one-tailed tests or other null values. If the researcher is comfortable with the observed value of K and interval estimation is not the goal, then a corrective factor is not necessary. Of course, discerning robustness by considering the distance between the interval endpoints and the null value is not a novel approach. For instance, once can set γs = p 1+φs,sµN to accomplish the same ends without consideration to the right hand side, i.e., without the semantic framework offered by the mean degree and average correlation. However, framing it in this way provides more intuition. In many modeling circumstances, the researcher has at least some prior information or beliefs pertaining to µN and φs,s . Moreover, it also allows the researcher to condition on φs,s or µN for additional sensitivity analyses. For instance, if a researcher believes that µN = N −1 (that all outcome variables are correlated), then we arrive to an upper bound on the average correlation w.r.t. the preservation of a significant test result: φs,s < (N −1) −1 ·K . 4.3.3 A Bootstrapping Strategy The approach of the previous section assumes that Ts = V −1/2 ( ˆβs − βs) d → N(0,1 + µNφs,s) no loss of generality in our context, where V = ∑ N i=1w 2 s,iσ 2 Yi here. Forming conservative 1 − α confidence sets entails a specification of µNφs,s for correction and a consistent plug-in estimator for the identifiable portion of the variance, V. However, if the social network ties involved are densely dependent, i.e., the structure of the network emerges from non-local forces in addition to local ones, this strategy is invalid. An immediate fix is to place fewer restrictions on the asymptotic 184 distribution of Ts . For instance, if Ts is unimodal, then the same strategy applies, except one must replace normal critical values with 2 · 3 −1 · α −1/2 . This follows from the Vysochanskij–Petunin inequality. Using α −1/2 in accordance with Chebyshev’s inequality is universally valid, but even more conservative. Still, these options presuppose that a defensible theory of µNφs,s exists. If it does not, then the Hoeffding bootstrap is another option for cogent inference with GLMs and GEEs in a social network context (Sparkes, Garcia, & Zhang, 2023). This type of bootstrap is theoretically distinct from traditional ones. Typically, a bootstrapping procedure conditions on the empirical cumulative distribution function Fˆ n(y) to approximate the sampling distribution of a statistic that is a functional of the true marginal distribution function(s). Like GLMs, the classic approach relies upon the assumption of mutual independence. Many extensions exist for dependent observations, such as the moving block bootstrap or bootstraps for network statistics (Lahiri, 2003, 1993; Snijders, Borgatti, et al., 1999). However, all of them rely upon a re-sampling procedure that covers or closely approximates the underlying dependency structure. The problem with this is that a feasible theory for re-sampling will usually imply a feasible theory for the specification of a correction. If the latter does not exist, then a suitable re-sampling theory probably does not either. Hence, typical bootstraps for dependent observations are also not reliable solutions for the circumstance currently being considered. A more thorough introduction to the Hoeffding bootstrap is available elsewhere (Sparkes, Garcia, & Zhang, 2023). It has two essential requirements: 1) that the statistic being bootstrapped is bounded, and 2) that the dependencies between the outcome variables are not too extreme when they are copious. In essence, the Hoeffding bootstrap works by producing a set of B re-sampled statistics of various dimensions under the auspices that, as N → ∞ and B → ∞, the extremes of 185 this bootstrap distribution are consistently estimable. These estimated extremes are then used as plug-ins for Hoeffding’s inequality. A pivotal difference in this process is that the re-sampled outcome variables are treated as random. Stated otherwise, the Hoeffding bootstrap does not condition on the observed sample. Moreover, it does not attempt to emulate the true sampling distribution. The intuition behind this approach is that the support of the probability distribution governing the bootstrapped functions contains the support of the statistic of interest by construction. Therefore, the extremes of the bootstrap distribution can be used to form confidence sets with almost sure coverage. Theoretically, the added factor that results from using Hoeffding’s inequality is not necessary. However, since extreme order statistics typically possess a sub-optimal convergence rate and it is not feasible to use an arbitrarily large B, the added factor helps to compensate for the bias of the sample bootstrap range that results from small to moderate B. A more precise definition of extreme dependence in this context requires some new notation and setup. Let T0 : R N → R be the original statistic of the N outcome variables. An arbitrary re-sampling of N of the original outcome variables with replacement will produce a statistic Tk : R l → R s.t. l ≤ N. Other sampling schemes are possible. However, they are not considered here. To illustrate this scheme, consider T0 = N −1 ∑ N i=1 Yi . Then in the toy circumstance such that Y1 is re-sampled N consecutive times for the kth re-sampling, Tk = Y1. Now, let {Tk}k∈K be the set of statistics being considered, where K = {0,1,2,...,B} and 0 ∈ K means that T0 has been included. As previously hinted and at no loss of generality, we require T(K) , the maximum order statistic of the elements of {Tk}k∈K , to converge in probability to a constant M s.t. M ≥ M0, where M0 is the maximum of the support of T0. Again, since T0 ∈ {Tk}k∈K , it is true that M0 ≤ M by construction. 186 Next, consider the cumulative distribution function of T(K) : Pr(T(K) ≤ t). This function can always be decomposed into the following product: Pr(TB ≤ t|TB−1 ≤ t,...,T0 ≤ t)·Pr(TB−1 ≤ t|TB−2 ≤ t,...,T0 ≤ t)···Pr(T0 ≤ t) When dependencies between outcome variables are extreme, many more of the factors of this product trivially evaluate to one. For example, consider a set of strictly non-negative outcome variables {Y1,Y2,Y3} and again use the arithmetic average as a basis function: T0 = 3 −1 ∑ 3 i=1 Yi . Temporarily assume Y2 and Y3 are strongly dependent in the sense that Y3 ≤ Y2 with probability one and condition on the event that T1 = 3 −1Y1 + 3 −1 · 2Y2 ≤ t. Since T0 = T1 + 3 −1 (Y3 −Y2), it follows that T0 ≤ T1 and hence that Pr(T0 ≤ t|T1 ≤ t) = 1 trivially. If Y2 and Y3 are only weakly to moderately dependent, however, it is false that Pr(T0 ≤ t|T1 ≤ t) = 1 in this sense. If dependencies are extreme and copious, then it is possible for the product of the conditional functions to tend towards a non-zero limit. For the Hoeffding bootstrap to work, we require a sufficient number of the 1 ≤ Q(B) ≤ B + 1 factors of Pr(T(K) ≤ t) to be strictly less than unity when t < M and N is sufficiently large. If Q(B) → ∞ as B → ∞, then this is sufficient to establish that T(K) converges almost surely to M as B → ∞ WLOG. Therefore, an extreme and copious system of dependencies is one s.t. Q(B) fails to diverge. On this topic, we again cite the caveat that it is not possible to empirically verify dependency structures. It then follows, of course, that one cannot empirically verify that Q(B) diverges. Nevertheless, this condition is mild in comparison to standard assumptions. The Q(B) divergence condition will probably apply to network data insofar as the tie variables do not cause one another in abundance. More generally, it is likely safe to assume that Q(B) diverges in a network setting for 187 reasonable T0 if knowledge of one relationship in the network doesn’t imply with probability one that an asymptotically non-negligible proportion of others adopt a certain state as well. Methods already in popular employ implicitly constrain dependencies in this way. Subsequently, the use of the Hoeffding bootstrap possesses as least nomological consistency. Insofar as the dependency structure is not extreme and copious in the manner described, we can then assert that ET0 ∈ [T(1) ,T(K) ] with probability one for some arbitrarily large N and B. This interval represents a non-penalized, almost sure confidence set for unbiased statistics. For finite N and B, however, it is more realistic to assert that a confidence set of the form T0 ± {T(K) −T(1)} · p 2−1 log(2/α) possesses at least 1−α coverage, at least as a rule of thumb (Sparkes, Garcia, & Zhang, 2023). Before concluding this section, we also point out that one can reasonably regain some efficiency from the penalization by using p 6−1 log(2/α) instead, provided the distribution of T0 becomes sum-symmetric asymptotically. A sum-symmetric distribution function is one s.t. the areas above and below the cumulative distribution function are equal. Sum-symmetry of a distribution function is also relatively safe to suppose for large enough sample sizes if T0 is an additive statistic that converges in probability to a constant or if it believed that its asymptotic distribution becomes relatively symmetric. 4.4 An Application In this section, we present an analysis of a directed friendship network measured from a Los Angeles area high school. The data originated from a larger study on smoking diffusion, which sampled the friendship networks of five different Los Angeles schools at four different time points (de la Haye, Shin, Vega Yon, & Valente, 2019). Students were provided a roster of all other students 188 in their grade cohort and were asked to identify up to 19 close friends. Additional information on the design and the surveys employed is available in the report of the original study. The nodal covariates accessed for this analysis were Hispanic ethnicity (Hispanic: 1; NonHispanic: 0), History of Smoking (Hx of Smoking: 1; No Hx: 0), Sex (Female: 1; Male: 0), History of Drinking (Hx: 1; No Hx: 0), GPA (1-5, increments of .5), and Housing Status (Owns Home: 1; Rents: 2; Unsure: 3). In place of Housing Status and GPA, however, we used the following two transformations: Above Average Grades (GPA > 3.78: 1; 0 otherwise) and Owns (Owns Home: 1; 0 otherwise). Complete case data from the fourth wave (Spring, 12th grade) of four schools was employed, resulting in a nodal set of n = 776 students and N = 152,238 link variables. Only the fourth wave of data was used since we wished to focus on contemporaneous ties for this demonstration. For the regression analysis, we created two duplicates of the binary covariates: one duplicate that identified values for the ego (i) and another for the values of the alters (j). For example, for link Yi, j , we created an ego Hispanic ethnicity variable that equaled one if the respondent associated with the first index of the link i identified as Hispanic and zero otherwise and we constructed a similar Hispanic ethnicity variable for the alter index, j. Moreover, with two exceptions, we created binary homophily variables that equaled one if the corresponding ego and alter covariate values possessed the same value. The homophily covariates for smoking and drinking history served as the two exceptions. They equaled one only if the alter and ego both reported a history of the health behavior. This was done to better assess the impact of the positive similarity on friendship identification. Here, the effects of having a smoking history was our main focus, but we were also interested in the association between friendship identification and other homophily terms. 189 4.4.1 Methods Logistic regression was used to estimate the associations of interest. To address confounding, we adjusted for all other variates previously specified. Moreover, we also included indicators for a student’s school into the model as a proxy to adjust for environmental influences. Confidence sets were constructed using both direct correction and the penalized Hoeffding bootstrap of the form T0 ± {T(K) − T(1)} · p 6−1 log(2/α). Simple random sampling with replacement and B = 5,000 draws characterized the bootstrapping procedure. All sets used a α = .05 level. For direct correction, our strategy was to approximate a bound on φs,s for an arbitrary s, say φ, by selecting a few clustering mechanisms that would hypothetically produce an estimable, strictly positive exchangeable correlation coefficient that is larger than each φs,s . We then took the maximum of these estimated coefficients to use as correction. As a baseline assumption, we also posited that the degree distribution of D, the dependency graph for the tie variables conditional on x, was bounded by a power law. This assumption suggested the following feasible bound: µN ≤ √ N −1. Recall that a basic assumption explored in Section 4.2 posits that, conditional on x, the dependency structure of the ties is localized enough as to admit a central limit theorem. For this reason, it is safe to assert that µN is actually asymptotically finite. Hence, even when the bound placed on φs,s is invalid, it will still be the case that φs,sµN ≤ φ √ N −1 for large enough N since both φs,s and φ > 0 are also bounded with probability one. Nevertheless, we also note that because φs,s = {N −1 ∑ N i=1w 2 s,iσ 2 Yi } −1 · |E| −1 ∑i<j ws,iws, jσi, j is a sum of both negative and positive terms, a nonweighted φ of strictly positive terms that is believed to bound {N −1 ∑ N i=1 σ 2 Yi } −1 · |E| −1 ∑i<j σi, j is still a reasonable choice. 190 Available scholarship on health behaviors and the formation of high school friendships also informed our choices for clustering and µN. Past research into the nature of high school social networks has indicated that they possess a high degree of assortative mixing by race, ethnicity, gender, age, and social class (Block & Grund, 2014; Goodreau, Kitts, & Morris, 2009; Shrum, Cheek Jr, & MacD, 1988; Smetana, Campione-Barr, & Metzger, 2006). Since adolescents of this age typically branch off from parental attachments as a developmental milestone, they require more intimate attachments to their peers. This has been posited to create more mutuality of ties and triad closures (McFarland, Moody, Diehl, Smith, & Thomas, 2014; Prisbell & Andersen, 1980). This dynamic has also been noted to be present in the context of shared health behaviors (Valente, Gallaher, & Mouttapa, 2004). For instance, there is ample evidence that shared smoking behaviors and alcohol use also encourage the formation of ties (Cheadle, Walsemann, & Goosby, 2015; Rice, Donohew, & Clayton, 2003; Valente, 2003). For the specification of µN, the ecological context of tie formation in the spirit of Bronfenbrenner, 1992 was also important to consider. McFarland, Moody, Diehl, Smith, and Thomas, 2014 noted that, although there is plentiful evidence that student level assortative processes by demographic similarity are a pivotal factor in the formation of ties, variations in the presentation and extent of homophily and the differentiated patterns of empirical network expression give credence to ecological factors as explanatory forces. Building upon Crosnoe, Johnson, and Elder Jr, 2004, they posited that school size, demographic composition, and educational climate interact with micro-level processes to determine social formation. Larger school environments are more conducive to the proliferation of outside forces of assortment, all of which can induce complicated patterns of tie dependence. 191 Since this exploration conditioned on school, Hispanic ethnicity, self-reported sex, a proxy for social status, smoking and drinking behaviors, and age, it is feasible to assert that these dependencies were nullified to some extent. However, since there was no information available about student affiliations or outside environments otherwise, we made a conservative assumption that omitted variables induced a mosaic of tie dependencies beyond what was specified and known. This also seemed reasonable to assert since high schools are also often defined by a handful of thematic social groups (Eckert, 1989). These social groups, which are nested in wider societal institutions and environments, would have created a non-trivial joint distribution. For these reasons, it seemed cogent to assert the power law bound, which is conservative enough to cover unknown clustering under our baseline assumptions and for large enough N. Although we had the option of using clustering algorithms on past waves of the friendship networks to choose a working partition, we instead opted to first cluster via the propensity for an ego and alter to share the same self-identification of sex and Hispanic ethnicity, provided knowledge of all ego, alter, and homophily terms for smoking history, drinking history, home ownership, grades, and knowledge of the school attended. Propensity scores are the (estimated) probabilities of an event of interest, provided a set of known covariate values (Rubin, 2010; Thoemmes & Kim, 2011). Rosenbaum and Rubin, 1983 showed that they are the coarsest summary of the information encoded in the covariate values with respect to the events of interest. In our context, they would represent the propensity for homophily to occur in some salient dimension, provided knowledge about other shared properties. We see this as a reasonable strategy for imperfectly capturing some dimensions of dependence. Although graph clustering mechanisms can also achieve this, they work with realized graphs that cannot be conflated with the associated dependency graphs. Propensity scores, on the other hand, can offer a refined score on the propensity of similarity 192 w.r.t. certain social dimensions that are directly associated with the underlying dependency graph. We chose propensity scores for an ego and alter to share the same self-identification of sex and Hispanic ethnicity since sex and ethnic identification are important for assortative mixing. To supplement this, we also chose to cluster via propensity scores for smoking homophily, conditional on all other covariates except those related to smoking status. We also clustered by ego identity for comparison. We used logistic regression to estimate the propensity scores. All exchangeable correlation coefficients were estimated using the gee package in R (Carey, Lumley, Moler, Ripley, & Ripley, 2022). Ceiling{N/log(N)} = 12,758 strata were formed using the quantiles of the estimated scores. Specifying K = ceiling{N/log(N)} strata guarantees that K → ∞ with sample size and that the variances of the cluster statistics are consistently estimable under our regularity conditions (Sparkes & Zhang, 2023). We designated φGH, φS, and φE as the exchangeable correlation coefficients for the gender-ethnic, smoking history, and ego strategies respectively. Letting φ = max(φGH,φS,φE), we then set the universal correction to the model-derived standard errors calculated under the counterfactual assumption of mutual independence to q 1+φˆ · √ N −1. Lastly, we also calculated K values to gauge the robustness of statistical decisions under the assumption of mutual independence. 4.4.2 Results The exchangeable correlation coefficients were estimated as follows: φˆGH = 0.0045,φˆ S = 0.0096, and φˆ E = 0.0038. Therefore, φ was set to .0096 because it was the maximum. Consequently, we set q 1+φˆ · √ N −1 to 2.17. The results of the regression are available in Table 4.1 below. 193 Table 4.1: Friendship Identification Variable Estimate SE CIˆ N CIˆ H KN KU Intercept -6.42 0.12 (-6.95, -5.9) (-7.15, -5.7) 698.89 301.48 Smoking Hx. Alter -0.23 0.08 (-0.56, 0.11) (-0.71, 0.26) 1.14 — Ego -0.10 0.07 (-0.41, 0.21) (-0.55, 0.35) — — Homophily 0.44 0.16 (-0.24, 1.13) (-0.49, 1.38) 0.99 — Drinking Hx. Alter -0.04 0.07 (-0.32, 0.24) (-0.4, 0.32) — — Ego 0.11 0.06 (-0.15, 0.38) (-0.29, 0.51) — — Homophily 0.11 0.11 (-0.36, 0.58) (-0.46, 0.68) — — Hispanic Alter -0.01 0.08 (-0.34, 0.32) (-0.48, 0.46) — — Ego -0.24 0.08 (-0.57, 0.09) (-0.7, 0.22) 1.48 0.07 Homophily †⋆ 1.21 0.07 (0.91, 1.51) (0.78, 1.64) 74.09 31.45 Sex Alter 0.15 0.05 (-0.06, 0.37) (-0.16, 0.47) 1.43 0.05 Ego -0.06 0.05 (-0.28, 0.15) (-0.34, 0.21) — — Homophily †⋆ 1.16 0.05 (0.95, 1.38) (0.89, 1.44) 140.05 59.96 Owns Home Alter 0.00 0.05 (-0.21, 0.21) (-0.28, 0.29) — — Ego 0.12 0.05 (-0.09, 0.33) (-0.15, 0.39) 0.64 — Homophily 0.10 0.05 (-0.1, 0.3) (-0.16, 0.36) 0.26 — High Grades Alter 0.16 0.05 (-0.04, 0.35) (-0.11, 0.42) 2.12 0.35 Ego 0.13 0.05 (-0.07, 0.32) (-.14, .39) 0.97 — Homophily † 0.26 0.04 (0.08, 0.45) (-0.01, 0.54) 8.47 3.09 School †⋆ 2 0.51 0.06 (0.25, 0.78) (0.13, 0.9) 17.08 6.81 3 0.22 0.07 (-0.06, 0.5) (-0.19, 0.63) 1.94 0.27 4 0.49 0.07 (0.21, 0.77) (0.13, 0.85) 13.08 5.08 The ’†’ notation signifies statistical significance after the theoretically derived conservative correction under the assumption of normality with the specified correction, while ’⋆’ denotes statistical significance according to the 194 penalized Hoeffding bootstrap. The table also supplies the corresponding confidence sets: CIˆ N and CIˆ H for the normal and Hoeffding procedure respectively. Finally, KN supplies the K statistic under the assumption of asymptotic normality, while KU provides the same statistic under only the assumption that each estimator possesses a unimodal distribution. All but one of the following confidence intervals are from the Hoeffding bootstrap with penalization. According to this analysis, students who both identified as Hispanic or non-Hispanic had 3.36 the odds of friendship identification versus the baseline (≥ 95% CI: 2.18, 5.15), while students of the same self-reported sex possessed 3.21 the odds of friendship identification (≥ 95% CI: 2.43, 4.21). School environment also appeared to be associated with friendship identification after adjusting for the other variables. Students in the second and fourth schools had 1.67 (≥ 95% CI: 1.13, 2.46) and 1.63 (≥ 95% CI: 1.42, 2.33) times the odds of identifying friends respectively in comparison to those of the first school. Homophily of grade status was also estimated to be statistically significantly associated with friendship identification, but only according to the nonpenalized Hoeffding bootstrap. Conditional on all other variates, students with the same classification of scholastic achievement had 1.16 the odds of friendship identification (≥ 95% CI: 1.09, 1.56) in comparison to those who did not. Notably, the conclusions of the conservative normal approximation and the Hoeffding procedure were largely concordant. The smallest KU value of interest was observed to be .05. Conditional on the design matrix, this signified that an arbitrary φs,sµN had to be bounded by .05 for each statistically significant result to remain robust against an invalid specification of the dependency model, at least insofar as each statistic possessed a unimodal distribution. This bound improved to .26 under the assumption of normality. 195 Finally, a shared history of smoking was estimated to increase the odds of friendship identification by a factor of 1.56 (robustly statistically significant up until φSmokeHx.,Homophily ≤ µ −1 N · .99 under the assumption of normality). Similarly, a positive smoking history on the part of the alter alone was estimated to decrease the odds of friendship identification by a factor of .79 (robust up until φSmokeHx.,Alter ≤ µ −1 N · 1.14 under normality), but there was insufficient evidence to deem this estimate as statistically significantly different from zero by conservative methods. Overall, there was strong statistical evidence that a school’s environment and homophily of Hispanic ethnicity and sex are associated with friendship identification and thus social network formation. Although there was some evidence that a shared history of smoking is positively associated with friendship identification and, by proxy, tie formation, this evidence was ultimately insufficient as a basis for strong inferences on this matter. A fundamental limitation of this study was that it did not collect information on student associations, both school-wide and in the community. Doing so could have at least increased our readiness to posit a stricter bound on µN, conditional on these added associations. As it stands, under our possibly conservative estimate for µN, we would require φSmokeHx.,Homophily to be bounded by .0026, and only under the assumption that normality also holds, to trust in a decision of statistical significance. Although possible, there was obviously room for a great degree of doubt. By definition, then, an induction to the conclusion that a true association exists could not be strong in character. 196 4.5 Conclusion We have demonstrated that GLMs—and GEEs more generally—can be validly and cogently utilized for social network analysis, and in the context of unknown or intractable dependency structures more generally. We accomplished this with a variance identity for additive statistics that is closely related to the Moulton factor and with a different type of bootstrapping process that is fairly robust to statistical dependencies in the absence of a particular theory. Since the true dependency structure is unknown, inestimable, and in all likelihood intractable, we believe that these strategies have merit. Justifying a choice for two values that summarize the impact of dependencies on the variance of a statistic is a much less Herculean task than cogently specifying the minutiae of a system of dynamic and mostly invisible relations, especially when the number of those relations diverges to infinity at a rate much faster than sample size. Even more generally, however, even if a researcher does not wish to directly specify these constants, they minimally provide a route to gauging the robustness of statistical decisions. Lastly, although not assumption-free, the Hoeffding bootstrapping procedure provides an alternative method for statistical inference that also does not require the specification of a dependency or network structure. Although a loss of efficiency is the cost of these strategies, we argue that this is no real cost at all in many circumstances. Efficiency can only be achieved if the dependency model is validly specified. This does not occur outside of toy circumstances. Of course, models that presuppose a particular dependency structure will be preferable in some circumstances. The same goes for more liberal methodologies in general due to the fact that failure to detect a salient association or effect can sometimes pose a greater real-world risk. Nevertheless, these methods provide a feasible route to the analysis of marginal tie formation and—in the minimum—a tool for sensitivity analysis. 197 This is important because the analysis of marginal ties in the absence of any knowledge of network structure largely defines the study of covert networks. We imagine that the methods outlined here will especially bare fruits in this domain. 198 References Nelder, J. A., & Wedderburn, R. W. (1972). Generalized linear models. Journal of the Royal Statistical Society: Series A (General), 135(3), 370–384. McCullagh, P., & Nelder, J. A. (2019). Generalized linear models. Routledge. Lusher, D., Koskinen, J., & Robins, G. (2013). Exponential random graph models for social networks: Theory, methods, and applications. Cambridge University Press. Robins, G., Pattison, P., Kalish, Y., & Lusher, D. (2007). An introduction to exponential random graph (p*) models for social networks. Social networks, 29(2), 173–191. Harris, J. K. (2013). An introduction to exponential random graph modeling (Vol. 173). Sage Publications. Snijders, T. A., Pattison, P. E., Robins, G. L., & Handcock, M. S. (2006). New specifications for exponential random graph models. Sociological methodology, 36(1), 99–153. Wang, P., Pattison, P., & Robins, G. (2013). Exponential random graph model specifications for bipartite networks—a dependence hierarchy. Social networks, 35(2), 211–222. Butts, C. T. (2006). Cycle census statistics for exponential random graph models. Snijders, T. A. (2017). Stochastic actor-oriented models for network dynamics. Annual review of statistics and its application, 4, 343–363. Snijders, T. A. (1996). Stochastic actor-oriented models for network change. Journal of mathematical sociology, 21(1-2), 149–172. Leifeld, P., & Cranmer, S. J. (2022). The stochastic actor-oriented model is a theory as much as it is a method and must be subject to theory tests. Network Science, 10(1), 15–19. Leifeld, P., & Cranmer, S. J. (2019). A theoretical and empirical comparison of the temporal exponential random graph model and the stochastic actor-oriented model. Network science, 7(1), 20–51. Cranmer, S. J., & Desmarais, B. A. (2011). Inferential network analysis with exponential random graph models. Political analysis, 19(1), 66–86. 199 Robins, G., Snijders, T., Wang, P., Handcock, M., & Pattison, P. (2007). Recent developments in exponential random graph (p*) models for social networks. Social networks, 29(2), 192– 215. Goodreau, S. M. (2007). Advances in exponential random graph (p*) models applied to a large social network. Social networks, 29(2), 231–248. An, W. (2016). Fitting ergms on big networks. Social science research, 59, 107–119. Snijders, T. A., et al. (2002). Markov chain monte carlo estimation of exponential random graph models. Journal of Social Structure, 3(2), 1–40. Amati, V., Schonenberger, F., & Snijders, T. A. (2015). Estimation of stochastic actor-oriented ¨ models for the evolution of networks by generalized method of moments. Journal de la Societ´ e Franc¸aise de Statistique ´ , 156(3), 140–165. Breslow, N. E. (1996). Generalized linear models: Checking assumptions and strengthening conclusions. Statistica applicata, 8(1), 23–41. Agresti, A. (2015). Foundations of linear and generalized linear models. John Wiley & Sons. White, H. (1982). Maximum likelihood estimation of misspecified models. Econometrica: Journal of the econometric society, 1–25. Amemiya, T. (1985). Advanced econometrics. Harvard university press. Yuan, K.-H., & Jennrich, R. I. (1998). Asymptotics of estimating equations under natural conditions. Journal of Multivariate Analysis, 65(2), 245–260. Sparkes, S., & Zhang, L. (2023). Properties and deviations of random sums of densely dependent random variables [Preprint available at https://arxiv.org/abs/2310.11554]. Green, P. J. (1984). Iteratively reweighted least squares for maximum likelihood estimation, and some robust and resistant alternatives. Journal of the Royal Statistical Society: Series B (Methodological), 46(2), 149–170. Berk, K. N. (1973). A central limit theorem for m-dependent random variables with unbounded m. The Annals of Probability, 352–354. Hoeffding, W., & Robbins, H. (1948). The central limit theorem for dependent random variables. 200 Le Cam, L. (1986). The central limit theorem around 1935. Statistical science, 78–91. Withers, C. S. (1981). Central limit theorems for dependent variables. i. Zeitschrift fur Wahrschein- ¨ lichkeitstheorie und verwandte Gebiete, 57(4), 509–534. Kariya, T., & Kurata, H. (2004). Generalized least squares. John Wiley & Sons. Fomby, T. B., Johnson, S. R., Hill, R. C., Fomby, T. B., Johnson, S. R., & Hill, R. C. (1984). Feasible generalized least squares estimation. Advanced econometric methods, 147–169. Ballinger, G. A. (2004). Using generalized estimating equations for longitudinal data analysis. Organizational research methods, 7(2), 127–150. Zorn, C. J. (2001). Generalized estimating equation models for correlated data: A review with applications. American Journal of Political Science, 470–490. Zeger, S. L., Liang, K.-Y., & Albert, P. S. (1988). Models for longitudinal data: A generalized estimating equation approach. Biometrics, 1049–1060. MacKinnon, J. G., Nielsen, M. Ø., & Webb, M. D. (2023). Cluster-robust inference: A guide to empirical practice. Journal of Econometrics, 232(2), 272–299. Lyons, R. (2011). The spread of evidence-poor medicine via flawed social-network analysis. Statistics, Politics, and Policy, 2(1). VanderWeele, T. J., Ogburn, E. L., & Tchetgen, E. J. T. (2012). Why and when” flawed” social network analyses still yield valid tests of no contagion. Statistics, Politics, and Policy, 3(1). Gardiner, J. C., Luo, Z., & Roman, L. A. (2009). Fixed effects, random effects and gee: What are the differences? Statistics in medicine, 28(2), 221–239. Schweinberger, M., & Handcock, M. S. (2015). Local dependence in random graph models: Characterization, properties and statistical inference. Journal of the American Statistical Association, 77(3), 647. Frank, O., & Strauss, D. (1986). Markov graphs. Journal of the american Statistical association, 81(395), 832–842. Block, P., Stadtfeld, C., & Snijders, T. A. (2019). Forms of dependence: Comparing saoms and ergms from basic principles. Sociological Methods & Research, 48(1), 202–239. 201 Moulton, B. R. (1986). Random group effects and the precision of regression estimates. Journal of econometrics, 32(3), 385–397. Moulton, B. R. (1990). An illustration of a pitfall in estimating the effects of aggregate variables on micro units. The review of Economics and Statistics, 334–338. Kloek, T. (1981). Ols estimation in a model where a microvariable is explained by aggregates and contemporaneous disturbances are equicorrelated. Econometrica: Journal of the Econometric Society, 205–207. Greenwald, B. C. (1983). A general analysis of bias in the estimated standard errors of least squares coefficients. Journal of Econometrics, 22(3), 323–338. Sparkes, S., Garcia, E., & Zhang, L. (2023). The functional average treatment effect [Preprint available at https://arxiv.org/abs/2312.00219]. Lahiri, S. N. (2003). Resampling methods for dependent data. Springer Science & Business Media. Lahiri, S. N. (1993). On the moving block bootstrap under long range dependence. Statistics & Probability Letters, 18(5), 405–413. Snijders, T. A., Borgatti, S. P., et al. (1999). Non-parametric standard errors and tests for network statistics. Connections, 22(2), 161–170. de la Haye, K., Shin, H., Vega Yon, G. G., & Valente, T. W. (2019). Smoking diffusion through networks of diverse, urban american adolescents over the high school period. Journal of health and social behavior, 60(3), 362–376. Block, P., & Grund, T. (2014). Multidimensional homophily in friendship networks. Network Science, 2(2), 189–212. Goodreau, S. M., Kitts, J. A., & Morris, M. (2009). Birds of a feather, or friend of a friend? using exponential random graph models to investigate adolescent social networks. Demography, 46(1), 103–125. Shrum, W., Cheek Jr, N. H., & MacD, S. (1988). Friendship in school: Gender and racial homophily. Sociology of Education, 227–239. Smetana, J. G., Campione-Barr, N., & Metzger, A. (2006). Adolescent development in interpersonal and societal contexts. Annual review of psychology, 57, 255. 202 McFarland, D. A., Moody, J., Diehl, D., Smith, J. A., & Thomas, R. J. (2014). Network ecology and adolescent social structure. American sociological review, 79(6), 1088–1121. Prisbell, M., & Andersen, J. F. (1980). The importance of perceived homophily, level of uncertainty, feeling good, safety, and self-disclosure in interpersonal relationships. Communication Quarterly, 28(3), 22–33. Valente, T. W., Gallaher, P., & Mouttapa, M. (2004). Using social networks to understand and prevent substance use: A transdisciplinary perspective. Substance use & misuse, 39(10-12), 1685–1712. Cheadle, J. E., Walsemann, K. M., & Goosby, B. J. (2015). Teen alcohol use and social networks: The contributions of friend influence and friendship selection. Journal of alcoholism and drug dependence, 3(5). Rice, R. E., Donohew, L., & Clayton, R. (2003). Peer network, sensation seeking, and drug use among junior and senior high school students. Connections, 25(2), 32–58. Valente, T. W. (2003). Social network influences on adolescent substance use: An introduction. Connections, 25(2), 11–16. Bronfenbrenner, U. (1992). Ecological systems theory. Jessica Kingsley Publishers. Crosnoe, R., Johnson, M. K., & Elder Jr, G. H. (2004). School size and the interpersonal side of education: An examination of race/ethnicity and organizational context. Social Science Quarterly, 85(5), 1259–1274. Eckert, P. (1989). Jocks and burnouts: Social categories and identity in the high school. Teachers college press. Rubin, D. B. (2010). Propensity score methods. American journal of ophthalmology, 149(1), 7–9. Thoemmes, F. J., & Kim, E. S. (2011). A systematic review of propensity score methods in the social sciences. Multivariate behavioral research, 46(1), 90–118. Rosenbaum, P. R., & Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1), 41–55. Carey, V. J., Lumley, T. S., Moler, C., Ripley, B., & Ripley, M. B. (2022). Package ‘gee’. 203 Chapter 5 Discussion This chapter concludes the dissertation. Altogether, it accomplishes the following ends: it reviews and contextualizes key findings from preceding chapters, it discusses some of their limitations and strengths, highlights a subset of potential applications in key fields, and discusses future research directions. Since a work such as this is incomplete without the warmth of a bookend, it also brings back the metaphor of the stool. In summation, Chapter 2 established the following: that additive statistics are statistically consistent estimators of their expected values provided the average number of (linear) dependencies in a sample is sub-linear in n, that exact models for dependency structures are unnecessary, and that a series of classical concentration inequalities can apply to random sums even under mutual dependence. This chapter also introduced an important class of variable—the U class—that encompasses a large proportion of commonly employed error distributions. Pertinently, a concentration inequality, which is sharper than many standard results, was also proven for this class of variable. This body of concepts and results was harnessed for an analysis of climate change. Typically, the statistical analysis of climate data requires relatively strong suppositions: namely, that the probabilistic dependencies present between outcome variables are locally limited, or that 204 mixing conditions apply s.t. asymptotic normality is achieved. However, climate data emerge from complicated ecologies by definition, which, in all likelihood, impose dynamic and ineffable dependencies in multiple dimensions, only two of which are spatial and temporal. In short, there is no reason to believe that underlying dependency relations can be represented by a tractable schema, and especially one that is even asymptotically local in character. Since the conclusions reached by the analysis presented in this chapter do not depend upon any of these premises, they are less prone to doubt and afford a better foundation for further scientific deliberation. Furthermore, since additive statistics are nearly ubiquitous tools for estimating parameters that at least approximate values of scientific interest, the methods delineated within this chapter share in this quality w.r.t. applicability. This last point also highlights the contributions of the manuscript presented in Chapter 4. Social networks are constituted by individual relationships that emerge from similar ecologies of dependence. These ecologies are defined by a milieu of interacting societal forces and institutions, the nature of which is largely unknown and arguably inconceivable. Additionally, this picture is complicated by the sheer magnitude of dependencies that require specification. For a directed network with n actors, there are n(n−1) 2 probabilistic dependencies between random tie variables that require consideration. A defensible theory for such an object is mythological. Crude reductions are inevitable and this fact bears consequences for any statistical model that depends upon the finer details of such a theory. Generalized linear models—and generalized estimating equations more generally—do not require such a theory. More importantly, they can optimally approximate the probability that a relationship is formed, conditional on some set of covariates, without any knowledge of existing ties or graph structure. Being able to accomplish this is invaluable in many settings since, by construction, the features of the network are largely unknown before observation. The 20 study of covert networks, which includes the study of terrorist networks, provides one extreme case of this. Therefore, a cogent modeling approach that allows for the prediction of tie formation in the absence of this information is integral. Through the use of the variance identity, concentration inequalities, and the Hoeffding bootstrapping strategy, this chapter accomplished this. Moreover, it also demonstrated the worth of this approach with data from a directed high school friendship network. Prior research has indicated that homophily of sex and ethnicity, in addition to a school’s environment, are associated with friendship formation in this setting. These associations were ultimately confirmed. Again, since this analysis was not conditioned upon knowledge of network structure or a particular dependency theory, its conclusions can stand against scrutiny with additional degrees of fortitude. Chapter 3 proved two main results under the tutelage of two relatively mild assumptions: counterfactual consistency and preserved supports. Firstly, it proved that causal effects that are related to the functional average are identified, regardless of informative sampling or the existence of unmeasured confounders. Secondly, it proved that basic regression analyses target functional average causal effects under these same conditions insofar as the usual set of regression assumptions, which are usually reserved for association studies, also holds. Pertinently, both propositions were proven under the auspices of a structural causal theory. Proving the second statement was a central achievement of this dissertation since it implies that a plethora of past investigations can be revisited with a causal lens. For decades, it has been believed that elementary regression analyses that omit confounding variables cannot support causal statements. Since the identification and measurement of a set of variables that is adequate for causal identification is exceptionally non-trivial, this has made the employment of regression methods for causal inference a vexing endeavor. However, the contents of Chapter 3 disabused this requirement. It established that any 206 statistical regression with a defensible causal theory can be utilized for causal inference, regardless of unmeasured confounding or informative sampling conditions, insofar as 1) the study design preserved the set of values that the outcome could have taken under an experimental design, 2) counterfactual consistency held, 3) the regression model appeared sufficiently well specified, and 4) there was reasonable evidence supporting the notion of sum-symmetric errors. Hence, it established that empirically accounting for a sufficient set of confounding forces is not a necessary condition for causal inference. This fact will greatly serve future research in addition: doubly so since methods such as linear regression are within the grasps of every applied scientist. This stated, Chapter 3 also substantiated the properties of basic functional average estimators. In particular, it demonstrated that the sample mid-range estimator is an important statistic for causal inference, despite its near-banishment to the dusty lower shelves of statistical history. Since the extremes of the support of outcome variables are preserved under fairly mild regularity conditions, the work of this chapter gestures towards an entire terrain of investigation and scholarly synthesis. For instance, under the conditions of Chapter 3, extreme value theory automatically doubles as theory for causal inference. Indubitably, this is useful for all fields of study, although it is of particular interest to climate and clinical sciences. Next, attention returns to the stool. This dissertation replaced at least three of its legs. In place of an exact theory of dependence that requires—in the minimum—the specification of n 2 parameters, there now exists a leg characterized by the defense (or bounding) of two constants only: the average number of correlations within a sample and the average magnitude of these correlations. Replacing finite sample or asymptotic normality is finite sample or asymptotic sumsymmetry. Instead of requiring non-informative sampling, or sampling with correctable bias, the sole requirement is now the preservation of theoretical supports or extremes only. For causal 20 inference, this latter amendment to the stool precludes the need for a hackneyed third—and one that is so often cracked and in-harmoniously balanced midst the others—the stipulation that most or all relevant factors related to a causal question need to be identified and measured. Now, working scientists need not know it all before they can know something. As stated in the introduction, the boon of this stool is that it is made of wood and can bear more weight. By resting upon premises that are more general—and thus ones that apply to a larger number of possible worlds, a fact that heightens the likelihood that within this assortment our own finds a home—it also provides quantitative statements about uncertainty that are, in general, more conservative than exact models. In this fact rests a fundamental limitation, in addition to another dimension of strength. For the latter, first appreciate the specter that haunts every statistical argument: the unverifiable nature of pivotal statistical assertions. Put otherwise, although made of wood, our new stool is certainly not concrete. It too can break, and quite easily. Certainty of this, or when it transpires, unfortunately, is beyond any empirical enterprise. Still, even when invalidly applied, the methodologies supplied by this dissertation will almost always have an increased probability, whatever it might be, of providing true scientific statements in comparison to others reviewed. This conservatism, however, is not appropriate for every scientific question; and now, we return to the fundamental limitation. In some situations, statistical conservatism is a riskier enterprise. For instance, if it is hypothesized that an outcome variable possesses a sum-symmetric distribution, and a failure to reject this hypothesis is taken as evidence that mean exchangeability holds, a conservative approach will ultimately lead to more spurious causal statements. Barring these circumstances, though, the conservatism of these methods testify to no other intrinsic deficiency. Ultimately, this is because efficiency is a red herring, more often than not. For an estimator to be efficient in any useful sense, it must emerge from a model that is valid, which 208 in turn, requires it to emerge from a set of premises that are true. And while such an estimator is most certainly efficient in some reality, the chance that this reality is our own is negligible at worst and non-discernible at best, which, in turn, disqualifies it from boasting its crown at the ball. Some future research directions have already been mentioned in passing. Two more receive exposition, although this accounting is not exhaustive. Since the primary author of the manuscripts covered in this dissertation started his career as a clinical social worker, future applications in this field are of primary concern. The statistical analysis of human experience and psychopathology is fraught with challenges. Concisely put, there is simply no reason for the usual battery of modeling assumptions to apply, even in semblance. Any two given people emerge from an indelibly connected and globalized world, where any number of institutions, forms of media, or socio-ecological-historical circumstances impart shared influence across time and space, and in ways that are mysterious to the understanding. This claim holds also for a plethora of sociological studies. Additionally, for any two given experiences that originate from the same awareness, appeals to any form of local dependence, or even to stationarity in many circumstances, seem prima facie empty of truth. This makes the statistical analysis of the efficacy of mental health services non-trivial. The methods presented in this dissertation can address these obstacles with greater success. Just the same, they can also bare great fruit in specific sub-fields of social network analysis. Key future developments, for instance, can arise in relation to the construction of cogent models that aim to predict the formation of relationships within covert, or more specifically, terrorist networks. Networks such as these often exist within much larger social, religious, or political networks, which are more amenable to study. The observed features of the participants within these networks, in conjunction with previously examined (covert) terrorist relationships, can be become 209 an important tool for predicting future ones in the absence of any knowledge of the properties or contemporaneous relationships of the actual (covert) terrorist network of interest. Obviously, this is pivotal, as knowledge of these types mostly do not exist by definition. If they did, the network under consideration would not be very covert. 210
Abstract (if available)
Abstract
Statistical modeling often relies on two pivotal assumptions: non-informative sampling conditions and weak, localized probabilistic dependencies between observations. For causal inference in observational settings, the strong ignorability of treatment assignment is also typically required. These assumptions are usually not fulfilled in modern observational contexts. Chapter 2 establishes a mathematical identity for the variance of an additive statistic that removes the strict need for a detailed dependency model. Following this, some basic properties of additive statistics are explored under very general conditions of statistical dependence. An important type of random variable is also introduced, the U random variable, before a novel concentration inequality for sums of dependent random variables of this class is proven. This inequality can be used to construct confidence sets for sums of mutually dependent, but weakly correlated random variables of this class. Chapter 3 introduces a new type of causal effect that is identified under mild conditions. It is related to the average value of a function. Functional average estimators are explored and a novel bootstrapping process, the Hoeffding bootstrap, is introduced as a strategy for uncertainty quantification. U concepts are then used to prove that well-specified regression models, under the semantic framework of a causal theory and the conditions of this chapter, target functional average causal effects under modeling assumptions that are often used in association studies. Chapter 4 uses the variance identity of Chapter 2 and the Hoeffding bootstrap of Chapter 3 to extend generalized estimating equation approaches to regressions in social network contexts.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Hierarchical approaches for joint analysis of marginal summary statistics
PDF
Robust causal inference with machine learning on observational data
PDF
Statistical insights into deep learning and flexible causal inference
PDF
Nonparametric ensemble learning and inference
PDF
Delta Method confidence bands for parameter-dependent impulse response functions, convolutions, and deconvolutions arising from evolution systems described by…
PDF
Causality and consistency in electrophysiological signals
PDF
Stein's method via approximate zero biasing and positive association with applications to combinatorial central limit theorem and statistical physics
PDF
New methods for asymmetric error classification and robust Bayesian inference
PDF
Statistical learning in High Dimensions: Interpretability, inference and applications
PDF
Statistical methods and analyses in the Multiethnic Cohort (MEC) human gut microbiome data
PDF
An abstract hyperbolic population model for the transdermal transport of ethanol in humans: estimating the distribution of random parameters and the deconvolution of breath alcohol concentration
PDF
Essays on the econometric analysis of cross-sectional dependence
PDF
Binding and scope dependencies with 'floating quantifiers' in Japanese
PDF
Essays on bioinformatics and social network analysis: statistical and computational methods for complex systems
PDF
A structural econometric analysis of network and social interaction models
PDF
Three essays on linear and non-linear econometric dependencies
PDF
Essays on estimation and inference for heterogeneous panel data models with large n and short T
PDF
Shrinkage methods for big and complex data analysis
PDF
Information geometry of annealing paths for inference and estimation
PDF
On the interplay between stochastic programming, non-parametric statistics, and nonconvex optimization
Asset Metadata
Creator
Sparkes, Shane William
(author)
Core Title
Statistical methods for causal inference and densely dependent random sums
School
Keck School of Medicine
Degree
Doctor of Philosophy
Degree Program
Biostatistics
Degree Conferral Date
2024-05
Publication Date
03/18/2024
Defense Date
12/19/2023
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
causal inference,concentration of measure,dependent observations,functional average,Hoeffding bootstrap,OAI-PMH Harvest,social network analysis,statistical inference,variance identity
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Valente, Thomas (
committee chair
), Zhang, Lu (
committee chair
), Garcia, Erika (
committee member
), Horn, Abigail (
committee member
), Pickering, Trevor (
committee member
)
Creator Email
sgugliel@usc.edu,shanewsparkes@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113851028
Unique identifier
UC113851028
Identifier
etd-SparkesSha-12699.pdf (filename)
Legacy Identifier
etd-SparkesSha-12699
Document Type
Dissertation
Format
theses (aat)
Rights
Sparkes, Shane William
Internet Media Type
application/pdf
Type
texts
Source
20240319-usctheses-batch-1129
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
causal inference
concentration of measure
dependent observations
functional average
Hoeffding bootstrap
social network analysis
statistical inference
variance identity