Close
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Essays on treatment effect and policy learning
(USC Thesis Other)
Essays on treatment effect and policy learning
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Essays on Treatment Eect and Policy Learning by Yue Fang A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ECONOMICS) May 2023 Copyright 2023 Yue Fang To my parents. ii Acknowledgements I would like to express my deepest gratitude to my dissertation committee members, Geert Ridder, John Strauss, and Jong-Shi Pang, for their great support and invaluable guidance throughout my Ph.D. studies. I am extremely grateful to my advisor, Professor Geert Ridder, for his encouragement, mentorship, and support. Professor Ridder is truly the best advisor I could have ever imagined. With his profound theoretical background and great passion for research, he consistently understands my questions and provides the most insightful guidance. He is patient, supportive, and responsive. I am also immensely grateful to Professor Strauss for his support of all my decisions and his valuable suggestions. Professor Pang's exceptional mathematical prowess and introduction to the world of rigorous optimization theory have been instrumental in my research journey. I feel incredibly fortunate to have these amazing professors as my committee members. I would like to extend my heartfelt thanks to my friend, Sizhu Lu, who has been by my side through all the ups and downs. Sizhu's generous sharing of academic resources have been immensely helpful to my research. Furthermore, I feel fortunate to have intelligent and supportive colleagues in the Depart- ment of Economics at USC. Our shared ideas, discussions, and mutual support have been invaluable. I would also like to thank my friends, Shuzhou, Shaoning, and Zhan, for their assistance with algorithms and computations. iii I am deeply grateful to my dear parents for their unconditional love and support. They have given me the courage to pursue my dreams and have always been there to listen to my desires and concerns, share my joys and sorrows, and celebrate my achievements. Their love and support have been the driving force behind my academic journey, and I am forever grateful. iv Table of Contents Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Chapter 2: Optimal Policy Learning Under an Inequality Constraint . . . . . . . . . 4 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3 Related Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.4 Theoretical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.5 Multi-objective problem and chance-constraint problem . . . . . . . . . . . . 24 2.5.1 Multi-objective problem . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.5.2 Chance-constraint problem . . . . . . . . . . . . . . . . . . . . . . . . 26 2.6 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.6.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.6.1.1 Grid Search . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 v 2.6.1.2 MIP for the multi-objective problem . . . . . . . . . . . . . 33 2.6.2 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.7 Empirical Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Chapter 3: Optimal Dynamic Policy Learning in Short Periods . . . . . . . . . . . . 44 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.1.1 Related literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.2 Model Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.3 Theoretical Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.3.1 Simple case: two periods . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.3.2 Multiple periods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.3.3 Unknown propensity score . . . . . . . . . . . . . . . . . . . . . . . . 59 3.4 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Chapter 4: Bounds on the Variance Estimator of the Average Treatment Eect in Cluster Randomized Trials . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.2 Set up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.3 Exact variance and upper bound estimators . . . . . . . . . . . . . . . . . . 74 4.3.1 Exact variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.3.2 Cluster RCT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.3.2.1 Upper bound estimators . . . . . . . . . . . . . . . . . . . . 76 4.3.2.2 From V () and P U = 1 . . . . . . . . . . . . . . . . . . . . . 77 4.3.2.3 An estimator derived from the dierence with ^ V LZ . . . . . 79 4.4 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 vi 4.4.1 First estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.4.2 Second estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.5 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 A Appendix to Chapter 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 A.1 Proof of Theorem 2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 A.2 Proof of Theorem 2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 A.3 Mean-log deviation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 A.4 Transform of constraint function . . . . . . . . . . . . . . . . . . . . . 102 B Appendix to Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 B.1 Two periods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 B.2 Multiple periods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 B.3 Unknown propensity score . . . . . . . . . . . . . . . . . . . . . . . . 119 C Appendix to chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 vii List of Tables 2.1 Simulation results from DGP1 . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.2 Simulation results from DGP2 . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.3 Penalized objective maximization with dierent penalization parameters . . 39 2.4 Welfare and Gini index with NSW subset data . . . . . . . . . . . . . . . . . 41 2.5 CGSM by varying . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.6 Pre-treatment variables of the targeted group under dierent policies . . . . 42 3.1 Welfare from DGP 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.2 Welfare from DGP 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.1 Comparison of estimators, P U = 1 . . . . . . . . . . . . . . . . . . . . . . . . 82 4.2 Upper bound estimator on V (^ ) . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.3 Eect on school participation: Panel A. Household sample . . . . . . . . . . 88 4.4 Eect on school participation, Panel B School Sample . . . . . . . . . . . . . 89 viii List of Figures 2.1 Approximation with smooth functions . . . . . . . . . . . . . . . . . . . . . 24 2.2 Upper and lower bounds on indicator function . . . . . . . . . . . . . . . . . 28 ix Abstract Treatment eect and policy learning are important topics in both econometrics and applied microeconomics. Policy learning is built on heterogeneous treatment eect estimation. This topic has been quite popular in recent years and has connected dierent elds, including statistics, optimization, computer science, and econometrics. In this thesis, I consider nding the optimal policy in dierent settings, and discuss variance estimators of average treatment eect estimator in clustered randomized trials. All three chapters aim to contribute policy evaluation and optimization. In Chapter 1, I brie y introduce the concept of policy learning and summarize the later three chapters. In Chapter 2, I work on the setting considering an inequality constraint and focus on the constrained optimization problem. There has been a discussion on the trade-o between eciency and equality. In real world, besides population mean income, inequality level is also an important aspect of the society. By solving the empirical constrained optimization problem, we could obtain an estimated policy, the welfare of which could converge to the best optimal mean income, and satisfy the constraint with high probability. In Chapter 3, I work on the dynamic setting with short periods based on sequential randomized trials. By solving the problems in each period backward, we could obtain the set of estimated optimal policies in all periods, the welfare of which could converge to the best obtainable welfare at x the optimal rate. In Chapter 4, I propose better variance estimators of the average treatment eect estimator in clustered randomized trials, in nite population setting. xi Chapter 1 Introduction Policy learning refers to learning the best treatment assignment for the population of interest from existing experiments or observational studies, in order to achieve some specic goals. It is rooted in the existence of heterogeneous treatment eects and is built on causal inference or treatment eects. A lot of social and business programs are costly, so that policy makers and companies would like to apply treatments to people who could bring the most benet or revenue, instead of treating everyone. This is also called the personalized/individualized treatment rule. Because we determined whether to treat someone and what treatment to apply based on the individual characteristics. Based on how the data is collected, policy learning can be divided into online learning and oine learning. Online learning refers to the situation where data is collected continuously, and the policy maker can observe the outcome of the treatment assignment and update the policy accordingly. Oine learning refers to the situation where data is collected only once, and the policy maker can only observe the outcome of the treatment assignment after the policy is implemented. In this thesis, I will focus on the oine learning setting. 1 There could be dierent goals and settings in policy learning. For example, the goal could be to maximize the average treatment eect, or to maximize the average treatment eect for a specic subgroup. It could also be to minimize the social inequality level. Besides the objective function, various constraints can also be considered. The setting could be static or dynamic. And interferences can also be taken into consideration. In Chapter 2, I consider the setting of maximizing mean income while maintaining a satisfac- tory equality level. The objective function is the mean income function, and the constraint is the inequality measure. The setting is static, and there is no interference. I formulate the problem as a constrained optimization problem and try to solve the empirical problem with the information collected from the experiment. I rst assume that the global optimum of the empirical problem could be found, and show the theoretical propoerties of the empirical solution, with regard to the population welfare and population inequality level when apply- ing the empirical solution to the population. I also discuss how to use grid search to nd the global optimum to the empirical problem when the policy class is simple. Then I discuss the potential methods to tackle the problem when the policy class is more complicated. Since the constrained optimization problem is non-convex, it is hard to nd the global optimum. I refer to Cui et al. (2022) to discuss potential methods to nd stationary points. However, the gap on how local optimum performs relative to the best obtainable welfare still exists. In Chapter 3, I consider the dynamic setting with short periods. The number of periods is at most around 10. I propose to extend the Empirical Welfare Maximization method by Kitagawa and Tetenov (2018) to the dynamic setting, and solve the problem backward period by period. A sequential randomized experiment is needed. We need to assume that for each combination of information group, there are both treated units and control units in the sequential randomized experiment. The optimization problem in the last period is rst solved, given all the previous realized information. Then I assume that the estimated optimal policy will be adopted in the future periods, and solve the optimization problem 2 in period t, taking into consideration both the outcome in period t, and expected outcome in future periods. I show that when applying the sets estimated optimal policies to the population, the total welfare, which is dened as the sum of outcomes in all periods, will converge to the best obtainable welfare at the optimal rate. Chapter 4 discusses the average treatment eect estimator in clustered randomized trials with the nite population setting. Instead of focuing on the treatment eect estimator itself, I consider how to nd a better variance estimator of the corresponding ATE estimator. Clustering standard error is very common and important in applied economics. Liang-Zeger estimator of the variance is a benchmark and is widely adopted. In clustered randomized trials, units of the same cluster are assigned to the same treatment group, either treated or control. We rst follow Abadie et al. (2023) and calculate the exact variance of the ATE estimator in clustered randomized trials, as well as the dierence between the exact variance and the Liang-Zeger estimator. We nd that in nite population setting, the dierence does not go to 0. The dierence includes the term on the covariance between the cluster average potential outcome when treated and cluster avereage potential outcome when controlled, which cannot be directly estimated. We refer to Aronow et al. (2014) to provide the bound on the covariance term, leading to upper bound estimators of the exact variance, which could improve over Liang-Zeger estimator. In the future, I plan to continue work on treatment eect and policy learning, and consider other settings, for example, when interferece exists and when Nash equilirbium is incorpo- rated. 3 Chapter 2 Optimal Policy Learning Under an Inequality Constraint 2.1 Introduction Optimal policy learning is a classical topic, and has gained even more popularity during the past few years, particularly in statistics, computer science, and econometrics. The rea- son why we want to learn an optimal policy is based on the fact that treatment eects are heterogeneous among people with dierent covariates. With more and more interest and technical progress in estimating heterogeneous treatment eects, oine optimal policy learning has been a large step forward. In contrast to online policy learning, where data is constantly coming in, oine policy learning is based on a given dataset, either experi- mental or observational, tries to estimate conditional average treatment eects, and then solves a maximization problem to infer the optimal policy on treatment. Another reason is that policy makers always would like the policy to be simple and straightforward, as well as interpretable. This research topic brings several elds together, involving statistics, biolog- ical statistics, computer science, operation research, and econometrics. Although dierent 4 elds might have dierent ways to tackle down this question, the intersection of interest has further boosted the popularity of oine optimal policy learning. It consists of two stages. In the rst stage, either an experiment or an observational study is conducted, where the treatment assignment is independent of potential outcomes conditional on observed covariates. We could estimate conditional average treatment eects (CATE) from the collected dataset. In the second stage, with the information collected from the experiment or observational studies, we would like to infer what types of people should be treated in terms of some selected covariates, in order to achieve specic policy goals. Usually, the goal is to maximize the population mean outcome, which is equivalent to maximizing the population average treatment eect. In this paper, we focus on job training as the policy, and income as the outcome. Therefore, the policy's goal is to maximize the mean income of the targeted population. However, there could be other objectives, such as maximizing weighted average income, reducing social inequality, or lowering social poverty rate. As in the social welfare literature, there are many ways of dening social welfare. The social planner might assign dierent weights to dierent subgroups, leading to multiple forms of social welfare functions. In this paper, while our main objective is to maximize the mean income, we also want to ensure that the social equality is maintained at a satisfactory level. This is because social well-being comprises of both eciency and equity, and high social inequality level can lead to collective violence. Policy makers are thus motivated to reduce inequality to a certain level to prevent riots, and to avoid the harm to growth that usually results from extreme redistribution policies. Our aim is to allocate resources to those in disadvantage, in order to make the society more equal, even at the cost of some eciency. However, the most important target is still on growth, as long as the inequality level is around a satisfactory level that reduces the probability of riots and avoids further extreme policies. We frame the problem as a constrained optimization problem, with the mean income as the objective function and 5 the inequality index as a constraint. Although there are many inequality measures in public economics, we focus on Gini index due to its wide recognition. Other measures such as mean-log deviation, Atkinson's index, Hoover index, Theil index, and 20/20 ratio can also be incorporated into our setting. Additionally, by using a pre-determined policy class, we can ensure that the selected policy aligns with policy makers' preferences while also satisfying exogenous restrictions such as discrimination limitations and budget constraints. In the simulation and application sections of this paper, we consider the functional class of linear eligibility score, where the policy variables are selected based on policy makers' preferences. This policy class allows for easy interpretability and implementation in practice. The simulation and application sections of this paper provide examples of how this policy class can be used to achieve optimal policy outcomes while satisfying the desired restrictions. In summary, our goal is to determine which subgroups of people should be treated, and we have formulated this as a constrained optimization problem. The decision variable, which takes a functional form, appears in both the objective and constraint functions. Since we don't know the true distribution of the treatment eect and individuals' characteristics, we must maximize the sample analog of the population problem to derive a satisfactory policy. After conducting a pilot experiment or observational study, it is possible to estimate the Conditional Average Treatment Eect (CATE) of people based on high-dimensional covari- ates. If the propensity score is known, one can use the Inverse Probability Weighting (IPW) estimator to estimate the average treatment eect under a specic policy. However, if the propensity score is unknown, one can rst estimate the conditional mean outcomes and the propensity score using various methods, both parametric and non-parametric. In the second step, one can use the Doubly Robust (DR) estimator or the Augmented Inverse Probability Weighting (AIPW) estimator to estimate the average treatment eect under a specic policy. 6 The objective is to determine whom to treat based on some low-dimensional covariates, and to estimate the policy regarding these covariates based on the available information. The evaluation of the policy is done through regret analysis, which measures the dierence between the maximal population welfare and the population welfare induced by the estimated policy. In this paper, the upper bound of the convergence rate of the regret is derived, which is related to the complexity of the pre-dened policy class and the sample size in the rst stage. After introducing an inequality constraint, we demonstrate that it is not possible to satisfy both the hard constraint and achieve the optimal convergence rate for the objective function when directly solving the corresponding sample analog of the population problem. There- fore, we propose to relax the constraint in order to achieve the optimal regret convergence rate. Additionally, we suggest two alternative approaches for formulating the problem: pe- nalizing the constraint function in the objective function, and using continuous functions to approximate the indicator function and solving the corresponding transformed problem. The paper is structured as follows. In Section 2, we present the formulation of the constrained problem, the relationship between the population problem and the empirical problem, and one approach to solve it. Section 3 discusses the related literature. Section 4 presents the theoretical properties of the proposed approach. Section 5 introduces two additional directions to formulate the problem. In Section 6, we discuss the algorithm and present the simulation results to demonstrate the properties of the proposed approach. In Section 7, we apply the approach to the experiment in National Supported Work Demonstration and estimate the welfare and Gini index. Finally, Section 8 concludes the paper. 7 2.2 Problem formulation Following Kitagawa and Tetenov (2018), the data comes from an experiment consisting of a random sample of size n of Z i = (Y i ;D i ; X i ). The observable pre-treatment covariates of individual i are denoted by X i 2 X R dx . D i is the treatment assignment in the experiment, andY i is the observed outcome. The population from which the sample is drawn is characterized by P , which is a joint distribution of (Y i (0);Y i (1);D i ; X i ) and is unknown. Y i (1) is the potential outcome of individuali if treated, andY i (0) is the potential outcome of individual i if untreated. Assume the sample and the target populations are from the same distribution P , from a class of probability distributionsP. The objective is to learn the optimal non-randomized policy which performs uniformly well over the entire distribution class. A policy g(X)2f0; 1g determines the non-randomized treatment assignment for an individual with observed characteristics X. The class of policies that policy makers can choose from is denoted byG. In the experiment, potential outcomesY i (1) andY i (0) are xed, and we can only observe one potential outcome, either Y i (1) or Y i (0), depending on the treatment assignment D i . When estimating the conditional average treatment eect, we consider the intent-to-treat without accounting for the non-compliance. With policy g(X) applied to the target population, the welfare dened by the mean outcome is given by ~ W (g) =E P [Y (1)g(X) +Y (0)(1g(X)); where E P [] is the expectation with respect to the joint distribution P . As discussed in Kitagawa and Tetenov (2018),Y can represent a variety of outcomes, re ecting a wide range of social preferences, and it can also be set to a known function of these outcomes. 8 As per the literature, we denote the conditional mean outcome as m d (x) =E[Y (d)j X = x] and the conditional average treatment eect as (x) = E[Y (1)Y (0)j X = x]. We can express ~ W (g) as a function of(X) to demonstrate the equivalence between maximizing the mean outcome and maximizing the average treatment eect: ~ W (g) =E P [Y (1)g(X) +Y (0)(1g(X))] =E P [(Y (1)Y (0))g(X) +Y (0)] =E P [Y (0)] +E P [E[Y (1)Y (0)jX]g(X)] =E P [Y (0)] +E P [(X)g(X)]: Denote W (g) =E P [(X)g(X)]. Given an pilot experiment, we have access to the propensity score e(X) =P (D = 1jX) by design, and can use the inverse probability weighting (IPW) estimator to estimate the average treatment eect under policyg. By avoiding the estimation of the conditional mean outcome, we can avoid introducing additional estimation error. In the case of an observational study where the propensity score is unknown, we could rst estimate both the propensity score and the conditional mean outcome, and then use the augmented inverse probability weighting (AIPW) estimator to estimate the average treatment eect under policyg. In this paper, we will focus on the experimental setting, assuming that the sample and the target populations are from the same distribution, the potential outcomes are unconfounded, and the propensity score is bounded away from 0 and 1. Under these assumptions, we can express ~ W (g) as follows, getting rid of the estimation uncertainty of e(X): ~ W (g) =E P YD e(X) g(X) + Y (1D) 1e(X) (1g(X)) =E P [Y (0)] +E P YD e(x) Y (1D) 1e(X) g(X) : 9 When considering inequality, there are various indices that can be used. One popular measure is Gini index, which is dened as I Gini = 1 R (1F (y)) 2 dy E[Y ] . The Gini index under policy g is I Gini (g) = 1 R (1F g (y)) 2 dy E g [Y ] where F g (y) = Z F Y (1)jX=x (y)g(x) +F Y (0)jX=x (y)(1g(x)) dP X (x): E g [Y ] =W (g) =E P [Y (1)g(X) +Y (0)(1g(X))]; P X is the marginal distribution of X. F Y (1)jX and F Y (0)jX are conditional distribution func- tions of potential outcomes Y (1) and Y (0). Other measures of inequality could also be considered, like coecient variation I CV (g) = p varg (Y ) Eg [Y ] and mean-log deviation I MLD (g) = ln(E[(X)g(X)+m 0 (x)])E[ log (X)g(X)+m log 0 (X)], where log (X) =E[ln(Y (1))ln(Y (0))j X], and m log d =E[ln(Y (d))jX]. If the distribution of (Y (1);Y (0);D; X) is fully known to the policy maker, an optimal policy is derived by solving max g2G W (g),E P DY 1e(X) (1D)Y 1e(X) g(X) (2.2.1) s.t. I(g), 1 R (1 R F Y (1)jX=x (y)g(x) +F Y (0)jX=x (y)(1g(x)) dP X (x)) 2 dy E P [ DY e(X) g(X) + (1D)Y 1e(X) (1g(X))] g(X) is an indicator function. The true distribution is unknown. Therefore, we need to consider its empirical counterpart as follows. max g2G ^ W (g) = 1 n n X i=1 DY i 1e(X i ) (1D)Y i 1e(X i ) g(X i ) (2.2.2) s.t. ^ I(g) = 1 R (1 R h ^ F Y (1)jX=x (y)g(x) + ^ F Y (0)jX=x (y)(1g(x)) i dP X (x)) 2 dy 1 n P n i=1 DY i e(X i ) g(X i ) + (1D)Y i 1e(X i ) (1g(X i )) 10 where ^ F Y (1)jX=x (y) = 1 n n X i=1 D i e(X i ) 1(Y i y) ; ^ F Y (0)jX=x (y) = 1 n n X i=1 1D i 1e(X i ) 1(Y i y) : Both the objective function and the constraint function include the decision variable g(X), and as a result, both the objective function and the constraint function are discontinuous, non-linear, and non-convex. It is not possible to guarantee obtaining the global optimum of a non-convex constrained optimization problem. This is the main diculty of this paper. To derive theoretical results, we assume that the global optimum of the empirical problems can be found and discuss the possibilities of satisfying this assumption. When the policy class is simple, for example,G =fg :g(X) = 1(X 1 d)g orG =fg :g(X) = 1( 0 + 1 X 1 + 2 X 2 0)g, the global optimum can be found by grid search. Then we discuss the potential methods to study this problem when the global optimum of this empirical problem cannot be guaranteed. We rst introduce some assumptions on the population distribution and the empirical dis- tribution. Following Sun (2021), we assume that Assumption 2.1. There exists a distribution P 0 2 P under which a non-empty set of policies satises the constraint exactlyG 0 =fg2G : I(g;P 0 ) = g. And the distribution classP includes a sequence of empirical data distributionsfP hn g that are contiguous to P 0 , and satisfy8g2G 0 , there exists C > 0 such that n (I(g;P hn ))>C. Assumption 2.2. Let g P 0 be the optimal solution to the population problem under distri- bution P 0 . Assume that the inequality constraint is binding at the constrained optimum I(g P 0 ;P 0 ) =. Assumption 2.3. Under distributionP 0 , there exists> 0, such that for any feasible policy g, whenever I(g;P 0 ): The above assumption states that the population constraint is always binding at the optimal policy. This means the policy maker has chosen an inequality level, at which there is no more room for improvement. The assumption also states that if the policy makes the population constraint bind, it will violate the constraint under a contiguous empirical distribution P hn byCn . The last statement that the assumption infer is that there does not exist a policy that outperforms the population optimum g P 0 under distribution P 0 . Under the above assumptions, any policy that satisfy the empirical constraint will induce welfare that is lower than the optimal welfare by no less than . And any policy that could induce the optimal population welfare is not a feasible solution to the empirical problem. Besides the above assumptions on the distribution, let's introduce two more assumptions on the distribution class mathcalP with regard to the outcome and the treatment assignment in the experiment. Assumption 2.4 (Bounded Outcomes). There exists M < 1 such that the support of the outcome variable Y is contained in [0;M]. Assume that the population average income without treatment is bounded away from 0: E[Y (0)]M. Because we consider the income as outcome Y , it naturally has bounded support. By bounding the outcome, the treatment eect is also bounded asj(X)jM, with (X) the conditional average treatment eect (X) =E [Y (1)Y (0)jX]. This assumption also implies that the tail assumption holds, which is similar to that in Kitagawa and Tetenov (2021): for8P2P Z p P (jY (d)j>y)dyM: 12 With bounded outcomes yM, Z p P (jY (d)j>y)dy = Z M 0 p P (Y (d)>y)dy Z M 0 1dy =M: The assumption that the mean income in the society is extremely larger than 0 even without treatment is also fair. Assumption 2.5 (Strict Overlap). The propensity score satises: there92 (0; 1=2], such that e(x)2 [; 1] for all x2X . This assumption guarantees that at each x, there are both treated units and control units in the observed sample, so that the conditional average treatment eect can be estimated. In order to obtain a policy that could induce welfare close to the population optimal welfare, we need to relax the constraint in the empirical problem, by choosing " = +" n , with " n > 0. max g2G ^ W (g) (2.2.3) s.t. ^ I(g) " Denition 2.1 (Constrained policy set). The constrained policy set is dened as the set including policies that satisfy the population level constraint: G() =fg2G :I(g)g ; and the estimated constrained policy set is dened as the set including policies that satisfy the empirical constraint: ^ G( " ) =fg2G :I n (g) " g =fg2G :I n (g) +" n g: 13 Dierent from the true constrained policy set, the estimated policy set is stochastic and depend on the sample. We require the estimated policy set to become closer to the true constrained set as the sample size gets larger by choosing an appropriate " n . Lemma 2.1. By choosing" n =n with < 1 2 , we could haveG() ^ G( " ) with probability 1 ; = C "n p v n ! 0. Lemma 2.1 relies on the result of Lemma 2.5 in Section 6, which is proven in Appendix, and is used to establish the correlation between the two policy sets. Let ^ g be one of the solutions to the empirical constrained maximization problem. It would be desirable to obtain an approximately optimal welfare and while also approximately satisfying the constraint, when implementing ^ g on the population. It is possible that ^ g may not fulll the population level constraint: I(^ g ) > , but we ccan ensure that the violation term converges to 0 as the sample size increases. When addressing the empirical constrained maximization problem, we examine both the indicator function g(X) and its approximate continuous function s(X), where s(X) could be Sigmoid function or some piecewise functions, like the smooth surrogate function used to approximate 0-1 loss (Qi et al., 2019). Besides solving the empirical constrained maxi- mization problem directly for global optimum, we also provide another two perspectives on this problem. The rst method follows Sun (2021), which incorporates the constraint func- tion into the objective function using a penalty, thus transforming the original constrained problem into an unconstrained one. We reformulate the objective function max g2G W (g) maxfC(g); 0g where W (g) = E P [(X)g(X)], C(g) = (1)E g [Y ] R (1F g (y)) 2 dy. The penalization parameter is selected based on policy considerations. We refer to this model as the multi- goal objective function to distinguish it from the original optimization problem. The other 14 method is based on Cui et al. (2022) and involves two sub-problems: a restricted prob- lem and a relaxed problem. Instead of smooth functions, Cui et al. (2022) suggests using piecewise ane functions to approximate the indicator function, applying exact penalization method, and then adopting surrogate functions to turn the problem into solving nitely many convex problems. The solution obtained in this way is a local optimum. When the sample size is quite large, solving the empirical constrained maximization problem for the global optimum can become computationally expensive. Additionally, we may be unable to solve the empirical problem if we are away from a particular policy class. In these situations, this approximation method appears to be very important. However, there still remains a gap on how the local optimum performs regarding population welfare and population con- straint. We will only give a brief discussion on this method, and the further analysis will be postponed to further research. Consistent with the optimal policy literature in econometrics, we use regret to assess the pol- icy's performance. Regret is dened as the dierence between the optimal welfare achievable in the policy class under the population constraint and the welfare attained by implement- ing the estimated policy under distributionP . We derive a non-asymptotic distribution-free upper bound of the regret and show that the regret converges to 0 uniformly on the distri- bution classP at a rate of 1= p n. Additionally, we demonstrate that the inequality measure induced by the estimated policy approximately satises the population constraint, with a non-asymptotic distribution-free upper bound on the dierence. We evaluate policies using both simulations and a real application. In these parts, we consider a pre-determined policy class based on linear functions, dened as follows: G = g :g(X) = 1(X T + 0 0) : To ensure a globally optimal solution to the empirical constrained problem, we begin by considering two variables, X 1 and X 2 , in the policy function, which correspond to income 15 and education in the real-world application. The parameters that we aim to learn are 0 , 1 , and 2 . In simulations, we vary the sample sizes and data generation processes and compare the welfare and Gini index under policies estimated using dierent methods. In the real application, we refer to the National Supported Work Demonstration. The data we use are from the male sub-sample as used by LaLonde (1986) in his paper. It should be noted that our theoretical results are not limited to the policy class we use in simulations and applications. Instead, the theoretical results apply to all policy classes that satisfy the assumptions in Section 4. 2.3 Related Literature Optimal policy learning, also known as optimal treatment rule, has been extensively studied in statistics, medicine, econometrics, and machine learning. Some literature focuses on estimating the conditional treatment eect and decide the optimal treatment rule based on the sign of the conditional treatment eect (Murphy, 2003; Robins, 2004). This method is based on the correct specication of the conditional mean outcome model (Q-learning) or the conditional treatment eect (A-learning) model. If the model is misspecied, the rule derived based on these estimators will not be optimal. Other literature is developed by separately estimating the conditional mean outcome (or conditional treatment eect) and estimating the optimal policy (Qian and Murphy, 2011; Athey and Wager, 2021). As long as the estimation error in the rst step satises some specic conditions, the optimality of the policy derived in the second step can be guaranteed. In econometrics, Manski (2004) and Manski (2009) started the literature of learning optimal treatment rules. Manski (2004) proposed Conditional Empirical Success (CES) rules, as- sessed their theoretical properties by maximum regret, and derived the nite-sample regret bounds. CES rules partition the covariate space into nite sets and assign the treatment arm 16 that yields the highest sample average outcome to each set. When the partition is xed, the regret converges to zero at the rate at least 1= p n. In the earlier stage, most work is based on unconstrained policy class, where the policy is based on the sign of conditional treat- ment eect. According to Hirano and Porter (2009), when the estimation of the conditional average treatment eect could achieve the rate of 1= p n, the treatment assignment rules obtained by a threshold of the estimated conditional average treatment eect are asymp- totically minimax-optimal. Stoye (2009) studied the treatment rule when the policy could depend arbitrarily on covariates and proved the nite sample minimax properties of the rule. Kitagawa and Tetenov (2018) further studied the treatment rule learning in the setting where the policy is constrained to a pre-determined policy class and showed that, when the propensity score is known, the regret bounds relative to the best policy in could be ob- tained at the rate 1= p n, scaling with the complexity of . Athey and Wager (2021) studied the optimal policy learning in observational studies, where the propensity score is unknown and treatment assignment might be endogeneous. Using doubly robust estimator of condi- tional average treatment eect and relying on the eciency of semiparametric estimation literature, Athey and Wager (2021) proved the learned policy could achieve asymptotically the optimal rate 1= p n. While Kitagawa and Tetenov (2018) focused on non-asymptotic bounds, Athey and Wager (2021) relied on the semiparametric eciency and looked at the asymptotic rate of the optimal rule. Therefore, it has been shown in the existing literature that the optimal convergence rate of the estimated optimal policy is 1= p n, without any particular assumptions on the distribution. Surrogate and approximation of the indicator function have been widely used in statistics and computer science to study the optimal policy learning problem. Policy learning has drawn a lot of attention in statistics and machine learning literature. Most applications were focused on clinical trials. Because of the heterogeneous treatment eects among pa- tients, researchers would like to obtain the optimal treatment rule for patients based on their observed characteristics. Zhao et al. (2012) adopted inverse propensity weighting to derive 17 the optimal individualized treatment rules, which they call outcome weighted learning. They adopted hinge loss, which is common in support vector machine framework, to approximate the binary 0-1 loss from the discontinuous indicator function, showed the estimator is con- sistent, and derived the nite sample bound on the regret of the estimated policy. Zhang et al. (2012) used doubly robust score to learn optimal policies from observational studies. But they didn't derive the 1= p n convergence rate of the regret. Besides focusing on the mean outcome, some researchers also consider the other types of targets. Kock et al. (2021) studied general functional form of the outcome. To make the policy fair, Kitagawa and Tetenov (2021) proposed studying the problem of treatment as- signment by maximizing a rank-dependent social welfare function. Rank-dependent social welfare functions are commonly used in welfare economics to represent dierent types of social welfare. With rank-dependent social welfare functions, policy makers are assigning more weight to individuals with fewer advantages in the society (e.g. lower income) when making decisions. Viviano and Bradic (2020) tried to maximize the fairness, which is dened according to the sensitive attributes, within the Pareto frontier. The above two papers both derived the 1= p n convergence rate of regret, which is the dierence between the estimated social welfare (fairness) and the best obtainable welfare (fairness) within the pre-specied policy class. The best obtainable outcome is the outcome that could be achieved when the population distribution is fully known. This paper is closely related to work by Sun (2021). The main dierence is that we consider non-linear constraint while Sun (2021) focuses on the linear budget constraint. Sun (2021) denes asymptotic optimality in terms of welfare and asymptotic feasibility in terms of budgets, proves that it is impossible to simultaneously achieve asymptotic optimality and asymptotic feasibility, and provides 1= p n convergence rate of feasibility with sacricing welfare and 1= p n convergence rate of optimality with sacricing feasibility separately. We focus on the non-asymptotic upper bound of the convergence rate of the regret. Our goal 18 is more similar to the second part of Sun (2021), which takes optimality as the goal and showing some tolerance with respect to the constraint. Sun changes the objective function by combining the welfare function and the budget function with a penalty assigned to the constraint. In Section 5 of this paper, We follow this method and study the problem with the nonlinear constraint as a comparison. We show that the combined objective could not give a more satisfying solution, either from welfare or from equality perspective. In Section 6, We also refer to Cui and Pang (2021) to derive the local optimal solution when global optimum of the empirical problem could not be obtained. Cui and Pang (2021) studied generalized chance-constrained programs which involve probabilities of disjunctive nonconvex functional events. They integrated the parameterization of upper and lower approximations of the indicator function, constraint penalization, and convexication of nonconvexity and nondierentiability via surrogation. Our setting can be taken as an application of their method. 2.4 Theoretical Results The population level non-convex constrained problem (2.2.1) cannot be solved for the global maximum. When the potential outcomesY (1);Y (0) are bounded, the conditional treatment eect (X) is also bounded. The decision variable g(X)2f0; 1g. Therefore, the objective function E[(X)g(X)] is always bounded. The global maximum of (2.2.1) exists. It is possible that multiple solutions exist, corresponding to dierent treatment assignments. We care about W (g). Suppose g 1 and g 2 both achieve the maximal value of the welfare and satisfy the constraint, then we treat them equally. By drawing a random sample from the population and implementing an experiment on the sample, we could have information on the heterogeneous treatment eect and can formulate an empirical problem (2.2.3). The global maximum of the empirical problem (2.2.3) is 19 possible to be solved. By solving the empirical problem, we could obtain an estimated optimal solution ^ g , which is the optimal policy when applying to the sample. The policy maker would expect that when policy ^ g is applied to the population, the population welfare is approximately optimal and the population constraint is approximately satised. We rst introduce another two standard assumptions in the causal inference literature. Assumption 2.6 (Unconfoundedness). Y (1);Y (0)?DjX. This assumption means that after conditioning on observed characteristics, the potential outcomes are independent of the treatment assignment. In the randomized experiments, the unconfoundedness assumption is satised by design. We next introduce the assumption on the policy classG, where we restrict the complexity of G. Since we select the optimal policy fromG, we don't wantG to be too complex. Otherwise the convergence rate of the regret to 0 would be badly aected. Assumption 2.7 (VC). The class of decision rulesG has a nite VC-dimension v<1. We now explicitly write down the population problem max g2G E P D e(X) 1D 1e(X) Yg(X) (2.4.1) s.t. 1 R M 0 D e(X) 1D 1e(X) g(X) + 1D 1e(X) 1(Y >y) 2 dy E P h D e(X) 1D 1e(X) Yg(X) + 1D 1e(X) Y i The corresponding empirical problem is max g2G 1 n n X i=1 D i e(X i ) 1D i 1e(X i ) Y i g(X i ) (2.4.2) s.t. 1 R M 0 1 n P n i=1 D i e(X i ) 1D i 1e(X i ) g(X i ) + 1D i 1e(X i ) 1(Y i >y) 2 dy 1 n P n i=1 h D i e(X i ) 1D i 1e(X i ) Y i g(X i ) + 1D i 1e(X i ) Y i i " 20 where " = +" n . Lemma 2.2. [Kitagawa and Tetenov (2021) Lemma A.4.] LetH be a VC-subgraph class of uniformly bounded functions. Assume there exists H <1 such thatjjhjj 1 H for all h2H. Assume thatH has VC-dimension v <1. Then there is a universal constant C 1 , such that E P n " sup h2H jE n (h)E P (h)j # C 1 H r v n holds for all n 1, where C 1 = 24 R 1 0 p ln(2 1=2 K) + ln 2 + 2 ln(16e) 2 ln"d". Lemma 2.2 is proved in Kitagawa and Tetenov (2018) and is also used in Kitagawa and Tetenov (2021). This builds the connection between the VC-dimension of policy class and VC-subgraph class, so that classical empirical process theories can be applied directly. Lemma 2.3. Under the above assumptions, and assumeG has a nite VC-dimensionv<1, we have sup P2P E P n sup g2G W (g) ^ W (g) C 1 M r v n hold for all n 1. Lemma 2.3 indicates that uniformly over the distribution classP, the dierence between the expectation of empirical welfare over dierent samples and the population welfare under the same policy g is bounded from above. Lemma 2.4. [Kitagawa and Tetenov (2021) Lemma A.5.] LetH be a VC-subgraph class of uniformly bounded functions, upper bounded by H. Assume thatH has VC-dimension v<1. If Rp P (jYj>jyj)dyM, then Z E P n " sup h2H 1 n n X i=1 h(Z i )1(jY i j>jyj)E P [h(Z)1(jY i j>jyj)] # dy 2(C 1 + 1) HM r v n holds for all n 1. 21 Lemma 2.4 provides us with the tools to bound the empirical process for the Gini index. By showing the upper bound of the dierence between the empirical distribution and the population distribution, we are able to derive the upper bound on the dierence between empirical Gini index and population Gini index. In this paper, we rst derive the following lemma, the uniform upper bound of the dierence between population Gini index and sample Gini index. Lemma 2.5 is proved using Lemma 2.4 and indicates that uniformly over distribution classP, the supremum of the dierence between the expectation of the estimated Gini index and the true population Gini index under the same policy g is bounded from above. Lemma 2.5. Under the above assumptions, and assume policy class G has nite VC- dimension v<1, then we have sup P2P E P n sup g2G I n (g)I(g) (9C 1 + 8) M M r v n for all n 1. Theorem 2.1. Suppose the above assumptions hold, let " n = n with < 1 2 , in equation (2.4.2). Assume ^ g is the optimal solution of equation (2.4.2). Then ^ g satises that with probability at least 1 , with = (9C 1 +8)M "n p v n , we have sup P2P E P n " sup g2G() W (g)W (^ g ) # sup P2P E P n " sup g2 ^ G(") W (g)W (^ g ) # 2C 1 M r v n : Also, we have sup P2P E P n [I(^ g )] (9C 1 + 8) M M r v n +n : where C 1 is a constant introduced above. 22 According to Theorem 2.1, the optimal solution of (2.4.2), ^ g , when applied to the population, could induce the population welfare that is at most 2C 1 M p v n lower from the maximal population welfare obtainable from the constrained policy setG(), and induce the Gini index that at most exceeds the satisfying level by n . Specically, we consider policy classG =fg :g(X) = 1( 0 + 1 X 1 + 2 X 2 0); ( 0 ; 1 ; 2 )2 Bg, whereB is the restricted parameter space. VC(G) = 3 in this case. Besides consider- ing the indicator function g(X), we could also consider a smooth function s(f(X)), where f(X) = 0 + 1 X 2 + 2 X 2 , to approximate the indicator function. The benets of adopting a smooth function for approximation is that it allows more possibilities to solve the empirical constrained optimization problem. Instead of considering the policy classG, we consider a poliy class of continuous functionsF. A natural type of smooth function to consider is the sigmoid functions(f(X)) = 1 1+e f(X) , where determined how close the smooth function is to the indicator function. The approximation function considered in (Qi et al., 2019) is as follows s(f(X)) = 8 > > > > > > > < > > > > > > > : 0 f(X)1 (1+f(X)) 2 2 1<f(X) 0 1 (1f(X)) 2 2 0<f(X) 1 1 f(X)> 1 . We plot the indicator function and several smooth approximation functions in Figure 2.1. The population and empirical optimization problem with respect to f are max f2F;I(f) W (f) (2.4.3) max f2F; ^ I(f)" ^ W (f) (2.4.4) Actually, as long as the function classF has nite VC dimension, the solution to Problem (2.4.4) also has the properties shown in Theorem 2.1. The advantage of considering a smooth 23 Figure 2.1: Approximation with smooth functions approximation function of the indicator function is that it simplies computation of the empirical optimization problem. 2.5 Multi-objective problem and chance-constraint problem However, it is not always the case that (2.2.3) can be solved for the global optimal solution. In this section, we discuss two potential ways to further study the problem. We rst transform the original problem to a multi-objective problem, by penalizing the objective function with the constraint function. We will derive the distribution-free upper bound of the dierence between the welfare induced by the solution to the empirical multi-objective problem and the maximal welfare obtainable inG() if the true distribution is known, and the upper 24 bound of the dierence between the Gini index induced by the estimated solution and the satisfying level . 2.5.1 Multi-objective problem The multi-objective problem can be written as max g2G V (g), ( E P D e(X) 1D 1e(X) Yg(X) max n C(g); 0 o ) (2.5.1) where is the penalization parameter chosen by policy makers, and C(g) = (1)E P D e(X) 1D 1e(X) Yg(X) + 1D 1e(X) Y Z M 0 D e(X) 1D 1e(X) g(X) + 1D 1e(X) 1(Y >y) 2 dy: SinceC(g) = ~ W (g) (I(g)), where ~ W (g) =E[Y (1)g(X) +Y (0)(1g(X))], the constraint C(g) 0 in (2.5.1) is equivalent to the constraint I(g) in (2.4.1). The corresponding empirical problem is max g2G V n (g), ( 1 n n X i=1 D i e(X i ) 1D i 1e(X i ) Y i g(X i ) max n ^ C(g); 0 o ) (2.5.2) where ^ C(g) = (1) 1 n n X i=1 D i e(X i ) 1D i 1e(X i ) Y i g(X i ) + 1D i 1e(X i ) Y i Z M 0 1 n n X i=1 D i e(X i ) 1D i 1e(X i ) g(X i ) + 1D i 1e(X i ) 1(Y i >y) ! 2 dy: 25 Lemma 2.6. Let ^ g m be the optimal solution of (2.5.2), then we have sup g2G() W (g)W (^ g m ) sup g2G V (g)V (^ g m ): Based on Lemma 2.6, we have our next theorem. Theorem 2.2. Suppose Assumptions 1-4 holds, Let ^ g m be the optimal solution of Problem (2.5.2). Then the welfare and the Gini index induced by ^ g m satisfy sup P2P E P n " sup g2G() W (g)W (^ g m ) # sup P2P E P n sup g2G V (g)V (^ g m ) 2C 1 1 +(1) 1 + 3 M r v n ; sup P2P E P n [I(^ g m )] 2C 1 (1) 1 + 3M 2 r v n : Theorem 2.2 states that by solving the empirical multi-objective problem, we could obtain a policy that obtains similar results as Theorem 2.1. When applying the empirical optimal solution to the population, the population welfare converges to the achievable optimal welfare at rate 1 p n , and the population constraint converges to 0 at rate 1 p n . However, the constant in the welfare upper bound is larger than that in Theorem 2.1, while the constraint upper has fast rate than that in Theorem 2.1. 2.5.2 Chance-constraint problem Although the methods discussed above both obtain estimated policies inducing welfare and the Gini index on population level, the dierence of which from the best policy inG() can be upper bounded, Theorem 2.1 is based on the fact that the empirical problem can be solved. Due to the non-convexity in the empirical problem, the global optimum is not always obtainable, and Multi-Objective problem might suer from slow computation. Next 26 we discuss brie y about the local optimum of the empirical problem, by applying chance- constraint method in optimization to this setting. This is based on the recent literature on solving non-convex optimization problems. As widely acknowledged in optimization lit- erature, when the constraint is non-convex, local optimum instead of global optimum is obtained. But the local optimum can be better than the global optimum of a convexied version of the original non-convex problem. Assume Z = (X;Y;D) and ~ Z = ( ~ X; ~ Y; ~ D) are from the same distribution P , we can expand the constraint into ~ C(g) =(1)E P D e(X) 1D 1e(X) Yg(X) + 1D 1e(X) Y E P " D e(X) 1D 1e(X) ~ D e( ~ X) 1 ~ D 1e( ~ X) ! g(X)g( ~ X) min(Y; ~ Y ) # E P " D e(X) 1D 1e(X) 1 ~ D 1e( ~ X) g(X) min(Y; ~ Y ) # E P " 1D 1e(X) ~ D e( ~ X) 1 ~ D 1e( ~ X) ! g( ~ X) min(Y; ~ Y ) # E P " (1D)(1 ~ D) (1e(X))(1e( ~ X)) min(Y; ~ Y ) # 0: The derivation of ~ C(g) is included in the Appendix. This constraint falls into the framework of (Cui et al., 2022). Consider specically the policy classG = g(X) = 1fX T 0g;2B . Instead of approximating the indicator function, (Cui et al., 2022) bound the indicator func- tion with piece-wise ane functions ub (t; ) = min n max 1 + t ; 0 ; 1 o 1(t 0) max n min t ; 1 ; 0 o = lb (t; ), where represents dierence performance regarding how close the piece-wise ane functions are to the indicator function. can be either set xed, for example 1, or can be treated as a variable to optimize. We set as xed. Let m 1 (Z) = E h DY e(X) i , m 0 (Z) = E h (1D)Y 1e(X) i , and q + 1 (Z; ~ Z) = (1)E h DY e(X) i , q + 2 (Z; ~ Z) = E h (1D) ~ D (1e(X))e( ~ X) + D(1 ~ D) e(X)(1e( ~ X)) min(Y; ~ Y ) i ,q + 3 (Z; ~ Z) =q + 4 (Z; ~ Z) =E h (1D)(1 ~ D) (1e(X))(1e( ~ X)) min(Y; ~ Y ) i . 27 Figure 2.2: Upper and lower bounds on indicator function And q l (Z; ~ Z);l = 1; 2; 3; 4 correspond to terms including (Z; ~ Z) and g(X);g( ~ X), with neg- ative signs, in ~ C(g). And corresponds the last term in ~ C(g). If we substitute the upper bound and lower bound to the positive and negative parts respec- tively in both objective function and constraint function, we get two subproblems. The rst subproblem is the relaxed problem, which is obtained by replacing the indicator function in the objective function by the upper bound function if the coecient in front of it is positive, and by the lower bound function if the coecient in front of it is negative, and replacing the indicator function in the constraint function by the lower bound function if the coecient in front of it is positive, and replacing the indicator function in the constraint function by the upper bound function if the coecient in front of it is negative. In this way, we can guaratee that the new objective function is not smaller than the original objective, and the new constraint function is easier to be satised than the original constraint. Similarly, we 28 could obtain the restricted problem if we replace the indicator function with the approximate functions in the opposite direction. Relaxed problem: max 2B E P m 1 (Z) ub 0 (X;; )m 0 (Z) lb 0 (X;; ) (2.5.3) s.t. E P " 4 X l=1 q + l (Z; ~ Z) lb l (X; ~ X;; )q l (Z; ~ Z) ub l (X; ~ X;; ) # 0 Restricted problem: max 2B E P m 1 (Z) lb 0 (X;; )m 0 (Z) ub 0 (X;; ) (2.5.4) s.t. E P " 4 X l=1 q + l (Z; ~ Z) ub l (X; ~ X;; )q l (Z; ~ Z) lb l (X; ~ X;; ) # 0 It should be noticed that as ! 0, any accumulation point of the globally optimal solutions of the relaxed problem or the restricted problem must be a globally optimal solution of the original problem (Cui et al. 2022). We then analyze the restricted problem and relaxed problem respectively. After bounding the discontinuous indicator function with continuous functions, we have made the objective function directionally dierentiable and globally Lipschitz continuous with a Lipschitz con- stant Lip rst=rlx 0 (x; ) > 0, satisfying E[Lip rst=rlx 0 (~ x)] <1. And the constraint function is B-dierentiable. Lemma 2.7. After approximating the indicator function, the new objective function is Lip- schitz continuous, and the new constraint is B-dierentiable. 29 With the continuity properties in both the objective function and constraint function, we can move to applying exact penalization method and conducting regret analysis on the penalized optimization problem. For simplicity of notation, we usec 0 to denote the function in the expectation in the objective, and use c 1 to denote the function in the expectation in the constraint. Then we express (2.5.3,2.5.4) as max 2B E P [c 0 (x;; )] (2.5.5) s.t. E P [c 1 (x; ~ x;; )] 0 Given any penalty parameter ~ > 0, the exact penalization gives max 2B E P [c 0 (x;; )] ~ max(E P [c 1 (x; ~ x;; )]; 0): (2.5.6) Denote the penalized term byr(; ) = max( c 1 (; ); 0), where c 1 (; ) =E P [c 1 (x; ~ x;; )]. The directional derivative of r(; ) is r 0 (;v) = c 0 1 (; ;v)1( c 1 (; )>) + max( c 0 1 (; ;v); 0)1( c 1 (; ) =) According to Proposition 14 in Cui et al. (2022), ifE[c 0 (x;; )] is Lipschitz continuous onB with Lipschitz modulus Lip 0 > 0 and sup 2Bn ^ S min v2T (;B) r 0 (; ;v) 1, then for every ~ > Lip 0 , every directional stationary point of (2.5.6) is a B-stationary point of (2.5.5). The empirical counterpart of the penalized surrogate problem is max 2B 1 n n X i=1 c 0 (X i ;; ) max( 1 n 2 n X i=1 n X j=1 c 1 (X i ;X j ;; ) s ; 0) (2.5.7) 30 (Cui et al., 2022) provides an algorithm to derive the optimal solution of (2.5.7). The obtained solution ^ is a stationary point of the Problem (2.5.5). We postpone theoretical discussion of chance-constraint problem to the appendix. 2.6 Simulation In this section, we discuss the computation methods to obtain the optimal solution to the empirical problems. In this paper, we consider the policy classG =fg : g(X) = 1( 0 + 1 X 1 + 2 X 2 0)g. Variable X 1 and X 2 are chosen by the policy makers, which is the same as EWM. Because the function in the indicator function is scale-invariant, we normalize all parameters with 2 . There are 2 parameters left for our policy: ~ 1 = 1 2 and ~ 0 = 0 2 . We compare the policies obtained by our proposed method with the following policies: 1. Empirical Welfare Maximization method as proposed in Kitagawa and Tetenov (2018), which does not take Gini inequality measure into consideration; 2. plug-in method using parametric model to estimate the conditional mean outcomes; 3. plug-in method using non-parametric model to estimate the conditional mean outcomes; 4. treat everyone; 5. treat no one; 6. grid search without considering the constraint. The Empirical Welfare Maximization (EWM) method selects the policy from the pre-determined policy class by maximizing the empirical welfare without imposing the inequality constraint. Plug-in policies are not restricted to the pre-determined policy class and will select any indi- vidual who has positive treatment eect to treat. For parametric plug-in methods, we rst use linear functions to estimate conditional mean outcomes m 1 (x) =E[Y (1)j X = x] and m 0 (x) = E[Y (0)j X = x]. Then we use the estimated parameters to predict the coun- terfactual outcome Y i (1) or Y i (0) for each individual. If the predicted individual treatment 31 eect is positive, this individual will be selected to treat. The policy can be written as g(X) = 1f ^ m 1 (X) ^ m 0 (X) 0g. Non-parametric plug-in policy follows the same logic, but uses non-parametric models to m 1 (x) and m 0 (x). We use Gradient Boosting Regression in non-parametric plug-in method to predict conditional mean outcomes. We also calculate the welfare and Gini index when all individuals are treated and when all individuals are not treated as baseline comparisons. We consider dierent data generation processes. In the rst DGP, both the conditional average treatment eect and outcome are linear functions of exogeneous variables X. In the second DGP, both the conditional average treatment eect and outcome are non-linear in the exogeneous variables X. We will report both the welfare and Gini index obtained with the corresponding methods. 2.6.1 Algorithm 2.6.1.1 Grid Search This algorithm needs to loop through all pairs (i;j), which takes O(n 2 ). Within each loop, (b 1 ;:::;b n ) needs to be calculated, which takes time O(n). Therefore, the total time com- plexity of this algorithm would be O(n 3 ). With the sample size around 10; 000, the solution can be found in around 6 hours. 32 Algorithm 1 Constrained Grid Search Method (CGSM) For i = 1;:::;n: Line i: ~ 0 +X 1i ~ 1 +X 2i = 0 For j = 1;:::;n: Line j: ~ 0 +X 1j ~ 1 +X 2j = 0 Calculate the intersection of Line i and Line j: ( ~ 0 ; ~ 1 ) = ( X 2i X 1j X 1i X 2j X 1j X 1i ; X 2j X 2i X 1j X 1i ) Choose a direction (1;d) perpendicular to ( ~ 1 ; ~ 2 ) that satisfies ~ 0 +X 1i ~ 1 +X 2i 0 and ~ 0 +X 1j ~ 1 +X 2j 0 Move in direction (d 0 ;d 1 ) = (1;d); (1;d); (1;d); (1;d) by ": ( ~ 0 +d 0 "; ~ 1 + d 1 ") Obtain b 1 ;:::;b n using ( ~ 0 +d 0 "; ~ 1 +d 1 ") Check the Gini index constraint with (b 1 ;:::;b n ) If satisfied: Log ( ~ 0 ; ~ 1 ) into feasible set and calculate the welfare Else: Continue Select the maximal welfare from the feasible set Output the corresponding ( ~ 0 ; ~ 1 ) 2.6.1.2 MIP for the multi-objective problem We could write the inequality constraint as following: 1 n 2 n X i=1 n X j=1 D i e(X i ) 1D i 1e(X i ) D j e(X j ) 1D j 1e(X j ) 1f ~ 0 + min( ~ 1 X i1 +X i2 ; ~ 1 X j1 +X j2 ) 0g + D i e(X i ) 1D i 1e(X i ) 1D j 1e(X j ) 1f ~ 0 + ~ 1 X i1 +X i2 0g + D j e(X j ) 1D j 1e(X j ) 1D i 1e(X i ) 1f ~ 0 + ~ 1 X j1 +X j2 0g + 1D i 1e(X i ) 1D j 1e(X j ) min(Y i ;Y j ) (1 e ) 1 n n X i=1 D i e(X i ) 1D i 1e(X i ) 1f ~ 0 + ~ 1 X j1 +X j2 0g + 1D i 1e(X i ) Y i 0 33 Therefore, we formulate the whole problem as max b 1 ;:::;bn; ~ 0 ; ~ 1 1 n n X i=1 i Y i b i s.t. 1 n 2 n X i=1 n X j=1 [ i j b i b j + i j b i + j i b j + i j ]Y ij (1 e ) 1 n n X i=1 [ i b i + i ]Y i 0 (2b i 1)( ~ 0 + ~ 1 X 1 +X 2 ) 0 b i 2f0; 1g;8i = 1;:::;n where i = D i e(X i ) 1D i 1e(X i ) , i = 1D i 1e(X i ) , j = D j e(X j ) 1D j 1e(X j ) , j = 1D j 1e(X j ) ,Y ij = minfY i ;Y j g, and Y i are observed from the sampled data. (b 1 ;:::;b n ) are binary decision variables, and ~ 0 and ~ 1 are real numbers. We formulate the multi-goal objective function as a mixed integer problem. max b 1 ;:::;bn 1;2 ;:::; 1;n ;::: n1;n ; 0 1 n n X i=1 i Y i b i max ( (1) 1 n n X i=1 i Y i b i 2 n n X i=1 i 2 n X j>i j Y ij ij + (1) 1 n n X i=1 i Y i 2 n 2 n X i=1 X j>i i j Y ij ; 0 ) s.t. b i 2f0; 1g; ij 2f0; 1g; ij b i ; ij b j ; ij b i +b j 1; X T i C i b i X T i C i + 1; where constantC i satisesC i > sup 2B jX T i j. These constraints indicate that whenX T i 0, then b i = 1; otherwise b i = 0, and ij =b i b j . 34 For the algorithm on chance-constraint problem, please refer to (Cui et al., 2022). We save the simulation results of this method to the appendix. 2.6.2 Simulation Results In the simulation, we try sample size of 500, 1; 000 and 2; 000, with 500 replications, and compare the in-sample welfare as well as Gini index among dierent methods. To evaluate the policies, we calculate the welfare and Gini index that could be obtained by each policy. In the rst DGP, where the conditional mean outcomes m 1 (x) and m 0 (x) are correctly specied by linear models, the parametric plug-in method should be close to the rst best policy with respect to welfare, which is dened as the maximal welfare that could be achieved without any constraint on the policy. In the second DGP, where the conditional mean outcomes are no longer linear functions of X, we expect non-parametric plug-in method to be a good approximation of the rst best policy. It is possible that there is no feasible policy for the constrained optimization problem. WhetherG() is an empty set or not depend on the distribution of the potential outcomes and the restrictions on the equality level. When the pre-treatment income and the treatment eect are not very heterogeneous, we expect all policies to satisfy the constraint on the Gini index. But when the pre-treatment income is very heterogeneous and the treatment eect is relatively small, or the correlation between the treatment eect and the pre-treatment income is highly positive, we expect no policy could achieve the satisfactory level of equal- ity. In other words, if the society's original inequality index is very high, and the treatment doesn't have a large eect, we would like to either implement the treatment for more periods, or consider other types of treatment which could produce larger treatment eect. When the treatment can only enlarge the inequality, we might consider give up this type of treatment. 35 We try two dierent data generation processes as follows. DGP1: Y i = i + X T i + (X T i )D i +u i : DGP2: construct new variables ~ X = exp(X); ~ ~ X = ln(X 2 + 2) Y i = i + ( ~ X T 1 + ~ ~ X T 2 )D i +u i : In table 2.1, we report the true welfare achieved by dierent policies in the sample. The results are based on 500 simulations. As can be seen, since the conditional treatment eect is linear in covariates, the parametric plug-in method achieves the highest welfare. However, this policy is not from the pre-determined policy class. And the covariates to estimate the conditional treatment eect are not chosen by the policy maker, and might contain some sensitive covariates. All the policies could satisfy the inequality constraint with the satisfactory level at 0:3. Then among all the other methods except the parametric plug-in, CGSM could achieve the highest welfare from the pre-determined policy class, especially compared with EWM. In table 2.1, we report the estimators of the average welfare and the Gini index, as well as the corresponding standard errors. In DGP2, we change both the outcome and the conditional average treatment eect to be non-linear functions of X. As shown in table 2.2, the Gini index achieved with CGSM is much smaller than that achieved by EWM. Meanwhile, the welfare achieved with CGSM is also higher than that achieved with EWM. But the dierence is not statistically signicant. The dierence between EWM method and unconstrained grid search method provides the evidence. Both EWM and unconstrained grid search methods are trying to select the policy from the pre-determined policy class to solve the welfare maximization problem without equality constraint. EWM solves the problem via MIP, and grid search travels through all 36 500 1000 2000 CGSM Welfare 5.419 5.602 5.633 (1.653) (1.807) (1.644) Gini 0.207 0.208 0.206 (0.019) (0.018) (0.025) EWM Welfare 5.133 5.286 5.365 (1.977) (2.030) (1.821) Gini 0.190 0.201 0.199 (0.052) (0.023) (0.041) Parametric Plug-in Welfare 5.594 5.741 5.780 (1.704) (1.826) (1.663) Gini 0.209 0.208 0.209 (0.021) (0.020) (0.022) Non-Parametric Plug-in Welfare 5.448 5.619 5.678 (1.693) (1.828) (1.670) Gini 0.204 0.206 0.207 (0.017) (0.017) (0.019) Treat every one Welfare 5.086 5.250 5.316 (2.015) (2.047) (1.854) Gini 0.187 0.200 0.196 (0.054) (0.023) (0.045) Treat no one Welfare 5.143 5.290 5.255 (1.525) (1.684) (1.568) Gini 0.208 0.207 0.207 (0.019) (0.018) (0.019) Unconstrained grid search Welfare 5.374 5.563 5.595 (1.665) (1.825) (1.659) Gini 0.208 0.206 0.206 (0.018) (0.018) (0.024) Table 2.1: Simulation results from DGP1 possible derived based on the sample. It turns out grid search method could achieve higher outcome than EWM, along with the cost of consuming much longer time. Since the treatment eect can be negative from DGP1, it is not surprising that treating everyone induces smaller welfare than the other methods. Parametric plug-in method seems to work quite well, considering both the welfare and the Gini index. This is due to the fact that the true conditional average treatment eect is a linear function in X. 37 500 1000 2000 CGSM Welfare 5.810 5.783 5.899 (2.282) (2.203) (1.981) Gini 0.332 0.333 0.331 (0.082) (0.085) (0.079) EWM Welfare 5.259 5.307 5.256 (2.048) (1.919) (2.008) Gini 0.367 0.362 0.369 (0.115) (0.106) (0.107) Parametric Plug-in Welfare 5.591 5.645 5.790 (2.017) (1.908) (1.801) Gini 0.338 0.335 0.332 (0.094) (0.094) (0.085) Non-Parametric Plug-in Welfare 6.426 6.281 6.271 (2.017) (1.953) (1.864) Gini 0.262 0.280 0.291 (0.073) (0.077) (0.074) Treat every one Welfare 4.926 5.150 5.334 (1.907) (2.026) (1.979) Gini 0.380 0.370 0.365 (0.113) (0.113) (0.104) Treat no one Welfare 5.004 5.053 5.170 (1.871) (1.717) (1.603) Gini 0.370 0.370 0.365 (0.100) (0.100) (0.090) Unconstrained grid search Welfare 6.080 6.017 6.026 (2.151) (2.078) (1.920) Gini 0.351 0.345 0.339 (0.093) (0.092) (0.085) Table 2.2: Simulation results from DGP2 We then look at the penalized objective maximization problem with MIP by varying the penalization parameter . Table 2.3 reports the welfare and Gini index obtained with the penalized objective function of sample size 500. With the 4 penalization parameters we con- sider, the penalized objective maximization problem doesn't seem to give a more satisfying policy than EWM method. 38 Welfare Gini 0.5 5.097 (2.007) 0.367 (0.121) 1.0 5.371 (2.264) 0.369 (0.116) 1.5 4.777 (1.857) 0.388 (0.124) 2.0 5.252 (1.741) 0.371 (0.096) Table 2.3: Penalized objective maximization with dierent penalization parameters 2.7 Empirical Application National Supported Work Demonstration (NSW) was a pilot experiment for further pro- grams. It was designed to help disadvantaged workers move into the labor market and was conducted at 15 sites, 10 of which were used for the research. A detailed illustration of the program can be found in (LaLonde, 1986). The NSW program assigned qualied applicants to the treatment group and control group randomly. People assigned to the treatment group could receive all the benets, including the guaranteed work experience and counseling in a sheltered environment for 9 to 18 months. The NSW program paid the treatment group participants for their supported jobs, but at a lower wage rate than the regular jobs in the market. The participants could stay on their supported jobs until their terms of the program ended. The original dataset is consisted of 6,616 individuals, with 3,214 participants in the treatment group, and 3,402 participants in the control group. Male and female participants in the treatment group usually got dierent types of supported jobs. Male participants were more involved in the construction sectors. The cost of the program was discussed in (Kem- per et al., 1981). We consider trainees' subsidized wages as transfers instead of cost, and therefore will take cost of the program as $2; 700 per participant. Earnings and other demo- graphic data were collected from both the treatment and the control group at the baseline before the treatment, and were also collected every nine months after the treatment. We use a male subset of the original dataset as used in (LaLonde, 1986), consisting of 722 sample units. There were 297 observations in the treatment group, and 425 in the control group. The dataset includes earnings after the program in 1978, baseline earnings in 1975, 39 education, age, Black (1 if black, 0 otherwise), Hispanic (1 is Hispanic, 0 otherwise), married (1 if married, 0 otherwise), and no degree (1 if no degree, 0 otherwise). The rst variable is the outcome, and other variables are pre-treatment variables. Based on the summary statistics of the experiment, we set the satisfactory Gini index to be 0:55. In Table ??, we report the welfare and Gini index calculated under optimal estimated policies with dierent methods. Welfare is measured by the average income in 1978. Gini index is also calculated based on the income in 1978. CGSM, EWM, and unconstrained grid search methods select policies from a pre-determined policy classG = fg(X) j g(X) = 1f 0 + 1 X 1 +X 2 0g; ( 0 ; 1 )2R 2 g. Parametric plug-in and non-parametric plug-in select policies that are unrestricted. If no participants are treated, the average income is $5090:048 and Gini index would be 0:582. Treating everybody will increase the average income to $5191:099 and reduce Gini index to 0:565. This indicates that this specic treatment could reduce the social inequality level. If treating participants with positive treatment eect, we could achieve average income at $6244:437 - $7132:743, and Gini index at 0:513 - 0:556, depending how we estimate the conditional average treatment eect. It should be noticed that with sample size around 700, it is possible that the Gradient Boosting Regression is overtted, so we think the parametric plug-in method is more convincing. When selecting policies from our pre-determined policy class, unconstrained grid search method could achieve welfare at $6195:886 and Gini index at $0:564, slightly better than EWM method. When adding constraints on inequality, constrained grid search method (CGSM) could reduce Gini index to 0:550, with a little sacrice of welfare at $6190:822. We then vary the satisfactory inequality level to see the trade-o between eciency and equality. We are interested in which subgroups of people do dierent policies select to treat. We report the baseline average earnings, average years of education, and average age of targeted groups under dierent policies in Table ??. Constrained Grid Search Method aims at the subgroup with lower pre-treatment earnings, while EWM select everyone to treat. The average education level does not vary a lot under dierent policies. 40 Methods Welfare Gini index CGSM $6190:822 0:550 ($459:133) (0.020) Unconstrained grid search $6195:886 0:564 ($444:338) (0.022) EWM $5976:352 0:565 ($499:078) (0.022) Parametric plug-in $6244:437 0:556 ($514:179) (0.027) Non-parametric plug-in $7132:743 0:513 ($318:667) (0.023) Treat everyone $5191:099 0:565 ($467:732) (0.022) Treat nobody $5090:048 0:582 ($314:893) (0.017) Table 2.4: Welfare and Gini index with NSW subset data Welfare Gini index 0.570 $6195:886 0:564 0.560 $6190:822 0:550 0.550 $6190:822 0:550 0.548 $6187:286 0:548 < 0:548 infeasible infeasible Table 2.5: CGSM by varying 2.8 Conclusion In this paper, we proposed a constrained maximization problem, which maximizes society's average income while subjecting to a satisfying social inequality level. We adopt a con- strained grid search method to derive the optimal policy. We show that the estimated policy could provide the mean income that converges to the supremum of the mean income achiev- able in the constrained policy class, and guarantee that the Gini index converges to the satisfying level. We show that the proposed constrained grid search method has a good per- formance by both simulations and a real-data application. We consider the linear eligibility score based on pre-treatment earnings and years of education for the policy class. We found 41 Methods Earnings in 1975 Education Years Age Number of treated Total $3042:896 10.267 24.521 722 CGSM $1762:644 10.304 24.296 651 Unconstrained grid search $1840:527 10.307 24.319 658 EWM $3042:896 10.267 24.521 722 Parametric plug-in $1633:567 10.607 25.182 527 Non-parametric plug-in $2496:736 10.389 24.565 414 Table 2.6: Pre-treatment variables of the targeted group under dierent policies that the constrained grid search method biases the policy towards people with lower pre- treatment income. We also approximate the discontinuous indicator function with a smooth function and solve the constrained optimization problem with Bayesian Optimization for the global maximum. We also consider changing the original objective function by penalizing it with the inequal- ity constraint and then solve the multi-objective function by mixed integer programming. Although it can also guarantee the convergence of the mean outcome, the nite sample per- formance is not very satisfying, and it is also computationally expensive. Lastly, we try to connect our problem with the chance-constraint problem in optimization, bounding the indicator function with piece-wise ane functions, and then apply the exact penalization method to solve both the population problem and the empirical problem. The limiting point obtained in this way is not guaranteed to be global optimal of the surrogate problem, but only a local optimal point. The constraint we consider is inequality. We formalize the problem as a constrained opti- mization problem. As long as the social inequality is around an acceptable level, we focus on the mean income. There are other constraints that can be considered, for example, the poverty rate, as well as the other types of inequality measure. The constrained grid search method we consider in this paper can be applied to other settings with any types of con- straints. However, the limitation is that it can only be applied to the policy class of fewer than two variables. With more variables in the policy class, the constrained grid search 42 method would become unmanageable. If the constraint is less complex, like the poverty rate, mixed integer programming could be adopted and more variables could be incorpo- rated in the policy. When the sample size becomes too large or the policy class is more complex, we need to consider approximating the indicator function with smooth function and apply Bayesian Optimization or refer to the chance-constraint problem to solve for the local optimum. 43 Chapter 3 Optimal Dynamic Policy Learning in Short Periods 3.1 Introduction Optimal policy learning refers to nding the optimal treatment rules to achieve a specic goal. When the treatment is applied in multiple periods instead of only one period, the set of optimal treatment rules in all periods is called the dynamic policy. This policy implies whom to treat in each period and is set at the beginning of all periods, unlike adaptive policies. Dynamic policy has gained attention in medical science and statistics for a long time and is also known as dynamic treatment regimes, dynamic treatment decisions, dynamic treatment rules, and so on. As introduced in Murphy (2003) and Chakraborty and Murphy (2014), "A dynamic regime consists of a sequence of decision policies, one per stage of intervention, that dictates how to individualize treatments to patients, based on evolving treatments and covariates history", and "dynamic treatments explicitly incorporate the heterogeneity in need for treatment across individuals and the heterogeneity in need for treatment across time within an indi- vidual." In this paper, we adopt the name "dynamic treatment policy" to keep consistent with the literature in econometrics. 44 Heterogeneous treatment eect is an intrinsic motivation and a fundamental basis for research on optimal treatment policy. This concept has its roots in clinical medicine, where people with dierent characteristics, features, or covariates may react dierently to the various treatment arms. Thus, doctors aim to choose the best treatment for each new patient. Dynamicity corresponds to the evolving characteristics and health improvements, which may require multiple stages of treatment. Although the origins of the study of treatment policy lie in clinical trials, the research is now expanding into dierent elds due to the generality of heterogeneity and dynamicity. Economists are also interested in this topic because it can be applied to social programs such as poverty reduction and job training sessions, as well as advertising and marketing projects. In this paper, we focus on an application in marketing. Specically, we consider a scenario where a company wants to send $10 coupons to current customers in multiple periods to maximize total prot. However, the treatment eect may vary among customers, with some ignoring the coupon altogether and some being motivated to spend more time shopping around and purchasing more than the minimum amount required to use the coupon. More- over, customer behavior may change across periods and depend on their previous purchases, making the treatment eect heterogeneous and dynamic. This heterogeneity is the main motivation for learning optimal treatment policies. The reason for learning dynamic policy is to take future expected outcomes into consideration when making decisions in the current period, a concept commonly seen in dynamic programming. Our goal is to develop a set of policies at the beginning of all periods, depicting whom the coupons should be sent to in each period. There are two main data-driven methods for estimating optimal policies. The rst is called the indirect method, which involves estimating a function related to the treatment eect and 45 then selecting individuals whose estimated treatment eect exceeds a threshold. This ap- proach includes Q-learning and A-learning. Q-learning, originally from reinforcement learn- ing, is an approximate dynamic programming procedure that rst models and estimates Q-functions, which are the stage-specic conditional expectations of the sum of current and future rewards. The expected future rewards are calculated by assuming that optimal deci- sions are made at all future periods, given the current individual covariate history. Optimal policies are then obtained by maximizing the Q-functions. In contrast, A-learning models and estimates the regret functions, which measure the loss incurred by not following the opti- mal policy. Optimal policies are then obtained by minimizing the regret functions. However, because of the two-step estimation process, the optimal policies in Q-learning and A-learning rely heavily on the accurate estimation of the Q-functions and regret functions. If the model in the rst step is misspecied or the estimation is poorly tted, the policy estimated in the second step may be far from optimal. The second method is the direct method, which maximizes an estimator of the population mean outcome. The key is to nd a good estimator for the mean outcome, and this method has gained popularity in econometrics due to its ability to avoid mis-specication problems associated with the rst-stage estimation in the indirect method. The policies selected from a pre-determined policy class are exible, and researchers focus on proving whether the mean outcome obtained by applying the estimated policy to the whole population converges to the optimal mean outcome. The optimal convergence rate is then derived. In the static setting, 1= p n has been shown to be the optimal convergence rate in both experimental and observational settings. In this paper, we focus on the direct method for short period dynamic settings and present theoretical results on the performance of the approximate optimal policy when estimating the population expected outcome with inverse probability weighted estimator and doubly 46 robust estimator. Our goal is to select the best policy from pre-determined policy classes for each period, taking into account the purchase behavior in previous periods. Similar to the static setting, our method consists of two stages. In the rst stage, we need to draw information from an existing experiment or observational study, where we obtain a score for each group of individuals. The score could be expected current and future outcomes. In the second stage, we take the score as xed and maximize the estimated mean outcome function, which is a function of the xed scores and decision variables of whether to treat specic groups or not. The decision variable is an indicator function belonging to the pre- determined policy class. Unlike current literature, we do not use surrogate functions to approximate the indicator functions. Instead, we use the binary loss to obtain the optimal policy and derive the statistical properties of the corresponding estimated policy. Similar to the current literature, we consider the backward induction when solving the dynamic problem. In contrast to the static setting, when estimating the policy for the current period, we need to consider both the current period's outcome as well as the future periods' expected out- come given the current period's policy. With the backward induction method, we solve the maximization problem starting from the last period, and plug these estimated optimal poli- cies into the model to obtain the optimal expected future outcomes, when moving backward to solve maximizaiton problems period by period. In this method, when solving the maxi- mization problem of the current period, we assume that estimated optimal policies will be implemented in all future periods. The regret is dened as the dierence between the best possible expected total outcome and the expected total outcome by implementing the set of estimated policies. Theoretically, we prove a nite-sample distribution-free upper bound of the regret of the estimated policy Computationally, we use mixed integer programming to solve non-convex optimization problems directly in dierent periods. We rst consider the simplest case of 47 two periods when the propensity score is known. Then we extend the approach to longer periods with known propensity score and also consider the case when the propensity score is unknown in short periods. This paper extends the direct maximization method in static settings to the dynamic setting and show the theoretical properties of the estimated policy. However, we recently notice that Sakaguchi (2021) considers the same method in tackling down the problem. Moreover, they not only consider the backward induction method, but also consider the method that simultaneously solves the policies for all periods. In statistics literature, Zhao et al. (2015) has talked about both the backward outcome weighted learning (BOWL) and simultaneous outcome weighted learning (SOWL), and shown that SOWl is more computationally inecient but could guarantee the consistent estimation while BOWL might fail. Sakaguchi (2021) shows the similar result with 0-1 loss. While this paper was started in 2021, I need to admit that we follow quite similar method as Sakaguchi (2021), and I only consider the backward induction method. This paper is structured as follows. We begin by introducing the related literature in the rest part of the rst section. In the second section, we formulate the problem mathematically and present the model explicitly, including cases with both known and unknown propensity scores. In the third section, we provide theoretical proofs for the properties of the solu- tions obtained from solving the optimization problems. In the fourth section, we conduct simulations to compare the methods studied in this paper with indirect methods, such as Q-learning method. Finally, we conclude the paper in the last section. 3.1.1 Related literature The topic of optimal dynamic treatment policies has been widely studied in various elds, and has gained popularity in econometrics. Q-learning, a classic method from reinforcement learning, is often used to estimate optimal dynamic treatment policies. As an alternative 48 to Q-learning, Murphy (2003) proposed A-learning, which estimates the regret function instead of the conditional mean functions. However, both Q-learning and A-learning have two main limitations. First, model misspecication can lead to an estimated policy that is far from optimal. Second, estimation is based on minimizing the prediction error in tting the conditional mean functions or regret functions, which may not necessarily result in maximal long-term welfare. To overcome the model misspecication problem, many researchers have attempted to use more exible non-parametric regression techniques (Qian and Murphy, 2011; Zhao et al., 2012; Chakraborty and Moodie, 2013)). However, the mitigation of the risk of model misspecication often results in uninterpretable estimated policies. To ensure the interpretability of estimated policies, Zhang et al. (2018) proposed a tree-based decision list based on estimated Q-functions. The direct method of maximizing the estimator of the expected welfare has been well studied in static setting of optimal policy estimation, both in statistical theory and computational eciency. Kitagawa and Tetenov (2018) estimated the population expected welfare with inverse probability weighted estimator and proved the n 1=2 convergence rate of the regret when the propensity score was known in the data used to derive the estimated policy. Athey and Wager (2021) showed that when using observational studies to estimate the policy, the policy obtained by maximizing the doubly robust estimator for the expected welfare function could achieve the optimal convergence rate (n 1=2 ). When the treatment consists of multiple stages, the inverse probability weighted estimator is also adopted to estimate expected welfare function in each stage. Zhao et al. (2015) proposed backward outcome weighted learning and simultaneous outcome weighted learning, and proposed to use convex surrogate loss for the 0-1 loss to solve the maximization problem. There are also bunches of literature adopting doubly robust estimator (Pan and Zhao, 2021; Tao et al., 2018; Nie et al., 2021). 49 In econometrics, Heckman and Navarro (2007) consider the marginal treatment eect, but don't aim at the identication. Han (2019) tries to relax the sequential randomization assumption, which is commonly adopted in the dynamic treatment regime literature, and considers non-compliance, characterizes the sharp set of policy rules. This paper is based on literature in the static setting (Kitagawa and Tetenov, 2018; Athey and Wager, 2021), and is closely related to the literature in dynamic setting (Tao et al., 2018; Nie et al., 2021). When the propensity score is known in the given data, we propose to use IPW estimator for the expected welfare function and then maximize the non-convex estimated welfare function directly. With the data from observational studies, we propose to use doubly robust estimator. Dierent from Nie et al. (2021), we consider decisions on whom to treat in each stage, instead of when to start treatment for dierent subgroups, but we are limited to the setting with number of period at most around 10. As to the class of decision policies, we refer to linear functions and Tao et al. (2018) for the tree-based policies due to the simplicity and interpretability. We derive the convergence rate of the regret guarantees of the estimated policies in dynamic settings of short periods. 3.2 Model Formulation We want to make policies regarding sending coupons to existing customers for the next few periods. The outcome we care about is the amount of purchase. Whether we want to send coupons to a specic customer depend on the previous purchase, treatment assignments, as well as other observed covariates. Let's start with two-period case where the optimal policy is learnt from a sequential two-stage randomized trial. Let the data be a sizen random sample of Z i = (X i ;D 1i ;Y 1i ;D 2i ;Y 2i ). X 1i 2XR dx refers to observable pre-treatment covariates of individuali. D 1i ;D 2i 2f0; 1g are binary indicators of the individual's treatment assignments in two periods. Denote history information in the 50 rst period as H 1i = X i , and history information of the second period as H 2i = (X i ;D 1i ;Y 1i ). The population from which the sample is drawn is characterized by P , which is a joint distribution of (X i ;D 1i ;Y 1i (1);Y 1i (0);D 2i ;Y 2i (0; 0);Y 2i (0; 1);Y 2i (1; 0);Y 2i (1; 1)). The total welfare is dened as the sum of outcomes in two periods W (g 1 ;g 2 ) =W 1 (g 1 ;g 2 ) =E P [Y 1 (g 1 ) +Y 2 (g 1 ;g 2 )]; and for simplicity, we set discounting factor = 1. The optimal policy is the policy that maximizes W (g 1 ;g 2 ): (g 1 ;g 2 )2 arg max (g 1 ;g 2 )2G 1 G 2 W (g 1 ;g 2 ): In the direct method, We rst need to nd an estimator ^ W (g 1 ;g 2 ) for the welfare function, and then nd the policy that maximizes the estimator: (^ g 1 ; ^ g 2 )2 arg max (g 1 ;g 2 )2G 1 G 2 ^ W (g 1 ;g 2 ): Following the literature, we use the utilitarian regret to evaluate the estimated policy. The utilitarian regret of deploying decision policy (g 1 ;g 2 ) is dened as the welfare dierence between the current policy (g 1 ;g 2 ) and the optimal policy (g 1 ;g 2 ) when applying the policies to the population: R(g 1 ;g 2 ) =W (g 1 ;g 2 )W (g 1 ;g 2 ) = max (g 1 ;g 2 )2G 1 G 2 fE P [Y 1 (g 1 (H 1 )) +Y 2 (g 1 (H 1 );g 2 (H 2 ))]g E P [Y 1 (g 1 (H 1 )) +Y 2 (g 1 (H 1 );g 2 (H 2 ))]: 51 To guarantee that we obtain satisfactory welfare when applying the estimated optimal policy to the population, we would like the population welfare converges to the obtainable optimal welfare regardless of the distribution P , and the convergence rate of the worst-case welfare loss to be fast enough (achieves n 1=2 ). We adopt the backward learning method to estimate the policies period by period. When there are only two periods, we rst estimate the policy in the second period. The welfare in the second period is expressed as W 2 (g 2 ) =E P [Y 2 (D 1 ;g 2 (H 2 ))] =E P [Y 2 (D 1 ; 1)g 2 (H 2 ) +Y 2 (D 1 ; 0)(1g 2 (H 2 ))]: The optimal policy in the second period can be obtained by solving the estimator of the second period's welfare function: ^ g 2 2 arg max g 2 2G 2 ^ W 2 (g 2 ): Then we substitute ^ g 2 into the outcome function by assuming the we will adopt the optimal policy in the second period, and further estimate the optimal policy in the rst period by maximizing the estimator of future's expected outcomes. W (g 1 ; ^ g 2 ) =E P [Y 1 (g 1 (H 1 )) +Y 2 (g 1 (H 1 );g 2 (H 2 ))] ^ g 1 2 arg max g 1 2G 1 ^ W (g 1 ; ^ g 2 ): 52 Then we introduce the estimator of the welfare function. When a sequential randomized experiment is possible, the propensity score is known, therefore we can count on the IPW estimator. We write the population expected welfare functions as follows: W 2 (g 2 ) =E P D 2 g 2 (H 2 )Y 2 e 2 (H 2 ) + (1D 2 )(1g 2 (H 2 ))Y 2 1e 2 (H 2 ) ; W (g 1 ;g 2 ) =W 1 (g 1 ;g 2 ) =E P D 1 g 1 (H 1 )Y 1 e 1 (H 1 ) + (1D 1 )(1g 1 (H 1 ))Y 1 1e 1 (H 1 ) +E P D 1 g 1 (H 1 ) e 1 (H 1 ) + (1D 1 )(1g 1 (H 1 )) 1e 1 (H 1 ) D 2 g 2 (H 2 ) e 2 (H 2 ) + (1D 2 )(1g 2 (H 2 )) 1e 2 (H 2 ) Y 2 =E P D 1 g 1 (H 1 ) e 1 (H 1 ) + (1D 1 )(1g 1 (H 1 )) 1e 1 (H 1 ) D 2 g 2 (H 2 ) e 2 (H 2 ) + (1D 2 )(1g 2 (H 2 )) 1e 2 (H 2 ) (Y 1 +Y 2 ) : The estimated welfare functions are ^ W 2 (g 2 ) = 1 n n X i=1 D 2 g 2 (H 2 )Y 2 e 2 (H 2 ) + (1D 2 )(1g 2 (H 2 ))Y 2 1e 2 (H 2 ) ; ^ W (g 1 ;g 2 ) = ^ W 1 (g 1 ;g 2 ) = 1 n n X i=1 " D 1 1fH 1 2g 1 g e 1 (H 1 ) + (1D 1 )1fH 1 = 2g 1 g 1e 1 (H 1 ) D 2 1fH 2 2g 2 g e 2 (H 2 ) + (1D 2 )1fH 2 = 2g 2 g 1e 2 (H 2 ) (Y 1 +Y 2 ) # : Similarly, when information of sequential randomized experiments of more periods are avail- able, we could estimate the optimal policy for multiple periods. Suppose there areT periods in total. The population welfare function in the last period T is W T (G T ) =E P [Y T (D 1 ;:::;D T1 ;g T (H T ))]; 53 and in any period t before the last period is W t (g t ;:::;g T ) =E P [Y t (D 1 ;:::;D t1 ;g t (H t )) + +Y T (D 1 ;:::;D t1 ;g t (H t );:::;g t (H t ))]: Same as in the two-period case, we use backward induction to estimate the optimal policies. We rst estimate the optimal policy in the last period T : max g T 2G T ^ W T (g T ): After obtaining the optimal policies in later period (^ g t+1 ; ^ g t+2 ;:::; ^ g T ), we continue the esti- mate the optimal policy in period t: max Gt2Gt ^ W t (g t ; ^ g t+1 ;:::; ^ g T ); where ^ W t (g t ; ^ g t+1 ;:::; ^ g T ) = 1 n n X i=1 " D t g t (H t ) e t (H t ) + (1D t )(1g t (H t )) 1e t (H t ) T Y t 0 =t+1 D t g t 0(H t 0) e t 0(H t 0) + (1D t 0)(1g t 0(H t 0)) 1e t 0(H t 0) T X t 0 =t y t 0 # : In this way, we obtain the set of optimal policies (^ g 1 ;:::; ^ g T ). 3.3 Theoretical Properties When the data come from sequential multiple assignment randomized. trials, the propensity score is known. In this section, We will show that the estimated optimal policies obtained by maximizing the inverse probability weighted estimator of the welfare functions can guarantee that the population welfare converges to the optimal welfare at the optimal convergence 54 rate. We rst introduce some assumptions that we need to impose on the data. Besides the classic assumptions on the potential outcomes and treatment assignments, we also impose an assumption on the complexity of the policy classG t . Assumption 3.1. 1. Sequential randomization of the treatment assignment: the treatment in periodt is assigned independently of the potential outcomes of periodt, conditional on the information that can be observed before period t. Y t (D 1 ;:::;D t1 ; 0);Y t (D 1 ;:::;D t1 ; 1)?D t jH t = (X;D 1 ;:::;D t1 ;Y 1 ;:::;Y t1 );8t: 2. The potential outcomes in periodt,Y t (;d 1 ;:::;d t1 ;D t ) is independent of future treatment assignments (D t+1 ;:::;D T ). Y t (D 1 ;:::;D t1 ; 0);Y t (D 1 ;:::;D t1 ; 1)? (D t+1 ;:::;D T );8t: 3. Bounded outcomes:9M <1;s:t: the support of outcome variablesY t , wheret = 1;:::;T , lies in [ M 2 ; M 2 ]. Therefore, the treatment eect is bounded by M. Y t (D 1 ;:::;D t1 ; 0);Y t (D 1 ;:::;D t1 ; 1)2 [ M 2 ; M 2 ];8t: 4. Strict overlap: The propensity score is denoted by e t (H t ) = Pr(D t = 1jH t =H t ), where t = 1;:::;T . Assume that92 (0; 1=2),such that the propensity scores satisfy e t (H t )2 [; 1]. e t (H t )2 [; 1];8t 5. The policy class in any period t:G t , where t = 1;:::;T , has nite VC-dimension v t <1 and is countable. 55 Regarding interpretability and straightforwardness, the popular choices of policy classes include classes of linear policies and classes of trees. Denition 3.1. First best policy in periodt: policyG t is called the rst best policy in period t if E P " T X s=t Y s (D 1 ;:::;D t1 ;G t ;G t+1 ;:::;G s )Y s (D 1 ;:::;D t1 ; 1G t ;G t+1 ;:::;G s )jH t =H t # 0: In other words, the rst best policy treats everyone whose conditional average treatment eect, considering current and future periods, is positive. Linear policy classes. Regarding the linear policy classes, the set in period t can be expressed as G t;LR = H t 2R d H t : t;0 + H T t;sub t;sub 0 : ( t;0 ; t;sub )2R d t;sub +1 The VC-dimension ofG t;LR is v t =d t;sub + 1<1. Decision Trees. Trees represent treatment policies recursively. In each period, a depth-L (L 1) decision tree T L is represented by a splitting variable x j ;j2 1;:::;d x , a threshold k, and two depth-(L 1) decision trees T (L1);A ;T (L1);B . T L (x) =T (L1);A (x) if x j k and T L (x) = T (L1);B (x) if x j > k. To split the tree, we adopt the purity measure proposed by Laber and Zhao (2015). 3.3.1 Simple case: two periods Consider a company which plans to send coupons to customers to increase prot. The set of policies consists of two periods. The goal is to maximize customers' purchase amount. Sup- pose the company rst conducts a two-period randomized experiment. The assignments of 56 treatment in two periods are both randomized conditional on observed covariates. We adopt the potential outcome framework, under which the potential outcomes in the rst period Y 1 (0);Y 1 (1) and potential outcomes in the second period Y 2 (0; 0);Y 2 (0; 1);Y 2 (1; 0);Y 2 (1; 1) are xed. With the experiment, we collect datafX i ;D i1 ;Y i1 ;D i2 ;Y i2 g n i=1 . We estimate the optimal policy with IPW estimator in the following steps. If the sequential experiments are completely random, then propensity score e t (H t ) is constant. In the strati- ed randomized experiments, although the propensity score is no longer constant, it is still known by design. Step1: we estimate the optimal policy for period 2 ^ g 2 2 arg max g 2 2G 2 1 n n X i=1 D 2 g 2 (H 2 ) e 2 (H 2 ) + (1D 2 )(1g 2 (H 2 )) 1e 2 (H 2 ) Y 2 Step2: assuming that the policy taken in period 2 is the estimated solution in step 1, we estimate the optimal policy for period 1 ^ g 1 2 arg max g 1 2G 1 1 n n X i=1 " D 1 g 1 (H 1 ) e 1 (H 1 ) + (1D 1 )(1g 1 (H 1 )) 1e 1 (H 1 ) D 2 g 2 (H 2 ) e 2 (H 2 ) + (1D 2 )1fH 2 = 2 ^ g 2 g 1e 2 (H 2 ) (Y 1 +Y 2 ) # After obtaining the estimated policy (^ g 1 ; ^ G 2 ), we evaluate the policy by analyzing the regret W (g 1 ;g 2 )W (^ g 1 ; ^ G 2 ) =W (g 1 ;g 2 ) ^ W (g 1 ;g 2 ) + ^ W (g 1 ;g 2 ) ^ W (^ g 1 ; ^ G 2 ) + ^ W (^ g 1 ; ^ G 2 )W (^ g 1 ; ^ G 2 ) 2 sup (g 1 ;g 2 )2G 1 G 2 W (g 1 ;g 2 ) ^ W (g 1 ;g 2 ) : 57 The regret is upper bounded by the above empirical process. Theorem 3.1. Upper bound: Under the above assumptions, sup P2P(M;) E P n[W (g 1 ;g 2 )W (^ g 1 ; ^ g 2 )]C 0 M 2 r v 1 +v 2 n where C 0 = 48 R 1 0 q ln( p 2K) + 1 + 2 ln(16e) 2 ln"d" q e (e1) log 2 ln( 2e log 2 ). This upper bound is uniform over distribution classP that satises Assumption 3.1-(3), (4). Theorem 3.2. Lower bound: sup P2P(M;) E P n[W (g 1 ;g 2 )W (^ g 1 ; ^ g 2 )] expf2 p 2g r v 1 +v 2 n Based on Theorem 3.1 and Theorem 3.2, we know that the convergence rate of the regret of the estimated optimal policy is guaranteed to be 1= p n. 3.3.2 Multiple periods In multiple-period sequential randomized experiment, if the experiments are completely ran- domized, then the true propensity score of all subgroups is known to be constants. However, if there are many periods, then the product of propensity scores of all periods might still be very close to 0. In this paper, we only consider short periods T 6. So we don't need to worry about the above problem. Same as the two-period case, we start from estimating the optimal policy in the last period by maximizing the IPW estimator of the welfare function. ^ g T 2 arg max g T 2G T 1 n n X i=1 D iT g T (H iT ) e T (H iT ) + (1D iT )(1g T (H iT )) 1e T (H iT ) Y iT : 58 After obtaining the estimated optimal policy in t + 1;:::;T , we continue to period t: ^ g t 2 arg max gt2Gt 1 n n X i=1 " D it g i (H it ) e t (H it ) + (1D it )(1g t (H it )) 1e t (H it ) T Y t 0 =t+1 D it ^ g t 0(H it 0) e t 0(H it 0) + (1D it 0)(1 ^ g it 0(H it 0)) 1e t 0(H it 0) T X t 0 =t y t 0 # : Theorem 3.3. Upper bound: Under the above assumptions, sup P2P(M;) E P n [W (g 1 ;:::;g T )W (^ g 1 ;:::; ^ g T )]C 0 TM 2 T s P T t=1 v t n ; where C 0 is the same constant as in the two periods.This upper bound is uniform over dis- tribution classP that satises Assumption 3.1-(3), (4). 3.3.3 Unknown propensity score Sequential experiments are not always available, and more common data resource is obser- vational studies. In the observational studies, the propensity score needs to be estimated. In the static setting, Kitagawa and Tetenov (2018) has shown that when the propensity score is unknown, the estimated policy based on the IPW estimator of the welfare function can only achieve a convergence rate of the regret slower than n 1=2 . This also holds in the dynamic setting. Therefore, we consider both the IPW estimator with estimated propensity score and the doubly robust estimator of the welfare functions in this case. Because the history information is expanding with time, which means that the dimension of covariates increases over time, it's possible that the estimated propensity score is quite close to 0 or 1. Denotea t = min Ht2Ht e t (H t )(1e t (H t )). Ifa t ! 0, there is limited overlap. In this paper, since we only consider a short period, we assume a t to be away from 0. When considering longer periods, we could apply regularization to estimating the propensity 59 score and adopt the IPW estimator with the regularized estimated propensity score. When estimating the propensity scores, We refer to the regularized calibrated estimator ((Tan, 2020)). When using IPW estimator, the optimization problem formulation wiis the same as above, except we replace the true propensity score with the estimated propensity score. In period T : ^ g p T 2 arg max g T 2G p T 1 n n X i=1 D iT g T (H iT ) ^ e T (H iT ) + (1D iT )(1g(H iT )) 1 ^ e T (H iT ) Y iT : In period t: ^ g p t 2 arg max G p t 2Gt 1 n n X i=1 " D t g t (H it ) ^ e t (H t ) + (1D t )(1g t (H it )) 1 ^ e t (H t ) T Y t 0 =t+1 D t 0^ g p t 0 (H t 0) ^ e t 0(H t 0) + (1D t 0)(1 ^ g p t 0 (H t 0)) 1 ^ e t 0(H t 0) T X t 0 =t Y it 0 # : Assumption 3.2. For a class of data generating processP e , the estimator of propensity score ^ e t (H t ) satises sup P2Pe sup t2f1;:::;Tg n E P n " 1 n n X i=1 D it e t (H it ) D it ^ e t (H it ) # <1 60 Denote ~ W t as the empirical welfare function using estimated propensity score. After obtain- ing (^ g p 1 ;:::; ^ g p T ), we evaluate the policy by analyzing the regret W (g 1 ;:::;g T )W (^ g p 1 ;:::; ^ g p T ) =W (g 1 ;:::;g T ) ^ W (g 1 ;:::;g T ) + ^ W (g 1 ;:::;g T ) ~ W (g 1 ;:::;g T ) + ~ W (g 1 ;:::;g T ) ~ W (^ g p 1 ;:::; ^ g p T ) + ~ W (^ g p 1 ;:::; ^ g p T ) ^ W (^ g p 1 ;:::; ^ g p T ) + ^ W (^ g p 1 ;:::; ^ g p T )W (^ g p 1 ;:::; ^ g p T ) 2 sup (g 1 ;:::;G T )2G 1 G T W (g 1 ;:::;g T ) ^ W (g 1 ;:::;g T ) + 2 sup (g 1 ;:::;g T )2G 1 G T ^ W (g 1 ;:::;g T ) ~ W (g 1 ;:::;g T ) : Since (^ g p 1 ;:::; ^ g p T ) is the optimal solution to ~ W (g 1 ;:::;g T ), so we have ~ W (g 1 ;:::;g T ) ~ W (^ g p 1 ;:::; ^ g p T ) 0. As shown in the inequality equation, the regret is upper bounded by the above two terms. The rst term is the same as the empirical process using true propensity score. Theorem 3.4. Upper bound: Under the above assumptions, sup P2P(M;) E P n[W (g 1 ;:::;g T )W (^ g p 1 ;:::; ^ g p T )]C 0 TM 2 T s P T t=1 v t n +O( 1 n ): C 0 is a constant as in previous theorems. This upper bound is uniform over distribution class P that satises Assumption 3.1-(3), (4). Denote conditional mean outcomes as m 1 (d 1 ; H 1 ) = E[Y 1 (d 1 ) j H 1 ], and m 2 (d 2 ; H 2 ) = E[Y 2 (D 1 ;d 2 )j H 2 ]. The information of D 1 in the second period's conditional mean outcome functions is contained in H 2 . Conditioning on H 2 , D 1 is also xed. We impose an assumption on the estimation error of the conditional mean outcome function. 61 Assumption 3.3. For a class of data generating processP e , the estimator of conditional mean outcome ^ m t (D t ;H t ) satises sup P2Pe sup t2f1;:::;Tg n E P n " 1 n n X i=1 j ^ m t (D it ;H it )m t (D it ;H it )j # <1: If we obtain the optimal solution (g m 1 ;g m 2 ) by solving the problem ^ g 2 2 arg max g2G 1 n n X i=1 ^ 2 (g 2 ; H 2i ) and ^ g 1 2 arg max g2G 1 n n X i=1 ^ 1 (g 1 ; ^ g 2 ; H 1i ) where ^ 2 and ^ 1 are estimators of 2 (g 2 ; H 2 ) =m 2 (1; H 2 )g 2 (H 2 ) +m 2 (0; H 2 )(1g 2 (H 2 )) and 1 (g 1 ;g 2 ; H 1 ) =g 1 (H 1 )(m 1 (1; H 1 ) +E P [m 2 (1; H 2 )g 2 (H 2 ) +m 2 (0; H 2 )(1g 2 (H 2 ))j H 1 ;D 1 = 1]) + (1g 1 (H 1 ))(m 1 (0; H 1 ) +E P [m 2 (1; H 2 )g 2 (H 2 ) +m 2 (0; H 2 )(1g 2 (H 2 ))j H 1 ;D 1 = 0]) respectively. Theorem 3.5. If the basic assumptions in previous sections as well as Assumption 3.3 hold, sup P2P E P n[W (g 1 ;g 2 )W (^ g dr 1 ; ^ g dr 2 )]C 0 M 2 r v 1 +v 2 n +O( 1 n ): 62 3.4 Simulation In this section, we run simulations to evaluate the performance of the backward rules in two periods. The policy lies in the policy class G 1 =f(x 1 ;x 2 )2R 2 : 1;0 + 1;1 x 1 + 1;2 x 2 0; ( 1;0 ; 1;1 ; 1;2 )2R 3 g; G 2 =f(x 1 ;x 2 ;y 1 )2R 2 : 2;0 + 2;1 x 1 + 2;2 x 2 + 2;3 y 1 0; ( 2;0 ; 2;1 ; 2;2 ; 2;3 )2R 4 g: Besides using the binary loss, we also consider using surrogate functions to smooth the em- pirical welfare functions and solve the surrogate welfare functions. The policies we would like to compare include: backward rule with IPW estimator, backward rule with surrogate IPW estimator We compare the welfare obtained from our estimated optimal dynamic policies with the welfare obtained from parametric Q-learning methods, treating no one, and treat- ing everyone. It should be noticed that the linear plug-in policies we use as comparisons are also limited to the restricted policy class. We consider the case when the propensity score is known. In the rst data generation process, we consider the treatment eect to be linear in observable covariates. In the second data generation process, we consider non-linear functions of treatment eects. We run 500 simulations, considering dierent sample sizes n = 100; 200; 500; 1000; 1500. We report DGPs as follows. In DGP1, treatment eect is a linear function of observed variables. In DGP2, treatment eect is a non-linear function of observed variables. DGP1: Let X = (X 1 ;X 2 ;X 3 ;X 4 ;X 2 1 ;X 1 X 2 ), and H 2 = (X;D 1 ;Y 1 ). Y i1 = + (X T i 1 )D i1 + X T i 1 +u i1 ; Y i2 = (H T i2 2 )D i2 + H T i2 2 +u i2 : 63 DGP2: Y i1 = + X T i 1 + (30X 3 i2 0:3 ln(X 2 i2 + 0:0001) + 2X i1 X i2 + 2:5X 2 i1 )D i1 + X T i 1 +u i1 ; Y i2 = (20Y 3 i1 + 0:2 ln((100Y i1 X i1 ) 2 + 0:001) 100Y i1 X i1 X i2 )D i2 + H T i2 2 +u i2 : We report in the table both the median and the mean of the out-of-sample welfare from the simulations. As can be seen in Table 3.1 and Table 3.2, the backward EWM method outperforms the other methods when the treatment eect is not linear in the observed variables. As shown in Table 3.1, the Q-learning method has quite similar performance Methods 100 200 500 1000 1500 Backward EWM mean 1.047 1.000 0.960 0.994 1.038 median 1.049 1.000 0.960 0.992 1.041 Q-learning mean 1.050 1.001 0.959 0.997 1.040 median 1.052 1.001 0.961 0.998 1.041 Treat everyone mean 1.098 1.010 0.905 1.006 1.072 median 1.098 1.010 0.905 1.006 1.072 Treat no one mean 0.994 0.992 1.028 0.986 1.007 median 0.994 0.992 1.028 0.986 1.007 Table 3.1: Welfare from DGP 1 to the backward rule with IPW. This is because in DGP 1, the treatment eect is linear in covariates. Therefore Q-learning with linear functions perform well. In Table 3.2, the Methods 100 200 500 1000 1500 Backward EWM mean 3.048 3.505 3.559 3.972 3.689 median 3.146 3.716 3.813 4.352 4.495 Q-learning mean 3.012 3.057 3.376 2.998 3.143 median 3.151 3.161 3.275 2.761 3.033 Treat everyone mean 3.306 3.311 3.080 3.256 3.118 median 3.306 3.311 3.080 3.256 3.118 Treat no one mean 0.990 1.009 0.988 1.036 1.017 median 0.990 1.009 0.988 1.036 1.017 Table 3.2: Welfare from DGP 2 64 treatment eect function is no longer linear in covariates. Q-learning with linear functions perform much worse than Backward EWM, due to misspecication. 3.5 Conclusion In this paper, we extend Empirical Welfare Maximization method (Kitagawa and Tetenov, 2018) to dynamic setting of short periods. When the propensity score is known, We rely on the IPW estimator of the mean outcome with the true propensity score as well as the estimated regularized propensity score to calculate the empirical welfare function. When the propensity score is unknown, we rely on the doubly robust estimator of the mean outcome to calculate the empirical welfare function. We analyze the regret, which is dened as the dierence between the maximal welfare within a restricted policy class and the welfare obtained with the estimated optimal policies. We show that the regret can be upper bounded by a distribution-free bound, which converges to 0 as the sample size gets larger. And the convergence rate is consistent with the optimal rate in the literature 1= p n. We then evaluate the welfare with simulations. With dierent data generation processes, we show that the method proposed in this paper is more robust compared to the indirect method that estimates the conditional average treatment eect parametrically, which might suer from model mis-specication, and easier to interpret compared to the indirect method that estimates the conditional average treatment eect non-parametrically. 65 Chapter 4 Bounds on the Variance Estimator of the Average Treatment Eect in Cluster Randomized Trials 4.1 Introduction It is common to adjust standard errors for clustering in empirical economics when there are thought to be clusters in the population. As stated in Abadie et al. (2017), whether one should cluster standard errors in an experiment is based on sampling and treatment assign- ment mechanisms, and clustering is in essence a design problem. Abadie et al. (2017) de- scribed two special cases under which clustering is not necessary: when there is no clustering in the sampling and there is no clustering in the assignment; when there is no heterogeneity in the treatment eect and there is no clustering in the treatment assignment. When adjustment for clustering needs to be taken into consideration, the model-based ap- proach is very important. And Liang-Zeger variance is commonly used. We consider the same setting as Abadie et al. (2017), where a scalar outcome Yi is modeled in terms of a binary treatment covariate W i 2f0; 1g, with the cluster of unit i denoted by C i 2f1;:::;Cg. We estimate the linear modelY i = +W i +" i = T X i +" i under assumptionsE ["jX;C] = 0 and E "" T jX;C = , where T = (;) and X T i = (1;W i ). The OLS estimator for 66 is ^ = (X T X) 1 (X T Y ). To allow general heteroskedasticity and relax the assumptions on the within-cluster correlation, Liang and Zeger (1986) derived an estimator, allowing the correlation between unit i and unit j with C i = C j to be unrestricted. The logic is as follows: let the units be ordered by cluster and let the c denote the N c N c sub-matrix of corresponding to cluster c, and X c denote the sub-matrix of X for cluster c. Then V LZ ( ^ ) = (X T X) 1 ( P C c=1 X T c c X c )(X T X) 1 . Abadie et al. (2017) derived the exact variance of the average eect estimator of a binary treatment if the sampling and/or the treatment is clustered, and they illustrated three cases for LZ variance to be approximately correct: 1. there is no heterogeneity in the treatment eects for all units; 2. only few clusters in the population of clusters are observed; 3. there is at most one sampled unit per cluster. However, all the three cases might not hold in applications. Especially when all the clusters in the population are included in the sample, the LZ variance is quite conservative. Therefore, Abadie et al. (2017) proposed another estimator ^ V CA to improve over the LZ variance estimator ^ V LZ when there is variation in the treatment within the clusters. They claimed that there is no general improvement over LZ variance when the assignment is perfectly correlated within the clusters, which is the case of cluster randomized controlled trials. However, cluster RCT is very common in empirical studies, because treatment assignment is often randomized on cluster level in some elds, like education, micro-nance, and health sectors. For example, (Miguel and Kremer, 2004) evaluated a project in which school-based mass treatment with deworming drugs was randomly phased into schools, rather than to individuals. In this paper, we will illustrate how we can improve over ^ V LZ in cluster RCT, by applying the upper bound estimator proposed by (Aronow et al., 2014). This paper is under the potential outcome framework, viewing potential outcomes as xed values, and focuses on nite populations. The potential outcome framework in causal inference has recently gained much attention in econometrics. This view dates back to the setting of (Neyman, 1923) 67 where potential outcomes are taken as xed quantities, some observed and some missing, and randomness comes from treatment assignments. Both Neyman and Fisher conducted their analysis under this setting for nite population. This setting can also accommodate super- populations, when treating potential outcomes for each individual as xed, and randomness in the corresponding nite sample comes from the sampling uncertainty. In addition to this framework, a model-based framework takes a dierent perspective, treating potential outcomes as random variables drawn from a distribution. The key idea of model-based frameworks is to impute the missing outcomes from a distribution instead of a xed number. This framework also applies to both nite populations and innite populations, but is more often seen in a innite population analysis. Under the potential outcome framework, uncertainties might come from two sources: sam- pling and treatment assignment. (Abadie et al., 2020) clearly illustrate the dierences be- tween sampling-based inference and design-based inference in nite populations. Consider a nite population consisting ofN units with each unit characterized by a pair of variables Y i and X i . Sampling-based uncertainty about an estimand, which is a function of the full set of pairsf(Y i ;X i )g N i=1 , arises because we only observe a sample of (Y i ;X i ). With a variable R i 2f0; 1g indicating whether the unit i is observed or not, an estimator is a function of the observed datafR i ;R i Y i ;R i X i g N i=1 . The sampling process that determinesfR i g N i=1 for dierent samples provides information for sampling-based inferences. On the other hand, design-based uncertainty is based on the unobservability of one potential outcome, Y i (1) or Y i (0). W i 2f0; 1g indicates which potential variable is observed, Y i = Y i (W i ). Design- based uncertainty comes from the process of determiningfW i g N i=1 . The two types of infer- ences assess the variability of the estimator across dierent samples based on the above two uncertainties. Among randomized experiments, estimators in completely randomized experiments have been broadly studied, in terms of both the unbiasedness and eciency. We rst illustrate 68 completely randomized controlled trial as a starting point. We consider both sampling and design-based uncertainties in completely randomized experiments. There are N units in the entire population, with N 1 treated and N 0 controlled. We sample n units from the entire population, with n 1 as treated and n 0 as controlled. The variance of dierence-in-mean estimator in completely randomized experiments can be written as var(^ jn 1 ;n 0 ) = S 2 1 n 1 + S 2 0 n 0 S 2 01 N 0 +N 1 ; whereS 2 x = 1 N1 P N i=1 (Y i (x) 1 N P N j=1 Y j (x)) 2 ;x = 0; 1 andS 2 01 = 1 N1 P N i=1 (Y i (1)Y i (0) 1 N P N j=1 (Y j (1)Y j (0))) 2 (Abadie et al., 2020). It is well known that neither unbiased nor consistent variance estimation is possible in this setting (Splawa-Neyman et al., 1990; Imbens and Rubin, 2015), because the joint distribution of the potential outcomes can never be fully recovered from empirical data, and N = N 0 +N 1 is nite. The most common estimator is the conservative Neyman estimator with nite population correction, (n 1 1) ^ S 2 1 n 2 1 + (n 0 1) ^ S 2 0 n 2 0 , but it is neither unbiased nor consistent in nite population settings. Researchers have been working on improving this variance estimator.Abadie et al. (2020) exploit the presence of xed attributes to improve the eciency of average treatment eect estimator, while Aronow et al. (2014) propose an interval estimator, which is consistent for sharp bounds, based on the Fr echet-Hoeding inequality. This interval estimator is the smallest interval that contains all values of the variance that can be inferred from empirical data. The upper bound of this interval is never larger than the conventional estimator, and this sharp bound is the source of the eciency improvement. With this upper bound, the narrowest Wald-type intervals can be constructed asymptotically. This paper extends Aronow et al. (2014) upper bound estimator to settings of cluster RCT with varying cluster sizes. In cluster RCT, the treatment assignment is randomized on cluster level, and the entire cluster will receive the same type of treatment. Considering 69 the cluster averages, either Y c (1) or Y c (0) can be observed in data, but not both of them. The joint distribution of Y c (1) or Y c (0) can not be recovered, and the average treatment eect within each cluster can not be estimated. We rst calculate the exact variance of average treatment eect estimator as well as the normalized limit of Liang-Zeger variance estimator under RCT settings, and show the dierence between them. This dierence is related with the joint distribution of Y c (1) or Y c (0). By exploiting the sharp bounds on the joint distribution, we can obtain an interval estimator of this dierence and increase eciency over the commonly used clustering variance estimator Liang-Zeger estimator. On the other hand, we can also apply the sharp bounds to the exact variance directly to derive another estimator. We show how our new estimators improve over Liang-Zeger variance estimator with both simulations and an application. In the next section, we illustrate the set up and assumptions on sampling and treatment assignment without replacement under potential outcomes framework in nite population. Section 3 calculates the exact variance of the average treatment eect estimator and proposes two estimators based on upper bounds. Section 4 reports the comparison between our estimators and LZ estimator. Section 5 applies our estimator to an empirical work of a education program in Morocco. 4.2 Set up We adopt the basic settings in Aronow et al. (2014). There are a sequence of nite popula- tions, and thep-th population hasN p units, withN p strictly increasing inp. The population is partitioned intoC p clusters, withC p weakly increasing inp. We focus on one specic pop- ulation and omit the subscriptp. LetN denote the number of total units in this population, and C denote the number of clusters. N and C are both nite. Let C i 2f1;:::;Cg denote the cluster that unit i belongs to. C ic = 1fC i = cg, N c = P N i=1 C ic , and N = P C c=1 N c . 70 he sizes of clusters can be dierent. C i 2f1; 2;:::;Cg indicates which cluster unit i is belong to. Neyman's potential outcome framework is adopted, under which potential out- comes and group indicator of unit i (Y i (0);Y i (0);C i ) are xed. Stochasticity only comes from sampling and treatment assignment. The parameter of interest is the population aver- age eect of treatment = 1 N P N i=1 (Y i (1)Y i (0)). The cluster average treatment eect is c = 1 Nc P N i=1 C ic (Y i (1)Y i (0)). The sampling process consists of two stages. In the rst stage, S = CP C clusters are randomly sampled. Then in the second stage, within each sampled cluster c, N c P U units are sampled. A binary variable R i 2f0; 1g indicates whether one unit is observed. Denote the sample size as n, and we have n! p NP C P U . The treatment assignment probability q c for cluster c can vary across clusters. Suppose q c is drawn from a distribution (; 2 ). If 2 = 0, then there is no clustering in the assignment process and the experiment is completely randomized. If 2 = (1), then q c 2f0; 1g, and the experiment is cluster RCT, in which the whole cluster is assigned to the same treatment group. In this paper, we do not assume the true distribution ofq c , instead we can observeq c of sampled clusters, and we will rely on this empirical distribution ofq c and dene 1 C P C c=1 q c = q. Within each sampled cluster c, N c P U q c units are assigned to the treatment group, and N c P U (1q c ) units are assigned to the control group. We use W i 2f0; 1g to denote whether one unit receives treatment. Our paper is under the setting of conducting both sampling and treatment assignment with- out replacement, and this is slightly dierent from Abadie et al. (2017), which implemented sampling and treatment assignment with replacement. The dierences can be shown with the following possibilities: Assumption 4.1. The vector of assignments (W ) is independent of the vector of sampling indicators (R). 71 The sampling probability is as follows Pr(R i = 1) =P C P U Pr(R i = 1;R j = 1jC i 6=C j ) = S(S 1) C(C 1) P 2 U = CP C 1 C 1 P C P 2 U The probability of cluster c being sampled is P C = S C Pr(R i = 1;R j = 1jC i =C j =c) = S C P U N c P U 1 N c 1 =P C P U N c P U 1 N c 1 And the probability of an individualj being in a specic clusterc is Nc N . Therefore, we have Pr(R i = 1;R j = 1jC i =C j ) = Pr(R i = 1;R j = 1jC i =C j =c) Pr(C j =c) = C X c=1 N c N (P C P U N c P U 1 N c 1 ) In Abadie et al. (2017), Pr(R i = 1;R j = 1jC i 6=C j ) =P 2 C P 2 U and Pr(R i = 1;R j = 1jC i = C j ) =P C P 2 U . The probability of being assigned to treatment is Pr(W i = 1jC i =c) =q c Pr(W i = 1) = C X c=1 N c N q c =! p q Pr(W i = 1;W j = 1jC i =C j ) = C X c=1 N c N q c N c P U q c 1 N c P U 1 Pr(W i = 1;W j = 1jC i 6=C j ) = ( C X c=1 N c N q c ) 2 72 In Abadie et al. (2017), q c is drawn randomly from a distribution Pr(W i = 1) = = 1 2 , where half of the sampled units are expected to be assigned to the treatment. Pr(W i = 1;W j = 1jC i 6=C j ) = 2 and Pr(W i = 1;W j = 1jC i =C j ) = 2 + 2 . If the true distribution of q c ( and 2 ) is not known, and only realized q c for cluster c is observed. We would like to estimate the linear regression: Y i = +W i +" i , where " i is assumed to have expectation as 0. Y (0) = and Y (1) = +. And we are interested in the average treatment eect, which can be dened in the linear-in-mean form: = Y (1) Y (0) = 1 N N X i=1 Y i (1) 1 N N X i=1 Y i (0) The estimator of is ^ = ^ Y (1) ^ Y (0), where ^ Y (1) = 1 N 1 P N i=1 R i W i Y i (1),N 1 = P N i=1 R i W i ; ^ Y (0) = 1 N 0 P N i=1 R i (1W i )Y i (0), N 0 = P N i=1 R i (1W i ). Residuals are dened as: " i (w) =Y i (w) Y (w) =Y i (w) 1 N P N i=1 Y i (w);w = 0; 1. We also dene cluster parameters: c = Y c (1) Y c (0) = 1 Nc P N i=1 C ic (Y i (1)Y i (0)) as cluster treatment eect, and " c (w) = 1 Nc P N i=1 C ic " i (w) as cluster residuals, where C ic = 1 if C i =c. In this setting, only R i and W i are stochastic. We use a transformation T i = 2W i 1, and denote T = E(T i ) = 2 1. We derive the exact variance of ^ = ^ Y (1) ^ Y (0) based on the linear approximation: p n(^ ) =o p (1); = 2 P N i=1 R i " i (T i T ) p NP C P U (1 2 T ) (4.2.1) where T i = 2W i 1, and T = 2E(W i ) 1 = 2 1. The proof of derivation if this linear approximation is in the appendix. This linear approximation is the same as Abadie et al. (2017) if we set E(W i ) = 1 2 , where we can derive = 2 P N i=1 R i " i T i p NP C P U . 73 4.3 Exact variance and upper bound estimators 4.3.1 Exact variance Proposition 4.1. The exact variance of is V () = 4 n [(1 2 T )] 2 N X i=1 N X j=1 E[R i R j " i " j (T i T )(T j T )] (4.3.1) and the limit of normalized Liang-Zeger estimator is n ^ V LZ (^ ) = 4 n [(1 2 T )] 2 C X c=1 X i:C i =c X j:C j =c E(R i R j " i " j (T i T )(T j T )) Therefore, the dierence between the exact variance and the limit of normalized Liang-Zeger estimator is V ()n ^ V LZ (^ ) = 4 n [(1 2 T )] 2 X X C j 6=C i E[R i R j " i " j (T i T )(T j T )] = P 2 C P 2 U n C X c=1 N 2 c ( " c (1) " c (0)) 2 " c (1) " c (0) = Y c (1) Y (1) Y c (0) Y (0) = c . This is consistent with Abadie et al. (2017). And they proposed an estimator ^ V CA (^ ) = ^ V LZ (^ ) 1 n 2 P C c=1 n 2 c (^ c ^ ) 2 when there is variation in treatment within each cluster and P C = 1. V CA (^ ) ^ V (^ ) = N 2 P 2 C P 2 U n 2 P C c=1 N 2 c N 2 ( " c (1) " c (0)) 2 ! p P C c=1 N 2 c N 2 ( c ) 2 We use information of the sample to estimate V CA (^ ). ^ V CA (^ ) = P C c=1 n 2 c n 2 (^ c ^ ) 2 . 74 Considering the only stochasticity comes from R i and W i here, we can further derive the exact variance as V () = 1 N(1 2 T ) 2 N X i=1 ( [(1K 1 )( T + 2) + 4K 1 (K 3 K 2 )] ~ " i (1) 2 + [(1K 1 )( T + 2) + 4K 1 (K 3 K 2 )] ~ " i (0) 2 + 8K 1 (K 3 K 2 )~ " i (0)~ " i (1) ) + 1 N(1 2 T ) 2 C X c=1 ( ( T + 2)(K 1 K 4 ) + 4K 1 (K 2 K 3 ) +K 4 (1 2 T ) N 2 c ~ " c (1) 2 + (2 T )(K 1 K 4 ) + 4K 1 (K 2 K 3 ) +K 4 (1 2 T ) N 2 c ~ " c (0) 2 +2 4K 1 (K 2 K 3 ) +K 4 (1 2 T ) N 2 c ~ " c (1) ~ " c (0) ) where ~ " i (1) = (1 T )" i (1), ~ " i (0) = (1 + T )" i (0), = P C c=1 Nc N q c . K 1 ;K 2 ;K 3 ;K 4 are all constants. K 1 = P C c=1 Nc(NcP U 1) N(Nc1) , K 2 = P C c=1 Nc(Ncqc1) N(Nc1) q c , K 3 = P C c=1 Nc N q c , K 4 = S1 C1 P U = CP C 1 C1 P U . 4.3.2 Cluster RCT In applications, it is very common to randomize the treatment assignment on cluster level, instead of unit level, which means the cluster is assigned to the treatment group or the control group as a whole. The treatment fraction within a cluster c, q c , is either 0 or 1. E(W i ) = ! p E(q c ) = q. Denote S 0 as number of control clusters, and S 1 as treated clusters. q is treatment probability, which is the fraction of sampled clusters being assigned to treatment. We have S 1 S = q. In this case, cluster treatment eect can not be estimated, because only one of Y c (1) and Y c (0) can be observed. Abadie et al. (2017) claimed that there was no improvement over LZ estimator in cluster RCT since ^ c was not estimated. However, although we don't observe the joint distribution of " c (1) and " c (0) or the joint distribution 75 of Y c (1) and Y c (0), we can nd an upper bound and a lower bound of the joint distribution from marginal distributions. 4.3.2.1 Upper bound estimators In this part, we rst would like to derive the sharp bounds on cov(y c (1);y c (0)). Lemma 4.1. Following Aronow et al. (2014) Lemma 1, we deneG C (y) = 1 C P C c=1 1(y c (1) y) andF C (y) = 1 C P C c=1 1(y c (0)y) as marginal distributions ofy c (1) andy c (0). Dene the left-continuous inverses as G 1 C (u) = inffy :G C (y)ug and F 1 C (u) = inffy :F C (y)ug. Dene y c (1) = 1 C P C c=1 y c (1) and y c (0) = 1 C P C c=1 y c (0) C (y c (1);y c (0)) =cov(y c (1);y c (0)) H C (y c (1);y c (0)) = Z 1 0 G 1 C (u)F 1 C (u)du y c (1) y c (0) L C (y c (1);y c (0)) = Z 1 0 G 1 C (u)F 1 C (1u) y c (1) y c (0) Given G N and F N , and no other information on the joint distribution of (Y (1);Y (0)), the bound L C (y c (1);y c (0)) C (y c (1);y c (0)) H C (y c (1);y c (0)) is sharp. The upper bound is attained when y c (1) and y c (0) is co-monotonic, and the lower bound is attained when y c (1) and y c (0) are counter-monotonic. NeitherG C norF C is observed in practice. Instead, we can observe their estimates ^ G C (y) = 1 C 1 P C c=1 q c 1(y c (1)y) =y (dC 1 ue) (1) and ^ F C (y) = 1 C 0 P C c=1 (1q c )1(y c (0)y) =y (C 1 +dC 0 ue) (0), and left-continuous inverses ^ G 1 C (u) = inffy : ^ G C (y)ug and ^ F 1 C (u) = inffy : ^ F C (y)ug. 76 An interval estimator [ L C (y c (1);y c (0)); H C (y c (1);y c (0))] is ^ H C (y c (1);y c (0)) = Z 1 0 ^ G 1 C (u) ^ F 1 C (u)du ^ y c (1) ^ y c (0) = P X c=1 (p c p c1 )y [c] (1)y [c] (0) ^ y c (1) ^ y c (0) ^ L C (y c (1);y c (0)) = Z 1 0 ^ G 1 C (u) ^ F 1 C (1u)du ^ y c (1) ^ y c (0) = P X c=1 (p c p c1 )y [c] (1)y [P +1c] (0) ^ y c (1) ^ y c (0) where the [0; 1]-partitionP m;n =fp 0 ;p 1 ;:::;p P g be the ordered distinct elements off0; 1 C 1 ; 2 C 1 ;:::; 1g[ f0; 1 C 0 ; 2 C 0 ;:::; 1g. Let y [c] (1) =y (dC 1 pce) (1) and y [c] (0) =y (C 1 +dC 0 pce) (0). 4.3.2.2 From V () and P U = 1 Abadie et al. (2017) stated there is no general improvement over the LZ variance available if the treatment assignment is perfectly available, because cluster average " c (0) and " c (1) for one cluster can not be simultaneously estimated in cluster RCT. Here we apply Aronow et al. (2014) upper bound to cluster aggregates to obtain eciency improvement in variance estimators. We rst specify the exact variance of average treatment eect estimator in cluster RCT. De- note cluster residual aggregates " T c (1) = P N i=1 C ic " i (1) and " T c (0) = P N i=1 C ic " i (0). Because this upper bound estimator is based on the sum of residuals within a cluster, we need to observe all units within a sampled cluster, which means P U = 1. 77 V () = T N(1 2 T ) 2 (1 S 1 C 1 ) C X c=1 ((1 T ) 2 " T c (1) 2 (1 + T ) 2 " T c (0) 2 ) + 1 N(1 2 T ) 2 (1 S 1 C 1 ) C X c=1 (1 T )" T c (1) (1 + T )" T c (0) 2 + 1 N(1 2 T ) 2 (1 S 1 C 1 2 T ) C X c=1 (1 T )" T c (1) + (1 + T )" T c (0) 2 When considering estimators, we write V () =Z 1 C X c=1 " T c (1) 2 +Z 2 C X c=1 " T c (0) 2 +Z 3 C X c=1 " T c (1)" T c (0) (4.3.2) for convenience, where Z 1 = 1 N(1+ T ) 2 (1 S1 C1 )(1 + T ) + (1 S1 C1 2 T ) , Z 2 = 1 N(1 T ) 2 (1 S1 C1 )(1 T ) + (1 S1 C1 2 T ) , Z 3 = 2(S1) N(C1) , with n NP C ! p 1, we write Z 1 ;Z 2 ;Z 3 as Z 1 = P C n(1+ T ) 2 (1 S1 C1 )(1 + T ) + (1 S1 C1 2 T ) , Z 2 = P C n(1 T ) 2 (1 S1 C1 )(1 T ) + (1 S1 C1 2 T ) , Z 3 = 2P C (S1) n(C1) . We need to apply Aronow et al. (2014) upper bound to P C c=1 " T c (1)" T c (0). Also, to unbiasedly estimate Y (1) Y (0), we also need an estimator forcov(Y i (1);Y i (0)), for which the upper bound estimator can be obtained with the same logic. 78 ^ V H () = (C 1) ( Z 1 ^ var(Y T c (1)) + ^ Y (1) 2 n 0 n 1 n ^ var(Y i (1)) ^ var(M c ) +Z 2 ^ var(Y T c (0)) + ^ Y (0) 2 n 1 n 0 n ^ var(Y i (0)) ^ var(M c ) +Z 3 ^ cov H (Y T c (0);Y T c (1)) + ^ Y (1) ^ Y (0) 1 N ^ cov L (Y i (1);Y i (0)) ^ var(M c ) ) : Our proposed estimator is ^ V h (^ ) = 1 n ^ V H (). The information we need to know about the population is the number of total cluster C or the probability of one cluster being sampled P C . 4.3.2.3 An estimator derived from the dierence with ^ V LZ The dierence between the limit of Liang-Zeger variance estimator and the exact variance is V ()nV LZ (^ ) = P 2 C P 2 U n C X c=1 N 2 c ( " c (1) " c (0)) 2 Instead of applying bounds to the exact variance directly, we also propose an estimator to improve over LZ variance estimator in cluster RCT as ^ V H CRCT (^ ) = ^ V LZ (^ ) (C 1) [ diff L where [ diff L = ^ var(y c (1)) + ^ var(y c (0)) + ^ Y (1) 2 n 0 n 1 n ^ var(Y i (1)) + ^ Y (0) 2 n 1 n 0 n ^ var(Y i (0)) + ^ Y (1) ^ Y (0) 1 n ^ cov H (Y i (1);Y i (0)) ^ var( c ) + ^ cov L (y c (1);y c (0)); 79 where c = Nc N , and it can be estimated by nc n . c = 1 C . y c (t) = c Y c (t) and y c (t) = 1 C Y (t). This estimator is an extension of Abadie et al. (2017) estimator to cluster RCT. They assumes P C to be 1. With our estimator, we do not need P C = 1, but we need to know the value of either P C or C. Besides this, we do not need to know P U or any other information of the population. In the simulations, we assume P C = 1 for simplicity. 4.4 Simulation We evaluate the performance of linear approximation and then compare our two upper bound estimators to LZ variance estimators in cluster RCT. The data is generated byY i = c W i +" i , " i N (0; 1). The potential outcomes are Y i (0)N (0; 1) and Y i (1) = c +Y i (0) if C i =c. c varies across clusters. We draw c fromN (0; 1 2 ). To evaluate the linear approximation, we calculate the exact variance of , which linearly approximatesnV (^ ). For each DGP, we rst sample clusters and units and then randomly assign sampled clusters to the treatment group and the control group. To simulate the true variance of ^ , we draw samples for 1000 times and calculate the variance of ^ . Because cluster size varies, sample size n for each sample is dierent. The comparison for linear approximation is between the expectation of V () n and V (^ ). The second estimator is based on Liang-Zeger estimator ^ V LZ (^ ), which can be written as ^ V LZ (^ ) = 1 n (E[W 2 i ]E[W i ] 2 ) 2 E(W i ) 1 E 2 6 4 C X c=1 0 B @ P i:C i =c ^ " i P i:C i =c ^ " i W i 1 C A 0 B @ P i:C i =c ^ " i P i:C i =c ^ " i W i 1 C A | 3 7 5 0 B @ E(W i ) 1 1 C A 80 For each DGP, we run simulation for 1000 times, and we calculateE[ V () n ] as exact variance, E[ ^ V LZ (^ )] for LZ estimator,E[ ^ V h (^ )] for the rst estimator, andE[ ^ V H CRCT (^ )] for the second estimator. 4.4.1 First estimator The rst estimator is based on the exact variance of . We vary the number of clusters C, the range of cluster sizes [M c ;M c ], the probability of a cluster being treated q, and the probability of a cluster being sampled P C . V (^ ) is the variance of ^ calculated from 1000 replications. We also report the coverage rates for associated with the 95% condence intervals of the LZ estimator and the upper bound estimator. As can be seen from the table below, is a good linear approximation of p n(^ ), and the goodness increases with the sample size. The upper bound estimator is smaller than LZ estimator and more close to the true variance of ^ . Comparing the coverage rates, the upper bound estimator outperforms LZ estimator in most DGPs. 81 Table 4.1: Comparison of estimators, P U = 1 C M c q P C V (^ ) Exact variance LZ esti- mator Upper bound estima- tor LZ Cov- erage for Upper bound Cover- age for 30 [30,45] 0.4 0.75 0.019151 0.021090 0.025558 0.024310 96.0% 95.4% 1 0.011738 0.013073 0.019398 0.016331 98.0% 97.1% 0.5 0.75 0.029057 0.027979 0.039936 0.033116 96.4% 94.7% 1 0.015671 0.016646 0.030172 0.021388 98.7% 97.3% 0.6 0.75 0.019797 0.019946 0.028564 0.023657 97.1% 96.0% 1 0.012307 0.012337 0.021205 0.015608 98.4% 96.1% [300,450] 0.4 0.75 0.018547 0.019689 0.023946 0.020522 94.1% 92.8% 1 0.012339 0.011886 0.018215 0.012718 96.5% 93.3% 0.5 0.75 0.015952 0.014764 0.022727 0.016961 95.0% 92.4% 1 0.008749 0.008532 0.016930 0.010259 99.1% 95.8% 0.6 0.75 0.009222 0.008490 0.015678 0.010277 98.0% 94.6% 1 0.004750 0.004568 0.011662 0.006009 99.4% 96.3% 50 [50,75] 0.4 0.75 0.012784 0.012485 0.016432 0.014988 96.0% 95.4% 1 0.007526 0.007464 0.012637 0.010173 96.9% 95.7% 0.5 0.75 0.005284 0.005322 0.007999 0.006942 97.0% 96.1% 1 0.003089 0.003224 0.005862 0.004582 98.9% 97.9% 0.6 0.75 0.009473 0.009210 0.015190 0.011343 98.2% 96.4% 1 0.005617 0.005463 0.011071 0.007264 99.3% 96.9% [500,750] 0.4 0.75 0.010875 0.011406 0.014626 0.012010 96.5% 93.1% 1 0.007122 0.006915 0.011127 0.007539 96.8% 94.3% 0.5 0.75 0.008481 0.007893 0.012329 0.008999 96.6% 93.5% 1 0.004869 0.004655 0.009006 0.005289 98.6% 94.8% 0.6 0.75 0.008005 0.006623 0.011960 0.007452 97.2% 93.2% 1 0.003588 0.003731 0.008856 0.004356 99.6% 96.1% 100 [100,150] 0.4 0.75 0.006134 0.006058 0.008086 0.006795 96.5% 95.1% 1 0.003872 0.003749 0.006083 0.004442 97.7% 96.0% 0.5 0.75 0.004691 0.004614 0.007253 0.005545 98.3% 96.3% 1 0.002943 0.002767 0.005435 0.003631 99.0% 96.6% 0.6 0.75 0.003540 0.003256 0.005628 0.003845 98.1% 95.1% 1 0.001965 0.001921 0.004215 0.002515 99.1% 97.0% [1000,1500] 0.4 0.75 0.005775 0.005652 0.007527 0.005918 97.0% 95.1% 1 0.003482 0.003473 0.005734 0.003776 98.5% 95.4% 0.5 0.75 0.004076 0.003876 0.006119 0.004184 98.5% 94.9% 1 0.002240 0.002327 0.004601 0.002573 99.2% 97.1% 0.6 0.75 0.002489 0.002361 0.004383 0.002586 98.8% 95.0% 1 0.001337 0.001336 0.003302 0.001561 99.3% 96.3% 82 4.4.2 Second estimator For the second estimator, we set P C = 1, and vary the number of clusters C, cluster sizes [M c ;M c ], probability of units being sampled in sampled clusters P U , and the probability of each cluster being treated q. Table 4.2: Upper bound estimator on V (^ ) C M c q P U V (^ ) LZ estimator Upper bound LZ Coverage for Upper bound Coverage for 30 [30,45] 0.4 0.4 0.018063 0.021633 0.019825 95.3% 94.6% 0.6 0.014608 0.018702 0.016976 95.8% 95.3% 0.8 0.013027 0.017167 0.015461 96.4% 95.7% 1 0.012309 0.016184 0.014488 95.7% 94.7% 0.5 0.4 0.017450 0.023000 0.020910 96.7% 95.5% 0.6 0.012747 0.019381 0.017273 97.6% 96.7% 0.8 0.011774 0.017996 0.015819 97.8% 96.7% 1 0.011093 0.017407 0.015070 97.5% 96.6% 0.6 0.4 0.010598 0.014201 0.013584 97.4% 97.2% 0.6 0.007491 0.011451 0.010834 98.2% 98.0% 0.8 0.005275 0.009896 0.009255 99.2% 98.7% 1 0.005022 0.009155 0.008482 98.5% 98.1% [300,450] 0.4 0.4 0.011457 0.014708 0.011932 95.4% 93.1% 0.6 0.010020 0.014722 0.011757 96.4% 94.2% 0.8 0.010465 0.014869 0.011749 95.8% 93.8% 1 0.009716 0.014581 0.011439 96.2% 94.6% 0.5 0.4 0.012033 0.019764 0.015825 97.9% 96.4% 0.6 0.011186 0.019396 0.015309 97.8% 96.0% 0.8 0.010830 0.019316 0.015106 97.9% 96.5% 1 0.010838 0.019308 0.015008 98.6% 96.8% 0.6 0.4 0.003058 0.005747 0.005088 98.8% 98.1% 0.6 0.002594 0.005400 0.004700 98.7% 98.2% 0.8 0.002498 0.005245 0.004508 99.2% 98.6% 1 0.002419 0.005217 0.004448 98.8% 98.4% 50 [50,75] 0.4 0.4 0.011619 0.016230 0.014107 97.1% 95.8% 0.6 0.010263 0.015577 0.013194 98.3% 96.7% 0.8 0.009516 0.014934 0.012438 97.9% 96.5% 1 0.009806 0.014545 0.011952 97.4% 95.1% 0.5 0.4 0.008114 0.012965 0.011578 98.3% 97.8% 0.6 0.007224 0.011939 0.010420 98.2% 97.7% 83 0.8 0.006373 0.011392 0.009776 98.9% 97.8% 1 0.005839 0.011067 0.009382 99.0% 97.7% 0.6 0.4 0.008668 0.013222 0.012187 98.3% 97.9% 0.6 0.008359 0.012082 0.010899 98.0% 97.4% 0.8 0.006868 0.011436 0.010178 98.4% 97.8% 1 0.006368 0.011165 0.009830 98.6% 98.3% [500,750] 0.4 0.4 0.008292 0.013796 0.010942 97.9% 95.9% 0.6 0.008164 0.013634 0.010705 97.4% 95.0% 0.8 0.009200 0.013693 0.010689 96.5% 95.1% 1 0.008722 0.013750 0.010690 97.4% 95.4% 0.5 0.4 0.004910 0.009430 0.007571 98.4% 97.3% 0.6 0.005173 0.009298 0.007366 99.0% 97.0% 0.8 0.004887 0.009270 0.007285 98.8% 97.4% 1 0.004649 0.009289 0.007254 98.8% 97.6% 0.6 0.4 0.002961 0.006477 0.005368 99.5% 98.9% 0.6 0.002770 0.006383 0.005205 99.5% 99.2% 0.8 0.002912 0.006360 0.005129 99.4% 98.6% 1 0.002816 0.006304 0.005053 99.5% 98.8% 100 [100,150] 0.4 0.4 0.004583 0.006695 0.005756 97.6% 96.4% 0.6 0.003962 0.006451 0.005440 98.1% 97.0% 0.8 0.004086 0.006207 0.005175 97.9% 95.8% 1 0.003727 0.006231 0.005150 98.6% 97.5% 0.5 0.4 0.003740 0.006546 0.005643 98.8% 98.2% 0.6 0.003736 0.006242 0.005269 98.2% 97.2% 0.8 0.003240 0.006215 0.005176 98.8% 98.2% 1 0.003240 0.006162 0.005083 98.5% 97.8% 0.6 0.4 0.002504 0.005164 0.004630 99.6% 99.2% 0.6 0.002040 0.004856 0.004255 99.8% 99.6% 0.8 0.002197 0.004732 0.004087 99.9% 99.4% 1 0.002074 0.004618 0.003947 99.4% 99.0% [1000,1500] 0.4 0.4 0.003420 0.005397 0.004277 98.2% 96.6% 0.6 0.003177 0.005296 0.004164 97.9% 96.4% 0.8 0.003288 0.005302 0.004146 98.1% 96.6% 1 0.003534 0.005264 0.004107 98.3% 95.6% 0.5 0.4 0.002446 0.004465 0.003531 99.0% 98.0% 0.6 0.002195 0.004405 0.003446 99.5% 98.8% 0.8 0.002240 0.004417 0.003433 99.0% 98.3% 1 0.002261 0.004387 0.003393 98.7% 97.5% 0.6 0.4 0.001748 0.004338 0.003508 99.8% 99.6% 0.6 0.001669 0.004315 0.003452 99.8% 99.4% 0.8 0.001844 0.004251 0.003379 99.4% 98.9% 1 0.001612 0.004248 0.003360 99.8% 99.4% 84 4.5 Application For application, we apply our estimator to Benhassine et al. (2015). They used a large- randomized experiment in Morocco to evaluate \labeled cash transfer", compared its eect with that of conditional cash transfers, and checked whether targeting at mothers would make a dierence from targeting at fathers. Labeled cash transfer is a small cash transfer made to parents of school-aged children in poor rural communities, not conditional on school attendance but explicitly labeled as an education support program. There were four inter- vention arms: labeled cash transfers to fathers, labeled cash transfers to mothers, conditional cash transfers to fathers, and conditional cash transfers to mothers. In this large experiment, a total of 320 rural school sectors (65% of all school sectors in the selected regions) were randomly assigned to either one of four treatment groups or a control group. Objective measures of school participation for over 44,000 children and detailed survey data for over 4,000 households were collected. Each school sector includes one \main" primary school and four satellite school units on average. Out of the 320 school sectors (318 sectors eectively attended 1 ), 60 (59 eective) sectors were assigned to the control group, 40 to LCT to fathers, 40 to LCT to mothers, 90 to CCT to fathers, and 90 (89 eective) to CCT to mothers. The eect of being assigned to each of the treatment groups was estimated by the following specication in Benhassine et al. (2015): Y ij = + 1 TAYSSIR j + 2 LCT m j + 3 CCT f j + 4 CCT m j + X 0 ij +" ij (4.5.1) where: 1 2 sectors were unaccessible because of ood. 85 Y ij is the outcome of student i in school sector j; TAYSSIR = 1 if school j is assigned to any treatment group; LCT m j = 1 if school j is in the LCT to mothers group; CCT f j = 1 if school j is in the CCT to fathers group; CCT m j = 1 if school j is in the CCT to mothers group; X ij includes strata dummies, school-level controls, and child-level controls. With this specication, ^ 1 measures the eect of labeled cash transfer paid to fathers. ^ 1 + ^ 2 captures the eect of making labeled cash transfer to mothers, ^ 1 + ^ 3 estimates the eect of conditional cash transfer to fathers, and ^ 1 + ^ 4 is the estimate of the impact of conditional cash transfer to mothers. With the OLS regressions, standard errors of estimators were clustered at the school-sector level with Liang-Zeger estimators. In this paper, we try to calculate the standard errors with our upper bound estimators and with LZ estimator. We rst regress Y ij on X ij to take away the eect of control variables, and get the residuals " ij as our dependent variable in the following regressions. Y ij = + X 0 +v ij Instead of pooling all observations and estimating the eect with the specication 4:5:1, we try to estimate the eect of LCT to fathers, LCT to mothers, CCT to fathers, and CCT to mothers, by comparing one specic treatment group and the control group. We estimate the following specications ^ v ij = 1 + 1 LCT fj +u ij in treatment group of LCT to fathers and the control group. 86 ^ v ij = 2 + 2 LCT mj +u ij in treatment group of LCT to mothers and the control group. ^ v ij = 3 + 3 CCT fj +u ij in treatment group of CCT to fathers and the control group. ^ v ij = 4 + 4 CCT mj +u ij in treatment group of CCT to mothers and the control group. In the following table, we list ^ 1 ; ^ 2 ; ^ 3 ; ^ 4 and their variance estimators: ^ V H CRCT , ^ V h and ^ V LZ . Because of the construction of ^ V H CRCT , it is always smaller than ^ V LZ , while sometimes ^ V h is larger than ^ V LZ . Take the impact of being in the group of CCT to mothers on \Attending school by end of year 2 if had dropped out at any time before baseline" for an example. ^ = 0:0764, which means 7.64% increase in attendance rate in school. And an 95% condence interval for ^ with ^ V LZ is (0:0141; 0:1387), with ^ V H CRCT is (0:0317; 0:1211), and with ^ V h is (0:0174; 0:1354). 87 Table 4.3: Eect on school participation: Panel A. Household sample Dependent variable Impact of LCT to fathers Impact of LCT to mothers Impact of CCT to fathers Impact of CCT to mothers Attending school by end of year 2, among 6{15 at baseline 0.0650 0.0390 0.0519 0.0544 ^ V H CRCT 0.0119 0.0105 0.0113 0.0105 ^ V h 0.0119 0.0105 0.0113 0.0119 ^ V LZ 0.0120 0.0107 0.0115 0.0121 Dropped out by year 2, among enrolled in grades 1{4 at baseline -0.0637 -0.0478 -0.0507 -0.0598 ^ V H CRCT 0.0094 0.0076 0.0090 0.0077 ^ V h 0.0093 0.0078 0.0090 0.0078 ^ V LZ 0.0095 0.0078 0.0091 0.0080 Attending school by year 2 if dropped out before baseline 0.0917 0.0689 0.0614 0.0764 ^ V H CRCT 0.0275 0.0230 0.0209 0.0228 ^ V h 0.0261 0.0235 0.0209 0.0301 ^ V LZ 0.0277 0.0233 0.0210 0.0318 Never enrolled in school by end of year 2, among 6{15 in year 0 -0.0126 -0.0061 -0.0040 -0.0034 ^ V H CRCT 0.0062 0.0045 0.0057 0.0045 ^ V h 0.0062 0.0045 0.0057 0.0046 ^ V LZ 0.0063 0.0046 0.0058 0.0047 88 Table 4.4: Eect on school participation, Panel B School Sample Dependent variable Impact of LCT to fathers Impact of LCT to mothers Impact of CCT to fathers Impact of CCT to mothers Dropped out by year 2, among enrolled in grades 1{4 -0.0376 -0.0366 -0.0360 -0.0389 ^ V H CRCT 0.0076 0.0066 0.0073 0.0068 ^ V h 0.0070 0.0064 0.0073 0.0078 ^ V LZ 0.0079 0.0070 0.0077 0.0084 Dropped out in year 1, among enrolled in grades 1{4 -0.0146 -0.0136 -0.0121 -0.0146 ^ V H CRCT 0.0055 0.0048 0.0052 0.0050 ^ V h 0.0051 0.0045 0.0052 0.0053 ^ V LZ 0.0057 0.0051 0.0056 0.0059 Dropped out in year 2, among enrolled in grades 1{4 -0.0242 -0.0242 -0.0253 -0.0259 ^ V H CRCT 0.0040 0.0036 0.0042 0.0037 ^ V h 0.0039 0.0036 0.0043 0.0050 ^ V LZ 0.0041 0.0038 0.0044 0.0049 Attendance rate during surprise school visits, among enrolled 0.0050 0.0151 0.0135 0.0081 ^ V H CRCT 0.0061 0.0053 0.0050 0.0054 ^ V h 0.0062 0.0051 0.0049 0.0063 ^ V LZ 0.0062 0.0055 0.0051 0.0065 Completed primary school, among enrolled in grade 5 0.0271 0.0334 0.0399 0.0557 ^ V H CRCT 0.0209 0.0215 0.0309 0.0211 ^ V h 0.0214 0.0211 0.0299 0.0271 ^ V LZ 0.0210 0.0216 0.0314 0.0262 89 4.6 Conclusion We calculate the exact variance of average eect estimators under the setting of sampling and randomizing treatment without replacement in cluster RCT, and show the dierence between this exact variance and the normalized limit of Liang-Zeger variance estimator. Then we propose two variance estimators based on Aronow et al. (2014) upper bounds. The rst estimator is derived by applying the upper bound directly to the exact variance to get rid of the joint distribution estimation, but this estimator requires more information about the population: the number of total clusters in the population and the sampling probability within each cluster. The second estimator is based on the dierence between the limit of LZ variance estimator and the exact variance, and it is therefore always slightly smaller than the LZ variance estimator and needs less information about the population: the number of total clusters in the population or the probability of a cluster being sampled from the population. From the simulation results, we can see the coverage rate of the 95% condence interval of the rst estimator is more stable. We also demonstrate and compare the performances of three variance estimators in a labeled cash transfer program aimed at primary school children in Morocco. In the future, we would like to further extend the upper bound estimators to unconfoundedness settings in observational studies, where the treatment is on the cluster level. 90 References Abadie, A. et al. (2017). When should you adjust standard errors for clustering? Tech. rep. National Bureau of Economic Researc. { (2020). \Sampling-based versus design-based uncertainty in regression analysis". Econometrica 88.1, pp. 265{296. { (2023). \When should you adjust standard errors for clustering?" The Quarterly Journal of Economics 138.1, pp. 1{35. Aronow, P. M., D. P. Green, and D. K. K. Lee (2014). \Sharp bounds on the variance in randomized experiments". Annals of Statistics 42.3, pp. 850{871. Athey, S. and S. Wager (2021). \Policy learning with observational data". Econometrica 89.1, pp. 133{161. Benhassine, N. et al. (2015). \Turning a shove into a nudge? A" labeled cash transfer" for education". American Economic Journal: Economic Policy 7.3, pp. 86{125. Chakraborty, B. and E. Moodie (2013). \Statistical methods for dynamic treatment regimes". Springer-Verlag. doi 10, pp. 978{1. Chakraborty, B. and S. A. Murphy (2014). \Dynamic treatment regimes". Annual review of statistics and its application 1, p. 447. Cui, Y. and J.-S. Pang (2021). Modern nonconvex nondierentiable optimization. SIAM. Cui, Y., J. Liu, and J.-S. Pang (2022). \Nonconvex and Nonsmooth Approaches for Ane Chance-Constrained Stochastic Programs". Set-Valued and Variational Analysis, pp. 1{63. Han, S. (2019). \Optimal dynamic treatment regimes and partial welfare ordering". arXiv preprint arXiv:1912.10014. 91 Heckman, J. J. and S. Navarro (2007). \Dynamic discrete choice and dynamic treatment eects". Journal of Econometrics 136.2, pp. 341{396. Hirano, K. and J. R. Porter (2009). \Asymptotics for statistical treatment rules". Econometrica 77.5, pp. 1683{1701. Imbens, G. W. and D. B. Rubin (2015). Causal inference in statistics, social, and biomedical sciences. Cambridge University Press. Kemper, P. et al. (1981). \The supported work evaluation: Final benet-cost analysis". Kitagawa, T. and A. Tetenov (2018). \Who should be treated? empirical welfare maximization methods for treatment choice". Econometrica 86.2, pp. 591{616. { (2021). \Equality-minded treatment choice". Journal of Business & Economic Statistics 39.2, pp. 561{574. Kock, A. B., D. Preinerstorfer, and B. Veliyev (2021). \Functional sequential treatment allocation". Journal of the American Statistical Association, pp. 1{13. Laber, E. B. and Y.-Q. Zhao (2015). \Tree-based methods for individualized treatment regimes". Biometrika 102.3, pp. 501{514. LaLonde, R. J. (1986). \Evaluating the econometric evaluations of training programs with experimental data". The American economic review, pp. 604{620. Liang, K.-Y. and S. L. Zeger (1986). \Longitudinal data analysis using generalized linear models". Biometrka 73.1, pp. 13{22. Manski, C. F. (2004). \Statistical treatment rules for heterogeneous populations". Econometrica 72.4, pp. 1221{1246. { (2009). Identication for prediction and decision. Harvard University Press. Miguel, E. and M. Kremer (2004). \Worms: identifying impacts on education and health in the presence of treatment externalities". Econometrica 72.1, pp. 159{217. Murphy, S. A. (2003). \Optimal dynamic treatment regimes". Journal of the Royal Statistical Society: Series B (Statistical Methodology) 65.2, pp. 331{355. Neyman, J. (1923). \Sur les applications de la th eorie des probabilit es aux experiences agricoles: Essai des principes". Roczniki Nauk Rolniczych 10, pp. 1{51. Nie, X., E. Brunskill, and S. Wager (2021). \Learning when-to-treat policies". Journal of the American Statistical Association 116.533, pp. 392{409. 92 Pan, Y. and Y.-Q. Zhao (2021). \Improved doubly robust estimation in learning optimal individualized treatment rules". Journal of the American Statistical Association 116.533, pp. 283{294. Qi, Z., J.-S. Pang, and Y. Liu (2019). \Estimating individualized decision rules with tail controls". arXiv preprint arXiv:1903.04367. Qian, M. and S. A. Murphy (2011). \Performance guarantees for individualized treatment rules". Annals of statistics 39.2, p. 1180. Robins, J. M. (2004). \Optimal structural nested models for optimal sequential decisions". In: Proceedings of the second seattle Symposium in Biostatistics. Springer, pp. 189{326. Sakaguchi, S. (2021). \Estimation of Optimal Dynamic Treatment Assignment Rules under Policy Constraints". arXiv preprint arXiv:2106.05031. Splawa-Neyman, J., D. M. Dabrowska, and T. Speed (1990). \On the application of probability theory to agricultural experiments. Essay on principles. Section 9." Statistical Science, pp. 465{472. Stoye, J. (2009). \Minimax regret treatment choice with nite samples". Journal of Econometrics 151.1, pp. 70{81. Sun, L. (2021). \Empirical welfare maximization with constraints". arXiv preprint arXiv:2103.15298. Tan, Z. (2020). \Regularized calibrated estimation of propensity scores with model misspecication and high-dimensional data". Biometrika 107.1, pp. 137{158. Tao, Y., L. Wang, and D. Almirall (2018). \Tree-based reinforcement learning for estimating optimal dynamic treatment regimes". The annals of applied statistics 12.3, p. 1914. Van Der Vaart, A. and J. A. Wellner (2009). \A note on bounds for VC dimensions". Institute of Mathematical Statistics collections 5, p. 103. Viviano, D. and J. Bradic (2020). \Fair policy targeting". arXiv preprint arXiv:2005.12395. Zhang, B. et al. (2012). \A robust method for estimating optimal treatment regimes". Biometrics 68.4, pp. 1010{1018. Zhang, Y. et al. (2018). \Interpretable dynamic treatment regimes". Journal of the American Statistical Association 113.524, pp. 1541{1549. 93 Zhao, Y.-Q. et al. (2015). \New statistical learning methods for estimating optimal dynamic treatment regimes". Journal of the American Statistical Association 110.510, pp. 583{598. Zhao, Y. et al. (2012). \Estimating individualized treatment rules using outcome weighted learning". Journal of the American Statistical Association 107.499, pp. 1106{1118. 94 Appendices A Appendix to Chapter 1 Our problem is formulated as max g2G W (g) s.t. I(g) We consider mean outcome function as our objective function, and use Gini coecient to measure the inequality. W (g) = Z 1 0 (1F g (y))dy I(g) = 1 R 1 0 (1F g (y)) 2 dy R 1 0 (1F g (y))dy The empirical counter part of the population problem is max g2G ^ W (g) s.t. ^ I(g) where ^ W (g) = Z 1 0 (1 ^ F g (y))dy; ^ I(g) = 1 R 1 0 (1 ^ F g (y)) 2 dy R 1 0 (1 ^ F g (y))dy 95 Instead of solving the above empirical problem, we relax the constraint problem in order to guarantee the optimality of the objective function. max g2G ^ W (g) s.t. ^ I(g) +" n A.1 Proof of Theorem 2.1 Proof. Letg 2 arg max g2G() W (g), ^ g 2 arg max g2G() ^ W (g), and ^ g 2 arg max g2 ^ G(") ^ W (g). Following Kitagawa and Tetenov (2021), denote 2 (F g (y)) = (1F g (y)) 2 and 1 (F g (y)) = 1F g (y). I Gini (g) = 1 R 1 0 2 (F g (y))dy R 1 0 1 (F g (y))dy ^ I Gini (g) = 1 R 1 0 2 ( ^ F g (y)^ 0)dy R 1 0 1 ( ^ F g (y)^ 0)dy sup g2G ^ I Gini (g)I Gini (g) = sup g2G R 1 0 2 ( ^ F g (y)^ 0)dy R 1 0 1 ( ^ F g (y)^ 0)dy R 1 0 2 (F g (y))dy R 1 0 1 (F g (y))dy = sup g2G R 1 0 2 (F g (y))dy R 1 0 1 (F g (y))dy R 1 0 2 (F g (y))dy R 1 0 1 ( ^ F g (y)^ 0)dy + R 1 0 2 (F g (y))dy R 1 0 1 ( ^ F g (y)^ 0)dy R 1 0 2 ( ^ F g (y)^ 0)dy R 1 0 1 ( ^ F g (y)^ 0)dy = sup g2G R 1 0 2 (F g (y))dy R 1 0 1 (F g (y))dy R 1 0 1 ( ^ F g (y)^ 0)dy Z 1 0 1 ( ^ F g (y)^ 0)dy Z 1 0 1 (F g (y))dy + 1 R 1 0 1 ( ^ F g (y)^ 0)dy Z 1 0 2 (F g (y))dy Z 1 0 2 ( ^ F g (y)^ 0)dy sup g2G 1 R 1 0 1 ( ^ F g (y)^ 0)dy Z 1 0 h 1 ( ^ F g (y)^ 0) 1 (F g (y)) i dy + Z 1 0 h 2 (F g (y)) 2 ( ^ F g (y)^ 0) i dy : 96 In Chapter 2, we have assumed that ^ E g [Y ] = R 1 0 1 ( ^ F g (y)^ 0)dyM > 0. Therefore, we have sup g2G ^ I Gini (g)I Gini (g) M 1 sup g2G Z 1 0 h 1 ( ^ F g (y)^ 0) 1 (F g (y)) i dy + sup g2G Z 1 0 h 2 (F g (y)) 2 ( ^ F g (y)^ 0) i dy : Denote w(Z i ;g) = D i g(X i ) e(X i ) + (1D i )(1g(X i )) 1e(X i ) . E P n sup g2G Z 1 0 h 1 ( ^ F g (y)^ 0) 1 (F g (y)) i dy =E P n sup w2W jE n [w]E P [w]j C 1 M r v n : sup g2G Z 1 0 h 2 (F g (y)) 2 ( ^ F g (y)^ 0) i dy sup g2G Z 1 0 2 (F g (y)) 2 ( ^ F g (y)^ 0) dy Z 1 0 sup g2G 2 (F g (y)) 2 ( ^ F g (y)^ 0) dy j 0 2 (0)j Z 1 0 sup g2G ^ F g (y)F g (y) dy: ^ F g (y)F g (y) = 1 n n X i=1 w(Z i ;g)1fY i >ygE[w(Z;g)1fY >yg] : E P n Z 1 0 sup g2G ^ F g (y)F g (y) dy = Z 1 0 E P n sup g2G ^ F g (y)F g (y) dy Z 1 0 2(1 +C 1 ) r v n p P(Y >y)dy = 2(1 +C 1 ) r v n Z 1 0 p P(Y 1 >y;D = 1) +P(Y 0 >y;D = 0)dy 2(1 +C 1 ) r v n Z 1 0 p P(Y 1 >y;D = 1) + p P(Y 0 >y;D = 0)dy 4M (1 +C 1 ) r v n =C r v n : 97 E P n sup g2G ^ I Gini (g)I Gini (g) (9C 1 + 8) M M r v n : P(sup g2G j ^ I Gini (g)I Gini (g)j" n ) E[sup g2G j ^ I Gini (g)I Gini (g)j] " n C " n r v n Let = C "n p v n ! 0;( ) = C p v n ! 0 P(sup g2G j ^ I Gini (g)I Gini (g)j( )) 1 =)P( ^ I Gini (g)I Gini (g)( )) 1 =)G() ^ G( e ) with probability 1 : If we set " n =n , then we need n 1 2 ! 0, which requires < 1 2 . With probability 1 , jW (g )W (^ g )j =jW (g )W (^ g ) +W (^ g )W (^ g )j =jW (g ) ^ W (g ) + ^ W (g ) ^ W (^ g ) + ^ W (^ g )W (^ g ) +W (^ g ) ^ W (^ g ) + ^ W (^ g ) ^ W (^ g ) + ^ W (^ g )W (^ g )j W (g ) ^ W (g ) + ^ W (^ g )W (^ g ) sup g2G() W (g) ^ W (g) + sup g2 ^ G(") W (g) ^ W (g) 2 sup g2G W (g) ^ W (g) : The rst inequality is based on triangle inequality and the fact that ^ W (g ) ^ W (^ g ) 0 andG() ^ G( " ). The last inequality is based onG()G and ^ G( " )G. jI(^ g )j =jI(^ g ) " +" n j I(^ g ) ^ I(^ g ) +" n : 98 Therefore, we have E P n sup g2G jW (g )W (^ g )j 2C 1 M r v n with probability 1 ; E P n [I(^ g )]E P n sup g2G I(g) ^ I(g) +" n C r v n +n : A.2 Proof of Theorem 2.2 Proof. DenoteV (g) =W (g) maxfC(g); 0g, whereC(g) = (1)W (g) R 1 0 2 (F g (y))dy. Denote g m = arg max g2G V (g) and ^ g m = arg max g2G V n (g) V (g m )V (^ g m ) = (W (g ) maxfC(g ); 0g) (W (^ g ) maxfC(^ g ); 0g) = (W (g m )W (^ g m )) (maxfC(g m ); 0g maxfC(^ g m ); 0g) (W (g m )W (^ g m )) +jC(g m )C(^ g m )j Then we look at (W (g m )W (^ g m )) and jC(g m )C(^ g m )j separately: W (g m )W (^ g m ) =W (g m ) ^ W (g m ) + ^ W (g m ) ^ W (^ g m ) + ^ W (^ g m )W (^ g m ) 2 sup g2G W (g) ^ W (g) C(g m )C(^ g m ) =C(g m ) ^ C(g m ) + ^ C(g m ) ^ C(^ g m ) + ^ C(^ g m )C(^ g m ) 2 sup g2G C(g) ^ C(g) 99 Therefore we could obtain sup g2G V (g)V (^ g ) =V (g m )V (^ g m ) 2 sup g2G W (g) ^ W (g) + sup g2G C(g) ^ C(g) : Let (X) = D e(X) + 1D 1e(X) . ^ W (g)W (g) =E n [(X)Yg(X)]E P [(X)Yg(X)] Let f W = (X)Yg(X), we have F W = M , Similarly, letf H 1 = (X)( ~ X) min(Y; ~ Y )g(X)g( ~ X), we have F H 1 = M 2 . Letf H 2 = (X)Yg(X), we have F H 2 = M 2 . Let f H 3 = (X)Yg(X), we have F H 3 = M 2 . sup P2P E P n sup g2G V (g)V (^ g ) 2C 1 F W +(1)( F W + F H 1 + F H 2 + F H 3 ) r v n = 2C 1 1 +(1) 1 + 3 M r v n : V (g m )V (^ g m ) =W (g m ) max(C(g m ); 0)W (^ g m ) + max(C(^ g m ); 0) =W (g m )W (^ g m ) + max(C(^ g m ); 0) W (g m )W (^ g m ) The rst equality holds because g m satises the population consraint C(g m ) 0. So max(C(g m ); 0) = 0 The second inequality holds because max(C(^ g m ); 0) 0. Therefore, sup P2P E P n [W (g m )W (^ g m )] sup P2P E P n [V (g m )V (^ g )] 2C 1 1 +(1) 1 + 3 M r v n : 100 I(^ g m ) = C(^ g m ) W (^ g m ) 1 M C(^ g m ) 1 M (C(^ g m )C(g m )) 2 sup g2G C(g) ^ C(g) 2C 1 (1) 1 + 3 M r v n A.3 Mean-log deviation It should be noted that the framework in this paper is not only limited to Gini Coecient. The algorithm itself can also be applied to cases using other inequality measures. But the proof of the theoretical results need to be modied. For example, another popular inequality measure in development economics is mean-log deviation (MLD), which is dened as I MLD = lnE[Y ]E[ln(Y )]. Under policy g, the MLD is I MLD (g) = lnE[Y (g)]E[ln(Y (g))]: sup g2G ^ I MLD (g)I MLD (g) = (lnE n [Y (g)]E P [ln(Y (g))]) (lnE P [ ^ Y (g)]E P [ln( ^ Y (g))]) sup g2G ln E n [Y (g)] E P [Y (g)] + sup g2G jE n [ln(Y (g))]E P [ln(Y (g))]j sup g2G E n [Y (g)] E P [Y (g)] 1 +C 1 ln M r v n 1 M sup g2G jE n [Y (g)]E P [Y (g)]j +C 1 ln M r v n C 1 M M + ln M r v n : Other analysis should be the same as that using Gini coecient. 101 A.4 Transform of constraint function The empirical constraint with ^ I Gini (g) can be expressed as Z M 0 " 1 n n X i=1 D i g(X i ) e(X i ) + (1D i )(1g(X i )) 1e(X i ) 1(Y i >y) # 2 dy (1 " ) 1 n n X i=1 D i g(X i ) e(X i ) + (1D i )(1g(X i )) 1e(X i ) Y i 0 () Z M 0 " 1 n 2 n X i=1 n X j=1 D i g(X i ) e(X i ) + (1D i )(1g(X i )) 1e(X i ) D j g(X j ) e(X j ) + (1D j )(1g(X j )) 1e(X j ) 1(Y i >y)1(Y j >y) # dy (1 " ) 1 n n X i=1 D i g(X i ) e(X i ) + (1D i )(1g(X i )) 1e(X i ) Y i 0 () 1 n 2 n X i=1 n X j6=i D i g(X i ) e(X i ) + (1D i )(1g(X i )) 1e(X i ) D j g(X j ) e(X j ) + (1D j )(1g(X j )) 1e(X j ) Z M 0 1(Y i >y)1(Y j >y)dy + 1 n 2 n X i=1 D i g(X i ) e(X i ) + (1D i )(1g(X i )) 1e(X i ) Z M 0 1(Y i >y)dy (1 " ) 1 n n X i=1 D i g(X i ) e(X i ) + (1D i )(1g(X i )) 1e(X i ) Y i 0 Because R M 0 1(Y i > y)dy = Y i , and R M 0 1(Y i > y)1(Y j > y)dy = minfY i ;Y j g, we have the constraint further expressed as 1 n 2 ( n X i=1 D i e(X i ) g(X i ) + 1D i 1e(X i ) (1g(X i )) 2 Y i + 2 X i X j > < > > : 1 2 with prob. 1 2 + 1 ; 1 2 with prob. 1 2 1 ; and, if b 1j = 0, Y 1 (1) = 8 > > < > > : 1 2 with prob. 1 2 1 ; 1 2 with prob. 1 2 + 1 ; if b 2j = 1, Y 2 (D 1 ; 1) = 8 > > < > > : 1 2 with prob. 1 2 + 2 ; 1 2 with prob. 1 2 2 ; and, if b 2j = 0, Y 2 (D 1 ; 1) = 8 > > < > > : 1 2 with prob. 1 2 2 ; 1 2 with prob. 1 2 + 2 ; where 1 ; 2 2 [0; 1 1 ]. 111 As forY 1 (0) andY 2 (D 1 ; 0)'s conditional distribution, we consider the degenerate distribution at Y 1 (0) = 0 at every H 1 = H 1j and Y 2 (D 1 ; 0) = 0 at every H 2 = H 2j ;j = 1;:::;v 1 +v 2 . When b 1j = 1; 1 (H 1j ) = 1 , and when b 1j = 0; 1 (H 1j ) = 1 . When b 2j = 1; 2 (H 2j ) = 2 , and when b 2j = 0; 2 (H 2j ) = . For each b2f(0; 0); (0; 1); (1; 0); (1; 1)g v 1 +v 2 , P b 2P(1;) holds. We dene a subclass ofP(1;) byP =fP b : b2f(0; 0); (0; 1); (1; 0); (1; 1)g v 1 +v 2 g. With P b 2P , the optimal treatment assignment rule is G 1b =fH 1j :b 1j = 1;jv 1 +v 2 g;G 2b =fH 2j :b 2j = 1;jv 1 +v 2 g; which is feasible (G 1b ;G 2b )2G 1 G 2 by the construction of the support points of H 1 H 2 . The maximized social welfare is W (g 1b ;g 2b ) = (v 1 +v 2 ) 1 ( 1 v 1 +v 2 X j=1 b 1j + 2 v 1 +v 2 X j=1 b 2j ): Let ( ^ G 1 ; ^ G 2 ) be an arbitrary treatment rule depending on sample (Z 1 ;:::;Z n ), and ^ b2 f(0; 0); (0; 1); (1; 0); (1; 1)g v 1 +v 2 be a vector whose jth element is ^ b 1j = 1fH 1j 2 ^ G 1 g; ^ b 2j = 1fH 2j 2 ^ G 2 g. Consider(b) a prior distribution of b such thatb 11 ;:::;b 1(v 1 +v 2 ) ;b 21 ;:::;b 2(v 1 +v 2 ) are i.i.d. and b 11 ;b 21 Ber(1=2). The welfare loss satises the following inequalities: 112 sup P2P(1;) E P n[W (g 1 ;g 2 )W (^ g 1 ; ^ g 2 )] sup P b 2P E P n b [W (G 1b ;G 2b )W (^ g 1 ; ^ g 2 )] Z b E P n b [W (g 1b ;g 2b )W (^ g 1 ; ^ g 2 )]d(b) = Z b ( 1 E[P H 1 (G 1b ^ G 1 )] + 2 E[P H 2 (G 2b ^ G 2 )])d(b) = Z b Z Z 1 ;:::;Zn [ 1 P H 1 (fb 1 (H 1 )6= ^ b 1 (H 1 )g) + 2 P H 2 (fb 2 (H 2 )6= ^ b 2 (H 2 )g)]dP n b (Z 1 ;:::;Z n )d(b) inf ^ G 1 ; ^ G 2 Z b Z Z 1 ;:::;Zn [ 1 P H 1 (fb 1 (H 1 )6= ^ b 1 (H 1 )g) + 2 P H 2 (fb 2 (H 2 )6= ^ b 2 (H 2 )g)]dP n b (Z 1 ;:::;Z n )d(b); where b(H 1 ; H 2 ) = (b 1 (H 1 );b 2 (H 2 )) and ^ b(H 1 ; H 2 ) are elements of b and ^ b, such that b 1 (H 1j ) =b 1j ;b 2 (H 2j ) =b 2j and ^ b 1 (H 1j ) = ^ b 1j ; ^ b 2 (H 2j ) = ^ b 2j . ^ g 1 = n H 1j :(b 1j = 1jZ 1 ;:::;Z n ) 1 2 ;jv 1 +v 2 o ; ^ g 2 = n H 2j :(b 2j = 1jZ 1 ;:::;Z n ) 1 2 ;jv 1 +v 2 o ; where (b 1j = 1j Z 1 ;:::;Z n ) and (b 2j = 1j Z 1 ;:::;Z n ) are posterior probabilties for b 1j = 1 and b 2j = 1. The minimized Bayes risk is given by Z Z 1 ;:::;Zn 1 E H 1 [minf(b 1 (H 1 ) = 1jZ 1 ;:::;Z n ); 1(b 1 (H 1 ) = 1jZ 1 ;:::;Z n )g] + 2 E H 2 [minf(b 2 (H 2 ) = 1jZ 1 ;:::;Z n ); 1(b 2 (H 2 ) = 1jZ 1 ;:::;Z n )g] d ~ P n = (v 1 +v 2 ) 1 Z Z 1 ;:::;Zn v 1 +v 2 X j=1 1 minf(b 1 (H 1 ) = 1jZ 1 ;:::;Z n ); 1(b 1 (H 1 ) = 1jZ 1 ;:::;Z n )g + 2 minf(b 2 (H 2 ) = 1jZ 1 ;:::;Z n ); 1(b 2 (H 2 ) = 1jZ 1 ;:::;Z n )g d ~ P n ; where ~ P n is the marginal likelihood off(Y 1i (1);Y 1i (0);Y 2i (0; 0);Y 2i (0; 1);Y 2i (1; 0);Y 2i (1; 1);D 1i ;D 2i ; X 1i ) : i = 1;:::;ng with prior (b). 113 For each j = 1;:::; (v 1 +v 2 ), let k + 1j = #fi : H 1i = H 1j ;Y 1i D 1i = 1 2 g; k 1j = #fi : H 1i = H 1j ;Y 1i D 1i = 1 2 g; k + 2j = #fi : H 2i = H 2j ;Y 2i D 2i = 1 2 g; k 2j = #fi : H 2i = H 2j ;Y 2i D 2i = 1 2 g: The poterior for b 1j = 1 and b 2j = 1 can be written as (b 1j = 1jZ 1 ;:::;Z n ) = 8 > > < > > : 1 2 if #fi : H 1i = H 1j ;D 1i = 1g = 0; ( 1 2 + 1 ) k + 1j ( 1 2 1 ) k 1j ( 1 2 + 1 ) k + 1j ( 1 2 1 ) k 1j +( 1 2 + 1 ) k 1j ( 1 2 1 ) k + 1j otherwise; (b 2j = 1jZ 1 ;:::;Z n ) = 8 > > < > > : 1 2 if #fi : H 2i = H 2j ;D 2i = 1g = 0; ( 1 2 + 2 ) k + 2j ( 1 2 2 ) k 2j ( 1 2 + 2 ) k + 2j ( 1 2 2 ) k 2j +( 1 2 + 2 ) k 2j ( 1 2 2 ) k + 2j otherwise: Hence, for b 1 , minf(b 1 (H 1 ) = 1jZ 1 ;:::;Z n ); 1(b 1 (H 1 ) = 1jZ 1 ;:::;Z n )g = minf( 1 2 + 1 ) k + 1j ( 1 2 1 ) k 1j ; ( 1 2 + 1 ) k 1j ( 1 2 1 ) k + 1j g ( 1 2 + 1 ) k + 1j ( 1 2 1 ) k 1j + ( 1 2 + 1 ) k 1j ( 1 2 1 ) k + 1j = min n 1; 1 2 + 1 1 2 1 k + 1j k 1j o 1 + 1 2 + 1 1 2 1 k + 1j k 1j = 1 1 +a jk + 1j k 1j j 1 ; where a 1 = 1 + 2 1 1 2 1 > 1; 114 and for b 2 , minf(b 2 (H 2 ) = 1jZ 1 ;:::;Z n ); 1(b 2 (H 2 ) = 1jZ 1 ;:::;Z n )g = minf( 1 2 + 2 ) k + 2j ( 1 2 2 ) k 2j ; ( 1 2 + 2 ) k 2j ( 1 2 2 ) k + 2j g ( 1 2 + 2 ) k + 2j ( 1 2 2 ) k 2j + ( 1 2 + 2 ) k 2j ( 1 2 2 ) k + 2j = min n 1; 1 2 + 2 1 2 2 k + 2j k 2j o 1 + 1 2 + 2 1 2 2 k + 2j k 2j = 1 1 +a jk + 2j k 2j j 2 ; where a 2 = 1 + 2 2 1 2 2 > 1: Since k + 1j k 1j = P i:H 1i =H 1j 2Y 1i D 1i , k + 2j k 2j = P i:H 2i =H 2j 2Y 2i D 2i , (v 1 +v 2 ) 1 v 1 +v 2 X j=1 E ~ P n 2 4 1 1 +a j P i:H 1i =H 1j 2Y 1i D 1i j 1 + 2 1 +a j P i:H 2i =H 2j 2Y 2i D 2i j 2 3 5 1 2(v 1 +v 2 ) v 1 +v 2 X j=1 E ~ P n 2 4 1 a j P i:H 1i =H 1j 2Y 1i D 1i j 1 + 2 a j P i:H 2i =H 2j 2Y 2i D 2i j 2 3 5 1 2(v 1 +v 2 ) v 1 +v 2 X j=1 1 a E ~ P n j P i:H 1i =H 1j 2Y 1i D 1i j 1 + 2 a E ~ P n j P i:H 2i =H 2j 2Y 2i D 2i j 2 The marginal distribution of Y 1i (1) and Y 2i (D 1i ; 1) are Pr(Y 1i (1) = 1=2) = Pr(Y 1i (1) = 1=2) = 1=2 and Pr(Y 2i (D 1i ; 1) = 1=2) = Pr(Y 2i (D 1i ; 1) =1=2) = 1=2, so E ~ P n X i:H 1i =H 1j 2Y 1i D 1i =E ~ P n X i:H 1i =H 1j ;D 1i =1 2Y 1i = n X k 1 =0 0 B @ n k 1 1 C A 1 2(v 1 +v 2 ) k 1 1 1 2(v 1 +v 2 ) nk 1 E B(k 1 ; 1 2 ) k 1 2 n X k 1 =0 0 B @ n k 1 1 C A 1 2(v 1 +v 2 ) k 1 1 1 2(v 1 +v 2 ) nk 1 r k 1 4 r n 8(v 1 +v 2 ) ; 115 and E ~ P n X i:H 2i =H 2j 2Y 2i D 2i =E ~ P n X i:H 2i =H 2j ;D 2i =1 2Y 2i = n X k 2 =0 0 B @ n k 2 1 C A 1 2(v 1 +v 2 ) k 2 1 1 2(v 1 +v 2 ) nk 2 E B(k 2 ; 1 2 ) k 2 2 n X k 2 =0 0 B @ n k 2 1 C A 1 2(v 1 +v 2 ) k 2 1 1 2(v 1 +v 2 ) nk 2 r k 2 4 r n 8(v 1 +v 2 ) : Hence, the Bayes risk is bounded from below by 1 a p n 8(v 1 +v 2 ) 1 + 2 a p n 8(v 1 +v 2 ) 2 1 expf(a 1 1) r n 8(v 1 +v 2 ) g + 2 expf(a 2 1) r n 8(v 1 +v 2 ) g = 1 expf 4 1 1 2 1 r n 8(v 1 +v 2 ) g + 2 expf 4 2 1 2 2 r n 8(v 1 +v 2 ) g: When 1 and 2 are set to be proportional to n 1=2 , the lower bound of the Bayes risk has the slowest convergence rate. Let 1 = 2 = q v 1 +v 2 n , 1 2 1 expf 4 1 1 2 1 r n 8(v 1 +v 2 ) g + 1 2 2 expf 4 2 1 2 2 r n 8(v 1 +v 2 ) g = 1 2 r v 1 +v 2 n (expf p 2 1 2 1 g + expf p 2 1 2 2 g) r v 1 +v 2 n expf2 p 2g if 1 2 1 1 2 ; 1 2 2 1 2 : The conditions 1 2 1 1 2 ; 1 2 2 1 2 are equivalent to n 16(v 1 +v 2 ). 116 B.2 Multiple periods In multiple periods, the welfare function in period 1 can be written as W (g 1 ;:::;g T ) =E P [Y 1 (g 1 (H 1 )) + +Y T (g 1 (H 1 );:::;g T (H T ))] =E P " T Y t=1 D t g t (H t ) e t (H t ) + (1D t )(1g t (H t )) 1e t (H t ) T X t=1 Y t # : The welfare function in period t can be written as W t (g t ;:::;g T ) =E P [Y 1 (g 1 (H 1 )) + +Y T (g 1 (H 1 );:::;g T (H T ))] =E P " T Y s=t D s g s (H s ) e s (H s ) + (1D s )(1g s (H s )) 1e t (H t ) T X s=t Y s # : By backward induction, ^ g T 2 arg max g T 2G T W T (g T ); ^ g t 2 arg max gt2Gt W t (g t ; ^ g t+1 ;:::; ^ g T ); ^ g 1 2 arg max gt2G 1 W (g 1 ; ^ g 2 ;:::; ^ g T ): 117 With the estimated optimale policy (^ g 1 ;:::; ^ g T ), we show the proof of upper bound of regret in multiple periods. R(^ g 1 ;:::; ^ g T ) =W (g 1 ;:::;g T )W (^ g 1 ;:::; ^ g T ) = (W (g 1 ;:::;g T )W (g 1 ;:::; ^ g T )) + W (g 1 ;:::; ^ g T )W (g 1 ;:::; ^ g T1 ; ^ g T ) + + (W (g 1 ; ^ g 2 ;:::; ^ g T )W (^ g 1 ; ^ g 2 ;:::; ^ g T )) =W (g 1 ;:::;g T )W n (g 1 ;:::;g T ) +W n (g 1 ;:::;g T )W n (g 1 ;:::; ^ g T ) +W n (g 1 ;:::; ^ g T )W (g 1 ;:::; ^ g T ) +W (g 1 ;:::; ^ g T )W n (g 1 ;:::; ^ g T ) +W n (g 1 ;:::; ^ g T )W n (g 1 ;:::; ^ g T1 ; ^ g T ) +W n (g 1 ;:::; ^ g T1 ; ^ g T )W (g 1 ;:::; ^ g T1 ; ^ g T ) + +W (g 1 ; ^ g 2 ;:::;g T )W n (g 1 ; ^ g 2 ;:::;g T ) +W n (g 1 ; ^ g 2 ;:::;g T )W n (^ g 1 ; ^ g 2 ;:::;g T ) +W n (^ g 1 ; ^ g 2 ;:::;g T )W (^ g 1 ; ^ g 2 ;:::;g T ) 2 sup (g 1 ;:::;g T )2G 1 G T jW (g 1 ;:::;g T )W n (g 1 ;:::;g T )j +W n (g 1 ;:::;g T )W n (g 1 ;:::; ^ g T ) + +W n (g 1 ; ^ g 2 ;:::;g T )W n (^ g 1 ; ^ g 2 ;:::;g T ) 2 sup (g 1 ;:::;g T )2G 1 G T jW (g 1 ;:::;g T )W n (g 1 ;:::;g T )j: We can derive W n (g 1 ;:::;g t ; ^ g t+1 ;:::; ^ g T )W n (g 1 ;:::; ^ g t ; ^ g t+1 ;:::; ^ g T ) 0 118 with W n (g 1 ;:::;g t ; ^ g t+1 ;:::; ^ g T )W n (g 1 ;:::; ^ g t ; ^ g t+1 ;:::; ^ g T ) =E n (g 1 ;D 1 ; H 1 )::: (g t1 ;D t1 ; H t1 ) (g t ;D t ; H t ) (^ g t ;D t ; H t ) Y t + (^ g t+1 ;D t+1 ; H t+1 )Y t+1 + + (^ g t+1 ;D t+1 ; H t+1 )::: (^ g T ;D T ; H T )Y T =E (g 1 ;D 1 ; H 1 )::: (g t1 ;D t1 ; H t1 )E n " (g t ;D t ; H t ) (^ g t ;D t ; H t ) Y t + (^ g t+1 ;D t+1 ; H t+1 )Y t+1 + + (^ g t+1 ;D t+1 ; H t+1 )::: (^ g T ;D T ; H T )Y T j H t1 # 0: According to Lemma .3, we have E P n sup f2F jE n (f)E P (f)j 2C 2 F r VC(F) n : And F = TM 2 T Therefore, we have E P n [W (g 1 ;:::;g T )W (^ g 1 ;:::; ^ g T )]C 0 TM T s P T t=1 v t n ; where C 0 = 48 R 1 0 q ln( p 2K) + 1 + 2 ln(16e) 2 ln"d" q e (e1) log 2 ln( 2e log 2 ). B.3 Unknown propensity score Consider the two-period case with unknown propensity score using the inverse probability weighting estimator, with estimated propensity score. E P n[W (g )W (^ g )] 2E P n[sup g2G jW (g) ^ W (g)j] + 2E P n[sup g2G j ^ W (g) ~ W (g)j] 119 where ^ W (g 1 ;g 2 ) = 1 n n X i=1 D i1 g 1 (X i1 ) e 1 (X i1 ) + (1D i1 )(1g 1 (X i1 )) 1e 1 (X i1 ) D i2 g 2 (H i2 ) e 2 (H i2 ) + (1D i2 )(1g 2 (H i2 )) 1e 2 (H i2 ) (Y i1 +Y i2 ); and ~ W (g 1 ;g 2 ) = 1 n n X i=1 D i1 g 1 (X i1 ) ^ e 1 (X i1 ) + (1D i1 )(1g 1 (X i1 )) 1 ^ e 1 (X i1 ) D i2 g 2 (H i2 ) ^ e 2 (H i2 ) + (1D i2 )(1g 2 (H i2 )) 1 ^ e 2 (H i2 ) (Y i1 +Y i2 ): Denote i11 = D i1 e(X i1 ) , i10 = 1D i1 1e 1 (X i1 ) , i21 = D i2 e 2 (H i2 ) , and i20 = 1D i2 1e 2 (H i2 ) . 120 Denote ^ i11 = D i1 ^ e 1 (X i1 ) , ^ i10 = 1D i1 1e(X i1 ) , ^ i21 = D i2 ^ e(H i2 ) , and ^ i20 = 1D i2 1^ e 2 (H i2 ) . E P n[ sup (g 1 ;g 2 )2G 1 G 2 j ^ W (g 1 ;g 2 ) ~ W (g 1 ;g 2 )j] =E P n " sup (g 1 ;g 2 )2G 1 G 2 1 n n X i=1 ( i11 g 1 (X i1 ) + i10 (1g 1 (X i1 ))) ( i21 g 2 (H i2 ) + i20 (1g 2 (H i2 ))) (Y i1 +Y i2 ) 1 n n X i=1 ^ i11 g 1 (X i1 ) + ^ i10 (1g 1 (X i1 )) ^ i21 g 2 (H i2 ) + ^ i20 (1g 2 (H i2 )) (Y i1 +Y i2 ) # E P n " sup (g 1 ;g 2 )2G 1 G 2 1 n n X i=1 ( i11 g 1 (X i1 ) + i10 (1g 1 (X i1 ))) ( i21 g 2 (H i2 ) + i20 (1g 2 (H i2 ))) ^ i11 g 1 (X i1 ) + ^ i10 (1g 1 (X i1 )) ^ i21 g 2 (H i2 ) + ^ i20 (1g 2 (H i2 )) Y i1 +Y i2 # E P n " sup (g 1 ;g 2 )2G 1 G 2 1 n n X i=1 ( i11 g 1 (X i1 ) + i10 (1g 1 (X i1 ))) ( i21 g 2 (H i2 ) + i20 (1g 2 (H i2 ))) ^ i11 g 1 (X i1 ) + ^ i10 (1g 1 (X i1 )) ( i21 g 2 (H i2 ) + i20 (1g 2 (H i2 ))) + ^ i11 g 1 (X i1 ) + ^ i10 (1g 1 (X i1 )) ( i21 g 2 (H i2 ) + i20 (1g 2 (H i2 ))) ^ i11 g 1 (X i1 ) + ^ i10 (1g 1 (X i1 )) ^ i21 g 2 (H i2 ) + ^ i20 (1g 2 (H i2 )) Y i1 +Y i2 # E P n " sup (g 1 ;g 2 )2G 1 G 2 2M n n X i=1 i11 ^ i11 g 1 (X i1 ) + i10 ^ i10 (1g 1 (X i1 )) ( i21 g 2 (H i2 ) + i20 (1g 2 (H i2 ))) + ^ i11 g 1 (X i1 ) + ^ i10 (1g 1 (X i1 )) i21 ^ i21 g 2 (H i2 ) + i20 ^ i20 (1g 2 (H i2 )) # E P n " 2M n n X i=1 i11 ^ i11 + i10 ^ i10 + i21 ^ i21 + i20 ^ i20 # O( 1 n ): 121 Similarly, if we consider the case with estimated conditonal mean outcome, we have E P n " sup (g 1 ;g 2 )2G 1 G 2 ^ W (g 1 ;g 2 ) ~ W (g 1 ;g 2 ) # =E P n " sup (g 1 ;g 2 )2G 1 G 2 1 n n X i=1 ( 1 (g 1 ;g 2 ; H i1 ) ^ 1 (g 1 ;g 2 ; H i1 )) # E P n " sup (g 1 ;g 2 )2G 1 G 2 1 n n X i=1 g 1 (H i1 )(m 1 (1; H i1 ) +D i1 (m 2 (1; H i2 )g 2 (H i2 ) +m 2 (0; H i2 )(1g 2 (H i2 )))) + (1g 1 (H i1 ))(m 1 (0; H i1 ) + (1D i1 )(m 2 (1; H i2 )g 2 (H i2 ) +m 2 (0; H i2 )(1g 2 (H i2 )))) g 1 (H i1 )( ^ m 1 (1; H i1 ) +D i1 ( ^ m 2 (1; H i2 )g 2 (H i2 ) + ^ m 2 (0; H i2 )(1g 2 (H i2 )))) (1g 1 (H i1 ))( ^ m 1 (0; H i1 ) + (1D i1 )( ^ m 2 (1; H i2 )g 2 (H i2 ) + ^ m 2 (0; H i2 )(1g 2 (H i2 )))) # O( 1 n ): C Appendix to chapter 3 Linear approximation p n(^ ) Denote T i = 2D i 1, so that D i = T i +1 2 and 1D i = 1T i 2 Y i =D i Y i (1) + (1D i )Y i (0) =T i Y i (1)Y i (0) 2 + Y i (1)+Y i (0) 2 Consider linear regression Y i = +D i +" i , we only observe (Y i ;D i ) when R i = 1 OLS estimator for is ^ = P N i=1 R i Y i (D i W ) P N i=1 R i (D i W ) 2 = 2 P N i=1 R i Y i (T i T ) P N i=1 R i (T i T ) 2 = 2 1 N P N i=1 R i Y i (T i T ) 1 N P N i=1 R i (T i T ) 2 = 2 U V ; where U i =R i Y i (T i T ), V i =R i (T i T ) 2 , and T = P N i=1 R i T i P N i=1 R i . 122 U = 1 N N X i=1 R i Y i (T i T ) = 1 N N X i=1 R i Y i (T i T ) = 1 N N X i=1 R i T i Y i (1)Y i (0) 2 + Y i (1) +Y i (0) 2 (T i T ) ! p E R i T i (T i T ) Y (1) Y (0) 2 +E R i (T i T ) Y (1) + Y (0) 2 = 1 2 E [R i ] 1 2 T Y (1) Y (0) , u V = 1 N N X i=1 R i (T i T ) 2 ! p E R i (T i T ) 2 =E [R i ] 1 2 T , v U V ! p u v = E [R i ] (1 2 T ) Y (1) Y (0) 2E [R i ] (1 2 T ) = 2 Next we apply delta method: f 0 B @ 2 6 4 U V 3 7 5 1 C A = U V ;f 0 B @ 2 6 4 u v 3 7 5 1 C A = u v rf() = 0 B @ 1 v u 2 v 1 C A ; the covariance matrix of 0 B @ U V 1 C A = 0 B @ var(U) n cov(U;V ) cov(U;V ) var(V ) n 1 C A 123 U V = u v +rf() T 0 B @ U u V v 1 C A = u v + 1 v U u u 2 v V v +o p (1) = u v + 1 E [R i ] (1 2 T ) 1 N N X i=1 R i Y i (T i T ) 1 2 E [R i ] 4 E(D 2 i )E(D i ) 2 Y (1) Y (0) ! E [R i ] (E(D 2 i )E(D i ) 2 ) Y (1) Y (0) 2 (E [R i ] (1 2 T )) 2 1 n n X i=1 R i (T i T ) 2 E [R i ] 4 E(D 2 i )E(D i ) 2 ! +o p (1) = u v + 2 n P n i=1 R i Y i (T i T ) n P n i=1 R i (T i T ) 2 2E [R i ] (1 2 T ) +o p (1) 2 N N X i=1 R i Y i (T i T ) = 2 N N X i=1 R i (" i +T i 2 + Y (1) + Y (0) 2 )(T i T ) = 2 N n X i=1 R i " i (T i T ) + N N X i=1 R i T i (T i T ) Therefore, we have U V = u v + 2 N P N i=1 R i " i (T i T ) 2E[R i ](1 2 T ) +o p (1) p n(^ ) = p n( 2 U V 2 u v ) +o p (1) = p n 2 N P N i=1 R i " i (T i T ) E [R i ] (1 2 T ) +o p (1) = p NP C P U 2 P n i=1 R i " i (T i T ) NP C P U (1 2 T ) +o p (1) = 2 P N i=1 R i " i (T i T ) p NP C P U (1 2 T ) +o p (1) Denote = 2 P N i=1 R i " i (T i T ) p NP C P U (1 2 T ) ; and when E[D i ] = = 1 2 , E[T i ] = T = 0, the above linear approximation reduces to 2 P N i=1 R i " i T i p NP C P U , as stated in Abadie et al. (2017). 124 Liang-Zeger variance and its limit Y i = +D i +" i = (;) 0 B @ 1 D i 1 C A +" i = T X i +" i V LZ ( ^ ) = (X T X) 1 ( C X c=1 X T c c X c )(X T X) 1 n ^ V LZ ( ^ ) =n(X T X) 1 0 @ C X c=1 X i:C i =c ^ " i X i ! X i:C i =c ^ " i X i ! T 1 A (X T X) 1 ! E(X i X T i ) 1 E 0 @ C X c=1 X i:C i =c " i X i ! X i:C i =c " i X i ! T 1 A E(X i X T i ) 1 E(X i X T i ) =E 0 B @ 1 D i 1 C A 1 D i = 0 B @ 1 E (D i ) E (D i ) E (D 2 i ) 1 C A E(X i X T i ) 1 = 1 E (D 2 i ) (E (D i )) 2 0 B @ E (D 2 i ) E (D i ) E (D i ) 1 1 C A X i:C i =c " i X i ! X i:C i =c " i X i ! T = 0 B @ P i:C i =c " i P i:C i =c " i D i 1 C A P i:C i =c " i P i:C i =c " i D i = 0 B @ P i:C i =c " i 2 P i:C i =c " i P i:C i =c " i D i P i:C i =c " i P i:C i =c " i D i P i:C i =c " i D i 2 1 C A 125 n ^ V LZ ( ^ )! 1 n E (D 2 i )E (D i ) 2 2 0 B @ E (D 2 i ) E (D i ) E (D i ) 1 1 C A E 0 @ C X c=1 X i:C i =c " i X i ! X i:C i =c " i X i ! T 1 A 0 B @ E (D 2 i ) E (D i ) E (D i ) 1 1 C A n ^ V LZ (^ )! 1 n E (D 2 i )E (D i ) 2 2 8 < : 2 4 E (D i )E C X c=1 X i:C i =c " i ! 2 +E C X c=1 X i:C i =c " i ! X i:C i =c " i D i ! 3 5 (E (D i )) + 2 4 E (D i )E C X c=1 X i:C i =c " i ! X i:C i =c " i D i ! +E C X c=1 X i:C i =c " i D i ! 2 3 5 9 = ; = 1 E (D 2 i )E (D i ) 2 2 1 n 8 < : (E (D i )) 2 E C X c=1 X i:C i =c " i ! 2 2 (E (D i ))E C X c=1 X i:C i =c " i ! X i:C i =c " i D i ! +E C X c=1 X i:C i =c " i D i ! 2 9 = ; 126 Let T i = 2D i 1, therefore D i = T i +1 2 . Let Y i Y =" i . Denote T = 2 1. If we assume =E(D i ) = 1 2 , then we have T = 0. RHS = 1 n E (D 2 i )E (D i ) 2 2 8 < : (E (D i )) 2 E C X c=1 X i:C i =c " i ! 2 2E (D i )E C X c=1 X i:C i =c " i ! X i:C i =c " i D i ! +E C X c=1 X i:C i =c " i D i ! 2 9 = ; = 1 n E (D 2 i )E (D i ) 2 2 8 < : E T i + 1 2 2 E C X c=1 X i:C i =c " i ! 2 2 E T i + 1 2 E C X c=1 X i:C i =c " i ! X i:C i =c " i T i + 1 2 ! +E C X c=1 X i:C i =c " i T i + 1 2 ! 2 9 = ; = 4 n [(1 2 T )] 2 8 < : (ET i + 1) 2 C X c=1 X i:C i =c " i ! 2 2(ET i + 1)E C X c=1 X i:C i =c " i ! X i:C i =c " i (T i + 1) ! +E C X c=1 X i:C i =c " i (T i + 1) ! 2 9 = ; = 4 n [(1 2 T )] 2 8 < : E C X c=1 X i:C i =c " i ! 2 2E C X c=1 X i:C i =c " i ! X i:C i =c " i (T i + 1) ! +E C X c=1 X i:C i =c " i (T i + 1) ! 2 9 = ; = 4 n [(1 2 T )] 2 C X c=1 X i:C i =c X j:C j =c E (T i T j R i " i R j " j ) n ^ V LZ (^ )! p 4 n C X c=1 X i:C i =c X j:C j =c E (T i T j R i " i R j " j ) 127 If we don't assume = 1 2 , we have RHS = 4 n [(1 2 T )] 2 E 8 < : C X c=1 2 4 4 2 X i:C i =c " i ! 2 4 X i:C i =c " i ! X i:C i =c " i (T i + 1) ! + X i:C i =c " i (T i + 1) ! 2 3 5 9 = ; = 4 n [(1 2 T )] 2 C X c=1 X i:C i =c X j:C j =c E(R i R j " i " j (T i T )(T j T )) Exact variance of BecauseE() = 0, var() =E( 2 ); var() =E( 2 ) [E()] 2 . E( 2 ) =E 2 4 2 P N i=1 R i " i (T i T ) p NP C P U E (D 2 i )E (D i ) 2 2 ! 2 3 5 = 4 n [(1 2 T )] 2 N X i=1 N X j=1 E[R i R j " i " j (T i T )(T j T )] The dierence between Liang-Zeger variance limit and the exact variance is 4 n [(1 2 T )] 2 X X C j 6=C i E[R i R j " i " j (T i T )(T j T )]; which is the term when unit i and unit j are not in the same cluster. " i " j (T i T )(T j T ) = " i (1)" i (0) 2 T i + " i (1) +" i (0) 2 " j (1)" j (0) 2 T j + " j (1) +" j (0) 2 (T i T )(T j T ) = 1 4 (" i (1)" i (0))(" j (1)" j (0))(1T i T T j T +T i T j 2 T ) + (" i (1)" i (0))(" j (1) +" j (0))(T j T i T j T T +T i 2 T ) + (" i (1) +" i (0))(" j (1)" j (0))(T i T i T j T T +T j 2 T ) +(" i (1) +" i (0))(" j (1) +" j (0))(T i T j T i T T j T + 2 T ) 128 When i and j are not in the same cluster, we haveE[T i T j ] = 2 T . Therefore, X X C j 6=C i E[R i R j " i " j (T i T )(T j T )] = 1 4 (1 2 T ) 2 P 2 C P 2 U X X C i 6=C j (" i (1)" i (0))(" j (1)" j (0)): We also know that n X i=1 n X j=1 (" i (1)" i (0))(" j (1)" j (0)) = 0; which means X X C i 6=C j (" i (1)" i (0))(" j (1)" j (0)) + X X C i =C j (" i (1)" i (0))(" j (1)" j (0)) = 0: We can write X X C i 6=C j (" i (1)" i (0))(" j (1)" j (0)) = X X C i =C j (" i (1)" i (0))(" j (1)" j (0)) = C X c=1 Nc X i=1 (" i (1)" i (0)) Nc X j=1 (" j (1)" j (0)) = C X c=1 N 2 c ( " c (1) " c (0)) 2 The dierence is P 2 C P 2 U n C X c=1 N 2 c ( " c (1) " c (0)) 2 The improvement over LZ variance estimator is based on this term. In cluster RCT, we apply bound to this term P 2 C P 2 U n C X c=1 N 2 c ( " c (1) " c (0)) 2 =n C X c=1 N c N 2 Y c (1) Y (1) Y c (0) Y (0) 2 129 Denote c = Nc N . C X c=1 N c N 2 Y c (1) Y (1) Y c (0) Y (0) 2 = C X c=1 c Y c (1) c Y (1) c Y c (0) c Y (0) 2 For t = 0; 1, P C c=1 c Y c (t) = Y (t), and the average of this is 1 C P C c=1 c Y c (t) = 1 C Y (t) Denote y c (t) = c Y c (t) and y c (t) = 1 C Y (t),the above equation can be further written as = C X c=1 c Y c (1) 1 C Y (1) + 1 C Y (1) c Y (1) c Y c (0) 1 C Y (0) + 1 C Y (0) c Y (0) 2 = C X c=1 y c (1) y c (1) + 1 C Y (1) c Y (1) y c (0) y c (0) + 1 C Y (0) c Y (0) 2 = C X c=1 (y c (1) y c (1)) 2 + C X c=1 (y c (0) y c (0)) 2 + Y (1) 2 + Y (0) 2 + Y (1) Y (0) C X c=1 c 1 C 2 Y (1) Y (0) C X c=1 (y c (1) y c (1)) c 1 C + Y (1) Y (0) C X c=1 (y c (0) y c (0)) c 1 C + C X c=1 (y c (1) y c (1)) (y c (0) y c (0)) Based on the above equation, we need to estimate var(y c (1)); var(y c (0)); var( c ); cov(y c (1); c ); cov(y c (0); c ); cov(y c (1);y c (0)) and Y (1) 2 ; Y (0) 2 ; Y (1) Y (0); Y (1); Y (0): 130 n C X c=1 c Y c (1) c Y (1) c Y c (0) c Y (0) 2 =n(C 1) ( 1 C 1 C X c=1 (y c (1) y c (1)) 2 + 1 C 1 C X c=1 (y c (0) y c (0)) 2 + Y (1) 2 + Y (0) 2 + Y (1) Y (0) 1 C 1 C X c=1 c 1 C 2 + Y (1) Y (0) " 1 C 1 C X c=1 (y c (1) y c (1)) c 1 C + 1 C 1 C X c=1 (y c (0) y c (0)) c 1 C # + 1 C 1 C X c=1 (y c (1) y c (1)) (y c (0) y c (0)) ) =n(C 1) var(y c (1)) + var(y c (0)) + Y (1) 2 + Y (0) 2 + Y (1) Y (0) var( c ) + Y (1) Y (0) [cov(y c (1); c ) + cov(y c (0); c )] + cov(y c (1);y c (0)) The estimator for the above term can be written as n(C 1) ( ^ var(y c (1)) + ^ var(y c (0)) + ^ Y (1) 2 n 0 n 1 n ^ var(Y i (1)) + ^ Y (0) 2 n 1 n 0 n ^ var(Y i (0)) + ^ Y (1) ^ Y (0) 1 n ^ cov L (Y i (1);Y i (0)) ^ var( c ) + h ^ Y (1) ^ Y (0) i [ ^ cov(y c (1); c ) + ^ cov(y c (0); c )] + ^ cov H (y c (1);y c (0)) ) To improve over LZ variance estimator, we propose ^ V H (^ ) = ^ V LZ (^ ) (C 1) [ diff where [ diff = ^ var(y c (1)) + ^ var(y c (0)) + ^ Y (1) 2 n 0 n 1 n ^ var(Y i (1)) + ^ Y (0) 2 n 1 n 0 n ^ var(Y i (0)) + ^ Y (1) ^ Y (0) 1 n ^ cov L (Y i (1);Y i (0)) ^ var( c ) + h ^ Y (1) ^ Y (0) i [ ^ cov(y c (1); c ) + ^ cov(y c (0); c )] + ^ cov H (y c (1);y c (0)) 131 Explicit expression of V () Next we derive explicit expression of V () = 2 P N i=1 R i " i (T i T ) p NP C P U (1 2 T ) = 1 p NP C P U (1 2 T ) 2 N X i=1 R i (T i " i (1)" i (0) 2 + " i (1) +" i (0) 2 )(T i T ) = 1 p NP C P U (1 2 T ) N X i=1 R i (T i (" i (1)" i (0)) + (" i (1) +" i (0)))(T i T ) = 1 p NP C P U (1 2 T ) N X i=1 R i T i (~ " i (1) + ~ " i (0)) + N X i=1 R i (~ " i (1) ~ " i (0)) ! where ~ " i (1) = (1 T )" i (1) and ~ " i (0) = (1 + T )" i (0). Denote S = 1 p NP C P U(1 2 T ) P N i=1 R i (~ " i (1) ~ " i (0)) and D = 1 p NP C P U(1 2 T ) P N i=1 R i T i (~ " i (1) + ~ " i (0)). We know thatE(S) = 0,E(D) = 0, butE(SD)6= 0. E(SD) = 1 NP C P U (1 2 T ) 2 N X i=1 N X j=1 E(R i T i R j )(~ " i (1) + ~ " i (0))(~ " j (1) ~ " j (0)) = T NP C P U (1 2 T ) 2 ( N X i=1 P C P U (1 C X c=1 N c N N c P U 1 N c 1 )(~ " i (1) 2 ~ " i (0) 2 ) + C X c=1 N X i=1 N X j=1 C ic C jc ( C X c=1 N c N P C P U N c P U 1 N c 1 P C P 2 U CP C 1 C 1 )(~ " i (1) ~ " i (0))(~ " j (1) + ~ " j (0)) ) = T N(1 2 T ) 2 ( N X i=1 (1 C X c=1 N c N N c P U 1 N c 1 )(~ " i (1) 2 ~ " i (0) 2 ) + C X c=1 ( C X c=1 N c N N c P U 1 N c 1 P U CP C 1 C 1 )N 2 c ( ~ " c (1) 2 ~ " c (0) 2 ) ) 132 E(S 2 ) = 1 NP C P U (1 2 T ) 2 N X i=1 N X j=1 E [R i R j ] (~ " i (1) ~ " i (0))(~ " j (1) ~ " j (0)) = 1 NP C P U (1 2 T ) 2 N X i=1 P C P U C X c=1 N c N P C P U N c P U 1 N c 1 ! (~ " i (1) ~ " i (0)) 2 + 1 NP C P U (1 2 T ) 2 C X c=1 C X c 0 =1 N X i=1 N X j=1 C ic C jc 0(P C P 2 U S 1 C 1 )(~ " i (1) ~ " i (0))(~ " j (1) ~ " j (0)) + 1 NP C P U (1 2 T ) 2 C X c=1 N X i=1 N X j=1 C ic C jc " C X c=1 N c N P C P U N c P U 1 N c 1 P C P 2 U S 1 C 1 # (~ " i (1) ~ " i (0)) (~ " j (1) ~ " j (0)) = 1 N(1 2 T ) 2 N X i=1 (1 C X c=1 N c N N c P U 1 N c 1 )(~ " i (1) ~ " i (0)) 2 + 1 N(1 2 T ) 2 ( C X c=1 N c (N c P U 1) N(M c 1) S 1 C 1 P U ) C X c=1 N 2 c ( ~ " c (1) ~ " c (0)) 2 133 E(D 2 ) = 1 NP C P U (1 2 T ) 2 N X i=1 N X j=1 E(R i T i R j T j )(~ " i (1) + ~ " i (0))(~ " j (1) + ~ " j (0)) = 1 NP C P U (1 2 T ) 2 N X i=1 [P C P U ( C X c=1 P C P U N c (N c P U 1) N(N c 1) )(4 C X c=1 N c (N c q c 1) N(N c 1) q c 4 C X c=1 N c N q c + 1) # (~ " i (1) + ~ " i (0)) 2 + 1 NP C P U (1 2 T ) 2 C X c=1 C X c 0 =1 N X i=1 N X j=1 C ic C jc 0P C S 1 C 1 P 2 U " 4( C X c=1 N c N q c ) 2 4 C X c=1 N c N q c + 1 # (~ " i (1) + ~ " i (0))(~ " j (1) + ~ " j (0)) + 1 N C X c=1 N X i=1 N X j=1 C ic C jc " C X c=1 P C P U N c (N c P U 1) N(N c 1) 4 C X c=1 N c (N c q c 1) N(N c 1) q c 4 C X c=1 N c N q c + 1 ! P C S 1 C 1 P 2 U 4( C X c=1 N c N q c ) 2 4 C X c=1 N c N q c + 1 !# (~ " i (1) + ~ " i (0))(~ " j (1) + ~ " j (0)) = 1 N(1 2 T ) 2 N X i=1 " 1 ( C X c=1 N c (N c P U 1) N(N c 1) )(4 C X c=1 N c (N c q c 1) N(N c 1) q c 4 C X c=1 N c N q c + 1) # (~ " i (1) + ~ " i (0)) 2 + 1 N(1 2 T ) 2 C X c=1 " C X c=1 N c (N c P U 1) N(N c 1) 4 C X c=1 N c (N c q c 1) N(N c 1) q c 4 C X c=1 N c N q c + 1 ! S 1 C 1 P U 4( C X c=1 N c N q c ) 2 4 C X c=1 N c N q c + 1 !# N 2 c ( ~ " c (1) + ~ " c (0)) 2 134 Note: T = 2 P C c=1 Nc N q c 1. V () = T N(1 2 T ) 2 ( N X i=1 (1 C X c=1 N c N N c P U 1 N c 1 )(~ " i (1) 2 ~ " i (0) 2 ) + C X c=1 ( C X c=1 N c N N c P U 1 N c 1 P U S 1 C 1 )N 2 c ( ~ " c (1) 2 ~ " c (0) 2 ) ) + 1 N(1 2 T ) 2 N X i=1 (1 C X c=1 N c N N c P U 1 N c 1 )(~ " i (1) ~ " i (0)) 2 + 1 N(1 2 T ) 2 ( C X c=1 N c (N c P U 1) N(N c 1) S 1 C 1 P U ) C X c=1 N 2 c ( ~ " c (1) ~ " c (0)) 2 + 1 N(1 2 T ) 2 N X i=1 " 1 ( C X c=1 N c (N c P U 1) N(N c 1) )(4 C X c=1 N c (N c q c 1) N(N c 1) q c 4 C X c=1 N c N q c + 1) # (~ " i (1) + ~ " i (0)) 2 + 1 M(1 2 T ) 2 C X c=1 " C X c=1 N c (N c P U 1) N(N c 1) 4 C X c=1 N c (N c q c 1) N(N c 1) q c 4 C X c=1 N c N q c + 1 ! S 1 C 1 P U 2 T M 2 c ( ~ " c (1) + ~ " c (0)) 2 Estimators in cluster RCT C X c=1 " T c (1) 2 = C X c=1 Y T c (1) N C Y (1) + N C Y (1)N c Y (1) 2 = C X c=1 Y T c (1) N C Y (1) 2 + C X c=1 N C Y (1)N c Y (1) 2 + 2 C X c=1 Y T c (1) N C Y (1) N C Y (1)M c Y (1) = (C 1) var(Y T c (1)) + Y (1) 2 C X c=1 N C N c 2 2 Y (1)(C 1)cov(Y T c (1);N c ) ! 135 C X c=1 " T c (0) 2 = C X c=1 Y T c (0) N C Y (0) + N C Y (0)N c Y (0) 2 = C X c=1 Y T c (0) N C Y (0) 2 + C X c=1 N C Y (0)N c Y (0) 2 + 2 C X c=1 Y T c (0) N C Y (0) N C Y (0)N c Y (0) = (C 1) var(Y T c (0)) + Y (0) 2 C X c=1 N C N c 2 2 Y (0)(C 1)cov(Y T c (0);N c ) ! C X c=1 " T c (1)" T c (0) = C X c=1 Y T c (1) N C Y (1) + N C Y (1)N c Y (1) Y T c (0) N C Y (0) + N C Y (0)N c Y (0) = C X c=1 Y T c (1) N C Y (1) Y T c (0) N C Y (0) + C X c=1 Y T c (1) N C Y (1) N C Y (0)N c Y (0) + C X c=1 Y T c (0) N C Y (0) N C Y (1)N c Y (1) + C X c=1 N C Y (1)N c Y (1) N C Y (0)N c Y (0) = (C 1) cov(Y T c (0);Y T c (1)) Y (0)cov(Y T c (1);N c ) Y (1)cov(Y T c (0);N c ) + Y (1) Y (0) C X c=1 N C N c 2 ! We knowE( ^ Y (t)) = Y (t), for t = 0; 1. E( ^ Y (t) 2 ) =E( ^ Y (t)) 2 + var( ^ Y (t)) = Y (t) 2 + nn(t) nn(t) var(Y i (t)); E( ^ Y (1) ^ Y (0)) =E( ^ Y (1))E( ^ Y (0)) + cov( ^ Y (1); ^ Y (0)) = Y (1) Y (0) + cov(Y (1);Y (0)) n : 136 Equivalence between one regression and step-wise regression under homoskedasticity Y i = +D i +X i +" i ^ Y i = ^ + ^ D i + ^ X i We rst regress Y i on X i Y i = 1 +X i +u i and get the residuals ^ u i =Y i ^ ^ X i . Then we regress ^ u i on D i ^ u i = 2 +D i +v i and get the residuals ^ ^ u i = ^ 2 + ^ D i . With the assumption of homoskedasticity, the rst method will give ^ V 1 (^ ) = 1 N(N 2) P N i=1 (Y i ^ Y i ) 2 D(1 D) ; the second method will lead to ^ V 2 (^ ) = 1 N(N 2) P N i=1 ( ^ u i ^ ^ u i ) 2 D(1 D) : Y i ^ Y i =Y i ^ ^ D i ^ X i ^ u i ^ ^ u i =Y i ^ 1 ^ X i ^ 2 ^ D i ^ 1 = Y ^ X u and ^ 2 = u ^ W . Therefore, ^ 1 + ^ 2 = ^ , and Y i ^ Y i = ^ u i ^ ^ u i asymptotically. 137
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Essays on development economics
PDF
Estimation of treatment effects in randomized clinical trials which involve non-trial departures
PDF
Essays on economics of education
PDF
Essays on the estimation and inference of heterogeneous treatment effects
PDF
Essays on nonparametric and finite-sample econometrics
PDF
Essays on development and health economics: social media and education policy
PDF
Essays on causal inference
PDF
Essays on econometrics
PDF
Essays on family planning policies
PDF
Nonparametric ensemble learning and inference
PDF
Essays on the microeconomic effects of taxation policies
PDF
Mixed-integer nonlinear programming with binary variables
PDF
Three essays on the identification and estimation of structural economic models
PDF
Prohorov Metric-Based Nonparametric Estimation of the Distribution of Random Parameters in Abstract Parabolic Systems with Application to the Transdermal Transport of Alcohol
PDF
Addressing federal pain research priorities: drug policy, pain mechanisms, and integrative treatment
PDF
Essays on family and labor economics
PDF
Essays on the U.S. market for substance use treatment and the impact of Medicaid policy reform
PDF
Essays in the economics of education and conflict
PDF
Algorithms and landscape analysis for generative and adversarial learning
PDF
Three essays on econometrics
Asset Metadata
Creator
Fang, Yue
(author)
Core Title
Essays on treatment effect and policy learning
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Economics
Degree Conferral Date
2023-05
Publication Date
04/13/2023
Defense Date
03/27/2023
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
cluster randomized controlled trials,inequality constraint,multiple periods,OAI-PMH Harvest,policy learning,regret analysis,treatment effect estimation,variance estimator
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Ridder, Geert (
committee chair
), Pang, Jong-Shi (
committee member
), Strauss, John (
committee member
)
Creator Email
eileenfangy@gmail.com,yuefang@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113015425
Unique identifier
UC113015425
Identifier
etd-FangYue-11621.pdf (filename)
Legacy Identifier
etd-FangYue-11621
Document Type
Dissertation
Format
theses (aat)
Rights
Fang, Yue
Internet Media Type
application/pdf
Type
texts
Source
20230414-usctheses-batch-1022
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
cluster randomized controlled trials
inequality constraint
multiple periods
policy learning
regret analysis
treatment effect estimation
variance estimator