Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Essays on the estimation and inference of heterogeneous treatment effects
(USC Thesis Other)
Essays on the estimation and inference of heterogeneous treatment effects
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Essays on the Estimation and Inference of
Heterogeneous Treatment Effects
by
Jingbo Wang
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(Economics)
August 2021
Copyright 2021 Jingbo Wang
I dedicate this thesis to my parents,
for their unconditional love and sacrifices.
ii
Acknowledgements
First of all, I would like to thank my advisor, Professor Cheng Hsiao, for his genuine support and
unselfish help during all these years. It was never easy and I am very fortunate to have him as
my advisor. He not only teaches me by example how to conduct first-class research and be a true
scholar, but also illuminates to me the significance of being resilient and optimistic.
I would like to thank my dissertation committee members, Professor Yufeng Huang, Professor
Jeffrey Nugent, and Professor Sha Yang, especially for their generous help and mentorship when I
was on the job market. It has been an unusual year and their kind support helped me through.
I would like to thank all the professors who have nurtured my scholarly character. In particular,
I would like to thank Professor Yingying Fan, Professor Jinchi Lv, Professor Roger Moon, Profes-
sor Paulina Oliva, Professor Jong-shi Pang, Professor Geert Ridder, and Professor Gerard Tellis at
the University of Southern California; Professor Manuel Arellano, Professor David Dorn, Profes-
sor Gerard Llobet, Professor Pedro Mira, and Professor Rafael Repullo at the CEMFI; Professor
Meijin Wang, and Professor Xianxiang Xu at the Sun Yat-sen University.
I would like to thank all of my friends, both for the good time we have had together and for
being with me on my blue days. For a long time, I have suffered from depression and insomnia.
Their priceless company helped me get through difficult times and overcome obstacles. Owing to
the space limitation, it is a pity that I cannot list all the names here.
Finally, I would like to thank my family, my parents Faqing Wang and Xiaoying Hu, for their
endless love; my deceased grandparents Futing Hu and Meiju Wang, for raising me from an infant
and giving me a happy childhood; my fianc´ ee Jiayi Wang, for bringing peace and joy into my life.
iii
Table of Contents
Dedication ii
Acknowledgements iii
List of Tables vi
List of Figures vii
Abstract viii
Chapter 1: Estimation and Inference 1
1.1 Related literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Distributional nearest neighbors . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.1 Model setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.2 Distributional nearest neighbors estimator . . . . . . . . . . . . . . . . . . 7
1.2.3 Technical conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.4 Asymptotic results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3 Two-scale bias reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3.1 Two-scale DNN Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3.2 Asymptotic normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.3.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.4 Simulation studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.4.1 Two-scale DNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.4.2 Comparisons with random forest . . . . . . . . . . . . . . . . . . . . . . . 18
1.5 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Chapter 2: Endogeneity 24
2.1 Related literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.2 Theoretical framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.2.1 Model setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.2.2 Heterogeneous treatment effect . . . . . . . . . . . . . . . . . . . . . . . 33
2.2.3 Control function approach . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.2.4 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.2.5 Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.2.6 An explanation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.3 Estimation and inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
iv
2.3.1 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.3.2 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.4 Monte Carlo simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Chapter 3: Empirical applications 56
3.1 Application 1: The effects of maternal smoking . . . . . . . . . . . . . . . . . . . 56
3.2 Application 2: The price elasticities of yogurt . . . . . . . . . . . . . . . . . . . . 58
3.2.1 Sample construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.2.2 Empirical findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Bibliography 70
Appendices 77
C Proofs of main results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
C.1 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
C.1.1 Lemma 1 and its proof . . . . . . . . . . . . . . . . . . . . . . . 78
C.1.2 Lemma 2 and its proof . . . . . . . . . . . . . . . . . . . . . . . 80
C.1.3 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . 82
C.2 Proof of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
C.2.1 Lemma 3 and its proof . . . . . . . . . . . . . . . . . . . . . . . 84
C.2.2 Proof of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . 86
C.3 Proofs of Theorem 3 and 4 . . . . . . . . . . . . . . . . . . . . . . . . . . 90
C.4 Proof of Theorem 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
v
List of Tables
1.1 DNN and random forest on fixed points . . . . . . . . . . . . . . . . . . . . . . . 20
1.2 DNN and random forest on random points . . . . . . . . . . . . . . . . . . . . . . 21
2.1 Simulation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.1 Yogurt brand and size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.2 Yogurt brand and size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.3 Price elasticities at representative points . . . . . . . . . . . . . . . . . . . . . . . 67
vi
List of Figures
1.1 The choice of subsampling scale s and the resulting bias and mean squared error . . 17
1.2 The choice of neighborhood size k and the resulting bias and mean squared error . 18
2.1 An explanation on identification . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.2 Simulation with linear and quadratic heterogeneity . . . . . . . . . . . . . . . . . 48
2.3 Simulation with the BLP as true model . . . . . . . . . . . . . . . . . . . . . . . . 52
3.1 Heterogeneous treatment effects of smoking on child birth weights . . . . . . . . . 57
3.2 Own and cross price elasticities with respect to the price of Dannon small . . . . . 63
3.3 Own and cross price elasticities with respect to the price of Yoplait small . . . . . . 64
3.4 Own and cross price elasticities with respect to the price of Dannon large . . . . . 65
3.5 Own and cross price elasticities with respect to the price of Yoplait large . . . . . . 66
vii
Abstract
In this dissertation, I propose a novel estimator to estimate heterogeneous treatment effects and
further generalize it to the case with treatment endogeneity. To achieve this goal, I adapt the
classical nearest neighbors estimator in the statistics literature and intentionally reform it from the
modern perspectives of machine learning. These changes are based on novel theoretical results
I have derived and in practice it brings significant empirical improvements. In this dissertation,
these improvements are demonstrated with a few Monte Carlo simulations.
I further tackle the problem of estimating heterogeneous treatment effects in the presence of
treatment endogeneity. I propose a novel and straightforward approach to deal with endogeneity
based on the classical control function literature in economics. This innovative route offers un-
precedented advantage to derive statistical inference for the fact that I further prove in theory that
the bootstrap can be conveniently used for inference.
In this dissertation, I also conduct two interesting empirical studies with the new proposed
methods in the subfields of health and business economics. However, it is worth noting that these
two studies are not merely simple applications, they have also targeted on important economic
problems and intended to bring new insights to the understanding of our economic world.
viii
Chapter 1
Estimation and Inference
In nearly all fields of economics, economists aspire to infer causal relationships. Have international
trades polarized domestic income inequality? Will metro access deteriorate neighborhood housing
values? And are food stamps actually alleviating poverty for the poor? The introduction of the
potential outcomes framework, or the Rubin causal model (Rubin, 1974; Imbens and Rubin, 2015)
has revolutionized the way how these economic questions are answered. The potential outcomes
framework perceives outcomes as in parallel universes, one with the event happened and another
without. The usual interest is thus the mean difference between the two conceptual outcomes,
which becomes the classic idea of average treatment effects (ATE). Traditionally for the compu-
tation of average treatment effects, the unit of analysis is usually the average treatment effects
conditional on an individual’s fixed feature vector. This unit has been under the name of condi-
tional average treatment effects (CATE) (MaCurdy et al., 2011) or heterogeneous treatment effects
(HTE) (Heckman et al., 1997; Crump et al., 2007). Heterogeneous treatment effects are treatment
effects at the individual level as opposed to the average treatment effects at the population level.
The concept of heterogeneous treatment effects has begun to receive increasing attentions and
has its own advantages in the modern big data era. First, it comes from a place of identification
concern for average treatment effects with high-dimensional covariates. The identification of av-
erage treatment effects comes from the unconfoundedness condition and the overlap condition, or
known as the common support condition. The overlap condition implies that given any covariate
value, this observation has positive chance to be in the treatment (or control) group. However,
1
D’Amour et al. (2017) has shown that it can be less plausible for the common support to hold with
high-dimensional covariates. Therefore, we may not be sure about what averages are reported in
average treatment effects in a high dimensional setting. Second, heterogeneous treatment effects
can help further explore causal mechanism behind treatment effects. Traditional average treatment
effect estimations usually act like a black box and causal mechanisms behind treatment effects are
largely ignored (Imai et al., 2011; Ding et al., 2018). On the contrary, heterogeneous treatment
effects can be incorporated into the mediation analysis framework (Imai et al., 2010; Tchetgen and
Shpitser, 2012), which aims to explore causal mechanisms behind treatment effects and has been
popular in political science, epidemiology, and biomedical studies. Third, heterogeneous treatment
effects are themselves the center of interests in policy evaluation, personalized medicine, and cus-
tomized marketing (Imai and Ratkovic, 2013; Grimmer et al., 2017; Powers et al., 2018). They
provide richer information than averages and the heterogeneity information can be invaluable in a
wide range of modern big data challenges in economics, business, and healthcare.
To our knowledge, Wager and Athey (2018) are the first to address the need of a useful esti-
mator for heterogeneous treatment effects in a high-dimensional setting. In their seminal paper,
they establish the asymptotic theory for the random forests algorithm and creatively introduce this
machine learning method into the estimation of heterogeneous treatment effects. In Monte Carlo
simulations, their method outperforms the classical nonparametric k-NN estimator in biases and
mean squared errors. The intuition for their groundbreaking result is that the random forests algo-
rithm can be perceived as a variant of nearest neighbor methods, but the algorithm is able to fully
exploit data information and consequently assigns adaptive weights to nearest neighbors. With the
extra information from data, their method can achieve improved precision with high-dimensional
covariates in finite samples.
In this paper, we provide an alternative method to estimate heterogeneous treatment effects.
Our recipe is to subsample the data and average the 1-nearest neighbor estimators from each sub-
sample. This turns out to be equivalent to assigning a monotone weight to the nearest neighbors in
a distributional fashion. We name the estimator (primitive) distributional nearest neighbors (DNN)
2
and prove it to be asymptotically unbiased and normal if the subsampling scale diverges with sam-
ple size n. A nice feature of the concept of distributional nearest neighbors is that it can give us an
interesting class of estimators, where new DNN estimators with reduced bias can be formed by a
weighted sum of existing DNN estimators with different subsampling scales. Moreover, the bias
reduction can take the estimation performance of heterogeneous treatment effects to a new level.
Our approach originates from panel bias reduction (Hahn and Kuersteiner, 2002; Hahn and
Newey, 2004; Arellano and Hahn, 2013; Dhaene and Jochmans, 2015) in panel data analysis
(Hsiao, 2014). If each subsample of size s is treated as repeated observations of an individual,
the joint analysis of all subsamples then becomes a panel data problem. The subsample size s in
our paper is the counterpart of the time-series length T in panel analysis. However, the essence
of the various bias reduction practices dates back at least to the jackknife (Schucany et al., 1971;
Efron, 1982), whose classical case pivots on an invariant bias form and leverages scales of sample
size n and n 1.
Our paper is rooted in the nearest neighbors literature (Mack, 1980; Gy¨ orfi et al., 2002; Sam-
worth, 2012; Biau and Devroye, 2015; Berrett et al., 2018) and the matching literature (Abadie
and Imbens, 2006; Rosenbaum, 2010). The forms of our (primitive) distributional nearest neigh-
bors coincides with the limit of a peculiar case of bootstrap averaging (bagging) nearest neighbors
without replacement (Hall and Samworth, 2005; Biau et al., 2010; Samworth, 2012). However,
the nature of heterogeneous treatment effects estimation is a regression problem and the central
pursuit here is point-wise inference. Our paper thus differentiates from existing papers and con-
tributes by formally giving a desirable class of estimators that is beyond classic scope of bagging
and extending the bagging idea to a more general setting. The bias reduction algorithm we propose
turns out to work very well and we demonstrate its effectiveness in several Monte Carlo simula-
tions. We also derive the result of asymptotic normality, which is central for inference in empirical
applications.
The rest of this paper is organized as follows. We give a brief review of related literature in
Section 1.1. Section 1.2 introduces the DNN procedure and investigates its asymptotic properties.
3
We formally suggest our two-scale DNN framework and establish the asymptotic properties of
the new method in Section 1.3. Section 1.4 presents several Monte Carlo simulation examples to
demonstrate the advantages of DNN. We provide an application of our method to a real-life data
set to study the heterogeneity of treatment effects of smoking on children’s birth weights across
mothers’ ages in Section 3.1. Section 1.5 discusses some implications and extensions of our work.
The proofs of the main results are relegated to the Appendix.
1.1 Related literature
There is an emerging literature on causal inference with high dimensional covariates in economics
(Fan et al., 2011; Belloni et al., 2014a). The usefulness of big data (Fan and Lv, 2008; Fan and
Fan, 2009; Fan et al., 2010; Fan and Lv, 2018) can have tremendous potential for many studies in
economics (Athey and Imbens, 2017; Mullainathan and Spiess, 2017). In causal inference, there
has been encouraging advances in methodologies. Novel tools useful in high dimensional setting
have been ingeniously developed for the estimation of average treatment effects (Belloni et al.,
2014b; Chernozhukov et al., 2015; Fan et al., 2016; Chernozhukov et al., 2017; Athey et al., 2018).
Modern machine learning methods, such as random forests, have also been innovatively introduced
to the estimation of heterogeneous treatment effects (Athey and Imbens, 2016; Wager and Athey,
2018; Athey et al., 2018).
This paper also connects in spirit to the literature of higher order kernels (Schucany and Som-
mers, 1977; Bierens, 1987; Fan and Hu, 1992; Newey et al., 2004). The central idea of bias
reduction in higher order kernels and in our paper is the generalized jackknife (Schucany et al.,
1971; Efron, 1982). For details, Jones and Foster (1993) has an excellent review. Our contribution
is not that we invent a new idea to reduce bias but that we provide a practically effective way for
implementation.
4
1.2 Distributional nearest neighbors
1.2.1 Model setting
For the estimation of heterogeneous treatment effects, we exploit the conventional potential out-
comes framework. While our results can be conveniently extended to the multi-valued treatment
settings, we focus on the binary case without loss of generality. Suppose we have observations on
(X;W;Y) in which Y is a scalar response, X is the pre-treatment feature vector with dimensionality
d, and W is a binary treatment assignment indicator, with W = 1 treated and W = 0 untreated. It is
assumed that there is one potential outcome associated with each treatment assignment status, only
Y(0) and Y(1) for the binary case, where Y(0) and Y(1) are two random variables corresponding
to the outcome without and with treatment, respectively. The dilemma is that they are not observed
simultaneously. The observed response Y = Y(0)(1W)+Y(1)W.
Traditionally, the interest is on the (super-population) average treatment effect (ATE) t of W
on Y , which is defined as
t =E[Y(1)Y(0)]: (1)
Given a fixed feature vector x, the heterogeneous treatment effect (HTE) of W on Y at the point x
is given by
t(x)=E[Y(1)Y(0)jX= x]: (2)
The estimation and inference of HTE t(x) is our goal in this paper. Ideally, if both Y(1) and
Y(0) were observable, the problem would reduce to a classical nonparametric regression prob-
lem. However, it is a luxury that we do not have in observational studies. Instead we assume the
unconfoundedness condition (Rubin, 1974).
Assumption 1. The treatment assignment is unconfounded in that it does not depend on the po-
tential outcomes conditioning on X, that is,
Y(0);Y(1)? ? Wj X: (3)
5
The unconfoundedness condition entails that treatment assignments can be regarded as random for
observations with the same feature vector. Under this condition, it holds that
t(x)=E[YjX= x;W = 1]E[YjX= x;W = 0]: (4)
Thus the estimation oft(x) can be decomposed into the estimation problems ofE[YjX= x;W = 1]
in the treated group andE[YjX= x;W = 0] in the control group, respectively, which are classical
nonparametric regression problems.
In this paper, we take the approach of estimatingE[YjX= x;W = 1] andE[YjX= x;W = 0]
separately and then combine them to estimate the heterogeneous treatment effectt(x). To ease our
presentation, we will demonstrate our estimation method for the treated group with W = 1. For the
control group with W = 0, the method and theory can be applied in the same fashion. Another way
to put it is that we can assume without loss of generality that the untreated response to be a constant
of zero. In real applications, separate regressions are run on both arms. The separated regression
is also adopted in Wager and Athey (2018). The potential drawback is the loss of semi-parametric
efficiency, which will be discussed in Section 1.5.
Similar to nonparametric regression models, we assume that for the treated group,
Y =m(X)+e; (5)
with e an independent noise with mean zero and m() the unknown relationship between X and
Y . Moreover, an independent and identically distributed (i:i:d:) sample of size n,f(X
i
;Y
i
)
n
i=1
g, is
observed for the treated group. In the following, we assume (5) is our working model and our
target is to estimatem(x) for some given x. Here, x can be beyond the X
i
’s appeared in the sample.
And any future assumptions are with respect to model (5).
6
1.2.2 Distributional nearest neighbors estimator
We first revisit the classical k-nearest neighbors (k-NN) procedure for nonparametric regression.
Given a fixed point x, we can compute the Euclidean distance of each sample point to x and then
reorder the sample with this distance. The sample can be relabeled using the order of distances,
kX
(1)
xkkX
(2)
xkkX
(n)
xk;
wherekk denotes the Euclidean distance, and the ties are broken by maintaining the original
order of labels. Other distance measures can be used. Yet, it is not the focus of this paper. Here
X
(1)
is the closest point in the sample to point x, and Y
(1)
associated with X
(1)
is thus the 1-nearest
neighbor estimate of m(x). In general, the k-nearest neighbors (k-NN) estimator uses the first k
nearest neighbors for estimation,
ˆ m
k-NN
=
1
k
k
å
i=1
Y
(i)
:
The closest k nearest neighbors have equal weights
1
k
while the other observations have zero
weights. It is well known that the classical k nearest neighbors suffers the curse of dimension-
ality. There is a bias terms that asymptotically vanishes but compromises finite sample estimation
precision. Moreover, the bias term becomes more a problem with larger dimension of covariates.
Therefore the challenge for nearest neighbor methods to be adapted to high-dimensional data is
how to effectively remove the bias term.
In this paper we propose a new estimator which makes the bias reduction straightforward
and practical. Before we proceed, a formal notation is given to the 1-nearest neighbor estima-
tor with subsampled observations, which is the building block of our (primitive) DNN estimator.
Letfi
1
;;i
s
g with i
1
< i
2
<< i
s
and s n be a subset off1;;ng. With Z
i
j
as a shorthand
for (X
i
j
;Y
i
j
), we define F(x;Z
i
1
;Z
i
2
;:::;Z
i
s
) as the 1-nearest neighbor estimator to m(x) in the
subsamplef(Z
i
j
)
s
j=1
g,
F(x;Z
i
1
;Z
i
2
;:::;Z
i
s
)= Y
(1)
(Z
i
1
;Z
i
2
;:::;Z
i
s
): (6)
7
Our recipe for (primitive) DNN estimator is thus to average the 1-nearest neighbor estimators from
all the subsamples of size s, where 1 s n. When s= n, it is just the conventional 1-nearest
neighbor estimator Y
(1)
. When s= 1, it reduces to the simple sample average. This setup happens
to coincide with the classical idea of a U-statistic with F as its kernel (Hoeffding, 1948; H´ ajek,
1968; Korolyuk and Borovskich, 1994). Our formal definition for a (primitive) DNN estimator
with subsampling scale s is
D
n
(s)(x)=
n
s
1
å
1i
1
0.
With the underlying data generating process specified above, we further assume that we have
i:i:d: data.
Assumption 3. We have an i:i:d: sample of size n,(X
1
;Y
1
);(X
2
;Y
2
);:::;(X
n
;Y
n
), from model (5).
In summary, we have three assumptions in total. Condition 1 is the unconfoundedness assump-
tion that is the fundamental setup in the potential outcomes framework for causal inference. With
this setup, the causal inference problem reduces to a nonparametric regression problem. Condition
2 is regular and commonly imposed in nonparametric regressions. Condition 3 specifies the data
we have. Although it appears idealized at first sight, the assumption of i:i:d: data is commonly
made in modern machine learning. It enables researchers to provide key insights with simplified
technical presentation.
9
1.2.4 Asymptotic results
We are ready to present the asymptotic properties for (primitive) DNN. Our first theorem estab-
lishes the bias of the (primitive) DNN estimator D
n
(s)(x), and the second theorem shows that
D
n
(s)(x) can be asymptotically normal with appropriately chosen subsampling scale s.
Theorem 1. Given x2 supp(X), under Conditions 2–3 we have
ED
n
(s)(x)=m(x)+ B(s); (9)
where
B(s)=G(2=d+ 1)
f(x)tr(m
00
(x))+ 2m
0
(x)
T
f
0
(x)
2dV
2=d
d
f(x)
1+2=d
s
2=d
+ o(s
2=d
); (10)
V
d
=
p
d=2
G(1+ d=2)
;
G() denotes the Gamma function, f
0
(x) and m
0
(x) are the first order gradients at x for f(x) and
m(x), respectively,m
00
(x) is the Hessian matrix ofm() at x, and tr() gives the trace.
Theorem 1 gives us the form of asymptotic bias of the (primitive) DNN estimator. The key
idea of the proof comes from Biau and Devroye (2015) for the case of k-nearest neighbors. Details
of our proof are provided in Appendix A. We can see that the leading order of bias converges to 0
at rate s
2=d
. Therefore when subsampling scale s!¥, the (primitive) DNN estimator is asymp-
totically unbiased. The interesting and surprising result from Theorem 1 is that the coefficient in
the leading order of bias term B(s) does not depend on subsampling scale s. This is the key feature
that opens doors to our two-scale bias reduction, which is to be presented in Section 4. When the
dimensionality of features d is large, the rate of convergence of the main bias term is slow. In
situations like this, it can be more beneficial to get rid of the first order bias.
10
Theorem 2. Given x2 supp(X), under Conditions 2–3, and assuming in addition, s!¥ and
s
n! 0, we have for some positives
n
withs
2
n
= O(
s
n
),
D
n
(s)(x)m(x) B(s)
s
n
D
! N(0;1): (11)
Details of the proof are given in Appendix A. Theorem 2 establishes the asymptotic normality
of the DNN estimator. Since the DNN estimator is a U-statistic, our proof builds on the traditional
U-statistic framework in Serfling (2018); Korolyuk and Borovskich (1994). The major difference
of the approach in our paper with the traditional U-statistic framework is that the classical U-
statistic framework only allows subsampling scale s to be finite. Some of the classical results do
not yet apply naturally to the case when s!¥.
Theorem 2 tells us that the (primitive) DNN estimator is asymptotically normal. An interesting
difference with the random forest approach is that the convergence rate of the (primitive) DNN
estimator can be derived to be
p
n=s. When s = O(n
d
d+4
), the optimal rate of convergence in
terms of mean squared errors is obtained. Another feature is that the DNN estimator is a simple
L-statistic. The bootstrap (Efron, 1982) can be directly used for variance estimation (Tu and Ping,
1989; Shao and Tu, 1995). Consequently we do not need to know the exact form of the asymptotic
variance as long as it is bounded. The exact form of the asymptotic variance depends on the
unknown error variance s
2
e
, the underlying unknown distribution f , and the evaluated point x. If
the exact form were available, the plug-in estimation of variance here can still be challenging given
intermediate unknowns.
11
1.3 Two-scale bias reduction
1.3.1 Two-scale DNN Estimator
As seen in Theorem 1, the DNN estimator can be asymptotically unbiased and normal for s appro-
priately chosen. In fact, we see in Theorem 1 that
ED
n
(s)(x)=m(x)+ cs
2=d
+ o(s
2=d
): (12)
Here the positive constant c is specified in Theorem 1. It depends only on the underlying data
generating process and does not change when we choose a different subsampling scale s. Such an
appealing property gives us an effective way to reduce leading order bias.
Consider two (primitive) DNN estimators of different subsampling scales s
1
and s
2
. Their
asymptotic biases have the following forms
D
n
(s
1
)(x)=m(x)+ cs
2=d
1
+ o(s
2=d
1
);
D
n
(s
2
)(x)=m(x)+ cs
2=d
2
+ o(s
2=d
2
):
We then proceed with solving the following system of linear equations
w
1
+ w
2
= 1; (13)
w
1
s
2=d
1
+ w
2
s
2=d
2
= 0; (14)
yielding the weights w
1
= w
1
(s
1
;s
2
)= s
2=d
2
=(s
2=d
2
s
2=d
1
) and w
2
=s
2=d
1
=(s
2=d
2
s
2=d
1
).
We propose the two-scale DNN estimator as
D
n
(s
1
;s
2
)(x)= w
1
D
n
(s
1
)(x)+ w
2
D
n
(s
2
)(x):
12
Equation (13) ensures that the two-scale DNN is still unbiased for m(x). Equation (14) gives the
constraint to remove the first order bias. Compared to the (primitive) DNN estimator D
n
(s)(x), the
introduced two-scale DNN estimator D
n
(s
1
;s
2
)(x) is free of the first order bias. As a result, the
trade-off between bias and variance can be made upon a new level. From our extensive simulation
studies in Section 5, this bias reduction will substantively improve precision in estimation and
mean squared errors. The construction of confidence intervals also becomes more meaningful
with reduced bias.
From our derivations above, we can see either w
1
or w
2
is negative. It implies that the DNN
framework can assign negative weights to distant nearest neighbors, which is different from the
nonnegative weights present random forests algorithms assign. This feature may partially provide
an intuitive rationale behind the performance of DNN estimators.
It is also worth noting that, under further conditions, three or more scales bias reduction can
remove higher order bias terms as well. In the literature of higher order kernels, there are some
related discussions. We reserve the question for future work.
1.3.2 Asymptotic normality
We next give a formal theorem for the two-scale DNN estimator.
Theorem 3. Given x2 supp(X), under Conditions 2–3, and assuming in addition, s
1
!¥ with
s
1
n! 0, s
2
!¥ with s
2
n! 0, and s
1
6= s
2
, for some positives
n
withs
2
n
= O(
s
1
n
+
s
2
n
) we have
D
n
(s
1
;s
2
)(x)m(x)L
s
n
D
! N(0;1); (15)
where(w
1
;w
2
) is the solution to Equations 13 and 14, andL= o(s
2=d
1
+ s
2=d
2
).
Finally, we present a theorem for the case when the control group is not degenerate. The
notations of subscripts will come back temporarily for the next new paragraphs. For the treated
group and the control group, respectively, let n
1
and n
0
denote the i.i.d. sample size, s
(1)
and s
0
denote the subsampling scale, supp(X
1
) and supp(X
0
) denote the support, m
1
() and m
0
() denote
13
the regression function,e
1
ande
0
denote the random noise, D
(1)
n
(s
1
)(x) and D
(0)
n
(s
0
)(x) denote the
(primitive) DNN estimator,L denote the bias, and Y
1
and Y
0
denote the response. The heteroge-
neous treatment effect at point x ist(x).
Theorem 4. Given x2 supp(X
1
)\supp(X
0
), under Condition 1–3 for both the treated and control
group with their subscripts, and assuming in addition, s
(1)
i
!¥ with s
(1)
i
n
1
! 0 for i = 1;2
, s
(0)
i
!¥ with s
(0)
i
n
0
! 0 for i= 1;2, and s
(0)
1
6= s
(0)
2
, s
(1)
1
6= s
(1)
2
, for some positive s
n
with
s
2
n
= O(
s
(1)
1
n
1
+
s
(1)
2
n
1
+
s
(0)
1
n
0
+
s
(0)
2
n
0
) we have
[D
(1)
n
(s
(1)
1
;s
(1)
2
)(x) D
(0)
n
(s
(0)
1
;s
(0)
2
)(x)]t(x)L
s
n
D
! N(0;1); (16)
whereL= o(s
(1)
1
2=d
+ s
(1)
2
2=d
+ s
(0)
1
2=d
+ s
(0)
2
2=d
).
The evaluated point x is required to be in the support of both the treated and control groups.
This condition is easier to check and verify than the common support condition in estimating
average treatment effects (ATE), while the latter requires that supports fully overlap. We can
see that the final rate of convergence is dominated by the slower side. A potential caveat of the
separated regression can be the semi-parametric efficiency. However, the advantage of separated
regression approach is that the whole sample does not need to represent the proper portions of the
treated and control in the population. This feature actually can provide flexibility in observational
studies when the sampling cost is very different between the treated and control. For example, it is
easier to survey those who come to vote than those who do not. It is also clear that the separated
regression approach can be flexibly extended to multi-valued treatment settings.
1.3.3 Implementation
For the remaining of the paper we exploit (primitive) DNN estimator D
n
(s) and D
n
(2s) and use this
combination to form new (two-scale) DNN estimators. The choice is only for simplicity. We can
obtain weights by solving Equations (13) and (14) for s and 2s.
1
12
2=d
D
n
(s)+
2
2=d
12
2=d
D
n
(2s) is our
14
new (two-scale) DNN estimator. When d= 3, the new (two-scale) DNN estimator is approximately
1:70D
n
(s)+ 2:70D
n
(2s). When d = 10, it is approximately6:73D
n
(s)+ 7:73D
n
(2s). The
coefficients for the case of d= 10 are large and can inflate variance. We can make a compromise
in situations like this and choose the weights for d = 3, that is,1:70D
n
(s)+ 2:70D
n
(2s). This
simplification comes from our choice of s and 2s and may not be optimal. It is a trade off between
bias and variance. In cases some other scales are chosen, this simplification is not necessary.
Athey and Imbens (2016); Wager and Athey (2018) proposed an honest rule, that is, the re-
sponse of each observation is only used to estimate the treatment effects or to decide where to
place the splits, but not both. Therefore they split whole sample into two subsamples. One is
used for deciding partitions, and the other for estimation. Our approach is different. When we
apply the DNN estimator to a specific choice of the subsampling scale s, the weight distribution
is deterministic. When we compute distances and obtain rankings, only the information on X is
used. Our framework agrees with the honest rule and there is no need to split data. In a data-rich
environment, a proportion of the whole sample can also be separately used to tune subsampling
scale s.
In this paper, we also provide a simple and straightforward tuning algorithm. Our tuning pro-
cedure is to compute debiased two-scale DNN estimators for s = 1;2;:::, and go on until the
difference in absolute differences in debiased two-scale DNN estimators changes sign. It is the
point where the curvature of debiased two-scale DNN estimator changes. The intuition comes
from the curve structure in Figure 1. This algorithm works fast and well in our simulations. In
future works, it will be helpful to see how cross validation methods can help the tuning process.
For computation, we use the L-statistic representation instead of the U-statistic one. The L-
statistic representation avoids the computation cost of running through all subsample combinations.
To be specific, we first compute the distributional weight with subsampling scale s and 2s according
to Equation (8). We then use1:70D
n
(s)+2:70D
n
(2s) to form new distributional weight for two-
scale DNN. After this, we simply sort observations by Euclidean distance and apply the new weight
15
to take a weighted average. The distributional point of view is scalable and greatly saves cost of
computation.
Given the choice of s, the two-scale DNN estimator, as a linear combination of the (primitive)
DNN estimators, is still an L-statistic. The bootstrap can be directly employed to estimate variance
for DNN estimators (Tu and Ping, 1989; Shao and Tu, 1995). When the untreated group is not
degenerate as in our application in Section 6, we run separated regressions on the treated and
control group, respectively, and then take a difference. We bootstrap the difference by resampling
within each group strata to provide inference for heterogeneous treatment effect estimate.
1.4 Simulation studies
This section presents two simulation studies to demonstrate the usefulness of the DNN framework.
The first one studies the newly suggested DNN in a setting where we go through all choices for s
and plot the resulting biases and mean squared errors (MSEs) as functions of s in one graph. In
the second study, we compare the performance of our DNN estimators with the random forests
approach in Wager and Athey (2018) in various settings.
1.4.1 Two-scale DNN
We here present the properties of DNN estimators and their potentially substantial improvement in
bias reduction and mean squared errors. We conduct a Monte Carlo simulation with data generating
process,
y=(x
1
1)
2
+(x
2
+ 1)
3
3x
3
+e; (17)
with vector (x
1
;x
2
;x
3
;e)
T
N(0;I
4
), and sample size n= 1000. The evaluated target point is
(0:5;0:5;0:5)
T
. The coordinates are chosen to place the target point well in the interior and avoid
irregular border cases. We estimate the heterogeneous treatment effects at this point with subsam-
pling scales s running from 1 to 250. The two-scale DNN is implemented using subsampling scales
s and 2s for simplicity. The simulation results are presented in Figure 1. The subsampling scale
16
Figure 1.1: The choice of subsampling scale s and the resulting bias and mean squared error
s is on the horizontal axis while the resulting biases and mean squared errors are on the vertical
axis. The upper left region of Figure 1 depicts the bias from (primitive) DNN estimation, while
the upper right depicts the mean squared error from (primitive) DNN estimation. The lower two
regions are devoted to the two-scale DNN estimation in the same fashion.
First, we can see from Figure 1 that as the subsampling scale s increases, the bias of the (prim-
itive) DNN estimator shrinks toward zero. This is consistent with asymptotic theory. However, the
marginal benefit on bias reduction becomes smaller and smaller. The classical U-shaped pattern
of bias and variance trade-off shows up in the upper-right region of Figure 1. The mean squared
errors first decrease because bias is smaller. But then the variance effect dominates and the mean
squared errors increase.
Second, compared to (primitive) DNN estimators, the two-scale DNN speeds up both the pro-
cesses of bias reduction and bias-variance trade-off and squeezes the curves toward lower levels of
subsampling scale s. The most celebrated fact is that the best mean squared error achieved drops
by over half. This significant improvement takes only an extra step of weighted averaging.
17
Figure 1.2: The choice of neighborhood size k and the resulting bias and mean squared error
Third, we repeat the same exercise to classical k-NN estimators and try out two-scale bias
reduction strategy on k-NN. The bias form of k-NN estimators can be found in Mack (1980), the
ninth equation. k-NN and 2k-NN are used to remove the bias term c(
k
n
)
2=d
. We present our Monte
Carlo result in Figure 2. It is very interesting as well as surprising that the best mean squared error
achieved seems not to improve from two-scale bias reduction on k-NN. While we do not rule out
the possible existence of bias reduction algorithms that directly work well with k-NN, which is also
our initial motivation, we are skeptical that whether such an algorithm can in general outperform
DNN.
1.4.2 Comparisons with random forest
We also compare our DNN estimators with the random forest (RF) approach in Wager and Athey
(2018). The comparisons are made in various settings where we slowly increase the dimensionality
18
d. In all nine settings, Monte Carlo simulations are run 1000 times with sample size 1000. To be
specific, for j= 10;15;20;25;30;35;40;45, and 50, respectively, the data generating process is
y= log
"
j
å
i=1
(x
3
i
2x
2
i
+ 2x
i
)
#
2
+e (18)
with (x
1
;x
2
;:::;x
j
;e)
T
N(0;I
j+1
). We make comparisons both on a fixed point and on a ran-
dom point. The fixed test point is at x with the first [
j1
2
] coordinates being 0:5, and the other
coordinates zero. [
j1
2
] denotes the largest integer smaller or less than
j1
2
. The random test
point has the nonzero coordinates independently drawn from the uniform distribution on [1;1],
[1:645;1:645], and[1:96;1:96], respectively. Our simulations are designed to mimic the need
of real-data applications when researchers are interested in varying some variables of interest while
maintaining the remaining at mean or median levels. An example of empirical applications is pro-
vided in Section 3.1.
Details on implementation of DNN have been provided in Section 4.3. It is a simple weighted
average given choices of subsampling scales. We give an algorithm example on how to choose s
and use subsampling scales s and 2s to reduce bias. For random forests, we use the R package of
grf, version 0.10.2, from Athey et al. (2018), which extends results in Wager and Athey (2018) and
provides a powerful computing algorithm.
The simulation results are shown in Table 1 and Table 2. Table 1 gives results on fixed points
and Table 2 on random points. In both tables, the first column is the number of covariates and the
second column method used. To be more specific, Column 3 in Table 1 gives the mean error from
true value from 1000 simulations and Column 4 the mean squared error. The variance of estimates
from 1000 simulations is shown in Column 5. Column 6 presents the mean of estimated variance
from 1000 simulations.
19
Setting Method Bias MSE Variance Est. Var
1
DNN 0.0695 0.1529 0.1482 0.1186
RF 1.4261 2.0991 0.0655 0.2657
2
DNN 0.3011 0.2316 0.1411 0.1017
RF 1.9127 3.7078 0.0493 0.2336
3
DNN 0.9215 0.9524 0.1033 0.0824
RF 2.4864 6.2175 0.0354 0.1897
4
DNN 1.1361 1.3733 0.0826 0.0675
RF 2.6469 7.0319 0.0256 0.1464
5
DNN 1.5270 2.3950 0.0633 0.0556
RF 2.9171 8.5318 0.0226 0.1228
6
DNN 1.6459 2.7601 0.0510 0.0472
RF 2.9740 8.8623 0.0177 0.1011
7
DNN 1.8930 3.6268 0.0432 0.0406
RF 3.1334 9.8328 0.0145 0.0886
8
DNN 1.9717 3.9254 0.0378 0.0358
RF 3.1595 9.9952 0.0129 0.0801
9
DNN 2.1543 4.6742 0.0332 0.0319
RF 3.2626 10.6559 0.0113 0.0706
Table 1.1: DNN and random forest on fixed points
20
Setting Method Bias1 MSE1 Bias2 MSE2 Bias3 MSE3
1
DNN 1.6501 5.9719 0.3733 2.6169 0.0522 3.3099
RF 2.2921 9.0874 1.1239 4.0355 0.7950 4.3272
2
DNN 2.1731 8.1449 0.6118 3.1575 0.0782 2.6222
RF 3.0175 13.1980 1.3901 5.4754 0.7690 3.9108
3
DNN 2.5371 9.8113 0.9139 3.9884 0.2735 2.7997
RF 3.3814 15.3962 1.6712 6.7085 0.9332 4.3545
4
DNN 2.6107 9.8556 0.9113 3.1277 0.2721 2.2102
RF 3.4155 15.2258 1.5937 5.4840 0.8836 3.5566
5
DNN 2.9096 12.0461 0.9927 2.8333 0.4285 2.1475
RF 3.6706 17.4889 1.6473 4.9908 1.0193 3.5361
6
DNN 2.8729 11.1165 0.9628 2.5058 0.3775 1.7125
RF 3.5862 16.1031 1.5641 4.4397 0.9260 2.8518
7
DNN 2.9534 11.1935 1.0500 2.8721 0.4872 1.6026
RF 3.6233 15.9013 1.6201 4.7788 1.0227 2.7662
8
DNN 2.9842 11.2533 1.0322 2.1603 0.3847 1.2181
RF 3.6321 15.7957 1.5835 3.8708 0.8815 2.1056
9
DNN 3.0837 11.8569 1.1340 2.5356 0.4594 1.2802
RF 3.7046 16.3090 1.6578 4.2431 0.9297 2.1797
Table 1.2: DNN and random forest on random points
21
In Table 2, Bias1 gives the mean of errors from true values from 1000 simulations when ran-
dom points are generated from[1;1] uniformly. MSE1 is the mean squared errors of all simula-
tions. Bias2 and MSE2 present results in the same fashion when random points are generated from
[1:645;1:645] uniformly. Bias3 and MSE3 are for[1:96;1:96].
From our simulations, we observe that DNN excels at controlling bias size. This property
can bring improvement in mean squared errors when bias is main impediment to estimation. In
situations where bias is not the main concern, other methods may have an advantage. Bootstrap
also offers good estimate of variance.
1.5 Discussions
In this paper we have built the DNN framework for estimation of heterogeneous treatment ef-
fects. The framework encompasses both theory and practice. We further test DNN estimator in
Monte Carlo simulations and with a real-life empirical research. In both cases, the DNN estimator
demonstrates great potential for precise estimation and inference.
It is also inevitable that DNN has imperfections. First, the DNN framework mitigates the curse
of dimensionality but still suffers from it. We are curious that whether data exploiting algorithms
and theoretical derivations can be combined to further push boundaries. Second, we have assumed
i:i:d: data for analysis, which is more plausible in a cross-sectional setting. It is thus interesting to
see how DNN estimator can adapt to other popular data structures, such as time series and panel
data. The third issue is semi-parametric efficiency. The separated regression approach consists of
only estimations of regression functions and has ignored the information between treatment status
and control variables. The efficient influence function for heterogeneous treatment effects is not
used and there is a loss of semi-parametric efficiency. It is appealing if there is a semi-parametric
efficient approach for HTE estimation. Fourth, the dimensionality of features d has been fixed in
this paper. Although it is the dominant case in economics, we reserve the question to allow d!¥
22
for future studies. This generality can connect DNN to the dimension reduction literature and make
the unconfoundedness condition more plausible.
23
Chapter 2
Endogeneity
Price elasticity, the percentage change in sales due to a percent change in price, is of paramount
importance in economics and marketing. As a primary measure of market structure, the precise es-
timation of price elasticity is central in many situations, such as when economists conduct welfare
analysis, evaluate price effects of mergers, and estimate the pass-through effects of taxation or tar-
iff. In business economics and marketing, price elasticities can provide rich information about the
behaviors of current and potential customers. This information can assist firms in making pivotal
market competition decisions, such as pricing and targeting.
Despite its general usefulness, the estimation of price elasticity faces two main challenges. The
first challenge is price endogeneity. Observed prices in observational studies are often determined
simultaneously by market demand and supply. This fact inevitably raises the concern of price endo-
geneity. The presence of price endogeneity, if not properly addressed, can weaken the consistency
of elasticity estimates and sometimes give confusing results, such as an upward-sloping demand
curve. The second challenge is the flexible estimation of price elasticity. The responsiveness to
prices is heterogeneous by nature. Price elasticity estimates ideally should adjust differently to
different price levels, allow rich substitution patterns, and capture meaningful behavioral changes
along demographics.
Current popular approaches mostly use a parametric framework to work on these difficulties.
On the one hand, a parametric setup of the indirect utility enables one to back out the unobserved
omitted product characteristics. This advantage clears the way to use the generalized methods of
24
moments to deal with price endogeneity. On the other hand, a parametric setup of the distribution
of random coefficients can help overcome the independence of irrelevant alternatives. This inge-
nious feature offers a route to allow flexible substitution patterns. However, a parametric setup can
also be a double-edged sword. First, the consumer choice process can be complicated and often-
times involve choices in multiple goods, of multiple quantities, under limited consideration, and
with search friction. Parameterizing the choice process in these cases has the risk of misspecifica-
tion. Second, a parametric model also relies on the assumptions of the preference heterogeneity.
It often becomes pivotal to correctly specify these taste distributions. The estimation result can be
sensitive to the specification and sometimes may experience a fundamental change under a differ-
ent specification. In these situations, it is strongly desirable that we also have a model that does
not rest on parametric assumptions.
With this motivation, we propose a nonparametric method to estimate price elasticities with
aggregate market-level data in differentiated products markets. In particular, we investigate the
problem of price elasticity estimation from a causal perspective. When market price is perceived
as a continuous but endogenous market policy, the estimation of price elasticity becomes a policy
evaluation problem. If we further require the price elasticity to be flexible, the problem thereby
turns into the estimation of treatment effects conditioning on different values of control variables.
In this way, we can borrow recent developments in the heterogeneous treatment effect literature to
estimate price elasticities in a fully nonparametric fashion.
To be specific, we will adapt classical nonparametric IV models (Blundell and Powell, 2003;
Newey and Powell, 2003; Hall and Horowitz, 2005) and deliberately shift our interest from struc-
tural functions to point-wise estimations. In particular, this paper will make an adaptation of the
triangular control function approach in Newey et al. (1999) as our starting point, since it can pro-
vide a straightforward economic interpretation. We will first specify a triangular simultaneous
equation system, where the first equation specifies a nonparametric relationship between demand
and prices and the second the relationship between prices and exogenous instrumental variables.
It can be shown that, under commonly made control function assumptions, the point-wise price
25
slope can be identified as a combination of several point-wise conditional expectations. As a re-
sult, we can then estimate point-wise price elasticities from the estimations of several intermediate
point-wise conditional expectations. This estimation route, to the best of our knowledge, has not
yet been explored before in empirical studies. This paper is also the first paper to utilize this result
to flexibly estimate price elasticity.
The difficulty in the implementation of this strategy mainly lies in the estimation of point-
wise conditional expectations. In this paper, we propose the use of modern machine learning
methods to deal with this problem. The reason why machine learning methods are preferred is
that classical nonparametric methods, such as kernel and nearest neighbors estimators, suffer the
curse of dimensionality. When the dimension of conditioning covariates turns large, classical
nonparametric estimation can become unstable in practice. However, on the contrary, modern
machine learning methods are often empowered with data-driven algorithms and have been shown
to work well with a relatively large dimension of conditioning variables. In this way, we are
capable to accommodate many covariates in a fully nonparametric fashion and have more leverage
to pursue heterogeneity behind all these dimensions.
However, the disadvantage of using popular machine learning methods is statistical inference.
While machine learning methods have been widely used in business practices, it has been less
emphasized that how one can conduct valid statistical inference with them. In this paper, we prove
that, if we use the bootstrap averaged (bagged) nearest neighbors estimator (Biau et al., 2010; Fan
et al., 2018) for point-wise prediction, the standard bootstrap procedure can be directly used for its
inference. Since the bootstrap commutes with smooth functions, we are then able to conveniently
use the bootstrap to directly derive inference for price elasticity estimates. However, it is worth
noting that when statistical inference is not a concern, other popular machine learning methods,
such as deep neural networks and random forests, can also be employed for conditional predictions
within our framework.
We will first conduct Monte Carlo simulations to test the working validity of our approach.
In this paper, we demonstrate two settings. The first setting follows a conventional reduced form
26
setup. It is shown that our approach can well recover flexible shapes of price slopes despite of the
contamination of price endogeneity. We also derive confidence intervals using the bootstrap along
with our estimations. In our simulations, these confidence intervals have well covered the model
truth. The second Monte Carlo setting uses the standard BLP model (Berry et al., 1995) as the true
model. It is shown that our approach can be compatible with the BLP model and approximately
applies to market level data aggregated from BLP-type individual choices.
With this confidence, our method is then applied to the yogurt industry to estimate the price
elasticities of two leading national brands, Yoplait and Dannon, using the IRI academic dataset.
Our main focus is on the package size of yogurt. We estimate the own and cross price elasticities
for small-sized and large-sized yogurt for both Yoplait and Dannon. It is found that the competing
brand’s yogurt in a similar package size is more a substitute than own brand’s yogurt in a different
size. This result is obtained without a priori structural assumptions on consumer preferences. We
also trace the price elasticity changes across own price levels.
Our paper contributes to the literature on demand estimation. In particular, we revisit the
problem of price elasticity estimation from a causal perspective and in a nonparametric fashion.
Compared to the models based on multinomial choices, our framework bears less parametric as-
sumptions and can thus be more flexible in various situations, for example, when there is concern
on multiple purchasing quantities and endogenous consideration sets. Compared to the existing
nonparametric IV approaches (Blundell et al., 2012, 2016), our interest is mostly on the point-
wise heterogeneity and we incorporate modern machine learning algorithms to empower classical
nonparametric strategies, instead of imposing additional economic and econometric constraints to
regularize classical estimates.
Our paper also contributes to the emerging literature on heterogeneous treatment effects (Athey
and Imbens, 2016; Wager and Athey, 2018). In particular, we adapt the control function approach
in Newey et al. (1999) and shift its focus to point-wise estimations. This adaptation offers a simple
and intuitive pathway to deal with treatment endogeneity in the context of heterogeneous treatment
effects. We enrich the theoretical results in Fan et al. (2018) and prove that the standard bootstrap
27
procedure can be directly adopted for statistical inference, if the bagged nearest neighbors estima-
tor is used for intermediate predictions.
Our paper possesses an empirical contribution as well. We find that price elasticities exhibit
a very different pattern for yogurt in small and large package sizes. However, it is common in
empirical studies that yogurt of different sizes for the same brand are first pooled together for
further analysis. Our result raises a concern for this conventional practice since yogurt of different
sizes can be intrinsically very different products. We also argue that the heterogeneity patterns we
have found can serve as a primitive for sophisticated structural modeling.
The following of the paper is organized as follows. Section 2 gives a brief review of related
literature. We formally introduce our theoretical framework in Section 3. Section 4 gives more
details on our estimation and inference strategy. We further justify our method with a Monte Carlo
simulation in Section 5. Section 6 applies our method to estimate price elasticities for yogurt with
the IRI academic dataset. Section 7 is devoted to a discussion.
2.1 Related literature
Demand analysis has been one of the oldest economic problems. Working (1927); Stone (1954);
Deaton and Muellbauer (1980) had studied this problem repeatedly with new insights. More re-
cently, the BLP model (Berry et al., 1995) becomes the leading work-horse model in structural
demand analysis. The BLP model (Berry et al., 1995) makes important changes of the multino-
mial choice model (McFadden, 1973) and provides a seminal framework to deal with endogeneity
and flexibility (Nevo, 2001). Meanwhile, the reduced-form approaches are also quickly evolving.
This line of research works on relaxing the linear demand system in traditional analysis (Hausman
and Newey, 1995; Banks et al., 1997; Hausman and Newey, 2015) and improving nonparametric
estimates with additional theoretical or empirical constraints (Haag et al., 2009; Blundell et al.,
2012; Dette et al., 2016; Blundell et al., 2016). Interestingly, the seemingly unrelated structural
28
perspective and the reduced form perspective can actually be reconciled under a general nonpara-
metric setting. Berry and Haile (2014) has ingeniously shown that with connected substitutes
(Berry et al., 2013), an index restriction is enough to transform the discrete choice demand model
into a nonparametric IV problem. Compiani (2019) further shows that this nonparametric ap-
proach is able to give better shapes of demand curves when consumers experience inattention or
loss aversion. Our paper contributes to the demand literature by offering another route to flexibly
estimate price elasticity, where machine learning methods can be used to empower nonparametric
regressions.
Our paper is related to the literature of heterogeneous treatment effects. The concept of treat-
ment effect heterogeneity is deeply rooted in economics and possesses great potential in empirical
studies (Heckman et al., 1997; Athey and Imbens, 2016; Wager and Athey, 2018; Fan et al., 2018).
While previous studies mostly focus on exogenous treatments, our framework in this paper offers
a simple pathway to deal with treatment endogeneity in the context of heterogeneous treatment
effects. The simplicity of our method comes from adapting classical nonparametric IV models and
focusing on point-wise estimation. Compared to the random forests local generalized method of
moments approach (Athey et al., 2019), our framework is fully nonparametric, open to the use of
various machine learning methods, and enjoys an easy economic interpretation.
We have used the control function approach in our paper. As far as we know, the control func-
tion approach is introduced to economics as a generalization of the linear instrumental variable
approach (Heckman and Robb, 1985). For these early insights, please see, for example, Smith
and Blundell (1986); Rivers and Vuong (1988). For more recent developments on nonparamet-
ric IV models, see, for example, Chesher (2003); Matzkin (2003); Blundell and Powell (2003);
Newey and Powell (2003); Das et al. (2003); Hall and Horowitz (2005); Blundell et al. (2007);
Matzkin (2008); Imbens and Newey (2009); Blundell et al. (2013); Chen et al. (2014); Matzkin
(2015); Hahn and Ridder (2017, 2018). Different from most of the literature, our empirical strat-
egy follows a point-wise approach. In this way, we can uncover observed heterogeneity and many
modern machine learning methods can then be incorporated into the conventional control function
29
framework for the first time. It is also worth mentioning that Chen and Pouzo (2015) has obtained
point-wise bootstrap confidence bands for linear and nonlinear functionals of nonparametric IV
and nonparametric quantile IV estimators, and Chen and Christensen (2018) has bootstrap uni-
form confidence bands for linear and nonlinear functionals of sieve nonparametric IV estimator.
Our paper makes use of the bootstrap averaged (bagged) nearest neighbors method. The bagged
nearest neighbors regression estimator, as far as we know, first appears as a special case when
averaging nearest neighbor estimators from bootstrapped without replacement subsamples in Biau
et al. (2010). More recently, Fan et al. (2018) derives the same estimator independently from a
panel perspective, proves its point-wise asymptotic normality, and proposes to use the generalized
jackknife to analytically remove its higher order bias. The generalized jackknife procedure, when
working together with the bagging algorithm (Breiman, 1996), turns out to be very powerful and
can effectively mitigate the curse of dimensionality. For the purpose of this paper, we would like to
perform the generalized jackknife in a flexible fashion, and combine intermediate estimates to form
final parameters of interests, the use of the variance formula for inference becomes less useful. In
this paper, we complement existing results and prove that the bootstrap can be directly used for the
bagged nearest neighbors estimator. This bootstrap result overcomes the obstacles to conducting
inference when using the bagged nearest neighbors.
Machine learning methods have been gaining increasing interests in the business world as well
as in economic research (Mullainathan and Spiess, 2017). One way to accommodate machine
learning methods into economic research is to use the penalized regressions, such as LASSO,
for variable selection, and then apply conventional econometric methods to this reduced set of
variables. This interesting convenience can build on the assumption of ignorable approximation
errors (Belloni et al., 2014a,b), or more recently the orthogonality features in many economic
problems (Chernozhukov et al., 2017). Our paper demonstrates another possibility to introduce
machine learning methods into economics. Our philosophy is to transform well-known economic
and econometric models into a combination of intermediate conditional expectation problems and
30
then utilize modern machine learning methods to work on these conditional expectations. We treat
machine learning as an extension of traditional nonparametric methods.
Our paper connects to the literature of empirical industrial organization. While the current
canonical demand estimation approach builds on the discrete choice framework (McFadden, 1973)
and the BLP model (Berry et al., 1995; Nevo, 2001; Petrin, 2002), empirical studies for many in-
dustries have also offered various alternatives that can relax different restrictions of the canonical
model. These extensions include, but are not limited to, cases when consumers purchase multiple
goods or multiple units (Hendel, 1999; Dub´ e et al., 2018; Kim et al., 2002), make decisions un-
der limited consideration (Goeree, 2008) or search frictions (De Los Santos et al., 2012), choose
among geographically-differentiated options (Houde, 2012), and purchase complementary prod-
ucts (Gentzkow, 2007). Our approach complements the above literature and can be useful for a
wide range of empirical challenges like these.
Our empirical study builds on previous findings in the yogurt industry. The landscape of the
yogurt industry can be simplified as several major producers, a few grocery chains, local stores,
and a vast number of individual consumers. A typical consumer usually purchases multiple units
of yogurt in multiple flavors from different product lines of a single brand (Kim et al., 2002). The
conventional pricing practice for grocery stores is that different product lines are priced differently,
but prices are uniform across flavors (Draganska and Jain, 2006; Draganska et al., 2009). The
vertical relationship between grocery chains and yogurt producers is known to be important and
can affect the offerings of yogurt in local grocery stores (Villas-Boas, 2007). It is also found that
consumers can exhibit brand inertia (Pavlidis and Ellickson, 2018) and have fixed purchasing cost
(Huang and Bronnenberg, 2018). It is interesting that the package size of yogurt, which is a classic
price discrimination device, has received little attention. In this paper, we offer a novel analysis on
the difference between yogurts in different package sizes.
31
2.2 Theoretical framework
In this section, we will introduce our theoretical framework to estimate price elasticity. We will first
demonstrate our problem in a nonparametric model, explain how this problem can be approached
from a heterogeneous treatment effect perspective, list a set of assumptions we need, show how
price slopes can be identified, and finally provide an intuition to the identification using the directed
acyclic graphs from Pearl (2009).
2.2.1 Model setup
We assume that product j at market t has sales s
jt
and price p
jt
, for j = 1;2;;J, and t =
1;2;;T . Let p
t
=(p
1t
; p
2t
;; p
Jt
)
T
denote the price vector for productsf1;2;;Jg at market
t. Here()
T
is the transpose. Similarly, let x
jt
denote the vector of observed product characteris-
tics for product j at market t and x
t
=(x
T
1t
;x
T
2t
;;x
T
Jt
)
T
the stacked vector of observed product
characteristics at market t. We assume that for j= 1;2;;J, and t = 1;2;;T , the structural
relationship between market sales, market prices, and the observed product characteristics is that
s
jt
= f
j
(p
t
;x
t
)+e
jt
; (1)
where e
jt
is the aggregate unobserved shock to the demand of product j at market t, and it is
assumed to have mean zero. The pivotal problem here is that the disturbance e
jt
can be poten-
tially correlated with p
jt
, which raises the concern of price endogeneity. However, the source of
price endogeneity can be very general. The endogeneity may come from the omission of a con-
founding variable unobserved by econometricians, such as the unobserved product characteristics,
measurement error on prices, or sample selection issues.
The demand for product j at market t depends on the prices and observed product characteris-
tics of all products at market t in a nonlinear fashion. In our assumption, this demand function for
product j, f
j
(), is structural and unchanged across markets. The variation of the demand of prod-
uct j across different markets comes from the variation of the prices of all products, the product
32
characteristics of all products, and the realized unobserved shocks in different markets. Our setup
can further allow for more market specific effects if we introduce one more product with market
features as its product characteristics. The model can also incorporate infinite dimensions of unob-
served heterogeneity across geographic locations and time if we allow the demand function f
j
()
to be location specific and time specific, that is, we can split the data into different subsamples and
we estimate a demand function for each subsample. However, the flexibility comes at the price that
there will be substantially less observations for each estimation. We will not make these extensions
in this paper since it is not our main focus.
2.2.2 Heterogeneous treatment effect
We can look at the estimation of price elasticity from the potential outcomes perspective. When
price is perceived as a continuous market policy, its corresponding demand is then the observed
outcome in parallel universes. When our goal is the causal effect of market price on demand,
the problem becomes the evaluation of a market price policy. As a result, the estimation of price
slope is then the estimation of the treatment effect of price on demand. As a contrast, the structural
approach directly models the preferences of individuals and then use the stability of the preferences
to derive price slope. We argue that the recovery of human nature is actually a more fundamental
and complex issue. If our goal is solely on price slopes, the causal perspective is possibly more
straightforward.
Price elasticity, which measures normalized price sensitiveness, should be heterogeneous by
nature. The price sensitiveness should change at different price levels, vary for different income
groups, and evolve over age. Flexibility is therefore one of the central concerns when estimating
price elasticity. In this context, we are eager to learn more about the heterogeneity of price treat-
ment effects on demand. The heterogeneous treatment effect provides us one solution to allow
flexibility and capture heterogeneity. The heterogeneous treatment effect, proposed in Athey and
33
Imbens (2016); Wager and Athey (2018), is defined as the conditional treatment effect on a fixed
value of all control variables, which is,
E[Y(1)Y(0)jX= x];
where Y(1) is the potential outcome when treated and Y(0) untreated. The difference between
heterogeneous treatment effect and the conditional average treatment effect is that here the control
variables X can be potentially of high dimension. In our context, when we consider the price as a
treatment, the partial derivative¶
p
f
j
(p
t
;x
t
) is just the continuous version heterogeneous treatment
effect conditional on (p
t
;x
t
). This point-wise strategy can help us recover the heterogeneity of
treatment effects because we can deliberately make repeated estimations on different points and
observe the change in estimates. For some problems, the knowledge on price slopes would suffice.
However, when it is not the case, we need to further normalize the obtained price slope and form
price elasticity. The normalization is also achieved point-wisely.
The remaining challenge is the price endogeneity. When there is a confounder that simultane-
ously affects price and demand in the disturbancee, we will not be able to estimate the heteroge-
neous treatment effect¶
p
f
j
(p
t
;x
t
). The intuition is that we cannot distinguish whether the demand
change comes from the price change or the change in the confounder, which is beyond our control
and moves simultaneously with price. To overcome this empirical hurdle, this paper proposes a
control function approach to estimate heterogeneous treatment effect in the presence of treatment
endogeneity. We will show that the problem can be conveniently transformed by adapting classical
nonparametric IV models. In this paper, we use the triangular simultaneous equation system in
Newey, Powell, and Vella (1999) as our starting point.
34
2.2.3 Control function approach
As is common in situations with endogeneity, we need some additional machinery to work on the
problem. It is further assumed that p
t
is related to a vector of instrumental variables z
t
,
p
t
= g(z
t
)+ u
t
: (2)
where g() denotes a nonparametric relationship between price and the instrumental variables. In
a reduced-form interpretation, g(z
jt
) can be the conditional expectation of p
t
on z
t
. In this way, it
can assumed without loss of generality that
E(u
t
jz
t
)= 0: (3)
However, if we try to understand Equation (2) from a structural setup, Equation (3) implies that
the endogeneity issue is confined in Equation (1) and will not continue in Equation (2) for the
exogenous instrumental variables z
t
. In both situations, the disturbance u
t
can be interpreted as the
aggregation of all factors that affect price p
t
beyond and orthogonal to instrumental variables z
t
.
The above model setup has routinely followed the triangular control function approach in Newey
et al. (1999). If f() and g() are replaced with linear functions, we can see its immediate origin
of the classical two-step model in, for example, Heckman and Robb (1985). It is worth noting that
here we allow the possible overlap between the instrumental variables z
t
and the observed product
characteristics x
t
. The famous BLP instrument in the demand literature is actually the observed
characteristics of other products. To make this point explicit, we will introduce a notation to
decompose z
t
into (z
(1)
t
;z
(2)
t
), where z
(1)
t
are the instrumental variables that overlap with x
t
, and
z
(2)
t
the instrumental variables excluded from observed product characteristics x
t
. The dimension
of excluded z
(2)
t
is d
z
.
Before we move on, we spend some time comparing the modeling difference between the
above setup and the multi-nominal choice model. The multi-nominal choice model is often fea-
tured with a linear indirect utility in product characteristics and the Type I error. In this case, the
35
aggregated market share has a logistic form. To our knowledge, this logistic demand function
cannot be decomposed into an additive function of the unobserved product characteristics. As a
result, the market demand from the random coefficients multi-nominal choice model, which inte-
grates the logistic market demand at local realizations of random coefficients, is also not in general
additively separable in unobserved product characteristics. In other words, the triangular nonpara-
metric model in this paper does not nest the standard random coefficients multi-nominal choice
model as a special case and the reverse is also true. However, we argue in the appendix that if
we follow a point-wise strategy and take a point-wise Taylor’s expansions of the logistic demand
function, it can be approximately true that the demand function from the standard multi-nominal
choice model is additively separable in unobserved product characteristics.
2.2.4 Assumptions
Now we are ready to formally list the assumptions we will need for estimation and inference. The
most important assumption that we are going to make is the exclusion assumption.
Assumption 4. For j= 1;2;;J, it holds that
E(e
jt
jx
t
;z
t
;u
t
)=E(e
jt
ju
t
):
Assumption 1 states that if the orthogonal shock u
t
is known, the error resulting from the
unobserved disturbancee
jt
only comes from the u
t
side. This assumption is pivotal in the triangular
simultaneous equations (Newey et al., 1999). We can understand it in this way. First of all, the
controls x
t
are exogeneous, so they can be safely moved away. However, since z
t
and u
t
are
components of prices, they must be correlated with disturbance e
jt
. The essence in Assumption
1 is actually that z
t
do not have an impact on the demand except through the effects on price
and u
t
. This point will be more apparent in Figure 1 when we talk about the intuition behind
identification. Beyond this requirement, we also need to ensure that z
t
and u
t
do not correlate and
can be separated. It is ensured by the conditional independence assumption.
36
Assumption 5. For the instrumental variables z
t
, it holds that
E(u
t
jz
t
)= 0:
Another way to understand Assumption 2 is that there is no direct link between z
t
and u
t
. This
point will be further explained in Figure 1. We have made two modeling assumptions by now.
They are crucial to make our strategy work. The coming Assumption 3 and Assumption 4 are
regularity conditions to ensure a well-defined solution.
Assumption 6. There are no less excluded instrumental variables than the number of endogenous
variables, that is, d
z
J.
Assumption 7. f
j
(p
t
;x
t
) for j= 1;2;;J,E(e
jt
ju
t
) for j= 1;2;;J, and g(z
t
), are first order
continuously differentiable with respect to all arguments. Moreover, the Jacobian matrix of g(z
t
)
with respect to the excluded instruments z
(2)
t
is of full column rank.
Assumption 3 is commonly made in the instrumental variables literature. It implies that we
will require enough sources of variations to deal with endogeneity. The convenience it brings will
become more explicit when we derive the identification result. Assumption 4 is to ensure that the
partial derivatives exist and have desirable properties. It is a technical condition.
2.2.5 Identification
This section will show some algebra to derive the identification of the heterogeneous treatment
effect¶
p
f
j
(p
t
;x
t
) in the presence of endogeneity. What we are about to find has actually already
appeared in Newey, Powell, and Vella (1999). What is also interesting is that, to the best of our
knowledge, this identification and estimation route seems not to have been explored so far. Our
paper can be the first paper to give this result an interpretation in the context of heterogeneous
37
treatment effect and practically use the result for estimation. Before we proceed, as always, we
first introduce new notations. Let
h
j
(p
t
;x
t
;z
(2)
t
) :=E(s
jt
jp
t
;x
t
;z
t
);
which is the conditional expectation of demand given the levels of prices, products characteristics
and instrumental variables, and let
l(u
t
) :=E(e
jt
jp
t
;x
t
;z
t
)=E(e
jt
ju
t
);
where the second equality comes directly from Assumption 1. When we take conditional expecta-
tions on both sides of Equation (1), it can be shown without much effort that
h
j
(p
t
;x
t
;z
(2)
t
)= f
j
(p
t
;x
t
)+l(u
t
); (4)
which states that the conditional demand function h
j
is an additive function of the structural de-
mand function f
j
and the control functionl(). As the classical argument of the control function
approach goes, the equation implies that with the inclusion of the control function, the problem of
endogeneity can turn into an omitted variable problem. In other words, the price endogeneity will
on longer be a concern in the presence of prices, product characteristics, and the newly included
control function.
Thanks to the regularity conditions in Assumption 3 and 4, we can continue to take partial
derivatives on both sides of Equation (4) with respect to p
t
and z
(2)
t
. By the chain rule of calculus,
we can obtain that
¶
p
t
h
j
(p
t
;x
t
;z
(2)
t
)
| {z }
J1
=¶
p
t
f
j
(p
t
;x
t
)
| {z }
J1
+¶
u
t
l(u
t
)
| {z }
J1
; (5)
¶
z
(2)
t
h
j
(p
t
;x
t
;z
(2)
t
)
| {z }
d
z
1
=¶
z
(2)
t
g(z
t
)
| {z }
d
z
J
¶
u
t
l(u
t
)
| {z }
J1
: (6)
38
Here ¶
p
t
h
j
(p
t
;x
t
;z
(2)
t
) is the Jacobian matrix of conditional demand function h
j
with respect to
price vector p
t
, evaluated at point(p
t
;x
t
;z
(2)
t
). The other terms are defined in the same fashion. We
explicitly list the dimensions of these Jacobian matrices underneath to avoid potential confusion
on various definitions of the Jacobian matrix.
Our goal is the heterogeneous treatment effect¶
p
f
j
(p
t
;x
t
). We can rearrange Equation (5) and
Equation (6) and get
¶
z
(2)
t
g(z
t
)
| {z }
d
z
J
¶
p
t
f
j
(p
t
;x
t
)
| {z }
J1
=¶
z
(2)
t
h
j
(p
t
;x
t
;z
(2)
t
)
| {z }
d
z
1
+¶
z
(2)
t
g(z
t
)
| {z }
d
z
J
¶
p
t
h
j
(p
t
;x
t
;z
(2)
t
)
| {z }
J1
; (7)
which gives us a system of d
z
linear equations in J unknowns, whose solution is to be discussed in
the following two scenarios.
When d
z
> J, that is, when the number of excluded instrumental variables is larger than the
number of endogenous prices, Equation (7) is an over-identified system. Since ¶
z
(2)
t
g(z
t
) has full
column rank by Assumption 3, we are able to obtain a minimum distance solution
¶
p
t
f
j
(p
t
;x
t
)
| {z }
J1
=¶
p
t
h
j
(p
t
;x
t
;z
(2)
t
)
| {z }
J1
+(¶
z
(2)
t
g(z
t
)
T
| {z }
Jd
z
¶
z
(2)
t
g(z
t
)
| {z }
d
z
J
)
1
¶
z
(2)
t
g(z
t
)
T
| {z }
Jd
z
¶
z
(2)
t
h
j
(p
t
;x
t
;z
(2)
t
)
| {z }
d
z
1
:
When the number of excluded instruments is equal to the number of endogenous prices, that is,
when d
z
= J, Equation (7) is just identified. In this case, full column rank can ensure that¶
z
(2)
t
g(z
t
)
is invertible and we can get
¶
p
t
f
j
(p
t
;x
t
)
| {z }
J1
=¶
p
t
h
j
(p
t
;x
t
;z
(2)
t
)
| {z }
J1
)+¶
z
(2)
t
g(z
t
)
1
| {z }
JJ
¶
z
(2)
t
h
j
(p
t
;x
t
;z
(2)
t
)
| {z }
J1
; (8)
which is a simplification of the over-justified solution.
Now we take a closer look at Equation (8). If different endogenous prices do not share the same
excluded instrument, for example, when each product price has only its own Hausman price IV ,
the Jacobian matrix ¶
z
(2)
t
g(z
t
) will be a diagonal matrix. In this case, the diagonal element of the
39
inverse of¶
z
(2)
t
g(z
t
) is the inverse of the diagonal element of¶
z
(2)
t
g(z
t
), which implies that we can
further simplify the equation and deal with price endogeneity for each price separately. However, if
there exists a shared excluded instrument, such as a common cost shifter, the inverse of the matrix
¶
z
(2)
t
g(z
t
) in this more complicated case needs to be solved jointly.
Figure 2.1: An explanation on identification
Note: This figure depicts a simplified relationship between demand, price, unobserved confounder, and
instrumental variable using the directed acyclic graphs (Pearl, 2009).
2.2.6 An explanation
As we have mentioned, the above relationship has appeared in Newey, Powell, and Vella (1999)
with a slightly different notation. However, we are going to give the equation an interpretation
in the context of heterogeneous treatment effects. We will use the the directed acyclic graphs
(Pearl, 2009) for illustration and our interpretation needs not to be the interpretation. Before that,
to make our argument easier, let us further simply Equation (8) to the case where there is only one
endogenous price and one excluded instrumental variable, that is,
¶
p
f
j
(p
t
;x
t
)=¶
p
h
j
(p
t
;x
t
;z
t
)+¶
z
g(z
t
)
1
¶
z
h
j
(p
t
;x
t
;z
t
): (9)
40
Figure 1 depicts the endogenous situation with the directed acyclic graphs (Pearl, 2009). We
hope to evaluate the causal effect of price on demand, which is the Channel 2 in Figure 1. It is the
casual effect because it is the effect of price on demand between different parallel universes where
price is different but the unobserved confounder is fixed. However, the unobserved confounder
affects price and demand simultaneously through Channel 3 and Channel 4 . If this is the case,
when we move price a little bit, the change in demand comes from two sources. One change
comes directly from the change in price, which is Channel 2 , while the other comes from the
co-movement of the unobserved confounder with price, which is Channel 4 3 . What we can
observe in the data is the total effect of price on demand, but what we are more interested in is the
partial effect of price on demand, which is also the causal effect of price on demand. However, if
we have access to an instrumental variable, we are still able to decompose the total effect of price
on demand with the new machinery.
To make this trick, the instrumental variable needs to satisfy two conditions. First, we need
to require the instrumental variable not to affect the demand directly, which implies that there
is no direct link between the instrumental variable and the demand. This is actually also what
Assumption 1 has required. When there is a change in the instrumental variable, the change in
demand can only come from its effect on the price and the unobserved confounder. Second, there
should not be a direct link between the instrumental variable and the unobserved confounder. They
are two orthogonal determinants of the price. This point is guaranteed by Assumption 2, where the
unobserved confounder and the instrumental variable are set to be orthogonal.
Now let us get a step back and look at what the data can identify. First of all, the data can inform
us the total effect of price on demand, which includes the part from the unobserved confounder. It is
the¶
p
h
j
(p
t
;x
t
;z
t
) in Equation (9). Second, we also know how the instrumental variable can change
the price, which is the Channel 1 in Figure 1 and¶
z
g(z
t
) in Equation (9). Moreover, we also know
how the instrumental variable affects demand when holding the price constant. The instrumental
variable can have an effect because the price is held constant. The instrumental variable and the
unobserved confounder are two determinants of the price. This is Channel 1 3 + 4 in
41
Figure 1. In this way, the indirect effect of price on demand, which is Channel 4 3 , can be
worked out by using the effect of the instrumental variable on the price to divide the effect of the
instrumental variable on the demand, which is¶
z
g(z
t
)
1
¶
z
h
j
(p
t
;x
t
;z
t
) in Equation (9). Finally,
the total effect minus the indirect effect gives us the direct partial effect of price on demand. It is
also the heterogeneous treatment effect of price on demand.
2.3 Estimation and inference
This section will be denoted to discuss the implementation details of our nonparametric strategy.
We will first discuss possible approaches to estimate the point-wise conditional expectations, in-
cluding classical nonparametric regressions and modern machine learning methods. When we are
clear on how to achieve point-wise predictions, the next problem is then how to obtain partial
derivatives. In this paper, we will use finite differences to numerically approximate partial deriva-
tives. Finally, we discuss how to conduct statistical inference. We propose the use of the bagged
nearest neighbors method for point-wise prediction and formally prove that the bootstrap can be
directly used for inference when the bagged nearest neighbor estimator is used.
2.3.1 Estimation
The major difference between our method and the nonparametric IV literature is that we have
deliberately transformed the problem into point-wise estimations, while current approaches in the
nonparametric IV literature have focused mostly on the estimation and inference of structural func-
tions. This point-wise transformation can bring two intriguing changes. First of all, we are now
capable to capture observed heterogeneity to our interest. For example, if our goal is to explore
the heterogeneous treatment effects, we can then repeatedly change the conditional values and
compare the treatment effects on these points. As a result, the point-wise approach can give us a
leverage to recover heterogeneity. Second, when the economic and econometric problem is trans-
formed into point-wise predictions, another possibility to utilize modern learning methods is open
42
to economists. Although not fully understood, machine learning methods, such as the deep neu-
ral networks, have demonstrated their superior performance in making predictions in a nonlinear
fashion. While current discussion on machine learning in econometrics is more on the variable
selection side, this paper shows that these machine learning methods can actually be conveniently
incorporated with the well studied nonparametric IV models.
A class of modern machine learning methods can be viewed as algorithm enhanced classical
nonparametric methods. For example, the random forest regression, in its essence, has a con-
venient representation as the nearest neighbor method, but with adaptive and data-driven local
weights (Wager and Athey, 2018). The deep neural networks, at a high level, can be seen as
a sophisticated functional approximation using the sieves (Chen, 2007), which has already been
familiar to economists. The novelty is that the basis of the sieves is now implicitly chosen by
layers of linear combinations and nonlinear activations. The bagged nearest neighbor method, to
be elaborated in the next subsection, can also been seen as a bagging algorithm (Breiman, 1996)
enhanced matching estimator (Abadie and Imbens, 2006), which has already been widely used in
causal studies. However, despite these generalities, modern machine learning methods turn out to
work fairly well in practice, especially when there is a relatively large dimension of control vari-
ables. The reason why there exists such a surprising difference is still a myth in probability and
statistical theory. One possible explanation is that the algorithms in the machine learning methods
can help alleviate the curse of dimensionality.
In this paper, our framework allows a general use of methods to derive conditional predictions
as long as they provide consistency. When the dimension of conditional variables is low, tradi-
tional nonparametric methods, such as the sieves estimator and the kernel methods, can also be
used within our framework. In this case, the sieves estimator actually possesses an extra advan-
tage. Our intermediate goals are partial derivatives of conditional expectations. Since the sieves
estimator gives us an approximation of the structural function, we can then take partial derivatives
on the basis functions and directly get an approximation of the partial derivatives. This strategy is
straightforward and can work very well when there are few conditional variables. When there are
43
a relatively large dimension of conditional variables, we conjecture that the commonly used deep
neural network can enjoy the same advantage. However, since it is not the focus of this paper, we
will not make further exploration in this direction. In general, we can use numerical methods to
derive partial derivatives. The most popular method in this domain is perhaps the finite difference
method, which has been widely used in modern numerical analysis. For details, see, for example,
Strikwerda (2004). The finite difference method is a discretization method, in which finite differ-
ences are used to approximate derivatives. In other words, the partial derivative is the limit of a
difference quotient by definition. We can then take a small difference and use the difference quo-
tient for approximation. However, at implementation level, there exist various different approaches
to construct the difference quotient. The difference quotient can have various schemes, such as the
forward scheme, the backward scheme, and the central scheme. In this paper, for simplicity, we
will mostly use the forward scheme, that is, we will use the difference quotient[g(z+d)g(z)]=d
to approximate the partial dirivative g
z
(z).
2.3.2 Inference
Machine learning methods can give fairly good point-wise predictions. The remaining difficulty
is statistical inference. To mitigate this discrepancy, we propose to use the bootstrap averaged
(bagged) nearest neighbors to estimate point-wise conditional expectations. We formally prove in
this paper that the bootstrap can be directly used for inference for the bagged nearest neighbors
estimators. Since the bootstrap commutes with smooth functions, we can therefore directly use the
bootstrap to derive inference for price elasticity estimates. It is worth noting that the price elasticity
estimates are not necessarily asymptotic normal.
The bootstrap averaged (bagged) nearest neighbors regression estimator, to the best of our
knowledge, first appears as a special example when averaging nearest neighbors estimators from
without replacement bootstrapped subsamples in Biau et al. (2010). The nearest neighbors es-
timator here is more commonly known as the matching estimator (Abadie and Imbens, 2006) in
economics. More recently, Fan et al. (2018) obtains the same estimator independently from a panel
44
perspective. The idea is that we can construct an artificial panel structure in the data and average
nearest neighbors from each crosssection. This strategy is in effect equivalent to subsampling and
thereby the subsample size in subsample bootstrapping becomes the counterpart of the time dimen-
sion in panel data models (Hsiao, 2014). As a result, the panel jackknife method (Arellano and
Hahn, 2013; Dhaene and Jochmans, 2015) can be readily applied to the bagged nearest neighbors
estimators to remove higher order bias. In other words, the bagged nearest neighbors estimator
can be understood as the joint product of the familiar matching estimators, the bootstrap averaging
algorithm (Breiman, 1996), and the generalized jackknife procedure. Surprisingly, this hybrid, as
other machine learning methods, turns out to enjoy desirable theoretical properties and has demon-
strated fairly good working performance with a relatively large dimension of control variables. We
give a formal and concise introduction to the bagged nearest neighbors in the appendix.
Since we need to combine several predictions to form price elasticities, statistical inference for
these final products of interests is not a trivial problem. The obstacles are that 1) we may not have
asymptotic normality, 2) the variance formula is not easy to obtain, and 3) the variance formula may
not have a working precision if many plug-in estimates are required. To overcome this hurdle, we
prove in this paper that the bootstrap can be used for inference for bagged nearest neighbors. Since
the bootstrap commutes with smooth functions, it follows that the bootstrap can be directly used
for inference of the price elasticity estimates, if the bagged nearest neighbors estimators are used
for point-wise predictions. We show in our Monte Carlo simulations that the bootstrap strategy
seems to work well in practice and offer a valid inference with working precision. Our formal
theorem and its proof for the bootstrap result are relegated to the appendix after the introduction
on the bagged nearest neighbors. As far as we know, it is the first time this property is established
for the bagged nearest neighbors. Our proof is novel and becomes mathematically neat with the
introduction of the Hoeffding decomposition (Hoeffding, 1948) and the Mallow’s distance (Bickel
and Freedman, 1981).
To conclude this section, our implementation strategy is that we first get point-wise conditional
predictions on various points, use them to numerically derive partial derivatives, and then combine
45
all relevant ingredients to obtain our final estimate of price elasticities. For inference, we repeat the
above process routinely on bootstrapped samples of the full data. The distribution of all these final
estimates from bootstrapped samples will approximate the asymptotic distribution of the elasticity
estimator. In this way, we can use this distribution to derive inference. Moreover, in many cases,
we are not satisfied with estimating price elasticity only for one point. The same estimation and
inference procedure can actually be repeated on all other points of interest. In particular, it will
be interesting when we deliberately change the value of one or two conditioning variable and hold
all the others constant. As a result, we can trace the change of price elasticities along one or more
dimensions of interest.
2.4 Monte Carlo simulation
We conduct Monte Carlo simulations in the section to show that our empirical strategy works well
in practice. Our method is applied to two parametric settings where we can conveniently derive
analytical solutions. We then compare our estimates and their analytical counterparts at various
given points. This design is aimed to show that our method can well capture the heterogeneity
of price slopes, even in the presence of endogeneity. In particular, we will use the following data
generating process in our simulation,
s
1
= g(p
1
; p
2
; p
3
; p
4
)+e;
e = e 2u;
p
1
= z+ u;
where p
2
; p
3
; p
4
;u;e and z are independent and follow the standard normal distribution. Here u
can be interpreted as an unobserved common shock that affects both p
1
and s
1
. The presence of
common shock u raises the concern of endogeneity. For other parameters, s
1
can be interpreted as
the demand for product 1, p
1
; p
2
; p
3
; p
4
the prices for product 1 to product 4, and e some random
46
demand shock. Moreover, z can be interpreted as a product specific cost component and can then
serve as an instrument for p
1
.
For the demand function g, we use two model specifications. In particular, for Model 1,
g(p
1
; p
2
; p
3
; p
4
)= 5+ p
2
1
+ 2p
1
3p
2
+ p
3
p
2
4
;
and for Model 2,
g(p
1
; p
2
; p
3
; p
4
)= 5+ p
3
1
+ 2p
1
3p
2
+ p
3
p
2
4
;
where the difference is only on the higher order terms of p
1
. The reason we make this distinction
is that the two models will have a linear and a quadratic price slope along p
1
, respectively. To be
specific, it is easy to verify that for Model 1, the price slope
¶g
¶ p
1
(p
1
;0;0;0)= 2p
1
+ 2;
and for Model 2,
¶g
¶ p
1
(p
1
;0;0;0)= 3p
2
1
+ 2:
Figure 2.2 depicts our estimation results for Model 1 and Model 2 from one simulated sample
with 10,000 independent observations. The estimations are conducted point by point where p
1
moves gradually from0:8 to 0:8 and the other variables are held at 0. We also computed the
95% confidence intervals using the bootstrap. They are presented using the dashed lines in Figure
2.2. From Figure 2.2, we can see that even in the presence of endogeneity, our method can still
well capture the price slopes at different price levels. In other words, the heterogeneity of the price
treatment effects has been fully recovered.
We further repeat the above simulation exercise for 500 times. Table 1 summarizes the results.
In Table 1, Column p
1
gives the values of p
1
at the points where we make estimations. The other
variables at held at their mean and median levels 0. The two Columns Slope give us analytical
values for price slopes at these points from our theoretical derivation. As a comparison, our price
47
Figure 2.2: Simulation with linear and quadratic heterogeneity
-0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8
Model 1
0
1
2
3
4
5
6
Analytical price slope
Estimated price slope
95% Confidence Interval
-0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8
Model 2
0
1
2
3
4
5
6
Analytical price slope
Estimated price slope
95% Confidence Interval
Note: This figure plots the estimations of our method for one simulated sample with 10,000 independent
observations. The estimations are conducted point by point with p
1
varying from0:8 to 0:8 and the other
variables at 0.
48
slope estimates are present in their neighboring Columns Estimated Slope. The two Columns Vari-
ance are the variances of the respective price slope estimates from 500 simulations. Meanwhile, the
two Columns Estimated Variance are the mean of the respective 500 simulated bootstrap variances.
It can be seen from Table 1 that our bootstrap inference procedure can give a reliable and
convenient estimate of the true variance of the price slope estimator. This feature becomes very
important if statistical inference is also one of the concerns. The trajectories of the slope estimates
can well mimic their analytical truth, which will be very useful in many economic research prob-
lems, such as the taxation or tariff pass-through. In the appendix, we will also conduct a Monte
Carlo simulation when the sample is generated from a random coefficient multinomial model. It
seems that our method can be compatible with this important case.
To conclude, we have employed a simple Monte Carlo simulation to demonstrate the potential
usefulness of our estimation and inference strategy. It shows that our method is able to recover
heterogeneity in the presence of endogeneity and give a convenient and valid inference.
The BLP model (Berry et al., 1995), or the random coefficients multi-nominal choice model
has been the working horse model of demand analysis for decades. This section is to show that our
approach can be compatible with the standard random coefficients multi-nominal choice model.
A commonly used random coefficients multi-nominal choice model first explicitly specifies the
indirect utility from individual consumption, that is, individual i choosing product j at market t
enjoys the indirect utility
u
i jt
=a
i
+b
i
p
jt
+x
j
+e
i jt
; (10)
where p
jt
is the price of product j at market t,x
j
is the other product characteristics of product j.
It is usually assumed that price p
jt
is product and market specific, whilex
j
is only product specific.
However, the crucial difference between p
jt
and x
j
is that x
j
is only observed by individual con-
sumers but not by econometricians. Since prices are often correlated with product characteristics,
the omitted product characteristics becomes the source of price endogeneity in this standard model.
In other settings, the price endogeneity can also arise in scenarios such as measurement error on
prices, endogenous product choice set, or sample selection issue.
49
Table 2.1: Simulation results
p
1
Model 1 Model 2
Slope Est. Slope Variance Est. Variance Slope Est. Slope Variance Est. variance
-0.80 0.40 0.34 0.0603 0.0694 3.92 3.75 0.1390 0.1498
-0.72 0.56 0.51 0.0607 0.0696 3.56 3.39 0.1293 0.1358
-0.64 0.72 0.68 0.0613 0.0691 3.23 3.06 0.1169 0.1229
-0.56 0.88 0.85 0.0629 0.0702 2.94 2.77 0.1092 0.1146
-0.48 1.04 1.02 0.0660 0.0712 2.69 2.52 0.1071 0.1080
-0.40 1.20 1.20 0.0668 0.0705 2.48 2.31 0.0980 0.1022
-0.32 1.36 1.36 0.0679 0.0717 2.31 2.13 0.0932 0.0959
-0.24 1.52 1.53 0.0706 0.0716 2.17 1.99 0.0893 0.0934
-0.16 1.68 1.70 0.0700 0.0739 2.08 1.89 0.0863 0.0908
-0.08 1.84 1.88 0.0702 0.0756 2.02 1.84 0.0838 0.0895
0.00 2.00 2.05 0.0743 0.0775 2.00 1.82 0.0867 0.0902
0.08 2.16 2.22 0.0770 0.0784 2.02 1.84 0.0880 0.0903
0.16 2.32 2.39 0.0772 0.0808 2.08 1.91 0.0883 0.0915
0.24 2.48 2.56 0.0776 0.0842 2.17 2.01 0.0871 0.0946
0.32 2.64 2.73 0.0786 0.0868 2.31 2.15 0.0883 0.0987
0.40 2.80 2.90 0.0819 0.0899 2.48 2.33 0.0932 0.1022
0.48 2.96 3.07 0.0832 0.0931 2.69 2.55 0.0969 0.1115
0.56 3.12 3.25 0.0866 0.0956 2.94 2.80 0.1054 0.1175
0.64 3.28 3.42 0.0921 0.0999 3.23 3.09 0.1190 0.1275
0.72 3.44 3.59 0.0946 0.1029 3.56 3.43 0.1311 0.1395
0.80 3.60 3.76 0.1016 0.1089 3.92 3.79 0.1508 0.1529
Note: This table summarizes our simulation results from 500 simulations. Est. is short for Estimated.
For each simulation, we generate a sample with 10,000 independent observations and make estimations
point by point with the values of p
1
varying from0:8 to 0:8 and the other variables at 0. For each
estimation, the estimated variance is obtained by the bootstrap. Meanwhile, variance is the variance of
the estimated slopes from 500 simulations.
50
In Equation (10), a
i
and b
i
represent individual preferences, where a
i
is the individual fixed
effect, and b
i
is the individual sensitiveness to product price. The difference between this model
and the multi-nominal choice model (McFadden, 1973) is thata
i
andb
i
are assumed to be random
coefficients. This ingenious feature is introduced to avoid the presence of independent irrelevant
alternatives in multi-nominal choice models. In this way, the substitution patterns between prod-
ucts can depend on demographic variables and thus the price elasticity is allowed to be flexible. In
our simulation, we explicitly assume
a
i
=a
1
n
i1
;
b
i
=b
0
+b
1
n
i1
+b
2
n
i2
;
where v
i1
and v
i2
represent demographic variables. They are assumed to be independent and follow
a standard normal distribution in this simulation. Other preference parameters,a
1
,b
0
,b
1
, andb
2
,
are set to be pre-determined fixed values(0:8;3;0:5;0:5).
The error term e
i jt
is often assumed to follow the i:i:d: Type I distribution. In this case, the
market share s
jt
of product j at market t has a closed form expression,
s
jt
=
Z Z
exp(b
0
p
jt
+x
j
+a
1
v
i1
+b
1
v
i1
p
jt
+b
2
v
i2
p
jt
)
1+å
J
q=1
exp(b
0
p
qt
+x
q
+a
1
v
q1
+b
1
v
q1
p
qt
+b
2
v
q2
p
qt
)
dF(x
1
)dF(x
2
): (11)
At a high-level, this particular form of the market share function comes from the linear setup of
indirect utility and the i:i:d: Type I error. If they fail to hold, the aggregated market share function
turns into a general nonlinear function of product prices, observed characteristics, and unobserved
characteristics. If the unobserved product characteristics is the only source of endogeneity, our
model setup in the paper actually requires that the unobserved product characteristic is additively
separable to the demand function, which may not be true for the above Equation (11). However, as
we illustrate at the end of this section, this additivity can be approximately true at each evaluated
point. There exists other modeling routes. For example, Berry and Haile (2014) propose to impose
an index restriction.
51
Figure 2.3: Simulation with the BLP as true model
0.2 0.3 0.4 0.5 0.6 0.7 0.8
Own elasticity
-2
-1.5
-1
-0.5
0
True Elasticity
Estimated Elasticity
95% Confidence Interval
0.2 0.3 0.4 0.5 0.6 0.7 0.8
Cross elasticity
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
True Elasticity
Estimated Elasticity
95% Confidence Interval
Note: This figure plots the estimations of our method when the true model is the BLP model. The sample size is
100,000. The estimations are conducted point by point with p
1
varying from 0:2 to 0:8 and the other variables at
their median levels.
52
Back to the standard model, the unobserved product characteristicsx
j
can be backed out to deal
with endogeneity. Berry (1994) has shown that, if there is an outside good with market share s
0
,
logs
jt
logs
0t
can back out the unobserved product characteristicsx
j
. The backed out unobserved
product characteristics can then be used to form moment conditions if instrumental variables are
available. As a result, the preference parameters in Equation (11) can be recovered using the
general method of moments. If it is further assumed that these parameters are stable in and out of
sample, their estimates can be used to conduct counterfactual analysis.
In our case, we know the true data generating process since it is Monte Carlo simulation. We
have assumed that
p
t
=(0:5;0:5;0:75;1)
T
+ 0:5u
t
+ 0:5z
t
;
x
t
=(5;6;7;8)
T
+ u
t
:
where u
t
is a common shock to price p
t
and unobserved product characteristicsx
t
. z
t
only affect
p
t
and can serve as instruments. u
t
and z
t
are independent random vectors and each element follow
independent uniform distribution on[0:5;0:5]. In total, we assume to have 100,000 independent
markets with 4 products. Figure 2.3 gives our estimation results.
We have argued that Equation (11) cannot in general be guaranteed to be expressed as an
additive model in unobserved product characteristicsx . However, if we take a Taylor expansion at
a fixed point(x
t
;p
t
;x
t
),
s
jt
(x
t
;p
t
;x
t
)= s
jt
(x
t
;p
t
;x
t
)+ A
T
(x
t
x
t
)+ B
T
(p
t
p
t
)+ C
T
(x
t
x
t
)+ o();
where
A=5
x
t
s
jt
(x
t
;p
t
;x
t
);
B=5
p
t
s
jt
(x
t
;p
t
;x
t
);
C=5
x
t
s
jt
(x
t
;p
t
;x
t
):
53
It seems that it can be approximately true that s
jt
is additive inx locally at each point.
2.5 Discussion
In this paper, we have proposed a point-wise approach to flexibly estimate price elasticity with the
presence of endogeneity. Our framework is open to the use of modern machine learning methods
to estimate point-wise conditional expectations. In this way, the curse of dimensionality can be
mitigated and the working performance of our approach can be improved. In particular, if the
bagged nearest neighbors are to be used for point-wise prediction, we prove that the standard
bootstrap procedure can be directly employed for inference. We believe that our flexible price
elasticity estimates can be very useful in a wide range of economic problems, including welfare
analysis, firm pricing, and tax incidence study.
Since this paper has focused on the estimation and inference of price elasticities, we have not
given an explicit discussion on how counterfactual analysis can be conducted in a non-parametric
setup. In a parametric model, the conventional assumption for counterfactual analysis is that the
structural parameters are stable in the sample and out of sample. They reflect deep preferences and
will not change in the new scenario. In this way, the structural models can provide predictions and
counterfactual analysis even when the market becomes totally different. We argue that if similar
assumptions are made, counterfactuals can also be obtained in our framework. However, instead of
assuming constant price coefficients, we can impose other restrictions, such as shape restriction on
how price elasticities should evolve. This feature can be important when the new policy is believed
to have significant impact on individual preferences, that is, when structural parameters are most
likely to be misspecified.
We are also curious about the potential future use of the deep neural networks in our frame-
work. The deep neural networks has been widely used in business practices to deal with big data,
where they have demonstrated superior reliability and working usefulness. For our problem, one
particularly interesting direction is that how we can use the deep neural networks to predict directly
54
the slopes. If there is a convenient and fast algorithm to achieve this, we are curious to know how it
can further push the dimensionality limit. Moreover, if unstructured data, such as satellite images,
product pictures, and social network texts can also be utilized to answer economic questions, we
wonder what new insights we can get from such a change.
55
Chapter 3
Empirical applications
3.1 Application 1: The effects of maternal smoking
In this section we work on real-life data and study the heterogeneity of treatment effects of smok-
ing on children’s birth weights across mothers’ ages. This is the same empirical research question
that Abrevaya, Hsu, and Lieli (2015) studied with their kernel-based estimator. We will work on
the same data set
*
in order to give a relatively complete picture of existing approaches on eco-
nomic heterogeneity. However, while the research interests are both economic heterogeneity, our
empirical targets are different. While Abrevaya, Hsu, and Lieli (2015) targets on the average treat-
ment effects conditioning on a single variable, such as age, this study focuses on the heterogeneous
treatment effect conditioning on the full set of control variables. The difference reflects different
perspectives on observed heterogeneity.
In our study, the feature vector X includes variables such as mother’s age, mother’s education,
father’s education, gestation length in weeks, and the number of prenatal visits. The response vari-
able Y is child’s birth weight. The binary treatment W is whether the mother smoked or not during
pregnancy. The data set consists of 475,506 observations in total, in which there are 85,062 black
mothers with 4,926 smoked during pregnancy and 390,444 white mothers with 58,977 smoked
during pregnancy. For more details on the data, see Abrevaya, Hsu, and Lieli (2015). We ignore
*
The data set used in this section is obtained from the research webpage of Robert P. Lieli,
https://sites.google.com/site/robertplieli/research. A related data set (Abrevaya, 2006) has also been used as the touch-
stone of heterogeneity study in panel data models.
56
Figure 3.1: Heterogeneous treatment effects of smoking on child birth weights
the panel structure in the data and directly assume that the unconfoundedness condition and i:i:d
condition hold.
We estimate the heterogeneous treatment effects with all control variables, except for age,
fixed at the average levels of the corresponding treated group. For the age variable, we deliberately
choose its level and vary it from 16 to 35. Our purpose is to estimate the treatment effect hetero-
geneity across ages with the other variables fixed at average levels. Since Abrevaya, Hsu, and Lieli
(2015) also conduct a subgroup analysis with black and white mothers, we do the same exercise
here. In effect, it is equivalent to adding an extra dimension of heterogeneity across race.
Figure 2 shows empirical results with our two-scale DNN estimators. First of all, a clear
downward sloping curve can be observed for both black and white mothers. It implies that as
age increases, the loss in new born’s weights associated with mother’s smoking behavior becomes
larger. This empirical finding is consistent with Abrevaya, Hsu, and Lieli (2015). Our estimations
are significant with the bootstrapped 95% confidence intervals away from zero. Moreover, our
empirical result suggests a different pattern between black and white mothers. For black mothers,
57
there is an interesting hump around the age of 26. It is an interesting undocumented pattern that
is potentially useful for policy makers. Finally, the downward sloping curve can also give us a
hint on the causal mechanism behind smoking behaviors. Without any further assumption, we can
infer that the causal channel of how smoking affects birth weights may be some unobserved factors
associated with age.
3.2 Application 2: The price elasticities of yogurt
In this section, we demonstrate the applicability of our approach by estimating the price elasticities
among yogurt products. Yogurt is a widely-studied product category in empirical industrial organi-
zation and marketing. However, the yogurt literature has emphasized different aspects of consumer
behavior, and consequently, has adopted different and sometimes mutually-incompatible demand
models. In this paper, we will estimate own- and cross- price elasticities among popular yogurt
products without imposing a priori structural assumptions on consumer behavior. We believe that
the empirical elasticity patterns we find will be 1) directly informative of product markup and price
pass-through, and 2) instructive of more reasonable modeling assumptions if structural approaches
are to be used to characterize demand.
The rich empirical literature on yogurt offers us a myriad of different characterizations of the
demand function that go beyond the “baseline” random coefficient demand system (Berry et al.,
1995). In particular, Villas-Boas (2007) use the conventional random coefficient discrete choice
model but ingeniously point out the importance of the interaction between retailer and manufac-
turer. Meanwhile, various other studies focus on different aspects of the consumer choice process
that are beyond the scope of a conventional discrete choice model. For example, Kim et al. (2002)
study consumers’ choices of a variety of products and in multiple quantities. Pavlidis and Ellickson
(2018) study consumer switching costs across products and brands and discuss its implications to
dynamic pricing. Huang and Bronnenberg (2018) study the costly consideration-set formation in a
demand system, where consumers can choose a variety of products in multiple quantities.
58
In this section, we use the IRI academic dataset (Bronnenberg et al. (2008)) to characterize
elasticities across yogurt products. This dataset has been commonly used in marketing and pro-
vides weekly sales information on yogurt for various store chains from 2001 to 2007 in several
states of the United States. From this dataset, we will demonstrate two interesting features that
are not much emphasized by existing literature. First, as in many consumer-packaged-goods cat-
egories, yogurt products are packaged into various package sizes, such as 8 oz or 32 oz. The
classical microeconomic theory might view these size offerings as a form of the second-degree
price discrimination. However, in empirical studies, consumer preference heterogeneity across
package sizes are often abstracted away. Commonly, yogurts of various package sizes are pooled
within the brand for structural analyses, possibly due to the complexity of specifying preference
distribution across sizes. In this paper, we will give an analysis of package sizes without a priori
structural assumptions. Second, in our sample periods, one yogurt brand is constantly priced at a
higher level than the other. Such persistent differences might be an outcome of cost differences.
Our empirical analysis aims to speak to these two features of demand.
3.2.1 Sample construction
We first picture the landscape of the total yogurt sales during our sample period. A summary of
statistics is provided in Table 2. During our sample period, the most popular brand is Yoplait.
Dannon follows as the second popular brand, but with only two-thirds of the sales of Yoplait. The
third popular brand is the private label yogurt provided by stores. The private label is technically
not a brand since the private label yogurt can be very different and often of distinct qualities from
store to store. For Yoplait and Dannon, we decompose their sales further into popular sizes and
report the size-associated total sales. The numbers show that both Yoplait and Dannon offer two
vertically-different product lines. One is at around $1:7 per pound and the other is premium and at
around $2:4 per pound. In this paper, we will focus on the regular product line, since in total they
sell the most units and can best reflect the competitive interaction between Yoplait and Dannon.
59
Table 3.1: Yogurt brand and size
Brand Total sales Size Individual sales Price
Yoplait 626;760;636
0:375 490;723;711 $1:66
0:25 63;490;752 $2:55
1:5 26;982;361 $1:73
1:125 23;543;991 $2:37
0:6875 11;123;974 $2:31
0:5 6;517;007 $2:18
3 1;884;085 $1:52
Dannon 397;986;186
0:375 199;607;994 $1:58
0:5 71;257;840 $1:40
1 33;874;083 $2:24
2 23;157;036 $1:41
1:5 14;539;104 $1:63
Private
label
277;347;934
0:5 211;898;981 $0:98
0:375 42;254;002 $1:13
2 13;661;875 $0:97
Note: This table summarizes yogurt sales in popular brands and sizes from our dataset. The
sales are in units, the sizes are in pounds, and the price has been normalized to dollars per
pound. Only most popular brands and sizes have been listed.
60
Table 3.2: Yogurt brand and size
Variables Mean S.D. Min Q1 Median Q3 Max Obs.
Yoplait
small
Price 0.92 0.17 0.10 0.80 0.92 1.05 1.58 156,580
Sales 416.13 272.49 66.00 207.75 346.13 556.88 1343.25 156,580
IV 0.84 0.05 0.72 0.81 0.84 0.87 0.97 156,580
large
Price 0.92 0.12 0.17 0.83 0.92 1.00 1.46 156,580
Sales 89.65 61.50 15.00 43.50 72.00 120.00 301.50 156,580
IV 0.86 0.02 0.75 0.84 0.86 0.88 0.91 156,580
premium Price 0.64 0.08 0.15 0.58 0.63 0.70 0.98 156,580
Dannon
small
Price 0.83 0.18 0.11 0.69 0.80 0.94 1.36 156,580
Sales 216.38 180.58 20.25 83.25 160.50 291.75 915.00 156,580
IV 0.77 0.06 0.58 0.73 0.77 0.81 0.90 156,580
large
Price 0.78 0.11 0.27 0.69 0.77 0.86 1.30 156,580
Sales 131.13 93.37 18.00 62.00 106.00 172.50 502.00 156,580
IV 0.73 0.02 0.64 0.72 0.73 0.74 0.80 156,580
Private label
Price 0.54 0.09 0.18 0.48 0.53 0.60 1.46 156,580
Other controls
Store ACV 0.22 0.09 0.04 0.16 0.20 0.27 1.00 156,580
Chain
Shelf.1 0.25 0.10 0.04 0.15 0.22 0.33 0.49 156,580
Shelf.2 0.31 0.10 0.03 0.25 0.33 0.40 0.63 156,580
Time
Week 0.51 0.29 0.02 0.27 0.52 0.75 1.00 156,580
Year 0.58 0.28 0.17 0.33 0.67 0.83 1.00 156,580
Note: This table summarizes yogurt sales in popular brands and sizes from our dataset. Prices
reported here are per 8 oz, except that the premium Yoplait price is per 4 oz.
61
We will define Dannon 0:375 pounds (6 oz) and 0:5 pounds (8 oz) as Dannon small and Dannon
1:5 pounds (24 oz) and 2 pounds (32 oz) as Dannon large. For Yoplait, Yoplait 0:375 pounds (6
oz) is defined to be Yoplait small and Yoplait 1:5 pounds (24 oz) Yoplait large. Since Yoplait offers
more at the premium line, we define Yoplait 0:25 pounds (4 oz), 0:6875 pounds (11 oz) and 1:125
pounds (18 oz) as Yoplait premium and use it as a control. For the Private label, we do not further
distinguish sizes and use the average price as a control. When we compute the average price, we
use the sum of sale pounds to divide the total revenue.
With the above definitions, we eventually arrive at a sample of 156;580 observations at the
store-week level. For each observation, we have the prices for Yoplait small, Yoplait large, Yoplait
premium, Dannon small, Dannon large, and the Private label. We instrument for the potentially
endogenous prices using the Hausman instruments (Hausman et al., 1994; Berry et al., 1995), that
is, prices of the focal product in other geographic markets. We further control for fixed effects
at the store, the chain, and the time level. We use the all-commodity volume (ACV) provided by
the IRI dataset as a control for store level fixed effect. It is a weighted measure of the product
availability based on store aggregate sales. The chain level fixed effects are to be represented by
two variables, Shelf:1 and Shelf:2. They are the ratios of the Dannon sales and the private label
sales to the total sales of yogurt in the grocery chain. They reflect the chain level preference over
Dannon, Yoplait, and the private label. One interpretation is that they can reflect the shelf space
allocation. We also add the number of the week in a year and the number of the year in our sample
to further control the time level fixed effects. Both variables have been normalized to[0;1]. Note
that although our analysis uses proxies to control for fixed effects, we do not make restrictions on
the functional forms, thereby permitting the controls to involve in a flexible manner. We provide
the summary statistics on all these variables in Table 3.
3.2.2 Empirical findings
Our empirical findings are presented in Figure 3–6. In each of the four figures, we report the price
elasticities of all four focal yogurt products with respect to the price change in one of them (e.g.
62
Figure 3.2: Own and cross price elasticities with respect to the price of Dannon small
0.7 0.75 0.8 0.85 0.9 0.95 1
Price of Dannon small
-4
-3
-2
-1
0
1
2
3
Elasticity
Dannon small
Dannon large
Yoplait small
Yoplait large
95% Confidence Interval
Note: This figure provides the price elasticity estimates of Dannon small, Dannon large, Yoplait small, and
Yoplait large with respect to the price of Dannon small. These elasticities are evaluated at 30 price levels of
Dannon small, from 0.7 to 1 dollars per 8 oz.
63
Figure 3.3: Own and cross price elasticities with respect to the price of Yoplait small
0.7 0.75 0.8 0.85 0.9 0.95 1
Price of Yoplait small
-6
-4
-2
0
2
4
Elasticity
Dannon small
Dannon large
Yoplait small
Yoplait large
95% Confidence Interval
Note: This figure provides the price elasticity estimates of Dannon small, Dannon large, Yoplait small, and
Yoplait large with respect to the price of Yoplait small. These elasticities are evaluated at 30 price levels of
Yoplait small, from 0.7 to 1 dollars per 8 oz.
64
Figure 3.4: Own and cross price elasticities with respect to the price of Dannon large
0.7 0.75 0.8 0.85 0.9 0.95 1
Price of Dannon large
-2
-1.5
-1
-0.5
0
0.5
1
Elasticity
Dannon small
Dannon large
Yoplait small
Yoplait large
95% Confidence Interval
Note: This figure provides the price elasticity estimates of Dannon small, Dannon large, Yoplait small, and
Yoplait large with respect to the price of Dannon large. These elasticities are evaluated at 30 price levels of
Dannon large, from 0.7 to 1 dollars per 8 oz.
65
Figure 3.5: Own and cross price elasticities with respect to the price of Yoplait large
0.7 0.75 0.8 0.85 0.9 0.95 1
Price of Yoplait large
-2
-1.5
-1
-0.5
0
0.5
1
1.5
Elasticity
Dannon small
Dannon large
Yoplait small
Yoplait large
95% Confidence Interval
Note: This figure provides the price elasticity estimates of Dannon small, Dannon large, Yoplait small, and
Yoplait large with respect to the price of Yoplait large. These elasticities are evaluated at 30 price levels of
Yoplait small, from 0.7 to 1 dollars per 8 oz.
66
Table 3.3: Price elasticities at representative points
Dannon small Dannon large Yoplait small Yoplait large
(A) Dannon small -2.961 0.268 1.621 0.957
(B) Dannon large 0.122 -1.371 0.048 0.970
(C) Yoplait small 3.880 0.649 -4.927 1.367
(D) Yoplait large 1.413 0.204 -0.294 -1.576
Note: This table presents the price elasticity estimates when the A (B, C, D) product is priced at
$0:85 per 8 oz and all the other products at their median price levels.
Dannon small). We report estimated price elasticities at every of the 50 price levels between $0:7
to $1 per 8 oz, while holding all other prices, market, and control variables at their median levels.
The bootstrap 95% confidence intervals for all these point estimations are reported in dashed lines.
We also report the own and cross elasticity estimates at own price $0:85 per 8 oz, while others at
their median levels, in Table 4.
First of all, both Dannon and Yoplait’s small-sized products exhibit elastic demand and their
own-price elasticities increase in magnitude with the price. The downward sloping own-price
elasticity is consistent with Marshall’s second law of demand, which is commonly assumed in
theoretical IO. Second, we can see that at the price $0:85, Dannon small has an own elasticity of
about3, whereas Yoplait small has a cross elasticity of about 1:6. In contrast, at the price $0:85,
Yoplait small has an own elasticity of about5, whereas Dannon small has a cross elasticity of
about 4. It seems to suggest that Yoplait small consumers are more sensitive to price. Third, we
also observe a different cross elasticity pattern for Dannon small and Yoplait small. When we
slightly change the price of Dannon small, the cross elasticity of Yoplait small slowly increases
with this change. However, when we slightly change the price of Yoplait small, the cross elasticity
of Dannon small slowly decreases. Considering the fact that Yoplait is mostly priced a bit higher
than Dannon in our sample, an immediate explanation for this pattern difference is that when price
levels are closer, the competition between the two products is more intense. Fourth, we observe
very similar patterns for Yoplait large and Dannon large when we change the price of Yoplait
small or Dannon small. It gives us the impression that the large-sized yogurts are specialized for
67
their targeted consumer groups, to whom the brand of small-sized yogurt is less relevant. Finally,
when we slowly change the price of Dannon large, the strongest substitution comes from Yoplait
large, which is consistent with our findings for small-sized yogurt. However, the pattern does
not hold when we change the price of Yoplait large. For Yoplait large, we also fail to observe
a downward sloping own elasticity curve. In addition, the price elasticities for the other three
products, i.e. Dannon small, Yoplait small, and Dannon large, are estimated with relatively tight
confidence intervals. In contrast, the price elasticities of Yoplait large are estimated with much
wider confidence intervals and for a large price region, the cross elasticities for Dannon large and
Yoplait small are not distinguishable from zero. All these facts hinder us from reaching a confident
interpretation for Yoplait large.
Since it seems that the substitution among yogurt products is more prevalent within the package
size rather than brand, let us focus solely on the small-sized yogurt and, for now, ignore other
products. This simplification can shed some light on the cost structure of Yoplait and Dannon.
From our estimation, at the median price level of Yoplait and Dannon, which is($0:92;$0:80), the
own-price elasticities are about6 for Yoplait and about2:5 for Dannon. If this is the pricing
equilibrium and both firms have constant marginal costs, we can conveniently derive the first order
optimality conditions for each firm. When we plug in the numbers, a simple computation can tell
us that the marginal cost is about $0:77 for Yoplait and $0:48 for Dannon. Conversely, if we have
information on marginal costs, our point-wise price elasticity estimates can also assist to give an
implication on optimal pricing.
To conclude, our result shows that small-sized yogurt and large-sized yogurt are very differ-
ent products by nature and the competition among yogurt products exists more within the same
package size than within the same brand. In empirical studies, an array of studies have first aggre-
gated data across all package sizes within brand, and then conduct further analysis. It seems that
whereas this approach can simplify the product space, it might have also abstracted away impor-
tant dimensions of product differentiation and substitution. It is worth noting that our results build
only on a minimal assumption on functional forms, functional differentiability, and the validity of
68
instrumental variables, regardless of the source of endogeneity. The heterogeneity patterns we find
are obtained before any a priori structural assumptions on individuals and markets. We believe that
such information can be valuable for its own sake and constructive for further structural modeling.
69
Bibliography
Abadie, A. and G. W. Imbens (2006). Large sample properties of matching estimators for average
treatment effects. Econometrica 74(1), 235–267.
Abrevaya, J. (2006). Estimating the effect of smoking on birth outcomes using a matched panel
data approach. Journal of Applied Econometrics 21(4), 489–519.
Abrevaya, J., Y . Hsu, and R. Lieli (2015). Estimating conditional average treatment effects. Journal
of Business & Economic Statistics 33(4), 485–505.
Arcones, M. A. and E. Gine (1992). On the bootstrap of U and V statistics. The Annals of
Statistics 20(2), 655–674.
Arellano, M. and J. Hahn (2013). Understanding Bias in Nonlinear Panel Models: Some Recent
Developments, V olume 3, pp. 381–409. Cambridge: Cambridge University Press.
Athey, S. and G. Imbens (2016). Recursive partitioning for heterogeneous causal effects. Proceed-
ings of the National Academy of Sciences 113(27), 7353–7360.
Athey, S. and G. Imbens (2017). The state of applied econometrics: Causality and policy evalua-
tion. Journal of Economic Perspectives 31(2), 3–32.
Athey, S., G. Imbens, and S. Wager (2018). Approximate residual balancing: De-biased inference
of average treatment effects in high dimensions. Journal of the Royal Statistical Society: Series
B (Statistical Methodology), to appear.
Athey, S., J. Tibshirani, and S. Wager (2018). Generalized random forests. The Annals of Statistics,
to appear.
Athey, S., J. Tibshirani, and S. Wager (2019). Generalized random forests. The Annals of Statis-
tics 47(2), 1148–1178.
Banks, J., R. Blundell, and A. Lewbel (1997). Quadratic engel curves and consumer demand. The
Review of Economics and Statistics 79(4), 527–539.
Belloni, A., V . Chernozhukov, and C. Hansen (2014a). High-dimensional methods and inference
on structural and treatment effects. Journal of Economic Perspectives 28(2), 29–50.
Belloni, A., V . Chernozhukov, and C. Hansen (2014b). Inference on treatment effects after selec-
tion among high-dimensional controls. The Review of Economic Studies 81(2), 608–650.
70
Berrett, T. B., R. J. Samworth, and M. Yuan (2018). Efficient multivariate entropy estimation via
k -nearest neighbour distances. The Annals of Statistics, to appear.
Berry, S., A. Gandhi, and P. Haile (2013). Connected substitutes and invertibility of demand.
Econometrica 81(5), 2087–2111.
Berry, S., J. Levinsohn, and A. Pakes (1995). Automobile prices in market equilibrium. Econo-
metrica 63(4), 841–890.
Berry, S. T. (1994). Estimating Discrete-Choice models of product differentiation. The RAND
Journal of Economics 25(2), 242.
Berry, S. T. and P. A. Haile (2014). Identification in differentiated products markets using market
level data. Econometrica 82(5), 1749–1797.
Biau, G., F. C´ erou, and A. Guyader (2010). On the rate of convergence of the bagged nearest
neighbor estimate. Journal of Machine Learning Research 11(3), 687—-712.
Biau, G. and L. Devroye (2015). Lectures on the Nearest Neighbor Method. Springer.
Bickel, P. J. and D. A. Freedman (1981). Some asymptotic theory for the bootstrap. The Annals of
Statistics 9(6), 1196–1217.
Bierens, H. J. (1987). Kernel estimators of regression functions, V olume 1, pp. 99–144. Cambridge
University Press.
Blundell, R., X. Chen, and D. Kristensen (2007). Semi-Nonparametric IV Estimation of Shape-
Invariant Engel Curves. Econometrica 75, 1613–1669.
Blundell, R., J. Horowitz, and M. Parey (2016). Nonparametric estimation of a nonseparable
demand function under the slutsky inequality restriction. The Review of Economics and Statis-
tics 99(2), 291–304.
Blundell, R., J. L. Horowitz, and M. Parey (2012). Measuring the price responsiveness of gaso-
line demand: Economic shape restrictions and nonparametric demand estimation. Quantitative
Economics 3(1), 29–51.
Blundell, R., D. Kristensen, and R. L. Matzkin (2013). Control functions and simultaneous equa-
tions methods. American Economic Review 103(3), 563–69.
Blundell, R. and J. L. Powell (2003). Endogeneity in Nonparametric and Semiparametric Regres-
sion Models, V olume 2, pp. 312–357. Cambridge University Press.
Borovkov, A. A. (2013). Probability Theory. Springer.
Breiman, L. (1996). Bagging predictors. Machine Learning 24(2), 123–140.
Bronnenberg, B. J., M. W. Kruger, and C. F. Mela (2008). Database Paper-The IRI marketing data
set. Marketing Science 27(4), 745–748.
71
Chen, X. (2007). Chapter 76 Large Sample Sieve Estimation of Semi-Nonparametric Models,
V olume 6, pp. 5549. Elsevier.
Chen, X., V . Chernozhukov, S. Lee, and W. K. Newey (2014). Local Identification of Nonpara-
metric and Semiparametric Models. Econometrica 82, 785–809.
Chen, X. and T. M. Christensen (2018). Optimal sup-norm rates and uniform inference on nonlin-
ear functionals of nonparametric iv regression. Quantitative Economics 9(1), 39–84.
Chen, X. and D. Pouzo (2015). Sieve Wald and QLR Inferences on Semi/Nonparametric Condi-
tional Moment Models. Econometrica 83(3), 1013–1079.
Chernozhukov, V ., D. Chetverikov, M. Demirer, E. Duflo, C. Hansen, and W. Newey (2017).
Double/Debiased/Neyman machine learning of treatment effects. American Economic Re-
view 107(5), 261–65.
Chernozhukov, V ., C. Hansen, and M. Spindler (2015). Valid post-selection and post-regularization
inference: An elementary, general approach. Annual Review of Economics 7(1), 649–688.
Chesher, A. (2003). Identification in nonseparable models. Econometrica 71(5), 1405–1441.
Compiani, G. (2019). Market counterfacturals and the specification of multi-product demand: A
nonparametric approach. Working paper.
Crump, R. K., J. V . Hotz, G. W. Imbens, and O. A. Mitnik (2007). Nonparametric tests for treat-
ment effect heterogeneity. Review of Economics and Statistics 90(3), 389–405.
D’Amour, A., P. Ding, A. Feller, L. Lei, and J. Sekhon (2017). Overlap in observational studies
with high-dimensional covariates. arXiv preprint arXiv:1711.02582.
Das, M., W. K. Newey, and F. Vella (2003). Nonparametric estimation of sample selection models.
The Review of Economic Studies 70(1), 33–58.
De Los Santos, B., A. Hortac ¸su, and M. R. Wildenbeest (2012, May). Testing models of con-
sumer search using data on web browsing and purchasing behavior. American Economic Re-
view 102(6), 2955–80.
Deaton, A. and J. Muellbauer (1980). An almost ideal demand system. The American Economic
Review 70(3), 312–326.
Dette, H., S. Hoderlein, and N. Neumeyer (2016). Testing multivariate economic restrictions using
quantiles: The example of slutsky negative semidefiniteness. Journal of Econometrics 191(1),
129–144.
Dhaene, G. and K. Jochmans (2015). Split-panel jackknife estimation of fixed-effect models. The
Review of Economic Studies 82(3), 991–1030.
Ding, P., A. Feller, and L. Miratrix (2018). Decomposing treatment effect variation. Journal of the
American Statistical Association to appear.
72
Draganska, M. and D. C. Jain (2006). Consumer preferences and Product-Line pricing strategies:
An empirical analysis. Marketing Science 25(2), 164–174.
Draganska, M., M. Mazzeo, and K. Seim (2009). Beyond plain vanilla: Modeling joint product
assortment and pricing decisions. Quantitative Marketing and Economics 7(2), 105–146.
Dub´ e, J., G. J. Hitsch, and P. E. Rossi (2018). Income and wealth effects on Private-Label demand:
Evidence from the great recession. Marketing Science 37(1), 22–53.
Efron, B. (1982). The Jackknife, the Bootstrap, and Other Resampling Plans. SIAM.
Efron, B. and C. Stein (1981). The jackknife estimate of variance. The Annals of Statistics 9(3),
586–596.
Fan, J. and Y . Fan (2009). High-dimensional classification using features annealed independence
rules. The Annals of Statistics 36(6), 2605–2637.
Fan, J., Y . Fan, and Y . Wu (2010). High-dimensional classification. High-Dimensional Data Anal-
ysis (T. T. Cai and X. Shen, eds.), pp. 3–37. World Scientific.
Fan, J. and T. Hu (1992). Bias correction and higher order kernel functions. Statistics & Probability
Letters 13(3), 235 – 243.
Fan, J., K. Imai, H. Liu, Y . Ning, and X. Yang (2016). Improving covariate balancing propensity
score: A doubly robust and efficient approach. Manuscript.
Fan, J. and J. Lv (2008). Sure independence screening for ultrahigh dimensional feature space (with
discussion). Journal of the Royal Statistical Society: Series B (Statistical Methodology) 70(5),
849–911.
Fan, J. and J. Lv (2018). Sure independence screening (invited review article). Wiley StatsRef:
Statistics Reference Online.
Fan, J., J. Lv, and L. Qi (2011). Sparse high-dimensional models in economics. Annual Review of
Economics 3(1), 291–317.
Fan, Y ., J. Lv, and J. Wang (2018). DNN: A two-scale distributional tale of heterogeneous treatment
effect inference. Working paper.
Gentzkow, M. (2007). Valuing new goods in a model with complementarity: Online newspapers.
American Economic Review 97(3), 713–744.
Goeree, M. S. (2008). Limited information and advertising in the U.S. personal computer industry.
Econometrica 76(5), 1017–1074.
Grimmer, J., S. Messing, and S. J. Westwood (2017). Estimating heterogeneous treatment effects
and the effects of heterogeneous treatments with ensemble methods. Political Analysis 25(4),
1–22.
73
Gy¨ orfi, L., M. Kohler, A. Krzy˙ zak, and H. Walk (2002). A Distribution-Free Theory of Nonpara-
metric Regression. Springer.
Haag, B. R., S. Hoderlein, and K. Pendakur (2009). Testing and imposing slutsky symmetry in
nonparametric demand systems. Journal of Econometrics 153(1), 33–50.
Hahn, J. and G. Kuersteiner (2002). Asymptotically unbiased inference for a dynamic panel model
with fixed effects when both n and t are large. Econometrica 70(4), 1639–1657.
Hahn, J. and W. Newey (2004). Jackknife and analytical bias reduction for nonlinear panel models.
Econometrica 72(4), 1295–1319.
Hahn, J. and G. Ridder (2017). Instrumental variable estimation of nonlinear models with nonclas-
sical measurement error using control variables. Journal of Econometrics 200, 238–250.
Hahn, J. and G. Ridder (2018). Three-stage semi-parametric inference: Control variables and
differentiability. Journal of Econometrics.
H´ ajek, J. (1968). Asymptotic normality of simple linear rank statistics under alternatives. The
Annals of Mathematical Statistics 39(2), 325–346.
Hall, P. and J. L. Horowitz (2005). Nonparametric methods for inference in the presence of instru-
mental variables. The Annals of Statistics 33, 2904–2929.
Hall, P. and R. J. Samworth (2005). Properties of bagged nearest neighbour classifiers. Journal of
the Royal Statistical Society: Series B (Statistical Methodology) 67(3), 363 – 379.
Hausman, J., G. Leonard, and D. J. Zona (1994). Competitive analysis with differenciated prod-
ucts. Annales d’
´
Economie et de Statistique (34), 159–180.
Hausman, J. A. and W. K. Newey (1995). Nonparametric estimation of exact consumers surplus
and deadweight loss. Econometrica 63(6), 1445–1476.
Hausman, J. A. and W. K. Newey (2015). Nonparametric welfare analysis. Annual Review of
Economics 9(1), 1–26.
Heckman, J. and R. Robb (1985). Alternative methods for evaluating the impact of interventions:
An overview. Journal of Econometrics 30(1), 239–267.
Heckman, J. J., H. Ichimura, and P. E. Todd (1997). Matching as an econometric evaluation estima-
tor: Evidence from evaluating a job training programme. The Review of Economic Studies 64(4),
605–654.
Heckman, J. J., J. Smith, and N. Clements (1997). Making the most out of programme evaluations
and social experiments: Accounting for heterogeneity in programme impacts. The Review of
Economic Studies 64(4), 487–535.
Hendel, I. (1999). Estimating multiple-discrete choice models: An application to computerization
returns. The Review of Economic Studies 66(2), 423–446.
74
Hoeffding, W. (1948). A class of statistics with asymptotically normal distribution. The Annals of
Mathematical Statistics 19(3), 293 – 325.
Houde, J. (2012). Spatial differentiation and vertical mergers in retail markets for gasoline. Amer-
ican Economic Review 102(5), 2147–82.
Hsiao, C. (2014). Analysis of Panel Data. Cambridge University Press.
Huang, Y . and B. J. Bronnenberg (2018). Pennies for your thoughts: Costly product consideration
and purchase quantity thresholds. Marketing Science 37(6), 1009–1028.
Imai, K., L. Keele, D. Tingley, and T. Yamamoto (2011). Unpacking the black box of causality:
Learning about causal mechanisms from experimental and observational studies. American
Political Science Review 105(4), 765–789.
Imai, K., L. Keele, and T. Yamamoto (2010). Identification, inference and sensitivity analysis for
causal mediation effects. Statistical Science 25(1), 51–71.
Imai, K. and M. M. Ratkovic (2013). Estimating treatment effect heterogeneity in randomized
program evaluation. The Annals of Applied Statistics 7(1), 443–470.
Imbens, G. W. and W. K. Newey (2009). Identification and estimation of triangular simultaneous
equations models without additivity. Econometrica 77(5), 1481–1512.
Imbens, G. W. and D. B. Rubin (2015). Causal Inference in Statistics, Social, and Biomedical
Sciences. Cambridge University Press.
Jones, M. and P. Foster (1993). Generalized jackknifing and higher order kernels. Journal of
Nonparametric Statistics 3(1), 81 – 94.
Kim, J., G. M. Allenby, and P. E. Rossi (2002). Modeling consumer demand for variety. Marketing
Science 21(3), 229–250.
Korolyuk, V . S. and Y . V . Borovskich (1994). Theory of U-statistics. Springer.
Mack, Y . (1980). Local properties of k-NN regression estimates. SIAM Journal on Algebraic
Discrete Methods 2(3), 311–323.
MaCurdy, T., X. Chen, and H. Hong (2011). Flexible estimation of treatment effect parameters.
American Economic Review 101(3), 544–551.
Matzkin, R. L. (2003). Nonparametric estimation of nonadditive random functions. Economet-
rica 71(5), 1339–1375.
Matzkin, R. L. (2008). Identification in nonparametric simultaneous equations models. Economet-
rica 76(5), 945–978.
Matzkin, R. L. (2015). Estimation of nonparametric models with simultaneity. Economet-
rica 83(1), 1–66.
75
McFadden, D. (1973). Conditional logit analysis of qualitative choice behavior, pp. 105–142.
Academic Press.
Mullainathan, S. and J. Spiess (2017). Machine learning: An applied econometric approach. Jour-
nal of Economic Perspectives 31(2), 87–106.
Nevo, A. (2001). A practitioner’s guide to estimation of Random-Coefficients logit models of
demand. Journal of Economics & Management Strategy 9(4), 513–548.
Newey, W. K., F. Hsieh, and J. M. Robins (2004). Twicing kernels and a small bias property of
semiparametric estimators. Econometrica 72(3), 947 – 962.
Newey, W. K. and J. L. Powell (2003). Instrumental Variable Estimation of Nonparametric Models.
Econometrica 71, 1565–1578.
Newey, W. K., J. L. Powell, and F. Vella (1999). Nonparametric estimation of triangular simulta-
neous equations models. Econometrica 67(3), 565–603.
Pavlidis, P. and P. B. Ellickson (2018). Implications of parent brand inertia for multiproduct pric-
ing. Quantitative Marketing and Economics 15(4), 369–407.
Pearl, J. (2009). Causality: Models, reasoning and inference. Cambridge University Press.
Petrin, A. (2002). Quantifying the benefits of new products: The case of the minivan. Journal of
Political Economy 110(4), 705–729.
Powers, S., J. Qian, K. Jung, A. Schuler, N. H. Shah, T. Hastie, and R. Tibshirani (2018).
Some methods for heterogeneous treatment effect estimation in high dimensions. Statistics
in Medicine 37(11), 1767–1787.
Rivers, D. and Q. H. Vuong (1988). Limited information estimators and exogeneity tests for
simultaneous probit models. Journal of Econometrics 39(3), 347–366.
Rosenbaum, P. R. (2010). Design of Observational Studies. Springer.
Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomized
studies. Journal of Educational Psychology 66(5), 688–701.
Samworth, R. J. (2012). Optimal weighted nearest neighbour classifiers. The Annals of Statis-
tics 40(5), 2733–2763.
Schucany, W., H. Gray, and D. Owen (1971). On bias reduction in estimation. Journal of the
American Statistical Association 66(335), 524 – 533.
Schucany, W. and J. P. Sommers (1977). Improvement of kernel type density estimators. Journal
of the American Statistical Association 72(358), 420 – 423.
Serfling, R. J. (2018). Approximation Theorems of Mathematical Statistics. Wiley Series in Prob-
ability and Statistics.
Shao, J. and D. Tu (1995). The Jackknife and Bootstrap. Springer.
76
Smith, R. and R. Blundell (1986). An exogeneity test for a simultaneous equation tobit model with
an application to labor supply. Econometrica 54(3), 679–685.
Stone, R. (1954). Linear expenditure systems and demand analysis: An application to the pattern
of british demand. The Economic Journal 64(255), 511–527.
Strikwerda, J. (2004). Finite Difference Schemes and Partial Differential Equations (Second Edi-
tion ed.). Society for Industrial and Applied Mathematics.
Tchetgen, E. and I. Shpitser (2012). Semiparametric theory for causal mediation analysis: Ef-
ficiency bounds, multiple robustness and sensitivity analysis. The Annals of Statistics 40(3),
1816–1845.
Tu, D. and C. Ping (1989). Bootstrapping the untrimmed L-Statistics. Jounal of Systems Science
and Complexity 9(1), 14–23.
Villas-Boas, S. (2007). Vertical relationships between manufacturers and retailers: Inference with
limited data. The Review of Economic Studies 74(2), 625–652.
Wager, S. and S. Athey (2018). Estimation and inference of heterogeneous treatment effects using
random forests. Journal of the American Statistical Association, to appear.
Working, E. (1927). What do statistical “Demand curves” show? The Quarterly Journal of
Economics 41(2), 212–235.
77
Appendices
C Proofs of main results
C.1 Proof of Theorem 1
Our road map for the proof of Theorem 1 consists of three steps. First, in Lemma 1 we derive the
bias term of the 1-nearest neighbor feature vector to the target point x in a general case with i.i.d.
sample of size n. Second, in Lemma 2 a bridge is built to link the discrepancy in the features to the
discrepancy in the response. Third, we employ the two facilities developed above for the analysis
of the DNN estimator D
n
(s) with subsampling scale s.
There are several ways to prove it. In this appendix, we present a clean proof using the
Lebesgue differentiation theorem. Lemma 2 is the key to the proof of Theorem 1. In this ap-
pendix, with the powerful idea of projecting onto a half line from Biau and Devroye (2015), we
present a neat proof with spherical integration for Lemma 2. As we will see, Theorem 1 is just one
step away from Lemmas 1–2.
C.1.1 Lemma 1 and its proof
Lemma 1. Given x2 supp(X), under Conditions 2–3, and when n!¥, the 1-nearest neighbor to
x in the i.i.d. samplef(X
i
;Y
i
)g
n
i=1
has its 1-nearest neighbor feature X
(1)
satisfying
EkX
(1)
xk
2
=
G(2=d+ 1)
( f(x)V
d
)
2=d
n
2=d
+ o(n
2=d
); (C.1)
78
where
V
d
=
p
d=2
G(1+ d=2)
andG() is the Gamma function.
Proof of Lemma 1: We will use the following two results in the proof of Lemma 1. Since they are
well known, their proofs are not presented in this appendix.
1. By the Lebesgue differentiation theorem, when r! 0,j(B(x;r))= f(x)V
d
r
d
+o(r
d
), where
j is some measure on X, B(x;r) is the Euclidean ball inR
d
, V
d
is the volume of unit ball, f
is the density of measurej with respect to the Lebesgue measurel, and f is continuous at
x.
2. For a> 0 and b> 0, we have
R
¥
0
x
a1
exp(bx
p
)dx=
1
p
b
a=p
G(
a
p
):
For our target,
EkX
(1)
xk
2
=
Z
¥
0
P(kX
(1)
xk
2
> t) dt
=
Z
¥
0
P(kX
(1)
xk>
p
t) dt
=
Z
¥
0
[1j(B(x;
p
t))]
n
dt
= n
2=d
Z
¥
0
[1j(B(x;
p
t
n
1=d
))]
n
dt:
We then take the limit when n!¥,
lim
n!¥
n
2=d
EkX
(1)
xk
2
= lim
n!¥
Z
¥
0
[1j(B(x;
p
t
n
1=d
))]
n
dt
= lim
n!¥
Z
¥
0
[1( f(x)V
d
+ o(1))
t
d=2
n
)]
n
dt
=
Z
¥
0
lim
n!¥
[1( f(x)V
d
+ o(1))
t
d=2
n
]
n
dt
=
Z
¥
0
expf( f(x)V
d
+ o(1))t
d=2
g dt
=
G(2=d+ 1)
( f(x)V
d
)
2=d
+ o(1);
79
which completes the proof of Lemma 1. Since we have assumed f(x)V
d
to be bounded everywhere,
the resulting o(1) is uniform for t in the second equality and in the third equality the integral and
limit can interchange.
From Lemma 1, we know that the first order of the squared distance of the closest feature vector
from an i:i:d: random sample of size n to the target point x is n
2=d
. If the i:i:d: sample has size
s, as a consequence, the first order of squared distance is s
2=d
. When size s!¥, the squared
distance goes to zero. When the dimensionality of features d is large, the rate of convergence is
slow. The coefficient is also explicitly derived. It is intuitive that when the density f(x) at x is
small, the squared distance is large.
C.1.2 Lemma 2 and its proof
As in Biau and Devroye (2015), we first define the projection onto the half linekX xk,
m(r)= lim
d!0
E[m(X)j rkX xk r+d]=E[YjkX xk= r]: (C.2)
An immediate consequence of this definition is that m(0)=E[YjX= x]=m(x).
Lemma 2. When r! 0, we have for x2 supp(X),
m(r)= m(0)+
f(x)tr(m
00
(x))+ 2m
0
(x)
T
f
0
(x)
2d f(x)
r
2
+ o(r
2
): (C.3)
Proof of Lemma 2: The spherical coordinate integration is to be used in our proof. We first
introduce the notation. B(0;r) denotes the ball with radius r centered at 0 in Euclidean space
R
d
, S
d1
denotes the unit sphere in R
d
, n denotes a measure constructed on the sphere S
d1
,
and x2S
d1
denotes a point on the sphere. We omit other details. Integration with spherical
coordinates is equivalent to the standard integration,
Z
B(0;r)
f(x)dx=
Z
r
0
u
d1
Z
S
d1
f(ux)n(dx)du:
80
Some integration formulas are here. It holds that
Z
S
d1
n(dx)= dV
d
;
Z
S
d1
x n(dx)= 0;
Z
S
d1
x
T
Mx n(dx)= tr(M)V
d
;
where M is a d d matrix.
First, we decompose the m(r) into two components which we will analyze separately,
m(r)= lim
d!0
E[m(X)j rkX xk r+d]
= lim
d!0
E[m(X)1(rkX xk r+d)]
P(rkX xk r+d)
:
Before we proceed, we use the spherical coordinate representation for the denominator and numer-
ator,
P(rkX xk r+d)=
Z
r+d
r
u
d1
Z
S
d1
f(x+ ux)n(dx)du
E[m(X)1(rkX xk r+d)]=
Z
r+d
r
u
d1
Z
S
d1
m(x+ ux) f(x+ ux)n(dx)du:
Using spherical coordinate integration, we also have
Z
S
d1
f(x+ rx)n(dx)=
Z
S
d1
( f(x)+ f
0
(x)
T
x r+
1
2
x
T
f
00
(x)x r
2
+ o(r
2
))n(dx)
= f(x)dV
d
+
1
2
tr( f
00
(x))V
d
r
2
+ o(r
2
);
81
and
Z
S
d1
m(x+ rx) f(x+ rx)n(dx)
=
Z
S
d1
[m(x)+m
0
(x)
T
x r+
1
2
x
T
m
00
(x)x r
2
+ o(r
2
)][ f(x)+ f
0
(x)
T
x r+
1
2
x
T
f
00
(x)x r
2
+ o(r
2
)]n(dx)
=
Z
S
d1
n
m(x) f(x)+[ f(x)m
0
(x)
T
x+m(x) f
0
(x)
T
x]r
+[
1
2
f(x)x
T
m
00
(x)x+
1
2
m(x)x
T
f
00
(x)x+x
T
m
0
(x) f
0
(x)
T
x]r
2
+ o(r
2
)
o
n(dx)
=m(x) f(x)dV
d
+
1
2
[ f(x)tr(m
00
(x))+m(x) tr( f
00
(x))]V
d
r
2
+m
0
(x)
T
f
0
(x)V
d
r
2
+ o(r
2
):
With the above results, by L’Hˆ opital’s rule
m(r)= lim
d!0
E[m(X)1(rkX xk r+d)]
P(rkX xk r+d)
=
m(x) f(x)dV
d
+
1
2
[ f(x)tr(m
00
(x))+m(x) tr( f
00
(x))]V
d
r
2
+m
0
(x)
T
f
0
(x)V
d
r
2
+ o(r
2
)
f(x)dV
d
+
1
2
tr( f
00
(x))V
d
r
2
+ o(r
2
)
=m(x)+
f(x)tr(m
00
(x))+ 2m
0
(x)
T
f
0
(x)
2d f(x)
r
2
+ o(r
2
):
This completes the proof of Lemma 2. Lemma 2 tells us that when the distance r! 0, the differ-
ence in projections onto the half linekX xk can be approximated by the squared distance r
2
. We
already know the order of squared distance from Lemma 1. The only remaining question is the
discrepancy between the points and their projections.
C.1.3 Proof of Theorem 1
Suppose X
(1)
is known. Then E[YjkX xk =kX
(1)
xk]6=E[YjX = X
(1)
] in general. The
former is the mean on a sphere while the latter is the mean at a point of the sphere. However,
for the iterative expectation, the unconditional mean of the response of the 1-nearest neighbor
EY
(1)
=E
X
(1)
E[YjX= X
(1)
]=E
X
(1)
E[YjkX xk=kX
(1)
xk]. Letting r
(1)
=kX
(1)
xk, then
82
Em(r
(1)
)=EY
(1)
. One interesting fact is that their variances are different in general. This fact will
be revisited in Lemma 3. We are now able to finish the proof of Theorem 1,
ED
n
(s)(x)=EF(x;Z
i
1
;Z
i
2
;:::;Z
i
s
)
=E[Y
(1)
(Z
i
1
;Z
i
2
;:::;Z
i
s
)]
=E[m(r
(1)
)(Z
i
1
;Z
i
2
;:::;Z
i
s
)]
=m(x)+G(2=d+ 1)
f(x)tr(m
00
(x))+ 2m
0
(x)
T
f
0
(x)
2dV
2=d
d
f(x)
1+2=d
s
2=d
+ o(s
2=d
):
The result follows directly from Lemmas 1–2. To summarize, the projection idea (Biau and De-
vroye, 2015) enables us to derive the asymptotic bias of the DNN estimator. Moreover, the coeffi-
cient here does not depend on the subsample size s. This interesting fact opens the door to the new
two-scale framework.
C.2 Proof of Theorem 2
We prove the asymptotic normality of the DNN estimator in Theorem 2. Our proof builds on the
U-statistic framework. The classical results of the asymptotic normality of U-statistics are not yet
directly applicable since the subsampling scale s!¥. However, we follow the routine of proving
the classical U-statistics (H´ ajek, 1968) and then give a sufficient condition that makes the transition
fluent. The condition we give in this paper is s=n! 0. This condition is so intuitive a posteriori
that it simply means that the subsample size s is relatively small compared to the whole sample
size n and can be treated as a constant as in classical U-statistics. A related approach is Wager and
Athey (2018)’s approach which is built on the ANOV A framework (Efron and Stein, 1981).
83
C.2.1 Lemma 3 and its proof
Lemma 3 prepares us the order of variance of the first order H´ ajek projections. This interpretation
will soon explain itself in the formal proof of Theorem 2. As always, we introduce the new notation
first. Given x, the projection ofF(x;Z
1
;Z
2
;:::;Z
s
) onto Z
1
is denoted asF
1
(x;z
1
),
F
1
(x;z
1
)=E[F(x;Z
1
;Z
2
;:::;Z
s
)jZ
1
= z
1
]=E[F(x;z
1
;Z
2
;:::;Z
s
)]: (C.4)
In this section, letE
1
(var
1
) andE
2
(var
2
) denote expectation (variance) with respect to Z
1
and
fZ
2
;Z
3
;:::;Z
s
g, respectively, and
e
X
(1)
the closest X to x amongfX
2
;X
3
;:::;X
s
g.
Lemma 3. Given x, under Conditions 2 and 3, for the variance ofF
1
(x;z
1
), denoted ash
1
, when
s!¥, we have
h
1
= var
1
F
1
(x;z
1
)=
s
2
2s 1
+ o(s
2
): (C.5)
Proof of Lemma 3: The strategy is first to decompose F
1
(x;z
1
) into several terms. We then
analyze the terms one by one and try to understand them intuitively. Their properties will be
84
carefully studied. Finally, we use these results to derive the order of the variance of F
1
(x;z
1
).
Observe that
F
1
(x;z
1
)=E
2
F(x;z
1
;Z
2
;:::;Z
s
)
=E
2
[F(x;z
1
;Z
2
;:::;Z
s
)1(kx
1
xkk
e
X
(1)
xk)]
+E
2
[F(x;z
1
;Z
2
;:::;Z
s
)1(kx
1
xk>k
e
X
(1)
xk)]
=E
2
y
1
1(kx
1
xkk
e
X
(1)
xk)
+E
2
[F(x;Z
2
;:::;Z
s
)1(kx
1
xk>k
e
X
(1)
xj)]
= y
1
E
2
1(kx
1
xkk
e
X
(1)
xk)
+E
2
F(x;Z
2
;:::;Z
s
)
E
2
[F(x;Z
2
;:::;Z
s
)1(kx
1
xkk
e
X
(1)
xk)]
=E
2
F(x;Z
2
;:::;Z
s
)+ AB;
where
A= y
1
E
2
[F(x;Z
2
;:::;Z
s
)jkx
1
xkk
e
X
(1)
xk];
B=E
2
1(kx
1
xkk
e
X
(1)
xk):
Despite its complicated appearance, the representation above is actually very intuitive. It means
that to the value of the response of the 1-nearest neighbor, the marginal contribution of knowing the
first observation z
1
,F
1
(x;z
1
)E
2
F(x;Z
2
;:::;Z
s
), equals the product of the marginal contribution
from observation y
1
when x
1
is actually closer than the rest, A, and the probability of the scenario,
B.
For A and B, there are some useful facts.
1.) E
1
AB=EF(x;Z
1
;Z
2
;:::;Z
s
)E
2
F(x;Z
2
;:::;Z
s
)
= B(s) B(s 1)= O(s
2=d
) O((s 1)
2=d
)= o(s
1
).
2.) A and B are asymptotically independent when s!¥ since B degenerates.
85
3.) E
1
B=E
1
E
2
1(kx
1
xkk
e
X
(1)
xk)=
1
s
, by symmetry.
4.) E
1
B=E
1
[1r(B(x;kx
1
xk))]
s1
.
5.) E
1
B
2
=E
1
[1r(B(x;kx
1
xk))]
2s2
=
1
2s1
.
Combining the results together, if we denotes
2
=E
1
A
2
<¥, then as s!¥,
h
1
= var
1
F
1
(x;z
1
)= var
1
(AB)=E
1
A
2
B
2
(E
1
AB)
2
=
s
2
2s 1
+ o(s
2
):
We emphasize thats
2
is different froms
2
e
in general, as mentioned earlier in the proof of Theorem
1. Besides the noise variances
2
e
,s
2
also consists of the variation arising from landing at different
points on the sphere with the same radius. Thus, s
2
also depends on the target point x and the
density function f . We have also restricted the response y to be bounded, thus s
2
= O(1). The
proof of Lemma 3 completes.
C.2.2 Proof of Theorem 2
In this section, we omit x for simplicity whenever there is no confusion. We will first introduce
Hoeffding’s canonical decomposition (Hoeffding, 1948). It is an extension of the projection idea.
Then we will find that the H´ ajek projection can be seen as the first order part of the decomposition.
And since the H´ ajek projection is the sum of i:i:d: terms, it can be asymptotically normal. Finally,
we explicitly compare the orders between different terms and give a sufficient condition for our
DNN estimator to achieve asymptotic normality.
86
We first demonstrate Hoeffding’s canonical decomposition (Hoeffding, 1948). To ease no-
tation, we use Z
i
as a shorthand for (X
i
;Y
i
). The following definitions can be made as natural
extensions of theF
1
projection in Lemma 3. Define
F
1
(z
1
)=EF(z
1
;Z
2
;:::;Z
s
);
F
2
(z
1
;z
2
)=EF(z
1
;z
2
;Z
3
;:::;Z
s
);
.
.
.
F
s
(z
1
;z
2
;z
3
;:::;z
s
)=EF(z
1
;z
2
;z
3
;:::;z
s
);
e
F
1
(z
1
)=F
1
(z
1
)EF;
e
F
2
(z
1
;z
2
)=F
2
(z
1
;z
2
)EF;
.
.
.
e
F
s
(z
1
;z
2
;z
3
;:::;z
s
)=F
s
(z
1
;z
2
;z
3
;:::;z
s
)EF:
The canonical terms are defined as
g
1
(z
1
)=
e
F
1
(z
1
);
g
2
(z
1
;z
2
)=
e
F
2
(z
1
;z
2
) g
1
(z
1
) g
2
(z
2
);
g
3
(z
1
;z
2
;z
3
)=
e
F
3
(z
1
;z
2
;z
3
)
3
å
i=1
g
1
(z
i
)
å
1i< j3
g
2
(z
i
;z
j
);
.
.
.
g
s
(z
1
;z
2
;:::;z
s
)=
e
F
s
(z
1
;z
2
;z
3
;:::;z
s
)
s
å
i=1
g
1
(z
i
)
å
1i< js
g
2
(z
i
;z
j
)
å
1i
1
i
2
i
s1
s
g
s1
(z
i
1
;z
i
2
;:::;z
i
s1
):
So the kernelF can then be written as the sum of canonical terms,
F(z
1
;z
2
;:::;z
s
)EF=
s
å
i=1
g
1
(z
i
)+
å
1i< js
g
2
(z
i
;z
j
)+:::+ g
s
(z
1
;z
2
;:::;z
s
): (C.6)
87
Another perspective to look at the above equation is the Efron-Stein ANOV A decomposition
(Efron and Stein, 1981). They found that a symmetric kernel can be decomposed into 2
n
1
random variables which are all with zero mean and uncorrelated. From either perspective, we have
varF(Z
1
;:::;Z
s
)=
s
1
Eg
2
1
+
s
2
Eg
2
2
+:::+
s
s
Eg
2
s
:
We now use the Equation C.6 to decompose D
n
(s),
D
n
(s)ED
n
(s)
=
n
s
1
å
1i
1
Abstract (if available)
Abstract
In this dissertation, I propose a novel estimator to estimate heterogeneous treatment effects and further generalize it to the case with treatment endogeneity. To achieve this goal, I adapt the classical nearest neighbors estimator in the statistics literature and intentionally reform it from the modern perspectives of machine learning. These changes are based on novel theoretical results I have derived and in practice it brings significant empirical improvements. In this dissertation, these improvements are demonstrated with a few Monte Carlo simulations. ? I further tackle the problem of estimating heterogeneous treatment effects in the presence of treatment endogeneity. I propose a novel and straightforward approach to deal with endogeneity based on the classical control function literature in economics. This innovative route offers unprecedented advantage to derive statistical inference for the fact that I further prove in theory that the bootstrap can be conveniently used for inference. ? In this dissertation, I also conduct two interesting empirical studies with the new proposed methods in the subfields of health and business economics. However, it is worth noting that these two studies are not merely simple applications, they have also targeted on important economic problems and intended to bring new insights to the understanding of our economic world.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Essays on estimation and inference for heterogeneous panel data models with large n and short T
PDF
Estimation of heterogeneous average treatment effect-panel data correlated random coefficients model with polychotomous endogenous treatments
PDF
Essays on causal inference
PDF
Essays on treatment effect and policy learning
PDF
Essays on the econometrics of program evaluation
PDF
Essays on econometrics analysis of panel data models
PDF
The development of the prognostic propensity score: an introduction to a method to identify optimal treatment according to individual tailoring variables when heterogeneous treatment effects are ...
PDF
Essays on family planning policies
PDF
Three essays on the identification and estimation of structural economic models
PDF
Three essays on the evaluation of long-term care insurance policies
PDF
Essays on the econometric analysis of cross-sectional dependence
PDF
Nonparametric ensemble learning and inference
PDF
Essays on the microeconomic effects of taxation policies
PDF
Maternal full-time employment and childhood obesity
PDF
Essays on econometrics
PDF
Essays on innovation, human capital, and COVID-19 related policies
PDF
Essays on health and aging with focus on the spillover of human capital
PDF
Essays on development and health economics: social media and education policy
PDF
Robust estimation of high dimensional parameters
PDF
Essays in panel data analysis
Asset Metadata
Creator
Wang, Jingbo
(author)
Core Title
Essays on the estimation and inference of heterogeneous treatment effects
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Economics
Degree Conferral Date
2021-08
Publication Date
07/23/2021
Defense Date
03/19/2021
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
bagged nearest neighbors,elasticity estimation,heterogeneous treatment effects,OAI-PMH Harvest
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Hsiao, Cheng (
committee chair
), Huang, Yufeng (
committee member
), Nugent, Jeffrey (
committee member
), Yang, Sha (
committee member
)
Creator Email
jingbowa@usc.edu,jingbowang@outlook.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC15617648
Unique identifier
UC15617648
Legacy Identifier
etd-WangJingbo-9834
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Wang, Jingbo
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
bagged nearest neighbors
elasticity estimation
heterogeneous treatment effects