Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Statistical learning in High Dimensions: Interpretability, inference and applications
(USC Thesis Other)
Statistical learning in High Dimensions: Interpretability, inference and applications
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Statistical Learning in High Dimensions: Interpretability, Inference
and Applications
by
Simeng Shao
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(BUSINESS ADMINISTRATION)
August 2022
Copyright 2022 Simeng Shao
I dedicate this thesis to my grandparents,
Junshu and Shoujuan Li,
for their eternal love.
ii
Acknowledgements
First and foremost, I would like to express my sincerest gratitude for my PhD advisors Dr. Adel
Javanmard and Dr. Jacob Bien, who have shown endless patience, guidance and support during
my academic career. Adel has always impressed me as one of the most intelligent people I have
known in my life. His domain expertise and research intuition have been a compass in my aca-
demic voyage and navigated me to accomplish my research projects. Jacob not only helped me
through research, his curiosity, generosity, commonsense, empathy, and rigorousness, and passion
for research have influenced me in all aspects of life. I will always remember his excitement when
we had breakthroughs on our projects, which never failed to immediately lift me up from rock bot-
tom; his long paragraphs of advice that he gave me as “newbie” researcher are worth me reading
again and again.
I would also like to thank Dr. Timothy Armstrong to be on my defense committee and Dr.
Yingying Fan, Dr. Wenguang Sun, Dr. Meissam Razaviyayn to be on qualifying committee,
who did not hesitate to share extremely valuable advice. I would thank Dr. Vishal Gupta, Dr.
Grourab Mukherjee, who are our PhD coordinators for all their help and support, and Dr. Hamid
Nazerzadeh who was one of my co-authors on my first project. I would also thank all of the
faculty and staff members in the Data Sciences and Operations (DSO) department for making the
department a lovely place to learn and do research, and for all their help throughout this journey.
I would like to also thank everyone in my cohort, Michael Huang, Wilson Lin, Brad Rava,
and Patrick V ossler, for being my best friends and having interesting academic discussions. I
am thankful for the former and fellow PhD students, Luella Fu, Shobhit Jain, Joshua Derenski,
iii
Junxiong Yin, Yiqiu Shen, Aikaterini Giannoutsou, Justin Mulvany, Mohammad Mehrabi and
Greg Faletto. I am thankful for my dear friends Justin and Claire for candid conversations and
making the ride much more enjoyable.
Last but by no means the least, I would like to thank my parents for always spoiling me and
being my strongest support. I feel extremely fortunate to have their unwavering belief and encour-
agement to achieve everything I had in my past 26 years. My mom Fang Li, who is an independent
and career-driven woman and never stops learning, has always inspired me to thrive and aim higher,
and my dad Zhengbo Shao, has been the biggest cheerleader for both of us (and also independent
and career-driven). I am grateful for my loving husband, Michael, through thick and thin. He was
not only a partner-in-crime, the joy of my life, but he was also a role model for me. It is safe to say
he makes me a better version of who I am. I would also thank my in-laws for supporting me and
being my family in this foreign country. Lastly, I want to give a brief shout out to my cat Opal for
being a great emotional support animal, and for sometimes not refusing my cuddles and for being
the funniest cat in the world.
iv
Table of Contents
Dedication ii
Acknowledgements iii
List of Tables viii
List of Figures ix
Abstract xiii
Chapter 1: Introduction 1
1.1 Multi-Product Dynamic Pricing in High-dimensions . . . . . . . . . . . . . . . . . 1
1.1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 Related Literature and Our Contributions . . . . . . . . . . . . . . . . . . 2
1.2 Controlling the False Split Rate in Tree-Based Aggregation . . . . . . . . . . . . . 3
1.2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Related Literature and Our Contribution . . . . . . . . . . . . . . . . . . . 4
1.3 Prediction Sets for High-Dimensional Mixture of Experts Models . . . . . . . . . 5
1.3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.2 Related Literature and Our Contribution . . . . . . . . . . . . . . . . . . . 6
Chapter 2: Multi-Product Dynamic Pricing in High-dimensions 7
2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Benchmark policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Multi-Product Pricing Policy (M3P) . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5 Regret Analysis for M3P . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.6 Proof of Main Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.6.1 Proof of Theorem 2.5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.6.2 Proof Theorem 2.5.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.7 Proof of Proposition 2.3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.8 Proof of Lemma 2.5.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.9 Proof of Proposition 2.5.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
v
2.10 Proof of Lemma 2.5.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.11 Proof of Proposition 2.6.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.11.1 Proof of Lemma 2.6.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.11.2 Proof of Lemma 2.11.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.11.3 Proof of Lemma 2.11.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Chapter 3: Controlling the False Split Rate in Tree-Based Aggregation 38
3.1 Problem setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.1.1 A multiple hypothesis testing formulation for aggregation . . . . . . . . . 41
3.1.2 FSR on a tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.2 Hierarchical aggregation testing with FSR control . . . . . . . . . . . . . . . . . . 46
3.2.1 Testing with independent p-values . . . . . . . . . . . . . . . . . . . . . . 48
3.2.2 Testing with arbitrarily dependent p-values . . . . . . . . . . . . . . . . . 51
3.3 Two statistical applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.3.1 Testing equality of means . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.3.1.1 Simes’ p-values . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.3.2 Testing equality of regression coefficients . . . . . . . . . . . . . . . . . . 55
3.3.2.1 Constructing p-values for the null hypotheses . . . . . . . . . . 56
3.4 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.4.1 Testing on a binary tree with idealized p-values . . . . . . . . . . . . . . . 61
3.4.2 Testing on a non-binary tree with idealized p-values . . . . . . . . . . . . 64
3.4.3 Two statistical applications . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.4.3.1 Testing equality of means . . . . . . . . . . . . . . . . . . . . . 65
3.4.3.2 Testing equality of regression coefficients . . . . . . . . . . . . 66
3.5 Data examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.5.1 Application to stocks data . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.5.2 Application to NYC taxi data . . . . . . . . . . . . . . . . . . . . . . . . 74
3.5.2.1 Aggregation results . . . . . . . . . . . . . . . . . . . . . . . . 77
3.5.2.2 Comparing prediction performance . . . . . . . . . . . . . . . . 79
3.5.2.3 FSR with synthetic data . . . . . . . . . . . . . . . . . . . . . . 80
3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.7 Proof of main theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
3.7.1 Proof of Theorem 3.2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
3.7.2 Proof of Theorem 3.2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
3.7.3 Proof of Theorem 3.2.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
3.7.3.1 Proof of Theorem 3.2.4 . . . . . . . . . . . . . . . . . . . . . . 89
3.8 Proof of technical lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
3.8.1 Proof of Lemma 3.1.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
3.8.2 Proof of Lemma 3.1.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
3.8.3 Proof of Lemma 3.1.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
3.8.4 Proof of Lemma 3.3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
3.8.5 Proof of Proposition 3.3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . 98
vi
3.9 Proof of Proposition 3.7.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
3.9.1 Proof of Lemma 3.9.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
3.10 Some useful lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
Chapter 4: Prediction Sets for High-Dimensional Mixture of Experts Models 108
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
4.1.1 Comparison with Conformal Inference Literature . . . . . . . . . . . . . . 110
4.1.2 Illustration of Results with a Toy Example . . . . . . . . . . . . . . . . . . 111
4.1.3 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
4.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
4.2.1 Prelude: EM-based Estimator . . . . . . . . . . . . . . . . . . . . . . . . 114
4.2.2 Debiased Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
4.2.3 Prediction Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
4.3 Numerical Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
4.3.1 Example 1: Low Dimensional Case with Unbalanced Components . . . . . 121
4.3.2 Example 2: Low Dimensional Case with Unequal Variance . . . . . . . . . 123
4.3.3 Example 3: High Dimensional Case . . . . . . . . . . . . . . . . . . . . . 127
4.4 Real Data Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
References 134
vii
List of Tables
3.1 The table shows 40 clusters that are aggregated by our procedure. The rightmost
column is the mean log-volatility. The column Company/Number of Companies
shows how many companies are in each cluster (or the name of the company when
there is only one). The columns from Sector to Industry National show the classi-
fication of the clusters that are selected. . . . . . . . . . . . . . . . . . . . . . . . 73
3.2 Achieved number of groups with decaying sample size. . . . . . . . . . . . . . . . 78
3.3 Prediction performance of the 6 methods with the test data set. . . . . . . . . . . . 80
3.4 Achieved FSR and average power by our algorithm with synthetic data where noise
level iss = 15. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.1 Out-of-sample MSE calculated 1000 validation data points for fitted models where
K= 1;2;3;4;5. When K= 2, the MSE is lowest. . . . . . . . . . . . . . . . . . . 131
4.2 Coverage rates of 95% prediction sets conditional on the predictor “weighted stan-
dard deviation of thermal conductivity”. The subgroups are divided by which of
the equal-length sub-intervals within range of this predictor that test data fall into.
The conditional coverage rates are satisfied. . . . . . . . . . . . . . . . . . . . . . 132
viii
List of Figures
3.1 An example of leaves partition. There are p= 12 leaves in total, hence 11 potential
barriers. The solid barriers indicate the true splitting of leaves, while the dashed
barriers result in the achieved splitting of leaves. In terms of the vector of barriers,
FDP =
1
3
and power =
2
3
. In terms of splitting of leaves, FSP =
54
41
=
1
3
and
power= 1
54
41
=
2
3
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.2 An example ofT with 11 leaves and 19 nodes in total (including leaf nodes).
The dashed boxes show the true aggregation of the leaves,C
, into K= 5 groups,
withB
=fd
1
;d
2
;c
2
;c
3
;b
2
g. The thicker edges and the nodes they connect form
T
rej
, withX’s marking true rejections and’s marking false rejectionsF . The
rejections correspond to an estimated aggregation with M = 7 groups:fd
1
;d
2
g,
fd
3
;d
4
g,fd
5
;d
6
g,fd
7
g,fd
8
g,fd
9
g,fd
10
, d
11
g. On the right branch of the tree,
two false rejections lead to a nonzero false split rate, FSP=
85
71
=
3
6
. We have
V =(30)+(21)1= 3, and R=(22)+(30)+(21)+(30)1= 6.
Hence
V
R
= FSP. On the left branch of the tree, there is one missing rejection (c
1
)
that leads to a true positive proportion of TPP= 1
87
51
=
3
4
. . . . . . . . . . . . . 45
3.3 Plots of achieved FSR and average power by our algorithm (HAT) and Lynch and
Guo’s algorithm (LG), on a binary tree with p= 1000 leaves. The number of true
groups K varies from 100 to 500. The p-values are simulated to be independent.
The targeted FSR level isa = 0:1;0:2;0:3. . . . . . . . . . . . . . . . . . . . . . . 63
3.4 Plots of achieved FSR and average power by our algorithm (HAT) and Lynch and
Guo’s algorithm (LG) on a binary tree. The number of leaves p is fixed at 1000
and the number of true groups K is fixed at 500. The p-values are simulated to be
independent. The target FSR levela varies from 0:01 to 0:5. . . . . . . . . . . . . 64
3.5 Plot of achieved FSR by our algorithm (HAT) and Lynch and Guo’s algorithm (LG)
on a non-binary tree. The number of true groups K is fixed at 5 and the number
of leaves p ranges from 14 to 41. The p-values are simulated to be independent.
The target FSR levela is set at 0:1;0:2;0:3, respectively. While HAT controls FSR
under the target levels, LG fails to do so. . . . . . . . . . . . . . . . . . . . . . . . 65
ix
3.6 Plots of achieved FSR and mean power with varying number of true groups K with
ANOV A p-values on a 3-regular tree (p= 243;s = 0:3). The hierarchical testing
is done with our algorithm (HAT) directly, as well as with Simes’ procedure and
reshaped threshold function (3.18) (HAT-Simes-reshaped). The target FSR levela
are fix at 0:1;0:2;0:3, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.7 Plots of empirical CDFs of three nodes under the setting n= 100, p= 243,b = 0:6,
K= 30,r = 0:2,s = 0:6. Node #110 is a non-null node, node #86 is a null node
inB
, and node #13 is a null node that is a child of #86. . . . . . . . . . . . . . . 68
3.8 Plots of achieved FSR and average power, tested with hierarchical testing proce-
dure with our threshold function (3.15) (HAT) and reshaped threshold function
(3.18) (HAT-reshaped) on a 3-regular tree (n= 100, p= 243, b = 0:6, r = 0:2,
s = 0:6). The p-values are generated by the debiasing procedure. The number of
true groups K varies from 21 to 93. The target FSR levelsa are at 0:1;0:2;0:3. . . 69
3.9 Plots of the achieved FSR and the average power for our threshold function (3.15)
(HAT) and the reshaped threshold function (3.18) (HAT-reshaped) on a 3-regular
tree (n= 100, p= 243,b = 0:6,r= 0:2,s = 0:6). The p-values are generated by
the debiased procedure. The number of true groups K are fixed at three values: 21,
57, and 81. The target FSR levela varies from 0:01 to 0:5. . . . . . . . . . . . . . 70
3.10 The subsector “Credit Intermediation and Related Activities” consists of 347 com-
panies, whose log-volatilities are represented by the points. The subsector consists
of 3 industry groups and 8 industries. Applying HAT rejects the null hypothesis
that the 3 industry groups have similar mean stock volatility, but does not reject
the null hypotheses that within each industry groups different industries have the
same mean stock log-volatility. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.11 Map of neighborhoods colored with percentage of drivers who have started a trip
from there. Most neighborhoods have fewer than 10% of the drivers starting their
trips there in the month of December 2013. . . . . . . . . . . . . . . . . . . . . . 76
3.12 Map colored with log-transformed least square coefficients fitted by regressing fare
on features from HAT’s aggregation of neighborhoods of New York City. There
are 44 aggregated clusters out of the 194 neighborhoods. Darker colors correspond
to higher fitted coefficients. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
x
4.1 The plot on the top left shows how true probability of two clusters,p
1
(t);p
2
(t) re-
spectively, vary with time t (dotted), as well as estimated probabilities ˆ p
1
(t); ˆ p
2
(t)
(solid). The plot on the bottom left contains observed outcomes (black dots), pre-
dicted outcome curves (solid) and true curves (dotted). The plot in the middle
shows the predicted set with 95% confidence level with shaded area. The plot on
the top right and bottom right contain the averaged coverage rates and averaged
lengths of prediction sets, respectively. The average are taken among 500 runs; for
each run, there are 1000 outcomes generated from the model and the percentage of
them falling into the confidence set is calculated. . . . . . . . . . . . . . . . . . . 113
4.2 Performance of estimation and prediction sets for the setting of Example 1, where
K = 2, T = 100 and distribution of observations from the two components is un-
balanced, the variances in two linear models are the same. The plot on the top
left shows how true probability of cluster 1 and 2, p
1
(t);p
2
(t), vary with time t
(dotted), as well as estimated probabilities ˆ p
1
(t); ˆ p
2
(t) (solid). The plot on the
bottom left contains observed outcomes (black dots), predicted outcome centers
(solid) and true centers (dotted) of two clusters. The plot in the middle demon-
strates with shaded are the prediction set with 95% confidence level. The plot on
the top right and bottom right contain the averaged coverage rates and averaged
lengths of prediction sets, respectively. The average are taken among 500 runs; for
each run, there are 1000 outcomes generated from the model and the percentage of
them falling into the confidence region is calculated. . . . . . . . . . . . . . . . . 124
4.3 Performance of estimation and prediction sets for the setting of Example 2, where
K = 2, T = 100 and distribution of observations from the two components is bal-
anced, the variances in two linear models are the 0:1;0:2, respectively. The plot on
the top left shows how true probability of Cluster 1 and 2, p
1
(t);p
2
(t), vary with
time t (dotted), as well as estimated probabilities ˆ p
1
(t); ˆ p
2
(t) (solid). The plot on
the bottom left contains observed outcomes (black dots), predicted outcome curve
(solid) and true curve (dotted). The plot in the middle shows predicted outcome
(solid) and true outcome (dotted). The shaded bands are the predicted set with 95%
confidence level. The plot on the top right and bottom right contain the averaged
coverage rates and averaged lengths of prediction region, respectively. The average
are taken among 500 runs; for each run, there are 1000 outcomes generated from
the model and the percentage of them falling into the confidence region is calculated.126
4.4 Scatter plot of Coverage Rate of the 95% prediction region against truep
1
(x
new
)(data
point being drawn from cluster 1). The coverage rate of our prediction sets is guar-
anteed at the 95% pre-specified level regardless ofp
1
. . . . . . . . . . . . . . . . . 129
4.5 Scatter plot of average lengths of 95% prediction sets against true p
1
(x
new
)(data
point being drawn from cluster 1). The lengths are smaller when p
1
(x
new
) are
closer to 0 or 1, whereas larger whenp
1
(x
new
) approach 0:5. . . . . . . . . . . . . 129
xi
4.6 Scatter plot of average lengths of 95% prediction region against distances between
two centers. The average lengths increase with distance initially, then fluctuates
and vary between 6:5 and 11. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
4.7 Plot of prediction sets for 100 randomly sampled data points in the test set. The
black points are the log(1+temperature) and the bars correspond to the prediction
sets for each observation. We use red when there is only one interval and cyan
when there are two intervals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
xii
Abstract
The past decades have witnessed revolutionary changes in measurement, collection, and storage of
data. High-dimensional data are ubiquitous in many domains of studies, which brought tremendous
opportunities and challenges. In this thesis, we discuss a few statistical approaches that address
the curse of high-dimensionality. Our proposed work builds upon the high-dimensional literature
with novel applications and improvements to interpretability and the development of statistical
inference tools.
We first consider leveraging high-dimensional contexts for the problem of multi-product dy-
namic pricing. In this environment, the customers arrive over time and products are described by
high-dimensional feature vectors. Each customer chooses a product according to the widely used
Multinomial Logit (MNL) choice model and her utility depends on the product features as well as
the prices offered. Our model allows for heterogenous price sensitivities for products. The seller
a-priori does not know the parameters of the choice model but can learn them through interactions
with the customers. The seller’s goal is to design a pricing policy that maximizes her cumulative
revenue. This model is motivated by online marketplaces such as the Airbnb platform and online
advertising. We measure the performance of a pricing policy in terms of regret, which is the ex-
pected revenue loss with respect to a clairvoyant policy that knows the parameters of the choice
model in advance and always sets the revenue-maximizing prices. We propose a pricing policy,
named M3P, that achieves a T -period regret of O(log(T d)(
p
T+ d log(T))) under heterogenous
price sensitivity for products with features of dimension d. We also prove that no policy can
achieve worst-case T -regret better thanW(
p
T).
xiii
Next, we discuss an approach that helps reduce dimensionality and increase model inter-
pretability. We observe that in many domains, data measurements can naturally be associated with
the leaves of a tree, expressing the relationships among these measurements. The problem of tree-
based aggregation asks which of these tree-defined subgroups of leaves should really be treated as
a single entity and which of these entities should be distinguished from each other. We introduce
the false split rate, an error measure that describes the degree to which subgroups have been split
when they should not have been. We then propose a multiple hypothesis testing algorithm for tree-
based aggregation, which we prove controls this error measure. We focus on two main examples
of tree-based aggregation, one which involves aggregating means and the other which involves
aggregating regression coefficients. We apply this methodology to aggregate stocks based on their
volatility and to aggregate neighborhoods of New York City based on taxi fares.
Finally, we address the issue of heterogeneity in data that can be brought by high-dimensionality.
The Mixture of Experts (MoE) model has become a popular tool to analyze heterogeneous data
with high dimensionality. This is because it allows not only mixture centers, but also mixture
weights to depend on features, which gives rise to more flexibility in the data generating process.
Even though some powerful tools, such as the Expectation-Maximization (EM) algorithm, have
been developed to estimate such models and make predictions, there has been little work on the
predictive inference on high-dimensional MoE models. Motivated by this, we propose a debiased
predictor for a new observation drawn from a MoE model. We construct a prediction set, which
is in the form of a union of intervals. We apply our proposed method to analysis of a dataset of
superconduct materials, and investigate the relationship between critical temperature of materials
and other elementary and electronic attributes.
xiv
Chapter 1
Introduction
This chapter provides an introduction to approaches we proposed that addresses the issues of high-
dimensionality that appear in the following aspects: learning, interpreting, and conducting statisti-
cal inference. Section 1.1 provides an application that learns and optimizes with high-dimensional
features in a multi-product dynamic pricing setting
1
. Section 1.2 provides an approach that re-
duces dimensionality and improves interpretability by using a tree-based aggregation procedure
2
.
Section 1.3 provides an approach that enables prediction inference for a high-dimensional Mixture
of Experts (MoE) model.
1.1 Multi-Product Dynamic Pricing in High-dimensions
1.1.1 Motivation
The work presented in Chapter 2 is motivated by online marketplaces that offer many products,
each with a large number of features. The goal is to help online marketplace owners maximize
their revenue by setting “optimal” prices for all products.
1
Published in 2020 IEEE International Symposium on Information Theory (ISIT). http://doi.org/10.1109/
ISIT44484.2020.9174296
2
Under revision. https://doi.org/10.48550/arXiv.2108.05350
1
We use the example of the Airbnb platform to understand the scenario better: In booking a
”stay” (product), the user first narrows down her choice to a so-called consideration set by selecting
the location and date. The products here are highly differentiable. Each product has a high-
dimensional feature vector that encodes its properties, such as space, amenities, walking score,
house rules, reviews of previous tenants, etc. The platform then sets the prices for the products in
the consideration set by learning and leveraging the feature vector.
The seller has to balance the following trade-off: A high price may drive the user away, hurting
the revenue. On the other hand, a low price encourages the user to purchase the product but results
in a smaller revenue from each sale. Therefore, for the seller to maximize its revenue, it must try to
learn the users’ purchase behavior. Through interactions with users and observing their purchasing
decisions, the seller can learn how users weigh different features in their purchasing decisions.
1.1.2 Related Literature and Our Contributions
The dynamic pricing literature has been widely studied in [28, 4]. Some previous work considers
parametric models for the demand function with a small number of parameters [8, 13, 29, 27, 20,
8, 33, 53]. The work [8] studies the dynamic pricing problem from both parametric and non-
parametric lens.
There has been an interest in dynamic pricing in contextual settings. The works [1, 22, 59,
50, 6] consider a single-product setting where the seller receives a single product at each step to
sell and assumes equal price sensitivities b = 1 for all products. The work [50] studies dynamic
pricing in high-dimensional contextual setting with sparsity structure in a single-product scenario.
[66, 70] studied high-dimensional multi-product pricing, with a low-dimensional linear model for
the aggregate demand. However, they assume the demand vector for all the products at each step
is observed.
2
We study a setting where the utility of buying a product is a function of its features and price.
Let u(q
0
;g
0
; ˜ x; p) be the utility obtained from buying a product with feature vector ˜ x, at price p
where the parameter vectors q
0
and g
0
, in R
d
, represent the users’ purchase behavior. We em-
ploy the Multinomial Logit (MNL) choice model [63] to model the probability that the customer
chooses the products based on their utility. We propose a dynamic pricing policy, Multi-Product
Pricing Policy (M3P) in high-dimensional environments, which uses the maximum likelihood
method to estimate the true parameters of the utility model based on the previous purchasing be-
havior of the users. We measure the performance of a pricing policy in terms of regret, which is the
difference between the expected revenue obtained by the pricing policy and the revenue gained by a
clairvoyant policy that has full knowledge of the parameters of the utility model and always offers
the revenue-maximizing price. Our policy, achieves a T -regret of O(log(T d)(
p
T + d log(T))),
where d and T respectively denote the features dimension and the length of the time horizon. Fur-
thermore, we also prove that our policy is almost optimal in the sense that no policy can achieve
worst-case T -regret better thanW(
p
T).
1.2 Controlling the False Split Rate in Tree-Based Aggregation
1.2.1 Motivation
With the development of increasingly advanced data measurement and collection tools, we have ac-
cess to more high-quality, fine-measurement, and multi-aspect data. However, it also has brought
us the challenge of balancing between complex models that capture the subtleties of the phe-
nomenon and simpler models that are more interpretable and more efficient to learn. We seek the
right balance between both and leverage data structures that help achieve this goal. For example,
it has been found in many datasets that the relationships among data measurements have a tree
3
structure. Examples can be seen by but are not limited to the North American Industry Classi-
fication System (NAICS) [82] for organizing business categories, ecological taxonomic trees for
categorizing species, and universal product codes (UPC) for arranging retail stock products.
Take North American Industry Classification System (NAICS) as an example. The industry
classification system provides a hierarchical structure that first divides business establishments into
20 sectors, then drills down into subsectors, industry groups, industries, and national industries.
Intuitively, businesses that belong to low-level branches have more in common, making sense for
practitioners to prioritize aggregating them when there is a need for a simpler model. In Chapter 3,
we use the term tree-based aggregation to refer to the process of deciding which branches’ leaves
are similar (i.e., aggregated) and which are different from each other (i.e. split apart).
1.2.2 Related Literature and Our Contribution
Tree-based aggregation procedures have been proposed in various contexts, including regression
problems, in which features represent counts of rare events [90] or counts of microbial species
[10], and in graphical modeling [88]. These approaches focus on prediction and estimation but do
not address the hypothesis testing question of whether a particular split should occur.
We treat the general tree-based aggregation problem as a multiple testing problem involving a
parameter vectorq
whose elements correspond to the leaves of a known tree. By merging leaves
under nodes in the tree, we force the parameters in each aggregation group to share the same value.
We address the aggregation quality by defining an error measure, false split rate (FSR), which
corresponds to the fraction of splits made that are “incorrect”. We highlight a fundamental con-
nection between FSR and false discovery rate (FDR) [7] mathematically. (Indeed, in the particular
case of a binary tree, we prove that FSR and FDR are equivalent.) We propose a tree-based ag-
gregation procedure that proceeds top-down and shows that this procedure can control FSR at any
pre-specified level. We extend our tree-based aggregation algorithm to two concrete applications,
4
one whenq
is the mean of a scalar signal and one whenq
is a vector of regression coefficients.
The second application is especially significant because it helps address the issue of rare features
in high-dimensional settings. Finally, we apply our algorithms to simulation studies and real data
studies. We demonstrate aggregation of stocks based on their log-volatility and aggregation of
neighborhoods of New York City based on a regression model for predicting taxi drivers’ profits.
1.3 Prediction Sets for High-Dimensional Mixture of Experts
Models
1.3.1 Motivation
In Chapter 4, we focus on predictive inference in high-dimensional settings. We primarily focus
on developing prediction inference for the popular Mixture of Experts (MoE) model which is com-
monly utilized to analyze heterogenous data with high dimensionality. The Mixture of Experts
model, first proposed by [51], handles the heterogeneity issue by clustering similar data in order
to disentangle the multimodality of the data. This approach has been successful in modeling com-
plex data such as in [42], which models how changing environmental conditions impact different
groups of phytoplankton populations using ocean sample data. The chapter provides background
on the Mixture of Experts model, highlights how to apply debiasing techniques to address the
high-dimensional data, and proposes a novel method to construct prediction sets. We addition-
ally provide extensive experiments to highlight how our approach fairs in different settings and
consider a real data application using superconductivity data provided in [76].
5
1.3.2 Related Literature and Our Contribution
The Mixture of Experts (MoE) model is similar to the Mixture of Linear Regressions (MLR)
model. Unlike MoE, the MLR has been well studied with works studying the convergence of the
EM estimator ([93, 5, 54, 56, 55]) and statistical inference on the estimator ([98]). Additionally,
the MLR also has been studied in high-dimensional settings, with convergence and oracle results
for the`
1
-regularized maximum likelihood estimator ([75]), convergence results for the MLR in
latent variable models ([92]), and convergence results and confidence intervals for the setting with
two high-dimensional linear components ([97]).
Our work is also related to work in conformal inference for regression ([68], [87], [58], [57]).
These approaches are attractive as they can utilize residuals of hold-out data to quantify the uncer-
tainty of predictions and specialized methods, such as Conformalized Quantile Regression, pro-
duce locally adaptive intervals that provide marginal converge ([72]). These methods assume the
subgroup is observed and achieve equal coverage rate for all subgroups. Our method makes no
such assumptions and are complementary to conformal inference.
6
Chapter 2
Multi-Product Dynamic Pricing in High-dimensions
1
Online marketplaces offer very large number of products described by a large number of fea-
tures. This contextual information creates differentiation among products and also affects the
willingness-to-pay of buyers. To provide more context, let us consider the Airbnb platform: the
products sold in this market are “stays.” In booking a stay, the user first selects the destination city,
dates of visit, type of place (entire place, 1 bedroom, shared room, etc) and hence narrows down
her choice to a so-called consideration set. The platform then sets the prices for the products in
the consideration set. Notably, the products here are highly differentiable. Each product can be de-
scribed by a high-dimensional feature vector that encodes its properties, such as space, amenities,
walking score, house rules, reviews of previous tenants, etc. We study a model where the platform
aims to maximize its cumulative revenue.
In setting prices, there is a clear trade-off: A high price may drive the user away (decreases the
likelihood of a sale) and hence hurts the revenue. A low price, on the other hand encourages the
user to purchase the product; however, it results in a smaller revenue from that sell. Therefore, in
order for the seller to maximize its revenue, it must try to learn the purchase behavior of the users.
1
A. Javanmard, H. Nazerzadeh and S. Shao, ”Multi-Product Dynamic Pricing in High-Dimensions with
Heterogeneous Price Sensitivity,” 2020 IEEE International Symposium on Information Theory (ISIT), 2020, pp. 2652-
2657, doi: 10.1109/ISIT44484.2020.9174296.
7
Through interactions with users and observing their purchasing decisions, the seller can learn how
users weigh different features in their purchasing decisions.
In this work, we study a setting where the utility from buying a product is a function of the
product features and its price. Let u(q
0
;g
0
; ˜ x; p) be the utility obtained from buying a product with
feature vector ˜ x, at price p where the parameter vectors q
0
and g
0
, in R
d
, represent the users’
purchase behavior. Namely,q
0
captures the contribution of each feature to the user’s valuations of
the products andg
0
captures the sensitivity of the utility to the price. We focus on utility model:
u(q
0
; ˜ x; p)=y
hf( ˜ x);q
0
ihf( ˜ x);g
0
ip+ z
; (2.1)
wheref :R
˜
d
7!R
d
is a feature mapping andy :R7!R is a general strictly increasing function and
ha;bi indicates the inner product of two vectors a and b. The mappingf and functiony are fixed
and known. The term z, a.k.a. market noise, captures the idiosyncratic change in the valuation
of each user. Throughout the chapter, we use the notation xf( ˜ x) for the mapped features and
defineb
x
=hx;g
0
i the price sensitivity parameter of a product with (mapped) feature x. Examples
of such utility models include: (i) Log-log model (y(y)= e
y
,f(y)= ln(y));(ii) Semi-log model
(y(y)= e
y
, f(y)= y); and (iii) Logistic model (y(y)= e
y
=(1+ e
y
), f(y)= y). We encode the
“no-purchase” option as a new product with zero utility.
We emphasize that the parameters of the utility model,q
0
andg
0
, are a priori unknown to the
seller.
In our model, given a consideration set, the customer chooses the products that results in the
highest utility. We study the widely used Multinomial Logit (MNL) choice model [63] which
corresponds to having the noise terms, Eq (2.1), drawn independently from a standard Gumbel
distribution, whose cdf is given by G(y) = e
e
y
[78]. The MNL model is arguably the most
popular random utility model with a long history in economics and marketing research to model
choice behavior [63, 62, 36].
8
We propose a dynamic pricing policy, called M3P, for Multi-Product Pricing Policy in high-
dimensional environments. Our policy uses maximum likelihood method to estimate the true pa-
rameters of the utility model based on previous purchasing behavior of the users.
We measure the performance of a pricing policy in terms of the regret, which is the difference
between the expected revenue obtained by the pricing policy and the revenue gained by a clair-
voyant policy that has full information of the parameters of the utility model and always offers
the revenue-maximizing price. Our policy, achieves a T -regret of O(log(T d)(
p
T + d log(T))),
where d and T respectively denote the features dimension and the length of the time horizon. Fur-
thermore, we also prove that our policy is almost optimal in the sense that no policy can achieve
worst-case T -regret better thanW(
p
T).
In the next section, we briefly review the related work to ours. We would like to highlight
that our work is distinguished from the previous literature in two major aspects: (i) Multi-product
pricing that should take into account the interaction of different products as changing the price
for one product may shift the demand for other products, and this dependence makes the pricing
problem even more complex. (ii) Heterogeneity and uncertainty in price sensitivity parameters.
2.1 Related Work
There is a vast literature on dynamic pricing as one of the central problems in revenue management.
We refer the reader to [28, 4] for extensive surveys on this area. A popular theme in this area is
dynamic pricing with learning where there is uncertainty about the demand function, but informa-
tion about it can be obtained via interaction with customers. A line of work [2, 31, 39, 19, 32, 21]
took Bayesian approach for the learning part and studied this problem in non-contextual setting.
Another related line of work assumes parametric models for the demand function with a small
number of parameters, and proposes policies to learn these parameters using statistical procedures
such as maximum likelihood [8, 13, 29, 27, 20] or least square estimation [8, 33, 53].
9
Recently, there has been an interest in dynamic pricing in contextual setting. The work [1, 22,
59, 50, 6] consider single-product setting where the seller receives a single product at each step to
sell (corresponding to N= 1 in our setting) and assume equal price sensitivitiesb = 1 for all prod-
ucts. In [1], the authors consider a noiseless valuation model with strategic buyer and propose a pol-
icy with T -period regret of order O(T
2=3
). This setting has been extended to include market noise
and also a market of strategic buyers who are utility maximizers [34]. In [22], authors propose
a pricing policy based on binary search in high-dimension with adversarial features that achieves
regret O(d
2
log(T=d)). The work [50] studies the dynamic pricing in high-dimensional contex-
tual setting with sparsity structure and propose a policy with regret O(s
0
log(d)log(T)) but in a
single-product scenario. The dynamic pricing problem has also been studied under time-varying
coefficient valuation models [43] to address the time-varying purchase behavior of customers and
the perishability of sales data. Very recently, [66] studied high-dimensional multi-product pricing,
with a low-dimensional linear model for the aggregate demand. In this model, the demand vector
for all the products at each step is observed, while in our work the seller only sees the product
index that is chosen by the buyer at each step. Similarly, [70] studies a model where the seller can
observe the aggregate demand and proposes a myopic policy based on least-square estimations that
obtains a logarithmic regret.
2.2 Model
We consider a firm which sells a set of products to customers that arrive over time. The products
are differentiated and each is described by a wide range of features. At each step t, the customer
selects a consideration setC
t
of size at most N from the available products. This is the set the
customer will actively consider in her purchase decision. The seller sets the price for each of the
products in this set, after which the customer may choose (at most) one of the products inC
t
. If he
10
chooses a product, a sale occurs and the seller collects a revenue in the amount of the posted price;
otherwise, no sale occurs and seller does not get any revenue.
Each product i is represented by an observable vector of features ˜ x
i
2R
˜
d
. Products offered at
different round can be highly differentiated (their features vary) and we assume that the feature
vectors are sampled independently from a fixed, but unknown, distributionDR
˜
d
. We recall the
notation xf( ˜ x) wheref :R
˜
d
7!R
d
is the feature mapping.
We assume thatkx
i
k
¥
1 for all ˜ x
t
in the support ofD, andk(q
0
;g
0
)k W, for an arbitrarily
large but fixed constant W. Throughout the chapter, we usekk to indicate the`
2
norm.
If an item i (at period t) is priced at p
it
, then the customer obtains utility u
it
from buying it,
where
2
u
it
=y
hf( ˜ x
i
);q
0
ihf( ˜ x
i
);g
0
ip
it
+ z
it
:
Here, q
0
;g
0
2R
d
are the parameters of the demand curve and are unknown a priori to the seller.
The feature mappingf and the functiony are both fixed and known.
This is a random utility model with z
it
component representing market shocks and are modeled
as zero mean random variables drawn independently and identically from a standard Gumbel dis-
tribution. At each round, the customer chooses the product that results in the highest utility. Since
y is strictly increasing, this corresponds to the product with the highesty
1
(u
it
). For noise drawn
from standard Gumbel distribution, this brings us to the multinomial logit (MNL) choice model
that has been widely used in academic literature and practice [78, 30]. Under the MNL model, the
probability of choosing an item i from setC
t
is given by
q
it
P(i
t
= ijC
t
)=
exp(u
0
it
)
1+å
`2C
t
exp(u
0
`t
)
; for i2C
t
; (2.2)
where u
0
it
=hx
i
;q
0
ihx
i
;g
0
ip
it
, for i2C
t
, where we recall that x
i
f( ˜ x
i
).
2
In general the offered price not only depends on the feature vectors x
i
but also the period t, as the estimate of the
model parameters may vary across time t. We make this explicit in the notation p
it
by considering both i and t in the
subscript.
11
We refer to the termb
i
=hx
i
;g
0
i in the utility model as the price sensitivity of product i. Note
that our model allows for heterogeneous price sensitivities. We also encode the no-purchase option
by item / 0, with market utility z
/ 0t
, drawn from zero mean Gumbel distribution. The random utility
z
/ 0t
can be interpreted as the utility obtained from choosing an option outside the offered ones. This
is equivalent to u
0
/ 0t
= 0. Having the utility model established as above, at all steps the user chooses
the item with maximum utility from her consideration set; in case of equal utilities, we break the
tie randomly.
To summarize, our setting is as follows. At each period t:
1. The user narrows down her options by forming a consideration setC
t
of size at most N.
2. For each product i2C
t
, the seller offers a price p
it
.
3
3. The user chooses item i
t
2C
t
[f/ 0g where i
t
= argmax
i2C
t
[f/ 0g
u
it
.
4. The seller observes what product is chosen from the consideration set and uses this informa-
tion to set the future prices.
We make the following assumption that ensures positivity of the products price sensitivity
parameters. Per this assumption, note that for a product with feature ˜ x
i
2D, the price sensitivity is
given byhf( ˜ x
i
);g
0
i=hx
i
;g
0
i.
Assumption 2.2.1. We have minfhx;g
0
i : f
1
(x)2Dg L
0
> 0, for some constant L
0
.
Before proceeding with the policy description, we will discuss the benchmark policy which is
used in defining the notion of regret and measuring the performance of pricing policies.
3
Equivalently, the seller can determine all the prices in advance and reveal them after the user determines the
consideration set. We note that the consideration set of the user does not depend on the prices, but the choice she
makes from the consideration set depends on the prices. In addition, recall that all the users share the sameq
0
andg
0
and the choice of consideration set does not reveal information about these parameters.
12
2.3 Benchmark policy
The seller’s goal is to minimize her regret, which is defined as the expected revenue loss against
a clairvoyant policy that knows the utility model parameters q
0
;g
0
in advance and always offers
the revenue-maximizing prices. We next characterize the benchmark policy. Let p
t
=(p
it
)
i2C
t
,
b
t
=(b
i
)
i2C
t
, where b
i
=hx
i
;g
0
i and X
t
2R
jC
t
jd
be the feature matrix, which is obtained by
stacking x
i
, i2C
t
as its rows (Recall thatjC
t
j N). The proposition below gives an implicit
formula to write the vector of optimal prices p
t
as a function p
t
= g(X
t
g
0
;X
t
q
0
). We refer to g as
the pricing function.
Proposition 2.3.1. The benchmark policy that knows the utility model parametersq
0
;g
0
, sets the
optimal prices as follows: For product i2C
t
, the optimal price is given by p
it
=
1
hx
i
;g
0
i
+B
0
t
, where
B
0
t
is the unique value of B satisfying the following equation:
B=
å
`2C
t
1
hx
`
;g
0
i
e
(1+hx
`
;g
0
iB)
e
hx
`
;q
0
i
: (2.3)
Proof of Proposition 2.3.1 is similar to that in [96, Theorem 3.1] and is deferred to Ap-
pendix 2.7.
We can now formally define the notion of regret. Letp be a pricing policy that sets the vector
of prices p
p
t
=(p
p
it
)
i2C
t
at time t for the products in the consideration setC
t
. Then, the seller’s
expected revenue at period t, under such policy will be
rev
p
t
=
å
i2C
t
q
it
p
p
it
; (2.4)
13
with q
it
being the probability of buying product i from the setC
t
as given by Eq (2.2).
4
Similarly,
we let rev
t
be the seller’s expected revenue under the benchmark policy that sets price vectors p
t
,
at period t. The worst-case cumulative regret of policyp is defined as
Regret
p
(T) max
n T
å
t=1
(rev
t
rev
p
t
) :k(q
0
;g
0
)k W; f(D)[1;1]
d
o
: (2.5)
Input: (at time 0) function g, parameter W (bound onky
0
k
1
)
Input: (arrives over time) covariate matricesfX
t
g
t2[T]
Output: pricesfp
t
g
t2[T]
1:t
1
1, p
1
0,q
1
0 and set the length of k-th episode:`
k
k+ d
2: for each episode k= 1;3;::: do
3: Exploration Phase: for the initial d periods of episode k, choose the prices of items
independently as p
it
Uniform([0;1]).
4: Exploitation Phase: At the end of the exploration phase, update the model parameter
estimateb n
k
using the ML estimator applied to the previous exploration periods:
b n
k
= arg min
knkW
L
k
(n); (2.6)
withL
k
(n) given by (2.10).
5: Exploitation Phase: offer prices based on the current estimateb n
k
=
b
q
k
b g
k
as
p
it
1
hx
it
;b g
k
i
+ B
t
; (2.7)
where B
t
is the unique value of B satisfying the following equation:
B=
å
`2C
t
1
hx
`
;b g
k
i
e
(1+hx
`
;b g
k
iB)
e
hx
`
;
b
q
k
i
: (2.8)
Algorithm 1: M3P policy for multi-product dynamic pricing
4
More precisely, rev
p
t
is the expected revenue conditional on filtrationF
t1
, whereF
t
is the sigma algebra gener-
ated by feature matrices X
1
;:::;X
t+1
and market shocks z
1
;:::;z
t
.
14
2.4 Multi-Product Pricing Policy (M3P)
In this section, we provide a formal description of our multi-product dynamic pricing policy (M3P).
The policy sees the time horizon in an episodic structure, where the length of episodes grow linearly
. Specifically, episode k is of length`
k
= k+ d, with the first d referred to as exploration periods
and the other k ones as exploitation periods. Throughout the chapter, we use notation E
k
to refer to
periods in episode k, i.e., E
k
=f`
k
;:::;`
k+1
1g. In the exploration periods, the products prices are
chosen independently as p
it
Uniform([0;1]).
5
In the exploitation periods, we choose the optimal
prices based on the current estimate of the model parameters. Concretely, let I
k
be the set of
exploration periods up to episode k, i.e., the set consisting of the initial d periods in episode 1;:::;k,
and hencejI
k
j= kd. We form the negative log-likelihood function for estimatingn
0
=(q
T
0
;g
T
0
)
T
using the sales data in periods I
k
:
L
k
(n)=
1
kd
å
t2I
k
log
exp(u
0
i
t
t
(n))
å
`2C
t
[f/ 0g
exp(u
0
`t
(n))
;
where i
t
denotes the product purchased at time t, and
u
0
it
(n)=hx
i
;qihx
i
;gip
it
; (2.9)
withn=(q
T
;g
T
)
T
. We adopt the convention that i
t
= / 0 for the “no-purchase” case with u
0
/ 0t
()= 0.
The log-likelihood loss can be written in a more compact form. We let y
t
=(y
it
)
i2C
t
be the
response vector that indicates which product is purchased at time t:
y
it
=
8
>
>
<
>
>
:
1 product i is chosen ,
0 otherwise .
5
Indeed we can offer p
it
Uniform([0;C]) for any fixed constant C> 0 and C appears in the regret bound. But
since we treat C as a constant it does not affect the order of the regret.
15
We also let u
0
t
(n)=(u
0
it
(n))
i2C
t
[f0g
. Then, the log-likelihood loss can be written as
L
k
(n)=
1
kd
å
t2I
k
log
y
t
exp(u
0
t
(n))
1+å
`2C
t
exp(u
0
`t
(n))
: (2.10)
We construct the estimaten
k
=(
b
q
k
;b g
k
) by solving the following optimization:
b n
k
= arg min
knkW
L(n): (2.11)
In the exploitation periods of the k-th episode, the policy adheres to the estimateb n
k
=(
b
q
k
;b g
k
)
throughout the episode and sets the price vectors as p
t
= g(X
t
b g
t
;X
t
b
q
k
).
The policy terminates at time T but note that the policy does not need to know T in advance.
Also, by the design when the policy does not have much information about the model parameters
it updates its estimates frequently (since the length of episodes are small) but as time proceeds the
policy gathers more information about the parameters and updates its estimates less frequently, and
use them over longer episodes.
2.5 Regret Analysis for M3P
We next state our result on the regret of M3P policy.
Theorem 2.5.1. (Regret upper bound) Consider the choice model (2.2). Then, the worst case T -
period regret of the M3P policy is of O(log(T d)(
p
T+ d log(T))), with d and T being the feature
dimension and the length of time horizon.
Below, we state the key lemmas in the proof of Theorem 2.5.1 and refer to Appendices for the
proof of technical steps. Let p
t
=(p
it
)
t2C
t
be the vector of prices posted at time t for products in
the consideration setC
t
. Recall that M3P sets the prices as p
t
= g(X
t
b g
k
;X
t
b
q
k
), where g(;) is the
pricing function whose implicit characterization is given by Proposition 2.3.1.
16
Our next lemma shows that the pricing function g(;) is Lipschitz. We remind that L
0
is given
in Assumption 2.2.1, W is the initial bound on the model parameters (k(q
0
;g
0
)k W as described
in Section 2.2) and N is the maximum size of the consideration set at each step.
Lemma 2.5.2. Suppose that p
1
= g(X
t
g
1
;X
t
q
1
) and p
2
= g(X
t
g
2
;X
t
q
2
). Then, there exists a con-
stant C= C(W;L
0
)> 0 such that the following holds
kp
1
p
2
k CN
2
kX
t
(g
1
g
2
)k
2
+kX
t
(q
1
q
2
)k
2
1=2
: (2.12)
We next upper bound the right-hand side of Eq (2.12) by bounding the estimation error of the
proposed estimator.
Proposition 2.5.3. Letb n
k
be the solution of optimization problem (2.11). Then, there exist con-
stants c
0
;c
1
, and c
2
(depending on W;L
0
;N), such that for k c
0
d, with probability at least
1 d
2
k
1:5
2e
c
2
kd
, we have
k
b
q
k
q
0
k
2
+kb g
k
g
0
k
2
c
1
log(kd)
k
; (2.13)
The last part of the proof is to relate the regret of the policy at each period t to the distance
between the posted price vector p
t
and the price vector p
t
posted by the benchmark. Recall the
definition of revenue rev
p
t
from (2.4) and define the regret as reg
t
rev
t
rev
p
t
.
Lemma 2.5.4. Let p
t
= g(X
t
g
0
;X
t
q
0
) be the optimal price vector posted by the benchmark policy
that knows the model parametersq
0
andg
0
in advance. There exists a constant C> 0 (depending
on W) such that the following holds,
reg
t
c
3
Nkp
t
p
t
k
2
;
for some constant c
3
= c
3
(W;L
0
).
17
The reason that in Lemma 2.5.4, the revenue gap depends on the squared of the difference of the
price vectors is that p
t
= argmin rev
t
(p) is the optimal price and henceÑrev
t
(p)= 0. Therefore,
by Taylor expansion of function rev
t
(p) around p
t
, we see that the first order term vanishes and
the second order term O(kp
t
p
t
k
2
) matters.
The proof of Theorem 2.5.1 follows by combining Lemma 2.5.2, Proposition 2.5.3 and Lemma
2.5.4. We refer to Section 2.6.1 for its proof.
Our next theorem provides a lower bound on the T -regret of any pricing policy. The proof of
Theorem 2.5.5 is given in Section 2.6.2 and employs the notion of ‘uninformative prices’, intro-
duced by [13].
Theorem 2.5.5. (Regret lower bound) Consider the choice model (2.2). Then, the T -period regret
of any pricing policy in this case isW(
p
T).
Theorem 2.5.5 implies that M3P has optimal cumulative regret in T , up to logarithmic factor.
2.6 Proof of Main Theorems
2.6.1 Proof of Theorem 2.5.1
By Lemma 2.5.4, we have
reg
t
c
3
Nkp
t
p
t
k
2
c
3
C
2
N
4
kX
t
(b g
k
g
0
)k
2
+kX
t
(
b
q
k
q
0
)k
2
; (2.14)
18
where we used Lemma 2.5.2. Note that the estimatesb g
k
and
b
q
k
are constructed using the samples
in I
k
and consequently are independent from the current features X
t
(t is in the exploitation phase
of episode k). Taking the expectation first with respect to X
t
, we obtain
E(reg
t
)= c
3
C
2
N
5
E
h
kS
1=2
(b g
k
g
0
)k
2
+kS
1=2
(
b
q
k
q
0
)k
2
i
c
3
c
max
C
2
N
5
E
h
kb g
k
g
0
k
2
+k
b
q
k
q
0
k
2
i
; (2.15)
whereS=E(x
i
x
T
i
)2R
dd
is the population covariance of the features distribution and c
max
is the
maximum singular value ofS. We letG be the probability event that (2.13) holds true. Then, by
Proposition 2.5.3 we haveP(G) 1 d
2
k
1:5
2e
c
2
kd
.
We are now in place to bound the regret of our policy. Given that the length of episodes grow
linearly, we have Regret(T)å
K
k=1
Regret
k
, with Regret
k
denoting the total expected regret during
episode k and K=b
p
2Tc. Consider the two cases below:
k c
0
: Here, c
0
is the constant in the statement of Proposition 2.5.3. Since the benchmark
prices are bounded by P, given by Lemma 2.9.1, the regret at each step is at most P and
hence the total regret over such episodes is at most P(å
c
0
i=1
k+ c
0
d)<(c
2
0
+ c
0
d)P.
k> c
0
: In this case, we write for t2 E
k
,
E(reg
t
) (2.16)
c
3
c
max
C
2
N
5
n
E
h
kb g
k
g
0
k
2
+k
b
q
k
q
0
k
2
I
G
i
+E
h
kb g
k
g
0
k
2
+k
b
q
k
q
0
k
2
I
G
c
io
c
3
c
max
C
2
N
5
c
1
log(kd)
k
+ 4W
2
P(G
c
)
c
3
c
max
C
2
N
5
c
1
log(kd)
k
+ 4W
2
(d
2
k
1:5
+ 2e
c
2
kd
): (2.17)
19
Hence, the total expected regret over episode k is bounded as follows
Regret
k
=
å
t2E
k
E(reg
t
) C log(T d)(1+ d=k); (2.18)
where we used the fact thatjE
k
j= k+ d, constant C hides various constants.
To bound the cumulative expected regret up to time T , let K =b
p
2Tc. By combining the two
cases, we obtain
Regret(T)(c
2
0
+ c
0
d)P+C log(T d)
K
å
k=1
(1+ d=k)
˜
C log(T d)(
p
T+ d log(T)):
This concludes the proof.
2.6.2 Proof Theorem 2.5.5
Since we are treating N (the maximum size of a consideration set) as constant and only interested
in the lower bound of regret in terms of time horizon T , we consider the case of N = 1. The
W(
p
T) follows by existence of the so-called ‘uninformative prices’ [13]. For a fixed time t, recall
that e
u
0
t
=(1+ e
u
0
t
) is the purchase probability, where u
0
t
=hx
t
;q
0
ihx
t
;g
0
ip
t
and x
t
and p
t
are
respectively the product feature and the posted price at time t. An uninformative price p, is any
such price such that all the purchase probability curves (across model parameters) intersect at that
price. The name comes from the fact that such price does not reveal any information about the
underlying model parameters and hence does not help with the learning part (exploration of the
space of the model parameters). Now, if an uninformative price is also the optimal price for a
specific choice of parameters, then we would get a clear tension between the exploitation and
exploration objectives. Indeed, for a policy to learn the underlying model parameters fast enough,
it must necessarily choose prices that are away from the uninformative prices and this in turns
20
leads to accruing regret when an uninformative price is in fact the optimal prices for the true model
parameters.
To construct uninformative prices, we letq
0; j
= 0 andg
0; j
= 0 for j> 2 and let x
t;1
= 1 for all
products. We then have
u
0
t
=hx
t
;q
0
ihx
t
;g
0
ip
t
=q
0;1
g
0;1
p
t
:
We set g
0;1
=q
0;1
and get u
0
t
=q
0;1
(1 p
t
). Therefore, p
t
= 1, is an uninformative price, since
the purchase probability curves only depend on u
0
t
(see Eq. (2.2)) and hence they all (acrossq
0;1
)
intersect at the common price p
t
= 1. In addition, forq
0;1
= 2, it is easy to verify that p
t
= 1 is the
optimal price, using Proposition 2.3.1.
Now that we have established the existence of uninformative prices, it can be shown that the
worst-case T -regret of any policy is lower bounded byW(
p
T).
In our next proposition, we make the above insight rigorous and formally states a lower bound
on the regret of any policy. Before proceeding with the statement, we establish a lemma. We
consider a problem classC to be a pairC =(P;), with =[q
min
;q
max
] andP =[p
min
; p
max
],
with q
min
; p
min
0. We would like to consider models with q
0;1
= g
0;1
= q2 and pricing
policies that set prices inP. Note that in this case, a posted price p yields the purchase probability
e
q(1p)=(1+e
q(1p)
)
. We also assume that for the optimal price under the model parameterq, denoted
by p
(q), we have p
(q)2P for allq2 . Our next lemma gives a sufficient condition for this
assumption to hold.
Lemma 2.6.1. Suppose that p
min
1=q
max
and p
max
max(1;2=q
min
). Then, p
(q)2P for all
q2 .
We refer to Appendix 2.11.1 for the proof of Lemma 2.6.1.
Proposition 2.6.2. Define a problem classC =(P;) by lettingP=[2=5;4=3] and =[3=2;5=2],
and the purchase probability
q(p;q)=
e
q(1p)
1+ e
q(1p)
:
21
Then for any pricing policyp setting prices inP and any T 2, there exists a parameterq2
such that
Regret
p
(q;T)
p
T
3(41)
4
;
where Regret
p
(q;T) is the total expected regret of policyp up to time T , under model parameter
q.
The proof of Proposition 2.6.2 follows is similar to the proof of [13, Theorem 3.1]. However,
that theorem applies only to the specific purchase probability function 1=2+qq p. The proof of
Proposition 2.6.2 requires some detailed analysis that is deferred to Appendix 2.11.
2.7 Proof of Proposition 2.3.1
In the benchmark policy, the seller knows the model parametersq
0
;g
0
. For simplicity, we use the
shorthandsb
i
=hx
i
;g
0
i, e
it
= exp(hx
i
;q
0
ib
i
p
it
), and the sum as G(e
t
)=å
`2C
t
e
`t
. The revenue
function can be written in terms of e
it
as
rev
t
(p
t
)=
å
i2C
t
p
it
P(i
t
= ijC
t
)=
å
i2C
t
p
it
e
it
1+ G(e
t
)
;
where we used (2.2). Writing the stationarity condition for the optimal price vector p
t
, we get that
for each i2C
t
:
¶rev
t
(p
t
)
¶ p
it
=
e
it
e
it
b
i
p
it
1+ G(e
t
)
+
(å
`2C
t
p
`t
e
`t
)e
it
b
i
(1+ G(e
t
))
2
= 0;
which is equivalent to
b
i
e
it
1+ G(e
t
)
1
b
i
p
it
+
å
`2C
t
p
`t
e
`t
1+ G(e
t
)
| {z }
rev
t
(p
t
)
= 0:
22
Since e
it
> 0, the above equation implies that
p
it
=
1
b
i
+ rev
t
(p
t
): (2.19)
Define B
0
t
rev
t
(p
t
). We next show that B
0
t
is the solution to Equation (2.3). By multiplying both
sides of (2.19) by e
`t
and summing over`2C
t
, we have
å
`2C
t
e
`t
p
`t
=
å
`2C
t
e
`t
b
`
+ B
0
t
å
`2C
t
e
`t
=
å
`2C
t
e
`t
b
`
+ B
0
t
G(e
t
):
By definition of B
0
t
, the left-hand side of the above equation is equal to B
0
t
(1+ G(e
t
)). By rear-
ranging the terms we obtain
B
0
t
=
å
`2C
t
e
`t
b
`
=
å
`2C
t
1
b
`
e
hx
`
;q
0
ib
`
p
`t
=
å
`2C
t
1
b
`
exp
n
hx
`
;q
0
ib
`
1
b
`
+ B
0
t
o
=
å
`2C
t
1
b
`
e
hx
`
;q
0
i
e
(1+b
`
B
0
t
)
;
where the second line follows from Equation (2.19).
Regarding the uniqueness of the solution of (2.3), note that the left-hand side of (2.3) is strictly
increasing in B and is zero at B= 0, while the right hand side is strictly decreasing in B and is
positive at B= 0. Therefore, Equation (2.3) has a unique solution.
23
2.8 Proof of Lemma 2.5.2
Define function f :RR
N
R
N
7!R as
f(B;d;b) B
å
`2C
t
1
b
`
e
d
`
e
(1+b
`
B)
: (2.20)
By characterization of the pricing function g, given in Proposition 2.3.1, we have p
(1)
it
=
1
b
(1)
i
+B
(1)
t
and p
(2)
it
=
1
b
(2)
i
+ B
(2)
t
, where B
(1)
t
and B
(2)
t
are the solution of
f(B;X
t
q
1
;X
t
g
1
)= 0;
and
f(B;X
t
q
2
;X
t
g
2
)= 0:
By implicit function theorem for a point (B;d;b) that satisfies f(B;d;b) = 0, there exists an
open set around(d;b), and a unique differentiable function h : U7!R such that h(d;b)= B and
f(h(z
1
;z
2
);z
1
;z
2
)= 0 for all(z
1
;z
2
)2 U. Furthermore, the partial derivative of g can be computed
as
¶h
¶d
i
(d;b)=
h
¶ f
¶B
(d;b)
i
1
¶ f
¶d
i
(h(d;b);d;b)
=
1+
å
`2C
t
e
d
`
e
(1+b
`
B)
1
1
b
i
e
(1+b
i
B)
e
d
i
<
e
d
i
b
i
<
e
W
L
0
;
24
where in the last step we use the normalizationjd
i
j=jhx
i
;q
1
ij W, and 0< L
0
< minb
i
is the
lower bound on the price sensitivities. Likewise, we have
¶h
¶b
i
(d;b)=
h
¶ f
¶B
(d;b)
i
1
¶ f
¶b
i
(h(d;b);d;b)
=
1+
å
`2C
t
e
d
`
e
(1+b
`
B)
1
B
b
i
+
1
b
i
2
e
(1+b
i
B)
e
d
i
<
B
L
0
+
1
L
0
2
e
d
i
<(Ne
W1
+ 1)
e
W
L
2
0
;
where we used the fact that the solution B of f(B;d;b) satisfies B Ne
W1
=L
0
. (This follows
readily by noting that the right-hand side of (2.20) is non-increasing in B.) This shows that g(d;b)
is a Lipschitz function ofd, with Lipschitz constant CN, where C Ne
2W
=min(L
2
0
;1). Therefore,
jp
(1)
it
p
(2)
it
j=jB
(1)
t
B
(2)
t
j+
1
b
(1)
i
1
b
(2)
i
=jh(X
t
q
1
;X
t
g
1
) h(X
t
q
2
;X
t
g
2
)j+
1
b
(1)
i
b
(2)
i
jhx
i
;g
1
g
2
ij
CN(kX
t
(q
1
q
2
)k
1
+kX
t
(g
1
g
2
)k
1
)+
1
L
2
0
jhx
i
;g
1
g
2
ij: (2.21)
By Cauchy-Schwarz inequality we havekX
t
(q
1
q
2
)k
1
p
NkX
t
(q
1
q
2
)k andkX
t
(g
1
g
2
)k
1
p
NkX
t
(g
1
g
2
)k. Hence,kp
(1)
t
p
(2)
t
k
˜
CN
2
(kX
t
(q
1
q
2
)k+kX
t
(g
1
g
2
)k), for some constant
˜
C. The claim now follows by using a+ b<
p
2(a
2
+ b
2
).
2.9 Proof of Proposition 2.5.3
We start by recalling the notation n
0
=(q
T
0
;g
T
0
)
T
and define
˜
X
t
=[X
t
;diag(p
t
)X
t
]. To prove
Proposition 2.5.3, we first rewrite the loss function in terms of the augmented parameter vectorn.
(Recall our convention that / 0 corresponds to “no-purchase” with u
0
/ 0t
()= 0 .)
25
L
k
(n)=
1
kd
å
t2I
k
log
exp(u
0
i
t
t
)
å
`2C
t
[f/ 0g
exp(u
0
`t
)
=
1
kd
å
t2I
k
(log(1+
å
`2C
t
e
hx
`
;qip
`t
hx
`
;gi
)
0
B
B
@
8
>
>
<
>
>
:
0 i
t
= / 0
hx
i
t
;qi p
i
t
hx
i
t
;gi otherwise
1
C
C
A
=
1
kd
å
t2I
k
8
>
>
<
>
>
:
log(1+
å
`2C
t
e
h ˜ x
`
;ni
)
0
B
B
@
8
>
>
<
>
>
:
0 i
t
= / 0
h ˜ x
i
t
;ni otherwise
1
C
C
A
9
>
>
=
>
>
;
;
(2.22)
where ˜ x
`
=[x
T
`
;p
`t
x
T
`
]
T
. The gradient and the hessian ofL
k
are given by
ÑL
k
(n)=
1
kd
å
t2I
k
å
`2C
t
exp(u
0
`
)u
0
`
˜ x
`
1+å
`2C
t
exp(u
0
`
)
˜ x
i
t
!
; (2.23)
Ñ
2
L
k
(n)=
1
kd
å
t2I
k
(1+å
`2C
t
exp(u
0
`
))(å
`2C
t
exp(u
0
`
)((u
0
`
)
2
+ 1) ˜ x
2
`
)(å
`2C
t
u
0
`
˜ x
`
)
2
(1+å
`2C
t
exp(u
0
`
))
2
: (2.24)
We proceed by bounding the gradient and the hessian of the loss function. Before that, we
establish an upper bound on the prices that are set by the pricing function g.
Lemma 2.9.1. Suppose thatkx
`
k
¥
1 andkn
0
k
1
W. Let B
u
= B
u
(W;L
0
;N) be the solution to
the following equation:
B= N
1
L
0
e
(1+L
0
B)
e
W
: (2.25)
Then, the prices set by the pricing function p
t
= g(X
t
g;Xq), wheren
0
=(q
T
0
;g
T
0
)
T
, are bounded
by P= 1=L
0
+ B
u
.
26
The proof of above Lemma follows readily by noting that the right-hand side of (2.25) is an
upper bound for the right hand side of (2.3) and therefore B
0
t
B
u
. The results then follows by
recalling that the pricing function sets prices as p
it
= 1=b
i
+ B
0
t
.
To bound the gradient of the loss function at the true model parameters, note that
ÑL
k
(n
0
)=
1
kd
å
t2I
k
S
t
; with S
t
å
`2C
t
exp(u
0
`t
)u
0
`t
˜ x
`
1+å
`2C
t
exp(u
0
`t
)
˜ x
i
t
(2.26)
We also have
ju
0
`t
jjhx
`
;q
0
i+jhx
`
;g
0
ijp
`t
W(1+ P) M; (2.27)
for a constant M= M(W;L
0
;N)> 0 and sokS
t
k(M+1)
p
d(1+ P
2
), becausek ˜ x
`
k
p
d(1+ P
2
).
Note that by (2.26),ÑL
k
(n
0
) is written as some of kd terms. In each term, the index i
t
has random-
ness coming from the market noise distribution. By a straightforward calculation, one can verify
that each of these terms has zero expectation. Using (2.27) and by applying Matrix Freedman in-
equality to the right-hand side of (2.26), followed by union bounding over d coordinates of feature
vectors, we obtain
kÑL
k
(n
0
)kl
k
; withl
k
2(M+ 1)
s
d(1+ P
2
)
log(djI
k
j)
jI
k
j
; (2.28)
with probability at least 1 d
0:5
jI
k
j
1:5
.
27
We next pass to lower bonding the hessian of the loss. For any ˜ n withk ˜ nk W, we have
hn
0
b n
k
;Ñ
2
L
k
( ˜ n)(n
0
b n
k
)i
=
1
kd
å
t2I
k
(1+å
`2C
t
exp( ˜ u
0
`t
))(å
`2C
t
exp( ˜ u
0
`t
)(( ˜ u
0
`t
)
2
+ 1)( ˜ x
`
(n
0
b n
k
))
2
(å
`2C
t
˜ u
0
`t
˜ x
`
(n
0
b n
k
))
2
(1+å
`2C
t
exp( ˜ u
0
`t
))
2
(a)
1
kd
å
t2I
k
å
`2C
t
exp( ˜ u
0
`t
)(( ˜ u
0
`t
)
2
+ 1)( ˜ x
`
(n
0
b n
k
))
2
+å
`2C
t
exp( ˜ u
0
`t
)å
`2C
t
exp( ˜ u
0
`t
)( ˜ x
`
(n
0
b n
k
))
2
(1+å
`2C
t
exp( ˜ u
0
`t
))
2
(b)
>
1
kd
å
t2I
k
å
`2C
t
exp( ˜ u
0
`t
)( ˜ x
`
(n
0
b n
k
))
2
1+å
`2C
t
exp( ˜ u
0
`t
)
(c)
e
M
kd
å
t2I
k
å
`2C
t
( ˜ x
`
(n
0
b n
k
))
2
1+ Ne
M
(d)
e
M
kd(1+ Ne
M
)
k
˜
X
(k1)
(g
0
b g
k
)k
2
c
0
(W;L
0
;N)
Nkd
k
˜
X
(k1)
(n
0
b n
k
)k
2
; (2.29)
where (a) and (b) follow from Jensen’s Inequality. (c) is becausej ˜ u
0
`
j M by definition (2.27) and
the assumptionk ˜ nk W; in(d), we define n
k
å
t2I
k
jC
t
j Nkd and construct
˜
X
(k1)
of size n
k
by 2d, by staking the features ˜ x
t
, for t2 I
k
, row-wise.
By optimality ofb n
k
and the second order Taylor expansion we have
0L(n
0
)L(b n
k
)=hÑL(n
0
);b n
k
n
0
i
1
2
hb n
k
n
0
;Ñ
2
L( ˜ n)(b n
k
n
0
))i; (2.30)
for some ˜ n on the segment betweenn
0
andb n
k
.
Therefore by (2.30) and (2.29) we arrive at
c
0
(W;L
0
;N)
2Nkd
k
˜
X
(k1)
(n
0
b n
k
)k
2
kÑL(n
0
)kkb n
k
n
0
k:
28
Using the bounds on the gradient given by (2.28), we get that with probability at least 1d
0:5
jI
k
j
1:5
c
0
(W;L
0
;N)
2Nkd
k
˜
X
(k1)
(n
0
b n
k
)k
2
l
k
kb n
k
n
0
k: (2.31)
Next, in order to lower bound the left-hand side of (2.31), we first lower bound the minimum
eigenvalue of
b
S
k
, the empirical second moment of
˜
X
(k1)
, defined as
b
S
k
1
n
k
˜
X
(k1)
(
˜
X
(k1)
)
T
=
1
n
k
å
t2I
k
;`2C
t
˜ x
`
˜ x
T
`
;
where we recall that ˜ x
`
=[x
T
`
;p
`t
x
T
`
]
T
for`2C
t
.
Since rows of
˜
X
(k1)
are bounded, they are subgaussian. Using [86, Remark 5.40], there exist
universal constants c and C such that for every m 0, the following holds with probability at least
1 2e
cm
2
:
b
S
k
˜
S
op
max(d;d
2
) where d = C
s
d
n
k
+
m
p
n
k
; (2.32)
where
˜
S=E[ ˜ x
`
˜ x
T
`
]2R
2d2d
is the covariance of the these vectors. We proceed by computing
˜
SE[ ˜ x
`
˜ x
T
`
]. Recall that for period t2 I
k
, the prices are set uniformly at random from [0;B].
Hence, lettingm =E[x
`
] andSE[x
`
x
T
`
] the population first and second moments of x
`
, we have
˜
SE[ ˜ x
`
˜ x
T
`
]=
0
B
@
S
Bm
2
Bm
T
2
B
2
3
1
C
A
: (2.33)
The Schur complement of
˜
S, with respect to blockS, reads as
B
2
3
B
2
4
m
T
S
1
m
B
2
3
B
2
4
=
B
2
12
;
29
where we used the fact thatm
T
S
1
m 1 which follows readily by looking at the Schur complement
of the second moment of a=[x
t
;1], i.e., E[aa
T
]=[S m;m
T
1] and use the fact that the Schur
complement of positive semidefinite matrices is nonnegative. As a result of (2.33), the singular
values of
˜
S are larger than ˜ c
min
min(c
min
;B
2
=12).
Then, for n
k
> c
0
d, with probability at least 1 2e
c
2
n
k
, the following is true:
b
S
k
˜
S
op
1
2
˜ c
min
; (2.34)
and hence the minimum singular value of
b
S
k
is at least ˜ c
min
=2. Using this in Equation (2.31) we
get that with probability at least 1 d
0:5
jI
k
j
1:5
2e
c
2
n
k
,
1
2
˜ c
min
kb n
k
n
0
k
2
1
n
k
k
˜
X
(k1)
(b n
k
n
0
)k
2
: (2.35)
Combining (2.35) and (2.31) leads to
n
k
c
0
(W;L
0
;N)
4Nkd
˜ c
min
kb n
k
n
0
k
2
c
0
(W;L
0
;N)
2Nkd
k
˜
X
(k1)
(b n
k
n
0
)k
2
l
k
kn
0
b n
k
k; (2.36)
which gives that with probability at least 1 d
2
k
1:5
2e
c
2
kd
, the following holds true
kb n
k
n
0
k
4Nkd
n
k
c
0
˜ c
min
l
k
4N
c
0
˜ c
min
l
k
:
The proof is complete.
30
2.10 Proof of Lemma 2.5.4
In this lemma, we aim at bounding the revenue loss (against the clairvoyant policy) in terms of the
distance between the posted price vector and the optimal one posted by the clairvoyant policy. By
Taylor expansion,
rev
t
(p
t
)= rev
t
(p
t
)+Ñrev
t
(p
t
)(p
t
p
t
)+
1
2
(p
t
p
t
)
T
Ñ
2
rev
t
( ˜ p)(p
t
p
t
); (2.37)
for some ˜ p between p
t
and p
t
. Note that p
t
= argmaxrev
t
(p), thusÑrev
t
(p
t
)= 0 and the first
term in the Taylor expansion vanishes.
In order to prove the result, it suffices to show that the operator normkÑ
2
rev
t
( ˜ p)k
2
is bounded.
Fix i; j2C
t
. We have
¶rev
t
(p)
¶ p
i
=
e
u
0
it
(1b
i
p
i
)
1+å
`2C
t
e
u
0
`t
+
å
k2C
t
p
k
e
u
0
kt
(1+å
`2C
t
e
u
0
`t
)
2
; (2.38)
withb
i
=hx
i
;g
0
i. Taking derivative with respect to p
j
, we get
¶
2
rev
t
(p)
¶ p
i
¶ p
j
=
e
u
0
jt
(1+å
`2C
t
e
u
0
`t
)
2
h
b
j
(1b
i
p
i
)e
u
0
it
+ 1 p
j
b
j
i
+ 2b
j
e
u
0
jt
å
k2C
t
p
k
e
u
0
kt
(1+å
`2C
t
e
u
0
`t
)
3
: (2.39)
By Lemma 2.9.1, we have p
it
P. Also, by (2.27), we haveju
0
`t
j M. In addition, 0b
i
kx
i
k
¥
kg
0
k
1
W. Since P and M are constants depending only on W and L
0
, there exists a constant
c
1
(W;L
0
)> 0, such that
¶
2
rev
t
¶ p
i
¶ p
j
c
1
(W;L
0
); (2.40)
31
uniformly over i; j2C
t
. We next bound the operator norm ofÑ
2
rev
t
(p). Note that for a matrix
A2R
NN
, we have
kAk
2
= sup
kuk1
ju
T
Auj sup
kuk1
n N
å
i; j=1
jA
i; j
jju
i
jju
j
j
o
jAj
¥
sup
kuk1
kuk
2
1
NjAj
¥
; (2.41)
wherejAj
¥
= max
1i; jN
jA
i j
j. Therefore, the result follows by using (2.40).
2.11 Proof of Proposition 2.6.2
Define the purchase probability q(q; p) and the revenue function r(q; p) as
q(p;q)=
e
q(1p)
1+ e
q(1p)
; r(p;q)= p
e
q(1p)
1+ e
q(1p)
: (2.42)
We next find the optimal price corresponding to parameterq. Write
r
0
(p;q)= 1(1+q p)
1
1+ e
q(1p)
+
q p
(1+ e
q(1p)
)
2
; (2.43)
which implies that the optimal price p
(q) satisfies the following relation
q p
(q)= 1+ e
q(1p
(q))
: (2.44)
Our next lemma is on some properties of the problem classC considered in the statement of
Proposition 2.6.2. Its proof is given in Appendix 2.11.2.
Lemma 2.11.1. For all p2P =[2=5;4=3] andq2 =[3=2;5=2],
(i) Forq
0
= 2, we havejq(p;q) q(p;q
0
)jjp
(q) pjjqq
0
j.
(ii) r(p
(q)) r(p)
1
520
(p
(q) p)
2
.
32
(iii)jp
(q) p
(q
0
)j 0:2jqq
0
j.
In order to establish a lower bound on the regret of a policy, we need to quantitatively measure
the uncertainty of the policy about the unknown model parameterq. To this end, we leverage the
notion of the KL divergence of two probability measures f
0
and f
1
defined on a discrete sample
spaceW as
KL( f
0
; f
1
)
å
w2W
f
0
(w)log
f
0
(w)
f
1
(w)
: (2.45)
For any pricing policy p and parameter q2 we let f
p;q
t
:f0;1g
t
![0;1] be the probability
distribution of the customer purchase responses Y
t
=(Y
1
;:::;Y
t
) under pricing policyp and model
parameterq. Formally, for all y
t
=(y
1
;:::;y
t
)2f0;1g
t
,
f
p;q
t
(y
t
)=
t
Õ
i=1
q(p
i
;q)
y
i
(1 q(p
i
;q))
1y
i
;
where p
i
=p(y
i1
) is the price posted under policyp.
The next lemma shows that in order to reduce the uncertainty about model parameterq
0
(equiv-
alently increasing KL( f
p;q
0
t
; f
p;q
t
)), the policy incurs a large regret. We refer to Appendix 2.11.3
for its proof.
Lemma 2.11.2. For anyq2 , t 1,q
0
= 2 and any policyp setting prices inP, we have
KL( f
p;q
0
t
; f
p;q
t
) 3640(q
0
q)
2
Regret
p
(t;q
0
);
where Regret
p
(t;q
0
) indicates the total expected regret of the pricing policy p, up until step t,
under model parameterq
0
.
The next lemma is similar to [13, Lemma 3.4] and its proof follows along the same lines for
our purchase probability function (in particular, using Lemma 2.11.1 (ii, iii)).
33
Lemma 2.11.3. Letp be any pricing policy setting prices inP. Then, for any T 2 and model
parametersq
0
= 2 andq
1
=q
0
+
1
4
T
1=4
, we have
Regret
p
(T;q
0
)+ Regret
p
(T;q
1
)
p
T
2080(41)
2
e
KL( f
p;q
0
T
; f
p;q
1
T
)
:
Armed with Lemma 2.11.2 and 2.11.3 we are ready to prove Theorem 2.5.5. Let q
0
= 2 and
q
1
=q
0
+
1
4
T
1=4
. Then, by non-negativity of KL-divergence and by virtue of Lemma 2.11.2 we
have
Regret
p
(T;q
0
)+ Regret
p
(T;q
1
)
16
3640
p
T KL( f
p;q
0
t
; f
p;q
t
):
Combining this bound with Lemma 2.11.3 we get
2fRegret
p
(T;q
0
)+ Regret
p
(T;q
1
)g
16
3640
p
T KL( f
p;q
0
t
; f
p;q
t
)+
p
T
2080(41)
2
e
KL( f
p;q
0
T
; f
p;q
1
T
)
p
T
2080(41)
2
KL( f
p;q
0
t
; f
p;q
t
)+ e
KL( f
p;q
0
t
; f
p;q
t
)
p
T
2080(41)
2
;
where we used that a+ e
a
1 for all a 0. Hence,
max
q2fq
0
;q
1
g
Regret
p
(T;q)
p
T
2 2080(41)
2
>
p
T
3(41)
4
;
which completes the proof.
2.11.1 Proof of Lemma 2.6.1
By (2.44) we have q p
(q) 1. Hence, for q2 we have p
(q) 1=q
max
. To prove the other
side, assume that p
(q)> 1 for someq2 . Then, by (2.44), we have
q p
(q)= 1+ e
q(1p
(q))
< 2;
34
which gives p
(q)< 2=q 2=q
min
. As a result, p
(q) max(1;2=q
min
) completing the proof of
the lemma.
2.11.2 Proof of Lemma 2.11.1
To prove the first item, write
jq(p;q) q(p;q
0
)j=
1
1+ e
q(1p)
1
1+ e
q
0
(1p)
je
q(1p)
e
q
0
(1p)
j
(1+ e
q(1p)
)(1+ e
q
0
(1p)
)
e
max(q;q
0
)j1pj
(1 e
jq
0
qjj1pj
)
(1+ e
q(1p)
)(1+ e
q
0
(1p)
)
1 e
jq
0
qjj1pj
jq
0
qjj1 pj
=jq
0
qjjp
(q
0
) pj;
where the last inequality follows since 1 e
x
x for x 0. Further, the last inequality holds
since p
(q
0
)= 1 forq
0
= 2 by using (2.44).
To prove the second item, by some algebraic calculation,
r
00
(p)=qe
q(1p)
q p(1 e
q(1p)
) 2(1+ e
q(1p)
)
(1+ e
q(1p)
)
3
: (2.46)
Sinceq2 and p2P, if p 1 thenq p(1 e
q(1p)
) 2(1+ e
q(1p)
)4; otherwise
q p(1 e
q(1p)
) 2(1+ e
q(1p)
)
10
3
(1 e
q(1p)
) 2(1+ e
q(1p)
)
4
3
16
3
e
q(1p)
4
3
16
3
e
5=6
<0:98:
35
Combining these two cases and using the characterization (2.46), we arrive at
r
00
(p)0:98qe
q(1p)
(1+ e
q(1p)
)
3
0:98 3=2 e
5=6
(1+ e
3=2
)
3
<1=260;
(2.47)
where we used that e
q(1p)
2[e
5=6
;e
3=2
] forq2 and p2P.
Now by Taylor expansion of r(p) around p
we obtain
r(p)= r(p
(q))+ r
0
(p
(q))(p p
(q))+
1
2
r
00
( ˜ p)(p p
(q))
2
;
for some ˜ p between p and p
(q). Note that by optimality of p
(q), we have r
0
(p
(q))= 0. Further,
(2.47) implies that
r(p) r(p
(q))
1
520
(p p
(q))
2
:
Finally to prove item(iii), taking derivative of both sides of (2.44) with respect toq gives us
p
(q)+q
d
dq
p
(q)=
1 p
(q)q
d
dq
p
(q)
e
q(1p
(q))
;
and therefore by rearranging the terms
d
dq
p
(q)=
1
q
p
(q)+
e
q(1p
(q))
1+ e
q(1p
(q))
!
:
Since e
q(1p)
e
5=6
, forq2 and p2P, we get
d
dq
p
(q)
2
5
2
5
+
e
5=6
1+ e
5=6
!
0:2
The result then follows by an application of the Mean Value Theorem.
36
2.11.3 Proof of Lemma 2.11.2
By employing the chain rule for KL divergence [24, Theorem 2.5.3],
KL( f
p;q
0
t
; f
p;q
t
)=
t
å
s=1
KL( f
p;q
0
t
; f
p;q
t
jY
s1
)
=
t
å
s=1
å
y
s
2f0;1g
s
f
p;q
0
s
(y
s
)log
f
p;q
0
s
(y
s
jy
s1
)
f
p;q
s
(y
s
jy
s1
)
!
=
t
å
s=1
å
y
s1
2f0;1g
s1
f
p;q
0
s1
(y
s1
)
å
y
s
2f0;1g
f
p;q
0
s
(y
s
jy
s1
)log
f
p;q
0
s
(y
s
jy
s1
)
f
p;q
s
(y
s
jy
s1
)
!
=
t
å
s=1
å
y
s1
2f0;1g
s1
f
p;q
0
s1
(y
s1
)KL( f
p;q
0
s
(y
s
jy
s1
); f
p;q
s
(y
s
jy
s1
))
t
å
s=1
1
q(p
s
;q)(1 q(p
s
;q))
å
y
s1
2f0;1g
s1
f
p;q
0
s1
(y
s1
)(q(p
s
;q
0
) q(p
s
;q))
2
7
t
å
s=1
å
y
s1
2f0;1g
s1
f
p;q
0
s1
(y
s1
)(q(p
s
;q
0
) q(p
s
;q))
2
;
where in the penultimate step we used the inequality KL(B
1
;B
2
)
(q
1
q
2
)
2
q
2
(1q
2
)
for two Bernoulli
random variables B
1
and B
2
with parameters q
1
and q
2
, respectively [79, Corollary 3.1]. In the last
step we used that q(p
s
;q) for p
s
2P andq2 . Next, by using Lemma 2.11.1 (item 1) we obtain
KL( f
p;q
0
t
; f
p;q
t
) 7(q
0
q)
2
t
å
s=1
å
y
s1
2f0;1g
s1
f
p;q
0
s1
(y
s1
)(p
(q
0
) p
s
)
2
= 7(q
0
q)
2
t
å
s=1
E
q
0
(p
(q
0
) p
s
)
2
;
where we used the observation that p
s
is a measurable function of y
s1
andE
q
0
denotes expectation
with respect to f
p;q
0
s1
measure. Next by using Lemma 2.11.1 (item 2), we get
KL( f
p;q
0
t
; f
p;q
t
) 3640(q
0
q)
2
t
å
s=1
E
q
0
(r(p
(q
0
)) r(p
s
)) 3640(q
0
q)
2
Regret
p
(t;q
0
):
37
Chapter 3
Controlling the False Split Rate in Tree-Based Aggregation
1
A common challenge in data modeling is striking the right balance between models that are suf-
ficiently flexible to adequately describe the phenomenon being studied and those that are simple
enough to be easily interpretable. We consider this tradeoff within the increasingly common con-
text in which data measurements can be associated with the leaves of a known tree. Such data
structures arise in myriad domains from business to science:
The U.S. Bureau of Labor Statistics’ Standard Occupational Classification system arranges
jobs as leaves of a tree [83].
The North American Industry Classification System (NAICS) provides a hierarchical orga-
nization of businesses [82].
In retail, product hierarchies arrange the types of items sold within nested categories.
Geographic data arrange locations within a hierarchy of regions from local to global (e.g.,
within the US, from ZIP code to county to state)
Ecological data arranges species (or even finer grained types) as leaves of a taxonomic tree.
1
S. Shao, J. Bien and A. Javanmard, ”Controlling the False Split Rate in Tree-Based Aggregation,” doi:
arXiv:2108.05350.
38
Measurements in low-level branches of the tree may share a lot in common, and so—in the
absence of evidence to the contrary—a data modeler would favor a simpler (literally “high-level”)
description in which distinctions within low-level branches would not be made; on the other hand,
when there is evidence of a difference between sibling branches, then modeling them as distinct
from each other may be warranted. We use the term tree-based aggregation to refer to the process
of deciding which branches’ leaves should be treated as the same (i.e., aggregated) and which
should be treated as different from each other (i.e. split apart).
We formulate the general tree-based aggregation problem as a multiple testing problem involv-
ing a parameter vector q
whose elements correspond to leaves of a known tree. Our goal is to
partition the leaves based on branches of the tree so that the set of parameters in each group share
the same value. Every non-leaf node has an associated null hypothesis that states that all of its
leaves have the same parameter value. Type I errors correspond to splitting up groups unnecessar-
ily; type II errors correspond to aggregating groups with different parameter values.
The rest of the chapter is organized as follows. In Section 3.1, we define an error measure,
called the false split rate (FSR), that corresponds to the fraction of splits made that were unneces-
sary. Within our tree-based setting, we show that controlling the FSR is related to controlling the
false discovery rate [7], with equivalence in the special case of a binary tree.
In Section 3.2, we propose a tree-based aggregation procedure that leverages this connection.
Our algorithm proceeds in a top-down fashion, only testing hypotheses of nodes whose parents
were rejected. Such an approach to hierarchical testing originates with [91], which lays the foun-
dation for the multiple testing problem on trees. Our procedure is closely related to more recent
work by [60], which increases power using carefully chosen node-specific thresholds that depend
on where the hypothesis is located in the hierarchy. This work was in turn further developed in
[71]. Other work involving various forms of a hierarchy-based multiple testing problem (although
not having to do with aggregation in the sense of this chapter) include [12, 40, 52]. While these
39
works focus on FDR control, another line of work uses hierarchical testing for gradually locating
non-zero variables while controlling the family-wise error rate [65, 37].
In Section 3.3, we consider two concrete scenarios where tree-based aggregation is natural.
In the first scenario, the parameter vector q
represents the mean of a scalar signal measured on
the leaves of the tree. In the second scenario, q
is a (potentially high-dimensional) vector of
regression coefficients where features are associated with leaves of the tree.
Finally, we demonstrate through simulation studies (Section 3.4) and real data experiments
(Section 3.5) the empirical merits of our framework and algorithm. We consider two applica-
tions, corresponding to the two concrete scenarios of tree-based aggregation. The first application
involves aggregation of stocks (with respect to the NAICS’s sector-industry tree) based on mean
log-volatility. The second application aggregates neighborhoods of New York City (with respect to
a geographically based hierarchy) based on a regression vector for predicting taxi drivers’ monthly
total fares based on the frequency of different starting locations.
Notation: For an integer p, we use the shorthand [p]=f1;2;:::; pg. For two real numbers a;b,
we write a^ b and a_ b for their minimum and maximum, respectively. We use e
i
to denote the
i-th standard basis element, e.g., e
1
=(1;0;:::;0)
>
. The`
q
norm of a vector x2R
p
is defined as
kxk
q
=
å
p
j=1
jx
j
j
q
1=q
for q 0. For a set S[p], x
S
=(x
i
)
i2S
is the vector obtained by restricting
the vector x to the indices in set S.
We use the term “tree” throughout the chapter to denote a rooted directed tree; more specifi-
cally a directed graph whose underlying undirected graph has no cycle, with one node having no
incoming edges, referred to as the root. We use the terms parent and child in the natural way :
if u! v is an edge, then u is the parent of v and v is a child of u. Each node can have multiple
children but only one parent. The nodes with no children are called leaves, and we writeL for
this set of nodes. Given a treeT , we use the notationT
u
to represent its subtree rooted at node
u2T andL
u
to refer to this subtree’s set of leaves.
40
3.1 Problem setup
3.1.1 A multiple hypothesis testing formulation for aggregation
LetT be a known tree with p leaves, each corresponding to a coordinate of the unobserved pa-
rameter vector q
2R
p
. We formulate the tree-aggregation task as a multiple hypothesis testing
problem: To each internal (non-leaf) node u of the tree we assign a null hypothesis
H
0
u
: All elements ofq
L
u
have the same value; (3.1)
where q
L
u
is the subvector of q
restricted to leaves of the subtree rooted at u. Rejecting the
null hypothesisH
0
u
implies that the leaves under u should be further split into smaller groups.
Given the way the hypotheses are defined, a logical constraint to impose on the output of a testing
procedure is the following:
Constraint 1. The parent of a rejected node must itself be rejected.
By constraint 1, the set of rejected nodes will then form a subtreeT
rej
ofT (sharing the
same root asT ), and furthermore the subtrees rooted at the leaves ofT
rej
represent the aggregated
groups. Our goal is to develop testing procedures that result in high quality splits of the parameters.
In order to measure the performance of an aggregation (or equivalently a set of splits) we propose
a new criterion as follows.
False Split Rate (FSR). Recall that we are interested in splits that can be expressed as a com-
bination of branches of the treeT . Therefore if we order the leaves (from left to right), if two
leaves are in the same group, then the other leaves between them are also in the same group. For
partitioning an ordered sequence of p leaves, we have p 1 potential positions for the barriers
of groups. We use a vector J2f0;1g
p1
to denote whether the corresponding barrier exists at
that position. Each realization of such vector will result in a unique splitting of leaves, and vice
41
features
potential barriers
p
p1
Figure 3.1: An example of leaves partition. There are p= 12 leaves in total, hence 11 potential
barriers. The solid barriers indicate the true splitting of leaves, while the dashed barriers result in
the achieved splitting of leaves. In terms of the vector of barriers, FDP=
1
3
and power=
2
3
. In
terms of splitting of leaves, FSP=
54
41
=
1
3
and power= 1
54
41
=
2
3
.
versa. LetJ
and
b
J respectively denote the corresponding vectors for the true splittingC
and an
achieved splitting
b
C . In Figure 3.1 we give an example of p= 12 leaves. The solid barriers mark
the true splitting,J
=(0;1;0;0;1;0;0;0;1;0;0); the dashed barriers mark the achieved splitting,
b
J =(0;0;0;0;1;0;0;1;1;0;0).
We can view the splitting task as a barrier discovery problem. The false discovery proportion
can then be written as
FDP
b
:=
jf j2[p] : J
j
= 0;
b
J
j
= 1gj
jf j2[p] :
b
J
j
= 1gj
: (3.2)
Likewise the true positive proportion for the barrier discovery problem is given by
TPP
b
:=
jf j2[p] : J
j
= 1;
b
J
j
= 1gj
jf j2[p] : J
j
= 1gj
: (3.3)
Since a set of barriers determines certain splitting of the leaves, we can express the above quantity
in terms of the resulting groups. Suppose
b
C =f
b
C
1
;:::;
b
C
M
g is a splitting of the leaves [p], and
C
=fC
1
;:::;C
K
g is the true splitting. For each true group C
i
; i2f1;:::;Kg, we count the number
of splits of C
i
by members of
b
C , i.e., å
M
j=1
1fC
i
\
b
C
j
6= / 0g 1. Therefore, the total number of
excessive (false) splits of C
i
is given by
K
å
i=1
M
å
j=1
1fC
i
\
b
C
j
6= / 0g 1
!
=
K
å
i=1
M
å
j=1
1fC
i
\
b
C
j
6= / 0g
!
K;
42
while the total number of splits is(M 1)_ 1.
We define the false split proportion (FSP) as
FSP :=
å
K
i=1
å
M
j=1
1fC
i
\
b
C
j
6= / 0g
K
(M 1)_ 1
: (3.4)
We also define the true positive proportion in a similar manner by changing the role ofC
and
b
C :
TPP := 1
å
M
i=1
å
K
j=1
1fC
i
\
b
C
j
6= / 0g
M
K 1
: (3.5)
In the next lemma, we prove that the quantities FSP and TPP in terms of groups are equivalent to
quantities FDP
b
and TPP
b
for the barrier discovery problem.
Lemma 3.1.1. For the quantities FSP and TPP; given by Equations (3.4) and (3.5), and the quan-
tities FDP
b
and TPP
b
, given by Equations (3.2) and (3.3), the following holds true:
FSP= FDP
b
; TPP= TPP
b
:
We refer to Appendix 3.8.1 for the proof of Lemma 3.1.1. The false split rate (FSR) and the
expected power are defined as
FSR :=E(FSP); Power :=E(TPP); (3.6)
where the expectation is with respect to the randomness in
b
C , which in our context will depend
on the p-values for the hypotheses of the form (3.1). In the next section we provide another
characterization for FSR in the tree-aggregation context, and in Section 3.2 we develop a testing
procedure that controls FSR at a pre-specified levela< 1.
43
3.1.2 FSR on a tree
While the FSR metric can be calculated for a general splitting of p objects using definition (3.4),
in this section we focus on splittings that can be expressed as a combination of branches ofT as
explained in the previous section. We will provide an equivalent characterization of FSP in this
context in terms of specific structural properties of the treeT .
For a testing procedure satisfying Constraint 1, the rejected nodes on the tree still maintain the
tree structure. We useT
rej
to represent the subtree of rejected nodes on the treeT . We also define
deg
T
(u) as the (out) degree of node u on treeT (the number of children of node u); similarly,
deg
T
rej
(u) is the degree of node u on the subtreeT
rej
. We useF as the set of false rejections inT .
Lastly, we defineB
as the set of nodes whose leaf sets correspond to the true aggregation, i.e.,
B
is such that
C
=fL
u
j u2B
g:
This characterization ofC
stems from the assumption that the true aggregation is among the
partitions allowed by the tree.
Our next lemma characterizes the number of false splits and the total number of splits in terms
of the treeT and its subtreeT
rej
. By virtue of this lemma we have an alternative characterization
of FSP (and FSR), which is more amenable to analysis.
Lemma 3.1.2. Define V and R as follows:
V :=
å
u2F
deg
T
(u) deg
T
rej
(u)
jB
\Fj; (3.7)
R := max
(
å
u2T
rej
deg
T
(u) deg
T
rej
(u)
1;0
)
: (3.8)
Then V and R quantify the number of false splits and the total number of splits, respectively.
44
a
b
1
b
2
c
1
c
2
c
3
c
4
c
5
d
1
d
2
d
3
d
4
d
5
d
6
d
7
d
8
d
9
d
10
d
11
X
X
Figure 3.2: An example of T with 11 leaves and 19 nodes in total (including leaf nodes).
The dashed boxes show the true aggregation of the leaves,C
, into K = 5 groups, withB
=
fd
1
;d
2
;c
2
;c
3
;b
2
g. The thicker edges and the nodes they connect formT
rej
, withX’s marking
true rejections and’s marking false rejectionsF . The rejections correspond to an estimated
aggregation with M= 7 groups:fd
1
;d
2
g,fd
3
;d
4
g,fd
5
;d
6
g,fd
7
g,fd
8
g,fd
9
g,fd
10
, d
11
g. On the
right branch of the tree, two false rejections lead to a nonzero false split rate, FSP=
85
71
=
3
6
. We
have V =(3 0)+(2 1) 1= 3, and R=(2 2)+(3 0)+(2 1)+(3 0) 1= 6. Hence
V
R
= FSP. On the left branch of the tree, there is one missing rejection (c
1
) that leads to a true
positive proportion of TPP= 1
87
51
=
3
4
.
Consequently, we have
FSP=
V
R
; and FSR=E
V
R
;
where FSP and FSR are defined as in (3.4) and (3.6).
A key quantity in the above characterization is deg
T
(u)deg
T
rej
(u), which counts the number
of additional splits due to rejectingH
0
u
.
Figure 3.2 represents a concrete example to illustrate the quantities and the equivalence stated
in the lemma.
Remark 1. Let us stress that the FSP metric in general can be very different from the commonly
used metric of false discovery rate (FDR) for multiple hypothesis testing. FDR measures the overall
performance of the testing rule, including the hypotheses at the inner nodes, while FSR concerns
the quality of the splitting of the leaves. Therefore, the proposed methods for controlling FDR on
45
trees cannot be applied to control FSR. We further discuss this point in Section 3.4.1 through nu-
merical examples (see Figure 3.5). That said, we show in the next lemma that FSP and FDP become
equivalent for the special case of a binary tree.
The following corollary states the equivalence for the special case in whichT is a binary tree.
In this case, FSP corresponds exactly to the commonly used FDP, which is the ratio between the
number of false rejections and the total number of rejections.
Lemma 3.1.3. For a binary tree, the quantities V and R given by Equations (3.7) and (3.8) can be
simplified as
V =jFj; (3.9)
and
R=
T
rej
: (3.10)
Therefore,
FSP=
jFj
T
rej
; FSR= FDR :=E
jFj
jT
rej
j
; (3.11)
where we recall thatF andT
rej
respectively denote the set of false rejections and the set of
rejections in the multiple hypothesis testing problem of (3.1).
We defer the proofs for Lemma 3.1.2 and Lemma 3.1.3 to Appendix 3.8.
3.2 Hierarchical aggregation testing with FSR control
So far we have defined the metric FSR to measure the quality of a splitting of leaves and proposed
an alternate characterization of it in terms of the structure of the rejected (and false rejected) nodes
as in Lemma 3.1.2. In this section, we introduce a new multiple testing procedure to test the null
46
hypothesesH
0
u
, starting from the root and proceeding down the tree. The procedure assumes that
each non-leaf node u has a p-value that is super-uniform underH
0
u
, i.e.
P(p
u
t) t for all t2[0;1]: (3.12)
Later, in Section 3.3, we discuss how to construct such p-values for two statistical applications.
We call our multiple testing procedure HAT, shorthand for hierarchical aggregation testing, as
the parameters in the returned splits can be aggregated together to improve model interpretability
and in some cases improve the predictive power of the model. The HAT procedure controls the
FSR both for independent p-values (Section 3.2.1) and under arbitrary dependence of the p-values
(Section 3.2.2).
The hypotheses defined in (3.1) are indeed intersection hypotheses, i.e.,
H
0
u
holds)H
0
v
holds for8v2T
u
; (3.13)
whereT
u
is the subtree rooted at node u. In other words, the parent of a non-null node must be
non-null, and if a node is null then every child of it is null as well. This property motivates us to
use a top-down sequential testing algorithm on the tree that honors Constraint 1.
Before describing the HAT algorithm, we need to establish some notation. For clarity, we
sometimes use the notationH
0
d;u
to make it explicit that node u is at depth d of the treeT , where
the depth of a node is one plus the length of the unique path that connects the root to that node (the
root is at depth 1). We also useT
d
for the set of non-leaf nodes at depth d ofT .
The testing procedure runs as follows. Let a be our target FSR level. Starting from the root
node (depth= 1), at each level d we only test hypotheses at the nodes whose parents are rejected.
The test levels for hypotheses are determined by a step-up threshold function so that the test level
at each hypothesisH
0
d;u
depends on the number of leaves under this nodejL
u
j, the target levela,
47
the maximum node degree denoted byD, and the number of splits made in previous levels, denoted
by R
1:(d1)
.
The details of our HAT procedure are given in Algorithm 2, and depend on node-specific thresh-
oldsa
u
(r), both explicitly and through the function
R
d
(r) :=
å
u2T
d
1fp
u
a
u
(r)g(deg
T
(u) 1): (3.14)
Input: : FSR levela, TreeT , p-values p
u
for u2TnL .
Output: : Aggregation of leaves such that the procedure controls FSR.
initializeT
1
rej
=frootg, R
1:1
= deg
T
(root) 1.
1: repeat
2: From depth d= 2 to maximum depth D of the treeT , perform hypothesis testing on
each node inT
d
. Compute r
d
as
r
d
= max
n
r 0 : r R
d
(r)
o
;
where R
d
(r) is defined in (3.14), with threshold functiona
u
(r) given by (3.15) (for ca-
se of independent p-values) or (3.18) (under general dependence among p-values).
Reject the nodes in the setT
d
rej
=
u2T
d
: p
u
a
u
(r
d
)
.
3: UpdateT
1:d
rej
=T
1:(d1)
rej
[T
d
rej
, and R
1:d
= R
1:(d1)
+ r
d
.
4: until No node in the current depth has a rejected parent or d= D.
Algorithm 2: Hierarchical Aggregation Testing(HAT) Algorithm
3.2.1 Testing with independent p-values
Assuming that the node p-values p
u
are independent, the threshold functiona
u
(r) used for testing
H
0
d;u
is defined as:
a
u
(r)= 1fparent(u)2T
d1
rej
g
1
D
ajL
u
j(R
1:(d1)
+ r)
p(1
1
D
2
)¯ h
d;r
+ajL
u
j(R
1:(d1)
+ r)
; (3.15)
48
where ¯ h
d;r
is the partial harmonic sum given by
¯ h
d;r
= 1+
p1(å
u2T
d
deg
T
(u)jT
d
jr)
å
m=R
1:(d1)
+r+1
1
m
: (3.16)
Theorem 3.2.1. Consider a tree with maximum node degreeD and suppose that for each node u
in the tree, under the null hypothesisH
0
u
, the p-value p
u
is super-uniform (see (3.12)). Further,
assume that the p-values for the null nodes are independent from each other and from the non-null
p-values. Then using Algorithm 2 with threshold function (3.15) to test intersection hypotheses
H
0
u
controls FSR under the target levela.
The proof of Theorem 3.2.1 is given in Section 3.7.1 of the appendix and uses a combination of
different ideas. At the core of the proof is a ‘leave-one-out’ technique to decouple the quantities V
and R. We also use the following self-consistency property of the testing rule. Observe that R
d
(r)
counts the additional splits of the leaves that result due to the rejected nodes in depth d, assuming
that the threshold levela
u
(r) is used. We prove that the following self-consistency property holds:
R
d
(r
d
)= r
d
where r
d
is defined in Step 2 of Algorithm 2. In words, using r
d
to test the nodes in
T
d
(node u to be tested at level a
u
(r
d
)) gives us r
d
additional splits of the leaves, and therefore
the update rule for R
1:d
in line 3 of the algorithm ensures that this quantity counts the number of
splits formed from testing nodes in depth 1;:::;d. Using the self-consistency property and the
leave-one-out technique, along with intricate probabilistic bounds in terms of structural properties
ofT , we prove that FSR is controlled at the pre-assigned levela.
A few remarks are in order regarding the testing thresholda
u
(r). From its definition, we have
a
u
(r)= 0 if the parent hypothesis of u is not rejected. Also note that since the testing is done in a
downward manner, the eventfparent(u)2T
d1
rej
g is observed by the time the node u is tested. Also
note that as we reject more hypotheses early on, the burden of proof reduces for the subsequent
hypotheses, becausea
u
(r) is increasing in R
1:(d1)
. This trend is similar to FDR control methods
(e.g., [7, 49]). We also observe that a
u
(r) is increasing injL
u
j. For the nodes at upper levels
49
of the tree, this is crucially useful as R
1:(d1)
is small for these nodes, whilejL
u
j is large and
compensates for it in the threshold function.
Our next theorem is a generalization of Theorem 3.2.1 to the case that the null p-values distri-
bution deviates from a super-uniform distribution. We will use Theorem 3.2.2 to control FSR in
Section 3.3.2 where we aim to aggregate the features in a linear regression setting. As we will dis-
cuss, for this application we suggest to construct the p-values using a debiasing approach, which
results in p-values that are asymptotically super-uniform (as the sample size n diverges).
Theorem 3.2.2. Consider a tree with maximum node degreeD and suppose that for each non-leaf
node u in the tree, under the null hypothesisH
0
u
, the p-value p
u
satisfies
P(p
u
t) t+e
0
; for all t2[0;1];
for a constante
0
> 0. Further, assume that the p-values for the null nodes are independent from
each other and from the non-null p-values. Consider running Algorithm 2 to test intersection
hypothesesH
0
u
with the threshold function given by
a
u
(r)= 1fparent(u)2T
d1
rej
g
(
1
D
ajL
u
j(R
1:(d1)
+ r)
p(1
1
D
2
)¯ h
d;r
+ajL
u
j(R
1:(d1)
+ r)
e
0
)
: (3.17)
Then, FSR is controlled under the target levela.
50
3.2.2 Testing with arbitrarily dependent p-values
Theorems 3.2.1 and 3.2.2 assume that the null p-values are independent from each other and from
the non-null p-values. To handle arbitrarily dependent p-values, we propose a modified threshold
function:
a
u
(r)= 1fparent(u)2T
d1
rej
g
ajL
u
jb
d
(R
1:(d1)
+ r)
p(D
1
D
)(D 1)
; (3.18)
whereb
d
() is a reshaping function of the form
b
d
(R
1:(d1)
+ r)=
R
1:(d1)
+ r
å
å
u2T
d
deg
T
(u)
k=d(d1)
1
k
; (3.19)
and d is the minimum node degree inT . It is straightforward to see that the reshaping function
is lowering the test thresholds compared to the independent p-values case, making the testing
procedure more conservative to handle general dependence among p-values. In the next theorem,
we show that with the reshaped testing threshold FSR is controlled for arbitrarily dependent p-
values.
Theorem 3.2.3. Consider a tree with maximum node degreeD and minimum node degreed, and
suppose that for each node u in the tree, under the null hypothesisH
0
u
, the p-value is super-
uniform, i.e., (3.12) holds. The p-values for the nodes can be arbitrarily dependent. Then, HAT
(Algorithm 2) with the reshaped threshold (3.18) controls FSR under the target levela.
The proof of Theorem 3.2.3 builds upon a lemma from [11] on dependency control of a pair of
non-negative random variables. We refer to Section 3.7.3 of the appendix for further details and
the complete proof.
51
We conclude this section with an analogous result to Theorem 3.2.3, where the p-values are
approximately super-uniform. This can also be perceived as a generalization of Theorem 3.2.2 to
the case of arbitrarily dependent p-values.
Theorem 3.2.4. Consider a tree with maximum node degreeD and minimum node degreed, and
suppose that for each non-leaf node u in the tree, under the null hypothesisH
0
u
, the p-value p
u
satisfies
P(p
u
t) t+e
0
; for all t2[0;1];
for a constante
0
> 0. The p-values for the nodes can be arbitrarily dependent. Consider running
Algorithm 2 to test intersection hypothesesH
0
u
with the threshold function given by
a
u
(r)= 1fparent(u)2T
d1
rej
g
(
ajL
u
jb
d
(R
1:(d1)
+ r)
p(D
1
D
)(D 1)
e
0
)
; (3.20)
with the reshaping functionb
d
() defined by (3.18). Then, FSR is controlled under the target level
a.
Proof of Theorem 3.2.4 is similar to the proof of Theorem 3.2.3, and is deferred to Sec-
tion 3.7.3.1 of the appendix.
3.3 Two statistical applications
Here we consider two statistical applications of tree-based aggregation. In Section 3.3.1, we study
the problem of testing equality of means, for which the nodewise p-values are formed by one-
way ANOV A tests. In Section 3.3.2 we study the problem of aggregating features with the same
coefficients in a linear regression setting. This application is motivated by [90], which describes
the prevalence of “rare features” in modern prediction problems in which features correspond to
52
the counts of events that rarely occur in the population. This work shows that aggregating rare
features into denser features can lead to better performing predictive models.
3.3.1 Testing equality of means
In this application, we imagine thatq
is a vector of unknown means and that at each leaf node i
there is a noisy observation of the corresponding mean:
y
i
=q
i
+e
i
;
where e
1
;:::;e
p
N(0;s
2
) independently. Given the treeT , we study the problem of splitting
leaves into groups with the same mean parameter q
i
, from observed data y=(y
1
;:::;y
p
)
>
. We
want to aggregate the leaves by testing the equality of means.
To this end, we use our proposed testing procedure from the previous section to test the equality
of mean parameters. For each node u2T , we construct a p-value based on a one-way ANOV A
test with knowns> 0,
p
u
= 1 F
c
2
D
u
1
s
2
å
v2child(u)
jL
v
j( ¯ y
v
¯ y
u
)
2
!
; (3.21)
where ¯ y
v
=jL
v
j
1
å
i2L
v
y
i
, and child(u) is the set of children of u. Also D
u
:= deg
T
(u) =
jchild(u)j and F
c
2
D
u
1
is the cdf of ac
2
D
u
1
random variable.
We show in the following lemma that the above construction gives bona fide p-values for our
testing procedure.
Lemma 3.3.1. The p-value defined in (3.21) is uniform underH
0
u
: allq
L
u
have the same value.
Furthermore, for any two distinct nodes a;b2TnL , p
a
and p
b
are independent.
53
One special case is when the treeT is a binary tree, i.e., each node has exactly two children.
Then the nodewise p-value can be simplified as a two-sample z-test p-value, namely,
p
u
= 1F
j ¯ y
u;left
¯ y
u;right
j
s
p
1=n
1
+ 1=n
2
!
; (3.22)
where n
1
=jL
left-child(u)
j, n
2
=jL
right-child(u)
j, and
¯ y
u;left
=
1
n
1
å
i2L
left-child(u)
y
i
; ¯ y
u;right
=
1
n
2
å
i2L
right-child(u)
y
i
:
AlsoF is the cdf of a standard normal random variable.
3.3.1.1 Simes’ p-values
Recall that the nodewise hypothesesfH
0
u
g
u
are intersection hypotheses as in (3.13), and therefore
one can apply Simes’ procedure to form bona fide intersection p-values.
For each node a, the set p
S(a)
denotes the p-values at the non-leaf descendants of a and at
a itself. Sort this set of p-values in an increasing order, and denote the ordered p-values as
p
(1)
;:::; p
(jS(a)j)
. Then the Simes’ p-value at node a is constructed as follows:
p
a;Simes
:= min
1kjS(a)j
p
(k)
jS(a)j
k
:
As shown by [74], as the original p-values are independent (as per Lemma 3.3.1), the Simes’
p-values constructed as above are super-uniform, and hence can be used to test the nodewise hy-
potheses. However, note that the Simes’ p-values are not independent anymore, so when applying
the HAT procedure, we need to use the reshaped threshold function (3.18).
54
3.3.2 Testing equality of regression coefficients
Consider a linear model where the response variables are generated as
y= Xq
+e; e N(0;s
2
I
n
): (3.23)
Here y=(y
1
;:::;y
n
)
>
2R
n
is the response vector, X2R
np
is the feature (design) matrix,q
2R
p
is the vector of unobserved model parameters, and e2R
n
is a vector of independent zero-mean
Gaussian errors with variances
2
. We use X
1
;:::;X
p
to represent the columns of X.
In many applications the features are counts data, i.e., X
i j
records the frequency of an event j
occurring in observation i. As discussed in [90], it is often the case that a large number of these
events rarely occur, and therefore the corresponding features are zero for nearly all observations.
A common practice in analyzing such data is to remove the rare features in the pre-processing
step. However, [90] show that when a tree is available, rare features can be aggregated to create
informative predictors for the response. Their tree-based aggregation method forms new features
that count the frequency of unions of events. Their feature aggregation is equivalent to adding
equality constraints on the coefficients based on the tree structure. The chief focus of [90] was on
developing point estimators with better predictive performance. Here we focus on the aggregation
recovery itself by controlling the false split rate (FSR) criterion. In doing that, we use the point
estimator of [90], along with a debiasing approach to construct the nodewise p-values for our
proposed testing procedure.
The proposed point estimator of [90] is the solution to the following optimization problem:
b
q2 arg min
q2R
p
1
2n
ky Xqk
2
2
+ P
n;l
(q); (3.24)
55
where the penalty term P
n;l
(q) is given by
P
n;l
(q) := min
g2R
jTj
l
n
å
u2Tnroot
jg
u
j+(1n)
p
å
j=1
jq
j
j
!
s.t. q = Ag: (3.25)
Here the matrix A2R
pjTj
encodes the tree structure. Specifically, each entry A
i j
is 1 if leaf i
is a descendant of node j, and is 0 otherwise. The penalty term is developed using a tree-based
parameterization strategy to impose sparsity and constrain it to be piecewise constant. The level
sets of
b
q (sets where
b
q takes on the same value) corresponds to features that are aggregated. We
refer to [90] for further discussion and insight behind the penalty term P
n;l
(q). It is worth noting
that P
n;l
(q) is a convex function.
3.3.2.1 Constructing p-values for the null hypotheses
We would like to use the estimator
b
q to construct p-values for the null hypothesesH
0
u
given
in (3.1), and then apply the HAT procedure to return a splitting of parameters to their level sets
(or equivalently split the features into groups which should be aggregated). However, a major
obstacle is that deriving the distribution of the estimator
b
q is not tractable. Moreover, due to the
regularization term P
n;l
(q), this estimator is biased. In order to deal with these challenges and to
perform inference on
b
q, we follow a debiasing approach.
The debiasing approach was pioneered in [45, 94, 85, 48] for statistical inference in high-
dimensions where the sample size is much smaller than the dimension of the features (i.e., n p).
Regularized estimators such as the lasso [80] are popular point estimators in these regimes because
they promote specific structures in the estimator. However these estimators are biased due to the
regularization. The focus of the debiasing work has been on statistical inference on individual
model parameters, namely constructing p-values for null hypotheses of the formH
0;i
:q
i
= 0.
The debiasing approach has been extended for inference on linear functions of model parame-
ters [14, 16] and also general functionals of them [44]. The original debiasing method can also
56
be used to perform inference on a group of model parameters, e.g. constructing valid p-values
for null hypothesisH
0
:q
A
= 0 where the group sizejAj is fixed as n; p!¥ (see e.g, [45, Sec-
tion 3.4]). More recently, [37] have studied the group inference problem for linear regression
model by considering sum-type statistics. Namely, by considering quadratic form hypotheses,
H
0
:q
>
A
Gq
A
= 0, for a positive definite matrix G. They propose a debiasing approach to directly
estimate the quadratic form q
>
A
Gq
A
and to provide asymptotically valid p-values for the corre-
sponding hypotheses. The constructed p-values are valid for any group size in terms of type-I
error control. This work also discusses how by a direct application of the methodology developed
in [65], one can test significance of multiple groups, where the groups are defined by a tree struc-
ture. The method of [65] is based on a hierarchical approach to test variables importance. At the
core, it constructs hierarchical adjusted p-values to account for the multiplicity of testing problems
and controls the family wise error rate at the prespecified level. At every level of the tree, the
p-value adjustment is a weighted Bonferroni correction and across different levels it is a sequential
procedure with no correction but with the constraint that if a parent hypothesis is not rejected then
the procedure does not go further down the tree. By comparison, our HAT algorithm controls the
false split rate which is a very different criterion than the family wise error rate. Also HAT does
not do any adjustment to p-values, and instead chooses the threshold levels in a sequential manner
depending on the previous rejections and the structural properties of the tree.
Here we follow the same methodology of [37] to construct valid p-values for the HAT pro-
cedure, using the point estimator (3.24). Specifically, for each node u, we transform the null
hypothesisH
0
u
to an equivalent problem:
f
H
0
u
: Q
u
:=q
>
L
u
G
u
q
L
u
= 0:
57
To lighten the notation, below we use the shorthand q
u
:=q
L
u
and m=jL
u
j. In addition, we
choose G
u
2R
mm
to be the following projection matrix
G
u
= I
m
1
m
1
m
1
>
m
:
Recall that I
m
is the identity matrix of size m and 1
m
is the all-one vector of length m. To see that
H
0
u
and
f
H
0
u
are equivalent, observe that
q
>
L
u
G
u
q
L
u
=q
>
u
I
m
1
m
1
m
1
>
m
q
u
=
1
2
å
a;b2L
u
(q
a
q
b
)
2
:
To make inference on the quadratic form Q
u
, we first consider the point estimator estimator
b
Q
u
:=
b
q
>
u
G
u
b
q
u
, where
b
q is the estimator given by (3.24).
To debias
b
Q
u
we first decompose the error term into
b
Q
u
Q
u
=
b
q
>
u
G
u
b
q
u
q
u
>
G
u
q
u
= 2
b
q
>
u
G
u
b
q
u
q
u
b
q
u
q
u
>
G
u
b
q
u
q
u
:
The dominating term in this decomposition is 2
b
q
>
u
G
u
(
b
q
u
q
u
). The approach in [37] is to develop
an unbiased estimate of this term and then subtract this estimate from
b
Q
u
. Given a projection
direction
b
b, the unbiased estimate is of the form
1
n
b
b
>
X
>
(y X
b
q)=
b
b
>
b
S(q
b
q)+
1
n
b
b
>
X
>
e;
58
where
b
S :=
1
n
X
>
X is the sample covariance matrix. The idea is to find a projection direction
b
b such
that
b
b
>
b
S(
b
qq
) is a good estimate for
b
q
>
u
G
u
(
b
q
u
q
u
). The projection direction
b
b is constructed
by solving the following optimization problem:
b
b=argmin
b
b
>
b
Sb
s.t. max
w2C
u
hw;
b
Sb[
b
q
>
u
G
u
0]
>
i
kG
u
b
q
u
k
2
l
n
;
(3.26)
where
C
u
=
(
e
1
;:::;e
p
;
1
kG
u
b
q
u
k
2
[
b
q
>
u
G
u
0]
>
)
:
Here e
i
2R
p
is i-th element of the standard basis, the vector with one at the i-coordinate and zero
everywhere else. In additionl
n
is chosen to be of order
p
log(p)=n.
Finally the debiased estimator for Q
u
is constructed as
b
Q
d
u
:=
b
q
>
u
G
u
b
q
u
+
2
n
b
b
>
X
>
(y X
b
q): (3.27)
We next discuss the result of [37] on the asymptotic distribution of the debiased estimator
b
Q
d
u
.
Suppose that the true model q
is s
0
sparse (i.e., it has s
0
nonzero entries). As shown in [37,
Theorem 2], under the condition s
0
(log p)=
p
n! 0, and assuming that the initial estimator
b
q
satisfiesk
b
qq
k
2
C
p
s
0
(log p)=n andk
b
qq
k
1
Cs
0
p
(log p)=n for some constant C>
0, then the residual
b
Q
d
u
Q
u
asymptotically admits a Gaussian distribution. More specifically,
b
Q
d
u
Q
u
= Z
u
+D
u
where
Z
u
N(0;var(
b
Q
d
u
)); var(
b
Q
d
u
)=
4s
2
n
b
b
>
b
S
b
b: (3.28)
59
In addition, for any constant c
1
> 0, there exists a constant c
2
> 0 depending on c
1
such that
P
jD
u
j c
1
(kG
u
b
q
u
k
2
+kG
u
k
2
)
s
0
log p
n
2pe
c
2
n
; (3.29)
The above bound state that with high probability the bias term D
u
is of order s
0
(log p)=n, while
var(
b
Q
d
u
) is of order 1=n. Therefore under the condition s
0
(log p)=
p
n! 0 the noise term Z
u
domi-
nates the bias termD
u
.
2
Note that var(
b
Q
d
u
) involves the noise variances
2
(which is the same for all nodes u). Letb s be
a consistent estimate ofs. Then the variance of the debiased estimator
b
Q
d
u
is estimated by
c var
t
(
b
Q
d
u
)=
4b s
2
n
b
b
>
b
S
b
b+
t
n
; (3.30)
for some positive fixed constantt. The termt=n is just to ensure that the estimated variance is at
least of order 1=n (in the case of
b
b
>
b
S
b
b= 0), and so it dominates the bias component of
b
Q
d
u
. The
exact choice oft does not matter in the large sample limit (n!¥).
Using this result, we construct the two-sided p-value for the null hypothesis
f
H
0
u
as follows:
p
u
= 2
2
4
1F
0
@
j
b
Q
d
u
j
q
c var
t
(
b
Q
d
u
)
1
A
3
5
;
whereF is the cdf of the standard normal distribution.
Proposition 3.3.2. Consider the asymptotic distributional characterization of
b
Q
d
u
given by (3.28)
and (3.29). Letb s =b s(y;X) be an estimator of the noise level satisfying, for any fixede> 0,
lim
n!¥
P
b s
s
1
e
= 0:
2
In [37], the probability bound pe
c
2
n
was further simplified to p
c
0
since n& log p and assuming n; p!¥.
60
Under the condition s
0
(log p)=
p
n! 0, for any fixed arbitrarily small constante
0
(saye
0
= 0:001),
there exists n
0
> 0 such that for all n> n
0
we have
P(p
u
t) t+e
0
; for all t2[0;1]:
We refer to Appendix 3.8.5 for the proof of Proposition 3.3.2.
By virtue of Proposition 3.3.2, the constructed p-values satisfy the assumption of Theorem 3.2.4
and therefore by running the HAT procedure we are able to control FSR under the target level.
3.4 Simulations
In this section, we conduct simulation studies to understand the performance of HAT in different
settings. Sections 3.4.1 and 3.4.2 study HAT with idealized p-values, whereas Section 3.4.3 con-
siders its use in the contexts of aggregating means and aggregating regression coefficients. All
simulation studies use thesimulator R package [9].
3.4.1 Testing on a binary tree with idealized p-values
We first conduct a simulation study to evaluate the performance of our proposed hierarchical ag-
gregation testing procedure. Since we have drawn the equivalence between FSR and FDR for
binary trees (cf. Lemma 3.1.3), we can also compare our procedure with other testing procedures
that have been proposed to control FDR through hierarchical testing on trees (For non-binary trees
there is no such reference to compare with, since FSR is a criterion proposed by the present work,
and there is no other algorithm in the literature to control FSR). Specifically, we will compare our
61
HATprocedure to the procedure proposed by [60]. Their method corresponds to Algorithm 2 with
several modifications. First, their thresholds are given by
a
u
(r)=a
jL
u
(
f
T)j
jL
root
(
f
T)j
m
u
(
f
T)+ R
1:(d1)
+ r 1
m
u
(
f
T)
; (3.31)
where
f
T is the tree in which we takeT and remove the leaves, m
u
(
f
T) is the number of descen-
dants of node u in
f
T ,jL
u
(
f
T)j is the number of leaves in
f
T that descend from u. Also, they
initialize R
1:1
= 1 and, instead of (3.14), they take
R
d
(r)=
å
u2
f
T
d
1fp
u
a
u
(r)g:
We randomly generate p points from Unif[0;1] and form a binary tree structure among them
using hierarchical clustering. We let K=jB
j be the number of true groups by cutting the tree into
K disjoint subtrees with the R function cutree. The nodes that are the roots of the subtrees will
formB
. All non-leaf nodes inB
and their non-leaf descendants are null nodes, and all ancestors
ofB
are non-null nodes. We simulate p-values for each null and non-null non-leaf node in the
following way:
For null nodes, the corresponding p-values are generated independently from Unif([0;1]).
For non-null nodes, the corresponding p-values are generated independently from Beta(1;60).
For each pair of p and K, the set of p-values are simulated independently for 100 repetitions
as described above. With each set of p-values and a target FSR levela, we perform a hierarchical
testing procedure with our algorithm and [60]’s method, which we refer to as LG. We calculate
FSP and TPP based on the aggregation of leaves that results. The simulated values of the FSR and
the mean power are obtained by averaging over the 100 values of FSP and TPP, respectively.
62
0.000
0.025
0.050
0.075
0.100
100 200 300 400 500
K
FSR
0.0
0.2
0.4
0.6
0.8
100 200 300 400 500
K
mean power
method HAT LG a 0.1 0.2 0.3
Figure 3.3: Plots of achieved FSR and average power by our algorithm (HAT) and Lynch and
Guo’s algorithm (LG), on a binary tree with p = 1000 leaves. The number of true groups K
varies from 100 to 500. The p-values are simulated to be independent. The targeted FSR level is
a = 0:1;0:2;0:3.
Figure 3.3 shows how FSR and average power change with K when p is fixed at 1000. We can
see that both methods control FSR under the targeta’s. In terms of power, whena = 0:1, the LG
method enjoys slightly higher power. For larger a, however, the average power achieved by our
HAT method is higher; the gap in power enlarges as K increases. When K is large with the tree
fixed, meaning that theB
nodes are at deeper levels, LG’s power drops at a faster rate than ours.
Indeed, for these a values, our method shows a substantial advantage when we have a deep tree
and the non-null nodes appear at deeper levels of the tree.
Figure 3.4 shows how achieved FSR and average power change with a in the setting where
p= 1000;K= 500. We observe again that HAT achieves higher power than LG whena is above
0.1. From the left panel, we see that both methods are conservative in that the achieved FSR is
lower than the target level a, but as evident from the right-most panel, HAT showcases a better
tradeoff between FSR and the mean power.
63
0.0
0.1
0.2
0.3
0.4
0.5
0.0 0.1 0.2 0.3 0.4 0.5
a
FSR
0.00
0.25
0.50
0.75
0.0 0.1 0.2 0.3 0.4 0.5
a
mean power
0.00
0.25
0.50
0.75
0.0 0.1 0.2
FSR
mean power
method HAT LG
Figure 3.4: Plots of achieved FSR and average power by our algorithm (HAT) and Lynch and
Guo’s algorithm (LG) on a binary tree. The number of leaves p is fixed at 1000 and the number of
true groups K is fixed at 500. The p-values are simulated to be independent. The target FSR level
a varies from 0:01 to 0:5.
3.4.2 Testing on a non-binary tree with idealized p-values
The LG algorithm is guaranteed to control FSR in the previous section due to the equivalence
between FSR and FDR in the special case of a binary tree. However, for a non-binary tree, the LG
algorithm does not have a theoretical guarantee on FSR control.
We generate a tree where the root has degree 5, and each child of the root is either a non-leaf
node with degree 10 or is a leaf node; we vary the number of non-root non-leaf nodes from 1 to
4, which results in p ranging from 14 to 41. The number of true groups is fixed at 5, therefore the
root is the only non-null node. We simulate p-values for the interior nodes in the same fashion as
in Section 3.4.1: the p-values for null nodes are simulated independently from Unif([0;1]) and the
p-values for non-null nodes are simulated independently from Beta(1;60). For both methods, we
fix the target FSR level a at three different levels: 0:1;0:2;0:3. An estimate of FSR is obtained
by averaging FSP over 100 runs. The achieved FSR is shown in Figure 3.5. As expected, we
64
0.1
0.2
0.3
0.4
20 30 40
p
FSR
method
HAT
LG
a
0.1
0.2
0.3
Figure 3.5: Plot of achieved FSR by our algorithm (HAT) and Lynch and Guo’s algorithm (LG)
on a non-binary tree. The number of true groups K is fixed at 5 and the number of leaves p ranges
from 14 to 41. The p-values are simulated to be independent. The target FSR level a is set at
0:1;0:2;0:3, respectively. While HAT controls FSR under the target levels, LG fails to do so.
observe that the HAT procedure controls FSR under each targeta for all values of p, whereas the
LG algorithm does not.
Therefore, for aggregating leaves in general settings where the tree can be beyond binary, only
our algorithm provably controls FSR under the pre-specified level. This highlights the importance
of using our approach, which has guaranteed FSR control for tree-based aggregation problems
with non-binary trees.
3.4.3 Two statistical applications
3.4.3.1 Testing equality of means
In this section we apply the HAT procedure to the problem of testing equality of means. To simulate
this setting, we form a balanced 3-regular tree with p= 243 leaves. For each K, we cut the tree
65
into K disjoint subtrees, which leads to K non-overlapping subgroups of leaves. We assign a value
to each leaf as
y
i
=q
k(i)
+e
i
; k(i)2f1;:::;Kg;i2f1;:::; pg;
where k(i) represents the group of leaf node i and the elements ofq are independently generated
from a Unif(1;1:5) distribution multiplied by random signs, ande
i
’s from a N(0;s
2
) distribution.
We simulate 100 runs by generating 100 independent e’s with the noise level set to s = 0:3.
The p-values are calculated as in (3.21).
Recall that we have shown in Lemma 3.3.1 that the ANOV A p-values are independent, which
satisfies the assumptions in Theorem 3.2.1. Thus we can perform our testing procedure using
our threshold function for independent p-values. Alternatively, we can also form the bona fide
p-value using Simes’ procedure, and test with the reshaped threshold function that is designed for
arbitrarily dependent p-values.
We calculate FSR and average power by taking the average of the FSP and power over the 100
runs. Figure 3.6 demonstrates how FSR and average power change with K. We observe that using
Simes’ p-values together with the reshaped thresholds achieves both lower FSR and higher power,
which makes sense in this context because large effect sizes low in the tree may not translate to
large effect sizes high in the tree.
3.4.3.2 Testing equality of regression coefficients
We apply our testing procedure HAT to the application of testing equality of regression coeffi-
cients. We assume a high-dimensional linear model as described in Section 3.3.2 and generate p
coefficients that take K unique values. This partition comes from leaves of disjoint subtrees of
T . We compute the p-values using the debiased method on each node as in Section 3.3.2.1. The
details of the data generating process are described below.
66
0.000
0.025
0.050
0.075
0.100
25 50 75 100 125
K
FSR
0.00
0.25
0.50
0.75
1.00
25 50 75 100 125
K
Mean Power
a 0.1 0.2 0.3 method HAT HAT-Simes-reshaped
Figure 3.6: Plots of achieved FSR and mean power with varying number of true groups K with
ANOV A p-values on a 3-regular tree (p= 243;s = 0:3). The hierarchical testing is done with our
algorithm (HAT) directly, as well as with Simes’ procedure and reshaped threshold function (3.18)
(HAT-Simes-reshaped). The target FSR levela are fix at 0:1;0:2;0:3, respectively.
We first form a balanced 3-regular tree with p= 243 leaves. We express the tree by a binary
matrix A2f0;1g
pjTj
with rows corresponding to features and columns corresponding to nodes.
Each entry A
ju
= 1 if node u is an ancestor of leaf j or if u= j, and A
ju
= 0 otherwise. For a
given K, we cut the tree into K subtrees. The roots of the subtrees formB
. We want to set the
coefficients corresponding to the leaves within each subtree to the same value. To achieve this, we
generate a vector of length K, denoted as
e
q
, with the first (1b)K elements set to 0; the other
bK elements of of
e
q
are independently drawn from N(0;0:5
2
). Then we setq
= A
B
e
q
, where
A
B
is matrix A restricted to columns that correspond to the nodes inB
. Note that the columns
of A
B
have disjoint supports as no two nodes inB
can share a same descendant. Parameterb
controls the sparsity of
e
q
, and therefore sparsity ofq
.
To simulate a setting with rare feature, we consider a design matrix X :=
e
XW2R
np
from a
Bernoulli-Gaussian distribution. The entries
e
X
i j
are generated i.i.d from standard normal distribu-
tion. The entries W
i j
are drawn i.i.d from Bernoulli(r). The Bernoulli parameterr determines the
67
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00
x
F
n
(x)
node
13
86
110
Empirical CDFs of p-values
110
86
13
Figure 3.7: Plots of empirical CDFs of three nodes under the setting n= 100, p= 243, b = 0:6,
K= 30,r = 0:2,s = 0:6. Node #110 is a non-null node, node #86 is a null node inB
, and node
#13 is a null node that is a child of #86.
level of rareness in the design matrix. Also represents the entry-wise product of two matrices.
Finally, the high-dimensional linear model is generated by (3.23) wheres = c
kXq
k
2
p
n
.
We fix the parameters as n= 100; p= 243;b = 0:6;r= 0:2;s = 0:6, and vary K from 21 to 93.
For each K, we simulate 100 independente’s. The initial estimator
b
q that solves the optimization
problem (3.24) is achieved by using the R packagerare [89]. The tuning parametersl anda are
chosen by cross-validation over a 210 grid. We then follow the steps described in Section 3.3.2.1
to compute the p-values at each node. The positive constantt in (3.30) is set to one and the noise
level estimateb s is obtained using the scaled lasso [77] (R packagescalreg).
Figure 3.7 shows the empirical cdf of the p-values, obtained from the 100 realizations of the
noise, at three representative nodes when K = 57. Among the three nodes, node #110 is a non-
null node, which means q
L
110
contains at least two distinct values. Nodes #13 and #86 are both
null nodes but at different depths on the tree. node #86 is one of theB
nodes and node #13
is a descendant of node #86. The curve of p-values at node #110 is above the diagonal line,
which means the distribution has a higher density at small values than uniform distribution. On
the contrary, the distribution of p-values at nodes #13 and #86 are super-uniform. The curve for a
deeper level node seems to be further away from the diagonal line than its ancestor node.
68
0.00
0.05
0.10
20 40 60 80
K
FSR
0.2
0.3
0.4
0.5
0.6
0.7
20 40 60 80
K
mean power
method HAT HAT-reshaped a 0.1 0.2 0.3
Figure 3.8: Plots of achieved FSR and average power, tested with hierarchical testing procedure
with our threshold function (3.15) (HAT) and reshaped threshold function (3.18) (HAT-reshaped)
on a 3-regular tree (n= 100, p= 243,b = 0:6,r = 0:2,s = 0:6). The p-values are generated by
the debiasing procedure. The number of true groups K varies from 21 to 93. The target FSR levels
a are at 0:1;0:2;0:3.
We perform our testing procedure with the constructed p-values. Note that the p-values gen-
erated are not necessarily independent. Therefore we will use the reshaped threshold function
(3.18), which we have shown in theory controls FSR with arbitrarily dependent p-values. We also
test with the threshold function (3.15), which we have not proven FSR control when the p-values
are dependent. In Figure 3.8, we demonstrate the result with varying K and target FSR levels
0:1;0;2;0:3 for both threshold functions. In Figure 3.9 we fix K at three different values and vary
a from 0:01 to 0:5. We observe from the plots that testing with both threshold functions control
FSR below each target level a. The reshaping function makes the threshold more conservative,
hence the power of the HAT test with the reshaping function is generally lower.
69
0.0
0.1
0.2
0.3
0.4
0.5
0.0 0.1 0.2 0.3 0.4 0.5
a
FSR
0.2
0.4
0.6
0.0 0.1 0.2 0.3 0.4 0.5
a
mean power
0.2
0.4
0.6
0.000 0.025 0.050 0.075
FSR
mean power
K 21 57 81 method HAT HAT-reshaped
Figure 3.9: Plots of the achieved FSR and the average power for our threshold function (3.15)
(HAT) and the reshaped threshold function (3.18) (HAT-reshaped) on a 3-regular tree (n= 100,
p= 243,b = 0:6,r = 0:2,s = 0:6). The p-values are generated by the debiased procedure. The
number of true groups K are fixed at three values: 21, 57, and 81. The target FSR levela varies
from 0:01 to 0:5.
3.5 Data examples
3.5.1 Application to stocks data
In this section, we analyze whether volatility of stocks is similar if companies are in similar cat-
egories. To investigate this question, we use daily stock price data from January 1, 2015 to De-
cember 31, 2019, derived from the US Stock Database c
2021 Center for Research in Security
Prices (CRSP), The University of Chicago Booth School of Business [25]. Specifically, we wish
to aggregate stocks in a similar sector unless their volatility levels are significantly different. We
use several criteria for screening stocks of interest: We only keep common stocks that are publicly
traded throughout this entire period; we also avoid penny stocks that have prices under $0.01 per
share. After pre-screening, we have n= 2538 stocks in total. Following [69] and [61], we use the
high-low range estimator for the daily variance
70
v
t
=
1
4log(2)
(log(H
t
) log(L
t
))
2
;
where H
t
and L
t
are day t’s highest and lowest prices, respectively. We take the average of v
t
throughout the 5-year period as our estimate for the volatility of each stock and log-transform the
volatility to reduce skewness.
We combine this stock log-volatility data with company industry classification information
provided by the Compustat database [23]. The classification system we use is the North Ameri-
can Industry Classification System (NAICS), an industry classification system that employs a six
digit code: the first two digits designate the largest sector; the third, fourth, fifth and sixth dig-
its designate the subsector, industry group, industry, and national industry, respectively. We use
this hierarchy to construct a tree with the first six layers representing the digits and the last layer,
namely the leaves, corresponding to the individual companies.
At every node on the tree, we acquire a p-value by performing an F-test (Equation 8.4, [73]),
for testing equality of the log-volatilities of all stocks within the subtree defined by this node. We
further apply Simes’ procedure to the p-values. We use HAT with the reshaped thresholds and
a = 0:4. The achieved aggregation result is summarized in Table 3.1.
The final aggregation result consists of 40 clusters at a variety of levels: 21 at sector level, 8 at
subsector level, 10 at industry group level, and one at company level. Two sectors “Manufacturing
II” and “Finance and Insurance” are split into further clusters while other sectors remain undivided.
Figure 3.10 focuses on the 347 companies in the subsector “Credit Intermediation and Related
Activities”. Each point represents the log-volatility of a company. The three facets correspond
to three industry groups within the subsector and eight levels on the y-axis correspond to the
eight industries nested in the industry groups. As can be observed in the plot, the industry group
“Depository Credit Intermediation” has significantly lower mean (around -8.27) compared to the
other two industry groups in the subsector (around -7.67 and -7.59 respectively). Therefore, the
71
null hypothesis that the three industry groups have similar mean volatility is rejected. On the
contrary, within each industry group, there are no noticeable differences among different industries,
leading none of the null hypotheses at the industry group level to be rejected.
72
Table 3.1: The table shows 40 clusters that are aggregated by our procedure. The rightmost column is the mean log-volatility.
The column Company/Number of Companies shows how many companies are in each cluster (or the name of the company when
there is only one). The columns from Sector to Industry National show the classification of the clusters that are selected.
Sector Subsector Industry Group Industry Industry National Company/Number of Companies Mean
Agriculture, Forestry, Fishing and Hunting 5 -7.69
Mining, Quarrying, and Oil and Gas Extraction 63 -6.51
Utilities 28 -8.4
Construction 45 -7.55
Manufacturing I 73 -8.03
Manufacturing III 576 -7.48
Wholesale Trade 75 -7.62
Retail Trade I 75 -7.56
Retail Trade II 34 -7.5
Transportation 44 -7.81
Warehousing 3 -8.31
Information 261 -7.5
Real Estate and Rental and Leasing 50 -7.34
Professional, Scientific, and Technical Services 77 -7.55
Administrative and Support and Waste Management and Remediation Services 57 -7.65
Educational Services 12 -7.38
Health Care and Social Assistance 41 -7.36
Arts, Entertainment, and Recreation 16 -7.75
Accommodation and Food Services 52 -7.86
Other Services (except Public Administration) 6 -8.08
Nonclassifiable Establishments 9 -7.28
Wood Product Manufacturing 6 -7.55
Paper Manufacturing 19 -8.14
Printing and Related Support Activities 4 -7.91
Petroleum and Coal Products Manufacturing 16 -7.93
Plastics and Rubber Products Manufacturing 17 -7.65
Nonmetallic Mineral Product Manufacturing 11 -7.44
Basic Chemical Manufacturing 28 -7.18
Resin, Synthetic Rubber, and Artificial and Synthetic Fibers and Filaments Manufacturing 6 -7.91
Pesticide, Fertilizer, and Other Agricultural Chemical Manufacturing 6 -7.14
Pharmaceutical and Medicine Manufacturing 299 -6.37
Paint, Coating, and Adhesive Manufacturing 6 -8.06
Soap, Cleaning Compound, and Toilet Preparation Manufacturing 14 -8.18
Manufacturing II
Chemical Manufacturing
Other Chemical Product and Preparation Manufacturing 5 -8.35
Securities, Commodity Contracts, and Other Financial Investments and Related Activities 64 -7.95
Insurance Carriers and Related Activities 87 -8.3
Funds, Trusts, and Other Financial Vehicles MANHATTAN BRIDGE CAPITAL INC -7.64
Depository Credit Intermediation 306 -8.27
Nondepository Credit Intermediation 32 -7.67
Finance and Insurance
Credit Intermediation and Related Activities
Activities Related to Credit Intermediation 9 -7.59
73
3.5.2 Application to NYC taxi data
We apply our method of aggregating features to the NYC Yellow Taxi Trip data (available at
data.cityofnewyork.us). The data set contains billions of records of trips made by yellow taxi
drivers throughout the year of 2013. We restrict our attention to the single month of December.
After cleaning the data, we have 13.5 million trips made by a total of n= 32704 taxi drivers,
who are uniquely identified by hack license sequences. We study the association between pickup
locations and fares. Specifically, we take the total fare each taxi driver earned in December 2013 as
the response variable and take the number of rides starting from each neighborhood as the features.
We adopt neighborhood tabulation areas (NTAs) as our definition for neighborhoods, which
were originally created for population projection [67]. In total, there are p= 194 road-accessible
neighborhoods where taxi pickups can occur. To form a tree with NTAs as leaves, we start by con-
necting the root to five nodes, representing the boroughs of New York City. Within each borough,
we apply hierarchical clustering to the NTAs based on their geographical coordinates. This results
in a tree with depth 10.
The design matrix X stores the driver-neighborhood count data, i.e., entry X
i j
is the number
of trips starting in neighborhoods j by driver i within the month of December. The availability of
taxis is not uniformly distributed across the city (see Figure 3.11) and X is a highly sparse matrix:
Most areas had fewer than 10% of the drivers starting their trips there during that month, and in
fact 109 out of the 194 neighborhoods have seen less than 1% of the drivers. The response variable
that we use for guiding the aggregation is the total amount of fare earned by each driver within the
month.
To aggregate neighborhood features, we perform the following procedure: with data X and
y, as well as the given tree structure, we first fit the penalized regression (3.24) to construct an
initial estimate of the coefficients
b
q. The estimation is achieved by using the rare package with
cross-validation across for choosing the regularization parametersn andl across a grid of 5 50
74
Activities Related
to Credit Intermediation
Depository Credit
Intermediation
Nondepository Credit
Intermediation
-9 -8 -7 -6 -5
Financial Transactions
Processing, Reserve,
and Clearinghouse Activities
Mortgage and Nonmortgage
Loan Brokers
Other Activities
Related to
Credit Intermediation
Commercial Banking
Savings Institutions
Credit Card Issuing
Other Nondepository
Credit Intermediation
Sales Financing
log(volatility)
Industry
Figure 3.10: The subsector “Credit Intermediation and Related Activities” consists of 347 com-
panies, whose log-volatilities are represented by the points. The subsector consists of 3 industry
groups and 8 industries. Applying HAT rejects the null hypothesis that the 3 industry groups have
similar mean stock volatility, but does not reject the null hypotheses that within each industry
groups different industries have the same mean stock log-volatility.
75
Figure 3.11: Map of neighborhoods colored with percentage of drivers who have started a trip
from there. Most neighborhoods have fewer than 10% of the drivers starting their trips there in the
month of December 2013.
76
values. Next, we carry out the debiasing step by solving the optimization problem (3.26), with the
R package quadprog. Note that the noise level s is unknown, which we estimate by using the
scaled lasso ([77]; R package scalreg). Moreover, the positive constantt in (3.30) is set to one.
After constructing the p-values for each non-leaf node of the tree, we run HAT witha = 0:3.
3.5.2.1 Aggregation results
Our testing result shows 40 rejections in the tree, leading to 44 aggregated clusters. Among these
clusters, the boroughs of Bronx and Staten Island remain undivided, meaning that the neighbor-
hoods in Bronx and State Island each form one aggregated cluster. Brooklyn, Queens, and Man-
hattan are divided into 7, 14, and 21 subgroups, respectively. We perform least squares constrained
by the equality constraints from our aggregation result, which is equivalent to an unconstrained
least squares with 44 features, in which each feature is the sum of the raw features in each group.
In Figure 3.12, we show a choropleth of these coefficient values. Trips starting from Manhattan
and parts of Queens, especially the airports, have higher coefficient values. Within Manhattan,
areas like Hell’s kitchen, Times Square, and Penn Station have some of the higher coefficient val-
ues. This may be attributed to there being less access to public transportation in the case of Hell’s
Kitchen or being tourist and transportation hubs in the case of Times Square and Penn Station.
Staten Island and the eastern part of Queens, however, have lower coefficient values, likely due to
their being more residential, except for the areas close to Manhattan, and Flushing, which are more
commercial areas.
We can also study how the aggregation results vary with sample size. To do so, we randomly
subset the original dataset to different sizes, and perform the above-mentioned procedure on each
sample. The number of achieved groups for each sample size is shown in Table 3.2. As expected,
reduced sample sizes leads to fewer rejections and therefore fewer aggregated groups.
77
40.5
N
40.6
N
40.7
N
40.8
N
40.9
N
74.2
W 74.1
W 74
W 73.9
W 73.8
W 73.7
W
-1
0
1
Value
Figure 3.12: Map colored with log-transformed least square coefficients fitted by regressing fare
on features from HAT’s aggregation of neighborhoods of New York City. There are 44 aggregated
clusters out of the 194 neighborhoods. Darker colors correspond to higher fitted coefficients.
Table 3.2: Achieved number of groups with decaying sample size.
Sample Size p Number of Groups
n = 32704 194 44
n=2 = 16352 194 42
n=4 = 8176 194 29
n=8 = 4088 194 29
n=16 = 2044 194 18
n=32 = 1022 194 12
78
3.5.2.2 Comparing prediction performance
In this section, we assess prediction performance achieved by our aggregated features. We hold
out a random sample of 20% of the drivers as the test set, and train with the remaining 80%
observations. For comparison, we also consider the following models that cluster features or apply
regularization:
Lasso with the original variables (L1): We fit the model with the original 194 features. After
a 10-fold cross-validation procedure with the training data, the model chosen contains 38
non-zero variables.
Lasso with only dense features (L1-dense): We perform a pre-processing step of filtering
out the neighborhoods where fewer than 0:5% of the drivers have started trips from there
(namely feature frequencies < 0:5%). After filtering, we are left with 99 neighborhoods.
We fit a lasso regression with the remaining 99 features and select the model with a 10-fold
cross-validation procedure using the training data.
Least squares with clusters aggregated at a pre-specified hierarchical level (ls-boro): We
aggregate the aggregate the neighborhood features to the borough level, and fit a least squares
model with these 5 features.
Lasso with clusters aggregated at optimized height (L1-agg-h). In addition to the standard
lasso tuning parameter, we use an extra parameter h that determines the height in the tree
where we aggregate the features. The parameter h varies on a grid of 5 equally spaced
values. We use a 10-fold cross-validation procedure for selecting both parameters on the
training data.
Rare regression proposed by [90] that encourages both sparsity and aggregation by a penal-
ization term (Rare). The parameters are chosen with 10-fold cross-validation on a 2 100
79
Table 3.3: Prediction performance of the 6 methods with the test data set.
Method MSPE
L1 95.780
L1-dense 95.864
ls-boro 147.438
L1-agg-h 102.593
Rare 95.725
HAT (our method) 95.459
grid. The aggregation is formed at the same time the model is fit. The model chosen contains
29 aggregated features.
We compute the mean squared prediction error (MSPE) of each method on the test set (see
Table 3.3). The L1 and L1-dense methods are not aggregation-related and achieve similar per-
formance. Both ls-boro and L1-agg-h achieve some level of aggregation but the aggregations
are determined at certain heights. L1-agg-h has an additional tuning parameter and is therefore
advantageous. Lastly, both Rare and our method achieve aggregation in a flexible way, and the
prediction results are comparable. Rare selects 43 aggregation clusters while our method achieves
46 groups in total.
3.5.2.3 FSR with synthetic data
To directly evaluate the aggregation recovery performance of HAT, we create a synthetic response
based on the tree structureT constructed by the neighborhoods, as well as the observed trip counts
data X. In addition, we take the aggregation result and fitted coefficients from Section 3.5.2.1 as the
true aggregation and true vectorq
. We simulate the response 100 times independently according
to (3.23) with s = 15. We use the same debiased method to calculate the node-wise p-values
and perform our testing procedure with target FSR levels varying froma = 0:01 toa = 0:3. We
compare the aggregation results with the true aggregation and compute FSR and average power
over the 100 runs. The results are shown in Table 3.4.
80
Table 3.4: Achieved FSR and average power by our algorithm with synthetic data where noise
level iss = 15.
Target Level FSR Average Power
0.01 0.000 0.547
0.02 0.000 0.560
0.05 0.000 0.577
0.10 0.001 0.593
0.20 0.003 0.608
0.30 0.003 0.620
0.40 0.004 0.626
0.50 0.005 0.632
3.6 Conclusion
In many application domains, ranging from business and e-commerce, to computer vision and
image processing, biology and ecology, the data measurements are naturally associated with the
leaves of a tree which represents the data structure. Motivated by these applications, in this work
we studied the problem of splitting the measurements into non-overlapping subgroups which can
be expressed as a combination of branches of the tree. The subgroups ideally express the leaves
that should be aggregated together, and perceived as single entities. We formulate the task of tree-
based aggregation/splitting as a multiple testing problem and introduced a novel metric called false
split rate which corresponds to the fraction of splits made that were unnecessary. In addition, we
proposed a so called HAT procedure (and a few variants of it) to return a splitting of leaves, which
is guaranteed to control the false split rate under the target level.
It is worth noting some of the salient distinctions of the setup considered in this chapter with
the classical hierarchical clustering. Firstly, in hierarchical clustering the tree is cut at a fixed level,
while our framework allows for more flexible summarization of the tree where different branches
are cut at different depths. In other words, our framework yields multi-scale resolution of the data.
Secondly, the clustering problem is often formulated as an unsupervised problem. In contrast, our
81
framework can be perceived as supervised clustering problem where the labeled data are used to
group the leaves by combining the branches of the tree.
3.7 Proof of main theorems
3.7.1 Proof of Theorem 3.2.1
Recall the definition of the quantities V and R:
V :=
å
u2F
deg
T
(u) deg
T
rej
(u)
jB
\Fj;
R := max
8
<
:
å
u2T
rej
deg
T
(u) deg
T
rej
(u)
1;0
9
=
;
:
Note that R> 0 becauseT
rej
T (recall thatT
rej
does not include any leaves ofT as there
is no hypothesis associated to those nodes.) As we showed in Lemma 3.1.2, the false split rate can
be written in terms of V and R:
FSR=E
V
R_ 1
:
For node a2B
letF
a
=F\T
a
, and define the quantity V
a
as follows:
V
a
=
8
>
>
<
>
>
:
å
u2F
a
deg
T
(u) deg
T
a;rej
(u)
1; ifF
a
6= / 0
0 otherwise
(3.32)
By definition V
a
0. Indeed, from the proof of Lemma 3.1.2, V
a
is the number of false splits in
the setL
a
. Also it is easy to verify that V =å
a2B
V
a
.
82
We first show that
E
V
a
R
ajL
a
j
p
; for a2B
: (3.33)
Denote by S(T
a
) the set of all nonempty subtrees ofT
a
rooted at node a. We also let V
a
(T
0
) be
the number of false splits inL
a
when the rejection subtree isT
0
, i.e.,
V
a
(T
0
)=
å
u2T
0
(deg
T
(u) deg
T
0(u)) 1:
Here we used that a2B
and therefore any rejection inT
0
is a false rejection and soF
a
=T
0
.
Define
˜
R
T
0 to be the total number of splits when we set p
u
= 0 for u2T
0
and p
u
= 1 for u2
T
a
nT
0
.
Note that
˜
R
T
a;rej
= R since for u2T
a;rej
the p-value p
u
is already below the threshold at node
u and for u2T
a
nT
a;rej
, p
u
is already above the threshold at that node u. Therefore, writing
P
c
T
a
=fp
u
: u = 2T
a
g, we have
E
V
a
R
1(V
a
> 0)
P
c
T
a
=
å
T
0
2S(T
a
)
E
V
a
(T
0
)
˜
R
T
0
1(T
a;rej
=T
0
)
P
c
T
a
=
å
T
0
2S(T
a
)
V
a
(T
0
)
˜
R
T
0
P(T
a;rej
=T
0
); (3.34)
where S(T
a
) denotes the set of all nonempty subtrees ofT
a
rooted at node a, and we have used
the fact that V
a
(T
0
) is non-random and
˜
R
T
0 is constant conditional onP
c
T
a
.
Define R
d
(r) :=å
u2T
d 1fp
u
a
u
(r)g(deg
T
(u) 1). Observe that R
d
(r
d
) is the additional
number of splits made by the rejected nodes in depth d, going from depth d1 to depth d, because
the hypothesesH
0
u
in depth d are tested at levela
u
(r
d
). Using our notation this can be written as
the identity R
1:d
= R
1:(d1)
+ R
d
(r
d
).
83
We argue that r
d
= R
d
(r
d
). To see why, note that by definition
r
d
= max
(
0 r
å
u2T
d
deg
T
(u)jT
d
j : r R
d
(r)
)
:
Hence, r
d
R
d
(r
d
) and r
d
+1> R
d
(r
d
+1). Since R
d
(r) is an integer valued function, the fact that
R
d
(r
d
+ 1)< r
d
+ 1 implies R
d
(r
d
+ 1) r
d
. Thus, r
d
R
d
(r
d
) R
d
(r
d
+ 1) r
d
, which gives
r
d
= R
d
(r
d
), and consequently
R
1:d
= R
1:(d1)
+ r
d
: (3.35)
We next continue by upper bounding the right hand side of (3.34). Based on our testing method-
ology, described in Algorithm 2, a typical node u at depth d is tested at levela
u
(r
d
) given by (3.15).
We have
a
u
(r
d
)= 1fparent(u)2T
d1
rej
g
1
D
ajL
u
j(R
1:(d1)
+ r
d
)
p(1
1
D
2
)¯ h
d;r
+ajL
u
j(R
1:(d1)
+ r
d
)
= 1fparent(u)2T
d1
rej
g
1
D
ajL
u
jR
1:d
p(1
1
D
2
)¯ h
d;r
+ajL
u
jR
1:d
= 1fparent(u)2T
d1
rej
g
1
D
g
u
p(1
1
D
2
)+g
u
; (3.36)
with
g
u
:=
a
¯ h
d;r
jL
u
jR
1:d
:
Note thata
u
(r
d
) is increasing ing
u
.
Lemma 3.7.1. Suppose that u2T
a
and the node a is at level d
a
. Then, on the eventfT
a;rej
=T
0
g
we have
g
u
a
¯ h
d;r
jL
a
j
˜
R
T
0: (3.37)
84
The proof of Lemma 3.7.1 follows readily from the fact that on the eventfT
a;rej
=T
0
g, we
have R
1:d
˜
R
T
0. Also, since u2T
a
we havejL
u
jjL
a
j.
We next provide an upper bound for the thresholdsa
u
(r
d
) for all nodes u2T
a;rej
, which will
be useful in controlling FSR. For positive integer m, define
˜ g
a;m
:=
a
¯ h
d;r
jL
a
jm: (3.38)
Using Lemma 3.7.1 and the facta
u
(r
d
) is increasing ing
u
, we obtain that on the eventf
˜
R
T
0= mg,
the following holds:
a
u
(r
d
) ˜ a
a;m
:=
1
D
˜ g
a;m
p(1
1
D
2
)+ ˜ g
a;m
: (3.39)
We are now ready to upper bound the right hand side of (3.34).
Proposition 3.7.2. Let a2B
and assume that the null p-values are mutually independent, and
independent from the non-null p-values. For our testing procedure described in Algorithm 2, the
following holds true:
E
"
å
T
0
2S(T
a
)
V
a
(T
0
)
˜
R
T
0
P(T
a;rej
=T
0
jP
c
T
a
)
#
a
jL
a
j
p
: (3.40)
The proof of Proposition 3.7.2 uses the equality (3.36) and the structural properties of the tree
T tree. Its proof is deferred to Section 3.9 of the appendix. The bound (3.33) now follows readily
by applying iterative expectation to (3.40).
Proof of Theorem 3.2.1. By using the bound (3.33) and noting that V =å
a2B
V
a
, we have
FSR=
å
a2B
E
V
a
R_ 1
=
å
a2B
E
V
a
1(V
a
> 0)
R
å
a2B
jL
a
j
p
a =a:
The result follows.
85
3.7.2 Proof of Theorem 3.2.2
Theorem 3.2.2 can be proved by following similar lines of the proof of Theorem 3.2.1 and so we
omit a detailed proof here. The main difference is that in this case, the quantity ˜ a
a;m
should be
defined as
˜ a
a;m
:=
1
D
˜ g
a;m
p(1
1
D
2
)+ ˜ g
a;m
e
0
: (3.41)
Also, the bound (3.66) is updated as
q
u;m
=P(u2T
m
a;rej
)( ˜ a
a;m
+e
0
)
depth(u)depth(a)+1
; (3.42)
and therefore similar to (3.67) we have
å
u2T
a
q
u;m
1
D
D( ˜ a
a;m
+e
0
)
1D( ˜ a
a;m
+e
0
)
=
1
Dp(1
1
D
2
)
˜ g
a;m
; (3.43)
which is the same bound as in (3.67), albeit via a slightly different derivation and choice of thresh-
old levelsa
u
(r). The rest of the proof would be identical to the proof of Theorem 3.2.1.
86
3.7.3 Proof of Theorem 3.2.3
Let a2B
, we have
E
V
a
R
1fV
a
> 0g
=E
"
å
T
0
2S(T
a
)
V
a
(T
0
)
R
1
T
a;rej
=T
0
#
(D
1
D
)
å
T
0
2S(T
a
)
E
jT
0
j
R
1
T
a;rej
=T
0
=(D
1
D
)
å
T
0
2S(T
a
)
å
u2T
0
E
"
1
T
a;rej
=T
0
R
#
=(D
1
D
)
å
u2T
a
å
T
0
2S(T
a
):u2T
0
E
"
1
T
a;rej
=T
0
R
#
=(D
1
D
)
å
u2T
a
E
"
1
u2T
a;rej
R
#
=(D
1
D
)
å
u2T
a
E
"
1
p
u
a
u
(r
d
)
R
#
(D
1
D
)
å
u2T
a
E
"
1
p
u
a
u
(r
d
)
R
1:(d1)
+ r
d
#
=(D
1
D
)
å
u2T
a
E
2
6
6
4
1
p
u
ajL
u
jb
d
(R
1:(d1)
+r
d
)
p(D
1
D
)(D1)
R
1:(d1)
+ r
d
3
7
7
5
; (3.44)
where the first inequality follows from Lemma 3.10.1; the second inequality is because R R
1:d
=
R
1:(d1)
+ r
d
.
Next, we will use the following proposition by Blanchard & Roquain.
Proposition 3.7.3 ([11]). A couple (U;V) of possibly dependent nonnegative random variables
such that U is superuniform, i.e.,8t2[0;1];P(U t) t, satisfy the following inequalities
8c> 0;E
1fU cb(V)g
V
c;
87
ifb() is a shape function of the following form
b(x)=
Z
x
0
tdn(t);
wheren is an arbitrary probability distribution on(0;¥), and V is arbitrary.
Letting U = p
u
, V = R
1:(d1)
+ r
d
, and c=
ajL
u
j
p(D
1
D
)(D1)
, we have
(D
1
D
)
å
u2T
a
E
"
1
p
u
a
u
(r
d
)
R
1:(d1)
+ r
d
#
(D
1
D
)
å
u2T
a
ajL
u
j
p(D
1
D
)(D 1)
=
a
p
"
å
u2T
a
jL
u
j
D 1
#
ajL
a
j
p
;
where the last inequality follows from
å
u2T
a
jL
u
j
D
å
d=2
å
u2T
d
\T
a
jL
u
j=
D
å
d=2
jL
a
j=(D 1)jL
a
j:
It is reasonable to use a measuren that puts mass proportional to
1
x
only on the values that its
arguments could possibly take. By the design of the tree, we have
R
1:(d1)
+ r
d
(d 1)(d 1)+d 1= d(d 1);
since at least one node has to be rejected on each depth from 1 to d 1 for the algorithm to carry
on to depth d, and
R
1:(d1)
+ r
d
å
u2T
d1
deg
T
(u) 1:
88
Therefore,
b
d
(R
1:(d1)
+ r
d
)=
R
1:(d1)
+ r
d
å
å
u2T
d1
deg
T
(u)1
k=d(d1)
1
k
:
The rest of the proof is identical to the proof of Theorem 3.2.1.
3.7.3.1 Proof of Theorem 3.2.4
The result of theorem can be derived by a similar argument used in the proof of Theorem 3.2.3.
We leave out a detailed proof and only highlight the required modifications to the proof of Theo-
rem 3.2.3.
Following the chain of inequalities in (3.44) we have that for a2B
,
E
V
a
R
1fV
a
> 0g
(D
1
D
)
å
u2T
a
E
"
1
p
u
a
u
(r
d
)
R
1:(d1)
+ r
d
#
=(D
1
D
)
å
u2T
a
E
2
6
6
4
1
p
u
ajL
u
jb
d
(R
1:(d1)
+r
d
)
p(D
1
D
)(D1)
e
0
R
1:(d1)
+ r
d
3
7
7
5
=(D
1
D
)
å
u2T
a
E
2
6
6
4
1
p
u
+e
0
ajL
u
jb
d
(R
1:(d1)
+r
d
)
p(D
1
D
)(D1)
e
0
R
1:(d1)
+ r
d
3
7
7
5
(3.45)
where we used the definition of threshold function a
u
(r
d
) given by (3.20). Also by theorem as-
sumption p
u
+e
0
is super-uniform because
P(p
u
+e
0
t)=P(p
u
te
0
)(te
0
)+e
0
= t:
Therefore we can apply Proposition 3.7.3 with U= p
u
+e
0
, V = R
1:(d1)
+r
d
, and c=
ajL
u
j
p(D
1
D
)(D1)
.
The rest of the proof is identical to the proof of Theorem 3.2.3 and is omitted.
89
3.8 Proof of technical lemmas
3.8.1 Proof of Lemma 3.1.1
Following the example given in Figure 3.1, we use solid shape for true barriers and dashed shape
for achieved barriers. Therefore, FDP
b
can be written as
FDP
b
=
ji2[p] : J
i
= 0;
b
J
i
= 1j
ji2[p] :
b
J
i
= 1j
=
#fslots with dashed but not solid barriersg
#fslots with dashed barriersg
:
Similarly, we can write TPP
b
as
TPP
b
=
ji2[p] : J
i
= 1;
b
J
i
= 1j
ji2[p] : J
i
= 1j
=
#fslots with solid and dashed barriersg
#fslots with solid barriersg
;
where(R
b
)
C
:=(H
b
0
[H
b
1
)nR
b
is the set of non-rejections.
Furthermore, we know that n barriers yield(n+1) groups. Hence #fslots with only dashed barriersg=
M 1 and #fslots with only solid barriersg= K 1.
Fixing the solid barriers, any dashed barrier in a slot where a solid barrier does not already
exist will divide one true group into two. That is to say, within each true group C
i
, the number of
dashed barriers, say m, will yield(m+ 1) pairs offi; jg such that C
i
\
b
C
j
6= / 0. Therefore,
#fslots with dashed but not solid barriersg=
K
å
i=1
M
å
j=1
1fC
i
\
b
C
j
6= / 0g 1
!
=
K
å
i=1
M
å
j=1
1fC
i
\
b
C
j
6= / 0g
!
K:
90
Similarly, by exchanging the role of dashed and solid barriers, we also get
#fslots with solid but not dashed barriersg=
M
å
i=1
K
å
j=1
1fC
i
\
b
C
j
6= / 0g 1
!
=
M
å
i=1
K
å
j=1
1fC
i
\
b
C
j
6= / 0g
!
M:
Finally, by plugging in the terms, we can draw the equivalence between FDP
b
and FSP, as well
as the two true positive proportions.
3.8.2 Proof of Lemma 3.1.2
We will prove the lemma by showing that
max
8
<
:
å
u2T
rej
(deg
T
(u) deg
T
rej
(u)) 1;0
9
=
;
= M 1: (3.46)
and
å
u2F
deg
T
(u) deg
T
rej
(u)
jB
\Fj=
K
å
i=1
M
å
j=1
1fC
i
\
b
C
j
6= / 0g
!
K; (3.47)
The proof is based on induction on the depth of the tree D.
We first prove the induction basis when D= 2. In this case,T consists in one root node u
0
and
its children as leaves. We therefore have only one hypothesis,H
0
u
0
.
IfH
0
u
0
fails to be rejected, thenF =T
rej
= / 0, M = 1. Both left hand side and right hand
side of equation (3.46) are 0. For equation (3.47), the left hand side is clearly 0, and the right
hand side is also 0 sinceå
K
i=1
å
M
j=1
1fC
i
\
b
C
j
6= / 0g
K=(å
K
i=1
1) K= 0.
91
IfH
0
u
0
is rejected, we will have M = deg
T
(u
0
) andT
rej
=fu
0
g. Equation (3.46) holds
because å
u2T
rej
(deg
T
(u) deg
T
rej
(u)) 1 = deg
T
(u
0
) deg
T
rej
(u
0
) 1 = M 1 since
deg
T
rej
(u
0
)= 0. For equation (3.47), we consider two scenarios:
– IfH
0
u
0
is true, then K=jB
j= 1 andF =T
rej
=fu
0
g. So the left hand side of (3.47)
becomes deg
T
(u
0
) 1= M 1, and the right hand side becomes M K = M 1,
hence the equality holds.
– OtherwiseH
0
u
0
is false and K =jB
j= deg
T
(u
0
)= M, andF = / 0. So the left hand
side of (3.47) becomes 0, and the right hand side becomes MK= 0, hence the equal-
ity holds.
Next we proceed by proving the induction step for equation (3.46). Let D> 2 be an arbitrary
integer. We assume for a tree with maximum depth D 1, identity (3.46) holds. We want to
show that it holds for a tree with maximum depth D.
Clearly, this equation holds when the root node is not rejected, i.e.,T
rej
= / 0 and M= 1. We
henceforth discuss the case that the root node is rejected. In this case, equation (3.46) can be
simplified as
å
u2T
rej
(deg
T
(u) deg
T
rej
(u))= M:
For a treeT with maximum depth D, if we remove the root node, we will be left with a forest
where each tree is of maximum depth less than D. Within each tree, we have that identity (3.46)
holds by the induction hypothesis. We refer to the set of trees in the forest as S
root
. Furthermore,
we use M
T
0;T
0
2 S
root
for the number of achieved groups in each such tree. Obviously,
å
T
0
2S
root
M
T
0 = M: (3.48)
92
Therefore,
å
u2T
rej
(deg
T
(u) deg
T
rej
(u))
=
å
u2T
rej
nroot
(deg
T
(u) deg
T
rej
(u))+ deg
T
(root) deg
T
rej
(root)
=
å
T
0
2S
root
å
u2T
rej
\T
0
(deg
T
(u) deg
T
rej
(u))+ deg
T
(root) deg
T
rej
(root)
=
å
T
0
2S
root
T
0
\T
rej
6=/ 0
å
u2T
rej
\T
0
(deg
T
(u) deg
T
rej
(u))+ deg
T
(root) deg
T
rej
(root)
=
å
T
0
2S
root
T
0
\T
rej
6=/ 0
M
T
0+ deg
T
(root) deg
T
rej
(root)
=
å
T
0
2S
root
T
0
\T
rej
6=/ 0
M
T
0+
å
T
0
2S
root
T
0
\T
rej
=/ 0
M
T
0
= M;
where the fourth equality is by the induction hypothesis; the fifth equality holds because there are
deg
T
(root) deg
T
rej
(root) subtreesT
0
2 S
root
such thatT
0
\T
rej
= / 0, and their M
T
0 = 1; the
last equality follows from (3.48). This proves the induction step and hence completes the proof of
identity (3.46).
We next proceed to prove (3.47). Suppose that the induction hypothesis holds for trees with
depth at most D1. We want to prove it for trees of depth D. Note that this identity trivially holds
when the root is not rejected, and therefore we focus on the case where the root is rejected. There
are two scenarios: (1) the root is a true rejection, or (2) the root is a false rejection.
We first assume the root is a true rejection. Then we have
K=
å
T
0
2S
root
K
T
0; (3.49)
93
where K
T
0 1 is defined as the number of true groups in eachT
0
2 S
root
.
Then the left hand side of (3.47) becomes
å
u2F
deg
T
(u) deg
T
rej
(u)
jB
\Fj
=
å
T
0
2S
root
å
u2F\T
0
deg
T
(u) deg
T
rej
(u)jB
\F\T
0
j
!
=
å
T
0
2S
root
"
å
1iK
T
0
å
1 jM
T
0
1fC
i
\
b
C
j
6= / 0g
!
K
T
0
#
=
å
1iK
å
1 jM
1fC
i
\
b
C
j
6= / 0g
!
K;
where the first equality holds becauseT
0
2 S
root
are disjoint from each other and the root is not in
B
\F ; the second equality follows from the induction hypothesis, and the last equality follows
from (3.48) and (3.49).
For the case where the root is a false rejection, we have K= 1,B
=frootg and any rejection
is a false rejection (F =T
rej
). We write
å
u2F
deg
T
(u) deg
T
rej
(u)
jB
\Fj=
å
u2T
rej
deg
T
(u) deg
T
rej
(u)
1= M 1;
where in the last step we used identity (3.46). On the other hand, in this case there is only one true
group (K= 1) which consists of all leaves. Therefore, any returned group
b
C
j
will intersect with it
and we get
K
å
i=1
M
å
j=1
1fC
i
\
b
C
j
6= / 0g
!
1= M 1:
Comparing the previous two equations implies that identity (3.47) holds for the treeT . This
completes the induction step and hence proves identity (3.47).
94
3.8.3 Proof of Lemma 3.1.3
The proof of Lemma 3.1.3 follows from Lemma 3.1.2 and that deg
T
(u)= 2, for all non-leaf nodes
u2T . It suffices to show
å
u2F
deg
T
(u) deg
T
rej
(u)
jB
\Fj=jFj; (3.50)
and
max
8
<
:
å
u2T
rej
(deg
T
(u) deg
T
rej
(u)) 1;0
9
=
;
=jT
rej
j: (3.51)
To prove equation (3.50), note that if a node is falsely rejected all of its rejected children are
also false rejections. Therefore, å
u2F
deg
T
rej
(u) counts the total number of edges where both
nodes of it are inF . Hence,
å
u2F
deg
T
(u) deg
T
rej
(u)
jB
\Fj
= 2jFjjfu : u2F;parent(u)2FgjjB
\Fj
= 2jFjjfu : u2F;parent(u)2Fgjjfu : u2F;parent(u)= 2Fgj
= 2jFjjFj
=jFj:
95
Equation (3.51) holds trivially whenjT
rej
j= 0. WhenjT
rej
j> 0, the root node is rejected, and
we write
å
u2T
rej
(deg
T
(u) deg
T
rej
(u)) 1
= 2jT
rej
j
å
u2T
rej
deg
T
rej
(u) 1
= 2jT
rej
j(jT
rej
j 1) 1
=jT
rej
j:
This completes the proof.
3.8.4 Proof of Lemma 3.3.1
We use the shorthand`
u
:=jL
u
j for a node u. Define the random vector w2R
D
a
with elements
w
u
=`
1=2
u
¯ y
u
and the fixed unit vector r2R
D
a
with elements r
u
=(`
u
=`
a
)
1=2
. We have
r
>
w=
å
u2child(a)
(`
u
=`
a
)
1=2
(`
1
2
u
¯ y
u
)=`
1=2
a å
u2child(a)
`
u
¯ y
u
=`
1=2
a
¯ y
a
; (3.52)
from which it follows that
å
u2child(a)
`
u
( ¯ y
u
¯ y
a
)
2
=
å
u2child(a)
(`
1=2
u
( ¯ y
u
¯ y
a
))
2
=
å
u2child(a)
(w
u
r
u
r
>
w)
2
=k(I
D
a
rr
>
)wk
2
:
The random vector w is multivariate normal withE[w
u
]=`
1=2
u
¯
q
u
, where
¯
q
u
=
1
jL
u
j
å
i2L
u
q
i
is the
average of parameters on the leave nodesL
u
. In addition, Cov(w)=s
2
I
D
a
. Taking the expectation
of (3.52) establishes that
E[(rr
>
w)
u
]=E[r
u
(r
>
w)]=(`
u
=`
a
)
1=2
`
1=2
a
E[ ¯ y
a
]=`
1=2
u
¯
q
a
:
96
UnderH
a
, we have
¯
q
u
=
¯
q
a
and thus
(I
D
a
rr
>
)w N
0;s
2
(I
D
a
rr
>
)
;
where we use the fact thatkrk
2
= 1 and so I
D
a
rr
>
is a projection matrix. This establishes that
å
u2child(a)
`
u
( ¯ y
u
¯ y
a
)
2
s
2
c
2
D
a
1
underH
a
, meaning that p
a
is uniform. Now consider some node b6= a. IfL
a
\L
b
= / 0, then p
a
and p
b
are clearly independent (because they depend only on y
L
a
and y
L
b
, respectively). Thus,
it remains to consider the case thatL
a
L
b
(i.e., a is a descendant of b). There must exist
v2 child(b) withL
a
L
v
L
b
. From (3.21), p
b
= f( ¯ y
v
;y
L
b
nL
v
). Since(L
b
nL
v
)\L
a
= / 0, we
know that p
a
is independent of y
L
b
nL
v
. It therefore remains to show that p
a
is also independent of
¯ y
v
. To do so, observe that
`
v
¯ y
v
=
å
i2L
a
y
i
+
å
i2L
v
nL
a
y
i
=`
1=2
a
r
>
w+
å
i2L
v
nL
a
y
i
: (3.53)
Thus,
Cov
[I
D
a
rr
T
]w; ¯ y
v
=`
1
v
Cov
[I
D
a
rr
>
]w;`
v
¯ y
v
=`
1=2
a
`
1
v
Cov
[I
D
a
rr
>
]w;r
>
w
=s
2
`
1=2
a
`
1
v
[I
D
a
rr
>
]r
= 0;
97
where the first equality follows from observing that the second term in (3.53) is independent of w
(which depends only on y
L
a
) and the second inequality uses that Cov(w)=s
2
I
D
a
. This establishes
that p
a
is independent of p
b
.
3.8.5 Proof of Proposition 3.3.2
Note that for any node u, we havekG
u
k
2
= 1 since G
u
is a projection matrix. Also, by using [37,
Lemma 2] (which itself follows from [15, Lemma 1]), we have
kG
u
b
qk
2
c
0
(
b
b
>
b
S
b
b)
1=2
; (3.54)
for some constant c
0
> 0. This inequality follows by analyzing the optimization (3.26) which is
used to define the direction
b
b. Therefore, for any node u, we obtain
kG
u
b
qk
2
+kG
u
k
2
q
c var
t
(
b
Q
d
u
)
c
0
(
b
b
>
b
S
b
b)
1=2
+ 1
q
c var
t
(
b
Q
d
u
)
=
c
0
(
b
b
>
b
S
b
b)
1=2
+ 1
q
4b s
2
n
b
b
>
b
S
b
b+
t
n
c
0
p
n
2b s
+
r
n
t
: (3.55)
Define x :=F
1
(1
t
2
) and
h
n
:= c
1
c
0
2b s
+
r
1
t
!
s
0
log p
p
n
; (3.56)
98
with c
1
given in (3.29). Under the null hypothesis
f
H
0;u
(or equivalentlyH
0;u
), we have for all
t2[0;1],
P(p
u
t)=P
0
@
F
1
(1
t
2
)
j
b
Q
d
u
j
q
c var
t
(
b
Q
u
)
1
A
=P
0
@
x
j
b
Q
d
u
j
q
c var
t
(
b
Q
u
)
1
A
=P
0
@
x
jZ
u
+D
u
j
q
c var
t
(
b
Q
u
)
1
A
P
0
@
xh
n
jZ
u
j
q
c var
t
(
b
Q
u
)
1
A
+P
0
@
h
n
jD
u
j
q
c var
t
(
b
Q
u
)
1
A
: (3.57)
By using the bias bound (3.29), together with (3.55) and definition ofh
n
given by (3.56), we have
P
0
@
h
n
jD
u
j
q
c var
t
(
b
Q
u
)
1
A
2pe
c
2
n
; (3.58)
for all nodes u. In addition,
P
0
@
xh
n
jZ
u
j
q
c var
t
(
b
Q
u
)
1
A
P
0
@
xh
n
jZ
u
j
q
4b s
2
n
b
b
>
b
S
b
b
1
A
=P
0
@
xh
n
sjZ
u
j
b s
q
var(
b
Q
d
u
)
1
A
P
0
@
(xh
n
)(1e)
jZ
u
j
q
var(
b
Q
d
u
)
1
A
+P
b s
s
1
e
= 2F(ex x+h
n
eh
n
)+P
b s
s
1
e
(3.59)
99
Combining (3.57), (3.58) and (3.59) we obtain
P(p
u
t) 2F(ex x+h
n
eh
n
)+P
b s
s
1
e
+ 2pe
c
2
n
:
Note that the right-hand side of the above equation does not depend on the node u. In other words,
it is a uniform bound for all nodes. Under the condition s
0
(log p)=
p
n! 0, we have h
n
! 0 as
n!¥. Therefore, for any fixed e
0
> 0, by choosing e > 0 small enough and n
0
= n
0
(e) large
enough we can ensure that for all n n
0
P(p
u
t) 2F(x)+e
0
= 2(1F(x))+e
0
= t+e
0
;
for all nodes u.
100
3.9 Proof of Proposition 3.7.2
For depth d we define the quantities L
d
:= R
1:(d1)
+r
d
and U
d
:= p1
å
u2T
d deg
T
(u)jT
d
j r
d
.
For node a2B
with depth(a)= d, we write
å
T
0
2S(T
a
)
V
a
(T
0
)
˜
R
T
0
P(T
a;rej
=T
0
jP
c
T
a
)
(a)
(D
1
D
)
å
T
0
2S(T
a
)
jT
0
j
˜
R
T
0
P(T
a;rej
=T
0
jP
c
T
a
)
=(D
1
D
)
å
T
0
2S(T
a
)
å
u2T
0
1
˜
R
T
0
P(T
a;rej
=T
0
jP
c
T
a
)
=(D
1
D
)
å
u2T
a
å
T
0
2S(T
a
):u2T
0
1
˜
R
T
0
P(T
a;rej
=T
0
jP
c
T
a
)
(b)
(D
1
D
)
å
u2T
a
U
d
å
m=L
d
å
T
0
2S(T
a
):u2T
0
1
˜
R
T
0
P(T
a;rej
=T
0
jP
c
T
a
)1(
˜
R
T
0 = m)
=(D
1
D
)
å
u2T
a
U
d
å
m=L
d
1
m
å
T
0
2S(T
a
)
P(T
a;rej
=T
0
jP
c
T
a
)1(
˜
R
T
0 = m;u2T
0
)
=(D
1
D
)
å
u2T
a
U
d
å
m=L
d
1
m
P(
˜
R
T
a;rej
= m;u2T
a;rej
jP
c
T
a
): (3.60)
Here (a) follows from Lemma 3.10.2, and (b) holds since forT
0
2 S(T
a
), the number of total
rejections
˜
R
T
0 satisfies
L
d
˜
R
T
0 U
d
:
The lower bound holds trivially sinceT
0
2 S(T
a
) and depth(a)= d. The number of splits made
by algorithm up to level d is R
1:d
= R
1:(d1)
+ r
d
by using Equation (3.35). For the upper bound,
note that one can split the p leaves at most p 1 times. Now focusing on nodes in depth d,
rejecting a node u results in deg
T
(u) 1 additional splits. So the nodes in depth d can make up
toå
u2T
d deg
T
(u)jT
d
j additional splits, while the algorithm makes r
d
additional splits as we
discussed in Equation (3.35). So the difference between these two quantities, å
u2T
d deg
T
(u)
101
jT
d
j r
d
, is the number of potential splits that the testing rule has missed while testing nodes
at depth d. This argument implies that the total number of splits can go up to U
d
= p 1
(å
u2T
d deg
T
(u)jT
d
j r
d
):
Now by using bound (3.39), on the eventf
˜
R
T
0 = mg we have a
u
(r
d
) ˜ a
a;m
. DefineT
m
a;rej
as the rejection subtree as if the test levelsa
u
(r
d
) are replaced by ˜ a
a;m
. ThereforeT
a;rej
T
m
a;rej
,
which implies
P(
˜
R
T
a;rej
= m;u2T
a;rej
jP
c
T
a
)P(
˜
R
T
a;rej
= m;u2T
m
a;rej
jP
c
T
a
):
Combining this inequality with (3.60) and taking the expectation gives
E
"
å
T
0
2S(T
a
)
V
a
(T
0
)
˜
R
T
0
P(T
a;rej
=T
0
jP
c
T
a
)
#
(D
1
D
)
å
u2T
a
U
d
å
m=L
d
1
m
P(
˜
R
T
a;rej
= m;u2T
m
a;rej
):
(3.61)
102
Focusing on the innermost summation, we have
U
d
å
m=L
d
1
m
P(
˜
R
T
a;rej
= m;u2T
m
a;rej
)
=
U
d
å
m=L
d
1
m
h
P(
˜
R
T
a;rej
m;u2T
m
a;rej
)P(
˜
R
T
a;rej
m+ 1;u2T
m
a;rej
)
i
=
U
d
å
m=L
d
1
m
P(
˜
R
T
a;rej
m;u2T
m
a;rej
)
U
d
+1
å
m
0
=L
d
+1
1
m
0
1
P(
˜
R
T
a;rej
m
0
;u2T
m
0
1
a;rej
)
=
U
d
å
m=L
d
+1
1
m
P(
˜
R
T
a;rej
m;u2T
m
a;rej
)
1
m 1
P(
˜
R
T
a;rej
m;u2T
m1
a;rej
)
+
1
L
d
P(
˜
R
T
a;rej
L
d
;u2T
L
d
a;rej
)
1
U
d
P(
˜
R
T
a;rej
U
d
+ 1;u2T
U
d
a;rej
)
U
d
å
m=L
d
+1
1
m
P(
˜
R
T
a;rej
m;u2T
m
a;rej
)
1
m 1
P(
˜
R
T
a;rej
m;u2T
m1
a;rej
)
+
1
L
d
P(u2T
L
d
a;rej
)
U
d
å
m=L
d
+1
1
m
P
˜
R
T
a;rej
m;u2T
m
a;rej
nT
m1
a;rej
+
1
L
d
P(u2T
L
d
a;rej
)
U
d
å
m=L
d
+1
1
m
P
u2T
m
a;rej
nT
m1
a;rej
+
1
L
d
P(u2T
L
d
a;rej
); (3.62)
where in the last equality we used the observationT
m1
a;rej
T
m
a;rej
, sincea
u;m
is increasing in m.
103
For exposition purposes, we define the shorthand q
u;m
=P(u2T
m
a;rej
) for u2T
a
and m 1.
Then, from the chain of inequalities in (3.62) we get
U
d
å
m=L
d
1
m
P(
˜
R
T
a;rej
= m;u2T
m
a;rej
)
U
d
å
m=L
d
+1
1
m
P
u2T
m
a;rej
nT
m1
a;rej
+
1
L
d
P(u2T
L
d
a;rej
)
U
d
å
m=L
d
+1
1
m
(q
u;m
q
u;m1
)+
1
L
d
q
u;L
d
=
U
d
å
m=L
d
1
m
q
u;m
U
d
1
å
m=L
d
1
m+ 1
q
u;m
=
1
U
d
q
u;U
d
+
U
d
1
å
m=L
d
1
m
1
m+ 1
q
u;m
: (3.63)
By deploying (3.63) in the bound (3.61), we get
E
"
å
T
0
2S(T
a
)
V
a
(T
0
)
˜
R
T
0
P(T
a;rej
=T
0
jP
c
T
a
)
#
(D
1
D
)
å
u2T
a
1
U
d
q
u;U
d
+
U
d
1
å
m=L
d
1
m
1
m+ 1
q
u;m
!
=(D
1
D
)
1
U
d
å
u2T
a
q
u;U
d
+
U
d
1
å
m=L
d
1
m(m+ 1)
å
u2T
a
q
u;m
!
: (3.64)
Our next step is to upper boundå
u2T
a
q
u;m
which is the subject of the following lemma.
Lemma 3.9.1. For any integer m 1 we have
å
u2T
a
q
u;m
˜ g
a;m
p(D
1
D
)
;
where ˜ g
a;m
is given by (3.38).
The proof of Lemma 3.9.1 is deferred to Section 3.9.1.
104
By virtue of Lemma 3.9.1 and (3.64), we have
E
"
å
T
0
2S(T
a
)
V
a
(T
0
)
˜
R
T
0
P(T
a;rej
=T
0
jP
c
T
a
)
#
1
p
1
U
d
˜ g
a;U
d
+
U
d
1
å
m=L
d
1
m(m+ 1)
˜ g
a;m
!
=
ajL
a
j
p¯ h
d;r
1+
U
d
1
å
m=L
d
1
m+ 1
!
=
ajL
a
j
p¯ h
d;r
1+
U
d
å
m=L
d
+1
1
m
!
=
ajL
a
j
p
: (3.65)
3.9.1 Proof of Lemma 3.9.1
Since a2B
, any node u2T
a
is a true null and hence it has a super uniform p-value, i.e. for any
x2[0;1] we haveP(p
u
x) x. In addition, by our assumption the null p-values are independent
and if a node u is rejected so are the nodes on the path from node a to it. Therefore,
q
u;m
=P(u2T
m
a;rej
) ˜ a
depth(u)depth(a)+1
a;m
: (3.66)
Here we used the fact that the rejection thresholds inT
m
a;rej
are set to ˜ a
a;m
.
Also, since the node degrees inT are at mostD, the number of nodes in subtreeT
a
that are
depth d of the treeT is at mostD
ddepth(a)
. We therefore have
å
u2T
a
q
u;m
D
å
d=depth(a)
D
ddepth(a)
˜ a
ddepth(a)+1
a;m
¥
å
d=1
D
d1
˜ a
d
a;m
=
1
D
D ˜ a
a;m
1D ˜ a
a;m
=
˜ g
a;m
p(D
1
D
)
; (3.67)
105
which completes the proof.
3.10 Some useful lemmas
Lemma 3.10.1. Consider a treeT with maximum degreeD. Denote byL the set of leaf nodes in
T . We then have
jLj
(D 1)jTj+ 1
D
;
wherejTj denotes the number of nodes inT .
Proof. Recall that the degree of a node u is the number of its children in the tree. The leaves are
of zero degree and the other nodes are of maximum degreeD. Therefore,
(jTj p)D
å
u2T
deg
T
(u)=jTj 1:
By rearranging the terms we get
p
(D 1)jTj+ 1
D
:
Lemma 3.10.2. Consider a treeT with maximum degreeD. ForT
0
, a subtree ofT , define
V(T
0
)=
å
u2T
0
(deg
T
(u) deg
T
0(u)) 1:
We then have the following bound on V(T
0
):
V(T
0
)
(D
2
1)jT
0
j+ 1
D
1
D
1
D
jT
0
j;
wherejT
0
j denotes the number of nodes inT
0
.
106
Proof. If node u2T
0
is not a leaf ofT
0
, we have deg
T
0(u) 1 and so
deg
T
(u) deg
T
0(u) deg
T
(u) 1D 1:
If u2T
0
is a leaf ofT
0
, we have
deg
T
(u) deg
T
0(u)= deg
T
(u)D:
We therefore have
V(T
0
)=
å
u2T
0
(deg
T
0(u) deg
T
0(u)) 1
jL
T
0jD+(jT
0
jjL
T
0j)(D 1) 1
=jT
0
j(D 1)+jL
T
0j 1
(D
2
1)jT
0
j+ 1
D
1
D
1
D
jT
0
j; (1)
where the second inequality follows from Lemma 3.10.1.
107
Chapter 4
Prediction Sets for High-Dimensional Mixture of Experts
Models
4.1 Introduction
Rapid development of technology has led to the collection of larger volumes of data and the mea-
surement of increasingly finer-grained data. Analyzing these large datasets introduces new chal-
lenges. Due to the diversity of data, applying a single model to a large dataset often cannot fully
capture the complex relationships in the data. This problem is known as heterogeneity, which
means the observations are not compatible and need to be represented by a mixture of models.
The most natural solution to deal with heterogeneity is to simply cluster similar observations to
disentangle the mixture. However, identifying the best clustering of observations is difficult. [51]
proposed the idea of allowing the mixture probabilities to depend on features associated with ob-
servations, where this type of model is called Mixture of Experts (MoE). Using a high-dimensional
MoE model, [42] modeled how the dynamics of phytoplankton populations are changed by envi-
ronmental conditions. Despite the powerful EM algorithm developed by [42] to estimate such MoE
models, there has been a lack of work in the statistical inference of MoE predictions. Motivated
by this, the current paper aims at providing statistical inference for predictions made based on the
assumption of Mixture of Experts model.
108
We consider the MoE model as a mixture model where both the cluster centers and the mixing
probabilities are generalized linear models (GLMs). Specifically, for an observation x
i
2R
p
, the
outcome y
i
is drawn from a mixture of Gaussians with k-th component given by
(y
i
jx
i
;z
i
= k) N(x
T
i
b
k
;s
2
k
);
where z
i
is a latent variable that determines the cluster membership and follows a Multinomial
distribution, i.e.,
(z
i
jx
i
) Multinomial(p
i;1
(x
i
);::;p
k
(x
i
));
with
p
k
(x
i
)=
exp(x
T
i
a
k
)
å
K
`=1
exp(x
T
i
a
`
)
:
We can express the model that includes the latent variable by
y
i
=
K
å
k=1
1fz
i
= kg
x
T
i
b
k
+e
k
; (4.1)
P(z
i
= k)=p
k
(x
i
): (4.2)
For a new observation drawn from model (4.1), i.e.,
y
new
=
K
å
k=1
1fz
new
= kg
x
T
new
b
k
+e
k
;
we aim to construct a prediction set W, such that the prediction set covers y
new
with pre-
specified confidence(1 q) 100%, i.e.
109
P(y
new
2W) 1 q: (4.3)
4.1.1 Comparison with Conformal Inference Literature
Conformal prediction is a distribution-free predictive inference method for regression ([68], [87],
[58], [57]). The general idea of conformal prediction (also known as full conformal prediction) is
to use residuals of hold-out data from regression models fitted with training data, to help quantify
the uncertainty of predictions. Conformalized Quantile Regression (CQR, [72]) combines the
conformal prediction with quantile regression, producing locally adaptive intervals that provide
marginal coverage.
Our proposed method differs from conformal prediction in the following aspects. First, (full)
conformal prediction literature relaxed the assumptions of data from independence to exchange-
ability, whereas it only guarantees marginal coverage, which means the coverage is over a random
draw of both training and test data ([57]). On the contrary, our method does not need to assume
that x
new
is random to be able to give desired coverage. Second, full conformal intervals have fixed
lengths by their nature, whereas CQR intervals have adaptive lengths. Our predictive sets have
lengths adaptive to x
new
, which affects the uncertainty of a new prediction.
Fundamentally, our prediction inference problem is complementary to conformal inference.
Our method assumes that the subgroup is unobserved, whereas conformal inference achieves equal
coverage rate for all subgroups and requires knowledge of subgroups.
110
4.1.2 Illustration of Results with a Toy Example
To illustrate our result, we consider a simple low-dimensional case where there are K= 2 compo-
nents in the mixture. For t = 1;::::;T = 100 time points, there are p= 2 covariates varying with
time:
x
1;t
=
t 1
T 1
;
x
2;t
= x
2
1;t
; t= 1;:::;T:
The centers of the two components are respectively
m
1
(x
t
)=10(T 1)x
1;t
+ 10(T 1)
2
x
2;t
+
5
2
;
m
2
(x
t
)= 1:
The outcomefy
t
g
T
t=1
conditional on the cluster assignment variable z
t
are generated from
y
t
jx
t
;z
t
= k N(m
k
(t);0:1
2
);
where the probability of being drawn from the first cluster, p
1
(t) :=P(z
t
= 1jx
t
)= 1p
2
(t),
follows from a logistic function, where
log
p
1
(t)
1p
1
(t)
= log(9) 2log(9)t:
This appears as p
1
(t) being a monotonic function of t. When t! 0, we have p
1
! 0:9, and
when t! T , we havep
1
! 0:1.
111
We simulate one data point from the above model at each of the t = 1;:::;100 time points. We
estimate the parameters with EM algorithm introduced in [35]. We plot the p
1
(t);p
2
(t) curves
on the top left plot Figure 4.1, as well as the estimated ˆ p
1
(t); ˆ p
2
(t) curves achieved by the EM
algorithm. We can observe from the plot that the two true probability trends (dotted) are symmetric,
we have approximately equal data points from the two clusters. We can also observe that the
estimated probabilities capture the true probabilities quite well, as the solid curves are close to the
dotted curves. On the bottom left of Figure 4.1, we plot the predicted outcome along with the true
outcome.
We provide the 95% confidence prediction sets using our algorithm introduced in this work
in the middle of Figure 4.1, where the shaded area indicate the prediction sets. We notice that at
different time points, the prediction sets can be either one interval, or the union of two intervals,
which largely depend on the distances between the two centers.
Finally, we directly look at the performance of our prediction sets. We repeat the above process
for 500 times. Each time we record the lengths of our prediction sets and whether or not the
observation y
t
falls into our prediction sets. We average the two statistics over the 500 runs. The
plots on the top right and bottom right plots of Figure 4.1 demonstrate the average coverage rate
and average lengths of our prediction sets. We can observe that the coverage rates are quite close
to the pre-specified level. When the two centers of mixtures are close, the prediction sets become
a single interval and the lengths of the prediction sets become shorter. In contrast, when the two
centers are far apart, the prediction sets adapt to cover the centers, and the total lengths of two
intervals become longer than when they overlap.
4.1.3 Related Works
The MLR model has been well studied in the literature ([64]). Extensive work has been developed
to study the convergence of the EM estimator ([93, 5, 54, 56, 55]), as well as statistical inference
112
0.25
0.50
0.75
0 25 50 75 100
t
Probability
0
1
2
3
0 25 50 75 100
t
y
0
1
2
3
0 25 50 75 100
t
y
0.900
0.925
0.950
0.975
1.000
0 25 50 75 100
t
Cover Rates
0.8
1.0
1.2
0 25 50 75 100
t
Average Lengths
colour Cluster 1 Cluster 2 linetype Estimated True
Figure 4.1: The plot on the top left shows how true probability of two clusters,p
1
(t);p
2
(t) respec-
tively, vary with time t (dotted), as well as estimated probabilities ˆ p
1
(t); ˆ p
2
(t) (solid). The plot on
the bottom left contains observed outcomes (black dots), predicted outcome curves (solid) and true
curves (dotted). The plot in the middle shows the predicted set with 95% confidence level with
shaded area. The plot on the top right and bottom right contain the averaged coverage rates and
averaged lengths of prediction sets, respectively. The average are taken among 500 runs; for each
run, there are 1000 outcomes generated from the model and the percentage of them falling into the
confidence set is calculated.
113
([98]). Within recent years, there have been some work on MLR in high-dimensional settings.
[75] proposed an`
1
-regularized maximum likelihood estimator and established convergence and
oracle result of the estimator. [92] proved convergence of high-dimensional MLR in latent variable
models. [97] considered the case of two high-dimensional linear components, and established
the rate of convergence of coefficients as well as confidence intervals. To our best knowledge,
problems like statistical and predictive inference for MoE model have not been addressed in the
literature.
The debiasing approach for constructing CI for coefficients has been widely used in linear
regression models in high-dimensional settings ([46, 47, 95, 84, 48]). Within recent years, there
has been work on inference for general linear functions ([14, 38, 44, 99]). In terms of debiasing
linear prediction, [14, 81, 3] proposed bias-corrected estimators for single linear regression model.
[18] considered inference for individualized treatment effects (ITE), which is a linear contrast of
two coefficient vectors x
T
new
(b
1
b
2
). To the best of our knowledge, the method proposed in this
work is the first to provide statistical inference for prediction of mixed linear models.
4.2 Methodology
4.2.1 Prelude: EM-based Estimator
The expectation-maximization (EM) algorithm ([26]) is often used for finding maximum-likelihood
solutions for problems involving missing data, alternating between the E (expectation) step and the
M (maximization) step, while managing to increase the objective function.
Assume n data pointsf(x
i
;y
i
)g
n
i=1
are drawn independently from the MoE model with K clus-
ters. The log-likelihood function of the data is
114
`(
fb
T
k
g
K
1
;fa
T
k
g
K
1
;fs
k
g
K
1
T
| {z }
q2R
(2p+1)K
)=
1
n
n
å
i=1
log
"
K
å
k=1
p
k
(x
i
)f
k
(x
i
;y
i
)
#
; (4.4)
where
p
k
(x
i
)=
exp(x
T
i
a
k
)
å
K
`=1
exp(x
T
i
a
`
)
;
and
f
k
(x
i
;y
i
)=
1
p
2ps
k
exp
(y
i
x
T
i
b
k
)
2
2s
2
k
:
Throughout the paper we useq =
fb
T
k
g
K
1
;fa
T
k
g
K
1
;fsg
K
1
T
2R
(2p+1)K
as the stacked vector
of parameters to be estimated.
We can incorporate the latent variable z
i
that determines the cluster membership of the i-th data
point, and define the complete log-likelihood as:
`(q)=
1
n
n
å
i=1
K
å
k=1
1fz
i
= kg[logp
k
(x
i
)+ logf
k
(x
i
;y
i
)]:
We take conditional expectations on the membership indicator 1fz
i
= kg with respect to(x
i
;y
i
).
Note that the conditional probabilities can be expressed by
g
i;k
(q) :=E[1fz
i
= kgjx
i
;y
i
]=
p
k
(x
i
)f
k
(x
i
;y
i
)
å
K
`=1
p
`
(x
i
)f
`
(x
i
;y
i
)
;
which is referred to as responsibilities in the literature (see, e.g. [42]).
To overcome the issues brought by high-dimensionality, we add regularization termskb
k
k
1
;ka
k
k
1
to the log-likelihood to enforce sparsity. The regularized objective function is thus in the form
115
Q(qj
ˆ
q)=
1
n
n
å
i=1
K
å
k=1
g
i;k
(
ˆ
q)[logp
k
(x
i
)+ logf
k
(x
i
;y
i
)]+
K
å
k=1
l
a
ka
k
k
1
+
K
å
k=1
l
b
kb
k
k
1
: (4.5)
The estimation of parameters can be done by an EM algorithm, which iterates between esti-
mating the conditional membership probabilities g
i;k
(
ˆ
q), and maximizing (4.5) to update the pa-
rameters
ˆ
q =
f
ˆ
b
T
k
g
K
1
;f ˆ a
T
k
g
K
1
;f ˆ s
k
g
K
1
T
. We list in details how each step is formulated:
E step: Estimating the conditional responsibility of membership based on the latest parame-
ter estimate.
g
i;k
(
ˆ
q)=
ˆ p
i;k
ˆ s
k
exp
(y
i
x
T
i
ˆ
b
k
)
2
2 ˆ s
2
k
å
K
`=1
ˆ p
i;`
ˆ s
`
exp
(y
i
x
T
i
ˆ
b
`
)
2
2 ˆ s
2
`
:
M step: For each cluster k= 1;:::;K, update the estimated parameters(
ˆ
b
k
; ˆ a
k
; ˆ s
k
) to lower
the objective value in (4.5).
– Update
ˆ
b
k
:
ˆ
b
k
= argmin
b
n
å
i=1
g
i;k
(
ˆ
q)
ˆ s
2
k
(y
i
x
T
i
b)
2
2n
+l
b
kbk
1
: (4.6)
– Update ˆ a
k
:
ˆ a
k
= argmin
a
k
1
n
n
å
i=1
K
å
k=1
g
i;k
(
ˆ
q)x
T
i
a
k
log
K
å
`=1
exp(x
T
i
a
`
)
!!
+l
a
K
å
k=1
ka
k
k
1
:
(4.7)
– Update ˆ s
k
:
ˆ s
2
k
=
å
n
i=1
g
i;k
(
ˆ
q)(y
i
x
T
i
ˆ
b
k
)
2
å
n
i=1
g
i;k
(
ˆ
q)
: (4.8)
116
The algorithm terminates when the regularized negative log-likelihood converges, i.e., when
the improvement of (4.5) is negligible.
In practice, we use the R packageflowmix ([41]) to carry out this EM algorithm.
4.2.2 Debiased Prediction
Due to the regularization term, the parameters estimated using the procedures above are biased and
have an intractable distribution due to regularization. Thus the prediction given any x
new
is biased.
To address this issue and construct prediction sets, we propose to use a debiase procedure to enable
statistical inference. The debiasing procedure for generalized linear problems relies on the Fisher
information matrix I(q), which is defined as the second derivative of (4.4). We can estimate the
sample Fisher information matrix by
˜
I(
ˆ
q)=
1
n
n
å
i=1
Ñ`
i
(
ˆ
q)Ñ`
T
i
(
ˆ
q);
where
Ñ`
i
=(f
¶`
i
¶b
k
g
K
k=1
;f
¶`
i
¶a
k
g
K
k=1
;f
¶`
i
¶s
k
g
K
k=1
)
T
is the stacked gradient vector, and
ˆ
q =
f
ˆ
b
T
k
g
K
k=1
;f ˆ a
T
k
g
K
k=1
;f ˆ s
k
g
K
k=1
T
are parameter estimates achieved by the EM algorithm.
For a new observation x
new
that is assumed to be drawn from the model, the outcome can be
expressed as
117
y
new
=
K
å
k=1
1fz
new
= kg
x
T
new
b
k
+e
k
; (4.9)
and
P(z
new
= k)=p
k
(x
new
)=
exp(x
T
new
a
k
)
å
K
`=1
exp(x
T
new
a
`
)
:
Since
ˆ
b
k
is a biased estimate, x
T
new
ˆ
b
k
is also a biased prediction of y
new
given z
new
= k. We use
ˆ
G
k
:= x
T
new
ˆ
b
k
to denote the initial prediction, andG
k
:= x
T
new
b
k
as the true value. Suppose that u
k
is an arbitrary vector. We propose to construct a bias-corrected prediction
ˆ
G
d
k
:= x
T
new
ˆ
b
k
+ u
T
k
1
n
å
i
¶`
i
(
ˆ
q)
¶b
k
:
The direction u
k
is generalized from [46, 95, 14], which aims to find a direction that minimizes
variance while controlling bias. However, the goal of [46, 95, 14] are to establish inference for co-
efficients, which differs from our goal of establishing inference for prediction. Here we follow the
method in [17] to construct a “variance-enhancement projection direction,” which adds an addi-
tional constraint to ensure that variance has a dominating role compared to bias. Our optimization
problem differs from [17] in that we use a constrained Fisher information matrix,
˜
I
b
k
(
ˆ
q), in the
objective of (4.10), and we allow the matrix
˜
S
k
in the constraint of (4.10) to be different from the
matrix in the objective, while [17] keeps them consistent.
We propose to identify u
k
by solving the optimization problem
min u
T
˜
I
b
k
(
ˆ
q)u
s.t. sup
w2C
j
w;
˜
S
k
u x
new
jkx
new
k
2
l
k
; (4.10)
118
where we define
˜
S
k
:=
1
n
å
n
i=1
g
i;k
(
ˆ
q)
ˆ s
2
k
x
i
x
T
i
, and l
k
s
q
log(p)
n
, wherekb
k
k
0
s;8k2[K] ; the set
C =fe
1
;:::;e
p
;x
new
=kx
new
k
2
g, where e
i
denotes the i-th standard Euclidean basis vector; and
˜
I
b
k
(
ˆ
q) is the estimated Fisher information matrix from sample, constrained to b
k
, which is for-
mulated as
˜
I
b
k
(
ˆ
q)=
1
n
n
å
i=1
g
i;k
(
ˆ
q)
y
i
x
T
i
ˆ
b
k
ˆ s
2
k
x
i
!
g
i;k
(
ˆ
q)
y
i
x
T
i
ˆ
b
k
ˆ s
2
k
x
i
!
T
:
Finally, the estimated variance for
ˆ
G
d
k
is of the form
b
V
k
=
1
n
u
T
k
˜
I
b
k
(
ˆ
q)u
k
; k= 1;:::;K; (4.11)
and the prediction variance estimate is
b
2
k
:=
b
V
k
+ ˆ s
2
k
; k= 1;:::;K: (4.12)
4.2.3 Prediction Sets
Recall that our goal is to construct a (1 q) 100% prediction setW, i.e.,P(y
new
2W) 1 q.
By the nature of the mixture, we seek a prediction set of the form
W=[
K
k=1
L
k
;
where eachL
k
=[l
k
;u
k
] is centered at a debiased estimator
ˆ
G
d
k
. Intuitively, we would like
mixture components with largerp
k
(x
new
) to have wider intervals.
119
To this end, we form a probability density function using a weighted mixture of Gaussian
densities
f(y)=
K
å
k=1
ˆ p
k
(x
new
)
b
k
f
y
ˆ
G
d
k
b
k
!
; (4.13)
wheref() is the standard normal pdf.
We seek a set of intervals[y
1
;y
+
1
];:::;[y
K
;y
+
K
], such that
K
å
k=1
Z
y
+
k
y
k
f(y)dy= 1 q; (4.14)
while minimizingå
K
k=1
jy
+
k
y
k
j.
Assume without loss of generality that
ˆ
G
d
1
:::
ˆ
G
d
K
. We first define the interval
Q=[
ˆ
G
d
1
Cz
1q=2
b
1
;
ˆ
G
d
K
+Cz
1q=2
b
K
];
which is defined as a window of possible cut-offs, and C 1 is a multiplier. We then divideQ
into sub-intervals of sizee, denoted asQ
1
;:::;Q
jQj
e
. Thus we can approximate the areas under the
curve f(y) withinQ
i
bye h
i
, where h
i
is f(y) evaluated at a y2Q
i
.
To achieve the minimized length, we sort the h
i
’s taken from each sub-interval in decreasing
order, i.e.,
h
(1)
h
(2)
::: h
(
jQj
e
)
;
and find N such that
N
å
i=1
h
(i)
1 q
e
: (4.15)
120
The prediction set is thus[
N
i=1
Q
(i)
:
Input: : Confidence level 1 q, debiased estimate of each center
ˆ
G
d
k
, prediction standard
error estimate b
k
, membership probability estimate ˆ p
k
(x
new
);k= 1;:::;K.
Output: : 1 q Confidence Level Prediction Sets.
1: Form a weighted mixture of Gaussian densities
f(y)=
K
å
k=1
ˆ p
k
(x
new
)
b
k
f
y
ˆ
G
d
k
b
k
!
;
2: Find a windowQ within which the integral
R
Q
f(y) 1 q, and divideQ into
sub-intervals of sizee. Take within middle point y
i
within each sub-intervalQ
i
and
evaluate h
i
= f(y
i
).
3: Sort h
i
in decreasing order,
h
(1)
h
(2)
::: h
(
jQj
e
)
:
4: Find N such that
N
å
i=1
h
(i)
1 q
e
:
5: Take the union of the corresponding sub-intervals, i.e.[
N
i=1
Q
(i)
.
Algorithm 3: Constructing 1 q Confidence Level Prediction Sets
4.3 Numerical Study
4.3.1 Example 1: Low Dimensional Case with Unbalanced Components
We first look at a setting similar to the example that we discussed in Section 4.1.2 where there are
K= 2 components in the mixture. The data are generated for T = 100 time points, with centers of
the two components respectively
121
m
1
(x
t
)= 10(T 1)
2
x
2;t
10(T 1)x
1;t
+
5
2
;
m
2
(x
t
)= 1;
where
x
1;t
=
t 1
T 1
;
x
2;t
= x
2
1;t
; t= 1;:::;T:
The outcomefy
t
g
T
t=1
conditional on the cluster assignment variable z
t
are generated from
y
t
jx
t
;z
t
= k N(m
k
(t);s
2
k
);
The probability of being drawn from the first cluster,p
1
(t) :=P(z
t
= 1), follows from a logistic
function, where
log
p
1
(t)
1p
1
(t)
= log(10) log(4)log(10)t;
and s
1
=s
2
= 0:1. Thus, when t! 0, p
1
! 0:91 approximately, and when t! T , we have
p
1
! 0:29 approximately. Overall, there are around 60% data points from the cluster 1, and 40
% from cluster 2, which differs from the balanced case in Section 4.1.2. On the top-left corner
of Figure 4.2, we plot the trend of true probability of being drawn from both clusters, as well as
estimated probability. We can see that overall the probability of cluster 1 is on average higher than
the probability of cluster 2, hence cluster 1 is the majority group.
On the bottom left of Figure 4.2, we demonstrate the true trend of two centers varying with
time, as well as estimated trend. The black dots represent the observed outcome that are noisy
122
around true centers. We can see that the EM algorithm achieves good estimate of the model, as the
estimated curves (solid) fit closely to the true curves (dotted).
The plot in the middle of Figure 4.2 shows the prediction sets, which are demonstrated by
the gray area. First of all, we can observe that the prediction sets have good coverage of the
observations. Second, compared with Figure 4.1, we can notice that the band that covers cluster 2
is a lot thinner, and comparably, the band that covers cluster 1 is thicker. This can be explained as
the majority cluster will need a larger interval to capture its variability, whereas the minority cluster
will have a shorter interval. We also notice that as t increases, the lengths of intervals corresponding
to cluster 1 decrease, while the lengths for cluster 2 increase, which might be explained by the less
dominant probability cluster 1 enjoys as t increases.
On the right of Figure 4.2, we have two plots that show the average coverage rates and average
lengths of prediction sets, respectively. We can see that the coverage rate is guaranteed at the pre-
specified 95% level. Compared with Figure 4.1, we observe that the average lengths increase with
the probability of minority group, rather than the constant trend shown in Figure 4.1.
4.3.2 Example 2: Low Dimensional Case with Unequal Variance
We again focus on a setting that is similar to the example discussed in Section 4.1.2 where there
are K= 2 components in the mixture. The data are generated for T = 100 time points, with centers
of the two components respectively
m
1
(x
t
)= 10(T 1)
2
x
2;t
10(T 1)x
1;t
+
5
2
;
m
2
(x
t
)= 1;
123
0.25
0.50
0.75
0 25 50 75 100
t
Probability
0
1
2
3
0 25 50 75 100
t
y
0
1
2
3
0 25 50 75 100
t
y
0.900
0.925
0.950
0.975
1.000
0 25 50 75 100
t
Cover Rates
0.5
0.6
0.7
0.8
0 25 50 75 100
t
Average Lengths
colour Cluster 1 Cluster 2 linetype Estimated True
Figure 4.2: Performance of estimation and prediction sets for the setting of Example 1, where
K = 2, T = 100 and distribution of observations from the two components is unbalanced, the
variances in two linear models are the same. The plot on the top left shows how true probability of
cluster 1 and 2,p
1
(t);p
2
(t), vary with time t (dotted), as well as estimated probabilities ˆ p
1
(t); ˆ p
2
(t)
(solid). The plot on the bottom left contains observed outcomes (black dots), predicted outcome
centers (solid) and true centers (dotted) of two clusters. The plot in the middle demonstrates with
shaded are the prediction set with 95% confidence level. The plot on the top right and bottom
right contain the averaged coverage rates and averaged lengths of prediction sets, respectively. The
average are taken among 500 runs; for each run, there are 1000 outcomes generated from the model
and the percentage of them falling into the confidence region is calculated.
124
where
x
1;t
=
t 1
T 1
;
x
2;t
= x
2
1;t
; t= 1;:::;T:
The outcomefy
t
g
T
t=1
conditional on the cluster assignment variable z
t
are generated from
y
t
jx
t
;z
t
= k N(m
k
(t);s
2
k
);
wheres
1
= 0:1 ands
2
to 0:2.
For the probabilityp
1
(t), we let
log
p
1
(t)
1p
1
(t)
= log(9) 2log(9)t;
In this setting there are approximately the same sample sizes from the two clusters (in contrast
to 4.3.1) but the noise levels for the two clusters are different.
We plot the results in Figure 4.3. On the top left plot, we show the trend of true probability of
being drawn from both clusters, as well as estimated probability. On the bottom left, we demon-
strate the true trend of two centers varying with time, as well as the estimated trend. The black
dots represent the observed outcome that are noisy around true centers. We can see that in this case
the distribution from two clusters are overall balanced, and for both probabilities and centers the
estimation performed well.
The plot in the middle of Figure 4.3 shows the prediction sets, which are demonstrated by
the gray area. First of all, we can observe that the prediction sets have good coverage of the
observations. Second, compared with Figure 4.1, we can notice that the band that covers cluster
2 is a lot thicker than the band in Figure 4.1. In addition, the band that covers cluster 2 is always
125
0.25
0.50
0.75
0 25 50 75 100
t
Probability
0
1
2
3
0 25 50 75 100
t
y
0
1
2
3
0 25 50 75 100
t
y
0.900
0.925
0.950
0.975
1.000
0 25 50 75 100
t
Cover Rates
0.6
0.8
1.0
1.2
0 25 50 75 100
t
Average Lengths
colour Cluster 1 Cluster 2 linetype Estimated True
Figure 4.3: Performance of estimation and prediction sets for the setting of Example 2, where
K = 2, T = 100 and distribution of observations from the two components is balanced, the vari-
ances in two linear models are the 0:1;0:2, respectively. The plot on the top left shows how true
probability of Cluster 1 and 2, p
1
(t);p
2
(t), vary with time t (dotted), as well as estimated proba-
bilities ˆ p
1
(t); ˆ p
2
(t) (solid). The plot on the bottom left contains observed outcomes (black dots),
predicted outcome curve (solid) and true curve (dotted). The plot in the middle shows predicted
outcome (solid) and true outcome (dotted). The shaded bands are the predicted set with 95% con-
fidence level. The plot on the top right and bottom right contain the averaged coverage rates and
averaged lengths of prediction region, respectively. The average are taken among 500 runs; for
each run, there are 1000 outcomes generated from the model and the percentage of them falling
into the confidence region is calculated.
126
thicker than the band that covers cluster 1. This can be easily explained as the prediction sets need
to assign more weights (lengths) to the noisy group.
On the right of Figure 4.3, we have two plots that show the coverage rates and average lengths
of prediction sets, respectively. We can see that the coverage rate is guaranteed at the pre-specified
level. Compared with Figure 4.1, we observe that the average lengths overall increase, as the
probability of minority group increase, rather than the constant trend shown in Figure 4.1. When
the more noisy group weigh more in the mixture, the lengths of prediction sets naturally need to
increase as well.
4.3.3 Example 3: High Dimensional Case
We consider a high-dimensional case where K= 2 for T = 150 time points, and at each time point
there are n
t
= 5 observations. There are p= 500 covariates. The true coefficients are
b
1;0
=2; b
1
=(4;2;4;6;2;0;:::;0)
T
;
b
2;0
= 2; b
2
=(0;0;0;0;0;4;2;4;6;2;0;:::;0)
T
;
a
1;0
= 5; a
1
=(0;:::;0;0:7;0:7;0:7;0:7;0:7)
T
;
a
2;0
= 5; a
2
=(0;:::;0;0:7;0:7;0:7;0:7;0:7;0;0;0;0;0)
T
;
We simulate i.i.d. X2R
np
N(0;I). And generate
y
i
jx
i
;z
i
= k N(m
k;i
;s
2
k
);
127
wherem
k;i
= x
T
i
b
k
+b
k;0
,s
k
= 1;k= 1;2; and
P(z
i
= k)=
exp
a
k;0
+ x
T
i
a
k
å
2
`=1
exp
a
`;0
+ x
T
i
a
`
:
We perform on the above-mentioned training data with the EM procedures using the flowmix
package to acquire parameter estimates. For the regularization parameter, we use a 5-fold Cross-
validation to choose the parameters(
ˆ
l
a
;
ˆ
l
b
) from a 10 10 logarithmically-spaced grid.
To evaluate our method, we generate 100 independent x
new
2R
p
. For each x
new
, we perform
our algorithm to acquire the 95% prediction set. We then generate 100 independent y
new;i
’s con-
ditioning on x
new
, and count the percentage of y
new;i
’s falling into the prediction set, and also
calculate the length of the prediction regions corresponding to each x
new
.
We repeat the above procedure for 500 times, meanwhile keeping the 100 x
new
’s the same.
Now for each x
new
, we summarize:
1. Average length of prediction sets (averaging over the 500 runs),
2. Coverage probability (proportion averaged across the 500 runs).
We plot in Figure 4.4 the relationship between coverage rates and true probability of being
drawn from cluster 1 (p
1
(x
new
)), for each x
new
. We observe that the coverage rate of our prediction
sets meet the 95% pre-specified level, regardless ofp
1
(x
new
).
In Figure 4.5, we plot the relationship between average lengths of prediction sets and true
probability of being drawn from cluster 1 (p
1
(x
new
)). We can observe that the lengths are seemingly
smaller whenp
1
(x
new
) are closer to boundary values, whereas larger whenp
1
(x
new
) approach 0:5.
We plot in Figure 4.6 the average length of prediction intervals against the distances between
the two centers of clusters, i.e.,
x
T
new
(b
1
b
2
)
:
128
0.8
0.9
1.0
1.1
0.00 0.25 0.50 0.75 1.00
p
1
Cover Rate
Figure 4.4: Scatter plot of Coverage Rate of the 95% prediction region against truep
1
(x
new
)(data
point being drawn from cluster 1). The coverage rate of our prediction sets is guaranteed at the
95% pre-specified level regardless ofp
1
.
5
6
7
8
9
10
0.00 0.25 0.50 0.75 1.00
p
1
Average Length
Figure 4.5: Scatter plot of average lengths of 95% prediction sets against truep
1
(x
new
)(data point
being drawn from cluster 1). The lengths are smaller whenp
1
(x
new
) are closer to 0 or 1, whereas
larger whenp
1
(x
new
) approach 0:5.
129
5
6
7
8
9
10
0 10 20 30
Distance
Average Length
Figure 4.6: Scatter plot of average lengths of 95% prediction region against distances between two
centers. The average lengths increase with distance initially, then fluctuates and vary between 6:5
and 11.
We can notice that there seems to be a monotonic trend when
x
T
new
(b
1
b
2
)
is less than 10, this
is probably also when the two intervals end overlapping. This is because heuristically, the average
length of either interval for the two clusters is around 5 (notice that when the distance between the
two centers is 0, which means the two clusters are on top of each other, the length is around 5).
Once distance reaches 10, the total lengths of prediction sets start to vary. We do not notice any
trend and the variation is probably due to the probabilities of being drawn from either clusters for
each x
new
.
4.4 Real Data Application
We apply our method to the superconductivity data provided in [76]. This dataset contains the crit-
ical temperature and a set of p= 81 attributes for each material. The attributes used as predictors
are elemental property statistics and electronic structures of attributes. In total, around 21K mate-
rials are presented in the dataset. We notice that there are some critical temperatures that are less
than 1mK, a log transformation will lead them to be outliers, which eventually cause our prediction
130
intervals to be unreasonably long. Therefore, we decide to add 1 to the critical temperature before
taking a log-transform, which is essentially replacing the log of very small temperatures with 0.
We also center and scale each predictor column.
Since we do not have information on the actual number of clusters, K, we validate with a
hold-out set of data. In fact, we randomly sample n = 200 records as training data. We vary
K = 1;2;3;4;5 and estimate the parameters of the model with EM algorithm. The parameters
(l
a
;l
b
) are chosen by a 5-fold cross-validation. With each model, we compute test error, mean
squared error (MSE), on hold-out 1000 data points (validation data), which are randomly sampled
and do not overlap with training data. The model with K = 2 achieves lowest MSE thus we will
continue with K= 2 clusters while constructing prediction sets. We show the MSE with K in Table
4.1.
K MSE
1 8.953
2 0.867
3 0.978
4 1.318
5 0.869
Table 4.1: Out-of-sample MSE calculated 1000 validation data points for fitted models where
K= 1;2;3;4;5. When K= 2, the MSE is lowest.
To examine the coverage of our constructed prediction set. We the remaining 20K data points
that are not included in the training set and validation set. For each data point we construct the
prediction set and have if the true outcome falls into the prediction set. The proportion of the test
data points outcomes that fall into their prediction sets is 97:1%. The average length of intervals
after being transformed back from log-values are around 42 Kelvin.
We randomly sample 100 data points from the test data, and plot their prediction sets in Figure
4.7.
To better understand the conditional coverage of our prediction sets, we can condition on some
predictors. We consider the predictor that is most related to critical temperature, which is weighted
131
0
5
0 25 50 75 100
index
log(1+temperature)
Figure 4.7: Plot of prediction sets for 100 randomly sampled data points in the test set. The
black points are the log(1+ temperature) and the bars correspond to the prediction sets for each
observation. We use red when there is only one interval and cyan when there are two intervals.
standard deviation of thermal conductivity. We divide the range of this variable into 5 equally-
spaced sub-intervals, and divide test data points into 5 subgroups accordingly. We calculate the
conditional coverage rate by calculating the coverage rate of data points within each subgroup.
Table 4.2 contain the results of conditional coverage rates. We can see that the conditional coverage
rates are satisfied.
Subgroup Number of data Coverage Rates
1 4000 0.98
2 4039 0.99
3 3960 0.97
4 3999 0.97
5 4000 0.96
Table 4.2: Coverage rates of 95% prediction sets conditional on the predictor “weighted standard
deviation of thermal conductivity”. The subgroups are divided by which of the equal-length sub-
intervals within range of this predictor that test data fall into. The conditional coverage rates are
satisfied.
132
4.5 Conclusion
The present paper introduces a procedure to construct debiased predictor for a new observation
drawn from a high-dimensional Mixture of Experts (MoE) model, and construct prediction sets
with pre-specified confidence level. It is worth noting that our proposed method differs from
works in the conformal prediction literature fundamentally because we do not assume knowledge
of subgroup clustering. On top of that, we provide intervals with conditional coverage and flexible
lengths without ad-hoc adjustment.
133
References
[1] Kareem Amin, Afshin Rostamizadeh, and Umar Syed. Learning prices for repeated auctions
with strategic buyers. In Advances in Neural Information Processing Systems, pages 1169–
1177, 2013.
[2] Victor F Araman and Ren´ e Caldentey. Dynamic pricing for nonperishable products with
demand learning. Operations research, 57(5):1169–1188, 2009.
[3] Susan Athey, Guido W Imbens, and Stefan Wager. Approximate residual balancing: debiased
inference of average treatment effects in high dimensions. Journal of the Royal Statistical
Society: Series B (Statistical Methodology), 80(4):597–623, 2018.
[4] Yossi Aviv and Gustavo Vulcano. Dynamic list pricing. In The Oxford handbook of pricing
management. 2012.
[5] Sivaraman Balakrishnan, Martin J Wainwright, and Bin Yu. Statistical guarantees for the em
algorithm: From population to sample-based analysis. The Annals of Statistics, 45(1):77–
120, 2017.
[6] Gah-Yi Ban and N Bora Keskin. Personalized dynamic pricing with machine learning. 2017.
[7] Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: A practical and
powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B
(Methodological), 57(1):289–300, 1995.
[8] Omar Besbes and Assaf Zeevi. Dynamic pricing without knowing the demand function: Risk
bounds and near-optimal algorithms. Operations Research, 57(6):1407–1420, 2009.
[9] Jacob Bien. The simulator: an engine to streamline simulations. arXiv preprint
arXiv:1607.00021, 2016.
[10] Jacob Bien, Xiaohan Yan, L´ eo Simpson, and Christian Lorenz M¨ uller. Tree-aggregated pre-
dictive modeling of microbiome data. bioRxiv, 2020.
[11] Gilles Blanchard and Etienne Roquain. Two simple sufficient conditions for fdr control.
Electronic Journal of Statistics, 2(0):963–992, 2008.
134
[12] Marina Bogomolov, Christine B Peterson, Yoav Benjamini, and Chiara Sabatti. Test-
ing hypotheses on a tree: new error rates and controlling strategies. arXiv preprint
arXiv:1705.07529, 2017.
[13] Josef Broder and Paat Rusmevichientong. Dynamic pricing under a general parametric
choice model. Operations Research, 60(4):965–980, 2012.
[14] T Tony Cai, Zijian Guo, et al. Confidence intervals for high-dimensional linear regression:
Minimax rates and adaptivity. The Annals of statistics, 45(2):615–646, 2017.
[15] Tianxi Cai, Tony Cai, and Zijian Guo. Individualized treatment selection: An optimal hy-
pothesis testing approach in high-dimensional models. arXiv preprint arXiv:1904.12891,
2019.
[16] Tianxi Cai, Tony Cai, and Zijian Guo. Optimal statistical inference for individualized treat-
ment effects in high-dimensional models. arXiv preprint arXiv:1904.12891 (To appear in
Journal of the Royal Statistical Society: Series B), 2019.
[17] Tianxi Cai, Tony Cai, and Zijian Guo. Optimal statistical inference for individualized treat-
ment effects in high-dimensional models, 2019.
[18] Tianxi Cai, T Tony Cai, and Zijian Guo. Optimal statistical inference for individualized
treatment effects in high-dimensional models. Journal of the Royal Statistical Society: Series
B (Statistical Methodology), 83(4):669–719, 2021.
[19] Nicolo Cesa-Bianchi, Claudio Gentile, and Yishay Mansour. Regret minimization for reserve
prices in second-price auctions. IEEE Transactions on Information Theory, 61(1):549–564,
2015.
[20] Xi Chen, Zachary Owen, Clark Pixton, and David Simchi-Levi. A statistical learning ap-
proach to personalization in revenue management. 2015.
[21] Wang Chi Cheung, David Simchi-Levi, and He Wang. Dynamic pricing and demand learning
with limited price experimentation. Operations Research, 65(6):1722–1731, 2017.
[22] Maxime Cohen, Ilan Lobel, and Renato Paes Leme. Feature-based dynamic pricing. 2016.
[23] Compustat Industrial - Annual Data, 2015-2019. Available: Standard & Poor’s Compustat
[01/26/2021]. Retrieved from Wharton Research Data Service.
[24] Thomas M Cover and Joy A Thomas. Elements of information theory. John Wiley & Sons,
2012.
[25] CRSP Stocks, 2015-2019. Available: Center For Research in Security Prices. Graduate
School of Business. University of Chicago [01/26/2021]. Retrieved from Wharton Research
Data Service.
135
[26] Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incom-
plete data via the em algorithm. Journal of the Royal Statistical Society: Series B (Method-
ological), 39(1):1–22, 1977.
[27] A. V . den Boer and A. P. Zwart. Mean square convergence rates for maximum(quasi) likeli-
hood estimation. Stochastic systems, 4:1 – 29, 2014.
[28] Arnoud V den Boer. Dynamic pricing and learning: historical origins, current research, and
new directions. Surveys in operations research and management science, 20(1):1–18, 2015.
[29] Arnoud V den Boer and Bert Zwart. Simultaneously learning and optimizing using controlled
variance pricing. Management science, 60(3):770–783, 2013.
[30] Ossama Elshiewy, Daniel Guhl, and Yasemin Boztug. Multinomial logit models in
marketing-from fundamentals to state-of-the-art. Marketing ZFP, 39(3):32–49, 2017.
[31] Vivek F Farias and Benjamin Van Roy. Dynamic pricing with a prior on market response.
Operations Research, 58(1):16–29, 2010.
[32] Kris Johnson Ferreira, David Simchi-Levi, and He Wang. Online network revenue manage-
ment using thompson sampling. 2016.
[33] Alexander Goldenshluger and Assaf Zeevi. A linear response bandit problem. Stochastic
Systems, 3(1):230–261, 2013.
[34] Negin Golrezaei, Adel Javanmard, and Vahab Mirrokni. Dynamic incentive-aware learning:
Robust pricing in contextual auctions. http://dx.doi.org/10.2139/ssrn.3144034, 2018.
[35] Bettina Gr¨ un and Friedrich Leisch. FlexMix version 2: Finite mixtures with concomitant
variables and varying and constant parameters. Journal of Statistical Software, 28(4):1–35,
2008.
[36] Peter M Guadagni and John DC Little. A logit model of brand choice calibrated on scanner
data. Marketing science, 2(3):203–238, 1983.
[37] Zijian Guo, Claude Renaux, Peter B¨ uhlmann, and T Tony Cai. Group inference in high
dimensions with applications to hierarchical testing. arXiv preprint arXiv:1909.01503, 2019.
[38] Zijian Guo, Claude Renaux, Peter B¨ uhlmann, and Tony Cai. Group inference in high dimen-
sions with applications to hierarchical testing. Electronic Journal of Statistics, 15(2):6633–
6676, 2021.
[39] J Michael Harrison, N Bora Keskin, and Assaf Zeevi. Bayesian dynamic pricing policies:
Learning and earning under a binary prior distribution. Management Science, 58(3):570–
586, 2012.
136
[40] Ruth Heller, Nilanjan Chatterjee, Abba Krieger, and Jianxin Shi. Post-selection inference fol-
lowing aggregate level hypothesis testing in large-scale genomic data. Journal of the Ameri-
can Statistical Association, 113(524):1770–1783, 2018.
[41] Sangwon Hyun. flowmix: Ocean Flow Cytometry Analysis, 2022. R package version
0.0.0.9000.
[42] Sangwon Hyun, Mattias Rolf Cape, Francois Ribalet, and Jacob Bien. Modeling cell popula-
tions measured by flow cytometry with covariates using sparse mixture of regressions. arXiv
preprint arXiv:2008.11251, 2020.
[43] Adel Javanmard. Perishability of data: dynamic pricing under varying-coefficient models.
The Journal of Machine Learning Research, 18(1):1714–1744, 2017.
[44] Adel Javanmard and Jason D Lee. A flexible framework for hypothesis testing in high
dimensions. Journal of the Royal Statistical Society: Series B (Statistical Methodology),
82(3):685–718, 2020.
[45] Adel Javanmard and Andrea Montanari. Confidence intervals and hypothesis testing for
high-dimensional regression. The Journal of Machine Learning Research, 15, 06 2013.
[46] Adel Javanmard and Andrea Montanari. Confidence intervals and hypothesis testing for
high-dimensional regression. The Journal of Machine Learning Research, 15(1):2869–2909,
2014.
[47] Adel Javanmard and Andrea Montanari. Hypothesis testing in high-dimensional regression
under the gaussian random design model: Asymptotic theory. IEEE Transactions on Infor-
mation Theory, 60(10):6522–6554, 2014.
[48] Adel Javanmard and Andrea Montanari. Debiasing the lasso: Optimal sample size for gaus-
sian designs. The Annals of Statistics, 46(6A):2593–2622, 2018.
[49] Adel Javanmard, Andrea Montanari, et al. Online rules for control of false discovery rate and
false discovery exceedance. The Annals of statistics, 46(2):526–554, 2018.
[50] Adel Javanmard and Hamid Nazerzadeh. Dynamic pricing in high-dimensions. Journal of
Machine Learning Research, 20(9):1–49, 2019.
[51] Michael I Jordan and Robert A Jacobs. Hierarchical mixtures of experts and the em algo-
rithm. Neural computation, 6(2):181–214, 1994.
[52] Eugene Katsevich and Chiara Sabatti. Multilayer knockoff filter: Controlled variable selec-
tion at multiple resolutions. The annals of applied statistics, 13(1):1, 2019.
[53] Bora Keskin. Optimal dynamic pricing with demand model uncertainty: A squared-
coefficient-of-variation rule for learning and earning. Working Paper, 2014.
137
[54] Jason M Klusowski, Dana Yang, and WD Brinda. Estimating the coefficients of a mixture
of two linear regressions by expectation maximization. IEEE Transactions on Information
Theory, 65(6):3515–3524, 2019.
[55] Jeongyeol Kwon and Constantine Caramanis. Em converges for a mixture of many linear
regressions. In International Conference on Artificial Intelligence and Statistics, pages 1727–
1736. PMLR, 2020.
[56] Jeongyeol Kwon, Wei Qian, Constantine Caramanis, Yudong Chen, and Damek Davis.
Global convergence of the em algorithm for mixtures of two component linear regression.
In Conference on Learning Theory, pages 2055–2110. PMLR, 2019.
[57] Jing Lei, Max G’Sell, Alessandro Rinaldo, Ryan J Tibshirani, and Larry Wasserman.
Distribution-free predictive inference for regression. Journal of the American Statistical As-
sociation, 113(523):1094–1111, 2018.
[58] Jing Lei, James Robins, and Larry Wasserman. Distribution-free prediction sets. Journal of
the American Statistical Association, 108(501):278–287, 2013.
[59] Ilan Lobel, Renato Paes Leme, and Adrian Vladu. Multidimensional binary search for con-
textual decision-making. arXiv preprint arXiv:1611.00829, 2016.
[60] Gavin Lynch and Wenge Guo. On procedures controlling the fdr for testing hierarchically
ordered hypotheses. arXiv preprint arXiv:1612.04467, 2016.
[61] Martin Martens and Dick van Dijk. Measuring volatility with the realized range. Journal of
Econometrics, 138(1):181–207, 2007. 50th Anniversary Econometric Institute.
[62] Daniel McFadden. Economic choices. American economic review, 91(3):351–378, 2001.
[63] Daniel McFadden et al. Conditional logit analysis of qualitative choice behavior. 1973.
[64] Geoffrey J McLachlan, Shu-Kay Ng, and Richard Bean. Robust cluster analysis via mixture
models. Austrian Journal of Statistics, 35(2&3):157–174, 2006.
[65] Nicolai Meinshausen. Hierarchical testing of variable importance. Biometrika, 95(2):265–
278, 06 2008.
[66] Jonas Mueller, Vasilis Syrgkanis, and Matt Taddy. Low-rank bandit methods for high-
dimensional dynamic pricing. arXiv preprint arXiv:1801.10242, 2018.
[67] NYC Planning, 2020. Available: ”Neighborhood Tabulation Areas (Formerly ”Neighbor-
hood Projection Areas”)”. Retrieved from September 22, 2020.
[68] Harris Papadopoulos, Kostas Proedrou, V olodya V ovk, and Alex Gammerman. Inductive
confidence machines for regression. In European Conference on Machine Learning, pages
345–356. Springer, 2002.
138
[69] Michael Parkinson. The extreme value method for estimating the variance of the rate of
return. The Journal of Business, 53(1):61–65, 1980.
[70] Sheng Qiang and Mohsen Bayati. Dynamic pricing with demand covariates. arXiv preprint
arXiv:1604.07463, 2016.
[71] Aaditya Ramdas, Jianbo Chen, Martin J. Wainwright, and Michael I. Jordan. Dagger: A
sequential algorithm for fdr control on dags. ArXiv, abs/1709.10250, 2017.
[72] Yaniv Romano, Evan Patterson, and Emmanuel Candes. Conformalized quantile regression.
Advances in neural information processing systems, 32, 2019.
[73] George A. F. Seber and Alan J. Lee. Linear regression analysis. Wiley, 2012.
[74] R. J. Simes. An improved bonferroni procedure for multiple tests of significance. Biometrika,
73(3):751–754, 12 1986.
[75] Nicolas St¨ adler, Peter B¨ uhlmann, and Sara A. van de Geer. L1-penalization for mixture
regression models. TEST, 19:209–256, 2010.
[76] Valentin Stanev, Corey Oses, A. Gilad Kusne, Efrain Rodriguez, Johnpierre Paglione, Ste-
fano Curtarolo, and Ichiro Takeuchi. Machine learning modeling of superconducting critical
temperature. npj Computational Materials, 4(1):29, 2018.
[77] Tingni Sun and Cun-Hui Zhang. Scaled sparse linear regression. Biometrika, 99(4):879–898,
2012.
[78] Kalyan T Talluri and Garrett J Van Ryzin. The theory and practice of revenue management,
volume 68. Springer Science & Business Media, 2006.
[79] Inder Jeet Taneja and Pranesh Kumar. Relative information of type s, csisz´ ar’s f-divergence,
and information inequalities. Information Sciences, 166(1-4):105–125, 2004.
[80] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal
Statistical Society: Series B (Methodological), 58(1):267–288, 1996.
[81] Nilesh Tripuraneni and Lester Mackey. Debiasing linear prediction. arXiv preprint
arXiv:1908.02341, 2019.
[82] US OMB. North american industry classification system. Executive Office of the President;
Office of Management and Budget, 2017.
[83] US OMB. Standard occupational classification manual. Executive Office of the President;
Office of Management and Budget, 2018.
[84] Sara Van de Geer, Peter B¨ uhlmann, Ya’acov Ritov, and Ruben Dezeure. On asymptotically
optimal confidence regions and tests for high-dimensional models. The Annals of Statistics,
42(3):1166–1202, 2014.
139
[85] Sara van de Geer, Peter B¨ uhlmann, Ya’acov Ritov, and Ruben Dezeure. On asymptotically
optimal confidence regions and tests for high-dimensional models. The Annals of Statistics,
42(3):1166–1202, 2014.
[86] Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. In Com-
pressed sensing, pages 210–268. Cambridge Univ. Press, Cambridge, 2012.
[87] Vladimir V ovk, Alexander Gammerman, and Glenn Shafer. Algorithmic learning in a ran-
dom world. Springer Science & Business Media, 2005.
[88] Ines Wilms and Jacob Bien. Tree-based node aggregation in sparse graphical models. arXiv
preprint arXiv:2101.12503, 2021.
[89] Xiaohan Yan and Jacob Bien. rare: Linear Model with Tree-Based Lasso Regularization for
Rare Features. R package version 0.1.0.
[90] Xiaohan Yan and Jacob Bien. Rare feature selection in high dimensions. Journal of the
American Statistical Association, 0(0):1–14, 2020.
[91] Daniel Yekutieli. Hierarchical false discovery rate-controlling methodology. Journal of the
American Statistical Association, 103(481):309–316, 2008.
[92] Xinyang Yi and Constantine Caramanis. Regularized em algorithms: A unified framework
and statistical guarantees. Advances in Neural Information Processing Systems, 28, 2015.
[93] Xinyang Yi, Constantine Caramanis, and Sujay Sanghavi. Alternating minimization for
mixed linear regression. In International Conference on Machine Learning, pages 613–621.
PMLR, 2014.
[94] Cun-Hui Zhang and Stephanie S. Zhang. Confidence intervals for low dimensional param-
eters in high dimensional linear models. Journal of the Royal Statistical Society: Series B
(Statistical Methodology), 76(1):217–242, 2014.
[95] Cun-Hui Zhang and Stephanie S Zhang. Confidence intervals for low dimensional param-
eters in high dimensional linear models. Journal of the Royal Statistical Society: Series B
(Statistical Methodology), 76(1):217–242, 2014.
[96] Heng Zhang, Paat Rusmevichientong, and Huseyin Topaloglu. Multi-product pricing under
the generalized extreme value models with homogeneous price sensitivity parameters. 2018.
[97] Linjun Zhang, Rong Ma, T Tony Cai, and Hongzhe Li. Estimation, confidence intervals, and
large-scale hypotheses testing for high-dimensional mixed linear regression. arXiv preprint
arXiv:2011.03598, 2020.
[98] Hong-Tu Zhu and Heping Zhang. Hypothesis testing in mixture regression models. Journal
of the Royal Statistical Society: Series B (Statistical Methodology), 66(1):3–16, 2004.
140
[99] Yinchu Zhu and Jelena Bradic. Linear hypothesis testing in dense high-dimensional linear
models. Journal of the American Statistical Association, 113(524):1583–1600, 2018.
141
Abstract (if available)
Abstract
The past decades have witnessed revolutionary changes in measurement, collection, and storage of data. High-dimensional data are ubiquitous in many domains of studies, which brought tremendous opportunities and challenges. In this thesis, we discuss a few statistical approaches that address the curse of high-dimensionality. Our proposed work builds upon the high-dimensional literature with novel applications and improvements to interpretability and the development of statistical inference tools.
We first consider leveraging high-dimensional contexts for the problem of multi-product dynamic pricing. In this environment, the customers arrive over time and products are described by high-dimensional feature vectors. Each customer chooses a product according to the widely used Multinomial Logit (MNL) choice model and her utility depends on the product features as well as the prices offered. Our model allows for heterogenous price sensitivities for products. The seller a-priori does not know the parameters of the choice model but can learn them through interactions with the customers. The seller's goal is to design a pricing policy that maximizes her cumulative revenue. This model is motivated by online marketplaces such as the Airbnb platform and online advertising. We measure the performance of a pricing policy in terms of regret, which is the expected revenue loss with respect to a clairvoyant policy that knows the parameters of the choice model in advance and always sets the revenue-maximizing prices. We propose a pricing policy, named M3P, that achieves an optimal T-period regret of O(log(Td) under heterogenous price sensitivity.
Next, we discuss an approach that helps reduce dimensionality and increase model interpretability. We observe that in many domains, data measurements can naturally be associated with the leaves of a tree, expressing the relationships among these measurements. The problem of tree-based aggregation asks which of these tree-defined subgroups of leaves should really be treated as a single entity and which of these entities should be distinguished from each other. We introduce the false split rate, an error measure that describes the degree to which subgroups have been split when they should not have been. We then propose a multiple hypothesis testing algorithm for tree-based aggregation, which we prove controls this error measure. We focus on two main examples of tree-based aggregation, one which involves aggregating means and the other which involves aggregating regression coefficients. We apply this methodology to aggregate stocks based on their volatility and to aggregate neighborhoods of New York City based on taxi fares.
Finally, we address the issue of heterogeneity in data that can be brought by high-dimensionality. The Mixture of Experts (MoE) model has become a popular tool to analyze heterogeneous data with high dimensionality. This is because it allows not only mixture centers, but also mixture weights to depend on features, which gives rise to more flexibility in the data generating process. Even though some powerful tools, such as the Expectation-Maximization (EM) algorithm, have been developed to estimate such models and make predictions, there has been little work on the predictive inference on high-dimensional MoE models. Motivated by this, we propose a debiased predictor for a new observation drawn from a MoE model. We construct a prediction set, which is in the form of a union of intervals. We apply our proposed method to analysis of a dataset of superconduct materials, and investigate the relationship between critical temperature of materials and other elementary and electronic attributes.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Reliability and robustness in modern statistical learning
PDF
Large scale inference with structural information
PDF
Adapting statistical learning for high risk scenarios
PDF
Large-scale multiple hypothesis testing and simultaneous inference: compound decision theory and data driven procedures
PDF
Leveraging sparsity in theoretical and applied machine learning and causal inference
PDF
Nonparametric empirical Bayes methods for large-scale inference under heteroscedasticity
PDF
Modeling customer choice in assortment and transportation applications
PDF
Real-time controls in revenue management and service operations
PDF
Essays on revenue management with choice modeling
PDF
Stochastic optimization in high dimension
PDF
Efficient policies and mechanisms for online platforms
PDF
Shrinkage methods for big and complex data analysis
PDF
Nonparametric ensemble learning and inference
PDF
Topics in selective inference and replicability analysis
PDF
Sparseness in functional data analysis
PDF
Essays on nonparametric and finite-sample econometrics
PDF
Statistical and computational approaches for analyzing metagenomic sequences with reproducibility and reliability
PDF
Model selection principles and false discovery rate control
PDF
Statistical inference for dynamical, interacting multi-object systems with emphasis on human small group interactions
PDF
Statistical insights into deep learning and flexible causal inference
Asset Metadata
Creator
Shao, Simeng
(author)
Core Title
Statistical learning in High Dimensions: Interpretability, inference and applications
School
Marshall School of Business
Degree
Doctor of Philosophy
Degree Program
Business Administration
Degree Conferral Date
2022-08
Publication Date
07/29/2022
Defense Date
07/29/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
dynamic pricing,false discovery rate,high dimensional statistics,multiple testing,OAI-PMH Harvest,rare features,statistical inference
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Bien, Jacob (
committee chair
), Javanmard, Adel (
committee chair
), Armstrong, Timothy (
committee member
)
Creator Email
simengsh@usc.edu,simengshao@hotmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC111375808
Unique identifier
UC111375808
Legacy Identifier
etd-ShaoSimeng-11042
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Shao, Simeng
Type
texts
Source
20220801-usctheses-batch-964
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
dynamic pricing
false discovery rate
high dimensional statistics
multiple testing
rare features
statistical inference