Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Nonparametric ensemble learning and inference
(USC Thesis Other)
Nonparametric ensemble learning and inference
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
NonparametricEnsembleLearningandInference by PatrickVossler ADissertationPresentedtothe FACULTYOFTHEUSCGRADUATESCHOOL UNIVERSITYOFSOUTHERNCALIFORNIA InPartialFulfillmentofthe RequirementsfortheDegree DOCTOROFPHILOSOPHY (BUSINESSADMINISTRATION) August2022 Copyright 2022 PatrickVossler Acknowledgements I feel that words fail to express accurately how truly grateful I am for the steadfast support of my family, friends, and mentors as I have pursued this Ph.D. Nevertheless, I hope that what follows can somewhat convey how fortunate I feel to have received your love and support over these past fewyears. First, I would not have reached this point in my academic career without my family. I want to thank my parents, Tracey Tierney and David Vossler, for their patience and love, for encouraging metobecuriousabouttheworldaroundmeandforpushingmetogive“alwaysmore,neverless,” and my sister, Laura Vossler, for her support and advice as I’ve progressed through my Ph.D. and for introducing me to social science research. I am also grateful for the support of my aunts, June Tierney and Kellie Burke; I have done my best to take their advice of not counting myself out to heart. I want to thank my advisors, Jinchi Lv and Yingying Fan, for guiding me through my Ph.D., patiently answering all of my questions, and helping show me what it means to be a statistician. Their enthusiasm and commitment to seeking the answers to impactful research questions will alwaysbeasourceofinspiration. Ontopofeverythingelse,theyhavecreatedafantasticresearch group where I was fortunate to meet my collaborators Chien-Ming Chi, Emre Demirkaya, Xinze Du,andLanGao. I want to thank Jacob Bien for allowing me to co-teach BUAD 312 with him and shaping my teachingphilosophy;VishalGuptaforchallengingmeandmycohortwithhiswhiteboardproblems and encouraging us to be curious researchers. To Michael Huang, Wilson Lin, Simeng Shao, and Bradley Rava, thank you for all the late nights working on assignments together, introducing me ii – an L.A. native – to new restaurants, and all the jokes and laughs. Finally, thank you, Vitalii Ostrovsky,fordemonstratingwhatitmeanstobepassionateaboutresearchandBenjaminGraham forintroducingmetostatisticsandprogramminginthefirstplace. Most of all, I am grateful for the unending support and love of my partner, Audrey Chu, throughoutmyPh.D. Finally, I wish to acknowledge that much of this dissertation is based on the following collab- orations with Chien-Ming Chi, Emre Demirkaya, Yingying Fan, Lan Gao, Jinchi Lv, and Jingbo Wang: 1. Chapter2andAppendixAcontainworkfromthepre-print: Chi, C.-M., Vossler, P., Fan, Y. & Lv, J. Asymptotic properties of high-dimensional random forests. arXivpreprintarXiv:2004.13953(2020). 2. Chapter3andAppendixBcontainworkfromthepre-print: Demirkaya, E., Fan, Y., Gao, L., Lv, J., Vossler, P. & Wang, J. Optimal Nonparametric In- ferencewithTwo-ScaleDistributionalNearestNeighbors. arXivpreprintarXiv:1808.08469 (2021). iii TableofContents Acknowledgements ii ListofTables vii ListofFigures viii Abstract ix Chapter1: Introduction 1 1.1 NonparametricRegression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Approximationerror . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1.2 Curseofdimensionality . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1.3 Therandomforestalgorithmandmodelsparsity . . . . . . . . . . . . . . 5 1.2 EnsembleEstimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.2.1 Baggingandsubagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.2.2 Baggingandnearestneighbors . . . . . . . . . . . . . . . . . . . . . . . . 7 1.2.3 Distributionalnearestneighbors . . . . . . . . . . . . . . . . . . . . . . . 8 Chapter2: High-DimensionalRandomForest 10 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2 RandomForest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2.1 Modelsettingandtherandomforestalgorithm . . . . . . . . . . . . . . . 15 2.2.2 CART-splitcriterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2.3 Roadmapforthebias-variancedecompositionanalysis . . . . . . . . . . . 20 2.3 MainResults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.3.1 Definitionsandtechnicalconditions . . . . . . . . . . . . . . . . . . . . . 22 2.3.1.1 ExamplessatisfyingSID . . . . . . . . . . . . . . . . . . . . . 24 2.3.1.2 SIDandmodelsparsity: sparsityparameterα 1 . . . . . . . . . . 26 2.3.2 Convergencerates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.3.3 Sharperconvergencerateswithbinaryfeatures . . . . . . . . . . . . . . . 31 2.3.4 Theroleofrelevantfeatures . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.3.5 Relatedwork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.4 ApproximationTheory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.4.1 Mainresults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 iv 2.4.2 Sampletree-growingrule . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.5 Upperboundsforthestatisticalestimationerror . . . . . . . . . . . . . . . . . . . 42 2.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Chapter3: OptimalNonparametricInference withTwo-ScaleDistributionalNearestNeighbors 46 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.2 ModelSetting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.2.1 Distributionalnearestneighbors(DNN) . . . . . . . . . . . . . . . . . . . 50 3.3 Two-scaledistributionalnearestneighbors . . . . . . . . . . . . . . . . . . . . . . 52 3.3.1 Two-scaleDNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.3.2 Accuracyandasymptoticdistributionsoftwo-scaleDNN . . . . . . . . . . 54 3.4 Varianceanddistributionestimatesfortwo-scale DNNestimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.4.1 Jackknifeestimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.4.2 Bootstrapestimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.4.3 BootstrapestimatorfordistributionofTDNN . . . . . . . . . . . . . . . . 63 3.5 Applicationtoheterogeneoustreatmenteffectestimationandinference . . . . . . . 65 3.6 Simulationstudies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 3.6.1 Two-scaleDNNversusDNN . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.6.2 ComparisonswithDNNandk-NNfornonparametricinference . . . . . . 72 3.7 Realdataapplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 3.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 Chapter4: Ongoingandfutureresearch 79 4.1 Globalcontrolofrandomforestvarianceandallowingforfully-growntrees . . . . 79 4.2 MultiscaleDNNandextensionstohigh-dimensional settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 References 81 Appendices 85 A Chapter2Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 A.1 Proofsofmainresults . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 A.1.1 Technicalpreparation . . . . . . . . . . . . . . . . . . . . . . . 86 A.2 AdditionalexamplesforSID . . . . . . . . . . . . . . . . . . . . . . . . . 89 A.3 ProofofTheorem1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 A.3.1 ProofofTheorem2 . . . . . . . . . . . . . . . . . . . . . . . . 91 A.3.2 ProofofTheorem3 . . . . . . . . . . . . . . . . . . . . . . . . 93 A.3.3 ProofofTheorem4 . . . . . . . . . . . . . . . . . . . . . . . . 101 A.3.4 ProofofTheorem5 . . . . . . . . . . . . . . . . . . . . . . . . 105 A.4 ProofsofCorollaries1–2,Proposition1,andsomekeylemmas . . . . . . 109 A.4.1 ProofofCorollary1 . . . . . . . . . . . . . . . . . . . . . . . . 109 A.4.2 ProofofCorollary2 . . . . . . . . . . . . . . . . . . . . . . . . 109 A.4.3 ProofofProposition1 . . . . . . . . . . . . . . . . . . . . . . . 110 v A.4.4 ProofofLemma1 . . . . . . . . . . . . . . . . . . . . . . . . . 126 A.4.5 ProofofLemma2 . . . . . . . . . . . . . . . . . . . . . . . . . 128 A.4.6 ProofofLemma3 . . . . . . . . . . . . . . . . . . . . . . . . . 129 A.4.7 Lemma6anditsproof . . . . . . . . . . . . . . . . . . . . . . . 134 A.4.8 Lemma7anditsproof . . . . . . . . . . . . . . . . . . . . . . . 137 A.5 Additionallemmasandtechnicaldetails . . . . . . . . . . . . . . . . . . . 140 A.5.1 Lemma8anditsproof . . . . . . . . . . . . . . . . . . . . . . . 140 A.5.2 Lemma9anditsproof . . . . . . . . . . . . . . . . . . . . . . . 143 A.5.3 VerifyingCondition1forExample1 . . . . . . . . . . . . . . . 144 A.5.4 VerifyingCondition1forExample2 . . . . . . . . . . . . . . . 144 A.5.5 VerifyingCondition1forExample3 . . . . . . . . . . . . . . . 151 A.5.6 VerifyingCondition1forExample4 . . . . . . . . . . . . . . . 154 A.5.7 ProofforRemark3 . . . . . . . . . . . . . . . . . . . . . . . . 156 A.5.8 VerifyingCondition1forExample6 . . . . . . . . . . . . . . . 157 A.6 VerifyingCondition1forExample7. . . . . . . . . . . . . . . . . . . . . 157 A.6.1 VerifyingCondition1forExample8 . . . . . . . . . . . . . . . 161 B Chapter3additionalnumericalresultsandproofs . . . . . . . . . . . . . . . . . . 165 B.1 Additionalsimulationresults . . . . . . . . . . . . . . . . . . . . . . . . . 165 B.1.1 Comparisonwithk-NN . . . . . . . . . . . . . . . . . . . . . . 165 B.1.2 Simulationsetting1withuniformdesign . . . . . . . . . . . . . 167 B.2 Proofsofmainresults . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 B.2.1 ProofofTheorem6 . . . . . . . . . . . . . . . . . . . . . . . . 167 B.2.2 ProofofTheorem7 . . . . . . . . . . . . . . . . . . . . . . . . 170 B.2.3 ProofofTheorem8 . . . . . . . . . . . . . . . . . . . . . . . . 173 B.2.4 ProofofTheorem9 . . . . . . . . . . . . . . . . . . . . . . . . 175 B.2.5 ProofofTheorem10 . . . . . . . . . . . . . . . . . . . . . . . 177 B.2.6 ProofofTheorem11 . . . . . . . . . . . . . . . . . . . . . . . 182 B.2.7 ProofofTheorem12 . . . . . . . . . . . . . . . . . . . . . . . 188 B.2.8 ProofofTheorem13 . . . . . . . . . . . . . . . . . . . . . . . 191 B.3 Somekeylemmasandtheirproofs . . . . . . . . . . . . . . . . . . . . . . 192 B.3.1 Lemma10anditsproof . . . . . . . . . . . . . . . . . . . . . . 192 B.3.2 Lemma11anditsproof . . . . . . . . . . . . . . . . . . . . . . 193 B.3.3 Lemma12anditsproof . . . . . . . . . . . . . . . . . . . . . . 195 B.3.4 Lemma13anditsproof . . . . . . . . . . . . . . . . . . . . . . 195 B.3.5 Lemma14anditsproof . . . . . . . . . . . . . . . . . . . . . . 200 B.3.6 Lemma15anditsproof . . . . . . . . . . . . . . . . . . . . . . 204 B.3.7 Lemma16anditsproof . . . . . . . . . . . . . . . . . . . . . . 207 B.3.8 Lemma17anditsproof . . . . . . . . . . . . . . . . . . . . . . 209 B.3.9 Lemma18anditsproof . . . . . . . . . . . . . . . . . . . . . . 211 B.4 Additionaltechnicaldetails. . . . . . . . . . . . . . . . . . . . . . . . . . 214 B.4.1 Lemma19anditsproof . . . . . . . . . . . . . . . . . . . . . . 214 B.4.2 Lemma20anditsproof . . . . . . . . . . . . . . . . . . . . . . 217 B.4.3 Lemma21anditsproof . . . . . . . . . . . . . . . . . . . . . . 218 vi ListofTables 2.1 Comparison of the properties of recent consistency rate results for the random forestalgorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2 Comparison of the consistency rates of the random forest algorithm with different splittingrules. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.1 Comparison of DNN, k-NN, and TDNN in simulation setting 1 described in Sec- tion3.6.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 3.2 Comparison of DNN, k-NN, and TDNN in simulation setting 2 described in Sec- tion3.6.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 3.3 Comparison of DNN, k-NN, and TDNN in simulation setting 3 described in Sec- tion3.6.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 3.4 TheMSEsofdifferentnonparametriclearningmethodsontherealdataapplication inSection3.7. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.1 A modified version of the comparison of DNN, k-NN, and TDNN in simulation setting1asdescribedinSection3.6.2,butwiththerandomfeaturevectorXdrawn fromU([0,1] 3 )insteadofN(0,I 3 ). . . . . . . . . . . . . . . . . . . . . . . . . . . 166 vii ListofFigures 2.1 DiagramofaLevel2(Height3)tree. . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2 Visualrepresentationofthemodifiedsampletree-growingruleusedinSection2.4.2 40 2.3 VisualdepictionoftheproofstrategyforLemma2 . . . . . . . . . . . . . . . . . 43 3.1 Theresultsofsimulationsetting1describedinSection3.6.1forDNNandTDNN. 70 4.1 AgraphicalillustrationofthenotationusedintheproofofTheoremA.3.2. . . . . 95 4.2 AgraphicalillustrationofpiecewiseSIDforExample7 . . . . . . . . . . . . . . 159 4.3 ThebiasandMSEresultsfork-NNinSection3.6.1. . . . . . . . . . . . . . . . . 166 viii Abstract Nonparametricregressionmethodsareflexiblestatisticallearningtoolsthatrequireminimaliden- tification assumptions compared to their parametric counterparts. Thus, practitioners frequently use these methods in applications due to their appealing empirical performance. In this disserta- tion,westudytwononparametricmethods–therandomforestalgorithmanddistributionalnearest neighbors(DNN)–forwhichtheexistingtheoreticalworkprovidesinferenceguaranteesthathold onlyforamodifiedversionofthemethodordonotholdunderhigh-ordersmoothnessassumptions. In the first part of this dissertation, we use a bias-variance decomposition analysis to derive consistencyratesfortherandomforestalgorithmwiththesampleclassificationandregressiontree (CART) splitting criterion in a general high-dimensional nonparametric regression setting. Our new results provide theoretical justification for the ability of the random forest algorithm to adapt to high dimensionality while remaining flexible enough to allow for a discontinuous regression function and dependent covariates. Furthermore, our bias analysis explicitly characterizes how the asymptotic bias of the random forest algorithm depends on the sample size, tree height, and columnsubsamplingparameter. In the second part of this dissertation, we study the distributional nearest neighbors method, whichassignsmonotonic,nonnegativeweightstotheentiresampleinadistributionalfashion. The DNNmethodachievestheoptimalnonparametricminimaxconvergencerateunderasecond-order smoothness assumption but fails to do so under a high-order smoothness assumption. We show that the slow convergence rate of DNN under high-order smoothness assumptions is due to its asymptotic bias. Thus, we propose eliminating the DNN estimator’s first-order bias by linearly combiningtwoDNNestimatorswithdifferentsubsamplingscalesresultinginthenoveltwo-scale ix DNN (TDNN) estimator. We prove that the TDNN estimator has an optimal nonparametric con- vergence rate for estimating the regression function under a fourth-order smoothness condition. Furthermore, we establish that the DNN and TDNN methods are asymptotically normal as the subsamplingscalesandsamplesizeapproachinfinity. Finally,weprovidevarianceestimatorsand adistributionestimatorusingthejackknifeandbootstraptechniquesforTDNN. Weconcludebydiscussingpotentialextensionsofourworkinthisdissertationandopenques- tionsrelatedtotherandomforestandTDNN. x Chapter1 Introduction Thisdissertationattemptstoexpandourcollectiveunderstandingoftherandomforestandbagged onenearestneighbor(1-NN)estimatorsbyaddressingthefollowingquestions: 1. Can we derive a consistency rate for the random forest algorithm that uses the sample clas- sification and regression tree (CART) splitting criterion in a general high-dimensional non- parametricregressionsetting? 2. Why does the random forest method perform well in high-dimensional settings despite the well-knownchallengesnonparametricmethodsfaceinsuchsettings? 3. Canwecharacterizetheasymptoticdistributionofbagged1-NNestimators? 4. Howcanweadaptthebagged1-NNestimatortoachievetheoptimalnonparametricconver- gencerateunderhigh-ordersmoothnessassumptions? The following sections explain the motivation for our interest in these questions and provide a high-levelintroductiontobothofthesemethods. 1.1 NonparametricRegression Givenasetofnobservations{(X i ,Y i )} n i=1 fromthenonparametricmodelY = f(X)+ε,where f(·) is the unknown mean regression function andε is the model error withE(ε)=0, is it possible to 1 constructafunction f n thatapproximates f? Sincemathematiciansfirstusedordinaryleastsquares toaddressthisproblemintheearly18thcentury, 1 devisingsolutionsforthisproblemhastakenon evengreaterimportance. Rapidadvancesinmoderntechnologyhaveresultedinamassiveamount of data, often with more features than observations, and spurred the development of regression methodscapableofextractingpatternsfrombigdata. Depending on the application domain, one might have some knowledge about the form of the function f being estimated. For instance, we might know that the function f is linear in X (i.e., f(X)=b T X for some unknown parameter b) or that f can be well-approximated by such a linear function. Inthiscase,theregressiontaskthenconsistsofestimatingtheparameterb,whichcanbe done with a bit of linear algebra. Unfortunately, in many applications, little or no information is knownabout f,apriori;therefore,itmaybeincorrecttoassumeaparametricform. Thisdilemma has motivated the development of nonparametric regression procedures that infer more arbitrary functions f from data. Formally, a regression method is nonparametric if the only assumption is that f belongs to some infinite-dimensional class. While this allows us to model the regression functionmoreflexibly,italsocomesataprice. No matter how we choose to model the regression function, the metric of success remains the same: How well does the estimator f n approximate f? As we often wish to compare different estimators,itisusefultoformalizeanotionoftheapproximationerror. 1.1.1 Approximationerror Inregressionanalysis,arandomvector(X,Y)withvaluesR d ×RsatisfyingE(Y 2 )<∞isconsid- ered, and we aim to estimate the relationship betweenX andY. Typically, the aim is to minimize 1 We are vague regarding attributing the discovery of ordinary least squares given that the disagreement between Gauss and Legendre has been described as “the most famous priority dispute in the history of statistics.” See [3] for moredetails. 2 the mean squared error or L 2 risk. Thus, we are interested in constructing a (measurable) function f ∗ :R d →R,satisfying E(∥Y− f ∗ (X)∥ 2 )= min g:R d →R E(∥Y−g(X)∥ 2 ). Welet f :R d →R, f(x)=E(Y|X=x)denotetheregressionfunction,andbecause f satisfies E(∥Y−m(X)∥ 2 )=E(∥Y− f(X)∥ 2 )+E(∥m(X)− f(X)∥ 2 ), itistheoptimalpredictor f ∗ . Thus,agoodestimate f n :R d →RshouldkeeptheL 2 errorsmall: E(∥f n (X)− f(X)∥ 2 ). (1.1) The L 2 error decomposes nicely into bias and variance terms, which are useful for much of the analysis in this dissertation because it allows us to focus on each term individually. A bit of algebrayieldstheclassicbias-variancedecomposition: E(∥f n (X)− f(X)∥ 2 )=E(∥f n (X)−E(f n (X))∥ 2 )+E(∥E(f n (X))− f(X)∥ 2 ). TheL 2 error(1.1)isournotionoftheerrorof f n relativeto f. Universallyconsistentestimators, such as kernel estimators, exist; however, for practical purposes, we are more interested in assess- ingthenumberofsamplesrequiredforagoodestimate. Inotherwords,weaimtocharacterizethe rateatwhichE(∥f n (X)− f(X)∥ 2 )approacheszerorelativetothesamplesizen. 1.1.2 Curseofdimensionality It is well known that we must restrict the class of regression functions that we consider in order toobtainnontrivialconvergencerateresults. Thus,weintroducethefollowingdefinitionof (s,C)- smoothness. 3 Definition1. Let s=q+r, q∈N 0 , and 0<r≤1. A function f :R d →R is called (s,C)-smooth if for every α i ∈N 0 ,i = 1,···,d with ∑ d i=1 α i = q the partial derivative ∂ q f ∂x α 1 1 ∂x α 2 2 ···∂x α d d exists and satisfies ∂ q m(y) ∂x α 1 1 ∂x α 2 2 ···∂x α d d − ∂ q m(z) ∂x α 1 1 ∂x α 2 2 ···∂x α d d ≤C∥y−z∥ r forany y,z∈R d ,whered istheambientdimensionofthefeaturespaceX . Theseminalworkin[4]determinedtheoptimalminimaxrateofconvergenceinnonparametric regressionfor(s,C)-smoothfunctions. Foradditionalcontext,asequenceofpositivenumbers(a n ) withn∈NiscalledalowerminimaxrateofconvergencefortheclassofdistributionsD if liminf n→∞ inf f n sup (X,Y)∈D E(∥f n − f∥ 2 ) a n =C 1 >0. Additionally, a sequence is called an achievable rate of convergence for the class of distributions D if limsup n→∞ sup (X,Y)∈D E(∥f n − f∥ 2 ) a n =C 2 <∞. Finally,a sequence is calledan optimal minimax rate of convergenceifit is both a lowerminimax and an achievable rate of convergence. In particular, [4] showed that the optimal minimax rate of convergencefortheestimationofa (s,C)-smoothregressionfunctionis n − 2s 2s+d . (1.2) Although (1.2) is optimal, it suffers from a characteristic feature in the case of high-dimensional functions: If d is relatively large compared with s, this convergence rate can be extremely slow. Thiswell-knownphenomenonisoftencalledthe“curseofdimensionality.” At a high level, nonparametric methods suffer from this phenomenon because many of these methods operate by approximating the target function locally with simpler functions. However, thesesimplerfunctionsarestillonlyapproximationsandresultinlocalerrorsthataggregateglob- ally. Thus, we must obtain satisfactory performance in most local areas to approximate the entire 4 function well. For example, suppose that constant functions approximate the target function well in regions of radius at most 0 < r < 1. Then, one approach for an estimator would be to divide the domain ofX into regions that have a radius no larger than r and attempt to approximate the smaller regions well. If we employ this method andX is d-dimensional, then the smallest (and hopefully most accurate) partition is of size O(r −d ). However, the issue with this approach is that ifwewantalloftheselocalapproximationstobeaccurate,werequiresufficientdatapointstofall into each such region. In other words, we require a data size exponential in d to approximate the target function locally with simpler functions. Unfortunately, in many modern applications d is oftenquitelargethustherequirementthatn>2 d isoftenimpractical. 1.1.3 Therandomforestalgorithmandmodelsparsity Thecurseofdimensionalityappearstoruleoutnonparametricregressionmethodsfortheincreas- ingly high-dimensional data sets ubiquitous in modern applications. In applications ranging from text classification to genomic analysis, the number of features of X can be much larger than the number of observations. However, in many cases, the dimensionality of the data is believed to be large only in the superficial sense of there having many coordinates, whereas the true degrees of freedomaremuchsmaller. Forexample,thismightbethecaseduetostrongdependenciesbetween features or because some features might be irrelevant, or for both reasons. A high-dimensional datasetwouldhaveasparsesubsetofrelevantfeaturesinanyofthesecases. Twomainestimation approaches exploit the existence of this sparse subset of features in high-dimensional data with nonparametricmethods: 1. Use a preprocessing method to screen for relevant features [5–7] and then estimate the re- gressionfunctionwiththesmallersetofselectedfeatures. 2. Use a nonparametric method that adapts to the sparsity of the data as part of its training procedure. 5 In this dissertation, we are interested in better understanding how the methods described in the secondapproachadapttosparsity. One such nonparametric method is the random forest algorithm [8, 9]. The random forest algorithm is regularly used in applications because it provides accurate estimates with limited tuning. Giventhecurseofdimensionality,weexpectthat,similartoothernonparametricmethods, such as kernel estimators and nearest neighbor methods, the performance of the random forest algorithm quickly degrades as the ambient dimension of a data set increases. However, evidence to the contrary exists in practice. The random forest method has been successfully used with high-dimensionalgenomicsdata[10]andhigh-dimensionallongitudinaldata[11]. Therehavebeenseveralimportantrecentresultsontheconsistencyoftherandomforestalgo- rithmanditsvariants. However,almostallresultsconcerningtheconsistencyoftherandomforest algorithminhigh-dimensionalsettingshaveanalyzedmodifiedversionsofit. Oftenthesemodified versions use splitting rules that do not consider the response when choosing how to partition the featurespace. Oneexceptionis[12],inwhichonlybinaryfeaturesareconsidered. In light of this, in Chapter 2, we derive the consistency rates of the random forest algorithm with trees split using the sample CART-splitting criterion from [8] in a general high-dimensional nonparametric regression setting through a bias-variance decomposition analysis. We provide a new technical analysis for polynomially growing dimensionality through natural regularity condi- tionsthatcharacterizetheintrinsiclearningbehavioroftherandomforestmethodatthepopulation level. Insummary,ournewtheoreticalresultsdemonstratethattherandomforestmethodcanadapt to high dimensionality and allow for a discontinuous regression function. Our results provide the theoretical justification for what practitioners have observed empirically: The random forest algo- rithmisaflexiblenonparametriclearningtool,eveninhigh-dimensionalsettings. 6 1.2 EnsembleEstimators Instatisticsandmachinelearning,ensemblemethodscombinethepredictionsofmultiplelearners to obtain better predictive performance. The success of ensemble algorithms on many benchmark datasetshasmotivatedresearcherstoexplorethereasonsforthesuccessofsuchmethods. Infact, the generalizability of an ensemble can be significantly better than that of a single predictor. See [13,14]foranoverviewofensemblemethods. 1.2.1 Baggingandsubagging Bagging (bootstrap aggregating) [15] and subagging (subsample aggregating) [16] are some of the most straightforward methods for combining estimators to improve their performance. For regression problems, bagging (subagging) draws resamples (subsamples) from the original data set, constructs an estimate from each resample (subsample), and produces a final estimate by ag- gregatingeachoftheresampled(subsampled)estimates. Thoughcomputationallyintensive,these methods are effective for improving unstable estimates, especially for high-dimensional data sets. Thus, numerous recent theoretical contributions have been made to bagging and its variants [16– 21]. Breiman’s bagging procedure has a simple application in the case of nearest neighbor meth- ods. Nearest neighbor methods are some of the oldest approaches to regression and classification problems [22, 23]. Despite their relative simplicity, they have convenient theoretical properties and demonstrate strong empirical performance. Furthermore, implementing these methods only requires a measure of distance in the sample space and training data; hence, these methods are commonlyusedasabuildingblockformorecomplexmethods. 1.2.2 Baggingandnearestneighbors The 1-NN regression estimate is as simple as it sounds. For a test point x, the 1-NN estimate is m(x) =Y (1) (x), where Y (1) (x) is the response of the observation that has the smallest Euclidean 7 distancetoxamongallotherobservationsX 1 ,···,X n . Byitself,thisestimatorisinconsistent[24], however,usingbagging,onecantransformthe1-NNestimatorintoaconsistentone,providedthat theresamplesizeissufficientlysmall[21]. We let m s (x) be the 1-NN regression estimate for a random subsample of size s drawn with or without replacement from{(X i ,Y i )} n i=1 . We obtain the bagged estimate by repeating the random sampling an infinite number of times, taking the average of the individual estimates. Thus, the baggedregressionestimateb m(x)isdefinedas b m(x)=E(m s (x)), where the expectation is with respect to the resampling distribution, conditional on the observed data. The work of [20] demonstrated that if s→∞ and s/n→ 0, then the bagged version of the 1-NN regression estimate is universally consistent. Furthermore, [21, 25] found that the bagged 1-NN regression estimate achieves the optimal nonparametric rate of convergence, assuming the Lipschitzcontinuityoftheregressionfunction. 1.2.3 Distributionalnearestneighbors The critical insight behind these results is that we can recast the bagged 1-NN estimator as a weightednearestneighborregressionestimator. Werefertothisversionoftheestimatorasthedis- tributional nearest neighbor estimator because it automatically assigns monotonic weights to the nearestneighborsinadistributionalfashionontheentiresample. Despitethesesignificantresults, characterizing the asymptotic distribution of the DNN estimator has not been attempted. Addi- tionally, the DNN estimator fails to achieve the optimal nonparametric rate when the regression functionsatisfieshigher-ordersmoothnessconditions. In Chapter 3, we provide an in-depth analysis of the slow convergence rate of DNN under high-order smoothness assumptions and determine that its asymptotic bias is the cause of its slow 8 convergence rate. To address this issue, we propose eliminating the first-order bias of DNN by linearlycombiningtwoDNNestimatorswithdifferentsubsamplingscales,resultingintheTDNN estimator. WeshowthattheTDNNestimatorcanalsoberepresentedasaweightednearestneigh- bor estimator, assigning negative weights to some observations. The use of negative weights en- dorsestheoptimalnonparametricconvergencerateunderthehigher-ordersmoothnessassumption on the regression function. Furthermore, we establish that DNN and TDNN are asymptotically normal as the subsampling scales and sample size diverge to infinity. To facilitate the practical implementation of our method, we also provide variance estimators and a distribution estimator usingthejackknifeandbootstraptechniquesforthetwo-scaleDNN.Theseestimatorscanbeused toconstructvalidconfidenceintervalsforthenonparametricinferenceoftheregressionfunction. 9 Chapter2 High-DimensionalRandomForest 2.1 Introduction Sinceitsintroductionby[8,9],therandomforestalgorithmhasbecomeafixtureinappliedstatis- tics [26, 27] in part because it achieves impressive “off-the-shelf” empirical performance despite itsrelativelysimplestructure. Theforestbuildsmanyindependentdecisiontreesusingthetraining sample and outputs the average of predictions of individual trees as the prediction at a test point. As a result, statisticians use the random forest algorithm in applications, including those involv- ing high-dimensional data, such as finance [28], bioinformatics [10, 29], and multisource remote sensing[30]. Inadditiontotheirsuccessinpredictionandclassification,therandomforestmethod has also been adapted to other statistical applications, such as feature selection with importance measures [31, 32], heterogeneous treatment effect estimation[33], and survival analysis [34, 35] (see,e.g.,[36]forarecentoverviewofdifferentapplicationsoftherandomforestalgorithm). Despite their lauded empirical performance, explaining why the random forest method per- forms so well from a theoretical perspective has been challenging. There is a relatively limited but important line of recent work on the consistency of the random forest algorithm. Some ear- lier consistency results in [35, 37–40] have usually considered certain simplified versions of the original random forest algorithm, where the splitting rules are assumed to be independent of the response. Recently, [41] made an important contribution to the consistency of the original ver- sion of the random forest algorithm in the classical setting of fixed-dimensional ambient feature 10 space. As mentioned before, the random forest method can deal with high-dimensional feature space with promising empirical performance. To better understand such a phenomenon, [37, 42] established consistency results for simplified versions of the random forest algorithm where the convergence rates depend on the number of informative features in sparse models. In a special casewhereallfeaturesarebinary,[12]derivedhigh-dimensionalconsistencyratesfortherandom forestalgorithmwithoutcolumnsubsampling. Additionalresultsalongthislineincludepointwise consistency[42,43],asymptoticdistribution[43],andconfidenceintervalsforpredictions[44]. However,evenwiththeconsiderableadvancementsinourunderstandingoftheempiricalsuc- cess of the random forest algorithm, a gap remains between the version of the random forest algorithmusedinpracticeandtheversionusedfortheoreticalanalysis. Inthischapter,wecharac- terizetheconsistencyratefortheoriginalalgorithm 1 inageneralhigh-dimensionalnonparametric regressionsetting,whichisthefirstresultofitskindintheliteraturetothebestofourknowledge. Despite the existing theory for the random forest algorithm, it remains largely unclear how to characterize the consistency rate for its original version in a general high-dimensional nonpara- metric regression setting. Our main contribution is to characterize such a consistency rate for the random forest algorithm with non-fully grown trees. Thus, we introduce a new condition, the suf- ficient impurity decrease (SID), on the underlying regression function and feature distribution to facilitateourtechnicalanalysis. AssumingregularityconditionsandSID,weshowthattherandom forest estimator can be consistent with a rate of the polynomial order of the sample size, provided that the feature dimensionality increases at most polynomially with the sample size. To the best of our knowledge, this is the first result on the high-dimensional consistency rate for the original versionoftherandomforestalgorithm. Due to the bias-variance decomposition of the prediction loss, we separately establish upper bounds on the squared bias (i.e., the approximation error) and variance. Our bias analysis reveals 1 In this thesis, the original version of the random forest algorithm refers to the algorithm that 1) grows trees using Breiman’sCART[8],and2)usescolumnsubsampling(thecriticaldistinctionbetweentherandomforestalgorithm[8] andbagging[15]). 11 a novel and interesting understanding of how the bias depends on the sample size, column sub- sampling parameter, and forest height. The latter two are the most important model complexity parameters. We also establish the convergence rate of the estimation variance, which is less pre- cise than our bias results in characterizing the effects of model complexity parameters. Such a limitationisprimarilyduetotechnicalchallenges. The SID, formally defined in Condition 1, is a key assumption for obtaining the desired con- sistency rates. We discover that, conditional on each node in the feature space, if one global split ofthenodeexistssuchthatasufficientamountofestimationbias 2 canbereduced,thenthedesired convergence rate results follow. This finding greatly helps us understand how the random forest methodcontrolsthebias. The SID condition can accommodate discontinuous regression functions and dependent co- variates, making our results applicable to numerous applications. The SID condition is new to the literature, and a concurrent work[12] exploitsit independently in a simpler setting with all binary features. At a high level, the idea for SID comes from a concept frequently used in the context of tree models, called impurity decrease. Impurity decrease typically measures the importance of individualfeatures;however,weuseitinthischaptertoprovetheconsistencyoftherandomforest estimator. A related well-known feature importance is the mean decrease impurity [45], which evaluates the importance of a feature by calculating how much variance of the response can be reducedbyincludingaparticularfeatureinthetrainingoftherandomforestalgorithm. Detailson the mean decrease impurity measure and other feature importance measures are presented in [45, 46]. In terms of technical innovations, we precisely characterize how column subsampling affects the bias by combining the SID condition with a global view of the biases from all individual trees in the random forest. This global approach is one of our major technical innovations in obtaining a high-dimensional consistency rate for the random forest estimator. Our variance analysis of the forests uses a “grid” discretization approach, which is also new to the random forest literature. 2 Weusetheterms“impurity,”“bias,”and“approximationerror”interchangeablyinthischapter. 12 Despite the technical innovation in our variance analysis, we are forced to take a local view and bound the random forest variance by establishing a variance bound for the individual trees. A weaknessofthislocalapproachisthattheresultingvarianceupperboundistheworst-casebound among all column subsampling parameters and, hence, is less precise. Establishing the global controlofthevarianceoftherandomforestmethodremainsanopenquestion. In Table 2.1, we provide a comparison of our consistency theory with some closely related results in the literature. The consistency of the original version of the random forest algorithm wasfirstinvestigatedintheseminalwork[41]underthesettingofacontinuousadditiveregression functionandindependentcovariateswithfixeddimensionality,andnoexplicitrateofconvergence was provided. Their results are for the random forest method with fully grown trees where each terminalnodecontainsexactlyonedatapointanddemonstratetheimportanceofrowsubsampling forachievingconsistency. Twocommonlystudiedvariantsoftherandomforestalgorithmarethecenteredrandomforest andtheMondrianrandomforest. Thecenteredrandomforestusesasplittingrulewhichuniformly selectsafeaturefromthesetofavailablefeaturesandperformssplitsatthecenterofthenodealong the prechosen attribute. In a fixed-dimensional feature space, [37] derived the rate of convergence which depends on the number of relevant features, assuming a Lipschitz continuous regression function. Recently, [42] proved improved consistency rates for the centered random forest com- pared to [37]. In addition, [47] established the minimax rate of convergence for another variant of the random forest method, the Mondrian random forest. The Mondrian random forest uses trees that are partitioned by draws from a Mondrian random process [48]. Under the assumptions of fixeddimensionality,aHöldercontinuousregressionfunction,anddependentcovariates. Afunda- mentaldifferencebetweentheoriginalrandomforestmethodandthesevariantsisthattheoriginal versionusestheresponsetoguidethesplits,whilethesevariantsdonot. The rest of the chapter is organized as follows. Sections 2.2.1 to 2.2.2 introduce the model setting and the random forest algorithm. Section 2.2.3 provides a roadmap of the bias-variance decomposition analysis of the random forest, which is fundamental to our main results. Then 13 Table 2.1: Comparison of the properties of recent consistency rate results for the random forest algorithm. p≫n Consistency rate Conditions Original algorithm Ourwork Yes Yes TheSIDassumption(Section2.3.1) Yes [41] No No Independent covariates and contin- uousadditiveregressionfunction Yes [42] No Yes DependentcovariatesandLipschitz continuousfunction No [47] No Yes Dependent covariates and Hölder continuousfunctions No we present SID in Section 2.3.1 with examples justifying its usefulness. The main results on the consistency rates are provided in Section 2.3.2. To further appreciate our main results, we provide consistency rates under a simple example with binary features in Section 2.3.3. There, withrestrictivemodelassumptions,wederivesharperconvergencerates. Inadditiontomotivating SID through several examples, in Section 2.3.4, we discuss the relationships between SID and the relevance of active features. In Section 2.3.5, we compare our results to recent related work. We detail our analysis of the approximation error and estimation variance in Sections 2.4 and 2.5, respectively. Section 2.6 discusses some implications and extensions of our work. All proofs and technicaldetailsareprovidedinAppendixA. 2.1.1 Notation Tofacilitatethetechnicalpresentation,wefirstintroducethenotationusedthroughoutthechapter. We let (Ω,F,P) be the underlying probability space. Further, a n = o(b n ) if lim n a n /b n = 0 for some real a n and b n , and a n =O(b n ) if limsup n |a n |/|b n |<∞. The number of elements in a set S isdenotedas#S,andforanintervalt,wedefine|t|:=supt−inft. Whenthesummationisoveran empty set, we define its value as zero and define 0 0 =0. For simplicity, we frequently represent a sequenceofelementsA 1 ,···,A k asA 1:k . Unlessotherwisenoted,alllogarithmsusedinthischapter arelogarithmswithbase2. 14 2.2 RandomForest 2.2.1 Modelsettingandtherandomforestalgorithm The measurable nonparametric regression function with p-dimensional random vector X X X taking valuesin [0,1] p isdenotedbym(X X X). Therandomforestaimstolearntheregressionfunctionnon- parametrically based on the observations x x x i ∈ [0,1] p ,y i ∈R,i =1,···,n, from the nonparametric regressionmodel y i =m(x x x i )+ε i , (2.1) where X X X,x x x i , and ε i , where i =1,···,n, are independent, and{x x x i } and{ε i } are two sequences of identicallydistributedrandomvariables. Inaddition,x x x 1 isdistributedidenticallyasX X X. Level0 t t t 0 :=[0,1] p Split: (j 1 ∈Θ 1,1 ,c 1 ) Nodet t t 1,1 Split: (j 2 ∈Θ 2,1 ,c 2 ) Nodet t t 2,1 Nodet t t 2,2 Nodet t t 1,2 Split: (j 3 ∈Θ 2,2 ,c 3 ) Nodet t t 2,3 Nodet t t 2,4 Figure 2.1: Level 2 (Height 3) tree example. Each node defines the point where the current node is split and produces new nodes. The sets of features eligible for splitting nodes at level k−1 is denotedasΘ k :={Θ k,1 ,···,Θ k,2 k−1}withΘ k,s ⊂{1,···,p}. In the following paragraphs, we introduce our random forest estimates and briefly review how atree algorithm growsa treeusing agivensplitting criterion. The algorithm recursivelypartitions therootnode,usingthesplittingcriteriontodeterminewheretospliteachnode. Thissplitinvolves two components: the direction or feature to split j and the feature value to split on c. In addition, the algorithm restricts the set of available features used to determine the direction of each split. This procedure repeats until the tree height reaches a predetermined level, and the last grown nodesarecalledtheendnodesofthetree. 15 Next, we introduce the notation for the structure of a tree and its nodes. A node is defined as a rectangle such that t t t =× p j=1 t j :=t 1 ×···×t p , where× denotes the Cartesian product, and each t j is a closed or half closed interval in [0,1]. For a node t t t, its two daughter nodes, t 1 ×···× t j−1 ×(t j ∩[0,c))×t j+1 ×···×t p andt 1 ×···×t j−1 ×(t j ∩[c,1])×t j+1 ×···×t p ,areobtainedby splitting t t t according to the split (j,c) with the direction j∈{1,···,p} and point c∈t j . We use Θ k :={Θ k,1 ,···,Θ k,2 k−1} to denote the sets of available features for the 2 k−1 splits at level k−1 that grow the 2 k nodes at level k of the tree. Figure 2.1 presents a graphical example. A split is alsoreferredtoasacut,andt t t(j,c)denotesoneofthedaughternodesoft t t afterthesplit (j,c). Among the many existing splitting criteria, we are particularly interested in analyzing the sta- tistical properties of the CART-split criterion [8, 9] used in the original random forest algorithm and formally introduced in Section 2.2.2. In this chapter, we use deterministic splitting criteria to characterize the statistical properties of CART splits. A deterministic splitting criterion provides a split for a given node t t t and set of available features Θ⊂{1,···,p} without being subject to the variation of the observed sample (x x x i ,y i ). An important example of the deterministic splitting criterion, known as the theoretical CART-split criterion in the literature [41, 46], is introduced in Section 2.3.1. The splits made by the theoretical CART-split criterion are not affected by the ob- served sample. Some other deterministic splitting criteria are introduced in Section 2.4.1. Next, weintroducethenotationforourtechnicalanalysis. In light of how a tree algorithm grows trees, given any (deterministic) splitting criterion and a setofΘ 1:k ,wecangrowallnodesinatreeateachleveluntillevelk. Weintroduceatree-growing rule for recording these nodes. A tree-growing rule denoted as T is associated with this splitting criterion, and givenΘ 1:k , T(Θ 1:k ) denotes the collection of all sequences of nodes connecting the root node to the end nodes at level k of this tree. Precisely, for each end node of this tree, we can listauniquek-dimensionaltupleofnodesthatconnectstherootnodet t t 0 :=[0,1] p tothisendnode. Thesetuplesofnodescanbethoughtofas“treebranches”thattracedownfromtherootnode. An exampleofatreebranchinFigure2.1is(t t t 1,1 ,t t t 2,2 )thatconnectstherootnodetotheendnodet t t 2,2 . Inparticular,thecollectionT(Θ 1:k )contains2 k suchk-dimensionaltuplesofnodes. 16 Given any T (and hence the associated splitting criterion) and Θ 1:k , the tree estimate denoted asb m T(Θ 1:k ) foratestpointc c c∈[0,1] p isdefinedas b m T(Θ 1:k ) (c c c,X n ):= ∑ (t t t 1 ,···,t t t k )∈T(Θ 1:k ) 1 1 1 c c c∈t t t k ∑ i∈{i:x x x i ∈t t t k } y i #{i:x x x i ∈t t t k } , (2.2) whereX n :={x x x i ,y i } n i=1 . Moreover, the fraction is defined as zero when no sample is in the node t t t k , and 1 1 1 c c c∈t t t k is an indicator function taking the value of 1 if c c c∈t t t k and 0 otherwise. In (2.2), each test point in [0,1] p belongs to one end node because{t t t k : (t t t 1 ,···,t t t k )∈T(Θ 1:k )} for each integer k>0isapartitionof [0,1] p . Nowthatwehaveintroducedindividualtrees,wediscusshowtocombinetreestoformaforest. Asmentioned,onlyasubsetofallfeaturesisusedtodetermineasplitforeachnodeofatree. Every setofavailablefeaturesΘ l,s ,l =1,···,k,s=1,···,2 l−1 ,inthischapterhas⌈γ 0 p⌉distinctintegers among1,···,pwith⌈·⌉theceilingfunctionforsome0<γ 0 ≤1,whichisapredeterminedconstant parameter. The default parameter valueγ 0 =1/3 is used in most implementations of the random forest algorithm. Given a growing rule T, each sequence of sets of available featuresΘ 1:k can be employed to grow a level k tree, and a sequence of distinctΘ 1:k results in a distinct tree. Given k, p, and γ 0 , the forests considered in this chapter consist of all possible distinct trees, in the sense thatallpossiblesetsofavailablefeatures,Θ 1:k ,areconsidered. Thepredictionofarandomforestistheaverageofthepredictionsofalltreemodelsinthefor- est. Foraprecisedefinition,weintroduceboldfacerandommappingsΘ Θ Θ 1:k ,whichareindependent anduniformlydistributedoverallpossibleΘ 1:k foreachinteger k. Therandomforestestimatefor c c cwiththeobservationsX n isgivenby E(b m T(Θ Θ Θ 1:k ) (c c c,X n )|X n )= ∑ Θ 1:k P(∩ k s=1 {Θ Θ Θ s =Θ s })b m T(Θ 1:k ) (c c c,X n ). (2.3) 17 That is, we take the expectation over the sets of available features 34 . This step of tree aggregation iscalledcolumnsubsampling. Inpractice,columnsubsamplingcanbeusedinmorethanoneway; inthiswork,weuseallpossibletreesintreeaggregationforcolumnsubsamplingin(2.3). Remark1. Incontrasttotheconventionalnotationforestimates,thenotationforforestestimates (2.3) uses the conditional expectation. Alternatively, we can define the finite discrete parameter spaceQ and write (2.3) as (#Q) −1 ∑ Θ∈Q b m T(Θ) (c c c,X n ), where Θ denotes a sequence of sets of available features. However, we choose to use the definition in (2.3), as the exact definition of the discreteparameterspaceandthecardinalityofthisspacearenotstrictlyrelevanttoourtechnical analysis. Wedefineotherforestestimatessimilarly. Remark2. Weusetheconditionalexpectationforacompactnotationforrandomforestestimates. However, by definition, for the conditional expectation to exist, the first moment of the integrand must exist. Rigorously, regularity conditions are necessary for this purpose. (for details, see SectionA.3.1inAppendixA). In addition to column subsampling, the random forest algorithm also resamples observations for making predictions. We let A={a 1 ,···,a B } be a set of subsamples with each a i consisting of ⌈bn⌉ observations (indices) drawn without replacement from{1,···,n} for some positive integer B and 0 < b≤ 1. In addition, each a i is independent of model training. The default values of parameters B and b are 500 and 0.632, respectively, in the randomForest R package [49]. 5 The treeestimateusingsubsampleaisdefinedas b m T(Θ 1:k ),a (c c c,X n ):= ∑ (t t t 1 ,···,t t t k )∈T(Θ 1:k ) 1 1 1 c c c∈t t t k ∑ i∈a∩{i:x x x i ∈t t t k } y i #(a∩{i:x x x i ∈t t t k }) . (2.4) 3 Forclarity,giventhetreeheightk+1,thenumberoffeatures p,andγ 0 ,thenumberofdistinctΘ 1:k is p ⌈γ 0 p⌉ 2 k −1 . Moreover,foreachsetofavailablefeatures,P(∩ k s=1 {Θ Θ Θ s =Θ s })= p ⌈γ 0 p⌉ 1−2 k . 4 Whenγ 0 =1,theexpectationisredundant. 5 Anotherdefaultsetupsetsb=1butdrawsobservationswithreplacement. 18 Then,therandomforestestimategivenAisdefinedas B −1 ∑ a∈A E(b m T,a (Θ Θ Θ 1:k ,c c c,X n )|X n ):=B −1 ∑ a∈A E(b m T(Θ Θ Θ 1:k ),a (c c c,X n )|X n ), (2.5) where the boldface notation moves up into the parentheses for simplicity, and the conditional expectationaboveiswithrespecttoΘ Θ Θ 1:k . 6 The estimate in (2.5) is an abstract random forest estimate, as a generic tree-growing rule T is used. The benefit of using abstract random forest estimates can be observed in Theorem 3 in Section 2.4.1 for analyzing the bias of the random forest. In addition to abstract random forest estimates,weconsiderthesamplerandomforestestimateintroducedin(2.7)inSection2.2.2. For simplicity,werefertobothversionsasrandomforestestimatesunlessthedistinctionisnecessary. 2.2.2 CART-splitcriterion Given a node t t t, a subset of observation indices a, and a set of available features Θ⊂{1,···,p}, theCARTsplitisdefinedas ( b j,b c):= argmin j∈Θ,c∈{x ij : x x x i ∈t t t,i∈a} " ∑ i∈a∩P L (¯ y L −y i ) 2 + ∑ i∈a∩P R (¯ y R −y i ) 2 # , (2.6) whereP L :={i: x x x i ∈t t t,x ij <c},P R :={i: x x x i ∈t t t,x ij ≥c},and ¯ y L := ∑ i∈a∩P L y i #(a∩P L ) , ¯ y R := ∑ i∈a∩P R y i #(a∩P R ) . Thecriterionbreakstiesrandomly. Tosimplifytheanalysis,weassumethatthecriterionsplitson a random point if #{x ij : x x x i ∈t t t,i∈ a}≤ 1, a situation where both summations in (2.6) are zeros (the summation over an empty index set is defined as zero). Furthermore, we define the criterion 6 In particular, for independent random variables, such as Θ Θ Θ k , we use an expectation with a subscript to indicate which variables the expectation is with respect to, which is equivalent to the expectation conditional on all other variables. We use conditional expectations to align the expressions with those in the technical proofs, where we repeatedlymanipulatetheconditionalexpectations. 19 such that every split results in two nonempty (an empty node has zero volume) daughter nodes. 7 Given a sample, the CART-split criterion conditional on the sample is a deterministic (except for random splits due to ties) splitting criterion, and conditioning on another sample leads to another deterministicsplittingcriterion. We define b T a as the sample tree-growing rule associated with a splitting criterion following (2.6). In (2.2) and (2.4), we introduced tree estimates based on the tree-growing rules associated with deterministic splitting criteria. The tree estimates using b T a can be defined similarly because the sample tree-growing rule is a deterministic tree-growing rule when the sampleX n is given. Specifically,wehave 8 b m b T a (Θ 1:k ) (c c c,X n ):= ∑ (t t t 1 ,···,t t t k )∈ b T a (Θ 1:k ) 1 1 1 c c c∈t t t k ∑ i∈{i:x x x i ∈t t t k } y i #{i:x x x i ∈t t t k } , (2.7) andthedefinitionisthesameforb m b T a ,a . Hence,therandomforestestimateforatestpointc c c∈[0,1] p isgivenby B −1 ∑ a∈A E b m b T a ,a (Θ Θ Θ 1:k ,c c c,X n )|X n , (2.8) wheretheaverageandconditionalexpectationcorrespondtothesampleandcolumnsubsampling, respectively. Theaverageandconditionalexpectationareinterchangeable. 2.2.3 Roadmapforthebias-variancedecompositionanalysis We are now ready to introduce the bias-variance decomposition analysis for the random forest algorithm,whoseL 2 predictionlossisdefinedas E m(X X X)−B −1 ∑ a∈A E b m b T a ,a (Θ Θ Θ 1:k ,X X X,X n )|X X X,X n 2 . (2.9) 7 ThisstatementcanbemaderigorouswithamoresophisticateddefinitionofCARTsplits,butweomitthedetails forsimplicity. 8 Thenotationfor(2.7)isunsuitableforsuchcasesasthehonesttrees[43]wherethesampleusedforgrowingtrees andforpredictionaredifferent. 20 Wedefinesomenotationforfurtherillustration. Foratree-growingruleT andΘ 1:k ,thepopulation versionof(2.2)isdefinedas m ∗ T(Θ 1:k ) (c c c):= ∑ (t t t 1 ,···,t t t k )∈T(Θ 1:k ) 1 1 1 c c c∈t t t k E(m(X X X)|X X X∈t t t k ) (2.10) foreachtestpointc c c∈[0,1] p . Forsimplicity,weusem ∗ T (Θ 1:k ,c c c)todenotetheleft-handside(LHS) of(2.10). Thepopulationestimate(2.10)canalsobeemployedwiththesampletreegrowingrule, andm ∗ b T a issimilarlydefinedwith b T a inplaceofT in(2.10). Tosimplifythenotation,wetemporarily considerthecasethatusesthefullsamplea={1,···,n},andwhere b T a andb m b T a ,a aredenotedas b T and b m b T , respectively. We use the full sample; therefore, the sample subsampling and the average B −1 ∑ a∈A (·)intherandomforestestimatein(2.9)arenolongerneeded. Thus,(2.9)becomes E m(X X X)−E b m b T (Θ Θ Θ 1:k ,X X X,X n )|X X X,X n 2 . ByJensen’sinequality 9 andtheCauchy–Schwarzinequality,wecandeducethat 1 2 E m(X X X)−E b m b T (Θ Θ Θ 1:k ,X X X,X n )|X X X,X n 2 ≤E m(X X X)−m ∗ b T (Θ Θ Θ 1:k ,X X X) 2 +E m ∗ b T (Θ Θ Θ 1:k ,X X X)−b m b T (Θ Θ Θ 1:k ,X X X,X n ) 2 , (2.11) where the right-hand side (RHS) is a summation of the approximation error (the first term, also referred to as squared bias) and the estimation variance (the second term). Details on deriving (2.11) are found in Section A.3 of Appendix A. The first term on the RHS of (2.11) is referred to 9 Specifically,werequiretheconditionalJensen’sinequality;forsimplicity,weomit“conditional”unlessotherwise necessary. 21 as the approximation error. By the definition of m ∗ T in (2.10), it holds that for any tree-growing ruleT andeachΘ 1:k ,on∩ k l=1 {Θ Θ Θ l =Θ l }, E m(X X X)−m ∗ T (Θ Θ Θ 1:k ,X X X) 2 Θ Θ Θ 1:k =Θ 1:k = ∑ (t t t 1 ,···,t t t k )∈T(Θ 1:k ) P(X X X∈t t t k )Var(m(X X X)|X X X∈t t t k ). (2.12) The RHS of (2.12) is the average approximation error resulting fromL 2 -approximating m(X X X) by the class of step functions{f(X X X) : f(X X X) =∑ (t t t 1 ,···,t t t k )∈T(Θ 1:k ) c(t t t k )1 1 1 X X X∈t t t k ,c(t t t k )∈R}. We observe that the approximation error in (2.11) is also subject to sample variation because sample CART splitsareusedtobuildthetrees. In Section 2.3.2, we obtain the desired convergence rates for random forest consistency by bounding the two terms in (2.11) and introduce these results in Theorem 1 and Corollary 1. In Section 2.4, we analyze and establish an upper bound for the average approximation error in Lemma 1. Next, the estimation variance term is analyzed in Section 2.5, where we introduce the high-dimensional random forest estimation foundation to establish the convergence rate for estimation variance in Lemma 2. Finally, we note that γ 0 is a predetermined constant parameter, and we do not specify the value of γ 0 if it is not directly relevant to the results (e.g., theorems or lemmas). 2.3 MainResults 2.3.1 Definitionsandtechnicalconditions Foranodet t t anditstwodaughternodest t t ′ andt t t ′′ ,wedefine (I) t t t,t t t ′ :=P(X X X∈t t t ′ |X X X∈t t t)Var(m(X X X)|X X X∈t t t ′ ) +P(X X X∈t t t ′′ |X X X∈t t t)Var(m(X X X)|X X X∈t t t ′′ ), (2.13) 22 and (II) t t t,t t t ′ :=P(X X X∈t t t ′ |X X X∈t t t) E(m(X X X)|X X X∈t t t ′ )−E(m(X X X)|X X X∈t t t) 2 +P(X X X∈t t t ′′ |X X X∈t t t) E(m(X X X)|X X X∈t t t ′′ )−E(m(X X X)|X X X∈t t t) 2 , (2.14) anddefine (I) t t t,t t t ′′ and (II) t t t,t t t ′′ thesameas (I) t t t,t t t ′ and (II) t t t,t t t ′,respectively. Tofacilitateourtechnical analysis,weintroducesomenaturalregularityconditionsandtheirintuitionsbelow. Condition1. Thereexistssomeα 1 ≥1suchthatforeachnodet t t =t 1 ×···×t p , Var(m(X X X)|X X X∈t t t)≤α 1 sup j∈{1,···,p},c∈t j (II) t t t,t t t(j,c) . Condition2. ThedistributionofX X X hasadensityfunction f boundedawayfrom0and∞. Condition 3. Assume model (2.1) and p = O(n K 0 ) for some positive constant K 0 . In addition, assumeasymmetricdistributionaround0forε 1 andE|ε 1 | q <∞forsufficientlylargeq>0whose valuewillbespecifieddependingonthecontexts. Condition4. Assumethatsup c c c∈[0,1] p|m(c c c)|≤M 0 forsomeM 0 >0. WerefertoCondition1aboveasthesufficientimpuritydecrease,anditisnewtotheliterature. Overall, Conditions 2–4 are basic assumptions in nonparametric regression models. In partic- ular, our technical analysis allows for polynomially growing dimensionality p. The symmetric distribution on the model error is a technical assumption that can be relaxed. The SID assump- tion introduced in Condition 1 plays a key role in our technical analysis, and we motivate the need for this condition as follows. Consider two tree models: f 1 (X X X)=1 1 1 X X X∈t t t 0 E(m(X X X)|X X X∈t t t 0 ) and f 2 (X X X) =1 1 1 X X X∈t t t ′E(m(X X X)|X X X∈t t t ′ )+1 1 1 X X X∈t t t ′′E(m(X X X)|X X X∈t t t ′′ ), where t t t ′ and t t t ′′ are daughter nodes of t t t 0 aftersomesplit. ThesquaredbiasesgiventhesetreemodelsarerespectivelyE(m(X X X)− f 1 (X X X)) 2 = Var(m(X X X)|X X X∈t t t 0 )andE(m(X X X)− f 2 (X X X)) 2 =(I) t t t 0 ,t t t ′. Weobservethat(I) t t t 0 ,t t t ′ isthesquaredbiasfor approximating m(X X X) with f 2 (X X X). As it is the squared bias remaining after the split ont t t 0 , it is also called the remaining bias. Analogously, we extend the definition to an arbitrary nodet t t and one of 23 itsdaughternodet t t ′ anduse(I) t t t,t t t ′ todenotethe“conditionalremainingbias.” Theterm(II) t t t,t t t ′ and Var(m(X X X)|X X X∈t t t) are respectively called the “conditional bias decrease” (or conditional impurity decrease)andthe“conditionaltotalbias”becauseVar(m(X X X)|X X X∈t t t)=(I) t t t,t t t ′ +(II) t t t,t t t ′. Intuitively, having a large conditional bias decrease on each node is a desired property for achievingagoodcontrolofthesquaredbiasoftherandomforestestimate. Thisnaturallymotivates the SID condition. The SID only requires a nontrivial lower bound for the maximum conditional bias decrease, and the split (j ∗ ,c ∗ )=argsup j∈Θ,c∈t j (II) t t t,t t t(j,c) with the column restrictionΘ is the theoreticalCART[41,46,50]. InSection2.3.2,weestablishtheconvergenceratefortherandomforestestimatewithm(x x x)in thefunctionalclass SID(α):={m(X X X):m(X X X)satisfiesSIDwithα 1 ≤α}. The size of SID(α) is nondecreasing in α≥ 1: if m(X X X)∈ SID(α−c) for some α−c≥ 1 and c>0, then m(X X X)∈SID(α). In Section 2.3.1.1, we verify that many popular regression functions canbelongtotheabovefunctionalclassandderivethecorrespondingvaluesofα. Theseexamples demonstrate that the SID condition can accommodate nonadditive and discontinuous regression functionsandallowsfordependentfeatures. InSection2.3.1.2,weillustrateanimportantrelation betweenSIDandmodelsparsity. 2.3.1.1 ExamplessatisfyingSID Example1. Considerm(X X X)=1 1 1 X 1 ∈[b,1] forsome0≤b≤1,and thedistributionofX X X isarbitrary. Then,m(X X X)∈SID(1). Example 2. Let X X X have a uniform distribution over [0,1] p and let 00,independentofthemodelcoefficients. Example3. LetX X X beuniformlydistributedover [0,1] p andconsiderm(X X X)with ( ∂m(z z z) ∂z 1 ,···, ∂m(z z z) ∂z p ) to be continuous in [0,1] p . In addition, 1) for each j∈ S ∗ , either M 1 ≤ ∂m(z z z) ∂z j ≤ M 2 for every z z z∈ [0,1] p or M 1 ≤− ∂m(z z z) ∂z j ≤ M 2 for every z z z∈ [0,1] p , where M 2 ≥ M 1 >0 are constants and S ∗ is a subset of{1,···,p}, and 2) for each j̸∈ S ∗ , ∂m(z z z) ∂z j = 0. Then, m(X X X)∈ SID(c(#S ∗ ) 2 ) for a constant c > 0 depending only on M 1 ,M 2 . Furthermore, if m(X X X) = ∑ s ∗ j=1 m j (X j ) for a positive integers ∗ ,thenm(X X X)∈SID(cs ∗ ). Example4. LetX X X be uniformly distributedover [0,1] p . Additionally,m(X X X)=∑ s ∗ j=1 m j (X j )where 1≤ s ∗ ≤ p is an integer. Suppose that for some c 0 > 0 and 1 2 <λ < 1, it holds that for every 1≤ j≤s ∗ andevery (a,b)⊂[0,1], sup x∈Λ(a,b) 1 x−a Z x a m j b−x x−a (z−a)+x −m j (z) dz 2 ≥c 0 Var(m j (X j )|X j ∈(a,b)), (2.15) whereΛ(a,b)=[λa+(1−λ)b,(1−λ)a+λb]. Then,m(X X X)∈SID(cs ∗ )forsomec>0depending onc 0 ,λ. Inparticular,(2.15)holdsifm j (z)isdifferentiableon[0,1],andforsomec 1 >0andevery (a,b)⊂[0,1], LHSof(2.15)≥c 1 sup z∈(a,b) |m ′ j (z)| 2 (b−a) 2 . (2.16) Example1confirmsthatSIDallowsfordependentfeatures. Example2demonstratesthatSID is satisfied in high-dimensional sparse quadratic models. Example 3 considers a general struc- ture for m(X X X) which can include the special cases of cumulative distribution, linear, logistic, and nonadditive polynomial functions. Example 4 provides sufficient conditions ensuring SID with uniform X X X under the sparse additive model setting. In particular, if m(X X X) in Examples 2–3 is also 25 additive,thenitcanbeverifiedtosatisfytherequirementsofExample4. Sufficientconditionssim- ilartothoseinExample4canbederivedfornonadditivemodelsbuttakemuchmorecomplicated forms; the additive structure in Example 4 is imposed for simplified forms of these conditions. More examples, including logistic regression functions, higher-order polynomial functions with interactions, additive piecewise linear models, and linear combinations of indicator functions of hyperrectangles, are found in Section A.2 of Appendix A. The proofs for these examples and the exampleinRemark3areinSectionsA.5.3–A.5.7ofAppendixA. Remark3. ThecoefficientrestrictionisnecessaryforSIDtoholdinExample2. Acounterexample violating the coefficient restriction of Example 2 is m(X X X) = X 1 X 2 −0.5X 1 −0.5X 2 +0.25 with uniformlydistributedX X X (seeAppendixAforaformalproofthatthisexampleviolatesSID).Section 5in[12]examinedtheinconsistencyoftherandomforestalgorithmundersimilarmodelsettings, suggestingthatcertaincoefficientrestrictionisnecessaryforsatisfactoryperformanceofCARTin these cases. Identifying the necessary condition for the consistency of the random forest method andstudyinghowfarSIDisfromsuchaconditionareopenquestionsforfuturestudy. 2.3.1.2 SIDandmodelsparsity: sparsityparameterα 1 Asmallervalueofα 1 inSIDimpliesthattheoptimalsplitcanreducemoreimpurityintermsofthe conditional total bias given each node. In contrast, in sparse models, the examples in the previous section show that the required value ofα 1 for m(X X X)∈SID(α 1 ) is at least linearly proportional to the number of active features in m(X X X). These results echo the intuition on how CART works in reducing impurity: more active features implies that each split contributes less (proportionally) to the impurity reduction on average. The previous examples aim for generality; hence, the derived α 1 may not be optimal. For example, in Example 3, with the additional assumption of an additive model,thevalueofα 1 linearlydependsonthenumberofactivefeaturescomparedtothequadratic dependence without such an assumption. In addition, the values of α 1 in these examples may be smallerforspecificmodelcoefficients. 26 2.3.2 Convergencerates We can now characterize the explicit convergence rates for the consistency of the random forest algorithm in a fairly general high-dimensional nonparametric regression setting. Both b T a and γ 0 are defined in Sections 2.2.2 and 2.2.1, respectively. Moreover, B is the number of trees used for row subsampling and 0< b≤ 1 is the proportion of training data used for row subsampling. Thesedefinitionsarepresenteddirectlypreceding(2.4). Thedetailsontherandomforestestimates E(b m b T a ,a (···) X X X,X n )and 1 B ∑ a∈A E(b m b T a ,a (···) X X X,X n )arepresentedinSection2.2.2. Theorem1. AssumethatConditions1–4holdandlet01,0<η <1/8, 0 < c < 1/4, and δ > 0 be given with 2η <δ < 1 4 . Let A ={a 1 ,···,a B } with #a i =⌈bn⌉ for i=1,···,B and a∈A be given. Then, there exists someC >0 such that for all large n and each 1≤k≤clog 2 ⌈bn⌉, E m(X X X)−E b m b T a ,a (Θ Θ Θ 1:k ,X X X,X n ) X X X,X n 2 ≤C α 1 (⌈bn⌉) −η +(1−γ 0 (α 1 α 2 ) −1 ) k +(⌈bn⌉) −δ+c . (2.17) Inaddition,whenweaggregateoverrowsubsamples(i.e.,overa∈A),wehave E m(X X X)− 1 B ∑ a∈A E b m b T a ,a (Θ Θ Θ 1:k ,X X X,X n ) X X X,X n 2 ≤C α 1 (⌈bn⌉) −η +(1−γ 0 (α 1 α 2 ) −1 ) k +(⌈bn⌉) −δ+c . (2.18) To the best of our knowledge, Theorem 1 is the first result for the consistency rates for the original version of the random forest algorithm(see Table 2.1 in the Introduction for detailed dis- cussion). Theconstantα 2 >1isarbitraryandisnecessarytoaccountfortheestimationerrorfrom using sample CART splits. Although Condition 2 restricts the feature dimensionality p, the upper boundsinTheorem1donotdependon pexplicitly. Remark4discussestheimplicitdependenceof the rates on p. Additionally, see Section 2.3.3 for more informative convergence rates depending on pformodelswithbinaryfeatures. 27 Our results provide no interesting information about the tuning parameter b. The technical reasonisthatweappliedtheCauchy–Schwarzinequalitywhenderiving(2.18)from(2.17),which holds even in the worst case when all trees are highly correlated with each other. In this sense, (2.18) only provides a highly conservative upper bound. Because random forest improves upon bagging [15] by using column subsampling, we focus on the random forest algorithm using only columnsubsamplingwiththefullsample(i.e.,b=1anda={1,···,n}). Thenotationforthecase withb=1ispresentedinSection2.2.3. To gain a more in-depth understanding of the upper bounds in Theorem 1, we provide Corol- lary1restatingtheresultsinTheorem1withmoreemphasisonthebias-variancedecomposition. Corollary 1. Under all the conditions of Theorem 1, for all large n and each 1≤k≤clog 2 n, it holdsforthetwotermsontheRHSof (2.11)that Squaredbias:=E m(X X X)−m ∗ b T (Θ Θ Θ 1:k ,X X X) 2 ≤O n −η +(1−γ 0 (α 1 α 2 ) −1 ) k | {z } Maintermofbias + O(n −δ+c ) | {z } Uninterestingerror (2.19) and Estimationvariance:=E m ∗ b T (Θ Θ Θ 1:k ,X X X)−b m b T (Θ Θ Θ 1:k ,X X X,X n ) 2 ≤O(n −η )+ O(n −δ+c ) | {z } Uninterestingerror . (2.20) Thethirdtermin(2.17),whichisalsodisplayedin(2.19)and(2.20)asthe“uninterestingerror,” is caused by technical difficulty and does not carry too much meaning (see Remark 5 for details). Such a term could be eliminated using a more refined technical analysis. Thus, the discussion in thissectionignoresthistermforsimplicity. Our results provide a fresh understanding of the random forest, especially regarding how the random forest method controls the bias. The second term on the RHS of (2.19) is the main term of the bias of the random forest, whereas the first term n −η is the upper bound of the error caused 28 by the sample CART splits. If theoretical CART splits (see the formal definition in Section 2.3.1) are used, then the first term on the RHS of (2.19) vanishes, and in the second term, α 2 = 1 (see Remark 7 in Section 2.4.1). We contribute to quantitatively characterizing how the use of column subsampling controls the bias when k≤clogn. The exponential decay rate in the second term on the RHS of (2.19) is derived through the global control of the bias of all trees in the forest and the SID condition makes such precise quantification possible. For a fixed sample size n, the first termn −η intheupperboundofthebiasdoesnotvanishasthetreeheightincreases. Fortheupper bound of the bias to decrease to zero, it requires both n,k→∞. Tuning a higherγ 0 decreases the second term in the bias decomposition because a larger γ 0 makes each split more likely to be on relevant features(see Definition 1), decreasing the bias faster. The first term on the RHS of (2.19), O(n −η ), is likely a conservative upper bound on the error caused by sample CART splits because this term is invariant toγ 0 and k. Accurately characterizing how this error term depends onγ 0 , k, and n is an interesting future research topic. Moreover, this error term and the first term of the estimation variance can generally have different convergence rates; O(n −η ) is a common upper boundforbothtermsforsimplicityofthetechnicalpresentation. Our upper bound of the estimation variance in (2.20) is less informative than the upper bound of the bias in that it does not reflect any effects ofγ 0 and k. The variance upper bound is conser- vativeduetocertaintechnicaldifficulties. Weboundthevarianceoftherandomforestestimateby establishing a uniform upper bound for the variances of individual trees, a great distinction from theglobalapproachinourbiasanalysisusingSID. The literature has indicated that the random forest algorithm with column subsampling is closely related to the adaptive nearest neighbors method [51] and adaptive weighted estimation [52]. We draw analogy to the adaptive nearest neighbors method to assist in the understanding of our results. As revealed in [51], the forest estimate can be viewed as an adaptive weighted nearest neighborsestimator. Thecolumnsubsamplingrateγ 0 controlstheadaptiveweightsassignedtothe nearest neighbors, with γ 0 = 1 producing uniform weights on only the nearest neighbors (under some distance metric) and zero weights on farther ones, and smaller γ 0 producing more adaptive 29 and distributedweights extended eventoevenfartherneighbors. Thissuggests thattheestimation varianceshoulddependonγ 0 . However,ourupperboundisconservative,holdingforallvaluesof γ 0 . Itisalsocommonlybelievedthatalargerkusuallycauseslargervariancefortheforestestimate becausetheendnodestendtocontainfewerobservations. InSection2.3.3,weillustratetheeffect of k on estimation variance under a simplified model setting. It is a challenging and interesting future research topic to precisely characterize how variance depends onγ 0 and k. As a result, the fact that the upper bound in Theorem 1 decreases withγ 0 is a coincidence rather than reality. The crudeupperboundscausethisdecreaseinseveralpartsinthebias-variancedecomposition. Theorem1holds(nonuniformly)foreachcombinationof(k,η,δ,c)satisfying1≤k≤clog 2 n, 0 <η < 1/8, δ∈ (2η,1/4) and 0 < c < 1/4. If we set η = 1 8 −ε, δ = 1 4 −ε, c = 1 8 , and k = ⌊ 1 8 log 2 (n)⌋inTheorem1,weobtainamoreinformativeconvergencerateasshowninCorollary2. Corollary2. AssumethatConditions2–4holdandlet0<ε < 1 8 ,α 1 ≥1,α 2 >1,and0<γ 0 ≤1 begiven. Foralllargenandtreeheightk=⌊ 1 8 log 2 n⌋, sup m(X X X)∈SID(α 1 ) E m(X X X)−E b m b T (Θ Θ Θ 1:k ,X X X,X n ) X X X,X n 2 ≤O n − 1 8 +ε +(1−γ 0 (α 1 α 2 ) −1 ) ⌊ log 2 (n) 8 ⌋ | {z } Maintermofbias ≤O n − 1 8 +ε +n − log 2 (e) 8 × γ 0 α 1 α 2 , where the term n −1/8+ε is a common bound of all terms on the RHS of (2.19) and (2.20) except forthe“maintermofbias.” Remark 4. The feature dimensionality p and tree height k determine the number of all possible nodes when growing trees. In deriving the consistency rates in Theorem 1, we must account for the probabilistic deviations between the sample moments (e.g., means) conditional on all possible nodes and their corresponding population counterparts, which is easy to understand for the vari- anceanalysis. WemustalsoaccountforsuchdeviationsduetothesampleCARTsplitsforthebias analysis. Hoeffding’s inequality can quantify the deviations between the sample and population 30 moments on each cell. However, to achieve uniform control over all possible nodes, we must re- strictthenumberofnodesupper-boundedby2 clog 2 n p(⌈n 1+ρ 1 ⌉+1) clog 2 n forsomeρ 1 >0when heightk≤clog 2 n. (see(A.8)inAppendixAfordetails). Theconditionof p=O(n K 0 )issufficient forrestrictingthenumberofallpossiblenodesforouranalysisanddemonstratesthatthebiasand varianceimplicitlydependon pthroughninourupperbounds. Remark5. WebrieflyexplainhowthethirdtermintheupperboundinTheorem1isanartifactof ourtechnicalanalysis. Whenatreeisgrowntolevelk,thereare2 k nodes,whichformapartitionof the p-dimensionalhypercube. Amongthesenodes,ifanodet t t istoosmallsuchthatP(X X X∈t t t)≤c n with c n depending only on n, there will not be enough observations in t t t with high probability. Consequently,wecannotuseHoeffding’sinequalitytocontroltheprobabilisticdifferencebetween E(m(X X X)|X X X∈t t t) and its sample counterpart. The total probability of all these small nodes is less than c n 2 k , when c n and k≤ clog 2 n are sufficiently small; thus, we use Conditions 3 and 4 to establish the upper bounds for the mean differences on these small nodes. This is one of the reasons we need the brute-force analysis method (for details, see Lemma 1 in Section 2.4). Thus, we limit the tree height parameter c<1/4. In addition, another reason for the third term in the rates is the use of our high-dimensional estimation framework. The details of this framework are providedinLemma2inSection2.5. Whetheritispossibleandhowtoeliminatethistermthrough afineranalysisisaninterestingfutureresearchtopic. 2.3.3 Sharperconvergencerateswithbinaryfeatures This section demonstrates that our bias-variance decomposition analysis technique can yield a sharper upper bound under simplified model settings. In Example 5, we assume the absence of rowsubsamplinginthissectionforsimplicity. Example 5. Assume that X 1 ,···,X p are independent,P(X j = 1) =P(X j = 0) = 1 2 for all j, and thatm(X X X)=∑ s ∗ j=1 β1 1 1 X j =1 forsome|β|>0ands ∗ ≤ p. 31 Wegainsomeinsightintohowbinaryfeaturescangreatlysimplifytheproblembyconsidering any end node t t t k and the branch (t t t 0 ...,t t t k ) connecting it to the root node t t t 0 . Along this branch, once a coordinate j has a split with c∈ (0,1], any further split on this coordinate results in zero decrease in population impurity. In addition, each sample CART split is on some data point; thus, the split can only be either (j,1) or (j,0) for some j, with the latter (j,0) resulting in an empty daughter node and zero population impurity decrease; thus, split (j,0) can be safely removed from consideration. In summary, the CART essentially tries to minimize (2.6) with regard to only j∈{1,···,p} to find a split of the form (j,1). For these reasons, the problem is considerably simplified. Nevertheless,bydefinition,thesampleCARTcouldstillleadtoasplitofform (j,0)whenthe event #{i:x ij =1}=0 occurs for some j, making the analysis tedious. To avoid this, the CART inthissectionis redefinedasfollows: foranodet t t andfeaturerestrictionΘ,thesplitis ( b j,1)such that b j:=argmin j∈Θ " ∑ i∈P L ∑ i∈P L y i #P L −y i 2 + ∑ i∈P R ∑ i∈P R y i #P R −y i 2 # , where P L :={i : x x x i ∈ t t t,x ij < 1} and P R :={i : x x x i ∈ t t t,x ij ≥ 1}. CART stops splitting when all available coordinates have been split. To obtain an equal height for each tree branch, we can considertrivialsplitsthatresultinemptydaughternodes(seetheproofofProposition1fordetails). Proposition 1. Consider Example 5 and i.i.d. observations from (2.1) with|ε 1 |≤ M ε for some M ε >0 andE(ε 1 )=0. Moreover, 0<γ 0 ≤1, 0<η <1, andε >0 are given, and (log e p) 2+ε = o(n 1−η ). Then,1)m(X X X)∈SID(s ∗ )and 2)foralllargenandevery0≤k≤ηlog 2 (n), E m(X X X)−E b m b T (Θ Θ Θ 1:k ,X X X,X n ) X X X,X n 2 ≤2(1−γ 0 (s ∗ ) −1 ) k Var(m(X X X)) | {z } Squaredbias +2(3M 0 +2M ε ) 2 2 k log e (max{n,p}) 2+ε n | {z } Estimationvariance +o(n −1 ), (2.21) 32 where M 0 =sup c c c∈[0,1] p|m(c c c)|, andb m T (Θ Θ Θ 1:0 ,X X X,X n ) is defined as n −1 ∑ n i=1 y i . Particularly, ifγ 0 = 1,then(2.21)holdswiththesquaredbiasboundreplacedwithmax (s ∗ −k) β 2 2 ,0 . WecompareProposition1withTheorem1. 10 Thisupperboundisfreeofuninterestingerrors, andthisestimationvariancedependsontreelevelkmoreexplicitlyduetothesimplermodelsetting withbinaryfeatures. ThesquaredbiastermontheRHSof(2.21)explicitlydependsonthecolumn subsamplingparameterγ 0 ,treelevelk,andsparsityparameters ∗ (α 1 =s ∗ ). Thesquaredbiasdoes notdependonthesamplesizenandα 2 because,underthissimplifiedmodelsetting,wecanshow that the sample CART approximates theoretical CART perfectly on a high probability event. Our upperbound in Proposition 1 exhibitsthe explicitbias-variancetrade-offwithrespect totree level k,whichisnotpresentinTheorem1. However,thetrade-offwithrespecttoγ 0 isstillnotreflectedintheoverallupperboundbecause our estimation variance bound is universal for all 0<γ 0 ≤1. Such a caveat is caused by different technicalapproachesinourbiasandvarianceanalysis,wherewegloballycontrolbiasviaSIDbut bound the random forest estimation variance by bounding individual tree estimation variance. A moredetaileddiscussionispresentedattheendoftheproofofProposition1. The special upper bound when γ 0 =1 corresponds to the regression tree model. In such case theoptimalconvergencerate(accordingtoProposition1)isachievedwhenk=s ∗ (i.e.,thesquared bias is zero) and is of order 2 s ∗ n (log e (max{n,p})) 2+ε . This optimal rate is faster than the best rate obtained by minimizing the RHS of (2.21) with respect to k. This is because (2.21) is proved by applying our general bias analysis technique developed for proving Theorem 1 and hence makes no use of the specific model structure in Example 5. Our optimal rate when γ 0 = 1 is consistent with those in Theorems 3.3 and 4.4 of [12], up to a logarithmic factor. Their main results concern a more general model than the linear model in Example 5, but rely on the same binary features assumptionandareconfinedtothecaseofγ 0 =1. 10 ThecontinuousfeatureassumptioninTheorem1canbeeasilyrelaxedtoaccommodatebinaryfeatures 33 2.3.4 Theroleofrelevantfeatures Inthissection,weformallystudytheroleofrelevantfeaturesforSID.Weshowthatiftheregularity conditions and SID are assumed, only the splits along the relevant feature directions can reduce a sufficient amount of bias for some nodes. More precisely, we introduce a variant of SID with S 0 ⊂{1,···,p}. Condition1. Thereexistssomeα 1 ≥1suchthatforeachnodet t t =t 1 ×···×t p , Var(m(X X X)|X X X∈t t t)≤α 1 sup j∈S 0 ,c∈t j (II) t t t,t t t(j,c) . For simplicity, we refer to the above condition as SID2. The difference from SID is that the supremum in SID2 is taken only over the features in S 0 . In what follows, we observe that SID2 holds only if S 0 includes all relevant features when the regularity conditions on the underlying re- gressionfunctionandSIDareassumed. Webeginwithaformaldefinitionoftherelevantfeatures. Definition1. Afeature jisrelevanttotheregressionfunctionm(X X X)ifandonlyifthereexistssome constantι >0suchthat E(Var(m(X X X)|X s ,s∈{1,···,p}\{j}))>ι. In Theorem 2, we characterize the magnitude of theL 2 loss when a relevant feature is left out duringmodeltraining. Theorem 2. Assume that Conditions 3–4 hold and some relevant feature j is not involved in the randomforestmodeltrainingprocedure. Then,wehave E m(X X X)− 1 B ∑ a∈A E b m b T a ,a (Θ Θ Θ 1 ,···,Θ Θ Θ k ,X X X,X n ) X X X,X n 2 ≥ι. By Theorem 1, the regularity conditions and SID2 are sufficient for high-dimensional random forest consistency using only features in set S 0 . Furthermore, if SID2 holds with some S 0 , it 34 holds with each S 1 such that S 0 ⊂ S 1 . Thus, the result in Theorem 2 suggests that, assuming the regularity conditions and SID, all relevant features must be included in S 0 for SID2 to hold with some S 0 . Otherwise, we assume that j ∗ ̸∈S 0 is the index of some relevant feature. From previous discussions, SID holds, and SID2 holds with S 1 ={1,···,p}\{j ∗ }, implying the consistency of therandomforestestimatorwithorwithouttherelevantfeature j ∗ ,contradictingTheorem2. ThenoninclusionassumptionofarelevantfeatureinTheorem2isanalogoustoassumingthat a relevant feature is never split in random forest training and might be unnecessarily strong for ourpurpose. Nevertheless,ourgoalistodeliverthekeymessagethattherandomforestalgorithm mustsplitoneveryrelevantfeaturedirectiontocontrolthebias. Remark 6. Definition 1 provides a natural measurement of feature importance. Alternative defi- nitions have been considered in the literature. For example, for nonparametric feature screening, [53]assumedthatE(m(X X X))=0andforsomeconstantι >0,itholdsthatforeachrelevantfeature j, Var(m(X X X))−E Var(m(X X X)|X j ) ≥ι. (2.22) The difference is that Definition 1 measures the conditional importance of each feature, given all otherfeatures,whereas(2.22)measuresthemarginalimportanceoffeatures. 2.3.5 Relatedwork Weoutlinethedifferencebetweenourrandomforestestimateandstandardrandomforestsoftware packages and provide a detailed comparison of our consistency results with the existing results from recent literature. Our setting differs from standard random forest software packages in the number of trees grown and the height of trees. In practice, random forest packages first randomly draw a set of subsamples a with #a =⌈bn⌉ (two subsampling modes are available; see [49] for details) and available columns for splitting. Then, these packages follow (2.6) to split nodes and stop splitting a node if and only if the node contains one observation. By default, these packages grow 500 such independent trees. For our work, for each l∈{1,···,B} with an arbitrary integer 35 B > 0, a l contains⌈bn⌉ distinct indices in{1,···,n}. These indices can be chosen in any way independent of the training sample. Then, we grow a forest with all possible trees defined in Section 2.2.1 for each a l . Moreover, we consider trees with a height of at most clog 2 n for some possibly small c> 0 and our sample trees defined in Section 2.2.2 continue to grow nodes for a nodewithoneobservation. Next, we compare our consistency results in Section 2.3.2 to the existing findings from the recent literature. For an easier comparison, we focus on the case of k =clog 2 n and drop the un- interesting error (i.e., the third term of (⌈bn⌉) −δ+c ) from the upper bounds in Theorem 1. With such conventions, noting that (1−γ 0 (α 1 α 2 ) −1 ) clog 2 n ≈ n −cγ 0 (α 1 ,α 2 ) −1 , our convergence rate be- comes n − cγ 0 α 1 α 2 +n −η . Table 2.2 summarizes our convergence rate and the rates for two modified versions of the random forest algorithm, the centered random forests [42] and Mondrian random forests [47]. Moreover, s represents the number of informative features and β > 0 denotes the exponent for the Hölder continuity condition. As mentioned in the Introduction, these modified versionsoftherandomforestalgorithmusesplittingmethodsthatareindependentoftheresponse inthetrainingsample,whichisadeparturefromtheoriginalversionoftherandomforestalgorithm proposedin[8]. Theorem 1 and [42] both consider the sparsity parameter similarly, whereas [47] did not con- sider sparsity. In contrast to our work, [42] did not consider the scenario of growing sparsity parameter(seeSection2.3.1.2). Theresultin[47]achievedaminimaxrateunderaclassofHölder continuous functions with parameter β. This rate nontrivially depends on the dimensionality p and becomes uninformative for a large p. Our consistency result is the only result so far that al- lows for the original random forest algorithm, growing sparsity parameter, and growing ambient dimensionality. Furthermore, our rates explicitly consider the effect of column subsampling for the random forest method; hence, γ 0 appears in the rates. Such differences make our consistency result unique and useful for understanding the original random forests algorithm. However, we acknowledge that the rate of convergence given in Theorem 1 is not optimal due to the technical difficultiesdiscussedinSection2.3.2. 36 Table 2.2: Comparison of consistency rates of the random forest algorithm with different splitting rules. Rateofconvergence Growing sparsity parameter Explicit dependence on dimensionality p OurTheorem1 n − cγ 0 α 1 α 2 +n −η | {z } Squaredbias + n −η |{z} Variance Yes No Centered RF[42] (n( √ logn) s−1 ) − 1 slog2+1 No No Mondrian RF[47] n − 2β p+2β No Yes 2.4 ApproximationTheory This section builds the approximation theory of the random forest in two steps. First, in Theo- rem 3, we derive the decreasing rates of the approximation error from approximating m(X X X). We approximate m(X X X) using a class of theoretical forest estimates, each of which is associated with a tree-growingrulefromaclassoftree-growingrulesrepresentedbyT (seedefinition). Eachgrow- ing rule inT is associated with a deterministic splitting criterion comparable to the theoretical CART-split criterion in terms of impurity decrease. Then, in Theorem 4, we verify that on a high probability event, a version of the sample tree-growing rule conditional on the observed sample is an instance ofT . In other words, we demonstrate that the sample CART splits are comparable to the theoretical CART splits in terms of the impurity decrease. We start in Lemma 1 by pre- senting the bound on the approximation error of (2.11), which plays a key role in establishing the consistencyrateinTheorem1. Lemma 1. Assume that Conditions 1–4 hold and let 0 < γ 0 ≤ 1, α 2 > 1, 0 < η < 1 8 , δ with 2η <δ < 1 4 ,andc>0begiven. Then,foralllargenandeach1≤k≤clog 2 n, E m(X X X)−m ∗ b T (Θ Θ Θ 1:k ,X X X) 2 ≤8M 2 0 n −δ 2 k +2α 1 α 2 n −η +2M 2 0 (1−γ 0 (α 1 α 2 ) −1 ) k +2n −1 . 37 The main idea for proving the desired upper bound in Lemma 1 is to determine a class of deterministic tree-growing rulesT such that given an event U U U n of asymptotic probability one, a slightlymodifiedversionofthesamplerule b T isaninstanceofT . Hence,weobtain E[(m(X X X)−m ∗ b T (Θ Θ Θ 1:k ,X X X)) 2 1 1 1 U U U n |X n ]≲ sup T∈T E[(m(X X X)−m ∗ T (Θ Θ Θ 1:k ,X X X)) 2 1 1 1 U U U n |X n ] ≤ sup T∈T E(m(X X X)−m ∗ T (Θ Θ Θ 1:k ,X X X)) 2 , (2.23) where the expectation is with respect to Θ Θ Θ 1:k and X X X, and we employ the notation≲ in the first step to emphasize that a slightly modified version of b T is involved in the rigorous derivation. The laststepholdsbecauseT consistsofgrowingrulesassociatedwithdeterministicsplittingcriteria. With these inequalities, the work to bound the very last term in (2.23) remains. Then, P(U U U c n ) is sufficiently small for all large n; thus, we obtain the desired result in Lemma 1. We defineT in Section 2.4.1 and discuss the slightly modified version of b T and how to bound the RHS of (2.23) inSection2.4.2. 2.4.1 Mainresults Given parameters ε, α 2 , and k, all tree growing rules satisfying Condition 5 below form a class of tree growing rules (i.e.,T ), each of which is associated with an abstract deterministic splitting criterion. Condition 5. For the tree-growing rule T, there exist someε≥0, α 2 ≥1, and positive integer k suchthatforanysetsofavailablefeaturesΘ 1 ,···,Θ k ,each(t t t 1 ,···,t t t k )∈T(Θ 1 ,···,Θ k ),andeach 1≤l≤k, 1) if (II) t t t l−1 ,t t t l ≤ε,then sup (j∈Θ l ,c) (II) t t t l−1 ,t t t l−1 (j,c) ≤α 2 ε; 2) if (II) t t t l−1 ,t t t l >ε,thensup (j∈Θ l ,c) (II) t t t l−1 ,t t t l−1 (j,c) ≤α 2 (II) t t t l−1 ,t t t l , where we do not specify to which Θ l,1 ,···,Θ l,2 l−1 in Θ l feature j belongs in the supremum for simplicity. 38 Whenα 2 =1andε =0,foreachintegerk>0,onlyonetree-growingrulesatisfiesCondition5, the one associated with the theoretical CART-split criterion. The parameters α 2 > 1 and ε > 0 areintroducedtoaccountforthestatisticalestimationerrorwhenusingthesampleCARTsplitsto estimatethetheoreticalCARTsplits. Theorem4inSection2.4.2revealsthat,withhighprobability, aslightlymodifiedversionofthesampletree-growingrulesatisfiesConditionrefHRF:tree. Theorem 3. Assume that Condition 1 holds with α 1 ≥ 1, Var(m(X X X))<∞, and the tree growing ruleT satisfiesCondition5withsomeintegerk>0,ε≥0,andα 2 ≥1. Thenforeach0<γ 0 ≤1, wehave E m(X X X)−m ∗ T (Θ Θ Θ 1 ,···,Θ Θ Θ k ,X X X) 2 ≤α 1 α 2 ε+ 1−γ 0 (α 1 α 2 ) −1 k Var(m(X X X)). Remark 7. If we set α 2 = 1 and ε = 0, Theorem 3 applies only to the growing rule associated withthetheoreticalCART-splitcriterion. Inthissense,Condition5canbeunderstoodasawayto extend the applicability of Theorem 3 to a wider class of tree-growing rules inT . Indeed, we set ε asn −η ,accountingfortheestimationerrorduetosampleCARTsplitswhenprovingTheorem1 byexploitingTheorem3. Condition5enablesustoapplyTheorem3toanabstracttree-growingruleobtainedbyslightly modifying the tree-growing rule associated with the sample CART-splitting criterion. In Sec- tion 2.4.2, we discuss this abstract tree-growing rule in detail. The exponential upper bound 1−γ 0 (α 1 α 2 ) −1 k in Theorem 3 is obtained through a recursive analysis. To appreciate the re- cursiveanalysis,weconsiderthespecialcaseofε =0. Then,wedemonstratethat E(m(X X X)−m ∗ T (Θ Θ Θ 1:k ,X X X)) 2 ≤ 1− γ 0 α 1 α 2 E(m(X X X)−m ∗ T (Θ Θ Θ 1:k−1 ,X X X)) 2 . (2.24) In(2.24),weobservetherecursivestructureforcontrollingtheapproximationerror. SectionA.3.2 inAppendixApresentsmoredetailsontherecursiveinequality. 39 Theorem 3 is the key result that makes our approximation error analysis unique and practical. Itdifferssharplyfromtheexistingliteratureinthesensethatourtechnicalanalysisismorespecific totherandomforestalgorithmanddoesnotrelyongeneralmethodsofdata-independentpartition such as Stone’s theorem [54] or data-dependent partition ( e.g., [55]). This focus is in contrast to mostexistingstudies[20,37,41,43]. 2.4.2 Sampletree-growingrule InTheorem4,weanalyzeaversionofthesampletree-growingruledefinedbelowanddemonstrate that, conditional on the observed sample, for a high probabilityX n -measurable event, this rule satisfiesCondition5withε =ε n decreasingtozero. GivensetsofavailablefeaturesΘ 1:k forsome positive integer k, we consider the following procedure of modifying a subtree with ζ > 0. For each (t t t 1 ,···,t t t k )∈ b T(Θ 1:k ) with P(X X X∈ t t t l−1 ) <ζ for some 1≤ l≤ k, we fix l 0 := min{l−1 : P(X X X∈t t t l−1 )<ζ,1≤ l≤ k}. Then, we trim the descendant nodes of t t t l 0 from b T(Θ 1:k ) and grow newdescendantnodesinsuchawaythateachnewdescendantnodet t t ′ anditsparentnodet t t satisfy sup (j∈Θ,c) (II) t t t,t t t(j,c) =(II) t t t,t t t ′. ThesetsofavailablefeaturesarethoseinΘ 1:k ,andwedonotspecify them in the supremum. Each new descendant node of t t t l 0 is grown according to the theoretical CART-splitcriterion. AgraphicalillustrationisdisplayedinFigure2.2. ... (a) Trim the subtree after the specified node. ... :) A B C (b) Split the nodes A, B, and C using the theoretical CART-split criterion with the correspondingsetsofavailablefeatures. Figure2.2: Trimasubtreeandregrowit. Next, we define the modified version of the sample tree. For each node path in b T(Θ 1:k ), there is at most one node t t t l 0 , as defined previously. We collect these nodes in a subset, perform the 40 previously described procedure accordingly, and obtain the modified sample tree. The new tree is denotedas b T ζ (Θ 1:k ),andwerefertoitasthesemi-sampletree-growingrule. Theorem 4. Assume that Conditions 2–4 hold and let α 2 > 1, 0 < η < 1 8 , c > 0, and δ with 2η <δ < 1 4 be given. Then there exists anX n -measurable eventU U U n withP(U U U c n ) =o(n −1 ) such thatconditionalonX n ,oneventU U U n andforalllargen, b T ζ withζ =n −δ satisfiesCondition5with k=⌊clog 2 n⌋,ε =n −η ,andα 2 . Remark8. Theorem4isaresultforthesemi-sampletree-growingruleinsteadofthesampletree- growing rule; thus, the tree height parameter c>0 is an arbitrary constant in Theorem 4. To use the result of Theorem 4 for Lemma 1, we must control theL 2 difference between the population version of random forest estimates using these two rules, which is the reason for the first term in Lemma1. SuchatermisboundedbyO(n −δ+c );thus,thevalueofcmustbelimitedwhenapplying Lemma1toobtainTheorem1. ThereisasimilarremarkforLemma3inSection2.5. Remark9. DuetoTheorem4,wecanapplyTheorem3to b T ζ withζ =n −δ andobtainthefollow- inginequality: E " m(X X X)−m ∗ b T ζ (Θ Θ Θ 1:k ,X X X) 2 X n # 1 1 1 U U U n ≤α 1 α 2 n −η +(1−γ 0 (α 1 α 2 ) −1 ) k Var(m(X X X)), where the estimate m ∗ b T ζ is similarly defined as m ∗ b T (for more detail, see the proof of Lemma 1 in Section A.4.4 in Appendix A). As a result, we do not need Condition 5 in Lemma 1 beause it is a consequence of Theorem 4 that the sample tree-growing rule is an instance of Condition 5 in a probabilistic sense. This is one of our main contributions to proving such a result in Theorem 4 insteadofassumingthatthesampletree-growingrulesatisfiesCondition5. 41 2.5 Upperboundsforthestatisticalestimationerror In this section, we develop a general high-dimensional estimation foundation to analyze the con- sistency of the random forest estimate and use it to derive the convergence rate for the estimation variance(i.e.,thesecondtermin(2.11)). Lemma 2. Assume that Conditions 2–4 hold and let 0 <η < 1/4, 0 < c < 1/4, and ν > 0 be given. Then,thereexistssomeconstantC>0suchthatforalllargenandeach1≤k≤clog 2 n, E m ∗ b T (Θ Θ Θ 1 ,···,Θ Θ Θ k ,X X X)−b m b T (Θ Θ Θ 1 ,···,Θ Θ Θ k ,X X X,X n ) 2 ≤n −η +C2 k n − 1 2 +ν . (2.25) Wegainsomeinsightintothechallengeassociatedwith(2.25). Wemustcontroltheprobabilis- tic difference between every conditional mean and average on each node grown using the sample tree-growingruletoanalyzetheestimationvariance. Thistaskischallengingbecausethesenodes have random boundaries. A naive consideration of all possible nodes in [0,1] p considers every nodebutresultsinanuncountablyinfiniteset,precludingtheapplicationofstandardconcentration inequalities,suchasHoeffding’sinequality. Next,wedescribeourapproachtoovercomingsucha challengeindetail. Instead of considering estimation of conditional means on all possible nodes with random boundariesin [0,1] p directly,weestimateonlyconditionalmeansoneachofasetofdeterministic nodes from a predetermined grid, which we will formally define next. The grid contains many nodes such that for an arbitrary node t t t, there is a node t t t # on the grid being so close to t t t that the values of their theoretical conditional means, E(m(X X X)|X X X∈t t t) andE(m(X X X)|X X X∈t t t # ), are very close, and the values of their empirical conditional means, ∑ x x x i ∈t t t y i #{i:x x x i ∈t t t} and ∑ x x x i ∈t t t # y i #{i:x x x i ∈t t t # } , are also close. Sincethenumberofnodesonthegridisnottoolarge,wecanshowthatthetheoreticalconditional meanE(m(X X X)|X X X∈t t t # )anditsempiricalcounterpart ∑ x x x i ∈t t t # y i #{i:x x x i ∈t t t # } areuniformlycloseusingHoeffding’s inequality. CombiningalltheseresultscanyieldLemma2. The mentioned grid is defined as follows. Let ρ 1 be a given positive constant and consider a sequence of b i = i ⌈n 1+ρ 1⌉ with 0≤ i≤⌈n 1+ρ 1 ⌉. We construct hyperplanes such that along each 42 Figure 2.3: From the left to the right panel, we move the original boundaries to the nearest grid lines. jth coordinate, each point b i is crossed by only one of the hyperplanes, which is perpendicular to the jth axis. The result is exactly (⌈n 1+ρ 1 ⌉+1) p distinct hyperplanes and each boundary of the root node [0,1] p is also one of these hyperplanes. These hyperplanes form a grid on [0,1] p , and we refer to each of these hyperplanes as a grid hyperplane or a grid line. For a node t t t, we definethenodet t t # bymovingallboundariesoft t t tothecorrespondingnearestgridlines(seeFigure 2.3 for a graphical illustration). For a tree-growing rule T, we define T # such that for each Θ 1:k , (t t t # 1 ,···,t t t # k )∈T # (Θ 1:k )if (t t t 1 ,···,t t t k )∈T(Θ 1:k ). Weobservetwoimportantpropertiesofthesharp notation. First,foreachnodet t t,ift t t ′ andt t t ′′ areitsdaughternodes,then (t t t ′ ) # and (t t t ′′ ) # aredaughter nodes of t t t # . Second, for each integer k>0, the collection of end nodes t t t # k at level k is a partition of [0,1] p . As a result, T # can be understood as a tree-growing rule (induced by T). The same definitionofthesharpnotationappliesforthesampletree-growingrule. We demonstrate how to use the grid to obtain the result in Lemma 2. To control theL 2 loss betweenm ∗ b T andb m b T asin(2.25),wedecomposethesquaredlossintothreetermsas L 2 lossbetweenm ∗ b T andm ∗ b T # | {z } Controlledby(2.28) ←→ L 2 lossbetweenm ∗ b T # andb m b T # | {z } ControlledbyTheorem5 ←→ L 2 lossbetweenb m b T # andb m b T | {z } Controlledby(2.29) , (2.26) and establish bounds for each of them in Theorem 5, (2.28), and (2.29) below, respectively, using the grid. In particular, in Theorem 5 the grid helps bound the LHS of (2.27) uniformly over all possible tree-growing rules T. This approach provides a solution to a fundamental estimation 43 problem in proving random forest consistency that involves infinitely many possible T values. By (2.2)and(2.10),foranyT andΘ 1:k ,wecandeducethaton∩ k i=1 {Θ Θ Θ i =Θ i }, E m ∗ T # (Θ Θ Θ 1:k ,X X X)−b m T #(Θ Θ Θ 1:k ,X X X,X n ) 2 Θ Θ Θ 1:k =Θ 1:k ,X n = ∑ (t t t 1 ,···,t t t k )∈T # (Θ 1:k ) P(X X X∈t t t k ) E(m(X X X)|X X X∈t t t k )− ∑ i∈{i:x x x i ∈t t t k } y i #{i:x x x i ∈t t t k } 2 . (2.27) From the expression on the RHS of (2.27), with the grid, we only deal with the estimation of the conditionalmeansoneachnodeintheset {t t t # :t t t∈{AllendnodesgrownbyallpossiblegrowingrulesgivenΘ 1 ,···,Θ k }}. Such a set contains only finitely many distinct nodes given k, n, and p. This set can be further enlarged to consider all possible growing rules and sets of available features (i.e., the collection of end nodes grown by all possible T # (Θ 1:k )). According to (A.8) in Appendix A, the number of distinctnodesoftheenlargedsetisboundedby2 k p(⌈n 1+ρ 1 ⌉+1) k . Theorem5. AssumethatConditions3–4holdandlet0<η < 1 4 and0<c< 1 4 begiven. Thenfor alllargenandeach1≤k≤clog 2 n,wehave E sup T E m ∗ T # (Θ Θ Θ 1:k ,X X X)−b m T #(Θ Θ Θ 1:k ,X X X,X n ) 2 Θ Θ Θ 1:k ,X n ≤n −η , where the supremum is over all possible tree growing rules. Note that due to the use of the grid, thesupremumcanbesimplifiedtoamaxoverafinitelymanytreegrowingrules. Lemma3. Assume that Conditions 2–4 hold and let 1/2<∆<1 and c>0 be given. Then there existssomeconstantC>0suchthatforalllargenandeach1≤k≤clogn, E m ∗ b T # (Θ Θ Θ 1:k ,X X X)−m ∗ b T (Θ Θ Θ 1:k ,X X X) 2 ≤C2 k n ∆−1 (2.28) 44 and E b m b T # (Θ Θ Θ 1:k ,X X X,X n )−b m b T (Θ Θ Θ 1:k ,X X X,X n ) 2 ≤C2 k n ∆−1 . (2.29) 2.6 Discussion Inthischapter,weinvestigatedtheasymptoticpropertiesoftherandomforestalgorithmina high-dimensional feature space. In contrast to existing theoretical results, our asymptotic analysis considered the original version of the random forest in a general high-dimensional nonparametric regression setting where covariates can be dependent and the underlying true regression func- tion can be discontinuous. Explicit convergence rates were established for the high-dimensional consistency of the random forest, justifying its theoretical advantages as a flexible nonparametric learning tool in high dimensions. We present a new technical analysis for polynomially growing dimensionality through natural regularity conditions characterizing the intrinsic learning behavior oftherandomforestatthepopulationlevel. 45 Chapter3 OptimalNonparametricInference withTwo-ScaleDistributionalNearestNeighbors 3.1 Introduction Theideaunderlyingnearestneighbormethodsissimple: “thingsthatappearsimilararelikelysim- ilar.” [56] This simple and time-tested 1 idea has led to the k-nearest neighbors (k-NN) procedure and its extensions, including the weighted nearest neighbors method, which are known for their straightforwardimplementationandappealingtheoreticalproperties. Forexistingresultsandsome recentdevelopmentsrelatedtok-NNregression,see,forexample,[57–61]. Despite the advantage of the weighted nearest neighbors (WNN) method over the unweighted k-NN, the selection of the adaptive weights can be challenging in practice. The bagged 1-NN estimator,an ensemble learning method, has been proposed toaddress such an issue. Specifically, [25]and[21]proposedtoestimatethemeanregressionfunctionbyaveragingall1-NNestimators constructed from randomly subsampling s observations with or without replacement, where s is required to diverge with the total sample size n. [25] showed that this procedure automatically assigns monotonic nonnegative weights to the nearest neighbors in a distributional fashion on the entiresample,motivatingustonameitasthedistributionalnearestneighbors(DNN)inthisthesis 1 The first recorded discussion of the nearest neighbor idea can be traced back to Alhazen’s Book of Optics in the early 11th century where he discussed using nearest neighbor classification as an explanation for visual object recognition. See[56]forfurtherdiscussionofthehistoryofnearestneighbormethods. 46 for easy presentation. The seminal work of [15] first introduced the bagging technique, and it has since been employed to improve the performance of base estimators for ensemble methods. For instance,see[18]fortheasymptoticpropertiesofbaggednearestneighborclassifiers. [21] proved that DNN achieves the nonparametric minimax optimal convergence rate under a Lipschitzcontinuityassumptionoftheregressionfunction. However,resultsdescribingtheasymp- totic distribution of DNN do not exist, limiting its application to statistical inference. In addition, when the mean regression function has higher-order smoothness, the DNN estimator no longer achieves the nonparametric optimal rate. In this chapter, we discover through thorough investiga- tions that the non-optimality of DNN is caused by the slow convergence rate due to the bias. For further bias reduction, we establish the higher-order asymptotic expansion for the bias of DNN. Basedonsuchabiasexpansion,weproposetoeliminatetheleadingorderbiasofDNNbylinearly combiningtwoDNNestimatorswithdifferentsubsamplingscales,resultinginthenoveltwo-scale DNNprocedurefornonparametricestimationandinference. The DNN estimator has a representation of L-statistic with weights depending only on the rank of the observations [25], facilitating easy and fast implementation. However, such a repre- sentation does not help with establishing the sampling properties. For the theoretical analysis, we further demonstrate that DNN estimator has an equivalent representation of U-statistic with a kernel function of diverging dimensionality equal to the subsampling scale s, and therefore, the two-scale DNN estimator also has a U-statistic representation with a new and carefully con- structed diverging-dimensional kernel. Despite the nice U-statistic representations, the classical theory does not apply to DNN or two-scale DNN for deriving their asymptotic properties because of the diverging dimensionality of the kernel functions. To overcome such a technical challenge, we exploit Hoeffding’s canonical decomposition introduced in [62], and carefully collect and an- alyze the higher-order terms in our decomposition. Our theoretical results suggest that, when the subsampling scales are appropriately chosen, two-scale DNN achieves the nonparametric opti- mal rate under the fourth-order smoothness assumption on the regression function and the density function of covariates. A larger implication of our study is that, for regression function with even 47 higher-order smoothness, the multi-scale DNN can be constructed in the same fashion to achieve theoptimalnonparametricconvergencerate;weleavethedetailedinvestigationforfuturestudy. By construction, some weights in the two-scale DNN take negative values. The advantage of using negative weights in the weighted nearest neighbors classifiers was formally investigated in [63]. For the problem of regression, although [59] theoretically showed that the weighted nearest neighbors estimator allowing for negative weights can improve upon that with only nonnegative weights in terms of the rate of convergence, it still remains largely unclear how to practically choose these weights. Our two-scale DNN provides an explicit and easy-to-implement way to assign negative weights which endorses the optimal nonparametric convergence rate under the higher-ordersmoothnessassumptionoftheregressionfunction. We further show that DNN and two-scale DNN are asymptotically normal as the subsampling scales and sample size n diverge to infinity. The asymptotic variance of the two-scale DNN esti- mator, however, does not admit a simple analytic form that is practically useful for statistical in- ference. Weexploittwomethods,thejackknifeandbootstrap,forasymptoticvarianceestimation. We formally demonstrate that both methods yield consistent estimates of the asymptotic variance. Ourproofsaremoreintricatethanthestandardtechniqueintheliteraturebecauseofthediverging subsampling scales. The key is to write the jackknife estimator as a weighted summation of a sequence of U-statistics and carefully analyze the higher-order terms. Our proof for the bootstrap estimator is built on our results for the jackknife estimator. Although both methods yield consis- tent variance estimates, the bootstrap estimator is much more computationally efficient. We also provide a bootstrap method to directly estimate the distribution of the two-scale DNN estimator withoutestimatingtheasymptoticvariance. We then demonstrate the superior finite-sample performance of our method using simulation studies and a real data application. The two-scale DNN estimator has two parameters to tune – the two subsampling scales – and it is equivalent to tune the ratio between the two subsampling scalesandoneofthesubsamplingscales. Weproposetojointlytunethesetwoparametersusinga 48 two-dimensionalgrid,andchoosethecombinationofthetwoparametersthatminimizesthemean- squared estimation error (MSE). As an application, we discuss the usage of the two-scale DNN for the heterogeneous treatment effect (HTE) estimation and inference with theoretical guarantee underthesettingofrandomizedexperiments. The rest of the chapter is organized as follows. Section 3.2 introduces the model setting for nonparametricregressionestimationandreviewstheDNNestimator. Wepresentthetwo-scaledis- tributionalnearestneighbors(TDNN)procedureanditssamplingpropertiesinSection3.3. Section 3.4 investigates the variance estimation and distribution estimation for the TDNN estimator. We showtheapplicationofTDNNtoHTEestimationandinferenceinSection3.5. Weprovideseveral simulationexamplesandarealdataapplicationjustifyingourtheoreticalresultsandillustratingthe finite-sample performance of the suggested TDNN method in Sections 3.6 and 3.7, respectively. Section 3.8 discusses some implications and extensions of our work. All the proofs and technical detailsareprovidedinAppendixB. 3.2 ModelSetting Consider a sample of independent and identically distributed (i.i.d.) observations{(X i ,Y i )} n i=1 fromthefollowingnonparametricmodel Y =µ(X)+ε, (3.1) where Y is the response, X∈R d represents the vector of covariates with fixed dimensionality d, µ(X) is the unknown mean regression function, andε is the model error. The goal is to estimate and infer the underlying true mean regression function µ(x) at some given feature vector x in the supportofX. 49 3.2.1 Distributionalnearestneighbors(DNN) Given a fixed feature vectorx∈R d , we calculate the Euclidean distance of each observed feature vector X i to the target x and then reorder the sample according to such distances. Denote the reorderedsampleas{(X (1) ,Y (1) ),···,(X (n) ,Y (n) )}with ∥X (1) −x∥≤∥X (2) −x∥≤···≤∥X (n) −x∥, (3.2) where∥·∥ denotes the Euclidean norm of a given vector and the ties are broken by assigning the smallest rank to the observation with the smallest natural index. Then the weighted nearest neighbors(WNN)estimate[57]isdefinedas b µ WNN (x)= n ∑ i=1 w ni Y (i) , (3.3) where (w n1 ,w n2 ,···,w nn ) is some deterministic weight vector with all the components summing up to one. In practice, one can also use the non-Euclidean distances given by certain manifold structures. ThetheoreticalpropertiesoftheWNNestimator(3.3)havebeenstudiedextensivelyin[59]. In particular,ithasbeenprovedthereinthat,withanappropriatelyselectednonnegativeweightvector, b µ WNN (x) can be consistent with the optimal rate of convergence O P (n −2/(d+4) ) when the second- order derivative exists and can have asymptotic normality. Moreover, the optimal rate of con- vergence can be improved by allowing for negative weights under higher-order derivatives. These existingresultsprovideonlysomegeneralsufficientconditionsontheweightvector(w n1 ,···,w nn ) in order to deliver the theoretical properties. However, identifying a practical weight vector with provably appealing properties can be highly nontrivial. Furthermore, the asymptotic variance of b µ WNN (x)canadmitarathercomplicatedformanddependuponsomeunknownpopulationquanti- tiesthatareverydifficulttoestimateinpractice,hinderingtheapplicabilityinstatisticalinference. Incontrast,thebagged1-NNestimatorproposedandstudiedin[25]and[21](whichwereferto astheDNNestimatorinthischapterfortheeaseofpresentation)automaticallyassignsmonotonic 50 weights to the nearest neighbors in a distributional fashion on the entire sample. Denote by s with 1≤s≤n the subsampling scale. Let{i 1 ,···,i s } with i 1 0suchthatP(∥X−x∥≥R)≤e −αR foreachR>0. Assumption 2. The density f(·) is bounded away from 0 and ∞, f(·) and µ(·) are four times continuously differentiable with bounded second, third, and fourth-order partial derivatives in a neighborhood ofx, andEY 2 <∞. Moreover, the model errorε has zero mean and finite variance σ 2 ε >0,andisindependentofX. Assumption 3. We have an i.i.d. sample{(X 1 ,Y 1 ),(X 2 ,Y 2 ),...,(X n ,Y n )} of size n from model (3.1). We begin with presenting an asymptotic expansion of the bias of single-scale DNN estimator inthetheorembelow. Theorem6. Assume that Conditions 1–3 hold and s→∞. Then for any fixedx∈supp(X)⊂R d , wehave ED n (s)(x)=µ(x)+B(s) (3.13) with B(s)=Γ(2/d+1) f(x)tr(µ ′′ (x))+2µ ′ (x) T f ′ (x) 2dV 2/d d f(x) 1+2/d s −2/d +R(s), (3.14) R(s)= 8 > < > : O(s −3 ), d =1, O(s −4/d ), d≥2, 9 > = > ; . (3.15) whereV d = π d/2 Γ(1+d/2) ,Γ(·)isthegammafunction, f ′ (·)andµ ′ (·)denotethefirst-ordergradientsof f(·)andµ(·),respectively, f ′′ (·)andµ ′′ (·)representthed×d Hessianmatricesof f(·)andµ(·), respectively,andtr(·)standsforthetraceofagivenmatrix. Theorem 6 above shows that the first-order asymptotic bias of the single-scale DNN estimator D n (s)(x)isoforders −2/d ,andthesecond-orderasymptoticbiasisoforders −4/d ford≥2andof order s −3 for d =1. The rate of convergence for the bias term becomes slower as the feature di- mensionalityd grows,whichiscommonfornonparametricestimators. Itthuswouldbebeneficial toremovethefirst-orderasymptoticbiascompletelytoimprovethefinite-sampleperformance. 55 We relate our results to the existing literature. [21] showed that DNN achieves the optimal convergence rate of n 1/(d+2) under the Lipschitz continuity assumption on the regression function when d≥3. Our Theorem 6 is proved assuming the fourth-order smoothness condition (see Con- dition 2). Under the fourth-order smoothness condition, DNN does not achieve the nonparametric optimalrateof n 4/(d+8) ,mainlybecauseofthebias. Ourresultsrevealthatinsuchacase,biasre- ductionisneededforimprovedconvergencerate. Thefourth-ordersmoothnessconditionismainly usedtoobtaintheexplicitformofthecoefficientinfrontofthefirst-orderbiass −2/d andtheorder oftheremainderR(s),whicharecriticalforsuccessfulbiasreductionandalsoplayimportantroles in developing our asymptotic normality theory (which until now has been absent from literature). In addition, as mentioned in the Introduction, the de-biasing idea here can be similarly applied by constructing the multiscale DNN to further reduce the higher-order bias under even higher-order smoothnessassumption. Corollary3. Assume that Conditions 1–3 hold and s→∞. Then for any fixedx∈supp(X)⊂R d andthetwo-scaleDNNestimatorwithweightsdefinedin(3.9)–(3.10),wehave ED n (s 1 ,s 2 )(x)=µ(x)+R(s), (3.16) where R(s)= 8 > < > : O(s −3 ), d =1, O(s −4/d ), d≥2. 9 > = > ; . ByusingtheTDNNestimator,theasymptoticbiasreducestothesecond-ordertermO(s −4/d 1 + s −4/d 2 )ford≥2andO(s −3 1 +s −3 2 )ford =1. Corollary3isadirectconsequenceofTheorem6. We further characterize the asymptotic distribution of the single-scale DNN estimator in the followingtheorem,whichisnewtotheliterature. 56 Theorem 7. Assume that Conditions 1–3 hold, s→ ∞, and s = o(n). Then for any fixed x∈ supp(X)⊂R d ,itholdsthatforsomepositivesequenceσ n oforder (s/n) 1/2 , D n (s)(x)−µ(x)−B(s) σ n D −→N(0,1) (3.17) asn→∞,whereB(s)isgivenin(3.14). Theorem7requirestheassumptionsofs→∞ands=o(n),wheretheformerleadstovanishing biasandthelatterleadstocontrolledvarianceasymptotically. ThetechnicalanalysisofTheorem7 exploitsHoeffding’scanonicaldecomposition[62]whichisanextensionoftheHájekprojection. DespitetheU-statisticrepresentationofD n (s)(x)givenin(3.5),theclassicalU-statisticasymp- totic theory (e.g., [65, 66]) is not readily applicable because of the typical assumption of fixed subsampling scale s. In contrast, our method requires the opposite assumption of diverging sub- samplingscales. Such a statistic is called an infinite-order U-statistic (IOU) and has gained more interest in the recent literature; see, e.g., [52, 75–77]. Unfortunately, the assumptions on the kernel functions of the U-statistics in most of the IOU literature are not satisfied for the TDNN. For instance, [76] assumedthatthekernelsareconvergingasthesamplesizegrows. However,inourcase,thekernel Φ(x;Z i 1 ,Z i 2 ,...,Z i s ) =Y (1) (Z i 1 ,Z i 2 ,...,Z i s ) becomes degenerate as s tends to infinity. Another exampleis[75]whoconsideredscalar-valuedrandomvariables. Inourcase,Z i ’sarevector-valued andtherebytheresultsof[75]arenotreadilyapplicable. In [33], the asymptotic distribution of the random forests [1, 8, 9] estimator was studied via examining the asymptotic normality of the IOU. Both their proof and ours rely on Hoeffding’s decomposition of the U-statistics (and in particular IOUs) to establish the asymptotic normality. However, the main challenge in these proofs is controlling the variance of the first-order Hájek projection. This variance term takes different forms for nearest neighbors methods and tree based methods, and thus it needs to be handled differently for each case. For instance, Theorem 3.3 and Corollary 3 in [33] demonstrate bounds for tree based methods, which are not directly extendable 57 to the nearest neighbors methods. Instead, we use Lemma 15 in Section B.3.6 of Appendix B to boundvariancespecificallyforourmethod. Recently, [77] established convergence theory similar to our Theorem 7 under more general settingandmorecomplicatedassumptionswhichalsoconcernthekerneloftheU-statisticsandthe Hájekprojectionofthekernel. Incontrast,ourTheorem7isdevelopedundersimplerassumptions that are more targetedto the TDNN. It might be possible to check the conditions and then employ the results of [77] to prove our Theorem 7. However, the efforts on checking these assumptions canberathersignificantandevencomparabletothefulldevelopmentofourproof. We proceed with characterizing the asymptotic distribution for the two-scale DNN estimator introducedin(3.11). Theorem8. Assume that Conditions 1–3 hold, s 2 →∞, s 2 =o(n), and there exist some constants 0<c 1 <c 2 <1 such that c 1 ≤s 1 /s 2 ≤c 2 . Then for any fixedx∈supp(X)⊂R d , it holds that for somepositivesequenceσ n oforder (s 2 /n) 1/2 , D n (s 1 ,s 2 )(x)−µ(x)−Λ σ n D −→N(0,1) (3.18) asn→∞,whereΛ=O(s −4/d 1 +s −4/d 2 )ford≥2andΛ=O(s −3 1 +s −3 2 )ford =1. We note that the positive sequenceσ n in Theorem 8 is different from the sequenceσ n in The- orem 7, with the former representing the asymptotic standard deviation of the TDNN estimator and the latter representing the asymptotic standard deviation of the single-scale DNN estimator. We use the same generic notation for the convenience of technical presentation. Since the explicit form of the asymptotic standard deviation will not be used, this should not cause any confusion. Theorem 8 requires both subsampling scales s 1 and s 2 to diverge and be of smaller orders of the full sample size n in order to give the best trade off between the squared bias and variance. We would like to point out that Theorem 8 is not a simple consequence of Theorem 7, since marginal asymptotic normalities do not necessarily entail joint asymptotic normality. To deal with such a 58 technical difficulty, we have to jointly analyze the two single-scale DNN estimators. A key in- gredient of our technical analysis of Theorem 8 is to show that the TDNN estimator also admits a U-statistic representation, which enables us to exploit Hoeffding’s decomposition and calculate thevariancesofthekernelandtheassociatedfirst-orderHájekprojection. We also obtain the theorem below on the mean-squared error (MSE) of our TDNN estimator. Setting c = (s 1 /s 2 ) 2/d , the weights of the two single-scale DNN estimators are given by w ∗ 1 = c/(c−1)andw ∗ 2 =−1/(c−1)accordingto(3.9)and(3.10). Theorem 9. Assume that Conditions 1–3 hold, s 2 →∞, s 2 = o(n), and c is a constant in (0,1). Thenforanyfixedx∈supp(X)⊂R d ,wehavethatwhend≥2, E D n (s 1 ,s 2 )(x)−µ(x) 2 ≤ A (c−1) 2 n R 1 (x,d, f,µ)c −2 s −8/d 2 +σ 2 ε s 2 n o , (3.19) andwhend =1, E D n (s 1 ,s 2 )(x)−µ(x) 2 ≤ A (c−1) 2 n R 2 (x,d, f,µ)c −1 s −6 2 +σ 2 ε s 2 n o , (3.20) whereAissomepositiveconstant,andR 1 (x,d, f,µ)andR 2 (x,d, f,µ)aresomeconstantsdepend- ingontheboundsofthefirstfourderivativesof f(·)andµ(·)inaneighborhoodofx. Theorem 9 provides an upper bound for the pointwise MSE and such result can be applied easily to obtain the integrated MSE under some regularity conditions. The optimal choice of sub- samplingscales 2 intermsofachievingthebestbias-variancetradeoffisgivenbys 2 =O(n d/(8+d) ) for d≥2 and s 2 =O(n 1/7 ) for d =1, yielding the corresponding consistency rate at the order of O(n −4/(8+d) ) for d≥ 2 and O(n −3/7 ) for d = 1. Note that such rate of convergence is minimax optimal (see, e.g., [4]) when d≥2 under the smoothness assumptions in Condition 2. Compared 59 totheresultin[59]wheretheminimaxoptimalconvergencerateforthesingle-scaleDNNwasob- tainedford≥3undertheLiptchitzcontinuitycondition,ourresultstillremainsminimaxoptimal whend≥2underdifferentsmoothnessassumptionsinCondition2. 3.4 Varianceanddistributionestimatesfortwo-scale DNNestimator 3.4.1 Jackknifeestimator AsunveiledinLemma16inSectionB.3.7ofAppendixB,thetwo-scaleDNNestimator D n (s 1 ,s 2 )(x)withs 1 1. It is seen that w ∗ 1 < 0 and hence the TDNN estimator assigns negative weights to some nearest neighbors which manages to reduce the bias for DNN. However, to control the variance of TDNN, the ratio c=s 2 /s 1 shouldbechosenappropriatelyawayfromone. Withtheabovechoiceofweights,therearetwoparameterstotunefortheTDNN:subsampling scales 1 andtheratioc=s 2 /s 1 . Totunetheparametersforpredictionofagivenfeaturevectorx,we performaweightedleave-one-outcross-validation(LOOCV)procedureusingeachoftheBnearest neighbors to x as a single left-out observation. Specifically, we set aside each of the B nearest 68 neighbors to x and make prediction for it using the TDNN estimator with all the remaining n−1 observations and the given combination (c,s 1 ). Then the tuned (c,s 1 ) is obtained by minimizing a weighted sum of the squared error over those B left-out nearest neighbors, where the weights aredefinedbythestandardGaussiankerneldistancesofthenearestneighborstothegivenfeature vectorx. Finally,wecalculateourTDNNestimateD n (s 1 ,cs 1 )(x)forthegivenpointxusingthes 1 andcselectedbyourweightedLOOCVtuningprocedure. Inouranalysis,wealwaysselecttheratiocfromasetofvalues. Thenforagivenvalueofc,we provideourchoicesofthesubsamplingscales 1 throughasign-changetuningmethod. Specifically, for the prediction of a feature vector x, we compute the TDNN estimator D n (s 1 ,cs 1 )(x) for each consecutive s 1 starting from 1. We continue this process until the difference in the absolute dif- ferencesofconsecutiveTDNNestimatorschangessign. Intuitively,thesignchangerepresentsthe valueofs 1 wherethecurvatureoftheTDNNestimatorasafunctionofs 1 changes. Wedenotethe subsamplingscalechosenbythesign-changetuningprocessass sign . Thisprocessismotivatedby thecurve structurein Figure 3.1 fromthe simulationexampleinSection 3.6.1. One issuewith the simplesign-changetuningmethodisthatwemayriskselectingavalueofs 1 thatcorrespondstoa localminimumfortheMSEofTDNNasafunctionofs 1 . Tomitigatesuchconcern,weconsidera sequence of subsampling scales in the next step of our tuning process with s sign as our lower limit and 2s sign as our upper limit, where the initial value s sign given by the sign-change tuning method providesawarmstartforandspecifiestheorderoftuningparameters 1 . 3.6.1 Two-scaleDNNversusDNN To illustrate the effectiveness of the two-scale framework compared to the single-scale DNN, we simulaten=1000datapointsfromthefollowingmodel. Setting1. AssumethatY =µ(X)+ε,where µ(x)=(x 1 −1) 2 +(x 2 +1) 3 −3x 3 69 0 2 4 6 0 50 100 150 200 Subsampling scale s Bias of DNN DNN minimum = 0.1123 0 10 20 30 40 50 0 50 100 150 200 Subsampling scale s Mean squared error (MSE) −1.5 −1.0 −0.5 0.0 0 5 10 15 20 25 30 35 40 45 Subsampling scale s Bias of TDNN TDNN minimum = 0.0603 DNN minimum = 0.1123 tuned TDNN minimum = 0.0643 0.06 0.09 0.12 0.15 0.18 5 10 15 20 25 30 35 40 45 Subsampling scale s Mean squared error (MSE) 0.115 0.120 0.125 0.130 60 80 100 120 140 Figure 3.1: The results of simulation setting 1 described in Section 3.6.1 for DNN and TDNN. The rows show the bias and MSE as functions of the subsampling scale s for DNN and TDNN, respectively. Thetoprightpanelalsodepictsazoomed-inplotwheretheU-shapedpatternismore apparent. The dashed lines in the MSE plots are labeled with the minimum MSE value for each of the methods. The tuned TDNN MSE minimum corresponds to the weighted LOOCV tuning methoddescribedatthebeginningofSection3.6. 70 withx=(x 1 ,x 2 ,x 3 ) T and (X T ,ε) T ∼N(0,I 4 ). Our goal here is to compare the mean-squared error (MSE) of the TDNN estimator with those of the DNN and k-NN estimators at a fixed test point chosen to be (0.5,−0.5,0.5) T . For the implementation of the DNN, we estimate the regression function at this test point and calculate the MSE while varying the subsampling scale s from 1 to 250. For the TDNN, we estimate the regressionfunctionwithfixedc=2forsimplicityands 1 varyingfrom1to250. Figure3.1presentsthesimulationresultsforDNNandTDNNintermsofboththebiasandthe MSE.Afirstobservationisthatasthesubsamplingscalesincreases,thebiasoftheDNNestimator shrinks toward zero because larger subsampling scale s leads to the use of the information in the sample concentrated around the fixed test point. From the MSE plot for DNN, we observe the classical U-shaped pattern of the bias-variance tradeoff. Thanks to the higher-order asymptotic expansions, the two-scale procedure of TDNN is completely free of the first-order asymptotic bias. The substantial difference between the dominating first-order asymptotic bias in DNN and the second-order asymptotic bias in TDNN at the finite-sample level is evident in the left panel of Figure3.1. From the MSE plot for TDNN, we also see a similar bias-variance tradeoff. An interesting phenomenon by comparing the two smooth U-shaped curves in the right panel of Figure 3.1 is that the minimum of the MSE for TDNN is attained at a much smaller subsampling scale s than that for DNN. Furthermore, we observe that because of the reduced finite-sample bias, TDNN attains a more than 45% reduction of minimum MSE compared to the single-scale DNN. We also show in the bottom right panel of Figure 3.1 the MSE obtained by our weighted LOOCV tuning procedure for TDNN described at the beginning of Section 3.6, without using any knowledge of the underlying true regression function. We see that our tuning procedure provides a good approximation to the true MSE despite considering a smaller range of subsampling scales in a data-adaptive way. Finally, an additional comparison of TDNN and k-NN is included in Section B.1.1ofAppendixB. 71 3.6.2 ComparisonswithDNNandk-NNfornonparametricinference We further compare TDNN with DNN and k-NN over three simulation examples. The first two examples compare the estimation accuracy of each method in nonparametric regression settings, while the third example compares the ability of each method to estimate and infer the heteroge- neoustreatmenteffects(HTEs)underthesettingofrandomizedexperiments. Foreachsimulationsetting,weuseatrainingsamplesizeofn=1000andthesummarystatis- ticsarecalculatedbasedon1000simulationreplications. Throughoutoursimulations,weestimate the variance of the TDNN estimator and DNN estimator using the bootstrap method that has been theoretically justified in Section 3.4.2. As for the inference by the k-NN estimator, we adopt the modelingstrategyin[33]andmodelb µ kNN asGaussianwithmeanµ(x)andvarianceb σ 2 kNN /(k−1), whereb σ 2 kNN isthesamplevarianceoverthek nearestneighbors. WetuneourTDNNestimatorus- ingtheweightedLOOCVtuningmethodbyleavingouteachoftheBnearestneighborsofagiven feature vector x to predict, which has been described at the beginning of Section 3.6.1. We also adopt the same weighted LOOCV tuning strategy for the DNN estimator. We employ the kknn R package [91] to tune the neighborhood size k for the k-NN estimator using the leave-one-out cross-validation. Inoursimulationstudies,BfortheweightedLOOCVtuningprocedureisalways chosenas20,thesubsamplingscalessforDNNvariesfrom1to250,andtheneighborhoodsizek fork-NNvariesfrom1to200. The first simulation setting in this section also uses Setting 1 described in Section 3.6.1. We evaluate the performance of TDNN, DNN, and k-NN in terms of the bias, variance, and MSE at a fixed test point (0.5,−0.5,0.5) T as well as for a set of 100 random test points drawn from the distributionofthe covariatesX∼N(0,I 3 ). TheMSE,bias, andvariancefortheset ofrandom test points are obtained by averaging over all the random test points. For the TDNN estimator, the ratioc=s 2 /s 1 ischosenfromthesequence{2,4,6,8,10,15,20,25,30}fortherandomtestpoints and we fix c =2 for the fixed test point for simplicity. The subsampling scale s 1 is chosen from the interval [s sign ,2s sign ] for each given c, where s sign is given by the sign-change tuning process (relatedtothecurvature)introducedatthebeginningofSection3.6. 72 FixedTestPoint RandomTestPoints Method MSE Bias 2 Variance MSE Bias 2 Variance DNN 0.1249 0.0556 0.0623 15.0989 14.4701 0.5968 KNN 0.3207 0.0062 0.3114 9.4558 6.8510 2.2138 TDNN 0.0576 0.0082 0.0464 7.3142 5.3296 1.5235 Table 3.1: Comparison of DNN, k-NN, and TDNN in simulation setting 1 described in Sec- tion3.6.1. We observe from Table 3.1 that for both fixed test point and random test points, the TDNN estimator significantly outperforms the DNN and k-NN estimators in terms of MSE. In addition, theimprovementovertheDNNismainlyduetothelargelyreducedbias,whichisinlinewithour theory. In contrast, the TDNN has reduced variance compared to the k-NN, because TDNN is a baggedstatisticandthebaggingtechniqueisknowntobesuccessfulinvariancereduction. Wecan seethattheaverageMSEoverasetofrandomtestpointsismuchlargerthantheMSEatthefixed testpoint(0.5,-0.5,0.5). ThemainreasonisthatthecovariatevectorXisgeneratedfromanormal distributionandthedensityfunctionatextremevaluesisclosetozero,andthusthetheoreticalMSE canbeverylargeforthoseextremepoints. Asacomparison,wealsopresentthesimulatingresults under the same setting except thatX∼U([0,1] 3 ) in Section B.1.2 of Appendix B. It is seen from Table 4.1 in Section B.1.2 of Appendix B that under the uniform distribution setting, the MSE for randomtestpointsisonlyslightlylargerthantheMSEatthefixedtestpoint. For the second simulation setting, we investigate the performance of TDNN, DNN, and k-NN inthesettingbelow,whichisamodifiedversionofasimulationsettingfirstconsideredin[92]. Setting2. AssumethatY =µ(X)+ε,where µ(X)=4(4x 1 −2+8x 2 2 ) 2 +(3−4x 2 ) 2 +16 p x 3 +1(2x 3 −1) 2 withX=(x 1 ,···,x p ) T ,X∼U([0,1] p ),andε∼N(0,1)independentofX. Weincreasetheambient dimensionality palongthesequence{3,5,10,15,20}. 73 FixedTestPoint RandomTestPoints Method p MSE Bias 2 Variance MSE Bias 2 Variance DNN 3 0.7266 0.5159 0.2379 5.6775 4.5131 1.2315 KNN 3 0.6630 0.1453 0.5158 5.1323 2.2046 2.8698 TDNN 3 0.2594 0.0535 0.2707 3.4788 1.6190 2.0933 DNN 5 0.7271 0.5176 0.2390 5.9057 4.6419 1.2638 KNN 5 0.6699 0.1664 0.5272 5.3712 2.3072 2.9450 TDNN 5 0.2579 0.0699 0.2789 3.6883 1.6990 2.1640 DNN 10 0.8756 0.5822 0.2602 6.2705 4.8632 1.3218 KNN 10 0.8297 0.1992 0.5643 5.7003 2.4493 3.0715 TDNN 10 0.2867 0.0503 0.2987 3.9243 1.7798 2.3084 DNN 15 0.9376 0.6434 0.2685 6.4583 4.9885 1.3466 KNN 15 0.7919 0.2213 0.5693 5.8418 2.5189 3.1275 TDNN 15 0.2823 0.0439 0.3083 4.0509 1.8257 2.3743 DNN 20 0.9653 0.6341 0.2735 6.8909 5.2649 1.4073 KNN 20 0.8174 0.1868 0.5729 6.2427 2.6994 3.2703 TDNN 20 0.3298 0.0530 0.3276 4.4064 1.9583 2.5184 Table 3.2: Comparison of DNN, k-NN, and TDNN in simulation setting 2 described in Sec- tion3.6.2. Since the theoretical properties of TDNN established in this chapter rely on the assumption of fixed dimensionality, it is natural to expect that the performance of TDNN can deteriorate as the dimensionalitygrows. Toalleviatesuchdifficulty,weexploitthefeaturescreeningidea[6,93,94] for dimension reduction to accompany the implementation of TDNN. For the screening step, we testthenullhypothesisofindependencebetweentheresponseandeachfeatureusingthenonpara- metric tool of distance correlation statistic [95, 96] and calculate the corresponding p-value. Then we select features with p-values less than α/p with some significance level α∈ (0,1) and make prediction by using these selected features. For our simulation studies, we fixα =0.001. For the TDNN estimator, the ratio c=s 2 /s 1 is chosen from the sequence{2,4,6,8,10,15,20,25,30} for random test points and we fix c=2 for the fixed test point for simplicity. The subsampling scale s 1 ischosenfromtheinterval[s sign ,2s sign ]foreachgivenc,wheres sign isgivenbythesign-change tuningprocessintroducedatthebeginningofSection3.6. We again evaluate the performance of the three estimators at a fixed test point chosen as x 1 = 0.2,x 2 =0.4,x 3 =0.6,andx j =0.5for j>3aswellasforasetof100testpointsrandomlydrawn from the hypercube [0,1] p . The simulation results in Table 3.2 show that the screening technique 74 FixedTestPoint RandomTestPoints Method p MSE Bias 2 Variance Coverage Width MSE Bias 2 Variance Coverage Width DNN 3 0.1511 0.0414 0.0977 0.816 1.1541 0.3152 0.1580 0.1066 0.6727 1.2215 k-NN 3 0.1269 0.0517 0.0756 0.856 1.0702 0.3916 0.3130 0.0733 0.5340 1.0511 TDNN 3 0.0899 0.0145 0.0836 0.948 1.1236 0.3022 0.0672 0.1670 0.8196 1.5124 DNN 5 0.1706 0.0430 0.0967 0.801 1.1551 0.3204 0.1612 0.1061 0.6707 1.2188 k-NN 5 0.1320 0.0560 0.0752 0.852 1.0676 0.4013 0.3208 0.0731 0.5262 1.0499 TDNN 5 0.1008 0.0168 0.0833 0.915 1.1209 0.3063 0.0704 0.1668 0.8162 1.5112 DNN 10 0.1600 0.0364 0.0987 0.833 1.1647 0.3337 0.1718 0.1083 0.6635 1.2305 k-NN 10 0.1302 0.0489 0.0780 0.869 1.0866 0.4154 0.3325 0.0750 0.5251 1.0627 TDNN 10 0.1014 0.0113 0.0852 0.934 1.1318 0.3174 0.0764 0.1722 0.8143 1.5336 DNN 15 0.1687 0.0313 0.1019 0.825 1.1808 0.3428 0.1782 0.1093 0.6608 1.2361 k-NN 15 0.1287 0.0500 0.0782 0.872 1.0868 0.4291 0.3427 0.0759 0.5201 1.0682 TDNN 15 0.1021 0.0109 0.0888 0.923 1.1536 0.3237 0.0791 0.1746 0.8124 1.5445 DNN 20 0.1628 0.0382 0.0985 0.820 1.1669 0.3394 0.1757 0.1094 0.6642 1.2368 k-NN 20 0.1330 0.0497 0.0798 0.877 1.0981 0.4232 0.3366 0.0764 0.5248 1.0721 TDNN 20 0.1061 0.0125 0.0892 0.927 1.1564 0.3215 0.0772 0.1748 0.8144 1.5464 Table 3.3: Comparison of DNN, k-NN, and TDNN in simulation setting 3 described in Sec- tion3.6.2. workswellandtheTDNNestimatorhassignificantlyreducedMSEscomparedtothesingle-scale DNNandk-NNestimators. Observethatalthoughthedensityfunctionofthecovariatesisuniform, theaverageMSEfor randomtestpointsislargerthantheMSEforthefixedtestpointbecausethe MSEalsodependsonthevaluesoftheregressionfunctionanditsderivatives. The first two simulation examples demonstrate the estimation accuracy of TDNN for general nonparametricregressionandthethirdonewillfocusontheheterogeneoustreatmenteffect(HTE) estimation and inference with the confidence interval coverage. We use a modified version of the secondsimulationsettingforcausalinferencein[33]. Setting3. Assumethatthetreatmentpropensitye(x)=0.5,themaineffectm(x)= 1 8 (x 1 −1)forthe controlgroup,andthetreatmenteffectτ(x)=ς(x 1 )ς(x 2 )ς(x 3 )withς(x)=1+{1+exp(−20(x− 1 3 ))} −1 for the treatment group, where x = (x 1 ,···,x p ) T . Further assume that the feature vector X∼U([0,1] p )andtheregressionerrorε∼N(0,1)independentofXforbothgroups. Weincrease theambientdimensionality palongthesequence{3,5,10,15,20}. 75 As with simulation setting 2 above, we evaluate the performance of the three nonparametric learningandinferencemethodsatafixedtestpointchosenasx 1 =0.2,x 2 =0.4,x 3 =0.6,andx j = 0.5for j>3aswellasforasetof100testpointsrandomlydrawnfromthehypercube[0,1] p . For theTDNNestimator,theratioc=s 2 /s 1 ischosenfromthesequence{2,4,6,8,10,15,20,25,30} for random test points, and we fix c = 2 for the fixed test point for simplicity. The subsampling scale s 1 is chosen from the interval [s sign ,2s sign ] for each given c, where s sign is given by the sign- changetuningprocessintroducedatthebeginningofSection3.6. WeapplytheTDNNestimatorto the treatment group and control group separately, and then take the differencebetween the TDNN estimators for the two groups to estimate the HTE. In addition, we also report the coverage prob- ability of 95% confidence intervals for the HTE constructed based on the asymptotic normality resultsestablishedinSection3.5. TheDNNandk-NNestimatorsaresimilarlyappliedforestima- tion and inference of the HTE. In particular, we see from the results in Table 3.3 that the TDNN estimatorindeedprovideslowerMSEsforHTEestimationandvalidconfidenceintervalsforHTE inferencewithhighercoveragecomparedtotheDNNandk-NNestimators. 3.7 Realdataapplication In this section, we demonstrate the practical performance of the suggested TDNN procedure for nonparametric learning on the Abalone data set, which is available at the UCI repository (https: //archive.ics.uci.edu/ml/datasets/abalone). The Abalone data set has been widely in- vestigated in the literature for the illustration of various nonparametric regression methods; see, e.g.,[8,97]and[25]. Thisdatasetcontains4177observationson8inputvariablesandaresponse that represents the number of rings indicating the age of an abalone. The major goal of this real data application is to predict the response based on the information of the 8 input variables. Since the first input variable is categorical and consists of three categories indicating the sex (Male, Fe- male, and Infant), we only search nearest neighbors restrictively in each category. Consequently, thereare7featuresaftersplittingthedatasetintothreecategories. Becausethenonparametricrate 76 Method k-NN DNN TDNN RF MSE 4.99 4.559 4.546 4.60 Table 3.4: The MSEs of different nonparametric learning methods on the real data application in Section3.7. of convergence for the nearest neighbors methods becomes slower as the feature dimensionality grows,weexploitthepopulartoolofprincipalcomponentanalysis(PCA)toreducethedimension- alityofthefeaturespaceandemploythefirst mprincipalcomponentsfornonparametriclearning. In our analysis, we choose m=3 since the first three principal components account for more than 99%ofthevariationintheresponse. Specifically, we randomly set aside 25% of the 4177 observations as a test set and train the TDNN estimator based on the remaining 75% of the observations. As mentioned in Section 3.6, the tuning of the two subsampling scales s 1 and s 2 is equivalent to that of the subsampling scale s 1 and their ratio c =s 2 /s 1 . We adopt the same strategy as described in Section 3.6 to tune both parameters s 1 and c for the TDNN in a data-adaptive fashion. For a given feature vector x in the test set, each of the B nearest neighbors to x is chosen as the left-out observation in the weighted LOOCVtuningprocedure. Thenthetuned (c,s 1 )isobtainedbyminimizingtheweightedsquared error over those B left-out observations with the weights defined by the corresponding standard Gaussian kernel distance to the given feature vector x. Finally, we apply the TDNN estimator constructed with the tuned (c,s 1 ) to the test set and calculate the prediction error in terms of the MSE.Theaboveprocedureinvolvingrandomdatasplittingisrepeated50timesandtheprediction errorsareaveragedoverthose50randomsplits. Inparticular,wetune(c,s 1 )byconsideringvaluesofc∈{1.2,1.5,2,3,4,5,6,7,8,9,10,15,20} and s 1 ∈ [s sign ,2s sign ] with s sign obtained by the sign-change tuning process (related to the curva- ture) introduced in Section 3.6. The subsamping size s for the DNN estimator is chosen from the sequence starting from 50 to 250 with an increment of 5. We set the neighborhood size of B=50 fortheimplementationoftheweightedLOOCVtuningprocedure. Wecomparethepredictionper- formance of the TDNN to that of the k-NN, DNN, and random forests (RF) in terms of the MSE 77 evaluatedonthetestdata. Table3.4summarizestheresultsofallthenonparametriclearningmeth- ods on this real data application. In particular, the results for the k-NN and RF are extracted from [25]. Indeed, from Table 3.4 we see that TDNN improves over both k-NN and DNN at the finite- sample level, which is in line with our theoretical results and simulation examples. Moreover, the TDNNalsooutperformstheRF. 3.8 Discussion In this chapter, we have investigated the problems of estimation and inference for nonparametric mean regression function using the two-scale DNN (TDNN), a bias reduced estimator based on the distributional nearest neighbors (DNN). Our suggested method of TDNN alleviates the finite- samplebiasissueoftheclassicalk-nearestneighborsandadmitseasyimplementationwithsimple tuning under the assumption of the fourth-order smoothness on the mean regression function. We have provided theoretical justifications for the proposed estimator and established the asymptotic normalitytheoryforpracticaluseofTDNNinnonparametricstatisticalinferencewithoptimality. Finally,wehavealsodemonstratedthatthenewTDNNtoolcanbeexploitedfortheheterogeneous treatmenteffect(HTE)estimationandinferencethatiskeytoidentifyingindividualizedtreatment effects. 78 Chapter4 Ongoingandfutureresearch Weconcludebydiscussingfutureresearchquestionsrelatedtothepreviouslydiscussedchapters. 4.1 Global control of random forest variance and allowing for fully-growntrees In Chapter 2 we analyze the bias of all individual trees from a global perspective and combined withour SID condition, we can precisely explainhow column subsampling affectsthe asymptotic bias of random forests. However, we cannot apply this same global approach to our asymptotic varianceanalysisandinsteadmustboundthevarianceofeachtree. Asaresult,ourvarianceupper bound is less precise because it is the most conservative bound over all possible values of the columnsubsamplingparameter. Establishingglobalcontrolofrandomforestsvarianceremainsan interestingopenquestion. Additionally, our current results only apply to random forests with non-fully-grown trees, i.e., the height of each tree is upper-bounded by clog 2 (n). However, in practice, trees are often grown until there is only one observation in each leaf, i.e., fully-grown. Therefore, extending our current theoretical framework to reflect how random forests are used in practice is an essential line of futureresearch. 79 4.2 MultiscaleDNNandextensionstohigh-dimensional settings OurbiasreductionideafromChapter3canbegeneralizedtoconstructtheMultiscaleDNNwhen the mean regression function has even higher-order smoothness. In such a case, DNN or TDNN nolongerenjoysthenonparametricminimaxoptimalconvergencerate. Byexploitinghigher-order asymptoticbiasexpansion,aMultiscaleDNNcanbeconstructedinthesamefashionforachieving thenonparametricoptimalconvergencerate. Weleavethedetailedinvestigationsforfuturestudy. It would also be interesting to extend the idea of TDNN to the settings of diverging or high feature dimensionality and consider the non-i.i.d. data settings such as time series, panel, and sur- vival data. Since the distance function plays a natural role in identifying the nearest neighbors, it would be interesting to investigate the choice of different distance metrics, aside from the Eu- clideandistance,thatarepertinenttospecificmanifoldstructuresintrinsictodata. Theseproblems arebeyondthescopeofthisdissertationandwillbeinterestingtopicsforfutureresearch. 80 References 1. Chi,C.-M.,Vossler,P.,Fan,Y.&Lv,J.Asymptoticpropertiesofhigh-dimensionalrandom forests.arXivpreprintarXiv:2004.13953(2020). 2. Demirkaya, E., Fan, Y., Gao, L., Lv, J., Vossler, P. & Wang, J. Optimal Nonparametric In- ferencewithTwo-ScaleDistributionalNearestNeighbors.arXivpreprintarXiv:1808.08469 (2021). 3. Stigler, S. M. Gauss and the invention of least squares. the Annals of Statistics, 465–474 (1981). 4. Stone, C. J. Optimal global rates of convergence for nonparametric regression. The annals ofstatistics,1040–1053(1982). 5. Fan, J. & Lv, J. Sure independence screening for ultrahigh dimensional feature space (with discussion).JournaloftheRoyalStatisticalSocietySeriesB70,849–911(2008). 6. Fan, J. & Fan, Y. High dimensional classification using features annealed independence rules.Annalsofstatistics36,2605(2008). 7. Fan,J.&Lv,J.Sureindependencescreening(invitedreviewarticle).WileyStatsRef:Statis- ticsReferenceOnline(2018). 8. Breiman,L.Randomforests.MachineLearning45,5–32(2001). 9. Breiman,L.Manualonsettingup,using,andunderstandingrandomforestsv3.1. Statistics DepartmentUniversityofCaliforniaBerkeley,CA,USA1,58(2002). 10. Díaz-Uriarte, R. & De Andres, S. A. Gene selection and classification of microarray data usingrandomforest.BMCBioinformatics7,3(2006). 11. Capitaine, L., Genuer, R. & Thiébaut, R. Random forests for high-dimensional lon- gitudinal data. Statistical Methods in Medical Research 30, 166–184. doi:10.1177/ 0962280220946080 (2021). 12. Syrgkanis, V. & Zampetakis, M. Estimation and inference with trees and forests in high dimensionsinConferenceonLearningTheory(2020),3453–3454. 13. Sagi, O. & Rokach, L. Ensemble learning: A survey. WIREs Data Mining and Knowledge Discovery8,e1249.doi:https://doi.org/10.1002/widm.1249 (2018). 14. Dietterich, T. G. Ensemble Methods in Machine Learning in Multiple Classifier Systems (SpringerBerlinHeidelberg,Berlin,Heidelberg,2000),1–15. 15. Breiman,L.Baggingpredictors.Machinelearning24,123–140(1996). 16. Bühlmann,P.&Yu,B.Analyzingbagging.TheannalsofStatistics30,927–961(2002). 17. Friedman,J.H.&Hall,P.Onbaggingandnonlinearestimation.Journalofstatisticalplan- ningandinference137,669–683(2007). 18. Hall,P.&Samworth,R.J.Propertiesofbaggednearestneighbourclassifiers.J.R.Stat.Soc. Ser.BStat.Methodol.67,363–379.doi:10.1111/j.1467-9868.2005.00506.x (2005). 19. Buja,A.&Stuetzle,W.Observationsonbagging.StatisticaSinica,323–351(2006). 81 20. Biau, G., Devroye, L. & Lugosi, G. Consistency of random forests and other averaging classifiers.JournalofMachineLearningResearch9,2015–2033(2008). 21. Biau,G.,Cérou,F.&Guyader,A.Ontherateofconvergenceofthebaggednearestneighbor estimate.JournalofMachineLearningResearch11,687–712(2010). 22. Silverman,B.W.&Jones,M.C.E.FixandJ.L.Hodges(1951):AnImportantContribution to Nonparametric Discriminant Analysis and Density Estimation: Commentary on Fix and Hodges (1951). International Statistical Review / Revue Internationale de Statistique 57, 233–238(1989). 23. Cover,T.&Hart, P.Nearestneighborpatternclassification. IEEE transactions on informa- tiontheory13,21–27(1967). 24. Devroye,L.,Györfi,L.&Lugosi,G.Aprobabilistictheoryofpatternrecognition(Springer Science&BusinessMedia,2013). 25. Steele, B. M. Exact bootstrap k-nearest neighbor learners. Machine Learning 74, 235–255 (2009). 26. Varian, H. R. Big data: New tricks for econometrics. Journal of Economic Perspectives28, 3–28(2014). 27. Howard, J. & Bowles, M. The two most important algorithms in predictive modeling today inStrataConferencepresentation28(2012). 28. Khaidem, L., Saha, S. & Dey, S. R. Predicting the direction of stock market prices using randomforest.arXivpreprintarXiv:1605.00003(2016). 29. Qi,Y.inEnsembleMachineLearning307–323(Springer,2012). 30. Gislason, P. O., Benediktsson, J. A. & Sveinsson, J. R. Random forests for land cover clas- sification.PatternRecognitionLetters27,294–300(2006). 31. Goldstein,B.A.,Polley,E.C.&Briggs,F.B.Randomforestsforgeneticassociationstud- ies.StatisticalApplicationsinGeneticsandMolecularBiology10,32(2011). 32. Mentch, L. & Hooker, G. Ensemble trees and CLTs: Statistical inference for supervised learning.arXivpreprintarXiv:1404.6473(2014). 33. Wager, S. & Athey, S. Estimation and inference of heterogeneous treatment effects using randomforests.JournaloftheAmericanStatisticalAssociation113,1228–1242(2018). 34. Ishwaran, H., Kogalur, U. B., Blackstone, E. H. & Lauer, M. S. Random survival forests. TheAnnalsofAppliedStatistics2,841–860(2008). 35. Ishwaran, H. & Kogalur, U. B. Consistency of random survival forests. Statistics & Proba- bilityLetters80,1056–1064(2010). 36. Biau,G.&Scornet,E.Arandomforestguidedtour.Test25,197–227(2016). 37. Biau, G. Analysis of a random forests model. Journal of Machine Learning Research 13, 1063–1095(2012). 38. Bai,Z.-D.,Devroye,L.,Hwang,H.-K.&Tsai,T.-H.Maximainhypercubes.RandomStruc- tures&Algorithms27,290–309(2005). 39. Genuer,R.Variancereductioninpurelyrandomforests.JournalofNonparametricStatistics 24,543–562(2012). 40. Zhu,R.,Zeng,D.&Kosorok,M.R.Reinforcementlearningtrees.JournaloftheAmerican StatisticalAssociation110,1770–1784(2015). 41. Scornet, E., Biau, G. & Vert, J.-P. Consistency of random forests. The Annals of Statistics 43,1716–1741(2015). 82 42. Klusowski, J. M. Sharp Analysis of a Simple Model for Random Forests in Proceedings of The24thInternationalConferenceonArtificialIntelligenceandStatistics(edsBanerjee,A. &Fukumizu,K.)130(2021),757–765. 43. Wager, S. & Athey, S. Estimation and inference of heterogeneous treatment effects using randomforests.JournaloftheAmericanStatisticalAssociation113,1228–1242(2018). 44. Wager,S.,Hastie,T.&Efron,B.Confidenceintervalsforrandomforests:Thejackknifeand theinfinitesimaljackknife.JournalofMachineLearningResearch15,1625–1651(2014). 45. Louppe, G., Wehenkel, L., Sutera, A. & Geurts, P. Understanding variable importances in forests of randomized trees. Advances in neural information processing systems 26, 431– 439(2013). 46. Scornet, E. Trees, forests, and impurity-based variable importance. arXiv preprint arXiv:2001.04295(2020). 47. Mourtada, J., Gaïffas, S. & Scornet, E. Minimax optimal rates for Mondrian trees and forests.TheAnnalsofStatistics48,2253–2276.doi:10.1214/19-AOS1886 (2020). 48. Roy,D.M.,Teh,Y.W.,etal.TheMondrianProcess.inNIPS (2008),1377–1384. 49. Liaw, A. & Wiener, M. Classification and Regression by randomForest. R News 2, 18–22 (2002). 50. Klusowski,J.M.AnalyzingCART.arXivpreprintarXiv:1906.10086 (2019). 51. Lin,Y.&Jeon,Y.Randomforestsandadaptivenearestneighbors.JournaloftheAmerican StatisticalAssociation101,578–590(2006). 52. Athey, S., Tibshirani, J., Wager, S., et al. Generalized random forests. Annals of Statistics 47,1148–1178(2019). 53. Fan, J., Feng, Y. & Song, R. Nonparametric independence screening in sparse ultra-high- dimensionaladditivemodels.JournaloftheAmericanStatisticalAssociation106,544–557 (2011). 54. Stone, C. J. Consistent nonparametric regression. The Annals of Statistics 5, 595–620 (1977). 55. Nobel, A. Histogram regression estimation using data-dependent partitions. The Annals of Statistics24,1084–1105(1996). 56. Chen,G.H., Shah,D., et al. Explainingthesuccessofnearestneighbormethodsinpredic- tion.FoundationsandTrends˝ oinMachineLearning10,337–588(2018). 57. Mack,Y.LocalPropertiesofk-NNRegressionEstimates. SIAM Journalon AlgebraicDis- creteMethods2,311–323.doi:10.1137/0602035 (1980). 58. Györfi,L.,Kohler,M.,Krzyak,A.&Walk,H.ADistribution-FreeTheoryofNonparametric Regressiondoi:10.1007/b97848 (Springer,2002). 59. Biau,G.&Devroye,L.Lecturesonthenearestneighbormethod (Springer,2015). 60. Berrett, T. B., Samworth, R. J. & Yuan, M. Efficient multivariate entropy estimation via k-nearestneighbourdistances.TheAnnalsofStatistics47,288–318(2019). 61. Lin, Z., Ding, P. & Han, F. Estimation based on nearest neighbor matching: from density ratiotoaveragetreatmenteffect.arXivpreprintarXiv:2112.13506 (2021). 62. Hoeffding, W. A Class of Statistics with Asymptotically Normal Distribution. en. The An- nalsofMathematicalStatistics19,293–325.doi:10.1214/aoms/1177730196 (1948). 63. Samworth,R.J.Optimalweightednearestneighbourclassifiers.en.TheAnnalsofStatistics 40,2733–2763.doi:10.1214/12-AOS1049 (2012). 83 64. Hájek, J. Asymptotic normality of simple linear rank statistics under alternatives. The An- nalsofMathematicalStatistics39,325–346.doi:10.1214/aoms/1177698394 (1968). 65. Korolyuk, V. S. & Borovskich, Y. V. Theory of U-statistics doi:10.1007/978-94-017- 3515-5 (Springer,1994). 66. Serfling, R. J. Approximation Theorems of Mathematical Statistics doi:10 . 1002 / 9780470316481 (WileySeriesinProbabilityandStatistics,1980). 67. Hall,P.Effectofbiasestimationoncoverageaccuracyofbootstrapconfidenceintervalsfor aprobabilitydensity.TheAnnalsofStatistics,675–694(1992). 68. Schucany, W. & Sommers, J. P. Improvement of kernel type density estimators. Journal of theAmericanStatisticalAssociation72,420–423(1977). 69. Calonico,S.,Cattaneo,M.D.&Farrell,M.H.Ontheeffectofbiasestimationoncoverage accuracy in nonparametric inference. Journal of the American Statistical Association 113, 767–779(2018). 70. Newey, W. K., Hsieh, F. & Robins, J. M. Twicing kernels and a small bias property of semiparametricestimators.Econometrica72,947–962(2004). 71. Cheang,W.-K.&Reinsel,G.C.Biasreductionofautoregressiveestimatesintimeseriesre- gressionmodelthroughrestrictedmaximumlikelihood.JournaloftheAmericanStatistical Association95,1173–1184(2000). 72. Leblanc, A. A bias-reduced approach to density estimation using Bernstein polynomials. JournalofNonparametricStatistics22,459–475(2010). 73. Hall, P. The bootstrap and Edgeworth expansion (Springer Science & Business Media, 2013). 74. Schucany,W.,Gray,H.&Owen,D.Onbiasreductioninestimation. Journalof the Ameri- canStatisticalAssociation66,524–533(1971). 75. Borovskikh,I.I.V.U-statisticsinBanachSpaces(VSP,1996). 76. Frees,E.W.InfiniteorderU-statistics.ScandinavianJournalofStatistics,29–45(1989). 77. Song, Y., Chen, X., Kato, K., et al. Approximating high-dimensional infinite-order U- statistics: Statistical and computational guarantees. Electronic Journal of Statistics 13, 4794–4848(2019). 78. Quenouille, M. H. Approximate tests of correlation in time-series. J. Roy. Statist. Soc. Ser. B11,68–84(1949). 79. Quenouille, M. H. Notes on bias in estimation. Biometrika 43, 353–360. doi:10.1093/ biomet/43.3-4.353 (1956). 80. Arvesen, J. N. JackknifingU-statistics. Ann. Math. Statist. 40, 2076–2100. doi:10.1214/ aoms/1177697287 (1969). 81. Efron,B.Bootstrapmethods:anotherlookatthejackknife.Ann.Statist.7,1–26(1979). 82. Rubin, D. B. Estimating causal effects of treatments in randomized and nonrandomized studies.JournalofEducationalPsychology66,688–701(1974). 83. Imbens, G. W. & Rubin, D. B. Causal Inference in Statistics, Social, and Biomedical Sci- ences(CambridgeUniversityPress,2015). 84. Crump,R.K.,Hotz,V.J.,Imbens,G.W.&Mitnik,O.A.Nonparametrictestsfortreatment effectheterogeneity.ReviewofEconomicsandStatistics90,389–405(2008). 85. Lee, M.-j. Non-parametric tests for distributional treatment effect for randomly censored responses. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 71, 243–264(2009). 84 86. Shalit, U., Johansson, F. D. & Sontag, D. Estimating individual treatment effect: gener- alization bounds and algorithms in Proceedings of the 34th International Conference on MachineLearning-Volume70(2017),3076–3085. 87. Hahn, P. R., Murray, J. S., Carvalho, C. M., et al. Bayesian regression tree models for causalinference:regularization,confounding,andheterogeneouseffects.BayesianAnalysis (2020). 88. Powers, S., Qian, J., Jung, K., Schuler, A., Shah, N. H., Hastie, T., et al. Some meth- ods for heterogeneous treatment effect estimation in high-dimensions. arXiv preprint arXiv:1707.00102(2017). 89. Zaidi,A.&Mukherjee,S.GaussianProcessMixturesforEstimatingHeterogeneousTreat- mentEffects.arXivpreprintarXiv:1812.07153(2018). 90. Hitsch,G.J.&Misra,S.Heterogeneoustreatmenteffectsandoptimaltargetingpolicyeval- uation.AvailableatSSRN3111957 (2018). 91. Hechenbichler, K. & Schliep, K. Weighted k-Nearest-Neighbor Techniques and Ordinal Classification2004. 92. Dette, H. & Pepelyshev, A. Generalized Latin Hypercube Design for Computer Experi- ments.Technometrics52,421–429.doi:10.1198/TECH.2010.09157 (2010). 93. Fan, J. & Lv, J. Sure independence screening for ultrahigh dimensional feature space (with discussion). Journalof theRoyal Statistical Society: Series B (Statistical Methodology)70, 849–911.doi:10.1111/j.1467-9868.2008.00674.x (2008). 94. Fan,J.&Lv,J.Sureindependencescreening(invitedreviewarticle).WileyStatsRef:Statis- ticsReferenceOnline(2018). 95. Székely, G. J., Rizzo, M. L. & Bakirov, N. K. Measuring and testing dependence by corre- lationofdistances.TheAnnalsofStatistics35,2769–2794(2007). 96. Gao, L., Fan, Y., Lv, J. & Shao, Q. Asymptotic distributions of high-dimensional distance correlationinference.TheAnnalsofStatistics49,1999–2020(2021). 97. Breiman,L.Usingadaptivebaggingtodebiasregressionstech.rep.(TechnicalReport547, StatisticsDept.UCB,1999). 98. Bennett, G. Probability inequalities for the sum of independent random variables. Journal oftheAmericanStatisticalAssociation57,33–45(1962). 99. Bernstein, S. Sur une modification de linéqualité de Tchebichef. Annal. Sci. Inst. Sav. Ukr. Sect.Math.I,38–49(1924). 100. Borovkov,A.A.ProbabilityTheorydoi:10.1007/978-1-4471-5201-9 (Springer,2013). 101. Berry,A.C.TheaccuracyoftheGaussianapproximationtothesumofindependentvariates. Trans.Amer.Math.Soc.49,122–136.doi:10.2307/1990053 (1941). 102. Peng, W., Coleman, T. & Mentch, L. Asymptotic Distributions and Rates of Convergence forRandomForestsviaGeneralizedU-statistics.arXivpreprintarXiv:1905.10651(2019). 85 Appendices A Chapter2Proofs ThisappendixcontainstheproofsofallmainresultsandtechnicallemmasdiscussedinChapter2 as well as some additional technical details. All the notation is the same as defined in Chapter 2. WeuseC todenoteagenericpositiveconstantwhosevaluemaychangefromlinetoline. A.1 Proofsofmainresults A.1.1 Technicalpreparation Wenowdescribefurtherdetailsofthe gridintroducedinSection 2.5fortheanalysispurpose. For anys∈{1,···,p}andinteger0≤q<⌈n 1+ρ 1 ⌉,define 1 t t t(s,q):=[0,1] s−1 ×[b q ,b q+1 )×[0,1] p−s , where we recall that b i = i ⌈n 1+ρ 1 ⌉ ’s are the grid points defined in Section 2.5; the parameter ρ 1 is defined in the same section. Let us assume that Condition 2 is satisfied with f(·) denoting the densityfunctionofthedistributionofX X X. Thenitholdsthatforeachρ 1 >0andn≥1, sup s,q P(X X X∈t t t(s,q))≤ sup f ⌈n 1+ρ 1 ⌉ . (A.1) 1 We can let the interval [b ⌈n 1+ρ 1⌉−1 ,b ⌈n 1+ρ 1⌉ ] have a closed right end. Since we assume that the density of the distributionofX X X exists,itdoesnotaffectourtechnicalanalysis. 86 Thus, for eachρ 1 >0, n≥1, positive integer k, and each nodet t t with at most k boundaries not on thegridhyperplanes(e.g.,fortheleftplotinFigure2.3,thebluenodehas4boundariesnotonthe gridhyperplanes,whereastheredonehasonly2),wehave sup t t t P(X X X∈t t t∆t t t # )≤k× sup f ⌈n 1+ρ 1 ⌉ , (A.2) where the supremum is over all possible sucht t t and A∆B:=(A∩B c )∪(A c ∩B) for any two sets A andB. Observethat(A.2)appliestoallnodesconstructedbyatmostk cuts. Let p-dimensional random vectors x x x i ,i = 1,···,n, be independent and identically distributed (i.i.d.) with the same distribution as X X X. Let ρ 2 > 0 be given. We next show that if Condition 2 holds,itfollowsfrom(A.1)thatforeach p≥1andalllargen, P ∪ s,q n #{i:x x x i ∈t t t(s,q)}≥⌈(logn) 1+ρ 2 ⌉ o ≤ p⌈n 1+ρ 1 ⌉ sup f n ρ 1 (logn) 1+ρ 2 . (A.3) wheretheunionisoverallpossibles∈{1,···,p}and0≤q<⌈n 1+ρ 1 ⌉. Todevelopsomeintuition for(A.3),notethatifP(x x x i ∈t t t(s,q))=cn −1 forsomeconstantc>0,then#{i:x x x i ∈t t t(s,q)}hasan asymptoticPoissondistributionwithmeanc. Moreover,theprobabilityupperboundin(A.1)isin factmuchsmallerthann −1 asymptotically. Toestablish(A.3),adirectcalculationshowsthat P ∪ s,q n #{i:x x x i ∈t t t(s,q)}≥⌈(logn) 1+ρ 2 ⌉ o ≤ p⌈n 1+ρ 1 ⌉sup s,q P #{i:x x x i ∈t t t(s,q)}≥⌈(logn) 1+ρ 2 ⌉ = p⌈n 1+ρ 1 ⌉×sup s,q n ∑ l≥l 0 n l P(x x x i ∈t t t(s,q)) l 1−P(x x x i ∈t t t(s,q)) n−l ! , (A.4) 87 where l 0 =⌈(logn) 1+ρ 2 ⌉. Since the cumulative probability inside the parentheses on the RHS of ((A.4)) is an increasing function ofP(x x x i ∈t t t(s,q)), it follows from (A.1) and Condition 2 that foralllargen, RHSof(A.4)≤ p⌈n 1+ρ 1 ⌉× n ∑ l≥l 0 n l sup f ⌈n 1+ρ 1 ⌉ l 1− sup f ⌈n 1+ρ 1 ⌉ n−l ! ≤ p⌈n 1+ρ 1 ⌉ (l 0 !) −1 n ∑ l≥l 0 sup f n ρ 1 l ! ≤ p⌈n 1+ρ 1 ⌉ sup f n ρ 1 l 0 1− sup f n ρ 1 −1 (l 0 !) −1 ≤ p⌈n 1+ρ 1 ⌉ sup f n ρ 1 l 0 , (A.5) wherel!:=1×···×l. Thiscompletestheproofof(A.3). WedenotetheeventontheLHSof(A.3)byA withρ 1 ,ρ 2 >0asfollows. A := ∪ s∈{1,···,p},0≤q<⌈n 1+ρ 1⌉ n #{i:x x x i ∈t t t(s,q)}≥⌈(logn) 1+ρ 2 ⌉ o c =∩ s∈{1,···,p},0≤q<⌈n 1+ρ 1⌉ n #{i:x x x i ∈t t t(s,q)}<(logn) 1+ρ 2 o . (A.6) OneventA,itholdsthatforeachnodet t t constructedusingatmostk cuts, #{i:x x x i ∈t t t∆t t t # }<k(logn) 1+ρ 2 . (A.7) We next provide an upper bound on the number of conditional means required to be estimated. Define G n,k as the set containing all nodes constructed by at most k cuts with cuts all on the grid hyperplanes. We can see that there are at most (p(⌈n 1+ρ 1 ⌉+1)) k distinct choices of k cuts on the grid hyperplanes. Furthermore, each of these k cuts results in at most 2 k nodes, which are all possiblenodesgrownbythegivenk cuts. Thus,wecanobtainthat #G n,k ≤2 k p(⌈n 1+ρ 1 ⌉+1) k . (A.8) 88 A.2 AdditionalexamplesforSID We provide three additional examples for showing the flexibility of SID. In particular, Example 6 below is an example of Example 3, and Example 7 considers regression function m(X X X) that is not monotonic. Example8isanon-additivemodelwithalinearcombinationofintercepts. Theproofs fortheseexamplesarerespectivelyinSectionsA.5.8–A.6.1. Example6. AssumethatX X X isuniformlydistributedin [0,1] p ,andletS ∗ besomesubsetof {1,...,p}. 1) Letm(X X X)= exp(∑ j∈S ∗β j X j ) 1+exp(∑ j∈S ∗β j X j ) begivenwith|β j |̸=0. Then, m(X X X)∈SID 4 #S ∗ max j∈S ∗|β j | min j∈S ∗|β j | 2 ×exp 2 ∑ j∈S ∗ |β j | . 2) Let m(X X X) = ∑ k 1 k=1 β kk Π j∈T k X r jk j +∑ j∈S ∗β j X j be given with r jk ’s being positive integers, ∪ k 1 k=1 T k ⊂S ∗ ,andallpositive(orallnegative)β kk ’sandβ j ’s. Then m(X X X)∈SID 4(#S ∗ ) 2 (max j,k r jk )∑ k 1 k=1 |β kk |+max j∈S ∗|β j | min j∈S ∗|β j | 2 . Example7. AssumethatX X X isuniformlydistributedin [0,1] p with p≥s ∗ forsomepositiveinteger s ∗ . Theregressionfunctionisdefinedasm(X X X):=∑ s ∗ j=1 m j (X j ),whereforeach j≤s ∗ andx∈[0,1], m j (x):=h j,K (x)1 1 1 [b j,K−1 ,b j,K ] + K−1 ∑ k=1 h j,k (x)1 1 1 [b j,k−1 ,b j,k ) with some integer K > 0, linear functions h j,1 ,···,h j,K such that m j (x) is continuous and that r≤| dh j,k (x) dx |≤R for all j,k with some R≥r >0, and constants b j,k ’s such that 0 =b j,0 <···< b j,K =1. Then,m(X X X)∈SID s ∗ 1024R 5 (b ∗ ) 3 r 5 ,whereb ∗ :=min j≤s ∗ ,1≤k≤K (b j,k −b j,k−1 ). Example8. AssumethatX X X isuniformlydistributedin [0,1] p with p≥s ∗ forsomepositiveinteger s ∗ . Let positive integer k j be given and 0 = c (j) 0 <··· < c (j) k j = 1 be real numbers for each j = 89 1···,s ∗ . Letβ(i 1 ,···,i s ∗)with1≤i j ≤k j and1≤ j≤s ∗ berealcoefficientssuchthatforsomeι > 0, it holds that for each j, either 1) for every (i 1 ,···,i s ∗) with i j ≥2,∆β :=β(i 1 ,···,i j ,···,i s ∗)− β(i 1 ,···,i j −1,···,i s ∗)≥ι, or 2) for every (i 1 ,···,i s ∗) with i j ≥ 2, ∆β≤−ι. The regression functionisdefinedtobe m(X X X)= k 1 ∑ i 1 =1 ··· k s ∗ ∑ i s ∗=1 β(i 1 ,···,i s ∗)Π s ∗ j=1 1 1 1 X j ∈[c (j) i j −1 ,c (j) i j ) . In addition, assume that sup c c c∈[0,1] p|m(c c c)|≤ M 0 . Then, m(X X X)∈ SID s ∗ c † (1−c † ) 2M 0 ι 2 , where c † :=min{ 1 4 ,min j≤s ∗ ,1≤i≤k j {c (j) i −c (j) i−1 }}. A.3 ProofofTheorem1 We begin with considering the case when a contains the full sample. We will apply standard inequalities to separate the L 2 loss into two terms that can be dealt with by Lemmas 1 and 2, respectively,toobtaintheconclusionin(2.17). Observethattheresultsintheselemmasareappli- cable to the case of any a with #a =⌈bn⌉ and without replacement. The other case with sample subsamplingcanbedealtwithsimilarlybyanapplicationofJensen’sinequality. Let us first examine the case without sample subsampling. By Jensen’s inequality and the triangleinequality,wecandeducethat E m(X X X)−E b m b T (Θ Θ Θ 1 ,···,Θ Θ Θ k ,X X X,X n ) X X X,X n 2 ≤E m(X X X)−b m b T (Θ Θ Θ 1 ,···,Θ Θ Θ k ,X X X,X n ) 2 =E m(X X X)−m ∗ b T (Θ Θ Θ 1 ,···,Θ Θ Θ k ,X X X) +m ∗ b T (Θ Θ Θ 1 ,···,Θ Θ Θ k ,X X X)−b m b T (Θ Θ Θ 1 ,···,Θ Θ Θ k ,X X X,X n ) 2 ≤2 E m(X X X)−m ∗ b T (Θ Θ Θ 1 ,···,Θ Θ Θ k ,X X X) 2 +E m ∗ b T (Θ Θ Θ 1 ,···,Θ Θ Θ k ,X X X)−b m b T (Θ Θ Θ 1 ,···,Θ Θ Θ k ,X X X,X n ) 2 . (A.9) Thisresultisalsoshownin(2.11). 90 ByLemma1,itholdsthatforalllargenandeach1≤k≤clogn, E m(X X X)−m ∗ b T (Θ Θ Θ 1 ,···,Θ Θ Θ k ,X X X) 2 ≤8M 2 0 n −δ 2 k +2α 1 α 2 n −η +2M 2 0 (1−γ 0 (α 1 α 2 ) −1 ) k +2n −1 . (A.10) Letν >0 be sufficiently small. Then by Lemma 2, there exists some constantC>0 such that for alllargenandeach1≤k≤clog 2 n, E m ∗ b T (Θ Θ Θ 1 ,···,Θ Θ Θ k ,X X X)−b m b T (Θ Θ Θ 1 ,···,Θ Θ Θ k ,X X X,X n ) 2 ≤n −η +C2 k n − 1 2 +ν . (A.11) In view of (A.9)–(A.11), we can conclude that there exists some constantC >0 such that for all largenandeach1≤k≤clog 2 n, E m(X X X)−E b m b T (Θ Θ Θ 1 ,···,Θ Θ Θ k ,X X X,X n ) X X X,X n 2 ≤C α 1 α 2 n −η +(1−γ 0 (α 1 α 2 ) −1 ) k +n −δ+c . The above result uses the fact that n − 1 2 +ν+c =o(n −δ+c ) due to a smallν. Thus, replacing n with ⌈bn⌉leadsto(2.17). Toshowthesecondassertion,weuseJensen’sinequalitytoobtainthat E m(X X X)−B −1 ∑ a∈A E b m b T a ,a (Θ Θ Θ 1 ,···,Θ Θ Θ k ,X X X,X n ) X X X,X n 2 =E h B −1 ∑ a∈A m(X X X)−E b m b T a ,a (Θ Θ Θ 1 ,···,Θ Θ Θ k ,X X X,X n ) X X X,X n i 2 ≤B −1 ∑ a∈A E m(X X X)−E b m b T a ,a (Θ Θ Θ 1 ,···,Θ Θ Θ k ,X X X,X n ) X X X,X n 2 . CombiningthisresultandthefirstassertioncompletestheproofofTheorem1. A.3.1 ProofofTheorem2 Recall that X n ={x x x i ,y i } n i=1 are i.i.d. training data, and X X X is the independent copy of x x x 1 = (x 11 ,···,x 1p ) T . When the jth feature is not involved in the random forest training procedure, 91 the random forests estimate (2.8) is trained on{(y i ,x i1 ,···,x i(j−1) ,x i(j+1) ,···,x ip )} n i=1 . We first showthatsucharandomforestsestimateis (X X X −j ,X n )-measurable,where X X X −j :=(X 1 ,···,X j−1 ,X j+1 ,···,X p ) T . Then by the independence between X X X andX n , we can resort to the projection theorem to obtain thedesiredconclusion. Letusbegintheformalproof. Wedenotesucharandomforestestimateby 1 B ∑ a∈A E b m b T a ,a (Θ Θ Θ 1 ,···,Θ Θ Θ k ,X X X,X n ) X X X,X n (A.12) Inorderthattheconditionalexpectationin(A.12)iswell-defined,weuseConditions3–4toensure the existence of the first moment of the integrand in (A.12). Specifically, by Conditions 3–4, it holdsthatforeacha∈A, E b m b T a ,a (Θ Θ Θ 1 ,···,Θ Θ Θ k ,X X X,X n ) <∞, andhence(A.12)iswell-defined. By assumption, during the training phase, the jth feature is not involved, which entails that b m b T a ,a (Θ Θ Θ 1 ,···,Θ Θ Θ k ,c c c 1 ,X n ) = b m b T a ,a (Θ Θ Θ 1 ,···,Θ Θ Θ k ,c c c 2 ,X n ) for each c c c i := (c i1 ,···,c ip ) T ∈ [0,1] p ,i = 1,2,withc 1l =c 2l forl̸= j. Thenitfollowsthat 1 B ∑ a∈A E b m b T a ,a (Θ Θ Θ 1:k ,X X X,X n ) Θ Θ Θ 1:k ,X X X,X n = 1 B ∑ a∈A E b m b T a ,a (Θ Θ Θ 1:k ,X X X,X n ) Θ Θ Θ 1:k ,X X X −j ,X n . Inviewofthisresult,wecanseethat 1 B ∑ a∈A E b m b T a ,a (Θ Θ Θ 1 ,···,Θ Θ Θ k ,X X X,X n ) X X X,X n is (X X X −j ,X n )-measurable. (A.13) 92 SinceX n isindependentofX X X,wehave Var m(X X X)|X X X −j ,X n =Var m(X X X)|X X X −j . (A.14) Bythedefinitionofrelevantfeatures,wecandeducethat E m(X X X)− 1 B ∑ a∈A E b m b T a ,a (Θ Θ Θ 1 ,···,Θ Θ Θ k ,X X X,X n ) X X X,X n 2 =E n E h m(X X X)− 1 B ∑ a∈A E b m b T a ,a (Θ Θ Θ 1 ,···,Θ Θ Θ k ,X X X,X n ) X X X,X n 2 X X X −j ,X n io ≥E Var m(X X X)|X X X −j ,X n =E Var m(X X X)|X s ,s∈{1,···,p}\{j} ≥ι. Here, in the first inequality, we apply (A.13) and the projection theorem. For the second equality, weresortto(A.14). ThisconcludestheproofofTheorem2. A.3.2 ProofofTheorem3 Let 0<γ 0 ≤1 be given. We deal with the case where there are no random splits first (see the end of this proof for details). Let us begin with a closed-form expression for the L 2 approximation errorin(A.15)belowobtainedusing(2.12). Wearguethatintheexpression E m(X X X)−m ∗ T (Θ Θ Θ 1:k ,X X X) 2 = ∑ Θ 1:k P(Θ Θ Θ 1:k =Θ 1:k ) ∑ (t t t 1 ,···,t t t k )∈T(Θ 1:k ) P(X X X∈t t t k )Var(m(X X X)|X X X∈t t t k ), (A.15) the(average)conditionalvarianceontheendnodesatthelastlevelcanbeboundedbythe(average) conditionalvarianceattheonetothelastlevelmultipliedbyafactor (1−γ 0 (α 1 α 2 ) −1 ),andhence wehavetherecursiveargumentasin(2.24). However,accordingtoCondition5,forthenodewith 93 too small probabilities, we need to use a different approach to deal with the case, which results in anadditionaltermεα 1 α 2 inTheorem3. In what follows, T(Θ 1:k ) is categorized into two groups, where upper bounds are constructed accordingly. Let ε≥ 0 be given. Then we introduce a set of tuples denoted as T ε . For each Θ 1 ,···,Θ k , define a set of k-dimensional tuples T ε (Θ 1 ,···,Θ k ) such that if the following two propertieshold: 1) (t t t 1 ,···,t t t k )∈T(Θ 1 ,···,Θ k ), 2) Thereexistssomepositiveintegerl≤ksuchthatsup j,c (II) t t t l−1 ,t t t l−1 (j,c) ≤α 2 ε withthesupre- mumoverallpossible (j,c)’s, then (t t t 1 ,···,t t t k )∈T ε (Θ 1 ,···,Θ k ). InviewofthedefinitionofT ε ,wecandeducethat RHSof(A.15) = ∑ Θ 1:k P(Θ Θ Θ 1:k =Θ 1:k ) " ∑ (t t t 1 ,···,t t t k )∈T ε (Θ 1:k ) P(X X X∈t t t k )Var(m(X X X)|X X X∈t t t k ) + ∑ (t t t 1 ,···,t t t k )∈ T(Θ 1:k ) T ε (Θ 1:k ) P(X X X∈t t t k )Var(m(X X X)|X X X∈t t t k ) # , (A.16) wherethesummationisoverallpossibleΘ 1:k . Forsimplicity,defineT † (Θ 1:k ):=T(Θ 1:k ) T ε (Θ 1:k )andV(t t t):=Var(m(X X X)|X X X∈t t t). Wecanob- servetwopropertiesofT † . First,if(t t t 1 ,···,t t t k )∈T † (Θ 1:k ),thenwehave(t t t 1 ,···,t t t l )∈T † (Θ 1:l )for each1≤l <k,butnottheotherwayaround. Second,if (t t t 1 ,···,t t t k )∈T † (Θ 1:k )andt t t ′ k istheother daughter node of t t t k−1 in T(Θ 1:k ), then it holds that (t t t 1 ,···,t t t k−1 ,t t t ′ k )∈ T † (Θ 1:k ); that is, daugh- ter nodes are included in T † (Θ 1:k ) as a pair. It is worth emphasizing that T † (Θ 1:k ) and T ε (Θ 1:k ) are two sets of tuples such that{t t t k :(t t t 1 ,···,t t t k )∈T † (Θ 1:k )} and{t t t k :(t t t 1 ,···,t t t k )∈T ε (Θ 1:k )} are mutuallyexclusive,andcollectivelytheyareapartitionoffeaturespace. 94 <latexit sha1_base64="cnGyrbZws9somTDCNg6CuLsDY+Q=">AAAPxnic1Vdbb9s2FFbbre2cbG62x70QS4KlgOLKRm/DoKLoHtaHPXRD0xazjICiji0iFCWQVB1XELB/OezHDNihLMcXOYmDNShKIDFF8vB837mQPGEmuDae98+Nm7e++PL2nbtftba2v/6mfW/n2zc6zRWDI5aKVL0LqQbBJRwZbgS8yxTQJBTwNjz5xc6/fQ9K81S+NpMMBgkdST7kjBocOt7ZyloEWxDCiMvC8JMPGWcmV1D2D5+5z3xtgAoT/+gSAe9BPOhoMxHgF5qHqHJEIoRIJQPik67Hkge73Xrl0kznEUtKl2hGUdbrPHaNolIPU5UQHVNENQUh0whIP0xFZHsDUgS/UTUCshfYMT1J8Kcw5XHhlXskCMi66eB1DIbimq7bLffKamMWcxH1BR/FJhQ5uAQtMiEm5uxkqrmo/tvWgFCSs7lqm7Ovucxlck3ZpvzCHoIiLLJAnkWp0TWVxp59iHDRUKUJyagCafDHxMQvDoLKm9WY3blz2PW8++TwkExnKulqQqbKxPfLwYqCcwHWJF1yqmM+NP6hZ/17jrvO/HEy90eTREXatS6R7qaErsJlHZsZH6MApk5DrzVWbYCYXBdkQj4S6OXvshmZi3QGG4XppRE6/yqXM2i9rnXpNCeJ6XTVfNoQ5idKozNujSxa58+rQfy0ifHZ58LHiKrV4F+4h9ZeOIuRvsF9c4FQk8S6zBpciP+6dluhRj5/btcX9OsDr3FqzK7etYfahhf6ObZetda1bXtRsJQ/t/bRZldrJMsNXi9ADJwawqVJic5Di6eCplv71VOzT/FJKH2GvgHlTmYG7VqDUkMO6H1S7P25FwQ9z7M4NhIKl4UCkNHSs7p1fG/X63hVI81Ot+7sOnV7dbxzyw+ilOUJqmSCat3vepkZFFQZzgTuGOQaMrxS6Qj62JU0AT0oqtqgJPs4EhF8auMfxmU1uihR0ETbpxquTDBm9eqcHXQtgXUL+rkZPh0UXKLBQbKptmEuCFrcVhtYBChgRkywQ5niCBhDiCrK0HjLqrLRMBOYXi4JE6Q0+9RgkEmC6eRLGIM2U7oWkOChompSVBWE7owgTcAozlyqVDrW7pAbFxErfupmqea23sGKBeUjGGK1VFmiOCsKyuKPX1+UxcOe++SR2+s9xXVWCapqTQN8Fs51CYRlTWG5UTkSmI4qzWWEhmapkkjNJUMuBK4Zx9zgdKTo2K/fPUsBRBIueZJjHcQ/gI9RNB8Z88ieIbPqqQJRh/EMgwXBdaoZCNCIkNdoLlcXgyU+PT5WVE5RVFYlYaoiUMjOUAP+Tx5uJ1mcKr86s2pQs4JgQ8vMi6+1Vloo02z6/j+rtcqWTbjuano1O296ne7jzsPfe7vPX9Spd9f53vnBOXC6zhPnufPSeeUcOWzr761/t29v32m/bMt23h5Pl968Uct85yy19l//AUf8Yo8=</latexit> t0 ⇥ 1,1 ··· ⇥ k,1 ··· ··· ··· ··· ··· ··· ··· Figure4.1: ThethickbluetreebranchisthefirsttreebranchofT(Θ 1:k ). With these notations, let us deal with the second term on the RHS of (A.16) first. Simple calculationsshowthat ThesecondtermontheRHSof(A.16) = ∑ Θ 1:k P(Θ Θ Θ 1:k =Θ 1:k ) ∑ (t t t 1:k )∈T † (Θ 1:k ) P(X X X∈t t t k−1 )P(X X X∈t t t k |X X X∈t t t k−1 )V(t t t k ) = ∑ Θ 1:k P(Θ Θ Θ 1:k =Θ 1:k ) ∑ (t t t 1 ,···,t t t k )∈T † (Θ 1:k ) P(X X X∈t t t k−1 ) (I) t t t k−1 ,t t t k 2 = ∑ Θ 1:k−1 P(Θ Θ Θ 1:k−1 =Θ 1:k−1 ) ∑ Θ k P(Θ Θ Θ k =Θ k ) ∑ (t t t 1 ,···,t t t k )∈T † (Θ 1:k ) P(X X X∈t t t k−1 ) (I) t t t k−1 ,t t t k 2 , (A.17) where the second equality is due to the definition of (I) t t t k−1 ,t t t k in (2.13) and the second property of T † ,andthethirdequalityisduetotheindependenceofrandomparameters. To deal with the RHS of (A.17), we consider tree branches of the tree T(Θ 1:k ) (not T † (Θ 1:k )) asfollows. Thereare2 k distincttreebranchest t t 1:k inT(Θ 1:k ),andwecallthefirsttwoofthesetree branches“thefirsttreebranchofT(Θ 1:k ),”whosecorrespondinglastcolumnsetrestrictionisΘ k,1 (recall that Θ k ={Θ k,1 ,···,Θ k,2 k−1}). See Figure 4.1 for a graphical illustration. Note that there are two daughter nodes of the first tree branch. In addition, note that it is possible that some tree branches of T(Θ 1:k ) are not included in T † (Θ 1:k ); in such cases, the corresponding summations (e.g., see (A.18) below) ignore these tree branches since we have defined that summations over emptysetsarezeros. 95 Now, with the definition of tree branches, we write the inner term on the RHS of (A.17) as follows. ∑ Θ k P(Θ Θ Θ k =Θ k ) ∑ (t t t 1 ,···,t t t k )∈T † (Θ 1:k ) P(X X X∈t t t k−1 ) (I) t t t k−1 ,t t t k 2 = ∑ Θ k P(Θ Θ Θ k,1 =Θ k,1 ,···,Θ Θ Θ k,2 k−1 =Θ k,2 k−1) × ∑ (t t t 1 ,···,t t t k )∈T † (Θ 1:k )where (t t t 1 ,···,t t t k−1 )isthefirsttreebranchinT(Θ 1:k ) P(X X X∈t t t k−1 ) (I) t t t k−1 ,t t t k 2 + ∑ (t t t 1 ,···,t t t k )∈T † (Θ 1:k )where (t t t 1 ,···,t t t k−1 )isnotthefirsttreebranchinT(Θ 1:k ) P(X X X∈t t t k−1 ) (I) t t t k−1 ,t t t k 2 , (A.18) and then, since the first tree branch is related to only the feature restrictionΘ Θ Θ k,1 and the other tree branchesareonlysubjecttoΘ Θ Θ k,2 ,···,Θ Θ Θ k,2 k−1,andthatΘ Θ Θ k,l ’sareindependent, RHSof(A.18) = ∑ Θ k,1 P(Θ Θ Θ k,1 =Θ k,1 ) × ∑ (t t t 1 ,···,t t t k )∈T † (Θ 1:k )where (t t t 1 ,···,t t t k−1 )isthefirsttreebranchinT(Θ 1:k ) P(X X X∈t t t k−1 ) (I) t t t k−1 ,t t t k 2 + ∑ Θ k,2 ,···,Θ k,2 k−1 P(Θ Θ Θ k,2 =Θ k,2 ,···,Θ Θ Θ k,2 k−1 =Θ k,2 k−1) × ∑ (t t t 1 ,···,t t t k )∈T † (Θ 1:k )where (t t t 1 ,···,t t t k−1 )isnotthefirsttreebranchinT(Θ 1:k ) P(X X X∈t t t k−1 ) (I) t t t k−1 ,t t t k 2 . (A.19) With(A.19),wecanfocusonthesummationwiththefirsttreebranchinthefollowing;without loss of generality, we suppose the first tree branch of T(Θ 1:k ) is also included in T † (Θ 1:k ) (oth- erwise, we can consider some other tree branch). Observe that there exists at least one optimal feature j ∗ suchthatsup c (II) t t t k−1 ,t t t k−1 (j ∗ ,c) =sup j,c (II) t t t k−1 ,t t t k−1 (j,c) ,wherethesupremumontheRHS 96 is the unconstrained supremum. It is not difficult to see that the probability thatΘ Θ Θ k,1 includes one oftheseoptimalfeaturesisatleastγ 0 ;thatis, P(Θ Θ Θ k,1 includesoneoftheoptimalfeatures)≥γ 0 (thegoodstate), P({Θ Θ Θ k,1 includesoneoftheoptimalfeatures} c )<1−γ 0 (thebadstate). (A.20) In (A.17), it follows from the definition of T † that sup j,c (II) t t t k−1 ,t t t k−1 (j,c) >α 2 ε. Then ifΘ k,1 is in thegoodstate,bythefirstitemofCondition5itholdsthat (II) t t t k−1 ,t t t k >ε. Bythis,theseconditem ofCondition5,andthefactthatΘ k,1 isinthegoodstate,wehave (I) t t t k−1 ,t t t k =Var(m(X X X|X X X∈t t t k−1 )−(II) t t t k−1 ,t t t k ≤Var(m(X X X)|X X X∈t t t k−1 )−α −1 2 sup j,c (II) t t t k−1 ,t t t k−1 (j,c) . (A.21) Moreover,itfollowsfromCondition1that RHSof(A.21)≤Var(m(X X X)|X X X∈t t t k−1 )(1−(α 1 α 2 ) −1 ). Ontheotherhand,ifΘ k,1 isinthebadstate,itholdsthat (I) t t t k−1 ,t t t k ≤Var(m(X X X)|X X X∈t t t k−1 ). 97 Bytheaboveobservation,forthefirsttreebranch, ∑ Θ k,1 P(Θ Θ Θ k,1 =Θ k,1 ) × ∑ (t t t 1 ,···,t t t k )∈T † (Θ 1:k )where (t t t 1 ,···,t t t k−1 )isthefirsttreebranchinT(Θ 1:k ) P(X X X∈t t t k−1 ) (I) t t t k−1 ,t t t k 2 ≤ ∑ goodΘ k,1 P(Θ Θ Θ k,1 =Θ k,1 ) × ∑ (t t t 1 ,···,t t t k )∈T † (Θ 1:k )where (t t t 1 ,···,t t t k−1 )isthefirsttreebranchinT(Θ 1:k ) P(X X X∈t t t k−1 ) V(t t t k−1 )(1−(α 1 α 2 ) −1 ) 2 + ∑ badΘ k,1 P(Θ Θ Θ k,1 =Θ k,1 ) × ∑ (t t t 1 ,···,t t t k )∈T † (Θ 1:k )where (t t t 1 ,···,t t t k−1 )isthefirsttreebranchinT(Θ 1:k ) P(X X X∈t t t k−1 ) V(t t t k−1 ) 2 . (A.22) If (t t t 1 ,···,t t t k−1 ) is the first tree branch in T(Θ 1:k ), due to the facts that there are two daughter nodes of t t t k−1 and that the terms in the summations on the RHS of (A.22) does not depend on t t t k , and(A.20), RHSof(A.22) ≤ ∑ goodΘ k,1 P(Θ Θ Θ k,1 =Θ k,1 )P(X X X∈t t t k−1 )V(t t t k−1 )(1−(α 1 α 2 ) −1 ) + ∑ badΘ k,1 P(Θ Θ Θ k,1 =Θ k,1 )P(X X X∈t t t k−1 )V(t t t k−1 ) ≤ sup γ≥γ 0 γP(X X X∈t t t k−1 )V(t t t k−1 )(1−(α 1 α 2 ) −1 )+(1−γ)P(X X X∈t t t k−1 )V(t t t k−1 ) ≤(1−γ 0 (α 1 α 2 ) −1 )P(X X X∈t t t k−1 )V(t t t k−1 ). (A.23) Wenoticethat(A.23)holdsifthefirsttreebranchisnotincludedinT † (Θ 1:k )sincethesummation wouldbezero. Wecanapplytheargumentsfor(A.18)–(A.23)toeachtreebranchinT(Θ 1:k )toget RHSof(A.19)≤ ∑ (t t t 1 ,···,t t t k−1 )∈T † (Θ 1:k−1 ) (1−γ 0 (α 1 α 2 ) −1 )P(X X X∈t t t k−1 )V(t t t k−1 ). 98 Thus, RHSof(A.17) ≤(1−γ 0 (α 1 α 2 ) −1 ) ∑ Θ 1:k−1 P(Θ Θ Θ 1:k−1 =Θ 1:k−1 ) ∑ (t t t 1:k−1 )∈T † (Θ 1:k−1 ) P(X X X∈t t t k−1 )V(t t t k−1 ). (A.24) Wecanrepeatthecalculationin(A.24)k timestoconcludethat RHSof(A.24)≤(1−γ 0 (α 1 α 2 ) −1 ) k Var(m(X X X)). (A.25) Next, we bound the first term in (A.16). Let (t t t 1 ,···,t t t k )∈ T ε (Θ 1 ,···,Θ k ) be a given tuple. By the second property in the definition of T ε , there exists a smallest integer 1≤l≤k such that sup j,c (II) t t t l−1 ,t t t l−1 (j,c) ≤α 2 ε. ByCondition1,wehave Var(m(X X X)|X X X∈t t t l−1 )≤α 1 α 2 ε. (A.26) Denote by S the set of tuples in T ε (Θ 1 ,···,Θ k ) such that the first l−1 nodes aret t t 1 ,···,t t t l−1 . For each q∈{l−1,···,k−1}, let S q be the set of distinct tuples in{(t t t 1 ,···,t t t q ) : (t t t 1 ,···,t t t k )∈ S}. Thenwecandeducethat ∑ (t t t 1 ,···,t t t k )∈S P(X X X∈t t t k )Var(m(X X X)|X X X∈t t t k ) = ∑ (t t t 1 ,···,t t t k )∈S P(X X X∈t t t k−1 )P(X X X∈t t t k |X X X∈t t t k−1 )Var(m(X X X)|X X X∈t t t k ) ≤ ∑ (t t t 1 ,···,t t t k−1 )∈S k−1 P(X X X∈t t t k−1 )Var(m(X X X)|X X X∈t t t k−1 ) ≤ ∑ (t t t 1 ,···,t t t l−1 )∈S l−1 P(X X X∈t t t l−1 )Var(m(X X X)|X X X∈t t t l−1 ) =P(X X X∈t t t l−1 )Var(m(X X X)|X X X∈t t t l−1 ) ≤P(X X X∈t t t l−1 )α 1 α 2 ε. (A.27) 99 Here,thefirstinequalityin(A.27)followsfromthefactthatVar(m(X X X)|X X X∈t t t k−1 )≥(I) t t t k−1 ,t t t k . The second inequality is obtained by repeating the same argument for the first inequality. Moreover, the second equality is because S l−1 contains exactly one tuple, while the last inequality follows from(A.26). GivenΘ 1 ,···,Θ k andε,weseethatsummingtheLHSof(A.27)overallpossible(andmutually exclusive) tuple sets S is bounded by the summation over the probabilities of exclusive events multipliedbyα 1 α 2 ε. Thus,itholdsthat ∑ (t t t 1 ,···,t t t k )∈T ε (Θ 1:k ) P(X X X∈t t t k )Var(m(X X X)|X X X∈t t t k )≤α 1 α 2 ε. SincesummingovertheprobabilitiesofΘ 1:k givesone,wehave ∑ Θ 1:k P(Θ Θ Θ 1:k =Θ 1:k ) ∑ (t t t 1 ,···,t t t k )∈T ε (Θ 1:k ) P(X X X∈t t t k )Var(m(X X X)|X X X∈t t t k )≤α 1 α 2 ε. Therefore, combining this inequality, and (A.15)–(A.17), and (A.24)–(A.25) yields the desired conclusionofTheorem3forthecasewithoutrandomsplits. Forthecasewheretherearerandomsplits,wecanconditionontheserandomsplitsandapply the previous arguments to get the same conclusion. Specifically, let “Random splits” denote the randomparameteroftheserandomsplits,wehave E m(X X X)−m ∗ T (Θ Θ Θ 1:k ,X X X) 2 =E E(RHSof(A.15)|Randomsplits) ≤E α 1 α 2 ε+ 1−γ 0 (α 1 α 2 ) −1 k Var(m(X X X)) =α 1 α 2 ε+ 1−γ 0 (α 1 α 2 ) −1 k Var(m(X X X)), wheretheinequalityisduetothepreviousarguments. ThiscompletestheproofofTheorem3. Finally,wenotethatthepreviousargumentsalsoleadtothedesiredboundin(2.24). 100 A.3.3 ProofofTheorem4 Let us first briefly outline the proof idea for the main assertion of Theorem 4. We argue that for each node t t t, the (sample) CART-split criterion in (2.6) gives results that are very close to those of the theoretical CART-split introduced in Section 2.4.1. More precisely, let b t t t be one of the daughter nodes oft t t after the CART-split given a set of available featuresΘ, and we argue that the value of (II) t t t, b t t t is very close to that of sup j∈Θ,c (II) t t t,t t t(j,c) . Since this argument involves quantities (II)’s, to obtain the desired result we need to control the differences between the theoretical and sample(conditional)moments. Thus,werelyonthegridintroducedinSection2.5,wherewealso introducetheideasforthegrid. TheformalproofstartswithconstructingtheX n -measurableeventU U U n describedinTheorem4. Defineforsome∆>0,s>0(furtherrequirementson∆,swillbespecifiedshortlyin(A.28)below), U U U n :=C C C n ∩A 1 (k,∆)∩A 2 (k,∆)∩A 3 (k+1,∆)∩A, where k =⌊clog(n)⌋, C C C n =∩ n i=1 {|ε i |≤ n s }, the eventA is defined in (A.6), andA i (k,∆),i∈ {1,2,3} are defined in Lemma 8 in Section A.5.1; note that we let k =⌊clog(n)⌋ in the proof of Theorem4forsimplicity. Briefly,theeventsA i (k,∆),i∈{1,2,3}controltheconditionalmoments, which include the conditional means and probabilities, and the numbers of observations on each of the sufficiently large nodes on the grid hyperplanes. Since we have assumed Condition 2 and Condition3withsufficientlylargeq,itfollowsfromLemma8and(A.3)thatforalllargen, P(U U U c n )=o(n −1 ), whichconcludesthefirstassertionofTheorem4regardingtheeventU U U n . It remains to show the second assertion of Theorem 4. Let us introduce some needed notation andparameterrestrictionsasfollows. Itisrequiredthat 1 2 <∆<1−2δ,whichispossiblebecause 101 δ < 1 4 (δ andη in(A.28)belowaregivenbyTheorem4). Inaddition,welet∆ ′ and(asufficiently small)s>0besuchthat 1 2 <∆ ′ <∆and η <min{ ∆ ′ 4 −2s,δ−2s, δ 2 }. (A.28) To better understand the technical arguments, we provide some useful intuitions first. For each node t t t =× p j=1 t j and a set of available features Θ, let us fix a best cut (j ∗ (t t t),c ∗ (t t t)) := argsup j∈Θ,c (II) t t t,t t t(j,c) and for simplicity, we do not specify the dependence of the cut on Θ. Let t t t ∗ be one of the daughter nodes of t t t after (j ∗ (t t t),c ∗ (t t t)). Our goal is to find the lower bound of (II) t t t, b t t t −(II) t t t,t t t ∗ in terms of the sample size n. The main idea of the proof is to find a semi-sample daughternodeoft t t denotedast t t † suchthatt t t † isgrownbyacut(j ∗ (t t t),c † (t t t))withc † (t t t)=x i,j ∗ (t t t) for some i∈{1,···,n} and the value of c † (t t t) is very close to c ∗ (t t t) (recall that x x x i = (x i,1 ,···,x i,p ) T ’s are the observations in the sample). Intuitively, on one hand, (II) t t t, b t t t −(II) t t t,t t t † should be bounded from below because b t t t maximizes the sample counterpart of (II) t t t, b t t t and hence d (II) t t t, b t t t ≥ d (II) t t t,t t t † (the sample conditional bias decrease; a formal definition is in (A.31) below), and the values of these sample counterparts are close to themselves, respectively, in a probabilistic sense. On the other hand,thedifferencebetweenc † (t t t)andc ∗ (t t t)isverysmallandthus|(II) t t t,t t t †−(II) t t t,t t t ∗|iscontrolled. Thenbytheuseofthesemi-sampledaughternode,wecancompletethetechnicalanalysis. Wenowintroducesomenecessarynotationfortheremainingproof. LetusfixanintervalI ∗ (t t t) suchthatc ∗ (t t t)∈I ∗ (t t t)⊂t j ∗ (t t t) and P(X j ∗ (t t t) ∈I ∗ (t t t)|X X X∈t t t)=n −δ . InviewofCondition2,suchI ∗ (t t t)iswell-defined. Inaddition,forthenodet t t andΘ,wefixanother cut (j ∗ (t t t),c † (t t t))suchthatc † (t t t)isanelementoftheset x i,j ∗ (t t t) :x x x i ∈t t t, x i,j ∗ (t t t) ∈I ∗ (t t t) (A.29) 102 whenthesetisnotempty,andotherwisec † (t t t)isarandomvalueint j ∗ (t t t) . Recall that b t t t, t t t † , and t t t ∗ denote, respectively, one of the daughter nodes constructed by the CART-split(2.6),thecut(j ∗ (t t t),c † (t t t)),andthecut(j ∗ (t t t),c ∗ (t t t)). Particularly,duetothedefinition ofc † (t t t),wehaveensuredthat |P(X X X∈t t t † |X X X∈t t t)−P(X X X∈t t t ∗ |X X X∈t t t)|≤n −δ . (A.30) Given the nodet t t and an arbitrary partition oft t t ′ andt t t ′′ , we can define the sample version of (2.14) as d (II) t t t,t t t ′ := #{i:x x x i ∈t t t ′ } #{i:x x x i ∈t t t} 0 @ ∑ x x x i ∈t t t ′ y i #{i:x x x i ∈t t t ′ } − ∑ x x x i ∈t t t y i #{i:x x x i ∈t t t} 1 A 2 + #{i:x x x i ∈t t t ′′ } #{i:x x x i ∈t t t} 0 @ ∑ x x x i ∈t t t ′′ y i #{i:x x x i ∈t t t ′′ } − ∑ x x x i ∈t t t y i #{i:x x x i ∈t t t} 1 A 2 . (A.31) Here, we define a summation over an empty set as zero. In particular, d (II) t t t,t t t ′ is zero if t t t contains onlyoneobservation. To complete the proof for the second conclusion, we need Lemmas 6 and 7 in Sections A.4.7 and A.4.8, respectively. Let a constant c 1 >0 withα 2 ≥1+c 1 be given. It follows from Lemmas 6 and 7 and the definition of sample tree growing rule b T that on the event U U U n , there exists some 103 constantC>0suchthatforalllargen,eachsequenceofsetsofavailablefeaturesΘ 1 ,···,Θ k ,each (t t t 1 ,···,t t t k )∈ b T(Θ 1 ,···,Θ k ),andeach1≤l≤k withP(X X X∈t t t l−1 )≥n −δ ,wehave (II) t t t l−1 ,t t t l −(II) t t t l−1 ,t t t ∗ =(II) t t t l−1 ,t t t l −(II) t t t # l−1 ,t t t # l | {z } (i) +(II) t t t # l−1 ,t t t # l − d (II) t t t # l−1 ,t t t # l | {z } (ii) + d (II) t t t # l−1 ,t t t # l − d (II) t t t l−1 ,t t t l | {z } (iii) + d (II) t t t l−1 ,t t t l − d (II) t t t l−1 ,t t t † l | {z } (iv) + d (II) t t t l−1 ,t t t † l − d (II) (t t t l−1 ) # ,(t t t † l ) # | {z } (v) + d (II) (t t t l−1 ) # ,(t t t † l ) # −(II) (t t t l−1 ) # ,(t t t † l ) # | {z } (vi) +(II) (t t t l−1 ) # ,(t t t † l ) # −(II) t t t l−1 ,t t t † l | {z } (vii) +(II) t t t l−1 ,t t t † l −(II) t t t l−1 ,t t t ∗ l | {z } (viii) ≥−C(n − δ 2 +n − ∆ ′ 4 +2s +n −δ+2s ) ≥−c 1 n −η , (A.32) where we suppress the dependence of all the daughter nodes on the set of available features. In (A.32), terms (i)–(iii) and (v)–(vii) are bounded in Lemma 7, while terms (iv) and (viii) are an- alyzed in Lemma 6. To apply Lemma 6, notice that t t t l in term (iv) is grown by the (sample) CART-splitgivent t t l−1 andtheavailablefeatures. Thelastinequalityaboveisduetoalllargenand (A.28). Inviewof(A.32)andthedefinitionof b T ζ ,oneventU U U n itholdsthatforalllargen,eachsequence of sets of available featuresΘ 1 ,···,Θ k , and each (t t t 1 ,···,t t t k )∈ b T ζ (Θ 1 ,···,Θ k ) withζ =n −δ , we havefor1≤l≤k, sup j∈Θ l ,c (II) t t t l−1 ,t t t l−1 (j,c) ≤(II) t t t l−1 ,t t t l +c 1 n −η , (A.33) wherewedonotspecifytowhichsetofavailablefeatures jbelongsinthesupremumforsimplicity as in Condition 5. Observe that because of the construction of the semi-sample tree growing rule, wedonotrequiretheconditionofP(X X X∈t t t l−1 )≥n −δ inthestatementof(A.33)aswedoin(A.32). 104 Finally, by the same conditions as for (A.33) and the choices of α 1 and c 1 , we have that for each1≤l≤k,if (II) t t t l−1 ,t t t l >n −η ,itholdsthat sup j∈Θ l ,c (II) t t t l−1 ,t t t l−1 (j,c) ≤α 2 (II) t t t l−1 ,t t t l , andif (II) t t t l−1 ,t t t l ≤n −η ,itholdsthat sup j∈Θ l ,c (II) t t t l−1 ,t t t l−1 (j,c) ≤α 2 n −η , whichconcludestheproofofTheorem4. A.3.4 ProofofTheorem5 Tooutlinetheproofidea,letusrewritetheexpectationandobtainaclosed-formexpressionbelow. From(2.27),wecanseethat E sup T E m ∗ T # (Θ Θ Θ 1 ,···,Θ Θ Θ k ,X X X)−b m T #(Θ Θ Θ 1 ,···,Θ Θ Θ k ,X X X,X n ) 2 Θ Θ Θ 1:k ,X n =E h sup T ∑ (t t t 1 ,···,t t t k )∈T # (Θ 1 ,···,Θ k ) P(X X X∈t t t k ) E(m(X X X)|X X X∈t t t k )− ∑ i∈{i:x x x i ∈t t t k } y i #{i:x x x i ∈t t t k } 2 i . (A.34) To have a clearer picture of how to apply Hoeffding’s inequality to our case, we utilize an even larger upper bound to get rid of Θ 1 ,···,Θ k (feature restrictions). Observe that the sum- mation on the RHS of (A.34) is over the partition{t t t k : (t t t 1 ,···,t t t k )∈T # (Θ 1 ,···,Θ k )}. Thus, the RHS of (A.34) can be further bounded by considering the supremum of partitions over all T and Θ 1 ,···,Θ k ; we use T k to denote a level k tree such that{t t t k : (t t t 1 ,···,t t t k )∈ T # k } is an instance of suchapartitiontosimplifythenotation. Thenitfollowsthat TheRHSof(A.34) ≤E 2 4 sup T # k ∑ (t t t 1 ,···,t t t k )∈T # k P(X X X∈t t t k ) E(m(X X X)|X X X∈t t t k )− ∑ i∈{i:x x x i ∈t t t k } y i #{i:x x x i ∈t t t k } 2 3 5 . (A.35) 105 Notice that there is no Θ on the RHS of (A.35), and hence the outer expectation is only over the sample. Inwhatfollows,weboundtheRHSof(A.35). TheargumentisbasedontheeventA 1 (k,∆) introduced in Lemma 8 in Section A.5.1, which in turn relies on Hoeffding’s inequality. On such event, for each nodet t t on the grid constructed with at most k cuts and satisfyingP(X X X∈t t t)≥n ∆−1 , thedeviationbetweenitssampleandpopulationconditionalmeanscanbecontrolled. Let∆>0,∆ ′ >0,andsufficientlysmall0 ∆ ′ 2 . Assume that the momentconditionparameterqinCondition3issufficientlylargewithq> 5+2δ s anddefine E n,k :=sup T # k ∑ (t t t 1 ,···,t t t k )∈T # k P(X X X∈t t t k ) E(m(X X X)|X X X∈t t t k )− ∑ x x x i ∈t t t k y i #{i:x x x i ∈t t t k } ! 2 . ThentheRHSof(A.35)canberewrittenas E E n,k 1 1 1 ∪ n i=1 {|ε i |>n s } +E E n,k 1 1 1 ∩ n i=1 {|ε i |≤n s } . (A.37) Let us bound the first term in (A.37). By Condition 4, which requires sup c c c∈[0,1] p|m(c c c)|≤M 0 , asimpleupperboundforE n,k isgivenby E n,k ≤ M 0 + n ∑ i=1 |y i | 2 (A.38) 106 for each n≥ 1 and k≥ 1. It follows from the Cauchy–Schwarz inequality, (A.38), Minkowski’s inequality, Conditions 3–4, and the definition ofδ that there exists some constantC>0 such that foreachn≥1andk≥1, E E n,k 1 1 1 ∪ n i=1 {|ε i |>n s } ≤ q E(E 2 n,k ) r P ∪ n i=1 {|ε i |>n s } ≤ s E M 0 + n ∑ i=1 |y i | 4 s n ∑ i=1 P |ε i |>n s ≤ M 0 + n ∑ i=1 E|y i | 4 1/4 2 s n ∑ i=1 P |ε i |>n s ≤ (n+1)M 0 +n(E|ε 1 | 4 ) 1/4 2 s n ∑ i=1 P |ε i |>n s ≤Cn −δ . (A.39) Wenextdealwiththesecondtermin(A.37). Letusdefineforeacht t t, E t t t,n :=E(m(X X X)|X X X∈t t t)− ∑ x x x i ∈t t t y i #{i:x x x i ∈t t t} , E † n,k :=sup T # k ∑ (t t t 1 ,···,t t t k )∈T # k , P(X X X∈t t t k )≥n ∆−1 P(X X X∈t t t k ) E t t t k ,n 2 . Wecanmakethreeusefulobservations: 1) Under Condition 4 and on the event∩ n i=1 {|ε i |≤n s }, it holds that for eacht t t, all large n, and eachk≥1, (E t t t,n ) 2 ≤2n 2s . 2) OntheeventA 1 (k,∆),itholdsthatforalllargenandeach1≤k≤clogn, E † n,k ≤n − ∆ ′ 2 . 3) Foreachn≥1andk≥1, E † n,k ≤sup t t t (E t t t,n ) 2 , 107 wherethesupremumisoverallpossiblenodes. By observation 1) above and the definition ofE † n,k , we have that for all large n and each 1≤k≤ clog 2 n, E E n,k 1 1 1 ∩ n i=1 {|ε i |≤n s } =E sup T # k ∑ (t t t 1 ,···,t t t k )∈T # k , P(X X X∈t t t k )<n ∆−1 P(X X X∈t t t k ) E t t t k ,n 2 + ∑ (t t t 1 ,···,t t t k )∈T # k , P(X X X∈t t t k )≥n ∆−1 P(X X X∈t t t k ) E t t t k ,n 2 1 1 1 ∩ n i=1 {|ε i |≤n s } ≤2n c+∆+2s−1 +E E † n,k 1 1 1 ∩ n i=1 {|ε i |≤n s } , (A.40) where the first term on the RHS of the inequality follows from the fact that the summation is over atmost2 clog 2 n nodes. FromthethreeobservationsaboveandLemma8(withκ inLemma8settoδ),wecandeduce thatforalllargenandeach1≤k≤clogn, E E † n,k 1 1 1 ∩ n i=1 {|ε i |≤n s } =E E † n,k 1 1 1 ∩ n i=1 {|ε i |≤n s } 1 1 1 A 1 (k,∆) c +E E † n,k 1 1 1 ∩ n i=1 {|ε i |≤n s } 1 1 1 A 1 (k,∆) ≤E sup t t t (E t t t,n ) 2 1 1 1 ∩ n i=1 {|ε i |≤n s } 1 1 1 A 1 (k,∆) c +E E † n,k 1 1 1 A 1 (k,∆) ≤2n 2s P A 1 (k,∆) c +E E † n,k 1 1 1 A 1 (k,∆) ≤2n − ∆ ′ 2 , (A.41) whereforthelastinequality,recallthatδ−2s> ∆ ′ 2 . Theninlightof(A.36)and(A.40)–(A.41),itholdsthatforalllargenandeach1≤k≤clogn, E E n,k 1 1 1 ∩ n i=1 {|ε i |≤n s } ≤ n −η 2 . (A.42) 108 Therefore, combining (A.34)–(A.35), (A.37), (A.39), and thatδ >η completes the proof of The- orem5. A.4 ProofsofCorollaries1–2,Proposition1,andsomekeylemmas A.4.1 ProofofCorollary1 The arguments for showing (2.19) and (2.20) in Corollary 1 can be found in (A.10) and (A.11) in SectionA.3,respectively. A.4.2 ProofofCorollary2 First, we setη = 1 8 −ε,δ = 1 4 −ε, c= 1 8 , and k =⌊ 1 8 log 2 (n)⌋ in Theorem 1. Since e x ≥(1− x n ) n for0≤x≤n,itholdsthatforalllargen, (1−γ 0 (α 1 α 2 ) −1 ) k ≤e − kγ 0 α 1 α 2 ≤2 − 1 8 log 2 (e)γ 0 α 1 α 2 log 2 (n) ×e γ 0 α 1 α 2 =n − log 2 (e) 8 × γ 0 α 1 α 2 ×e γ 0 α 1 α 2 . By this, Theorem 1, we can show that there exist N > 0 and C > 0 such that for any m(X X X) satisfiesCondition1withα 1 andalln≥N, E m(X X X)−E b m b T (Θ Θ Θ 1:k ,X X X,X n ) X X X,X n 2 ≤C n − 1 8 +ε +n − log 2 (e) 8 × γ 0 α 1 α 2 . (A.43) To obtain (A.43), we note that the results in Lemmas 1–3 and Theorems 4–5 can be shown to be uniform over all m(X X X) satisfying the respective requirements of these results. Particularly, the result in Theorem 3 is already for all m(X X X)∈ SID(α 1 ). For simplicity, we omit the detailed analysisfor(A.43). By(A.43)andthedefinitionofSID(α 1 ),weconcludethedesiredresult. 109 A.4.3 ProofofProposition1 Letusdealwiththefirstassertionfirst. Adirectcalculationshowsthatforeveryt t t =t 1 ×···×t p , 8 > > < > > : sup j∈{1,···,p},c∈t j (II) t t t,t t t(j,c) = β 2 4 , if Var(m(X X X)|X X X∈t t t)>0, sup j∈{1,···,p},c∈t j (II) t t t,t t t(j,c) =0, if Var(m(X X X)|X X X∈t t t)=0, andthatVar(m(X X X)|X X X∈t t t)≤s ∗ β 2 4 ,whichconcludesthatm(X X X)∈SID(s ∗ ). Next, we proceed to deal with the second assertion, and we begin with the bias-variance de- composition upper bound in (A.44) and some details for CART in (A.45) below. By Jensen’s inequalityandtriangularinequality, E m(X X X)−E b m b T (Θ Θ Θ 1 ,···,Θ Θ Θ k ,X X X,X n ) X X X,X n 2 ≤2 E m(X X X)−m ∗ b T (Θ Θ Θ 1 ,···,Θ Θ Θ k ,X X X) 2 +E m ∗ b T (Θ Θ Θ 1 ,···,Θ Θ Θ k ,X X X)−b m b T (Θ Θ Θ 1 ,···,Θ Θ Θ k ,X X X,X n ) 2 . (A.44) AshavementionedintheremarkbeforeProposition1,foreachfeaturerestrictionΘandnode t t t,thesampleCARTsplitinthecasewithbinaryfeaturesis ( b j,1)with b j:=argmax j∈Θ d (II) t t t,t t t(j,1) , (A.45) wherefort t t anditstwodaughternodest t t 1 ,t t t 2 , d (II) t t t,t t t 1 = ∑ n i=1 1 1 1 x x x i ∈t t t 1 ∑ n i=1 1 1 1 x x x i ∈t t t ∑ n i=1 1 1 1 x x x i ∈t t t 1 (m(x x x i )+ε i ) ∑ n i=1 1 1 1 x x x i ∈t t t 1 − ∑ n i=1 1 1 1 x x x i ∈t t t (m(x x x i )+ε i ) ∑ n i=1 1 1 1 x x x i ∈t t t 2 + ∑ n i=1 1 1 1 x x x i ∈t t t 2 ∑ n i=1 1 1 1 x x x i ∈t t t ∑ n i=1 1 1 1 x x x i ∈t t t 2 (m(x x x i )+ε i ) ∑ n i=1 1 1 1 x x x i ∈t t t 2 − ∑ n i=1 1 1 1 x x x i ∈t t t (m(x x x i )+ε i ) ∑ n i=1 1 1 1 x x x i ∈t t t 2 , and the ties are broken randomly; the definition of splits here is the same as the one given in the remarkbeforeProposition1. 110 Additional remarks for splitting in this case with binary features are as follows. Due to the definition that 0 0 = 0, for any trivial split (j,c), which is a split gives a daughter node t t t ′ with P(X X X∈t t t ′ ) = 0, it holds that d (II) t t t,t t t(j,c) = 0 and (II) t t t,t t t(j,c) = 0. If all coordinates in some Θ have already been split, the CART stops splitting; to have well-defined level k trees, we allow CART to make trivial splits that give empty sets as daughter nodes, and we define daughter nodes of an emptysettobetwoemptysets. Asaresult, b T(Θ 1:k )maycontainemptyendnodes. To bound the two terms on the RHS of (A.44), our first step is to show that the sample CART split ( b j,1) for each node t t t is “very close to” its theoretical CART split counterpart (j ∗ ,c ∗ ) = argsup j∈Θ,c∈t j (II) t t t,t t t(j,c) , in the sense as in iii) of Lemma 5 below. To get Lemma 5, we need an eventU n definedasfollows. DenotebyG n thecollectionofallendnodesoftreesoflevellowerthanlog 2 (n),andthenodes areformedbyusingthesplits (1,1),···(p,1). Adirectcalculationshowsthat #G n ≤ ⌊log 2 (n)⌋ ∑ k=0 p k 2 k ≤1+p log 2 (n) nlog 2 (n). (A.46) Defineevents Q 1 (t t t)= n E(m(X X X)1 1 1 X X X∈t t t )−n −1 n ∑ i=1 1 1 1 x x x i ∈t t t (m(x x x i )+ε i ) ≤ log e (max{n,p}) 2+ε 2 q 2M 2 0 +M 2 ε r P(X X X∈t t t) n o , Q 2 (t t t)= n P(X X X∈t t t)−n −1 n ∑ i=1 1 1 1 x x x i ∈t t t ≤ log e (max{n,p}) 2+ε 2 r P(X X X∈t t t) n o , U n = ∩ t t t∈G n Q 1 (t t t) ∩ ∩ t t t∈G n Q 2 (t t t) . Note that U n depends only on the training dataX n and is independent of X X X, which is the independentcopyofx x x 1 . Itholdsthat P(U c n )=o(n −1 ), (A.47) 111 whoseproofisdeferredtotheendoftheproofofProposition1. With event U n , we introduce Lemma 5 below, whose proof is also deferred to the end of the proofofProposition1. Lemma 5. i) For every t t t and every split (j,c), it is either (II) t t t,t t t(j,c) = β 2 4 or (II) t t t,t t t(j,c) = 0. In addition,onU n ,foralllargen,itholdsthatforeveryendnodet t t oftreesoflevelk≤ηlog 2 (n)−1, ii) Forevery1≤ j≤ p, ∑ n i=1 1 1 1 x x x i ∈t t t n d (II) t t t,t t t(j,1) −P(X X X∈t t t)(II) t t t,t t t(j,1) ≤18(M ε +2M 0 ) 2 log e (max{n,p}) 2+ε 2 r P(X X X∈t t t) n . iii) ForeachfeaturerestrictionΘandthesampleCARTsplit ( b j,1)givenin(A.45), (II) t t t,t t t( b j,1) = sup j∈Θ,c∈t j (II) t t t,t t t(j,c) . Now, we deal with the two terms on the RHS of (A.44), and begin with the first term. By the specificmodelsettingassumedhere,wehave E m(X X X)−m ∗ b T (Θ Θ Θ 1 ,···,Θ Θ Θ k ,X X X) 2 ≤E (m(X X X)−m ∗ b T (Θ Θ Θ 1 ,···,Θ Θ Θ k ,X X X)) 2 1 1 1 U n +E (m(X X X)−m ∗ b T (Θ Θ Θ 1 ,···,Θ Θ Θ k ,X X X)) 2 1 1 1 U c n ≤E (m(X X X)−m ∗ b T (Θ Θ Θ 1 ,···,Θ Θ Θ k ,X X X)) 2 1 1 1 U n +4M 2 0 P(U c n ), (A.48) where m ∗ T (Θ Θ Θ 1 ,···,Θ Θ Θ k ,X X X) is defined to be E(m(X X X)) when k = 0, which is a trivial theoretical randomforestsmodel. To further deal with the first term on the RHS of (A.48), let us define T ∗ (Θ 1:k ) to be a tree of level k grown by theoretical CART splits with sets of available features specified as in Θ 1:k . We want to make a connection between b T and T ∗ as in (A.49) below. However, because theoretical CART and sample CART split the nodes differently, it is unclear whether the equality in (A.49) 112 holds if ties are broken randomly. To ensure such an equality, we additionally require that for all largenand0≤k≤ηlog 2 (n),thetheoreticalCARTbreakstiessuchthat (m(X X X)−m ∗ T ∗(Θ Θ Θ 1 ,···,Θ Θ Θ k ,X X X)) 2 1 1 1 U n =(m(X X X)−m ∗ b T (Θ Θ Θ 1 ,···,Θ Θ Θ k ,X X X)) 2 1 1 1 U n , whichispossiblebecauseofiii)ofLemma5. Therefore, E (m(X X X)−m ∗ b T (Θ Θ Θ 1 ,···,Θ Θ Θ k ,X X X)) 2 1 1 1 U n =E (m(X X X)−m ∗ T ∗(Θ Θ Θ 1 ,···,Θ Θ Θ k ,X X X)) 2 1 1 1 U n ≤E m(X X X)−m ∗ T ∗(Θ Θ Θ 1 ,···,Θ Θ Θ k ,X X X) 2 . (A.49) TodealwiththefirsttermontheRHSof(A.49),weneed(A.50)below. Foreachk≥0, E m(X X X)−m ∗ T ∗(Θ Θ Θ 1 ,···,Θ Θ Θ k ,X X X) 2 ≤(1−γ 0 (s ∗ ) −1 ) k Var(m(X X X)). (A.50) Inaddition,ifitisknownthatγ 0 =1,asharpsquaredbiasupperboundcanbeobtainedby E m(X X X)−m ∗ T ∗(Θ Θ Θ 1 ,···,Θ Θ Θ k ,X X X) 2 ≤max (s ∗ −k) β 2 4 ,0 (A.51) for each k≥ 0. On the other hand, the second term on the RHS of (A.44) is bounded by (A.52) below. Foreach0≤k≤ηlog 2 (n), E m ∗ b T (Θ Θ Θ 1:k ,X X X)−b m b T (Θ Θ Θ 1:k ,X X X,X n ) 2 ≤(3M 0 +2M ε ) 2 2 k log e (max{n,p}) 2+ε n +(2M 0 +M ε ) 2 P(U c n ). (A.52) The proofs of (A.50)–(A.52) are deferred to the end of the proof of Proposition 1. We also give some intuition in Remark 10 in the proof of (A.52) for improving the estimation upper bound. The results of (A.44) and (A.47)–(A.52) lead to the desired result, and hence we have finished the proof. 113 Letusgiveaclosingremark. Itisseenthatbothestimationvarianceandsquaredbiasanalyses rely on the event U n . A general version of such an event is used for random forests analysis for general cases, and the technique is called “the grid.” A brief introduction of the grid for general cases, which is far more complicated than the simple case here, can be found in Section 2.5. In addition,thebiasanalysis(A.48)–(A.50)isasimpleversionoftheoneinSection2.4. Asremarked afterProposition1,thegeneralbiasanalysisdependsonthesamplesizesincetheoptimalspliton eachcoordinateisunknownandfeaturesaredependent. Proofof (A.47). ToboundtheprobabilitiesofthecomplementsoftheeventsQ 1 (t t t),Q 2 (t t t),U n with concentration inequalities, we need the variance upper and lower bounds and (A.54) below. For eachnodet t t, Var(1 1 1 x x x 1 ∈t t t )=P(X X X∈t t t)(1−P(X X X∈t t t)), Var(ε 1 )P(X X X∈t t t)≤Var(1 1 1 x x x 1 ∈t t t (m(x x x 1 )+ε 1 )) =Var(1 1 1 x x x 1 ∈t t t m(x x x 1 ))+Var(ε 1 )P(X X X∈t t t) ≤(2M 2 0 +M 2 ε )P(X X X∈t t t). (A.53) Sincet isconstructedbyatmostk≤ηlog 2 (n)cuts, P(X X X∈t t t)≥n −η . (A.54) By Bernstein’s inequality [98, 99], (A.53), the assumptions of i.i.d. observations and bounded regression function and model errors, (A.54) and that (log e p) 2+ε =o(n 1−η ), it holds that for all largenandeveryt t t, P((Q 1 (t t t)) c )≤2exp −(log e (max{n,p})) 2+ε 3 , (A.55) 114 andifP(X X X∈t t t)<1,foralllargen, P((Q 2 (t t t)) c )≤2exp −(log e (max{n,p})) 2+ε 3 , (A.56) andifP(X X X∈t t t)=1,foralln≥1, P((Q 2 (t t t)) c )=0, (A.57) sinceP(X X X∈t t t)=n −1 ∑ n i=1 1 1 1 x x x i ∈t t t =1. By(A.46)and (A.55)–(A.57)andthe assumptionsof i.i.d. observations,abounded regression function,andboundedmodelerrors,itholdsthat P(U c n )≤2× 1+p log 2 (n) nlog 2 (n) ×2exp − 1 3 (log e (max{n,p})) 2+ε =o(n −1 ), (A.58) whichfinishestheproof. ProofofLemma5. Thefirstassertioncanbeshownbyadirectcalculationandhenceweomitthe detail. LetafeaturerestrictionΘbegiven. Thethirdassertionisaresultofthefirsttwoassertions andthat P(X X X∈t t t)≥n −η , (A.59) which is due to that t t t is constructed by k≤ ηlog 2 (n)−1 cuts. Specifically, suppose the first two assertions hold and let (j ∗ ,c ∗ ) = argsup j∈Θ,c∈t j (II) t t t,t t t(j,c) and j † such that (II) t t t,t t t(j † ,1) = 0 be given. If sup j∈Θ,c∈t j (II) t t t,t t t(j,c) = 0, the desired result is obviously true. Suppose otherwise sup j∈Θ,c∈t j (II) t t t,t t t(j,c) = β 2 4 (bythefirstassertion). Bythefactthatfeaturesarebinary, (II) t t t,t t t(j ∗ ,1) =(II) t t t,t t t(j ∗ ,c ∗ ) , 115 whichincombinationwithsup j∈Θ,c∈t j (II) t t t,t t t(j,c) = β 2 4 andthesecondassertionleadsto ∑ n i=1 1 1 1 x x x i ∈t t t n d (II) t t t,t t t(j ∗ ,1) ≥ β 2 4 P(X X X∈t t t)−18(M ε +2M 0 ) 2 log e (max{n,p}) 2+ε 2 r P(X X X∈t t t) n . Meanwhile,by (II) t t t,t t t(j † ,1) =0andthesecondassertion, 18(M ε +2M 0 ) 2 log e (max{n,p}) 2+ε 2 r P(X X X∈t t t) n ≥ ∑ n i=1 1 1 1 x x x i ∈t t t n d (II) t t t,t t t(j † ,1) . Combiningtheseand(A.59), (log e p) 2+ε =o(n 1−η ),itholdsthatforalllargen, ∑ n i=1 1 1 1 x x x i ∈t t t n d (II) t t t,t t t(j ∗ ,1) > ∑ n i=1 1 1 1 x x x i ∈t t t n d (II) t t t,t t t(j † ,1) , whichincombinationwiththefactthat ∑ n i=1 1 1 1 x x x i ∈t t t n d (II) t t t,t t t( b j,1) ≥ ∑ n i=1 1 1 1 x x x i ∈t t t n d (II) t t t,t t t(j ∗ ,1) , which is due to j ∗ ∈Θ, implies that d (II) t t t,t t t( b j,1) > d (II) t t t,t t t(j † ,1) for every such (j † ,1). Therefore, we have (II) t t t,t t t( b j,1) >0 in this scenario. This together with the first assertion concludes the third assertion. In the following, we prove the second assertion. Let us consider a node t t t and a split (j,1). If the jth coordinate has already been split, the desired result is obviously true. Therefore, we supposethe jthcoordinateoft t t hasnotbeenspliton. Lett t t 1 andt t t 2 denotethetwodaughternodes, respectively. Ourgoalistodealwiththedifference ∑ n i=1 1 1 1 x x x i ∈t t t n d (II) t t t,t t t(j,1) −P(X X X∈t t t)(II) t t t,t t t(j,1) , (A.60) 116 where ∑ n i=1 1 1 1 x x x i ∈t t t n d (II) t t t,t t t(j,1) = ∑ n i=1 1 1 1 x x x i ∈t t t 1 n ∑ n i=1 1 1 1 x x x i ∈t t t 1 (m(x x x i )+ε i ) ∑ n i=1 1 1 1 x x x i ∈t t t 1 − ∑ n i=1 1 1 1 x x x i ∈t t t (m(x x x i )+ε i ) ∑ n i=1 1 1 1 x x x i ∈t t t 2 + ∑ n i=1 1 1 1 x x x i ∈t t t 2 n ∑ n i=1 1 1 1 x x x i ∈t t t 2 (m(x x x i )+ε i ) ∑ n i=1 1 1 1 x x x i ∈t t t 2 − ∑ n i=1 1 1 1 x x x i ∈t t t (m(x x x i )+ε i ) ∑ n i=1 1 1 1 x x x i ∈t t t 2 , (A.61) and P(X X X∈t t t)(II) t t t,t t t(j,1) =P(X X X∈t t t 1 )(E(m(X X X)|X X X∈t t t 1 )−E(m(X X X)|X X X∈t t t)) 2 +P(X X X∈t t t 2 )(E(m(X X X)|X X X∈t t t 2 )−E(m(X X X)|X X X∈t t t)) 2 . (A.62) We begin with the difference between the respective first terms of the RHS of (A.61)–(A.62) as follows. ∑ n i=1 1 1 1 x x x i ∈t t t 1 n ∑ n i=1 1 1 1 x x x i ∈t t t 1 (m(x x x i )+ε i ) ∑ n i=1 1 1 1 x x x i ∈t t t 1 − ∑ n i=1 1 1 1 x x x i ∈t t t (m(x x x i )+ε i ) ∑ n i=1 1 1 1 x x x i ∈t t t 2 −P(X X X∈t t t 1 )(E(m(X X X)|X X X∈t t t 1 )−E(m(X X X)|X X X∈t t t)) 2 ≤ ∑ n i=1 1 1 1 x x x i ∈t t t 1 n ∑ n i=1 1 1 1 x x x i ∈t t t 1 (m(x x x i )+ε i ) ∑ n i=1 1 1 1 x x x i ∈t t t 1 − ∑ n i=1 1 1 1 x x x i ∈t t t (m(x x x i )+ε i ) ∑ n i=1 1 1 1 x x x i ∈t t t 2 − ∑ n i=1 1 1 1 x x x i ∈t t t 1 n (E(m(X X X)|X X X∈t t t 1 )−E(m(X X X)|X X X∈t t t)) 2 + ∑ n i=1 1 1 1 x x x i ∈t t t 1 n (E(m(X X X)|X X X∈t t t 1 )−E(m(X X X)|X X X∈t t t)) 2 −P(X X X∈t t t 1 )(E(m(X X X)|X X X∈t t t 1 )−E(m(X X X)|X X X∈t t t)) 2 , (A.63) 117 andthatthedifferencebetweenthefirsttwotermsontheRHSof(A.63)isboundedby ∑ n i=1 1 1 1 x x x i ∈t t t 1 n ∑ n i=1 1 1 1 x x x i ∈t t t 1 (m(x x x i )+ε i ) ∑ n i=1 1 1 1 x x x i ∈t t t 1 − ∑ n i=1 1 1 1 x x x i ∈t t t (m(x x x i )+ε i ) ∑ n i=1 1 1 1 x x x i ∈t t t 2 − ∑ n i=1 1 1 1 x x x i ∈t t t 1 n (E(m(X X X)|X X X∈t t t 1 )−E(m(X X X)|X X X∈t t t)) 2 ≤ ∑ n i=1 1 1 1 x x x i ∈t t t 1 n ∑ n i=1 1 1 1 x x x i ∈t t t 1 (m(x x x i )+ε i ) ∑ n i=1 1 1 1 x x x i ∈t t t 1 −E(m(X X X)|X X X∈t t t 1 ) + E(m(X X X)|X X X∈t t t)− ∑ n i=1 1 1 1 x x x i ∈t t t (m(x x x i )+ε i ) ∑ n i=1 1 1 1 x x x i ∈t t t (2M ε +4M 0 ), (A.64) where we use the identity a 2 −b 2 = (a−b)(a+b) and the assumptions of a bounded regression functionandmodelerrors. TwotermsofdifferencesontheRHSof(A.64)canbefurtherboundedrespectivelyasfollows. ∑ n i=1 1 1 1 x x x i ∈t t t 1 (m(x x x i )+ε i ) ∑ n i=1 1 1 1 x x x i ∈t t t 1 −E(m(X X X)|X X X∈t t t 1 ) = n −1 ∑ n i=1 1 1 1 x x x i ∈t t t 1 (m(x x x i )+ε i ) n −1 ∑ n i=1 1 1 1 x x x i ∈t t t 1 − E(m(X X X)1 1 1 X X X∈t t t 1 ) n −1 ∑ n i=1 1 1 1 x x x i ∈t t t 1 + E(m(X X X)1 1 1 X X X∈t t t 1 ) n −1 ∑ n i=1 1 1 1 x x x i ∈t t t 1 − E(m(X X X)1 1 1 X X X∈t t t 1 ) P(X X X∈t t t 1 ) ≤ ∑ n i=1 1 1 1 x x x i ∈t t t 1 n −1 ∑ n i=1 1 1 1 x x x i ∈t t t 1 (m(x x x i )+ε i ) n −E(m(X X X)1 1 1 X X X∈t t t 1 ) + E(m(X X X)1 1 1 X X X∈t t t 1 ) P(X X X∈t t t 1 ) P(X X X∈t t t 1 )− ∑ n i=1 1 1 1 x x x i ∈t t t 1 n , (A.65) andsimilarly, ∑ n i=1 1 1 1 x x x i ∈t t t (m(x x x i )+ε i ) ∑ n i=1 1 1 1 x x x i ∈t t t −E(m(X X X)|X X X∈t t t) ≤ ∑ n i=1 1 1 1 x x x i ∈t t t n −1 ∑ n i=1 1 1 1 x x x i ∈t t t (m(x x x i )+ε i ) n −E(m(X X X)1 1 1 X X X∈t t t ) + E(m(X X X)1 1 1 X X X∈t t t ) P(X X X∈t t t) P(X X X∈t t t)− ∑ n i=1 1 1 1 x x x i ∈t t t n . (A.66) 118 On the other hand, by the assumption of a bounded m(·), the last two terms on the RHS of (A.63)isboundedby ∑ n i=1 1 1 1 x x x i ∈t t t 1 n (E(m(X X X)|X X X∈t t t 1 )−E(m(X X X)|X X X∈t t t)) 2 −P(X X X∈t t t 1 )(E(m(X X X)|X X X∈t t t 1 )−E(m(X X X)|X X X∈t t t)) 2 ≤4M 2 0 ∑ n i=1 1 1 1 x x x i ∈t t t 1 n −P(X X X∈t t t 1 ) . (A.67) By (A.61)–(A.67), it holds that for all large n and every end node t t t of trees of level k≤ ηlog 2 (n)−1,onU n , max j ∑ n i=1 1 1 1 x x x i ∈t t t n d (II) t t t,t t t(j,1) −P(X X X∈t t t)(II) t t t,t t t(j,1) ≤2 (2M ε +4M 0 )×2( q 2M 2 0 +M 2 ε +M 0 )+4M 2 0 log e (max{n,p}) 2+ε 2 r P(X X X∈t t t) n ≤18(M ε +2M 0 ) 2 log e (max{n,p}) 2+ε 2 r P(X X X∈t t t) n . Thisandtheargumentbefore(A.60)concludethesecondassertion,andhencewehavefinished theproof. Proofof (A.50). TheproofideafollowsthatforproofofTheorem3,butismuchsimplifiedasthe theoretical tree growing rule T ∗ is considered here. In what follows, we deal with the case where therearenorandomsplitsfirst(seetheendofthisprooffordetails). Letusstartwithanexpression 119 for the LHS of (A.50) in (A.68) below, which can be obtained by direct calculations when there arenorandomsplits. Itholdsthat E m(X X X)−m ∗ T ∗(Θ Θ Θ 1:k ,X X X) 2 = ∑ Θ 1:k P(Θ Θ Θ 1:k =Θ 1:k ) ∑ (t t t 1:k )∈T ∗ (Θ 1:k ) P(X X X∈t t t k )V(t t t k ) = ∑ Θ 1:k P(Θ Θ Θ 1:k =Θ 1:k ) ∑ (t t t 1:k )∈T ∗ (Θ 1:k ) P(X X X∈t t t k−1 )P(X X X∈t t t k |X X X∈t t t k−1 )V(t t t k ) = ∑ Θ 1:k P(Θ Θ Θ 1:k =Θ 1:k ) ∑ (t t t 1:k )∈T ∗ (Θ 1:k ) P(X X X∈t t t k−1 ) (I) t t t k−1 ,t t t k 2 = ∑ Θ 1:k P(Θ Θ Θ 1:k =Θ 1:k ) ∑ (t t t 1:k )∈T ∗ (Θ 1:k ) P(X X X∈t t t k−1 ) V(t t t k−1 )−(II) t t t k−1 ,t t t k 2 = ∑ Θ 1:k−1 P(Θ Θ Θ 1:k−1 =Θ 1:k−1 ) × ∑ Θ k P(Θ Θ Θ k =Θ k ) ∑ (t t t 1:k )∈T ∗ (Θ 1:k ) P(X X X∈t t t k−1 ) V(t t t k−1 )−(II) t t t k−1 ,t t t k 2 , (A.68) whereV(t t t):=Var(m(X X X)|X X X∈t t t), the third equality follows from the definition of (I) t t t k−1 ,t t t k and the fact that there are exactly two daughter nodes after each t t t k−1 , the fourth equality is due to the identity Var(m(X X X)|X X X∈t t t) = (I) t t t,t t t(j,c) +(II) t t t,t t t(j,c) for every t t t and j,c∈t j , and the last equality is fromtheindependencebetweencolumnsets. Inaddition,weletVar(m(X X X)|X X X∈t t t):=0,(I) t t t,t t t ′ :=0, and (II) t t t,t t t ′ :=0ift t t =t t t ′ = / 0. To proceed, we separately deal with tree branches in the tree T ∗ (Θ 1:k ) as follows. There are 2 k distinct tree branchest t t 1:k in T ∗ (Θ 1:k ), and we call the first two of these tree branches “the first 120 treebranchofT ∗ (Θ 1:k ),”whosecorrespondinglastcolumnsetrestrictionisΘ k,1 (recallthatΘ k = {Θ k,1 ,···,Θ k,2 k−1}). SeeFigure4.1foragraphicalillustration. Sincecolumnsetsareindependent, RHSof(A.68) = ∑ Θ 1:k−1 P(Θ Θ Θ 1:k−1 =Θ 1:k−1 ) × ∑ Θ k P(Θ Θ Θ k =Θ k ) ∑ ThefirsttreebranchofT ∗ (Θ 1:k ) P(X X X∈t t t k−1 ) V(t t t k−1 )−(II) t t t k−1 ,t t t k 2 + ∑ Θ k P(Θ Θ Θ k =Θ k ) ∑ OthertreebranchesofT ∗ (Θ 1:k ) P(X X X∈t t t k−1 ) V(t t t k−1 )−(II) t t t k−1 ,t t t k 2 = ∑ Θ 1:k−1 P(Θ Θ Θ 1:k−1 =Θ 1:k−1 ) × ∑ Θ k,1 P(Θ Θ Θ k,1 =Θ k,1 ) ∑ ThefirsttreebranchofT ∗ (Θ 1:k ) P(X X X∈t t t k−1 ) V(t t t k−1 )−(II) t t t k−1 ,t t t k 2 + ∑ Θ k,2 ,···,Θ k,2 k−1 P((Θ Θ Θ k,2 ,···,Θ Θ Θ k,2 k−1)=(Θ k,2 ,···,Θ k,2 k−1)) × ∑ OthertreebranchesofT ∗ (Θ 1:k ) P(X X X∈t t t k−1 ) V(t t t k−1 )−(II) t t t k−1 ,t t t k 2 . (A.69) Now,letussayagoodcolumnsetrestrictionΘw.r.t. anodet t t issuchthat sup j∈Θ,c∈t j (II) t t t,t t t(j,c) = sup j∈{1,···,p},c∈t j (II) t t t,t t t(j,c) . Becausem(X X X)∈SID(s ∗ ),itholdsthatforanodet t t, 8 > > < > > : V(t t t)−sup j∈Θ,c∈t j (II) t t t,t t t(j,c) ≤(1−(s ∗ ) −1 )V(t t t), ifΘisgood, V(t t t)−sup j∈Θ,c∈t j (II) t t t,t t t(j,c) ≤V(t t t), o.w. (A.70) 121 By(A.70),wedealwiththefirsttreebranchasfollows. ∑ Θ k,1 P(Θ Θ Θ k,1 =Θ k,1 ) ∑ ThefirsttreebranchofT ∗ (Θ 1:k ) P(X X X∈t t t k−1 ) V(t t t k−1 )−(II) t t t k−1 ,t t t k 2 ≤ ∑ GoodΘ k,1 P(Θ Θ Θ k,1 =Θ k,1 ) ∑ ThefirsttreebranchofT ∗ (Θ 1:k ) P(X X X∈t t t k−1 ) (1−(s ∗ ) −1 )V(t t t k−1 ) 2 + ∑ BadΘ k,1 P(Θ Θ Θ k,1 =Θ k,1 ) ∑ ThefirsttreebranchofT ∗ (Θ 1:k ) P(X X X∈t t t k−1 ) V(t t t k−1 ) 2 . (A.71) Notice that the end node t t t k is not needed on the RHS of (A.71), and that the first tree branch consistsofexactlytwodaughternodes. Lett t t 1 ,···,t t t k−1 denotethefirsttreebranch. Then, RHSof(A.71) = ∑ GoodΘ k,1 P(Θ Θ Θ k,1 =Θ k,1 )P(X X X∈t t t k−1 )(1−(s ∗ ) −1 )V(t t t k−1 ) + ∑ BadΘ k,1 P(Θ Θ Θ k,1 =Θ k,1 )P(X X X∈t t t k−1 )V(t t t k−1 ) . (A.72) Furthermore, recall that the probability of having a good column set is at leastγ 0 according to our model assumption. Specifically, the probability of having any active j inΘ Θ Θ isγ 0 ; if no active feature is left for t t t, then Θ Θ Θ is a good column set with probability one. By this and the fact that 1−(s ∗ ) −1 ≤1, RHSof(A.72) ≤ γ 0 ×P(X X X∈t t t k−1 )(1−(s ∗ ) −1 )V(t t t k−1 )+(1−γ 0 )×P(X X X∈t t t k−1 )V(t t t k−1 ) ≤ (1−γ 0 (s ∗ ) −1 )×P(X X X∈t t t k−1 )V(t t t k−1 ) . (A.73) Next,weapplytheargumentfor(A.71)–(A.73)toothertreebranchesin(A.69)andobtain RHSof(A.69) ≤(1−γ 0 (s ∗ ) −1 ) ∑ Θ 1:k−1 P(Θ Θ Θ 1:k−1 =Θ 1:k−1 ) ∑ (t t t 1:k−1 )∈T ∗ (Θ 1:k−1 ) P(X X X∈t t t k−1 )V(t t t k−1 ). (A.74) 122 Toconclude,werecursivelyapplytheseargumentstoshowthat RHSof(A.74)≤(1−γ 0 (s ∗ ) −1 ) k V(t t t 0 ), whichleadstothedesiredresult. Lastly,toconsiderrandomsplits,welet“Randomsplits”denotetherandomparameterofthese randomsplits,andhence E m(X X X)−m ∗ T ∗(Θ Θ Θ 1:k ,X X X) 2 =E E m(X X X)−m ∗ T ∗(Θ Θ Θ 1:k ,X X X) 2 |Randomsplits ≤E (1−γ 0 (s ∗ ) −1 ) k V(t t t 0 ) =(1−γ 0 (s ∗ ) −1 ) k V(t t t 0 ), wheretheinequalityisduetothepreviousarguments. Thisconcludestheproof. Proofof (A.51). Recallthatγ 0 =1meansthattheallcolumnsetscontainallfeatures,andthatthe forestmodelisessentiallyadecisiontreemodel. Inaddition,recallthatthetotalvarianceis Var(m(X X X))=s ∗ β 2 4 . (A.75) SincethetreemodelisgrownbyusingtheoreticalCART,byi)ofLemma5,thefirstsplitison oneofthefirsts ∗ coordinates;thetotalbiasdecreaseforthefirstsplitis β 2 4 . Next, we split the resulting two daughter nodes by using theoretical CART. Each split results in conditional bias decrease of an amount of β 2 4 , and that each daughter nodet t t is such thatP(X X X∈ t t t)= 1 2 .Hence,thetotalbiasdecreaseforthesecondsplitis 1 2 β 2 4 + 1 2 β 2 4 = β 2 4 . Thesestepsrepeatuntiltherearenoactivefeaturestobesplit;weseethatateachlevelk≤s ∗ , thetotalbiasdecreaseis β 2 4 . By(A.75)andthisargument,weconcludetheproof. 123 Proofof (A.52). By(2.2),(2.10),andtheassumptionsofaboundedregressionfunctionandmodel errors, E m ∗ b T (Θ Θ Θ 1:k ,X X X)−b m b T (Θ Θ Θ 1:k ,X X X,X n ) 2 ≤E 1 1 1 U n ∑ (t t t 1:k )∈ b T(Θ Θ Θ 1:k ) 1 1 1 X X X∈t t t k E(m(X X X)|X X X∈t t t k ) − ∑ (t t t 1:k )∈ b T(Θ Θ Θ 1:k ) 1 1 1 X X X∈t t t k ∑ n i=1 1 1 1 x x x i ∈t t t k (m(x x x i )+ε i ) ∑ n i=1 1 1 1 x x x i ∈t t t k 2 +(2M 0 +M ε ) 2 P(U c n ) =E 1 1 1 U n ∑ (t t t 1:k )∈ b T(Θ Θ Θ 1:k ) 1 1 1 X X X∈t t t k E(m(X X X)|X X X∈t t t k )− ∑ n i=1 1 1 1 x x x i ∈t t t k (m(x x x i )+ε i ) ∑ n i=1 1 1 1 x x x i ∈t t t k 2 +(2M 0 +M ε ) 2 P(U c n ), (A.76) wherethesecondequalityisduetothefactthat1 1 1 X X X∈t t t ×1 1 1 X X X∈t t t ′ =0ift t t∩t t t ′ = / 0. The RHS of (A.76) can be further dealt with as follows. By the model assumptions, for every endnodet t t oftreesoflevelk, P(X X X∈t t t)≥2 −k . (A.77) For every end nodet t t k in (A.76) with 0≤k≤ηlog 2 (n), it holds that eithert t t k ∈G n ort t t k is an emptyset. Ift t t k isanemptyset,bythedefinitionthat 0 0 =0, E(m(X X X)|X X X∈t t t k )− ∑ n i=1 1 1 1 x x x i ∈t t t k (m(x x x i )+ε i ) ∑ n i=1 1 1 1 x x x i ∈t t t k =0. (A.78) 124 Ontheotherhand,onU n ,foreachnodet t t k ∈G n , E(m(X X X)|X X X∈t t t k )− ∑ n i=1 1 1 1 x x x i ∈t t t k (m(x x x i )+ε i ) ∑ n i=1 1 1 1 x x x i ∈t t t k ≤ E(m(X X X)1 1 1 X X X∈t t t k ) P(X X X∈t t t k ) − n −1 ∑ n i=1 1 1 1 x x x i ∈t t t k (m(x x x i )+ε i ) P(X X X∈t t t k ) + n −1 ∑ n i=1 1 1 1 x x x i ∈t t t k (m(x x x i )+ε i ) P(X X X∈t t t k ) − ∑ n i=1 1 1 1 x x x i ∈t t t k (m(x x x i )+ε i ) ∑ n i=1 1 1 1 x x x i ∈t t t k ≤ 1 P(X X X∈t t t k ) E(m(X X X)1 1 1 X X X∈t t t k )− ∑ n i=1 1 1 1 x x x i ∈t t t k (m(x x x i )+ε i ) n + 1 P(X X X∈t t t k ) ∑ n i=1 1 1 1 x x x i ∈t t t k (m(x x x i )+ε i ) ∑ n i=1 1 1 1 x x x i ∈t t t k ∑ n i=1 1 1 1 x x x i ∈t t t k n −P(X X X∈t t t k ) ≤ 1 P(X X X∈t t t k ) ( q 2M 2 0 +M 2 ε +M 0 +M ε ) log e (max{n,p}) 2+ε 2 r P(X X X∈t t t k ) n ≤2 k 2 (3M 0 +2M ε ) log e (max{n,p}) 2+ε 2 √ n , (A.79) wherethethirdinequalityisduetoeventU n andtheassumptionthatt t t k ∈G n ,andthelastequality isdueto(A.77)andthesubadditivityinequality. By(A.76),(A.78)–(A.79), E m ∗ b T (Θ Θ Θ 1:k ,X X X)−b m b T (Θ Θ Θ 1:k ,X X X,X n ) 2 ≤(3M 0 +2M ε ) 2 2 k log e (max{n,p}) 2+ε n +(2M 0 +M ε ) 2 P(U c n ), whichconcludesthedesiredresult. Remark10. Letusgivesomeintuitionforhowtoestablishasharperestimationupperboundthat depends on γ 0 . The way we deal with the estimation variance here is to bound the squared dif- ferencesin(A.76)directly;essentially,weestablishtheestimationvarianceupperboundforeach tree. Notice that the end nodes for each tree b T(Θ 1:k ) are exclusive, but end nodes of distinct trees are not exclusive. Our intuition is that it may be possible to aggregate distinct trees and sharpen the estimation upper bound. The new upper bound should depend on γ 0 since column aggrega- tion (i.e., the expectation over Θ Θ Θ 1:k ) depends on γ 0 . An example of utilizing column aggregation 125 for analysis can be seen in our bias analysis in (A.50) and Section 2.4. There, we argue that the overall squared bias of a forest is controlled instead of arguing that each tree’s bias is controlled, which is not right since there are always trees with high bias and trees with low bias in a forest withγ 0 <1andallpossibletrees. A.4.4 ProofofLemma1 RecallthatX n denotestheni.i.d. observationsandX X X istheindependentcopyofx x x 1 . Letζ =n −δ , and b T ζ andU U U n be as defined in Theorem 4. See Sections 2.4.2 and A.3.3 for the definitions of b T ζ and eventU U U n , respectively. The main idea of the proof is the same as that described in (2.23), but in the formal proof, it is b T ζ instead of b T that satisfies Condition 5. For details, see Theorem 4 and Remark9. Anapplicationofthetriangleinequalityleadsto E m(X X X)−m ∗ b T (Θ Θ Θ 1:k ,X X X) 2 =E m(X X X)−m ∗ b T ζ (Θ Θ Θ 1:k ,X X X) − m ∗ b T (Θ Θ Θ 1:k ,X X X)−m ∗ b T ζ (Θ Θ Θ 1:k ,X X X) 2 ≤2 E m(X X X)−m ∗ b T ζ (Θ Θ Θ 1:k ,X X X) 2 +E m ∗ b T (Θ Θ Θ 1:k ,X X X)−m ∗ b T ζ (Θ Θ Θ 1:k ,X X X) 2 ! . (A.80) FromTheorem4,weseethatforalllargen,oneventU U U n , b T ζ withζ =n −δ satisfiesCondition5 with k =⌊clogn⌋,ε =n −η , andα 2 . Observe that if a tree growing rule satisfies Condition 5 with k, it satisfies Condition 5 with each positive integer no larger than k. By the fact thatX n is independentofX X X andΘ Θ Θ,U U U n isX n -measurable,Condition4(whichstatesthatsup c c c∈[0,1] p|m(c c c)|≤ M 0 ),andTheorem3withε =n −η ,itholdsthatforeach1≤k≤clogn, E " m(X X X)−m ∗ b T ζ (Θ Θ Θ 1:k ,X X X) 2 X n # 1 1 1 U U U n ≤α 1 α 2 n −η +(1−γ 0 (α 1 α 2 ) −1 ) k M 2 0 . (A.81) 126 Thenby(A.81)andCondition4,itholdsthatforalllargenandeachinteger1≤k≤clogn, E m(X X X)−m ∗ b T ζ (Θ Θ Θ 1:k ,X X X) 2 =E " m(X X X)−m ∗ b T ζ (Θ Θ Θ 1:k ,X X X) 2 1 1 1 U U U n # +E " m(X X X)−m ∗ b T ζ (Θ Θ Θ 1:k ,X X X) 2 1 1 1 U U U c n # ≤α 1 α 2 n −η +(1−γ 0 (α 1 α 2 ) −1 ) k M 2 0 +n −1 . (A.82) Here,toboundthesecondtermontheRHSoftheequalityabove,weutilizeCondition4,standard inequalities,andthatP(U U U c n )=o(n −1 ). Ontheotherhand,byCondition4wehavesup c c c∈[0,1] p|m ∗ b T (c c c)−m ∗ b T ζ (c c c)|≤2M 0 . Bythisandthe factthatthereareatmost2 k nodesatlevelk,itholdsforeachΘ 1 ,···,Θ k thaton∩ k l=1 {Θ Θ Θ l =Θ l }, E h m ∗ b T (Θ Θ Θ 1 ,···,Θ Θ Θ k ,X X X)−m ∗ b T ζ (Θ Θ Θ 1 ,···,Θ Θ Θ k ,X X X) 2 Θ Θ Θ 1 ,···,Θ Θ Θ k ,X n i ≤ζ2 k (2M 0 ) 2 . (A.83) Hence,wecanconcludethatforeach1≤k≤clogn, E m ∗ b T (Θ Θ Θ 1:k ,X X X)−m ∗ b T ζ (Θ Θ Θ 1:k ,X X X) 2 ≤ζ2 k+2 M 2 0 . (A.84) Therefore,inviewof(A.80),(A.82),and(A.84),itholdsthatforalllargenandeachinteger1≤k≤clogn, E m(X X X)−m ∗ b T (Θ Θ Θ 1:k ,X X X) 2 ≤2 4M 2 0 n −δ 2 k +α 1 α 2 n −η +M 2 0 (1−γ 0 (α 1 α 2 ) −1 ) k +Cn −1 , (A.85) whichconcludestheproofofLemma1. 127 A.4.5 ProofofLemma2 ThemainideaoftheproofforthislemmaisbasedonthegridandhasbeendiscussedinSection2.5. Letthe grid be defined with positive parametersρ 1 andρ 2 ; see Section A.1.1 for details of these parameters. With thegridandbysomesimplecalculations,wecanwrite E m ∗ b T (Θ Θ Θ 1:k ,X X X)−b m b T (Θ Θ Θ 1:k ,X X X,X n ) 2 =E m ∗ b T (Θ Θ Θ 1:k ,X X X)−m ∗ b T # (Θ Θ Θ 1:k ,X X X)+m ∗ b T # (Θ Θ Θ 1:k ,X X X)−b m b T # (Θ Θ Θ 1:k ,X X X,X n ) +b m b T # (Θ Θ Θ 1:k ,X X X,X n )−b m b T (Θ Θ Θ 1:k ,X X X,X n ) 2 ≤3 E m ∗ b T (Θ Θ Θ 1:k ,X X X)−m ∗ b T # (Θ Θ Θ 1:k ,X X X) 2 +E m ∗ b T # (Θ Θ Θ 1:k ,X X X)−b m b T # (Θ Θ Θ 1:k ,X X X,X n ) 2 +E b m b T # (Θ Θ Θ 1:k ,X X X,X n )−b m b T (Θ Θ Θ 1:k ,X X X,X n ) 2 . (A.86) Letuschoosemin{1,ν+1/2}>∆>1/2. ThenitfollowsfromLemma3thatthereexistssomeconstant C>0suchthat E m ∗ b T (Θ Θ Θ 1:k ,X X X)−m ∗ b T # (Θ Θ Θ 1:k ,X X X) 2 +E b m b T # (Θ Θ Θ 1:k ,X X X)−b m b T (Θ Θ Θ 1:k ,X X X,X n ) 2 ≤C2 k n − 1 2 +ν . (A.87) Recall that conditional onX n , b T is essentially associated with a deterministic splitting criterion. By this factandTheorem5,wecandeducethat E m ∗ b T # (Θ Θ Θ 1:k ,X X X)−b m b T # (Θ Θ Θ 1:k ,X X X,X n ) 2 =E E m ∗ b T # (Θ Θ Θ 1:k ,X X X)−b m b T # (Θ Θ Θ 1:k ,X X X,X n ) 2 Θ Θ Θ 1:k ,X n ≤E sup T E m ∗ T # (Θ Θ Θ 1:k ,X X X)−b m T #(Θ Θ Θ 1:k ,X X X,X n ) 2 Θ Θ Θ 1:k ,X n ≤n −η , (A.88) wherethesupremumisoverallpossibletreegrowingrules. Therefore,combining(A.86)–(A.88)completes theproofofLemma2. 128 A.4.6 ProofofLemma3 Recall thatX n denotes the n i.i.d. observations and X X X is the independent copy of x x x 1 . The parameters ∆,c,C aregivenbyLemmaA.4.6. EssentiallyLemma3showsthatthepopulationmeansconditionalonan arbitrary nodet t t is very close to those conditional on nodet t t # in terms of theL 2 distance. In addition to the populationmeans,Lemma3alsoconsidersthedeviationsofthesamplemeans. Tocontrolthosedeviations, we exploit the results in (A.2) and (A.7), the moment bounds of the model errors, and Condition 4 (which states that sup c c c∈[0,1] p|m(c c c)|≤M 0 ). In what follows, we first establish the bound in (2.28). By Condition 4, wehavethatforeachn≥1andk≥1, E m ∗ b T # (Θ Θ Θ 1:k ,X X X)−m ∗ b T (Θ Θ Θ 1:k ,X X X) 2 =E 2 4 E 0 @ ∑ (t t t 1 ,···,t t t k )∈ b T(Θ Θ Θ 1:k ) m ∗ b T # (Θ Θ Θ 1:k ,X X X)−m ∗ b T (Θ Θ Θ 1:k ,X X X) 2 1 1 1 X X X∈t t t k Θ Θ Θ 1:k ,X n 1 A 3 5 ≤E " E ∑ (t t t 1 ,···,t t t k )∈ b T(Θ Θ Θ 1:k ) P(X X X∈t t t k )≥n ∆−1 m ∗ b T # (Θ Θ Θ 1:k ,X X X)−m ∗ b T (Θ Θ Θ 1:k ,X X X) 2 1 1 1 X X X∈t t t k Θ Θ Θ 1:k ,X n ! +2 k (n ∆−1 )(2M 0 ) 2 # . (A.89) It follows from Condition 4 and the definitions of the sharp notation and the population tree model that foreachn≥1and k≥1, RHSof (A.89) ≤E " E ∑ (t t t 1 ,···,t t t k )∈ b T(Θ Θ Θ 1:k ) P(X X X∈t t t k )≥n ∆−1 E(m(X X X)|X X X∈t t t k )−E(m(X X X)|X X X∈t t t # k ) 2 1 1 1 X X X∈t t t k ∩t t t # k +(2M 0 ) 2 1 1 1 X X X∈t t t k ∆t t t # k Θ Θ Θ 1:k ,X n !# +2 k (n ∆−1 )(2M 0 ) 2 ≤E " E ∑ (t t t 1 ,···,t t t k )∈ b T(Θ Θ Θ 1:k ) P(X X X∈t t t k )≥n ∆−1 E(m(X X X)|X X X∈t t t k )−E(m(X X X)|X X X∈t t t # k ) 2 1 1 1 X X X∈t t t k +(2M 0 ) 2 1 1 1 X X X∈t t t k ∆t t t # k Θ Θ Θ 1:k ,X n !# +2 k (n ∆−1 )(2M 0 ) 2 . (A.90) 129 TodealwiththeRHSof(A.90),weneedtoestablishanupperboundfor E(m(X X X)|X X X∈t t t k )−E(m(X X X)|X X X∈t t t # k ). In light of Condition 2(f(·) is the density of the distribution of X X X), (A.2), and Condition 4, it holds that for t t t k in(A.90)withP(X X X∈t t t k )≥n ∆−1 and1≤k≤clogn, E(m(X X X)|X X X∈t t t k )−E(m(X X X)|X X X∈t t t # k ) = E(m(X X X)1 1 1 X X X∈t t t k ) P(X X X∈t t t k ) − E(m(X X X)1 1 1 X X X∈t t t # k ) P(X X X∈t t t k ) + E(m(X X X)1 1 1 X X X∈t t t # k ) P(X X X∈t t t k ) − E(m(X X X)1 1 1 X X X∈t t t # k ) P(X X X∈t t t # k ) ≤ E(|m(X X X)|1 1 1 X X X∈t t t k ∆t t t #) P(X X X∈t t t k ) + E(|m(X X X)|1 1 1 X X X∈t t t # k ) P(X X X∈t t t # k ) 1− P(X X X∈t t t # k ) P(X X X∈t t t k ) ≤2 M 0 P(X X X∈t t t k ∆t t t # k ) P(X X X∈t t t k ) ≤2M 0 (sup f)(clogn) n 1−∆ n 1+ρ 1 , (A.91) where the third inequality follows from|P(A)−P(B)|≤P(A∆B) for two events A,B. Then by (A.90) – (A.91),wehavethatforalllargenandeach1≤k≤clogn, RHSof (A.90)≤E " E ∑ (t t t 1 ,···,t t t k )∈ b T(Θ Θ Θ 1:k ) 2M 0 (sup f)(clogn) n 1−∆ n 1+ρ 1 2 1 1 1 X X X∈t t t k +(2M 0 ) 2 1 1 1 X X X∈t t t k ∆t t t # k Θ Θ Θ 1:k ,X n !# +2 k (n ∆−1 )(2M 0 ) 2 ≤2 k 2M 0 (sup f)(clogn) n 1−∆ n 1+ρ 1 2 + (2M 0 ) 2 (sup f)(clogn) 1 n 1+ρ 1 ! +2 k (n ∆−1 )(2M 0 ) 2 , (A.92) whichleadsto(2.28). We next proceed to show the bound in (2.29). Let ¯ ∆ with 1/2< ¯ ∆<∆ and sufficiently small s>0 be givensuchthat ¯ ∆=∆−2s. LetusdefineC C C n :=∩ n i=1 {|ε i |≤n s }. Observethatforeachn≥1,conditionalon X n wehave 1) ForeachΘ 1 ,···,Θ k andtreegrowingrule,sup c c c∈[0,1] p|b m T(Θ 1:k ,c c c,X n ) |≤∑ n i=1 |y i |; 2) ForeachΘ 1 ,···,Θ k andtreegrowingrule,sup c c c∈[0,1] p|b m T(Θ 1:k ,c c c,X n ) |1 1 1 C C C n ≤M 0 +n s . 130 WefurtherdefineanX n -measurableevent E E E n :=C C C n ∩A 3 (⌊clogn⌋, ¯ ∆)∩A, where the eventA 3 (⌊clogn⌋, ¯ ∆) is given in Lemma 8 in Section A.5.1 andA is defined in (A.6). In particular,eventA 3 (⌊clogn⌋, ¯ ∆)saysthatthenumberofobservationsoneachnodet t t onthegridconstructed byatmost⌊clogn⌋cutsandwithP(X X X∈t t t)>n ¯ ∆−1 isnolessthann 1/2 . Then by property 1) above, the Cauchy–Schwarz inequality, and Minkowski’s inequality, it holds that foreachn≥1and k≥1, E b m b T # (Θ Θ Θ 1:k ,X X X,X n )−b m b T (Θ Θ Θ 1:k ,X X X,X n ) 2 ≤E (2 n ∑ i=1 |y i |) 2 1 1 1 E E E c n + b m b T # (Θ Θ Θ 1:k ,X X X,X n )−b m b T (Θ Θ Θ 1:k ,X X X,X n ) 2 1 1 1 E E E n ! ≤4 n ∑ i=1 E|y i | 4 1/4 ! 2 P(E E E c n ) 1 2 +E b m b T # (Θ Θ Θ 1:k ,X X X,X n )−b m b T (Θ Θ Θ 1:k ,X X X,X n ) 2 1 1 1 E E E n . (A.93) Tobound the second term above, from the aforementioned property 2) and some basic calculations, we can obtainthatforeachnand1≤k≤clogn, E b m b T # (Θ Θ Θ 1:k ,X X X,X n )−b m b T (Θ Θ Θ 1:k ,X X X,X n ) 2 1 1 1 E E E n ≤Q 1 +2 k (2n ¯ ∆−1 )(2(M 0 +n s )) 2 , (A.94) where Q 1 :=E " E ∑ (t t t 1:k )∈ b T(Θ Θ Θ 1:k ) P(X X X∈t t t k )≥2n ¯ ∆−1 b m b T # (Θ Θ Θ 1:k ,X X X,X n )−b m b T (Θ Θ Θ 1:k ,X X X,X n ) 2 1 1 1 X X X∈t t t k 1 1 1 E E E n X n ,Θ Θ Θ 1:k !# andthesecondtermontheright-handsideof(A.94)istheupperboundforatermsimilartoQ 1 butsumming over{t t t k :t t t 1:k ∈ b T(Θ Θ Θ 1:k )}withP(X X X∈t t t k )<2n ¯ ∆−1 . 131 To further deal with Q 1 , we need the following results (A.95)–(A.96). Due to Condition 2, (A.2), and thefactthatk≤clog 2 (n),foralllargen,itholdsthatifP(X X X∈t t t k )≥2n ¯ ∆−1 ,then P(X X X∈t t t # k )≥n ¯ ∆−1 . (A.95) Inaddition,itfollowsfromthedefinitionofsharpnotationandproperty2)abovethatforeach(t t t 1 ,···,t t t k )∈ b T(Θ 1:k ), b m b T # (Θ 1:k ,X X X,X n )−b m b T (Θ 1:k ,X X X,X n ) 2 1 1 1 X X X∈t t t k 1 1 1 E E E n ≤ ¯ y(t t t # k )− ¯ y(t t t k ) 2 1 1 1 X X X∈t t t k ∩t t t # k +(2(M 0 +n s )) 2 1 1 1 X X X∈t t t k \t t t # k 1 1 1 E E E n ≤ ¯ y(t t t # k )− ¯ y(t t t k ) 2 1 1 1 X X X∈t t t k +(2(M 0 +n s )) 2 1 1 1 X X X∈t t t k ∆t t t # k 1 1 1 E E E n , (A.96) whereforeachnodet t t, ¯ y(t t t):= ∑ x x x i ∈t t t y i #{i:x x x i ∈t t t} , and ¯ y(t t t)isdefinedaszeroifthedenominatoriszero. By(A.95)–(A.96),foralllargenand1≤k≤clogn, RHSof(A.94)≤Q 2 +2 k (2n ¯ ∆−1 )(2(M 0 +n s )) 2 (A.97) with Q 2 := E " E ∑ (t t t 1:k )∈ b T(Θ Θ Θ 1:k ) P(X X X∈t t t k )≥2n ¯ ∆−1 P(X X X∈t t t # k )≥n ¯ ∆−1 ¯ y(t t t # k )− ¯ y(t t t k ) 2 1 1 1 X X X∈t t t k +(2(M 0 +n s )) 2 1 1 1 X X X∈t t t k ∆t t t # k 1 1 1 E E E n X n ,Θ Θ Θ 1:k !# . 132 Inwhatfollows,wedealwith ¯ y(t t t # k )− ¯ y(t t t k ). Bysimplecalculations,foreveryt t t k , ¯ y(t t t # k )− ¯ y(t t t k ) = ∑ x x x i ∈t t t # k y i #{i:x x x i ∈t t t # k } − ∑ x x x i ∈t t t k y i #{i:x x x i ∈t t t k } = (#{i:x x x i ∈t t t k }) ∑ x x x i ∈t t t # k y i −(#{i:x x x i ∈t t t # k }) ∑ x x x i ∈t t t k y i (#{i:x x x i ∈t t t # k })×(#{i:x x x i ∈t t t k }) = ∑ x x x i ∈t t t # k y i − ∑ x x x i ∈t t t k y i #{i:x x x i ∈t t t # k } + #{i:x x x i ∈t t t k }−#{i:x x x i ∈t t t # k } #{i:x x x i ∈t t t # k } ∑ x x x i ∈t t t k y i #{i:x x x i ∈t t t k } ≤ ∑ x x x i ∈t t t # k ∆t t t (|m(x x x i )|+|ε i |) #{i:x x x i ∈t t t # k } + #{i:x x x i ∈t t t k ∆t t t # k } #{i:x x x i ∈t t t # k } ∑ x x x i ∈t t t k (|m(x x x i )|+|ε i |) #{i:x x x i ∈t t t k } . (A.98) By the definition ofA 3 (k, ¯ ∆) (in Lemma 8; recall that ¯ ∆ > 1 2 ), for each t t t k satisfying the conditions specifiedinQ 2 , #{i:x x x i ∈t t t # k }≥n 1 2 . By this, (A.7), Condition 4, and the fact that k≤clogn, it holds that on E E E n , for each t t t k satisfying the conditionsspecifiedinQ 2 , RHSof(A.98)≤ 2(M 0 +n s )c(logn) 2+ρ 2 n 1 2 , (A.99) wherewerecallthatρ 2 >0isdefinedin(A.6). By (A.98)–(A.99), (A.2), and∆= ¯ ∆+2s, there exists some constantC>0 such that for all large n and 1≤k≤clogn, RHSof(A.97) ≤2 k 2(M 0 +n s )c⌈logn⌉ 2+ρ 2 n 1 2 2 + (2(M 0 +n s )) 2 n ! +2 k (2n ¯ ∆−1 )(2(M 0 +n s )) 2 ≤2 k ×(2(M 0 +n s )) 2 × 2c 2 ⌈logn⌉ 4+2ρ 2 n +2n ¯ ∆−1 ≤C2 k n ∆−1 , (A.100) 133 which gives the bound for the second term on the RHS of (A.93). For the first term on the RHS of (A.93), inviewofConditions3–4,wehave (∑ n i=1 (E|y i | 4 ) 1/4 ) 2 n 2 =O(1), andbyLemma8,(A.3),andCondition3withsufficientlylargeq,itholdsthat P(E E E c n )=o(n −6 ). Therefore,combiningtheseresults,(A.93),and(A.100)yields(2.29),whichconcludestheproofofLemma 3. A.4.7 Lemma6anditsproof All the assumptions and notation in Lemma 6 below follow those in Theorem 4. In particular, we set k=⌊clog(n)⌋. Lemma6. ThereexistssomeconstantC>0suchthatoneventA 3 (k+1,∆)∩A,itholdsthatforalllarge n and each set of available features, each t t t constructed by less than k cuts with P(X X X∈t t t)≥ n −δ and its daughternodes b t t t,t t t † ,andt t t ∗ satisfythefollowingproperties: 1) c † (t t t)isnotrandom(i.e.,c † (t t t)isanelementin(A.29)). 2) |(II) t t t,t t t †−(II) t t t,t t t ∗|≤Cn − δ 2 . 3) d (II) t t t, b t t t ≥ d (II) t t t,t t t †. Proof. Letusassumethatproperty1)holdsforthemoment. Thenbythedefinitionof d (II)andthedefinitions of b t t t andt t t † (they are both daughter nodes oft t t and the corresponding cuts are along directions subject to the samesetofavailablefeatures),wehavethatifc † (t t t)isnotrandom,thenitholdsthat d (II) t t t, b t t t ≥ d (II) t t t,t t t †,which establishesproperty 3). From (A.30), the assumptions of Theorem 4, and some simple calculations, we can seethatthereexistssomeconstantC>0suchthatforalllargenandeachsetofavailablefeatures, |(II) t t t,t t t †−(II) t t t,t t t ∗|≤Cn − δ 2 , 134 whichprovesproperty2). Notethat(A.30)holdsforeachfeaturerestrictionΘ. Now it remains to establish property 1), which means that we have to show that the set of (A.29) is not empty. Foreachnodet t t =× p j=1 t j ,defineanode I I I(t t t,h,I):=t 1 ×···×t h−1 ×I×t h+1 ×···×t p forh∈{1,···,p}andanintervalI⊂[0,1]. DenotebyR(t t t,h,δ)asetcontainingalltheintervalsJ suchthat J⊂t h andP(X h ∈J|X X X∈t t t) =n −δ . Observe that if node t t t is constructed by less than k cuts, then I I I(t t t,h,I) withI∈R(t t t,h,δ)isconstructedbyatmostk+1cuts. Foreachintegerk,defineH k asthesetcontainingall nodesconstructed byat most k arbitrarycuts (thesecuts arenot necessarilyon thegridlines). Letusdefine anevent B B B(k):= 8 > < > : inf P(X X X∈t t t)≥n −δ , t t t∈H k−1 , h∈{1,···,p}, I∈R(t t t,h,δ) n ∑ i=1 1 1 1 x x x i ∈I I I(t t t,h,I) <1 9 > = > ; , wheretheinfimumisoverallt t t,h,andI suchthattheconditionshold. Thenwecanseethatonevent(B B B(k)) c , property1)holds,wherethesuperscriptcdenotesthesetcomplement. Next, recall some notation related to the grid defined in Sections 2.5 and A.1.1, includingt t t # , the event A, and parametersρ 1 >0,ρ 2 >0. By∆<1−2δ (see (A.28) for the definitions ofδ,∆) and Condition 2, foralllargenwehave B B B(k)⊂ 8 > > > < > > > : inf P(X X X∈t t t)≥ n ∆−1 +(k+1) sup f ⌈n 1+ρ 1⌉ n δ , t t t∈H k−1 , h∈{1,···,p}, I∈R(t t t,h,δ) n ∑ i=1 1 1 1 x x x i ∈I I I(t t t,h,I) <1 9 > > > = > > > ; , (A.101) whereweusen −δ ≥ n ∆−1 +(k+1) sup f ⌈n 1+ρ 1⌉ n δ foralllargenbecauseof∆<1−2δ andk=⌊clogn⌋(recall that k is defined to be⌊clogn⌋ in this proof). Notice that the infimum on the RHS of (A.101) is over the nodesin W := I I I(t t t,h,I):P(X X X∈t t t)≥ n ∆−1 +(k+1) sup f ⌈n 1+ρ 1 ⌉ n δ ,t t t∈H k−1 ,h∈{1,···,p},I∈R(t t t,h,δ) . 135 Next,itfollowsfromthedefinitionsofR(t t t,h,δ)andH k thatforeveryn≥1, W⊂ t t t :t t t∈H k+1 ,P(X X X∈t t t)≥n ∆−1 +(k+1) sup f ⌈n 1+ρ 1 ⌉ , andhenceforeachn≥1, RHSof (A.101) ⊂ 8 < : inf t t t: t t t∈H k+1 , P(X X X∈t t t)≥n ∆−1 +(k+1) sup f ⌈n 1+ρ 1⌉ n ∑ i=1 1 1 1 x x x i ∈t t t <1 9 = ; . (A.102) Moreover,from(A.2)wecanobtainthatforeachn≥1, RHSof (A.102)⊂ 8 > < > : inf P(X X X∈t t t # )≥n ∆−1 t t t∈H k+1 n ∑ i=1 1 1 1 x x x i ∈t t t <1 9 > = > ; , (A.103) andbysimplecalculations, RHSof (A.103)= 8 > < > : inf P(X X X∈t t t # )≥n ∆−1 t t t∈H k+1 n ∑ i=1 1 1 1 x x x i ∈t t t #+1 1 1 x x x i ∈t t t\t t t #−1 1 1 x x x i ∈t t t # \t t t <1 9 > = > ; ⊂ 8 > < > : inf P(X X X∈t t t # )≥n ∆−1 t t t∈H k+1 n ∑ i=1 1 1 1 x x x i ∈t t t #− n ∑ i=1 1 1 1 x x x i ∈t t t # ∆t t t ! <1 9 > = > ; . (A.104) Thenby(A.7),whichsays∑ n i=1 1 1 1 x x x i ∈t t t # ∆t t t <(k+1)(logn) 1+ρ 2 onA,itholdsthatforeachn≥1, RHSof (A.103) ⊂ 0 B @ 8 > < > : inf P(X X X∈t t t # )≥n ∆−1 t t t∈H k+1 n ∑ i=1 1 1 1 x x x i ∈t t t # <1+(k+1)(logn) 1+ρ 2 9 > = > ; ∩A 1 C A∪A c . (A.105) Byk=⌊clog(n)⌋,foralllargen, RHSof (A.105)⊂ 8 > < > : inf P(X X X∈t t t # )≥n ∆−1 t t t∈H k+1 n ∑ i=1 1 1 1 x x x i ∈t t t # <n 1 2 9 > = > ; ∪A c , (A.106) 136 where we also remove the intersection of the eventA. By the definitions of G n,k+1 (∆) andA 3 (k+1,∆) in Lemma8ofSectionA.5.1, 8 > < > : inf P(X X X∈t t t # )≥n ∆−1 t t t∈H k+1 n ∑ i=1 1 1 1 x x x i ∈t t t # <n 1 2 9 > = > ; = ( inf t t t∈G n,k+1 (∆) n ∑ i=1 1 1 1 x x x i ∈t t t <n 1 2 ) =∪ t t t∈G n,k+1 (∆) {#{i:x x x i ∈t t t}<n 1 2 } =(A 3 (k+1,∆)) c . Bythis, RHSof (A.106)⊂(A 3 (k+1,∆)) c ∪A c . (A.107) Therefore, in view of (A.101)–(A.107), we can conclude thatA 3 (k+1,∆)∩A⊂ (B B B(k)) c , which leads to property1). ThiscompletestheproofofLemma6. A.4.8 Lemma7anditsproof All the assumptions and notation in Lemma 7 below follow those in Theorem 4. In particular, we set k=⌊clog(n)⌋. Lemma7. There exists some constantC>0 such that on eventU U U n , it holds that for all large n, each node t t t constructed by less than k cuts withP(X X X∈t t t)≥n −δ and each daughter node t t t ′ of t t t satisfy the following properties: 1) |(II) t t t,t t t ′−(II) t t t # ,(t t t ′ ) # |≤Cn −δ . 2) |(II) t t t # ,(t t t ′ ) # − d (II) t t t # ,(t t t ′ ) # |≤C(n −δ+2s +n − ∆ ′ 4 +2s ). 3) | d (II) t t t # ,(t t t ′ ) # − d (II) t t t,t t t ′|≤C(n −δ+2s +n − ∆ ′ 4 +2s ). 137 Proof. We prove property 2) and the other two properties can be shown using similar arguments. Lett t t ′′ be theotherdaughternodeoft t t. Fromthedefinition,wecandeducethat |(II) t t t # ,(t t t ′ ) # − d (II) t t t # ,(t t t ′ ) # | ≤ #{i:x x x i ∈(t t t ′ ) # } #{i:x x x i ∈t t t # } 0 @ ∑ x x x i ∈(t t t ′ ) # y i #{i:x x x i ∈(t t t ′ ) # } − ∑ x x x i ∈t t t # y i #{i:x x x i ∈t t t # } 1 A 2 −P(X X X∈(t t t ′ ) # |X X X∈t t t # ) E(m(X X X)|X X X∈(t t t ′ ) # )−E(m(X X X)|X X X∈t t t # ) 2 + #{i:x x x i ∈(t t t ′′ ) # } #{i:x x x i ∈t t t # } 0 @ ∑ x x x i ∈(t t t ′′ ) # y i #{i:x x x i ∈(t t t ′′ ) # } − ∑ x x x i ∈t t t # y i #{i:x x x i ∈t t t # } 1 A 2 −P(X X X∈(t t t ′′ ) # |X X X∈t t t # ) E(m(X X X)|X X X∈(t t t ′′ ) # )−E(m(X X X)|X X X∈t t t # ) 2 . (A.108) Withoutlossofgenerality,weneedonlytodealwiththefirsttermontheRHSof(A.108). ByCondition4,wehavethatforeachn≥1, ∑ x x x i ∈t t t y i #{i:x x x i ∈t t t} 1 1 1 U U U n ≤M 0 +n s . (A.109) Let us make use of three useful claims below, where we omit the presumption that t t t is constructed by less thankcuts,andthatt t t ′ isthedaughternodeoft t t. By(A.2),Condition2(f(·)isthedensityofthedistribution of X X X, which is the independent copy of x x x 1 ), and the choices of∆ andδ (∆,δ are defined in Theorem 4), it holdsthat a) Foralllargenandeacht t t withP(X X X∈t t t)≥n −δ , P(X X X∈t t t # )≥n ∆−1 ; b) Foralllargenandeacht t t ′ andt t t withP(X X X∈t t t ′ |X X X∈t t t)≥n −δ andP(X X X∈t t t)≥n −δ , P(X X X∈(t t t ′ ) # )≥n ∆−1 . 138 To show this result, note that by (A.2), that k =⌊clogn⌋, and the assumption thatt t t is constructed by lessthank cuts, |P(X X X∈t t t ′ )−P(X X X∈(t t t ′ ) # )|≤⌊clogn⌋ sup f ⌈n 1+ρ 1 ⌉ , andhencebytheassumptionsandthat∆−1<−2δ,itholdsthatforalllargen, P(X X X∈(t t t ′ ) # )≥P(X X X∈t t t ′ )−⌊clogn⌋ sup f ⌈n 1+ρ 1 ⌉ ≥n −2δ −⌊clogn⌋ sup f ⌈n 1+ρ 1 ⌉ ≥n ∆−1 . Recallthatρ 1 >0isdefinedforthegirdinSection2.5. c) Foralllargenandeacht t t ′ andt t t withP(X X X∈t t t ′ |X X X∈t t t)<n −δ andP(X X X∈t t t)≥n −δ , P(X X X∈(t t t ′ ) # |X X X∈t t t # )<2n −δ . Toshowthisresult,noticethat P(X X X∈(t t t ′ ) # ) P(X X X∈t t t # ) − P(X X X∈(t t t ′ )) P(X X X∈t t t) = P(X X X∈(t t t ′ ) # ) P(X X X∈t t t # ) − P(X X X∈(t t t ′ ) # ) P(X X X∈t t t) + P(X X X∈(t t t ′ ) # ) P(X X X∈t t t) − P(X X X∈(t t t ′ )) P(X X X∈t t t) ≤ (P(X X X∈t t t)−P(X X X∈t t t # ))P(X X X∈(t t t ′ ) # ) P(X X X∈t t t # )P(X X X∈t t t) +n δ P(X X X∈(t t t ′ ) # ∆t t t ′ ) ≤2n δ P(X X X∈(t t t ′ ) # ∆t t t ′ ) =o(n −δ ), where the inequalities use the assumptionP(X X X∈t t t)≥n −δ and the last equality follows from (A.2). ThedesiredresultfollowsfromthisandtheassumptionP(X X X∈t t t ′ |X X X∈t t t)<n −δ . 139 We first consider the case of t t t and t t t ′ withP(X X X∈t t t ′ | X X X∈t t t)≥n −δ . In light of (A.109) and the above claimsa)andb),thereexistssomeconstantC>0suchthatoneventU U U n (specifically,onA 1 (⌊clog(n)⌋,∆)∩ A 2 (⌊clog(n)⌋,∆)),foralllargenandeachsucht t t andt t t ′ wehave #{i:x x x i ∈(t t t ′ ) # } #{i:x x x i ∈t t t # } 0 @ ∑ x x x i ∈(t t t ′ ) # y i #{i:x x x i ∈(t t t ′ ) # } − ∑ x x x i ∈t t t # y i #{i:x x x i ∈t t t # } 1 A 2 −P(X X X∈(t t t ′ ) # |X X X∈t t t # ) E(m(X X X)|X X X∈(t t t ′ ) # )−E(m(X X X)|X X X∈t t t # ) 2 ≤Cn −∆ ′ 4 +s , (A.110) where∆ ′ isdefinedinTheorem4. Fortheothercaseoft t t andt t t ′ withP(X X X∈t t t ′ |X X X∈t t t)<n −δ ,itfollowsfrom(A.109)andtheaboveclaim c)thatthereexistssomeconstantC>0suchthatoneventU U U n ,foralllargenandeachsucht t t andt t t ′ wehave LHSof (A.110)≤C(n −δ+2s +n − ∆ ′ 4 +2s ). (A.111) Therefore, combining (A.108) and (A.110)–(A.111), we can establish property 2), which concludes the proofofLemma7. A.5 Additionallemmasandtechnicaldetails A.5.1 Lemma8anditsproof Let the sample size n and tree level k be given. Let G n,k be as defined in Section A.1.1; for the reader’s convenience, G n,k is the set containing all nodes on the grid constructed by at most k cuts with cuts all on thegridhyperplanesdefinedinSection2.5. For∆>0,wealsodefineG n,k (∆)asthesubsetofG n,k suchthat ift t t∈G n,k andP(X X X∈t t t)≥n ∆−1 ,thent t t∈G n,k (∆). Tosimplifythenotation,thecomplementofaneventthat dependsonsomeparameterssuchasA(·)isdenotedasA c (·). 140 Lemma 8. Let 1 2 <∆<1, c>0, κ >0, and 0<∆ ′ <∆ be given and assume Condition 3 with q> 4+4κ ∆ ′ andCondition4. Wedefine A c 1 (k,∆):=∪ t t t∈G n,k (∆) ∑ x x x i ∈t t t y i #{i:x x x i ∈t t t} −E(m(X X X)|X X X∈t t t) ≥n − ∆ ′ 4 , A c 2 (k,∆):=∪ t t t∈G n,k−1 (∆) t t t ′ ∈G n,k t t t ′ ⊂t t t ( #{i:x x x i ∈t t t ′ } #{i:x x x i ∈t t t} −P(X X X∈t t t ′ |X X X∈t t t) ≥n − ∆ ′ 4 ) , A c 3 (k,∆):=∪ t t t∈G n,k (∆) n #{i:x x x i ∈t t t}<n 1 2 o . Then,itholdsthatforalllargenand0≤k≤clog(n)+1, P(A c 1 (k,∆))≤n −κ , P(A c 2 (k,∆))≤n −κ , P(A c 3 (k,∆))≤n −κ . (A.112) Proof. The arguments for the three inequalities in (A.112) are similar, and we begin with showing the first one. The main idea of the proof is based on Hoeffding’s inequality. Since Hoeffding’s inequality is for boundedrandomvariables,wewillconsiderthetruncatedmodelerrorsinordertoapplythisinequality. For eachn≥1andk≥0,wecandeducethat P(A c 1 (k,∆)) =P A c 1 (k,∆)∩ ∩ n i=1 {|ε i |≤n ∆ ′ 4 } +P A c 1 (k,∆)∩ ∪ n i=1 {|ε i |>n ∆ ′ 4 } =P ∪ t t t∈G n,k (∆) E E E(t t t) ∩ ∩ n i=1 {|ε i |≤n ∆ ′ 4 } +P A c 1 (k,∆)∩ ∪ n i=1 {|ε i |>n ∆ ′ 4 } ≤ ∑ t t t∈G n,k (∆) P(E E E(t t t))+ n ∑ i=1 P |ε i |>n ∆ ′ 4 = ∑ t t t∈G n,k (∆) ∑ B P {i:x x x i ∈t t t}=B P E E E(t t t) {i:x x x i ∈t t t}=B + n ∑ i=1 P |ε i |>n ∆ ′ 4 , (A.113) 141 where∑ B representsthesummationoverallpossiblesubsetsof{1,···,n}and E E E(t t t):= 8 > < > : ∑ x x x i ∈t t t m(x x x i )+ε i 1 1 1 |ε i |≤n ∆ ′ 4 #{i:x x x i ∈t t t} −E(m(X X X)|X X X∈t t t) ≥n − ∆ ′ 4 9 > = > ; . Note that the summation for the first term on the RHS of (A.113) can be further decomposed into two termsas RHSof(A.113) = ∑ t t t∈G n,k (∆) ∑ #B≥ n ∆ 2 P {i:x x x i ∈t t t}=B P E E E(t t t) {i:x x x i ∈t t t}=B + ∑ t t t∈G n,k (∆) ∑ #B< n ∆ 2 P {i:x x x i ∈t t t}=B P E E E(t t t) {i:x x x i ∈t t t}=B + n ∑ i=1 P |ε i |>n ∆ ′ 4 ≤ ∑ t t t∈G n,k (∆) ∑ #B≥ n ∆ 2 P {i:x x x i ∈t t t}=B P E E E(t t t) {i:x x x i ∈t t t}=B + ∑ t t t∈G n,k (∆) P n ∑ i=1 1 1 1 x x x i ∈t t t < n ∆ 2 + n ∑ i=1 P |ε i |>n ∆ ′ 4 . (A.114) Thelastinequalityaboveisdueto ∑ #B< n ∆ 2 P {i:x x x i ∈t t t}=B P E E E(t t t) {i:x x x i ∈t t t}=B = ∑ #B< n ∆ 2 P(E E E(t t t)∩{{i:x x x i ∈t t t}=B}) ≤ ∑ #B< n ∆ 2 P({i:x x x i ∈t t t}=B) =P n ∑ i=1 1 1 1 x x x i ∈t t t < n ∆ 2 . 142 Then by the definition of G n,k (∆),∆>1/2, and Conditions 3–4, an application of Lemma 9 in Section A.5.2showsthatforalllargenandeachk≥0, RHSof(A.114)≤ ∑ t t t∈G n,k (∆) 2exp −n ∆−∆ ′ 8 ! +2exp −(logn) 2+∆ 2 ! + n ∑ i=1 P |ε i |>n ∆ ′ 4 . (A.115) Thus, it follows from ∆>∆ ′ , p =O(n K 0 ) in Condition 3 with q> 4+4κ ∆ ′ , and (A.8) that for all large n and each0≤k≤clogn+1, RHSof(A.115)≤n −κ , which establishes the first inequality in (A.112). The other two inequalities in (A.112) can be shown in a similarfashion,whichcompletestheproofofLemma8. A.5.2 Lemma9anditsproof Lemma 9. Assume that x x x i ’s are independent copies of X X X. Then for each n≥ 1, ∆ > 0, and t t t such that P(X X X∈t t t)≥n ∆−1 ,wehave P n ∑ i=1 1 1 1 x x x i ∈t t t ≤n ∆ − √ n(logn) 1+ ∆ 2 ! ≤2exp −(logn) 2+∆ 2 . (A.116) Assume further that sup c c c∈[0,1] p|m(c c c)|<∞, x x x i andε i ’s are independent,ε i ’s are identically distributed, and ε 1 hasasymmetricdistributionaroundzero. ThenforeachB⊂{1,···,n},t t t andt t t ′ witht t t ′ ⊂t t t,∆ ′′ >0,and t >0,itholdsforalllargenthat P 0 @ ∑ i∈B E(m(X X X)|X X X∈t t t)− m(x x x i )+ε i 1 1 1 |ε i |≤n ∆ ′′ #B ≥t {i:x x x i ∈t t t}=B 1 A ≤2exp −t 2 #B 4n 2∆ ′′ , P 0 @ ∑ i∈B P(X X X∈t t t ′ |X X X∈t t t)−1 1 1 x x x i ∈t t t ′ #B ≥t {i:x x x i ∈t t t}=B 1 A ≤2exp −t 2 #B 2 . (A.117) 143 Proof. ObservethatbyHoeffding’sinequality,wehavethatforeach∆>0andn≥1, P n −1 n ∑ i=1 1 1 1 x x x i ∈t t t −P(X X X∈t t t) ≥ (logn) 1+ ∆ 2 √ n ! ≤2exp −(logn) 2+∆ 2 . Thenbysomealgebraiccalculations,wecanestablishthedesiredprobabilityboundin(A.116). Ontheotherhand,itfollowsfromtheassumptionsthatforeachi∈B,∆ ′′ >0,andt t t⊂t t t 0 ,wehave E E(m(X X X)|X X X∈t t t)− m(x x x i )+ε i 1 1 1 |ε i |≤n ∆ ′′ {s:x x x s ∈t t t}=B =E E(m(X X X)|X X X∈t t t)− m(x x x i )+ε i 1 1 1 |ε i |≤n ∆ ′′ x x x i ∈t t t =0, which along with conditional Hoeffding’s inequality leads to the first probability bound in (A.117). The secondprobabilityboundin(A.117)canalsobeshownusingsimilararguments,whichconcludestheproof ofLemma9. A.5.3 VerifyingCondition1forExample1 Foranodet t t =× p j=1 t j ,ifb̸∈t 1 ,thenVar(m(X X X)|X X X∈t t t)=0. Inthiscase,thedesiredresultclearlyholds. As fortheothercasewithb∈t 1 ,asplitatbonthefirstcoordinateleadstothedesiredresult. A.5.4 VerifyingCondition1forExample2 Inthisproof,weshowthatforeacht t t =t 1 ×···×t p ,sup j∈{1,···,p},c∈t j (II) t t t,t t t(j,c) isproportionalto Var(m(X X X)|X X X∈t t t). A simple calculation shows that the conditional bias decrease (II) t t t,t t t ′ given a split (j,x) withx∈t j isboundedfrombelowsuchthat (II) t t t,t t t ′≥P(X X X∈t t t ′ |X X X∈t t t)P(X X X∈t t t ′′ |X X X∈t t t) H j (x) 2 , (A.118) where H j (x):=E(m(X X X)|X X X∈t t t ′′ )−E(m(X X X)|X X X∈t t t ′ ), 144 and thatt t t ′ =t 1 ×···×t ′ j ×···×t p andt t t ′′ =t 1 ×···×t ′′ j ×···×t p witht ′ j =[inft j ,x) andt ′′ j =[x,supt j ]∩t j . Forthereader’sconvenience,recallthat (II) t t t,t t t ′ =P(X X X∈t t t ′ |X X X∈t t t) E(m(X X X)|X X X∈t t t ′ )−E(m(X X X)|X X X∈t t t) 2 +P(X X X∈t t t ′′ |X X X∈t t t) E(m(X X X)|X X X∈t t t ′′ )−E(m(X X X)|X X X∈t t t) 2 . Inthefollowing,weshowthattheRHSof(A.118)giventheoptimalsplitamong (j, 1 2 inft j + 1 2 supt j ) | {z } Midpointonthejthcoordinate and (j, 1 4 inft j + 3 4 supt j ) | {z } Thirdquarterofthejthcoordinate , j =1,···,s ∗ is at least a proportion of Var(m(X X X)|X X X∈t t t). Notice that for the splits (j, 1 2 inft j + 1 2 supt j ) and (j, 1 4 inft j + 3 4 supt j ),itholdsthat P(X X X∈t t t ′ ) P(X X X∈t t t) P(X X X∈t t t ′′ ) P(X X X∈t t t) = 1 4 and P(X X X∈t t t ′ ) P(X X X∈t t t) P(X X X∈t t t ′′ ) P(X X X∈t t t) = 3 16 respectivelyforeach j duetotheuniform distributionassumptiononX X X. Westartwiththefollowingresults(A.119)–(A.121). Recallthat∑ s ∗ l>s ∗(···)= 0sinceitisasummationoveranemptyset. Itholdsthat Var(m(X X X)|X X X∈t t t) = s ∗ ∑ j=1 1 3 H j inft j +supt j 2 2 + β 2 jj 180 (supt j −inft j ) 4 + s ∗ ∑ l>j β 2 lj 144 (supt j −inft j ) 2 (supt l −inft l ) 2 . (A.119) If H j ( inft j +supt j 2 ) 2 ≤ 1 1296 β 2 jj (supt j −inft j ) 4 ,then H j 1 4 inft j + 3 4 supt j 2 ≥ β 2 jj (supt j −inft j ) 4 648 . (A.120) Foreach j, s ∗ ∑ l>j β 2 lj 144 (supt j −inft j ) 2 (supt l −inft l ) 2 ≤ 1 3 H j inft j +supt j 2 2 . (A.121) 145 The results of (A.119)–(A.121) are proven after the proof for Example 2; the coefficient assumption of Example2isusedforderiving(A.121). Define T j := 1 3 H j inft j +supt j 2 2 | {z } thefirstterm + β 2 jj 180 (supt j −inft j ) 4 | {z } thesecondterm + s ∗ ∑ l>j β 2 lj 144 (supt j −inft j ) 2 (supt l −inft l ) 2 | {z } thethirdterm . By(A.119),wesupposewithoutlossofgeneralitythatforsome j≤s ∗ , T j ≥ Var(m(X X X)|X X X∈t t t) s ∗ . (A.122) By(A.121),oneof thefirsttwoterms of T j isthelargesttermin T j (tiesare possible). Ifthelargesttermof T j isitsfirstterm,thenby(A.122), 1 3 H j inft j +supt j 2 2 ≥ Var(m(X X X)|X X X∈t t t) 3s ∗ . Therefore,by(A.118),thesplit (j, inft j +supt j 2 )issuchthat (II) t t t,t t t ′≥ 1 4 Var(m(X X X)|X X X∈t t t) s ∗ . (A.123) If the second term of T j is the largest term of T j and that H j ( inft j +supt j 2 ) 2 > 1 1296 β 2 jj (supt j −inft j ) 4 , then by(A.122), H j inft j +supt j 2 2 ≥ 180 1296 × Var(m(X X X)|X X X∈t t t) 3s ∗ , whichincombinationwith(A.118)concludesthatthesplit (j, inft j +supt j 2 )issuchthat (II) t t t,t t t ′≥ 5 432 Var(m(X X X)|X X X∈t t t) s ∗ . (A.124) 146 Otherwise, if the second term of T j is the largest term of T j and that H j ( inft j +supt j 2 ) 2 ≤ 1 1296 β 2 jj (supt j − inft j ) 4 ,thenby(A.120)and(A.122), H j 1 4 inft j + 3 4 supt j 2 ≥ 180 648 × Var(m(X X X)|X X X∈t t t) 3s ∗ , whichincombinationwith(A.118)concludesthatthesplit (j, 1 4 inft j + 3 4 supt j )issuchthat (II) t t t,t t t ′≥ 5 288 Var(m(X X X)|X X X∈t t t) s ∗ . (A.125) Theresultsof(A.123)–(A.125)concludesthatsup j∈{1,···,p},c∈t j (II) t t t,t t t(j,c) isproportionalto Var(m(X X X)|X X X∈t t t)foreacht t t,andthat m(X X X)∈SID(86.4×s ∗ ), whichisthedesiredresult. Proofsof (A.119)–(A.121). Westartwithproving(A.121). Define g j (X X X)=β jj X 2 j +β j X j + s ∗ ∑ l=1 l̸=j β lj X l X j with β lj =β jl . Consider a node t t t, a split (j,x), and two daughter nodes t t t ′ and t t t ′′ as in (A.118). By the uniformdistributionassumptiononX X X, H j (x)=E(g j (X X X)|X X X∈t t t ′′ )−E(g j (X X X)|X X X∈t t t ′ ), whosedetailedderivationisomittedforsimplicity. Recallthatt t t ′ =t 1 ×···×t ′ j ×···×t p andt t t ′′ =t 1 ×···× t ′′ j ×···×t p witht ′ j =[inft j ,x)andt ′′ j =[x,supt j ]∩t j . 147 Inwhatfollows,wederiveaclosed-form expressionfor H j (x)in(A.127)below. Bytheuniformdistri- butionassumptiononthefeaturevectorandthechangeofvariablesformula, E(g j (X X X)|X X X∈t t t ′′ ) = 1 |t ′′ j | Z z∈t ′′ j β jj z 2 +β j z+ s ∗ ∑ l=1;l̸=j β lj R l 2 zdz = 1 |t ′ j | Z z∈t ′ j β jj (f(z)) 2 + β j + s ∗ ∑ l=1;l̸=j β lj R l 2 ! f(z)dz, (A.126) where f(z)= (supt j −x)(z−inft j ) x−inft j +x,R j :=supt j +inft j ,andr j :=supt j −inft j . Since 1 |t ′ j | Z z∈t ′ j (f(z)) k dz= (f(x)) k+1 (k+1)(supt j −x) − (f(inft j )) k+1 (k+1)(supt j −x) = (supt j ) k+1 −x k+1 (k+1)(supt j −x) , wehave RHSof(A.126)= β jj 3 (supt j ) 2 +(supt j )x+x 2 + β j + s ∗ ∑ l=1;l̸=j β lj R l 2 ! supt j +x 2 . Bythisandsimilarcalculations,itholdsthat H j (x)= β jj r j 3 R j +x + β j + s ∗ ∑ l=1;l̸=j β lj R l 2 ! r j 2 . (A.127) Hence, H j ( 1 2 inft j + 1 2 supt j )= r j 2 β jj R j +β j + s ∗ ∑ l=1;l̸=j β lj R l 2 ! , H j ( 1 4 inft j + 3 4 supt j )= r j 2 β jj 5 6 inft j + 7 6 supt j +β j + s ∗ ∑ l=1;l̸=j β lj R l 2 ! , (A.128) wherethefirstequalityconcludes(A.121)giventhecoefficientassumption. Next,weproceedtoshow(A.119). Wehave Var(m(X X X)|X X X∈t t t)= s ∗ ∑ j=1 Var(g j (X X X)|X X X∈t t t)− s ∗ −1 ∑ j=1 s ∗ ∑ l>j β 2 lj Var(X l X j |X X X∈t t t), (A.129) 148 where to justify the equality, we notice that when expanding Var(m(X X X)|X X X∈t t t), all terms appearing are i) 2E (β i X i −E(β i X i |X X X∈t t t))(β j X j −E(β j X j |X X X∈t t t))|X X X∈t t t , ii)2E (β ij X i X j −E(β ij X i X j |X X X∈t t t))(β kl X k X l −E(β kl X k X l |X X X∈t t t))|X X X∈t t t , iii)2E (β j X j −E(β j X j |X X X∈t t t))(β kl X k X l −E(β kl X k X l |X X X∈t t t))|X X X∈t t t , iv)2E (β jj X 2 j −E(β jj X 2 j |X X X∈t t t))(β k X k −E(β k X k |X X X∈t t t))|X X X∈t t t , v)2E (β jj X 2 j −E(β jj X 2 j |X X X∈t t t))(β kl X k X l −E(β kl X k X l |X X X∈t t t))|X X X∈t t t , vi)2E (β jj X 2 j −E(β jj X 2 j |X X X∈t t t))(β kk X 2 k −E(β kk X 2 k |X X X∈t t t))|X X X∈t t t , vii)Var(β lj X l X j |X X X∈t t t), viii)Var(β jj X 2 j |X X X∈t t t), ix)Var(β j X j |X X X∈t t t), x)2E (β j X j −E(β j X j |X X X∈t t t))(β lj X l X j −E(β lj X l X j |X X X∈t t t))|X X X∈t t t , xi)2E (β ij X i X j −E(β ij X i X j |X X X∈t t t))(β lj X l X j −E(β lj X l X j |X X X∈t t t))|X X X∈t t t , xii)2E (β jj X 2 j −E(β jj X 2 j |X X X∈t t t))(β j X j −E(β j X j |X X X∈t t t))|X X X∈t t t ,and xiii) 2E (β jj X 2 j −E(β jj X 2 j |X X X∈t t t))(β lj X l X j −E(β lj X l X j |X X X∈t t t))|X X X∈t t t , where i, j,k,l are distinct in- dices. Amongthem,i)–vi)arezerossincefeaturesareindependent,viii)–xiii)areconsideredin ∑ s ∗ j=1 Var(g j (X X X)|X X X∈t t t), and vii) is considered twice (for each l, j) in ∑ s ∗ j=1 Var(g j (X X X)|X X X∈t t t), which con- cludes(A.129). Inaddition,bytheuniformdistributionassumptiononX X X andadirectcalculation, Var(g j (X X X)|X X X∈t t t)=β 2 jj Var(X 2 j |X j ∈t j )+β 2 j Var(X j |X j ∈t j ) + s ∗ ∑ l=1;l̸=j β 2 lj Var(X l X j |X X X∈t t t) + s ∗ ∑ k=1 k̸=l k̸=j s ∗ ∑ l=1 l̸=j β kj β lj E(X k X l X 2 j |X X X∈t t t)−E(X k X j |X X X∈t t t)E(X l X j |X X X∈t t t) +2β jj β j E(X 3 j |X j ∈t j )−E(X 2 j |X j ∈t j )E(X j |X j ∈t j ) +2 s ∗ ∑ l=1 l̸=j β jj β lj E(X 3 j X l |X X X∈t t t)−E(X 2 j |X j ∈t j )E(X l X j |X X X∈t t t) +2 s ∗ ∑ l=1 l̸=j β j β lj E(X 2 j X l |X X X∈t t t)−E(X j |X j ∈t j )E(X l X j |X X X∈t t t) =:(A)+(B)+(C)+(D)+(E)+(F)+(G), 149 where (A)= β 2 jj 45 r 2 j 4(supt j ) 2 +7(supt j )(inft j )+4(inft j ) 2 , (B)=β 2 j r 2 j 12 , (C)= s ∗ ∑ l=1 l̸=j β 2 lj 144 3R 2 j r 2 l +3R 2 l r 2 j +r 2 j r 2 l , (D)= s ∗ ∑ k=1 k̸=l k̸=j s ∗ ∑ l=1 l̸=j β kj β lj 48 R l R k r 2 j , (E)= β jj β j 6 r 2 j R j , (F)= s ∗ ∑ l=1 l̸=j β jj β lj 12 R l r 2 j R j , and (G)= s ∗ ∑ l=1 l̸=j β j β lj 12 R l r 2 j . Bythis,(A.128),andadirectcalculation, Var(g j (X X X)|X X X∈t t t)− 1 3 H j ( 1 2 inft j + 1 2 supt j ) 2 = β 2 jj 180 r 4 j + s ∗ ∑ l=1;l̸=j β 2 lj 144 (3R 2 j r 2 l +r 2 j r 2 l ). (A.130) Also,bytheuniformdistributionassumptiononX X X,theaboveexpressionfor(C),andthat s ∗ ∑ j=1 s ∗ ∑ l=1 l̸=j 1=2 s ∗ −1 ∑ j=1 s ∗ ∑ l>j 1, itholdsthat s ∗ ∑ j=1 s ∗ ∑ l=1 l̸=j β 2 lj 144 (3R 2 j r 2 l +r 2 j r 2 l )= s ∗ −1 ∑ j=1 s ∗ ∑ l>j β 2 lj Var(X l X j |X X X∈t t t)+ s ∗ −1 ∑ j=1 s ∗ ∑ l>j β 2 lj 144 r 2 j r 2 l , whichincombinationwith(A.129)–(A.130)leadsto(A.119). Lastly,wedealwith(A.120). By(A.128), H j ( 1 4 inft j + 3 4 supt j )−H j ( 1 2 inft j + 1 2 supt j )= β jj r 2 j 12 . 150 Therefore,if|H j ( 1 2 inft j + 1 2 supt j )|≤ |β jj |r 2 j 36 ,then H j ( 1 4 inft j + 3 4 supt j ) 2 − H j ( 1 2 inft j + 1 2 supt j ) 2 = |β jj |r 2 j 12 β jj r 2 j 12 +2H j ( 1 2 inft j + 1 2 supt j ) ≥ |β jj |r 2 j 12 |β jj |r 2 j 12 −2 H j ( 1 2 inft j + 1 2 supt j ) ! ≥ β 2 jj r 4 j 432 . Thisimpliesthatif (H j ( 1 2 inft j + 1 2 supt j )) 2 ≤( β jj r 2 j 36 ) 2 = β 2 jj r 4 j 1296 ,then H j 1 4 inft j + 3 4 supt j 2 ≥ β 2 jj r 4 j 648 , whichconcludesthedesiredresultof(A.120). Wehavefinishedtheproofsfor(A.119)–(A.121). A.5.5 VerifyingCondition1forExample3 The proof idea is straightforward: for each node t t t =t 1 ×···t p , we establish appropriate upper and lower bounds for Var(m(X X X)|X X X∈ t t t) and sup j,c (II) t t t,t t t(j,c) , respectively, to conclude the desired result, where we use sup j,c (II) t t t,t t t(j,c) to denote sup j∈{1,···,p},c∈t j (II) t t t,t t t(j,c) for simplicity. We begin with an upper bound of Var(m(X X X)|X X X∈t t t)asfollows. Var(m(X X X)|X X X∈t t t) =E m(X X X)−E(m(X X X)|X X X∈t t t) 2 |X X X∈t t t ≤E m(X X X)− sup z z z∈t t t m(z z z)+inf z z z∈t t t m(z z z) 2 2 |X X X∈t t t ! ≤ 1 4 E m(X X X)−sup z z z∈t t t m(z z z)−inf z z z∈t t t m(z z z)+m(X X X) 2 |X X X∈t t t ! ≤ 1 4 E sup z z z∈t t t m(z z z)−m(X X X)+m(X X X)−inf z z z∈t t t m(z z z) 2 |X X X∈t t t ! = 1 4 (sup z z z∈t t t m(z z z)−inf z z z∈t t t m(z z z)) 2 . (A.131) 151 Meanwhile,bytheassumptionson ∂m(z z z) ∂z j ’s,wecanshowthat |sup z z z∈t t t m(z z z)−inf z z z∈t t t m(z z z)|≤ ∑ j∈S ∗ |t j |M 2 , andweomitthedetailsforsimplicity. Bythisand(A.131), Var(m(X X X)|X X X∈t t t)≤ M 2 2 4 ( ∑ j∈S ∗ |t j |) 2 . (A.132) Next, we deal with the lower bound of sup j,c (II) t t t,t t t(j,c) , which is by definition larger than (II) t t t,t t t(j ∗ ,c ∗ ) , where j ∗ :=argmax j∈S ∗ |t j | and c ∗ := supt j ∗+inft j ∗ 2 . Lett t t ∗ 1 andt t t ∗ 2 bethecorrespondingdaughternodesoft t t suchthatthe j ∗ thcoordinateoft t t ∗ 1 is inft j , supt j +inft j 2 . Recallthat (II) t t t,t t t ∗ 1 =P(X X X∈t t t ∗ 1 |X X X∈t t t) E(m(X X X)|X X X∈t t t ∗ 1 ))−E(m(X X X)|X X X∈t t t)) 2 +P(X X X∈t t t ∗ 2 |X X X∈t t t) E(m(X X X)|X X X∈t t t ∗ 2 ))−E(m(X X X)|X X X∈t t t)) 2 . Toestablishalowerboundof (II) t t t,t t t ∗ 1 ,itsufficestoestablishalowerboundon (E(m(X X X)|X X X∈t t t ∗ 1 )−E(m(X X X)|X X X∈t t t ∗ 2 )) 2 . Tothisend,wefirstnoticethatbythefundamentaltheoremofcalculus, m(z z z+ 1 2 |t j ∗|e e e j ∗)=m(z z z)+ Z z j ∗+ 1 2 |t j ∗| z j ∗ ∂m(w w w) ∂w j ∗ dw j ∗, 152 where e e e j is a unit vector with its jth coordinate being one and z z z = (z 1 ,···,z p ) T . Now, suppose that the partial derivative along the j ∗ th coordinate is positive in the following arguments for simplicity; that is, ∂m(w w w) ∂w j ∗ ≥M 1 >0. BytheseandtheuniformdistributionassumptiononX X X, E(m(X X X)|X X X∈t t t ∗ 2 )= 1 |t t t ∗ 2 | Z z z z∈t t t ∗ 2 m(z z z)dz z z = 1 |t t t ∗ 1 | Z z z z∈t t t ∗ 1 " m(z z z)+ Z z j ∗+ 1 2 |t j ∗| z j ∗ ∂m(w w w) ∂w j ∗ dw j ∗ # dz z z ≥ 1 |t t t ∗ 1 | Z x x x∈t t t ∗ 1 m(z z z)+ 1 2 |t j ∗|M 1 dz z z. (A.133) With(A.133), E(m(X X X)|X X X∈t t t ∗ 2 )−E(m(X X X)|X X X∈t t t ∗ 1 )≥ 1 2 |t j ∗|M 1 , whichalongwithsimplecalculationsshows (II) t t t,t t t ∗ 1 = 1 2 E(m(X X X)|X X X∈t t t)−E(m(X X X)|X X X∈t t t ∗ 1 ) 2 + E(m(X X X)|X X X∈t t t)−E(m(X X X)|X X X∈t t t ∗ 2 ) 2 ≥ 1 16 |t j ∗| 2 M 2 1 . (A.134) By(A.134),(A.132),and (∑ j∈S ∗|t j |) 2 |t j ∗| 2 ≤(#S ∗ ) 2 duetothedefinitionof j ∗ ,itholdsthat 4M 2 2 M −2 1 (#S ∗ ) 2 sup j,c (II) t t t,t t t(j,c) ≥ 4M 2 2 M −2 1 (#S ∗ ) 2 (II) t t t,t t t ∗ 1 ≥Var(m(X X X)|X X X∈t t t). This concludes the desired result for the case with a positive partial derivative. The same arguments applytotheothercase,andhencewehavefinishedtheproofofthefirstassertion. Asforthecasewiththeadditivemodelassumption,wehave Var(m(X X X)|X X X∈t t t)= s ∗ ∑ j=1 Var(m j (X j )|X j ∈t j ), (A.135) wheret t t =t 1 ×···×t p . Bytheargumentssimilartothosein(A.131)–(A.132), Var(m(X X X)|X X X∈t t t)≤s ∗ 1 4 max 1≤j≤s ∗ |t j | 2 M 2 2 , (A.136) 153 whichincombinationwith(A.134)leadsto 4M 2 2 M −2 1 s ∗ sup j,c (II) t t t,t t t(j,c) ≥ 4M 2 2 M −2 1 s ∗ (II) t t t,t t t ∗ 1 ≥Var(m(X X X)|X X X∈t t t). Thisconcludesthesecondassertionandfinishestheproof. A.5.6 VerifyingCondition1forExample4 BytheuniformdistributionassumptiononX X X,foreachnodet t t =t 1 ×···×t p , Var(m(X X X)|X X X∈t t t)= s ∗ ∑ l=1 Var(m l (X l )|X l ∈t l ). (A.137) With (A.137) and (2.15), to finish the proof of the first assertion, we show that the LHS of (2.15) is propor- tionaltotheconditionalbiasdecreaseasfollows. A simple calculation shows that the conditional bias decrease (II) t t t,t t t ′ given a split (j,x) with x∈t j is boundedfrombelowsuchthat (II) t t t,t t t ′≥P(X X X∈t t t ′ |X X X∈t t t)P(X X X∈t t t ′′ |X X X∈t t t) H j (x) 2 , (A.138) where H j (x):=E(m(X X X)|X X X∈t t t ′′ )−E(m(X X X)|X X X∈t t t ′ ), and thatt t t ′ =t 1 ×···×t ′ j ×···×t p andt t t ′′ =t 1 ×···×t ′′ j ×···×t p witht ′ j =[inft j ,x) andt ′′ j =[x,supt j ]∩t j . Forthereader’sconvenience,recallthat (II) t t t,t t t ′ =P(X X X∈t t t ′ |X X X∈t t t) E(m(X X X)|X X X∈t t t ′ )−E(m(X X X)|X X X∈t t t) 2 +P(X X X∈t t t ′′ |X X X∈t t t) E(m(X X X)|X X X∈t t t ′′ )−E(m(X X X)|X X X∈t t t) 2 . Recallthatm(z z z)=∑ p j=1 m j (z j ). Then E(m(X X X)|X X X∈t t t ′′ )= 1 supt j −x Z supt j x m j (z)dz+ s ∗ ∑ l̸=j;l=1 E(m l (X l )|X X X∈t t t ′′ ), 154 and E(m(X X X)|X X X∈t t t ′ )= 1 x−inft j Z x inft j m j (z)dz+ s ∗ ∑ l̸=j;l=1 E(m l (X l )|X X X∈t t t ′′ ). Thus,bythechangeofvariablesformula, H j (x)= 1 supt j −x Z supt j x m j (z)dz− 1 x−inft j Z x inft j m j (z)dz = 1 x−inft j Z x inft j m j supt j −x x−inft j (z−inft j )+x −m j (z) dz. (A.139) Supposethesplitisalong j:=argmax 1≤l≤s ∗Var(m(X l )|X l ∈t l ). By(A.139)and(2.15), sup x∈Λ(inft j ,supt j ) (H j (x)) 2 ≥c 0 max 1≤l≤s ∗ Var(m l (X l )|X l ∈t l ), whichincombinationwith(A.138), sup l∈{1,···,s ∗ },c∈t l (II) t t t,t t t(l,c) ≥sup c∈t j (II) t t t,t t t(j,c) ≥λ(1−λ)c 0 max 1≤l≤s ∗ Var(m l (X l )|X l ∈t l ). Bythisand(A.137),itholdsthat m(X X X)∈SID s ∗ λ(1−λ)c 0 , whichconcludesthefirstassertion. 155 Forthesecondassertion,wenotethatifm j (z)isdifferentiableon [0,1],then Var(m j (X j )|X j ∈t j ) =E m j (X j )−E(m j (X j )|X j ∈t j ) 2 X j ∈t j ≤E 0 @ m j (X j )− sup z∈t j m j (z)+inf z∈t j m j (z) 2 !! 2 X j ∈t j 1 A ≤ 1 4 E 0 @ m j (X j )−sup z∈t j m j (z)−inf z∈t j m j (z)+m j (X j ) ! 2 X j ∈t j 1 A ≤ 1 4 E 0 @ sup z∈t j m j (z)−m j (X j )+m j (X j )−inf z∈t j m j (z) ! 2 X j ∈t j 1 A = 1 4 (sup z∈t j m j (z)−inf z∈t j m j (z)) 2 ≤ 1 4 (sup t∈t j |m ′ j (z)||t j |) 2 , where the last step is because of the mean value theorem. This and (2.16) lead to (2.15) with some c 0 >0, andhencewehavefinishedtheproof. A.5.7 ProofforRemark3 Inthisproof,weshowthatgiventhemodel m(X X X)=X 1 X 2 −0.5X 1 −0.5X 2 +0.25=(X 1 −0.5)(X 2 −0.5) with uniformly distributed X X X, there exists a node t t t =t 1 ×···×t p such that a constantα 1 ≥1 for SID does notexist. Infact,thereareinfinitelymanysuchnodesinthiscase. Consideranodet t t =[0.5−d 1 ,0.5+d 1 ]× [0.5−d 2 ,0.5+d 2 ]×[a 3 ,b 3 ]×···×[a p ,b p ]forsomepositived 1 ≤0.5,d 2 ≤0.5and0≤a j 1inthefollowing. 157 Let us start with an upper bound for Var(m(X X X)|X X X∈t t t). By the assumptions of an additive model and a uniformdistributionofX X X,foreveryt t t, Var(m(X X X)|X X X∈t t t)= s ∗ ∑ l=1 Var(m l (X l )|X l ∈t l ), (A.140) andbythedefinitionofR,foreveryt t t =t 1 ×···×t p andl≤s ∗ , Var(m l (X l )|X l ∈t l )≤(|t l |R) 2 . (A.141) Define j:=arg max 1≤l≤s ∗ |t l |, andhenceby(A.140)–(A.141), Var(m(X X X)|X X X∈t t t)≤s ∗ (|t j |R) 2 . (A.142) Now, we proceed to deal with the lower bound of the conditional bias decrease. Let us introduce some notation for referring to the linear functions and splits ont j . The rightmost linear function ont j is denoted byh 1 (x),withl>0beingthelengthofitsdomainont j ;andthelinearfunctiontotheleftofh 1 (x)isdenoted byh 2 (x),withL≥0beingthelengthofitsdomainont j . Inaddition,weconsiderfoursplitpointsA,B,C, and D, which are respectively in the middle of the domain of h 1 (x), the left-end of h 1 (x), the middle of the domain of h 2 (x), and the left-end of h 2 (x). A graphical illustration is in Figure 4.2. Moreover, we refer to thecorrespondingdaughternodesoft j ontheright-handsideofthesplitsast ′ jA ,t ′ jB t ′ jC ,andt ′ jD ,respectively. Lett t t ′ s =t 1 ×···t j−1 ×t ′ js ×t j+1 ×···×t p fors∈{A,B,C,D}. Bythedefinitionof (II) t t t,t t t ′ s andtheassumptionofauniformdistributionofX X X, max{(II) t t t,t t t ′ A ,(II) t t t,t t t ′ B } ≥max n l 2|t j | (E(m j (X j )|X j ∈t ′ jA )−E(m j (X j )|X j ∈t j )) 2 , l |t j | (E(m j (X j )|X j ∈t ′ jB )−E(m j (X j )|X j ∈t j )) 2 o . (A.143) 158 Figure4.2: Thepartofm j (x)ont j . Inthisexample,therearethreepiecewiselinearfunctions. By the model assumptions and simple calculations, it holds that|E(m j (X j )|X j ∈t ′ jA )−E(m j (X j )|X j ∈ t ′ jB )|≥ lr 4 . Bythis,if|E(m j (X j )|X j ∈t ′ jA )−E(m j (X j )|X j ∈t j )|≤ lr 8 ,then |E(m j (X j )|X j ∈t ′ jB )−E(m j (X j )|X j ∈t j )|≥ lr 8 ;theotherwayroundisalsotrue. Hence, RHSof(A.143)≥ l 2|t j | ×( lr 8 ) 2 = l 3 r 2 128|t j | . (A.144) NoticethattheaboveargumentsdonotdependonthevalueofL. Next,bythedefinitionof (II) t t t,t t t ′ s andtheassumptionofauniformdistributionofX X X, max{(II) t t t,t t t ′ C ,(II) t t t,t t t ′ D } ≥max n 1 |t j | ( L 2 +l)(E(m j (X j )|X j ∈t ′ jC )−E(m j (X j )|X j ∈t j )) 2 , 1 |t j | (L+l)(E(m j (X j )|X j ∈t ′ jD )−E(m j (X j )|X j ∈t j )) 2 o . (A.145) TolowerboundtheRHSof(A.145),wewrite E(m j (X j )|X j ∈t ′ jD )= L 2 ( 1 L+l )E(m j (X j )|X j ∈[D,C))+( L 2 +l)( 1 L+l )E(m j (X j )|X j ∈t ′ jC ), whichfollowsfromtheuniformdistributionassumptiononthefeaturevector. Hence, |E(m j (X j )|X j ∈t ′ jD )−E(m j (X j )|X j ∈t ′ jC )| = L 2 ( 1 L+l ) E(m j (X j )|X j ∈[D,C))−E(m j (X j )|X j ∈t ′ jC ) . (A.146) 159 Since we have assumed that m j (x) is continuous with slope upper and lower bounds, if the value of l is sufficiently small, then|E(m j (X j )|X j ∈ [D,C))−E(m j (X j )|X j ∈t ′ jC )| is sufficiently large. For example, if lR≤ Lr 2 , thenE(m j (X j )|X j ∈t ′ jC )≤m j (C) and henceE(m j (X j )|X j ∈[D,C))≥E(m j (X j )|X j ∈t ′ jC )+ Lr 4 in Figure4.2. Forgeneralcases,iflR≤ Lr 2 ,itholdsthat RHSof(A.146)≥ L 2 ( 1 L+l )× Lr 4 = L 2 r 8(L+l) . Bythisandargumentssimilartothosefor(A.144), RHSof(A.145)≥ 1 |t j | ( L 2 +l)( L 2 r 16(L+l) ) 2 ≥ L 3 r 2 512|t j | . (A.147) To have our conclusion, we need an observation as follows. Due to the definition of b ∗ and K > 1, it holdsthatb ∗ ≤ 1 2 .Also,if L |t j | Lr 2R ,thenby(A.144), sup j∈{1,···,p},c∈t j (II) t t t,t t t(j,c) ≥ (l) 3 r 2 128|t j | ≥ r 5 (b ∗ ) 3 (|t j |) 2 1024R 3 . (A.149) If L |t j | ≥b ∗ andl≤ Lr 2R ,thenby(A.147), sup j∈{1,···,p},c∈t j (II) t t t,t t t(j,c) ≥ (b ∗ ) 3 (r|t j |) 2 512 . (A.150) 160 By(A.142)and(A.148)–(A.150),weconcludethat s ∗ (|t j |R) 2 × 1024R 3 r 5 (b ∗ ) 3 (|t j |) 2 sup j∈{1,···,p},c∈t j (II) t t t,t t t(j,c) ≥Var(m(X X X)|X X X∈t t t), whichleadstothedesiredresult. A.6.1 VerifyingCondition1forExample8 The proof idea is to find an upper bound of Var(m(X X X)|X X X∈t t t) and a lower bound of maximum conditional biasdecreaseforeachnodet t t =t 1 ×···×t p suchthatsup j∈{1,···,p},c∈t j (II) t t t,t t t(j,c) islowerboundedintermsof Var(m(X X X)|X X X∈t t t). We begin with noticing that if there are no jump points on t t t, then Var(m(X X X)|X X X∈t t t)=0 andhencetheproofistrivial. Wethereforeconsiderthecasewithatleastonejumppointont t t. In the following, we deal with the lower bound of the conditional bias decrease. Due to the regression function form, we can establish a lower bound for the conditional bias decrease on a nodet t t given a split at any jump point ont t t as follows. Let a nodet t t be given with a jump point on it; lett t t ′ ,t t t ′′ denote two daughter nodesafterthesplitatoneofthejumppointsont t t. Thenitholdsthat |E(m(X X X)|X X X∈t t t ′ )−E(m(X X X)|X X X∈t t t ′′ )|≥ι, whichfollowsfromthemodelassumptions,andthat (II) t t t,t t t ′ =ζ(E(m(X X X)|X X X∈t t t ′ )−E(m(X X X)|X X X∈t t t)) 2 +(1−ζ)(E(m(X X X)|X X X∈t t t ′′ )−E(m(X X X)|X X X∈t t t)) 2 ≥ inf x∈R ζ(E(m(X X X)|X X X∈t t t ′ )−x) 2 +(1−ζ)(E(m(X X X)|X X X∈t t t ′′ )−x) 2 ≥ζ(1−ζ)(E(m(X X X)|X X X∈t t t ′ )−E(m(X X X)|X X X∈t t t ′′ )) 2 =ζ(1−ζ)ι 2 , (A.151) whereζ =P(X X X∈t t t ′ |X X X∈t t t). On the other hand, to establish the upper bound for Var(m(X X X)|X X X∈t t t), we separate the proof into two cases: (i)therearemorethantwojumppointsonsomecoordinatesoft t t,and(ii)thereareatmosttwojump pointsoneverycoordinateoft t t. 161 We deal with the first case first. Write the regression function conditional on t t t as ∑ k 0 k=1 β (k) 1 1 1 X X X∈C (k) for some k 0 , whereC (1) ,···,C (k 0 ) are all the subnodes on t t t, with their respective coefficients denoted by β (1) ,···,β (k 0 ) . Duetothecoefficientassumptions, max 1≤l≤k 0 β (l) − min 1≤l≤k 0 β (l) ≤2M 0 . (A.152) Hence, Var(m(X X X)|X X X∈t t t)=E ( k 0 ∑ k=1 β (k) 1 1 1 X X X∈C (k)−E(m(X X X)|X X X∈t t t)) 2 |X X X∈t t t ≤E ( k 0 ∑ k=1 β (k) 1 1 1 X X X∈C (k)−β (1) ) 2 |X X X∈t t t =E ( k 0 ∑ k=1 β (k) 1 1 1 X X X∈C (k)− k 0 ∑ k=1 β (1) 1 1 1 X X X∈C (k)) 2 |X X X∈t t t =E ( k 0 ∑ k=1 (β (k) −β (1) )1 1 1 X X X∈C (k)) 2 |X X X∈t t t ≤(2M 0 ) 2 , (A.153) wherethelastinequalityisdueto(A.152). Suppose the jth coordinate has at least three jump points. Let us consider a split at any jump point in between the first and the last jump points on the jth coordinate; let t t t ′ and t t t ′′ denote the resulting daughter nodes. InlightoftheuniformdistributionassumptiononX X X, min{P(X X X∈t t t ′ |X X X∈t t t),P(X X X∈t t t ′′ |X X X∈t t t)}≥ min j≤s ∗ ,1≤i≤k j c (j) i −c (j) i−1 =:c ∗ , whichalongwith(A.151)and(A.153)showsthat 2M 0 ι 2 1 c ∗ (1−c ∗ ) sup j∈{1,···,p},c∈t j (II) t t t,t t t(j,c) ≥ 2M 0 ι 2 1 c ∗ (1−c ∗ ) (II) t t t,t t t ′ ≥Var(m(X X X)|X X X∈t t t), (A.154) whichconcludestheproofofcase(i). Let us proceed to analyze case (ii). We again denote all the subnodes and their respective coefficients byC (1) ,···,C (k 0 ) andβ (1) ,···,β (k 0 ) . Inaddition,defineρ j ≥0suchthat 162 1. ρ j =|t j | −1 max{a−inft j ,supt j −b}ifthe jthcoordinateoft t t hastwojumppointsata0 since we have assumed a nontrivial case where there is at least onejumppointont t t. AmongC (1) ,···,C (k 0 ) , let us fix a subnodeC (k ∗ ) such that 1) if the lth coordinate of t t t has two jump points at 0 < > : O(s −4/d 1 +s −4/d 2 ), d≥2, O(s −3 1 +s −3 2 ), d =1. Thistogetherwith(B.24)completestheproofofTheorem8. B.2.4 ProofofTheorem9 The main idea of the proof is to apply the bias-variance decomposition for the mean-squared error. Recall thatD n (s 1 ,s 2 )(x)=w ∗ 1 D n (s 1 )(x)+w ∗ 2 D n (s 2 )(x)and E[D n (s 1 ,s 2 )(x)]=w ∗ 1 E[µ(X (1) (s 1 ))]+w ∗ 2 E[µ(X (1) (s 2 ))], 175 whereX (1) (s 1 )=X (1) (X 1 ,...,X s 1 ) denotes the 1-nearest neighbor ofx among{X 1 ,...,X s 1 } and similarly, X (1) (s 2 )=X (1) (X 1 ,...,X s 2 ). Thenwehavethebias-variancedecomposition E [D n (s 1 ,s 2 )(x)−µ(x)] 2 = E n D n (s 1 ,s 2 )(x)−w ∗ 1 E[µ(X (1) (s 1 ))]−w ∗ 2 E[µ(X (1) (s 2 ))] 2 o + E(D n (s 1 ,s 2 )(x))−µ(x) 2 :=I 1 (x)+I 2 (x). (B.25) Let us first deal with the bias term I 2 (x). Using the similar arguments to those in the proofs of Lemmas 13 and14,wecandeducethat I 2 (x)≤ 8 > > > > < > > > > : R 2 1 (x,d,f,µ) (c−1) 2 c −1 s −6 2 , d =1, R 2 2 (x,d,f,µ) (c−1) 2 c −2 s −8/d 2 , d≥2, where R 1 (x,d, f,µ)and R 2 (x,d, f,µ)aresomeconstantsdependingontheboundsforthefirstfourderiva- tivesof f(·)andµ(·)inaneighborhoodofx. WenowanalyzethevariancetermI 1 (x). Itholdsthat I 1 (x)≤(w ∗ 1 ) 2 E n s 1 −1 ∑ 1≤i 1 0dependinguponw ∗ 1 ,w ∗ 2 , x,andthedistributionofε suchthat E [Φ ∗ (x;Z ∗ 1 ,...,Z ∗ s 2 )] 4 ≤M. (B.87) Proof. Sincetheobservationsinthebootstrapsample{Z ∗ 1 ,...,Z ∗ n }areselectedindependentlyanduniformly fromtheoriginalsample{Z 1 ,...,Z n },wehave E [Φ ∗ (x;Z ∗ 1 ,...,Z ∗ s 2 )] 4 = E E [Φ ∗ (x;Z ∗ 1 ,...,Z ∗ s 2 )] 4 Z 1 ,...,Z n =n −s 2 n ∑ i 1 =1 ··· n ∑ i s 2 =1 E [Φ ∗ (x;Z i 1 ,...,Z i s 2 )] 4 . Observethatfordistincti 1 ,...,i s 2 ,wehaveshownintheproofofLemma11inSectionB.3.2thatass 2 →∞, E [Φ ∗ (x;Z 1 ,...,Z s 2 )] 4 →A forsomepositiveconstantAthatdependsuponw ∗ 1 ,w ∗ 2 ,x,andthedistributionofε. Furthermore,notethatifi 1 =i 2 =...=i c andtheremainingargumentsaredistinct,thenitholdsthat Φ(x;Z i 1 ,...,Z i s 2 )=Φ(x;Z i 1 ,Z i c+1 ,...,Z i s 2 ). Therefore, there exists some positive constant M depending upon w ∗ 1 , w ∗ 2 , x, and the distribution of ε such that E [Φ ∗ (x;Z i 1 ,...,Z i s 2 )] 4 ≤M forany1≤i 1 ≤n,...,1≤i s 2 ≤n. ThiscompletestheproofofLemma12. B.3.4 Lemma13anditsproof In Lemma 13 below, we will provide the asymptotic expansion ofE∥X (1) −x∥ k with k≥1 and its higher- orderasymptoticexpansionforthecaseofk=2asthesamplesizen→∞. 195 Lemma13. Assume that Conditions 1–3 hold andx∈supp(X)⊂R d is fixed. Then the 1-nearest neighbor (1NN)X (1) ofxinthei.i.d. sample{X 1 ,···,X n }satisfiesthatforanyk≥1, E∥X (1) −x∥ k = Γ(k/d+1) (f(x)V d ) k/d n −k/d +o(n −k/d ) (B.88) as n→∞, whereΓ(·) is the gamma function andV d = π d/2 Γ(1+d/2) . In particular, when k =2, there are three cases. Ifd =1,wehave E∥X (1) −x∥ 2 = Γ(2/d+1) (f(x)V d ) 2/d n −2/d − Γ(2/d+2) d(f(x)V d ) 2/d n −(1+2/d) +o(n −(1+2/d) ). (B.89) Ifd =2,wehave E∥X (1) −x∥ 2 = Γ(2/d+1) (f(x)V d ) 2/d n −2/d − tr(f ′′ (x))Γ(4/d+1) f(x)(f(x)V d ) 4/d d(d+2) + Γ(2/d+2) d(f(x)V d ) 2/d n −4/d +o(n −4/d ), (B.90) where f ′′ (·)standsfortheHessianmatrixofthedensityfunction f(·). Ifd≥3,wehave E∥X (1) −x∥ 2 = Γ(2/d+1) (f(x)V d ) 2/d n −2/d − tr(f ′′ (x))Γ(4/d+1) f(x)(f(x)V d ) 4/d d(d+2) n −4/d +o(n −4/d ). (B.91) Proof. Denote by φ the probability measure onR d given by random vector X. We begin with obtaining an approximation of φ(B(x,r)), where B(x,r) represents a ball in the Euclidean space R d with center x and radius r > 0. Recall that by Condition 2, the density function f(·) of measure φ with respect to the Lebesgue measureλ is four times continuously differentiable with bounded corresponding derivatives in a neighborhoodofx. ThenusingtheTaylorexpansion,weseethatforanyξ∈S d−1 and0<ρ <r, f(x+ρξ)= f(x)+ f ′ (x) T ξρ+ 1 2 ξ T f ′′ (x)ξρ 2 +o(ρ 2 ), (B.92) 196 where S d−1 denotes the unit sphere inR d , and f ′ (·) and f ′′ (·) stand for the gradient vector and the Hessian matrix,respectively,ofthedensityfunction f(·). Withtheaidoftherepresentationin(B.92),anapplication ofthesphericalintegrationleadsto φ(B(x,r))= Z r 0 Z S d−1 f(x+ρξ)ρ d−1 ν(dξ)dρ = Z r 0 Z S d−1 f(x)+ f ′ (x) T ξρ+ 1 2 ξ T f ′′ (x)ξρ 2 +o(ρ 2 ) ρ d−1 ν(dξ)dρ = Z r 0 h f(x)dV d ρ d−1 + tr(f ′′ (x))V d 2 ρ d+1 +o(ρ d+1 ) i dρ = f(x)V d r d + tr(f ′′ (x))V d 2(d+2) r d+2 +o(r d+2 ), (B.93) where ν denotes a measure constructed on the unit sphereS d−1 as characterized in Lemma 19 in Section B.4.1andd·standsforthedifferentialofagivenvariablehereafter. WenowturnourattentiontothetargetquantityE∥X (1) −x∥ k foranyk≥1. Itholdsthat E∥X (1) −x∥ k = Z ∞ 0 P(∥X (1) −x∥ k >t)dt = Z ∞ 0 P(∥X (1) −x∥>t 1/k )dt = Z ∞ 0 [1−φ(B(x,t 1/k ))] n dt =n −k/d Z ∞ 0 " 1−φ B x, t 1/k n 1/d !!# n dt. (B.94) To evaluate the integration in (B.94), we need to analyze the term h 1−φ B x, t 1/k n 1/d i n . It follows from theasymptoticexpansionofφ(B(x,r))in(B.93)that " 1−φ B x, t 1/k n 1/d !!# n = h 1− f(x)V d t d/k n − tr(f ′′ (x))V d 2(d+2) t (d+2)/k n 1+2/d +o(n −(1+2/d) ) i n . (B.95) From(B.95),weseethatforeachfixedt >0, lim n→∞ " 1−φ B x, t 1/k n 1/d !!# n =exp(−f(x)V d t d/k ). 197 Moreover,byCondition1,wehave " 1−φ B x, t 1/k n 1/d !!# n ≤ " exp −α t 1/k n 1/d !# n ≤exp −αt 1/k . Thus,anapplicationofthedominatedconvergencetheoremyields lim n→∞ Z ∞ 0 " 1−φ B x, t 1/k n 1/d !!# n dt = Z ∞ 0 lim n→∞ " 1−φ B x, t 1/k n 1/d !!# n dt = Z ∞ 0 exp(−f(x)V d t d/k )dt = Γ(k/d+1) (f(x)V d ) k/d , (B.96) whichestablishesthedesiredasymptoticexpansionin(B.88)foranyk≥1. We further investigate higher-order asymptotic expansion for the case of k = 2. The leading term of the asymptotic expansion for E∥X (1) −x∥ 2 has been identified in (B.96) with the choice of k = 2. But we now aim to conduct a higher-order asymptotic expansion. To do so, we will resort to the higher-order asymptotic expansion given in (B.95). In view of (B.95), we can deduce from the Taylor expansion for functionlog(1−x)around0that " 1−φ B x, t 1/2 n 1/d !!# n −exp n −f(x)V d t d/2 o =exp 8 < : nlog 2 4 1− f(x)V d t d/2 n − tr(f ′′ (x))V d 2(d+2) t (d+2)/2 n 1+2/d +o(n −(1+2/d) ) 3 5 9 = ; −exp n −f(x)V d t d/2 o =exp 8 < : −f(x)V d t d/2 − tr(f ′′ (x))V d 2(d+2) t (d+2)/2 n 2/d − f 2 (x)V 2 d t d 2n +o(n −(2/d) ) 9 = ; −exp n −f(x)V d t d/2 o (B.97) as n→∞. To determine the order of the above remainders, there are three separate cases, that is, d = 1, d =2,andd≥3. 198 First,forthecaseofd =1,itfollowsfrom(B.97)that " 1−φ B x, t 1/2 n 1/d !!# n −exp n −f(x)V d t d/2 o =exp −f(x)V d t d/2 − f 2 (x)V 2 d t d 2n +o(n −1 ) −exp n −f(x)V d t d/2 o =exp n −f(x)V d t d/2 o exp − f 2 (x)V 2 d t d 2n +o(n −1 ) −1 =exp n −f(x)V d t d/2 o − f 2 (x)V 2 d t d 2n +o(n −1 ) (B.98) asn→∞. Furthermore,itholdsthat Z ∞ 0 exp n −f(x)V d t d/2 o − f 2 (x)V 2 d t d 2 dt =− Γ(2/d+2) d(f(x)V d ) 2/d , (B.99) wherewehaveusedthefactthatforanya>0andb>0, Z ∞ 0 x a−1 exp(−bx p )dx= 1 p b −a/p Γ( a p ). (B.100) Therefore, combining (B.94), (B.96), (B.98), and (B.99) results in the desired higher-order asymptotic ex- pansionin(B.89)forthecaseofk=2and d =1. Whend =2,notingthat2/d =1,itfollowsfrom(B.97)that " 1−φ B x, t 1/2 n 1/d !!# n −exp n −f(x)V d t d/2 o =exp 8 < : −f(x)V d t d/2 − tr(f ′′ (x))V d 2(d+2) t (d+2)/2 n 2/d − f 2 (x)V 2 d t d 2n 2/d +o(n −(2/d) ) 9 = ; −exp n −f(x)V d t d/2 o =exp n −f(x)V d t d/2 o 0 @ exp 8 < : − tr(f ′′ (x))V d 2(d+2) t (d+2)/2 n 2/d − f 2 (x)V 2 d t d 2n 2/d +o(n −(2/d) ) 9 = ; −1 1 A =exp n −f(x)V d t d/2 o 0 @ − tr(f ′′ (x))V d 2(d+2) t (d+2)/2 n 2/d − f 2 (x)V 2 d t d 2n 2/d +o(n −(2/d) ) 1 A (B.101) 199 asn→∞. Applyingequality(B.100)againyields Z ∞ 0 exp n −f(x)V d t d/2 o − tr(f ′′ (x))V d 2(d+2)n 2/d t (d+2)/2 dt =− tr(f ′′ (x))Γ(4/d+1) d(d+2)f(x)(f(x)V d ) 4/d n −2/d . (B.102) Hence,combining(B.94),(B.96),(B.99),(B.101),and(B.102)leadstothedesiredhigher-orderasymptotic expansionin(B.90)forthecaseofk=2andd =2. Finally, it remains to investigate the case of d≥3. In view of n −1 =o(n −2/d ) for d≥3, we can obtain from(B.97)that " 1−φ B x, t 1/2 n 1/d !!# n −exp n −f(x)V d t d/2 o =exp 8 < : −f(x)V d t d/2 − tr(f ′′ (x))V d 2(d+2) t (d+2)/2 n 2/d +o(n −(2/d) ) 9 = ; −exp n −f(x)V d t d/2 o =exp n −f(x)V d t d/2 o 0 @ exp 8 < : − tr(f ′′ (x))V d 2(d+2) t (d+2)/2 n 2/d +o(n −(2/d) ) 9 = ; −1 1 A =exp n −f(x)V d t d/2 o 0 @ − tr(f ′′ (x))V d 2(d+2) t (d+2)/2 n 2/d +o(n −(2/d) ) 1 A . (B.103) Consequently, combining (B.94), (B.96), (B.102), and (B.103) yields the desired higher-order asymptotic expansionin(B.91)forthecaseofk=2andd≥3. ThisconcludestheproofofLemma13. B.3.5 Lemma14anditsproof As in [59], we define the projection of the mean function µ(X)=E(Y|X) onto the positive half lineR + = [0,∞)givenby∥X−x∥as m(r)= lim δ→0+ E[µ(X)|r≤∥X−x∥≤r+δ]=E[Y|∥X−x∥=r] (B.104) foranyr≥0. Clearly,thedefinitionin(B.104)entailsthat m(0)=E[Y|X=x]=µ(x). (B.105) 200 We will show in Lemma 14 below that the projection m(·) admits an explicit higher-order asymptotic ex- pansionasthedistancer→0. Lemma14. Foreachfixedx∈supp(X)⊂R d ,wehave m(r)=m(0)+ f(x)tr(µ ′′ (x))+2µ ′ (x) T f ′ (x) 2d f(x) r 2 +O 4 r 4 (B.106) as r→0, where O 4 is some bounded quantity depending only on d and the fourth-order partial derivatives of the underlying density function f(·) and regression function µ(·). Here g ′ (·) and g ′′ (·) stand for the gradientvectorandtheHessianmatrix,respectively,ofagivenfunctiong(·). Proof. Wewillexploitthesphericalcoordinateintegrationinourproof. Letusfirstintroducesomenecessary notation. DenotebyB(0,r)theballcenteredat0andwithradiusr intheEuclideanspaceR d ,S d−1 theunit sphereinR d ,ν ameasureconstructedontheunitsphereS d−1 asin(B.93),andξ =(ξ i )∈S d−1 anarbitrary pointontheunitsphere. LetV d bethevolumeoftheunitballinR d asgivenin(B.88). Theintegrationwith thesphericalcoordinatesisequivalenttothestandardintegrationthroughtheidentity Z B(0,r) f(x)dx= Z r 0 u d−1 Z S d−1 f(uξ)ν(dξ)du. (B.107) FromLemma19inSectionB.4.1,wehavethefollowingintegrationformulaswiththesphericalcoordinates Z S d−1 ν(dξ) = dV d , (B.108) Z S d−1 ξ ν(dξ) = 0, (B.109) Z S d−1 ξ T Aξ ν(dξ) = tr(A)V d , (B.110) Z S d−1 ξ i ξ j ξ k ν(dξ) = 0 forany1≤i, j,k≤d, (B.111) where A is any d×d symmetric matrix. We will make use of the identities in (B.108)–(B.111) in our technicalanalysis. 201 Letusdecomposem(r)intotwotermsthatwewillanalyzeseparately m(r)= lim δ→0+ E[µ(X)|r≤∥X−x∥≤r+δ] = lim δ→0+ E[µ(X)1(r≤∥X−x∥≤r+δ)] P(r≤∥X−x∥≤r+δ) , (B.112) where 1(·) stands for the indicator function. In view of (B.107), we can obtain the spherical coordinate representationsforthedenominatorandnumeratorin(B.112) P(r≤∥X−x∥≤r+δ)= Z r+δ r u d−1 Z S d−1 f(x+uξ)ν(dξ)du (B.113) and E[µ(X)1(r≤∥X−x∥≤r+δ)] = Z r+δ r u d−1 Z S d−1 µ(x+uξ)f(x+uξ)ν(dξ)du. (B.114) Notethatinlightof(B.112)–(B.114),anapplicationofL’Hôpital’sruleleadsto m(r)= lim δ→0+ E[µ(X)1(r≤∥X−x∥≤r+δ)] P(r≤∥X−x∥≤r+δ) = R S d−1µ(x+rξ)f(x+rξ)ν(dξ) R S d−1 f(x+rξ)ν(dξ) . (B.115) Firstletusexpandthedenominator. Usingthesphericalcoordinateintegration,wecandeducethat Z S d−1 f(x+rξ)ν(dξ) = Z S d−1 f(x)+ f ′ (x) T ξr+ 1 2 ξ T f ′′ (x)ξr 2 + 1 6 ∑ 1≤i,j,k≤d ∂ 3 f(x) ∂x i ∂x j ∂x k ξ i ξ j ξ k r 3 + 1 24 ∑ 1≤i,j,k,l≤d ∂ 4 f(x+θrξ) ∂x i ∂x j ∂x k ∂x l ξ i ξ j ξ k ξ l r 4 ν(dξ), (B.116) 202 where0<θ <1. Notethatthefourth-orderpartialderivativesof f areboundedinsomeneiborghhoodofx byCondition2,and Z S d−1 ∑ 1≤i,j,k,l≤d |ξ i ξ j ξ k ξ l |ν(dξ)= Z S d−1 d ∑ i=1 |ξ i | 4 ν(dξ) ≤ Z S d−1 d 2 d ∑ i=1 ξ 2 i 2 ν(dξ) =d 2 Z S d−1 ν(dξ)=d 3 V d . (B.117) Thus,from(B.108)–(B.111)and(B.117)wecanobtain Z S d−1 f(x+rξ)ν(dξ)= f(x)dV d + 1 2 tr(f ′′ (x))V d r 2 +R 1 (d, f,x)r 4 , (B.118) where the coefficient R 1 (d, f,x) in the remainder term is bounded and depends only on the fourth-order partialderivativesof f anddimensionalityd. Forthenumerator,itholdsthat Z S d−1 µ(x+rξ)f(x+rξ)ν(dξ) = Z S d−1 h µ(x)+µ ′ (x) T ξr+ 1 2 ξ T µ ′′ (x)ξr 2 + 1 6 ∑ 1≤i,j,k≤d ∂ 3 µ(x) ∂x i ∂x j ∂x k ξ i ξ j ξ k r 3 + 1 24 ∑ 1≤i,j,k,l≤d ∂ 4 µ(x+θ 1 rξ) ∂x i ∂x j ∂x k ∂x l ξ i ξ j ξ k ξ l r 4 i × h f(x)+ f ′ (x) T ξr+ 1 2 ξ T f ′′ (x)ξr 2 + 1 6 ∑ i,j,k ∂ 3 f(x) ∂x i ∂x j ∂x k ξ i ξ j ξ k r 3 + 1 24 ∑ 1≤i,j,k,l≤d ∂ 4 f(x+θ 2 rξ) ∂x i ∂x j ∂x k ∂x l ξ i ξ j ξ k ξ l r 4 i ν(dξ), (B.119) 203 where 0 <θ 1 < 1 and 0 <θ 2 < 1. In the same manner as deriving (B.117), we can bound the integrals associated with r 4 and the higher-orders r 5 ,r 6 ,r 7 , and r 8 under Condition 2 that the fourth-order partial derivativesof f(·)andµ(·)areboundedinaneighborhoodofx. Hence,wecandeducethat Z S d−1 µ(x+rξ)f(x+rξ)ν(dξ) =µ(x)f(x) Z S d−1 ν(dξ)+ µ(x)r 2 2 Z S d−1 ξ T f ′′ (x)ξ ν(dξ) +r 2 Z S d−1 ξ T µ ′ (x)f ′ (x) T ξ ν(dξ)+ f(x)r 2 2 Z S d−1 ξ T µ ′′ (x)ξ ν(dξ) +R 2 (d, f,x)r 4 +o(r 4 ) =µ(x)f(x)dV d + 1 2 [f(x)tr(µ ′′ (x))+µ(x)tr(f ′′ (x))]V d r 2 +µ ′ (x) T f ′ (x)V d r 2 +R 2 (d, f,x)r 4 +o(r 4 ), (B.120) where the coefficient R 2 (d, f,x) in the remainder term is bounded and depends only on the fourth-order partial derivatives of f and dimensionality d. The last equality in (B.120) follows from (B.108)–(B.111). Therefore,substituting(B.118)and(B.120)into(B.115)leadsto m(r)=µ(x)+ f(x)tr(µ ′′ (x))+2µ ′ (x) T f ′ (x) 2d f(x) r 2 +O 4 r 4 as r→ 0, where O 4 is a bounded quantity depending only on d and the fourth-order partial derivatives of f(·)andµ(·). ThiscompletestheproofofLemma14. B.3.6 Lemma15anditsproof Lemma15belowprovidesuswiththeorderofthevarianceforthefirst-orderHájekprojection. Tosimplify the technical presentation, we use Z i as a shorthand notation for (X i ,Y i ). Given any fixed vector x, the projectionofΦ(x;Z 1 ,Z 2 ,...,Z s )ontoZ 1 isdenotedasΦ 1 (x;z 1 )givenby Φ 1 (x;z 1 )=E[Φ(x;Z 1 ,Z 2 ,...,Z s )|Z 1 =z 1 ] =E[Φ(x;z 1 ,Z 2 ,...,Z s )]. (B.121) DenotebyE i andE i:s theexpectationswithrespecttoZ i and{Z i ,Z i+1 ,...,Z s },respectively. 204 Lemma15. For any fixedx, the varianceη 1 ofΦ 1 (x;Z 1 ) defined in (B.121) satisfies that when s→∞ and s=o(n), lim n→∞ Var(Φ) nη 1 =0. (B.122) Proof. A main ingredient of the proof is to decompose Var(Φ) and η 1 using the conditioning arguments. Denotebyζ i,s theindicatorfunctionfortheeventthatX i isthe1NNofxamong{X 1 ,···,X s }. Bysymmetry, wecanseethatζ i,s areidenticallydistributedwithmean Eζ i,s =s −1 . Inaddition,observethatΦ(x;Z 1 ,Z 2 ,...,Z s )=∑ s i=1 y i ζ i,s . ThenwecanobtainanupperboundofVarΦas Var(Φ)≤ E[Φ 2 ]= E h s ∑ i=1 y i ζ i,s 2 i = s ∑ i=1 E[y 2 i ζ i,s ] =sE[y 2 1 ζ 1,s ], wherewehaveusedthefactthatζ i,s ζ j,s =0withprobabilityonewheni̸= j. Since E[ε|X]=0byassumption,itholdsthat sE[y 2 1 ζ 1,s ]=sE[µ 2 (X 1 )ζ 1,s ]+σ 2 ε sE[ζ 1,s ] = E 1 [µ 2 (X 1 )sE 2:s [ζ 1,s ]]+σ 2 ε . AkeyobservationisthatE 2:s [ζ 1,s ]={1−φ(B(x,∥X 1 −x∥))} s−1 andE 1 [sE 2:s [ζ 1,s ]]=1. SeeLemma20in Section B.4.2 for a list of properties for the indicator functions ζ i,s . Thus, sE 2:s [ζ 1,s ] behaves like a Dirac measureatxass→∞. SuchobservationleadstoLemma21inSectionB.4.3,whichentailsthat Var(Φ)≤µ 2 (x)+σ 2 ε +o(1) (B.123) ass→∞. 205 Toderivealowerboundforη 1 ,weexploittheideainTheorem3of[102]. LetBbetheeventthatX 1 is thenearestneighborofxamong{X 1 ,...,X s }. DenotebyX ∗ 1 thenearestpointtoxandy ∗ 1 thecorresponding response. Thenwecandeducethat Φ 1 (x;Z 1 )= E[y 1 1 B |Z 1 ]+ E[y ∗ 1 1 B c|Z 1 ] =y 1 E[1 B |Z 1 ]+ E[y ∗ 1 1 B c|Z 1 ] =ε 1 E[1 B |X 1 ]+µ(X 1 )E[1 B |X 1 ]+ E[µ(X ∗ 1 )1 B c|X 1 ] =ε 1 E[1 B |X 1 ]+ E[µ(X ∗ 1 )|X 1 ]. Sinceε isanindependentmodelerrortermwith E[ε|X]=0byassumption,itholdsthat η 1 =Var(Φ 1 (x;Z 1 ))=Var(ε 1 E[1 B |X 1 ])+Var(E[µ(X ∗ 1 )|X 1 ]) ≥Var(ε 1 E[1 B |X 1 ])=σ 2 ε E E 2 [1 B |X 1 ] = σ 2 ε 2s−1 , (B.124) wherewehaveusedthefactthat E[E 2 [1 B |X 1 ]]= E[1 B ′|X 1 ]= 1 2s−1 withB ′ representingtheeventthatX 1 isthenearestneighborofxamongthei.i.d. observations {X 1 ,X 2 ,...,X s ,X ′ 2 ,...,X ′ s }. Wenowturntotheupperboundforη 1 . FromthevariancedecompositionforVar(Φ)givenin(B.7),we canobtain Var(Φ)= s ∑ j=1 s j Var(g j (x;Z 1 ,...,Z j )) =sη 1 + s ∑ j=2 s j Var(g j (x;Z 1 ,...,Z j )), whichalongwith(B.123)entailsthat sη 1 ≤Var(Φ)≤µ 2 (x)+σ 2 ε +o(1). (B.125) 206 Consequently,combining(B.124)and(B.125)leadsto η 1 ∼s −1 , (B.126) where∼ denotes the asymptotic order. Finally, recall that it has been shown that Var(Φ)≤C for some positiveconstantdependinguponµ(x)andσ ε . Therefore,weseethataslongass→∞ands=o(n), Var(Φ) nη 1 =O( s n )→0, whichyieldsthedesiredconclusionin(B.122). ThisconcludestheproofofLemma15. B.3.7 Lemma16anditsproof Assumethats 1 0bearbitrarilygiven. Bythecontinuityoffunction f atpointx,thereexistsaneighborhoodB(x,δ) ofxwithsomeδ >0suchthat |f(X 1 )− f(x)|<ε forallX 1 ∈B(x,δ). Wewilldecomposetheaboveexpectationin(B.150)intotwoparts: oneinsideandthe otheroutsideofB(x,δ)as E 1 [|f(X 1 )− f(x)|sE 2:s [ζ 1,s ]]=E 1 [|f(X 1 )− f(x)|sE 2:s [ζ 1,s ]1 B(x,δ) (X 1 )] +E 1 [|f(X 1 )− f(x)|sE 2:s [ζ 1,s ]1 B c (x,δ) (X 1 )], (B.151) wherethesuperscriptcstandsforsetcomplementinR d . Thefirsttermontheright-handsideof(B.151)isboundedbyε since E 1 [|f(X 1 )− f(x)|sE 2:s [ζ 1,s ]1 B(x,δ) (X 1 )]≤E 1 [εsE 2:s [ζ 1,s ]1 B(x,δ) (X 1 )] ≤E 1 [εsE 2:s [ζ 1,s ]]=ε. (B.152) 218 Toboundthesecondtermontheright-handsideof(B.151),observethat B(x,δ)⊂B(x,∥X 1 −x∥) whenX 1 ∈B c (x,δ). ThenanapplicationofLemma20gives E 2:s [ζ 1,s ]≤(1−φ(B(x,δ))) s−1 whenX 1 ∈B c (x,δ). Thus,wecandeducethat E 1 [|f(X 1 )− f(x)|sE 2:s [ζ 1,s ]1 B c (x,δ) (X 1 )] ≤E 1 [|f(X 1 )− f(x)|s(1−φ(B(x,δ))) s−1 1 B c (x,δ) (X 1 )] ≤s(1−φ(B(x,δ))) s−1 E 1 [|f(X 1 )− f(x)|] ≤s(1−φ(B(x,δ))) s−1 (∥f∥ L 1+ f(x)), (B.153) where∥·∥ L 1 denotestheL 1 -normofagivenfunction. Finally, we see that the right-hand side of the last equation in (B.153) tends to 0 as s→∞. Therefore, forlargeenoughs,thequantity E 1 [|f(X 1 )− f(x)|sE 2:s [ζ 1,s ]1 B c (x,δ) (X 1 )] can be bounded from above by 2ε. Since the choice of ε > 0 is arbitrary, combining such upper bound, (B.150), (B.151), and (B.152) yields the desired limit in (B.149) as s→∞. This completes the proof of Lemma21. 219
Abstract (if available)
Abstract
Nonparametric regression methods are flexible statistical learning tools that require minimal identification assumptions compared to their parametric counterparts. Thus, practitioners frequently use these methods in applications due to their appealing empirical performance. In this dissertation, we study two nonparametric methods -- the random forest algorithm and distributional nearest neighbors (DNN) -- for which the existing theoretical work provides inference guarantees that hold only for a modified version of the method or do not hold under high-order smoothness assumptions.
In the first part of this dissertation, we use a bias-variance decomposition analysis to derive consistency rates for the random forest algorithm with the sample classification and regression tree (CART) splitting criterion in a general high-dimensional nonparametric regression setting. Our new results provide theoretical justification for the ability of the random forest algorithm to adapt to high dimensionality while remaining flexible enough to allow for a discontinuous regression function and dependent covariates. Furthermore, our bias analysis explicitly characterizes how the asymptotic bias of the random forest algorithm depends on the sample size, tree height, and column subsampling parameter.
In the second part of this dissertation, we study the distributional nearest neighbors method, which assigns monotonic, nonnegative weights to the entire sample in a distributional fashion. The DNN method achieves the optimal nonparametric minimax convergence rate under a second-order smoothness assumption but fails to do so under a high-order smoothness assumption. We show that the slow convergence rate of DNN under high-order smoothness assumptions is due to its asymptotic bias. Thus, we propose eliminating the DNN estimator's first-order bias by linearly combining two DNN estimators with different subsampling scales resulting in the novel two-scale DNN (TDNN) estimator. We prove that the TDNN estimator has an optimal nonparametric convergence rate for estimating the regression function under a fourth-order smoothness condition. Furthermore, we establish that the DNN and TDNN methods are asymptotically normal as the subsampling scales and sample size approach infinity. Finally, we provide variance estimators and a distribution estimator using the jackknife and bootstrap techniques for TDNN.
We conclude by discussing potential extensions of our work in this dissertation and open questions related to the random forest and TDNN.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Essays on nonparametric and finite-sample econometrics
PDF
Large-scale inference in multiple Gaussian graphical models
PDF
Statistical insights into deep learning and flexible causal inference
PDF
Nonparametric empirical Bayes methods for large-scale inference under heteroscedasticity
PDF
Feature selection in high-dimensional modeling with thresholded regression
PDF
Prohorov Metric-Based Nonparametric Estimation of the Distribution of Random Parameters in Abstract Parabolic Systems with Application to the Transdermal Transport of Alcohol
PDF
Nonparametric estimation of an unknown probability distribution using maximum likelihood and Bayesian approaches
PDF
Neighborhood and graph constructions using non-negative kernel regression (NNK)
PDF
Leveraging sparsity in theoretical and applied machine learning and causal inference
PDF
Shrinkage methods for big and complex data analysis
PDF
Essays on treatment effect and policy learning
PDF
Three essays on econometrics
PDF
Structure learning for manifolds and multivariate time series
PDF
Model selection principles and false discovery rate control
PDF
Adapting statistical learning for high risk scenarios
PDF
Statistical learning in High Dimensions: Interpretability, inference and applications
PDF
Statistical methods for causal inference and densely dependent random sums
PDF
Finite sample bounds in group sequential analysis via Stein's method
PDF
Machine learning approaches for downscaling satellite observations of dust
PDF
Essays on estimation and inference for heterogeneous panel data models with large n and short T
Asset Metadata
Creator
Vossler, Patrick
(author)
Core Title
Nonparametric ensemble learning and inference
School
Marshall School of Business
Degree
Doctor of Philosophy
Degree Program
Business Administration
Degree Conferral Date
2022-08
Publication Date
05/31/2022
Defense Date
05/10/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
bagging,bootstrap aggregation,Cart,distributional nearest neighbor,ensemble learning,nearest neighbor,nonparametric statistics,OAI-PMH Harvest,random forests
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Lv, Jinchi (
committee chair
), Fan, Yingying (
committee member
), Ridder, Geert (
committee member
)
Creator Email
patrick.vossler18@gmail.com,pvossler@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC111339149
Unique identifier
UC111339149
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Vossler, Patrick
Internet Media Type
application/pdf
Type
texts
Source
20220608-usctheses-batch-945
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
bagging
bootstrap aggregation
distributional nearest neighbor
ensemble learning
nearest neighbor
nonparametric statistics
random forests