Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Large N, T asymptotic analysis of panel data models with incidental parameters
(USC Thesis Other)
Large N, T asymptotic analysis of panel data models with incidental parameters
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
LARGE N, T ASYMPTOTIC ANALYSIS OF PANEL DATA MODELS WITH INCIDENTAL PARAMETERS by Martin Weidner A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ECONOMICS) May 2011 Copyright 2011 Martin Weidner Acknowledgments First, I would like to express my deepest gratitude to Hyungsik Roger Moon, who in the past years was a great teacher, advisor and friend for me, and without whom I would probably neither have become an Econometrician nor have had much success in that profession. IamdeeplyindebtedtoChengHsiaoandGeertRidderforteachingmeEconometrics, giving me feedback on my research, and continuously supporting me before, during and after thejob-market. I am also very grateful to Iv´ an Fern´ andez-Val and Matthew Shum, whom I had the great honor and pleasure to work with and to learn from, and who also helped and supported me so much in the past few months. I owe special thanks to Michelle Sovinsky Goeree, Jinyong Hahn, Jinchi Lv and FernandoZapateroforservingoneither myqualifyingor mydissertation committee and for giving me very valuable and insightful feedback on my research. I would also like to thank Dominic Brewer, John Ham and Daniel Klerman for giving me the opportunity to work for them as a research assistant, which was very stimulating, and in the course of which I have really learned a lot. I am very grateful to Young Miller and Morgan Ponder for helping me with all administrative concerns that occurred throughout my time at USC, and who really do a terrific job in overseeing and holding the department together. Special thanks also goes to Andrew McEachin, Eleonora Granziera, Kyoung Eun Kim, Shuyang Sheng and Bo Kim, who either worked or shared an office with me in the past years, and who made work very pleasant and enjoyable for me during that time. ii I also want to acknowledge St´ ephane Bonhomme and Stefan Hoderlein, for advising and supporting me with my research and for being extremely helping and kind before and during the job-market. Last, butnotleast, I wishtothankmyparents, Dagmar andFrankWeidner, andmy fianc´ ee, Sakura Sch¨ afer-Nameki, for all their love and support that helped me so much in my life and during my studies. iii Table of Contents Acknowledgments ii List of Tables vi List of Figures vii Abstract viii Chapter 1: Introduction 1 Chapter 2: Linear Regression for Panel with Unknown Number of Factors as Interactive Fixed Effects 7 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Model, QMLE and Consistency . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3 Asymptotic Profile Likelihood Expansion . . . . . . . . . . . . . . . . . . 16 2.3.1 When R=R 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3.2 When R>R 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.4 Justification of Assumptions 2.4 and 2.5 . . . . . . . . . . . . . . . . . . . 25 2.5 Monte Carlo Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Chapter 3: Estimation of Random Coefficients Logit Demand Models with Inter- active Fixed Effects 33 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.3 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.3.1 Extension: regressor endogeneity with respect to e jt . . . . . . . . 43 3.4 Consistency and Asymptotic Distribution of ˆ α and ˆ β . . . . . . . . . . . . 45 3.5 Monte Carlo Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.6 Empirical application: demand for new automobiles, 1973-1988 . . . . . . 54 3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Chapter 4: Semiparametric Estimation of Nonlinear Panel Data Models with Generalized Random Effects 63 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 iv 4.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.3 Description of Estimators and Main Results . . . . . . . . . . . . . . . . . 71 4.3.1 Sampling Issues (Generalized Random Effect Assumption) . . . . . 72 4.3.2 Identification Issues (Smoothness Assumption on π) . . . . . . . . 75 4.3.3 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.4 Asymptotic Analysis of the Estimators . . . . . . . . . . . . . . . . . . . . 80 4.4.1 Uniform Consistency of ˆ θ(π) . . . . . . . . . . . . . . . . . . . . . 81 4.4.2 Score and Hessian of the Integrated Likelihood . . . . . . . . . . . 82 4.4.3 Joint Maximum Likelihood Estimation of θ and π . . . . . . . . . 87 4.5 Generalized Random Effects . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.5.1 Imposing an Appropriate Smoothness Assumption . . . . . . . . . 92 4.5.2 Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 4.6 Monte Carlo Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 4.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Bibliography 110 Appendix A 117 A.1 Proof of Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 A.2 Proof of Likelihood Expansion . . . . . . . . . . . . . . . . . . . . . . . . 118 Appendix B 129 B.1 Alternative GMM approach . . . . . . . . . . . . . . . . . . . . . . . . . . 129 B.2 Details for Section 3.4 (Consistency and Asymptotic Distribution) . . . . 133 B.2.1 Formulas for Asymptotic Bias Terms . . . . . . . . . . . . . . . . . 133 B.2.2 Assumptions for Consistency . . . . . . . . . . . . . . . . . . . . . 134 B.2.3 AdditionalAssumptionsforAsymptoticDistributionandBiasCor- rection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 B.2.4 Bias and Variance Estimators . . . . . . . . . . . . . . . . . . . . . 138 B.3 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 B.3.1 Proof of Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . 140 B.3.2 Proof of Limiting Distribution . . . . . . . . . . . . . . . . . . . . 145 B.3.3 Consistency of Bias and Variance Estimators . . . . . . . . . . . . 152 Appendix C 153 C.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 C.1.1 Assumptions for Consistency . . . . . . . . . . . . . . . . . . . . . 153 C.1.2 Further Regularity Conditions on the Model. . . . . . . . . . . . . 154 C.2 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 C.2.1 Proofs for Section 4.4.1 . . . . . . . . . . . . . . . . . . . . . . . . 156 C.2.2 Proofs for Section 4.4.2 . . . . . . . . . . . . . . . . . . . . . . . . 158 C.2.3 Proofs for Section 4.4.3 . . . . . . . . . . . . . . . . . . . . . . . . 167 C.3 Further Discussions for Section 4.5 . . . . . . . . . . . . . . . . . . . . . . 169 C.3.1 Approximating Unknown Distributions . . . . . . . . . . . . . . . 169 C.3.2 Approximate Identification of π(α|x) . . . . . . . . . . . . . . . . . 171 v List of Tables 2.1 Simulation results for the bias and standard error (std) of the QMLE ˆ β R . 31 2.2 Simulation results for the quantiles of √ NT ˆ β R −β 0 . . . . . . . . . . . . 31 3.1 Simulation results for specification 1 (no heteroscedasticity). . . . . . . . 52 3.2 Simulation results for specification 2 (heteroscedasticity in e 0 jt ). . . . . . . 53 3.3 Parameter estimates (and t-values) for automobile demand estimation. . . 56 3.4 Parameter estimates (and t-values) for model specification A and B. . . . 56 3.5 Summary statistics for the 23 product-aggregates used in estimation. . . 58 3.6 Estimated price elasticities for specification B in t=1988. . . . . . . . . . 59 3.7 Estimated price elasticities for specification C (BLP case) in t=1988. . . 60 4.1 Monte Carlo Results for T =12. . . . . . . . . . . . . . . . . . . . . . . . 105 4.2 Same as Table 4.1, but with T =24 and only for σ π =0.7. . . . . . . . . . 108 vi List of Figures 4.1 For T =12 (left) andT =24 (right) we plot p T −1 I −1 (α,θ 0 ,y i0 ). . . . . 101 4.2 Examples of “basis functions” for T =12 . . . . . . . . . . . . . . . . . . 102 4.3 Same as Figure 4.2, but for T =24.. . . . . . . . . . . . . . . . . . . . . . 102 4.4 True and estimated individual effect distributions. . . . . . . . . . . . . . 104 4.5 Same as Figure 4.4, but with T =24 and only for σ π =0.7. . . . . . . . . 107 B.1 Example for multiple local minima in the objective function L(β). . . . . 132 vii Abstract Thisdissertation contributestotheeconometrics ofpaneldatamodelsandtheirapplica- tiontoeconomicproblems. Inparticular,itconsiders“largeT”panels,whereinaddition to the cross-sectional dimensionN also the number of time periodsT is relatively large. Chapter 1 provides an introduction to the field of large T panel data econometrics and explains the contribution of the dissertation to this field. Chapter 2 analyzes linear panel regression models, allowing for unobserved factors (interactive fixed effects) in the error structure of the model. In particular, it is shown that, under appropriate assumptions, the limiting distribution of the Gaussian quasi maximumlikelihoodestimatorfortheregressioncoefficientsisindependentofthenumber of factors used in the estimation. The important practical implication of this result is that for inference on the regression coefficients there is no need to estimate the number of factors consistently. Chapter3extendstheBerry,LevinsohnandPakes(1995)randomcoefficientsdiscrete- choice demandmodelbyaddinginteractive fixedeffects to theunobservedproductchar- acteristics of this model. The interactive fixed effects can be arbitrarily correlated with the observed product characteristics, which accommodates endogeneity, and they can capture strong persistence in market shares across products and markets. A two step least squares-minimum distance procedure is proposed to estimate the model, and the asymptotic properties of this estimator are derived. This methodology is then applied to the estimation of US automobile demand. viii Chapter 4 proposes a new approach for higher order bias correction in large T non- linear panel data models that is based on inference on the individual effect distribution. Under appropriate assumptions it is shown that the incidental parameter bias for the estimator of the parameters of interest can converge to zero at an arbitrary polynomial rate in T, i.e. that the incidental parameter problem can vanish very rapidly in this approach as T increases. This has important implications in particular for applications where T is modestly large and N is much larger than T. ix Chapter 1 Introduction Thisthesisisconcernedwiththestatisticalanalysisofsomeparticularpaneldatamodels andtheir application toeconomic problems. Apaneldataset consists of observations on manycross-sectional units(individuals,products,firms,countries, etc.) inmultipletime periods. The information contained in such a data set is generally much richer than in a pure cross-sectional data set or in a pure time-series data set. In particular, the avail- ability of panel data makes it possible to differentiate the heterogeneous influences and characteristics that are unique to each cross-sectional unit from the structural relations or common trends that are common to all cross-sectional units. Several textbooks have been written on the subject of panel data (e.g. Hsiao (2003), Arellano (2003), Wooldridge (2010)), and it continues to be a very active research area. Thegoal of paneldata analysis typically is to estimate anddo inferenceon a few param- eters of interest (regression coefficients, marginal effects, etc.), while at the same time controlling for unobserved heterogeneity, which is often modeled by unobserved individ- ual effects. There are well-established techniques for handling the unobserved hetero- geneity in purely linear panel data models, but this issue is still a serious econometric challenge for manynon-linear models. Oneof thedifficultiesin non-linear modelsis that the parameters of interest may not be uniquely identified from the data in the presence of the unobserved individual effects. Based on how to handle this potential identifica- tion problem, and which asymptotic theory is considered, one can broadly classify the existing panel data literature as follows. Firstly, there is the “classic” panel data literature that considers point identification and point estimation under an asymptotic where the number of time periodsT remains constant, while the cross-sectional size N goes to infinity. Obtaining a consistent point 1 estimatorfortheparametersofinterestatfixedT isdesirableandcanindeedbeachieved for some non-linear models (e.g. Rasch (1960), Andersen (1970), Chamberlain (1984), Hausman, Hall and Griliches (1984), Manski (1987), Honor´ e (1992), Horowitz and Lee (2004), Bonhomme (2010)). However, at fixed T a non-linear panel data model may not be point identified, or may not possess a √ N-consistent estimator, as discussed by Chamberlain (2010) for the static binary choice model. Furthermore, an incidental parameter problem (Neyman and Scott (1948), see e.g. Lancaster (2000) for a review) usually appears in fixed T estimation of non-linear panel data models since the num- ber of incidental parameters (individual effects) grows with the sample size. Resolving this problem usually requires a model specific augmentation of standard estimation pro- cedures like maximum likelihood. We refer e.g. to Chamberlain (1984), Arellano and Honor´ e (2001), and the above mentioned textbooks for reviews of this branch of the literature. Secondly, there is the “large T” panel data literature, which includes e.g. Phillips and Moon (1999), Hahn and Kuersteiner (2002), Lancaster (2002), Woutersen (2002), HahnandKuersteiner(2004), HahnandNewey (2004), Carro(2007), ArellanoandBon- homme (2009), Fern´ andez-Val (2009), Bester and Hansen (2009), Bai (2009b), Dhaene and Jochmans (2010); a review is provided by Arellano and Hahn (2007). This liter- ature considers an asymptotic where both panel dimensions N and T go to infinity. This large T asymptotic guarantees point identification of a very large class of models under weak regularity conditions, and provides an asymptotic solution to the incidental parameter problem. Namely, the (maximum likelihood) estimator for the parameters of interest is shown to have a bias of order 1/T, which thus vanishes asymptotically, and bias correction techniques are discussed that augment the convergence rate of the bias further. Finally, there are a couple of papers that acknowledge the fact that many non-linear panel data models are not point identified at fixedT and consequently discuss set iden- tification (boundanalysis) for theparameters of interest orfor certain policyparameters 2 likemarginaleffects. Theseincludee.g. Chernozhukov, HahnandNewey(2005), Honor´ e and Tamer (2006), Chernozhukov, Fern´ andez-Val, Hahn and Newey (2009a) and Cher- nozhukov, Fern´ andez-Val and Newey (2009b). Thesethreeestimation approachesfornon-linearpaneldatamodelsshouldbeviewed as complements rather than substitutes. If for fixed T a ( √ N-) consistent estimator is available for the particular model under consideration, then it probably should be used. Ifthisisnotthecase,thenchancesarethatthemodelmaynotbepointidentifiedatfixed T and in particular for small values of T one needs to consider inference using bound analysis. However, the above cited papers on set identification all point out that the bounds can be very tight and shrink rather rapidly asT grows. Thus, ifT is sufficiently largeonecansafelyignorethefactthatthemodelmightonlybeset-identifiedandsimply use the large T estimation methodology. Each of the following three chapters analyzes a particular class of panel data models with distinct motivations and different potential applications. The common theme to all three chapters is that they consider cases where both panel dimensionsN andT are relatively large, i.e. they contribute to the branch of “large T” panel data mentioned above. All three chapters consider an asymptotic where both N and T go to infinity, which, as noted above, is theoretical appealing since it overcomes the identification problem and provides an approximate solution to the incidental parameter problem. In addition, this asymptotic is motivated by the fact that many panel data sets that are available nowadays indeedhave relatively largeN andT. Thisapplies tomicroeconomic surveys(e.g. thePanelStudyofIncomeDynamicortheNationalLongitudinalSurveyof Youth),tomacroeconomicdatasets(e.g. OECD,EurostatandUNESCOprovidedataof many countries over multiple decades), financial datasets (e.g. the Center for Research in Security Prices and COMPUSTAT provide data on prices, earnings, ratings, etc. of thousands of companies over many decades), and other areas. Chapter 2 considers a linear panel regression model with interactive fixed effects. While the model has a linear regression specification it is non-linear in the specification 3 of the unobserved error structure, which contains (multiple) individual effects and time effects that interact multiplicatively. We refer to this multiplicative specification as a factor structure or an interactive effect, and both the individual effects and the time effects areestimated as parameters, i.e. treated as fixedeffects. We show howto analyze the fixed effect Gaussian quasi maximum likelihood estimator (QMLE) of this model undertheN,T →∞asymptotic. Inparticular, underappropriate assumptionswe show that the limiting distribution of the QMLE for the regression coefficients is independent of the number of interactive fixed effects used in the estimation, as long as this number does not fall below the true number of interactive fixed effects present in the data. The important practical implication of this result is that for inference on the regression coefficients one does not need to estimate the number of interactive effects consistently, but can simply rely on any known upper bound of this number to calculate the QMLE. Further details and references are provided in the introduction of the chapter itself. The research of this part of the thesis also entered into Moon and Weidner (2010b), which is a joint paper with Hyungsik Roger Moon. In Chapter 3 we extend the Berry, Levinsohn and Pakes (BLP, 1995) random coef- ficients discrete-choice demand model, which underlies much recent empirical work in industrial organization. In the context of demand estimation the panel structure is given by observing market shares and characteristics of multiple products (the cross- sectional units) in different markets (e.g. over time). Analogous to the linear regression model in Chapter 2 we add interactive fixed effects in the form of a factor structure to theunobservedproductcharacteristics, which is thestructuralerrorof theBLPdemand model. Theinteractive fixedeffects canbearbitrarilycorrelated withtheobservedprod- uct characteristics (including price), which accommodates endogeneity and, at the same time,capturesstrongpersistenceinmarketsharesacrossproductsandmarkets. Wepro- pose a two step least squares-minimum distance procedure to calculate the estimator. This estimator is easy to compute, Monte Carlo simulations show that it performs well, and we discuss an empirical application to US automobile demand. Further details and 4 references are again provided in the chapter itself. This chapter of the thesis also gave rise to a joint paper with Hyungsik Roger Moon and Matthew Shum (Moon, Shum and Weidner (2010)). Both Chapter2andChapter3introduceinteractive fixedeffects into theerrorstruc- ture of the respective model, i.e. there are individual effects (factor loadings) and time effects (factors) that are both estimated as parameters. Under the asymptotic where both panel dimensions become large one therefore faces an incidental parameter prob- lem not only in the cross-sectional dimension, but also in the time dimension, which turnsout to give riseto a bias of order 1/N (or 1/J, in the notation of Chapter 3) in the estimator for the parameters of interest, in addition to the “standard” cross-sectional incidental parameter bias of order 1/T. The chapters show how to derive the magnitude of these biases asymptotically, whichthen also allows for bias correction. Regarding bias correction in linear factor regression models we also refer to Bai (2009b) and Moon and Weidner(2010a). Biascorrectioninothernon-linearlargedimensionalpaneldatamodels with individual and time effects is discussed in Fern´ andez-Val and Weidner (2010). Chapter 4 of the thesis does not consider time effects, but studies relatively general non-linear panel data models with individual effects when both N and T become large. As mentioned above, it is well-known that the fixed effect estimation approach in these modelstypicallyresultsinanincidentalparameterbiasoforder1/T. Weshowthatunder appropriate assumptions this incidental parameter bias can be substantially reduced if instead of a fixed effect approach one estimates the distribution of the individual effectsjointlywiththeparameterofinterestbymaximumlikelihood,therebytreatingthe individual effect distribution non-parametrically. The convergence rate of the incidental parameterbiasinthisapproachisshowntobeonlylimitedbythesmoothnessproperties of thetrueindividualeffect distribution. Toallow inferenceon thisdistributionwemake a “generalized random effect” assumption, which requires the cross-sectional units to be partitioned into groups and imposes a random effect assumption in each group. In Monte Carlo simulations we consider the dynamic binary choice model, and we find 5 the finite sample properties of our estimator to be in accordance with the asymptotic results. The results on higher order bias correction in this chapter are particularly important in applications where T is only modestly large, while N is much larger than T, which is the case in many microeconomic panel surveys. For very large values of N biascorrection becomes particularlyimportant, sincethestandarderroroftheestimator oftheparametersofinterestisverysmall,sothatevenasmallremainingbiasmayeasily dominatethepropertiesoftheestimator. Forfurtherdetailsandreferencesweagainrefer to the introduction of the chapter itself. This chapter was the basis for my job-market paper (Weidner (2011)). 6 Chapter 2 Linear Regression for Panel with Unknown Number of Factors as Interactive Fixed Effects 2.1 Introduction Panel data models typically incorporate individual and time effects to control for het- erogeneity in cross-section and across time-periods. While often these individual and time effects enter the model additively, they can also be interacted multiplicatively, thus giving rise to so called interactive effects, which we also refer to as a factor structure. The multiplicative form captures the heterogeneity in the data more flexibly, since it allows for common time-varying shocks (factors) to affect the cross-sectional units with individual specific sensitivities (factor loadings). It is this flexibility that motivated the discussion of interactive effects in the econometrics literature, e.g. Holtz-Eakin, Newey and Rosen (1988), Ahn, Lee, Schmidt (2001; 2007), Pesaran (2006), Bai (2009b; 2009a), Zaffaroni (2009), Moon and Weidner (2010a). Analogous to the analysis of individual specific effects, one can either choose to model the interactive effects as random (random effects/correlated effects) or as fixed (fixedeffects), witheach optionhavingits specificmeritsanddrawbacks, thathavetobe weighedineachempiricalapplicationseparately. Inthischapter,weconsidertheinterac- tive fixed effect specification, i.e. we treat the interactive effects as nuisance parameters, 7 whichareestimated jointlywiththeparametersofinterest. 1 Theadvantages ofthefixed effects approach are for instance that it is semi-parametric, since no assumption on the distribution of the interactive effects needs to be made, and that the regressors can be arbitrarily correlated with the interactive effect parameters. Let R 0 be the true number of interactive effects (number of factors) in the data, and let R be the number of interactive effects used by the econometrician in the data analysis. A key restriction in the existing literature on interactive fixed effects is that R 0 is assumed to be known, 2 i.e. R = R 0 . This is true both for the quasi-differencing analysis in Holtz-Eakin, Newey and Rosen (1988) 3 and for the least squares analysis of Bai(2009b). AssumingR 0 tobeknowncouldbequiterestrictive,sinceinmanyempirical applications there is no consensus about the exact number of factors in the data or in the relevant economic model, so that an estimator which is not robust towards some degree of mis-specification of R 0 should not be used. The goal of the present chapter is to overcome this problem. For a linear panel regression model with interactive fixed effects we consider the Gaussian quasi maximum likelihood estimator (QMLE), 4 which jointly minimized the sum of squared residuals over the regression parameters and the interactive fixed effects parameters(seeKiefer(1980), Bai(2009b), andMoonandWeinder(2010a)). Weemploy an asymptotic where both the number of cross-sectional and the number of time-serial 1 Note that Ahn, Lee, Schmidt (2001; 2007) take a hybrid approach in that they treat the fac- tors as non-random, but the factor loadings as random. The common correlated effects estimator of Pesaran (2006) was introduced in a context, where both the factor loadings and the factors follow cer- tain probability laws, but also exhibits some properties of a fixed effects estimator. When we refer to interactive fixed effects we mean that both factors and factor loadings are treated as non-random parameters. 2 Intheliterature, consistent estimation procedures forR 0 are established onlyfor purefactor models, not for the model with regressors. 3 Holtz-Eakin, Newey and Rosen (1988) assume just one interactive effect, but their approach could easily be generalized to multiple interactive effects, as long as their number is known 4 The QMLE is sometimes called concentrated least squares estimator in the literature. 8 dimensions becomes large, while the number of interactive effects R 0 (and also R) is constant. Themain findingof the chapter is that underappropriateassumptions the QMLEof the regression parameters has the same limiting distribution for all R≥R 0 . Thus, the QMLE is robust towards inclusion of extra interactive effects in the model, and within the QMLEframework there is no asymptotic efficiency loss from choosingR larger than R 0 . This result is surprising because the conjecture in the literature is that the QMLE with R > R 0 might be consistent but could be less efficient than the QMLE with R 0 (e.g., see Bai (2009b)). 5 The important empirical implication of our result is that as long as a valid upper bound on the number of factors is known one can use this upper bound to construct the QMLE, and need not worry about consistent estimation of the number of factors. Since the limiting distribution of the QMLE with R>R 0 is identical to the one withR=R 0 the results of Bai (2009b) and Moon and Weidner (2010a) regarding inference on the regression parameters become applicable. In order to derive the asymptotic theory of the QMLE with R ≥ R 0 we study the properties of the profile likelihood function, which is the quasi likelihood function after integrating out the interactive fixed effect parameters. Concretely, we derive an approximate quadratic expansion of this profile likelihood in the regression parameters. This expansion is difficult to perform, since integrating out the interactive fixed effects results in an eigenvalue problem in the formulation of the profile likelihood. ForR=R 0 we show how to overcome this difficulty by performing a joint expansion of the profile likelihood in the regression parameters and in the idiosyncratic error terms. Using the perturbationtheoryoflinearoperatorsweprovethattheprofilequasilikelihoodfunction is analytic in a neighborhood of the true parameter, and we obtain explicit formulas for 5 For R < R 0 the QMLE could be inconsistent, since then there are interactive fixed effects in the residuals of the model which can be correlated with the regressors but are not controlled for in the estimation. 9 the expansion coefficients, in particular analytic expressions for the approximated score and the approximated Hessian for R=R 0 . To generalize the result to R > R 0 we then show that the difference between the profile likelihood for R =R 0 and for R>R 0 is just a constant term plus a term whose dependence on the regression parameters is sufficiently small to be irrelevant for the asymptotic distribution of the QMLE. Due to the eigenvalue problem in the likelihood function, the derivation of this last result requires some very specific knowledge about the eigenvectors and eigenvalues of the random covariance matrix of the idiosyncratic error matrix. We provide high-level assumptions under which the results hold, and we show that these high-level assumptions are satisfied, when the idiosyncratic errors of the model are independent and identically normally distributed. As we explain in section 2.4, the justification of our high-level assumptions for more general distribution of the idiosyncratic errors requires some furtherprogress in the Random Matrix Theoryof real random covariance matrices, both regarding the properties of their eigenvalues and of their eigenvectors (see Bai (1999) for a review of this literature). The chapter is organized as follows. In Section 2.2 we introduce the interactive fixed effect model, its Gaussian quasi likelihood function, and the corresponding QMLE, and also discuss consistency of the QMLE. The asymptotic profile likelihood expansion is derived in Section 2.3. Section 2.4 provides a justification for the high-level assumptions that we impose, and discusses the relation of these assumptions to the random matrix theory literature. Monte Carlo results which illustrate the validity of our conclusion at finite sample are presented in Section 2.5, and the conclusions of the chapter are drawn in Section 2.6. A few words on notation. The transpose of a matrix A is denoted by A ′ . For a column vectors v its Euclidean norm is defined by kvk = √ v ′ v . For the n-th largest eigenvalues (counting multiple eigenvalues multiple times) of a symmetric matrix B we writeμ n (B). For anm×nmatrixAtheFrobenius or HilbertSchmidtnormiskAk HS = p Tr(AA ′ ), andtheoperator or spectralnormiskAk=max 06=v∈R n kAvk kvk , or equivalently 10 kAk = p μ 1 (A ′ A). Furthermore, we useP A =A(A ′ A) −1 A ′ and M A = 1−A(A ′ A) −1 A ′ , where 1 is the m×m identity matrix, and (A ′ A) −1 denotes some generalized inverse if A is not of full column rank. For square matrices B, C, we use B > C (or B ≥ C) to indicate that B−C is positive (semi) definite. We use “wpa1” for “with probability approaching one”, andA= d B to indicate that the random variables A andB have the same probability distribution. 2.2 Model, QMLE and Consistency A linear panel regression model with cross-sectional dimensionN, time-serial dimension T, and interactive fixed effects of dimension R 0 , is given by Y = K X k=1 β 0 k X k + ε, ε=λ 0 f 0′ +e, (2.1) whereY,X k ,ε ande areN×T matrices,λ 0 is aN×R 0 matrix,f 0 is aT×R 0 matrix, and the regression parameters β 0 k are scalars — the superscript zero indicates the true value of the parameters. We write β for the K-vector of regression parameters, and introduce the notation β·X ≡ P K k=1 β k X k . All matrices, vectors and scalars in this chapter are real valued. A choice for the number of interactive effects R used in the estimation needs to be made, and we may haveR6=R 0 since the true number of factors R 0 may not be known accurately. Given the choice R, the quasi maximum likelihood estimator (QMLE) for the parameters β 0 , λ 0 and f 0 is given by 6 ˆ β R , ˆ Λ R , ˆ F R = argmin {β∈ R K , Λ∈ R N×R ,F∈R T×R } Y − β·X − ΛF ′ 2 HS . (2.2) ThesquareoftheHilbert-Schmidtnormissimplythesumofthesquaredelements ofthe argument matrix, i.e. the QMLE is defined by minimizing the sum of squared residuals, 6 The optimal ˆ ΛR and ˆ FR in (2.2) are not unique, since the objective function is invariant under right-multiplication of Λ with a non-degenerate R×R matrix S, and simultaneous right-multiplication of F with (S −1 ) ′ . However, the column spaces of ˆ ΛR and ˆ FR are uniquely determined. 11 which is equivalent to minimizing the likelihood function for iid normal idiosyncratic errors. The estimator is the quasi MLE since the idiosyncratic errors need not be iid normal and since R might not equal R 0 . The QMLE for β 0 can equivalently be defined by minimizing the profile quasi likelihood function, namely ˆ β R =argmin β∈R K L R NT (β), (2.3) where L R NT (β)= min {Λ∈R N×R ,F∈ R T×R } 1 NT Y − β·X − ΛF ′ 2 HS = min F∈R T×R 1 NT Tr (Y −β·X)M F (Y −β·X) ′ = 1 NT T X t=R+1 μ t (Y −β·X) ′ (Y −β·X) . (2.4) Here, we first concentrated out Λ by use of its own first order condition. The resulting optimization problem for F is a principal components problem, so that the the optimal F is given by the R largest principal components of the T ×T matrix (Y −β·X) ′ (Y −β·X). At the optimum the projector M F therefore exactly projects out the R largest eigenvalues of this matrix, which gives rise to the final formulation of the profile likelihood function as the sum over its T −R smallest eigenvalues. 7 This last formulation of L R NT (β) is very convenient since it does not involve any explicit optimization over nuisance parameters. Numerical calculation of eigenvalues is very fast, so that the numerical evaluation of L R NT (β) is unproblematic for moderately large values of T. The function L R NT (β) is not convex in β and might have multiple local minima, which have to be accounted for in the numerical calculation of ˆ β R . We writeL 0 NT (β) forL R 0 NT (β), which is the profile likelihood obtain from the true numberof factors. In order to show consistency of ˆ β R we impose the following assumptions. 7 Since the model is symmetric under N ↔ T, Λ↔ F, Y ↔ Y ′ , X k ↔ X ′ k there also exists a dual formulation ofL R NT (β) that involves solving an eigenvalue problem for an N×N matrix. 12 Assumption 2.1. (i) kX k k=O p ( √ NT), k =1,...,K, (ii) kek =O p ( p max(N,T)). OnecanjustifyAssumption2.1(i)byuseofthenorminequalitykX k k≤kX k k HS and the fact thatkX k k 2 HS = P i,t X 2 k,it =O p (NT), where i=1,...,N andt =1,...,T, and the last step follows e.g. if X k,it has a uniformly bounded second moment. Assumption 2.1(ii) is a condition on the largest eigenvalue of the random covariance matrix e ′ e, which is often studied in the literature on random matrix theory, e.g. Geman (1980), Bai, Silverstein, Yin (1988), Yin, Bai, and Krishnaiah (1988), Silverstein (1989). The results in Latala (2005) show that kek =O p ( p max(N,T)) if e has independent entries with mean zero and uniformly bounded fourth moment. Some weak dependence of the entries e it across i and t is also permissible (see, e.g., Moon and Weidner (2010a)). Assumption 2.2. (i) 1 √ NT Tr(X k e ′ )=O p (1), k =1,...,K. (ii) Consider linear combinationsX α = P K k=1 α k X k of the regressorsX k withK-vector α such that kαk =1. We assume that there exists a constant b>0 such that min {α∈R K ,kαk=1} T X t=R+R 0 +1 μ t X ′ α X α NT ≥b, wpa1. Assumption2.2(i) requiresweak exogeneity of theregressorsX k . Assumption2.2(ii) is a generalization of the usual non-collinearity condition on the regressors. It requires X ′ α X α tobenon-degenerate even after elimination of thelargestR+R 0 eigenvalues (the sumintheassumptiononlyrunsover thesmallestT−R−R 0 eigenvalues of thismatrix, whilerunningover all eigenvalues would give thetrace operator, and thustheusualnon- colinearity condition). In particular, this assumption is violated if there exists a linear combination of theregressors withkαk =1 and rank(X α )≤R+R 0 , i.e. theassumption 13 rulesout“low-rankregressors”liketimeinvariantregressorsorcross-sectionallyinvariant regressors. These low-rank regressors require a special treatment in the interactive fixed effect model (see Bai (2009b) and Moon and Weidner (2010a)), and we ignore them in the present chapter. If one is not interested explicitly in their regression coefficients, one can always eliminate the low-rank regressors by an appropriate projection of the data, e.g. subtraction of the time (or cross-sectional) means from the data eliminates all time-invariant (or cross-sectionally invariant) regressors. Theorem2.1. Let Assumption 2.1 and 2.2 be satisfied and letR≥R 0 . ForN,T →∞ we then have p min(N,T) ˆ β R −β 0 =O p (1). Remarks. (i) The Theorem guarantees consistency of ˆ β R , R≥R 0 , in an arbitrary limitN,T → ∞. In the rest of this chapter we consider asymptotics where N and T grow at the same rate, i.e. N/T → κ 2 , for some positive constant κ. For these restricted asymptotics the theorem already guarantees √ N (or equivalently √ T) consistency of ˆ β R , which is a useful intermediate result. (ii) The p min(N,T) convergence rate in Theorem 2.1 can be generalized further. If we generalize Assumption 2.1(ii) and Assumption 2.2(i) to Assumption 2.1(ii ∗ ) 1 √ NT kek = O p (ξ NT ), and Assumption 2.2(i ∗ ) 1 NT Tr(X k e ′ ) = O p (ξ NT ), k = 1,...,K, where ξ NT → 0, then it is possible to establish that √ ξ NT ˆ β R −β 0 = O p (1). The proof of Theorem 2.1 is presented in the appendix. The theorem imposes no restriction at all on f 0 and λ 0 , apart from the condition R≥R 0 . To derive the results in the rest of the chapter we do however make the following strong factor assumption. 8 8 The strong factor assumption is regularly imposed in the literature on large N and T factor models, including Bai and Ng (2002), Stock and Watson (2002) and Bai (2009b). Onatski (2006) discussed an alternative “weak factor” assumption for the purpose of estimating the number of factors in a pure factor model, and a more general discussion of strong and weak factors is given in Chudik, Pesaran and Tosetti (2009). 14 Assumption 2.3. (i) 0<plim N,T→∞ 1 N λ 0′ λ 0 <∞, (ii) 0<plim N,T→∞ 1 T f 0′ f 0 <∞. The main result of this chapter is that the inclusion of unnecessary factors in the estimation does not change the asymptotic distribution of the QMLE for β 0 . Before deriving this result rigorously, we want to provide an intuitive explanation for it. As already mentioned above, the estimator ˆ F R is given by the firstR principal components of the matrix (Y − ˆ β R ·X) ′ (Y − ˆ β R ·X). We have Y − ˆ β R ·X =λ 0 f 0′ +e−( ˆ β R −β 0 )·X. (2.5) Forasymptotics, whereN andT growatthesamerate,wefindthatAssumption2.1and theresultofTheorem2.1guaranteethatke−( ˆ β R −β 0 )·Xk =O p ( √ N). Thestrongfactor assumptionimpliesthatthenormsofthecolumnsofλ 0 andf 0 eachgrowatarateof √ N (or equivalently √ T), so that the spectral norm of λ 0 f 0′ grows at the rate √ NT. The strong factor assumption therefore guarantees that λ 0 f 0′ is the dominant component of Y− ˆ β R ·X,whichimpliesthatthefirstR 0 principalcomponentsof(Y− ˆ β R ·X) ′ (Y− ˆ β R ·X) are close to f 0 , i.e. the true factors are correctly picked up by the principal component estimator. The additional R−R 0 principal components that are estimated for R>R 0 cannot pick up anymore true factors and are thus mostly determined by the remaining term e−( ˆ β R −β 0 )·X. Our results below show that ˆ β R is not only √ N consistent, but actually √ NT consistent, sothatk( ˆ β R −β 0 )·Xk =O p (1), whichmakestheidiosyncratic error matrix e the dominant part of e− ( ˆ β R −β 0 )·X, i.e. the R−R 0 additional principal components in ˆ F R are mostly determined bye, and more precisely are close to the R−R 0 principal components of e ′ e. This means that they are essentially random and close to uncorrelated with the regressors X k . Including unnecessary factors in the QMLE calculation is therefore analogous to including irrelevant regressors in a linear regression which are uncorrelated with the relevant regressorsX k . From the second line 15 in equation (2.4) we see that these additional random components of ˆ F R project out the corresponding R−R 0 dimensional subspace of the T-dimensional space spanned by the observations over time, thus effectively reducing the number of time dimensions by R−R 0 . This usually results in a somewhat increased finite sample variance of the QMLE, buthas no influence asymptotically asT goes to infinity, so that the asymptotic distributions of ˆ β R 0 and ˆ β R are identical for R≥R 0 . 2.3 Asymptotic Profile Likelihood Expansion To derive the asymptotics of ˆ β R , we study the asymptotic properties of the profile like- lihood function L R NT (β) around β 0 . First we notice that the expression cannot easily be discussed by analytic means, since there is no explicit formula for the eigenvalues of a matrix. In particular, a standard Taylor expansion of L 0 NT (β) around β 0 cannot easily be derived. In Section 2.3.1 we show how to overcome this problem when the true number of factors is known, i.e. R =R 0 , and in Section 2.3.2 we generalize the results to R>R 0 . When the trueR 0 is known, the approach we choose is to perform a joint expansion in the regression parameters and in the idiosyncratic error terms. To perform this joint expansion we apply the perturbation theory of linear operators (e.g., Kato (1980)). We thereby obtain an approximate quadratic expansion ofL 0 NT (β) in β, which can be used to derive the first order asymptotic theory of the QMLE ˆ β R 0. To carry the results for R = R 0 over to R > R 0 , we first note that equation (2.4) implies that L R NT (β)=L 0 NT (β)− 1 NT R X t=R 0 +1 μ t (Y −β·X) ′ (Y −β·X) . (2.6) The extra term 1 NT P R t=R 0 +1 μ t (Y −β·X) ′ (Y −β·X) is due to overfitting on the extra factors. We show that the β-dependence of this term is sufficiently small, so that 16 apart from a constant the approximate quadratic expansions of L R NT (β) and L 0 NT (β) aroundβ 0 are identical. To obtain this result we first strengthen Theorem 2.1 and show that ˆ β R converges to β 0 at a rate of at least N 3/4 , so that we only have to discuss the β-dependence of the extra term in L R NT (β) within an N 3/4 shrinking neighborhood of β 0 . From the analysis of L R NT (β), we can then deduce the main result of the chapter, namely √ NT ˆ β R −β 0 = √ NT ˆ β R 0−β 0 +o p (1). (2.7) This implies that the limiting distributions of ˆ β R and ˆ β R 0 are identical, and that over- estimating the number of factors results in no efficiency loss in terms of the asymptotic variance of the QMLE. 2.3.1 When R =R 0 WewanttoexpandtheprofilelikelihoodL 0 NT (β)simultaneouslyinβ andinthespectral norm of e. Let the K + 1 expansion parameters be defined by ǫ 0 = kek/ √ NT and ǫ k =β 0 k −β k ,k =1,...,K, anddefinetheN×T matrixX 0 =( √ NT/kek)e. With these definitions we obtain 1 √ NT (Y −β·X) = 1 √ NT λ 0 f 0′ +(β 0 −β)·X +e = λ 0 f 0′ √ NT + K X k=0 ǫ k X k √ NT . (2.8) According to equation (2.4) the profile likelihood L 0 NT (β) can be written as the sum over the T −R 0 smallest eigenvalues of the matrix in (2.8) multiplied by its trans- posed. Weconsider P K k=0 ǫ k X k / √ NT asasmallperturbationoftheunperturbedmatrix λ 0 f 0′ / √ NT, and thus expand L 0 NT (β) in the perturbation parameters ǫ = (ǫ 0 ,...,ǫ K ) aroundǫ=0, namely L 0 NT (β)= 1 NT ∞ X g=0 K X k 1 ,...,kg=0 ǫ k 1 ǫ k 2 ... ǫ kg L (g) λ 0 , f 0 , X k 1 , X k 2 ,...,X kg , (2.9) 17 where L (g) =L (g) λ 0 , f 0 , X k 1 , X k 2 ,...,X kg are the expansion coefficients. The unperturbed matrix λ 0 f 0′ / √ NT has rank R 0 , so that the T −R 0 smallest eigenvaluesoftheunperturbedT×T matrixf 0 λ 0′ λ 0 f 0′ /NT areallzero,i.e. L 0 NT (β)=0 forǫ=0andthusL (0) λ 0 , f 0 =0. DuetoAssumption2.3 theR 0 non-zeroeigenvalues oftheunperturbedT×T matrixf 0 λ 0′ λ 0 f 0′ /NT convergetopositiveconstantsasN,T → ∞. This means that the “separating distance” of the T −R 0 zero-eigenvalues of the unperturbed T ×T matrix f 0 λ 0′ λ 0 f 0′ /NT converges to a positive constant, i.e. the next largest eigenvalue is well separated. This is exactly the technical condition under which the perturbation theory of linear operators guarantees that the above expansion of L 0 NT in ǫ exists and is convergent as long as the spectral norm of the perturbation P K k=0 ǫ k X k / √ NT is smaller than a particular convergence radius r 0 (λ 0 ,f 0 ), which is closely related to the separating distance of the zero-eigenvalues. For details on that see Kato (1980) andAppendixA.2, wherewedefiner 0 (λ 0 ,f 0 )andshow thatitconverges to a positive constant asN,T →∞. Note that for the expansion (2.9) it is crucial that we haveR=R 0 ,sincetheperturbationtheoryoflinearoperatorsdescribestheperturbation ofthesumofallzero-eigenvalues oftheunperturbedmatrixf 0 λ 0′ λ 0 f 0′ /NT. ForR>R 0 the sum in L R NT (β) leaves out the R−R 0 largest of these perturbed zero-eigenvalues, which results in a much more complicated mathematical problem, since the structure and ranking among these perturbed zero-eigenvalues needs to be discussed. The above expansion of L 0 NT (β) is applicable whenever the operator norm of the perturbationmatrix P K k=0 ǫ k X k / √ NT issmallerthanr 0 (λ 0 ,f 0 ). Sinceourassumptions guarantee that kX k / √ NTk =O p (1), for k = 0,...,K, and ǫ 0 =O p (min(N,T) −1/2 ) = o p (1), we have P K k=0 ǫ k X k / √ NT =O p (kβ−β 0 k)+o p (1), i.e. the above expansion is always applicable asymptotically within a shrinking neighborhood of β 0 — which is sufficient since we already know that ˆ β R is consistent for R≥R 0 . In addition to guaranteeing converge of the series expansion, the perturba- tion theory of linear operators also provides explicit formulas for the expan- sion coefficients L (g) , namely for g = 1,2,3 we have L (1) λ 0 ,f 0 ,X k = 18 0, L (2) λ 0 ,f 0 ,X k 1 ,X k 2 = Tr(M λ 0X k 1 M f 0X ′ k 2 ), L (3) λ 0 ,f 0 ,X k 1 ,X k 2 ,X k 3 = − 1 3 [Tr M λ 0X k 1 M f X ′ k 2 λ 0 (λ 0′ λ 0 ) −1 (f 0′ f 0 ) −1 f 0′ X ′ k 3 +...],wherethedotsreferto5addi- tional terms obtained from the first one by permutation of k 1 , k 2 and k 3 , so that the expressionbecomestotally symmetricintheseindices. Ageneral expressionforthecoef- ficients for all orders in g is given in Lemma A.1 in the appendix. One can show that for g≥3 the coefficients L (g) are bounded as follows 1 NT L (g) λ 0 , f 0 , X k 1 , X k 2 ,...,X kg ≤ a NT (b NT ) g kX k 1 k √ NT kX k 2 k √ NT ... kX kg k √ NT , (2.10) wherea NT andb NT are functions ofλ 0 andf 0 that converge to finite positive constants in probability. This bound on the coefficients L (g) allows us to derive a bound on the remainder term, when the profilelikelihood expansion is truncated at aparticular order. The likelihood expansion can be applied under more general asymptotics, but here we only consider the limit N,T →∞ with N/T → κ 2 , 0 < κ <∞, i.e. N and T grow at the same rate. Then, the relevant coefficients of the expansion, which are not treated as part of the remainder term, are L 0 NT (β 0 )= 1 NT ∞ X g=2 ǫ g 0 L (g) λ 0 ,f 0 ,X 0 ,X 0 ,...,X 0 = 1 NT ∞ X g=2 L (g) λ 0 ,f 0 ,e,e,...,e , W k 1 k 2 = 1 NT L (2) λ 0 , f 0 , X k 1 , X k 2 = 1 NT Tr(M λ 0X k 1 M f 0X ′ k 2 ), C (1) k = 1 √ NT L (2) λ 0 , f 0 , X k , U = 1 √ NT Tr(M λ 0X k M f 0e ′ ), C (2) k = 3 2 √ NT L (3) λ 0 , f 0 , X k , e, U =− 1 √ NT Tr eM f 0e ′ M λ 0X k f 0 (f 0′ f 0 ) −1 (λ 0′ λ 0 ) −1 λ 0′ +Tr e ′ M λ 0eM f 0X ′ k λ 0 (λ 0′ λ 0 ) −1 (f 0′ f 0 ) −1 f 0′ +Tr e ′ M λ 0X k M f 0e ′ λ 0 (λ 0′ λ 0 ) −1 (f 0′ f 0 ) −1 f 0′ . (2.11) 19 In the first line above we used the fact that L (g) λ 0 ,f 0 ,X k 1 ,X k 2 ,...,X kg is linear in theargumentsX k 1 toX kg andthatǫ 0 X 0 =e. TheK×K matrixW withelementsW k 1 k 2 is the approximated Hessian of the profile likelihood function L 0 NT (β). The K-vectors C (1) andC (2) with elementsC (1) k andC (2) k constitute theapproximated score ofL 0 NT (β). From the expansion (2.9) and the bound (2.10) we obtain the following theorem, whose proof is provided in the appendix. Theorem 2.1. Let Assumptions 2.1 and 2.3 be satisfied. Suppose that N,T →∞ with N/T →κ 2 , 0<κ<∞. Then we have L 0 NT (β)=L 0 NT (β 0 )− 2 √ NT β−β 0 ′ C (1) +C (2) + β−β 0 ′ W β−β 0 +L 0,rem NT (β), where the remainder term L 0,rem NT (β) satisfies for any sequence c NT →0 sup {β:kβ−β 0 k≤c NT } L 0,rem NT (β) 1+ √ NT kβ−β 0 k 2 =o p 1 NT . Corollary 2.2. Let Assumptions 2.1, 2.2, and 2.3 be satisfied. Furthermore assume that C (1) =O p (1). In the limit N,T →∞ with N/T →κ 2 , 0<κ<∞, we then have √ NT ˆ β R 0−β 0 =W −1 C (1) +C (2) +o p (1)=O p (1). Since the estimator ˆ β R 0 minimizesL 0 NT (β) it must in particular satisfyL 0 NT ( ˆ β R 0)≤ L 0 NT β 0 + W −1 C (1) +C (2) / √ NT . The Corollary follows from applying Theo- rem 2.1 to this inequality and using the consistency of ˆ β R 0. Details are given in the appendix. Using Theorem 2.1, the corollary is also directly obtained from the results in Andrews (1999). Our assumptions already guarantee C (2) =O p (1) and W −1 =O p (1), so that only C (1) =O p (1) needs to be assumed explicitly in the Corollary. Corollary 2.2 allows to replicate theresult in Bai (2009b). Furthermore, the assump- tions in the corollary do not restrict the regressor to be strictly exogenous, and the 20 techniques developed here are applied in Moon and Weidner (2010a) to discuss pre- determined regressors in the linear factor regression model with R =R 0 , in which case thescoretermC (1) contributesanadditionalincidentalparameterbiastotheasymptotic distribution of ˆ β R . Remark. If we weaken Assumption 2.1(ii) to kek =o p (N 2/3 ), then Theorem 2.1 still continues to hold. If we assume that C (2) =O p (1), then Corollary 2.2 also holds under this weaker condition onkek. 2.3.2 When R>R 0 We now extend the likelihood expansion to the case R>R 0 . Let ˆ λ(β) and ˆ f(β) be the minimizing parameters in the first line of equation (2.4) forR=R 0 . These are the first R 0 principalcomponentsof(Y−β·X)(Y−β·X) ′ and(Y−β·X) ′ (Y−β·X), respectively. For the corresponding orthogonal projectors we use the notation M ˆ λ (β) ≡ M ˆ λ(β) and M ˆ f (β)≡M ˆ f(β) . For the residuals after taking out these first R 0 principal components we write ˆ e(β)≡Y −β·X− ˆ λ(β) ˆ f ′ (β). Analogous to the expansion of L 0 NT (β) the perturbation theory of linear operators also provides an expansion for M ˆ λ (β), M ˆ f (β) and ˆ e(β) in (β −β 0 ) and kek, i.e. in addition to describing the sum of the perturbed eigenvalues L 0 NT (β) it also describes the structure of the correspondingperturbedeigenvectors. For example, we have ˆ e(β)= M λ 0eM f 0− P k (β k −β 0 k )M λ 0X k M f 0+higher order terms. Thedetailsoftheseexpansions are presented in Lemma A.1 and A.2 in the appendix. These expansions are crucial when generalizing the likelihood expansion to R>R 0 . Equation (2.6) can equivalently be written as L R NT (β)=L 0 NT (β)− 1 NT R−R 0 X t=1 μ t ˆ e ′ (β)ˆ e(β) . (2.12) 21 Here we used that ˆ e ′ (β)ˆ e(β) is the residual of (Y −β·X) ′ (Y −β·X) after subtracting thefirstR 0 principalcomponents, which implies that theeigenvalues of thesetwo matri- ces are the same, except from the R 0 largest ones which are missing in ˆ e ′ (β)ˆ e(β). By applying the expansion of ˆ e(β) to this expression forL R NT (β) one obtains the following. Theorem 2.3. Under Assumption 2.1 and 2.3 and for R>R 0 we have (i) L R NT (β)=L 0 NT (β)− 1 NT R−R 0 X t=1 μ t [A(β)]+L R,rem,1 NT (β), where A(β) =M f 0 e−(β−β 0 )·X ′ M λ 0 e−(β−β 0 )·X M f 0, and for any constant c>0 sup {β: √ Nkβ−β 0 k≤c} L R,rem,1 NT (β) √ N + √ NT kβ−β 0 k =O p 1 NT . (ii) L R NT (β)=L 0 NT (β)− 1 NT R−R 0 X t=1 μ t B(β)+B ′ (β) +L R,rem,2 NT (β), where B(β)= 1 2 A(β)−M f 0e ′ M λ 0eM f 0e ′ λ 0 (λ 0′ λ 0 ) −1 (f 0′ f 0 ) −1 f 0′ +M f 0 (β−β 0 )·X−e ′ M λ 0ef 0 (f 0′ f 0 ) −1 (λ 0′ λ 0 ) −1 λ 0′ eM f 0 +M f 0e ′ M λ 0 (β−β 0 )·X f 0 (f 0′ f 0 ) −1 (λ 0′ λ 0 ) −1 λ 0′ eM f 0 +M f 0e ′ M λ 0ef 0 (f 0′ f 0 ) −1 (λ 0′ λ 0 ) −1 λ 0′ (β−β 0 )·X M f 0 +B (eeee) +M f 0B (rem,1) (β)P f 0 +P f 0B (rem,2) P f 0, and B (eeee) =−M f 0e ′ M λ 0eM f 0e ′ λ 0 (λ 0′ λ 0 ) −1 (f 0′ f 0 ) −1 (λ 0′ λ 0 ) −1 λ 0′ eM f 0 +M f 0e ′ M λ 0ef 0 (f 0′ f 0 ) −1 (λ 0′ λ 0 ) −1 λ 0′ ef 0 (f 0′ f 0 ) −1 (λ 0′ λ 0 ) −1 λ 0′ eM f 0 − 1 2 M f 0e ′ M λ 0ef 0 (f 0′ f 0 ) −1 (λ 0′ λ 0 ) −1 (f 0′ f 0 ) −1 f 0′ e ′ M λ 0eM f 0 + 1 2 M f 0e ′ λ 0 (λ 0′ λ 0 ) −1 (f 0′ f 0 ) −1 f 0′ e ′ M λ 0ef 0 (f 0′ f 0 ) −1 (λ 0′ λ 0 ) −1 λ 0′ eM f 0 . 22 Here,B (rem,1) (β) andB (rem,2) areT×T matrices, B (rem,2) is independent ofβ and satisfies kB (rem,2) k=O p (1), and for any constant c>0 sup {β: √ Nkβ−β 0 k≤c} kB (rem,1) (β)k 1+ √ NT kβ−β 0 k =O p (1), sup {β: √ Nkβ−β 0 k≤c} L R,rem,2 NT (β) (1+ √ NT kβ−β 0 k) 2 =o p 1 NT . Here, the remainder termsL R,rem,1 NT (β) andL R,rem,2 NT (β) stem from terms in ˆ e ′ (β)ˆ e(β) whosespectralnormissmallerthanO p (1)ando p (1),respectively,withina √ N shrinking neighborhood of β after dividing by √ N + √ NT β−β 0 and 1 + √ NT β−β 0 , respectively. Using Weyl’s inequality those terms can be separated from the eigenvalues μ t [ˆ e ′ (β)ˆ e(β)]. The expression for B(β) looks complicated, in particular the terms in B (eeee) . Note however, that B (eeee) is β-independent and satisfies kB (eeee) k = O p (1) under our assumptions, so that it is relatively easy to deal with these terms. Note furthermore that the structure of B(β) is closely related to the expansion of L 0 NT (β), since bydefinitionwe haveL 0 NT (β)=(NT) −1 Tr(ˆ e ′ (β)ˆ e(β)), which can beapproximated by (NT) −1 Tr(B(β)+B ′ (β)). Plugging the definition of B(β) into (NT) −1 Tr(B(β)+ B ′ (β)) oneindeedrecovers thetermsof theapproximated Hessian andscoreprovidedby Theorem 2.1, which is a convenient consistency check. We do not give explicit formulas for B (rem,1) (β) and B (rem,2) , because those terms enter B(β) projected by P f 0, which makes them orthogonal to the leading term A(β), so that they can only have limited influence on the eigenvalues of B(β)+B ′ (β). The bounds on the norms of B (rem,1) (β) and B (rem,2) provided in the theorem are sufficient for all conclusions on the properties of μ t [B(β)+B ′ (β)] below. The proof of the theorem can be found in the appendix. The first part of Theorem 2.3 is useful to show that ˆ β R converges to β 0 at a rate of at least N 3/4 . The purpose of the second part is to show that ˆ β R has the same limiting distribution as ˆ β R 0. To actually obtain these two results one requires further conditions on the β-dependence of the largest few eigenvalues of A(β) and B(β)+B ′ (β). 23 Assumption 2.4. For all constants c>0 sup {β: √ Nkβ−β 0 k≤c} P R−R 0 t=1 n μ t [A(β)]−μ t A(β 0 ) −μ t h ˜ A(β) io √ N +N 5/4 kβ−β 0 k+N 2 kβ−β 0 k 2 /log(N) ≤O p (1), where ˜ A(β)=M f 0 (β−β 0 )·X ′ M λ 0 (β−β 0 )·X M f 0. Corollary 2.4. Let R > R 0 , let Assumptions 2.1, 2.2, 2.3 and 2.4 be satisfied and furthermore assume that C (1) = O p (1). In the limit N,T → ∞ with N/T → κ 2 , 0 < κ<∞, we then have N 3/4 ˆ β R −β 0 =O p (1). The corollary follows from the inequality L R NT ( ˆ β R )≤L R NT (β 0 ) by applying the first part of Theorem 2.3, Assumption 2.4, and our expansion of L 0 NT (β). The justfication of Assumption 2.4 is discussed in the next section. Knowing that ˆ β R converges to β 0 at a rate of at least N 3/4 is a convenient intermediate result. It implies that we only have to study the properties ofL R NT (β) within aN 3/4 shrinkingneighborhood ofβ 0 , which is reflected in the formulation of the following assumption. Assumption 2.5. For all constants c>0 sup {β:N 3/4 kβ−β 0 k≤c} P R−R 0 t=1 μ t [B(β)+B ′ (β)]−μ t B(β 0 )+B ′ (β 0 ) (1+ √ NTkβ−β 0 k) 2 =o p (1). Combining the first part of Theorem 2.3, Assumption 2.5, and Theorem 2.1, we find that the profile likelihood for R>R 0 can be written as L R NT (β)=L R NT (β 0 )− 2 √ NT β−β 0 ′ C (1) +C (2) + β−β 0 ′ W β−β 0 +L R,rem NT (β), 24 with a remainder term that satisfies for all constants c>0 sup {β:N 3/4 kβ−β 0 k≤c} L R,rem NT (β) 1+ √ NT kβ−β 0 k 2 =o p 1 NT . This result, together with N 3/4 -consistency of ˆ β R , gives rise to the following corollary. Corollary 2.5. Let R > R 0 , let Assumptions 2.1, 2.2, 2.3, 2.4 and 2.5 be satisfied and furthermore assume that C (1) = O p (1). In the limit N,T → ∞ with N/T → κ 2 , 0<κ<∞, we then have √ NT ˆ β R −β 0 =W −1 C (1) +C (2) +o p (1) =O p (1). The proof of Corollary 2.5 is analogous to that of Corollary 2.2. The combination of both corollaries shows that our main result in equation (2.7) holds, i.e. the limiting distributions of ˆ β R and ˆ β R 0 are indeed identical. What is left to do is to justify the high-level assumptions 2.4 and 2.5. 2.4 Justification of Assumptions 2.4 and 2.5 We start with the justification of Assumption 2.4. We have A(β) = A(β 0 )+ ˜ A(β)− A mixed (β),whereA mixed (β)=M f 0e ′ M λ 0 (β−β 0 )·X M f 0+thesametermtransposed. By applying Weyl’s inequality 9 we thus find R−R 0 X t=1 n μ t [A(β)]−μ t A(β 0 ) −μ t h ˜ A(β) io ≤ R−R 0 X t=1 μ t [A mixed (β)] ≤2(R−R 0 )Kkekkβ−β 0 kmax k kM λ 0X k M f 0k. (2.13) 9 Weyl’s inequality says μm(G +H) ≤ μm(G) +μ1(H) for arbitrary symmetric n×n matrices G and H and 1 ≤ m ≤ n. Here, we refer to a generalization of this, which reads P m t=1 μt(G +H) ≤ P m t=1 μt(G) + P m t=1 μt(H). These inequalities are standard results in linear algebra and are readily derived from the Courant-Fischer-Weyl min-max principle. 25 For asymptotics with N andT growing at the same rate Assumption 2.1(ii) guarantees kek =O p ( √ N). Using this and inequality 2.13, we find thatkM λ 0X k M f 0k =O p (N 3/4 ) is a sufficient condition for Assumption 2.4. This condition can be justified by assuming that X k = Γ k f 0′ + ˜ X k , where Γ k is an N×R 0 matrix and ˜ X k is an N×T matrix with k ˜ Xk = O p (N 3/4 ), i.e. X k has an approximate factor structure with the same factors that enter into the equation for Y and an idiosyncratic component ˜ X k . Analogous to our discussion of Assumption 2.1(ii) we can obtain the bound on the norm of ˜ X k by assumingthat its entries ˜ X k,it aremean zero, have boundedfourthmoment andareonly weakly correlated across i and t. We have thus provided a way to justify Assumption 2.4 without imposing any addi- tional condition on the error matrix e, but by restricting the data generating process for the regressors X k . Alternatively, one can derive the statement in the assumption by imposing weaker restrictions on X k , but making further assumptions on the error matrix e. An example of this is provided by Theorem 2.1 below, where we only assume that X k = X k + ˜ X k , with rank(X k ) being bounded, but without assuming that X k is generated by the factors f 0 . The discussion of Assumption 2.5 is more complicated. By Weyl’s inequality we know that the absolute value of μ t [B(β)+B ′ (β)]−μ t B(β 0 )+B ′ (β 0 ) is bounded by the spectral norm of B(β)+B ′ (β)−B(β 0 )−B ′ (β 0 ), which is of order O p (N 3/2 )kβ− β 0 k+O p (N 2 )kβ−β 0 k 2 . This bound is obviously too crude to justify the assumption. What we need here is a bound that not only takes into account the spectral norm of the difference between B(β)+B ′ (β) and B(β 0 )+B ′ (β 0 ), but also the structure of the eigenvectors of the various matrices involved. The assumption only restricts the properties of B(β) in an N 3/4 shrinking neigh- borhood of β 0 . In this shrinking neighborhood the dominant term in B(β)+B ′ (β) is M f 0e ′ M λ 0eM f 0, since its spectral norm is of order N, while the spectral norm of the remaining terms, e.g. A mixed (β) above, is at most of order N 3/4 . Our goal is to show that the largest few eigenvalues of B(β)+B ′ (β) only differ by o p (1) from those of the 26 leading term M f 0e ′ M λ 0eM f 0, within the shrinking neighborhood of β 0 . To do so, we first need to introduce some notation. Let w t ∈ R T , t = 1,...,T −R 0 , be the normalized eigenvectors of M f 0e ′ M λ 0eM f 0 with the constraintf 0′ w t =0, and letρ t ,t =1,...,T−R 0 , be the corresponding eigen- values. Letv i ∈ R N ,i=1,...,N−R 0 ,bethenormalizedeigenvectorsofM λ 0eM f 0e ′ M λ 0 with the constraint λ 0′ v i =0, and let ρ i , i=1,...,N−R 0 , be the corresponding eigen- values. 10 We assume that eigenvalues are sorted in decreasing order, i.e. ρ 1 ≥ρ 2 ≥.... Note that the eigenvalues ρ t and ρ i are identical for t=i. Let d (1) NT =max i,t,k |v ′ i X k w t |, d (2) NT =max i kv ′ i eP f 0k, d (3) NT =max t kw ′ t e ′ P λ 0k, d (4) NT =N −3/4 max i kv ′ i X k P f 0k, d (5) NT =N −3/4 max t kw ′ t X ′ k P λ 0k, where i = 1...N −R 0 , t = 1,...,T −R 0 and k = 1...K. Furthermore, define d NT = max 1, d (1) NT , d (2) NT , d (3) NT , d (4) NT , d (5) NT . Theorem 2.1. Let assumptions 2.1 and 2.3 hold, let R > R 0 and consider a limit N,T → ∞ with N/T → κ 2 , 0 < κ < ∞. Assume that ρ R−R 0 > aN, wpa1, for some constant a > 0. Furthermore, let there exists a series of integers q NT > R−R 0 such that d NT q NT =o p (N 1/4 ), and 1 q NT T−R 0 X t=q NT 1 ρ R−R 0−ρ t =O p (1). Then, for all constants c>0 and t=1,...,R−R 0 we have sup {β:N 3/4 kβ−β 0 k≤c} μ t B(β)+B ′ (β) −ρ t =o p (1), 10 For T <N the vectors vi, i = T−R 0 +1,...,N−R 0 , correspond to null-eigenvalues, and if there are multiple null-eigenvalues those vi are not uniquely defined. In that case we assume that those vi are drawn randomly from the Haar measure on the unit sphere of the corresponding null-eigenspace. For T > N we assume the same for wt, t = N−R 0 ,...,T −R 0 . This specification avoids correlation between X k and those vi and wt being caused by a particular choice of the eigenvectors that correspond to degenerate null-eigenvalues. 27 which implies that Assumption 2.5 is satisified. We can now justify Assumption 2.5 by showing that the conditions of Theorem 2.1 are satisfied. The following discussion is largely heuristic. Since v i and w t are the normalized eigenvalues of M f 0e ′ M λ 0eM f 0 and M λ 0eM f 0e ′ M λ 0 we expect them to be essentially uncorrelated with X k and eP f 0, and therefore we expect v ′ i X k w t = O p (1), kv ′ i eP f 0k = O p (1), kw ′ t e ′ P λ 0k = O p (1). We also expect kv ′ i X k P f 0k = O p ( √ T) and kw ′ t X ′ k P λ 0k = O p ( √ N), which is different to the preceding terms with e, since X k can be correlated with f 0 and λ 0 . In the definition of d NT the maxima over these terms are taken over i and t, so that we anticipate some weak dependence of d NT on N (or equivalently T). Note that we needd NT =o p (N 1/4 ) since otherwiseq NT does not exist. The key to making this discussion rigorous and show that indeed d NT = o p (N 1/4 ), or smaller,isagoodknowledgeofthepropertiesoftheeigenvectorsv i andw t . Iftheentries e it areiidnormal,thenthematrixofv i ’sandw t ’sisHaar-distributed(ontheN−R 0 and T−R 0 dimensionalsubspaces spannedbyM λ 0 andM f 0). In that case the formalization oftheabovediscussionbecomesrelativelyeasy,andtheresultissummarizedinTheorem 2.2 below. The conjecture in the random matrix theory literature is that the limiting distri- bution of the eigenvectors of a random covariance matrix is “distribution free”, i.e. is independentoftheparticulardistributionofe it (see,e.g., Silverstein(1990), Bai(1999)). However, we are not aware of a formulation and corresponding proof of this conjecture that is sufficient for our purposes. The second condition in Theorem 2.1 is on the eigenvectors ρ t of the random covari- ancematrixM f 0e ′ M λ 0eM f 0. Eigenvaluesarestudiedmoreintenselythaneigenvectorsin the random matrix theory literature, and it is well-known that the properly normalized empirical distribution of the eigenvalues (the so called empirical spectral distribution) of an iid sample covariance matrix converges to the Marˇ cenko-Pastur-law (Marˇ cenko and Pastur (1967)) for asymptotics where N and T grow at the same rate. This means 28 that the sum over the eigenvalues ρ t in Theorem 2.1 asymptotically becomes an inte- gral over the Marˇ cenko-Pastur limiting spectral distribution. 11 To derive a bound on this sum, one furthermore needs to know the asymptotic properties of ρ R−R 0. For ran- dom covariance matrices fromiid normal errors, it is known from Johnstone (2001) and Soshnikov (2002) that the properly normalized few largest eigenvalues converge to the Tracy-Widom law. 12 . An additional subtlety in the discussion of the eigenvalues and eigenvectors of the random covariance matrixM f 0e ′ M λ 0eM f 0 are the projections withM f 0 andM λ 0, which stem from integrating out the true factors and factor loadings of the model. Those pro- jectors are not normally present in the literature on large dimensional random covari- ance matrices. If the idiosyncratic error distribution is iid normal these projections are unproblematic, since the distribution of e is rotationally invariant in that case, i.e. the projections are mathematically equivalent to a reduction of the sample size by R 0 in both directions. Thus, if thee it areiid normal, then we can show that the conditions of Theorem 2.1 are satisfied, and we can therefore verify that the high-level assumptions of the last section hold. This result is summarized in the following theorem. Theorem 2.2. Let R > R 0 , let Assumption 2.3 hold and consider a limit N,T → ∞ with N/T →κ 2 , 0<κ<∞. Furthermore, assume that (i) For all k =1,...,K we can decompose X k =X k + ˜ X k , , such that k ˜ X k k=O p (N 3/4 ), k ˜ X k k HS =O p ( √ NT), kX k k=O p ( √ NT), rank(X k )≤Q k , whereQ k is independent ofN andT. For theK×K matrix ˜ W defined by ˜ W k 1 k 2 = 1 NT Tr( ˜ X k ˜ X ′ k ) we assume that plim N,T→∞ W k 1 k 2 > 0. In addition, we assume 11 To make this argument mathematically rigorous one needs to know the convergence rate of the empirical spectral distribution to its limit law, which is a ongoing research subject in the literature, e.g. Bai (1993), Bai, Miao and Yao (2004), G¨ otze and Tikhomirov (2010). 12 To our knowledge this result is not established for error distributions that are not normal. Sosh- nikov (2002) has a result under non-normality but only for asymptotics with N/T →1. 29 that E (M λ 0X k M f 0) it 24+ǫ , E|(M λ 0X k ) it | 6+ǫ and E (X k M f 0) it 6+ǫ are bounded uniformly across i, j, N and T for some ǫ>0. . (ii) The error matrix e is independent of λ 0 , f 0 , X k and ˜ X k , k = 1,...,K, and its elements e it are distributed as iid N(0,σ 2 ). Then, the Assumptions 2.1, 2.2, 2.4 and 2.5 are satisfied and we have C (1) =O p (1). By Corollary 2.2 and 2.5 we can therefore conclude √ NT ˆ β R −β 0 = √ NT ˆ β R 0 −β 0 + o p (1). It seems to be quite challenging to extend Theorem 2.2 to non-iid-normal e it , given the present status of the literature on eigenvalues and eigenvectors of large dimensional random covariance matrices, and we would like to leave this as a future research topic. 2.5 Monte Carlo Simulations Here, we consider a panel model with one regressor (K = 1), two factors (R 0 = 2) and the following data generating process (DGP) Y it =β 0 X it + 2 X r=1 λ ir f tr +e it , X it =1+ ˜ X it + 2 X r=1 (λ ir +χ ir )f tr , (2.14) where i = 1,...,N and t = 1,...,T. The random variables ˜ X it , λ ir , f tr , χ ir and e it are mutually independent, ˜ X it is distributed as iidN(1,1), and λ ir , f tr and χ ir are all distributed as iidN(1,1). For e it we also assume that it is iid across i and t, but we consider two different specifications for the marginal distribution, namely eitherN(0,1) or a Student’s t-distribution with 5 degrees of freedom. We choose β 0 = 1, and use 10,000 repetitions in our simulation. For each draw ofY andX we compute the QMLE ˆ β R according to equation (2.3) for different values of R. Table 2.1 reports the bias and standard error of ˆ β R for sample sizesN =T =50 and N =T = 100. For R = 0 (OLS estimator) and R = 1 we have R<R 0 , i.e. less factors are usedin theestimation than are presentin theDGP. As a resultof this, the QMLEis 30 N =T =50 N =T =100 e it ∼N(0,1) e it ∼t(5) e it ∼N(0,1) e it ∼t(5) bias std bias std bias std bias std R=0 0.42741 0.02710 0.42788 0.02699 0.42806 0.01890 0.42813 0.01884 R=1 0.29566 0.05712 0.29633 0.05830 0.29597 0.03725 0.29541 0.03717 R=2 0.00047 0.02015 0.00175 0.02722 0.00005 0.00974 0.00057 0.01296 R=3 0.00046 0.02101 0.00139 0.02693 0.00007 0.00993 0.00062 0.01314 R=4 0.00051 0.02183 0.00140 0.02792 0.00010 0.01012 0.00062 0.01335 R=5 0.00042 0.02259 0.00137 0.02888 0.00011 0.01028 0.00061 0.01361 Table 2.1: Simulation results for the bias and standard error (std) of the QMLE ˆ β R . Notes: For different value ofR welist theresults for two different sample sizesN andT, and the two different specifications for e it . The data generating process is described in the main text, in particular the true number of factors here is R 0 = 2. We used 10,000 repetitions in the simulation. Quantiles of √ NT ˆ β R −β 0 e it ∼ 2.5% 5% 10% 25% 50% 75% 90% 95% 97.5% N(0,1) R=2 -1.903 -1.598 -1.239 -0.643 0.008 0.663 1.240 1.616 1.916 R=3 -1.977 -1.625 -1.253 -0.650 0.011 0.658 1.276 1.650 1.952 R=4 -1.998 -1.664 -1.275 -0.666 0.016 0.682 1.296 1.694 1.992 R=5 -2.041 -1.672 -1.284 -0.682 0.019 0.698 1.328 1.723 2.000 t(5) R=2 -2.537 -2.095 -1.614 -0.807 0.072 0.935 1.716 2.188 2.573 R=3 -2.550 -2.116 -1.642 -0.817 0.071 0.946 1.757 2.206 2.626 R=4 -2.592 -2.147 -1.653 -0.829 0.067 0.961 1.796 2.259 2.664 R=5 -2.652 -2.181 -1.688 -0.854 0.071 0.972 1.805 2.296 2.720 Table 2.2: Simulation results for the quantiles of √ NT ˆ β R −β 0 . Notes: We list the results for N = T = 100, the two different specifications of e it , different values ofR, and the data generating process as described in the main text with R 0 =2. We used 10,000 repetitions in the simulation. heavily biased for these values of R, since the factor structure in the DGP is correlated with the regressors, but is not controlled for in the estimation. In contrast, for all values R≥R 0 thebiasof theQMLEisnegligible comparedtoits standarderror. Furthermore, the standard error remains almost constant as R increases beyond R = R 0 ; concretely from R=2 to R=5 it increases only by about 7% for N =T =50 and only by 5% for N =T =100. 31 Table 2.2 reports quantiles of the appropriately normalized QMLE for R≥R 0 and N = T = 100. One finds that the quantiles remain almost constant as R increases. In particular, the differences in the quantiles for different values of R are relatively small compared to the differences between the quantiles, so that the size of a test statistics that is based on ˆ β R is essentially independent of the choice of R≥R 0 . ThefindingsoftheMonteCarlosimulationsdescribedinthelasttwoparagraphhold just as well for the specification with normally distributed as for the specification where e it has Student’s t-distribution. From this finding one may conjecture that Theorem 2.2 also holds for more general error distributions. 2.6 Conclusions In this chapter we showed that under certain regularity conditions the limiting distri- bution of the QMLE of a linear panel regression with interactive fixed effects does not change when we include redundant factors in the estimation. The important empirical implication of this result is that one can use an upper boundof the number of factors in the estimation without asymptotic efficiency loss. For inference on the regression coef- ficients one thus need not worry about consistent estimation of the number of factors in the model. As regularity conditions we mostly impose high-level assumptions, and we verify that these hold under iid normal errors. Our simulation results suggest that normality of the error distribution is not necessary. Along the lines of the arguments presented in Section 2.4, we expect that progress in the literature on large dimensional random covariance matrices will allow verification of our high-level assumptions under moregeneralerrordistributions. Thisisavitalandinterestingtopicforfutureresearch. 32 Chapter 3 Estimation of Random Coefficients Logit Demand Models with Interactive Fixed Effects 3.1 Introduction The Berry, Levinsohn and Pakes (1995) (hereafter BLP) demand model, based on the randomcoefficientslogitmultinomialchoicemodel,hasbecometheworkhorseofdemand modelling in empirical industrial organization and antitrust analysis. An important virtue of this model is that it parsimoniously and flexibly captures substitution possibil- ities between the products in a market. At the same time, the nested simulated GMM procedureproposedbyBLPaccommodatespossibleendogeneityoftheobservedproduct- specific regressors, notably price. This model and estimation approach has proven very popular (e.g. Nevo (2001), Petrin (2002); surveyed in Ackerberg et. al. (2007)). Taking a cue from recent developments in panel data econometrics (eg. Bai and Ng (2006), Bai (2009b), and Moon and Weidner (2010a; 2010b)), we extend the standard BLP demand model by adding interactive fixed effects to the unobserved product char- acteristic, which is the main “structural error” in the BLP model. This interactive fixed effect specification combines market (or time) specific fixed effects with product specific fixed effects in a multiplicative form, which is often referred to as a factor structure. 33 Our factor-based approach extends the baseline BLP model in two ways. First, we offer an alternative to the usual moment-based GMM approach. The interactive fixed effects “soak up” some important channels of endogeneity, which may obviate the need for instrumental variables of endogenous regressors such as price. This is important as such instruments may notbeeasy to identify in practice. Second, even if endogeneity persists inthepresenceoftheinteractive fixedeffects, theinstrumentsonlyneedtobeexogenous with respect to the residual part of the unobserved product characteristics, which is not explained by the interactive fixed effect. This may expand the set of variables which may be used as instruments. Our model represents one of the first applications of fixed effect factor modelling in paneldata,whichheretoforehavemainlybeenconsideredinalinearsetting,tononlinear models. Relative to the existing factor literature (for instance, Bai (2009b), and Moon and Weidner (2010a; 2010b)), our model poses some estimation challenges. The usual principal components approach is inadequate due to the the nonlinearity of the model and the potential endogeneity of the regressors. The GMM estimation approach of BLP cannot be used due to the presence of the interactive fixed effects. Hence, we propose an alternative estimator which we call the Least Squares-Minimum Distance (LS-MD) estimator. The new estimator is calculated in two steps. The first step consists of a least squares fit, which includes the interactive fixed effects and instrumental variables as regressors. The second step minimizes the norm of the estimated IV coefficients of the first step. We show that our estimator is consistent and derive the limit distribution under an asymptotics where both the number of products and the number of markets goes toinfinity. Inpractice, theestimator issimpleandstraightforward to compute, and shows a good small-sample performance in our Monte Carlo simulations. Our work complements some recent papers in which alternative estimation approaches for and extensions of the standard random coefficients logit model have beenproposed,includingVillas-Boas andWiner (1999), Knittel and Metaxoglou (2008), 34 Dube, Fox and Su (2008), Harding and Hausman (2007), Bajari, Fox, Kim and Ryan (2008b), and Gandhi, Kim and Petrin (2010). We implement our estimator on a dataset of market shares for automobiles, inspired bytheexerciseinBLP.Thisapplication illustrates thatourestimator iseasytocompute in practice. Significantly, we find that, once factors are included in the specification, the results assuming that price is exogenous or endogenous are quite similar, suggesting that the factors are indeed capturing much of the unobservable productand time effects leading to price endogeneity. Moreover, including the interactive fixed effects leads to estimates which are both more precise and (for the most part) larger in magnitude than estimates obtained from models without factors, and imply larger (in absolute value) price elasticities than the standard model. The chapter is organized as follows. Section 2 introduces the model. In Section 3 we discuss the LS-MD estimation method. Consistency and asymptotic normality is discussed in Section 4. Section 5 contains Monte Carlo simulation results and Section 6 discusses the empirical example. Section 7 concludes. In appendix B.1 we discuss the advantages of our estimation approach to a more standard GMM approach, in the pres- ence of factors. The rest of the appendix lists our assumptions for the asymptotics and provides all technical derivations and proofs of the results in the main text. Notation We write A ′ for the transpose of a matrix or vector A. For column vectors v the Euclidean norm is defined by kvk = √ v ′ v . For the n-th largest eigenvalues (count- ing multiple eigenvalues multiple times) of a symmetric matrix B we write μ n (B). For an m×n matrix A the Frobenius norm is kAk F = p Tr(AA ′ ), and the spectral norm is kAk = max 06=v∈R n kAvk kvk , or equivalently kAk = p μ 1 (A ′ A). Furthermore, we use P A =A(A ′ A) −1 A ′ andM A = 1 m −A(A ′ A) −1 A ′ , where 1 m is them×midentity matrix, and (A ′ A) −1 denotes some generalized inverse if A is not of full column rank. The vec- torization of a m×n matrix A is denoted vec(A), which is the mn×1 vector obtained 35 by stacking the columns of A. For square matrices B, C, we use B >C (or B ≥C) to indicate thatB−C is positive (semi) definite. We use∇ for the gradient of a function, i.e. ∇f(x) is the vector of partial derivatives of f with respect to each component of x. We use “wpa1” for “with probability approaching one”. 3.2 Model The random coefficients logit demand model is an aggregate market-level model, formu- lated at the individual consumer-level. Consumeri’s utility of productj in market 1 t is given by u ijt =δ 0 jt +ǫ ijt +X ′ jt v i , (3.1) where ǫ ijt is an idiosyncratic product preference, v i = (v i1 ,...,v iK ) ′ is an idiosyncratic characteristic preference. The mean utility is defined as δ 0 jt =X ′ jt β 0 +ξ 0 jt , (3.2) whereX jt =(X 1,jt ,...,X K,jt ) ′ isavector ofK observed productcharacteristics (includ- ing price), and β 0 = β 0 1 ,...,β 0 K ′ is the corresponding vector of coefficients. Following BLP,ξ 0 jt denotes unobserved productcharacteristics of productj, which can vary across markets t. This is a “structural error”, in that it is observed by all consumers when they make their decisions, but is unobserved by the econometrician. In this chapter, we focus on the case where these unobserved product characteristics vary across products and markets according to a factor structure: ξ 0 jt =λ 0′ j f 0 t +e 0 jt , (3.3) 1 The t subscript can also denote different time periods. 36 whereλ 0 j = λ 0 1j ,...,λ 0 Rj ′ is a vector of factor loadings corresponding to theR factors f 0 t = f 0 1t ,...,f 0 Rt ′ , and e 0 jt is a random component. Here λ 0′ j f 0 t represent interactive fixed effects, in that both the factors f 0 t and factor loadings λ 0 j are unobserved to the econometrician, and can be correlated arbitrarily with the observed product character- istics X jt . We assume that the number of factors R is known. 2 The superscript zero indicates the true parameters, and objects evaluated at the true parameters. The factor structure in equation (3.3) approximates reasonably some unobserved product and market characteristics of interest in an interactive form. For example, television advertising is well-known to be composed of a product-specific component as well as an annual cyclical component (peaking duringthe winter and summer months). 3 Thefactorsandfactorloadingscanalsoexplainstrongcorrelationoftheobservedmarket shares over both products and markets, which is a stylized fact in many industries that has motivated somerecent dynamicoligopoly models of industryevolution (eg. Besanko and Doraszelski (2004)). The standard BLP estimation approach, based on moment conditions, allows for weak correlation across markets and products, but does not admit strong correlation due to shocks that affect all products and markets simultaneously, which we model by the factor structure. To begin with, we assume that the regressors X jt are exogenous with respect to the errors e 0 jt , i.e. X jt and e 0 jt are uncorrelated for given j, t. This assumption, however, is only made for ease of exposition, and in both section 3.3.1 below and the empirical application, we consider the more general case where regressors (such as price) may be endogenous. Notwithstanding, regressors which are strictly exogenous with respect to e 0 jt canstillbeendogenouswithrespecttotheξ 0 jt ,duetocorrelation withthefactors and 2 KnownRisalsoassumedinBai (2009)andMoonandWeidner(2009)forthelinearregression model with interactive fixed effects. Allowing for R to be unknown presents a substantial technical challenge even for the linear model, and therefore goes beyond the scope of the present chapter. In pure factor models consistent inference procedures on the number of factors are known, e.g. Bai and Ng (2002), Onatski (2005), and Harding (2007). 3 cf. TV Dimensions (1997). 37 factor loadings. 4 When the index t refers to time (or otherwise possesses some natural ordering), then sequential exogeneity is allowed throughout the whole chapter, i.e. X jt can be correlated with past values of the errors e 0 jt . The errors e 0 jt are assumed to be independent across j and t, but heteroscedasticity is allowed. Weassumethatthedistributionsofǫ=(ǫ ijt )andv =(v i )aremutuallyindependent, and are also independent of X = (X jt ) and ξ 0 = (ξ 0 jt ). We also assume that ǫ ijt follows a marginal type I extreme value distribution iid across i and j (but not necessarily independent acrosst). For given preferencesv i andδ t =(δ 1t ,...,δ Jt ), the probability of agent i to choose productj in market t then takes the multinomial logit form: π jt (δ t ,X t ,v i )= exp δ jt +X ′ jt v i 1+ P J l=1 exp δ lt +X ′ lt v i , (3.4) We do not observe individual specific choices, but market shares of the J products in the T markets. The market share of productj in market t is given by s jt (α 0 ,δ t ,X t )= Z π jt (δ t ,X t ,v)dG α 0(v), (3.5) where G α 0(v) is the known distribution of consumer tastes v i over the product charac- teristic, andα 0 is aL×1 vector of parameters of this distribution. The most often used specification in this literature is to assume that the random coefficients are jointly mul- tivariate normal distributed, coresponding to the assumptions thatv∼N(0,Σ 0 ), where Σ 0 is aK×K matrix of parameters, which can be subject to constraints (e.g. only one or a few regressors may have random coefficients, in which case the components of Σ 0 are only non-zero for these regressors), and α 0 consists of the independent parameters 4 Anexampleofthecasewherepricepjt isanendogenousregressorwithrespecttothecommonfactors butexogenous withrespect totheerror isas follows. Supposethattdenotestime. Ifpjt is determinedas a function of past unobserved product characteristic and some additional exogenous (wrt e 0 jt ) variables Zjt, i.e. pjt = p(Zjt,ξ 0 j,t−1 ,ξ 0 j,t−2 ,...), then pjt is endogenous wrt ξ 0 jt (because ξ 0 jt is correlated across time), butsequentiallyexogenouswithrespecttoe 0 jt . Inthisexamplethepriceendogeneityiscompletely captured by the factor structure and no instrument for price is required in our estimation procedure. 38 in Σ 0 . 5 The distribution dG α 0(v) could have a finite number of support points, but in any case we assume a continuum of agents i (at each support point, if their number is finite), in order to have a deterministic interpretation of the above market shares. The observables in this model are the market shares s 0 jt and the regressors X jt . 6 In addition, we need M instruments Z jt = (Z 1,jt ,...,Z M,jt ) ′ in order to estimate the parameters α, with M ≥ L. These additional instruments can for example be chosen to be non-linear transformation of theX k,jt . Note that these additional instruments are also needed in the usual BLP estimation procedure, even in the absence of the factor structure. There, even if all the X’s were exogenous with respect to ξ 0 j,t , instruments analogous to the Z’s in our model would still be required to identify the covariance parameters in the random coefficients distribution. Theunknownparameters areα 0 ,β 0 ,λ 0 ,andf 0 . Ofthese, theimportantparameters to estimate are β 0 and α 0 , in terms of which we can calculate the ultimate objects of interest (such as price elasticities) The factors and factor loadings λ 0 and f 0 are not directly of interest, and are treated as nuisance parameters. The existing literature on demand estimation usually considers asymptotics with either J large and T fixed, or T large and J fixed. Under these standard asymptotics, the estimation of the nuisance parametersλ 0 andf 0 creates a Neyman and Scott (1948) incidental parameter problem: because the number of nuisance parameters grows with the sample size, the estimators for the parameters of interest become inconsistent. Fol- lowing some recent panel data literature, e.g. Hahn and Kuersteiner (2002; 2004) and Hahn andNewey (2004), wehandlethis problem byconsideringasymptotics whereboth J and T become large. Under this alternative asymptotic, the incidental parameter 5 We focus in this chapter on the case where functional form of the distribution function Gα is known by the researcher. Recent papers have addressed estimation when this is not known; eg. Bajari, Fox, Kim and Ryan (2008b), (2008a). 6 In the present chapter we assume that the true market shares sjt = sjt(δ 0 t ) are observed. Berry, Lintonand Pakes(2004) explicitly consider sampling error inthe observedmarketshares in their asymp- totic theory. Here, we abstract away from this additional complication and focus on the econometric issues introduced by the factor structure in ξ 0 . 39 problem is transformed into the issue of asymptotic bias in the limiting distribution of the estimators of the parameters of interest. This asymptotic bias can be characterized andcorrectedfor. OurMonteCarlosimulationssuggestthatthealternative asymptotics provides a good approximation of the properties of our estimator at finite sample sizes, as long as J andT are moderately large. 3.3 Estimation Following BLP, one can assume (or under appropriate assumption on G α and X jt one can show) that equation (3.5) is invertible, i.e. for each market t the mean utilities δ t = (δ 1t ,...,δ Jt ) are unique functions of α, the market shares s t = (s 1t ,...,s Jt ), and the regressors X t = (X 1t ,...,X Jt ). 7 We denote these functions by δ jt (α, s t , X t ). We have δ 0 jt =δ jt (α 0 , s t , X t )= K X k=1 β 0 k X k,jt + R X r=1 λ 0 jr f 0 tr +e 0 jt . (3.6) If δ 0 jt is known, then the above model reduces to the linear panel regression model with interactive fixed effects. Estimation of this model was discussed under fixed T asymptotics in e.g. Holtz-Eakin, Newey and Rosen (1988), and Ahn, Lee, Schmidt (2001), and for J,T →∞ asymptotics in Bai (2009b), and Moon and Weidner (2010a; 2010b). In the previous section we introduced the random coefficient logit model, which provides one specification for the market shares as a function of mean utilities. Our analysisextendstootherspecificationsaslongastherelationbetweenmarketsharesand mean utilities is invertible, i.e. δ jt =δ jt (α 0 , s t , X t ) is well-defined, and the assumptions below are satisfied. 7 Gandhi(2008) shows thisresult undergeneral conditions, andBerry andHaile (2009) andChiappori and Komunjer (2009) utilize this inverse mapping in their nonparametric identification results. 40 The computational challenge in estimating the model (3.6) lies in accommodating both the model parameters (α,β), which in the existing literature has mainly been done in a GMM framework, as well as the nuisance elementsλ j ,f t , which in theexisting liter- ature have been treated using a principal components decomposition in a least-squares context (e.g., Bai (2009b), and Moon andWeidner (2010a; 2010b)). Ourestimation pro- cedure combines both the GMM approach to demand estimation and the least squares (or QMLE) approach to the interactive fixed effect model. Our least squares-minimum distance (LS-MD) estimators forα andβ are defined by Step 1 (least squares): for given α let δ(α) = δ(α, s, X), ˜ β α , ˜ γ α , ˜ λ α , ˜ f α = argmin {β∈B β ,γ,λ,f} δ(α)− K X k=1 β k X k − M X m=1 γ m Z m −λf ′ 2 F , Step 2 (minimum distance): ˆ α = argmin α∈Bα ˜ γ ′ α W JT ˜ γ α , Step 3 (least squares): δ(ˆ α) = δ(ˆ α, s, X), ˆ β, ˆ λ, ˆ f = argmin {β∈B β ,λ,f} δ(ˆ α)− K X k=1 β k X k −λf ′ 2 F . (3.7) Here, δ(α, s, X), X k and Z m are J×T matrices, λ is J×R and f is T ×R, W JT is a positive definiteM×M weight matrix, andB α ⊂ R L andB β ⊂ R K are parameter sets for α and β. In step 1, we include the IV’s Z m as auxiliary regressors, with coefficients γ∈ R M . Step2isbasedonimposingtheexclusionrestrictionontheIV’s,whichrequires that γ = 0, at the true value of α. Thus, we first estimate β, λ, f, and the instrument coefficientsγ byleast squaresfor fixedα, andsubsequentlyweestimateαbyminimizing the norm of ˜ γ α with respect to α. 41 Step 3 in (3.7), which defines ˆ β, is just a repetition of step 1, but with α = ˆ α and γ =0. Onecould also usethestep 1 estimator ˜ β ˆ α to estimateβ. Underthe assumptions for consistency of (ˆ α, ˆ β) presented below, this alternative estimator is also consistent for β 0 . However, in general ˜ β ˆ α has a larger variance than ˆ β, since irrelevant regressors are included in the estimation of ˜ β ˆ α . For given α, β and γ the optimal factors and factor loadings in the least squares problems in step 1 (and step 3) of (3.7) turn out to be the principal components estima- tors for λ and f. These incidental parameters can therefore be concentrated out easily, and the remaining objective function forβ andγ turns out to be given by an eigenvalue problem (see e.g. Chapter 2 and Moon and Weidner (2010a; 2010b) for details), namely ˜ β α , ˜ γ α = argmin {β,γ} T X t=R+1 μ t " δ(α)− K X k=1 β k X k − M X m=1 γ m Z m ! ′ δ(α)− K X k=1 β k X k − M X m=1 γ m Z m !# . (3.8) Thisformulationgreatlysimplifiesthenumericalcalculationoftheestimator,sinceeigen- valuesareeasyandfasttocompute,andweonlyneedtoperformnumericaloptimization over β and γ, not over λ and f. The step 1 optimization problem in (3.7) has the same structure as the interactive fixed effect regression model. Thus, forα=α 0 it is known from Bai (2009b), and Moon and Weidner (2010a; 2010b) that (under their assumptions) ˆ β α 0 is √ JT-consistent for β 0 and asymptotically normal as J,T →∞ with J/T →κ 2 , 0<κ<∞. TheLS-MDestimator weproposeabove isdistinctive, becauseoftheinclusionof the instrumentsZ as regressors in the first-step. This can be understood as a generalization of an estimation approach for a linear regression model with endogenous regressors. Consider a simple structural equation y 1 = Y 2 α+e, where the endogenous regressors Y 2 takes the reduced form specification Y 2 = Zδ +V, and e and V are correlated. The two stage least squares estimator of α is ˆ α 2SLS =(Y ′ 2 P Z Y 2 ) −1 Y ′ 2 P Z y 1 , where P Z = 42 Z(Z ′ Z) −1 Z ′ . Inthissetup,itispossibletoshowthat ˆ α 2SLS isalsoanLS-MDestimator withasuitablechoiceoftheweightmatrix. Namely,inthefirststeptheOLSregressionof (y 1 −Y 2 α) on regressorsX andZ yields the OLS estimator ˜ γ α =(Z ′ Z) −1 Z ′ (y 1 −Y 2 α). Then,inthesecondstepminimizingthedistance ˜ γ ′ α W˜ γ α withrespecttoαgives ˆ α(W)= [Y ′ 2 Z(Z ′ Z) −1 W(Z ′ Z) −1 Z ′ Y 2 ] −1 [Y ′ 2 Z(Z ′ Z) −1 W(Z ′ Z) −1 Z ′ y 1 ]. Choosing W = Z ′ Z thus results in ˆ α = ˆ α(Z ′ Z) = ˆ α 2SLS . Obviously, for our nonlinear model, strict 2SLS is not applicable; however, our estimation approach can be considered a generalization of this alternative iterative estimator, in which the exogenous instruments Z are included as “extra” regressors in the initial least-squares step. 8 Moreover, the presence of the factors makes it difficult to use the moment condition- based GMM approach proposed by BLP. Specifically, we know of no way to handle the factors and factor loadings in a GMM moment condition setting such that the resulting estimator for α and β is consistent. In appendix B.1 we consider an alternative GMM estimator in which, rather than including the instrumentsZ as “extra” regressors in the first step, we estimate all the structural parameters of interest (α,β) by using GMM on the implied moment conditions of the model, after obtaining estimates of the factors λ and f via a preliminary principal components step. We show that in the absence of factors (R = 0) our LS-MD estimator is equivalent to the GMM estimator for an appropriate choice of weight matrix, but in the presence of factors the two estimators can be different and the GMM estimator may not be consistent as J,T →∞. 3.3.1 Extension: regressor endogeneity with respect to e jt So far, we have assumed that the regressors X could be endogenous only through the factors λ ′ j f t , and they are exogenous wrt e 0 . However, this could be restrictive in some applications, e.g., when price p jt is determined by ξ jt contemporaneously. Hence, we consider herethe possibility that the regressorsX could also becorrelated withe 0 . This 8 Recently, Chernozhukov and Hansen (2006) used a similar two stage estimation method for a class of instrumental quantile regressions. 43 is readily accommodated within our framework. Let X end ⊂X denote the endogenous regressors, with dim(X end ) = K 2 . (Hence, the number of exogenous regressors equals K−K 2 .) Similarly, letβ end denotethecoefficients on theseregressors, whileβ continues to denote the coefficients on the exogenous regressors. Correspondingly, we assume that M, the number of instruments, exceedsL+K 2 . Then we define the following estimator, which is a generalized version of our previous estimator: step 1: for given α end =(α,β end ) let δ(α) = δ(α, s, X), ˜ β α end, ˜ γ α end, ˜ λ α end, ˜ f α end = argmin {β∈B β ,γ,λ,f} δ(α)− K 2 X k=1 β end k X end k − K X k=K 2 +1 β k X k − M X m=1 γ m Z m −λf ′ 2 F , step 2: ˆ α end =(ˆ α, ˆ β end ) = argmin α end ∈Bα×B end β ˜ γ ′ α end W JT ˜ γ α end , step 3: δ(ˆ α) = δ(ˆ α, s, X), ˆ β, ˆ λ, ˆ f = argmin {β∈B β ,λ,f} δ(ˆ α)− K 2 X k=1 ˆ β end k X end k − K X k=K 2 +1 β k X k −λf ′ 2 F , (3.9) whereB α ,B β andB end β are the parameter sets forα,β andβ end . Thedifference between thisestimator, andthepreviousoneforwhichalltheregressorswereassumedexogenous, is that the estimation ofβ end , the coefficients on the endogenous regressors ˜ X, has been moved to the second step. The structure of the estimation procedure in (3.9) is exactly equivalent to that of our original LS-MD estimator (3.7), only that α is replaced by α end , and δ(α) is replaced by δ(α)− P K 2 k=1 β end k X end k . Thus, all results below on the consistency, asymptotic distribution and bias correction of the LS-MD estimator (3.7) with only sequentially exogenous regressors directly generalize to the estimator (3.9) 44 with more general endogenous regressors. Given this discussion, we see that the original BLP(1995) model can beconsidered a special case of our modelin which thefactors are absent (i.e. R=0). 3.4 Consistency and Asymptotic Distribution of ˆ α and ˆ β In this section we present our results on the properties of the LS-MD estimator ˆ α and ˆ β defined in (3.7) under the asymptotics J,T →∞. We define the JT ×K matrix x f , the JT ×M matrix z f , and theJT ×1 vector d(α) by x f .,k =vec X k M f 0 , z f .,m =vec Z m M f 0 , d(α) =vec δ(α)−δ(α 0 ) , (3.10) wherek =1,...,K andm=1,...,M. Thekey assumption thatguarantees consistency of the LS-MD estimator is the following. Assumption 3.1. There exists a constant c>0 such that for all α∈B α we have 1 JT d ′ (α)P (x f ,z f ) d(α)−max λ d ′ (α)P (x f ,M f 0 ⊗λ) d(α) ≥ ckα−α 0 k 2 , wpa1, where we maximize over all J×R matrices λ. ThisisarelevancyassumptionontheinstrumentsZ. Itrequiresthatthecombination ofX k M f 0 andZ m M f 0 hasmoreexplanatorypowerforδ(α)−δ(α 0 )thanthecombination ofX k M f 0 andanyJ×Rmatrixλ,uniformlyoverα. TheappearanceoftheprojectorM f 0 isnotsurprisinginviewof theinteractive fixedeffect specification in(3.6). For example, if R = 1 and f 0 t = 1 for all t, then multiplying with M f 0 is equivalent to subtracting the time-mean for each cross-sectional unit, which is the standard procedure in a model with only individual fixed effects. 9 The appearance of λ in Assumption 3.1 requires that, loosely speaking, the instruments must be more relevant forδ(α)−δ(α 0 ) than any 9 Remember however, that both f 0 and λ 0 are unobserved in our model. 45 λ. Although λ can be chosen arbitrarily, it is time-invariant; it follows, then, that the instrumentsZ m cannot all bechosen time-invariant withoutviolating Assumption 3.1. 10 The matrix valued function δ(α) =δ(α,s,X) was introduced as the inverse of equa- tion (3.5) forthe market sharess jt (δ t ). Thus,once a functionalform fors jt (δ t )is chosen and some distributional assumptions on the data generating process are made, it should in principle be possible to analyze Assumption 3.1 further and to discuss validity and optimality of the instruments. Unfortunately, too little is known about the properties of δ(α) to make such a furtheranalysis feasible at thepresent time. Even if noendogeneity in the regressors is present, and even if R = 0, it is still difficult to prove that a given set of instruments satisfies our relevancy Assumption 3.1. This, however, is not only a featureofourapproach,butisalsotrueforBLP,andforBerry,LintonandPakes(2004). If some regressors, in particular price, are treated as endogenous (wrte 0 ), the discussion of relevance (and exogeneity) of the instruments becomes even more complicated. The remaining assumptions B.1 to B.4 which are referred to in the following consis- tency theorem are presented in appendix B.2. These additional assumptions are only slight modifications of the ones used in Bai (2009b) and Moon and Weidner (2010a; 2010b) for the linear models with interactive fixed effects, and we refer to those papers for a more detailed discussion. The main contribution of the present chapter is the gen- eralization of thefactor analysis tonon-linear randomcoefficient discrete-choice demand models, and Assumption 3.1 is the key assumption needed for this generalization. Note also that Assumption B.2 (in the appendix) requires 1 JT Tr e 0 Z ′ m = o p (1), i.e. exo- geneity of the instruments with respect to e 0 , but instruments can be correlated with the factors and factor loadings and thus also with ξ 0 . Theorem 3.1. Let assumptions 3.1, B.1, B.2, B.3 and B.4 be satisfied, letα 0 ∈B α and β 0 ∈ B β , and let B α ⊂ R L and B β ⊂ R K be bounded. In the limit J,T → ∞ we then have ˆ α=α 0 +o p (1), and ˆ β =β 0 +o p (1). 10 Time-invariant instruments are also ruled out by Assumption B.4 in the appendix, which requires instruments to be “high-rank”. 46 The proof of Theorem 3.1 is given in the appendix. Next, we present results on the limiting distribution of ˆ α and ˆ β. This requires some additional notation. We define the JT ×K matrix x λf , theJT ×M matrix z λf , and theJT ×L matrix g by x λf .,k =vec M λ 0X k M f 0 , z λf .,m =vec M λ 0Z m M f 0 , g .,l =−vec ∇ l δ(α 0 ) , (3.11) where k = 1,...,K, m = 1,...,M, and l = 1,...,L. Note that x λf = ( 1 T ⊗M λ 0)x f , z λf =( 1 T ⊗M λ 0)z f , andg is the vectorization of the gradient ofδ(α), evaluated at the true parameter. 11 We introduce the (L+K)×(L+K) matrix G and the (K +M)× (K +M) matrix Ω as follows G= plim J,T→∞ 1 JT g ′ x λf g ′ z λf x λf′ x λf x λf′ z λf , Ω= plim J,T→∞ 1 JT x λf ,z λf ′ diag(Σ vec e ) x λf ,z λf , (3.12) where Σ vec e = vec E e 0 jt 2 j=1,...,J t=1,...,T is the JT-vector of vectorized variances of e 0 jt . Finally, we define the (K +M)×(K +M) weight matrixW by W = plim J,T→∞ " 1 JT x λf′ x λf −1 0 K×M 0 M×K 0 M×M + −(x λf′ x λf ) −1 x λf′ z λf 1 M × 1 JT z ′ M x λfz −1 W JT 1 JT z ′ M x λfz −1 −(x λf′ x λf ) −1 x λf′ z λf 1 M ′ # . (3.13) Existence of these probability limits is imposed by Assumption B.5 in the appendix. Some further regularity condition are necessary to derive the limiting distribution of 11 We do not necessarily require that all δjt(α) are differentiable. All we need is that J×T matrices ∇ l δ(α), l =1,...,L, exist, which satisfy Assumption B.6(i) in the appendix. 47 our LS-MD estimator, and those are summarized in Assumption B.6 in the appendix. Assumption B.6 is again a straightforward generalization of the assumptions imposed by Moon and Weidner (2010a; 2010b) for the linear model, except for part (i) of the assumption,whichdemandsthatδ(α) canbelinearlyapproximated aroundα 0 suchthat the Frobeniusnorm of the remainderterm of the expansion is of ordero p ( √ JTkα−α 0 k) in any √ J shrinking neighborhood of α 0 . Theorem 3.2. Let Assumption 3.1 to B.6 be satisfied, and let α 0 and β 0 be interior points of B α and B β , respectively. In the limit J,T →∞ with J/T → κ 2 , 0 < κ <∞, we then have √ JT ˆ α−α 0 ˆ β−β 0 → d N κB 0 +κ −1 B 1 +κB 2 , GWG ′ −1 GWΩWG ′ GWG ′ −1 , with the formulas for B 0 , B 1 and B 2 given in the appendix B.2.1. The proof of Theorem 3.2 is provided in the appendix. Analogous to the QMLE in the linear model with interactive fixed effects, there are three bias terms in the limiting distribution of the LS-MD estimator. The bias term κB 0 is only present if regressors or instruments are pre-determined, i.e. if X jt or Z jt are correlated with e jτ for t>τ (but not for t = τ, since this would violate weak exogeneity). A reasonable interpretation of this bias terms thus requires that the index t refers to time, or has some other well- definedordering. Theother two bias termsκ −1 B 1 andκB 2 aredueto heteroscedasticity of theidiosyncraticerrore 0 jt across firmsj andmarketst,respectively. Thefirstandlast biastermsareproportionaltoκ,andthusarelargewhenT issmallcomparedtoJ,while the second bias terms is proportionaltoκ −1 , and thus is large whenT is large compared to J. Note that no asymptotic bias is present if regressors and instruments are strictly exogenous and errors e 0 jt are homoscedastic. There is also no asymptotic bias when R = 0, since then there are no incidental parameters. For a more detailed discussion 48 of the asymptotic bias, we again refer to Bai (2009b) and Moon and Weidner (2010a; 2010b). Whilethestructureoftheasymptoticbiastermsisanalogoustothebiasencountered inlinearmodelswithinteractivefixedeffects,wefindthatthestructureoftheasymptotic variance matrixfor ˆ αand ˆ β is analogous totheGMM variancematrix. Accordingto the discussion in appendixB.1, the LS-MD estimator is equivalent to the GMM estimator if no factors are present. In that case the weight matrix W that appears in Theorem 3.2 is indeed just the probability limit of the GMM weight matrix 12 and our asymptotic variance matrix thus exactly coincides with the one for GMM. If factors are present, there is no GMM analog of our estimator, but the only change in the structure of the asymptotic variance matrix is the appearance of the projectors M f 0 and M λ 0 in the formulas for G, Ω and W. The presence of these projectors implies that those components of X k and Z m which are proportional to f 0 and λ 0 do not contribute to the asymptotic variance, i.e. do not help in the estimation of ˆ α and ˆ β. This is again analogousthestandardfixedeffectsetupinpaneldata,wheretime-invariantcomponents do not contribute to the identification of the regression coefficients. Using the explicit expressions for the asymptotic bias and variance of the LS-MD estimator,onecanprovideestimatorsforthisasymptoticbiasandvariance. Byreplacing the true parameter values (α 0 ,β 0 ,λ 0 ,f 0 ) by the estimated parameters (ˆ α, ˆ β, ˆ λ, ˆ f), the error term (e 0 ) by the residuals (ˆ e), and population values by sample values it is easy to define estimators ˆ B 0 , ˆ B 1 , ˆ B 2 , ˆ G, ˆ Ω and c W for B 0 , B 1 , B 2 , G, Ω and W. This is done explicitly in appendix B.2.4. Theorem 3.3. Let the assumption of Theorem 3.2 and Assumption B.7 be satisfied. In the limit J,T → ∞ with J/T → κ 2 , 0 < κ < ∞ we then have ˆ B 1 = B 1 +o p (1), ˆ B 2 = B 2 +o p (1), ˆ G = G+o p (1), ˆ Ω = Ω+o p (1) and c W = W +o p (1). If in addition 12 See also equation (B.4) in the appendix. 49 the bandwidth parameter h, which enters in the definition of ˆ B 0 , satisfies h → ∞ and h 5 /T →0, then we also have ˆ B 0 =B 0 +o p (1). The proof is again given in the appendix. Theorem 3.3 motivates the introduction of the bias corrected estimator ˆ α ∗ ˆ β ∗ = ˆ α ˆ β − 1 T ˆ B 0 − 1 J ˆ B 1 − 1 T ˆ B 2 . (3.14) Under the assumptions of Theorem 3.2 the bias corrected estimator is asymptotically unbiased, normally distributed, and has asymptotic vari- ance (GWG ′ ) −1 GWΩWG ′ (GWG ′ ) −1 , which is consistently estimated by ˆ G c W ˆ G ′ −1 ˆ G c W ˆ Ω c W ˆ G ′ ˆ G c W ˆ G ′ −1 . These results allow inference on α 0 and β 0 . From the standard GMM analysis it is know that the (K +M)×(K +M) weight matrix W which minimizes the asymptotic variance is given by W = cΩ −1 , where c is an arbitrary scalar. If the errors e 0 jt are homoscedastic with variance σ 2 e we have Ω=σ 2 e plim J,T→∞ 1 JT x λf ,z λf ′ x λf ,z λf ,andinthiscaseitisstraightforwardtoshow that the optimalW =σ 2 e Ω −1 is attained by choosing W JT = 1 JT z ′ M x λfz. (3.15) Under heteroscedasticity of e 0 jt there are in general not have enough degrees of freedom in W JT to attain the optimal W. The reason for this is that we have chosen the first stage of our estimation procedure to be an ordinary least squares step, which is optimal under homoscedasticity but not under heteroscedasticity. By generalizing the first stage optimizationtoweightedleastsquaresonewouldobtaintheadditionaldegreesoffreedom to attain theoptimalW also underheteroscedasticity, butin thepresent chapter we will notconsiderthispossibilityfurther. InourMonteCarlosimulationsandintheempirical application wealways chooseW JT accordingtoequation(3.15). Underhomoscedasticity this choice of weight matrix is optimal in the sense that it minimizes the asymptotic 50 variance of our LS-MD estimator, but nothing is known about the efficiency bound in the presence of interactive fixed effects, i.e. a different alternative estimator could theoretically have even lower asymptotic variance. 3.5 Monte Carlo Results For our Monte Carlo simulation, we assume that there is one factor (R=1), a constant regressor, one additional regressor X jt , and we consider the following data generating process for mean utility and regressors δ jt = β 0 1 + β 0 2 X jt + λ 0 j f 0 t + e 0 jt , X jt = ˜ X jt +λ 0 j +f 0 t +λ 0 j f 0 t , (3.16) whereλ 0 j , f 0 t and ˜ X jt are all independently and identically distributed across j andt as N(0,1), and are also mutually independent. For the distribution of the error term e 0 jt conditional on X jt , λ 0 j andf 0 t we use two different specifications: specification 1: e 0 jt X,λ 0 ,f 0 ∼ iidN(0,1), specification 2: e 0 jt X,λ 0 ,f 0 ∼ iidN(0,σ 2 jt ), σ 2 jt = 1 1+exp(−λ 0 jt ) . (3.17) In the first specification, the error is distributed independently of the regressor, factor and factor loading. In the second specification there is heteroscedasticity conditional on the factor loading, namely the variance σ 2 jt of e 0 jt is is an increasing function of λ jt , which is chosen nonlinearly such thatσ 2 jt ∈(0,1). According to our asymptotic analysis we expect this heterogeneity in e 0 jt to result in a bias of the LS-MD estimator, which is accounted for in our bias corrected estimator. As values for the regression parameters we choose β 0 1 =0 and β 0 2 =1. The market shares are computed from the mean utilities according to equation (3.4) and (3.5), where we assume a normally distributed random coefficient on the regressor 51 X jt ,i.e. v∼N(0,α 2 ),andwesetα 0 =1. Althoughtheregressorsarestrictlyexogenous (wrt e 0 jt ) in our simulation, we still need an instrument to identify α, and we choose Z jt =X 2 jt , the square of the regressor X jt . Both X jt and Z jt are therefore endogenous with respect to the total unobserved error λ 0 j f 0 t + e 0 jt , since the factors and factor loadings enter into their distributions. Results of simulation runs with 1000 repetitions are reported in table 3.1 and 3.2. The bias corrected estimators ˆ α ∗ and ˆ β ∗ , whose summary statistics are reported, are the ones defined in (3.14), but without inclusion of ˆ B 0 , since there is no pre-determined regressor in this setup. J,T statistics ˆ α ˆ β 2 ˆ β 1 ˆ α ∗ ˆ β ∗ 2 ˆ β ∗ 1 20,20 bias -0.0009 -0.0097 -0.0036 0.0040 -0.0044 -0.0033 std 0.2438 0.2269 0.0630 0.3326 0.3091 0.0659 rmse 0.2437 0.2270 0.0631 0.3325 0.3090 0.0659 50,50 bias -0.0048 -0.0065 -0.0013 -0.0046 -0.0063 -0.0013 std 0.0993 0.0960 0.0238 0.0990 0.0953 0.0239 rmse 0.0993 0.0962 0.0239 0.0990 0.0955 0.0239 80,80 bias -0.0017 -0.0024 -0.0001 -0.0018 -0.0026 -0.0001 std 0.0614 0.0597 0.0133 0.0612 0.0595 0.0133 rmse 0.0614 0.0597 0.0133 0.0612 0.0595 0.0133 Table 3.1: Simulation results for specification 1 (no heteroscedasticity). Notes: We report the bias, standard errors (std), and square roots of the mean square errors (rmse) of the LS-MD estimator (ˆ α, ˆ β) and its bias corrected version (ˆ α ∗ , ˆ β ∗ ). 1000 repetitions were used. In specification 1 there is no heteroscedasticity ine 0 jt , and we thus expect no bias in theLS-MDestimator. Thecorrespondingresultsintable3.1verifythisexpectation. We find that both the LS-MD estimator (ˆ α, ˆ β) and its bias corrected version (ˆ α ∗ , ˆ β ∗ ) have biases that are very small compared to their standard errors, and are not statistically significant in our sample of 1000 simulation runs. For J =T = 50 and J =T = 80 the standard errors of (ˆ α, ˆ β) and of (ˆ α ∗ , ˆ β ∗ ) are almost identical, but for J = T = 20 the standard errors for (ˆ α ∗ , ˆ β ∗ ) are up to 37% larger than for (ˆ α, ˆ β) in specification 1, i.e. 52 J,T statistics ˆ α ˆ β 2 ˆ β 1 ˆ α ∗ ˆ β ∗ 2 ˆ β ∗ 1 20,20 bias -0.0326 -0.0460 -0.0004 -0.0040 -0.0277 -0.0036 std 0.2068 0.1988 0.0475 0.5875 0.4523 0.0598 rmse 0.2093 0.2039 0.0475 0.5872 0.4529 0.0599 50,50 bias -0.0207 -0.0240 0.0016 -0.0056 -0.0063 0.0006 std 0.0831 0.0852 0.0166 0.1325 0.1313 0.0167 rmse 0.0856 0.0885 0.0167 0.1325 0.1314 0.0167 80,80 bias -0.0116 -0.0138 0.0008 -0.0022 -0.0025 0.0001 std 0.0452 0.0458 0.0096 0.0441 0.0444 0.0096 rmse 0.0467 0.0478 0.0097 0.0441 0.0445 0.0096 Table 3.2: Simulation results for specification 2 (heteroscedasticity in e 0 jt ). Notes: We report the bias, standard errors (std), and square roots of the mean square errors(rmse)of the LS-MDestimator (ˆ α, ˆ β)andits bias corrected version (ˆ α ∗ , ˆ β ∗ ). 1000 repetitions were used. at these small values for J and T the bias correction adds some noise to the estimator and thus increases their standard errors. In specification 2 there is heteroscedasticity in e 0 jt . Correspondingly, from table 3.2 we find that the LS-MD estimators ˆ α and ˆ β 2 have biases of magnitudes between 15% and 30% of their standard errors (the constant coefficient ˆ β 1 is essentially unbiased). In contrast, thebiases of thebias corrected estimators ˆ α ∗ and ˆ β ∗ 2 aremuch smaller, and are notstatistically significantat5%levelinoursampleof1000 simulationruns. Thisshows thatourbiascorrectionformulaadequatelycorrectsforthebiasduetoheteroscedasticity in e 0 jt . For J = T = 80 the standard errors of (ˆ α, ˆ β) and (ˆ α ∗ , ˆ β ∗ ) are almost identical (implyingthattherootmean squareerrorof(ˆ α ∗ , ˆ β ∗ )issmaller, sinceits biasis smaller), which confirms our asymptotic results that the bias correction removes the bias but leaves the variance of the estimators unchanged as J,T →∞. However, as we already found for specification 1, at finite sample the bias correction also adds some noise to the estimators. ForJ =T =50 the standard errors of (ˆ α ∗ , ˆ β ∗ ) are up to 60% larger, and for J = T = 20 even up to 195% larger than the standard errors of (ˆ α, ˆ β). Thus, at finite sample there is a trade-off between between bias and variance of the estimators. (ˆ α, ˆ β) has smaller variance, butlarger bias, while (ˆ α ∗ , ˆ β ∗ )has smaller bias, butlarger variance. Depending on the sample size, it may thus be advantageous in empirical application to 53 ignore the bias due to heteroscedasticity of e 0 jt and to simply use the LS-MD estimator without bias correction. 3.6 Empirical application: demand for new automobiles, 1973-1988 As an illustration of our procedure, we estimate an aggregate random coefficients logit model of demand for new automobiles, modelled after the analysis in BLP (1995). We compare specificiations with and without factors, and with and without price endogene- ity. Throughout, we allow for one normally-distributed random coefficient, attached to price. For this empirical application, we use the same data as was used in BLP (1995), which are new automobile sales from 1971-1990. 13 However, our estimation procedure requires a balanced panel for the principal components step. Since there is substantial entry and exit of individual car models, we aggregate up to manufacturer-size level, and assumethatconsumerschoose between aggregate composites of cars. 14 Furthermore,we also reduce our sample window to the sixteen years 1973-1988. In Table 3.5, we list the 23 car aggregates employed in our analysis, along with the across-year averages of the variables. Except from the aggregation our variables are the same as in BLP. Market share is given by total sales divided by the number of households in that year. The unit for price is $1000 of 1983/84 dollars. Our unit for “horse power over weight” (hp/weight) is 100 times horse power over pound. “Miles per dollar” (mpd) is obtained from miles per gallons divided by real price per gallon, and measured in miles over 1983/84 dollars. Size is given by length times width, and measured in 10 −4 inch 2 . The choice of units is rather arbitrary, we simply tried to avoid too small and too large decimal numbers. 13 The data are available on the webpage of James Levinsohn. 14 This resembles the treatment in Esteban and Shum’s (2007) study of the new and used car markets. 54 We construct instruments using the idea of Berry (1994). The instruments for a particular aggregated model and year are given by the averages of hp/weight, mpd and size, over all cars produced by different manufactures in the same year. Results. Table 3.3 contains estimation results from four specifications of the model. In specification A, prices are considered exogenous (wrt e 0 jt ), but one factor is present, whichcapturessomedegreeofpriceendogeneity(wrt. ξ jt ). Specification Balsocontains one factor, but treats prices as endogenous, even conditional on the factor. Specifica- tion CcorrespondstotheBLP(1995) model,whereprices areendogenous, butnofactor is present. Finally, in specification D, we treat prices as exogenous, and do not allow for a factor. This final specification is clearly unrealistic, but is included for comparison with the other specifications. In table 3.3 we report the bias corrected LS-MD estima- tor (this only makes a difference for specification A and B), which accounts for bias due to heteroscedasticity in the error terms, and due to pre-determined regressors (we choose bandwidthh=2 in the construction of ˆ B 0 ). The estimation results without bias correction are reported in table 3.4. It turns out, that it makes not much difference, whether the LS-MD estimator, or it bias corrected version are used. The t-values of the bias corrected estimators are somewhat larger, but apart from the constant, which is insignificant anyways, the bias correction changes neither the sign of the coefficients nor the conclusion whether the coefficients are significant at 5% level. In Specification A, most of the coefficients are precisely estimated. The price coef- ficient is -4.109, and the characteristics coefficients take the expected signs. The α parameter, corresponding to the standard deviation of the random coefficient on price, is estimated to be 2.092. These point estimates imply that, roughly 97% of the time, the random price coefficient is negative, which is as we should expect. Comparedtothisbaseline,SpecificationBallowspricetobeendogenous(evencondi- tional on the factor). The point estimates for this specifications are virtually unchanged from those in Specification A, except for the constant term. Overall, the estimation 55 Specifications: A: R=1 B: R=1 C: R=0 D: R=0 exogenous p endogenous p endogenous p exogenous p price -4.109 (-3.568) -3.842 (-4.023) -1.518 (-0.935) -0.308 (-1.299) hp/weight 0.368 (1.812) 0.283 (1.360) -0.481 (-0.314) 0.510 (1.981) mpd 0.088 (2.847) 0.117 (3.577) 0.157 (0.870) 0.030 (1.323) size 5.448 (3.644) 5.404 (3.786) 0.446 (0.324) 1.154 (2.471) α 2.092 (3.472) 2.089 (3.837) 0.894 (0.923) 0.171 (1.613) const 3.758 (1.267) 0.217 (0.117) -3.244 (-0.575) -7.827 (-8.984) Table 3.3: Parameter estimates (and t-values) for automobile demand estimation. Notes: We list results for four different model specifications (no factor R = 0 vs. one factor R =1; exogenous price vs. endogenous price). α is the standard deviation of the random coefficient distribution (only price has a random coefficient), and the regressors are p (price), hp/weight (horse power per weight), mpd (miles per dollar), size (car length times car width), and a constant. Specifications: A: R=1 B: R=1 exogenous p endogenous p price -3.112 (-2.703) -2.943 (-3.082) hp/weight 0.340 (1.671) 0.248 (1.190) mpd 0.102 (3.308) 0.119 (3.658) size 4.568 (3.055) 4.505 (3.156) α 1.613 (2.678) 1.633 (3.000) const -0.690 (-0.232) -2.984 (-1.615) Table 3.4: Parameter estimates (and t-values) for model specification A and B. Notes: Here we report the LS-MD estimators without bias correction, while in table 3.3 we report the bias corrected LS-MD estimators. results for the specifications A and B are very similar, and show that once factors are taken into account it does not make much difference whether price is treated as exoge- nous or endogenous. This suggests that the factors indeed capture most of the price endogeneity in this application. In contrast, the estimation results for specifications C and D, which are the two specifications without any factors, are very different qualitatively. The t-values for spec- ification C are rather small (i.e. standard errors are large), so that the difference in the coefficient estimates in these two specifications are not actually statistically significant. However, the differences in the t-values themselves shows that it makes a substantial 56 difference for the no-factor estimation results whether price is treated as exogenous or endogenous. Specifically,inSpecificationC,thekeypricecoefficientandαaresubstantiallysmaller in magnitude; furthermore, the standard errors are large, so that none of the estimates are significant at usual significance levels. Moreover, the coefficient on hp/weight is negative, which is puzzling. In Specification D, which corresponds to a BLP model, but without price endogeneity, we see that the price coefficient is reduced dramatically relative to the other specifications, down to -0.308. This may be attributed to the usual attenuation bias. Altogether, these estimates seem less satisfactory than the ones obtained for Specifications A and B, where factors were included. Elasticities. The sizeable differences in the magnitudes of the price coefficients across the specification with and without factors suggest that these models may imply eco- nomically meaningful differences in price elasticities. For this reason, we computed the matrices of own- and cross-price elasticities for Specifications B (in Table (3.6)) and C (in Table (3.7)). In both these matrices, the elasticities were computed using the data in 1988, the final year of our sample. Comparing these two sets of elasticities, the most obvious difference is that the elasticities – both own- and cross-price – for Specification C, corresponding to the standard BLP model without factors, are substantially smaller (aboutone-halfinmagnitude)thantheSpecificationBelasticities. Forinstance, reading down the first column of Table (3.6), we see that a one-percent increase in the price of a smallChevroletcar would resultin a28% reduction inits market share,butincrease the market share for large Chevrolet cars by 1.5%. For the results in Table (3.7), however, this same one-percent price increase would reduce the market share for small Chevrolet cars by only 13%, and increase the market share for large Chevrolet cars by less than half a percent. 57 Product# Make Size Class Manuf. Mkt Share % Price hp/weight mpd size (avg) (avg) (avg) (avg) (avg) 1 CV (Chevrolet) small GM 1.39 6.8004 3.4812 20.8172 1.2560 2 CV large GM 0.49 8.4843 3.5816 15.9629 1.5841 3 OD (Oldsmobile) small GM 0.25 7.6786 3.4789 19.1946 1.3334 4 OD large GM 0.69 9.7551 3.6610 15.7762 1.5932 5 PT (Pontiac) small GM 0.46 7.2211 3.4751 19.3714 1.3219 6 PT large GM 0.31 8.6504 3.5806 16.6192 1.5686 7 BK (Buick) all GM 0.84 9.2023 3.6234 16.9960 1.5049 8 CD (Cadillac) all GM 0.29 18.4098 3.8196 13.6894 1.5911 9 FD (Ford) small Ford 1.05 6.3448 3.4894 21.7885 1.2270 10 FD large Ford 0.63 8.9530 3.4779 15.7585 1.6040 11 MC (Mercury) small Ford 0.19 6.5581 3.6141 22.2242 1.2599 12 MC large Ford 0.32 9.2583 3.4610 15.9818 1.6053 13 LC (Lincoln) all Ford 0.16 18.8322 3.7309 13.6460 1.7390 14 PL (Plymouth) small Chry 0.31 6.2209 3.5620 22.7818 1.1981 15 PL large Chry 0.17 7.7203 3.2334 15.4870 1.5743 16 DG (Dodge) small Chry 0.35 6.5219 3.6047 23.2592 1.2031 17 DG large Chry 0.17 7.8581 3.2509 15.4847 1.5681 18 TY (Toyota) all Other 0.54 7.1355 3.7103 24.3294 1.0826 19 VW (Volkswagen) all Other 0.17 8.2388 3.5340 24.0027 1.0645 20 DT/NI (Datsen/Nissan) all Other 0.41 7.8120 4.0226 24.5849 1.0778 21 HD (Honda) all Other 0.41 6.7534 3.5442 26.8501 1.0012 22 SB (Subaru) all Other 0.10 5.9568 3.4718 25.9784 1.0155 23 REST all Other 1.02 10.4572 3.6148 19.8136 1.2830 Table 3.5: Summary statistics for the 23 product-aggregates used in estimation. 58 CV s CV l OD s OD l PT s PT l BK CD FD s FD l MC s MC l LC PL s PL l DG s DG l TY VWDTNI HD SBREST CV s -28.07 0.82 0.70 1.70 0.96 0.31 2.77 0.14 1.32 2.38 0.41 1.45 0.03 0.32 0.22 0.44 0.31 1.57 0.57 1.74 0.91 0.15 6.58 CV l 1.50-34.54 0.72 2.02 0.79 0.21 3.27 0.73 0.97 3.54 0.37 2.15 0.16 0.21 0.21 0.30 0.30 1.21 0.40 1.62 0.71 0.10 10.17 OD s 1.29 0.72 -35.78 2.08 0.72 0.18 3.36 1.15 0.84 3.90 0.35 2.37 0.25 0.17 0.20 0.25 0.28 1.06 0.34 1.53 0.63 0.08 11.35 OD l 0.98 0.64 0.65-35.80 0.59 0.13 3.37 2.09 0.64 4.34 0.30 2.63 0.45 0.12 0.17 0.18 0.25 0.84 0.25 1.36 0.51 0.06 12.86 PT s 1.76 0.80 0.72 1.90-32.51 0.26 3.09 0.38 1.14 3.02 0.39 1.84 0.08 0.26 0.22 0.37 0.31 1.39 0.48 1.70 0.81 0.12 8.56 PT l 2.17 0.81 0.68 1.55 0.98-26.85 2.53 0.06 1.40 1.97 0.41 1.21 0.01 0.35 0.22 0.48 0.31 1.65 0.61 1.72 0.94 0.16 5.37 BK 0.99 0.64 0.66 2.09 0.60 0.13-34.47 2.04 0.65 4.33 0.30 2.62 0.44 0.12 0.18 0.18 0.25 0.84 0.25 1.36 0.51 0.06 12.81 CD 0.00 0.01 0.01 0.08 0.00 0.00 0.12 -6.97 0.00 0.36 0.00 0.21 3.67 0.00 0.00 0.00 0.00 0.00 0.00 0.02 0.00 0.00 1.19 FD s 2.03 0.82 0.71 1.71 0.95 0.31 2.79 0.15-28.99 2.41 0.41 1.47 0.03 0.32 0.22 0.44 0.31 1.56 0.57 1.74 0.90 0.15 6.67 FD l 0.61 0.50 0.55 1.95 0.42 0.07 3.13 4.23 0.40-34.69 0.23 2.80 0.90 0.06 0.14 0.10 0.20 0.56 0.15 1.07 0.34 0.04 14.05 MC s 1.57 0.77 0.72 1.99 0.81 0.22 3.24 0.63 1.02 3.41 -34.49 2.07 0.14 0.23 0.21 0.32 0.30 1.26 0.42 1.64 0.74 0.11 9.77 MC l 0.62 0.50 0.55 1.95 0.43 0.07 3.14 4.15 0.41 4.64 0.23 -36.50 0.88 0.06 0.14 0.11 0.20 0.56 0.15 1.08 0.35 0.04 14.03 LC 0.00 0.01 0.02 0.09 0.01 0.00 0.1520.15 0.00 0.41 0.00 0.24-23.81 0.00 0.00 0.00 0.00 0.00 0.00 0.02 0.00 0.00 1.39 PL s 2.21 0.79 0.64 1.40 0.98 0.34 2.29 0.03 1.42 1.65 0.40 1.01 0.01-23.54 0.21 0.49 0.30 1.66 0.63 1.67 0.95 0.16 4.42 PL l 1.47 0.75 0.72 2.03 0.78 0.21 3.29 0.78 0.96 3.59 0.37 2.18 0.17 0.21-35.26 0.30 0.29 1.19 0.39 1.61 0.70 0.10 10.33 DG s 2.17 0.81 0.68 1.55 0.98 0.33 2.54 0.06 1.40 1.99 0.41 1.22 0.01 0.35 0.22 -26.80 0.31 1.64 0.61 1.72 0.94 0.16 5.41 DG l 1.47 0.75 0.72 2.03 0.78 0.21 3.29 0.78 0.96 3.59 0.37 2.18 0.17 0.21 0.20 0.30-35.18 1.19 0.39 1.61 0.70 0.10 10.33 TY 1.94 0.81 0.72 1.79 0.93 0.29 2.91 0.22 1.25 2.65 0.41 1.62 0.05 0.30 0.22 0.41 0.31-30.16 0.54 1.73 0.87 0.14 7.41 VW 2.13 0.82 0.69 1.61 0.97 0.32 2.63 0.09 1.37 2.13 0.41 1.31 0.02 0.34 0.22 0.47 0.31 1.62-27.86 1.73 0.93 0.15 5.85 DTNI 1.49 0.76 0.72 2.02 0.79 0.21 3.28 0.75 0.97 3.55 0.37 2.16 0.16 0.21 0.21 0.30 0.29 1.20 0.40-33.74 0.71 0.10 10.22 HD 1.88 0.81 0.72 1.83 0.91 0.28 2.97 0.26 1.22 2.77 0.40 1.69 0.06 0.29 0.22 0.40 0.31 1.47 0.52 1.72-31.39 0.13 7.77 SB 2.16 0.82 0.68 1.58 0.98 0.33 2.57 0.07 1.39 2.04 0.41 1.25 0.02 0.35 0.22 0.47 0.31 1.64 0.61 1.73 0.94-27.60 5.58 REST 0.56 0.47 0.53 1.91 0.40 0.07 3.07 4.71 0.37 4.65 0.22 2.80 1.00 0.06 0.13 0.09 0.19 0.51 0.13 1.02 0.32 0.03-25.42 Table 3.6: Estimated price elasticities for specification B in t=1988. Notes: Rows (i) correspond to market shares (s jt ), and columns (j) correspond to prices (p jt ) with respect to which elasticities are calculated. 59 CV s CV l OD s OD l PT s PT l BK CD FD s FD l MC s MC l LC PL s PL l DG s DG l TY VWDTNI HD SB REST CV s -12.95 0.46 0.46 0.48 0.46 0.47 0.48 1.45 0.46 0.51 0.46 0.51 1.41 0.48 0.46 0.47 0.46 0.46 0.46 0.46 0.46 0.47 0.51 CV l 0.43-15.20 0.49 0.53 0.45 0.41 0.53 2.46 0.43 0.60 0.46 0.59 2.39 0.40 0.47 0.41 0.47 0.43 0.42 0.47 0.44 0.41 0.61 OD s 0.41 0.47 -15.79 0.53 0.44 0.39 0.53 2.83 0.41 0.61 0.46 0.61 2.73 0.37 0.47 0.39 0.47 0.42 0.40 0.47 0.42 0.39 0.63 OD l 0.38 0.45 0.48-16.57 0.41 0.35 0.53 3.40 0.38 0.63 0.44 0.63 3.28 0.33 0.45 0.35 0.45 0.39 0.36 0.45 0.40 0.36 0.65 PT s 0.44 0.47 0.49 0.51-14.32 0.44 0.51 2.01 0.45 0.56 0.47 0.56 1.95 0.44 0.47 0.44 0.47 0.45 0.44 0.47 0.45 0.44 0.58 PT l 0.46 0.44 0.43 0.44 0.44-11.76 0.44 1.09 0.46 0.45 0.44 0.45 1.07 0.51 0.44 0.48 0.44 0.45 0.47 0.44 0.45 0.48 0.46 BK 0.38 0.45 0.48 0.53 0.42 0.35-16.54 3.38 0.38 0.63 0.44 0.63 3.26 0.33 0.45 0.35 0.45 0.39 0.36 0.45 0.40 0.36 0.65 CD 0.03 0.06 0.07 0.10 0.05 0.03 0.10-7.85 0.03 0.15 0.06 0.15 5.14 0.02 0.06 0.03 0.06 0.04 0.03 0.06 0.04 0.03 0.16 FD s 0.46 0.46 0.47 0.48 0.46 0.47 0.48 1.48-13.03 0.51 0.46 0.51 1.44 0.48 0.46 0.47 0.46 0.46 0.46 0.46 0.46 0.47 0.52 FD l 0.33 0.42 0.45 0.52 0.37 0.30 0.51 4.22 0.33-17.46 0.41 0.63 4.06 0.27 0.42 0.30 0.42 0.35 0.31 0.42 0.36 0.30 0.65 MC s 0.43 0.47 0.49 0.53 0.45 0.42 0.52 2.35 0.43 0.59 -14.99 0.59 2.28 0.41 0.47 0.42 0.47 0.44 0.42 0.47 0.44 0.42 0.60 MC l 0.33 0.42 0.45 0.52 0.38 0.30 0.52 4.20 0.33 0.63 0.41 -17.44 4.03 0.27 0.42 0.30 0.42 0.35 0.31 0.42 0.36 0.30 0.66 LC 0.04 0.07 0.08 0.11 0.05 0.03 0.10 5.75 0.04 0.16 0.06 0.16-8.59 0.02 0.07 0.03 0.07 0.04 0.03 0.07 0.04 0.03 0.17 PL s 0.45 0.40 0.40 0.39 0.42 0.48 0.39 0.79 0.45 0.39 0.41 0.39 0.77-10.42 0.40 0.48 0.40 0.43 0.47 0.40 0.43 0.47 0.39 PL l 0.42 0.47 0.49 0.53 0.45 0.41 0.53 2.51 0.42 0.60 0.46 0.60 2.43 0.40-15.28 0.41 0.47 0.43 0.41 0.47 0.44 0.41 0.61 DG s 0.46 0.44 0.44 0.44 0.44 0.48 0.44 1.10 0.46 0.45 0.44 0.45 1.08 0.51 0.44 -11.80 0.44 0.45 0.47 0.44 0.45 0.48 0.46 DG l 0.42 0.47 0.49 0.53 0.45 0.41 0.53 2.51 0.42 0.60 0.46 0.60 2.43 0.40 0.47 0.41-15.28 0.43 0.41 0.47 0.44 0.41 0.61 TY 0.46 0.47 0.48 0.49 0.46 0.46 0.49 1.69 0.46 0.53 0.46 0.53 1.64 0.46 0.47 0.46 0.47-13.58 0.46 0.47 0.46 0.46 0.54 VW 0.46 0.45 0.45 0.46 0.45 0.48 0.45 1.24 0.46 0.48 0.45 0.47 1.21 0.50 0.45 0.48 0.45 0.46-12.28 0.45 0.45 0.47 0.48 DTNI 0.42 0.47 0.49 0.53 0.45 0.41 0.53 2.48 0.43 0.60 0.46 0.60 2.40 0.40 0.47 0.41 0.47 0.43 0.42-15.22 0.44 0.41 0.61 HD 0.45 0.47 0.48 0.50 0.46 0.45 0.50 1.79 0.45 0.54 0.47 0.54 1.74 0.46 0.47 0.45 0.47 0.45 0.45 0.47-13.83 0.45 0.55 SB 0.46 0.44 0.44 0.45 0.45 0.48 0.45 1.15 0.46 0.46 0.44 0.46 1.13 0.50 0.44 0.48 0.44 0.45 0.47 0.44 0.45-12.00 0.47 REST 0.32 0.41 0.45 0.51 0.37 0.29 0.51 4.37 0.32 0.63 0.40 0.63 4.19 0.26 0.41 0.29 0.41 0.34 0.30 0.41 0.35 0.29 -17.59 Table 3.7: Estimated price elasticities for specification C (BLP case) in t=1988. Notes: Rows (i) correspond to market shares (s jt ), and columns (j) correspond to prices (p jt ) with respect to which elasticities are calculated. 60 Clearly, these differences in elasticities would have significant competitive implica- tions. As a rough estimate, using an inverse-elasticity pricing rule as a benchmark, we seethatSpecificationC,whichcorrespondstothestandardBLPmodel,impliesmarkups which are roughly twice the size of the markups from Specification B, which allows for factors. This is an economically very significant difference, and suggests that not includ- ing the factors may lead to underestimation of the degree of competitiveness in this market. On the whole, then, this empirical application shows that our estimation procedure is feasible even for moderate-sized datasets like the one used here. Including interac- tive fixed effects delivers results which are strikingly different than those obtained from specifications without these fixed effects. 3.7 Conclusion In this chapter, we considered an extension of the popular BLP random coefficients discrete-choice demand model, which underlies much recent empirical work in IO. We add interactive fixed effects in the form of a factor structure on the unobserved prod- uct characteristics. The interactive fixed effects can be arbitrarily correlated with the observedproductcharacteristics (includingprice),whichaccommodateendogeneityand, at the sametime, captures strongpersistencein market sharesacross productsand mar- kets. We propose a two step least squares-minimum distance (LS-MD) procedure to calculate the estimator. Our estimator is easy to compute, and Monte Carlo simulations show that it performs well. We apply our estimator to US automobile demand. Significantly, we find that, once factors are included in the specification, the results assuming that price is exogenous or endogenous are quite similar, suggesting that the factors are indeed capturing much of the unobservable product and time effects leading to price endogeneity. 61 The model in this chapter is, to our knowledge, the first application of factor- modelling to a nonlinear setting with endogenous regressors. Since many other models used in applied settings (such as duration models in labor economics, and parametric auction models in IO) have these features, we believe that factor-modelling may prove an effective way of controlling for unobserved heterogeneity in these models. We are exploring these applications in ongoing work. 62 Chapter 4 Semiparametric Estimation of Nonlinear Panel Data Models with Generalized Random Effects 4.1 Introduction This chapter explores a new approach to higher order bias correction in non-linear panel datamodelsunderanasymptotics whereboththenumberofcross-sectional unitsN and thenumberoftimerperiodsT becomelarge(referredtosimplyas“largeT asymptotics” in the following). Instead of estimating the individual effects, which parameterize the unobservedheterogeneityofeachindividual,asadditionalparametersofthemodel(fixed effect approach), we propose to instead consistently estimate the distribution of these individual effects conditional on the regressors in a non-parametric way. Our main findings are, firstly, that as T grows the incidental parameter bias of the parameters of interest vanishes at a rate that is determined by the convergence rate of the estimator for the individual effect distribution (under the Hellinger distance); and secondly, that theseconvergenceratescanbemuchhigherthantheonesobtainedfromtheexistingbias reductiontechniquesinthefixedeffectapproach. Thisimprovement,however,alsocomes at a cost: in order to allow for consistent non-parametric inference on the individual effect distribution, we need to impose assumptions that restrict the dependence of the individual effects ond the regressors, and that require this distribution to be sufficiently smooth. Our results are particularly important for applications in which the number 63 of time periods T is modestly large, while the number of cross-sectional units N is much larger than T. In such a scenario the existing bias correction techniques may be insufficientto achieve areduction of theincidental parameter biasto alevel whereit can be ignored relative to the standard error of the estimator, so that imposing additional restrictions in order to further reduce the bias becomes a very attractive option. As discussed in Chapter 1, the large T approach, which is also used in the present chapter,isconvenientsincethealternativeasymptoticN,T →∞guaranteesexistenceof aconsistent pointestimator for alarge class of models, andprovides an asymptotic solu- tion to the incidental parameter problem. Various techniques are developed to decrease the order of the incidental parameter bias from 1/T to smaller orders (e.g. to 1/T 2 ). Whether theremainingbias is problematicdependsontherelative sizeofN andT. The standard error of the estimator is of order (NT) −1/2 , i.e. it depends on both N and T, whilethebiasonlydependsonT. For applications whereN isnottoolargerelative toT onecan thus ignore thebias relative tothe standarderror,butforvery large values ofN relative toT thebiasdominates thestandarderror. Inthelatter case, getting additional data in the cross-sectional dimension, i.e. increasing N without increasing T, does not improve the estimator further, and in fact worsens the size properties of test statistics based on this biased estimator. The main research problem in the large T panel data literature is thus to find ways to decrease the incidental parameter bias further in order to make the estimation methodology applicable for a larger range of sample sizes N, T. With the exceptions discussed below, most of the large T panel literature uses a “fixed effect” approach, in which the individual effects (which parameterize the cross- sectional heteroscedasticity) are themselves estimated as incidental parameters. In the presentchapter weconsider a“random effect” approach, inwhich thedistribution of the individual effects is estimated instead. While we allow for correlation between the indi- vidual effects and the observed regressors, we require some constraints on the structure of this correlation, since otherwise inference on the conditional distribution of the indi- vidual effects is infeasible due to the curse of dimensionality (large dimensional support 64 of the conditioning variables, i.e. the regressors). For most of our results we need not specify the nature of these correlation constraints, so that they are applicable to various ways ofreducingthedimensionalityoftheconditioningvariables. Asaconcreteexample for such a correlation constraint we discuss the “generalized random effect” assumption, which assumes that individuals can be grouped based on their observed regressor val- ues and imposes independence between the individual effects and the regressors within each group. Apart from this generalized random effect assumption we impose nofurther parametric constraints on the individual effect distribution. We estimate the param- eters of interest (parametric component) jointly with the individual effect distribution (non-paramatric component) by maximum likelihood. The generalized random effect assumption (or any other constraint that reduces the dimension of the regressors as con- ditioningvariables)isrestrictive,butitalsoturnsouttohaveverypowerfulconsequences for the incidental parameter bias. WeshowthattherateatwhichthebiasdecreaseswithT dependsonthesmoothness of the true individual effect distribution, and that the bias can decrease at an arbitrary polynomial rate as long as the true individual effect distribution is sufficiently smooth. Thebiasmaythereforebemuchsmallerthantheoneobtainedfromtheexistingmethods in the large T panel literature. In particular in applications where T is modestly large and N is much larger than T, one may therefore be willing to impose the generalized random effect assumption together with a smoothness assumption on the individual effect distribution, in order to avoid (or substantially reduce) the incidental parameter problem. Thetechnicalderivationofourresultisdoneintwosteps. Firstly,wederivetheprop- erties of the maximum likelihood estimator for theparameters of interest, assumingthat some estimator for the individual effect distribution is given and is used to integrate out theindividualeffects from thelikelihood function. We showthat theresultingincidental parameter bias for the parameters of interest is bounded by an expression that involves the Hellinger distance between the true individual effect distribution and its estimator. 65 TherateatwhichtheincidentalparameterbiasvanishesasT increasesthereforedepends on the rate at which the individual effect distribution can be estimated. Secondly, we considerestimationoftheindividualeffectdistributionbymaximumlikelihoodandshow how in this case the convergence rate of the estimator in T depends on the smoothness properties of the true distribution. Our first result on the incidental parameter bias of the parameters of interest is also applicable to other estimators for the individual effect distribution, and it would clearly be interesting to explore alternative estimation approaches (beyond maximum likelihood) for this distribution in future research. The purposeof this chapter is not to replace the existing methods on bias correction in large T panel data, but to provide an interesting alternative with somewhat comple- mentary properties. For example, the Jackknife bias correction methods developed in Hahn and Newey (2004) and Dhaene and Jochmans (2010) also allow for higher order bias correction (to order 1/T 2 , 1/T 3 , etc). Compared to this method, we have to impose a restriction on the correlation structure between the individual effects and the regres- sors, which is not required for the Jackknife. On the other hand, the Jackknife method needs to impose a stationarity assumption on all observed variables, and higher order Jackknife bias correction can significantly increase the standard error of the estimator, which is both not the case in our approach. Our estimation approach is based on the integrated likelihood function (integrating overtheindividualeffects)asopposedtotheprofilelikelihoodfunction(maximizingover the individual effects) that appears in the fixed effect estimation. Woutersen (2002) and Arellano and Bonhomme (2009) also use integration instead of profiling for the purpose of bias correction in large T panel data model, but they only discuss how to reduce the bias to order 1/T 2 or o(1/T), respectively. For the most part these papers do not discuss consistent estimation of the individual effect distribution, which we show to be the key tool to achieve higher order bias correction. In the second part of Arellano and Bonhomme (2009), bias reduction too(1/T) is discussed for a parametric random effect model using joint maximum likelihood estimation of all parameters. The analysis in the 66 present chapter can be viewed as an extension of their results to the semi-parametric generalized random effect case and to higher order bias reduction. The econometric techniques used here are however quite different from their work, and in particular the higher order bias correction is crucial from an applied perspective. Another related paper is Bester and Hansen (2007). They consider non-linear panel data models with “flexible correlated random effects” and discuss semiparamatric sieve estimation. The difference to the present chapter is that they consider fixed T asymp- totics, starting from the assumption that the model is identified at fixedT. Their paper is therefore complementary to our approach, just as the fixed T and large T panel data literature are complementary in general. As mentioned above, we considerjoint maximum likelihood estimation of theparam- eters of interest and the conditional distribution of the individual effects, treating the latternon-parametrically. Thereisalargeliteratureonsemi-andnon-parametricestima- tion, including non-parametric density estimation (reviewed e.g. in H¨ ardle and Linton (1994), Chen (2007), and Ichimura and Todd (2007)). We essentially employ a sieve approach, i.e. a different parameter set for the non-parametric component is chosen for different sample sizes in such a way that asymptotically a very large class of individual effect distributions can be approximated. Usually in sieve-estimation the parameter set is chosen sample size dependent for the purpose of keeping the sampling error in check. However, in our case the parameter set is chosen sample size dependent in order to keep the fixedT identification problem in check instead of controlling the sampling error, i.e. the role that is played by the large T asymptotics is somewhat non-standard. This is also illustrated by the fact that under the large T asymptotics the maximum likelihood estimator for the parameters of interest (parametric component) is consistent even if we plug in a fixed prior for the individual effect distribution (i.e. an inconsistent estimator for the non-parametric component). This is usually not the case in semi-parametric estimation problems, but is very intuitive in view of the existing largeT panel literature (since profiling out and integrating out nuisance parameters should yield similar results 67 for the parameters of interest). Our results are therefore more readily interpreted from the perspective of this latter literature. For our Monte Carlo simulations we consider a dynamic binary choice model, which isknownnotbeidentifiedatfiniteT,seee.g. Honor´ eandTamer(2006). Thesimulations confirm our asymptotic results in that they show that the bias of our estimator for the parameters of interest is very small, as long as the true distribution for the individual effects is sufficiently smooth. The finite sample variance of our estimator is found to be very close to the the variance of the fixed effect maximum likelihood estimator, which is a very important result, since most large T bias correction techniques have a tendency to increase the finite sample variance of the estimator relative to the fixed effect MLE. For future research it would be interesting to consider alternative estimation proce- dures for the individual effect distribution, both within the maximum likelihood frame- work, where one could explore the choice of alternative parameter sets for the non- parametric density estimation, but, as already mentioned above, also alternatives to maximum likelihood, e.g. the predictive recursion method that is considered recently in the statistics literature (Newton, Quintana and Zhang (1998), Newton (2002), Mar- tin and Tokdar (2010)). Furthermore, it would be fascinating to explore more general ways to reduce the dimension of the conditioning variables in the individual effect dis- tribution that go beyond our generalized random effect assumption, and that allow for a more general correlation structure between the individuals effects and the regressors. We formulated many of our results under high-level assumptions, instead of directly considering the generalized random effect case, exactly for the purposeof allowing these types of generalizations. The chapter is organized as follows. In Section 4.2 we introduce the model and some additional notation. Section 4.3 definestheestimators fortheparameters of interest and the individual effect distribution, and provides a brief discussion of the main conceptual ideas and results of the chapter. Section 4.4 derives the asymptotic distribution for the estimator of the parameters of interest under appropriate high-level assumptions, which 68 is the main technical contribution of the chapter. In Section 4.5 we apply these general results to the special case of generalized random effects. Monte Carlo simulations are presented in Section 4.6, and some concluding remarks are given in Section 4.7. The appendix contains figures and tables for the Monte Carlo simulations, and provides the regularity assumptions that are referred to in the theorems of the main text, as well as the proofs of these theorems. 4.2 Model We consider a panel data model with N cross-sectional units and T time periods. A dependent variable Y it , a vector of time-varying independent variables X it and a vector of time-invariant independent variables Z i are observed, where i = 1,...,N and t = 1,...,T. Let Y i = (Y i1 ,...,Y iT ) and X i = (X i1 ,...,X iT ;Z i ), i.e. all independent variables are summarized by X i . We assume that the unobserved heterogeneity in the distribution ofY i conditional onX i can be described by (a vector of) individual specific effects α i . The random variables Y i , X i andα i take values in the sets Y T , X T and A, where A⊂ R M , and M is some finite positive integer. For elements of Y T , X T and A we use the notation y i , x i and α i , or simply y, x andα, i.e. we distinguish non-random from random objects by using lower case as opposed to capital letters and by not using bold face type for the fixed effects. We assume cross-sectional independence and that for each i=1,...,N the distribution of Y i conditional on X i is given by f Y|X (y i |x i ;θ,π) = Z A f(y i |x i ,α;θ)π(α|x i )dα, (4.1) where θ∈ Θ⊂ R K are the parameters of interest, f(y i |x i ,α;θ) is the distribution of Y i conditional on X i andα i for given θ, and π(α|x i ) is the distribution of the individual effects conditional on the regressors. Since the individual effects are unobserved they are integrated over in equation (4.1). In the following, when the distribution of one 69 generic cross-sectional unit is considered, we will often drop the index i for notational convenience. Equation (4.1) describes the distribution of Y given X as a mixture of distributions f(y|x,α;θ) over the distribution of α. We impose a parametric model for f(y|x,α;θ), i.e. we assume that f(y|x,α;θ) is known up to the finite dimensional parameter θ. We furthermore assume that f(y|x,α;θ) has the structure f(y|x,α;θ)= T Y t=1 f t (y t |x,y (t−1) ,α;θ), (4.2) withy (t) =(y 1 ,...,y t ). Here,f t (y t |x,y (t−1) ,α;θ)istheperiodlikelihoodfunction,which describes the distribution of Y it conditional on X i , lags of Y it and individual effectsα i . Some Further Notation We use θ 0 and π 0 = π 0 (α|x) to denote the true parameters of interest and the true conditional distribution of the individual effects, i.e. we assume that f Y|X (y|x;θ 0 ,π 0 ) describes the actual distribution of Y i conditional on the covariate value X i =x. Let Π A T be the set of all conditional probability densities π(α|x) with respect to the Lebesgue measure, overα∈A and conditional onx∈X T . We only consider conditional distributions for the individual effectsα that are absolutely continuous with respect to the Lebesgue measure on A, and we therefore use the terms distribution and density interchangeably, i.e. we often refer to π(α|x) as the distribution of α. The following subsets of Π A T impose a lower respectively upper bound on π(α|x) Π low T = π∈Π A T ∀x∈X T ,∀α∈A : π(α|x)≥π low T (α|x) , Π up T = π∈Π A T ∀x∈X T ,∀α∈A : π(α|x)≤π up T (α|x) . (4.3) 70 Here,theboundsπ low T (α|x)andπ up T (α|x)donotintegratetoone. WedefinetheHellinger distance between distributionsπ,π 0 ∈Π A T and between the distributions of the outcome variable Y that are implied by π and π 0 (at the trueθ 0 ) as follows D H (π,π 0 )= v u u t 1 N N X i=1 Z A h p π(α|X i )− p π 0 (α|X i ) i 2 dα, D H (f Y (π),f Y (π 0 ))= v u u t 1 N N X i=1 Z Y T hq f Y|X (y|X i ;θ 0 ,π)− q f Y|X (y|X i ;θ 0 ,π 0 ) i 2 dy, (4.4) TheKullbackLeiblerdivergencemeasuresbetweenπ,π 0 ∈Π A T andbetweentheirimplied distributions for the outcome Y read D KL (π||π 0 )= 1 N N X i=1 Z A log π 0 (α|X i ) π(α|X i ) π 0 (α|X i )dα, D KL (f Y (π)||f Y (π 0 ))= 1 N N X i=1 Z Y T log " f Y|X (y|X i ;θ 0 ,π 0 ) f Y|X (y|X i ;θ 0 ,π) # f Y|X (y|X i ;θ 0 ,π 0 )dy. (4.5) These two Hellinger distances and two Kullback Leibler divergences can all be viewed as distance measures between the distributions π and π 0 . These distance measures are random variables since sample averages over functions of covariates appear in their definitions. 4.3 Description of Estimators and Main Results The unknown parameters in the model (4.1) are the finite dimensional vector θ and the conditional distribution function π = π(α|x). The (log-) likelihood function over these parameters is L NT (θ,π)= 1 NT N X i=1 log f Y|X (Y i |X i ;θ,π). (4.6) 71 For given π, the maximum likelihood estimator forθ and the corresponding profile like- lihood function are ˆ θ(π)=argmax θ∈Θ L NT (θ,π), L NT (π)=max θ∈Θ L NT (θ,π). (4.7) Themaximumlikelihoodestimator forπ isobtainedbymaximizingtheprofilelikelihood L NT (π) over an appropriate set Π T of individual effect distributions, i.e. ˆ π =argmax π∈Π T L NT (π), ˆ θ = ˆ θ(ˆ π). (4.8) Thechoice of theparameter setΠ T crucially affects thepropertiesof thejointmaximum likelihood estimators ˆ π and ˆ θ. Clearly, we need Π T ⊂ Π A T (the set of all conditional distributions that integrate to one), but the discussion below makes clear why it is importanttoconstrainΠ T further. WhileadifferentsetΠ T mightbechosenfordifferent values of N and T, it is the T-dependence of the parameter set that turns out to be decisive, which is why we make this dependence explicit in the subscriptT. It is well known that the fixed effect maximum likelihood estimator for θ has an incidental parameter bias of order 1/T (e.g. Hahn and Newey (2004)). Similarly, the estimator ˆ θ in general also possesses an incidental parameter bias. The main objective of this chapter is to show that the rate at which the bias of ˆ θ vanishes with T is only restricted by the properties of the true distributionπ 0 , as long as the parameter set Π T is chosen appropriately. Therearedifferenttypesofrestrictions thathavetobeimposedonπ 0 andΠ T ,which can be associated either with finite N sampling issues or finite T identification issues, and we discuss those separately in the following. 4.3.1 Sampling Issues (Generalized Random Effect Assumption) The incidental parameter problem in panel data (Neyman and Scott (1948)) is very familiar when the individual effectsα i are modeled as fixed effects, i.e. when a separate 72 parameter α i is introduced and estimated for each cross-sectional unit. The problem occurs because the number of parameters α i grows with the sample size, which results in an inconsistency of the estimator (e.g. the maximum likelihood estimator) for the parameters of interest under fixedT asymptotics. Since we model the individual effects as correlated random effects, i.e. via their conditional distribution π(α|x), the problem is somewhat less obvious. However, the incidental parameter problem also arises in this approach whenever the support X T of theregressorsis“large”(i.e.eitherdiscretewithverymanysupportpoints,orcontinuous and high-dimensional), which is to be expected in particular for large values of T. For example, ifX T is discrete with cardinality much larger than the sample size N, then we expect the realization of each X i to be unique within the sample, so that there is only one available observation for the estimation ofπ(α|x) atx=X i . Inthat case there is no differencebetween theunrestrictedcorrelated randomeffects approach andthestandard fixed effects approach. For the correlated random effects approach this incidental parameter problem is resolved asymptotically for fixed T and N → ∞ (e.g. for discrete X T one eventually has many units with the same X i under this asymptotic). However, this asymptotic consideration may not be relevant for the estimation problem at given finite sample size N, T, in particular if T is (moderately) large. In order to estimate the conditional distribution π(α|x) consistently we thus either need the supportX T to be“small” in thefirstplace (small numberof continuous dimen- sions, or only few discrete support points, relative to N), or we have to make further assumptions to overcome this curse of dimensionality in the conditioning variables. Various ways are conceivable to reduce the dimension of the conditioning variables. The restriction that we consider explicitly in this chapter is what we call a “generalized random effect” assumption. Namely, we assume that there exists a partitioning of X T into a finite number of groups such that the conditional densities π(α|x) and π(α|˜ x) are identical if x and ˜ x belong to the same group, i.e. we impose a random effect 73 assumption within each group. We assume that this partitioning is known. The number of groups G T may increase with T, but not too rapidly. This generalized random effect assumption solves the incidental parameter problem, since we only need to estimate a different distribution π(α|x) for each group but not separately for each x∈X T . Theassumption thatthe truedistributionπ 0 satisifies this generalized randomeffect condition can also be written as α ⊥ X g, (4.9) i.e. the individual effects α are independent of the regressors X once we condition on thegroupg. Animportantgeneralization ofthisconditionalindependenceassumptionis obtained byreplacing theconditioning on the groupg with a conditioning on some more general observedvariableZ, i.e. α ⊥ X Z. Aslongas thesupportofZ is nottoolarge this more general condition also resolves the incidental parameter problem. The gener- alized random effect restriction which we consider in this chapter is simply the special case whereZ has discrete support. For future research it would clearly be interesting to consider the more general case as well. The existence for such a conditioning variableZ is also the basis for the control function approach, discussed e.g. in Imbens and Newey (2009). However, here we do not assume that the model is point identified at fixed T, even after the conditional independence assumption is imposed. Example: The model which we consider explicitly in our Monte Carlo simulations below is the single index dynamic binary choice model, for which Y it ∈{0,1}, i.e. Y T = {0,1} T , and Y it =1 θY i,t−1 +α i +ε it ≥0 . (4.10) Here, 1{.} is the indicator function, ε it is a random shock that is independent and identically distributed across i and t (with known distributions), and is independent of α i . For simplicity we consider the case where no additional regressors X it are present, 74 butwe assumethattheinitial periodoutcome variableY i0 is observed (thetotal number ofobservedtime-periodsisthusT+1),andwetreatthisinitialoutcomeasaconditioning variable, i.e. Z i =Y i0 and therefore X T ={0,1} in the above notation. In this example one thus needs to estimate the parameter θ∈ R and the two densities π(α|Y i0 =0) and π(α|Y i0 =1). SinceX T is finite with a constant number of elements (independent ofT), thismodelalreadysatisfiesthegeneralized randomeffects assumption,withoutimposing any further constraints. 4.3.2 Identification Issues (Smoothness Assumption on π) The motivation for the largeT asymptotics in the panel data literature is to resolve the incidentalparameterestimationproblem,toovercomethepotentialfixedT identification problem, and to do this in a way that is not model specific. We have argued that the generalized random effect assumption already overcomes the incidental parameter problem, but there may still remain an identification problem at finite T. For example, under the random effects approach there is no incidental parameter problem in the dynamic binary choice model without additional regressors, which we just introduced. Nevertheless the model is not fixed T identified, as discussed in Honor´ e and Tamer (2006). It is this identification problem that motivates the use of largeT asymptotics in the present chapter. If the particular model of interest is point identified at fixed T once the generalized random effect assumption is imposed, then one can use the sieve estimation approach of Bester and Hansen (2007), which allows for semi-parametric estimation at fixed T. However,todeterminewhetherthemodelispointidentifiedornotrequires(apotentially complicated) model specific analysis. When T is sufficiently large we show that one can avoid this by considering the alternative asymptotics N,T →∞. 75 To discuss how the identification problem vanishes as T → ∞ we consider the expected (log-) likelihood function L NT (θ,π)= E L NT (θ,π) X 1 ,...,X N . (4.11) This expected likelihood is not quite the population likelihood function, since we still condition on the regressors X i , i.e. L NT (θ,π) is still random, and for finite N not all of the possible variation in X i is already accounted for in L NT (θ,π). For π ∈ Π A T we define 1 θ(π)=argmax θ∈Θ L NT (θ,π). (4.12) For sufficiently large values ofN andT we assume thatθ(π 0 )=θ 0 with probability one, i.e. if the true distribution for the individual effects is known, then θ is point-identified. Our analysis in the next section shows that if π satisfies a certain generalized Lips- chitz condition with Lipschitz constant κ T =o( √ T), then under appropriate regularity conditions we have θ(π)−θ(π 0 )=O p κ T T D H (π,π 0 ), (4.13) whereD H (π,π 0 ) is the Hellinger distance introduced above. This result has two impor- tant consequences: Firstly, by choosing some fixed prior distribution π for each T we can achieve κ T = O(1) and D H (π,π 0 ) ≤ √ 2 (which always holds for the Hellinger distance). We thus obtainθ(π)−θ 0 =O p (1/T), i.e. the identification problem forθ vanishes asT becomes large, and the size of the identified set shrinks at least at the rate 1/T. 2 1 Note that π depends on T, but we suppress this dependence. 2 θ(π)hasagivenvalueforeachT andwe haveθ 0 =θ(π)+Op(1/T), whichimplies thattheidentified set for θ 0 has to shrink at the rate 1/T. 76 Secondly, the actual rate at which the identified set for θ 0 shrinks may be much faster, and dependson how fast the identified set forπ 0 shrinksin terms of the Hellinger distance. This rate can be very high, depending on the smoothness assumptions that are imposed on the allowed conditional densities π. We are going to discuss this issue a little further now. Forsimplicity,considerthecasewhereθ 0 isknown(theconclusionsforthecasewhere θ is estimated turn out to be equivalent), and define π =argmax π∈Π T L NT (θ 0 , π), (4.14) for some appropriate parameter set Π T . Here, the optimal π may not be unique, but we assume that it exists and that one of the optimal values is chosen if there are multiple ones. The rate at whichD H (π,π 0 ) vanishes asT →∞ provides an upper bound for the rate at which the identified set for π 0 shrinks with T. MaximizingL NT (θ 0 ,π)isequivalent tominimizingD KL (f Y (π)||f Y (π 0 )), whichisthe Kullback Leibler divergence of the outcome variable distributions that are implied by π and π 0 (see definition above). Thus, the rate at which D H (π,π 0 ) converges to zero as T →∞ depends on 3 (i) How fast D KL (f Y (π)||f Y (π 0 )) converges to zero, i.e. how well the distribution of the outcome variable generated by an element in Π T can approximate the true distribution of the data. (ii) Whether a small value of D KL (f Y (π)||f Y (π 0 )) for π ∈ Π T also implies that D H (π,π 0 ) is small. Note that the first point demands Π T to be large enough, while the second point requires it not to be too large. This is analogous to the problem one faces in non- parametricsieveestimation (seee.g. Chen(2007)), onlythatourcondition(ii)isrelated 3 Assumption 4.3 below provides a formal statement of these conditions. 77 to identification, while there it is about controlling the sample variation of the non- parametric estimator. To discuss condition (i) one can e.g. use the inequality D KL (f Y (π)||f Y (π 0 )) ≤ D KL (π||π 0 ), which holds generally (chain rule for Kullback Leibler divergence). Sat- isfying condition (i) is therefore mainly an exercise in approximation theory, with the distance measure given bythe Kullback Leibler divergence. Thereare manypossibilities to approximate an unknown function, summarized e.g. in the review of Chen (2007). Since π is a probability density we also need to impose the constraints that the density is positive and integrates to one. The quality of the approximation of π 0 will strongly depend on how smooth π 0 is as a function of α. To satisfy condition (ii) we are going to choose the set Π T such that it only contains distributions that are sufficiently smooth in a well-defined sense, as will be discussed extensively in Section 4.5. 4.3.3 Main Results In Section 4.4 below we derive the asymptotic properties of the estimator ˆ θ under high- levelassumptionsontheparametersetΠ T andthetruedistributionπ 0 . InSection4.5we then discuss the generalized random effect assumption as one particular example how to satisfythesehigh-level assumptions. However, ourasymptoticresultsfor ˆ θ inSection 4.4 are applicable more generally. The high-level assumptions we impose there reflect the restrictions discussed above, i.e. firstly that the dependence of α and X needs to be restricted to avoid a curse of dimensionality problem when estimating π(α|x), secondly that the true distribution π 0 (α|x) can be well-approximated by an element in Π T , and thirdly that distributions in Π T are sufficiently smooth to control for the fact that the model may not be identified at fixed T. 78 The method we use to control the asymptotic bias in ˆ θ is to bound the difference between ˆ θ and ˆ θ(π 0 ), namely we show that ˆ θ− ˆ θ(π 0 )=O p κ T μ T T +o p 1 √ NT , (4.15) where κ T is the generalized Lipschitz constant that describes the smoothness of the functionsinΠ T ,andμ T istherateatwhichthetruedistributionπ 0 canbeapproximated by an element of Π T in terms of Hellinger distance as T →∞. The result (4.15) is very powerful since the infeasible estimator ˆ θ(π 0 ) has no asymptotic bias, i.e. the equation states thatalltermsthatcontributetotheasymptoticbiasof ˆ θ areoftheorderκ T μ T /T, which can vanish very rapidly as T increases. To describe the limiting distribution of ˆ θ we note that under appropriate regularity conditions √ NT( ˆ θ(π 0 )−θ 0 ) is asymptotically normal with mean zero and varianceI −1 0 . Here, I 0 is the large N, T limit of the appropriately scaled information matrix of the modelforgivenπ 0 . Thus,as longasN isnotgrowing toofastasymptotically, namelyas long as N =o T/(μ 2 T κ 2 T ) , we can conclude that the right hand side of equation (4.15) is of order o p (1/ √ NT), and therefore √ NT( ˆ θ−θ 0 )→ d N(0,I −1 0 ) for N,T →∞. We then apply these general results to the special case of generalized random effects, where the regressor domain X T is decomposed into G T groups and a random effect assumption is imposed in each group. We discuss how Π T can be chosen appropriately in that case. We then show that if π 0 (α|x) is r times continuously differentiable in α with bounded derivatives and r ≥ 1, then under appropriate regularity conditions we have κ T μ T T = 1 T logT T (r−1)/2 . (4.16) Thus, the rate at which the bias of ˆ θ decreases inT is only restricted by the smoothness of π 0 . This is a very powerful result: By imposing a generalized random effect assump- tion together with a smoothness assumption on the distribution π 0 , we can obtain an 79 estimator ˆ θ whose asymptotic bias vanishes very rapidly. The rate at which the bias decreases in T can be substantially higher than the rates obtained from other bias cor- rection techniques. Consequently, the estimator ˆ θ can beasymptotically unbiased under a much larger range of asymptotics, allowingN to grow much faster thanT. OurMonte Carlo simulations confirm this asymptotic result, since we find ˆ θ to have very little bias also in scenarios whereN is much larger thanT, as long asπ 0 can bewell-approximated by an element of Π T . 4.4 Asymptotic Analysis of the Estimators InthissectionweanalyzethelargeN,T asymptoticpropertiesoftheestimator ˆ θ(π)that wasintroducedin(4.7),andofwhichthejointmaximumlikelihoodestimator ˆ θ = ˆ θ(ˆ π)is aspecialcase. Insubsection 4.4.1 weshow uniformconsistency of ˆ θ(π)over allπ∈Π low T , i.e. over all individual effect distributions π = π(α|x) that are appropriately bounded from below. Having established uniform consistency we then continue to analyze the local properties of the integrated likelihood functionL NT (θ,π) aroundθ 0 . In subsection 4.4.2 we derive uniform bounds on the difference between the scores and the Hessians of L NT (θ,π)betweentwodifferentindividualeffectdistributions. Wethenusethesebounds forthescoreandHessiantoalsoboundthedifference ˆ θ(ˆ π)− ˆ θ(π 0 )intermsoftheHellinger distance between ˆ π andπ 0 . In subsection 4.4.3 we then derive the convergence rate of ˆ π to π 0 in terms of the Hellinger distance under appropriate assumptions. This also gives an upper bound for the convergence rate of ˆ θ(ˆ π) to ˆ θ(π 0 ). Since ˆ θ(π 0 ) is asymptotically normal and unbiased we can use this result to characterize the asymptotic distribution of ˆ θ(ˆ π). 80 4.4.1 Uniform Consistency of ˆ θ(π) Wearegoingtoshowuniformconsistencyof ˆ θ(π)overalargeclassofdistributionsπ(α|x) for the asymptotics N,T → ∞. Our strategy for doing so is to relate the integrated likelihood that was defined in (4.6) to the profile likelihood, which reads L p NT (θ)= 1 NT N X i=1 max α∈A log f(Y i |X i ,α;θ)= 1 NT N X i=1 logf(Y i |X i ,ˆ α p i (θ);θ), (4.17) where ˆ α p i (θ) = argmax α∈A f(Y i |X i ,α;θ). The profile likelihood is the one that is max- imized in the fixed effect approach and the corresponding fixed effect estimator for the parameters of interest reads ˆ θ p =argmax θ∈Θ L p NT (θ). (4.18) Consistency of ˆ θ p is well-established in the literature on large T panel data. Our goal is therefore to show ˆ θ(π) = ˆ θ p +o p (1). In order for this to hold, the key condition on π(α|x) is the existence of a lower bound π low T (α|x) > 0, i.e. we impose the condition π(α|x)≥π low T (α|x), or equivalently π∈Π low T in the notation introduced above. Further regularity conditions on the model and on π low T (α|x) are presented in the appendix. Theorem4.1(Consistency). Letassumption C.1besatisfied andletN,T →∞. Then we have sup π∈Π low T ˆ θ(π)−θ 0 =o p (1). The proof of the theorem is based on relating the integrated likelihood function L NT (θ,π) to the profile likelihood function L p NT (θ). In general, as long as π(α|x) inte- grates toone, we haveL NT (θ,π)≤L p NT (θ)(bythemeanvalue theorem forintegration). Toprovethetheoremweneedtoshowthattheoppositeinequalityalsoholdsuptoo p (1) in order to conclude thatL NT (θ,π)=L p NT (θ)+o p (1), within a neighborhood of θ 0 and 81 uniformly over π∈ Π low T . For this step, the lower bound on π(α|x) is required. Finally, under appropriate regularity conditions we know from the fixed effect panel literature that L p NT (θ) has a well-seperated global maximum close to θ 0 , so that the same must be true for L NT (θ,π), which leads us to conclude ˆ θ(π) = θ 0 +o p (1), uniformly over π∈Π low T . For details we refer to the appendix. Ourassumptionsallow thelower boundπ low T (α|x) to converge to zero at an arbitrary polynomial rate in T as T → ∞, i.e. the constraint to have a lower bound for π(α|x) quickly becomes less and less restrictive as T increases. It is crucial that the consistency result for ˆ θ(π) holds uniformly over the set Π low T , sinceultimately we wantto consider ˆ θ(ˆ π), where ˆ π is an estimator forπ 0 , i.e. is random. The uniform consistency results then shows that ˆ θ(ˆ π) is consistent as long as ˆ π takes values in Π low T . 4.4.2 Score and Hessian of the Integrated Likelihood Having established uniform consistency of ˆ θ(π), we are now going to study the local properties of the integrated likelihood function L NT (θ,π) as a function of θ around the true parameterθ 0 , in particular the score and the Hessian ofL NT (θ,π) atθ 0 . This local analysisofL NT (θ,π)canthenbecombinedwiththeuniformconsistencyresulttoderive the asymptotic properties of ˆ θ(π). While it is sufficient for consistency of ˆ θ(π) to impose a lower bound on π(α|x) it now becomes crucial to also impose a smoothness condition on the density π(α|x) in the form of a generalized Lipschitz condition. To formulate the Lipschitz condition we require a distance measure in the space of individual effects A. For each x ∈ X T let d x :A×A→ R be a measurable non-negative function, such that d x (α,α) = 0 for all α ∈ A. In many applications one might simply set d x (α,β) = kα−βk. However, in 82 general d x (α,β) need neither be symmetric nor satisfy a triangle inequality. We define the following parameter set for π Π lip T,κ = π∈Π A T ∀x∈X T ,∀α,β∈A : |π(α|x)−π(β|x)|≤κ T π(α|x)d x (α,β) . (4.19) Thisisthesetof allconditionaldistributionsthatsatisfyaLipschitztypecondition with generalized Lipschitz constant κ T . In contrast to a standard Lipschitz condition, there also appearsπ(α|x) on theright hand sideof the condition. This is naturalhere, sinceit is important to control the magnitude of relative changes ofπ(α|x) acrossα, as opposed to absolute changes. Locally the condition is simply a standard Lipschitz condition on logπ(α|x). Theorem 4.2. Consider the limit N,T → ∞ and let assumption C.2(iii) and (vi) be satisfied. Let κ T >0 be a series such that κ −1 T =O(1) and κ T =O(T). Then sup π 1 ,π 2 ∈Π lip T,κ ∂L NT (θ 0 ,π 1 ) ∂θ − ∂L NT (θ 0 ,π 2 ) ∂θ =O p κ T T . sup π 1 ,π 2 ∈Π lip T,κ ∂ 2 L NT (θ 0 ,π 1 ) ∂θ∂θ ′ − ∂ 2 L NT (θ 0 ,π 2 ) ∂θ∂θ ′ =O p κ T √ T . Again,itiscrucialthattheresultofTheorem4.2holdsuniformlyoverπ,sinceweare ultimatelyinterested inapplicationswhereπ isreplacedbytheestimator ˆ π. Wefocuson theresult forthe score. Thetheorem implies thatall scores∂L NT (θ 0 ,π)/∂θ for different π(α|x)becomeasymptoticallyclosetoeachotheraslongπ(α|x)satisfiesaLipschitztype condition with κ T =o(T). The most interesting application of the theorem is obtained by setting π 1 = π 0 . The score ∂L NT (θ 0 ,π)/∂θ is unbiased for π = π 0 . Assuming that π 0 ∈ Π lip T,κ , we can thus conclude from the theorem that the bias of the score of the integrated likelihood is at most of order κ T /T for all π∈Π lip T,κ . The bound on the bias of the score provided by Theorem 4.2 is independent of π. However, we clearly expect the bias of the score to be small whenever π is close to 83 the true distribution π 0 , and in the following we formalize this intuition. To do so, we decompose the score of the integrated likelihood as follows ∂L NT (θ 0 ,π) ∂θ = ∂L NT (θ 0 ,π 0 ) ∂θ + ∂L NT (θ 0 ,π) ∂θ + ν NT (π) √ NT , (4.20) where L NT (θ,π) is the expected likelihood conditional on the regressors X 1 ,...,X N , which was introduced in equation (4.11), and ν NT (π) is given by ν NT (π)= 1 √ NT N X i=1 ν NT,i (π) ν NT,i (π)= ∂log f Y|X (Y i |X i ;θ 0 ,π) ∂θ − ∂log f Y|X (Y i |X i ;θ 0 ,π 0 ) ∂θ − E " ∂log f Y|X (Y i |X i ;θ 0 ,π) ∂θ X i # . (4.21) Inequation(4.20)thefirsttermofthedecompositionisthescoreatπ 0 ,whichisunbiased, i.e. the bias of the integrated likelihood score originates from the remaining two terms. In the following we provide bounds on these two terms. Theorem 4.3. Let κ T > 0 be a series such that κ −1 T = O(1) and κ T = O(T), and assume that there exist an upper bound π up T (α|x) such that π 0 (α|x) ≤ π up T (α|x) for all α∈Aandx∈X T . Let assumption C.2(iv) be satisfied and consider the limitN,T →∞. Then there exists a series of random variables C T >0 withC T =O p (1) such that for all π∈Π lip T,κ ∩Π up T ∂L NT (θ 0 ,π) ∂θ ≤C T κ T T D H (π,π 0 ). Note that C T is independent of π, so that the theorem can also be applied when an estimator ˆ π is plugged in. The notation Π up T for the set of all conditional probability densities that are bounded from above by π up T (α|x) was introduced earlier. At the true parameters we know that ∂L NT (θ 0 ,π 0 )/∂θ = 0, and one expects intu- itively that the expected score ∂L NT (θ 0 ,π)/∂θ should be close to zero whenever π is 84 close to π 0 . Theorem 4.3 formalizes this intuition and shows that an appropriate dis- tance measure for the individual effect distributions is the Hellinger distance. Next, we discuss the asymptotic properties of ν NT (π), which is the sum over the difference of individual score functions minus the mean of this difference (the score at π 0 is mean zero, so this term does not appear explicitly in equation (4.21)). Lemma 4.4. Let assumption C.2(i) and (ii) be satisfied, consider the limit N,T →∞, and let κ T > 0 be a series such that κ −1 T =O(1). For all series π T with π T ∈ Π lip T,κ we have E h (ν NT (π T )) 2 i =O κ 2 T /T , and therefore ν NT (π T )=O p κ T / √ T . From this lemma we conclude that ν NT (π) = o p (1) for those π that satisfy a gen- eralized Lipschitz condition with κ T = o( √ T). 4 However, the lemma does not make a uniform statement over all π that satisfy this condition. Therefore we cannot apply the lemmawhenπ isreplacedbytheestimator ˆ π. Anotherwayofseeingthatthereisacom- plicationhereistorealizethattheresultν NT (π)=O p κ T / √ T isderivedinthelemma by evaluating the second moment of ν NT (π), using the fact that the ν NT,i (π) are mean zero and independent across i conditional on the regressors. This, however, only holds for non-stochasticπ and the argument breaks down when the estimator ˆ π is plugged in. The problem of generalizing a pointwise convergence or boundedness result to the cor- respondinguniform result is well-known in the literature on empirical processes (see e.g. Andrews(1994) for a review). Thekey in going from the pointwise to the uniformresult is toimposeacondition (e.g. astochastic equicontinuity condition)thatguarantees that the parameter set (in our case Π T ) is sufficiently “small” in an appropriate sense. This problem is therefore directly related to our discussion in Section 4.3.1, where we have argued that the space of conditional densitiesπ(α|x) is too “large” for our purposes and needs to be restricted. Instead of making such a restriction explicit here, we impose the following high-level assumption, which has the advantage that our results in this section can be applied under various restrictions on Π T that satisfy this assumption. 4 In Lemma 4.4 we make the T-dependence of the conditional densities π(α|x) explicit (note that x∈XT, i.e. the dimension of x changes with T), while we usually suppress this T-dependence. 85 Assumption 4.1. sup π∈Π T kν NT (π)k=o p (1). Using this assumption and applying the preceding results of this section, we obtain the following corollary. Corollary 4.5. Let κ T > 0 be a series such that κ −1 T = O(1) and κ T = o( √ T), and assume that an upper bound π up T (α|x) exists such that π 0 (α|x)≤π up T (α|x) for all α∈A and x∈X T . Let assumption C.2 and 4.1 be satisfied and consider the limit N,T →∞. Let ˆ π be an estimator that takes values in Π T ⊂Π lip T,κ ∩Π up T for all T. Then we have ˆ θ(ˆ π)− ˆ θ(π 0 )=O p κ T T D H (ˆ π,π 0 )+o p 1 √ NT . This result holds independently of whether ˆ π is the joint maximum likelihood esti- mator introduced in equation (4.8), i.e. it can also be applied for alternative estimation proceduresfortheindividualeffectdistribution. ItisinterestingtocompareCorollary4.5 to Theorem 4 in Arellano and Bonhomme (2009), which makes avery similar statement. They use theL 2 Kullback Leibler loss instead of the Hellinger distance and they assume a parametric specification for π and impose a random effect assumption. The key dif- ference, however, is that they need to impose that N/T converges to a constant, while we impose no restriction on howN andT go infinity. This means that our result can be applied to discuss higher order bias correction of ˆ θ(ˆ π). Note that ˆ θ(π 0 ) is asymptotically unbiased and that the term o p 1/ √ NT can be ignored in the limiting distribution of ˆ θ(ˆ π) since it is small compared to the asymptotic standard error of ˆ θ(ˆ π). Thus, according to corollary 4.5 the asymptotic bias of ˆ θ(ˆ π) crucially depends on the asymptotics ofD H (ˆ π,π 0 ), i.e. how fast ˆ π converges to the true distribution in terms of the Hellinger distance. This is what we are going to study next. 86 4.4.3 Joint Maximum Likelihood Estimation of θ and π We now want to study the properties of the joint maximum likelihood estimators ˆ θ and ˆ π that were introduced in (4.8). The profile likelihood L NT (π) was defined in (4.7). A Taylor expansion in θ gives L NT (π)=L NT (θ 0 ,π)+( ˆ θ(π)−θ 0 ) ′ ∂L NT (θ 0 ,π) ∂θ + 1 2 ( ˆ θ(π)−θ 0 ) ′ ∂L NT (θ 0 ,π) ∂θ∂θ ′ ( ˆ θ(π)−θ 0 )+O p k ˆ θ(π)−θ 0 k 3 . (4.22) Using our results on ˆ θ(π) as well as the bounds on the score and Hessian from the last subsection it is easy to provide appropriate boundson the terms inL NT (π) that involve ˆ θ(π). The new task here is to handle the term L NT (θ 0 ,π). We have L NT (θ 0 ,π)=L NT (θ 0 ,π 0 )− 1 T D KL f Y (π)||f Y (π 0 ) + 1 T √ N ψ(π), (4.23) where we have defined ψ NT (π)= √ NT L NT (θ 0 ,π 0 )−L NT (θ 0 ,π)−L NT (θ 0 ,π 0 )+L NT (θ 0 ,π) = 1 √ N N X i=1 ( log f Y|X (Y i |X i ;θ 0 ,π 0 ) f Y|X (Y i |X i ;θ 0 ,π) − E " log f Y|X (Y i |X i ;θ 0 ,π 0 ) f Y|X (Y i |X i ;θ 0 ,π) X i #) . (4.24) Note that ψ NT (π) and ν NT (π) are closely related, namely ν NT (π) is obtained from ψ NT (π) by taking the derivative with respect to θ 0 and multiplying with minus T −1/2 . Both are zero mean empirical processes with index π. Analogously to ν NT (π) we can bound ψ NT (π) pointwise by evaluating the second moment, as stated in the following lemma. Lemma 4.6. Let assumption C.3 be satisfied, let π 0 ∈Π up T and let π T ∈Π low T . Then we have ψ NT (π T )=O p (1) p D KL (f Y (π T )||f Y (π 0 ). 87 However, just as we discussed for ν NT (π) above the pointwise bound is not suffi- cient for our purposes and we impose a high-level assumption to guarantee the uniform bound. Satisfying this assumption again requires to constraint the parameter set Π T appropriately. Assumption 4.2. Let there exist κ T > 0 with κ −1 T = O(1) and κ T = o( √ T), and a series of random numbers c T =o p (1) such that Π T ⊂Π lip T,κ and ∀π∈Π T : |ψ NT (π)|≤c T √ T κ T p D KL (f Y (π)||f Y (π 0 )). Apart from ψ NT (π), the other term on the right hand side of equation (4.23) which depends on π is D KL f Y (π)||f Y (π 0 ) . If we consider the population level in the cross- sectional dimension and assume θ 0 is known, then maximizing the expected likelihood overπ is equivalent to minimizing theKullbackLeibler divergenceD KL f Y (π)||f Y (π 0 ) . However, the model may not be identified, i.e. minimizing D KL f Y (π)||f Y (π 0 ) may not necessarily give π = π 0 . It is, however, crucial for our discussion that minimizing D KL f Y (π)||f Y (π 0 ) overπ∈Π T yieldsasmallvalueofD H (π,π 0 )atleastforsufficiently large values ofT, because in the last section we have shown that the convergence rate of D H (ˆ π,π 0 ) determines the rate at which the bias of ˆ θ converges to zero as T →∞. This motivates the following assumption. Assumption 4.3. Let there exists μ T = o(1), and a series of random numbers C T = O p (1), such that (i) inf π∈Π T D KL (f Y (π)||f Y (π 0 ))=O p (μ 2 T ). (ii) ∀π∈Π T : D H (π,π 0 )≤C T h p D KL (f Y (π)||f Y (π 0 ))+μ T i . Assumption 4.3(i) demands that the true distribution of the dependent variable Y can be approximated better and better (as T → ∞) in terms of the Kullback Leibler divergence by a distribution of Y that is implied by some π ∈ Π T . This assumption is trivially satisfied if π 0 ∈ Π T , but unless we are willing to assume a parametric form for 88 π it is not reasonable to expect π 0 ∈ Π T . For the semiparametric approach we have to choose Π T such that every distributionπ 0 that satisfies certain regularity conditions can beapproximatedwellbyanelementinΠ T . Therateofconvergenceofthisapproximation of π 0 is given by μ T . This rate crucially depends on the (smoothness) properties of π 0 . In this sense Assumption 4.3(i) requires Π T to be sufficiently “large”. In contrast, assumption 4.3(ii) demands Π T to be sufficiently “small”, namely it should not contain ill-behaved distributions for whichD KL (f Y (π)||f Y (π 0 )) is small, whileD H (π,π 0 ) is not. Note that μ T appears on the right hand side of Assumption 4.3(ii), i.e. the assumption allows for thepossibilitythatD KL (f Y (π)||f Y (π 0 )) is(close to) zero butD H (π,π 0 ) isnot, which is important since the model may not be identified at finite T. However, as T becomes large the assumption requires this identification problem to vanish at the rate μ T . We have D H (π,π 0 ) ≤ p D KL (π||π 0 ), which is a general relation between Hellinger distance and Kullback Leibler divergence. Thus, a slightly stronger form of Assumption 4.3(ii) is obtained by replacing D H (π,π 0 ) with p D KL (π||π 0 ) on the left hand side of the assumption. This is interesting, because by the “chain rule for the Kullback Leibler divergence”wehaveD KL (f Y (π)||f Y (π 0 ))≤D KL (π||π 0 ). Thus,Assumption4.3(ii)essen- tially requires that this general inequality can be reverted forπ∈Π T , allowing for some random pre-factor C T and some “slackness” μ T . Theorem 4.7. Let assumption 4.1, 4.2, 4.3 and C.2 be satisfied. Then we have (i) D H (ˆ π,π 0 )=O p (μ T )+o p 1 κ T r T N ! , (ii) ˆ θ− ˆ θ(π 0 )=O p κ T μ T T +o p 1 √ NT . We want to give some intuition on the proof of this theorem. Consider π = argmax π∈Π T L NT (θ 0 , π), whichisaninfeasible“estimator” forπ 0 basedontheexpected likelihood and on knowledge of θ 0 . It is easy to see that Assumption 4.3 implies D H (π,π 0 ) = O p (μ T ). Part (i) of Theorem 4.7 generalizes this result to the feasible 89 estimator ˆ π. The fact that ˆ π is obtained from L NT (θ,π) instead of the expected likeli- hoodL NT (θ,π) can be controlled by the decomposition (4.23) and assumption 4.2, and results in the additional term o p 1 κ T q T N . This terms also accounts for the fact that θ 0 is not known and ˆ θ(π) is used instead — the results of the previous subsection are crucial to show this. The second part of Theorem 4.7 then follows directly from the first part and Corollary 4.5 above. For the details we refer to the appendix. ParttwoofTheorem4.7providesaboundonthedifferencebetweenthefeasiblemax- imum likelihood estimator ˆ θ and the infeasible “miracle” maximum likelihood estimator ˆ θ(π 0 ) that could be obtained if the distribution of the individual effects were known. Under appropriate regularity conditions one can show that ˆ θ(π 0 ) is asymptotically nor- mal and unbiased. Unbiasedness stems from the fact that the score ∂L NT (θ 0 ,π)/∂θ is unbiasedforπ =π 0 . Asymptoticnormalitycan,forexample, beshownbyrelating ˆ θ(π 0 ) to the fixed effect estimator ˆ θ p , whose asymptotic theory is well-studied in the large T literature. For the sake of generality, we formulate this as an assumption, which can be justified in different ways. Assumption 4.4. As N,T → ∞ we have √ NT( ˆ θ(π 0 )−θ 0 ) → d N 0, I −1 0 , for a positive definite K×K matrix I 0 . From Theorem 4.7(ii) we then obtain the following corollary. Corollary 4.8. Let the assumptions of Theorem 4.1 and 4.7 as well as Assumption 4.4 be satisfied, and let N =o T/(μ 2 T κ 2 T ) for N,T →∞. Then we have √ NT( ˆ θ−θ 0 ) → d N 0, I −1 0 . Thus, as long as N does not grow too fast relative to T as N,T → ∞, we find that ˆ θ is asymptotically unbiased. How rapidly N is allowed to go to infinity relative to T is specified by μ T and κ T , the details of which depend on the particular low-level assumptions that are made to justify the high-level assumption in this section. One concrete example for this is discussed in the following. 90 4.5 Generalized Random Effects To apply the results of the last section one needs to satisfy the high-level assumptions 4.1, 4.2 and 4.3. Assumptions 4.1 and 4.2 demand specific uniform bounds on the empirical processesν NT (π) andψ NT (π). We gave conditions under which these bounds are satisfied pointwise in Lemma 4.4 and 4.6, but in order to show the uniform result over π∈Π T one needs to impose restrictions on the parameter set of allowed individual effect distributions Π T . Since assumption 4.3(i) demands the true distribution π 0 (α|x) to be well approximated by an element in Π T , 5 the restrictions on Π T require us to also imposerestrictionsonπ 0 . Inparticular,theunrestrictedcorrelatedrandomeffectmodel, whereapartfrombasicregularity conditions norestriction is imposedonπ(α|x), is ruled out by these high-level assumptions, unless the regressor domainX T is discrete and only contains a small number of elements relative to N. 6 IfX T contains too many elements, then the set of conditional distributions π(α|x) is too rich, so that the uniform bounds in assumption 4.1 and 4.2 cannot be satisfied. In other words, we are facing a curse of dimensionality problem in estimating the conditional distribution π(α|x) if the domain of the conditioning variables is too large. In order to overcome this problem one needs to restrict the correlation structure between the individual effects and the regressors. Different ways of doing this are con- ceivable. Here, we want to explore the generalized random effect restriction, which assumesapartitioningofX T intogroupsandimposesarandomeffectassumptionwithin each group. Let G T be a known partition of X T into G T groups. We assume that the 5 Actually, assumption 4.3(i) only requires the outcome variable distributions implied by π 0 and the one implied by some π∈ ΠT to be close to each other, but the easiest way to satisfy this assumption is to show that π 0 itself can be approximated well by π∈ΠT in terms of Kullback Leibler distance. 6 A small dimensional continuoussupportXT can also beadmissible, as long as asmoothness assump- tion of π(α|x) as a function of x is imposed. 91 distributionsπ(α|x) andπ(α|˜ x) are identical ifx and ˜ x belong to the same group, i.e. if x,˜ x∈g withg∈G T . The set of distributions for which this constraint holds is given by Π G T = π∈Π A T ∀g∈G T ,∀x,˜ x∈g,∀α∈A: π(α|x) =π(α|˜ x) . (4.25) We assume π 0 ∈ Π G T , i.e. the true distribution satisfied the generalized random effect assumption, and we restrict the parameter set Π T to be a subset of Π G T . Thegeneralized random effect assumption reduces the dimension of Π T dramatically, compared to the unrestricted correlated random effect case. Once this assumption and some further regularity conditions on Π T are imposed one can use methods from empirical process theory to show that the uniform bounds in Assumption 4.1 and 4.2 are satisfied as long as the number of groupsG T increases sufficiently slowly as N,T →∞. 4.5.1 Imposing an Appropriate Smoothness Assumption We now want to discuss how to choose Π T ⊂ Π G T such that Assumption 4.3 is satisfied, and which rates μ T and κ T can be obtained. Assumption 4.3(ii) is an approximate identification condition forπ within Π T . For givenθ 0 , 7 the assumption demandsπ to be close to π 0 (in terms of Hellinger distance) whenever the distributions for the outcome variable Y that are implied by π and π 0 are close to each other (in terms of Kullback Leibler divergence). However, this identification of π from the distribution of Y is only required approximately, since μ T appears as a slackness parameter in the assumption. OnlyasT becomes large theslacknessμ T converges to zero, sothat theindividualeffect distribution is identified in the limit T →∞. Determining thedistributionπ(α|x) fromthedistributionofY is an ill-posed inverse problem for fixed T, as discussed e.g. in Bonhomme (2010). One way to understand why this problem is solved asymptotically as T becomes large is to ask how one could estimate π(α|x) within the fixed effect approach. The fixed effect approach starts from 7 Note that θ 0 enters in the definition ofDKL(fY(π)||fY(π 0 ). 92 the realizations α 0 i for the individual effects of each cross-sectional unit, which are dis- tributed according to π(α|X i ). If the realizations α 0 i would be observed, then, once the generalized randomeffect assumptionisimposed,onecouldestimateπ(α|x) consistently for fixed T as N →∞ by using a standard kernel density estimation within each group g ∈G T . In reality the α 0 i are not observed, but the fixed effect approach provides esti- mators ˆ α p i , which are obtained from maximizing the profile likelihood jointly with the parameters of interest. As T becomes large, we have ˆ α p i = α 0 i +O p (T −1/2 ). Thus, a Kernel density estimator over ˆ α p i within each group provides an estimator for π(α|x) which is consistent as both N and T become large. We do not consider the Kernel density estimator over ˆ α p i further,because it does not allow for higher order bias correction (it gives some fixed convergence rate ofD H (ˆ π,π 0 ) inT,whichisnotoptimal). However, apartfromthefactthatπ(α|x)canbeconsistently estimated as N,T →∞, this estimator illustrates another very important point: For a given value of T, the model intrinsically determines the “resolution” of the individual effects. In the fixed effect approach, this “resolution” is described by the variance of ˆ α p i . For large values of T the variance of ˆ α p i is approximated by T −1 I −1 (α 0 i ,θ 0 ,X i ), where I(α,θ,x) is the information matrix of the model with respect to the individual effects, namely I(α,θ,x) = − E 1 T ∂ 2 logf(Y i |X i ,α;θ) ∂α∂α ′ X i =x,α=α, θ = − 1 T Z y∈Y T ∂ 2 logf(y|x,α;θ) ∂α∂α ′ f(y|x,α;θ)dy. (4.26) This information matrix also plays an important role for our maximum likelihood esti- mator of π(α|x), and for the question how to choose Π T such that assumption 4.3(ii) is 93 satisfied. To understand this, consider a quadratic expansion of logf(Y i |X i ,α;θ) in α around its maximum value ˆ α p i = ˆ α p i (θ)=argmax α f(Y i |X i ,α;θ). For large T we have f(Y i |X i ,α;θ)≈f(Y i |X i ,ˆ α p i ;θ) exp 1 2 (α− ˆ α p i ) ′ ∂ 2 logf(Y i |X i ,ˆ α p i ;θ) ∂α∂α ′ (α− ˆ α p i ) ≈f(Y i |X i ,ˆ α p i ;θ) exp − T 2 (α− ˆ α p i ) ′ I(ˆ α p i ,θ,X i )(α− ˆ α p i ) . In the last line we used that under appropriate regularity conditions the Hessian converges to its expected value as T becomes large. Thus, the functional form of f(Y i |X i ,α;θ)inαapproximatesanon-normalizedmultivariatenormalpdfwithmean ˆ α p i and varianceI −1 (ˆ α p i ,θ,X i )/T. This variance therefore describes how fast f(Y i |X i ,α;θ) is varying as a function of α, which is important, since according to equation (4.1) the distribution of Y i given X i only depends on π(α|X i ) via integration over α, with f(Y i |X i ,α;θ) also appearing in the integrand. Therefore, variations in π(α|X i ) over α that are more rapid than the variations in f(Y i |X i ,α;θ) over α will tend to be “washed out” by the integration, i.e. very rapid fluctuations in π(α|X i ) have very little effect on the distribution of Y i given X i . Thus, if we do not impose an appropriate smoothness assumption on π(α|x), then D KL (f Y (π)||f Y (π 0 ))canbeclosetozero,whileπ andπ 0 arequitedifferent. Howsmooth π(α|x) needs to be at a particular value of α is determined byI −1 (α,θ 0 ,x)/T. TospecifyanappropriateparametersetΠ T thataccountsforthissmoothnessrequire- ment,wefirstneedtointroducesomefurthernotation. Letφ(α;β,Ω)bethemultivariate normal pdf with meanβ and variance matrix Ω. Forx∈X T andα,β∈A we define the kernel function K Ω T (α,β;x) = φ(α;β,Ω T (β,x)) R A φ(γ;β,Ω T (β,x))dγ , (4.27) where Ω T (β,x) is a positive definite M ×M matrix for all values of T, β and x. In order to be compatible with the generalized random effect assumption we impose the 94 restriction Ω T (β,x) = Ω T (β,˜ x) for all x, ˜ x in the same group. When the kernel K Ω T is applied to π∈Π G T one obtains another conditional distribution in Π G T , namely K Ω T π (α|x) = Z A K Ω T (α,β;x)π(β|x)dβ. (4.28) Using this kernel function we now definethe parameter set of individualeffects distribu- tions as follows Π T =K Ω T Π G T ∩Π up T ∩Π low T . (4.29) This is the set of all distributions that can be generated by applying the Kernel K Ω T to an element of Π G T ∩Π up T ∩Π low T (the set of distribution that satisfy the generalized random effect assumption as well as some appropriate upper and lower bound). We assume π 0 ∈ Π up T ∩Π low T . The parameter set satisfies Π T ⊂ Π G T . Here we impose upper andlower boundrestrictionsfortechnicalreasons. InourMonteCarlosimulationsbelow we do not impose any upper or lower bound restriction on π(α|x), i.e. we simply use Π T =K Ω T Π G T , which turns out not to affect the good performance of the estimators. Since applying K Ω T is a Gaussian kernel smoothing with variance Ω T (α,x), the smoothnessofthedistributionsinΠ T dependsonthechoiceofvariancematrixΩ T (α,x). The larger Ω T (α,x), the smoother are the distributions in Π T . Motivated by the above discussion, we choose Ω T (α,x) = ρ T T 1 N g(x) X i∈{1,...,N}:X i ∈g(x) I(α, ˜ θ,X i ) −1 , (4.30) where ˜ θ is some preliminary consistent estimator for θ 0 , e.g. the fixed effect maximum likelihood estimator, ρ T > 0 is a scalar bandwidth parameter, g(x)∈G T denotes to the group to which x ∈ X T belongs (i.e. x ∈ g(x)), and N g(x) is the number of observed 95 individuals in this group, which is also the set of individuals that are summed over in (4.30). For the bandwidth we require ρ T logT → const. > 0, as N,T →∞. (4.31) We assume that π 0 (α|x) satisfied the generalized random effect assumption and is r times continuously differentiable in α with bounded derivatives, where r ≥ 1. Under appropriate further regularity conditions one can then show that Assumption 4.3 is satisfied for the above choice of Π T and Ω T with κ T = s T logT , μ T = logT T r/2 . (4.32) By theorem 4.7 the bias of ˆ θ converges at the rate κ T μ T /T, which now equals T −1 [(logT)/T] (r−1)/2 . Thus, the smoother π 0 is, the faster the bias of ˆ θ converges to zero. Alternative choices for the parameter set Π T are conceivable. The advantage of definingΠ T astheimageofaGaussiankernelsmoothingoperatoristhatthesmoothness propertiesofπ(α|x)canbecontrolledseparatelyforeachvalueofαandwithineachgroup intermsofthevariancematrixΩ T (α,x), whichisrelatedtotheinformationmatrix. The individual effect distributions in Π T are simply infinite mixtures of normal distributions with different means and specified variances. Further technical details and motivation for this construction of Π T are discussed in the appendix. 4.5.2 Computation In contrast to standard choices for a non-parametric sieve space, the parameter set Π T defined in (4.29) is still infinite dimensional. However, the numerical implementation of the estimator requires to discretize Π T . A convenient method for doing this is to discretize the support A of the individual effects. This discretization is to be chosen 96 such that it does not affect the properties of the estimator. As explained above, one can interpret the standard error of the fixed effect estimator for the individualeffects (which for large T is given by the square root of the diagonal elements of T −1 I −1 (α,θ 0 ,x)) as the “resolution” in the space A that is implied by the model. The numerical error due to the discretization of A will be small as long as the step-size for the discretization of A is chosen sufficiently small relative to this standard error. In particular, it is natural here to choose a different discretization step-size for different values of α and different groups g ∈ G T . For example, in our Monte Carlo simulations below the set A is one- dimensionalandwechoosethediscretization step-sizeforgivenαandx∈g equalto1/6 of the approximate standard error q T −1 I −1 (α, ˜ θ,x), using some preliminary consistent estimator ˜ θ ofθ 0 . Weverifiedthatthechoiceof1/6didnotaffecttheperformanceofthe estimator in that case, i.e. a smaller discretization yields essentially the same estimator fortheparametersofinterest. 8 IfthesetAisunbounded,thenthediscretizationrequires to impose some bounds, which can however be chosen very broadly. The discretization of A and the calculation of the Kernel variance Ω T (α,x) both require evaluation of the information matrix I(α,θ,x), which involves integration over y∈Y T . Unlessthemodelallows forananalytical evaluation ofI(α,θ,x),itwilloften be easiest to perform a Monte Carlo integration to determineI(α,θ,x), i.e. to draw many y from the distribution f(y|x,α;θ) and to use minus 1/T times the sample average of the corresponding Hessians as an approximation of I(α,θ,x). It is sufficient for our procedure to get a reasonable approximation ofI(α,θ,x), i.e. a small sampling error in I(α,θ,x) does not affect the performance of the estimation. What can affect the performance, however, is the choice of bandwidthρ T that enters into Ω T (α,x). In view of the resolution discussion in the spaceA it is natural to choose ρ T ≥1. We want to leave a rigorous treatment of the data dependent bandwidth choice for future research. In our Monte Carlo simulations we choose ρ T = 4 everywhere, 8 Note that we only discretizedA for the estimation procedure, but not for data generating process of the Monte Carlo simulation, i.e. the realizations α 0 i were chosen from an unrestricted continuous distributions, only restricted by the computer precision. 97 whichseemedareasonablecompromisebetweenbeingabletoapproximatemanydensity functionswithΠ T (Assumption4.3(i)) andmakingsurethatthemodelisapproximately identified (Assumption 4.3(ii)). AfterdiscretizationofAandchoiceofρ T theparametrizationofΠ T isdeterminedand is finite dimensional. For ease of notation we consider in the rest of this section the pure random effect case where the number of groups G T = 1, and we assume that A is one- dimensional. Thegeneralization toG T >1andhigher-dimensionalAisstraightforward. Letα ∗ q ,q =1,...,Q,bethechosendiscretizationofA,andletK Ω qr ,q,r =1,...,Qbethe corresponding discretization of K Ω T (α,β,x), 9 which in the pure random effect case does not depend on x. The discretized version of the integrated likelihood function, defined in (4.6), then reads L approx NT (θ,π disc )= 1 NT N X i=1 log Q X q,r=1 f(Y i |X i ,α ∗ q ;θ)K Ω qr π disc r , (4.33) where π disc r , r = 1,...,Q, describes the distribution to which the Kernel smoothing is applied in order to parameterize Π T , and the superscript “disc” refers to discretization. The constraints P r π disc r = 1 and π disc r ≥ 0, for all r = 1,...,Q, need to be imposed, and further upper and lower bounds on π disc r can also be imposed without difficulty. The estimators ˆ θ and ˆ π disc are obtained by joint maximization of L approx NT (θ,π disc ) over θ andπ disc . The number of parameter π disc may be quite large, but this is numeri- cally unproblematic, since for givenθ the objective function is smooth and concave over π disc and the constraints on π disc are linear, i.e. the optimization problem over π disc is 9 One can, for example, choose K Ω qr =Φ „ α ∗ q+1 +α ∗ q 2 ;α ∗ r ,ΩT(α ∗ r ) « −Φ „ α ∗ q +α ∗ q−1 2 ;α ∗ r ,ΩT(α ∗ r ) « , for q = 2,...,Q−1, where Φ(α;β,Ω) is the cdf of a normal distribution with mean β and variance Ω. For q =1 the second term is omitted, and for q =Q the first term is set to 1. This definition guarantees P q K Ω qr =1. 98 very well-behaved, and the corresponding gradient and Hessian can be easily calculated analytically. The structure of L approx NT (θ,π disc ) as a function of both θ and π disc may, however, be more complicated. In principle, multiple local maxima could exist, and it may therefore be necessary to repeat the joint maximization over L NT (θ,π) with multiple starting values, or to perform an initial grid search over θ∈Θ. An interesting alternative option — which we also make use of in our Monte Carlo simulations —istoperformtheoptimization sequentially. Startingwithsomeconsistent preliminary estimator for θ, one first optimizes over π disc for given θ, then takes the optimal valueπ disc and optimizes overθ withπ disc fixed, and so on. Asymptotically, one can show thatalready after afinitenumberof repetitions thissequential approach yields anestimatorforθwhoseasymptoticbiasdecreasesatthesamerateasthejointmaximum likelihood estimator. In our Monte Carlo setup this sequential approach turned out to converge quite rapidly. Note that once the estimator ˆ π disc is found, one can obtain the actual estimator for the distribution π by applying the Kernel functionK Ω rq . 4.6 Monte Carlo Simulations Inoursimulation studyweconsiderthedynamicbinarychoice modelwithoutadditional regressors, as introduced in equation (4.10). In this modelonly thebinaryoutcome vari- able Y it is observed for time periods t = 0,...,T and cross-sectional units i = 1,...,N. The parameter of interest θ ∈ Θ = R and the individual effect α i ∈ A = R are both scalars. The initial period outcome Y i0 is used as a conditioning variable. We consider 99 a probit model, i.e. shocks ε it are standard normally distributed, iid across both i and t. The (log-) likelihood function of the model thus reads L NT (θ,π) = 1 NT N X i=1 log Z R ( T Y t=1 [Φ(θY i,t−1 +α)] Y it [1−Φ(θY i,t−1 +α)] 1−Y it ) π(α|Y i0 )dα, (4.34) whereΦ(.)is thecdfofthestandardnormaldistribution. Theunknownparametersthat enter the likelihood function are θ and the two densities π(α|Y i0 =0) andπ(α|Y i0 =1). For the data generating process we choose the true parameter of interest θ 0 = 1. The conditional distribution α|Y i0 = 1 is chosen to be a t-distribution with 5 degrees of freedom, centered at α = 0 and rescaled such that its standard error is σ π . The conditional distributionα|Y i0 =0ischosen tobeanequalmixtureof twot-distributions with 5 degrees of freedom, one centered at α = 1 and one centered at α = −1, and both rescaled such that each has a standard error σ π . Thus, σ π parameterizes the smoothness of the conditional density π(α|Y i0 ), and we consider different values for σ π in thesimulations below. Finally, weletY i0 =0andY i0 =1 bothoccur withprobability 0.5 in the data generating process. To estimate the model, we discretize the set A = R by choosing a lower bound of α = −9, an upper bound of α = 9, and a discretization step-size that is sufficiently small relative to the variance of the fixed effect estimator for α, as described in the computation section above. In Figure 4.1 we plot p T −1 I −1 (α,θ 0 ,Y i0 ) as a function of α for Y i0 = 0 and 1, and forT =12and24. We havearguedthatthisquantity, whichapproximated thestandard error of the fixed effect estimator forα, can be viewed as the “resolution scale” that the model provides for the estimation of the individual effect distribution. The figure shows that wecannot expectto resolve much structureinπ(α|Y i0 )for sayα<−2.5 andα>2, since p T −1 I −1 (α,θ 0 ,Y i0 ) then becomes quickly very large. The figure also shows that 100 −5 0 5 0 0.5 1 1.5 2 2.5 3 3.5 4 α standard error y 0 =0 y 0 =1 −5 0 5 0 0.5 1 1.5 2 2.5 3 3.5 4 α standard error y 0 =0 y 0 =1 Figure 4.1: For T =12 (left) and T =24 (right) we plot p T −1 I −1 (α,θ 0 ,y i0 ). Notes: We useθ 0 =1 andy i0 =0 and 1. For largeT the plotted quantity approximates the standard error of the fixed effect estimator for the individual effects α. wecanexpecttoresolvesomewhatfinerstructuresforT =24thanforT =12. Notealso that p T −1 I −1 (α,θ 0 ,Y i0 ) is not symmetric around α = 0 (it would only be for θ = 0) and that it is slightly different for Y i0 =0 and Y i0 =1. We choose a bandwidth ρ T = 4 for all simulations. According to our Kernel con- struction we approximate π(α|Y i0 ) as a mixture of normal distributions with arbitrary means α, but given variances Ω T (α,Y i0 ) =ρ T T −1 I −1 (α, ˜ θ,Y i0 ). We plot a few of these normal distributions that are used as “basis functions” in Figure 4.2 for T = 12 and in Figure 4.3 for T = 24, for ˜ θ =θ 0 (the dependence on ˜ θ is not very strong, i.e. using an estimator ˜ θ —aswedointheactualestimationprocedure—doesnotchangetheseplots much). As was expected from Figure 4.1, these basis functions become more and more wide, correspondingtoless and less resolution power, as theabsolute valueofα becomes larger. OnecanalsoseethatthebasisfunctionsforT =24aremorenarrow,i.e. areable to resolve more structure inπ(α|Y i0 ). By choosing a smaller value for the bandwidthρ T one could resolve finer structures in the individual effect distribution. However, smaller values of ρ T also mean that the identification of π(α|Y i0 ) from the distribution of Y it becomes more problematic, and one needs to compromise between these two competing goals. 101 −5 0 5 0 0.1 0.2 0.3 0.4 0.5 0.6 α density −5 0 5 0 0.1 0.2 0.3 0.4 0.5 0.6 α density Figure 4.2: Examples of “basis functions” for T =12 Notes: These “basis functions” are used to approximate the true distributions, plots are for for y i0 =0 (left) and y i0 =1 (right). −5 0 5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 α density −5 0 5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 α density Figure 4.3: Same as Figure 4.2, but for T =24. Once the bandwidthρ T and thus the basis functions are chosen, the key question is whetherthetruedistributionπ 0 (α|Y i0 )canbewell-approximatedbythesebasisfunctions (i.e. by an element in the parameter set Π T = K Ω T Π A T ). This will crucially depend on how smooth π 0 (α|Y i0 ) is, which in our setup is regulated by the parameter σ π . Note that we can only expect a good performance of our joint maximum likelihood estimator for θ andπ, if the true distribution π 0 can be well-approximated by an element in Π T . Figure 4.4 shows the maximum likelihood estimator for π(α|Y i0 ) obtained from a sample of sizeT =12 andN =10000 (takingθ =θ 0 as given) for different values ofσ π . Here, wehaveusedN =10000, whichislarger thanthesamplesizes intheactualMonte 102 Carlo simulations below, in order to keep the sampling error small, so that we can focus on the question of whether π 0 can be well-approximated at T = 12 for our bandwidth choice. The figure shows that the approximation of π 0 is relatively good at σ π = 1.4, but rather bad at σ π = 0.7. Thus, we expect the joint maximum likelihood estimator ˆ θ to have little bias in the caseσ π =1.4 but potentially large bias in the caseσ π =0.7. It turnsoutthattheapproximation ofπ 0 intheintermediatecaseσ π =1isstillsufficiently good to obtain little bias of ˆ θ, but from Figure 4.4 alone it would probably not be clear what to expect in that case. Table4.1containsouractualMonteCarloresultsforvariousestimatorsoftheparam- eters of interest at T = 12, and for different values of N and σ π . We performed 1000 simulation repetitions for N = 100 and 500, and 500 repetitions for N = 2500. The fixed effect estimators (based on the profile likelihood) for θ that we consider are the fixed effect maximum likelihood estimator (FE-MLE), the first order split panel Jack- knife estimator (FE-JACK-1), which eliminates the asymptotic bias of order 1/T, and the second order split panel Jackknife estimator (FE-JACK-2), which in addition elim- inates the asymptotic bias of order 1/T 2 . These Jackknife estimators are obtained by estimating two sub-panels of sample size T/2 (and also three sub-panels of sample size T/3 for FE-JACK-2), andthen appropriatelyforminglinear combinations of theestima- torsatdifferentsamplesize, asdescribedinDhaeneandJochmans(2010), andoriginally proposed by Hahn and Newey (2004) for panels without time-correlation. Here, we use theJackknifemethod,sincetoourknowledgeitistheonlybiascorrection methodinthe literature so far that in principle allows for arbitrary higher order bias correction, which makesitanaturalobjectofcomparisonforourrandomeffectmethod. Therandomeffect estimators (based on the integrated likelihood) that we calculate are the random effect miracle estimator ˆ θ(π 0 ) (RE-MIR), which is infeasible since it assumes knowledge ofπ 0 , the random effect estimator with fixed prior distribution ˆ θ(π prior ) (RE-PRIOR), using a normal prior distributionπ prior (α|Y i0 ) with mean zero and standard error equal to 4 for 103 • σ π =1.4: −5 0 5 0 0.05 0.1 0.15 0.2 0.25 α π ( α | y 0 =0 ) Estimated True −5 0 5 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 α π ( α | y 0 =1 ) Estimated True • σ π =1: −5 0 5 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 α π ( α | y 0 =0 ) Estimated True −5 0 5 0 0.1 0.2 0.3 0.4 0.5 α π ( α | y 0 =1 ) Estimated True • σ π =0.7: −5 0 5 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 α π ( α | y 0 =0 ) Estimated True −5 0 5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 α π ( α | y 0 =1 ) Estimated True Figure 4.4: True and estimated individual effect distributions. Notes: For different values of σ π the true conditional distribution π 0 (α|y i0 ) is plotted as a dotted line for both y i0 = 0 (left diagram) and y i0 = 1 (right diagram). The solid lines are the corresponding maximum likelihood estimators forπ(α|y i0 ) obtained from a sample with T = 12 and N = 10000. The structure of the true distribution cannot be resolved very well for the case σ π =0.7, given our particular bandwidth choice ρ T =4. 104 T =12, σπ =1.4 N =100 N =500 N =2500 bias std rmse bias std rmse bias std rmse FE-MLE -0.313 0.124 0.336 -0.308 0.056 0.313 -0.3115 0.0243 0.3125 FE-JACK-1 0.029 0.151 0.153 0.037 0.067 0.077 0.0308 0.0299 0.0429 FE-JACK-2 -0.014 0.252 0.253 0.003 0.115 0.115 -0.0044 0.0516 0.0517 RE-MIR 0.003 0.109 0.109 0.003 0.050 0.050 -0.0003 0.0214 0.0214 RE-PRIOR -0.171 0.124 0.212 -0.169 0.056 0.178 -0.1728 0.0242 0.1745 RE-MLE -0.006 0.129 0.129 0.003 0.059 0.059 -0.0002 0.0257 0.0257 T =12, σπ =1 N =100 N =500 N =2500 bias std rmse bias std rmse bias std rmse FE-MLE -0.320 0.112 0.339 -0.314 0.052 0.319 -0.3160 0.0253 0.3169 FE-JACK-1 0.025 0.138 0.141 0.029 0.063 0.069 0.0278 0.0290 0.0402 FE-JACK-2 -0.008 0.244 0.244 -0.004 0.106 0.106 -0.0059 0.0493 0.0497 RE-MIR 0.003 0.095 0.095 0.002 0.045 0.045 0.0012 0.0204 0.0204 RE-PRIOR -0.196 0.113 0.227 -0.194 0.052 0.201 -0.1948 0.0240 0.1963 RE-MLE -0.016 0.116 0.117 -0.002 0.055 0.055 -0.0005 0.0249 0.0249 T =12, σπ =0.7 N =100 N =500 N =2500 bias std rmse bias std rmse bias std rmse FE-MLE -0.332 0.112 0.350 -0.323 0.050 0.327 -0.3237 0.0221 0.3244 FE-JACK-1 0.017 0.135 0.136 0.023 0.060 0.064 0.0225 0.0266 0.0349 FE-JACK-2 -0.017 0.230 0.230 -0.011 0.102 0.102 -0.0083 0.0449 0.0457 RE-MIR -0.005 0.090 0.090 0.002 0.040 0.040 0.0004 0.0179 0.0179 RE-PRIOR -0.223 0.112 0.250 -0.214 0.051 0.220 -0.2149 0.0223 0.2161 RE-MLE -0.049 0.113 0.124 -0.037 0.051 0.063 -0.0424 0.0223 0.0479 Table 4.1: Monte Carlo Results for T =12. Notes: Results for the bias, standard error (std) and root mean square error (rmse) of different estimators for the parameter of interest θ in simulations at T = 12 and for different values of N and σ π . The results are based on 1000 repetitions for N = 100 and N = 500, and on 500 repetitions for N = 2500. The estimators are the fixed effect MLE (FE-MLE), first and second order split panel Jackknife (FE-JACK-1 and 2), the infeasible random effect miracle estimator (RE-MIR), a random effect estimator with fixed prior distribution (RE-PRIOR), and the random effect joint MLE over θ and π (RE-MLE). both values of Y i0 , and finally our joint maximum likelihood estimator (RE-MLE) that is obtained by maximizing the integrated likelihood over θ∈ R and π∈Π T . Table 4.1 shows that, as expected, for T =12 the FE-MLE is severely biased due to the incidental parameter problem. The first order Jackknife bias correction eliminates around90% of this bias, and the second order Jackknife correction reduces the bias even further. However, the bias correction also increases the standard error of the estimator, 105 by around 20% for the first order correction and by around 100% for the second order correction. Both Jackknife estimators are known to have the same asymptotic variance as the MLE, but the phenomenon of finite sample variance inflation is also well-known. In terms of root mean square error the second order Jackknife estimator performs worse than the first order Jackknife estimator in all our simulations, due to the much larger standarderror. Inthefollowing wethereforeconcentrate onthecomparison between the first order Jackknife estimator and the RE-MLE. The RE-MLE performs very well in theT =12 simulations forσ π =1.4 andσ π =1. In those cases the RE-MLE is essentially unbiased at N = 500 and N = 2500 (the bias is below 5% significance given the numberof simulation repetitions), and it has a bias at N =100whichisstillverysmallrelativetothestandarderroratN =100. Furthermore, ithasastandarderrorthatisalmost identicaltothestandarderrorof theFE-MLE,and whichisthereforesmaller thanthestandarderroroftheFE-JACK-1. Notethatthebias oftheFE-JACK-1isessentially independentofN, whileitsstandarderrordecreases like N −1/2 . In our simulation design at T = 12 we find that at N = 2500 the bias and the standarderroroftheFE-JACK-1arealmostidentical, i.e. forallvaluesofN larger than 2500 the bias will dominate the standard error of the FE-JACK-1. Even at N = 500 the bias is about half the size of the standard error for the FE-JACK-1, which would be very problematic for testing purposes. Thus, in particular for large values of N the RE-MLE therefore performs much better than the FE-JACK-1 for σ π =1.4 and 1. However, asalreadyanticipated fromFigure4.4, thisisnottrueforσ π =0.7. Inthat case we find the bias of the RE-MLE to be around twice the bias of the FE-JACK-1, whichalsoresultsinalargerrootmeansquareerrorforlargevaluesofN. Theproperties of theFE-JACK-1areessentially independentfromσ π , sinceinthefixedeffect approach thepropertiesoftheindividualeffectdistributionarenotimportant. Incontrast, forour random effect approach it makes a big difference whether σ π = 1 or σ π = 0.7, since in one case the trueindividualeffect distribution can bereasonably approximated, whilein 106 the other case it cannot. These results very clearly show the tradeoff one faces between using the RE-MLE and using the FE-JACK-1. −5 0 5 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 α π ( α | y 0 =0 ) Estimated True −5 0 5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 α π ( α | y 0 =1 ) Estimated True Figure 4.5: Same as Figure 4.4, but with T =24 and only for σ π =0.7. Given our bandwidth choice we thus found that we cannot properly resolve the distributionπ 0 forσ π =0.7 atT =12. Thisproblemis, however, automatically resolved if T becomes larger. Figure 4.5 shows that at T = 24 one can already approximate the true individual effects distribution for σ π = 0.7 relatively well, and Table 4.2 shows the corresponding Monte Carlo results for the parameters of interest. In that case, the bias of the RE-MLE is again very small relative to its standard error, and for N = 500 and N =2500 the RE-MLE therefore again performs much better than the FE-JACK-1. 10 4.7 Conclusions This chapter presents an alternative approach to higher order bias correction in non- linear panel data model with large N and T. Instead of profiling out the individual effects (fixed effect approach) we proposeto integrate out the individual effects from the 10 It is also interesting that in all our simulations the standard errors of the FE-MLE, the RE-PRIOR andtheRE-MLEareverysimilar, whiletheinfeasibleRE-MIRhasasomewhatsmallervariance. Asymp- totically all these standard errors will converge, but in finite sample knowing the trueπ 0 not only results in an insignificant bias of the RE-MIR estimator, but also in an increased efficiency in terms of standard error. Finally, we want to point out that the performance of the FE-MLE and the RE-PRIORestimator is very similar not only in terms of standard error but also in terms of bias. Both estimators have a bias of similar order that decreases at the rate of 1/T. 107 T =24, σπ =0.7 N =100 N =500 N =2500 bias std rmse bias std rmse bias std rmse FE-MLE -0.166 0.078 0.183 -0.1662 0.0334 0.1695 -0.1664 0.0164 0.1672 FE-JACK-1 0.008 0.087 0.087 0.0068 0.0373 0.0379 0.0062 0.0176 0.0187 FE-JACK-2 0.007 0.115 0.116 0.0061 0.0517 0.0520 0.0048 0.0226 0.0231 RE-MIR 0.003 0.066 0.066 -0.0003 0.0288 0.0288 0.0004 0.0143 0.0143 RE-PRIOR -0.116 0.079 0.140 -0.1178 0.0336 0.1226 -0.1180 0.0166 0.1192 RE-MLE -0.007 0.080 0.081 -0.0028 0.0349 0.0350 -0.0003 0.0170 0.0170 Table 4.2: Same as Table 4.1, but with T =24 and only for σ π =0.7. likelihoodfunction,andusetheresultingintegratedlikelihoodtoestimatetheparameters of interest. We show that if a consistent estimator for the individual effect distribution is used to integrate out the individual effects, then the rate at which the bias of the estimator for the parameters of interest decreases with T is proportional to the rate at whichtheestimatoroftheindividualeffectdistributionapproachesthetruedistributions (in terms of Hellinger distance). Compared to the fixed effect maximum likelihood estimator, which has a bias of order 1/T, we can thus obtain a significant improvement in the convergence rate of the bias, as long as a good estimator for the individual effects distribution is available. The bias that results from our estimation approach can also be significantly lower than the one obtained from existing bias correction techniques. This result on the bias correction for the parameters of interest is applicable to all estimators of the individual effect distribution that satisfy some weak regularity conditions. The estimator for the distribution that we consider explicitly in this chapter is the jointmaximumlikelihoodestimator,whichmaximizesthelikelihoodfunctionjointlywith the estimator for the individual effects. The properties of this estimator are crucially determined by the choice of parameter set of distributions over which the estimation is performed. To allow for non-parametric estimation of the individual effect distribution this parameter set needs to be chosen sample size dependent, analogously to a semi- parametric sieve estimation approach. Underappropriatehigh-level assumptions on this parameter set and on the true distribution of the individual effects we then derive the convergenceratesoftheestimatorforthedistribution(intermsofHellingerdistance)and 108 thus also obtain (an upper bound on) the convergence rate of the incidental parameter bias of the estimator for the parameters of interest. Thehigh-levelassumptionsthatareemployedtoderivethesegeneralresultsrequirea restriction on the correlation structure between the regressors and the individualeffects. As a concrete example of such a restriction we consider the case of generalized random effects, which demands that individuals can be partitioned into groups and imposes a random effect assumption in each group. No further parametric assumptions are made on the distribution of the individual effects. We discuss how to choose an appropriate parametersetfortheindividualeffectdistributioninthiscase,andshowthattheconver- gencerateoftheincidental parameterbiasonlydependsonthesmoothnesspropertiesof the trueindividualeffect distribution. As long as this distributionis sufficiently smooth, the bias can decrease at an arbitrary polynomial rate in T. Forfutureworkitwouldbefascinatingtodiscussalternativechoicesfortheestimator oftheindividualeffectdistribution. Furthermore,itwouldbeinterestingtodiscussalter- native restrictions on thecorrelation structurebetween theregressors and the individual effects, which go beyond the generalized random effect assumption discussed explicitly in the present chapter. An important extension would also be the estimation of policy parameterslikemarginaleffects. Finally,itisalsoimportanttodevelopadatadependent selection method for the bandwidth ρ T that enters in the non-parametric estimation of the individual effect distribution. 109 Bibliography Ackerberg, D., Benkard, L., Berry, S., and Pakes, A. (2007). Econometric tools for analyzing market outcomes. In Heckman, J. and Leamer, E., editors, Handbook of Econometrics, Vol. 6A. North-Holland. Ahn, S., Lee, Y., and Schmidt, P. (2007). Panel data models with multiple time-varying individual effects. Journal of Productivity Analysis. Ahn, S. C., Lee, Y. H., and Schmidt, P. (2001). GMM estimation of linear panel data models with time-varying individual effects. Journal of Econometrics, 101(2):219– 255. Andersen, E. (1970). Asymptotic properties of conditional maximum-likelihood estima- tors. Journal of the Royal Statistical Society. Series B (Methodological), 32(2):283– 301. Andrews, D. (1994). Empirical process methods in econometrics. Handbook of Econo- metrics, Volume IV, eds. RF Engle and DL McFadden. Andrews, D. W. K. (1999). Estimation when a parameter is on a boundary. Economet- rica, 67(6):1341–1384. Arellano, M. (2003). Panel data econometrics. Oxford University Press, Oxford. Arellano, M. and Bonhomme, S. (2009). Robust priors in nonlinear panel data models. Econometrica, 77(2):489–536. Arellano, M. and Hahn, J. (2007). Understanding bias in nonlinear panel models: Some recent developments. Econometric Society Monographs, 43:381. Arellano, M. and Honor´ e, B. (2001). Panel data models: some recent developments. Handbook of econometrics, 5:3229–3296. Bai, J. (2009a). Likelihood approach to small T dynamic panel models with interactive effects. Manuscript. Bai, J. (2009b). Panel data models with interactive fixed effects. Econometrica, 77(4):1229–1279. 110 Bai, J. and Ng, S. (2002). Determining the number of factors in approximate factor models. Econometrica, 70(1):191–221. Bai, J.andNg, S.(2006). Confidenceintervals fordiffusionindexforecasts andinference for factor-augmented regressions. Econometrica, 74(4):1133–1150. Bai, Z. (1993). Convergence rate of expected spectral distributions of large random matrices. Part II. Sample covariance matrices. The Annals of Probability, 21(2):649– 672. Bai, Z.(1999). Methodologies inspectral analysis of large dimensionalrandom matrices, a review. Statistica Sinica, 9:611–677. Bai, Z., Miao, B., and Yao, J. (2004). Convergence rates of spectral distributions of large sample covariance matrices. SIAM journal on matrix analysis and applications, 25(1):105–127. Bai, Z. D., Silverstein, J. W., and Yin, Y. Q. (1988). A note on the largest eigenvalue of a large dimensional sample covariance matrix. J. Multivar. Anal., 26(2):166–168. Bajari, P., Fox, J., Kim, K., and Ryan, S. (2008a). The random coefficients logit model is identified. Manuscript, University of Minnesota. Bajari, P., Fox, J., Kim, K., and Ryan, S. (2008b). A simple nonparametric estimator for the distribution of random coefficients. Manuscript, University of Minnesota. Berry,S.andHaile, P.(2009). Identification ofdiscretechoice demandfrommarket-level data. Manuscript, Yale University. Berry, S., Levinsohn,J., andPakes, A.(1995). Automobileprices in marketequilibrium. Econometrica, 63(4):841–890. Berry, S.,Linton,O.B., andPakes, A.(2004). Limittheorems forestimating theparam- eters of differentiated product demand systems. The Review of Economic Studies, 71(3):613–654. Berry, S. T. (1994). Estimating discrete-choice models of product differentiation. The RAND Journal of Economics, 25(2):242–262. Besanko,D.andDoraszelski,U.(2004). Capacitydynamicsandendogenousasymmetries in firm size. RAND Journal of Economics, 35:23–49. Bester, A. and Hansen, C. (2007). Flexible Correlated Random Effects Estimation in Panel Models with Unobserved Heterogeneity. Technical report, mimeo. Bester, C.andHansen, C.(2009). Apenaltyfunctionapproach tobias reductioninnon- linear panel models with fixed effects. Journal of Business and Economic Statistics, 111 27(2):131–148. Bonhomme, S. (2010). Functional Differencing. Technical report, Mimeo. Carro,J.(2007). Estimatingdynamicpaneldatadiscretechoicemodelswithfixedeffects. Journal of Econometrics, 140(2):503–528. Chamberlain, G. (1984). Panel Data. Griliches and M. Intrilligator, eds., Handbook of Econometrics, Chapter 22, pages 1247–1318. Chamberlain, G. (2010). Binary Response Models for Panel Data: Identification and Information. Econometrica, 78(1):159–168. Chen,X.(2007). Largesamplesieveestimationofsemi-nonparametricmodels. Handbook of Econometrics, 6:5549–5632. Chernozhukov, V., Fern´ andez-Val, I., Hahn, J., and Newey, W. (2009a). Identification and Estimation of Marginal Effects in Nonlinear Panel Models. Chernozhukov, V., Fernandez-Val, I., and Newey, W. (2009b). Quantile and average effects in nonseparable panel models. CeMMAP working papers. Chernozhukov, V., Hahn, J., and Newey, W. (2005). Bound analysis in panel models with correlated random effects. Technical report, Technical report, MIT, UCLA and MIT. Chernozhukov, V. and Hansen, C. (2006). Instrumental quantile regression inference for structural and treatment effect models. Journal of Econometrics, 132(2):491–525. Chiappori, P. and Komunjer, I. (2009). On the nonparametric identification of multiple choice models. Manuscript, Columbia University. Chudik,A., Pesaran, M., andTosetti, E.(2009). Weak andStrongCrossSection Depen- dence and Estimation of Large Panels. Manuscript. Dhaene, G. and Jochmans, K. (2010). Split-panel jackknife estimation of fixed-effect models. Dube, J. P., Fox, J., and Su, C. (2008). Improving the Numerical Performance of BLP Static and Dynamic Discrete Choice Random Coefficients Demand Estimation. Manuscript. Esteban, S. and Shum, M. (2007). Durable Goods Oligopoly with Secondary Markets: the Case of Automobiles. RAND Journal of Economics, 38:332–354. Fern´ andez-Val, I. (2009). Fixed effects estimation of structuralparameters andmarginal effects in panel probit models. Journal of Econometrics, 150:71–85. 112 Fern´ andez-Val, I. and Weidner, M. (2010). Individual and Time Effects in Nonlinear Panel Data Models with Large N, T. Manuscript. Gandhi, A. (2008). On the Nonparametric Foundations of Discrete-Choice Demand Estimation. Manuscript, University of Wisconsin. Gandhi, A., il Kim, K., and Petrin, A. (2010). The Interaction of Observed and Unob- served Factors in Discrete Choice Demand Models. Manuscript. Geman, S. (1980). A limit theorem for the norm of random matrices. Annals of Proba- bility, 8(2):252–261. G¨ otze, F. and Tikhomirov, A. (2010). The Rate of Convergence of Spectra of Sample Covariance Matrices. Theory of Probability and its Applications, 54:129. Hahn, J. and Kuersteiner, G. (2002). Asymptotically unbiased inference for a dynamic panelmodelwithfixedeffectswhenbothnandTarelarge. Econometrica,70(4):1639– 1657. Hahn,J.andKuersteiner,G.(2004). Bias reductionfor dynamicnonlinear panelmodels with fixed effects. Manuscript. Hahn, J. and Newey, W. (2004). Jackknife and analytical bias reduction for nonlinear panel models. Econometrica, 72(4):1295–1319. Harding, M. (2007). Structural estimation of high-dimensional factor models. Manuscript. Harding, M. and Hausman, J. (2007). Using a Laplace Approximation to Estimate the Random Coefficients Logit Model By Nonlinear Least Squares. International Economic Review, 48:1311–1328. H¨ ardle,W.andLinton,O.(1994). Appliednonparametricmethods. Handbook of Econo- metrics, 4:2295–2339. Hausman, J., Hall, B., and Griliches, Z. (1984). Econometric models for count data with an application to the patents-R & D relationship. Econometrica: Journal of the Econometric Society, 52(4):909–938. Holtz-Eakin, D., Newey, W., and Rosen, H.S.(1988). Estimating vector autoregressions with panel data. Econometrica, 56(6):1371–95. Honor´ e, B. (1992). Trimmed LAD and least squares estimation of truncated and cen- soredregression modelswith fixedeffects. Econometrica: Journal of the Econometric Society, 60(3):533–565. 113 Honor´ e, B. and Tamer, E. (2006). Bounds on parameters in panel dynamic discrete choice models. Econometrica, 74(3):611–629. Horowitz, J.and Lee, S.(2004). Semiparametric estimation of a paneldata proportional hazards model with fixed effects. Journal of Econometrics, 119(1):155–198. Hsiao, C. (2003). Analysis of panel data. Cambridge Univ Pr. Ichimura, H. and Todd, P. (2007). Implementing nonparametric and semiparametric estimators. Handbook of Econometrics, 6:5369–5468. Imbens, G. and Newey, W. (2009). Identification and estimation of triangular simulta- neous equations models without additivity. Econometrica, 77(5):1481–1512. Johnstone, I. (2001). On the distribution of the largest eigenvalue in principal compo- nents analysis. Annals of Statistics, 29(2):295–327. Kato, T. (1980). Perturbation Theory for Linear Operators. Springer-Verlag. Kiefer, N. (1980). A time series-cross section model with fixed effects with an intertem- poral factor structure. Manuscript, Department of Economics, Cornell University. Knittel,C.andMetaxoglou, K.(2008). Estimationofrandomcoefficientdemandmodels: Challenges, difficulties and warnings. Manuscript. Lancaster, T. (2000). The incidental parameter problem since 1948. Journal of Econo- metrics, 95(2):391–413. Lancaster, T. (2002). Orthogonal parameters and panel data. The Review of Economic Studies, 69(3):647–666. Latala, R. (2005). Some estimates of norms of random matrices. Proc. Amer. Math. Soc., 133:1273–1282. Manski, C. (1987). Semiparametric analysis of random effects linear models from binary panel data. Econometrica: Journal of the Econometric Society, 55(2):357–362. Marˇ cenko, V. and Pastur, L.(1967). Distribution of eigenvalues for some sets of random matrices. Sbornik: Mathematics, 1(4):457–483. Martin, R. and Tokdar, S. (2010). Semiparametric inference in mixture models with predictive recursion. Manuscript. MediaDynamics,Inc.(1997). TVDimensions.MediaDynamics,Inc.annualpublication. Moon, H., Shum,M., and Weidner, M. (2010). Estimation of Random Coefficients Logit Demand Models with Interactive Fixed Effects. Manuscript. 114 Moon, H. and Weidner, M. (2010a). Dynamic Linear Panel Regression Models with Interactive Fixed Effects. Manuscript. Moon, H.andWeidner, M.(2010b). Linear Regression for Panel withUnknownNumber of Factors as Interactive Fixed Effects. Manuscript. Nevo, A. (2001). Measuring market power in the ready-to-eat cereals industry. Econo- metrica, 69:307–342. Newton, M. (2002). On a nonparametric recursive estimator of the mixing distribution. Sankhy¯ a: The Indian Journal of Statistics, Series A, 64(2):306–322. Newton, M., Quintana, F., and Zhang, Y. (1998). Nonparametric Bayes methods using predictiveupdating. Practical nonparametric and semiparametric Bayesian statistics, 133:45–61. Neyman, J. and Scott, E. (1948). Consistent estimates based on partially consistent observations. Econometrica, 16(1):1–32. Onatski, A. (2005). Determining the number of factors from empirical distribution of eigenvalues. Discussion Papers 0405-19, Columbia University, Department of Eco- nomics. Onatski, A. (2006). Asymptotic distribution of the principal components estimator of large factor models when factors are relatively weak. Manuscript. Pesaran, M. H. (2006). Estimation and inference in large heterogeneous panels with a multifactor error structure. Econometrica, 74(4):967–1012. Petrin, A. (2002). Quantifying the benefits of new products: the case of the minivan. Journal of Political Economy, 110:705–729. Phillips, P. C. B. and Moon, H. (1999). Linear regression limit theory for nonstationary panel data. Econometrica, 67(5):1057–1111. Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests.(Copenhagen, Danish Institute for Educational Research), expanded edition (1980) with foreword and afterword by BD Wright. Chicago: The University of Chicago Press. Silverstein, J.(1990). Weak convergence ofrandomfunctionsdefinedbytheeigenvectors of sample covariance matrices. The Annals of Probability, 18(3):1174–1194. Silverstein, J. W. (1989). On the eigenvectors of large dimensional sample covariance matrices. J. Multivar. Anal., 30(1):1–16. 115 Soshnikov,A.(2002). Anoteonuniversalityofthedistributionofthelargesteigenvalues in certain sample covariance matrices. Journal of Statistical Physics, 108(5):1033– 1056. Stock, J. H. and Watson, M. W. (2002). Forecasting using principal components from a large numberof predictors. Journal of the American Statistical Association, 97:1167– 1179. Villas-Boas, J.andWiner, R.(1999). Endogeneity in brandchoice models. Management Science, 45:1324–1338. Weidner, M. (2011). Semiparametric Estimation of Nonlinear Panel Data Models with Generalized Random Effects. Manuscript. Wooldridge, J. (2010). Econometric analysis of cross section and panel data. The MIT press. Woutersen, T. (2002). Robustness against incidental parameters. Manuscript. Yin, Y. Q., Bai, Z. D., and Krishnaiah, P. (1988). On the limit of the largest eigenvalue of thelarge-dimensional samplecovariance matrix. Probability Theory Related Fields, 78:509–521. Zaffaroni, P. (2009). Generalized least squares estimation of panel with common shocks. Manuscript. 116 Appendix A Appendix to Chapter 2 A.1 Proof of Consistency Proof of Theorem 2.1. We first establish a lower bound on L 0 NT (β). Consider the last expression forL 0 NT (β) in equation (2.4) and plug inY = P k β 0 k X k +λ 0 f 0′ +e, then replace λ 0 f 0′ byλf ′ , and minimize over theN×R 0 matrixλ and theT×R 0 matrixf. This gives L 0 NT (β)≥ 1 NT min ˜ F Tr " X k (β 0 k −β k )X k +e ! M ˜ F X k (β 0 k −β k )X k +e ! ′ # , ≥bkβ−β 0 k 2 +O p kβ−β 0 k p min(N,T) ! + 1 NT Tr ee ′ +O p 1 min(N,T) . (A.1) wherein the firstline we minimize over allT×(R+R 0 ) matrices ˜ F, and to arrive at the second linewedecomposedtheexpression inthecomponentquadraticin (β−β 0 ), linear 117 in (β−β 0 ) and independent of (β−β 0 ) and applied Assumption 2.1 and 2.2. Next, we establish an upper bound onL 0 NT (β 0 ). We have L 0 NT (β 0 ) = 1 NT T X t=R+1 μ t h λ 0 f 0′ +e ′ λ 0 f 0′ +e i ≤ 1 NT Tr e ′ M λ 0e = 1 NT Tr ee ′ +O p 1 min(N,T) . (A.2) Since we could choose β = β 0 in the minimization of β, the optimal ˆ β needs to satisfy L 0 NT ( ˆ β)≤L 0 NT (β 0 ). With the above results we thus find bk ˆ β−β 0 k 2 +O p k ˆ β−β 0 k p min(N,T) ! +O p 1 min(N,T) ≤ 0. (A.3) From this it follows that k ˆ β−β 0 k =O p min(N,T) −1/2 , which is what we wanted to show. A.2 Proof of Likelihood Expansion Definition A.1. For the N×R matrix λ 0 and the T ×R matrix f 0 we define d max (λ 0 ,f 0 )= 1 √ NT λ 0 f 0′ = r 1 NT μ 1 (λ 0′ f 0 f 0′ λ 0 ), d min (λ 0 ,f 0 )= r 1 NT μ R (λ 0′ f 0 f 0′ λ 0 ), (A.4) i.e. d max (λ 0 ,f 0 ) and d min (λ 0 ,f 0 ) are the square roots of the maximal and the minimal eigenvalue of λ 0′ f 0 f 0′ λ 0 /NT. Furthermore, the convergence radius r 0 (λ 0 ,f 0 ) is defined by r 0 (λ 0 ,f 0 )= 4d max (λ 0 ,f 0 ) d 2 min (λ 0 ,f 0 ) + 1 2d max (λ 0 ,f 0 ) −1 . (A.5) 118 Lemma A.1. If the following condition is satisfies K X k=1 β 0 k −β k kX k k √ NT + kek √ NT < r 0 (λ 0 ,f 0 ), (A.6) then (i) the profile quasi likelihood function can be written as a power series in the K +1 parameters ǫ 0 =kek/ √ NT and ǫ k =β 0 k −β k , namely L 0 NT (β)= 1 NT ∞ X g=2 K X k 1 =0 K X k 2 =0 ... K X kg=0 ǫ k 1 ǫ k 2 ...ǫ kg L (g) λ 0 ,f 0 ,X k 1 ,X k 2 ,...,X kg , where the expansion coefficients are given by 11 L (g) λ 0 , f 0 , X k 1 , X k 2 ,...,X kg = ˜ L (g) λ 0 , f 0 , X (k 1 , X k 2 ,...,X kg) = 1 g! h ˜ L (g) λ 0 , f 0 , X k 1 , X k 2 ,...,X kg +all permutations of k 1 ,...,k g i , i.e. L (g) is obtained by total symmetrization of the last g arguments of 12 ˜ L (g) λ 0 , f 0 , X k 1 , X k 2 ,...,X kg = g X p=1 (−1) p+1 X ν 1 +...+νp =g m 1 +...+m p+1 =p−1 2≥ν j ≥ 1 , m j ≥ 0 Tr S (m 1 ) T (ν 1 ) k 1 ... S (m 2 ) ...S (mp) T (νp) ...kg S (m p+1 ) , 11 Here we use the round bracket notation (k1,k2,...,kg) for total symmetrization of these indices. 12 One finds ˜ L (1) ` λ 0 , f 0 , X k 1 , X k 2 ,...,X kg ´ = 0, which is why the sum in the power series of L 0 NT starts from g = 2 instead of g =1. 119 with S (0) =−M λ 0 , S (m) = λ 0 (λ 0′ λ 0 ) −1 (f 0′ f 0 ) −1 (λ 0′ λ 0 ) −1 λ 0′ l , for m≥1, T (1) k =λ 0 f 0′ X ′ k +X k f 0 λ 0′ , T (2) k 1 k 2 =X k 1 X ′ k 2 , for k,k 1 ,k 2 =0...K , X 0 = √ NT kek e, X k =X k , for k =k =1...K. (ii) the projector M ˆ λ (β) can be written as a power series in the same parameters ǫ k (k =0,...,K), namely M ˆ λ (β)= ∞ X g=0 K X k 1 =0 K X k 2 =0 ... K X kg=0 ǫ k 1 ǫ k 2 ... ǫ kg M (g) λ 0 , f 0 , X k 1 , X k 2 ,...,X kg , where the expansion coefficients are given by M (0) (λ 0 , f 0 )=M λ 0, and for g≥1 M (g) λ 0 , f 0 , X k 1 , X k 2 ,...,X kg = ˜ M (g) λ 0 , f 0 , X (k 1 , X k 2 ,...,X kg) = 1 g! h ˜ M (g) X k 1 , X k 2 ,...,X kg +all permutations of k 1 ,...,k g i , i.e. M (g) is obtained by total symmetrization of the last g arguments of ˜ M (g) λ 0 , f 0 , X k 1 , X k 2 ,...,X kg = g X p=1 (−1) p+1 X ν 1 +...+νp =g m 1 +...+m p+1 = p 2≥ ν j ≥ 1 , m j ≥ 0 S (m 1 ) T (ν 1 ) k 1 ... S (m 2 ) ... S (mp) T (νp) ...kg S (m p+1 ) , where S (m) , T (1) k , T (2) k 1 k 2 , and X k are given above. 120 (iii) For g ≥ 3 the coefficients L (g) in the series expansion of L 0 NT (β) are bounded as follows 1 NT L (g) λ 0 , f 0 , X k 1 , X k 2 ,...,X kg ≤ Rgd 2 min (λ 0 ,f 0 ) 2 16d max (λ 0 ,f 0 ) d 2 min (λ 0 ,f 0 ) g kX k 1 k √ NT kX k 2 k √ NT ... kX kg k √ NT . Under the stronger condition K X k=1 β 0 k −β k kX k k √ NT + kek √ NT < d 2 min (λ 0 ,f 0 ) 16d max (λ 0 ,f 0 ) , (A.7) we therefore have the following bound on the remainder when the series expansion for L 0 NT (β) is truncated at order G≥2: L 0 NT (β)− 1 NT G X g=2 K X k 1 =0 ... K X kg=0 ǫ k 1 ... ǫ kg L (g) λ 0 , f 0 , X k 1 , X k 2 ,...,X kg ≤ R(G+1)α G+1 d 2 min (λ 0 ,f 0 ) 2(1−α) 2 , where α = 16d max (λ 0 ,f 0 ) d 2 min (λ 0 ,f 0 ) K X k=1 β 0 k −β k kX k k √ NT + kek √ NT ! <1. (iv) The operator norm of the coefficient M (g) in the series expansion of M ˆ λ (β) is bounded as follows, for g≥1 M (g) λ 0 ,f 0 ,X k 1 ,X k 2 ,...,X kg ≤ g 2 16d max (λ 0 ,f 0 ) d 2 min (λ 0 ,f 0 ) g kX k 1 k √ NT kX k 2 k √ NT ... kX kg k √ NT . 121 Under the condition (A.7) we therefore have the following bound on operator norm of the remainder of the series expansion of M ˆ λ (β), for G≥0 M ˆ λ (β) − G X g=0 K X k 1 =0 ... K X kg=0 ǫ k 1 ... ǫ kg M (g) λ 0 , f 0 , X k 1 , X k 2 ,...,X kg ≤ (G+1)α G+1 2(1−α) 2 . Proof of Theorem 2.1. TheR 0 non-zeroeigenvalues ofthematrixλ 0′ f 0 f 0′ λ 0 /NT are identicaltotheeigenvalues oftheR 0 ×R 0 matrix(f 0 f 0′ /T) −1/2 (λ 0 λ 0′ /N)(f 0 f 0′ /T) −1/2 , and Assumption 2.3 guarantees that these eigenvalues, including d max (λ 0 ,f 0 ) and d min (λ 0 ,f 0 ) converge to positive constants in probability. Therefore, also r 0 (λ 0 ,f 0 ) converges to a positive constant in probability. Assumptions2.1 and2.3 furthermoreimplythatinthelimitN,T →∞withN/T → κ 2 , 0<κ<∞, we have kλ 0 k √ N =O p (1), kf 0 k √ T =O p (1), λ 0′ λ 0 N −1 =O p (1), f 0′ f 0 T −1 =O p (1), kX k k √ NT =O p (1), kek √ NT =O p N −1/2 . (A.8) Thus, for β−β 0 ≤c NT , c NT =o(1), we have α→0 as N,T →∞, i.e. the condition (A.7) in part (iii) of Lemma A.1 is asymptotically satisfied, and by applying the Lemma we find 1 NT (ǫ 0 ) g−r L (g) λ 0 ,f 0 ,X k 1 ,...,X kr ,X 0 ,...,X 0 =O p kek √ NT g−r ! =O p N − g−r 2 , (A.9) 122 where we used ǫ 0 X 0 = e and the linearity of L (g) λ 0 , f 0 , X k 1 , X k 2 ,...,X kg in the arguments X k . Truncating the expansion of L 0 NT (β) at order G = 3 and applying the corresponding result in Lemma A.1(iii) we obtain L 0 NT (β)= 1 NT K X k 1 ,k 2 =0 ǫ k 1 ǫ k 2 L (2) λ 0 ,f 0 ,X k 1 ,X k 2 + 1 NT K X k 1 ,k 2 ,k 3 =0 ǫ k 1 ǫ k 2 ǫ k 3 L (3) λ 0 ,f 0 ,X k 1 ,X k 2 ,X k 3 +O p α 4 =L 0 NT (β 0 ) − 2 √ NT β−β 0 ′ C (1) +C (2) + β−β 0 ′ W β−β 0 +L 0,rem NT (β), (A.10) where, using (A.9) we find L 0,rem NT (β)= 3 NT K X k 1 ,k 2 =1 ǫ k 1 ǫ k 2 ǫ 0 L (3) λ 0 ,f 0 ,X k 1 ,X k 2 ,X 0 + 1 NT K X k 1 ,k 2 ,k 3 =1 ǫ k 1 ǫ k 2 ǫ k 3 L (3) λ 0 ,f 0 ,X k 1 ,X k 2 ,X k 3 +O p K X k=1 β 0 k −β k kX k k √ NT + kek √ NT ! 4 −O p " kek √ NT 4 # =O p kβ−β 0 k 2 N −1/2 +O p kβ−β 0 k 3 +O p kβ−β 0 kN −3/2 +O p kβ−β 0 k 2 N −1 +O p kβ−β 0 k 3 N −1/2 +O p kβ−β 0 k 4 . (A.11) HereO p kek √ NT 4 isnotjustsometermofthatorder,butexactlythetermofthatorder contained in O p (α 4 ) =O p P K k=1 β 0 k −β k kX k k √ NT + kek √ NT 4 . This term is not present 123 inL 0,rem NT (β) since it is already contained inL 0 NT (β 0 ). 13 Equation (A.11) shows that the remainder satisfies the bound stated in the theorem, which concludes the proof. Proof of Corollary 2.2. Using Assumption 2.2(ii) we find for R=R 0 W ≥μ K (W)= min {α∈R K ,kαk=1} α ′ Wα= min {α∈R K ,kαk=1} 1 NT Tr M f 0X ′ α M λ 0X α M f 0 ≥ 1 NT T X t=2R 0 +1 μ t X ′ α X α ≥b 2 , wpa1, (A.12) and therefore W −1 ≤1/b 2 wpa1. Using Assumption 2.1 we find |C (2) k |≤ 9R 0 2 √ NT kek 2 kX k k λ 0 (λ 0′ λ 0 ) −1 (f 0′ f 0 ) −1 f 0′ =O p (1) , (A.13) and therefore γ ≡W −1 C (1) +C (2) / √ NT =O p (1/ √ NT). Applying Theorem 2.1 to the inequalityL 0 NT ( ˆ β R 0)≤L 0 NT β 0 +γ then gives ˆ β R 0−β 0 −γ ′ W ˆ β R 0−β 0 −γ ≤L 0,rem NT (γ)−L 0,rem NT ( ˆ β R 0) =o p 1 NT −L 0,rem NT ( ˆ β R 0). (A.14) Fromthisandconsistencyof ˆ β R 0 itfollowsthat √ NT( ˆ β R 0−β 0 )=O p (1),sinceotherwise the inequality is violated asymptotically due to the bound on L 0,rem NT ( ˆ β R 0). From √ NT consistency of ˆ β R 0 it now follows thatL 0,rem NT ( ˆ β R 0)=o p (1/NT), and usingthis the above inequality yields √ NT( ˆ β R 0−β 0 −γ)=o p (1), which proves the corollary. Lemma A.2. Under the assumptions of Theorem 2.1 we have ˆ e(β)=M λ 0eM f 0 +ˆ e (1) e +ˆ e (2) e − K X k=1 β k −β 0 k ˆ e (1) X,k +ˆ e (2) X,k +ˆ e (rem) (β), 13 Alternatively, we could have truncatedtheexpansion at orderG=4. Then, thetermOp » “ kek √ NT ” 4 – would be more explicit, namely it would equal 1 NT ǫ 4 0 L (4) ` λ 0 ,f 0 ,X0,X0,X0,X0 ´ , which is clearly con- tained inL 0 NT (β 0 ). 124 where the N×T matrix valued expansion coefficients read ˆ e (1) X,k =M λ 0X k M f 0 , ˆ e (2) X,k =−M λ 0X k M f 0e ′ λ 0 (λ 0′ λ 0 ) −1 (f 0′ f 0 ) −1 f 0′ −M λ 0eM f 0X ′ k λ 0 (λ 0′ λ 0 ) −1 (f 0′ f 0 ) −1 f 0′ −λ 0 (λ 0′ λ 0 ) −1 (f 0′ f 0 ) −1 f 0′ X ′ k M λ 0eM f 0−λ 0 (λ 0′ λ 0 ) −1 (f 0′ f 0 ) −1 f 0′ e ′ M λ 0X k M f 0 −M λ 0X k f 0 (f 0′ f 0 ) −1 (λ 0′ λ 0 ) −1 λ 0′ eM f 0−M λ 0ef 0 (f 0′ f 0 ) −1 (λ 0′ λ 0 ) −1 λ 0′ X k M f 0, ˆ e (1) e =−M λ 0eM f 0e ′ λ 0 (λ 0′ λ 0 ) −1 (f 0′ f 0 ) −1 f 0′ −λ 0 (λ 0′ λ 0 ) −1 (f 0′ f 0 ) −1 f 0′ e ′ M λ 0eM f 0 −M λ 0ef 0 (f 0′ f 0 ) −1 (λ 0′ λ 0 ) −1 λ 0′ eM f 0 , ˆ e (2) e =M λ 0eM f 0e ′ λ 0 (λ 0′ λ 0 ) −1 (f 0′ f 0 ) −1 f 0′ e ′ λ 0 (λ 0′ λ 0 ) −1 (f 0′ f 0 ) −1 f 0′ −M λ 0eM f 0e ′ M λ 0ef 0 (f 0′ f 0 ) −1 (λ 0′ λ 0 ) −1 (f 0′ f 0 ) −1 f 0′ −M λ 0eM f 0e ′ λ 0 (λ 0′ λ 0 ) −1 (f 0′ f 0 ) −1 (λ 0′ λ 0 ) −1 λ 0′ eM f 0 + M λ 0ef 0 (f 0′ f 0 ) −1 (λ 0′ λ 0 ) −1 λ 0′ eM f 0e ′ λ 0 (λ 0′ λ 0 ) −1 (f 0′ f 0 ) −1 f 0′ + λ 0 (λ 0′ λ 0 ) −1 (f 0′ f 0 ) −1 f 0′ e ′ M λ 0eM f 0e ′ λ 0 (λ 0′ λ 0 ) −1 (f 0′ f 0 ) −1 f 0′ + M λ 0ef 0 (f 0′ f 0 ) −1 (λ 0′ λ 0 ) −1 λ 0′ ef 0 (f 0′ f 0 ) −1 (λ 0′ λ 0 ) −1 λ 0′ eM f 0 + λ 0 (λ 0′ λ 0 ) −1 (f 0′ f 0 ) −1 f 0′ e ′ M λ 0ef 0 (f 0′ f 0 ) −1 (λ 0′ λ 0 ) −1 λ 0′ eM f 0 +λ 0 (λ 0′ λ 0 ) −1 (f 0′ f 0 ) −1 f 0′ e ′ λ 0 (λ 0′ λ 0 ) −1 (f 0′ f 0 ) −1 f 0′ e ′ M λ 0eM f 0 −λ 0 (λ 0′ λ 0 ) −1 (f 0′ f 0 ) −1 (λ 0′ λ 0 ) −1 λ 0′ eM f 0e ′ M λ 0eM f 0 −M λ 0ef 0 (f 0′ f 0 ) −1 (λ 0′ λ 0 ) −1 (f 0′ f 0 ) −1 f 0′ e ′ M λ 0eM f 0 , and the remainder term satisfies for any sequence c NT →0 sup {β:kβ−β 0 k≤c NT } ˆ e (rem) (β) Nkβ−β 0 k 2 +kβ−β 0 k +N −1 =O p (1) . 125 Proof. The general expansion of M ˆ λ (β) is given in Lemma A.1, and the analogous expansion for M ˆ f (β) is obtained by applying the symmetry N ↔ T, λ ↔ f, e ↔ e ′ , X k ↔X ′ k . For the residuals ˆ e(β) we have ˆ e(β)=M ˆ λ (β) (Y −β·X) M ˆ f (β) =M ˆ λ (β) e− β−β 0 ·X +λ 0 f 0′ M ˆ f (β), (A.15) and plugging in the expansions of M ˆ λ (β) and M ˆ f (β) it is straightforward to derive the expansion of ˆ e(β) from this, including the bound on the remainder. Proof of Theorem 2.3. The terms in B(β) +B ′ (β) in addition to A(β) all have a spectralnormoforderO p ( √ N)for √ Nkβ−β 0 k≤c. Thus,thefirstpartoftheTheorem directlyfollows from thesecond partbyapplyingWeyl’s inequality. Whatis left toshow is that the second part holds. Applying the expansion ˆ e(β) in Lemma A.2 together with kM λ 0eM f 0k = O p ( √ N), kˆ e (1) e k = O p (1), kˆ e (2) e k = O p (N −1/2 ), kˆ e (1) k k = O p (N) kˆ e (2) k k=O p ( √ N) and the bound onkˆ e (rem) k given in the Lemma we obtain ˆ e ′ (β)ˆ e(β)=B(β)+B ′ (β)+T (rem) (β), (A.16) where the terms B (rem,1) (β) and B (rem,2) in B(β) are given by B (rem,1) (β)=M f 0[(β−β 0 ·X)] ′ M λ 0eM f 0e ′ λ 0 (λ 0′ λ 0 ) −1 (f 0′ f 0 ) −1 f 0′ +M f 0e ′ M λ 0[(β−β 0 ·X)]M f 0e ′ λ 0 (λ 0′ λ 0 ) −1 (f 0′ f 0 ) −1 f 0′ +M f 0e ′ M λ 0eM f 0[(β−β 0 ·X)] ′ λ 0 (λ 0′ λ 0 ) −1 (f 0′ f 0 ) −1 f 0′ +M f 0 M f 0e ′ M λ 0ˆ e (2) e +ˆ e (1)′ e ˆ e (2) e +ˆ e (2)′ e M λ 0e ′ M f 0 P f 0 , B (rem,2) = 1 2 P f 0 M f 0e ′ M λ 0ˆ e (2) e +ˆ e (1)′ e ˆ e (2) e +ˆ e (2)′ e M λ 0e ′ M f 0 P f 0 =f 0 (f 0′ f 0 ) −1 (λ 0′ λ 0 ) −1 λ 0′ eM f 0e ′ M λ 0eM f 0e ′ λ 0 (λ 0′ λ 0 ) −1 (f 0′ f 0 ) −1 f 0′ , (A.17) 126 and for √ Nkβ−β 0 k≤c (which implieskˆ e(β)k =O p ( √ N)) we have kT (rem) (β)k=O p (N −1/2 )+kβ−β 0 kO p (N 1/2 )+kβ−β 0 k 2 O p (N 3/2 ). (A.18) which holds uniformly over β. Note also that B (eeee) +B (eeee)′ =M f 0 M f 0e ′ M λ 0ˆ e (2) e +ˆ e (1)′ e ˆ e (2) e +ˆ e (2)′ e M λ 0e ′ M f 0 M f 0. (A.19) Thus, we have kB (rem,2) k = O p (1), and for √ Nkβ−β 0 k ≤ c we have kB (rem,1) (β)k = O p (1)+kβ−β 0 kO p (N), and by Weyl’s inequality μ t ˆ e ′ (β)ˆ e(β) =μ t B(β)+B ′ (β) +o p h 1+kβ−β 0 k 2 i , (A.20) again uniformly over β. This proves the Theorem. Proof of Corollary 2.4. From Theorem 2.1 we know that √ N( ˆ β R −β 0 ) =O p (1), so that the boundsin Theorem 2.3 and Assumption 2.4 are applicable. Since ˆ β R minimizes L R NT (β) it must in particular satisfy L R NT ( ˆ β R ) ≤ L R NT (β 0 ). Applying Theorem 2.3(i), Theorem 2.1, and Assumption 2.4 to this inequality gives ˆ β R −β 0 ′ W ˆ β R −β 0 − 2 √ NT ˆ β R −β 0 ′ C (1) +C (2) ≤ 1 NT R−R 0 X t=1 μ r h ˜ A ˆ β R i +O p h √ N +N 5/4 k ˆ β R −β 0 k+N 2 k ˆ β R −β 0 k/log(N) i . (A.21) Our assumptions guarantee C (2) =O p (1), and we explicitly assume C (1) =O p (1). Fur- thermore, Assumption 2.2 guarantees that ˆ β R −β 0 ′ W ˆ β R −β 0 − 1 NT R−R 0 X t=1 μ r h ˜ A ˆ β R i ≥b 2 k ˆ β R −β 0 k 2 . (A.22) 127 Thus we obtain b 2 N 3/4 k ˆ β R −β 0 k 2 ≤O p (1)+O p N 3/4 k ˆ β R −β 0 k +o p N 3/4 k ˆ β R −β 0 k 2 , (A.23) from which we can conclude thatN 3/4 k ˆ β R −β 0 k=O p (1), which proves the first part of the Theorem. Proof of Corollary 2.5. Having N 3/4 k ˆ β R − β 0 k = O p (1) the bound in Assump- tion 2.5 becomes applicable. We already introduced γ ≡ W −1 C (1) +C (2) / √ NT = O p (1/ √ NT). Since ˆ β R minimizes L R NT (β) it must in particular satisfy L R NT ( ˆ β R ) ≤ L R NT β 0 +γ . Using Theorem 2.3(ii) and Assumption 2.5 it follows that L 0 NT ( ˆ β R )≤L 0 NT β 0 +γ + 1 NT o p 1+ √ NTk ˆ β R −β 0 k 2 2 . (A.24) The rest of the proof is analogous to the proof of corrollary 2.2. 128 Appendix B Appendix to Chapter 3 B.1 Alternative GMM approach In this section we show that in the presence of factors a moment based estimation approach along the lines originally proposed by BLP is inadequate. The moment condi- tions imposed by the model are E e jt α 0 , β 0 , λ 0 f 0′ X k,jt =0, k =1,...,K , E e jt α 0 , β 0 , λ 0 f 0′ Z m,jt =0, m=1,...,M , (B.1) wheree jt (α, β, λf ′ )=δ jt (α, s t , X t )− P K k=1 β k X k,jt − P R r=1 λ ir f tr . Note that we write the residuals e jt as a function of the J×T matrix λf ′ in order to avoid the ambiguity of the decomposition into λ and f. The corresponding sample moments read m X k (α, β, λf ′ )= 1 JT Tr e(α, β, λf ′ )X ′ k , m Z m (α, β, λf ′ )= 1 JT Tr e(α, β, λf ′ )Z ′ m . (B.2) 129 We also define the sample moment vectors m X (α, β, λf ′ ) = m X 1 ,...,m X K ′ and m Z (α, β, λf ′ ) = m Z 1 ,...,m Z M ′ . An alternative estimator for α, β, λ and f is then given by 1 ˆ λ α,β , ˆ f α,β =argmin {λ,f} J X j=1 T X t=1 e 2 jt (α, β, λf ′ ). ˆ α GMM , ˆ β GMM = argmin {α∈Bα,β} m X (α,β, ˆ λ α,β ˆ f ′ α,β ) m Z (α,β, ˆ λ α,β ˆ f ′ α,β ) ′ W JT m X (α,β, ˆ λ α,β ˆ f ′ α,β ) m Z (α,β, ˆ λ α,β ˆ f ′ α,β ) , (B.3) whereW JT is apositivedefinite(K+M)×(K+M)weight matrix. Themaindifference betweenthisalternativeestimatorandourestimator(3.7)isthattheleast-squaresstepis used solely to recover estimates of the factors and factor loadings (principal components estimator), while the structural parameters (α,β) are estimated in the GMM second step. The relation between ˆ α and ˆ β defined in (3.7) and ˆ α GMM and ˆ β GMM defined in (B.3) is as follows (i) Let R=0 (no factors) and set W JT = 1 JT x ′ x −1 0 K×M 0 M×K 0 M×M + −(x ′ x) −1 x ′ z 1 M 1 JT z ′ M x z −1 W JT 1 JT z ′ M x z −1 −(x ′ x) −1 x ′ z 1 M ′ , (B.4) 1 The minimizing ˆ λ α,β and ˆ f α,β are the principal components estimators, e.g. ˆ λ α,β consists of the eigenvectors corresponding to the R largest eigenvalues of the J×J matrix δ(α, s, X)− K X k=1 β k X k ! δ(α, s, X)− K X k=1 β k X k ! ′ . 130 where x is a JT ×K matrix and z is a JT ×M matrix, given by x .,k = vec(X k ), k = 1,...,K, and z .,m = vec(Z m ), m = 1,...,M. Then ˆ α and ˆ β solve (3.7) with weight matrix W JT if and only if they solve (B.3) with this weight matrix W JT , 2 i.e. in this case we have (ˆ α, ˆ β)=(ˆ α GMM , ˆ β GMM ). (ii) Let R> 0 and M =L (exactly identified case). Then a solution of (3.7) also is a solution of (B.3), but not every solution of (B.3) needs to be a solution of (3.7). (iii) For M >L andR>0 there is no straightforward characterization of the relation- ship between the estimators in (3.7) and (B.3). WewanttodiscusstheexactlyidentifiedcaseM =Labitfurther. Thereasonwhyin thiscaseeverysolutionof (3.7)alsosolves(B.3)isthatthefirstorderconditions(FOC’s) wrt to β and γ of the first stage optimization in (3.7) read m X (ˆ α, ˆ β, ˆ λ ˆ α, ˆ β ˆ f ′ ˆ α, ˆ β ) = 0 and m Z (ˆ α, ˆ β, ˆ λ ˆ α, ˆ β ˆ f ′ ˆ α, ˆ β ) = 0, which implies that the GMM objective function of (B.3) is zero, i.e. minimized. The reverse statement is not true, because for R > 0 the first stage objective function in(3.7) is notaquadratic functionofβ andγ anymoreonce one concentrates out λ and f, and it can have multiple local minima that satisfy the FOC. Therefore, ˆ α GMM and ˆ β GMM can be inconsistent, while ˆ α and ˆ β are consistent, which is the main reason to consider the latter in this paper. To illustrate this important difference between ˆ α GMM , ˆ β GMM and ˆ α, ˆ β, we want to give a simple example for a linear model in which the QMLE objective function has multiple local minima. Consider a DGP where Y jt = β 0 X jt +λ 0 j f 0 t +e jt , with X jt = 1 + 0.5 ˜ X jt +λ 0 j f 0 t , and ˜ X jt , e jt , λ 0 j and f 0 t are all identically distributed as N(0,1), mutually independent, and independent across j and t. Here, the number of 2 With this weight matrixWJT the second stage objective function in (B.3) becomes (d(α)−xβ) ′ x(x ′ x) −1 x ′ (d(α)−xβ)/JT +d ′ (α)Mxz(z ′ Mxz) −1 WJT(z ′ Mxz) −1 z ′ Mxd(α) =(d(α)−xβ) ′ Px(d(α)−xβ)/JT +˜ γ ′ α WJT ˜ γα , where d(α) = vec(δ(α, s, X)−δ(α 0 , s, X)). Here, β only appears in the first term, and by choosing β = ˆ β =(x ′ x) −1 x ′ d(α) this term becomes zero. Thus, we are left with the second term, which is exactly the second stage objective function in (3.7) in this case, since for R = 0 by the Frisch-Waugh theorem we have ˜ γα =(z ′ Mxz) −1 z ′ Mxd(α). 131 factors R = 1, and we assume that Y jt and X jt are observed and that β 0 = 0. The least squares objective function in this model, which corresponds to our inner loop, is given by L(β) = P T t=2 μ t [(Y −βX) ′ (Y −βX)]. For J =T = 100 and a concrete draw of Y and X, this objective function is plotted in figure B.1. The shape of this objective function is qualitatively unchanged for other draws of Y and X, or larger values of J and T. As predicted by our consistency result, the global minimum of L(β) is close to β 0 = 0, but another local minimum is present, which does neither vanish nor converge to β 0 = 0 when J and T grow to infinity. Thus, the global minimum of L(β) gives a consistent estimator, but the solution to the FOC ∂L(β)/∂β = 0 gives not. In this example, the principal components estimator of λ(β) and f(β), which are derived from Y −βX, become very bad approximations forλ 0 andf 0 for β&0.5. Thus, forβ&0.5, the fixed effects are essentially not controlled for anymore in the objective function, and the local minimum aroundβ≈0.8 reflects the resulting endogeneity problem. − 0.5 0 0.5 1 1.5 0.9 1 1.1 1.2 1.3 1.4 1.5 β objective function Figure B.1: Example for multiple local minima in the objective function L(β). Notes: The global minimum can be found close to the true value β 0 = 0, but another local minimum exists aroundβ≈0.8, which renders the FOC inappropriate for defining the estimator ˆ β. 132 B.2 Details for Section 3.4 (Consistency and Asymptotic Distribution) B.2.1 Formulas for Asymptotic Bias Terms LettheJ×1vectorΣ (1) e ,theT×1vectorΣ (2) e ,andtheT×T matricesΣ X,e k ,k =1,...,K, and Σ Z,e m , m=1,...,M, be defined by Σ (1) e,j = 1 T T X t=1 E e 0 jt 2 , Σ (2) e,t = 1 J J X j=1 E e 0 jt 2 , Σ X,e k,tτ = 1 J J X j=1 E X k,jt e 0 jτ , Σ Z,e m,tτ = 1 J J X j=1 E Z m,jt e 0 jτ , (B.5) where j =1,...,J andt,τ =1,...,T. Furthermore, let b (x,0) k = plim J,T→∞ Tr P f 0Σ X,e k , b (x,1) k = plim J,T→∞ Tr h diag Σ (1) e M λ 0X k f 0 (f 0′ f 0 ) −1 (λ 0′ λ 0 ) −1 λ 0′ i , b (x,2) k = plim J,T→∞ Tr h diag Σ (2) e M f 0X ′ k λ 0 (λ 0′ λ 0 ) −1 (f 0′ f 0 ) −1 f 0′ i , b (z,0) m = plim J,T→∞ Tr P f 0Σ Z,e m , b (z,1) m = plim J,T→∞ Tr h diag Σ (1) e M λ 0Z m f 0 (f 0′ f 0 ) −1 (λ 0′ λ 0 ) −1 λ 0′ i , b (z,2) m = plim J,T→∞ Tr h diag Σ (2) e M f 0Z ′ m λ 0 (λ 0′ λ 0 ) −1 (f 0′ f 0 ) −1 f 0′ i , (B.6) and we set b (x,i) = b (x,i) 1 ,...,b (x,i) K ′ and b (z,i) = b (z,i) 1 ,...,b (z,i) M ′ , for i=0,1,2. With these definitions we can now give the expression for the asymptotic bias terms which appear in Theorem 3.2, namely B i =− GWG ′ −1 GW b (x,i) b (z,i) , (B.7) 133 where i=0,1,2. B.2.2 Assumptions for Consistency Assumption B.1. We assume that the probability limits of λ 0′ λ 0 /J and f 0′ f 0 /T are finite and have full rank, i.e. (a) plim J,T→∞ λ 0′ λ 0 /J >0, (b) plim J,T→∞ f 0′ f 0 /T > 0 . Assumption B.2. (i) 1 JT Tr e 0 X ′ k =o p (1), for k =1,...,K, 1 JT Tr e 0 Z ′ m =o p (1), for m=1,...,M. (ii) ke 0 k=O p (max(J,T)). Assumption B.3. (i) sup α∈Bα\α 0 kδ(α)−δ(α 0 )k F kα−α 0 k =O p ( √ JT). (ii) W JT → p W >0. Assumption B.4. (a) Let Ξ jt = (X 1,jt ,...,X K,jt ,Z 1,jt ,...,Z M,jt ) ′ be the (K +M)-vectors of regressors and instruments that appear in step 1 of (3.7). We assume that the probability limit of the (K +M)×(K +M) matrix (JT) −1 P j,t Ξ jt Ξ ′ jt exists and is positive definite, i.e. plim J,T→∞ h (JT) −1 P J j=1 P T t=1 Ξ jt Ξ ′ jt i >0. (b) We assume that theK regressorsX can be decomposed inton“low-rank regressors” X low and K−n “high-rank regressors” X high . The two types of regressors satisfy: 134 (i) For ρ ∈ R K+M−n define the J × T matrix Ξ high,ρ = P M m=1 ρ m Z m + P K−n k=1 ρ M+k X high,k , which is a linear combination of high-rank regressors and instruments. We assume that there exists a constant b>0 such that min {ρ∈R K+M−n ,kρk=1} T X t=2R+n+1 μ t Ξ ′ high,ρ Ξ high,ρ JT ! ≥b wpa1. (ii) For the low-rank regressors we assume rank(Ξ low,k ) = 1, k = 1,...,n, i.e. they can be written as X k =w k v ′ k for J×1 vectors w k and T ×1 vectors v k , and we define the J×n matrix w = (w 1 ,...,w n ) and the T ×n matrix v = (v 1 ,...,v n ). We assume that there exists B > 0 such that J −1 λ 0′ M v λ 0 > BI R wpa1, and T −1 f 0′ M w f 0 > BI R wpa1. Assumption B.1 guarantees that kλ 0 k and kf 0 k grow at a rate of √ J and √ T, respectively. This is a so called “strong factor” assumption that makes sure that the influence of the factors is sufficiently large, so that the principal components estimators ˆ λ and ˆ f can pick up the correct factor loadings and factors. Assumption B.2 imposes (i) weak exogeneity of X k and Z m wrt e 0 , and (ii) a bound on the the spectral norm of e 0 , which is satisfied as long as e 0 jt has mean zero, has a uniformly bounded fourth moment (across j,t,J,T) and is weakly correlated across j and t. Assumption B.3(i) demandsthataboundon theFrobeniusnormof (δ(α)−δ(α 0 ))/kα−α 0 kexists, whichis satisfied as long as e.g. the elements (δ jt (α)−δ jt (α 0 ))/kα−α 0 k are uniformly bounded (acrossj,t,J,T). AssumptionB.3(ii) requires existence of a positive definiteprobability limit of the weight matrix W JT . Assumption B.4(a) is the standard non-colinearity assumption on the regressors X k and the instruments Z m . As discussed in Bai (2009b) and Moon and Weidner (2010a; 2010b)justassumingweakexogeneityandnon-colinearityisnotsufficientforconsistency of the QMLE in the presence of factors, and the same is true here. In particular, in a 135 model with factors one needs to distinguish so called “low-rank regressors” and “high- rankregressors”andtreatthemdifferently. ThisdistinctionisintroducedinAssumption B.4(b) and additional assumptions on the low- and high-rank regressors are imposed. Low-rank regressors are for example regressors that are constant over either markets t or products j, or more generally factor into a component that depends only on j and a component that depends only on t. All other regressors are usually high-rank regressors. Assumption B.4 in this paper is equivalent to Assumption 4 in Moon and Weidner (2010a; 2010b), and some further discussion can be found there. If there are no low-rank regressors (if n = 0 in Assumption B.4) then Theorem 3.1 holds even without imposing Assumption B.1, i.e. also when factors are “weak”. B.2.3 Additional Assumptions for Asymptotic Distribution and Bias Correction Assumption B.5. We assume existence of the probability limits G, Ω, W, b (x,i) and b (z,i) , i=0,1,2. In addition, we assume GWG ′ >0 and GWΩWG ′ >0. Assumption B.6. (i) There exist J×T matrices r Δ (α) and ∇ l δ(α 0 ), l =1,...,L, such that δ(α)−δ(α 0 )= L X l=1 (α l −α 0 l )∇ l δ(α 0 )+r Δ (α), and 1 √ JT k∇ l δ(α 0 )k F =O p (1), for l =1,...,L, sup {α: √ Jkα−α 0 k<c,α6=α 0 } 1 √ JT kr Δ (α)k F kα−α 0 k =o p (1), for all c>0. (ii) kλ 0 j k and kf 0 t k are uniformly bounded across j, t, J and T. 136 (iii) The errors e 0 jt are independent across j and t, they satisfy Ee 0 jt =0, and E(e 0 jt ) 8+ǫ is bounded uniformly across j,t and J,T, for some ǫ>0. (iv) The regressors X k , k = 1,...,K, (both high- and low rank regressors) and the instruments Z m , m = 1,...,M, can be decomposed as X k = X str k +X weak k and Z m = Z str m +Z weak m . The components X str k and Z str m are strictly exogenous, i.e. X str k,jt and Z str m,jt are independent of e 0 iτ for all j,i,t,τ. The components X weak k and Z weak m are weakly exogenous, and we assume X weak k,jt = t−1 X τ=1 c k,jτ e 0 j,t−τ , Z weak m,jt = t−1 X τ=1 d m,jτ e 0 j,t−τ , for some coefficients c k,jτ and d m,jτ that satisfy |c k,jτ | < α τ , |d k,jτ | < α τ , where α∈ (0,1) is a constant that is independent of τ = 1,...,T −1, j = 1...J, k =1,...,K andm=1,...,M. We also assume that E(X str k,jt ) 8+ǫ and E(Z str m,jt ) 8+ǫ are bounded uniformly over j,t and J,T, for some ǫ>0. Assumption B.2 is implied by Assumption B.6, so it is not necessary to impose it explicitly in Theorem 3.2. Part (ii), (iii) and (iv) of Assumption B.6 are identical to Assumption 5 in Moon and Weidner (2010a; 2010b), except for the appearance of the instruments Z m here, which need to be included since they appear as additional regressors in the first step of our estimation procedure. Part (i) of Assumption B.6 can for example be justified by assuming that within any √ J-shrinking neighborhood of α 0 we have wpa1 thatδ jt (α) is differentiable, that|∇ l δ jt (α)| is uniformlyboundedacrossj, t, J and T, and that ∇ l δ jt (α) is Lipschitz continuous with a Lipschitz constant that is uniformly bounded across j, t, J and T, for all l =1,...L. But since the assumption is only on the Frobenius norm of the gradient and remainder term, one can also conceive weaker sufficient conditions for Assumption B.6(i). 137 Assumption B.7. For all c>0 and l =1,...,L we have sup {α: √ JTkα−α 0 k<c} k∇ l δ(α)−∇ l δ(α 0 )k F =o p ( √ JT). This last assumption is needed to guarantee consistency of the bias and variance estimators that are presented in the following. B.2.4 Bias and Variance Estimators Given the LS-MD estimators ˆ α and ˆ β, we can define the residuals ˆ e=δ(ˆ α, s, X)− K X k=1 ˆ β k X k − ˆ λ ˆ f ′ . (B.8) We also define theJT×K matrix ˆ x λf , theJT×M matrix ˆ z λf , and theJT×L matrix ˆ g by ˆ x λf .,k =vec M ˆ λ X k M ˆ f , ˆ z λf .,m =vec M ˆ λ Z m M ˆ f , ˆ g .,l =−vec(∇ l δ(ˆ α)) , (B.9) where k = 1,...,K, m = 1,...,M, and l = 1,...,L. The definition of ˆ Σ vec e , ˆ Σ (1) e and ˆ Σ (2) e is analogous to that of Σ vec e , Σ (1) e and Σ (2) e , but with e replaced by ˆ e. The T ×T matrices ˆ Σ X,e k , k =1,...,K, and ˆ Σ Z,e m , m=1,...,M, are defined by ˆ Σ X,e k,tτ = 1 J P J j=1 X k,jt ˆ e jτ for 0<t−τ ≤h 0 otherwise ˆ Σ Z,e m,tτ = 1 J P J j=1 Z m,jt ˆ e jτ for 0<t−τ ≤h 0 otherwise (B.10) 138 where t,τ = 1,...,T, and h ∈ N is a bandwidth parameter. Using these objects we define ˆ G= 1 JT ˆ g ′ ˆ x λf ˆ g ′ ˆ z λf ˆ x λf′ ˆ x λf ˆ x λf′ ˆ z λf , ˆ Ω= 1 JT ˆ x λf ,ˆ z λf ′ diag( ˆ Σ vec e ) ˆ x λf ,ˆ z λf , ˆ b (x,0) k = Tr P ˆ f ˆ Σ X,e k , ˆ b (x,1) k = Tr h diag ˆ Σ (1) e M ˆ λ X k ˆ f( ˆ f ′ ˆ f) −1 ( ˆ λ ′ ˆ λ) −1 ˆ λ ′ i , ˆ b (x,2) k = Tr h diag ˆ Σ (2) e M ˆ f X ′ k ˆ λ( ˆ λ ′ ˆ λ) −1 ( ˆ f ′ ˆ f) −1 ˆ f ′ i , ˆ b (z,0) m = Tr P ˆ f ˆ Σ Z,e m , ˆ b (z,1) m = Tr h diag ˆ Σ (1) e M ˆ λ Z m ˆ f( ˆ f ′ ˆ f) −1 ( ˆ λ ′ ˆ λ) −1 ˆ λ ′ i , ˆ b (z,2) m = Tr h diag ˆ Σ (2) e M ˆ f Z ′ m ˆ λ( ˆ λ ′ ˆ λ) −1 ( ˆ f ′ ˆ f) −1 ˆ f ′ i , (B.11) for k = 1,...,K and m = 1,...,M. We set ˆ b (x,i) = ˆ b (x,i) 1 ,..., ˆ b (x,i) K ′ and ˆ b (z,i) = ˆ b (z,i) 1 ,..., ˆ b (z,i) M ′ , for i=0,1,2. The estimator ofW is given by c W = 1 JT ˆ x λf′ ˆ x λf −1 0 K×M 0 M×K 0 M×M + −(ˆ x λf′ ˆ x λf ) −1 ˆ x λf′ ˆ z λf 1 M 1 JT ˆ z λf′ M ˆ x λfˆ z λf −1 W JT 1 JT ˆ z λf′ M ˆ x λfˆ z λf −1 −(ˆ x λf′ ˆ x λf ) −1 ˆ x λf′ ˆ z λf 1 M ′ . (B.12) Finally, for i=0,1,2, we have ˆ B i =− ˆ G c W ˆ G ′ −1 ˆ G c W ˆ b (x,i) ˆ b (z,i) . (B.13) Theonlysubtlety herelies in thedefinition of ˆ Σ X,e k and ˆ Σ Z,e m , whereweexplicitly impose the constraint that ˆ Σ X,e k,tτ = ˆ Σ Z,e m,tτ = 0 for t−τ ≤ 0 and for t−τ > h, where h ∈ N 139 is a bandwidth parameter. On the one side (t−τ ≤ 0) this constraint stems from the assumption that X k and Z m are only correlated with past values of the errors e 0 , not with present and future values, on the other side (t−τ > h) we need the bandwidth cutoff to guarantee that the variance of our estimator forB 0 converges to zero. Without imposing this constraint and introducing the bandwidth parameter, our estimator for B 0 would be inconsistent. B.3 Proofs In addition to the vectorizations x, x f , x λf , z, z f , z λf , g, andd(α), which were already defined above, we also introduce the JT ×1 vector ε=vec e 0 . B.3.1 Proof of Consistency Proof of Theorem 3.1. From Bai (2009b) and Moon and Weidner (2010a; 2010b) we know that for α = α 0 one has ˜ γ α 0 = o p (1). Since the optimal choice ˆ α minimizes ˜ γ ′ ˆ α W JT ˜ γ ˆ α we have ˜ γ ′ ˆ α W JT ˜ γ ˆ α ≤ ˜ γ ′ α 0 W JT ˜ γ α 0 = o p (1), (B.14) 140 andtherefore˜ γ ˆ α =o p (1),sinceW JT convergestoapositivedefinitematrixinprobability. After minimization over λ and f, the objective function for step 1 optimization reads L α (β,γ) =min λ,f 1 JT Tr " δ(α)− K X k=1 β k X k − M X m=1 γ m Z m −λf ′ ! δ(α)− K X k=1 β k X k − M X m=1 γ m Z m −λf ′ ! ′ # =min f 1 JT Tr " δ(α)− K X k=1 β k X k − M X m=1 γ m Z m ! M f δ(α)− K X k=1 β k X k − M X m=1 γ m Z m ! ′ # =min f 1 JT Tr " λ 0 f 0′ +δ(α)−δ(α 0 )− K X k=1 β k −β 0 X k − M X m=1 γ m Z m +e 0 ! M f λ 0 f 0′ +δ(α)−δ(α 0 )− K X k=1 β k −β 0 X k − M X m=1 γ m Z m +e 0 ! ′ # . (B.15) Defining L up α (β,γ) = 1 JT Tr " δ(α)−δ(α 0 )− K X k=1 β k −β 0 X k − M X m=1 γ m Z m +e 0 ! M f 0 δ(α)−δ(α 0 )− K X k=1 β k −β 0 X k − M X m=1 γ m Z m +e 0 ! ′ # , L low α (β,γ) =min λ 1 JT Tr " M λ δ(α)−δ(α 0 )− K X k=1 β k −β 0 X k − M X m=1 γ m Z m +e 0 ! M f 0 δ(α)−δ(α 0 )− K X k=1 β k −β 0 X k − M X m=1 γ m Z m +e 0 ! ′ # , (B.16) we have for all β, γ L low α (β,γ) ≤ L α (β,γ) ≤ L up α (β,γ). (B.17) 141 Here, for the upper bound we simply choose f = f 0 in the minimization problem of L α (β,γ). Wearriveatthelower boundbystartingwiththedualformulation ofL α (β,γ) —inwhichweminimize overλ—andsubtracttheterm withP f 0,which can bewritten as thetrace of a positive definitematrix. Dueto the projection withM f 0 theλ 0 f 0′ term drops out of both L low α (β,γ) andL up α (β,γ). Let ( ˜ β α ,˜ γ α ) be the minimizing parameters of L α (β,γ) given α; let ( ˜ β up α ,˜ γ up α ) be the minimizing parameters of L up α (β,γ) given α; and let ˜ β low α,γ be the minimizing parameter of L low α (β,γ) given α and γ. We then have L low α ( ˜ β low α,˜ γα ,˜ γ α )≤L low α ( ˜ β α ,˜ γ α )≤L α ( ˜ β α ,˜ γ α )≤L α ( ˜ β up α ,˜ γ up α )≤L up α ( ˜ β up α ,˜ γ up α ). (B.18) Using the vectorizations of X k , Z m and e 0 , we can rewrite the lower and upper bound in vector notation: L up α (β,γ)= 1 JT d(α)−x(β−β 0 )−zγ+ε ′ M f 0⊗ 1 J d(α)−x(β−β 0 )−zγ+ε , L low α ( ˜ β low α,γ ,γ)=min β,λ 1 JT d(α)−x(β−β 0 )−zγ+ε ′ M f 0⊗M λ d(α)−x(β−β 0 )−zγ+ε . (B.19) For given λ, let ˜ β low α,γ,λ be the optimal β in the last equation. We have ˜ β up α ˜ γ up α = h (x f ,z f ) ′ (x f ,z f ) i −1 (x f ,z f ) ′ (d(α)+ε) , ˜ β low α,γ,λ = x ′ M f 0⊗M λ x −1 x ′ M f 0⊗M λ (d(α)−zγ+ε) , (B.20) 142 and therefore L up α ( ˜ β up α ,˜ γ up α )= 1 JT d f (α)+ε f ′ M (x f ,z f ) d f (α)+ε f = 1 JT d f (α)+ε f ′ d f (α)+ε f − 1 JT d ′ (α)P (x f ,z f ) d(α)−R 1 (α), L low α ( ˜ β low α,γ ,γ)=min λ 1 JT d f (α)−z f γ+ε f ′ M (x f ,M f 0 ⊗λ) d f (α)−z f γ+ε f = 1 JT d f (α)−z f γ+ε f ′ d f (α)−z f γ+ε f −max λ 1 JT (d(α)−zγ+ǫ) ′ P (x f ,M f 0 ⊗λ) (d(α)−zγ+ǫ) = 1 JT d f (α)+ε f ′ d f (α)+ε f −max λ 1 JT (d(α)−zγ) ′ P (x f ,M f 0 ⊗λ) (d(α)−zγ)+R 2 (α,γ,λ) , (B.21) where ε f (α) =vec e 0 M f 0 , d f (α)= vec (δ(α)−δ(α 0 ))M f 0 , and the remainder terms R 1 (α) and R 2 (α,γ,λ) are given by R 1 (α)= 2 JT d ′ (α)P (x f ,z f ) ε+ 1 JT ε ′ P (x f ,z f ) ε, R 2 (α,γ,λ) = 2 JT (d(α)−zγ) ′ P (x f ,M f 0 ⊗λ) ε + 1 JT ε ′ P (x f ,M f 0 ⊗λ) ε+ 1 JT d f (α)+ε f ′ zγ. (B.22) The inequality L low α ( ˜ β low α,˜ γα ,˜ γ α )≤L up α ( ˜ β up α ,˜ γ up α ) evaluated at α= ˆ α thus gives R 2 (ˆ α,˜ γ ˆ α , ˜ λ)−R 1 (ˆ α) ≥ 1 JT d ′ (ˆ α)P (x f ,z f ) d(ˆ α)−max λ 1 JT d ′ (ˆ α)P (x f ,M f 0 ⊗λ) d(ˆ α) ≥ ckˆ α−α 0 k 2 , wpa1, (B.23) 143 where ˜ λ is the optimal choice of λ in L low α ( ˜ β low ˆ α,˜ γ ˆ α ,˜ γ ˆ α ), and we used Assumption 3.1. Assumption B.4(i) implies kX k k = O p ( √ JT) and kZ m k = O p ( √ JT). Using this and Assumption B.2 we find (x f ,z f ) ′ ε=o p (JT) and therefore P (x f ,z f ) ε F = (x f ,z f ) h (x f ,z f ) ′ (x f ,z f ) i −1 (x f ,z f ) ′ ε F =o p ( √ JT), (B.24) wherewealsousedthatAssumptionB.4 guarantees (x f ,z f ) ′ (x f ,z f ) −1 =O p (1/ √ JT). Below we also show that kP (x f ,M f 0 ⊗λ) εk F = o p ( √ JT). Using these results, Assump- tion B.3(i) and the fact that ˜ γ ˆ α =o p (1) one obtains R 2 (ˆ α,˜ γ ˆ α , ˜ λ)=o p (1)+o p (kˆ α−α 0 k), R 1 (ˆ α)=o p (1)+o p (kˆ α−α 0 k). (B.25) Therefore we have o p (1)+o p (kˆ α−α 0 k) ≥ ckˆ α−α 0 k 2 , (B.26) which implies ˆ α−α 0 =o p (1). What is left to show is that kP (x f ,M f 0 ⊗λ) εk F = o p ( √ JT). For A = x f and B = M f 0⊗λ we use the general formula P (A,B) =P B +M B +P (M B A) M B and the fact that M (M f 0 ⊗λ) x f =M 1 T ⊗λ x f to obtain P (x f ,M f 0 ⊗λ) =P (M f 0 ⊗λ) +M (M f 0 ⊗λ) P [M ( 1 T ⊗λ) x f ] −M (M f 0 ⊗λ) P [M ( 1 T ⊗λ) x f ] P (M f 0 ⊗λ) , (B.27) and therefore P (x f ,M f 0 ⊗λ) ε F ≤2 P (M f 0 ⊗λ) ε F + P [M ( 1 T ⊗λ) x f ] ε F (B.28) 144 Note that theJT ×K matrix M ( 1 T ⊗λ) x f is simply vec(M λ X k M f 0) k=1,...K . We have P (M f 0 ⊗λ) ε F =kP λ e 0 M f 0k F ≤Rke 0 k=o p ( √ JT), (B.29) and thus also x f′ M ( 1 T ⊗λ) ε = o p (JT), which analogous to (B.24) implies P [M ( 1 T ⊗λ) x f ] ε F =o p ( √ JT), so that the required result follows. Having ˆ α = α 0 +o p (1), the proof of ˆ β = β 0 +o p (1) is straightforward using the methods in Bai (2009b) and Moon and Weidner (2010a; 2010b) — the only additional term appearing here is δ(ˆ α)−δ(α 0 ), which for kˆ α−α 0 k = o p (1) has no effect on the consistency of ˆ β. B.3.2 Proof of Limiting Distribution Lemma B.1. Let the assumptions of Theorem 3.1 (consistency) be satisfied and in addition let (JT) −1/2 Tr(e 0 X ′ k ) = O p (1), and (JT) −1/2 Tr(e 0 Z ′ m ) = O p (1). In the limit J,T → ∞ with J/T →κ 2 , 0<κ<∞, we then have √ J(ˆ α−α)=O p (1). Proof. The proof is analogous to the consistency proof. We know from Moon and Wei- dner (2010a; 2010b) that √ J˜ γ α 0 = O p (1) and therefore √ J˜ γ ˆ α = O p (1), applying the same logic as in equation (B.14). With the additional assumptions in the lemma we furthermore obtain kP (x f ,z f ) εk F = O p ( √ J) and kP (x f ,M f 0 ⊗λ) εk F = O p ( √ J) and can conclude R 2 (ˆ α,˜ γ ˆ α , ˜ λ)−R 1 (ˆ α)=O p J −1 +O p (J −1/2 kˆ α−α 0 k). (B.30) This implies O p J −1 +O p (J −1/2 kˆ α−α 0 k) ≥ ckˆ α−α 0 k 2 , (B.31) so that we obtain √ J(ˆ α−α)=O p (1). 145 Proof of Theorem 3.2. Assumption B.6 guarantees (JT) −1/2 Tr(e 0 X ′ k ) = O p (1), and (JT) −1/2 Tr(e 0 Z ′ m )=O p (1), so that we can apply Lemma B.1 to conclude √ J(ˆ α−α)= O p (1). The first step in the definition of the LS-MD estimator is equivalent to the linear regression model with interactive fixed effects, but with an error matrix that has an additional term Δδ(α)≡δ(α)−δ(α 0 ), namelyE(α)≡e+Δδ(α). Using ˆ α−α 0 =o p (1) and Assumption B.3(i) we have kE(ˆ α)k = o p ( √ JT), so that the results in Moon and Weidner (2010a; 2010b) guarantee ˜ β ˆ α −β 0 =o p (1) andk˜ γ ˆ α k=o p (1), which we already used in the consistency proof. Using √ J(ˆ α−α) = O p (1) and Assumption B.6(i) we findkE(ˆ α)k=O p ( √ J), which allows us to truncatethe asymptotic likelihood expansion derived in Moon and Weidner (2010a; 2010b) at an appropriateorder. Namely, applying their results we have √ JT ˜ β α −β 0 ˜ γ α =V −1 JT C (1) (X k ,E(α))+C (2) (X k ,E(α)) k=1,...,K C (1) (Z m ,E(α))+C (2) (Z m ,E(α)) m=1,...,M +r QMLE (α), (B.32) where V JT = 1 JT Tr(M f 0X ′ k 1 M λ 0X k 2 ) k 1 ,k 2 =1,...,K Tr(M f 0X ′ k M λ 0Z m ) k=1,...,K;m=1,...,M Tr(M f 0Z ′ m M λ 0X k ) m=1,...,M;k=1,...,K Tr(M f 0Z ′ m 1 M λ 0Z m 2 ) m 1 ,m 2 =1,...,M = 1 JT x λf ,z λf ′ x λf ,z λf , (B.33) 146 and forX either X k or Z m and E =E(α) we have C (1) (X, E)= 1 √ JT Tr M f 0E ′ M λ 0X , C (2) (X, E)=− 1 √ JT Tr EM f 0E ′ M λ 0Xf 0 (f 0′ f 0 ) −1 (λ 0′ λ 0 ) −1 λ 0′ +Tr E ′ M λ 0EM f 0X ′ λ 0 (λ 0′ λ 0 ) −1 (f 0′ f 0 ) −1 f 0′ +Tr E ′ M λ 0XM f 0E ′ λ 0 (λ 0′ λ 0 ) −1 (f 0′ f 0 ) −1 f 0′ , (B.34) and finally for the remainder we have r QMLE (α) =O p (JT) −3/2 kE(α)k 3 kX k k +O p (JT) −3/2 kE(α)k 3 kZ m k +O p (JT) −1 kE(α)kkX k k 2 kk ˜ β α −β 0 k +O p (JT) −1 kE(α)kkZ m k 2 k˜ γ α k , (B.35) which holds uniformly over α. The first two terms in r QMLE (α) stem from the bound on higher order terms in the score function (C (3) , C (4) , etc.), whereE(α) appears three times or more in the expansion, while the last two terms inr QMLE (α) reflect the bound on higher order terms in the Hessian expansion, and beyond. Note that Assumption B.4 already guarantees that V JT >b > 0, wpa1. ApplyingkX k k =O p ( √ JT), kZ m k = O p ( √ JT), andkE(α)k =O p ( √ J) within √ Jkα−α 0 k<c, we find for all c>0 sup {α: √ Jkα−α 0 k<c} r QMLE (α) 1+ √ JTk ˜ β α −β 0 k+ √ JTk˜ γ α k =o p (1). (B.36) 147 The inverse of the partitioned matrix V JT is given by V −1 JT =JT× × x λf′ M z λfx λf −1 − x λf′ M z λfx λf −1 x λf′ z λf z λf′ z λf −1 − z λf′ M x λfz λf −1 z λf′ x λf x λf′ x λf −1 z λf′ M x λfz λf −1 . (B.37) Using √ J(ˆ α−α) =O p (1) and Assumption B.6(i) we find C (1) (X k ,E(ˆ α)) k=1,...,K C (1) (Z m ,E(ˆ α)) m=1,...,M = 1 √ JT x λf ,z λf ′ ε − 1 JT x λf ,z λf ′ g √ JT(ˆ α−α 0 )+o p ( √ JTkˆ α−α 0 k), C (2) (X k ,E(ˆ α)) k=1,...,K C (2) (Z m ,E(ˆ α)) m=1,...,M = c (2) x c (2) z +O p √ Jkˆ α−α 0 k , (B.38) where c (2) x = h C (2) (X k ,e) i k=1,...,K , c (2) z = h C (2) (Z m ,e) i m=1,...,M . (B.39) From this one can conclude that √ JTk ˜ β ˆ α −β 0 k = O p (1) +O p ( √ JTkˆ α−α 0 k) and √ JTk˜ γ ˆ α k=O p (1)+O p ( √ JTkˆ α−α 0 k),sothatwefindr QMLE (ˆ α)=o p (1)+o p ( √ JTkˆ α− α 0 k). Combining the above results we obtain √ JT ˜ γ ˆ α = 1 JT z λf′ M x λfz λf −1 " 1 √ JT z λf′ M x λfε+c (2) z − z λf′ x λf x λf′ x λf −1 c (2) x − 1 JT z λf′ M x λfg √ JT (ˆ α−α 0 ) # +o p (1)+o p ( √ JTkˆ α−α 0 k). (B.40) 148 The above results holds not only for ˆ α, but uniformly for all α in any √ J shrinking neighborhood of α 0 (we still made this explicit in the bound on r QMLE (α) above, but found it too tedious to define remainder terms in all intermediate steps), i.e. we have √ JT ˜ γ α = 1 JT z λf′ M x λfz λf −1 " 1 √ JT z λf′ M x λfε+c (2) z − z λf′ x λf x λf′ x λf −1 c (2) x − 1 JT z λf′ M x λfg √ JT (α−α 0 ) # +r γ (α), (B.41) where for all c>0 sup {α: √ Jkα−α 0 k<c} kr γ (α)k 1+ √ JTkα−α 0 k =o p (1). (B.42) Therefore, the objective function for ˆ α reads JT ˜ γ ′ α W JT ˜ γ α =A 0 −2A ′ 1 h √ JT α−α 0 i + h √ JT α−α 0 i ′ A 2 h √ JT α−α 0 i +r obj (α), (B.43) where A 0 is a scalar, A 1 is a L×1 vector, andA 2 is aL×L matrix defined by A 0 = 1 √ JT z λf′ M x λfε+c (2) z − z λf′ x λf x λf′ x λf −1 c (2) x ′ 1 JT z λf′ M x λfz λf −1 W JT 1 JT z λf′ M x λfz λf −1 1 √ JT z λf′ M x λfε+c (2) z − z λf′ x λf x λf′ x λf −1 c (2) x , A 1 = 1 JT g ′ M x λfz λf 1 JT z λf′ M x λfz λf −1 W JT 1 JT z λf′ M x λfz λf −1 1 √ JT z λf′ M x λfε+c (2) z − z λf′ x λf x λf′ x λf −1 c (2) x , A 2 = 1 JT g ′ M x λfz λf 1 JT z λf′ M x λfz λf −1 W JT 1 JT z λf′ M x λfz λf −1 1 JT z λf′ M x λfg , (B.44) 149 and the remainder term in the objective function satisfies sup {α: √ Jkα−α 0 k<c} kr obj (α)k 1+ √ JTkα−α 0 k 2 =o p (1). (B.45) Under our assumptions one can show that kA 1 k =O p (1) and plim J,T→∞ A 2 > 0. Com- bining the expansion of the objective function with the results of √ J-consistency of ˆ α we can thus conclude that √ JT ˆ α−α 0 =A −1 2 A 1 +o p (1). (B.46) Analogous to equation (B.32) for the first step, we can apply the results in Moon and Weidner (2010a; 2010b) to the third step of the LS-MD estimator to obtain √ JT( ˆ β−β 0 )= 1 JT x λf′ x λf −1h C (1) (X k ,E(ˆ α))+C (2) (X k ,E(ˆ α)) i k=1,...,K +o p (1) = 1 JT x λf′ x λf −1 1 √ JT x λf′ ε− 1 JT x λf′ g √ JT (ˆ α−α 0 )+c (2) x +o p (1). (B.47) Here,theremaindertermo p ( √ JTkˆ α−α 0 k)isalreadyabsorbedintotheo p (1)term,since (B.46) already shows √ JT-consistency of ˆ α. Let G JT and W JT be the expressions in equation (3.12) and (3.13) before taking the probability limits, i.e. G=plim J,T→∞ G JT andW =plim J,T→∞ W JT . One can show that G JT W JT G ′ JT = 1 JT (g,x) ′ P x λf (g,x)+ A 2 0 L×K 0 K×L 0 K×K . (B.48) 150 Using this, one can rewrite equation (B.46) and (B.47) as follows G JT W JT G ′ JT √ JT ˆ α−α 0 ˆ β−β 0 = 1 √ JT (g,x) ′ P x λfε+ A 1 + 1 JT g ′ x λf 1 JT x λf′ x λf −1 c (2) x c (2) x +o p (1), (B.49) and therefore √ JT ˆ α−α 0 ˆ β−β 0 = G JT W JT G ′ JT −1 G JT W JT 1 √ JT x λf ,z λf ′ ε + G JT W JT G ′ JT −1 A 3 c (2) z + g ′ x λf −A 3 z λf′ x λf x λf′ x λf −1 c (2) x c (2) x +o p (1) = GWG ′ −1 GW " 1 √ JT x λf ,z λf ′ ε+ c (2) x c (2) z # +o p (1), (B.50) whereA 3 = 1 JT g ′ M x λfz λf 1 JT z λf′ M x λfz λf −1 W JT 1 JT z λf′ M x λfz λf −1 . Havingequa- tion (B.50), all that is left to do is to derive the asymptotic distribution ofc (2) x , c (2) z and 1 √ JT x λf ,z λf ′ ε. This was done in Moon and Weidner (2010a; 2010b) under the same assumptions that we impose here. They show that c (2) x =−κ −1 b (x,1) −κb (x,2) +o p (1), c (2) z =−κ −1 b (z,1) −κb (z,2) +o p (1), (B.51) and 1 √ JT x λf ,z λf ′ ε −→ d N " −κ b (x,0) b (z,0) , Ω # . (B.52) 151 Plugging this into (B.50) gives the result on the limiting distribution of ˆ α and ˆ β which is stated in the theorem. B.3.3 Consistency of Bias and Variance Estimators Proof of Theorem 3.3. From Moon and Weidner (2010a; 2010b) we already know that under our assumptions we have ˆ Ω= Ω+o p (1), ˆ b (x,i) =b (x,i) +o p (1) and ˆ b (z,i) =b (z,i) + o p (1), fori=0,1,2. They also show thatkM ˆ λ −M λ 0k=O p (J −1/2 ) andkM ˆ f −M f 0k= O p (J −1/2 ), from which we can conclude that c W = W +o p (1). These results on M ˆ λ and M ˆ f together with √ JT-consistency of ˆ α and Assumption B.7 are also sufficient to conclude ˆ G=G+o p (1). It follows that ˆ B i =B i +o p (1), for i=0,1,2. 152 Appendix C Appendix to Chapter 4 C.1 Assumptions C.1.1 Assumptions for Consistency To state our assumption we first define I i =− 1 T ∂ 2 logf(Y i |X i ,ˆ α p i (θ 0 );θ 0 ) ∂α∂α ′ , B c,i = α∈ R M α− ˆ α p i (θ 0 ) ′ I i α− ˆ α p i (θ 0 ) ≤ c T . (C.1) Assumption C.1. There exist c 1 ,c 2 ,c 3 ,c 4 ,c 5 >0 such that wpa1 (i) ˆ θ p =θ 0 +o p (1), L p NT ( ˆ θ p )=L p NT (θ 0 )+o p (1). (ii) ∀θ∈Θ: L p NT (θ)≤L p NT ( ˆ θ p )−min(c 1 , c 2 kθ− ˆ θ p k 2 ). (iii) ∀i∈{1,...,N}, ∀α∈B c 3 ,i ∩A: 1 T logf(Y i |X i ,α;θ 0 )≥ 1 T logf(Y i |X i ,ˆ α p i ;θ 0 )− c 4 2 (α− ˆ α p i ) ′ I i (α− ˆ α p i ), where ˆ α p i = ˆ α p i (θ 0 ). (iv) ∀i∈{1,...,N}: vol(B c 3 ,i ∩A)≥c 5 vol(B c 3 ,i ). 153 (v) 1 N N X i=1 q kI −1 i k=o p (T 3/2 ). (vi) The logarithm of π low T (α|x) is Lipschitz continuous in α with a Lipschitz constant that is uniformly bounded over x∈X T . Furthermore, π low T (α|x) satisfies 1 N N X i=1 log √ detI i π low T (ˆ α p i (θ 0 )|X i ) ≤o p (T). Part (i) of this assumption demands consistency of the fixed effect effect estimator and some continuity of the profile likelihood around θ 0 . Part (ii) requires that L p NT (θ) has a properly isolated maximum at ˆ θ p with a non-degenerate Hessian. Part (iii) is a similar assumption on the maximum of logf(Y i |X i ,α;θ 0 ) in α. Part (iv) demands that the boundary of A is well-behaved, where “vol” refers to the volume of a set. Part (v) requires a lower bound on the eigenvalues of I i , here k.k is the operator norm. Part (vi) is a regularity condition on the lower boundπ low T (α|x), which still allows this lower bound to decrease in T at any polynomial rate. C.1.2 Further Regularity Conditions on the Model Define f α|Y,X (α|y,x;θ,π) = f(y|x,α;θ)π(α|x) f Y|X (y|x;θ,π) , (C.2) This is the posterior distribution ofα for given Y =y under the priorπ(α|x), for given values of x andθ. Similarly, the posterior distribution ofα under a uniform prior reads f unif α|Y,X (α|y,x;θ) = f(y|x,α;θ) R A f(y|x,β;θ)dβ . (C.3) 154 It is convenient to introduce the following notation. J (1) (y,x)= 1 √ T Z A ∂logf(y|x,α;θ 0 ) ∂θ f unif α|Y,X (α|y,x;θ 0 )dα, J (2) (y,x)= 1 T Z A ∂logf(y|x,α;θ 0 ) ∂θ ∂logf(y|x,α;θ 0 ) ∂θ ′ f unif α|Y,X (α|y,x;θ 0 )dα, H k 1 k 2 (y,x)= 1 T 2 Z A ∂ 2 logf(y|x,α;θ 0 ) ∂θ k 1 ∂θ ′ k 2 + ∂logf(y|x,α;θ 0 ) ∂θ k 1 ∂logf(y|x,α;θ 0 ) ∂θ k 2 ! 2 f unif α|Y,X (α|y,x;θ 0 )dα, D (q) (y,x)= Z A Z A h √ T d x (α,β) i q f unif α|Y,X (α|y,x;θ 0 )dαf unif α|Y,X (β|y,x;θ 0 )dβ, (C.4) whereq =2,4. Note that in the definition of J (1) (y,x) the factor 1 √ T is the appropriate normalization for the score function ∂logf(y|x,α;θ 0 ) ∂θ k , since the score at the true param- eters has zero mean, and since f unif α|Y,X (α|y,x;θ 0 ) will be centered around the realized value α 0 i if evaluated at y = Y i and x = X i . Similarly, for H k 1 k 2 (y,x) the expression ∂ 2 logf(y|x,α;θ 0 ) ∂θ k 1 ∂θ ′ k 2 + ∂logf(y|x,α;θ 0 ) ∂θ k 1 ∂logf(y|x,α;θ 0 ) ∂θ k 2 ismeanzeroatα 0 i ,sothat1/T istheappro- priatenormalizationforthesquareofthisexpression. Also, inthedefinitionofD (q) (y,x) it is natural to rescale d x (α,β) by √ T, since the standard deviation of the distribution f unif α|Y,X (α|y,x;θ) is of order 1/ √ T. We make the following high-level assumptions. Assumption C.2. We assume that (i) Y i and X i are independently and identically distributed across i. (ii) EJ (2) (Y i ,X i )=O(1), and ED (q) (Y i ,X i )=O(1), for q =2,4. (iii) 1 N P N i=1 h J (1) k (Y i ,X i ) i 2 =O p (1), 1 N P N i=1 h J (2) k 1 k 2 (Y i ,X i ) i 2 =O p (1), 1 N P N i=1 H(Y i ,X i )=O p (1), and 1 N P N i=1 D (q) (Y i ,X i ) 2 =O p (1), for q =2,4. 155 (iv) 1 N N X i=1 Z A h E J (1) (Y,X) X =X i ,α=α i 2 π up T (α|X)dα =O p (1/T), 1 N N X i=1 Z A h E J (2) (Y,X) X =X i ,α=α i 2 π up T (α|X)dα =O p (1), 1 N N X i=1 Z A h E D (q) (Y,X) X =X i ,α=α i 2 π up T (α|X)dα =O p (1), q =2,4. (v) √ NT ∂L NT (θ 0 ,π 0 ) ∂θ =O p (1), and ∃c>0 such that ∂ 2 L NT (θ 0 ,π 0 ) ∂θ∂θ ′ >c, wpa1. (vi) ∂ 3 L NT (θ,π) ∂θ k 1 ∂θ k 2 ∂θ k 3 =O p (1), uniformly in a neighborhood of θ 0 and over π∈ Π lip T,κ with κ T = √ T. These regularity assumptions look complicated. However, a key advantage of our analysis of the integrated likelihood is that it does not involve a Laplace approximation and therefore allows the distributionsπ(α|x) to be e.g. non-differentiable inα — only a Lipschitz condition is imposed. Assumption C.3. π up T (α|x)/π low T (α|x) is uniformly bounded over α ∈A, x ∈X T and T. AssumptionC.3isaconvenienttechnicalcondition,butcanprobablyrelaxedwithout affecting the validity of our conclusions. For the moment, we leave this generalization for future work. C.2 Proofs C.2.1 Proofs for Section 4.4.1 Proof of Theorem 4.1. By the mean value theorem for integration there exist ˜ α i (θ,π,Y i ,X i )∈A such that L NT (θ,π)= 1 NT N X i=1 log f(Y i |X i ,˜ α i (θ,π,Y i ,X i );θ), (C.5) 156 andthereforeL NT (θ,π)≤L p NT (θ). WehavethusobtainedanupperboundonL NT (θ,π). Next, we derive a lower bound on L NT (θ 0 ,π). Let ˆ α p i = ˆ α p i (θ 0 ). We have wpa1 that L NT (θ 0 ,π) = 1 NT N X i=1 log Z A f(Y i |X i ,α;θ 0 )π(α|X i )dα ≥ 1 NT N X i=1 log Z B c 3 ,i ∩A f(Y i |X i ,ˆ α p i ;θ 0 )exp − c 4 T 2 (α− ˆ α p i ) ′ I i (α− ˆ α p i ) π low T (α|X i )dα ≥L p NT (θ 0 )+ 1 NT N X i=1 log exp h − c 3 c 4 2 i inf α∈B c 3 ,i ∩A π low T (α|X i ) Z B c 3 ,i ∩A dα =L p NT (θ 0 )− c 3 c 4 2T + 1 NT N X i=1 inf α∈B c 3 ,i ∩A logπ low T (α|X i )+ 1 NT N X i=1 logvol(B c 3 ,i ∩A) ≥L p NT (θ 0 )+ 1 NT N X i=1 logπ low T (ˆ α p i |X i ) − b NT N X i=1 s c 3 kI −1 i k T + 1 NT N X i=1 log c 5 vol(B c 3 ,i ) +o p (1) =L p NT (θ 0 )+ 1 NT N X i=1 log π low T (ˆ α p i |X i ) T M/2 √ detI i ≥L p NT (θ 0 )+o p (1), (C.6) uniformlyoverπ∈Π low T . Here,b>0 is the Lipschitz constant of logπ low T . Then we have uniformly over π∈Π low T L p NT ( ˆ θ(π))≥L NT ( ˆ θ(π),π)≥L NT (θ 0 ,π)≥L p NT (θ 0 )+o p (1)=L p NT ( ˆ θ p )+o p (1). (C.7) Applying our assumption on the shape of L p NT (θ) we thus obtain c 2 ˆ θ(π)− ˆ θ p 2 ≤L p NT ( ˆ θ p )−L p NT ( ˆ θ(π))=o p (1), (C.8) 157 which implies k ˆ θ(π)− ˆ θ p k = o p (1), and therefore k ˆ θ(π)−θ 0 k = o p (1), uniformly over π∈Π low T . C.2.2 Proofs for Section 4.4.2 Lemma C.1. For all κ T >0, y∈Y T and x∈X T we have (i) sup π 1 ,π 2 ∈Π lip κ ∂log f Y|X (y|x;θ 0 ,π 1 ) ∂θ k − ∂log f Y|X (y|x;θ 0 ,π 2 ) ∂θ k ! 2 ≤8κ 2 T J (2) kk (y,x) D (2) (y,x)+ κ 2 T 2T D (4) (y,x) . (ii) sup π∈Π lip κ ∂log f Y|X (y|x;θ 0 ,π) ∂θ k − Z A ∂logf(y|x,α;θ 0 ) ∂θ k f unif α|Y,X (α|y,x;θ 0 )dα ! 2 ≤4κ 2 T J (2) kk (y,x) D (2) (y,x)+ κ 2 T 2T D (4) (y,x) . In addition, either let ˜ E be a (conditional) expected value over the random variable ˜ Y = Y and ˜ X = X, or let ˜ E = 1 N P N i=1 be a sample average over the sample ˜ Y = Y i and ˜ X =X i . Then we have (iii) sup π 1 ,π 2 ∈Π lip κ " ˜ E ∂log f Y|X ( ˜ Y| ˜ X;θ 0 ,π 1 ) ∂θ k − ∂log f Y|X ( ˜ Y| ˜ X;θ 0 ,π 2 ) ∂θ k !# 2 ≤8κ 2 T ˜ EJ (2) kk ( ˜ Y, ˜ X) ˜ ED (2) ( ˜ Y, ˜ X) + κ 2 T 2T ˜ ED (4) ( ˜ Y, ˜ X) . (iv) sup π 1 ,π 2 ∈Π lip κ " ˜ E ∂log f Y|X ( ˜ Y| ˜ X;θ 0 ,π 1 ) ∂θ k − Z A ∂logf( ˜ Y| ˜ X,α;θ 0 ) ∂θ k f unif α|Y,X (α| ˜ Y, ˜ X;θ 0 )dα # 2 ≤4κ 2 T ˜ EJ (2) kk ( ˜ Y, ˜ X) ˜ ED (2) ( ˜ Y, ˜ X) + κ 2 T 2T ˜ ED (4) ( ˜ Y, ˜ X) . 158 Proof. Part (i): Applying Chebyshev’s inequality one gets ∂log f Y|X (y|x;θ 0 ,π 1 ) ∂θ k − ∂log f Y|X (y|x;θ 0 ,π 2 ) ∂θ k ! 2 = Z A ∂logf(y|x,α;θ 0 ) ∂θ k f(y|x,α;θ 0 ) π 1 (α|x) f Y|X (y|x;θ 0 ,π 1 ) − π 2 (α|x) f Y|X (y|x;θ 0 ,π 2 ) dα 2 ≤ Z A ∂logf(y|x,α;θ 0 ) ∂θ k 2 f(y|x,α;θ 0 ) R A f(y|x,β;θ 0 )dβ dα | {z } =T J (2) (y,x) Z A π 1 (α|x) R A f(y|x,β;θ 0 )dβ f Y|X (y|x;θ 0 ,π 1 ) − π 2 (α|x) R A f(y|x,β;θ 0 )dβ f Y|X (y|x;θ 0 ,π 2 ) 2 f(y|x,α;θ 0 ) R A f(y|x,β;θ 0 )dβ dα | {z } ≡b(y,x) . (C.9) For the integrand in the second term we have π 1 (α|x) R A f(y|x,β;θ 0 )dβ f Y|X (y|x;θ 0 ,π 1 ) − π 2 (α|x) R A f(y|x,β;θ 0 )dβ f Y|X (y|x;θ 0 ,π 2 ) 2 = π 1 (α|x) R A f(y|x,β;θ 0 )dβ f Y|X (y|x;θ 0 ,π 1 ) −1 − π 2 (α|x) R A f(y|x,β;θ 0 )dβ f Y|X (y|x;θ 0 ,π 2 ) −1 2 ≤2 π 1 (α|x) R A f(y|x,β;θ 0 )dβ f Y|X (y|x;θ 0 ,π 1 ) −1 2 +2 π 2 (α|x) R A f(y|x,β;θ 0 )dβ f Y|X (y|x;θ 0 ,π 2 ) −1 2 . (C.10) Furthermore π 1 (α|x) R A f(y|x,β;θ 0 )dβ f Y|X (y|x;θ 0 ,π 1 ) −1 = Z A f(y|x,β;θ 0 )[π 1 (α|x)−π 1 (β|x)] f Y|X (y|x;θ 0 ,π 1 ) dβ ≤ Z A f(y|x,β;θ 0 )|π 1 (α|x)−π 1 (β|x)| f Y|X (y|x;θ 0 ,π 1 ) dβ ≤κ T Z A f(y|x,β;θ 0 )π 1 (β|x)d x (β,α) f Y|X (y|x;θ 0 ,π 1 ) dβ =κ T Z A d x (β,α)f α|Y,X (β|y,x;θ 0 ,π 1 )dβ. (C.11) 159 Therefore, also applying Jensen’s inequality (namely [ EZ] 2 ≤ E[Z 2 ]), we obtain A 1 ≡ Z A π 1 (α|x) R A f(y|x,β;θ 0 )dβ f Y|X (y|x;θ 0 ,π 1 ) −1 2 f(y|x,α;θ 0 ) R A f(y|x,β;θ 0 )dβ dα ≤κ 2 T Z A Z A d 2 x (β,α)f α|Y,X (β|y,x;θ 0 ,π 1 ) f(y|x,α;θ 0 ) R A f(y|x,γ;θ 0 )dγ dβdα ≤κ 2 T Z A Z A d 2 x (β,α) f(y|x,β;θ 0 ) R A f(y|x,γ;θ 0 )dγ f(y|x,α;θ 0 ) R A f(y|x,γ;θ 0 )dγ dβdα +κ 2 T Z A Z A d 2 x (β,α) f α|Y,X (β|y,x;θ 0 ,π 1 )− f(y|x,β;θ 0 ) R A f(y|x,γ;θ 0 )dγ f(y|x,α;θ 0 ) R A f(y|x,γ;θ 0 )dγ dβdα ≤κ 2 T Z A Z A d 2 x (β,α) f(y|x,β;θ 0 ) R A f(y|x,γ;θ 0 )dγ f(y|x,α;θ 0 ) R A f(y|x,γ;θ 0 )dγ dβdα +κ 2 T s A 1 Z A Z A d 4 x (β,α) f(y|x,β;θ 0 ) R A f(y|x,γ;θ 0 )dγ f(y|x,α;θ 0 ) R A f(y|x,γ;θ 0 )dγ dβdα, (C.12) where in the last step we applied Chebyshev’s inequality. This implies A 1 ≤2κ 2 T Z A Z A d 2 x (β,α) f(y|x,β;θ 0 ) R A f(y|x,γ;θ 0 )dγ f(y|x,α;θ 0 ) R A f(y|x,γ;θ 0 )dγ dβdα +κ 4 T Z A Z A d 4 x (β,α) f(y|x,β;θ 0 ) R A f(y|x,γ;θ 0 )dγ f(y|x,α;θ 0 ) R A f(y|x,γ;θ 0 )dγ dβdα = 2κ 2 T T D (2) (y,x)+ κ 4 T T 2 D (4) (y,x). (C.13) By symmetry we obtain the same results for π 2 , and we denote the corresponding term by A 2 . Combining the above inequalities we find b(y,x)≤2A 1 +2A 2 ≤ 8κ 2 T T D (2) (y,x)+ 4κ 4 T T 2 D (4) (y,x). (C.14) Combining the above results gives part (i) of the lemma. Part (ii) of the lemma is obtained analogously, but in that case there is noA 2 term, so that the bound is a factor two smaller. Part (iii) and (iv) are also obtained by following the same arguments, but with ˜ E taken into account whenever Chebyshev’s inequality is applied. 160 Proof of Theorem 4.2. # Part I (Score): Applying part (iii) of Lemma C.1 yields sup π 1 ,π 2 ∈Π lip κ ∂L NT (θ 0 ,π 1 ) ∂θ − ∂L NT (θ 0 ,π 2 ) ∂θ k 2 = sup π 1 ,π 2 ∈Π lip κ " 1 NT N X i=1 ∂log f Y|X (Y i |X i ;θ 0 ,π 1 ) ∂θ k − ∂log f Y|X (Y i |X i ;θ 0 ,π 2 ) ∂θ k !# 2 ≤ 8κ 2 T T 2 1 N N X i=1 J (2) kk (Y i ,X i ) !" 1 N N X i=1 D (2) (Y i ,X i ) ! + κ 2 T 2T 1 N N X i=1 D (4) (Y i ,X i ) !# . (C.15) Together with the assumptions, this shows the result. # Part II (Hessian): We have ∂ 2 L NT (θ 0 ,π) ∂θ∂θ ′ = 1 NT N X i=1 ∂ 2 log f Y|X (Y i |X i ;θ 0 ,π) ∂θ∂θ ′ = 1 NT N X i=1 Z A ∂ 2 logf(Y i |X i ,α;θ 0 ) ∂θ∂θ ′ + ∂logf(Y i |X i ,α;θ 0 ) ∂θ ∂logf(Y i |X i ,α;θ 0 ) ∂θ ′ f α|Y,X (α|Y i ,X i ;θ 0 ,π)dα− ∂ log f Y|X (Y i |X i ;θ 0 ,π) ∂θ ∂ log f Y|X (Y i |X i ;θ 0 ,π) ∂θ ′ (C.16) Thus, we have ∂ 2 L NT (θ 0 ,π 1 ) ∂θ∂θ ′ − ∂ 2 L NT (θ 0 ,π 2 ) ∂θ∂θ ′ =A 1 −A 2 , (C.17) where A 1 = 1 NT N X i=1 Z A ∂ 2 logf(Y i |X i ,α;θ 0 ) ∂θ∂θ ′ + ∂logf(Y i |X i ,α;θ 0 ) ∂θ ∂logf(Y i |X i ,α;θ 0 ) ∂θ ′ f α|Y,X (α|Y i ,X i ;θ 0 ,π 1 )−f α|Y,X (α|Y i ,X i ;θ 0 ,π 2 )dα, (C.18) 161 and A 2 = 1 NT N X i=1 ∂ log f Y|X (Y i |X i ;θ 0 ,π 1 ) ∂θ ∂ log f Y|X (Y i |X i ;θ 0 ,π 1 ) ∂θ ′ − ∂ log f Y|X (Y i |X i ;θ 0 ,π 2 ) ∂θ ∂ log f Y|X (Y i |X i ;θ 0 ,π 2 ) ∂θ ′ (C.19) Analogous to the proof of Lemma C.1 that was used in Part I we obtain the following bound for A 1 : A 2 1,k 1 k 2 ≤ 8κ 2 T T 1 N N X i=1 H k 1 k 2 (Y i ,X i ) !" 1 N N X i=1 D (2) (Y i ,X i ) ! + κ 2 T 2T 1 N N X i=1 D (4) (Y i ,X i ) !# . (C.20) Therefore,A 1 =O p (κ T / √ T)underourassumptions,uniformlyoverπ 1 andπ 2 . Applying part (ii) of Lemma C.1 we obtain |A 1,k 1 k 2 |≤ 2 N √ T N X i=1 J (1) k 1 (Y i ,X i ) s 4κ 2 T J (2) k 2 k 2 (Y i ,X i ) D (2) (Y i ,X i )+ κ 2 T 2T D (4) (Y i ,X i ) +same term with k 1 ↔k 2 +4κ 2 T J (2) k 2 k 2 (Y i ,X i ) D (2) (Y i ,X i )+ κ 2 T 2T D (4) (Y i ,X i ) ≤ 8κ 2 T √ T v u u t 1 N N X i=1 h J (1) k 1 (Y i ,X i ) i 2 v u u t 1 N N X i=1 J (2) k 2 k 2 (Y i ,X i ) D (2) (Y i ,X i )+ κ 2 T 2T D (4) (Y i ,X i ) +same term with k 1 ↔k 2 + 8κ 2 T √ T 1 N N X i=1 J (2) k 2 k 2 (Y i ,X i ) D (2) (Y i ,X i )+ κ 2 T 2T D (4) (Y i ,X i ) . (C.21) 162 Furthermore, we have 1 N N X i=1 J (2) kk (Y i ,X i ) D (2) (Y i ,X i )+ κ 2 T 2T D (4) (Y i ,X i ) ≤ v u u t 1 N N X i=1 h J (2) kk (Y i ,X i ) i 2 v u u t 1 N N X i=1 D (2) (Y i ,X i ) 2 + κ 2 T 2T v u u t 1 N N X i=1 D (4) (Y i ,X i ) 2 . (C.22) Thus, our assumptions also guarantee A 2 =O p (κ T / √ T), uniformly over π 1 and π 2 , so that the same holds for the difference of the Hessians. Proof of Theorem 4.3. We have ∂L NT (θ 0 ,π) ∂θ = 1 NT N X i=1 Z A E Y|X i ,α ∂ log f Y|X (Y|X i ;θ,π) ∂θ π 0 (α|X i )dα = 1 NT N X i=1 Z A E Y|X i ,α ∂ log f Y|X (Y|X i ;θ,π) ∂θ π 0 (α|X i )−π(α|X i ) dα = 1 NT N X i=1 Z A E Y|X i ,α ∂ log f Y|X (Y|X i ;θ,π) ∂θ h p π 0 (α|X i )+ p π(α|X i ) i h p π 0 (α|X i )− p π(α|X i )] i dα . (C.23) 163 Applying Chebyshev’s inequality we find ∂L NT (θ 0 ,π) ∂θ k ≤ 1 T D H (π,π 0 ) v u u t 1 N N X i=1 Z A E Y|X i ,α ∂logf Y|X (Y|X i ;θ 0 ,π) ∂θ k 2 h p π 0 (α|X i )+ p π(α|X i ) i 2 dα ≤ √ 2 T D H (π,π 0 ) v u u u u u t 1 N N X i=1 Z A " E Y|X i ,α ∂logf Y|X (Y|X i ;θ 0 ,π) ∂θ k # 2 π 0 (α|X i )+π(α|X i ) dα | {z } =B(π) . (C.24) Using the upper bound on π 0 (α|X i ) and π(α|X i ) we find B(π)≤ 2 N N X i=1 Z A " E Y|X i ,α ∂logf Y|X (Y|X i ;θ 0 ,π) ∂θ k # 2 π up T (α|X i )dα (C.25) 164 For the integrand in the last expression we have " E Y|X i ,α ∂ log f Y|X (Y|X i ;θ 0 ,π) ∂θ k # 2 ≤ " E Y|X i ,α Z A ∂logf(Y|X i ,α;θ 0 ) ∂θ k f unif α|Y,X (α|Y,X i ;θ 0 )dα | {z } =T E Y|X i ,α J (1) k (Y,X i ) + E Y|X i ,α ∂ log f Y|X (Y|X i ;θ 0 ,π) ∂θ k − Z A ∂logf(Y|X i ,α;θ 0 ) ∂θ k f unif α|Y,X (α|Y,X i ;θ 0 )dα !# 2 ≤2T h E Y|X i ,α J (1) k (Y,X i ) i 2 +2 " E Y|X i ,α ∂logf Y|X (Y|X i ;θ 0 ,π) ∂θ k − Z A ∂logf(Y|X i ,α;θ 0 ) ∂θ k f unif α|Y,X (α|Y,X i ;θ 0 )dα !# 2 ≤2T h E Y|X i ,α J (1) k (Y,X i ) i 2 +4κ 2 T E Y|X i ,α J (2) kk (Y,X i ) E Y|X i ,α D (2) (Y,X i ) + κ 2 T 2T E Y|X i ,α D (4) (Y,X i ) , (C.26) 165 where we applied part (iv) or Lemma C.1. By Chebyshev’s inequality and applying the assumptions, we thus obtain B(π)≤4T 1 N N X i=1 Z A h E Y|X i ,α J (1) k (Y,X i ) i 2 π up T (α|X i )dα +8κ 2 T v u u t 1 N N X i=1 Z A h E Y|X i ,α J (2) kk (Y,X i ) i 2 π up T (α|X i )dα v u u t 1 N N X i=1 Z A E Y|X i ,α D (2) (Y,X i ) 2 π up T (α|X i )dα + 4κ 4 T T v u u t 1 N N X i=1 Z A h E Y|X i ,α J (2) kk (Y,X i ) i 2 π up T (α|X i )dα v u u t 1 N N X i=1 Z A E Y|X i ,α D (4) (Y,X i ) 2 π up T (α|X i )dα =O p (1)+O p (κ 2 T ) (C.27) Combining the above results gives the statement in the theorem. Proof of Lemma 4.4. Applying part (iii) of Lemma C.1 we find Eν 2 NT (π T )= 1 NT N X i=1 Eν 2 NT,i (π T ) ≤ 1 T E " ∂log f Y|X (Y|X;θ 0 ,π) ∂θ − ∂log f Y|X (Y|X;θ 0 ,π 0 ) ∂θ # 2 ≤ 8κ 2 T T EJ (2) kk (Y,X) ED (2) (Y,X) + κ 2 T 2T ED (4) (Y,X) ≤O κ 2 T T , (C.28) and therefore ν NT (π T )=O p (κ T / √ T), uniformly over π∈Π lip T,κ . 166 Proof of Corollary 4.5. Consistency of ˆ θ(π 0 ) and ˆ θ(ˆ π) together with Assumption C.2(v) and (vi) and Theorem 4.2 imply that ˆ θ(π 0 )−θ 0 = ∂ 2 L NT (θ 0 ,π 0 ) ∂θ∂θ ′ −1 ∂L NT (θ 0 ,π 0 ) ∂θ +o p (NT) −1/2 , ˆ θ(ˆ π)−θ 0 = ∂ 2 L NT (θ 0 ,ˆ π) ∂θ∂θ ′ −1 ∂L NT (θ 0 ,ˆ π) ∂θ +o p (NT) −1/2 . (C.29) Therefore ˆ θ(ˆ π)− ˆ θ(π 0 )= ∂ 2 L NT (θ 0 ,π 0 ) ∂θ∂θ ′ −1 ∂L NT (θ 0 ,ˆ π) ∂θ − ∂L NT (θ 0 ,π 0 ) ∂θ + " ∂ 2 L NT (θ 0 ,ˆ π) ∂θ∂θ ′ −1 − ∂ 2 L NT (θ 0 ,π 0 ) ∂θ∂θ ′ −1 # ∂L NT (θ 0 ,ˆ π) ∂θ +o p (NT) −1/2 (C.30) By Assumption C.2(v) and Theorem 4.3 we find the first term on the right hand side of (C.30) to be of order O p (κ T /T)D H (ˆ π,π 0 ). By Theorem 4.2 and again Assumption C.2(v) we find the second term on the right hand side to be of ordero p (NT) −1/2 . For this last step weneed toboundthedifferencebetween theinverse of two matrices, which can e.g. by done by using the general matrix relation A −1 −B −1 = A −1 (B−A)B −1 , which implies kA −1 −B −1 k ≤ kA −1 kkB−AkkB −1 k, where the norm is the operator norm. The statement of the corollary thus follows from (C.30). C.2.3 Proofs for Section 4.4.3 Proof of Lemma 4.6. Cross-sectional independence implies that E(ψ 2 NT (π T )|X 1 ,...,X N )≤ h D (2) KL (f Y (π T )||f Y (π 0 )) i 2 , (C.31) 167 where D (2) KL (f Y (π)||f Y (π 0 ))= v u u t 1 N N X i=1 Z Y T log f Y|X (y|X i ;θ 0 ,π 0 ) f Y|X (y|X i ;θ 0 ,π) 2 f Y|X (y|X i ;θ 0 ,π 0 )dy. (C.32) Therefore ψ NT (π T )=O p (1)D (2) KL (f Y (π T )||f Y (π 0 ). (C.33) By assumption π 0 (α|x)/π T (α|x) ≤ π up T (α|x)/π low T (α|x) is bounded. This also implies that f Y|X (y|x;θ 0 ,π 0 ) f Y|X (y|x;θ 0 ,π) is bounded. Note that for all 0<z≤b we have (logz) 2 ≤d 2 (logz+ 1/z−1), with d 2 =b 2 /(b−1). Thus, there exists a constants d such that D (2) KL (f Y (π)||f Y (π 0 ))≤d p D KL (f Y (π)||f Y (π 0 )). (C.34) This proofs the lemma. Proof of Theorem 4.7. Assumption 4.3(i) guarantees that there exists ˜ π T ∈Π T such thatD KL (f Y (˜ π)||f Y (π 0 ))=O p (T −2μ ). For such a ˜ π T we have T L NT (˜ π T )−L NT (θ 0 ,π 0 ) ≥T L NT (θ 0 ,˜ π T )−L NT (θ 0 ,π 0 ) =O p (T −2μ )+o p r T N T −μ κ T ! . (C.35) Here, we have also used assumption 4.2. The optimal ˆ π needs to satisfy L NT (ˆ π) ≥ L NT (θ 0 ,˜ π T ). Therefore T L NT (ˆ π)−L NT (θ 0 ,π 0 ) ≥O p (T −2μ )+o p r T N T −μ κ T ! . (C.36) 168 Using the expansion (4.22) and our results from on the score and Hessian and on θ(π), the last inequality yields D KL f Y (ˆ π)||f Y (π 0 ) ≤O p (T −2μ )+o p r T N T −μ κ T ! +o p 1 κ T r T N ! p D KL (f Y (ˆ π)||f Y (π 0 )) +T O p 1 √ NT +O p κ T T D H (ˆ π,π 0 ) 2 ≤O p (T −2μ )+o p r T N T −μ κ T ! +o p 1 κ T r T N ! p D KL (f Y (ˆ π)||f Y (π 0 )) +O p 1 N +O p κ T √ NT h p D KL (f(ˆ π)||f(π 0 ))+T −μ i +O p κ 2 T T h p D KL (f(ˆ π)||f(π 0 ))+T −μ i 2 . (C.37) From this we can conclude that p D KL (f(ˆ π)||f(π 0 ))=O p (T −μ )+o p 1 κ T r T N ! +O p 1 √ N +o p s r T N T −μ κ T =O p (T −μ )+o p 1 κ T r T N ! . (C.38) By assumption 4.3(ii) this implies part (i) of the theorem. Part (ii) of the theorem follows from part (i) by applying Corollary 4.5. C.3 Further Discussions for Section 4.5 We now presentthetechnical justification for thechoice of parameter setΠ T inequation (4.29). C.3.1 Approximating Unknown Distributions For simplicity we consider the case where A = R, i.e. the number of dimensions of the incidental parameter space is M = 1, and there are no additional restrictions on A. In 169 that case we have K Ω T (α,β;x) = φ(α;β,Ω T (β,x), which is a standard normal pdf with meanβ and variance Ω T (α,x)= ρ T T Λ T (α,x), where we denote the inverse that appears in equation (4.30) by Λ T (α,x). In the rest of this subsection we drop the dependenceon the regressor value x for notational convenience. We have to show that an unknown density π 0 (α) can be approximated well by π approx (α)= R R φ(α;β,Ω T (β))π(β)dβ for some appropriatechoice of densityπ(β). First we note that if both π 0 (α) and Ω T (α) are arbitrarily often differentiable, then we can achieve π 0 =π approx by choosing π(α)= 1 P ∞ q=0 1 2 q q! d 2q dα 2q [Ω T (α)] q π 0 (α). (C.39) Thisexpressionhastobeunderstoodasaformalpowerexpansion,whichcanberewritten as π(α) = 1 1+ P ∞ q=1 1 2 q q! d 2q dα 2q [Ω T (α)] q π 0 (α) = ∞ X r=0 − ∞ X q=1 1 2 q q! d 2q dα 2q [Ω T (α)] q r π 0 (α) =π 0 (α)− 1 2 d 2 dα 2 Ω T (α)π 0 (α) + 1 4 d 2 dα 2 Ω T (α) d 2 dα 2 Ω T (α)π 0 (α) +... = ∞ X q=0 ρ T T q A q (α), (C.40) where the first few expansion coefficients A q (α) read A 0 (α)=π 0 (α), A 1 (α)=− 1 2 d 2 dα 2 Λ T (α)π 0 (α) , A 2 (α)= 1 4 d 2 dα 2 Λ T (α) d 2 dα 2 Λ T (α)π 0 (α) − 1 8 d 4 dα 4 [Λ T (α)] 2 π 0 (α) , (C.41) 170 and the expression for all higher A q (α) can be obtained by expanding the first line of equation (C.40) and sorting terms by powers of Ω T (α). 1 Under appropriate regularity conditions we have R R A q (α)dα = 0 for all q ≥ 1. For example, we have R R A 1 (α)dα = lim α→−∞ 1 2 d dα Λ T (α)π 0 (α) −lim α→+∞ 1 2 d dα Λ T (α)π 0 (α) , and we assume that these limits are both zero. The highest derivatives of π 0 (α) and Λ T (α) that appear in A q (α) are of order 2q. Thus,ifπ 0 (α)isr times differentiable, andassumingthatΛ T (α)is alsosufficiently often differentiable, we can choose π(α)= ⌊r/2⌋ X q=0 ρ T T q A q (α), (C.42) where⌊r/2⌋ denotes the largest integer smaller or equal to r/2. For large T this choice of π(α) is close to π 0 (α), so that π(α) ≥ 0 is satisfied asymptotically. For this choice of π(α) one can show that under appropriate regularity conditions D KL (π approx ,π 0 ) = O p [(ρ T /T) r ]. Since D KL (f Y (π approx )||f Y (π 0 )) ≤ D KL (π approx ||π 0 ) this implies that Assumption 4.3(i) is satisfied with μ T =(ρ T /T) r/2 . C.3.2 Approximate Identification of π(α|x) Assumption 4.3(ii) is an approximate identification condition of π =π(α|x) within the set Π T . The identification is approximate since the “slackness” μ T appears on the right hand side of the inequality in the assumption. We are going to define an infeasible parameter set that satisfies Assumption 4.3(ii) with μ T = 0 for algebraic reasons and 1 In the special case where ΛT(α) does not depend on α one obtains the simple general formula Aq(α)=(−2 q q!) −1 Λ q T d 2q /dα 2q π 0 (α). 171 then show that the kernel construction of Section 4.5.1 provides a feasible parameter set that approximates the infeasible one sufficiently well. We define f α|Y,X (α|y,x;θ,π) = f(y|x,α;θ)π(α|x) f Y|X (y|x;θ,π) , K 0 T (α,β;x) = Z Y T f α|Y,X α|y,x;θ 0 ,π 0 f(y|x,β;θ 0 )dy. (C.43) ThisisnotaBayesianwork,butf α|Y,X (α|y,x;θ,π)clearlyhasaBayesianinterpretation, namely it is the posterior distribution ofα for given Y =y under the prior π(α|x), for given values of x and θ. InK 0 T (α,β;x) this posterior distribution is integrated over the true distribution of Y, i.e. K 0 T (α,β;x) is the expected posterior distribution under the prior π 0 conditional on the individual effect equal to β. K 0 T (α,β;x) is a kernel function that defines an endomorphism of distributions over A for each x ∈ X T . Namely, for π∈Π A T we have K 0 T π (α|x)= Z A K 0 T (α,β;x)π(β|x)dβ. (C.44) For given q∈ N an infeasible parameter set is defined by Π (q) T =(K 0 T ) q Π up T ∩Π low T . (C.45) This is the set of all distributions that can be generated by q consecutive applications of the kernel K 0 T to an element of Π up T ∩Π low T (the set of distribution that satisfy some appropriate upper and lower bound). The parameter set Π (q) T is infeasible, since π 0 and θ 0 enter into the definition ofK 0 T . We assumeπ 0 ∈Π up T ∩Π low T . Then we have π 0 ∈Π (q) T for all q∈ N because π 0 is a fix point ofK 0 T . 172 The main motivation for defining Π (q) T is the following algebraic result: for every q there exists a constant c q such that for all π∈Π (q) T D H (π,π 0 )≤c q D H (f(π),f(π 0 )) “ 1− 1 2q+1 ” . (C.46) The proof is given below. SinceD H (f(π),f(π 0 ))≤ p D KL (f(π)||f(π 0 )) this means that if q becomes large the set Π (q) T approximately satisfies Assumption 4.3(ii). In order to approximate this infeasible parameter set by a feasible one, we note that K 0 T (α,β;x) has generic properties as T becomes large. Namely, under some regularity conditions on π 0 one can apply a Laplace approximation argument to show that the distribution of α whose probability density function is given by K 0 T (α,β;x) becomes a Gaussian distribution with meanβ and variance 2I −1 (β,θ 0 ,x)/T asT →∞, where the variance contains the inverse of the information matrix. Thus,applyingK 0 T toadistributionbecomesasymptotically equivalent toapplyinga Gaussiankernelsmoothingwithvariance2I −1 (β,θ 0 ,x). Thissuggeststodefineafeasible parameter set by replacing (K 0 T ) q with a Gaussian kernel of appropriate variance, which is exactly the construction of section 4.5.1, with q→∞ as N,T →∞. We now want to prove inequality (C.46). Note that for π ∈ Π (1) T there exist ˜ f(y|x) and ˜ π(α|x) such that ˜ f(y|x)= Z A f(y|x,α;θ 0 )˜ π(α|x)dα, π(α|x)= Z Y T f(y|x,α;θ)π 0 (α|x) f Y|X (y|x;θ,π 0 ) ˜ f(y|x)dy. (C.47) 173 Therefore D 2 H (π,π 0 )= 1 N N X i=1 Z A h p π(α|X i )− p π 0 (α|X i ) i 2 dα =2− 2 N N X i=1 Z A s π 0 (α|X i ) π(α|X i ) π(α|X i )dα =2− 2 N N X i=1 Z A Z Y T 1 q π(α|X i ) π 0 (α|X i ) f α|Y,X (α|y,x;θ 0 ,π 0 ) ˜ f(y|x)dydα ≤2− 2 N N X i=1 Z Y T 1 q R A π(α|X i ) π 0 (α|X i ) f α|Y,X (α|y,x;θ 0 ,π 0 )dα ˜ f(y|x)dy =2− 2 N N X i=1 Z Y T s f Y|X (y|x;θ 0 ,π 0 ) f Y|X (y|x;θ 0 ,π) ˜ f(y|x)dy =D 2 H (f Y|X (π),f Y|X (π 0 )) + 2 N N X i=1 Z Y T " 1− s f Y|X (y|x;θ 0 ,π 0 ) f Y|X (y|x;θ 0 ,π) # h ˜ f(y|x)−f Y|X (y|x;θ 0 ,π) i dy =D 2 H (f Y|X (π),f Y|X (π 0 )) − 2 N N X i=1 Z Y T " 1− s f Y|X (y|x;θ 0 ,π 0 ) f Y|X (y|x;θ 0 ,π) #" 1− s ˜ f(y|x) f Y|X (y|x;θ 0 ,π) # " 1+ s ˜ f(y|x) f Y|X (y|x;θ 0 ,π) # f Y|X (y|x;θ 0 ,π)dy ≤D 2 H (f Y|X (π),f Y|X (π 0 )) +2 √ 2D H (f Y|X (π),f Y|X (π 0 ))D H (f Y|X (π), ˜ f) " 1+sup y,x s ˜ f(y|x) f Y|X (y|x;θ 0 ,π) # . (C.48) Here we have used Jensen’s inequality in the fourth line and Chebychev’s inequality in the last step. Note that sup y,x ˜ f(y|x) f Y|X (y|x;θ 0 ,π) =sup y,x f Y|X (y|x;θ 0 ,˜ π) f Y|X (y|x;θ 0 ,π) ≤sup α,x ˜ π(α|x) π(α|x) . (C.49) 174 Applying the triangle inequality D H (f Y|X (π), ˜ f) ≤ D H (f Y|X (π),f Y|X (π 0 )) + D H ( ˜ f,f Y|X (π 0 )) we thus obtain D 2 H (π,π 0 )≤a 1 D 2 H (f Y|X (π),f Y|X (π 0 ))+a 2 D H (f Y|X (π),f Y|X (π 0 ))D H ( ˜ f,f Y|X (π 0 )), (C.50) for suitable constants a 1 and a 2 . Next, we use ˜ f(y|x)=f Y|X (y|x;θ 0 ,˜ π) to obtain D 2 H ( ˜ f,f Y|X (π 0 ))=2− 2 N N X i=1 Z Y T s f Y|X (y|x;θ 0 ,π 0 ) ˜ f(y|x) f Y|X (y|x;θ 0 ,˜ π)dy =2− 2 N N X i=1 Z Y T Z A s f Y|X (y|x;θ 0 ,π 0 ) ˜ f(y|x) f(y|x,α;θ 0 )˜ π(α|x)dαdy ≤2− 2 N N X i=1 Z A 1 r R Y T ˜ f(y|x) f Y|X (y|x;θ 0 ,π 0 ) f(y|x,α;θ 0 )dy ˜ π(α|x)dα =2− 2 N N X i=1 Z A s π 0 (α|x) π(α|x) ˜ π(α|x)dα =D 2 H (π,π 0 )+ 2 N N X i=1 Z A " 1− s π 0 (α|x) π(α|x) # [˜ π(α|x)−π(α|x)]dα =D 2 H (π,π 0 )+2 √ 2D H (π,π 0 )D H (˜ π,π) " 1+sup α,x s ˜ π(α|x) π(α|x) # . (C.51) Again, using the triangle inequality for the Hellinger distance, we thus obtain D 2 H ( ˜ f,f Y|X (π 0 ))≤a 3 D 2 H (π,π 0 )+a 4 D H (π,π 0 )D H (˜ π,π 0 ), (C.52) for suitable constants a 3 and a 4 . Combining this with the above result gives D 2 H (π,π 0 )≤a 1 D 2 H (f Y|X (π),f Y|X (π 0 ))+a 2 a 3 D H (f Y|X (π),f Y|X (π 0 ))D H (π,π 0 ) +a 2 a 4 D H (f Y|X (π),f Y|X (π 0 )) p D H (π,π 0 )D H (˜ π,π 0 ). (C.53) 175 From this, we can conclude D H (π,π 0 )≤c 1 [D H (f Y|X (π),f Y|X (π 0 ))] 2/3 , (C.54) for a suitable constant c 1 . By iterating the above proof for π ∈ Π (1) T one obtains the result for general Π (q) T . 176
Abstract (if available)
Abstract
This dissertation contributes to the econometrics of panel data models and their application to economic problems. In particular, it considers "large T'' panels, where in addition to the cross-sectional dimension N also the number of time periods T is relatively large. Chapter 1 provides an introduction to the field of large T panel data econometrics and explains the contribution of the dissertation to this field.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Essays on econometrics analysis of panel data models
PDF
Essays on estimation and inference for heterogeneous panel data models with large n and short T
PDF
Two essays in econometrics: large N T properties of IV, GMM, MLE and least square model selection/averaging
PDF
Three essays on the statistical inference of dynamic panel models
PDF
A structural econometric analysis of network and social interaction models
PDF
Essays in panel data analysis
PDF
Essays on nonparametric and finite-sample econometrics
PDF
Essays on health economics
PDF
Essays on high-dimensional econometric models
PDF
Approximating stationary long memory processes by an AR model with application to foreign exchange rate
PDF
Assessment of the impact of second-generation antipscyhotics in Medi-Cal patients with bipolar disorder using panel data fixed effect models
PDF
Essays on causal inference
PDF
Bayesian analysis of stochastic volatility models with Levy jumps
PDF
Feature selection in high-dimensional modeling with thresholded regression
PDF
Essays on factor in high-dimensional regression settings
PDF
Essays on economics of education
PDF
Graph-based models and transforms for signal/data processing with applications to video coding
PDF
Model, identification & analysis of complex stochastic systems: applications in stochastic partial differential equations and multiscale mechanics
PDF
Three essays on econometrics
PDF
Essays on the econometric analysis of cross-sectional dependence
Asset Metadata
Creator
Weidner, Martin
(author)
Core Title
Large N, T asymptotic analysis of panel data models with incidental parameters
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Economics
Publication Date
03/28/2011
Defense Date
03/03/2011
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
factor models,incidental parameters,large panels,OAI-PMH Harvest,panel data models
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Moon, Hyungsik Roger (
committee chair
), Hsiao, Cheng (
committee member
), Lv, Jinchi (
committee member
), Ridder, Geert (
committee member
)
Creator Email
mweidner@usc.edu,weidner.econ@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m3700
Unique identifier
UC169406
Identifier
etd-Weidner-4420 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-449644 (legacy record id),usctheses-m3700 (legacy record id)
Legacy Identifier
etd-Weidner-4420.pdf
Dmrecord
449644
Document Type
Dissertation
Rights
Weidner, Martin
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
factor models
incidental parameters
large panels
panel data models