Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Bayesian multilevel quantile regression for longitudinal data
(USC Thesis Other)
Bayesian multilevel quantile regression for longitudinal data
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
BAYESIAN MULTI LEVEL QUANTILE REGRESSION FOR LOGITUDINAL DATA by Chih-Chieh Chang A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (BIOSTATISTICS) May 2015 Copyright 2015 Chih-Chieh Chang ii Contents List of Tables iv List of Figures vi Chapter 1 Introduction 1 1.1Introduction 1 1.1.1 Motivation 1: Is the difference of BMI between overweight/obese children and children with normal weights associated with health risks? 2 1.1.2 Motivation 2: Are the effects of health risk factors on overweight/obese children the same with those on children with normal weights? 5 1.2 The Southern California Children’s Health Study (CHS) 7 1.3 Statistical Methodology Review 9 1.3.1 Modeling the expectation of a dependent variable 9 1.3.2 Modeling the quantile of dependent variable for independent data 10 1.3.3 Quantile regression applied to longitudinal data 14 1.3.4 Bayesian quantile regression 15 1.4 Proposed Methodology 16 1.4.1 Quantile regression based multi-level mixed-effects modeling 16 1.4.2 Bayesian multilevel quantile regression for longitudinal data 16 Chapter 2 Quantile Regression Based Multi-Level Mixed-Effects Model 18 2.1 Introduction 18 2.2 Data 22 2.2.1 Computation of BMI at 85%ile 22 2.2.2 Analysis data set for quantile regression based multi-level modeling 25 2.3 Methods 31 2.3.1 Computing the BMI 85%ile reference point 31 2.3.1.1 BMI at 85%ile provided by the CDC growth charts 31 2.3.1.2 Using quantile regression to get the BMI at 85%ile 32 2.3.2 “2-Step” quantile regression based multi-level approach 33 2.3.3 Testing for gender differences 35 iii 2.3.4 Confounders 35 2.3.5 Effect modifiers 35 2.4 Results 36 2.5 Discussion 48 Chapter 3 Bayesian Multi-Level Quantile Regression for Longitudinal Data 53 3.1 Introduction 53 3.2 Model 64 3.3 Estimation (Proposed Approach) 67 3.4 Simulation 72 3.4.1 Simulations based on “location-shift” models 72 3.4.2 Simulations based on “location-scale-shift” models 75 3.4.3 Simulations based on CHS BMI data structure 76 3.5 Results 84 3.5.1 Results of simulations based on “location-shift” models 84 3.5.2 Results of simulations based on “location-scale-shift” models 87 3.5.3 Results of simulation based on CHS BMI data structure 88 3.6 Application (Real Data Analysis) 92 3.6.1 Children’s Health Study (Data description) 93 3.6.2 Methods 95 3.6.3 Results 97 3.7 Conclusion 109 3.8 Appendix 110 Chapter 4 Summary of Findings and Future Plans 113 4.1 Summary of Findings 113 4.2 Future Plans 113 Reference 116 iv List of Tables Table 2.1 Descriptive statistics of CHS data used for computing internal BMI 85%ile 25 Table 2.2 Built environment variables used in the data analysis organized under the Lynch framework (Adopted from Table 1, Jerrett et al., 2010) 30 Table 2.3 Descriptive statistics of subject characteristics in the CHS data used for multi-level analysis 36 Table 2.4 Descriptive statistics for BMI in CHS data used for multi-level analysis 40 Table 2.5 Descriptive statistics of potential confounders/effect modifiers in CHS data used for multi-level analysis 42 Table 2.6 Results of testing confounding effects and Interaction effects (for Second Hand Smoke Exposure at Baseline (SHS)) 46 Table 2.7 Estimated Effects of Second Hand Smoke Exposure and Selected Effect Modifiers based on Final Model1 Using Different Outcomes 47 Table 3.1 Results of Simulation 1 (comparison between four methods) 85 Table 3.2 Results of Simulation 2 (comparison between four methods) 86 Table 3.3 Results of Simulation 3 (Comparing the proposed Bayesian Multilevel QR with LQMM and to Estimate the coefficients at 75%ile percentiles based on the data simulated in Geraci and Bottai’s paper in 2014) 88 Table 3.4 Results of Simulation 4 89 Table 3.5 Results of Simulation 5 90 Table 3.6 Results of Simulation 6 91 Table 3.7 Results of Simulation 7 92 Table 3.8 Descriptive statistics of subject characteristics in CHS data used for Bayesian multi-level quantile regression analysis 94 Table 3.9 Descriptive statistics of considered covariates in CHS data used for Bayesian multi-level quantile regression analysis 95 Table 3.10 Estimates of Second Hand Smoke Exposure by Selected Effect Modifiers Based on Proposed Bayesian Multilevel Quantile regression (Data with fully Stratified by Gender) 106 Table 3.11 Estimates of Second Hand Smoke Exposure by Selected Effect Modifiers Based on Proposed Bayesian Multilevel Quantile regression (Data with fully Stratified by Gender and Ethnicity) 107 v Table 3.12 Estimates of Second Hand Smoke Exposure Based on Proposed Bayesian Multilevel Quantile regression (Data with fully Stratified by Gender) 108 vi List of Figures Figure 2.1 Gamma Index Based on 500 m Buffer around Subject’s House 28 Figure 2.2 BMI 85%ile reference curves based various approaches (dot line = Females, solid line = Males) 37 Figure 2.3 Percent of Children Classified as Overweight Based on Various Methods at Age of 9.5 yrs and 16 yrs (QR: Quantile regression; CDC: Growth Chart of BMI; NHW: non-Hispanic White Children; Hisp=Hispanic Children) 39 Figure 2.4 The predicted ∆BMI.p85 curves for Hispanic children under condition of being allergic and exposed to high dose of NOx emitted from non-freeway sources (top blue: male exposed to SHS; bottom blue: male unexposed to SHS; top red: female exposed to SHS; bottom red: female unexposed to SHS). 47 Figure 2.5 The predicted ∆BMI.p85 curves for Caucasian children under condition of being allergic and exposed to high dose of NOx emitted from non-freeway sources (top blue: male exposed to SHS; bottom blue: male unexposed to SHS; top red: female exposed to SHS; bottom red: female unexposed to SHS). 48 Figure 3.1 The iteration process of Monte Carlo EM algorithm 59 Figure 3.2 The probability density function of asymmetric Laplace distribution: (a) for 0.5 ≤ τ < 1 with μ=0, σ=1; (b) for 0.5 ≤ τ < 1 with μ=2, σ=1; (c) for 0.5 ≤ τ < 1 with μ=0, σ=2; (d) for 0.5 ≤ τ < 1 with μ=0, σ=0 112 1 Chapter 1 Introduction 1.1 Introduction Rates of overweight and obesity in children have nearly doubled over the past two decades (between 1973 and 1994) in the United States. In addition, increasing prevalence of childhood obesity has become a globally pervasive problem since similar or worse trends have been reported in numerous industrialized and developing countries (Deckelbaum and Williams, 2001; Fiorino and Brooks, 2009). Childhood overweight or obesity status has been proven to increase the risk of obesity during adulthood, as well as the risk of cancer (such as endometrial cancer, postmenopausal breast cancer, rectum cancer, and colon cancer), type-2 diabetes, and cardiovascular disease (Calle and Thun, 2004; Dietz and Robinson, 2005). While most studies in this area have been based on the cross sectional design, those that deal with longitudinal assessments at various levels of aggregation have used multilevel growth curve models to assess the effects of health risk factors (Environmental tobacco use) and built environment factors (including food access around home, traffic density around residence, and proximity to parks and recreation resources) on attained levels and rates of changes of BMI based on longitudinal data from the Children Health Study (CHS) (Jerrett et al, 2010; Wolch et al, 2011; Dunton et al, 2012). Other commonly used approaches on obesity researches are applicable for various types of outcomes, such as continuous BMI, BMI z-score or binary indicator of obesity/overweight. Logistic regression (Seach, 2010; Sonneville et al, 2011; Wen et al, 2012) and generalized estimating equations (GEE) (Wu et al, 2011) are conventionally applied for modeling the dichotomized BMI indicator while regression analysis, GEE, and multilevel modeling are traditionally executed for modeling continuous BMI (Jerrett et al, 2010; Xie et al, 2010; Wolch et al, 2011; Dunton et al, 2012) or BMI z-score (Branum et al, 2011; 2 Sonneville et al, 2011; Rundle et al, 2012) with risk factors. These models are set up to assess the effect of risk factors on either the expected value of BMI/BMI z-score or probability of obesity. Hence, there is no guarantee that the estimated effects of risk factors will consistently represent the size and nature of these effects on the whole distribution of BMI, especially, on the extreme tails of the BMI distribution. Due to limited assessments of the covariate effects on the extremes of distribution of the outcome, we are forced to assume that the effects of health risk factors on attained levels and growth rates of BMI are constant for all children regardless of their weight status. Therefore, several important questions cannot be appropriately and directly investigated by the conventional multilevel growth curve modeling approach. The first question deals with whether the difference of BMI between overweight/obese children and children with normal weights is associated with exposure of health risk factors. The second question deals with whether the effects of health covariates uniformly affect the children regardless of their weight status? Fortunately, we could potentially use the concepts of quantile regression, introduced by Koenker and Bassett (1978), to develop methodologies that could statistically resolve these two questions. Quantile regression is an analytic approach that provides a more complete picture of the relationship between covariates effects and a response variable by estimating the conditional quantile functions of an outcome, i.e., the quantiles of the conditional distribution of a response variable are expressed as functions of explanatory covariates. 1.1.1 Motivation 1: Is the difference of BMI between overweight/obese children and children with normal weights associated with health risks? To answer this question, we first must know the reference BMI values of overweight/obese children such that we would be able to compute the outcome of interest, the 3 deviation of individual BMI from the referenced BMI of overweight/obese children. According to the definition for weight classification suggested by Centers for Disease Control and Prevention (CDC), a child or adolescent is considered overweight if his/her BMI is between the 85 th and the 95 th percentiles for children of the same gender and age while obesity is defined as BMI level at or above the 95 th percentile. The commonly used reference BMI values of overweight/obese children are from the revised growth chart of BMI, adjusted for age and gender, provided by CDC in 2000. For example, the reference BMI cutoff values of being overweight and obese for a 2-yr-old boy are 18.16 and 19.34, respectively. The CDC growth charts were generated based on a modified version of the LMS method (Cole, 1990; Kuczmarski et al, 2002). The main concept of the LMS method, extended from the Box-Cox power transformation, is mapping the skewed BMI measurements (or other anthropometric measurements for children) to their standardized BMI z scores (as well as the BMI percentiles), and vice versa. Furthermore, because the distribution of BMI in children changes according to age and gender, this LMS method summarizes the changing distribution of a measurement by three smoothed curves representing skewness (parameterized by lambda), median, and coefficient of variation (parameterized by sigma). Nevertheless, there is one fundamental limitation associated with using the reference BMI values from the CDC growth charts. When the environmental and ethnic backgrounds of the target population are prominently different from those of the reference population of the CDC BMI growth chart, the relevance and accuracy of CDC growth charts could be diminished (Carey et al, 2004). The CDC growth charts are based on five cross-sectional, nationally representative health examination surveys conducted between 1963 and 1994 while, for instance, the CHS data is a longitudinal, multi-ethnic data collected from 1993 onwards. Consequently, applying the CDC BMI growth charts to children 4 in the CHS would be problematic since the reference BMI values of CDC growth chart may not be suitable for the race/ethnicity composition in the CHS. One may use the LMS method on CHS data to generate BMI reference values at various percentiles. However, there are three issues in applying the LMS method used by CDC. First, the LMS method used by CDC may be only applicable to cross-sectional data because it was cross-sectionally derived and only relevant for single measurements (Cole, 1994). Although Cole (1994) proposed a LMS method for longitudinal data that generates the conditional reference charts and advocated the importance of adequately reflecting the longitudinal aspect (the dependencies of observations on the same subject were modeled by auto-regressive concepts) when quantifying growth status, this method requires impractical condition of regular spacing of the measurement times. Second, the LMS method is not conducive to incorporating additional covariates that offer additional crucial information, due to sparsity of data in most applications. To apply the LMS method for generating the race-, age-, and gender-specific reference BMI values, the first step is to group data according to the status of race, age, and gender. Thus, it is often difficult to maintain the sample size for each grouped data while incorporating more covariates. Third, the key assumption and purpose of the LMS method is that the skewed data would be approximately normally distributed after appropriate transformation, whereas the ability of LMS method to achieve its promised normality over the full range of relevant ages is questionable (Wei et al, 2006). Some background details of this modified LMS method used for generating CDC growth charts are described in Chapter 2. To overcome the above discussed disadvantages of the LMS method, quantile regression could be a viable alternative to generate the reference BMI values at 85 th (overweight) and 95 th (obese) percentiles since it is relatively easier to incorporate new covariates and the method does 5 not rely on the assumption of normality. Thereafter, one could study the association between health risks and the deviation of subjects’ BMI from the reference BMI values of overweight/obese children. 1.1.2 Motivation 2: Are the effects of health risk factors on overweight/obese children the same with those on children with normal weights? In contrast to the conventional regression approaches that are only able to model the association between explanatory variables and the expectation of a continuous outcome, Koenker and Hallock (2001) illustrated how the effects of covariates (such as prenatal medical care) vary on different quantiles of the distribution of the birth-weights (outcome) with quantile regression. By using this new approach, they did detect that the effects of covariates are not uniform across the whole distribution of birth-weights; moreover, the sizes of covariate effects (such as mother’s weight gain) on infants with low birth-weight (lower tail of distribution of birth-weights) is substantially different from that on infants with normal birth-weight (>2500g). Thus, extending from the conventional multilevel modeling approach used for previous studies on association between health risk factors and continuous BMI (Jerrett et al, 2010; Xie et al, 2010; Wolch et al, 2011; Duntion st al, 2012), it is natural to consider developing the quantile regression version of multilevel modeling approaches that (1) are capable of directly modeling the effects of risk factors on overweight/obese children (upper tail of BMI distribution), (2) are able to characterize the heterogeneity among subjects, and (3) could account for cross-level confounding and interactions. In order to study the effects of the risk factors (at several levels of aggregation) on the attained levels and rates of change of BMI at upper tails of the BMI distribution (i.e., health effects on overweight/obese children) in a multilevel setting, some challenges need to be 6 addressed. The first challenge is that of dealing with the correlation between the observations of the same subject in longitudinal data. Ignoring the within subject variation due to the repeated measurements on the same subject would lead to improper inference (Edwards, 2000; Wu 2010). Several quantile regression approaches have been introduced to cope with correlation between observations on the same subject via simple location-shift models, including the quantile regression with penalized fixed subject-specific effects (Koenker 2004) and quantile regression with random subject-specific effects (Geraci and Bottai, 2007). These methods account for the dependence of serial measurements on the same individual via addition of subject-specific intercepts. However, because the correlation between observations on the same subject is too complicated to be characterized by a simple location-shift model (a model including subject- specific intercepts that assumes multiple observations on the same subject to be independent, conditional on a subject specific intercept), a quantile regression approach that has ability to handle a complex correlation structure is desirable. A detailed discussion on such an approach is presented in Chapter 3. The second challenge is of dealing with a multi-level structure. To our knowledge, no approach has been proposed to dealing with multilevel structured data in the quantile regression paradigm. The conventional mixed effects modeling approach has the ability to examine the effects of cross-level risk factors directly on the expectation of BMI by adding random effects from different aggregated levels. In addition, this mixed-effects modeling approach properly accounts for cross-level confounding and interaction. The third challenge is to characterize the non-linear trajectories of BMI. In conventional mixed effects modeling approach utilizes the spline technique to describe the non-linear growth curves of BMI in children and adolescents (Jerrett et al, 2010; Wolch et al, 2011). In quantile 7 regression, the smoothing spline technique with roughness penalty, introduced by Cole (1994), is proposed to describe the non-linear association between an outcome and explanatory variables. Thus, quantile regression with incorporation of the spline technique that has ability to longitudinally handle the non-linear trajectories of BMI is desirable. In this dissertation, we propose and implement two multi-level quantile regression techniques for longitudinal modeling of childhood obesity. In the first, the deviations of individual subjects’ BMI trajectories from the appropriate quantile curves are used as outcomes, and conventional multi-level mixed-effects model are applied to assess the effects of health risk factors on individual BMI trajectories relative to reference quantile curves. In the second, a Bayesian multilevel quantile regression approach will be developed to allow for assessing the effects of risk factors directly on the quantile of BMI distribution while accounting for longitudinal and multi-level structure of the data. To do this, the existing quantile regression approach is extended to handle longitudinal data with hierarchical structure under the Bayesian framework. Note that, while our motivating example and primary application is for the CHS, the methodologies are applicable to any study with a similar multi-level study design. 1.2 The Southern California Children’s Health Study (CHS) The proposed methods are inspired by an ongoing, longitudinal epidemiologic study with rich prospective data resources to address childhood obesity risk and related health behaviors. The Southern California Children’s Health Study (CHS) is a longitudinal study involving five multi-ethnic cohorts comprising more than 11,000 children from 16 Southern California communities with diverse socio-economic, built-environment and demographic profiles. The study was initiated in 1993 with 3600 school children from three cohorts of 10 th , 7 th , and 4 th graders, commonly referred to as cohorts A, B, and C, respectively. To further enhance the 8 ability to assess effects of long term exposure to air pollution on children’s respiratory health, an additional cohort of 4 th graders (cohort D, N=~2600) was recruited in 1996. A cohort of K-1 graders (cohort E, N= ~5500) was added in 2001 to more closely assess the effect of air pollution on onset of asthma starting even at younger age. These cohorts have extensive information on objective yearly measurements of weight and height, and a wealth of information at the community, school, individual and temporal levels. To assess the effects of the built- environment on childhood obesity, additional data were compiled, via separate funding, to characterize the built-environment around the participants’ homes, neighborhoods, schools, and communities. Children were followed from grade of study entry through high school graduation; hence, each child (of cohorts A, B, C, and D) has up to 8 annual visits / measurements. Also, since children in cohort E had been followed up since 2001, each participant has up to 12 measurements. The anthropometric data of height (cm) and weight (pound) were measured at children’s schools annually, by a trained technician, following a standardized physical examination (Peters et al, 1999). The BMI was calculated based on the measured height and weight. At study entry, a parent/guardian of the recruited child completed a comprehensive questionnaire that provided information of family demographic characteristics, status of chronic disease (such as, history of asthma, wheeze, or allergy), physical activity, indoor sources of air pollutant exposure and household characteristics. The details of CHS data used to illustrate the new methods are described in Chapter 2. 9 1.3 Statistical Methodology Review 1.3.1 Modeling the expectation of a dependent variable There are many regression analysis techniques used to study the relationship between an outcome and one or more risk factors, and to explore the form of this relationship. More specifically, regression analysis can quantify how the typical value of an outcome (often indicating the mean of the outcome) changes due to the change in one risk factor while the other risk factors are held constant. The conventional regression analysis technique estimates the conditional expectation of the outcome on the risk factors, i.e., the mean of the outcome when the risk factors are held fixed at a given level. A large body of methods has been developed to conduct such regression analysis. The methods are classified into 2 groups, parametric and non-parametric methods. The well-known parametric methods, such as linear regression and ordinary least square regression, define the regression function in terms of a finite number of unknown parameters while the non-parametric methods allow the regression function lie in a specified set of (infinite-dimensional) functions. The underlying assumptions of regression analysis are (1) the error is a random variable with mean zero conditional on independent variables; (2) the independent variables are linearly independent to each other; (3) the errors for each individual are uncorrelated; (4) the error is homoscedastic. These assumptions imply that the estimators of regression parameters are unbiased, consistent, and efficient. The ordinary linear regression models the expected value of a given unknown quantity as a linear combination of a set of independent variables. This approach is appropriate to use only when the outcome follows a normal distribution. However, many types of outcomes violate this normality assumption, such as binary outcome, or categorical outcome. Generalized linear 10 models relax the normal assumption by allowing the outcome have an error distribution from the exponential family of distributions, and for an arbitrary link function of the outcome to vary linearly with the predicted values (rather than assuming that the response itself must vary linearly). These methods only provide a regression function that characterizes the relationship between the averages of the distributions with a set of risk factors. However, when the assumptions listed above are violated, it becomes difficult to properly assess the association between risk factors and any part of the distribution of the outcome. It, therefore, indicates that approaches for modeling the expectation of an outcome might give a rather incomplete picture because “Just as the mean gives an incomplete picture of a single distribution, so the regression curve gives a corresponding incomplete picture for a set of distributions.” (Mosteller and Tukey (1977, p. 266)) 1.3.2 Modeling the quantile of dependent variable for independent data Quantile regression was initially introduced by Koenker and Bassett in 1978. Quantile regression is a type of regression analysis used to estimate the conditional quantile of the response variable. On the other hand, while the least squares approach results in estimates that approximate the conditional mean of the response conditional on the predictors. Since the 1990s, quantile regression has been widely used in empirical economics fields based on the concepts of “going beyond models for the conditional mean.” Chamberlain (1994) found that the union wage premium at 10 th percentile of manufacturing workers (28 percent) was over 25 percent higher than that at 90 th percentile (0.3 percent) while the estimated mean union wage premium was 15 percent based the least squares method. Therefore, this finding by Chamberlain just shows that conventional regression may provide misleading information. In biomedical 11 research, one might be more interested in estimating the effects of various factors on the health outcome of subjects under unusually vulnerable situation, for instance, effect of prenatal second hand smoke exposure on the lung function of infants with low birth-weight. Specifically, Koenker and Hallock (2001) observed that the estimates of effects of the health factors on the conditional mean of birth-weights might inconsistently represent the size and nature of these effects on the children with low birth-weights, i.e. children on the lower tail of the birth-weight distribution. To deal with this problem, they used quantile regression to estimate a serial of quantile functions to see a more complete picture of effects of health risk factors. Therefore, quantile regression is desirable for studying how the effects of risk factors might vary on the outcome across percentiles of distribution of the outcome, especially when the errors distribution is heteroscedastic. In fact, some risk factors might be found not to be statistically significantly associated with the expected value of the outcome except in the upper/lower tails of the outcome distribution. Applications of quantile regression have been used in economics (see Koenker and Hallock (2001)); and reference growth charts, where percentile curves are commonly used to identify subjects with abnormal growth; see Wei et al. (2006). The definition of a quantile is such that at the -th quantile of Y, a random variable with distribution function = ≤ , is given by = = : ≥ , where 0 ≤ ≤ 1. The specific quantile of Y could be found by minimizing the expected loss of Y-u with respect to u with defined a loss function as = − < 0 !: min % & ' ( − ) ! = min % − 1 * − ) + % ∞ + min % * − ) + ∞ % This can be solved by 12 - -) & ' ( − ) ! = - -) . − 1 * − ) + % ∞ + * − ) + ∞ % / 0 = 1 − 0 + % ∞ − 0 + ∞ % 0 = 1 − 0 + ) % ∞ − 1 − 0 + ) % ∞ ! 0 = ) − Let the solution be 1 , then we get 1 = . Thus, 1 is -th quantile of Y (population quantile). The -th sample quantile of a random sample, , 3 ,…, 5 , can be obtained by solving the following minimization problem 1 6 = 789min %∈; < = − ) 5 => = 789min %∈; ? − 1 < = − ) @ A B% + < = − ) @ A C% D When one is interested in a -th conditional quantile function, |F = GH , where X is a matrix containing k covariates. Given the distribution function of Y, H can be obtained by solving H = 789 min IJ; K & − GH ! Furthermore, solving the sample analogue gives the estimator of H . H L = 789 min IJ; K < = − G = H 5 => The minimization problem can be reformulated as a linear programming problem 13 min I,%,M ∈; K ×; O PQ 5 ′ ) + 1 − 5 ′ R|GH + ) − R = , where X denotes an n by k regression design matrix and ) = ,R = : = 1,2,…, represents the positive and negative parts of a vector of the residuals. In the R statistical software (Quantile Regression in R: A Vignette), the rq function in the quantreg package is used to fit quantile functions via solving the linear programming problem described above based on various algorithms. These algorithms include the simplex algorithm (exterior point method) (Koenker and d’Orey, 1987), the Frisch-Newton algorithm (Interior point method) (Portnoy and Koenker, 1997), and the preprocessing algorithm for the Frisch Newton algorithm (Portnoy and Koenker, 1997). The choice of an algorithm depends on the sample size. Suggested by Hu and Wei (2005), the simplex algorithm is suitable for data with less than 5000 observations and less than 50 variables; interior point method is computationally efficient for large data set; and interior point method with preprocessing algorithm is appropriate for dealing with extremely large data (n > 100000). In addition, the standard errors of parameters in quantile functions can be estimated by methods available in the summary.rq function (in quantreg package), including inverting a default rank test when sample size is smaller than 1000 observations (argument name in summary.rq function: se=“rank”, Koenker, 1994), asymptotic covariance matrix based on assumption of iid errors (“iid”, Koenker and Bassett, 1978), presumes local (in ) linearity (in explanatory variable(s)) of the conditional quantile functions and computes a Huber sandwich estimate using a local estimate of the sparsity (“nid”), kernel estimate of the sandwich (“ker”, Newey et al, 1990), and the resampling method (“boot”, Bilias et al, 2000). 14 1.3.3 Quantile regression applied to longitudinal data Several quantile regression methods have been proposed for modeling longitudinal data. Koenker (2004) considered a quantile regression approach employing T regulation methods. This approach accounts for the dependence between serial measurements on the same subjects by adding a “large” number of subject-specific fixed effects. The inclusion of a large number of subject fixed effects can induce inflation of variability of estimated effects of other covariates. To overcome this inflation effect, Koenker employed T regulation methods to shrink the over- parameterized individual fixed effects. However, optimizing the performance of this approach requires a suitable choice for the penalty parameter. There are currently no theoretically or practically accepted guidelines on how to choose the optimal level of shrinkage (Geraci and Bottai, 2007). In addition, this approach is computationally inefficient under some circumstances such as when dealing with complex models, or with large sample sizes. Geraci and Bottai (2007) proposed a likelihood-based quantile regression that accounts for the correlation within a subject by adding subject-level random effects. The main features of this approach are: (1) the assumption that the outcome follows an asymmetric Laplace distribution so that the maximum-likelihood estimation is equivalent to the minimization of the objective function of conventional quantile regression, sum of absolute residuals; and (2) this method automatically reaches a nearly optimal degree of shrinkage of subject effects, while the quantile regression with penalized fixed effects requires a suitable choice for the penalty parameter to regularize the subject effects. This approach estimates the quantile function via Monte Carlo EM algorithm (MCEM) due to the unobservable random effects and lack of a closed-form solution for maximizing the likelihood function. The disadvantages of this iterative approach are: (1) it would suffer from computational burden when the model is complicated; (2) 15 the choices of initial values for the MCEM algorithm would heavily influence on convergence speed and accuracy of estimates. One potential pitfall is that the MCEM method with inappropriate initial values could converge to a local maximum rather than global maximum. 1.3.4 Bayesian quantile regression Since the Bayesian approach enables obtaining the entire posterior distributions of parameters of interests by Markov chain Monte Carlo (MCMC) methods, Yu and Moyeed (2001) introduced Bayesian quantile regression. They assume that the conditional quantile of the outcome follows the asymmetric Laplace distribution (ALD) regardless of the original distribution of the data. One important verified fact of this assumption is that the minimization of the loss function of the conditional quantile regression is exactly equivalent to the maximization of the likelihood function of ALD (Yu and Moyeed, 2001; Geraci and Bottai, 2007). An important feature of their approach is that the resulting posterior densities of parameters of interests would be proper even if improper prior distributions are assigned. This Bayesian approach uses an MCMC algorithm with a random-walk Metropolis algorithm to extract the posterior distribution of unknown parameters because the standard conjugate prior distribution is not available for quantile regression based on the ALD. Because the posteriors of parameters of interests are not analytically tractable when assuming ALD for error terms, Kozumi and Kobayashi (2011) developed a mixture representation of an ALD so that the conditional posterior distributions are tractable and the existing estimating procedures are significantly simplified. Kozumi and Kobayashi (2011) proved that an ALD random variable can be represented as a location-scale mixture of exponential and normal distributions. Furthermore, the resulting Gibbs sampler is 16 accomplishable by sampling from either the normal or the generalized inverse Gaussian distribution. 1.4 Proposed Methodology Two methods are proposed in this dissertation with the aim of solving the two major questions described in section 1.1: 1.4.1 Quantile regression based multi-level mixed-effects modeling The first method is motivated by a desire to assess the effects of health risk factors (at various level of aggregation) on individual BMI trajectories relative to the reference quantile curves of overweight/obese children. In doing this, the attained BMI is subtracted for the reference BMI at the 85 th percentile or 95 th percentile (for overweight or obesity cutoffs, respectively) as an outcome to fit an ordinary multi-level model. Thus, we would be able to study the association between health risk factors and the deviations of individual subjects’ BMI trajectories from reference percentile curves of overweight/obese children. This approach allows for the examination by how much the effects of risk factors deviate the attained BMI level from the reference cutoffs; essentially, creating a hybrid outcome for the binary indicator of overweight or obesity and the attained BMI levels. 1.4.2 Bayesian multilevel quantile regression for longitudinal data The second method is motivated by a desire to assess the health effects of risk factors (from various levels of aggregation) directly on the quantile of the BMI distribution while considering the longitudinal aspect and the multi-level structure in the data. To do this, we extend the Bayesian quantile regression approach by combining it with the ordinary mixed- effects modeling approach so that our approach allows for :(1) the possibilities that correlation 17 between the observations on the same subject is characterized via various types of subject-level random effects ; (2) the non-linear trajectories of a given BMI percentile could be accommodated by Spline terms of age, such as B-spline or penalized B-spline; (3) the ability to assess the effects of cross-level risk factors directly on the quantile of distribution of outcome; and (4) the ability to properly account for cross-level confounding and interaction. The remainder of this dissertation is organized as follows: in Chapter 2, quantile regression is applied to get the reference BMI values of overweight children, i.e. 85%ile of BMI distribution. Furthermore, the deviated BMI (i.e., BMI measures subtracted from the 85 th percentile BMI reference values are used as outcome). Specifically, based on CHS data, we use the conventional mixed-effect modeling approach to examine the effects of risk factors on the deviation of a subject’s BMI trajectory from the 85 th percentile of the reference BMI curve. In Chapter 3, we present an integrated Bayesian framework of multilevel quantile regression of overweight children as a function of health risk factors, socio-economic factors, and ecological factors. In addition, we conduct simulation studies to examine the finite sample properties of the proposed method is workable and preferable approach in terms of relative bias and computational efficiency compared to other existing methods. In Chapter 4, we discuss the main utilities of the proposed approach through an application of the proposed Bayesian multilevel quantile regression on CHS data and we also outline areas for future work 18 Chapter 2 Quantile Regression Based Multi-Level Mixed-Effects Model 2.1 Introduction In this chapter, we discuss a proposed approach that examines the association between risk factors and deviations of individual subjects’ BMI trajectories from reference percentile curves, at specific %iles such as the 85 th for overweight, in a “two-step” quantile regression based multi-level mixed-effects modeling structure. Since BMI, an inexpensive and simple measurement calculated based on height and weight, is highly correlated with body fat mass, BMI is a useful predictor of future adiposity, morbidity, and mortality (Field et al, 2003; Daniels, 2009). In contrast, other measures such as the BMI z-scores are not widely used in clinical practice due to difficulty experienced by patients (or study participants) in fully understanding their meaning (Daniels, 2009). Previous studies applied mixed-effects modeling to identify risk factors (including traffic density, food access, proximity of parks and recreation, and second hand smoke) that could be associated with changes in attained level of BMI and rate of change of BMI (Jerrett et al, 2014; Jerrett et al, 2010; Xie et al, 2010; Wolch et al, 2011; Dunton at al, 2012). These studies concluded that the risk factors positively associated with BMI increments would have implications of increasing the probability of being overweight or obese. However, under the conventional mixed effects modeling approach that treats BMI as an outcome, it is impossible to examine the magnitude of the risk factors’ effect on the deviation of the attained BMI values from the reference BMI cutoff values for overweight status. To properly examine the association between risk factors and the deviation of BMI from a reference quantile cut-off (e.g., individuals’ BMI at a given age for a given sex, subtracted 19 from the corresponding reference BMI values of overweight), the reference BMI values of being overweight is required before the analysis. The most widely used BMI reference values for classifying overweight/obesity are from the growth charts of BMI by the US Center for Disease Control and Prevention (CDC). In practice, as suggested by CDC, a child is defined as overweight if the BMI is between the 85 th and 95 th percentile of BMI distribution of children with same age and sex while obesity is defined as being above the 95 th percentile. When using the 85 th percentile BMI reference point from the CDC growth charts for BMI, we need to be cautious because we are comparing children with reference values from data nationally collected between 1963 and 1994, which may not be representative of the ethnic distribution of children in studies such as the CHS (see details in section 2.2.1). Carey et al (2004) pointed out that using external growth charts could be questionable when the environmental and ethnic backgrounds of the target population are markedly different from those of the reference population. Hence, Carey et al (2004) applied the LMS method, an approach similar to that used to generate the CDC growth charts, to construct reference growth charts to identify HIV-infected children with somatic growth failure based on their own data representing the population of HIV-infected children. The LMS method assumes that potentially skewed data, such as children’s BMI, would be normally distributed across the full range of relevant covariate (e.g. age) after appropriate transformation. Conversely, Wei et al (2006) illustrated that using quantile regression to construct the growth charts is preferable not only when the characteristics of a targeted population is different from those of the reference population (for instance, differences in ethnic diversity or physical condition (HIV-infected infants)) but also when the normality assumption is not attainable even after using various approaches of data transformation. In addition, another feature of quantile regression is that it is easier to incorporate additional 20 covariates of interests, compared to LMS method. Because of potential ethnic differences in BMI trajectories in children (Yates et al, 2004), as for example in the motivating example of the multi-ethnic Children’s Health Study, using the age- and gender-adjusted BMI reference values at 85 th percentiles from CDC growth charts may not be valid for all ethnicities. Hence, instead of using age- and gender-adjusted BMI values of CDC growth chart, quantile regression is utilized to compute the age-, gender-, and ethnicity-specific BMI at 85 th percentile of the BMI distribution based on data from Children’s Health Study (CHS). Because of the increasing overall BMI levels in children due to the obesity epidemic in the last two decades, we use only cross sectional data from the CHS at 1993 to create our reference growth curves. The Children’s Health Study (CHS) is a long running prospective cohort study of Southern California children that has longitudinal yearly objective measurements of height and weight of children covering 5- 18 years of age, and covers a wide range of exposure levels to ambient air pollution and other built environment factors. In the proposed model, instead of using BMI as outcome, we use the quantitative deviation of BMI from appropriate reference at a chosen quantile as an outcome. For illustrative purposes, we focus on the 85 th percentile of BMI, a cutoff point for overweight status. Thus, this outcome could be interpreted as deviations of individual subjects’ BMI level from the 85 th percentile (minimal BMI value of being overweight) of the children population. This approach contrasts with a similar approach that creates a binary indicator based on similar cutoff values, essentially creating a hybrid outcome that combine features of the continuous attained BMI and binary data based on age-, gender-, and ethnicity specific cutoff for overweight status. The advantage of the proposed approach is in that it concretely provides the magnitude of deviation from the reference curves instead of overly relying on an arbitrary cutoff value. 21 After computing the deviated BMI outcome, the flexible multilevel mixed effects modeling approach could be utilized to examine the effects of risk factors on changing the deviated BMI longitudinally on data from the motivating CHS data. Until now, the flexible multi-level modeling approach was applied to CHS longitudinal data to examine the association between environmental and/or genetic risk factors on level or rate of growth of BMI (Jerrett et al, 2014; Jerrett et al, 2010; Wolch et al, 2011) or lung function (Gauderman et al, 2004; Gauderman et al, 2007), after fully characterizing the non-linear growth trajectories via spline functions. This novel approach allowed us to study the cumulated health effects not only on the level of BMI/lung function at some attained age, but also on the growth rate of BMI/lung function. Hence, by using deviated BMI as an outcome, the proposed model on deviations would allow us to focus on the extremes of the outcome distribution. In this chapter, we illustrate the modeling procedure by using second hand smoke (SHS) as a health risk factor of interest. Some past studies, using traditional modeling approaches, indicated that SHS exposure is associated with increasing BMI or BMI z-score (Kowk et al, 2010; Raum et al, 2011). Here, since the SHS exposure has been shown to be positively associated with BMI, we hypothesize that SHS exposure would increase the deviations of individual subjects’ BMI level from the 85 th percentile of the reference population of children. This is tested by using the proposed quantile regression based multi-level modeling approach. To review this approach, the reference value of BMI 85 th percentile is computed based on quantile regression; in next step, we treated the BMI subtracted the BMI 85 th reference percentile as an outcome to fit an ordinary multi-level model as described previously (Berhane and Molitor, 2008) which allowed us to test for confounding or interactions across various levels of aggregation based on the multi-level structured CHS data. Hence, this analysis will provide a 22 novel approach with focus on the effects of health factors on increasing/decreasing of the deviations of individual subjects’ BMI level from the BMI at 85 th percentile of the reference population of children. 2.2 Data To examine the effects of risk factors on the deviation of BMI away from the 85%ile BMI reference value (observed BMI – BMI 85%ile, noted as ∆BMI.p85), estimating reference BMI 85 th percentile is required before the analysis. Two data sets are used for computing BMI 85 th percentile reference values: (1) data from CDC growth charts; (2) a cross-sectional baseline data set from the three cohorts in children health study (CHS) recruited in 1993. Subsequently, longitudinal CHS data from the two 4 th grade cohorts are used to test the association between risk factors and BMI deviation (∆BMI.p85). The details of the CHS data, including inclusion and exclusion criteria, are described below. 2.2.1 Computation of BMI at 85%ile National data used for generating the CDC BMI growth Charts The revised growth charts for children’s BMI provided by CDC in 2000 is based on five cross-sectional national representative health examination surveys, including national health examination survey II (NHES II) (National Center for Health Statistics, 1967), NHES III (National Center for Health Statistics, 1994), national health and nutrition examination survey I (NHANES I) (National Center for Health Statistics, 1973), NHANES II (National Center for Health Statistics, 1981), and NHANES III (National Center for Health Statistics, 1994). These surveys were conducted during 1963 and 1994. Furthermore, the age range of the surveys, used to generate the BMI growth charts, is between 2 years and 20 years while the observations aged 23 from birth to 2 years are from supplemental data described elsewhere (Kuczmarski et al, 2002). Observations were excluded if the information was from NHANES III with age equal to 6 years or older because including those observations made the overweight criteria based on BMI percentile under-classified (Kuczmarski et al, 2002). Regarding anthropometric data, information on BMI and age was collected at the time of examination. Furthermore, the age of the subjects was grouped by single month of age. Each month of age was rounded to the nearest completed month. For example, 12 months of age referred to subjects aged between 12.0 and 12.9 month. All Surveys consisted of a home interview and a standardized physical examination performed in a mobile examination center (Kuczmarski et al, 2002). Details on the design, sampling algorithm, subject selection and assessment of anthropometric data are reported elsewhere (Kuczmarski et al, 2002). Children’s Health Study data used for generating BMI values at 85 th percentile The children’s health study (CHS) cohorts are described elsewhere (Peters et al, 1999). Three cohorts (A, B, and C) of CHS were used for calculating the internal BMI 85%ile values. At study entry (1993-1994), the children in these cohorts were 4 th graders, 7 th graders, and 10 th graders, respectively. Height (cm) and weight (pound) were measured at children’s schools annually, by a trained technician, following standardized procedures for physical examination. BMI was calculated based on the measured height and weight. At study entry, a parent/guardian of the participating child completed a detailed questionnaire that provided information of family demographic characteristics, history of respiratory diseases (such as asthma history), physical activity, and household characteristics. All the parents/guardians of participating children provided written informed consent; in addition, the study protocol was approved by the Institutional Review Board of the University of Southern California. 24 While the CHS is a longitudinal study, to make the data structure similar to the cross- sectionally designed surveys that led to the CDC growth charts and to avoid the relatively high BMI values in later years, we only used the first observation (baseline measurement) of the serial annul measurements of the same subject in calculating the reference cutoff values. However, the cross-sectional data covered the entire childhood age large because they were recruited from 4 th , 7 th , and 10 th grades. Furthermore, eligible observations needed to meet two additional inclusion criteria; (1) the age of the first observation was between 9.5 years and 16 years (observations with age less than 9.5 yrs or larger than 16 yrs were excluded due to small sample size, <450 observations), and (2) children were either Non-Hispanic White or Hispanic (other races/ethnicities were not included because of small sample sizes). In total, there were 2704 children included for computing the internal BMI 85%ile reference values. There were slightly more female children in this study (1391 (51.4%) vs. 1313 (48.6%), Table 2.1). In general, the distribution of cohorts, entry years, and ethnicities were similar in both genders. Specifically, the majority of subjects were from Cohort C (4 th grader, 55.8% in Females and 60.9% in Males, Table 2.1). The observations taken in 1993 between two genders were about the same (about 94%, table 2.1). Moreover, more non-Hispanic children were included in this study compared to Hispanic children (67.1% and 71.3%, table 2.1) in both gender. 25 Table 2.1 Descriptive statistics of CHS data used for computing internal BMI 85%ile Female (N=1391, 51.4%) Male (N=1313, 48.6%) N (%) 1 N (%) 1 Cohort A 189 (13.6%) 152 (11.6%) B 426 (30.6%) 361 (27.5%) C 776 (55.8%) 800 (60.9%) Entry Year 2 1993 1311 (94.4%) 1243 (94.7%) 1994 78 (5.6%) 69 (5.3%) Ethnicity Hispanic 458 (32.9%) 377 (28.7%) Non-Hisp White 933 (67.1%) 936 (71.3%) 1. Percentage within gender. 2. Due to the missing values in Entry year, the sample size varies. 2.2.2 Analysis data set for quantile regression based multi-level modeling Children health data used for examining the effects of health factors on BMI deviation from overweight BMI reference value (BMI 85%ile) The details of study design, site selection, subject inclusion and assessment of health factors and anthropometric measurements are published elsewhere (Peters et al., 1999). In summary, two cohorts of CHS, recruited in 1993 and 1996, were used for testing the association of health factors and BMI deviated from that of Overweight cut-off values (∆BMI.p85). The children in these two cohorts were 4 th graders at study entry located in 11 communities of Southern California. These children were followed from study entry up to high school graduation; and each child had up to 8 annual visits/measurements. The anthropometric data of height (cm) and weight (pound) were measured at children’s schools annually, by a trained technician, following a standardized physical examination. BMI was calculated based on the measured height and weight. At study entry, a parent/guardian of the recruited child completed a comprehensive questionnaire that provided information of family demographic characteristics, 26 status of chronic disease (such as, history of asthma, wheeze, or allergy), physical activity, indoor sources of air pollutants exposure and household characteristics. All the parents or guardians of participating children provided written informed consent. The study protocol was approved by the Institutional Review Board of the University of Southern California. Inclusion and exclusion criteria Since this is a longitudinal analysis focus on growth, it was restricted to children with at least two observations. Meanwhile, because the BMI 85%ile reference values were only generated to Hispanic and non-Hispanic White aged between 9.5 yrs and 16 yrs, eligible observations of children for this analysis were restricted to those two race/ethnicities with age between 9.5 yrs and 16 yrs (N=2540, table 2.2). In addition, only children with home addresses geo-coded, using the Teleatlas geocoding database to the corresponding road network, accurately were included in the analysis (Jerrett et al, 2010). An address is accurately geo-coded if address is located on the correct side of the street with their relative position between cross-streets determined by linear interpolation of residence number between the nearest intersections. The detailed definition of geo-coded accuracy was reported elsewhere (McConnell et al, 2006). Risk factor of interest Second hand smoke: The information of maternal smoking exposure during pregnancy and child’s history of second hand smoke exposure was provided by a parent or guardian at baseline and each year during the follow-up period. Potential Confounders and Effect Modifiers NOx: The traffic related pollutant exposure at a participants’ residence was characterized using the information on traffic volume, wind speed and wind direction in the geographic information system. A line source dispersion model (Benson) was applied to estimate the 27 residential traffic-related air pollutants from local freeway and non-freeway sources, including oxides of nitrogen (McConnell et al, 2006). Moreover, this line source dispersion model estimated the incremental contribution of nitrogen oxides (NOx) above the regional background level (McConnell et al, 2006; McConnell et al, 2010). Nevertheless, this modeled NOx reflects the mixture of multiple pollutants from nearby traffic; therefore, it represents a high correlation of pollutants in the mixture of traffic related pollutants not just a specific pollutant, as described in the Supplemental Material (doi:10.1289/ehp.0901232 via http://dx.doi.org/). Built environment measurements: The built-environmental variables were previously reported as potential confounders associated with incremental/decremented BMI (Jerrett et al, 2014; Jerrett et al, 2010; Wolch et al, 2011). In addition, the built environment variables were organized in a geographic information system (GIS) through a classic Lynch framework as described previously (Jerrett et al, 2010). These variables included estimated population density, walkability represented by street connectivity, available parks and recreation facilities (unit: acres), green space based on normalized density vegetation index, access to food outlets, and neighborhood socio-economic characteristics including unemployment rate (in town level and census block level) and violent crime rate (in town level). Population density: Population density was presented by the number of people within the 500 m buffer (Jerrett et al, 2010) Street connectivity: The street connectivity was assessed using the length of streets and street network patterns in digital and network files. We used gamma index (within 500m road network buffer around the house) to represent the connectivity since it is the most commonly used indicator in urban planning literatures index. Gamma index is a measure of connectivity in a network which is a ratio of actual number of edges (e) with maximum possible number of 28 edges for the vertices (v) in that network (eq. 2.1). This index ranges from 0 (no connections between nodes) to 1.0 (the maximum number of connections, with direct links between all the vertices). In other words, if a house is located in an area with developed/ efficient transport network, the value of gamma index should be close to 1 (Figure 2.1). Gamma index = U VM3 (eq. 2.1) Figure 2.1 Gamma Index Based on 500 m Buffer around Subject’s House Parks and recreation: Park space (in acres) was measured within 500m buffer of children’s home, a distance that provides easy access (10–15 min of travel time) for children going to a park on foot or bike (Dill, 2004; Wolch et al., 2005). Only the area within the buffer distance was included. While small parks tend to be used more often by children, both small and large parks were found to have similar frequency of usage among adolescents (Grow et al., 2008). Parents may provide more car transportation for their adolescents to team sports or other structured activities at sports facilities such as facilities in large parks, which may be farther from home. 29 The definition of park land use was based on four classification systems, including Southern California Association of Governments, the San Diego Association of Governments, the Santa Barbara Assessor’s office, and the City of Atascadero, as reported elsewhere (Wolch et al, 2011). Normalized difference vegetation index (NDVI): NDVI was measured within 500 m buffer of participant’s home. This NDVI, range from -1 to 1, was derived from the 2001 Lansat Enhanced Thematic Mapper Plus (ETM+) data with a resolution of 30 m. The higher NDVI means greater coverage of vegetation (Wolch et al, 2011); furthermore, higher neighborhood greenness was associated with lower BMI among adults (Tilt et al, 2007) and children (Bell et al, 2008). Crime rate: The town level crime rate data were downloaded from the Bureau of Justice Statistics (http://www.ojp.usdoj.gov/bjs). Because the participating children were recruited in 1993 and 1996, the annual crime data in year 1992, 1993, 1995, and 1996 were used for the analysis. Rom all the available crime variables, only violent crime rate was used to examine the confounding effect of second hand smoke associated with ∆BMI.p85 since violent crime rate was reported as a potential confounders associated with elevated levels of BMI (Doyle et al, 2006). Moreover, the annual average of town level violent crime rate based on data from year 92, 93, 95, and 96 was used due to the small variation of violent crime rate across years of the CHS observation period. Traffic density exposure estimates: Traffic exposure variables were based on the functional classification data in year 2000 from California Department of Transportation. Additionally, the annual average daily traffic (AADT) volumes were combined to Teleatlas road network, a road network with high positional accuracy, to improve the less accurate functional 30 classification data (Wu et al, 2005). The approach used in the collection of traffic data was previously described (Jerrett et al, 2010). Jerrett et al (2010) concluded that because the spatial characteristics of traffic density did not change rapidly during the observation period of CHS, the traffic density measurements could be considered as long-term traffic patterns around the participants’ houses. In this study, we focused on AADT within 150m and 300m distances buffer of children’s houses. Table 2.2 summarizes the variables tested in this analysis on the confounding effects and effect modification of built environment influences. Unless otherwise noted, variables were tested within 500m Euclidian buffers. Some covariates at smaller radii, smaller than 500m, were used for this analysis due to the data availability. Table 2.2 Built environment variables used in the data analysis organized under the Lynch framework (Adopted from Table 1, Jerrett et al., 2010) Lynch framework Variable Operationalized 1 Paths Connectivity Gamma index and beta index Nodes Unhealthy food access Healthy food nodes Recreation InfoUSA detailed business data on food retail sources InfoUSA detailed business data on food (over the counter food vendors) InfoUSA detailed business data on grocery stores Location of parks and recreation programs assembled from web audits of public data sources Districts Crime Green cover Air pollution Traffic density Population density Land use mix Block size Crime per 100,000 for various crimes at the community scale from the California Department of Justice Total green space based on land use, and on normalized difference vegetation indices derived from landsat images taken during the wet season Estimated by CALINE dispersion models Annualized average daily traffic from the California functional class data (conflated to the Teleatlas road network) Pop/sq. Mile Homogeneity index similar to Frank et al. 1994 Average block size 1 Variables operationalized in 500 m Euclidean buffers around the subjects homes. For food access, 500 m network buffers were used as well around homes and schools. Other spatial scales tested but not reported. See text for details. 31 All the potential confounders/effect modifiers listed above were time independent, baseline measurements. 2.3 Methods 2.3.1 Computing the BMI 85%ile reference point 2.3.1.1 BMI at 85%ile provided by the CDC growth charts The BMI percentile of CDC growth chart was generated based on a two-stage smoothing processes: curve smoothing stage and transformation stage. In the curve smoothing stage, the locally weighted regression approach (LWR) was used to 10 empirical percentiles (3 rd , 5 th , 10 th , 25 th , 50 th , 75 th , 85 th , 95 th , and 97 th ) of BMI-for-age growth chart. Specifically, for ages at 2 to 12.5 years, the settings of LWR were 5-point smoothing at midpoint of each age interval. For 13 to 20 years, the settings of LWR were 25-point smoothing for boys and 27-point smoothing for girls. In the transformation stage, to remove the skewness of the anthropometric data in younger children and adolescents, the Box-Cox transformation could make the skewed data to approximately follow the normal distribution. Since the three parameters (L: power (degree of skewness), M: mean, and S: coefficient of variation) used in the Box-Cox transformation could change across time (Cole, 1989), a modified LMS method was applied to estimate these three parameters. In the original LMS method proposed by Cole (1989), the three parameters (L, M, and S) could be estimated through penalized likelihood (Cole, 1990). However, the modified LMS method used by CDC used smoothing empirical percentile curves (10 curves) in the first stage. Then, at each age interval, a group of 10 equations for the 10 smoothed empirical percentile curves were generated by specifying LMS transformation equations (eq. 2.3). The estimated L, M, and S parameters were obtained by minimizing the sum of squared errors. The benefit of the modified LMS method is in ensuring that the fit of smoothed curves from the 32 smoothing stage (1 st stage) is close to the empirical data (Kuczmarski et al, 2002). Finally, noting that the assumption of the LMS method is that the data would be normally distributed after transformation, the BMI percentiles in CDC growth chart were adjusted for age and gender. The CDC growth chart of BMI was downloaded from http://www.cdc.gov/growthcharts/data/zscore/bmiagerev.xls W = X1 + YZ [ \ , ≠ 0 = X_ `a , = 0 (eq. 2.3) 2.3.1.2 Using quantile regression to get the BMI at 85%ile Quantile regression (QR) provides another possibility for computing the BMI percentiles for overcoming the potential pitfall that the characteristics of the targeted population are different from those of children in the CDC growth charts (Wei et al, 2006). To generate the BMI 85%ile reference values based on CHS data, we first generated cubic B-spline basis functions (degree of freedom = 3) of age. The estimated BMI percentiles were based on the solution to the criterion of minimizing the absolute residuals as shown in eq. 2.4 (Koenker and Bassett, 1978). H L b,cU,b.ef ,H L g,cU,b.ef = min I h,ij ,I kK,ij lX = − 79_ = !0.85 − lX = − 79_ = < 0 !, (eq. 2.4) where g is the index of gender, e is the index of ethnicity, i is the index of subjects. In eq. 2.4, the 79_ = , eq. 2.5, contains the basis functions of cubic B-spline of age. 79_ = = H bcU + ∑ H g,p K cU q g,VcU 79_ = r g>b , (eq. 2.5) where m is the number of age intervals that have been divided into regions based on the overall age range of the data set. The basis functions, defined recursively, are shown below (eq. 2.6 and eq. 2.7). Equation 2.6 shows the basis functions of degree p = 0 while equation 2.7 shows that 33 higher order of basis functions could be recursively estimated from lower order basis functions. The basis functions of a cubic B-Spline could be generated using the BS function in R. q s,t 79_ = u 1, 79_ s ≤ 79_ < 79_ sv 0,wxℎ_8z{_ , (eq. 2.6) where p=0, and j= 0, 1, ….., m-2. q s,t 79_ = |cU|cU } |cU }O~ |cU } q s,t 79_ + |cU }O~Ok |cU |cU }O~Ok |cU }Ok q sv,t 79_ , (eq. 2.7) where p=1, 2, 3, and j=0, 1, …., m-3-2. Second, since there is evidence (Yates et al, 2004; Freedman et al, 2006) showing that the BMI distributions were different between genders and ethnicities, the estimated BMI 85%ile was adjusted for age, gender, and ethnicity. Finally, to be compatible with the CDC growth charts, the BMI 85%ile reference values were based on the restricted cross-sectional CHS data as described in section 2.2.1. 2.3.2 “2-Step” quantile regression based multi-level approach The proposed “2-Step” multi-level approach allows us to exploit and account for cross- level confounding and interactions; and the set-up of the multi-level model as shown in eq. 2.8a, eq. 2.8b, eq. 2.8c, and eq. 2.8d. In eq. 2.8a, the lX b.ef represents the estimated BMI 85%ile based on quantile regression or CDC growth chart as described in the previous section (section 2.3.1). In eq. 2.8b, the focused outcome, lX =s − lX b.ef , is the residuals at the 85 th percentile when lX b.ef was computed based on the quantile regression. By definition, lX =s is the measured BMI of subject i at visit j while the lX b.ef is the minimal BMI value of becoming overweight for subject i at visit j (also for persons who is with same age, gender, and ethnicity of 34 subject i). Thus, lX =s − lX b.ef could be explained as the deviation of a subject’s BMI trajectory from the BMI reference curve of being overweight (at the 85 th %ile). Furthermore, “lX =s − lX b.ef < 0” means that the BMI of subject i is below 85%ile of the population (individuals formed by same age, gender, and ethnicity). For a case with “lX =s − lX b.ef ≥ 0”, subject i is considered as overweight due to a BMI higher than 85%ile of the population’s (individuals formed by same age, gender, and ethnicity) BMI distribution. The goal of this study is to demonstrate that the proposed approach might be a better tool for examining the association between SHS exposure and deviations of individual subjects’ BMI trajectories from the 85 th percentile curves in a “two-step” quantile regression based multi-level mixed-effects modeling structure. The features of model (Eq. 2.8a-d) (1) provide a “two-step” mixed effects model structure, (2) allow for potential inference on complex nonlinear functionals of interest via a Bayesian approach proposed by Berhane and Molitor (2008), and (3) are designed to allow for proper accounting for cross-level confounding and interactions. Specifically, the linear spline of age (with knots at age 12 and 14) is applied to account for the non-linearity of BMI trajectory due to puberty (Jerrett et al, 2014). In addition, the effects of second hand smoke exposure will be tested on both growth rate of ∆BMI.p85 and the ∆BMI.p85 level at age 16 (cumulative effects of second hand smoke exposure up to age 16). This “two- step” quantile regression based multi-level modeling was done using the rq (level 0) and lme (level 1 to 2b) functions in R. Level 0: BMI b.ef age ˆ‰ ,gender ˆ ,hispanic ˆ !=γ b�‘,ef% +γ �‘,ef% age ˆ‰ (eq. 2.8a) 35 Level 1: “BMI ˆ‰ − BMI b.ef age ˆ‰ ,gender ˆ ,hispanic ˆ !” = α •�,ˆ +α �,ˆ age ˆ‰ − 16 16 − 9.5 +α 3 town + α V cohort + α › hispanic +α f� age ˆ‰ − 12 v +α f� age ˆ‰ − 14 v (eq. 2.8b) Level 2a: α •�,ˆ = β b�,b +β b�, SHS ˆ +ε ˆ (eq. 2.8c) Level 2b: α �,ˆ = β �,b +β �, SHS ˆ +ε 3ˆ (eq. 2.8d) 2.3.3 Testing for gender differences The effect of the variable of interest, SHS exposure, was tested for possible gender differences on the intercept (effect on level at attained age 16) and slope (effect on growth rate) using the likelihood ratio test approach. For the screening purpose, we use the significance level of 0.05). 2.3.4 Confounders A factor would be considered as a confounder if the estimated effect of the variable of interest changed by more than 10% after adjustment for the factor. Moreover, a qualified confounder is a variable confounding the association of second hand smoke exposure and ∆BMI.p85 either on attained level at age 16 or on growth rate of ∆BMI.p85. 2.3.5 Effect modifiers A factor would be considered as an effect modifier if the p-value of interaction effect of the main risk factor and the testing factor is smaller than 0.05. 36 2.4 Results For this study, data from 2540 children were used to examine the association between second hand smoke and ∆BMI.p85. Table 2.3 indicates that the distributions of ethnicities were similar in both genders. In addition, about 6.0% of children were born outside the United States of America. Table 2.3 Descriptive statistics of subject characteristics in the CHS data used for multi-level analysis Subject characteristics: Female (N=1266, %=49.8%) Male (N=1274, %=50.2) N (%) 1 N (%) 1 Cohort - C1 (recruited in 1993) 590 (46.6.06%) 598 (46.9%) - D (recruited in 1996) 676 (53.4%) 676 (53.1%) Ethnicity - Non-Hispanic White 792 (62.6%) 809 (63.5%) - Hispanic 474 (37.4) 465 (36.5%) Foreign Born 2 - No 1178 (94.0%) 1188 (94.1%) - Yes 75 (6.0%) 75 (5.9%) 1. Percentage within gender. 2. Due to the missing values in Entry year, the sample size varies. The estimated BMI 85%ile curves based on quantile regression and the CDC growth charts are shown in Figure 2.1. Since the majority of subjects used for generating CDC’s BMI growth charts are Caucasians, the shape of BMI 85%ile reference curves based on CDC (Figure 2.1(a)) was similar to those based on quantile regression for Non-Hispanic Whites (Figure 2.1(b)). Given that the overweight reference curves for Non-Hispanic Whites based on quantile regression (Figure 2.1(b)) are generally higher than those based on CDC, it might be an evidence shown that the shape of the BMI distribution did not change but shifted to the right (the mean BMI increased for newer generation). We also noted that the reference curves for Hispanic Whites are not only higher than that of Non-Hispanic Whites but also is clearly different in the shapes of the curves (Figure 2.1(b) and (c)). Thus, the weight distribution is different between Hispanic Whites and Non-Hispanic Whites. 37 (a) (b) (c) Figure 2.2 BMI 85%ile reference curves based various approaches (dot line = Females, solid line = Males) 10 11 12 13 14 15 16 18 20 22 24 26 28 30 Age (y r) BM I 85th Percentile BMI based on CDC's BMI Grow th Charts 10 11 12 13 14 15 16 18 20 22 24 26 28 30 Age (y r) BM I non-Hispanic 85th Percentile BMI of Cohort C & D based on B-Spline basis func of cohort A , B, and C 10 11 12 13 14 15 16 18 20 22 24 26 28 30 Age (y r) BM I Hispanic 85th Percentile BMI of Cohort C & D based on B-Spline basis func of cohort A , B, and C 38 At study entry, the average BMI was 18.63 kg/m 2 for males and 18.52 kg/m 2 for females (table 2.4). Furthermore, based on CDC growth charts, in females, the percentage of overweight (BMI %ile ≥ 85) children in the Hispanic group (32.1%) was 10% higher than that in the non- Hispanic White group (21.3%). In males, percentage of overweight children in the Hispanic group was 14% higher than that in the non-Hispanic White group. Hence, based on CDC growth charts, more Hispanic children were identified as overweight compared to non-Hispanic White children for both genders. In contrast, based on the internal BMI 85%ile of CHS data (based on the quantile regression approach), the proportions of overweight male children were about the same in the Hispanic (16.3%) and non-Hispanic White (16.3%) groups. However, the proportion of overweight female children was 3.3% higher in the non-Hispanic White group (16.0%) compared to that in the Hispanic group (12.7%). Therefore, the percentage of overweight children identified by internal BMI 85%ile was smaller than that by CDC growth charts in both ethnic groups (Figure 2.2). In addition, based on the CDC growth charts, more overweight children were identified in the Hispanic than that in the non-Hispanic White group in both genders. Moreover, based on internal BMI 85%ile, more overweight children were identified in female non-Hispanic White compared with that in female Hispanic. However, the percentages of overweight children were about the same among these two ethnicities in males. The results imply that more Hispanic children were classified as overweight based on CDC BMI 85%ile compared with based on internal 85%ile. Knowing that Hispanic children are generally heavier than Caucasian children (i.e., ethnics difference of BMI distribution) and majority of the children used for generating CDC growth charts are Caucasian, using CDC BMI 85%ile may misclassify Hispanic children with normal weight as overweight due to ignoring the ethnic differences. 39 Figure 2.3 Percent of Children Classified as Overweight Based on Various Methods at Age of 9.5 yrs and 16 yrs (QR: Quantile regression; CDC: Growth Chart of BMI; NHW: non- Hispanic White Children; Hisp=Hispanic Children) At age 16, the average BMI were 23.26 and 23.24 in males and females (table 2.4), respectively. When CDC growth charts were used (BMI ≥ BMI 85%ile) to identify overweight children, more Hispanic children were classified as overweight compared to non-Hispanic White children in both genders (females: 36.3% vs. 24.2%; males: 43.2% vs. 26.6%). In contrast, based on the internal BMI 85%ile of CHS data using quantile regression, the percentages of children identified as overweight in both race/ethnicities were about the same for males and females (females: 16.96% vs. 17.11%; males: 22% vs. 18.56%) (Figure 2.2). Comparing the BMI level at entry (age 10) with that at age 16, the average BMI increased by 4.7 units for females and males. In addition, the percentages of children identified 0 5 10 15 20 25 30 35 40 45 Age 9.5 Age 9.5 Age 9.5 Age 9.5 Age 16 Age 16 Age 16 Age 16 QR CDC QR CDC QR CDC QR CDC NHW NHW Hisp Hisp NHW NHW Hisp Hisp % of Children Classified as Overweight Females Males 40 as overweight increased in both race/ethnicities and in both genders no matter which approach was used to classify the weight status. Table 2.4 Descriptive statistics for BMI in CHS data used for multi-level analysis Major outcomes: Female (N=1266) Male (N=1274) N (%) 1 Mean (SD) N (%) 1 Mean (SD) BMI at Entry (average age = 10) 1266 18.52 (3.58) 1274 18.63 (3.64) BMI – BMI 85%ile (CDC) at entry 1266 -1.57 (3.52) 1274 -0.89 (3.63) BMI – BMI 85%ile (QR) at entry 1266 -3.54 (3.55) 1274 -3.18 (3.60) BMI ≥ BMI 85%ile (CDC) at entry: - Non-Hispanic White 169 (21.3%) 200 (24.7%) - Hispanic 152 (32.1%) 180 (38.7%) BMI ≥ BMI 85%ile (QR) at entry: - Non-Hispanic White 127 (16.0%) 132 (16.3%) - Hispanic 60 (12.7%) 76 (16.3%) BMI at age 16 416 2 23.24 (4.68) 406 2 23.26 (4.89) BMI – BMI 85%ile (CDC) at age 16 416 2 -1.26 (4.68) 406 2 -0.73 (4.89) BMI – BMI 85%ile (QR) at age 16 416 2 -3.45 (4.65) 406 2 -3.06 (4.81) BMI ≥ BMI 85%ile (CDC) at age 16: - Non-Hispanic White 60 (24.2%) 71 (26.6%) - Hispanic 61 (36.3%) 60 (43.2%) BMI ≥ BMI 85%ile (QR) at age 16: - Non-Hispanic White 32 (12.9%) 41 (15.4%) - Hispanic 36 (21.4%) 34 (24.5%) 1. Percentage within gender. 2. Due to the missing values in Entry year, the sample size varies. Regarding the prevalence of exposure to second hand smoke at study entry, the proportions of those “ever exposed” to postnatal second hand smoke were about the same in both genders (32.3% and 33.1% for males and females, respectively). In addition, compared with the proportion of postnatal second hand smoke, the prevalence of those exposed to in utero tobacco smoke exposure was lower in both genders (17.2% and 17.5% for males and females, respectively) at study entry. 41 To have an easier comparison among BMI levels of the children living in various environments in terms of NOx levels emitted from non-freeway resources, each child’s house was dichotomized into either high or low levels of NOx (using median cut). The average BMI (at study entry) was slightly higher in those living in an area with high NOx levels (female: 18.77, male: 18.68) compared to those living in an low area (female: 18.22; male: 18.47) at study entry. At the end of the study period (age = 16), BMI was still higher in the high Nox non- freeway (female: 23.82, male: 23.42) category compared to low NOx non-freeway environments (female: 22.72; male: 23.03). At age 9.5, the prevalence of the allergies was 27.2% for females and 30.1% for males, respectively. Additional descriptive information on other potential confounders and effect modifiers is shown in Table 2.3. In summary, most of the parents were at least high school graduates (female: 85.3%, male: 85.7%). The prevalence of asthma at age 9.5 was 11.8% for females and 16.2% for males. The incidence of wheezing in the past 12 months was 17.9% for females and 20.5% for males. The percentage of children who had ever wheezed was 30.8% for females and 37.3% for males. In general, Table 2.5 shows that more males experienced those chronic diseases compared to females. For the potential confounders and effect modifiers of neighborhood characteristics, the annual average daily traffic volumes (AADT) within a 150-meter of a child’s house was 21.2 for females and 22.8 for males. The average 300-meter traffic exposure (AADT 300m) was 25.1 for females and 26.3 for males. In addition, the average amount of NOx emitted from freeway sources was 3.36 ppb for females and 3.35 ppb for males. In contrast, the average NOx emitted from non-freeway sources was 2.74 ppb for females and 2.79 ppb for males. Regarding connectivity, the mean of gamma index for street connectivity was 0.39 for females and 0.4 for males. For the food environments around the participants' houses, the average of number of 42 grocery stores and fast food restaurants was 1.2 stores for females and 0.95 stores for males. The percentage of the child living in a home without any fast food restaurants or grocery stores within in a 500-meter buffer was 75.3% for females and 75.7% for males. In addition, to describe the greenness characteristics of a child’s house, the average normalized difference vegetation index of greenness (NDVI) within 500 meters of the home was 0.04 for females and males. The average area of parks and recreation within 500 meters of a child’s house was 5.68 acres for females and 6.00 cares for males. For the socioeconomics characteristics, the average population within 500 meters was 1226.88 for females and 1227.54 for males. The average unemployment rate was 0.05 for females and 0.05 for males. In general, Table 2.5 shows that there is no obvious sign of gender differences in those neighborhood characteristics (all p-values > 0.05, data not shown). Table 2.5 Descriptive statistics of potential confounders/effect modifiers in CHS data used for multi-level analysis Female (N=1266) Male (N=1274) N 1 (%) 2 Mean (SD) N 1 (%) 2 Mean (SD) Individual characteristics: Exposed to utero smoke - No 1012 (82.5%) 1023 (82.8%) - Yes 215 (17.5%) 212 (17.2%) Parental Education - Less than high school 178 (14.7%) 176 (14.3%) - High school graduated 255 (21.1%) 232 (18.9%) - Above than high school graduated 775 (64.2%) 822 (66.8%) Asthma status at baseline - No 1087 (88.2%) 1046 (83.8%) - Yes 146 (11.8%) 202 (16.2%) Current wheeze (in past 12 months) - No 969 (82.1%) 957 (79.5%) - Yes 212 (17.9%) 246 (20.5%) Ever wheeze - No 823 (69.2%) 761 (62.7%) - Yes 366 (30.8%) 453 (37.3%) 43 Allergy - No 876 (72.8%) 842 (69.9%) - Yes 328 (27.2%) 363 (30.1%) Ever hay fever - No 974 (84.7%) 959 (82.3%) - Yes 176 (15.3%) 206 (17.7%) Home/neighborhood environment: AADT Density-150m 1234 21.19 (52.60) 1242 22.82 (59.72) AADT Density-300m 1234 25.07 (54.35) 1242 26.28 (57.19) NOX freeway 1235 3.36 (4.57) 1242 3.35 (4.41) NOX non-freeway 1235 2.74 (2.46) 1250 2.79 (0.09) Gamma index 1242 0.39 (0.09) 1250 0.40 (0.09) Total fast food restaurants and grocery stores within a 500 meter road network distance of the subject 1240 1.20 (2.96) 1249 0.95 (2.28) Indicator function showing those subjects with no food within a 500 meter road network distance of the subject - No 306 (24.7%) 304 (24.3%) - Yes 934 (75.3%) 945 (75.7%) NDVI within 500m 1242 0.04 (0.04) 1250 0.04 (0.04) Parks and recreation within 500m Buffer (unit: ACRE) 1242 5.68 (14.94) 1250 6 (13.7) Total population within 500m (Population Density) 1242 1226.88 (1065.99) 1250 1227.54 (969.8) Census block level: proportion of unemployment total (male+ female) 1242 0.05 (0.04) 1250 0.05 (0.04) Community social contents: % of unemployed total (male + female): p_utp_total/100 1266 0.04 (0.01) 1274 0.04 (0.01) Violent Crime rate 1266 892.16 (435.54) 1274 893.83 (441.41) Exposure variable of interest: Second hand smoke (ever) - No 810 (66.9%) 824 (67.7%) - Yes 401 (33.1%) 394 (32.3%) 1. Due to the missing values in Entry year, the sample size varies. 2. Percentage within gender. The effects of second hand smoke exposure on BMI level (at age 16) and growth rate of BMI were not statistically different between genders (p = 0.19, df=2). However, the fitted 44 trajectories of ∆BMI.p85 were still gender specific since they follow distinctly different patterns due to natural physiologic variation in the pubertal process (figures 2.3 and 2.4). In table 2.6, when the BMI 85 th percentile was based on CHS data, only parental education and in utero smoke exposure were found to confound the association between SHS exposure and ∆BMI.p85. Meanwhile, NOx non-freeway, allergy, and traffic density were found to be potentially significant effect modifiers (p < 0.1). However, after a selection process (adding the most significant effect modifier at a time until the p-values of the remaining “leave-out” variables are larger than 0.05), allergy and NOx non-freeway were the only effect modifiers included in the model (p <0.05). It is worth noting that those factors that qualified as effect modifiers were not highly correlated with each other. Hence, the final model included in-utero smoke exposure as a confounder and allergy status NOx non-freeway exposure as effect modifiers for SHS. In Table 2.7, the effect of exposed to second hand smoke on the “deviated BMI” level at age 16 was 0.85 kg/m2 higher for children with allergy compared to those without allergy. However, both the main effect of second hand smoke exposure and the interaction effect of second hand smoke exposure and allergy status on the “deviated BMI” level at age 16 were found not statistically significant. In contrast, effect of second hand smoke exposure on the growth rate of deviated BMI was statistically significantly modified by allergy status (p<0.05, Table 2.7). Specifically, the effect of exposed to second hand smoke on the growth rate of the deviated BMI level was 1.07 kg/m2 higher per year for children with allergy compared to those without allergy. When focusing on the interaction effect of second hand smoke and NOx emitted from non-freeway resources, Table 2.7 indicates that the effect of second hand smoke exposure on the “deviated BMI” level at age 16 was statistically significantly higher by 1.68 kg/m2 for children 45 with exposure to higher NOx from non-freeway resources by 5.76 units (the 10 th – 90 th percentile range). Moreover, effect of second hand smoke exposure on the growth rate of deviated BMI was statistically significantly modified by NOx emitted from non-freeway resources (p<0.05, Table 2.7). Specifically, the effect of exposure to second hand smoke on the growth rate of the deviated BMI level was 0.95 kg/m2 higher per year for children exposed to higher NOx from non-freeway resources by 5.76 units. For comparison with the use of reference BMI values for overweight (based on the quantile regression), we also use the BMI threshold values from the CDC’s growth charts to compute the “deviated BMI” as the outcome and refit the model including in-utero smoke exposure as a confounder and allergy status and NOx emitted from non-freeway sources as effect modifiers. Table 2.7 shows that results using CDC’s growth charts to get the reference BMI values at the 85 th %ile were similar to those using the proposed 2-step quantile regression based multilevel modeling approach. However, the effect size of second hand smoke exposure was smaller when the “deviated BMI” was based on the CDC’s growth charts compared to that based on the quantile regression approach. We note that the results using CDC’s growth charts were very similar to those using the raw BMI measurements in terms of the magnitude and the p- values of the estimated coefficients. 46 Table 2.6 Results of testing confounding effects and Interaction effects (for Second Hand Smoke Exposure at Baseline (SHS)) Tested for confounding/ef fect modifying Effect SHS.coef intercept change SHS.coef slope change Unadj.SHS coef.intercept EST (SD) Unadj.SHS coef.slope EST (SD) Adj.SHS coef.intercept EST (SD) Adj.SHS coef.slope EST (SD) Intr.Act LRT 1 HS -4.8% -1.8% 0.99 (0.21)*** 0.73 (0.18)*** 0.94 (0.21)*** 0.72 (0.18)*** 0.355 NDVI -1.6% -2.9% 0.99 (0.21)*** 0.73 (0.18)*** 0.97 (0.21)*** 0.71 (0.18)*** 0.095 utsmoke -14.0% -6.4% 0.99 (0.21)*** 0.73 (0.18)*** 0.85 (0.23)*** 0.68 (0.20)*** 0.060 TotPop500 -0.7% -0.4% 0.99 (0.21)*** 0.73 (0.18)*** 0.98 (0.21)*** 0.73 (0.18)*** 0.055 PUnempSES -4.3% -5.3% 0.99 (0.21)*** 0.73 (0.18)*** 0.94 (0.21)*** 0.69 (0.18)*** 0.782 Park500 -0.1% 0.1% 0.99 (0.21)*** 0.73 (0.18)*** 0.98 (0.21)*** 0.73 (0.18)*** 0.493 TotFood500 -3.9% -2.3% 0.99 (0.21)*** 0.73 (0.18)*** 0.95 (0.21)*** 0.71 (0.18)*** 0.465 NoFood500 -5.0% -2.6% 0.99 (0.21)*** 0.73 (0.18)*** 0.94 (0.21)*** 0.71 (0.18)*** 0.700 NOXFWY -1.2% -0.4% 0.99 (0.21)*** 0.73 (0.18)*** 0.97 (0.21)*** 0.73 (0.18)*** 0.011 NOXNFWY -2.1% -0.3% 0.99 (0.21)*** 0.73 (0.18)*** 0.97 (0.21)*** 0.73 (0.18)*** 0.002 GAMMA 0.3% -0.6% 0.99 (0.21)*** 0.73 (0.18)*** 0.99 (0.21)*** 0.72 (0.18)*** 0.507 wheznow -4.0% -2.2% 0.99 (0.21)*** 0.73 (0.18)*** 0.95 (0.21)*** 0.71 (0.18)*** 0.888 everwhez -3.3% -0.5% 0.99 (0.21)*** 0.73 (0.18)*** 0.95 (0.21)*** 0.73 (0.18)*** 0.382 allergy -0.6% -0.4% 0.99 (0.21)*** 0.73 (0.18)*** 0.98 (0.21)*** 0.73 (0.18)*** 0.032 anyhayf -2.9% -0.8% 0.99 (0.21)*** 0.73 (0.18)*** 0.96 (0.21)*** 0.72 (0.18)*** 0.734 YASTHMA -0.5% -0.0% 0.99 (0.21)*** 0.73 (0.18)*** 0.98 (0.21)*** 0.73 (0.18)*** 0.353 forborn -1.1% 0.7% 0.99 (0.21)*** 0.73 (0.18)*** 0.98 (0.21)*** 0.73 (0.18)*** 0.249 AADT150m -4.4% -2.5% 0.99 (0.21)*** 0.73 (0.18)*** 0.94 (0.21)*** 0.71 (0.18)*** 0.006 AADT300m -2.3% -1.7% 0.99 (0.21)*** 0.73 (0.18)*** 0.96 (0.21)*** 0.72 (0.18)*** 0.008 VioCrime -0.2% -0.3% 0.99 (0.21)*** 0.73 (0.18)*** 0.99 (0.21)*** 0.73 (0.18)*** 0.870 UTPt -2.3% -2.7% 0.99 (0.21)*** 0.73 (0.18)*** 0.97 (0.21)*** 0.71 (0.18)*** 0.189 ***: p <0.05; **: p <0.1; *: p<0.2 1: based on likelihood ratio test without including missing indicator of effect modifier 47 Table 2.7 Estimated Effects of Second Hand Smoke Exposure and Selected Effect Modifiers based on Final Model1 Using Different Outcomes BMI – BMI 85%ile.QR based BMI – BMI 85%ile CDC BMI SHS effect on SHS effect on SHS effect on ∆BMI.p85 level at age 16 Growth rate of ∆BMI.p85 ∆BMI.p85 level at age 16 Growth rate of ∆BMI.p85 ∆BMI.p85 level at age 16 Growth rate of ∆BMI.p85 SHS -0.22 (0.35) -0.11 (0.31) -0.18 (0.35) -0.07 (0.31) -0.18 (0.35) -0.06 (0.31) NOx non-Freeway 2 -0.18 (0.43) -0.45 (0.24) -0.13 (0.43) -0.4 (0.23) -0.13 (0.43) -0.4 (0.23) Allergy -0.66 (0.26)* -0.53 (0.23)* -0.62 (0.26)* -0.46 (0.22)* -0.62 (0.26)* -0.46 (0.22)* SHS*NOx non-Freeway 2 1.68 (0.49)* 0.95 (0.42)* 1.64 (0.48)* 0.89 (0.42)* 1.64 (0.48)* 0.89 (0.42)* SHS*Allergy 0.85 (0.47) 1.07 (0.41)* 0.76 (0.47) 0.95 (0.41)* 0.76 (0.47) 0.95 (0.41)* 1. Adjusted for sex, ethnicity, cohort, community, in-utero smoke exposure. 2. The coefficients are rescaled to 10-90 th %ile range of NOx non-Freeway: 5.7614 *: p value < 0.05 Figure 2.4 The predicted ∆BMI.p85 curves for Hispanic children under condition of being allergic and exposed to high dose of NOx emitted from non-freeway sources (top blue: male exposed to SHS; bottom blue: male unexposed to SHS; top red: female exposed to SHS; bottom red: female unexposed to SHS). 48 Figure 2.5 The predicted ∆BMI.p85 curves for Caucasian children under condition of being allergic and exposed to high dose of NOx emitted from non-freeway sources (top blue: male exposed to SHS; bottom blue: male unexposed to SHS; top red: female exposed to SHS; bottom red: female unexposed to SHS). 2.5 Discussion As briefly stated in the introduction, using BMI as an outcome would limit the assessments of the covariate effects on the extremes of the distribution of BMI. Moreover, treating BMI z-scores as an outcome would lead the same problem as that of using BMI as an outcome. One may apply the proposed approach to BMI z-scores; however, the estimated effects of the risk factors would be the same with those based on the conventional multi-level mixed- effects modeling approach because the deviated BMI z-scores are shifted by a constant, 1.364 (=the BMI z-score at 85 th percentile). Another popular method in obesity studies is the logistic regression for binary indicators of obesity or overweight. Nevertheless, there are two limitations to this approach. First, in most studies, the indicator of obesity or overweight was generated 49 based on the CDC’s BMI growth charts (Seach, 2010; Sonneville et al, 2011; Wen et al, 2012). Because most of the data used to producing the CDC’s growth charts were from Caucasian children, the BMI percentile and BMI z-score of CDC may only be applicable to Caucasian children. The use of the CDC growth charts for all BMI data may lead to misclassification of overweight or obesity status of a child from races/ethnicities other than Caucasian. Second, based on logistic regression, the estimated effects of risk factors are expressed / interpreted as odds ratios. Assuming that two subjects have the same biological / environmental backgrounds except BMI (one with BMI = 10 while the other with BMI = 21), they would have the same probability of being obese/overweight (regardless their differences in BMI), if they both were exposed to some health risk factors. Thus, binary indicator logistic regression is limited to assessing covariate effects on the extremes of the distribution of BMI as well. One of the major advantages of the approach we used in this paper is the fact that we were able to quantify the magnitude of BMI units above a meaningful threshold as a function of changes in a risk factor of interest. Previous longitudinal studies have shown significant and interesting associations between health outcomes (as quantified by continuous BMI) and built environmental factors (Jerrett et al, 2014; Jerrett et al, 2010; Wolch et al, 2010) via a flexible multi-level modeling approach based on the CHS data. In this study, we used an outcome defined as the deviation of an observed BMI value from a threshold BMI value for overweight. The reference BMI values for overweight are available from the CDC’s BMI growth charts. However, the reference population of the CDC’s growth charts might be inadequate for the CHS data due to the reasons related to the environment and potential ethnic diversity. Also, the model-fitting evaluation of the LMS method used to generate the CDC’s growth charts was largely dependent on subjective choices 50 of smoothing parameters. Nevertheless, there is no objective statistical approach to determining the smoothing parameters (Wei et al, 2006). Moreover, another disadvantage of applying the LMS method is that it is difficult to adjust for as many covariates as necessary due to the sparsity of data in most applications. To generate age, gender, and ethnicity-adjusted BMI percentile reference values based on the LMS method, data would have to be initially grouped based on the status of the subjects’ age, gender, and ethnicity. Hence, it is difficult to maintain an adequate sample size for each grouped data while adjusting for multiple covariates of interest. In contrast to the LMS method, quantile regression allows for objective examination of either a specific part or the whole conditional distribution of the outcome. In addition, quantile regression is a more flexible and natural method for generating the reference growth curves of BMI and the subsequent calculation of percentiles of interest in a setting that allows for the natural incorporation of covariates. As discussed above, our approach allows us to (1) directly assess the effects of risk factors on the difference between BMI and reference BMI cutoffs for overweight or obesity (opposite to using BMI and BMI z-score); (2) have results that are easier to interpret compared to BMI z-score; (3) get more representative reference values based on our data compared to those based on external referenced population. Therefore, in this study, instead of using BMI or BMI z-score as outcome, we used the BMI 85 th percentile subtracted from measured BMI as outcome since this outcome could be interpreted as the deviations of individual subjects’ BMI trajectories from the 85 th percentile curves. Furthermore, the multi-level modeling approach provides an ideal way to analyze the association that accounts for the correlation within subjects. In addition, this two-step multi-level modeling approach is able to examine the confounding and interaction effects across various levels of aggregation. Hence, we applied the novel two-step multi-level 51 modeling approach to study the effect of SHS exposure on increasing the “deviated BMI” (i.e. the difference between individual subjects’ BMI trajectories and the 85 th percentile of population’s BMI conditioning on age, gender, and ethnicity). Our major findings are that SHS exposure is associated with increments in the difference between individual subjects’ BMI trajectories and the threshold BMI values for overweight. Furthermore, the effect of second hand exposure is confounded by in-utero smoking exposure and modified by the allergy status of a child and the emission of NOx from non-freeway resources. There are some limitations to this study. First, quantile regression provides an easier way to generate the BMI percentile with adjustments for as many covariates as possible; however, due to the small sample size on races/ethnicities, including Asian, African American, and other ethnicities, the results may not be generalizable to non-Hispanic and non-Caucasian children. Secondly, quantile regression was used to compute the conditional BMI percentile based on a cross-sectional sub data set of the CHS because that allows us to compare with cross-sectional- based CDC’s BMI growth charts. In addition, unlike the quantile regression for cross-sectional data, the quantile regression methods for longitudinal data are not well developed. Thus, we are unable to examine the effect of SHS exposure on BMI percentile directly and longitudinally. Finally, to extend this study, we may need to consider an approach that allows us to examine the effects of health factors directly on the conditional quantile of the distribution of the outcome. In conclusion, the proposed two-step multi-level modeling approach was successfully utilized to (1) generate the 85 th percent reference values of BMI for age-gender-ethnicity; (2) to examine the effect of SHS exposure on levels and slope of the “deviated BMIs”. Based on this novel approach, the association between SHS and deviated BMIs is found to be confounded by in-utero smoke exposure. The effects of SHS exposure are also modified by the exposure of 52 NOx emitted from non-freeway sources and allergy status of children. Moreover, the effects of SHS exposure are stronger when a child is exposed to high level of NOx emitted from non- freeway sources and/or experiences allergies. 53 Chapter 3 Bayesian Multi Level Quantile Regression for Longitudinal Data 3.1 Introduction To analyze the health effects on health outcome longitudinally, mixed effects regression is a well established and commonly used approach since it allows us to study the cumulated health effects not only on the level of health outcome at certain attained age, but also on the growth rate of an outcome while characterizing the non-linear growth trajectories of outcome via spline functions (Jerrett et al, 2010; Wolch et al, 2011; Duntion et al, 2012). However, there are two major limitations in current methods of applying mixed effects modeling on longitudinal data. Firstly, the mixed effects regression is mainly based on the distributional assumption of Gaussian or some other distribution. Therefore, when the assumption is not attainable, the results based on mixed effects regression are questionable. Secondly, because the mixed effects regression only allows us to study the effects on the conditional mean of the outcome, applying mixed effects regression may be inappropriate when the interest is in studying the effects on any other specific part (e.g., extreme quantiles as in overweight or obesity related problems) and/or across the entire outcome distribution. For instance, traditional mixed effects regression could examine the effect of post-natal second hand smoke (SHS) solely on mean BMI but not test the effect of SHS on the children at risk of being overweight (the BMI percentile ≥ 0.85). Regarding the issue of achievability of the normality assumption for continuous outcomes, some health outcomes such as anthropometric data (weight, BMI) in children or adolescents, are usually extremely skewed (Cole et al, 1995; Wei et al, 2006). One of the most popular solutions to resolving the skewness problem is applying the LMS method, proposed by 54 Cole (1990). The LMS method is a method that transforms the health outcome to the corresponding standardized value so that the transformed data would be approximately normal. However, the ability of achieving promised normality may be attainable at some specific part of distribution but could not be guaranteed over the full range of a relevant covariate, e.g. age (Wei et al, 2006). In addition, Geraci and Bottai (2007) pointed out that mean regression would be extraordinarily inadequate not only in estimating the location but also in estimating the quantile of the conditional distribution of the response variable while the sign of skewness of the conditional distribution changes over the full range of the response. To overcome the difficulty in satisfying the assumption of normality, quantile regression has become an ideal approach to apply. Moreover, Wei et al (2006) successfully demonstrated that the model fitting performance of quantile regression was better than that of the LMS method while the underlying distribution of height of children was different with normal distribution (distribution was bimodal in this case), using data from Finnish population. To date, the approaches of quantile regression for analyzing longitudinal data are not fully developed. One of the major challenges is how to characterize the dependence between observations from the same subjects since ignoring the dependence might lead to inaccurate statistical inferences for the parameters of interests. Koenker (2004) presented a quantile regression approach that accounts for the correlation of observations within a subject by including the “large” number of fixed individual effects. Koenker started with a simple model as shown in eq. 3.1, for the conditional quantile functions of the response, =s ,of the j th observation on subject i. In this simple model, the dependence between observations from the same subject is represented by = , capturing the subject specific source of unobserved heterogeneity, that was not accounted for by other covariates in the model. Furthermore, the = in the model has a pure 55 location shift effect on the conditional quantiles of response, i.e. = does not change over quantiles, while the effects of ¡ =s , H , are allowed to be -dependent. @ A} |¡ =s ! = = + ¡ =s ¢ H (eq. 3.1) where j=1,2, …, £ = , and i = 1,2,…,n. Regarding the estimates of parameters in the conditional quantile functions, Koenker (2004) initially proposed using the estimators that solve the T -type penalty terms incorporated in the conventional loss function shown in eq. 3.2 due to the statistical and computational advantages. In eq. 3.2, ¤ g controls the influence of the q quantiles, , 3 ,.., ¥ , on the estimation of the = parameters. Hence, eq. 3.2 implies that we can estimate multiple quantile functions, simultaneously. Specifically, Koenker illustrated that using the T shrinkage method would tolerate more discrepancies so that highly unusual = ’s would be authentically shrunken toward zero compared to those obtained using T 3 Gaussian penalty. When ¦ → 0, eq. 3.2 would become a purely fixed effects based modeling; in contrast, when ¦ → ∞ and = → 0 for all i, we obtain a model without fixed effects. min ¨ © ,I © ∑ ∑ ∑ ¤ g K =s − = − ¡ =s ¢ H K ! 5 => r A s> ¥ g> +¦ ∑ | = | 5 => (eq. 3.2) Based on the simulation results provided by Koenker (2004), the performance of this T - penalized quantile regression (on median) is shown to be better than that based on least square methods, in terms of bias and root mean square error, when the data are simulated based on location shift model (homoscedastic error) and location-scale shift model (heteroscedastic error). In addition, the penalized quantile regression approach is more preferable than least square approaches when the distribution of random errors is asymmetric, such as the chi-square 56 distribution. However, one fundamental issue with this penalized quantile regression approach is that there is no statistical rule on how to choose the value of the penalty parameter, ¦. Geraci and Bottai (2007) have demonstrated via simulated studies that inadequate values of ¦ might heavily influence the magnitude of bias in estimated parameters via simulation studies. To overcome the difficulty in proper making choice for the penalty parameter, Geraci and Bottai (2007) introduced quantile regression with random effects for longitudinal data. This quantile regression approach relies mainly on the assumption of asymmetric Laplace distribution (ALD) for the error distribution. The main features of this approach are that (1) the Maximum Likelihood Estimates (MLE) for the parameters of interest are equivalent to the solution for minimization of the objective function, i.e., the sum of absolute residuals; (2) this method provides an automatic choice of the optimal level of penalization. Again, Geraci and Bottai demonstrated the advantage of applying quantile regression with random effects via a simple location-shift model shown in eq. (3.3). In this model, the correlation of serial measurements on the same subject was accommodated via location-shift random effects. @ A} |¡ =s ,) = ! = ) = + ¡ =s ¢ H (eq. 3.3) If a random variable Y follows ALD with parameters ª,«,and , the probability density function (pdf) of ALD is |ª,«, = ¬ _¡®− “ @¯ ¬ ”°, where ¡ = ¡ − ¡ ≤ 0 !, 0< <1 is the skewness parameter, « >0 is the scale parameter, −∞ < ª < ∞ is the location parameter, −∞ < < ∞ and . is the indicator function. Particularly, given , the likelihood for the conditional quantile of response =s is =s |H,) = ,«! = ¬ _ ³ © . ´ A} µ¶ A µ· A} ¸ ¹ © º / (eq. 3.4). 57 The properties of ALD are described in Appendix (Section 3.8) of this chapter. Based on the likelihood of conditional quantile of =s , it is implied that the maximization of the likelihood in eq. 3.4 with respect to the parameter H is equivalent to the minimization of the conventional loss function in eq 3.5, provided that « is treated as nuisance. min I u » @ A} % A ¼ A} ¸ I © ¬ ½¾ (eq. 3.5) Furthermore, the density for subject i (with = observations, ¿ = = = , =3 ,…, =5 A ,!) conditional on the random intercept ) = is “¿ = |H,) = ,«” = ∏ =s |H,) = ,«! 5 A s> . In this setting, the random effects are assumed to follow some distribution, g, with parameter, Á. Then, the complete-data likelihood function for subject i is given by “¿ = |H,Á,«” = ∏ =s |H,) = ,«! 5 A s> 9) = |Á . Hence, if there are N subjects with response, ¿ = “¿ ,¿ 3 ,…,¿  ”, and the corresponding random subject effects are à = ) ,) 3 ,…,)  , the joint density of “¿,Ô is “¿,Ã|H,Á,«” = ∏ ∏ =s |H,) = ,«! 5 A s>  => 9) = |Á . In Geraci and Bottai (2007), a Monte Carlo EM (MCEM) algorithm is proposed to estimate the parameters of interest, H,Á, and «. In summary, in step 1 of the MCEM process, this iterative algorithm initially assigns the initial values of H,Á, and «. Then, in step 2, at iteration t, a sample of random effects for subject i, v = Ä = “R = Ä ,R =3 Ä ,…,R =r A Ä ” of size £ = , is drawn from ℎ“) = |¿ = ,H Ä ,Á Ä ,« Ä ” for i = 1, 2, .., N using a Gibbs sampler in conjunction with adaptive rejection sampling algorithm due to the log-concavity in ) = . However, since the random effects () = ) are not observable, the sampled random effects for subject i are drawn from the known joint likelihood of “¿ = ,) = ” given by “¿ = |H,) = ,«”9) = |Á due to 58 ℎ“) = |¿ = ,H,Á,«” ∝ “ = |H,) = ,«”9) = |Á . In step 3, the E-step, H Äv is the solution that minimizes & Æ H,Á,«|H Ä ,Á Ä ,« Ä ! = ∑ r A ∑ ∑ “ =s − R =gs Ä − ¡ =s ¢ H” 5 A s> r A g>  => , which could be solved by using a linear programming algorithm as in conventional quantile regression. Here, « Äv = ∑ 5 A Ç AÈk ∑ r A Ç AÈk ∑ ∑ ∑ “ =s − R =g Ä − ¡ =s ¢ H Äv ” 5 A s> r A g>  => and Á Äv = ∑ 5 A Ç AÈk ∑ r A Ç AÈk ∑ ∑ R =g Ä 5 A g>  => . Furthermore, step 2 and 3 are mainly based on Monte Carlo E steps. Finally, steps 1 through 3 would be repeated until parameters, H,Á,and «, converged (figure 3.1). 59 Figure 3.1 The iteration process of Monte Carlo EM algorithm A} |G =s ,) = ! = G =s ¢ H+ ) = Fitting Model: H b ,« b ,Á b ! Step 1: Initializing the parameters = ∗ “H,«,Á|H x ,« x ,Ê x ” = 1 £ = < T “H,«,Á;¿ ,R Ì x ” r A g> Step 2: MC E-step Drawing a sample of random subject effects Í = Ä = “R = Ä ,R =3 Ä ,…,R =r A Ä ”, from ℎ“) = |¿ = ,H Ä ,« Ä ,Á Ä ”, where ℎ“) = |¿ = ,H Ä ,« Ä ,Á Ä ” ∝ “¿ = |) = ,H Ä ,« Ä ”9) = |Á Ä ! H Äv = argmin I∈; Î < 1 £ = < < “ =s − R =g Ä − G =s ¢ H” 5 A s> r A g>  => « Äv = 1 ∑ =  => 1 ∑ £ =  => << < “ =s − R =g Ä − G =s ¢ H Äv ” 5 A s> r A g>  => Á Äv = R78“Í Ä ,Í 3 Ä ,…Í Â Ä ” Step 3: MC M-step Convergence criteria: ÏÐÑ Ò Ó Ò ÔOÕ Ò Ô Ò Ô v Ö Õ Ó <Ö × , Ò Ô = Ø Ô ,Ù Ô ,Á Ô ! 60 Geraci and Bottai (2007) have shown that the performance of the likelihood-based quantile regression approach with random effects was better than quantile regression with penalized fixed effects in terms of relative bias. More importantly, they illustrated that the likelihood-based quantile regression with random effects could automatically achieve the optimal degree of shrinkage of the individual effects leading to significant improvement by overcoming the limitations in the approach based on quantile regression with penalized fixed effects. However, there are still some limitations of quantile regression with random effects that need to be addressed. First, the choices of initial values of parameters of interest could significantly affect the speed of convergence and the accuracy of the estimated parameters (see section 3.5 for details). Second, in addition to the fact that the model fitting process is complex, the Monte Carlo EM algorithm suffers from heavy computational burden; and this gets even worse, when inadequate initial values are used. Third, the results provided by Geraci and Bottai (2007) are purely based on the location-shift model. They failed to assure the performance of quantile regression with random effects on data with heteroscedastic errors (location-scale-shift model). In 2014, Geraci and Bottai further proposed linear quantile mixed model (LQMM) as an improvement of MCEM, an approach can be computational demanding when modeling complex structures of random effects. The LQMM is proposed to reduce the computational burden when estimating the fixed effects coefficients and the covariance matrix of multiple random effects. Specifically, the estimates of fixed effects and random effects based on LQMM are based on the combination of Gaussian quadrature approximation and non-smooth optimization algorithm. Their results of simulation show that LQMM is robust even when including more random effects, the random effects are not normally distributed, and errors are heteroscedastic. However, the settings of simulating heteroscedastic errors are different from the CHS BMI data. 61 In CHS data, we are interested in the BMI growth curves with respected to X (time variable, e.g. age) while the X in their simulation is generated based on a mixture distribution of two standard normal distribution (Geraci and Bottai, 2014). Thus, the performance of LQMM on fitting data similar to CHS data is still not clear. In parallel to the development of frequentist based quantile regression model, several Bayesian approaches to quantile regression have been proposed (Yu and Moyeed, 2001; Reich at al, 2010; Kozumi and Kobayashi, 2011). The Bayesian approach is useful and desirable because if provides full access to the posterior distributions of parameters of interest via Markov chain Monte Carlo (MCMC) methods exploiting the desirable feature, Yu and Moyeed (2001) introduced a Bayesian quantile regression modeling approach. Yu and Moyeed assumed that the conditional outcome follows the asymmetric Laplace distribution (ALD) regardless of the distribution from which the data may have come. An important feature of this approach is that the minimization of the loss function of conditional quantile regression (eq. 3.6) is equivalent to maximization of the likelihood function of ALD (eq. 3.7) when the scale parameter of ALD is considered as nuisance. min I Ú =s − ¡ =s ¢ H !Û (eq. 3.6) =s |,H,«! = ¬ _ ³ © . ´ A} µ· A} ¸ ¹ © º / (eq. 3.7). Yu and Moyeed (2001) used MCMC with a random-walk Metropolis algorithm to extract the posterior distribution of unknown parameters because the standard conjugate prior distribution is not available for quantile regression based on the ALD. In summary, given the observations, = , 3 ,…, 5 , the fitted model is |¡ = = ¡ =s ¢ H, the posterior distribution 62 of H, H| is given by H| ∝ @ |H I H , where I H is the prior distribution of H and @ |H is the likelihood function based on ALD written as @ |H = 5 1 − 5 _ ∑ ³ © @ A ¼ A ¸ I ! Q AÈk . In practice, there usually is lack of information about the prior distribution for H. However, Yu and Moyeed (2001) have shown that even one may use improper uniform prior distributions, the resulting joint posterior distribution would be proper, i.e. 0 < 0 @ |H I H +H I < ∞. The details of the proof are given in Yu and Moyeed (2001). Unlike the intractable conditional posterior distribution of ALD, Kozumi and Kobayashi (2011) developed a mixture representation of an ALD so that the conditional posterior distributions are tractable. This might lead to significantly simplified estimating procedures. In summary, Yu and Moyeed dealt with a simple model as shown in eq 3.8 under the assumption that Ü = follows an asymmetric Laplace distribution (ALD), Ü = | = 1 − _ ³ © Ý A , setting the scale parameter to be equal to 1 (eq.3.7). In order to apply the Gibbs sampling algorithm, Kozumi and Kobayashi (2011) proved that an ALD random variable Ü = can be decomposed as a location-scale mixture of exponential and normal distributions given by Ü = = Þ ß = + Þ 3 àß = ) = , where Þ = 3 , Þ 3 = á 3 , ß = ~&¡1 , and ) = ~ã0,1 . Hence, the conditional density of = , 3 ,…, 5 given H and ß = ß ,ß 3 ,…,ß 5 becomes “|H ,ß” ∝ W∏ ß = k P 5 => ä_ å∑ “´ A µ· A ¸ ¹ © µæ k ç A ” P Pæ P P ç A Q AÈk è and eq. 3.8 could be rewritten as eq. 3.9. = = ¡ = ¢ H + Ü = , where i = 1, 2, …, n (eq. 3.8) = = ¡ = ¢ H + Þ ß = + Þ 3 àß = ) = (eq. 3.9) 63 To estimate parameters of interest using the fully Bayesian approach, Kozumi and Kobayashi (2011) assumed that the prior distribution of H is normal given by H ~ãH ,b ,é ,b !. Furthermore, a Gibbs sampling algorithm for the quantile regression model (eq 3.9) is used to draw samples of H and ß from their full conditional distributions. When conditioning on ß the full conditional distribution of H is given by H | ,ß ~ ãH L ,é ê !, where H L = é ê ®∑ ¼ A @ A ë k ì A ë P P ì A 5 => + é ,b H ,b °, and é ê = ∑ í î í î ï ð × × ñ î ò î>Õ + é ,b . At the mean, the ß could be sampled from the generalized inverse Gaussian distribution with the density given by ß = |,H ~ó ó “ 3 ,ô L = ,õ ö = ” = ÷ 6 A ø ê A ⁄ ! k P \ 3ú k/P ø ê A ÷ 6 A ! ß = k P _ k P ø ê A P ì A µk v÷ 6 A P ì A ! , where ô L = 3 = @ A ¼ A ¸ I © ! P ë P P , õ ö = 3 = 2 + ë k P ë P P , and ü /3 . is a modified Bessel function of the third kind (Barndorff-Nielsen and Shephard, 2001). One advantage of this Gibbs sampler algorithm is in generating draws of parameters of interest with low correlation. In more generalized cases, i.e. when the scale parameter « is not equal to 1, eq 3.9 could be rewritten as eq. 3.10. To develop the Gibbs sampling algorithm, Kozumi and Kobayashi (2011) re-parameterized eq. 3.10 as eq. 3.11. For the prior distributions of parameters, Kozumi and Kobayashi (2011) assumed that H ~ãH ,b ,é ,b !; «~ ó “ 5 h 3 , ý h 3 ”, where ó “ 5 h 3 , ý h 3 ” is the inverse Gamma distribution with parameters b and { b ; and R = ~ℰ« . Next, the samples of H , «, and R = R = ,R 3 ,…,R 5 would be drawn from their full conditional distributions. First, H is drawn from the density of H | ,R ,«~ ãH ,é !, where H = é ê ®∑ ¼ A @ A ë k M A ë P P ¬M A 5 => + é ,b H ,b °, and é = ∑ í î í î ï ð × × Ù î ò î>Õ + é ,b . Second, R is drawn from the density of 64 R = |,H ,« ~ ó ó “ 3 ,ô = ,õ = ”, where ô = 3 = @ A ¼ A ¸ I © ! P ë P P ¬ , õ ö = 3 = 3 ¬ + ë k P ë P P ¬ . Third, « is drawn from the fully conditional distribution given by «| ,H ,R ~ ó “ 5 3 , ý̃ 3 ”, where = b + 3, {̃ = { b + 2∑ R = 5 => + ∑ @ A ¼ A ¸ I © ë k M A ! P ë P P M A 5 => . = = ¡ = ¢ H + «Þ ß = + «Þ 3 àß = ) = , where i = 1, 2, …, n (eq. 3.10) = = ¡ = ¢ H + Þ R = + Þ 3 à«R = ) = , where R = = «ß = (eq. 3.11) The Bayesian quantile regression approaches discussed above are applicable only for cross-sectional data. To our knowledge, Bayesian quantile regression approaches for dealing with longitudinal and multi-level data have received little attention to date. In addition, as we discussed before, the iteration-based quantile regression for longitudinal data (MCEM) proposed by Geraci and Bottai (2007) suffers from heavy computational burden when dealing with relatively more complex models that have a large number of random effects. Therefore, the main objective of this chapter is to introduce a new Bayesian quantile regression approach for longitudinal data with hierarchical structure. Generally speaking, the proposed method will allow us to study non-linear growth trajectories and associated parameters using a flexible spline based modeling approach. However, our simulated studies and data analysis will be performed assuming linear growth trajectories. 3.2 Model Consider a longitudinal/clustered data, involving 3 levels, that are defined as follows: assume that there are j=1, 2, …, = level-1 units that are nested within i=1, 2, …, level-2 units that are in turn nested within c=1, 2, …, T level 3 units. Furthermore, assuming the data 65 are in the form G ¢ , ¢ ,G = ¢ , = ¢ ,G =s ¢ , =s ¢ ,x =s , =s !, for j = 1, 2, …, = , i = 1, 2, …, , and c= 1, 2, …, , ∑ ∑ = 5 => ¢ > = ã, where G =s ¢ and =s ¢ are level-1 covariate vectors, G = ¢ and = ¢ are level-2 covariate vectors, G ¢ and ¢ are level-3 covariate vectors, x =s is the time variable, and =s is the j th measurement of outcome on the i th level-2 unit (e.g. subject) of c th level-3 unit (e.g. town). The multilevel quantile functions are then defined as Level 1: A} = 7 ,= + ∑ q ,= g ã g x =s ! g + ô , G =s + õ , =s (eq. 3.12) Level 2a: 7 ,= = 7 , + ô ,3 G = + õ ,3 = + ) = (eq. 3.13a) Level 2b k : q ,= g = q ,g + ô ,33g G = + õ ,33g = + R = g (eq. 3.13b), k=1, 2, …, K Level 3a: 7 , = 7 ,b + ô ,V G + õ ,V + ) (eq. 3.14a) Level 3b k : q ,g = q ,bg + ô ,V3g G + õ ,V3g + R g (eq. 3.14b), k=1, 2, …, K We now define the various parameters and random effects in the multi-level model described above, for a fixed quantile . In the first level (eq. 3.12), subject specific intercepts and growth parameters (such as 7 ,= , q ,= g ) are extracted after common adjustments for all time dependent covariates. In other words, 7 ,= and q ,= g represent the cluster-specific random effects at the second level. Note that ã g x =s !, k=1, 2, …K are some Spline basis functions (such as linear- Spline) used to model the non-linear trajectories of BMI across the childhood growth period. In the second level, each of the subject-specific regression coefficients extracted from eq. 3.12 (7 ,= and q ,= g ) is modeled as a function of group-level variables at level 2 (eq. 3.13a and 3.13b). More specifically, we assume the errors at level 2 were normally distributed given by ) = ~ã0,« ||, 3 !, R = g ~ã0,« g , 3 !, and wR) = ,R = g ! = « | g , . The term ) = 66 indicates the deviation of the subject-specific intercepts at level 2 away from the community- specific intercept at level 3 after accounting for time independent covariates G = ¢ and = ¢ . Analogously, R = g measures the deviation of the slopes of ã g x =s ! within each subject (ci) from the overall slope within community (c) after accounting for G = ¢ and = ¢ . « ||, 3 and « g , 3 are the variances of subject-specific intercepts and slopes at level 2 while the « | g , represents the covariance between intercepts and slopes. In the third level, each of the community-specific (c) regression coefficients defined in eq. 3.13a and 3.13b (7 , and q ,g ) is modeled as a function of group-level variables at level 3, i.e., community level (eq. 3.14a and 3.14b). More specifically, we assume the errors at level 3 were normally distributed as given by ) ~ã0,« ||, 3 !, R g ~ã0,« g , 3 !, and wR) ,R g ! = « | g , . The term ) indicates the deviation of the intercepts of each community (c) at level 3 from the overall intercept, 7 ,b , after accounting for ecological covariates G ¢ and ¢ . Analogously, R g measures indicates the deviation of the slopes of each community (c) at level 3 (q ,g ) from the overall slope, q ,bg , after accounting for G ¢ and ¢ . « ||, 3 and « g , 3 are the variances of the community-specific intercepts while the « | g , represents the covariance between intercepts () ) and slopes (R g ). Hence, this multilevel quantile regression summarizes the distribution of the cluster-specific coefficients in two parts: (1) fixed effects that are invariant across groups (e.g. 7 ,b , ô ,V , õ ,V ) and random effects (e.g. ) , R g , ) = , R = g ) that depict group level heterogeneity left unexplained by the fixed covariates. For the most part, and without loss of generality, random effects are assumed to be independent across levels. However, this could be easily relaxed as needed. By combining the eq. 3.12, eq. 3.13a, eq. 3.13b, eq. 3.14a, and eq. 3.14b, we get a combined quantile regression mixed effects model as in eq. 3.15 below. 67 =s ! = 7 ,b + ô ,V G + õ ,V + ô ,3 G = + õ ,3 = + ∑ q ,bg ã g x =s ! + ∑ ô ,V3g G ã g x =s ! g + g ∑ õ ,V3g ã g x =s ! g + ∑ ô ,33g G = ã g x =s ! g + ∑ õ ,33g = ã g x =s ! g + ô , G =s + õ , =s + ) + ) = + ∑ R = g ã g x =s ! + ∑ R g ã g x =s ! g g (eq. 3.15) Also note that, for a given , all the coefficients of fixed effects (including 7 ,b , ô ,V , õ ,V , ô ,3 , õ ,3 , q ,bg , ô ,V3g , õ ,V3g , ô ,33g , õ ,33g , ô , , õ , ) are -dependent while the random effects are left to be independent of (including ) , R g , ) = , R = g ) because we generally condition on all the random effects. 3.3 Estimation (Proposed Approach) To explain our proposed approach for estimating the coefficients, we start with a simpler case, assuming the data followed a “location-shift” model with 2 levels (eq 3.16). We fit a conditional %T_ function of outcome, Q Y ,ˆ‰ |U bˆ ,U ˆ ,X ˆ‰ !, where b= and = are the random effects on intercept and slope at the second level. Also, due to the homoscedasticity of errors, the intercept (H ,b ) is percentile-dependent while the effects of risk factor (H , ) is constant across the distribution of health outcome (eq 3.17), i.e., it is quantile-independent. We assume that the quantile residuals ( ,=s − ,=s | b= , = ,G =s !, denoted as Ü , follow asymmetric Laplace distribution (,0,« ). Specifically, we let ε = ε ,ˆ‰ − F ε ,ˆ‰ !~ALDτ,0,« , where Ü ,=s ! is the inverse function of cumulated density function of Ü ,=s and the probability density function of ALD is fx;μ,σ ,τ = ¬ k exp“− ¬ k τ − Ix ≤ μ !”. In this case, fε ;0,σ ,τ = " k exp“− # k$ " k τ − Iε ≤ 0 !” = ¬ k _¡“− 1,% −H 1,0 −H 1,1 G % ¬ k & − 68 1,% − H 1,0 − H 1,1 G % ≤ 0!'”. Thus, to estimate the coefficients β ,b and β , under Bayesian framework, we just need to assign appropriate priors and hyper priors (if any) to the parameters in the joint likelihood function. In summary, when errors are homoscedastic, we can apply the approach suggested by Yu and Moyeed (2001) with conditioning on the random effects. ,=s = H b + b= + H + = G =s + Ü ,=s (eq. 3.16) ,=s | b= , = ,G =s ! = H ,b + H , G =s (eq. 3.17), where H ,b = H b + b= + Ü ,=s ! and H , = H + = ,=s − ,=s | b= , = ,G =s ! = &H b + b= + H + = G =s + Ü ,=s ' − H ,b + H , G =s ! =&H b + b= + H + = G =s + Ü ,=s ' − &H b + b= + Ü ,=s ! + H + = G =s ' =Ü ,=s − Ü ,=s ! However, when dealing with more complicated models, the method discussed above may not work. What would happen if we move from a “location-shift” model to a “location-scale- shift” model by adding G =s Ü ,=s to ,=s (the case discussed above)? After adding G =s Ü ,=s to ,=s , 3,=s follows a “location-scale-shift” model with 2 levels (eq 3.18). The fitted conditional %T_ function is denoted by 2,% | 0 , 1 ,G % !. Since the errors are heteroscedastic, the intercept (H 3,b ) and slope (H 3, ) are both percentile-dependent (eq 3.19). Again, we assume that the quantile residuals ( 3,=s − 3,=s | b= , = ,G =s ! , denoted by Ü 3 , follow asymmetric Laplace distribution. Specifically, Ü 3 is equal to 1 + G =s !&Ü ,=s − Ü ,=s !' (eq. 3.18 - eq. 3.19), where Ü ,=s ! is the inverse function of cumulated density function of Ü ,=s . However, does the Ü 3 still follow asymmetric Laplace distribution given the assumption of Ü ,=s − Ü ,=s !~,0,« 1 ? According to the properties of ALD, if Ü ,=s ~,0,« 1 , Ü 1,% ~,0,c« , where c is a constant. Therefore, with conditioning on G =s , Ü 3 = 69 1 + G % !&Ü 1,% − −1 Ü 1,% !' would follow ,0,1 + G % !« !. However, according to Geraci and Bottai (2007, 2014) and other published paper, 1 + X ˆ‰ !&ε ,ˆ‰ − F ε ,ˆ‰ !'~ALDτ,0,σ , where « is a constant regardless the values of G =s . Thus, instead of assuming Ü 3 ~,0,« , we propose to assume Ü 3 ~,0,1 + G % !« !. To estimate the coefficients H 3,b and H 3 under Bayesian framework, we need to assign appropriate priors and hyper priors (if any) to the parameters in the new proposed joint likelihood function based on ,0,1 + G % !« !. 3,=s = H b + b= + H + = G =s + 1 + G =s !Ü ,=s = ,=s + G =s Ü ,=s (eq. 3.18) 3,=s | b= , = ,G =s ! = H 3,b + H 3, G =s (eq. 3.19) where H 3,b = H b + b= + Ü ,=s ! and H 3, = H + = + Ü ,=s ! (eq. 3.19) - (eq. 3.18): 3,=s − 3,=s | b= , = ,G =s ! = &H b + b= + H + = G =s + 1 + G =s !Ü ,=s ' − H 3,b + H 3, G =s ! =&H b + b= + H + = G =s + 1 + G =s !Ü ,=s ' − )H b + b= + Ü ,=s ! + “H + = + Ü ,=s !”G =s * =1 + G =s !&Ü ,=s − Ü ,=s !', where &Ü ,=s − Ü ,=s !'~,0,« 1 3,=s − 3,=s | b= , = ,G =s !~ ,0,1 + G =s !« 1 ! The cases discussed above indirectly assume the true errors structure is known. In practice, we rarely have information on the errors pattern; thus, it would be impossible to determine what assumption of Ü – we should use for analysis, Ü ~,0,« or Ü ~,0,1 + G % !«!. To make our proposed method applicable to real data analysis, we further modified the assumption as shown in eq 3.20. In eq 3.20, the constant a is used to provide the information on the error structure of the data. The constant, a, could be any real number depending on the nature of the distribution of data. For instance, a is equal to zero for the first case (homoscedastic errors) while a is 1 for the second case (heteroscedastic errors). 70 Furthermore, to estimate the coefficients H s′ under a Bayesian framework, we assign appropriate priors and hyper priors (if any) to the parameters and to a in the new proposed joint likelihood function based on ,0,1 + 7G % !« !. Ü ~,0,1 + 7G % !«! (eq. 3.20) In the framework of this generalized model, no matter what the original distribution of the data is, we assume @ =s |ª =s ! is ª =s ,« Ý ,! and the ª =s ! = 7 ,b + ô ,V G + õ ,V + ô ,3 G = + õ ,3 = + ∑ q ,bg ã g x =s ! + ∑ ô ,V3g G ã g x =s ! g + ∑ õ ,V3g ã g x =s ! g g + ∑ ô ,33g G = ã g x =s ! g + ∑ õ ,33g = ã g x =s ! g + ô , G =s + õ , =s + ) + ) = + ∑ R = g ã g x =s ! + ∑ R g ã g x =s ! g g for any 0 < < 1, where as « Ý = 1 + 7 ô ,V + 7 3 õ ,V + 7 V ô ,3 + 7 › õ ,3 + 7 f q ,bg + 7 , ô ,V3g + 7 - õ ,V3g + 7 e ô ,33g + 7 . õ ,33g + 7 b ô , + 7 õ , «. Because the standard conjugate prior distribution is not available for quantile regression, the Markov Chain Monte Carlo (MCMC) method could be applied to extract the posterior distributions of parameters of interest. Thus, this allows us to use any prior distribution. Moreover, the Bayesian framework implemented via MCMC not only gives us the marginal and joint distributions of all parameters of interest but also takes the uncertainty of parameters into account for predictive inferences (Yu and Moyeed, 2001). In the Bayesian setting, given the observed outcomes, = / =s 0, the joint posterior distribution of H ,=s = ô , ,õ , ,q ,bg ,ô ,V3g ,õ ,V3g ,ô ,33g ,õ ,33g !, H ,= = ô ,3 ,õ ,3 ,ô ,33g ,õ ,33g !, H , = 7 ,b ,ô ,V ,õ ,V ,!, ) = = ) = ,R = g !, ) = ) ,R g !, and « Ý is given by 71 “H ,=s ,H ,= ,H , ,) = ,) ,« Ý |” ∝ “|H ,=s ,H ,= ,H , ,) = ,) ,« Ý ,”) = !) !“H ,=s ”“H ,= ”“H , ”« ∏ 7 g b g> (eq. 3.21), where “|H ,=s ,H ,= ,H , ,) = ,) ,« Ý ,” is the asymmetric Laplace distribution density function, and ) = !,) !,“H =s ”,“H ,= ”,“H , ”,« , and 7 g are prior densities of ) = ,) ,H =s ,H = ,H ,« and 7 g , respectively. Therefore, the likelihood function “|H ,=s ,H ,= ,H , ,) = ,) ,« Ý ,” = Ç Ç ¬ 1 Ç _¡®−∑ ∑ ∑ =s − 7 ,b − ô ,V G − 5 A s> 5 => > õ ,V − ô ,3 G = − õ ,3 = − ∑ q ,bg ã g x =s ! − ∑ ô ,V3g G ã g x =s ! g − ∑ õ ,V3g ã g x =s ! g g − ∑ ô ,33g G = ã g x =s ! g − ∑ õ ,33g = ã g x =s ! g − ô , G =s − õ , =s − ) − ) = − ∑ R = g ã g x =s ! − ∑ R g ã g x =s ! g g !° (eq. 3.22), where k=1, 2, …, K. Theoretically, we could use any prior in eq. 3.21. Yu and Moyeed (2001) showed that improper uniform prior distributions of all components of H ,=s ,H ,= ,and H , could result in a proper joint posterior distribution. We also generally impose the hyper-prior distribution of Σ % A 3 and Σ % 3 (Σ % A 3 ! and Σ % 3 !) in eq 3.21 and get eq. 3.23. In eq 3.23, the joint posterior distribution of the parameters is proportional to the product of the likelihood of “|H ,=s ,H ,= ,H , ,) = ,Σ % A 3 ,Σ % 3 ,) ,«,7 g ,” and the prior (including ) = |Σ % A 3 !) |Σ % 3 !“H ,=s ”“H ,= ”“H , ”« ,and 7 g ) and hyper-prior (including Σ % A 3 !,and Σ % 3 !) distributions of parameters of interests. 72 “H ,=s ,H ,= ,H ,, ,) = ,) ,Σ % A 3 ,Σ % 3 ,« Ý |” ∝ “|H ,=s ,H ,= ,H , ,) = ,Σ % A 3 ,Σ % 3 ,) ,«,7 g ,”) = |Σ % A 3 !Σ % A 3 !) |Σ % 3 !Σ % 3 !“H ,=s ”“H ,= ”“H , ”« ∏7 g (eq. 3.23) In this setting, we assume that the prior densities of all components of H ,=s ,H ,= ,and H , are independent and identically distributed uniform random variables while the prior distributions of the random effects () = and ) ), accounting for the dependencies of observations from the same subject, are independently and identically distributed as multivariate normal distributions given as ) = |Σ % A 3 !~X ã3,Σ % A 3 ! and ) |Σ % 3 !~X ã3,Σ % 3 !, respectively. Moreover, theoretically, Σ % A 3 and Σ % 3 are allowed to be any type of positive-definite variance-covariance matrix. For simplicity, we assume that all the off-diagonal elements of both matrices are zero, i.e. assume independence for the variance-covariance matrix structure. However, this potentially restrictive assumption could be relaxed as needed, at the expense of heavier computational cost. Moreover, the prior distribution of « and the hyper-prior densities of elements in Σ % A 3 and Σ % 3 are assumed to follow inverse gamma distributions while the prior of a 4 (constant for modifying the « Ý parameter in the ALD) is assumed to be X ã3,Σ | K 3 !. 3.4 Simulation 3.4.1 Simulations based on “location-shift” models We initially conducted two simulation studies to evaluate the performance of MCEM for the quantile function at 85 th percentile (=0.85). For the first simulation, the data were generated based on a “location-shift” model with (1) homoscedastic errors and (2) subject-specific random intercepts to account for the dependence among observations from the same subject. The data for the second simulation study were simulated based on a “location-scale-shift” model with (1) 73 homoscedastic errors and (2) subject-specific random intercepts and random slopes to account for the correlation between observations from the same subject. The details regarding these two simulations are described as follow: Simulation 1 Simulation Model: Level 1: Y ˆ‰ = α bˆ + 2X ˆ‰ + ε ˆ‰ (eq. 3.24) Level 2: α bˆ = 1 + u ˆ (eq. 3.25) Y ˆ‰ = 1 + u ˆ + 2X ˆ‰ + ε ˆ‰ (eq. 3.26) u ˆ ~N0,1 , ε ˆ‰ ~N0,1 , i = 1, 2, …, 100, and j = 1, 2, …, 23 Model Fitting: 0.85|) = ,G =s ! = q ,b.ef + ) = + q 3,b.ef G =s (eq. 3.27) For computational convenience, eq 3.24 and eq 3.25 were combined to get eq 3.26. The model, eq. 3.26, used to simulate the data follows the location-shift model used in a previous simulation study by Geraci and Bottai (2007). We simulated data for 100 subjects (N=100) and 23 observations for each subject ( = = = 23, for i=1,2,…..,100). The error terms, ε ˆ‰ , were generated from a standard normal distribution, N(0,1). The values of G =s was set equal to j for all i. In other words, G = =1, G =3 =2,….., G =3V =23. Here, all G =s s were set to be positive in order to avoid the problem of quantile crossing (i.e., the possibility that at some G =s , the predicted value of 85 th percentile is larger than that of 95 th percentile). The random effects, ) = , were generated based on a standard normal distribution, N(0,1). In total, we simulated 1000 replications. 74 The estimated quantile function is shown in eq. 3.27. According to the simulated model (eq. 3.26), only the intercept is quantile-dependent (quantile specific) since the random errors were homoscedastic. Furthermore, for all the simulations in this chapter, we focus on the quantile function of the 85 th percentile (τ=0.85) because 85 th percentile of BMI was the cut-off point of being overweight suggested by the Center of Disease Control and Prevention (CDC). Based on this set-up, the true value of the intercept of 85 th percentile is 2.03 (1 + 0.85 = 1 + Z b.ef = 2.0364)) while the true value of the slope is 2. Simulation 2 Simulation Model: Level 1: Y ˆ‰ = α bˆ + α ˆ X ˆ‰ + ε ˆ‰ (eq. 3.28) Level 2a: α bˆ = 1 + u ˆ (eq. 3.29) Level 2b: α ˆ = 2 + v ˆ (eq. 3.30) Y ˆ‰ = 1 + u ˆ + 2 + v ˆ X ˆ‰ + ε ˆ‰ (eq. 3.31) u ˆ ~N0,1 , v ˆ ~N0,1 ,ε ˆ‰ ~N0,1 , i = 1, 2,…, 100, and j = 1, 2, …, 23 Model Fitting: 0.85|) = ,v = ,G =s ! = q ,b.ef + ) = + q 3,b.ef G =s + v = G =s (eq. 3.32) The model depicted by eq 3.31 (obtained by combining eq. 3.28, 3.29 and 3.30), is used to simulate the data and is an extended version of the model for simulation 1. Again, we simulated 100 subjects (N=100) and 23 observations for each subject ( = = = 23, for i=1,2,…..,100). In addition, the error terms, ε ˆ‰ , were generated from a standard normal 75 distribution, N(0,1) as well. In eq. 3.31, we have included random slopes at the subject level in addition to the random intercepts. Here, G =s was set equal to j for all i. In other words, G = =1, G =3 =2,….., G =3V =23. The random effects, ) = and R = , were each generated from a standard normal distribution, N(0,1). In total, we simulated 1000 replications. The estimated quantile function is shown in eq. 3.32. According to this simulated model (eq. 3.31), the intercept is quantile-dependent (quantile specific) while the slope is constant across quantiles since the random errors were homoscedastic. In this setting, the true value of the intercept at the 85 th percentile is 2.03 (1 + 0.85 = 1 + Z b.ef = 2.0364)) while the true value of the slope is still 2. 3.4.2 Simulations based on “location-scale-shift” models To evaluate our proposed method by comparing performance to other published methods, we also conducted a simulation study based on the model proposed by Geraci and Bottai (2014). These simulations would allow us to fairly compare the performance of our proposed method to that from other methods, including the linear quantile mixed modeling (LQMM, Geraci and Bottai 2014) and a Bayesian approach without the modified scale parameter (σ) of ALD. Simulation 3 Simulation Model: =s = 100 + ) = + 2G =s + Z =s + 1 + 0.25G =s !_ =s (eq. 3.33) ) = ~ã0,5 G =s = ô = + 8 =s , ô = ~0,1 , 8 =s ~0,1 Z =s ~lw£ 1,0.5 _ =s ~ã0,5 76 i =1, 2, ..100, j=1, 2, …,10 Model Fitting: 0.75|) = ,v = ,G =s ,Z =s ! = q ,b.-f + ) = + q 3,b.-f G =s + q V,b.-f Z =s (eq. 3.34) For simulation 3, we simulated 100 subjects (N=100) and 10 observations for each subject ( = = = 10, for i=1, 2,….., 100). In addition, following simulation studies proposed by Geraci and Bottai (2014), the error terms, ε ˆ‰ , were generated from a normal distribution, N(0,5). In eq. 3.33, we did not only include random intercepts but also included one additional covariate (Z =s ). Here, Z =s was simulated from a binomial distribution with n=1and p =0.5. Here, G =s was generated from two independent Uniform distributions, U(0,1). The random effects, ) = , were each generated from a normal distribution, N(0,5). In total, we simulated 1000 replications. Since Geraci and Bottai (2014) only published the results focusing on the median, 75 th %ile and 90 th %ile, for easier comparison with their results, we focused on estimating 75%ile quantile function as shown in eq. 3.34. According to this simulated model (eq. 3.33), the intercept and the coefficient of G =s are quantile-dependent (quantile specific) while the coefficient of Z =s is constant across quantiles since the random errors were independent of Z =s . In this setting, the true values of the intercept and q 3,b.-f at the 75 th percentile are 101.51 (100 + √5Z b.-f ) and 2.38 (2 + 0.25√5Z b.-f ), respectively. Here, the true value of the slope of Z =s is still 1. 3.4.3 Simulations based on CHS BMI data structure Our main motivation for the methodological development is to develop methods that could be eventually applied to real data such as those examining effects of risk factors on BMI growth curves at a given percentile based on longitudinal data similar to those in the CHS. We 77 conducted the following four simulation studies (simulations 4 to 7) to imitate the BMI growth curves by generating BMI measurements and centered age at 13 under different hypothetical scenarios. Using this set-up, we would be able to evaluate the robustness of the proposed Bayesian multilevel quantile regression for the quantile function at 85 th percentile (=0.85) under different situations. For simulation 4, the data were generated based on a “location-shift” model with (1) homoscedastic errors and (2) subject-specific random intercepts and slopes to account for the dependencies between correlated observations. The data for the simulation 5 was extended from simulation 4. In simulation 5, we still simulated data with random intercepts and slopes to account for dependence between observations. In addition, we made the coefficient of X percentile-dependent by including heteroscedastic errors. For simulation 6, the data were generated based on a model with (1) homoscedastic errors and (2) three-level structures at the temporal, subject, and town levels. Specifically, we included random subject-level intercepts and slopes, and random town-level intercepts and slopes to account for the correlations between- and across-levels. Thus, simulation 6 is extended more general simulation set-up compared to simulation 4 as it includes additional random effects from multiple levels. In simulation 7, we extend the set-up of simulation 6 by modeling the coefficient of X as percentile-dependent and by including heteroscedastic errors. The details regarding these four simulations are described as follows: Simulation 4 Simulation Model: _R_T 1: Ä=s = H b= + H = G Ä=s + Ü Ä=s (eq. 3.35) _R_T 27: H b= = H b + ) b.= (eq. 3.36) 78 _R_T 2q: H = = H + ) .= (eq. 3.37) =s = H b + ) b.= + H + ) .= ¡ =s + Ü =s , (eq. 3.38) where β b = 25, β = 2, u b.ˆ ~N0,1 and u .ˆ ~N0,1 are the subject-specific random effects on intercepts and x ˆ‰ , x ˆ‰ = u U0,1 , if j = 1 x ˆ + j − 1 , if j ≠ 1 , and e ˆ‰ ~N0,1 . i = 1, 2,…, 500, and j = 1, 2, …, 4. Model Fitting: 0.85|) Ä ,) Ä= ,v = ,G =s ! = q ,b.ef + q 3,b.ef G =s + ) 0. + ) 1. G =s (eq. 3.39) The model depicted by eq 3.38 (obtained by combining eq. 3.35, 3.36 and 3.37), is used to simulate the data. We simulated 500 subjects (N=500) and 4 observations for each subject ( = = = 4, for i=1, 2,….., 500). In addition, the error terms, ε ˆ‰ , were generated by using a standard normal distribution, N(0,1) as well. In eq. 3.38, we did not only include random intercepts but also included random slopes at the subject level. Here, G = (first observation of subject i) was drawn from the Uniform distribution, U(0,1) and G =s was set to equal to G = + % − 1 for all i and j>1. In other words, G =s s were generated to imitate centered annual age. The random effects, ) b.= and ) .= , were each generated from a standard normal distribution, N(0,1). In total, we simulated 500 replications. The estimated quantile function is shown in eq. 3.39. According to this simulated model (eq. 3.38), the intercept is quantile-dependent (i.e., quantile specific) while the slope is constant across quantiles since the random errors were homoscedastic. In this setting, the true value of the intercept at the 85 th percentile is 26.03 (25 + 0.85 = 1 + Z b.ef = 26.0364)) while the true value of the slope is still 2. 79 Simulation 5 Simulation Model: _R_T 1: Ä=s = H b= + H = G Ä=s + 1 + ¡ =s !Ü Ä=s (eq. 3.40) _R_T 27: H b= = H b + ) b.= (eq. 3.41) _R_T 2q: H = = H + ) .= (eq. 3.42) =s = H b + ) b.= + H + ) .= ¡ =s + 1 + ¡ =s !Ü =s , (eq. 3.43) where β b = 25, β = 2, u b.ˆ ~N0,1 and u .ˆ ~N0,1 are the subject-specific random effects on intercepts and x ˆ‰ , x ˆ‰ = u U0,1 , if j = 1 x ˆ + j − 1 , if j ≠ 1 , and e ˆ‰ ~N0,1 . i = 1, 2, …, 500, and j = 1, 2, …, 4. The model depicted by eq 3.43 (obtained by combining eq. 3.40, 3.41 and 3.42), is used to simulate the data and is an extended version of simulation 4. The simulation settings are similar with those in simulation 4 except that the errors are heteroscedastic. The estimated quantile function is the same with that in simulation 4 (shown in eq. 3.39). According to this simulated model (eq. 3.43), the coefficients of intercept and slope are quantile-dependent (i.e., quantile specific) since the random errors are heteroscedastic. In this setting, the true values of the intercept and slope at the 85 th percentile are 26.03 (25 + 0.85 = 25 + Z b.ef = 26.0364) and 3.03 (2 + 0.85 = 2 + Z b.ef = 3.0364)), respectively. Simulation 6 Simulation Model: _R_T 1: Ä=s = H bÄ= + H = G Ä=s + Ü Ä=s (eq. 3.44) 80 _R_T 27: H bÄ= = H bÄ + ) b.= (eq. 3.45) _R_T 2q: H = = H Ä + ) .= (eq. 3.46) _R_T 37: H bÄ = H b + R b. (eq. 3.47) _R_T 3q: H bÄ = H + R . (eq. 3.48) =s = H b + ) b.= + R b. + H + ) .= + R . ¡ =s + _ =s (eq. 3.49) where β b = 25, β = 2, u b.< ~N0,1 and u .< ~N0,1 are the town-specific random effects on intercepts and x <ˆ‰ , u b.<ˆ ~N0,1 and u .<ˆ ~N0,1 are the subject-specific random effects on intercepts and x <ˆ‰ , x <ˆ‰ = u U0,1 , if j = 1 x <ˆ + j − 1 , if j ≠ 1 , and e <ˆ‰ ~N0,1 . Model Fitting: 0.85|) Ä ,) Ä= ,v = ,G =s ! = q b,b.ef + q ,b.ef G =s + ) 0. + R 0. + ) 1. + R 1. G =s (eq. 3.50) The model, eq 3.49 (combining eq. 3.44 through 3.48), is used to simulate the data and is an extended version of the model used for simulation 4. Specifically, to mimic the set up in the Children Health Study (CHS) data, we simulated data containing 10 towns. Moreover, in each town, we simulated 50 subjects ( Ä = 50, for t=1, 2,…, 12) and 4 observations for each subject ( Ä= = 4, for t=1, 2, …, 10; i= 1, 2,…, 50). The error terms, e ˆ‰ , were generated by using a standard normal distribution, N(0,1). In eq. 3.49, the model contained the random effects from various levels, including random subject intercepts and slopes, and random town intercepts and slopes. The G = (first observation of subject i in town c) was drawn from uniform distribution, U(0,1) and G =s was set to equal to G = + % − 1 for all c and i, and j>1. The random effects, 81 ) b.= , ) .= , R b. and R . , were each generated from the standard normal distribution, N(0,1). In total, we simulated 500 replications. The estimated quantile function is shown in eq. 3.50. According to the simulated model (eq. 3.49), the intercept is quantile-dependent (i.e., quantile specific) while the slope is constant across quantiles since the random errors are homoscedastic. Meanwhile, the true value of intercept for the 85 th percentile is 26.03 (25 + 0.85 = 25 + Z b.ef = 26.0364)) while the true value of slope is still 2. Simulation 7 Simulation Model: _R_T 1: =s = H b= + H = G =s + 1 + ¡ =s !Ü =s (eq. 3.51) _R_T 27: H b= = H b + ) b.= (eq. 3.52) _R_T 2q: H = = H + ) .= (eq. 3.53) _R_T 37: H b = H b + R b. (eq. 3.54) _R_T 3q: H b = H + R . (eq. 3.55) =s = H b + ) b.= + R b. + H + ) .= + R . ¡ =s + 1 + ¡ =s !_ =s (eq. 3.56) where β b = 25, β = 2, u b.< ~N0,1 and u .< ~N0,1 are the town-specific random effects on intercepts and x <ˆ‰ , u b.<ˆ ~N0,1 and u .<ˆ ~N0,1 are the subject-specific random effects on intercepts and x <ˆ‰ , x <ˆ‰ = u U0,1 , if j = 1 x <ˆ + j − 1 , if j ≠ 1 , and e <ˆ‰ ~N0,1 . Model fitting will be conducted using the model depicted in eq 3.50. The model depicted by eq 3.56 (obtained by combining eq. 3.51 through 3.55), is used to simulate the data and is an 82 extended version of simulation 6. The simulation settings are similar with those in simulation 6 with the exception that the errors here are heteroscedastic. The estimated quantile function is the same with to that in simulation 6 (shown in eq. 3.50). According to this simulated model (eq. 3.56), the coefficients of the intercept and slope are quantile-dependent (i.e., quantile specific) since the random errors were heteroscedastic. In this setting, the true values of the intercept and slope at the 85 th percentile are 26.03 (25 + 0.85 = 25 + Z b.ef = 2.0364) and 3.03 (2 + 0.85 = 2 + Z b.ef = 3.0364)), respectively. In simulation studies 1 and 2, we compared our proposed Bayesian multilevel quantile regression with other methods, including conventional quantile regression (noted as Naïve, i.e., treating the longitudinal data as cross-sectional data) and the quantile regression with random effects proposed by Geraci and Bottai (2007) where the initial values of parameters were from the conventional quantile regression (i.e. treat all observations are independent to each other, notated as MCEM). To study how different initial values influence the performance of MCEM, we also used Bayesian multilevel quantile regression (notated as MCEM + Bayes) to determine the initial values with assignments for all prior densities of H =s ,H = , H , ) = and ) . The priors of ) = and ) were set as independent and identically distributed non-informative multivariate normal distributions, i.e. normal distribution with large variances, given as ) = !~ã3,Σ % => 3 ! and ) !~ã3,Σ % = 3 !, respectively. Therefore, this Bayesian quantile regression used to get the initial values still considers the longitudinal aspects except for the fact that this Bayesian quantile regression assigns non-informative prior densities to the parameters of interest without assigning hyper-prior distributions. Furthermore, for each replication, the samples of each subject-specific random effects () = and ) ) used in fitting the quantile regression with random effects (MCEM) were drawings of size 5000 via the Gibbs sampler with an adaptive rejection sampling algorithm. 83 In simulation 3, we compared the proposed Bayesian multilevel quantile regression with LQMM (Geraci and Bottai, 2014). In addition, since Geraci and Bottai (2007) did not report the results for 85%ile, we compared the estimated coefficients at 75%ile. Thus, we would be able to fairly compare our results with theirs. In simulations 4, 5, 6, and 7, we compared (1) the proposed Bayesian multilevel quantile regression, assuming ,0,1 + G % !«! to (2) LQMM (Geraci and Bottai, 2014), and (3) Bayesian multilevel quantile regression, assuming ,0,« . We need to keep in mind that assuming ,0,« might be inaccurate for the cases with heteroscedastic errors. Meanwhile, the results based on the proposed Bayesian multilevel quantile regression with assuming ,0,« (in simulations 4 and 6) or ,0,1 + G % !«!(in simulations 5 and 7) were treated as gold standards since they were based on the accurate/appropriate assumption for « Ý . For each replication, the Bayesian multi-level quantile regression process executed Markov chain Monte Carlo (MCMC) to generate the sampled posterior densities of parameters of interest with simple size of 7000 (based on 15000 draws with thinning rate 4 after a 15000 iteration burn-in period). Furthermore, convergence was assessed based on the trace plots of the Markov chains and the Gelman-Rubin statistic (larger than 1.1 deemed as not properly converged, based on 2 chains). The performance of the methods is evaluated based on the relative bias averaged over replications formulated as, q?7{ @ H L ,g ! = bb ∑ I ê ©,K A I ©,K BI ©,K B bb C> , where H L ,g C is the estimated effect of G g on the conditional quantile τ for the rth replication and H ,g is the true effect on the quantile of the conditional distribution of response. All analyses were conducted using R 2.12.2 and WinBugs14. 84 3.5 Results 3.5.1 Results of simulations based on “location-shift” models Simulation 1 Table 3.1 indicates that all four approaches had adequate performance in estimating the 85 th percentile slope (relative bias: -0.009% - 0.006%). When regarding the performance of estimating the 85 th intercept, table 3.1 indicates that applying MCEM is not superior to the naïve approach (21.06% reduced to 18.21%). However, the performance of MCEM was improved enormously when the initial values were determined by Bayesian quantile regression (18.21% reduced to -0.032%). The reason why MCEM + Bayes might provide better estimates of the intercept and the slope (compared with MCEM) is that the Bayesian quantile regression part of MCEM + Bayes generates good initial values for the intercept and slope (closer to the true values of intercept and slope). Hence, it implies that quantile regression with random effects might be sensitive to initial values; also, it confirms that solutions provided by MCEM could be local extreme values instead of the desired global extreme values. When comparing Bayesian multilevel quantile regression and MCEM + Bayes, the performance of the MCEM + Bayes approach is only slightly better than that of the proposed Bayesian multilevel quantile regression approach (relative bias on 85 th percentile of intercept: 0.48% vs. -0.032%, table 3.1). Thus, once the initial values are determined by Bayesian quantile regression, the MCEM process did not enormously improve the accuracy of parameters. Regarding the computational time, the Bayesian multilevel quantile regression (average computation time = 12.42 minutes) is the most computationally efficient method compared to MECM approaches (MCEM = 30.98 mins; MCEM + Bayes = 30.89 mins). 85 In summary, the overall performance of estimating the intercept and slope based on MCEM + Bayes is as good as that based on our proposed Bayesian multilevel quantile regression. Moreover, the Bayesian multilevel quantile regression approach is more computationally efficient than MCEM + Bayes. Table 3.1 Results of Simulation 1 (comparison between four methods) Methods D Õ,3.EF D ×,3.EF EST (Bias) EST (Bias) Naive 1 2.407 (21.060%) 2.000 (0.003%) MCEM 2 2.407 (18.214%) 2.000 (-0.009%) MCEM + Bayes 3 2.036 (-0.032%) 2.000 (0.006%) Bayes Multilevel QR 4 2.046 (0.480%) 2.000 (<0.001%) 1. Ignoring the dependencies between observations of the same subject, i.e., treating the longitudinal data as cross-sectional. 2. The initial values of parameters were from the conventional quantile regression 3. The initial values of parameters were from the Bayesian quantile regression with assigning normal prior densities to parameters of interests 4. The prior densities of parameters of interests are uniform distribution Simulation 2 Table 3.2 indicates that Bayesian multilevel quantile regression and MCEM + Bayes approaches maintained adequate performance in estimating the 85 th percentile slope (relative bias: -0.039% - 0.670%) after including additional random subject-specific slopes. However, the naïve approach did poorly in estimating the 85 th percentile slope because of its shortcoming due to ignoring the random subject effects on slope. Moreover, when the initial values were from the conventional quantile regression approach (i.e., the naïve method), the relative bias for the slope was not diminished significantly via the MCEM process (50.04% reduced to 48.23%). This finding once again confirms that the MCEM approach could suffer from inappropriate choices of initial values for the parameters of interest. When considering the performance of the methods in estimating the 85 th %ile intercept, Table 3.2 indicates that the performance of MCEM was satisfactory whether the initial values were based on the naïve method or based on the Bayesian quantile regression (MCEM: 0.387%; 86 MCEM + Bayes: 0.318%). Even though it had large bias in estimating the slope (48.233%), the MCEM predicted the intercept with relatively high accuracy (0.387%). When comparing the proposed Bayesian multilevel quantile regression and MCEM + Bayes, the performance of the Bayesian multilevel quantile regression is only slightly better than that of MCEM + Bayes (relative bias on 85 th percentile of intercept: -0.022% vs. 0.318%, table 3.2). Thus, once the initial values are determined by Bayesian quantile regression, the MCEM process did not enormously improve the accuracy of parameters. Regarding the computational time, the Bayesian multilevel quantile regression (average computation time = 10.38 minutes) is the most computationally efficient method compared to MECM approaches (MCEM = 106.15 mins; MCEM + Bayes = 48.61 mins). In summary, the overall performance of estimating the intercept and slope based on MCEM + Bayes is as good as that based on Bayesian multilevel quantile regression. However, Bayesian multilevel quantile regression is more computationally efficient than MCEM + Bayes. Table 3.2 Results of Simulation 2 (comparison between four methods) Methods D Õ,3.EF D ×,3.EF EST (Bias) EST (Bias) Naive 1 1.552 (-23.782%) 3.001 (50.041%) MCEM 2 2.044 (0.387%) 2.965 (48.233%) MCEM + Bayes 3 2.043 (0.318%) 1.999 (-0.039%) Bayes Multilevel QR 4 2.036 (-0.022%) 2.013 (0.670%) 1. Ignoring the dependencies between observations of the same subject, i.e., treating the longitudinal data as cross-sectional. 2. The initial values of parameters were from the conventional quantile regression 3. The initial values of parameters were from the Bayesian quantile regression with assigning normal prior densities to parameters of interests 4. The prior densities of parameters of interests are uniform distribution 87 3.5.2 Results of simulations based on “location-scale-shift” models Simulation 3 In this simulation scenario, we have 2-level data (temporal, subject) with two covariates (X and Z). However, only the intercept and the coefficient of X are quantile-dependent while the coefficient of Z is constant across quantiles. In table 3.3, the first row contains the results based on the proposed Bayesian multilevel quantile regression while the results based on LQMM are shown on the second row of the table. The third row of table 3.3 shows the results published by Geraci and Bottai in 2014. In summary, applying LQMM on our simulated data was providing similar results to those published by Geraci and Bottai (2014). Furthermore, Table 3.3 indicates that all approaches had outstanding performance in estimating the 75 th percentile slope (relative bias: <1%). As we discussed previously (in Section 3.3), LQMM assumes that the quantile residuals follow ,0,« regardless of the errors structure while our proposed approach assumes ,0,1 + a X ˆ‰ + a 3 Z ˆ‰ !«!. Furthermore, our proposed approach will automatically determine the coefficients of a and a 3 based on the nature of the data. If all a ˆ s’ close to zero, the errors could be homoscedastic. On the other hand, if the coefficient of X changes across quantiles, the a should be significantly different from zero. Table 3.3 shows that the bias is very small regardless of whether we assumed ,0,« or ,0,1 + a X ˆ‰ + a 3 Z ˆ‰ !«!. To further confirming that both assumptions would produce similar results, we conducted simulations 4, 5, 6, and 7. Here, our simulations imitated the structure of the BMI data from the CHS. 88 Table 3.3 Results of Simulation 3 (Comparing the proposed Bayesian Multilevel QR with LQMM and to Estimate the coefficients at 75%ile percentiles based on the data simulated in Geraci and Bottai’s paper in 2014) Method Intercept 1 X 2 Z 3 EST Bias EST Bias EST Bias Bayesian Multilevel QR 101.510 <0.0001 2.389 0.005 1.008 0.008 LQMM 101.544 0.0004 2.364 -0.005 0.996 -0.004 LQMM (from GB 2014 paper) NA <0.0001 NA -0.005 NA 0.012 1. The true value of Intercept of 75%ile function is 101.508. 2. The true value of X’s coefficient of 75%ile function is 2.377. 3. The true value of Z’s coefficient of 75%ile function is 1. 3.5.3 Results of simulation based on CHS BMI data structure Simulation 4 In this simulation scenario, we have 2-level data (temporal, subject) with homoscedastic errors. Thus, only the intercept is quantile-dependent while the coefficient of X is constant across quantiles. Table 3.4, the first row contains the results based on the proposed multilevel quantile regression with an accurate assumption of the quantile residuals’ distribution (i.e., when we know the true errors structure). In this case, the accurate distributional assumption is ,0,« . The results in row 1 are treated as gold standard. In addition, the second row includes the results based on our proposed approach assuming ,0,1 + aX ˆ‰ !«!. In this case, the estimated a should be ideally close to zero. The results based on LQMM are shown in the third row of Table 3.4. In summary, the performance of LQMM (-0.1% on intercept and 1.7% on X ˆ‰ ) was as good as that of our approach that assumed ,0,« (-0.5% on intercept and 1.1% on X ˆ‰ ) while our approach that assumed ,0,1 + aX ˆ‰ !«! provided slightly bigger bias on etstimating coefficient of X ˆ‰ (2.2%, Table 3.4) which might be due to introducing additional parameter a that modifying the scale parameter («) in ALD. 89 Table 3.4 Results of Simulation 4 Methods Intercepts Slopes Ave. Est. Relative Bias (%) Ave. Est. Relative Bias (%) Gold Standard 25.910 -0.487 2.021 1.056 Multilevel QR 25.864 -0.661 2.045 2.245 LQMM 25.998 -0.149 2.034 1.680 Simulation 5 In this simulation scenario, we have 2-level data (temporal, subject) with heteroscedastic errors. Thus, both the intercept and the coefficient of X are quantile-dependent. In Table 3.5, the first row contains the results based on the proposed multilevel quantile regression with an accurate assumption of the quantile residuals’ distribution (i.e., when we know the true errors structure). In this case, the accurate assumption is ,01 + X ˆ‰ !,«!. The results in row 1 are treated as gold standard. In addition, the second row includes the results based on our proposed approach with assuming ,0,1 + aX ˆ‰ !«!. In this case, the estimated a should be ideally close to one. In addition, the third row includes the results based on our proposed approach with assuming ,0,« . This approach used the same assumption as that used by Geraci and Boottai (2007, 2014). The results based on LQMM are shown in the fourth row of Table 3.4. In summary, all four approaches provided results with small bias in estimating the intercept (from -0.6% to 1.8%, Table 3.5). However, the LQMM and the Bayesian quantile method that assumed ,0,« estimated the slope with very large bias (-10.8% and -11.9%, Table 3.5). Thus, these results indicated that using the wrong assumption about the distribution of the quantile residuals would lead to biased estimates of covariate effects. Based on the Table 3.5, the results based on our proposed approach (assuming ,0,1 + aX ˆ‰ !«!) are the closet 90 to those based on the “Gold standard” model/approach (intercept: -1.1% vs. -0.6%; slope: 1.9% vs. -0.7%). We also note that the bias in the slope could be reduced from -11.9% to 1.9% by using the assumption ,0,1 + aX ˆ‰ !«! (instead using ,0,« ). This is of great practical importance in applying our proposed approach. Table 3.5 Results of Simulation 5 Methods Intercepts Slopes Ave. Est. Relative Bias (%) Ave. Est. Relative Bias (%) Gold Standard 25.887 -0.576 3.014 -0.728 Multilevel QR 25.741 -1.133 3.093 1.852 Bayes ALD 26.492 1.751 2.677 -11.852 LQMM 26.493 1.753 2.709 -10.780 Simulation 6 In this simulation scenario, we have 3-level data (temporal, subject, and town) with homoscedastic errors. Thus, only the intercept is quantile-dependent while the coefficient of X is constant across quantiles. In Table 3.6, the first row contains the results based on the proposed multilevel quantile regression with an accurate assumption of the quantile residuals’ distribution (i.e., when we know the true errors structure). In this case, the accurate assumption is ,0,« . The results in row 1 are treated as gold standard. In addition, the second row includes the results based on our proposed approach assuming ,0,1 + aX ˆ‰ !«!. In this case, the estimated a should be ideally close to zero. The LQMM was not applied to this simulation since the LQMM function in R is not feasible to fit a model with more than 2 levels. In summary, the results showed in Table 3.6 are similar to Table 3.4. The proposed Bayesian multilevel quantile regression assuming ,0,1 + aX ˆ‰ !«! provided similarly small bias in estimating the intercept (-0.6%), but slightly bigger relative bias for the coefficient of X ˆ‰ (2.2%, Table 3.4), compared to that based on the gold standard. The slightly larger bias in estimating 91 slope could be due to the introduction of the additional parameter a in the likelihood function of ALD. We also evaluated the performance of our proposed approach in terms of the coverage rates of the parameters. Here, the coverage rate is the proportion of number of replicated data’s 95% credible intervals that covered the true values of the parameters. The conclusion based on the coverage rate still similar to that based on the relative bias. The proposed Bayesian multilevel quantile regression assuming ,0,1 + aX ˆ‰ !«! provided similarly high coverage rate in estimating the intercept (95.4%) and the slope (96.8%) compared to that based on the gold standard (intercept: 97.0%; slope: 97.4%). Table 3.6 Results of Simulation 6 Methods D Õ,3.EF D ×,3.EF Ave. Est. Relative Bias (%) Coverage Rate (%) Ave. Est. Relative Bias (%) Coverage Rate (%) Gold Standard 25.913 -0.5 97.0% 2.018 0.9 97.4% Multilevel QR 25.873 -0.627 95.4% 2.046 2.286 96.8% Simulation 7 In this simulation scenario, we have 3-level data (temporal, subject, and town) with heteroscedastic errors. Thus, both the intercept and coefficient of X are quantile-dependent. In Table 3.7, the first row contains the results based on the proposed multilevel quantile regression with an accurate assumption of the quantile residuals’ distribution (i.e., when we know the true errors structure). In this case, the accurate assumption is ,01 + X ˆ‰ !,«!. The results in row 1 are treated as gold standard. In addition, the second row includes the results based on our proposed approach assuming ,0,1 + aX ˆ‰ !«!. In this case, the estimated a should be ideally close to one. In addition, the third row includes the results based on our proposed approach assuming ,0,« . This approach made the same assumption as that used by 92 Geraci and Boottai (2007, 2014). Note that the LQMM was not applied to this simulation scenario because the LQMM function in R is not able to fit a model with more than 2 levels. In summary, all three approaches provided results with small bias in estimating the intercept (from -0.5% to 1.8%, Table 3.7). However, the Bayesian quantile method assuming ,0,« estimated the slope with large bias (-12.2%, Table 3.7). In addition, while checking the coverage rate, the proposed Bayesian multilevel quantile regression assuming ,0,1 + aX ˆ‰ !«! provided relatively high coverage rate in estimating the intercept (92.6%) and the slope (97.8%) compared to that based on approach assuming ,0,« (intercept: 84.2%; slope: 89.8%). Thus, these results indicated that making the wrong assumption about the quantile residuals would lead to biased estimates of covariate effects and lower coverage rate. Specifically, the bias of the slope could be reduced from -12.2% to 1.6% and coverage rate increased from 89.8% to 97.8% by using our proposed approach with assumption ,0,1 + aX ˆ‰ !«! (instead of using ,0,« ). Table 3.7 Results of Simulation 7 Methods Intercepts Slopes Ave. Est. Relative Bias (%) Coverage Rate (%) Ave. Est. Relative Bias (%) Coverage Rate (%) Gold Standard 25.897 -0.536 97.0% 3.008 -0.944 98.4% Multilevel QR 25.753 -1.089 92.6% 3.086 1.622 97.8% Bayes ALD 26.514 1.800 84.2% 2.665 -12.2 89.8% 3.6 Application (Real Data Analysis) In section 3.5, we demonstrated that our proposed Bayesian multilevel quantile regression could successfully estimate the quantile-specific effects of health risk factors, and the procedure has good performance in finite sample settings. Thus, in this section, we present results for real data analysis using CHS data as an illustration. Because our analysis in this 93 chapter is mainly intended to be illustrative (and not exuastive), we will be guided by the final models that were identified in our more thorough analysis in Chapter 2. 3.6.1 Children’s Health Study (Data description) We analyzed a subset of the data (used in Chapter 2) for 757 children, focusing on a 4- year follow-up period to avoid issues of nonlinearity in the growth trajectories. The main objective of this limited analysis is to demonstrate the ability to fit such models using our newly developed methods and associated software and to see if there any additional insights that could be gained through the use of the newly proposed modeling approach. A thorough analysis of the full CHS data using the new modeling approach is beyond the scope of this dissertation. Subjects included in this analysis were selected based on the following criteria: (1) subjects contributed at least 4 observations from age 13 yrs to 16 yrs; and (2) subjects without missing information on BMI, in utero tobacco smoke exposure, allergy status at baseline, and history of second hand smoke exposure (at baseline). Table 3.8 indicates that more subjects were recruited in 1996 (53% in Females, 47% in Males) and more non-Hispanic Whites (63% in Females, 70% in Males) were included in this study. 94 Table 3.8 Descriptive statistics of subject characteristics in CHS data used for Bayesian multi-level quantile regression analysis Major outcomes: Female (N=399, %=52.7%) Male (N=358, %=47.3) N (%) 1 Mean (SD) N (%) 1 Mean (SD) Cohort - C1 (recruited in 1993) 154 (38.6%) 149 (41.6%) - D (recruited in 1996) 245 (61.4%) 209 (58.4%) Ethnicity - Non-Hispanic White 251 (62.9%) 250 (69.8%) - Hispanic 148 (37.1%) 108 (30.2%) BMI at Entry (Age 13) - Non-Hispanic White 251 (62.9%) 20.61 (3.8) 250 (69.8%) 20.17 (4.1) - Hispanic 148 (37.1%) 21.79 (4.7) 108 (30.2%) 22.42 (5.0) BMI in year 4 (Age 16) - Non-Hispanic White 251 (62.9%) 22.45 (3.9) 250 (69.8%) 22.46 (4.3) - Hispanic 148 (37.1%) 23.53 (5.0) 108 (30.2%) 24.69 (5.7) 1. Percentage within gender. At study entry (age 13), Table 3.8 indicates that Hispanic participants were heavier than non-Hispanic Whites in both genders. Furthermore, at study entry Males were heavier than Females in both ethnicities. In Year 4 of the study, the average BMI of Hispanic participants were still higher than that of non-Hispanic Whites in both genders. The average BMI of Males was higher than that of Females in both ethnicities. In Table 3.9, more female participants were exposed to utero smoke (17%) and second hand smoke (30%) while more male subjects reported having allergy before study entry (32%). 95 Table 3.9 Descriptive statistics of considered covariates in CHS data used for Bayesian multi-level quantile regression analysis Female (N=399, %=52.7%) Male (N=358, %=47.3) N (%) 1 Mean (SD) N (%) 1 Mean (SD) Individual characteristics: Exposed to utero smoke - No 332 (83.2%) 312 (87.2%) - Yes 67 (16.8%) 46 (12.8%) Allergy - No 290 (72.7%) 243 (67.9%) - Yes 109 (27.3%) 115 (32.1%) Exposure variable of interest: Second hand smoke (ever) - No 281 (70.4%) 264 (73.7%) - Yes 118 (29.6%) 94 (26.3%) 1. Percentage within gender. 3.6.2 Methods Because we are using 4-year follow-up data from CHS and the growth trajectories have been shown to be linear in this group (Gauderman 2000, 2002), we fit multilevel models with linear growth curves for BMI. As mentioned in Chapter 2, the CDC’s BMI growth Charts were stratified by gender. In this analysis, we fitted models with full stratification by gender (Males and Females). Moreover, due to potential differences between genders and ethnicities, we also fitted models with full stratification by gender (Males and Females) and ethnicity (Hispanic Whites and Non-Hispanic Whites). According to the results shown in Chapter 2, we have also found that in-utero exposure to maternal smoking was an important confounder of the association between BMI and SHS. Moreover, the effect of SHS was modified by allergy status at baseline. Thus, in our illustrative models, we included in-utero exposure to maternal smoking as a confounder and allergy as an effect modifier for illustration purposes. The effect of SHS will be reported on both the growth 96 rate of 85%ile BMI and the attained 85%ile BMI level at age 13 (baseline). We fitted the following multilevel quantile regression model. Level 1: lX =s ! = H b= + H = x =s − 12.5! (eq. 3.57) Level 2a:H b= = H b + q b Hwℎw8x = + q %b )x{£Ì = + q ýb YIY = + q |b 7TT_89 = + q ý|b YIY = ∗ 7TT_89 = + _ b= (eq. 3.58) Level 2b:H = = q b + q Hwℎw8x = + q % )x{£Ì = + q ý YIY = + q | 7TT_89 = + q ý| YIY = ∗ 7TT_89 = + _ = (eq. 3.59) Level 3:H b = q b + _ b (eq. 3.60) We now define the various parameters and random effects in the multi-level model described above, for a fixed quantile . In the first level (eq. 3.57), subject specific intercepts and growth parameters (such as H b= , H = ) are extracted. In other words, H b= and H = represent the subject-specific random effects. In the second level, each of the subject-specific regression coefficients extracted from eq. 3.57 (H b= and H = ) is modeled as a function of subject-level variables at level 2 (eq. 3.58 and 3.59). More specifically, we assume that the errors at level 2 were independent normal distributions given by _ b= ~ã0,« J, 3 !, _ = ~ã0,« , 3 !. The term _ b= indicates the deviation of the subject-specific intercepts at level 2 away from the town- specific intercept at level 3 after accounting for subject-level covariates Hwℎw8x = , )x{£Ì = , YIY = , and 7TT_89 = . Also, « J, 3 and « , 3 are the variances of subject-specific intercepts and slopes at level 2. In the third level, each of the town-specific regression coefficients defined in eq. 3.60 is modeled as a function of group-level variables at level 3. More specifically, we assume that the errors at level 3 were normally distributed as given by _ b ~ã0,« J, 3 !. The term _ b indicates the deviation of the intercepts of each town c at level 3 from the overall intercept, q b . Here,« b, 3 is the variances of the town-specific intercepts. Hence, this multilevel quantile regression model 97 summarizes the distribution of the cluster-specific coefficients in two parts: (1) fixed effects that are invariant across groups (e.g. q b , q , q % ) and random effects (e.g. _ b= , _ = , _ b ) that depict group level heterogeneity left unexplained by the fixed covariates. Also note that, for a given , all the coefficients of fixed effects are -dependent while the random effects are left to be independent of because we generally condition on all the random effects. Most importantly, based on our proposed method to estimate the quantile-dependent fixed effects, we assume that the quantile residuals follow lX =s − lX =s !,« =s ,!, where « =s = &1 + 7x =s − 12.5!'«. If K is away from zero, the fixed effect estimates would change across quantiles. All analyses in this example were conducted using WinBugs 1.4.3 and R2WinBUGS package in R (version i386 3.0.2). 3.6.3 Results The estimated regression quantiles (25%, 50%, 75%, and 85%) and their 95% credible intervals are shown in Tables 3.10 and 3.11. The results based on the mean-based mixed effects modeling approach and 95% confidence intervals are provided for comparison. In Tables 3.10 and 3.11, within each category of gender or combination of gender and ethnicity, the first three rows indicates the estimated effects of SHS, allergy and interaction of SHS and allergy on the attained BMI level at age 13 yrs. Also, the last three rows show the estimated effects of SHS, allergy and interaction of SHS and allergy on the growth rate of BMI. In the Males group, the effect of SHS on the attained BMI levels at age 13 yrs was modified by allergy status at similar levels across all quantiles (85%: -1.10; 75%: -1.08; 50%: - 1.10; 25%: -1.10). The negative estimates of SHS*allergy’ coefficients indicates SHS effect on the attained BMI level at age 13 yrs were larger on subjects without allergy compared to those 98 with allergy. Also, note that this interaction effect on attained BMI levels at age 13 yrs was not statistically significantly different from zero. However, the interaction effect of SHS and allergy on the growth rate of BMI was weaker at the higher quantiles (85%ile: 0.20; 75%ile: 0.20) and stronger at the median and the lower quantules(50%: 0.26;25%: 0.31). The positive estimates of SHS*allergy’ coefficients indicate that the SHS effect (on the growth rate of BMI) was larger on subjects with allergy compared to those without allergy. Also, note that this interaction effect on the growth rate of BMI was not statistically significantly different from zero. The results based on the mean-based mixed effects modeling approach were similar to those based on the median regression but the estimated effects of SHS*allergy on both the attained BMI levels at age 13yrs were stronger. In the Females group, the interaction effect of SHS and allergy status on the attained BMI levels at age 13 was weaker at the 85 th percentile (1.16) compared to other quantiles (75%: 1.40; 50%: 1.49; 25%: 1.46). The positive estimates of the SHS*allergy’ coefficients indicate that the SHS effect on the attained BMI level at age 13 yrs was consistently larger for subjects with allergy compared to those without allergy across all quantiles, even though the interaction effect on attained BMI levels at age 13 yrs was not statistically significant across all quantiles. In contrast, the interaction effect of SHS and allergy on the growth rate of BMI was similar across all quantiles (85%ile: 0.12; 75%ile: 0.13; 50%: 0.10; 25%: 0.11). The positive estimates of SHS*allergy’ coefficients indicate that the SHS effect on the growth rate of BMI was larger for subjects with allergy compared to those without allergy. The interaction effect had similar magnitude all quantiles but not statistically significantly different from zero. We also note that the results from the mean-based mixed effects modeling approach were similar to those from 99 quantile regression models at the median (50%). The only exception was that the estimated effect of SHS*allergy on the attained BMI levels at age 13yrs was larger. In the Hispanic Males group, the effect of SHS on the attained BMI levels at age 13 was less modified at the median and lower quantiles (50%: 1.72; 25%: 1.75) compared to the upper quartiles (85%: 2.09; 75%: 2.00). The positive estimates of SHS*allergy’ coefficients indicate that the SHS effect on the attained BMI level at age 13 yrs was larger on subjects with allergy compared to those without allergy. Also, note that this interaction effect on attained BMI levels at age 13 yrs was not statistically significantly different from zero. In contrast, the interaction effect of SHS and allergy on the growth rate of BMI was weaker at higher quantiles (85%ile: 0.25; 75%ile: 0.33; 50%: 0.47;25%: 0.52). The positive estimates of SHS*allergy’ coefficients indicate that the SHS effect on the growth rate of BMI was larger on subjects with allergy compared to those without allergy. Also, note that this interaction effect on the growth rate of BMI was not statistically significantly different from zero. The results from the mean-based mixed effects modeling approach were similar to those from the quantile regression at the median. In the non-Hispanic White Males group, the interaction effect of SHS and allergy status on the attained BMI levels at age 13 was weaker at the higher quantiles and lower quantiles and became stronger at the median (85%: -1.52, 75%: -1.55, 50%: -1.58; 25%: -1.52). The negative estimates of the SHS*allergy’ coefficients indicate that the SHS effect on the attained BMI level at age 13 yrs was consistently larger on subjects without allergy compared to those with allergy across quantiles. Also, note that this interaction effect on attained BMI levels at age 13 yrs was not statistically significantly different from zero. In contrast, the interaction effect of SHS and allergy on the growth rate of BMI was weaker at the higher quantiles (85%ile: 0.22; 75%ile: 100 0.22; 50%: 0.26; 25%: 0.28). The positive estimates of SHS*allergy’ coefficients indicate that the SHS effect on the growth rate of BMI was larger on subjects with allergy compared to those without allergy. However, this interaction effect was weaker on subjects with BMI at higher quantiles. This interaction effect on the growth rate of BMI was not statistically significantly different from zero. The results from the corresponding mean-based mixed effects modeling approach were similar to those from the quantile regression at the median. However, the estimated effect of SHS*allergy on the attained BMI levels at age 13yrs was larger. In the Hispanic Females group, the interaction effect of SHS and allergy on the attained BMI levels at age 13 was stronger at the upper quartiles (85%: -4.07, 75%: -4.04) compared to that at the median (-3.74) and the lower quantile (25%: -3.48). The magnitude of this interaction effect was weaker at the median and the lower quantile. The negative estimates of SHS*allergy’ coefficients indicate that the SHS effect on the attained BMI level at age 13 yrs was larger on subjects without allergy compared to those with allergy, but not statistically significant. The interaction effect of SHS and allergy on the growth rate of BMI was stronger at the 75 th percentile and quantiles below 75 th percentile (85%ile: 0.60; 75%ile: 0.64; 50%: 0.63; 25%: 0.63). The positive estimates of SHS*allergy’ coefficients indicate that the SHS effect on the growth rate of BMI was larger for subjects with allergy compared to those without allergy. This interaction effect on the growth rate of BMI was statistically significant at the 75 th percentile and quantiles below 75 th percentile. However, this interaction effect on the growth rate of BMI became marginal at the 85 th percentile. In terms of statistical significance, the results from the mean-based mixed effects modeling approach draw the same conclusions as those from the quantile regression at the median. However, the estimated effects of SHS*allergy were slightly larger on both the attained BMI levels at age 13 yrs and the growth rate. 101 In the non-Hispanic White Females group, the interaction effect of SHS and allergy status on the attained BMI levels at age 13 was stronger at the higher quantiles (85%: 3.48, 75%: 3.37) compared to the median (3.32) and the lower quantile (25%: 3.09). The positive estimates of the SHS*allergy’ coefficients indicate that the SHS effect on the attained BMI level at age 13 yrs was consistently larger for subjects with allergy compared to those without allergy across all quantiles, even though for the interaction effect on attained BMI levels at age 13 yrs was not statistically significant across all quantiles. In contrast, the interaction effect of SHS and allergy on the growth rate of BMI was slightly weaker at the higher and lower quantiles compared to that at the median (85%ile: -0.06; 75%ile: -0.09; 50%: -0.15; 25%: -0.11). The negative estimates of SHS*allergy’ coefficients indicate that the SHS effect on the growth rate of BMI was larger for subjects without allergy compared to those with allergy. The interaction effect had smaller magnitude for subjects with BMI at the higher and lower quantiles but not statistically significantly different from zero. We also note that the results from the mean-based mixed effects modeling approach were similar to those from quantile regression models at the median. The only exception was that the estimated effect of SHS*allergy on the attained BMI levels at age 13yrs was larger. In conclusion, the interaction effects of SHS and allergy on the attained BMI level at age 13 yrs and the growth rate of BMI were inconsistent across genders and ethnicities. For the interaction effect on the attained BMI level at age 13 yrs, we found qualitatively positively stronger effects at higher quartiles in Hispanic boys and non-Hispanic White girls. In contrast, we found that the interaction effect was negatively larger at the upper quantiles in the group of the Hispanic girls. Meanwhile, in non-Hispanic White boys, the interaction the interaction effect was negatively larger at the median and smaller at the lower and upper quantiles. 102 Regarding the interaction effect of SHS and allergy on the growth rate of BMI, inconsistent results between genders and ethnicities were observed as well. We found stronger effect at the 75 th percentile and quartiles below 75 th percentile in Hispanic girls. In addition, this interaction effect was statistically significant across quantiles except that at the 85 th percentile. In contrast, the effect of SHS*allergy was generally weaker at higher quantiles in the other three groups (in Hispanic Male group, non-Hispanic White Male, and non-Hispanic White Female). The only interaction effect of SHS and allergy on the growth rate of BMI that was found to be statistically significant was that for the Hispanic Female group. A more complete analysis of CHS data across more cohorts with longer follow-up period to cover the entire childhood age range might provide more meaningful and potentially statistically significant results. Because the interaction effects of second hand smoke exposure and allergy status were not found to be statistically significant, we refitted models without including allergy status and the interaction term of second hand smoke exposure and allergy status (see Table 3.12). In the Males group, the effect of SHS on the attained BMI levels at age 13 yrs was smaller at upper and lower quartiles (85%: 0.32, 75%: 0.28, 25%: 0.56) compared to that at the median (0.56). The positive estimates of SHS coefficients indicate that SHS effect on the attained BMI levels at age 13 yrs was positively associated with increasing BMI at a given percentile. Also, note that this effect on attained BMI levels at age 13 yrs was not statistically significantly different from zero. In contrast, the effect of SHS on the growth rate of BMI was weaker at the median and lower quantiles but stronger at the higher quantiles (85%ile: 0.29; 75%ile: 0.29; 50%: 0.16; 25%: 0.20). The positive estimates of SHS coefficients indicate that the SHS effect on the growth rate of BMI was positively associated with elevating the growth rate of BMI at a given percentile. Also, note that this SHS effect on the growth rate of BMI was 103 statistically significantly different from zero. The results from the mean-based mixed effects modeling approach were similar to those from the quantile models at the median. However, the estimated effects of SHS on the attained BMI levels at age 13 yrs were relatively smaller while the SHS had larger effect on the growth rate of BMI when comparing results from the mean- based mixed effects modeling approach to those from the median regression approach. In the Females group, the effect of SHS on the attained BMI levels at age 13 was stronger at the upper and lower quartiles (85%: 1.02, 75%: 1.07, 25%: 0.94) compared to that at the median (0.66). The positive estimates of SHS coefficients indicate that SHS effect on the attained BMI levels at age 13 yrs was positively associated with increasing BMI levels at a given percentile. Also, note that this effect on the attained BMI levels at age 13 yrs was not statistically significantly different from zero. In contrast, the effect of SHS on the growth rate of BMI was stronger at the median (85%ile: -0.03; 75%ile: -0.04; 50%: -0.10; 25%: -0.06). The negative estimates of SHS coefficients indicate that the SHS effect on the growth rate of BMI was negatively associated with elevating the growth rate of BMI at a given percentile. Also, note that this SHS effect on the growth rate of BMI was not statistically significantly different from zero. The results from the mean-based mixed effects modeling approach were similar to those from the median regression approach. However, the magnitude of SHS’s effect based on the mean-based regression approach was larger on the attained BMI levels at age 13 yrs but smaller on the growth rate of BMI compared to that based on the median regression approach. When we further used data stratified by gender and ethnicity, we note the results are quite different between genders and ethnicities. In the Hispanic Males group, the effect of SHS on the attained BMI levels at age 13 yrs was smaller at the upper quartiles and median compared to that at the lower quantiles (85%: 0.77; 75%: 0.83; 50%: 0.99; 25%: 1.14). The positive estimates of 104 SHS coefficients indicate that SHS effect on the attained BMI levels at age 13 yrs was positively associated with increasing BMI levels at a given percentile. Also, note that this effect on the attained BMI levels at age 13 yrs was not statistically significantly different from zero. In contrast, the effect of SHS on the growth rate of BMI was larger at upper quantiles (85%ile: 0.36; 75%ile: 0.32; 50%: 0.20; 25%: 0.09). The positive estimates of SHS coefficients indicate that the SHS effect on the growth rate of BMI was positively associated with elevating the growth rate of BMI at a given percentile. Also, note that this SHS effect on the growth rate of BMI was not statistically significantly different from zero. The results from the mean-based mixed effects modeling approach were similar to those at the median. However, the magnitude of SHS’s effect based on the mean-based regression approach was larger on the attained BMI levels at age 13 yrs compared to that based on the median regression approach. In the non-Hispanic White Males group, the effect of SHS on the attained BMI levels at age 13 was weaker at the upper quartiles (85%: 0.09, 75%: 0.11) compared to the median (0.16) and the lower quantile (25%: 0.28). The positive estimates of the SHS coefficients indicate that the SHS effect on the attained BMI levels at age 13 yrs was positively associated with increasing BMI levels at a given percentile. Also, note that this effect on the attained BMI levels at age 13 yrs was not statistically significantly different from zero. In contrast, the effect of SHS on the growth rate of BMI was similar between the upper quantiles and the median (85%ile: 0.27; 75%ile: 0.28; 50%: 0.26) but smaller at the lower quantiles (25%: 0.21). The positive estimates of SHS coefficients indicate that the SHS effect on the growth rate of BMI was positively associated with elevating the growth rate of BMI at a given percentile. Also, note that this SHS effect on the growth rate of BMI was statistically significantly different from zero. The results 105 from the mean-based mixed effects modeling approach were similar to those from the median regression approach. In the Hispanic Females group, the effect of SHS on the attained BMI levels at age 13 yrs was stronger at the upper quartiles (85%: 1.41, 75%: 1.38) compared to the median (1.26) and lower quantile (25%: 1.22). The positive estimates of SHS coefficients indicate that SHS effect on the attained BMI levels at age 13 yrs was positively associated with increasing BMI levels at a given percentile. Also, note that this effect on the attained BMI levels at age 13 yrs was not statistically significantly different from zero. In contrast, the effect of SHS on the growth rate of BMI was similar across quantiles (85%ile: -0.26; 75%ile: -0.26; 50%: -025; 25%: -0.24). The negative estimates of SHS coefficients indicate that the SHS effect on the growth rate of BMI was negatively associated with elevating the growth rate of BMI at a given percentile. Also, note that this SHS effect on the growth rate of BMI was not statistically significantly different from zero. The results from the mean-based mixed effects modeling approach show the stronger effect of SHS on the attained BMI levels and the growth rate of BMI compared to those from the median regression approach. In the non-Hispanic White Females group, the effect of SHS on the attained BMI levels at age 13 was weaker at the upper and lower quartiles (85%: 0.96, 75%: 1.01; 25%: 0.90) compared to the median (1.02). The positive estimates of SHS coefficients indicate that the SHS effect on the attained BMI levels at age 13 yrs was positively associated with increasing BMI levels at a given percentile. Also, note that this effect on the attained BMI levels at age 13 yrs was not statistically significantly different from zero. In contrast, the effect of SHS on the growth rate of BMI was stronger at the upper quantiles (85%ile: 0.14; 75%ile: 0.12; 50%: 0.06; 25%: 0.05). The positive estimates of SHS coefficients indicate that the SHS effect on the 106 growth rate of BMI was positively associated with elevating the growth rate of BMI at a given percentile. Also, note that this SHS effect on the growth rate of BMI was not statistically significantly different from zero. The results from the mean-based mixed effects modeling approach show the weaker effect of SHS on the attained BMI levels but stronger effect on the growth rate of BMI compared to those from the median regression approach. Table 3.10 Estimates of Second Hand Smoke Exposure by Selected Effect Modifiers Based on Proposed Bayesian Multilevel Quantile regression (Data with fully Stratified by Gender) Mean BMI 1 25%ile BMI 2 50%ile BMI 2 75%ile BMI 2 85%ile BMI 2 Est. (SE) 95%CI 3 Est. (SE) 95%CI 4 Est. (SE) 95%CI 4 Est. (SE) 95%CI 4 Est. (SE) 95%CI 4 Male SHS 0.79 (0.7) (-0.5, 2.1) 0.87 (0.7) (-0.4, 2.2) 0.73 (0.7) (-0.6, 2.1) 0.72 (0.7) (-0.6, 2.1) 0.71 (0.7) (-0.6, 2.1) Allergy 0.18 (0.6) (-0.9, 1.3) 0.09 (0.6) (-1.0, 1.2) 0.16 (0.6) (-1.0, 1.3) 0.18 (0.6) (-0.9, 1.3) 0.15 (0.6) (-/01, 1.3) SHS:allergy -1.15 (1.1) (-3.3, 1.0) -1.10 (1.1) (-3.3, 1.0) -1.10 (1.1) (-3.2, 1.0) -1.08 (1.1) (-3.2, 1.1) -1.10 (1.1) (-3.4, 1.0) age:SHS 0.17 (0.1) (0.0, 0.4) 0.09 (0.1) (-0.1, 0.3) 0.15 (0.1) (0.0, 0.4) 0.22 (0.1) (0.0, 0.4) 0.22 (0.1) (0.0, 0.4) age:allergy -0.19 (0.1) (-0.4, 0.0) -0.20 (0.1) (-0.4, .00) -0.17 (0.1) (-0.3, 0.0) -0.16 (0.1) (-0.3, 0.0) -0.16 (0.1) (-0.3, 0.0) age:SHS:allergy 0.25 (0.2) (-0.1, 0.6) 0.31 (0.2) (0.0, 0.6) 0.26 (0.2) (0.0, 0.6) 0.20 (0.2) (-0.1, 0.5) 0.20 (0.2) (-0.1, 0.5) Female SHS 0.61 (0.6) (-0.5, 1.8) 0.51 (0.6) (-0.6, 1.6) 0.73 (0.6) (-0.5, 1.9) 0.73 (0.6) (-0.5, 1.9) 0.73 (0.6) (-0.5, 1.9) Allergy -0.63 (0.5) (-1.7, 0.4) -0.65 (0.5) (-1.7, 0.4) -0.59 (0.5) (-1.6, 0.4) -0.43 (0.6) (-1.5, 0.7) -0.38 (0.6) (-1.5, 0.7) SHS:allergy 1.64 (1.0) (-0.4, 3.6) 1.46 (1.0) (-0.6, 3.4) 1.49 (1.0) (-0.5, 3.5) 1.40 (1.1) (-0.8, 3.6) 1.16 (1.1) (-1.0, 3.3) age:SHS -0.08 (0.1) (-0.3, 0.1) -0.10 (0.1) (-0.3, 0.1) -0.10 (0.1) (-0.3, 0.1) -0.08 (0.1) (-0.3, 0.1) -0.05 (0.1) (-0.3, 0.2) age:allergy -0.08 (0.1) (-0.3, 0.1) -0.08 (0.1) (-0.3, 0.1) -0.07 (0.1) (-0.3, 0.1) -0.10 (0.1) (-0.3, 0.1) -0.11 (0.1) (-0.3, 0.1) age:SHS:allergy 0.12 (0.2) (-0.2, 0.5) 0.11 (0.2) (-0.2, 0.5) 0.10 (0.2) (-0.2, 0.4) 0.13 (0.2) (-0.2, 0.5) 0.12 (0.2) (-0.3, 0.5) 1. Based on mixed effects modeling with adjusted for cohort, ethnicity, in-utero smoke exposure, and including random subject intercepts, random subject slopes, and random town intercepts. 2. Based on the proposed Bayesian multilevel quantile regression with adjusted for cohort, ethnicity, in-utero smoke exposure, and including random subject intercepts, random subject slopes, and random town intercepts. 3. 95% confidence interval 4. 95% credible interval 107 Table 3.11 Estimates of Second Hand Smoke Exposure by Selected Effect Modifiers Based on Proposed Bayesian Multilevel Quantile regression (Data with fully Stratified by Gender and Ethnicity) Mean BMI 1 25%ile BMI 2 50%ile BMI 2 75%ile BMI 2 85%ile BMI 2 Est. (SE) 95%CI 3 Est. (SE) 95%CI 4 Est. (SE) 95%CI 4 Est. (SE) 95%CI 4 Est. (SE) 95%CI 4 Male + Hispanic SHS 0.89 (1.3) (-1.7, 3.5) 0.78 (1.4) (-2.0, 3.5) 0.63 (1.4) (-2.1, 3.4) 0.39 (1.4) (-2.4, 3.2) 0.36 (1.4) (-2.5, 3.2) allergy -1.79 (1.3) (-4.4, 0.8) -1.71 (1.3) (-4.3, 1.0) -1.61 (1.4) (-4.3, 1.0) -1.72 (1.4) (-4.4, 1.0) -1.74 (1.4) (-4.5, 1.0) SHS:allergy 1.78 (2.8) (-3.7, 7.3) 1.75 (2.9) (-4.0, 7.4) 1.72 (2.9) (-4.0, 7.3) 2.00 (3.0) (-3.8, 7.9) 2.09 (2.9) (-3.7, 7.7) age:SHS 0.13 (0.2) (-0.3, 0.6) -0.02 (0.2) (-0.5, 0.4) 0.09 (0.2) (-0.4, 0.5) 0.24 (0.2) (-0.2, 0.7) 0.29 (0.2) (-0.2, 0.8) age:allergy -0.11 (0.2) (-0.5, 0.3) -0.08 (0.2) (-0.5, 0.4) -0.16 (0.2) (-0.6, 0.3) -0.15 (0.2) (-0.6, 0.3) -0.13 (0.2) (-0.6, 0.3) age:SHS:allergy 0.44 (0.5) (-0.5, 1.4) 0.52 (0.5) (-0.5, 1.5) 0.47 (0.5) (-0.5, 1.4) 0.33 (0.5) (-0.6, 1.3) 0.25 (0.5) (-0.7, 1.2) Male + non-Hispanic Whites SHS 0.77 (0.8) (-0.8, 2.3) 0.86 (0.8) (-0.7, 2.4) 0.78 (0.8) (-0.8, 2.3) 0.68 (0.8) (-0.9, 2.3) 0.67 (0.8) (-0.9, 2.2) allergy 0.60 (0.6) (-0.6, 1.8) 0.52 (0.6) (-0.7, 1.7) 0.51 (0.6) (-0.7, 1.8) 0.50 (0.6) (-0.7, 1.8) 0.49 (0.7) (-0.8, 1.8) SHS:allergy -1.62 (1.2) (-3.9, 0.7) -1.52 (1.2) (-3.8, 0.8) -1.58 (1.2) (-3.9, 0.8) -1.55 (1.2) (-4.0, 0.8) -1.52 (1.2) (-3.9, 0.9) age:SHS 0.17 (0.1) (-0.1, 0.4) 0.11 (0.1) (-0.1, 0.3) 0.15 (0.1) (-0.1, 0.4) 0.19 (0.1) (0.0, 0.4) 0.19 (0.1) (-0.1, 0.4) age:allergy -0.22 (0.1) (-0.4, 0.0) -0.22 (0.1) (-0.4, 0.0) -0.18 (0.1) (-0.4, 0.0) -0.17 (0.1) (-0.4, 0.0) -0.17 (0.1) (-0.4, 0.0) age:SHS:allergy 0.25 (0.2) (-0.1, 0.6) 0.28 (0.2) (-0.1, 0.6) 0.26 (0.2) (-0.1, 0.6) 0.22 (0.2) (-0.1, 0.6) 0.22 (0.2) (-0.1, 0.6) Female + Hispanic SHS 2.54 (1.1) (0.4, 4.7) 2.27 (1.1) (0.1, 4.4) 2.43 (1.1) (0.2, 4.6) 2.63 (1.1) (0.4, 4.9) 2.68 (1.2) (0.4, 4.9) allergy 0.92 (1.1) (-1.2, 3.1) 0.92 (1.1) (-1.2, 3.1) 0.87 (1.1) (-1.3, 3.1) 0.91 (1.2) (-1.4, 3.2) 0.89 (1.2) (-1.4, 3.1) SHS:allergy -3.79 (2.0) (-7.6, 0.1) -3.48 (1.9) (-7.3, 0.3) -3.74 (2.0) (-7.7, 0.1) -4.04 (2.1) (-8.1, 0.0) -4.07 (2.1) (-8.2, 0.1) age:SHS -0.49 (0.2) (-0.8, -0.2) -0.43 (0.2) (-0.7, -0.1) -0.44 (0.2) (-0.8, -0.1) -0.46 (0.2) (-0.8, -0.1) -0.46 (0.2) (-0.8, -0.1) age:allergy -0.22 (0.2) (-0.6, 0.1) -0.21 (0.2) (-0.5, 0.1) -0.19 (0.2) (-0.5, 0.1) -0.19 (0.2) (-0.5, 0.2) -0.17 (0.2) (-0.5, 0.2) age:SHS:allergy 0.69 (0.3) (0.1, 1.3) 0.63 (0.3) (0.0, 1.2) 0.63 (0.3) (0.0, 1.2) 0.64 (0.3) (0.0, 1.2) 0.60 (0.3) (0.0, 1.2) Female + non-Hispanic Whites SHS 0.06 (0.7) (-1.3, 1.4) 0.15 (0.7) (-1.2, 1.4) 0.23 (0.7) (-1.2, 1.6) 0.21 (0.7) (-1.2, 1.6) 0.12 (0.7) (-1.3, 1.5) allergy -1.12 (0.6) (-2.3, 0.1) -1.04 (0.6) (-2.2, 0.1) -1.00 (0.6) (-2.2, 0.2) -0.95 (0.6) (-2.2, 0.3) -0.96 (0.7) (-2.3, 0.3) SHS:allergy 3.37 (1.2) (1.1, 5.7) 3.09 (1.1) (0.9, 5.3) 3.32 (1.2) (0.9, 5.7) 3.37 (1.2) (0.9, 5.8) 3.48 (1.3) (1.0, 6.0) age:SHS 0.16 (0.1) (-0.1, 0.4) 0.07 (0.1) (-0.2, 0.3) 0.08 (0.1) (-0.2, 0.3) 0.13 (0.1) (-0.1, 0.4) 0.15 (0.1) (-0.1, 0.4) age:allergy 0.01 (0.1) (-0.2, 0.2) -0.02 (0.1) (-0.2, 0.2) -0.02 (0.1) (-0.2, 0.2) -0.05 (0.1) (-0.3, 0.2) -0.08 (0.1) (-0.3, 0.2) age:SHS:allergy -0.16 (0.2) (-0.6, 0.3) -0.11 (0.2) (-0.5, 0.3) -0.15 (0.2) (-0.6, 0.3) -0.09 (0.2) (-0.5, 0.4) -0.06 (0.2) (-0.5, 0.4) 1. Based on mixed effects modeling with adjusted for cohort, in-utero smoke exposure, and including random subject intercepts, random subject slopes, and random town intercepts. 2. Based on the proposed Bayesian multilevel quantile regression with adjusted for cohort, in-utero smoke exposure, and including random subject intercepts, random subject slopes, and random town intercepts. 3. 95% confidence interval 4. 95% credible interval 108 Table 3.12 Estimates of Second Hand Smoke Exposure Based on Proposed Bayesian Multilevel Quantile regression (Data with fully Stratified by Gender) Mean BMI 25%ile BMI 50%ile BMI 75%ile BMI 85%ile BMI Est. (SE) 95%CI 3 Est. (SE) 95%CI 4 Est. (SE) 95%CI 4 Est. (SE) 95%CI 4 Est. (SE) 95%CI 4 Male 1,2 SHS 0.40 (0.6) (-0.7, 1.5) 0.56 (0.6) (-0.5, 1.7) 0.76 (0.7) (-0.6, 2.1) 0.28 (0.6) (-0.8, 1.4) 0.32 (0.6) (-0.8, 1.5) age:SHS 0.26 (0.1) (0.1, 0.4) 0.20 (0.1) (0.0, 0.4) 0.16 (0.1) (0.0, 0.4) 0.29 (0.1) (0.1, 0.5) 0.29 (0.1) (0.1, 0.5) Female 1, 2 SHS 1.02 (0.5) (0.0, 2.0) 0.94 (0.5) (-0.1, 1.9) 0.66 (0.6) (-0.5, 1.8) 1.07 (0.5) (0.0, 2.1) 1.02 (0.6) (-0.1, 2.1) age:SHS -0.05 (0.1) (-0.2, 0.1) -0.06 (0.1) (-0.2, 0.1) -0.10 (0.1) (-0.3, 0.1) -0.04 (0.1) (-0.2, 0.1) -0.03 (0.1) (-0.2, 0.2) Male + Hispanic 3, 4 SHS 1.28 (1.2) (-1.1, 3.6) 1.14 (1.2) (-1.2, 3.6) 0.99 (1.2) (-1.4, 3.5) 0.83 (1.3) (-1.7, 3.3) 0.77 (1.3) (-1.7, 3.3) age:SHS 0.22 (0.2) (-0.2, 0.6) 0.09 (0.2) (-0.3, 0.5) 0.20 (0.2) (-0.2, 0.6) 0.32 (0.2) (-0.1, 0.7) 0.36 (0.2) (0.0, 0.8) Male + non-Hispanic Whites 3, 4 SHS 0.16 (0.6) (-1.1, 1.4) 0.28 (0.6) (-1, 1.5) 0.16 (0.6) (-1.1, 1.5) 0.11 (0.7) (-1.2, 1.4) 0.09 (0.7) (-1.2, 1.4) age:SHS 0.26 (0.1) (0.1, 0.5) 0.21 (0.1) (0, 0.4) 0.26 (0.1) (0.1, 0.4) 0.28 (0.1) (0.1, 0.5) 0.27 (0.1) (0.1, 0.5) Female + Hispanic 3, 4 SHS 1.39 (0.9) (-0.4, 3.2) 1.22 (0.9) (-0.5, 3.0) 1.26 (0.9) (-0.6, 3.1) 1.38 (1.0) (-0.5, 3.2) 1.41 (1.0) (-0.5, 3.4) age:SHS -0.29 (0.1) (-0.6, 0.0) -0.24 (0.1) (-0.5, 0.0) -0.25 (0.1) (-0.5, 0.0) -0.26 (0.1) (-0.6, 0.0) -0.26 (0.2) (-0.6, 0.0) Female + non-Hispanic Whites 3, 4 SHS 0.91 (0.6) (-0.3, 2.1) 0.90 (0.6) (-0.3, 2.1) 1.02 (0.6) (-0.2, 2.3) 1.01 (0.7) (-0.3, 2.3) 0.96 (0.7) (-0.4, 2.3) age:SHS 0.13 (0.1) (-0.1, 0.4) 0.05 (0.1) (-0.2, 0.3) 0.06 (0.1) (-0.2, 0.3) 0.12 (0.1) (-0.1, 0.4) 0.14 (0.1) (-0.1, 0.4) 1. Based on mixed effects modeling with adjusted for cohort, ethnicity, in-utero smoke exposure, and including random subject intercepts, random subject slopes, and random town intercepts. 2. Based on the proposed Bayesian multilevel quantile regression with adjusted for cohort, ethnicity, in-utero smoke exposure, and including random subject intercepts, random subject slopes, and random town intercepts. 3. Based on mixed effects modeling with adjusted for cohort, in-utero smoke exposure, and including random subject intercepts, random subject slopes, and random town intercepts. 4. Based on the proposed Bayesian multilevel quantile regression with adjusted for cohort, in-utero smoke exposure, and including random subject intercepts, random subject slopes, and random town intercepts. 5. 95% confidence interval 6. 95% credible interval 109 3.7 Conclusion Due to ignoring the random effects (dependence among observations from the same subject), the naïve method was providing consistently inadequate estimates of the 85 th percentile slope and intercept except for the slope in simulation 1. In addition, even when the initial values were generated based on the conventional quantile regression (naïve) method, the performance of MCEM was consistently inferior in estimating intercept (simulation 1) or the slope (simulation 2). To assess the extent to whether the MCEM was sensitive to different initial values, the MCEM with initial values based on Bayesian quantile regression (MCEM + Bayes) approach was applied in the simulation studies as well. According to the results of all three simulation studies, the performance of MCEM + Bayes was favorable in simulations 1 and 2. However, when more random effects added from various levels, MCEM + Bayes estimated the intercept and slope parameters with large bias and suffered from heavier computational burden. Compared to the naïve, MCEM, and MCEM + Bayes approaches, the fully Bayesian multilevel quantile regression performed much better in the first two simulation studies, in terms of bias of slope and intercept and computational efficiency. Thus, the Bayesian multilevel quantile regression is more preferable to the other three approaches when errors are homoscedastic. Geraci and Bottai (2014) proposed LQMM that would allow the inclusion of more random effects without suffering from computational burden. By conducting the simulation studies done by Geraci and Bottai (2014), the performance of LQMM is robust for cases with heteroscedastic errors. However, the performance of LQMM is disappointing when the data are simulated to mimic the BMI data of CHS (> 10% bias of estimating slope, Table 3.5). 110 When we simulated cases with heteroscedastic errors, the performance of the Bayesian multilevel quantile regression (assuming ,0,« ) is disappointing (>10%, Tables 3.5 and 3.7). After our proposed in medication in the assumption about the distribution of the quantile residuals (using ,0,1 + aX ˆ‰ !«!), the performance of the Bayesian multilevel quantile regression are significantly improved and becomes suitable to fit the data simulated by Geraci and Bottai and the data simulated to imitate BMI data of CHS. In conclusion, the performance of MCEM is unstable and heavily influenced by the choice of initial values. The performance of LQMM is not generally preferable due to the inaccurate assumption of quantile residuals (,0,« ). In contrast to other approaches, the proposed Bayesian multilevel quantile regression method (assuming ,0,1 + aX ˆ‰ !«!) consistently provided appropriate estimates of 85 th percentile of intercept and slope even when including heteroscedastic errors, more covariates, and additional random effects from various levels. After validated our proposed multilevel quantile regression via simulation studies, we further applied it to a subset of CHS data to justify that it is a practical statistical approach to evaluate the conditional quantiles of health outcome with longitudinal data. 3.8 Appendix Property of Asymmetric Laplace Distribution If a random variable Y follows asymmetric Laplace distribution (ALD) with parameters ª,«,and , then Pr(Y < ª) = and Pr(Y ≥ ª) = 1 − which means that ª is the th quantile of distribution. Specifically, the probability density function (pdf) of ALD is |ª,«, = ¬ _¡®− “ @¯ ¬ ”°, where ¡ = ¡ − ¡ ≤ 0 !, 0< <1 is the skew parameter, « >0 is the scale parameter, −∞ < ª < ∞ is the location parameter, −∞ < < ∞ and . is the indicator function. The panels of Figure 1 show the ALD under various settings of parameters 111 ª,«,7+ . Furthermore, when is larger than 0.5, the distribution is left skewed (figure 1 (a) - (d)). While comparing (a) and (c), it shows that the tail of ALD in figure 1(c) is heavier and longer than that in figure 1(a) due to larger «. Conversely, figure 1(d) shows that the distribution was squeezed towards the location of th quanitle compared to figure 1 (a) and (c) because of smallest « in figure 1 (a), (b), and (c). Meanwhile, when comparing (a) and (b) of figure 1, it shows that those two figures are identical except for the locations of the th quantile because the location of the th quantile was determined by ª. Moreover, when is equal to 0.5, the distribution is symmetric. Furthermore, the probability density function, |ª,«,0.5 = ›¬ _¡®− |@¯| 3¬ °, is the probability density function of double exponential distribution. This double exponential distribution is a special case of ALD. If a random variable, Y, follows ALD, then the distribution function of ALD is ; ª,«, = å _ » kµ© º @¯ ½ , ≤ ª 1 − 1 − _ » © º @¯ ½ , > ª , the quantile function of ALD is ; ª,«, = å ª + ¬ Tw9 “ @ ”, 0 ≤ ≤ ª − ¬ Tw9 “ @ ”, < ≤ 1 , and the moment generating function is ∅x = 1 − U MN ¬Ä ¬Äv , where ¬ < x < ¬ . In addition, the kth central moment of Y is & − ª g = Ì!« g 1 − “ KOk + K KOk ”, the mean of Y is & = ª + ¬3 , the variance is 78 = ¬ P 3v3 P ! P P , the skewness is 3 ~ ~ ! P v P ~ P \ , and the Kurtosis is . P v , P P v . P 33 P P . 112 (a) Q=0, Ù=1 (b) Q=2, Ù=1 (c) Q=0, Ù=2 (d) Q=0, Ù=0.5 Figure 3.2 The probability density function of asymmetric Laplace distribution: (a) for 0.5 ≤ R < 1 with Q=0, Ù=1; (b) for 0.5 ≤ R < 1 with Q=2, Ù=1; (c) for 0.5 ≤ R < 1 with Q=0, Ù=2; (d) for 0.5 ≤ R < 1 with Q=0, Ù=0. 113 Chapter 4 Summary of Findings and Future Plans 4.1 Summary of Findings In Chapter 2, we use a continuous BMI variable redefined as a deviation from reference age- and gender- cutoff point as an outcome of interest. This redefined outcome provides a quantifiable way of assessing effects of risk factors in relation to overweight or obesity indicators. This is achieved by creating a hybrid variable that combines the raw BMI variable (which may not directly connect to the issue of overweight or obesity) and the binary overweight or obesity indicators (which treats BMI values that are close to the cutoff as identical to those that are far away from the cutoffs). This new variable is especially meaningful when dealing with the age- and gender- dependent definition of overweight or obesity due to variations in growth trajectories. 4.2 Future Plans In Chapter 3, we propose a new Bayesian Multilevel quantile regression model when errors are homoscedastic or heteroscedastic, allowing for a potentially large number of covariates, complex non-linear growth trajectories and the inclusion of a large number of random effects within / across levels, than allowed by previous methods. In addition to giving analytic arguments that led to an improved theoretical setup for the models, we also evaluated the finite sample performance of our proposed approach, compared to other competing methods in the literature, by using simulation studies. The two major improvements achieved by the dissertation are significant computational efficiency due to the fully Bayesian of the estimation procedure and the ability to include as many levels (and hence random effects) as desired in models without encountering prohibitive computational barriers. This is important for studies such as the CHS 114 and also for modeling childhood BMI data via non-linear growth trajectories. Our work intuitively leads to natural extensions in two important directions. The first extension deals with model mis-specification. Because our model evaluation focused on relative bias (similar to other preceding work in the area), it would be desirable to assess the robustness of our approach by evaluating the performance of the new approach when models are potentially mis-specified. For example, one could assess the performance of the new modeling approach when data are generated based on a 3-level model, but a Bayesian 2-level quantile regression function is used to fit the model (the issue of under-fitting). In the opposite direction, one may also assess the issue of model misspecification when a Bayesian 3-level quantile regression model is fit for data generated based on a 2-level model (the issue of over- fitting). The second extension deals with assessment of model performance when non-linear growth trajectories are considered. For example, this could be useful when fitting models on BMI data from the entire childhood period and there is a need to depict the gender-specific non- linear growth trajectories due to growth spurts in adolescence. While the set-up of our models is already set up (eqs. 3.12-3.14b) to handle this more general situation, our simulation studies (sections 3.4 and 3.5) and illustrative data analysis (section 3.6), so far have focused on linear growth of the outcome Y (e.g., BMI) as a function of age. While there is nothing fundamentally different in the modeling approach due to addition of non-linearity (e.g., using spline function that could lead to parametric with an extended design matrix), it will be useful to extend our simulation studies to ensure that analysis of longitudinal data with relatively longer observation periods covering the entire childhood period is supported by results from simulation studies. In similar vain, we would like to extend our data analysis of the BMI data from the CHS to be more 115 comprehensive via a more thorough model building (and checking) and covering the full range of the childhood period. For the mean regression approaches, including parametric / non- parametric splines is the commonly used tool to capture the non-linear relationship between outcome and predictors (Berhane and Molitor, 2008; Riedel and Imre, 1993). For quantile regression, Koenker et al (1994) proposed quantile smoothing splines to deal with non-linearity by incorporating penalty terms in the object function of minimizing the absolute deviation (eq 4.1). The penalty parameter (¦) is used to control the roughness of the curves, i.e., to prevent situations of over-smooth / under smooth fitting the data. £ c ∑ = − ¡ = ! + ¦“0 |9"¡ | t +¡ b ” /t (eq. 4.1) They applied their approach to study non-linear associations between body mass index (BMI) and running speed at median, 70%ile, and 90%ile. However, their demonstration is limited to cross-sectional data. Our newly proposed modeling framework given in Chapter 3 extends this work to cover longitudinal multi-level data by using parametric splines. The extensions to handle fully non-parametric smoothing techniques as in eq. 4.1, in principle, are again very similar to our proposed approach. A promising direction for such extension would be to restructure the fully non-parametric smoothing spline approach given in eq. 4.1 in its equivalent mixed-effects model form and then have essentially a version of our Bayesian multi-level quantile regression model that has more random effects to deal with the smoothing splines. The additional random effects would enable us to estimate the smoothing parameters. 116 Reference 1. Barndorff-Nielsen OE, Shephard N. Non-Gaussian Ornstein–Uhlenbeck-based models and some of their uses in financial economics (with discussion). Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2001; 63: 167–241. 2. Bell JF, Wilson JS, Liu GC. Neighborhood greenness and 2-year changes in body mass index of children and youth. American Journal of Preventive Medicine 2008; 35: 547– 553. 3. Benson PE. CALINE4—A dispersion model for predicting air pollution concentrations near roadways. State of California Department of Transportation: Sacramento, CA 1989. 4. Berhane K, Molitor NT. A Bayesian approach to functional based multi-level modeling of longitudinal data: With applications to environmental epidemiology. Biostatistics 2008; 9: 686–699. 5. Bilias Y, Chen S, Ying Z. Simple resampling methods for censored regression quantiles. Journal of Econometrics 2000; 99: 373–386. 6. Branum AM, Parker JD, Keim SA, Schempf AH. Prepregnancy body mass index and gestational weight gain in relation to child body mass index among siblings. American Journal of Epidemiology 2011; 174: 1159–1165. 7. Calle EE, Thun MJ. 2004. Obesity and cancer. Oncogene 2004; 23: 6365–6378. 8. Carey VJ, Yong FH, Frenkel LM, McKinney RM. Growth velocity assessment in paediatric AIDS: smoothing, penalized quantile regression and the definition of growth failure. Statistics in Medicine 2004; 23: 509–526. 9. Chamberlain G. Quantile regression, censoring, and the structure of wages. Advances in Econometrics, Sixth World Congress 1994; 1:171–209. 117 10. Cole TJ. The LMS Method for Constructing Normalized Growth Standards. European Journal of Clinical Nutrition 1990; 44: 45–60. 11. Cole TJ. Using the LMS method to measure skewness in the NCHS and Dutch National height standards. Annals of Human Biology 1989; 16: 407–419. 12. Cole TJ. Growth charts for both cross-sectional and longitudinal data. Statistics in Medicine 1994; 13: 2477–2492. 13. Cole TJ, Freeman JV, Preece MA. Body mass index reference curves for the UK, 1990. Archives of Disease in Childhood 1995; 73: 25–29. 14. Daniels SR. The use of BMI in the clinical setting. Pediatrics 2009; 124 (Suppl.): S35– S41. 15. Deckelbaum RJ, Williams CL. Childhood obesity: the health issue. Obesity Research 2001; 9 (Suppl.): 239S–243S. 16. Dietz WH, Robinson TN. Clinical practice: overweight children and adolescents. New England Journal of Medicine 2005; 352: 2100–2109 17. Dill J. Measuring Network Connectivity for Bicycling and Walking. Transport Research Board 2004 Annual MeetingTransportation Research Board, Washington DC 2004. 18. Doyle S, Kelly-Schwartz A, Schlossberg M, Stockard J. Active community environments and health: the relationship of walkable and safe communities to individual health. Journal of the American Planning Association 2006; 72: 19–31. 19. Dunton G, McConnell R, Jerrett M, Wolch J, Lam C, Gilliland F, Berhane K. Organized physical activity in young school children and subsequent 4-year change in body mass index. Archives of Pediatrics and Adolescent Medicine 2012; 166: 713–718. 118 20. Edwards LJ. Modern statistical techniques for the analysis of longitudinal data in biomedical research. Pediatric Pulmonology 2000; 30: 330–334. 21. Field AE, Laird N, Steinberg E, Fallon E, Semega-Janneh M, Yanovski JA. Which metric of relative weight best captures body fatness in children? Obesity Research 2003; 11: 1345–1352. 22. Fiorino EK, Brooks LJ. Obesity and respiratory diseases in childhood. Clinics In Chest Medicine 2009; 30: 601–608. 23. Freedman DS, Khan LK, Serdula MK, Ogden CL, Dietz WH. Racial and ethnic differences in secular trends for childhood BMI, weight, and height. Obesity 2006; 14: 301–308. 24. Gauderman WJ, Avol E, Gilliland F, Vora H, Thomas D, Berhane K, McConnell R, Kuenzli N, Lurmann F, Rappaport E, Margolis H, Bates D, Peters J. The effect of air pollution on lung development from 10 to 18 years of age. The New England Journal of Medicine 2004; 351: 1057–1067. 25. Gauderman WJ, Gilliland F, Vora H, Avol E, Stram D, McConnell R, Thomas D, Lurman F, Margolis HG, Rappaport EB, Berhane K, Peters J. Association between air pollution and lung function growth in southern California children: results from a second cohort. American Journal of Respiratory and Critical Care Medicine 2002; 166: 76–84. 26. Gauderman WJ, McConnell R, Gilliland F, London S, Thomas D, Avol E, Vora H, Berhane K, Rappaport EB, Lurman F, Margolis HG, Peters J. Association between air pollution and lung function growth in southern California children. American Journal of Respiratory and Critical Care Medicine 2000; 162: 1383–1390. 119 27. Gauderman WJ, Vora H, McConnell R, Berhane K, Gilliland F, Thomas D, Lurmann F, Avol E, Kuenzli N, Jerrett M, Peters J. Effect of exposure to traffic on lung development from 10 to 18 years of age: a cohort study. Lancet 2007; 369: 571–577. 28. Geraci M, Bottai M. Linear quantile mixed models. Statistics and Computing 2014; 24: 461–479. 29. Geraci M, Bottai M. Quantile regression for longitudinal data using the asymmetric Laplace distribution. Biostatistics 2007; 8: 140–154. 30. Grow HM, Saelens BE, Kerr J, Durant NH, Norman GJ, Sallis JF. Where are youth active? Roles of proximity, active transport, and built environment. Medicine & Science in Sports & Exercise 2008; 40: 2071-2079. 31. Jerrett M, McConnell R, Chang CCR, Wolch J, Reynolds K, Lurmann F, Gilliland F, Berhane K. Automobile traffic around the home and attained body mass index: a longitudinal cohort study of children aged 10-18 years. Preventive Medicine 2010; 50 (Suppl. 1): S50–58. 32. Jerrett M, McConnell R, Wolch J, Chang R, Lam C, Dunton G, Gilliland F, Lurmann F, Islam, Berhane K. Traffic-related air pollution and obesity formation in children: a longitudinal, multilevel analysis. Environmental Health 2014; 13:49. 33. Koenker R, Bassett G. Regression Quantiles. Econometrica 1978; 46: 33–50. 34. Koenker R, Hallock KF. Quantile Regression. The Journal of Economic Perspectives 2001; 15: 143–156. 35. Koenker R, Ng P, Portnoy S. Quantile smoothing splines. Biometrika 1994; 81: 673–680. 36. Koenker R. Quantile Regression for Longitudinal Data. Journal of Multivariate Analysis 2004; 91: 74–89. 120 37. Koenker RW, d’Orey V. Algorithm as 229: Computing regression quantiles. Journal of the Royal Statistical Society. Series C (Applied Statistics) 1987; 36: 383–393. 38. Kozumi H, Kobayashi G. Gibbs sampling methods for Bayesian quantile regression. Journal of Statistical Computation and Simulation 2011; 81: 1565–1578. 39. Kuczmarski RJ, Ogden CL, Guo SS, Grummer-Strawn LM, Flegal KM, Mei Z, Wei R, Curtin LR, Roche AF, Johnson CL. 2000 CDC Growth Charts for the United States: Methods and Development. Vital and Health Statistics.Series 11, Data from the National Health Survey 2002; 246: 1–190. 40. Kwok MK, Schooling CM, Lam TH, Leung GM. Paternal smoking and childhood overweight: evidence from the Hong Kong “Children of 1997”. Pediatrics 2010; 126: 46- 56 41. McConnell R, Berhane K, Yao L, Jerrett M, Lurmann F, Gilliland F, Kunzli N, Gauderman J, Avol E, Thomas D, Peters J. Traffic, susceptibility, and childhood asthma. Environ Health Perspectives 2006; 114: 766–772. 42. McConnell R, Islam T, Shankardass K, Jerrett M, Lurmann F, Gilliland F, Gauderman J, Avol E, Künzli N, Yao L, Peters J, Berhane K. Childhood incident asthma and traffic- related air pollution at home and school. Environmental Health Perspectives 2010; 118: 1021–1026. 43. Mosteller F, Tukey JW. Data Analysis and Regression: A Second Course in Statistics. 1st ed. Addison Wesley 1977. 44. Newey WK, Powell JL, Walker JR. Semiparametric estimation of selection models: some empirical results. The American Economic Review 1990; 80: 324–328. 121 45. NHANES - National Health and Nutrition Examination Survey Homepage, n.d. http://www.cdc.gov/nchs/nhanes.htm. 46. Peters JM, Avol E, Navidi W, London SJ, Gauderman WJ, Lurmann F, Linn WS, Margolis H, Rappaport E, Gong H, Thomas DC. A Study of Twelve Southern California Communities with Differing Levels and Types of Air Pollution. I: Prevalence of Respiratory Morbidity. American Journal of Respiratory and Critical Care Medicine 1999; 159: 760-767. 47. Plan and Operation of a Health Examination Survey of U.S. Youths 12-17 Years of Age. Vital and Health Statistics. Ser. 1, Programs and Collection Procedures 1969; 8: 1-80. 48. Plan and Operation of the Third National Health and Nutrition Examination Survey, 1988-94. Series 1: Programs and Collection Procedures. Vital and Health Statistics. Ser. 1, Programs and Collection Procedures 1994; 32: 1-407. 49. Plan, Operation, and Response Results of a Program of Children’s Examinations. Vital and Health Statistics. Ser. 1, Programs and Collection Procedures 1967; 5: 1-56. 50. Portnoy S, Koenker R. The Gaussian Hare and the Laplacian Tortoise: Computability of squared-error versus absolute-error estimators. Statistical Science 1997; 12: 279–300. 51. Raum E, Kupper-Nybelen J, Lamerz A, Hebebrand J, Herpertz-Dahlmann B, Brenner H. Tobacco smoke exposure before, during, and after pregnancy and risk of overweight at age 6. Obesity (Silver Spring) 2011; 19: 2411–2417. 52. Reich BJ, Bondell HD, Wang HJ. Flexible Bayesian quantile regression for independent and clustered data. Biostatistics 2010; 11: 337–352. 53. Riedel KS, Imre K. Smoothing spline growth curves with covariates. Communication in Statistics – Theory and Methods 1993; 22: 1795–1818. 122 54. Rundle A, Hoepner L, Hassoun A, Oberfield S, Freyer G, Holmes D, Reyes M, Quinn J, Camann D, Perera F, and Whyatt R. Association of childhood obesity with maternal exposure to ambient air polycyclic aromatic hydrocarbons during pregnancy. American Journal of Epidemiology 2012; 175: 1163–1172 55. Seach KA, Dharmage SC, Lowe AJ, Dixon JB. Delayed introduction of solid feeding reduces child overweight and obesity at 10 years. International Journal of Obesity 2010; 34: 1475–1479. 56. Sonneville KR, Rifas-Shiman SL, Oken E, Peterson KE, Gortmaker SL, Gillman MW, TaverasEM. Longitudinal association of maternal attempt to lose weight during the postpartum period and child obesity at age 3 year. Obesity 2011; 19: 2046–2052. 57. Tilt JH, Unfried TM, Roca B. Using objective and subjective measures of neighborhood greenness and accessible destinations for understanding walking trips and BMI in Seattle, Washington. American Journal of Health Promotion 2007; 21: 371–379. 58. Wei Y, Pere A, Koenker R, He X. Quantile regression methods for reference growth charts. Statistics in Medicine 2006; 25: 1369–1382. 59. Wen X, Gillman MW, Rifas-Shiman SL, Sherry B, Kleinman K, Taveras EM. Decreasing prevalence of obesity among young children in Massachusetts from 2004 to 2008. Pediatrics 2012; 129: 823-831. 60. Wolch J, Jerrett M, Reynolds K, McConnell R, Chang R, Dahmann N, Brady K, Gilliland F, Su JG, Berhane K. Childhood obesity and proximity to urban parks and recreational resources: a longitudinal cohort study. Health & Place 2011; 17: 207–214. 61. Wolch J, Wilson JP, Fehrenbach J. Parks and Park Funding in Los Angeles: An Equity- mapping Analysis. Urban Geography 2005; 26: 4–35. 123 62. Wu J, Funk T, Lurman FW, Winer AM. Improving spatial accuracy of roadway networks and geocoded addresses. Transactions in GIS 2005; 9: 585–601. 63. Wu L. Mixed effects models for complex data. Boca Raton, Florida: Chapman & Hall/CRC 2010 64. Wu T, Dixon WE, Dalton WT, Tudiver F, Liu X. Joint effects of child temperament and maternal sensitivity on the development of childhood obesity. Maternal and Child Health Journal 2011; 15: 469–477. 65. Xie B, Palmer PH, Pang Z, Sun P, Duan H, Johnson CA. Environmental Tobacco Use and Indicators of Metabolic Syndrome in Chinese Adults. Nicotine & Tobacco Research 2010; 12: 198–206. 66. Yates A, Edman J, Aruguete M. Ethnic differences in BMI and body/self-dissatisfaction among Whites, Asian subgroups, Pacific Islanders, and African-Americans. Journal of Adolescent Health 2004; 34: 300–307. 67. Yu K, Moyeed RA. Bayesian quantile regression. Statistics & Probability Letters 2001; 54: 437–447.
Abstract (if available)
Abstract
Conventional mixed effects regression focuses only on effects on the conditional mean, which may be inappropriate when the interest is in testing the effects on extremes of the outcome distribution (e.g. BMI at the 85th percentile). We propose a multilevel quantile regression model with errors following the Asymmetric Laplace Distribution with a data-driven skew parameter under a Bayesian framework. Using our approach, the estimates of parameters would be unbiased regardless of the structure of random errors. This approach allows 1)direct modeling of risk effects with higher accuracy, 2)characterizing inter-subject heterogeneity, 3)accounting for cross-level effects, and 4)modeling non-linear growth trajectories in longitudinal/multi-level data. Besides deriving analytic solutions with improved properties, we conducted simulations to show that our proposed approach consistently provided appropriate estimates at extreme percentiles even when dealing with heteroscedastic errors, and multiple random effects from various levels. We illustrate the new approach through analysis of longitudinal BMI to model determinants of overweight status based on data from the Children's Health Study.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Multilevel Bayesian latent class growth mixture model for longitudinal zero-inflated Poisson data
PDF
Quantile mediation models: methods for assessing mediation across the outcome distribution
PDF
Functional based multi-level flexible models for multivariate longitudinal data
PDF
Bayesian model averaging methods for gene-environment interactions and admixture mapping
PDF
Bayesian hierarchical models in genetic association studies
PDF
Effect of measurement error on the association between baseline and longitudinal change
PDF
Street connectivity and childhood obesity: a longitudinal, multilevel analysis
PDF
Evaluating the associations between the baseline and other exposure variables with the longitudinal trajectory when responses are measured with error
PDF
Incorporating prior knowledge into regularized regression
PDF
Disparities in exposure to traffic-related pollution sources by self-identified and ancestral Hispanic descent in participants of the USC Children’s Health Study
PDF
Three essays on linear and non-linear econometric dependencies
PDF
Bayesian models for a respiratory biomarker with an underlying deterministic model in population research
PDF
Large-scale inference in multiple Gaussian graphical models
PDF
Attrition in longitudinal twin studies: a comparative study of SEM estimation methods
PDF
The design, implementation, and evaluation of accelerated longitudinal designs
PDF
Comparison of nonlinear mixed effect modeling methods for exhaled nitric oxide
PDF
Examining exposure to extreme heat and air pollution and its effects on all-cause, cardiovascular, and respiratory mortality in California: effect modification by the social deprivation index
PDF
Enhancing model performance of regularization methods by incorporating prior information
PDF
Feature selection in high-dimensional modeling with thresholded regression
PDF
Native American ancestry among Hispanic Whites is associated with higher risk of childhood obesity: a longitudinal analysis of Children’s Health Study data
Asset Metadata
Creator
Chang, Chih-Chieh (author)
Core Title
Bayesian multilevel quantile regression for longitudinal data
School
Keck School of Medicine
Degree
Doctor of Philosophy
Degree Program
Biostatistics
Publication Date
02/12/2015
Defense Date
02/12/2015
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
Bayesian,childhood obesity,longitudinal data,mixed-effect modeling,multilevel modeling,OAI-PMH Harvest,quantile regression
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Berhane, Kiros T. (
committee chair
), Conti, David V. (
committee member
), Fan, Yingying (
committee member
), Gauderman, William James (
committee member
), McConnell, Rob (
committee member
)
Creator Email
chihchie@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-532589
Unique identifier
UC11297874
Identifier
etd-ChangChihC-3187.pdf (filename),usctheses-c3-532589 (legacy record id)
Legacy Identifier
etd-ChangChihC-3187.pdf
Dmrecord
532589
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Chang, Chih-Chieh
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
Bayesian
childhood obesity
longitudinal data
mixed-effect modeling
multilevel modeling
quantile regression