Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Analysis using generalized linear models and its applied computation with R
(USC Thesis Other)
Analysis using generalized linear models and its applied computation with R
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
ANALYSIS USING GENERALIZED LINEAR MODELS AND ITS APPLIED COMPUTATION WITH R by Yini Cui A Thesis Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree MASTER OF SCIENCE (STATISTICS) May 2009 Copyright 2009 Yini Cui ii Epigraph By always thinking unto them ... I keep the subject constantly before me and wait till the first dawning open little by little into the full light. Isaac Newton (1642-1727) iii Acknowledgments Heartfelt thanks to my thesis advisor Professor Larry Goldstein for his enthusiasm, guidance and support throughout the completion of this thesis. He helped to make Statistics fun for me with his wise advice and good teaching. Above all, his encouragement is so much appreciated. His truly scientist intuition exceptionally inspire and enrich my growth as a master student. I am indebted to him more than he knows. I would like to express my deep and sincere gratitude to the many professors who have taught me Statistics and Math. I especially want to thank Professor Miguel Dumett. Without his generous, careful supervision over my research, meticulous reading my drafts, continued belief in my ability, and friendship, this thesis would not have been completed. The creativity, determination and sense of joy with which he respond to life’s challenges are very impressive. I am also indebted to my committee: Professor Remigijus Mikulevicius, Professor Alan Schumitzky and Professor Jay Bartroff. They kindly grant me their precious time for answering my questions and give valuable suggestions, from which I will benefit, for a long time to come. And I am grateful to my family and friends whose unwavering support and encouragement motivated me to continue pursuing my goal. iv Table of Contents Epigraph ii Acknowledgments iii List of Tables vi List of Figures vii Abstract ix Chapter 1 1 1.1 Linear Regression Model 1 1.2 Exponential Family And Its Properties 2 1.3 Generalized Linear Regression Models 7 Chapter 2 10 2.1 Newton-Raphson Iterative Method 10 2.2 Inference Procedure 11 Chapter 3 14 3.1 Hypothesis Testing 14 3.2 Taylor Series Approximations 15 3.3 Sampling Distribution for MLEs 16 3.4 Log-likelihood Ratio Statistic 17 3.5 Deviance Calculation 19 Chapter 4 21 4.1 Case Analysis I: Binomial GLM 21 4.2 Case Analysis II: Poisson GLM 36 4.3 Case Analysis III: Gamma GLM 45 Chapter 5 52 5.1 Conclusions 52 v Bibliography 54 Appendix 55 vi List of Tables Table 1: Link functions and variance functions 8 Table 2: Output of Model I by R 25 Table 3: Output of Model II by R 25 Table 4: Deviance value of Model I and Model II by R 26 Table 5: Output of Model_GLM I by R 27 Table 6: Output of Model_GLM II by R 29 Table 7: Comparison of models with one predictor less than the larger model 30 Table 8: Output of Model_GLM III by R 31 Table 9: 95% confidence intervals for each parameter in Model_GLM III 32 Table 10: Information of the top one influential case 35 Table 11: Output of Model III by R 38 Table 12: Output of Model_GLM IV by R 40 Table 13: 95% confidence intervals for each parameter in Model_GLM IV 42 Table 14: Output of Model IV by R 47 Table 15: Output of Model_GLM V by R 48 Table 16: 95% confidence intervals for each parameter in Model_GLM V 49 vii List of Figures Figure 1-1 Histogram for proportion of cases 22 Figure 1-2: Density for proportion of cases 23 Figure 1-3: Boxplot for proportion of cases vs age 23 Figure 1-4: Residuals vs fitted 33 Figure 1-5: Scale-Location plot 33 Figure 1-6: Normal Q-Q plot 34 Figure 1-7: Residuals vs leverage 35 Figure 1-8: Half normal plot 36 Figure 2-1: Plots of incidents vs service 37 Figure 2-2: Boxplot for incidents vs type 38 Figure 2-3: Residuals vs fitted 39 Figure 2-4: Relationship between mean and variance 41 Figure 2-5: Residuals vs fitted 42 Figure 2-6: Scale-Location plot 43 Figure 2-7: Normal Q-Q plot 43 Figure 2-8: Residuals vs leverage 44 Figure 3-1: Plots of longevity vs thorax 46 viii Figure 3-2: Boxplot for longevity vs activity 46 Figure 3-3 Residuals vs fitted 49 Figure 3-4 Scale-Location plot 50 Figure 3-5 Normal Q-Q plot 50 Figure 3-6 Residuals v.s. leverage 51 ix Abstract Generalized linear models (GLMs) are introduced by Nelder and Wedderburn (see [8]). As an extension of normal linear regression for a single dependent variable, GLMs are widely used to do regression modeling for non-normal data with a minimum of extra complication. Unifying various other statistical models under one framework, GLMs develop a general algorithm for maximum likelihood estimation in all these models. GLMs are flexible enough to include a wide range of common situations, but at the same time allow most of the familiar ideas of normal linear regression to carry over. Furthermore, the link function which provides the relationship between the linear predictor and the mean of the distribution function does not have to be linear. Taking advantage of statistical software R (see [3]), efficiency of GLMs analysis in different datasets is greatly improved and diagnostics become visible. 1 Chapter 1 1.1 Linear Regression Model Linear regression models (see [9]) are frequently used to build a formula which can predict the value of dependent variable when values of predictor variables are fitted. They also allow a test to be made of whether a given variable does have a significant effect on dependent variable where there exist many related variables. Generally, linear regression procedures will estimate a linear equation of the form: ε β + = X Y where Y is a vector of dependent variables, X is a matrix whose elements are predictor variables, ε is a vector of residuals which are normal distributed,β represents the unknown parameters that show the dependent variables change corresponding to a change of one unit in predictor variables. For example: i i i x y ε β β + + = 2 1 where i Y ~ ( ) 2 ,σ μ i N , i ε ~ ( ) 2 , 0 σ N Then = i i x X 1 and = 2 1 β β β , ( ) β μ T i i i X Y E = = There may be two insufficiencies when applying linear regression models in some cases: normality and linearity. Why? The normal approximation to the distribution of i Y is accurate under some situations. However, not all i Y s are normal distributed. Instead, i Y can be Binomial, Poisson, Gamma distributed and so on. 2 For example, when i Y is binary, i Y = 1 if flipped coin is on face and i Y = 0 if it is on tail, then 0 < i μ < 1. The linear model is inadequate in these cases because complicated and unnatural constraints on β would be required to make sure that i μ stays in the possible range. GLMs instead assume a link function to build the relationship instead of this linear relationship (see [4], [5], [7]). 1.2 Exponential Family And Its Properties For a single random variable Y whose probability distribution depends on a single parameterθ , a family of probability density function or probability mass function is called an exponential family if it can be expressed as ( ) ( ) ( ) ( ) ( ) θ θ θ b y a e t y s y f = ; where ( ) ( ) ( ) ( ) θ θ b y a t y s , , , are known functions and θ is called the parameter of the family. Also, when ( ) ( ) ( ) y s y d log = and ( ) ( ) ( ) θ θ t c log = , the expression can be transformed to the following form: ( ) ( ) ( ) ( ) ( ) [ ] y d c b y a y f + + = θ θ θ exp ; If ( ) y y a = , then the exponential family is said to be in canonical form. And the canonical form is non-unique. By defining a transformed parameter ( ) y a z = , it is always possible to convert an exponential family to canonical form. 3 Many well-known distributions belong to the exponential family. These include the discrete families: Bernoulli, Binomial, Poisson, and Negative Binomial and the continuous families: Normal, Exponential, Gamma, Chi-squared and Beta. Following are some detailed examples of the representation of some useful distribution as exponential families. For Binomial distribution the probability mass function is ( ) ( ) y n y y n y f − − = θ θ θ 1 ; , where y takes the values 0, 1, 2 … It can be expressed as ( ) ( ) ( ) + − + − − = y n n y y y f log 1 log 1 log log exp ; θ θ θ θ ( ) y y a = , ( ) ( ) θ θ θ − − = 1 log log b , ( ) ( ) θ θ − = 1 log n c , ( ) = y n y d log For Poisson distribution the probability mass function for the discrete random variable Y is ( ) ! ; y e y f y θ θ θ − = , where y takes the values 0, 1, 2 … It can be expressed as ( ) ( ) ! log log exp ; y y y f − − = θ θ θ ( ) y y a = , ( ) θ θ log = b , ( ) θ θ − = c , ( ) ! log y y d − = For Normal distribution the probability density function is ( ) ( ) ( ) − − = 2 2 2 / 1 2 2 1 exp 2 1 ; θ σ πσ θ y y f , assume 2 σ is known 4 It can be expressed as ( ) ( ) − − + − = 2 2 2 2 2 2 2 log 2 1 2 2 exp ; πσ σ θ σ θ σ θ y y y f ( ) y y a = , ( ) 2 σ θ θ = b , ( ) ( ) 2 2 2 2 log 2 1 2 πσ σ θ θ − − = c , ( ) 2 2 2σ y y d − = To get the expressions for expectation and variance of ( ) y a , several properties of distributions in the exponential family are used. Here, all integrals considered are in the sense of Lebesgue. ( ) 1 ; = ∫ +∞ ∞ − dy y f θ for a fixed θ Differentiate both sides with respect to θ , we get ( ) 0 ; = ∫ +∞ ∞ − dy y f d d θ θ For simplicity, we assume that the order of integration and differentiation is reversible, in that case we get ( ) 0 ; = ∫ ∞ + ∞ − dy d y df θ θ Similarly, if differentiate ( ) 1 ; = ∫ +∞ ∞ − dy y f θ twice with respect to θ , we get ( ) 0 ; 2 2 = ∫ ∞ + ∞ − dy y f d d θ θ Again, for simplicity, we assume that the order of integration and differentiation is reversible, in that case we get ( ) 0 ; 2 2 = ∫ ∞ + ∞ − dy d y f d θ θ For the exponential family, 5 ( ) ( ) ( ) ( ) ( ) [ ] y d c b y a y f + + = θ θ θ exp ; , Differentiate ( ) θ ; y f with respect to θ , we get ( ) ( ) ( ) ( ) [ ] ( ) ( ) ( ) ( ) [ ] ( ) ( ) ( ) [ ] ( ) θ θ θ θ θ θ θ θ θ ; ' ' exp ' ' ; y f c b y a y d c b y a c b y a d y df + = + + + = and ( ) ( ) ( ) ( ) [ ] ( ) 0 ; ' ' ; ∫ ∫ = + = dy y f c b y a dy d y df θ θ θ θ θ (A) Since ( ) ( ) ( ) [ ] y a E dy y f y a = ∫ θ ; and ( ) ( ) ( ) ( ) ( ) θ θ θ θ θ ' ; ' ; ' c dy y f c dy y f c = = ∫ ∫ , from (A), it follows that ( ) ( ) [ ] ( ) 0 ' ' = + θ θ c y a E b (B) Similarly, differentiate ( ) θ ; y f twice with respect to θ , we get ( ) ( ) ( ) ( ) [ ] ( ) ( ) ( ) ( ) [ ] ( ) ( ) ( ) ( ) [ ] ( ) ( ) ( ) ( ) ( ) [ ] [ ] ( ) ( ) ( ) ( ) [ ] ( ) ( ) [ ] ( ) { ( ) [ ]} ( ) θ θ θ θ θ θ θ θ θ θ θ θ θ θ θ θ θ θ θ ; ' ; ' ' ' ' ; ' ' ; ' ' ' ' ; ' ' ; ' ' ' ' ; 2 2 2 2 2 2 y f y a E y a b y f c b y a y f y a E b b y a y f c b y a y f c b y a y f c b y a d y f d − + + = − + + = + + + = By definition ( ) { ( ) [ ]} ( ) ( ) [ ] ∫ = − y a dy y f y a E y a var ; 2 θ , we can get ( ) ( ) ( ) [ ] ( ) ( ) [ ] ( ) [ ] 0 var ' ' ' ' ' ; 2 2 2 = + + = ∫ y a b c y a E b dy d y f d θ θ θ θ θ (C) Therefore, the expressions for expectation and variance of ( ) y a can by simplified as following: From (B), ( ) [ ] ( ) ( ) θ θ ' ' b c y a E − = (D) From (C), ( ) [ ] ( ) ( ) ( ) ( ) ( ) [ ] 3 ' ' ' ' ' ' ' var θ θ θ θ θ b b c c b y a − = (E) When the exponential family is in canonical form, that is ( ) y y a = , then From (D), [ ] ( ) ( ) θ θ ' ' b c Y E − = 6 From (E), [ ] ( ) ( ) ( ) ( ) ( ) [ ] 3 ' ' ' ' ' ' ' var θ θ θ θ θ b b c c b Y − = To get the expression of expectation and variance of the derivatives of the log-likelihood function, function U called score statistic is introduced. The log-likelihood function for a distribution in the exponential family is ( ) ( ) ( ) ( ) ( ) y d c b y a y l + + = θ θ θ ; Differentiate ( ) θ ; y l with respect to θ , we get function U ( ) θ θ d y dl U ; = Since function U depends on y, it can be viewed as a random variable, that is ( ) ( ) ( ) θ θ ' ' c b y a U + = (F) Take expectation on both sides of (F), we get ( ) ( ) ( ) [ ] ( ) θ θ ' ' c y a E b U E + = From (D), ( ) ( ) ( ) ( ) ( ) 0 ' ' ' ' = + − = θ θ θ θ c b c b U E Define ( ) U J var = is called the information From (F) and (E), ( ) ( ) [ ] ( ) [ ] ( ) ( ) ( ) ( ) ( ) θ θ θ θ θ ' ' ' ' ' ' ' var ' var 2 b b c c b y a y b U − = = (G) Since ( ) ( ) ( ) [ ] 2 2 var U E U E U − = and ( ) 0 = U E from (D), we have, ( ) ( ) 2 var U E U = Differentiate (F) with respect of θ , we get, 7 ( ) ( ) ( ) [ ] ( ) ( ) ( ) ( ) ( ) θ θ θ θ θ θ ' ' ' ' ' ' ' ' ' ' ' c b c b c y a E b U E + − = + = From (G), ( ) ( ) U U E var ' − = Conclusively speaking, one property of U is: ( ) ( ) ( ) ' var 2 U E U E U − = = 1.3 Generalized Linear Regression Models The GLM is defined in terms of a set of independent random variables N Y Y Y ,..., , 2 1 , satisfies two properties: (1) The distribution of each i Y belongs to the exponential family in same canonical form and depends on a single parameter i θ , though i θ do not have to be the same for all i. (2) The distribution of all the i Y ’s are of the same form. Usually, the parameters i θ does not serve as parameters of our interest since there will be too many unknown parameters to be estimated. For model specification we focus more on a smaller set of parameters p β β β ,..., , 2 1 , where p<<N. For a GLM there is a transformation of i μ such that ( ) i i Y E μ = and ( ) β μ T i i x g = . Function g is a monotone, differentiable function called the link function, which provides the relationship between the linear predictor and the mean of the distribution function. So the parameter i θ are replaced by the parameter β , which makes the estimation process easier. 8 Here is a table of canonical link functions and variance functions used for several distributions in the exponential family (taken from wikipedia). Table 1: Link functions and variance functions Distribution Family Link Function Variance Function Binomial ( ) ( ) μ μ μ − = 1 log g ( ) μ μ − 1 Poisson ( ) μ μ log = g μ Normal ( ) μ μ = g 1 Gamma ( ) 1 − = μ μ g 2 μ Inverse Gaussian ( ) 2 − = μ μ g 3 μ For most analyses of continuous data, the linear models are set under assumption that the random variables i Y are independent and i Y ~ ( ) 2 ,σ μ i N , then ( ) β μ T i i i x Y E = = . Compared to linear models, GLMs are more applicable to solve problems under more general situations as follows: (1) Dependent variable can have a distribution other than the Normal distribution. It can have any distribution belong to exponential family in canonical form. (2) Relationship between dependent and predictor variables need not be of the simple linear form as above. There are several advantages to introduce GLMs. (1) We don’t have to transform dependent variable Y to normality. 9 (2) Many “nice” properties of the Normal distribution are shared by the exponential family of distributions. (3) There can be some non-linear function relating ( ) i i Y E μ = to β T i x , that is, ( ) β μ T i i x g = . Such models have now been further generalized to situations where functions may be estimated numerically. 10 Chapter 2 2.1 Newton-Raphson Iterative Method To find the value of x at which the function t crosses the x-axis, we express the slope to t at a value ( ) 1 − m x by ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) 1 1 1 ' 1 − − − = − − ≈ = − m m m m m x x x x x t x t x t dx dt m , where the distance ( ) ( ) 1 − − m m x x is small If ( ) m x is the required solution so that ( ) 0 = m x t , then ( ) ( ) ( ) ( ) ( ) ( ) 1 1 1 ' − − − − ≈ m m m m x t x t x x This is the Newton-Rahpson formula for solving ( ) 0 = x t . Starting with an initial guess ( ) 1 x successive approximations are obtained until the iterative process converges. This is usually the case when ( ) 1 x is close enough to the solution of ( ) 0 = x t . For maximum likelihood estimation using the score function and information function, the estimating equation is ( ) ( ) ( ) ( ) 1 1 1 ' − − − − = m m m m U U θ θ Since we can approximate ' U by its expectation ( ) ' U E in maximum likelihood estimation ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) 1 1 1 1 1 1 ' − − − − − − + = − = m m m m m m m J U U E U θ θ θ (H) 11 2.2 Inference Procedure Maximum likelihood estimation is a well-known statistical method used for fitting a mathematical model to data. Consider independent random variables N Y Y Y ,..., , 2 1 satisfying the properties of a GLM. We need to estimate parameters β which are related to the i Y ’s through ( ) i i Y E μ = and ( ) β μ T i i x g = For each i Y , the log-likelihood function is ( ) ( ) ( ) ( ) ( ) ( ) ( ) i i i i i i i i i y d c b y y d c b y a l + + = + + = θ θ θ θ For all i Y s, the log-likelihood function is ( ) ( ) ( ) ∑ ∑ ∑ ∑ + + = = = i i i i N i i y d c b y l l θ θ 1 To get the maximum likelihood estimator for the parameter j β , p j ,..., 2 , 1 = , we differentiate ( ) ( ) ( ) ∑ ∑ ∑ ∑ + + = = = i i i i N i i y d c b y l l θ θ 1 with respect to j β as following: ∑ ∑ = = ∂ ∂ ∂ ∂ ∂ ∂ = ∂ ∂ = = ∂ ∂ N i j i i i i i N i j i j j l l U l 1 1 β μ μ θ θ β β (I) For first item of (I): ( ) ( ) i i i i i c b y l θ θ θ ' ' + = ∂ ∂ Because of (D): ( )( ) i i i i i y b l μ θ θ − = ∂ ∂ ' For second item of (I): ( ) ( ) ( ) ( ) ( ) ( ) ( ) [ ] 2 ' ' ' ' ' ' ' ' ' i i i i i i i i i i b b c b c b c θ θ θ θ θ θ θ θ θ μ + − = ∂ − ∂ = ∂ ∂ 12 Because of (E): ( ) ( ) i i i i Y b var ' θ θ μ = ∂ ∂ By Inverse Function Theorem, ( ) ( ) i i i i Y b var ' 1 θ μ θ = ∂ ∂ For third item of (I): Set ( ) β μ η T i i i x g = = ij i i j i i i j i x η μ β η η μ β μ ∂ ∂ = ∂ ∂ ∂ ∂ = ∂ ∂ Therefore, ( ) ( ) ∑ ∑ = = ∂ ∂ − = ∂ ∂ ∂ ∂ ∂ ∂ = N i i i ij i i i N i j i i i i i j x Y y l U 1 1 var η μ μ β μ μ θ θ (J) The variance-covariance matrix of the j U ’s has terms k j jk U U E J = Because of (J), the information matrix J has the following form: ( ) ( ) ( ) ( ) ∂ ∂ − ∂ ∂ − = ∑ ∑ = = N i N l l l lk l l l i i ij i i i jk x Y y x Y y E J 1 1 var var η μ μ η μ μ Since i Y s are independent, ( )( ) [ ] 0 = − − l l i i Y Y E μ μ when l i ≠ ( ) [ ] ( ) [ ] ∑ = ∂ ∂ − = N i i i i ik ij i i jk Y x x Y E J 1 2 2 2 var η μ μ By definition, ( ) [ ] ( ) i i i Y Y E var 2 = −μ ( ) [ ] ∑ = ∂ ∂ = N i i i i ik ij jk Y x x J 1 2 var η μ 13 Let ( ) m b be the vector of estimates of the parameters p β β β ,..., , 2 1 , (where p<<n), in the m th iteration. From (H), Newton-Raphson iterative methods can be written as: ( ) ( ) ( ) [ ] ( ) 1 1 1 1 − − − − + = m m m m U J b b 14 Chapter 3 3.1 Hypothesis Testing Hypothesis tests are performed by comparing how well two related models fit the data. For GLMs, the two models should have the same probability distribution and the same link function, but the linear component of one model has more parameters than the other. The simpler model, corresponding to the null hypothesis 0 H , is a special case of the other more general model, corresponding to the alternative hypothesis 1 H . If the simpler model fits the data as well as the more general model does, then it is preferred due to less parameters introduced and 0 H is accepted. If the more general model fits significantly better, then it is retained due to better explanation of the data and 1 H is accepted. To make these comparisons, we use summary statistics to describe how well the models fit the data. The goodness of fit statistics may be based on the maximum value of the likelihood function or the minimum value of the sum of squares criterion or a composite statistic based on the residuals. The steps can be summarized as follows: (1) Develop a simpler model 0 M with less parameters for 0 H and a more general model 1 M for 1 H (2) Calculate the goodness of fit statistic 0 G , 1 G for 0 M , 1 M respectively (3) Test ' 0 H : 1 0 G G = against ' 1 H : 1 0 G G ≠ by sampling distribution (4) If ' 0 H is not rejected, then 0 H is not rejected. Because 0 M achieve almost the same 15 goodness of fit statistics as 1 M and 0 M has fewer parameters to estimate, 0 M is preferred model. If ' 1 H is not rejected, then 0 H is rejected. Because 1 0 G G ≠ and 1 M can achieve better goodness of fit statistics, though 1 M has more parameters to estimate, 1 M is preferred model. If there is only one parameterβ , the score statistic has the asymptotic sampling distribution ( ) ( ) U U E U var − ~ ( ) 1 , 0 N and ( ) [ ] ( ) U U E U var 2 − ~ ( ) 1 2 χ Since ( ) 0 = U E and ( ) J U = var , we have: J U ~ ( ) 1 , 0 N and J U 2 ~ ( ) 1 2 χ If there are more than one parameters β , then score vector U has the asymptotic multivariate Normal distribution and U J U T 1 − ~ ( ) p 2 χ 3.2 Taylor Series Approximations Taylor series is a representation of a function as an infinite sum of terms calculated from the values of its derivatives at a single point. ( ) ( ) ( ) ( ) n n n t x n t f x f − = ∑ ∞ =0 ! where ( ) ( ) t f n denotes the n th derivative of ƒ evaluated at the point t; the 0 th derivative of ƒ is defined to be ƒ itself and (x − a) 0 and 0! are both defined to be 1. 16 According to Taylor series approximation, when x is close to t ( ) ( ) ( ) ( ) t x t x dx f d t x dx df t x t f x f = = − + − + ≈ 2 2 2 2 1 3.3 Sampling Distribution for MLEs If there is only one parameterβ and b is the estimate ofβ , according to Taylor series approximation, we get, ( ) ( ) ( ) ( ) b b d l d b d dl b b l l = = − + − + ≈ β β β β β β β 2 2 2 2 1 ( ) ( ) ( ) ( ) ( ) ( ) b U b b U b b l l ' 2 1 2 − + − + ≈ β β β Since we can approximate ( ) b U ' by its expectation ( ) ( ) b U E ' in maximum likelihood estimation and ( ) ( ) ( ) b U E b J ' − = ( ) ( ) ( ) ( ) ( ) ( ) b J b b U b b l l 2 2 1 − − − + ≈ β β β Similarly, the corresponding approximation for the log-likelihood function for a vector parameter β is ( ) ( ) ( ) ( ) ( ) ( )( ) b b J b b U b b l l T T − − − − + ≈ β β β β 2 1 (K) where U is the vector of scores and J is the information matrix. Also, we have ( ) ( ) ( )( ) b b J b U U − − ≈ β β Since β ˆ = b , b is the estimator which maximized ( ) b l , so ( ) 0 = b U 17 ( ) ( )( ) b b J U − − ≈ β β Provided that J is non-singular, ( ) U J b 1 − ≈ −β Take expectation on both sides, ( ) ( ) 0 1 = ≈ − − U E J b E β Therefore, ( ) β ≈ b E , so b is a unbiased estimator of β The variance-covariance matrix for b is ( )( ) [ ] ( )( ) T T T J UU E J b b E 1 1 − − ≈ − − β β Since ( ) T UU E J = and ( ) 1 1 − − = J J T for J is symmetric ( )( ) [ ] 1 − ≈ − − J b b E T β β The asymptotic sampling distribution for b is ( ) ( )( ) β β − − b b J b T ~ ( ) p 2 χ When 1 = p , b ~ ( ) 1 , − J N β 3.4 Log-likelihood Ratio Statistic Another way of assessing the adequacy of a model is to compare it with a saturated model with the maximum number of parameters that can be estimated. For GLMs, if there are N observations N Y Y Y ,..., , 2 1 , all with potentially different values for the linear component β T i x , then a saturated model can be specified with N parameters. If some of the observations have the same linear component or covariate pattern, that is, they correspond to the same combination of factor levels and have the same values of any continuous predictor variables, they 18 are called replicates. In this case, the maximum number of parameters that can be estimated for the saturated model is equal to the number of potentially different linear components, which may be less than N . Here, set m be the maximum number of parameters that can be estimated. If there are N observations, then N m = . Let max β be the parameter vector for the saturated model and max b be the maximum likelihood estimator of max β . The likelihood function for the saturated model evaluated at max b , ( ) y b L ; max , will certainly be larger than any other likelihood function for these observations, with the same assumed distribution and link function, because it provides the most complete description of the data. Let ( ) y b L ; be the maximum value of the likelihood function for the model to be test. Then the likelihood ratio ( ) ( ) y b L y b L ; ; max = λ provides a way of assessing the goodness of fit for the model. Take log on both sides, ( ) ( ) y b l y b l ; ; log max − = λ Large values of λ log mean that the model of interest is a poor description of the data relative to the saturated model. If the values of λ log is close to zero, we are encouraged to apply the model with less parameters and achieve a data fit almost as well as the saturated model. To determine the critical region for λ log , we need its sampling distribution. 19 3.5 Deviance Calculation The deviance, also called the log-likelihood ration statistic, is defined as: ( ) ( ) [ ] y b l y b l D ; ; 2 max − = Also, it can be written in the following form: ( ) ( ) [ ] ( ) ( ) [ ] ( ) ( ) [ ] y l y l y l y b l y l y b l D ; ; 2 ; ; 2 ; ; 2 max max max β β β β − + − − − = (L) Since ( ) ( ) ( ) ( ) ( ) ( ) b J b b U b b l l 2 2 1 − − − + ≈ β β β and ( ) 0 = b U For the first item of (L): From (K), ( ) ( ) [ ] ( ) ( )( ) − − = − max max max max max max max 2 1 2 ; ; 2 β β β b b J b y l y b l T ( ) ( ) [ ] y l y b l ; ; 2 max max β − ~ ( ) m 2 χ For the second item of (L): ( ) ( ) [ ] y l y b l ; ; 2 β − ~ ( ) p 2 χ For the third item of (L): Let ( ) ( ) [ ] y l y l v ; ; 2 max β β − = , is a positive constant. It will be near zero under the condition that the model of interest fits the data almost as well as the saturated model fits. Therefore, the sampling distribution of the deviance D is approximately D ~ ( ) v p m , 2 − χ Consider [ ] T q H β β β β ...... : 1 0 0 = = for model 0 M and [ ] T p H β β β β ...... : 1 1 1 = = for model 1 M , where m p q < < 20 ( ) ( ) [ ] ( ) ( ) [ ] ( ) ( ) [ ] y b l y b l y b l y b l y b l y b l D D D ; ; 2 ; ; 2 ; ; 2 0 1 1 max 0 max 1 0 − = − − − = − = Δ Since 0 D ~ ( ) q m− 2 χ and 1 D ~ ( ) p m− 2 χ Then D Δ ~ ( ) q p− 2 χ To eliminate the term 2 σ , we can use the ratio p m D q p D D F − − − = 1 1 0 ~ ( ) p m q p F − − , 21 Chapter 4 4.1 Case Analysis I: Binomial GLM R is a free software environment for statistical computing and graphics. It runs on a wide variety of operating systems. Although there are several excellent statistical packages, we use R for GLMs data analysis in this thesis. It is because that statistical modeling is one of the strongest features of R due to its unified approach, various model types and excellent diagnostic capabilities. Data “Esoph” from R package “Faraway” is a dataset from a case-control study of (o)esophageal cancer in Ile-et-Vilaine, France. The data frame has records for 88 age/alcohol/tobacco combinations. It has 5 variables: Age group, Alcohol consumption, Tobacco consumption, Number of cases and Number of controls. For Age group, it is divided into 6 groups: 25–34 years, 35–44 years, 45–54 years, 55–64 years, 65–74 years and 75+ years. For Alcohol consumption, it is divided into 4 groups: 0–39 gm/day, 40–79 gm/day, 80–119 gm/day, 120+ gm/day. For Tobacco consumption, it is divided into 4 groups: 0– 9 gm/day, 10–19 gm/day, 20–29 gm/day, 30+ gm/day. We are interested in how the Age, Alcohol consumption and Tobacco consumption affect esophageal cancer, therefore we need to find out how the proportion of cases is related with these 22 three factors. The first stage in data analysis is usually taking advantage of graphical summaries to gain an understanding of the data (see [1], [2], [6]). Considering just one variable at a time, histograms are a good way of examining the distribution of a variable. Set ncontrols ncases ncases o + = Pr as the proportion of cases, which is the variable of interest in histogram as following. From the Figure 1-1, we notice that o Pr is obviously not normal distributed. Figure 1-1 Histogram for proportion of cases 23 Figure 1-2: Density for proportion of cases Figure 1-3: Boxplot for proportion of cases vs age 24 Since histogram is a fairly crude estimate due to sensitivity to the choice of bins, in Figure 1-2 we use a kernel density estimate as a smoother version of a histogram. Also the distribution is skewed and has one more peak at the right tail, which means it does not satisfy one of the pre-requisitions of applying linear regression models. By using boxplot, Figure 1-3 shows the distribution of o Pr by Age. We can get a clear overview that age is a factor that has great contribution to o Pr . Set o Pr , Agegp , Alcgp , Tobgp stands for Proportion of cases, Age, Alcohol Consumption, Tobacco Consumption respectively. Let o Pr be the dependent variable, Agegp , Alcgp , Tobgp be the predictor variables. By a naïve way to analysis this dataset, we develop two models using linear regression models directly: Model I: ε β β β β + + + + = Tobgp Alcgp Agegp o * * * Pr 4 3 2 1 Model II: ε β β β β β β β + + + + + + + = Agegp Tobgp Tobgp Alcgp Alcgp Agegp Tobgp Alcgp Agegp o * * * * * * * * * Pr 6 5 4 4 3 2 1 Model II adds interaction terms to Model I, since the three predictor variables may have some correlation. 25 Table 2: Output of Model I by R Estimate Std. Error t value Pr(>|t|) Intercept 1.2303 1.0899 1.129 0.26217 unclass(agegp) 0.5547 0.1660 3.342 0.00124** unclass(alcgp) 0.1786 0.2435 0.733 0.46532 unclass(tobgp) -0.5290 0.2447 2.162 0.03344* Residual standard error: 2.548 on 84 degrees of freedom Multiple R-squared: 0.1727 Adjusted R-squared: 0.1432 F-statistic: 5.846 on 3 and 84 DF p-value: 0.001123 Table 3: Output of Model II by R Estimate Std.Error t value Pr(>|t|) Intercept -0.26208 2.52147 -0.104 0.9175 unclass(agegp) 1.14841 0.55600 2.066 0.0421* unclass(alcgp) 0.23046 0.79787 0.289 0.7734 unclass(tobgp) -0.15533 0.79078 -0.196 0.8448 unclass(agegp):unclass(alcgp) -0.07814 0.15110 -0.517 0.6065 unclass(agegp):unclass(tobgp) -0.17353 0.15032 -1.154 0.2517 unclass(alcgp):unclass(tobgp) 0.08474 0.22122 0.383 0.7027 Residual standard error: 2.568 on 81 degrees of freedom Multiple R-squared: 0.1897 Adjusted R-squared: 0.1297 F-statistic: 3.161 on 6 and 81 DF p-value: 0.007754 26 2 R , the coefficient of determination, will give some information about the goodness of fit of a model. It is usually defined as the proportion of variance of the dependent variables that is predictable from (that can be explained by) the predictor variables. When the value of 2 R is low, there may be two possible explanations. One is that there is little relationship of the dependent and predictor variables. The other is that the model fit is poor. Since 2 R will never decrease when adding a new predictor to the model, it will favor the largest model as a criterion for choosing models. 2 a R , the adjusted 2 R , is introduced. It makes allowance for the fact that a larger model uses more parameters. 2 a R increases only when a predictor that has some predictive value is added to the model. From Table 2 and Table 3, 2 a R for model I and model II are 0.1432, 0.1297 respectively, which are too low to be satisfied. And most of the estimated parameters are not statistically significant. From Figure 1-3, there is an obvious relationship between the dependent and predictor variables in the boxplot. Therefore, we consider Model I and Model II are poor. Also, we calculate the deviance of Model I and Model II as listed in Table 4: Table 4: Deviance value of Model I and Model II by R Model I Model II Deviance 545.5504 534.3249 27 The deviance is a general measure of goodness of fit. For linear models, the deviance is the residual sum of squares. From Table 4, the deviance of model I and model II calculated by R are 545.5504 and 534.3249 respectively, which are both too large to be acceptable. The standard linear model is clearly not directly suitable here. Although, we may use transformation or weighting to modify the models, it is better to develop a model that is directly suited for data. We can view o Pr to be binomially distributed and use Binomial GLM to estimate the parameters. Develop a Model_GLM I and get its output by R command as following: > modelglm1 <- glm(cbind(ncases, ncontrols) ~ unclass(agegp) + unclass(tobgp) + unclass(alcgp), data = esoph, family = binomial()) > summary(modelglm1) Table 5: Output of Model_GLM I by R Estimate Std. Error t value Pr(>|t|) Intercept -5.59594 0.41540 -13.471 < 2e-16 *** unclass(agegp) 0.52867 0.07188 7.355 1.91e-13 *** unclass(tobgp) 0.27446 0.08074 3.399 0.000676 *** unclass(alcgp) 0.69382 0.08342 8.317 < 2e-16 *** (Dispersion parameter for binomial family taken to be 1) Null deviance: 227.241 on 87 degrees of freedom Residual deviance: 73.959 on 84 degrees of freedom 28 From Table 5, the residual deviance is the deviance for the current model while the null deviance is the deviance for a model with just an intercept term and no other predictors. We can see that compared to Model I, Model_GLM I achieved much smaller deviance as 73.959 and all 4 estimates are statistically significant with p-value smaller than 0.05. Considering that there may have interaction effects, we still need to develop Model_GLM II to introduce interaction terms to Model_GLM I before we leap to any conclusions. Develop a Model_GLM I and get its output by R command as following: > modelglm2 <- glm(cbind(ncases, ncontrols) ~ unclass(agegp) + unclass(tobgp) + unclass(alcgp) + unclass(agegp) * unclass(alcgp) + unclass(agegp) * unclass(tobgp) + unclass(alcgp) * unclass(tobgp), data = esoph, family = binomial()) > summary(modelglm2) 29 Table 6: Output of Model_GLM II by R Estimate Std.Error t value Pr(>|t|) Intercept -7.59377 1.09109 -6.960 3.41e-12*** unclass(agegp) 0.83810 0.22118 3.789 0.000151*** unclass(tobgp) 0.72136 0.34784 2.074 0.038096* unclass(alcgp) 1.49461 0.35020 4.268 1.97e-05*** unclass(agegp):unclass(alcgp) -0.12156 0.06928 -1.755 0.079318 unclass(agegp):unclass(tobgp) -0.01907 0.06665 -0.286 0.774768 unclass(alcgp):unclass(tobgp) -0.15615 0.07615 -2.051 0.040312* (Dispersion parameter for binomial family taken to be 1) Null deviance: 227.241 on 87 degrees of freedom Residual deviance: 67.732 on 81 degrees of freedom In Table 6, Model_GLM II achieved a lower deviance 67.732 and got a barely statistical significant parameter for interaction between Alcgp and Tobgp , the p-value is 0.040312. To test if any of these interaction terms should be added, we use the following R command to do a comparison of all models with one predictor less than the larger model and get results in Table 7 as following: > drop1(modelglm2,test="F") 30 Table 7: Comparison of models with one predictor less than the larger model Df Deviance AIC F value Pr(F) <none> 67.732 229.213 unclass(agegp):unclass(alcgp) 1 70.772 230.253 3.6355 0.06010 unclass(agegp):unclass(tobgp) 1 67.813 227.294 0.0974 0.75572 unclass(tobgp):unclass(alcgp) 1 71.911 231.392 4.9975 0.02814 * Akaike Information Criterion (AIC) is another popular criterion. It is defined as following: p likelihood imum AIC 2 log max 2 + − = , where p is the number of parameters. The results in Table 7 show that the deviance and AIC value of all 4 models are very close. Only the interaction between Alcgp and Tobgp is worth adding with a p-value =0.02814. Then we developed Model_GLM III: Agegp Tobgp Tobgp Alcgp Agegp o * * * * * Pr 4 4 3 2 1 β β β β β + + + + = Build Model_GLM III and get its output by R command as following: > modelglm3<-glm(cbind(ncases, ncontrols) ~ unclass(agegp) + unclass(tobgp) + unclass(alcgp) + unclass(alcgp) * unclass(tobgp), data = esoph, family = binomial()) > summary(modelglm3) 31 Table 8: Output of model_GLM III by R Estimate Std.Error t value Pr(>|t|) Intercept -6.23633 0.56166 -11.103 < 2e-16 *** unclass(agegp) 0.52699 0.07214 7.305 2.77e-13*** unclass(tobgp) 0.59323 0.19663 3.017 0.00255** unclass(alcgp) 0.96637 0.17710 5.457 4.85e-08*** unclass(alcgp):unclass(tobgp) -0.13183 0.07460 -1.767 0.07718 (Dispersion parameter for binomial family taken to be 1) Null deviance: 227.241 on 87 degrees of freedom Residual deviance: 70.852 on 83 degrees of freedom According to Table 8, though 4 β is not very statistical significant with p-value-0.07718, we view Model_GLM III is a better predict model for the lowest deviance 70.852 it achieved with adding only one more parameter. From the estimates in Table 8, we get: Agegp Tobgp Tobgp Alcgp Agegp o * * 13183 . 0 * 59323 . 0 * 96637 . 0 * 52699 . 0 23633 . 6 Pr − + + + − = It shows that Age, Alcohol consumption and Tobacco consumption are three factors that can lead to esophageal cancer and Alcohol consumption serves as the biggest contributor. Also, there is an interaction relationship between Alcohol consumption and Tobacco consumption. 32 Furthermore, to obtaine the 95% confidence intervals for each parameter in Model_GLM III can be calculated by R command as following: > confint(modelglm3) Table 9: 95% confidence intervals for each parameter in Model_GLM III 2.5 % 97.5 % Intercept -7.3691986 -5.16492547 unclass(agegp) 0.3882982 0.67143821 unclass(tobgp) 0.2049054 0.97734457 unclass(alcgp) 0.6223358 1.31755150 unclass(alcgp):unclass(tobgp) -0.2781207 0.01482488 Even though there are some theories for GLM residuals, we analyze them by known techniques for regression models. For diagnostics of the Model_GLM III, we use following R command: > plot(modelglm3) > halfnorm(influence(modelglm3)$hat) 33 Figure 1-4: Residuals vs fitted Figure 1-5: Scale-Location plot 34 Figure 1-4 and Figure 1-5 showing the residuals against the fitted values can be used to detect lack of fit. In this instance, there is no sign that some change to the model is required or a transformation of one of the variables is needed. It seems that the variance is roughly constant as the fitted values vary. The residuals can be assessed for normality using a QQ plot. For GLMs, we do not expect the residuals to be normally distributed, but we are still interested in detecting outliers. According to Figure 1-6, there are 3 points slightly diverged. Figure 1-6: Normal Q-Q plot 35 Figure 1-7: Residuals vs leverage The Cook statistics reduce the information to a single value for each case. They represent a scaled measure of the change in the fit if the single case is dropped from the original dataset. In Figure 1-7, all points locate below the contour line of Cook’s distance=0.5. To pick out the top one influential case, we can use the following R command: > esoph[cooks.distance(modelglm3)>0.1,] Table 10: Information of the top one influential case Obs # agegp alcgp tobgp ncases ncontrols 67 65-74 40-79 0-9g/day 17 34 36 By R, the cook’s distance of observation # 67 is 0.221344, which is acceptable. Figure 1-8: Half normal plot From Figure 1-8, there are no apparent outlier points that diverge from the rest of the data in Half-normal plot. Conclusively speaking, the results from diagnostics of the Model_GLM III are good. 4.2 Case Analysis II: Poisson GLM Data “Ships” from R package “MASS” is a dataset giving the number of damage incidents and aggregate months of service by ship type, year of construction, and period of operation. 37 Set inc , serv , type , year , period stand for number of damage incidents, aggregate months of service, ship types, year of construction and period of operation respectively. To find out how inc is affected by these factors, we plot the data to get an overview of the data: Figure 2-1: Plots of incidents vs service Clearly, the plots are not evenly spread and most of the plots are compressed in the bottom left in Figure 2-1, which means inc is not normal distributed. From Figure 2-2, we can see that inc is highly correlated with type . 38 Figure 2-2: Boxplot for incidents vs type If we use standard linear regression model Model III directly, we will get the following results: Table 11: Output of Model III by R Estimate Std. Error t value Pr(>|t|) (Intercept) -3.534e+01 1.906e+01 -1.854 0.0721 service 1.430e-03 1.443e-04 9.910 1.07e-11*** unclass(type) -6.504e-01 8.874e-01 -0.733 0.4685 unclass(year) 2.697e-01 2.209e-01 1.221 0.2304 unclass(period) 3.279e-01 1.603e-01 2.046 0.0483 * Residual standard error: 7.596 on 35 degrees of freedom Multiple R-squared: 0.7687 Adjusted R-squared: 0.7422 F-statistic: 29.07 on 4 and 35 DF p-value: 1.083e-10 39 Figure 2-3: Residuals vs fitted 2 a R is 0.7422 in Table 11, which is not good enough but still acceptable. However, the deviance of this model is 2019.549, which is too large. And we see clear evidence of non-constant variance in Figure 2-3. Since the dependent variable inc is a positive count integer, we may consider to use Poisson GLM to estimate the parameters and explain it in terms of the given predictor serv . While developing Model_GLM IV by applying GLM with following R command , we can get the results in Table 12: > modelglm4<-glm(incidents~service + unclass(type) + unclass(year) + unclass(period), family = poisson, ships) 40 > summary(modelglm4) Table 12: Output of Model_GLM IV by R Estimate Std. Error t value Pr(>|t|) (Intercept) -8.176e+00 1.194e+00 -6.847 7.52e-12*** service 9.131e-05 5.278e-06 17.300 <2e-16 *** unclass(type) -2.106e-01 5.405e-02 -3.896 9.76e-05*** unclass(year) 6.720e-02 1.301e-02 5.167 2.38e-07*** unclass(period) 8.001e-02 9.336e-03 8.570 < 2e-16 *** (Dispersion parameter for poisson family taken to be 1) Null deviance: 730.25 on 39 degrees of freedom Residual deviance: 234.59 on 35 degrees of freedom From Table 12, Model_GLM IV has much smaller deviance 234.59, compared to Model III. It also achieved a pretty good statistical significance for all parameters. For a Poisson distribution, the mean is equal to the variance. To investigate the relationship between mean and variance in this model, we use the following R command to plot and a line representing mean equal to variance is also shown in Figure 2-4: > plot(log(fitted(modelglm4)),log((ships$incidents-fitted(modelglm4))^2)) 41 > abline(0,1) Figure 2-4: Relationship between mean and variance From Figure 2-4, we see most points are concentrated beside the line y=x, which shows the variance is proportional to, and almost equal to, the mean. According to Table 12, we get: period year type serv inc * 10 * 001 . 8 * 10 * 72 . 6 * 10 * 106 . 2 * 10 * 131 . 9 176 . 8 2 2 1 5 − − − − + + − + − = It tells us that all 4 predictors have some effects on inc , while serv has the biggest and most critical influence. inc increases dramatically with the growth of serv . Furthermore, to obtaine the 95% confidence intervals for each parameter in Model_GLM IV can 42 be calculated by R command as following: > confint(modelglm4) Table 13: 95% confidence intervals for each parameter in Model_GLM IV 2.5 % 97.5 % (Intercept) -1.055932e+01 -5.8753853466 service 8.118537e-05 0.0001018970 unclass(type) -3.176695e-01 -0.1055708975 unclass(year) 4.205088e-02 0.0930844404 unclass(period) 6.207501e-02 0.0987124364 For diagnostics of the Model_GLM IV, we use following R command: > plot(modelglm4) Figure 2-5: Residuals vs fitted 43 Figure 2-6: Scale-Location plot Figure 2-7: Normal Q-Q plot 44 From Figure 2-5 and Figure 2-6, it seems that the variance stay constant as the fitted values vary, while there exist 3 outliers as labeled on Figure 2-5 and Figure 2-6. According to Figure 2-7, there are 3 outliers, which is consistent with which were shown in Figure 2-5 and Figure 2-6. More information about the background of data gathering for this dataset and further detailed case analysis on these 3 cases is expected. Figure 2-8: Residuals vs leverage Contours of equal Cook's distance are shown as Figure 2-8. There are 2 points having cook’s distance larger than 1 as labeled. Usually, points with a Cook's distance of 1 or more are considered for closer examination in the analysis. Conclusively speaking, the results from 45 diagnostics of the Model_GLM IV are satisfying, though 3 outlier cases need more investigation, especially observation # 9 and observation # 11. 4.3 Case Analysis III: Gamma GLM Data “Fruitfly” from R package “Faraway” is a data frame has 9 rows and 3 columns. 125 fruitflies were divided randomly into 5 groups of 25 each. The dependent variable was the longevity of the fruitfly in days. One group was kept solitary, while another was kept individually with a virgin female each day. Another group was given 8 virgin females per day. As an additional control the fourth and fifth groups were kept with one or eight pregnant females per day. Pregnant fruitflies will not mate. The thorax length of each male was measured as this was known to affect longevity. One observation in the many groups has been lost. It contains 3 columns: Thorax length, Lifetime in days and Activity. For Activity, it is divided into 5 groups: isolated = fly kept solitary, one = fly kept with one pregnant fruitfly, many = fly kept with eight pregnant fruitflies, low= fly kept with one virgin fruitfly, high = fly kept with eight virgin fruitflies. To show how Thorax length and Activity influence Lifetime in days, we plot the data in the following way first: 46 Figure 3-1: Plots of longevity vs thorax Figure 3-2: Boxplot for longevity vs activity 47 From Figure 3-1 and Figure 3-2, we can see that the longevity is not normal distributed. Clearly, there is a strong positive correlation with thorax. If we develop Model IV by standard linear regression model, we will get the following results: Table 14: Output of Model IV by R Estimate Std. Error t value Pr(>|t|) Intercept -40.6061 13.0459 -3.113 0.00231 ** thorax 133.6439 15.0704 8.868 7.94e-15 *** unclass(activity) -3.9053 0.8052 -4.850 3.72e-06 *** Residual standard error: 12.54 on 121 degrees of freedom Multiple R-squared: 0.4955 Adjusted R-squared: 0.4872 F-statistic: 59.43 on 2 and 121 DF p-value: < 2.2e-16 Calculated by R, the deviance of Model IV is 19040.71, which is too big. From Table 14, 2 a R is 0.4872, which is not high enough. Since the standard linear regression model does not work well with the data and the dependent variable is continuous, we try the alternative way to develop Model_GLM V with following R command: > modelglm5<-glm(longevity~thorax+unclass(activity),fruitfly,family=Gamma(link=log)) > summary(modelglm5) 48 Table 15: Output of Model_GLM V by R Estimate Std. Error t value Pr(>|t|) Intercept 2.09351 0.24390 8.584 3.71e-14 *** thorax 2.62882 0.28175 9.330 6.35e-16 *** unclass(activity) -0.07610 0.01505 -5.055 1.55e-06 *** (Dispersion parameter for Gamma family taken to be 0.05500025) Null deviance: 13.2803 on 123 degrees of freedom Residual deviance: 6.8165 on 121 degrees of freedom According to Table 15, we can see that while Model_GLM V reduces the deviance dramatically from 19040.71 for Model IV to 6.8165, it also achieved better statistical significance for all three parameters. We can get: activity thorax longevity * 0761 . 0 * 62882 . 2 09351 . 2 − + = . It shows that longevity is mainly determined by thorax length while sex activity also serves as an explainer with negative effect. Furthermore, to obtain the 95% confidence interval of estimated parameters in Model_GLM V, we can use the following R command: > confint(modelglm5) 49 Table 16: 95% confidence intervals for each parameter in Model_GLM V 2.5 % 97.5 % (Intercept) 1.6103516 2.58266005 thorax 2.0619222 3.19070128 unclass(activity) -0.1065976 -0.04561404 For diagnostics of Model_GLM V, we use the following R command: > plot(model2) Figure 3-3 Residuals vs fitted Figure 3-3 and Figure 3-4 show the residuals against the fitted values, there is no evidence that variance is not roughly constant as the fitted values vary. 50 Figure 3-4 Scale-Location plot Figure 3-5 Normal Q-Q plot 51 Use QQ plot as a good way of detecting outliers. In Figure 3-5, there is 1 point slightly diverged from other points. Figure 3-6 Residuals vs leverage The Cook's distance of an observation is a measure of the global influence of this observation on all the predicted values. In Figure 3-6, all points have Cook’s distance much less than 0.5, which means no particular point alone affects estimates much. Conclusively speaking, all figures for diagnostics of Model_GLM V show that the model works pretty well. 52 Chapter 5 5.1 Conclusions The standard linear models can not handle non-normal dependent variables such as counts or proportions. This is the motivation of development of GLM that can deal with categorical, binary and other dependent variables that belongs to exponential family in canonical form. The link function describes how the mean of the dependent variable and a linear combination of the predictor variables are related. It could be non-linear. In principle, any monotone continuous and differentiable function will do, but there are some convenient and common choices. We choose β as estimated parameters instead of θ , which can reduce the number of unknown parameters from N to p. Sometimes we can maximize this analytically and find an exact solution for MLE β ˆ , while typically we need to use numerical optimization by applying the Newton-Raphson iterative methods. To choose the best GLM for prediction, we use hypothesis tests to see if a simpler model can achieve the same statistic fit as a more general model, which achieve high statistic fit by fitting a sufficiently high-order polynomial. For goodness of fit test, the scaled deviance is asymptotically 2 χ with certain degrees of freedom. For GLM diagnostics, we focus on residuals, Cook’s 53 distance, outliers to check the adequacy of the assumptions that support the GLM. Three cases of typical non-normal dataset from R public data packages are analyzed in Chapter 4 of this thesis showing inadequacy of the classic linear regressions and advantages of GLMs. 54 Bibliography [1] Bickel, P. and K. Doksum (1977). Mathematical Statisticas: Basic Ideas and Selected Topics. San Francisco: Holden Day. [2] Bowman, A. and A. Azzalini (1997). Applied Smoothing Techniques for Data Analysis: The Kernel Approach with S-Plus Illustrations. Oxford: Oxford University Press. [3] Dalgaar, P. (2002). Introductory Statistics with R. New York: Springer. [4] Dobson, A. (1990). An Introduction to Generalized Linear Models. London: Chapman and Hall. [5] Gill, J. (2001). Generalized Linear Models: A Unified Approach. Thousand Oaks, CA: Sage [6] Lindsey, J. K. (1997). Applying Generalized Linear Models. New York: Springer [7] Myers, R. and D. Montgomery (1997). A tutorial on generalized linear models. Journal of Quality technology 29, 274-291 [8] Nelder J. and R. Wedderburn (1972). Generalized Linear Models. Journal of the Royal Statistical Society. Series A (132): 370–384. [9] Weisberg, S. (2005). Applied Linear Regression (3ed.) New York: Wiley. 55 Appendix > # case analysis I > library(faraway) > data(esoph) > esoph > summary(esoph) > plot(ncases/(ncases+ncontrols)~agegp,esoph,xlab="age",ylab="probability of cases") > plot(ncases/(ncases+ncontrols)~alcgp,esoph,xlab="alcohol",ylab="probability of cases") > plot(ncases/(ncases+ncontrols)~tobgp,esoph,xlab="tobacco",ylab="probability of cases") > pro<-esoph$ncases/(esoph$ncases+esoph$ncontrols) > hist(pro,xlab="probability of cases") > plot(density(pro)) > model1<-lm(ncases~unclass(agegp)+unclass(alcgp)+unclass(tobgp),esoph) > summary(model1) >model2<-lm(ncases~ unclass(agegp) + unclass(alcgp) + unclass(tobgp) + unclass(agegp) * unclass(alcgp) + unclass(agegp) * unclass(tobgp) + unclass(alcgp) * unclass(tobgp), esoph) > summary(model2) > predict(model1) 56 > residuals(model1) > deviance(model1) > predict(model2) > residuals(model2) > deviance(model2) > modelglm1 <- glm(cbind(ncases, ncontrols) ~ unclass(agegp) + unclass(tobgp) + unclass(alcgp), data = esoph, family = binomial()) > modelglm2 <- glm(cbind(ncases, ncontrols) ~ unclass(agegp) + unclass(tobgp) + unclass(alcgp) + unclass(agegp) * unclass(alcgp) + unclass(agegp) * unclass(tobgp) + unclass(alcgp) * unclass(tobgp), data = esoph, family = binomial()) > summary(modelglm1) > summary(modelglm2) > drop1(modelglm1,test="F") > drop1(modelglm2,test="F") > modelglm3<-glm(cbind(ncases, ncontrols) ~ unclass(agegp) + unclass(tobgp) + unclass(alcgp) + unclass(alcgp) * unclass(tobgp), data = esoph, family = binomial()) > summary(modelglm3) > anova(model1, modelglm1) > confint(modelglm3) 57 > plot(modelglm3) > esoph[cooks.distance(modelglm3)>0.1,] > halfnorm(influence(modelglm3)$hat) > # case analysis II > library(MASS) > data(ships) > ships > summary(ships) > plot(incidents~service,ships,main="incidents vs service") > plot(incidents~year,ships,main="incidents vs year") > model3<-lm(incidents~service+unclass(type)+unclass(year)+unclass(period),ships) > summary(model3) > plot(predict(model3),residuals(model3),xlab="Fitted", ylab="Residuals",main="Residuals vs Fitted") > modelglm4<-glm(incidents ~ service + unclass(type) + unclass(year) + unclass(period), family= poisson, ships) > summary(modelglm4) > deviance(model3) > plot( residuals(model2) ~ predict( model2, type="response" ), xlab = expression( hat(mu) ), 58 ylab="deviance residuals") > plot(residuals(model2)~predict(model2,type="link"),xlab=expression(hat(eta)),ylab="deviance residuals") > plot(residuals(model2,type="response")~predict(model2,type="link"),xlab=expression(hat(eta)), ylab="deviance residuals") > plot(log(fitted(modelglm4)),log((ships$incidents-fitted(modelglm4))^2)) > abline(0,1) > plot(modelglm4) > cooks.distance(modelglm4) > confint(modelglm4) > # case analysis III > library(faraway) > data(fruitfly) > fruitfly > summary(fruitfly) > plot(longevity~thorax,fruitfly, main="longevity vs thorax") > plot(longevity~activity, fruitfly,main="longevity vs activity") > model4<-lm(longevity~thorax+unclass(activity),fruitfly) > summary(model4) 59 > deviance(model4) > modelglm5<-glm(longevity~thorax+unclass(activity),fruitfly,family=Gamma(link=log)) > summary(modelglm5) > deviance(modelglm5) > plot(modelglm5) > confint(modelglm5)
Abstract (if available)
Abstract
Generalized linear models (GLMs) are introduced by Nelder and Wedderburn. As an extension of normal linear regression for a single dependent variable, GLMs are widely used to do regression modeling for non-normal data with a minimum of extra complication. Unifying various other statistical models under one framework, GLMs develop a general algorithm for maximum likelihood estimation in all these models. GLMs are flexible enough to include a wide range of common situations, but at the same time allow most of the familiar ideas of normal linear regression to carry over. Furthermore, the link function which provides the relationship between the linear predictor and the mean of the distribution function does not have to be linear. Taking advantage of statistical software R, efficiency of GLMs analysis in different datasets is greatly improved and diagnostics become visible.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Comparing robustness to outliers and model misspecification between robust Poisson and log-binomial models
PDF
Symmetric and trimmed solutions of simple linear regression
PDF
Nonparametric estimation of an unknown probability distribution using maximum likelihood and Bayesian approaches
PDF
Calibrating COCOMO® II for functional size metrics
PDF
Finite sample bounds in group sequential analysis via Stein's method
PDF
Reproducible large-scale inference in high-dimensional nonlinear models
PDF
A comparison of GLM, GAM, and GWR modeling of fish distribution and abundance in Lake Ontario
PDF
Three essays on linear and non-linear econometric dependencies
PDF
Feature selection in high-dimensional modeling with thresholded regression
PDF
Energy use intensity estimation method based on building façade features by using regression models
PDF
Finite dimensional approximation and convergence in the estimation of the distribution of, and input to, random abstract parabolic systems with application to the deconvolution of blood/breath al...
PDF
Model selection methods for genome wide association studies and statistical analysis of RNA seq data
PDF
Information geometry of annealing paths for inference and estimation
PDF
Investigation of mechanisms of complex catalytic reactions from obtaining and analyzing experimental data to mechanistic modeling
Asset Metadata
Creator
Cui, Yini
(author)
Core Title
Analysis using generalized linear models and its applied computation with R
School
College of Letters, Arts and Sciences
Degree
Master of Science
Degree Program
Statistics
Publication Date
05/05/2009
Defense Date
03/29/2009
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
generalized linear model,linear regression,link function,maximum likelihood estimation,OAI-PMH Harvest,software R
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Larry, Goldstein (
committee chair
), Alan, Schumitzky (
committee member
), Jay, Bartroff (
committee member
), Miguel, Dumett (
committee member
), Remigijus, Mikulevicius (
committee member
)
Creator Email
yinicui@gmail.com,yinicui@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m2172
Unique identifier
UC1428975
Identifier
etd-Cui-2869 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-222965 (legacy record id),usctheses-m2172 (legacy record id)
Legacy Identifier
etd-Cui-2869.pdf
Dmrecord
222965
Document Type
Thesis
Rights
Cui, Yini
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
generalized linear model
linear regression
link function
maximum likelihood estimation
software R