Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Analysis of robustness and residuals in the Affymetrix gene expression microarray summarization
(USC Thesis Other)
Analysis of robustness and residuals in the Affymetrix gene expression microarray summarization
PDF
Download
Share
Open document
Flip pages
Copy asset link
Request this asset
Request accessible transcript
Transcript (if available)
Content
ANALYSIS OF ROBUSTNESS AND RESIDUALS IN THE AFFYMETRIX GENE EXPRESSION MICROARRAY SUMMARIZATION by Huanying Ge A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree MASTER OF SCIENCE (STATISTICS) August 2008 Copyright 2008 Huanying Ge Dedication to my parents ii Acknowledgements IamextremelygratefulforhavingDr. LeiLiasmyadvisor. Ithankhimforhiscontinual encouragement, training and support during the years I spent in his lab. I could never have done all these work without his patient guidance. He has been an inspiration to me in doing this research. I next thank Dr. Larry Goldstein, in the Department of Mathematics, for his sugges- tion and help in writing this thesis. Most of my statistical knowledge about the linear regression analysis is learned from Dr. Larry Goldstein’s classes. I also appreciate the other committee, Dr. Liang Chen, for her time and help in reading and improving the thesis. I also thank many people in the PhD program of computational biology and bioin- formatics, including Dr. Chao Cheng, Dr. Xiaotu Ma, Dr. Quansong Ruan, Min Xu, Yu Huang and Fang Fang, for all their helps and supports during my research. I benefit greatly from the collaboration and discussion with them. I acknowledge and appreciate the love and support from my parents, without whom I would be nothing. I dedicate this thesis to them. iii Table of Contents Dedication ii Acknowledgements iii List Of Tables vi List Of Figures vii Abstract 1 Chapter 1: Introduction 2 1.1 Regression analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Least squares solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Breakdown point . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4 Sensitivity curve and influence function . . . . . . . . . . . . . . . . . . . 6 Chapter 2: Method of least absolute deviations (LAD) 11 2.1 Introduction to LAD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 LAD solution for categorical variables . . . . . . . . . . . . . . . . . . . . 12 2.3 Computation of LAD and median polish . . . . . . . . . . . . . . . . . . . 14 2.4 Robustness of LAD method . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Chapter 3: Application of LAD to oligonucleotide microarray pre-processing 19 3.1 Introduction to microarray pre-processing . . . . . . . . . . . . . . . . . . 19 3.2 Summarization using LAD method . . . . . . . . . . . . . . . . . . . . . . 21 3.2.1 A two-factor summarization model . . . . . . . . . . . . . . . . . . 21 3.2.2 LAD estimation and its computation . . . . . . . . . . . . . . . . . 22 3.3 Robustness of LAD estimates . . . . . . . . . . . . . . . . . . . . . . . . . 23 Chapter 4: Robustness assessment by standard spike-in dataset 29 4.1 Affymetrix spike-in data sets . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.2 Differential gene expression estimation . . . . . . . . . . . . . . . . . . . . 31 4.3 Residual analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.4 Accuracy of log-ratio estimate . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.5 Sensitivity curve and bound . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.6 Standardized log-ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 iv 4.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 References 44 v List Of Tables 4.1 The spike-in probe-sets and their concentrations in HG-U133A “Expt-3- 4” dataset. The unit for the concentration values shown in the table is picomolar. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.2 log 2 transformed intensity value after normalization for one probe-set ex- ample. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.3 Table of log-ratio and SAE value for each spike-in gene estimated by LAD regression and median polish procedure. . . . . . . . . . . . . . . . . . . . 33 vi List Of Figures 1.1 Scatter plot and LS regression line for a simulated data. Solid: LS fitted line using all the data points; Dashes: LS fitted line using only the first four data points. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1 ScatterplotofasimulateddataandtwofittedlinesbyLSandLAD.Solid: LS fitted line; Dashes: LAD fitted line. . . . . . . . . . . . . . . . . . . . . 12 4.1 M-A plots of the “Expt-3-4” data set estimated by LAD regression. Grey: non-spike-inprobe-set; Black: spike-inprobe-setswhichhaveexpectedlog- ratio 1; Magenta: spike-in probe-sets which have zero input concentration in Exp 3; Green: suggested spike-in homologs. Spike-in probe-sets that havezeroinputconcentrationinExp4arenotshowupintheplotbecause the M (log-ratio of Exp 4 versus Exp 3) value is very low (<−1.5). . . . . 32 4.2 The residual distribution of 4 spike-in probe-sets after LAD regression fit. 35 4.3 Distribution of “outlier residual” ratios for both spike-in and non-spike-in probe-sets. The “outlier residual” is defined using cutoff value 0.3 (A) or 0.5 (B). Histogram: distribution of non-spike-in probe-sets (total 22235 probe-sets); Black dots: the “outlier residual” ratio value for each spike-in or their homolog probe-sets; Green curve: the density curve of spike-in probe-sets (total 71 probe-sets). . . . . . . . . . . . . . . . . . . . . . . . 36 4.4 Heatmap of residuals for five probe-sets. Each row is one probe; each column is one array. The left three arrays are from Exp3; the right three are from Exp4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.5 In figure A, x-axis is the average log 2 (concentration) which are designed in Exp 3&4; y-axis is estimated 1 f(0) . In figure B, x-axis is the estimated log-ratio values from LAD model-fitting. . . . . . . . . . . . . . . . . . . . 38 4.6 The deviation of log-ratio estimate with one more additional perturbation point. y-aixs: deviation of the log-ratio (ˆ m n+1 − ˆ m n ). x-aixs: the log 2 intensity value of the additional data. . . . . . . . . . . . . . . . . . . . . 40 vii 4.7 The deviation of log-ratio estimate with two more additional perturbation observations. z-aixs: deviation of the log-ratio (ˆ m n+2 − ˆ m n ); x- and y-axis are the log2 intensity value of two additional observations. Two 3-D plots illustratetwocases: oneadditionaldataisonExp4, probe9andtheother one is on Exp 3 and probe 4; one additional data is on Exp 4, probe 8 and the other one is on Exp 3 and probe 9. . . . . . . . . . . . . . . . . . . . . 41 4.8 Adjusted M-A plots of the “Expt-3-4” data set after standardization using 1 f(0) estimates. Grey: non-spike-in probe-set; Black: spike-in probe-sets which have expected log-ratio 1; Magenta: spike-in probe-sets which have zero input concentration in Exp 3; Green: suggested spike-in homologs. . 43 viii Abstract DNAmicroarrayhasbeenwidelyusedinthefieldoffunctionalgenomics. Theestimation of gene expression from microarray is a statistical problem where a lot of effort has been made. In this study, we focus on the summarization step of Affymetrix microarray pre- processing. We apply the Least Absolute Deviation (LAD) regression to estimate the probe- and treatment-specific effect in the widely-taken two-factor model. The median polishcanbeusedasanapproximationapproachfortheLADregressioninthistwo-factor summarization model. We show that the LAD estimator is robust in the sense that it has bounded influence where the bound is strongly associated with the RNA concentration. Furthermore, we calculate the influence bound and standard error which are used as the measure of accuracy for the log-ratio estimate. 1 Chapter 1 Introduction 1.1 Regression analysis We use regression analysis to examine the relationship between a response variable and explanatory variables. In practice, we try to find a curve or line which fits a series of data points under some constraints. In the case of a set of data points (x i ,y i ) with i=1,2,...,n, we try to find a functionf such that ! y i =f(x i )≈y i . We may suppose that the function f has a certain form with several parameters which need to be estimated. For example, suppose that it is a simple linear function: f(x)= ax+b. We try to fit a simple linear model: y i =ax i +b+! i to observed variables {(x i ,y i )},i=1,···,n. Here, x is the explanatory variable, y is the response variable and ! i is the error term. When we have more than one explanatory variables, {x i1 ,x i2 ,···,x ip } , the linear model will be enlarged to: y i = β 0 +x i1 β 1 + ···+x ip β p +! i =X T i β+! i , where i=1,...,n 2 or Y =Xβ+!, where X = 1 x i1 ··· x ip . . . . . . . . . . . . 1 x n1 ··· x np , and Y = y 1 . . . y n . For each dataset, we can get a regression line according to certain objective constraint or optimization and yield coefficient estimates ! β=[ ! β 0 , ! β 1 ,··· , ( β p ] T . 1.2 Least squares solution Least squares (LS) method is the most popular way to fit linear models. The objective is to minimize the residuals sum of squares (RSS): argmin ˆ β n ) i=1 r 2 i where r i =y i − ˆ y i and ˆ y i = ˆ β 0 +x i1 ˆ β 1 +···+x ip ˆ β p . In the case of a simple linear model y i =ax i +b+! i , it tries to findthe bestvalues a andbsothatthe residualsum ofsquares (RSS) is minimized: (ˆ a, ˆ b)=argmin a,b n ) i=1 (y i −ax i −b) 2 3 In LS model, the error term ! is assumed to be Gaussian noise with mean zero. The regression coefficient estimates can be obtained by solving the normal equation: X T Xβ =X T Y. (1.1) If X has full rank, we can get the solution: ˆ β=(X T X) −1 X T Y. (1.2) If the columns of X are not linearly independent, we cannot get a unique solution from (1.2). One solution is given as: ˆ β=(X T X) − X T Y, (1.3) where (X T X) − is any generalized inverse of (X T X). In the real data analysis, the assumption of normal distributed noise may not be satisfied. Applying to the LS method on that dataset can result in wrong estimates. Let us consider a set of simple observations (X, Y) = {(1.2, 1.5), (2.4, 2.3), (5.6, 5), (8.1, 8.5), (10, 40)} and fit the simple linear model: y i = β 0 + β 1 x i + ! i . We could obtain the LS estimates: ˆ β=[−7.493,3.471] T . We can get another estimates ˆ β = [0,1] T , if we kick out the data point (10,40). Two fitted results are shown in Figure 1.1. We see that the regression line is severely affected by a single data point. Statistics estimated from data including outlier points is misleading. The outliers prompt the statistician to think about the robustness of regression analysis and its measurement. 4 Figure 1.1: Scatter plot and LS regression line for a simulated data. Solid: LS fitted line using all the data points; Dashes: LS fitted line using only the first four data points. 1.3 Breakdown point We see that a single outlier can offset the least square estimator, which indicates that LS is not a robust estimator. In order to measure the robustness, we first introduce the concept of breakdown point [1,2]. Definition 1.3.1. Suppose we have a sample X 0 =(x 1 ,x 2 ,···,x n ) and the correspond- ing statistic T n (X 0 ) is based on this sample. If we replace m out of n components in this sample by infinite, we can have a new sample X m and a new statistic value T n (X m ). T n has the breakdown value V(T n ,X 0 ): V(T n ,X 0 )= m ∗ n , 5 where m ∗ is the smallest m such that: sup X m|T n (X m )−T n (X 0 )| =∞. The breakdown value can be explained as the smallest proportion of sample pertur- bation that can lead the statistic up to infinity. For example, the mean X n = 1 n n ) 1 x i ; V(X n ,X 0 )= 1 n . A single data in the sample is driven to infinity; the sample mean will go to infinity. The median for a sample X of odd size n: median(X)=x ( n+1 2 ) ; V(median(X),X 0 )= n+1 2n . The sample median has the breakdown value almost 1 2 . For the regression case with sample Z = {(x 1 ,y 1 ),··· ,(x n ,y n )} and the breakdown point is equal to 1 n for least square estimators. It goes to zero when the sample size is very large. 1.4 Sensitivity curve and influence function A good estimator is resistant to certain data perturbation. The effect of adding an observation on the estimate can be formulated by sensitivity curve. The definition of the sensitivity curve is given by [1]. 6 Definition 1.4.1. The sensiticity curve of the estimator T n based on sample X = {x 1 ,···,x n−1 } is SC(z;T n )=n(T n ({x 1 ,···,x n−1 ,z})−T n−1 ({x 1 ,···,x n−1 })). (1.4) The sensitivity curve is tied to a particular sample. The influence function (or in- fluence curve), introduced by Hampel in year 1974 [3], extended the robustness measure fromsampletodistribution. LetXhasprobabilitydistributionP θ ,whereθ isaparameter we are interested. F(X,θ) is the distribution function of P θ on R d . We can express θ as a functional T(F) of the distribution function. The estimator of θ based on the empirical probability distribution F n is T n (F n ). Definition 1.4.2. The influence function (IF) of an estimator T at distribution F is: IF(z;T,F) = lim ε→0 T(G ε )−T(F) ε = lim ε→0 T((1−ε)F +εH)−T(F) ε , where G ε is a mixture distribution defined by: G ε = (1−ε)F +εH, 0< ε< 1 and H is perturbation distribution. H could be the probability distribution having point-mass one on z, where z is an additional observation; G ε is the mixture distribution from which an observation has probability (1−ε) of being generated by F and a probability ε of being an arbitrary value z. 7 Hampel also introduced an important summary value of the influence function, which is called “gross-error-sensitivity” in the paper [3]. It is the bound value of influence function. Definition 1.4.3. The “gross-error-sensitivity” is defined as the supremum of the ab- solute value of the influence function or the largest influence of contamination on the estimates: γ(T,F)=sup x |IF(x;T,F)|. Remark 1.4.1. There is a relationship between the asymptotic behavior of the estimator and its influence function according to [4]: √ n(T n (F n )−T(F))→N(0, * IF(x;F,T) 2 f(x)dx); lim n→∞ nvar(T n |F)= * IF(x;F,T) 2 f(x)dx. Example 1.4.1. The sample mean or expectation is: µ = T(F)= + xdF = E(X), ! µ = T n (F n )= + xdF n = 1 n , n 1 X i . 8 The sensitivity curve for sample {x 1 ,···,x n−1 } is SC(z;T n )= n(T n ({x 1 ,···,x n−1 ,z})−T n−1 ({x 1 ,···,x n−1 })) = n( 1 n ( , n−1 1 x i +z)− 1 n−1 ( , n−1 1 x i )) = z− 1 n−1 ( , n−1 1 x i ) = z−! µ. The influence function for mean is IF(z;T,F) = lim ε→0 T(Gz−T(F)) ε = lim ε→0 T((1−ε)F+εH)−T(F)) ε = lim ε→0 (1−ε) ! udF(u)+ε ! udHz(u)− ! udF(u) ε = lim ε→0 −ε ! udF(u)+ε ! udHz(u) ε = z−µ The sensitivity curve of sample mean converge to the influence function in distribution when n→∞. The gross-error-sensitivity of mean is γ(T,F)= sup z |IF(z;T,F)| = sup z |z−µ| = ∞. The asymptotic variance of the mean estimate obtained through its influence function is lim n→∞ nvar(T n |F)= + (x−µ) 2 f(x)d(x)= σ 2 . 9 It is the well-known result that sample mean has var(x)= σ 2 /n, or nvar(x)= σ 2 . 10 Chapter 2 Method of least absolute deviations (LAD) 2.1 Introduction to LAD Least absolute deviations (LAD), which is also known as Least Absolute Errors (LAE), or the L 1 norm problem, is another fitting technique [2]. In the simple regression case, it tries to seek the best a and b which minimize the sum of the absolute errors (SAE): (ˆ a, ˆ b)=argmin a,b n ) i=1 |y i −ax i −b|. (2.1) We can use the same data from Figure 1.1 to illustrate the different fitting results by LS and LAD. Figure 2.1 is the scatter plot with two fitted regression lines for those data points. The LAD regression line is indicated by the dashed line and the LS regression line is indicated by the solid line. We can see that the LAD regression line is not severely affected by the outlier data point (10,40). In order to get the LAD solution, we can reformulate the problem as a linear rogramming (LP) problem and thereby be solved via LPalgorithms,suchasthesimplexandtheinteriorpointalgorithm[5–9]. Itisworthwhile to mention that LAD fit, like the median estimate, may have multiple solutions. The 11 Figure 2.1: Scatter plot of a simulated data and two fitted lines by LS and LAD. Solid: LS fitted line; Dashes: LAD fitted line. question of nonuniqueness of LAD regression was discussed in detail by Harter (1977) [10]. Edgeworth [11] also mentioned that it was common to adopt the middle of the indeterminate tract of LAD fits as the final solution. 2.2 LAD solution for categorical variables If we have a sample Y = {y 1 ,y 2 ,···,y n }, we can decompose it as: y i =µ+ε i , whereµisthelocationparameterand ε i iserrorterm. Ifweuseleastsquare(LS)criteria, we will get ˆ µ =Y = 1 n , n 1 y i . 12 Proposition 2.2.1. If we use the least absolute deviations (LAD) criteria for the above sample Y, then we will get ˆ µ =median{Y } Proof. According to LAD criteria, we try to minimize the sum of absolute errors: SAE = , n i=1 |ε i | = , n i=1 |y i −µ| (SAE) % = , n i=1 sign(y i −µ) The solution can be written as: ˆ µ = y ( n+1 2 ) if nisodd, x,wherey ( n 2 ) <x<y ( n+2 2 ) if niseven. where y (i) is the order statistics of Y (y (1) <y (2) < ···<y (n) ). The LAD solution is unique if and only if the sample size is odd. We can also see that the error term has the property: median({ε i }) = 0. If we separate the sampleY by one factor (e.g. marriage status) into two levels (mar- ried and single), we can label the sample as Y = {y 11 ,y 12 ,···,y 1k 1 ,y 21 ,y 22 ,···,y 2k 2 }, where k 1 +k 2 =n; y 1i is the measured value from a married person; y 2i is the measured value from a single person. We can decompose the sample as: y ij =µ+α i +ε ij It states that each value y ij is the sum of the global term (µ), the deviation due to the classification of marriage status (α i ) and the random error term ε ij . 13 Proposition 2.2.2. If we apply the LAD criteria for the above sample Y, then we will get µ+α 1 =median({y 1j }),j=1,···,k 1 ; µ+α 2 =median({y 2j }),j=1,···,k 2 . Proof. We try to minimize the SAE: SAE = , i , j |ε ij | = , k 1 j=1 |ε 1j |+ , k 2 j=1 |ε 2j | = k 1 ) j=1 |y 1j −µ−α 1 | 1 23 4 (1) + k 2 ) j=1 |y 2j −µ−α 2 | 1 23 4 (2) Similarly, we can get µ + α 1 = median({y 1j }) by minimizing (1) and get µ + α 1 = median({y 1j }) by minimizing (2). Remark 2.2.3. We can also see that median({ε 1j }) = 0 j=1,···,k 1 ; median({ε 2j }) = 0 j=1,···,k 2 . This property can be written as median({ε ij }|i) = 0. The median value of errors on each level of factor is zero. Furthermore, we also have median({ε ij }) = 0. 2.3 Computation of LAD and median polish In the two-factor model y ijk =µ+α i +β j +ε ijk , 14 The sum of absolute errors (SAE) is: SAE = ) i,j,k |y ijk −µ−α i −β j | We have the following residual property: median({ε ijk }|i) = 0; median({ε ijk }|j) = 0; median({ε ijk }) = 0 Therefore, there is a close relationship between LAD fitting and Tukey’s procedure, “me- dian polish” [12], which is a nice tool of exploratory data analysis (EDA) for a two-way table. In Tukey’s “median polish”, we can associate one factor with the rows and the other with the columns. We may call two factors as the “row effect” and “column effect”. Each row (or column) represents one level of row (or column) factor. The median polish procedure consists of several iterations and each iteration has two “half-step”. During eachhalf-step, wesubtractrow(orcolumn)mediansfromtheentriesinthetableandadd thesemedianvaluestothecorrespondingrow(orcolumn)effect. Wealternate“half-step” operationonrowsandcolumns. Thesumofabsoluteerrors(SAE)isreducedduringeach step and we stop the iteration once it becomes stable. The values remaining in the table are residuals [12–14]. 15 Example 2.3.1. We consider a 3×3 table: 3 5 7 4 7 8 8 9 9 Then the median polish procedure will be done by the following steps: 3 5 7 0 4 7 8 0 8 9 9 0 0 0 0 0 ⇒ −202 5 −301 7 −100 9 0 0 0 0 ⇒ 001 −2 −1 0 0 0 10 −1 2 −2 0 1 7 We can get the row effect: [−2,0,2], column effect: [−2,0,1], and is 7 the global term ˆ µ. The “median polish” procedure produces an additive model fit which is close to being optimal in the least absolute residuals sense. It provides a useful and easy way to obtain approximation to LAD regression in the two-way table. 2.4 Robustness of LAD method LAD method is robust in the sense of protecting against the influence of an outlier. The important properties of LAD method is that the estimates are not sensitive to the value size of residuals, but are sensitive to the sign value. In the presence of outliers, especially withheavy-taileddistribution, LADestimatorsarebetterthantheleastsquareestimator [7,10,15]. Furthermore,theLADestimatoristhemaximumlikelihoodestimatorwhenthe 16 errorvariablefollowsadoubleexponential(Laplace)distribution[2]. MontoCarlostudies also demonstrate that it works well in the case of a mixture of normals or contaminated normal error distributions [11,15]. Here, we consider the linear model Y =Xβ+ε, (2.2) or y i =x T i β+ε i ,i=1,...,n, (2.3) where x T i is a 1×p row vector and i th row in the n×p design matrix X; i=1,···,n, n is the sample size; β∈ R p is a vector of unknown regression parameters and p is the number of parameters; ε i is the error term. Before we quantify the robustness of LAD using the influence function (IF), we make two assumptions on the error term and the explanatory variable x. Assumption 2.4.1. The regression error {ε i } are i.i.d with a positive and continuous density function f. ε and x are independent and the median of {ε i } is zero. Assumption 2.4.2. The regression explanatory variable x, a column vector with ele- ments x 1 ,x 2 ,···,x p , has the property that Q = E(xx T ) is a positive definite matrix (p×p). Equivalently, if we have a sample with size n and x i is the i th value for variable x, we write it in the form of design matrix X where x T i is the i th row of X. The CLT gives that lim n→∞ 1 n X T X =Q is a positive definite matrix. Then, according to [2,16], we have following theorem: 17 Theorem 2.4.1. If the regression model satisfied ASSUMPTION 2.4.1 and 2.4.2, then the influence function of the LAD estimators at the point (x,y) is given by: 1 2f(0) (Q) −1 xsign(y−xβ), (2.4) where f(0) is the density of ε at median zero and the last term is the sign function, taking value -1, 0, or 1. We can see that the influence function is bounded as a sign function of y for fixed x. If x is not fixed but can only taking dummy values in the categorical case, the influence function is still bounded. The largest influence on estimates only depends on f(0). The proof of this theorem can be found in [2,16]. Another important statistical property for the LAD estimator has been stated by Koenker and Bassett [17] that LAD estimator has an asymptotically normal limiting distribution by the following theorem: Theorem 2.4.2. Assume the regression model satisfied ASSUMPTION 2.4.1 and 2.4.2. Then LAD regression estimates ˆ β (p×1) has a limiting normal distribution: √ n( ˆ β−β)−→N(0, 1 (2f(0)) 2 Q −1 )indistribution (2.5) The proof of the theory can be found in [17] and [18]. This asymptotic distribution can be used to construct confidence intervals for the LAD regression estimator [19,20]. 18 Chapter 3 Application of LAD to oligonucleotide microarray pre-processing 3.1 Introduction to microarray pre-processing DNA microarray has been a widely used technique in the field of functional genomics. It measures the gene expression levels using tens of thousands of probes which are spotted on a solid surface. It provides a global gene expression profile for thousands of genes simultaneously. Microarray technology Microarray technologies can be divided into two major categories: cDNA arrays and the high density oligonucleotide arrays. The typical cDNA array is two-channel (or two- color). The probes are sequence fragments which are generated by PCR and spotted on the solid surface. Generally, each probe is designed to represent one kind of mRNAs or represent one gene. During the experiment, mRNAs from two samples (e.g. cancer cell versus normal cell) are isolated and reverse-translated into cDNAs. These cDNAs 19 are labeled sample-specifically by two different colors using fluorophores. The two cDNA samples are mixed and hybridized to one microarray. The array is scanned to get the intensities of two fluorophore. The change of gene expression levels between two samples can be determined by relative fluorophore intensities. The oligonucleotide microarray is designed to estimate the absolute value of gene ex- pression. There are several commercial oligonucleotide microarrays available from com- panies such as Affymetrix, Agilent, NimbleGen and Illumine. These arrays are composed of oligonucleotide on a silica substrate. The lengths of these oligonucleotides are varied among different products. In Affymetrix expression microarray, it uses 11-21 probes which have 25 oligonu- cleotide bases, to represent one gene, and as a whole they are called a probe-set. Asso- ciated with each perfect match (PM) probe, a mis-match (MM) probe that differs only in the middle (13 th ) base is included in most expression arrays. Because oligonucleotide microarray is one-channel array, each array is used to measure the gene expression profile from one sample. After the hybridization and image processing, we get fluorescence in- tensity of each probe on all arrays. The final mRNA abundance estimates for each gene will be calculated from these intensities values. Microarray pre-processing The estimation of gene expression from probe intensities is a statistical problem where much effort has been made. Among them are Affymetrix’s MAS 5.0 [21], Li and Wong’s dChip [22], and RMA from Irizarry et al [23]. Each method mainly consists of two modules: normalization and summarization. Normalization aims to balance intensity 20 values in order to reduce non-biological variations which are introduced during sample preparation and instrumental reading. Most of the normalization methods available are done at probe level, including the loess, contrasts, invariant set [22,24], quantiles [25] and sub-array [26,27]. Summarization combines normalized signal values in each probe- set and produces a final abundance estimate for the corresponding gene. There are several summarization methods available, including Affymetrix’s MAS 5.0 [21], dChip’s MBEI [22] and RMA’s median polish [28]. 3.2 Summarization using LAD method After proper background correction, normalization and PM correction step, summariza- tion is the last step in pre-processing where multiple probe intensities within a probe-set are combined to produce an expression value. Suppose that we have I (I> 1) arrays, then for a given probe-set with J (J> 1) probes, we have I×J adjusted intensity values {y ij }forthecorrespondinggene. Thefinalmeasuredexpressionvalueofthegeneoneach array will be estimated based on certain statistical model. 3.2.1 A two-factor summarization model In the RMA package, it will do the background correction first, and then normalize the arrays using quantile normalization. In order to obtain the final expression value for each probeset, it assumes that the background corrected, normalized and log-transformed PM intensity values y ij follow a linear additive model [28]: y ij =µ+α i +β j +ε ij , (3.1) 21 whereµistheglobalterm;µ+α i representsthelogscaleexpressionvalueforthisprobeset found on array i,i=1,...,I; β j represents the log scale probe-specific effect for probe j,j =1,...,J; ε ij represents the error term. Generally, the log 2 scale is used for the log-transformation. 3.2.2 LAD estimation and its computation In the typical microarray experiment, we have several replicate arrays under each treat- ment level. In RMA’s summarization model, α i represent the array-specific effect. After the model fitting, the expression value for each treatment usually will be calculated using the medians or means across the replicate arrays. In this study, we use treatment-specific effect instead of the array-specific effect. We get the expression value for each treatment directly. We modify the summarization to the following objective problem: Problem 3.2.1. In each probeset, let variabley ijk be the j th probe intensity value on the k th replicate array of the i th treatment, where i=1,···,I; j=1,···,J; k=1,···,K. We try to decompose each intensity value according to following model: log 2 (y ijk )=µ+α i +β j +ε ijk , (3.2) where µ is the global term; α i represents the i th treatment-specific effect; β j represents the j th probe-specific effect. 22 To estimate the parameters in the above additive model, we adopt the LAD method. That is, we minimize the sum of absolute errors (SAE): SAE = ) i, j, k |ε ijk | = ) i, j, k |log 2 (y ijk )−µ−α i −β j |. (3.3) There are existing LAD regression functions available in R and SAS. In the R running environment, functionrq() intheRpackagequantreg [29–31]withthequantileparameter tau=0.5 will run the LAD regression. However, in the model (3.2), we notice that all regressors are categorical variables. Therefore, we can minimize the sum of absolute deviations by “median polish” approach [12] by treating the treatment-specific effect as the row (column) effect and the probe- specific effect as the column (row) effect. 3.3 Robustness of LAD estimates We further investigate the LAD estimates of gene expression in the typical treatment- control case where the experiment design is balanced. Definition 3.3.1. In thebalanced treatment-control microarray experiment, we only have two treatment levels: treatment versus control. We have the same number of replicate arrays (numberk> 1) in both treatment and control. 23 We rewrite the model (3.2) in the matrix form: Y =Xβ+ε = x T 1 . . . x T n β+ε, (3.4) where Y is the signal values, Y = {y 1 ,···,y n }, n is the sample size; X is the design matrix; β is a vector of parameters including the effects of each probe and treatment; x T i is an indicator vector which indexes the signal value y i by its levels on each factor. In a balanced microarray experiment defined by 3.3.1, we only have two treatment levels: α 1 and α 2 . Assume that the probe-set has N p PM probes. If we use the sum contrasts, we get α 2 =−α 1 ; β Np =−(β 1 +···+β Np−1 ), and β T = {µ,α 1 ,β 1 ,···,β Np−1 }. For example, the indicator vector, x T i =[ 3412 1 , 3412 1 , 3 41 2 0,0,1,0,0,...,0], means that the signal value y i is from the 1 st treatment, 3 rd probe; and the indicator vector, x T j =[ 3412 1 , 3412 −1, 3 41 2 −1,−1,−1,−1,...,−1], means that the signal value y j is from the 2 nd treatment and N th p probe. 24 The design matrix X has (N p +1) columns: X = 1 1 100 ... 0 . . . . . . . . . . . . . . . . . . . . . 1 1 100 ... 0 1 1 010 ... 0 . . . . . . . . . . . . . . . . . . . . . 1 1 010 ... 0 . . . . . . . . . . . . . . . . . . . . . 1 −1 −1 −1 −1 ... −1 . . . . . . . . . . . . . . . . . . . . . 1 −1 −1 −1 −1 ... −1 , (3.5) . Proposition 3.3.1. If the microarray experiment is balanced treatment-control design (3.3.1), then ( 1 n X T X) −1 is a constant and positive defined. 25 Proof. We have the same replicate array (numberk> 1) in treatment versus control. From the design matrix X like (3.5), we can get the matrix of 1 n X T X: 1 n X T X = 1 0 00 ··· 0 0 1 00 ··· 0 0 0 2 Np 1 Np ··· 1 Np 0 0 1 Np 2 Np ··· 1 Np . . . . . . . . . . . . . . . . . . 0 0 1 Np 1 Np ··· 2 Np . (3.6) Itisa(N p +1)×(N p +1)matrix. Sinceitisfullrank, wecancalculateitsinversedirectly. ( 1 n X T X) −1 = 1 0 00 ··· 0 0 1 00 ··· 0 0 0 N p −11 ··· 1 0 0 1 N p −1 ··· 1 . . . . . . . . . . . . . . . . . . 0 0 11 ··· N p −1 , (3.7) The formula of 1 n X T X is not related to the number of replicates in the experiment. Let us denote log-ratios for treatment versus control by m. We can get that m = α 1 −α 2 =2α 1 . According to THEOREM 2.4.1 and 2.4.2, we can get following important properties of LAD estimated log-ratios. 26 Theorem3.3.2. If we assume that the error term in summarization model (3.4) satisfies ASSUMPTION 2.4.1, then the LAD estimated log-ratio ˆ m in the balanced treatment- control design experiment (3.3.1) holds following properties: • Theinfluenceonlog-ratioestimatesisboundedwiththegross-error-sensitivityvalue: γ(m,F)= 1 f(0) (3.8) • The log-ratio estimates has an asymptotically normal limiting distribution: √ n(ˆ m−m)−→N(0, 1 (f(0)) 2 )indistribution (3.9) Proof. We notice that α 1 or α 2 are set by 1 or -1. According to THEOREM 2.4.1 and ( 1 n X T X) −1 formula (3.7), we can have γ(α 1 ,F)= 1 2f(0) Because m = α 1 −α 2 =2α 1 , we can get γ(m,F)= 1 f(0) Similarly, the asymptotically normal limiting distribution of log-ratio easily derived ac- cording to THEOREM 2.4.1 and ( 1 n X T X) −1 formula (3.7) directly. 27 The robustness of LAD estimator in the microarray summarization is shown by the bounded influence function. The largest influence on log-ratio estimates, the gross-error- sensitivity, is 1 f(0) . 28 Chapter 4 Robustness assessment by standard spike-in dataset 4.1 Affymetrix spike-in data sets We evaluate the LAD estimation using Affymetrix HG-U133A data set [32]. It contains triplicate of 14 experiments with 42 spike-in genes. These spike-in genes are organized into 14 gene groups, each of which contains 3 genes of the same concentration. The concentrations range from 0-512 pM according to a 14× 14 Latin square design. We will analyze arrays from experiment 3 and 4 particularly. Experiment 3 contains tripli- cate: Exp03 R1, Exp03 R2 and Exp03 R3; experiment 4 contains triplicate: Exp04 R1, Exp04 R2 andExp04 R3. Thedesignedconcentrationsof36spike-ingenesinexperiment 4 are two-fold higher; the concentrations of the remaining 6 spike-in genes are 0 in either experiment 3 or 4 (see Table 4.1). Beside these designed spike-in probeset, a recent paper has suggested that there are 22 additional probe-sets that detect spike-in RNAs due to the cross-hybridization. Later we refer to these six arrays as data set “Expt-3-4”. 29 Table 4.1: The spike-in probe-sets and their concentrations in HG-U133A “Expt-3-4” dataset. The unit for the concentration values shown in the table is picomolar. Group ID Probeset Exp 3 Exp 4 1 203508 at; 204563 at; 204513 s at 0.25 0.5 2 204205 at; 204959 at; 207655 s at 0.5 1 3 204836 at; 205291 at; 209795 at 1 2 4 207777 s at; 204912 at; 205569 at 2 4 5 207160 at; 205692 s at; 212827 at 4 8 6 209606 at; 205267 at; 204417 at 8 16 7 205398 s at; 209734 at; 209354 at 16 32 8 206060 s at; 205790 at; 200665 s at 32 64 9 207641 at; 207540 s at; 204430 s at 64 128 10 203471 s at; 204951 at; 207968 s at 128 256 11 AFFX-r2-TagA at; AFFX-r2-TagB at; 256 512 AFFX-r2-TagC at 12 AFFX-r2-TagD at; AFFX-r2-TagE at; 512 0 AFFX-r2-TagF at 13 AFFX-r2-TagG at; AFFX-r2-TagH at; 0 0.125 AFFX-DapX-3 at 14 AFFX-LysX-3 at; AFFX-PheX-3 at; 0.125 0.25 AFFX-ThrX-3 at 30 Table4.2: log 2 transformedintensityvalueafternormalizationforoneprobe-setexample. Exp3 1 Exp3 2 Exp3 3 Exp4 1 Exp4 2 Exp4 3 204951 at 1 11.3015 11.1926 11.3355 12.3765 12.3397 12.3513 204951 at 2 12.1043 12.1233 12.1167 13.0352 12.9705 12.9534 204951 at 3 10.6867 10.9686 10.7920 12.0007 12.0300 12.0101 204951 at 4 11.8429 11.8305 11.7989 12.7083 12.7357 12.7399 204951 at 5 11.4366 11.2920 11.3008 12.3142 12.2674 12.3432 204951 at 6 11.7439 11.6624 11.7167 12.5873 12.6092 12.5812 204951 at 7 11.8941 12.0064 11.9865 12.7253 12.6998 12.7953 204951 at 8 10.1346 10.2715 10.2883 11.2470 11.0445 11.2128 204951 at 9 12.2291 12.1939 12.2202 13.0232 13.0331 13.0374 204951 at 10 11.7488 11.7329 11.8503 12.4277 12.3369 12.4212 204951 at 11 10.9033 10.7454 10.8151 11.8521 11.8929 11.9052 4.2 Differential gene expression estimation WeuseRMA’sbackgroundcorrectionandnormalizationintheBioConductor’saffypack- age [33,34] to carry out the pre-processing. The normalization method for RMA is called quantile normalization. It uses complete data information and defines a pseudo-reference bytheaveragedquantiles. Thenormalizationiscarriedoutbasedonthequantile-quantile transformation between each target array and the pseudo-reference [35,36]. Inthedataset“Expt-3-4”,mostofprobe-setshave11probes[37]. Table4.2illustrates one probe-set example of intensity values after normalization and log 2 transformation. We try to estimate 11 probe-specific parameters and 2 treatment-specific parameters in each probe-set. Here, the treatment-specific effect refers to the gene expression values specific to the experiment 3 or 4. We do not estimate the array-specific effect for each replicate in experiment 3 and 4. Afternormalization, ineachprobe-set, weuseeitherLADregressionormedianpolish method to estimate the treatment- and probe-specific effect based on the model in 3.2. 31 Figure 4.1: M-A plots of the “Expt-3-4” data set estimated by LAD regression. Grey: non-spike-in probe-set; Black: spike-in probe-sets which have expected log-ratio 1; Ma- genta: spike-inprobe-setswhichhavezeroinputconcentrationinExp3;Green: suggested spike-in homologs. Spike-in probe-sets that have zero input concentration in Exp 4 are not show up in the plot because the M (log-ratio of Exp 4 versus Exp 3) value is very low (<−1.5). We use the rq() function in R package quantreg [29–31] to run the LAD regression; we run median polish procedure with the maximum 30 iteration. In Figure 4.1, we show the M-A plot obtained by LAD regression. In M-A plot, each “M” value is the log-ratio of experiment 4 versus experiment 3 and “A” value is the average log-intensities of the two experiments for one probe-set. The expected log- ratios of spike-in genes is M = 1; the expected log-ratios of non-spike-in genes is M = 0. We can see that non-spike-in genes have a small variations along M = 0 with some outliers. The spike-in genes (black point in Figure 4.1) are close to M = 1 in the middle intensity range. The log-ratio is not close to M = 1 in both low and high intensity range. In the low intensity range, it is mainly due to the low hybridization signal; in the high intensity range, it is due to the saturation of probe hybridization. These suggested spike-in homologs (green points) do have signal differentiation between Exp 3 and 4. 32 Table 4.3: Table of log-ratio and SAE value for each spike-in gene estimated by LAD regression and median polish procedure. log-ratio SAE Probe-set LAD medpolish LAD medpolish 203508 at 0.0442947 0.0442948 9.402492 9.402494 204563 at 0.1885193 0.1885194 8.535332 8.535333 204513 s at 0.2649478 0.2649480 11.542628 11.542625 204205 at 0.4381062 0.4381060 7.952603 7.952603 204959 at 0.5392192 0.5392200 11.709618 11.709621 207655 s at 0.3511599 0.3511600 10.421022 10.421025 204836 at 1.2017324 1.2001560 20.470940 20.472519 205291 at 0.1715140 0.1715140 10.265282 10.265283 209795 at 0.7232303 0.7232300 21.089031 21.089033 207777 s at 0.7810454 0.7810460 12.850392 12.850391 204912 at 0.6051201 0.6051200 12.659032 12.659030 205569 at 0.5166385 0.5166380 12.789001 12.789001 207160 at 1.3601266 1.3601260 18.721968 18.721968 205692 s at 0.6861964 0.6861960 10.317876 10.317874 212827 at 0.5617307 0.5617300 14.074102 14.074105 209606 at 1.0842565 1.0842560 17.887382 17.887379 205267 at 0.7872672 0.7889680 8.629322 8.631025 204417 at 1.3855846 1.3852940 9.328554 9.328841 205398 s at 0.8782355 0.8782360 10.971365 10.971367 209734 at 0.5922348 0.5922340 5.192269 5.192269 209354 at 0.8266641 0.8266640 14.480909 14.480913 206060 s at 1.0658765 1.0658760 8.644506 8.644503 205790 at 0.4438017 0.4438020 5.692341 5.692339 200665 s at 1.0730610 1.0730620 7.743551 7.743550 207641 at 0.8168570 0.8214420 5.841037 5.845625 207540 s at 0.9806993 0.9807000 4.790139 4.790140 204430 s at 1.3701159 1.3701160 4.710365 4.710364 203471 s at 0.8785265 0.8792320 8.040324 8.041027 204951 at 0.9051457 0.9051460 4.473050 4.473050 207968 s at 0.9185119 0.9185120 11.277754 11.277755 AFFX-r2-TagA at 0.6571885 0.6571880 5.392241 5.392241 AFFX-r2-TagB at 0.6390517 0.6390520 6.367735 6.367734 AFFX-r2-TagC at 0.3592692 0.3592700 3.933364 3.933366 AFFX-LysX-3 at 1.0110455 0.9699140 53.500760 53.500756 AFFX-PheX-3 at 0.5461029 0.5458140 26.796007 26.796011 AFFX-ThrX-3 at 0.7666006 0.7655060 26.432437 26.432441 33 The results obtained using median polish give almost the same M-A plot. In Table 4.3, we furthershow the log-ratio and sum of absolute deviations (SAE) value forspike-in probe-sets using both methods. We can see that the log-ratios estimated from LAD fit are almost the same as ones estimated from median polish procedure. The median polish procedure also achieves the same SAE as the LAD regression. It is a strong evident to support the claim that we can utilize the median polish procedure to get the LAD solution approximately in the summarization. 4.3 Residual analysis Weareinterestedinthelog-ratioestimate( ˆ m i )forprobe-seti,butalsowanttoknowhow accurate the estimate is. The accuracy of the log-ratio estimate can be obtained through the residual analysis after the LAD model fitting. In Figure 4.2, we show the residual dis- tribution of four spike-in probe-sets: 204951 at, 207540 s at, 204513 s at and 204836 at, which have the log-ratio estimates 0.9051, 0.9807, 0.2649 and 1.2017 respectively. The residual distributions clearly show that they have different accuracy and confidence in- tervals. Residuals from probe-sets 204951 at, 207540 s at have a sharp peak at median zero, while residuals from probe-sets 204513 s at and 204836 at are widely spread. Wehavecheckedalltheresidualdistributionforspike-inprobesetandtheirhomologs. We find that spike-in probesets tend to have more large residual values (in both nega- tive and positive sign) than these of non-spike-in probesets. In order to quantify this observation, we first define those residuals whose absolute values are larger than a cutoff as “outlier residuals”. Then, we calculate the percentage of “outlier residuals” in each 34 Figure 4.2: The residual distribution of 4 spike-in probe-sets after LAD regression fit. 35 Figure 4.3: Distribution of “outlier residual” ratios for both spike-in and non-spike-in probe-sets. The “outlier residual” is defined using cutoff value 0.3 (A) or 0.5 (B). His- togram: distribution of non-spike-in probe-sets (total 22235 probe-sets); Black dots: the “outlier residual” ratio value for each spike-in or their homolog probe-sets; Green curve: the density curve of spike-in probe-sets (total 71 probe-sets). probe-set. In Figure 4.3, we show the distribution of “outlier residual” ratios for both spike-in and non-spike-in probe-sets. We empirically choose two cutoff values, 0.3 and 0.5. We see that the distribution of “outlier residual” ratios for spike-in probe-sets has longer tail than that of non-spike-in ones under both cutoff conditions. The residual analysis reveals an important problem in Affymetrix probe design. Be- cause probes within the same probe-set have different binding affinity, they will have different hybridization changes (intensity changes) in response to the corresponding RNA changes. The probe-set will have more “outlier residuals”, if the hybridization changes among probes are less similar. For each non-spike-in probe-set, its probes receive the same RNA concentrations in Exp 3 and 4. The change of hybridization intensity is only due to the noise. However, for each spike-in probe-set, its probes receive different RNA concentrations in Exp 3 and 4. The residuals in non-spike-in probe-set will have less 36 Figure 4.4: Heatmap of residuals for five probe-sets. Each row is one probe; each column is one array. The left three arrays are from Exp3; the right three are from Exp4. “outlier residuals” than ones in spike-in probe-set after the summarization model-fitting. Furthermore, we check the probes for “outlier residuals” using heatmaps (Figure 4.4). We find that most of “outlier residuals” are grouped by probes. In these probes, all residuals from the same probe across the replicates are “outlier residuals”. They detect different expression changes rather than the ones we estimated. If a probe-set have too much “outlier residuals”, the expression change we estimated is not reliable since probes are not consensus in detecting the RNA concentration change. 4.4 Accuracy of log-ratio estimate From the residual analysis, we notice that the log-ratio estimates have different accuracy in different probe-set. According to Theorem 3.3.2, we see that the reciprocal of residual density at median zero, 1 f(0) , is a key value to measure the accuracy of log-ratio estimate. 37 Figure 4.5: In figure A, x-axis is the average log 2 (concentration) which are designed in Exp 3&4; y-axis is estimated 1 f(0) . In figure B, x-axis is the estimated log-ratio values from LAD model-fitting. It is the gross-error-sensitivity value (the influence bound) of ˆ m. It is also the standard deviation for the asymptotically normal limiting distribution of √ n(ˆ m−m), where m is the true log-ratio value and n is the number of intensity values in the probe-set across all samples. The value of 1 f(0) can be calculated based on the sparsity function [19]: s(τ) = [f(F −1 (τ))] −1 . (4.1) In the LAD case, τ=0.5, then we can get 1 f(0) =s(0.5) = d dt F −1 (t)| t=0.5 . (4.2) InFigure4.5A,weshowthescatterplotof 1 f(0) versustheaveragelog 2 (concentration) in Exp 3&4. The concentration values are the designed input concentrations shown in table 4.1. We can see that there is a clear trend between 1 f(0) and concentration levels, meaningthattheaccuracyishigherwhenthedesignedinputconcentrationincreases. On 38 the other hand, the scatter plot of 1 f(0) versus the log-ratio estimates shows that there is no obvious relationship between them. 4.5 Sensitivity curve and bound The influence and asymptotic normal distribution are based on a large sample theory, where n→∞. In this microarray data set, most of the probe-sets have sample size n = 66. Here, we will follow the idea of sensitivity curve to measure the robustness and resistance of the LAD log-ratio estimates. In Figure 4.6, we illustrate the sensitivity curve for probe-set 204951 at. Because we can put the additional observation on either probe and in either Experiment 3 or 4, we have total 22 cases. In each plot, y-axis is the deviation value of (ˆ m n+1 − ˆ m n ), where ˆ m n is the log-ratio estimate based on original intensities and ˆ m n+1 is the log-ratio estimate based on intensities with an additional observation x; x-axis is the log 2 intensity value for the additional observation x. We have scanned the log 2 intensity value of x from -5 to 25 with step=0.001. The log 2 intensities of the original 66 probes in probe-set 204951 at are in the range [10.135,13.037]. We find that the deviation (ˆ m n+1 − ˆ m n ) is stable whenx< 10.135 orx> 13.037. From the Figure, we can see that the additional value does not perturb the log-ratio estimate, when it is located in some factor levels, such as (Exp 3, probe 2), (Exp 4, probe 5) and so on. It fluctuates in other cases, with the max({ˆ m n+1 − ˆ m n }) = 0.0257 and min({ˆ m n+1 − ˆ m n })=−0.004. The deviation is bounded and does not change when the intensity value goes to extremely large or small. 39 Figure 4.6: The deviation of log-ratio estimate with one more additional perturbation point. y-aixs: deviation of the log-ratio (ˆ m n+1 − ˆ m n ). x-aixs: the log 2 intensity value of the additional data. 40 Figure 4.7: The deviation of log-ratio estimate with two more additional perturbation observations. z-aixs: deviation of the log-ratio (ˆ m n+2 − ˆ m n ); x- and y-axis are the log2 intensity value of two additional observations. Two 3-D plots illustrate two cases: one additional data is on Exp 4, probe 9 and the other one is on Exp 3 and probe 4; one additional data is on Exp 4, probe 8 and the other one is on Exp 3 and probe 9. 41 Moreover, we have tried to add two more observations for probe-set 204951 at. There are22×22combinationsforthewaystoaddtheperturbationobservations. InFigure4.7, we illustrate the deviation (ˆ m n+2 − ˆ m n ) surface against the two additional observations for two cases. The deviation is difficult to describe and formulate. However, we find that the deviation of log-ratio estimated are still bounded by the same range [−0.004,0.0257]. 4.6 Standardized log-ratio Based on Theorem 3.3.2, we calculate the standard deviation of the log-ratio estimate by θ n = 1 √ nf(0) . In Figure 4.8, we show the adjusted M-A plot after we standardize the log-ratio using θ n . We see that the variation of log-ratio for non-spikein probesets is stabilized along the A values. The adjusted M (log-ratio) values for spike-in probesets provide a higher specificity in the high intensity range by comparing to the M-A plot in Figure 4.1. 4.7 Discussion In this study, we apply the LAD regression to the Affymetrix microarray summarization. We have assumed that the background correction and the normalization step have been done correctly. The log-ratio and its variation estimates are based on the normalized results. The result will be varied if we adopt other normalization methods. The standard deviation and influence function bound we derived for the log-ratio estimate is only depend on 1 f(0) . However, the 1 f(0) estimation remains a big problem where much efforts has been made [38,39]. The method we used to estimate 1 f(0) is based 42 Figure4.8: AdjustedM-Aplotsofthe“Expt-3-4”datasetafterstandardizationusing 1 f(0) estimates. Grey: non-spike-in probe-set; Black: spike-in probe-sets which have expected log-ratio 1; Magenta: spike-in probe-sets which have zero input concentration in Exp 3; Green: suggested spike-in homologs. on sparsity function [19]. In practical computation, we need specify a bandwidth which is used to calculate the difference quotient of the empirical quantile function in (4.2). The bandwidth option we used is Hall-Sheather bandwidth [38], which is the same as the default setting of quantile regression in R. We have also tried the kernel estimation to estimatef(0)usingitsdefaultbandwidthsettinginR.Thestandarddeviationcalculated usingkernelestimatedf(0)doesnotperformsowellastheoneusing 1 f(0) estimationfrom sparsity function in variation stabilization of non-spikein probe-sets. 43 References [1] Hoaglin DC, mosteller F, Tukey JW: Understanding robust and exploratory data analysis. John Wiley and Sons 1983. [2] Bloomfield P, Steiger WL: Least absolute deviations: theory, applications, and algo- rithms. Birkhauser 1983. [3] Hampel: The Influence Curve and Its Role in Robust Estimation. Journal of the American Statistical Association 1974, 69(346):383–393. [4] Huber PJ: Robust statistics. Wiley 1981. [5] Bloomfield P, Steiger W: Least Absolute Deviations Curve-Fitting. SIAM Journal on Scientific and Statistical Computing 1980, 1(2):290–301. [6] Gentle JE: Least absolute values estimation: an introduction. Communica- tions in Statistics - Simulation and Computation 1977, 6(4):313–328. [7] Armstrong RD, Frome EL, Sklar MG: Linear Programming in Exploratory Data Analysis. Journal of Educational and Behavioral Statistics 1980, 5(4):293– 307. [8] Koenker RW, D’Orey V: Algorithm AS 229: Computing Regression Quan- tiles. Applied Statistics 1987, 36(3):383–393. [9] Barrodale I, Roberts F: An improved algorithm for discrete L1 linear ap- proximation. SIAM Journal on Numerical Analysis 1973, 10(5):839–848. [10] Harter HL: Nonuniqueness of least absolute values regression. Communica- tions in Statistics-Theory and Methods 1977, 6(9):829–838. [11] Narula SC, Wellington JF: The Minimum Sum of Absolute Errors Regres- sion: A State of the Art Survey. International Statistical Review / Revue In- ternationale de Statistique 1982, 50(3):317–326. [12] Tukey JW: Exploratory Data Analysis. Addison-Wesley 1977. [13] Leinhardt S, Wasserman SS:Exploratory Data Analysis: An Introduction to Selected Methods. Sociological Methodology 1979, 10:311–365. [14] Siegel AF:Low Median and Least Absolute Residual Analysis of Two-Way Tables. Journal of the American Statistical Association 1983, 78(382):371–374. 44 [15] WilsonH:Least-squaresversusminimumabsolutedeviationsestimationin linear models. Decision Sciences 1978, 9:322–335. [16] Boldin MV, Galina SI, Tyurin YN: Sign-Based Methods in Linear Statistical Models. American Mathematical Society 1997. [17] Bassett B, Koenker R: Asymptotic Theory of Least Absolute Error Regres- sion. Journal of the American Statistical Association 1978, 73(363):618–622. [18] Pollard D: Asymptotics for least absolute deviation regression estimators. Econometric Theory 1991, 7(186-199). [19] Chen C, Wei Y: Computational Issues for Quantile Regression. The Indian Journal of Statistics 2005, 67(399-417). [20] Kocherginsky M, He X, Mu Y: Practical Confidence Intervals for Regression Quantiles. Journal of Computational and Graphical Statistics 2005, 14. [21] Affymetrix: Affymetrix Microarray Suite User Guide, version 5. Santa Clara, CA 2001. [22] LiC,WongWH:Model-basedanalysisofoligonucleotidearrays: expression index computation and outlier detection. PNAS 2001, 98. [23] Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP: Exploration, normalization, and summaries of high density oligonu- cleotide array probe level data. Biostatistics 2003, 4(2):249–264. [24] Li C, Wong WH: Model-based analysis of oligonucleotide arrays: model validation, design issues and standard error application. Genome Biology 2001, 2(8). [25] BolstadBM,IrizarryRA,AstrandM,SpeedTP:Acomparisonofnormalization methodsforhighdensityoligonucleotidearraydatabasedonvarianceand bias. Bioinformatics 2003, 19(2):185–193. [26] Li LM, Cheng C: Differentiation Detection in Microarray Normalization, Chapter 2 in ”Methods in Microarray Normalization”. CRC, edited by Phillip Stafford 2008. [27] Cheng C, Li LM:Sub-array normalization subject to differentiation. Nucleic Acids Res 2005, 33(17):5565–5573. [28] Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B, Speed TP:Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Res 2003, 31(4). [29] Koenker R: quantreg: Quantile Regression 2007, [http://www.r-project.org]. [R package version 4.10]. [30] Koenker RW, Bassett GW: Regression quantiles. Econometrica 1978, 46(33-50). 45 [31] Koenker R, dOrey: Computing regression quantiles. Applied Statistics 1994, 43(410-414). [32] Affymetrix: HUMANGENOMEU133DATASET[http://www.affymetrix. com/support/technical/sample_data/datasets.affx]. [33] GentlemanRC,CareyVJ,BatesDM,BolstadB,DettlingM,DudoitS,EllisB,Gau- tier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, Leisch F, Li C, Maechler M, Rossini AJ, Sawitzki G, Smith C, Smyth G, Tierney L, Yang JY, Zhang J:Bioconductor: open software development for computational biology and bioinformatics. Genome Biol 2004, 5(10). [34] L G, L C, BM B, RA I: affy–analysis of Affymetrix GeneChip data at the probe level. Bioinformatics 2004, 20:307–315. [35] Bolstad B: Probe Level Quantile Normalization of High Density Oligonu- cleotide Array Data 2001, [http://bmbolstad.com/stuff/qnorm.pdf]. [36] BolstadBM,IrizarryRA,AstrandM,SpeedTP:Acomparisonofnormalization methodsforhighdensityoligonucleotidearraydatabasedonvarianceand bias. Bioinformatics 2003, 19(2):185–193. [37] McGeeM,ChenZ:NewSpiked-InProbeSetsfortheAffymetrixHGU-133A Latin Square Experiment. COBRA Preprint Series 2006, (5). [38] Hall P, Sheater SJ:On the Distribution of a Studentized Quantile. Journal of the Royal Statistical Society 1988, 50(3). [39] Bloch DA, Gastwirth JL: On a Simple Estimate of the Reciprocal of the Density Function. The Annals of Mathematical Statistics 1968, 39(3). 46
Asset Metadata
Creator
Ge, Huanying (author)
Core Title
Analysis of robustness and residuals in the Affymetrix gene expression microarray summarization
School
College of Letters, Arts and Sciences
Degree
Master of Science
Degree Program
Statistics
Publication Date
06/19/2010
Defense Date
07/01/2008
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
LAD,microarray,OAI-PMH Harvest,robustness
Language
English
Advisor
Li, Lei M. (
committee chair
), Chen, Liang (
committee member
), Goldstein, Larry M. (
committee member
)
Creator Email
hge@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m1278
Unique identifier
UC1213032
Identifier
etd-Ge-20080619 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-68867 (legacy record id),usctheses-m1278 (legacy record id)
Legacy Identifier
etd-Ge-20080619.pdf
Dmrecord
68867
Document Type
Thesis
Rights
Ge, Huanying
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
uscdl@usc.edu
Abstract (if available)
Abstract
DNA microarray has been widely used in the field of functional genomics. The estimation of gene expression from microarray is a statistical problem where a lot of effort has been made. In this study, we focus on the summarization step of Affymetrix microarray pre-processing. We apply the Least Absolute Deviation (LAD) regression to estimate the probe- and treatment-specific effect in the widely-taken two-factor model. The median polish can be used as an approximation approach for the LAD regression in this two-factor summarization model. We show that the LAD estimator is robust in the sense that it has bounded influence where the bound is strongly associated with the RNA concentration. Furthermore, we calculate the influence bound and standard error which are used as the measure of accuracy for the log-ratio estimate.
Tags
LAD
microarray
robustness
Linked assets
University of Southern California Dissertations and Theses