Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
A linear model for measurement errors in oligonucleotide microarray experiment
(USC Thesis Other)
A linear model for measurement errors in oligonucleotide microarray experiment
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
A LINEAR MODEL FOR MEASUREMENT ERRORS IN OLIGONUCLEOTIDE MICROARRAY EXPERIMENT by Furong Wang A Thesis Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree MASTER OF SCIENCE (APPLIED BIOSTATISTICS AND EPIDEMIOLOGY) December 2002 Copyright 2002 Furong Wang Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. UMI Number: 1414857 UMI UMI Microform 1414857 Copyright 2003 by ProQuest Information and Learning Company. All rights reserved. This microform edition is protected against unauthorized copying under Title 17, United States Code. ProQuest Information and Learning Company 300 North Zeeb Road P.O. Box 1346 Ann Arbor, Ml 48106-1346 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. UNIVERSITY O F SOUTHERN CALIFORNIA THE 'GRADUATE SCHOOL UNIVERSITY PARK LOS ANGELES, CALIFORNIA 90089-1695 This thesis, written by W Ah/Sly PdfZo under the direction o f h thesis committee, and approved by all its members, has been presented to and accepted by the Director o f Graduate and Professional Programs, in partial fulfillment o f the requirements fo r the degree of ftppL ieA . Director Date. Thesis Committee C hair Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. ACKNOWLEDGEMENTS I would like to express my gratitude to Dr. Paul Marjoram for his helpful guidance, comments, suggestions and editing throughout the course of my research and in the preparation of this manuscript. I am also grateful to my thesis committee, for without their guidance, I could not have completed this portion of my education. Committee member: Paul Marjoram, PHD (chair) Kimberly Siegmund, PHD Timothy J. Triche, MD, PHD And, of course, my deepest thanks go to my family for their strong support. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. TABLE OF CONTENTS ACAKNOWLEDGEMENTS---------------------------------- ii LIST OF TABLES---------------------------------------- iv LIST OF FIGURES------------------------------------------------------------------------------ v ABSTRACT--------------- vi INTRODUCTION--------------------------------------------------------------------------------1 STATISTICAL METHODS------------------------------------------------------------------ 12 RESULTS------------------------------------------------------------------------------------------ 18 DISCUSSION-------------------------------------------------------------------------------------36 REFERENCE--------------------------------------------- -4 3 iii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. LIST OF TABLES Table 1. Comparison of R2 in four models for all chips------------------------------- 26 Table 2. Regression of standard deviation and gene expression among three replicate chips on Log level, at different time, under different treatment--------- 28 Table 3. Variance of gene expression, time and treatment in the model------------31 Table 4. Number of genes with significantly altered expression after treatment detected by t-test, using observed or predicted model, at different time points—31 Table 5. Standard deviation of gene expression among replicate chips-------------42 iv Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. LIST OF FIGURES Figure 1. Measured values of gene expressions in two identical chips A and B using the same mJRNA sample------------------ 10 Figure 2. Variation of gene expressions within and across replicate chips 20 Figure 3.1. Association of gene expression and variance of gene expression 24 Figure 3.2. Association of log (gene expression) with variance of log (gene expression)----------------------------------------------------------------------------------------- 24 Figure 3.3 Association of gene expression and standard deviation of gene expression------------------------------------------------------------------------------------------ 25 Figure 3.4 Association of log (gene expression) with log(standard deviation)— 25 Figure 4. regression line for samples at nine experimental conditions--------------30 Figure 5. linear model for referrence data.------------------------------------------------- 35 Figure 6.1. Distribution of P values from two methods of two-sample t-test 37 Figure 6.2. Distribution of p values from two-sample t-test, p less than 0.05— 37 Figure 7. Regression lines for replicates in control group, in treatment group and reference data-------------------------------------------------------------------------------------- 39 V Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. ABSTRACT Microarray technology is a powerful technology for measuring the expression of large numbers of genes simultaneously. However, “measurement error” of gene expression must be allowed for in any microarray analysis. We propose a linear model for measurement error as a function of the gene expression level in oligonucleotide microarray experiments. This model finds that measurement errors increase exponentially as gene expression level increases. This model helps to explicitly allow for the effect o f measurement error and, by pooling the information from all genes simultaneously, provides more robust estimates of that error. This allows us to identify more genes with altered expression under differing experimental conditions. We also demonstrate that our model has better statistical properties than analysis based upon traditional “gene-by-gene” t-test. We provide evidence to suggest our model is likely to be more widely applicable. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. INTRODUCTION The genome project is a milestone in gene research. It is aimed to discover organism genomes in a few decades. However, as more genes are explored, new challenges present themselves. The first completely identified genome is S. cerevisiae genome, a yeast genome, whose sequencing was finished in 1996. It tells us that there are approximately 6,200 “real” genes in the yeast genome, but only about one quarter of them have known functions [Goffeau et al. 1996]. The total number of human genes is estimated to be between 30,000 and 35,000 (http://www.genome.gov). Up to now, approximately 90% o f the human genome has been sequenced. However, the functions of most of these genes are yet be explored. The process by which a gene’s coded information is converted into the structures present and operating in the cell is called gene expression. Expressed genes include those transcribed into mRNA and then translated into protein, and those transcribed into RNA but not translated into protein (e.g., transfer and ribosomal RNAs). Regulation o f gene expression in a cell occurs on several levels. The first level of regulation happens in the process of DNA transcription into mRNA. Other levels of regulation include differential degradation of mRNA in the cytoplasm, and the process of mRNA translation into protein. By studying how a particular gene is regulated, i.e. up-regulated or down-regulated, in varying Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. conditions, we can get some clues about the function of that gene. It is protein that performs biological functions for a corresponding gene. The type and amount of proteins in a cell determine the phenotypes of cells and tissues, but vary in response to environmental change through variation o f the corresponding genes. So measuring mRNA will give us a good indication o f variation of gene expression in response to environment [Schena et al. 1995]. In the past, Northern blotting and PCR are methods commonly used to detect expression of genes in an experiment. The advantage of Northern blotting is its sensitivity. However, only a few genes can be examined at a time with this approach. An even more sensitive and faster method o f measuring gene expression, quantitative PCR, can measure several hundred genes at a time. But, considering that there are tens of thousands o f genes in the human genome, quantitative PCR is not a satisfactory method either. These facts indicate that traditional gene-by-gene approaches cannot meet the requirement of providing a “global view” of gene expression in a cell. In the past few years, DNA microarray appears to be a new technology that will meet this requirement. Up to now, microarray has attracted tremendous interest among biological researchers. One advantage of this technology is that it is able to monitor the whole genome on a single chip. Thus we can get expression profiles for all genes in a cell simultaneously. With microarray technology, we can obtain a better picture of how thousands of genes in a cell interact with each other in biological process or disease. 2 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. One goal of microarray experiments is to estimate changes of gene expression patterns by observing mRNA changes under different conditions, e.g. in tumor and normal cells. An array is an orderly arrangement of probes, which are gene fragments designed to be complementary to the selected genes or expressed sequence tags (EST) reference sequences. Based on base-pairing rules (i.e., A-T and G-C for DNA; A-U and G-C for RNA [Southern et al. 1999] ) and hybridization, an array automates the process of measuring the expression of genes in biological samples. Samples for an array experiment can be deposited on a common gene assay system such as microplates or standard blotting membranes by hand, or by robotics. There are two main types of arrays: cDNA arrays and oligonucleotide arrays. The basic technology of these two arrays is the same: DNA complementary to genes of interest is generated and laid out on solid surfaces at defined positions (spots). Sample cDNA, reverse-transcribed from sample mRNA, is labeled with a fluorescent dye and eluted on the surface, then hybridizes to spots on the array. The amount of hybridization at each spot is measured by the intensity of fluorescence. Presence of hybridization is detected by fluorescence following laser excitation. Each spot contains either DNA oligomers (oligonucleotide arrays), or a longer DNA sequence (cDNA arrays) designed to be complementary to a particular gene 3 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. o f interest. Oligonucleotide arrays are designed and synthesized based on gene sequence information alone, while cDNA arrays are generated using physical intermediates such as clones, PCR products, cDNAs and so on. Oligonucleotide arrays are generated by photolithography techniques to synthesize oligomers directly on the glass slides [Fodor et al. 1993, Lipshutz et al. 1999]. They are manufactured and marketed primarily by Affymetrix Inc. cDNA arrays are created by mechanical gridding, where prepared material is applied to each spot by ink-jet or physical deposition [Schena et al. 1995, Duggan et al. 1999]. In an oligonucleotide microarray experiment, gene expression for one sample is measured each time, so gene expression in a sample is represented by the intensity of spots on the array. In a cDNA array experiment, two samples are measured simultaneously to compare relative abundance of gene expression in two samples. Two samples are labeled with differently colored dyes, then mixed and hybridized with the arrayed spots. After hybridization, fluorescence measurements for each dye are done separately. The ratio of fluorescence intensity for each gene in the two samples is used to express the relative gene expression. An oligonucleotide array also has other specific features. Each oligomer contains a 25 nucleotide-sequence designed to bind to the gene of interest. Each spot in an oligonucleotide array contains roughly 300,000 identical oligomers. Such a spot is referred to as a probe set for a given gene (there are 16-20 probe sets per gene on Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Affymetrix arrays). Probe sets are non-overlapping if possible, or minimally overlapping if necessary with each other to serve as sensitive, unique, sequence- specific detectors. Probe design is based on being complementary to the selected genes or EST reference sequences, uniqueness relative to family members and other genes, and absence of near-complementarity to other RNAs that may be highly abundant in the sample (for example, rRNA, tRNA, actin-mRNA). The key to this approach is the use of multiple oligonucleotides of different sequence designed to hybridize to different regions of the same mRNA. An additional level of redundancy comes from the use of mismatch (MM) control probes. MM probes are identical to their perfect match (PM) partners, except for a single base difference in the central position. For an Affymetrix array, MM probes differ from their PM partners in the 13th base in a 25-bases oligonucleotide. PM and MM probes form a probe pair. The MM probes act as specificity controls that allow the direct subtraction of both background and cross hybridization signals. The average of the PM and MM difference (PM - MM) for all probe pairs in a probe set (called “average difference”) is commonly used as an expression index for the target gene [Lipshutz et al. 1999]. Microarrays have the potential to answer many interesting questions in the field of genetics because they can reveal gene expressions in a global view. However, there are some problems with microarray experiments. When we study the 5 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. changes of gene expressions, in a disease for example, some factors unrelated to the disease or variable under study may falsely link certain genes with the disease. So it is critical to identify sources of microarray variability and the magnitude of variation to better understand true gene expressions patterns. Variability of measurement in a microarray may be due to specific experimental procedures. In addition, other possible sources o f array variability have been investigated. Most variation of gene expressions detected by microarray is caused by biological variation, which means that gene expression patterns differ among tissues (tissue heterogeneity), or among cells (cell heterogeneity), or among individuals of population (inter-individual variation). Bakay et al. (2002) provides a systematic study of sources of variability for human muscle biopsies. They find that the expression of the same gene differs a lot among regions of muscle biopsy from the same patient, reflecting variation of gene expression in cell type content even in a relatively homogeneous tissue such as muscle. Because humans are highly outbred, there is gene polymorphism between individuals in population. This is another large source of variation. Bartosiewic et al. (2000) study gene expressions of mouse liver tissue among different animals, and find gene expressions to be highly variable. Another type of variation is chip variation introduced by the different designs of the chip technologies. Compared with biological variation, chip variation is probably relatively minor [Bakay et al. 2002], When performing an array 6 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. experiment for identical samples on varying type of chips, e.g. oligonucleotide chips compared to cDNA chip, hybridization of the sample with probes on different arrays may vary. Thus gene expressions detected from those arrays may be different. Such variation is called chip variation. Thus, observed alterations in gene expression can also be caused by characteristics of arrays instead of true gene expression change. Also, microarray results heavily depend on particular biological conditions [Brazma et al. 2000]. All sources of variation will ultimately influence measurement values of gene expression. However, even after having allowed for all the type of variation mentioned above, there is still “inherent” noise in microarray data. Lee [Lee et al. 2000] indicates that substantial inconsistency exists among replicated chips when measuring the same sample on the same type of chip. This type of variation is measurement error. In this thesis, we focus on measurement error. We aim to estimate the distribution of measurement error in order to make it possible to distinguish altered gene expression from noise with high confidence. The simplest microarray experiment is aimed to identify genes that are differentially expressed with respect to a single factor, e.g. samples with and without treatment. However, as mentioned above, a variety o f sources of variation in microarray experiments may lead to inaccurate estimation of gene expression. How do we estimate the magnitude of gene expression for a given gene with 7 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. confidence? In particular, how do we determine the statistical significance of an observed change of gene expression? Accurate estimation of the error distribution is necessary for making valid, rigorous inferences from the experiment. To identify genes differentially expressed in varying conditions, one common practice is to compute ratios of two raw signals to estimate “fold change” in expression going from one condition to another [Chen et al. 1997]. Some studies use an arbitrary threshold for “significant fold-change” in expression level, e.g. it is declared that gene expression changes greater than two-fold are significant [Gray et al. 1998]. Other use three-fold as the threshold for significance. Although relatively straightforward, this fold-change approach is highly problematic. There is no natural choice for the threshold that needs to be reached to report a gene as “significantly” changed. Furthermore, fold ratios formed from small numbers have a much higher variance than ratios formed from large numbers. Consequently, no single threshold value can be appropriate for all genes. The two-sample t-test is a commonly used method to compare gene expression for two groups. Previous studies have found evidence of differential variability for various genes [Chen et al. 1997, Newton et al. 2001]. In particular, the variance of gene expression increases as gene expression increases, so an assumption of Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. unequal variance seems to be more appropriate for a t-test based upon gene expression values. The t-test depends on the assumption of Normality of expression values for the gene o f interest. It is unclear whether an assumption of Normality is appropriate. Non-parametric methods do not assume that the data follow any particular type of distribution. One method is the Wilcoxon test, which is based on the sum of ranks of each gene on the chips. However, because non-parametric methods make no distributional assumptions, they lack power and appear to be relatively insensitive for identifying changes in gene expression [Jin et al. 2001], Measured gene expression is typically not equal to true gene expression. It is common to get substantially different values for the same gene when it is measured several times. We demonstrate this in Figure 1, in which we show a scatter plot o f two microarray experiments in which both microarrays use the same mRNA sample. Each dot stands for the measured values for one gene expression. The distribution of measurement error appears to increase with the level of gene expression increasing. The higher expressed genes tend to have higher measurement errors. Therefore, if we use the observed value of gene expressions 9 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. gene expressions i n ch ip B o o o 0 0 o o o t o o o . o • < 3- o o . o C \ s o ° o C O o O O 2000 4000 6000 gene expressions in chip A 8000 Figure 1. Measured values of gene expressions in two identical chips labeled A and B using the same mRNA sample. 10 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. in the microarray experiment but don’t consider the effect of measurement error on recorded gene expression values, the analysis results for the microarray experiment might be misleading. In this paper, we will address the relationship of measurement error and gene expression in a microarray experiment. In our study, we have tried many different models to find a suitable association between measured gene expression values and the variance of those values among replicate chips in a microarray experiment. The arrays we consider here are replicates in the sense that the same sample mRNA is hybridized to the same type of chips several times. The best model we find indicates an exponential increase o f measurement error as the true level o f gene expression increases. For those genes with the same true expression value, this model predicts the same measurement error distribution. Informally speaking, the model pools the information for all genes expressed at a similar value in order to better estimate their measurement error distribution. Such a model is therefore likely to lead to more accurate analysis results. We also compare two methods of performing a two-sample t-test to identify genes with altered expression under different conditions. One method uses the observed variance for a given gene in replicated chips, while the other uses the predicted variance from our model. For the data set we consider, the number of genes identified by our model as having altered expressions is more than that by the other method. Our model also helps to find more highly expressed genes with 11 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. altered expression patterns. In addition, we performed an analysis of variance (ANOVA) to consider comparisons among microarray data collected from different conditions, in an attempt to see whether factors such as treatment or time Influence the measurement error in the experiment. Our results show that although there is statistically significant difference in model fits under different treatment conditions or at different time points (both p < 0.001), the difference is very small and is not likely to be biologically significant. Thus it is reasonable to use the same measurement error model under different treatments or tie points. STATISTICAL METHODS In this section, we explain the statistical methods used in this study. The results corresponding to each analysis are shown in the subsequent result section. We use Affymetrix Oligonucleotide microarray HG-U95Av2 to measure gene expression and Affymetrix Genechip software to process the microarray image and convert image information into numerical form. In total, there are 12625 genes on each chip. Each gene has 20 probe pairs. For each probe pair, both PM and MM gene fragments are designed, and probe intensity of PM and MM are reported. As is standard, the average of difference between PM and MM (PM- MM) is used to represent probe pair expression intensity. 12 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. DNA-chip Analyzer (Dchip) is the software package we use to analyze the oligonucleotide arrays. Dchip analyses microarray experiment data to estimate gene expression with a model created by Li, C and Wong, WH in Harvard University. This model tests how the probe intensity values respond to changes of the expression levels of the gene. Let £ ? , • represent the true expression of a target gene in sample i. It is assumed that the intensity value of a probe increases linearly with 6h but different probes pair do not increase at the same rate. Within the same probe pair, the PM intensity is assumed to increase at a higher rate than the MM intensity. Let PMy and MMy denote the PM and MM intensity values for the z'th array and theyth probe pair for a given gene, V j denote the baseline response of the jth probe pair due to non specific hybridization, O j denote the rate of increase of the MM response of the yth probe pair, §j denote the additional rate of increase in the corresponding PM response, and £y denote a generic for a random error. The full model is then written as: MM y = Vj + 9 iC C j + s y PM y — V j + 0 iCCj + 6 j + Sy The rates of increase are assumed to be nonnegative. The model for individual probe responses implies an even simpler model for the PM-MM differences: y y = PM y - MM y - 0 i<f)j+£ij, X (j> j2 =J, £y~N( 0, cr2 ) j Here J is the number of probe pairs in the probe set. This model is called the reduced model. 13 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Suppose for gene A, the < j & have been learned from a large number of arrays. They can be treated as constants and used to analyze the mean and variance of the expression index estimation. For a single array, the reduced model becomes: yj = PMj - MMj - 9 < f > j + £j Given the < f& , the linear least square estimate for 9 is: % y $ j % y j t j 9 - — — = ---------- ’ with E( 9 ) = 9 and V ar(9 ) = a 2 / J L 0 j J j Hence, the approximate standard error for the least square estimation can be computed as: Std Error {9 ) = y j(a 2 / J ) with a 2 = (fitted - obsereved ) 2)/(J - 1) j Similarly, < p & and standard error for ^s can be estimated by setting 9s as known constants. This model is used to detect and exclude cross-hybridizing probes, image contamination, and outliers from other causes. The 20 ( j) values corresponding to the probe pairs o f a target gene constitute a particular “probe response pattern” for that gene. The 20 differences of PM-MM are scaled by the target gene’s expression index (9). An array is considered as an outlier for a certain gene if the probe response pattern corresponding to this gene is inconsistent with the response pattern from other arrays. In an outlier array, the estimated 9 has a larger standard error. Those outlier arrays are excluded and the remaining arrays 14 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. are used to estimate the probe response pattern. For an outlier array, Dchip still computes its expression index conditional on the estimated probe response pattern, but attaches a large standard error as an indicator of poor reliability of this expression index. Dchip can handle probe outliers too. A probe is called a probe outlier when its behavior is inconsistent with the rise and fall of other probes for the same gene. Using the conditional standard error of the estimated (fs, probe outliers can be identified. This inconsistency is probably due to the cross hybridization of such a probe to non-target genes. In addition, a “single outlier” might be an image spike in one array affecting just one PM-MM difference. Such a single outlier may affect estimates of both 6 , • and < f > j and can be identified by the large residual for this data point. Once identified, single outliers are regarded as “missing data” in the model fitting [Li, Wong 2001], Because the scanned images for different samples may have different overall image brightness, a proper normalization is required to make these images comparable before comparing the expression levels of genes between arrays. Model-based expression computation requires normalized probe-level data. By default, the median overall intensity of one array is chosen as the baseline. Normalization should be based only on probe values that belong to non- differentially expressed genes, but generally we do not know which genes are non-differentially expressed. Nevertheless, we expect that a probe for a non- differentially expressed gene will have similar intensity ranks across arrays (the 15 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. ranks are calculated in the arrays separately by listing the genes in order of expression for each array). Dchip uses an iterative procedure to identify a set of probes (called the “invariant set”), which have roughly constant ranks across chips. A piecewise linear running median line is then fit from the invariant set and used as the normalization curve. After normalization, the two arrays have similar overall brightness and further analysis can be performed [Li, Wong, 2001], More details about Dchip can be found at ttp://www.biostat.harvard.edu/complab/dchip. We analyze data from replicate arrays in our study. Replicate arrays here mean that the same sample mRNA is hybridized to the same type of chip several times. In our study, the experimental sample is collected under nine different conditions. Under each condition, an equal amount of sample mRNA is run in three chips. In total, we had 9*3=27 chips. All 27 chips are used by Dchip to estimate gene expressions. We use Dchip-reported expression values as our measured gene expression values. For some genes, Dchip reports their expression values as negative. As is common, we delete these genes from our final analysis. We begin by constructing scatter plots of gene expression values from replicate chips, or from chips under different conditions, in order to visually assess the consistency of the chip data. We then create scatter diagrams of measures of variability against average expression values to find a linear relationship between 16 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. measurement error and gene expression value. Results from such a model can then be used as robust estimation of variability in later analysis. To find a linear relationship, several models are tried. First, we transform gene expression values, later we transform the measures of variability and average gene expression values from the replicate chips. Least-squares regression is applied to determine the best fitting straight line or polynomial curve for these models. Let m index genes on chip and k index the n replicate chips (n = 3). Xmk denotes the expression of gene m on chip k. We calculate variance of gene expression among chips according to the standard formula, i.e. O m ~ (O f m l ~ M m) O f m2 "F . . . T m k ~ f-Zm) ) / i p ~ l ) j ( p i ~ 1 j 2...12625, k - 1,2, 3). Where jum denotes true expression of gene m. Because the true values of gene expression are unknown, we use the mean of the gene expression values among the replicate chips, i.e. (x mi + X m2 +X m i) / 3, to estimate true gene expression // m . To measure goodness of fit for models, the three following criteria are considered: 1. There should be no obvious trend observed in the scatter plot of model error (residual) against fitted value. 2. The model error distribution should follow a Normal distribution with mean of zero and variance of £ , . 3. The model should explain a large proportion of variation of gene expression among replicates chips, by observing a higher value of the square of the correlation coefficient (R ). 17 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The ultimate aim of many microarray experiments is to detect genes that are differentially expressed under different conditions. One o f the most commonly used approaches is the two-sample t-test. Previous studies have found differential variability as a function o f level of gene expression [Chen et al. 1997, Newton et al. 2001], so it is suitable to assume unequal variance of gene expression for the t- test. Although there are only a few replicate expression values for each gene under each o f nine conditions, we assume these values are random samples from the measurement value distribution of that gene. Two methods of two-sample t-test are compared. One method uses the observed variance of gene expressions among replicate chips. The other method uses the variance predicted from our model. a=0.05 is chosen as the gene-specific significance level. P-values for each gene from both methods are reported. Analysis of variance (ANOVA) is used to address the influence of treatment and time on gene variation among replicate chips. The modification effect of time and treatment on gene expression is considered too. S-plus and SAS are the statistical software package we use to do the analyses. RESULTS We apply our methods to a data set drawn from cell lines for eye cells with macular degeneration. One sample is chosen as a control and is not given any treatment, while the other identical sample is treated. mRNA is measured for the 18 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. control sample at 0 ,6 ,1 2 , 24 and 48 hours, and for treated samples at 6, 12, 24 and 48 hours. The control sample at time 0 is the baseline for our study. In each of the nine conditions, the same amount of sample is run on three identical chips, giving us three replicates in each condition or twenty-seven chips altogether. To test consistency of gene expressions under different conditions, we plot gene expressions in one chip against those in another chip. For example, we plot gene expressions for the control sample at 0 hour and 48 hour [Figure 2]. A, B, and C are three replicate chips for the control sample at 0 hour; D, E, and F are three replicate chips for the control sample at 48 hours. From Figure 2 a, b, and c, we can see that most of the genes are expressed similarly among replicate chips, but some genes are expressed differently. When we plot A against D or E or F in Figure2 d, e and f, we find many more genes that are expressed differently in two chips. These plots indicate that, as expected, variation of gene expression among replicate chips is less than that across conditions, even when the different conditions are quite similar, as for the data under consideration here. Because the replicate microarrays in this experiment use the same sample mRNA to hybridize to identical arrays, there is no biological or chip variation that might explain the different expression values observed for the same gene across replicates. Each gene is expected to be expressed equally among replicate chips. If x a and x * stands for the expression value o f a given gene in chip A and B 19 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. § to 03 t I § C M O 0 8 g I S I § a 0 200Q 4000 6000 800Q in chip A in chip A 0 2000 4000 60Q0 S00Q gene expressions in chip B & 0 2000 4000 6000 B000 gene expressions in chip A 0 2000 4000 6000 SQD D gene expressions in chip A 0 2000 4000 6000 8000 gene ex p ressions in chip A d e f Figure 2. Variation of gene expressions within and across replicate chips. A, B, C are three replicate chips for control sample at 0 hour. D, E, F are another three replicate chips for control sample at 48 hour. Graph a, b and c show gene expression pattern within replicate chips. Graph d, e and f represents gene expression pattern across conditions. 20 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. respectively, when gene expression values in chip A are plotted against gene expression values in chip B, a straight line x a = x b should be observed. Any difference we see is caused by measurement error, either random error or technical error, e.g. image processing, hybridization, etc. Technical errors may affect accuracy of microarray experiment and cause noise in gene expression. Technical errors may be introduced by sample preparation, for example the amount of mRNA extracted from a cell is tiny. Moreover, several steps are involved in mRNA extraction. Therefore, a mistake in any intermediate step may cause the measured mRNA amount to vary. Although the amount of mRNA used in replicate microarray is intended to be equal, this is actually impossible. In addition, errors might occur during image processing. The measurement of fluorescence intensity might not be accurate. Because of measurement errors in microarray experiments, we generally find a scatter plot like Figure 1: measured expression values of most genes are quite close among replicated chips, some are quite different, but none are exactly the same. As we have seen, observed data is subject to unavoidable measurement error. Suppose there are W genes in the chip and let m index the set o f genes. For gene m let c m denote the measured expression value, jum denote the true expression value, and S m denote measurement error. We write: 21 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. C m fim $ m (jW 1, 2,.. .W ). SO fd m — c m — $ m. It is important to know precisely how the observed value reflects the underlying true value, i.e. the distribution of measurement error. The degree to which we trust the measurement value depends on this precision. To estimate measurement error in microarray experiments, we study the variance o f the measured expression among replicated chips for each gene. The true gene expression is always unknown, therefore we use the average value among replicate chips for each gene as an estimate of that gene’s true expression, i.e. we use the mean value to estimate the true value. We aim to find a function S m = f ( c m) that relates the measurement error to the true gene expression value. Since a linear function has computational advantages, and it is easy to examine the goodness of fit for such a model, we attempt to find a linear association. We explore several transformations of the data in an attempt to find such a linear relationship. In the first model, we test whether there is a linear increase or decrease of gene expression variance as gene expression increases. A poor linear association is observed from the scatter plot of variance against gene expression values [Figure 3.1 a]. In general, less than 25% of the variation o f gene expression is explained by this model. Furthermore, residuals from the model are not Normally distributed. Moreover, an obvious decreasing trend is observed for the 22 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. residuals when plotted against the fitted values [Figure 3.1 b, c]. All these indicate that model 1 is not a satisfactory linear model. In the second model, we log- transform the gene expressions in individual chips and test whether the variance of the log expression value is linearly associated with the average of the log- expression values. This again performs poorly [Figure 3.2a]. Less than 12% of gene variation is explained by this model. The model residuals have a long right tail (see Q-Q Normal plot) and are not Normally distributed [Figure 3.2 b, c]. In the third model, we use the square root of the variance, i.e. standard deviation of expression, to see whether a better linear relationship can be found. This time, 59.7% of the gene variation is explained by the model. However, the model residuals are still not randomly spread [Figure 3.3 a, b, c]. Based on model 3, we log transform standard deviation of gene expression and the mean o f gene expression to arrive at model 4. An obvious linear trend is now revealed. 56.8% of the gene variation of is now explained by gene expression, slightly lower than that in model 3 [Figure 3.4 a]. However, the residuals of model 4 are much more randomly spread reflecting better behavior of the model [Figure 3.4 b]. From the Q-Q plot [Figure 3.4 c] we see that the left tail is skewed. In particular, the residuals have a long left tail. The values of R2 for the four models are shown in Table 1. Although model 4 is not a perfect model, it is better than the other three models discussed here. So we choose model 4 as our final model to show a linear relation 23 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I £ £ I ¥ ■ o OQ O O 0 2QD0 4000 6000 8000 m ean of gene expression 0 20D00 60000 1000D0 fitted 8 § 9 S 0 2 •4 -2 4 Gusntiles of Standard Normal Figure 3.1 Association of gene expression and variance of gene expression. 2 4 6 8 m ean of log(gene g fitted I - -2 0 2 Quarrtiles of Standard Normal Figure 3.2 Association of log (gene expression) with variance of log (gene expression). 24 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 8 ® 3 Ct, O Q Q D 2QDQ 400Q 6000 900Q m ean of gene e Qa o o $&s* ° - □ ef Q D 100 200 300 400 fitted § § -4 -2 0 4 2 Quarrtiles of Standard Normal C Figure 3.3 Association of gene expression and standard deviation o f gene expression fitted Quarrtiles of Standard Normal Figure 3.4 Association of log (gene expression) with Log(standard deviation) 25 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 1. Comparison of R2 in four models for all chips. Model 1 Model 2 Model 3 Model 4 Control, 0 hour 0.23 0.12 0.46 0.46 Control, 6 hour 0.16 0.09 0.48 0.44 Control, 12 hour 0.19 0.07 0.52 0.51 Control, 24 hour 0.08 0.04 0.49 0.53 Control, 48 hour 0.18 0.06 0.53 0.46 Treatment, 6 hour 0.32 0.07 0.57 0.57 Treatment, 12 hour 0.15 0.07 0.52 0.56 Treatment, 24 hour 0.23 0.05 0.60 0.55 Treatment, 48 hour 0.20 0.08 0.58 0.56 26 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. of measurement error and gene expression. This is formally expressed as the formula: lny/m - a+ J3* lnjum + s m (m = 1, 2,... 12625). For gene m, y/m is the standard deviation of measurement error, jum denotes the true value of expression, a reflects a constant background noise and s m is the residual error that is assumed to follow a Normal distribution with mean zero, i.e. is assumed to be constant across genes. Our final model indicates that measurement error increases exponentially with increasing gene expression, i.e. y/m ~ Consequently, the higher gene expression is, the higher measurement error will be, and the lower gene expression is, the lower measurement error will be. Compared with lowly- expressed genes, the measured value for highly-expressed gene is less reliable. We apply our model to the data for each of the nine conditions to get nine linear regression equations. The intercepts (a) and slopes (JJ) of those equations can be found in Table 2. A highly significant linear association between standard deviation of expression values and average gene expression, both on the log scale, can be found in all nine conditions (all PO.OOl). The slope at baseline is 0.59, while for controls at 6, 12, 27 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 2. Regression of standard deviation and gene expression among three replicates on Log level, at different time, under different treatment. Control Treatment Intercept (*SE) Slope (*SE) Intercept (*SE) Slope (*SE) 0 hour 0.05 0.59 (0.04) (0.01) 6 hour -0.25 0.58 -0.74 0.71 (0.03) (0.01) (0.03) (0.01) 12 hour -0.78 0.65 -0.70 0.69 (0.03) (0.01) (0.03) (0.01) 24 hour -0.89 0.69 -0.90 0.70 (0.03) (0.01) (0.03) (0.01) 48 hour -0.41 0.61 -0.61 0.67 (0.03) (0.01) (0.03) (0.01) *SE: Standard Error Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 24, 48 hour, the slopes are 0.58, 0.65, 0.69 and 0.61 respectively. For treatment at 6, 12, 24, 48 hours, the slopes are 0.71, 0.69, 0.70 and 0.67. The slopes in the control groups are quite close to each other, and those in treatment groups are also quite similar. At each time point, the slope for the treatment line is slightly higher than that for the control (0.71 vs. 0.58 at 6 hour, 0.69 vs. 0.65 at 12 hour, 0.70 vs. 0.69 at 24 hour and 0.67 vs. 0.61 at 48 hour). These nine regression lines are shown in Figure 4. All nine lines are close to each other. We would like to know whether these slight differences in fact are significant and, if so, whether they are caused by an effect due to treatment, time or both. To investigate this, we perform an ANOVA. Model variances corresponding to each factor are listed in Table3. We can see that variance across different genes is the largest proportion of array data variance, about 30 times of the sum of variances caused by time and treatment. In addition, we find that both time and treatment are highly associated with variation o f measured gene expressions (variance of 1504.27 and 667.37, both P < 0.0001). That means that as time changes, the relationship between measurement error and expression varies, i.e. the measurement error distribution for genes with the same expression are not the same at different time points, and similarly for the treatment effect. We also tested whether the effect of treatment modifies the effect o f time in this study, but did not find any meaningful results. 29 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. variance o f logfgene expression) control,0 hour control ,6 hour treatment,6 hour control, 12 hour treatment,12 hour control ,24 hour treatment,24 hour control ,48 hour treatment,48 hour 4 6 logfgene expression) 10 Figure 4. regression line for samples at nine experimental conditions. 30 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 3. Variance of gene expression, time and treatment in the model. Variance Degree of Freedom F test P-value Gene 57994.92 1 112589.00 <0.001 Time 1504.27 4 730.08 <0.001 Treatment 667.37 1 1295.60 <0.001 Residual 114955.22 106320 Table 4. Number of genes with significantly altered expression after treatment detected by t-test, using observed or predicted model, at different time points. Ni N 2 N 12 6 hour 408 725 137 12 hour 187 298 28 24 hour 412 729 128 48 hour 284 725 75 Ni: in observed variance model N2: in predicted variance model N 12: in both models 31 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Although the results shown here indicate a statistically significant change in model parameters across treatments and time, these changes are relatively minor and are highly unlikely to be biologically meaningful. One of the major aims of microarray experiments is to identify genes whose expression values have changed under different conditions. Typically, measured values o f gene expression have always been used to identify over-expressed or under-expressed genes via the commonly used two-sample t-test. Most methods estimate the variation in gene expression for each gene independently. However, it is reasonable to assume that power might be improved by using a more robust estimate of variance. Our model uses expression values for all genes simultaneously to better estimate the measurement error distribution. One might therefore expect results generated from our model to be more reliable. To investigate the performance of our model, we compare a two-sample t-test using observed measurement error (method 1) to a t-test using the measurement error predicted from our model (method 2). We set 0.05 as the genome-wide false positive rate when comparing gene expression changes. No “multiple comparison correction” is performed. There are approximately 10,000 genes compared for each of the two methods. We fit the linear model for each o f the nine conditions separately. At each time point, we compare gene expression in treatment and 32 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. control. In general, the results from our model help to find more genes whose expression are significantly changed under different conditions [Table 4], Method 2 finds many more genes that express significantly differently between treatment and control group than does method 1. However, there are not many genes that are found to be significantly changed by both methods. It appears that the two method pick up quite different genes in identifying gene expression change in two groups. Given that we are performing no multiple comparison correction, one would expect roughly 500 genes (i.e. 5% of 10,000) to result in a gene-specific p-value of less than 0.05 even there is no difference between the samples. Thus, method 1 finds less genes than would be expected even if no genes had changed expression. This suggests that method 2 is out performing method 1. However, while it might be possible that we find more genes because there are more genes whose expressions have actually changed, it may also be possible that we are picking up more noise. To address this issue, we use a reference data set. In the reference set there are six replicate chips formed by splitting the same sample mRNA into six equal parts. We randomly combine three chips to form one group and the other three chips to form the other group. Then we compare gene expressions between the two groups. There should be no difference of gene expression values between the two groups. However, for a genome-wide false- positive rate of 5%, we would expect to find about 500 genes, with differential expression in the two groups. When we fit a model to the reference data, we get 33 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. the following regression equation: lny/m= 0.03 + 0.59 *lnjim . The model residuals appear to be Normally distributed [Figure 5]. Using method 1, we found 263 (2.18%) such genes, among them, 149 genes (1.3%) are over-expressed in group2,114 genes (0.9%) are suppressed. Using method 2, we found more genes: 359 (2.98%) such genes, 205 (1.7%) over expressed genes, and 154 (1.3%) suppressed genes. From these results, we can see that false-positive rate from method 2 is closer to 5%, which means method 2 is less conservative than method 1. However, as method 2 still only gets a false positive rate of 3%, the test, despite being an improvement, is still conservative. As a further consideration, we plot p-values against a uniform distribution status for both methods [Figure 6.1]. Under the null hypothesis of no difference in mean gene expression values, p-values resulting from both tests should follow a uniform distribution. Figure 6.1 shows that p-values from method 2 are more uniformly distributed than those from method 1, although it is also not perfect. Thus, method 2 appears to improve on model 1, while still being conservative. However, our particular interest is in those genes whose p-values are less than 0.05 because we want to identify genes with altered expression under two conditions. So we look more closely at the distribution of p values that are less than 0.05 [Figure 6.2], Now we can see that both methods are conservative in 34 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Quarrtiles o f Standard Normal fitted Figure 5. Linear association of gene expression and standard deviation in referrence data. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. identifying statistically significant altered genes because the p values are higher than expected. The distribution of p-values using method 2 looks a little better than the distribution using method 1, but there is clearly still room for improvement. DISCUSSION We propose a model to assess measurement error for oligonucleotide microarray experiments. We find that measurement error increases exponentially with gene expression, and that measurement error can be slightly affected by factors such as treatment or time. Our model gives a more robust method for predicting measurement error for genes. Using the measurement error predicted by our model, we find a tendency to detect more genes that have significantly changed their expressions under varying situations. Dchip reports the standard error for each gene based on the variance of probe sets for that gene in a given chip. The aim o f this study is to model measurement error of gene expression among replicates, therefore we should focus on gene-level variation, not probe-level. For this reason we do not use standard error reported by Dchip as an index of measurement error for genes. 36 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 0.0 0.2 0.4 0.0 0.8 1.0 standard p 0.0 0.2 0.4 0.6 O.S 1.0 standard p Figure 6.1. Distribution of P values from two methods of two-sample t-test. I I f * 4 - §- SO ! 0.02 0.04 0 0 5 0.0 I I I I i - cu 0.0 0.01 0 0 2 siardardp standard p Figure 6.2. Distribution of p values from two-sample t-test, p less than 0.05. 37 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Although our study finds that measurement error is somewhat different under different treatments or at different time points, the slopes of the linear regression for chips under these conditions do not differ that much [Figure 4], When we further compare measurement error association with gene expression among control, treatment, and reference replicates, we do not find a big difference either, with slopes of linear regression as 0.66, 0.63, and 0.59. This similarity can be found in Figure 7. Although these differences are "statistically" significant, it appears unlikely that they are "biologically" significant. Based on this, we can say that measurement error is rather consistent in the two data sets used in this study. However, whether measurement error is truly affected by experimental conditions is not clear. We would have to fit our model to more data in the future to fully explore this point. Although our final model is the best of the models we tried, it can explain only 56.8% o f total variance of measurement variation for gene expression. Moreover, we see in Figure 3.4 that the model residuals are not perfectly normally distributed. This indicates that our model cannot perfectly assess measurement errors in microarray experiment. Figure 3.2a illustrates that after logarithmic transformation of data, the variance of measured genes is nicely sTable for highly expressed genes, but varies greatly for lowly expressed genes. From figure 3.4a, we can see a similar pattern: measurement errors for lowly expressed genes are 38 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. variance o f log(gene expression) ■ standard - control ■ treatment < D C N O 0 6 2 4 8 10 logfgene expression) Figure 7. Regression lines for replicates in control group, in treatment group and reference data. 39 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. more spread, while measurement error for highly expressed are less spread. Therefore, the relation between measurement error and gene expression might be better explained by two different models, one of for lowly-expressed genes, the other for highly-expressed genes. Microarrays allow us to study thousands of genes simultaneously. This means we are doing multiple hypothesis testing. For example, we perform 12625 two- sample t-tests simultaneously to compare the change of expression for each gene in the chip. Therefore gene-specific type I error a should be adjusted to this multiple comparisons. According to the Bonferroni correction, we need to use a =0.05/12625, i.e. only genes whose p value is reported as less than 0.000004 can be regarded as significant alter their expression. None of the p values from our study met the criteria, which means no genes would be identified as having changed. However, the Bonferroni correction is known to be extremely conservative. So in future work, we would need to find a better way to deal with multiple comparisons, e.g. step-down correction, or permutation tests. A further limitation of our study is that it uses only three replicates. We would like to apply this model to experiments with more replicates. It is known that the more times a sample is measured, the closer the average measured expression values will be to its true value. However, microarray experiments are both time- consuming and expensive, and no such data is currently available to us. Thus it is 40 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. impossible to analyze as many replicates as we would like. We would also like to estimate how many replicates should be used in microarray experiments in order to be able to effectively estimate the true expression of a gene. Table 5 shows the standard deviation of gene expression among replicate chips predicted by our model for some sample expression values. From the Table, we can clearly see that the standard deviation of gene expression values among replicate chips is rather small. For example, when true gene expression is 100, the standard deviation is 11.28, but when true gene expression increases 10 times to 1000, the standard deviation is only 51.44. Even when true gene expression reaches 10,000, the standard deviation is 234.71. This indicates that variation o f gene expression caused by measurement error is very small. Thus, there is little point running replicate mRNA on multiple Affymetrix microarrays. Our results differ from previous result reported by Lee et al [Lee. et al. 2000] who finds that at least three replicates needed for the same sample to avoid significant measurement error. However, this difference is likely to be due to the arrays used in the experiments: we consider oligonucleotide array, while they consider cDNA arrays. cDNA arrays are known to have higher variance than oligonucleotide arrays. In future work, we would like to fit our model to data in which replicates represent biological replication, e.g. samples of the same tissue from different subjects. This would enable us to determine how many replicates would be needed in microarray experiments. In addition to this, from Table 5 we also can 41 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. see that variance of gene expression is likely to increase as the expression goes up. So the number of replicates needed in microarray experiments is likely to depend on the expression values of the genes of most interest. Table 5. Standard deviation of gene expression among replicate chips Gene Expression Standard Deviation 100 11.28 200 17.81 400 28.12 500 32.58 800 44.41 1000 51.44 2000 81.24 4000 128.30 6000 167.61 8000 202.61 10000 234.71 42 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. REFERENCE Bartosiewicz M, Trounstine M, Baker D, et al. Development of a toxicological gene array and quantitative assessment of this technology. Arch biochem biophys 2000; 376:66-73. Brazma A, Robinson A, Cameron G, et al. One stop shop for microarray data. Nature 2000; 403:699-700. Bakay M, Chen YW, Borup R, et al. Sources of variability and effect of experimental approach on expression profiling data interpretation. BMC Bioinformatics 2002; 3 (1):4. Chen Y, Dougherty ER, Bitter ML. Ratio-based decisions and the quantitative analysis of cDNA microarray images. JBiomedical Optics 1997, 2, 364-367. Duggan DJ, Bittner M, Chen Y, et al. Expression profiling using cDNA microarrays. Nature Genet 1999; 21: 10-14. Fodor S, Rava R, Huang X, et al. Multiplexed biochemical arrays with biological chips. Nature 1993; 364: 555-556. Gray NS, Wodicka L, Thunnissen AM, et al. Exploring chemical libraries, structure, and genomics in the search for kinase inhibitors. Science 1998; 281: 533-538. Goffeau, A, Barrell BG, Bussey H, et al. Life with 6000 genes. Science 1996; 274, 563-567. Lee MLT, Kuo FC, Whitmore GA, et al. Importance of replication in microarray gene expression study: statistical method and evidence from repetitive cDNA hybridizations. Proc Natl Acad Sci USA 2000; 97: 9834-9839. Li C, Wong WH. Model-based analysis of oligonucleotide arrays: model validation, design issues and standard error application, Genome Biol 2001; 2(8): research0032.1-0032.11. Li C, Wong WH. Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection, Proc. Natl. Acad. Sci 2001; 98: 31-36. Lipshutz RJ, Fodor SPA, Gingeras TR, et al. High density synthetic oligonucleoutide arrays. Nature Genet 1999; 21: 20-24. 43 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Jin H, Yang R, A wad TA et al. Effects of early ACE inhibition on cardiac gene expression following acute myocardial infarction. Circulation 2001; 195: 736- 742. Newton, MA, Kendziorski, CM, Richmond, CS, et al. On differential variability of expression ratios: improving statistical inference about gene expression changes from microarray data. J. Comput Biol 2001; 8(1): 37-52. Schena M, Shalon D, Davis RW et al. Quantitative monitoring of gene expression patterns with a cDNA microarray. Science 1995; 270: 467-70. Southern E. Mir K, Xhehepinov. Molecular interactions on microarrays. Nature Gene 1999; 21 (suppl): 5-9. Thomas, JG., Olson, JM., Tapscott, SJ. An efficient and robust statistical modeling approach to discover differentially expressed genes using genomic expression profiles. Genome Research 2001; 11: 1227-1236. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Development and evaluation of standardized stroke outcome measures in a population of stroke patients in rural China
PDF
Association between body mass and benign prostatic hyperplasia in Hispanics: Role of steroid 5-alpha reductase type 2 (SRD5A2) gene
PDF
Extent, prevalence and progression of coronary calcium in four ethnic groups
PDF
BRCA1 mutations and polymorphisms in African American women with a family history of breast cancer identified through high throughput sequencing
PDF
beta3-adrenergic receptor gene Trp64Arg polymorphism and obesity-related characteristics among African American women with breast cancer: An analysis of USC HEAL Study
PDF
A descriptive analysis of medication use by asthmatics in the Children's Health Study, 1993
PDF
Association between latchkey status and smoking behavior in middle school children
PDF
A pilot survey of medical abortion knowledge and practices among obstetrician/gynecologists and family practitioners in Los Angeles County
PDF
Evaluation of the accuracy and reliability of self-reported breast, cervical, and ovarian cancer incidence in a large population-based cohort of native California twins
PDF
Determinants of mammographic density in African-American, non-Hispanic white and Hispanic white women before and after the diagnosis with breast cancer
PDF
Comparisons of metabolic factors among gestational diabetes mellitus probands, siblings and cousins
PDF
Dietary fiber intake and atherosclerosis progression: The Los Angeles Atherosclerosis Study
PDF
Family history, hormone replacement therapy and breast cancer risk on Hispanic and non-Hispanic women, The New Mexico Women's Health Study
PDF
Cigarettes and alcohol in relation to colorectal cancer within the Singapore Chinese Health Study
PDF
CYP17 polymorphism and risk for colorectal adenomas
PDF
A case-control study of passive smoking and bladder cancer risk in Los Angeles
PDF
Effect of hormone therapy on the progression of carotid-artery atherosclerosis in postmenopausal women with and without established coronary artery disease
PDF
Validation of serum cotinine as a biomarker of environmental tobacco smoke exposure: Validation with self-report and association with subclinical atherosclerosis in non-smokers
PDF
Metabolic effects of magnesium supplementation in women with a history of gestational diabetes mellitus
PDF
Descriptive epidemiology of thyroid cancer in Los Angeles County, 1972-1995
Asset Metadata
Creator
Wang, Furong
(author)
Core Title
A linear model for measurement errors in oligonucleotide microarray experiment
School
Graduate School
Degree
Master of Science
Degree Program
Applied Biostatistics and Epidemiology
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
biology, biostatistics,biology, genetics,OAI-PMH Harvest,statistics
Language
English
Contributor
Digitized by ProQuest
(provenance)
Advisor
Marjoram, Paul (
committee chair
), Siegmund, Kimberly (
committee member
), Triche, Timothy J. (
committee member
)
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c16-299490
Unique identifier
UC11341091
Identifier
1414857.pdf (filename),usctheses-c16-299490 (legacy record id)
Legacy Identifier
1414857.pdf
Dmrecord
299490
Document Type
Thesis
Rights
Wang, Furong
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the au...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus, Los Angeles, California 90089, USA
Tags
biology, biostatistics
biology, genetics