Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Population substructure and its impact on genome-wide association studies with admixed populations
(USC Thesis Other)
Population substructure and its impact on genome-wide association studies with admixed populations
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
POPULATION SUBSTRUCTURE AND ITS IMPACT ON GENOME-WIDE ASSOCIATION STUDIES WITH ADMIXED POPULATIONS by Jinghua Liu ____________________________________________________________________ A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (STATISTICAL GENETICS AND GENETIC EPIDEMIOLOGY) August 2012 Copyright 2012 Jinghua Liu ii Table of Contents List of Tables ...................................................................................................................... iv List of Figures ..................................................................................................................... v Abstract ..................................................................................................................... viii Chapter 1 Introduction ..................................................................................................... 1 1.1 Use of admixed populations for genetic association studies ................................................ 1 1.2 Degree of admixture in different populations ............................................................................. 2 1.2.1 Migration history ............................................................................................................................ 2 1.2.2 Cryptic structure ............................................................................................................................. 3 1.2.3 Self-reported ancestry, global ancestry, and local ancestry ......................................... 4 1.3 Background of the Hispanic population ........................................................................................ 5 1.3.1 Background ....................................................................................................................................... 5 1.3.2 Population substructures identified among Hispanic samples ................................... 6 1.3.3 Asthma among Hispanics ............................................................................................................ 7 1.3.4 Potential confounding of genetic association studies among Hispanics ........................................................................................................................................................... 8 1.4 Methods for control of confounding ................................................................................................ 9 1.4.1 EIGENSTRAT .................................................................................................................................. 10 1.4.2 STRUCTURE & ADMIXTURE .................................................................................................... 11 1.4.3 HAPMIX & LAMP ........................................................................................................................... 11 1.5 Remaining challenges for genetic association studies among Hispanics ....................... 12 1.6 Introduction to graphical modeling ............................................................................................... 14 Chapter 2 Population substructures among Hispanics ................................................... 15 2.1 The USC Children’s Health Study (CHS) ....................................................................................... 15 2.1.1 Samples & markers ...................................................................................................................... 15 2.1.2 Potential confounding by population substructure observed from previous studies ........................................................................................................................................... 15 2.2 Ancestry informative markers ......................................................................................................... 16 2.3 HapMap III populations ...................................................................................................................... 17 2.3.1 Diverse ethnic populations ....................................................................................................... 17 2.3.2 Global ancestry estimator: EIGENSTRAT, STRUCTURE & ADMIXTURE ................................................................................................................................................... 17 2.3.3 Local ancestry estimates among HapMap MEX samples ............................................. 27 2.3.4 Comparing methods .................................................................................................................... 29 2.4 Population substructure among CHS samples .......................................................................... 34 2.4.1 Global ancestry estimates ......................................................................................................... 34 2.4.2 Local ancestry estimates ........................................................................................................... 41 Chapter 3 Confounding and Heterogeneity in Genetic Association Studies with Admixed Populations ................................................................................................ 42 3.1 Introduction ............................................................................................................................................ 42 iii 3.2 Materials and Methods........................................................................................................................ 45 3.2.1 Graphical Model ............................................................................................................................ 45 3.2.2 Regression models ....................................................................................................................... 48 3.2.3 Simulations ..................................................................................................................................... 49 3.2.4 Scenarios .......................................................................................................................................... 51 3.2.5 USC Children’s Health Study (CHS) ....................................................................................... 53 3.3 Results........................................................................................................................................................ 55 3.3.1 Simulation result .......................................................................................................................... 55 3.3.2 Results from the Children’s Health Study .......................................................................... 58 3.4 Discussion ................................................................................................................................................ 64 Chapter 4 Mapping by admixture linkage disequilibrium ............................................. 72 4.1 Introduction ............................................................................................................................................ 72 4.1.1 Concept for admixture mapping ............................................................................................ 72 4.1.2 Testing for excess ancestry proportions ............................................................................ 72 4.1.3 Advantages of admixture mapping ....................................................................................... 74 4.1.4 Purpose of this study .................................................................................................................. 75 4.2 Materials and Methods........................................................................................................................ 75 4.2.1 Regression models ....................................................................................................................... 75 4.2.2 Simulation framework ............................................................................................................... 79 4.2.3 Scenarios .......................................................................................................................................... 81 4.2.4 Real data analysis among African Americans ................................................................... 81 4.3 Results........................................................................................................................................................ 82 4.3.1 Simulation results ........................................................................................................................ 82 4.3.2 Real data analysis results .......................................................................................................... 85 4.4 Discussion ................................................................................................................................................ 95 Chapter 5 Summary ........................................................................................................ 99 Chapter 6 Future Directions ......................................................................................... 103 Bibliography .................................................................................................................... 106 iv List of Tables Table 2-1 Estimated individual global European, Asian, Amerindian, and African ancestry proportions across HapMap MEX samples from different approaches. ........................................................................................ 32 Table 2-2 Pearson correlation of estimated individual global European ancestry proportion between different approaches. .......................................... 32 Table 2-3 Characteristics of different approaches for estimating individual global and local ancestries. ............................................................................... 33 Table 2-4 Estimated individual global European, Asian/Amerindian, and African ancestry proportions among CHS samples through STRUCTURE with K=6. ................................................................................. 40 Table 3-1 Simulated scenarios A-C. ................................................................................. 52 Table 3-2 P-value and effect estimate for selected markers across ethnic groups and models. ........................................................................................... 63 Table 3-3 Investigation of heterogeneity for SNP rs10519951 in the Children’s Health Study combined samples. ................................................... 70 Table 4-1 Type 1 error among models for admixture scan. .............................................. 82 Table 4-2 Power among models for admixture scan using only ancestry information. ...................................................................................................... 83 Table 4-3 Power among models for admixture scan incorporating genotype information. ...................................................................................................... 83 Table 4-4 Analysis details for regions with great disagreement between case-only and case-control analysis.................................................................. 87 Table 4-5 Known Hits that replicated from the conventional model. ............................... 89 Table 4-6 Known Hits that are replicated only in models incorporating local ancestry information. ........................................................................................ 90 Table 4-7 Effect size for SNPs and on chromosome 9. ................................................... 95 v List of Figures Figure 1-1 Graphical model for the concept of confounding. Y represents the trait of interest, G M represents the genotyped marker, and X represents unknown environmental and genetic factors. Observed variables are drawn as solid square, and unobserved variables are drawn as dashed circle. Association between variables is represented by solid line connecting between variables................................... 14 Figure 2-1 Clusters identified among HapMap III samples through EIGENSTRAT. ................................................................................................ 19 Figure 2-2 Finer scale cluster among HapMap samples identified through EIGENSTRAT within European, African, and Asian ancestry. ...................... 20 Figure 2-3 STRUCTURE results from analyze only HapMap samples for K from 2 to 7. ....................................................................................................... 22 Figure 2-4 Ln likelihood of the data estimated from the STRUCTURE with K from 2 to 7 among HapMap samples. .......................................................... 23 Figure 2-5 ADMIXTURE results from analyze only HapMap samples for K from 2 to 7. ....................................................................................................... 24 Figure 2-6 Cross validation errors of the data estimated from the ADMIXTURE with K from 2 to 10 among HapMap samples. ....................... 25 Figure 2-7 Comparison of estimated individual global ancestry among HapMap samples from STRUCTURE, ADMIXTURE, and EIGENSTRAT. ................................................................................................ 26 Figure 2-8 Estimated individual local ancestry on chromosome 22 from HAPMIX and LAMP for three selected HapMap MEX samples. ................... 30 Figure 2-9 Comparison of estimated individual global ancestry among HapMap MEX samples from STRUCTURE, ADMIXTURE, HAPMIX, and LAMP. ..................................................................................... 31 Figure 2-10 Clusters identified through EIGENSTRAT by analyzing HapMap and CHS combined samples. ............................................................. 36 Figure 2-11 Ancestry proportions identified through STRUCTURE by analyzing HapMap and CHS combined samples. ............................................ 39 Figure 2-12 Individual global ancestry estimated from HAPMIX for CHS samples. ............................................................................................................ 41 vi Figure 3-1 (a) Potential confounding paths in genetic association studies among admixed populations. Y represents the outcome of interest, G M the SNP at a marker locus being tested for association, L the individual local ancestry in the immediate neighborhood of the marker locus, Q the individual global ancestry averaged through L across the genome, X represents other causal factors, either unmeasured environmental factors, or unmeasured causal loci present across the genome, that may be associated with global ancestry, and G L the immediate neighborhood of the marker locus that is used to estimate individual local ancestry L. (b) Directions of admixture LD and the LD in the parental populations. ................................ 46 Figure 3-2 Type I error rates with and without control for confounding at the marker locus (G M ) in scenario A. ..................................................................... 56 Figure 3-3 Effect of adjustment by global and local ancestries on power in scenario B. ........................................................................................................ 56 Figure 3-4 Comparison of power in scenario C when there is heterogeneity due to differential LD between ancestries. ....................................................... 58 Figure 3-5 Local ancestry along chromosome 4 for selected CHS Hispanic samples. global ancestry Q represents estimated European ancestry proportion. .......................................................................................... 59 Figure 3-6 Q-Q plots for model (1)-(4) among combined samples. ................................. 60 Figure 3-7 Analysis results across models (4) and (5) for combined samples. ................. 61 Figure 3-8 Plausibility of scenario B in the ENCODE regions......................................... 66 Figure 3-9 Distribution of the D' difference between the CEU and the Asian populations in the ENCODE regions. .............................................................. 67 Figure 4-1 Simulation framework for admixture mapping. .............................................. 80 Figure 4-2 Simulation results across different LD levels among markers with allele frequencies greater than 0.4 between populations. ......................... 84 Figure 4-3 Genome-wide admixture scan using the SNP association and the models use only local ancestry information. .................................................... 86 Figure 4-4 Genome-wide admixture scan using the SNP association and the models incorporating both genotype and ancestry information (SUM, MIX, and CCcom). ............................................................................... 88 Figure 4-5 Compare the performance between the proposed CCmix (2df) model and the conventional model. .................................................................. 91 vii Figure 4-6 Comparison of the performance between the proposed CCcom (2df) model and the SNP association analysis on the region on chromosome 8. ................................................................................................. 92 Figure 4-7 Comparison of models CCcom (2df) and CCcom_GL (3df). ......................... 93 Figure 4-8 Comparison of models CCcom (2df) and CCcom_GL (3df) on the region on chromosome 8. ........................................................................... 94 Figure 4-9 Comparison of models CCcom (2df) and CCcom_GL (3df) on the region on chromosome 9. ........................................................................... 94 Figure 4-10 Compare model CCcom_GL (3df) to the SNP association model. ............................................................................................................... 97 Figure 5-1 Comparision of proposed models. ................................................................... 99 Figure 5-2 Changes on results between proposed models. ............................................. 100 Figure 5-3 Histogram of changes in –log10(p-value) when comparing Y~G+L+GL+Q (3df) to the conventional model Y~G+Q. ............................ 101 Figure 5-4 Weighted changes on results between proposed models. .............................. 102 viii Abstract Association studies among admixed populations pose many challenges. The purpose of this study is to compare the methods for ancestry estimation and to investigate the control for confounding and the capture of heterogeneity in SNP effect by the use of individual ancestries. In addition, a general regression framework is proposed to perform admixture mapping for both case-only and case-control study designs among admixed populations. For confounding and heterogeneity, simulation results indicate that 1) adjustment for global ancestry can control for confounding; 2) additional adjustment for local ancestry may increase power when the induced admixture LD is in the opposite direction as the LD in the ancestral populations; 3) the inclusion of a SNP by local ancestry interaction term can increase power when there is substantial differential LD between ancestry populations. Real data analysis in a genome-wide data using the University of Southern California's Children's Health Study of childhood asthma highlights rs10519951 (p=8.5E- 7) from the model with the interaction term, a SNP lacking any evidence of association from the SNP association analysis (p=0.5). For the admixture mapping, simulation and real data analysis results among African Americans from the Multiethnic Cohort Study of prostate cancer indicate that 1) case-only analysis suffers from spurious results among the regions with biased local ancestry estimation; 2) our proposed regression model yield similar performance as the existing methods; 3) it is more powerful to incorporate genotype information for admixture mapping; 4) and it is more powerful to incorporate SNP by local ancestry interaction to capture the admixture signal and heterogeneity by local ancestry simultaneously. 1 Chapter 1 Introduction 1.1 Use of admixed populations for genetic association studies Genome-wide association studies (GWAS) have been relatively successful at identifying numerous risk variants for many diseases and traits (Hindorff et al.). A majority of these studies have been performed among individuals of European descent (Broderick et al.; Gudbjartsson et al.; Hunter et al.; Saxena et al.; I. Tomlinson et al.; Zanke et al.; Bilguvar et al.; Graham et al.; Liu et al.; Tenesa et al.; I. P. Tomlinson et al.; Wallace et al.; Benyamin et al.; Hicks et al.; H. Kim et al.; Kottgen et al.; Landi et al.; Ma et al.; Org et al.; Pillai et al.; Simon-Sanchez, Schulte, et al.; Song et al.; Xiong et al.; Arking et al.; Birlea et al.; Chalasani et al.; Eijgelsheim et al.; Hor et al.; Lascorz et al.; Tan et al.; Lessard et al.; O'Seaghdha et al.; Ryu et al.; Van Laer et al.; Voight et al.; Boger et al.; S. Kim et al.; McKay et al.; Panoutsopoulou et al.; Reilly et al.; Simon-Sanchez, van Hilten, et al.; K. Wang et al.; Wijsman et al.). GWAS within European populations are convenient for sample collection, and these studies are also free of major population structure and heterogeneity. However, the field is gradually moving towards GWAS in more diverse ethnic populations (Rosenberg et al.). This is mainly due to the desire to generalize genetic findings to other populations, the belief that variation in local LD structure across populations can assist in localization of the putative causal variants as well as to the belief that the limited genetic variation within individuals with European ancestry will not be sufficient to find all underlying variants for each disease, (Cooper, Tayo and Zhu; Haiman and Stram). It has been shown from simulation studies that association studies using a broader range of 2 populations with more or different genetic variations as well as differences in disease prevalence have additional power for discovery (Pulit, Voight and de Bakker). To date, there have been several reported GWAS among relatively homogeneous non-Caucasian populations such as Chinese (Garcia-Barcelo, Tang, et al.; Garcia-Barcelo, Yeung, et al.; Shu et al.; Guo et al.; Kung et al.; Han et al.; Lei et al.; Ng et al.; Tse et al.; Zhang et al.; Chen et al.; Tsai et al.; F. Wang et al.), Japanese (Hattori et al.; Hiura, Shen, et al.; Satake et al.; Hiura, Tabara, et al.; Kamatani et al.; Tanaka et al.; Unoki et al.; Yamada et al.; Yasuda et al.; Low et al.; Cui et al.; Kumar et al.), and Korean (Cho et al.; Yoon et al.; J. J. Kim et al.). Furthermore, there have been GWAS among admixed populations such as African Americans (Bostrom et al.; Charles et al.; Lettre et al.), and Hispanics (Hayes et al.; Norris et al.; Rich et al.; Palmer et al.). The advantages of conducting GWAS among admixed populations over homogeneous populations is the extended linkage disequilibrium (LD) along the genome (Bonilla et al.; Gonzalez Burchard et al.). 1.2 Degree of admixture in different populations 1.2.1 Migration history Modern humans originated in Africa, and the expansion from Africa to Asia, more specifically to the Middle East, occurred about 100,000 years ago (Cavalli-Sforza, Menozzi and Piazza). The Middle East is considered the center where modern humans appeared and spread to other parts of the word. It is hypothesized that about 60,000 years ago, East Asia was reached from the Middle East through two routes, one through Central Asia and the other through South Asia. The expansion from Southeast Asia to Australia 3 happened about 55,000 years ago. The expansion to Europe occurred about 35,000 years ago from West Asia. There has been hard evidence supporting the idea that the expansion from Northeast Asia to America, more specifically from Siberia to Alaska, happened between 35,000 and 15,000 years ago by way of the Bering Strait (Fagan). Based on linguistic, dental, and genetic information, there are three major migrations from Siberia to America (Cavalli-Sforza, Menozzi and Piazza): (1) the first migration followed by a rapid occupation of the whole continent by the Amerinds; (2) a second migration, named after the Na-Dene family, mainly settled in southern Alaska and on the northwestern coast of North America; (3) and the third migration of the Eskimo-Aleut who occupied Alaska, the northern coast of North America, and the Aleutian islands. It is possible that Na- Dene and Eskimo-Aleuts had common origins from Asia (Cavalli-Sforza, Menozzi and Piazza). The first wave of migrations forms the native populations (Amerindians) living in the Americas prior to the arrival of Columbus. Columbus’s arrival in the Caribbean in 1492 brought European and African ancestries into America and marks the formation of Latino (Hispanic) populations in the region (Gonzalez Burchard et al.) 1.2.2 Cryptic structure Populations that are geographically distant from each other for a long time form a broad range of ethnicity and tend to have different allele frequencies between them. Among these individuals, for example Caucasians and Asians, ethnicity provides a good measure of genetic homogeneity. 4 There is also cryptic structure among admixed populations that carry genes from two or more ancestries. When two geographically isolated populations come together, gene flow in small amounts per generation, and there is infusion of individuals from one population to the other (Cavalli-Sforza, Menozzi and Piazza). For example, African- Americans originate from the admixture of Caucasians and Africans, while Hispanics originate from the admixture of Caucasians, Africans, and Amerindians. Individuals from the admixed population have diverse ancestry compositions. 1.2.3 Self-reported ancestry, global ancestry, and local ancestry Individuals could self-identify into a single ethnic category, and one or multi race groups. Ethnicity indicates one’s geography, nationality or country of birth of the person’s ancestors, and race is more of an indicator of genetic background of the person. These two together form the self-reported ancestry for an individual. Individual global ancestry is the proportion of the genome an individual received from ancestral populations. For individuals from major continental populations, self-reported ancestry would agree perfectly with global ancestry. Another ancestry indicator, individual local ancestry, carries additional ancestry information for admixed individuals. Genetic admixture occurs when individuals from previously distinct populations interbreed. After several generations, the genomes of the individuals in an admixed population become a mosaic composed of chromosomal segments originating from each of the ancestral populations. The ancestral variation for these segments is then referred to as local ancestry. Depending on their particular genealogical history, each admixed individual will have different proportions of chromosomal segments throughout their genome originating from each of the ancestral populations. The average of these proportions across the 5 genome is referred to as global ancestry defined above. This individual and localized admixture is due to genetic drift, population admixture, local chromosomal structure, and mating patterns (Salanti, Sanderson and Higgins). Self-reported ancestry could be a proxy for unmeasured environmental factors. For example, subgroups of Hispanic population, Hispanic White vs. Hispanic Black, could capture the diversity of environmental exposure, culture, lifestyle, and socioeconomic status among the Hispanic origin subgroups. These environmental factors may or may not differ between individuals with different genetic composition (e.g. difference in global or local ancestry composition). However, self-reported ancestry is not an accurate proxy for the individual’s genetic ancestry especially for admixed individuals. Individual global ancestry, in most cases estimated from the genotyping of individuals, is a better indicator of the individual’s genome composition. It captures the cryptic structures within admixed populations and any environmental and genetic divergence along with the underline substructure. In contrast, individual local ancestry more likely reflects the recent genetic recombinations since initial admixture, and is an accurate locus specific ancestral indicator. It captures the diversity in LD structure across the genome for each individual as well as the extended LD along the region for admixed populations. 1.3 Background of the Hispanic population 1.3.1 Background Hispanics are the largest and fastest growing minority population in the United States. The Hispanic population accounted for 13% of the nation’s total population in 2000, and 6 grew by 43%, which account for over half of the increase in the total population between 2000 and 2010 [U.S. Census Bureau. 2010]. There were 50.5 million Hispanics that account for 16% of the nation’s total population in the United States in 2010 [U.S. Census Bureau. 2010]. According to the U.S. Census 2010, Hispanic origin is the heritage, nationality group of the individual’s ancestors before their arrival in the United States. Hispanic or Latino refers to “a person of Cuban, Mexican, Puerto Rican, South or Central American, or other Spanish culture or origin regardless of race” [U.S. Census Bureau. 2010]. The three major subgroups of Hispanic origins recorded in the 2010 census are (a) Mexican, Mexican American, (b) Puerto Rican, and (c) Cuban. Individuals of Hispanic origin could be in any race groups, e.g. White, Black, or Asian. As shown from the migration history of America, Hispanics is mainly an admixed population mix of European, Amerindian, and African ancestries (Bertoni et al.; Choudhry, Coyle, et al.; Choudhry, Seibold, et al.; Lee et al.; Salari et al.). The three major Hispanic origin subgroups distribute differently across the country, with Mexicans mainly clustering in the Southwest and Chicago, Puerto Ricans in the Northeast, and Cubans in Florida (Denavas and Hall). This uneven distribution of Hispanic origins and marital behaviors in different areas result in a genetic heterogeneity (Bertoni et al.), as well as diversity in environmental and socioeconomic factors among Hispanic populations (Reibman and Liu) 1.3.2 Population substructures identified among Hispanic samples Hispanics are an admixed population with mainly European, Amerindian, and African ancestries. The average ancestry proportions for Hispanics from different regions are 7 different. Hispanics from the San Luis Valley in Southern Colorado are of about 62.7% European, 34.1% Amerindian, and 3.2% African ancestries (Bonilla et al.). Puerto Ricans with 59.7% European, 19.1% Amerindian, and 21.3% African ancestries (Choudhry, Coyle, et al.). Hispanics from Los Angeles have on average 48% European, 44% Amerindian, and 8% African ancestries (Price, Patterson, Yu, et al.), while Hispanics residing in 6 census tracts of the Los Angeles County from the Latino Eye Study have 40.1% European, 45.2% Amerindian, and 4.9% African ancestries (Shtir et al.). Hispanics from San Francisco Bay Area, California consist of 45.4% European, 51.0% Amerindian, and 3.7% African ancestries (Choudhry, Coyle, et al.). And Hispanic samples collected from New York City are identified as a mixture of 29.2% European, 44.8% Amerindian, and 26.0% African ancestries (Lee et al.). In general, Hispanics origin in Puerto Ricans and Cubans tend to have a greater proportion of African ancestry and a lower proportion of Amerindian ancestry compared to Mexican Americans (Reibman and Liu). 1.3.3 Asthma among Hispanics Puerto Ricans has the highest asthma prevalence and mortality, while Mexican Americans has the lowest in the United States (Carter-Pokras and Gergen; Freeman, Schneider and McGarvey; Homa, Mannino and Lara; Choudhry, Coyle, et al.; Salari et al.). The results from a parent-child trios study of Mexican and Puerto Rican support the LTA4H and ALOX5AP genes as risk factors for asthma in Hispanic populations (Via et al.). GWAS among Puerto Rican samples identified 5q23 as susceptible region for asthma (Choudhry, Taub, et al.). (will add more reference and established findings) 8 Exposures associated with asthma diagnosis are environmental tobacco smoke, presence of dampness/mold, roaches, and furry pets in the home (Freeman, Schneider and McGarvey). Ancestry by environment interactions (e.g. SES) modify the risk of asthma among Hispanics (Choudhry, Seibold, et al.). Environmental factors associated with asthma among subgroups of Hispanics differed , e.g. bathroom mold and roaches were significantly associated with asthma in Puerto Ricans but not Mexicans and Dominicans(Freeman, Schneider and McGarvey). 1.3.4 Potential confounding of genetic association studies among Hispanics A standard design for genetic association studies entails selecting cases and unrelated population-based controls that are representative of the source population that gives rise to the cases (Devlin and Roeder). When the sample being studied is drawn from a population consisting of sub-populations with varying rates of disease, cases will be more likely than randomly-selected controls to arise from the sub-populations with the higher rates of disease. Furthermore, for any genetic locus at which allele frequencies differ among the sub-populations, spurious associations will be induced using the standard case-control analysis, resulting in false positive or negative results (Thomas and Witte). Significant systematic differences in ancestry proportions (with regard to Amerindian and European ancestries) between cases and controls among Hispanic samples have been observed consistently in many studies (Aldrich et al.; Choudhry, Taub, et al.; Salari et al.). This indicates that there is potential confounding due to population substructure among Hispanic populations, and it is necessary to adjust for this confounding in association studies of asthma among Hispanic populations (Aldrich et al.; Choudhry, Taub, et al.). 9 1.4 Methods for control of confounding It has been suggested that confounding by population substructure might be essentially a non-issue within broad population groupings defined by self-identified race or ethnicity (Wacholder, Rothman and Caporaso; Jorm and Easteal). If ethnicity provides a good measure of genetic homogeneity, or disease risks vary little within such groups, then incorporating this information into an association study (i.e., by matching cases and controls on—or analytically adjusting for—ethnicity) should help reduce the potential for bias due to population stratification. However, for an admixed population, the variation in allele frequencies and disease rates within these broad ranges of ethnicity (e.g., African-Americans, and Hispanics) is unclear. Differing allele frequencies and disease rates arise when there is random mating within, but little or no mating between, sub- populations. The choice of mates is determined by geographic, socioeconomic, religious, cultural, and physical characteristics, which may not segregate into broadly defined ethnic groupings. Furthermore, among sub-populations, these characteristics are dynamic over time and space, often undergoing their own evolution (Cavalli-Sforza and Feldman). Therefore, adjusting for or matching on self-identified race or ethnicity as a proxy for genetic sub-populations may not fully control for population stratification in these admixed populations (Choudhry, Coyle, et al.; Serre et al.). There have been several approaches to control for confounding due to population substructure for GWAS among population-based samples. These include approaches aiming to control the confounding but not necessarily to estimate the population structures. Genomic control approaches attempt to adjust the test statistic distribution for the presence of stratification (Devlin and Roeder), and logistic regression approaches 10 adjust for many unlinked markers (Setakis, Stirnadel and Balding). Alternatively, there are approaches that rely on the estimation of population structures. These include latent variable approaches attempting to identify the specific structure to analytically adjust for the stratification (Hoggart, Parra, et al.; Satten, Flanders and Yang; Pritchard, Stephens and Donnelly; Alexander, Novembre and Lange), and distance-based multivariate approaches capturing the variation with fewer dimensions than the original data (Engelhardt and Stephens; Li and Yu; Miclaus, Wolfinger and Czika; Price, Patterson, Plenge, et al.). For the most part, approaches that attempt to estimate the structure focus solely on global ancestry for a given individual although it has been recently suggested that correcting for individual local ancestries may be required for genome-wide association scans in admixed populations (Bryc et al.; Kang et al.; Qin et al.; X. Wang et al.). In this study, STRUCTURE, ADMIXTURE, and EIGENSTRAT were used to estimate individual global ancestry, and HAPMIX and LAMP were used to assess individual local ancestry. 1.4.1 EIGENSTRAT EIGENSTRAT (Price, Patterson, Plenge, et al.) applies principal components analysis to genotype data to infer continuous axes of genetic variation. SNPs are centered and scaled, and then eigenvectors are calculated from the covariance matrix between individuals based on the genotype of the SNPs. Top continuous axes of variation (top eigenvectors) are used to infer substructures within the study samples. 11 1.4.2 STRUCTURE & ADMIXTURE STRUCTURE (Pritchard, Stephens and Donnelly) is a model-based approach. It assumes K founder populations characterized by a set of allele frequencies across a number of independent markers, and assumes Hardy-Weinberg equilibrium within each population. Individuals are then originated from one or more of the K populations, and are probabilistically assigned to populations. The program models the likelihood of the observed genotypes based on the assigned populations and allele frequency within each population, and uses the Markov chain Monte Carlo (MCMC) algorithm to sample the posterior distribution. The basic idea for the program ADMIXTURE (Alexander, Novembre and Lange) is similar to STRUCTURE. However, instead of relying on MCMC to sample the posterior distribution, ADMIXTURE utilizes an optimization technique to focus on maximizing the likelihood. The same as STRUCTURE, the result from ADMIXTURE gives the estimated proportion of ancestry (average across the genome) from each contributing population for each individual under study. 1.4.3 HAPMIX & LAMP HAPMIX (Price, Tandon, et al.) is a haplotype-based approach that infers local ancestry from dense genome-wide data. It assumes two homogeneous reference ancestral populations for the admixed population under study, and the reference populations are required to be close to the true founder populations of the study samples. The key of the program is assuming genome of an admixed individual is a mosaic of small regions that originates from the reference populations. It calculates the likelihood that a haplotype from an admixed individual is from one reference population or the other at each locus, and likelihoods from nearby loci are combined through a Hidden Markov Model (HMM) 12 to get a probabilistic ancestry estimator for each locus. Another advantage of this program is instead of estimating haplotypes among the study samples first and assuming no error during the phasing process, HAPMIX incorporates a built-in phasing process and averages the inference about ancestry across all possible phase solutions for each admixed individual. LAMP (Pasaniuc, Sankararaman, et al.) uses a sliding window-based framework. It assumes admixed populations arise from K ancestral populations. It partitions an individual’s genome into small windows and assumes no more than one recombination event that changes the ancestry within each window. It chooses a window size for each locus, and at each locus, it computes the likelihood of having different ancestry upstream and downstream within the window. 1.5 Remaining challenges for genetic association studies among Hispanics With regard to the population substructures for Hispanic population, the challenges remaining for conducting genetic association studies among this population include the following: a) There are several existing methods for estimating individual global and local ancestry. The question is how to choose between these methods for ancestry estimation according to the study design, purpose of the study (e.g. adjusting for confounding, capturing heterogeneity), and the samples under study (general knowledge about the geography and genetic composition of the samples). 13 b) In an admixed population, each individual contains various proportions of founder ancestries with these proportions varying across the genome, resulting in LD patterns varying within and between populations. What does the local ancestry look like and how much diversity across the genome is observed for these admixed individuals? In addition, although there have been studies showing that local ancestry could be a confounder and adjustment for individual local ancestry is necessary in genetic association studies, all these conclusions are drawn from simulation studies. There has not been any study showing the need to adjust for local ancestry in addition to the adjustment for global ancestry (which is generally utilized in genetic association studies these days). c) In an admixed population such as Hispanics, there are indicators of self- reported ancestry (e.g. Hispanic White, Hispanic Asian, and Hispanic Black), estimated individual global, as well as local ancestries. What is the relation among all these ancestry indicators? How much do they correlate with each other, and how much addition information does each of them capture? What is the contribution from each ancestry indicator for the interpretation of underline genetic and environmental effects on the trait under study? d) Is there heterogeneity by self-reported ancestry, estimated individual global and local ancestry for SNP marginal effect? Is there heterogeneity of SNP effects on disease risks by global (or self-reported ancestry) by local ancestry three-way interaction? e) Is the testing of gene by environmental interaction confounded by population substructure among admixed individuals? (will add more reference for this section) 14 1.6 Introduction to graphical modeling Figure 1-1 uses graphical model to give a structural representation of the concept of confounding. Assume that we are interested in the unknown causal relation between a gene (G M ) and disease (Y). If we undertake an association study within which there exists an unknown factor X (e.g. population substructure, other underline genetic factors, or any environmental factors) that is associated with genotype and is a risk factor for the disease, then confounding by this unobserved factor X can occur. Figure 1-1 Graphical model for the concept of confounding. Y represents the trait of interest, G M represents the genotyped marker, and X represents unknown environmental and genetic factors. Observed variables are drawn as solid square, and unobserved variables are drawn as dashed circle. Association between variables is represented by solid line connecting between variables. 15 Chapter 2 Population substructures among Hispanics 2.1 The USC Children’s Health Study (CHS) 2.1.1 Samples & markers The CHS is an ongoing cohort study investigating environmental and genetic influences on asthma in children. The study design is discussed in detail elsewhere (Navidi et al.; McConnell et al.; Li et al.). In this project, we include a total of 2,839 samples (1,246 asthma cases and 1593 controls) from two self-reported ethnic groups: 1,489 non- Hispanic Whites and 1,350 Hispanics. Genotyping of these samples was performed at the USC Genome Center utilizing both the Illumina HumanHap550 and the Illumina Human 610-Quad BeadChips. 2.1.2 Potential confounding by population substructure observed from previous studies It has been shown from several previous studies that population based genetic association studies among Hispanic populations might be confounded by population stratification. For example, a case-control study among Hispanics showed that Puerto Rican asthma cases had a significantly lower proportion of African ancestry and a significantly higher proportion of European ancestry than controls (Choudhry, Coyle, et al.). In addition, European ancestry was found to be associated with more severe asthma in Mexican- Americans, and there was a strong inverse correlation between Native American and European ancestry in this Hispanic population (Salari et al.). 16 2.2 Ancestry informative markers Ancestry informative markers (AIMs) are markers with different allele frequencies between parental populations, therefore, are selected to identify individuals’ ancestral proportions. In order to study the observed structure within the CHS multiethnic populations, three exclusive AIMs groups are used to distinguish between continental, as well as with continental finer substructures. AIM233 (Smith et al.) and AIM557 (Seldin et al.) are informative marker sets that are selected to identify continental genetic structures. AIM233 contains AIMs from four lists, each with 100 SNPs, which are optimal for distinguishing four population mixtures: European vs. West-African, European vs. Amerindian, West-African vs. Amerindian, and European vs. East Asian. Of these 400 SNPs, 233 are unique and have a high probability of successful genotyping using Illumina. AIM557 is a subset of markers found on the Illumina Linkage IV panel that are informative for identifying European ancestry from African, East Asian, South Asian, and Amerindian ancestries. Furthermore, in order to detect possible finer scale structures with European population, we further include a group of European sub- structure ancestry informative markers, AIM192 and AIM1211 (Tian et al.), that are selected from the Illumina 300K and 500 K platforms. AIM192 is informative for identifying Northern/Southern European substructures, and AIM1211 is informative for identifying substructures along a West-East gradient within northern Europeans. 17 2.3 HapMap III populations 2.3.1 Diverse ethnic populations HapMap Phase III could be used as reference populations when estimating individual global and local ancestries among the study samples. HapMap III recruits samples from 11 populations: Yoruba in Ibadan, Nigeria (YRI), Luhya in Webuye, Kenya (LWK), African in Southwest USA (ASW), Maasai in Kinyawa, Kenya (MKK), Toscans in Italy (TSI), Utah residents with Northern and Western European ancestry from the CEPH collection (CEU), Han Chinese in Beijing, China (CHB), Japanese in Tokyo, Japan (JPT), Chinese in Metropolitan Denver, Colorado (CHD), Mexican ancestry in Los Angeles, California (MEX), and Gujarati Indians in Houston, Texas (GIH). When investigating population structures, only unrelated samples are included for the analysis. In the end, 988 unrelated samples (113 YRI, 90 LWK, 49 ASW, 143 MKK, 88 TSI, 112 CEU, 84 CHB, 86 JPT, 85 CHD, 50 MEX, and 88 GIH) are included in this study. 2.3.2 Global ancestry estimator: EIGENSTRAT, STRUCTURE & ADMIXTURE 2.3.2.1 Results from EIGENSTRAT Individual global ancestry among HapMap III samples is estimated through the EIGENSTRAT program based on the 1981 AIMs. Figure 2-1 shows the plots of the first eigenvector against eigenvectors two through seven for HapMap III samples. It shows clearly that the first eigenvector distinguishes two major continental clusters, the African ancestry related groups (YRI, LWK, ASW, and MKK) and the other ethnic groups. And within the African ancestry, it also separates YRI and LWK from ASW and MKK. The second eigenvector identifies the Asian cluster (CHB, JPT, and CHD) as well as the 18 European cluster (TSI and CEU). Then the third eigenvector further separates out the Indian cluster (GIH) from the other ethnic groups. The fourth and the fifth eigenvectors identify the Amerindian (MEX) as well as the Maasai component of the African ancestry (MKK) from the other ethnic groups. Eigenvector six is the major eigenvector that is able to clearly distinguish the southern (TSI) northern (CEU) clusters within the European ancestry. And finally, eigenvector seven is able to distinguish JPT from CHB and CHD within the Asian ancestry. The remaining eigenvectors (eigenvectors eight to ten) do not show any pattern of clusters that are related to the ethnic groups within the HapMap III samples. The top ten eigenvectors explains 20% of the variance in the data. Figure 2-2 shows the finer scale clusters within the European (TSI and CEU), African (YRI, LWK, ASW, and MKK), and Asian (CHB, JPT, and CHD) ancestries respectively. For each ancestry group, the top two most informative eigenvectors that are specific to identify the finer clusters within that ancestry are plot against each other. Within the European ancestry, eigenvectors 4 and 6 separate TSI that represents the southern European cluster from the CEU. Within the African ancestry, with the combination of eigenvector 1 and 4, the four different ethnic groups are perfectly separated from each other. The YRI and LWK are two small clusters that are close to each other, while the ASW and MKK are relatively far apart with greater variation within the clusters. For the Asian ancestry, the JPT is clearly separate out from the other ethnic groups mainly through eigenvector 7, and the CHB and CHD are together identified as a homogeneous Asian cluster. 19 Figure 2-1 Clusters identified among HapMap III samples through EIGENSTRAT. 20 Figure 2-2 Finer scale cluster among HapMap samples identified through EIGENSTRAT within European, African, and Asian ancestry. 2.3.2.2 Results from STRUCTURE I also estimate individual global ancestry through the STRUCTURE program based on the same set of AIMs among HapMap III samples. When running STRUCTURE, the number of clusters (K) is predefined to a range of numbers based on our knowledge about the samples under studying, and the admixture model is used for running the program. The length of burnin period is set at 20,000, and the number of MCMC steps after burnin is set at 10,000. I perform 10 independent runs for each K, and the final number of clusters K used to interpret the sub-structures is decided based on both the estimated Ln probability of the data and the knowledge about the geography and possible ancestry proportions that I believe are true among the samples. The results for K ranges from 2 to 7 are shown in Figure 2-3. For each plot, the horizontal axis represents each individual grouped by ethnicity. The vertical axis represents the estimated individual ancestry coefficients, which is a continuous variable 21 between 0 and 1 indicating the percentage of different ancestries of the individual’s genome. Different estimated ancestries are represented by different color. Here, green is used to represent the estimated European ancestry; red to represent African ancestry; orange to represent the Maasai component of the African ancestry; blue purple to represent Asian ancestry; pink to represent the Amerindian ancestry; and blue to represent the Western Indian ancestry. As shown in Figure 2-3, with K=2, STRUCTURE mainly identifies the African ancestry. There are four ethnic groups (YRI, LWK, ASW, and MKK) that mostly consist of African ancestry. For K=3, STRUCTURE further identifies Asian ancestry. There are three ethnic groups (CHB, JPT, and CHD) that are homogeneous of Asian ancestry. STRUCTURE separates out the Indian ancestry among the GIH samples for K=4, and for K=5 it identifies the Maasai component of the African ancestry within the MKK ethnic group. The Amerindian ancestry is finally identified for K=6 within the MEX ethnic group. There is no more ancestry clusters that could be identified from the STRUCTURE program with K greater than 6. Figure 2-4 shows the Ln likelihood of the data estimated from the STRUCTURE with K ranges from 2 to 7. The likelihood increases dramatically from K=2 to K=3 followed by a relatively small increase from K=3 to K=6, and then the likelihood leverages after K=6. Based on the estimated likelihood from STRUCTURE and the geography of the samples, assuming six clusters (K=6) could best represent the population substructure within the HapMap III samples. 22 Figure 2-3 STRUCTURE results from analyze only HapMap samples for K from 2 to 7. 23 Figure 2-4 Ln likelihood of the data estimated from the STRUCTURE with K from 2 to 7 among HapMap samples. 2.3.2.3 Results from ADMIXTURE The results from ADMIXTURE are shown in Figure 2-5 for K ranges from 2 to 7. Similar to the result from STRUCTURE analysis, ADMIXTURE can identify the three major continental ancestries, which are European, African, and Asian ancestries, as well as the other three distinct ancestry proportions: Amerindian among MEX samples, Indian ancestry mainly among GIH samples, and the Maasai component of African ancestry. The major differences between the results from ADMIXTURE and STRUCTURE are: 1) the estimated proportion of Maasai component among MKK and LWK samples is higher from ADMIXTURE; 2) the estimated Amerindian ancestry among MEX is higher from ADMIXTURE; 3) JPT is separated out from CHB and CHD. 24 Figure 2-5 ADMIXTURE results from analyze only HapMap samples for K from 2 to 7. 25 Figure 2-6 Cross validation errors of the data estimated from the ADMIXTURE with K from 2 to 10 among HapMap samples. Figure 2-6 shows the cross validation errors of the data estimated from the ADMIXTURE with K from 2 to 10. The error decreases dramatically from K=2 to K=3 followed by a relatively small decrease from K=3 to K=6, and then increases again after K=6. This supports the conclusion drawn from the STRUCTURE program that six clusters (K=6) could best represent the population substructure within the HapMap III samples. However, as shown from the EIGENSTRAT result, based on the 1981 AIMs selected, it is able to identify the JPT from the other Asian ethnic groups. The finer scale substructure with the Asian ancestry identified from ADMIXTURE with K=7 is consistent with the finding from the EIGENSTRAT program. 26 Figure 2-7 Comparison of estimated individual global ancestry among HapMap samples from STRUCTURE, ADMIXTURE, and EIGENSTRAT. 27 Figure 2-7 (Continued). In order to compare the results from EIGENSTRAT, STRUCTURE, and ADMIXTURE, we plot the top eigenvectors along with the population sub-structures identified through STRUCTURE and ADMIXTURE. As shown in Figure 2-7, the top four eigenvectors capture similar population sub-structures as that form STRUCTURE and ADMIXTURE. Eigenvector 7 support the finer scale substructure identified among Asian populations through ADMIXTURE with K=7. 2.3.3 Local ancestry estimates among HapMap MEX samples When estimating individual local ancestry among HapMap MEX samples, phased haplotypes of HapMap III CEU and Asian (CHB and JPT) are used as the reference ancestral populations for HAPMIX, and allele frequencies from each group are served as 28 the reference allele frequency for LAMP. In addition, global ancestry is calculated by averaging across the estimated local ancestries across the genome for each individual. As shown from the STRUCTURE result with K=6, the MEX admixed samples are mainly consists of Amerindian and European ancestries. When estimating individual local ancestry among MEX samples, it is optimal to use homogeneous Amerindian as one of the reference populations; however, such homogeneous population of Amerindian ancestry is not available in this study. STRUCTURE results comparing between K=5 and K=6 show that the Asian ancestry is most close to the Amerindian ancestry, and the HapMap III CHB and JPT are perfect homogeneous ethnic groups of Asian ancestry. Therefore, CHB and JPT are use as the Asian reference population to represent the Amerindian ancestry within MEX samples. Therefore, HapMap III CEU and Asian samples are used as the reference population when estimating individual local ancestries among MEX samples. Figure 2-8 plots the estimated individual local ancestry on chromosome 22 from HAPMIX and LAMP for three selected MEX samples. Sample NA19679 has an estimated European ancestry proportion of 0.84 from STRUCTURE with K=5. And the estimated European ancestry proportion for samples NA19676 and NA19759 are 0.63 and 0.42 respectively. In each plot, the horizontal axis represents markers ordered by their physical position on chromosome 22, and the vertical axis represents the proportion of European ancestry at each locus. The genome is mainly consisted of three different kind of regions: regions with ~0% European ancestry (two copies from the Asian parental population); regions with ~50% European ancestry (one copy from European parental population and one copy from Asian parental population); and regions with ~100% 29 European ancestry (two copies from the European parental population). It shows that for sample with greater global European ancestry estimated from the STRUCTURE program, the local ancestry estimated through both HAPMIX and LAMP consistently shows more proportions of European ancestry. The local ancestry estimated from LAMP is roughly agree with that estimated from the HAPMIX program. Pearson correlation of the estimated local ancestry for samples NA19679, NA19676, and NA19759 are 0.71,0.82, and 0.74 respectively between HAPMIX and LAMP. 2.3.4 Comparing methods As the individual global ancestry is an average of the local ancestries across the genome, besides estimating individual global ancestry based on selected AIMs through STRUCTURE and ADMIXTURE, we also calculated individual global ancestry by averaging the estimated individual local ancestry across the genome (437,599 loci in total). Individual global ancestry estimators are shown in Figure 2-9 for STRUCTURE, ADMIXTURE, HAPMIX, and LAMP. The order of the individuals is the same across the plots. The estimated individual global ancestries are consistent among different approaches. As shown in Table 2-1, the estimated Asian ancestry proportion is similar among STRUCTURE with K=5 (31%), ADMIXTURE with K=5 (31%), HAPMIX (36%), and LAMP (34%). For K=6, STRUCTURE and ADMIXTURE result in much less European ancestry proportions. As shown in Table 2-2, The Pearson correlation of European ancestry proportion is high (>0.9) between different approaches. HAPMIX and LAMP results in very similar global ancestry estimator (as shown in Figure 10). 30 Figure 2-8 Estimated individual local ancestry on chromosome 22 from HAPMIX and LAMP for three selected HapMap MEX samples. 31 Figure 2-9 Comparison of estimated individual global ancestry among HapMap MEX samples from STRUCTURE, ADMIXTURE, HAPMIX, and LAMP. 32 Table 2-1 Estimated individual global European, Asian, Amerindian, and African ancestry proportions across HapMap MEX samples from different approaches. European Asian Amerindian African Other STRUCTURE K=5 60% 31% - 5% 4% K=6 42% 14% 42% 2% 0% ADMIXTURE K=5 50% 31% - 6% 3% K=6 27% 6% 57% 4% 6% HAPMIX 64% 36% - - - LAMP 66% 34% - - - Table 2-2 Pearson correlation of estimated individual global European ancestry proportion between different approaches. STRUCTURE ADMIXTURE HAPMIX LAMP K=5 K=6 K=5 K=6 STRUCTURE K=5 0.97 0.99 0.96 0.95 0.94 K=6 0.95 0.99 0.97 0.97 ADMIXTURE K=5 0.94 0.93 0.92 K=6 0.96 0.96 HAPMIX 1.00 33 Table 2-3 Characteristics of different approaches for estimating individual global and local ancestries. Ancestry estimator Markers Time a Choice for study designs Global Local STRUCTURE Yes AIMs 1 hr Studies with genotyped AIMs. Need precise & interpretable estimators for individual global ancestry. ADMIXTURE Yes AIMs; Random markers (e.g. 10,000 ~ 100,000) 1 min Studies without AIMs. Need quick but interpretable global estimator that is good enough for adjusting for confounding. EIGENSTRAT Yes AIMs; Random markers (e.g. 10,000 ~ 100,000) 5 sec Studies without AIMs. Need quick adjustment in the study without complete interpretation of the estimated cluster. Studies that would like to capture other potential unknown substructures. HAPMIX Yes Yes GWAS; 1000 genome; Sequencing 15 hr Studies need precise local ancestry estimator. Samples mainly consist of two parental ancestries. LAMP Yes Yes GWAS; 1000 genome; Sequencing 3 min Studies need quick local ancestry estimator with the trade off of accuracy. Studies with samples apparently consist of more than two parental ancestries. a The time consumed for conducting STRUCTURE, ADMIXTURE, and EIGENSTRAT is calculated for running the program based on 1891 AIMs among 50 samples; time calculated for conducting HAPMIX and LAMP is calculated for running the program across the whole genome (437,599 autosomal SNPs) among 50 samples. 34 2.4 Population substructure among CHS samples Among the 1981 AIMs that are available on HapMap, 1746 of them are available in both HapMap III and CHS GWAS. HapMap III samples are served as the reference population when estimating individual global and local ancestries among CHS samples. 2.4.1 Global ancestry estimates Individual global ancestry is estimated through EIGENSTRAT and STRUCTURE. The results from EIGENSTRAT are shown in Figure 2-10. Figure 2-10 (a) shows the estimated clusters from the top ten eigenvectors within HapMap reference populations. The result is similar to that from previous investigation only among HapMap samples (Figure 2-1). The major difference between running EIGENSTRAT within HapMap samples only and for combined HapMap and CHS samples is: The Amerindian ancestry among HapMap MEX samples is identified earlier (eigenvector two) when combining HapMap with CHS than that from analyze HapMap samples alone (eigenvector four). This is driven by the large Hispanic group within the CHS samples. Figure 2-10 (b) shows the clusters within CHS non-Hispanic samples. The scale for each plot is the same as that in (a). Comparing the clusters to those identified with the HapMap reference populations, the CHS non-Hispanic White population, marked as green in the plot, is a fairly homogeneous population closely clusters around the European ancestry. There are a few individuals lying between the European and the other ancestry clusters (small tails toward the Asian and African ancestries). Most of the CHS non-Hispanic Asian samples (marked as blue purple) are clustered as Asian ancestry, and most non-Hispanic Black samples are identified as the African cluster. CHS self-identified non-Hispanic Mix samples (plot as black circles) are basically a combination of the three CHS non-Hispanic 35 groups state above, and the non-Hispanic Other population (pink dots) includes several individuals with levels of Amerindian ancestry, which is shown from the plot for eigenvector one against eigenvector 3. These samples are genetically more close to the CHS Hispanic samples that are shown in Figure 2-10 (c). In this figure, self-identified Hispanic White, Hispanic Mix, and Hispanic Other samples are marked with pink color, Hispanic Asian and Hispanic Black samples are marked with blue purple and red color respectively. Overall, the CHS Hispanic population is an admixed population mainly clustering between European and Amerindian ancestry, and with a few individuals reaching towards Asian and African ancestries as identified among HapMap reference samples. Population substructure identified through STRUCTURE is consistent with that from the EIGENSTRAT. Result for HapMap reference populations and for CHS samples is plotted separately in Figure 2-11 (a) and (b). When combining HapMap samples with CHS samples, there is an additional ancestry, the Southern European ancestry within TSI, identified through STRUCTURE with K=7 comparing to the result shown before (Figure 2-3). The estimated individual global European, Asian, Amerindian, and African ancestry proportions among CHS samples through STRUCTURE with K=6 is shown in Table 2-4. 36 (a) Identified clusters within within HapMap reference populations. Figure 2-10 Clusters identified through EIGENSTRAT by analyzing HapMap and CHS combined samples. 37 (b) Identified clusters within CHS non-Hispanic samples. Figure 2-10 (Continued). 38 (c) Identified clusters within CHS Hispanic samples. Figure 2-10 (Continued). 39 (a) Identified ancestry proportions within HapMap reference populations. (b) Identified ancestry proportions within CHS samples with K=6. Figure 2-11 Ancestry proportions identified through STRUCTURE by analyzing HapMap and CHS combined samples. 40 (c) Identified ancestry proportions within CHS samples with K=7. Figure 2-11 (Continued). Table 2-4 Estimated individual global European, Asian/Amerindian, and African ancestry proportions among CHS samples through STRUCTURE with K=6. Num European Asian Amerindian African Other Non-Hispanics White 1749 96% 1% 1% 0% 2% Asian 48 7% 75% 0% 0% 18% Black 16 24% 0% 0% 75% 1% Mix 56 62% 27% 1% 1% 11% Other 6 52% 1% 42% 0% 5% Hispanics White 401 67% 0% 29% 0% 4% Asian 4 23% 44% 23% 0% 10% Black 3 26% 3% 20% 49% 2% Mix 282 68% 5% 22% 0% 5% Other 839 45% 0% 49% 0% 6% 41 2.4.2 Local ancestry estimates Similar to the approach when estimating individual local ancestry among HapMap MEX samples, HapMap CEU and Asian (CHB and JPT) are used as the reference population when conducting HAPMIX among CHS samples. Figure 2-12 shows the individual global ancestry averaged across the estimated local ancestry across the genome. Figure 2-12 Individual global ancestry estimated from HAPMIX for CHS samples. 42 Chapter 3 Confounding and Heterogeneity in Genetic Association Studies with Admixed Populations 3.1 Introduction Genome-wide association studies (GWAS) have been relatively successful at identifying numerous risk variants for many diseases and traits (Hindorff et al.). A majority of these studies have been performed with individuals of European ancestry and so have been free of major population structure and heterogeneity. However, the field is gradually moving towards GWAS in more diverse ethnic populations (Rosenberg et al.). To date, there have been several reported GWAS among relatively homogeneous non-Caucasian populations, including Chinese (Garcia-Barcelo, Tang, et al.; Guo et al.; Han et al.; Lei et al.; Ng et al.; Tse et al.; Zhang et al.), Japanese (Hattori et al.; Hiura, Shen, et al.; Kamatani et al.; Tanaka et al.; Unoki et al.; Yamada et al.; Yasuda et al.), and Korean populations (Cho et al.). In part, this is due to the desire to generalize genetic findings to other populations, as well as to the belief that the limited genetic variation within individuals with European ancestry will not be sufficient to find all underlying variants for each disease. Association studies using a broader range of populations with more or different genetic variations as well as differences in disease prevalence may have additional power for discovery (Pulit, Voight and de Bakker). In addition to expansion to other homogeneous populations, GWAS among admixed populations such as African Americans (Adeyemo et al.; Barnholtz-Sloan et al.) and Hispanics (Hayes et al.; Norris et al.; Palmer et al.; Rich et al.) may be advantageous due to extended linkage disequilibrium (LD) along the genome (Bonilla et al.; Gonzalez 43 Burchard et al.). However, such admixed populations pose new challenges in association studies, most notably potential confounding due to subtle population stratification and heterogeneity of the effects due to differential LD. Genetic admixture occurs when individuals from previously distinct populations interbreed. For example, African- Americans originate from the admixture of Caucasians and Western Africans, while Hispanic-Americans originate from the admixture of Caucasians, Western Africans, and Amerindians. After several generations, the genomes of the individuals in an admixed population become a mosaic composed of chromosomal segments originating from each of the ancestral populations. The ancestral variation for these segments is referred to as local ancestry. Depending on their particular genealogical history, each admixed individual will have different proportions of chromosomal segments throughout their genome originating from each of the ancestral populations. The average of these proportions across the genome is referred to as global ancestry. While controlling for self-identified race and/or ethnicity is possible for broad- scale structure, when finer scale stratification or admixture is suspected an alternative approach is to perform family-based studies to obtain valid inference (Gauderman, Witte and Thomas). Between these two extreme study designs there exist many approaches that attempt to either account for the effects of the confounding or to identify the unknown structure. These include approaches aiming to control the confounding but not necessarily to estimate the population structures. In an admixed population, each individual contains various proportions of founder ancestries with these proportions varying across the genome. This individual and localized admixture is due to genetic drift, population admixture, local chromosomal 44 structure, and mating patterns (Salanti, Sanderson and Higgins), and results in LD patterns varying within and between populations. When testing genetic markers that are proxies for a disease causal locus (as in a GWAS), this differential LD can result in heterogeneity of effect estimates by local ancestry. This heterogeneity not only can impact the power in a GWAS, but it can influence meta-analysis, as well. In fact, many researchers have leveraged this to identify causal variants – arguing that consistency in effect estimates across multiple ethnic groups bolsters support for that specific variant being a true causal polymorphism (Teslovich et al.; Waters et al.). To deal with heterogeneity, most current studies perform a test of interaction between the SNP of interest and ethnicity and then conduct a stratified analysis if appropriate. These methods are optimal for the combined analysis from more homogeneous populations (e.g. European and Asian), but it is unclear how appropriate this may be for certain admixed populations with variation in local ancestry along the chromosome (e.g. Hispanics) or when combining an admixed population with others. In this chapter, we use graphical diagram to clarify the mechanisms in which admixture can lead to confounding and how heterogeneity in effect estimates may arise. Based on these mechanisms, we investigate the source and effect of confounding and test for heterogeneity via an interaction term between SNP and local ancestry through simulations. Across all models and simulation scenarios we focus on effect estimation, type I error and power. Finally, we apply these models to a GWAS investigating the impact of genetic variation on asthma in the University of Southern California’s Children’s Health Study. We discuss the overall impact of global ancestry on this 45 analysis and identify several empirical examples where accounting for local ancestry impacts inference. 3.2 Materials and Methods 3.2.1 Graphical Model Figure 3-1 (a) is a graphical model representing the relationship of several factors involved in genetic association studies among admixed populations (Greenland; Greenland, Pearl and Robins). Here, Y represents the outcome of interest. G M represents the SNP at a marker being tested for association with Y (with effect β G M ). G D represents an unmeasured causal locus (with effect β G D ) for which we are testing G M as a proxy. X represents other causal factors that are associated with global ancestry (Q), including unmeasured environmental factors and/or unmeasured causal loci. The global ancestry is most often estimated from a subset set of markers (i.e. ancestry informative markers) (Seldin et al.; Shtir et al.; Smith et al.; Tian et al.). Alternatively, we view global ancestry as the average of local ancestries along the genome. Ancestral variation can lead to differences in allele frequencies for measured genetic variants (q) and unmeasured causal variants (p). We assume the local ancestry at G M and G D are the same, and there are additional SNPs (G L ) that can be used to estimate local ancestry for each subject at each location. 46 (a) Potential confounding paths in genetic association studies among admixed populations. (b) Directions of admixture LD and the LD in the parental populations. Figure 3-1 (a) Potential confounding paths in genetic association studies among admixed populations. Y represents the outcome of interest, G M the SNP at a marker locus being tested for association, L the individual local ancestry in the immediate neighborhood of the marker locus, Q the individual global ancestry averaged through L across the genome, X represents other causal factors, either unmeasured environmental factors, or unmeasured causal loci present across the genome, that may be associated with global ancestry, and G L the immediate neighborhood of the marker locus that is used to estimate individual local ancestry L. (b) Directions of admixture LD and the LD in the parental populations. 47 There are paths between factors that together may lead to confounding for the relationship between the marker (G M and potentially G D ) and Y. By definition, global ancestry Q is correlated with local ancestry L ( ρ Q,L ). When there are differing allele frequencies by ancestral populations at G M (variation in q), Q is thus related to G M and results in confounding path G M -L-Q-X-Y when testing the marker locus G M . Similarly, when there are differing allele frequencies at G D , there will be the confounding path G D - L-Q-X-Y if we are testing the disease locus G D or even the marker locus G M . There are two components that affect the magnitude of the LD between G M and G D in an admixed population: LD within parental populations (D'); and the admixture LD induced by differential frequencies between ancestral populations at both the marker and the disease locus. As shown in Figure 3-1 (a), the admixture LD is indicated as the path G M -L-G D marked in red, and the LD within the parental populations is in black. When the directions of the LD are the same between parental populations, as indicated in Figure 3-1 (b), the reference alleles for G M & G D are determined so that the correlation between these two loci is positive within the admixed population. Similarly, a reference local ancestry population for L can be defined such that L is positively correlated with G M in the admixed population. Thus, the reference allele is the same in both parental populations. Given these reference definitions, when L is negatively correlated with G D (left panel), there exists an overall negative correlation between G M & G D through the path G M -L-G D . In this situation, the admixture LD is in a different direction to the LD in the parental populations. This results in a corresponding reduction in the observed magnitude of the LD in the admixed population. In contrast, when L is positively correlated with G D , there is an overall positive correlation between G M & G D through the 48 path G M -L-G D (right panel). In this situation, the admixture LD is in the same direction as the LD in the parental populations, and the observed LD between G M and G D in the admixed population is enhanced. In addition to the scenarios discussed above, when the LD in the two ancestral populations is in opposite directions, the admixture LD will always enhance the LD in one ancestry while reducing the level of LD in the other. In summary, for a marker G M, admixture LD has the potential to act as an additional confounder of the G M -Y effect. For a disease locus G D , there is no such potential. Finally, individual local ancestries may modify the marginal effect at the marker locus because of the differential LD existing across ancestral populations. That is, within a study population the level of association between G D -G M varies across individuals as a function of L, (e.g. D' 1 ≠ D' 2 ). Thus, L acts as an effect modifier of the association between G M and Y. 3.2.2 Regression models We use the following generalized linear models to investigate the efficiency of controlling for confounding by global ancestry and the potential impact on power by adjusting for local ancestry: g(µ Y ) = α + β G M G M (1) g(µ Y ) = α + β G M G M + Q β Q (2) g(µ Y ) = α + β G M G M + L β L + Q β Q (3) Specifically, g(µ Y ) is the logit link, Y is a dichotomized outcome with values of 0 (unaffected) and 1 (affected), and µ Y is the probability of Y=1, conditional on the covariates included in the model. Alternative outcomes can be handled in a similar 49 manner in the generalized linear framework. G M represents the number of variant alleles for each individual and β G M is the corresponding marginal effect. A Wald or likelihood ratio test of β G M = 0 can be used to test association. For investigating heterogeneity, we compared model (3) to a model that also includes a G M ×L interaction term: g(µ Y ) = α + β G M G M + L β L + int β G M L + Q β Q (4) Here, we use a 2-df likelihood ratio test for the joint test of β G M = 0 and int β = 0. This joint test has been shown to be nearly optimal across many different scenarios for main and interacting effects (Kraft et al.). 3.2.3 Simulations We conduct simulations to investigate the performance of using the models defined above to control for confounding and to capture heterogeneity of effects. Simulations are based on the framework represented in the graphical model in Figure 3-1 – we simulate data based on the confounding paths and assess the Type I error and power after adjusting for individual ancestries. To test the gain in power as well as the potential over- adjustment by local ancestry, we include simulation scenarios that model the admixture LD. In addition, we simulate data with and without LD differences between populations to gauge the impact of heterogeneity between ancestries. In all scenarios, we generate cases for a binary disease outcome (Y) using a logistic regression model incorporating the disease locus G D and the individual global ancestry Q, with a 50% average probability for being a case. For simplicity we assume a direct relationship of Q to Y. 50 More specifically, in the simulations, we assume that individuals come from the admixture of two parental ancestries, A1 and A2. The steps for simulating the dataset are as follows: a) Each individual i is assigned a local ancestry representing the number of genetic copies from ancestry group A 2 : 0 (L i =0), 1 (L i =1) or 2 (L i =2). We generate 600 individuals within each ancestry group b) Assign allele frequencies at the disease locus (G D ) within each parental ancestry: p 1 & p 2 c) Assign allele frequencies at the marker locus (G M ) within each parental ancestry: q 1 & q 2 d) Assign LD between G D and G M within each parental ancestry: D' 1 & D' 2 e) Within each ancestry group, calculate the haplotype frequencies at the disease (G D ) and the marker (G M ) loci based on the assigned allele frequencies (p and q) and LD (D') within each parental ancestry. f) Given the haplotype frequencies within each parental ancestry, generate two haplotypes for each individual conditional on their local ancestry. That is, two haplotypes from A 1 for each individual with L i =0, one haplotype from A 1 and one haplotype from A 2 for each individual with L i =1, and two haplotypes from A 2 for each individual with L i =2. From this, we obtain the genotypes at both the marker (G M ) and the disease (G D ) loci. 51 g) For simulating global ancestry, we generate Q conditional on L for each individual from the regression model: Q i = L /2 + 0.2(L i - L ) + ε i , with ε i ~ N(0,0.3). Then Q is truncated between 0 and 1. In this way, the generated Q is related to local ancestry L with a correlation around 0.5 to reflect the observed distribution among the Hispanic samples in the CHS. h) We then probabilistically generate case-status for all 1,800 samples using a logistic regression model incorporating the disease locus G D and the individual global ancestry Q. Variables in the logistic regression model are mean centered and there is a baseline risk of 50%, thus resulting in approximately equal numbers of cases and controls for each replicate. Note that, in the simulations, instead of directly simulating the observed LD in the admixed populations, we generate the two components of the LD between G M and G D (as described in the graphical model) respectively: LD within each parental ancestry is assigned as D' 1 and D' 2 ; for the admixture LD, based on the definition, we assign different allele frequencies between parental ancestries A 1 and A 2 to generate an simulated induced admixture LD. 3.2.4 Scenarios Across all the simulation scenarios ( Table 3-1), we vary the population specific parameters for each parental ancestry (p , q, and D') and the causal model parameters ( β Q and β G D ). In scenario A, there is no genetic causal effect ( β G D = log(1)) but a strong global ancestry effect ( β Q = log(3.0)). The allele frequency is fixed at 0.3 in ancestral population 52 A 1 at both G M and G D , with the allele frequency in ancestral population A 2 varying from 0.3 to 0.7 to simulate the induced admixture LD. The LD is the same (D'=0.9) within each ancestral population. In this scenario, we simulate different allele frequencies between ancestral populations to investigates the efficiency of control for confounding by individual ancestries in models (2) Y ~ G + Q and (3) Y ~ G + Q + L. In scenario B, we simulate the genetic causal effect ( β G D = log(1.2)) when no effect of Q on Y ( β Q = log(1)) is present. We simulate a positive correlation (LD) between G M and G D in each ancestral population: D' 1 = D' 2 =0.9. The induced admixture LD due to differential allele frequencies between ancestries is simulated as described in Scenario A and, in addition, we simulate induced admixture LD in the same as well as in different directions to the LD in the original ancestral populations. Finally, in scenario C, we simulate a genetic causal effect ( β G D = log(1.2)) as well as an effect of Q on Y ( β Q = log(3.0)), and we simulate the same allele frequency across ancestral populations (p=q=0.3). In this scenario, we varies the D' difference between populations from 0 to 1.8 (D' 1 =0.9, D' 2 varies from -0.9 to 0.9) to gauge the impact of heterogeneity. Table 3-1 Simulated scenarios A-C. Scenario Allele freq difference a SNP effect β G D Q effect β Q D' difference (heterogeneity) A [0,0.4] None log(3.0) None B [0,0.4] log(1.2) None None C None log(1.2) log(3.0) [0,1.8] a Allele frequencies at both the disease and the marker loci. 53 For each simulated scenario, we create 1,800 individuals with an equal number of individuals (N L = 600) within each local ancestry group (L = {0, 1, 2}, where L indicates the number of copies from ancestral population 1). Conditional on L and the corresponding specified parameters for allele frequency, LD and risk, we generate G D , G M , and Q. We then probabilistically generate case-status for all 1,800 individuals using a logistic regression model incorporating the disease locus G D and the individual global ancestry Q. Variables in the logistic regression model are mean centered and there is a baseline risk of 50%, thus resulting in approximately equal numbers of cases and controls for each replicate. This simulation framework does not directly simulate potential confounding or heterogeneity by L. Rather, potential confounding and heterogeneity is induced by simulating haplotypes, global ancestry and diseases status conditional on local ancestries as reflected in our graphical framework. Specifically, potential confounding is induced via the path, G M -L-Q-Y. The Type I error and empirical power are calculated as the number of significant tests (α = 0.05) over 10,000 replicates. 3.2.5 USC Children’s Health Study (CHS) The CHS is an ongoing cohort study investigating environmental and genetic influences on asthma in children. The study design is discussed in detail elsewhere (Li et al.; McConnell et al.; Navidi et al.). The CHS GWAS is a nested case-control study from the ongoing longitudinal CHS cohort with approximately equal number of cases and controls for non-Hispanic whites and Hispanics. All CHS subjects and their parents gave informed consent and the study was approved by the University of Southern California Institutional Review Board. In this study, we include a total of 2,839 samples from two self-reported ethnic groups: 1,396 non-Hispanic Whites and 1,171 Hispanics. Among non-Hispanics 54 samples there are 595 cases and 801 controls; and there are 532 cases and 639 controls among Hispanics. We analyze the CHS data stratified by ethnicity and in a combined sample, assuming that the non-Hispanic white individuals all have two copies of European local ancestry at each location. Genotyping of these samples was performed at the USC Genome Center utilizing both the Illumina HumanHap550 and the Illumina Human 610-Quad BeadChips and the analysis is conducted on 437,599 autosomal SNPs passing a stringent quality control procedure. We perform several genome-wide scans in the CHS samples with additional covariates (age, gender, community of residence, and self-reported ethnicity). We estimate individual local ancestry L through the program HAPMIX (Price, Tandon, et al.). HAPMIX requires two parental reference populations and estimates individual local ancestry from dense genotyping chips. We focus our investigation on the confounding and heterogeneity due to the structure defined by a European/Amerindian admixture. As the Amerindian component has greatest similarity with Asian populations, and since a homogeneous or non-admixed Amerindian population is unavailable in HapMap, we used the homogeneous HapMap phase III CEPH and East Asian (Han Chinese in Beijing, China and Japanese in Tokyo, Japan) populations as the two reference populations for local admixture estimation ("The International Hapmap Project"; Altshuler et al.). The average of all local ancestry estimates across the genome for each individual is used to estimate global ancestry ( Q i = L im /2M M ∑ ) for each individual. 55 3.3 Results 3.3.1 Simulation result 3.3.1.1 Confounding In scenario A (Figure 3-2), at the marker locus G M , when the allele frequency difference between ancestral populations is greater than 0.1, the crude model has substantially elevated Type I error rate while models (2) Y ~ G + Q and (3) Y ~ G + Q + L efficiently control for the confounding. The pattern is the same when testing the disease locus G D . In scenario B, there is no confounding path simulated and all models have the correct test size (not shown). Reflecting these patterns, adjustment by global ancestry (when needed) results in an unbiased effect estimate. In contrast, there is very little impact on the effect estimate from adjustment with local ancestry. However, when the induced LD due to differential allele frequency between ancestries is in the same direction as the LD from the parental ancestries (Figure 3-3 (a)), adjusting for local ancestry results in a slight loss in power. When the induced LD is in the opposite direction to the LD from the parental ancestries (Figure 3-3 (b)), additional adjustment of local ancestries results in a slight increase in power. However, this decrease/increase in power is negligible for allele frequency differences <0.1. When testing the disease variant directly, as opposed to a marker, the pattern is the same as that shown in Figure 3(a). 56 Figure 3-2 Type I error rates with and without control for confounding at the marker locus (G M ) in scenario A. (a) Induced LD is in the same direction to the LD in the parental ancestries. Figure 3-3 Effect of adjustment by global and local ancestries on power in scenario B. 57 (b) Induced LD is in different direction to the LD in the parental ancestries. Figure 3-3 Continued. 3.3.1.2 Heterogeneity Figure 3-4 shows the performance of inclusion of the G m x L interaction term by comparing model (3) Y ~ G + Q + L and model (4) Y ~ G + Q + L + GL. For tests at the marker locus G M , when the LD between G M and G D are similar in the ancestral populations (D'<0.7), there is a reduction in power for the two degree of freedom test of the joint effect of G m and G m x L from model (4). As the difference in LD increases (>0.7), model (4) has similar or greater power than model (3). For testing a disease variant (D' difference = 0), the loss in power is 0.10 for the simulated scenario. 58 Figure 3-4 Comparison of power in scenario C when there is heterogeneity due to differential LD between ancestries. 3.3.2 Results from the Children’s Health Study Figure 3-5 displays the local ancestry estimates across chromosome 4 for five individuals sampled across a range of global coefficients of ancestry (Q = {0, 0.25, 0.5, 0.75, 1.0}). For the individuals with European global ancestry not equal to 0% or 100% there is substantial diversity as to which regions contain more than zero copies of local European ancestry. Averaging the estimated local ancestries at each SNP location across all Hispanic individuals shows that, on average, each location has slightly more than one copy of European ancestry (bottom panel in Figure 3-5). The average of all local ancestry estimates across the genome across all the individuals yields an estimate of average European ancestry around 1.35. 59 Figure 3-5 Local ancestry along chromosome 4 for selected CHS Hispanic samples. global ancestry Q represents estimated European ancestry proportion. When investigating the impact of confounding in the CHS GWAS, as shown in Figure 3-6, a crude analysis Y~G among the Hispanics and non-Hispanic Whites combined samples results in a large overdispersion (λ 1 =1.27), that is reduced substantially with adjustment by global or both global and local ancestries in models (2) Y~G+Q (λ 2 =1.02) and (3) Y~G+Q+L (λ 3 =1.02). However, the overdispersion parameter is a summary across the entire genome. In the CHS data for example, about 50% of the markers have a smaller p-value from model (3) Y~G+Q+L compared to model (2) Y~G+Q. About 10% of the SNPs found to be potentially interesting (p<0.05) from an analysis with both global and local ancestry are not noteworthy with an analysis adjusting only for global ancestry. In terms of effect estimates, among those SNPs potentially 60 interesting (p < 0.05) from an analysis adjusting for global ancestry only, additional adjustment for local ancestry results in a 10% or greater change in the effect estimate for over 9% of the SNPs. Figure 3-6 Q-Q plots for model (1)-(4) among combined samples. Figure 3-7 shows the p-values from a GWAS analysis of the combined non- Hispanic white and Hispanic populations in the CHS using models (3) and (4). The most notable SNP rs10119122 (p = 4.5×10-8) from model (3) remains noteworthy with the 2- 61 df test from model (4). There is an additional SNP (rs10519951) that lacks association with model (3) (p = 0.50), but is noteworthy with model (4) as indicated by a much smaller p-value (p = 8.5×10-7). Figure 3-7 Analysis results across models (4) and (5) for combined samples. Table 3-2 provides the SNP effect estimates and corresponding p-values from both models for these two SNPs from the ethnic-specific and combined analyses. In addition, for model (4) in the Hispanic only analysis and the combined analysis we present the expected genetic effect estimate within each local ancestry stratum by stratifying individuals by their estimated local ancestry at each SNP and using the stratum-specific effect estimates from model (4). For SNP rs10119122 there are similar allele frequencies across the two populations and there is very little heterogeneity indicated from model (4) in Hispanics. Thus, the marginal effect estimate from model (3) 62 in the Hispanic only analysis ( β G M = -0.30) is similar to the non-Hispanic estimates ( β G M = -0.36). It is also interesting to note that in the non-Hispanic white only analysis rs10119122 has an effect estimate of β G M = -0.36. For the Hispanic individuals only analysis within the strata of individuals with two copies of European ancestry, the effect estimate is almost identical, β G M = -0.37. In contrast, SNP rs10519951 has very little evidence for an effect in the non- Hispanic White analysis from model (3) ( β G M = 0.19, p = 0.07). In the Hispanic only analysis, rs10519951 has a sizeable inverse effect on asthma from model (3) ( β G M = - 0.29, p = 5.2×10 -3 ) and further evidence of heterogeneity by local ancestry from model (4), specifically the test for only the interaction term has a p value of 1.6×10 -5 . Also, the estimates of effect are comparable between the non-Hispanic whites only ( β G M = 0.19) and the Hispanics only within the strata of individuals having two copies of European ancestry ( β G M = 0.16). In the Hispanic only analysis within the strata of individuals carry 0 copies of European ancestry the estimate is β G M =-1.12. The contrast in estimates across strata is reflected in the more significant result from model (4) in the combined sample (p = 8.5×10 -7 ). 63 Table 3-2 P-value and effect estimate for selected markers across ethnic groups and models. Marker Local Strata a non-Hispanic White Hispanics Combined e N b Freq Model 3 c β G M (p-value) N Freq Model 3 β G M (p-value) Model 4 β G M (p-value) N Freq Model 3 β G M (p-value) Model (4) β G M (p-value) rs10119122 L≈0 0 - - 12 0.46 - 0.03 12 0.46 - 0.04 L≈1 0 - - 410 0.60 - -0.17 410 0.60 - -0.17 L≈2 1372 0.66 - 721 0.64 - -0.37 2093 0.65 - -0.37 All 1372 0.66 -0.36 (2.1×10 -5 ) 1143 0.62 -0.3 (1.8×10 -3 ) - (4.7×10 -3 ) 2515 0.64 -0.34 (4.5×10 -8 ) - (1.6×10 -7 ) rs10519951 L≈0 0 - - 119 0.34 - -1.12 119 0.34 - -1.17 L≈1 0 - - 531 0.26 - -0.48 531 0.26 - -0.49 L≈2 1386 0.19 - 510 0.19 - 0.16 1896 0.19 - 0.18 All 1386 0.19 0.19 (7.1×10 -2 ) 1160 0.24 -0.29 (5.2×10 -3 ) - (1.6×10 -5 ) 2546 0.21 -0.05 (5.0×10 -1 ) - (8.5×10 -7 ) a Estimated individual local ancestry. L is round up into three categorical groups 0 ( ˆ L ≤0.5), 1 (0.5< ˆ L ≤1.5), and 2 ( ˆ L >1.5). b Sample size within each local strata. c Effect estimate β G M of the SNP marginal effect followed by the corresponding p-value in the parenthesis from Model (3): Y~G+Q+L. d The expected effect estimate β G M of the SNP effect within each local strata followed by the 2-df test p-value in the parenthesis from Model (4): Y~G+L+GL +Q. e Combined non-Hispanic White and Hispanic samples for analysis. 64 In contrast, SNP rs10519951 has very little evidence for an effect in the non- Hispanic white analysis from either model 4 ( β G M = 0.08, p = 0.38) or model (5) (p = 0.02), albeit there is some evidence of heterogeneity. In the Hispanic only analysis, rs10519951 has a sizeable inverse effect on asthma from model (4) ( β G M = -0.29, p = 5.2×10 -3 ) and further evidence of heterogeneity by local ancestry from model (5) (p = 1.6×10 -5 ). For both the non-Hispanic white and Hispanic analyses, there is a larger inverse effect for those individuals with zero copies of European ancestry. This similarity in heterogeneity by local ancestry is reflected in the significant result from model (5) in the combined sample (p = 1.9×10 -7 ). 3.4 Discussion When confounding arises through global ancestry via a path that links an external factor to the marker being evaluated, then global ancestry alone can control for the confounding. This assumes that the estimated global ancestry accurately captures the underlying factor. Previous studies have argued that adjusting for local ancestry is necessary for controlling for confounding (Kang et al.; Qin et al.; X. Wang et al.), however these papers simulated local ancestry as a strict confounder and did not allow for induced LD in admixed populations. Our simulations demonstrate that impact of adjustment for local ancestry is more nuanced within admixed populations. When the direction of the admixture LD is in a different direction from the LD in the parental ancestries there is a reduction in the magnitude of the LD in the admixed population and additional adjustment for local ancestry can increase the power to detect the true association at the marker locus. But, this potential gain in power comes with risk: when the admixture LD is in the same 65 direction as the LD within the ancestral population, adjustment for local ancestry will result in over-adjustment. To quantify the occurrence of the simulated scenario in real populations, we examine data from the ENCODE regions from the HapMap ENCODE Genotyping Project("The Encode (Encyclopedia of DNA Elements) Project"). For comparison of the two relevant ancestral populations, we estimate allele frequency and D' within the CEU and East Asian samples. To mimic the potential LD between a genotyped SNP and an unobserved disease variant as in a genomewide scan, we limit our comparison to only SNPs genotyped in the CHS study (treated as marker loci) vs. those SNPs not genotyped but contained within ENCODE (treated as disease loci). Figure 3-8(a) shows the joint distribution of D' within the Asian and the CEU population for the ENCODE regions. Across all the estimated pairwise D', about 38% are in the opposite direction (the gray areas). For the 62% of markers that have LD in the same direction in parental populations (the white areas in Figure 3-8(a)), the direction of induced LD will determine if there is a resulting over-adjustment or gain in power. The joint distribution of the direction of the D' within parental ancestries and the direction of the induced LD is shown in Figure Figure 3-8(b) for these variants (with the variants from the white regions in Figure Figure 3-8(a)). In Figure Figure 3-8(b), the x-axis is the direction of the D' within parental ancestries (positive or negative); y-axis is the direction of the induced LD. When the directions of allele frequency differences between ancestries at the marker locus is the same as that at the disease locus (e.g. at both loci, the allele frequency in European ancestry is higher than that in Amerindian ancestry), we assign a positive sign for the induced LD; when the allele frequency differences are in the opposite directions between the marker and the disease loci, we assign a negative sign of the induced LD. 66 Among these loci, the direction of the induced LD and the LD within the ancestral populations is in the same direction for about 57% of them. (a) Direction of the LD in the parental ancestries. (b) Direction of the induced LD and the LD in the parental ancestries. Figure 3-8 Plausibility of scenario B in the ENCODE regions. When investigating heterogeneity of effect estimates by local ancestry, there is also a potential loss in power by testing both the SNP main effect and the interaction via 67 a 2-df test (Figure 3-4). In the ENCODE regions, about 30% of the estimated differences in D' between the populations are greater than 0.7 (Figure 3-9). Our simulation results demonstrates that model (4) with the SNP-local ancestry interaction has greater power than the conventional model for D' differences above 0.7. Thus, one may expect an increase in power for about 30% of the SNPs, with the remaining 70% having none or a slight reduction in power. Given this tradeoff, a GWAS for discovery using only model (4) may not be the most advantageous approach. However, in practice most investigators will first perform an analysis without an interaction term. Subsequent analyses with the interaction term included offer the potential to uncover previously unidentified regions. In such a two-step approach, one would need to consider the impact on type I error, but for discovery and further follow-up such impact may be negligible. Figure 3-9 Distribution of the D' difference between the CEU and the Asian populations in the ENCODE regions. The graphical model we presented helps to construct and interpret our empirical investigation of the CHS data. Following the graphical model presented, we view global 68 ancestry as a composite of all local ancestry estimates along the genome, i.e. the L-Q path. Thus, we estimate global ancestry as an average of all local ancestry estimates. This is in contrast to the more commonly used approach of estimating global ancestry using selected ancestry informative markers (AIMs) and the STRUCTURE program (Falush, Stephens and Pritchard "Inference of Population Structure Using Multilocus Genotype Data: Linked Loci and Correlated Allele Frequencies"; Falush, Stephens and Pritchard "Inference of Population Structure Using Multilocus Genotype Data: Dominant Markers and Null Alleles"; Hubisz et al.; Pritchard, Stephens and Donnelly), or EIGENSTAT (Price, Patterson, Plenge, et al.). In the graphical modeling framework, these approaches could be represented with a box for observed AIMs pointing directly to Q. For comparison, we use the HapMap Phase III (release 2) samples as reference populations and estimate global ancestry using the program STRUCTURE with 1,637 selected AIMs (Seldin et al.; Shtir et al.; Smith et al.; Tian et al.). The estimated individual global ancestries are highly correlated (R 2 =0.96) with the global ancestry values calculated by averaging the estimated individual local ancestry across the genome (437,599 loci in total). For the CHS GWAS, the most notable SNP (rs10119122) from the marginal test of the SNPs remains noteworthy with the 2-df test from model (4). In contrast, SNP rs10519951, a SNP in NR3C2, only has a significant p-value from model (4) (p = 8.5×10 - 7 in the combined samples). Although rs10519951 does not reach the conventional cutoff for determining genome-wide significance for main effects (i.e. α = 5×10 -8 ), the genome- wide significance level for the 2-df test that involved correlated local ancestries across the genome is unclear and is an active area of research. Such significance levels may 69 depend on the specific admixed population investigated since the distribution of local ancestry for each individual across all locations in the genome will depend upon the sample. In this case a permutation test for determining significance may be required. Whether the top SNPs are strictly significant or not, there is a clear potential for additional information to be gained from including local ancestry in a test of heterogeneity. Overall for the CHS, and consistent with results from ENCODE, the interaction model results in smaller p-values for 35% of the SNPs across the genome. Notably, for those SNPs with a smaller p-value from model (4), this change is often substantial suggesting that a great deal of additional information may be captured by jointly considering the main and interaction terms. When testing the disease variant, adjusting for local ancestry most often results in a loss of power from over-adjustment when the allele frequency is different between ancestries. Likewise, when investigating a measured causal variant in an admixed population there will be no influence of differential LD between the marker and the causal variant. Thus, the inclusion of a SNP by local ancestry interaction term will not capture any additional information and stratified estimates across local ancestry strata should be similar. This offers a potential approach to leverage differential LD patterns in an admixed population to help identify causal variants when performing fine-scale mapping or sequencing studies (Stacey et al.). In addition to capturing the heterogeneity of the SNP effect among admixed populations, it is possible that this observed effect is induced by another genetic or environmental factor that drives the observed effect modification and is correlated with self-identified ethnicity and thus related to local ancestry via global ancestry (i.e.X-Q-L). 70 In order to investigate the source (environmental or genetic) of the heterogeneity one can perform further analyses within strata by individual global ancestry and, if available, the strata of self-reported ethnicity. For SNP rs10519951, Table 3-3 shows that the heterogeneity captured by local ancestry is attenuated when stratifying by global ancestry or self-identified ethnicity. These results suggest that this particular observed heterogeneity is most likely due to local genetic structure and not global genetic or environmental differences. Table 3-3 Investigation of heterogeneity for SNP rs10519951 in the Children’s Health Study combined samples. Strata N Allele freq beta p-value All Combined 2546 0.21 -0.05 5.0×10 -1 a L b L≈0 (Asian) 119 0.34 -1.17 8.5×10 -7 L≈1 531 0.26 -0.49 L≈2 (European) 1896 0.19 0.18 Q b Q≈0.25 (Asian) 21 0.43 -0.76 2.9×10 -2 d Q≈0.5 480 0.30 -0.35 Q≈0.75 (European) 2045 0.19 0.06 E c Hispanics 1160 0.24 -0.29 5.6×10 -3 d Non-Hispanic White 1386 0.19 0.19 a Conventional analysis testing of the SNP main effect only. b Estimated individual ancestry is rounded into three categorical groups when presenting the number of samples and the allele frequency within each strata L ˆ : 0 ( L ˆ ≤0.5), 1 (0.5< L ˆ ≤1.5), and 2 ( L ˆ >1.5). Q: 0.25 (Q≤0.33), 0.5 (0.33<Q≤0.66), and 0.75 (Q>0.66). c Self-reported ethnicity. d 2-df test of the SNP by strata interaction and the SNP marginal effect. 71 We have demonstrated that one needs to consider the impact of adjustment by local ancestry in addition to the common practice of adjusting for global ancestry. While the adjustment for local ancestry reflects the induced admixture LD within admixed populations, the impact of inclusion of local ancestry depends upon the LD patterns in the ancestral populations. Furthermore, we have also demonstrated the potential for a 2-df test of SNP main effect and SNP by local ancestry interaction to increase power when there is substantial differential LD between ancestral populations. We realize that for most GWAS utilizing admixed populations, investigators will first scan the genome with a marginal test of association. Thus, we view analyses with the interaction term as secondary follow-up to uncover previously unidentified regions with substantial heterogeneity of SNP effect by local ancestry. 72 Chapter 4 Mapping by admixture linkage disequilibrium 4.1 Introduction 4.1.1 Concept for admixture mapping Mapping by admixture linkage disequilibrium is also known as admixture mapping. It is a test for association of the disease with the ancestry conditioning on the admixture. When genetic risk variants differ in frequency between ancestry populations, cases are more likely to inherit alleles derived from the ancestral population that carries more disease susceptible alleles (Patterson et al.). As a result, in regions near the disease locus, cases tend to have a higher ancestry proportion from the population in which the disease is more prevalent. The major steps for admixture mapping are inferring ancestral origins at each locus and testing for excess ancestry proportions among cases. The number of markers required for admixture mapping depends on the number of generations since admixture and the information content for ancestry of the markers (a function of allele frequencies in the ancestral populations) (McKeigue). 4.1.2 Testing for excess ancestry proportions Admixture mapping using family data tests for excess transmission of the genome that derives from one ancestral population (McKeigue; Zheng and Elston). It is a test of association conditional on parental admixture. Near the disease locus, the ancestry is skewed toward one of the ancestral population given the ancestry of the parents. Admixture mapping using unrelated individuals could be conducted among either case only or case-control samples (Montana and Pritchard). The main idea for case- 73 control study design is to test if the mean of the local ancestry among the cases significantly diverges from the mean of the local ancestry among the controls at each locus: T CC = (L d − L c ) − 2(Q d −Q c ) SD(L d − L c ) L d = 1 N d L i,d i N d ∑ L c = 1 N c L i,c i Nc ∑ Q d = 1 N d Q i,d i N d ∑ Q c = 1 N c Q i,c i Nc ∑ In this equation, N d and N c represent the number of cases and controls respectively. L i,d represents the local ancestry estimate for individual i among the cases and Q i,d represents the global ancestry estimate for the same individual. L d and L c represent the average local ancestry at the tested locus among the cases and controls respectively. Q d and Q c represent the averaged global ancestry across the cases and controls respectively. Note that environmental and social factors may also differ between ancestral populations (Risch et al.). For a case-control study design, a overall difference in ancestry (differences in global ancestry) could lead to confounding for the association between local ancestry and disease. Therefore, the test statistics T CC tests if the difference in local ancestry between cases and controls significantly diverges from the difference in global ancestry between cases and controls. The case-only study design tests if the mean local ancestry significantly diverges from the genome-wide mean (global ancestry) among the set of cases: T CO = L d − 2Q d SD(L d ) 74 Case-only analysis uses the global ancestry from the same set of samples as the “controls” to control for the potential confounding that may arise in the case-control admixture study design. However, there are issues that arise when using the case only study design in a admixture scan. For example, if the local ancestry estimate is biased at certain regions among both cases and controls (elevated ancestry proportions towards one parental ancestry), a case only analysis will lead to spurious results. 4.1.3 Advantages of admixture mapping First of all, linkage disequilibrium decays rapidly with distance. Genome-wide association studies using background LD (LD among ancestry populations) requires a relatively dense set of markers (Gabriel et al.). In contrast, admixture mapping takes advantages of the extended LD within the admixed populations. Considering the size of haplotypes and the size of ancestry blocks in an admixed population, admixture mapping using admixture LD requires fewer markers for testing regions with excess ancestry proportions from one population. Early studies focused on the use of relatively evenly spaced ancestral informative markers (AIMs) across the genome for capturing each region of local ancestry and for testing association. In recent studies, admixture mapping with randomly selected markers has become an alternative to mapping with AIMs. A second advantage for admixture mapping is that for a rare disease, it is more efficient to use cases-only compared to case-control analysis (Hoggart, Shriver, et al.). With the establishment of reliable approaches for local ancestry estimation, there have been many recent papers that develop models to incorporate local ancestry information in the genetic 75 association studies with admixed samples (Chanock; Ding et al.; Pasaniuc, Zaitlen, et al.; Shriner, Adeyemo and Rotimi; Zhu et al.; Fejerman et al.). 4.1.4 Purpose of this study We propose a general regression framework to perform admixture mapping for both case only and case-control study designs. We then test the performance of these proposed models with a comparison to existing approaches. Finally, we apply the various models to a real data set consisting of African Americans from the Multiethnic Cohort Study of prostate cancer. For comparison to work presented in Chapter 3, we discuss the performance of the models leveraging admixture information to a model that incorporates a SNP by local ancestry interaction. 4.2 Materials and Methods 4.2.1 Regression models 4.2.1.1 Proposed models for admixture mapping in regression framework For the case-control analysis for admixture mapping, we can rewrite the test statistic as: T CC = (L d − 2Q d ) − (L c − 2Q c ) SD(L d − L c ) Therefore, the t-test for the case-control analysis for admixture mapping is mathematically identical to a regression model L i – 2Q i = α + β Y Y i + ε i with a test of if the coefficient for Y (case-control status) is significantly different from zero. Changing this model to a more general regression we use for testing ancestry and disease outcome 76 association, we propose a regression model (“CCreg”) to implement the original idea of admixture mapping for case-control study design: logit(Y i ) = α + β L L i + β Q Q i + ε i H o : β L = 0 H a : β L ≠ 0 In this model, L i represent the estimated local ancestry in terms of the number of ancestral chromosomal segments (in the range from 0 to 2) for individual i at the tested locus and Q i represent the estimated global ancestry as an average of the local ancestry across the genome divided by 2 (in the range from 0 to 1) for the same individual. Y i is an indicator of disease status (cases vs. controls), and β L represents the marginal effect of local ancestry. The model tests the association between Y i and local ancestry (L i ) with adjustment for global ancestry (Q i ) to control for potential confounding. Similarly, for the case-only analysis, we can rewrite the test statistic as: T CO = (L z=0 − 2Q z=0 ) − 0 SD(L z=0 ) Here, we define two groups, Z=0 for the cases and Z=1 for a hypothetic group with the measured independent variables all equal to 0 ( L z=1 − 2Q z=1 = 0). In this way, the t-test for the case-only analysis for admixture mapping is mathematically identical to a regression model L i –2Q i =α+β Z Z i +ε i with a test of if β Z is significantly different from zero. As we define that when Z=1, L i –2Q i =0, so the model can be simplified to L i – 2Q i =α-αZ i +ε i with a test of if α is significantly different from zero. Further restrict the data to Z=0 group (only the cases), we develop a regression model (“COreg”) to implement the idea of admixture mapping for the case-only analysis: L i = α+ 2Q i + ε i 77 H o : α = 0 H a : α ≠ 0 In this model, the test of if L i is significantly different from 2Q i is equivalent to the test of if α is significantly different from 0 in the model. In addition to the model CCreg, we further propose a model CCcom that incorporates both genotype and local ancestry information in the test: logit(Y i ) = α + β G G i + β L L i + β Q Q i + ε i A 2-df likelihood ratio test is used to jointly test the genotype (G i ) and the local ancestry (L i ) marginal effect. 4.2.1.2 Existing approaches ADM is an approach proposed by Pasaniuc et al. (Pasaniuc, Zaitlen, et al.) for admixture mapping among cases. This approach defines the likelihood of the data for the individual within each ancestral strata: In the likelihood, Ω represents the multiplicative risk for disease given one or two reference ancestral population, l i,N i represents the likelihood for individual i with N i copies of allele derived from the reference ancestral population. Assuming individuals 78 are independent of each other, the likelihood of the data is then written as the multiplicative of the individual likelihood and a likelihood ratio test is used to derive the ADM score which follows a chi-square distribution with 1-df. The other two approaches proposed by Pasaniuc et al. (Pasaniuc, Zaitlen, et al.) for admixture mapping using both genotype and local ancestry information are MIX and SUM. Both tests combine the ADM model together with a SNP association model. The SNP association model adjusts for local ancestry, and therefore is conditionally independent of the ADM model: L p A,0 , p B,0 ,R ( ) = p A,Y 2RR AA,Y +RV AA,Y (1− p A,Y ) 2VV AA ,Y +RV AA,Y Y ∈ 0,1 { } ∏ p A,Y RR AB,Y +0.5RV AB,Y (1− p A,Y ) VV AB ,Y +0.5RV AB,Y p B,Y RR AB,Y +0.5RV AB,Y (1− p B,Y ) VV AB,Y +0.5RV AB ,Y Y ∈ 0,1 { } ∏ p B,Y 2RR BB,Y +RV BB ,Y (1− p B,Y ) 2VV BB,Y +RV BB ,Y Y ∈ 0,1 { } ∏ p A,1 = Rp A,0 1− p A,0 + Rp A,0 SNP = 2 max p A,0 , p B,0 ,R log L(p A,0 , p B,0 ,R) − max p A , p B log L(p A , p B ,1) In the likelihood, R represents the relative increase in risk per extra reference allele, p A,0 represents the allele frequency in ancestral population A among controls, and RR AA,0 represents the number of individuals with genotype RR and both alleles derived from ancestral population A among the controls. The MIX (1-df) and the SUM (2-df) models are likelihood ratio tests that combine the likelihood (MIX) or the test statistics (SUM) of the single SNP test (adjusting for local ancestry) and the ADM. Both 79 approaches assume no heterogeneity of the SNP effect across ancestral populations. More specifically, SUM score is the sum of the statistics ADM and SNP and the SUM statistic follows a chi-square distribution with 2-df. MIX model multiplies the liklihood L admix (Ω) and L(p A,0 ,p B,0 ,R) assuming the following relationship between Ω and R: And then the MIX statistic is calculated as: MIX = 2 max p A,0 , p B,0 ,R log L combined (p A,0 , p B,0 ,R) − max p A,0 , p B,0 ,R log L combined (p A,0 , p B,0 ,1) This statistic follows a chi-square distribution with 1-df 4.2.1.3 Summary of the proposed models As a summary, we proposed three models in regression framework for admixture mapping: Model (1) that uses only ancestry information for case-only study design; Model (2) that uses only ancestry information for case-control study design; and Model (3) that incorporates ancestry and SNP genotype information: COreg: L i = α+ 2Q i + ε i (1) CCreg: logit(Y i ) ~ α + β L L i + β Q Q i + ε i (2) CCcom: logit(Y i ) ~ α + β G G i + β L L i + β Q Q i + ε i (2-df) (3) 4.2.2 Simulation framework A total of 9,641 African-American samples with measured genotypes were used for the simulation. Local ancestry was estimated through HAPMIX using HapMap 2 YRI & CEU as the reference populations. Global ancestry was then calculated as an average of the local ancestry across the genome divided by 2 for each individual. 80 Figure 4-1 Simulation framework for admixture mapping. As shown in the framework above (Figure 4-1), G M represent the marker locus we observed and G D represent the true disease locus that is not directly observed in the genotype data. LD (r 2 ) is the predefined linkage disequilibrium between G M and G D . p represents the allele frequency at G M , and is calculated directly from the observed genotypes; while q represents the allele frequency at G D . During the simulation, q is set equal to p. Haplotype frequencies at G M & G D are calculated based on the predefined LD 81 and the allele frequencies at these two loci. Then the probability of the genotypes at G D given G M ( Pr(G D |G M ) ) is calculated based on the haplogype frequencies at G M and G D . Given the conditional probability Pr(G D |G M ) and the observed genotypes at G M , the genotypes at the disease locus G D are generated for each individual. After simulating the disease locus genotypes, we resample 20,000 individuals from our African-American samples with replacement. For these samples, we generate case/control status for a binary disease outcome (Y) using a logistic regression model incorporating the disease locus G D and the predefined disease prevalence (0.1) in the population. Then, 1,000 cases and 1,000 controls are randomly selected (without replacement) from the 20,000 samples. Admixture mapping was conducted among these selected samples to compare the performance of the proposed models. 4.2.3 Scenarios Scenario A: Testing the performance of the models under the null hypothesis. The effect of G D on disease outcome Y (odds ratio) equals to 1.0. Scenario B: Testing the performance of the models under the alternative hypothesis. The LD (r 2 ) between G M and G D equals to 0.9. The effect of G D on disease outcome Y (odds ratio) is fixed at 2.0. SNPs are grouped into strata according to the allele frequency differences between European and African ancestries (as calculated from HapMap III populations). The performance of the models was tested within each strata respectively. 4.2.4 Real data analysis among African Americans We apply the models to the African-American samples in the MEC (Haiman et al.). After quality control, there were 9,641 individuals and 863,431 SNPs remaining for the analysis. Among these samples, there were 4,905 cases and 4,732 control. The averaged 82 global European ancestry proportion (calculated as the averaged local ancestry across the genome divided by 2 for each individual) is 0.205 among cases and 0.215 among controls. 4.3 Results 4.3.1 Simulation results A significant threshold of 0.05 is used to access the type I error rate for the tested models. SNPs are grouped by their allele frequency differences between CEU and YRI populations from HapMap III. As shown in Table 4-1, under the null hypothesis (OR=1.0), the overall type I error rates are around 0.05 except for the ADM model (0.027) and the SUM score model (0.035). Table 4-1 Type 1 error among models for admixture scan. OR=1.0; LD=0.9 Allele freq differences 1 All < 0.2 0.2 ~ 0.4 > 0.4 Number of loci 12393 7839 3275 1279 Use Genotype info SNP association 0.051 0.051 0.048 0.055 Use ancestry info COreg 0.058 0.057 0.062 0.055 ADM 0.027 0.027 0.028 0.023 CCreg 0.052 0.051 0.053 0.054 Use genotype & ancestry CCcom 0.051 0.053 0.050 0.045 SUM 0.035 0.038 0.032 0.026 MIX 0.048 0.050 0.046 0.043 1 Absolute allele frequency differences between African and European populations. To assess the power of the tested models, a threshold of 1e-05 is used for models that use only local ancestry information (COreg, ADM, and CCreg), and a threshold of 83 5e-08 is used for models that incorporate a SNP-based test (SNP association, CCcom, SUM, and MIX). Table 4-2 shows the power for models using only ancestry information (COreg, ADM, and CCreg). The simulation results indicate that the case-only analysis is more powerful than the case-control analysis. In addition, our proposed model COreg performs better than the ADM model for the case-only analysis. For models incorporating genotype information, as shown in Table 4-3, models CCcom, SUM, and MIX result in simulation power as the SNP association model. Across all the tested models, the power for detecting the association increases with greater allele frequency differences between European and African populations. Table 4-2 Power among models for admixture scan using only ancestry information. OR=2.0; LD=0.9 Allele freq differences 1 All < 0.2 0.2 ~ 0.4 > 0.4 Number of loci 12393 7839 3275 1279 Models COreg 0.033 0.001 0.020 0.250 ADM 0.015 0.000 0.005 0.122 CCreg 0.011 0.000 0.005 0.094 1 Absolute allele frequency differences between Asian and European populations. Table 4-3 Power among models for admixture scan incorporating genotype information. OR=2.0; LD=0.9 Allele freq differences 1 All < 0.2 0.2 ~ 0.4 > 0.4 Number of loci 12393 7839 3275 1279 Models SNP association 0.739 0.659 0.865 0.935 CCcom 0.726 0.644 0.852 0.922 SUM 0.728 0.643 0.855 0.928 MIX 0.756 0.675 0.870 0.956 1 Absolute allele frequency differences between Asian and European populations. 84 As a conclusion, the simulation results suggest that at the LD=0.9 level (dense markers across the genome), when genotype information is available, it is always more powerful to incorporate it in the analysis, and we will not gain much in terms of power by incorporating ancestry information. However, among markers with great allele frequency differences between populations (>0.4), when LD decreased to 0.4, our proposed model that incorporates both ancestry and genotype information begins to show greater power than the SNP association analysis; and when LD is lower than 0.28, our proposed case- only model COreg begins to show greater power than the SNP association analysis (Figure 4-2). Noted that the case-only analysis has a constant power across different level of LD because the local ancestry background remains constant. Figure 4-2 Simulation results across different LD levels among markers with allele frequencies greater than 0.4 between populations. 85 4.3.2 Real data analysis results 4.3.2.1 Results across the genome We apply models (SNP association, COreg, CCreg, and CCcom) to a whole genome scan among the MEC African Americans. Figure 4-3 shows the results from the SNP association and the models use only local ancestry information. The pattern from case- only analysis is similar to that from the case-control analysis, but it shows clearly that the case-only analysis is much more significant. Note that the resulting pattern is almost identical between ADM and COreg, with COreg attaining more significant p-values. On the other hand, some regions (e.g. regions on Chromosome 6, 8, 11, and 15) show a substantial disagreement between the case-only and the case-control analysis. Within these regions, the case-only analysis results in very significant p-values, while only one of these regions (on Chromosome 8) reaches genome-wide significance (1e-05) from the case-control analysis using only local ancestry. The difference may due to over- adjustment for individual global ancestry in the case-control setting, or due to the spurious result introduced by the individual local ancestry estimation (e.g., if the estimated local ancestry proportion is higher than the average among both cases and controls). In order to understand the reason for the difference, we list detailed analysis results and ancestry estimates in Table 4-4 for these regions. 86 SNP association (black and gray) COreg (red) SNP association (black and gray) CCreg (red) Figure 4-3 Genome-wide admixture scan using the SNP association and the models use only local ancestry information. As shown in Table 4-4, on Chromosome 6, 11, and 15, there are regions with elevated local ancestry estimation (of European origins) among both cases and controls. The highly significant signals from the case-only analysis (here we only show the result from COreg) match perfectly with these regions. In addition, at these loci, the case- control analysis does not show any hint of the association. Therefore, we suspect that the significance detected within these regions are false positives that caused by the local ancestry estimation procedure. On the other hand, for the region on Chromosome 8, only the local ancestry estimation (proportion inherited from European ancestry) among the cases shows deviation from the mean. In this region, cases tend to have less European ancestry proportion than one would expect. At the most significant locus, all the models (SNP association, case-only, and case-control analysis) reach the genome-wide significant level (1e-05). Therefore, we view this region as a promising candidate for the disease under studying. The reduced signal from the case-control analysis may due to the adjustment for individual global ancestry, as global ancestry is highly correlated with local ancestry. 87 Table 4-4 Analysis details for regions with great disagreement between case-only and case-control analysis. Regions Chr6: Chr8: Chr11: Chr15: -log10(p): COreg (red) vs. CCreg (blue) Local ancestry: Cases (red) vs. Controls (blue) Figure 4-4 shows the results from the models using both genotype and ancestry information (CCcom) and compares their performance to the SUM and MIX models. The results indicate that the performance is similar across the three models, and most of the signals from these mixed models are well captured by the SNP association model that uses only the genotype information. Note that the SUM score and the MIX score model combine the genotype signal from the case-control analysis with the admixture signal from the case-only analysis; consequently, these two models generate spurious results at the regions on chromosome 6, 11, and 15 (regions shown in Table 4-4). On the other hand, our proposed mixed model, CCcom, is a combination of genotype signal and the admixture signal from the case-control analysis, therefore is immune to these false 88 positive regions. Therefore, model CCcom is a more appropriate model for combining genotype and ancestry information. Conventional model (black and gray) SUM (red) Conventional model (black and gray) MIX (red) Conventional model (black and gray) CCcom (red) Figure 4-4 Genome-wide admixture scan using the SNP association and the models incorporating both genotype and ancestry information (SUM, MIX, and CCcom). 4.3.2.2 Results on known hits We further check the results on the known hits for prostate cancer among MEC African- Americans. Table 4-5 shows the list of SNPs that are replicated from the SNP association model (p-value cutoff 0.05). These SNPs are further classified into two subgroups: group A SNPs that are only significant from the SNP association model, group B SNPs that also make the 0.05 cutoff among the models that use only ancestry information (COreg or CCreg). For each SNP, the model with the most significant p-value is highlighted in bold. As we expected, for SNPs in group A, the most significant p-value resulted from either the SNP association model or model CCcom which combines both genotype and 89 ancestry information, and there is not much gain in terms of increased statistical significance by incorporating ancestry information in the combined model. For group B SNPs, all the most significant p-values resulted from the combined model CCcom. Note that the results are almost identical between ADM and COreg, with COreg attaining more significant p-values. In addition, the performance of CCcom is similar to MIX, while the SUM score model results in relatively conservative p-values (less significant results). Table 4-5 Known Hits that replicated from the conventional model. Group SNP Chr SNP association COreg CCreg CCcom A rs10187424 2 3.33E-02 2.03E-01 8.08E-02 4.38E-02 rs12621278 2 1.07E-02 9.77E-01 6.78E-01 3.66E-02 rs7584330 2 1.21E-02 7.48E-01 7.59E-01 1.22E-02 rs2292884 2 2.90E-03 8.11E-01 6.16E-01 4.19E-03 rs12653946 5 2.34E-03 3.43E-01 8.89E-01 9.48E-03 rs1983891 6 2.52E-04 6.94E-02 3.18E-01 1.14E-03 rs339331 6 1.74E-06 2.45E-01 6.11E-02 1.09E-06 rs9364554 6 9.92E-03 5.48E-01 9.40E-01 1.73E-02 rs10993994 10 2.43E-03 5.51E-01 8.19E-01 7.46E-03 rs7127900 11 1.70E-03 4.64E-01 6.85E-01 4.67E-03 rs11228565 11 2.06E-03 9.35E-01 9.61E-01 7.25E-03 rs7210100 17 2.65E-08 7.56E-02 6.61E-02 6.04E-09 rs8102476 19 1.89E-02 3.91E-01 2.20E-01 1.30E-02 rs11672691 19 3.74E-02 4.04E-01 3.18E-01 1.07E-01 B rs2028898 2 1.56E-02 1.30E-01 3.55E-02 9.36E-03 rs10486567 7 8.03E-04 3.91E-02 2.27E-03 4.49E-05 rs1512268 8 6.30E-07 4.76E-02 3.74E-03 6.37E-07 rs5759167 22 1.00E-04 7.86E-03 1.26E-03 4.05E-05 Table 4-6 shows the list of SNPs that are not replicated in the SNP association analysis but are captured by the models incorporating ancestry information (p-value cutoff 0.05). Note that rs2121875 and rs130067 yield very significant p-values from the case only analysis (COreg), but show nothing from the case-control analysis (CCreg). This is due to large differences in local ancestry as compared to global ancestry for both 90 cases and controls. The averaged global ancestry estimates are 0.205 and 0.214 among cases and controls respectively for these two SNPs. For rs2121875, the averaged local ancestry estimates are 0.44 and 0.46 among cases and controls; and at rs130067, the estimates are 0.46 and 0.48 among cases and controls. At both loci there are elevated local ancestry estimates among both the cases and controls. For the case only analysis a comparison to the global ancestry yields a significant p-value. For the case-control analysis, since local ancestry is comparable for both the cases and the controls there is no significant association. Table 4-6 Known Hits that are replicated only in models incorporating local ancestry information. SNP Chr SNP association COreg CCreg CCcom rs6763931 3 2.85E-01 4.11E-03 3.26E-02 9.97E-02 rs2121875 5 4.62E-01 3.26E-04 9.00E-01 7.01E-01 rs130067 6 1.36E-01 8.93E-13 8.27E-01 3.24E-01 rs2928679 8 8.94E-01 6.54E-02 2.93E-03 1.02E-02 rs4962416 10 1.08E-01 6.83E-02 1.58E-03 3.88E-03 4.3.2.3 Building regression models As model CCcom generates valid results across the genome, we compare the performance of CCcom to the SNP association in Figure . In this figure, the x-axis plots the –log10(p) for the genotype marginal effect from the conventional model, and y-axis plots the –log10(p) for the 2-df test of the genotype and local ancestry main effect from the CCcom model. The solid line represents when the two models perform the same. Result shows that 10 SNPs (all on chr 8) were significant in CCcom and not in the SNP association analysis (red colored) and there were 275 SNPs (274 of them are on 91 chromosome 8 and 1 of them is on chromosome 22) in which the difference between models is greater than 4 (blue colored). Figure Fi Figure 4-5 Compare the performance between the proposed CCmix (2df) model and the conventional model. Figure 4-6 shows the detailed analysis results for the region on chromosome 8. In this figure, the red and blue colored SNPs are among the ones colored in Figure . Region A (highlighted in red) is the region containing the association signals from the SNP association analysis, and most of the red colored SNPs from Figure fall into this region. Region B are the two regions highlighted in blue on both side of Region A, and most of the blue colored SNPs from Figure fall into this region. For the red colored SNPs, the association signal comes from both the genotype and the ancestry, and for the majority of the blue colored SNPs, the association signal mainly comes from only the ancestry information. 92 Figure 4-6 Comparison of the performance between the proposed CCcom (2df) model and the SNP association analysis on the region on chromosome 8. Furthermore, we build an additional model based on CCcom to include the SNP by local ancestry interaction, and propose a 3-df test of the G, L, and GL: CCcom_GL logit(Y) ~ α + β G G + β L L + β int GL + β Q Q (3-df test) Figure 4-7 compares the performance between CCcom and the SNP associatoin model. Similar to Figure 4-5, SNPs marked in red are the ones that make the significant cutoff from model CCcom_GL (3-df) and result in more significant p-values than CCcom (2-df). There are 9 such SNPs across the genome and all of them are on chromosome 8 (as shown in Figure 4-8). In the right panel, the 12 SNPs marked in blue result in much Position on chromosome 8 93 smaller p-values than the conventional model (difference between –log10(p) is greater than 4). Among these SNPs, 9 of them are on chromosome 9, and the remaining 3 SNPs are on chromosome 2, 5, and 8 respectively. Results for the region on chromosome 9 are shown in Figure 4-9. The 9 blue colored SNPs (from Figure 4-8) are gathered within the two highlighted regions. The first region contains a transcription factor gene ZFAND5. It regulats NFkappaB activation and apoptosis. The second region contains ALDH1A1, a Aldehyde dehydrogenase enzyme that are responsible for alcohol metabolism and is also involved in the regulation of the metabolic responses to high-fat diet. Figure 4-7 Comparison of models CCcom (2df) and CCcom_GL (3df). 94 Figure 4-8 Comparison of models CCcom (2df) and CCcom_GL (3df) on the region on chromosome 8. Figure 4-9 Comparison of models CCcom (2df) and CCcom_GL (3df) on the region on chromosome 9. Finally, we take the most significant SNP from each of the highlighted regions in Figure 4-9 and show the effect estimates within each ancestry strata. As shown in Table 95 4-7, the allele frequencies for rs4073226 are 0.720 for L=0 (homogeneous of African ancestry) and 0.493 for L=2 (homogeneous of European ancestry). This estimated allele frequency is consistent with that from the HapMap samples (0.714 for YRI and 0.535 for CEU). For rs3815836, the allele frequencies are 0.162 (L=0) and 0.537 (L=2), which are also consistent with the HapMap samples (0.129 for YRI and 0.566 for CEU). For both SNPs, the effect sizes are in opposite directions between local ancestry strata L=0 and L=2. Table 4-7 Effect size for SNPs and on chromosome 9. Marker Local Strata N Allele freq CCcom (2-df) β G (p-value) CCcom_GL (3-df) β G (p-value) rs4073226 L=0 6189 0.720 - -0.059 L=1 2945 0.616 - 0.178 L=2 504 0.493 - 0.415 All 9638 0.676 0.048 (0.27) - (1.6×10 -5 ) rs3815836 L=0 6211 0.162 - 0.144 L=1 2925 0.343 - -0.127 L=2 503 0.537 - -0.398 All 9639 0.237 -0.001 (0.94) - (9.5×10 -6 ) 4.4 Discussion Results from both simulation and real data analysis indicate that case only analysis (COreg and ADM) is more powerful than case-control analysis (CCreg) and our proposed regression model COreg for case only analysis is more powerful than the ADM model. When SNP genotypes are available, it is more powerful to incorporate genotype information in the model (CCcom, MIX, and SUM), and our proposed regression model 96 CCcom yield similar performance as MIX and SUM models. On the other hand, the real data analysis shows that the case-only analysis suffers from spurious results among the regions with biased local ancestry estimation (e.g. estimated ancestry proportion is higher than it should be among both cases and controls). Therefore, the case only analysis models COreg and ADM as well as models MIX and SUM that incorporate signals from the case only analysis may result in false positives across the genome. As a conclusion, we suggest using our proposed regression model CCcom with the 2-df test of both SNP and local ancestry for admixture scan with admixed populations. Moving from the model using only genotype information to the model incorporating bother genotype and local ancestry information, and finally to the model that further considering heterogeneity by local ancestry, at each step, we gain power at certain loci and at the same time lose power due the additional degrees of freedom in the test. More specifically, compared to the SNP association model, model CCcom (2-df) results in more significant p-values for 43% of the loci; and compared to CCcom (2-df), model CCcom_GL (3-df) results in more significant p-values for 34% of the loci. When comparing CCcom_GL directly to the conventional model (as shown in Figure 4-10), we find that 47% of the loci attain more significant p-values and notice that the difference tends to be greater between the two models among these loci. For example, rs11777807 on chromosome 8 has a p-value at 2.4e-3 from the conventional model, but results in a very significant p-value at 4.1e-12 from model CCcom_GL (3-df). This change suggests that after an initial scan with the conventional model, it is more powerful to run an admixture scan using CCcom_GL (3-df), a model that captures the admixture signal and heterogeneity by local ancestry simultaneously. 97 Figure 4-10 Compare model CCcom_GL (3df) to the SNP association model. Note that our suggestions on the admixture scan for GWAS are based on the results from both simulations and real data analysis. As the real data analysis results show that there could be spurious results from the case-only analysis (as shown in Table 4-4), we do not suggest using models that capture the admixture signals through the case- only analysis (COreg, ADM, SUM, and MIX). However, when the issue with case-only analysis is solved (e.g. no more regions with elevated local ancestry estimates), our suggestions for admixture scan in GWAS may change accordingly. As shown in the simulation, for models use only local ancestry information, COreg > ADM > CCreg in terms of power. And for the combined models, the performance is similar across the CCcom, MIX and SUM models. Compared to the exiting models, our proposed regression modes have the following advantages: 1) easier to set-up and running the analysis; 2) more flexible for including additional factors in the model (e.g. the SNP by local ancestry interaction); 3) easier to interpret the result. The log odds ratio of the SNP effect, local ancestry effect as well as the ancestry strata 98 specific SNP effect can be obtained directly from the regression coefficients in the model; 4) can be applied to continuous outcomes as well. 99 Chapter 5 Summary Figure 5-1 summarizes our suggested models for GWAS among admixed populations. The two models to the left, Y~G+Q and Y~G+GL+L+Q (2df), are the models we proposed in Chapter 3. The two models to the right, Y~G+L+Q (2df) and Y~G+GL+L+Q (3df), are the models we proposed in Chapter 4. As shown in the figure, moving from the left panel to the right panel, we incorporate admixture signals into the models; moving from the upper panel to the lower panel, we incorporate heterogeneity signals into the models. Figure 5-1 Comparision of proposed models. 100 Figure 5-2 shows the effects on results by incorporating different signals in the model. Compared to the conversional model Y~G+Q, about 43% of the loci result in more significant p-values by incorporating local ancestry information (Y~G+L+Q 2df), and about 40% of the loci result in more significant p-values by incorporating G by L interaction in the model (Y~G+GL+L+Q 2df). When comparing the model that incorporates both admixture signal and heterogeneity signal (Y~G+GL+L+Q 3df) to the conventional model, about 47% of the loci result in more significant p-values. The distribution of the changes in –log10(p-value) is shown in Figure 5-3. Among the loci with smaller difference between the two models, e.g. the absolute value of the difference is smaller than 0.6, the conventional model shows its advantage over the advanced model (more than half of the loci results in more significant p-values from the conventional model); however, among the loci with greater difference between models, our proposed 3-df test model shows the advantage over the conventional model. Figure 5-2 Changes on results between proposed models. 101 -1 0 1 2 >2 D ifference in –log10(p-value) 0.06% 52.6% 41.5% 4.91% 0.93% Figure 5-3 Histogram of changes in –log10(p-value) when comparing Y~G+L+GL+Q (3df) to the conventional model Y~G+Q. Although it turns out that compared to the conventional model, the model incorporating both admixture and heterogeneity signals results in less significant p-values for more than half of the loci (~53%) across the genome, we see a greater difference between the models among the loci that show greater difference between models (as indicated in Figure 5-3). In order to reflect the among of changes between models in the comparison, we calculated a weighted changes for the loci above (Wabove) and below (Wbelow) the expected line in Figure 4-10 respectively: W above = y i − x i ( ) i∈ y>x { } ∑ N i∈ y>x { } W below = y i − x i ( ) i∈ y<x { } ∑ N i∈ y<x { } 102 Here y i represents the –log10(p) from model Y~G+GL+L+Q (3df) at locus i and x i represents the –log10(p) from the conventional model Y~G+Q at the same locus. The weighted changes across models are shown in Figure 5-4. In each comparison, W above is greater than the absolute value of W below , suggesting that on average (considering the differences between models), the potential to gain information outweighs the potential loss in significance due to an additional degree-of-freedom in the test. Therefore, GWAS among admixed populations will benefit from incorporating admixture and heterogeneity signals in the analysis. Figure 5-4 Weighted changes on results between proposed models. 103 Chapter 6 Future Directions We have discussed the performance of various models including: SNP association, models using only ancestry information, models using both ancestry and genotype information, and models accounting for heterogeneity by local ancestry (SNP by local ancestry interaction). Each model has its own advantage under different scenarios. Then the issue becomes how to select among the models, or the build of a more completed model from the initial SNP association model. This model selection can based on the AIC which incorporates the penalty of over fitting of the model. Or, alternative ways can be generating priors for each model under each scenario (e.g. allele frequency difference between populations at each locus) and select among the models in a Bayesian framework. Principal Component Analysis (PCA) has been widely used for the detection of population substructures and has become a standard approach for adjusting for confounding for genetic association studies. However, there remain issues for this approach. First of all, the current prevalent way is to use the top 10 PCs to control for confounding in the association study. The top 10 PCs may not be enough for controlling for finer structures in the data and may be too much for studies with only major population structure (and therefore cause loss of power when the sample size is small). So, there is question about how many PCs is appropriate for the adjustment, and the answer will be different with different study populations. Secondly, PCs are not very interpretable compared to STRUCTURE results, so, there is no clear cutoff for defining outliers in terms of samples ancestry composition. Special clustering methods are 104 necessary for summarizing the results from PCA and interpreting the identified population substructures. Local ancestry estimates may generate spurious results for admixture scans among only the cases. We observed regions with elevated ancestry estimation within both cases and controls among the MEC African American samples (as shown in Table 4-4). Investigation of the underlying causes for this bias and how to account for this potential bias in estimation of local ancestry is a area of research that is needed. In addition, there is a need for statistical methods for more robust approaches for ancestry estimation (robust estimates when the exact reference population is not available or not known). Another challenge for association studies is over-adjustment, for example, in controlling for confounding. In order to control for any potential possibilities for population stratification, we suggest adjusting for individual global ancestry in the regression model. This has been shown to be the most efficient way in the literature as well as in our simulations and real data analysis. However, we may also lose power at loci that are in LD with the disease causal locus. Figure 3-3 (a) shows the simulation result under the scenario in which there is no global ancestry effect on the disease. The test is at the marker locus that is in LD with the underline disease locus. The model adjusting for global ancestry (red dashed line) has less power compared to the crude model (blue dotted line), especially when there is substantial allele frequency difference between ancestries. The conclusion and suggestions we give in Chapter 3 and 4 are based on the GWAS design. For Next Generation Sequencing, our conclusions may change 105 accordingly. First of all, it is more likely to have genotyped the causal locus in sequencing data than in GWAS data. Therefore, there will still be confounding issues but there will less likely be heterogeneity by local ancestry. Another interesting area is how to use the ancestry information to facilitate the genotyping (especially for rare variants), association analysis, as well as the summary of the results. Genotyping of the rare variants is a big challenge because of the sparse data in the cluster of the minor genotype (or even missing genotype clusters). Current approaches borrow information of the cluster distributions from a set of other SNPs. It is also possible to use the local ancestry information as the prior of the genotype clusters for the rare variants, e.g. have a better estimation of the haplotype probabilities around the region. For analyzing the sequencing data, information of local ancestry could also be used for generating prior for selecting SNPs for a joint analysis. For example, the results from admixture mapping could provide suggestions of the “supporting regions” as described in Figure 4-6 (Region B). But this information should be considered together with the allele frequencies because an admixture scan is only powerful at the loci that have substantial allele frequency differences between ancestral populations (with the further caveat that the disease prevalence has to be different between ancestral populations). In addition, the sequencing data we discussed above is for sequencing of each individual sample. An alternative approach is the pooling sequencing which cases and controls are grouped into different pools. In this case, population stratification becomes an issue for samples from admixed populations. 106 Bibliography "The Encode (Encyclopedia of DNA Elements) Project." Science 306.5696 (2004): 636-40. Print. "The International Hapmap Project." Nature 426.6968 (2003): 789-96. Print. Adeyemo, A., et al. "A Genome-Wide Association Study of Hypertension and Blood Pressure in African Americans." PLoS Genet 5.7 (2009): e1000564. Print. Aldrich, M. C., et al. "Comparison of Statistical Methods for Estimating Genetic Admixture in a Lung Cancer Study of African Americans and Latinos." Am J Epidemiol 168.9 (2008): 1035-46. Print. Alexander, D. H., J. Novembre, and K. Lange. "Fast Model-Based Estimation of Ancestry in Unrelated Individuals." Genome Res 19.9 (2009): 1655-64. Print. Altshuler, D. M., et al. "Integrating Common and Rare Genetic Variation in Diverse Human Populations." Nature 467.7311 (2010): 52-8. Print. Arking, D. E., et al. "Genome-Wide Association Study Identifies Gpc5 as a Novel Genetic Locus Protective against Sudden Cardiac Arrest." PLoS One 5.3 (2010): e9879. Print. Barnholtz-Sloan, J. S., et al. "Fgfr2 and Other Loci Identified in Genome-Wide Association Studies Are Associated with Breast Cancer in African-American and Younger Women." Carcinogenesis 31.8 (2010): 1417-23. Print. Benyamin, B., et al. "Variants in Tf and Hfe Explain Approximately 40% of Genetic Variation in Serum-Transferrin Levels." Am J Hum Genet 84.1 (2009): 60-5. Print. Bertoni, B., et al. "Admixture in Hispanics: Distribution of Ancestral Population Contributions in the Continental United States." Hum Biol 75.1 (2003): 1-11. Print. Bilguvar, K., et al. "Susceptibility Loci for Intracranial Aneurysm in European and Japanese Populations." Nat Genet 40.12 (2008): 1472-7. Print. Birlea, S. A., et al. "Genome-Wide Association Study of Generalized Vitiligo in an Isolated European Founder Population Identifies Smoc2, in Close Proximity to Iddm8." J Invest Dermatol 130.3 (2010): 798-803. Print. Boger, C. A., et al. "Cubn Is a Gene Locus for Albuminuria." J Am Soc Nephrol 22.3 (2011): 555-70. Print. 107 Bonilla, C., et al. "Admixture in the Hispanics of the San Luis Valley, Colorado, and Its Implications for Complex Trait Gene Mapping." Ann Hum Genet 68.Pt 2 (2004): 139- 53. Print. Bostrom, M. A., et al. "Candidate Genes for Non-Diabetic Esrd in African Americans: A Genome-Wide Association Study Using Pooled DNA." Hum Genet 128.2 (2010): 195-204. Print. Broderick, P., et al. "A Genome-Wide Association Study Shows That Common Alleles of Smad7 Influence Colorectal Cancer Risk." Nat Genet 39.11 (2007): 1315-7. Print. Bryc, K., et al. "Colloquium Paper: Genome-Wide Patterns of Population Structure and Admixture among Hispanic/Latino Populations." Proc Natl Acad Sci U S A 107 Suppl 2 (2010): 8954-61. Print. Carter-Pokras, O. D., and P. J. Gergen. "Reported Asthma among Puerto Rican, Mexican-American, and Cuban Children, 1982 through 1984." Am J Public Health 83.4 (1993): 580-2. Print. Cavalli-Sforza, L. L., and M. W. Feldman. "The Application of Molecular Genetic Approaches to the Study of Human Evolution." Nat Genet 33 Suppl (2003): 266-75. Print. Cavalli-Sforza, L. L., Paolo Menozzi, and Alberto Piazza. The History and Geography of Human Genes. Princeton, N.J.: Princeton University Press, 1994. Print. Chalasani, N., et al. "Genome-Wide Association Study Identifies Variants Associated with Histologic Features of Nonalcoholic Fatty Liver Disease." Gastroenterology 139.5 (2010): 1567-76, 76 e1-6. Print. Chanock, S. J. "A Twist on Admixture Mapping." Nat Genet 43.3 (2011): 178-9. Print. Charles, B. A., et al. "A Genome-Wide Association Study of Serum Uric Acid in African Americans." BMC Med Genomics 4 (2011): 17. Print. Chen, Z. J., et al. "Genome-Wide Association Study Identifies Susceptibility Loci for Polycystic Ovary Syndrome on Chromosome 2p16.3, 2p21 and 9q33.3." Nat Genet 43.1 (2011): 55-9. Print. Cho, Y. S., et al. "A Large-Scale Genome-Wide Association Study of Asian Populations Uncovers Genetic Factors Influencing Eight Quantitative Traits." Nat Genet 41.5 (2009): 527-34. Print. Choudhry, S., et al. "Population Stratification Confounds Genetic Association Studies among Latinos." Hum Genet 118.5 (2006): 652-64. Print. 108 Choudhry, S., et al. "Dissecting Complex Diseases in Complex Populations: Asthma in Latino Americans." Proc Am Thorac Soc 4.3 (2007): 226-33. Print. Choudhry, S., et al. "Genome-Wide Screen for Asthma in Puerto Ricans: Evidence for Association with 5q23 Region." Hum Genet 123.5 (2008): 455-68. Print. Cooper, R. S., B. Tayo, and X. Zhu. "Genome-Wide Association Studies: Implications for Multiethnic Samples." Hum Mol Genet 17.R2 (2008): R151-5. Print. Cui, R., et al. "Common Variant in 6q26-Q27 Is Associated with Distal Colon Cancer in an Asian Population." Gut (2011). Print. Denavas, C., and M. A. Hall. "The Hispanic Population in the United States: March 1986 and 1987." Curr Popul Rep Popul Charact.434 (1988): 1-89. Print. Devlin, B., and K. Roeder. "Genomic Control for Association Studies." Biometrics 55.4 (1999): 997-1004. Print. Ding, L., et al. "Comparison of Measures of Marker Informativeness for Ancestry and Admixture Mapping." BMC Genomics 12.1 (2011): 622. Print. Eijgelsheim, M., et al. "Genome-Wide Association Analysis Identifies Multiple Loci Related to Resting Heart Rate." Hum Mol Genet 19.19 (2010): 3885-94. Print. Engelhardt, B. E., and M. Stephens. "Analysis of Population Structure: A Unifying Framework and Novel Methods Based on Sparse Factor Analysis." PLoS Genet 6.9 (2010). Print. Fagan, Brian M. The Great Journey : The Peopling of Ancient America. New York, N.Y.: Thames and Hudson, 1987. Print. Falush, D., M. Stephens, and J. K. Pritchard. "Inference of Population Structure Using Multilocus Genotype Data: Dominant Markers and Null Alleles." Mol Ecol Notes 7.4 (2007): 574-78. Print. ---. "Inference of Population Structure Using Multilocus Genotype Data: Linked Loci and Correlated Allele Frequencies." Genetics 164.4 (2003): 1567-87. Print. Fejerman, L., et al. "Admixture Mapping Identifies a Locus on 6q25 Associated with Breast Cancer Risk in Us Latinas." Hum Mol Genet (2012). Print. Freeman, N. C., D. Schneider, and P. McGarvey. "Household Exposure Factors, Asthma, and School Absenteeism in a Predominantly Hispanic Community." J Expo Anal Environ Epidemiol 13.3 (2003): 169-76. Print. Gabriel, S. B., et al. "The Structure of Haplotype Blocks in the Human Genome." Science 296.5576 (2002): 2225-9. Print. 109 Garcia-Barcelo, M. M., et al. "Genome-Wide Association Study Identifies Nrg1 as a Susceptibility Locus for Hirschsprung's Disease." Proc Natl Acad Sci U S A 106.8 (2009): 2694-9. Print. Garcia-Barcelo, M. M., et al. "Genome-Wide Association Study Identifies a Susceptibility Locus for Biliary Atresia on 10q24.2." Hum Mol Genet 19.14 (2010): 2917-25. Print. Gauderman, W. J., J. S. Witte, and D. C. Thomas. "Family-Based Association Studies." J Natl Cancer Inst Monogr.26 (1999): 31-7. Print. Gonzalez Burchard, E., et al. "Latino Populations: A Unique Opportunity for the Study of Race, Genetics, and Social Environment in Epidemiological Research." Am J Public Health 95.12 (2005): 2161-8. Print. Graham, R. R., et al. "Genetic Variants near Tnfaip3 on 6q23 Are Associated with Systemic Lupus Erythematosus." Nat Genet 40.9 (2008): 1059-61. Print. Greenland, S. "Quantifying Biases in Causal Models: Classical Confounding Vs Collider-Stratification Bias." Epidemiology 14.3 (2003): 300-6. Print. Greenland, S., J. Pearl, and J. M. Robins. "Causal Diagrams for Epidemiologic Research." Epidemiology 10.1 (1999): 37-48. Print. Gudbjartsson, D. F., et al. "Variants Conferring Risk of Atrial Fibrillation on Chromosome 4q25." Nature 448.7151 (2007): 353-7. Print. Guo, Y., et al. "Genome-Wide Association Study Identifies Aldh7a1 as a Novel Susceptibility Gene for Osteoporosis." PLoS Genet 6.1 (2010): e1000806. Print. Haiman, C. A., et al. "Genome-Wide Association Study of Prostate Cancer in Men of African Ancestry Identifies a Susceptibility Locus at 17q21." Nat Genet 43.6 (2011): 570-3. Print. Haiman, C. A., and D. O. Stram. "Exploring Genetic Susceptibility to Cancer in Diverse Populations." Curr Opin Genet Dev 20.3 (2010): 330-5. Print. Han, J. W., et al. "Genome-Wide Association Study in a Chinese Han Population Identifies Nine New Susceptibility Loci for Systemic Lupus Erythematosus." Nat Genet 41.11 (2009): 1234-7. Print. Hattori, E., et al. "Preliminary Genome-Wide Association Study of Bipolar Disorder in the Japanese Population." Am J Med Genet B Neuropsychiatr Genet 150B.8 (2009): 1110-7. Print. Hayes, M. G., et al. "Identification of Type 2 Diabetes Genes in Mexican Americans through Genome-Wide Association Studies." Diabetes 56.12 (2007): 3033-44. Print. 110 Hicks, A. A., et al. "Genetic Determinants of Circulating Sphingolipid Concentrations in European Populations." PLoS Genet 5.10 (2009): e1000672. Print. Hindorff, L. A., et al. "Potential Etiologic and Functional Implications of Genome- Wide Association Loci for Human Diseases and Traits." Proc Natl Acad Sci U S A 106.23 (2009): 9362-7. Print. Hiura, Y., et al. "Identification of Genetic Markers Associated with High-Density Lipoprotein-Cholesterol by Genome-Wide Screening in a Japanese Population: The Suita Study." Circ J 73.6 (2009): 1119-26. Print. Hiura, Y., et al. "A Genome-Wide Association Study of Hypertension-Related Phenotypes in a Japanese Population." Circ J 74.11 (2010): 2353-9. Print. Hoggart, C. J., et al. "Control of Confounding of Genetic Associations in Stratified Populations." Am J Hum Genet 72.6 (2003): 1492-504. Print. Hoggart, C. J., et al. "Design and Analysis of Admixture Mapping Studies." Am J Hum Genet 74.5 (2004): 965-78. Print. Homa, D. M., D. M. Mannino, and M. Lara. "Asthma Mortality in U.S. Hispanics of Mexican, Puerto Rican, and Cuban Heritage, 1990-1995." Am J Respir Crit Care Med 161.2 Pt 1 (2000): 504-9. Print. Hor, H., et al. "Genome-Wide Association Study Identifies New Hla Class Ii Haplotypes Strongly Protective against Narcolepsy." Nat Genet 42.9 (2010): 786-9. Print. Hubisz, MJ, et al. "Inferring Weak Population Structure with the Assistance of Sample Group Information." Molecular Ecology Resources 9.5 (2009): 1322-32. Print. Hunter, D. J., et al. "A Genome-Wide Association Study Identifies Alleles in Fgfr2 Associated with Risk of Sporadic Postmenopausal Breast Cancer." Nat Genet 39.7 (2007): 870-4. Print. Jorm, A. F., and S. Easteal. "Assessing Candidate Genes as Risk Factors for Mental Disorders: The Value of Population-Based Epidemiological Studies." Soc Psychiatry Psychiatr Epidemiol 35.1 (2000): 1-4. Print. Kamatani, Y., et al. "Genome-Wide Association Study of Hematological and Biochemical Traits in a Japanese Population." Nat Genet 42.3 (2010): 210-5. Print. Kang, S. J., et al. "Assessing the Impact of Global Versus Local Ancestry in Association Studies." BMC Proc 3 Suppl 7 (2009): S107. Print. Kim, H., et al. "Genome-Wide Association Study of Acute Post-Surgical Pain in Humans." Pharmacogenomics 10.2 (2009): 171-9. Print. 111 Kim, J. J., et al. "A Genome-Wide Association Analysis Reveals 1p31 and 2p13.3 as Susceptibility Loci for Kawasaki Disease." Hum Genet 129.5 (2011): 487-95. Print. Kim, S., et al. "Genome-Wide Association Study of Csf Biomarkers Abeta1-42, T-Tau, and P-Tau181p in the Adni Cohort." Neurology 76.1 (2011): 69-79. Print. Kottgen, A., et al. "Multiple Loci Associated with Indices of Renal Function and Chronic Kidney Disease." Nat Genet 41.6 (2009): 712-7. Print. Kraft, P., et al. "Exploiting Gene-Environment Interaction to Detect Genetic Associations." Hum Hered 63.2 (2007): 111-9. Print. Kumar, V., et al. "Common Variants on 14q32 and 13q12 Are Associated with Dlbcl Susceptibility." J Hum Genet (2011). Print. Kung, A. W., et al. "Association of Jag1 with Bone Mineral Density and Osteoporotic Fractures: A Genome-Wide Association Study and Follow-up Replication Studies." Am J Hum Genet 86.2 (2010): 229-39. Print. Landi, M. T., et al. "A Genome-Wide Association Study of Lung Cancer Identifies a Region of Chromosome 5p15 Associated with Risk for Adenocarcinoma." Am J Hum Genet 85.5 (2009): 679-91. Print. Lascorz, J., et al. "Genome-Wide Association Study for Colorectal Cancer Identifies Risk Polymorphisms in German Familial Cases and Implicates Mapk Signalling Pathways in Disease Susceptibility." Carcinogenesis 31.9 (2010): 1612-9. Print. Lee, Y. L., et al. "Comparing Genetic Ancestry and Self-Reported Race/Ethnicity in a Multiethnic Population in New York City." J Genet 89.4 (2010): 417-23. Print. Lei, S. F., et al. "Genome-Wide Association Scan for Stature in Chinese: Evidence for Ethnic Specific Loci." Hum Genet 125.1 (2009): 1-9. Print. Lessard, C. J., et al. "Identification of a Systemic Lupus Erythematosus Susceptibility Locus at 11p13 between Pdhx and Cd44 in a Multiethnic Study." Am J Hum Genet 88.1 (2011): 83-91. Print. Lettre, G., et al. "Genome-Wide Association Study of Coronary Heart Disease and Its Risk Factors in 8,090 African Americans: The Nhlbi Care Project." PLoS Genet 7.2 (2011): e1001300. Print. Li, Q., and K. Yu. "Improved Correction for Population Stratification in Genome-Wide Association Studies by Identifying Hidden Population Structures." Genet Epidemiol 32.3 (2008): 215-26. Print. 112 Li, Y. F., et al. "Glutathione S-Transferase P1, Maternal Smoking, and Asthma in Children: A Haplotype-Based Analysis." Environ Health Perspect 116.3 (2008): 409- 15. Print. Liu, Y. Z., et al. "Identification of Plcl1 Gene for Hip Bone Size Variation in Females in a Genome-Wide Association Study." PLoS One 3.9 (2008): e3160. Print. Low, S. K., et al. "Genome-Wide Association Study of Pancreatic Cancer in Japanese Population." PLoS One 5.7 (2010): e11824. Print. Ma, D., et al. "A Genome-Wide Association Study of Autism Reveals a Common Novel Risk Locus at 5p14.1." Ann Hum Genet 73.Pt 3 (2009): 263-73. Print. McConnell, R., et al. "Air Pollution and Bronchitic Symptoms in Southern California Children with Asthma." Environ Health Perspect 107.9 (1999): 757-60. Print. McKay, J. D., et al. "A Genome-Wide Association Study of Upper Aerodigestive Tract Cancers Conducted within the Inhance Consortium." PLoS Genet 7.3 (2011): e1001333. Print. McKeigue, P. M. "Mapping Genes That Underlie Ethnic Differences in Disease Risk: Methods for Detecting Linkage in Admixed Populations, by Conditioning on Parental Admixture." Am J Hum Genet 63.1 (1998): 241-51. Print. Miclaus, K., R. Wolfinger, and W. Czika. "Snp Selection and Multidimensional Scaling to Quantify Population Structure." Genet Epidemiol 33.6 (2009): 488-96. Print. Montana, G., and J. K. Pritchard. "Statistical Tests for Admixture Mapping with Case- Control and Cases-Only Data." Am J Hum Genet 75.5 (2004): 771-89. Print. Navidi, W., et al. "Design and Analysis of Multilevel Analytic Studies with Applications to a Study of Air Pollution." Environ Health Perspect 102 Suppl 8 (1994): 25-32. Print. Ng, C. C., et al. "A Genome-Wide Association Study Identifies Itga9 Conferring Risk of Nasopharyngeal Carcinoma." J Hum Genet 54.7 (2009): 392-7. Print. Norris, J. M., et al. "Genome-Wide Association Study and Follow-up Analysis of Adiposity Traits in Hispanic Americans: The Iras Family Study." Obesity (Silver Spring) 17.10 (2009): 1932-41. Print. O'Seaghdha, C. M., et al. "Common Variants in the Calcium-Sensing Receptor Gene Are Associated with Total Serum Calcium Levels." Hum Mol Genet 19.21 (2010): 4296-303. Print. 113 Org, E., et al. "Genome-Wide Scan Identifies Cdh13 as a Novel Susceptibility Locus Contributing to Blood Pressure Determination in Two European Populations." Hum Mol Genet 18.12 (2009): 2288-96. Print. Palmer, N. D., et al. "Candidate Loci for Insulin Sensitivity and Disposition Index from a Genome-Wide Association Analysis of Hispanic Participants in the Insulin Resistance Atherosclerosis (Iras) Family Study." Diabetologia 53.2 (2010): 281-9. Print. Panoutsopoulou, K., et al. "Insights into the Genetic Architecture of Osteoarthritis from Stage 1 of the Arcogen Study." Ann Rheum Dis 70.5 (2011): 864-7. Print. Pasaniuc, B., et al. "Inference of Locus-Specific Ancestry in Closely Related Populations." Bioinformatics 25.12 (2009): i213-21. Print. Pasaniuc, B., et al. "Enhanced Statistical Tests for Gwas in Admixed Populations: Assessment Using African Americans from Care and a Breast Cancer Consortium." PLoS Genet 7.4 (2011): e1001371. Print. Patterson, N., et al. "Methods for High-Density Admixture Mapping of Disease Genes." Am J Hum Genet 74.5 (2004): 979-1000. Print. Pillai, S. G., et al. "A Genome-Wide Association Study in Chronic Obstructive Pulmonary Disease (Copd): Identification of Two Major Susceptibility Loci." PLoS Genet 5.3 (2009): e1000421. Print. Price, A. L., et al. "Principal Components Analysis Corrects for Stratification in Genome-Wide Association Studies." Nat Genet 38.8 (2006): 904-9. Print. Price, A. L., et al. "A Genomewide Admixture Map for Latino Populations." Am J Hum Genet 80.6 (2007): 1024-36. Print. Price, A. L., et al. "Sensitive Detection of Chromosomal Segments of Distinct Ancestry in Admixed Populations." PLoS Genet 5.6 (2009): e1000519. Print. Pritchard, J. K., M. Stephens, and P. Donnelly. "Inference of Population Structure Using Multilocus Genotype Data." Genetics 155.2 (2000): 945-59. Print. Pulit, S. L., B. F. Voight, and P. I. de Bakker. "Multiethnic Genetic Association Studies Improve Power for Locus Discovery." PLoS One 5.9 (2010): e12600. Print. Qin, H., et al. "Interrogating Local Population Structure for Fine Mapping in Genome- Wide Association Studies." Bioinformatics 26.23 (2010): 2961-8. Print. Reibman, J., and M. Liu. "Genetics and Asthma Disease Susceptibility in the Us Latino Population." Mt Sinai J Med 77.2 (2010): 140-8. Print. 114 Reilly, M. P., et al. "Identification of Adamts7 as a Novel Locus for Coronary Atherosclerosis and Association of Abo with Myocardial Infarction in the Presence of Coronary Atherosclerosis: Two Genome-Wide Association Studies." Lancet 377.9763 (2011): 383-92. Print. Rich, S. S., et al. "A Genome-Wide Association Scan for Acute Insulin Response to Glucose in Hispanic-Americans: The Insulin Resistance Atherosclerosis Family Study (Iras Fs)." Diabetologia 52.7 (2009): 1326-33. Print. Risch, N., et al. "Categorization of Humans in Biomedical Research: Genes, Race and Disease." Genome Biol 3.7 (2002): comment2007. Print. Rosenberg, N. A., et al. "Genome-Wide Association Studies in Diverse Populations." Nat Rev Genet 11.5 (2010): 356-66. Print. Ryu, E., et al. "Genome-Wide Association Analyses of Genetic, Phenotypic, and Environmental Risks in the Age-Related Eye Disease Study." Mol Vis 16 (2010): 2811-21. Print. Salanti, G., S. Sanderson, and J. P. Higgins. "Obstacles and Opportunities in Meta- Analysis of Genetic Association Studies." Genet Med 7.1 (2005): 13-20. Print. Salari, K., et al. "Genetic Admixture and Asthma-Related Phenotypes in Mexican American and Puerto Rican Asthmatics." Genet Epidemiol 29.1 (2005): 76-86. Print. Satake, W., et al. "Genome-Wide Association Study Identifies Common Variants at Four Loci as Genetic Risk Factors for Parkinson's Disease." Nat Genet 41.12 (2009): 1303-7. Print. Satten, G. A., W. D. Flanders, and Q. Yang. "Accounting for Unmeasured Population Substructure in Case-Control Studies of Genetic Association Using a Novel Latent- Class Model." Am J Hum Genet 68.2 (2001): 466-77. Print. Saxena, R., et al. "Genome-Wide Association Analysis Identifies Loci for Type 2 Diabetes and Triglyceride Levels." Science 316.5829 (2007): 1331-6. Print. Seldin, M. F., et al. "European Population Substructure: Clustering of Northern and Southern Populations." PLoS Genet 2.9 (2006): e143. Print. Serre, D., et al. "Correction of Population Stratification in Large Multi-Ethnic Association Studies." PLoS One 3.1 (2008): e1382. Print. Setakis, E., H. Stirnadel, and D. J. Balding. "Logistic Regression Protects against Population Structure in Genetic Association Studies." Genome Res 16.2 (2006): 290- 6. Print. 115 Shriner, D., A. Adeyemo, and C. N. Rotimi. "Joint Ancestry and Association Testing in Admixed Individuals." PLoS Comput Biol 7.12 (2011): e1002325. Print. Shtir, C. J., et al. "Variation in Genetic Admixture and Population Structure among Latinos: The Los Angeles Latino Eye Study (Lales)." BMC Genet 10 (2009): 71. Print. Shu, X. O., et al. "Identification of New Genetic Risk Variants for Type 2 Diabetes." PLoS Genet 6.9 (2010). Print. Simon-Sanchez, J., et al. "Genome-Wide Association Study Reveals Genetic Risk Underlying Parkinson's Disease." Nat Genet 41.12 (2009): 1308-12. Print. Simon-Sanchez, J., et al. "Genome-Wide Association Study Confirms Extant Pd Risk Loci among the Dutch." Eur J Hum Genet (2011). Print. Smith, M. W., et al. "A High-Density Admixture Map for Disease Gene Discovery in African Americans." Am J Hum Genet 74.5 (2004): 1001-13. Print. Song, H., et al. "A Genome-Wide Association Study Identifies a New Ovarian Cancer Susceptibility Locus on 9p22.2." Nat Genet 41.9 (2009): 996-1000. Print. Stacey, S. N., et al. "Ancestry-Shift Refinement Mapping of the C6orf97-Esr1 Breast Cancer Susceptibility Locus." PLoS Genet 6.7 (2010): e1001029. Print. Tan, L., et al. "A Genome-Wide Association Analysis Implicates Sox6 as a Candidate Gene for Wrist Bone Mass." Sci China Life Sci 53.9 (2010): 1065-72. Print. Tanaka, Y., et al. "Genome-Wide Association of Il28b with Response to Pegylated Interferon-Alpha and Ribavirin Therapy for Chronic Hepatitis C." Nat Genet 41.10 (2009): 1105-9. Print. Tenesa, A., et al. "Genome-Wide Association Scan Identifies a Colorectal Cancer Susceptibility Locus on 11q23 and Replicates Risk Loci at 8q24 and 18q21." Nat Genet 40.5 (2008): 631-7. Print. Teslovich, T. M., et al. "Biological, Clinical and Population Relevance of 95 Loci for Blood Lipids." Nature 466.7307 (2010): 707-13. Print. Thomas, D. C., and J. S. Witte. "Point: Population Stratification: A Problem for Case- Control Studies of Candidate-Gene Associations?" Cancer Epidemiol Biomarkers Prev 11.6 (2002): 505-12. Print. Tian, C., et al. "Analysis and Application of European Genetic Substructure Using 300 K Snp Information." PLoS Genet 4.1 (2008): e4. Print. Tomlinson, I. P., et al. "A Genome-Wide Association Study Identifies Colorectal Cancer Susceptibility Loci on Chromosomes 10p14 and 8q23.3." Nat Genet 40.5 (2008): 623-30. Print. 116 Tomlinson, I., et al. "A Genome-Wide Association Scan of Tag Snps Identifies a Susceptibility Variant for Colorectal Cancer at 8q24.21." Nat Genet 39.8 (2007): 984- 8. Print. Tsai, F. J., et al. "Identification of Novel Susceptibility Loci for Kawasaki Disease in a Han Chinese Population by a Genome-Wide Association Study." PLoS One 6.2 (2011): e16853. Print. Tse, K. P., et al. "Genome-Wide Association Study Reveals Multiple Nasopharyngeal Carcinoma-Associated Loci within the Hla Region at Chromosome 6p21.3." Am J Hum Genet 85.2 (2009): 194-203. Print. Unoki, H., et al. "Snps in Kcnq1 Are Associated with Susceptibility to Type 2 Diabetes in East Asian and European Populations." Nat Genet 40.9 (2008): 1098-102. Print. Van Laer, L., et al. "A Genome-Wide Association Study for Age-Related Hearing Impairment in the Saami." Eur J Hum Genet 18.6 (2010): 685-93. Print. Via, M., et al. "The Role of Lta4h and Alox5ap Genes in the Risk for Asthma in Latinos." Clin Exp Allergy 40.4 (2010): 582-9. Print. Voight, B. F., et al. "Twelve Type 2 Diabetes Susceptibility Loci Identified through Large-Scale Association Analysis." Nat Genet 42.7 (2010): 579-89. Print. Wacholder, S., N. Rothman, and N. Caporaso. "Population Stratification in Epidemiologic Studies of Common Genetic Variants and Cancer: Quantification of Bias." J Natl Cancer Inst 92.14 (2000): 1151-8. Print. Wallace, C., et al. "Genome-Wide Association Study Identifies Genes for Biomarkers of Cardiovascular Disease: Serum Urate and Dyslipidemia." Am J Hum Genet 82.1 (2008): 139-49. Print. Wang, F., et al. "Genome-Wide Association Identifies a Susceptibility Locus for Coronary Artery Disease in the Chinese Han Population." Nat Genet 43.4 (2011): 345-9. Print. Wang, K., et al. "Integrative Genomics Identifies Lmo1 as a Neuroblastoma Oncogene." Nature 469.7329 (2011): 216-20. Print. Wang, X., et al. "Adjustment for Local Ancestry in Genetic Association Analysis Ofadmixed Populations." Bioinformatics (2010). Print. Waters, K. M., et al. "Consistent Association of Type 2 Diabetes Risk Variants Found in Europeans in Diverse Racial and Ethnic Groups." PLoS Genet 6.8 (2010). Print. 117 Wijsman, E. M., et al. "Genome-Wide Association of Familial Late-Onset Alzheimer's Disease Replicates Bin1 and Clu and Nominates Cugbp2 in Interaction with Apoe." PLoS Genet 7.2 (2011): e1001308. Print. Xiong, D. H., et al. "Genome-Wide Association and Follow-up Replication Studies Identified Adamts18 and Tgfbr3 as Bone Mass Candidate Genes in Different Ethnic Groups." Am J Hum Genet 84.3 (2009): 388-98. Print. Yamada, Y., et al. "Identification of Celsr1 as a Susceptibility Gene for Ischemic Stroke in Japanese Individuals by a Genome-Wide Association Study." Atherosclerosis 207.1 (2009): 144-9. Print. Yasuda, K., et al. "Variants in Kcnq1 Are Associated with Susceptibility to Type 2 Diabetes Mellitus." Nat Genet 40.9 (2008): 1092-7. Print. Yoon, K. A., et al. "A Genome-Wide Association Study Reveals Susceptibility Variants for Non-Small Cell Lung Cancer in the Korean Population." Hum Mol Genet 19.24 (2010): 4948-54. Print. Zanke, B. W., et al. "Genome-Wide Association Scan Identifies a Colorectal Cancer Susceptibility Locus on Chromosome 8q24." Nat Genet 39.8 (2007): 989-94. Print. Zhang, X. J., et al. "Psoriasis Genome-Wide Association Study Identifies Susceptibility Variants within Lce Gene Cluster at 1q21." Nat Genet 41.2 (2009): 205-10. Print. Zheng, C., and R. C. Elston. "Multipoint Linkage Disequilibrium Mapping with Particular Reference to the African-American Population." Genet Epidemiol 17.2 (1999): 79-101. Print. Zhu, X., et al. "Combined Admixture Mapping and Association Analysis Identifies a Novel Blood Pressure Genetic Locus on 5p13: Contributions from the Care Consortium." Hum Mol Genet 20.11 (2011): 2285-95. Print.
Abstract (if available)
Abstract
Association studies among admixed populations pose many challenges. The purpose of this study is to compare the methods for ancestry estimation and to investigate the control for confounding and the capture of heterogeneity in SNP effect by the use of individual ancestries. In addition, a general regression framework is proposed to perform admixture mapping for both case-only and case-control study designs among admixed populations. For confounding and heterogeneity, simulation results indicate that 1) adjustment for global ancestry can control for confounding
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Pharmacogenetic association studies and the impact of population substructure in the women's interagency HIV study
PDF
Multivariate methods for extracting genetic associations from correlated data
PDF
Observed and underlying associations in nicotine dependence
PDF
Genetic studies of cancer in populations of African ancestry and Latinos
PDF
Genetic studies of inflammation and cardiovascular disease
PDF
Missing heritability may be explained by the common household environment and its interaction with genetic variation
PDF
Genomic risk factors associated with Ewing Sarcoma susceptibility
PDF
Using genetic ancestry to improve between-population transferability of a prostate cancer polygenic risk score
PDF
Genetic risk factors in multiple myeloma
PDF
Adaptive set-based tests for pathway analysis
PDF
Characterizing the genetic and environmental contributions to ocular and central nervous system health
PDF
Identifying genetic, environmental, and lifestyle determinants of ethnic variation in risk of pancreatic cancer
PDF
Association between informed decision-making and mental health-related quality of life in long term prostate cancer survivors
PDF
Genetic and environmental risk factors for childhood cancer
PDF
Methodology and application of modern genetic association tests in admixed populations
PDF
Prostate cancer: genetic susceptibility and lifestyle risk factors
PDF
Robust feature selection with penalized regression in imbalanced high dimensional data
PDF
Association of single nucleotide polymorphisms in GCK, GCKR and PNPLA3 with type 2 diabetes related quantitative traits in Mexican-American population
PDF
Lifestyle-related exposures and diseases in twins
PDF
Preprocessing and analysis of DNA methylation microarrays
Asset Metadata
Creator
Liu, Jinghua
(author)
Core Title
Population substructure and its impact on genome-wide association studies with admixed populations
School
Keck School of Medicine
Degree
Doctor of Philosophy
Degree Program
Statistical Genetics and Genetic Epidemiology
Publication Date
08/01/2012
Defense Date
04/27/2012
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
admixed population,admixture mapping,confounding,genetic association study,GWAS,heterogeneity,OAI-PMH Harvest,population stratification
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Conti, David V. (
committee chair
), Gauderman, William James (
committee member
), Gilliland, Frank D. (
committee member
), Knowles, James (
committee member
), Thomas, Duncan C. (
committee member
)
Creator Email
jinghliu@gmail.com,LiuJi3@humgen.ucsf.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-78608
Unique identifier
UC11289260
Identifier
usctheses-c3-78608 (legacy record id)
Legacy Identifier
etd-LiuJinghua-1089.pdf
Dmrecord
78608
Document Type
Dissertation
Rights
Liu, Jinghua
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
admixed population
admixture mapping
confounding
genetic association study
GWAS
heterogeneity
population stratification