Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Statistical methods and analyses in the Multiethnic Cohort (MEC) human gut microbiome data
(USC Thesis Other)
Statistical methods and analyses in the Multiethnic Cohort (MEC) human gut microbiome data
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
1
Statistical Methods and Analyses in the Multiethnic Cohort (MEC)
Human Gut Microbiome Data
by
Kechen Zhao
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY (BIOSTATISTICS)
May 2019
Copyright 2019 Kechen Zhao
2
Table of Contents
ACKNOWLEDGEMENTS……………………………….……………………………………...3
ABSTRACT………………………………………………………………………………………4
CHAPTER 1: INTRODUCTION TO THE MICROBIOME………………………………….....7
CHAPTER 2: METHOD COMPARISONS AND DEVELOPMENT FOR MICROBIAL DATA
ANALYSIS………………………………………………………………………………………25
CHAPTER 3: ASSOCIATIONS OF MICROBIAL ABUNDANCE WITH BASELINE AND
ADDITIONAL COVARIATES………………………………………………………………... 53
CHAPTER 4: COMPUTATION AND INTERPRETATION OF ANCESTRY ESTIMATES...64
CHAPTER 5: THE MULTIETHNIC COHORT (MEC) MICROBIAL GENOME-WIDE
ASSOCIATION STUDY………………………………………………………………………..76
CHAPTER 6: CONCLUSIONS AND FUTURE PLANS………………………………………89
REFERENCES: …………………………………………………………………………………91
3
Acknowledgements
I would like to express my deepest gratitude to my Ph.D. mentor and other committee members
for their kind assistance, especially to Dr. Daniel Oscar Stram for his enlightening mentorship,
insightful guidance and generous support in walking me through my thesis. I also want to thank
Drs. Iona Chang, Meredith Hullar, and Timothy Randolph for the great collaboration experience.
I also appreciate valuable comments from my committee members Drs. Wendy Cozen, Kimberley
Siegmund and Mariana Stern. I want to thank Dr. Emily Putnam-Hornstein for offering me the
opportunity to participant statistical research in social science and for her generous support in the
past.
4
Abstract
The thesis is about the development and implementation of statistical methods for an association
study between microbial abundance and baseline covariates (e.g. sex, race, age) and environmental
and other covariates of interest (e.g. diet, lifestyle, medication use) as well as a large-scale Genome
Wide Association Study (GWAS). Chapter 1 provides background information about the
microbiome, current evidences of microbial associations with human health and disease conditions.
It also provides reviews of current bioinformatics tools for raw microbial data processing and
existing statistical analysis methods for downstream analysis. Chapter 2 covers the existing
statistical models used in microbial data analysis. The goal of this chapter is to compare model
performance and select a statistically powerful, flexible and scalable method for our analysis.
Under the assumption that existing methods would yield proper Type 1 error rate as the methods
are all well recognized and established in the past, we can base the analysis of power on fitting the
models of interest on a set of representative bacteria and computing likelihood ratio statistics for
each variable in the models considered. The models of interest include methods for dealing with
excess zeros, in addition we add to the list, the ordinal logistic model (as implemented in SAS and
in various R packages). The ordinal model has some conveniences in interpretation of the
parameter estimates (as a usual log odds ratio). Chapter 2 contains three analyses, first is a power
and computational time comparison for the selected models. What we discovered was that there is
no uniformly most powerful method from among those tested but that the logistic model was
reasonably competitive in terms of power, used 1/3 of the computation time compared to hurdle
or zero-inflated models, and required considerably less model tuning than these methods. The
second part of chapter two developed the mathematical formulae for the computationally efficient
score test from the logistic regression model. The score test is an approximation to the full
likelihood ratio test. In the third part we also considered two other, residual-based, approximation
5
methods, which are even more computationally efficient than the proposed score test procedure.
We found that overall the residual methods were slightly conservative compared to the full ordinal
logistic model, but with -log10 transformed p-values which were very highly correlated between
residual methods and full ordinal logistic regression methods; moreover, we found that we could
use genomic control to inflate the p-values from the residual methods to overcome their
conservative tendencies.
Chapter 3 of the thesis provides results for associations between alpha diversity and individual
microbial abundances and environmental covariates of interest. Of approximately 4000
associations tested 114 were globally significant by the Bonferroni criteria. One striking finding
is that three healthy diet score variables (the dietary approaches to stop hypertension, DASH, score,
the Alternative Healthy Eating Index, AHEI, and the alternative Mediterranean diet score, AMDS)
were all strongly positively associated with increased alpha diversity and with positive or negative
changes in abundances of a large number of individual microbial genera. Among other types of
variables, smoking and diabetes were strongly associated with many of the abundance variables.
A GWAS analysis is given in Chapter 5, prior to running the GWAS, we needed to compute
genetic ancestry information such as principal components and direct estimates of genetic ancestry
and admixture in the association models to control for confounding or cryptic effects association
testing. This is the topic of Chapter 4. Because there is interest in whether genetic admixture is
informative about microbiome composition, in the GWAS we estimated genetic ancestries and
attached ancestry labels to each of the eight estimated components; we then included these
variables as adjustment variables (in lieu of either self-reported race, or principal components) in
the association analysis. One important aspect of Chapter 4 was to define the ethnic heritage origins
for each of the eight estimated ancestry proportions. From a number of different analyses we were
6
able to definitively distinguish admixture components coming from Northern versus Southern
Europeans, Mainland Japanese versus Okinawan Japanese, Native Hawaiian versus Hawaiian
Filipino/Chinese, and Native American and African components
In Chapter 5, we present selected results of performing the full GWAS to examine the associations
of 104 bacterial genera with 3 million SNPs. At least 20 of the genera/SNP associations were
significant at the 5x10
-8
level, compared to an expected number of approximately 104*0.05 = 5.2
under the complete null.
Chapter 6 provides conclusions and summary of this thesis and outlines future work.
7
CHAPTER 1: INTRODUCTION TO THE MICROBIOME
1.1 Human Gut Microbiome
Microbiota is the collection of microbial organisms (bacteria, archaea, microbial eukaryotes and
viruses) from a specific site or environment, such as human gut. (Hall, Tolonen, & Xavier, 2017).
It is reported that there are about 3.3 million unique genes in the human microbiome, which far
exceeds the human genome made up of about 22,000 genes (Qin et al., 2010, 2012). Variations in
microbiome among individuals are immensely more diverse than human genomic variations.
Humans are 99.9% genetically identical to one another but they can be 80%-90% different to
another in the composition of their microbiome of gut or other body sites(Navas-Molina et al.,
2013; Ursell, Metcalf, Parfrey, & Knight, 2012). Recent development in microbiome research
provide increasing evidences that reveal strong links between host microbiome and important
human diseases, such as cancer, obesity, diabetes, IBD and Alzheimer’s Disease (Bull & Plummer,
2014; Harach et al., 2017; Hartstra, Bouter, Bäckhed, & Nieuwdorp, 2015; Holmes, Harris, &
Quince, 2012; John & Mullin, 2016; Rajagopala et al., 2017; T. Harach, N. Marungruang, N.
Dutilleul, V. Cheatham, K. D. Mc Coy, J. J. Neher, M. Jucker, F. Fåk, T., 1989). These findings
shed cues to the direction that taking into account of the variations harbored within the microbiome
may be more beneficial to personalized medicine, than approaches looking at the mostly
unchanged host genome alone.
1.2 Microbiome Phenotyping and Next Generation Sequencing (NGS) Technologies
Until recently, scientific research activities in the microbiome has been largely limited due to
technical challenges in obtaining microbiome data from a sample. For example, in the past
8
microorganisms obtained from a specimen were once examined by culturing them. Yet, many of
the bacterial strains are not culturable when taken outside from its original sample environment
(Sharpton, 2014). As Cho and Blaser in their 2012 Nature Genetics Review notes, with the recent
advance in nucleotide sequencing technologies, direct sequencing of microbiome community from
host environment can readily be obtained and microbiome data is now being generated in an
unprecedented scale. The culture-free sequencing technology allows researchers to explore the
diversity and functional aspects of unculturable microorganisms associated with host microbial
community. This new development in technology coupled with advancing data analysis
tools/methods is expected to help researchers answer important but yet unaddressed questions
about human microbiome and the interactions between human microbiome with host genetics and
disease conditions that would otherwise never be possible to study in the past (Cho & Blaser, 2012).
The two mainstreaming methodologies in microbiome sequencing are the targeted 16s ribosomal
DNA (rDNA) amplicon-based method and the shotgun metagenomics methods. Each of the two
methods generates different type of microbiome data and help to answer different research
questions. Current solutions to sequencing microbiome community include ABI 3730, 454 FLX
Titanium, 454 FLX+, Illumina GAIIx, Illumina HiSeq 2000, Illumina MiSeq, PacBio and
IonTorrent (Kuczynski et al., 2012). These sequencing platforms fall into three categories of
sequencing technologies: capillary sequencing (e.g. ABI 3730), pyrosequencing (e.g. Roche 454,
FLX and FLX Titanium) and Illumina’s clonal arrays (e.g. Illumina GAIIx and Illumina HiSeq
2000) (Kuczynski et al., 2012). While capillary method is associated with high-throughput features,
the pyrosequencing and Illumina platforms are considered as the ultra-high throughput next-
generation sequencing that provides much higher sequencing coverage (Weinstock, 2012).
9
The amplicon based methods utilize one or a few genetic biomarkers to unfold the identities and
diversity of the microbial communities in the sample. The most popular biomarker in amplicon
based methods is the 16s rRNA gene. As described by many authors (Kuczynski et al., 2012;
Sharpton, 2014), the 16s rRNA gene contains both phylogenetically conservative and informative
regions. The 16s rRNA gene is an essential piece of genetic information that instructs the coding
of rRNA, a key component of bacterial ribosome, which is required for bacterial survival.
Functionally, the 16s rDNA contains highly conserved regions in all living organisms, which is
considered a perfect biomarker to detect presence of bacterial in collected samples. On the other
hand, the 16s rRNA gene also contains highly evolving regions that allows ones to differentiate
and classify sequencing reads into specific taxonomic levels. One important issue in conducting
amplicon based microbiome sequencing is the proper choice of polymerase chain reaction (PCR)
primer that amplifies the targeted amplicon. Poor choice of primer could lead to amplification bias
that produces misleading sequencing data not representing the true compositions of the
microbiome community under investigation. The ideal primer set should provide enough coverage
to detect a broad spectrum of bacteria in host-associated microbial communities (Kuczynski et al.,
2012; Sharpton, 2014).
As described by Kuczynski et al., 2012 and Sharpton, 2014, unlike the amplicon based method
that only sequences targeted genomic regions (such as the 16s RNA gene), the shotgun
metagenomics method comprehensively sequences all the genes from all organisms and cells
contained in a sample. The shotgun metagenomics sequencing begins by DNA extraction from all
organisms (such as viruses, bacterial and host cells) in a sample. The extracted DNA then
undergoes a process during which genomic DNA in full length are shredded into smaller DNA
fragments. Each fragments is then independently sequenced and the resulted sequence is called a
10
read. Each read is then compared against or aligned to reference genome such as public genome
databases (such as HGP) to reveal its origin. These reads could come from host cells or non-host
organisms such as viruses and bacteria. For microbiome-related reads, the annotated reads not only
include highly phylogenetically informative markers (such as the 16s rRNA gene) but also include
genes that code proteins. The 16s rRNA gene helps to estimate the microbial composition and
diversity in the sample. The genes that code proteins help to identify types and abundances of the
metabolic pathways in the sample. As a result, the shotgun metagenomics data help researchers
to simultaneously explore two aspects of microbiome: what types of microorganisms are in the
sample; and what biological functions are encoded by these microorganisms (Kuczynski et al.,
2012; Sharpton, 2014).
1.3 Disadvantage and Advantages of 16s based method and shotgun based method
Both 16s RNA gene amplicon method and shotgun metagenomics method have their own strengths
and disadvantages inherited from their experimental designs. One noticeable advantage of 16s
rRNA gene is that it is present in all living organisms and this unique feature helps detect the
presence of a broad spectrum of microbial organisms (Kuczynski et al., 2012). Another strength
of using 16s rRNA gene is that large database and reference sequences of 16s rRNA gene are
abundantly available (such as greengenes, SILVA, and the Ribosomal Database Oproject)
(Kuczynski et al., 2012). Having informative database and reference sequences makes 15s rRNA
gene a better marker than others in facilitating the process of taxonomy assignments. While
popular and powerful, the 16s rRNA gene amplicon method has limitations. The amplicon based
16s rDNA target sequencing method is capable to differentiate or resolute microbiome sequence
reads at genus level but it may have difficulty to differentiate sequence reads at the species level
11
due to highly identical information in the 16s rRNA gene at species level (Jovel et al., 2016).
Poorly conducted PCR experiments could result in bias in estimating the diversity in microbiome
community. Incorrectly assembled and sequencing errors could introduce spurious sequencing
data that is hard to identify its taxonomy (Sharpton, 2014). 16s rRNA amplicon method produce
little or no information to reveal the biologically functional aspects of the microbiome community
(Sharpton, 2014). Using 16s rRNA alone or poorly selected primers could lead to fail to detect
highly diverged or novel microbial strains in the sample (Sharpton, 2014). Also, 16s rRNA method
could lead to overestimate of microbial community diversity because the 16s rRNA genes is
transferable between bacterial taxa (Sharpton, 2014).
As Sharpton and et al. noted the main advantage of using the shotgun metagenomics approaches
is that such data not only enables researchers to identify the composition and determine the
diversity of the microbial community in a sample, but it also allows one to explore the biological
functions associated with the microbial community in the sample. In a nutshell, the metagenomics
approach enables one to answer two important questions: who are there in the sample; what
biologically functional potentials do they have? (Sharpton, 2014). Another major advantage is that
the metagenomics method is able to resolute sequence reads at species level. Recall, from the
precious paragraph, that the 16s rRNA gene method often has difficulty to provide resolution at
the species level due to highly identical information in the 16s rRNA gene at species level. The
metagenomics approach has an edge over the 16s rRNA gene method in providing higher
resolution at the species level. This is achieved by synthesizing all information from the assembled
genomic sequence other than using the information from 16s rRNA alone. While powerful, the
shotgun metagenomics method is not free of limitations. First, the process and analysis of shotgun
metagenomics data is rather complex because it generates massive sequence reads from all DNA
12
fragments in the sample. Secondly, the totality of metagenomes in the sample contains unwanted
DNA (e.g., host DNA). If the wanted unwanted DNA overwhelms the microbiota DNA, special
molecular techniques must be applied to enrich the microbiota DNA before sequencing. Third,
contaminant genomes in the sample could result in misleading diversity measurements of
microbial community (Sharpton, 2014). Forthly, while offering some insights into the functional
aspects of the microbial community in the sample, the shotgun metagenomics approach cannot
fully elucidate all the functional activities in the microbial community. To fully capture the
functional activities in the microbial community would most likely to require other omics data
such as proteomic and metabolomics data (Franzosa et al., 2015). Lastly, the metagenomics
experiment is more expansive than the 16s rRNA gene amplicon method.
1.4 Bioinformatics pipeline (QIIME)
With the recent advances in sequencing technologies, microbiome community from a host or
environmental sample can now be studied by culture-free sequencing technologies. The high
throughput NGS technology (e.g., Roche 454, Illumina’s MiSeq and HiSeq) generates billions of
sequence reads in a single run. To make practical use of these massive fast accumulating sequence
data, it requires efficient and scalable bioinformatics and computational tools. Several software
solutions have been developed and made available to the public. These software include VAMPS
(Navas-Molina et al., 2013), mothur, the RDP tools, ARB and QIIME (Quantitative Insights Into
Microbial Ecology). In the proposal, I will emphasize the use of QIIME pipeline because we have
used QIIME in our study to process raw microbiome sequence reads. QIIME is an open-source
bioinformatics pipeline that provides solutions to both upstream and downstream analysis of the
raw microbiome sequence data. QIIME combines commonly used published methods or third-
13
party tools and implements a variety of methods for calculating diversity, plotting data
visualization and performing statistical analyses. The upstream analysis usually consists of
multiple steps conducted in sequence such as demultiplexing, primer removals, quality-filtering,
OTU (operational taxonomy unit) picking, taxonomy assignment, sequence alignment, phylogeny
construction, constructing OTU tables. The downstream analysis are usually conducted based on
the processed data from upstream analysis. The type of downstream analysis depends on the nature
of the research questions. Typical downstream analysis consists of some of but not limited to the
following analysis: alpha-diversity analysis, beta-diversity analysis, OUT heatmaps, regression
based methods that relates diversity measurements or single OUT information to some covariates
of interest (e.g. disease status, phenotypes, genotypes, nutrition / medication use, sex, ethnicity)
(Navas-Molina et al., 2013)
14
Figure 1.1 The paradigm of bioinformatics processing in QIIME pipeline (Navas-Molina et al.,
2013).
15
1.5 Microbiome and Human Diseases and Conditions
As described by the review (Jandhyala et al., 2015) increasing scientific evidences have revealed
strong relationship between normal gut microbiome and human health. The human gut microbiome
plays important roles in synthesizing important metabolites, regulating immunological activities,
susceptibility to diseases (e.g. cancer, Alzheimer’s disease, IBD), drug and xenobiotic metabolism,
pathogen regulation and maintaining the structure and function of gut mucosal barrier. Infant gut
flora starts to appear like adult gut flora by age of three. Yet, temporal and spatial variation in the
diversity and composition of the gut microbiome remains throughout one’s adulthood. It have been
shown or believed that the diversity and composition of human microbiome are affected by host
environmental and genetic factors and the interactions among them: mode of deliver at birth
(vaginal or caesarean), host diet style during infancy (breast feeds or formula feeds) and adulthood
(vegan or meat based), use of medications (antibiotics), and host genetics (Jandhyala et al., 2015).
Involvement of the gut microbiome in metabolism
Jandhyala and et al. also reviewed the microbial involvement in multiple physiological
metabolisms that relates to carbohydrate, protein, vitamins and bile acid transformation. The
human gut microbiome has important impacts on metabolisms of carbohydrate, protein, vitamins
and bile acid transformation (Jandhyala et al., 2015). The undigested carbohydrates and
indigestible oligosaccharides are left to be fermented by organisms such as Enterobacteria,
Bifidobacterium, Roseburia, Fecalibacterium, and Bacteroides in colon. The resulted products are
the short chain fatty acids (e.g. butyrate, propionate and acetate), which are very energy-rich
metabolites to host. This carbohydrates fermentation process is performed with aids of enzymes
(e.g. glycosyl transferases, glycoside hydrolases and polysaccharide lyases) that are expressed
16
predominately by species of Genus Bacteroides. Genome of some members in genus Bacteroides
even encodes more hydrolases that digest carbohydrates than the human genome. Organisms such
as Oxalobacter formigenes, Lactobacillus species, and Bifidobacterium species synthesize oxalate
in colon, which in turn reduces the risk of oxalate stone formation in the kidney. The gut
microbiome also uses proteinases and peptidases to metabolize protein. Special transporters on
bacterial wall facilitates the entry of the amino acid from intestinal lumen into bacterial cell. Inside,
the amino acids were converted into other molecules functioning as signaling molecules and
antimicrobial peptides to suppress the colonization of other bacteria (Jandhyala et al., 2015).
The gut microbiome also synthesizes vitamin K and some components of vitamin B. In addition,
member of genus Bacteriodes has been found to generate conjugated linoleic acid, which has been
shown to be antidiabetic and influence immunomodulatory properties (Jandhyala et al., 2015).
Some members of genus Bacteroides have been shown to have the capability to transform the bile
acid by deconjugating and dehydrating the primary bile acids and converting them into secondary
bile acids (Jandhyala et al., 2015).
1.6 The role of Gut microbiome in immunity and drug metabolism
Since 1970s, it has been increasingly recognized that the gut microbiome plays important roles in
xenobiotic and drug metabolism (A. Das, Srinivasan, Ghosh, & Mande, 2016). For example, the
microbial enzyme beta-glucoronidase causes deconjugation of the anticancer drug irinotecan that
induces adverse events such as diarrhea and anorexia. These findings have profound implications
in future practices of treating certain diseases.
17
The human intestinal mucus layer provides protection to keep the pathogens away from epithelial
contact. In addition to this layer of physical protection, it has been shown that gut microbiota
possesses antimicrobial function that suppresses pathogenic colonization by producing
antimicrobial proteins such as cathelicidins and C-type lectins. Another mechanism that gut
microbiota used to prevent pathogenic outgrowth is to induce local immunoglobulins (Jandhyala
et al., 2015).
1.7 The microbiome and host immunology
As Berer et al., 2017; Jandhyala et al., 2015; Rooks & Garrett, 2016; Shreiner, Kao, & Young,
2016; Zitvogel, Daillère, Roberti, Routy, & Kroemer, 2017 noted, increasing research has shown
that the gut microbiota community plays important roles in shaping host immunological
development and modulation. Human gastrointestinal track and the microbial communities it
harbors has evolved intricate mutualistic and commensal relationship. The dynamic and active
dialogue between gastrointestinal track and the gut microbiome is important in developing and
maintaining homeostasis. The gut microbiota and its components and metabolites synthesized
form host-microbiota interaction coordinates normal immunological functions and dysfunctions.
The immune modulation is in part regulated by metabolites synthesized by gut microbiota, such
as short-chain fatty acids, aryl hydrocarbon receptor ligands and polyamines. The immune
modulation is also regulated by microbial compoenents, such as Polysaccharide A, Formyl
peptides and heptose-1,7-bisphospate (HBP). A significant physiological perturbation in host or
the gut microbiota could induce disturbance to the mutually beneficial relationship between host
and gut microbiota, which could result in dysbiosis. Dysbiosis is characterized as the imbalance in
the composition of the microbial community and its functions, which has been shown to be
18
associated with the change in immunological functions and susceptibility to immune-related
disease conditions such as allergies, inflammatory diseases and Type-2 diabetes. Evidence has also
shown that the gut microbiota plays roles in developing and maintaining the structure and integrity
of gastrointestinal track. For example, Bacteroides thetaiotaomicron, Lactobacillus rhamnosus and
Akkermansia muciniphilia have been shown to play roles in maintenance of desmosomes,
preventing cytokine induced apoptosis of the intestinal epithelial wall, maintenance and control of
gut barrier functions (Berer et al., 2017; Jandhyala et al., 2015; Rooks & Garrett, 2016; Shreiner,
Kao, & Young, 2016; Zitvogel, Daillère, Roberti, Routy, & Kroemer, 2017).
1.8 Human Health
The composition of gut microbiota and its interaction with host have profound impacts on human
health. It has been long recognized that gut microbiota is associated with gastrointestinal track
related conditions such as diarrhea, Irritable Bowel Syndrome (IBS) and Inflammatory Bowel
Disease (IBD). Recently, it has been increasingly recognized that gut microbiome plays important
roles in immune- and metabolic-mediated diseases such as Alzheimer’s disease, cancer and autism,
obesity, Type 2 Diabetes, and multiple sclerosis (Bull & Plummer, 2014; Harach et al., 2017;
Hartstra et al., 2015; Holmes et al., 2012; John & Mullin, 2016; Rajagopala et al., 2017; T. Harach,
N. Marungruang, N. Dutilleul, V. Cheatham, K. D. Mc Coy, J. J. Neher, M. Jucker, F. Fåk, T.,
1989). The Escherichia coli (E. coli) strain O157:H7 and Clostridium difficile are pathogenic to
human and induce inflammation, bleeding and diarrhea, which could escalate to life-threatening
conditions in some affected individuals. For example, the epidemic of another E. coli strain
O104:H4 in 2011 left hundreds hospitalized and caused several mortalities in Germany (Chen,
2016). As Bull and Plummer noted, Irritable bowel syndrome (IBS) is defined as a range of bowel
19
conditions associated with abdominal pain that is related to a change in bowel functionality and
habits. It is estimated that IBS affects approximately 10%-20% of the worldwide population. The
cause of IBS is complex and is thought to be multifactorial, including genetics, infection,
inflammation, immunity, motor dysfunction of the GIT and microbiome. A disturbance to the
composition of gut microbiome is thought to be associated with the low-grade intestinal
inflammation in IBS. The perturbation to gut microbiota composition would cause dysbiosis in the
gastrointestinal track (GIT) (Bull & Plummer, 2014). Dysbiosis in the GIT may facilitate the
adherences of pathogens that is associated with IBS in walls of GIT. Recent evidences support an
alteration in the composition of microbiota is found in IBS patients. Ponnusamy K, Choi JN, et al.
found the ratio of Firmicutes to Bacteroidetes is increased by 2-fold in IBS patients. A more recent
study found that decreased richness of phylum Bacteoidetes (P=0.007) and the genus Bacillus
(P=0.008) in the IBS-overweight participants (Fourie et al., 2016; Kim et al., 2017). As Bull and
Plummer noted, similar to IBS, Inflammatory Bowel Disease (IBD) is also thought to be associated
with dysbiosis induced by great alteration in the relative abundance of particular microbial strains
or significant change in the composition of gut microbiome. It has been observed that the relative
abundances of members of Firmicutes is greatly reduced in IBD patients. The phylum Firmicutes
is well known to produce short chain fatty acids (e.g. acetate and butyrate) that have anti-
inflammatory properties (Bull & Plummer, 2014). Recently, an increasing body of evidences have
associated the gut microbiome with metabolic syndromes such as obesity and Type 2 diabetes
(T2D). As Hartstra et al. noted, both animal and human studies have found perturbed balance in
the composition of gut microbiome. Although inconclusive in the direction of effect, accumulating
research literatures have pointed to the alteration in the relative abundances of phylum
Bacteroidetes and Firmicutes in patients affected with obesity or T2D. A study have found fecal
20
microbiota transplantation from lean males to males with metabolic syndromes lead to significant
improvement in insulin sensitivity and an increase in microbial diversity, as well as increased
relative abundance in bacterial strains producing butyrate (Hartstra et al., 2015). As Hartstra et al.
noted, metabolites such as butyrate is linked to enhanced mitochondria activity that burns energy
more efficiently (Hartstra et al., 2015). Alzheimer’s disease (AD) is the most prevalent form of
dementia. The affected individuals suffer detrimental symptoms such short-term memory loss,
disorientation, inability of self-care, withdrawal from family and society, and eventually lead to
death. The AD conditions were irreversible once onset and will worsen in the affected as time
progresses. To make matter worse, Alzheimer’s disease remains incurable despite companies
including Eily Lilly and Co. and Merck and Co. conducted several large-scale, multi-billion USD
clinical trials in an attempt to treat or contain the progression of Alzheimer’s disease but only
resulted in failures. Therefore, it is more important than ever to understand the pathological roots
of the Alzheimer’s disease in order to develop effective treatments to AD. It has been well
established that the advancement of Alzheimer’s disease is highly correlated with the level of
cerebral amyloid-β peptide deposition. Recently, T. Harach and et al. conducted environmentally
controlled experiment in laboratory mice and found the expression of cerebral amyloid-β peptide
is highly affected by the gut microbiota. In the experiment, it is found that the abundance of phylum
Firmicutes and phylum Bacteroids are significantly altered in the APPPS1 transgenic mice. Genera
Allobaculum, Akkermansia were decreased and unclassified genera in Rikenellaceae and S24-7
were increased. Furthermore, they found cerebral amyloid-β peptide deposition was significantly
reduced in germ-free APP transgenic mice compared to control mice with gut microbiota. Inducing
gut microbiota to germ-free APP transgenic mice increased cerebral amyloid-β peptide deposition.
Collectively, the evidences suggested gut microbiota may play a role in amyloid-β peptide
21
pathology and may contribute to the development of neurodegenerative disease such as
Alzheimer’s disease (Harach et al., 2017; T. Harach, N. Marungruang, N. Dutilleul, V. Cheatham,
K. D. Mc Coy, J. J. Neher, M. Jucker, F. Fåk, T., 1989). Recent studies also found link and
involvement of gut microbiota in the development and pathogenesis of multiple sclerosis (MS), an
autoimmune disorder of the central nerve system. Kerstin Berer and et al. found the CNS-specific
autoimmunity incidence are higher in transgenic mice induced with gut microbiota from MS-
affected twins than those induced with gut microbiota from healthy co-twins (Berer et al., 2017).
Cekanaviciute, E. and et al. recently found increased abundance of both Akkermansia muciniphila
and Acinetobacter calcoaceticus in MS patients compared with healthy controls. In contrast, the
anti-inflammation stimulating strain Parabacteroides distasonis was found reduced in MS patients.
In addition, animal experiment shows that germ-free mice transplanted with microbiota from MS
patients resulted in severer autoimmune symptoms compared with mice transplanted with
microbiota from healthy controls. Collectively, epidemiological studies and experimental studies
in mice demonstrate evidences that global microbiome structure and specific microbial strains are
likely to play roles in regulating autoimmune responses and be involved in the pathogenesis of MS
(Berer et al., 2017; Cekanaviciute et al., 2017). Understanding the underlying involvement of
microbiome in the pathogenesis of MS would help lead to better therapeutic targets in treating MS.
Although the mechanisms by which microbiome affect the development of tumor and cancer are
not yet fully understood, scientific community starts to see the link between microbiome with
several cancers such as colorectal cancer. Researchers start to see some signature microbial
biomarkers in samples from cancer patients though it is yet unclear whether the microbial markers
are drivers of pathogenesis or as a result of pathogenesis. Using metagenomics profiling,
Rajagopala et al. discovered that two species in the phylum Fusobacterium, Porphyromonas
22
asaccharolytica and Peptostreptococcus stomatis, are overrepresented in the colorectal tumor
samples (Rajagopala et al., 2017). Further, metatranscriptome data showed that Fusobacterium,
Campylobacter, and Leptotrichia genera are highly enriched in colorectal cancer tumor samples
(Rajagopala et al., 2017). Several independent studies have found Fusobacterium is a risk factor
for colorectal adenomas and cancer. Other studies have associated genera Peptostreptococcus,
Prevotella, Parvimonas, Leptotrichia, Campylobacter and Gemella with the detection of colorectal
cancer (Rajagopala et al., 2017). As noted by Rajagopala et al., cancer can be also triggered by
specific pathogens like H. pylori that induces chronic gastric inflammation which can promote
gastric carcinoma. Other mechanisms by which microbiome promotes cancer pathogenesis include
inducing disruption of intestinal barrier integrity. Strains such as E. coli and Bacteroides fragilis
promote intestinal barrier failure by expression bacterial toxins, which in turn leads to circulation
of carcinogens through the disrupted gut. This leads to the development of intestinal
adenocarcinoma and tumors in distant body sites (Rajagopala et al., 2017). Recent research
development also sees links between microbiome and autism. F. Strati and colleagues found
decreased ratio of Bacteroidetes/Firmicutes abundance in autism subjects, mainly due to a
reduction of Bacteroidetes relative abundance. At genus level, it is found that the relative
abundance of Dialister, Bilophila, Parabacteroides, Alistipes, and Veillonella are decreased in
autistic subjects, while the relative abundance of Corynebacterium, Collinsella, Dorea and
Lactobacillus are increased (Strati et al., 2017). D. Kang and colleagues conducted human clinical
trials and found fecal microbiota transplantation significantly improved autism spectrum disorder
symptoms in an open-label study. The study also found increased overall microbial diversity and
increased relative abundances in Bifidobacterium, Prevotella and Desulfovibrio (Kang et al., 2017).
23
1.9 MEC grants description
Obesity has become increasingly prevalent. The increased excess adiposity and obesity-related
behaviors have associated with diseases related to low-grade inflammation, such as T2DM,
cardiovascular disease, and gall bladder disease. Inflammation has also been found related to
several cancers. The increased prevalence of obesity and cancer rates in racial and ethnic groups
may be explained by genetics and environmental factors. One of the important environmental
factors associated with obesity is the composition of gut microbiome. The gut microbiome may
influence adipose distribution through its contribution to differences in body weight, insulin
sensitivity, and glucose and lipid metabolism.
One Aim of our research on the microbiome to test the global hypothesis that the gut microbiome
plays roles in influencing the biological pathways of fat deposition and obesity-related
inflammation and thus contributes to the development of cancer. Current research takes the
advantages of the Multiethnic Cohort (MEC) to examine the composition of gut microbiome as an
effect modifier of the relationship between fat distribution, diet and cancer risk across five major
ethnic U.S. population (African Americans, Latinos, Japanese Americans, Native Hawaiians and
Whites). This research involves extensive collaboration work in which information about the gut
microbiome, other metabolic, genomic, lifestyle/behavior, biochemical, hormonal measurements
are integrated to investigate the amount and distribution of fat and examine their relationships to
the intermediate cancer phenotypes as biomarkers. Microbiome research in the MEC attempts to
identity the gut microbial profiles associated with the amount and distribution of body fat in five
ethnic groups. The gut microbial profiles are generated from the 16S rRNA gene analysis of the
fecal microbiome community using high throughput sequencing technology. A second goal is to
24
test the association of the gut microbial profiles with diet, lifestyle factors and intermediate
biomarkers of cancer risk related to obesity. A third goal of current research is to test the
association of circulating concentrations of lipopolysaccharide-binding protein with colorectal
cancer risk within an ethnically diverse cohort. The work of this this thesis contributes to all three
goals.
The baseline cohort of MEC (Multiethnic Cohort Study of Diet and Cancer) is made up of 215,000
men and women whose ages range from 45 to 75 and who returned a questionnaire by mail
between 1993 and 1996. The study participants completed a 26-page baseline questionnaire that
includes a quantitative food frequency questionnaire, medical history and lifestyle. The
biorepository subcohort of over 69,000 including African American, white, Hawaiian, Japanese
American participants in Hawaii and African American, Japanese American and Latino
participants in Los Angeles. The subcohort gave a blood and urine sample between 2001 and 2006
(Meredith A. J. Hullar, T. Randolph, K. Zhao, D. O. Stram, K. Curtis, W. Copeland, U. Lim, L.
Wilkens, J. W. Lampe, L. LeMarchand, n.d.).
In this dissertation, we want to examine the association and relationship of the gut microbiome
data with some of the rich datasets we have for the MEC, which includes GWAS, nutritional,
lifestyle, medical data. Evaluation of statistical methods for relating the microbiome to those two
datasets is at the heart of the proposed thesis, which will comprise both methodological and applied
parts. Methods development is especially relevant to the joint analysis of GWAS and microbiome
data.
25
CHAPTER 2: METHOD COMPARISONS AND DEVELOPMENT FOR MICROBIAL
DATA ANALYSIS
Given the complex nature of the gut microbiome data collected from high-throughput sequencing
technologies, there are numerous perspectives looking at these data, when flexibly addressing
different kinds of research question. With a currently growing research interest in the microbial
research and increasing availability of high-dimension microbiome data sets, there has been a
growing attention to the development of statistical methods for the analysis of microbiome data.
A large portion of the currently available methods is aimed to examine the relationship of the
microbiome data with another type of information, such as genetic, phenotypic, lifestyle, diet, and
metabolomic data. Here, with a research emphasis on genetics and gut microbiome in this thesis,
I will review statistical methods that have been developed in the past to analyze ecological
community data and methods that have been developed recently to analyze microbial data
generated from the state-of-the-art high throughput sequencing technology.
Various statistical methods have been deployed to investigate the relationship between host
genetics and microbiome composition. These methods differ because they were designed to
address different research questions. A brief review of currently available methods will be
provided.
Genome-wide association studies (GWAS) have long been used to interrogate the association of
genetic variants and phenotypes of interest. A GWAS (or GWA study) can be effectively
considered as a “single SNP versus single phenotype” analysis, where the single phenotype can be
bacterial taxa in microbiome analysis. In a GWAS, regression-based approaches are most often
used to investigate the relationship between a genetic variant and an outcome of interest while
allowing for adjusting additional covariates in the same model. Linear regression is widely adopted
26
in other GWA studies, and investigators have begun to use them in microbiome analysis. Using
linear regression based GWA, it was found that there are significant association between host
genetic variation and microbiome composition in 10 body sites (Blekhman et al., 2015). Another
GWA study identified at least 8 bacterial taxa whose abundances were significantly associated
with host genetic variations. Among the identified genetic variations, a variant near a body mass
index (BMI) related gene is found to be associated with a taxon that affects obesity (genus
Akkermansia) (Beaumont et al., 2016; Davenport, 2016; Davenport et al., 2015).
2.1 Zero inflation models and hurdle models
Operational taxonomic unit (OTU) counts in a typical microbiome study contain excessive zeros,
which are often handled inappropriately by investigators. While a linear regression is the default
method in many GWA studies, it may not be a valid method to conduct GWA in some scenarios.
For example, statistical inferences drawn from a linear regression may not be valid if the
assumption of normality or constant variance is not met. The microbiome relative abundance data
in some taxa contain excessive zeros and are highly skewed in distribution. Lizhe Xu and et al.
(2015) showed, in simulations, that zero-inflated data inflates type 1 error rate, decreases power
and resulted in poor model fit and higher bias when linear regressions were used. Their simulation
studies show that zero inflated and hurdle model have better controlled type 1 error rates, higher
power, better model fit and more accurate parameter estimation (Xu, Paterson, Turpin, & Xu,
2015).
2.2 Distance-based regression
Xing Hua and et al. developed a distance-based regression framework that tests the association
between pairwise similarity in microbiome composition and pairwise similarities in genotypes.
27
The method can be considered as a multiple phenotypes and multiple SNPs approach, which
simultaneously relates information computed from multiple OTUs to information computed from
multiple genotypes. The method can flexibly include other covariates such as age, sex,
environmental variables and other confounders. The method is implemented in a R package
‘microbiomeGWAS’ (Hua, Song, Yu, & Goedert, 2015).
2.3 Kernel-based regression
Hongzhe Li and et al. proposed a kernel-based regression framework to characterize the
association between microbiome composition and an outcome of interest, which is referred to as
the microbiome regression-based kernel association test (MiRKAT). The method begins with
constructing a kernel that measures the similarities of microbiome composition between samples.
It then relates this similarity metric to an outcome of interest. A variance-component test is
developed to formally test the statistical association between microbiome composition and an
outcome of interest (Zhao et al., 2015).
2.4 Co-inertia analysis
Microbiome and GWA data set are “omics” data set that records hundreds to hundreds of
thousands of variables simultaneously. Multivariate approaches have been successfully deployed
to interrogate the biological insights from such “omics” data. Among these methods, the co-inertia
analysis (CIA) has been increasingly deployed in microbiome studies. The method essentially
couples two tables and examines the covariant patterns between the two tables. In a microbiome
GWA study, the two tables are essentially the microbiome composition table consisting of OTU
counts and the genetic table consisting of genotype counts respectively. The method produces plot
28
that consists of two-dimension vectors, where the length of a vector represents the strength of
association between the two tables (Dray, Chessel, & Thioulouse, 2003).
2.5 Distance correlation based analysis
Traditional statistical methods only examine one independent variable and one outcome at a time.
Distance correlation, however, can measure statistical independence between two random vectors
that go beyond dimension of one. Distance correlation based analysis produces a single summary
statistics (dCorr) very similar to Pearson’s correlation coefficient. dCorr takes a value between 0
to 1, where dCorr=0 indicates complete independence and dCorr=1 indicates complete dependence.
A Monte Carlo based method is developed to obtain a p-value, which is used to formally test
statistical significance between the association of the two random vectors (Székely & Rizzo, 2009).
In spite of the availability of statistical methods that employs a multi-variable joint analysis
approach – i.e. distance-based regression, kernel-based regression, co-inertia analysis, distance
correlation based analysis, a primary focus and research of interest in our study is to examine the
relationship or association between variables of interest (demographic information, nutritional
information, medical / medication records, living style, behavior and genotype) and abundance
measurement of a single genus. Bearing this goal in mind, the apparently more suitable methods
for our analysis are those designed for single variable outcome/dependent methods. Given the
nature of microbial abundance measurements, one could easily see these data imposes some
serious technical challenges for anyone attempting to make sense of them. The paragraphs below
outline a few of these challenges and methodological considerations. For one, raw microbial
abundance measurement is an integer, usually rich in inflated zeros and small values. Popular and
widely used ordinary linear regression methods may not have enough power to detect signals or
fail to produce valid statistical estimates due to violations in statistical assumptions. In addition,
29
total number of outcome variables and independent variables to be examined in our study is
massive, which translates to massive number of statistical tests to perform. In our study, we’ve
assembled data on 104 microbial genera, and 41 selected non-genetics independent variables and
over 3,000,000 SNP genotypes. As a result, the computation required to examine the relationship
between all pairs of microbial abundance and independent variables are gigantic. In this emerging
scenario, we need statistical methods that are robust against violations in assumption, easily
generalizable to a range of differently distributed microbial abundance measurements, and which
require minimal model tuning across an array of microbial abundance measurements. In addition
to the other methods mentioned above other methods can handle excessive zeros also, for example,
ordinal logistic regression can be easily adapted to modeling the quantiles of the distribution of
genera relative abundances.
In the following we compare the power of a number of possible methods when applied to a sample
of 12 genera selected to represent the diversity of distributional characteristics seen more generally
(over the 104 genera) measured in 6,112 samples. We focus our comparison on power to detect
associations. We do this empirically (over the 12 genera and baseline characteristics, sex, age, race,
antibiotic, sample month, and batch variable). For each of the methods considered (see below) we
construct type 3 likelihood ratio statistics for the inclusion of each baseline characteristic using
multi-degree of freedom tests when necessary (i.e. 4 df for race, 1 df each for sex, age, antibiotic
use, 3 df for batch, and 11 for sample month). For each characteristic we compute the rank of the
LRT statistic observed in the data analysis.
Figure 2.1 provides a histogram of values taken by the 12 selected genera to be used, we see a
varying number of shapes of these genera, and the number of zeros.
30
Figure 2.1 The distribution of the 12 selected genera.
The methods that we considered are the following. The zero-inflated negative binomial, the zero-
inflated Poisson, a negative binomial hurtle model and a Poisson hurdle model, a logistic
regression of the quintiles of the genera distribution, linear regression using relative abundance,
linear regression using quintile coded ranks (0-4) as the outcome variable.
2.6 The Models
Zero inflated negative binomial model
The model was first proposed by Lambert, 1992 and proposed for microbiome analysis by Xu et
al (2015). This model is really a composite of two models, a negative binomial model (possibly
over dispersed with a dispersion parameter estimated) and a model for what are termed the
“structural zeros” which are the excess zeros over-and above that which is predicted by the (over-
dispersed) negative binomial. This zero portion is allowed to depend on a linear combination of
31
covariate variables, as in a logistic regression except that the structural zeros may not be directly
observed because the negative binomial may allow for zeros as well. The modeling of the non-
excess zeros and positive values is performed using a classic log link function relating the mean
of this distribution to covariates. In our analysis we used three different approaches to the
covariates used in the zero model, for example take the antibiotic (yes/no) use variable. We can
include covariates in both the zero and nonzero portions and give the LRT test for removing the
covariate of interest from both parts of the model (zero and nonzero) simultaneously giving a two-
degree of freedom test. We also can keep this variable out of the zero model (or simplify the zero
model so it only depends on an intercept parameter) giving a 1 df LRT test, or we could include
the antibiotic variable as a covariate in both the zero and nonzero model, but remove it only from
the nonzero model, also giving a 1 df test. We show all three methods in our study of the 12
representative genera.
The mathematical expression of zero-inflated negative binomial model can be decomposed into
two underlying processes, referring to, Process 1 and Process 2. The Process 1 models the source
of ‘structural’ zeros and the Process 2 draws counts from a negative binomial distribution.
Specifically:
where g(.) is the conditional probability mass function (PMF) of the Negative Binomial
distribution given 𝒙 𝒊 which is explicitly expressed as,
Pr(𝑌 𝑖 = 𝑦 𝑖 | 𝒙 𝒊 ) =
𝜃 𝜃 𝜇 𝑖 𝑦 𝑖 Γ(𝜃 +𝑦 𝑖 )
Γ(1+𝑦 𝑖 )Γ(𝜃 )(𝜃 +𝜇 𝑖 )
𝜃 +𝑦 𝑖
0 with probability 𝜔 𝑖
g(𝑦 𝑖| 𝒙 𝒊 ) with probability 1-𝜔 𝑖
𝑦 𝑖 =
32
𝜔 (𝛿 ′
𝒛 𝒊 ) + [1- 𝜔 (𝛿 ′
𝒛 𝒊 )] g(0| 𝒙 𝒊 ) if 𝑦 𝑖 = 0
𝑃𝑟 {𝑌 𝑖 = 𝑦 𝑖 | 𝒙 𝒊,𝒛 𝒊 } =
Jointly, the PMF of {𝑌 𝑖 = 𝑦 𝑖 | 𝒙 𝒊 } is
The probability 𝜔 (.) can depend on covariates value 𝒛 𝒊 and their corresponding coefficient parameters
𝛿 ′
. The link functions that relates the linear combination 𝛿 ′
𝒛 𝒊 to probability 𝜔 are conventionally
specified as a logistic function or the probit function (Xu et al., 2015).
Zero Inflated Poisson
Poisson model is also capable of handling counts data. The Zero Inflated Poisson is a natural
extension to the Poisson model, which allows to account for extra zeros in the observed data. Zero-
inflated Poisson was first proposed by (Lambert Diane 1992). The method was later brought into
the application of microbial data analysis by Lizhe Xu and et al. (2015). Similar to Zero Inflated
Negative Binomial model, Zero Inflated Poisson consists of two parts: the all-zero part that models
the structural zeros and the not-all-zero part that allows for non-zero integers.
The mathematical expression of zero-inflated Poisson model can be decomposed into two
underlying processes, referring to, Process 1 and Process 2. The Process 1 models the source of
‘structural’ zeros and the Process 2 draws counts from a Poisson distribution. Specifically:
where g(.) is the conditional probability mass function (PMF) of the Poisson distribution given 𝒙 𝒊
which is explicitly expressed as,
Pr(𝑌 𝑖 = 𝑦 𝑖 | 𝒙 𝒊 ) =
𝑒 −𝜇 𝑖 𝜇 𝑖 𝑦 𝑖 𝑦 𝑖
Jointly, the PMF of {𝑌 𝑖 = 𝑦 𝑖 | 𝒙 𝒊 } is
[1- 𝜔 (𝛿 ′
𝒛 𝒊 )] g(𝑦 𝑖| 𝒙 𝒊 ) if 𝑦 𝑖 > 0
0 with probability 𝜔 𝑖
g(𝑦 𝑖| 𝒙 𝒊 ) with probability 1-𝜔 𝑖
𝑦 𝑖 =
33
The probability 𝜔 (.) can depend on covariates value 𝒛 𝒊 and their corresponding coefficient
parameters𝛿 ′
. The link functions that relates the linear combination 𝛿 ′
𝒛 𝒊 to probability 𝜔 are
conventionally specified as a logistic function or the probit function.
Negative Binomial Hurdle model Poisson Hurdle model
The hurdle models are also two-part models (Mullahy, 1986). Contrary to a single data generating
process, a hurdle model specifies two data generating processes. The binomial distribution dictates
the data generating process of zero values whereas the truncated distribution at zero (usually a
Negative Binomial or Poisson distribution) governs the data generating process of positive values.
The first part of the model can be parametrized by a logistic or a probit regression while the second
part of the model can be analyzed by a truncated Poisson or Negative Binomial regression. Here,
we briefly outline the hurdle model specifications. The conditional distribution, given covariates
𝒙 𝒊 and 𝒛 𝒊 , of probability density function of a hurdle model can be specified as following:
where 𝜔 relates 𝒛 𝒊 to model zero probability and g(.) is a probability mass function truncated at
zero. The most widely used choices for g(.) are the Truncated Negative Binomial distribution and
the Truncated Poisson distribution.
Ordinal logistic regression
𝜔 (𝛿 ′
𝒛 𝒊 ) + [1- 𝜔 (𝛿 ′
𝒛 𝒊 )] g(0| 𝒙 𝒊 ) if 𝑦 𝑖 = 0
[1- 𝜔 (𝛿 ′
𝒛 𝒊 )] g(𝑦 𝑖| 𝒙 𝒊 ) if 𝑦 𝑖 > 0
Pr{𝑌 𝑖 = 𝑦 𝑖 | 𝒙 𝒊,𝒛 𝒊 } =
𝜔 (𝛿 ′
𝒛 𝒊 ) if 𝑦 𝑖 = 0
[1- 𝜔 (𝛿 ′
𝒛 𝒊 )] g(𝑦 𝑖| 𝒙 𝒊 ) if 𝑦 𝑖 > 0
Pr{𝑌 𝑖 = 𝑦 𝑖 | 𝒙 𝒊,𝒛 𝒊 } =
34
There are several reasons why we propose this method, first it only requires an assumption of
monotonicity rather than linearity or log-linearity for the relationship between the (quantized)
relative abundances and the covariates of interest. Secondly, there is a single, easily interpreted
odds ratio estimated in each model. Finally we expect that computation times are considerably
shorter for this model than for the ZI or hurdle models. This issue arises because of the large
number of tests (over 4 hundred million) that will be performed in the GWAS analysis.
Not only is the OLR method faster than the ZI or hurdle models but also, it is possible to further
speed up testing for nonzero effects by employing the score test methods which can be
implemented as described later in this thesis and can be optimized to be far faster than Wald tests
which require MLE computation. Consider a genus with full four quintile cutoffs (with each
subject taking a value from 1, 2, 3, 4 or 5) and assume the five values are ordered such that
1<2<3<4<5. Then the proportional odds model is explicitly parameterized as following:
P1 = Pr(y≤1) =
exp (𝛼 1
+∑ 𝛽 𝑘 𝑥 𝑘 𝑝 𝑘 =1
)
1+exp (𝛼 1
+∑ 𝛽 𝑘 𝑥 𝑘 𝑝 𝑘 =1
)
,
P2 = Pr(y≤2) =
exp (𝛼 2
+∑ 𝛽 𝑘 𝑥 𝑘 𝑝 𝑘 =1
)
1+exp (𝛼 2
+∑ 𝛽 𝑘 𝑥 𝑘 𝑝 𝑘 =1
)
,
P1 = Pr(y≤3) =
exp (𝛼 3
+∑ 𝛽 𝑘 𝑥 𝑘 𝑝 𝑘 =1
)
1+exp (𝛼 3
+∑ 𝛽 𝑘 𝑥 𝑘 𝑝 𝑘 =1
)
,
P1 = Pr(y≤4) =
exp (𝛼 4
+∑ 𝛽 𝑘 𝑥 𝑘 𝑝 𝑘 =1
)
1+exp (𝛼 4
+∑ 𝛽 𝑘 𝑥 𝑘 𝑝 𝑘 =1
)
,
P1 = Pr(y≤5) = 1.
The MLEs of parameter 𝛼 1
, 𝛼 2
, 𝛼 3
, 𝛼 4
, 𝛽 1
, …, 𝛽 𝑝 can be estimated using numeric method such as
Newton-Ralphson on the joint log-likelihood function. Statistical tests – such as Wald’s test, score
35
test and likelihood ratio test – can be used to test whether any of the 𝛽 𝑘 are statistically significantly
different from zero. More is presented later on this topic.
OLS regression using relative abundance and OLS version of Quantile-based regression
The ordinary least square (OLS) method was also considered in our method comparison. The main
reason is include this method is because OLS method has been widely used in literature and the
method is relatively easy to implement and interpret. The OLS method serves as a benchmark in
our method comparison. In addition, two different outcome variables were each tested using OLS
method. The first type of the two outcome variables is the relative abundance of a single genus.
Relative abundance is a normalized abundance measurement by dividing the subject’s observed
abundance reads from a single genus by the total observed abundance reads combined from all
genera. The model is specified as followings:
𝑅𝑒𝑙𝑎𝑡𝑖𝑣𝑒 𝐴𝑏𝑢𝑛𝑑𝑎𝑛𝑐𝑒 = 𝛼 +∑ 𝛽 𝑘 𝑥 𝑘 𝑝 𝑘 =1
+ 𝜖 ,
where 𝛼 is the intercept; 𝛽 𝑘 s are the regression coefficients, 𝑥 𝑘 s are the observed covariate values
and 𝜖 is the error term.
The second type of the two outcome variables is an ordinal categorical counts (1, 2, 3, 4 and 5) by
applying quintile cutoffs on each relative abundance measurement. This procedure may help
remedy the issue of excessive zeros normally seen in relative abundance measurements.
𝑄𝑢𝑖𝑛𝑡𝑖𝑙𝑒 −𝑐𝑢𝑡 𝑅𝑒𝑙𝑎𝑡𝑖𝑣𝑒 𝐴𝑏𝑢𝑛𝑑𝑎𝑛𝑐𝑒 = 𝛼 +∑ 𝛽 𝑘 𝑥 𝑘 𝑝 𝑘 =1
+ 𝜖 ,
where 𝛼 is the intercept; 𝛽 𝑘 s are the regression coefficients, 𝑥 𝑘 s are the observed covariate values
and 𝜖 is the error term.
We used SAS to fit each model. The SAS code used is given in Appendix A for this chapter.
36
2.7 Results of the comparison
As an example. we give the results in the Table 2.1 below which show the observed LRT statistic
for five of the methods (Zero-inflated Poisson and hurdle Poisson model were excluded from the
comparisons because of frequent failures in model convergence when fitted with our data). The
rows indicate genus and the columns give percent of observed zeros for the 12 genera and LRT
statistics for the selected methods, for testing the significance of race in the model while adjusting
for all other baseline covariates. The zero model contains the full list of baseline covariates
including race, here we only show 4 df tests, keeping race in the zero model for zero inflated and
hurdle models. The next table (Table 2.2) gives a summary of the ranks of the LRT test over the
12 selected genera for five different methods with the ranks calculated over all the baseline
covariates (Race, age, sex, season, batch).
The data in these two tables clearly indicates that there is no uniformly most powerful model found
for these data, the LRTs for testing the significance of (Table 2.1) of race, for the same genus
varies considerably by the model chosen. The more powerful methods for testing the race effect
appear to be the ordinal logistic models (and the OLS linear-rank version), but this is far from
universal, for 5 of the 12 genera the LRT for the zero inflated negative binomial is higher than the
ordinal logistic. Table 2.2 reemphasizes this impression with more information about the ordering
of the LRT statistic by model for the full set of baseline covariates. In this summary the Zero
inflated negative binomial model appears to be slightly more powerful than the ordinal logistic.
The hurdle models and the linear models generally perform less well.
37
Finally Table 2.4 shows the execution times in SAS for fitting the full models (with all covariates
included) for one of the genera showed in Figure 2.1 above (number 10). The execution times for
other genera used were very similar
2.8 Conclusions
While no method is uniformly more powerful than all others considered we chose the ordinal
logistic method for our GWAS study since it is competitive with the zero inflated negative
binomial, it uses 1/3 as much computer time, and (unlike the similar linear regression on ranks)
automatically provides easily interpretable odds ratios for each variable in the model.
38
Table 2.1 LRT statistics of the 5 selected methods across the 12 genera when examining race
Genus Percentage of zeros Ordinal Logistic ZI_NB Hurdle-NB Linear_rank Linear_RA
1 24.8 284.162 88.6 32 290.8338 30.253
2 0 371.25 315 9 374.3054 422.5224
3 0 64.133 90.1 8 65.033 81.4164
4 0 35.288 10.9 1 33.9696 10.956
5 18 179.868 147.9 99 179.1578 58.5154
6 0.13 137.722 188.4 84 140.6186 109.7308
7 0.16 14.857 17.1 16.3 15.36 3.0188
8 22.8 181.871 56.1 19 177.4156 9.4122
9 0.28 37.819 50.9 45.9 37.838 27.7106
10 29.7 153.933 70.5 24.7 156.4126 58.197
11 69.6 2.125 18 6.4 2.135 14.7068
12 2.1 54.498 34.4 10.4 54.1788 41.119
39
Table 2.2 the average rank of the LRT test over the 12 selected genera for five different methods , ranks are from 1 to 5 with 5 being
the best.
Ordinal Logistic ZI_NB Hurdle-NB Linear_rank Linear_RA
race 3.416667 3.583333 1.916667 3.75 2.333333
season 3.75 3.5 1.916667 3.25 2.583333
sex 3.333333 3.5 2.25 3.333333 2.5
antibiotics 3.833333 3.333333 1.833333 3.166667 2.833333
batch 3.166667 4 2 3.25 2.583333
40
Execution time (give results in seconds of computer time reported for SAS)
Method Time required for a single run in SAS (in seconds)
Zero-inflated Poisson 1.18
Zero-inflated Negative Binomial 1.01
Poisson Hurdle not converged
Negative Binomial Hurdle 1.14
Ordinal_Logistic 0.34
OLS_RA 0.15
OLS_quintile 0.12
Table 2.3 The computation time in seconds to fit a full baseline model for each of the methods in
SAS for genus number 10 (Figure 2.1)
2.9 Full Score Test for Ordinal Logistic Regression Model
Although an ordinal logistic regression model generally runs faster than other competing model
for counts data such as zero-inflation models and the hurdle models, its computational speed,
unfortunately, impedes its application to analyze data at GWAS scale, especially with 145
outcomes and over 3 million SNP genotypes. One way to sizably speed up the computational speed
to put aside parameter estimation and to focus on just performing statistical tests. It is well known
that score test fits only the null model, and not the alternative to perform statistical tests.
Furthermore, we make the observation that the Fisher’s information regarding the baseline
covariates remains unchanged each time a new genotype is tested. Inspired by these observations
and thoughts, we proposed and developed the following score test paradigm suitable for GWAS
utility.
1. Define cumulative event probability P1, P2, P3, P4, and P5,
P1 = Pr(y≤1) =
exp (𝛼 1
+𝛽 1
𝑥 +𝑐𝑧 )
1+exp (𝛼 1
+𝛽 1
𝑥 +𝑐𝑧 )
,
P2 = Pr(y≤2) =
exp (𝛼 2
+𝛽 2
𝑥 +𝑐𝑧 )
1+exp (𝛼 2
+𝛽 2
𝑥 +𝑐𝑧 )
,
P3 = Pr(y≤3) =
exp (𝛼 3
+𝛽 3
𝑥 +𝑐𝑧 )
1+exp (𝛼 3
+𝛽 3
𝑥 +𝑐𝑧 )
,
41
P4 = Pr(y≤4) =
exp (𝛼 4
+𝛽 4
𝑥 +𝑐𝑧 )
1+exp (𝛼 4
+𝛽 4
𝑥 +𝑐𝑧 )
,
P5 = Pr(y≤5) = 1
2. Event probability
𝜋 1
= P1
𝜋 2
= P2-P1
𝜋 3
= P3-P2
𝜋 4
= P4-P3
𝜋 5
=1-P4,
3. Likelihood and log-likelihood function
L = 𝜋 1
𝐼 1
𝜋 2
𝐼 2
𝜋 3
𝐼 3
𝜋 4
𝐼 4
𝜋 5
𝐼 5
l = 𝐼 1
ln (𝜋 1
)+ 𝐼 2
ln (𝜋 2
)+ 𝐼 3
ln (𝜋 3
)+ 𝐼 4
ln (𝜋 4
)+ 𝐼 5
ln (𝜋 5
)
= 𝐼 1
ln (𝑃 1)+ 𝐼 2
ln (𝑃 2−𝑃 1)+ 𝐼 3
ln (𝑃 3−𝑃 2)+ 𝐼 4
ln (𝑃 4−𝑃 3)+ 𝐼 5
ln (1−𝑃 4)
4. Score vector and multivariate chain rule
Define: 𝜃 : 𝜃 = (𝛼 1
,𝛼 2
,𝛼 3
,𝛼 4
,𝛽 1
,𝛽 2
,𝛽 3
,𝛽 4
,𝑐 )
The score function of 𝜃 is then,
𝜕𝑙
𝜕𝜃
=
𝜕𝑙
𝜕𝑃
𝜕𝑃
𝜕𝜃
𝜕𝑙
𝜕𝑃
= 𝑀 4∗5
𝐼 5∗1
=
(
1
𝑃 1
−
1
𝑃 2
−𝑃 1
0 0 0
0
1
𝑃 2
−𝑃 1
−
1
𝑃 3
−𝑃 2
0 0
0 0
1
𝑃 3
−𝑃 2
−
1
𝑃 4
−𝑃 3
0
0 0 0
1
𝑃 4
−𝑃 3
−
1
1−𝑃 4
)
4∗5
(
𝐼 1
𝐼 2
𝐼 3
𝐼 4
𝐼 5)
5∗1
42
𝜕𝑃
𝜕𝜃
=W=
(
𝑃 1
(1−𝑃 1
) 0 0 0
0 𝑃 2
(1−𝑃 2
) 0 0
0 0 𝑃 3
(1−𝑃 3
) 0
0 0 0 𝑃 4
(1−𝑃 4
)
𝑃 1
(1−𝑃 1
)𝑥 0 0 0
0 𝑃 2
(1−𝑃 2
)𝑥 0 0
0 0 𝑃 3
(1−𝑃 3
)𝑥 0
0 0 0 𝑃 4
(1−𝑃 4
)𝑥 𝑃 1
(1−𝑃 1
)𝑧 𝑃 2
(1−𝑃 2
)𝑧 𝑃 3
(1−𝑃 3
)𝑧 𝑃 4
(1−𝑃 4
)𝑧 )
The transposed score vector is,
S
t
= (𝑀 4∗5
𝐼 5∗1
)
𝑇 𝑊 9∗4
= 𝐼 1∗5
𝑇 𝑀 5∗4
𝑇 𝑊 4∗9
𝑇
Fisher’s information is,
VAR(S) = 𝑊 9∗4
𝑀 4∗5
𝐶𝑂𝑉 (𝐼 5∗1
) 𝑀 𝑇 5∗4
𝑊 𝑇 4∗9
= 𝑊 9∗4
𝑀 4∗5
∑
5∗5
𝑀 𝑇 5∗4
𝑊 𝑇 4∗9
,
where 𝐶𝑂𝑉 (𝐼 5∗1
)=∑
5∗5
=
(
𝜋 1
(1−𝜋 1
) −𝜋 1
𝜋 2
−𝜋 1
𝜋 3
−𝜋 1
𝜋 4
−𝜋 1
𝜋 5
. 𝜋 2
(1−𝜋 2
) −𝜋 2
𝜋 3
−𝜋 2
𝜋 4
−𝜋 2
𝜋 5
. . 𝜋 3
(1−𝜋 3
) −𝜋 3
𝜋 4
−𝜋 3
𝜋 5
. . . 𝜋 4
(1−𝜋 4
) −𝜋 4
𝜋 5
−𝜋 1
𝜋 5
−𝜋 2
𝜋 5
−𝜋 3
𝜋 5
−𝜋 4
𝜋 5
𝜋 5
(1−𝜋 5
))
What we have described is the contribution to the score and the information for a single individual.
We now add index i for individual and define 𝐺 𝑖 =𝑀 𝑖Σ
𝑖𝑀 𝑖 𝑡 , so that the information matrix (i.e.
the variance of the score) is equal to ∑𝑊 𝑖𝐺 𝑖 𝑖 𝑊 𝑖 𝑡 .
Now consider a score test for the inclusion of an additional variable in the model, we will call that
variable z1. We test the null hypothesis that the effect parameter, d, of z1 is zero, note that under
this null hypothesis all the p’s and ’s are the same as above (they do not involve z1 since d=0).
The score test is done as follows.
43
1. Fit by maximum likelihood the 9 parameters in the baseline model which means that the
score statistic for the first nine parameters is set to zero.
2. The score for the 10
th
variable z1 is computed as follows
For each individual, i, define the vector dP
i
=[𝑃 1𝑖 (1−𝑃 1𝑖 ),𝑃 2𝑖 (1−𝑃 2𝑖 ),𝑃 3𝑖 (1−
𝑃 3𝑖 ),𝑃 4𝑖 (1−𝑃 4𝑖 )]
𝑡
Define a transformed 𝑧 1𝑖 as 𝑣 𝑖 =𝑧 𝑖 ×dP
i
Compute the score for z1 as 𝑠 =∑𝑣 𝑖 𝑡 𝑀 𝑖𝐼 𝑖 𝑖
3. We calculate the new information matrix (a 10 x 10 matrix) as
𝐴 = ∑[
𝑊 𝑖𝐺 𝑖𝑊 𝑖 𝑡 𝑊 𝑖𝐺 𝑖𝑣 𝑖 𝑣 𝑖 𝑡 𝐺 𝑖 𝑡 𝑊 𝑖 𝑡 𝑣 𝑖 𝑡 𝐺 𝑖𝑣 𝑖 ]
𝑖 and partition A as [
𝐴 11
𝐴 12
𝐴 21
𝐴 22
]
Define A11 A12 A22
4. We compute the (10,10) element of the inverse of A to serve as the inverse of the variance
of the score for z1. The 10x10 element can be written as 1/(𝐴 22
−𝐴 21
𝐴 11
−1
𝐴 12
).
5. Finally compute the score test as the observed score divided by the square root of its
variance which is
𝑠 √(𝐴 22
−𝐴 21
𝐴 11
−1
𝐴 12
)
Notice that in repetitive use of the score test using the baseline as the null model and iterating
through all the new z variables z1 z2, z3, …, of interest, that A11 does not change (and neither
does it’s inverse) because we are always testing the hypothesis of no effect of each new z in turn
(not adding new z’s to the model). This reduces computation since it avoids repeatedly inverting
44
the full matrix A. Moreover, we can save (during the fitting of the baseline model) the matrices
𝐺 𝑖 , 𝑊 𝑖, and 𝑀 𝑖𝐼 𝑖 and reuse them for each new z of interest.
However, our time estimate for the dataset of interest (5212 individuals) is that programmed as a
loop in R it still takes about 0.2 seconds per test to do all the computation for one test (not counting
the initial baseline computations). Given that we want to do approximately 400 million tests (over
the 104 genera) this would amount to a minimum of 722 days of computation time. Even if we
could have complete use of a 88 node system with no other users this would require 8 days of
computer use, neglecting the time to set up 104 baseline models.
Therefor we seek a faster solution. This brings up the method of residuals, i.e. regression of
Pearson residuals from the baseline model on each new z in turn. The cor.test procedure in R can
do each correlation in approximately 0.00128 seconds or approximately total 110 hours to do the
300 million tests or only 1.2 hours if all 88 nodes were exclusively available.
2.10 Residual Approximation Method
Although the proposed score test can significantly cut down on computational time from a
theoretical standpoint, it would require some optimized Cpp or Rcpp programming work to really
take the advantage of the proposed method. Fortunately, we can receive expertise input from the
Bioinformatics Core here at USC. In the meanwhile, we also proposed and tested two
approximation methods using the residuals from an ordinal logistic regression. In a typical linear
regression model, variations in the outcome variable is usually explained by regressing the
outcome variable on a pre-specified list of covariates and the unexplained variations can be
captured in the residuals, variations of which may be further explained by regressing the residuals
on additional covariate(s) of interest. For OLS regression this “residual” method does not give
45
unbiased estimates of the effect of the additional covariate, however it does give a conservative
test of the hypothesis that the effect of the additional covariate on the outcome is zero. On the basis
of p-value comparisons (given in Chapter 5) between models fit using full ordinal logistic
regression and the Pearson’s residual approximation method we argue that this conservativeness
can be evaluated and corrected for using a genomic control adjustment for the under-dispersion of
the test.
In this spirit, we proposed and tested two residual-style paradigms for the ordinal logistic
regression model, namely the deviance method and the Pearson’s residual method. The resulting
paradigms require much less computational time at the reasonable expense of approximation. In
addition, our study finds that the Pearson’s residual method provides a somewhat better
approximation in p-value than the Deviance residual method when compared with the ordinal
logistic regression.
2.11 Deviance residual method
The deviance residual method is not an unfamiliar quantity in binary logistic regression models.
We extended and generalized the formula to compute deviance residuals in an ordinal logistic
regression setting.
The basic algebra is the same as above Specifically, for a quintized relative abundance – which
takes values from 0, 1, 2, 3, and 4 – we set the group 0 as the reference group and fit an ordinal
logistic regression model:
P1 = Pr(y≤1) =
exp (𝛼 1
+𝜷 𝟏 ′𝒙 )
1+exp (𝛼 1
+𝜷 𝟏 ′𝒙 )
,
P2 = Pr(y≤2) =
exp (𝛼 2
+𝜷 𝟏 ′𝒙 )
1+exp (𝛼 2
+𝜷 𝟏 ′𝒙 )
,
46
P3 = Pr(y≤3) =
exp (𝛼 3
+𝜷 𝟏 ′𝒙 )
1+exp (𝛼 3
+𝜷 𝟏 ′𝒙 )
,
P4 = Pr(y≤4) =
exp (𝛼 4
+𝜷 𝟏 ′𝒙 )
1+exp (𝛼 4
+𝜷 𝟏 ′𝒙 )
,
where 𝒙 is a list of baseline covariates in our study that includes sex, age, ethnicity, antibiotics,
season and batch. We then estimate regression the coefficients 𝛼 1
, 𝛼 2
, 𝛼 3
, 𝛼 4
and 𝜷 𝟏 and estimate
the predictive probability for each individual being one of the five quintile categories:
𝑃𝑟𝑜𝑏 𝑖 (RA_quintile=0),
𝑃𝑟𝑜𝑏 𝑖 (RA_quintile=1),
𝑃𝑟𝑜𝑏 𝑖 (RA_quintile=2),
𝑃𝑟𝑜𝑏 𝑖 (RA_quintile=3),
𝑃𝑟𝑜𝑏 𝑖 (RA_quintile=4).
Then for each individual, we compute their ordinal logistic regression deviance residual:
𝐷 _𝑟𝑒𝑠𝑖𝑑𝑎𝑢𝑙 𝑖 =𝑠𝑖𝑔𝑛 (𝑞𝑢𝑖𝑛𝑡𝑖𝑙𝑒 𝑖 − 𝐸 (𝑌 𝑖))*√−2∗log [𝑃𝑟𝑜𝑏 𝑖(𝑞𝑢𝑖𝑛𝑡𝑖𝑙𝑒 𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑 )],
where 𝐸 (𝑌 𝑖) = ∑ 𝑘 ∗
4
𝑘 =0
𝑃𝑟𝑜𝑏 𝑖 (RA_quintile=k).
Finally, we correlate the deviance residual to one SNP genotype at a time and compute p values to
test whether the correlation coefficient is statistically significantly from zero. The p values
computed from Pearson’s correlation serve as approximation p-values to test whether a SNP is
statistically significantly associated with quintized relative abundance after adjusting for baseline
covariates. We validated the approximate by computing the Pearson’s correlation between the
approximated p values and the respective ordinal logistic regression models. A high Pearson’s
correlation coefficient indicates the approximation method approximates the ordinal logistic model
well in producing p values.
47
2.12 The Pearson’s residual method
The Pearson’s residual method is inspired by Pearson’s residual used in the binary logistic
regression model. We extended and generalized Pearson’s residual in the ordinal logistic
regression model. . Specifically, for a quintized (split into quintiles) relative abundance – which
takes values from 0, 1, 2, 3, and 4 – we set the group 0 as the reference group and fit an ordinal
logistic regression model:
P1 = Pr(y≤1) =
exp (𝛼 1
+𝜷 𝟏 ′𝒙 )
1+exp (𝛼 1
+𝜷 𝟏 ′𝒙 )
,
P2 = Pr(y≤2) =
exp (𝛼 2
+𝜷 𝟏 ′𝒙 )
1+exp (𝛼 2
+𝜷 𝟏 ′𝒙 )
,
P3 = Pr(y≤3) =
exp (𝛼 3
+𝜷 𝟏 ′𝒙 )
1+exp (𝛼 3
+𝜷 𝟏 ′𝒙 )
,
P4 = Pr(y≤4) =
exp (𝛼 4
+𝜷 𝟏 ′𝒙 )
1+exp (𝛼 4
+𝜷 𝟏 ′𝒙 )
,
where 𝒙 is a list of baseline covariates in our study that includes sex, age, ethnicity, antibiotics,
season and batch. We then estimate regression the coefficients 𝛼 1
, 𝛼 2
, 𝛼 3
, 𝛼 4
and 𝜷 𝟏 and estimate
the predictive probability for each individual being one of the five quintile categories:
𝑃𝑟𝑜𝑏 𝑖 (RA_quintile=0),
𝑃𝑟𝑜𝑏 𝑖 (RA_quintile=1),
𝑃𝑟𝑜𝑏 𝑖 (RA_quintile=2),
𝑃𝑟𝑜𝑏 𝑖 (RA_quintile=3),
𝑃𝑟𝑜𝑏 𝑖 (RA_quintile=4).
Then for each individual, we compute their ordinal logistic regression deviance residual:
𝐷 _𝑟𝑒𝑠𝑖𝑑𝑎𝑢𝑙 𝑖 =(𝑞𝑢𝑖𝑛𝑡𝑖𝑙𝑒 𝑖 − 𝐸 (𝑌 𝑖))/√∑ (𝑘 −𝐸 (𝑌 𝑖 ))
2
∗
4
𝑘 =0
𝑃𝑟𝑜𝑏 𝑖(RA_quintile=k)],
where 𝐸 (𝑌 𝑖) = ∑ 𝑘 ∗
4
𝑘 =0
𝑃𝑟𝑜𝑏 𝑖 (RA_quintile=k).
48
Finally, we correlate the Pearson’s residual to one SNP genotype at a time and compute p values
to test whether the correlation coefficient is statistically significantly from zero. The p values
computed from Pearson’s correlation serve as approximation p-values to test whether a SNP is
statistically significantly associated with quintized relative abundance after adjusting for baseline
covariates. We validated the approximate by computing the Pearson’s correlation between the
approximated p values and the respective ordinal logistic regression models. A high Pearson’s
correlation coefficient indicates the approximation method approximates the ordinal logistic model
well in producing p values.
2.13 Results
We computed p-values of the ordinal logistics regression, using LRT test adjusted for the baseline
variables as well as p-values from the deviance residual method and Pearson’s residual method
from 54 additional covariates for all genera that have full 5 quintile cuts. Figure 2.2 is the scatter
plot of p-value between the deviance residual method and ordinal logistic regression model on log-
10 scale. Log-10 scaled p-values between Deviance’s residual method and ordinal logistic
regression model are highly correlated (Pearson’s coefficient = 0.9905). Figure 2.3 is the scatter
plot of p-value between Pearson’s residual method and ordinal logistic regression model on log-
10 scale. Log-10 scaled p-values between Pearson’s residual method and ordinal logistic
regression model are highly correlated (Pearson’s coefficient = 0.995). By comparing the
Pearson’s correlations between Deviance’s residual method and Pearson’s residual method, we
notice the later method provides a slightly better approximation to the p-values obtained from the
ordinal logistic regression. It is also noticeable that the trend of correlation between Pearson’s
residual method and ordinal logistic regression model is preserved even for those smallest p-values,
49
which is a desired application property in GWAS study in which smallest p-values are often of
interest. Reliable approximation using Pearson’s residual provides a sound possibility of applying
Pearson’s residual method in GWAS studies.
Figure 2.2 The scatter plot of p-value between Deviance’s residual method and ordinal logistic
regression model on log-10 scale.
50
Figure 2.3 The scatter plot of p-value between Pearson’s residual method and ordinal logistic
regression model on log-10 scale.
Appendix A – SAS code
*Negatuve Binomial Hurdle models;
proc fmm data=one;
51
class sex
race
antibx
sample_month
batch_new;
model genus_30 = sex race age antibx sample_month batch_new /dist
= truncnegbin offset= log_reads ;
model genus_30 = / dist = constant;
probmodel sex race age antibx sample_month batch_new;
run;
* Zero-inflated Negative Binomial model;
proc fmm data=one;
class sex
race
antibx
sample_month
batch_new;
model genus_30 = sex race age antibx sample_month batch_new /dist
= negbin offset= log_reads ;
model genus_30 = / dist = constant;
run;
* Poisson Hurdle model;
proc fmm data=one;
class sex
race
antibx
sample_month
batch_new;
model genus_49 = sex race age antibx sample_month batch_new /dist
= truncpoisson offset= log_reads ;
model genus_49 = / dist = constant;
run;
* Zero-inflated Poisson model;
proc fmm data=one;
class sex
race
antibx
sample_month;
model genus_9 = sex race age antibx /dist = poisson offset=
log_reads ;
model genus_9 = / dist = constant;
probmodel sex race age antibx;
run;
52
* Ordinal logistic regression;
proc logistic data = one descending; /* Change input data here */
class sex
race
antibx
sample_month
batch_new;
model rank30 = sex race age antibx sample_month batch_new;
run;
* OLS with Relative Abundance;
proc genmod data = one;
class
sex
race
antibx
sample_month
batch_new;
model rv_genus_30 = sex race age antibx sample_month batch_new;
run;
* OLS with Quitile-cut Relative Abundance;
proc genmod data = one;
class
sex
race
antibx
sample_month
batch_new;
model rv_genus_30 = sex race age antibx sample_month batch_new;
run;
53
CHAPTER 3: ASSOCIATIONS OF MICROBIAL ABUNDANCE WITH BASELINE AND
ADDITIONAL COVARIATES
3.1 Introduction
The characterization of host gut microbiome is thought to be affected by a range of factors such as
host demography, season, nutrition, lifestyle, medical and medication history, host genetics,
among many others (Conlon & Bird, 2015). To what extent these factors influence the diversity
and abundance of host gut microbial profile remains largely unknown from both qualitative and
quantitative perspectives. Due to lack of sequence-based microbial data in the past, no
epidemiological studies in the United States have examined the association of gut microbial profile
with the factors possible affecting them using sufficiently large sample size (> 5000 subjects) to
the best of author’s knowledge. The MEC project that has kept track of hundreds of thousands of
study participants, collecting participant information by means of detailed questionnaire items
ranging from date of birth, sex, ethnicity, lifestyle, medical and medication history, nutritional
information, as well as biological samples such as blood and stool. The informative nature of the
MEC study presents a unique opportunity for one to examine the association of gut microbiome
and factors possibly affecting it.
The Part I of the proposed dissertation is focused on the statistical analysis of the abundance and
diversity of the gut microbiome in association with various factors of interest, such as lifestyle,
medical history and medication use, host nutrition information, age, sex, ethnicity among other
variables. Specifically, general linear models will be applied to examine the association of
diversity and abundance of the gut microbiome with factors of interest while controlling for factors
that may cause confounding effect.
3.2 Research and Methods
54
The remainder of this dissertation will describe an analysis of data from 6118 MEC subjects
collected from the obesity study. Each of the 6118 subjects has information on abundance counts
of 105 genera, 186 covariates that includes lifestyle, nutrition and demography among many others.
In his analysis, relative abundance and α-diversity are treated as independent variables. For the ith
genera in subject j, relative abundance (RA) is calculated as follows,
𝑅𝐴
𝑖𝑗
=
𝑛 𝑖𝑗
𝑁 𝑗 ,
where 𝑛 𝑖𝑗
represents the number of sequencing reads of ith genera in subject j, 𝑁 𝑗 is the total
number of sequencing reads in subject j. The relative abundance, 𝑅𝐴
𝑖𝑗
, is just defined as the ratio
of number of sequencing reads of ith genera in subject j 𝑛 𝑖𝑗
over the total number of sequencing
reads in subject j 𝑁 𝑗 . Therefore, relative abundance takes values between zero and one. One
important reason to use relative abundance as opposed to raw abundance involves data
normalization, which allows for fair comparison. While relative abundance measures number of
sequencing reads of a particular taxon in an individual, α-diversity measures diversity of taxon in
an individual. Although α-diversity can be characterized in many ways, Shannon’s index is the
most popular metric to characterize α-diversity in many biological and ecological literatures.
Shannon’s index is defined as follows,
𝐻 ′
𝑖𝑗
= −∑𝑝 𝑖𝑗
ln(𝑝 𝑖𝑗
)
𝑆 𝑖 =1
,
where 𝑝 𝑖𝑗
represents the proportion of sequencing reads of taxon i in individual j relative to the
total number of sequencing reads in individual j, and ln(𝑝 𝑖𝑗
) denotes the natural logarithm of 𝑝 𝑖𝑗
.
The Shannon’s index, also known as Shannon entropy, measures the number (richness) and
distribution (evenness) of taxa in an individual. The idea is that more different taxa and the more
55
equal their proportional abundances imply higher uncertainty to predict which taxon would be if
drawn randomly, which translates to higher diversity in taxa of interest in ecological literature.
Linear Models. Several sets of linear models are built to answer different questions. The first linear
model is built to examine the association of Shannon’s Index with a core set of fundamental
baseline covariates including age of sample collection, self-reported ethnicity, antibiotics use, sex,
season of sample collection.
𝑦 𝑖 = ∑𝛽 𝑗 𝑡 𝑖𝑗
10
𝑗 =1
+ ∑𝛽 𝑐 𝑥 𝑖𝑐
𝑝 𝑐 =1
+ 𝘀 𝑖, (1.1)
where 𝑦 𝑖 denotes the alpha-diversity using Shannon’s Index, 𝑥 𝑖𝑐
, denotes the value of baseline
covariates of interest for individual i, 𝑡 𝑖𝑗
represent the jth genetic PC for ith individual and 𝘀 𝑖 is
the error term.
The second set of linear models are used to examine the association of Shannon’s Index with a
covariate of interest (e.g. the nutrition score). The full list of covariates of interest is specified in
the supplement table. The model is defined as follows,
𝑦 𝑖 =𝛽 𝑐𝑜𝑣
𝑐𝑜𝑣 𝑖 + ∑𝛽 𝑗 𝑡 𝑖𝑗
10
𝑗 =1
+ ∑𝛽 𝑐 𝑥 𝑖𝑐
𝑝 𝑐 =1
+ 𝛽 𝑒 𝐸 𝑖 +𝑐 𝘀 𝑖, (1.2)
where 𝑐𝑜𝑣 𝑖 is a covariate of interest (e.g. nutrition score). The linear model also adjusts for calorie
intake score variable 𝐸 𝑖 , in addition to the top 10 PCs and baseline covariates specified in Model
1.1.
The third set of linear models are used to examine if adding ancestry proportions (fully described
in Chapter 4) to the model with self-reported ethnicity explains more variation in alpha-diversity.
56
A nested likelihood ratio test will be performed. The full list of covariates of interest is specified
in the supplement table. The models are defined as follows,
𝑦 𝑖 = 𝛼 +∑𝛽 𝑐 𝑥 𝑖𝑐
𝑝 𝑐 =1
+ 𝘀 𝑖 (1.3)
𝑦 𝑖 = 𝛼 +∑𝛽 𝑐 𝑥 𝑖𝑐
𝑝 𝑐 =1
+∑𝛽 𝑎 ∗𝐴𝑛𝑐𝑒𝑠𝑃𝑟𝑜𝑝 𝑖𝑎
7
𝑎 =1
+ 𝘀 𝑖 (1.4)
where 𝐴𝑛𝑐𝑒𝑠𝑃𝑟𝑜𝑝 𝑖𝑎
is estimated percentage of ancestry a for individual i. A Likelihood Ratio Test
(LRT) will be conducted to test if adding additional genetic ancestry information to the model
already adjusted for self-reported ethnicity explains more variation in the alpha-diversity or
relative abundance phenotypes. The LRT test statistics is computed as follows,
LRT= 2(ln𝐿 𝑓𝑢𝑙𝑙 𝑚𝑜𝑑𝑒𝑙 − ln𝐿 𝑛𝑒𝑠𝑡𝑒𝑑 𝑚𝑜𝑑𝑒𝑙 ),
which follows a Chi-squared distribution with 7 degrees of freedom under the null hypothesis.
3.3 Preliminary Results for Baseline Analysis
FIGURE 3.1 summarizes the baseline model results for Shannon’s Index as the independent
variable. The figure shows that sex (1=male, 2=female) is a strong predictor of microbiome
diversity with males having increased microbiome diversity. Ethnicity is also a strong predictor of
diversity with the Japanese ethnicity having decreased diversity compared to white. Also,
increased diversity is also found associated with increased age. Antibiotics usage (yes or no) and
antibiotics usage duration have been found associated with decreased diversity. Seasonal effect is
also associated with diversity with summer months having increased diversity compared to
December.
57
FIGURE 3.1. Summary of baseline model results for Shannon’s Index as the dependent variable.
FIGURE 3.2 through 3.4 summarize baseline model results for the relative abundances for 3 of
the most abundant genera. Sex, ethnicity, age, antibiotics, antibiotics usage duration and seasonal
effect appears to have different patterns in magnitude and direction of observed associations.
58
FIGURE 3.2. Summary of baseline model results for the relative abundance of genus Bacteroides
as the dependent variable.
59
FIGURE 3.3. Summary of baseline model results for the relative abundance of genus Barnesiella
as the dependent variable.
FIGURE 3.4. Summary of baseline model results for the relative abundance of genus
Parabacteroides as the dependent variable.
60
FIGURE 3.5. Summary of top hits from adjusted models for Alpha-diversity.
FIGURE 3.5 shows the summary of top hits from adjusted models for alpha-diversity. The figure
shows that greens, vegetable, along with many of the healthy index scores are significantly
associated with increased diversity. To highlight, amount of fish consumption and amount of
omega-3 consumption are significantly associated with increased diversity. This finding is
consistent with a recently publication highlighting that omega-3 intake is correlated with gut
microbiome diversity (Menni et al., 2017).
61
Figure 3.6 –log10 transformed p-values for associations of environmental variables with microbial
abundances. All transformed p-value were color-stratified by their taxonomic order name. The
horizontal line indicates the multiple testing criteria. All associations above the horizontal line are
considered statistically significant.
62
Figure 3.7 –log10 transformed p-values for associations of environmental variables with microbial
abundances. All transformed p-value were color-stratified by environmental variables. The
horizontal line indicates the multiple testing criteria. All associations above the horizontal line are
considered statistically significant.
Figure 3.1 shows that Shannon’s index or the Alpha-diversity is associated with an arrange of
baseline covariates. Specifically, males are associated with higher Shannon’s index compared to
females. Ethnicity is also statistically significantly associated with Shannon’s index and Japanese
are associated with lower Shannon’s index compared to whites. Age is associated with increased
Shannon’s index. Antibiotics usage and duration are associated with lower Shannon’s index.
Season during which the stool sample was collected is also associated with Shannon’s index and
the late summer months appears to be associated with increased Shannon’s index compared to
December.
63
After adjusting for the baseline covariates, Shannon’s index was found associated with several
environmental covariates. Among the most significant associations were several healthy eating
score indices and soluble fiber.
Figure 3.5 shows plot of p-values rankings for the strength of environmental association by
taxonomic order. The total number of tests is equal to the number of environmental covariates (41)
multiplied by the number of genera (104) The black horizontal line in the plot indicates the global
significance level of association, a total of about 114 associations were strongly significant. We
see the a large fraction of the associations that survive Bonferroni correction are contributed by
members of the order Clostridiales, while constituting almost half of the genera (45.7 percent) this
genera contributes 92 of the 114 (80.7%) significant associations.
Figure 3.6 shows a plot of the same p-values, but this time ordered by environmental variables.
We see that the healthy eating indices, DASH total score, and HEI2021 total score are over
represented in the set of significant associations, as are smoking status and diabetes.
3.4 Conclusions
Alpha diversity and single genera abundances are associated with several environmental covariates
passing multiple testing criteria in some cases with ease. It is very interesting that a large fraction
of the most significant associations are provided by two of the healthy eating scores considered,
and that this corresponds so well with the alpha diversity findings. It is also intriguing that the
largest order (Clostridiales) contributes more than its fair share of significant associations.
64
CHAPTER 4: COMPUTATION AND INTERPRETATION OF ANCESTRY ESTIMATES
4.1 Genetic Ancestry Estimation
Information about genetic ancestry structure is (whether given as principal components, or
ancestry estimates) important for non-genetic and genetic association studies. In a non-genetic
association study, information about genetic ancestry helps to control spurious associations
confounded by genetic effect. In a genetic association study, information about genetic ancestry
(usually based on a principal components analysis) is critical to control for spurious association
due to population stratifications (Price et al., 2006). Genetic ancestry was carefully computed and
analyzed in the study to reveal number of ancestral populations in our admixed samples and to
provide corresponding ancestral proportion estimates for each subject. Information regarding
ancestral proportions is also required for subsequent analysis such as Zhernakova analysis and
GWAS analysis (Zhernakova, A and et al. 2016). There is more than one software implementations
available to compute genetic ancestry. We have made the decision to choose ADMIXTURE to
compute genetic ancestry in this project for its flexibility and computational efficiency (Alexander,
Novembre, & Lange, 2009).
4.2 Methods
Estimates of ancestry percentage are computed using multi-locus genotype data in Admixture
software (Alexander et al., 2009). In order to compute genetic ancestry we used genotypes from a
total of 42,583 SNPs which were a part of all the GWAS chips used in the various MEC GWAS
studies (Haiman et al., 2011) that contribute genetic data to this project. For the purpose of ancestry
estimation we randomly chose 20,000 of the of 42,583 s The genetic ancestries were computed
65
based on the 20,000 SNPs rather than all 42,583 simply to speed up the computations, execution
times are quadratic in the number of SNPs used (Alexander et al., 2009).
To determine the value K (number of estimated admixed ancestries), a cross-validation procedure
is performed. As shown in Figure 4.1, as the K increased there was a large decrease in cross-
validation error, but then a plateau was observed with K=8 result in the lowest cross-validation
error compared to other K values. We selected to estimate reported ancestry percentages and
explored its relationship to other data.
FIGURE 4.1. Cross-validation error by number of estimated admixed ancestries.
The eight ancestry percentages are estimated using a likelihood model in Admixture using the set
of 20,000 SNPs. Let 𝑔 𝑖𝑗
represent the observed counts of allele 1 at SNP j of subject i; 𝑞 𝑖𝑘
represent the fraction of subject i’s genome contributed by population k; 𝑓 𝑗𝑘
represent the
66
frequency of allele 1 at SNP j in population k. Assuming individuals are independent, the log-
likelihood of the entire sample is
,
where 𝑞 𝑖𝑘
are the unknown ancestry percentage parameters to be estimated. The underlying
estimation algorithm involves a fast block relaxation scheme using sequential quadratic
programming for block updates, coupled with a novel quasi-Newton acceleration of convergence
(Alexander et al., 2009; Zhou, Alexander, & Lange, 2011). In Figure 4.2, the eight estimated
ancestry percentages are plotted using eight rainbow colors for each subject, which shows clear
pattern specific to self-report ethnicity.
Figure 4.2 Eight estimated ancestry percentages by subjects stratified by self-reported ethnicity.
67
4.3 Further investigation of subgroups
Figure 2 has some somewhat mysterious characteristics and it is not immediately clear what might
be driving some of the patterns shown. We label the K=8 ancestries, as p1 – p8 according to their
position in the output file produced by Admixture. Whites are not homogeneous showing large
amounts of both the ancestries labeled p1 and p3 which we speculate divide northern from southern
Europeans. African Americans show large amounts of ancestry labeled p1 and p7, so that p7 is
clearly African ancestry. Self-reported Latinos are largely p1, and p8, with a smaller amount of p3.
Japanese are mostly p2 and p6. Native Hawaiians have a complicated composition containing
elements p1, p2, p4, and p5.
In order to dig deeper into understanding the admixture components represented by the 8 groups,
we did the follow analyses:
1. We used European only 1000 genomes data combined with our self-reported Europeans to
clarify European origin for p1 and p3
2. Because previous work indicated that major subgroupings of Japanese-speaking people are
Mainland Japanese and southern Japanese especially those originating from Okinawa, the
southern-most heavily populated island of Japan (Yamaguchi-Kabata Y, Nakazono K,
Takahashi A, Saito S, Hosono N, Kubo M, Nakamura Y, 2008); we investigated Okinawan
ancestry using publicly available information
(https://kozainternationalplaza.wordpress.com/2015/10/10/okinawan-family-names/)
about Okinawan last names.
3. It is recognized that native Hawaiians are probably the most recently admixed group with
only a few generations of admixture to date. Because of the recency of admixture,
information collected from the initial questionnaire about parental ancestry and admixture
68
we considered to be informative in this particular group. In fact, only 38 percent of self-
reported Native Hawaiians report both parents as being only of Native Hawaiian origin.
This is in contrast with the other admixed groups: for example, nearly all the African
American participants in the MEC report both parents as African American.
4.4 Labeling Assignment of European 1 and European 2:
In order to clarify the European origins of the components p1 and p3 (which we rename European
1 and European 2 which portion of European we downloaded genotypes for three European ethnic
groups from the 1000 genomes study, namely Italians from Tuscany (TSI) a part of north central
Italy, British (BRT), and Fins (FIN). We combined the 1000 genome samples with the self-
reported whites and setting K=4, we computed ancestral proportion estimates labeled K1 to K4
The results are shown graphically below, our analysis showed the following results
1. E1 (mean in White = 38.6%) is highly positively correlated with the ancestry signature of
Southern Europeans (TSC from 1000 Genome project).
69
2. E2 (mean in White = 51.1%) is highly negatively correlated with the ancestry signature of
Southern Europeans ( Italian from 1000 Genome project) (see the plot below)
Figure 4.3 Scatter plot of ancestry proportion for E1 ancestry vs. Italian ancestry (left panel);
Scatter plot of ancestry proportion for E2 ancestry vs. Italian ancestry (right panel)
3. The average British-like ancestry (northern European) signature (K1) is very close to that of the
1000 Genome British, which is consistent with the fact that the European ancestry in our data is
dominated by E2 (mean in White = 51.1%) (see Figure 4.4 below) .
70
Figure 4.4 Scatter plot of K1-K4 ancestry proportions by FIN, GBR, TSI and white.
4. Literature review reveals that the five largest European ancestries in Hawaii are German (7.4%),
Irish (5.2%), English (4.6%), Portuguese (4.3%) and Italian (2.7%) of the total Hawaiian
population,. This indicates the ratio of Northern European to Southern European in Hawaii is about
2.4, which is consistent to the fact that the European ancestry in our data is primarily dominated
by E2, the Northern European ancestry ("https://en.wikipedia.org/wiki/Hawaii#Population").
Together, the evidences and information presented above strongly indicate "European 1" is the
Southern European ancestry, and "European 2" is the Northern European ancestry.
4.5 Labeling Assignment of Japanese 1 and Japanese 2:
Renaming components p2 and p6 as Japanese 1 and Japanese 2 respectively, A two-ancestry only
Admixture analysis was performed on 1000 Genome project Japanese subjects and MEC Japanese
subjects combined. The analysis resulted in two ancestry proportions, namely A1 and A2. Dr.
71
Lynne Wilkens of the University of Hawaii kindly provided frequency of subjects having more
A1 proportion than A2 proportion and vice versa for each of the Okinawa and non-Okinawa last
names in our MEC samples. A 2-by-2 tabulation was formed to examine whether ancestry A1/A2
is associated with Okinawa last names. A Pearson's Chi-squared test was performed and the
respective p-value is 2.2e-16, indicating strong and statistically significant association between
ancestry A1/A2 and Okinawa last names. Further, the 2-by-2 tabulation reveals that subjects
having higher A2 proportion than that of A1 is associated with a higher percentage of Okinawa
last names (28.57%). Subjects having higher A1 proportion than that of A2 is associated with a
smaller percentage of Okinawa last names (3.25%).
Okinawa last
name
non-Okinawa last
name
Percentage of Okinawa last
name
A1 > A2 23 685 3.25%
A1 < A2 38 95 28.57%
Pearson's Chi-squared test with Yates' continuity correction
X-squared = 102.9977, df = 1, p-value < 2.2e-16
Table 4.1 The 2-by-2 tabulation of Okinawa last name versus A1/A1 ancestry.
Together, the evidence presented above strongly indicate a higher A2 proportion is positively
associated with Okinawa ancestry while a higher A1 proportion is positively associated with
mainland Japanese (non-Okinawa) ancestry.
In addition, a second set of analysis was performed to examine the correlation between A1/A2
computed from 1000 genome Japanese JAP and our self-reported MEC Japanese with the earlier
72
two Japanese ancestral components (E2, E6 now termed Japanese1/Japanese2). Simple linear
regression analysis reveals Japanese1 is statistically significantly and positively associated with
A1, the mainland Japanese ancestry (p-value<0.0001, correlation=0.90) . Japanese2 is statistically
significantly and positively associated with A2, the Okinawa ancestry (p-value<0.0001,
correlation=0.98). Together, the evidences and information presented above strongly indicate
"Japanese 1" is the mainland Japanese ancestry, and "Japanese 2" is the Okinawa ancestry.
4.6 Labeling Assignment of Hawaiian 1 and Hawaiian 2:
Define Hawaiian 1 and Hawaiian 2 (these are p4 and p5 from the overall MEC ancestral (K=8)
previously. We hypothesized that Hawaiian 1 or Hawaiian 2 is individually related to either Native
Hawaiian, Filipino or Chinese ancestry. To test this hypothesis, variables related to parental
ancestry serves as surrogate or ancillary information to help determine the ancestry nature of
Hawaiian1 and Hawaiian2. Specifically, number of parents having Filipino ancestry (0,1 or 2),
number of parents having Hawaiian ancestry (0, 1 or 2), and number of parents having Chinese
ancestry (0, 1 or 2) were calculated for each subject. Below, box plot 1 reveals Hawaiian 1 is
strongly positively correlated with number of parents having Native Hawaiian ancestry. Box plot
2 and 3 combined indicates Hawaiian 2 is positively correlated with Filipino/Chinese ancestry
(with slightly more of Filipino ancestry). Together, these evidences indicate that Hawaiian 1 is
related to Native Hawaiian ancestry. Hawaiian 2 is related to Filipino/Chinese ancestry.
In summary, we assign ethnic labels to the 8 estimated ancestries based on our analysis as follow:
k8_p1=European 1 (Southern European) (red)
k8_p2=Japanese 1 (Main island Japanese) (yellow)
k8_p3=European 2 (Northern European) (light green)
73
k8_p4=Native Hawaiian 1 (Hawaiian) (very light blue)
k8_p5=Native Hawaiian 2 (Filipino / Chinese Hawaiian) (light blue)
k8_p6=Japanese 2 (Okinawa Japanese) (blue)
k8_p7=African (purple)
k8_p8=Native American (violet)
Figure 4.5 Scatter plot and regression analysis between k8_p2 ancestry and a1 ancestry (left panel);
scatter plot and regression analysis between k8_p6 ancestry and a2 ancestry (right panel)
74
Figure 4.6 Boxplot of NH_1 ancestry by number of self-report Hawaiian parents.
Figure 4.7 Boxplot of NH_2 ancestry by number of self-report Filipino parents.
75
Figure 4.8 Boxplot of NH_2 ancestry by number of self-report Chinese parents.
76
CHAPTER 5: THE MULTIETHNIC COHORT (MEC) MICROBIAL GENOME-WIDE
ASSOCIATION STUDY
5.1 Background
In this chapter, we describe a GWAS scan on the more abundant bacterial taxonomies, the ones
having 20% or more non-zero measurements in relative abundance. In total, 95 out of 104 genera
qualifies by this criteria. Given the large amount of outcomes and SNP genotypes, identifying a
computationally efficient but robust statistical method is at the heart of this type of GWAS analysis.
In a previous chapter, we identified ordinal regression models with quintized relative abundance
as competitive, in terms of power, compared with several other popular methods used to analyze
this type of data. Yet applying ordinal logistic regression model to perform GWAS on 95 outcomes
each with 3 million SNPs is still a computationally formidable task. The typical statistical
implementation for GWAS involves statistical testing and parameter estimation simultaneously.
However, performing parameter estimation adds extra computational time compared to performing
statistical testing alone. In addition, majority of the associations from a GWAS study are
statistically insignificant, leaving parameter estimation on such associations effectively irrelevant.
With these observations in mind, we decided to adopt a GWAS approach that performs statistical
testing first and performs parameter estimation post-hoc only for the ‘significant’ associations.
Furthermore, this approach can be further accelerated by a modified score test approach that only
needs to compute the baseline model once. We also developed approximation methods that can be
more easily implemented in R. For this GWAS, we used Pearson’s residual method showed in a
precious chapter, which resembles a more computationally efficient and robust solution.
5.2 Methods
77
Study Participants
The MEC Gut Microbiome GWAS uses previously existing GWAS data for 5212 individuals who
also have gut microbiome measurements.
All participants are members of the Multiethnic Cohort Study. MEC, a prospective cohort study
established to investigate the association of lifestyle and dietary factors with chronic disease,
within a diverse multiethnic population. See Kolonel et al., for study design details [4]. In short,
the cohort comprises >215,000 men and women who were recruited in 1993 – 1996 and were
between the ages of 45 to 75 at baseline, self-identified as primarily belonging to one of the five
racial/ethnic populations: African Americans, Japanese Americans, Latinos, Native Hawaiians,
and whites. Potential participants were identified and recruited in Hawaii and California. Each
participant completed an extensive self-administered questionnaire at baseline including questions
about diet, medication, tobacco smoking pre-existing conditions etc. A later full questionnaire,
called “Q3” here, asked very similar questions of the participants. To be eligible for the gut
microbiome study individuals had have returned Q3 as well as the initial questionnaire Q1.
Starting approximately in 2010 MEC participants were contacted and asked to participate in a
large-scale project relating obesity and its components to metabolomic, genetic, and microbiome
data. We deal with one of the sub-projects of the overall Obesity Study, namely a GWAS analysis
of the gut microbiome. The University of Hawaii and the University of Southern California
institutional review boards approved of the study protocol and all participants provided written
informed consent for this study.
Gut Microbiome Samples
Stool samples were obtained using a mailed sampler and were processed at the University of
Washington.
78
Genetic data
Starting in 2009 the MEC has performed genome wide genotyping for DNA samples of over
30,000 participants, These data are available for nearly half of all currently living members of the
MEC. When enrollment for the Obesity Study (OS) a priority was placed upon recruiting
individuals with previously existing GWAS data as part of the data collection. The GWAS data
was available for a series of case control studies of cancers including, breast, prostate. Disease-
free controls for these studies were targeted in the OS gut microbiome substudy.
All genetic data described here is based on Illumina platforms (GWAS chips), but a large variety
of different Illumina chips were utilized. As a part of this study imputation to the Haplotype
Reference Consortium was performed using the Michigan server (S. Das et al., 2016). A total of
approximately 3 million SNPs were successfully imputed (with quality score > 0.4 in all studies)
for each of 5212 participants. The gut microbiome GWAS relies on the imputed data since only
about 40,000 genotyped SNPs are common over all the different studies, which is too few for a
GWAS analysis.
Microbial Abundance Measurements
The original stool samples were processed and microbial genomic DNA was extracted. For all
samples, the extracted DNA then underwent Polymerase Chain Reaction (PCR) amplification at a
various regions of the 16S rRNA gene. The amplified microbial DNA were then sequenced on the
Illumina HiSeq platform to output sequencing data. The sequencing data then underwent
bioinformatics processing using QIIME (Navas-Molina et al., 2013). The resulting output were
clusters or groups of bacterial sequences based on their similarity. Each cluster was then assigned
with a full taxonomic name (family-genera-) based on external reference information. The basic
measurement of microbial abundance were the number of sequence counts in each taxonomy.
79
From the sequence counts, relative abundances – were computed. Relative abundance quantifies
normalized bacterial abundance of a single taxonomy and is mathematically defined as following,
𝑅𝐴
𝑖𝑗
=
𝑛 𝑖𝑗
𝑁 𝑗 ,
where 𝑛 𝑖𝑗
is the number of bacterial counts in ith taxonomy for jth individual and 𝑁 𝑗 is the total
number of bacterial counts of all bacterial taxonomies for jth individual. For ordinal regression
models and approximation methods, quintile-categorical variable was generated for each
taxonomy by quintizing its corresponding relative abundance variable.
Ancestry Proportions
Ancestry percentages are computed using multi-locus genotype data in Admixture software
(Alexander et al., 2009). In order to compute genetic ancestry we used genotypes from a total of
42,583 SNPs which were a part of all the GWAS chips used in the various MEC GWAS studies
that contribute genetic data to this project. For the purpose of ancestry estimation we randomly
chose 20,000 of the of 42,583 s The genetic ancestries were computed based on the 20,000 SNPs
rather than all 42,583 simply to speed up the computations, execution times are quadratic in the
number of SNPs used (cite Admixture paper).
To determine the value K (number of estimated admixed ancestries), a cross-validation procedure
is performed. As shown in Figure 4.1, as the K increased there was a large decrease in cross-
validation error, but then a plateau was observed with K=8 result in the lowest cross-validation
error compared to other K values. We selected to estimate reported ancestry percentages and
explored its relationship to other data.
The eight ancestry percentages are estimated using a likelihood model in Admixture using the set
of 20,000 SNPs. See chapter 4 of this thesis for details and for the resolution of the eight ancestry
80
components in terms of geographical heritage. These ancestry variables were included as
adjustment variables for all genetic analyses.
Statistical Methods
The Pearson’s residual method
The Pearson’s residual method is inspired by Pearson’s residual used in the binary logistic
regression model, as fully described above in Chapter 2.
To control for confounding effect, we included in the baseline model a range of variables that
could affect human gut microbiome: age, sex, self-reported ethnicity,
Genomic Inflation factor and multiple testing control
The Chi-squared statistics were adjusted by the genomic inflation factor by dividing the original
Chi-squared statistics by 0.456/(observed median of the Chi-squared statistics) to control for
cryptic effect from the genetics stratification and for the excess conservativeness of the residual-
based method (Chapter 2). We used the standard 5e-8 criteria in p-value to screen the associations
that passes the genome-wide association level.
5.3 Results
Figure 1.1 and 1.2 is the scatter plot of the p-values from ordinal regression method and the
approximation method from chromosome 19 and 20 in Bacteroides. The correlation coefficient of
the p-value (on the -log10 scale) between the two methods are 0.971 and 0.986 on chromosome
19 and 20 respectively, indicating the approximation method approximates the results obtained
from an ordinal regression method reasonably well.
Figure 1.1 and 1.2 also reveal a significant edge of the approximation method over the ordinal
regression model when performing association tests on super-rare SNPs. Figure 5.1 and 5.2
indicate that applying ordinal regression model to super-rare SNPs may result in misleading
81
significant associations possibly due to a failure in model convergence ; notice the line-up of
circles near x=0. These correspond to SNPs for which the ordinal model failed to converge, with
p-values fixed at exceedingly small values when the iterations were stopped. In contrary, the
approximation method is more robust and insensitive to rarer SNPs.
After validating the approximation method, we applied the approximation method to conduct
GWAS on all 95 genera each with 3 million SNP genotypes. The most significant associations on
each chromosome were identified (also referred to chromosome-wise association). Figure1-3 are
the Manhattan’s plots of the genomic inflation factor (GIF) -adjusted p-values from the
chromosomes where the top 3 associations were identified. The y-axis were in log-10 scale and
the x-axis indicates the genomic position of each SNP on the chromosome. Figure 4 is the Q-Q
plot that shows the genomic control is well performed.
A list of twenty SNP chromosome-wide top hits at p < 5x10
-8
are given in Table 5.1. We show
chromosome and location, genus name, the GIF adjusted p-value from the residual method. We
also show the p-value and per-allele odds ratio from fitting the ordinal model to these SNP x genera
pairs. We see that there is a good correspondence between these p-values, which again helps to
justify our approximation.
In terms of overall results, the GWAS analysis provides only a small degree of evidence of genetic
influences on microbiome relative abundances. As summarized in Table 5.1, there are
approximately 20 genome-wide significant associations out of over 100 scans performed. The
number of significant associations expected under the complete null is approximately 0.05 per scan
times 95 scans or 4.75. Although this is fewer associations that the 20 observed, no single
association reaches a p-value that is both genome-wide and genera-wide significant (i.e. taking
account of the number of scans performed). While perhaps not comparable, the p-values from the
82
association testing of the environmental variables considered in Chapter 3, are far stronger than
what we have seen in this preliminary GWAS.
Figure 5.1 Scatter plot of raw p-values between Pearson’s approximation method and ordinal
logistic regression model on chromosome 19 of bug_8 (left panel); Scatter plot of log10-scaled p-
values between Pearson’s approximation method and ordinal logistic regression model on
chromosome 19 of bug_8 (right panel). The correlation for the right panel is 0.971
83
Figure 5.2 Scatter plot of raw p-values between Pearson’s approximation method and ordinal
logistic regression model on chromosome 20 of bug_8 (left panel); Scatter plot of log10-scaled p-
values between Pearson’s approximation method and ordinal logistic regression model on
chromosome 20 of bug_8 (right panel). The correlation shown in the right panel is 0.986.
Figure 5.3 Manhattan’s plot of log10 scaled GIF-adjusted p-values on chromosome 8 of Bug 60.
The red line indicates the nominal genome-wide significance level 5e-8.
84
Figure 5.4 Manhattan’s plot of log10 scaled GIF-adjusted p-values on chromosome 14 of Bug 72.
The red line indicates the nominal genome-wide significance level 5e-8.
85
Figure 5.4 Manhattan’s plot of log10 scaled GIF-adjusted p-values on chromosome 5 of Bug 69.
The red line indicates the nominal genome-wide significance level 5e-8.
86
Figure 5.4 Manhattan’s plot of log10 scaled GIF-adjusted p-values on chromosome 5 of Bug 34.
The red line indicates the nominal genome-wide significance level 5e-8.
87
Figure 5.5 Q-Q plot of GIF-adjusted p-values on chromosome 8 of Bug 60. The red line is a
diagonal line. Majority of the observed p-values closely follow the red diagonal line, which
indicate the genomic inflation is properly controlled.
88
GIF-adjusted p_value Genus_name Chromosome location
4.50E-09 Ruminococcaceae 8:54397925_102730
9.94E-09 Veillonellaceae 14:39053876_32283
1.12E-08 Veillonellaceae 9:9255823_22158
1.45E-08 Lachnospiraceae_Incertae_Sedis 5:16854042_34338
1.48E-08 Erysipelotrichaceae 5:58185842_100637
1.63E-08 Porphyromonadaceae 5:122525926_202810
2.11E-08 Prevotellaceae;Other 4:161231442_274988
2.11E-08 Lachnospiraceae_Incertae_Sedis 1:114888412_174632
2.13E-08 Enterobacteriaceae 8:3643255_11757
2.39E-08 Enterobacteriaceae 12:6923018_13071
2.47E-08 Prevotellaceae 7:108569296_186395
2.79E-08 Erysipelotrichaceae 19:34958931_57252
3.11E-08 Coriobacteriaceae 19:37236504_61121
3.36E-08 Lachnospiraceae 2:73404070_133225
3.93E-08 Lactobacillaceae 1:62091829_90839
3.97E-08 Porphyromonadaceae 18:60026693_96590
4.08E-08 Ruminococcaceae 1:54823299_78091
4.09E-08 Prevotellaceae 10:81023806_136780
4.12E-08 NA 5:149042191_243978
4.32E-08 Prevotellaceae 1:31231126_48241
Table 5.1 Summary of chromosome-wise top associations for all 104 genera.
89
CHAPTER 6: CONCLUSIONS AND FUTURE PLANS
While microbiome is an emerging field attracting growing research interest, there are many
important yet unanswered research questions concerning the relationship between microbiome and
diseases, the relationship between microbiome and environmental factors and the relationship
between microbiome and genetics. Work from this thesis collectively contribute to the knowledge
body of human gut microbiome research. In this thesis, we conducted through literature reviews
regarding microbial functionalities in human health, potential factors that shape the microbial
abundance and community. In Chapter 2 of this thesis, we reviewed and examined several methods
suitable to analyze microbiome data. We compared the statistical power between zero-inflated
negative binomial model, hurdle negative binomial model, ordinal logistic regression model, linear
regression using quintile outcomes and linear regression using relative abundance as an outcome.
We found while no method is uniformly more powerful than all others considered we chose the
ordinal logistic method for our GWAS study since it is competitive with the zero inflated negative
binomial, it uses 1/3 as much computer time, and (unlike the similar linear regression on ranks)
automatically provides easily interpretable odds ratios for each variable in the model. We also
derived important results for a score test that does not need to re-fit the baseline covariates. The
score test method greatly saves computational time and could be up to scale up microbial GWAS
scans. We also note that there are limitations with the score test method. For example, the score
test requires the very same complete set of data throughout the entire GWAS; and as such, missing
data could cause problem when applying the score test method. These limitations in the score test
lead us to the development of the Deviance residual method and the Pearson’s residual method.
We showed that both approximation methods perform well in approximating the Type-3 p-values
obtained from the corresponding ordinal logistic regression model. In Chapter 2, we used ordinal
90
logistic regression model to examine the association between microbial abundance and many
environmental variables (demographics, medical/medication history, living behavior, nutritional
information). We found statistically significant associations between microbial abundances and
healthy eating score, BMI, Obesity, smoking history along with other covariates. In Chapter 4, we
estimated the ancestry proportions of the MEC participants in this project using genetic data. We
conducted several analyses to assign appropriate ethnicity labels to the estimated ancestries. In
Chapter 5, we applied Pearson’s residual method to perform a GWAS scan for over 3 million SNPs
on the 95 common taxon. We discovery 20 of GWAS hits that passes the nominal GWAS
significance level (5e-8) on various genomic locations across several different taxon. From the
GWAS results, we also noticed the ordinal logistic regression model may fail to converge when
analyzing super-rare SNPs. Yet, the Pearson’s residual method was not affected by rare SNPs. The
Pearson’s approximation method would seem not to be greatly affected by missing data (so long
as they are missing at random) and can handle such scenarios automatically.
For the future work beyond this thesis, we need to follow up with the GWAS hits by bioinformatics
annotations. In addition, there will be two publications resulted from this thesis. First publication
concerns the study that examines the association of microbial abundance with baseline and
environmental covariates given in Chapter 3. The second publication will include methodological
considerations and results from the microbial GWAS scan given in Chapter 5. We may also
consider implementing the score test paradigm in Rcpp to speed it up.
91
Reference
Alexander, D. H., Novembre, J., & Lange, K. (2009). Fast model-based estimation of ancestry in
unrelated individuals. Genome Research, 19(9), 1655–1664.
https://doi.org/10.1101/gr.094052.109
Beaumont, M., Goodrich, J. K., Jackson, M. A., Yet, I., Davenport, E. R., Vieira-silva, S., … Bell,
J. T. (2016). Heritable components of the human fecal microbiome are associated with
visceral fat. Genome Biology, 1–19. https://doi.org/10.1186/s13059-016-1052-7
Berer, K., Ann, L., Cekanaviciute, E., Jia, X., Xiao, L., Xia, Z., & Liu, C. (2017). Gut microbiota
from multiple sclerosis patients enables spontaneous autoimmune encephalomyelitis in mice,
6–11. https://doi.org/10.1073/pnas.1711233114
Blekhman, R., Goodrich, J. K., Huang, K., Sun, Q., Bukowski, R., Bell, J. T., … Clark, A. G.
(2015). Host genetic variation impacts microbiome composition across human body sites.
Genome Biology, 16(1), 191. https://doi.org/10.1186/s13059-015-0759-1
Bull, M. J., & Plummer, N. T. (2014). Part 1: The Human Gut Microbiome in Health and Disease.
Integrative Medicine (Encinitas, Calif.), 13(6), 17–22. https://doi.org/26770121
Cekanaviciute, E., Yoo, B. B., Runia, T. F., Debelius, J. W., Singh, S., & Nelson, C. A. (2017).
Gut bacteria from multiple sclerosis patients modulate human T cells and exacerbate
symptoms in mouse models, 2–7. https://doi.org/10.1073/pnas.1711235114
Chen, Z. (2016). Microbiome and Metagenomics : Statistical Methods , Computation and
Applications.
Cho, I., & Blaser, M. J. (2012). The human microbiome : at the interface of health and disease,
13(April). https://doi.org/10.1038/nrg3182
Conlon, M. A., & Bird, A. R. (2015). The impact of diet and lifestyle on gut microbiota and human
health. Nutrients, 7(1), 17–44. https://doi.org/10.3390/nu7010017
Das, A., Srinivasan, M., Ghosh, T. S., & Mande, S. S. (2016). Xenobiotic metabolism and gut
microbiomes. PLoS ONE, 11(10), 1–26. https://doi.org/10.1371/journal.pone.0163099
Das, S., Forer, L., Schönherr, S., Sidore, C., Locke, A. E., Kwong, A., … Fuchsberger, C. (2016).
Next-generation genotype imputation service and methods. Nature Genetics, 48(10), 1284–
1287. https://doi.org/10.1038/ng.3656
Davenport, E. R. (2016). Elucidating the role of the host genome in shaping microbiome
composition. Gut Microbes, 0976(April), 00–00.
https://doi.org/10.1080/19490976.2016.1155022
Davenport, E. R., Cusanovich, D. A., Michelini, K., Barreiro, L. B., Ober, C., & Gilad, Y. (2015).
Genome-wide association studies of the human gut microbiota. PLoS ONE, 10(11), 1–22.
https://doi.org/10.1371/journal.pone.0140301
Dray, S., Chessel, D., & Thioulouse, J. (2003). Co-inertia analysis and the linking of ecological
data tables. Ecology, 84(11), 3078–3089. https://doi.org/10.1890/03-0178
Fourie, N. H., Wang, D., Abey, S. K., Sherwin, L. B., Joseph, P. V., Rahim-Williams, B., …
92
Henderson, W. A. (2016). The Microbiome of the Oral Mucosa in Irritable Bowel Syndrome.
Gut Microbes, 0976(March), 00–00. https://doi.org/10.1080/19490976.2016.1162363
Franzosa, E. A., Hsu, T., Sirota-Madi, A., Shafquat, A., Abu-Ali, G., Morgan, X. C., &
Huttenhower, C. (2015). Sequencing and beyond: integrating molecular “omics” for
microbial community profiling. Nature Reviews Microbiology, 13(6), 360–372.
https://doi.org/10.1038/nrmicro3451
Haiman, C. A., Chen, G. K., Blot, W. J., Strom, S. S., Berndt, S. I., Kittles, R. A., … Timothy, R.
(2011). African ancestry identifies a susceptibility locus at 17q21, 43(6), 570–573.
https://doi.org/10.1038/ng.839.Genome-wide
Hall, A. B., Tolonen, A. C., & Xavier, R. J. (2017). Human genetic variation and the gut
microbiome in disease. Nature Publishing Group. https://doi.org/10.1038/nrg.2017.63
Harach, T., Marungruang, N., Duthilleul, N., Cheatham, V., Mc Coy, K. D., Frisoni, G., …
Bolmont, T. (2017). Reduction of Abeta amyloid pathology in APPPS1 transgenic mice in
the absence of gut microbiota. Scientific Reports, 7(December 2016), 41802.
https://doi.org/10.1038/srep41802
Hartstra, A. V., Bouter, K. E. C., Bäckhed, F., & Nieuwdorp, M. (2015). Insights into the role of
the microbiome in obesity and type 2 diabetes. Diabetes Care, 38(1), 159–165.
https://doi.org/10.2337/dc14-0769
Holmes, I., Harris, K., & Quince, C. (2012). Dirichlet multinomial mixtures: Generative models
for microbial metagenomics. PLoS ONE, 7(2). https://doi.org/10.1371/journal.pone.0030126
Hua, X., Song, L., Yu, G., & Goedert, J. J. (2015). No Title.
Jandhyala, S. M., Talukdar, R., Subramanyam, C., Vuyyuru, H., Sasikala, M., & Reddy, D. N.
(2015). Role of the normal gut microbiota. World Journal of Gastroenterology, 21(29), 8836–
8847. https://doi.org/10.3748/wjg.v21.i29.8787
John, G. K., & Mullin, G. E. (2016). The Gut Microbiome and Obesity. Current Oncology Reports,
18(7). https://doi.org/10.1007/s11912-016-0528-7
Jovel, J., Patterson, J., Wang, W., Hotte, N., O’Keefe, S., Mitchel, T., … Wong, G. K. S. (2016).
Characterization of the gut microbiome using 16S or shotgun metagenomics. Frontiers in
Microbiology, 7(APR), 1–17. https://doi.org/10.3389/fmicb.2016.00459
Kang, D., Adams, J. B., Gregory, A. C., Borody, T., Chittick, L., Fasano, A., … Sullivan, M. B.
(2017). Microbiota Transfer Therapy alters gut ecosystem and improves gastrointestinal and
autism symptoms : an open-label study. Microbiome, 1–16. https://doi.org/10.1186/s40168-
016-0225-7
Kim, S., Kim, H., Yim, Y. S., Ha, S., Atarashi, K., Tan, T. G., … Huh, J. R. (2017). Maternal gut
bacteria promote neurodevelopmental abnormalities in mouse offspring. Nature.
https://doi.org/10.1038/nature23910
Kuczynski, J., Lauber, C. L., Walters, W. A., Parfrey, L. W., Clemente, J. C., Gevers, D., & Knight,
R. (2012). Experimental and analytical tools for studying the human microbiome. Nat Rev
Genet, 13(1), 47–58. https://doi.org/nrg3129 [pii]\r10.1038/nrg3129
93
Menni, C., Zierer, J., Pallister, T., Jackson, M. A., Long, T., Mohney, R. P., … Valdes, A. M.
(2017). Omega-3 fatty acids correlate with gut microbiome diversity and production of N-
carbamylglutamate in middle aged and elderly women. Scientific Reports, 7(1), 1–11.
https://doi.org/10.1038/s41598-017-10382-2
Meredith A. J. Hullar, T. Randolph, K. Zhao, D. O. Stram, K. Curtis, W. Copeland, U. Lim, L.
Wilkens, J. W. Lampe, L. LeMarchand, I. C. (n.d.). Ethnicity and or dietary impacts on the
gut microbiome in the MEC. TBD.
Navas-Molina, J. A., Peralta-Sánchez, J. M., González, A., McMurdie, P. J., Vázquez-Baeza, Y.,
Xu, Z., … Knight, R. (2013). Advancing our understanding of the human microbiome using
QIIME. Methods in Enzymology (1st ed., Vol. 531). Elsevier Inc.
https://doi.org/10.1016/B978-0-12-407863-5.00019-8
Price, A. L., Patterson, N. J., Plenge, R. M., Weinblatt, M. E., Shadick, N. A., & Reich, D. (2006).
Principal components analysis corrects for stratification in genome-wide association studies.
Nature Genetics, 38(8), 904–909. https://doi.org/10.1038/ng1847
Qin, J., Li, R., Raes, J., Arumugam, M., Burgdorf, K. S., Manichanh, C., … Zoetendal, E. (2010).
A human gut microbial gene catalogue established by metagenomic sequencing. Nature,
464(7285), 59–65. https://doi.org/10.1038/nature08821
Qin, J., Li, Y., Cai, Z., Li, S., Zhu, J., Zhang, F., … Wang, J. (2012). A metagenome-wide
association study of gut microbiota in type 2 diabetes. Nature, 490(7418), 55–60.
https://doi.org/10.1038/nature11450
Rajagopala, S. V, Vashee, S., Old, L. M., Suzuki, Y., Venter, J. C., Telenti, A., & Nelson, K. E.
(2017). Minireview The Human Microbiome and Cancer, 10(April), 226–235.
https://doi.org/10.1158/1940-6207.CAPR-16-0249
Rooks, M. G., & Garrett, W. S. (2016). immunity. Nature Publishing Group, 16(6), 341–352.
https://doi.org/10.1038/nri.2016.42
Sharpton, T. J. (2014). An introduction to the analysis of shotgun metagenomic data. Frontiers in
Plant Science, 5(June), 1–14. https://doi.org/10.3389/fpls.2014.00209
Shreiner, A. B., Kao, J. Y., & Young, V. B. (2016). The gut microbiome in health and in disease.
Curr Opin Gastroenterol., 31(1), 69–75.
https://doi.org/10.1097/MOG.0000000000000139.The
Strati, F., Cavalieri, D., Albanese, D., Felice, C. De, Donati, C., Hayek, J., … Filippo, C. De.
(2017). New evidences on the altered gut microbiota in autism spectrum disorders, 1–11.
https://doi.org/10.1186/s40168-017-0242-1
Székely, G. J., & Rizzo, M. L. (2009). Brownian distance covariance. Annals of Applied Statistics,
3(4), 1236–1265. https://doi.org/10.1214/09-AOAS312
T. Harach, N. Marungruang, N. Dutilleul, V. Cheatham, K. D. Mc Coy, J. J. Neher, M. Jucker, F.
Fåk, T., L. and T. B. (1989). Reduction of Alzheimer’s disease beta-amyloid pathology in the
absence of gut microbiota. Journal of Chemical Information and Modeling, 53, 160.
https://doi.org/10.1017/CBO9781107415324.004
94
Ursell, L. K., Metcalf, J. L., Parfrey, L. W., & Knight, R. (2012). Defining the human microbiome.
Nutrition Reviews, 70 Suppl 1(suppl 1), S38-44. https://doi.org/10.1111/j.1753-
4887.2012.00493.x
Weinstock, G. M. (2012). Genomic approaches to studying the human microbiota.
https://doi.org/10.1038/nature11553
Xu, L., Paterson, A. D., Turpin, W., & Xu, W. (2015). Assessment and selection of competing
models for zero-inflated microbiome data. PLoS ONE, 10(7), 1–30.
https://doi.org/10.1371/journal.pone.0129606
Yamaguchi-Kabata Y, Nakazono K, Takahashi A, Saito S, Hosono N, Kubo M, Nakamura Y, K.
N. (2008). Japanese population structure, based on SNP genotypes from 7003 individuals
compared to other ethnic groups: effects on population-based association studies. American
Journal of Human Genetics, 83(4), 445–456. https://doi.org/10.1016/j.ajhg.2008.08.019.
Zhao, N., Chen, J., Carroll, I. M., Ringel-Kulka, T., Epstein, M. P., Zhou, H., … Wu, M. C. (2015).
Testing in microbiome-profiling studies with MiRKAT, the microbiome regression-based
kernel association test. American Journal of Human Genetics, 96(5), 797–807.
https://doi.org/10.1016/j.ajhg.2015.04.003
Zhou, H., Alexander, D., & Lange, K. (2011). A quasi-Newton acceleration for high-dimensional
optimization algorithms. Statistics and Computing, 21(2), 261–273.
https://doi.org/10.1007/s11222-009-9166-3
Zitvogel, L., Daillère, R., Roberti, M. P., Routy, B., & Kroemer, G. (2017). Anticancer effects of
the microbiome and its products. Nature Publishing Group, 15(8), 465–478.
https://doi.org/10.1038/nrmicro.2017.44
Abstract (if available)
Abstract
The thesis is about the development and implementation of statistical methods for an association study between microbial abundance and baseline covariates (e.g. sex, race, age) and environmental and other covariates of interest (e.g. diet, lifestyle, medication use) as well as a large-scale Genome Wide Association Study (GWAS). Chapter 1 provides background information about the microbiome, current evidences of microbial associations with human health and disease conditions. It also provides reviews of current bioinformatics tools for raw microbial data processing and existing statistical analysis methods for downstream analysis. Chapter 2 covers the existing statistical models used in microbial data analysis. The goal of this chapter is to compare model performance and select a statistically powerful, flexible and scalable method for our analysis. Under the assumption that existing methods would yield proper Type 1 error rate as the methods are all well recognized and established in the past, we can base the analysis of power on fitting the models of interest on a set of representative bacteria and computing likelihood ratio statistics for each variable in the models considered. The models of interest include methods for dealing with excess zeros, in addition we add to the list, the ordinal logistic model (as implemented in SAS and in various R packages). The ordinal model has some conveniences in interpretation of the parameter estimates (as a usual log odds ratio). Chapter 2 contains three analyses, first is a power and computational time comparison for the selected models. What we discovered was that there is no uniformly most powerful method from among those tested but that the logistic model was reasonably competitive in terms of power, used 1/3 of the computation time compared to hurdle or zero-inflated models, and required considerably less model tuning than these methods. The second part of chapter two developed the mathematical formulae for the computationally efficient score test from the logistic regression model. The score test is an approximation to the full likelihood ratio test. In the third part we also considered two other, residual-based, approximation methods, which are even more computationally efficient than the proposed score test procedure. We found that overall the residual methods were slightly conservative compared to the full ordinal logistic model, but with −log10 transformed p-values which were very highly correlated between residual methods and full ordinal logistic regression methods
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Diet quality and pancreatic cancer incidence in the multiethnic cohort
PDF
The multiethnic nature of chronic disease: studies in the multiethnic cohort
PDF
Shortcomings of the genetic risk score in the analysis of disease-related quantitative traits
PDF
Identifying and quantifying transcriptional module heterogeneity and genetic co-regulation, with applications in asthma
PDF
Identification and fine-mapping of genetic susceptibility loci for prostate cancer and statistical methodology for multiethnic fine-mapping
PDF
twas_sim, a Python-based tool for simulation and power analysis of transcriptome-wide association analysis
PDF
Bayesian model averaging methods for gene-environment interactions and admixture mapping
PDF
Best practice development for RNA-Seq analysis of complex disorders, with applications in schizophrenia
PDF
X-linked repeat polymorphisms and disease risk: statistical power and study designs
PDF
Incorporating prior knowledge into regularized regression
PDF
Extending genome-wide association study methods in African American data
PDF
Improving the power of GWAS Z-score imputation by leveraging functional data
PDF
ROC surface in the presence of verification bias
PDF
A genome wide association study of multiple sclerosis (MS) in Hispanics
PDF
Screening and association testing of coding variation in steroid hormone coactivator and corepressor genes in relationship with breast cancer risk in multiple populations
PDF
Risk factors of pelvic floor disorders in the multiethnic cohort study
PDF
Application of Bayesian methods in association mapping
PDF
Applications of multiple imputations in survival analysis
PDF
Computational algorithms and statistical modelings in human microbiome analyses
PDF
DNA methylation and gene expression profiles in Vidaza treated cultured cancer cells
Asset Metadata
Creator
Zhao, Kechen
(author)
Core Title
Statistical methods and analyses in the Multiethnic Cohort (MEC) human gut microbiome data
School
Keck School of Medicine
Degree
Doctor of Philosophy
Degree Program
Biostatistics
Publication Date
05/02/2019
Defense Date
12/20/2018
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
Biostatistics,GWAS,human gut microbiome,MEC,microbial GWAS,microbiome,microbiome GWAS,multiethnic cohort,OAI-PMH Harvest
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Stram, Daniel (
committee chair
), Cozen, Wendy (
committee member
), Putnam-Hornstein, Emily (
committee member
), Siegmund, Kimberly (
committee member
), Stern, Mariana (
committee member
)
Creator Email
kechenzh@usc.edu,zhao_kechen@hotmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-165369
Unique identifier
UC11662625
Identifier
etd-ZhaoKechen-7392.pdf (filename),usctheses-c89-165369 (legacy record id)
Legacy Identifier
etd-ZhaoKechen-7392.pdf
Dmrecord
165369
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Zhao, Kechen
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
GWAS
human gut microbiome
MEC
microbial GWAS
microbiome
microbiome GWAS
multiethnic cohort