Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
The role of genetic ancestry in estimation of the risk of age-related degeneration (AMD) in the Los Angeles Latino population
(USC Thesis Other)
The role of genetic ancestry in estimation of the risk of age-related degeneration (AMD) in the Los Angeles Latino population
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
THE
ROLE
OF
GENETIC
ANCESTRY
IN
ESTIMATION
OF
THE
RISK
OF
AGE-‐
RELATED
DEGENERATION
(AMD)
IN
THE
LOS
ANGELES
LATINO
POPULATION
by
Lijun
He
A
Thesis
Presented
to
the
FACULTY
OF
THE
USC
GRADUATE
SCHOOL
UNIVERISTY
OF
SOUTHERN
CALIFORNIA
In
Partial
Fulfillment
of
the
Requirements
for
the
Degree
MASTER
OF
SCIENCE
(BIOSTATISTICS)
August
2014
Copyright
2014
Lijun
He
ii
DEDICATION
I
dedicate
this
work
to
my
parents
and
sister
for
their
unconditional
and
continuous
love
and
support.
iii
ACKNOWLEDGEMENTS
I
would
like
to
express
my
sincere
gratitude
towards
Dr.
Stanley
P.
Azen,
Dr.
Paul
Marjoram,
and
Dr.
Jim
Gauderman
for
their
continuous
guidance
and
support.
I
would
also
like
to
thank
Dr.
Rohit
Varma,
Dr.
Jerome
Rotter,
Dr.
Kent
Taylor,
Dr.
Ida
Chen,
Dr.
Xiaoyi
Gao,
Mina
Torres,
and
John
Morrison
for
their
suggestions
and
support
during
the
work.
In
addition,
this
work
is
supported
by
the
grant
EY011753
(to
Dr.
Rohit
Varma),
National
Eye
Institute,
National
Institutes
of
Health.
Finally,
I
am
grateful
to
my
close
friends.
Without
them,
I
would
not
have
survived
from
hard
times.
iv
TABLE
OF
CONTENTS
Dedication
ii
Acknowledgements
iii
List
of
Figures
v
Abstract
vi
Chapter
1:
Introduction
1
Chapter
2:
Methods
6
2.1
Samples
and
Study
populations:
6
2.2
SNP
Panels
7
2.3
Software
and
Statistical
methods:
7
Chapter
3:
Results
9
3.1
Genetic
ancestry
estimation
of
LALES
population
9
3.2
The
optimization
of
global
ancestry
estimation
with
AIM
selected
from
whole
genome
15
3.3
The
ancestry
estimation
by
EIGENSoft
16
3.4
The
effect
of
ancestry
estimation
on
the
association
study
19
Chapter
4:
Conclusion
22
References
23
v
LIST
OF
FIGURES
Figure
1.
The
optimization
of
parameter
K
for
STRUCTURE
10
Figure
2.
Individual
ancestry
proportions
for
the
LALES
12
Figure
3.
The
correlation
between
the
estimates
with
previously
collected
SNPs
and
that
with
randomly
selected
SNPs
14
Figure
4.
The
effect
of
SNP
size
on
the
ancestry
estimates
16
Figure
5.
Principle
component
scatter
plots
for
two
axes
of
variation
of
LALES
samples.
A.
PC1
versus
PC2,
B.
PC2
versus
PC3,
C.
PC3
versus
PC4
17
Figure
6.
Manhattan
plot
for
the
GWAS,
A.
GWAS
without
adjustment,
B.
GWAS
after
adjusted
by
population
stratification
20
Figure
7.
The
effect
of
principle
components
on
the
relationship
between
AMD
risk
and
genetic
factors
21
vi
ABSTRACT
The
Latino
population
is
an
admixed
population.
Previous
studies
indicate
the
potential
role
of
ethnicity
in
the
risk
of
AMD.
The
purpose
of
this
study
is
to
evaluate
the
effect
of
population
structure
on
the
association
between
the
risk
of
AMD
and
Single
Nucleotide
Polymorphisms
(SNPs).
The
Los
Angeles
Latino
Eye
Study
(LALES)
is
a
unique
population-‐based
cohort
study
designed
to
explore
eye
diseases
in
the
Los
Angeles
Latino
population.
1007
Mexican
Americans
aged
40
years
and
older,
including
490
subjects
with
AMD
and
517
corresponding
controls,
were
selected
from
the
LALES
population.
DNA
was
extracted
for
all
subjects
using
blood
cards.
Genotyping
of
these
DNA
samples
was
performed
using
the
Illumina
HumanOmniExpress
BeadChip.
Genotypes
of
988
HapMap
samples
and
88
Native
American
samples
were
also
included
in
our
study
as
reference
groups
for
ancestry
estimation.
The
PLINK
and
R
software
packages
are
used
for
data
analysis.
Panels
with
243
ancestry
informative
markers,
and
others
with
a
range
of
different
numbers
of
randomly
selected
SNPs,
are
used
to
estimate
global
ancestry.
This
global
ancestry
analysis
is
performed
using
the
software
STRUCTURE
and
EIGENSTRAT.
Logistic
regression
is
used
to
evaluate
the
effect
of
the
global
ancestry
estimates
on
subsequent
association
testing.
The
software
Haploview
is
used
to
vii
demonstrate
the
results
of
Genome
Wide
Association
Study
(GWAS).
Correlation
coefficient
r
2
is
computed
to
assess
the
methods
used
to
estimate
the
genetic
ancestry
and
the
effect
of
population
stratification.
The
average
genetic
ancestry
for
individuals
from
the
LALES
population
is
around
52.4%
European,
43.5%
Native
American,
3.8%
African,
and
0.3%
Asian.
The
results
from
logistic
regression
and
GWAS
indicate
that
there
is
no
effect
of
population
stratification
on
the
relationship
between
risk
of
AMD
and
SNPs.
1
CHAPTER
1:
INTRODUCTION
It
is
estimated
that
30%
of
the
US
population
will
be
Latino
by
2050
(US
Census
Bureau
report,
2008).
With
the
fast
growth
of
the
Latino
population
and
the
increase
in
the
prevalence
of
eye
diseases
(Congdon
et
al,
2003),
a
clear
epidemiological
and
genetic
understanding
of
eye
diseases,
such
as
cataract,
glaucoma,
age-‐related
macular
degeneration
(AMD),
and
diabetic
retinopathy
(DR)
among
Latinos
is
needed
to
improve
the
life
quality
for
this
population.
The
Los
Angeles
Latino
Eye
Study
(LALES)
is
a
unique
population-‐based
cohort
study
formed
to
explore
eye
diseases
in
Latino
population.
The
aims
of
this
study
are
to
1)
understand
the
prevalence
and
incidence
of
eye
diseases
among
Latinos;
2)
examine
the
frequency
of
different
levels
of
visual
impairment
due
to
these
diseases;
3)
investigate
the
association
between
genetic/environment
risk
factors
and
eye
diseases.
(Los
Angeles
Latino
Eye
Study,
NEI)
The
Latino
population
is
admixed.
The
formation
of
admixed
populations
mostly
results
from
the
interbreeding
of
separated
populations
after
human
migration.
Since
admixed
populations
are
derived
from
more
than
one
ancestral
population
(Seldin,
Pasaniuc
and
Price,
2011),
these
kinds
of
populations
provide
a
great
base
with
which
to
explore
genetic
diversity.
The
Latino
population
is
one
of
the
most
widely
studied
admixed
populations
for
ancestry
analysis.
Based
on
previous
studies,
the
Latino
population
is
mainly
derived
from
the
Caucasian,
2
Native
American,
and
African
populations
(Choudhry
et
al.,
2005;
Seldin
et
al.,
2011).
Some
studies
assume
Latino
population
has
four
ancestries,
including
Caucasian,
Native
American,
African,
and
Asian
populations.
(Shtir
et
al.,
2009;
Yang
et
al.,
2011).
Therefore,
the
Latino
population
is
an
appropriate
candidate
used
for
ancestry
analysis
due
to
its
genetic
structure.
Human
genetic
diversity,
which
is
a
consequence
of
recombination,
genetic
mutations,
genetic
drift,
etc,
is
an
important
issue
in
biomedical
studies.
Without
a
correct
understanding
of
genetic
background,
inference
regarding
disease
associations
is
made
more
difficult.
Specifically,
the
existence
of
population
structure
in
admixed
populations
requires
special
consideration.
Genetic
ancestry
is
associated
with
risk
for
many
diseases,
but
only
a
small
percentage
of
genetic
diversity
is
a
result
of
variation
between
populations.
Far
greater
is
the
variation
within
local
populations
(Jorde
et
al.,
2000;
Smith
and
O'Brien,
2005;
Florez
et
al.,
2009).
In
admixed
populations
the
genetic
material
is
a
mixture
of
material
from
different
populations.
As
such,
the
genetic
variation
is
greater
than
is
typical,
and
the
potential
for
spurious
association
due
to
the
population
ancestry
cannot
be
ignored
in
association
studies
with
these
kinds
of
populations
(Seldin
et
al.,
2011).
The
identification
of
population
structure
in
admixed
population
not
only
can
be
used
to
eliminate
spurious
relationships
with
disease,
but
also
can
be
used
to
reveal
historical
population
events,
such
as
3
the
transatlantic
slave
trade,
the
colonization
of
the
Americas,
etc.
(Moreno-‐
Estrada
et
al.,
2013).
With
the
increased
availability
of
genotyping
technologies
and
the
ongoing
reduction
in
genotyping
cost,
substantial
numbers
of
genetic
markers
used
for
genetic
ancestry
analysis
have
been
collected.
In
addition,
the
continuous
development
of
bioinformatics
tools
and
methods
allows
for
increasingly
accurate
estimation
of
the
original
ancestral
proportions
and
decreasing
computational
cost
for
analysis
of
genome-‐wide
genetic
data.
As
such,
there
has
never
been
a
better
time
to
understand
the
genetic
make-‐up
of
admixed
populations.
The
challenges
of
ancestry
analysis
include
identifying
population
structure
and
assigning
individuals
to
populations
accurately
(Pritchard,
Stephens
and
Donnelly,
2000).
There
are
plenty
of
methods
or
software
available
for
ancestry
analysis.
However,
each
one
has
its
own
emphasis.
In
general,
two
main
methodologies
have
been
used
for
ancestry
analysis:
clustering
and
principal
component
analysis
(PCA)
(Novembre
and
Ramachandran 2011).
Cluster
analysis
is
a
model-‐based
method
that
groups
subjects
with
similar
SNP
frequencies
into
the
same
group.
Principal
component
analysis
(PCA)
converts
information
in
SNP
frequencies
into
principal
4
components,
which
are
linearly
uncorrelated
and
can
be
visualized.
These
PCs
can
then
be
used
for
ancestry
analysis.
The
most
popular
software
using
the
clustering
method
for
ancestry
analysis
is
called
STRUCTURE,
which
was
firstly
introduced
by
Pritchard
et
al.
(Pritchard
et
al.,
2000).
Four
ancestry
models
are
available
for
current
version
of
software:
1)
No
admixture
model;
2)
Admixture
model;
3)
Linkage
model;
4)
Model
using
prior
population
information
(Hubisz
et
al.,
2009;
Falush,
Stephens
and
Pritchard,
2003;
Pritchard
et
al.,
2000).
Markov
chain
Monte
Carlo
(MCMC)
methods
and
Gibbs
sampling
are
used
to
compute
the
posterior
distribution
of
membership
of
each
of
the
K
possible
source
populations.
Therefore,
this
method
requires
extensive
computation
for
large
datasets.
In
addition,
the
clustering
process
is
highly
sensitive
to
the
number
of
clusters
K.
The
choice
of
appropriate
K-‐value
plays
an
important
role
in
accurately
estimating
genetic
ancestry.
Due
to
the
computational
cost
of
the
method,
a
smaller
ancestry
informative
marker
(AIM)
panel
is
required
for
ancestry
evaluation
with
STRUCTURE.
The
AIM
panel
is
comprised
of
a
group
of
unlinked
markers.
The
number
of
AIMs
used
has
grown
over
time,
as
computational
advances
have
permitted.
We
also
note
that
one
or
two
previous
studies
have
used
different
methods
to
screen
the
SNPs.
(Tandon,
Patterson
and
Reich,
2010;
Yang
et
al.,
2011)
5
Eigensoft
is
one
of
the
popular
procedures
developed
for
the
strategy
of
principal
component
analysis
(Liu
et
al.,
2013).
This
dimension
reduction
method
is
computationally
efficient
on
large
genetic
data.
For
example,
it
can
process
over
110000
SNPs
(Price
et
al.,
2006).
Age
related
macular
degeneration
(AMD),
one
of
the
phenotypes
of
interest
in
the
LALES
study,
is
an
irreversible
visual
dysfunction.
The
exact
mechanism
causing
AMD
is
still
not
clear.
Several
studies
have
shown
that
AMD
is
related
to
aberrant
activation
of
the
complement
cascade
(Edwards
et
al.,
2005;
Haines
et
al.,
2005;
Klein
et
al.,
2005).
Multiple
genetic
and
environmental
risk
factors
are
thought
to
be
involved
in
the
development
of
AMD
(Kokotas,
Grigoriadou
and
Petersen,
2011).
The
prevalence
of
late
AMD
among
different
populations,
including
European,
Latino,
African,
and
Asian
is
reported
(Klein
et
al.,
2006)
.
The
data
indicates
the
potential
association
between
the
risk
of
AMD
and
race.
The
purpose
of
this
study
is
to
1)
Estimate
the
ancestry
proportion
globally
with
Latino
samples;
2)
Evaluate
the
effect
of
the
population
stratification
on
the
association
between
AMD
and
the
specific
risk
factors.
6
CHAPTER
2:
METHODS
2.1
Samples
and
Study
populations:
1007
Mexican
Americans
aged
40
years
and
older,
including
490
subjects
with
AMD
and
517
corresponding
controls
were
selected
from
the
LALES
population.
DNA
was
extracted
for
all
subjects
using
blood
cards.
Genotyping
of
these
DNA
samples
was
performed
using
the
Illumina
HumanOmniExpress
BeadChip.
The
genotype
data
for
988
unrelated
HapMap
III
samples
were
obtained
from
The
International
HapMap
Project.
There
are
a
total
of
11
populations
in
this
dataset,
including
African
ancestry
in
Southwest
USA
(ASW),
Utah
residents
with
Northern
and
Western
European
ancestry
from
the
CEPH
(CEU);
Han
Chinese
in
Beijing,
China
(CHB),
Japanese
in
Tokyo,
Japan
(JPT),
Chinese
in
Metropolitan
Denver,
Colorado
(CHD),
Gujarati
Indians
in
Houston,
Texas
(GIH),
Luhya
in
Webuye,
Kenya
(LWK),
Mexican
ancestry
in
Los
Angeles,
California
(MEX),
Maasai
in
Kinyawa,
Kenya
(MKK),
Toscans
in
Italy
(TSI),
and
Yoruba
in
Ibadan,
Nigeria
(YRI)
(International
HapMap
Project).
We
also
obtained
genotypes
data
for
88
Native
American
samples,
courtesy
of
the
University
of
Michigan.
Genotyping
quality
control
of
the
data
is
conducted
using
the
following
criteria.
The
sample
call
rate
is
required
to
be
more
than
95%.
The
cutoff
for
SNP
call
rate
is
required
to
be
0.1.
SNPs
with
minor
allele
frequency
(MAF)
less
than
0.01
are
7
also
excluded
from
the
study.
The
threshold
p-‐value
for
Hardy-‐Weinberg
equilibrium
is
set
as
1x10
-‐6
.
2.2
SNP
Panels
An
AIM
panel
with
243
SNPs
is
formed
by
consulting
previous
publications
for
candidates
that
are
available
for
all
three
sources
of
data
and
insisting
that
they
pass
the
criteria
for
quality
control.
In
addition,
the
qualified
SNPs
shared
by
these
three
datasets
are
used
as
a
SNP
pool
for
SNP
selection.
The
selection
process
is
similar
to
that
used
by
Yang
et.
al.,
2011.
However,
since
the
software
for
STRUCTURE
runs
into
tractability
issues
for
more
than
2500
SNPs
(based
on
tests),
we
randomly
select
2500
SNPs
from
the
above
SNP
pool
as
our
AIM
panel.
In
order
to
examine
the
effect
of
panel
sizes,
subgroups
of
the
AIM
panel
were
randomly
selected
from
the
previous
AIM
panel
with
2500
SNPs.
The
obtained
AIM
panels
are
used
for
genetic
ancestry
estimation.
2.3
Software
and
Statistical
methods:
PLINK
(version
1.0.7)
and
R
(version
3.0.1)
are
used
for
data
manipulation.
STRUCTURE
(version
2.3.4)
is
conducted
on
the
high-‐density
dataset
to
evaluate
the
population
structure.
Correlation
coefficient
r
2
is
used
to
evaluate
the
relationship
between
estimation
using
smaller
panels
and
that
using
the
2500
SNP
panel.
In
addition,
EIGENSTRAT
(version
5.0.1)
is
also
used
to
examine
the
population
structure
in
term
of
PCs.
Logistic
regression
via
PLINK
and
STATA
is
used
to
evaluate
the
effect
of
the
global
ancestry
estimates
on
subsequent
8
association
testing.
Haploview
(version
4.2)
is
used
to
demonstrate
the
result
of
GWAS.
Correlation
coefficient
r
2
is
computed
to
assess
the
relationship
between
the
negative
log
of
p-‐values
derived
from
GWAS
and
that
derived
from
the
GWAS
adjusted
for
population
stratification.
A
significance
threshold
of
0.05
is
used
as
the
cutoff
for
non-‐GWAS
p-‐values
in
this
study.
9
CHAPTER
3:
RESULTS
3.1
Genetic
ancestry
estimation
of
LALES
population
The
genetic
ancestry
estimates
are
computed
using
STRUCTURE.
AIM
panel
collected
from
previous
study
is
widely
used
for
ancestry
analysis
with
STRUCTURE.
Therefore,
the
AIM
panel
including
243
SNPs
collected
from
previous
studies
is
firstly
used
to
estimate
the
genetic
ancestry
with
STRUCTURE.
The
determination
of
parameter
K,
the
number
of
population,
is
a
critical
part
prior
to
the
running
of
STRUCTURE.
The
estimation
of
K
can
be
obtained
by
the
method
reported
by
Evannon
et
al
(EVANNO,
REGNAUT
and
GOUDET,
2005).
The
logarithm
of
posterior
probability,
lnPr(X|K),
is
computed
by
STRUCTURE
to
obtain
the
probability
at
different
K
assuming
a
uniform
prior
on
K
and
using
Bayes’s
Rule.
The
change
of
the
lnPr(X|K)
is
defined
as
the
ln’Pr(X|K)=lnPr(X|K)-‐lnPr(X|K-‐1).
The
second
order
of
change
of
the
lnPr(X|K)
is
calculated
as
|ln”Pr(X|K)|=|ln’Pr(X|K+1)-‐ln’Pr(X|K)|.
At
last,
the
ΔK
is
defined
as
the
mean
of
ln”Pr(X|K)
divided
by
the
standard
deviation
of
lnPr(X|K).
The
above
definitions
and
the
corresponding
K
were
plotted
in
Figure
1.
One
obvious
peak
was
detected
at
k=4
in
Fugure
1D.
Since
the
highest
value
of
ΔK
occurs
at
the
real
K.
Therefore,
these
Figures
indicate
that
STRUCTURE
generates
ancestry
estimation
the
most
likely
at
K=4,
which
suggest
that
Latino
population
has
4
main
ancestries.
10
Figure
1.
The
optimization
of
parameter
K
for
STRUCTURE
2 4 6 8 10
-575000 -565000 -555000
A. LnPr(X|K)
k
lnk
3 4 5 6 7 8 9 10
-2000 0 2000 6000
B. Ln'Pr(X|K)
k
lnk1
3 4 5 6 7 8 9
0 1000 2000 3000 4000 5000
C. Ln''Pr(X|K)
k
lnk2
3 4 5 6 7 8 9
0 50 100 150
D. !K
k
deltak
11
In
addition,
the
population
structure
estimates
for
the
LALES
population
at
K=3
to
5
are
demonstrated
in
Figure
2.
A
single
vertical
line
represents
an
individual.
Each
color
represents
on
inferred
population.
The
green
color
represents
a
European
derived
population.
The
blue
color
indicates
that
the
subjects
are
from
Africa.
The
red
color
corresponds
to
Native
American
ancestry.
The
light
pink
color
is
for
Asian
ancestry.
When
K=3,
there
are
three
colors,
indicating
three
ancestries
(European,
African,
and
Native
American).
When
K
increases
to
4,
there
are
four
colors,
indicating
four
ancestries
(European,
Asian,
African,
and
Native
American).
When
K
increases
to
5,
only
population
GIH
(Gujarati
Indians
in
Houston)
shows
some
additional
ancestry.
Since
this
study
we
focus
on
a
Latino
population,
that
additional
ancestry
is
not
of
interest
to
us.
It
seems
that
most
relevant
data
structure
information
is
captured
at
K=4.
When
K
continues
to
increase,
there
is
no
obvious
change
found
in
the
ancestry
estimations
for
LALES.
Therefore,
this
Figure
also
indicates
that
the
Latino
population
is
derived
from
four
main
ancestries.
All
the
results
reinforce
the
statement
that
Latino
population
is
derived
from
4
main
populations,
which
is
consistent
with
previous
studies.
Therefore,
we
decide
that
K=4
was
used
for
the
present
analysis.
12
Figure
2.
Individual
ancestry
proportions
for
the
LALES
K=3
K=4
K=5
Note:
The
blue,
yellow,
light
pink,
red
lines
represent
African,
European,
Asian,
and
Native
American
ancestry.
13
The
performance
of
the
243
SNPs
we
used
collected
from
previous
studies
can
be
evaluated
using
these
plots
from
STRUCTURE.
Among
LALES
samples,
the
average
Native
American
ancestry
is
estimated
to
be
41.9%
and
the
average
European
ancestry
is
estimated
to
be
53.0%.
There
is
only
an
average
of
3.1%
of
ancestry
estimated
as
African
and
2.0%
estimated
as
Asian
for
the
LALES
individuals.
These
figures
agree
well
with
existing
estimates
from
the
literature.
In
comparison,
we
also
conducted
an
ancestry
analysis
based
on
243
randomly
selected
SNPs
by
using
the
average
ancestry
estimates
from
STRUCTURE.
The
average
ancestry
estimates
obtained
from
10
sets
of
AIM
panels
include
43.2%
(SD=1.01%)
of
Native
American,
52.1%
(SD=1.07%)
of
European,
3.4%
(SD=0.45%)
of
African,
and
1.4%
(SD=0.67%)
of
Asian.
The
similar
patterns
suggest
that
both
ways
of
forming
an
AIM
panel
produce
comparable
ancestry
estimation.
In
addition,
the
correlation
coefficient
between
the
Native
American
estimates
derived
from
the
AIM
panel
from
the
literature
and
the
average
Native
American
estimates
derived
from
randomly
selected
AIM
panel
is
calculated.
The
correlation
coefficient
for
the
Native
American
estimates
is
0.881.
Those
results
are
shown
in
Figure
3.
In
summary,
the
above
results
support
the
view
that
the
AIM
panel
derived
from
previously
collected
SNPs
and
the
AIM
panel
derived
from
randomly
selected
SNPs
provide
comparable
ancestry
estimates.
14
Figure
3.
The
correlation
between
the
estimates
with
previously
collected
SNPs
and
that
with
randomly
selected
SNPs.
15
3.2
The
optimization
of
global
ancestry
estimation
with
AIM
selected
from
whole
genome
Based
on
our
available
computation
equipment,
2500
SNPs
is
the
maximum
that
we
can
use
to
perform
ancestry
estimation
with
STRUCTURE.
As
we
know,
more
SNPs
will,
in
general,
provide
better
ancestry
estimation.
Therefore,
we
use
the
ancestry
estimates
derived
from
2500
randomly
selected
SNPs
as
the
best
estimates
we
can
obtain
in
our
current
system.
We
then
used
Native
American
ancestry
estimates
for
Latino
individuals
to
compare
the
effect
of
the
AIM
panel
size
on
the
ancestry
estimation
results.
In
addition
to
the
benchmark
case
with
2500
SNPs,
we
also
used
randomly
selected
SNP
subsets
of
size
200,
500,
1000,
1500,
and
2000
SNPs
for
ancestry
estimation
with
STRUCTURE.
Results
are
shown
in
Figure
3.
The
figure
shows
that
as
the
size
of
the
SNP
set
increases,
we
produce
closer
estimates
to
those
obtained
with
2500
SNPs,
which
support
the
view
that
more
SNPs
provide
more
accurate
estimates.
We
have
shown
that
2500
SNPs
will
provide
better
estimates
than
200
SNPs
since
more
SNPs
provide
more
information
for
the
accurate
estimation.
However,
ancestry
estimation
with
STRUCTURE
is
very
time
consuming
when
the
number
of
SNPs
is
large.
It
takes
around
8
hours
for
the
data
with
dimension
2000x2500
using
a
computer
with
2.7GHz
Intel
Core
i7.
When
the
data
has
dimension
1500x2500,
the
computation
time
is
reduced
to
around
4.5
hours.
The
correlation
coefficient
between
the
estimates
with
2500
SNPs
and
that
with
1500
is
more
than
0.99.
The
comparisons
indicate
that
smaller
SNPs
sample
size
16
might
be
enough
to
produce
the
same
conclusion
in
a
time-‐efficient
way.
However,
for
a
final
analysis,
it
obviously
pays
to
use
as
many
SNPs
as
can
be
processed
using
the
hardware
available.
Figure
4.
The
effect
of
SNP
size
on
the
ancestry
estimates
3.3
The
ancestry
estimation
by
EIGENSoft
Principle
component
analysis
was
applied
to
the
genotype
data
to
decrease
the
dimensions
of
the
data.
The
plots
generated
using
two
principal
components
(PCs)
are
used
to
demonstrate
the
diversity
of
the
samples
(see
Figure
5).
Based
on
the
plot
for
PC1
and
PC2,
the
4
major
continental
populations,
including
0 500 1000 1500 2000 2500
0.0 0.2 0.4 0.6 0.8 1.0
the effect of SNP size on the ancestry estimates
SNP size
correlation coefficient
17
European,
Native
American,
African,
and
Asian,
form
clear
clusters.
The
Latino
population
is
mainly
an
admixture
between
Europeans
and
Native
Americans.
The
Asian–specific
component
has
the
least
contribution
to
the
total
variation
in
the
first
4
PCs.
Since
Asians
are
close
to
Native
Americans,
Native
American
samples
don’t
form
clear
clusters
in
the
direction
of
the
Asian-‐specific
component.
Note
that,
the
Latino
population,
being
an
admixture
of
Native
Americans
and
Europeans,
falls
on
a
broad
spectrum
between
those
populations
on
the
PC
plots.
Figure
5.
Principle
component
scatter
plots
for
two
axes
of
variation
of
LALES
samples.
A.
PC1
versus
PC2,
B.
PC2
versus
PC3,
C.
PC3
versus
PC4
-0.04 -0.02 0.00 0.02 0.04 0.06 0.08
-0.06 -0.04 -0.02 0.00 0.02 0.04
PC1
PC2
CEU
Asian
YRI
NA
LALES
18
-0.06 -0.04 -0.02 0.00 0.02 0.04
-0.04 -0.02 0.00 0.02 0.04 0.06
PC2
PC3
CEU
Asian
YRI
NA
LALES
-0.04 -0.02 0.00 0.02 0.04 0.06
-0.10 -0.05 0.00 0.05
PC3
PC4
CEU
Asian
YRI
NA
LALES
19
3.4
The
effect
of
ancestry
estimation
on
the
association
study
GWAS
plots
before
population
stratification
and
after
population
stratification
are
generated
with
Haploview
(Figure
7).
The
adjustment
for
population
stratification
was
performed
with
the
first
four
principle
components
generated
by
Eigensoft.
The
p-‐values
obtained
without/with
PC
adjusted
GWAS
are
generated
using
PLINK.
No
p-‐value
less
than
10
-‐6
is
found
in
the
unadjusted
GWAS.
The
pattern
of
GWAS
plots
without/with
the
adjustment
of
principle
components
are
evaluated.
After
adjustment
by
the
first
four
principle
components,
the
pattern
of
the
GWAS
plot
does
not
change
significantly
and
no
SNP
is
found
to
have
a
p-‐value
less
than
10
-‐6
,
which
indicated
that
no
SNP
is
significantly
associated
with
the
risk
of
AMD.
The
correlation
between
the
results
obtained
from
GWAS
and
those
obtained
from
principle
component
adjusted
GWAS
is
0.991.
All
of
the
above
results
indicate
that
the
population
structure
is
not
a
confounding
factor
between
the
risk
of
AMD
and
SNPs
in
this
Latino
population.
In
addition,
the
relationship
between
the
AMD
risk
and
the
estimated
ancestry
proportions
is
assessed
with
univariate
analysis
(Figure
8).
There
is
no
significant
association
between
AMD
risk
and
the
estimated
ancestry
proportions
(p>0.05).
This
result
further
confirmed
that
population
structure
is
not
a
confounding
factor
in
this
study.
20
Figure
6.
Manhattan
plot
for
the
GWAS,
A.
GWAS
without
adjustment,
B.
GWAS
after
adjusted
by
population
stratification
21
Figure
7.
The
effect
of
principle
components
on
the
relationship
between
AMD
risk
and
genetic
factors
22
CHAPTER
4:
Conclusions
In
this
study
we
showed
that
the
Latino
population
is
an
admixture
of
4
source
populations,
which
the
major
components
being
Native
American
and
European.
However,
the
above
results
indicate
that
the
population
structure
does
not
appear
to
be
a
major
confounding
factor
between
the
risk
of
AMD
and
SNPs
in
this
Latino
population.
In
addition,
the
relationship
between
the
AMD
risk
and
the
estimated
ancestry
proportions
is
assessed
with
univariate
analysis.
There
is
no
significant
association
between
AMD
risk
and
the
estimated
ancestry
proportions
(p>0.05).
This
result
further
confirmed
that
population
structure
is
not
a
confounding
factor
in
this
study.
23
REFERENCES:
Choudhry,
S.,
Coyle,
N.E.,
Tang,
H.,
Salari,
K.,
Lind,
D.,
Clark,
S.L.,
et
al.
(2005)
Population
stratification
confounds
genetic
association
studies
among
Latinos.
Human
Genetics,
118,
652–664.
Congdon,
N.G.
(2003)
Important
Causes
of
Visual
Impairment
in
the
World
Today.
JAMA,
290,
2057–2060.
Edwards,
A.O.
(2005)
Complement
Factor
H
Polymorphism
and
Age-‐Related
Macular
Degeneration.
Science,
308,
421–424.
EVANNO,
G.,
REGNAUT,
S.
and
GOUDET,
J.
(2005)
Detecting
the
number
of
clusters
of
individuals
using
the
software
structure:
a
simulation
study.
Molecular
Ecology,
14,
2611–2620.
Falush,
D.,
Stephens,
M.
and
Pritchard,
J.K.
(2003)
Inference
of
Population
Structure
Using
Multilocus
Genotype
Data:
Linked
Loci
and
Correlated
Allele
Frequencies.
Genetics,
164(4),
1567-‐1587
Florez,
J.C.,
Price,
A.L.,
Campbell,
D.,
Riba,
L.,
Parra,
M.V.,
Yu,
F.,
et
al.
(2009)
Strong
association
of
socioeconomic
status
with
genetic
ancestry
in
Latinos:
implications
for
admixture
studies
of
type
2
diabetes.
Diabetologia,
52,
1528–
1536.
Haines,
J.L.
(2005)
Complement
Factor
H
Variant
Increases
the
Risk
of
Age-‐Related
Macular
Degeneration.
Science,
308,
419–421.
Hubisz,
M.J.,
Falush,
D.,
Stephens,
M.
and
Pritchard,
J.K.
(2009)
Inferring
weak
population
structure
with
the
assistance
of
sample
group
information.
Molecular
Ecology
Resources,
9,
1322–1332.
International
HapMap
project,
http://hapmap.ncbi.nlm.nih.gov/
Jorde,
L.B.,
Watkins,
W.S.,
Bamshad,
M.J.,
Dixon,
M.E.,
Ricker,
C.E.,
Seielstad,
M.T.,
et
al.
(2000)
The
Distribution
of
Human
Genetic
Diversity:
A
Comparison
of
Mitochondrial,
Autosomal,
and
Y-‐Chromosome
Data.
The
American
Journal
of
Human
Genetics,
66,
979–988.
Klein,
R.J.
(2005)
Complement
Factor
H
Polymorphism
in
Age-‐Related
Macular
Degeneration.
Science,
308,
385–389.
Klein,
R.,
Klein,
B.E.K.,
Knudtson,
M.D.,
Wong,
T.Y.,
Cotch,
M.F.,
Liu,
K.,
et
al.
(2006)
24
Prevalence
of
Age-‐Related
Macular
Degeneration
in
4
Racial/Ethnic
Groups
in
the
Multi-‐ethnic
Study
of
Atherosclerosis.
Ophthalmology,
113,
373–380.
Kokotas,
H.,
Grigoriadou,
M.
and
Petersen,
M.B.
(2011)
Age-‐related
macular
degeneration:
genetic
and
clinical
findings.
Clinical
Chemistry
and
Laboratory
Medicine,
49
(4),
601-‐616.
Los
Angeles
Latino
Eye
Study,
NEI,
http://www.nei.nih.gov/latinoeyestudy
Liu,
Y.,
Nyunoya,
T.,
Leng,
S.,
Belinsky,
S.A.,
Tesfaigzi,
Y.
and
Bruse,
S.
(2013)
Softwares
and
methods
for
estimating
genetic
ancestry
in
human
populations.
Human
genomics,
7,
1.
Moreno-‐Estrada,
A.,
Gravel,
S.,
Zakharia,
F.,
McCauley,
J.L.,
Byrnes,
J.K.,
Gignoux,
C.R.,
et
al.
(2013)
Reconstructing
the
Population
Genetic
History
of
the
Caribbean.
PLOS
Genetics,
9,
e1003925.
Novembre,
J.
and
Ramachandran,
S.
(2011)
Perspectives
on
Human
Population
Structure
at
the
Cusp
of
the
Sequencing
Era.
Annual
Review
of
Genomics
and
Human
Genetics,
12,
245-‐274
Price,
A.L.,
Patterson,
N.J.,
Plenge,
R.M.,
Weinblatt,
M.E.,
Shadick,
N.A.
and
Reich,
D.
(2006)
Principal
components
analysis
corrects
for
stratification
in
genome-‐wide
association
studies.
Nature
Genetics,
38,
904–909.
Pritchard,
J.K.,
Stephens,
M.
and
Donnelly,
P.
(2000)
Inference
of
Population
Structure
Using
Multilocus
Genotype
Data.
Genetics,
155(2),
945-‐959
Seldin,
M.F.,
Pasaniuc,
B.
and
Price,
A.L.
(2011)
New
approaches
to
disease
mapping
in
admixed
populations.
Nature
Reviews
Genetics,
12,
523–528.
Shtir,
C.J.,
Marjoram,
P.,
Azen,
S.,
Conti,
D.V.,
Le
Marchand,
L.,
Haiman,
C.A.,
et
al.
(2009)
Variation
in
genetic
admixture
and
population
structure
among
Latinos:
the
Los
Angeles
Latino
eye
study
(LALES).
BMC
Genetics,
10,
71.
Smith,
M.W.
and
O'Brien,
S.J.
(2005)
Mapping
by
admixture
linkage
disequilibrium:
advances,
limitations
and
guidelines.
Nature
Reviews
Genetics,
6,
623–632.
Tandon,
A.,
Patterson,
N.
and
Reich,
D.
(2010)
Ancestry
informative
marker
panels
for
African
Americans
based
on
subsets
of
commercially
available
SNP
arrays.
Genetic
Epidemiology,
35,
80–83.
US
Census
Bureau
Report,
2008,
https://www.census.gov/newsroom/releases/archives/population/cb08-‐
123.html
Yang,
J.J.,
Cheng,
C.,
Devidas,
M.,
Cao,
X.,
Fan,
Y.,
Campana,
D.,
et
al.
(2011)
Ancestry
25
and
pharmacogenomics
of
relapse
in
acute
lymphoblastic
leukemia.
Nature
Genetics,
43,
237–241.
Abstract (if available)
Abstract
The Latino population is an admixed population. Previous studies indicate the potential role of ethnicity in the risk of AMD. The purpose of this study is to evaluate the effect of population structure on the association between the risk of AMD and Single Nucleotide Polymorphisms (SNPs). ❧ The Los Angeles Latino Eye Study (LALES) is a unique population‐based cohort study designed to explore eye diseases in the Los Angeles Latino population. 1007 Mexican Americans aged 40 years and older, including 490 subjects with AMD and 517 corresponding controls, were selected from the LALES population. DNA was extracted for all subjects using blood cards. Genotyping of these DNA samples was performed using the Illumina HumanOmniExpress BeadChip. Genotypes of 988 HapMap samples and 88 Native American samples were also included in our study as reference groups for ancestry estimation. ❧ The PLINK and R software packages are used for data analysis. Panels with 243 ancestry informative markers, and others with a range of different numbers of randomly selected SNPs, are used to estimate global ancestry. This global ancestry analysis is performed using the software STRUCTURE and EIGENSTRAT. Logistic regression is used to evaluate the effect of the global ancestry estimates on subsequent association testing. The software Haploview is used to demonstrate the results of Genome Wide Association Study (GWAS). Correlation coefficient r² is computed to assess the methods used to estimate the genetic ancestry and the effect of population stratification. ❧ The average genetic ancestry for individuals from the LALES population is around 52.4% European, 43.5% Native American, 3.8% African, and 0.3% Asian. The results from logistic regression and GWAS indicate that there is no effect of population stratification on the relationship between risk of AMD and SNPs.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
An integrative approach for determining age related macular degeneration risk facors in Latinos
PDF
Age related macular degeneration in Latinos: risk factors and impact on quality of life
PDF
A genome wide association study of multiple sclerosis (MS) in Hispanics
PDF
Using genetic ancestry to improve between-population transferability of a prostate cancer polygenic risk score
PDF
Two-stage genotyping design and population stratification in case-control association studies
PDF
Polygenes and estimated heritability of prostate cancer in an African American sample using genome-wide association study data
PDF
Polygenic analyses of complex traits in complex populations
PDF
The impact of global and local Polynesian genetic ancestry on complex traits in Native Hawaiians
PDF
Disparities in exposure to traffic-related pollution sources by self-identified and ancestral Hispanic descent in participants of the USC Children’s Health Study
PDF
Pharmacogenetic association studies and the impact of population substructure in the women's interagency HIV study
PDF
Association of traffic-related air pollution and age-related macular degeneration in the Los Angeles Latino Eye Study
PDF
Native American ancestry among Hispanic Whites is associated with higher risk of childhood obesity: a longitudinal analysis of Children’s Health Study data
PDF
Genetic studies of cancer in populations of African ancestry and Latinos
PDF
Shortcomings of the genetic risk score in the analysis of disease-related quantitative traits
PDF
Disease risk estimation from case-control studies with sampling
PDF
A global view of disparity in imputation resources for conducting genetic studies in diverse populations
PDF
Population substructure and its impact on genome-wide association studies with admixed populations
PDF
Two-step study designs in genetic epidemiology
PDF
Fish consumption and risk of colorectal cancer
PDF
Characterization and discovery of genetic associations: multiethnic fine-mapping and incorporation of functional information
Asset Metadata
Creator
He, Lijun
(author)
Core Title
The role of genetic ancestry in estimation of the risk of age-related degeneration (AMD) in the Los Angeles Latino population
School
Keck School of Medicine
Degree
Master of Science
Degree Program
Biostatistics
Publication Date
08/08/2014
Defense Date
08/08/2014
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
AMD,ancestry analysis,Latino population,OAI-PMH Harvest
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Azen, Stanley P. (
committee chair
), Marjoram, Paul (
committee chair
), Gauderman, William James (
committee member
)
Creator Email
LijunHe@gmail.com,lijunhe@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-457444
Unique identifier
UC11286835
Identifier
etd-HeLijun-2795.pdf (filename),usctheses-c3-457444 (legacy record id)
Legacy Identifier
etd-HeLijun-2795.pdf
Dmrecord
457444
Document Type
Thesis
Format
application/pdf (imt)
Rights
He, Lijun
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
AMD
ancestry analysis
Latino population