Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Benchmarking of computational tools for ancestry prediction using RNA-seq data
(USC Thesis Other)
Benchmarking of computational tools for ancestry prediction using RNA-seq data
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Benchmarking of Computational Tools for Ancestry Prediction using RNA-seq data
by
Anushka Yadav
A Thesis Presented to the
FACULTY OF THE USC ALFRED E. MANN SCHOOL OF PHARMACY
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements of the Degree
MASTER OF SCIENCE
(PHARMACEUTICAL SCIENCES)
August 2023
Copyright 2024 Anushka Yadav
ii
Acknowledgments
I would like to thank my advisor, Dr. Serghei Mangul, for being very helpful and supportive, giving
me advice on the thesis and valuable feedback on all the experiments. I would also like to thank
Dr. Ian Haworth and Dr. Peter Calabrese for being my committee members and providing me
with their support and advice. Finally, I would like to thank my teammate, Ryan Alomair, for
helping me with all the experiments and pipelines.
iii
Table of Contents
Acknowledgements...........................................……………………………………………………………………....…ii
List of Tables............................................………………………………………………………………………………....iv
List of Figures…...............................................…………………………………………………………………………....v
Abstract…................................................……………………………………………………………………………………1
Chapter 1: Introduction……………………...............................................………………………………………....2
Chapter 2: Materials and Methods…………................................................………………………………..…4
Chapter 3: Results……………………...............................................…………………………………………….…..12
Chapter 4: Discussion………...............................................…………………………………………………….…..25
References…………...............................................………………………………………………………………………27
iv
List of Tables
Table 1. Super and Sub-population of the sample……................................………………………………...4
Table 2. FastQC statistics for RNA-seq data………………………..............................………………………..13
Table 3. EthSEQ results for WES and RNA-seq data………………………..............................…………….15
Table 4. K values with minimum CV error for 14 samples……………..............................………………24
v
List of Figures
Figure 1. Outline of the project……………………………………………………….............................................5
Figure 2. Pipeline 1…………………………………………………………………...........................….......................6
Figure 3. GATK workflow………………………………………………………………...............................................6
Figure 4. STAR aligner 1 and 2-pass mode……………………………………….............................................8
Figure 5. Differences between DNA-seq and RNA-seq analysis……….............................................9
Figure 6. Example bash script for RNA download of sample NA20847………..........................…….12
Figure 7. Bash Script for FastQC processing…………………………………..........................………............12
Figure 8. Bash script for WES data bam file creation……………………….........................………..........14
Figure 9. Bash script for RNA-seq data bam file creation………………………….........................…......14
Figure 10. EthSEQ commands……………………………………………………….............................................14
Figure 11. PCA graph for HG00171 WES data………………………………..........................………............16
Figure 12. PCA graph for HG00171 RNA-seq data……………………………..........................……...........16
Figure 13. PCA graph for HG00732 WES data……………………………………..........................…............17
Figure 14. PCA graph for HG00732 RNA-seq data………………………………….....................................17
Figure 15. Bash Script for data pre-processing using GATK………………………...........................…....19
Figure 16. Bash script for variant calling……………………………………………........................................20
Figure 17. Bash script for AKT……………………………………………………….............................................21
Figure 18. PCA graph for WES SNPs data………………………………………..........................…................21
vi
Figure 19. PCA graph for RNA-seq SNPs data……………………………………..............................…........22
Figure 20. Bash script for PLINK binary file set creation………………………..............................……...23
Figure 21. ADMIXTURE command line code………………………………..............................………..........23
1
Abstract
Of late, much interest has also been placed on identifying the role of genetic architecture on
differential risk patterns based on one’s ancestry. Outside of the gold-standard traditional
methods such as WGS (whole genome sequencing), WES (whole exome sequencing), or genotype
data from the 1000 Genomes project, RNA-sequencing (RNA-seq) is now popularly used to call
genomic variants providing inferred estimates for single nucleotide variants (SNVs) via the
Genome Analysis Toolkit (GATK) pipeline. RNA-seq is a prominent technology for transcriptome
profiling and provides accurate measurement of the level of transcripts and genes and is highly
favorable for its low usage costs. The aim of our study is to utilize the SNVs inferred from RNA-
seq data to accurately predict genetic ancestry based on individual genetic variants inferred from
RNA-seq data to capture the proportion of ancestral estimates across the genome and locus-
specific allelic ancestral effects respectively. We then compared the ancestry inferred from RNA-
seq to the gold standard methods (WGS & WES data) inferred from genomics data and identified
differences. Our study highlights the use of RNA-seq data for ancestry estimation and will inform
the genomics community about the best computational tools for identifying ancestries from RNA-
seq data. We can leverage our benchmarking results that will allow the biomedical field to
effectively annotate the much larger cohort of 80,000 public RNA-seq samples to evaluate the
complex population substructure by expression quantitative trait loci (eQTL) analysis within and
across ancestries in a diverse cohort.
2
Chapter 1: Introduction
The DNA of two arbitrarily selected individuals have similarities up to 99.9% and yet, the
remaining 0.1% is responsible for the heterogeneity in the genome that differentiates each
person. This variation in one genome sequence from another can be due to the presence of Single
Nucleotide Polymorphism or SNP. SNPs are the variation at a single position in a DNA sequence
in individuals and this difference can be used to determine the ancestral background of the
individuals. Ethnicity can be defined as belonging to a specific group with mutual religious,
cultural, or racial traits whereas, the genomic variations found among individuals from different
populations because of their ancestors can be referred to as genetic ancestry. Since ethnicity is
more likely to be based on cultural background, there is a hint of self-identification which adds
uncertainty to the genetic ancestry [1].
The rise of genetic ancestry estimation has provided quantitative answers related to the
classification of individuals and the role of population stratification in health conditions and drug
response [2]. In most cases, the data that is available might be self-reported which has different
levels of ancestry and does not match with the ancestry predicted using various genomic
technologies [3]. There is an abundance of genomic information in public and private databases
as genomic data is proven to be an effective tool for inferring ancestry [4]. Much research has
been done related to the prediction of ancestry using the standard traditional methods such as
WGS (whole genome sequencing) and WES (whole exome sequencing) which utilize the SNPs
information inferred using various variant calling pipelines as these technologies provide greater
depth and breadth of sequencing coverage which has implications on variant calling [5].
The gold standard method refers to the available techniques that have been benchmarked under
reasonable conditions. It can be used to describe an experimental pipeline or model that has
been exhaustively tested and has gained approval as a reliable method in the research field. The
gold standard techniques used for this project involve the prediction of ancestry using Whole
Genome Sequencing and Whole Exome Sequencing data. WGS is the most comprehensible
source of genetic variations found in an individual [6] whereas, WES data is helpful in recognizing
rare as well as common variants in less than 2% coverage of the human genome [7]. Multiple
gene-level, as well as variant-level pipelines, have been developed and used for the analysis of
the WGS and WES data to estimate the ancestry.
DNA-seq can be used to understand the underlying genetic structure of a physiological outcome
as it tells us more about the stable genomic sequence. Since RNA-seq focuses more on the
downstream gene expression, it can be used in supporting the causative variants that are often
not taken into consideration in WGS or WES data. However, there is a lack of systematic
3
benchmarking studies of the computational tools using RNA-sequencing (RNA-seq) data to detect
one’s ancestry. Relatively new bioinformatic procedures allow the detection of variants in the
DNA indirectly from RNA-seq data which is inexpensive and can be used when DNA data is
unavailable [4]. There has been an increase in the requirement for a comprehensive list that
clearly elucidates and compares the efficacy of the tools using RNA-seq with the other existing
omics technologies [8]. Exclusive focus on the usage of RNA-seq data in different tools in
comparison to the more famous and widely used gold standard techniques is lacking [4]. It is still
an open question whether SNPs inferred from RNA-seq data can be used effectively by the in-
silico tools for ancestry prediction [9]. Altogether, the studies done so far do not provide clarity
on whether different bioinformatics tools can use RNA-seq data to provide correct ancestry
estimates.
4
Chapter 2: Materials and Methods
2.1 Samples
We use the 1000 Genomes on GRCh38 for the low-coverage WGS and Exome data. The high-
coverage RNA-seq data was downloaded from the 1000 Genomes Human Genome Structural
Variation Consortium Phase 2. The WGS, WES, and RNA-seq data were aligned to the human
reference genome hg38. Individuals belonging to super populations (Table 1) like African (AFR),
East Asian (EAS), South Asian (SAS), European (EUR), and Mixed American (AMR) were selected.
Table 1- Super and Sub-population of the samples
Super Population Sub-population Sample ID
African (AFR) Yoruba in Ibadan, Nigeria NA19238, NA19239
East Asian (EAS) Han Chinese South HG00512, HG00513
Kinh in Ho Chi Minh City, Vietnam HG01596
Han Chinese in Beijing, China NA18534
Chinese Dai in Xishuangbanna, China HG00864
Japanese in Tokyo, Japan NA18939
European (EUR) Finnish HG00171
British in England and Scotland HG00096
Toscani in Italy NA20509
South Asian (SAS) Gujarati Indians in Houston, TX NA20847
Mixed American (AMR) Puerto Rican in Puerto Rica HG00731, HG00732
2.2 Outline
The flowchart (fig. 1) depicts the outline of the project which involves several approaches based
on the input format requirement of the tools. The first pipeline explains how the ground truth
was established using WGS and WES data in the form of fastq files from the 1000 Genomes
Project. After the data preprocessing, GATK filters were used for variant discovery. The vcf files
containing the SNPs information (true SNPs data of known ancestry) were then used as the input
in the respective tools and the output was marked as the gold standard results, used for
evaluating the predictive accuracy of the RNA-seq data on replacing the WGS and WES data in
the tools.
5
Figure 1- Outline of the project
The following pipelines were used for RNA-seq data preprocessing and variant calling-
Approach 1: In this pipeline (fig. 2), bam files were generated as per the requirements of the tool,
EthSEQ [10]. The WGS paired-end fastq files downloaded from the 1000 Genomes Project were
aligned to the indexed hg38 reference genome using the BWA-MEM aligner that created a single
sam file [11]. Samtools was used to sort and index the sam file and convert it to the binary
equivalent bam file format [12]. Single-end RNA-seq fastq files were aligned using STAR aligner
as it provides connectivity details in order to reconstruct the full extent of the spliced RNA
molecules [13].
6
Figure 2: Pipeline 1
Approach 2: The second pipeline involved data pre-processing steps along with variant discovery
using GATK functionalities (fig. 3) for the tool AKT [14]. The raw unmapped fastq files were
checked for sequence quality, GC content, and adapter sequence presence using FastQC [15].
The reads were mapped using BWA-MEM and STAR aligners for WGS and RNA-seq data,
respectively. MarkDuplicatesSpark was used to identify and flag the duplicates present in the
reads which might be present due to sequencing errors [16]. These flags were ignored by the
GATK during the variant calling step.
Figure 3- GATK workflow [17]
7
Variant calling algorithms rely heavily on base quality scores since it informs us how much the
base at a particular position can be trusted. Scores assigned by the sequencing machines are
prone to errors. Therefore, the recalibration of the quality score can be done using GATK’s
BaseRecalibrator which requires a set of known variants to build a machine-learning model that
is used for adjusting the base quality score of the query sequence [18]. The known set of variants
was taken from dbsnp for homo sapien assembly in vcf file format. The analysis-ready reads were
used for variant calling using HaplotypeCaller which calls the SNPs and indels using local re-
assembly of haplotypes [19]. To subset the SNPs and indels into separate files, the SelectVariants
function with the reference genome hg38 and the generated vcf file was used to filter out the
SNPs information [20].
Approach 3: The third pipeline was used by the tool ADMIXTURE [3] for predicting the
contributing populations for the samples. It involved similar steps for data pre-processing and
variant discovery as pipeline 2, with an added step of converting the vcf files to binary PLINK file
format [21]. The input for the tool was in the form of .bed files that stores the genotype data
along with the supplementary .bim files containing SNP names and map positions and .fam files
with family structure information. PLINK 1.9 provides a --make-bed option that was used to
convert the vcf files into bed file format. Since ADMIXTURE does not recognize missing genotypes
and reports an error along with the line number, the --vcf-half-call ‘h’ option was implemented
to treat all the missing genotypes as haploid or homozygous [22]. To further clean the dataset, -
-geno 0.1 was applied to imputed all the individuals with over 10% missing genotype data as the
ADMIXTURE’s machine learning algorithm marks an inability to work with the missing
information [23].
2.3 DNA-seq vs RNA-seq analysis
2.3.1 Comparison of BWA and STAR aligner
RNA-seq data present unique computational challenges for its data analysis workflow due to the
gapped nature of the RNA data [24]. RNA-seq reads are typically around 36-200 nt and mapping
these reads to the reference transcriptome is a critical step in the analysis [25]. Since the
transcriptomes of well-studied species such as humans and mice are incomplete, RNA-seq reads
are required to map to the reference genome, acting as a substitute for transcriptome. Aligning
RNA-seq data to the reference genome is trickier as compared to whole genome sequencing data
(WGS) due to the presence of multiple splice junctions. RNA transcribed from DNA have exons
scattered with non-coding or intron sequences between them which are removed later by
splicing. Since exons are derived from non-contiguous genome sequences which are separated
by varying distances, gapped alignment is the main concern while mapping the RNA-seq data to
the reference genome. Several algorithms map the reads to the known exonic regions in the
reference, but they fail as the RNA-seq spans the exon boundary due to parts of the reads that
are not able to map contiguously to the reference genome. Conventional aligners used for WGS
data such as BWA are not able to handle the spliced transcripts. Since RNA-seq data has gained
popularity over the past few years, multiple splice-aware aligners are being developed.
8
Figure 4- STAR aligner 1 and 2-pass mode [26]
STAR or Spliced Transcripts Alignment to a Reference is an aligner that has been developed to
overcome the challenges faced while aligning RNA-seq data as it accounts for the spliced
alignments. STAR exhibits high accuracy and has surpassed the other aligners by a factor of 50 in
areas like mapping speed. The efficient mapping done by the STAR aligner is achieved by its
algorithm that performs seed searching, clustering, stitching, and scoring [26]. STAR aligner
provides two modes for the alignment of RNA-seq data to the reference genome i.e. One-pass
mapping mode and Two-pass mapping mode respectively. In One-pass mode, the STAR aligner
maps the RNA-seq data to the reference genome and creates alignment in SAM/BAM format
along with splice junctions file in a table format. This mode is efficient for differential gene
analysis. However, the two-pass mode is recommended for the variant discovery pipeline as it
provides faster alignment and sensitivity [27]. In the two-pass mode, the STAR aligner identifies
the splice junctions in the genome in the first pass and then, uses this information for the
accurate alignment to the reference sequence.
9
Figure 5- Differences between DNA-seq and RNA-seq analysis [17], [28]
2.3.2 New processing step
Another difference in the RNA-seq analysis as compared to the DNA-seq is the added data pre-
processing step of SplitNCigarReads [29]. CIGAR (Compact Idiosyncratic Gapped Alignment
Report) strings are used by SAM/BAM file format to provide information about the alignment
and help in understanding the mapping of the query sequence to the reference genome. RNA
ends have ambiguous bases which are denoted by ‘N’ and GATK functions are unable to deal with
them since DNA does not consist of N ends in the CIGAR string. These ends are present between
two exonic regions and the aligners might oversee these sequences if they are too small and thus,
overhangs in the intronic region. This remainder sequence between two exonic regions can cause
problems in the downstream analysis of the data. SplitNCigarReads helps with splitting the reads
with N CIGAR strings and helps group information per exon along with trimming the overhangs.
2.4 Ancestry predicting tools
The population structure analysis based on genetic ancestry is an important part of genetic
studies that gives us an insight into multiple aspects of different genetic problems related to
population stratification. A single method or software is unable to provide answers to all the
issues [30]. In the project, three different ancestry-predicting tools have been used. Each one of
the tools has distinct input format requirements that were facilitated by several pipelines as
mentioned earlier.
10
2.4.1 EthSEQ
EthSEQ is an ancestry predicting tool that provides an automated workflow to annotate ancestry
using the SNPs genotypes derived from the NGS data of whole exome sequencing. It allows easy
and reliable genetic ancestry annotation using WES data and validates it by referencing the 1000
Genomes Project and TCGA data [10]. For running the tool, it requires a reference model
selection/creation which is based on the genotype data at the SNPs position for the samples with
known ancestry and a list of BAM files for the samples with unknown ancestry, called the target
model. Using the automated workflow, EthSEQ then annotates the ancestry for each individual
sample along with detailed and comprehensive visual reports.
EthSEQ constructs the reference model based on the genotype data of the individuals with
known ancestry, using the information available on the 1000 Genomes Project and considers the
conserved populations like EUR (Caucasian), EAS (East Asian), SAS (South Asian), AFR (African).
The target model is built using the input BAM files of the individuals with unknown ancestry with
the help of the ASEQ [31] genotyping module that helps map the genotypes using the positions
in the reference model. Next, the tool performs principal component analysis using the
SNPRelate R package on the target and the reference model genotype data. The first two PCA
components are used to identify the smallest convex regions annotating the population-specific
groups based on the reference model which is then used with the target model. The individual
samples which lie inside the convex regions are marked with the corresponding population group
and labeled as INSIDE. The samples that lie outside the polygons are labeled as CLOSEST with the
top contributing population groups, calculated by the distance of the samples from the centroid
of the polygons. The percentage of the contributing populations is also mentioned in the report.
2.4.2 AKT
AKT or Ancestry and Kinship Toolkit is another statistical ancestry-predicting tool that uses whole
genome sequencing data for the quick detection of related samples, predicting ancestry, and
finding the correlation between variants [14]. It works with large multi-sample VCF or BCF file
formats as input to perform the analysis. It is executed in C++ and uses HTSlib to read the VCF/BCF
files. It combines multiple sub-functions into a single binary command and the analyses are run
using akt subcommand input.bcf.
One of the most useful analyses performed using AKT is a fast principal component analysis which
is a rapid and effective way to classify ancestry for unknown samples. It provides a set of
commonly found variants taken from different databases such as the 1000 Genomes Project or
Genome Aggregation Database (gnomAD), which contains genetic variants present in several
populations around the world. The first two PCA components help in the identification of the
large population arrangements present in the input samples. AKT calculates the singular value
decomposition which helps in the reduction of the large genotype matrix of M markers and N
samples. Exact SVD calculation results in the process being slow when the number of markers M
is greater than the sample N. Hence, inexact SVD is computed by AKT which leads to the
11
calculation of the important first principal components. This helps with the easy and rapid
curation of large WGS data.
2.4.3 ADMIXTURE
ADMIXTURE is a high-performance tool that is useful for estimating the proportions of ancestry
using the SNPs information from the samples with unknown ancestry. It works efficiently by
evaluating the maximum likelihood estimates to infer the admixture in the samples, relying on
the observed genotype data [9]. ADMIXTURE offers a variety of functionalities such as
unsupervised and supervised modes that can be used for a better understanding of the
contributing population groups and ancestral background of the individuals.
For the input, ADMIXTURE requires binary PLINK format which includes .bim files containing the
genotype information along with .bim (SNP names and positions) and .fam (family structure)
support files. Apart from the input files, the tool also requires an estimation of the contributing
populations denoted by K. This can be generated using reference samples with known ancestries.
To confirm the contributing populations, the cross-validation option provided by ADMIXTURE can
be used which is enabled using the --cv flag. The correct K value results in low cross-validation
error as compared to other values of K. ADMIXTURE provides advanced features that are helpful
in setting the gold standard results while estimating the populations contributing towards the
ancestry of the unknown samples and is helpful in estimating the accuracy of other tools.
12
Chapter 3: Results
The data was taken from the 1000 Genomes Project which included whole genome, whole
exome, and RNA sequencing data for 14 samples belonging to different populations. These
samples were in the form of raw fastq files and were downloaded using the high-performance
cluster, Discovery (figure 6).
Figure 6 - Example bash script for RNA download of sample NA20847
Due to the high duplication rate and shorter reads with splicing events in RNA-seq data, FastQC
was used to calculate the basic statistics such as total sequence, sequence length, %GC content,
and the overabundance of adapter content. FastQC provides quick quality analysis of the fastq
files and provides well-structured and detailed HTML as well as text-based reports that can be
used for getting an overall idea about the sequence quality and structure [32].
Figure 7- Bash script for FastQC processing
FastQC helps in assessing the base quality of the RNA-seq data by flagging the sequences with
poor quality. Trimming of the poor-quality bases was not required since all 14 samples passed
the base quality assessment. As mentioned in Table 2, the average GC content for the samples
lies between 46% to 52%. For the adapter content analysis, 9 out of 14 samples were flagged
with ‘Warn’ which signifies the adapter sequence being present in more than 5% of the reads.
13
Table 2- FastQC statistics for RNA-seq data
Sample ID Encoding Total
Sequence
Sequence
flagged as
poor
quality
Sequence
Length
%GC Adapter
Content
HG00096 Sanger/Illumina 1.9 372957646 0 101 48 Pass
HG00171 Sanger/Illumina 1.9 444383891 0 101 48 Pass
HG00512 Sanger/Illumina 1.9 340344797 0 101 48 Warn
HG00513 Sanger/Illumina 1.9 429926959 0 101 48 Warn
HG00731 Sanger/Illumina 1.9 402311534 0 101 46 Warn
HG00732 Sanger/Illumina 1.9 320525022 0 101 45 Warn
HG00864 Sanger/Illumina 1.9 318156787 0 101 48 Pass
HG01596 Sanger/Illumina 1.9 302487288 0 101 49 Pass
NA18534 Sanger/Illumina 1.9 320863165 0 101 52 Warn
NA18939 Sanger/Illumina 1.9 397745589 0 101 48 Warn
NA19238 Sanger/Illumina 1.9 357726504 0 101 46 Warn
NA19239 Sanger/Illumina 1.9 326685976 0 101 50 Warn
NA20509 Sanger/Illumina 1.9 325926330 0 101 49 Warn
NA20847 Sanger/Illumina 1.9 270439426 0 101 48 Pass
14
3.1 EthSEQ results
Following pipeline 1, WES and RNA-seq data were aligned to the reference genome hg38 using
BWA-MEM (fig. 8) and STAR aligner respectively. Sorting and indexing of the sam and bam files
were performed using samtools.
Figure 8- Bash script for WES data bam file creation
Figure 9- Bash script for RNA-seq data bam file creation
The bam files generated from WGS and RNA-seq data were used as the input for EthSEQ as two
separate target models. For the reference model selection from the list provided by the tool,
Gencode.Exome [33] was specified as it takes into consideration the SNPs information from the
hg38 human genome assembly. To run the tool with the input data, the R workspace was loaded
using the r/4.0.0 version.
Figure 10- EthSEQ commands
15
EthSEQ provides multiple filtering options such as mbq which stands for minimum base quality
that allows the user to set a threshold for the base quality scores of the sequence while reading
the input files and helps in excluding the genotype data with poor quality. Other parameters
include mrq and mdc which stand for minimum read quality and minimum depth coverage
respectively. Minimum read quality keeps in check the quality of the genotype reads that are
used for predicting the ancestry, making the process more efficient and accurate. Depth coverage
accounts for the number of times a specific position was sequenced in the genome. The default
values for these parameters were used while running the WES and RNA-seq bam files.
Table 3- EthSEQ results for WES and RNA-seq data
No. Sample Exome results RNA-seq results
1. HG00096 EUR (inside) EUR (inside)
2. HG00171 EUR Closest EUR
(83.62%) SAS (16.38%)
EUR Closest EUR(89%)
AFR (11%)
3. HG00512 EAS (inside) EAS (inside)
4. HG00513 EAS (inside) EAS (inside)
5. HG00731 EUR (inside) EUR (inside)
6. HG00732 EUR Closest EUR
(44.66%) SAS (18.64%)
AFR (19.1%) EAS
(17.61%)
EUR Closest EUR
(83.86%) EAS (16.14%)
7. HG00864 EAS (inside) EAS (inside)
8. HG01596 EAS (inside) EAS (inside)
9. NA18534 EAS (inside) EAS (inside)
10. NA18939 EAS (inside) EAS (inside)
11. NA19238 AFR (inside) AFR (inside)
12. NA19239 AFR (inside) AFR (inside)
13. NA20509 EUR (inside) EUR (inside)
14. NA20847 SAS (inside) SAS (inside)
16
Figure 11- PCA graphs for HG00171 WES data
Figure 12- PCA graph for HG00171 RNA-seq data
17
Figure 13- PCA graph for HG00732 WES data
Figure 14- PCA graph for HG00732 RNA-seq data
18
Table 3 elucidates the results for the WES and RNA-seq data for 14 samples belonging to different
super populations. As mentioned in Table 1, all the samples have only a single super population
contributing to their ancestries as indicated by the data available on the 1000 Genomes Project.
EthSEQ was successful in marking most of the samples belonging to single ancestral groups along
with the flag INSIDE, denoting the presence of the sample points inside the population polygons.
Samples HG00171 (fig. 11 & 12) and HG00732 (fig. 13 & 14) were marked with the CLOSEST flag
and the other population groups contributing to their ancestry. This signifies the position of the
sample points outside the population polygons and the percentage of the contributing
population groups is calculated by considering the distance of the sample points from the
centroid of the polygons. The polygon with the corresponding population at the minimum
distance was considered to be the closest ancestral group contributing to the ancestry of the
sample. The areas covered by the polygons are different for WES and RNA-seq results since
different SNPs datasets were analyzed by EthSEQ. RNA-seq represents the gene expression levels
as well as the genetic variations present in sequence whereas, WES data focuses on the coding
region of the genome and captures the genetic variations present in those regions. Since the
same reference SNPs dataset is used in both cases, the general pattern of the area covered by
the polygons remains comparable.
3.2 AKT results
Following the creation of BAM files, pipeline 2 was used for data preprocessing and variant
discovery using GATK functionalities. During the preprocessing step (fig. 15),
MarkDuplicatesSpark was used to mark the duplicates that might be present due to sequencing
errors and once flagged, these duplicates were ignored by the GATK analyzer in the downstream
processing during variant discovery. Base Recalibrator was used to build a model which takes into
consideration the common variants for re-calculating the quality scores for the bases in the input
sequence.
19
Figure 15- Bash script for data pre-processing using GATK
The final bam files after the previous step were used for the variant calling process (fig.16) using
the Haplotypecaller function of GATK that converts the bam files into a vcf file format, containing
SNPs and INDEL information. To separate the SNPs into a different file, the SelectVariant function
was used to select SNPs from the variant vcf file using the --select-type option.
20
Figure 16- Bash script for variant calling
The SNPs in the vcf file were used as the input for the tool AKT which provides the option of the
PCA algorithm to generate visual graphs based on the common SNPs found in the human
genome. AKT provides a list of reference vcf files that contain common SNPs information based
on various repositories. The vcf file based on the human genome hg38 ‘wgs.hg38.vcf.gz’ was used
for comparing the genotype data in the WGS and RNA-seq-based vcf files (fig. 17). AKT uses
EIGENSTRAT [34] for the principal component analysis on the genetic data and the R script is pre-
installed in the AKT directory. Specific flags such as -W were used for the identification of the
reference SNPs file (also known as weight files). AKT produces several text files that include the
information per line related to the sample such as sample ID, genotype information, chromosome
position, and allele data. These variables were used for the comparison to the reference dataset
and led to the generation of visually informative PCA graphs.
21
Figure 17- Bash script for AKT
Figure 18- PCA graph for WGS SNPs data
22
Figure 19- PCA graph for RNA-seq SNPs data
The PCA graphs generated by AKT for the WGS and RNA-seq SNPs vcf files (fig. 18 & 19) with the
regions marked by different colors denoting various conserved population groups based on the
common SNPs found in the human genome. The PCA graph corresponding to the WES data
marked all the samples correctly which can be verified using the ancestry mentioned in Table 1.
However, the RNA-seq-based PCA graph was not able to justify the same and the results were
inconclusive. This can be due to different variant groups’ expression amplification in the RNA-seq
genotype dataset as a result of higher duplication rate and transcriptional noises [35] that might
affect the comparison to the reference dataset variants.
3.3 ADMIXTURE cross-validation analysis of the contributing populations
For ADMIXTURE, pipeline 3 was followed which required using the vcf files created using pipeline
2 to be converted into binary file format with the help of PLINK [36]. Several flags were used for
the PLINK command line that were useful in cleaning the vcf data to run the output binary files
on ADMIXTURE without facing any errors-
● --vcf-half-call ‘h’: this assumes the missing genotype as a homozygous call
● --allow-extra-chr: allows the non-numeric labels for the chromosomes to be read by PLINK
● --geno: helps in the removal of variants with a call rate less than 10%
● –chr 1-22 XY: includes the autosomal as well as sex chromosomes in the analysis
23
Figure 20- Bash script for PLINK binary fileset creation
Figure 21- ADMIXTURE command line code
In this project, ADMIXTURE’s cross-validation (CV) procedure was used to confirm the
appropriate number of populations contributing to the ancestry of the 14 samples and to check
the predictive accuracy of the other tools. The number of contributing populations denoted by K
can be checked using the --cv flag while running ADMIXTURE in the unsupervised mode.
According to the samples, the main groups of the super population included African (AFR),
European (EUR), East Asian (EAS), South Asian (SAS), and Mixed American (AMR). Therefore, the
cross-validation error for K values 1 to 5 was checked and the one with the least error was picked
for each sample. Table 4 denotes the K values for all the samples with minimum CV error.
ADMIXTURE was successful in determining the number of populations contributing to the
ancestral background of the samples when compared to the data in Table 1.
24
Table 4- K values with minimum CV error for 14 samples
No. Sample K value for the
least CV error
(1000 Genomes
data with known
ancestry)
K value for the
least CV error-
RNA-seq data
No. of contributing
populations based
on 1000 Genomes
Project database
1. HG00096 K=1,
CV error=0.00001
K=1,
CV error= 0.00001
1 (EUR)
2. HG00171 K=1,
CV error=0.00002
K=1,
CV error= 0.00001
1 (EUR)
3. HG00512 K=1,
CV error=0.00001
K=1,
CV error= 0.00001
1 (EAS)
4. HG00513 K=1,
CV error=0.00001
K=1,
CV error= 0.00001
1 (EAS)
5. HG00731 K=1,
CV error=0.00002
K=1,
CV error= 0.00001
1 (AMR)
6. HG00732 K=1,
CV error=0.00002
K=1,
CV error= 0.00001
1 (AMR)
7. HG00864 K=1,
CV error=0.00001
K=1,
CV error= 0.00001
1 (EAS)
8. HG01596 K=1,
CV error=0.00001
K=1,
CV error= 0.00001
1 (EAS)
9. NA18534 K=1,
CV error=0.00001
K=1,
CV error= 0.00001
1 (EAS)
10. NA18939 K=1,
CV error=0.00002
K=1,
CV error= 0.00001
1 (EAS)
11. NA19238 K=1,
CV error=0.00001
K=1,
CV error= 0.00001
1 (AFR)
12. NA19239 K=1,
CV error=0.00002
K=1,
CV error= 0.00001
1 (AFR)
13. NA20509 K=1,
CV error=0.00002
K=1,
CV error= 0.00001
1 (EUR)
14. NA20847 K=1,
CV error=0.00002
K=1,
CV error= 0.00001
1 (EUR)
25
Chapter 4: Discussion
The estimation of the ancestral background has become an essential component of genetic
studies involving population-based research work using the different sets of variants present in
distinct population groups. Population stratification plays an important role in genetic studies
and multiple experiments have been conducted using the more traditionally popular techniques
like whole genome and whole exome sequencing. While RNA-sequencing remains a relatively
new approach in ancestry prediction, it involves information related to the real-time and dynamic
operations of the cell which can be used to assess the gene expression levels and genetic
variations at different time points to get an accurate idea of the genetic data.
Several softwares have been developed to predict the ancestry of unknown individuals where
the accuracy of the algorithm differs, based on different reference panels, analysis of different
sets of genetic markers, and statistical calculations. This project involved the usage of several
pipelines to satisfy the input requirements of each of the tools. Pipeline 1 required the conversion
of the raw fastq files into bam file format which was later used by EthSEQ as the input. An
important difference was noted when the alignment of the whole genome and whole exome
sequencing data, taken from the 1000 Genomes Project, was done using BWA-MEM aligner as it
is efficient with gapped alignment and can map the paired-end reads. However, RNA-seq data
contains splice junctions which require an aligner that can work with the alignment of the
different parts of the genome at different reference genome positions. STAR aligner was able to
handle the splicing events by using the splice junction positions for mapping purposes.
Variant discovery using the RNA-seq data is tedious due to the large duplication rates and other
transcriptional noises that might interfere with the downstream analysis. Pipeline 2 was utilized
to check the duplicates in the RNA-seq data with the help of MarkDuplicatesSpark, which flagged
the duplicates such that they were ignored during the variant calling step. Systematic error can
occur during sequencing which leads to incorrect base calls. The Base Recalibrator builds a
machine-learning model from the set of known variants files and re-assigns the quality scores of
the bases present in the query sequence. Haplotypecaller was successful in calling variants from
both WGS and RNA-seq data which was used by AKT as input. An additional step in the data
preprocessing of RNA-seq required the splitting of the reads with N bases between the Cigar
strings using the SplitNCigarReads function and grouping the information per exon.
The wide range of functionalities provided by ADMIXTURE was proven to be useful in estimating
the number of contributing populations toward the ancestry of the samples. PLINK was used for
creating the binary files that are an essential part of the input required by ADMIXTURE. PLINK
included multiple flags necessary for cleaning the variant dataset and assisted in properly
functioning the ancestry predicting tool.
The tools EthSEQ, AKT, and ADMIXTURE are all used for analyzing the genetic data to predict the
ancestry of the individuals. The output produced by each tool differs in the assessment of the
ancestry and uses a pre-defined reference panel suitable for the input provided by the user.
26
ADMIXTURE does not require a reference panel while running in the unsupervised mode. The
output produced by these tools denotes the estimates of genetic ancestry for the samples in
different forms.
EthSEQ provided the ancestry in the form of the percentage contribution of various populations
supported by the visual PCA graphs that were generated using the SNPRelate R package. AKT
created text files while performing the PCA analysis on the dataset where each line represents a
single individual supported with the variables such as chromosome position, and principal
component information for the variants along with eigenvalues that were required for
positioning the samples on the PCA graphs in comparison to the reference variants. ADMIXTURE
provided the cross-validation error values that were used to check the accuracy of the other two
tools for population contribution.
EthSEQ was able to correctly annotate 12 out of 14 samples with the accurate ancestral group.
HG00171 and HG00732 were marked with more than one population contrary to the grouping
information provided by the 1000 Genomes Project database (Table 1). This might be due to
different variant groups being considered during ancestry analysis with WES and RNA-seq data
in comparison to the EthSEQ’s reference dataset. This is also supported by the difference in the
areas covered by the polygons corresponding to various populations in the WES and RNA-seq
PCA graphs.
AKT was unable to produce significant results with RNA-seq genotype data in contrast to WGS
sample vcf files which were closely placed according to their population groups on the PCA
graphs. AKT uses EIGENSTRAT for the PCA analysis which is designed to use the genotype data
derived from DNA sequencing. The vcf files created using the RNA-seq data might contain genetic
variants such as SNPs which affect the regulatory regions or splice junctions of the genome as
RNA-seq majorly consists of gene expression level information. The genotype data deduced from
the RNA-seq might be less precise as compared to the direct genotyping due to more weightage
given to the variants in specific regions of the genome which leads to the discrepancy in the PCA
graphs of WGS and RNA-seq data.
On comparing the precision of the tools to ADMIXTURE’s modeling choice of population groups,
EthSEQ performed more accurately as compared to AKT using RNA-seq data. This leads to the
conclusion that EthSEQ shows better flexibility while working with a new data type as compared
to AKT. However, it raises the requirement for models and tools with more versatile detection
abilities. New tools can be developed which may detect the ancestry based on more conserved
groups of variants, similar to the ones identified by the RNA-seq data.
27
References
[1] R. Shraga et al., “Evaluating genetic ancestry and self-reported ethnicity in the context of
carrier screening,” BMC Genet., vol. 18, no. 1, p. 99, Dec. 2017, doi: 10.1186/s12863-
017-0570-y.
[2] M. Via, E. Ziv, and E. Burchard, “Recent advances of genetic ancestry testing in
biomedical research and direct to consumer testing,” Clin. Genet., vol. 76, no. 3, pp.
225–235, Sep. 2009, doi: 10.1111/j.1399-0004.2009.01263.x.
[3] D. H. Alexander, J. Novembre, and K. Lange, “Fast model-based estimation of ancestry
in unrelated individuals,” Genome Res., vol. 19, no. 9, pp. 1655–1664, Sep. 2009, doi:
10.1101/gr.094052.109.
[4] R. Barral-Arca, J. Pardo-Seco, X. Bello, F. Martinón-Torres, and A. Salas, “Ancestry
patterns inferred from massive RNA-seq data,” RNA, vol. 25, no. 7, pp. 857–868, Jul.
2019, doi: 10.1261/rna.070052.118.
[5] V. A. Yépez et al., “Clinical implementation of RNA sequencing for Mendelian disease
diagnostics,” Genome Med., vol. 14, no. 1, p. 38, Apr. 2022, doi: 10.1186/s13073-022-
01019-9.
[6] P. C. Ng and E. F. Kirkness, “Whole Genome Sequencing,” in Genetic Variation, M. R.
Barnes and G. Breen, Eds., in Methods in Molecular Biology, vol. 628. Totowa, NJ:
Humana Press, 2010, pp. 215–226. doi: 10.1007/978-1-60327-367-1_12.
[7] A. Belkadi et al., “Whole-exome sequencing to analyze population structure, parental
inbreeding, and familial linkage,” Proc. Natl. Acad. Sci., vol. 113, no. 24, pp. 6713–6718,
Jun. 2016, doi: 10.1073/pnas.1606460113.
[8] S. Mangul et al., “Systematic benchmarking of omics computational tools,” Nat.
Commun., vol. 10, no. 1, p. 1393, Mar. 2019, doi: 10.1038/s41467-019-09406-4.
[9] D. H. Alexander and K. Lange, “Enhancements to the ADMIXTURE algorithm for
individual ancestry estimation,” 2011.
[10] A. Romanel, T. Zhang, O. Elemento, and F. Demichelis, “EthSEQ: ethnicity annotation
from whole exome sequencing data,” Bioinformatics, vol. 33, no. 15, pp. 2402–2404,
Aug. 2017, doi: 10.1093/bioinformatics/btx165.
[11] “bwa.1.” https://bio-bwa.sourceforge.net/bwa.shtml (accessed May 18, 2023).
28
[12] H. Li et al., “The Sequence Alignment/Map format and SAMtools,” Bioinformatics, vol.
25, no. 16, pp. 2078–2079, Aug. 2009, doi: 10.1093/bioinformatics/btp352.
[13] A. Dobin et al., “STAR: ultrafast universal RNA-seq aligner,” Bioinformatics, vol. 29, no.
1, pp. 15–21, Jan. 2013, doi: 10.1093/bioinformatics/bts635.
[14] R. Arthur, O. Schulz-Trieglaff, and A. J. Cox, “AKT: Ancestry and Kinship Toolkit”.
[15] “Babraham Bioinformatics - FastQC A Quality Control tool for High Throughput
Sequence Data.” https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ (accessed
May 18, 2023).
[16] “MarkDuplicatesSpark,” GATK, Nov. 23, 2019. https://gatk.broadinstitute.org/hc/en-
us/articles/360036358972-MarkDuplicatesSpark (accessed May 18, 2023).
[17] “Germline short variant discovery (SNPs + Indels),” GATK, May 08, 2023.
https://gatk.broadinstitute.org/hc/en-us/articles/360035535932-Germline-short-variant-
discovery-SNPs-Indels- (accessed May 19, 2023).
[18] “BaseRecalibrator,” GATK, Apr. 01, 2020. https://gatk.broadinstitute.org/hc/en-
us/articles/360036898312-BaseRecalibrator (accessed May 18, 2023).
[19] “HaplotypeCaller,” GATK, Jan. 25, 2023. https://gatk.broadinstitute.org/hc/en-
us/articles/360037225632-HaplotypeCaller (accessed May 18, 2023).
[20] “SelectVariants,” GATK, Apr. 26, 2020. https://gatk.broadinstitute.org/hc/en-
us/articles/360037055952-SelectVariants (accessed May 18, 2023).
[21] “Data management - PLINK 2.0.” https://www.cog-
genomics.org/plink/2.0/data#make_pgen (accessed May 18, 2023).
[22] “Standard data input - PLINK 1.9.” https://www.cog-genomics.org/plink2/input (accessed
May 18, 2023).
[23] I. Gorin et al., “Determining the Area of Ancestral Origin for Individuals From North
Eurasia Based on 5,229 SNP Markers,” Front. Genet., vol. 13, p. 902309, May 2022,
doi: 10.3389/fgene.2022.902309.
29
[24] K. R. Kukurba and S. B. Montgomery, “RNA Sequencing and Analysis,” Cold Spring
Harb. Protoc., vol. 2015, no. 11, p. pdb.top084970, Nov. 2015, doi:
10.1101/pdb.top084970.
[25] C. Trapnell, L. Pachter, and S. L. Salzberg, “TopHat: discovering splice junctions with
RNA-Seq,” Bioinformatics, vol. 25, no. 9, pp. 1105–1111, May 2009, doi:
10.1093/bioinformatics/btp120.
[26] M. M. Piper Bob Freeman, Mary, “Alignment with STAR,” Introduction to RNA-Seq using
high-performance computing - ARCHIVED, Jun. 07, 2017.
https://hbctraining.github.io/Intro-to-rnaseq-hpc-O2/lessons/03_alignment.html
(accessed May 18, 2023).
[27] B. A. Veeneman, S. Shukla, S. M. Dhanasekaran, A. M. Chinnaiyan, and A. I.
Nesvizhskii, “Two-pass alignment improves novel splice junction quantification,”
Bioinformatics, vol. 32, no. 1, pp. 43–49, Jan. 2016, doi: 10.1093/bioinformatics/btv642.
[28] “RNAseq short variant discovery (SNPs + Indels),” GATK, May 02, 2023.
https://gatk.broadinstitute.org/hc/en-us/articles/360035531192-RNAseq-short-variant-
discovery-SNPs-Indels- (accessed May 19, 2023).
[29] “SplitNCigarReads,” GATK, Jan. 07, 2020. https://gatk.broadinstitute.org/hc/en-
us/articles/360036899652-SplitNCigarReads (accessed May 18, 2023).
[30] Y. Liu, T. Nyunoya, S. Leng, S. A. Belinsky, Y. Tesfaigzi, and S. Bruse, “Softwares and
methods for estimating genetic ancestry in human populations,” Hum. Genomics, vol. 7,
no. 1, p. 1, Dec. 2013, doi: 10.1186/1479-7364-7-1.
[31] A. Romanel, S. Lago, D. Prandi, A. Sboner, and F. Demichelis, “ASEQ: fast allele-
specific studies from next-generation sequencing data,” BMC Med. Genomics, vol. 8, no.
1, p. 9, Dec. 2015, doi: 10.1186/s12920-015-0084-2.
[32] R. Pereira, J. Oliveira, and M. Sousa, “Bioinformatics and Computational Tools for Next-
Generation Sequencing Analysis in Clinical Genetics,” J. Clin. Med., vol. 9, no. 1, p. 132,
Jan. 2020, doi: 10.3390/jcm9010132.
[33] A. J. Coffey et al., “The GENCODE exome: sequencing the complete human exome,”
Eur. J. Hum. Genet., vol. 19, no. 7, pp. 827–831, Jul. 2011, doi: 10.1038/ejhg.2011.28.
[34] J. Ma and C. I. Amos, “Theoretical Formulation of Principal Components Analysis to
Detect and Correct for Population Stratification,” PLoS ONE, vol. 5, no. 9, p. e12510,
Sep. 2010, doi: 10.1371/journal.pone.0012510.
30
[35] A. Conesa et al., “A survey of best practices for RNA-seq data analysis,” Genome Biol.,
vol. 17, no. 1, p. 13, Dec. 2016, doi: 10.1186/s13059-016-0881-8.
[36] J. Y. Dutheil, Ed., Statistical Population Genomics, vol. 2090. in Methods in Molecular
Biology, vol. 2090. New York, NY: Springer US, 2020. doi: 10.1007/978-1-0716-0199-0.
Abstract (if available)
Abstract
Of late, much interest has also been placed on identifying the role of genetic architecture on differential risk patterns based on one’s ancestry. Outside of the gold-standard traditional methods such as WGS (whole genome sequencing), WES (whole exome sequencing), or genotype data from the 1000 Genomes project, RNA-sequencing (RNA-seq) is now popularly used to call genomic variants providing inferred estimates for single nucleotide variants (SNVs) via the Genome Analysis Toolkit (GATK) pipeline. RNA-seq is a prominent technology for transcriptome profiling and provides accurate measurement of the level of transcripts and genes and is highly favorable for its low usage costs. The aim of our study is to utilize the SNVs inferred from RNA-seq data to accurately predict genetic ancestry based on individual genetic variants inferred from RNA-seq data to capture the proportion of ancestral estimates across the genome and locus-specific allelic ancestral effects respectively. We then compared the ancestry inferred from RNA-seq to the gold standard methods (WGS & WES data) inferred from genomics data and identified differences. Our study highlights the use of RNA-seq data for ancestry estimation and will inform the genomics community about the best computational tools for identifying ancestries from RNA-seq data. We can leverage our benchmarking results that will allow the biomedical field to effectively annotate the much larger cohort of 80,000 public RNA-seq samples to evaluate the complex population substructure by expression quantitative trait loci (eQTL) analysis within and across ancestries in a diverse cohort.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Evaluating the robustness and reproducibility of RNA-Seq quantification tools using computational replicates
PDF
Predicting mortality of sepsis with machine learning model approaches
PDF
Developing and benchmarking computational tools to facilitate T cell receptor repertoire analysis
PDF
Unlocking capacities of genomics datasets through effective computational methods
PDF
An analysis of the robustness and reproducibility of computational tools used in biomedical research
PDF
Understanding ancestry-specific disease allelic effect sizes by leveraging multi-ancestry single-cell RNA-seq data
PDF
Evaluating the robustness and reproducibility or AIRR sequencing tools using computational replicates
PDF
Model selection methods for genome wide association studies and statistical analysis of RNA seq data
PDF
A rigorous benchmarking of methods for SARS-CoV-2 lineage abundance estimation in wastewater
PDF
Global landscape of primary omics data generation and its secondary analysis across 193 countries and territories
PDF
The multifarious utility of public genomic repositories and their significance in genomic data science
PDF
reTCR: a unified repository for robust, rigorous, and reproducible analysis of TCR-Seq data
PDF
A systematic assessment of the completeness of TCR databases across Mus musculus strains
PDF
Development of methods and novel crosslinkers for RNA structure and interaction studies in living cells
PDF
Evaluation of ancestral diversity in open immunogenetics studies and databases
PDF
Computational methods for translation regulation analysis from Ribo-seq data
PDF
Computational algorithms for studying human genetic variations -- structural variations and variable number tandem repeats
PDF
Prediction of peptides in formation of MHC class I - peptide - TCR complexes using molecular models and artificial intelligence
PDF
Availability assessment of research products in biomedical research
PDF
Artificial intelligence in medicinal chemistry and drug discovery
Asset Metadata
Creator
Yadav, Anushka
(author)
Core Title
Benchmarking of computational tools for ancestry prediction using RNA-seq data
School
School of Pharmacy
Degree
Master of Science
Degree Program
Pharmaceutical Sciences
Degree Conferral Date
2023-08
Publication Date
06/01/2023
Defense Date
05/31/2023
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
ancestry predicting tools,benchmarking,bioinformatics,OAI-PMH Harvest,RNA-seq.
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Mangul, Serghei (
committee chair
), Calabrese, Peter (
committee member
), Haworth, Ian (
committee member
)
Creator Email
anushkay@usc.edu,anushkay2899@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113141860
Unique identifier
UC113141860
Identifier
etd-YadavAnush-11908.pdf (filename)
Legacy Identifier
etd-YadavAnush-11908
Document Type
Thesis
Format
theses (aat)
Rights
Yadav, Anushka
Internet Media Type
application/pdf
Type
texts
Source
20230601-usctheses-batch-1050
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
ancestry predicting tools
benchmarking
bioinformatics
RNA-seq.