Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Unlocking capacities of genomics datasets through effective computational methods
(USC Thesis Other)
Unlocking capacities of genomics datasets through effective computational methods
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Copyright 2021 Karishma Chhugani
UNLOCKING CAPACITIES OF GENOMICS DATASETS THROUGH
EFFECTIVE COMPUTATIONAL METHODS
by
Karishma Chhugani
A Thesis Presented to the
FACULTY OF THE USC SCHOOL OF PHARMACY
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
MASTER OF SCIENCE
PHARMACEUTICAL SCIENCES (PSCI)
August 2021
ii
DEDICATION
Without the endless support, encouragement, and love from my family, friends, and my mentors,
going through this Master’s journey would not be possible.
Thank you for believing in me and my abilities when I did not and pushing me beyond
boundaries which I didn’t even know existed.
For success is built not only by yourself, but who you surround yourself with.
iii
ACKNOWLEDGEMENTS
I am extremely thankful and grateful for my advisor Dr. Serghei Mangul as from day one, he
welcomed me to join his lab despite having little to no prior experience in bioinformatics.
Throughout the Master’s, Dr. Mangul has been an excellent mentor and professor who
implements a communicative and open learning environment, provided me with the ability to
connect and collaborate with fellow researchers around the world, and also gave me the
opportunity to lead my own projects. As the Master’s journey comes to a close, I have gained
knowledge, skills, and insights in the field of bioinformatics which wouldn’t have been possible
without the constant support, guidance, and expertise from Dr. Mangul. I would like to thank
Dhrithi Deshpande and Anushka Rajesh for not only being amazing labmates and friends, but
also being pillars of support for me throughout the two year Master’s. I am extremely grateful for
Aaron Karlsberg for his patience, knowledge, and expertise in the field of bioinformatics as he
helped me strengthen my foundation in bioinformatics. I would also like to thank the rest of the
Mangul lab members for being so welcoming and supportive. I would like to thank Dr. Curtis
Okamoto and Dr. Ryan Schmidt for being part of my thesis committee, as they provided me with
valuable suggestions, feedback, and comments on my thesis and I am so grateful. I extend my
gratitude to Dr. Roger Duncan, Dr. Ian Haworth, and Wade Thompson Harper for guiding me
not only through the Master’s program but also being so helpful and available for answering my
questions. Lastly, I would like to once again thank my family, friends, and mentors for their
unconditional support as I completed my Master’s degree.
iv
TABLE OF CONTENTS
DEDICATION….............................................................................................................................ii
ACKNOWLEDGEMENTS............................................................................................................iii
LIST OF BOX AND TABLES ......................................................................................................vi
LIST OF FIGURES ......................................................................................................................vii
ABSTRACT ................................................................................................................................viii
INTRODUCTION…………………...............................................................................................1
CHAPTER ONE: DATA-DRIVEN RESEARCH .....................................................................3
1.1 Challenges of Computational Research.....................................................................................4
1.2 Computational data-driven research is gaining independence..................................................5
1.3 Establishing meaningful collaborations between computational and experimental
biologists.........................................................................................................................................7
CHAPTER TWO: NEXT GENERATION SEQUENCING TECHNOLOGIES…………...11
2.1 Illumina sequencing technology……………………………………………………………...11
2.2 Nanopore sequencing technology………………………………………………………….....12
2.3 PacBio sequencing technology…………………………………...……………………….….12
CHAPTER THREE: RNA-SEQUENCING…………………………………………………..13
3.1 Survey of RNA-seq tools………………………………………………………………….....14
3.2 Results and archival stability and usability of RNA-seq tools ……………………………...16
CHAPTER FOUR: READ ALIGNMENT………………………………………………........18
4.1 Challenges of read alignment……………………………………………………….………..19
4.2 File formats associated with read alignment………………………….……………....……...22
CHAPTER FIVE: READ ALIGNMENT TOOLS………………..………………………….24
v
5.1 Bowtie………………………………………………………………………………………..24
5.2 Nextgenmap………………………………………………………………………………….26
CHAPTER SIX: UNLOCKING THE CAPACITIES OF VIRAL GENOMICS FOR THE
COVID-19 PANDEMIC RESPONSE……………………………………..…………………..28
6.1 Introduction……………………………………………………………...…………………...28
6.2 Initial SARS-CoV-2 detection and characterization………………………………………....29
6.3 The role of genomics in the early COVID-19 outbreak response…………………………...30
6.4 SARS-CoV-2 genomic evolution………………………...………………………………….32
6.5 The use of genomics to investigate the pandemic spread of SARS-CoV-2……………...….36
6.6 Monitoring SARS-CoV-2 transmission through wastewater genomic studies………………40
6.7 Genomics in clinical applications…………………………………………………………....42
6.8 Integrating clinical and genomics data………………………………………………………45
6.9 Discussion……………………………………………………………………………………45
CONCLUSION……………………………………………………………………………….….47
METHODS………………………………………………………………………………………48
REFERENCES……………………………………………………………………………….….49
vi
LIST OF BOX AND TABLES
Table 1 Questions asked during faculty interviews 10
Box 1 Advantages and limitations of short and long reads 20
vii
LIST OF FIGURES
Figure 1 Archival stability and usability of RNA-seq tools 17
Figure 2 Alternative splicing and RNA-Seq technologies 21
Figure 3 Available SARS-CoV-2 genomic sequencing data and its usage for outbreak
investigation 38
Figure 4 Variant of Concern (VOC) and Variant of Interest (VOI) circulating throughout the globe
40
viii
ABSTRACT
Over the past decade, next generation sequencing (NGS) technologies coupled with novel
computational methods have revolutionized the field of genomics. The bioinformatics
community has worked together to unlock the capacities of these genomic datasets, allowing for
scientific progression such as identifying novel biomarkers or unleashing new biological
pathways. The purpose of this work is to showcase the role of genomics, computational methods,
and next generation sequencing technologies, all of which are highlighted in this study,
consisting of six distinct chapters. The first chapter discusses the importance of data driven
research and how although computational data driven research face various challenges, it is on
the rise, gaining independence. The second chapter covers the Illumina, Nanopore, and PacBio
next generation sequencing technologies. The third chapter consists of a survey of RNA-seq
tools, and discusses the archival stability and usability of these RNA-seq tools. The fourth
chapter covers the challenges of read alignment along with the file formats associated with the
process of read alignment. The fifth chapter covers the functionalities of Bowtie and
Nextgenmap, two read alignment tools. Lastly, the sixth chapter discusses the key role genomics
played in helping to address the COVID-19 pandemic.
1
INTRODUCTION
During the past decade, the rapid advancement of high-throughput technologies has reshaped
modern biomedical research by vastly extending the diversity, richness, and availability of data
and methods across various domains. Currently, computational researchers are empowered with
data, methods, and tools that allow for the possibility of making important contributions in
biomedicine –– through primary analysis of pre-clinical and clinical datasets, the application and
development of novel machine learning algorithms towards task automation and diagnostic or
treatment predictions, and secondary analysis of existing public omics data. RNA-sequencing
(RNA-seq) has become an exemplar technology in modern biology and clinical applications over
the past decade. It has gained immense popularity in recent years driven by continuous efforts of
the bioinformatics community to develop accurate and scalable computational tools. RNA-seq is
a method of analyzing the RNA content of a sample using the modern sequencing platforms. It
generates enormous amounts of transcriptomic data in the form of nucleotide sequences, known
as reads. RNA-seq analysis enables the measuring of gene expression and corresponding
transcripts (sequencing of mRNAs’ and/or whole transcriptomes) which is essential for answering
important biological questions, such as detecting novel exons, transcripts, gene expressions, and
studying alternative splicing structure. However, obtaining meaningful biological signals from raw
data using computational methods is challenging due to the limitations of modern sequencing
technologies. The need to leverage these technological challenges have pushed the rapid
development of many novel computational tools which have evolved and diversified in accordance
with technological advancements, leading to the current myriad population of RNA-seq tools (e.g.
Bowtie and Nextgenmap).
2
Due to these recent advances in high-throughput sequencing technologies and population of
bioinformatic tools, more than any other infectious disease epidemic, the COVID-19 pandemic
has been characterized by the generation of large volumes of viral genomic data at an incredible
pace. However, distinguishing the most epidemiologically relevant information encoded in these
vast amounts of data requires substantial effort across the research and public health communities.
Studies of SARS-CoV-2 genomes have been critical in tracking the spread of variants and
understanding its epidemic dynamics, and may prove crucial for controlling future epidemics and
alleviating significant public health burdens. Together, genomic data and bioinformatics methods
enable broad-scale investigations of the spread of SARS-CoV-2 at the local, national, and global
scales and allow researchers the ability to efficiently track the emergence of novel variants,
reconstruct epidemic dynamics, and provide important insights into drug and vaccine development
and disease control.
3
CHAPTER ONE: DATA-DRIVEN RESEARCH
Recent advances in high throughput sequencing technologies and their application in biology and
biomedicine have created an unprecedented amount of biological data and an increased
dependency on computational analysis. Computational data-driven research focuses on developing
and applying computational models and methods across various types of omics datasets. Such
research is performed in a new type of laboratory, often referred to as a dry lab. In contrast with
the wet lab, researchers in the dry lab mine and reanalyze newly available, and increasingly rich
large-scale open datasets, and are well positioned to make novel biological discoveries. We discuss
the opportunities present in data-driven research, the challenges associated with computational
research, the rise of computational research as an independent domain of biomedical research, and
the significant collaborative opportunities that arise from integrating computational research with
experimental and translational biology.
Traditionally, computational researchers were focused solely on developing novel computational
tools that are accessible and used by a broad biomedical community. However, as biomedical
research diversifies across a wider range of topics, tools are developed with requirements aimed
at answering specific scientific questions. While this type of research constitutes the main driver
in computational, data-driven research, it represents a small fraction of growing research directions
in computational biology.
As technology rapidly advances alongside the methods to analyze rich and diverse datasets, the
opportunities of computational research are becoming increasingly prevalent. Computational
research can contribute to advancing research by developing new methods and tools for primary
4
and secondary analysis of biological data; it offers the unique opportunity to bridge analysis on
datasets collected by multidisciplinary teams. As a result, specific manuscripts have boosted their
rankings and citations through the development, implementation and discussion of new
computational methods
1
. Over the past two decades, published bioinformatics papers have
accounted for more than 34% of highly cited papers in science
2
.
Computational research additionally has the opportunity to expand into countries with low income
as scientists in these countries can utilize online training platforms, bioinformatic tutorials, and
available resources in order to expand their knowledge and skills in performing secondary analysis
of publicly available data
3
. Thus, the field of computational biology, through an established
commitment to open and widely accessible platform development and data availability, not only
will continue to promote cross disciplinary research but also more equitable access to the tools and
data necessary for the success of computational researchers with diverse resources.
1.1 Challenges of computational research
Although computational modeling and methods have provided promising frameworks for
identifying new research directions, there remain challenges before considering the use of
computational models to guide clinical decisions.
Regulation of artificial intelligence algorithms in the clinic
Despite the ubiquity and utility of artificial intelligence (AI) algorithms to automate tasks and
assist in medical diagnosis and treatment design, semi- or fully autonomous, AI has not yet been
broadly adopted in the clinic. Although several AI assisted medical devices have recently been
5
approved in the areas of image-based diagnosis
4,5,6
, comprehensive FDA regulatory guidelines for
the approval of AI based algorithms are currently still under development
7,8,9
. Precision or genomic
based medicine is based on incorporating genomics to guide tailored treatment strategies, and
cancer screening is currently at the forefront of efforts in oncology
10,11,12,13
. FDA approval for
genomics-based testing in cancer has not only increased patient accessibility to higher standards
of care, but introduced the potential to harbor significant advances in precision cancer
treatment
14,15,16
. Computationally-driven research has the potential to occupy a central role in
assessing and guiding treatment in the clinic. For example, although primary tumor biopsies can
yield important genetic information that may guide initial treatment decisions, liquid biopsies
aimed at measuring circulating tumor DNA (ctDNA) or circulating tumor cells (CTCs) can provide
information on disease progression and metastatic potential –– and highlight detailed molecular
information of patient’s cancer as it evolves over the course of therapy
17–19
. By incorporating
frequently updated molecular information as part of clinical trial protocol, the combination of
bioinformatics and AI based methods, computational researchers have the opportunity to develop
general methods that strategize optimal treatment strategies over the disease course, towards
ultimately increasing overall survival and improving patient outcome
20,21–23
. Together with
increasing opportunities to access longitudinal, genomic patient data for precision care,
computational data driven research in biology evolves from bioinformatic, tool-driven research
activity, to an interdisciplinary research that incorporates many aspects of math and computation.
1.2 Computational data-driven research is gaining independence
The narrow view of computational data-driven research, where computational scientists are often
expected to have secondary roles in biomedical projects and are not expected to lead projects
6
geared toward translational discovery still exists. This phenomenon is in part due to the
dependence on wet labs for omics data generation and has thus led computational researchers
instead to focus on developing bioinformatics tools for the community. However, as the
accessibility of omics technologies grows, data generation and analysis becomes increasingly more
complex; computational biology is likely to evolve into a principal, collaborative role. In the
context of clinical trials, where sample size and volume are often limited, assessing tradeoffs with
respect to optimizing sample usage for appropriate omics platforms, balancing cost, and assuring
statistical significance is necessary to ensure success. Thus, the computational researcher's
involvement early in genomics-based experimental design will be necessary to help answer
complex scientific questions.
Academic infrastructure to support independence and the principal role of computational biology
in biomedical research is growing across the US
24,25,26,27,28
. New computational biology and
biological data science departments are emerging in both university and medical institutions,
signaling a transition from the reliance on "bioinformatic cores", often regarded as a service to
translational biologists or clinicians, to the recognition of computational biology as an integral part
of biomedical research. As clinical papers become more reliant on data-driven analysis, the
computational biologist will not only substantially transform into a powerful driver of biomedical
research, but will also become increasingly essential.
When empowered with varied and increasingly abundant public data, computational researchers
are now also well positioned to formulate competitive hypotheses. New publishing avenues are
beginning to support secondary analyses as these are becoming an integral part of computational
7
biomedical research
29
. By performing secondary analysis of datasets across multiple cohorts,
computational researchers are gaining access to increased sample size and heterogeneity and can
capture novel signals unavailable in individual studies due a small sample size. Additionally, meta-
analysis can capture signals that are generalizable across diverse populations. Multiple discoveries
in biomedical fields were solely based on secondary analysis of public omics data, helping to find
novel biomarkers
30,31,32,33,34
.
1.3 Establishing meaningful collaborations between computational and experimental
biologists
Inter-disciplinary research and discovery.
As technology and methods advance, allowing larger scale experimental data, scientists are
progressively motivated to incorporate an interdisciplinary approach to discovery. A typical
collaborative scenario between wet and dry labs is described as the wet lab generates data and the
dry lab conducts secondary analysis or collaborates with a wet lab to gain access to the data. In
order to make novel discoveries, dry labs often rely on domain specific knowledge that will serve
to guide computational results, interpretation, and validation.
Wet labs must generate hypotheses as the first step of their pipeline so that they can accordingly
plan/order the associated materials. In contrast, dry labs are driven by the amount and quality of
data availability, and as a result, create their hypothesis based on which data is available. It is for
this reason that data driven research is more flexible as it is not restricted to one specific domain.
Despite multiple differences between the two types of labs, fundamentally there are a lot of
similarities. Among both wet lab and dry lab approaches, experiments are iterative in nature.
8
While wet lab researchers may need to alter chemical measurements and biological specimens in
the protocol, dry lab researchers may need to alter their script or data analysis methods multiple
times. In some hybrid laboratories, the activities are intricately connected and one activity may
depend on the other in order to generate and test hypotheses. In a sense, the dry lab methods have
expanded the toolbox, creating the concept of “discovery science”, collecting and analyzing a large
amount of data in order to identify potential hypotheses which can be validated and tested.
Reproducibility.
Reproducibility is another aspect which has similar requirements for both of the different labs. Dry
lab researchers can reproduce data, but at the expense of needing more computational power,
storage, and space to run the analysis and store the results afterwards. For example, in genomics,
standards of reproducibility need to be uniform in order for computational research to be able to
progress into the clinics
35
. Computational research needs to leverage the limitations of technology
and deliver accurate and reproducible results. Reproducibility in computational research can be
established by implementing standards of open software, open metadata, and utilizing data and
tools located on archivally stable repositories
36,37
. In contrast to dry lab research, reproducibility
can be limited in wet lab based research as researchers must work with a small sample size and
gather all essential lab equipment, animal models, chemicals in order to repeatedly carry out the
experiments, resulting in increased cost of experimentation. As there is no literature
that exists on
the unique opportunities and challenges associated with computational data-driven research, we
hope that our study will not only offer a unique insight into computational data-driven research,
but also encourage computational scientists to utilize open omics data and novel machine learning
approaches to classify, model, and manage biological data in order to enable high impact
discoveries in the biomedical field
38
. We have discussed how computational research has increased
9
opportunities in the biomedical research community and how open omics data in conjunction with
effective bioinformatics methods can enable novel biological discoveries.
As part of the manuscript, we interviewed influential computational researchers in biomedical
science. As their experience and expertise in this field is both essential and crucial, we thought it
would be beneficial to highlight their data-driven computational research and how the choice to
undertake computational research influenced their career. We hope to gain perspective from the
answers to these questions as these researchers are leading a successful career in computational
research and will motivate the next generation of younger scientists.
10
Table 1: Questions asked during faculty interviews
1. When initially starting your career, what aspects of computational research were you drawn
to?
2. What are some of the challenges with being a purely computational researcher? If you have
a wet lab component to your research, what drove you to establishing the wet lab portion
of it?
3. What are unique opportunities in computational research compared to wet-lab research?
4. What are the crucial skills to be successful in computational research?
5. What is your opinion on pure secondary analysis?
6. What are the domains where computational research can make a difference?
7. Do you have any comments on if it is more difficult to publish computational research
compared to wet lab research?
8. Whom do you primarily collaborate with? Are those computational or wet lab people?
9. How were you initially received as a computational researcher by the wet lab people? Did
that change over the years?
10. Over the years, do you see a change in how computational research has been received by
the biomedical community?
11. What is the most unique and exciting aspect of computational research and how do you see
the future of computational research?
12. If you had to give one piece of advice to the younger generation of scientists who are
looking to pursue a career in bioinformatics and computational research, what would you
suggest?
11
CHAPTER TWO: NEXT GENERATION SEQUENCING TECHNOLOGIES
Over the past few decades, modern high throughput sequencing technologies have extended their
capacities; these technologies are capable of obtaining millions of nucleotide sequences from an
individual transcriptome. Many regions on the transcriptome are crucial for detecting both the
fundamental and unique causes of varying disorders and diseases. The data collected from the
sequencing technologies can help us to identify these influential genes and measure the level of
gene transcription taking place. The three most common sequencing technologies (Illumina,
Nanopore, and PacBio) differ in their sequencing output as each platform is categorized as either
a “short read” platform or “long read” platform and each have their unique trade-offs.
2.1 Illumina sequencing technology
The Illumina sequencing platform (iSEQ100, MiniSeq, NextSeq550 series) based on sequencing-
by-synthesis chemistry, was commercialized in 2006. Illumina ligates cDNA fragments to adapters
and immobilizes the fragments on a solid surface before PCR amplification. Next, a reaction
mixture containing the primer, reversible nucleotide terminator for each base (labelled with
different fluorescent dyes), and the DNA polymerase enzyme, are added to the immobilized solid
surface. Each incorporated nucleotide is detected using the CCD camera based on the color of the
fluorescent dye. The reversible nucleotide terminator and the dye are then removed from the base,
and this cycle is repeated until all bases are identified and labelled. The sequences of more than 10
million colonies can be simultaneously determined in parallel, giving rise to a higher sequencing
throughput rate
39,40
when compared to other sequencing platforms. The Illumina sequencing
technology, a short read sequencing technology, has an average read length of 100-300 bp (base
pairs), and an error rate of ~0.1%
41
.
12
2.2 Nanopore sequencing technology
Beginning in 2012, Oxford Nanopore Technologies introduced their sequencing devices: MinION,
GridION, and PromethION which can sequence both DNA and RNA of any length, enabling the
technology to produce both short and long reads in real time. The read length ranges and can be
upto 1,000 kb (1,000,000 bp), with an error rate of ~5-20%
41
. The core of the technology centers
around a nanopore, which is a “nano-scale” hole created by a pore-forming protein. An ionic
current and the strand of RNA or DNA which is going to be sequenced is simultaneously passed
through the nanopore. As the strand of RNA or DNA is passing through the nanopore, there are
unique varying disruptions of the current which help identify which nucleotide is passing. These
patterns of current disruption are what can help determine the nucleotides one by one. In
comparison to strictly short read sequencing technologies such as Illumina, Nanopore sequencing
technology allows for sequencing of DNA or RNA without the added step of PCR amplification
42
.
2.3 PacBio sequencing technology
The PacBio sequencing technology, commercially known as SMRT (Single-molecule, real-time)
Sequencing, was introduced in 2010 and generates full length cDNA sequences (i.e., long reads)
that characterize transcripts within targeted genes or across entire transcriptomes. These read
lengths can range from 10-100kb (10,000-100,000 bp)
41
. Long reads generated by PacBio are
accurate at the scale of a single molecule (the sequence of the base is derived directly from the
individual RNA strand) through circular consensus
43,44
sequencing, effectively reading the same
cDNA insert many times sensitivity of PacBio (Iso-Seq) can be limited by external factors; for
example, Iso-Seq has a limitation of being able to produce full-length cDNA during the library
13
preparation step. It can only generate high quality reads (HiFi reads) if the target cDNA is short
enough to be sequenced in multiple passes.
When analyzing sequencing data, it is important to reduce potential noise that was introduced by
non-biological factors as this can affect the results, which then downstream could potentially lead
to conclusions that are incorrect. These non-biological factors can be defined as batch effects,
which can arise from non-biological factors such as the specific sequencing laboratory, the
day/time the sample is sequenced, the laboratory technician who sequenced the data, and even the
type of machine/protocol that is being used.
CHAPTER THREE: RNA-SEQUENCING
RNA-sequencing (RNA-seq) has become an exemplar technology in modern biology and clinical
applications over the past decade. It has gained immense popularity in recent years driven by
continuous efforts of the bioinformatics community to develop accurate and scalable
computational tools. RNA-seq is a method of analyzing the RNA content of a sample using the
modern sequencing platforms. It generates enormous amounts of transcriptomic data in the form
of nucleotide sequences, known as reads. RNA-seq analysis enables the probing of genes and
corresponding transcripts which is essential for answering important biological questions, such as
detecting novel exons, transcripts, gene expressions, and studying alternative splicing structure.
However, obtaining meaningful biological signals from raw data using computational methods is
challenging due to the limitations of modern sequencing technologies. The need to leverage these
technological challenges have pushed the rapid development of many novel computational tools
14
which have evolved and diversified in accordance with technological advancements, leading to the
current myriad population of RNA-seq tools.
Advances in high-throughput sequencing technologies, also known as next-generation sequencing
1 (NGS), have enabled cost-effective probing of gene sequences in living organisms. These
sequencing technologies have been adapted to probe the RNA expression (mRNA or totalRNA)
of an organism, known as RNA-seq. RNA-seq has reshaped biomedical research by expanding the
ability to analyze a vast range of biological data. Biomedical researchers are often tasked with
using RNA-seq computational methods which are typically available wrapped as software tools
and packages.
3.1 Survey of RNA-seq tools
We compiled 235 RNA-seq tools published between 2008 and 2020, which have varying purposes
and capabilities based on the type of analysis one is conducting or the biological questions one is
answering. We identified 15 primary areas of application in RNA-seq analysis (Data quality
control, read alignment, gene annotation, transcriptome assembly, transcriptome quantification,
differential expression, RNA splicing, cell deconvolution, immune repertoire profiling, allele
specific expression, viral detection, fusion detection, small RNA detection, detecting circRNA,
and visualization tools). After assigning each tool a category based on its area of application, we
highlighted the “Notable Features” for each tool. These notable features encompassed a range of
functionalities: purpose of the tool; features that render the tool unique or not unique within its
category; the form of receiving RNA-seq data as an input; the form of presenting output. We
documented whether each tool was web-based or required of the user one or many programming
15
languages for installation and/or utilization (“Programming Language”). In addition to the
programming languages, we highlighted whether a package manager (e.g., Anaconda,
Bioconductor, or CRAN) was available. Based on the combination of which programming
language was required and which package manager was available for each tool, we assessed the
required expertise needed to be able to install or run the tool. If the tool was a web-based tool with
no package manager available or a web-based tool along with programming languages and
package manager present, we assigned the tool the little-to-none required expertise of “+”. If the
tool required only R as a programming language or along with other programming languages and
had a package manager present, the tool was assigned a required expertise of “++”. In addition, if
programming languages other than R were required, and a package manager was present, the tool
was assigned a “++”. Lastly, tools that required a single programming language (other than R) or
multiple programming languages, and lacked a package manager to aid installation, were assigned
a “+++” for the most required expertise. Each published tool had a designated software link where
the tool can be downloaded and installed. Based on the type of platform hosting the URL, we
assigned the tools a “1” or “2”. An assignment of “1” meant that the tool’s software was hosted on
a more archivally stable web service designed to host source code. An assignment of “2” meant
that the tool’s software was hosted on a less archivally stable web service (e.g., personal and/or
university web services). Our table of 235 RNA-seq tools can be found at -
https://github.com/Mangul-Lab-USC/RNA-seq.
16
3.2 Results and archival stability and usability of RNA-seq tools
Maintaining the archival stability of bioinformatics tools is increasingly important in preserving
scientific transparency and reproducibility. We assessed the archival stability of 235 RNA-seq
tools. The majority of RNA-seq tools are stored on archivally stable repositories (e.g, GitHub) and
other tools are hosted on personal or academic webpages (Figure 1b), which often have limited
archival stability117. We have also assessed the computational expertise required to install and use
RNA-seq tools. A vast majority of tools require the user to operate the command line interface
(Figure 1a). Web-based tools accounted for 8.09% of tools. There has been an increasing number
of tools with available web-based interfaces designed for small-RNA detection and cell
deconvolution categories (71.4% and 31.2%, respectively). We have also compared the availability
of package managers across RNA-seq tools. Package managers are platforms that automate the
installation, configuration, and maintenance of a software tool, promising to expedite and simplify
the installation of a software tool and any required dependencies. The majority of RNA-seq tools
lack a package manager implementation (Figure 1c and 1d). For the tools with available package
manager implementation, Anaconda118 was the most commonly used package manager platform.
The second most popular platform was Bioconductor119 and CRAN120. Lastly, we evaluated the
effect of usability on the popularity of RNA-seq tools. We found that tools that are available as
package managers had significantly more citations per year compared with tools which are not
available as package managers (Figure 1e; Mann–Whitney U test, p-value = 1.85 × 10−7).
17
Figure 1 Archival stability and usability of RNA-seq tools (a) The percentage of required
computational expertise of current computational tools that are used for RNA-seq analysis in
different categories. (b) The percentages of type of URLs for tools collected. (c) The percentages
of package managers for all tools collected. (d) The percentage of availability of package managers
for tools in different categories. (e) Tools that are available as package managers exhibit increased
citations per year compared with tools that are not available as package managers (Mann–Whitney
U test, p-value = 1.85 × 10
-7
)
226
.
18
CHAPTER FOUR: READ ALIGNMENT
After the reads are generated from the designated sequencing technology, they are analyzed for
quality control (analysis of: GC content, read duplications, errors in sequencing, etc). As the goal
is to map the reads onto a reference genome or transcriptome with high accuracy, it is essential
that quality control tools are used to scrap the reads which are low quality. Based on the length of
the reads (short or long), the appropriate quality control tools can be used
45
. Once the low quality
bases and reads are removed or trimmed, the next step in the pipeline is read alignment. Read
alignment is a crucial step in the downstream analysis of RNA and DNA sequencing data. The
goal of aligning reads to a reference genome or transcriptome is to find the origin and location of
the read. When aligning reads to the reference sequence, tools can estimate the coverage rate,
which is the number of reads overlapping a position on the reference sequence
46
(Figure 2).
Generally, read alignment tools follow the method of seed and extension. The seeding step consists
of mapping a short portion of the read (which is called a seed) to the reference sequence. After the
seed has been mapped, the extension step begins in which alignment tools map the remaining
portion of the read around where the initial seed was mapped. Due to the fact that the initial seed
will be mapped to various locations in the genome, the extension step also must try to extend the
mapping at those specific locations. When comparing the seed and extension steps for DNA-seq
reads and RNA-seq reads, the extension step is more lengthy, expensive, and computationally
exhaustive for longer RNA-seq reads due to the read potentially being mapped over a splice
junction. As there are many possibilities of errors during the read alignment process, it is
imperative that the necessary steps of quality control early in the analysis are taken to help
streamline the process of read alignment as much as possible
47
.
19
4.1 Challenges of read alignment
Although read alignment is an essential step of the pipeline, there are various challenges associated
with the process of alignment (Box 1). While some of these challenges are linked with the specific
bioinformatic tool utilized, other challenges are associated with the nature of aligning the reads. In
addition, as read aligners have to compare reads to all potential positions on the genome or
transcriptome, this can be computationally challenging. As aligning individual RNA-seq reads to
the reference sequences can help quantify transcript expression, transcripts which are not present
in the genome annotation will be missed. Therefore, many refer to the known transcripts which
have been previously generated by genome annotations
47,48
.
Due to the fact that RNA-seq reads are sequenced from mRNA transcripts, mapping these reads
to a reference sequence poses a challenge because they do not contain regions from the introns
originally present in the genome prior to transcription. Therefore, RNA-seq alignment tools have
to account for spliced junctions, as a portion of the read may map at one exon and another portion
may map to another exon (both of which are thousands to tens of thousands bases apart because
that is where the intron originally was prior to splicing taking place), potentially making mapping
inconsistent
48
. The challenging aspect of this is the fact that the process of splicing creates fusions
of exons resulting in endless combinations of mRNA transcripts (isoforms) that exist. Utilizing
existing genome annotations for mapping splice junctions can be useful, however, once again, the
focus would only be on the junctions which are known, unable to uncover the new junctions.
Situations also arise where a read can align to multiple transcripts (multi-mapping), and in this
case, it can be difficult to decipher which transcript the read belongs to (Figure 2). There are many
splice alignment tools and software packages
49,50,51
available which can help to align these multi-
mapped reads across the exon-intron junctions present in the reference genome.
20
Box 1. Advantages and limitations of short and long reads
226
i. Error rate - Short read sequencing technologies have a lower error rate when compared to
long read sequencing technologies (a, b).
ii. Throughput - The throughput of long read sequencing technologies is typically lower than
the throughput of short read sequencing technologies (c).
iii. Alignment - Short reads suffer from multi-mapping issues, whereas longer reads, by nature
of having more information, can be more accurately mapped to its origin. Due to a high error
rate, pairwise alignment between the read, the reference transcriptome, and/or genome is
more challenging for long reads compared to short reads.
iv. Assemble novel transcripts - Longer reads are preferred for de novo assembly, because
they make the assembly step efficient. Most short reads do not span the shared region or
shared exon junction, making the assembly step ambiguous. Full-length transcript
sequencing eliminates the need for assembly.
v. Estimate transcripts and gene expression - Shorter reads are preferred for quantification of
transcripts due to their higher throughput. However, assigning short reads to the transcripts
requires more advanced probabilistic and statistical approaches. Longer reads have lower
throughput, but they can usually cover the entire transcript and make determination of the
transcript for each read a straightforward process.
21
22
Figure 2 Alternative splicing and RNA-Seq technologies. The flow of genetic information
begins with the DNA, which consists of introns and exons. DNA is transcribed into pre-mRNA,
further processed into mature mRNA, splicing out the introns and leaving the exons glued together.
mRNA is further translated into a protein. Different arrangements of exons (transcripts) may be
formed in a process called alternative splicing or exon skipping. An RNA-seq read is a short
sequence sampled from a transcript. These reads are generated using modern sequencing
technologies such as (a) the Illumina platform - a short-read sequencing technology, (b) Nanopore
and PacBio platforms - long-read sequencing technologies. The figure depicts two scenarios which
illustrate the alignment of the uniquely mapped reads to the transcriptome (c) and the genome (d).
A few of the reads are multicolored, indicating that, when aligned, they span across the exon-exon
junction. Some of the shorter reads (single-colored) are aligned only to a single exon and do not
span across the junction
226
.
4.2 File formats associated with read alignment
FASTA
52
: A FASTA file format is a sequence of amino acids or nucleotide sequences (single
letter codes are listed online for reference). The first line of a FASTA file begins with a ‘>’ and is
followed by a description of the sequence. The second line contains the sequence (either amino
acids or nucleotides). The reference genome is usually present in the .fa format which is often
indexed before alignment can take place.
FASTQ
53
: A FASTQ file is the output file from the sequencing machine which contains the
sequence data, and is often the input for many read alignment tools. You can view FASTQ files in
a text editor on the command line. The FASTQ file consists of 4 lines.
1) Information about the sequencing run in the form of a sequence identifier
2) Sequence consisting of: A, C, T, G (bases) or N (none)
3) Separator which is represented by a ‘+’
4) Quality scores which basically represent the probability of a base being called mistakedly
23
SAM
54
: A SAM file stands for Sequence Alignment/Map which consists of both a header section
and an alignment section. Each line of the SAM file contains 11 total regions of alignment
information; some of which contain specific flags for alignment. The SAM file is the output file
for the majority of read alignment tools.
BAM
55
: A BAM file stands for Binary Alignment/Map which is essentially a compressed version
of the SAM file and contains both a header and alignment. The BAM file is either the direct output
of alignment or must be converted from a SAM file. The alignment section consists of the
following information for each read pair:
RG: Read group, which indicates the number of reads for a specific sample.
BC: Barcode tag, which indicates the demultiplexed sample ID associated with the read.
SM: Single-end alignment quality.
AS: Paired-end alignment quality.
NM: Edit distance tag, which records the Levenshtein distance between the read and the reference.
XN: Amplicon name tag, which records the amplicon tile ID associated with the read.
24
CHAPTER FIVE: READ ALIGNMENT TOOLS
5.1 Bowtie
Developed in 2009, Bowtie is an open source read alignment tool which aligns short sequenced
reads to genomes. Before beginning the alignment process, Bowtie must index the reference
genome using the Burrows-Wheeler transform (BWT) and a FM index. The reason behind using
a BWT-based index is the fact that it is able to lessen the memory storage, allowing to search the
large number of reads generated by the sequencing machine. The index takes up approximately
2.2 GB (for unpaired alignment) and 2.9 GB (for paired-end alignment). Bowtie is able to create
this index which can be re-utilized across multiple runs of alignment. In addition, the Bowtie
website has indexes which are already built for not only the human genome, but also for other
model organisms. Bowtie indexes the reference genome (genome.fa) which is in the format of a
FASTA file. The command utilized for the indexing is as follows:
bowtie-build genome.fa bowtieIndex
The input for using Bowtie is the indexed genome (.fa format) and a set of reads (.fastq format).
When running read alignment on the command line, you must:
1) Specify the directory path in where you have installed bowtie
/u/home/v/victorx/miniconda3/bin/bowtie
2) Add the exact name of the index you previously generated
-x bowtieIndex
25
3) Specify the path of the first read (.fastq)
-1 /u/flashscratch/p/pelin/DNASeqTools/reads/splitted/test_1_g.fastq
4) Specify the path of the second read (.fastq)
-2 /u/flashscratch/p/pelin/DNASeqTools/reads/splitted/test_2_g.fastq
5) Specify the path in which the .SAM file will be saved
/u/home/v/victorx/scratch/bowtie_output/test.sam
Altogether the script should read as the following:
/u/home/v/victorx/miniconda3/bin/bowtie -x bowtieIndex -1
/u/flashscratch/p/pelin/DNASeqTools/reads/splitted/test_1_g.fastq -2
/u/flashscratch/p/pelin/DNASeqTools/reads/splitted/test_2_g.fastq
/u/home/v/victorx/scratch/bowtie_output/test.sam
6) Next, to convert from SAM to BAM file, you can use the following command:
samtools view -S -b test.sam > test2.bam
Based on what the user specifies when running read alignment, Bowtie will either provide results
for the high performance settings or the “best” option. In terms of the high performance settings,
Bowtie will not display results for a portion of the reads if those reads consist of too many
mismatches. However, using the “--best” option will report all alignments, minimizing mismatches
regarding the initial seed portion of the read, but this option will be computationally costly. Bowtie
can support reads which have a length up to 1,024 base pairs (bp) and can align short reads
(generated by the Illumina sequencing machine) at a rate of >25 millions reads per hour
56
.
26
5.2 Nextgenmap
Nextgenmap is a read alignment tool which was developed in 2013 and is a tool which can handle
aligning a variable number of reads (short and long) and additionally can handle a large amount of
the reference genome being polymorphic. Nextgenmap is more efficient than other read alignment
tools such as Bowtie as it doesn’t use the Burrows Wheeler Transform methods that Bowtie does.
Nextgenmap can accept a range of input files including FASTA, FASTQ, SAM, BAM files and
can output both SAM and BAM files. Nextgenmap utilizes a hash table, in which it can divide and
save the positions of the reads present in the reference genome. While Nextgenmap runs, it
computes alignment scores for the candidate mapping regions (CMR) which have the highest score
for alignment. After benchmarking Nextgenmap against other read alignment tools, the results
showed that Nextgenmap was 1.1 to 2.3 times faster than Bowtie2 in terms of total runtime.
Nextgenmap was also shown to map reads to the reference genome (handling up to 10%
polymorphic genomes).
The first step is to index the reference genome using the following command:
ngm -r reference-sequence.fa
When running Nextgenmap on the command line, the command for running single end mapping
is as follows:
ngm -q reads.fastq -r reference-sequence.fa -o results.sam -t 4
When running Nextgenmap on the command line, the command for running paired end mapping
is as follows:
ngm -r reference-sequence.fa -1 forward_reads.fq -2 reverse_reads.fq -o results.sam -t 4
27
Altogether the script should read as the following:
ngm -r /u/home/v/victorx/scratch/Karishma/nextgenmap/genome.fasta -1
/u/home/v/victorx/scratch/Karishma/nextgenmap/read1.fastq -2
/u/home/v/victorx/scratch/Karishma/nextgenmap/read2.fastq -o results_nextgenmap.sam -t
4
It is important to specify the path of where nextgenmap is installed on your computer, specify the
path of the reference genome, first read, second read, and path to where the SAM output should
be stored
57,58
.
28
CHAPTER SIX: UNLOCKING THE CAPACITIES OF VIRAL GENOMICS
FOR THE COVID-19 PANDEMIC RESPONSE
6.1 Introduction
COVID-19, a contagious disease caused by the novel severe acute respiratory syndrome
coronavirus 2 (SARS-CoV-2), has reached an extraordinary scale not seen since the influenza
pandemic of 1918–1919
59
. Within a month of its first reported case in China in December 2019,
COVID-19 had spread to many regions in China
60
and had been detected in several neighboring
countries, including Thailand, Korea, and Japan. As international flights continued to operate,
SARS-CoV-2 then spread to Europe and North America in a short amount of time, and was soon
declared a global pandemic
61,62
. According to the World Health Organization (WHO), in the first
16 months of the pandemic (through April 8, 2021), more than 132.7 million people became
infected worldwide, resulting in more than 2.8 million deaths
63
.
Over the past two decades, the biomedical community has become equipped with infrastructure
for basic genomic techniques to support epidemic responses
64
, a capability that has enabled the
rapid collection of SARS-CoV-2 genomic information which has allowed observation of SARS-
CoV-2 genomic evolution online, rapid tracking of SARS-CoV-2 genetic groups, lineages,
variants, variants of interest (VOI) and variants of concern (VOC)
65
. The precise and rapid tracking
of SARS-CoV-2 genetic changes facilitates fast development of SARS-CoV-2 clinical tests and
predicting the efficiency of the vaccines. As sequencing technologies and genomic analysis tools
progress, genome sequencing is becoming more widely integrated into clinical and healthcare
29
workflows. However, the utilization of genomic sequencing to its full potential for public health
surveillance and outbreak response efforts has yet to be established and depends on the broad
expansion of the best practices for preventing and limiting outbreaks that had been determined
during the COVID-19 response
66
. Herein, we discuss the genomic capacities that can be used to
address many of the public health issues associated with COVID-19.
6.2 Initial SARS-CoV-2 detection and characterization
Genomic analysis conducted on respiratory specimens isolated from the first COVID-19 patients
hospitalized in December, 2019, in Wuhan, China, allowed for the prompt detection and
characterization of a novel coronavirus, later named SARS-CoV-2, by January 2020
61,62,67,68
.
Initial sequence analyses revealed that SARS-CoV-2 shared 80% nucleotide identity with SARS-
CoV
69,70
, strongly indicating that SARS-CoV-2 was likely a respiratory pathogen that could spread
from human to human and hence with clear epidemic potential. These initial analyses also revealed
that SARS-CoV-2 shared high sequence similarity with related viruses found in bats and
pangolins, suggesting a zoonotic origin
68,69,71–77
. Across its complete genome, SARS-CoV-2 is
most closely related to the bat coronavirus RaTG13, with which it shares approximately 96%
nucleotide sequence identity. However, different SARS-CoV-2 coding regions share greater
similarity to those of other animal coronaviruses. For example, the spike (S) protein receptor-
binding domain (RBD) exhibits higher sequence identity (97.4%) to that of the Guangdong
pangolin virus, rather than to RaTG13 (89.3%), while SARS-CoV-2 long 1ab (replicase) open
reading frame (ORF) exhibits the highest sequence identity (98.8%) with the RmYN02 bat
30
coronavirus
78
. In further support of zoonotic origin, another coronavirus detected in five bats
(RacCS203) is genetically closely related to SARS-CoV-2
79
, and neutralizing antibodies for
SARS-CoV-2 were found in wild pangolins and bats from Thailand
79
. Moreover, there is similarity
between SARS-CoV-2 zoonosis and the zoonoses of the SARS-CoV and MERS-CoV
coronaviruses, as data indicate that in all three cases, other intermediate animals were likely present
in their transmission chains
80,81
. Together, these findings suggest a complex history of
recombination events prior to the zoonotic transfer of SARS-CoV-2 to humans, although when
and in which hosts these events took place remains unclear
71,72,77,82
. Genomic knowledge acquired
through viral sequencing and phylogenetic analysis greatly contributed to the rapid determination
of the potential epidemiological characteristics and origins of SARS-CoV-2
83
.
6.3 The role of genomics in the early COVID-19 outbreak response
As seen with other recent viral epidemics, viral genome sequencing has become an essential part
of the COVID-19 public health response
84
. Early access to SARS-CoV-2 genome sequences
allowed for the timely development and production of nucleic acid amplification testing (NAAT)-
based diagnostics, expedited vaccine development, and accelerated opportunities for SARS-CoV-
2 genomics-based real-time surveillance
85–90
. The first SARS-CoV-2 tests and vaccine candidates
appeared within one and three months, respectively, after the identification of the first COVID-19
patient
90–92
. Together with access to modern sequencing technologies, the scale of the pandemic,
based on numbers of cases and affected regions, has prompted the collection of SARS-CoV-2 viral
genomic data at an unparalleled magnitude (on average 2,500 genomes per day). Consequently,
31
the capacity to track virus spread and viral evolution in real time has been accelerated relative to
that associated with prior outbreaks
93
. When the WHO initially declared a Public Health
Emergency of International Concern (PHEIC) on January 30, 2020, 339 SARS-CoV-2 genomes
had already been collected and characterized
60–62,86
. By April 7, 2021, public repositories that host
SARS-CoV-2 genomes contained over 1,000,000 genomes
94–96
. Notably, by the end of the sixth
month of the pandemic (May 2020), Global Initiative on Sharing All Influenza Data (GISAID)
and the National Center for Biotechnology Information (NCBI) databases included 110,000
SARS-CoV-2 full-length genome sequences as compared to more than the 8,000 HIV full-length
genome sequences collected by the Los Alamos sequence National Laboratory
97
over the past 40
years
98
(Figure 3a). 86% of available SARS-CoV-2 raw sequencing data at NCBI is Illumina data,
13.7% is Oxford Nanopore, and 0.3% is Pacbio, IonTorrent and BGISEQ. There is a correlation
between the number of submitted sequences per capita and the GDP per capita for the majority of
the countries in the world, moreover, high-income countries submitted about 100x more sequences
per capita on average than did low-income countries (Figure 3b). However, it is remarkable that
African nations with a low GDP per capita sequenced viral genomes on a level comparable to that
of middle- and high-income countries
99
. Indeed, due to several previous programs that were aimed
at controlling outbreaks of other viruses in Africa, the sequencing capacity of the African
healthcare system improved, helping to increase its efficiency in the sequencing of SARS-CoV-2
genomes
99
(Figure 3c). Countries with the highest ratios for numbers of SARS-CoV-2 genomes
sequenced to numbers of COVID-19 cases and relatively low number of reported cases per capita
were Taiwan, New Zealand, Australia, Iceland, and Denmark (Figure 3d).
32
6.4 SARS-CoV-2 genomic evolution
The unprecedented scale of SARS-CoV-2 genome sequencing offers unique opportunities for
tracking SARS-CoV-2 evolution online and detecting the emergence and spread of new VOI and
VOC
100–102
(Figure 4). Due to SARS-CoV-2 genome sequencing and consequent bioinformatics
analysis it was shown that because of an intrinsic RNA proofreading mechanism, coronaviruses
exhibit lower mutation rates than do many other RNA viruses, such as Ebola virus and HIV
103–106
,
however as more variants emerge, mutation rates could slowly be on the rise. In addition, their
evolutionary (i.e., nucleotide substitution) rate partly reflects the action of host-dependent RNA-
editing enzymes (e.g., APOBEC)
107
. Coronaviruses undergo a mean rate of approximately 1.12 x
10
-3
nucleotide substitutions per site per year. This is comparable to the SARS-CoV-1 mutation
rate from 0.8 x 10
-3
to 2.38 x 10
-3
, Ebolavirus’s mutation rate of 1.3 x 10
-3
and is lower than
seasonal influenza mutation rate of 6.7 x 10
-3
and HIV mutation rate of 4.4 x 10
-3
103–105,108–110
.
Another important aspect of SARS-CoV-2 evolution is that SARS-CoV-2, like many other RNA
viruses, can live in the host as a swarm of closely related variants within individual hosts and has
a tendency for recombinations
111
. Genomic studies have demonstrated the presence of such intra-
host diversity
112–118
, with one study having identified between 1 and 52 haplotype variants in each
of 25 clinical patients
112
. Identifying the factors that shape these intra-host viral population
structures can promote a better understanding of short-term viral evolution, in addition to
providing insights into host adaptation and drug and vaccine design. For example, evidence of
intra-host recombination
119
may enable estimating the role of recombination in the zoonotic origin
of SARS-CoV-2
72
and the emergence of novel viral variants
120–122
.
33
Over the first year of the epidemic, SARS-CoV-2 has gradually accumulated mutations and
developed into several viral lineages as it has spread through the human population
65,123–126
.
However, from the advent of the pandemic through approximately September 2020, there was no
statistical evidence that any of the numerous characterized SARS-CoV-2 mutations had resulted
in a loss or gain of function
103–105
. For example, one study analyzed all 48,454 SARS-CoV-2
genomes available from GISAID from late July of 2020 that had been sequenced throughout the
world and identified 12,706 mutations, 398 of which were recurrent, and none of which were
associated with a significant change in transmissibility
127
. During the summer of 2020, the D614G
mutation in the viral S protein sparked attention because this new variant globally superseded the
original SARS-CoV-2 strain globally. Phylogenetic analyses and clinical evidence indicated that,
although the D614G variant was associated with both increased viral load and infectivity
126,128
, it
was also more susceptible to neutralizing antisera and was not linked to any change in vaccine
efficacy or increased pathogenicity
129
.
The first SARS-CoV-2 viral variant of concern (VOC) for public health, known as variant B.1.1.7,
was first detected in the UK in September 2020. Genomic analysis revealed that this B.1.1.7 variant
had first arisen in late Summer or early Fall 2020, and then quickly spread through many countries,
including Australia, Denmark, Italy, Iceland, the Netherlands, and now the US
130–132
. However,
the full pathogenic potential of this variant was not recognized until December 2020
130
. The
B.1.1.7 variant strain harbors at least 12 mutations, including 2 in the S protein: N501Y, which
increases the ability of SARS-CoV-2 binding to its cellular receptor, ACE2, and P618H, which
adjoins the furin cleavage site in the S protein
133–135
. Both mutations have been associated with a
34
40-80% increase in the transmissibility of this variant as compared to previous SARS-CoV-2
strains
130
. More recently, the B.1.1.7 variant was found to be associated with greater disease
severity and an increased risk of death as compared to other variants
136
. In addition, the variant
carries a ∆69–70 deletion that results in detection failure by some SARS-CoV-2 molecular tests,
which can limit the successful tracing of this VOC
137
. However, there is no evidence thus far that
this variant reduces vaccine efficacy.
The second VOC was discovered in, UK, in September 2020, and was characterized by several
mutations, including E484K in the RBD of the S protein. This mutation, which was later
discovered to have arisen independently in other viral variants around the world
138
, is associated
with reduced neutralizing activity of human convalescent and post-vaccination sera. Additional
VOCs related to B.1.1.7 include B.1.351, which was first detected in South Africa in November
2020,
139
where it spread rapidly. Although the latest reports indicate that this variant has also
spread to Zambia and the US, there is no evidence that this mutation impacts disease severity
140
.
This variant also harbors multiple mutations in the S protein, such as K417N, E484K, and N501Y.
The third VOC, P.1, was detected in four travelers who arrived in Japan from Brazil in January
2021
141–143
. P.1 carries similar mutations in the RBD domain as B.1.351 (K417T, E484K, N501Y),
the latter of which can increase transmissibility and help the virus evade neutralizing antibodies.
The impact of the K417T mutation is not known. More recently, another genetic variant
B.1.427/B.1.429 was declared as VOC because of its prevalence in the outbreak that happened in
California. This variant is harboring the L452R mutation in the S protein that is suspected to confer
SARS-CoV-2 antibody resistance, although it is less severe than the E484K mutation, which is
35
associated with greatly reduced viral susceptibility to antibody neutralization
144
. The full and
actual list of all VOI and VOC can be found at the official CDC page
145
.
Some of these variants were first independently identified in immunodeficient individuals in
different countries, suggesting that their emergence may be the result of convergent evolution
followed by rapid spread. For example, the appearance of the ΔH69/ΔV70 deletion was
documented in an immunosuppressed individual through deep viral genome sequencing at 23 time
points during the course of infection (101 days)
122
. A weakened host immune response can permit
the virus to replicate with little or no control, increasing the likelihood for mutations to occur. The
independent evolution of a given mutation in different geographic locations suggests that this
mutation may confer an adaptive advantage to the virus, such as immune evasion or increased
transmissibility, which is corroborated by clinical studies. Given the likely public health
importance of these VOCs and VUIs, global surveillance for these and other new variants is
expanding, as information for all SARS-CoV-2 lineages is now collected and made available
online for the rapid evaluation of their epidemiologic and vaccine impact and short-term evolution
based on individual data points
65,146
. In order to gain better control over emerging VOC and VOI,
A European Commission Recommendation dated 19 January 2021 stated that "all EU Member
States should reach a capacity of sequencing at least 5% - and preferably 10% - of positive test
results. In most Member States, the sequencing capacity for identification of SARS-CoV-2
variants is below the recommendation set by the European Commission to sequence 5-10% of
SARS-CoV-2 positive specimens'' (Figure 3d).
36
6.5 The use of genomics to investigate the pandemic spread of SARS-CoV-2
Access to rich and diverse publicly available SARS-CoV-2 genomic data across various regions
has allowed scientists and public health officials to efficiently track routes along which COVID-
19 outbreaks have spread locally and internationally. In this context, phylogenetic and genetic
network analyses can provide important public health information regarding viral epidemic
spread
147,148
. Importantly, as viruses accumulate genomic mutations within different populations,
knowledge regarding such evolution can reveal transmission chains and distinguish imported cases
from instances of local transmission if a sufficient number of samples is analyzed, ultimately
identifying high risk transmission routes which should be subject to enhance public health
control
149–152
. Genomic analysis has allowed the identification of SARS-CoV-2 introduction into
Europe from China, into the US from both China and Europe with a subsequent local
transmission
149,153–155
. One recent study suggests that SARS-CoV-2 was introduced in the US in
Connecticut via a domestic transmission route, while another showed that most successful viral
introductions to Arizona were likely from domestic travel
149,156
. Another study revealed that the
New York City area exhibited multiple introductions of SARS-CoV-2, primarily from Europe
157
.
Similarly, SARS-CoV-2 was potentially introduced into France from several countries, including
China, Italy, the United Arab Emirates, Egypt, and Madagascar
150
. We have curated a
comprehensive list of genomic outbreak investigations to date for various geographical regions.
This catalog contains 40 studies and is updated in real-time as more studies are published; an online
version is available at https://github.com/Mangul-Lab-USC/COVID-19-outbreak-investigations.
37
Viral genomics can also be used to monitor the effectiveness of global travel restrictions and
lockdowns in different countries in limiting viral spread. For example, genomic analysis showed
that the risk of domestic transmission of SARS-CoV-2 in Connecticut exceeded that of
international introduction at the time federal travel restrictions were imposed, highlighting the
critical need for local surveillance
149
. Similarly in Brazil, three clades of European origin were
established prior to the initiation of travel bans and lockdowns
158
. Another genomic analysis
showed that, due to violations of imposed lockdowns with sea trade, several SARS-CoV-2
international introductions likely occurred in Morocco
159
. In Australia, lockdown effectiveness
was validated using agent-based modeling coupled with SARS-CoV-2 genomic data, increased
testing of the population, and employment of mitigation strategies from the government
160
. On
December 19, 2019, due to the new rapidly spreading B.1.1.7 variant found in the UK, the prime
minister implemented tighter lockdown and other restrictions, and as a result, many countries
closed their borders to people traveling from the UK
161
. The spread of this variant then was
precisely tracked in the U.S. due to available sequencing data
162
.
38
39
Figure 3 Available SARS-CoV-2 genomic sequencing data and its usage for outbreak
investigation (a) The number of global SARS-CoV-2 genomes sequenced according to Global
Initiative On Sharing All Influenza Data (GISAID) between January 2020-March 2020. (b) The
number of available SARS-CoV-2 sequences in GISAID per 1 million (1M) individuals for each
country vs. the number of cases per capita up to March 2021. (c) The number of available SARS-
CoV-2 sequences in GISAID per 1 million (1M) individuals for each country in Africa vs. the
number of sequencers per capita up to March 2021. Blue line is a correlation line of all data points
on the plot (d) The number of available SARS-CoV-2 sequences in GISAID per number of
reported COVID-19 cases for each country vs. the number of reported COVID-19 cases per capita
up to March 2021. (e) Global outbreak investigations by phylogenetic analysis (red) and
wastewater studies (yellow), dots were placed in the geographical centers of each county or
region
227
.
Combining genomic methods with clinical and geospatial data can help characterize viral
infectivity, virulence, and death rates of circulating viral strains more accurately because
epidemics in different areas of the world may have distinct characteristics that depend on viral
genotype, as well the demographics of the host population. Specifically, integration and analysis
of phylogenetic and epidemiologic data can provide a more complete understanding of the
pandemic transmission dynamics
163
. Available genomic data can also be utilized to examine and
partly explain the relationship between genetic variation in strains of SARS-CoV-2 and disease
severity
164
. Findings from these studies can also help characterize mutation patterns in various
hotspots and identify correlates of infection and death rates in these countries
165
. Novel approaches
can be developed to combine population genomics and genetics to leverage the identification of
molecular markers with unusual pattern variations or relevant single nucleotide polymorphisms in
people from different geographies
67,68
. If SARS-CoV-2 infections continue at their current rate,
population genomic research and pharmacogenomics approaches may be useful in the
development of personalized therapeutics against this pathogen. Although disease severity can be
partly attributed to host genomics, understanding these factors has been difficult due to
contradictory evidence and limited host genomics studies conducted thus far to date
166
.
40
Figure 4 Variant of Concern (VOC) and Variant of Interest (VOI) circulating throughout
the globe (a) The locations where VOCs and VOIs were initially detected. (b) The timeline
showing when VOCs and VOIs initially appeared in the sequencing data (not the time when they
were declared as VOCs and VOIs)
227
.
6.6 Monitoring SARS-CoV-2 transmission through wastewater genomic studies
Another genomics-based method for population-level pathogen surveillance assesses the presence
of trace viral genomic material in wastewater, with this approach having been successfully
employed to track antibiotic use
167
and tobacco consumption
168
and for the monitoring of enteric
viruses such as poliovirus
169
. Notably, a 2013 study accomplished the early detection of a viral
outbreak in Sweden by quantifying hepatitis A virus and norovirus genetic material levels in
wastewater
170
. Although COVID-19 is primarily associated with respiratory symptoms, SARS-
CoV-2 is regularly shed in feces
171
. As of August 2020, SARS-CoV-2 RNA had been detected in
wastewater by over 35 studies in 17 countries using NAAT-based methods
(https://www.covid19wbec.org), which can effectively detect the presence and concentration of
viruses in wastewater
172
and potentially estimate the relative number of disease cases in the area
covered by the sewage facilities. However, current NAAT-based methods cannot detect whether
41
these samples harbor novel mutations
173
, and the development of novel mutations in the template
primer binding sites have the potential to compromise the efficacy of NAAT-based methods to
detect the viral presence
174
. Additionally, wastewater may contain fragmented or defective
genomes which may not be detected with these methods. Alternatively, a potentially promising
approach is the application of metagenomics on a global scale to detect, collect, and store samples
in preparation for future pandemics
175,176
. Metagenomics methods, which can sequence all
available genomic material in a sample, allow the characterization of an entire viral population and
the detection of prevalent SARS-CoV-2 variants in a given geographical space
177,178
. Wastewater
surveillance studies of SARS-CoV-2 RNA concentration across various regions in the world have
taken place between January 2020 and November 2020 (Figure 3e).
Temporal changes in SARS-CoV-2 RNA concentration in wastewater were assessed in Valencia,
Spain from February till April 2020, Paris, France from March till April 2020, and in many other
regions, they have been consistent with the number of clinically diagnosed cases in a given
community
179,180
. This relationship demonstrates the use of wastewater studies as a relatively
inexpensive and straightforward method for investigating national outbreak dynamics, especially
in areas where case diagnosis is complicated. In contrast, clinical diagnostic testing traditionally
used to assess the number of cases in a community typically underestimates actual infection
rates
181
, as this approach primarily focuses on symptomatic individuals because the asymptomatic
cases are less likely to be captured. However, combining clinical diagnostics with wastewater-
based surveillance can potentially provide a more comprehensive community-level profile of both
symptomatic and asymptomatic cases, enabling identification of hospital capacity needs
172,182–188
.
Additionally, an important advantage of wastewater monitoring is the ability to detect early-stage
42
outbreaks before they become widespread
169,173,189,190
. In contrast to NAAT-based methods such
as real time reverse transcription polymerase chain reaction (RT-PCR)-based analysis of SARS-
CoV-2, metagenomic sequencing allows for characterization of the prevalent SARS-CoV-2
genomic variants in a defined local region and reveal geospatial SARS-CoV-2 genotype
distribution
178,191
. Using wastewater samples can identify circulating lineages in the community
and accompany analysis of genomic epidemiology, for example, such analysis has already helped
to detect B.1.1.7 strains in the US and Switzerland
192
.
Despite the numerous advantages of wastewater-based virus surveillance, many potential
improvements would result in more reliable and extended applications in public health decision-
making. Currently, wastewater-based methods require calibration and validation because they only
provide a raw measure of the number of cases in a population
173
. Additionally, wastewater-based
monitoring lacks the granularity of clinical diagnostic testing and cannot discern a particular area
of an outbreak when the wastewater treatment plant serves a large population. Sampling at a higher
spatial resolution within the sewer system or even at a building-level scale could potentially
provide early indications of viral outbreaks and help monitor their progression
193
. This effort could
also include areas with large numbers of septic tank systems that are not feeding municipal
wastewater systems.
6.7 Genomics in clinical applications
Viral genomics can also aid in vaccine development and investigations of how viral evolution
impacts clinical outcomes and treatment
194
. While the majority of known SARS-CoV-2 mutations
43
have no effect on viral replication and transmission
195
, some substitutions have been linked to
phenotypic changes that could influence outbreak dynamics. For example, patients in Singapore
infected with ∆382 SARS-CoV-2 variants, which have a 382-nucleotide deletion in ORF8,
exhibited milder symptoms compared to patients with viruses that lacked this deletion
196
.
However, the ∆382 SARS-CoV-2 variant is very rare globally and appears to have died out in
Singapore. More alarmingly, the highly prevalent D614G mutation may increase transmissibility
and infectivity in natural populations, giving variants harboring this mutation a marked selective
advantage, although the best evidence to date comes from laboratory and simulation studies
only
126,153,197
. Lastly, the viral variants B.1.1.7, B.1.351, and P.1, which were detected in late 2020,
showed significantly increased transmissibility, heightening concern from a public health
perspective
140,198,199
. For vaccine development, understanding the degree to which different
regions of the viral genome are prone to mutation is important, as it is necessary to understand
whether rising immunity in humans will result in antigenic drift and consequent vaccine escape.
These evolutionary effects are commonly seen, for example, in human influenza viruses and
endemic coronaviruses. Analyses of the current genomic variability of SARS-CoV-2 suggest that
prospective COVID-19 vaccines should be cross-protective for the majority of currently known
viral variants
131,166,200
, although some minor variants (<1% natural occurrence frequency) have
been shown to alter the antigenicity of SARS-CoV-2
201
. For vaccine development, determining
the structures of SARS-CoV-2 antigens and their mutants is also crucial for the maximization of
vaccine efficacy
202
. The online COVID-3D resource allows for the exploration of the structural
distribution of genetic variation in SARS-CoV-2
203
.
44
The antigenic drift may also affect the effectiveness of NAAT-testing because when mutations
happen in primer regions, the effectiveness of tests drops due to loss of affinity. Therefore control
over appearing mutations should be taken into account for updating NAAT tests as well.
The first lab-confirmed case of COVID-19 re-infection case was detected in Hong Kong using a
genomic analysis approach
204
, after which additional re-infection cases were detected in Belgium,
Ecuador, and the US
205–207
. Phylogenetic analyses of longitudinal SARS-CoV-2 genomic
sequences for all these patients distinguished between patient re-infection and persistent viral
shedding from the initial infection. The findings in all four cases suggest that SARS-CoV-2 may
persist in the global human population, despite herd immunity due to natural infection, which can
complicate vaccine development and efficacy
204
. However, current data suggests that SARS-CoV-
2 re-infection is rare, and it has been proposed that immunity against reinfection can last for at
least several months after the primary infection
208
.
Clinical manifestations of SARS-CoV-2 infection vary greatly, ranging from a lack of symptoms
to irreversible pulmonary damage
209–211
. Adaptive immune responses, such as early CD8+ and
CD4+ T cell responses, have been associated with positive patient outcomes
212
. Next-generation
sequencing of T and B cell receptor repertoires from COVID-19 patients has also revealed
differences in immune response characteristics between patients with a mild or severe disease
course
213
. Schultheiß, et al. detected more than 14 million T and B receptors from blood samples
of infected patients from 70 time points, compiling a valuable resource that can inform new
therapeutic approaches and vaccine development. For example, their study revealed that
45
knowledge of host immunopathology obtained through sequencing can permit the early detection
of clinical biomarkers and aid in the identification of patients at risk for severe disease
213
.
6.8 Integrating clinical and genomics data
Many SARS-CoV-2 genomic studies to date have been conducted in the absence of substantial
clinical data collection and/or integration with viral sequence data. Conversely, numerous studies
have critically evaluated extensive clinical data alone, without assessing corresponding genomic
data
214
. Even investigations yielding large genomic (e.g., GISAID
86,94
) and clinical datasets
215
have
not performed integrated analyses of both data types. This substantial limitation of current
practices results from distinctions between the fields of bioinformatics (genomic data analyses)
and medical informatics (clinical data analyses). The COVID-19 pandemic promises to unite
researchers from both of these fields to integrate these seemingly disparate data sources, especially
in prospective studies
216
. Finding significant associations between genomic and clinical features
of the virus will ultimately support more targeted interventions by public health officials.
6.9 Discussion
The unprecedented density and volume of available SARS-CoV-2 genomic and clinical data
enabled the prompt and effective characterization of both SARS-CoV-2 genomes and COVID-19
epidemiology compared to those of previous outbreaks. The numerous successful efforts across
various parts of the globe utilizing genomic data for addressing the COVID-19 outbreak created a
solid foundation for the standardization of using SARS-CoV-2 genomic data. High-income
countries sequenced more SARS-CoV-2 sequences per population than the countries with low,
46
middle-low and middle income. However, the countries of Africa with low and middle-low income
demonstrated remarkably better preparedness to collect SARS-CoV-2 genomes than low and
middle-low income countries from other continents (Figure 3b). This preparedness can be
attributed to previous global initiatives to support African countries in mitigating previous
outbreaks of other viruses that ended up in growing sequencing capacity of the region. Africa
provides remarkable examples of the necessity of international cooperation which should be
implemented in other parts of the globe for better control of worldwide epidemiology.
At the same time, the unprecedented volume of SARS-CoV-2 genome sequencing that reached
one million viral genomes sequences challenged the current practices of viral data storage,
processing, and bioinformatics analysis
217–219
. While the importance of genome-based viral
surveillance systems was widely recognized, the principle of such systems were conceptualized,
and there were technological burdens of creating them, as such systems were still in the early
stages of development before the pandemic started. However, the unprecedented mobilization of
financial, scientific, and development resources during the course of COVID-19 allowed for fast
development, deployment, and scaling of numerous global surveillance systems which provide
resources for outbreak response using SARS-CoV-2 genome analysis.
47
CONCLUSION
Genomic data analyzed by computational tools can be used to effectively tackle important
biological problems such as detecting novel alternative splicing on specific exons to explore
disease progression, or characterizing viral genotypes as done for the COVID-19 pandemic. When
rigorously studied, benchmarked, and standardized, viral genomic surveillance systems enable
reliable and timely detection of the presence of circulating and emerging pathogens similar to
SARS-CoV-2, providing us a robust shield from current and newly emerging outbreaks
220
. With
sufficient sampling, genomic analysis will enable sentinel surveillance efforts capable of
effectively locating the geographic source of outbreaks, elucidating transmission chains, and
ultimately limiting the spread of the pathogens globally
157,199,221–225
. As science and technology
modernize, the plethora of scientific discovery in the near future has still yet to be unleashed as
the possibilities are endless.
48
METHODS
Over the course of my Masters, I have been involved in many versatile projects, all of which
required me to learn a new skill. During the start of my Masters, I was focused on introducing
myself to the field of bioinformatics as I did not have much exposure to it previously due to
being trained in cell biology. I learned how to properly read and analyze existing publications,
write a peer reviewed paper, write and submit a pre-submission letter to journals such as Nature,
and how to submit a paper for final review by the journals. While writing these papers, I
connected and consulted with co-authors from different countries and scientific expertise. My
experience in consulting and collaborating with various bioinformatics professionals around the
world provided me with new insight and methodologies I had not previously thought of. My
communication and presentation skills have vastly improved as a result of project/research
presentations created/shared during weekly lab meetings. In addition, I have assisted with
drafting two educational grants on teaching data science for the National Science Foundation
(NSF), making me think about effective methods to teach data science for biomedical
researchers. Over the past year, I have also been involved in gaining skills critical in creating
visualizations through different platforms. For example, I have utilized Adobe Illustrator, Python
libraries, such as Seaborn and matplotlib, and GUI based tools such as Tableau to create an
assortment of visualizations. Furthermore, I have learned how to navigate Github and manage
version control. Through software hosted on Github, and additional academic webpages I
acquired the skill of installing and running various RNA-seq tools through both the UNIX
command line and writing bash scripts. Being a part of this lab helped me gain exposure to the
flexibility and accessibility of utilizing data driven research and learning the implications this has
on clinical outcomes in the field of pharmaceutical sciences.
49
REFERENCES
1. Van Noorden, R., Maher, B. & Nuzzo, R. The top 100 papers. Nature vol. 514 550–553
(2014).
2. Wren, J. D. Bioinformatics programs are 31-fold over-represented among the highest
impact scientific papers of the past two decades. Bioinformatics 32, 2686–2691 (2016).
3. Mangul, S. et al. How bioinformatics and open data can boost basic science in countries and
universities with limited resources. Nat. Biotechnol. 37, 324–326 (2019).
4. McKinney, S. M. et al. International evaluation of an AI system for breast cancer screening.
Nature 577, 89–94 (2020).
5. Achenbach, S. Coronary CT angiography—future directions. Cardiovascular Diagnosis
and Therapy vol. 7 432–438 (2017).
6. IDx-DR Overview: Close Care Gaps, Prevent Blindness. https://dxs.ai/products/idx-dr/idx-
dr-overview-2/.
7. Benjamens, S., Dhunnoo, P. & Meskó, B. The state of artificial intelligence-based FDA-
approved medical devices and algorithms: an online database. NPJ Digit Med 3, 118 (2020).
8. [No title]. https://www.fda.gov/files/medical%20devices/published/US-FDA-Artificial-
Intelligence-and-Machine-Learning-Discussion-Paper.pdf.
9. Harvey, H. B. & Gowda, V. How the FDA Regulates AI. Acad. Radiol. 27, 58–61 (2020).
10. Wells, D. K. et al. Key Parameters of Tumor Epitope Immunogenicity Revealed Through a
Consortium Approach Improve Neoantigen Prediction. Cell 183, 818–834.e13 (2020).
11. Hoof, I. et al. NetMHCpan, a method for MHC class I binding prediction beyond humans.
Immunogenetics 61, 1–13 (2009).
12. Hundal, J. et al. pVAC-Seq: A genome-guided in silico approach to identifying tumor
50
neoantigens. Genome Med. 8, 11 (2016).
13. O’Donnell, T. J. et al. MHCflurry: Open-Source Class I MHC Binding Affinity Prediction.
Cell Syst 7, 129–132.e4 (2018).
14. FDA Approves Foundation Medicine’s FoundationOne®Liquid CDx, a Comprehensive
Pan-Tumor Liquid Biopsy Test with Multiple Companion Diagnostic Indications for
Patients with Advanced Cancer. https://www.foundationmedicine.com/press-
releases/445c1f9e-6cbb-488b-84ad-5f133612b721.
15. El Naqa, I., Haider, M. A., Giger, M. L. & Ten Haken, R. K. Artificial Intelligence:
reshaping the practice of radiological sciences in the 21st century. Br. J. Radiol. 93,
20190855 (2020).
16. Dlamini, Z., Francies, F. Z., Hull, R. & Marima, R. Artificial intelligence (AI) and big data
in cancer and precision oncology. Comput. Struct. Biotechnol. J. 18, 2300–2311 (2020).
17. Chabon, J. J. et al. Integrating genomic features for non-invasive early lung cancer
detection. Nature 580, 245–251 (2020).
18. Moding, E. J. et al. Circulating tumor DNA dynamics predict benefit from consolidation
immunotherapy in locally advanced non-small-cell lung cancer. Nature Cancer vol. 1 176–
183 (2020).
19. Azad, T. D. et al. Circulating Tumor DNA Analysis for Detection of Minimal Residual
Disease After Chemoradiotherapy for Localized Esophageal Cancer. Gastroenterology 158,
494–505.e6 (2020).
20. West, J. et al. Towards Multidrug Adaptive Therapy. Cancer Res. 80, 1578–1589 (2020).
21. Pritchard, J. R. et al. Defining principles of combination drug mechanisms of action. Proc.
Natl. Acad. Sci. U. S. A. 110, E170–9 (2013).
51
22. Jonsson, V. D. et al. Novel computational method for predicting polytherapy switching
strategies to overcome tumor heterogeneity and evolution. Sci. Rep. 7, 44206 (2017).
23. Irurzun-Arana, I., McDonald, T. O., Trocóniz, I. F. & Michor, F. Pharmacokinetic Profiles
Determine Optimal Combination Treatment Schedules in Computational Models of Drug
Resistance. Cancer Res. 80, 3372–3382 (2020).
24. UCLA Computational Medicine. https://compmed.ucla.edu/.
25. Computational and Quantitative Medicine. https://www.cityofhope.org/research/beckman-
research-institute/research-departments-and-divisions/computational-and-quantitative-
medicine.
26. Data Science and Biotechnology Institute. https://gladstone.org/science/data-science-and-
biotechnology-institute.
27. Institute for Computational Medicine. https://med.nyu.edu/departments-
institutes/computational-medicine/computational-medicine.
28. Computational Medicine and Bioinformatics.
https://medicine.umich.edu/dept/computational-medicine-bioinformatics (2016).
29. Markowetz, F. All biology is computational biology. PLOS Biology vol. 15 e2002050
(2017).
30. Data sharing and the future of science. Nat. Commun. 9, 2817 (2018).
31. Hoadley, K. A. et al. Cell-of-Origin Patterns Dominate the Molecular Classification of
10,000 Tumors from 33 Types of Cancer. Cell 173, 291–304.e6 (2018).
32. Ding, L. et al. Perspective on Oncogenic Processes at the End of the Beginning of Cancer
Genomics. Cell 173, 305–320.e10 (2018).
33. Bailey, M. H. et al. Comprehensive Characterization of Cancer Driver Genes and
52
Mutations. Cell 174, 1034–1035 (2018).
34. Elinoff, J. M. et al. Meta-analysis of blood genome-wide expression profiling studies in
pulmonary arterial hypertension. Am. J. Physiol. Lung Cell. Mol. Physiol. 318, L98–L111
(2020).
35. Van Keuren-Jensen, K., Keats, J. J. & Craig, D. W. Bringing RNA-seq closer to the clinic.
Nat. Biotechnol. 32, 884–885 (2014).
36. Brito, J. J. et al. Recommendations to enhance rigor and reproducibility in biomedical
research. Gigascience 9, (2020).
37. Mangul, S. Interpreting and integrating big data in the life sciences.
doi:10.7287/peerj.preprints.27603.
38. Shastry, K. A., Aditya Shastry, K. & Sanjay, H. A. Machine Learning for Bioinformatics.
Statistical Modelling and Machine Learning Principles for Bioinformatics Techniques,
Tools, and Applications 25–39 (2020) doi:10.1007/978-981-15-2445-5_3.
39. RNA Sequencing. https://www.illumina.com/techniques/sequencing/rna-sequencing.html.
40. Morganti, S. et al. Next Generation Sequencing (NGS): A Revolutionary Technology in
Pharmacogenomics and Personalized Medicine in Cancer. Translational Research and
Onco-Omics Applications in the Era of Cancer Personal Genomics 9–30 (2019)
doi:10.1007/978-3-030-24100-1_2.
41. Sequencing Technologies and Analyses: Where Have We Been and Where Are We Going?
iScience 18, 37–41 (2019).
42. How it works. http://nanoporetech.com/how-it-works.
43. Eid, J. et al. Real-Time DNA Sequencing from Single Polymerase Molecules. Science vol.
323 133–138 (2009).
53
44. From RNA to full-length transcripts: The PacBio Iso-Seq method for transcriptome analysis
and genome annotation. https://www.pacb.com/proceedings/from-rna-to-full-length-
transcripts-the-pacbio-iso-seq-method-for-transcriptome-analysis-and-genome-annotation/
(2021).
45. Conesa, A. et al. A survey of best practices for RNA-seq data analysis. Genome Biol. 17, 1–
19 (2016).
46. Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21
(2012).
47. Liao, Y., Smyth, G. K. & Shi, W. The Subread aligner: fast, accurate and scalable read
mapping by seed-and-vote. Nucleic Acids Res. 41, e108 (2013).
48. Kim, D. et al. TopHat2: accurate alignment of transcriptomes in the presence of insertions,
deletions and gene fusions. Genome Biol. 14, 1–13 (2013).
49. Huang, S. et al. SOAPsplice: Genome-Wide ab initio Detection of Splice Junctions from
RNA-Seq Data. Front. Genet. 2, 46 (2011).
50. Wang, K. et al. MapSplice: accurate mapping of RNA-seq reads for splice junction
discovery. Nucleic Acids Res. 38, e178 (2010).
51. Au, K. F., Jiang, H., Lin, L., Xing, Y. & Wong, W. H. Detection of splice junctions from
paired-end RNA-seq data by SpliceMap. Nucleic Acids Res. 38, 4570–4578 (2010).
52. BLAST TOPICS.
https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TY
PE=BlastHelp.
53. FASTQ files explained. https://support.illumina.com/bulletins/2016/04/fastq-files-
explained.html.
54
54. [No title]. https://samtools.github.io/hts-specs/SAMv1.pdf.
55. [No title]. https://samtools.github.io/hts-specs/SAMv1.pdf.
56. Langmead, B., Trapnell, C., Pop, M. & Salzberg, S. L. Ultrafast and memory-efficient
alignment of short DNA sequences to the human genome. Genome Biol. 10, 1–10 (2009).
57. Sedlazeck, F. J., Rescheneder, P. & von Haeseler, A. NextGenMap: fast and accurate read
mapping in highly polymorphic genomes. Bioinformatics 29, 2790–2791 (2013).
58. Cibiv. Cibiv/NextGenMap. https://github.com/Cibiv/NextGenMap.
59. Barro, R., Ursúa, J. & Weng, J. The Coronavirus and the Great Influenza Pandemic:
Lessons from the ‘Spanish Flu’ for the Coronavirus’s Potential Effects on Mortality and
Economic Activity. (2020) doi:10.3386/w26866.
60. Lu, J. et al. Genomic Epidemiology of SARS-CoV-2 in Guangdong Province, China. Cell
(2020) doi:10.1016/j.cell.2020.04.023.
61. Wang, C., Horby, P. W., Hayden, F. G. & Gao, G. F. A novel coronavirus outbreak of
global health concern. The Lancet vol. 395 470–473 (2020).
62. Wu, Z. & McGoogan, J. M. Characteristics of and Important Lessons From the Coronavirus
Disease 2019 (COVID-19) Outbreak in China: Summary of a Report of 72 314 Cases From
the Chinese Center for Disease Control and Prevention. JAMA (2020)
doi:10.1001/jama.2020.2648.
63. Coronavirus disease (COVID-19) – World Health Organization.
https://www.who.int/emergencies/diseases/novel-coronavirus-2019.
64. Grubaugh, N. D. et al. Tracking virus outbreaks in the twenty-first century. Nat Microbiol
4, 10–19 (2019).
65. Rambaut, A. et al. A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist
55
genomic epidemiology. Nat Microbiol 2020. Preprint] July 15, (2020).
66. The Rockefeller Foundation Releases New Action Plan to Accelerate Development of a
National System for Gathering and Sharing Information on SARS-CoV-2 Genomic Variants
and Other Pathogens. https://www.rockefellerfoundation.org/news/the-rockefeller-
foundation-releases-new-action-plan-to-accelerate-development-of-a-national-system-for-
gathering-and-sharing-information-on-sars-cov-2-genomic-variants-and-other-pathogens/.
67. Ren, L.-L. et al. Identification of a novel coronavirus causing severe pneumonia in human:
a descriptive study. Chin. Med. J. 133, 1015–1024 (2020).
68. Zhou, P. et al. A pneumonia outbreak associated with a new coronavirus of probable bat
origin. Nature 579, 270–273 (2020).
69. Tang, X. et al. On the origin and continuing evolution of SARS-CoV-2. Natl Sci Rev 7,
1012–1023 (2020).
70. Lu, R. et al. Genomic characterisation and epidemiology of 2019 novel coronavirus:
implications for virus origins and receptor binding. Lancet 395, 565–574 (2020).
71. Andersen, K. G., Rambaut, A., Lipkin, W. I., Holmes, E. C. & Garry, R. F. The proximal
origin of SARS-CoV-2. Nat. Med. 26, 450–452 (2020).
72. Li, X. et al. Emergence of SARS-CoV-2 through recombination and strong purifying
selection. Science Advances eabb9153 (2020).
73. Morens, D. M. & Fauci, A. S. Emerging Pandemic Diseases: How We Got to COVID-19.
Cell vol. 182 1077–1092 (2020).
74. Wu, A. et al. Genome Composition and Divergence of the Novel Coronavirus (2019-nCoV)
Originating in China. Cell Host Microbe 27, 325–328 (2020).
75. Zhu, N. et al. A Novel Coronavirus from Patients with Pneumonia in China, 2019. N. Engl.
56
J. Med. 382, 727–733 (2020).
76. Boni, M. F. et al. Evolutionary origins of the SARS-CoV-2 sarbecovirus lineage responsible
for the COVID-19 pandemic. Nat Microbiol (2020) doi:10.1038/s41564-020-0771-4.
77. Zhang, T., Wu, Q. & Zhang, Z. Probable Pangolin Origin of SARS-CoV-2 Associated with
the COVID-19 Outbreak. Curr. Biol. 30, 1578 (2020).
78. Zhou, H. et al. A Novel Bat Coronavirus Closely Related to SARS-CoV-2 Contains Natural
Insertions at the S1/S2 Cleavage Site of the Spike Protein. Curr. Biol. 30, 3896 (2020).
79. Wacharapluesadee, S. et al. Evidence for SARS-CoV-2 related coronaviruses circulating in
bats and pangolins in Southeast Asia. Nat. Commun. 12, 972 (2021).
80. Cui, J., Li, F. & Shi, Z.-L. Origin and evolution of pathogenic coronaviruses. Nat. Rev.
Microbiol. 17, 181–192 (2019).
81. Zhu, Z. et al. From SARS and MERS to COVID-19: a brief summary and comparison of
severe acute respiratory infections caused by three highly pathogenic human coronaviruses.
Respir. Res. 21, 224 (2020).
82. Lam, T. T.-Y. et al. Identifying SARS-CoV-2 related coronaviruses in Malayan pangolins.
Nature (2020) doi:10.1038/s41586-020-2169-0.
83. Rando, H. M. et al. Pathogenesis, Symptomatology, and Transmission of SARS-CoV-2
through analysis of Viral Genomics and Structure. arXiv [q-bio.QM] (2021).
84. Gardy, J. L. & Loman, N. J. Towards a genomics-informed, real-time, global pathogen
surveillance system. Nat. Rev. Genet. 19, 9–20 (2018).
85. Wang, D. et al. Clinical Characteristics of 138 Hospitalized Patients With 2019 Novel
Coronavirus–Infected Pneumonia in Wuhan, China. JAMA 323, 1061–1069 (2020).
86. Elbe, S. & Buckland-Merrett, G. Data, disease and diplomacy: GISAID’s innovative
57
contribution to global health. Glob Chall 1, 33–46 (2017).
87. Kalinich, C. C. et al. Real-time public health communication of local SARS-CoV-2
genomic epidemiology. PLoS Biol. 18, e3000869 (2020).
88. Thanh Le, T. et al. The COVID-19 vaccine development landscape. Nat. Rev. Drug Discov.
19, 305–306 (2020).
89. Amanat, F. & Krammer, F. SARS-CoV-2 Vaccines: Status Report. Immunity 52, 583–589
(2020).
90. Chen, W.-H., Strych, U., Hotez, P. J. & Bottazzi, M. E. The SARS-CoV-2 Vaccine
Pipeline: an Overview. Curr Trop Med Rep 1–4 (2020).
91. Sheridan, C. Coronavirus and the race to distribute reliable diagnostics. Nature
Biotechnology vol. 38 382–384 (2020).
92. Kudo, E. et al. Detection of SARS-CoV-2 RNA by multiplex RT-qPCR. PLoS Biol. 18,
e3000867 (2020).
93. Hadfield, J. et al. Nextstrain: real-time tracking of pathogen evolution. Bioinformatics 34,
4121–4123 (2018).
94. Shu, Y. & McCauley, J. GISAID: Global initiative on sharing all influenza data--from
vision to reality. Eurosurveillance 22, 30494 (2017).
95. An integrated national scale SARS-CoV-2 genomic surveillance network. The Lancet
Microbe (2020) doi:10.1016/S2666-5247(20)30054-9.
96. Fernandes, J. D. et al. The UCSC SARS-CoV-2 Genome Browser. Nat. Genet. 52, 991–998
(2020).
97. Kuiken, C., Korber, B. & Shafer, R. W. HIV sequence databases. AIDS Rev. 5, 52–61
(2003).
58
98. Foley, B. T. et al. HIV Sequence Compendium 2018. https://www.osti.gov/biblio/1458915
(2018) doi:10.2172/1458915.
99. Inzaule, S. C., Tessema, S. K., Kebede, Y., Ogwell Ouma, A. E. & Nkengasong, J. N.
Genomic-informed pathogen surveillance in Africa: opportunities and challenges. Lancet
Infect. Dis. (2021) doi:10.1016/S1473-3099(20)30939-7.
100. Wu, A. et al. Mutations, Recombination and Insertion in the Evolution of 2019-nCoV.
bioRxiv (2020) doi:10.1101/2020.02.29.971101.
101. Mathew, D. et al. Deep immune profiling of COVID-19 patients reveals distinct
immunotypes with therapeutic implications. Science 369, (2020).
102. Fang, S. et al. GESS: a database of global evaluation of SARS-CoV-2/hCoV-19 sequences.
Nucleic Acids Research (2020) doi:10.1093/nar/gkaa808.
103. Koyama, T., Platt, D. & Parida, L. Variant analysis of SARS-CoV-2 genomes. Bull. World
Health Organ. 98, 495–504 (2020).
104. Zhao, Z. et al. Moderate mutation rate in the SARS coronavirus genome and its
implications. BMC Evol. Biol. 4, 21 (2004).
105. Holmes, E. C., Dudas, G., Rambaut, A. & Andersen, K. G. The evolution of Ebola virus:
Insights from the 2013–2016 epidemic. Nature 538, 193–200 (2016).
106. Sanjuán, R. & Domingo-Calap, P. Mechanisms of viral mutation. Cellular and Molecular
Life Sciences vol. 73 4433–4448 (2016).
107. Di Giorgio, S., Martignano, F., Torcia, M. G., Mattiuz, G. & Conticello, S. G. Evidence for
host-dependent RNA editing in the transcriptome of SARS-CoV-2. Sci Adv 6, eabb5813
(2020).
108. Lofgren, E., Fefferman, N. H., Naumov, Y. N., Gorski, J. & Naumova, E. N. Influenza
59
seasonality: underlying causes and modeling theories. J. Virol. 81, 5429–5436 (2007).
109. Cuevas, J. M., Geller, R., Garijo, R., López-Aldeguer, J. & Sanjuán, R. Extremely High
Mutation Rate of HIV-1 In Vivo. PLoS Biol. 13, e1002251 (2015).
110. Hoenen, T., Groseth, A., Safronetz, D., Wollenberg, K. & Feldmann, H. Response to
Comment on ‘Mutation rate and genotype variation of Ebola virus from Mali case
sequences’. Science vol. 353 658 (2016).
111. Wilke, C. O., Wang, J. L., Ofria, C., Lenski, R. E. & Adami, C. Evolution of digital
organisms at high mutation rates leads to survival of the flattest. Nature 412, 331–333
(2001).
112. Shen, Z. et al. Genomic diversity of SARS-CoV-2 in Coronavirus Disease 2019 patients.
Clin. Infect. Dis. (2020) doi:10.1093/cid/ciaa203.
113. Moreno, G. K. et al. Limited SARS-CoV-2 diversity within hosts and following passage in
cell culture. bioRxiv 2020.04.20.051011 (2020) doi:10.1101/2020.04.20.051011.
114. Karamitros, T. et al. SARS-CoV-2 exhibits intra-host genomic plasticity and low-frequency
polymorphic quasispecies. bioRxiv 2020.03.27.009480 (2020)
doi:10.1101/2020.03.27.009480.
115. Lythgoe, K. A. et al. Shared SARS-CoV-2 diversity suggests localised transmission of
minority variants. bioRxiv 2020.05.28.118992 (2020) doi:10.1101/2020.05.28.118992.
116. Jary, A. et al. Evolution of viral quasispecies during SARS-CoV-2 infection. Clin.
Microbiol. Infect. (2020) doi:10.1016/j.cmi.2020.07.032.
117. Kuipers, J. et al. Within-patient genetic diversity of SARS-CoV-2. Cold Spring Harbor
Laboratory 2020.10.12.335919 (2020) doi:10.1101/2020.10.12.335919.
118. James, S. E. et al. High Resolution analysis of Transmission Dynamics of Sars-Cov-2 in
60
Two Major Hospital Outbreaks in South Africa Leveraging Intrahost Diversity. medRxiv
(2020) doi:10.1101/2020.11.15.20231993.
119. Sashittal, P., Luo, Y., Peng, J. & El-Kebir, M. Characterization of SARS-CoV-2 viral
diversity within and across hosts. bioRxiv (2020).
120. Avanzato, V. A. et al. Case Study: Prolonged Infectious SARS-CoV-2 Shedding from an
Asymptomatic Immunocompromised Individual with Cancer. Cell vol. 183 1901–1912.e9
(2020).
121. Choi, B. et al. Persistence and Evolution of SARS-CoV-2 in an Immunocompromised Host.
N. Engl. J. Med. 383, 2291–2293 (2020).
122. Kemp, S. A. et al. SARS-CoV-2 evolution during treatment of chronic infection. Nature
(2021) doi:10.1038/s41586-021-03291-y.
123. Geoghegan, J. L. & Holmes, E. C. The phylogenomics of evolving virus virulence. Nat.
Rev. Genet. 19, 756–769 (2018).
124. van Dorp, L. et al. Emergence of genomic diversity and recurrent mutations in SARS-CoV-
2. Infect. Genet. Evol. 104351 (2020).
125. Zhang, Y.-Z. & Holmes, E. C. A Genomic Perspective on the Origin and Emergence of
SARS-CoV-2. Cell 181, 223–227 (2020).
126. Korber, B. et al. Tracking Changes in SARS-CoV-2 Spike: Evidence that D614G Increases
Infectivity of the COVID-19 Virus. Cell (2020) doi:10.1016/j.cell.2020.06.043.
127. van Dorp, L. et al. No evidence for increased transmissibility from recurrent mutations in
SARS-CoV-2. doi:10.1101/2020.05.21.108506.
128. Baric, R. S. Emergence of a Highly Fit SARS-CoV-2 Variant. N. Engl. J. Med. (2020)
doi:10.1056/NEJMcibr2032888.
61
129. Hou, Y. J. et al. SARS-CoV-2 D614G variant exhibits efficient replication ex vivo and
transmission in vivo. Science 370, 1464–1468 (2020).
130. Volz, E. et al. Transmission of SARS-CoV-2 Lineage B.1.1.7 in England: Insights from
linking epidemiological and genetic data. medRxiv 2020.12.30.20249034 (2021).
131. Mahase, E. Covid-19: What have we learnt about the new variant in the UK? BMJ m4944
(2020) doi:10.1136/bmj.m4944.
132. Fiorentini, S. et al. First detection of SARS-CoV-2 spike protein N501 mutation in Italy in
August, 2020. Lancet Infect. Dis. (2021) doi:10.1016/S1473-3099(21)00007-4.
133. Peacock, T. P. et al. The furin cleavage site of SARS-CoV-2 spike protein is a key
determinant for transmission due to enhanced replication in airway cells. Cold Spring
Harbor Laboratory 2020.09.30.318311 (2020) doi:10.1101/2020.09.30.318311.
134. Chan, K. K., Tan, T. J. C., Narayanan, K. K. & Procko, E. An engineered decoy receptor for
SARS-CoV-2 broadly binds protein S sequence variants. bioRxiv (2020)
doi:10.1101/2020.10.18.344622.
135. Starr, T. N. et al. Deep Mutational Scanning of SARS-CoV-2 Receptor Binding Domain
Reveals Constraints on Folding and ACE2 Binding. Cell 182, 1295–1310.e20 (2020).
136. Horby, P. et al. NERVTAG note on B. 1.1. 7 severity. New & Emerging Threats Advisory
Group, Jan 21, (2021).
137. Bal, A. et al. Screening of the H69 and V70 deletions in the SARS-CoV-2 spike protein
with a RT-PCR diagnosis assay reveals low prevalence in Lyon, France. medRxiv
2020.11.10.20228528 (2020).
138. Chand, M. & Others. Investigation of novel SARS-COV-2 variant: Variant of Concern
202012/01 (PDF). Public Health England. PHE (2020).
62
139. Tegally, H. et al. Emergence and rapid spread of a new severe acute respiratory syndrome-
related coronavirus 2 (SARS-CoV-2) lineage with multiple spike mutations in South Africa.
medRxiv (2020).
140. Mwenda, M. et al. Detection of B.1.351 SARS-CoV-2 Variant Strain — Zambia, December
2020. MMWR. Morbidity and Mortality Weekly Report vol. 70 (2021).
141. Toovey, O. T. R., Harvey, K. N., Bird, P. W. & Tang, J. W.-T. W.-T. Introduction of
Brazilian SARS-CoV-2 484K.V2 related variants into the UK. Journal of Infection (2021)
doi:10.1016/j.jinf.2021.01.025.
142. Naveca, F. et al. Phylogenetic relationship of SARS-CoV-2 sequences from Amazonas with
emerging Brazilian variants harboring mutations E484K and N501Y in the Spike protein.
Virological. org. Available at: https://virological. org/t/phylogenetic-relationship-of-sars-
cov-2-sequences-from-amazonas-with-emerging-brazilian-variants-harboring-mutations-
e484k-and-n501y-in-the-spike-protein/585 (2021).
143. Faria, N. R. et al. Genomic characterisation of an emergent SARS-CoV-2 lineage in
Manaus: preliminary findings. (2021).
144. Greaney, A. J. et al. Comprehensive mapping of mutations to the SARS-CoV-2 receptor-
binding domain that affect recognition by polyclonal human serum antibodies. Cold Spring
Harbor Laboratory 2020.12.31.425021 (2021) doi:10.1101/2020.12.31.425021.
145. CDC. Cases, Data, and Surveillance. https://www.cdc.gov/coronavirus/2019-ncov/cases-
updates/variant-surveillance/variant-info.html (2021).
146. Maxmen, A. Massive Google-funded COVID database will track variants and immunity.
Nature (2021) doi:10.1038/d41586-021-00490-5.
147. Blair, C. & Ané, C. Phylogenetic Trees and Networks Can Serve as Powerful and
63
Complementary Approaches for Analysis of Genomic Data. Syst. Biol. 69, 593–601 (2020).
148. Martin, M. A., VanInsberghe, D. & Koelle, K. Insights from SARS-CoV-2 sequences.
Science 371, 466–467 (2021).
149. Fauver, J. R. et al. Coast-to-Coast Spread of SARS-CoV-2 during the Early Epidemic in the
United States. Cell (2020) doi:10.1016/j.cell.2020.04.021.
150. Gámbaro, F. et al. Introductions and early spread of SARS-CoV-2 in France.
doi:10.1101/2020.04.24.059576.
151. Miller, D. et al. Full genome viral sequences inform patterns of SARS-CoV-2 spread into
and within Israel. medRxiv (2020).
152. Thielen, P. M. et al. Genomic Diversity of SARS-CoV-2 During Early Introduction into the
United States National Capital Region. medRxiv (2020) doi:10.1101/2020.08.13.20174136.
153. McNamara, R. P. et al. High-Density Amplicon Sequencing Identifies Community Spread
and Ongoing Evolution of SARS-CoV-2 in the Southern United States. Cell Rep. 33,
108352 (2020).
154. Nadeau, S. A., Vaughan, T. G., Scire, J., Huisman, J. S. & Stadler, T. The origin and early
spread of SARS-CoV-2 in Europe. Proc. Natl. Acad. Sci. U. S. A. 118, (2021).
155. Worobey, M. et al. The emergence of SARS-CoV-2 in Europe and North America. Science
(2020) doi:10.1126/science.abc8169.
156. Ladner, J. T. et al. An Early Pandemic Analysis of SARS-CoV-2 Population Structure and
Dynamics in Arizona. MBio 11, (2020).
157. Gonzalez-Reiche, A. S. et al. Introductions and early spread of SARS-CoV-2 in the New
York City area. Science (2020) doi:10.1126/science.abc1917.
158. Candido, D. D. S. et al. Routes for COVID-19 importation in Brazil. J. Travel Med. 27,
64
(2020).
159. Badaoui, B., Sadki, K., Talbi, C., Driss, S. & Tazi, L. Genetic Diversity and Genomic
Epidemiology of SARS-COV-2 in Morocco. doi:10.1101/2020.06.23.165902.
160. Rockett, R. J. et al. Revealing COVID-19 transmission in Australia by SARS-CoV-2
genome sequencing and agent-based modeling. Nat. Med. 26, 1398–1404 (2020).
161. Kupferschmidt, K. Fast-spreading U.K. virus variant raises alarms. Science 371, 9–10
(2021).
162. Washington, N. L. et al. Genomic epidemiology identifies emergence and rapid
transmission of SARS-CoV-2 B.1.1.7 in the United States. medRxiv (2021)
doi:10.1101/2021.02.06.21251159.
163. Nepomuceno, M. R. et al. Besides population age structure, health and other demographic
factors can contribute to understanding the COVID-19 burden. Proceedings of the National
Academy of Sciences of the United States of America vol. 117 13881–13883 (2020).
164. Biswas, S. K. & Mudi, S. R. Genetic variation in SARS-CoV-2 may explain variable
severity of COVID-19. Med. Hypotheses 143, 109877 (2020).
165. Pachetti, M. et al. Emerging SARS-CoV-2 mutation hot spots include a novel RNA-
dependent-RNA polymerase variant. J. Transl. Med. 18, 179 (2020).
166. Rausch, J. W., Capoferri, A. A., Katusiime, M. G., Patro, S. C. & Kearney, M. F. Low
genetic diversity may be an Achilles heel of SARS-CoV-2. Proceedings of the National
Academy of Sciences 202017726 (2020) doi:10.1073/pnas.2017726117.
167. Fahrenfeld, N. & Bisceglia, K. J. Emerging investigators series: sewer surveillance for
monitoring antibiotic use and prevalence of antibiotic resistance: urban sewer
epidemiology. Environmental Science: Water Research & Technology vol. 2 788–799
65
(2016).
168. Castiglioni, S., Senta, I., Borsotti, A., Davoli, E. & Zuccato, E. A novel approach for
monitoring tobacco use in local communities by wastewater analysis. Tobacco Control vol.
24 38–42 (2015).
169. Sims, N. & Kasprzyk-Hordern, B. Future perspectives of wastewater-based epidemiology:
Monitoring infectious disease spread and resistance to the community level. Environ. Int.
139, 105689 (2020).
170. Hellmér, M. et al. Detection of pathogenic viruses in sewage provided early warnings of
hepatitis A virus and norovirus outbreaks. Appl. Environ. Microbiol. 80, 6771–6781 (2014).
171. Chen, Y. et al. The presence of SARS-CoV-2 RNA in the feces of COVID-19 patients. J.
Med. Virol. 92, 833–840 (2020).
172. Peccia, J. et al. Measurement of SARS-CoV-2 RNA in wastewater tracks community
infection dynamics. Nat. Biotechnol. (2020) doi:10.1038/s41587-020-0684-z.
173. Farkas, K., Hillary, L. S., Malham, S. K., McDonald, J. E. & Jones, D. L. Wastewater and
public health: the potential of wastewater surveillance for monitoring COVID-19. Current
Opinion in Environmental Science & Health vol. 17 14–20 (2020).
174. Adriaenssens, E. M. et al. Viromic Analysis of Wastewater Input to a River Catchment
Reveals a Diverse Assemblage of RNA Viruses. mSystems 3, (2018).
175. Carbo, E. C. et al. Coronavirus discovery by metagenomic sequencing: a tool for pandemic
preparedness. J. Clin. Virol. 131, 104594 (2020).
176. Bedford, J. et al. A new twenty-first century science for effective epidemic response.
Nature 575, 130–136 (2019).
177. Nieuwenhuijse, D. F. et al. Setting a baseline for global urban virome surveillance in
66
sewage. Sci. Rep. 10, 13748 (2020).
178. Crits-Christoph, A. et al. Genome sequencing of sewage detects regionally prevalent SARS-
CoV-2 variants. doi:10.1101/2020.09.13.20193805.
179. Randazzo, W., Cuevas-Ferrando, E., Sanjuán, R., Domingo-Calap, P. & Sánchez, G.
Metropolitan Wastewater Analysis for COVID-19 Epidemiological Surveillance. SSRN
Electronic Journal doi:10.2139/ssrn.3586696.
180. Wurtzer, S. et al. Evaluation of lockdown impact on SARS-CoV-2 dynamics through viral
genome quantification in Paris wastewaters. doi:10.1101/2020.04.12.20062679.
181. Wu, F. et al. SARS-CoV-2 Titers in Wastewater Are Higher than Expected from Clinically
Confirmed Cases. mSystems 5, (2020).
182. Weidhaas, J. et al. Correlation of SARS-CoV-2 RNA in wastewater with COVID-19
disease burden in sewersheds. (2020).
183. Medema, G., Heijnen, L., Elsinga, G., Italiaander, R. & Brouwer, A. Presence of SARS-
Coronavirus-2 RNA in Sewage and Correlation with Reported COVID-19 Prevalence in the
Early Stage of the Epidemic in The Netherlands. Environ. Sci. Technol. Lett. 7, 511–516
(2020).
184. Ahmed, W. et al. First confirmed detection of SARS-CoV-2 in untreated wastewater in
Australia: A proof of concept for the wastewater surveillance of COVID-19 in the
community. Sci. Total Environ. 728, 138764 (2020).
185. Gonzalez, R. et al. COVID-19 surveillance in Southeastern Virginia using wastewater-
based epidemiology. Water Res. 186, 116296 (2020).
186. Medema, G., Heijnen, L., Elsinga, G., Italiaander, R. & Brouwer, A. Presence of SARS-
Coronavirus-2 in sewage. doi:10.1101/2020.03.29.20045880.
67
187. Wu, F. et al. SARS-CoV-2 titers in wastewater foreshadow dynamics and clinical
presentation of new COVID-19 cases. doi:10.1101/2020.06.15.20117747.
188. Karthikeyan, S. et al. High throughput wastewater SARS-CoV-2 detection enables
forecasting of community infection dynamics in San Diego county.
doi:10.1101/2020.11.16.20232900.
189. Larsen, D. A. & Wigginton, K. R. Tracking COVID-19 with wastewater. Nat. Biotechnol.
(2020) doi:10.1038/s41587-020-0690-1.
190. Schmidt, C. Watcher in the wastewater. Nat. Biotechnol. 38, 917–920 (2020).
191. Izquierdo Lara, R. W. et al. Monitoring SARS-CoV-2 circulation and diversity through
community wastewater sequencing. Public and Global Health (2020)
doi:10.1101/2020.09.21.20198838.
192. Jahn, K. et al. Detection of SARS-CoV-2 variants in Switzerland by genomic analysis of
wastewater samples. medRxiv (2021).
193. Bogler, A. et al. Rethinking wastewater risks and monitoring in light of the COVID-19
pandemic. Nature Sustainability (2020) doi:10.1038/s41893-020-00605-2.
194. Burioni, R. & Topol, E. J. Assessing the human immune response to SARS-CoV-2 variants.
Nat. Med. (2021) doi:10.1038/s41591-021-01290-0.
195. Zhang, X. et al. Viral and host factors related to the clinical outcome of COVID-19. Nature
583, 437–440 (2020).
196. Young, B. E. et al. Effects of a major deletion in the SARS-CoV-2 genome on the severity
of infection and the inflammatory response: an observational cohort study. Lancet (2020)
doi:10.1016/S0140-6736(20)31757-8.
197. Zhang, L. et al. The D614G mutation in the SARS-CoV-2 spike protein reduces S1
68
shedding and increases infectivity. bioRxiv (2020) doi:10.1101/2020.06.12.148726.
198. Grubaugh, N. D., Hodcroft, E. B., Fauver, J. R., Phelan, A. L. & Cevik, M. Public health
actions to control new SARS-CoV-2 variants. Cell (2021) doi:10.1016/j.cell.2021.01.044.
199. Mashe, T. et al. Surveillance of SARS-CoV-2 in Zimbabwe shows dominance of variants of
concern. Lancet Microbe (2021) doi:10.1016/S2666-5247(21)00061-6.
200. Dearlove, B. et al. A SARS-CoV-2 vaccine candidate would likely match all currently
circulating variants. Proc. Natl. Acad. Sci. U. S. A. (2020) doi:10.1073/pnas.2008281117.
201. Li, Q. et al. The Impact of Mutations in SARS-CoV-2 Spike on Viral Infectivity and
Antigenicity. Cell 182, 1284–1294.e9 (2020).
202. Walls, A. C. et al. Structure, Function, and Antigenicity of the SARS-CoV-2 Spike
Glycoprotein. Cell 181, 281–292.e6 (2020).
203. Portelli, S. et al. Exploring the structural distribution of genetic variation in SARS-CoV-2
with the COVID-3D online resource. Nat. Genet. (2020) doi:10.1038/s41588-020-0693-3.
204. To, K. K.-W. et al. Coronavirus Disease 2019 (COVID-19) Re-infection by a
Phylogenetically Distinct Severe Acute Respiratory Syndrome Coronavirus 2 Strain
Confirmed by Whole Genome Sequencing. Clinical Infectious Diseases (2020)
doi:10.1093/cid/ciaa1275.
205. Van Elslande, J. et al. Symptomatic SARS-CoV-2 reinfection by a phylogenetically distinct
strain. Clin. Infect. Dis. (2020) doi:10.1093/cid/ciaa1330.
206. Prado-Vivar, B., Becerra-Wong, M., Guadalupe, J. J. & Others. COVID-19 re-infection by
a phylogenetically distinct SARS-CoV-2 variant, first confirmed event in South America.
SSRN 2020; published online Sept 8.
207. Iwasaki, A. What reinfections mean for COVID-19. The Lancet infectious diseases vol. 21
69
3–5 (2021).
208. Abu-Raddad, L. J. et al. Assessment of the risk of SARS-CoV-2 reinfection in an intense
re-exposure setting. bioRxiv (2020) doi:10.1101/2020.08.24.20179457.
209. Grant, M. C. et al. The prevalence of symptoms in 24,410 adults infected by the novel
coronavirus (SARS-CoV-2; COVID-19): A systematic review and meta-analysis of 148
studies from 9 countries. PLoS One 15, e0234765 (2020).
210. Fu, L. et al. Clinical characteristics of coronavirus disease 2019 (COVID-19) in China: A
systematic review and meta-analysis. Journal of Infection vol. 80 656–665 (2020).
211. Gavriatopoulou, M. et al. Organ-specific manifestations of COVID-19 infection. Clin. Exp.
Med. 20, 493–506 (2020).
212. Channappanavar, R., Zhao, J. & Perlman, S. T cell-mediated immune response to
respiratory coronaviruses. Immunologic Research vol. 59 118–128 (2014).
213. Schultheiß, C. et al. Next-Generation Sequencing of T and B Cell Receptor Repertoires
from COVID-19 Patients Showed Signatures Associated with Severity of Disease.
Immunity vol. 53 442–455.e4 (2020).
214. Williamson, E. J. et al. Factors associated with COVID-19-related death using
OpenSAFELY. Nature (2020) doi:10.1038/s41586-020-2521-4.
215. Haendel, M., Chute, C. & Gersing, K. The National COVID Cohort Collaborative (N3C):
Rationale, Design, Infrastructure, and Deployment. J. Am. Med. Inform. Assoc. (2020)
doi:10.1093/jamia/ocaa196.
216. Meredith, L. W. et al. Rapid implementation of SARS-CoV-2 sequencing to investigate
cases of health-care associated COVID-19: a prospective genomic surveillance study. The
Lancet Infectious Diseases (2020) doi:10.1016/s1473-3099(20)30562-4.
70
217. Van Noorden, R. Scientists call for fully open sharing of coronavirus genome data. Nature
590, 195–196 (2021).
218. Hodcroft, E. B. et al. Want to track pandemic variants faster? Fix the bioinformatics
bottleneck. Nature 591, 30–33 (2021).
219. Maxmen, A. One million coronavirus sequences: popular genome site hits mega milestone.
Nature (2021) doi:10.1038/d41586-021-01069-w.
220. Status of environmental surveillance for SARS-CoV-2 virus. https://www.who.int/news-
room/commentaries/detail/status-of-environmental-surveillance-for-sars-cov-2-virus.
221. Watson, C. How countries are using genomics to help avoid a second coronavirus wave.
Nature 582, 19 (2020).
222. Deng, X. et al. Genomic surveillance reveals multiple introductions of SARS-CoV-2 into
Northern California. Science 369, 582–587 (2020).
223. Chaguza, C., Nyaga, M. M., Mwenda, J. M., Esona, M. D. & Jere, K. C. Using genomics to
improve preparedness and response of future epidemics or pandemics in Africa. The Lancet
Microbe vol. 1 e275–e276 (2020).
224. Robert, A. Lessons from New Zealand’s COVID-19 outbreak response. The Lancet. Public
health vol. 5 e569–e570 (2020).
225. Oude Munnink, B. B. et al. Rapid SARS-CoV-2 whole-genome sequencing and analysis for
informed public health decision-making in the Netherlands. Nat. Med. 26, 1405–1410
(2020).
226. Deshpande, D. et al. RNA-seq data science: From raw data to effective interpretation. arXiv
[q-bio.GN] (2020)
71
227. Knyazev, S. et al. Unlocking capacities of viral genomics for the COVID-19 pandemic
response. ArXiv (2021).
S.4 Table. Landscape of current computational tools for RNA-seq analysis
Tool Year
Published
Notable
Features
Program
ming
language
Package manager Required
expertise
Software Type of URL
1. Web services
designed to host
source code
2. Others (e.g
personal and/or
university web
services)
a. Data quality control
iSeqQC
1
2020 Expression-based
raw data QC tool
that detects
outliers
R N/A ++ https://github.com/gku
mar09/iSeqQC
1
qsmooth
2
2018 Adaptive smooth
quantile
normalization
R Bioconductor ++
http://bioconductor.org/
packages/release/bioc/ht
ml/qsmooth.html
1
FastQC
3
2018 Raw data QC tool
for for high
throughput
sequence data
Java Anaconda ++
https://github.com/s-
andrews/fastqc/
1
QC3
4
2014 Raw data QC tool
detecting batch
effect and cross
contamination
Perl, R Anaconda ++ https://github.com/slzha
o/QC3
1
kPAL
5
2014 Alignment-free
assessment raw
data QC tool by
analyzing k-mer
frequencies
Python Anaconda ++ https://github.com/LU
MC/kPAL
1
HTQC
6
2013 Raw data QC
read assessment
and filtration
C++ N/A +++ https://sourceforge.net/
projects/htqc/
1
Trimmomatic
7
2014 Trimming of
reads and
removal of
adapters
Java Anaconda ++ http://www.usadellab.or
g/cms/index.php?page=
trimmomatic
2
Skewer
8
2014 Adapter trimming
of reads
C++ Anaconda ++ https://sourceforge.net/
projects/skewer
1
Flexbar
9
2012 Trimming of
reads and adaptor
removal
C++ Anaconda ++ https://github.com/seqa
n/flexbar
1
QuaCRS
10
2014
Post QC tool by
performing meta-
analyses on QC
metrics across
large numbers of
samples.
Python N/A +++ https://github.com/kwkr
oll32/QuaCRS
1
BlackOPs
11
2013
Post QC tool that
simulates
experimental
RNA-seq derived
from the
reference genome
and aligns these
sequences and
outputs a blacklist
of positions and
alleles caused by
mismapping
Perl N/A +++ https://sourceforge.net/
projects/rnaseqvariantbl
/
1
RSeQC
12
2012
Post QC
evaluation of
different aspects
of RNA-seq
experiments, such
as sequence
quality, GC bias,
nucleotide
composition bias,
sequencing depth,
strand specificity,
coverage
uniformity and
read distribution
over the genome
structure.
Python, C Anaconda ++ http://rseqc.sourceforge
.net/
1
RNA-SeQC
13
2012
RNA-seq metrics
for post- quality
control and
Java Anaconda ++ https://software.broadin
stitute.org/cancer/cga/r
na-seqc
2
process
optimization
Seqbias
14
2012
Post QC tool
using a graphical
model to increase
accuracy of de
novo gene
annotation,
uniformity of
read coverage,
consistency of
nucleotide
frequencies and
agreement with
qRT-PCR
R Anaconda,
Bioconductor
++ http://master.bioconduc
tor.org/packages/devel/
bioc/html/seqbias.html
1
SAMStat
15
2011
Post QC tool
which plotsPost
nucleotide
overrepresentatio
n and other
statistics in
mapped and
unmapped reads
in a html page
C N/A +++ http://samstat.sourcefor
ge.net
1
Samtools
16
2009
Post QC tool
using generic
alignment format
for storing read
alignments
against reference
C, Perl Anaconda + https://github.com/samt
ools/samtools
1
sequences and to
visualize the
Binary/Alignment
Map (BAM).
b. Read alignment
deSALT
17
2019 Long
transcriptomic
read alignment
with de Bruijn
graph-based
index
C Anaconda ++ https://github.com/ydLi
u-HIT/deSALT
1
Magic-
BLAST
18
2018 Aligner for long
and short reads
through
optimization of a
spliced alignment
score
C++ N/A +++
https://ncbi.github.io/ma
gicblast/ 1
Minimap2
19
2018
Alignment using
seed chain
alignment
procedure
C, Python Anaconda ++ https://github.com/lh3/
minimap2
1
DART
20
2018
Burrows-Wheeler
Transform based
aligner which
adopts
partitioning
strategy to divide
C/C++ Anaconda ++ https://github.com/hsin
nan75/DART
1
a read into two
groups
MMR
21
2016
Resolves the
mapping location
of multi-mapping
reads, optimising
for locally
smooth coverage.
C++ N/A +++ https://github.com/ratsc
hlab/mmr
1
ContextMap
2
22
2015
Allows parallel
mapping against
several reference
genomes
Java N/A +++ http://www.bio.ifi.lmu.
de/ContextMap
2
HISAT
23
2015
Aligning reads
using an indexing
scheme based on
the Burrows-
Wheeler
transform and the
Ferragina-
Manzini (FM)
index
C++ Anaconda ++ http://www.ccb.jhu.edu
/software/hisat/index.sh
tml
2
Segemehl
24
2014
Multi-split
mapping for
circular RNA,
trans-splicing,
and fusion events
in addition to
performing splice
alignment
C, C++,
Perl,
Python,
Shell
(Bash)
Anaconda ++ (http://www.bioinf.uni-
leipzig.de/Software/seg
emehl/).
2
JAGuaR
25
2014
Uses a modified
GTF (Gene
Transfer Format)
of known splice
sites to build the
complete
sequence from all
reads mapped to
the transcript.
Python N/A +++
https://www.bcgsc.ca/res
ources/software/jaguar 2
CRAC
26
2013
Uses double K-
mer indexing and
profiling
approach to map
reads, predict
SNPs, gene
fusions, repeat
borders.
C++ Anaconda ++ http://crac.gforge.inria.f
r/
2
STAR
27
2013
Aligns long reads
against genome
reference
database
C++
Anaconda ++ https://github.com/alex
dobin/STAR
1
Subread
28
2013
Mapping reads to
a reference
genome using
multi-seed
strategy, called
seed-and-vote
C, R Anaconda,
Bioconductor
++ https://bioconductor.org
/packages/release/bioc/
html/Rsubread.html
1
TopHat2
29
2013
Alignment of
transcriptomes in
C++,
Python
Anaconda ++ http://ccb.jhu.edu/softw
are/tophat/index.shtml
2
the presence of
insertions,
deletions and
gene fusions
OSA
30
2012
K-mer profiling
approach to map
reads
C# N/A +++ http://www.arrayserver.
com/wiki/index.php?titl
e=OSA
2
PASSion
31
2012
Pattern growth
pipeline for splice
junction detection
C++,
Perl,
Shell
(Bash)
N/A +++ https://trac.nbic.nl/passi
on/
2
RUM
32
2011
Comparative
analysis of RNA-
seq alignment
algorithms and
the RNA-seq
unified mapper
Perl,
Python
N/A +++ http://www.cbil.upenn.
edu/RUM/
2
SOAPSplice
33
2011
Ab initio
detection of
splice junctions
Perl Anaconda ++ http://soap.genomics.or
g.cn/soapsplice.html
2
MapSplice
34
2010
De novo
detection of
splice junctions
C++ Anaconda ++ https://github.com/LiuB
ioinfo/MapSplice
1
SpliceMap
35
2010
De novo
detection of
splice junctions
C++ Anaconda ++ http://web.stanford.edu/
group/wonglab/Splice
Map/
2
and RNA-seq
alignment
Supersplat
36
2010
De novo
detection of
splice junctions
C++ N/A +++ http://mocklerlab.org/to
ols/1/manual
2
HMMSplicer
37
2010
Detection of
splice junctions
of short sequence
reads
Python N/A +++ http://derisilab.ucsf.edu
/software/hmmsplicer
2
QPALMA
38
2008
Spliced
alignments of
short sequence
reads.
C++,
Python
N/A +++ http://www.raetschlab.o
rg/suppl/qpalma
2
c. Gene annotations
SQANTI
39
2018
Analyses quality
of long reads
transcriptomes
and removes
artefacts.
Python Anaconda ++ https://github.com/Con
esaLab/SQANTI
1
Annocript
40
2015
Databases are
downloaded to
annotate protein
coding transcripts
with the
prediction of
putative long
Perl,
Python, R
N/A +++ https://github.com/fran
kMusacchia/Annocript
1
non-coding
RNAs in whole
transcriptomes.
CIRI
41
2015
De novo circular
RNA
identification
Perl N/A +++ https://sourceforge.net/
projects/ciri/
1
TSSAR
42
2014
Automated de
novo TSS
annotation from
differential RNA-
seq data
Java,
Perl, R
Anaconda ++ http://rna.tbi.univie.ac.a
t/TSSAR
2
d. Transcriptome assembly
FLAIR
43
2020
Full-length
alternative
isoform analysis
of RNA
Python Anaconda ++ https://github.com/Broo
ksLabUCSC/FLAIR
1
Scallop
44
2017
Splice-graph-
decomposition
algorithm which
optimizes two
competing
objectives while
satisfying all
phasing
constraints posed
by reads spanning
multiple vertices
C++ Anaconda +++ https://github.com/King
sford-Group/scallop
1
CLASS2
45
2016
Splice variant
annotation
C++,
Perl,
Shell
Anaconda ++ https://sourceforge.net/
projects/splicebox/
1
StringTie
46
2015
Applies a
network flow
algorithm
originally
developed in
optimization
theory, together
with optional de
novo assembly, to
assemble
transcripts
C++ N/A +++ http://ccb.jhu.edu/softw
are/stringtie
2
Bridger
47
2015
De novo
transcript
assembler using a
mathematical
model, called the
minimum path
cover
C++, Perl N/A +++ https://sourceforge.net/
projects/rnaseqassembl
y/files/?source=navbar
1
Bayesembler
48
2014
Reference
genome guided
transcriptome
assembly built on
a Bayesian model
C++ N/A +++ https://github.com/bioin
formatics-
centre/bayesembler.
1
SEECER
49
2013
De novo
transcriptome
C++ N/A +++ http://sb.cs.cmu.edu/see
cer/
2
assembly using
hidden Markov
Model (HMM)
based method
BRANCH
50
2013
De novo
transcriptome
assemblies by
using genomic
information that
can be partial or
complete genome
sequences from
the same or a
related organism.
C++ N/A +++ https://github.com/baoe
/BRANCH
1
EBARDenovo
5
1
2013
De novo
transcriptome
assembly uses an
efficient chimera-
detection function
C# N/A +++ https://sourceforge.net/
projects/ebardenovo/
1
Oases
52
2012
De novo
transcriptome
assembly using k-
mer profiling and
building a de
Brujin graph
C N/A +++ https://github.com/dzer
bino/oases/tree/master
1
Cufflinks
53
2012
Ab initio
transcript
assembly,
estimates their
C++ N/A +++ https://github.com/cole-
trapnell-lab/cufflinks
1
abundances, and
tests for
differential
expression
IsoInfer
54
2011
Infer isoforms
from short reads
C/C++ N/A +++ http://www.cs.ucr.edu/
~jianxing/IsoInfer.html
2
IsoLasso
55
2011
Reference
genome guided
using LASSO
regression
approach
C++ N/A +++ http://alumni.cs.ucr.edu
/~liw/isolasso.html
2
Trinity
56
2011
De novo
transcriptome
assembly
C++,
Java,
Perl, R,
Shell
(Bash)
Anaconda ++ https://github.com/trinit
yrnaseq/trinityrnaseq/w
iki
1
Trans-
ABySS
57
2010
De novo short-
read
transcriptome
assembly and can
also be used for
fusion detection
Python N/A +++ https://github.com/bcgs
c/transabyss
1
Scripture
58
2010
Ab initio
reconstruction of
transcriptomes of
pluripotent and
lineage
committed cells
Java N/A +++ www.broadinstitute.org
/software/Scripture/
2
e. Transcriptome quantification
TALON
59
2019
Long-read
transcriptome
discovery and
quantification
Python N/A +++ https://github.com/dew
yman/TALON
1
Salmon
60
2017
Composed of:
lightweight-
mapping model,
an online phase
that estimates
initial expression
levels and model
parameters, and
an offline phase
that refines
expression
estimates models,
and mesures
sequence-
specific, fragment
GC, and
positional biases
C++ Anaconda ++ https://github.com/CO
MBINE-lab/Salmon
1
Kallisto
61
2016
K-mer based
pseudoalignment
for allligment free
transcript and
gene expression
quantification
C, C++,
Perl
Anaconda ++ https://github.com/pach
terlab/kallisto
1
Wub
62
2016
Sequence and
error simulation
tool to calculate
read and genome
assembly
accuracy.
Python Anaconda ++ https://github.com/nano
poretech/wub
1
Rcount
63
2015
GUI based tool
used for
quantification
using counts per
feature
Web
based
tool
N/A + https://github.com/MW
Schmid/Rcount
1
Ht-seq
64
2015
Calculates gene
counts by
counting number
of reads
overlapping
genes
Python pip ++ https://htseq.readthedoc
s.io/en/release_0.11.1/o
verview.html
2
EMSAR
65
2015
Estimation by
mappability-
based
segmentation and
reclustering using
a joint Poisson
model
C N/A +++ https://github.com/parkl
ab/emsar
1
Maxcounts
66
2014
Quantify the
expression
assigned to an
exon as the
C++ N/A +++ http://sysbiobig.dei.unip
d.it/?q=Software#MAX
COUNTS
2
maximum of its
per-base counts
FIXSEQ
67
2014
A nonparametric
and universal
method for
processing per-
base sequencing
read count data.
R N/A ++ https://bitbucket.org/tha
shim/fixseq/src/master/
1
Sailfish
68
2014
EM based
quantification
using statistical
coupling between
k-mers.
C, C++ Anaconda ++ https://github.com/king
sfordgroup/sailfish
1
Casper
69
2014
Bayesian
modeling
framework to
quantify
alternative
splicing.
R Anaconda,
Bioconductor
++ http://www.bioconduct
or.org/packages/release
/bioc/html/casper.html
1
MaLTA
70
2014
Simultaneous
transcriptome
assembly and
quantification
from Ion Torrent
RNA-Seq data.
C++ N/A +++ http://alan.cs.gsu.edu/N
GS/?q=malta
2
Featurecounts
7
1
2014
Read
summarization
program for
R N/A ++ http://subread.sourcefor
ge.net/
1
counting reads
generated.
MITIE
72
2013
Transcript
reconstruction
and assembly
from RNA-Seq
data using mixed
integer
optimisation.
MATLA
B, C++
N/A +++ https://github.com/ratsc
hlab/MiTie
1
iReckon
73
2013
EM-based
method to
accurately
estimate the
abundances of
known and novel
isoforms.
Java N/A +++ http://compbio.cs.toront
o.edu/ireckon/
2
eXpress
74
2013
Online EM based
algorithm for
quantification
which considers
one read at a
time.
C++,
Shell
(Bash)
Anaconda ++ https://pachterlab.githu
b.io/eXpress/manual.ht
ml
1
BitSeq
75
2012
Bayesian
transcript
expression
quantification and
C++, R Anaconda,Biocond
uctor
++ http://bitseq.github.io/ 1
differential
expression.
IQSeq
76
2012
Integrated
isoform
quantification
analysis.
C++ N/A +++ http://archive.gersteinla
b.org/proj/rnaseq/IQSe
q/
2
CEM
77
2012
Statistical
framework for
both
transcriptome
assembly and
isoform
expression level
estimation.
Python N/A +++ http://alumni.cs.ucr.edu
/~liw/cem.html
2
SAMMate
78
2011
Analysis of
differential gene
and isoform
expression.
Java N/A +++ http://sammate.sourcefo
rge.net/
1
Isoformex
79
2011
Estimation
method to
estimate the
expression levels
of transcript
isoforms.
N/A N/A N/A http://bioinformatics.wi
star.upenn.edu/isoforme
x
2
IsoEM
80
2011
EM based method
for inference of
Java N/A +++ http://dna.engr.uconn.e
du/software/IsoEM/.
2
isoform and gene-
specific
expression levels
RSEM
81
2011
Ab initio EM
based method for
inference of
isoform and gene-
specific
expression levels
C++,
Perl,
Python, R
Anaconda ++ https://github.com/dew
eylab/RSEM
1
EDASeq
82
2011
Within-lane GC-
content
normalization,
between-sample
normalization,
visualization.
R Anaconda,
Bioconductor
++ https://bioconductor.org
/packages/devel/bioc/ht
ml/EDASeq.html
1
MMSEQ
83
2011
Haplotype and
isoform specific
expression
estimation
C++, R,
Ruby,
Shell
(Bash)
N/A ++ https://github.com/eturr
o/mmseq
1
MISO
84
2010
Statistical model
that estimates
expression of
alternatively
spliced exons and
isoforms
C, Python N/A +++ https://miso.readthedoc
s.io/en/fastmiso/#latest-
version-from-github
2
SOLAS
85
2010
Prediction of
alternative
isoforms from
exon expression
levels
R N/A ++ http://cmb.molgen.mpg.
de/2ndGenerationSeque
ncing/Solas/
2
Rseq
86
2009
Statistical
inferences for
isoform
expression
C++ N/A +++ http://www-
personal.umich.edu/~jia
nghui/rseq/#download
2
rQuant
87
2009
Estimating
density biases and
considering the
read coverages at
each nucleotide
independently
using quadratic
programming
Matlab,
Shell
(Bash),
Javascript
N/A +++ https://galaxy.inf.ethz.c
h/?tool_id=rquantweb&
version=2.2&__identife
r=3iuqb8nb3wf
2
ERANGE
88
2008
Mapping and
quantifying
mammalian
transcripts
Python N/A +++ http://woldlab.caltech.e
du/rnaseq
2
f. Differential expression
Swish
89
2019
Non-parametric
model for
differential
expression
analysis using
R Bioconductor ++ https://bioconductor.org
/packages/release/bioc/
html/fishpond.html
1
inferential
replicate counts
Yanagi
90
2019
Transcriptome
segment analysis
Python/C
++
N/A +++ https://github.com/HCB
ravoLab/yanagi
1
Whippet
91
2018
Quantification of
transcriptome
structure and
gene expression
analysis using
EM.
Julia N/A +++
https://github.com/timbitz/
Whippet.jl
1
ReQTL
92
2018
Identifies
correlations
between SNVs
and gene
expression from
RNA-seq data
R N/A ++ https://github.com/Horv
athLab/ReQTL
1
vast-tools
93
2017
Profiling and
comparing
alternative
splicing events in
RNA-Seq data
and for
downstream
analyses of
R, Perl N/A +++
https://github.com/vastgro
up/vast-tools
1
alternative
splicing.
Ballgown
94
2015
Linear model–
based differential
expression
analyses
R Anaconda,
Bioconductor
++ https://github.com/alyss
afrazee/ballgown
1
Limma/Voom
9
5
2014
Linear model-
based differential
expression and
differential
splicing analyses
R Anaconda,
Bioconductor
++ https://bioconductor.org
/packages/release/bioc/
html/limma.html
1
rMATS
96
2014
Detect major
differential
alternative
splicing types in
RNA-seq data
with replicates.
Python,
C++
Anaconda ++
http://rnaseq-
mats.sourceforge.net/rma
ts3.2.5/
1
DESeq2
97
2014
Differential
analysis of count
data, using
shrinkage
estimation for
dispersions and
fold changes
R Bioconductor,
CRAN
++ https://bioconductor.org
/packages/release/bioc/
html/DESeq2.html
1
Corset
98
2014
Differential gene
expression
analysis for de
C++ Anaconda ++ https://github.com/Oshl
ack/Corset/wiki
1
novo assembled
transcriptomes
BADGE
99
2014
Bayesian model
for accurate
abundance
quantification and
differential
analysis
Matlab N/A +++ http://www.cbil.ece.vt.e
du/software.htm
2
compcodeR
100
2014
Benchmarking of
differential
expression
analysis methods
R Anaconda,
Bioconductor
++
https://www.bioconduct
or.org/packages/compc
odeR/
1
metaRNASeq
10
1
2014
Differential meta-
analyses of RNA-
seq data
R Anaconda, CRAN ++ http://cran.r-
project.org/web/packag
es/metaRNASeq
1
Characteristic
Direction
102
2014
Geometrical
multivariate
approach to
identify
differentially
expressed genes
R,
Python,
MATLA
B
N/A ++ http://www.maayanlab.
net/CD
2
HTSFilter
103
2013
Filter-replicated
high-throughput
transcriptome
sequencing data
R Anaconda,
Bioconductor
++ http://www.bioconduct
or.org/packages/release
/bioc/html/HTSFilter.ht
ml
1
NPEBSeq
104
2013
Nonparametric
empirical
bayesian-based
procedure for
differential
expression
analysis
R N/A ++ http://bioinformatics.wi
star.upenn.edu/NPEBse
q
2
EBSeq
105
2013
Identifying
differentially
expressed
isoforms.
R Anaconda,
Bioconductor
++ http://bioconductor.org/
packages/release/bioc/h
tml/EBSeq.html
1
sSeq
106
2013
Shrinkage
estimation of
dispersion in
Negative
Binomial models
R Anaconda,
Bioconductor
++ http://bioconductor.org/
packages/release/bioc/h
tml/sSeq.html
1
Cuffdiff2
107
2013
Differential
analysis at
transcript
resolution
C++,
Python
N/A +++ http://cole-trapnell-
lab.github.io/cufflinks/c
uffdiff/
1
SAMseq
108
2013
Nonparametric
method with
resampling to
account for the
different
sequencing
depths
R CRAN ++ https://rdrr.io/cran/samr
/man/SAMseq.html
1
DSGseq
109
2013
NB-statistic
method that can
detect
differentially
spliced genes
between two
groups of samples
without using a
prior knowledge
on the annotation
of alternative
splicing.
R N/A ++ http://bioinfo.au.tsinghu
a.edu.cn/software/DSG
seq/
2
NOISeq
110
2011
Uses a non-
parametric
approach for
differential
expression
analysis and can
work in absence
of replicates
R Bioconductor ++ https://bioconductor.org
/packages/release/bioc/
html/NOISeq.html
1
EdgeR
111
2010
Examining
differential
expression of
replicated count
data and
differential exon
usage
R Anaconda,
Bioconductor
++ https://bioconductor.org
/packages/release/bioc/
html/edgeR.html
1
DEGseq
112
2010
Identify
differentially
expressed genes
R Bioconductor ++ http://bioconductor.org/
packages/release/bioc/h
tml/DEGseq.html
1
or isoforms for
RNA-seq data
from different
samples.
g. RNA splicing
LeafCutter
113
2018
Detects
differential
splicing and maps
quantitative trait
loci (sQTLs).
R Anaconda ++ https://github.com/davi
daknowles/leafcutter.git
1
MAJIQ-
SPEL
114
2018
Visualization,
interpretation,
and experimental
validation of both
classical and
complex splicing
variation and
automated RT-
PCR primer
design.
C++,
Python
N/A +++ https://galaxy.biociph
ers.org/galaxy/root?t
ool_id=majiq_spel
2
MAJIQ
115
2016
Web-tool that
takes as input
local splicing
variations (LSVs)
quantified from
RNA-seq data
and provides
users with a
visualization
C++,
Python
N/A +++ https://majiq.biociphers
.org/commercial.php
2
package
(VOILA) and
quantification of
gene isoforms.
SplAdder
116
2016
Identification,
quantification,
and testing of
alternative
splicing events
Python N/A +++ http://github.com/ratsch
lab/spladder
1
SplicePie
117
2015
Detection of
alternative, non-
sequential and
recursive splicing
Perl, R N/A +++ https://github.com/puly
akhina/splicing_analysi
s_pipeline
1
SUPPA
118
2015
Alternative
splicing analysis
Python, R Anaconda ++
https://github.com/comprn
a/SUPPA
1
SNPlice
119
2015
Identifying
variants that
modulate Intron
retention
Python N/A +++ https://code.google.com
/p/snplice/
1
IUTA
120
2014
Detecting
differential
isoform usage
R N/A ++ http://www.niehs.nih.g
ov/research/resources/s
oftware/biostatistics/iut
a/index.cfm.
1
SigFuge
121
2014
Identifying
genomic loci
exhibiting
differential
transcription
patterns
R Anaconda,
Bioconductor
++ http://bioconductor.org/
packages/release/bioc/h
tml/SigFuge.html
1
FineSplice
122
2014
Splice junction
detection and
quantification
Python N/A +++ https://sourceforge.net/
p/finesplice/
1
PennSeq
123
2014
Statistical method
that allows each
isoform to have
its own non-
uniform read
distribution
Perl N/A +++ http://sourceforge.net/p
rojects/pennseq
1
FlipFlop
124
2014
RNA isoform
identification and
quantification
with network
flows
R Anaconda,
Bioconductor
++ https://bioconductor.org
/packages/release/bioc/
html/flipflop.html
1
GESS
125
2014
Graph-based
exon-skipping
scanner for de
novo detection of
skipping event
sites
N/A N/A N/A http://jinlab.net/GESS_
Web/
2
spliceR
126
2013
Classification of
alternative
splicing and
R Anaconda,
Bioconductor
++ http://www.bioconduct
or.org/packages/2.13/bi
oc/html/spliceR.html
1
prediction of
coding potential
RNASeq-
MATS
127
2013
Detects and
analyzes
differential
alternative
splicing events
C, Python N/A +++ http://rnaseq-
mats.sourceforge.net/
1
SplicingCompa
ss
128
2013
Differential
splicing detection
R N/A ++ http://www.ichip.de/sof
twa
2
DiffSplice
129
2013
Genome-wide
detection of
differential
splicing
C++ N/A +++ http://www.netlab.uky.
edu/p/bioinfo/DiffSplic
e
2
DEXSeq
130
2012
Statistical method
to test for
differential exon
usage.
R Anaconda,
Bioconductor
++ https://bioconductor.org
/packages/release/bioc/
html/DEXSeq.html
1
SpliceSeq
131
2012
Identifies
differential
splicing events
between test and
control groups.
Java N/A +++ http://bioinformatics.m
danderson.org/main/Spl
iceSeq:Overview.
2
JuncBASE
132
2011
Identification and
quantification of
alternative
splicing,
Python N/A +++ https://github.com/anbr
ooks/juncBASE
1
including
unannotated
splicing
ALEXA-seq
133
2010
Alternative
expression
analysis.
Perl, R,
Shell
(Bash)
N/A +++ http://www.alexaplatfor
m.org/alexa_seq/.
1
h. Cell deconvolution
TIMER2.0
134
2020
Web server for
comprehensive
analysis of
Tumor-
Infiltrating
Immune Cells.
Web-tool
R,
Javascript
N/A + https://github.com/taiw
enli/TIMER
1
CIBERSORTx
135
2019
Impute gene
expression
profiles and
provide an
estimation of the
abundances of
member cell
types in a mixed
cell population.
Web-tool
Java, R
N/A + https://cibersortx.stanfo
rd.edu/
2
quanTIseq
136
2019
Quantify the
fractions of ten
immune cell
types from bulk
RNA-sequencing
data.
R, Shell
(Bash)
N/A + https://icbi.i-
med.ac.at/quantiseq
2
Immunedeconv
137
2019
Benchmarking of
transcriptome-
based cell-type
quantification
methods for
immuno-
oncology
R Anaconda ++ https://github.com/icbi-
lab/immunedeconv
1
Linseed
138
2019
Deconvolution of
cellular mixtures
based on linearity
of transcriptional
signatures.
C++, R N/A ++ https://github.com/ctlab
/LinSeed
1
deconvSEQ
139
2019
Deconvolution of
cell mixture
distribution based
on a generalized
linear model.
R N/A ++ https://github.com/rose
du1/deconvSeq
1
CDSeq
140
2019
Simultaneously
estimate both
cell-type
proportions and
cell-type-specific
expression
profiles.
MATLA
B, R
N/A ++ https://github.com/kkan
g7/CDSeq_R_Package
1
Dtangle
141
2019
Estimates cell
type proportions
using publicly
available, often
R Anaconda,CRAN ++ https://github.com/gjhu
nt/dtangle
1
cross-platform,
reference data.
GEDIT
142
2019
Estimate cell type
abundances.
Web
based
tool
Python, R
N/A + http://webtools.mcdb.uc
la.edu/
2
SaVant
143
2017
Web based tool
for sample level
visualization of
molecular
signatures in gene
expression
profiles.
Javascript
, R
N/A +++ http://newpathways.mc
db.ucla.edu/savant
2
EPIC
144
2017
Simultaneously
estimates the
fraction of cancer
and immune cell
types.
R N/A ++ http://epic.gfellerlab.org
/
2
WSCUnmix
145
2017
Automated
deconvolution of
structured
mixtures.
MATLA
B
N/A +++ https://github.com/tedro
man/WSCUnmix
1
Infino
146
2017
Deconvolves bulk
RNA-seq into cell
type abundances
R, Python N/A ++ https://github.com/ham
merlab/infino
1
and captures gene
expression
variability in a
Bayesian model
to measure
deconvolution
uncertainty.
MCP-
counter
147
2016
Estimating the
population
abundance of
tissue-infiltrating
immune and
stromal cell
populations.
R N/A + https://github.com/ebec
ht/MCPcounter
1
CellCode
148
2015
Latent variable
approach to
differential
expression
analysis for
heterogeneous
cell populations.
R N/A ++ http://www.pitt.edu/~m
chikina/CellCODE/
2
PERT
149
2012
Probabilistic
expression
deconvolution
method.
MATLA
B
N/A +++ https://github.com/gquo
n/PERT
1
i. Immune repertoire profiling
ImReP
150
2018
Profiling
immunoglobulin
repertoires across
multiple human
tissues.
Python N/A +++ https://github.com/man
dricigor/imrep/wiki
1
TRUST (T
cell)
151
2016
Landscape of
tumor-infiltrating
T cell repertoire
of human
cancers.
Perl N/A +++ https://github.com/liula
b-dfci/TRUST4
1
V’DJer
152
2016
Assembly-based
inference of B-
cell receptor
repertoires from
short reads with
V’DJer.
C, C++ Anaconda ++ https://github.com/moz
ack/vdjer
1
IgBlast-based
pipeline
153
2016
Statistical
inference of a
convergent
antibody
repertoire
response.
C++ N/A +++ https://www.ncbi.nlm.n
ih.gov/igblast/
1
MiXCR
154
2015
Processes big
immunome data
from raw
Java Anaconda ++ https://github.com/mila
boratory/mixcr
1
sequences to
quantitated
clonotypes.
j. Allele specific expression
EAGLE
155
2017
Bayesian model
for identifying
GxE interactions
based on
associations
between
environmental
variables and
allele-specific
expression.
C++, R N/A ++ https://github.com/davi
daknowles/eagle
1
ANEVA-
DOT/ANEVA
1
56
2019
Identify ASE
outlier genes /
Quantify genetic
variation in gene
dosage from ASE
data.
R N/A ++ https://github.com/PejL
ab/ANEVA-DOT
1
aFC
157
2017
Quantifying the
regulatory effect
size of cis-acting
genetic variation
Python N/A +++ https://github.com/seca
stel/aFC
1
phASER
158
2016
Uses readback
phasing to
Python N/A +++ https://github.com/seca
stel/phaser
1
produce
haplotype level
ASE data (as
opposed to SNP
level)
RASQUAL
159
2016
Maps QTLs for
sequenced based
cellular traits by
combining
population and
allele-specific
signals.
C, R N/A +++
https://github.com/nats
uhiko/rasqual
1
allelecounter
160
2015
Generate ASE
data from
RNAseq data and
a genotype file.
Python N/A +++ https://github.com/seca
stel/allelecounter
1
WASP
161
2015
Unbiased allele-
specific read
mapping and
discovery of
molecular QTLs
C
Python
Anaconda ++ https://github.com/bmv
dgeijn/WASP/
1
Mamba
162
2015
Compares
different patterns
of ASE across
tissues
R N/A ++ http://www.well.ox.ac.u
k/~rivas/mamba/.
2
MBASED
163
2014
Allele-specific
expression
detection in
R Anaconda,
Bioconductor
++ https://bioconductor.org
/packages/release/bioc/
html/MBASED.html
1
cancer tissues and
cell lines
Allim
164
2013
Estimates allele-
specific gene
expression.
Python, R N/A +++ https://sourceforge.net/
projects/allim/
1
AlleleSeq
165
2011
Identifies allele-
specific events in
mapped reads
between maternal
and paternal
alleles.
Python,
Shell
N/A +++ http://alleleseq.gersteinl
ab.org/
2
k. Viral detection
ROP
166
2018
Dumpster diving
in RNA-
sequencing to
find the source of
1 trillion reads
across diverse
adult human
tissues
Python,
Shell
(Bash)
Anaconda ++ https://github.com/sma
ngul1/rop
1
RNA
CoMPASS
167
2014
Simultaneous
analysis of
transcriptomes
and
metatranscriptom
es from diverse
Perl,
Shell,
Java
N/A ++ http://rnacompass.sourc
eforge.net/
1
biological
specimens.
VirusSeq
168
2013
Identify viruses
and their
integration sites
using next-
generation
sequencing of
human cancer
tissues
Perl,
Shell
(Bash)
N/A +++ http://odin.mdacc.tmc.e
du/∼xsu1/VirusSeq.ht
ml
2
VirusFinder
169
2013
Detection of
Viruses and Their
Integration Sites
in Host Genomes
through Next
Generation
Sequencing Data
Perl N/A +++ http://bioinfo.mc.vande
rbilt.edu/VirusFinder/
2
l. Fusion detection
INTEGRATE-
Vis
170
2017
Generates plots
focused on
annotating each
gene fusion at the
transcript-
and protein-level
and assessing
expression across
samples.
Python N/A +++ https://github.com/Chri
sMaherLab/INTEGRA
TE-Vis
1
INTEGRATE-
Neo
171
2017
Gene fusion
neoantigen
discovery tool,
which uses RNA-
Seq reads and is
capable of
reporting tumor-
specific peptides
recognizable by
immune
molecules.
Python,
C++
N/A +++ https://github.com/Chri
sMaherLab/INTEGRA
TE-Neo
1
INTEGRATE
1
72
2016
Capable of
integrating
aligned RNA-seq
and WGS reads
and characterizes
the quality of
predictions.
C++ N/A +++ https://sourceforge.net/
projects/integrate-
fusion/
1
TRUP
173
2015
Combines split-
read and read-pair
analysis with de
novo regional
assembly for the
identification of
chimeric
transcripts in
cancer specimens.
C++,
Perl, R
N/A +++ https://github.com/rupi
ng/TRUP
1
PRADA
174
2014
Detect gene
fusions but also
performs
Python Anaconda ++ http://sourceforge.net/p
rojects/prada/
1
alignments,
transcriptome
quantification;
mainly integrated
genome/transcript
ome read
mapping.
Pegasus
175
2014
Annotation and
prediction of
biologically
functional gene
fusion candidates.
Java,
Perl,
Python,
Shell
(Bash)
N/A +++ https://github.com/Raba
danLab/Pegasus
1
FusionCatcher
1
76
2014
Finding somatic
fusion genes
Python Anaconda ++ https://sourceforge.net/
projects/fusioncatcher/
1
FusionQ
177
2013
Gene fusion
detection and
quantification
from paired-end
RNA-seq
C++,
Perl, R
N/A +++ http://www.wakehealth.
edu/CTSB/Software/So
ftware.htm
2
Barnacle
178
2013
Detecting and
characterizing
tandem
duplications and
fusions in de
novo
Python,
Perl
N/A +++ http://www.bcgsc.ca/pl
atform/bioinfo/software
/barnacle
2
transcriptome
assemblies
Dissect
179
2012
Detection and
characterization
of structural
alterations in
transcribed
sequences
C N/A +++ http://dissect-
trans.sourceforge.net
1
BreakFusion
180
2012
Targeted
assembly-based
identification of
gene fusions
C++, Perl N/A +++ https://bioinformatics.m
danderson.org/public-
software/breakfusion/
2
EricScript
181
2012
Identification of
gene fusion
products in
paired-end RNA-
seq data.
Perl, R,
Shell
(Bash)
Anaconda ++ http://ericscript.sourcef
orge.net
1
Bellerophontes
182
2012
Chimeric
transcripts
discovery based
on fusion model.
Java,
Perl,
Shell
(Bash)
N/A +++ http://eda.polito.it/belle
rophontes/
2
GFML
183
2012
Standard format
for organizing
and representing
the significant
features of gene
fusion data.
XML N/A + http://code.google.com/
p/gfml-prototype/
1
FusionHunter
18
4
2011
Identifies fusion
transcripts from
transcriptional
analysis.
C++ N/A +++
https://github.com/ma-
compbio/FusionHunter
1
ChimeraScan
18
5
2011
Identifying
chimeric
transcription.
Python Anaconda ++ https://code.google.com
/archive/p/chimerascan/
downloads
1
TopHat-
fusion
186
2011
Discovery of
novel fusion
transcripts.
C++,
Python
N/A +++ http://ccb.jhu.edu/softw
are/tophat/fusion_index
.shtml
2
deFuse
187
2011
Fusion discovery
in tumor RNA-
seq data.
C++,
Perl, R
Anaconda ++ https://github.com/amc
pherson/defuse/blob/ma
ster/README.md
1
m. Detecting circRNA
CIRIquant
188
2020
Accurate
quantification and
differential
expression
analysis of
circRNAs.
Python N/A ++ https://sourceforge.net/pr
ojects/ciri/files/CIRIquant
1
CIRI-vis
189
2020
Visualization of
circRNA
structures.
Java N/A +++ https://sourceforge.net/pr
ojects/ciri/files/CIRI-vis
1
Ularcirc
190
2019
Analysis and
visualisation of
R Bioconductor ++ https://github.com/VCCR
I/Ularcirc
1
canonical and
back splice
junctions.
CLEAR
191
2019
Circular and
Linear RNA
expression
analysis.
Python N/A +++ https://github.com/YangL
ab/CLEAR
1
CIRI-full
192
2019
Reconstruct and
quantify full-
length circular
RNAs.
Java N/A +++ https://sourceforge.net/pr
ojects/ciri/files/CIRI-full
1
circAST
193
2019
Full-length
assembly and
quantification of
alternatively
spliced isoforms
in Circular RNAs
Python N/A +++ https://github.com/xiaofe
ngsong/CircAST
1
CIRI2
194
2018
Denovo circRNA
identification
Pearl N/A +++ https://sourceforge.net/pr
ojects/ciri/files/CIRI2
1
Sailfish-cir
195
2017
Quantification of
circRNAs using
model-based
framework
Python N/A +++ https://github.com/zerodel
/sailfish-cir
1
CircComPara
19
6
2017
Multi-method
detection of
circRNAs
R, Python N/A +++ http://github.com/egaffo/
CirComPara
1
UROBORUS
197
2016 Computationally
identifying
circRNAs from
RNA-seq data
Perl N/A +++ https://github.com/WG
Lab/UROBORUS/tree/
master/bin
1
PTESFinder
19
8
2016 Identification of
non-co-linear
transcripts
Shell,
Java
N/A +++ https://sourceforge.net/
projects/ptesfinder-v1/
1
NCLscan
199
2016 identification of
non-co-linear
transcripts
(fusion, trans-
splicing and
circular RNA)
Python N/A +++ https://github.com/Tree
sLab/NCLscan
1
DCC
200
2016 Specific
identification and
quantification of
circRNA
Python N/A +++ https://github.com/diete
rich-lab/DCC
1
CIRI-AS
201
2016 Identification of
internal structure
and alternative
splicing events in
circRNA
Perl N/A +++ https://sourceforge.net/
projects/ciri/files/CIRI-
AS
1
circTest
200
2016 Differential
expression
analysis and
plotting of
circRNAs
R N/A +++ https://github.com/diete
rich-lab/CircTest
1
CIRCexplorer
2
202
2016 Annotation and
de novo assembly
of circRNAs
Python Anaconda ++ https://github.com/Yan
gLab/CIRCexplorer2
1
KNIFE
203
2015 Statistically based
detection of
circular and linear
isoforms from
RNA-seq data
Perl,
Python, R
N/A +++ https://github.com/linda
szabo/KNIFE
1
circRNA_find
er
204
2014 Identification of
circRNAs from
RNA-seq data
Perl N/A +++ https://github.com/orze
choj/circRNA_finder
1
find_circ
205
2013 Identification of
circRNAs based
on head-to-tail
spliced
sequencing reads
Python N/A ++ https://github.com/marv
in-jens/find_circ
1
n. Small RNA detection
miRTrace
206
2018
Quality control of
miRNA-seq data,
identifies cross-
species
contamination.
Java Anaconda +++ https://github.com/fried
landerlab/mirtrace
1
sRNAbench
207
2015
Expression
profiling of small
RNAs, prediction
of novel
microRNAs,
analysis of
Web
based
tool
N/A + https://bioinfo5.ugr.es/s
rnatoolbox/srnabench/
2
isomiRs, genome
mapping and read
length statistics.
sRNAde
207
2015
Detection of
differentially
expressed small
RNAs based on
three programs.
Web
based
tool
N/A + https://bioinfo5.ugr.es/s
rnatoolbox/srnade/
2
sRNAblast
207
2015
Aimed to
determine the
origin of
unmapped or
unassigned reads
by means of a
blast search
against several
remote databases.
Web
based
tool
N/A + https://bioinfo5.ugr.es/s
rnatoolbox/srnablast/
2
miRNAconsTa
rget
207
2015
Consensus target
prediction on user
provided input
data.
Web
based
tool
N/A + https://bioinfo5.ugr.es/s
rnatoolbox/amirconstar
get/
2
sRNAjBrowser
207
2015
Visualization of
sRNA expression
data in a genome
context.
Web
based
tool
N/A + https://bioinfo5.ugr.es/s
rnatoolbox/srnajbrowse
r/
2
sRNAjBrowser
DE
207
2015
Visualization of
differential
expression as a
function of read
Web
based
tool
N/A + https://bioinfo5.ugr.es/s
rnatoolbox/srnajbrowse
rde/
2
length in a
genome context.
ShortStack
208
2013
Analyzes
reference-aligned
sRNA-seq data
and performs.
comprehensive de
novo annotation
and quantification
of the inferred
sRNA genes.
Perl Anaconda +++ https://github.com/Mik
eAxtell/ShortStack
1
mirTools 2.0
209
2013
Detect, identify
and profile
various types,
functional
annotation and
differentially
expressed
sRNAs.
Web
based
tool
N/A + http://www.wzgenomic
s.cn/mr2_dev/
2
UEA sRNA
Workbench
210
2012
Complete
analysis of single
or multiple-
sample small
RNA datasets.
Web
based
tool,
C++,
Java
N/A +++ https://sourceforge.net/
projects/srnaworkbench
/
1
miRDeep2
211
2011
Discovers known
and novel
miRNAs,
quantifies
Perl Anaconda +++ https://github.com/raje
wsky-lab/mirdeep2
1
miRNA
expression.
miRanalyzer
212
2011
Detection of
known and
prediction of new
microRNAs in
high-throughput
sequencing
experiments.
Web
based
tool
N/A + http://bioinfo2.ugr.es/m
iRanalyzer/miRanalyze
r.php
2
SeqBuster
213
2010
Provides an
automatized pre-
analysis for
sequence
annotation for
analysing small
RNA data from
Illumina
sequencing.
Web
based
tool
Anaconda + http://estivill_lab.crg.es
/seqbuster
2
DARIO
214
2010
Allows to study
short read data
and provides a
wide range of
analysis features,
including quality
control, read
normalization,
and
quantification.
Web
based
tool
N/A + http://dario.bioinf.uni-
leipzig.de/index.py
2
o. Visualization tools
BEAVR
215
2020
Facilitates
interactive
analysis and
exploration of
RNA-seq data,
allowing
statistical testing
and visualization
of the table of
differentially
expressed genes
obtained.
R N/A ++ https://github.com/deve
loperpiru/BEAVR
1
coseq
216
2018
Co-expression
analysis of
sequencing data
R Anaconda,
Bioconductor
++ https://bioconductor.org
/packages/release/bioc/
html/coseq.html
1
ReadXplorer
217
2016
Read mapping
analysis and
visualization
Java N/A +++ https://www.uni-
giessen.de/fbz/fb08/Inst
/bioinformatik/software
/ReadXplorer
2
Integrated
Genome
Browser
218
2016
An interactive
tool for visually
analyzing tiling
array data and
enables
quantification of
alternative
splicing
Java N/A +++ http://www.bioviz.org/ 2
Sashimi
plots
219
2015
Quantitative
visualization
comparison of
exon usage
Python N/A ++ http://miso.readthedocs.
org/en/fastmiso/sashimi
.html
2
ASTALAVIST
A
220
2015
Reports all
alternative
splicing events
reflected by
transcript
annotations
Java Anaconda ++ http://astalavista.samme
th.net/
2
RNASeqBrows
er
221
2015
Incorporates and
extends the
functionality of
the UCSC
genome browser
Java N/A +++ http://www.australianpr
ostatecentre.org/researc
h/software/rnaseqbrows
er
2
SplicePlot
222
2014
Visualizing
splicing
quantitative trait
loci
Python N/A +++ http://montgomerylab.st
anford.edu/spliceplot/in
dex.html
2
RNASeqViewe
r
223
2014
Compare gene
expression and
alternative
splicing
Python N/A +++ https://sourceforge.net/
projects/rnaseqbrowser/
1
PrimerSeq
224
2014
Systematic design
and visualization
of RT-PCR
primers using
RNA seq data
Java,
C++,
Python
N/A +++ http://primerseq.sourcef
orge.net/
1
Epiviz
225
2014
Combining
algorithmic-
statistical analysis
and interactive
visualization
R Anaconda,
Bioconductor
++ https://epiviz.github.io/ 1
RNAbrowse
226
2014
RNA-seq De
Novo Assembly
Results Browser
N/A N/A N/A http://bioinfo.genotoul.f
r/RNAbrowse
2
ZENBU
227
2014
Interactive
visualization and
analysis of large-
scale sequencing
datasets
C++,
Javascript
N/A + https://fantom.gsc.riken.j
p/zenbu/
2
CummeRbund
5
3
2012
Navigate through
data produced
from a Cuffdiff
RNA-seq
differential
expression
analysis
R Anaconda,
Bioconductor
++ http://bioconductor.org/
packages/devel/bioc/ht
ml/cummeRbund.html
1
Splicing
Viewer
228
2012
Visualization of
splice junctions
and alternative
splicing
Java N/A +++ http://bioinformatics.zj.
cn/splicingviewer.
2
Table 1: Landscape of current computational methods for RNA-seq analysis. We categorized RNA-seq tools published from 2008 to
2020 based on processes in the RNA-seq pipeline and workflow; starting with data quality control, read alignment, gene annotations,
transcriptome assembly, transcriptome quantification, differential expression, RNA splicing, cell deconvolution, immune repertoire
profiling, allele specific expression, viral detection, fusion detection, detecting circRNA, small RNA detection, and visualization tools.
The third column (“Notable Features”) presents key functionalities and methods used. The fourth column (“Programing Language”)
presents the interface mode (e.g., GUI, web-based, programming language). The fifth column (“Package Manager”) highlights if a
package manager such as Anaconda, Bioconductor, CRAN, Docker Hub, pip, or PyPI is available for the tool. We designated the
assumed expertise level with a +, ++, or +++ in the sixth column (“Required Expertise”). A “+” represents little to no required
expertise which would be assigned to a GUI based/web interface tool. “++” was assigned to tools that require R and/or multiple
programming languages and whose software is located on Anaconda, Bioconductor, CRAN, Docker Hub, pip, or PyPI. “+++” was
assigned to tools that require expertise in languages such as C, C++, Java, Python, Perl, or Shell (Bash) and may or may not have a
package manager present. For each tool, we provide the links where the published tool software can be found and downloaded
(“Software”). In the seventh column (“Type of URL”), each tool was assigned a “1” for web services designed to host source code or
“2” for others (e.g personal and/or university web services).
References
1. Kumar, G., Ertel, A., Feldman, G., Kupper, J. & Fortina, P. iSeqQC: a tool for expression-based quality control in RNA
sequencing. BMC Bioinformatics 21, 56 (2020).
2. Hicks, S. C. et al. Smooth quantile normalization. Biostatistics 19, 185–198 (2018).
3. Babraham Bioinformatics - FastQC A Quality Control tool for High Throughput Sequence Data.
http://www.bioinformatics.babraham.ac.uk/projects/fastqc/.
4. Guo, Y. et al. Multi-perspective quality control of Illumina exome sequencing data using QC3. Genomics 103, 323–328 (2014).
5. Anvar, S. Y. et al. Determining the quality and complexity of next-generation sequencing data without a reference genome.
Genome Biol. 15, 555 (2014).
6. Yang, X. et al. HTQC: a fast quality control toolkit for Illumina sequencing data. BMC Bioinformatics 14, 33 (2013).
7. Bolger, A. M., Lohse, M. & Usadel, B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30, 2114–
2120 (2014).
8. Jiang, H., Lei, R., Ding, S.-W. & Zhu, S. Skewer: a fast and accurate adapter trimmer for next-generation sequencing paired-end
reads. BMC Bioinformatics 15, 182 (2014).
9. Dodt, M., Roehr, J., Ahmed, R. & Dieterich, C. FLEXBAR—Flexible Barcode and Adapter Processing for Next-Generation
Sequencing Platforms. Biology vol. 1 895–905 (2012).
10. Kroll, K. W. et al. Quality Control for RNA-Seq (QuaCRS): An Integrated Quality Control Pipeline. Cancer Inform. 13, 7–14
(2014).
11. Cabanski, C. R. et al. BlackOPs: increasing confidence in variant detection through mappability filtering. Nucleic Acids Res. 41,
e178 (2013).
12. Wang, L., Wang, S. & Li, W. RSeQC: quality control of RNA-seq experiments. Bioinformatics 28, 2184–2185 (2012).
13. DeLuca, D. S. et al. RNA-SeQC: RNA-seq metrics for quality control and process optimization. Bioinformatics 28, 1530–1532
(2012).
14. Jones, D. C., Ruzzo, W. L., Peng, X. & Katze, M. G. A new approach to bias correction in RNA-Seq. Bioinformatics 28, 921–
928 (2012).
15. Lassmann, T., Hayashizaki, Y. & Daub, C. O. SAMStat: monitoring biases in next generation sequencing data. Bioinformatics
27, 130–131 (2011).
16. Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
17. Liu, B. et al. deSALT: fast and accurate long transcriptomic read alignment with de Bruijn graph-based index.
doi:10.1101/612176.
18. Boratyn, G. M., Thierry-Mieg, J., Thierry-Mieg, D., Busby, B. & Madden, T. L. Magic-BLAST, an accurate DNA and RNA-seq
aligner for long and short reads. doi:10.1101/390013.
19. Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
20. Lin, H.-N. & Hsu, W.-L. DART: a fast and accurate RNA-seq mapper with a partitioning strategy. Bioinformatics vol. 34 190–
197 (2018).
21. Kahles, A., Behr, J. & Rätsch, G. MMR: a tool for read multi-mapper resolution. Bioinformatics 32, 770–772 (2016).
22. Bonfert, T., Kirner, E., Csaba, G., Zimmer, R. & Friedel, C. C. ContextMap 2: fast and accurate context-based RNA-seq
mapping. BMC Bioinformatics 16, 122 (2015).
23. Kim, D., Langmead, B. & Salzberg, S. L. HISAT: a fast spliced aligner with low memory requirements. Nat. Methods 12, 357–
360 (2015).
24. Hoffmann, S. et al. A multi-split mapping algorithm for circular RNA, splicing, trans-splicing and fusion detection. Genome Biol.
15, R34 (2014).
25. Butterfield, Y. S. et al. JAGuaR: junction alignments to genome for RNA-seq reads. PLoS One 9, e102398 (2014).
26. Philippe, N., Salson, M., Commes, T. & Rivals, E. CRAC: an integrated approach to the analysis of RNA-seq reads. Genome
Biol. 14, R30 (2013).
27. Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).
28. Liao, Y., Smyth, G. K. & Shi, W. The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote. Nucleic Acids
Res. 41, e108 (2013).
29. Kim, D. et al. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome
Biol. 14, R36 (2013).
30. Hu, J., Ge, H., Newman, M. & Liu, K. OSA: a fast and accurate alignment tool for RNA-Seq. Bioinformatics vol. 28 1933–1934
(2012).
31. Zhang, Y. et al. PASSion: a pattern growth algorithm-based pipeline for splice junction detection in paired-end RNA-Seq data.
Bioinformatics 28, 479–486 (2012).
32. Grant, G. R. et al. Comparative analysis of RNA-Seq alignment algorithms and the RNA-Seq unified mapper (RUM).
Bioinformatics 27, 2518–2528 (2011).
33. Huang, S. et al. SOAPsplice: Genome-Wide ab initio Detection of Splice Junctions from RNA-Seq Data. Front. Genet. 2, 46
(2011).
34. Wang, K. et al. MapSplice: accurate mapping of RNA-seq reads for splice junction discovery. Nucleic Acids Res. 38, e178
(2010).
35. Au, K. F., Jiang, H., Lin, L., Xing, Y. & Wong, W. H. Detection of splice junctions from paired-end RNA-seq data by
SpliceMap. Nucleic Acids Res. 38, 4570–4578 (2010).
36. Bryant, D. W., Jr, Shen, R., Priest, H. D., Wong, W.-K. & Mockler, T. C. Supersplat--spliced RNA-seq alignment.
Bioinformatics 26, 1500–1505 (2010).
37. Dimon, M. T., Sorber, K. & DeRisi, J. L. HMMSplicer: a tool for efficient and sensitive discovery of known and novel splice
junctions in RNA-Seq data. PLoS One 5, e13875 (2010).
38. De Bona, F., Ossowski, S., Schneeberger, K. & Rätsch, G. Optimal spliced alignments of short sequence reads. Bioinformatics
24, i174–80 (2008).
39. Tardaguila, M. et al. SQANTI: extensive characterization of long-read transcript sequences for quality control in full-length
transcriptome identification and quantification. Genome Res. (2018) doi:10.1101/gr.222976.117.
40. Musacchia, F., Basu, S., Petrosino, G., Salvemini, M. & Sanges, R. Annocript: a flexible pipeline for the annotation of
transcriptomes able to identify putative long noncoding RNAs. Bioinformatics 31, 2199–2201 (2015).
41. Gao, Y., Wang, J. & Zhao, F. CIRI: an efficient and unbiased algorithm for de novo circular RNA identification. Genome
Biology vol. 16 (2015).
42. Amman, F. et al. TSSAR: TSS annotation regime for dRNA-seq data. BMC Bioinformatics 15, 89 (2014).
43. Tang, A. D. et al. Full-length transcript characterization of SF3B1 mutation in chronic lymphocytic leukemia reveals
downregulation of retained introns. doi:10.1101/410183.
44. Shao, M. & Kingsford, C. Accurate assembly of transcripts through phase-preserving graph decomposition. Nat. Biotechnol. 35,
1167–1169 (2017).
45. Song, L., Sabunciyan, S. & Florea, L. CLASS2: accurate and efficient splice variant annotation from RNA-seq reads. Nucleic
Acids Res. 44, e98 (2016).
46. Pertea, M. et al. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat. Biotechnol. 33, 290–
295 (2015).
47. Chang, Z. et al. Bridger: a new framework for de novo transcriptome assembly using RNA-seq data. Genome Biology vol. 16
(2015).
48. Maretty, L., Sibbesen, J. A. & Krogh, A. Bayesian transcriptome assembly. Genome Biol. 15, 501 (2014).
49. Le, H.-S., Schulz, M. H., McCauley, B. M., Hinman, V. F. & Bar-Joseph, Z. Probabilistic error correction for RNA sequencing.
Nucleic Acids Res. 41, e109 (2013).
50. Bao, E., Jiang, T. & Girke, T. BRANCH: boosting RNA-Seq assemblies with partial or related genomic sequences.
Bioinformatics vol. 29 1250–1259 (2013).
51. Chu, H.-T. et al. EBARDenovo: highly accurate de novo assembly of RNA-Seq with efficient chimera-detection. Bioinformatics
29, 1004–1010 (2013).
52. Schulz, M. H., Zerbino, D. R., Vingron, M. & Birney, E. Oases: robust de novo RNA-seq assembly across the dynamic range of
expression levels. Bioinformatics 28, 1086–1092 (2012).
53. Trapnell, C. et al. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat.
Protoc. 7, 562–578 (2012).
54. Feng, J., Li, W. & Jiang, T. Inference of isoforms from short sequence reads. J. Comput. Biol. 18, 305–321 (2011).
55. Li, W., Feng, J. & Jiang, T. IsoLasso: a LASSO regression approach to RNA-Seq based transcriptome assembly. J. Comput. Biol.
18, 1693–1707 (2011).
56. Grabherr, M. G. et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat. Biotechnol. 29,
644–652 (2011).
57. Robertson, G. et al. De novo assembly and analysis of RNA-seq data. Nat. Methods 7, 909–912 (2010).
58. Guttman, M. et al. Ab initio reconstruction of cell type–specific transcriptomes in mouse reveals the conserved multi-exonic
structure of lincRNAs. Nature Biotechnology vol. 28 503–510 (2010).
59. Wyman, D. et al. A technology-agnostic long-read analysis pipeline for transcriptome discovery and quantification.
doi:10.1101/672931.
60. Patro, R., Duggal, G., Love, M. I., Irizarry, R. A. & Kingsford, C. Salmon provides fast and bias-aware quantification of
transcript expression. Nat. Methods 14, 417–419 (2017).
61. Bray, N. L., Pimentel, H., Melsted, P. & Pachter, L. Near-optimal probabilistic RNA-seq quantification. Nat. Biotechnol. 34,
525–527 (2016).
62. nanoporetech. nanoporetech/wub. GitHub https://github.com/nanoporetech/wub.
63. Schmid, M. W. & Grossniklaus, U. Rcount: simple and flexible RNA-Seq read counting. Bioinformatics 31, 436–437 (2015).
64. Anders, S., Pyl, P. T. & Huber, W. HTSeq--a Python framework to work with high-throughput sequencing data. Bioinformatics
31, 166–169 (2015).
65. Lee, S., Seo, C. H., Alver, B. H., Lee, S. & Park, P. J. EMSAR: estimation of transcript abundance from RNA-seq data by
mappability-based segmentation and reclustering. BMC Bioinformatics 16, 278 (2015).
66. Finotello, F. et al. Reducing bias in RNA sequencing data: a novel approach to compute counts. BMC Bioinformatics vol. 15 S7
(2014).
67. Hashimoto, T. B., Edwards, M. D. & Gifford, D. K. Universal count correction for high-throughput sequencing. PLoS Comput.
Biol. 10, e1003494 (2014).
68. Patro, R., Mount, S. M. & Kingsford, C. Sailfish enables alignment-free isoform quantification from RNA-seq reads using
lightweight algorithms. Nat. Biotechnol. 32, 462–464 (2014).
69. Rossell, D., Stephan-Otto Attolini, C., Kroiss, M. & Stöcker, A. QUANTIFYING ALTERNATIVE SPLICING FROM PAIRED-
END RNA-SEQUENCING DATA. Ann. Appl. Stat. 8, 309–330 (2014).
70. Mangul, S. et al. Transcriptome assembly and quantification from Ion Torrent RNA-Seq data. BMC Genomics 15 Suppl 5, S7
(2014).
71. Liao, Y., Smyth, G. K. & Shi, W. featureCounts: an efficient general purpose program for assigning sequence reads to genomic
features. Bioinformatics 30, 923–930 (2014).
72. Behr, J. et al. MITIE: Simultaneous RNA-Seq-based transcript identification and quantification in multiple samples.
Bioinformatics 29, 2529–2538 (2013).
73. Mezlini, A. M. et al. iReckon: simultaneous isoform discovery and abundance estimation from RNA-seq data. Genome Res. 23,
519–529 (2013).
74. Roberts, A. & Pachter, L. Streaming fragment assignment for real-time analysis of sequencing experiments. Nat. Methods 10,
71–73 (2013).
75. Glaus, P., Honkela, A. & Rattray, M. Identifying differentially expressed transcripts from RNA-seq data with biological
variation. Bioinformatics 28, 1721–1728 (2012).
76. Du, J. et al. IQSeq: integrated isoform quantification analysis based on next-generation sequencing. PLoS One 7, e29175 (2012).
77. Li, W. & Jiang, T. Transcriptome assembly and isoform expression level estimation from biased RNA-Seq reads. Bioinformatics
28, 2914–2921 (2012).
78. Xu, G. et al. SAMMate: a GUI tool for processing short read alignments in SAM/BAM format. Source Code Biol. Med. 6, 2
(2011).
79. Kim, H., Bi, Y., Pal, S., Gupta, R. & Davuluri, R. V. IsoformEx: isoform level gene expression estimation using weighted non-
negative least squares from mRNA-Seq data. BMC Bioinformatics 12, 305 (2011).
80. Nicolae, M., Mangul, S., Măndoiu, I. I. & Zelikovsky, A. Estimation of alternative splicing isoform frequencies from RNA-Seq
data. Algorithms Mol. Biol. 6, 9 (2011).
81. Li, B. & Dewey, C. N. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC
Bioinformatics vol. 12 (2011).
82. Risso, D., Schwartz, K., Sherlock, G. & Dudoit, S. GC-Content Normalization for RNA-Seq Data. BMC Bioinformatics 12, 1–17
(2011).
83. Turro, E. et al. Haplotype and isoform specific expression estimation using multi-mapping RNA-seq reads. Genome Biol. 12,
R13 (2011).
84. Katz, Y., Wang, E. T., Airoldi, E. M. & Burge, C. B. Analysis and design of RNA sequencing experiments for identifying
isoform regulation. Nat. Methods 7, 1009–1015 (2010).
85. Richard, H. et al. Prediction of alternative isoforms from exon expression levels in RNA-Seq experiments. Nucleic Acids Res. 38,
e112 (2010).
86. Jiang, H. & Wong, W. H. Statistical inferences for isoform expression in RNA-Seq. Bioinformatics 25, 1026–1032 (2009).
87. Bohnert, R., Behr, J. & Rätsch, G. Transcript quantification with RNA-Seq data. BMC Bioinformatics vol. 10 (2009).
88. Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L. & Wold, B. Mapping and quantifying mammalian transcriptomes by
RNA-Seq. Nat. Methods 5, 621–628 (2008).
89. Zhu, A., Srivastava, A., Ibrahim, J. G., Patro, R. & Love, M. I. Nonparametric expression analysis using inferential replicate
counts. Nucleic Acids Res. 47, e105 (2019).
90. Gunady, M. K., Mount, S. M. & Corrada Bravo, H. Yanagi: Fast and interpretable segment-based alternative splicing and gene
expression analysis. BMC Bioinformatics 20, 421 (2019).
91. Sterne-Weiler, T., Weatheritt, R. J., Best, A. J., Ha, K. C. H. & Blencowe, B. J. Efficient and Accurate Quantitative Profiling of
Alternative Splicing Patterns of Any Complexity on a Laptop. Mol. Cell 72, 187–200.e6 (2018).
92. Spurr, L. et al. ReQTL – an allele-level measure of variation-expression genomic relationships. doi:10.1101/464206.
93. Tapial, J. et al. An atlas of alternative splicing profiles and functional associations reveals new regulatory programs and genes
that simultaneously express multiple major isoforms. Genome Res. 27, 1759–1768 (2017).
94. Frazee, A. C. et al. Ballgown bridges the gap between transcriptome assembly and expression analysis. Nat. Biotechnol. 33, 243–
246 (2015).
95. Law, C. W., Chen, Y., Shi, W. & Smyth, G. K. voom: Precision weights unlock linear model analysis tools for RNA-seq read
counts. Genome Biol. 15, R29 (2014).
96. Shen, S. et al. rMATS: robust and flexible detection of differential alternative splicing from replicate RNA-Seq data. Proc. Natl.
Acad. Sci. U. S. A. 111, E5593–601 (2014).
97. Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.
Genome Biol. 15, 550 (2014).
98. Davidson, N. M. & Oshlack, A. Corset: enabling differential gene expression analysis for de novoassembled transcriptomes.
Genome Biology vol. 15 (2014).
99. Gu, J., Wang, X., Halakivi-Clarke, L., Clarke, R. & Xuan, J. BADGE: a novel Bayesian model for accurate abundance
quantification and differential analysis of RNA-Seq data. BMC Bioinformatics 15 Suppl 9, S6 (2014).
100. Soneson, C. compcodeR--an R package for benchmarking differential expression methods for RNA-seq data. Bioinformatics vol.
30 2517–2518 (2014).
101. Rau, A., Marot, G. & Jaffrézic, F. Differential meta-analysis of RNA-seq data from multiple studies. BMC Bioinformatics 15, 91
(2014).
102. Clark, N. R. et al. The characteristic direction: a geometrical approach to identify differentially expressed genes. BMC
Bioinformatics 15, 79 (2014).
103. Rau, A., Gallopin, M., Celeux, G. & Jaffrézic, F. Data-based filtering for replicated high-throughput transcriptome sequencing
experiments. Bioinformatics 29, 2146–2152 (2013).
104. Bi, Y. & Davuluri, R. V. NPEBseq: nonparametric empirical bayesian-based procedure for differential expression analysis of
RNA-seq data. BMC Bioinformatics 14, 262 (2013).
105. Leng, N. et al. EBSeq: an empirical Bayes hierarchical model for inference in RNA-seq experiments. Bioinformatics 29, 1035–
1043 (2013).
106. Yu, D., Huber, W. & Vitek, O. Shrinkage estimation of dispersion in Negative Binomial models for RNA-seq experiments with
small sample size. Bioinformatics 29, 1275–1282 (2013).
107. Trapnell, C. et al. Differential analysis of gene regulation at transcript resolution with RNA-seq. Nat. Biotechnol. 31, 46–53
(2013).
108. Li, J. & Tibshirani, R. Finding consistent patterns: a nonparametric approach for identifying differential expression in RNA-Seq
data. Stat. Methods Med. Res. 22, 519–536 (2013).
109. Wang, W., Qin, Z., Feng, Z., Wang, X. & Zhang, X. Identifying differentially spliced genes from two groups of RNA-seq
samples. Gene 518, 164–170 (2013).
110. Tarazona, S., García-Alcalde, F., Dopazo, J., Ferrer, A. & Conesa, A. Differential expression in RNA-seq: a matter of depth.
Genome Res. 21, 2213–2223 (2011).
111. Robinson, M. D., McCarthy, D. J. & Smyth, G. K. edgeR: a Bioconductor package for differential expression analysis of digital
gene expression data. Bioinformatics 26, 139–140 (2010).
112. Wang, L., Feng, Z., Wang, X., Wang, X. & Zhang, X. DEGseq: an R package for identifying differentially expressed genes from
RNA-seq data. Bioinformatics vol. 26 136–138 (2010).
113. Li, Y. I. et al. Annotation-free quantification of RNA splicing using LeafCutter. Nat. Genet. 50, 151–158 (2018).
114. Green, C. J., Gazzara, M. R. & Barash, Y. MAJIQ-SPEL: web-tool to interrogate classical and complex splicing variations from
RNA-Seq data. Bioinformatics 34, 300–302 (2018).
115. Vaquero-Garcia, J. et al. A new view of transcriptome complexity and regulation through the lens of local splicing variations.
Elife 5, e11752 (2016).
116. Kahles, A., Ong, C. S., Zhong, Y. & Rätsch, G. SplAdder: identification, quantification and testing of alternative splicing events
from RNA-Seq data. Bioinformatics 32, 1840–1847 (2016).
117. Pulyakhina, I. et al. SplicePie: a novel analytical approach for the detection of alternative, non-sequential and recursive splicing.
Nucleic Acids Res. 43, 11068 (2015).
118. Alamancos, G. P., Pagès, A., Trincado, J. L., Bellora, N. & Eyras, E. Leveraging transcript quantification for fast computation of
alternative splicing profiles. RNA 21, 1521–1531 (2015).
119. Mudvari, P. et al. SNPlice: variants that modulate Intron retention from RNA-sequencing data. Bioinformatics 31, 1191–1198
(2015).
120. Niu, L., Huang, W., Umbach, D. M. & Li, L. IUTA: a tool for effectively detecting differential isoform usage from RNA-Seq
data. BMC Genomics 15, 862 (2014).
121. Kimes, P. K. et al. SigFuge: single gene clustering of RNA-seq reveals differential isoform usage among cancer samples. Nucleic
Acids Res. 42, e113 (2014).
122. Gatto, A. et al. FineSplice, enhanced splice junction detection and quantification: a novel pipeline based on the assessment of
diverse RNA-Seq alignment solutions. Nucleic Acids Res. 42, e71 (2014).
123. Hu, Y. et al. PennSeq: accurate isoform-specific gene expression quantification in RNA-Seq by modeling non-uniform read
distribution. Nucleic Acids Res. 42, e20 (2014).
124. Bernard, E., Jacob, L., Mairal, J. & Vert, J.-P. Efficient RNA isoform identification and quantification from RNA-Seq data with
network flows. Bioinformatics 30, 2447–2455 (2014).
125. Ye, Z. et al. Computational analysis reveals a correlation of exon-skipping events with splicing, transcription and epigenetic
factors. Nucleic Acids Res. 42, 2856–2869 (2014).
126. Vitting-Seerup, K., Porse, B. T., Sandelin, A. & Waage, J. E. spliceR: An R package for classification of alternative splicing and
prediction of coding potential from RNA-seq data. doi:10.7287/peerj.preprints.80.
127. Park, J. W., Tokheim, C., Shen, S. & Xing, Y. Identifying differential alternative splicing events from RNA sequencing data
using RNASeq-MATS. Methods Mol. Biol. 1038, 171–179 (2013).
128. Aschoff, M. et al. SplicingCompass: differential splicing detection using RNA-seq data. Bioinformatics 29, 1141–1148 (2013).
129. Hu, Y. et al. DiffSplice: the genome-wide detection of differential splicing events with RNA-seq. Nucleic Acids Res. 41, e39
(2013).
130. Anders, S., Reyes, A. & Huber, W. Detecting differential usage of exons from RNA-seq data. Genome Res. 22, 2008–2017
(2012).
131. Ryan, M. C., Cleland, J., Kim, R., Wong, W. C. & Weinstein, J. N. SpliceSeq: a resource for analysis and visualization of RNA-
Seq data on alternative splicing and its functional impacts. Bioinformatics 28, 2385–2387 (2012).
132. Brooks, A. N. et al. Conservation of an RNA regulatory map between Drosophila and mammals. Genome Res. 21, 193–202
(2011).
133. Griffith, M. et al. Alternative expression analysis by RNA sequencing. Nat. Methods 7, 843–847 (2010).
134. Li, T. et al. TIMER2.0 for analysis of tumor-infiltrating immune cells. Nucleic Acids Res. (2020) doi:10.1093/nar/gkaa407.
135. Newman, A. M. et al. Determining cell type abundance and expression from bulk tissues with digital cytometry. Nat. Biotechnol.
37, 773–782 (2019).
136. Finotello, F. et al. Molecular and pharmacological modulators of the tumor immune contexture revealed by deconvolution of
RNA-seq data. Genome Med. 11, 34 (2019).
137. Sturm, G. et al. Comprehensive evaluation of transcriptome-based cell-type quantification methods for immuno-oncology.
Bioinformatics 35, i436–i445 (2019).
138. Zaitsev, K., Bambouskova, M., Swain, A. & Artyomov, M. N. Complete deconvolution of cellular mixtures based on linearity of
transcriptional signatures. Nat. Commun. 10, 2209 (2019).
139. Du, R., Carey, V. & Weiss, S. T. deconvSeq: deconvolution of cell mixture distribution in sequencing data. Bioinformatics 35,
5095–5102 (2019).
140. Kang, K. et al. CDSeq: A novel complete deconvolution method for dissecting heterogeneous samples using gene expression
data. PLoS Comput. Biol. 15, e1007510 (2019).
141. Hunt, G. J., Freytag, S., Bahlo, M. & Gagnon-Bartsch, J. A. dtangle: accurate and robust cell type deconvolution. Bioinformatics
35, 2093–2099 (2019).
142. Nadel, B. et al. The Gene Expression Deconvolution Interactive Tool (GEDIT): Accurate Cell Type Quantification from Gene
Expression Data. doi:10.1101/728493.
143. Lopez, D. et al. SaVanT: a web-based tool for the sample-level visualization of molecular signatures in gene expression profiles.
BMC Genomics vol. 18 (2017).
144. Racle, J., de Jonge, K., Baumgaertner, P., Speiser, D. E. & Gfeller, D. Simultaneous enumeration of cancer and immune cell
types from bulk tumor gene expression data. Elife 6, (2017).
145. Roman, T., Xie, L. & Schwartz, R. Automated deconvolution of structured mixtures from heterogeneous tumor genomic data.
PLoS Comput. Biol. 13, e1005815 (2017).
146. Zaslavsky, M., Novik, J. B., Chang, E. & Hammerbacher, J. Infino: a Bayesian hierarchical model improves estimates of immune
infiltration into tumor microenvironment. doi:10.1101/221671.
147. Becht, E. et al. Estimating the population abundance of tissue-infiltrating immune and stromal cell populations using gene
expression. Genome Biol. 17, 218 (2016).
148. Chikina, M., Zaslavsky, E. & Sealfon, S. C. CellCODE: a robust latent variable approach to differential expression analysis for
heterogeneous cell populations. Bioinformatics 31, 1584–1591 (2015).
149. Qiao, W. et al. PERT: A Method for Expression Deconvolution of Human Blood Samples from Varied Microenvironmental and
Developmental Conditions. PLoS Computational Biology vol. 8 e1002838 (2012).
150. Mangul, S. et al. Profiling immunoglobulin repertoires across multiple human tissues by RNA Sequencing. doi:10.1101/089235.
151. Li, B. et al. Landscape of tumor-infiltrating T cell repertoire of human cancers. Nature Genetics vol. 48 725–732 (2016).
152. Mose, L. E. et al. Assembly-based inference of B-cell receptor repertoires from short read RNA sequencing data with V’DJer.
Bioinformatics vol. 32 3729–3734 (2016).
153. Strauli, N. B. & Hernandez, R. D. Statistical inference of a convergent antibody repertoire response to influenza vaccine. Genome
Med. 8, 60 (2016).
154. Bolotin, D. A. et al. MiXCR: software for comprehensive adaptive immunity profiling. Nat. Methods 12, 380–381 (2015).
155. Knowles, D. A. et al. Allele-specific expression reveals interactions between genetic variation and environment. Nat. Methods
14, 699–702 (2017).
156. Mohammadi, P. et al. Genetic regulatory variation in populations informs transcriptome analysis in rare disease. Science 366,
351–356 (2019).
157. Mohammadi, P., Castel, S. E., Brown, A. A. & Lappalainen, T. Quantifying the regulatory effect size of cis-acting genetic
variation using allelic fold change. doi:10.1101/078717.
158. Castel, S. E., Mohammadi, P., Chung, W. K., Shen, Y. & Lappalainen, T. Rare variant phasing and haplotypic expression from
RNA sequencing with phASER. Nat. Commun. 7, 12817 (2016).
159. Kumasaka, N., Knights, A. J. & Gaffney, D. J. Fine-mapping cellular QTLs with RASQUAL and ATAC-seq. Nat. Genet. 48,
206–213 (2016).
160. Castel, S. E., Levy-Moonshine, A., Mohammadi, P., Banks, E. & Lappalainen, T. Tools and best practices for data processing in
allelic expression analysis. Genome Biol. 16, 195 (2015).
161. van de Geijn, B., McVicker, G., Gilad, Y. & Pritchard, J. K. WASP: allele-specific software for robust molecular quantitative
trait locus discovery. Nat. Methods 12, 1061–1063 (2015).
162. Pirinen, M. et al. Assessing allele-specific expression across multiple tissues from RNA-seq read data. Bioinformatics 31, 2497–
2504 (2015).
163. Mayba, O. et al. MBASED: allele-specific expression detection in cancer tissues and cell lines. Genome Biol. 15, 405 (2014).
164. Pandey, R. V., Franssen, S. U., Futschik, A. & Schlötterer, C. Allelic imbalance metre (Allim), a new tool for measuring allele-
specific gene expression with RNA-seq data. Mol. Ecol. Resour. 13, 740–745 (2013).
165. Rozowsky, J. et al. AlleleSeq: analysis of allele-specific expression and binding in a network framework. Molecular Systems
Biology vol. 7 522 (2011).
166. Mangul, S. et al. ROP: dumpster diving in RNA-sequencing to find the source of 1 trillion reads across diverse adult human
tissues. Genome Biol. 19, 36 (2018).
167. Xu, G. et al. RNA CoMPASS: a dual approach for pathogen and host transcriptome analysis of RNA-seq datasets. PLoS One 9,
e89445 (2014).
168. Chen, Y. et al. VirusSeq: software to identify viruses and their integration sites using next-generation sequencing of human
cancer tissue. Bioinformatics 29, 266–267 (2013).
169. Wang, Q., Jia, P. & Zhao, Z. VirusFinder: software for efficient and accurate detection of viruses and their integration sites in
host genomes through next generation sequencing data. PLoS One 8, e64465 (2013).
170. Zhang, J., Gao, T. & Maher, C. A. INTEGRATE-Vis: a tool for comprehensive gene fusion visualization. Sci. Rep. 7, 17808
(2017).
171. Zhang, J., Mardis, E. R. & Maher, C. A. INTEGRATE-neo: a pipeline for personalized gene fusion neoantigen discovery.
Bioinformatics 33, 555–557 (2017).
172. Zhang, J. et al. INTEGRATE: gene fusion discovery using whole genome and transcriptome data. Genome Res. 26, 108–118
(2016).
173. Fernandez-Cuesta, L. et al. Identification of novel fusion genes in lung cancer using breakpoint assembly of transcriptome
sequencing data. Genome Biol. 16, 7 (2015).
174. Torres-García, W. et al. PRADA: pipeline for RNA sequencing data analysis. Bioinformatics 30, 2224–2226 (2014).
175. Abate, F. et al. Pegasus: a comprehensive annotation and prediction tool for detection of driver gene fusions in cancer. BMC Syst.
Biol. 8, 97 (2014).
176. Nicorici, D. et al. FusionCatcher - a tool for finding somatic fusion genes in paired-end RNA-sequencing data.
doi:10.1101/011650.
177. Liu, C., Ma, J., Chang, C. J. & Zhou, X. FusionQ: a novel approach for gene fusion detection and quantification from paired-end
RNA-Seq. BMC Bioinformatics 14, 193 (2013).
178. Swanson, L. et al. Barnacle: detecting and characterizing tandem duplications and fusions in transcriptome assemblies. BMC
Genomics 14, 550 (2013).
179. Yorukoglu, D. et al. Dissect: detection and characterization of novel structural alterations in transcribed sequences.
Bioinformatics 28, i179–87 (2012).
180. Chen, K. et al. BreakFusion: targeted assembly-based identification of gene fusions in whole transcriptome paired-end
sequencing data. Bioinformatics 28, 1923–1924 (2012).
181. Benelli, M. et al. Discovering chimeric transcripts in paired-end RNA-seq data by using EricScript. Bioinformatics 28, 3232–
3239 (2012).
182. Abate, F. et al. Bellerophontes: an RNA-Seq data analysis framework for chimeric transcripts discovery based on accurate fusion
model. Bioinformatics 28, 2114–2121 (2012).
183. Kalyana-Sundaram, S., Shanmugam, A. & Chinnaiyan, A. M. Gene Fusion Markup Language: a prototype for exchanging gene
fusion data. BMC Bioinformatics vol. 13 269 (2012).
184. Li, Y., Chien, J., Smith, D. I. & Ma, J. FusionHunter: identifying fusion transcripts in cancer using paired-end RNA-seq.
Bioinformatics 27, 1708–1710 (2011).
185. Iyer, M. K., Chinnaiyan, A. M. & Maher, C. A. ChimeraScan: a tool for identifying chimeric transcription in sequencing data.
Bioinformatics 27, 2903–2904 (2011).
186. Kim, D. & Salzberg, S. L. TopHat-Fusion: an algorithm for discovery of novel fusion transcripts. Genome Biol. 12, R72 (2011).
187. McPherson, A. et al. deFuse: an algorithm for gene fusion discovery in tumor RNA-Seq data. PLoS Comput. Biol. 7, e1001138
(2011).
188. Zhang, J., Chen, S., Yang, J. & Zhao, F. Accurate quantification of circular RNAs identifies extensive circular isoform switching
events. Nat. Commun. 11, 90 (2020).
189. Zheng, Y. & Zhao, F. Visualization of circular RNAs and their internal splicing events from transcriptomic data. Bioinformatics
36, 2934–2935 (2020).
190. Humphreys, D. T., Fossat, N., Demuth, M., Tam, P. P. L. & Ho, J. W. K. Ularcirc: visualization and enhanced analysis of circular
RNAs via back and canonical forward splicing. Nucleic Acids Res. 47, e123 (2019).
191. Ma, X.-K. et al. A CLEAR pipeline for direct comparison of circular and linear RNA expression. doi:10.1101/668657.
192. Zheng, Y., Ji, P., Chen, S., Hou, L. & Zhao, F. Reconstruction of full-length circular RNAs enables isoform-level quantification.
Genome Med. 11, 2 (2019).
193. Wu, J. et al. CircAST: Full-length Assembly and Quantification of Alternatively Spliced Isoforms in Circular RNAs. Genomics
Proteomics Bioinformatics 17, 522–534 (2019).
194. Gao, Y., Zhang, J. & Zhao, F. Circular RNA identification based on multiple seed matching. Brief. Bioinform. 19, 803–810
(2018).
195. Li, M. et al. Quantifying circular RNA expression from RNA-seq data using model-based framework. Bioinformatics 33, 2131–
2139 (2017).
196. Gaffo, E., Bonizzato, A., Kronnie, G. & Bortoluzzi, S. CirComPara: A Multi-Method Comparative Bioinformatics Pipeline to
Detect and Study circRNAs from RNA-seq Data. Non-Coding RNA vol. 3 8 (2017).
197. Song, X. et al. Circular RNA profile in gliomas revealed by identification tool UROBORUS. Nucleic Acids Res. 44, e87 (2016).
198. Izuogu, O. G. et al. PTESFinder: a computational method to identify post-transcriptional exon shuffling (PTES) events. BMC
Bioinformatics 17, 31 (2016).
199. Chuang, T.-J. et al. NCLscan: accurate identification of non-co-linear transcripts (fusion, trans-splicing and circular RNA) with a
good balance between sensitivity and precision. Nucleic Acids Res. 44, e29 (2016).
200. Cheng, J., Metge, F. & Dieterich, C. Specific identification and quantification of circular RNAs from sequencing data.
Bioinformatics 32, 1094–1096 (2016).
201. Gao, Y. et al. Comprehensive identification of internal structure and alternative splicing events in circular RNAs. Nat. Commun.
7, 12060 (2016).
202. Zhang, X.-O. et al. Diverse alternative back-splicing and alternative splicing landscape of circular RNAs. Genome Res. 26, 1277–
1287 (2016).
203. Szabo, L. et al. Statistically based splicing detection reveals neural enrichment and tissue-specific induction of circular RNA
during human fetal development. Genome Biol. 16, 126 (2015).
204. Westholm, J. O. et al. Genome-wide analysis of drosophila circular RNAs reveals their structural and sequence properties and
age-dependent neural accumulation. Cell Rep. 9, 1966–1980 (2014).
205. Memczak, S. et al. Circular RNAs are a large class of animal RNAs with regulatory potency. Nature 495, 333–338 (2013).
206. Kang, W. et al. miRTrace reveals the organismal origins of microRNA sequencing data. Genome Biol. 19, 213 (2018).
207. Rueda, A. et al. sRNAtoolbox: an integrated collection of small RNA research tools. Nucleic Acids Res. 43, W467–73 (2015).
208. Axtell, M. J. ShortStack: comprehensive annotation and quantification of small RNA genes. RNA 19, 740–751 (2013).
209. Wu, J. et al. mirTools 2.0 for non-coding RNA discovery, profiling, and functional annotation based on high-throughput
sequencing. RNA Biol. 10, 1087–1092 (2013).
210. Stocks, M. B. et al. The UEA sRNA workbench: a suite of tools for analysing and visualizing next generation sequencing
microRNA and small RNA datasets. Bioinformatics 28, 2059–2061 (2012).
211. Friedländer, M. R., Mackowiak, S. D., Li, N., Chen, W. & Rajewsky, N. miRDeep2 accurately identifies known and hundreds of
novel microRNA genes in seven animal clades. Nucleic Acids Res. 40, 37–52 (2012).
212. Hackenberg, M., Rodríguez-Ezpeleta, N. & Aransay, A. M. miRanalyzer: an update on the detection and analysis of microRNAs
in high-throughput sequencing experiments. Nucleic Acids Res. 39, W132–8 (2011).
213. Pantano, L., Estivill, X. & Martí, E. SeqBuster, a bioinformatic tool for the processing and analysis of small RNAs datasets,
reveals ubiquitous miRNA modifications in human embryonic cells. Nucleic Acids Res. 38, e34 (2010).
214. Fasold, M., Langenberger, D., Binder, H., Stadler, P. F. & Hoffmann, S. DARIO: a ncRNA detection and analysis tool for next-
generation sequencing experiments. Nucleic Acids Research vol. 39 W112–W117 (2011).
215. Perampalam, P. & Dick, F. A. BEAVR: a browser-based tool for the exploration and visualization of RNA-seq data. BMC
Bioinformatics 21, 221 (2020).
216. Rau, A. & Maugis-Rabusseau, C. Transformation and model choice for RNA-seq co-expression analysis. Brief. Bioinform. 19,
425–436 (2018).
217. Hilker, R. et al. ReadXplorer 2—detailed read mapping analysis and visualization from one single source. Bioinformatics vol. 32
3702–3708 (2016).
218. Freese, N. H., Norris, D. C. & Loraine, A. E. Integrated genome browser: visual analytics platform for genomics. Bioinformatics
32, 2089–2095 (2016).
219. Katz, Y. et al. Quantitative visualization of alternative exon expression from RNA-seq data. Bioinformatics 31, 2400–2402
(2015).
220. Foissac, S. & Sammeth, M. Analysis of alternative splicing events in custom gene datasets by AStalavista. Methods Mol. Biol.
1269, 379–392 (2015).
221. An, J. et al. RNASeqBrowser: a genome browser for simultaneous visualization of raw strand specific RNAseq reads and UCSC
genome browser custom tracks. BMC Genomics 16, 145 (2015).
222. Wu, E., Nance, T. & Montgomery, S. B. SplicePlot: a utility for visualizing splicing quantitative trait loci. Bioinformatics 30,
1025–1026 (2014).
223. Rogé, X. & Zhang, X. RNAseqViewer: visualization tool for RNA-Seq data. Bioinformatics 30, 891–892 (2014).
224. Tokheim, C., Park, J. W. & Xing, Y. PrimerSeq: Design and visualization of RT-PCR primers for alternative splicing using
RNA-seq data. Genomics Proteomics Bioinformatics 12, 105–109 (2014).
225. Chelaru, F., Smith, L., Goldstein, N. & Bravo, H. C. Epiviz: interactive visual analytics for functional genomics data. Nat.
Methods 11, 938–940 (2014).
226. Mariette, J. et al. RNAbrowse: RNA-Seq de novo assembly results browser. PLoS One 9, e96821 (2014).
227. Severin, J. et al. Interactive visualization and analysis of large-scale sequencing datasets using ZENBU. Nat. Biotechnol. 32,
217–219 (2014).
228. Liu, Q. et al. Detection, annotation and visualization of alternative splicing from RNA-Seq data with SplicingViewer. Genomics
99, 178–182 (2012).
Abstract (if available)
Abstract
Over the past decade, next generation sequencing (NGS) technologies coupled with novel computational methods have revolutionized the field of genomics. The bioinformatics community has worked together to unlock the capacities of these genomic datasets, allowing for scientific progression such as identifying novel biomarkers or unleashing new biological pathways. The purpose of this work is to showcase the role of genomics, computational methods, and next generation sequencing technologies, all of which are highlighted in this study, consisting of six distinct chapters. The first chapter discusses the importance of data driven research and how although computational data driven research face various challenges, it is on the rise, gaining independence. The second chapter covers the Illumina, Nanopore, and PacBio next generation sequencing technologies. The third chapter consists of a survey of RNA-seq tools, and discusses the archival stability and usability of these RNA-seq tools. The fourth chapter covers the challenges of read alignment along with the file formats associated with the process of read alignment. The fifth chapter covers the functionalities of Bowtie and Nextgenmap, two read alignment tools. Lastly, the sixth chapter discusses the key role genomics played in helping to address the COVID-19 pandemic.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
An analysis of the robustness and reproducibility of computational tools used in biomedical research
PDF
Benchmarking of computational tools for ancestry prediction using RNA-seq data
PDF
Evaluating the robustness and reproducibility of RNA-Seq quantification tools using computational replicates
PDF
Omics for clinical diagnostics: challenges, opportunities, and computational approaches
PDF
Developing and benchmarking computational tools to facilitate T cell receptor repertoire analysis
PDF
The multifarious utility of public genomic repositories and their significance in genomic data science
PDF
Too many needles in this haystack: algorithms for the analysis of next generation sequence data
PDF
The impact of Human Genome Project, International HapMap Project and next generation sequencing to the R&D process of pharmaceutical industry
PDF
Global landscape of primary omics data generation and its secondary analysis across 193 countries and territories
PDF
Evaluating the robustness and reproducibility or AIRR sequencing tools using computational replicates
PDF
Computational algorithms for studying human genetic variations -- structural variations and variable number tandem repeats
PDF
Application of machine learning methods in genomic data analysis
PDF
Big data analytics in metagenomics: integration, representation, management, and visualization
PDF
Novel statistical and computational methods for analyzing genome variation
PDF
Development of methods and novel crosslinkers for RNA structure and interaction studies in living cells
PDF
Efficient algorithms to map whole genome bisulfite sequencing reads
PDF
Model selection methods for genome wide association studies and statistical analysis of RNA seq data
PDF
Genome-wide studies reveal the isoform selection of genes
PDF
Computational analysis of genome architecture
PDF
reTCR: a unified repository for robust, rigorous, and reproducible analysis of TCR-Seq data
Asset Metadata
Creator
Chhugani, Karishma
(author)
Core Title
Unlocking capacities of genomics datasets through effective computational methods
School
School of Pharmacy
Degree
Master of Science
Degree Program
Pharmaceutical Sciences
Degree Conferral Date
2021-08
Publication Date
07/18/2021
Defense Date
07/15/2021
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
bioinformatic tools,bioinformatics,computational methods,dry lab,genomics,Illumina,Nanopore,OAI-PMH Harvest,PacBio,read alignment,RNA sequencing,sequencing,wet lab
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Mangul, Serghei (
committee chair
), Okamoto, Curtis (
committee member
), Schmidt, Ryan (
committee member
)
Creator Email
chhugani@usc.edu,kckcgreen@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC15609012
Unique identifier
UC15609012
Legacy Identifier
etd-ChhuganiKa-9782
Document Type
Thesis
Format
application/pdf (imt)
Rights
Chhugani, Karishma
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
bioinformatic tools
bioinformatics
computational methods
dry lab
genomics
Illumina
Nanopore
PacBio
read alignment
RNA sequencing
sequencing
wet lab